Message ID | 16f8ae40-c0ae-dd57-346e-f9cacea55038@linux.ibm.com |
---|---|
State | New |
Headers | show |
Series | IVOPTs consider step cost for different forms when unrolling | expand |
Could you go into more detail about this choice of cost calculation? It looks like we first calculate per-group flags, which are true only if the unrolled offsets are valid for all uses in the group. Then we create per-candidate flags when associating candidates with groups. Instead, couldn't we take this into account in get_address_cost, which calculates the cost of an address use for a given candidate? E.g. after the main if-else at the start of the function, perhaps it would make sense to add the worst-case offset to the address in “parts”, check whether that too is a valid address, and if not, increase var_cost by the cost of one add instruction. I guess there are two main sources of inexactness if we do that: (1) It might underestimate the cost because it assumes that vuse[0] stands for all vuses in the group. (2) It might overestimates the cost because it treats all unrolled iterations as having the cost of the final unrolled iteration. (1) could perhaps be avoided by adding a flag to the iv_use to say whether it wants this treatment. I think the flag approach suffers from (2) too, and I'd be surprised if it makes a difference in practice. Thanks, Richard
Hi Richard, Thanks for the comments! on 2020/6/2 上午1:59, Richard Sandiford wrote: > Could you go into more detail about this choice of cost calculation? > It looks like we first calculate per-group flags, which are true only if > the unrolled offsets are valid for all uses in the group. Then we create > per-candidate flags when associating candidates with groups. > Sure. It checks every address type IV group to determine whether this group is valid to use reg offset addressing mode. Here we only need to check the first one and the last one, since the intermediates should have been handled by split_address_groups. With unrolling the displacement of the address can be offset-ed by (UF-1)*step, check the address with this max offset whether still valid. If the check finds it's valid to use reg offset mode for the whole group, we flag this group. Later, when we create IV candidate for address group flagged, we flag the candidate further. This flag is mainly for iv cand costing, we don't need to scale up iv cand's step cost for this kind of candidate. Imagining this loop is being unrolled, all the statements will be duplicated by UF. For the cost modeling against iv group, it's scaling up the cost by UF (here I simply excluded the compare_type since in most cases it for loop ending check). For the cost modeling against iv candidate, it's to focus on step costs, for an iv candidate we flagged before, it's taken as one time step cost, for the others, it's scaling up the step cost since the unrolling make step calculation become UF times. This cost modeling is trying to simulate cost change after the unrolling, scaling up the costs accordingly. There are somethings to be improved like distinguish the loop ending compare or else, whether need to tweak the other costs somehow since the scaling up probably cause existing cost framework imbalance, but during benchmarking I didn't find these matter, so take it as simple as possible for now. > Instead, couldn't we take this into account in get_address_cost, > which calculates the cost of an address use for a given candidate? > E.g. after the main if-else at the start of the function, > perhaps it would make sense to add the worst-case offset to > the address in “parts”, check whether that too is a valid address, > and if not, increase var_cost by the cost of one add instruction. > IIUC, what you suggest is to tweak the iv group cost, if we find one address group is valid for reg offset mode, we price more on the pairs between this group and other non address-based iv cands. The question is how do we decide this add-on cost. For the test case I was working on initially, adding one cost (of add) doesn't work, the normal iv still wined. We can price it more like two but what's the justification on this value, by heuristics? > I guess there are two main sources of inexactness if we do that: > > (1) It might underestimate the cost because it assumes that vuse[0] > stands for all vuses in the group. > Do you mean we don't need one check function like mark_reg_offset_groups? If without it, vuse[0] might be not enough since we can't ensure the others are fine with additional displacement from unrolling. If we still have it, I think it's fine to just use vuse[0]. > (2) It might overestimates the cost because it treats all unrolled > iterations as having the cost of the final unrolled iteration. > > (1) could perhaps be avoided by adding a flag to the iv_use to say > whether it wants this treatment. I think the flag approach suffers > from (2) too, and I'd be surprised if it makes a difference in practice. > Sorry, I didn't have the whole picture how to deal with uf for your proposal. But the flag approach considers uf in iv group cost calculation as well as iv cand step cost calculation. BR, Kewen > Thanks, > Richard >
"Kewen.Lin" <linkw@linux.ibm.com> writes: > Hi Richard, > > Thanks for the comments! > > on 2020/6/2 上午1:59, Richard Sandiford wrote: >> Could you go into more detail about this choice of cost calculation? >> It looks like we first calculate per-group flags, which are true only if >> the unrolled offsets are valid for all uses in the group. Then we create >> per-candidate flags when associating candidates with groups. >> > > Sure. It checks every address type IV group to determine whether this > group is valid to use reg offset addressing mode. Here we only need to > check the first one and the last one, since the intermediates should > have been handled by split_address_groups. With unrolling the > displacement of the address can be offset-ed by (UF-1)*step, check the > address with this max offset whether still valid. If the check finds > it's valid to use reg offset mode for the whole group, we flag this > group. Later, when we create IV candidate for address group flagged, > we flag the candidate further. This flag is mainly for iv cand > costing, we don't need to scale up iv cand's step cost for this kind > of candidate. But AIUI, this is calculating whether the uses in their original form support all unrolled offsets. For ivopts, I think the question is really whether the uses support all unrolled offsets when based on a given IV candidate (which might not be the original IV). E.g. there might be another IV candidate at a constant offset from the original one, and the offsets might all be in range for that offset too. > Imagining this loop is being unrolled, all the statements will be > duplicated by UF. For the cost modeling against iv group, it's > scaling up the cost by UF (here I simply excluded the compare_type > since in most cases it for loop ending check). For the cost modeling > against iv candidate, it's to focus on step costs, for an iv candidate > we flagged before, it's taken as one time step cost, for the others, > it's scaling up the step cost since the unrolling make step > calculation become UF times. > > This cost modeling is trying to simulate cost change after the > unrolling, scaling up the costs accordingly. There are somethings > to be improved like distinguish the loop ending compare or else, > whether need to tweak the other costs somehow since the scaling up > probably cause existing cost framework imbalance, but during > benchmarking I didn't find these matter, so take it as simple as > possible for now. > > >> Instead, couldn't we take this into account in get_address_cost, >> which calculates the cost of an address use for a given candidate? >> E.g. after the main if-else at the start of the function, >> perhaps it would make sense to add the worst-case offset to >> the address in “parts”, check whether that too is a valid address, >> and if not, increase var_cost by the cost of one add instruction. >> > > IIUC, what you suggest is to tweak the iv group cost, if we find > one address group is valid for reg offset mode, we price more on > the pairs between this group and other non address-based iv cands. > The question is how do we decide this add-on cost. For the test > case I was working on initially, adding one cost (of add) doesn't > work, the normal iv still wined. We can price it more like two > but what's the justification on this value, by heuristics? Yeah, I was thinking of adding one instance of add_cost. If that doesn't work, it'd be interesting to know why in more detail. >> I guess there are two main sources of inexactness if we do that: >> >> (1) It might underestimate the cost because it assumes that vuse[0] >> stands for all vuses in the group. >> > > Do you mean we don't need one check function like mark_reg_offset_groups? > If without it, vuse[0] might be not enough since we can't ensure the > others are fine with additional displacement from unrolling. If we still > have it, I think it's fine to just use vuse[0]. > >> (2) It might overestimates the cost because it treats all unrolled >> iterations as having the cost of the final unrolled iteration. >> >> (1) could perhaps be avoided by adding a flag to the iv_use to say >> whether it wants this treatment. I think the flag approach suffers >> from (2) too, and I'd be surprised if it makes a difference in practice. >> > > Sorry, I didn't have the whole picture how to deal with uf for your proposal. > But the flag approach considers uf in iv group cost calculation as well as > iv cand step cost calculation. > > BR, > Kewen > >> Thanks, >> Richard >>
Hi Richard, on 2020/6/2 下午3:14, Richard Sandiford wrote: > "Kewen.Lin" <linkw@linux.ibm.com> writes: >> Hi Richard, >> >> Thanks for the comments! >> >> on 2020/6/2 上午1:59, Richard Sandiford wrote: >>> Could you go into more detail about this choice of cost calculation? >>> It looks like we first calculate per-group flags, which are true only if >>> the unrolled offsets are valid for all uses in the group. Then we create >>> per-candidate flags when associating candidates with groups. >>> >> >> Sure. It checks every address type IV group to determine whether this >> group is valid to use reg offset addressing mode. Here we only need to >> check the first one and the last one, since the intermediates should >> have been handled by split_address_groups. With unrolling the >> displacement of the address can be offset-ed by (UF-1)*step, check the >> address with this max offset whether still valid. If the check finds >> it's valid to use reg offset mode for the whole group, we flag this >> group. Later, when we create IV candidate for address group flagged, >> we flag the candidate further. This flag is mainly for iv cand >> costing, we don't need to scale up iv cand's step cost for this kind >> of candidate. > > But AIUI, this is calculating whether the uses in their original form > support all unrolled offsets. For ivopts, I think the question is really > whether the uses support all unrolled offsets when based on a given IV > candidate (which might not be the original IV). > Good point! Indeed, the patch only flags the IV cands derived from the address group flagged with reg_offset_p, it has the possibility that we miss some other candidates with same basic but different offset which can satisfy addr_offset_valid_p. How about to update the current approach to: instead of flag the derived iv cand, when we determine the cost for iv cand, we check whether this iv cand has the same basic object as the one of any reg_offset_p vuse[0] (both should be stripped their offsets), then further check the offset can satisfy addr_offset_valid_p, if all checks expected, update the step cost without uf consideration. I would expect this kind of address based iv cand will mainly be used for address type group and compare type group, for the address type group, it can be only applied for those with same basic object, in most cases they are put in the same address group. If it's used for generic much, the step cost tweaking might not be fixed at the beginning but varies according to the iv set members. It looks too much for the existing framework. > E.g. there might be another IV candidate at a constant offset > from the original one, and the offsets might all be in range > for that offset too. > >> Imagining this loop is being unrolled, all the statements will be >> duplicated by UF. For the cost modeling against iv group, it's >> scaling up the cost by UF (here I simply excluded the compare_type >> since in most cases it for loop ending check). For the cost modeling >> against iv candidate, it's to focus on step costs, for an iv candidate >> we flagged before, it's taken as one time step cost, for the others, >> it's scaling up the step cost since the unrolling make step >> calculation become UF times. >> >> This cost modeling is trying to simulate cost change after the >> unrolling, scaling up the costs accordingly. There are somethings >> to be improved like distinguish the loop ending compare or else, >> whether need to tweak the other costs somehow since the scaling up >> probably cause existing cost framework imbalance, but during >> benchmarking I didn't find these matter, so take it as simple as >> possible for now. >> >> >>> Instead, couldn't we take this into account in get_address_cost, >>> which calculates the cost of an address use for a given candidate? >>> E.g. after the main if-else at the start of the function, >>> perhaps it would make sense to add the worst-case offset to >>> the address in “parts”, check whether that too is a valid address, >>> and if not, increase var_cost by the cost of one add instruction. >>> >> >> IIUC, what you suggest is to tweak the iv group cost, if we find >> one address group is valid for reg offset mode, we price more on >> the pairs between this group and other non address-based iv cands. >> The question is how do we decide this add-on cost. For the test >> case I was working on initially, adding one cost (of add) doesn't >> work, the normal iv still wined. We can price it more like two >> but what's the justification on this value, by heuristics? > > Yeah, I was thinking of adding one instance of add_cost. If that > doesn't work, it'd be interesting to know why in more detail. > The case is like: for (i = 0; i < SIZE; i++) y[i] = a * x[i] + z[i]; It has three array access in the loop body, after vectorization, it looks like vect__1.7_15 = MEM <vector(2) double> [(double *)vectp_x.5_20]; vect__3.8_13 = vect__1.7_15 * vect_cst__14; vect__4.11_19 = MEM <vector(2) double> [(double *)vectp_z.9_7]; vect__5.12_23 = vect__3.8_13 + vect__4.11_19; MEM <vector(2) double> [(double *)vectp_y.13_24] = vect__5.12_23; we expect to use reg_offset for those vector load/store when unrolling factor is big, but without unrolling or unrolling factor smaller like 2, the reg_index should outperform reg_offset. IIUC, the proposed costing change would look like (the left is before change, while the right is after change). Those zero costs are for those iv cands based on address objects. Group 0,1,2 are address groups, group 3 is compare group (omitted). One insn add cost is counted as 4 on ppc64le. <Group-candidate Costs>: \ <Group-candidate Costs>: Group 0: \ Group 0: cand cost compl. inv.expr. \ cand cost compl. inv.expr. inv.vars 1 5 1 1; NIL; \ 1 9 1 1; NIL; 4 1 1 1; NIL; \ 4 5 1 1; NIL; 7 0 1 NIL; NIL; \ 7 0 1 NIL; NIL; 8 0 0 NIL; NIL; \ 8 0 0 NIL; NIL; 9 0 0 NIL; NIL; \ 9 0 0 NIL; NIL; 13 5 1 1; NIL; \ 13 9 1 1; NIL; 14 5 1 3; NIL; \ 14 9 1 3; NIL; \ Group 1: \ Group 1: cand cost compl. inv.expr. \ cand cost compl. inv.expr. inv.vars 1 5 1 4; NIL; \ 1 9 1 4; NIL; 3 0 1 NIL; NIL; \ 3 0 1 NIL; NIL; 4 1 1 4; NIL; \ 4 5 1 4; NIL; 5 0 0 NIL; NIL; \ 5 0 0 NIL; NIL; 6 0 0 NIL; NIL; \ 6 0 0 NIL; NIL; 13 5 1 4; NIL; \ 13 9 1 4; NIL; 14 5 1 6; NIL; \ 14 9 1 6; NIL; \ Group 2: \ Group 2: cand cost compl. inv.expr. \ cand cost compl. inv.expr. inv.vars 1 5 1 7; NIL; \ 1 9 1 7; NIL; 4 1 1 7; NIL; \ 4 5 1 7; NIL; 10 0 0 NIL; NIL; \ 10 4 0 NIL; NIL; 11 0 0 NIL; NIL; \ 11 4 0 NIL; NIL; 12 0 1 NIL; NIL; \ 12 4 1 NIL; NIL; 13 5 1 7; NIL; \ 13 9 1 7; NIL; 14 5 1 9; NIL; \ 14 9 1 9; NIL; Initial set of candidates: \ Initial set of candidates: cost: 13 (complexity 3) \ cost: 25 (complexity 0) reg_cost: 5 \ reg_cost: 6 cand_cost: 5 \ cand_cost: 15 cand_group_cost: 3 (complexity 3) \ cand_group_cost: 4 (complexity 0) candidates: 4 \ candidates: 5, 8, 10 group:0 --> iv_cand:4, cost=(1,1) \ group:0 --> iv_cand:8, cost=(0,0) group:1 --> iv_cand:4, cost=(1,1) \ group:1 --> iv_cand:5, cost=(0,0) group:2 --> iv_cand:4, cost=(1,1) \ group:2 --> iv_cand:10, cost=(4,0) group:3 --> iv_cand:4, cost=(0,0) \ group:3 --> iv_cand:5, cost=(0,0) invariant variables: \ invariant variables: invariant expressions: 1, 4, 7 \ invariant expressions: \ Original cost 21 (complexity 0) \ Original cost 25 (complexity 0) \ Final cost 13 (complexity 3) \ Final cost 25 (complexity 0) When unrolling factor is 8, we expect to use reg offset modes for group 0, 1, 2, this proposal works well. But for unrolling factor is 2, we expect to still use reg index modes for group 0, 1, 2 (with iv_cand 4 as before), this proposal looks sticked to use iv_cand 8,5,10. It looks we will over-blame the other non reg offset candidates. sorry that I have no idea how to leverage uf in this proposal, but it changes the focus to iv_group cost, seems not easy to scale up or down this with good justification. BR, Kewen >>> I guess there are two main sources of inexactness if we do that: >>> >>> (1) It might underestimate the cost because it assumes that vuse[0] >>> stands for all vuses in the group. >>> >> >> Do you mean we don't need one check function like mark_reg_offset_groups? >> If without it, vuse[0] might be not enough since we can't ensure the >> others are fine with additional displacement from unrolling. If we still >> have it, I think it's fine to just use vuse[0]. >> >>> (2) It might overestimates the cost because it treats all unrolled >>> iterations as having the cost of the final unrolled iteration. >>> >>> (1) could perhaps be avoided by adding a flag to the iv_use to say >>> whether it wants this treatment. I think the flag approach suffers >>> from (2) too, and I'd be surprised if it makes a difference in practice. >>> >> >> Sorry, I didn't have the whole picture how to deal with uf for your proposal. >> But the flag approach considers uf in iv group cost calculation as well as >> iv cand step cost calculation. >> >> BR, >> Kewen >> >>> Thanks, >>> Richard >>>
Hi Kewen, Sorry for the late reply. The patch's most important change is below cost computation: > @@ -5890,6 +5973,10 @@ determine_iv_cost (struct ivopts_data *data, struct iv_cand *cand) > cost_step = add_cost (data->speed, TYPE_MODE (TREE_TYPE (base))); > cost = cost_step + adjust_setup_cost (data, cost_base.cost); > > + /* Consider additional step updates during unrolling. */ > + if (data->consider_reg_offset_for_unroll_p && !cand->reg_offset_p) > + cost += (data->current_loop->estimated_unroll - 1) * cost_step; This is a bit strange, to me the add instructions are additional computation caused by unrolling+addressing_mode, rather than a native part in candidate itself. Specifically, an additional cost is needed if candidates (without reg_offset_p) are chosen for the address type group/uses. > + > /* Prefer the original ivs unless we may gain something by replacing it. > The reason is to make debugging simpler; so this is not relevant for > artificial ivs created by other optimization passes. */ > > @@ -3654,6 +3729,14 @@ set_group_iv_cost (struct ivopts_data *data, > return; > } > > + /* Since we priced more on non reg_offset IV cand step cost, we should scale > + up the appropriate IV group costs. Simply consider USE_COMPARE at the > + loop exit, FIXME if multiple exits supported or no loop exit comparisons > + matter. */ > + if (data->consider_reg_offset_for_unroll_p > + && group->vuses[0]->type != USE_COMPARE) > + cost *= (HOST_WIDE_INT) data->current_loop->estimated_unroll; Not quite follow here, giving "pricing more on on-reg_offset IV cand" doesn't make much sense to me. Also why generic type uses are not skipped? We want to model the cost required for address computation, however, for generic type uses there is no way to save the computation in "address expression". Once unrolled, the computation is always there? And what's the impact on targets supporting [base + index + offset] addressing mode? Given the patch is not likely to harm because rtl loop unrolling is (or was?) by default disabled, so I am OK once the above comments are addressed. I wonder if it's possible to get 10% of (all which should be unrolled) loops unrolled (conservatively) on gimple and enable it by default at O3, rather than teaching ivopts to model a future pass which not likely be used outside of benchmarks? Thanks, bin > + > if (data->consider_all_candidates) > { > group->cost_map[cand->id].cand = cand; On Thu, May 28, 2020 at 8:24 PM Kewen.Lin <linkw@linux.ibm.com> wrote: > > > gcc/ChangeLog > > 2020-MM-DD Kewen Lin <linkw@gcc.gnu.org> > > * tree-ssa-loop-ivopts.c (struct iv_group): New field reg_offset_p. > (struct iv_cand): New field reg_offset_p. > (struct ivopts_data): New field consider_reg_offset_for_unroll_p. > (dump_groups): Dump group with reg_offset_p. > (record_group): Initialize reg_offset_p. > (mark_reg_offset_groups): New function. > (find_interesting_uses): Call mark_reg_offset_groups. > (add_candidate_1): Update reg_offset_p if derived from reg_offset_p group. > (set_group_iv_cost): Scale up group cost with estimate_unroll_factor if > consider_reg_offset_for_unroll_p. > (determine_iv_cost): Increase step cost with estimate_unroll_factor if > consider_reg_offset_for_unroll_p. > (tree_ssa_iv_optimize_loop): Call estimate_unroll_factor, update > consider_reg_offset_for_unroll_p. > > ----
Hi Bin, Thanks for the review!! on 2020/8/8 下午4:01, Bin.Cheng wrote: > Hi Kewen, > Sorry for the late reply. > The patch's most important change is below cost computation: > >> @@ -5890,6 +5973,10 @@ determine_iv_cost (struct ivopts_data *data, struct iv_cand *cand) >> cost_step = add_cost (data->speed, TYPE_MODE (TREE_TYPE (base))); >> cost = cost_step + adjust_setup_cost (data, cost_base.cost); >> >> + /* Consider additional step updates during unrolling. */ >> + if (data->consider_reg_offset_for_unroll_p && !cand->reg_offset_p) >> + cost += (data->current_loop->estimated_unroll - 1) * cost_step; > This is a bit strange, to me the add instructions are additional > computation caused by unrolling+addressing_mode, rather than a native > part in candidate itself. Specifically, an additional cost is needed > if candidates (without reg_offset_p) are chosen for the address type > group/uses. Good point, ideally it should be one additional cost for each cand set, when we select one cand for one group, we need to check this pair need more (estimated_unroll - 1) step costs, we probably need to care about this during remove/replace etc. IIUC the current IVOPTs cost framework doesn't support this and it could increase the selection complexity and time. I hesitated to do it and put it to cand step cost initially instead. I was thinking those candidates with reg_offset_p should be only used for those reg_offset_p groups in most cases (very limited) meanwhile the others are simply scaled up like before. But indeed this can cover some similar cases like one cand is only used for the compare type group which is for loop closing, then it doesn't need more step costs for unrolling. Do you prefer me to improve the current cost framework? >> + >> /* Prefer the original ivs unless we may gain something by replacing it. >> The reason is to make debugging simpler; so this is not relevant for >> artificial ivs created by other optimization passes. */ >> > >> @@ -3654,6 +3729,14 @@ set_group_iv_cost (struct ivopts_data *data, >> return; >> } >> >> + /* Since we priced more on non reg_offset IV cand step cost, we should scale >> + up the appropriate IV group costs. Simply consider USE_COMPARE at the >> + loop exit, FIXME if multiple exits supported or no loop exit comparisons >> + matter. */ >> + if (data->consider_reg_offset_for_unroll_p >> + && group->vuses[0]->type != USE_COMPARE) >> + cost *= (HOST_WIDE_INT) data->current_loop->estimated_unroll; > Not quite follow here, giving "pricing more on on-reg_offset IV cand" > doesn't make much sense to me. Also why generic type uses are not > skipped? We want to model the cost required for address computation, > however, for generic type uses there is no way to save the computation > in "address expression". Once unrolled, the computation is always > there? > The main intention is to scale up the group/cand cost for unrolling since we have scaled up the step costs. The assumption is that the original costing (without this patch) can be viewed as either for all unrolled iterations or just one single iteration. Since IVOPTs doesn't support fractional costing, I interpreted it as single iterations, to emulate unrolling scenario based on the previous step cost scaling, we need to scale up the cost for all computation. In most cases, the compare type use is for loop closing, there is still only one computation even unrolling happens, so I excluded it here. As "FIXME", if we find some cases are off, we can further restrict it to those USE_COMPARE uses which is exactly for loop closing. > And what's the impact on targets supporting [base + index + offset] > addressing mode? Good question, I didn't notice it since power doesn't support it. I noticed the comments of function addr_offset_valid_p only mentioning [base + offset], I guess it excludes [base + index + offset]? But I guess the address-based IV can work for this mode? > > Given the patch is not likely to harm because rtl loop unrolling is > (or was?) by default disabled, so I am OK once the above comments are > addressed. > Yes, it needs explicit unrolling options, excepting for some targets wants to enable it for some cases with specific loop_unroll_adjust checks. > I wonder if it's possible to get 10% of (all which should be unrolled) > loops unrolled (conservatively) on gimple and enable it by default at > O3, rather than teaching ivopts to model a future pass which not > likely be used outside of benchmarks? > Yeah, it would be nice if the unrolling happen before IVOPTs and won't have any future unrollings to get it off. PR[1] seems to have some discussion on gimple unrolling. Richi suggested driven-by-need gimple unrolling in the previous discussion[2] on the RFC of this. [1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 [2] https://gcc.gnu.org/pipermail/gcc-patches/2020-January/537645.html BR, Kewen
On Mon, Aug 10, 2020 at 12:27 PM Kewen.Lin <linkw@linux.ibm.com> wrote: > > Hi Bin, > > Thanks for the review!! > > on 2020/8/8 下午4:01, Bin.Cheng wrote: > > Hi Kewen, > > Sorry for the late reply. > > The patch's most important change is below cost computation: > > > >> @@ -5890,6 +5973,10 @@ determine_iv_cost (struct ivopts_data *data, struct iv_cand *cand) > >> cost_step = add_cost (data->speed, TYPE_MODE (TREE_TYPE (base))); > >> cost = cost_step + adjust_setup_cost (data, cost_base.cost); > >> > >> + /* Consider additional step updates during unrolling. */ > >> + if (data->consider_reg_offset_for_unroll_p && !cand->reg_offset_p) > >> + cost += (data->current_loop->estimated_unroll - 1) * cost_step; > > This is a bit strange, to me the add instructions are additional > > computation caused by unrolling+addressing_mode, rather than a native > > part in candidate itself. Specifically, an additional cost is needed > > if candidates (without reg_offset_p) are chosen for the address type > > group/uses. > > Good point, ideally it should be one additional cost for each cand set, > when we select one cand for one group, we need to check this pair need > more (estimated_unroll - 1) step costs, we probably need to care about > this during remove/replace etc. IIUC the current IVOPTs cost framework > doesn't support this and it could increase the selection complexity and > time. I hesitated to do it and put it to cand step cost initially instead. > > I was thinking those candidates with reg_offset_p should be only used for > those reg_offset_p groups in most cases (very limited) meanwhile the others > are simply scaled up like before. But indeed this can cover some similar > cases like one cand is only used for the compare type group which is for > loop closing, then it doesn't need more step costs for unrolling. > > Do you prefer me to improve the current cost framework? No, I don't think it's relevant to the candidate selecting algorithm. I was thinking about adjusting cost somehow in determine_group_iv_cost_address. Given we don't expose selected addressing mode in this function, you may need to do it in get_address_cost, either way. > > >> + > >> /* Prefer the original ivs unless we may gain something by replacing it. > >> The reason is to make debugging simpler; so this is not relevant for > >> artificial ivs created by other optimization passes. */ > >> > > > >> @@ -3654,6 +3729,14 @@ set_group_iv_cost (struct ivopts_data *data, > >> return; > >> } > >> > >> + /* Since we priced more on non reg_offset IV cand step cost, we should scale > >> + up the appropriate IV group costs. Simply consider USE_COMPARE at the > >> + loop exit, FIXME if multiple exits supported or no loop exit comparisons > >> + matter. */ > >> + if (data->consider_reg_offset_for_unroll_p > >> + && group->vuses[0]->type != USE_COMPARE) > >> + cost *= (HOST_WIDE_INT) data->current_loop->estimated_unroll; > > Not quite follow here, giving "pricing more on on-reg_offset IV cand" > > doesn't make much sense to me. Also why generic type uses are not > > skipped? We want to model the cost required for address computation, > > however, for generic type uses there is no way to save the computation > > in "address expression". Once unrolled, the computation is always > > there? > > > > The main intention is to scale up the group/cand cost for unrolling since > we have scaled up the step costs. The assumption is that the original If we adjust cost appropriately in function *group_iv_cost_address, this would become unnecessary, right? And naturally. > costing (without this patch) can be viewed as either for all unrolled > iterations or just one single iteration. Since IVOPTs doesn't support > fractional costing, I interpreted it as single iterations, to emulate > unrolling scenario based on the previous step cost scaling, we need to > scale up the cost for all computation. > > In most cases, the compare type use is for loop closing, there is still > only one computation even unrolling happens, so I excluded it here. > As "FIXME", if we find some cases are off, we can further restrict it to > those USE_COMPARE uses which is exactly for loop closing. > > > And what's the impact on targets supporting [base + index + offset] > > addressing mode? > > Good question, I didn't notice it since power doesn't support it. > I noticed the comments of function addr_offset_valid_p only mentioning > [base + offset], I guess it excludes [base + index + offset]? > But I guess the address-based IV can work for this mode? No, addr_offset_valid_p is only used to split address use groups. See get_address_cost and struct mem_address. I forgot to ask, what about target only supports [base + offset] addressing mode like RISC-V? I would expect it's not affected at all. > > > > > Given the patch is not likely to harm because rtl loop unrolling is > > (or was?) by default disabled, so I am OK once the above comments are > > addressed. > > > > Yes, it needs explicit unrolling options, excepting for some targets > wants to enable it for some cases with specific loop_unroll_adjust checks. > > > I wonder if it's possible to get 10% of (all which should be unrolled) > > loops unrolled (conservatively) on gimple and enable it by default at > > O3, rather than teaching ivopts to model a future pass which not > > likely be used outside of benchmarks? > > > > Yeah, it would be nice if the unrolling happen before IVOPTs and won't > have any future unrollings to get it off. PR[1] seems to have some > discussion on gimple unrolling. Thanks for directing me to the discussion. I am on Wilco's side on this problem, IMHO, It might be useful getting small loops unrolled (at O3 by default) by simply saving induction variable stepping and exit condition check, which can be modeled on gimple. Especially for RISC-V, it doesn't support index addressing mode, which means there will be as many induction variables as distinct arrays. Also interleaving after unrolling is not that important, it's at high chance that small loops eligible for interleaving are handled by vectorizer already. Thanks, bin > > Richi suggested driven-by-need gimple unrolling in the previous discussion[2] > on the RFC of this. > > [1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760 > [2] https://gcc.gnu.org/pipermail/gcc-patches/2020-January/537645.html > > BR, > Kewen
Hi Bin, on 2020/8/10 下午8:38, Bin.Cheng wrote: > On Mon, Aug 10, 2020 at 12:27 PM Kewen.Lin <linkw@linux.ibm.com> wrote: >> >> Hi Bin, >> >> Thanks for the review!! >> >> on 2020/8/8 下午4:01, Bin.Cheng wrote: >>> Hi Kewen, >>> Sorry for the late reply. >>> The patch's most important change is below cost computation: >>> >>>> @@ -5890,6 +5973,10 @@ determine_iv_cost (struct ivopts_data *data, struct iv_cand *cand) >>>> cost_step = add_cost (data->speed, TYPE_MODE (TREE_TYPE (base))); >>>> cost = cost_step + adjust_setup_cost (data, cost_base.cost); >>>> >>>> + /* Consider additional step updates during unrolling. */ >>>> + if (data->consider_reg_offset_for_unroll_p && !cand->reg_offset_p) >>>> + cost += (data->current_loop->estimated_unroll - 1) * cost_step; >>> This is a bit strange, to me the add instructions are additional >>> computation caused by unrolling+addressing_mode, rather than a native >>> part in candidate itself. Specifically, an additional cost is needed >>> if candidates (without reg_offset_p) are chosen for the address type >>> group/uses. >> >> Good point, ideally it should be one additional cost for each cand set, >> when we select one cand for one group, we need to check this pair need >> more (estimated_unroll - 1) step costs, we probably need to care about >> this during remove/replace etc. IIUC the current IVOPTs cost framework >> doesn't support this and it could increase the selection complexity and >> time. I hesitated to do it and put it to cand step cost initially instead. >> >> I was thinking those candidates with reg_offset_p should be only used for >> those reg_offset_p groups in most cases (very limited) meanwhile the others >> are simply scaled up like before. But indeed this can cover some similar >> cases like one cand is only used for the compare type group which is for >> loop closing, then it doesn't need more step costs for unrolling. >> >> Do you prefer me to improve the current cost framework? > No, I don't think it's relevant to the candidate selecting algorithm. > I was thinking about adjusting cost somehow in > determine_group_iv_cost_address. Given we don't expose selected > addressing mode in this function, you may need to do it in > get_address_cost, either way. > Thanks for your suggestion! Sorry, I may miss something, but I still think the additional cost is per candidate. The justification is that we miss to model the iv candidate step well in the context of unrolling, the step cost is part of candidate cost, which is per candidate. To initialize it in determine_iv_cost isn't perfect as you pointed out, ideally we should check any uses of the candidate requires iv update after each replicated iteration, and take extra step costs into account if at least one needs, meanwhile scaling up all the computation cost to reflect unrolling cost nature. Besides, the reg_offset desirable pair already takes zero cost for cand/group cost, IIRC negative cost isn't preferred in IVOPTs, are you suggesting increasing the cost for non reg_offset pairs? If so and per pair, the extra cost looks possible to be computed several times unexpectedly. >> >>>> + >>>> /* Prefer the original ivs unless we may gain something by replacing it. >>>> The reason is to make debugging simpler; so this is not relevant for >>>> artificial ivs created by other optimization passes. */ >>>> >>> >>>> @@ -3654,6 +3729,14 @@ set_group_iv_cost (struct ivopts_data *data, >>>> return; >>>> } >>>> >>>> + /* Since we priced more on non reg_offset IV cand step cost, we should scale >>>> + up the appropriate IV group costs. Simply consider USE_COMPARE at the >>>> + loop exit, FIXME if multiple exits supported or no loop exit comparisons >>>> + matter. */ >>>> + if (data->consider_reg_offset_for_unroll_p >>>> + && group->vuses[0]->type != USE_COMPARE) >>>> + cost *= (HOST_WIDE_INT) data->current_loop->estimated_unroll; >>> Not quite follow here, giving "pricing more on on-reg_offset IV cand" >>> doesn't make much sense to me. Also why generic type uses are not >>> skipped? We want to model the cost required for address computation, >>> however, for generic type uses there is no way to save the computation >>> in "address expression". Once unrolled, the computation is always >>> there? >>> >> >> The main intention is to scale up the group/cand cost for unrolling since >> we have scaled up the step costs. The assumption is that the original > If we adjust cost appropriately in function *group_iv_cost_address, > this would become unnecessary, right? And naturally. >> costing (without this patch) can be viewed as either for all unrolled >> iterations or just one single iteration. Since IVOPTs doesn't support >> fractional costing, I interpreted it as single iterations, to emulate >> unrolling scenario based on the previous step cost scaling, we need to >> scale up the cost for all computation. >> >> In most cases, the compare type use is for loop closing, there is still >> only one computation even unrolling happens, so I excluded it here. >> As "FIXME", if we find some cases are off, we can further restrict it to >> those USE_COMPARE uses which is exactly for loop closing. >> >>> And what's the impact on targets supporting [base + index + offset] >>> addressing mode? >> >> Good question, I didn't notice it since power doesn't support it. >> I noticed the comments of function addr_offset_valid_p only mentioning >> [base + offset], I guess it excludes [base + index + offset]? >> But I guess the address-based IV can work for this mode? > No, addr_offset_valid_p is only used to split address use groups. See > get_address_cost and struct mem_address. > I forgot to ask, what about target only supports [base + offset] > addressing mode like RISC-V? I would expect it's not affected at all. > addr_offset_valid_p is also used in this patch as Richard S. and Segher suggested to check the offset after unrolling (like: offset+(uf-1)*step) is still valid for the target. If address-based IV can not work for [base + index + offset], it won't affect [base + index + offset]. It can help all targets which supports [base + offset], so I'm afraid it can affect RISC-V too, but I would expect it's positive. Or do you happen to see some potential issues? or have some concerns? And as Richard S. suggested before, it has one parameter to control, target can simply disable this if it dislikes. >> >>> >>> Given the patch is not likely to harm because rtl loop unrolling is >>> (or was?) by default disabled, so I am OK once the above comments are >>> addressed. >>> >> >> Yes, it needs explicit unrolling options, excepting for some targets >> wants to enable it for some cases with specific loop_unroll_adjust checks. >> >>> I wonder if it's possible to get 10% of (all which should be unrolled) >>> loops unrolled (conservatively) on gimple and enable it by default at >>> O3, rather than teaching ivopts to model a future pass which not >>> likely be used outside of benchmarks? >>> >> >> Yeah, it would be nice if the unrolling happen before IVOPTs and won't >> have any future unrollings to get it off. PR[1] seems to have some >> discussion on gimple unrolling. > Thanks for directing me to the discussion. I am on Wilco's side on > this problem, IMHO, It might be useful getting small loops unrolled > (at O3 by default) by simply saving induction variable stepping and > exit condition check, which can be modeled on gimple. Especially for > RISC-V, it doesn't support index addressing mode, which means there > will be as many induction variables as distinct arrays. Also > interleaving after unrolling is not that important, it's at high > chance that small loops eligible for interleaving are handled by > vectorizer already. > Good idea, CC Jeff since I think Jeff (Jiufu) has been working to see whether we can bring in one gimple unrolling pass. Regardless we have/don't have it, if the RTL unrolling is there, I guess this patch set is still beneficial? If so, I would take it separately. BR, Kewen
On Mon, Aug 10, 2020 at 10:41 PM Kewen.Lin <linkw@linux.ibm.com> wrote: > > Hi Bin, > > on 2020/8/10 下午8:38, Bin.Cheng wrote: > > On Mon, Aug 10, 2020 at 12:27 PM Kewen.Lin <linkw@linux.ibm.com> wrote: > >> > >> Hi Bin, > >> > >> Thanks for the review!! > >> > >> on 2020/8/8 下午4:01, Bin.Cheng wrote: > >>> Hi Kewen, > >>> Sorry for the late reply. > >>> The patch's most important change is below cost computation: > >>> > >>>> @@ -5890,6 +5973,10 @@ determine_iv_cost (struct ivopts_data *data, struct iv_cand *cand) > >>>> cost_step = add_cost (data->speed, TYPE_MODE (TREE_TYPE (base))); > >>>> cost = cost_step + adjust_setup_cost (data, cost_base.cost); > >>>> > >>>> + /* Consider additional step updates during unrolling. */ > >>>> + if (data->consider_reg_offset_for_unroll_p && !cand->reg_offset_p) > >>>> + cost += (data->current_loop->estimated_unroll - 1) * cost_step; > >>> This is a bit strange, to me the add instructions are additional > >>> computation caused by unrolling+addressing_mode, rather than a native > >>> part in candidate itself. Specifically, an additional cost is needed > >>> if candidates (without reg_offset_p) are chosen for the address type > >>> group/uses. > >> > >> Good point, ideally it should be one additional cost for each cand set, > >> when we select one cand for one group, we need to check this pair need > >> more (estimated_unroll - 1) step costs, we probably need to care about > >> this during remove/replace etc. IIUC the current IVOPTs cost framework > >> doesn't support this and it could increase the selection complexity and > >> time. I hesitated to do it and put it to cand step cost initially instead. > >> > >> I was thinking those candidates with reg_offset_p should be only used for > >> those reg_offset_p groups in most cases (very limited) meanwhile the others > >> are simply scaled up like before. But indeed this can cover some similar > >> cases like one cand is only used for the compare type group which is for > >> loop closing, then it doesn't need more step costs for unrolling. > >> > >> Do you prefer me to improve the current cost framework? > > No, I don't think it's relevant to the candidate selecting algorithm. > > I was thinking about adjusting cost somehow in > > determine_group_iv_cost_address. Given we don't expose selected > > addressing mode in this function, you may need to do it in > > get_address_cost, either way. > > > > Thanks for your suggestion! > > Sorry, I may miss something, but I still think the additional cost is > per candidate. The justification is that we miss to model the iv > candidate step well in the context of unrolling, the step cost is part > of candidate cost, which is per candidate. > > To initialize it in determine_iv_cost isn't perfect as you pointed out, > ideally we should check any uses of the candidate requires iv update > after each replicated iteration, and take extra step costs into account > if at least one needs, meanwhile scaling up all the computation cost to > reflect unrolling cost nature. I see, it's similar to the auto-increment case where cost should be recorded only once. So this is okay given 1) fine predicting rtl-unroll is likely impossible here; 2) the patch has very limited impact. Thanks, bin > > Besides, the reg_offset desirable pair already takes zero cost for > cand/group cost, IIRC negative cost isn't preferred in IVOPTs, are you > suggesting increasing the cost for non reg_offset pairs? If so and per > pair, the extra cost looks possible to be computed several times > unexpectedly. > > >> > >>>> + > >>>> /* Prefer the original ivs unless we may gain something by replacing it. > >>>> The reason is to make debugging simpler; so this is not relevant for > >>>> artificial ivs created by other optimization passes. */ > >>>> > >>> > >>>> @@ -3654,6 +3729,14 @@ set_group_iv_cost (struct ivopts_data *data, > >>>> return; > >>>> } > >>>> > >>>> + /* Since we priced more on non reg_offset IV cand step cost, we should scale > >>>> + up the appropriate IV group costs. Simply consider USE_COMPARE at the > >>>> + loop exit, FIXME if multiple exits supported or no loop exit comparisons > >>>> + matter. */ > >>>> + if (data->consider_reg_offset_for_unroll_p > >>>> + && group->vuses[0]->type != USE_COMPARE) > >>>> + cost *= (HOST_WIDE_INT) data->current_loop->estimated_unroll; > >>> Not quite follow here, giving "pricing more on on-reg_offset IV cand" > >>> doesn't make much sense to me. Also why generic type uses are not > >>> skipped? We want to model the cost required for address computation, > >>> however, for generic type uses there is no way to save the computation > >>> in "address expression". Once unrolled, the computation is always > >>> there? > >>> > >> > >> The main intention is to scale up the group/cand cost for unrolling since > >> we have scaled up the step costs. The assumption is that the original > > If we adjust cost appropriately in function *group_iv_cost_address, > > this would become unnecessary, right? And naturally. > >> costing (without this patch) can be viewed as either for all unrolled > >> iterations or just one single iteration. Since IVOPTs doesn't support > >> fractional costing, I interpreted it as single iterations, to emulate > >> unrolling scenario based on the previous step cost scaling, we need to > >> scale up the cost for all computation. > >> > >> In most cases, the compare type use is for loop closing, there is still > >> only one computation even unrolling happens, so I excluded it here. > >> As "FIXME", if we find some cases are off, we can further restrict it to > >> those USE_COMPARE uses which is exactly for loop closing. > >> > >>> And what's the impact on targets supporting [base + index + offset] > >>> addressing mode? > >> > >> Good question, I didn't notice it since power doesn't support it. > >> I noticed the comments of function addr_offset_valid_p only mentioning > >> [base + offset], I guess it excludes [base + index + offset]? > >> But I guess the address-based IV can work for this mode? > > No, addr_offset_valid_p is only used to split address use groups. See > > get_address_cost and struct mem_address. > > I forgot to ask, what about target only supports [base + offset] > > addressing mode like RISC-V? I would expect it's not affected at all. > > > addr_offset_valid_p is also used in this patch as Richard S. and Segher > suggested to check the offset after unrolling (like: offset+(uf-1)*step) > is still valid for the target. If address-based IV can not work for > [base + index + offset], it won't affect [base + index + offset]. > > It can help all targets which supports [base + offset], so I'm afraid > it can affect RISC-V too, but I would expect it's positive. Or do you > happen to see some potential issues? or have some concerns? > > And as Richard S. suggested before, it has one parameter to control, > target can simply disable this if it dislikes. > > >> > >>> > >>> Given the patch is not likely to harm because rtl loop unrolling is > >>> (or was?) by default disabled, so I am OK once the above comments are > >>> addressed. > >>> > >> > >> Yes, it needs explicit unrolling options, excepting for some targets > >> wants to enable it for some cases with specific loop_unroll_adjust checks. > >> > >>> I wonder if it's possible to get 10% of (all which should be unrolled) > >>> loops unrolled (conservatively) on gimple and enable it by default at > >>> O3, rather than teaching ivopts to model a future pass which not > >>> likely be used outside of benchmarks? > >>> > >> > >> Yeah, it would be nice if the unrolling happen before IVOPTs and won't > >> have any future unrollings to get it off. PR[1] seems to have some > >> discussion on gimple unrolling. > > Thanks for directing me to the discussion. I am on Wilco's side on > > this problem, IMHO, It might be useful getting small loops unrolled > > (at O3 by default) by simply saving induction variable stepping and > > exit condition check, which can be modeled on gimple. Especially for > > RISC-V, it doesn't support index addressing mode, which means there > > will be as many induction variables as distinct arrays. Also > > interleaving after unrolling is not that important, it's at high > > chance that small loops eligible for interleaving are handled by > > vectorizer already. > > > > Good idea, CC Jeff since I think Jeff (Jiufu) has been working to see > whether we can bring in one gimple unrolling pass. Regardless we > have/don't have it, if the RTL unrolling is there, I guess this patch > set is still beneficial? If so, I would take it separately. > > BR, > Kewen
diff --git a/gcc/tree-ssa-loop-ivopts.c b/gcc/tree-ssa-loop-ivopts.c index 1d2697ae1ba..1b7e4621f37 100644 --- a/gcc/tree-ssa-loop-ivopts.c +++ b/gcc/tree-ssa-loop-ivopts.c @@ -432,6 +432,8 @@ struct iv_group struct iv_cand *selected; /* To indicate this is a doloop use group. */ bool doloop_p; + /* To indicate this group is reg_offset valid. */ + bool reg_offset_p; /* Uses in the group. */ vec<struct iv_use *> vuses; }; @@ -473,6 +475,7 @@ struct iv_cand struct iv *orig_iv; /* The original iv if this cand is added from biv with smaller type. */ bool doloop_p; /* Whether this is a doloop candidate. */ + bool reg_offset_p; /* Derived from one reg_offset valid group. */ }; /* Hashtable entry for common candidate derived from iv uses. */ @@ -653,6 +656,10 @@ struct ivopts_data /* Whether the loop has doloop comparison use. */ bool doloop_use_p; + + /* Whether need to consider register offset addressing mode for the loop with + upcoming unrolling by estimated unroll factor. */ + bool consider_reg_offset_for_unroll_p; }; /* An assignment of iv candidates to uses. */ @@ -840,6 +847,11 @@ dump_groups (FILE *file, struct ivopts_data *data) gcc_assert (group->type == USE_COMPARE); fprintf (file, " Type:\tCOMPARE\n"); } + if (group->reg_offset_p) + { + gcc_assert (address_p (group->type)); + fprintf (file, " reg_offset_p: true\n"); + } for (j = 0; j < group->vuses.length (); j++) dump_use (file, group->vuses[j]); } @@ -1582,6 +1594,7 @@ record_group (struct ivopts_data *data, enum use_type type) group->related_cands = BITMAP_ALLOC (NULL); group->vuses.create (1); group->doloop_p = false; + group->reg_offset_p = false; data->vgroups.safe_push (group); return group; @@ -2731,6 +2744,60 @@ split_address_groups (struct ivopts_data *data) } } +/* Go through all address type groups, check and mark reg_offset addressing mode + valid groups. */ + +static void +mark_reg_offset_groups (struct ivopts_data *data) +{ + class loop *loop = data->current_loop; + gcc_assert (data->current_loop->estimated_unroll > 1); + bool any_reg_offset_p = false; + + for (unsigned i = 0; i < data->vgroups.length (); i++) + { + struct iv_group *group = data->vgroups[i]; + if (address_p (group->type)) + { + struct iv_use *head_use = group->vuses[0]; + if (!tree_fits_poly_int64_p (head_use->iv->step)) + continue; + + bool found = true; + poly_int64 step = tree_to_poly_int64 (head_use->iv->step); + /* Max extra offset to fill for head of group. */ + poly_int64 max_increase = (loop->estimated_unroll - 1) * step; + /* Check whether this increment still valid. */ + if (!addr_offset_valid_p (head_use, max_increase)) + found = false; + + unsigned group_size = group->vuses.length (); + /* Check the whole group further. */ + if (group_size > 1) + { + /* Only need to check the last one in the group, both the head and + the last is valid, the others should be fine. */ + struct iv_use *last_use = group->vuses[group_size - 1]; + poly_int64 max_delta + = last_use->addr_offset - head_use->addr_offset; + poly_int64 max_offset = max_delta + max_increase; + if (maybe_ne (max_delta, 0) + && !addr_offset_valid_p (head_use, max_offset)) + found = false; + } + + if (found) + { + group->reg_offset_p = true; + any_reg_offset_p = true; + } + } + } + + if (!any_reg_offset_p) + data->consider_reg_offset_for_unroll_p = false; +} + /* Finds uses of the induction variables that are interesting. */ static void @@ -2762,6 +2829,9 @@ find_interesting_uses (struct ivopts_data *data) split_address_groups (data); + if (data->consider_reg_offset_for_unroll_p) + mark_reg_offset_groups (data); + if (dump_file && (dump_flags & TDF_DETAILS)) { fprintf (dump_file, "\n<IV Groups>:\n"); @@ -3147,6 +3217,7 @@ add_candidate_1 (struct ivopts_data *data, tree base, tree step, bool important, cand->important = important; cand->incremented_at = incremented_at; cand->doloop_p = doloop; + cand->reg_offset_p = false; data->vcands.safe_push (cand); if (!poly_int_tree_p (step)) @@ -3183,7 +3254,11 @@ add_candidate_1 (struct ivopts_data *data, tree base, tree step, bool important, /* Relate candidate to the group for which it is added. */ if (use) - bitmap_set_bit (data->vgroups[use->group_id]->related_cands, i); + { + bitmap_set_bit (data->vgroups[use->group_id]->related_cands, i); + if (data->vgroups[use->group_id]->reg_offset_p) + cand->reg_offset_p = true; + } return cand; } @@ -3654,6 +3729,14 @@ set_group_iv_cost (struct ivopts_data *data, return; } + /* Since we priced more on non reg_offset IV cand step cost, we should scale + up the appropriate IV group costs. Simply consider USE_COMPARE at the + loop exit, FIXME if multiple exits supported or no loop exit comparisons + matter. */ + if (data->consider_reg_offset_for_unroll_p + && group->vuses[0]->type != USE_COMPARE) + cost *= (HOST_WIDE_INT) data->current_loop->estimated_unroll; + if (data->consider_all_candidates) { group->cost_map[cand->id].cand = cand; @@ -5890,6 +5973,10 @@ determine_iv_cost (struct ivopts_data *data, struct iv_cand *cand) cost_step = add_cost (data->speed, TYPE_MODE (TREE_TYPE (base))); cost = cost_step + adjust_setup_cost (data, cost_base.cost); + /* Consider additional step updates during unrolling. */ + if (data->consider_reg_offset_for_unroll_p && !cand->reg_offset_p) + cost += (data->current_loop->estimated_unroll - 1) * cost_step; + /* Prefer the original ivs unless we may gain something by replacing it. The reason is to make debugging simpler; so this is not relevant for artificial ivs created by other optimization passes. */ @@ -7976,6 +8063,7 @@ tree_ssa_iv_optimize_loop (struct ivopts_data *data, class loop *loop, data->current_loop = loop; data->loop_loc = find_loop_location (loop).get_location_t (); data->speed = optimize_loop_for_speed_p (loop); + data->consider_reg_offset_for_unroll_p = false; if (dump_file && (dump_flags & TDF_DETAILS)) { @@ -8008,6 +8096,16 @@ tree_ssa_iv_optimize_loop (struct ivopts_data *data, class loop *loop, if (!find_induction_variables (data)) goto finish; + if (param_iv_consider_reg_offset_for_unroll != 0 && exit) + { + tree_niter_desc *desc = niter_for_exit (data, exit); + estimate_unroll_factor (loop, desc); + data->consider_reg_offset_for_unroll_p = loop->estimated_unroll > 1; + if (dump_file && (dump_flags & TDF_DETAILS) + && data->consider_reg_offset_for_unroll_p) + fprintf (dump_file, "estimated_unroll:%u\n", loop->estimated_unroll); + } + /* Finds interesting uses (item 1). */ find_interesting_uses (data); if (data->vgroups.length () > MAX_CONSIDERED_GROUPS)