Message ID | a2a9189d-d114-7d07-3b3e-1a4d13613ef1@linux.ibm.com |
---|---|
Headers | show |
Series | IVOPTs consider step cost for different forms when unrolling | expand |
On Thu, 28 May 2020, Kewen.Lin wrote: > Hi, > > This is one repost and you can refer to the original series > via https://gcc.gnu.org/pipermail/gcc-patches/2020-January/538360.html. > > As we discussed in the thread > https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00196.html > Original: https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00104.html, > I'm working to teach IVOPTs to consider D-form group access during unrolling. > The difference on D-form and other forms during unrolling is we can put the > stride into displacement field to avoid additional step increment. eg: > > With X-form (uf step increment): > ... > LD A = baseA, X > LD B = baseB, X > ST C = baseC, X > X = X + stride > LD A = baseA, X > LD B = baseB, X > ST C = baseC, X > X = X + stride > LD A = baseA, X > LD B = baseB, X > ST C = baseC, X > X = X + stride > ... > > With D-form (one step increment for each base): > ... > LD A = baseA, OFF > LD B = baseB, OFF > ST C = baseC, OFF > LD A = baseA, OFF+stride > LD B = baseB, OFF+stride > ST C = baseC, OFF+stride > LD A = baseA, OFF+2*stride > LD B = baseB, OFF+2*stride > ST C = baseC, OFF+2*stride > ... > baseA += stride * uf > baseB += stride * uf > baseC += stride * uf > > Imagining that if the loop get unrolled by 8 times, then 3 step updates with > D-form vs. 8 step updates with X-form. Here we only need to check stride > meet D-form field requirement, since if OFF doesn't meet, we can construct > baseA' with baseA + OFF. I'd just mention there are other targets that have the choice between the above forms. Since IVOPTs itself does not perform the unrolling the IL it produces is the same, correct? Richard. > This patch set consists four parts: > > [PATCH 1/4] unroll: Add middle-end unroll factor estimation > > Add unroll factor estimation in middle-end. It mainly refers to current > RTL unroll factor determination in function decide_unrolling and its > sub calls. As Richi suggested, we probably can force unroll factor > with this and avoid duplicate unroll factor calculation, but I think it > need more benchmarking work and should be handled separately. > > [PATCH 2/4] param: Introduce one param to control unroll factor > > As Richard and Segher's suggestion, I used addr_offset_valid_p for the > addressing mode, rather than one target hook. As Richard's suggestion, > it introduces one parameter to control this IVOPTs consideration and > further tweaking [3/4] on top of unroll factor estimation [1/4]. > > [PATCH 3/4] ivopts: Consider cost_step on different forms during unrolling > > Teach IVOPTs to mark the IV cand as reg_offset_p which is derived from > one address IV type group where the whole group is valid to use reg_offset > mode. Then scaling up the IV cand step cost by (uf - 1) for no > reg_offset_p IV cands, here the uf is one estimated unroll factor [1/4]. > > [PATCH 4/4] rs6000: P9 D-form test cases > > Add some test cases, mainly copied from Kelvin's patch. This is approved > by Segher if the whole series is fine. > > > Many thanks to Richard and Segher on previous version reviews. > > Bootstrapped and regress tested on powerpc64le-linux-gnu. > > Any comments are highly appreciated! Thanks in advance! > > > BR, > Kewen > > ------- > > gcc/cfgloop.h | 3 ++ > gcc/config/i386/i386-options.c | 6 +++ > gcc/config/s390/s390.c | 6 +++ > gcc/doc/invoke.texi | 9 +++++ > gcc/params.opt | 4 ++ > gcc/tree-ssa-loop-ivopts.c | 100 ++++++++++++++++++++++++++++++++++++++++++++++- > gcc/tree-ssa-loop-manip.c | 253 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > gcc/tree-ssa-loop-manip.h | 3 +- > gcc/tree-ssa-loop.c | 33 ++++++++++++++++ > gcc/tree-ssa-loop.h | 2 + > 10 files changed, 416 insertions(+), 3 deletions(-) > >
Hi Richi, on 2020/6/2 下午7:38, Richard Biener wrote: > On Thu, 28 May 2020, Kewen.Lin wrote: > >> Hi, >> >> This is one repost and you can refer to the original series >> via https://gcc.gnu.org/pipermail/gcc-patches/2020-January/538360.html. >> >> As we discussed in the thread >> https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00196.html >> Original: https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00104.html, >> I'm working to teach IVOPTs to consider D-form group access during unrolling. >> The difference on D-form and other forms during unrolling is we can put the >> stride into displacement field to avoid additional step increment. eg: >> >> With X-form (uf step increment): >> ... >> LD A = baseA, X >> LD B = baseB, X >> ST C = baseC, X >> X = X + stride >> LD A = baseA, X >> LD B = baseB, X >> ST C = baseC, X >> X = X + stride >> LD A = baseA, X >> LD B = baseB, X >> ST C = baseC, X >> X = X + stride >> ... >> >> With D-form (one step increment for each base): >> ... >> LD A = baseA, OFF >> LD B = baseB, OFF >> ST C = baseC, OFF >> LD A = baseA, OFF+stride >> LD B = baseB, OFF+stride >> ST C = baseC, OFF+stride >> LD A = baseA, OFF+2*stride >> LD B = baseB, OFF+2*stride >> ST C = baseC, OFF+2*stride >> ... >> baseA += stride * uf >> baseB += stride * uf >> baseC += stride * uf >> >> Imagining that if the loop get unrolled by 8 times, then 3 step updates with >> D-form vs. 8 step updates with X-form. Here we only need to check stride >> meet D-form field requirement, since if OFF doesn't meet, we can construct >> baseA' with baseA + OFF. > > I'd just mention there are other targets that have the choice between > the above forms. Since IVOPTs itself does not perform the unrolling > the IL it produces is the same, correct? > Yes. Before this patch, IVOPTs doesn't consider the unrolling impacts, it only models things based on what it sees. We can assume it thinks later RTL unrolling won't perform. With this patch, since the IV choice probably changes, the IL can probably change. The typical difference with this patch is: vect__1.7_15 = MEM[symbol: x, index: ivtmp.19_22, offset: 0B]; vs. vect__1.7_15 = MEM[base: _29, offset: 0B]; BR, Kewen > Richard. >
On Wed, 3 Jun 2020, Kewen.Lin wrote: > Hi Richi, > > on 2020/6/2 下午7:38, Richard Biener wrote: > > On Thu, 28 May 2020, Kewen.Lin wrote: > > > >> Hi, > >> > >> This is one repost and you can refer to the original series > >> via https://gcc.gnu.org/pipermail/gcc-patches/2020-January/538360.html. > >> > >> As we discussed in the thread > >> https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00196.html > >> Original: https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00104.html, > >> I'm working to teach IVOPTs to consider D-form group access during unrolling. > >> The difference on D-form and other forms during unrolling is we can put the > >> stride into displacement field to avoid additional step increment. eg: > >> > >> With X-form (uf step increment): > >> ... > >> LD A = baseA, X > >> LD B = baseB, X > >> ST C = baseC, X > >> X = X + stride > >> LD A = baseA, X > >> LD B = baseB, X > >> ST C = baseC, X > >> X = X + stride > >> LD A = baseA, X > >> LD B = baseB, X > >> ST C = baseC, X > >> X = X + stride > >> ... > >> > >> With D-form (one step increment for each base): > >> ... > >> LD A = baseA, OFF > >> LD B = baseB, OFF > >> ST C = baseC, OFF > >> LD A = baseA, OFF+stride > >> LD B = baseB, OFF+stride > >> ST C = baseC, OFF+stride > >> LD A = baseA, OFF+2*stride > >> LD B = baseB, OFF+2*stride > >> ST C = baseC, OFF+2*stride > >> ... > >> baseA += stride * uf > >> baseB += stride * uf > >> baseC += stride * uf > >> > >> Imagining that if the loop get unrolled by 8 times, then 3 step updates with > >> D-form vs. 8 step updates with X-form. Here we only need to check stride > >> meet D-form field requirement, since if OFF doesn't meet, we can construct > >> baseA' with baseA + OFF. > > > > I'd just mention there are other targets that have the choice between > > the above forms. Since IVOPTs itself does not perform the unrolling > > the IL it produces is the same, correct? > > > Yes. Before this patch, IVOPTs doesn't consider the unrolling impacts, > it only models things based on what it sees. We can assume it thinks > later RTL unrolling won't perform. > > With this patch, since the IV choice probably changes, the IL can probably > change. The typical difference with this patch is: > > vect__1.7_15 = MEM[symbol: x, index: ivtmp.19_22, offset: 0B]; > vs. > vect__1.7_15 = MEM[base: _29, offset: 0B]; So we're asking IVOPTS "if we were unrolling this loop would you make a different IV choice?" thus I wonder why we need so much complexity here? That is, if we can classify the loop as being possibly unrolled we could evaluate IVOPTs IV choice (and overall cost) on the original loop and in a second run on the original loop with fake IV uses added with extra offset. If the overall IV cost is similar we'll take the unroll friendly choice if the costs are way different (I wouldn't expect this to be the case ever?) I'd side with the IV choice when not unrolling (and mark the loop as to be not unrolled). Thus I'd err on the side of not unrolling but leave the ultimate choice of whether to unroll to RTL unless IV cost makes that prohibitive. Even without X- or D- form addressing modes the IV choice may differ and I think we don't need extra knobs for the unroller but instead can decide to set the existing n_unroll to zero (force not unroll) when costs say it would be bad? Richard. > BR, > Kewen > > > Richard. > > >
on 2020/6/3 下午3:07, Richard Biener wrote: > On Wed, 3 Jun 2020, Kewen.Lin wrote: > >> Hi Richi, >> >> on 2020/6/2 下午7:38, Richard Biener wrote: >>> On Thu, 28 May 2020, Kewen.Lin wrote: >>> >>>> Hi, >>>> >>>> This is one repost and you can refer to the original series >>>> via https://gcc.gnu.org/pipermail/gcc-patches/2020-January/538360.html. >>>> >>>> As we discussed in the thread >>>> https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00196.html >>>> Original: https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00104.html, >>>> I'm working to teach IVOPTs to consider D-form group access during unrolling. >>>> The difference on D-form and other forms during unrolling is we can put the >>>> stride into displacement field to avoid additional step increment. eg: >>>> >>>> With X-form (uf step increment): >>>> ... >>>> LD A = baseA, X >>>> LD B = baseB, X >>>> ST C = baseC, X >>>> X = X + stride >>>> LD A = baseA, X >>>> LD B = baseB, X >>>> ST C = baseC, X >>>> X = X + stride >>>> LD A = baseA, X >>>> LD B = baseB, X >>>> ST C = baseC, X >>>> X = X + stride >>>> ... >>>> >>>> With D-form (one step increment for each base): >>>> ... >>>> LD A = baseA, OFF >>>> LD B = baseB, OFF >>>> ST C = baseC, OFF >>>> LD A = baseA, OFF+stride >>>> LD B = baseB, OFF+stride >>>> ST C = baseC, OFF+stride >>>> LD A = baseA, OFF+2*stride >>>> LD B = baseB, OFF+2*stride >>>> ST C = baseC, OFF+2*stride >>>> ... >>>> baseA += stride * uf >>>> baseB += stride * uf >>>> baseC += stride * uf >>>> >>>> Imagining that if the loop get unrolled by 8 times, then 3 step updates with >>>> D-form vs. 8 step updates with X-form. Here we only need to check stride >>>> meet D-form field requirement, since if OFF doesn't meet, we can construct >>>> baseA' with baseA + OFF. >>> >>> I'd just mention there are other targets that have the choice between >>> the above forms. Since IVOPTs itself does not perform the unrolling >>> the IL it produces is the same, correct? >>> >> Yes. Before this patch, IVOPTs doesn't consider the unrolling impacts, >> it only models things based on what it sees. We can assume it thinks >> later RTL unrolling won't perform. >> >> With this patch, since the IV choice probably changes, the IL can probably >> change. The typical difference with this patch is: >> >> vect__1.7_15 = MEM[symbol: x, index: ivtmp.19_22, offset: 0B]; >> vs. >> vect__1.7_15 = MEM[base: _29, offset: 0B]; > > So we're asking IVOPTS "if we were unrolling this loop would you make > a different IV choice?" thus I wonder why we need so much complexity > here? I would describe it more like "we are going to unroll this loop with unroll factor uf in RTL, would you consider this variable when modeling?" In most cases, one single iteration is representative for the unrolled body, so it doesn't matter considering unrolling or not. But for the case here, it's not true, expected reg_offset iv cand can make iv cand step cost reduced, it leads the difference. > That is, if we can classify the loop as being possibly unrolled > we could evaluate IVOPTs IV choice (and overall cost) on the original > loop and in a second run on the original loop with fake IV uses > added with extra offset. If the overall IV cost is similar we'll > take the unroll friendly choice if the costs are way different > (I wouldn't expect this to be the case ever?) I'd side with the > IV choice when not unrolling (and mark the loop as to be not unrolled). > Could you elaborate it a bit? I guess it won't estimate the unroll factor here, just guess it's to be unrolled or not? The second run with fake IV uses added with extra offset sounds like scaling up the iv group cost by uf. > Thus I'd err on the side of not unrolling but leave the ultimate choice > of whether to unroll to RTL unless IV cost makes that prohibitive. > > Even without X- or D- form addressing modes the IV choice may differ > and I think we don't need extra knobs for the unroller but instead > can decide to set the existing n_unroll to zero (force not unroll) > when costs say it would be bad? Yes, even without x- or d- form addressing, the difference probably comes from compare type IV use for loop ending, maybe more cases which I am not aware of. But I don't see people care about it, probably the impact is small. IIUC what you stated here looks like to use ivopts information for unrolling factor decision, I think this is a separate direction, do we have this kind of case where ivopts costs can foresee the unrolling? Now the unroll factor estimation can be used for other optimization passes if they are wondering future unrolling factor decision, as discussed it sounds a good idea to override the n_unroll with some benchmarking. BR, Kewen > > Richard. > >> BR, >> Kewen >> >>> Richard. >>> >> >
On Wed, 3 Jun 2020, Kewen.Lin wrote: > on 2020/6/3 下午3:07, Richard Biener wrote: > > On Wed, 3 Jun 2020, Kewen.Lin wrote: > > > >> Hi Richi, > >> > >> on 2020/6/2 下午7:38, Richard Biener wrote: > >>> On Thu, 28 May 2020, Kewen.Lin wrote: > >>> > >>>> Hi, > >>>> > >>>> This is one repost and you can refer to the original series > >>>> via https://gcc.gnu.org/pipermail/gcc-patches/2020-January/538360.html. > >>>> > >>>> As we discussed in the thread > >>>> https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00196.html > >>>> Original: https://gcc.gnu.org/ml/gcc-patches/2020-01/msg00104.html, > >>>> I'm working to teach IVOPTs to consider D-form group access during unrolling. > >>>> The difference on D-form and other forms during unrolling is we can put the > >>>> stride into displacement field to avoid additional step increment. eg: > >>>> > >>>> With X-form (uf step increment): > >>>> ... > >>>> LD A = baseA, X > >>>> LD B = baseB, X > >>>> ST C = baseC, X > >>>> X = X + stride > >>>> LD A = baseA, X > >>>> LD B = baseB, X > >>>> ST C = baseC, X > >>>> X = X + stride > >>>> LD A = baseA, X > >>>> LD B = baseB, X > >>>> ST C = baseC, X > >>>> X = X + stride > >>>> ... > >>>> > >>>> With D-form (one step increment for each base): > >>>> ... > >>>> LD A = baseA, OFF > >>>> LD B = baseB, OFF > >>>> ST C = baseC, OFF > >>>> LD A = baseA, OFF+stride > >>>> LD B = baseB, OFF+stride > >>>> ST C = baseC, OFF+stride > >>>> LD A = baseA, OFF+2*stride > >>>> LD B = baseB, OFF+2*stride > >>>> ST C = baseC, OFF+2*stride > >>>> ... > >>>> baseA += stride * uf > >>>> baseB += stride * uf > >>>> baseC += stride * uf > >>>> > >>>> Imagining that if the loop get unrolled by 8 times, then 3 step updates with > >>>> D-form vs. 8 step updates with X-form. Here we only need to check stride > >>>> meet D-form field requirement, since if OFF doesn't meet, we can construct > >>>> baseA' with baseA + OFF. > >>> > >>> I'd just mention there are other targets that have the choice between > >>> the above forms. Since IVOPTs itself does not perform the unrolling > >>> the IL it produces is the same, correct? > >>> > >> Yes. Before this patch, IVOPTs doesn't consider the unrolling impacts, > >> it only models things based on what it sees. We can assume it thinks > >> later RTL unrolling won't perform. > >> > >> With this patch, since the IV choice probably changes, the IL can probably > >> change. The typical difference with this patch is: > >> > >> vect__1.7_15 = MEM[symbol: x, index: ivtmp.19_22, offset: 0B]; > >> vs. > >> vect__1.7_15 = MEM[base: _29, offset: 0B]; > > > > So we're asking IVOPTS "if we were unrolling this loop would you make > > a different IV choice?" thus I wonder why we need so much complexity > > here? > > I would describe it more like "we are going to unroll this loop with > unroll factor uf in RTL, would you consider this variable when modeling?" > > In most cases, one single iteration is representative for the unrolled > body, so it doesn't matter considering unrolling or not. But for the > case here, it's not true, expected reg_offset iv cand can make iv cand > step cost reduced, it leads the difference. > > > That is, if we can classify the loop as being possibly unrolled > > we could evaluate IVOPTs IV choice (and overall cost) on the original > > loop and in a second run on the original loop with fake IV uses > > added with extra offset. If the overall IV cost is similar we'll > > take the unroll friendly choice if the costs are way different > > (I wouldn't expect this to be the case ever?) I'd side with the > > IV choice when not unrolling (and mark the loop as to be not unrolled). > > > > Could you elaborate it a bit? I guess it won't estimate the unroll > factor here, just guess it's to be unrolled or not? The second run > with fake IV uses added with extra offset sounds like scaling up the > iv group cost by uf. From your example above the D-form (MEM[symbol: x, index: ivtmp.19_22, offset: 0B]) is preferable since in the unrolled variant we have the same addres but with a different constant offset for the unroll copies while the second form would have to update the 'base' IV. Thus I think the difference in IV cost and decision should already show up if we, for each USE add a USE with an added constant offset. This might be what your patch does with that extra flag on the USEs, I was suggesting to model the USEs more explicitely, simulating a 2-way unroll. I think in the end I'll defer to Bin here who knows the code best. > > Thus I'd err on the side of not unrolling but leave the ultimate choice > > of whether to unroll to RTL unless IV cost makes that prohibitive. > > > > Even without X- or D- form addressing modes the IV choice may differ > > and I think we don't need extra knobs for the unroller but instead > > can decide to set the existing n_unroll to zero (force not unroll) > > when costs say it would be bad? > > Yes, even without x- or d- form addressing, the difference probably comes > from compare type IV use for loop ending, maybe more cases which I am not > aware of. But I don't see people care about it, probably the impact is > small. > > IIUC what you stated here looks like to use ivopts information for unrolling > factor decision, I think this is a separate direction, do we have this > kind of case where ivopts costs can foresee the unrolling? > > Now the unroll factor estimation can be used for other optimization passes > if they are wondering future unrolling factor decision, as discussed it > sounds a good idea to override the n_unroll with some benchmarking. I didnt' suggest to use IVOPTs to determine the unroll factor. In fact your patch looks like it does this? Instead I wanted to make IVOPTs choose a set of IVs that is best for a blend of both worlds - use D-form when it doesn't hurt the not unrolled code [much], and X-form when the D-form is way worse (for whatever reason) and signal that to the unroller (but we could chose to not do that). The real issue is of course we're applying IV decision to a not final loop. > BR, > Kewen > > > > > Richard. > > > >> BR, > >> Kewen > >> > >>> Richard. > >>> > >> > > >
on 2020/6/3 下午5:27, Richard Biener wrote: > On Wed, 3 Jun 2020, Kewen.Lin wrote: > >> on 2020/6/3 下午3:07, Richard Biener wrote: >>> On Wed, 3 Jun 2020, Kewen.Lin wrote: >>> >>>> Hi Richi, >>>> snip ... >>>>> >>>>> I'd just mention there are other targets that have the choice between >>>>> the above forms. Since IVOPTs itself does not perform the unrolling >>>>> the IL it produces is the same, correct? >>>>> >>>> Yes. Before this patch, IVOPTs doesn't consider the unrolling impacts, >>>> it only models things based on what it sees. We can assume it thinks >>>> later RTL unrolling won't perform. >>>> >>>> With this patch, since the IV choice probably changes, the IL can probably >>>> change. The typical difference with this patch is: >>>> >>>> vect__1.7_15 = MEM[symbol: x, index: ivtmp.19_22, offset: 0B]; >>>> vs. >>>> vect__1.7_15 = MEM[base: _29, offset: 0B]; >>> >>> So we're asking IVOPTS "if we were unrolling this loop would you make >>> a different IV choice?" thus I wonder why we need so much complexity >>> here? >> >> I would describe it more like "we are going to unroll this loop with >> unroll factor uf in RTL, would you consider this variable when modeling?" >> >> In most cases, one single iteration is representative for the unrolled >> body, so it doesn't matter considering unrolling or not. But for the >> case here, it's not true, expected reg_offset iv cand can make iv cand >> step cost reduced, it leads the difference. >> >>> That is, if we can classify the loop as being possibly unrolled >>> we could evaluate IVOPTs IV choice (and overall cost) on the original >>> loop and in a second run on the original loop with fake IV uses >>> added with extra offset. If the overall IV cost is similar we'll >>> take the unroll friendly choice if the costs are way different >>> (I wouldn't expect this to be the case ever?) I'd side with the >>> IV choice when not unrolling (and mark the loop as to be not unrolled). >>> >> >> Could you elaborate it a bit? I guess it won't estimate the unroll >> factor here, just guess it's to be unrolled or not? The second run >> with fake IV uses added with extra offset sounds like scaling up the >> iv group cost by uf. > > From your example above the D-form (MEM[symbol: x, index: ivtmp.19_22, > offset: 0B]) is preferable since in the unrolled variant we have > the same addres but with a different constant offset for the unroll > copies while the second form would have to update the 'base' IV. > > Thus I think the difference in IV cost and decision should already > show up if we, for each USE add a USE with an added constant offset. > This might be what your patch does with that extra flag on the USEs, > I was suggesting to model the USEs more explicitely, simulating a > 2-way unroll. I think in the end I'll defer to Bin here who knows > the code best. > Thanks for your further explanation! As your proposal we introduce more iv use groups with step added. Take the example here https://gcc.gnu.org/pipermail/gcc-patches/2020-June/547128.html Imagining initially the cand iv 4 leading to x-form wins, it's the original iv, has the iv-group cost 1 against the address group. Although we introduce one more group (2-way unrolling), the iv still wins since pulling the address iv in takes 5 (15 for three). Probably we can introduce more groups according to uf here. OK. Looking forward to Bin's comments. >>> Thus I'd err on the side of not unrolling but leave the ultimate choice >>> of whether to unroll to RTL unless IV cost makes that prohibitive. >>> >>> Even without X- or D- form addressing modes the IV choice may differ >>> and I think we don't need extra knobs for the unroller but instead >>> can decide to set the existing n_unroll to zero (force not unroll) >>> when costs say it would be bad? >> >> Yes, even without x- or d- form addressing, the difference probably comes >> from compare type IV use for loop ending, maybe more cases which I am not >> aware of. But I don't see people care about it, probably the impact is >> small. >> >> IIUC what you stated here looks like to use ivopts information for unrolling >> factor decision, I think this is a separate direction, do we have this >> kind of case where ivopts costs can foresee the unrolling? >> >> Now the unroll factor estimation can be used for other optimization passes >> if they are wondering future unrolling factor decision, as discussed it >> sounds a good idea to override the n_unroll with some benchmarking. > > I didnt' suggest to use IVOPTs to determine the unroll factor. In > fact your patch looks like it does this? Instead I wanted to make > IVOPTs choose a set of IVs that is best for a blend of both worlds - use > D-form when it doesn't hurt the not unrolled code [much], and X-form > when the D-form is way worse (for whatever reason) and signal that > to the unroller (but we could chose to not do that). > Sorry for my weak comprehension! Nice, we are on the same direction. :) > The real issue is of course we're applying IV decision to a not final > loop. > Exactly. BR, Kewen
"Kewen.Lin" <linkw@linux.ibm.com> writes: > on 2020/6/3 下午5:27, Richard Biener wrote: >> On Wed, 3 Jun 2020, Kewen.Lin wrote: >> >>> on 2020/6/3 下午3:07, Richard Biener wrote: >>>> On Wed, 3 Jun 2020, Kewen.Lin wrote: >>>> >>>>> Hi Richi, >>>>> > > snip ... > >>>>>> >>>>>> I'd just mention there are other targets that have the choice between >>>>>> the above forms. Since IVOPTs itself does not perform the unrolling >>>>>> the IL it produces is the same, correct? >>>>>> >>>>> Yes. Before this patch, IVOPTs doesn't consider the unrolling impacts, >>>>> it only models things based on what it sees. We can assume it thinks >>>>> later RTL unrolling won't perform. >>>>> >>>>> With this patch, since the IV choice probably changes, the IL can probably >>>>> change. The typical difference with this patch is: >>>>> >>>>> vect__1.7_15 = MEM[symbol: x, index: ivtmp.19_22, offset: 0B]; >>>>> vs. >>>>> vect__1.7_15 = MEM[base: _29, offset: 0B]; >>>> >>>> So we're asking IVOPTS "if we were unrolling this loop would you make >>>> a different IV choice?" thus I wonder why we need so much complexity >>>> here? >>> >>> I would describe it more like "we are going to unroll this loop with >>> unroll factor uf in RTL, would you consider this variable when modeling?" >>> >>> In most cases, one single iteration is representative for the unrolled >>> body, so it doesn't matter considering unrolling or not. But for the >>> case here, it's not true, expected reg_offset iv cand can make iv cand >>> step cost reduced, it leads the difference. >>> >>>> That is, if we can classify the loop as being possibly unrolled >>>> we could evaluate IVOPTs IV choice (and overall cost) on the original >>>> loop and in a second run on the original loop with fake IV uses >>>> added with extra offset. If the overall IV cost is similar we'll >>>> take the unroll friendly choice if the costs are way different >>>> (I wouldn't expect this to be the case ever?) I'd side with the >>>> IV choice when not unrolling (and mark the loop as to be not unrolled). >>>> >>> >>> Could you elaborate it a bit? I guess it won't estimate the unroll >>> factor here, just guess it's to be unrolled or not? The second run >>> with fake IV uses added with extra offset sounds like scaling up the >>> iv group cost by uf. >> >> From your example above the D-form (MEM[symbol: x, index: ivtmp.19_22, >> offset: 0B]) is preferable since in the unrolled variant we have >> the same addres but with a different constant offset for the unroll >> copies while the second form would have to update the 'base' IV. >> >> Thus I think the difference in IV cost and decision should already >> show up if we, for each USE add a USE with an added constant offset. >> This might be what your patch does with that extra flag on the USEs, >> I was suggesting to model the USEs more explicitely, simulating a >> 2-way unroll. I think in the end I'll defer to Bin here who knows >> the code best. >> > > Thanks for your further explanation! As your proposal we introduce more > iv use groups with step added. Take the example here > https://gcc.gnu.org/pipermail/gcc-patches/2020-June/547128.html > Imagining initially the cand iv 4 leading to x-form wins, it's the > original iv, has the iv-group cost 1 against the address group. > Although we introduce one more group (2-way unrolling), the iv still > wins since pulling the address iv in takes 5 (15 for three). Probably > we can introduce more groups according to uf here. Yeah, to summarise that thread: the idea there was that we would continue to cost each use once, but base the cost on the kind of address seen in the unrolled iterations. I guess this tends to over-estimate the cost of index IVs to some extent, but I too was aiming for something simple that doesn't depend on a specific unroll factor. Kewen's point there was that that approach works for high unroll factors, but not for small unroll factors like 2. For: LD A = baseA, X LD B = baseB, X ST C = baseC, X X = X + stride LD A = baseA, X LD B = baseB, X ST C = baseC, X X = X + stride using X as an IV is still preferred. It's only once the unroll factor exceeds the number of pointer IVs that using pointer IVs becomes better. So like Kewen says, using 2 USEs (the original one and an unrolled one) would have the opposite problem: it would still prefer index IVs and not consider the benefit of pointer IVs at higher unroll factors. But I agree that trying to guess what a much later pass will do doesn't feel very clean either... Thanks, Richard