Message ID | 28b81a90-250e-feb0-8c39-c3c1b62449e7@arm.com |
---|---|
State | New |
Headers | show |
Series | [RFC,tree-vect] PR 88915: Further vectorize second loop when versioning | expand |
On Thu, 11 Jul 2019, Andre Vieira (lists) wrote: > Hi Richard(s), > > I am trying to tackle PR88915 and get GCC to further vectorize the "fallback" > loop when doing loop versioning as it does when loop versioning is not > required. > > I have a prototype patch that needs further testing, but at first glance it > seems to be achieving the desired outcome. > I was wondering whether you had any specific concerns with my current > approach. > > On top of this change I am looking at the iterations and alias checks > generated for every "vectorized-version". I.e. with the above patch I see: > if (iterations_check_VF_0 () && alias_check_VF_0 ()) > vectorized_for_VF_0 (); > else if (iterations_check_VF_1 () && alias_check_VF_1 ()) > vectorized_for_VF_1 (); > ... > else > scalar_loop (); > > The alias checks are not always short and may cause unnecessary performance > hits. Instead I am now trying to change the checks to produce the following > form: > > if (iterations_check_VF_0 ()) > { > if (alias_check_VF_0 ()) > { > vectorized_for_VF_0 (); > } > else > goto VF_1_check; // or scalar_loop > } > else if (iterations_check_VF_1 ()) > { > VF_1_check: > > if (alias_check_VF_1 ()) > vectorized_for_VF_1 (); > else > goto goto_VF_2_check; // or scalar_loop > } > ... > else > scalar_loop (); I think for code-size reason it would make sense to do it like if (iterations_check_for_lowest_VF ()) { if (alias_check_for_highest_VF ()) { vectorized_for_highest_VF (); vectorized epilogues; } } and make the vectorized_for_highest_VF loop skipped, falling through to the vectorized epilogues, when the number of iterations isn't enough to hit it. The advantage is that this would just use the epilogue vectorization code and it would avoid excessive code growth if you have many VFs to consider (on x86 we now have 8 byte, 16 byte, 32 byte and 64 byte vectors...). The disadvantage is of course that a small number of loops will not enter the vector code at all - namely those that would pass the alias check for lowest_VF but not the one for highest_VF. I'm sure this isn't a common situation and in quite a number of cases we formulate the alias check in a way that it isn't dependent on the VF anyways. There's also possibly an extra branch for the case the highest_VF loop isn't entered (unless there already was a prologue loop). > I am not yet sure whether to try the next VF after an alias check fail or to > just fall back to scalar instead. If you don't then there's no advantage to doing what I suggested? Richard. > > Cheers, > Andre >
On 12/07/2019 11:19, Richard Biener wrote: > On Thu, 11 Jul 2019, Andre Vieira (lists) wrote: > > > I think for code-size reason it would make sense to do it like > > if (iterations_check_for_lowest_VF ()) > { > if (alias_check_for_highest_VF ()) > { > vectorized_for_highest_VF (); > vectorized epilogues; > } > } > > and make the vectorized_for_highest_VF loop skipped, falling through > to the vectorized epilogues, when the number of iterations isn't > enough to hit it. Are you suggesting we only make the distinction between highest and lowest VF? Why not something like: if (alias_check_for_highest_VF ()) { if (iterations_check_VF_0 ()) goto VF_0; else if (iterations_check_VF_1 ()) goto VF_1; else if (iterations_check_VF_2 ()) goto VF_2; ... VF_0: vectorized_for_vf_0(); VF_1: vectorized_for_vf_1(); VF_2: vectorized_for_vf_2(); ... } else { goto scalar_loop; } I'll go have a look at how to best do this. The benefit of the earlier approach is it was able to use a lot of the existing vectorizer code to get it done. I have code that can split the condition and alias checks in 'vect_loop_versioning'. For this approach I am considering keeping that bit of code and seeing if I can patch up the checks after vectorizing the epilogue further. I think initially I will just go with a "hacked up" way of passing down the bb with the iteration check and split the false edge every time we vectorize it further. Will keep you posted on progress. If you have any pointers/tips they are most welcome! > > The advantage is that this would just use the epilogue vectorization > code and it would avoid excessive code growth if you have many > VFs to consider (on x86 we now have 8 byte, 16 byte, 32 byte and > 64 byte vectors...). The disadvantage is of course that a small > number of loops will not enter the vector code at all - namely > those that would pass the alias check for lowest_VF but not the > one for highest_VF. I'm sure this isn't a common situation and > in quite a number of cases we formulate the alias check in a way > that it isn't dependent on the VF anyways. The code growth is indeed a factor and I can see the argument for choosing this approach over the other. Cases of such specific overlaps are most likely oddities rather than the common situation. > There's also possibly > an extra branch for the case the highest_VF loop isn't entered > (unless there already was a prologue loop). I don't understand this one, can you elaborate? Cheers, Andre
On Mon, 15 Jul 2019, Andre Vieira (lists) wrote: > > > On 12/07/2019 11:19, Richard Biener wrote: > > On Thu, 11 Jul 2019, Andre Vieira (lists) wrote: > > > > > > I think for code-size reason it would make sense to do it like > > > > if (iterations_check_for_lowest_VF ()) > > { > > if (alias_check_for_highest_VF ()) > > { > > vectorized_for_highest_VF (); > > vectorized epilogues; > > } > > } > > > > and make the vectorized_for_highest_VF loop skipped, falling through > > to the vectorized epilogues, when the number of iterations isn't > > enough to hit it. > > Are you suggesting we only make the distinction between highest and lowest VF? > Why not something like: > > if (alias_check_for_highest_VF ()) > { > if (iterations_check_VF_0 ()) > goto VF_0; > else if (iterations_check_VF_1 ()) > goto VF_1; > else if (iterations_check_VF_2 ()) > goto VF_2; > ... > VF_0: > vectorized_for_vf_0(); > VF_1: > vectorized_for_vf_1(); > VF_2: > vectorized_for_vf_2(); > ... > } > else > { > goto scalar_loop; > } I think it will actually do it this way via the epilogue vectorization path. > I'll go have a look at how to best do this. The benefit of the earlier > approach is it was able to use a lot of the existing vectorizer code to get it > done. > > I have code that can split the condition and alias checks in > 'vect_loop_versioning'. For this approach I am considering keeping that bit of > code and seeing if I can patch up the checks after vectorizing the epilogue > further. I think initially I will just go with a "hacked up" way of passing > down the bb with the iteration check and split the false edge every time we > vectorize it further. Will keep you posted on progress. If you have any > pointers/tips they are most welcome! I thought to somehow force the idea that we have a prologue loop to the vectorizer so it creates the number-of-vectorized iterations check and branch around the main (highest VF) vectorized loop. > > > > The advantage is that this would just use the epilogue vectorization > > code and it would avoid excessive code growth if you have many > > VFs to consider (on x86 we now have 8 byte, 16 byte, 32 byte and > > 64 byte vectors...). The disadvantage is of course that a small > > number of loops will not enter the vector code at all - namely > > those that would pass the alias check for lowest_VF but not the > > one for highest_VF. I'm sure this isn't a common situation and > > in quite a number of cases we formulate the alias check in a way > > that it isn't dependent on the VF anyways. > > The code growth is indeed a factor and I can see the argument for choosing > this approach over the other. Cases of such specific overlaps are most likely > oddities rather than the common situation. Yeah, it also looks simplest to me (and a motivation to enable epilogue vectorization by default). > > There's also possibly > > an extra branch for the case the highest_VF loop isn't entered > > (unless there already was a prologue loop). > I don't understand this one, can you elaborate? The branch around the main vectorized loop I talked about above. So I'd fool the versioning condition to use the lowest VF for the iteration count checking and use the code that handles zero-trip iteration count for the vector loop unconditionally. In some way this makes checking the niter condition on the version check pointless (at least if we have a really low lowest VF like on x64 where it will likely be 2), so we may want to elide that completely? For the check to be "correct" we'd also need to compute the lowest VF a vectorized epilogue is still profitable (on x86 those will run once or never, but we can also end up with say main AVX512 vectorization, and a single vectorized epilogue with SSE2 if we somehow figure AVX256 vectorization isn't profitable for it - we can also end up with non-vectorizable epilogue). So with the current setup how we vectorize epilogues we maybe want to have a location of the version niter check we can "patch up" later after (not) vectorizing the epilogue(s). Richard.
On 15/07/2019 11:54, Richard Biener wrote: > On Mon, 15 Jul 2019, Andre Vieira (lists) wrote: > >> >> >> On 12/07/2019 11:19, Richard Biener wrote: >>> On Thu, 11 Jul 2019, Andre Vieira (lists) wrote: >> >> >> I have code that can split the condition and alias checks in >> 'vect_loop_versioning'. For this approach I am considering keeping that bit of >> code and seeing if I can patch up the checks after vectorizing the epilogue >> further. I think initially I will just go with a "hacked up" way of passing >> down the bb with the iteration check and split the false edge every time we >> vectorize it further. Will keep you posted on progress. If you have any >> pointers/tips they are most welc ome! > > I thought to somehow force the idea that we have a prologue loop > to the vectorizer so it creates the number-of-vectorized iterations > check and branch around the main (highest VF) vectorized loop. > Hmm I think I may have skimmed over this earlier. I am reading it now and am not entirely sure what you mean by this. Specifically the "number of vectorized iterations" or how a prologue loop plays a role in this. >>> >>> The advantage is that this would just use the epilogue vectorization >>> code and it would avoid excessive code growth if you have many >>> VFs to consider (on x86 we now have 8 byte, 16 byte, 32 byte and >>> 64 byte vectors...). The disadvantage is of course that a small >>> number of loops will not enter the vector code at all - namely >>> those that would pass the alias check for lowest_VF but not the >>> one for highest_VF. I'm sure this isn't a common situation and >>> in quite a number of cases we formulate the alias check in a way >>> that it isn't dependent on the VF anyways. >> >> The code growth is indeed a factor and I can see the argument for choosing >> this approach over the other. Cases of such specific overlaps are most likely >> oddities rather than the common situation. > > Yeah, it also looks simplest to me (and a motivation to enable > epilogue vectorization by default). > >>> There's also possibly >>> an extra branch for the case the highest_VF loop isn't entered >>> (unless there already was a prologue loop). >> I don't understand this one, can you elaborate? > > The branch around the main vectorized loop I talked about above. > So I'd fool the versioning condition to use the lowest VF for > the iteration count checking and use the code that handles > zero-trip iteration count for the vector loop unconditionally. I guess this sheds some light on the comment above. And it definitely implies we would need to know the lowest VF when creating this condition. Which is tricky. > > In some way this makes checking the niter condition on the version > check pointless (at least if we have a really low lowest VF like > on x64 where it will likely be 2), so we may want to elide that > completely? For the check to be "correct" we'd also need to > compute the lowest VF a vectorized epilogue is still profitable > (on x86 those will run once or never, but we can also end up > with say main AVX512 vectorization, and a single vectorized > epilogue with SSE2 if we somehow figure AVX256 vectorization > isn't profitable for it - we can also end up with non-vectorizable > epilogue). So with the current setup how we vectorize epilogues > we maybe want to have a location of the version niter check we > can "patch up" later after (not) vectorizing the epilogue(s). I think you come to the same conclusion here as I mentioned above. Somehow I wish I had understood this better when I first read it ... but eh such is life :) I went on and continued hacking around the approach of splitting the niter and alias check I had earlier. I got it to work with a single loop. However, when dealing with nested loops I run into the problem that I'd need to sink the niter checks. Otherwise you could end up with an alias check and niter checks outside the outer loop. Where the 2nd and consequent VF niter checks point to the corresponding epilogues in the inner loop. However, once those are done and it iterates over the outer-loop, it will go through the higher VF's first, leading to wrong behavior. To illustrate what I mean, here is a very simplistic illustration of what is happening: BB1: Alias check BB2: niter check VF 32 BB3: niter check VF 16 BB4: Vectorized loop VF32 BB5: Vectorized loop VF16 BB6: Remaining epilogue scalar loop BB7: Outer loop iteration (updates IV's and DRs of inner loop) BB8: Scalar inner&outer loop With edges: BB1 -T-> BB2 BB1 -F-> BB8 BB2 -T-> BB4 BB2 -F-> BB3 BB3 -T-> BB5 BB3 -F-> BB8 BB4 -> BB5 BB5 -> BB6 BB6 -> BB7 BB7 -> BB4 Where -T-> is a True edge and -F-> is a False edge So my first thought to solve this is to sink BB2 and BB3 into the loop for which BB7 is the latch. I.e. make BB7 -> BB2 But then I would argue, it would be good to introduce a BB9: BB1 -T-> BB9 BB9 -T-> BB2 BB9 -F-> BB8 Where BB9 checks that niter is at least the lowest VF. Sorry if I am repeating what you were telling me to do all along :') Cheers, Andre PS: I often find myself having to patch the DOMINATOR information, sometimes its easy to, but sometimes it can get pretty complicated. I wonder whether it would make sense to write something that traverses a loop and corrects this, if it doesn't exist already. > > Richard. >
On Fri, 19 Jul 2019, Andre Vieira (lists) wrote: > > > On 15/07/2019 11:54, Richard Biener wrote: > > On Mon, 15 Jul 2019, Andre Vieira (lists) wrote: > > > > > > > > > > > On 12/07/2019 11:19, Richard Biener wrote: > > > > On Thu, 11 Jul 2019, Andre Vieira (lists) wrote: > > > > > > > > > I have code that can split the condition and alias checks in > > > 'vect_loop_versioning'. For this approach I am considering keeping that > > > bit of > > > code and seeing if I can patch up the checks after vectorizing the > > > epilogue > > > further. I think initially I will just go with a "hacked up" way of > > > passing > > > down the bb with the iteration check and split the false edge every time > > > we > > > vectorize it further. Will keep you posted on progress. If you have any > > > pointers/tips they are most welc ome! > > > > I thought to somehow force the idea that we have a prologue loop > > to the vectorizer so it creates the number-of-vectorized iterations > > check and branch around the main (highest VF) vectorized loop. > > > > Hmm I think I may have skimmed over this earlier. I am reading it now and am > not entirely sure what you mean by this. Specifically the "number of > vectorized iterations" or how a prologue loop plays a role in this. When there is no prologue then the versioning condition currently ensures we always enter the vector loop. Only if there is a prologue is there a check and branch eventually skipping right to the epilogue. If the versioning condition (now using a lower VF) doesn't guarantee this we have to add this jump-around. > > > > > > > > > The advantage is that this would just use the epilogue vectorization > > > > code and it would avoid excessive code growth if you have many > > > > VFs to consider (on x86 we now have 8 byte, 16 byte, 32 byte and > > > > 64 byte vectors...). The disadvantage is of course that a small > > > > number of loops will not enter the vector code at all - namely > > > > those that would pass the alias check for lowest_VF but not the > > > > one for highest_VF. I'm sure this isn't a common situation and > > > > in quite a number of cases we formulate the alias check in a way > > > > that it isn't dependent on the VF anyways. > > > > > > The code growth is indeed a factor and I can see the argument for choosing > > > this approach over the other. Cases of such specific overlaps are most > > > likely > > > oddities rather than the common situation. > > > > Yeah, it also looks simplest to me (and a motivation to enable > > epilogue vectorization by default). > > > > > > There's also possibly > > > > an extra branch for the case the highest_VF loop isn't entered > > > > (unless there already was a prologue loop). > > > I don't understand this one, can you elaborate? > > > > The branch around the main vectorized loop I talked about above. > > So I'd fool the versioning condition to use the lowest VF for > > the iteration count checking and use the code that handles > > zero-trip iteration count for the vector loop unconditionally. > > I guess this sheds some light on the comment above. And it definitely implies > we would need to know the lowest VF when creating this condition. Which is > tricky. We can simply use the smallest vector size supported by the target to derive it from the actual VF, no? > > > > In some way this makes checking the niter condition on the version > > check pointless (at least if we have a really low lowest VF like > > on x64 where it will likely be 2), so we may want to elide that > > completely? For the check to be "correct" we'd also need to > > compute the lowest VF a vectorized epilogue is still profitable > > (on x86 those will run once or never, but we can also end up > > with say main AVX512 vectorization, and a single vectorized > > epilogue with SSE2 if we somehow figure AVX256 vectorization > > isn't profitable for it - we can also end up with non-vectorizable > > epilogue). So with the current setup how we vectorize epilogues > > we maybe want to have a location of the version niter check we > > can "patch up" later after (not) vectorizing the epilogue(s). > > I think you come to the same conclusion here as I mentioned above. Somehow I > wish I had understood this better when I first read it ... but eh such is life > :) > > I went on and continued hacking around the approach of splitting the niter and > alias check I had earlier. I got it to work with a single loop. However, when > dealing with nested loops I run into the problem that I'd need to sink the > niter checks. Otherwise you could end up with an alias check and niter checks > outside the outer loop. Where the 2nd and consequent VF niter checks point to > the corresponding epilogues in the inner loop. However, once those are done > and it iterates over the outer-loop, it will go through the higher VF's first, > leading to wrong behavior. > > To illustrate what I mean, here is a very simplistic illustration of what is > happening: > > BB1: Alias check > BB2: niter check VF 32 > BB3: niter check VF 16 > BB4: Vectorized loop VF32 > BB5: Vectorized loop VF16 > BB6: Remaining epilogue scalar loop > BB7: Outer loop iteration (updates IV's and DRs of inner loop) > BB8: Scalar inner&outer loop > > With edges: > BB1 -T-> BB2 > BB1 -F-> BB8 > BB2 -T-> BB4 > BB2 -F-> BB3 > BB3 -T-> BB5 > BB3 -F-> BB8 > BB4 -> BB5 > BB5 -> BB6 > BB6 -> BB7 > BB7 -> BB4 > > Where -T-> is a True edge and -F-> is a False edge > > So my first thought to solve this is to sink BB2 and BB3 into the loop for > which BB7 is the latch. > > I.e. make BB7 -> BB2 > > But then I would argue, it would be good to introduce a BB9: > BB1 -T-> BB9 > BB9 -T-> BB2 > BB9 -F-> BB8 > > Where BB9 checks that niter is at least the lowest VF. > > Sorry if I am repeating what you were telling me to do all along :') I'm not sure I understand - why would you have any check not inside the outer loop? Yes, we now eventually hoist versioning checks but the VF checks for the individual variants should be around the vectorized loop itself (so not really part of the versioning check). > Cheers, > Andre > > PS: I often find myself having to patch the DOMINATOR information, sometimes > its easy to, but sometimes it can get pretty complicated. I wonder whether it > would make sense to write something that traverses a loop and corrects this, > if it doesn't exist already. There's iterate-fix-dominators, but unless you create new edges/blocks manually rather than doing split-block/redirect-edge which should do dominator updating for you. Richard. > > > > > Richard. > > >
On 19/07/2019 12:35, Richard Biener wrote: > On Fri, 19 Jul 2019, Andre Vieira (lists) wrote: > >> >> >> On 15/07/2019 11:54, Richard Biener wrote: >>> On Mon, 15 Jul 2019, Andre Vieira (lists) wrote: >>> >>>> >>>> >>>> On 12/07/2019 11:19, Richard Biener wrote: >>>>> On Thu, 11 Jul 2019, Andre Vieira (lists) wrote: >>>> >>>> >>>> I have code that can split the condition and alias checks in >>>> 'vect_loop_versioning'. For this approach I am considering keeping that >>>> bit of >>>> code and seeing if I can patch up the checks after vectorizing the >>>> epilogue >>>> further. I think initially I will just go with a "hacked up" way of >>>> passing >>>> down the bb with the iteration check and split the false edge every time >>>> we >>>> vectorize it further. Will keep you posted on progress. If you have any >>>> pointers/tips they are most welc ome! >>> >>> I thought to somehow force the idea that we have a prologue loop >>> to the vectorizer so it creates the number-of-vectorized iterations >>> check and branch around the main (highest VF) vectorized loop. >>> >> >> Hmm I think I may have skimmed over this earlier. I am reading it now and am >> not entirely sure what you mean by this. Specifically the "number of >> vectorized iterations" or how a prologue loop plays a role in this. > > When there is no prologue then the versioning condition currently > ensures we always enter the vector loop. Only if there is a prologue > is there a check and branch eventually skipping right to the epilogue. > If the versioning condition (now using a lower VF) doesn't guarantee > this we have to add this jump-around. Right, I haven't looked at the prologue path yet. I had a quick look and can't find where this branch skipping to the epilogue is constructed. I will take a better look after I got my current example to work. >> >> I guess this sheds some light on the comment above. And it definitely implies >> we would need to know the lowest VF when creating this condition. Which is >> tricky. > > We can simply use the smallest vector size supported by the target to > derive it from the actual VF, no? So I could wait to introduce this check after all epilogue vectorization is done, back track to the last niter check and duplicate that in the outer loop. What I didn't want to do was use the smallest possible vector size for the target because I was under the impression that does not necessarily correspond to the smallest VF we may have for a loop, as the vectorizer may have decided not to vectorize for that vector size because of costs? If it I can assume this never happens, that once it starts to vectorize epilogues that it will keep vectorizing them for any vector size it knows off then yeah I can use that. > I'm not sure I understand - why would you have any check not inside > the outer loop? Yes, we now eventually hoist versioning checks > but the VF checks for the individual variants should be around > the vectorized loop itself (so not really part of the versioning check). Yeah I agree. I was just explaining what I had done wrong now. > >> Cheers, >> Andre >> >> PS: I often find myself having to patch the DOMINATOR information, sometimes >> its easy to, but sometimes it can get pretty complicated. I wonder whether it >> would make sense to write something that traverses a loop and corrects this, >> if it doesn't exist already. > > There's iterate-fix-dominators, but unless you create new edges/blocks > manually rather than doing split-block/redirect-edge which should do > dominator updating for you. Ah I was doing everything manually after having some bad experiences with lv_add_condition_to_bb. I will have a look at those thanks! Cheers, Andre > > Richard. > >> >>> >>> Richard. >>> >> >
Hi Richard, I believe this is in line with what you were explaining to me earlier. The one thing I haven't quite done here is the jump around for if there is no prolog peeling. Though I think skip_vectors introduces the jumping we need. The main gist of it is: 1) if '--param vect-epilogues-nomask=1' is passed we refrain from adding the versioning threshold check when versioning a loop at first, 2) later we allow the vectorizer to use skip_vectors to basically check niters to be able to skip the vectorized loop on to the right epilogue, 3) after vectorizing the epilogue, we check whether this was a versioned loop and whether we skipped adding the versioning threshold, if so we add a threshold based on the last VF There is a caveat here, if we don't vectorize the epilogue, because say our architecture doesn't know how, we will end up with a regression. Let's say we have a loop for which our target can vectorize for 32x but not 16x, we get: if (alias & threshold_for_16x ()) { if (niters > 31) vectorized_31 (); else scalar (); } else scalar (); Imagine our iteration count is 18, all we hit is the scalar loop, but now we go through 1x alias and 2x niters. Whereas potentially we could have done with just 1x niters. A mitigation for this could be to only add the threshold check if we actually vectorized the last loop, I believe a: 'if (LOOP_VINFO_VECTORIZABLE_P (loop_vinfo)) return NULL;' inside that new "else" in vect_transform_loop would do the trick. Though then we will still have that extra alias check... I am going to hack around in the back-end to "fake" a situation like this and see what happens. Another potential issue arises when the profitability threshold obtained from the cost model isn't guaranteed to be either LE or EQ to the versioning threshold. In that case there are two separate checks, which now we no longer will attempt to fold. I am trying to find a way to test and benchmark these changes. Unfortunately I am having trouble doing this for AVX512 as I find that the old '--param vect_epilogues_nomask=1' already causes wrong codegen in SPEC2017 for the gcc and perlbench benchmarks. Cheers, Andre On 19/07/2019 13:38, Andre Vieira (lists) wrote: > > > On 19/07/2019 12:35, Richard Biener wrote: >> On Fri, 19 Jul 2019, Andre Vieira (lists) wrote: >> >>> >>> >>> On 15/07/2019 11:54, Richard Biener wrote: >>>> On Mon, 15 Jul 2019, Andre Vieira (lists) wrote: >>>> >>>>> >>>>> >>>>> On 12/07/2019 11:19, Richard Biener wrote: >>>>>> On Thu, 11 Jul 2019, Andre Vieira (lists) wrote: >>>>> >>>>> >>>>> I have code that can split the condition and alias checks in >>>>> 'vect_loop_versioning'. For this approach I am considering keeping >>>>> that >>>>> bit of >>>>> code and seeing if I can patch up the checks after vectorizing the >>>>> epilogue >>>>> further. I think initially I will just go with a "hacked up" way of >>>>> passing >>>>> down the bb with the iteration check and split the false edge every >>>>> time >>>>> we >>>>> vectorize it further. Will keep you posted on progress. If you have >>>>> any >>>>> pointers/tips they are most welc ome! >>>> >>>> I thought to somehow force the idea that we have a prologue loop >>>> to the vectorizer so it creates the number-of-vectorized iterations >>>> check and branch around the main (highest VF) vectorized loop. >>>> >>> >>> Hmm I think I may have skimmed over this earlier. I am reading it now >>> and am >>> not entirely sure what you mean by this. Specifically the "number of >>> vectorized iterations" or how a prologue loop plays a role in this. >> >> When there is no prologue then the versioning condition currently >> ensures we always enter the vector loop. Only if there is a prologue >> is there a check and branch eventually skipping right to the epilogue. >> If the versioning condition (now using a lower VF) doesn't guarantee >> this we have to add this jump-around. > > Right, I haven't looked at the prologue path yet. I had a quick look and > can't find where this branch skipping to the epilogue is constructed. I > will take a better look after I got my current example to work. > >>> >>> I guess this sheds some light on the comment above. And it definitely >>> implies >>> we would need to know the lowest VF when creating this condition. >>> Which is >>> tricky. >> >> We can simply use the smallest vector size supported by the target to >> derive it from the actual VF, no? > > So I could wait to introduce this check after all epilogue vectorization > is done, back track to the last niter check and duplicate that in the > outer loop. > > What I didn't want to do was use the smallest possible vector size for > the target because I was under the impression that does not necessarily > correspond to the smallest VF we may have for a loop, as the vectorizer > may have decided not to vectorize for that vector size because of costs? > If it I can assume this never happens, that once it starts to vectorize > epilogues that it will keep vectorizing them for any vector size it > knows off then yeah I can use that. > > >> I'm not sure I understand - why would you have any check not inside >> the outer loop? Yes, we now eventually hoist versioning checks >> but the VF checks for the individual variants should be around >> the vectorized loop itself (so not really part of the versioning check). > > Yeah I agree. I was just explaining what I had done wrong now. >> >>> Cheers, >>> Andre >>> >>> PS: I often find myself having to patch the DOMINATOR information, >>> sometimes >>> its easy to, but sometimes it can get pretty complicated. I wonder >>> whether it >>> would make sense to write something that traverses a loop and >>> corrects this, >>> if it doesn't exist already. >> >> There's iterate-fix-dominators, but unless you create new edges/blocks >> manually rather than doing split-block/redirect-edge which should do >> dominator updating for you. > > Ah I was doing everything manually after having some bad experiences > with lv_add_condition_to_bb. I will have a look at those thanks! > > Cheers, > Andre > >> >> Richard. >> >>> >>>> >>>> Richard. >>>> >>> >>
On 30/07/2019 13:16, Andre Vieira (lists) wrote: > Hi Richard, > > I believe this is in line with what you were explaining to me earlier. > The one thing I haven't quite done here is the jump around for if there > is no prolog peeling. Though I think skip_vectors introduces the jumping > we need. > > The main gist of it is: > 1) if '--param vect-epilogues-nomask=1' is passed we refrain from adding > the versioning threshold check when versioning a loop at first, > 2) later we allow the vectorizer to use skip_vectors to basically check > niters to be able to skip the vectorized loop on to the right epilogue, > 3) after vectorizing the epilogue, we check whether this was a versioned > loop and whether we skipped adding the versioning threshold, if so we > add a threshold based on the last VF > > There is a caveat here, if we don't vectorize the epilogue, because say > our architecture doesn't know how, we will end up with a regression. > Let's say we have a loop for which our target can vectorize for 32x but > not 16x, we get: > > if (alias & threshold_for_16x ()) > { > if (niters > 31) > vectorized_31 (); > else > scalar (); > } > else > scalar (); > > Imagine our iteration count is 18, all we hit is the scalar loop, but > now we go through 1x alias and 2x niters. Whereas potentially we could > have done with just 1x niters. > > A mitigation for this could be to only add the threshold check if we > actually vectorized the last loop, I believe a: > 'if (LOOP_VINFO_VECTORIZABLE_P (loop_vinfo)) return NULL;' inside that > new "else" in vect_transform_loop would do the trick. Though then we > will still have that extra alias check... > > I am going to hack around in the back-end to "fake" a situation like > this and see what happens. Right should have quickly tried this before sending the email, what happens is actually vect_transform_loop never gets called because try_vectorize_loop_1 will recognize it can't vectorize the epilogue. So we end up with the "mitigation" result I suggested, where we simply don't have a versioning threshold which might still not be ideal. > > Another potential issue arises when the profitability threshold obtained > from the cost model isn't guaranteed to be either LE or EQ to the > versioning threshold. In that case there are two separate checks, which > now we no longer will attempt to fold. And I just realized I don't take the cost model threshold into account when creating the new threshold check, I guess if it is ordered_p we should again take the max there between the new threshold check and the cost threshold for the new check. > > I am trying to find a way to test and benchmark these changes. > Unfortunately I am having trouble doing this for AVX512 as I find that > the old '--param vect_epilogues_nomask=1' already causes wrong codegen > in SPEC2017 for the gcc and perlbench benchmarks. > > > Cheers, > Andre > > On 19/07/2019 13:38, Andre Vieira (lists) wrote: >> >> >> On 19/07/2019 12:35, Richard Biener wrote: >>> On Fri, 19 Jul 2019, Andre Vieira (lists) wrote: >>> >>>> >>>> >>>> On 15/07/2019 11:54, Richard Biener wrote: >>>>> On Mon, 15 Jul 2019, Andre Vieira (lists) wrote: >>>>> >>>>>> >>>>>> >>>>>> On 12/07/2019 11:19, Richard Biener wrote: >>>>>>> On Thu, 11 Jul 2019, Andre Vieira (lists) wrote: >>>>>> >>>>>> >>>>>> I have code that can split the condition and alias checks in >>>>>> 'vect_loop_versioning'. For this approach I am considering keeping >>>>>> that >>>>>> bit of >>>>>> code and seeing if I can patch up the checks after vectorizing the >>>>>> epilogue >>>>>> further. I think initially I will just go with a "hacked up" way of >>>>>> passing >>>>>> down the bb with the iteration check and split the false edge >>>>>> every time >>>>>> we >>>>>> vectorize it further. Will keep you posted on progress. If you >>>>>> have any >>>>>> pointers/tips they are most welc ome! >>>>> >>>>> I thought to somehow force the idea that we have a prologue loop >>>>> to the vectorizer so it creates the number-of-vectorized iterations >>>>> check and branch around the main (highest VF) vectorized loop. >>>>> >>>> >>>> Hmm I think I may have skimmed over this earlier. I am reading it >>>> now and am >>>> not entirely sure what you mean by this. Specifically the "number of >>>> vectorized iterations" or how a prologue loop plays a role in this. >>> >>> When there is no prologue then the versioning condition currently >>> ensures we always enter the vector loop. Only if there is a prologue >>> is there a check and branch eventually skipping right to the epilogue. >>> If the versioning condition (now using a lower VF) doesn't guarantee >>> this we have to add this jump-around. >> >> Right, I haven't looked at the prologue path yet. I had a quick look >> and can't find where this branch skipping to the epilogue is >> constructed. I will take a better look after I got my current example >> to work. >> >>>> >>>> I guess this sheds some light on the comment above. And it >>>> definitely implies >>>> we would need to know the lowest VF when creating this condition. >>>> Which is >>>> tricky. >>> >>> We can simply use the smallest vector size supported by the target to >>> derive it from the actual VF, no? >> >> So I could wait to introduce this check after all epilogue >> vectorization is done, back track to the last niter check and >> duplicate that in the outer loop. >> >> What I didn't want to do was use the smallest possible vector size for >> the target because I was under the impression that does not >> necessarily correspond to the smallest VF we may have for a loop, as >> the vectorizer may have decided not to vectorize for that vector size >> because of costs? If it I can assume this never happens, that once it >> starts to vectorize epilogues that it will keep vectorizing them for >> any vector size it knows off then yeah I can use that. >> >> >>> I'm not sure I understand - why would you have any check not inside >>> the outer loop? Yes, we now eventually hoist versioning checks >>> but the VF checks for the individual variants should be around >>> the vectorized loop itself (so not really part of the versioning check). >> >> Yeah I agree. I was just explaining what I had done wrong now. >>> >>>> Cheers, >>>> Andre >>>> >>>> PS: I often find myself having to patch the DOMINATOR information, >>>> sometimes >>>> its easy to, but sometimes it can get pretty complicated. I wonder >>>> whether it >>>> would make sense to write something that traverses a loop and >>>> corrects this, >>>> if it doesn't exist already. >>> >>> There's iterate-fix-dominators, but unless you create new edges/blocks >>> manually rather than doing split-block/redirect-edge which should do >>> dominator updating for you. >> >> Ah I was doing everything manually after having some bad experiences >> with lv_add_condition_to_bb. I will have a look at those thanks! >> >> Cheers, >> Andre >> >>> >>> Richard. >>> >>>> >>>>> >>>>> Richard. >>>>> >>>> >>>
On Tue, 30 Jul 2019, Andre Vieira (lists) wrote: > > > On 30/07/2019 13:16, Andre Vieira (lists) wrote: > > Hi Richard, > > > > I believe this is in line with what you were explaining to me earlier. The > > one thing I haven't quite done here is the jump around for if there is no > > prolog peeling. Though I think skip_vectors introduces the jumping we need. > > > > The main gist of it is: > > 1) if '--param vect-epilogues-nomask=1' is passed we refrain from adding the > > versioning threshold check when versioning a loop at first, > > 2) later we allow the vectorizer to use skip_vectors to basically check > > niters to be able to skip the vectorized loop on to the right epilogue, > > 3) after vectorizing the epilogue, we check whether this was a versioned > > loop and whether we skipped adding the versioning threshold, if so we add a > > threshold based on the last VF > > > > There is a caveat here, if we don't vectorize the epilogue, because say our > > architecture doesn't know how, we will end up with a regression. > > Let's say we have a loop for which our target can vectorize for 32x but not > > 16x, we get: > > > > if (alias & threshold_for_16x ()) > > { > > if (niters > 31) > > vectorized_31 (); > > else > > scalar (); > > } > > else > > scalar (); > > > > Imagine our iteration count is 18, all we hit is the scalar loop, but now we > > go through 1x alias and 2x niters. Whereas potentially we could have done > > with just 1x niters. True. Note we should swap the checks to if (threshold_for_16x && alias) > > A mitigation for this could be to only add the threshold check if we > > actually vectorized the last loop, I believe a: > > 'if (LOOP_VINFO_VECTORIZABLE_P (loop_vinfo)) return NULL;' inside that new > > "else" in vect_transform_loop would do the trick. Though then we will still > > have that extra alias check... > > > I am going to hack around in the back-end to "fake" a situation like this > > and see what happens. > > Right should have quickly tried this before sending the email, what happens is > actually vect_transform_loop never gets called because try_vectorize_loop_1 > will recognize it can't vectorize the epilogue. So we end up with the > "mitigation" result I suggested, where we simply don't have a versioning > threshold which might still not be ideal. I think the main issue is how we have implemented epilogue vectorization. Ideally when vectorizing the loop () we'd recognize all VFs we can handle and thus know beforehand. I think that's already done for the sake of openmp simd now so doing this when epilogue vectorization is enabled as well wouldn't be too bad - so then we know, at the time we do the versioning, what and how many vectorized epilogues we create. See vect_analyze_loop and the loop over vector sizes. > > > > Another potential issue arises when the profitability threshold obtained > > from the cost model isn't guaranteed to be either LE or EQ to the versioning > > threshold. In that case there are two separate checks, which now we no > > longer will attempt to fold. > > And I just realized I don't take the cost model threshold into account when > creating the new threshold check, I guess if it is ordered_p we should again > take the max there between the new threshold check and the cost threshold for > the new check. Also see https://gcc.gnu.org/ml/gcc-patches/2018-06/msg01397.html for a patch that computes costs of each possible VF, with that we could compute a better combined estimate for minimum number of iters (also considering the extra jumps to reach the main VF/2 loop). > > > > I am trying to find a way to test and benchmark these changes. Unfortunately > > I am having trouble doing this for AVX512 as I find that the old '--param > > vect_epilogues_nomask=1' already causes wrong codegen in SPEC2017 for the > > gcc and perlbench benchmarks. There's a reason it is not enabled by default. But I'd like to fix bugs it has so can you possibly reduce testcases and file bugs for it? Thanks, Richard. > > > > > > Cheers, > > Andre > > > > On 19/07/2019 13:38, Andre Vieira (lists) wrote: > > > > > > > > > On 19/07/2019 12:35, Richard Biener wrote: > > > > On Fri, 19 Jul 2019, Andre Vieira (lists) wrote: > > > > > > > > > > > > > > > > > > > On 15/07/2019 11:54, Richard Biener wrote: > > > > > > On Mon, 15 Jul 2019, Andre Vieira (lists) wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > On 12/07/2019 11:19, Richard Biener wrote: > > > > > > > > On Thu, 11 Jul 2019, Andre Vieira (lists) wrote: > > > > > > > > > > > > > > > > > > > > > I have code that can split the condition and alias checks in > > > > > > > 'vect_loop_versioning'. For this approach I am considering keeping > > > > > > > that > > > > > > > bit of > > > > > > > code and seeing if I can patch up the checks after vectorizing the > > > > > > > epilogue > > > > > > > further. I think initially I will just go with a "hacked up" way > > > > > > > of > > > > > > > passing > > > > > > > down the bb with the iteration check and split the false edge > > > > > > > every time > > > > > > > we > > > > > > > vectorize it further. Will keep you posted on progress. If you > > > > > > > have any > > > > > > > pointers/tips they are most welc ome! > > > > > > > > > > > > I thought to somehow force the idea that we have a prologue loop > > > > > > to the vectorizer so it creates the number-of-vectorized iterations > > > > > > check and branch around the main (highest VF) vectorized loop. > > > > > > > > > > > > > > > > Hmm I think I may have skimmed over this earlier. I am reading it now > > > > > and am > > > > > not entirely sure what you mean by this. Specifically the "number of > > > > > vectorized iterations" or how a prologue loop plays a role in this. > > > > > > > > When there is no prologue then the versioning condition currently > > > > ensures we always enter the vector loop. Only if there is a prologue > > > > is there a check and branch eventually skipping right to the epilogue. > > > > If the versioning condition (now using a lower VF) doesn't guarantee > > > > this we have to add this jump-around. > > > > > > Right, I haven't looked at the prologue path yet. I had a quick look and > > > can't find where this branch skipping to the epilogue is constructed. I > > > will take a better look after I got my current example to work. > > > > > > > > > > > > > I guess this sheds some light on the comment above. And it definitely > > > > > implies > > > > > we would need to know the lowest VF when creating this condition. > > > > > Which is > > > > > tricky. > > > > > > > > We can simply use the smallest vector size supported by the target to > > > > derive it from the actual VF, no? > > > > > > So I could wait to introduce this check after all epilogue vectorization > > > is done, back track to the last niter check and duplicate that in the > > > outer loop. > > > > > > What I didn't want to do was use the smallest possible vector size for the > > > target because I was under the impression that does not necessarily > > > correspond to the smallest VF we may have for a loop, as the vectorizer > > > may have decided not to vectorize for that vector size because of costs? > > > If it I can assume this never happens, that once it starts to vectorize > > > epilogues that it will keep vectorizing them for any vector size it knows > > > off then yeah I can use that. > > > > > > > > > > I'm not sure I understand - why would you have any check not inside > > > > the outer loop? Yes, we now eventually hoist versioning checks > > > > but the VF checks for the individual variants should be around > > > > the vectorized loop itself (so not really part of the versioning check). > > > > > > Yeah I agree. I was just explaining what I had done wrong now. > > > > > > > > > Cheers, > > > > > Andre > > > > > > > > > > PS: I often find myself having to patch the DOMINATOR information, > > > > > sometimes > > > > > its easy to, but sometimes it can get pretty complicated. I wonder > > > > > whether it > > > > > would make sense to write something that traverses a loop and corrects > > > > > this, > > > > > if it doesn't exist already. > > > > > > > > There's iterate-fix-dominators, but unless you create new edges/blocks > > > > manually rather than doing split-block/redirect-edge which should do > > > > dominator updating for you. > > > > > > Ah I was doing everything manually after having some bad experiences with > > > lv_add_condition_to_bb. I will have a look at those thanks! > > > > > > Cheers, > > > Andre > > > > > > > > > > > Richard. > > > > > > > > > > > > > > > > > > > > > Richard. > > > > > > > > > > > > > > > >
Hi Richard, Thanks for the feedback! See comments inline. On 01/08/2019 16:26, Richard Biener wrote: > On Tue, 30 Jul 2019, Andre Vieira (lists) wrote: > >> >> >> On 30/07/2019 13:16, Andre Vieira (lists) wrote: >>> Hi Richard, >>> >>> I believe this is in line with what you were explaining to me earlier. The >>> one thing I haven't quite done here is the jump around for if there is no >>> prolog peeling. Though I think skip_vectors introduces the jumping we need. >>> >>> The main gist of it is: >>> 1) if '--param vect-epilogues-nomask=1' is passed we refrain from adding the >>> versioning threshold check when versioning a loop at first, >>> 2) later we allow the vectorizer to use skip_vectors to basically check >>> niters to be able to skip the vectorized loop on to the right epilogue, >>> 3) after vectorizing the epilogue, we check whether this was a versioned >>> loop and whether we skipped adding the versioning threshold, if so we add a >>> threshold based on the last VF >>> >>> There is a caveat here, if we don't vectorize the epilogue, because say our >>> architecture doesn't know how, we will end up with a regression. >>> Let's say we have a loop for which our target can vectorize for 32x but not >>> 16x, we get: >>> >>> if (alias & threshold_for_16x ()) >>> { >>> if (niters > 31) >>> vectorized_31 (); >>> else >>> scalar (); >>> } >>> else >>> scalar (); >>> >>> Imagine our iteration count is 18, all we hit is the scalar loop, but now we >>> go through 1x alias and 2x niters. Whereas potentially we could have done >>> with just 1x niters. > > True. Note we should swap the checks to > > if (threshold_for_16x && alias) I agree, I just haven't figured out what the best way of doing it. I am trying TRUTH_ANDIF_EXPR now, but fold_build2 turns that into a TRUTH_AND_EXPR. Can I assume it does that for aarch64 because it can use conditional compare to merge the two checks? To be fair the code generated for aarch64 for the two checks I am looking at doesn't look too bad. I am pondering figuring out why fold decides to transform it back into a TRUTH_AND_EXPR. Otherwise I think I might just try splitting the basic block. Though I am not entirely sure adding an extra jump in there is always a good thing. > >>> A mitigation for this could be to only add the threshold check if we >>> actually vectorized the last loop, I believe a: >>> 'if (LOOP_VINFO_VECTORIZABLE_P (loop_vinfo)) return NULL;' inside that new >>> "else" in vect_transform_loop would do the trick. Though then we will still >>> have that extra alias check... > >>> I am going to hack around in the back-end to "fake" a situation like this >>> and see what happens. >> >> Right should have quickly tried this before sending the email, what happens is >> actually vect_transform_loop never gets called because try_vectorize_loop_1 >> will recognize it can't vectorize the epilogue. So we end up with the >> "mitigation" result I suggested, where we simply don't have a versioning >> threshold which might still not be ideal. > > I think the main issue is how we have implemented epilogue vectorization. > Ideally when vectorizing the loop () we'd recognize all VFs we can handle > and thus know beforehand. I think that's already done for the sake > of openmp simd now so doing this when epilogue vectorization is enabled > as well wouldn't be too bad - so then we know, at the time we do the > versioning, what and how many vectorized epilogues we create. See > vect_analyze_loop and the loop over vector sizes. > I think I managed to do this by passing a boolean initialized to the PARAM_VALUE (PARAM_VECT_EPILOGUES_NOMASK) to vect_analyze_loop and also changing the looping over vector_sizes to count the number of vectorizable loops whilst using the same code path as loop->simdlen if vect_epilogues_nomask is true. Then I set vect_epilogues_nomask to false if the number of vectorized loops isn't higher than 1. I'll follow up with a new patch for you to see, which will most likely make more sense than what I just wrote up there ;) >>> >>> Another potential issue arises when the profitability threshold obtained >>> from the cost model isn't guaranteed to be either LE or EQ to the versioning >>> threshold. In that case there are two separate checks, which now we no >>> longer will attempt to fold. >> >> And I just realized I don't take the cost model threshold into account when >> creating the new threshold check, I guess if it is ordered_p we should again >> take the max there between the new threshold check and the cost threshold for >> the new check. > > Also see https://gcc.gnu.org/ml/gcc-patches/2018-06/msg01397.html for > a patch that computes costs of each possible VF, with that we could > compute a better combined estimate for minimum number of iters > (also considering the extra jumps to reach the main VF/2 loop). > Started to look at this. Just to check something, this patch enables picking a lower VF loop if it determines that its single iteration cost is more than N times lower than the higher VF loop, where N is the ration between higher VF and lower VF no? Sounds like a good check to have, but I don't see how this is related to my current changes. I mean other than that it could turn off epilogue vectorization altogether since it could decide to only vectorize once. Did you mean to show me this as a way to incorporate costs for the extra jumps somehow? For instance, taking the ordered max of current versioning threshold and the SINGLE_ITERATION_COST + 1 (for the extra jump)? I also noticed we purposefully only vectorize the epilogue once, with vect-epilogues-nomask=1, I am guessing that's just because nobody got around to it yet? Specially since even vectorizing it once goes wrong sometimes. >>> >>> I am trying to find a way to test and benchmark these changes. Unfortunately >>> I am having trouble doing this for AVX512 as I find that the old '--param >>> vect_epilogues_nomask=1' already causes wrong codegen in SPEC2017 for the >>> gcc and perlbench benchmarks. > > There's a reason it is not enabled by default. But I'd like to > fix bugs it has so can you possibly reduce testcases and file > bugs for it? > Yeah that makes sense, also options that are off by default tend to bitrot fast! I did have an initial look at it, but the gcc benchmark isn't an easy one to dissect. I will give it another go sometime this week, see if I can figure out at least what loop is being miscompiled. Do you see any use in running any set of tests in the testsuite with the parameter for smaller testcases? > Thanks, > Richard. > >>> >>> >>> Cheers, >>> Andre >>> >>> On 19/07/2019 13:38, Andre Vieira (lists) wrote: >>>> >>>> >>>> On 19/07/2019 12:35, Richard Biener wrote: >>>>> On Fri, 19 Jul 2019, Andre Vieira (lists) wrote: >>>>> >>>>>> >>>>>> >>>>>> On 15/07/2019 11:54, Richard Biener wrote: >>>>>>> On Mon, 15 Jul 2019, Andre Vieira (lists) wrote: >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On 12/07/2019 11:19, Richard Biener wrote: >>>>>>>>> On Thu, 11 Jul 2019, Andre Vieira (lists) wrote: >>>>>>>> >>>>>>>> >>>>>>>> I have code that can split the condition and alias checks in >>>>>>>> 'vect_loop_versioning'. For this approach I am considering keeping >>>>>>>> that >>>>>>>> bit of >>>>>>>> code and seeing if I can patch up the checks after vectorizing the >>>>>>>> epilogue >>>>>>>> further. I think initially I will just go with a "hacked up" way >>>>>>>> of >>>>>>>> passing >>>>>>>> down the bb with the iteration check and split the false edge >>>>>>>> every time >>>>>>>> we >>>>>>>> vectorize it further. Will keep you posted on progress. If you >>>>>>>> have any >>>>>>>> pointers/tips they are most welc ome! >>>>>>> >>>>>>> I thought to somehow force the idea that we have a prologue loop >>>>>>> to the vectorizer so it creates the number-of-vectorized iterations >>>>>>> check and branch around the main (highest VF) vectorized loop. >>>>>>> >>>>>> >>>>>> Hmm I think I may have skimmed over this earlier. I am reading it now >>>>>> and am >>>>>> not entirely sure what you mean by this. Specifically the "number of >>>>>> vectorized iterations" or how a prologue loop plays a role in this. >>>>> >>>>> When there is no prologue then the versioning condition currently >>>>> ensures we always enter the vector loop. Only if there is a prologue >>>>> is there a check and branch eventually skipping right to the epilogue. >>>>> If the versioning condition (now using a lower VF) doesn't guarantee >>>>> this we have to add this jump-around. >>>> >>>> Right, I haven't looked at the prologue path yet. I had a quick look and >>>> can't find where this branch skipping to the epilogue is constructed. I >>>> will take a better look after I got my current example to work. >>>> >>>>>> >>>>>> I guess this sheds some light on the comment above. And it definitely >>>>>> implies >>>>>> we would need to know the lowest VF when creating this condition. >>>>>> Which is >>>>>> tricky. >>>>> >>>>> We can simply use the smallest vector size supported by the target to >>>>> derive it from the actual VF, no? >>>> >>>> So I could wait to introduce this check after all epilogue vectorization >>>> is done, back track to the last niter check and duplicate that in the >>>> outer loop. >>>> >>>> What I didn't want to do was use the smallest possible vector size for the >>>> target because I was under the impression that does not necessarily >>>> correspond to the smallest VF we may have for a loop, as the vectorizer >>>> may have decided not to vectorize for that vector size because of costs? >>>> If it I can assume this never happens, that once it starts to vectorize >>>> epilogues that it will keep vectorizing them for any vector size it knows >>>> off then yeah I can use that. >>>> >>>> >>>>> I'm not sure I understand - why would you have any check not inside >>>>> the outer loop? Yes, we now eventually hoist versioning checks >>>>> but the VF checks for the individual variants should be around >>>>> the vectorized loop itself (so not really part of the versioning check). >>>> >>>> Yeah I agree. I was just explaining what I had done wrong now. >>>>> >>>>>> Cheers, >>>>>> Andre >>>>>> >>>>>> PS: I often find myself having to patch the DOMINATOR information, >>>>>> sometimes >>>>>> its easy to, but sometimes it can get pretty complicated. I wonder >>>>>> whether it >>>>>> would make sense to write something that traverses a loop and corrects >>>>>> this, >>>>>> if it doesn't exist already. >>>>> >>>>> There's iterate-fix-dominators, but unless you create new edges/blocks >>>>> manually rather than doing split-block/redirect-edge which should do >>>>> dominator updating for you. >>>> >>>> Ah I was doing everything manually after having some bad experiences with >>>> lv_add_condition_to_bb. I will have a look at those thanks! >>>> >>>> Cheers, >>>> Andre >>>> >>>>> >>>>> Richard. >>>>> >>>>>> >>>>>>> >>>>>>> Richard. >>>>>>> >>>>>> >>>>> >> >
diff --git a/gcc/cfgloop.h b/gcc/cfgloop.h index 2f8ab106d03a8927087ee8038e08a825f6e1e237..04d874b70ddfb8a3f5175dcddf00fef6d33f3219 100644 --- a/gcc/cfgloop.h +++ b/gcc/cfgloop.h @@ -266,6 +266,8 @@ struct GTY ((chain_next ("%h.next"))) loop { the basic-block from being collected but its index can still be reused. */ basic_block former_header; + + unsigned long max_vf_limit; }; /* Set if the loop is known to be infinite. */ diff --git a/gcc/cfgloop.c b/gcc/cfgloop.c index f64326b944e630075ced7035937f4601a1cb6c66..07d633b678b52943d3ab82e8d61b80cd712431ac 100644 --- a/gcc/cfgloop.c +++ b/gcc/cfgloop.c @@ -355,6 +355,7 @@ alloc_loop (void) loop->nb_iterations_upper_bound = 0; loop->nb_iterations_likely_upper_bound = 0; loop->nb_iterations_estimate = 0; + loop->max_vf_limit = MAX_VECTORIZATION_FACTOR; return loop; } diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c index bd8fffb1704787d0a611fc02ee29054422596cbb..89529138b9cefb7f822bca72da06df519eff1a28 100644 --- a/gcc/tree-vect-loop-manip.c +++ b/gcc/tree-vect-loop-manip.c @@ -2968,7 +2968,8 @@ vect_create_cond_for_alias_checks (loop_vec_info loop_vinfo, tree * cond_expr) struct loop * vect_loop_versioning (loop_vec_info loop_vinfo, unsigned int th, bool check_profitability, - poly_uint64 versioning_threshold) + poly_uint64 versioning_threshold, + vec<loop_p> &more_loops) { struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo), *nloop; struct loop *scalar_loop = LOOP_VINFO_SCALAR_LOOP (loop_vinfo); @@ -3143,6 +3144,19 @@ vect_loop_versioning (loop_vec_info loop_vinfo, nloop = loop_version (loop_to_version, cond_expr, &condition_bb, prob, prob.invert (), prob, prob.invert (), true); gcc_assert (nloop); + + if (!LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)) + { + + /* Add the scalar fallback loop to the MORE_LOOPS vector to be looked + at later. Also make sure it is never vectorized for the original + vf by setting the limit of the maximum vf to the original vf minus + one. */ + nloop->max_vf_limit + = constant_lower_bound (LOOP_VINFO_VECT_FACTOR (loop_vinfo)) - 1; + more_loops.safe_push (nloop); + } + nloop = get_loop_copy (loop); /* Kill off IFN_LOOP_VECTORIZED_CALL in the copy, nobody will diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c index b49ab152012a5c7fe9cc0564e58d296447f9ffb1..081885c378200661237ef22d2b011fc775e21218 100644 --- a/gcc/tree-vect-loop.c +++ b/gcc/tree-vect-loop.c @@ -1862,7 +1862,7 @@ vect_analyze_loop_2 (loop_vec_info loop_vinfo, bool &fatal, unsigned *n_stmts) { opt_result ok = opt_result::success (); int res; - unsigned int max_vf = MAX_VECTORIZATION_FACTOR; + unsigned int max_vf = LOOP_VINFO_LOOP (loop_vinfo)->max_vf_limit; poly_uint64 min_vf = 2; /* The first group of checks is independent of the vector size. */ @@ -8468,7 +8468,7 @@ vect_transform_loop_stmt (loop_vec_info loop_vinfo, stmt_vec_info stmt_info, Returns scalar epilogue loop if any. */ struct loop * -vect_transform_loop (loop_vec_info loop_vinfo) +vect_transform_loop (loop_vec_info loop_vinfo, vec<loop_p> &more_loops) { struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo); struct loop *epilogue = NULL; @@ -8530,7 +8530,8 @@ vect_transform_loop (loop_vec_info loop_vinfo) } struct loop *sloop = vect_loop_versioning (loop_vinfo, th, check_profitability, - versioning_threshold); + versioning_threshold, more_loops); + sloop->force_vectorize = false; check_profitability = false; } diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h index f7432f0584762fd28d54f2978dc59f2df443e991..53d66b72d3ba6e15681209153b57736630e40e3b 100644 --- a/gcc/tree-vectorizer.h +++ b/gcc/tree-vectorizer.h @@ -1089,8 +1089,6 @@ STMT_VINFO_BB_VINFO (stmt_vec_info stmt_vinfo) conversion. */ #define MAX_INTERM_CVT_STEPS 3 -#define MAX_VECTORIZATION_FACTOR INT_MAX - /* Nonzero if TYPE represents a (scalar) boolean type or type in the middle-end compatible with it (unsigned precision 1 integral types). Used to determine which types should be vectorized as @@ -1473,7 +1471,7 @@ extern bool slpeel_can_duplicate_loop_p (const struct loop *, const_edge); struct loop *slpeel_tree_duplicate_loop_to_edge_cfg (struct loop *, struct loop *, edge); struct loop *vect_loop_versioning (loop_vec_info, unsigned int, bool, - poly_uint64); + poly_uint64, vec<loop_p> &); extern struct loop *vect_do_peeling (loop_vec_info, tree, tree, tree *, tree *, tree *, int, bool, bool); extern void vect_prepare_for_masked_peels (loop_vec_info); @@ -1614,7 +1612,7 @@ extern tree vect_get_loop_mask (gimple_stmt_iterator *, vec_loop_masks *, unsigned int, tree, unsigned int); /* Drive for loop transformation stage. */ -extern struct loop *vect_transform_loop (loop_vec_info); +extern struct loop *vect_transform_loop (loop_vec_info, vec<loop_p> &); extern opt_loop_vec_info vect_analyze_loop_form (struct loop *, vec_info_shared *); extern bool vectorizable_live_operation (stmt_vec_info, gimple_stmt_iterator *, diff --git a/gcc/tree-vectorizer.c b/gcc/tree-vectorizer.c index 325ef58722d21a65ab896a9358677b07111b060b..d63d532d5fe474904ff84b23912a2ed9cfd6194a 100644 --- a/gcc/tree-vectorizer.c +++ b/gcc/tree-vectorizer.c @@ -868,7 +868,8 @@ try_vectorize_loop_1 (hash_table<simduid_to_vf> *&simduid_to_vf_htab, unsigned *num_vectorized_loops, loop_p loop, loop_vec_info orig_loop_vinfo, gimple *loop_vectorized_call, - gimple *loop_dist_alias_call) + gimple *loop_dist_alias_call, + vec<loop_p> &more_loops) { unsigned ret = 0; vec_info_shared shared; @@ -979,7 +980,7 @@ try_vectorize_loop_1 (hash_table<simduid_to_vf> *&simduid_to_vf_htab, "loop vectorized using variable length vectors\n"); } - loop_p new_loop = vect_transform_loop (loop_vinfo); + loop_p new_loop = vect_transform_loop (loop_vinfo, more_loops); (*num_vectorized_loops)++; /* Now that the loop has been vectorized, allow it to be unrolled etc. */ @@ -1013,7 +1014,7 @@ try_vectorize_loop_1 (hash_table<simduid_to_vf> *&simduid_to_vf_htab, /* Epilogue of vectorized loop must be vectorized too. */ if (new_loop) ret |= try_vectorize_loop_1 (simduid_to_vf_htab, num_vectorized_loops, - new_loop, loop_vinfo, NULL, NULL); + new_loop, loop_vinfo, NULL, NULL, more_loops); return ret; } @@ -1022,7 +1023,8 @@ try_vectorize_loop_1 (hash_table<simduid_to_vf> *&simduid_to_vf_htab, static unsigned try_vectorize_loop (hash_table<simduid_to_vf> *&simduid_to_vf_htab, - unsigned *num_vectorized_loops, loop_p loop) + unsigned *num_vectorized_loops, loop_p loop, + vec<loop_p> &more_loops) { if (!((flag_tree_loop_vectorize && optimize_loop_nest_for_speed_p (loop)) @@ -1032,7 +1034,8 @@ try_vectorize_loop (hash_table<simduid_to_vf> *&simduid_to_vf_htab, return try_vectorize_loop_1 (simduid_to_vf_htab, num_vectorized_loops, loop, NULL, vect_loop_vectorized_call (loop), - vect_loop_dist_alias_call (loop)); + vect_loop_dist_alias_call (loop), + more_loops); } @@ -1051,6 +1054,7 @@ vectorize_loops (void) hash_table<simd_array_to_simduid> *simd_array_to_simduid_htab = NULL; bool any_ifcvt_loops = false; unsigned ret = 0; + auto_vec<loop_p> more_loops; vect_loops_num = number_of_loops (cfun); @@ -1105,14 +1109,19 @@ vectorize_loops (void) vector_loop->dont_vectorize = true; ret |= try_vectorize_loop (simduid_to_vf_htab, &num_vectorized_loops, - vector_loop); + vector_loop, + more_loops); } } } } else ret |= try_vectorize_loop (simduid_to_vf_htab, &num_vectorized_loops, - loop); + loop, more_loops); + + while (!more_loops.is_empty ()) + try_vectorize_loop (simduid_to_vf_htab, &num_vectorized_loops, + more_loops.pop (), more_loops); vect_location = dump_user_location_t (); diff --git a/gcc/tree.h b/gcc/tree.h index 3dce602dfbaca03f568e1c3638d56dfe3a3fd01c..b1c41131e9d1637784a1024d5c301252a06f89e1 100644 --- a/gcc/tree.h +++ b/gcc/tree.h @@ -22,6 +22,8 @@ along with GCC; see the file COPYING3. If not see #include "tree-core.h" +#define MAX_VECTORIZATION_FACTOR INT_MAX + /* Convert a target-independent built-in function code to a combined_fn. */ inline combined_fn