Message ID | 20160421160937.GB7047@msticlxl57.ims.intel.com |
---|---|
State | New |
Headers | show |
On Thu, Apr 21, 2016 at 6:09 PM, Ilya Enkovich <enkovich.gnu@gmail.com> wrote: > Hi, > > Currently when loop is vectorized we adjust its nb_iterations_upper_bound > by dividing it by VF. This is incorrect since nb_iterations_upper_bound > is upper bound for (<number of loop iterations> - 1) and therefore simple > dividing it by VF in many cases gives us bounds greater than a real one. > Correct value would be ((nb_iterations_upper_bound + 1) / VF - 1). Yeah, that seems correct. > Also decrement due to peeling for gaps should happen before we scale it > by VF because peeling applies to a scalar loop, not vectorized one. That's not true - PEELING_FOR_GAPs is so that the last _vector_ iteration is peeled as scalar operations. We do not account for the amount of known prologue peeling (if peeling for alignment and the misalignment is known at compile-time) - that would be peeling of scalar iterations. But it would be interesting to know why we need the != 0 check - static cost modelling should have disabled vectorization if the vectorized body isn't run. > This patch modifies nb_iterations_upper_bound computation to resolve > these issues. You do not adjust the ->nb_iterations_estimate accordingly. > Running regression testing I got one fail due to optimized loop. Heres > is a loop: > > foo (signed char s) > { > signed char i; > for (i = 0; i < s; i++) > yy[i] = (signed int) i; > } > > Here we vectorize for AVX512 using VF=64. Original loop has max 127 > iterations and therefore vectorized loop may be executed only once. > With the patch applied compiler detects it and transforms loop into > BB with just stores of constants vectors into yy. Test was adjusted > to increase number of possible iterations. A copy of test was added > to check we can optimize out the original loop. > > Bootstrapped and regtested on x86_64-pc-linux-gnu. OK for trunk? I'd like to see testcases covering the corner-cases - have them have upper bound estimates by adjusting known array sizes and also cover the case of peeling for gaps. Richard. > Thanks, > Ilya > -- > gcc/ > > 2016-04-21 Ilya Enkovich <ilya.enkovich@intel.com> > > * tree-vect-loop.c (vect_transform_loop): Fix > nb_iterations_upper_bound computation for vectorized loop. > > gcc/testsuite/ > > 2016-04-21 Ilya Enkovich <ilya.enkovich@intel.com> > > * gcc.target/i386/vect-unpack-2.c (avx512bw_test): Avoid > optimization of vector loop. > * gcc.target/i386/vect-unpack-3.c: New test. > > > diff --git a/gcc/testsuite/gcc.target/i386/vect-unpack-2.c b/gcc/testsuite/gcc.target/i386/vect-unpack-2.c > index 4825248..51c518e 100644 > --- a/gcc/testsuite/gcc.target/i386/vect-unpack-2.c > +++ b/gcc/testsuite/gcc.target/i386/vect-unpack-2.c > @@ -6,19 +6,22 @@ > > #define N 120 > signed int yy[10000]; > +signed char zz[10000]; > > void > -__attribute__ ((noinline)) foo (signed char s) > +__attribute__ ((noinline,noclone)) foo (int s) > { > - signed char i; > + int i; > for (i = 0; i < s; i++) > - yy[i] = (signed int) i; > + yy[i] = zz[i]; > } > > void > avx512bw_test () > { > signed char i; > + for (i = 0; i < N; i++) > + zz[i] = i; > foo (N); > for (i = 0; i < N; i++) > if ( (signed int)i != yy [i] ) > diff --git a/gcc/testsuite/gcc.target/i386/vect-unpack-3.c b/gcc/testsuite/gcc.target/i386/vect-unpack-3.c > new file mode 100644 > index 0000000..eb8a93e > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/vect-unpack-3.c > @@ -0,0 +1,29 @@ > +/* { dg-do run } */ > +/* { dg-options "-O2 -fdump-tree-vect-details -ftree-vectorize -ffast-math -mavx512bw -save-temps" } */ > +/* { dg-require-effective-target avx512bw } */ > + > +#include "avx512bw-check.h" > + > +#define N 120 > +signed int yy[10000]; > + > +void > +__attribute__ ((noinline)) foo (signed char s) > +{ > + signed char i; > + for (i = 0; i < s; i++) > + yy[i] = (signed int) i; > +} > + > +void > +avx512bw_test () > +{ > + signed char i; > + foo (N); > + for (i = 0; i < N; i++) > + if ( (signed int)i != yy [i] ) > + abort (); > +} > + > +/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */ > +/* { dg-final { scan-assembler-not "vpmovsxbw\[ \\t\]+\[^\n\]*%zmm" } } */ > diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c > index d813b86..da98211 100644 > --- a/gcc/tree-vect-loop.c > +++ b/gcc/tree-vect-loop.c > @@ -6921,11 +6921,13 @@ vect_transform_loop (loop_vec_info loop_vinfo) > /* Reduce loop iterations by the vectorization factor. */ > scale_loop_profile (loop, GCOV_COMPUTE_SCALE (1, vectorization_factor), > expected_iterations / vectorization_factor); > - loop->nb_iterations_upper_bound > - = wi::udiv_floor (loop->nb_iterations_upper_bound, vectorization_factor); > if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) > && loop->nb_iterations_upper_bound != 0) > loop->nb_iterations_upper_bound = loop->nb_iterations_upper_bound - 1; > + loop->nb_iterations_upper_bound > + = wi::udiv_floor (loop->nb_iterations_upper_bound + 1, > + vectorization_factor) - 1; > + > if (loop->any_estimate) > { > loop->nb_iterations_estimate
2016-04-22 10:13 GMT+03:00 Richard Biener <richard.guenther@gmail.com>: > On Thu, Apr 21, 2016 at 6:09 PM, Ilya Enkovich <enkovich.gnu@gmail.com> wrote: >> Hi, >> >> Currently when loop is vectorized we adjust its nb_iterations_upper_bound >> by dividing it by VF. This is incorrect since nb_iterations_upper_bound >> is upper bound for (<number of loop iterations> - 1) and therefore simple >> dividing it by VF in many cases gives us bounds greater than a real one. >> Correct value would be ((nb_iterations_upper_bound + 1) / VF - 1). > > Yeah, that seems correct. > >> Also decrement due to peeling for gaps should happen before we scale it >> by VF because peeling applies to a scalar loop, not vectorized one. > > That's not true - PEELING_FOR_GAPs is so that the last _vector_ iteration > is peeled as scalar operations. We do not account for the amount > of known prologue peeling (if peeling for alignment and the misalignment > is known at compile-time) - that would be peeling of scalar iterations. My initial patch didn't change anything for PEELING_FOR_GAP and it caused a runfail for one of SPEC2006 benchmarks. My investigation showed number of vector iterations calculation doesn't match nb_iterations_upper_bound adjustment in a way PEELING_FOR_GAP is accounted. Looking into vect_generate_tmps_on_preheader I see: /* If epilogue loop is required because of data accesses with gaps, we subtract one iteration from the total number of iterations here for correct calculation of RATIO. */ And then we decrement loop counter before dividing it by VF to compute ratio and ratio_mult_vf. This doesn't match nb_iterations_upper_bound update and that's why I fixed it. This resolved runfail for me. Thus ratio_mult_vf computation conflicts with your statement we peel a vector iteration. > > But it would be interesting to know why we need the != 0 check - static > cost modelling should have disabled vectorization if the vectorized body > isn't run. > >> This patch modifies nb_iterations_upper_bound computation to resolve >> these issues. > > You do not adjust the ->nb_iterations_estimate accordingly. > >> Running regression testing I got one fail due to optimized loop. Heres >> is a loop: >> >> foo (signed char s) >> { >> signed char i; >> for (i = 0; i < s; i++) >> yy[i] = (signed int) i; >> } >> >> Here we vectorize for AVX512 using VF=64. Original loop has max 127 >> iterations and therefore vectorized loop may be executed only once. >> With the patch applied compiler detects it and transforms loop into >> BB with just stores of constants vectors into yy. Test was adjusted >> to increase number of possible iterations. A copy of test was added >> to check we can optimize out the original loop. >> >> Bootstrapped and regtested on x86_64-pc-linux-gnu. OK for trunk? > > I'd like to see testcases covering the corner-cases - have them have > upper bound estimates by adjusting known array sizes and also cover > the case of peeling for gaps. OK, I'll make more tests. Thanks, Ilya > > Richard. >
On Tue, Apr 26, 2016 at 2:29 PM, Ilya Enkovich <enkovich.gnu@gmail.com> wrote: > 2016-04-22 10:13 GMT+03:00 Richard Biener <richard.guenther@gmail.com>: >> On Thu, Apr 21, 2016 at 6:09 PM, Ilya Enkovich <enkovich.gnu@gmail.com> wrote: >>> Hi, >>> >>> Currently when loop is vectorized we adjust its nb_iterations_upper_bound >>> by dividing it by VF. This is incorrect since nb_iterations_upper_bound >>> is upper bound for (<number of loop iterations> - 1) and therefore simple >>> dividing it by VF in many cases gives us bounds greater than a real one. >>> Correct value would be ((nb_iterations_upper_bound + 1) / VF - 1). >> >> Yeah, that seems correct. >> >>> Also decrement due to peeling for gaps should happen before we scale it >>> by VF because peeling applies to a scalar loop, not vectorized one. >> >> That's not true - PEELING_FOR_GAPs is so that the last _vector_ iteration >> is peeled as scalar operations. We do not account for the amount >> of known prologue peeling (if peeling for alignment and the misalignment >> is known at compile-time) - that would be peeling of scalar iterations. > > My initial patch didn't change anything for PEELING_FOR_GAP and it caused > a runfail for one of SPEC2006 benchmarks. My investigation showed number > of vector iterations calculation doesn't match nb_iterations_upper_bound > adjustment in a way PEELING_FOR_GAP is accounted. > > Looking into vect_generate_tmps_on_preheader I see: > > /* If epilogue loop is required because of data accesses with gaps, we > subtract one iteration from the total number of iterations here for > correct calculation of RATIO. */ > > And then we decrement loop counter before dividing it by VF to compute > ratio and ratio_mult_vf. This doesn't match nb_iterations_upper_bound > update and that's why I fixed it. This resolved runfail for me. > > Thus ratio_mult_vf computation conflicts with your statement we peel a > vector iteration. Hum. I stand corrected. So yes, we remove the last vector iteration if there are not already epilogue iterations. >> >> But it would be interesting to know why we need the != 0 check - static >> cost modelling should have disabled vectorization if the vectorized body >> isn't run. >> >>> This patch modifies nb_iterations_upper_bound computation to resolve >>> these issues. >> >> You do not adjust the ->nb_iterations_estimate accordingly. >> >>> Running regression testing I got one fail due to optimized loop. Heres >>> is a loop: >>> >>> foo (signed char s) >>> { >>> signed char i; >>> for (i = 0; i < s; i++) >>> yy[i] = (signed int) i; >>> } >>> >>> Here we vectorize for AVX512 using VF=64. Original loop has max 127 >>> iterations and therefore vectorized loop may be executed only once. >>> With the patch applied compiler detects it and transforms loop into >>> BB with just stores of constants vectors into yy. Test was adjusted >>> to increase number of possible iterations. A copy of test was added >>> to check we can optimize out the original loop. >>> >>> Bootstrapped and regtested on x86_64-pc-linux-gnu. OK for trunk? >> >> I'd like to see testcases covering the corner-cases - have them have >> upper bound estimates by adjusting known array sizes and also cover >> the case of peeling for gaps. > > OK, I'll make more tests. Thanks, Richard. > Thanks, > Ilya > >> >> Richard. >>
diff --git a/gcc/testsuite/gcc.target/i386/vect-unpack-2.c b/gcc/testsuite/gcc.target/i386/vect-unpack-2.c index 4825248..51c518e 100644 --- a/gcc/testsuite/gcc.target/i386/vect-unpack-2.c +++ b/gcc/testsuite/gcc.target/i386/vect-unpack-2.c @@ -6,19 +6,22 @@ #define N 120 signed int yy[10000]; +signed char zz[10000]; void -__attribute__ ((noinline)) foo (signed char s) +__attribute__ ((noinline,noclone)) foo (int s) { - signed char i; + int i; for (i = 0; i < s; i++) - yy[i] = (signed int) i; + yy[i] = zz[i]; } void avx512bw_test () { signed char i; + for (i = 0; i < N; i++) + zz[i] = i; foo (N); for (i = 0; i < N; i++) if ( (signed int)i != yy [i] ) diff --git a/gcc/testsuite/gcc.target/i386/vect-unpack-3.c b/gcc/testsuite/gcc.target/i386/vect-unpack-3.c new file mode 100644 index 0000000..eb8a93e --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/vect-unpack-3.c @@ -0,0 +1,29 @@ +/* { dg-do run } */ +/* { dg-options "-O2 -fdump-tree-vect-details -ftree-vectorize -ffast-math -mavx512bw -save-temps" } */ +/* { dg-require-effective-target avx512bw } */ + +#include "avx512bw-check.h" + +#define N 120 +signed int yy[10000]; + +void +__attribute__ ((noinline)) foo (signed char s) +{ + signed char i; + for (i = 0; i < s; i++) + yy[i] = (signed int) i; +} + +void +avx512bw_test () +{ + signed char i; + foo (N); + for (i = 0; i < N; i++) + if ( (signed int)i != yy [i] ) + abort (); +} + +/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */ +/* { dg-final { scan-assembler-not "vpmovsxbw\[ \\t\]+\[^\n\]*%zmm" } } */ diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c index d813b86..da98211 100644 --- a/gcc/tree-vect-loop.c +++ b/gcc/tree-vect-loop.c @@ -6921,11 +6921,13 @@ vect_transform_loop (loop_vec_info loop_vinfo) /* Reduce loop iterations by the vectorization factor. */ scale_loop_profile (loop, GCOV_COMPUTE_SCALE (1, vectorization_factor), expected_iterations / vectorization_factor); - loop->nb_iterations_upper_bound - = wi::udiv_floor (loop->nb_iterations_upper_bound, vectorization_factor); if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) && loop->nb_iterations_upper_bound != 0) loop->nb_iterations_upper_bound = loop->nb_iterations_upper_bound - 1; + loop->nb_iterations_upper_bound + = wi::udiv_floor (loop->nb_iterations_upper_bound + 1, + vectorization_factor) - 1; + if (loop->any_estimate) { loop->nb_iterations_estimate