Message ID | 5bb6a2ad-2096-b888-6f31-331de44a2ef7@arm.com |
---|---|
State | New |
Headers | show |
Series | [vect] Use main loop's thresholds and vectorization factor to narrow upper_bound of epilogue | expand |
Hi Andre, on 2021/5/24 下午2:17, Andre Vieira (lists) via Gcc-patches wrote: > Hi, > > When vectorizing with --param vect-partial-vector-usage=1 the vectorizer uses an unpredicated (all-true predicate for SVE) main loop and a predicated tail loop. The way this was implemented seems to mean it re-uses the same vector-mode for both loops, which means the tail loop isn't an actual loop but only executes one iteration. > > This patch uses the knowledge of the conditions to enter an epilogue loop to help come up with a potentially more restricive upper bound. > > Regression tested on aarch64-linux-gnu and also ran the testsuite using '--param vect-partial-vector-usage=1' detecting no ICEs and no execution failures. > > Would be good to have this tested for PPC too as I believe they are the main users of the --param vect-partial-vector-usage=1 option. Can someone help me test (and maybe even benchmark?) this on a PPC target? > Thanks for doing this! I can test it on Power10 which enables this parameter by default, also evaluate its impact on SPEC2017 Ofast/unroll. Do you have any preference for the baseline commit? I'll use r12-0 if it's fine. BR, Kewen
"Andre Vieira (lists)" <andre.simoesdiasvieira@arm.com> writes: > Hi, > > When vectorizing with --param vect-partial-vector-usage=1 the vectorizer > uses an unpredicated (all-true predicate for SVE) main loop and a > predicated tail loop. The way this was implemented seems to mean it > re-uses the same vector-mode for both loops, which means the tail loop > isn't an actual loop but only executes one iteration. > > This patch uses the knowledge of the conditions to enter an epilogue > loop to help come up with a potentially more restricive upper bound. > > Regression tested on aarch64-linux-gnu and also ran the testsuite using > '--param vect-partial-vector-usage=1' detecting no ICEs and no execution > failures. > > Would be good to have this tested for PPC too as I believe they are the > main users of the --param vect-partial-vector-usage=1 option. Can > someone help me test (and maybe even benchmark?) this on a PPC target? > > Kind regards, > Andre LGTM. OK if no objections and if the Power testing comes back clean. Thanks, Richard > gcc/ChangeLog: > > * tree-vect-loop.c (vect_transform_loop): Use main loop's > various' thresholds > to narrow the upper bound on epilogue iterations. > > gcc/testsuite/ChangeLog: > > * gcc.target/aarch64/sve/part_vect_single_iter_epilog.c: New test. > > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/part_vect_single_iter_epilog.c b/gcc/testsuite/gcc.target/aarch64/sve/part_vect_single_iter_epilog.c > new file mode 100644 > index 0000000000000000000000000000000000000000..a03229eb55585f637ebd5288fb4c00f8f921d44c > --- /dev/null > +++ b/gcc/testsuite/gcc.target/aarch64/sve/part_vect_single_iter_epilog.c > @@ -0,0 +1,11 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O3 --param vect-partial-vector-usage=1" } */ > + > +void > +foo (short * __restrict__ a, short * __restrict__ b, short * __restrict__ c, int n) > +{ > + for (int i = 0; i < n; ++i) > + c[i] = a[i] + b[i]; > +} > + > +/* { dg-final { scan-assembler-times {\twhilelo\tp[0-9]+.h, wzr, [xw][0-9]+} 1 } } */ > diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c > index 3e973e774af8f9205be893e01ad9263281116885..81e9c5cc42415a0a92b765bc46640105670c4e6b 100644 > --- a/gcc/tree-vect-loop.c > +++ b/gcc/tree-vect-loop.c > @@ -9723,12 +9723,31 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple *loop_vectorized_call) > /* In these calculations the "- 1" converts loop iteration counts > back to latch counts. */ > if (loop->any_upper_bound) > - loop->nb_iterations_upper_bound > - = (final_iter_may_be_partial > - ? wi::udiv_ceil (loop->nb_iterations_upper_bound + bias_for_lowest, > - lowest_vf) - 1 > - : wi::udiv_floor (loop->nb_iterations_upper_bound + bias_for_lowest, > - lowest_vf) - 1); > + { > + loop_vec_info main_vinfo = LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo); > + loop->nb_iterations_upper_bound > + = (final_iter_may_be_partial > + ? wi::udiv_ceil (loop->nb_iterations_upper_bound + bias_for_lowest, > + lowest_vf) - 1 > + : wi::udiv_floor (loop->nb_iterations_upper_bound + bias_for_lowest, > + lowest_vf) - 1); > + if (main_vinfo) > + { > + unsigned int bound; > + poly_uint64 main_iters > + = upper_bound (LOOP_VINFO_VECT_FACTOR (main_vinfo), > + LOOP_VINFO_COST_MODEL_THRESHOLD (main_vinfo)); > + main_iters > + = upper_bound (main_iters, > + LOOP_VINFO_VERSIONING_THRESHOLD (main_vinfo)); > + if (can_div_away_from_zero_p (main_iters, > + LOOP_VINFO_VECT_FACTOR (loop_vinfo), > + &bound)) > + loop->nb_iterations_upper_bound > + = wi::umin ((widest_int) (bound - 1), > + loop->nb_iterations_upper_bound); > + } > + } > if (loop->any_likely_upper_bound) > loop->nb_iterations_likely_upper_bound > = (final_iter_may_be_partial
on 2021/5/24 下午3:21, Kewen.Lin via Gcc-patches wrote: > Hi Andre, > > on 2021/5/24 下午2:17, Andre Vieira (lists) via Gcc-patches wrote: >> Hi, >> >> When vectorizing with --param vect-partial-vector-usage=1 the vectorizer uses an unpredicated (all-true predicate for SVE) main loop and a predicated tail loop. The way this was implemented seems to mean it re-uses the same vector-mode for both loops, which means the tail loop isn't an actual loop but only executes one iteration. >> >> This patch uses the knowledge of the conditions to enter an epilogue loop to help come up with a potentially more restricive upper bound. >> >> Regression tested on aarch64-linux-gnu and also ran the testsuite using '--param vect-partial-vector-usage=1' detecting no ICEs and no execution failures. >> >> Would be good to have this tested for PPC too as I believe they are the main users of the --param vect-partial-vector-usage=1 option. Can someone help me test (and maybe even benchmark?) this on a PPC target? >> > > > Thanks for doing this! I can test it on Power10 which enables this parameter > by default, also evaluate its impact on SPEC2017 Ofast/unroll. > Bootstrapped/regtested on powerpc64le-linux-gnu Power10. SPEC2017 run didn't show any remarkable improvement/degradation. BR, Kewen
Thank you Kewen!! I will apply this now. BR, Andre On 25/05/2021 09:42, Kewen.Lin wrote: > on 2021/5/24 下午3:21, Kewen.Lin via Gcc-patches wrote: >> Hi Andre, >> >> on 2021/5/24 下午2:17, Andre Vieira (lists) via Gcc-patches wrote: >>> Hi, >>> >>> When vectorizing with --param vect-partial-vector-usage=1 the vectorizer uses an unpredicated (all-true predicate for SVE) main loop and a predicated tail loop. The way this was implemented seems to mean it re-uses the same vector-mode for both loops, which means the tail loop isn't an actual loop but only executes one iteration. >>> >>> This patch uses the knowledge of the conditions to enter an epilogue loop to help come up with a potentially more restricive upper bound. >>> >>> Regression tested on aarch64-linux-gnu and also ran the testsuite using '--param vect-partial-vector-usage=1' detecting no ICEs and no execution failures. >>> >>> Would be good to have this tested for PPC too as I believe they are the main users of the --param vect-partial-vector-usage=1 option. Can someone help me test (and maybe even benchmark?) this on a PPC target? >>> >> >> Thanks for doing this! I can test it on Power10 which enables this parameter >> by default, also evaluate its impact on SPEC2017 Ofast/unroll. >> > Bootstrapped/regtested on powerpc64le-linux-gnu Power10. > SPEC2017 run didn't show any remarkable improvement/degradation. > > BR, > Kewen
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/part_vect_single_iter_epilog.c b/gcc/testsuite/gcc.target/aarch64/sve/part_vect_single_iter_epilog.c new file mode 100644 index 0000000000000000000000000000000000000000..a03229eb55585f637ebd5288fb4c00f8f921d44c --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/sve/part_vect_single_iter_epilog.c @@ -0,0 +1,11 @@ +/* { dg-do compile } */ +/* { dg-options "-O3 --param vect-partial-vector-usage=1" } */ + +void +foo (short * __restrict__ a, short * __restrict__ b, short * __restrict__ c, int n) +{ + for (int i = 0; i < n; ++i) + c[i] = a[i] + b[i]; +} + +/* { dg-final { scan-assembler-times {\twhilelo\tp[0-9]+.h, wzr, [xw][0-9]+} 1 } } */ diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c index 3e973e774af8f9205be893e01ad9263281116885..81e9c5cc42415a0a92b765bc46640105670c4e6b 100644 --- a/gcc/tree-vect-loop.c +++ b/gcc/tree-vect-loop.c @@ -9723,12 +9723,31 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple *loop_vectorized_call) /* In these calculations the "- 1" converts loop iteration counts back to latch counts. */ if (loop->any_upper_bound) - loop->nb_iterations_upper_bound - = (final_iter_may_be_partial - ? wi::udiv_ceil (loop->nb_iterations_upper_bound + bias_for_lowest, - lowest_vf) - 1 - : wi::udiv_floor (loop->nb_iterations_upper_bound + bias_for_lowest, - lowest_vf) - 1); + { + loop_vec_info main_vinfo = LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo); + loop->nb_iterations_upper_bound + = (final_iter_may_be_partial + ? wi::udiv_ceil (loop->nb_iterations_upper_bound + bias_for_lowest, + lowest_vf) - 1 + : wi::udiv_floor (loop->nb_iterations_upper_bound + bias_for_lowest, + lowest_vf) - 1); + if (main_vinfo) + { + unsigned int bound; + poly_uint64 main_iters + = upper_bound (LOOP_VINFO_VECT_FACTOR (main_vinfo), + LOOP_VINFO_COST_MODEL_THRESHOLD (main_vinfo)); + main_iters + = upper_bound (main_iters, + LOOP_VINFO_VERSIONING_THRESHOLD (main_vinfo)); + if (can_div_away_from_zero_p (main_iters, + LOOP_VINFO_VECT_FACTOR (loop_vinfo), + &bound)) + loop->nb_iterations_upper_bound + = wi::umin ((widest_int) (bound - 1), + loop->nb_iterations_upper_bound); + } + } if (loop->any_likely_upper_bound) loop->nb_iterations_likely_upper_bound = (final_iter_may_be_partial