Message ID | 20240911021637.3759883-1-hongtao.liu@intel.com |
---|---|
State | New |
Headers | show |
Series | [RFC] Enable vectorization for unknown tripcount in very cheap cost model but disable epilog vectorization. | expand |
On Wed, Sep 11, 2024 at 4:17 AM liuhongt <hongtao.liu@intel.com> wrote: > > GCC12 enables vectorization for O2 with very cheap cost model which is restricted > to constant tripcount. The vectorization capacity is very limited w/ consideration > of codesize impact. > > The patch extends the very cheap cost model a little bit to support variable tripcount. > But still disable peeling for gaps/alignment, runtime aliasing checking and epilogue > vectorization with the consideration of codesize. > > So there're at most 2 versions of loop for O2 vectorization, one vectorized main loop > , one scalar/remainder loop. > > .i.e. > > void > foo1 (int* __restrict a, int* b, int* c, int n) > { > for (int i = 0; i != n; i++) > a[i] = b[i] + c[i]; > } > > with -O2 -march=x86-64-v3, will be vectorized to > > .L10: > vmovdqu (%r8,%rax), %ymm0 > vpaddd (%rsi,%rax), %ymm0, %ymm0 > vmovdqu %ymm0, (%rdi,%rax) > addq $32, %rax > cmpq %rdx, %rax > jne .L10 > movl %ecx, %eax > andl $-8, %eax > cmpl %eax, %ecx > je .L21 > vzeroupper > .L12: > movl (%r8,%rax,4), %edx > addl (%rsi,%rax,4), %edx > movl %edx, (%rdi,%rax,4) > addq $1, %rax > cmpl %eax, %ecx > jne .L12 > > As measured with SPEC2017 on EMR, the patch(N-Iter) improves performance by 4.11% > with extra 2.8% codeisze, and cheap cost model improve performance by 5.74% with > extra 8.88% codesize. The details are as below I'm confused by this, is the N-Iter numbers ontop of the cheap cost model numbers? > Performance measured with -march=x86-64-v3 -O2 on EMR > > N-Iter cheap cost model > 500.perlbench_r -0.12% -0.12% > 502.gcc_r 0.44% -0.11% > 505.mcf_r 0.17% 4.46% > 520.omnetpp_r 0.28% -0.27% > 523.xalancbmk_r 0.00% 5.93% > 525.x264_r -0.09% 23.53% > 531.deepsjeng_r 0.19% 0.00% > 541.leela_r 0.22% 0.00% > 548.exchange2_r -11.54% -22.34% > 557.xz_r 0.74% 0.49% > GEOMEAN INT -1.04% 0.60% > > 503.bwaves_r 3.13% 4.72% > 507.cactuBSSN_r 1.17% 0.29% > 508.namd_r 0.39% 6.87% > 510.parest_r 3.14% 8.52% > 511.povray_r 0.10% -0.20% > 519.lbm_r -0.68% 10.14% > 521.wrf_r 68.20% 76.73% So this seems to regress as well? > 526.blender_r 0.12% 0.12% > 527.cam4_r 19.67% 23.21% > 538.imagick_r 0.12% 0.24% > 544.nab_r 0.63% 0.53% > 549.fotonik3d_r 14.44% 9.43% > 554.roms_r 12.39% 0.00% > GEOMEAN FP 8.26% 9.41% > GEOMEAN ALL 4.11% 5.74% > > Code sise impact > N-Iter cheap cost model > 500.perlbench_r 0.22% 1.03% > 502.gcc_r 0.25% 0.60% > 505.mcf_r 0.00% 32.07% > 520.omnetpp_r 0.09% 0.31% > 523.xalancbmk_r 0.08% 1.86% > 525.x264_r 0.75% 7.96% > 531.deepsjeng_r 0.72% 3.28% > 541.leela_r 0.18% 0.75% > 548.exchange2_r 8.29% 12.19% > 557.xz_r 0.40% 0.60% > GEOMEAN INT 1.07%% 5.71% > > 503.bwaves_r 12.89% 21.59% > 507.cactuBSSN_r 0.90% 20.19% > 508.namd_r 0.77% 14.75% > 510.parest_r 0.91% 3.91% > 511.povray_r 0.45% 4.08% > 519.lbm_r 0.00% 0.00% > 521.wrf_r 5.97% 12.79% > 526.blender_r 0.49% 3.84% > 527.cam4_r 1.39% 3.28% > 538.imagick_r 1.86% 7.78% > 544.nab_r 0.41% 3.00% > 549.fotonik3d_r 25.50% 47.47% > 554.roms_r 5.17% 13.01% > GEOMEAN FP 4.14% 11.38% > GEOMEAN ALL 2.80% 8.88% > > > The only regression is from 548.exchange_r, the vectorization for inner loop in each layer > of the 9-layer loops increases register pressure and causes more spill. > - block(rnext:9, 1, i1) = block(rnext:9, 1, i1) + 10 > - block(rnext:9, 2, i2) = block(rnext:9, 2, i2) + 10 > ..... > - block(rnext:9, 9, i9) = block(rnext:9, 9, i9) + 10 > ... > - block(rnext:9, 2, i2) = block(rnext:9, 2, i2) + 10 > - block(rnext:9, 1, i1) = block(rnext:9, 1, i1) + 10 > > Looks like aarch64 doesn't have the issue because aarch64 has 32 gprs, but x86 only has 16. > I have a extra patch to prevent loop vectorization in deep-depth loop for x86 backend which can > bring the performance back. > > For 503.bwaves_r/505.mcf_r/507.cactuBSSN_r/508.namd_r, cheap cost model increases codesize > a lot but don't imporve any performance. And N-iter is much better for that for codesize. > > > Any comments? > > > gcc/ChangeLog: > > * tree-vect-loop.cc (vect_analyze_loop_costing): Enable > vectorization for LOOP_VINFO_PEELING_FOR_NITER in very cheap > cost model. > (vect_analyze_loop): Disable epilogue vectorization in very > cheap cost model. > --- > gcc/tree-vect-loop.cc | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc > index 242d5e2d916..06afd8cae79 100644 > --- a/gcc/tree-vect-loop.cc > +++ b/gcc/tree-vect-loop.cc > @@ -2356,8 +2356,7 @@ vect_analyze_loop_costing (loop_vec_info loop_vinfo, > a copy of the scalar code (even if we might be able to vectorize it). */ > if (loop_cost_model (loop) == VECT_COST_MODEL_VERY_CHEAP > && (LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo) > - || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) > - || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo))) > + || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))) I notice that we should probably not call vect_enhance_data_refs_alignment because when alignment peeling is optional we should avoid it rather than disabling the vectorization completely. Also if you allow peeling for niter then there's no good reason to not allow peeling for gaps (or any other epilogue peeling). The extra cost for niter peeling is a runtime check before the loop which would also happen (plus keeping the scalar copy) when there's a runtime cost check. That also means versioning for alias/alignment could be allowed if it shares the scalar loop with the epilogue (I don't remember the constraints we set in place for the sharing). Richard. > { > if (dump_enabled_p ()) > dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > @@ -3638,7 +3637,8 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared) > /* No code motion support for multiple epilogues so for now > not supported when multiple exits. */ > && !LOOP_VINFO_EARLY_BREAKS (first_loop_vinfo) > - && !loop->simduid); > + && !loop->simduid > + && loop_cost_model (loop) > VECT_COST_MODEL_VERY_CHEAP); > if (!vect_epilogues) > return first_loop_vinfo; > > -- > 2.31.1 >
On Wed, Sep 11, 2024 at 4:04 PM Richard Biener <richard.guenther@gmail.com> wrote: > > On Wed, Sep 11, 2024 at 4:17 AM liuhongt <hongtao.liu@intel.com> wrote: > > > > GCC12 enables vectorization for O2 with very cheap cost model which is restricted > > to constant tripcount. The vectorization capacity is very limited w/ consideration > > of codesize impact. > > > > The patch extends the very cheap cost model a little bit to support variable tripcount. > > But still disable peeling for gaps/alignment, runtime aliasing checking and epilogue > > vectorization with the consideration of codesize. > > > > So there're at most 2 versions of loop for O2 vectorization, one vectorized main loop > > , one scalar/remainder loop. > > > > .i.e. > > > > void > > foo1 (int* __restrict a, int* b, int* c, int n) > > { > > for (int i = 0; i != n; i++) > > a[i] = b[i] + c[i]; > > } > > > > with -O2 -march=x86-64-v3, will be vectorized to > > > > .L10: > > vmovdqu (%r8,%rax), %ymm0 > > vpaddd (%rsi,%rax), %ymm0, %ymm0 > > vmovdqu %ymm0, (%rdi,%rax) > > addq $32, %rax > > cmpq %rdx, %rax > > jne .L10 > > movl %ecx, %eax > > andl $-8, %eax > > cmpl %eax, %ecx > > je .L21 > > vzeroupper > > .L12: > > movl (%r8,%rax,4), %edx > > addl (%rsi,%rax,4), %edx > > movl %edx, (%rdi,%rax,4) > > addq $1, %rax > > cmpl %eax, %ecx > > jne .L12 > > > > As measured with SPEC2017 on EMR, the patch(N-Iter) improves performance by 4.11% > > with extra 2.8% codeisze, and cheap cost model improve performance by 5.74% with > > extra 8.88% codesize. The details are as below > > I'm confused by this, is the N-Iter numbers ontop of the cheap cost > model numbers? No, it's N-iter vs base(very cheap cost model), and cheap vs base. > > > Performance measured with -march=x86-64-v3 -O2 on EMR > > > > N-Iter cheap cost model > > 500.perlbench_r -0.12% -0.12% > > 502.gcc_r 0.44% -0.11% > > 505.mcf_r 0.17% 4.46% > > 520.omnetpp_r 0.28% -0.27% > > 523.xalancbmk_r 0.00% 5.93% > > 525.x264_r -0.09% 23.53% > > 531.deepsjeng_r 0.19% 0.00% > > 541.leela_r 0.22% 0.00% > > 548.exchange2_r -11.54% -22.34% > > 557.xz_r 0.74% 0.49% > > GEOMEAN INT -1.04% 0.60% > > > > 503.bwaves_r 3.13% 4.72% > > 507.cactuBSSN_r 1.17% 0.29% > > 508.namd_r 0.39% 6.87% > > 510.parest_r 3.14% 8.52% > > 511.povray_r 0.10% -0.20% > > 519.lbm_r -0.68% 10.14% > > 521.wrf_r 68.20% 76.73% > > So this seems to regress as well? Niter increases performance less than the cheap cost model, that's expected, it is not a regression. > > > 526.blender_r 0.12% 0.12% > > 527.cam4_r 19.67% 23.21% > > 538.imagick_r 0.12% 0.24% > > 544.nab_r 0.63% 0.53% > > 549.fotonik3d_r 14.44% 9.43% > > 554.roms_r 12.39% 0.00% > > GEOMEAN FP 8.26% 9.41% > > GEOMEAN ALL 4.11% 5.74% > > > > Code sise impact > > N-Iter cheap cost model > > 500.perlbench_r 0.22% 1.03% > > 502.gcc_r 0.25% 0.60% > > 505.mcf_r 0.00% 32.07% > > 520.omnetpp_r 0.09% 0.31% > > 523.xalancbmk_r 0.08% 1.86% > > 525.x264_r 0.75% 7.96% > > 531.deepsjeng_r 0.72% 3.28% > > 541.leela_r 0.18% 0.75% > > 548.exchange2_r 8.29% 12.19% > > 557.xz_r 0.40% 0.60% > > GEOMEAN INT 1.07%% 5.71% > > > > 503.bwaves_r 12.89% 21.59% > > 507.cactuBSSN_r 0.90% 20.19% > > 508.namd_r 0.77% 14.75% > > 510.parest_r 0.91% 3.91% > > 511.povray_r 0.45% 4.08% > > 519.lbm_r 0.00% 0.00% > > 521.wrf_r 5.97% 12.79% > > 526.blender_r 0.49% 3.84% > > 527.cam4_r 1.39% 3.28% > > 538.imagick_r 1.86% 7.78% > > 544.nab_r 0.41% 3.00% > > 549.fotonik3d_r 25.50% 47.47% > > 554.roms_r 5.17% 13.01% > > GEOMEAN FP 4.14% 11.38% > > GEOMEAN ALL 2.80% 8.88% > > > > > > The only regression is from 548.exchange_r, the vectorization for inner loop in each layer > > of the 9-layer loops increases register pressure and causes more spill. > > - block(rnext:9, 1, i1) = block(rnext:9, 1, i1) + 10 > > - block(rnext:9, 2, i2) = block(rnext:9, 2, i2) + 10 > > ..... > > - block(rnext:9, 9, i9) = block(rnext:9, 9, i9) + 10 > > ... > > - block(rnext:9, 2, i2) = block(rnext:9, 2, i2) + 10 > > - block(rnext:9, 1, i1) = block(rnext:9, 1, i1) + 10 > > > > Looks like aarch64 doesn't have the issue because aarch64 has 32 gprs, but x86 only has 16. > > I have a extra patch to prevent loop vectorization in deep-depth loop for x86 backend which can > > bring the performance back. > > > > For 503.bwaves_r/505.mcf_r/507.cactuBSSN_r/508.namd_r, cheap cost model increases codesize > > a lot but don't imporve any performance. And N-iter is much better for that for codesize. > > > > > > Any comments? > > > > > > gcc/ChangeLog: > > > > * tree-vect-loop.cc (vect_analyze_loop_costing): Enable > > vectorization for LOOP_VINFO_PEELING_FOR_NITER in very cheap > > cost model. > > (vect_analyze_loop): Disable epilogue vectorization in very > > cheap cost model. > > --- > > gcc/tree-vect-loop.cc | 6 +++--- > > 1 file changed, 3 insertions(+), 3 deletions(-) > > > > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc > > index 242d5e2d916..06afd8cae79 100644 > > --- a/gcc/tree-vect-loop.cc > > +++ b/gcc/tree-vect-loop.cc > > @@ -2356,8 +2356,7 @@ vect_analyze_loop_costing (loop_vec_info loop_vinfo, > > a copy of the scalar code (even if we might be able to vectorize it). */ > > if (loop_cost_model (loop) == VECT_COST_MODEL_VERY_CHEAP > > && (LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo) > > - || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) > > - || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo))) > > + || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))) > > I notice that we should probably not call > vect_enhance_data_refs_alignment because > when alignment peeling is optional we should avoid it rather than disabling the > vectorization completely. > > Also if you allow peeling for niter then there's no good reason to not > allow peeling > for gaps (or any other epilogue peeling). Maybe, I just want to be conservative. > > The extra cost for niter peeling is a runtime check before the loop > which would also > happen (plus keeping the scalar copy) when there's a runtime cost check. That > also means versioning for alias/alignment could be allowed if it > shares the scalar > loop with the epilogue (I don't remember the constraints we set in place for the > sharing). Yes, but for current GCC, alias run-time check creates a separate scalar loop https://godbolt.org/z/9seoWePKK And enabling alias runtime check could increase too much codesize but w/o any performance improvement. > > Richard. > > > { > > if (dump_enabled_p ()) > > dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > > @@ -3638,7 +3637,8 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared) > > /* No code motion support for multiple epilogues so for now > > not supported when multiple exits. */ > > && !LOOP_VINFO_EARLY_BREAKS (first_loop_vinfo) > > - && !loop->simduid); > > + && !loop->simduid > > + && loop_cost_model (loop) > VECT_COST_MODEL_VERY_CHEAP); > > if (!vect_epilogues) > > return first_loop_vinfo; > > > > -- > > 2.31.1 > >
On Wed, Sep 11, 2024 at 4:21 PM Hongtao Liu <crazylht@gmail.com> wrote: > > On Wed, Sep 11, 2024 at 4:04 PM Richard Biener > <richard.guenther@gmail.com> wrote: > > > > On Wed, Sep 11, 2024 at 4:17 AM liuhongt <hongtao.liu@intel.com> wrote: > > > > > > GCC12 enables vectorization for O2 with very cheap cost model which is restricted > > > to constant tripcount. The vectorization capacity is very limited w/ consideration > > > of codesize impact. > > > > > > The patch extends the very cheap cost model a little bit to support variable tripcount. > > > But still disable peeling for gaps/alignment, runtime aliasing checking and epilogue > > > vectorization with the consideration of codesize. > > > > > > So there're at most 2 versions of loop for O2 vectorization, one vectorized main loop > > > , one scalar/remainder loop. > > > > > > .i.e. > > > > > > void > > > foo1 (int* __restrict a, int* b, int* c, int n) > > > { > > > for (int i = 0; i != n; i++) > > > a[i] = b[i] + c[i]; > > > } > > > > > > with -O2 -march=x86-64-v3, will be vectorized to > > > > > > .L10: > > > vmovdqu (%r8,%rax), %ymm0 > > > vpaddd (%rsi,%rax), %ymm0, %ymm0 > > > vmovdqu %ymm0, (%rdi,%rax) > > > addq $32, %rax > > > cmpq %rdx, %rax > > > jne .L10 > > > movl %ecx, %eax > > > andl $-8, %eax > > > cmpl %eax, %ecx > > > je .L21 > > > vzeroupper > > > .L12: > > > movl (%r8,%rax,4), %edx > > > addl (%rsi,%rax,4), %edx > > > movl %edx, (%rdi,%rax,4) > > > addq $1, %rax > > > cmpl %eax, %ecx > > > jne .L12 > > > > > > As measured with SPEC2017 on EMR, the patch(N-Iter) improves performance by 4.11% > > > with extra 2.8% codeisze, and cheap cost model improve performance by 5.74% with > > > extra 8.88% codesize. The details are as below > > > > I'm confused by this, is the N-Iter numbers ontop of the cheap cost > > model numbers? > No, it's N-iter vs base(very cheap cost model), and cheap vs base. > > > > > Performance measured with -march=x86-64-v3 -O2 on EMR > > > > > > N-Iter cheap cost model > > > 500.perlbench_r -0.12% -0.12% > > > 502.gcc_r 0.44% -0.11% > > > 505.mcf_r 0.17% 4.46% > > > 520.omnetpp_r 0.28% -0.27% > > > 523.xalancbmk_r 0.00% 5.93% > > > 525.x264_r -0.09% 23.53% > > > 531.deepsjeng_r 0.19% 0.00% > > > 541.leela_r 0.22% 0.00% > > > 548.exchange2_r -11.54% -22.34% > > > 557.xz_r 0.74% 0.49% > > > GEOMEAN INT -1.04% 0.60% > > > > > > 503.bwaves_r 3.13% 4.72% > > > 507.cactuBSSN_r 1.17% 0.29% > > > 508.namd_r 0.39% 6.87% > > > 510.parest_r 3.14% 8.52% > > > 511.povray_r 0.10% -0.20% > > > 519.lbm_r -0.68% 10.14% > > > 521.wrf_r 68.20% 76.73% > > > > So this seems to regress as well? > Niter increases performance less than the cheap cost model, that's > expected, it is not a regression. > > > > > 526.blender_r 0.12% 0.12% > > > 527.cam4_r 19.67% 23.21% > > > 538.imagick_r 0.12% 0.24% > > > 544.nab_r 0.63% 0.53% > > > 549.fotonik3d_r 14.44% 9.43% > > > 554.roms_r 12.39% 0.00% > > > GEOMEAN FP 8.26% 9.41% > > > GEOMEAN ALL 4.11% 5.74% I've tested the patch on aarch64, it shows similar improvement with little codesize increasement. I haven't tested it on other backends, but I think it would have similar good improvements > > > > > > Code sise impact > > > N-Iter cheap cost model > > > 500.perlbench_r 0.22% 1.03% > > > 502.gcc_r 0.25% 0.60% > > > 505.mcf_r 0.00% 32.07% > > > 520.omnetpp_r 0.09% 0.31% > > > 523.xalancbmk_r 0.08% 1.86% > > > 525.x264_r 0.75% 7.96% > > > 531.deepsjeng_r 0.72% 3.28% > > > 541.leela_r 0.18% 0.75% > > > 548.exchange2_r 8.29% 12.19% > > > 557.xz_r 0.40% 0.60% > > > GEOMEAN INT 1.07%% 5.71% > > > > > > 503.bwaves_r 12.89% 21.59% > > > 507.cactuBSSN_r 0.90% 20.19% > > > 508.namd_r 0.77% 14.75% > > > 510.parest_r 0.91% 3.91% > > > 511.povray_r 0.45% 4.08% > > > 519.lbm_r 0.00% 0.00% > > > 521.wrf_r 5.97% 12.79% > > > 526.blender_r 0.49% 3.84% > > > 527.cam4_r 1.39% 3.28% > > > 538.imagick_r 1.86% 7.78% > > > 544.nab_r 0.41% 3.00% > > > 549.fotonik3d_r 25.50% 47.47% > > > 554.roms_r 5.17% 13.01% > > > GEOMEAN FP 4.14% 11.38% > > > GEOMEAN ALL 2.80% 8.88% > > > > > > > > > The only regression is from 548.exchange_r, the vectorization for inner loop in each layer > > > of the 9-layer loops increases register pressure and causes more spill. > > > - block(rnext:9, 1, i1) = block(rnext:9, 1, i1) + 10 > > > - block(rnext:9, 2, i2) = block(rnext:9, 2, i2) + 10 > > > ..... > > > - block(rnext:9, 9, i9) = block(rnext:9, 9, i9) + 10 > > > ... > > > - block(rnext:9, 2, i2) = block(rnext:9, 2, i2) + 10 > > > - block(rnext:9, 1, i1) = block(rnext:9, 1, i1) + 10 > > > > > > Looks like aarch64 doesn't have the issue because aarch64 has 32 gprs, but x86 only has 16. > > > I have a extra patch to prevent loop vectorization in deep-depth loop for x86 backend which can > > > bring the performance back. > > > > > > For 503.bwaves_r/505.mcf_r/507.cactuBSSN_r/508.namd_r, cheap cost model increases codesize > > > a lot but don't imporve any performance. And N-iter is much better for that for codesize. > > > > > > > > > Any comments? > > > > > > > > > gcc/ChangeLog: > > > > > > * tree-vect-loop.cc (vect_analyze_loop_costing): Enable > > > vectorization for LOOP_VINFO_PEELING_FOR_NITER in very cheap > > > cost model. > > > (vect_analyze_loop): Disable epilogue vectorization in very > > > cheap cost model. > > > --- > > > gcc/tree-vect-loop.cc | 6 +++--- > > > 1 file changed, 3 insertions(+), 3 deletions(-) > > > > > > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc > > > index 242d5e2d916..06afd8cae79 100644 > > > --- a/gcc/tree-vect-loop.cc > > > +++ b/gcc/tree-vect-loop.cc > > > @@ -2356,8 +2356,7 @@ vect_analyze_loop_costing (loop_vec_info loop_vinfo, > > > a copy of the scalar code (even if we might be able to vectorize it). */ > > > if (loop_cost_model (loop) == VECT_COST_MODEL_VERY_CHEAP > > > && (LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo) > > > - || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) > > > - || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo))) > > > + || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))) > > > > I notice that we should probably not call > > vect_enhance_data_refs_alignment because > > when alignment peeling is optional we should avoid it rather than disabling the > > vectorization completely. > > > > Also if you allow peeling for niter then there's no good reason to not > > allow peeling > > for gaps (or any other epilogue peeling). > Maybe, I just want to be conservative. > > > > The extra cost for niter peeling is a runtime check before the loop > > which would also > > happen (plus keeping the scalar copy) when there's a runtime cost check. That > > also means versioning for alias/alignment could be allowed if it > > shares the scalar > > loop with the epilogue (I don't remember the constraints we set in place for the > > sharing). > Yes, but for current GCC, alias run-time check creates a separate scalar loop > https://godbolt.org/z/9seoWePKK > And enabling alias runtime check could increase too much codesize but > w/o any performance improvement. > > > > > Richard. > > > > > { > > > if (dump_enabled_p ()) > > > dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > > > @@ -3638,7 +3637,8 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared) > > > /* No code motion support for multiple epilogues so for now > > > not supported when multiple exits. */ > > > && !LOOP_VINFO_EARLY_BREAKS (first_loop_vinfo) > > > - && !loop->simduid); > > > + && !loop->simduid > > > + && loop_cost_model (loop) > VECT_COST_MODEL_VERY_CHEAP); > > > if (!vect_epilogues) > > > return first_loop_vinfo; > > > > > > -- > > > 2.31.1 > > > > > > > -- > BR, > Hongtao -- BR, Hongtao
On Thu, Sep 12, 2024 at 4:50 PM Hongtao Liu <crazylht@gmail.com> wrote: > > On Wed, Sep 11, 2024 at 4:21 PM Hongtao Liu <crazylht@gmail.com> wrote: > > > > On Wed, Sep 11, 2024 at 4:04 PM Richard Biener > > <richard.guenther@gmail.com> wrote: > > > > > > On Wed, Sep 11, 2024 at 4:17 AM liuhongt <hongtao.liu@intel.com> wrote: > > > > > > > > GCC12 enables vectorization for O2 with very cheap cost model which is restricted > > > > to constant tripcount. The vectorization capacity is very limited w/ consideration > > > > of codesize impact. > > > > > > > > The patch extends the very cheap cost model a little bit to support variable tripcount. > > > > But still disable peeling for gaps/alignment, runtime aliasing checking and epilogue > > > > vectorization with the consideration of codesize. > > > > > > > > So there're at most 2 versions of loop for O2 vectorization, one vectorized main loop > > > > , one scalar/remainder loop. > > > > > > > > .i.e. > > > > > > > > void > > > > foo1 (int* __restrict a, int* b, int* c, int n) > > > > { > > > > for (int i = 0; i != n; i++) > > > > a[i] = b[i] + c[i]; > > > > } > > > > > > > > with -O2 -march=x86-64-v3, will be vectorized to > > > > > > > > .L10: > > > > vmovdqu (%r8,%rax), %ymm0 > > > > vpaddd (%rsi,%rax), %ymm0, %ymm0 > > > > vmovdqu %ymm0, (%rdi,%rax) > > > > addq $32, %rax > > > > cmpq %rdx, %rax > > > > jne .L10 > > > > movl %ecx, %eax > > > > andl $-8, %eax > > > > cmpl %eax, %ecx > > > > je .L21 > > > > vzeroupper > > > > .L12: > > > > movl (%r8,%rax,4), %edx > > > > addl (%rsi,%rax,4), %edx > > > > movl %edx, (%rdi,%rax,4) > > > > addq $1, %rax > > > > cmpl %eax, %ecx > > > > jne .L12 > > > > > > > > As measured with SPEC2017 on EMR, the patch(N-Iter) improves performance by 4.11% > > > > with extra 2.8% codeisze, and cheap cost model improve performance by 5.74% with > > > > extra 8.88% codesize. The details are as below > > > > > > I'm confused by this, is the N-Iter numbers ontop of the cheap cost > > > model numbers? > > No, it's N-iter vs base(very cheap cost model), and cheap vs base. > > > > > > > Performance measured with -march=x86-64-v3 -O2 on EMR > > > > > > > > N-Iter cheap cost model > > > > 500.perlbench_r -0.12% -0.12% > > > > 502.gcc_r 0.44% -0.11% > > > > 505.mcf_r 0.17% 4.46% > > > > 520.omnetpp_r 0.28% -0.27% > > > > 523.xalancbmk_r 0.00% 5.93% > > > > 525.x264_r -0.09% 23.53% > > > > 531.deepsjeng_r 0.19% 0.00% > > > > 541.leela_r 0.22% 0.00% > > > > 548.exchange2_r -11.54% -22.34% > > > > 557.xz_r 0.74% 0.49% > > > > GEOMEAN INT -1.04% 0.60% > > > > > > > > 503.bwaves_r 3.13% 4.72% > > > > 507.cactuBSSN_r 1.17% 0.29% > > > > 508.namd_r 0.39% 6.87% > > > > 510.parest_r 3.14% 8.52% > > > > 511.povray_r 0.10% -0.20% > > > > 519.lbm_r -0.68% 10.14% > > > > 521.wrf_r 68.20% 76.73% > > > > > > So this seems to regress as well? > > Niter increases performance less than the cheap cost model, that's > > expected, it is not a regression. > > > > > > > 526.blender_r 0.12% 0.12% > > > > 527.cam4_r 19.67% 23.21% > > > > 538.imagick_r 0.12% 0.24% > > > > 544.nab_r 0.63% 0.53% > > > > 549.fotonik3d_r 14.44% 9.43% > > > > 554.roms_r 12.39% 0.00% > > > > GEOMEAN FP 8.26% 9.41% > > > > GEOMEAN ALL 4.11% 5.74% > > I've tested the patch on aarch64, it shows similar improvement with > little codesize increasement. > I haven't tested it on other backends, but I think it would have > similar good improvements I think overall this is expected since a constant niter dividable by the VF isn't a common situation. So the question is mostly whether we want to pay the size penalty or not. Looking only at docs the proposed change would make the very-cheap cost model nearly(?) equivalent to the cheap one so maybe the answer is to default to cheap rather than very-cheap? One difference seems to be that cheap allows alias versioning. Richard. > > > > > > > > Code sise impact > > > > N-Iter cheap cost model > > > > 500.perlbench_r 0.22% 1.03% > > > > 502.gcc_r 0.25% 0.60% > > > > 505.mcf_r 0.00% 32.07% > > > > 520.omnetpp_r 0.09% 0.31% > > > > 523.xalancbmk_r 0.08% 1.86% > > > > 525.x264_r 0.75% 7.96% > > > > 531.deepsjeng_r 0.72% 3.28% > > > > 541.leela_r 0.18% 0.75% > > > > 548.exchange2_r 8.29% 12.19% > > > > 557.xz_r 0.40% 0.60% > > > > GEOMEAN INT 1.07%% 5.71% > > > > > > > > 503.bwaves_r 12.89% 21.59% > > > > 507.cactuBSSN_r 0.90% 20.19% > > > > 508.namd_r 0.77% 14.75% > > > > 510.parest_r 0.91% 3.91% > > > > 511.povray_r 0.45% 4.08% > > > > 519.lbm_r 0.00% 0.00% > > > > 521.wrf_r 5.97% 12.79% > > > > 526.blender_r 0.49% 3.84% > > > > 527.cam4_r 1.39% 3.28% > > > > 538.imagick_r 1.86% 7.78% > > > > 544.nab_r 0.41% 3.00% > > > > 549.fotonik3d_r 25.50% 47.47% > > > > 554.roms_r 5.17% 13.01% > > > > GEOMEAN FP 4.14% 11.38% > > > > GEOMEAN ALL 2.80% 8.88% > > > > > > > > > > > > The only regression is from 548.exchange_r, the vectorization for inner loop in each layer > > > > of the 9-layer loops increases register pressure and causes more spill. > > > > - block(rnext:9, 1, i1) = block(rnext:9, 1, i1) + 10 > > > > - block(rnext:9, 2, i2) = block(rnext:9, 2, i2) + 10 > > > > ..... > > > > - block(rnext:9, 9, i9) = block(rnext:9, 9, i9) + 10 > > > > ... > > > > - block(rnext:9, 2, i2) = block(rnext:9, 2, i2) + 10 > > > > - block(rnext:9, 1, i1) = block(rnext:9, 1, i1) + 10 > > > > > > > > Looks like aarch64 doesn't have the issue because aarch64 has 32 gprs, but x86 only has 16. > > > > I have a extra patch to prevent loop vectorization in deep-depth loop for x86 backend which can > > > > bring the performance back. > > > > > > > > For 503.bwaves_r/505.mcf_r/507.cactuBSSN_r/508.namd_r, cheap cost model increases codesize > > > > a lot but don't imporve any performance. And N-iter is much better for that for codesize. > > > > > > > > > > > > Any comments? > > > > > > > > > > > > gcc/ChangeLog: > > > > > > > > * tree-vect-loop.cc (vect_analyze_loop_costing): Enable > > > > vectorization for LOOP_VINFO_PEELING_FOR_NITER in very cheap > > > > cost model. > > > > (vect_analyze_loop): Disable epilogue vectorization in very > > > > cheap cost model. > > > > --- > > > > gcc/tree-vect-loop.cc | 6 +++--- > > > > 1 file changed, 3 insertions(+), 3 deletions(-) > > > > > > > > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc > > > > index 242d5e2d916..06afd8cae79 100644 > > > > --- a/gcc/tree-vect-loop.cc > > > > +++ b/gcc/tree-vect-loop.cc > > > > @@ -2356,8 +2356,7 @@ vect_analyze_loop_costing (loop_vec_info loop_vinfo, > > > > a copy of the scalar code (even if we might be able to vectorize it). */ > > > > if (loop_cost_model (loop) == VECT_COST_MODEL_VERY_CHEAP > > > > && (LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo) > > > > - || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) > > > > - || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo))) > > > > + || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))) > > > > > > I notice that we should probably not call > > > vect_enhance_data_refs_alignment because > > > when alignment peeling is optional we should avoid it rather than disabling the > > > vectorization completely. > > > > > > Also if you allow peeling for niter then there's no good reason to not > > > allow peeling > > > for gaps (or any other epilogue peeling). > > Maybe, I just want to be conservative. > > > > > > The extra cost for niter peeling is a runtime check before the loop > > > which would also > > > happen (plus keeping the scalar copy) when there's a runtime cost check. That > > > also means versioning for alias/alignment could be allowed if it > > > shares the scalar > > > loop with the epilogue (I don't remember the constraints we set in place for the > > > sharing). > > Yes, but for current GCC, alias run-time check creates a separate scalar loop > > https://godbolt.org/z/9seoWePKK > > And enabling alias runtime check could increase too much codesize but > > w/o any performance improvement. > > > > > > > > Richard. > > > > > > > { > > > > if (dump_enabled_p ()) > > > > dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > > > > @@ -3638,7 +3637,8 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared) > > > > /* No code motion support for multiple epilogues so for now > > > > not supported when multiple exits. */ > > > > && !LOOP_VINFO_EARLY_BREAKS (first_loop_vinfo) > > > > - && !loop->simduid); > > > > + && !loop->simduid > > > > + && loop_cost_model (loop) > VECT_COST_MODEL_VERY_CHEAP); > > > > if (!vect_epilogues) > > > > return first_loop_vinfo; > > > > > > > > -- > > > > 2.31.1 > > > > > > > > > > > > -- > > BR, > > Hongtao > > > > -- > BR, > Hongtao
Richard Biener <richard.guenther@gmail.com> writes: > On Thu, Sep 12, 2024 at 4:50 PM Hongtao Liu <crazylht@gmail.com> wrote: >> >> On Wed, Sep 11, 2024 at 4:21 PM Hongtao Liu <crazylht@gmail.com> wrote: >> > >> > On Wed, Sep 11, 2024 at 4:04 PM Richard Biener >> > <richard.guenther@gmail.com> wrote: >> > > >> > > On Wed, Sep 11, 2024 at 4:17 AM liuhongt <hongtao.liu@intel.com> wrote: >> > > > >> > > > GCC12 enables vectorization for O2 with very cheap cost model which is restricted >> > > > to constant tripcount. The vectorization capacity is very limited w/ consideration >> > > > of codesize impact. >> > > > >> > > > The patch extends the very cheap cost model a little bit to support variable tripcount. >> > > > But still disable peeling for gaps/alignment, runtime aliasing checking and epilogue >> > > > vectorization with the consideration of codesize. >> > > > >> > > > So there're at most 2 versions of loop for O2 vectorization, one vectorized main loop >> > > > , one scalar/remainder loop. >> > > > >> > > > .i.e. >> > > > >> > > > void >> > > > foo1 (int* __restrict a, int* b, int* c, int n) >> > > > { >> > > > for (int i = 0; i != n; i++) >> > > > a[i] = b[i] + c[i]; >> > > > } >> > > > >> > > > with -O2 -march=x86-64-v3, will be vectorized to >> > > > >> > > > .L10: >> > > > vmovdqu (%r8,%rax), %ymm0 >> > > > vpaddd (%rsi,%rax), %ymm0, %ymm0 >> > > > vmovdqu %ymm0, (%rdi,%rax) >> > > > addq $32, %rax >> > > > cmpq %rdx, %rax >> > > > jne .L10 >> > > > movl %ecx, %eax >> > > > andl $-8, %eax >> > > > cmpl %eax, %ecx >> > > > je .L21 >> > > > vzeroupper >> > > > .L12: >> > > > movl (%r8,%rax,4), %edx >> > > > addl (%rsi,%rax,4), %edx >> > > > movl %edx, (%rdi,%rax,4) >> > > > addq $1, %rax >> > > > cmpl %eax, %ecx >> > > > jne .L12 >> > > > >> > > > As measured with SPEC2017 on EMR, the patch(N-Iter) improves performance by 4.11% >> > > > with extra 2.8% codeisze, and cheap cost model improve performance by 5.74% with >> > > > extra 8.88% codesize. The details are as below >> > > >> > > I'm confused by this, is the N-Iter numbers ontop of the cheap cost >> > > model numbers? >> > No, it's N-iter vs base(very cheap cost model), and cheap vs base. >> > > >> > > > Performance measured with -march=x86-64-v3 -O2 on EMR >> > > > >> > > > N-Iter cheap cost model >> > > > 500.perlbench_r -0.12% -0.12% >> > > > 502.gcc_r 0.44% -0.11% >> > > > 505.mcf_r 0.17% 4.46% >> > > > 520.omnetpp_r 0.28% -0.27% >> > > > 523.xalancbmk_r 0.00% 5.93% >> > > > 525.x264_r -0.09% 23.53% >> > > > 531.deepsjeng_r 0.19% 0.00% >> > > > 541.leela_r 0.22% 0.00% >> > > > 548.exchange2_r -11.54% -22.34% >> > > > 557.xz_r 0.74% 0.49% >> > > > GEOMEAN INT -1.04% 0.60% >> > > > >> > > > 503.bwaves_r 3.13% 4.72% >> > > > 507.cactuBSSN_r 1.17% 0.29% >> > > > 508.namd_r 0.39% 6.87% >> > > > 510.parest_r 3.14% 8.52% >> > > > 511.povray_r 0.10% -0.20% >> > > > 519.lbm_r -0.68% 10.14% >> > > > 521.wrf_r 68.20% 76.73% >> > > >> > > So this seems to regress as well? >> > Niter increases performance less than the cheap cost model, that's >> > expected, it is not a regression. >> > > >> > > > 526.blender_r 0.12% 0.12% >> > > > 527.cam4_r 19.67% 23.21% >> > > > 538.imagick_r 0.12% 0.24% >> > > > 544.nab_r 0.63% 0.53% >> > > > 549.fotonik3d_r 14.44% 9.43% >> > > > 554.roms_r 12.39% 0.00% >> > > > GEOMEAN FP 8.26% 9.41% >> > > > GEOMEAN ALL 4.11% 5.74% >> >> I've tested the patch on aarch64, it shows similar improvement with >> little codesize increasement. >> I haven't tested it on other backends, but I think it would have >> similar good improvements > > I think overall this is expected since a constant niter dividable by > the VF isn't a common situation. So the question is mostly whether > we want to pay the size penalty or not. > > Looking only at docs the proposed change would make the very-cheap > cost model nearly(?) equivalent to the cheap one so maybe the answer > is to default to cheap rather than very-cheap? One difference seems to > be that cheap allows alias versioning. I remember seeing cases in the past where we could generate an excessive number of alias checks. The cost model didn't account for them very well, since the checks often became a fixed overhead for all paths (both scalar and vector), especially if the checks were fully if-converted, with one branch at the end. The relevant comparison is then between the original pre-vectorisation scalar code and the code with alias checks, rather than between post-vectorisation scalar code and post-vectorisation vector code. Things might be better now though. FTR, I don't object to relaxing the -O2 model. It was deliberately conservative, for a time when enabling vectorisation at -O2 was somewhat controversial. It was also heavily influenced by SVE, where variable trip counts are not an issue. The proposal would also make GCC's behaviour more similar to Clang's. Thanks, Richard
On Wed, Sep 18, 2024 at 7:55 PM Richard Sandiford <richard.sandiford@arm.com> wrote: > > Richard Biener <richard.guenther@gmail.com> writes: > > On Thu, Sep 12, 2024 at 4:50 PM Hongtao Liu <crazylht@gmail.com> wrote: > >> > >> On Wed, Sep 11, 2024 at 4:21 PM Hongtao Liu <crazylht@gmail.com> wrote: > >> > > >> > On Wed, Sep 11, 2024 at 4:04 PM Richard Biener > >> > <richard.guenther@gmail.com> wrote: > >> > > > >> > > On Wed, Sep 11, 2024 at 4:17 AM liuhongt <hongtao.liu@intel.com> wrote: > >> > > > > >> > > > GCC12 enables vectorization for O2 with very cheap cost model which is restricted > >> > > > to constant tripcount. The vectorization capacity is very limited w/ consideration > >> > > > of codesize impact. > >> > > > > >> > > > The patch extends the very cheap cost model a little bit to support variable tripcount. > >> > > > But still disable peeling for gaps/alignment, runtime aliasing checking and epilogue > >> > > > vectorization with the consideration of codesize. > >> > > > > >> > > > So there're at most 2 versions of loop for O2 vectorization, one vectorized main loop > >> > > > , one scalar/remainder loop. > >> > > > > >> > > > .i.e. > >> > > > > >> > > > void > >> > > > foo1 (int* __restrict a, int* b, int* c, int n) > >> > > > { > >> > > > for (int i = 0; i != n; i++) > >> > > > a[i] = b[i] + c[i]; > >> > > > } > >> > > > > >> > > > with -O2 -march=x86-64-v3, will be vectorized to > >> > > > > >> > > > .L10: > >> > > > vmovdqu (%r8,%rax), %ymm0 > >> > > > vpaddd (%rsi,%rax), %ymm0, %ymm0 > >> > > > vmovdqu %ymm0, (%rdi,%rax) > >> > > > addq $32, %rax > >> > > > cmpq %rdx, %rax > >> > > > jne .L10 > >> > > > movl %ecx, %eax > >> > > > andl $-8, %eax > >> > > > cmpl %eax, %ecx > >> > > > je .L21 > >> > > > vzeroupper > >> > > > .L12: > >> > > > movl (%r8,%rax,4), %edx > >> > > > addl (%rsi,%rax,4), %edx > >> > > > movl %edx, (%rdi,%rax,4) > >> > > > addq $1, %rax > >> > > > cmpl %eax, %ecx > >> > > > jne .L12 > >> > > > > >> > > > As measured with SPEC2017 on EMR, the patch(N-Iter) improves performance by 4.11% > >> > > > with extra 2.8% codeisze, and cheap cost model improve performance by 5.74% with > >> > > > extra 8.88% codesize. The details are as below > >> > > > >> > > I'm confused by this, is the N-Iter numbers ontop of the cheap cost > >> > > model numbers? > >> > No, it's N-iter vs base(very cheap cost model), and cheap vs base. > >> > > > >> > > > Performance measured with -march=x86-64-v3 -O2 on EMR > >> > > > > >> > > > N-Iter cheap cost model > >> > > > 500.perlbench_r -0.12% -0.12% > >> > > > 502.gcc_r 0.44% -0.11% > >> > > > 505.mcf_r 0.17% 4.46% > >> > > > 520.omnetpp_r 0.28% -0.27% > >> > > > 523.xalancbmk_r 0.00% 5.93% > >> > > > 525.x264_r -0.09% 23.53% > >> > > > 531.deepsjeng_r 0.19% 0.00% > >> > > > 541.leela_r 0.22% 0.00% > >> > > > 548.exchange2_r -11.54% -22.34% > >> > > > 557.xz_r 0.74% 0.49% > >> > > > GEOMEAN INT -1.04% 0.60% > >> > > > > >> > > > 503.bwaves_r 3.13% 4.72% > >> > > > 507.cactuBSSN_r 1.17% 0.29% > >> > > > 508.namd_r 0.39% 6.87% > >> > > > 510.parest_r 3.14% 8.52% > >> > > > 511.povray_r 0.10% -0.20% > >> > > > 519.lbm_r -0.68% 10.14% > >> > > > 521.wrf_r 68.20% 76.73% > >> > > > >> > > So this seems to regress as well? > >> > Niter increases performance less than the cheap cost model, that's > >> > expected, it is not a regression. > >> > > > >> > > > 526.blender_r 0.12% 0.12% > >> > > > 527.cam4_r 19.67% 23.21% > >> > > > 538.imagick_r 0.12% 0.24% > >> > > > 544.nab_r 0.63% 0.53% > >> > > > 549.fotonik3d_r 14.44% 9.43% > >> > > > 554.roms_r 12.39% 0.00% > >> > > > GEOMEAN FP 8.26% 9.41% > >> > > > GEOMEAN ALL 4.11% 5.74% > >> > >> I've tested the patch on aarch64, it shows similar improvement with > >> little codesize increasement. > >> I haven't tested it on other backends, but I think it would have > >> similar good improvements > > > > I think overall this is expected since a constant niter dividable by > > the VF isn't a common situation. So the question is mostly whether > > we want to pay the size penalty or not. > > > > Looking only at docs the proposed change would make the very-cheap > > cost model nearly(?) equivalent to the cheap one so maybe the answer > > is to default to cheap rather than very-cheap? One difference seems to > > be that cheap allows alias versioning. > > I remember seeing cases in the past where we could generate an > excessive number of alias checks. The cost model didn't account > for them very well, since the checks often became a fixed overhead > for all paths (both scalar and vector), especially if the checks > were fully if-converted, with one branch at the end. The relevant > comparison is then between the original pre-vectorisation scalar code > and the code with alias checks, rather than between post-vectorisation > scalar code and post-vectorisation vector code. Things might be better > now though. Yes, the cost model (aka niter) check should now be before the alias check, not if-converted, but of course the alias-checking overhead can still be quite big. > FTR, I don't object to relaxing the -O2 model. It was deliberately > conservative, for a time when enabling vectorisation at -O2 was > somewhat controversial. It was also heavily influenced by SVE, > where variable trip counts are not an issue. I agree - I think we can try for GCC 15. Note since we disallow epilogue vectorization with cheap we might want to prefer smaller vector sizes which means the target might want to adjust its vector_modes hook. > The proposal would also make GCC's behaviour more similar to Clang's. So should we adjust very-cheap to allow niter peeling as proposed or should we switch the default at -O2 to cheap? Richard. > Thanks, > Richard
On Thu, Sep 19, 2024 at 2:08 PM Richard Biener <richard.guenther@gmail.com> wrote: > > On Wed, Sep 18, 2024 at 7:55 PM Richard Sandiford > <richard.sandiford@arm.com> wrote: > > > > Richard Biener <richard.guenther@gmail.com> writes: > > > On Thu, Sep 12, 2024 at 4:50 PM Hongtao Liu <crazylht@gmail.com> wrote: > > >> > > >> On Wed, Sep 11, 2024 at 4:21 PM Hongtao Liu <crazylht@gmail.com> wrote: > > >> > > > >> > On Wed, Sep 11, 2024 at 4:04 PM Richard Biener > > >> > <richard.guenther@gmail.com> wrote: > > >> > > > > >> > > On Wed, Sep 11, 2024 at 4:17 AM liuhongt <hongtao.liu@intel.com> wrote: > > >> > > > > > >> > > > GCC12 enables vectorization for O2 with very cheap cost model which is restricted > > >> > > > to constant tripcount. The vectorization capacity is very limited w/ consideration > > >> > > > of codesize impact. > > >> > > > > > >> > > > The patch extends the very cheap cost model a little bit to support variable tripcount. > > >> > > > But still disable peeling for gaps/alignment, runtime aliasing checking and epilogue > > >> > > > vectorization with the consideration of codesize. > > >> > > > > > >> > > > So there're at most 2 versions of loop for O2 vectorization, one vectorized main loop > > >> > > > , one scalar/remainder loop. > > >> > > > > > >> > > > .i.e. > > >> > > > > > >> > > > void > > >> > > > foo1 (int* __restrict a, int* b, int* c, int n) > > >> > > > { > > >> > > > for (int i = 0; i != n; i++) > > >> > > > a[i] = b[i] + c[i]; > > >> > > > } > > >> > > > > > >> > > > with -O2 -march=x86-64-v3, will be vectorized to > > >> > > > > > >> > > > .L10: > > >> > > > vmovdqu (%r8,%rax), %ymm0 > > >> > > > vpaddd (%rsi,%rax), %ymm0, %ymm0 > > >> > > > vmovdqu %ymm0, (%rdi,%rax) > > >> > > > addq $32, %rax > > >> > > > cmpq %rdx, %rax > > >> > > > jne .L10 > > >> > > > movl %ecx, %eax > > >> > > > andl $-8, %eax > > >> > > > cmpl %eax, %ecx > > >> > > > je .L21 > > >> > > > vzeroupper > > >> > > > .L12: > > >> > > > movl (%r8,%rax,4), %edx > > >> > > > addl (%rsi,%rax,4), %edx > > >> > > > movl %edx, (%rdi,%rax,4) > > >> > > > addq $1, %rax > > >> > > > cmpl %eax, %ecx > > >> > > > jne .L12 > > >> > > > > > >> > > > As measured with SPEC2017 on EMR, the patch(N-Iter) improves performance by 4.11% > > >> > > > with extra 2.8% codeisze, and cheap cost model improve performance by 5.74% with > > >> > > > extra 8.88% codesize. The details are as below > > >> > > > > >> > > I'm confused by this, is the N-Iter numbers ontop of the cheap cost > > >> > > model numbers? > > >> > No, it's N-iter vs base(very cheap cost model), and cheap vs base. > > >> > > > > >> > > > Performance measured with -march=x86-64-v3 -O2 on EMR > > >> > > > > > >> > > > N-Iter cheap cost model > > >> > > > 500.perlbench_r -0.12% -0.12% > > >> > > > 502.gcc_r 0.44% -0.11% > > >> > > > 505.mcf_r 0.17% 4.46% > > >> > > > 520.omnetpp_r 0.28% -0.27% > > >> > > > 523.xalancbmk_r 0.00% 5.93% > > >> > > > 525.x264_r -0.09% 23.53% > > >> > > > 531.deepsjeng_r 0.19% 0.00% > > >> > > > 541.leela_r 0.22% 0.00% > > >> > > > 548.exchange2_r -11.54% -22.34% > > >> > > > 557.xz_r 0.74% 0.49% > > >> > > > GEOMEAN INT -1.04% 0.60% > > >> > > > > > >> > > > 503.bwaves_r 3.13% 4.72% > > >> > > > 507.cactuBSSN_r 1.17% 0.29% > > >> > > > 508.namd_r 0.39% 6.87% > > >> > > > 510.parest_r 3.14% 8.52% > > >> > > > 511.povray_r 0.10% -0.20% > > >> > > > 519.lbm_r -0.68% 10.14% > > >> > > > 521.wrf_r 68.20% 76.73% > > >> > > > > >> > > So this seems to regress as well? > > >> > Niter increases performance less than the cheap cost model, that's > > >> > expected, it is not a regression. > > >> > > > > >> > > > 526.blender_r 0.12% 0.12% > > >> > > > 527.cam4_r 19.67% 23.21% > > >> > > > 538.imagick_r 0.12% 0.24% > > >> > > > 544.nab_r 0.63% 0.53% > > >> > > > 549.fotonik3d_r 14.44% 9.43% > > >> > > > 554.roms_r 12.39% 0.00% > > >> > > > GEOMEAN FP 8.26% 9.41% > > >> > > > GEOMEAN ALL 4.11% 5.74% > > >> > > >> I've tested the patch on aarch64, it shows similar improvement with > > >> little codesize increasement. > > >> I haven't tested it on other backends, but I think it would have > > >> similar good improvements > > > > > > I think overall this is expected since a constant niter dividable by > > > the VF isn't a common situation. So the question is mostly whether > > > we want to pay the size penalty or not. > > > > > > Looking only at docs the proposed change would make the very-cheap > > > cost model nearly(?) equivalent to the cheap one so maybe the answer > > > is to default to cheap rather than very-cheap? One difference seems to > > > be that cheap allows alias versioning. > > > > I remember seeing cases in the past where we could generate an > > excessive number of alias checks. The cost model didn't account > > for them very well, since the checks often became a fixed overhead > > for all paths (both scalar and vector), especially if the checks > > were fully if-converted, with one branch at the end. The relevant > > comparison is then between the original pre-vectorisation scalar code > > and the code with alias checks, rather than between post-vectorisation > > scalar code and post-vectorisation vector code. Things might be better > > now though. > > Yes, the cost model (aka niter) check should now be before the alias check, not > if-converted, but of course the alias-checking overhead can still be quite big. > > > FTR, I don't object to relaxing the -O2 model. It was deliberately > > conservative, for a time when enabling vectorisation at -O2 was > > somewhat controversial. It was also heavily influenced by SVE, > > where variable trip counts are not an issue. > > I agree - I think we can try for GCC 15. Note since we disallow epilogue > vectorization with cheap we might want to prefer smaller vector sizes > which means the target might want to adjust its vector_modes hook. > > > The proposal would also make GCC's behaviour more similar to Clang's. > > So should we adjust very-cheap to allow niter peeling as proposed or > should we switch > the default at -O2 to cheap? Any thoughts from other backend maintainers? > > Richard. > > > Thanks, > > Richard -- BR, Hongtao
> > So should we adjust very-cheap to allow niter peeling as proposed or > > should we switch > > the default at -O2 to cheap? > > Any thoughts from other backend maintainers? No preference from RISC-V since is variable length vector flavor, so no epilogue for those case, I mean it's already vectorizeable on RISC-V with -O2 :P https://godbolt.org/z/v5z8WxdjT
diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index 242d5e2d916..06afd8cae79 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -2356,8 +2356,7 @@ vect_analyze_loop_costing (loop_vec_info loop_vinfo, a copy of the scalar code (even if we might be able to vectorize it). */ if (loop_cost_model (loop) == VECT_COST_MODEL_VERY_CHEAP && (LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo) - || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) - || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo))) + || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))) { if (dump_enabled_p ()) dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, @@ -3638,7 +3637,8 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared) /* No code motion support for multiple epilogues so for now not supported when multiple exits. */ && !LOOP_VINFO_EARLY_BREAKS (first_loop_vinfo) - && !loop->simduid); + && !loop->simduid + && loop_cost_model (loop) > VECT_COST_MODEL_VERY_CHEAP); if (!vect_epilogues) return first_loop_vinfo;