diff mbox series

[RFC] Enable vectorization for unknown tripcount in very cheap cost model but disable epilog vectorization.

Message ID 20240911021637.3759883-1-hongtao.liu@intel.com
State New
Headers show
Series [RFC] Enable vectorization for unknown tripcount in very cheap cost model but disable epilog vectorization. | expand

Commit Message

liuhongt Sept. 11, 2024, 2:16 a.m. UTC
GCC12 enables vectorization for O2 with very cheap cost model which is restricted
to constant tripcount. The vectorization capacity is very limited w/ consideration
of codesize impact.

The patch extends the very cheap cost model a little bit to support variable tripcount.
But still disable peeling for gaps/alignment, runtime aliasing checking and epilogue
vectorization with the consideration of codesize.

So there're at most 2 versions of loop for O2 vectorization, one vectorized main loop
, one scalar/remainder loop.

.i.e.

void
foo1 (int* __restrict a, int* b, int* c, int n)
{
 for (int i = 0; i != n; i++)
  a[i] = b[i] + c[i];
}

with -O2 -march=x86-64-v3, will be vectorized to

.L10:
        vmovdqu (%r8,%rax), %ymm0
        vpaddd  (%rsi,%rax), %ymm0, %ymm0
        vmovdqu %ymm0, (%rdi,%rax)
        addq    $32, %rax
        cmpq    %rdx, %rax
        jne     .L10
        movl    %ecx, %eax
        andl    $-8, %eax
        cmpl    %eax, %ecx
        je      .L21
        vzeroupper
.L12:
        movl    (%r8,%rax,4), %edx
        addl    (%rsi,%rax,4), %edx
        movl    %edx, (%rdi,%rax,4)
        addq    $1, %rax
        cmpl    %eax, %ecx
        jne     .L12

As measured with SPEC2017 on EMR, the patch(N-Iter) improves performance by 4.11%
with extra 2.8% codeisze, and cheap cost model improve performance by 5.74% with
extra 8.88% codesize. The details are as below

Performance measured with -march=x86-64-v3 -O2 on EMR

    	     	    N-Iter	cheap cost model
500.perlbench_r	    -0.12%	-0.12%
502.gcc_r	    0.44%	-0.11%	
505.mcf_r	    0.17%	4.46%
520.omnetpp_r	    0.28%	-0.27%
523.xalancbmk_r	    0.00%	5.93%
525.x264_r	    -0.09%	23.53%
531.deepsjeng_r	    0.19%	0.00%
541.leela_r	    0.22%	0.00%
548.exchange2_r	    -11.54%	-22.34%
557.xz_r	    0.74%	0.49%
GEOMEAN INT	    -1.04%	0.60%

503.bwaves_r	    3.13%	4.72%
507.cactuBSSN_r	    1.17%	0.29%
508.namd_r	    0.39%	6.87%
510.parest_r	    3.14%	8.52%
511.povray_r	    0.10%	-0.20%
519.lbm_r	    -0.68%	10.14%
521.wrf_r	    68.20%	76.73%
526.blender_r	    0.12%	0.12%
527.cam4_r	    19.67%	23.21%
538.imagick_r	    0.12%	0.24%
544.nab_r	    0.63%	0.53%
549.fotonik3d_r	    14.44%	9.43%
554.roms_r	    12.39%	0.00%
GEOMEAN FP	    8.26%	9.41%
GEOMEAN ALL	    4.11%	5.74%

Code sise impact
    	     	    N-Iter	cheap cost model
500.perlbench_r	    0.22%	1.03%
502.gcc_r	    0.25%	0.60%	
505.mcf_r	    0.00%	32.07%
520.omnetpp_r	    0.09%	0.31%
523.xalancbmk_r	    0.08%	1.86%
525.x264_r	    0.75%	7.96%
531.deepsjeng_r	    0.72%	3.28%
541.leela_r	    0.18%	0.75%
548.exchange2_r	    8.29%	12.19%
557.xz_r	    0.40%	0.60%
GEOMEAN INT	    1.07%%	5.71%

503.bwaves_r	    12.89%	21.59%
507.cactuBSSN_r	    0.90%	20.19%
508.namd_r	    0.77%	14.75%
510.parest_r	    0.91%	3.91%
511.povray_r	    0.45%	4.08%
519.lbm_r	    0.00%	0.00%
521.wrf_r	    5.97%	12.79%
526.blender_r	    0.49%	3.84%
527.cam4_r	    1.39%	3.28%
538.imagick_r	    1.86%	7.78%
544.nab_r	    0.41%	3.00%
549.fotonik3d_r	    25.50%	47.47%
554.roms_r	    5.17%	13.01%
GEOMEAN FP	    4.14%	11.38%
GEOMEAN ALL	    2.80%	8.88%


The only regression is from 548.exchange_r, the vectorization for inner loop in each layer
of the 9-layer loops increases register pressure and causes more spill.
- block(rnext:9, 1, i1) = block(rnext:9, 1, i1) + 10
  - block(rnext:9, 2, i2) = block(rnext:9, 2, i2) + 10
    .....
	- block(rnext:9, 9, i9) = block(rnext:9, 9, i9) + 10
    ...
- block(rnext:9, 2, i2) = block(rnext:9, 2, i2) + 10
- block(rnext:9, 1, i1) = block(rnext:9, 1, i1) + 10

Looks like aarch64 doesn't have the issue because aarch64 has 32 gprs, but x86 only has 16.
I have a extra patch to prevent loop vectorization in deep-depth loop for x86 backend which can
bring the performance back.

For 503.bwaves_r/505.mcf_r/507.cactuBSSN_r/508.namd_r, cheap cost model increases codesize
a lot but don't imporve any performance. And N-iter is much better for that for codesize.


Any comments?


gcc/ChangeLog:

	* tree-vect-loop.cc (vect_analyze_loop_costing): Enable
	vectorization for LOOP_VINFO_PEELING_FOR_NITER in very cheap
	cost model.
	(vect_analyze_loop): Disable epilogue vectorization in very
	cheap cost model.
---
 gcc/tree-vect-loop.cc | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

Comments

Richard Biener Sept. 11, 2024, 8:04 a.m. UTC | #1
On Wed, Sep 11, 2024 at 4:17 AM liuhongt <hongtao.liu@intel.com> wrote:
>
> GCC12 enables vectorization for O2 with very cheap cost model which is restricted
> to constant tripcount. The vectorization capacity is very limited w/ consideration
> of codesize impact.
>
> The patch extends the very cheap cost model a little bit to support variable tripcount.
> But still disable peeling for gaps/alignment, runtime aliasing checking and epilogue
> vectorization with the consideration of codesize.
>
> So there're at most 2 versions of loop for O2 vectorization, one vectorized main loop
> , one scalar/remainder loop.
>
> .i.e.
>
> void
> foo1 (int* __restrict a, int* b, int* c, int n)
> {
>  for (int i = 0; i != n; i++)
>   a[i] = b[i] + c[i];
> }
>
> with -O2 -march=x86-64-v3, will be vectorized to
>
> .L10:
>         vmovdqu (%r8,%rax), %ymm0
>         vpaddd  (%rsi,%rax), %ymm0, %ymm0
>         vmovdqu %ymm0, (%rdi,%rax)
>         addq    $32, %rax
>         cmpq    %rdx, %rax
>         jne     .L10
>         movl    %ecx, %eax
>         andl    $-8, %eax
>         cmpl    %eax, %ecx
>         je      .L21
>         vzeroupper
> .L12:
>         movl    (%r8,%rax,4), %edx
>         addl    (%rsi,%rax,4), %edx
>         movl    %edx, (%rdi,%rax,4)
>         addq    $1, %rax
>         cmpl    %eax, %ecx
>         jne     .L12
>
> As measured with SPEC2017 on EMR, the patch(N-Iter) improves performance by 4.11%
> with extra 2.8% codeisze, and cheap cost model improve performance by 5.74% with
> extra 8.88% codesize. The details are as below

I'm confused by this, is the N-Iter numbers ontop of the cheap cost
model numbers?

> Performance measured with -march=x86-64-v3 -O2 on EMR
>
>                     N-Iter      cheap cost model
> 500.perlbench_r     -0.12%      -0.12%
> 502.gcc_r           0.44%       -0.11%
> 505.mcf_r           0.17%       4.46%
> 520.omnetpp_r       0.28%       -0.27%
> 523.xalancbmk_r     0.00%       5.93%
> 525.x264_r          -0.09%      23.53%
> 531.deepsjeng_r     0.19%       0.00%
> 541.leela_r         0.22%       0.00%
> 548.exchange2_r     -11.54%     -22.34%
> 557.xz_r            0.74%       0.49%
> GEOMEAN INT         -1.04%      0.60%
>
> 503.bwaves_r        3.13%       4.72%
> 507.cactuBSSN_r     1.17%       0.29%
> 508.namd_r          0.39%       6.87%
> 510.parest_r        3.14%       8.52%
> 511.povray_r        0.10%       -0.20%
> 519.lbm_r           -0.68%      10.14%
> 521.wrf_r           68.20%      76.73%

So this seems to regress as well?

> 526.blender_r       0.12%       0.12%
> 527.cam4_r          19.67%      23.21%
> 538.imagick_r       0.12%       0.24%
> 544.nab_r           0.63%       0.53%
> 549.fotonik3d_r     14.44%      9.43%
> 554.roms_r          12.39%      0.00%
> GEOMEAN FP          8.26%       9.41%
> GEOMEAN ALL         4.11%       5.74%
>
> Code sise impact
>                     N-Iter      cheap cost model
> 500.perlbench_r     0.22%       1.03%
> 502.gcc_r           0.25%       0.60%
> 505.mcf_r           0.00%       32.07%
> 520.omnetpp_r       0.09%       0.31%
> 523.xalancbmk_r     0.08%       1.86%
> 525.x264_r          0.75%       7.96%
> 531.deepsjeng_r     0.72%       3.28%
> 541.leela_r         0.18%       0.75%
> 548.exchange2_r     8.29%       12.19%
> 557.xz_r            0.40%       0.60%
> GEOMEAN INT         1.07%%      5.71%
>
> 503.bwaves_r        12.89%      21.59%
> 507.cactuBSSN_r     0.90%       20.19%
> 508.namd_r          0.77%       14.75%
> 510.parest_r        0.91%       3.91%
> 511.povray_r        0.45%       4.08%
> 519.lbm_r           0.00%       0.00%
> 521.wrf_r           5.97%       12.79%
> 526.blender_r       0.49%       3.84%
> 527.cam4_r          1.39%       3.28%
> 538.imagick_r       1.86%       7.78%
> 544.nab_r           0.41%       3.00%
> 549.fotonik3d_r     25.50%      47.47%
> 554.roms_r          5.17%       13.01%
> GEOMEAN FP          4.14%       11.38%
> GEOMEAN ALL         2.80%       8.88%
>
>
> The only regression is from 548.exchange_r, the vectorization for inner loop in each layer
> of the 9-layer loops increases register pressure and causes more spill.
> - block(rnext:9, 1, i1) = block(rnext:9, 1, i1) + 10
>   - block(rnext:9, 2, i2) = block(rnext:9, 2, i2) + 10
>     .....
>         - block(rnext:9, 9, i9) = block(rnext:9, 9, i9) + 10
>     ...
> - block(rnext:9, 2, i2) = block(rnext:9, 2, i2) + 10
> - block(rnext:9, 1, i1) = block(rnext:9, 1, i1) + 10
>
> Looks like aarch64 doesn't have the issue because aarch64 has 32 gprs, but x86 only has 16.
> I have a extra patch to prevent loop vectorization in deep-depth loop for x86 backend which can
> bring the performance back.
>
> For 503.bwaves_r/505.mcf_r/507.cactuBSSN_r/508.namd_r, cheap cost model increases codesize
> a lot but don't imporve any performance. And N-iter is much better for that for codesize.
>
>
> Any comments?
>
>
> gcc/ChangeLog:
>
>         * tree-vect-loop.cc (vect_analyze_loop_costing): Enable
>         vectorization for LOOP_VINFO_PEELING_FOR_NITER in very cheap
>         cost model.
>         (vect_analyze_loop): Disable epilogue vectorization in very
>         cheap cost model.
> ---
>  gcc/tree-vect-loop.cc | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> index 242d5e2d916..06afd8cae79 100644
> --- a/gcc/tree-vect-loop.cc
> +++ b/gcc/tree-vect-loop.cc
> @@ -2356,8 +2356,7 @@ vect_analyze_loop_costing (loop_vec_info loop_vinfo,
>       a copy of the scalar code (even if we might be able to vectorize it).  */
>    if (loop_cost_model (loop) == VECT_COST_MODEL_VERY_CHEAP
>        && (LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo)
> -         || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
> -         || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo)))
> +         || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)))

I notice that we should probably not call
vect_enhance_data_refs_alignment because
when alignment peeling is optional we should avoid it rather than disabling the
vectorization completely.

Also if you allow peeling for niter then there's no good reason to not
allow peeling
for gaps (or any other epilogue peeling).

The extra cost for niter peeling is a runtime check before the loop
which would also
happen (plus keeping the scalar copy) when there's a runtime cost check.  That
also means versioning for alias/alignment could be allowed if it
shares the scalar
loop with the epilogue (I don't remember the constraints we set in place for the
sharing).

Richard.

>      {
>        if (dump_enabled_p ())
>         dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> @@ -3638,7 +3637,8 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared)
>                            /* No code motion support for multiple epilogues so for now
>                               not supported when multiple exits.  */
>                          && !LOOP_VINFO_EARLY_BREAKS (first_loop_vinfo)
> -                        && !loop->simduid);
> +                        && !loop->simduid
> +                        && loop_cost_model (loop) > VECT_COST_MODEL_VERY_CHEAP);
>    if (!vect_epilogues)
>      return first_loop_vinfo;
>
> --
> 2.31.1
>
Hongtao Liu Sept. 11, 2024, 8:21 a.m. UTC | #2
On Wed, Sep 11, 2024 at 4:04 PM Richard Biener
<richard.guenther@gmail.com> wrote:
>
> On Wed, Sep 11, 2024 at 4:17 AM liuhongt <hongtao.liu@intel.com> wrote:
> >
> > GCC12 enables vectorization for O2 with very cheap cost model which is restricted
> > to constant tripcount. The vectorization capacity is very limited w/ consideration
> > of codesize impact.
> >
> > The patch extends the very cheap cost model a little bit to support variable tripcount.
> > But still disable peeling for gaps/alignment, runtime aliasing checking and epilogue
> > vectorization with the consideration of codesize.
> >
> > So there're at most 2 versions of loop for O2 vectorization, one vectorized main loop
> > , one scalar/remainder loop.
> >
> > .i.e.
> >
> > void
> > foo1 (int* __restrict a, int* b, int* c, int n)
> > {
> >  for (int i = 0; i != n; i++)
> >   a[i] = b[i] + c[i];
> > }
> >
> > with -O2 -march=x86-64-v3, will be vectorized to
> >
> > .L10:
> >         vmovdqu (%r8,%rax), %ymm0
> >         vpaddd  (%rsi,%rax), %ymm0, %ymm0
> >         vmovdqu %ymm0, (%rdi,%rax)
> >         addq    $32, %rax
> >         cmpq    %rdx, %rax
> >         jne     .L10
> >         movl    %ecx, %eax
> >         andl    $-8, %eax
> >         cmpl    %eax, %ecx
> >         je      .L21
> >         vzeroupper
> > .L12:
> >         movl    (%r8,%rax,4), %edx
> >         addl    (%rsi,%rax,4), %edx
> >         movl    %edx, (%rdi,%rax,4)
> >         addq    $1, %rax
> >         cmpl    %eax, %ecx
> >         jne     .L12
> >
> > As measured with SPEC2017 on EMR, the patch(N-Iter) improves performance by 4.11%
> > with extra 2.8% codeisze, and cheap cost model improve performance by 5.74% with
> > extra 8.88% codesize. The details are as below
>
> I'm confused by this, is the N-Iter numbers ontop of the cheap cost
> model numbers?
No, it's N-iter vs base(very cheap cost model), and cheap vs base.
>
> > Performance measured with -march=x86-64-v3 -O2 on EMR
> >
> >                     N-Iter      cheap cost model
> > 500.perlbench_r     -0.12%      -0.12%
> > 502.gcc_r           0.44%       -0.11%
> > 505.mcf_r           0.17%       4.46%
> > 520.omnetpp_r       0.28%       -0.27%
> > 523.xalancbmk_r     0.00%       5.93%
> > 525.x264_r          -0.09%      23.53%
> > 531.deepsjeng_r     0.19%       0.00%
> > 541.leela_r         0.22%       0.00%
> > 548.exchange2_r     -11.54%     -22.34%
> > 557.xz_r            0.74%       0.49%
> > GEOMEAN INT         -1.04%      0.60%
> >
> > 503.bwaves_r        3.13%       4.72%
> > 507.cactuBSSN_r     1.17%       0.29%
> > 508.namd_r          0.39%       6.87%
> > 510.parest_r        3.14%       8.52%
> > 511.povray_r        0.10%       -0.20%
> > 519.lbm_r           -0.68%      10.14%
> > 521.wrf_r           68.20%      76.73%
>
> So this seems to regress as well?
Niter increases performance less than the cheap cost model, that's
expected, it is not a regression.
>
> > 526.blender_r       0.12%       0.12%
> > 527.cam4_r          19.67%      23.21%
> > 538.imagick_r       0.12%       0.24%
> > 544.nab_r           0.63%       0.53%
> > 549.fotonik3d_r     14.44%      9.43%
> > 554.roms_r          12.39%      0.00%
> > GEOMEAN FP          8.26%       9.41%
> > GEOMEAN ALL         4.11%       5.74%
> >
> > Code sise impact
> >                     N-Iter      cheap cost model
> > 500.perlbench_r     0.22%       1.03%
> > 502.gcc_r           0.25%       0.60%
> > 505.mcf_r           0.00%       32.07%
> > 520.omnetpp_r       0.09%       0.31%
> > 523.xalancbmk_r     0.08%       1.86%
> > 525.x264_r          0.75%       7.96%
> > 531.deepsjeng_r     0.72%       3.28%
> > 541.leela_r         0.18%       0.75%
> > 548.exchange2_r     8.29%       12.19%
> > 557.xz_r            0.40%       0.60%
> > GEOMEAN INT         1.07%%      5.71%
> >
> > 503.bwaves_r        12.89%      21.59%
> > 507.cactuBSSN_r     0.90%       20.19%
> > 508.namd_r          0.77%       14.75%
> > 510.parest_r        0.91%       3.91%
> > 511.povray_r        0.45%       4.08%
> > 519.lbm_r           0.00%       0.00%
> > 521.wrf_r           5.97%       12.79%
> > 526.blender_r       0.49%       3.84%
> > 527.cam4_r          1.39%       3.28%
> > 538.imagick_r       1.86%       7.78%
> > 544.nab_r           0.41%       3.00%
> > 549.fotonik3d_r     25.50%      47.47%
> > 554.roms_r          5.17%       13.01%
> > GEOMEAN FP          4.14%       11.38%
> > GEOMEAN ALL         2.80%       8.88%
> >
> >
> > The only regression is from 548.exchange_r, the vectorization for inner loop in each layer
> > of the 9-layer loops increases register pressure and causes more spill.
> > - block(rnext:9, 1, i1) = block(rnext:9, 1, i1) + 10
> >   - block(rnext:9, 2, i2) = block(rnext:9, 2, i2) + 10
> >     .....
> >         - block(rnext:9, 9, i9) = block(rnext:9, 9, i9) + 10
> >     ...
> > - block(rnext:9, 2, i2) = block(rnext:9, 2, i2) + 10
> > - block(rnext:9, 1, i1) = block(rnext:9, 1, i1) + 10
> >
> > Looks like aarch64 doesn't have the issue because aarch64 has 32 gprs, but x86 only has 16.
> > I have a extra patch to prevent loop vectorization in deep-depth loop for x86 backend which can
> > bring the performance back.
> >
> > For 503.bwaves_r/505.mcf_r/507.cactuBSSN_r/508.namd_r, cheap cost model increases codesize
> > a lot but don't imporve any performance. And N-iter is much better for that for codesize.
> >
> >
> > Any comments?
> >
> >
> > gcc/ChangeLog:
> >
> >         * tree-vect-loop.cc (vect_analyze_loop_costing): Enable
> >         vectorization for LOOP_VINFO_PEELING_FOR_NITER in very cheap
> >         cost model.
> >         (vect_analyze_loop): Disable epilogue vectorization in very
> >         cheap cost model.
> > ---
> >  gcc/tree-vect-loop.cc | 6 +++---
> >  1 file changed, 3 insertions(+), 3 deletions(-)
> >
> > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> > index 242d5e2d916..06afd8cae79 100644
> > --- a/gcc/tree-vect-loop.cc
> > +++ b/gcc/tree-vect-loop.cc
> > @@ -2356,8 +2356,7 @@ vect_analyze_loop_costing (loop_vec_info loop_vinfo,
> >       a copy of the scalar code (even if we might be able to vectorize it).  */
> >    if (loop_cost_model (loop) == VECT_COST_MODEL_VERY_CHEAP
> >        && (LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo)
> > -         || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
> > -         || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo)))
> > +         || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)))
>
> I notice that we should probably not call
> vect_enhance_data_refs_alignment because
> when alignment peeling is optional we should avoid it rather than disabling the
> vectorization completely.
>
> Also if you allow peeling for niter then there's no good reason to not
> allow peeling
> for gaps (or any other epilogue peeling).
Maybe, I just want to be conservative.
>
> The extra cost for niter peeling is a runtime check before the loop
> which would also
> happen (plus keeping the scalar copy) when there's a runtime cost check.  That
> also means versioning for alias/alignment could be allowed if it
> shares the scalar
> loop with the epilogue (I don't remember the constraints we set in place for the
> sharing).
Yes, but for current GCC, alias run-time check creates a separate scalar loop
https://godbolt.org/z/9seoWePKK
And enabling alias runtime check could increase too much codesize but
w/o any performance improvement.

>
> Richard.
>
> >      {
> >        if (dump_enabled_p ())
> >         dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> > @@ -3638,7 +3637,8 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared)
> >                            /* No code motion support for multiple epilogues so for now
> >                               not supported when multiple exits.  */
> >                          && !LOOP_VINFO_EARLY_BREAKS (first_loop_vinfo)
> > -                        && !loop->simduid);
> > +                        && !loop->simduid
> > +                        && loop_cost_model (loop) > VECT_COST_MODEL_VERY_CHEAP);
> >    if (!vect_epilogues)
> >      return first_loop_vinfo;
> >
> > --
> > 2.31.1
> >
Hongtao Liu Sept. 12, 2024, 2:50 p.m. UTC | #3
On Wed, Sep 11, 2024 at 4:21 PM Hongtao Liu <crazylht@gmail.com> wrote:
>
> On Wed, Sep 11, 2024 at 4:04 PM Richard Biener
> <richard.guenther@gmail.com> wrote:
> >
> > On Wed, Sep 11, 2024 at 4:17 AM liuhongt <hongtao.liu@intel.com> wrote:
> > >
> > > GCC12 enables vectorization for O2 with very cheap cost model which is restricted
> > > to constant tripcount. The vectorization capacity is very limited w/ consideration
> > > of codesize impact.
> > >
> > > The patch extends the very cheap cost model a little bit to support variable tripcount.
> > > But still disable peeling for gaps/alignment, runtime aliasing checking and epilogue
> > > vectorization with the consideration of codesize.
> > >
> > > So there're at most 2 versions of loop for O2 vectorization, one vectorized main loop
> > > , one scalar/remainder loop.
> > >
> > > .i.e.
> > >
> > > void
> > > foo1 (int* __restrict a, int* b, int* c, int n)
> > > {
> > >  for (int i = 0; i != n; i++)
> > >   a[i] = b[i] + c[i];
> > > }
> > >
> > > with -O2 -march=x86-64-v3, will be vectorized to
> > >
> > > .L10:
> > >         vmovdqu (%r8,%rax), %ymm0
> > >         vpaddd  (%rsi,%rax), %ymm0, %ymm0
> > >         vmovdqu %ymm0, (%rdi,%rax)
> > >         addq    $32, %rax
> > >         cmpq    %rdx, %rax
> > >         jne     .L10
> > >         movl    %ecx, %eax
> > >         andl    $-8, %eax
> > >         cmpl    %eax, %ecx
> > >         je      .L21
> > >         vzeroupper
> > > .L12:
> > >         movl    (%r8,%rax,4), %edx
> > >         addl    (%rsi,%rax,4), %edx
> > >         movl    %edx, (%rdi,%rax,4)
> > >         addq    $1, %rax
> > >         cmpl    %eax, %ecx
> > >         jne     .L12
> > >
> > > As measured with SPEC2017 on EMR, the patch(N-Iter) improves performance by 4.11%
> > > with extra 2.8% codeisze, and cheap cost model improve performance by 5.74% with
> > > extra 8.88% codesize. The details are as below
> >
> > I'm confused by this, is the N-Iter numbers ontop of the cheap cost
> > model numbers?
> No, it's N-iter vs base(very cheap cost model), and cheap vs base.
> >
> > > Performance measured with -march=x86-64-v3 -O2 on EMR
> > >
> > >                     N-Iter      cheap cost model
> > > 500.perlbench_r     -0.12%      -0.12%
> > > 502.gcc_r           0.44%       -0.11%
> > > 505.mcf_r           0.17%       4.46%
> > > 520.omnetpp_r       0.28%       -0.27%
> > > 523.xalancbmk_r     0.00%       5.93%
> > > 525.x264_r          -0.09%      23.53%
> > > 531.deepsjeng_r     0.19%       0.00%
> > > 541.leela_r         0.22%       0.00%
> > > 548.exchange2_r     -11.54%     -22.34%
> > > 557.xz_r            0.74%       0.49%
> > > GEOMEAN INT         -1.04%      0.60%
> > >
> > > 503.bwaves_r        3.13%       4.72%
> > > 507.cactuBSSN_r     1.17%       0.29%
> > > 508.namd_r          0.39%       6.87%
> > > 510.parest_r        3.14%       8.52%
> > > 511.povray_r        0.10%       -0.20%
> > > 519.lbm_r           -0.68%      10.14%
> > > 521.wrf_r           68.20%      76.73%
> >
> > So this seems to regress as well?
> Niter increases performance less than the cheap cost model, that's
> expected, it is not a regression.
> >
> > > 526.blender_r       0.12%       0.12%
> > > 527.cam4_r          19.67%      23.21%
> > > 538.imagick_r       0.12%       0.24%
> > > 544.nab_r           0.63%       0.53%
> > > 549.fotonik3d_r     14.44%      9.43%
> > > 554.roms_r          12.39%      0.00%
> > > GEOMEAN FP          8.26%       9.41%
> > > GEOMEAN ALL         4.11%       5.74%

I've tested the patch on aarch64, it shows similar improvement with
little codesize increasement.
I haven't tested it on other backends, but I think it would have
similar good improvements
> > >
> > > Code sise impact
> > >                     N-Iter      cheap cost model
> > > 500.perlbench_r     0.22%       1.03%
> > > 502.gcc_r           0.25%       0.60%
> > > 505.mcf_r           0.00%       32.07%
> > > 520.omnetpp_r       0.09%       0.31%
> > > 523.xalancbmk_r     0.08%       1.86%
> > > 525.x264_r          0.75%       7.96%
> > > 531.deepsjeng_r     0.72%       3.28%
> > > 541.leela_r         0.18%       0.75%
> > > 548.exchange2_r     8.29%       12.19%
> > > 557.xz_r            0.40%       0.60%
> > > GEOMEAN INT         1.07%%      5.71%
> > >
> > > 503.bwaves_r        12.89%      21.59%
> > > 507.cactuBSSN_r     0.90%       20.19%
> > > 508.namd_r          0.77%       14.75%
> > > 510.parest_r        0.91%       3.91%
> > > 511.povray_r        0.45%       4.08%
> > > 519.lbm_r           0.00%       0.00%
> > > 521.wrf_r           5.97%       12.79%
> > > 526.blender_r       0.49%       3.84%
> > > 527.cam4_r          1.39%       3.28%
> > > 538.imagick_r       1.86%       7.78%
> > > 544.nab_r           0.41%       3.00%
> > > 549.fotonik3d_r     25.50%      47.47%
> > > 554.roms_r          5.17%       13.01%
> > > GEOMEAN FP          4.14%       11.38%
> > > GEOMEAN ALL         2.80%       8.88%
> > >
> > >
> > > The only regression is from 548.exchange_r, the vectorization for inner loop in each layer
> > > of the 9-layer loops increases register pressure and causes more spill.
> > > - block(rnext:9, 1, i1) = block(rnext:9, 1, i1) + 10
> > >   - block(rnext:9, 2, i2) = block(rnext:9, 2, i2) + 10
> > >     .....
> > >         - block(rnext:9, 9, i9) = block(rnext:9, 9, i9) + 10
> > >     ...
> > > - block(rnext:9, 2, i2) = block(rnext:9, 2, i2) + 10
> > > - block(rnext:9, 1, i1) = block(rnext:9, 1, i1) + 10
> > >
> > > Looks like aarch64 doesn't have the issue because aarch64 has 32 gprs, but x86 only has 16.
> > > I have a extra patch to prevent loop vectorization in deep-depth loop for x86 backend which can
> > > bring the performance back.
> > >
> > > For 503.bwaves_r/505.mcf_r/507.cactuBSSN_r/508.namd_r, cheap cost model increases codesize
> > > a lot but don't imporve any performance. And N-iter is much better for that for codesize.
> > >
> > >
> > > Any comments?
> > >
> > >
> > > gcc/ChangeLog:
> > >
> > >         * tree-vect-loop.cc (vect_analyze_loop_costing): Enable
> > >         vectorization for LOOP_VINFO_PEELING_FOR_NITER in very cheap
> > >         cost model.
> > >         (vect_analyze_loop): Disable epilogue vectorization in very
> > >         cheap cost model.
> > > ---
> > >  gcc/tree-vect-loop.cc | 6 +++---
> > >  1 file changed, 3 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> > > index 242d5e2d916..06afd8cae79 100644
> > > --- a/gcc/tree-vect-loop.cc
> > > +++ b/gcc/tree-vect-loop.cc
> > > @@ -2356,8 +2356,7 @@ vect_analyze_loop_costing (loop_vec_info loop_vinfo,
> > >       a copy of the scalar code (even if we might be able to vectorize it).  */
> > >    if (loop_cost_model (loop) == VECT_COST_MODEL_VERY_CHEAP
> > >        && (LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo)
> > > -         || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
> > > -         || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo)))
> > > +         || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)))
> >
> > I notice that we should probably not call
> > vect_enhance_data_refs_alignment because
> > when alignment peeling is optional we should avoid it rather than disabling the
> > vectorization completely.
> >
> > Also if you allow peeling for niter then there's no good reason to not
> > allow peeling
> > for gaps (or any other epilogue peeling).
> Maybe, I just want to be conservative.
> >
> > The extra cost for niter peeling is a runtime check before the loop
> > which would also
> > happen (plus keeping the scalar copy) when there's a runtime cost check.  That
> > also means versioning for alias/alignment could be allowed if it
> > shares the scalar
> > loop with the epilogue (I don't remember the constraints we set in place for the
> > sharing).
> Yes, but for current GCC, alias run-time check creates a separate scalar loop
> https://godbolt.org/z/9seoWePKK
> And enabling alias runtime check could increase too much codesize but
> w/o any performance improvement.
>
> >
> > Richard.
> >
> > >      {
> > >        if (dump_enabled_p ())
> > >         dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> > > @@ -3638,7 +3637,8 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared)
> > >                            /* No code motion support for multiple epilogues so for now
> > >                               not supported when multiple exits.  */
> > >                          && !LOOP_VINFO_EARLY_BREAKS (first_loop_vinfo)
> > > -                        && !loop->simduid);
> > > +                        && !loop->simduid
> > > +                        && loop_cost_model (loop) > VECT_COST_MODEL_VERY_CHEAP);
> > >    if (!vect_epilogues)
> > >      return first_loop_vinfo;
> > >
> > > --
> > > 2.31.1
> > >
>
>
>
> --
> BR,
> Hongtao



--
BR,
Hongtao
Richard Biener Sept. 13, 2024, 11:12 a.m. UTC | #4
On Thu, Sep 12, 2024 at 4:50 PM Hongtao Liu <crazylht@gmail.com> wrote:
>
> On Wed, Sep 11, 2024 at 4:21 PM Hongtao Liu <crazylht@gmail.com> wrote:
> >
> > On Wed, Sep 11, 2024 at 4:04 PM Richard Biener
> > <richard.guenther@gmail.com> wrote:
> > >
> > > On Wed, Sep 11, 2024 at 4:17 AM liuhongt <hongtao.liu@intel.com> wrote:
> > > >
> > > > GCC12 enables vectorization for O2 with very cheap cost model which is restricted
> > > > to constant tripcount. The vectorization capacity is very limited w/ consideration
> > > > of codesize impact.
> > > >
> > > > The patch extends the very cheap cost model a little bit to support variable tripcount.
> > > > But still disable peeling for gaps/alignment, runtime aliasing checking and epilogue
> > > > vectorization with the consideration of codesize.
> > > >
> > > > So there're at most 2 versions of loop for O2 vectorization, one vectorized main loop
> > > > , one scalar/remainder loop.
> > > >
> > > > .i.e.
> > > >
> > > > void
> > > > foo1 (int* __restrict a, int* b, int* c, int n)
> > > > {
> > > >  for (int i = 0; i != n; i++)
> > > >   a[i] = b[i] + c[i];
> > > > }
> > > >
> > > > with -O2 -march=x86-64-v3, will be vectorized to
> > > >
> > > > .L10:
> > > >         vmovdqu (%r8,%rax), %ymm0
> > > >         vpaddd  (%rsi,%rax), %ymm0, %ymm0
> > > >         vmovdqu %ymm0, (%rdi,%rax)
> > > >         addq    $32, %rax
> > > >         cmpq    %rdx, %rax
> > > >         jne     .L10
> > > >         movl    %ecx, %eax
> > > >         andl    $-8, %eax
> > > >         cmpl    %eax, %ecx
> > > >         je      .L21
> > > >         vzeroupper
> > > > .L12:
> > > >         movl    (%r8,%rax,4), %edx
> > > >         addl    (%rsi,%rax,4), %edx
> > > >         movl    %edx, (%rdi,%rax,4)
> > > >         addq    $1, %rax
> > > >         cmpl    %eax, %ecx
> > > >         jne     .L12
> > > >
> > > > As measured with SPEC2017 on EMR, the patch(N-Iter) improves performance by 4.11%
> > > > with extra 2.8% codeisze, and cheap cost model improve performance by 5.74% with
> > > > extra 8.88% codesize. The details are as below
> > >
> > > I'm confused by this, is the N-Iter numbers ontop of the cheap cost
> > > model numbers?
> > No, it's N-iter vs base(very cheap cost model), and cheap vs base.
> > >
> > > > Performance measured with -march=x86-64-v3 -O2 on EMR
> > > >
> > > >                     N-Iter      cheap cost model
> > > > 500.perlbench_r     -0.12%      -0.12%
> > > > 502.gcc_r           0.44%       -0.11%
> > > > 505.mcf_r           0.17%       4.46%
> > > > 520.omnetpp_r       0.28%       -0.27%
> > > > 523.xalancbmk_r     0.00%       5.93%
> > > > 525.x264_r          -0.09%      23.53%
> > > > 531.deepsjeng_r     0.19%       0.00%
> > > > 541.leela_r         0.22%       0.00%
> > > > 548.exchange2_r     -11.54%     -22.34%
> > > > 557.xz_r            0.74%       0.49%
> > > > GEOMEAN INT         -1.04%      0.60%
> > > >
> > > > 503.bwaves_r        3.13%       4.72%
> > > > 507.cactuBSSN_r     1.17%       0.29%
> > > > 508.namd_r          0.39%       6.87%
> > > > 510.parest_r        3.14%       8.52%
> > > > 511.povray_r        0.10%       -0.20%
> > > > 519.lbm_r           -0.68%      10.14%
> > > > 521.wrf_r           68.20%      76.73%
> > >
> > > So this seems to regress as well?
> > Niter increases performance less than the cheap cost model, that's
> > expected, it is not a regression.
> > >
> > > > 526.blender_r       0.12%       0.12%
> > > > 527.cam4_r          19.67%      23.21%
> > > > 538.imagick_r       0.12%       0.24%
> > > > 544.nab_r           0.63%       0.53%
> > > > 549.fotonik3d_r     14.44%      9.43%
> > > > 554.roms_r          12.39%      0.00%
> > > > GEOMEAN FP          8.26%       9.41%
> > > > GEOMEAN ALL         4.11%       5.74%
>
> I've tested the patch on aarch64, it shows similar improvement with
> little codesize increasement.
> I haven't tested it on other backends, but I think it would have
> similar good improvements

I think overall this is expected since a constant niter dividable by
the VF isn't a common situation.  So the question is mostly whether
we want to pay the size penalty or not.

Looking only at docs the proposed change would make the very-cheap
cost model nearly(?) equivalent to the cheap one so maybe the answer
is to default to cheap rather than very-cheap?  One difference seems to
be that cheap allows alias versioning.

Richard.

> > > >
> > > > Code sise impact
> > > >                     N-Iter      cheap cost model
> > > > 500.perlbench_r     0.22%       1.03%
> > > > 502.gcc_r           0.25%       0.60%
> > > > 505.mcf_r           0.00%       32.07%
> > > > 520.omnetpp_r       0.09%       0.31%
> > > > 523.xalancbmk_r     0.08%       1.86%
> > > > 525.x264_r          0.75%       7.96%
> > > > 531.deepsjeng_r     0.72%       3.28%
> > > > 541.leela_r         0.18%       0.75%
> > > > 548.exchange2_r     8.29%       12.19%
> > > > 557.xz_r            0.40%       0.60%
> > > > GEOMEAN INT         1.07%%      5.71%
> > > >
> > > > 503.bwaves_r        12.89%      21.59%
> > > > 507.cactuBSSN_r     0.90%       20.19%
> > > > 508.namd_r          0.77%       14.75%
> > > > 510.parest_r        0.91%       3.91%
> > > > 511.povray_r        0.45%       4.08%
> > > > 519.lbm_r           0.00%       0.00%
> > > > 521.wrf_r           5.97%       12.79%
> > > > 526.blender_r       0.49%       3.84%
> > > > 527.cam4_r          1.39%       3.28%
> > > > 538.imagick_r       1.86%       7.78%
> > > > 544.nab_r           0.41%       3.00%
> > > > 549.fotonik3d_r     25.50%      47.47%
> > > > 554.roms_r          5.17%       13.01%
> > > > GEOMEAN FP          4.14%       11.38%
> > > > GEOMEAN ALL         2.80%       8.88%
> > > >
> > > >
> > > > The only regression is from 548.exchange_r, the vectorization for inner loop in each layer
> > > > of the 9-layer loops increases register pressure and causes more spill.
> > > > - block(rnext:9, 1, i1) = block(rnext:9, 1, i1) + 10
> > > >   - block(rnext:9, 2, i2) = block(rnext:9, 2, i2) + 10
> > > >     .....
> > > >         - block(rnext:9, 9, i9) = block(rnext:9, 9, i9) + 10
> > > >     ...
> > > > - block(rnext:9, 2, i2) = block(rnext:9, 2, i2) + 10
> > > > - block(rnext:9, 1, i1) = block(rnext:9, 1, i1) + 10
> > > >
> > > > Looks like aarch64 doesn't have the issue because aarch64 has 32 gprs, but x86 only has 16.
> > > > I have a extra patch to prevent loop vectorization in deep-depth loop for x86 backend which can
> > > > bring the performance back.
> > > >
> > > > For 503.bwaves_r/505.mcf_r/507.cactuBSSN_r/508.namd_r, cheap cost model increases codesize
> > > > a lot but don't imporve any performance. And N-iter is much better for that for codesize.
> > > >
> > > >
> > > > Any comments?
> > > >
> > > >
> > > > gcc/ChangeLog:
> > > >
> > > >         * tree-vect-loop.cc (vect_analyze_loop_costing): Enable
> > > >         vectorization for LOOP_VINFO_PEELING_FOR_NITER in very cheap
> > > >         cost model.
> > > >         (vect_analyze_loop): Disable epilogue vectorization in very
> > > >         cheap cost model.
> > > > ---
> > > >  gcc/tree-vect-loop.cc | 6 +++---
> > > >  1 file changed, 3 insertions(+), 3 deletions(-)
> > > >
> > > > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> > > > index 242d5e2d916..06afd8cae79 100644
> > > > --- a/gcc/tree-vect-loop.cc
> > > > +++ b/gcc/tree-vect-loop.cc
> > > > @@ -2356,8 +2356,7 @@ vect_analyze_loop_costing (loop_vec_info loop_vinfo,
> > > >       a copy of the scalar code (even if we might be able to vectorize it).  */
> > > >    if (loop_cost_model (loop) == VECT_COST_MODEL_VERY_CHEAP
> > > >        && (LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo)
> > > > -         || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
> > > > -         || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo)))
> > > > +         || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)))
> > >
> > > I notice that we should probably not call
> > > vect_enhance_data_refs_alignment because
> > > when alignment peeling is optional we should avoid it rather than disabling the
> > > vectorization completely.
> > >
> > > Also if you allow peeling for niter then there's no good reason to not
> > > allow peeling
> > > for gaps (or any other epilogue peeling).
> > Maybe, I just want to be conservative.
> > >
> > > The extra cost for niter peeling is a runtime check before the loop
> > > which would also
> > > happen (plus keeping the scalar copy) when there's a runtime cost check.  That
> > > also means versioning for alias/alignment could be allowed if it
> > > shares the scalar
> > > loop with the epilogue (I don't remember the constraints we set in place for the
> > > sharing).
> > Yes, but for current GCC, alias run-time check creates a separate scalar loop
> > https://godbolt.org/z/9seoWePKK
> > And enabling alias runtime check could increase too much codesize but
> > w/o any performance improvement.
> >
> > >
> > > Richard.
> > >
> > > >      {
> > > >        if (dump_enabled_p ())
> > > >         dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> > > > @@ -3638,7 +3637,8 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared)
> > > >                            /* No code motion support for multiple epilogues so for now
> > > >                               not supported when multiple exits.  */
> > > >                          && !LOOP_VINFO_EARLY_BREAKS (first_loop_vinfo)
> > > > -                        && !loop->simduid);
> > > > +                        && !loop->simduid
> > > > +                        && loop_cost_model (loop) > VECT_COST_MODEL_VERY_CHEAP);
> > > >    if (!vect_epilogues)
> > > >      return first_loop_vinfo;
> > > >
> > > > --
> > > > 2.31.1
> > > >
> >
> >
> >
> > --
> > BR,
> > Hongtao
>
>
>
> --
> BR,
> Hongtao
Richard Sandiford Sept. 18, 2024, 5:55 p.m. UTC | #5
Richard Biener <richard.guenther@gmail.com> writes:
> On Thu, Sep 12, 2024 at 4:50 PM Hongtao Liu <crazylht@gmail.com> wrote:
>>
>> On Wed, Sep 11, 2024 at 4:21 PM Hongtao Liu <crazylht@gmail.com> wrote:
>> >
>> > On Wed, Sep 11, 2024 at 4:04 PM Richard Biener
>> > <richard.guenther@gmail.com> wrote:
>> > >
>> > > On Wed, Sep 11, 2024 at 4:17 AM liuhongt <hongtao.liu@intel.com> wrote:
>> > > >
>> > > > GCC12 enables vectorization for O2 with very cheap cost model which is restricted
>> > > > to constant tripcount. The vectorization capacity is very limited w/ consideration
>> > > > of codesize impact.
>> > > >
>> > > > The patch extends the very cheap cost model a little bit to support variable tripcount.
>> > > > But still disable peeling for gaps/alignment, runtime aliasing checking and epilogue
>> > > > vectorization with the consideration of codesize.
>> > > >
>> > > > So there're at most 2 versions of loop for O2 vectorization, one vectorized main loop
>> > > > , one scalar/remainder loop.
>> > > >
>> > > > .i.e.
>> > > >
>> > > > void
>> > > > foo1 (int* __restrict a, int* b, int* c, int n)
>> > > > {
>> > > >  for (int i = 0; i != n; i++)
>> > > >   a[i] = b[i] + c[i];
>> > > > }
>> > > >
>> > > > with -O2 -march=x86-64-v3, will be vectorized to
>> > > >
>> > > > .L10:
>> > > >         vmovdqu (%r8,%rax), %ymm0
>> > > >         vpaddd  (%rsi,%rax), %ymm0, %ymm0
>> > > >         vmovdqu %ymm0, (%rdi,%rax)
>> > > >         addq    $32, %rax
>> > > >         cmpq    %rdx, %rax
>> > > >         jne     .L10
>> > > >         movl    %ecx, %eax
>> > > >         andl    $-8, %eax
>> > > >         cmpl    %eax, %ecx
>> > > >         je      .L21
>> > > >         vzeroupper
>> > > > .L12:
>> > > >         movl    (%r8,%rax,4), %edx
>> > > >         addl    (%rsi,%rax,4), %edx
>> > > >         movl    %edx, (%rdi,%rax,4)
>> > > >         addq    $1, %rax
>> > > >         cmpl    %eax, %ecx
>> > > >         jne     .L12
>> > > >
>> > > > As measured with SPEC2017 on EMR, the patch(N-Iter) improves performance by 4.11%
>> > > > with extra 2.8% codeisze, and cheap cost model improve performance by 5.74% with
>> > > > extra 8.88% codesize. The details are as below
>> > >
>> > > I'm confused by this, is the N-Iter numbers ontop of the cheap cost
>> > > model numbers?
>> > No, it's N-iter vs base(very cheap cost model), and cheap vs base.
>> > >
>> > > > Performance measured with -march=x86-64-v3 -O2 on EMR
>> > > >
>> > > >                     N-Iter      cheap cost model
>> > > > 500.perlbench_r     -0.12%      -0.12%
>> > > > 502.gcc_r           0.44%       -0.11%
>> > > > 505.mcf_r           0.17%       4.46%
>> > > > 520.omnetpp_r       0.28%       -0.27%
>> > > > 523.xalancbmk_r     0.00%       5.93%
>> > > > 525.x264_r          -0.09%      23.53%
>> > > > 531.deepsjeng_r     0.19%       0.00%
>> > > > 541.leela_r         0.22%       0.00%
>> > > > 548.exchange2_r     -11.54%     -22.34%
>> > > > 557.xz_r            0.74%       0.49%
>> > > > GEOMEAN INT         -1.04%      0.60%
>> > > >
>> > > > 503.bwaves_r        3.13%       4.72%
>> > > > 507.cactuBSSN_r     1.17%       0.29%
>> > > > 508.namd_r          0.39%       6.87%
>> > > > 510.parest_r        3.14%       8.52%
>> > > > 511.povray_r        0.10%       -0.20%
>> > > > 519.lbm_r           -0.68%      10.14%
>> > > > 521.wrf_r           68.20%      76.73%
>> > >
>> > > So this seems to regress as well?
>> > Niter increases performance less than the cheap cost model, that's
>> > expected, it is not a regression.
>> > >
>> > > > 526.blender_r       0.12%       0.12%
>> > > > 527.cam4_r          19.67%      23.21%
>> > > > 538.imagick_r       0.12%       0.24%
>> > > > 544.nab_r           0.63%       0.53%
>> > > > 549.fotonik3d_r     14.44%      9.43%
>> > > > 554.roms_r          12.39%      0.00%
>> > > > GEOMEAN FP          8.26%       9.41%
>> > > > GEOMEAN ALL         4.11%       5.74%
>>
>> I've tested the patch on aarch64, it shows similar improvement with
>> little codesize increasement.
>> I haven't tested it on other backends, but I think it would have
>> similar good improvements
>
> I think overall this is expected since a constant niter dividable by
> the VF isn't a common situation.  So the question is mostly whether
> we want to pay the size penalty or not.
>
> Looking only at docs the proposed change would make the very-cheap
> cost model nearly(?) equivalent to the cheap one so maybe the answer
> is to default to cheap rather than very-cheap?  One difference seems to
> be that cheap allows alias versioning.

I remember seeing cases in the past where we could generate an
excessive number of alias checks.  The cost model didn't account
for them very well, since the checks often became a fixed overhead
for all paths (both scalar and vector), especially if the checks
were fully if-converted, with one branch at the end.  The relevant
comparison is then between the original pre-vectorisation scalar code
and the code with alias checks, rather than between post-vectorisation
scalar code and post-vectorisation vector code.  Things might be better
now though.

FTR, I don't object to relaxing the -O2 model.  It was deliberately
conservative, for a time when enabling vectorisation at -O2 was
somewhat controversial.  It was also heavily influenced by SVE,
where variable trip counts are not an issue.

The proposal would also make GCC's behaviour more similar to Clang's.

Thanks,
Richard
Richard Biener Sept. 19, 2024, 6:08 a.m. UTC | #6
On Wed, Sep 18, 2024 at 7:55 PM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Richard Biener <richard.guenther@gmail.com> writes:
> > On Thu, Sep 12, 2024 at 4:50 PM Hongtao Liu <crazylht@gmail.com> wrote:
> >>
> >> On Wed, Sep 11, 2024 at 4:21 PM Hongtao Liu <crazylht@gmail.com> wrote:
> >> >
> >> > On Wed, Sep 11, 2024 at 4:04 PM Richard Biener
> >> > <richard.guenther@gmail.com> wrote:
> >> > >
> >> > > On Wed, Sep 11, 2024 at 4:17 AM liuhongt <hongtao.liu@intel.com> wrote:
> >> > > >
> >> > > > GCC12 enables vectorization for O2 with very cheap cost model which is restricted
> >> > > > to constant tripcount. The vectorization capacity is very limited w/ consideration
> >> > > > of codesize impact.
> >> > > >
> >> > > > The patch extends the very cheap cost model a little bit to support variable tripcount.
> >> > > > But still disable peeling for gaps/alignment, runtime aliasing checking and epilogue
> >> > > > vectorization with the consideration of codesize.
> >> > > >
> >> > > > So there're at most 2 versions of loop for O2 vectorization, one vectorized main loop
> >> > > > , one scalar/remainder loop.
> >> > > >
> >> > > > .i.e.
> >> > > >
> >> > > > void
> >> > > > foo1 (int* __restrict a, int* b, int* c, int n)
> >> > > > {
> >> > > >  for (int i = 0; i != n; i++)
> >> > > >   a[i] = b[i] + c[i];
> >> > > > }
> >> > > >
> >> > > > with -O2 -march=x86-64-v3, will be vectorized to
> >> > > >
> >> > > > .L10:
> >> > > >         vmovdqu (%r8,%rax), %ymm0
> >> > > >         vpaddd  (%rsi,%rax), %ymm0, %ymm0
> >> > > >         vmovdqu %ymm0, (%rdi,%rax)
> >> > > >         addq    $32, %rax
> >> > > >         cmpq    %rdx, %rax
> >> > > >         jne     .L10
> >> > > >         movl    %ecx, %eax
> >> > > >         andl    $-8, %eax
> >> > > >         cmpl    %eax, %ecx
> >> > > >         je      .L21
> >> > > >         vzeroupper
> >> > > > .L12:
> >> > > >         movl    (%r8,%rax,4), %edx
> >> > > >         addl    (%rsi,%rax,4), %edx
> >> > > >         movl    %edx, (%rdi,%rax,4)
> >> > > >         addq    $1, %rax
> >> > > >         cmpl    %eax, %ecx
> >> > > >         jne     .L12
> >> > > >
> >> > > > As measured with SPEC2017 on EMR, the patch(N-Iter) improves performance by 4.11%
> >> > > > with extra 2.8% codeisze, and cheap cost model improve performance by 5.74% with
> >> > > > extra 8.88% codesize. The details are as below
> >> > >
> >> > > I'm confused by this, is the N-Iter numbers ontop of the cheap cost
> >> > > model numbers?
> >> > No, it's N-iter vs base(very cheap cost model), and cheap vs base.
> >> > >
> >> > > > Performance measured with -march=x86-64-v3 -O2 on EMR
> >> > > >
> >> > > >                     N-Iter      cheap cost model
> >> > > > 500.perlbench_r     -0.12%      -0.12%
> >> > > > 502.gcc_r           0.44%       -0.11%
> >> > > > 505.mcf_r           0.17%       4.46%
> >> > > > 520.omnetpp_r       0.28%       -0.27%
> >> > > > 523.xalancbmk_r     0.00%       5.93%
> >> > > > 525.x264_r          -0.09%      23.53%
> >> > > > 531.deepsjeng_r     0.19%       0.00%
> >> > > > 541.leela_r         0.22%       0.00%
> >> > > > 548.exchange2_r     -11.54%     -22.34%
> >> > > > 557.xz_r            0.74%       0.49%
> >> > > > GEOMEAN INT         -1.04%      0.60%
> >> > > >
> >> > > > 503.bwaves_r        3.13%       4.72%
> >> > > > 507.cactuBSSN_r     1.17%       0.29%
> >> > > > 508.namd_r          0.39%       6.87%
> >> > > > 510.parest_r        3.14%       8.52%
> >> > > > 511.povray_r        0.10%       -0.20%
> >> > > > 519.lbm_r           -0.68%      10.14%
> >> > > > 521.wrf_r           68.20%      76.73%
> >> > >
> >> > > So this seems to regress as well?
> >> > Niter increases performance less than the cheap cost model, that's
> >> > expected, it is not a regression.
> >> > >
> >> > > > 526.blender_r       0.12%       0.12%
> >> > > > 527.cam4_r          19.67%      23.21%
> >> > > > 538.imagick_r       0.12%       0.24%
> >> > > > 544.nab_r           0.63%       0.53%
> >> > > > 549.fotonik3d_r     14.44%      9.43%
> >> > > > 554.roms_r          12.39%      0.00%
> >> > > > GEOMEAN FP          8.26%       9.41%
> >> > > > GEOMEAN ALL         4.11%       5.74%
> >>
> >> I've tested the patch on aarch64, it shows similar improvement with
> >> little codesize increasement.
> >> I haven't tested it on other backends, but I think it would have
> >> similar good improvements
> >
> > I think overall this is expected since a constant niter dividable by
> > the VF isn't a common situation.  So the question is mostly whether
> > we want to pay the size penalty or not.
> >
> > Looking only at docs the proposed change would make the very-cheap
> > cost model nearly(?) equivalent to the cheap one so maybe the answer
> > is to default to cheap rather than very-cheap?  One difference seems to
> > be that cheap allows alias versioning.
>
> I remember seeing cases in the past where we could generate an
> excessive number of alias checks.  The cost model didn't account
> for them very well, since the checks often became a fixed overhead
> for all paths (both scalar and vector), especially if the checks
> were fully if-converted, with one branch at the end.  The relevant
> comparison is then between the original pre-vectorisation scalar code
> and the code with alias checks, rather than between post-vectorisation
> scalar code and post-vectorisation vector code.  Things might be better
> now though.

Yes, the cost model (aka niter) check should now be before the alias check, not
if-converted, but of course the alias-checking overhead can still be quite big.

> FTR, I don't object to relaxing the -O2 model.  It was deliberately
> conservative, for a time when enabling vectorisation at -O2 was
> somewhat controversial.  It was also heavily influenced by SVE,
> where variable trip counts are not an issue.

I agree - I think we can try for GCC 15.  Note since we disallow epilogue
vectorization with cheap we might want to prefer smaller vector sizes
which means the target might want to adjust its vector_modes hook.

> The proposal would also make GCC's behaviour more similar to Clang's.

So should we adjust very-cheap to allow niter peeling as proposed or
should we switch
the default at -O2 to cheap?

Richard.

> Thanks,
> Richard
Hongtao Liu Sept. 24, 2024, 7:11 a.m. UTC | #7
On Thu, Sep 19, 2024 at 2:08 PM Richard Biener

<richard.guenther@gmail.com> wrote:
>
> On Wed, Sep 18, 2024 at 7:55 PM Richard Sandiford
> <richard.sandiford@arm.com> wrote:
> >
> > Richard Biener <richard.guenther@gmail.com> writes:
> > > On Thu, Sep 12, 2024 at 4:50 PM Hongtao Liu <crazylht@gmail.com> wrote:
> > >>
> > >> On Wed, Sep 11, 2024 at 4:21 PM Hongtao Liu <crazylht@gmail.com> wrote:
> > >> >
> > >> > On Wed, Sep 11, 2024 at 4:04 PM Richard Biener
> > >> > <richard.guenther@gmail.com> wrote:
> > >> > >
> > >> > > On Wed, Sep 11, 2024 at 4:17 AM liuhongt <hongtao.liu@intel.com> wrote:
> > >> > > >
> > >> > > > GCC12 enables vectorization for O2 with very cheap cost model which is restricted
> > >> > > > to constant tripcount. The vectorization capacity is very limited w/ consideration
> > >> > > > of codesize impact.
> > >> > > >
> > >> > > > The patch extends the very cheap cost model a little bit to support variable tripcount.
> > >> > > > But still disable peeling for gaps/alignment, runtime aliasing checking and epilogue
> > >> > > > vectorization with the consideration of codesize.
> > >> > > >
> > >> > > > So there're at most 2 versions of loop for O2 vectorization, one vectorized main loop
> > >> > > > , one scalar/remainder loop.
> > >> > > >
> > >> > > > .i.e.
> > >> > > >
> > >> > > > void
> > >> > > > foo1 (int* __restrict a, int* b, int* c, int n)
> > >> > > > {
> > >> > > >  for (int i = 0; i != n; i++)
> > >> > > >   a[i] = b[i] + c[i];
> > >> > > > }
> > >> > > >
> > >> > > > with -O2 -march=x86-64-v3, will be vectorized to
> > >> > > >
> > >> > > > .L10:
> > >> > > >         vmovdqu (%r8,%rax), %ymm0
> > >> > > >         vpaddd  (%rsi,%rax), %ymm0, %ymm0
> > >> > > >         vmovdqu %ymm0, (%rdi,%rax)
> > >> > > >         addq    $32, %rax
> > >> > > >         cmpq    %rdx, %rax
> > >> > > >         jne     .L10
> > >> > > >         movl    %ecx, %eax
> > >> > > >         andl    $-8, %eax
> > >> > > >         cmpl    %eax, %ecx
> > >> > > >         je      .L21
> > >> > > >         vzeroupper
> > >> > > > .L12:
> > >> > > >         movl    (%r8,%rax,4), %edx
> > >> > > >         addl    (%rsi,%rax,4), %edx
> > >> > > >         movl    %edx, (%rdi,%rax,4)
> > >> > > >         addq    $1, %rax
> > >> > > >         cmpl    %eax, %ecx
> > >> > > >         jne     .L12
> > >> > > >
> > >> > > > As measured with SPEC2017 on EMR, the patch(N-Iter) improves performance by 4.11%
> > >> > > > with extra 2.8% codeisze, and cheap cost model improve performance by 5.74% with
> > >> > > > extra 8.88% codesize. The details are as below
> > >> > >
> > >> > > I'm confused by this, is the N-Iter numbers ontop of the cheap cost
> > >> > > model numbers?
> > >> > No, it's N-iter vs base(very cheap cost model), and cheap vs base.
> > >> > >
> > >> > > > Performance measured with -march=x86-64-v3 -O2 on EMR
> > >> > > >
> > >> > > >                     N-Iter      cheap cost model
> > >> > > > 500.perlbench_r     -0.12%      -0.12%
> > >> > > > 502.gcc_r           0.44%       -0.11%
> > >> > > > 505.mcf_r           0.17%       4.46%
> > >> > > > 520.omnetpp_r       0.28%       -0.27%
> > >> > > > 523.xalancbmk_r     0.00%       5.93%
> > >> > > > 525.x264_r          -0.09%      23.53%
> > >> > > > 531.deepsjeng_r     0.19%       0.00%
> > >> > > > 541.leela_r         0.22%       0.00%
> > >> > > > 548.exchange2_r     -11.54%     -22.34%
> > >> > > > 557.xz_r            0.74%       0.49%
> > >> > > > GEOMEAN INT         -1.04%      0.60%
> > >> > > >
> > >> > > > 503.bwaves_r        3.13%       4.72%
> > >> > > > 507.cactuBSSN_r     1.17%       0.29%
> > >> > > > 508.namd_r          0.39%       6.87%
> > >> > > > 510.parest_r        3.14%       8.52%
> > >> > > > 511.povray_r        0.10%       -0.20%
> > >> > > > 519.lbm_r           -0.68%      10.14%
> > >> > > > 521.wrf_r           68.20%      76.73%
> > >> > >
> > >> > > So this seems to regress as well?
> > >> > Niter increases performance less than the cheap cost model, that's
> > >> > expected, it is not a regression.
> > >> > >
> > >> > > > 526.blender_r       0.12%       0.12%
> > >> > > > 527.cam4_r          19.67%      23.21%
> > >> > > > 538.imagick_r       0.12%       0.24%
> > >> > > > 544.nab_r           0.63%       0.53%
> > >> > > > 549.fotonik3d_r     14.44%      9.43%
> > >> > > > 554.roms_r          12.39%      0.00%
> > >> > > > GEOMEAN FP          8.26%       9.41%
> > >> > > > GEOMEAN ALL         4.11%       5.74%
> > >>
> > >> I've tested the patch on aarch64, it shows similar improvement with
> > >> little codesize increasement.
> > >> I haven't tested it on other backends, but I think it would have
> > >> similar good improvements
> > >
> > > I think overall this is expected since a constant niter dividable by
> > > the VF isn't a common situation.  So the question is mostly whether
> > > we want to pay the size penalty or not.
> > >
> > > Looking only at docs the proposed change would make the very-cheap
> > > cost model nearly(?) equivalent to the cheap one so maybe the answer
> > > is to default to cheap rather than very-cheap?  One difference seems to
> > > be that cheap allows alias versioning.
> >
> > I remember seeing cases in the past where we could generate an
> > excessive number of alias checks.  The cost model didn't account
> > for them very well, since the checks often became a fixed overhead
> > for all paths (both scalar and vector), especially if the checks
> > were fully if-converted, with one branch at the end.  The relevant
> > comparison is then between the original pre-vectorisation scalar code
> > and the code with alias checks, rather than between post-vectorisation
> > scalar code and post-vectorisation vector code.  Things might be better
> > now though.
>
> Yes, the cost model (aka niter) check should now be before the alias check, not
> if-converted, but of course the alias-checking overhead can still be quite big.
>
> > FTR, I don't object to relaxing the -O2 model.  It was deliberately
> > conservative, for a time when enabling vectorisation at -O2 was
> > somewhat controversial.  It was also heavily influenced by SVE,
> > where variable trip counts are not an issue.
>
> I agree - I think we can try for GCC 15.  Note since we disallow epilogue
> vectorization with cheap we might want to prefer smaller vector sizes
> which means the target might want to adjust its vector_modes hook.
>
> > The proposal would also make GCC's behaviour more similar to Clang's.
>
> So should we adjust very-cheap to allow niter peeling as proposed or
> should we switch
> the default at -O2 to cheap?

Any thoughts from other backend maintainers?

>
> Richard.
>
> > Thanks,
> > Richard



--
BR,
Hongtao
Kito Cheng Sept. 27, 2024, 10:39 a.m. UTC | #8
> > So should we adjust very-cheap to allow niter peeling as proposed or
> > should we switch
> > the default at -O2 to cheap?
>
> Any thoughts from other backend maintainers?

No preference from RISC-V since is variable length vector flavor, so
no epilogue for those case, I mean it's already vectorizeable on
RISC-V with -O2 :P

https://godbolt.org/z/v5z8WxdjT
diff mbox series

Patch

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 242d5e2d916..06afd8cae79 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -2356,8 +2356,7 @@  vect_analyze_loop_costing (loop_vec_info loop_vinfo,
      a copy of the scalar code (even if we might be able to vectorize it).  */
   if (loop_cost_model (loop) == VECT_COST_MODEL_VERY_CHEAP
       && (LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo)
-	  || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
-	  || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo)))
+	  || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)))
     {
       if (dump_enabled_p ())
 	dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
@@ -3638,7 +3637,8 @@  vect_analyze_loop (class loop *loop, vec_info_shared *shared)
 			   /* No code motion support for multiple epilogues so for now
 			      not supported when multiple exits.  */
 			 && !LOOP_VINFO_EARLY_BREAKS (first_loop_vinfo)
-			 && !loop->simduid);
+			 && !loop->simduid
+			 && loop_cost_model (loop) > VECT_COST_MODEL_VERY_CHEAP);
   if (!vect_epilogues)
     return first_loop_vinfo;