Message ID | 20231204053208.908533-1-hongtao.liu@intel.com |
---|---|
State | New |
Headers | show |
Series | Don't vectorize when vector stmts are only vec_contruct and stores | expand |
On Mon, Dec 4, 2023 at 6:32 AM liuhongt <hongtao.liu@intel.com> wrote: > > .i.e. for below cases. > a[0] = b1; > a[1] = b2; > .. > a[n] = bn; > > There're extra dependences when contructing the vector, but not for > scalar store. According to experiments, it's generally worse. > > The patch adds an cut-off heuristic when vec_stmt is just > vec_construct and vector store. It improves SPEC2017 a little bit. > > BenchMarks Ratio > 500.perlbench_r 2.60% > 502.gcc_r 0.30% > 505.mcf_r 0.40% > 520.omnetpp_r -1.00% > 523.xalancbmk_r 0.90% > 525.x264_r 0.00% > 531.deepsjeng_r 0.30% > 541.leela_r 0.90% > 548.exchange2_r 3.20% > 557.xz_r 1.40% > 503.bwaves_r 0.00% > 507.cactuBSSN_r 0.00% > 508.namd_r 0.30% > 510.parest_r 0.00% > 511.povray_r 0.20% > 519.lbm_r SAME BIN > 521.wrf_r -0.30% > 526.blender_r -1.20% > 527.cam4_r -0.20% > 538.imagick_r 4.00% > 544.nab_r 0.40% > 549.fotonik3d_r 0.00% > 554.roms_r 0.00% > Geomean-int 0.90% > Geomean-fp 0.30% > Geomean-all 0.50% > > And > Regressed testcases: > > gcc.target/i386/part-vect-absneghf.c > gcc.target/i386/part-vect-copysignhf.c > gcc.target/i386/part-vect-xorsignhf.c > > Regressed under -m32 since it generates 2 vector > .ABS/NEG/XORSIGN/COPYSIGN vs original 1 64-bit vec_construct. The > original testcases are used to test vectorization capability for > .ABS/NEG/XORG/COPYSIGN, so just restrict testcase to TARGET_64BIT. > > gcc.target/i386/pr111023-2.c > gcc.target/i386/pr111023.c > Regressed under -m32 > > testcase as below > > void > v8hi_v8qi (v8hi *dst, v16qi src) > { > short tem[8]; > tem[0] = src[0]; > tem[1] = src[1]; > tem[2] = src[2]; > tem[3] = src[3]; > tem[4] = src[4]; > tem[5] = src[5]; > tem[6] = src[6]; > tem[7] = src[7]; > dst[0] = *(v8hi *) tem; > } > > under 64-bit target, vectorizer realize it's just permutation of > original src vector, but under -m32, vectorizer relies on > vec_construct for vectorization. I think optimziation for this case > under 32-bit target maynot impact much, so just add > -fno-vect-cost-model. > > gcc.target/i386/pr91446.c: This testcase is guard for cost model of > vector store, not vectorization capability, so just adjust testcase. > > gcc.target/i386/pr108938-3.c: This testcase relies on vec_construct to > optimize for bswap, like other optimziation vectorizer can't realize > optimization after it. So the current solution is add > -fno-vect-cost-model to the testcase. > > costmodel-pr104582-1.c > costmodel-pr104582-2.c > costmodel-pr104582-4.c > > Failed since it's now not vectorized, looked at the PR, it's exactly > what's wanted, so adjust testcase to scan-tree-dump-not. > > > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. > Ok for trunk? So the original motivation to not more aggressively prune store-from-CTOR vectorization in the vectorizer itself is that the vector store is possibly better for STLF (larger stores are good, larger loads eventually problematic). I'd also expect the costs to play out to not make those profitable. OTOH, if you have a series of 'double' stores you can convert to a series of V2DF stores you _may_ be faster if this reduces pressure on the store unit. Esp. V2DF is cheap to construct with one movhpd. So I don't think we want to try to pattern match it this way? In fact the SLP vectorization cases could all arrive with an SLP node specified (vectorizable_store would have to be changed here), which means you could check for an vect_external_def child instead? But as said, I would hope that we can arrive at a better way assessing the CONSTRUCTOR cost. IMHO one big issue is that load and store cost are comparatively high compared to simple stmt ops so it's very hard to offset saving many stores with "ops". That's because we generally think of 'cost' to model latency but as you say stores don't really have latency - we only have store bandwidth of the store unit and of course issue width (but that's true for other ops as well). I wonder what happens if we set both scalar and vector store cost to zero? Or maybe one (to count one issue slot)? Richard. > gcc/ChangeLog: > > PR target/99881 > PR target/104582 > * config/i386/i386.cc (ix86_vector_costs::add_stmt_cost): > Check if kind is vec_construct or vector store. > (ix86_vector_costs::finish_cost): Don't do vectorization when > vector stmts are only vec_construct and stores. > (ix86_vector_costs::ix86_vect_construct_store_only_p): New > function. > (ix86_vector_costs::ix86_vect_cut_off): Ditto. > > gcc/testsuite/ChangeLog: > > * gcc.target/i386/part-vect-absneghf.c: Restrict testcase to > TARGET_64BIT. > * gcc.target/i386/part-vect-copysignhf.c: Ditto. > * gcc.target/i386/part-vect-xorsignhf.c: Ditto. > * gcc.target/i386/pr91446.c: Adjust testcase. > * gcc.target/i386/pr108938-3.c: Add -fno-vect-cost-model. > * gcc.target/i386/pr111023-2.c: Ditto. > * gcc.target/i386/pr111023.c: Ditto. > * gcc.target/i386/pr99881.c: Remove xfail. > * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c: Changed > to Scan-tree-dump-not. > * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c: Ditto. > * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c: Ditto. > --- > gcc/config/i386/i386.cc | 81 ++++++++++++++++++- > .../costmodel/x86_64/costmodel-pr104582-1.c | 2 +- > .../costmodel/x86_64/costmodel-pr104582-3.c | 2 +- > .../costmodel/x86_64/costmodel-pr104582-4.c | 2 +- > .../gcc.target/i386/part-vect-absneghf.c | 4 +- > .../gcc.target/i386/part-vect-copysignhf.c | 4 +- > .../gcc.target/i386/part-vect-xorsignhf.c | 4 +- > gcc/testsuite/gcc.target/i386/pr108938-3.c | 2 +- > gcc/testsuite/gcc.target/i386/pr111023-2.c | 2 +- > gcc/testsuite/gcc.target/i386/pr111023.c | 2 +- > gcc/testsuite/gcc.target/i386/pr91446.c | 14 ++-- > gcc/testsuite/gcc.target/i386/pr99881.c | 2 +- > 12 files changed, 99 insertions(+), 22 deletions(-) > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc > index dcaea6c2096..a4b23e29eba 100644 > --- a/gcc/config/i386/i386.cc > +++ b/gcc/config/i386/i386.cc > @@ -24573,6 +24573,10 @@ public: > > private: > > + /* Don't do vectorization for certain patterns. */ > + void ix86_vect_cut_off (); > + > + bool ix86_vect_construct_store_only_p (vect_cost_for_stmt, stmt_vec_info); > /* Estimate register pressure of the vectorized code. */ > void ix86_vect_estimate_reg_pressure (); > /* Number of GENERAL_REGS/SSE_REGS used in the vectorizer, it's used for > @@ -24581,12 +24585,15 @@ private: > where we know it's not loaded from memory. */ > unsigned m_num_gpr_needed[3]; > unsigned m_num_sse_needed[3]; > + > + bool m_vec_construct_store_only; > }; > > ix86_vector_costs::ix86_vector_costs (vec_info* vinfo, bool costing_for_scalar) > : vector_costs (vinfo, costing_for_scalar), > m_num_gpr_needed (), > - m_num_sse_needed () > + m_num_sse_needed (), > + m_vec_construct_store_only (true) > { > } > > @@ -24609,6 +24616,10 @@ ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind, > = (kind == scalar_stmt || kind == scalar_load || kind == scalar_store); > int stmt_cost = - 1; > > + if (m_vec_construct_store_only > + && !ix86_vect_construct_store_only_p (kind, stmt_info)) > + m_vec_construct_store_only = false; > + > bool fp = false; > machine_mode mode = scalar_p ? SImode : TImode; > > @@ -24865,8 +24876,45 @@ ix86_vector_costs::ix86_vect_estimate_reg_pressure () > } > } > > +/* Return true if KIND and STMT_INFO is either vec_construct or vector > + store, stmt_info could be promote/demote between vec_construct and > + vector store, also count that. */ > +bool > +ix86_vector_costs::ix86_vect_construct_store_only_p (vect_cost_for_stmt kind, > + stmt_vec_info stmt_info) > +{ > + switch (kind) > + { > + case vec_construct: > + case vector_store: > + case unaligned_store: > + case vector_scatter_store: > + case vec_promote_demote: > + return true; > + > + /* Vectorizer will try VEC_PACK_TRUNK_EXPR for things likes > + char* a; > + short b1, b2, b3, b4; > + a[0] = b1; > + a[1] = b2; > + a[2] = b3; > + a[3] = b4; > + Also don't vectorized it. */ > + case vector_stmt: > + if (stmt_info && stmt_info->stmt > + && gimple_code (stmt_info->stmt) == GIMPLE_ASSIGN > + && gimple_assign_rhs_code (stmt_info->stmt) == NOP_EXPR) > + return true; > + > + default: > + break; > + } > + > + return false; > +} > + > void > -ix86_vector_costs::finish_cost (const vector_costs *scalar_costs) > +ix86_vector_costs::ix86_vect_cut_off () > { > loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo); > if (loop_vinfo && !m_costing_for_scalar) > @@ -24885,10 +24933,39 @@ ix86_vector_costs::finish_cost (const vector_costs *scalar_costs) > && (exact_log2 (LOOP_VINFO_VECT_FACTOR (loop_vinfo).to_constant ()) > > ceil_log2 (LOOP_VINFO_INT_NITERS (loop_vinfo)))) > m_costs[vect_body] = INT_MAX; > + return; > + } > + > + /* Don't do vectorization for things like > + > + a[0] = b1; > + a[1] = b2; > + .. > + a[n] = bn; > + > + There're extra dependences when contructing the vector, but not for > + scalar store. According to experiments, it's generally worse. */ > + if (m_vec_construct_store_only) > + { > + m_costs[0] = m_costs[1] = m_costs[2] = INT_MAX; > + if (dump_enabled_p ()) > + dump_printf_loc (MSG_NOTE, vect_location, > + "Skip vectorization for stmts which only contains" > + " vec_construct and vector store.\n"); > + return; > } > > + return; > +} > + > +void > +ix86_vector_costs::finish_cost (const vector_costs *scalar_costs) > +{ > + > ix86_vect_estimate_reg_pressure (); > > + ix86_vect_cut_off (); > + > vector_costs::finish_cost (scalar_costs); > } > > diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c > index 992a845ad7a..f940af70b72 100644 > --- a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c > +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c > @@ -12,4 +12,4 @@ foo (unsigned long *a, unsigned long *b) > s.b = b_; > } > > -/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */ > +/* { dg-final { scan-tree-dump-not "basic block part vectorized" "slp2" } } */ > diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c > index 999c4905708..eff60b2a82a 100644 > --- a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c > +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c > @@ -10,4 +10,4 @@ foo (double a, double b) > s.b = b; > } > > -/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */ > +/* { dg-final { scan-tree-dump-not "basic block part vectorized" "slp2" } } */ > diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c > index cc471e1ed73..2f354338e96 100644 > --- a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c > +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c > @@ -12,4 +12,4 @@ foo (signed long *a, unsigned long *b) > s.b = b_; > } > > -/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */ > +/* { dg-final { scan-tree-dump-not "basic block part vectorized" "slp2" } } */ > diff --git a/gcc/testsuite/gcc.target/i386/part-vect-absneghf.c b/gcc/testsuite/gcc.target/i386/part-vect-absneghf.c > index 48aed14d604..4052210ec39 100644 > --- a/gcc/testsuite/gcc.target/i386/part-vect-absneghf.c > +++ b/gcc/testsuite/gcc.target/i386/part-vect-absneghf.c > @@ -85,7 +85,7 @@ do_test (void) > abort (); > } > > -/* { dg-final { scan-tree-dump-times "vectorized using 8 byte vectors" 2 "slp2" } } */ > -/* { dg-final { scan-tree-dump-times "vectorized using 4 byte vectors" 2 "slp2" } } */ > +/* { dg-final { scan-tree-dump-times "vectorized using 8 byte vectors" 2 "slp2" { target { ! ia32 } } } } */ > +/* { dg-final { scan-tree-dump-times "vectorized using 4 byte vectors" 2 "slp2" { target { ! ia32 } } } } */ > /* { dg-final { scan-tree-dump-times {(?n)ABS_EXPR <vect} 2 "optimized" { target { ! ia32 } } } } */ > /* { dg-final { scan-tree-dump-times {(?n)= -vect} 2 "optimized" { target { ! ia32 } } } } */ > diff --git a/gcc/testsuite/gcc.target/i386/part-vect-copysignhf.c b/gcc/testsuite/gcc.target/i386/part-vect-copysignhf.c > index 811617bc3dd..006941f6651 100644 > --- a/gcc/testsuite/gcc.target/i386/part-vect-copysignhf.c > +++ b/gcc/testsuite/gcc.target/i386/part-vect-copysignhf.c > @@ -55,6 +55,6 @@ do_test (void) > abort (); > } > > -/* { dg-final { scan-tree-dump-times "vectorized using 8 byte vectors" 1 "slp2" } } */ > -/* { dg-final { scan-tree-dump-times "vectorized using 4 byte vectors" 1 "slp2" } } */ > +/* { dg-final { scan-tree-dump-times "vectorized using 8 byte vectors" 1 "slp2" { target { ! ia32 } } } } */ > +/* { dg-final { scan-tree-dump-times "vectorized using 4 byte vectors" 1 "slp2" { target { ! ia32 } } } } */ > /* { dg-final { scan-tree-dump-times ".COPYSIGN" 2 "optimized" { target { ! ia32 } } } } */ > diff --git a/gcc/testsuite/gcc.target/i386/part-vect-xorsignhf.c b/gcc/testsuite/gcc.target/i386/part-vect-xorsignhf.c > index a8ec60a088a..a57dc5aba23 100644 > --- a/gcc/testsuite/gcc.target/i386/part-vect-xorsignhf.c > +++ b/gcc/testsuite/gcc.target/i386/part-vect-xorsignhf.c > @@ -55,6 +55,6 @@ do_test (void) > abort (); > } > > -/* { dg-final { scan-tree-dump-times "vectorized using 8 byte vectors" 1 "slp2" } } */ > -/* { dg-final { scan-tree-dump-times "vectorized using 4 byte vectors" 1 "slp2" } } */ > +/* { dg-final { scan-tree-dump-times "vectorized using 8 byte vectors" 1 "slp2" { target { ! ia32 } } } } */ > +/* { dg-final { scan-tree-dump-times "vectorized using 4 byte vectors" 1 "slp2" { target { ! ia32 } } } } */ > /* { dg-final { scan-tree-dump-times ".XORSIGN" 2 "optimized" { target { ! ia32 } } } } */ > diff --git a/gcc/testsuite/gcc.target/i386/pr108938-3.c b/gcc/testsuite/gcc.target/i386/pr108938-3.c > index 32ac544c7ed..24725a9ab1d 100644 > --- a/gcc/testsuite/gcc.target/i386/pr108938-3.c > +++ b/gcc/testsuite/gcc.target/i386/pr108938-3.c > @@ -1,5 +1,5 @@ > /* { dg-do compile } */ > -/* { dg-options "-O2 -ftree-vectorize -mno-movbe" } */ > +/* { dg-options "-O2 -ftree-vectorize -mno-movbe -fno-vect-cost-model" } */ > /* { dg-final { scan-assembler-times "bswap\[\t ]+" 2 { target { ! ia32 } } } } */ > /* { dg-final { scan-assembler-times "bswap\[\t ]+" 3 { target ia32 } } } */ > > diff --git a/gcc/testsuite/gcc.target/i386/pr111023-2.c b/gcc/testsuite/gcc.target/i386/pr111023-2.c > index 6c69f947544..db25668a9ae 100644 > --- a/gcc/testsuite/gcc.target/i386/pr111023-2.c > +++ b/gcc/testsuite/gcc.target/i386/pr111023-2.c > @@ -1,6 +1,6 @@ > /* PR target/111023 */ > /* { dg-do compile } */ > -/* { dg-options "-O2 -mtune=icelake-server -ftree-vectorize -msse2 -mno-sse4.1" } */ > +/* { dg-options "-O2 -mtune=icelake-server -ftree-vectorize -msse2 -mno-sse4.1 -fno-vect-cost-model" } */ > > typedef char v16qi __attribute__((vector_size (16))); > typedef short v8hi __attribute__((vector_size (16))); > diff --git a/gcc/testsuite/gcc.target/i386/pr111023.c b/gcc/testsuite/gcc.target/i386/pr111023.c > index 6144c371f32..18cd579d937 100644 > --- a/gcc/testsuite/gcc.target/i386/pr111023.c > +++ b/gcc/testsuite/gcc.target/i386/pr111023.c > @@ -1,6 +1,6 @@ > /* PR target/111023 */ > /* { dg-do compile } */ > -/* { dg-options "-O2 -mtune=icelake-server -ftree-vectorize -msse2 -mno-sse4.1" } */ > +/* { dg-options "-O2 -mtune=icelake-server -ftree-vectorize -msse2 -mno-sse4.1 -fno-vect-cost-model" } */ > > typedef unsigned char v16qi __attribute__((vector_size (16))); > typedef unsigned short v8hi __attribute__((vector_size (16))); > diff --git a/gcc/testsuite/gcc.target/i386/pr91446.c b/gcc/testsuite/gcc.target/i386/pr91446.c > index 0243ca3ea68..3e0f8a4b315 100644 > --- a/gcc/testsuite/gcc.target/i386/pr91446.c > +++ b/gcc/testsuite/gcc.target/i386/pr91446.c > @@ -10,15 +10,15 @@ typedef struct > extern void bar (info *); > > void > -foo (unsigned long long width, unsigned long long height, > - long long x, long long y) > +foo (unsigned long long* width, > + long long* x) > { > info t; > - t.width = width; > - t.height = height; > - t.x = x; > - t.y = y; > + t.width = width[0]; > + t.height = width[1]; > + t.x = x[0]; > + t.y = x[1]; > bar (&t); > } > > -/* { dg-final { scan-assembler-times "vmovdqa\[^\n\r\]*xmm\[0-9\]" 2 } } */ > +/* { dg-final { scan-assembler-times "vmovdq\[au\]\[^\n\r\]*xmm\[0-9\]" 4 } } */ > diff --git a/gcc/testsuite/gcc.target/i386/pr99881.c b/gcc/testsuite/gcc.target/i386/pr99881.c > index 3e087eb2ed7..1a59325ac1c 100644 > --- a/gcc/testsuite/gcc.target/i386/pr99881.c > +++ b/gcc/testsuite/gcc.target/i386/pr99881.c > @@ -1,7 +1,7 @@ > /* PR target/99881. */ > /* { dg-do compile { target { ! ia32 } } } */ > /* { dg-options "-Ofast -march=skylake" } */ > -/* { dg-final { scan-assembler-not "xmm\[0-9\]" { xfail *-*-* } } } */ > +/* { dg-final { scan-assembler-not "xmm\[0-9\]" } } */ > > void > foo (int* __restrict a, int n, int c) > -- > 2.31.1 >
On Mon, Dec 4, 2023 at 10:10 PM Richard Biener <richard.guenther@gmail.com> wrote: > > On Mon, Dec 4, 2023 at 6:32 AM liuhongt <hongtao.liu@intel.com> wrote: > > > > .i.e. for below cases. > > a[0] = b1; > > a[1] = b2; > > .. > > a[n] = bn; > > > > There're extra dependences when contructing the vector, but not for > > scalar store. According to experiments, it's generally worse. > > > > The patch adds an cut-off heuristic when vec_stmt is just > > vec_construct and vector store. It improves SPEC2017 a little bit. > > > > BenchMarks Ratio > > 500.perlbench_r 2.60% > > 502.gcc_r 0.30% > > 505.mcf_r 0.40% > > 520.omnetpp_r -1.00% > > 523.xalancbmk_r 0.90% > > 525.x264_r 0.00% > > 531.deepsjeng_r 0.30% > > 541.leela_r 0.90% > > 548.exchange2_r 3.20% > > 557.xz_r 1.40% > > 503.bwaves_r 0.00% > > 507.cactuBSSN_r 0.00% > > 508.namd_r 0.30% > > 510.parest_r 0.00% > > 511.povray_r 0.20% > > 519.lbm_r SAME BIN > > 521.wrf_r -0.30% > > 526.blender_r -1.20% > > 527.cam4_r -0.20% > > 538.imagick_r 4.00% > > 544.nab_r 0.40% > > 549.fotonik3d_r 0.00% > > 554.roms_r 0.00% > > Geomean-int 0.90% > > Geomean-fp 0.30% > > Geomean-all 0.50% > > > > And > > Regressed testcases: > > > > gcc.target/i386/part-vect-absneghf.c > > gcc.target/i386/part-vect-copysignhf.c > > gcc.target/i386/part-vect-xorsignhf.c > > > > Regressed under -m32 since it generates 2 vector > > .ABS/NEG/XORSIGN/COPYSIGN vs original 1 64-bit vec_construct. The > > original testcases are used to test vectorization capability for > > .ABS/NEG/XORG/COPYSIGN, so just restrict testcase to TARGET_64BIT. > > > > gcc.target/i386/pr111023-2.c > > gcc.target/i386/pr111023.c > > Regressed under -m32 > > > > testcase as below > > > > void > > v8hi_v8qi (v8hi *dst, v16qi src) > > { > > short tem[8]; > > tem[0] = src[0]; > > tem[1] = src[1]; > > tem[2] = src[2]; > > tem[3] = src[3]; > > tem[4] = src[4]; > > tem[5] = src[5]; > > tem[6] = src[6]; > > tem[7] = src[7]; > > dst[0] = *(v8hi *) tem; > > } > > > > under 64-bit target, vectorizer realize it's just permutation of > > original src vector, but under -m32, vectorizer relies on > > vec_construct for vectorization. I think optimziation for this case > > under 32-bit target maynot impact much, so just add > > -fno-vect-cost-model. > > > > gcc.target/i386/pr91446.c: This testcase is guard for cost model of > > vector store, not vectorization capability, so just adjust testcase. > > > > gcc.target/i386/pr108938-3.c: This testcase relies on vec_construct to > > optimize for bswap, like other optimziation vectorizer can't realize > > optimization after it. So the current solution is add > > -fno-vect-cost-model to the testcase. > > > > costmodel-pr104582-1.c > > costmodel-pr104582-2.c > > costmodel-pr104582-4.c > > > > Failed since it's now not vectorized, looked at the PR, it's exactly > > what's wanted, so adjust testcase to scan-tree-dump-not. > > > > > > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. > > Ok for trunk? > > So the original motivation to not more aggressively prune > store-from-CTOR vectorization in the vectorizer itself is that > the vector store is possibly better for STLF (larger stores are > good, larger loads eventually problematic). That's exactly what I worried about, and I didn't observe any STLF stall in SPEC2017, I'll try with more benchmarks. But on the other hand, the cost model is not suitable for solving this problem, at best it only circumvents part of this. > > I'd also expect the costs to play out to not make those profitable. > > OTOH, if you have a series of 'double' stores you can convert to > a series of V2DF stores you _may_ be faster if this reduces > pressure on the store unit. Esp. V2DF is cheap to construct > with one movhpd. > > So I don't think we want to try to pattern match it this way? > > In fact the SLP vectorization cases could all arrive with an > SLP node specified (vectorizable_store would have to be > changed here), which means you could check for an > vect_external_def child instead? > > But as said, I would hope that we can arrive at a better way > assessing the CONSTRUCTOR cost. IMHO one big issue > is that load and store cost are comparatively high compared > to simple stmt ops so it's very hard to offset saving many > stores with "ops". That's because we generally think of > 'cost' to model latency but as you say stores don't really > have latency - we only have store bandwidth of the store Yes. > unit and of course issue width (but that's true for other ops > as well). I wonder what happens if we set both scalar and > vector store cost to zero? Or maybe one (to count one > issue slot)? I tried to reduce the cost of the scalar store, but it regressed in other cases. But I haven't tried with zero/one, it sounds like a good idea, Let me try that. > > Richard. > > > > gcc/ChangeLog: > > > > PR target/99881 > > PR target/104582 > > * config/i386/i386.cc (ix86_vector_costs::add_stmt_cost): > > Check if kind is vec_construct or vector store. > > (ix86_vector_costs::finish_cost): Don't do vectorization when > > vector stmts are only vec_construct and stores. > > (ix86_vector_costs::ix86_vect_construct_store_only_p): New > > function. > > (ix86_vector_costs::ix86_vect_cut_off): Ditto. > > > > gcc/testsuite/ChangeLog: > > > > * gcc.target/i386/part-vect-absneghf.c: Restrict testcase to > > TARGET_64BIT. > > * gcc.target/i386/part-vect-copysignhf.c: Ditto. > > * gcc.target/i386/part-vect-xorsignhf.c: Ditto. > > * gcc.target/i386/pr91446.c: Adjust testcase. > > * gcc.target/i386/pr108938-3.c: Add -fno-vect-cost-model. > > * gcc.target/i386/pr111023-2.c: Ditto. > > * gcc.target/i386/pr111023.c: Ditto. > > * gcc.target/i386/pr99881.c: Remove xfail. > > * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c: Changed > > to Scan-tree-dump-not. > > * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c: Ditto. > > * gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c: Ditto. > > --- > > gcc/config/i386/i386.cc | 81 ++++++++++++++++++- > > .../costmodel/x86_64/costmodel-pr104582-1.c | 2 +- > > .../costmodel/x86_64/costmodel-pr104582-3.c | 2 +- > > .../costmodel/x86_64/costmodel-pr104582-4.c | 2 +- > > .../gcc.target/i386/part-vect-absneghf.c | 4 +- > > .../gcc.target/i386/part-vect-copysignhf.c | 4 +- > > .../gcc.target/i386/part-vect-xorsignhf.c | 4 +- > > gcc/testsuite/gcc.target/i386/pr108938-3.c | 2 +- > > gcc/testsuite/gcc.target/i386/pr111023-2.c | 2 +- > > gcc/testsuite/gcc.target/i386/pr111023.c | 2 +- > > gcc/testsuite/gcc.target/i386/pr91446.c | 14 ++-- > > gcc/testsuite/gcc.target/i386/pr99881.c | 2 +- > > 12 files changed, 99 insertions(+), 22 deletions(-) > > > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc > > index dcaea6c2096..a4b23e29eba 100644 > > --- a/gcc/config/i386/i386.cc > > +++ b/gcc/config/i386/i386.cc > > @@ -24573,6 +24573,10 @@ public: > > > > private: > > > > + /* Don't do vectorization for certain patterns. */ > > + void ix86_vect_cut_off (); > > + > > + bool ix86_vect_construct_store_only_p (vect_cost_for_stmt, stmt_vec_info); > > /* Estimate register pressure of the vectorized code. */ > > void ix86_vect_estimate_reg_pressure (); > > /* Number of GENERAL_REGS/SSE_REGS used in the vectorizer, it's used for > > @@ -24581,12 +24585,15 @@ private: > > where we know it's not loaded from memory. */ > > unsigned m_num_gpr_needed[3]; > > unsigned m_num_sse_needed[3]; > > + > > + bool m_vec_construct_store_only; > > }; > > > > ix86_vector_costs::ix86_vector_costs (vec_info* vinfo, bool costing_for_scalar) > > : vector_costs (vinfo, costing_for_scalar), > > m_num_gpr_needed (), > > - m_num_sse_needed () > > + m_num_sse_needed (), > > + m_vec_construct_store_only (true) > > { > > } > > > > @@ -24609,6 +24616,10 @@ ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind, > > = (kind == scalar_stmt || kind == scalar_load || kind == scalar_store); > > int stmt_cost = - 1; > > > > + if (m_vec_construct_store_only > > + && !ix86_vect_construct_store_only_p (kind, stmt_info)) > > + m_vec_construct_store_only = false; > > + > > bool fp = false; > > machine_mode mode = scalar_p ? SImode : TImode; > > > > @@ -24865,8 +24876,45 @@ ix86_vector_costs::ix86_vect_estimate_reg_pressure () > > } > > } > > > > +/* Return true if KIND and STMT_INFO is either vec_construct or vector > > + store, stmt_info could be promote/demote between vec_construct and > > + vector store, also count that. */ > > +bool > > +ix86_vector_costs::ix86_vect_construct_store_only_p (vect_cost_for_stmt kind, > > + stmt_vec_info stmt_info) > > +{ > > + switch (kind) > > + { > > + case vec_construct: > > + case vector_store: > > + case unaligned_store: > > + case vector_scatter_store: > > + case vec_promote_demote: > > + return true; > > + > > + /* Vectorizer will try VEC_PACK_TRUNK_EXPR for things likes > > + char* a; > > + short b1, b2, b3, b4; > > + a[0] = b1; > > + a[1] = b2; > > + a[2] = b3; > > + a[3] = b4; > > + Also don't vectorized it. */ > > + case vector_stmt: > > + if (stmt_info && stmt_info->stmt > > + && gimple_code (stmt_info->stmt) == GIMPLE_ASSIGN > > + && gimple_assign_rhs_code (stmt_info->stmt) == NOP_EXPR) > > + return true; > > + > > + default: > > + break; > > + } > > + > > + return false; > > +} > > + > > void > > -ix86_vector_costs::finish_cost (const vector_costs *scalar_costs) > > +ix86_vector_costs::ix86_vect_cut_off () > > { > > loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo); > > if (loop_vinfo && !m_costing_for_scalar) > > @@ -24885,10 +24933,39 @@ ix86_vector_costs::finish_cost (const vector_costs *scalar_costs) > > && (exact_log2 (LOOP_VINFO_VECT_FACTOR (loop_vinfo).to_constant ()) > > > ceil_log2 (LOOP_VINFO_INT_NITERS (loop_vinfo)))) > > m_costs[vect_body] = INT_MAX; > > + return; > > + } > > + > > + /* Don't do vectorization for things like > > + > > + a[0] = b1; > > + a[1] = b2; > > + .. > > + a[n] = bn; > > + > > + There're extra dependences when contructing the vector, but not for > > + scalar store. According to experiments, it's generally worse. */ > > + if (m_vec_construct_store_only) > > + { > > + m_costs[0] = m_costs[1] = m_costs[2] = INT_MAX; > > + if (dump_enabled_p ()) > > + dump_printf_loc (MSG_NOTE, vect_location, > > + "Skip vectorization for stmts which only contains" > > + " vec_construct and vector store.\n"); > > + return; > > } > > > > + return; > > +} > > + > > +void > > +ix86_vector_costs::finish_cost (const vector_costs *scalar_costs) > > +{ > > + > > ix86_vect_estimate_reg_pressure (); > > > > + ix86_vect_cut_off (); > > + > > vector_costs::finish_cost (scalar_costs); > > } > > > > diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c > > index 992a845ad7a..f940af70b72 100644 > > --- a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c > > +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c > > @@ -12,4 +12,4 @@ foo (unsigned long *a, unsigned long *b) > > s.b = b_; > > } > > > > -/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */ > > +/* { dg-final { scan-tree-dump-not "basic block part vectorized" "slp2" } } */ > > diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c > > index 999c4905708..eff60b2a82a 100644 > > --- a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c > > +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c > > @@ -10,4 +10,4 @@ foo (double a, double b) > > s.b = b; > > } > > > > -/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */ > > +/* { dg-final { scan-tree-dump-not "basic block part vectorized" "slp2" } } */ > > diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c > > index cc471e1ed73..2f354338e96 100644 > > --- a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c > > +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c > > @@ -12,4 +12,4 @@ foo (signed long *a, unsigned long *b) > > s.b = b_; > > } > > > > -/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */ > > +/* { dg-final { scan-tree-dump-not "basic block part vectorized" "slp2" } } */ > > diff --git a/gcc/testsuite/gcc.target/i386/part-vect-absneghf.c b/gcc/testsuite/gcc.target/i386/part-vect-absneghf.c > > index 48aed14d604..4052210ec39 100644 > > --- a/gcc/testsuite/gcc.target/i386/part-vect-absneghf.c > > +++ b/gcc/testsuite/gcc.target/i386/part-vect-absneghf.c > > @@ -85,7 +85,7 @@ do_test (void) > > abort (); > > } > > > > -/* { dg-final { scan-tree-dump-times "vectorized using 8 byte vectors" 2 "slp2" } } */ > > -/* { dg-final { scan-tree-dump-times "vectorized using 4 byte vectors" 2 "slp2" } } */ > > +/* { dg-final { scan-tree-dump-times "vectorized using 8 byte vectors" 2 "slp2" { target { ! ia32 } } } } */ > > +/* { dg-final { scan-tree-dump-times "vectorized using 4 byte vectors" 2 "slp2" { target { ! ia32 } } } } */ > > /* { dg-final { scan-tree-dump-times {(?n)ABS_EXPR <vect} 2 "optimized" { target { ! ia32 } } } } */ > > /* { dg-final { scan-tree-dump-times {(?n)= -vect} 2 "optimized" { target { ! ia32 } } } } */ > > diff --git a/gcc/testsuite/gcc.target/i386/part-vect-copysignhf.c b/gcc/testsuite/gcc.target/i386/part-vect-copysignhf.c > > index 811617bc3dd..006941f6651 100644 > > --- a/gcc/testsuite/gcc.target/i386/part-vect-copysignhf.c > > +++ b/gcc/testsuite/gcc.target/i386/part-vect-copysignhf.c > > @@ -55,6 +55,6 @@ do_test (void) > > abort (); > > } > > > > -/* { dg-final { scan-tree-dump-times "vectorized using 8 byte vectors" 1 "slp2" } } */ > > -/* { dg-final { scan-tree-dump-times "vectorized using 4 byte vectors" 1 "slp2" } } */ > > +/* { dg-final { scan-tree-dump-times "vectorized using 8 byte vectors" 1 "slp2" { target { ! ia32 } } } } */ > > +/* { dg-final { scan-tree-dump-times "vectorized using 4 byte vectors" 1 "slp2" { target { ! ia32 } } } } */ > > /* { dg-final { scan-tree-dump-times ".COPYSIGN" 2 "optimized" { target { ! ia32 } } } } */ > > diff --git a/gcc/testsuite/gcc.target/i386/part-vect-xorsignhf.c b/gcc/testsuite/gcc.target/i386/part-vect-xorsignhf.c > > index a8ec60a088a..a57dc5aba23 100644 > > --- a/gcc/testsuite/gcc.target/i386/part-vect-xorsignhf.c > > +++ b/gcc/testsuite/gcc.target/i386/part-vect-xorsignhf.c > > @@ -55,6 +55,6 @@ do_test (void) > > abort (); > > } > > > > -/* { dg-final { scan-tree-dump-times "vectorized using 8 byte vectors" 1 "slp2" } } */ > > -/* { dg-final { scan-tree-dump-times "vectorized using 4 byte vectors" 1 "slp2" } } */ > > +/* { dg-final { scan-tree-dump-times "vectorized using 8 byte vectors" 1 "slp2" { target { ! ia32 } } } } */ > > +/* { dg-final { scan-tree-dump-times "vectorized using 4 byte vectors" 1 "slp2" { target { ! ia32 } } } } */ > > /* { dg-final { scan-tree-dump-times ".XORSIGN" 2 "optimized" { target { ! ia32 } } } } */ > > diff --git a/gcc/testsuite/gcc.target/i386/pr108938-3.c b/gcc/testsuite/gcc.target/i386/pr108938-3.c > > index 32ac544c7ed..24725a9ab1d 100644 > > --- a/gcc/testsuite/gcc.target/i386/pr108938-3.c > > +++ b/gcc/testsuite/gcc.target/i386/pr108938-3.c > > @@ -1,5 +1,5 @@ > > /* { dg-do compile } */ > > -/* { dg-options "-O2 -ftree-vectorize -mno-movbe" } */ > > +/* { dg-options "-O2 -ftree-vectorize -mno-movbe -fno-vect-cost-model" } */ > > /* { dg-final { scan-assembler-times "bswap\[\t ]+" 2 { target { ! ia32 } } } } */ > > /* { dg-final { scan-assembler-times "bswap\[\t ]+" 3 { target ia32 } } } */ > > > > diff --git a/gcc/testsuite/gcc.target/i386/pr111023-2.c b/gcc/testsuite/gcc.target/i386/pr111023-2.c > > index 6c69f947544..db25668a9ae 100644 > > --- a/gcc/testsuite/gcc.target/i386/pr111023-2.c > > +++ b/gcc/testsuite/gcc.target/i386/pr111023-2.c > > @@ -1,6 +1,6 @@ > > /* PR target/111023 */ > > /* { dg-do compile } */ > > -/* { dg-options "-O2 -mtune=icelake-server -ftree-vectorize -msse2 -mno-sse4.1" } */ > > +/* { dg-options "-O2 -mtune=icelake-server -ftree-vectorize -msse2 -mno-sse4.1 -fno-vect-cost-model" } */ > > > > typedef char v16qi __attribute__((vector_size (16))); > > typedef short v8hi __attribute__((vector_size (16))); > > diff --git a/gcc/testsuite/gcc.target/i386/pr111023.c b/gcc/testsuite/gcc.target/i386/pr111023.c > > index 6144c371f32..18cd579d937 100644 > > --- a/gcc/testsuite/gcc.target/i386/pr111023.c > > +++ b/gcc/testsuite/gcc.target/i386/pr111023.c > > @@ -1,6 +1,6 @@ > > /* PR target/111023 */ > > /* { dg-do compile } */ > > -/* { dg-options "-O2 -mtune=icelake-server -ftree-vectorize -msse2 -mno-sse4.1" } */ > > +/* { dg-options "-O2 -mtune=icelake-server -ftree-vectorize -msse2 -mno-sse4.1 -fno-vect-cost-model" } */ > > > > typedef unsigned char v16qi __attribute__((vector_size (16))); > > typedef unsigned short v8hi __attribute__((vector_size (16))); > > diff --git a/gcc/testsuite/gcc.target/i386/pr91446.c b/gcc/testsuite/gcc.target/i386/pr91446.c > > index 0243ca3ea68..3e0f8a4b315 100644 > > --- a/gcc/testsuite/gcc.target/i386/pr91446.c > > +++ b/gcc/testsuite/gcc.target/i386/pr91446.c > > @@ -10,15 +10,15 @@ typedef struct > > extern void bar (info *); > > > > void > > -foo (unsigned long long width, unsigned long long height, > > - long long x, long long y) > > +foo (unsigned long long* width, > > + long long* x) > > { > > info t; > > - t.width = width; > > - t.height = height; > > - t.x = x; > > - t.y = y; > > + t.width = width[0]; > > + t.height = width[1]; > > + t.x = x[0]; > > + t.y = x[1]; > > bar (&t); > > } > > > > -/* { dg-final { scan-assembler-times "vmovdqa\[^\n\r\]*xmm\[0-9\]" 2 } } */ > > +/* { dg-final { scan-assembler-times "vmovdq\[au\]\[^\n\r\]*xmm\[0-9\]" 4 } } */ > > diff --git a/gcc/testsuite/gcc.target/i386/pr99881.c b/gcc/testsuite/gcc.target/i386/pr99881.c > > index 3e087eb2ed7..1a59325ac1c 100644 > > --- a/gcc/testsuite/gcc.target/i386/pr99881.c > > +++ b/gcc/testsuite/gcc.target/i386/pr99881.c > > @@ -1,7 +1,7 @@ > > /* PR target/99881. */ > > /* { dg-do compile { target { ! ia32 } } } */ > > /* { dg-options "-Ofast -march=skylake" } */ > > -/* { dg-final { scan-assembler-not "xmm\[0-9\]" { xfail *-*-* } } } */ > > +/* { dg-final { scan-assembler-not "xmm\[0-9\]" } } */ > > > > void > > foo (int* __restrict a, int n, int c) > > -- > > 2.31.1 > >
diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc index dcaea6c2096..a4b23e29eba 100644 --- a/gcc/config/i386/i386.cc +++ b/gcc/config/i386/i386.cc @@ -24573,6 +24573,10 @@ public: private: + /* Don't do vectorization for certain patterns. */ + void ix86_vect_cut_off (); + + bool ix86_vect_construct_store_only_p (vect_cost_for_stmt, stmt_vec_info); /* Estimate register pressure of the vectorized code. */ void ix86_vect_estimate_reg_pressure (); /* Number of GENERAL_REGS/SSE_REGS used in the vectorizer, it's used for @@ -24581,12 +24585,15 @@ private: where we know it's not loaded from memory. */ unsigned m_num_gpr_needed[3]; unsigned m_num_sse_needed[3]; + + bool m_vec_construct_store_only; }; ix86_vector_costs::ix86_vector_costs (vec_info* vinfo, bool costing_for_scalar) : vector_costs (vinfo, costing_for_scalar), m_num_gpr_needed (), - m_num_sse_needed () + m_num_sse_needed (), + m_vec_construct_store_only (true) { } @@ -24609,6 +24616,10 @@ ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind, = (kind == scalar_stmt || kind == scalar_load || kind == scalar_store); int stmt_cost = - 1; + if (m_vec_construct_store_only + && !ix86_vect_construct_store_only_p (kind, stmt_info)) + m_vec_construct_store_only = false; + bool fp = false; machine_mode mode = scalar_p ? SImode : TImode; @@ -24865,8 +24876,45 @@ ix86_vector_costs::ix86_vect_estimate_reg_pressure () } } +/* Return true if KIND and STMT_INFO is either vec_construct or vector + store, stmt_info could be promote/demote between vec_construct and + vector store, also count that. */ +bool +ix86_vector_costs::ix86_vect_construct_store_only_p (vect_cost_for_stmt kind, + stmt_vec_info stmt_info) +{ + switch (kind) + { + case vec_construct: + case vector_store: + case unaligned_store: + case vector_scatter_store: + case vec_promote_demote: + return true; + + /* Vectorizer will try VEC_PACK_TRUNK_EXPR for things likes + char* a; + short b1, b2, b3, b4; + a[0] = b1; + a[1] = b2; + a[2] = b3; + a[3] = b4; + Also don't vectorized it. */ + case vector_stmt: + if (stmt_info && stmt_info->stmt + && gimple_code (stmt_info->stmt) == GIMPLE_ASSIGN + && gimple_assign_rhs_code (stmt_info->stmt) == NOP_EXPR) + return true; + + default: + break; + } + + return false; +} + void -ix86_vector_costs::finish_cost (const vector_costs *scalar_costs) +ix86_vector_costs::ix86_vect_cut_off () { loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo); if (loop_vinfo && !m_costing_for_scalar) @@ -24885,10 +24933,39 @@ ix86_vector_costs::finish_cost (const vector_costs *scalar_costs) && (exact_log2 (LOOP_VINFO_VECT_FACTOR (loop_vinfo).to_constant ()) > ceil_log2 (LOOP_VINFO_INT_NITERS (loop_vinfo)))) m_costs[vect_body] = INT_MAX; + return; + } + + /* Don't do vectorization for things like + + a[0] = b1; + a[1] = b2; + .. + a[n] = bn; + + There're extra dependences when contructing the vector, but not for + scalar store. According to experiments, it's generally worse. */ + if (m_vec_construct_store_only) + { + m_costs[0] = m_costs[1] = m_costs[2] = INT_MAX; + if (dump_enabled_p ()) + dump_printf_loc (MSG_NOTE, vect_location, + "Skip vectorization for stmts which only contains" + " vec_construct and vector store.\n"); + return; } + return; +} + +void +ix86_vector_costs::finish_cost (const vector_costs *scalar_costs) +{ + ix86_vect_estimate_reg_pressure (); + ix86_vect_cut_off (); + vector_costs::finish_cost (scalar_costs); } diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c index 992a845ad7a..f940af70b72 100644 --- a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-1.c @@ -12,4 +12,4 @@ foo (unsigned long *a, unsigned long *b) s.b = b_; } -/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */ +/* { dg-final { scan-tree-dump-not "basic block part vectorized" "slp2" } } */ diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c index 999c4905708..eff60b2a82a 100644 --- a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-3.c @@ -10,4 +10,4 @@ foo (double a, double b) s.b = b; } -/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */ +/* { dg-final { scan-tree-dump-not "basic block part vectorized" "slp2" } } */ diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c index cc471e1ed73..2f354338e96 100644 --- a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c +++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr104582-4.c @@ -12,4 +12,4 @@ foo (signed long *a, unsigned long *b) s.b = b_; } -/* { dg-final { scan-tree-dump "basic block part vectorized" "slp2" } } */ +/* { dg-final { scan-tree-dump-not "basic block part vectorized" "slp2" } } */ diff --git a/gcc/testsuite/gcc.target/i386/part-vect-absneghf.c b/gcc/testsuite/gcc.target/i386/part-vect-absneghf.c index 48aed14d604..4052210ec39 100644 --- a/gcc/testsuite/gcc.target/i386/part-vect-absneghf.c +++ b/gcc/testsuite/gcc.target/i386/part-vect-absneghf.c @@ -85,7 +85,7 @@ do_test (void) abort (); } -/* { dg-final { scan-tree-dump-times "vectorized using 8 byte vectors" 2 "slp2" } } */ -/* { dg-final { scan-tree-dump-times "vectorized using 4 byte vectors" 2 "slp2" } } */ +/* { dg-final { scan-tree-dump-times "vectorized using 8 byte vectors" 2 "slp2" { target { ! ia32 } } } } */ +/* { dg-final { scan-tree-dump-times "vectorized using 4 byte vectors" 2 "slp2" { target { ! ia32 } } } } */ /* { dg-final { scan-tree-dump-times {(?n)ABS_EXPR <vect} 2 "optimized" { target { ! ia32 } } } } */ /* { dg-final { scan-tree-dump-times {(?n)= -vect} 2 "optimized" { target { ! ia32 } } } } */ diff --git a/gcc/testsuite/gcc.target/i386/part-vect-copysignhf.c b/gcc/testsuite/gcc.target/i386/part-vect-copysignhf.c index 811617bc3dd..006941f6651 100644 --- a/gcc/testsuite/gcc.target/i386/part-vect-copysignhf.c +++ b/gcc/testsuite/gcc.target/i386/part-vect-copysignhf.c @@ -55,6 +55,6 @@ do_test (void) abort (); } -/* { dg-final { scan-tree-dump-times "vectorized using 8 byte vectors" 1 "slp2" } } */ -/* { dg-final { scan-tree-dump-times "vectorized using 4 byte vectors" 1 "slp2" } } */ +/* { dg-final { scan-tree-dump-times "vectorized using 8 byte vectors" 1 "slp2" { target { ! ia32 } } } } */ +/* { dg-final { scan-tree-dump-times "vectorized using 4 byte vectors" 1 "slp2" { target { ! ia32 } } } } */ /* { dg-final { scan-tree-dump-times ".COPYSIGN" 2 "optimized" { target { ! ia32 } } } } */ diff --git a/gcc/testsuite/gcc.target/i386/part-vect-xorsignhf.c b/gcc/testsuite/gcc.target/i386/part-vect-xorsignhf.c index a8ec60a088a..a57dc5aba23 100644 --- a/gcc/testsuite/gcc.target/i386/part-vect-xorsignhf.c +++ b/gcc/testsuite/gcc.target/i386/part-vect-xorsignhf.c @@ -55,6 +55,6 @@ do_test (void) abort (); } -/* { dg-final { scan-tree-dump-times "vectorized using 8 byte vectors" 1 "slp2" } } */ -/* { dg-final { scan-tree-dump-times "vectorized using 4 byte vectors" 1 "slp2" } } */ +/* { dg-final { scan-tree-dump-times "vectorized using 8 byte vectors" 1 "slp2" { target { ! ia32 } } } } */ +/* { dg-final { scan-tree-dump-times "vectorized using 4 byte vectors" 1 "slp2" { target { ! ia32 } } } } */ /* { dg-final { scan-tree-dump-times ".XORSIGN" 2 "optimized" { target { ! ia32 } } } } */ diff --git a/gcc/testsuite/gcc.target/i386/pr108938-3.c b/gcc/testsuite/gcc.target/i386/pr108938-3.c index 32ac544c7ed..24725a9ab1d 100644 --- a/gcc/testsuite/gcc.target/i386/pr108938-3.c +++ b/gcc/testsuite/gcc.target/i386/pr108938-3.c @@ -1,5 +1,5 @@ /* { dg-do compile } */ -/* { dg-options "-O2 -ftree-vectorize -mno-movbe" } */ +/* { dg-options "-O2 -ftree-vectorize -mno-movbe -fno-vect-cost-model" } */ /* { dg-final { scan-assembler-times "bswap\[\t ]+" 2 { target { ! ia32 } } } } */ /* { dg-final { scan-assembler-times "bswap\[\t ]+" 3 { target ia32 } } } */ diff --git a/gcc/testsuite/gcc.target/i386/pr111023-2.c b/gcc/testsuite/gcc.target/i386/pr111023-2.c index 6c69f947544..db25668a9ae 100644 --- a/gcc/testsuite/gcc.target/i386/pr111023-2.c +++ b/gcc/testsuite/gcc.target/i386/pr111023-2.c @@ -1,6 +1,6 @@ /* PR target/111023 */ /* { dg-do compile } */ -/* { dg-options "-O2 -mtune=icelake-server -ftree-vectorize -msse2 -mno-sse4.1" } */ +/* { dg-options "-O2 -mtune=icelake-server -ftree-vectorize -msse2 -mno-sse4.1 -fno-vect-cost-model" } */ typedef char v16qi __attribute__((vector_size (16))); typedef short v8hi __attribute__((vector_size (16))); diff --git a/gcc/testsuite/gcc.target/i386/pr111023.c b/gcc/testsuite/gcc.target/i386/pr111023.c index 6144c371f32..18cd579d937 100644 --- a/gcc/testsuite/gcc.target/i386/pr111023.c +++ b/gcc/testsuite/gcc.target/i386/pr111023.c @@ -1,6 +1,6 @@ /* PR target/111023 */ /* { dg-do compile } */ -/* { dg-options "-O2 -mtune=icelake-server -ftree-vectorize -msse2 -mno-sse4.1" } */ +/* { dg-options "-O2 -mtune=icelake-server -ftree-vectorize -msse2 -mno-sse4.1 -fno-vect-cost-model" } */ typedef unsigned char v16qi __attribute__((vector_size (16))); typedef unsigned short v8hi __attribute__((vector_size (16))); diff --git a/gcc/testsuite/gcc.target/i386/pr91446.c b/gcc/testsuite/gcc.target/i386/pr91446.c index 0243ca3ea68..3e0f8a4b315 100644 --- a/gcc/testsuite/gcc.target/i386/pr91446.c +++ b/gcc/testsuite/gcc.target/i386/pr91446.c @@ -10,15 +10,15 @@ typedef struct extern void bar (info *); void -foo (unsigned long long width, unsigned long long height, - long long x, long long y) +foo (unsigned long long* width, + long long* x) { info t; - t.width = width; - t.height = height; - t.x = x; - t.y = y; + t.width = width[0]; + t.height = width[1]; + t.x = x[0]; + t.y = x[1]; bar (&t); } -/* { dg-final { scan-assembler-times "vmovdqa\[^\n\r\]*xmm\[0-9\]" 2 } } */ +/* { dg-final { scan-assembler-times "vmovdq\[au\]\[^\n\r\]*xmm\[0-9\]" 4 } } */ diff --git a/gcc/testsuite/gcc.target/i386/pr99881.c b/gcc/testsuite/gcc.target/i386/pr99881.c index 3e087eb2ed7..1a59325ac1c 100644 --- a/gcc/testsuite/gcc.target/i386/pr99881.c +++ b/gcc/testsuite/gcc.target/i386/pr99881.c @@ -1,7 +1,7 @@ /* PR target/99881. */ /* { dg-do compile { target { ! ia32 } } } */ /* { dg-options "-Ofast -march=skylake" } */ -/* { dg-final { scan-assembler-not "xmm\[0-9\]" { xfail *-*-* } } } */ +/* { dg-final { scan-assembler-not "xmm\[0-9\]" } } */ void foo (int* __restrict a, int n, int c)