Message ID | patch-18115-tamar@arm.com |
---|---|
State | New |
Headers | show |
Series | AArch64 Update costing for vector conversions [PR110625] | expand |
Tamar Christina <tamar.christina@arm.com> writes: > Hi All, > > In gimple the operation > > short _8; > double _9; > _9 = (double) _8; > > denotes two operations. First we have to widen from short to long and then > convert this integer to a double. Think it's worth saying "two operations on AArch64". Some targets can do int->double directly. Saying that would explain... > Currently however we only count the widen/truncate operations: > > (double) _5 6 times vec_promote_demote costs 12 in body > (double) _5 12 times vec_promote_demote costs 24 in body > > but not the actual conversion operation, which needs an additional 12 > instructions in the attached testcase. Without this the attached testcase ends > up incorrectly thinking that it's beneficial to vectorize the loop at a very > high VF = 8 (4x unrolled). > > Because we can't change the mid-end to account for this the costing code in the ...why we can't do this. > backend now keeps track of whether the previous operation was a > promotion/demotion and ajdusts the expected number of instructions to: > > 1. If it's the first FLOAT_EXPR and the precision of the lhs and rhs are > different, double it, since we need to convert and promote. > 2. If it's the previous operation was a demonition/promotion then reduce the > cost of the current operation by the amount we added extra in the last. > > with the patch we get: > > (double) _5 6 times vec_promote_demote costs 24 in body > (double) _5 12 times vec_promote_demote costs 36 in body > > which correctly accounts for 30 operations. > > This fixes the regression reported on Neoverse N2 and using the new generic > Armv9-a cost model. > > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. > > Ok for master? > > Thanks, > Tamar > > gcc/ChangeLog: > > PR target/110625 > * config/aarch64/aarch64.cc (aarch64_vector_costs::add_stmt_cost): > Adjust throughput and latency calculations for vector conversions. > (class aarch64_vector_costs): Add m_num_last_promote_demote. > > gcc/testsuite/ChangeLog: > > PR target/110625 > * gcc.target/aarch64/pr110625_4.c: New test. > * gcc.target/aarch64/sve/unpack_fcvt_signed_1.c: Add > --param aarch64-sve-compare-costs=0. > * gcc.target/aarch64/sve/unpack_fcvt_unsigned_1.c: Likewise > > --- inline copy of patch -- > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc > index f9850320f61c5ddccf47e6583d304e5f405a484f..5622221413e52717974b96f79cc83008f237c536 100644 > --- a/gcc/config/aarch64/aarch64.cc > +++ b/gcc/config/aarch64/aarch64.cc > @@ -16077,6 +16077,15 @@ private: > leaving a vectorization of { elts }. */ > bool m_stores_to_vector_load_decl = false; > > + /* Non-zero if the last operation we costed is a vector promotion or demotion. > + In this case the value is the number of insn in the last operation. s/insn/insns/ OK with those changes. Thanks for tracking this down and working out what was missing. Richard > + > + On AArch64 vector promotion and demotions require us to first widen or > + narrow the input and only after that emit conversion instructions. For > + costing this means we need to emit the cost of the final conversions as > + well. */ > + unsigned int m_num_last_promote_demote = 0; > + > /* - If M_VEC_FLAGS is zero then we're costing the original scalar code. > - If M_VEC_FLAGS & VEC_ADVSIMD is nonzero then we're costing Advanced > SIMD code. > @@ -17132,6 +17141,29 @@ aarch64_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind, > stmt_cost = aarch64_sve_adjust_stmt_cost (m_vinfo, kind, stmt_info, > vectype, stmt_cost); > > + /* Vector promotion and demotion requires us to widen the operation first > + and only after that perform the conversion. Unfortunately the mid-end > + expects this to be doable as a single operation and doesn't pass on > + enough context here for us to tell which operation is happening. To > + account for this we count every promote-demote operation twice and if > + the previously costed operation was also a promote-demote we reduce > + the cost of the currently being costed operation to simulate the final > + conversion cost. Note that for SVE we can do better here if the converted > + value comes from a load since the widening load would consume the widening > + operations. However since we're in stage 3 we can't change the helper > + vect_is_extending_load and duplicating the code seems not useful. */ > + gassign *assign = NULL; > + if (kind == vec_promote_demote > + && (assign = dyn_cast <gassign *> (STMT_VINFO_STMT (stmt_info))) > + && gimple_assign_rhs_code (assign) == FLOAT_EXPR) > + { > + auto new_count = count * 2 - m_num_last_promote_demote; > + m_num_last_promote_demote = count; > + count = new_count; > + } > + else > + m_num_last_promote_demote = 0; > + > if (stmt_info && aarch64_use_new_vector_costs_p ()) > { > /* Account for any extra "embedded" costs that apply additively > diff --git a/gcc/testsuite/gcc.target/aarch64/pr110625_4.c b/gcc/testsuite/gcc.target/aarch64/pr110625_4.c > new file mode 100644 > index 0000000000000000000000000000000000000000..34dac19d81a85d63706d54f4cb0c738ce592d5d7 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/aarch64/pr110625_4.c > @@ -0,0 +1,18 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O3 -mcpu=neoverse-n2 -fdump-tree-vect-details" } */ > + > +typedef struct { > + short blue, green, red, opacity; > +} Pixel; > + > +double foo (long n, double *k, Pixel *k_pixels) { > + double result_2, result_1, result_0; > + for (; n; n++, k--) { > + result_0 += *k * k_pixels[n].red; > + result_1 += *k * k_pixels[n].green; > + result_2 += *k * k_pixels[n].blue; > + } > + return result_0 + result_1 + result_2; > +} > + > +/* { dg-final { scan-tree-dump-not "LOOP VECTORIZED" "vect" } } */ > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/unpack_fcvt_signed_1.c b/gcc/testsuite/gcc.target/aarch64/sve/unpack_fcvt_signed_1.c > index 0f96dc2ff007340541c2ba7d51e1ccfa0f3f2d39..4c5e88657408f61156035012212ed542fac45efb 100644 > --- a/gcc/testsuite/gcc.target/aarch64/sve/unpack_fcvt_signed_1.c > +++ b/gcc/testsuite/gcc.target/aarch64/sve/unpack_fcvt_signed_1.c > @@ -1,5 +1,5 @@ > /* { dg-do compile } */ > -/* { dg-options "-O2 -ftree-vectorize -fno-inline" } */ > +/* { dg-options "-O2 -ftree-vectorize -fno-inline --param aarch64-sve-compare-costs=0" } */ > > #include <stdint.h> > > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/unpack_fcvt_unsigned_1.c b/gcc/testsuite/gcc.target/aarch64/sve/unpack_fcvt_unsigned_1.c > index 70465f91eba4f80140b2059481eb8f06bbc9ace7..3ff2bd127756b2ff08095513b09325db4779ba02 100644 > --- a/gcc/testsuite/gcc.target/aarch64/sve/unpack_fcvt_unsigned_1.c > +++ b/gcc/testsuite/gcc.target/aarch64/sve/unpack_fcvt_unsigned_1.c > @@ -1,5 +1,5 @@ > /* { dg-do compile } */ > -/* { dg-options "-O2 -ftree-vectorize" } */ > +/* { dg-options "-O2 -ftree-vectorize --param aarch64-sve-compare-costs=0" } */ > > #include <stdint.h>
--- a/gcc/config/aarch64/aarch64.cc +++ b/gcc/config/aarch64/aarch64.cc @@ -16077,6 +16077,15 @@ private: leaving a vectorization of { elts }. */ bool m_stores_to_vector_load_decl = false; + /* Non-zero if the last operation we costed is a vector promotion or demotion. + In this case the value is the number of insn in the last operation. + + On AArch64 vector promotion and demotions require us to first widen or + narrow the input and only after that emit conversion instructions. For + costing this means we need to emit the cost of the final conversions as + well. */ + unsigned int m_num_last_promote_demote = 0; + /* - If M_VEC_FLAGS is zero then we're costing the original scalar code. - If M_VEC_FLAGS & VEC_ADVSIMD is nonzero then we're costing Advanced SIMD code. @@ -17132,6 +17141,29 @@ aarch64_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind, stmt_cost = aarch64_sve_adjust_stmt_cost (m_vinfo, kind, stmt_info, vectype, stmt_cost); + /* Vector promotion and demotion requires us to widen the operation first + and only after that perform the conversion. Unfortunately the mid-end + expects this to be doable as a single operation and doesn't pass on + enough context here for us to tell which operation is happening. To + account for this we count every promote-demote operation twice and if + the previously costed operation was also a promote-demote we reduce + the cost of the currently being costed operation to simulate the final + conversion cost. Note that for SVE we can do better here if the converted + value comes from a load since the widening load would consume the widening + operations. However since we're in stage 3 we can't change the helper + vect_is_extending_load and duplicating the code seems not useful. */ + gassign *assign = NULL; + if (kind == vec_promote_demote + && (assign = dyn_cast <gassign *> (STMT_VINFO_STMT (stmt_info))) + && gimple_assign_rhs_code (assign) == FLOAT_EXPR) + { + auto new_count = count * 2 - m_num_last_promote_demote; + m_num_last_promote_demote = count; + count = new_count; + } + else + m_num_last_promote_demote = 0; + if (stmt_info && aarch64_use_new_vector_costs_p ()) { /* Account for any extra "embedded" costs that apply additively diff --git a/gcc/testsuite/gcc.target/aarch64/pr110625_4.c b/gcc/testsuite/gcc.target/aarch64/pr110625_4.c new file mode 100644 index 0000000000000000000000000000000000000000..34dac19d81a85d63706d54f4cb0c738ce592d5d7 --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/pr110625_4.c @@ -0,0 +1,18 @@ +/* { dg-do compile } */ +/* { dg-options "-O3 -mcpu=neoverse-n2 -fdump-tree-vect-details" } */ + +typedef struct { + short blue, green, red, opacity; +} Pixel; + +double foo (long n, double *k, Pixel *k_pixels) { + double result_2, result_1, result_0; + for (; n; n++, k--) { + result_0 += *k * k_pixels[n].red; + result_1 += *k * k_pixels[n].green; + result_2 += *k * k_pixels[n].blue; + } + return result_0 + result_1 + result_2; +} + +/* { dg-final { scan-tree-dump-not "LOOP VECTORIZED" "vect" } } */ diff --git a/gcc/testsuite/gcc.target/aarch64/sve/unpack_fcvt_signed_1.c b/gcc/testsuite/gcc.target/aarch64/sve/unpack_fcvt_signed_1.c index 0f96dc2ff007340541c2ba7d51e1ccfa0f3f2d39..4c5e88657408f61156035012212ed542fac45efb 100644 --- a/gcc/testsuite/gcc.target/aarch64/sve/unpack_fcvt_signed_1.c +++ b/gcc/testsuite/gcc.target/aarch64/sve/unpack_fcvt_signed_1.c @@ -1,5 +1,5 @@ /* { dg-do compile } */ -/* { dg-options "-O2 -ftree-vectorize -fno-inline" } */ +/* { dg-options "-O2 -ftree-vectorize -fno-inline --param aarch64-sve-compare-costs=0" } */ #include <stdint.h> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/unpack_fcvt_unsigned_1.c b/gcc/testsuite/gcc.target/aarch64/sve/unpack_fcvt_unsigned_1.c index 70465f91eba4f80140b2059481eb8f06bbc9ace7..3ff2bd127756b2ff08095513b09325db4779ba02 100644 --- a/gcc/testsuite/gcc.target/aarch64/sve/unpack_fcvt_unsigned_1.c +++ b/gcc/testsuite/gcc.target/aarch64/sve/unpack_fcvt_unsigned_1.c @@ -1,5 +1,5 @@ /* { dg-do compile } */ -/* { dg-options "-O2 -ftree-vectorize" } */ +/* { dg-options "-O2 -ftree-vectorize --param aarch64-sve-compare-costs=0" } */ #include <stdint.h>