Message ID | ZqNqt64zgkm0/EFQ@arm.com |
---|---|
State | New |
Headers | show |
Series | [1/8] AArch64: Update Neoverse V2 cost model to release costs | expand |
Hi Tamar, > On 26 Jul 2024, at 11:21, Tamar Christina <tamar.christina@arm.com> wrote: > > External email: Use caution opening links or attachments > > > Hi All, > > Gather and scatters are not usually beneficial when the loop count is small. > This is because there's not only a cost to their execution within the loop but > there is also some cost to enter loops with them. > That makes sense and the benchmark numbers back it up so I’m sympathetic to the idea. > As such this patch models this overhead. For generic tuning we however still > prefer gathers/scatters when the loop costs work out. I don’t have a strong preference either way about the generic option, but I’m okay with it. > > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. > This improves performance of Exchange in SPECCPU 2017 by 3% with SVE enabled. > > Ok for master? > > Thanks, > Tamar > > gcc/ChangeLog: > > * config/aarch64/aarch64-protos.h (struct sve_vec_cost): Add > gather_load_x32_init_cost and gather_load_x64_init_cost. > * config/aarch64/aarch64.cc (aarch64_vector_costs): Add > m_sve_gather_scatter_x32 and m_sve_gather_scatter_x64. > (aarch64_vector_costs::add_stmt_cost): Use them. > (aarch64_vector_costs::finish_cost): Likewise. > * config/aarch64/tuning_models/a64fx.h: Update. > * config/aarch64/tuning_models/cortexx925.h: Update. > * config/aarch64/tuning_models/generic.h: Update. > * config/aarch64/tuning_models/generic_armv8_a.h: Update. > * config/aarch64/tuning_models/generic_armv9_a.h: Update. > * config/aarch64/tuning_models/neoverse512tvb.h: Update. > * config/aarch64/tuning_models/neoversen2.h: Update. > * config/aarch64/tuning_models/neoversen3.h: Update. > * config/aarch64/tuning_models/neoversev1.h: Update. > * config/aarch64/tuning_models/neoversev2.h: Update. > * config/aarch64/tuning_models/neoversev3.h: Update. > * config/aarch64/tuning_models/neoversev3ae.h: Update. > > --- > diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h > index 42639e9efcf1e0f9362f759ae63a31b8eeb0d581..16eb8edab4d9fdfc6e3672c56ef5c9f6962d0c0b 100644 > --- a/gcc/config/aarch64/aarch64-protos.h > +++ b/gcc/config/aarch64/aarch64-protos.h > @@ -262,6 +262,8 @@ struct sve_vec_cost : simd_vec_cost > unsigned int fadda_f64_cost, > unsigned int gather_load_x32_cost, > unsigned int gather_load_x64_cost, > + unsigned int gather_load_x32_init_cost, > + unsigned int gather_load_x64_init_cost, > unsigned int scatter_store_elt_cost) > : simd_vec_cost (base), > clast_cost (clast_cost), > @@ -270,6 +272,8 @@ struct sve_vec_cost : simd_vec_cost > fadda_f64_cost (fadda_f64_cost), > gather_load_x32_cost (gather_load_x32_cost), > gather_load_x64_cost (gather_load_x64_cost), > + gather_load_x32_init_cost (gather_load_x32_init_cost), > + gather_load_x64_init_cost (gather_load_x64_init_cost), > scatter_store_elt_cost (scatter_store_elt_cost) > {} > > @@ -289,6 +293,12 @@ struct sve_vec_cost : simd_vec_cost > const int gather_load_x32_cost; > const int gather_load_x64_cost; > > + /* Additional loop initialization cost of using a gather load instruction. The x32 > + value is for loads of 32-bit elements and the x64 value is for loads of > + 64-bit elements. */ > + const int gather_load_x32_init_cost; > + const int gather_load_x64_init_cost; > + > /* The per-element cost of a scatter store. */ > const int scatter_store_elt_cost; > }; > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc > index eafa377cb095f49408d8a926fb49ce13e2155ba2..1e14c3c0d24b449d404724e436ba57e1996ec062 100644 > --- a/gcc/config/aarch64/aarch64.cc > +++ b/gcc/config/aarch64/aarch64.cc > @@ -16227,6 +16227,12 @@ private: > supported by Advanced SIMD and SVE2. */ > bool m_has_avg = false; > > + /* This loop uses an SVE 32-bit element gather or scatter operation. */ > + bool m_sve_gather_scatter_x32 = false; > + > + /* This loop uses an SVE 64-bit element gather or scatter operation. */ > + bool m_sve_gather_scatter_x64 = false; > + > /* True if the vector body contains a store to a decl and if the > function is known to have a vld1 from the same decl. > > @@ -17291,6 +17297,17 @@ aarch64_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind, > stmt_cost = aarch64_detect_vector_stmt_subtype (m_vinfo, kind, > stmt_info, vectype, > where, stmt_cost); > + > + /* Check if we've seen an SVE gather/scatter operation and which size. */ > + if (kind == scalar_load > + && aarch64_sve_mode_p (TYPE_MODE (vectype)) > + && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_GATHER_SCATTER) > + { > + if (GET_MODE_UNIT_BITSIZE (TYPE_MODE (vectype)) == 64) > + m_sve_gather_scatter_x64 = true; > + else > + m_sve_gather_scatter_x32 = true; This is a bit academic at this stage but SVE2.1 adds quadword gather loads. I know we’re not vectoring for those yet, but maybe it’s worth explicitly checking for 32-bit size and gcc_unreachable () otherwise? > + } > } > > /* Do any SVE-specific adjustments to the cost. */ > @@ -17676,6 +17693,18 @@ aarch64_vector_costs::finish_cost (const vector_costs *uncast_scalar_costs) > m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs, > m_costs[vect_body]); > m_suggested_unroll_factor = determine_suggested_unroll_factor (); > + > + /* For gather and scatters there's an additional overhead for the first > + iteration. For low count loops they're not beneficial so model the > + overhead as loop prologue costs. */ > + if (m_sve_gather_scatter_x32 || m_sve_gather_scatter_x64) > + { > + const sve_vec_cost *sve_costs = aarch64_tune_params.vec_costs->sve; > + if (m_sve_gather_scatter_x32) > + m_costs[vect_prologue] += sve_costs->gather_load_x32_init_cost; > + else > + m_costs[vect_prologue] += sve_costs->gather_load_x64_init_cost; Shouldn’t this not be en else but rather: If (m_sve_gather_scatter_x64) m_costs[vect_prologue] += sve_costs->gather_load_x64_init_cost; In case the loop has both 32-bit and 64-bit gather/scatter? > + } > } > > /* Apply the heuristic described above m_stp_sequence_cost. Prefer > diff --git a/gcc/config/aarch64/tuning_models/a64fx.h b/gcc/config/aarch64/tuning_models/a64fx.h > index 6091289d4c3c66f01d7e4dbf97a85c1f8c40bb0b..378a1b3889ee265859786c1ff6525fce2305b615 100644 > --- a/gcc/config/aarch64/tuning_models/a64fx.h > +++ b/gcc/config/aarch64/tuning_models/a64fx.h > @@ -104,6 +104,8 @@ static const sve_vec_cost a64fx_sve_vector_cost = > 13, /* fadda_f64_cost */ > 64, /* gather_load_x32_cost */ > 32, /* gather_load_x64_cost */ > + 0, /* gather_load_x32_init_cost */ > + 0, /* gather_load_x64_init_cost */ > 1 /* scatter_store_elt_cost */ > }; > > diff --git a/gcc/config/aarch64/tuning_models/cortexx925.h b/gcc/config/aarch64/tuning_models/cortexx925.h > index fb95e87526985b02410d54a5a3ec8539c1b0ba6d..c4206018a3ff707f89ff3300700ec7dc2a5bc6b0 100644 > --- a/gcc/config/aarch64/tuning_models/cortexx925.h > +++ b/gcc/config/aarch64/tuning_models/cortexx925.h > @@ -135,6 +135,8 @@ static const sve_vec_cost cortexx925_sve_vector_cost = > operation more than a 64-bit gather. */ > 14, /* gather_load_x32_cost */ > 12, /* gather_load_x64_cost */ > + 42, /* gather_load_x32_init_cost */ > + 24, /* gather_load_x64_init_cost */ Can you comment on how these numbers are derived? Thanks, Kyrill > 1 /* scatter_store_elt_cost */ > }; > > diff --git a/gcc/config/aarch64/tuning_models/generic.h b/gcc/config/aarch64/tuning_models/generic.h > index 2b1f68b3052117814161a32f426422736ad6462b..101969bdbb9ccf7eafbd9a1cd6e25f0b584fb261 100644 > --- a/gcc/config/aarch64/tuning_models/generic.h > +++ b/gcc/config/aarch64/tuning_models/generic.h > @@ -105,6 +105,8 @@ static const sve_vec_cost generic_sve_vector_cost = > 2, /* fadda_f64_cost */ > 4, /* gather_load_x32_cost */ > 2, /* gather_load_x64_cost */ > + 12, /* gather_load_x32_init_cost */ > + 4, /* gather_load_x64_init_cost */ > 1 /* scatter_store_elt_cost */ > }; > > diff --git a/gcc/config/aarch64/tuning_models/generic_armv8_a.h b/gcc/config/aarch64/tuning_models/generic_armv8_a.h > index b38b9a8c5cad7d12aa38afdb610a14a25e755010..b5088afe068aa4be7f9dd614cfdd2a51fa96e524 100644 > --- a/gcc/config/aarch64/tuning_models/generic_armv8_a.h > +++ b/gcc/config/aarch64/tuning_models/generic_armv8_a.h > @@ -106,6 +106,8 @@ static const sve_vec_cost generic_armv8_a_sve_vector_cost = > 2, /* fadda_f64_cost */ > 4, /* gather_load_x32_cost */ > 2, /* gather_load_x64_cost */ > + 12, /* gather_load_x32_init_cost */ > + 4, /* gather_load_x64_init_cost */ > 1 /* scatter_store_elt_cost */ > }; > > diff --git a/gcc/config/aarch64/tuning_models/generic_armv9_a.h b/gcc/config/aarch64/tuning_models/generic_armv9_a.h > index b39a0c73db910888168790888d24ddf4406bf1ee..fd72de542862909ccb9a9260a16bb01935d97f36 100644 > --- a/gcc/config/aarch64/tuning_models/generic_armv9_a.h > +++ b/gcc/config/aarch64/tuning_models/generic_armv9_a.h > @@ -136,6 +136,8 @@ static const sve_vec_cost generic_armv9_a_sve_vector_cost = > operation more than a 64-bit gather. */ > 14, /* gather_load_x32_cost */ > 12, /* gather_load_x64_cost */ > + 42, /* gather_load_x32_init_cost */ > + 24, /* gather_load_x64_init_cost */ > 3 /* scatter_store_elt_cost */ > }; > > diff --git a/gcc/config/aarch64/tuning_models/neoverse512tvb.h b/gcc/config/aarch64/tuning_models/neoverse512tvb.h > index 825c6a64990b72cda3641737957dc94d75db1509..d2a0b647791de8fca6d7684849d2ab1e9104b045 100644 > --- a/gcc/config/aarch64/tuning_models/neoverse512tvb.h > +++ b/gcc/config/aarch64/tuning_models/neoverse512tvb.h > @@ -79,6 +79,8 @@ static const sve_vec_cost neoverse512tvb_sve_vector_cost = > operation more than a 64-bit gather. */ > 14, /* gather_load_x32_cost */ > 12, /* gather_load_x64_cost */ > + 42, /* gather_load_x32_init_cost */ > + 24, /* gather_load_x64_init_cost */ > 3 /* scatter_store_elt_cost */ > }; > > diff --git a/gcc/config/aarch64/tuning_models/neoversen2.h b/gcc/config/aarch64/tuning_models/neoversen2.h > index 3430eb9c06819e00ab38966bb960bd6525ff2b5c..00d2c12e739ffd371dd4720826894e980d577ca7 100644 > --- a/gcc/config/aarch64/tuning_models/neoversen2.h > +++ b/gcc/config/aarch64/tuning_models/neoversen2.h > @@ -135,6 +135,8 @@ static const sve_vec_cost neoversen2_sve_vector_cost = > operation more than a 64-bit gather. */ > 14, /* gather_load_x32_cost */ > 12, /* gather_load_x64_cost */ > + 42, /* gather_load_x32_init_cost */ > + 24, /* gather_load_x64_init_cost */ > 3 /* scatter_store_elt_cost */ > }; > > diff --git a/gcc/config/aarch64/tuning_models/neoversen3.h b/gcc/config/aarch64/tuning_models/neoversen3.h > index 7438e39a4bbe43de624b63fdd20d3fde9dfb6fc9..fc4333ffdeaef0115ac162e2da9d8d548bacf576 100644 > --- a/gcc/config/aarch64/tuning_models/neoversen3.h > +++ b/gcc/config/aarch64/tuning_models/neoversen3.h > @@ -135,6 +135,8 @@ static const sve_vec_cost neoversen3_sve_vector_cost = > operation more than a 64-bit gather. */ > 14, /* gather_load_x32_cost */ > 12, /* gather_load_x64_cost */ > + 42, /* gather_load_x32_init_cost */ > + 24, /* gather_load_x64_init_cost */ > 1 /* scatter_store_elt_cost */ > }; > > diff --git a/gcc/config/aarch64/tuning_models/neoversev1.h b/gcc/config/aarch64/tuning_models/neoversev1.h > index 0fc41ce6a41b3135fa06d2bda1f517fdf4f8dbcf..705ed025730f6683109a4796c6eefa55b437cec9 100644 > --- a/gcc/config/aarch64/tuning_models/neoversev1.h > +++ b/gcc/config/aarch64/tuning_models/neoversev1.h > @@ -126,6 +126,8 @@ static const sve_vec_cost neoversev1_sve_vector_cost = > 8, /* fadda_f64_cost */ > 32, /* gather_load_x32_cost */ > 16, /* gather_load_x64_cost */ > + 96, /* gather_load_x32_init_cost */ > + 32, /* gather_load_x64_init_cost */ > 3 /* scatter_store_elt_cost */ > }; > > diff --git a/gcc/config/aarch64/tuning_models/neoversev2.h b/gcc/config/aarch64/tuning_models/neoversev2.h > index cca459e32c1384f57f8345d86b42b7814ae44115..680feeb9e4ee7bf21d5a258d83e522e079fdc156 100644 > --- a/gcc/config/aarch64/tuning_models/neoversev2.h > +++ b/gcc/config/aarch64/tuning_models/neoversev2.h > @@ -135,6 +135,8 @@ static const sve_vec_cost neoversev2_sve_vector_cost = > operation more than a 64-bit gather. */ > 14, /* gather_load_x32_cost */ > 12, /* gather_load_x64_cost */ > + 42, /* gather_load_x32_init_cost */ > + 24, /* gather_load_x64_init_cost */ > 3 /* scatter_store_elt_cost */ > }; > > diff --git a/gcc/config/aarch64/tuning_models/neoversev3.h b/gcc/config/aarch64/tuning_models/neoversev3.h > index 3daa3d2365c817d03c6c0d5e66fe832620d8fb2c..812c6ad304e8d4c503dcd444437bf6528d6f3176 100644 > --- a/gcc/config/aarch64/tuning_models/neoversev3.h > +++ b/gcc/config/aarch64/tuning_models/neoversev3.h > @@ -135,6 +135,8 @@ static const sve_vec_cost neoversev3_sve_vector_cost = > operation more than a 64-bit gather. */ > 14, /* gather_load_x32_cost */ > 12, /* gather_load_x64_cost */ > + 42, /* gather_load_x32_init_cost */ > + 24, /* gather_load_x64_init_cost */ > 1 /* scatter_store_elt_cost */ > }; > > diff --git a/gcc/config/aarch64/tuning_models/neoversev3ae.h b/gcc/config/aarch64/tuning_models/neoversev3ae.h > index 29c6f22e941b26ee333c87b9fac22aea86625e97..280b5abb27d3c9f404d5f96f14d0cba1e13b9bd1 100644 > --- a/gcc/config/aarch64/tuning_models/neoversev3ae.h > +++ b/gcc/config/aarch64/tuning_models/neoversev3ae.h > @@ -135,6 +135,8 @@ static const sve_vec_cost neoversev3ae_sve_vector_cost = > operation more than a 64-bit gather. */ > 14, /* gather_load_x32_cost */ > 12, /* gather_load_x64_cost */ > + 42, /* gather_load_x32_init_cost */ > + 24, /* gather_load_x64_init_cost */ > 1 /* scatter_store_elt_cost */ > }; > > > > > > -- > <rb18671.patch>
Hi Kyrill, > > /* True if the vector body contains a store to a decl and if the > > function is known to have a vld1 from the same decl. > > > > @@ -17291,6 +17297,17 @@ aarch64_vector_costs::add_stmt_cost (int count, > vect_cost_for_stmt kind, > > stmt_cost = aarch64_detect_vector_stmt_subtype (m_vinfo, kind, > > stmt_info, vectype, > > where, stmt_cost); > > + > > + /* Check if we've seen an SVE gather/scatter operation and which size. */ > > + if (kind == scalar_load > > + && aarch64_sve_mode_p (TYPE_MODE (vectype)) > > + && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == > VMAT_GATHER_SCATTER) > > + { > > + if (GET_MODE_UNIT_BITSIZE (TYPE_MODE (vectype)) == 64) > > + m_sve_gather_scatter_x64 = true; > > + else > > + m_sve_gather_scatter_x32 = true; > > This is a bit academic at this stage but SVE2.1 adds quadword gather loads. I know > we’re not vectoring for those yet, but maybe it’s worth explicitly checking for 32- > bit size and gcc_unreachable () otherwise? To be honest I'm not quite sure how to detect it. If it just GET_MODE_UNIT_BITSIZE () == 128? But do we want an assert in the cost model? Happy to do so though but maybe a debug print is more appropriate? i.e. make it a missed optimization? > > > > + } > > } > > > > /* Do any SVE-specific adjustments to the cost. */ > > @@ -17676,6 +17693,18 @@ aarch64_vector_costs::finish_cost (const > vector_costs *uncast_scalar_costs) > > m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs, > > m_costs[vect_body]); > > m_suggested_unroll_factor = determine_suggested_unroll_factor (); > > + > > + /* For gather and scatters there's an additional overhead for the first > > + iteration. For low count loops they're not beneficial so model the > > + overhead as loop prologue costs. */ > > + if (m_sve_gather_scatter_x32 || m_sve_gather_scatter_x64) > > + { > > + const sve_vec_cost *sve_costs = aarch64_tune_params.vec_costs->sve; > > + if (m_sve_gather_scatter_x32) > > + m_costs[vect_prologue] += sve_costs->gather_load_x32_init_cost; > > + else > > + m_costs[vect_prologue] += sve_costs->gather_load_x64_init_cost; > > Shouldn’t this not be en else but rather: > If (m_sve_gather_scatter_x64) > m_costs[vect_prologue] += sve_costs->gather_load_x64_init_cost; > > In case the loop has both 32-bit and 64-bit gather/scatter? > This was an interesting comment. After some discussion and more benchmarking we've changed it to be an additive cost. > > > + } > > } > > > > /* Apply the heuristic described above m_stp_sequence_cost. Prefer > > diff --git a/gcc/config/aarch64/tuning_models/a64fx.h > b/gcc/config/aarch64/tuning_models/a64fx.h > > index > 6091289d4c3c66f01d7e4dbf97a85c1f8c40bb0b..378a1b3889ee265859786c1 > ff6525fce2305b615 100644 > > --- a/gcc/config/aarch64/tuning_models/a64fx.h > > +++ b/gcc/config/aarch64/tuning_models/a64fx.h > > @@ -104,6 +104,8 @@ static const sve_vec_cost a64fx_sve_vector_cost = > > 13, /* fadda_f64_cost */ > > 64, /* gather_load_x32_cost */ > > 32, /* gather_load_x64_cost */ > > + 0, /* gather_load_x32_init_cost */ > > + 0, /* gather_load_x64_init_cost */ > > 1 /* scatter_store_elt_cost */ > > }; > > > > diff --git a/gcc/config/aarch64/tuning_models/cortexx925.h > b/gcc/config/aarch64/tuning_models/cortexx925.h > > index > fb95e87526985b02410d54a5a3ec8539c1b0ba6d..c4206018a3ff707f89ff33007 > 00ec7dc2a5bc6b0 100644 > > --- a/gcc/config/aarch64/tuning_models/cortexx925.h > > +++ b/gcc/config/aarch64/tuning_models/cortexx925.h > > @@ -135,6 +135,8 @@ static const sve_vec_cost cortexx925_sve_vector_cost = > > operation more than a 64-bit gather. */ > > 14, /* gather_load_x32_cost */ > > 12, /* gather_load_x64_cost */ > > + 42, /* gather_load_x32_init_cost */ > > + 24, /* gather_load_x64_init_cost */ > > > Can you comment on how these numbers are derived? They were derived essentially from benchmarking. I did a bunch of runs over various cores to determine at which iteration count they become profitable. From that as you can probably tell the costs are a multiple of the cost of the operations for the specific core. This because that cost already keeps in mind things like VL differences. Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. Ok for master? Thanks, Tamar gcc/ChangeLog: * config/aarch64/aarch64-protos.h (struct sve_vec_cost): Add gather_load_x32_init_cost and gather_load_x64_init_cost. * config/aarch64/aarch64.cc (aarch64_vector_costs): Add m_sve_gather_scatter_init_cost. (aarch64_vector_costs::add_stmt_cost): Use them. (aarch64_vector_costs::finish_cost): Likewise. * config/aarch64/tuning_models/a64fx.h: Update. * config/aarch64/tuning_models/cortexx925.h: Update. * config/aarch64/tuning_models/generic.h: Update. * config/aarch64/tuning_models/generic_armv8_a.h: Update. * config/aarch64/tuning_models/generic_armv9_a.h: Update. * config/aarch64/tuning_models/neoverse512tvb.h: Update. * config/aarch64/tuning_models/neoversen2.h: Update. * config/aarch64/tuning_models/neoversen3.h: Update. * config/aarch64/tuning_models/neoversev1.h: Update. * config/aarch64/tuning_models/neoversev2.h: Update. * config/aarch64/tuning_models/neoversev3.h: Update. * config/aarch64/tuning_models/neoversev3ae.h: Update. -- inline copy of patch -- diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h index 42639e9efcf1e0f9362f759ae63a31b8eeb0d581..16eb8edab4d9fdfc6e3672c56ef5c9f6962d0c0b 100644 --- a/gcc/config/aarch64/aarch64-protos.h +++ b/gcc/config/aarch64/aarch64-protos.h @@ -262,6 +262,8 @@ struct sve_vec_cost : simd_vec_cost unsigned int fadda_f64_cost, unsigned int gather_load_x32_cost, unsigned int gather_load_x64_cost, + unsigned int gather_load_x32_init_cost, + unsigned int gather_load_x64_init_cost, unsigned int scatter_store_elt_cost) : simd_vec_cost (base), clast_cost (clast_cost), @@ -270,6 +272,8 @@ struct sve_vec_cost : simd_vec_cost fadda_f64_cost (fadda_f64_cost), gather_load_x32_cost (gather_load_x32_cost), gather_load_x64_cost (gather_load_x64_cost), + gather_load_x32_init_cost (gather_load_x32_init_cost), + gather_load_x64_init_cost (gather_load_x64_init_cost), scatter_store_elt_cost (scatter_store_elt_cost) {} @@ -289,6 +293,12 @@ struct sve_vec_cost : simd_vec_cost const int gather_load_x32_cost; const int gather_load_x64_cost; + /* Additional loop initialization cost of using a gather load instruction. The x32 + value is for loads of 32-bit elements and the x64 value is for loads of + 64-bit elements. */ + const int gather_load_x32_init_cost; + const int gather_load_x64_init_cost; + /* The per-element cost of a scatter store. */ const int scatter_store_elt_cost; }; diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc index eafa377cb095f49408d8a926fb49ce13e2155ba2..da2feb54ddad9b39db92e0a9ec7c4e40cfa3e4e2 100644 --- a/gcc/config/aarch64/aarch64.cc +++ b/gcc/config/aarch64/aarch64.cc @@ -16227,6 +16227,10 @@ private: supported by Advanced SIMD and SVE2. */ bool m_has_avg = false; + /* Additional initialization costs for using gather or scatter operation in + the current loop. */ + unsigned int m_sve_gather_scatter_init_cost = 0; + /* True if the vector body contains a store to a decl and if the function is known to have a vld1 from the same decl. @@ -17291,6 +17295,20 @@ aarch64_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind, stmt_cost = aarch64_detect_vector_stmt_subtype (m_vinfo, kind, stmt_info, vectype, where, stmt_cost); + + /* Check if we've seen an SVE gather/scatter operation and which size. */ + if (kind == scalar_load + && aarch64_sve_mode_p (TYPE_MODE (vectype)) + && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_GATHER_SCATTER) + { + const sve_vec_cost *sve_costs = aarch64_tune_params.vec_costs->sve; + if (GET_MODE_UNIT_BITSIZE (TYPE_MODE (vectype)) == 64) + m_sve_gather_scatter_init_cost + += sve_costs->gather_load_x64_init_cost; + else + m_sve_gather_scatter_init_cost + += sve_costs->gather_load_x32_init_cost; + } } /* Do any SVE-specific adjustments to the cost. */ @@ -17676,6 +17694,12 @@ aarch64_vector_costs::finish_cost (const vector_costs *uncast_scalar_costs) m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs, m_costs[vect_body]); m_suggested_unroll_factor = determine_suggested_unroll_factor (); + + /* For gather and scatters there's an additional overhead for the first + iteration. For low count loops they're not beneficial so model the + overhead as loop prologue costs. */ + if (m_sve_gather_scatter_init_cost) + m_costs[vect_prologue] += m_sve_gather_scatter_init_cost; } /* Apply the heuristic described above m_stp_sequence_cost. Prefer diff --git a/gcc/config/aarch64/tuning_models/a64fx.h b/gcc/config/aarch64/tuning_models/a64fx.h index 6091289d4c3c66f01d7e4dbf97a85c1f8c40bb0b..378a1b3889ee265859786c1ff6525fce2305b615 100644 --- a/gcc/config/aarch64/tuning_models/a64fx.h +++ b/gcc/config/aarch64/tuning_models/a64fx.h @@ -104,6 +104,8 @@ static const sve_vec_cost a64fx_sve_vector_cost = 13, /* fadda_f64_cost */ 64, /* gather_load_x32_cost */ 32, /* gather_load_x64_cost */ + 0, /* gather_load_x32_init_cost */ + 0, /* gather_load_x64_init_cost */ 1 /* scatter_store_elt_cost */ }; diff --git a/gcc/config/aarch64/tuning_models/cortexx925.h b/gcc/config/aarch64/tuning_models/cortexx925.h index 6cae5b7de5ca7ffad8a0f683e1285039bb55d159..b509cae758419a415d9067ec751ef1e6528eb09a 100644 --- a/gcc/config/aarch64/tuning_models/cortexx925.h +++ b/gcc/config/aarch64/tuning_models/cortexx925.h @@ -135,6 +135,8 @@ static const sve_vec_cost cortexx925_sve_vector_cost = operation more than a 64-bit gather. */ 14, /* gather_load_x32_cost */ 12, /* gather_load_x64_cost */ + 42, /* gather_load_x32_init_cost */ + 24, /* gather_load_x64_init_cost */ 1 /* scatter_store_elt_cost */ }; diff --git a/gcc/config/aarch64/tuning_models/generic.h b/gcc/config/aarch64/tuning_models/generic.h index 2b1f68b3052117814161a32f426422736ad6462b..101969bdbb9ccf7eafbd9a1cd6e25f0b584fb261 100644 --- a/gcc/config/aarch64/tuning_models/generic.h +++ b/gcc/config/aarch64/tuning_models/generic.h @@ -105,6 +105,8 @@ static const sve_vec_cost generic_sve_vector_cost = 2, /* fadda_f64_cost */ 4, /* gather_load_x32_cost */ 2, /* gather_load_x64_cost */ + 12, /* gather_load_x32_init_cost */ + 4, /* gather_load_x64_init_cost */ 1 /* scatter_store_elt_cost */ }; diff --git a/gcc/config/aarch64/tuning_models/generic_armv8_a.h b/gcc/config/aarch64/tuning_models/generic_armv8_a.h index b38b9a8c5cad7d12aa38afdb610a14a25e755010..b5088afe068aa4be7f9dd614cfdd2a51fa96e524 100644 --- a/gcc/config/aarch64/tuning_models/generic_armv8_a.h +++ b/gcc/config/aarch64/tuning_models/generic_armv8_a.h @@ -106,6 +106,8 @@ static const sve_vec_cost generic_armv8_a_sve_vector_cost = 2, /* fadda_f64_cost */ 4, /* gather_load_x32_cost */ 2, /* gather_load_x64_cost */ + 12, /* gather_load_x32_init_cost */ + 4, /* gather_load_x64_init_cost */ 1 /* scatter_store_elt_cost */ }; diff --git a/gcc/config/aarch64/tuning_models/generic_armv9_a.h b/gcc/config/aarch64/tuning_models/generic_armv9_a.h index 7156dbe5787e831bc4343deb7d7b88e9823fc1bc..999985ed40f694f2681779d940bdb282f289b8e3 100644 --- a/gcc/config/aarch64/tuning_models/generic_armv9_a.h +++ b/gcc/config/aarch64/tuning_models/generic_armv9_a.h @@ -136,6 +136,8 @@ static const sve_vec_cost generic_armv9_a_sve_vector_cost = operation more than a 64-bit gather. */ 14, /* gather_load_x32_cost */ 12, /* gather_load_x64_cost */ + 42, /* gather_load_x32_init_cost */ + 24, /* gather_load_x64_init_cost */ 3 /* scatter_store_elt_cost */ }; diff --git a/gcc/config/aarch64/tuning_models/neoverse512tvb.h b/gcc/config/aarch64/tuning_models/neoverse512tvb.h index 825c6a64990b72cda3641737957dc94d75db1509..d2a0b647791de8fca6d7684849d2ab1e9104b045 100644 --- a/gcc/config/aarch64/tuning_models/neoverse512tvb.h +++ b/gcc/config/aarch64/tuning_models/neoverse512tvb.h @@ -79,6 +79,8 @@ static const sve_vec_cost neoverse512tvb_sve_vector_cost = operation more than a 64-bit gather. */ 14, /* gather_load_x32_cost */ 12, /* gather_load_x64_cost */ + 42, /* gather_load_x32_init_cost */ + 24, /* gather_load_x64_init_cost */ 3 /* scatter_store_elt_cost */ }; diff --git a/gcc/config/aarch64/tuning_models/neoversen2.h b/gcc/config/aarch64/tuning_models/neoversen2.h index d41e714aa045266ecae62a36ed02dfbfb7597c3a..1a5b66901b5c3fb78f87fee40236957139644585 100644 --- a/gcc/config/aarch64/tuning_models/neoversen2.h +++ b/gcc/config/aarch64/tuning_models/neoversen2.h @@ -135,6 +135,8 @@ static const sve_vec_cost neoversen2_sve_vector_cost = operation more than a 64-bit gather. */ 14, /* gather_load_x32_cost */ 12, /* gather_load_x64_cost */ + 42, /* gather_load_x32_init_cost */ + 24, /* gather_load_x64_init_cost */ 3 /* scatter_store_elt_cost */ }; diff --git a/gcc/config/aarch64/tuning_models/neoversen3.h b/gcc/config/aarch64/tuning_models/neoversen3.h index 36f770c0a14fc127c75a60cd37048d46c3b069c7..cfd5060e6b64a0433de41b03cde886da119d9a1c 100644 --- a/gcc/config/aarch64/tuning_models/neoversen3.h +++ b/gcc/config/aarch64/tuning_models/neoversen3.h @@ -135,6 +135,8 @@ static const sve_vec_cost neoversen3_sve_vector_cost = operation more than a 64-bit gather. */ 14, /* gather_load_x32_cost */ 12, /* gather_load_x64_cost */ + 42, /* gather_load_x32_init_cost */ + 24, /* gather_load_x64_init_cost */ 1 /* scatter_store_elt_cost */ }; diff --git a/gcc/config/aarch64/tuning_models/neoversev1.h b/gcc/config/aarch64/tuning_models/neoversev1.h index 0fc41ce6a41b3135fa06d2bda1f517fdf4f8dbcf..705ed025730f6683109a4796c6eefa55b437cec9 100644 --- a/gcc/config/aarch64/tuning_models/neoversev1.h +++ b/gcc/config/aarch64/tuning_models/neoversev1.h @@ -126,6 +126,8 @@ static const sve_vec_cost neoversev1_sve_vector_cost = 8, /* fadda_f64_cost */ 32, /* gather_load_x32_cost */ 16, /* gather_load_x64_cost */ + 96, /* gather_load_x32_init_cost */ + 32, /* gather_load_x64_init_cost */ 3 /* scatter_store_elt_cost */ }; diff --git a/gcc/config/aarch64/tuning_models/neoversev2.h b/gcc/config/aarch64/tuning_models/neoversev2.h index c9c3019dd01a98bc20a76e8455fb59ff24a9ff6c..47908636b0f4c3eadd5848b590fd079c1c04aa10 100644 --- a/gcc/config/aarch64/tuning_models/neoversev2.h +++ b/gcc/config/aarch64/tuning_models/neoversev2.h @@ -135,6 +135,8 @@ static const sve_vec_cost neoversev2_sve_vector_cost = operation more than a 64-bit gather. */ 14, /* gather_load_x32_cost */ 12, /* gather_load_x64_cost */ + 42, /* gather_load_x32_init_cost */ + 24, /* gather_load_x64_init_cost */ 3 /* scatter_store_elt_cost */ }; diff --git a/gcc/config/aarch64/tuning_models/neoversev3.h b/gcc/config/aarch64/tuning_models/neoversev3.h index c602d067c7116cf6b081caeae8d36f9969e06d8d..c91e8c829532f9236de0102770e5c6b94e83da9a 100644 --- a/gcc/config/aarch64/tuning_models/neoversev3.h +++ b/gcc/config/aarch64/tuning_models/neoversev3.h @@ -135,6 +135,8 @@ static const sve_vec_cost neoversev3_sve_vector_cost = operation more than a 64-bit gather. */ 14, /* gather_load_x32_cost */ 12, /* gather_load_x64_cost */ + 42, /* gather_load_x32_init_cost */ + 24, /* gather_load_x64_init_cost */ 1 /* scatter_store_elt_cost */ }; diff --git a/gcc/config/aarch64/tuning_models/neoversev3ae.h b/gcc/config/aarch64/tuning_models/neoversev3ae.h index 96d7ccf03cd96056d09676d908c63a25e3da6765..61e439326eb6f983abf8574e657cfbb0c2f9bb33 100644 --- a/gcc/config/aarch64/tuning_models/neoversev3ae.h +++ b/gcc/config/aarch64/tuning_models/neoversev3ae.h @@ -135,6 +135,8 @@ static const sve_vec_cost neoversev3ae_sve_vector_cost = operation more than a 64-bit gather. */ 14, /* gather_load_x32_cost */ 12, /* gather_load_x64_cost */ + 42, /* gather_load_x32_init_cost */ + 24, /* gather_load_x64_init_cost */ 1 /* scatter_store_elt_cost */ };
Tamar Christina <Tamar.Christina@arm.com> writes: > @@ -289,6 +293,12 @@ struct sve_vec_cost : simd_vec_cost > const int gather_load_x32_cost; > const int gather_load_x64_cost; > > + /* Additional loop initialization cost of using a gather load instruction. The x32 Sorry for the trivia, but: long line. > + value is for loads of 32-bit elements and the x64 value is for loads of > + 64-bit elements. */ > + const int gather_load_x32_init_cost; > + const int gather_load_x64_init_cost; > + > /* The per-element cost of a scatter store. */ > const int scatter_store_elt_cost; > }; > [...] > @@ -17291,6 +17295,20 @@ aarch64_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind, > stmt_cost = aarch64_detect_vector_stmt_subtype (m_vinfo, kind, > stmt_info, vectype, > where, stmt_cost); > + > + /* Check if we've seen an SVE gather/scatter operation and which size. */ > + if (kind == scalar_load > + && aarch64_sve_mode_p (TYPE_MODE (vectype)) > + && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_GATHER_SCATTER) > + { > + const sve_vec_cost *sve_costs = aarch64_tune_params.vec_costs->sve; I think we need to check whether this is nonnull, since not all tuning targets provide SVE costs. > + if (GET_MODE_UNIT_BITSIZE (TYPE_MODE (vectype)) == 64) > + m_sve_gather_scatter_init_cost > + += sve_costs->gather_load_x64_init_cost; > + else > + m_sve_gather_scatter_init_cost > + += sve_costs->gather_load_x32_init_cost; > + } > } > > /* Do any SVE-specific adjustments to the cost. */ > @@ -17676,6 +17694,12 @@ aarch64_vector_costs::finish_cost (const vector_costs *uncast_scalar_costs) > m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs, > m_costs[vect_body]); > m_suggested_unroll_factor = determine_suggested_unroll_factor (); > + > + /* For gather and scatters there's an additional overhead for the first > + iteration. For low count loops they're not beneficial so model the > + overhead as loop prologue costs. */ > + if (m_sve_gather_scatter_init_cost) > + m_costs[vect_prologue] += m_sve_gather_scatter_init_cost; Might as well make this unconditional now. LGTM with those changes, but please wait for Kyrill's review too. Thanks, Richard > } > > /* Apply the heuristic described above m_stp_sequence_cost. Prefer > diff --git a/gcc/config/aarch64/tuning_models/a64fx.h b/gcc/config/aarch64/tuning_models/a64fx.h > index 6091289d4c3c66f01d7e4dbf97a85c1f8c40bb0b..378a1b3889ee265859786c1ff6525fce2305b615 100644 > --- a/gcc/config/aarch64/tuning_models/a64fx.h > +++ b/gcc/config/aarch64/tuning_models/a64fx.h > @@ -104,6 +104,8 @@ static const sve_vec_cost a64fx_sve_vector_cost = > 13, /* fadda_f64_cost */ > 64, /* gather_load_x32_cost */ > 32, /* gather_load_x64_cost */ > + 0, /* gather_load_x32_init_cost */ > + 0, /* gather_load_x64_init_cost */ > 1 /* scatter_store_elt_cost */ > }; > > diff --git a/gcc/config/aarch64/tuning_models/cortexx925.h b/gcc/config/aarch64/tuning_models/cortexx925.h > index 6cae5b7de5ca7ffad8a0f683e1285039bb55d159..b509cae758419a415d9067ec751ef1e6528eb09a 100644 > --- a/gcc/config/aarch64/tuning_models/cortexx925.h > +++ b/gcc/config/aarch64/tuning_models/cortexx925.h > @@ -135,6 +135,8 @@ static const sve_vec_cost cortexx925_sve_vector_cost = > operation more than a 64-bit gather. */ > 14, /* gather_load_x32_cost */ > 12, /* gather_load_x64_cost */ > + 42, /* gather_load_x32_init_cost */ > + 24, /* gather_load_x64_init_cost */ > 1 /* scatter_store_elt_cost */ > }; > > diff --git a/gcc/config/aarch64/tuning_models/generic.h b/gcc/config/aarch64/tuning_models/generic.h > index 2b1f68b3052117814161a32f426422736ad6462b..101969bdbb9ccf7eafbd9a1cd6e25f0b584fb261 100644 > --- a/gcc/config/aarch64/tuning_models/generic.h > +++ b/gcc/config/aarch64/tuning_models/generic.h > @@ -105,6 +105,8 @@ static const sve_vec_cost generic_sve_vector_cost = > 2, /* fadda_f64_cost */ > 4, /* gather_load_x32_cost */ > 2, /* gather_load_x64_cost */ > + 12, /* gather_load_x32_init_cost */ > + 4, /* gather_load_x64_init_cost */ > 1 /* scatter_store_elt_cost */ > }; > > diff --git a/gcc/config/aarch64/tuning_models/generic_armv8_a.h b/gcc/config/aarch64/tuning_models/generic_armv8_a.h > index b38b9a8c5cad7d12aa38afdb610a14a25e755010..b5088afe068aa4be7f9dd614cfdd2a51fa96e524 100644 > --- a/gcc/config/aarch64/tuning_models/generic_armv8_a.h > +++ b/gcc/config/aarch64/tuning_models/generic_armv8_a.h > @@ -106,6 +106,8 @@ static const sve_vec_cost generic_armv8_a_sve_vector_cost = > 2, /* fadda_f64_cost */ > 4, /* gather_load_x32_cost */ > 2, /* gather_load_x64_cost */ > + 12, /* gather_load_x32_init_cost */ > + 4, /* gather_load_x64_init_cost */ > 1 /* scatter_store_elt_cost */ > }; > > diff --git a/gcc/config/aarch64/tuning_models/generic_armv9_a.h b/gcc/config/aarch64/tuning_models/generic_armv9_a.h > index 7156dbe5787e831bc4343deb7d7b88e9823fc1bc..999985ed40f694f2681779d940bdb282f289b8e3 100644 > --- a/gcc/config/aarch64/tuning_models/generic_armv9_a.h > +++ b/gcc/config/aarch64/tuning_models/generic_armv9_a.h > @@ -136,6 +136,8 @@ static const sve_vec_cost generic_armv9_a_sve_vector_cost = > operation more than a 64-bit gather. */ > 14, /* gather_load_x32_cost */ > 12, /* gather_load_x64_cost */ > + 42, /* gather_load_x32_init_cost */ > + 24, /* gather_load_x64_init_cost */ > 3 /* scatter_store_elt_cost */ > }; > > diff --git a/gcc/config/aarch64/tuning_models/neoverse512tvb.h b/gcc/config/aarch64/tuning_models/neoverse512tvb.h > index 825c6a64990b72cda3641737957dc94d75db1509..d2a0b647791de8fca6d7684849d2ab1e9104b045 100644 > --- a/gcc/config/aarch64/tuning_models/neoverse512tvb.h > +++ b/gcc/config/aarch64/tuning_models/neoverse512tvb.h > @@ -79,6 +79,8 @@ static const sve_vec_cost neoverse512tvb_sve_vector_cost = > operation more than a 64-bit gather. */ > 14, /* gather_load_x32_cost */ > 12, /* gather_load_x64_cost */ > + 42, /* gather_load_x32_init_cost */ > + 24, /* gather_load_x64_init_cost */ > 3 /* scatter_store_elt_cost */ > }; > > diff --git a/gcc/config/aarch64/tuning_models/neoversen2.h b/gcc/config/aarch64/tuning_models/neoversen2.h > index d41e714aa045266ecae62a36ed02dfbfb7597c3a..1a5b66901b5c3fb78f87fee40236957139644585 100644 > --- a/gcc/config/aarch64/tuning_models/neoversen2.h > +++ b/gcc/config/aarch64/tuning_models/neoversen2.h > @@ -135,6 +135,8 @@ static const sve_vec_cost neoversen2_sve_vector_cost = > operation more than a 64-bit gather. */ > 14, /* gather_load_x32_cost */ > 12, /* gather_load_x64_cost */ > + 42, /* gather_load_x32_init_cost */ > + 24, /* gather_load_x64_init_cost */ > 3 /* scatter_store_elt_cost */ > }; > > diff --git a/gcc/config/aarch64/tuning_models/neoversen3.h b/gcc/config/aarch64/tuning_models/neoversen3.h > index 36f770c0a14fc127c75a60cd37048d46c3b069c7..cfd5060e6b64a0433de41b03cde886da119d9a1c 100644 > --- a/gcc/config/aarch64/tuning_models/neoversen3.h > +++ b/gcc/config/aarch64/tuning_models/neoversen3.h > @@ -135,6 +135,8 @@ static const sve_vec_cost neoversen3_sve_vector_cost = > operation more than a 64-bit gather. */ > 14, /* gather_load_x32_cost */ > 12, /* gather_load_x64_cost */ > + 42, /* gather_load_x32_init_cost */ > + 24, /* gather_load_x64_init_cost */ > 1 /* scatter_store_elt_cost */ > }; > > diff --git a/gcc/config/aarch64/tuning_models/neoversev1.h b/gcc/config/aarch64/tuning_models/neoversev1.h > index 0fc41ce6a41b3135fa06d2bda1f517fdf4f8dbcf..705ed025730f6683109a4796c6eefa55b437cec9 100644 > --- a/gcc/config/aarch64/tuning_models/neoversev1.h > +++ b/gcc/config/aarch64/tuning_models/neoversev1.h > @@ -126,6 +126,8 @@ static const sve_vec_cost neoversev1_sve_vector_cost = > 8, /* fadda_f64_cost */ > 32, /* gather_load_x32_cost */ > 16, /* gather_load_x64_cost */ > + 96, /* gather_load_x32_init_cost */ > + 32, /* gather_load_x64_init_cost */ > 3 /* scatter_store_elt_cost */ > }; > > diff --git a/gcc/config/aarch64/tuning_models/neoversev2.h b/gcc/config/aarch64/tuning_models/neoversev2.h > index c9c3019dd01a98bc20a76e8455fb59ff24a9ff6c..47908636b0f4c3eadd5848b590fd079c1c04aa10 100644 > --- a/gcc/config/aarch64/tuning_models/neoversev2.h > +++ b/gcc/config/aarch64/tuning_models/neoversev2.h > @@ -135,6 +135,8 @@ static const sve_vec_cost neoversev2_sve_vector_cost = > operation more than a 64-bit gather. */ > 14, /* gather_load_x32_cost */ > 12, /* gather_load_x64_cost */ > + 42, /* gather_load_x32_init_cost */ > + 24, /* gather_load_x64_init_cost */ > 3 /* scatter_store_elt_cost */ > }; > > diff --git a/gcc/config/aarch64/tuning_models/neoversev3.h b/gcc/config/aarch64/tuning_models/neoversev3.h > index c602d067c7116cf6b081caeae8d36f9969e06d8d..c91e8c829532f9236de0102770e5c6b94e83da9a 100644 > --- a/gcc/config/aarch64/tuning_models/neoversev3.h > +++ b/gcc/config/aarch64/tuning_models/neoversev3.h > @@ -135,6 +135,8 @@ static const sve_vec_cost neoversev3_sve_vector_cost = > operation more than a 64-bit gather. */ > 14, /* gather_load_x32_cost */ > 12, /* gather_load_x64_cost */ > + 42, /* gather_load_x32_init_cost */ > + 24, /* gather_load_x64_init_cost */ > 1 /* scatter_store_elt_cost */ > }; > > diff --git a/gcc/config/aarch64/tuning_models/neoversev3ae.h b/gcc/config/aarch64/tuning_models/neoversev3ae.h > index 96d7ccf03cd96056d09676d908c63a25e3da6765..61e439326eb6f983abf8574e657cfbb0c2f9bb33 100644 > --- a/gcc/config/aarch64/tuning_models/neoversev3ae.h > +++ b/gcc/config/aarch64/tuning_models/neoversev3ae.h > @@ -135,6 +135,8 @@ static const sve_vec_cost neoversev3ae_sve_vector_cost = > operation more than a 64-bit gather. */ > 14, /* gather_load_x32_cost */ > 12, /* gather_load_x64_cost */ > + 42, /* gather_load_x32_init_cost */ > + 24, /* gather_load_x64_init_cost */ > 1 /* scatter_store_elt_cost */ > };
Hi Tamar, > On 31 Jul 2024, at 18:46, Tamar Christina <Tamar.Christina@arm.com> wrote: > > External email: Use caution opening links or attachments > > > Hi Kyrill, > >>> /* True if the vector body contains a store to a decl and if the >>> function is known to have a vld1 from the same decl. >>> >>> @@ -17291,6 +17297,17 @@ aarch64_vector_costs::add_stmt_cost (int count, >> vect_cost_for_stmt kind, >>> stmt_cost = aarch64_detect_vector_stmt_subtype (m_vinfo, kind, >>> stmt_info, vectype, >>> where, stmt_cost); >>> + >>> + /* Check if we've seen an SVE gather/scatter operation and which size. */ >>> + if (kind == scalar_load >>> + && aarch64_sve_mode_p (TYPE_MODE (vectype)) >>> + && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == >> VMAT_GATHER_SCATTER) >>> + { >>> + if (GET_MODE_UNIT_BITSIZE (TYPE_MODE (vectype)) == 64) >>> + m_sve_gather_scatter_x64 = true; >>> + else >>> + m_sve_gather_scatter_x32 = true; >> >> This is a bit academic at this stage but SVE2.1 adds quadword gather loads. I know >> we’re not vectoring for those yet, but maybe it’s worth explicitly checking for 32- >> bit size and gcc_unreachable () otherwise? > > To be honest I'm not quite sure how to detect it. If it just GET_MODE_UNIT_BITSIZE () == 128? > But do we want an assert in the cost model? Happy to do so though but maybe a debug print is more > appropriate? i.e. make it a missed optimization? I don’t feel strongly about it. I’d drop my comment here. If and when we add vectorization for 128-bit chunks I’m sure we’ll be revisiting this cost model anyway. > >> >> >>> + } >>> } >>> >>> /* Do any SVE-specific adjustments to the cost. */ >>> @@ -17676,6 +17693,18 @@ aarch64_vector_costs::finish_cost (const >> vector_costs *uncast_scalar_costs) >>> m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs, >>> m_costs[vect_body]); >>> m_suggested_unroll_factor = determine_suggested_unroll_factor (); >>> + >>> + /* For gather and scatters there's an additional overhead for the first >>> + iteration. For low count loops they're not beneficial so model the >>> + overhead as loop prologue costs. */ >>> + if (m_sve_gather_scatter_x32 || m_sve_gather_scatter_x64) >>> + { >>> + const sve_vec_cost *sve_costs = aarch64_tune_params.vec_costs->sve; >>> + if (m_sve_gather_scatter_x32) >>> + m_costs[vect_prologue] += sve_costs->gather_load_x32_init_cost; >>> + else >>> + m_costs[vect_prologue] += sve_costs->gather_load_x64_init_cost; >> >> Shouldn’t this not be en else but rather: >> If (m_sve_gather_scatter_x64) >> m_costs[vect_prologue] += sve_costs->gather_load_x64_init_cost; >> >> In case the loop has both 32-bit and 64-bit gather/scatter? >> > > This was an interesting comment. After some discussion and more benchmarking > we've changed it to be an additive cost. Ok. > >> >>> + } >>> } >>> >>> /* Apply the heuristic described above m_stp_sequence_cost. Prefer >>> diff --git a/gcc/config/aarch64/tuning_models/a64fx.h >> b/gcc/config/aarch64/tuning_models/a64fx.h >>> index >> 6091289d4c3c66f01d7e4dbf97a85c1f8c40bb0b..378a1b3889ee265859786c1 >> ff6525fce2305b615 100644 >>> --- a/gcc/config/aarch64/tuning_models/a64fx.h >>> +++ b/gcc/config/aarch64/tuning_models/a64fx.h >>> @@ -104,6 +104,8 @@ static const sve_vec_cost a64fx_sve_vector_cost = >>> 13, /* fadda_f64_cost */ >>> 64, /* gather_load_x32_cost */ >>> 32, /* gather_load_x64_cost */ >>> + 0, /* gather_load_x32_init_cost */ >>> + 0, /* gather_load_x64_init_cost */ >>> 1 /* scatter_store_elt_cost */ >>> }; >>> >>> diff --git a/gcc/config/aarch64/tuning_models/cortexx925.h >> b/gcc/config/aarch64/tuning_models/cortexx925.h >>> index >> fb95e87526985b02410d54a5a3ec8539c1b0ba6d..c4206018a3ff707f89ff33007 >> 00ec7dc2a5bc6b0 100644 >>> --- a/gcc/config/aarch64/tuning_models/cortexx925.h >>> +++ b/gcc/config/aarch64/tuning_models/cortexx925.h >>> @@ -135,6 +135,8 @@ static const sve_vec_cost cortexx925_sve_vector_cost = >>> operation more than a 64-bit gather. */ >>> 14, /* gather_load_x32_cost */ >>> 12, /* gather_load_x64_cost */ >>> + 42, /* gather_load_x32_init_cost */ >>> + 24, /* gather_load_x64_init_cost */ >> >> >> Can you comment on how these numbers are derived? > > They were derived essentially from benchmarking. I did a bunch of runs over various cores > to determine at which iteration count they become profitable. From that as you can > probably tell the costs are a multiple of the cost of the operations for the specific core. > > This because that cost already keeps in mind things like VL differences. > > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. > > Ok for master? Ok with Richard’s comments addressed. Thanks, Kyrill > > Thanks, > Tamar > > gcc/ChangeLog: > > * config/aarch64/aarch64-protos.h (struct sve_vec_cost): Add > gather_load_x32_init_cost and gather_load_x64_init_cost. > * config/aarch64/aarch64.cc (aarch64_vector_costs): Add > m_sve_gather_scatter_init_cost. > (aarch64_vector_costs::add_stmt_cost): Use them. > (aarch64_vector_costs::finish_cost): Likewise. > * config/aarch64/tuning_models/a64fx.h: Update. > * config/aarch64/tuning_models/cortexx925.h: Update. > * config/aarch64/tuning_models/generic.h: Update. > * config/aarch64/tuning_models/generic_armv8_a.h: Update. > * config/aarch64/tuning_models/generic_armv9_a.h: Update. > * config/aarch64/tuning_models/neoverse512tvb.h: Update. > * config/aarch64/tuning_models/neoversen2.h: Update. > * config/aarch64/tuning_models/neoversen3.h: Update. > * config/aarch64/tuning_models/neoversev1.h: Update. > * config/aarch64/tuning_models/neoversev2.h: Update. > * config/aarch64/tuning_models/neoversev3.h: Update. > * config/aarch64/tuning_models/neoversev3ae.h: Update. > > -- inline copy of patch -- > > diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h > index 42639e9efcf1e0f9362f759ae63a31b8eeb0d581..16eb8edab4d9fdfc6e3672c56ef5c9f6962d0c0b 100644 > --- a/gcc/config/aarch64/aarch64-protos.h > +++ b/gcc/config/aarch64/aarch64-protos.h > @@ -262,6 +262,8 @@ struct sve_vec_cost : simd_vec_cost > unsigned int fadda_f64_cost, > unsigned int gather_load_x32_cost, > unsigned int gather_load_x64_cost, > + unsigned int gather_load_x32_init_cost, > + unsigned int gather_load_x64_init_cost, > unsigned int scatter_store_elt_cost) > : simd_vec_cost (base), > clast_cost (clast_cost), > @@ -270,6 +272,8 @@ struct sve_vec_cost : simd_vec_cost > fadda_f64_cost (fadda_f64_cost), > gather_load_x32_cost (gather_load_x32_cost), > gather_load_x64_cost (gather_load_x64_cost), > + gather_load_x32_init_cost (gather_load_x32_init_cost), > + gather_load_x64_init_cost (gather_load_x64_init_cost), > scatter_store_elt_cost (scatter_store_elt_cost) > {} > > @@ -289,6 +293,12 @@ struct sve_vec_cost : simd_vec_cost > const int gather_load_x32_cost; > const int gather_load_x64_cost; > > + /* Additional loop initialization cost of using a gather load instruction. The x32 > + value is for loads of 32-bit elements and the x64 value is for loads of > + 64-bit elements. */ > + const int gather_load_x32_init_cost; > + const int gather_load_x64_init_cost; > + > /* The per-element cost of a scatter store. */ > const int scatter_store_elt_cost; > }; > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc > index eafa377cb095f49408d8a926fb49ce13e2155ba2..da2feb54ddad9b39db92e0a9ec7c4e40cfa3e4e2 100644 > --- a/gcc/config/aarch64/aarch64.cc > +++ b/gcc/config/aarch64/aarch64.cc > @@ -16227,6 +16227,10 @@ private: > supported by Advanced SIMD and SVE2. */ > bool m_has_avg = false; > > + /* Additional initialization costs for using gather or scatter operation in > + the current loop. */ > + unsigned int m_sve_gather_scatter_init_cost = 0; > + > /* True if the vector body contains a store to a decl and if the > function is known to have a vld1 from the same decl. > > @@ -17291,6 +17295,20 @@ aarch64_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind, > stmt_cost = aarch64_detect_vector_stmt_subtype (m_vinfo, kind, > stmt_info, vectype, > where, stmt_cost); > + > + /* Check if we've seen an SVE gather/scatter operation and which size. */ > + if (kind == scalar_load > + && aarch64_sve_mode_p (TYPE_MODE (vectype)) > + && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_GATHER_SCATTER) > + { > + const sve_vec_cost *sve_costs = aarch64_tune_params.vec_costs->sve; > + if (GET_MODE_UNIT_BITSIZE (TYPE_MODE (vectype)) == 64) > + m_sve_gather_scatter_init_cost > + += sve_costs->gather_load_x64_init_cost; > + else > + m_sve_gather_scatter_init_cost > + += sve_costs->gather_load_x32_init_cost; > + } > } > > /* Do any SVE-specific adjustments to the cost. */ > @@ -17676,6 +17694,12 @@ aarch64_vector_costs::finish_cost (const vector_costs *uncast_scalar_costs) > m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs, > m_costs[vect_body]); > m_suggested_unroll_factor = determine_suggested_unroll_factor (); > + > + /* For gather and scatters there's an additional overhead for the first > + iteration. For low count loops they're not beneficial so model the > + overhead as loop prologue costs. */ > + if (m_sve_gather_scatter_init_cost) > + m_costs[vect_prologue] += m_sve_gather_scatter_init_cost; > } > > /* Apply the heuristic described above m_stp_sequence_cost. Prefer > diff --git a/gcc/config/aarch64/tuning_models/a64fx.h b/gcc/config/aarch64/tuning_models/a64fx.h > index 6091289d4c3c66f01d7e4dbf97a85c1f8c40bb0b..378a1b3889ee265859786c1ff6525fce2305b615 100644 > --- a/gcc/config/aarch64/tuning_models/a64fx.h > +++ b/gcc/config/aarch64/tuning_models/a64fx.h > @@ -104,6 +104,8 @@ static const sve_vec_cost a64fx_sve_vector_cost = > 13, /* fadda_f64_cost */ > 64, /* gather_load_x32_cost */ > 32, /* gather_load_x64_cost */ > + 0, /* gather_load_x32_init_cost */ > + 0, /* gather_load_x64_init_cost */ > 1 /* scatter_store_elt_cost */ > }; > > diff --git a/gcc/config/aarch64/tuning_models/cortexx925.h b/gcc/config/aarch64/tuning_models/cortexx925.h > index 6cae5b7de5ca7ffad8a0f683e1285039bb55d159..b509cae758419a415d9067ec751ef1e6528eb09a 100644 > --- a/gcc/config/aarch64/tuning_models/cortexx925.h > +++ b/gcc/config/aarch64/tuning_models/cortexx925.h > @@ -135,6 +135,8 @@ static const sve_vec_cost cortexx925_sve_vector_cost = > operation more than a 64-bit gather. */ > 14, /* gather_load_x32_cost */ > 12, /* gather_load_x64_cost */ > + 42, /* gather_load_x32_init_cost */ > + 24, /* gather_load_x64_init_cost */ > 1 /* scatter_store_elt_cost */ > }; > > diff --git a/gcc/config/aarch64/tuning_models/generic.h b/gcc/config/aarch64/tuning_models/generic.h > index 2b1f68b3052117814161a32f426422736ad6462b..101969bdbb9ccf7eafbd9a1cd6e25f0b584fb261 100644 > --- a/gcc/config/aarch64/tuning_models/generic.h > +++ b/gcc/config/aarch64/tuning_models/generic.h > @@ -105,6 +105,8 @@ static const sve_vec_cost generic_sve_vector_cost = > 2, /* fadda_f64_cost */ > 4, /* gather_load_x32_cost */ > 2, /* gather_load_x64_cost */ > + 12, /* gather_load_x32_init_cost */ > + 4, /* gather_load_x64_init_cost */ > 1 /* scatter_store_elt_cost */ > }; > > diff --git a/gcc/config/aarch64/tuning_models/generic_armv8_a.h b/gcc/config/aarch64/tuning_models/generic_armv8_a.h > index b38b9a8c5cad7d12aa38afdb610a14a25e755010..b5088afe068aa4be7f9dd614cfdd2a51fa96e524 100644 > --- a/gcc/config/aarch64/tuning_models/generic_armv8_a.h > +++ b/gcc/config/aarch64/tuning_models/generic_armv8_a.h > @@ -106,6 +106,8 @@ static const sve_vec_cost generic_armv8_a_sve_vector_cost = > 2, /* fadda_f64_cost */ > 4, /* gather_load_x32_cost */ > 2, /* gather_load_x64_cost */ > + 12, /* gather_load_x32_init_cost */ > + 4, /* gather_load_x64_init_cost */ > 1 /* scatter_store_elt_cost */ > }; > > diff --git a/gcc/config/aarch64/tuning_models/generic_armv9_a.h b/gcc/config/aarch64/tuning_models/generic_armv9_a.h > index 7156dbe5787e831bc4343deb7d7b88e9823fc1bc..999985ed40f694f2681779d940bdb282f289b8e3 100644 > --- a/gcc/config/aarch64/tuning_models/generic_armv9_a.h > +++ b/gcc/config/aarch64/tuning_models/generic_armv9_a.h > @@ -136,6 +136,8 @@ static const sve_vec_cost generic_armv9_a_sve_vector_cost = > operation more than a 64-bit gather. */ > 14, /* gather_load_x32_cost */ > 12, /* gather_load_x64_cost */ > + 42, /* gather_load_x32_init_cost */ > + 24, /* gather_load_x64_init_cost */ > 3 /* scatter_store_elt_cost */ > }; > > diff --git a/gcc/config/aarch64/tuning_models/neoverse512tvb.h b/gcc/config/aarch64/tuning_models/neoverse512tvb.h > index 825c6a64990b72cda3641737957dc94d75db1509..d2a0b647791de8fca6d7684849d2ab1e9104b045 100644 > --- a/gcc/config/aarch64/tuning_models/neoverse512tvb.h > +++ b/gcc/config/aarch64/tuning_models/neoverse512tvb.h > @@ -79,6 +79,8 @@ static const sve_vec_cost neoverse512tvb_sve_vector_cost = > operation more than a 64-bit gather. */ > 14, /* gather_load_x32_cost */ > 12, /* gather_load_x64_cost */ > + 42, /* gather_load_x32_init_cost */ > + 24, /* gather_load_x64_init_cost */ > 3 /* scatter_store_elt_cost */ > }; > > diff --git a/gcc/config/aarch64/tuning_models/neoversen2.h b/gcc/config/aarch64/tuning_models/neoversen2.h > index d41e714aa045266ecae62a36ed02dfbfb7597c3a..1a5b66901b5c3fb78f87fee40236957139644585 100644 > --- a/gcc/config/aarch64/tuning_models/neoversen2.h > +++ b/gcc/config/aarch64/tuning_models/neoversen2.h > @@ -135,6 +135,8 @@ static const sve_vec_cost neoversen2_sve_vector_cost = > operation more than a 64-bit gather. */ > 14, /* gather_load_x32_cost */ > 12, /* gather_load_x64_cost */ > + 42, /* gather_load_x32_init_cost */ > + 24, /* gather_load_x64_init_cost */ > 3 /* scatter_store_elt_cost */ > }; > > diff --git a/gcc/config/aarch64/tuning_models/neoversen3.h b/gcc/config/aarch64/tuning_models/neoversen3.h > index 36f770c0a14fc127c75a60cd37048d46c3b069c7..cfd5060e6b64a0433de41b03cde886da119d9a1c 100644 > --- a/gcc/config/aarch64/tuning_models/neoversen3.h > +++ b/gcc/config/aarch64/tuning_models/neoversen3.h > @@ -135,6 +135,8 @@ static const sve_vec_cost neoversen3_sve_vector_cost = > operation more than a 64-bit gather. */ > 14, /* gather_load_x32_cost */ > 12, /* gather_load_x64_cost */ > + 42, /* gather_load_x32_init_cost */ > + 24, /* gather_load_x64_init_cost */ > 1 /* scatter_store_elt_cost */ > }; > > diff --git a/gcc/config/aarch64/tuning_models/neoversev1.h b/gcc/config/aarch64/tuning_models/neoversev1.h > index 0fc41ce6a41b3135fa06d2bda1f517fdf4f8dbcf..705ed025730f6683109a4796c6eefa55b437cec9 100644 > --- a/gcc/config/aarch64/tuning_models/neoversev1.h > +++ b/gcc/config/aarch64/tuning_models/neoversev1.h > @@ -126,6 +126,8 @@ static const sve_vec_cost neoversev1_sve_vector_cost = > 8, /* fadda_f64_cost */ > 32, /* gather_load_x32_cost */ > 16, /* gather_load_x64_cost */ > + 96, /* gather_load_x32_init_cost */ > + 32, /* gather_load_x64_init_cost */ > 3 /* scatter_store_elt_cost */ > }; > > diff --git a/gcc/config/aarch64/tuning_models/neoversev2.h b/gcc/config/aarch64/tuning_models/neoversev2.h > index c9c3019dd01a98bc20a76e8455fb59ff24a9ff6c..47908636b0f4c3eadd5848b590fd079c1c04aa10 100644 > --- a/gcc/config/aarch64/tuning_models/neoversev2.h > +++ b/gcc/config/aarch64/tuning_models/neoversev2.h > @@ -135,6 +135,8 @@ static const sve_vec_cost neoversev2_sve_vector_cost = > operation more than a 64-bit gather. */ > 14, /* gather_load_x32_cost */ > 12, /* gather_load_x64_cost */ > + 42, /* gather_load_x32_init_cost */ > + 24, /* gather_load_x64_init_cost */ > 3 /* scatter_store_elt_cost */ > }; > > diff --git a/gcc/config/aarch64/tuning_models/neoversev3.h b/gcc/config/aarch64/tuning_models/neoversev3.h > index c602d067c7116cf6b081caeae8d36f9969e06d8d..c91e8c829532f9236de0102770e5c6b94e83da9a 100644 > --- a/gcc/config/aarch64/tuning_models/neoversev3.h > +++ b/gcc/config/aarch64/tuning_models/neoversev3.h > @@ -135,6 +135,8 @@ static const sve_vec_cost neoversev3_sve_vector_cost = > operation more than a 64-bit gather. */ > 14, /* gather_load_x32_cost */ > 12, /* gather_load_x64_cost */ > + 42, /* gather_load_x32_init_cost */ > + 24, /* gather_load_x64_init_cost */ > 1 /* scatter_store_elt_cost */ > }; > > diff --git a/gcc/config/aarch64/tuning_models/neoversev3ae.h b/gcc/config/aarch64/tuning_models/neoversev3ae.h > index 96d7ccf03cd96056d09676d908c63a25e3da6765..61e439326eb6f983abf8574e657cfbb0c2f9bb33 100644 > --- a/gcc/config/aarch64/tuning_models/neoversev3ae.h > +++ b/gcc/config/aarch64/tuning_models/neoversev3ae.h > @@ -135,6 +135,8 @@ static const sve_vec_cost neoversev3ae_sve_vector_cost = > operation more than a 64-bit gather. */ > 14, /* gather_load_x32_cost */ > 12, /* gather_load_x64_cost */ > + 42, /* gather_load_x32_init_cost */ > + 24, /* gather_load_x64_init_cost */ > 1 /* scatter_store_elt_cost */ > }; > > <rb18671.patch>
> -----Original Message----- > From: Richard Sandiford <richard.sandiford@arm.com> > Sent: Wednesday, July 31, 2024 7:17 PM > To: Tamar Christina <Tamar.Christina@arm.com> > Cc: Kyrylo Tkachov <ktkachov@nvidia.com>; gcc-patches@gcc.gnu.org; nd > <nd@arm.com>; Richard Earnshaw <Richard.Earnshaw@arm.com>; Marcus > Shawcroft <Marcus.Shawcroft@arm.com>; ktkachov@gcc.gnu.org > Subject: Re: [PATCH 8/8]AArch64: take gather/scatter decode overhead into > account > > Tamar Christina <Tamar.Christina@arm.com> writes: > > @@ -289,6 +293,12 @@ struct sve_vec_cost : simd_vec_cost > > const int gather_load_x32_cost; > > const int gather_load_x64_cost; > > > > + /* Additional loop initialization cost of using a gather load instruction. The x32 > > Sorry for the trivia, but: long line. Yeah, noticed it after I sent out the patch 😊 > > > + value is for loads of 32-bit elements and the x64 value is for loads of > > + 64-bit elements. */ > > + const int gather_load_x32_init_cost; > > + const int gather_load_x64_init_cost; > > + > > /* The per-element cost of a scatter store. */ > > const int scatter_store_elt_cost; > > }; > > [...] > > @@ -17291,6 +17295,20 @@ aarch64_vector_costs::add_stmt_cost (int count, > vect_cost_for_stmt kind, > > stmt_cost = aarch64_detect_vector_stmt_subtype (m_vinfo, kind, > > stmt_info, vectype, > > where, stmt_cost); > > + > > + /* Check if we've seen an SVE gather/scatter operation and which size. */ > > + if (kind == scalar_load > > + && aarch64_sve_mode_p (TYPE_MODE (vectype)) > > + && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == > VMAT_GATHER_SCATTER) > > + { > > + const sve_vec_cost *sve_costs = aarch64_tune_params.vec_costs->sve; > > I think we need to check whether this is nonnull, since not all tuning > targets provide SVE costs. Will do, but I thought since this was in a block that checked aarch64_use_new_vector_costs_p () that the SVE costs were required. What does that predicate mean then? Cheers, Tamar > > > + if (GET_MODE_UNIT_BITSIZE (TYPE_MODE (vectype)) == 64) > > + m_sve_gather_scatter_init_cost > > + += sve_costs->gather_load_x64_init_cost; > > + else > > + m_sve_gather_scatter_init_cost > > + += sve_costs->gather_load_x32_init_cost; > > + } > > } > > > > /* Do any SVE-specific adjustments to the cost. */ > > @@ -17676,6 +17694,12 @@ aarch64_vector_costs::finish_cost (const > vector_costs *uncast_scalar_costs) > > m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs, > > m_costs[vect_body]); > > m_suggested_unroll_factor = determine_suggested_unroll_factor (); > > + > > + /* For gather and scatters there's an additional overhead for the first > > + iteration. For low count loops they're not beneficial so model the > > + overhead as loop prologue costs. */ > > + if (m_sve_gather_scatter_init_cost) > > + m_costs[vect_prologue] += m_sve_gather_scatter_init_cost; > > Might as well make this unconditional now. > > LGTM with those changes, but please wait for Kyrill's review too. > > Thanks, > Richard > > > } > > > > /* Apply the heuristic described above m_stp_sequence_cost. Prefer > > diff --git a/gcc/config/aarch64/tuning_models/a64fx.h > b/gcc/config/aarch64/tuning_models/a64fx.h > > index > 6091289d4c3c66f01d7e4dbf97a85c1f8c40bb0b..378a1b3889ee265859786c1 > ff6525fce2305b615 100644 > > --- a/gcc/config/aarch64/tuning_models/a64fx.h > > +++ b/gcc/config/aarch64/tuning_models/a64fx.h > > @@ -104,6 +104,8 @@ static const sve_vec_cost a64fx_sve_vector_cost = > > 13, /* fadda_f64_cost */ > > 64, /* gather_load_x32_cost */ > > 32, /* gather_load_x64_cost */ > > + 0, /* gather_load_x32_init_cost */ > > + 0, /* gather_load_x64_init_cost */ > > 1 /* scatter_store_elt_cost */ > > }; > > > > diff --git a/gcc/config/aarch64/tuning_models/cortexx925.h > b/gcc/config/aarch64/tuning_models/cortexx925.h > > index > 6cae5b7de5ca7ffad8a0f683e1285039bb55d159..b509cae758419a415d9067ec > 751ef1e6528eb09a 100644 > > --- a/gcc/config/aarch64/tuning_models/cortexx925.h > > +++ b/gcc/config/aarch64/tuning_models/cortexx925.h > > @@ -135,6 +135,8 @@ static const sve_vec_cost cortexx925_sve_vector_cost = > > operation more than a 64-bit gather. */ > > 14, /* gather_load_x32_cost */ > > 12, /* gather_load_x64_cost */ > > + 42, /* gather_load_x32_init_cost */ > > + 24, /* gather_load_x64_init_cost */ > > 1 /* scatter_store_elt_cost */ > > }; > > > > diff --git a/gcc/config/aarch64/tuning_models/generic.h > b/gcc/config/aarch64/tuning_models/generic.h > > index > 2b1f68b3052117814161a32f426422736ad6462b..101969bdbb9ccf7eafbd9a1c > d6e25f0b584fb261 100644 > > --- a/gcc/config/aarch64/tuning_models/generic.h > > +++ b/gcc/config/aarch64/tuning_models/generic.h > > @@ -105,6 +105,8 @@ static const sve_vec_cost generic_sve_vector_cost = > > 2, /* fadda_f64_cost */ > > 4, /* gather_load_x32_cost */ > > 2, /* gather_load_x64_cost */ > > + 12, /* gather_load_x32_init_cost */ > > + 4, /* gather_load_x64_init_cost */ > > 1 /* scatter_store_elt_cost */ > > }; > > > > diff --git a/gcc/config/aarch64/tuning_models/generic_armv8_a.h > b/gcc/config/aarch64/tuning_models/generic_armv8_a.h > > index > b38b9a8c5cad7d12aa38afdb610a14a25e755010..b5088afe068aa4be7f9dd614 > cfdd2a51fa96e524 100644 > > --- a/gcc/config/aarch64/tuning_models/generic_armv8_a.h > > +++ b/gcc/config/aarch64/tuning_models/generic_armv8_a.h > > @@ -106,6 +106,8 @@ static const sve_vec_cost > generic_armv8_a_sve_vector_cost = > > 2, /* fadda_f64_cost */ > > 4, /* gather_load_x32_cost */ > > 2, /* gather_load_x64_cost */ > > + 12, /* gather_load_x32_init_cost */ > > + 4, /* gather_load_x64_init_cost */ > > 1 /* scatter_store_elt_cost */ > > }; > > > > diff --git a/gcc/config/aarch64/tuning_models/generic_armv9_a.h > b/gcc/config/aarch64/tuning_models/generic_armv9_a.h > > index > 7156dbe5787e831bc4343deb7d7b88e9823fc1bc..999985ed40f694f2681779d > 940bdb282f289b8e3 100644 > > --- a/gcc/config/aarch64/tuning_models/generic_armv9_a.h > > +++ b/gcc/config/aarch64/tuning_models/generic_armv9_a.h > > @@ -136,6 +136,8 @@ static const sve_vec_cost > generic_armv9_a_sve_vector_cost = > > operation more than a 64-bit gather. */ > > 14, /* gather_load_x32_cost */ > > 12, /* gather_load_x64_cost */ > > + 42, /* gather_load_x32_init_cost */ > > + 24, /* gather_load_x64_init_cost */ > > 3 /* scatter_store_elt_cost */ > > }; > > > > diff --git a/gcc/config/aarch64/tuning_models/neoverse512tvb.h > b/gcc/config/aarch64/tuning_models/neoverse512tvb.h > > index > 825c6a64990b72cda3641737957dc94d75db1509..d2a0b647791de8fca6d7684 > 849d2ab1e9104b045 100644 > > --- a/gcc/config/aarch64/tuning_models/neoverse512tvb.h > > +++ b/gcc/config/aarch64/tuning_models/neoverse512tvb.h > > @@ -79,6 +79,8 @@ static const sve_vec_cost > neoverse512tvb_sve_vector_cost = > > operation more than a 64-bit gather. */ > > 14, /* gather_load_x32_cost */ > > 12, /* gather_load_x64_cost */ > > + 42, /* gather_load_x32_init_cost */ > > + 24, /* gather_load_x64_init_cost */ > > 3 /* scatter_store_elt_cost */ > > }; > > > > diff --git a/gcc/config/aarch64/tuning_models/neoversen2.h > b/gcc/config/aarch64/tuning_models/neoversen2.h > > index > d41e714aa045266ecae62a36ed02dfbfb7597c3a..1a5b66901b5c3fb78f87fee40 > 236957139644585 100644 > > --- a/gcc/config/aarch64/tuning_models/neoversen2.h > > +++ b/gcc/config/aarch64/tuning_models/neoversen2.h > > @@ -135,6 +135,8 @@ static const sve_vec_cost neoversen2_sve_vector_cost = > > operation more than a 64-bit gather. */ > > 14, /* gather_load_x32_cost */ > > 12, /* gather_load_x64_cost */ > > + 42, /* gather_load_x32_init_cost */ > > + 24, /* gather_load_x64_init_cost */ > > 3 /* scatter_store_elt_cost */ > > }; > > > > diff --git a/gcc/config/aarch64/tuning_models/neoversen3.h > b/gcc/config/aarch64/tuning_models/neoversen3.h > > index > 36f770c0a14fc127c75a60cd37048d46c3b069c7..cfd5060e6b64a0433de41b03c > de886da119d9a1c 100644 > > --- a/gcc/config/aarch64/tuning_models/neoversen3.h > > +++ b/gcc/config/aarch64/tuning_models/neoversen3.h > > @@ -135,6 +135,8 @@ static const sve_vec_cost neoversen3_sve_vector_cost = > > operation more than a 64-bit gather. */ > > 14, /* gather_load_x32_cost */ > > 12, /* gather_load_x64_cost */ > > + 42, /* gather_load_x32_init_cost */ > > + 24, /* gather_load_x64_init_cost */ > > 1 /* scatter_store_elt_cost */ > > }; > > > > diff --git a/gcc/config/aarch64/tuning_models/neoversev1.h > b/gcc/config/aarch64/tuning_models/neoversev1.h > > index > 0fc41ce6a41b3135fa06d2bda1f517fdf4f8dbcf..705ed025730f6683109a4796c6 > eefa55b437cec9 100644 > > --- a/gcc/config/aarch64/tuning_models/neoversev1.h > > +++ b/gcc/config/aarch64/tuning_models/neoversev1.h > > @@ -126,6 +126,8 @@ static const sve_vec_cost neoversev1_sve_vector_cost = > > 8, /* fadda_f64_cost */ > > 32, /* gather_load_x32_cost */ > > 16, /* gather_load_x64_cost */ > > + 96, /* gather_load_x32_init_cost */ > > + 32, /* gather_load_x64_init_cost */ > > 3 /* scatter_store_elt_cost */ > > }; > > > > diff --git a/gcc/config/aarch64/tuning_models/neoversev2.h > b/gcc/config/aarch64/tuning_models/neoversev2.h > > index > c9c3019dd01a98bc20a76e8455fb59ff24a9ff6c..47908636b0f4c3eadd5848b59 > 0fd079c1c04aa10 100644 > > --- a/gcc/config/aarch64/tuning_models/neoversev2.h > > +++ b/gcc/config/aarch64/tuning_models/neoversev2.h > > @@ -135,6 +135,8 @@ static const sve_vec_cost neoversev2_sve_vector_cost = > > operation more than a 64-bit gather. */ > > 14, /* gather_load_x32_cost */ > > 12, /* gather_load_x64_cost */ > > + 42, /* gather_load_x32_init_cost */ > > + 24, /* gather_load_x64_init_cost */ > > 3 /* scatter_store_elt_cost */ > > }; > > > > diff --git a/gcc/config/aarch64/tuning_models/neoversev3.h > b/gcc/config/aarch64/tuning_models/neoversev3.h > > index > c602d067c7116cf6b081caeae8d36f9969e06d8d..c91e8c829532f9236de01027 > 70e5c6b94e83da9a 100644 > > --- a/gcc/config/aarch64/tuning_models/neoversev3.h > > +++ b/gcc/config/aarch64/tuning_models/neoversev3.h > > @@ -135,6 +135,8 @@ static const sve_vec_cost neoversev3_sve_vector_cost = > > operation more than a 64-bit gather. */ > > 14, /* gather_load_x32_cost */ > > 12, /* gather_load_x64_cost */ > > + 42, /* gather_load_x32_init_cost */ > > + 24, /* gather_load_x64_init_cost */ > > 1 /* scatter_store_elt_cost */ > > }; > > > > diff --git a/gcc/config/aarch64/tuning_models/neoversev3ae.h > b/gcc/config/aarch64/tuning_models/neoversev3ae.h > > index > 96d7ccf03cd96056d09676d908c63a25e3da6765..61e439326eb6f983abf8574e > 657cfbb0c2f9bb33 100644 > > --- a/gcc/config/aarch64/tuning_models/neoversev3ae.h > > +++ b/gcc/config/aarch64/tuning_models/neoversev3ae.h > > @@ -135,6 +135,8 @@ static const sve_vec_cost > neoversev3ae_sve_vector_cost = > > operation more than a 64-bit gather. */ > > 14, /* gather_load_x32_cost */ > > 12, /* gather_load_x64_cost */ > > + 42, /* gather_load_x32_init_cost */ > > + 24, /* gather_load_x64_init_cost */ > > 1 /* scatter_store_elt_cost */ > > };
> -----Original Message----- > From: Tamar Christina <Tamar.Christina@arm.com> > Sent: Thursday, August 1, 2024 9:51 AM > To: Richard Sandiford <Richard.Sandiford@arm.com> > Cc: Kyrylo Tkachov <ktkachov@nvidia.com>; gcc-patches@gcc.gnu.org; nd > <nd@arm.com>; Richard Earnshaw <Richard.Earnshaw@arm.com>; Marcus > Shawcroft <Marcus.Shawcroft@arm.com>; ktkachov@gcc.gnu.org > Subject: RE: [PATCH 8/8]AArch64: take gather/scatter decode overhead into > account > > > -----Original Message----- > > From: Richard Sandiford <richard.sandiford@arm.com> > > Sent: Wednesday, July 31, 2024 7:17 PM > > To: Tamar Christina <Tamar.Christina@arm.com> > > Cc: Kyrylo Tkachov <ktkachov@nvidia.com>; gcc-patches@gcc.gnu.org; nd > > <nd@arm.com>; Richard Earnshaw <Richard.Earnshaw@arm.com>; Marcus > > Shawcroft <Marcus.Shawcroft@arm.com>; ktkachov@gcc.gnu.org > > Subject: Re: [PATCH 8/8]AArch64: take gather/scatter decode overhead into > > account > > > > Tamar Christina <Tamar.Christina@arm.com> writes: > > > @@ -289,6 +293,12 @@ struct sve_vec_cost : simd_vec_cost > > > const int gather_load_x32_cost; > > > const int gather_load_x64_cost; > > > > > > + /* Additional loop initialization cost of using a gather load instruction. The > x32 > > > > Sorry for the trivia, but: long line. > > Yeah, noticed it after I sent out the patch 😊 > > > > > > + value is for loads of 32-bit elements and the x64 value is for loads of > > > + 64-bit elements. */ > > > + const int gather_load_x32_init_cost; > > > + const int gather_load_x64_init_cost; > > > + > > > /* The per-element cost of a scatter store. */ > > > const int scatter_store_elt_cost; > > > }; > > > [...] > > > @@ -17291,6 +17295,20 @@ aarch64_vector_costs::add_stmt_cost (int > count, > > vect_cost_for_stmt kind, > > > stmt_cost = aarch64_detect_vector_stmt_subtype (m_vinfo, kind, > > > stmt_info, vectype, > > > where, stmt_cost); > > > + > > > + /* Check if we've seen an SVE gather/scatter operation and which size. */ > > > + if (kind == scalar_load > > > + && aarch64_sve_mode_p (TYPE_MODE (vectype)) > > > + && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == > > VMAT_GATHER_SCATTER) > > > + { > > > + const sve_vec_cost *sve_costs = aarch64_tune_params.vec_costs->sve; > > > > I think we need to check whether this is nonnull, since not all tuning > > targets provide SVE costs. > > Will do, but I thought since this was in a block that checked > aarch64_use_new_vector_costs_p () > that the SVE costs were required. What does that predicate mean then? Ah nevermind, just realized it just means to apply the new vector cost routines, I hadn't realized that we supported new costing for non-SVE models as well. Will add the check. Thanks, Tamar > > Cheers, > Tamar > > > > > + if (GET_MODE_UNIT_BITSIZE (TYPE_MODE (vectype)) == 64) > > > + m_sve_gather_scatter_init_cost > > > + += sve_costs->gather_load_x64_init_cost; > > > + else > > > + m_sve_gather_scatter_init_cost > > > + += sve_costs->gather_load_x32_init_cost; > > > + } > > > } > > > > > > /* Do any SVE-specific adjustments to the cost. */ > > > @@ -17676,6 +17694,12 @@ aarch64_vector_costs::finish_cost (const > > vector_costs *uncast_scalar_costs) > > > m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs, > > > m_costs[vect_body]); > > > m_suggested_unroll_factor = determine_suggested_unroll_factor (); > > > + > > > + /* For gather and scatters there's an additional overhead for the first > > > + iteration. For low count loops they're not beneficial so model the > > > + overhead as loop prologue costs. */ > > > + if (m_sve_gather_scatter_init_cost) > > > + m_costs[vect_prologue] += m_sve_gather_scatter_init_cost; > > > > Might as well make this unconditional now. > > > > LGTM with those changes, but please wait for Kyrill's review too. > > > > Thanks, > > Richard > > > > > } > > > > > > /* Apply the heuristic described above m_stp_sequence_cost. Prefer > > > diff --git a/gcc/config/aarch64/tuning_models/a64fx.h > > b/gcc/config/aarch64/tuning_models/a64fx.h > > > index > > > 6091289d4c3c66f01d7e4dbf97a85c1f8c40bb0b..378a1b3889ee265859786c1 > > ff6525fce2305b615 100644 > > > --- a/gcc/config/aarch64/tuning_models/a64fx.h > > > +++ b/gcc/config/aarch64/tuning_models/a64fx.h > > > @@ -104,6 +104,8 @@ static const sve_vec_cost a64fx_sve_vector_cost = > > > 13, /* fadda_f64_cost */ > > > 64, /* gather_load_x32_cost */ > > > 32, /* gather_load_x64_cost */ > > > + 0, /* gather_load_x32_init_cost */ > > > + 0, /* gather_load_x64_init_cost */ > > > 1 /* scatter_store_elt_cost */ > > > }; > > > > > > diff --git a/gcc/config/aarch64/tuning_models/cortexx925.h > > b/gcc/config/aarch64/tuning_models/cortexx925.h > > > index > > > 6cae5b7de5ca7ffad8a0f683e1285039bb55d159..b509cae758419a415d9067ec > > 751ef1e6528eb09a 100644 > > > --- a/gcc/config/aarch64/tuning_models/cortexx925.h > > > +++ b/gcc/config/aarch64/tuning_models/cortexx925.h > > > @@ -135,6 +135,8 @@ static const sve_vec_cost cortexx925_sve_vector_cost > = > > > operation more than a 64-bit gather. */ > > > 14, /* gather_load_x32_cost */ > > > 12, /* gather_load_x64_cost */ > > > + 42, /* gather_load_x32_init_cost */ > > > + 24, /* gather_load_x64_init_cost */ > > > 1 /* scatter_store_elt_cost */ > > > }; > > > > > > diff --git a/gcc/config/aarch64/tuning_models/generic.h > > b/gcc/config/aarch64/tuning_models/generic.h > > > index > > > 2b1f68b3052117814161a32f426422736ad6462b..101969bdbb9ccf7eafbd9a1c > > d6e25f0b584fb261 100644 > > > --- a/gcc/config/aarch64/tuning_models/generic.h > > > +++ b/gcc/config/aarch64/tuning_models/generic.h > > > @@ -105,6 +105,8 @@ static const sve_vec_cost generic_sve_vector_cost = > > > 2, /* fadda_f64_cost */ > > > 4, /* gather_load_x32_cost */ > > > 2, /* gather_load_x64_cost */ > > > + 12, /* gather_load_x32_init_cost */ > > > + 4, /* gather_load_x64_init_cost */ > > > 1 /* scatter_store_elt_cost */ > > > }; > > > > > > diff --git a/gcc/config/aarch64/tuning_models/generic_armv8_a.h > > b/gcc/config/aarch64/tuning_models/generic_armv8_a.h > > > index > > > b38b9a8c5cad7d12aa38afdb610a14a25e755010..b5088afe068aa4be7f9dd614 > > cfdd2a51fa96e524 100644 > > > --- a/gcc/config/aarch64/tuning_models/generic_armv8_a.h > > > +++ b/gcc/config/aarch64/tuning_models/generic_armv8_a.h > > > @@ -106,6 +106,8 @@ static const sve_vec_cost > > generic_armv8_a_sve_vector_cost = > > > 2, /* fadda_f64_cost */ > > > 4, /* gather_load_x32_cost */ > > > 2, /* gather_load_x64_cost */ > > > + 12, /* gather_load_x32_init_cost */ > > > + 4, /* gather_load_x64_init_cost */ > > > 1 /* scatter_store_elt_cost */ > > > }; > > > > > > diff --git a/gcc/config/aarch64/tuning_models/generic_armv9_a.h > > b/gcc/config/aarch64/tuning_models/generic_armv9_a.h > > > index > > > 7156dbe5787e831bc4343deb7d7b88e9823fc1bc..999985ed40f694f2681779d > > 940bdb282f289b8e3 100644 > > > --- a/gcc/config/aarch64/tuning_models/generic_armv9_a.h > > > +++ b/gcc/config/aarch64/tuning_models/generic_armv9_a.h > > > @@ -136,6 +136,8 @@ static const sve_vec_cost > > generic_armv9_a_sve_vector_cost = > > > operation more than a 64-bit gather. */ > > > 14, /* gather_load_x32_cost */ > > > 12, /* gather_load_x64_cost */ > > > + 42, /* gather_load_x32_init_cost */ > > > + 24, /* gather_load_x64_init_cost */ > > > 3 /* scatter_store_elt_cost */ > > > }; > > > > > > diff --git a/gcc/config/aarch64/tuning_models/neoverse512tvb.h > > b/gcc/config/aarch64/tuning_models/neoverse512tvb.h > > > index > > > 825c6a64990b72cda3641737957dc94d75db1509..d2a0b647791de8fca6d7684 > > 849d2ab1e9104b045 100644 > > > --- a/gcc/config/aarch64/tuning_models/neoverse512tvb.h > > > +++ b/gcc/config/aarch64/tuning_models/neoverse512tvb.h > > > @@ -79,6 +79,8 @@ static const sve_vec_cost > > neoverse512tvb_sve_vector_cost = > > > operation more than a 64-bit gather. */ > > > 14, /* gather_load_x32_cost */ > > > 12, /* gather_load_x64_cost */ > > > + 42, /* gather_load_x32_init_cost */ > > > + 24, /* gather_load_x64_init_cost */ > > > 3 /* scatter_store_elt_cost */ > > > }; > > > > > > diff --git a/gcc/config/aarch64/tuning_models/neoversen2.h > > b/gcc/config/aarch64/tuning_models/neoversen2.h > > > index > > > d41e714aa045266ecae62a36ed02dfbfb7597c3a..1a5b66901b5c3fb78f87fee40 > > 236957139644585 100644 > > > --- a/gcc/config/aarch64/tuning_models/neoversen2.h > > > +++ b/gcc/config/aarch64/tuning_models/neoversen2.h > > > @@ -135,6 +135,8 @@ static const sve_vec_cost neoversen2_sve_vector_cost > = > > > operation more than a 64-bit gather. */ > > > 14, /* gather_load_x32_cost */ > > > 12, /* gather_load_x64_cost */ > > > + 42, /* gather_load_x32_init_cost */ > > > + 24, /* gather_load_x64_init_cost */ > > > 3 /* scatter_store_elt_cost */ > > > }; > > > > > > diff --git a/gcc/config/aarch64/tuning_models/neoversen3.h > > b/gcc/config/aarch64/tuning_models/neoversen3.h > > > index > > > 36f770c0a14fc127c75a60cd37048d46c3b069c7..cfd5060e6b64a0433de41b03c > > de886da119d9a1c 100644 > > > --- a/gcc/config/aarch64/tuning_models/neoversen3.h > > > +++ b/gcc/config/aarch64/tuning_models/neoversen3.h > > > @@ -135,6 +135,8 @@ static const sve_vec_cost neoversen3_sve_vector_cost > = > > > operation more than a 64-bit gather. */ > > > 14, /* gather_load_x32_cost */ > > > 12, /* gather_load_x64_cost */ > > > + 42, /* gather_load_x32_init_cost */ > > > + 24, /* gather_load_x64_init_cost */ > > > 1 /* scatter_store_elt_cost */ > > > }; > > > > > > diff --git a/gcc/config/aarch64/tuning_models/neoversev1.h > > b/gcc/config/aarch64/tuning_models/neoversev1.h > > > index > > > 0fc41ce6a41b3135fa06d2bda1f517fdf4f8dbcf..705ed025730f6683109a4796c6 > > eefa55b437cec9 100644 > > > --- a/gcc/config/aarch64/tuning_models/neoversev1.h > > > +++ b/gcc/config/aarch64/tuning_models/neoversev1.h > > > @@ -126,6 +126,8 @@ static const sve_vec_cost neoversev1_sve_vector_cost > = > > > 8, /* fadda_f64_cost */ > > > 32, /* gather_load_x32_cost */ > > > 16, /* gather_load_x64_cost */ > > > + 96, /* gather_load_x32_init_cost */ > > > + 32, /* gather_load_x64_init_cost */ > > > 3 /* scatter_store_elt_cost */ > > > }; > > > > > > diff --git a/gcc/config/aarch64/tuning_models/neoversev2.h > > b/gcc/config/aarch64/tuning_models/neoversev2.h > > > index > > > c9c3019dd01a98bc20a76e8455fb59ff24a9ff6c..47908636b0f4c3eadd5848b59 > > 0fd079c1c04aa10 100644 > > > --- a/gcc/config/aarch64/tuning_models/neoversev2.h > > > +++ b/gcc/config/aarch64/tuning_models/neoversev2.h > > > @@ -135,6 +135,8 @@ static const sve_vec_cost neoversev2_sve_vector_cost > = > > > operation more than a 64-bit gather. */ > > > 14, /* gather_load_x32_cost */ > > > 12, /* gather_load_x64_cost */ > > > + 42, /* gather_load_x32_init_cost */ > > > + 24, /* gather_load_x64_init_cost */ > > > 3 /* scatter_store_elt_cost */ > > > }; > > > > > > diff --git a/gcc/config/aarch64/tuning_models/neoversev3.h > > b/gcc/config/aarch64/tuning_models/neoversev3.h > > > index > > > c602d067c7116cf6b081caeae8d36f9969e06d8d..c91e8c829532f9236de01027 > > 70e5c6b94e83da9a 100644 > > > --- a/gcc/config/aarch64/tuning_models/neoversev3.h > > > +++ b/gcc/config/aarch64/tuning_models/neoversev3.h > > > @@ -135,6 +135,8 @@ static const sve_vec_cost neoversev3_sve_vector_cost > = > > > operation more than a 64-bit gather. */ > > > 14, /* gather_load_x32_cost */ > > > 12, /* gather_load_x64_cost */ > > > + 42, /* gather_load_x32_init_cost */ > > > + 24, /* gather_load_x64_init_cost */ > > > 1 /* scatter_store_elt_cost */ > > > }; > > > > > > diff --git a/gcc/config/aarch64/tuning_models/neoversev3ae.h > > b/gcc/config/aarch64/tuning_models/neoversev3ae.h > > > index > > > 96d7ccf03cd96056d09676d908c63a25e3da6765..61e439326eb6f983abf8574e > > 657cfbb0c2f9bb33 100644 > > > --- a/gcc/config/aarch64/tuning_models/neoversev3ae.h > > > +++ b/gcc/config/aarch64/tuning_models/neoversev3ae.h > > > @@ -135,6 +135,8 @@ static const sve_vec_cost > > neoversev3ae_sve_vector_cost = > > > operation more than a 64-bit gather. */ > > > 14, /* gather_load_x32_cost */ > > > 12, /* gather_load_x64_cost */ > > > + 42, /* gather_load_x32_init_cost */ > > > + 24, /* gather_load_x64_init_cost */ > > > 1 /* scatter_store_elt_cost */ > > > };
Tamar Christina <Tamar.Christina@arm.com> writes: >> -----Original Message----- >> From: Tamar Christina <Tamar.Christina@arm.com> >> Sent: Thursday, August 1, 2024 9:51 AM >> To: Richard Sandiford <Richard.Sandiford@arm.com> >> Cc: Kyrylo Tkachov <ktkachov@nvidia.com>; gcc-patches@gcc.gnu.org; nd >> <nd@arm.com>; Richard Earnshaw <Richard.Earnshaw@arm.com>; Marcus >> Shawcroft <Marcus.Shawcroft@arm.com>; ktkachov@gcc.gnu.org >> Subject: RE: [PATCH 8/8]AArch64: take gather/scatter decode overhead into >> account >> >> > -----Original Message----- >> > From: Richard Sandiford <richard.sandiford@arm.com> >> > Sent: Wednesday, July 31, 2024 7:17 PM >> > To: Tamar Christina <Tamar.Christina@arm.com> >> > Cc: Kyrylo Tkachov <ktkachov@nvidia.com>; gcc-patches@gcc.gnu.org; nd >> > <nd@arm.com>; Richard Earnshaw <Richard.Earnshaw@arm.com>; Marcus >> > Shawcroft <Marcus.Shawcroft@arm.com>; ktkachov@gcc.gnu.org >> > Subject: Re: [PATCH 8/8]AArch64: take gather/scatter decode overhead into >> > account >> > >> > Tamar Christina <Tamar.Christina@arm.com> writes: >> > > @@ -289,6 +293,12 @@ struct sve_vec_cost : simd_vec_cost >> > > const int gather_load_x32_cost; >> > > const int gather_load_x64_cost; >> > > >> > > + /* Additional loop initialization cost of using a gather load instruction. The >> x32 >> > >> > Sorry for the trivia, but: long line. >> >> Yeah, noticed it after I sent out the patch 😊 >> >> > >> > > + value is for loads of 32-bit elements and the x64 value is for loads of >> > > + 64-bit elements. */ >> > > + const int gather_load_x32_init_cost; >> > > + const int gather_load_x64_init_cost; >> > > + >> > > /* The per-element cost of a scatter store. */ >> > > const int scatter_store_elt_cost; >> > > }; >> > > [...] >> > > @@ -17291,6 +17295,20 @@ aarch64_vector_costs::add_stmt_cost (int >> count, >> > vect_cost_for_stmt kind, >> > > stmt_cost = aarch64_detect_vector_stmt_subtype (m_vinfo, kind, >> > > stmt_info, vectype, >> > > where, stmt_cost); >> > > + >> > > + /* Check if we've seen an SVE gather/scatter operation and which size. */ >> > > + if (kind == scalar_load >> > > + && aarch64_sve_mode_p (TYPE_MODE (vectype)) >> > > + && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == >> > VMAT_GATHER_SCATTER) >> > > + { >> > > + const sve_vec_cost *sve_costs = aarch64_tune_params.vec_costs->sve; >> > >> > I think we need to check whether this is nonnull, since not all tuning >> > targets provide SVE costs. >> >> Will do, but I thought since this was in a block that checked >> aarch64_use_new_vector_costs_p () >> that the SVE costs were required. What does that predicate mean then? > > Ah nevermind, just realized it just means to apply the new vector cost routines, > I hadn't realized that we supported new costing for non-SVE models as well. > > Will add the check. Yeah, the eventual plan was to remove aarch64_use_new_vector_costs_p and make everything use the same path (without any changes to the tuning structures being needed). But it's never bubbled high enough up the to-do list. The reason for adding aarch64_use_new_vector_costs_p originally was because the associated cost changes were made late in a release cycle, and I didn't want to change the code for anything that didn't actively need the new costs. Thanks, Richard
diff --git a/gcc/config/aarch64/aarch64-protos.h b/gcc/config/aarch64/aarch64-protos.h index 42639e9efcf1e0f9362f759ae63a31b8eeb0d581..16eb8edab4d9fdfc6e3672c56ef5c9f6962d0c0b 100644 --- a/gcc/config/aarch64/aarch64-protos.h +++ b/gcc/config/aarch64/aarch64-protos.h @@ -262,6 +262,8 @@ struct sve_vec_cost : simd_vec_cost unsigned int fadda_f64_cost, unsigned int gather_load_x32_cost, unsigned int gather_load_x64_cost, + unsigned int gather_load_x32_init_cost, + unsigned int gather_load_x64_init_cost, unsigned int scatter_store_elt_cost) : simd_vec_cost (base), clast_cost (clast_cost), @@ -270,6 +272,8 @@ struct sve_vec_cost : simd_vec_cost fadda_f64_cost (fadda_f64_cost), gather_load_x32_cost (gather_load_x32_cost), gather_load_x64_cost (gather_load_x64_cost), + gather_load_x32_init_cost (gather_load_x32_init_cost), + gather_load_x64_init_cost (gather_load_x64_init_cost), scatter_store_elt_cost (scatter_store_elt_cost) {} @@ -289,6 +293,12 @@ struct sve_vec_cost : simd_vec_cost const int gather_load_x32_cost; const int gather_load_x64_cost; + /* Additional loop initialization cost of using a gather load instruction. The x32 + value is for loads of 32-bit elements and the x64 value is for loads of + 64-bit elements. */ + const int gather_load_x32_init_cost; + const int gather_load_x64_init_cost; + /* The per-element cost of a scatter store. */ const int scatter_store_elt_cost; }; diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc index eafa377cb095f49408d8a926fb49ce13e2155ba2..1e14c3c0d24b449d404724e436ba57e1996ec062 100644 --- a/gcc/config/aarch64/aarch64.cc +++ b/gcc/config/aarch64/aarch64.cc @@ -16227,6 +16227,12 @@ private: supported by Advanced SIMD and SVE2. */ bool m_has_avg = false; + /* This loop uses an SVE 32-bit element gather or scatter operation. */ + bool m_sve_gather_scatter_x32 = false; + + /* This loop uses an SVE 64-bit element gather or scatter operation. */ + bool m_sve_gather_scatter_x64 = false; + /* True if the vector body contains a store to a decl and if the function is known to have a vld1 from the same decl. @@ -17291,6 +17297,17 @@ aarch64_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind, stmt_cost = aarch64_detect_vector_stmt_subtype (m_vinfo, kind, stmt_info, vectype, where, stmt_cost); + + /* Check if we've seen an SVE gather/scatter operation and which size. */ + if (kind == scalar_load + && aarch64_sve_mode_p (TYPE_MODE (vectype)) + && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_GATHER_SCATTER) + { + if (GET_MODE_UNIT_BITSIZE (TYPE_MODE (vectype)) == 64) + m_sve_gather_scatter_x64 = true; + else + m_sve_gather_scatter_x32 = true; + } } /* Do any SVE-specific adjustments to the cost. */ @@ -17676,6 +17693,18 @@ aarch64_vector_costs::finish_cost (const vector_costs *uncast_scalar_costs) m_costs[vect_body] = adjust_body_cost (loop_vinfo, scalar_costs, m_costs[vect_body]); m_suggested_unroll_factor = determine_suggested_unroll_factor (); + + /* For gather and scatters there's an additional overhead for the first + iteration. For low count loops they're not beneficial so model the + overhead as loop prologue costs. */ + if (m_sve_gather_scatter_x32 || m_sve_gather_scatter_x64) + { + const sve_vec_cost *sve_costs = aarch64_tune_params.vec_costs->sve; + if (m_sve_gather_scatter_x32) + m_costs[vect_prologue] += sve_costs->gather_load_x32_init_cost; + else + m_costs[vect_prologue] += sve_costs->gather_load_x64_init_cost; + } } /* Apply the heuristic described above m_stp_sequence_cost. Prefer diff --git a/gcc/config/aarch64/tuning_models/a64fx.h b/gcc/config/aarch64/tuning_models/a64fx.h index 6091289d4c3c66f01d7e4dbf97a85c1f8c40bb0b..378a1b3889ee265859786c1ff6525fce2305b615 100644 --- a/gcc/config/aarch64/tuning_models/a64fx.h +++ b/gcc/config/aarch64/tuning_models/a64fx.h @@ -104,6 +104,8 @@ static const sve_vec_cost a64fx_sve_vector_cost = 13, /* fadda_f64_cost */ 64, /* gather_load_x32_cost */ 32, /* gather_load_x64_cost */ + 0, /* gather_load_x32_init_cost */ + 0, /* gather_load_x64_init_cost */ 1 /* scatter_store_elt_cost */ }; diff --git a/gcc/config/aarch64/tuning_models/cortexx925.h b/gcc/config/aarch64/tuning_models/cortexx925.h index fb95e87526985b02410d54a5a3ec8539c1b0ba6d..c4206018a3ff707f89ff3300700ec7dc2a5bc6b0 100644 --- a/gcc/config/aarch64/tuning_models/cortexx925.h +++ b/gcc/config/aarch64/tuning_models/cortexx925.h @@ -135,6 +135,8 @@ static const sve_vec_cost cortexx925_sve_vector_cost = operation more than a 64-bit gather. */ 14, /* gather_load_x32_cost */ 12, /* gather_load_x64_cost */ + 42, /* gather_load_x32_init_cost */ + 24, /* gather_load_x64_init_cost */ 1 /* scatter_store_elt_cost */ }; diff --git a/gcc/config/aarch64/tuning_models/generic.h b/gcc/config/aarch64/tuning_models/generic.h index 2b1f68b3052117814161a32f426422736ad6462b..101969bdbb9ccf7eafbd9a1cd6e25f0b584fb261 100644 --- a/gcc/config/aarch64/tuning_models/generic.h +++ b/gcc/config/aarch64/tuning_models/generic.h @@ -105,6 +105,8 @@ static const sve_vec_cost generic_sve_vector_cost = 2, /* fadda_f64_cost */ 4, /* gather_load_x32_cost */ 2, /* gather_load_x64_cost */ + 12, /* gather_load_x32_init_cost */ + 4, /* gather_load_x64_init_cost */ 1 /* scatter_store_elt_cost */ }; diff --git a/gcc/config/aarch64/tuning_models/generic_armv8_a.h b/gcc/config/aarch64/tuning_models/generic_armv8_a.h index b38b9a8c5cad7d12aa38afdb610a14a25e755010..b5088afe068aa4be7f9dd614cfdd2a51fa96e524 100644 --- a/gcc/config/aarch64/tuning_models/generic_armv8_a.h +++ b/gcc/config/aarch64/tuning_models/generic_armv8_a.h @@ -106,6 +106,8 @@ static const sve_vec_cost generic_armv8_a_sve_vector_cost = 2, /* fadda_f64_cost */ 4, /* gather_load_x32_cost */ 2, /* gather_load_x64_cost */ + 12, /* gather_load_x32_init_cost */ + 4, /* gather_load_x64_init_cost */ 1 /* scatter_store_elt_cost */ }; diff --git a/gcc/config/aarch64/tuning_models/generic_armv9_a.h b/gcc/config/aarch64/tuning_models/generic_armv9_a.h index b39a0c73db910888168790888d24ddf4406bf1ee..fd72de542862909ccb9a9260a16bb01935d97f36 100644 --- a/gcc/config/aarch64/tuning_models/generic_armv9_a.h +++ b/gcc/config/aarch64/tuning_models/generic_armv9_a.h @@ -136,6 +136,8 @@ static const sve_vec_cost generic_armv9_a_sve_vector_cost = operation more than a 64-bit gather. */ 14, /* gather_load_x32_cost */ 12, /* gather_load_x64_cost */ + 42, /* gather_load_x32_init_cost */ + 24, /* gather_load_x64_init_cost */ 3 /* scatter_store_elt_cost */ }; diff --git a/gcc/config/aarch64/tuning_models/neoverse512tvb.h b/gcc/config/aarch64/tuning_models/neoverse512tvb.h index 825c6a64990b72cda3641737957dc94d75db1509..d2a0b647791de8fca6d7684849d2ab1e9104b045 100644 --- a/gcc/config/aarch64/tuning_models/neoverse512tvb.h +++ b/gcc/config/aarch64/tuning_models/neoverse512tvb.h @@ -79,6 +79,8 @@ static const sve_vec_cost neoverse512tvb_sve_vector_cost = operation more than a 64-bit gather. */ 14, /* gather_load_x32_cost */ 12, /* gather_load_x64_cost */ + 42, /* gather_load_x32_init_cost */ + 24, /* gather_load_x64_init_cost */ 3 /* scatter_store_elt_cost */ }; diff --git a/gcc/config/aarch64/tuning_models/neoversen2.h b/gcc/config/aarch64/tuning_models/neoversen2.h index 3430eb9c06819e00ab38966bb960bd6525ff2b5c..00d2c12e739ffd371dd4720826894e980d577ca7 100644 --- a/gcc/config/aarch64/tuning_models/neoversen2.h +++ b/gcc/config/aarch64/tuning_models/neoversen2.h @@ -135,6 +135,8 @@ static const sve_vec_cost neoversen2_sve_vector_cost = operation more than a 64-bit gather. */ 14, /* gather_load_x32_cost */ 12, /* gather_load_x64_cost */ + 42, /* gather_load_x32_init_cost */ + 24, /* gather_load_x64_init_cost */ 3 /* scatter_store_elt_cost */ }; diff --git a/gcc/config/aarch64/tuning_models/neoversen3.h b/gcc/config/aarch64/tuning_models/neoversen3.h index 7438e39a4bbe43de624b63fdd20d3fde9dfb6fc9..fc4333ffdeaef0115ac162e2da9d8d548bacf576 100644 --- a/gcc/config/aarch64/tuning_models/neoversen3.h +++ b/gcc/config/aarch64/tuning_models/neoversen3.h @@ -135,6 +135,8 @@ static const sve_vec_cost neoversen3_sve_vector_cost = operation more than a 64-bit gather. */ 14, /* gather_load_x32_cost */ 12, /* gather_load_x64_cost */ + 42, /* gather_load_x32_init_cost */ + 24, /* gather_load_x64_init_cost */ 1 /* scatter_store_elt_cost */ }; diff --git a/gcc/config/aarch64/tuning_models/neoversev1.h b/gcc/config/aarch64/tuning_models/neoversev1.h index 0fc41ce6a41b3135fa06d2bda1f517fdf4f8dbcf..705ed025730f6683109a4796c6eefa55b437cec9 100644 --- a/gcc/config/aarch64/tuning_models/neoversev1.h +++ b/gcc/config/aarch64/tuning_models/neoversev1.h @@ -126,6 +126,8 @@ static const sve_vec_cost neoversev1_sve_vector_cost = 8, /* fadda_f64_cost */ 32, /* gather_load_x32_cost */ 16, /* gather_load_x64_cost */ + 96, /* gather_load_x32_init_cost */ + 32, /* gather_load_x64_init_cost */ 3 /* scatter_store_elt_cost */ }; diff --git a/gcc/config/aarch64/tuning_models/neoversev2.h b/gcc/config/aarch64/tuning_models/neoversev2.h index cca459e32c1384f57f8345d86b42b7814ae44115..680feeb9e4ee7bf21d5a258d83e522e079fdc156 100644 --- a/gcc/config/aarch64/tuning_models/neoversev2.h +++ b/gcc/config/aarch64/tuning_models/neoversev2.h @@ -135,6 +135,8 @@ static const sve_vec_cost neoversev2_sve_vector_cost = operation more than a 64-bit gather. */ 14, /* gather_load_x32_cost */ 12, /* gather_load_x64_cost */ + 42, /* gather_load_x32_init_cost */ + 24, /* gather_load_x64_init_cost */ 3 /* scatter_store_elt_cost */ }; diff --git a/gcc/config/aarch64/tuning_models/neoversev3.h b/gcc/config/aarch64/tuning_models/neoversev3.h index 3daa3d2365c817d03c6c0d5e66fe832620d8fb2c..812c6ad304e8d4c503dcd444437bf6528d6f3176 100644 --- a/gcc/config/aarch64/tuning_models/neoversev3.h +++ b/gcc/config/aarch64/tuning_models/neoversev3.h @@ -135,6 +135,8 @@ static const sve_vec_cost neoversev3_sve_vector_cost = operation more than a 64-bit gather. */ 14, /* gather_load_x32_cost */ 12, /* gather_load_x64_cost */ + 42, /* gather_load_x32_init_cost */ + 24, /* gather_load_x64_init_cost */ 1 /* scatter_store_elt_cost */ }; diff --git a/gcc/config/aarch64/tuning_models/neoversev3ae.h b/gcc/config/aarch64/tuning_models/neoversev3ae.h index 29c6f22e941b26ee333c87b9fac22aea86625e97..280b5abb27d3c9f404d5f96f14d0cba1e13b9bd1 100644 --- a/gcc/config/aarch64/tuning_models/neoversev3ae.h +++ b/gcc/config/aarch64/tuning_models/neoversev3ae.h @@ -135,6 +135,8 @@ static const sve_vec_cost neoversev3ae_sve_vector_cost = operation more than a 64-bit gather. */ 14, /* gather_load_x32_cost */ 12, /* gather_load_x64_cost */ + 42, /* gather_load_x32_init_cost */ + 24, /* gather_load_x64_init_cost */ 1 /* scatter_store_elt_cost */ };