Message ID | ZtcJ_poYfV4kYjv1@kam.mff.cuni.cz |
---|---|
State | New |
Headers | show |
Series | Zen5 tuning part 2: disable gather and scatter | expand |
On Tue, Sep 3, 2024 at 3:07 PM Jan Hubicka <hubicka@ucw.cz> wrote: > > Hi, > We disable gathers for zen4. It seems that gather has improved a bit compared > to zen4 and Zen5 optimization manual suggests "Avoid GATHER instructions when > the indices are known ahead of time. Vector loads followed by shuffles result > in a higher load bandwidth." however the situation seems to be more > complicated. > > gather is 5-10% loss on parest benchmark as well as 30% loss on sparse dot > products in TSVC. Curiously enough breaking these out into microbenchmark > reversed the situation and it turns out that the performance depends on > how indices are distributed. gather is loss if indices are sequential, > neutral if they are random and win for some strides (4, 8). > > This seems to be similar to earlier zens, so I think (especially for > backporting znver5 support) that it makes sense to be conistent and disable > gather unless we work out a good heuristics on when to use it. Since we > typically do not know the indices in advance, I don't see how that can be done. > > I opened PR116582 with some examples of wins and loses Note there's no way to emulate masked gathers (well - emit control flow), so they remain the choice when AVX512 is enabled and you have conditional loads. Similar for stores and scatter though there performance may be well absymal - something for the cost model to resolve. Note I think x86 doesn't yet expose AVX512 masked gather/scatter - the builtin target hook doesn't support it and the backend doesn't have any mask_gather_load or mask_scatter_store optabs to go the now prefered internal-fn way. Open-coding 8-way gather is also heavy in code size and thus might effect ucode re-use for large loops (OTOH gathers may take up much space in the ucode cache or be not there at all). Richard. > Bootstrapped/regtested x86_64-linux, committed. > > > gcc/ChangeLog: > > * config/i386/x86-tune.def (X86_TUNE_USE_GATHER_2PARTS): Disable for > ZNVER5. > (X86_TUNE_USE_SCATTER_2PARTS): Disable for ZNVER5. > (X86_TUNE_USE_GATHER_4PARTS): Disable for ZNVER5. > (X86_TUNE_USE_SCATTER_4PARTS): Disable for ZNVER5. > (X86_TUNE_USE_GATHER_8PARTS): Disable for ZNVER5. > (X86_TUNE_USE_SCATTER_8PARTS): Disable for ZNVER5. > > diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def > index da1a3d6a3c6..ed26136faee 100644 > --- a/gcc/config/i386/x86-tune.def > +++ b/gcc/config/i386/x86-tune.def > @@ -476,35 +476,35 @@ DEF_TUNE (X86_TUNE_AVOID_4BYTE_PREFIXES, "avoid_4byte_prefixes", > /* X86_TUNE_USE_GATHER_2PARTS: Use gather instructions for vectors with 2 > elements. */ > DEF_TUNE (X86_TUNE_USE_GATHER_2PARTS, "use_gather_2parts", > - ~(m_ZNVER1 | m_ZNVER2 | m_ZNVER3 | m_ZNVER4 | m_CORE_HYBRID > + ~(m_ZNVER | m_CORE_HYBRID > | m_YONGFENG | m_SHIJIDADAO | m_CORE_ATOM | m_GENERIC | m_GDS)) > > /* X86_TUNE_USE_SCATTER_2PARTS: Use scater instructions for vectors with 2 > elements. */ > DEF_TUNE (X86_TUNE_USE_SCATTER_2PARTS, "use_scatter_2parts", > - ~(m_ZNVER4)) > + ~(m_ZNVER4 | m_ZNVER5)) > > /* X86_TUNE_USE_GATHER_4PARTS: Use gather instructions for vectors with 4 > elements. */ > DEF_TUNE (X86_TUNE_USE_GATHER_4PARTS, "use_gather_4parts", > - ~(m_ZNVER1 | m_ZNVER2 | m_ZNVER3 | m_ZNVER4 | m_CORE_HYBRID > + ~(m_ZNVER | m_CORE_HYBRID > | m_YONGFENG | m_SHIJIDADAO | m_CORE_ATOM | m_GENERIC | m_GDS)) > > /* X86_TUNE_USE_SCATTER_4PARTS: Use scater instructions for vectors with 4 > elements. */ > DEF_TUNE (X86_TUNE_USE_SCATTER_4PARTS, "use_scatter_4parts", > - ~(m_ZNVER4)) > + ~(m_ZNVER4 | m_ZNVER5)) > > /* X86_TUNE_USE_GATHER: Use gather instructions for vectors with 8 or more > elements. */ > DEF_TUNE (X86_TUNE_USE_GATHER_8PARTS, "use_gather_8parts", > - ~(m_ZNVER1 | m_ZNVER2 | m_ZNVER4 | m_CORE_HYBRID | m_CORE_ATOM > + ~(m_ZNVER | m_CORE_HYBRID | m_CORE_ATOM > | m_YONGFENG | m_SHIJIDADAO | m_GENERIC | m_GDS)) > > /* X86_TUNE_USE_SCATTER: Use scater instructions for vectors with 8 or more > elements. */ > DEF_TUNE (X86_TUNE_USE_SCATTER_8PARTS, "use_scatter_8parts", > - ~(m_ZNVER4)) > + ~(m_ZNVER4 | m_ZNVER5)) > > /* X86_TUNE_AVOID_128FMA_CHAINS: Avoid creating loops with tight 128bit or > smaller FMA chain. */
On 9/3/24 15:07, Jan Hubicka wrote: > Hi, > We disable gathers for zen4. It seems that gather has improved a bit compared > to zen4 and Zen5 optimization manual suggests "Avoid GATHER instructions when > the indices are known ahead of time. Vector loads followed by shuffles result > in a higher load bandwidth." however the situation seems to be more > complicated. A small bit of "real world" experience (but for zen3): Recently I switched to gfortran 14.2 for my weather forecasting. A year ago I had changed "-march=native -mtune=native" (on my zen3 system) to "-march=native -mtune=znver2" while using gfortran 13 - it had only a small effect (but positive). Last Monday I switched back to "-march=native -mtune=native", but that consistently made a 12 hour computation around 6 minutes slower (i.e., about 1/120th, or 0.8 %). The most computational intensive part of the code needs gather (either instructions or inline expansions of them). Hope this helps,
> On 9/3/24 15:07, Jan Hubicka wrote: > > > Hi, > > We disable gathers for zen4. It seems that gather has improved a bit compared > > to zen4 and Zen5 optimization manual suggests "Avoid GATHER instructions when > > the indices are known ahead of time. Vector loads followed by shuffles result > > in a higher load bandwidth." however the situation seems to be more > > complicated. > > A small bit of "real world" experience (but for zen3): > > Recently I switched to gfortran 14.2 for my weather forecasting. > A year ago I had changed "-march=native -mtune=native" (on my zen3 system) > to "-march=native -mtune=znver2" while using gfortran 13 - it had only a > small effect (but positive). > > Last Monday I switched back to "-march=native -mtune=native", but that > consistently made a 12 hour computation around 6 minutes slower (i.e., about > 1/120th, or 0.8 %). The most computational intensive part of the code needs > gather (either instructions or inline expansions of them). It would be nice to know what is causing this. Gathers can be enabled using -mtune-ctrl=use_gather and I would be happy to know about real world situations where they help. I am still looking into this. IMO disabling gather like on other zens makes sense especially for backporting. For trunk it probably makes sense to look for heuristics carefully enabling gathers. It is not clear to me how to benchmark them or how to set up heuristics. Spec2017 has very small coverage for loops requiring gathers and so does tsvc. I did some micro-benchmarks but their behaviour is, well, puzzling. Having additional data would be great. As Richard mentioned, it probably makes sense to enable masked gathers, since the open coded version needs condiitonals and we would not vectorize at all. I am not sure if we can do that with current APIs. I will cook up a micro-benchmarks for that. Concerning code size, I am not sure how much that applies in practice since gathers are used relatively sporadically and vectorizer blows up the code a lot anyways, but certainly one can construct example with very many loops needing gather... My guess is that array prefetching data is annotated to the instructoin cache and since gather produces a lot of loads, probably data simply does not fit. Opencoding the gather makes extra space for this info... Honza
On Wed, Sep 4, 2024 at 12:56 PM Jan Hubicka <hubicka@ucw.cz> wrote: > > > On 9/3/24 15:07, Jan Hubicka wrote: > > > > > Hi, > > > We disable gathers for zen4. It seems that gather has improved a bit compared > > > to zen4 and Zen5 optimization manual suggests "Avoid GATHER instructions when > > > the indices are known ahead of time. Vector loads followed by shuffles result > > > in a higher load bandwidth." however the situation seems to be more > > > complicated. > > > > A small bit of "real world" experience (but for zen3): > > > > Recently I switched to gfortran 14.2 for my weather forecasting. > > A year ago I had changed "-march=native -mtune=native" (on my zen3 system) > > to "-march=native -mtune=znver2" while using gfortran 13 - it had only a > > small effect (but positive). > > > > Last Monday I switched back to "-march=native -mtune=native", but that > > consistently made a 12 hour computation around 6 minutes slower (i.e., about > > 1/120th, or 0.8 %). The most computational intensive part of the code needs > > gather (either instructions or inline expansions of them). > > It would be nice to know what is causing this. Gathers can be enabled > using -mtune-ctrl=use_gather and I would be happy to know about real > world situations where they help. > > I am still looking into this. IMO disabling gather like on other zens > makes sense especially for backporting. For trunk > it probably makes sense to look for heuristics carefully enabling > gathers. It is not clear to me how to benchmark them or how to set up > heuristics. Spec2017 has very small coverage for loops requiring > gathers and so does tsvc. I did some micro-benchmarks but their > behaviour is, well, puzzling. Having additional data would be great. > > As Richard mentioned, it probably makes sense to enable masked gathers, > since the open coded version needs condiitonals and we would not > vectorize at all. I am not sure if we can do that with current APIs. > I will cook up a micro-benchmarks for that. See also https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85919, the targetm.vectorize.builtin_gather/targetm.vectorize.builtin_scatter interface is legacy and does not support masking at all. > Concerning code size, I am not sure how much that applies in practice > since gathers are used relatively sporadically and vectorizer blows up > the code a lot anyways, but certainly one can construct example with > very many loops needing gather... > > My guess is that array prefetching data is annotated to the instructoin > cache and since gather produces a lot of loads, probably data simply does > not fit. Opencoding the gather makes extra space for this info... > > Honza >
On 9/4/24 12:55, Jan Hubicka wrote: >> On 9/3/24 15:07, Jan Hubicka wrote: >> >>> Hi, >>> We disable gathers for zen4. It seems that gather has improved a bit compared >>> to zen4 and Zen5 optimization manual suggests "Avoid GATHER instructions when >>> the indices are known ahead of time. Vector loads followed by shuffles result >>> in a higher load bandwidth." however the situation seems to be more >>> complicated. >> >> A small bit of "real world" experience (but for zen3): >> >> Recently I switched to gfortran 14.2 for my weather forecasting. >> A year ago I had changed "-march=native -mtune=native" (on my zen3 system) >> to "-march=native -mtune=znver2" while using gfortran 13 - it had only a >> small effect (but positive). >> >> Last Monday I switched back to "-march=native -mtune=native", but that >> consistently made a 12 hour computation around 6 minutes slower (i.e., about >> 1/120th, or 0.8 %). The most computational intensive part of the code needs >> gather (either instructions or inline expansions of them). > > It would be nice to know what is causing this. Gathers can be enabled > using -mtune-ctrl=use_gather and I would be happy to know about real > world situations where they help. Ah - one detail that I forgot to mention: our code is "special" in the sense that it uses 32-bit floats while it runs on 64-bit address space. So its use of gather instructions is rather suboptimal, needing 2 gather instructions for each actual "gather operation". Hope this helps,
diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def index da1a3d6a3c6..ed26136faee 100644 --- a/gcc/config/i386/x86-tune.def +++ b/gcc/config/i386/x86-tune.def @@ -476,35 +476,35 @@ DEF_TUNE (X86_TUNE_AVOID_4BYTE_PREFIXES, "avoid_4byte_prefixes", /* X86_TUNE_USE_GATHER_2PARTS: Use gather instructions for vectors with 2 elements. */ DEF_TUNE (X86_TUNE_USE_GATHER_2PARTS, "use_gather_2parts", - ~(m_ZNVER1 | m_ZNVER2 | m_ZNVER3 | m_ZNVER4 | m_CORE_HYBRID + ~(m_ZNVER | m_CORE_HYBRID | m_YONGFENG | m_SHIJIDADAO | m_CORE_ATOM | m_GENERIC | m_GDS)) /* X86_TUNE_USE_SCATTER_2PARTS: Use scater instructions for vectors with 2 elements. */ DEF_TUNE (X86_TUNE_USE_SCATTER_2PARTS, "use_scatter_2parts", - ~(m_ZNVER4)) + ~(m_ZNVER4 | m_ZNVER5)) /* X86_TUNE_USE_GATHER_4PARTS: Use gather instructions for vectors with 4 elements. */ DEF_TUNE (X86_TUNE_USE_GATHER_4PARTS, "use_gather_4parts", - ~(m_ZNVER1 | m_ZNVER2 | m_ZNVER3 | m_ZNVER4 | m_CORE_HYBRID + ~(m_ZNVER | m_CORE_HYBRID | m_YONGFENG | m_SHIJIDADAO | m_CORE_ATOM | m_GENERIC | m_GDS)) /* X86_TUNE_USE_SCATTER_4PARTS: Use scater instructions for vectors with 4 elements. */ DEF_TUNE (X86_TUNE_USE_SCATTER_4PARTS, "use_scatter_4parts", - ~(m_ZNVER4)) + ~(m_ZNVER4 | m_ZNVER5)) /* X86_TUNE_USE_GATHER: Use gather instructions for vectors with 8 or more elements. */ DEF_TUNE (X86_TUNE_USE_GATHER_8PARTS, "use_gather_8parts", - ~(m_ZNVER1 | m_ZNVER2 | m_ZNVER4 | m_CORE_HYBRID | m_CORE_ATOM + ~(m_ZNVER | m_CORE_HYBRID | m_CORE_ATOM | m_YONGFENG | m_SHIJIDADAO | m_GENERIC | m_GDS)) /* X86_TUNE_USE_SCATTER: Use scater instructions for vectors with 8 or more elements. */ DEF_TUNE (X86_TUNE_USE_SCATTER_8PARTS, "use_scatter_8parts", - ~(m_ZNVER4)) + ~(m_ZNVER4 | m_ZNVER5)) /* X86_TUNE_AVOID_128FMA_CHAINS: Avoid creating loops with tight 128bit or smaller FMA chain. */