diff mbox series

Zen5 tuning part 2: disable gather and scatter

Message ID ZtcJ_poYfV4kYjv1@kam.mff.cuni.cz
State New
Headers show
Series Zen5 tuning part 2: disable gather and scatter | expand

Commit Message

Jan Hubicka Sept. 3, 2024, 1:07 p.m. UTC
Hi,
We disable gathers for zen4.  It seems that gather has improved a bit compared
to zen4 and Zen5 optimization manual suggests "Avoid GATHER instructions when
the indices are known ahead of time. Vector loads followed by shuffles result
in a higher load bandwidth." however the situation seems to be more
complicated.

gather is 5-10% loss on parest benchmark as well as 30% loss on sparse dot
products in TSVC. Curiously enough breaking these out into microbenchmark
reversed the situation and it turns out that the performance depends on 
how indices are distributed.  gather is loss if indices are sequential,
neutral if they are random and win for some strides (4, 8).

This seems to be similar to earlier zens, so I think (especially for
backporting znver5 support) that it makes sense to be conistent and disable
gather unless we work out a good heuristics on when to use it. Since we
typically do not know the indices in advance, I don't see how that can be done.

I opened PR116582 with some examples of wins and loses

Bootstrapped/regtested x86_64-linux, committed.


gcc/ChangeLog:

	* config/i386/x86-tune.def (X86_TUNE_USE_GATHER_2PARTS): Disable for
	ZNVER5.
	(X86_TUNE_USE_SCATTER_2PARTS): Disable for ZNVER5.
	(X86_TUNE_USE_GATHER_4PARTS): Disable for ZNVER5.
	(X86_TUNE_USE_SCATTER_4PARTS): Disable for ZNVER5.
	(X86_TUNE_USE_GATHER_8PARTS): Disable for ZNVER5.
	(X86_TUNE_USE_SCATTER_8PARTS): Disable for ZNVER5.

Comments

Richard Biener Sept. 3, 2024, 1:17 p.m. UTC | #1
On Tue, Sep 3, 2024 at 3:07 PM Jan Hubicka <hubicka@ucw.cz> wrote:
>
> Hi,
> We disable gathers for zen4.  It seems that gather has improved a bit compared
> to zen4 and Zen5 optimization manual suggests "Avoid GATHER instructions when
> the indices are known ahead of time. Vector loads followed by shuffles result
> in a higher load bandwidth." however the situation seems to be more
> complicated.
>
> gather is 5-10% loss on parest benchmark as well as 30% loss on sparse dot
> products in TSVC. Curiously enough breaking these out into microbenchmark
> reversed the situation and it turns out that the performance depends on
> how indices are distributed.  gather is loss if indices are sequential,
> neutral if they are random and win for some strides (4, 8).
>
> This seems to be similar to earlier zens, so I think (especially for
> backporting znver5 support) that it makes sense to be conistent and disable
> gather unless we work out a good heuristics on when to use it. Since we
> typically do not know the indices in advance, I don't see how that can be done.
>
> I opened PR116582 with some examples of wins and loses

Note there's no way to emulate masked gathers (well - emit control flow), so
they remain the choice when AVX512 is enabled and you have conditional
loads.  Similar for stores and scatter though there performance may be well
absymal - something for the cost model to resolve.  Note I think x86 doesn't
yet expose AVX512 masked gather/scatter - the builtin target hook doesn't
support it and the backend doesn't have any mask_gather_load or
mask_scatter_store
optabs to go the now prefered internal-fn way.

Open-coding 8-way gather is also heavy in code size and thus might effect
ucode re-use for large loops (OTOH gathers may take up much space in the
ucode cache or be not there at all).

Richard.

> Bootstrapped/regtested x86_64-linux, committed.
>
>
> gcc/ChangeLog:
>
>         * config/i386/x86-tune.def (X86_TUNE_USE_GATHER_2PARTS): Disable for
>         ZNVER5.
>         (X86_TUNE_USE_SCATTER_2PARTS): Disable for ZNVER5.
>         (X86_TUNE_USE_GATHER_4PARTS): Disable for ZNVER5.
>         (X86_TUNE_USE_SCATTER_4PARTS): Disable for ZNVER5.
>         (X86_TUNE_USE_GATHER_8PARTS): Disable for ZNVER5.
>         (X86_TUNE_USE_SCATTER_8PARTS): Disable for ZNVER5.
>
> diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
> index da1a3d6a3c6..ed26136faee 100644
> --- a/gcc/config/i386/x86-tune.def
> +++ b/gcc/config/i386/x86-tune.def
> @@ -476,35 +476,35 @@ DEF_TUNE (X86_TUNE_AVOID_4BYTE_PREFIXES, "avoid_4byte_prefixes",
>  /* X86_TUNE_USE_GATHER_2PARTS: Use gather instructions for vectors with 2
>     elements.  */
>  DEF_TUNE (X86_TUNE_USE_GATHER_2PARTS, "use_gather_2parts",
> -         ~(m_ZNVER1 | m_ZNVER2 | m_ZNVER3 | m_ZNVER4 | m_CORE_HYBRID
> +         ~(m_ZNVER | m_CORE_HYBRID
>             | m_YONGFENG | m_SHIJIDADAO | m_CORE_ATOM | m_GENERIC | m_GDS))
>
>  /* X86_TUNE_USE_SCATTER_2PARTS: Use scater instructions for vectors with 2
>     elements.  */
>  DEF_TUNE (X86_TUNE_USE_SCATTER_2PARTS, "use_scatter_2parts",
> -         ~(m_ZNVER4))
> +         ~(m_ZNVER4 | m_ZNVER5))
>
>  /* X86_TUNE_USE_GATHER_4PARTS: Use gather instructions for vectors with 4
>     elements.  */
>  DEF_TUNE (X86_TUNE_USE_GATHER_4PARTS, "use_gather_4parts",
> -         ~(m_ZNVER1 | m_ZNVER2 | m_ZNVER3 | m_ZNVER4 | m_CORE_HYBRID
> +         ~(m_ZNVER | m_CORE_HYBRID
>             | m_YONGFENG | m_SHIJIDADAO | m_CORE_ATOM | m_GENERIC | m_GDS))
>
>  /* X86_TUNE_USE_SCATTER_4PARTS: Use scater instructions for vectors with 4
>     elements.  */
>  DEF_TUNE (X86_TUNE_USE_SCATTER_4PARTS, "use_scatter_4parts",
> -         ~(m_ZNVER4))
> +         ~(m_ZNVER4 | m_ZNVER5))
>
>  /* X86_TUNE_USE_GATHER: Use gather instructions for vectors with 8 or more
>     elements.  */
>  DEF_TUNE (X86_TUNE_USE_GATHER_8PARTS, "use_gather_8parts",
> -         ~(m_ZNVER1 | m_ZNVER2 | m_ZNVER4 | m_CORE_HYBRID | m_CORE_ATOM
> +         ~(m_ZNVER | m_CORE_HYBRID | m_CORE_ATOM
>             | m_YONGFENG | m_SHIJIDADAO | m_GENERIC | m_GDS))
>
>  /* X86_TUNE_USE_SCATTER: Use scater instructions for vectors with 8 or more
>     elements.  */
>  DEF_TUNE (X86_TUNE_USE_SCATTER_8PARTS, "use_scatter_8parts",
> -         ~(m_ZNVER4))
> +         ~(m_ZNVER4 | m_ZNVER5))
>
>  /* X86_TUNE_AVOID_128FMA_CHAINS: Avoid creating loops with tight 128bit or
>     smaller FMA chain.  */
Toon Moene Sept. 4, 2024, 8:39 a.m. UTC | #2
On 9/3/24 15:07, Jan Hubicka wrote:

> Hi,
> We disable gathers for zen4.  It seems that gather has improved a bit compared
> to zen4 and Zen5 optimization manual suggests "Avoid GATHER instructions when
> the indices are known ahead of time. Vector loads followed by shuffles result
> in a higher load bandwidth." however the situation seems to be more
> complicated.

A small bit of "real world" experience (but for zen3):

Recently I switched to gfortran 14.2 for my weather forecasting.
A year ago I had changed "-march=native -mtune=native" (on my zen3 
system) to "-march=native -mtune=znver2" while using gfortran 13 - it 
had only a small effect (but positive).

Last Monday I switched back to "-march=native -mtune=native", but that 
consistently made a 12 hour computation around 6 minutes slower (i.e., 
about 1/120th, or 0.8 %). The most computational intensive part of the 
code needs gather (either instructions or inline expansions of them).

Hope this helps,
Jan Hubicka Sept. 4, 2024, 10:55 a.m. UTC | #3
> On 9/3/24 15:07, Jan Hubicka wrote:
> 
> > Hi,
> > We disable gathers for zen4.  It seems that gather has improved a bit compared
> > to zen4 and Zen5 optimization manual suggests "Avoid GATHER instructions when
> > the indices are known ahead of time. Vector loads followed by shuffles result
> > in a higher load bandwidth." however the situation seems to be more
> > complicated.
> 
> A small bit of "real world" experience (but for zen3):
> 
> Recently I switched to gfortran 14.2 for my weather forecasting.
> A year ago I had changed "-march=native -mtune=native" (on my zen3 system)
> to "-march=native -mtune=znver2" while using gfortran 13 - it had only a
> small effect (but positive).
> 
> Last Monday I switched back to "-march=native -mtune=native", but that
> consistently made a 12 hour computation around 6 minutes slower (i.e., about
> 1/120th, or 0.8 %). The most computational intensive part of the code needs
> gather (either instructions or inline expansions of them).

It would be nice to know what is causing this. Gathers can be enabled
using -mtune-ctrl=use_gather and I would be happy to know about real
world situations where they help.

I am still looking into this.  IMO disabling gather like on other zens
makes sense especially for backporting. For trunk
it probably makes sense to look for heuristics carefully enabling
gathers.  It is not clear to me how to benchmark them or how to set up
heuristics.  Spec2017 has very small coverage for loops requiring
gathers and so does tsvc. I did some micro-benchmarks but their
behaviour is, well, puzzling. Having additional data would be great.

As Richard mentioned, it probably makes sense to enable masked gathers,
since the open coded version needs condiitonals and we would not
vectorize at all.  I am not sure if we can do that with current APIs.
I will cook up a micro-benchmarks for that.

Concerning code size, I am not sure how much that applies in practice
since gathers are used relatively sporadically and vectorizer blows up
the code a lot anyways, but certainly one can construct example with
very many loops needing gather...

My guess is that array prefetching data is annotated to the instructoin
cache and since gather produces a lot of loads, probably data simply does
not fit. Opencoding the gather makes extra space for this info...

Honza
Richard Biener Sept. 4, 2024, 11:16 a.m. UTC | #4
On Wed, Sep 4, 2024 at 12:56 PM Jan Hubicka <hubicka@ucw.cz> wrote:
>
> > On 9/3/24 15:07, Jan Hubicka wrote:
> >
> > > Hi,
> > > We disable gathers for zen4.  It seems that gather has improved a bit compared
> > > to zen4 and Zen5 optimization manual suggests "Avoid GATHER instructions when
> > > the indices are known ahead of time. Vector loads followed by shuffles result
> > > in a higher load bandwidth." however the situation seems to be more
> > > complicated.
> >
> > A small bit of "real world" experience (but for zen3):
> >
> > Recently I switched to gfortran 14.2 for my weather forecasting.
> > A year ago I had changed "-march=native -mtune=native" (on my zen3 system)
> > to "-march=native -mtune=znver2" while using gfortran 13 - it had only a
> > small effect (but positive).
> >
> > Last Monday I switched back to "-march=native -mtune=native", but that
> > consistently made a 12 hour computation around 6 minutes slower (i.e., about
> > 1/120th, or 0.8 %). The most computational intensive part of the code needs
> > gather (either instructions or inline expansions of them).
>
> It would be nice to know what is causing this. Gathers can be enabled
> using -mtune-ctrl=use_gather and I would be happy to know about real
> world situations where they help.
>
> I am still looking into this.  IMO disabling gather like on other zens
> makes sense especially for backporting. For trunk
> it probably makes sense to look for heuristics carefully enabling
> gathers.  It is not clear to me how to benchmark them or how to set up
> heuristics.  Spec2017 has very small coverage for loops requiring
> gathers and so does tsvc. I did some micro-benchmarks but their
> behaviour is, well, puzzling. Having additional data would be great.
>
> As Richard mentioned, it probably makes sense to enable masked gathers,
> since the open coded version needs condiitonals and we would not
> vectorize at all.  I am not sure if we can do that with current APIs.
> I will cook up a micro-benchmarks for that.

See also https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85919, the
targetm.vectorize.builtin_gather/targetm.vectorize.builtin_scatter interface
is legacy and does not support masking at all.

> Concerning code size, I am not sure how much that applies in practice
> since gathers are used relatively sporadically and vectorizer blows up
> the code a lot anyways, but certainly one can construct example with
> very many loops needing gather...
>
> My guess is that array prefetching data is annotated to the instructoin
> cache and since gather produces a lot of loads, probably data simply does
> not fit. Opencoding the gather makes extra space for this info...
>
> Honza
>
Toon Moene Sept. 4, 2024, 2:57 p.m. UTC | #5
On 9/4/24 12:55, Jan Hubicka wrote:

>> On 9/3/24 15:07, Jan Hubicka wrote:
>>
>>> Hi,
>>> We disable gathers for zen4.  It seems that gather has improved a bit compared
>>> to zen4 and Zen5 optimization manual suggests "Avoid GATHER instructions when
>>> the indices are known ahead of time. Vector loads followed by shuffles result
>>> in a higher load bandwidth." however the situation seems to be more
>>> complicated.
>>
>> A small bit of "real world" experience (but for zen3):
>>
>> Recently I switched to gfortran 14.2 for my weather forecasting.
>> A year ago I had changed "-march=native -mtune=native" (on my zen3 system)
>> to "-march=native -mtune=znver2" while using gfortran 13 - it had only a
>> small effect (but positive).
>>
>> Last Monday I switched back to "-march=native -mtune=native", but that
>> consistently made a 12 hour computation around 6 minutes slower (i.e., about
>> 1/120th, or 0.8 %). The most computational intensive part of the code needs
>> gather (either instructions or inline expansions of them).
> 
> It would be nice to know what is causing this. Gathers can be enabled
> using -mtune-ctrl=use_gather and I would be happy to know about real
> world situations where they help.

Ah - one detail that I forgot to mention: our code is "special" in the 
sense that it uses 32-bit floats while it runs on 64-bit address space.

So its use of gather instructions is rather suboptimal, needing 2 gather 
instructions for each actual "gather operation".

Hope this helps,
diff mbox series

Patch

diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
index da1a3d6a3c6..ed26136faee 100644
--- a/gcc/config/i386/x86-tune.def
+++ b/gcc/config/i386/x86-tune.def
@@ -476,35 +476,35 @@  DEF_TUNE (X86_TUNE_AVOID_4BYTE_PREFIXES, "avoid_4byte_prefixes",
 /* X86_TUNE_USE_GATHER_2PARTS: Use gather instructions for vectors with 2
    elements.  */
 DEF_TUNE (X86_TUNE_USE_GATHER_2PARTS, "use_gather_2parts",
-	  ~(m_ZNVER1 | m_ZNVER2 | m_ZNVER3 | m_ZNVER4 | m_CORE_HYBRID
+	  ~(m_ZNVER | m_CORE_HYBRID
 	    | m_YONGFENG | m_SHIJIDADAO | m_CORE_ATOM | m_GENERIC | m_GDS))
 
 /* X86_TUNE_USE_SCATTER_2PARTS: Use scater instructions for vectors with 2
    elements.  */
 DEF_TUNE (X86_TUNE_USE_SCATTER_2PARTS, "use_scatter_2parts",
-	  ~(m_ZNVER4))
+	  ~(m_ZNVER4 | m_ZNVER5))
 
 /* X86_TUNE_USE_GATHER_4PARTS: Use gather instructions for vectors with 4
    elements.  */
 DEF_TUNE (X86_TUNE_USE_GATHER_4PARTS, "use_gather_4parts",
-	  ~(m_ZNVER1 | m_ZNVER2 | m_ZNVER3 | m_ZNVER4 | m_CORE_HYBRID
+	  ~(m_ZNVER | m_CORE_HYBRID
 	    | m_YONGFENG | m_SHIJIDADAO | m_CORE_ATOM | m_GENERIC | m_GDS))
 
 /* X86_TUNE_USE_SCATTER_4PARTS: Use scater instructions for vectors with 4
    elements.  */
 DEF_TUNE (X86_TUNE_USE_SCATTER_4PARTS, "use_scatter_4parts",
-	  ~(m_ZNVER4))
+	  ~(m_ZNVER4 | m_ZNVER5))
 
 /* X86_TUNE_USE_GATHER: Use gather instructions for vectors with 8 or more
    elements.  */
 DEF_TUNE (X86_TUNE_USE_GATHER_8PARTS, "use_gather_8parts",
-	  ~(m_ZNVER1 | m_ZNVER2 | m_ZNVER4 | m_CORE_HYBRID | m_CORE_ATOM
+	  ~(m_ZNVER | m_CORE_HYBRID | m_CORE_ATOM
 	    | m_YONGFENG | m_SHIJIDADAO | m_GENERIC | m_GDS))
 
 /* X86_TUNE_USE_SCATTER: Use scater instructions for vectors with 8 or more
    elements.  */
 DEF_TUNE (X86_TUNE_USE_SCATTER_8PARTS, "use_scatter_8parts",
-	  ~(m_ZNVER4))
+	  ~(m_ZNVER4 | m_ZNVER5))
 
 /* X86_TUNE_AVOID_128FMA_CHAINS: Avoid creating loops with tight 128bit or
    smaller FMA chain.  */