diff mbox series

[v10,3/3] x86: Make the divisor in setting `non_temporal_threshold` cpu specific

Message ID 20230527184632.694761-3-goldstein.w.n@gmail.com
State New
Headers show
Series [v10,1/3] x86: Increase `non_temporal_threshold` to roughly `sizeof_L3 / 4` | expand

Commit Message

Noah Goldstein May 27, 2023, 6:46 p.m. UTC
Different systems prefer a different divisors.

From benchmarks[1] so far the following divisors have been found:
    ICX     : 2
    SKX     : 2
    BWD     : 8

For Intel, we are generalizing that BWD and older prefers 8 as a
divisor, and SKL and newer prefers 2. This number can be further tuned
as benchmarks are run.

[1]: https://github.com/goldsteinn/memcpy-nt-benchmarks
---
 sysdeps/x86/cpu-features.c         | 31 ++++++++++++++++++++---------
 sysdeps/x86/dl-cacheinfo.h         | 32 ++++++++++++++++++------------
 sysdeps/x86/dl-diagnostics-cpu.c   | 11 ++++++----
 sysdeps/x86/include/cpu-features.h |  3 +++
 4 files changed, 51 insertions(+), 26 deletions(-)

Comments

DJ Delorie May 31, 2023, 2:33 a.m. UTC | #1
LGTM.
Reviewed-by: DJ Delorie <dj@redhat.com>
Sajan Karumanchi July 10, 2023, 5:23 a.m. UTC | #2
Noah,
I verified your patches on the master branch that impacts the non-threshold
 parameter on x86 CPUs. This patch modifies the non-temporal threshold value
from 24MB(3/4th of L3$) to 8MB(1/4th of L3$) on ZEN4.
From the Glibc benchmarks, we saw a significant performance drop ranging
from 15% to 99% for size ranges of 8MB to 16MB.
I also ran the new tool developed by you on all Zen architectures and the
results conclude that 3/4th L3 size holds good on AMD CPUs.
Hence the current patch degrades the performance of AMD CPUs.
We strongly recommend marking this change to Intel CPUs only.

Thanks,
Sajan K.
Noah Goldstein July 10, 2023, 3:58 p.m. UTC | #3
On Mon, Jul 10, 2023 at 12:23 AM Sajan Karumanchi
<sajan.karumanchi@gmail.com> wrote:
>
> Noah,
> I verified your patches on the master branch that impacts the non-threshold
>  parameter on x86 CPUs. This patch modifies the non-temporal threshold value
> from 24MB(3/4th of L3$) to 8MB(1/4th of L3$) on ZEN4.
> From the Glibc benchmarks, we saw a significant performance drop ranging
> from 15% to 99% for size ranges of 8MB to 16MB.
> I also ran the new tool developed by you on all Zen architectures and the
> results conclude that 3/4th L3 size holds good on AMD CPUs.
> Hence the current patch degrades the performance of AMD CPUs.
> We strongly recommend marking this change to Intel CPUs only.
>

So it shouldn't actually go down. I think what is missing is:
```
get_common_cache_info (&shared, &shared_per_thread, &threads, core);
```

In the AMD case shared == shared_per_thread which shouldn't really
be the case.

The intended new calculation is: Total_L3_Size / Scale
as opposed to: (L3_Size / NThread) / Scale"

Before just going with default for AMD, maybe try out the following patch?

```
---
 sysdeps/x86/dl-cacheinfo.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
index c98fa57a7b..c1866ca898 100644
--- a/sysdeps/x86/dl-cacheinfo.h
+++ b/sysdeps/x86/dl-cacheinfo.h
@@ -717,6 +717,7 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
       level3_cache_assoc = handle_amd (_SC_LEVEL3_CACHE_ASSOC);
       level3_cache_linesize = handle_amd (_SC_LEVEL3_CACHE_LINESIZE);

+      get_common_cache_info (&shared, &shared_per_thread, &threads, core);
       if (shared <= 0)
         /* No shared L3 cache.  All we have is the L2 cache.  */
  shared = core;
Noah Goldstein July 14, 2023, 2:21 a.m. UTC | #4
On Mon, Jul 10, 2023 at 10:58 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> On Mon, Jul 10, 2023 at 12:23 AM Sajan Karumanchi
> <sajan.karumanchi@gmail.com> wrote:
> >
> > Noah,
> > I verified your patches on the master branch that impacts the non-threshold
> >  parameter on x86 CPUs. This patch modifies the non-temporal threshold value
> > from 24MB(3/4th of L3$) to 8MB(1/4th of L3$) on ZEN4.
> > From the Glibc benchmarks, we saw a significant performance drop ranging
> > from 15% to 99% for size ranges of 8MB to 16MB.
> > I also ran the new tool developed by you on all Zen architectures and the
> > results conclude that 3/4th L3 size holds good on AMD CPUs.
> > Hence the current patch degrades the performance of AMD CPUs.
> > We strongly recommend marking this change to Intel CPUs only.
> >
>
> So it shouldn't actually go down. I think what is missing is:
> ```
> get_common_cache_info (&shared, &shared_per_thread, &threads, core);
> ```
>
> In the AMD case shared == shared_per_thread which shouldn't really
> be the case.
>
> The intended new calculation is: Total_L3_Size / Scale
> as opposed to: (L3_Size / NThread) / Scale"
>
> Before just going with default for AMD, maybe try out the following patch?
>
> ```
> ---
>  sysdeps/x86/dl-cacheinfo.h | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
> index c98fa57a7b..c1866ca898 100644
> --- a/sysdeps/x86/dl-cacheinfo.h
> +++ b/sysdeps/x86/dl-cacheinfo.h
> @@ -717,6 +717,7 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
>        level3_cache_assoc = handle_amd (_SC_LEVEL3_CACHE_ASSOC);
>        level3_cache_linesize = handle_amd (_SC_LEVEL3_CACHE_LINESIZE);
>
> +      get_common_cache_info (&shared, &shared_per_thread, &threads, core);
>        if (shared <= 0)
>          /* No shared L3 cache.  All we have is the L2 cache.  */
>   shared = core;
> --
> 2.34.1
> ```
> > Thanks,
> > Sajan K.
> >

ping. 2.38 is approaching and I expect you want to get any fixes in before
that.
Sajan Karumanchi July 14, 2023, 7:39 a.m. UTC | #5
* Noah,
On Mon, Jul 10, 2023 at 9:28 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote:
>
> On Mon, Jul 10, 2023 at 12:23 AM Sajan Karumanchi
> <sajan.karumanchi@gmail.com> wrote:
> >
> > Noah,
> > I verified your patches on the master branch that impacts the non-threshold
> >  parameter on x86 CPUs. This patch modifies the non-temporal threshold value
> > from 24MB(3/4th of L3$) to 8MB(1/4th of L3$) on ZEN4.
> > From the Glibc benchmarks, we saw a significant performance drop ranging
> > from 15% to 99% for size ranges of 8MB to 16MB.
> > I also ran the new tool developed by you on all Zen architectures and the
> > results conclude that 3/4th L3 size holds good on AMD CPUs.
> > Hence the current patch degrades the performance of AMD CPUs.
> > We strongly recommend marking this change to Intel CPUs only.
> >
>
> So it shouldn't actually go down. I think what is missing is:
> ```
> get_common_cache_info (&shared, &shared_per_thread, &threads, core);
> ```
>
The cache info of AMD CPUs is spread across CPUID registers:
0x80000005,  0x80000006, and  0x8000001D.
But, 'get_common_cache_info(...)' is using CPUID register 0x00000004
for enumerating cache details. This leads to an infinite loop in the
initialization stage for enumerating the cache details on AMD CPUs.

> In the AMD case shared == shared_per_thread which shouldn't really
> be the case.
>
> The intended new calculation is: Total_L3_Size / Scale
> as opposed to: (L3_Size / NThread) / Scale"
>
AMD Zen CPUs are chiplet based, so we consider only L3/CCX for
computing the nt_threshold.
* handle_amd(_SC_LEVEL3_CACHE_SIZE) initializes 'shared' variable with
'l3_cache_per_ccx' for Zen architectures and 'l3_cache_per_thread' for
pre-Zen architectures.

> Before just going with default for AMD, maybe try out the following patch?
>
Since the cache info registers and the approach to compute the cache
details on AMD are different from Intel, we cannot use the below
patch.
> ```
> ---
>  sysdeps/x86/dl-cacheinfo.h | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
> index c98fa57a7b..c1866ca898 100644
> --- a/sysdeps/x86/dl-cacheinfo.h
> +++ b/sysdeps/x86/dl-cacheinfo.h
> @@ -717,6 +717,7 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
>        level3_cache_assoc = handle_amd (_SC_LEVEL3_CACHE_ASSOC);
>        level3_cache_linesize = handle_amd (_SC_LEVEL3_CACHE_LINESIZE);
>
> +      get_common_cache_info (&shared, &shared_per_thread, &threads, core);
>        if (shared <= 0)
>          /* No shared L3 cache.  All we have is the L2 cache.  */
>   shared = core;
> --
> 2.34.1
> ```
> > Thanks,
> > Sajan K.
> >
diff mbox series

Patch

diff --git a/sysdeps/x86/cpu-features.c b/sysdeps/x86/cpu-features.c
index 1b6e00c88f..325ec2b825 100644
--- a/sysdeps/x86/cpu-features.c
+++ b/sysdeps/x86/cpu-features.c
@@ -636,6 +636,7 @@  init_cpu_features (struct cpu_features *cpu_features)
   unsigned int stepping = 0;
   enum cpu_features_kind kind;
 
+  cpu_features->cachesize_non_temporal_divisor = 4;
 #if !HAS_CPUID
   if (__get_cpuid_max (0, 0) == 0)
     {
@@ -716,13 +717,13 @@  init_cpu_features (struct cpu_features *cpu_features)
 
 	      /* Bigcore/Default Tuning.  */
 	    default:
+	    default_tuning:
 	      /* Unknown family 0x06 processors.  Assuming this is one
 		 of Core i3/i5/i7 processors if AVX is available.  */
 	      if (!CPU_FEATURES_CPU_P (cpu_features, AVX))
 		break;
-	      /* Fall through.  */
-	    case INTEL_BIGCORE_NEHALEM:
-	    case INTEL_BIGCORE_WESTMERE:
+
+	    enable_modern_features:
 	      /* Rep string instructions, unaligned load, unaligned copy,
 		 and pminub are fast on Intel Core i3, i5 and i7.  */
 	      cpu_features->preferred[index_arch_Fast_Rep_String]
@@ -732,12 +733,23 @@  init_cpu_features (struct cpu_features *cpu_features)
 		      | bit_arch_Prefer_PMINUB_for_stringop);
 	      break;
 
-	   /*
-	    Default tuned Bigcore microarch.
+	    case INTEL_BIGCORE_NEHALEM:
+	    case INTEL_BIGCORE_WESTMERE:
+	      /* Older CPUs prefer non-temporal stores at lower threshold.  */
+	      cpu_features->cachesize_non_temporal_divisor = 8;
+	      goto enable_modern_features;
+
+	      /* Older Bigcore microarch (smaller non-temporal store
+		 threshold).  */
 	    case INTEL_BIGCORE_SANDYBRIDGE:
 	    case INTEL_BIGCORE_IVYBRIDGE:
 	    case INTEL_BIGCORE_HASWELL:
 	    case INTEL_BIGCORE_BROADWELL:
+	      cpu_features->cachesize_non_temporal_divisor = 8;
+	      goto default_tuning;
+
+	      /* Newer Bigcore microarch (larger non-temporal store
+		 threshold).  */
 	    case INTEL_BIGCORE_SKYLAKE:
 	    case INTEL_BIGCORE_KABYLAKE:
 	    case INTEL_BIGCORE_COMETLAKE:
@@ -753,13 +765,14 @@  init_cpu_features (struct cpu_features *cpu_features)
 	    case INTEL_BIGCORE_SAPPHIRERAPIDS:
 	    case INTEL_BIGCORE_EMERALDRAPIDS:
 	    case INTEL_BIGCORE_GRANITERAPIDS:
-	    */
+	      cpu_features->cachesize_non_temporal_divisor = 2;
+	      goto default_tuning;
 
-	   /*
-	    Default tuned Mixed (bigcore + atom SOC).
+	      /* Default tuned Mixed (bigcore + atom SOC). */
 	    case INTEL_MIXED_LAKEFIELD:
 	    case INTEL_MIXED_ALDERLAKE:
-	    */
+	      cpu_features->cachesize_non_temporal_divisor = 2;
+	      goto default_tuning;
 	    }
 
 	      /* Disable TSX on some processors to avoid TSX on kernels that
diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
index 4a1a5423ff..8292a4a50d 100644
--- a/sysdeps/x86/dl-cacheinfo.h
+++ b/sysdeps/x86/dl-cacheinfo.h
@@ -738,19 +738,25 @@  dl_init_cacheinfo (struct cpu_features *cpu_features)
   cpu_features->level3_cache_linesize = level3_cache_linesize;
   cpu_features->level4_cache_size = level4_cache_size;
 
-  /* The default setting for the non_temporal threshold is 1/4 of size
-     of the chip's cache. For most Intel and AMD processors with an
-     initial release date between 2017 and 2023, a thread's typical
-     share of the cache is from 18-64MB. Using the 1/4 L3 is meant to
-     estimate the point where non-temporal stores begin outcompeting
-     REP MOVSB. As well the point where the fact that non-temporal
-     stores are forced back to main memory would already occurred to the
-     majority of the lines in the copy. Note, concerns about the
-     entire L3 cache being evicted by the copy are mostly alleviated
-     by the fact that modern HW detects streaming patterns and
-     provides proper LRU hints so that the maximum thrashing
-     capped at 1/associativity. */
-  unsigned long int non_temporal_threshold = shared / 4;
+  unsigned long int cachesize_non_temporal_divisor
+      = cpu_features->cachesize_non_temporal_divisor;
+  if (cachesize_non_temporal_divisor <= 0)
+    cachesize_non_temporal_divisor = 4;
+
+  /* The default setting for the non_temporal threshold is [1/8, 1/2] of size
+     of the chip's cache (depending on `cachesize_non_temporal_divisor` which
+     is microarch specific. The defeault is 1/4). For most Intel and AMD
+     processors with an initial release date between 2017 and 2023, a thread's
+     typical share of the cache is from 18-64MB. Using a reasonable size
+     fraction of L3 is meant to estimate the point where non-temporal stores
+     begin outcompeting REP MOVSB. As well the point where the fact that
+     non-temporal stores are forced back to main memory would already occurred
+     to the majority of the lines in the copy. Note, concerns about the entire
+     L3 cache being evicted by the copy are mostly alleviated by the fact that
+     modern HW detects streaming patterns and provides proper LRU hints so that
+     the maximum thrashing capped at 1/associativity. */
+  unsigned long int non_temporal_threshold
+      = shared / cachesize_non_temporal_divisor;
   /* If no ERMS, we use the per-thread L3 chunking. Normal cacheable stores run
      a higher risk of actually thrashing the cache as they don't have a HW LRU
      hint. As well, there performance in highly parallel situations is
diff --git a/sysdeps/x86/dl-diagnostics-cpu.c b/sysdeps/x86/dl-diagnostics-cpu.c
index a1578e4665..5aab63e532 100644
--- a/sysdeps/x86/dl-diagnostics-cpu.c
+++ b/sysdeps/x86/dl-diagnostics-cpu.c
@@ -113,8 +113,11 @@  _dl_diagnostics_cpu (void)
                             cpu_features->level3_cache_linesize);
   print_cpu_features_value ("level4_cache_size",
                             cpu_features->level4_cache_size);
-  _Static_assert (offsetof (struct cpu_features, level4_cache_size)
-                  + sizeof (cpu_features->level4_cache_size)
-                  == sizeof (*cpu_features),
-                  "last cpu_features field has been printed");
+  print_cpu_features_value ("cachesize_non_temporal_divisor",
+			    cpu_features->cachesize_non_temporal_divisor);
+  _Static_assert (
+      offsetof (struct cpu_features, cachesize_non_temporal_divisor)
+	      + sizeof (cpu_features->cachesize_non_temporal_divisor)
+	  == sizeof (*cpu_features),
+      "last cpu_features field has been printed");
 }
diff --git a/sysdeps/x86/include/cpu-features.h b/sysdeps/x86/include/cpu-features.h
index 40b8129d6a..c740e1a5fc 100644
--- a/sysdeps/x86/include/cpu-features.h
+++ b/sysdeps/x86/include/cpu-features.h
@@ -945,6 +945,9 @@  struct cpu_features
   unsigned long int level3_cache_linesize;
   /* /_SC_LEVEL4_CACHE_SIZE.  */
   unsigned long int level4_cache_size;
+  /* When no user non_temporal_threshold is specified. We default to
+     cachesize / cachesize_non_temporal_divisor.  */
+  unsigned long int cachesize_non_temporal_divisor;
 };
 
 /* Get a pointer to the CPU features structure.  */