Message ID | 20230527184632.694761-3-goldstein.w.n@gmail.com |
---|---|
State | New |
Headers | show |
Series | [v10,1/3] x86: Increase `non_temporal_threshold` to roughly `sizeof_L3 / 4` | expand |
LGTM.
Reviewed-by: DJ Delorie <dj@redhat.com>
Noah, I verified your patches on the master branch that impacts the non-threshold parameter on x86 CPUs. This patch modifies the non-temporal threshold value from 24MB(3/4th of L3$) to 8MB(1/4th of L3$) on ZEN4. From the Glibc benchmarks, we saw a significant performance drop ranging from 15% to 99% for size ranges of 8MB to 16MB. I also ran the new tool developed by you on all Zen architectures and the results conclude that 3/4th L3 size holds good on AMD CPUs. Hence the current patch degrades the performance of AMD CPUs. We strongly recommend marking this change to Intel CPUs only. Thanks, Sajan K.
On Mon, Jul 10, 2023 at 12:23 AM Sajan Karumanchi <sajan.karumanchi@gmail.com> wrote: > > Noah, > I verified your patches on the master branch that impacts the non-threshold > parameter on x86 CPUs. This patch modifies the non-temporal threshold value > from 24MB(3/4th of L3$) to 8MB(1/4th of L3$) on ZEN4. > From the Glibc benchmarks, we saw a significant performance drop ranging > from 15% to 99% for size ranges of 8MB to 16MB. > I also ran the new tool developed by you on all Zen architectures and the > results conclude that 3/4th L3 size holds good on AMD CPUs. > Hence the current patch degrades the performance of AMD CPUs. > We strongly recommend marking this change to Intel CPUs only. > So it shouldn't actually go down. I think what is missing is: ``` get_common_cache_info (&shared, &shared_per_thread, &threads, core); ``` In the AMD case shared == shared_per_thread which shouldn't really be the case. The intended new calculation is: Total_L3_Size / Scale as opposed to: (L3_Size / NThread) / Scale" Before just going with default for AMD, maybe try out the following patch? ``` --- sysdeps/x86/dl-cacheinfo.h | 1 + 1 file changed, 1 insertion(+) diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h index c98fa57a7b..c1866ca898 100644 --- a/sysdeps/x86/dl-cacheinfo.h +++ b/sysdeps/x86/dl-cacheinfo.h @@ -717,6 +717,7 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) level3_cache_assoc = handle_amd (_SC_LEVEL3_CACHE_ASSOC); level3_cache_linesize = handle_amd (_SC_LEVEL3_CACHE_LINESIZE); + get_common_cache_info (&shared, &shared_per_thread, &threads, core); if (shared <= 0) /* No shared L3 cache. All we have is the L2 cache. */ shared = core;
On Mon, Jul 10, 2023 at 10:58 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > On Mon, Jul 10, 2023 at 12:23 AM Sajan Karumanchi > <sajan.karumanchi@gmail.com> wrote: > > > > Noah, > > I verified your patches on the master branch that impacts the non-threshold > > parameter on x86 CPUs. This patch modifies the non-temporal threshold value > > from 24MB(3/4th of L3$) to 8MB(1/4th of L3$) on ZEN4. > > From the Glibc benchmarks, we saw a significant performance drop ranging > > from 15% to 99% for size ranges of 8MB to 16MB. > > I also ran the new tool developed by you on all Zen architectures and the > > results conclude that 3/4th L3 size holds good on AMD CPUs. > > Hence the current patch degrades the performance of AMD CPUs. > > We strongly recommend marking this change to Intel CPUs only. > > > > So it shouldn't actually go down. I think what is missing is: > ``` > get_common_cache_info (&shared, &shared_per_thread, &threads, core); > ``` > > In the AMD case shared == shared_per_thread which shouldn't really > be the case. > > The intended new calculation is: Total_L3_Size / Scale > as opposed to: (L3_Size / NThread) / Scale" > > Before just going with default for AMD, maybe try out the following patch? > > ``` > --- > sysdeps/x86/dl-cacheinfo.h | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h > index c98fa57a7b..c1866ca898 100644 > --- a/sysdeps/x86/dl-cacheinfo.h > +++ b/sysdeps/x86/dl-cacheinfo.h > @@ -717,6 +717,7 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) > level3_cache_assoc = handle_amd (_SC_LEVEL3_CACHE_ASSOC); > level3_cache_linesize = handle_amd (_SC_LEVEL3_CACHE_LINESIZE); > > + get_common_cache_info (&shared, &shared_per_thread, &threads, core); > if (shared <= 0) > /* No shared L3 cache. All we have is the L2 cache. */ > shared = core; > -- > 2.34.1 > ``` > > Thanks, > > Sajan K. > > ping. 2.38 is approaching and I expect you want to get any fixes in before that.
* Noah, On Mon, Jul 10, 2023 at 9:28 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > On Mon, Jul 10, 2023 at 12:23 AM Sajan Karumanchi > <sajan.karumanchi@gmail.com> wrote: > > > > Noah, > > I verified your patches on the master branch that impacts the non-threshold > > parameter on x86 CPUs. This patch modifies the non-temporal threshold value > > from 24MB(3/4th of L3$) to 8MB(1/4th of L3$) on ZEN4. > > From the Glibc benchmarks, we saw a significant performance drop ranging > > from 15% to 99% for size ranges of 8MB to 16MB. > > I also ran the new tool developed by you on all Zen architectures and the > > results conclude that 3/4th L3 size holds good on AMD CPUs. > > Hence the current patch degrades the performance of AMD CPUs. > > We strongly recommend marking this change to Intel CPUs only. > > > > So it shouldn't actually go down. I think what is missing is: > ``` > get_common_cache_info (&shared, &shared_per_thread, &threads, core); > ``` > The cache info of AMD CPUs is spread across CPUID registers: 0x80000005, 0x80000006, and 0x8000001D. But, 'get_common_cache_info(...)' is using CPUID register 0x00000004 for enumerating cache details. This leads to an infinite loop in the initialization stage for enumerating the cache details on AMD CPUs. > In the AMD case shared == shared_per_thread which shouldn't really > be the case. > > The intended new calculation is: Total_L3_Size / Scale > as opposed to: (L3_Size / NThread) / Scale" > AMD Zen CPUs are chiplet based, so we consider only L3/CCX for computing the nt_threshold. * handle_amd(_SC_LEVEL3_CACHE_SIZE) initializes 'shared' variable with 'l3_cache_per_ccx' for Zen architectures and 'l3_cache_per_thread' for pre-Zen architectures. > Before just going with default for AMD, maybe try out the following patch? > Since the cache info registers and the approach to compute the cache details on AMD are different from Intel, we cannot use the below patch. > ``` > --- > sysdeps/x86/dl-cacheinfo.h | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h > index c98fa57a7b..c1866ca898 100644 > --- a/sysdeps/x86/dl-cacheinfo.h > +++ b/sysdeps/x86/dl-cacheinfo.h > @@ -717,6 +717,7 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) > level3_cache_assoc = handle_amd (_SC_LEVEL3_CACHE_ASSOC); > level3_cache_linesize = handle_amd (_SC_LEVEL3_CACHE_LINESIZE); > > + get_common_cache_info (&shared, &shared_per_thread, &threads, core); > if (shared <= 0) > /* No shared L3 cache. All we have is the L2 cache. */ > shared = core; > -- > 2.34.1 > ``` > > Thanks, > > Sajan K. > >
diff --git a/sysdeps/x86/cpu-features.c b/sysdeps/x86/cpu-features.c index 1b6e00c88f..325ec2b825 100644 --- a/sysdeps/x86/cpu-features.c +++ b/sysdeps/x86/cpu-features.c @@ -636,6 +636,7 @@ init_cpu_features (struct cpu_features *cpu_features) unsigned int stepping = 0; enum cpu_features_kind kind; + cpu_features->cachesize_non_temporal_divisor = 4; #if !HAS_CPUID if (__get_cpuid_max (0, 0) == 0) { @@ -716,13 +717,13 @@ init_cpu_features (struct cpu_features *cpu_features) /* Bigcore/Default Tuning. */ default: + default_tuning: /* Unknown family 0x06 processors. Assuming this is one of Core i3/i5/i7 processors if AVX is available. */ if (!CPU_FEATURES_CPU_P (cpu_features, AVX)) break; - /* Fall through. */ - case INTEL_BIGCORE_NEHALEM: - case INTEL_BIGCORE_WESTMERE: + + enable_modern_features: /* Rep string instructions, unaligned load, unaligned copy, and pminub are fast on Intel Core i3, i5 and i7. */ cpu_features->preferred[index_arch_Fast_Rep_String] @@ -732,12 +733,23 @@ init_cpu_features (struct cpu_features *cpu_features) | bit_arch_Prefer_PMINUB_for_stringop); break; - /* - Default tuned Bigcore microarch. + case INTEL_BIGCORE_NEHALEM: + case INTEL_BIGCORE_WESTMERE: + /* Older CPUs prefer non-temporal stores at lower threshold. */ + cpu_features->cachesize_non_temporal_divisor = 8; + goto enable_modern_features; + + /* Older Bigcore microarch (smaller non-temporal store + threshold). */ case INTEL_BIGCORE_SANDYBRIDGE: case INTEL_BIGCORE_IVYBRIDGE: case INTEL_BIGCORE_HASWELL: case INTEL_BIGCORE_BROADWELL: + cpu_features->cachesize_non_temporal_divisor = 8; + goto default_tuning; + + /* Newer Bigcore microarch (larger non-temporal store + threshold). */ case INTEL_BIGCORE_SKYLAKE: case INTEL_BIGCORE_KABYLAKE: case INTEL_BIGCORE_COMETLAKE: @@ -753,13 +765,14 @@ init_cpu_features (struct cpu_features *cpu_features) case INTEL_BIGCORE_SAPPHIRERAPIDS: case INTEL_BIGCORE_EMERALDRAPIDS: case INTEL_BIGCORE_GRANITERAPIDS: - */ + cpu_features->cachesize_non_temporal_divisor = 2; + goto default_tuning; - /* - Default tuned Mixed (bigcore + atom SOC). + /* Default tuned Mixed (bigcore + atom SOC). */ case INTEL_MIXED_LAKEFIELD: case INTEL_MIXED_ALDERLAKE: - */ + cpu_features->cachesize_non_temporal_divisor = 2; + goto default_tuning; } /* Disable TSX on some processors to avoid TSX on kernels that diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h index 4a1a5423ff..8292a4a50d 100644 --- a/sysdeps/x86/dl-cacheinfo.h +++ b/sysdeps/x86/dl-cacheinfo.h @@ -738,19 +738,25 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) cpu_features->level3_cache_linesize = level3_cache_linesize; cpu_features->level4_cache_size = level4_cache_size; - /* The default setting for the non_temporal threshold is 1/4 of size - of the chip's cache. For most Intel and AMD processors with an - initial release date between 2017 and 2023, a thread's typical - share of the cache is from 18-64MB. Using the 1/4 L3 is meant to - estimate the point where non-temporal stores begin outcompeting - REP MOVSB. As well the point where the fact that non-temporal - stores are forced back to main memory would already occurred to the - majority of the lines in the copy. Note, concerns about the - entire L3 cache being evicted by the copy are mostly alleviated - by the fact that modern HW detects streaming patterns and - provides proper LRU hints so that the maximum thrashing - capped at 1/associativity. */ - unsigned long int non_temporal_threshold = shared / 4; + unsigned long int cachesize_non_temporal_divisor + = cpu_features->cachesize_non_temporal_divisor; + if (cachesize_non_temporal_divisor <= 0) + cachesize_non_temporal_divisor = 4; + + /* The default setting for the non_temporal threshold is [1/8, 1/2] of size + of the chip's cache (depending on `cachesize_non_temporal_divisor` which + is microarch specific. The defeault is 1/4). For most Intel and AMD + processors with an initial release date between 2017 and 2023, a thread's + typical share of the cache is from 18-64MB. Using a reasonable size + fraction of L3 is meant to estimate the point where non-temporal stores + begin outcompeting REP MOVSB. As well the point where the fact that + non-temporal stores are forced back to main memory would already occurred + to the majority of the lines in the copy. Note, concerns about the entire + L3 cache being evicted by the copy are mostly alleviated by the fact that + modern HW detects streaming patterns and provides proper LRU hints so that + the maximum thrashing capped at 1/associativity. */ + unsigned long int non_temporal_threshold + = shared / cachesize_non_temporal_divisor; /* If no ERMS, we use the per-thread L3 chunking. Normal cacheable stores run a higher risk of actually thrashing the cache as they don't have a HW LRU hint. As well, there performance in highly parallel situations is diff --git a/sysdeps/x86/dl-diagnostics-cpu.c b/sysdeps/x86/dl-diagnostics-cpu.c index a1578e4665..5aab63e532 100644 --- a/sysdeps/x86/dl-diagnostics-cpu.c +++ b/sysdeps/x86/dl-diagnostics-cpu.c @@ -113,8 +113,11 @@ _dl_diagnostics_cpu (void) cpu_features->level3_cache_linesize); print_cpu_features_value ("level4_cache_size", cpu_features->level4_cache_size); - _Static_assert (offsetof (struct cpu_features, level4_cache_size) - + sizeof (cpu_features->level4_cache_size) - == sizeof (*cpu_features), - "last cpu_features field has been printed"); + print_cpu_features_value ("cachesize_non_temporal_divisor", + cpu_features->cachesize_non_temporal_divisor); + _Static_assert ( + offsetof (struct cpu_features, cachesize_non_temporal_divisor) + + sizeof (cpu_features->cachesize_non_temporal_divisor) + == sizeof (*cpu_features), + "last cpu_features field has been printed"); } diff --git a/sysdeps/x86/include/cpu-features.h b/sysdeps/x86/include/cpu-features.h index 40b8129d6a..c740e1a5fc 100644 --- a/sysdeps/x86/include/cpu-features.h +++ b/sysdeps/x86/include/cpu-features.h @@ -945,6 +945,9 @@ struct cpu_features unsigned long int level3_cache_linesize; /* /_SC_LEVEL4_CACHE_SIZE. */ unsigned long int level4_cache_size; + /* When no user non_temporal_threshold is specified. We default to + cachesize / cachesize_non_temporal_divisor. */ + unsigned long int cachesize_non_temporal_divisor; }; /* Get a pointer to the CPU features structure. */