Message ID | 20230513051906.1287611-3-goldstein.w.n@gmail.com |
---|---|
State | New |
Headers | show |
Series | [v9,1/3] x86: Increase `non_temporal_threshold` to roughly `sizeof_L3 / 4` | expand |
One question about upgradability, one comment nit that I don't care about but include for completeness. Noah Goldstein via Libc-alpha <libc-alpha@sourceware.org> writes: > Different systems prefer a different divisors. > >>From benchmarks[1] so far the following divisors have been found: > ICX : 2 > SKX : 2 > BWD : 8 > > For Intel, we are generalizing that BWD and older prefers 8 as a > divisor, and SKL and newer prefers 2. This number can be further tuned > as benchmarks are run. > > [1]: https://github.com/goldsteinn/memcpy-nt-benchmarks > --- > sysdeps/x86/cpu-features.c | 27 +++++++++++++++++-------- > sysdeps/x86/dl-cacheinfo.h | 32 ++++++++++++++++++------------ > sysdeps/x86/include/cpu-features.h | 3 +++ > 3 files changed, 41 insertions(+), 21 deletions(-) > > diff --git a/sysdeps/x86/include/cpu-features.h b/sysdeps/x86/include/cpu-features.h > index 40b8129d6a..f5b9dd54fe 100644 > --- a/sysdeps/x86/include/cpu-features.h > +++ b/sysdeps/x86/include/cpu-features.h > @@ -915,6 +915,9 @@ struct cpu_features > unsigned long int shared_cache_size; > /* Threshold to use non temporal store. */ > unsigned long int non_temporal_threshold; > + /* When no user non_temporal_threshold is specified. We default to > + cachesize / cachesize_non_temporal_divisor. */ > + unsigned long int cachesize_non_temporal_divisor; > /* Threshold to use "rep movsb". */ > unsigned long int rep_movsb_threshold; > /* Threshold to stop using "rep movsb". */ This adds a new field to "struct cpu_features". Is this structure something that is shared between ld.so and libc.so ? I.e. tunables related? If so, does this field need to be added to the end of the struct, so as to not cause problems during an upgrade (when we have an old ld.so and a new libc.so)? > diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h > index 4a1a5423ff..864b00a521 100644 > --- a/sysdeps/x86/dl-cacheinfo.h > +++ b/sysdeps/x86/dl-cacheinfo.h > @@ -738,19 +738,25 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) > cpu_features->level3_cache_linesize = level3_cache_linesize; > cpu_features->level4_cache_size = level4_cache_size; > > - /* The default setting for the non_temporal threshold is 1/4 of size > - of the chip's cache. For most Intel and AMD processors with an > - initial release date between 2017 and 2023, a thread's typical > - share of the cache is from 18-64MB. Using the 1/4 L3 is meant to > - estimate the point where non-temporal stores begin outcompeting > - REP MOVSB. As well the point where the fact that non-temporal > - stores are forced back to main memory would already occurred to the > - majority of the lines in the copy. Note, concerns about the > - entire L3 cache being evicted by the copy are mostly alleviated > - by the fact that modern HW detects streaming patterns and > - provides proper LRU hints so that the maximum thrashing > - capped at 1/associativity. */ > - unsigned long int non_temporal_threshold = shared / 4; > + unsigned long int cachesize_non_temporal_divisor > + = cpu_features->cachesize_non_temporal_divisor; > + if (cachesize_non_temporal_divisor <= 0) > + cachesize_non_temporal_divisor = 4; > + > + /* The default setting for the non_temporal threshold is [1/2, 1/8] of size FYI this range is backwards ;-) > + of the chip's cache (depending on `cachesize_non_temporal_divisor` which > + is microarch specific. The defeault is 1/4). For most Intel and AMD > + processors with an initial release date between 2017 and 2023, a thread's > + typical share of the cache is from 18-64MB. Using a reasonable size > + fraction of L3 is meant to estimate the point where non-temporal stores > + begin outcompeting REP MOVSB. As well the point where the fact that > + non-temporal stores are forced back to main memory would already occurred > + to the majority of the lines in the copy. Note, concerns about the entire > + L3 cache being evicted by the copy are mostly alleviated by the fact that > + modern HW detects streaming patterns and provides proper LRU hints so that > + the maximum thrashing capped at 1/associativity. */ > + unsigned long int non_temporal_threshold > + = shared / cachesize_non_temporal_divisor; > /* If no ERMS, we use the per-thread L3 chunking. Normal cacheable stores run > a higher risk of actually thrashing the cache as they don't have a HW LRU > hint. As well, there performance in highly parallel situations is Ok, defaults to the same behavior. > diff --git a/sysdeps/x86/cpu-features.c b/sysdeps/x86/cpu-features.c > index 29b8c8c133..ba789d6fc1 100644 > --- a/sysdeps/x86/cpu-features.c > +++ b/sysdeps/x86/cpu-features.c > @@ -635,6 +635,7 @@ init_cpu_features (struct cpu_features *cpu_features) > unsigned int stepping = 0; > enum cpu_features_kind kind; > > + cpu_features->cachesize_non_temporal_divisor = 4; Ok. > @@ -714,12 +715,13 @@ init_cpu_features (struct cpu_features *cpu_features) > > /* Bigcore/Default Tuning. */ > default: > + default_tuning: > /* Unknown family 0x06 processors. Assuming this is one > of Core i3/i5/i7 processors if AVX is available. */ > if (!CPU_FEATURES_CPU_P (cpu_features, AVX)) > break; Ok. > - case INTEL_BIGCORE_NEHALEM: > - case INTEL_BIGCORE_WESTMERE: > + > + enable_modern_features: Ok. > /* Rep string instructions, unaligned load, unaligned copy, > and pminub are fast on Intel Core i3, i5 and i7. */ > cpu_features->preferred[index_arch_Fast_Rep_String] > @@ -728,12 +730,20 @@ init_cpu_features (struct cpu_features *cpu_features) > | bit_arch_Prefer_PMINUB_for_stringop); > break; > > - /* > - Default tuned Bigcore microarch. Note comment begin removed here... > + case INTEL_BIGCORE_NEHALEM: > + case INTEL_BIGCORE_WESTMERE: > + /* Older CPUs prefer non-temporal stores at lower threshold. */ > + cpu_features->cachesize_non_temporal_divisor = 8; > + goto enable_modern_features; > + > + /* Default tuned Bigcore microarch. */ Ok. > case INTEL_BIGCORE_SANDYBRIDGE: > case INTEL_BIGCORE_IVYBRIDGE: > case INTEL_BIGCORE_HASWELL: > case INTEL_BIGCORE_BROADWELL: > + cpu_features->cachesize_non_temporal_divisor = 8; > + goto default_tuning; > + Ok. > case INTEL_BIGCORE_SKYLAKE: > case INTEL_BIGCORE_KABYLAKE: > case INTEL_BIGCORE_COMETLAKE: Note nothing but more cases here, ok. > case INTEL_BIGCORE_SAPPHIRERAPIDS: > case INTEL_BIGCORE_EMERALDRAPIDS: > case INTEL_BIGCORE_GRANITERAPIDS: > - */ ... and comment end removed here. Ok. > + cpu_features->cachesize_non_temporal_divisor = 2; > + goto default_tuning; Ok. > - /* > - Default tuned Mixed (bigcore + atom SOC). > + /* Default tuned Mixed (bigcore + atom SOC). */ > case INTEL_MIXED_LAKEFIELD: > case INTEL_MIXED_ALDERLAKE: > - */ > + cpu_features->cachesize_non_temporal_divisor = 2; > + goto default_tuning; > } Ok.
On Thu, May 25, 2023 at 10:34 PM DJ Delorie <dj@redhat.com> wrote: > > > One question about upgradability, one comment nit that I don't care > about but include for completeness. > > Noah Goldstein via Libc-alpha <libc-alpha@sourceware.org> writes: > > Different systems prefer a different divisors. > > > >>From benchmarks[1] so far the following divisors have been found: > > ICX : 2 > > SKX : 2 > > BWD : 8 > > > > For Intel, we are generalizing that BWD and older prefers 8 as a > > divisor, and SKL and newer prefers 2. This number can be further tuned > > as benchmarks are run. > > > > [1]: https://github.com/goldsteinn/memcpy-nt-benchmarks > > --- > > sysdeps/x86/cpu-features.c | 27 +++++++++++++++++-------- > > sysdeps/x86/dl-cacheinfo.h | 32 ++++++++++++++++++------------ > > sysdeps/x86/include/cpu-features.h | 3 +++ > > 3 files changed, 41 insertions(+), 21 deletions(-) > > > > > diff --git a/sysdeps/x86/include/cpu-features.h b/sysdeps/x86/include/cpu-features.h > > index 40b8129d6a..f5b9dd54fe 100644 > > --- a/sysdeps/x86/include/cpu-features.h > > +++ b/sysdeps/x86/include/cpu-features.h > > @@ -915,6 +915,9 @@ struct cpu_features > > unsigned long int shared_cache_size; > > /* Threshold to use non temporal store. */ > > unsigned long int non_temporal_threshold; > > + /* When no user non_temporal_threshold is specified. We default to > > + cachesize / cachesize_non_temporal_divisor. */ > > + unsigned long int cachesize_non_temporal_divisor; > > /* Threshold to use "rep movsb". */ > > unsigned long int rep_movsb_threshold; > > /* Threshold to stop using "rep movsb". */ > > This adds a new field to "struct cpu_features". Is this structure > something that is shared between ld.so and libc.so ? I.e. tunables > related? If so, does this field need to be added to the end of the > struct, so as to not cause problems during an upgrade (when we have an > old ld.so and a new libc.so)? Not sure. HJ do you know? But moved for now as a kind of "why not". > > > diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h > > index 4a1a5423ff..864b00a521 100644 > > --- a/sysdeps/x86/dl-cacheinfo.h > > +++ b/sysdeps/x86/dl-cacheinfo.h > > @@ -738,19 +738,25 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) > > cpu_features->level3_cache_linesize = level3_cache_linesize; > > cpu_features->level4_cache_size = level4_cache_size; > > > > - /* The default setting for the non_temporal threshold is 1/4 of size > > - of the chip's cache. For most Intel and AMD processors with an > > - initial release date between 2017 and 2023, a thread's typical > > - share of the cache is from 18-64MB. Using the 1/4 L3 is meant to > > - estimate the point where non-temporal stores begin outcompeting > > - REP MOVSB. As well the point where the fact that non-temporal > > - stores are forced back to main memory would already occurred to the > > - majority of the lines in the copy. Note, concerns about the > > - entire L3 cache being evicted by the copy are mostly alleviated > > - by the fact that modern HW detects streaming patterns and > > - provides proper LRU hints so that the maximum thrashing > > - capped at 1/associativity. */ > > - unsigned long int non_temporal_threshold = shared / 4; > > > + unsigned long int cachesize_non_temporal_divisor > > + = cpu_features->cachesize_non_temporal_divisor; > > + if (cachesize_non_temporal_divisor <= 0) > > + cachesize_non_temporal_divisor = 4; > > + > > + /* The default setting for the non_temporal threshold is [1/2, 1/8] of size > > FYI this range is backwards ;-) Fixed. > > > + of the chip's cache (depending on `cachesize_non_temporal_divisor` which > > + is microarch specific. The defeault is 1/4). For most Intel and AMD > > + processors with an initial release date between 2017 and 2023, a thread's > > + typical share of the cache is from 18-64MB. Using a reasonable size > > + fraction of L3 is meant to estimate the point where non-temporal stores > > + begin outcompeting REP MOVSB. As well the point where the fact that > > + non-temporal stores are forced back to main memory would already occurred > > + to the majority of the lines in the copy. Note, concerns about the entire > > + L3 cache being evicted by the copy are mostly alleviated by the fact that > > + modern HW detects streaming patterns and provides proper LRU hints so that > > + the maximum thrashing capped at 1/associativity. */ > > + unsigned long int non_temporal_threshold > > + = shared / cachesize_non_temporal_divisor; > > /* If no ERMS, we use the per-thread L3 chunking. Normal cacheable stores run > > a higher risk of actually thrashing the cache as they don't have a HW LRU > > hint. As well, there performance in highly parallel situations is > > Ok, defaults to the same behavior. > > > > diff --git a/sysdeps/x86/cpu-features.c b/sysdeps/x86/cpu-features.c > > index 29b8c8c133..ba789d6fc1 100644 > > --- a/sysdeps/x86/cpu-features.c > > +++ b/sysdeps/x86/cpu-features.c > > @@ -635,6 +635,7 @@ init_cpu_features (struct cpu_features *cpu_features) > > unsigned int stepping = 0; > > enum cpu_features_kind kind; > > > > + cpu_features->cachesize_non_temporal_divisor = 4; > > Ok. > > > @@ -714,12 +715,13 @@ init_cpu_features (struct cpu_features *cpu_features) > > > > /* Bigcore/Default Tuning. */ > > default: > > + default_tuning: > > /* Unknown family 0x06 processors. Assuming this is one > > of Core i3/i5/i7 processors if AVX is available. */ > > if (!CPU_FEATURES_CPU_P (cpu_features, AVX)) > > break; > > Ok. > > > - case INTEL_BIGCORE_NEHALEM: > > - case INTEL_BIGCORE_WESTMERE: > > + > > + enable_modern_features: > > Ok. > > /* Rep string instructions, unaligned load, unaligned copy, > > and pminub are fast on Intel Core i3, i5 and i7. */ > > cpu_features->preferred[index_arch_Fast_Rep_String] > > @@ -728,12 +730,20 @@ init_cpu_features (struct cpu_features *cpu_features) > > | bit_arch_Prefer_PMINUB_for_stringop); > > break; > > > > - /* > > - Default tuned Bigcore microarch. > > Note comment begin removed here... > > > + case INTEL_BIGCORE_NEHALEM: > > + case INTEL_BIGCORE_WESTMERE: > > + /* Older CPUs prefer non-temporal stores at lower threshold. */ > > + cpu_features->cachesize_non_temporal_divisor = 8; > > + goto enable_modern_features; > > + > > + /* Default tuned Bigcore microarch. */ > > Ok. > > > case INTEL_BIGCORE_SANDYBRIDGE: > > case INTEL_BIGCORE_IVYBRIDGE: > > case INTEL_BIGCORE_HASWELL: > > case INTEL_BIGCORE_BROADWELL: > > + cpu_features->cachesize_non_temporal_divisor = 8; > > + goto default_tuning; > > + > > Ok. > > > case INTEL_BIGCORE_SKYLAKE: > > case INTEL_BIGCORE_KABYLAKE: > > case INTEL_BIGCORE_COMETLAKE: > Note nothing but more cases here, ok. > > case INTEL_BIGCORE_SAPPHIRERAPIDS: > > case INTEL_BIGCORE_EMERALDRAPIDS: > > case INTEL_BIGCORE_GRANITERAPIDS: > > - */ > > ... and comment end removed here. Ok. > > > + cpu_features->cachesize_non_temporal_divisor = 2; > > + goto default_tuning; > > Ok. > > > - /* > > - Default tuned Mixed (bigcore + atom SOC). > > + /* Default tuned Mixed (bigcore + atom SOC). */ > > case INTEL_MIXED_LAKEFIELD: > > case INTEL_MIXED_ALDERLAKE: > > - */ > > + cpu_features->cachesize_non_temporal_divisor = 2; > > + goto default_tuning; > > } > > Ok. >
diff --git a/sysdeps/x86/cpu-features.c b/sysdeps/x86/cpu-features.c index 29b8c8c133..ba789d6fc1 100644 --- a/sysdeps/x86/cpu-features.c +++ b/sysdeps/x86/cpu-features.c @@ -635,6 +635,7 @@ init_cpu_features (struct cpu_features *cpu_features) unsigned int stepping = 0; enum cpu_features_kind kind; + cpu_features->cachesize_non_temporal_divisor = 4; #if !HAS_CPUID if (__get_cpuid_max (0, 0) == 0) { @@ -714,12 +715,13 @@ init_cpu_features (struct cpu_features *cpu_features) /* Bigcore/Default Tuning. */ default: + default_tuning: /* Unknown family 0x06 processors. Assuming this is one of Core i3/i5/i7 processors if AVX is available. */ if (!CPU_FEATURES_CPU_P (cpu_features, AVX)) break; - case INTEL_BIGCORE_NEHALEM: - case INTEL_BIGCORE_WESTMERE: + + enable_modern_features: /* Rep string instructions, unaligned load, unaligned copy, and pminub are fast on Intel Core i3, i5 and i7. */ cpu_features->preferred[index_arch_Fast_Rep_String] @@ -728,12 +730,20 @@ init_cpu_features (struct cpu_features *cpu_features) | bit_arch_Prefer_PMINUB_for_stringop); break; - /* - Default tuned Bigcore microarch. + case INTEL_BIGCORE_NEHALEM: + case INTEL_BIGCORE_WESTMERE: + /* Older CPUs prefer non-temporal stores at lower threshold. */ + cpu_features->cachesize_non_temporal_divisor = 8; + goto enable_modern_features; + + /* Default tuned Bigcore microarch. */ case INTEL_BIGCORE_SANDYBRIDGE: case INTEL_BIGCORE_IVYBRIDGE: case INTEL_BIGCORE_HASWELL: case INTEL_BIGCORE_BROADWELL: + cpu_features->cachesize_non_temporal_divisor = 8; + goto default_tuning; + case INTEL_BIGCORE_SKYLAKE: case INTEL_BIGCORE_KABYLAKE: case INTEL_BIGCORE_COMETLAKE: @@ -749,13 +759,14 @@ init_cpu_features (struct cpu_features *cpu_features) case INTEL_BIGCORE_SAPPHIRERAPIDS: case INTEL_BIGCORE_EMERALDRAPIDS: case INTEL_BIGCORE_GRANITERAPIDS: - */ + cpu_features->cachesize_non_temporal_divisor = 2; + goto default_tuning; - /* - Default tuned Mixed (bigcore + atom SOC). + /* Default tuned Mixed (bigcore + atom SOC). */ case INTEL_MIXED_LAKEFIELD: case INTEL_MIXED_ALDERLAKE: - */ + cpu_features->cachesize_non_temporal_divisor = 2; + goto default_tuning; } /* Disable TSX on some processors to avoid TSX on kernels that diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h index 4a1a5423ff..864b00a521 100644 --- a/sysdeps/x86/dl-cacheinfo.h +++ b/sysdeps/x86/dl-cacheinfo.h @@ -738,19 +738,25 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) cpu_features->level3_cache_linesize = level3_cache_linesize; cpu_features->level4_cache_size = level4_cache_size; - /* The default setting for the non_temporal threshold is 1/4 of size - of the chip's cache. For most Intel and AMD processors with an - initial release date between 2017 and 2023, a thread's typical - share of the cache is from 18-64MB. Using the 1/4 L3 is meant to - estimate the point where non-temporal stores begin outcompeting - REP MOVSB. As well the point where the fact that non-temporal - stores are forced back to main memory would already occurred to the - majority of the lines in the copy. Note, concerns about the - entire L3 cache being evicted by the copy are mostly alleviated - by the fact that modern HW detects streaming patterns and - provides proper LRU hints so that the maximum thrashing - capped at 1/associativity. */ - unsigned long int non_temporal_threshold = shared / 4; + unsigned long int cachesize_non_temporal_divisor + = cpu_features->cachesize_non_temporal_divisor; + if (cachesize_non_temporal_divisor <= 0) + cachesize_non_temporal_divisor = 4; + + /* The default setting for the non_temporal threshold is [1/2, 1/8] of size + of the chip's cache (depending on `cachesize_non_temporal_divisor` which + is microarch specific. The defeault is 1/4). For most Intel and AMD + processors with an initial release date between 2017 and 2023, a thread's + typical share of the cache is from 18-64MB. Using a reasonable size + fraction of L3 is meant to estimate the point where non-temporal stores + begin outcompeting REP MOVSB. As well the point where the fact that + non-temporal stores are forced back to main memory would already occurred + to the majority of the lines in the copy. Note, concerns about the entire + L3 cache being evicted by the copy are mostly alleviated by the fact that + modern HW detects streaming patterns and provides proper LRU hints so that + the maximum thrashing capped at 1/associativity. */ + unsigned long int non_temporal_threshold + = shared / cachesize_non_temporal_divisor; /* If no ERMS, we use the per-thread L3 chunking. Normal cacheable stores run a higher risk of actually thrashing the cache as they don't have a HW LRU hint. As well, there performance in highly parallel situations is diff --git a/sysdeps/x86/include/cpu-features.h b/sysdeps/x86/include/cpu-features.h index 40b8129d6a..f5b9dd54fe 100644 --- a/sysdeps/x86/include/cpu-features.h +++ b/sysdeps/x86/include/cpu-features.h @@ -915,6 +915,9 @@ struct cpu_features unsigned long int shared_cache_size; /* Threshold to use non temporal store. */ unsigned long int non_temporal_threshold; + /* When no user non_temporal_threshold is specified. We default to + cachesize / cachesize_non_temporal_divisor. */ + unsigned long int cachesize_non_temporal_divisor; /* Threshold to use "rep movsb". */ unsigned long int rep_movsb_threshold; /* Threshold to stop using "rep movsb". */