Message ID | 20230714151459.3357038-1-goldstein.w.n@gmail.com |
---|---|
State | New |
Headers | show |
Series | [v1] x86: Use `3/4*sizeof(per-thread-L3)` as low bound for NT threshold. | expand |
On Fri, Jul 14, 2023 at 10:15 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > On some machines we end up with incomplete cache information. This can > make the new calculation of `sizeof(total-L3)/custom-divisor` end up > lower than intended (and lower than the prior value). So reintroduce > the old bound as a lower bound to avoid potentially regressing code > where we don't have complete information to make the decision. > --- > sysdeps/x86/dl-cacheinfo.h | 11 ++++++++++- > 1 file changed, 10 insertions(+), 1 deletion(-) > > diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h > index c98fa57a7b..0436ffb349 100644 > --- a/sysdeps/x86/dl-cacheinfo.h > +++ b/sysdeps/x86/dl-cacheinfo.h > @@ -757,12 +757,21 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) > the maximum thrashing capped at 1/associativity. */ > unsigned long int non_temporal_threshold > = shared / cachesize_non_temporal_divisor; > + > + /* If the computed non_temporal_threshold <= 3/4 * per-thread L3, we most > + likely have incorrect/incomplete cache info in which case, default to > + 3/4 * per-thread L3 to avoid regressions. */ > + unsigned long int non_temporal_threshold_lowbound > + = shared_per_thread * 3 / 4; > + if (non_temporal_threshold < non_temporal_threshold_lowbound) > + non_temporal_threshold = non_temporal_threshold_lowbound; > + > /* If no ERMS, we use the per-thread L3 chunking. Normal cacheable stores run > a higher risk of actually thrashing the cache as they don't have a HW LRU > hint. As well, their performance in highly parallel situations is > noticeably worse. */ > if (!CPU_FEATURE_USABLE_P (cpu_features, ERMS)) > - non_temporal_threshold = shared_per_thread * 3 / 4; > + non_temporal_threshold = non_temporal_threshold_lowbound; > /* SIZE_MAX >> 4 because memmove-vec-unaligned-erms right-shifts the value of > 'x86_non_temporal_threshold' by `LOG_4X_MEMCPY_THRESH` (4) and it is best > if that operation cannot overflow. Minimum of 0x4040 (16448) because the > -- > 2.34.1 > Sajan, can you test this out?
On Fri, Jul 14, 2023 at 10:15 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > On Fri, Jul 14, 2023 at 10:15 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > > On some machines we end up with incomplete cache information. This can > > make the new calculation of `sizeof(total-L3)/custom-divisor` end up > > lower than intended (and lower than the prior value). So reintroduce > > the old bound as a lower bound to avoid potentially regressing code > > where we don't have complete information to make the decision. > > --- > > sysdeps/x86/dl-cacheinfo.h | 11 ++++++++++- > > 1 file changed, 10 insertions(+), 1 deletion(-) > > > > diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h > > index c98fa57a7b..0436ffb349 100644 > > --- a/sysdeps/x86/dl-cacheinfo.h > > +++ b/sysdeps/x86/dl-cacheinfo.h > > @@ -757,12 +757,21 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) > > the maximum thrashing capped at 1/associativity. */ > > unsigned long int non_temporal_threshold > > = shared / cachesize_non_temporal_divisor; > > + > > + /* If the computed non_temporal_threshold <= 3/4 * per-thread L3, we most > > + likely have incorrect/incomplete cache info in which case, default to > > + 3/4 * per-thread L3 to avoid regressions. */ > > + unsigned long int non_temporal_threshold_lowbound > > + = shared_per_thread * 3 / 4; > > + if (non_temporal_threshold < non_temporal_threshold_lowbound) > > + non_temporal_threshold = non_temporal_threshold_lowbound; > > + > > /* If no ERMS, we use the per-thread L3 chunking. Normal cacheable stores run > > a higher risk of actually thrashing the cache as they don't have a HW LRU > > hint. As well, their performance in highly parallel situations is > > noticeably worse. */ > > if (!CPU_FEATURE_USABLE_P (cpu_features, ERMS)) > > - non_temporal_threshold = shared_per_thread * 3 / 4; > > + non_temporal_threshold = non_temporal_threshold_lowbound; > > /* SIZE_MAX >> 4 because memmove-vec-unaligned-erms right-shifts the value of > > 'x86_non_temporal_threshold' by `LOG_4X_MEMCPY_THRESH` (4) and it is best > > if that operation cannot overflow. Minimum of 0x4040 (16448) because the > > -- > > 2.34.1 > > > > Sajan, can you test this out? ping @Sajan Karumanchi release is quickly approaching and I assume you want this in.
*Noah, On Mon, Jul 17, 2023 at 10:09 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > On Fri, Jul 14, 2023 at 10:15 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > > On Fri, Jul 14, 2023 at 10:15 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > > > > On some machines we end up with incomplete cache information. This can > > > make the new calculation of `sizeof(total-L3)/custom-divisor` end up > > > lower than intended (and lower than the prior value). So reintroduce > > > the old bound as a lower bound to avoid potentially regressing code > > > where we don't have complete information to make the decision. > > > --- > > > sysdeps/x86/dl-cacheinfo.h | 11 ++++++++++- > > > 1 file changed, 10 insertions(+), 1 deletion(-) > > > > > > diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h > > > index c98fa57a7b..0436ffb349 100644 > > > --- a/sysdeps/x86/dl-cacheinfo.h > > > +++ b/sysdeps/x86/dl-cacheinfo.h > > > @@ -757,12 +757,21 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) > > > the maximum thrashing capped at 1/associativity. */ > > > unsigned long int non_temporal_threshold > > > = shared / cachesize_non_temporal_divisor; > > > + > > > + /* If the computed non_temporal_threshold <= 3/4 * per-thread L3, we most > > > + likely have incorrect/incomplete cache info in which case, default to > > > + 3/4 * per-thread L3 to avoid regressions. */ > > > + unsigned long int non_temporal_threshold_lowbound > > > + = shared_per_thread * 3 / 4; > > > + if (non_temporal_threshold < non_temporal_threshold_lowbound) > > > + non_temporal_threshold = non_temporal_threshold_lowbound; > > > + > > > /* If no ERMS, we use the per-thread L3 chunking. Normal cacheable stores run > > > a higher risk of actually thrashing the cache as they don't have a HW LRU > > > hint. As well, their performance in highly parallel situations is > > > noticeably worse. */ > > > if (!CPU_FEATURE_USABLE_P (cpu_features, ERMS)) > > > - non_temporal_threshold = shared_per_thread * 3 / 4; > > > + non_temporal_threshold = non_temporal_threshold_lowbound; > > > /* SIZE_MAX >> 4 because memmove-vec-unaligned-erms right-shifts the value of > > > 'x86_non_temporal_threshold' by `LOG_4X_MEMCPY_THRESH` (4) and it is best > > > if that operation cannot overflow. Minimum of 0x4040 (16448) because the > > > -- > > > 2.34.1 > > > > > > > Sajan, can you test this out? > ping @Sajan Karumanchi > release is quickly approaching and I assume you want this in. I tested the above patch, which tries to revert the 'nt_threshold' value for AMD CPUs to the actual value '3/4 of the shared cache'. However, your comments on threads' typical share of cache size and the non_temproral default ranges do not apply to AMD CPUs. The assignment of 'shared' to 'shared_pre_thread' doesn't make sense for AMD Zen CPUs. Maybe the simplest way to work for AMD CPUs is to configure the default non-temporal threshold value to 3/4th of the shared cache. I think the below patch would simplify the threshold computation for both architectures. +++ b/sysdeps/x86/dl-cacheinfo.h @@ -866,7 +866,6 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) data = handle_amd (_SC_LEVEL1_DCACHE_SIZE); core = handle_amd (_SC_LEVEL2_CACHE_SIZE); shared = handle_amd (_SC_LEVEL3_CACHE_SIZE); - shared_per_thread = shared; level1_icache_size = handle_amd (_SC_LEVEL1_ICACHE_SIZE); level1_icache_linesize = handle_amd (_SC_LEVEL1_ICACHE_LINESIZE); @@ -906,6 +905,11 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) cpu_features->level3_cache_linesize = level3_cache_linesize; cpu_features->level4_cache_size = level4_cache_size; + + unsigned long int non_temporal_threshold = shared * 3 / 4; + + if (cpu_features->basic.kind != arch_kind_amd) + { unsigned long int cachesize_non_temporal_divisor = cpu_features->cachesize_non_temporal_divisor; if (cachesize_non_temporal_divisor <= 0) @@ -913,7 +917,7 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) /* The default setting for the non_temporal threshold is [1/8, 1/2] of size of the chip's cache (depending on `cachesize_non_temporal_divisor` which - is microarch specific. The default is 1/4). For most Intel and AMD + is microarch specific. The default is 1/4). For most Intel processors with an initial release date between 2017 and 2023, a thread's typical share of the cache is from 18-64MB. Using a reasonable size fraction of L3 is meant to estimate the point where non-temporal stores @@ -923,7 +927,7 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) L3 cache being evicted by the copy are mostly alleviated by the fact that modern HW detects streaming patterns and provides proper LRU hints so that the maximum thrashing capped at 1/associativity. */ - unsigned long int non_temporal_threshold + non_temporal_threshold = shared / cachesize_non_temporal_divisor; /* If no ERMS, we use the per-thread L3 chunking. Normal cacheable stores run a higher risk of actually thrashing the cache as they don't have a HW LRU @@ -931,6 +935,8 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) noticeably worse. */ if (!CPU_FEATURE_USABLE_P (cpu_features, ERMS)) non_temporal_threshold = shared_per_thread * 3 / 4; + } + Thanks, Sajan K.
On Mon, Jul 17, 2023 at 1:42 PM sajan karumanchi <sajan.karumanchi@gmail.com> wrote: > > *Noah, > On Mon, Jul 17, 2023 at 10:09 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > > On Fri, Jul 14, 2023 at 10:15 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > > > > On Fri, Jul 14, 2023 at 10:15 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > > > > > > On some machines we end up with incomplete cache information. This can > > > > make the new calculation of `sizeof(total-L3)/custom-divisor` end up > > > > lower than intended (and lower than the prior value). So reintroduce > > > > the old bound as a lower bound to avoid potentially regressing code > > > > where we don't have complete information to make the decision. > > > > --- > > > > sysdeps/x86/dl-cacheinfo.h | 11 ++++++++++- > > > > 1 file changed, 10 insertions(+), 1 deletion(-) > > > > > > > > diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h > > > > index c98fa57a7b..0436ffb349 100644 > > > > --- a/sysdeps/x86/dl-cacheinfo.h > > > > +++ b/sysdeps/x86/dl-cacheinfo.h > > > > @@ -757,12 +757,21 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) > > > > the maximum thrashing capped at 1/associativity. */ > > > > unsigned long int non_temporal_threshold > > > > = shared / cachesize_non_temporal_divisor; > > > > + > > > > + /* If the computed non_temporal_threshold <= 3/4 * per-thread L3, we most > > > > + likely have incorrect/incomplete cache info in which case, default to > > > > + 3/4 * per-thread L3 to avoid regressions. */ > > > > + unsigned long int non_temporal_threshold_lowbound > > > > + = shared_per_thread * 3 / 4; > > > > + if (non_temporal_threshold < non_temporal_threshold_lowbound) > > > > + non_temporal_threshold = non_temporal_threshold_lowbound; > > > > + > > > > /* If no ERMS, we use the per-thread L3 chunking. Normal cacheable stores run > > > > a higher risk of actually thrashing the cache as they don't have a HW LRU > > > > hint. As well, their performance in highly parallel situations is > > > > noticeably worse. */ > > > > if (!CPU_FEATURE_USABLE_P (cpu_features, ERMS)) > > > > - non_temporal_threshold = shared_per_thread * 3 / 4; > > > > + non_temporal_threshold = non_temporal_threshold_lowbound; > > > > /* SIZE_MAX >> 4 because memmove-vec-unaligned-erms right-shifts the value of > > > > 'x86_non_temporal_threshold' by `LOG_4X_MEMCPY_THRESH` (4) and it is best > > > > if that operation cannot overflow. Minimum of 0x4040 (16448) because the > > > > -- > > > > 2.34.1 > > > > > > > > > > Sajan, can you test this out? > > ping @Sajan Karumanchi > > release is quickly approaching and I assume you want this in. > > I tested the above patch, which tries to revert the 'nt_threshold' > value for AMD CPUs to the actual value '3/4 of the shared cache'. How about just updating the comment? > However, your comments on threads' typical share of cache size and the > non_temproral default ranges do not apply to AMD CPUs. > The assignment of 'shared' to 'shared_pre_thread' doesn't make sense > for AMD Zen CPUs. `shared_per_thread` is equal to what `shared` was prior to the patch. > Maybe the simplest way to work for AMD CPUs is to configure the > default non-temporal threshold value to 3/4th of the shared cache. > I think the below patch would simplify the threshold computation for > both architectures. I somewhat prefer using 3/4*shared_per_thread as a lowbound as a catchall for misconfigurations. > > +++ b/sysdeps/x86/dl-cacheinfo.h > @@ -866,7 +866,6 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) > data = handle_amd (_SC_LEVEL1_DCACHE_SIZE); > core = handle_amd (_SC_LEVEL2_CACHE_SIZE); > shared = handle_amd (_SC_LEVEL3_CACHE_SIZE); > - shared_per_thread = shared; > > level1_icache_size = handle_amd (_SC_LEVEL1_ICACHE_SIZE); > level1_icache_linesize = handle_amd (_SC_LEVEL1_ICACHE_LINESIZE); > @@ -906,6 +905,11 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) > cpu_features->level3_cache_linesize = level3_cache_linesize; > cpu_features->level4_cache_size = level4_cache_size; > > + > + unsigned long int non_temporal_threshold = shared * 3 / 4; > + > + if (cpu_features->basic.kind != arch_kind_amd) > + { > unsigned long int cachesize_non_temporal_divisor > = cpu_features->cachesize_non_temporal_divisor; > if (cachesize_non_temporal_divisor <= 0) > @@ -913,7 +917,7 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) > > /* The default setting for the non_temporal threshold is [1/8, 1/2] of size > of the chip's cache (depending on `cachesize_non_temporal_divisor` which > - is microarch specific. The default is 1/4). For most Intel and AMD > + is microarch specific. The default is 1/4). For most Intel > processors with an initial release date between 2017 and 2023, a thread's > typical share of the cache is from 18-64MB. Using a reasonable size > fraction of L3 is meant to estimate the point where non-temporal stores > @@ -923,7 +927,7 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) > L3 cache being evicted by the copy are mostly alleviated by the fact that > modern HW detects streaming patterns and provides proper LRU hints so that > the maximum thrashing capped at 1/associativity. */ > - unsigned long int non_temporal_threshold > + non_temporal_threshold > = shared / cachesize_non_temporal_divisor; > /* If no ERMS, we use the per-thread L3 chunking. Normal cacheable stores run > a higher risk of actually thrashing the cache as they don't have a HW LRU > @@ -931,6 +935,8 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) > noticeably worse. */ > if (!CPU_FEATURE_USABLE_P (cpu_features, ERMS)) > non_temporal_threshold = shared_per_thread * 3 / 4; > + } > + > > Thanks, > Sajan K.
On Mon, Jul 17, 2023 at 4:25 PM DJ Delorie <dj@redhat.com> wrote: > > > sajan karumanchi writes: > > I tested the above patch, which tries to revert the 'nt_threshold' > > value for AMD CPUs to the actual value '3/4 of the shared cache'. > > However, your comments on threads' typical share of cache size and the > > non_temproral default ranges do not apply to AMD CPUs. > > The assignment of 'shared' to 'shared_pre_thread' doesn't make sense > > for AMD Zen CPUs. > > Maybe the simplest way to work for AMD CPUs is to configure the > > default non-temporal threshold value to 3/4th of the shared cache. > > I think the below patch would simplify the threshold computation for > > both architectures. > > Rather than put AMD-specific code in dl-cacheinfo.h, why not add another > field to cpu_features so that AMD can per-family-tune their cache > preferences in cpu-features.c, and set defaults there? > > It seems to me that the threshold has three parts: a numerator (to avoid > floating point math) and denominator, and whether the memory depends on > how many cores there are, which can be applied to the other two. > > A field for cachesize_non_temporal_numerator allows for non-floating > point math, and cpu-features.c could adjust either the numerator or > denominator according to whether the cache is per-core or total. It > also means that all tuning happens in one place, instead of spreading it > around. > I think the issue is we don't have the same pipeline for computing all the info for each family. So for AMD we need denominator to essentially be 4*threads-per-socket. But AFAIK theres no where earlier to set that. Might be something we can revisit / cleanup in the future, but think this is important for 2.38 so would like to get a baseline fix in.
* Noah On Tue, Jul 18, 2023 at 1:38 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > On Mon, Jul 17, 2023 at 1:42 PM sajan karumanchi > <sajan.karumanchi@gmail.com> wrote: > > > > *Noah, > > On Mon, Jul 17, 2023 at 10:09 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > > > > On Fri, Jul 14, 2023 at 10:15 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > > > > > > On Fri, Jul 14, 2023 at 10:15 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > > > > > > > > On some machines we end up with incomplete cache information. This can > > > > > make the new calculation of `sizeof(total-L3)/custom-divisor` end up > > > > > lower than intended (and lower than the prior value). So reintroduce > > > > > the old bound as a lower bound to avoid potentially regressing code > > > > > where we don't have complete information to make the decision. > > > > > --- > > > > > sysdeps/x86/dl-cacheinfo.h | 11 ++++++++++- > > > > > 1 file changed, 10 insertions(+), 1 deletion(-) > > > > > > > > > > diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h > > > > > index c98fa57a7b..0436ffb349 100644 > > > > > --- a/sysdeps/x86/dl-cacheinfo.h > > > > > +++ b/sysdeps/x86/dl-cacheinfo.h > > > > > @@ -757,12 +757,21 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) > > > > > the maximum thrashing capped at 1/associativity. */ > > > > > unsigned long int non_temporal_threshold > > > > > = shared / cachesize_non_temporal_divisor; > > > > > + > > > > > + /* If the computed non_temporal_threshold <= 3/4 * per-thread L3, we most > > > > > + likely have incorrect/incomplete cache info in which case, default to > > > > > + 3/4 * per-thread L3 to avoid regressions. */ > > > > > + unsigned long int non_temporal_threshold_lowbound > > > > > + = shared_per_thread * 3 / 4; > > > > > + if (non_temporal_threshold < non_temporal_threshold_lowbound) > > > > > + non_temporal_threshold = non_temporal_threshold_lowbound; > > > > > + > > > > > /* If no ERMS, we use the per-thread L3 chunking. Normal cacheable stores run > > > > > a higher risk of actually thrashing the cache as they don't have a HW LRU > > > > > hint. As well, their performance in highly parallel situations is > > > > > noticeably worse. */ > > > > > if (!CPU_FEATURE_USABLE_P (cpu_features, ERMS)) > > > > > - non_temporal_threshold = shared_per_thread * 3 / 4; > > > > > + non_temporal_threshold = non_temporal_threshold_lowbound; > > > > > /* SIZE_MAX >> 4 because memmove-vec-unaligned-erms right-shifts the value of > > > > > 'x86_non_temporal_threshold' by `LOG_4X_MEMCPY_THRESH` (4) and it is best > > > > > if that operation cannot overflow. Minimum of 0x4040 (16448) because the > > > > > -- > > > > > 2.34.1 > > > > > > > > > > > > > Sajan, can you test this out? > > > ping @Sajan Karumanchi > > > release is quickly approaching and I assume you want this in. > > > > I tested the above patch, which tries to revert the 'nt_threshold' > > value for AMD CPUs to the actual value '3/4 of the shared cache'. > How about just updating the comment? > > > However, your comments on threads' typical share of cache size and the > > non_temproral default ranges do not apply to AMD CPUs. > > The assignment of 'shared' to 'shared_pre_thread' doesn't make sense > > for AMD Zen CPUs. > `shared_per_thread` is equal to what `shared` was prior to the patch. > This 'shared_pre_thread' was introduced under this series of patches. I suggest removing 'shared_pre_thread' to avoid the misinterpretation of the shared cache per CCX with a thread. > > Maybe the simplest way to work for AMD CPUs is to configure the > > default non-temporal threshold value to 3/4th of the shared cache. > > I think the below patch would simplify the threshold computation for > > both architectures. > > I somewhat prefer using 3/4*shared_per_thread as a lowbound as > a catchall for misconfiguration. I think you can use the existing 'minimum_non_temporal_threshold' to handle the misconfigurations. > > > > +++ b/sysdeps/x86/dl-cacheinfo.h > > @@ -866,7 +866,6 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) > > data = handle_amd (_SC_LEVEL1_DCACHE_SIZE); > > core = handle_amd (_SC_LEVEL2_CACHE_SIZE); > > shared = handle_amd (_SC_LEVEL3_CACHE_SIZE); > > - shared_per_thread = shared; > > > > level1_icache_size = handle_amd (_SC_LEVEL1_ICACHE_SIZE); > > level1_icache_linesize = handle_amd (_SC_LEVEL1_ICACHE_LINESIZE); > > @@ -906,6 +905,11 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) > > cpu_features->level3_cache_linesize = level3_cache_linesize; > > cpu_features->level4_cache_size = level4_cache_size; > > > > + > > + unsigned long int non_temporal_threshold = shared * 3 / 4; > > + > > + if (cpu_features->basic.kind != arch_kind_amd) > > + { > > unsigned long int cachesize_non_temporal_divisor > > = cpu_features->cachesize_non_temporal_divisor; > > if (cachesize_non_temporal_divisor <= 0) > > @@ -913,7 +917,7 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) > > > > /* The default setting for the non_temporal threshold is [1/8, 1/2] of size > > of the chip's cache (depending on `cachesize_non_temporal_divisor` which > > - is microarch specific. The default is 1/4). For most Intel and AMD > > + is microarch specific. The default is 1/4). For most Intel > > processors with an initial release date between 2017 and 2023, a thread's > > typical share of the cache is from 18-64MB. Using a reasonable size > > fraction of L3 is meant to estimate the point where non-temporal stores > > @@ -923,7 +927,7 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) > > L3 cache being evicted by the copy are mostly alleviated by the fact that > > modern HW detects streaming patterns and provides proper LRU hints so that > > the maximum thrashing capped at 1/associativity. */ > > - unsigned long int non_temporal_threshold > > + non_temporal_threshold > > = shared / cachesize_non_temporal_divisor; > > /* If no ERMS, we use the per-thread L3 chunking. Normal cacheable stores run > > a higher risk of actually thrashing the cache as they don't have a HW LRU > > @@ -931,6 +935,8 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) > > noticeably worse. */ > > if (!CPU_FEATURE_USABLE_P (cpu_features, ERMS)) > > non_temporal_threshold = shared_per_thread * 3 / 4; > > + } > > + > > > > Thanks, > > Sajan K.
On Tue, Jul 18, 2023 at 8:35 AM sajan karumanchi <sajan.karumanchi@gmail.com> wrote: > > * Noah > On Tue, Jul 18, 2023 at 1:38 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > > On Mon, Jul 17, 2023 at 1:42 PM sajan karumanchi > > <sajan.karumanchi@gmail.com> wrote: > > > > > > *Noah, > > > On Mon, Jul 17, 2023 at 10:09 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > > > > > > On Fri, Jul 14, 2023 at 10:15 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > > > > > > > > On Fri, Jul 14, 2023 at 10:15 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > > > > > > > > > > On some machines we end up with incomplete cache information. This can > > > > > > make the new calculation of `sizeof(total-L3)/custom-divisor` end up > > > > > > lower than intended (and lower than the prior value). So reintroduce > > > > > > the old bound as a lower bound to avoid potentially regressing code > > > > > > where we don't have complete information to make the decision. > > > > > > --- > > > > > > sysdeps/x86/dl-cacheinfo.h | 11 ++++++++++- > > > > > > 1 file changed, 10 insertions(+), 1 deletion(-) > > > > > > > > > > > > diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h > > > > > > index c98fa57a7b..0436ffb349 100644 > > > > > > --- a/sysdeps/x86/dl-cacheinfo.h > > > > > > +++ b/sysdeps/x86/dl-cacheinfo.h > > > > > > @@ -757,12 +757,21 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) > > > > > > the maximum thrashing capped at 1/associativity. */ > > > > > > unsigned long int non_temporal_threshold > > > > > > = shared / cachesize_non_temporal_divisor; > > > > > > + > > > > > > + /* If the computed non_temporal_threshold <= 3/4 * per-thread L3, we most > > > > > > + likely have incorrect/incomplete cache info in which case, default to > > > > > > + 3/4 * per-thread L3 to avoid regressions. */ > > > > > > + unsigned long int non_temporal_threshold_lowbound > > > > > > + = shared_per_thread * 3 / 4; > > > > > > + if (non_temporal_threshold < non_temporal_threshold_lowbound) > > > > > > + non_temporal_threshold = non_temporal_threshold_lowbound; > > > > > > + > > > > > > /* If no ERMS, we use the per-thread L3 chunking. Normal cacheable stores run > > > > > > a higher risk of actually thrashing the cache as they don't have a HW LRU > > > > > > hint. As well, their performance in highly parallel situations is > > > > > > noticeably worse. */ > > > > > > if (!CPU_FEATURE_USABLE_P (cpu_features, ERMS)) > > > > > > - non_temporal_threshold = shared_per_thread * 3 / 4; > > > > > > + non_temporal_threshold = non_temporal_threshold_lowbound; > > > > > > /* SIZE_MAX >> 4 because memmove-vec-unaligned-erms right-shifts the value of > > > > > > 'x86_non_temporal_threshold' by `LOG_4X_MEMCPY_THRESH` (4) and it is best > > > > > > if that operation cannot overflow. Minimum of 0x4040 (16448) because the > > > > > > -- > > > > > > 2.34.1 > > > > > > > > > > > > > > > > Sajan, can you test this out? > > > > ping @Sajan Karumanchi > > > > release is quickly approaching and I assume you want this in. > > > > > > I tested the above patch, which tries to revert the 'nt_threshold' > > > value for AMD CPUs to the actual value '3/4 of the shared cache'. > > How about just updating the comment? > > > > > However, your comments on threads' typical share of cache size and the > > > non_temproral default ranges do not apply to AMD CPUs. > > > The assignment of 'shared' to 'shared_pre_thread' doesn't make sense > > > for AMD Zen CPUs. > > `shared_per_thread` is equal to what `shared` was prior to the patch. > > > This 'shared_pre_thread' was introduced under this series of patches. > I suggest removing 'shared_pre_thread' to avoid the misinterpretation > of the shared cache per CCX with a thread. Agreed, already split it just hadnt updated this. But fixed. > > > > Maybe the simplest way to work for AMD CPUs is to configure the > > > default non-temporal threshold value to 3/4th of the shared cache. > > > I think the below patch would simplify the threshold computation for > > > both architectures. > > > > I somewhat prefer using 3/4*shared_per_thread as a lowbound as > > a catchall for misconfiguration. > I think you can use the existing 'minimum_non_temporal_threshold' to > handle the misconfigurations. > Thats a correctness bound. This is more a sanity check for "we are making the decision with correct configuration knowledge" catchall. > > > > > > +++ b/sysdeps/x86/dl-cacheinfo.h > > > @@ -866,7 +866,6 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) > > > data = handle_amd (_SC_LEVEL1_DCACHE_SIZE); > > > core = handle_amd (_SC_LEVEL2_CACHE_SIZE); > > > shared = handle_amd (_SC_LEVEL3_CACHE_SIZE); > > > - shared_per_thread = shared; > > > > > > level1_icache_size = handle_amd (_SC_LEVEL1_ICACHE_SIZE); > > > level1_icache_linesize = handle_amd (_SC_LEVEL1_ICACHE_LINESIZE); > > > @@ -906,6 +905,11 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) > > > cpu_features->level3_cache_linesize = level3_cache_linesize; > > > cpu_features->level4_cache_size = level4_cache_size; > > > > > > + > > > + unsigned long int non_temporal_threshold = shared * 3 / 4; > > > + > > > + if (cpu_features->basic.kind != arch_kind_amd) > > > + { > > > unsigned long int cachesize_non_temporal_divisor > > > = cpu_features->cachesize_non_temporal_divisor; > > > if (cachesize_non_temporal_divisor <= 0) > > > @@ -913,7 +917,7 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) > > > > > > /* The default setting for the non_temporal threshold is [1/8, 1/2] of size > > > of the chip's cache (depending on `cachesize_non_temporal_divisor` which > > > - is microarch specific. The default is 1/4). For most Intel and AMD > > > + is microarch specific. The default is 1/4). For most Intel > > > processors with an initial release date between 2017 and 2023, a thread's > > > typical share of the cache is from 18-64MB. Using a reasonable size > > > fraction of L3 is meant to estimate the point where non-temporal stores > > > @@ -923,7 +927,7 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) > > > L3 cache being evicted by the copy are mostly alleviated by the fact that > > > modern HW detects streaming patterns and provides proper LRU hints so that > > > the maximum thrashing capped at 1/associativity. */ > > > - unsigned long int non_temporal_threshold > > > + non_temporal_threshold > > > = shared / cachesize_non_temporal_divisor; > > > /* If no ERMS, we use the per-thread L3 chunking. Normal cacheable stores run > > > a higher risk of actually thrashing the cache as they don't have a HW LRU > > > @@ -931,6 +935,8 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) > > > noticeably worse. */ > > > if (!CPU_FEATURE_USABLE_P (cpu_features, ERMS)) > > > non_temporal_threshold = shared_per_thread * 3 / 4; > > > + } > > > + > > > > > > Thanks, > > > Sajan K.
On Tue, Jul 18, 2023 at 10:27 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > On Tue, Jul 18, 2023 at 8:35 AM sajan karumanchi > <sajan.karumanchi@gmail.com> wrote: > > > > * Noah > > On Tue, Jul 18, 2023 at 1:38 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > > > > On Mon, Jul 17, 2023 at 1:42 PM sajan karumanchi > > > <sajan.karumanchi@gmail.com> wrote: > > > > > > > > *Noah, > > > > On Mon, Jul 17, 2023 at 10:09 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > > > > > > > > On Fri, Jul 14, 2023 at 10:15 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > > > > > > > > > > On Fri, Jul 14, 2023 at 10:15 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > > > > > > > > > > > > On some machines we end up with incomplete cache information. This can > > > > > > > make the new calculation of `sizeof(total-L3)/custom-divisor` end up > > > > > > > lower than intended (and lower than the prior value). So reintroduce > > > > > > > the old bound as a lower bound to avoid potentially regressing code > > > > > > > where we don't have complete information to make the decision. > > > > > > > --- > > > > > > > sysdeps/x86/dl-cacheinfo.h | 11 ++++++++++- > > > > > > > 1 file changed, 10 insertions(+), 1 deletion(-) > > > > > > > > > > > > > > diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h > > > > > > > index c98fa57a7b..0436ffb349 100644 > > > > > > > --- a/sysdeps/x86/dl-cacheinfo.h > > > > > > > +++ b/sysdeps/x86/dl-cacheinfo.h > > > > > > > @@ -757,12 +757,21 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) > > > > > > > the maximum thrashing capped at 1/associativity. */ > > > > > > > unsigned long int non_temporal_threshold > > > > > > > = shared / cachesize_non_temporal_divisor; > > > > > > > + > > > > > > > + /* If the computed non_temporal_threshold <= 3/4 * per-thread L3, we most > > > > > > > + likely have incorrect/incomplete cache info in which case, default to > > > > > > > + 3/4 * per-thread L3 to avoid regressions. */ > > > > > > > + unsigned long int non_temporal_threshold_lowbound > > > > > > > + = shared_per_thread * 3 / 4; > > > > > > > + if (non_temporal_threshold < non_temporal_threshold_lowbound) > > > > > > > + non_temporal_threshold = non_temporal_threshold_lowbound; > > > > > > > + > > > > > > > /* If no ERMS, we use the per-thread L3 chunking. Normal cacheable stores run > > > > > > > a higher risk of actually thrashing the cache as they don't have a HW LRU > > > > > > > hint. As well, their performance in highly parallel situations is > > > > > > > noticeably worse. */ > > > > > > > if (!CPU_FEATURE_USABLE_P (cpu_features, ERMS)) > > > > > > > - non_temporal_threshold = shared_per_thread * 3 / 4; > > > > > > > + non_temporal_threshold = non_temporal_threshold_lowbound; > > > > > > > /* SIZE_MAX >> 4 because memmove-vec-unaligned-erms right-shifts the value of > > > > > > > 'x86_non_temporal_threshold' by `LOG_4X_MEMCPY_THRESH` (4) and it is best > > > > > > > if that operation cannot overflow. Minimum of 0x4040 (16448) because the > > > > > > > -- > > > > > > > 2.34.1 > > > > > > > > > > > > > > > > > > > Sajan, can you test this out? > > > > > ping @Sajan Karumanchi > > > > > release is quickly approaching and I assume you want this in. > > > > > > > > I tested the above patch, which tries to revert the 'nt_threshold' > > > > value for AMD CPUs to the actual value '3/4 of the shared cache'. > > > How about just updating the comment? > > > > > > > However, your comments on threads' typical share of cache size and the > > > > non_temproral default ranges do not apply to AMD CPUs. > > > > The assignment of 'shared' to 'shared_pre_thread' doesn't make sense > > > > for AMD Zen CPUs. > > > `shared_per_thread` is equal to what `shared` was prior to the patch. > > > > > This 'shared_pre_thread' was introduced under this series of patches. > > I suggest removing 'shared_pre_thread' to avoid the misinterpretation > > of the shared cache per CCX with a thread. > > Agreed, already split it just hadnt updated this. > But fixed. > > > > > > Maybe the simplest way to work for AMD CPUs is to configure the > > > > default non-temporal threshold value to 3/4th of the shared cache. > > > > I think the below patch would simplify the threshold computation for > > > > both architectures. > > > > > > I somewhat prefer using 3/4*shared_per_thread as a lowbound as > > > a catchall for misconfiguration. > > I think you can use the existing 'minimum_non_temporal_threshold' to > > handle the misconfigurations. > > > Thats a correctness bound. > This is more a sanity check for "we are making the decision with > correct configuration knowledge" catchall. > The main point being: Other x86 systems might be affected, not just AMD. > > > > > > > > +++ b/sysdeps/x86/dl-cacheinfo.h > > > > @@ -866,7 +866,6 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) > > > > data = handle_amd (_SC_LEVEL1_DCACHE_SIZE); > > > > core = handle_amd (_SC_LEVEL2_CACHE_SIZE); > > > > shared = handle_amd (_SC_LEVEL3_CACHE_SIZE); > > > > - shared_per_thread = shared; > > > > > > > > level1_icache_size = handle_amd (_SC_LEVEL1_ICACHE_SIZE); > > > > level1_icache_linesize = handle_amd (_SC_LEVEL1_ICACHE_LINESIZE); > > > > @@ -906,6 +905,11 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) > > > > cpu_features->level3_cache_linesize = level3_cache_linesize; > > > > cpu_features->level4_cache_size = level4_cache_size; > > > > > > > > + > > > > + unsigned long int non_temporal_threshold = shared * 3 / 4; > > > > + > > > > + if (cpu_features->basic.kind != arch_kind_amd) > > > > + { > > > > unsigned long int cachesize_non_temporal_divisor > > > > = cpu_features->cachesize_non_temporal_divisor; > > > > if (cachesize_non_temporal_divisor <= 0) > > > > @@ -913,7 +917,7 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) > > > > > > > > /* The default setting for the non_temporal threshold is [1/8, 1/2] of size > > > > of the chip's cache (depending on `cachesize_non_temporal_divisor` which > > > > - is microarch specific. The default is 1/4). For most Intel and AMD > > > > + is microarch specific. The default is 1/4). For most Intel > > > > processors with an initial release date between 2017 and 2023, a thread's > > > > typical share of the cache is from 18-64MB. Using a reasonable size > > > > fraction of L3 is meant to estimate the point where non-temporal stores > > > > @@ -923,7 +927,7 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) > > > > L3 cache being evicted by the copy are mostly alleviated by the fact that > > > > modern HW detects streaming patterns and provides proper LRU hints so that > > > > the maximum thrashing capped at 1/associativity. */ > > > > - unsigned long int non_temporal_threshold > > > > + non_temporal_threshold > > > > = shared / cachesize_non_temporal_divisor; > > > > /* If no ERMS, we use the per-thread L3 chunking. Normal cacheable stores run > > > > a higher risk of actually thrashing the cache as they don't have a HW LRU > > > > @@ -931,6 +935,8 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) > > > > noticeably worse. */ > > > > if (!CPU_FEATURE_USABLE_P (cpu_features, ERMS)) > > > > non_temporal_threshold = shared_per_thread * 3 / 4; > > > > + } > > > > + > > > > > > > > Thanks, > > > > Sajan K.
diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h index c98fa57a7b..0436ffb349 100644 --- a/sysdeps/x86/dl-cacheinfo.h +++ b/sysdeps/x86/dl-cacheinfo.h @@ -757,12 +757,21 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) the maximum thrashing capped at 1/associativity. */ unsigned long int non_temporal_threshold = shared / cachesize_non_temporal_divisor; + + /* If the computed non_temporal_threshold <= 3/4 * per-thread L3, we most + likely have incorrect/incomplete cache info in which case, default to + 3/4 * per-thread L3 to avoid regressions. */ + unsigned long int non_temporal_threshold_lowbound + = shared_per_thread * 3 / 4; + if (non_temporal_threshold < non_temporal_threshold_lowbound) + non_temporal_threshold = non_temporal_threshold_lowbound; + /* If no ERMS, we use the per-thread L3 chunking. Normal cacheable stores run a higher risk of actually thrashing the cache as they don't have a HW LRU hint. As well, their performance in highly parallel situations is noticeably worse. */ if (!CPU_FEATURE_USABLE_P (cpu_features, ERMS)) - non_temporal_threshold = shared_per_thread * 3 / 4; + non_temporal_threshold = non_temporal_threshold_lowbound; /* SIZE_MAX >> 4 because memmove-vec-unaligned-erms right-shifts the value of 'x86_non_temporal_threshold' by `LOG_4X_MEMCPY_THRESH` (4) and it is best if that operation cannot overflow. Minimum of 0x4040 (16448) because the