Message ID | 20230718152759.219727-1-goldstein.w.n@gmail.com |
---|---|
State | New |
Headers | show |
Series | [v3] x86: Use `3/4*sizeof(per-thread-L3)` as low bound for NT threshold. | expand |
On Tue, Jul 18, 2023 at 8:51 PM DJ Delorie <dj@redhat.com> wrote: > > Noah Goldstein via Libc-alpha <libc-alpha@sourceware.org> writes: > > Way-back-when we stored the cache size in "shared" ("core" or > "core/threads_l2") the math was "shared * 3/4". Now the "core or > core/l2" value is in "shared_per_thread" > > > unsigned long int non_temporal_threshold > > = shared / cachesize_non_temporal_divisor; > > This is the "new" Intel ERMS value. > > > + /* If the computed non_temporal_threshold <= 3/4 * per-thread L3, we most > > + likely have incorrect/incomplete cache info in which case, default to > > + 3/4 * per-thread L3 to avoid regressions. */ > > + unsigned long int non_temporal_threshold_lowbound > > + = shared_per_thread * 3 / 4; > > + if (non_temporal_threshold < non_temporal_threshold_lowbound) > > + non_temporal_threshold = non_temporal_threshold_lowbound; > > So non_temporal_threshold might be increased to the 3/4 mark *even for > ERMS* when we know better. > > > /* If no ERMS, we use the per-thread L3 chunking. Normal cacheable stores run > > a higher risk of actually thrashing the cache as they don't have a HW LRU > > hint. As well, their performance in highly parallel situations is > > noticeably worse. */ > > if (!CPU_FEATURE_USABLE_P (cpu_features, ERMS)) > > - non_temporal_threshold = shared_per_thread * 3 / 4; > > + non_temporal_threshold = non_temporal_threshold_lowbound; > > This only does something when the ERMS value was larger than the lower > bound. > > We could end up with an ERMS-enabled core but with a default threshold, > but I think the only time that ever happens is with only one thread per > core. I suspect the logic could be better, but at the moment, it could > have been worse too ;-) > > I note Sajan's comment "I think you can use the existing > 'minimum_non_temporal_threshold' to handle the misconfigurations." but > the source disagrees: > > /* If `non_temporal_threshold` less than `minimum_non_temporal_threshold` > it most likely means we failed to detect the cache info. We don't want > to default to `minimum_non_temporal_threshold` as such a small value, > while correct, has bad performance. We default to 64MB as reasonable > default bound. 64MB is likely conservative in that most/all systems would > choose a lower value so it should never forcing non-temporal stores when > they otherwise wouldn't be used. */ > if (non_temporal_threshold < minimum_non_temporal_threshold) > non_temporal_threshold = 64 * 1024 * 1024; > else if (non_temporal_threshold > maximum_non_temporal_threshold) > non_temporal_threshold = maximum_non_temporal_threshold; > > The minimum can't be reset as it's used later for tunables. > Ah good point. > LGTM > Reviewed-by: DJ Delorie <dj@redhat.com> > Thank you for the review DJ. I'm going to push this shortly. After that I think all related changed for the memcpy fixes will be done for 2.38.
diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h index c98fa57a7b..2586ff0e31 100644 --- a/sysdeps/x86/dl-cacheinfo.h +++ b/sysdeps/x86/dl-cacheinfo.h @@ -745,8 +745,8 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) /* The default setting for the non_temporal threshold is [1/8, 1/2] of size of the chip's cache (depending on `cachesize_non_temporal_divisor` which - is microarch specific. The default is 1/4). For most Intel and AMD - processors with an initial release date between 2017 and 2023, a thread's + is microarch specific. The default is 1/4). For most Intel processors + with an initial release date between 2017 and 2023, a thread's typical share of the cache is from 18-64MB. Using a reasonable size fraction of L3 is meant to estimate the point where non-temporal stores begin out-competing REP MOVSB. As well the point where the fact that @@ -757,12 +757,21 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) the maximum thrashing capped at 1/associativity. */ unsigned long int non_temporal_threshold = shared / cachesize_non_temporal_divisor; + + /* If the computed non_temporal_threshold <= 3/4 * per-thread L3, we most + likely have incorrect/incomplete cache info in which case, default to + 3/4 * per-thread L3 to avoid regressions. */ + unsigned long int non_temporal_threshold_lowbound + = shared_per_thread * 3 / 4; + if (non_temporal_threshold < non_temporal_threshold_lowbound) + non_temporal_threshold = non_temporal_threshold_lowbound; + /* If no ERMS, we use the per-thread L3 chunking. Normal cacheable stores run a higher risk of actually thrashing the cache as they don't have a HW LRU hint. As well, their performance in highly parallel situations is noticeably worse. */ if (!CPU_FEATURE_USABLE_P (cpu_features, ERMS)) - non_temporal_threshold = shared_per_thread * 3 / 4; + non_temporal_threshold = non_temporal_threshold_lowbound; /* SIZE_MAX >> 4 because memmove-vec-unaligned-erms right-shifts the value of 'x86_non_temporal_threshold' by `LOG_4X_MEMCPY_THRESH` (4) and it is best if that operation cannot overflow. Minimum of 0x4040 (16448) because the