Message ID | 20240629035828.4145216-3-MayShao-oc@zhaoxin.com |
---|---|
State | New |
Headers | show |
Series | [v2,1/3] x86: Set preferred CPU features on the KH-40000 and KX-7000 Zhaoxin processors | expand |
On Saturday, June 29, 2024, MayShao-oc <MayShao-oc@zhaoxin.com> wrote: > Current 'non_temporal_threshold' set to 'non_temporal_threshold_lowbound' > on Zhaoxin processors without ERMS. The default > 'non_temporal_threshold_lowbound' is too small for the KH-40000 and > KX-7000 > Zhaoxin processors, this patch updates the value to > 'shared / cachesize_non_temporal_divisor'. > --- > sysdeps/x86/cpu-features.c | 1 + > sysdeps/x86/dl-cacheinfo.h | 6 ++++-- > 2 files changed, 5 insertions(+), 2 deletions(-) > > diff --git a/sysdeps/x86/cpu-features.c b/sysdeps/x86/cpu-features.c > index 1927f65699..e501e084ef 100644 > --- a/sysdeps/x86/cpu-features.c > +++ b/sysdeps/x86/cpu-features.c > @@ -1065,6 +1065,7 @@ https://www.intel.com/content/ > www/us/en/support/articles/000059422/processors.ht > > /* Yongfeng and Shijidadao mircoarch tuning. */ > case 0x5b: > + cpu_features->cachesize_non_temporal_divisor = 2; > case 0x6b: > cpu_features->preferred[index_arch_AVX_Fast_Unaligned_Load] > &= ~bit_arch_AVX_Fast_Unaligned_Load; > diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h > index 3a6ec4ef9f..5e77345a6e 100644 > --- a/sysdeps/x86/dl-cacheinfo.h > +++ b/sysdeps/x86/dl-cacheinfo.h > @@ -934,8 +934,10 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) > /* If no ERMS, we use the per-thread L3 chunking. Normal cacheable > stores run > a higher risk of actually thrashing the cache as they don't have a > HW LRU > hint. As well, their performance in highly parallel situations is > - noticeably worse. */ > - if (!CPU_FEATURE_USABLE_P (cpu_features, ERMS)) > + noticeably worse. Zhaoxin processors are an exception, the lowbound > is not > + suitable for them based on actual test data. */ > + if (!CPU_FEATURE_USABLE_P (cpu_features, ERMS) > + && cpu_features->basic.kind != arch_kind_zhaoxin) > non_temporal_threshold = non_temporal_threshold_lowbound; > /* SIZE_MAX >> 4 because memmove-vec-unaligned-erms right-shifts the > value of > 'x86_non_temporal_threshold' by `LOG_4X_MEMCPY_THRESH` (4) and it is > best > -- > 2.34.1 > > LGTM. Reviewed-by: Noah Goldstein <goldstein.w.n@gmail.com>
diff --git a/sysdeps/x86/cpu-features.c b/sysdeps/x86/cpu-features.c index 1927f65699..e501e084ef 100644 --- a/sysdeps/x86/cpu-features.c +++ b/sysdeps/x86/cpu-features.c @@ -1065,6 +1065,7 @@ https://www.intel.com/content/www/us/en/support/articles/000059422/processors.ht /* Yongfeng and Shijidadao mircoarch tuning. */ case 0x5b: + cpu_features->cachesize_non_temporal_divisor = 2; case 0x6b: cpu_features->preferred[index_arch_AVX_Fast_Unaligned_Load] &= ~bit_arch_AVX_Fast_Unaligned_Load; diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h index 3a6ec4ef9f..5e77345a6e 100644 --- a/sysdeps/x86/dl-cacheinfo.h +++ b/sysdeps/x86/dl-cacheinfo.h @@ -934,8 +934,10 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) /* If no ERMS, we use the per-thread L3 chunking. Normal cacheable stores run a higher risk of actually thrashing the cache as they don't have a HW LRU hint. As well, their performance in highly parallel situations is - noticeably worse. */ - if (!CPU_FEATURE_USABLE_P (cpu_features, ERMS)) + noticeably worse. Zhaoxin processors are an exception, the lowbound is not + suitable for them based on actual test data. */ + if (!CPU_FEATURE_USABLE_P (cpu_features, ERMS) + && cpu_features->basic.kind != arch_kind_zhaoxin) non_temporal_threshold = non_temporal_threshold_lowbound; /* SIZE_MAX >> 4 because memmove-vec-unaligned-erms right-shifts the value of 'x86_non_temporal_threshold' by `LOG_4X_MEMCPY_THRESH` (4) and it is best