Message ID | CAMe9rOpO_q8hMt+U4xUB8HoR9i7E0LO92SCiU_=v_306XdtjJQ@mail.gmail.com |
---|---|
State | New |
Headers | show |
I definitely think increasing the size in the case of processors with a large number of cores makes sense. Hopefully with some testing we can confirm it is a net win and/or find a more empirical number. Thanks for that patch with the tunable support. I've just put a similar patch in review for sharing right now. It adds support in the case that HAVE_TUNABLES isn't defined like the similar code in arena.c and also makes a minor change that turns init_cacheinfo into a init_cacheinfo_impl (a hidden callable). init_cacheinfo is now a constructor that just calls the impl and passes the cpu_features struct. This is useful in that it makes the code a bit more modular (something that we'll need to be able to test this internally). On Mon, May 22, 2017 at 12:17 PM, H.J. Lu <hjl.tools@gmail.com> wrote: > On Thu, May 18, 2017 at 1:59 PM, Erich Elsen <eriche@google.com> wrote: >> Hi H.J., >> >> I was on vacation, sorry for the slow reply. The updated benchmark >> still shows the same behavior, thanks. >> >> I'll try my hand at creating a patch that makes that variable >> __x86_shared_non_temporal_threshold a tunable. It will be necessary >> to do internal experiments anyway. >> > > __x86_shared_non_temporal_threshold was set to 6 times of per-core > shared cache size, based on the large memcpy micro benchmark in glibc > on a 8-core processor. For a processor with more than 8 cores, the > threshold is too low. Set __x86_shared_non_temporal_threshold to the > 3/4 of the total shared cache size so that it is unchanged on 8-core > processors. On processors with less than 8 cores, the threshold is > lower. > > Any comments? > > -- > H.J.
On Mon, May 22, 2017 at 6:23 PM, Erich Elsen <eriche@google.com> wrote: > I definitely think increasing the size in the case of processors with > a large number of cores makes sense. Hopefully with some testing we > can confirm it is a net win and/or find a more empirical number. > > Thanks for that patch with the tunable support. I've just put a > similar patch in review for sharing right now. It adds support in the > case that HAVE_TUNABLES isn't defined like the similar code in arena.c > and also makes a minor change that turns init_cacheinfo into a > init_cacheinfo_impl (a hidden callable). init_cacheinfo is now a > constructor that just calls the impl and passes the cpu_features > struct. This is useful in that it makes the code a bit more modular > (something that we'll need to be able to test this internally). This sounds a good idea. I'd also like to add tunable support in init_cpu_features to turn on/off CPU features. non_temporal_threshold will be one of them.
From bfb716e07b77f0ed8e0c2689d5cd01e2c8251fc5 Mon Sep 17 00:00:00 2001 From: "H.J. Lu" <hjl.tools@gmail.com> Date: Fri, 12 May 2017 13:38:04 -0700 Subject: [PATCH] x86: Update __x86_shared_non_temporal_threshold __x86_shared_non_temporal_threshold was set to 6 times of per-core shared cache size, based on the large memcpy micro benchmark in glibc on a 8-core processor. For a processor with more than 8 cores, the threshold is too low. Set __x86_shared_non_temporal_threshold to the 3/4 of the total shared cache size so that it is unchanged on 8-core processors. On processors with less than 8 cores, the threshold is lower. * sysdeps/x86/cacheinfo.c (__x86_shared_non_temporal_threshold): Set to the 3/4 of the total shared cache size. --- sysdeps/x86/cacheinfo.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/sysdeps/x86/cacheinfo.c b/sysdeps/x86/cacheinfo.c index 1ccbe41..3434d97 100644 --- a/sysdeps/x86/cacheinfo.c +++ b/sysdeps/x86/cacheinfo.c @@ -766,6 +766,8 @@ intel_bug_no_cache_info: /* The large memcpy micro benchmark in glibc shows that 6 times of shared cache size is the approximate value above which non-temporal - store becomes faster. */ - __x86_shared_non_temporal_threshold = __x86_shared_cache_size * 6; + store becomes faster on a 8-core processor. This is the 3/4 of the + total shared cache size. */ + __x86_shared_non_temporal_threshold + = __x86_shared_cache_size * threads * 3 / 4; } -- 2.9.4