Message ID | 1600891781-9272-1-git-send-email-patrick.mcgehearty@oracle.com |
---|---|
State | New |
Headers | show |
Series | [v2] Reversing calculation of __x86_shared_non_temporal_threshold | expand |
On Wed, Sep 23, 2020 at 1:10 PM Patrick McGehearty via Libc-alpha <libc-alpha@sourceware.org> wrote: > > The __x86_shared_non_temporal_threshold determines when memcpy on x86 > uses non_temporal stores to avoid pushing other data out of the last > level cache. > > This patch proposes to revert the calculation change made by H.J. Lu's > patch of June 2, 2017. > > H.J. Lu's patch selected a threshold suitable for a single thread > getting maximum performance. It was tuned using the single threaded > large memcpy micro benchmark on an 8 core processor. The last change > changes the threshold from using 3/4 of one thread's share of the > cache to using 3/4 of the entire cache of a multi-threaded system > before switching to non-temporal stores. Multi-threaded systems with > more than a few threads are server-class and typically have many > active threads. If one thread consumes 3/4 of the available cache for > all threads, it will cause other active threads to have data removed > from the cache. Two examples show the range of the effect. John > McCalpin's widely parallel Stream benchmark, which runs in parallel > and fetches data sequentially, saw a 20% slowdown with this patch on > an internal system test of 128 threads. This regression was discovered > when comparing OL8 performance to OL7. An example that compares > normal stores to non-temporal stores may be found at > https://vgatherps.github.io/2018-09-02-nontemporal/. A simple test > shows performance loss of 400 to 500% due to a failure to use > nontemporal stores. These performance losses are most likely to occur > when the system load is heaviest and good performance is critical. > > The tunable x86_non_temporal_threshold can be used to override the > default for the knowledgable user who really wants maximum cache > allocation to a single thread in a multi-threaded system. > The manual entry for the tunable has been expanded to provide > more information about its purpose. > > modified: sysdeps/x86/cacheinfo.c > modified: manual/tunables.texi > --- > manual/tunables.texi | 6 +++++- > sysdeps/x86/cacheinfo.c | 12 +++++++----- > 2 files changed, 12 insertions(+), 6 deletions(-) > > diff --git a/manual/tunables.texi b/manual/tunables.texi > index b6bb54d..94d4fbd 100644 > --- a/manual/tunables.texi > +++ b/manual/tunables.texi > @@ -364,7 +364,11 @@ set shared cache size in bytes for use in memory and string routines. > > @deftp Tunable glibc.tune.x86_non_temporal_threshold > The @code{glibc.tune.x86_non_temporal_threshold} tunable allows the user > -to set threshold in bytes for non temporal store. > +to set threshold in bytes for non temporal store. Non temporal stores > +give a hint to the hardware to move data directly to memory without > +displacing other data from the cache. This tunable is used by some > +platforms to determine when to use non temporal stores in operations > +like memmove and memcpy. > > This tunable is specific to i386 and x86-64. > @end deftp > diff --git a/sysdeps/x86/cacheinfo.c b/sysdeps/x86/cacheinfo.c > index b9444dd..c6767d9 100644 > --- a/sysdeps/x86/cacheinfo.c > +++ b/sysdeps/x86/cacheinfo.c > @@ -778,14 +778,16 @@ intel_bug_no_cache_info: > __x86_shared_cache_size = shared; > } > > - /* The large memcpy micro benchmark in glibc shows that 6 times of > - shared cache size is the approximate value above which non-temporal > - store becomes faster on a 8-core processor. This is the 3/4 of the > - total shared cache size. */ > + /* The default setting for the non_temporal threshold is 3/4 > + of one thread's share of the chip's cache. While higher > + single thread performance may be observed with a higher > + threshold, having a single thread use more than it's share > + of the cache will negatively impact the performance of > + other threads running on the chip. */ > __x86_shared_non_temporal_threshold > = (cpu_features->non_temporal_threshold != 0 > ? cpu_features->non_temporal_threshold > - : __x86_shared_cache_size * threads * 3 / 4); > + : __x86_shared_cache_size * 3 / 4); > } > Can we tune it with the number of threads and/or total cache size?
On 9/23/2020 3:23 PM, H.J. Lu wrote: > On Wed, Sep 23, 2020 at 1:10 PM Patrick McGehearty via Libc-alpha > <libc-alpha@sourceware.org> wrote: >> The __x86_shared_non_temporal_threshold determines when memcpy on x86 >> uses non_temporal stores to avoid pushing other data out of the last >> level cache. >> >> This patch proposes to revert the calculation change made by H.J. Lu's >> patch of June 2, 2017. >> >> H.J. Lu's patch selected a threshold suitable for a single thread >> getting maximum performance. It was tuned using the single threaded >> large memcpy micro benchmark on an 8 core processor. The last change >> changes the threshold from using 3/4 of one thread's share of the >> cache to using 3/4 of the entire cache of a multi-threaded system >> before switching to non-temporal stores. Multi-threaded systems with >> more than a few threads are server-class and typically have many >> active threads. If one thread consumes 3/4 of the available cache for >> all threads, it will cause other active threads to have data removed >> from the cache. Two examples show the range of the effect. John >> McCalpin's widely parallel Stream benchmark, which runs in parallel >> and fetches data sequentially, saw a 20% slowdown with this patch on >> an internal system test of 128 threads. This regression was discovered >> when comparing OL8 performance to OL7. An example that compares >> normal stores to non-temporal stores may be found at >> https://urldefense.com/v3/__https://vgatherps.github.io/2018-09-02-nontemporal/__;!!GqivPVa7Brio!IK1RH6wG0bg4U3NNMDpXf50VgsV9CFOEUaG0kGy6YYtq1G1Ca5VSz5szAxG0Zkiqdl8-IWc$ . A simple test >> shows performance loss of 400 to 500% due to a failure to use >> nontemporal stores. These performance losses are most likely to occur >> when the system load is heaviest and good performance is critical. >> >> The tunable x86_non_temporal_threshold can be used to override the >> default for the knowledgable user who really wants maximum cache >> allocation to a single thread in a multi-threaded system. >> The manual entry for the tunable has been expanded to provide >> more information about its purpose. >> >> modified: sysdeps/x86/cacheinfo.c >> modified: manual/tunables.texi >> --- >> manual/tunables.texi | 6 +++++- >> sysdeps/x86/cacheinfo.c | 12 +++++++----- >> 2 files changed, 12 insertions(+), 6 deletions(-) >> >> diff --git a/manual/tunables.texi b/manual/tunables.texi >> index b6bb54d..94d4fbd 100644 >> --- a/manual/tunables.texi >> +++ b/manual/tunables.texi >> @@ -364,7 +364,11 @@ set shared cache size in bytes for use in memory and string routines. >> >> @deftp Tunable glibc.tune.x86_non_temporal_threshold >> The @code{glibc.tune.x86_non_temporal_threshold} tunable allows the user >> -to set threshold in bytes for non temporal store. >> +to set threshold in bytes for non temporal store. Non temporal stores >> +give a hint to the hardware to move data directly to memory without >> +displacing other data from the cache. This tunable is used by some >> +platforms to determine when to use non temporal stores in operations >> +like memmove and memcpy. >> >> This tunable is specific to i386 and x86-64. >> @end deftp >> diff --git a/sysdeps/x86/cacheinfo.c b/sysdeps/x86/cacheinfo.c >> index b9444dd..c6767d9 100644 >> --- a/sysdeps/x86/cacheinfo.c >> +++ b/sysdeps/x86/cacheinfo.c >> @@ -778,14 +778,16 @@ intel_bug_no_cache_info: >> __x86_shared_cache_size = shared; >> } >> >> - /* The large memcpy micro benchmark in glibc shows that 6 times of >> - shared cache size is the approximate value above which non-temporal >> - store becomes faster on a 8-core processor. This is the 3/4 of the >> - total shared cache size. */ >> + /* The default setting for the non_temporal threshold is 3/4 >> + of one thread's share of the chip's cache. While higher >> + single thread performance may be observed with a higher >> + threshold, having a single thread use more than it's share >> + of the cache will negatively impact the performance of >> + other threads running on the chip. */ >> __x86_shared_non_temporal_threshold >> = (cpu_features->non_temporal_threshold != 0 >> ? cpu_features->non_temporal_threshold >> - : __x86_shared_cache_size * threads * 3 / 4); >> + : __x86_shared_cache_size * 3 / 4); >> } >> > Can we tune it with the number of threads and/or total cache > size? > When you say "total cache size", is that different from shared_cache_size * threads? I see a fundamental conflict of optimization goals: 1) Provide best single thread performance (current code) 2) Provide best overall system performance under full load (proposed patch) I don't know of any way to have default behavior meet both goals without knowledge of the system size/usage/requirements. Consider a hypothetical single chip system with 64 threads and 128 MB of total cache on the chip. That won't be uncommon in the coming years on server class systems, especially in large databases or HPC environments (think vision processing or weather modeling for example). If a single app owns the whole chip and is running a multi-threaded application but needs to memcpy a really large block of data when one phase of computation finished before moving to the next phase. A common practice would be to have 64 parallel calls to memcpy. The Stream benchmark demonstrates with OpenMP that current compilers handle that with no trouble. In the example, the per thread share of the cache is 2 MB and the proposed formula will set the threshold at 1.5 Mbytes. If the total copy size is 96 Mbytes or less, all threads comfortably fit in cache. If the total copy size is over that, then non-temporal stores are used and all is well there too. The current formula would set the threshold at 96 Mbytes for each thread. Only when the total copy size was 64*96 Mbytes = 6 GBytes would non-temporal stores be used. We'd like to switch to non-temporal stores much sooner as we will be thrashing all the threads caches. In practical terms, I've had access to typical memcpy copy lengths for a variety of commerical applications while studying memcpy on Solaris over the years. The vast majority of copies are for 64Kbytes or less. Most modern chips have much more than 64Kbytes of cache per thread, allowing in-cache copies for the common case, even without borrowing cache from other threads. The occasional really large copies tend to be when an application is passing a block of data to prepare for a new phase of computation or as a shared memory communication to another thread. In these cases, having the data remain in cache is usually not relevant and using non-temporal stores even when they are not strictly required does not have a negative affect on performance. A downside of tuning for a single thread comes in cloud computing environments, where having neighboring threads being cache hogs, even if relatively isolated in virtual machines, is a "bad thing" for having stable system performance. Whatever we can do to provide consistent, reasonable performance whatever the neighboring threads might be doing is a "good thing". - patrick
On Wed, Sep 23, 2020 at 1:57 PM Patrick McGehearty <patrick.mcgehearty@oracle.com> wrote: > > > > On 9/23/2020 3:23 PM, H.J. Lu wrote: > > On Wed, Sep 23, 2020 at 1:10 PM Patrick McGehearty via Libc-alpha > > <libc-alpha@sourceware.org> wrote: > >> The __x86_shared_non_temporal_threshold determines when memcpy on x86 > >> uses non_temporal stores to avoid pushing other data out of the last > >> level cache. > >> > >> This patch proposes to revert the calculation change made by H.J. Lu's > >> patch of June 2, 2017. > >> > >> H.J. Lu's patch selected a threshold suitable for a single thread > >> getting maximum performance. It was tuned using the single threaded > >> large memcpy micro benchmark on an 8 core processor. The last change > >> changes the threshold from using 3/4 of one thread's share of the > >> cache to using 3/4 of the entire cache of a multi-threaded system > >> before switching to non-temporal stores. Multi-threaded systems with > >> more than a few threads are server-class and typically have many > >> active threads. If one thread consumes 3/4 of the available cache for > >> all threads, it will cause other active threads to have data removed > >> from the cache. Two examples show the range of the effect. John > >> McCalpin's widely parallel Stream benchmark, which runs in parallel > >> and fetches data sequentially, saw a 20% slowdown with this patch on > >> an internal system test of 128 threads. This regression was discovered > >> when comparing OL8 performance to OL7. An example that compares > >> normal stores to non-temporal stores may be found at > >> https://urldefense.com/v3/__https://vgatherps.github.io/2018-09-02-nontemporal/__;!!GqivPVa7Brio!IK1RH6wG0bg4U3NNMDpXf50VgsV9CFOEUaG0kGy6YYtq1G1Ca5VSz5szAxG0Zkiqdl8-IWc$ . A simple test > >> shows performance loss of 400 to 500% due to a failure to use > >> nontemporal stores. These performance losses are most likely to occur > >> when the system load is heaviest and good performance is critical. > >> > >> The tunable x86_non_temporal_threshold can be used to override the > >> default for the knowledgable user who really wants maximum cache > >> allocation to a single thread in a multi-threaded system. > >> The manual entry for the tunable has been expanded to provide > >> more information about its purpose. > >> > >> modified: sysdeps/x86/cacheinfo.c > >> modified: manual/tunables.texi > >> --- > >> manual/tunables.texi | 6 +++++- > >> sysdeps/x86/cacheinfo.c | 12 +++++++----- > >> 2 files changed, 12 insertions(+), 6 deletions(-) > >> > >> diff --git a/manual/tunables.texi b/manual/tunables.texi > >> index b6bb54d..94d4fbd 100644 > >> --- a/manual/tunables.texi > >> +++ b/manual/tunables.texi > >> @@ -364,7 +364,11 @@ set shared cache size in bytes for use in memory and string routines. > >> > >> @deftp Tunable glibc.tune.x86_non_temporal_threshold > >> The @code{glibc.tune.x86_non_temporal_threshold} tunable allows the user > >> -to set threshold in bytes for non temporal store. > >> +to set threshold in bytes for non temporal store. Non temporal stores > >> +give a hint to the hardware to move data directly to memory without > >> +displacing other data from the cache. This tunable is used by some > >> +platforms to determine when to use non temporal stores in operations > >> +like memmove and memcpy. > >> > >> This tunable is specific to i386 and x86-64. > >> @end deftp > >> diff --git a/sysdeps/x86/cacheinfo.c b/sysdeps/x86/cacheinfo.c > >> index b9444dd..c6767d9 100644 > >> --- a/sysdeps/x86/cacheinfo.c > >> +++ b/sysdeps/x86/cacheinfo.c > >> @@ -778,14 +778,16 @@ intel_bug_no_cache_info: > >> __x86_shared_cache_size = shared; > >> } > >> > >> - /* The large memcpy micro benchmark in glibc shows that 6 times of > >> - shared cache size is the approximate value above which non-temporal > >> - store becomes faster on a 8-core processor. This is the 3/4 of the > >> - total shared cache size. */ > >> + /* The default setting for the non_temporal threshold is 3/4 > >> + of one thread's share of the chip's cache. While higher > >> + single thread performance may be observed with a higher > >> + threshold, having a single thread use more than it's share > >> + of the cache will negatively impact the performance of > >> + other threads running on the chip. */ > >> __x86_shared_non_temporal_threshold > >> = (cpu_features->non_temporal_threshold != 0 > >> ? cpu_features->non_temporal_threshold > >> - : __x86_shared_cache_size * threads * 3 / 4); > >> + : __x86_shared_cache_size * 3 / 4); > >> } > >> > > Can we tune it with the number of threads and/or total cache > > size? > > > > When you say "total cache size", is that different from > shared_cache_size * threads? > > I see a fundamental conflict of optimization goals: > 1) Provide best single thread performance (current code) > 2) Provide best overall system performance under full load (proposed patch) > I don't know of any way to have default behavior meet both goals without > knowledge > of the system size/usage/requirements. > > Consider a hypothetical single chip system with 64 threads and 128 MB of > total cache on the chip. > That won't be uncommon in the coming years on server class systems, > especially > in large databases or HPC environments (think vision processing or > weather modeling for example). > If a single app owns the whole chip and is running a multi-threaded > application but needs > to memcpy a really large block of data when one phase of computation > finished > before moving to the next phase. A common practice would be to have 64 > parallel calls > to memcpy. The Stream benchmark demonstrates with OpenMP that current > compilers > handle that with no trouble. > > In the example, the per thread share of the cache is 2 MB and the > proposed formula will set > the threshold at 1.5 Mbytes. If the total copy size is 96 Mbytes or > less, all threads comfortably > fit in cache. If the total copy size is over that, then non-temporal > stores are used and all is well there too. > > The current formula would set the threshold at 96 Mbytes for each > thread. Only when the total > copy size was 64*96 Mbytes = 6 GBytes would non-temporal stores be used. > We'd like > to switch to non-temporal stores much sooner as we will be thrashing all > the threads caches. > > In practical terms, I've had access to typical memcpy copy lengths for a > variety of commerical > applications while studying memcpy on Solaris over the years. The vast > majority of copies > are for 64Kbytes or less. Most modern chips have much more than 64Kbytes > of cache > per thread, allowing in-cache copies for the common case, even without > borrowing > cache from other threads. The occasional really large copies tend to be > when an application > is passing a block of data to prepare for a new phase of computation or > as a shared memory > communication to another thread. In these cases, having the data remain > in cache is usually > not relevant and using non-temporal stores even when they are not > strictly required does > not have a negative affect on performance. > > A downside of tuning for a single thread comes in cloud computing > environments, where > having neighboring threads being cache hogs, even if relatively isolated > in virtual machines, > is a "bad thing" for having stable system performance. Whatever we can > do to provide consistent, > reasonable performance whatever the neighboring threads might be doing > is a "good thing". > Have you tried the full __x86_shared_cache_size instead of 3 / 4?
On 9/23/2020 4:37 PM, H.J. Lu wrote: > On Wed, Sep 23, 2020 at 1:57 PM Patrick McGehearty > <patrick.mcgehearty@oracle.com> wrote: >> >> >> On 9/23/2020 3:23 PM, H.J. Lu wrote: >>> On Wed, Sep 23, 2020 at 1:10 PM Patrick McGehearty via Libc-alpha >>> <libc-alpha@sourceware.org> wrote: >>>> The __x86_shared_non_temporal_threshold determines when memcpy on x86 >>>> uses non_temporal stores to avoid pushing other data out of the last >>>> level cache. >>>> >>>> This patch proposes to revert the calculation change made by H.J. Lu's >>>> patch of June 2, 2017. >>>> >>>> H.J. Lu's patch selected a threshold suitable for a single thread >>>> getting maximum performance. It was tuned using the single threaded >>>> large memcpy micro benchmark on an 8 core processor. The last change >>>> changes the threshold from using 3/4 of one thread's share of the >>>> cache to using 3/4 of the entire cache of a multi-threaded system >>>> before switching to non-temporal stores. Multi-threaded systems with >>>> more than a few threads are server-class and typically have many >>>> active threads. If one thread consumes 3/4 of the available cache for >>>> all threads, it will cause other active threads to have data removed >>>> from the cache. Two examples show the range of the effect. John >>>> McCalpin's widely parallel Stream benchmark, which runs in parallel >>>> and fetches data sequentially, saw a 20% slowdown with this patch on >>>> an internal system test of 128 threads. This regression was discovered >>>> when comparing OL8 performance to OL7. An example that compares >>>> normal stores to non-temporal stores may be found at >>>> https://urldefense.com/v3/__https://vgatherps.github.io/2018-09-02-nontemporal/__;!!GqivPVa7Brio!IK1RH6wG0bg4U3NNMDpXf50VgsV9CFOEUaG0kGy6YYtq1G1Ca5VSz5szAxG0Zkiqdl8-IWc$ . A simple test >>>> shows performance loss of 400 to 500% due to a failure to use >>>> nontemporal stores. These performance losses are most likely to occur >>>> when the system load is heaviest and good performance is critical. >>>> >>>> The tunable x86_non_temporal_threshold can be used to override the >>>> default for the knowledgable user who really wants maximum cache >>>> allocation to a single thread in a multi-threaded system. >>>> The manual entry for the tunable has been expanded to provide >>>> more information about its purpose. >>>> >>>> modified: sysdeps/x86/cacheinfo.c >>>> modified: manual/tunables.texi >>>> --- >>>> manual/tunables.texi | 6 +++++- >>>> sysdeps/x86/cacheinfo.c | 12 +++++++----- >>>> 2 files changed, 12 insertions(+), 6 deletions(-) >>>> >>>> diff --git a/manual/tunables.texi b/manual/tunables.texi >>>> index b6bb54d..94d4fbd 100644 >>>> --- a/manual/tunables.texi >>>> +++ b/manual/tunables.texi >>>> @@ -364,7 +364,11 @@ set shared cache size in bytes for use in memory and string routines. >>>> >>>> @deftp Tunable glibc.tune.x86_non_temporal_threshold >>>> The @code{glibc.tune.x86_non_temporal_threshold} tunable allows the user >>>> -to set threshold in bytes for non temporal store. >>>> +to set threshold in bytes for non temporal store. Non temporal stores >>>> +give a hint to the hardware to move data directly to memory without >>>> +displacing other data from the cache. This tunable is used by some >>>> +platforms to determine when to use non temporal stores in operations >>>> +like memmove and memcpy. >>>> >>>> This tunable is specific to i386 and x86-64. >>>> @end deftp >>>> diff --git a/sysdeps/x86/cacheinfo.c b/sysdeps/x86/cacheinfo.c >>>> index b9444dd..c6767d9 100644 >>>> --- a/sysdeps/x86/cacheinfo.c >>>> +++ b/sysdeps/x86/cacheinfo.c >>>> @@ -778,14 +778,16 @@ intel_bug_no_cache_info: >>>> __x86_shared_cache_size = shared; >>>> } >>>> >>>> - /* The large memcpy micro benchmark in glibc shows that 6 times of >>>> - shared cache size is the approximate value above which non-temporal >>>> - store becomes faster on a 8-core processor. This is the 3/4 of the >>>> - total shared cache size. */ >>>> + /* The default setting for the non_temporal threshold is 3/4 >>>> + of one thread's share of the chip's cache. While higher >>>> + single thread performance may be observed with a higher >>>> + threshold, having a single thread use more than it's share >>>> + of the cache will negatively impact the performance of >>>> + other threads running on the chip. */ >>>> __x86_shared_non_temporal_threshold >>>> = (cpu_features->non_temporal_threshold != 0 >>>> ? cpu_features->non_temporal_threshold >>>> - : __x86_shared_cache_size * threads * 3 / 4); >>>> + : __x86_shared_cache_size * 3 / 4); >>>> } >>>> >>> Can we tune it with the number of threads and/or total cache >>> size? >>> >> When you say "total cache size", is that different from >> shared_cache_size * threads? >> >> I see a fundamental conflict of optimization goals: >> 1) Provide best single thread performance (current code) >> 2) Provide best overall system performance under full load (proposed patch) >> I don't know of any way to have default behavior meet both goals without >> knowledge >> of the system size/usage/requirements. >> >> Consider a hypothetical single chip system with 64 threads and 128 MB of >> total cache on the chip. >> That won't be uncommon in the coming years on server class systems, >> especially >> in large databases or HPC environments (think vision processing or >> weather modeling for example). >> If a single app owns the whole chip and is running a multi-threaded >> application but needs >> to memcpy a really large block of data when one phase of computation >> finished >> before moving to the next phase. A common practice would be to have 64 >> parallel calls >> to memcpy. The Stream benchmark demonstrates with OpenMP that current >> compilers >> handle that with no trouble. >> >> In the example, the per thread share of the cache is 2 MB and the >> proposed formula will set >> the threshold at 1.5 Mbytes. If the total copy size is 96 Mbytes or >> less, all threads comfortably >> fit in cache. If the total copy size is over that, then non-temporal >> stores are used and all is well there too. >> >> The current formula would set the threshold at 96 Mbytes for each >> thread. Only when the total >> copy size was 64*96 Mbytes = 6 GBytes would non-temporal stores be used. >> We'd like >> to switch to non-temporal stores much sooner as we will be thrashing all >> the threads caches. >> >> In practical terms, I've had access to typical memcpy copy lengths for a >> variety of commerical >> applications while studying memcpy on Solaris over the years. The vast >> majority of copies >> are for 64Kbytes or less. Most modern chips have much more than 64Kbytes >> of cache >> per thread, allowing in-cache copies for the common case, even without >> borrowing >> cache from other threads. The occasional really large copies tend to be >> when an application >> is passing a block of data to prepare for a new phase of computation or >> as a shared memory >> communication to another thread. In these cases, having the data remain >> in cache is usually >> not relevant and using non-temporal stores even when they are not >> strictly required does >> not have a negative affect on performance. >> >> A downside of tuning for a single thread comes in cloud computing >> environments, where >> having neighboring threads being cache hogs, even if relatively isolated >> in virtual machines, >> is a "bad thing" for having stable system performance. Whatever we can >> do to provide consistent, >> reasonable performance whatever the neighboring threads might be doing >> is a "good thing". >> > Have you tried the full __x86_shared_cache_size instead of 3 / 4? > I have not tested larger thresholds. I'd be more comfortable with a smaller one. We could construct specific tests to show either advantage or disadvantage to shifting from 3/4 to all of cache depending on what data access was used between memcpy operations. I consider pushing the limit on cache usage to be a risky approach. Few applications only work on a single block of data. If all threads are doing a shared copy and they use all the available cache, then after the memcpy returns, any other active data would have been pushed out of the cache. That's likely to cost severe performance loss in more cases than the modest performance gains for a few cases where the application only is concerned with using the data that was just copied. Just to give a more detailed example where large copies are not followed by using the data. Consider garbage collection followed by compression. With a multi-age garbage collector, stable data that is active and survived several garbage collections is in a 'old' region. It does not need to be copied. The current 'new' region is full but has both referenced and unreferenced data. After the marking phase, the individual elements of the referenced data is copied to the base of the 'new' region. When complete, the rest of the 'new' region becomes the new free pool. The total amount copied may far exceed the processor cache. Then the application exits garbage collection and resumes active use of mostly the stable data with some accesses to the just moved new data and fresh allocations. If we under-use non-temporal stores, we clear the cache and the whole application runs slower than otherwise. Individual memcpy benchmarks are useful in isolation testing and comparing code patterns but can mislead about overall application performance in the context of potential for cache abuse. I fell into that tarpit once while tuning memcpy for Solaris and finding my new, wonderfully fast copy code (ok, maybe 5% faster for in-cache data) caused a major customer application to run slower because my new code abused the cache. I modified my code to only use the new "in-cache fast copy" for copies less than a threshold (64Kbytes or 128Kbytes if I remember right) and all was well. - patrick
On Wed, Sep 23, 2020 at 3:39 PM Patrick McGehearty <patrick.mcgehearty@oracle.com> wrote: > > > > On 9/23/2020 4:37 PM, H.J. Lu wrote: > > On Wed, Sep 23, 2020 at 1:57 PM Patrick McGehearty > > <patrick.mcgehearty@oracle.com> wrote: > >> > >> > >> On 9/23/2020 3:23 PM, H.J. Lu wrote: > >>> On Wed, Sep 23, 2020 at 1:10 PM Patrick McGehearty via Libc-alpha > >>> <libc-alpha@sourceware.org> wrote: > >>>> The __x86_shared_non_temporal_threshold determines when memcpy on x86 > >>>> uses non_temporal stores to avoid pushing other data out of the last > >>>> level cache. > >>>> > >>>> This patch proposes to revert the calculation change made by H.J. Lu's > >>>> patch of June 2, 2017. > >>>> > >>>> H.J. Lu's patch selected a threshold suitable for a single thread > >>>> getting maximum performance. It was tuned using the single threaded > >>>> large memcpy micro benchmark on an 8 core processor. The last change > >>>> changes the threshold from using 3/4 of one thread's share of the > >>>> cache to using 3/4 of the entire cache of a multi-threaded system > >>>> before switching to non-temporal stores. Multi-threaded systems with > >>>> more than a few threads are server-class and typically have many > >>>> active threads. If one thread consumes 3/4 of the available cache for > >>>> all threads, it will cause other active threads to have data removed > >>>> from the cache. Two examples show the range of the effect. John > >>>> McCalpin's widely parallel Stream benchmark, which runs in parallel > >>>> and fetches data sequentially, saw a 20% slowdown with this patch on > >>>> an internal system test of 128 threads. This regression was discovered > >>>> when comparing OL8 performance to OL7. An example that compares > >>>> normal stores to non-temporal stores may be found at > >>>> https://urldefense.com/v3/__https://vgatherps.github.io/2018-09-02-nontemporal/__;!!GqivPVa7Brio!IK1RH6wG0bg4U3NNMDpXf50VgsV9CFOEUaG0kGy6YYtq1G1Ca5VSz5szAxG0Zkiqdl8-IWc$ . A simple test > >>>> shows performance loss of 400 to 500% due to a failure to use > >>>> nontemporal stores. These performance losses are most likely to occur > >>>> when the system load is heaviest and good performance is critical. > >>>> > >>>> The tunable x86_non_temporal_threshold can be used to override the > >>>> default for the knowledgable user who really wants maximum cache > >>>> allocation to a single thread in a multi-threaded system. > >>>> The manual entry for the tunable has been expanded to provide > >>>> more information about its purpose. > >>>> > >>>> modified: sysdeps/x86/cacheinfo.c > >>>> modified: manual/tunables.texi > >>>> --- > >>>> manual/tunables.texi | 6 +++++- > >>>> sysdeps/x86/cacheinfo.c | 12 +++++++----- > >>>> 2 files changed, 12 insertions(+), 6 deletions(-) > >>>> > >>>> diff --git a/manual/tunables.texi b/manual/tunables.texi > >>>> index b6bb54d..94d4fbd 100644 > >>>> --- a/manual/tunables.texi > >>>> +++ b/manual/tunables.texi > >>>> @@ -364,7 +364,11 @@ set shared cache size in bytes for use in memory and string routines. > >>>> > >>>> @deftp Tunable glibc.tune.x86_non_temporal_threshold > >>>> The @code{glibc.tune.x86_non_temporal_threshold} tunable allows the user > >>>> -to set threshold in bytes for non temporal store. > >>>> +to set threshold in bytes for non temporal store. Non temporal stores > >>>> +give a hint to the hardware to move data directly to memory without > >>>> +displacing other data from the cache. This tunable is used by some > >>>> +platforms to determine when to use non temporal stores in operations > >>>> +like memmove and memcpy. > >>>> > >>>> This tunable is specific to i386 and x86-64. > >>>> @end deftp > >>>> diff --git a/sysdeps/x86/cacheinfo.c b/sysdeps/x86/cacheinfo.c > >>>> index b9444dd..c6767d9 100644 > >>>> --- a/sysdeps/x86/cacheinfo.c > >>>> +++ b/sysdeps/x86/cacheinfo.c > >>>> @@ -778,14 +778,16 @@ intel_bug_no_cache_info: > >>>> __x86_shared_cache_size = shared; > >>>> } > >>>> > >>>> - /* The large memcpy micro benchmark in glibc shows that 6 times of > >>>> - shared cache size is the approximate value above which non-temporal > >>>> - store becomes faster on a 8-core processor. This is the 3/4 of the > >>>> - total shared cache size. */ > >>>> + /* The default setting for the non_temporal threshold is 3/4 > >>>> + of one thread's share of the chip's cache. While higher > >>>> + single thread performance may be observed with a higher > >>>> + threshold, having a single thread use more than it's share > >>>> + of the cache will negatively impact the performance of > >>>> + other threads running on the chip. */ > >>>> __x86_shared_non_temporal_threshold > >>>> = (cpu_features->non_temporal_threshold != 0 > >>>> ? cpu_features->non_temporal_threshold > >>>> - : __x86_shared_cache_size * threads * 3 / 4); > >>>> + : __x86_shared_cache_size * 3 / 4); > >>>> } > >>>> > >>> Can we tune it with the number of threads and/or total cache > >>> size? > >>> > >> When you say "total cache size", is that different from > >> shared_cache_size * threads? > >> > >> I see a fundamental conflict of optimization goals: > >> 1) Provide best single thread performance (current code) > >> 2) Provide best overall system performance under full load (proposed patch) > >> I don't know of any way to have default behavior meet both goals without > >> knowledge > >> of the system size/usage/requirements. > >> > >> Consider a hypothetical single chip system with 64 threads and 128 MB of > >> total cache on the chip. > >> That won't be uncommon in the coming years on server class systems, > >> especially > >> in large databases or HPC environments (think vision processing or > >> weather modeling for example). > >> If a single app owns the whole chip and is running a multi-threaded > >> application but needs > >> to memcpy a really large block of data when one phase of computation > >> finished > >> before moving to the next phase. A common practice would be to have 64 > >> parallel calls > >> to memcpy. The Stream benchmark demonstrates with OpenMP that current > >> compilers > >> handle that with no trouble. > >> > >> In the example, the per thread share of the cache is 2 MB and the > >> proposed formula will set > >> the threshold at 1.5 Mbytes. If the total copy size is 96 Mbytes or > >> less, all threads comfortably > >> fit in cache. If the total copy size is over that, then non-temporal > >> stores are used and all is well there too. > >> > >> The current formula would set the threshold at 96 Mbytes for each > >> thread. Only when the total > >> copy size was 64*96 Mbytes = 6 GBytes would non-temporal stores be used. > >> We'd like > >> to switch to non-temporal stores much sooner as we will be thrashing all > >> the threads caches. > >> > >> In practical terms, I've had access to typical memcpy copy lengths for a > >> variety of commerical > >> applications while studying memcpy on Solaris over the years. The vast > >> majority of copies > >> are for 64Kbytes or less. Most modern chips have much more than 64Kbytes > >> of cache > >> per thread, allowing in-cache copies for the common case, even without > >> borrowing > >> cache from other threads. The occasional really large copies tend to be > >> when an application > >> is passing a block of data to prepare for a new phase of computation or > >> as a shared memory > >> communication to another thread. In these cases, having the data remain > >> in cache is usually > >> not relevant and using non-temporal stores even when they are not > >> strictly required does > >> not have a negative affect on performance. > >> > >> A downside of tuning for a single thread comes in cloud computing > >> environments, where > >> having neighboring threads being cache hogs, even if relatively isolated > >> in virtual machines, > >> is a "bad thing" for having stable system performance. Whatever we can > >> do to provide consistent, > >> reasonable performance whatever the neighboring threads might be doing > >> is a "good thing". > >> > > Have you tried the full __x86_shared_cache_size instead of 3 / 4? > > > > I have not tested larger thresholds. I'd be more comfortable with a > smaller one. > We could construct specific tests to show either advantage or disadvantage > to shifting from 3/4 to all of cache depending on what data access was used > between memcpy operations. > > I consider pushing the limit on cache usage to be a risky approach. Few > applications > only work on a single block of data. If all threads are doing a shared > copy and > they use all the available cache, then after the memcpy returns, any other > active data would have been pushed out of the cache. That's likely to cost > severe performance loss in more cases than the modest performance gains for > a few cases where the application only is concerned with using the data that > was just copied. > > Just to give a more detailed example where large copies are not followed > by using > the data. Consider garbage collection followed by compression. With a > multi-age > garbage collector, stable data that is active and survived several > garbage collections > is in a 'old' region. It does not need to be copied. The current 'new' > region is full > but has both referenced and unreferenced data. After the marking phase, > the individual elements of the referenced data is copied to the base of > the 'new' region. > When complete, the rest of the 'new' region becomes the new free pool. > The total amount copied may far exceed the processor cache. Then the > application > exits garbage collection and resumes active use of mostly the stable > data with > some accesses to the just moved new data and fresh allocations. If we > under-use > non-temporal stores, we clear the cache and the whole application runs > slower > than otherwise. > > Individual memcpy benchmarks are useful in isolation testing and comparing > code patterns but can mislead about overall application performance in the > context of potential for cache abuse. I fell into that tarpit once while > tuning > memcpy for Solaris and finding my new, wonderfully fast copy code (ok, maybe > 5% faster for in-cache data) caused a major customer application to run > slower > because my new code abused the cache. I modified my code to only use the > new "in-cache fast copy" for copies less than a threshold (64Kbytes or > 128Kbytes if I remember right) and all was well. > The new threshold can be substantially smaller with large core count. Are you saying that even 3 / 4 may be too big? Is there a reasonable fixed threshold?
On 9/23/2020 6:13 PM, H.J. Lu wrote: > On Wed, Sep 23, 2020 at 3:39 PM Patrick McGehearty > <patrick.mcgehearty@oracle.com> wrote: >> >> >> On 9/23/2020 4:37 PM, H.J. Lu wrote: >>> On Wed, Sep 23, 2020 at 1:57 PM Patrick McGehearty >>> <patrick.mcgehearty@oracle.com> wrote: >>>> >>>> On 9/23/2020 3:23 PM, H.J. Lu wrote: >>>>> On Wed, Sep 23, 2020 at 1:10 PM Patrick McGehearty via Libc-alpha >>>>> <libc-alpha@sourceware.org> wrote: >>>>>> The __x86_shared_non_temporal_threshold determines when memcpy on x86 >>>>>> uses non_temporal stores to avoid pushing other data out of the last >>>>>> level cache. >>>>>> >>>>>> This patch proposes to revert the calculation change made by H.J. Lu's >>>>>> patch of June 2, 2017. >>>>>> >>>>>> H.J. Lu's patch selected a threshold suitable for a single thread >>>>>> getting maximum performance. It was tuned using the single threaded >>>>>> large memcpy micro benchmark on an 8 core processor. The last change >>>>>> changes the threshold from using 3/4 of one thread's share of the >>>>>> cache to using 3/4 of the entire cache of a multi-threaded system >>>>>> before switching to non-temporal stores. Multi-threaded systems with >>>>>> more than a few threads are server-class and typically have many >>>>>> active threads. If one thread consumes 3/4 of the available cache for >>>>>> all threads, it will cause other active threads to have data removed >>>>>> from the cache. Two examples show the range of the effect. John >>>>>> McCalpin's widely parallel Stream benchmark, which runs in parallel >>>>>> and fetches data sequentially, saw a 20% slowdown with this patch on >>>>>> an internal system test of 128 threads. This regression was discovered >>>>>> when comparing OL8 performance to OL7. An example that compares >>>>>> normal stores to non-temporal stores may be found at >>>>>> https://urldefense.com/v3/__https://vgatherps.github.io/2018-09-02-nontemporal/__;!!GqivPVa7Brio!IK1RH6wG0bg4U3NNMDpXf50VgsV9CFOEUaG0kGy6YYtq1G1Ca5VSz5szAxG0Zkiqdl8-IWc$ . A simple test >>>>>> shows performance loss of 400 to 500% due to a failure to use >>>>>> nontemporal stores. These performance losses are most likely to occur >>>>>> when the system load is heaviest and good performance is critical. >>>>>> >>>>>> The tunable x86_non_temporal_threshold can be used to override the >>>>>> default for the knowledgable user who really wants maximum cache >>>>>> allocation to a single thread in a multi-threaded system. >>>>>> The manual entry for the tunable has been expanded to provide >>>>>> more information about its purpose. >>>>>> >>>>>> modified: sysdeps/x86/cacheinfo.c >>>>>> modified: manual/tunables.texi >>>>>> --- >>>>>> manual/tunables.texi | 6 +++++- >>>>>> sysdeps/x86/cacheinfo.c | 12 +++++++----- >>>>>> 2 files changed, 12 insertions(+), 6 deletions(-) >>>>>> >>>>>> diff --git a/manual/tunables.texi b/manual/tunables.texi >>>>>> index b6bb54d..94d4fbd 100644 >>>>>> --- a/manual/tunables.texi >>>>>> +++ b/manual/tunables.texi >>>>>> @@ -364,7 +364,11 @@ set shared cache size in bytes for use in memory and string routines. >>>>>> >>>>>> @deftp Tunable glibc.tune.x86_non_temporal_threshold >>>>>> The @code{glibc.tune.x86_non_temporal_threshold} tunable allows the user >>>>>> -to set threshold in bytes for non temporal store. >>>>>> +to set threshold in bytes for non temporal store. Non temporal stores >>>>>> +give a hint to the hardware to move data directly to memory without >>>>>> +displacing other data from the cache. This tunable is used by some >>>>>> +platforms to determine when to use non temporal stores in operations >>>>>> +like memmove and memcpy. >>>>>> >>>>>> This tunable is specific to i386 and x86-64. >>>>>> @end deftp >>>>>> diff --git a/sysdeps/x86/cacheinfo.c b/sysdeps/x86/cacheinfo.c >>>>>> index b9444dd..c6767d9 100644 >>>>>> --- a/sysdeps/x86/cacheinfo.c >>>>>> +++ b/sysdeps/x86/cacheinfo.c >>>>>> @@ -778,14 +778,16 @@ intel_bug_no_cache_info: >>>>>> __x86_shared_cache_size = shared; >>>>>> } >>>>>> >>>>>> - /* The large memcpy micro benchmark in glibc shows that 6 times of >>>>>> - shared cache size is the approximate value above which non-temporal >>>>>> - store becomes faster on a 8-core processor. This is the 3/4 of the >>>>>> - total shared cache size. */ >>>>>> + /* The default setting for the non_temporal threshold is 3/4 >>>>>> + of one thread's share of the chip's cache. While higher >>>>>> + single thread performance may be observed with a higher >>>>>> + threshold, having a single thread use more than it's share >>>>>> + of the cache will negatively impact the performance of >>>>>> + other threads running on the chip. */ >>>>>> __x86_shared_non_temporal_threshold >>>>>> = (cpu_features->non_temporal_threshold != 0 >>>>>> ? cpu_features->non_temporal_threshold >>>>>> - : __x86_shared_cache_size * threads * 3 / 4); >>>>>> + : __x86_shared_cache_size * 3 / 4); >>>>>> } >>>>>> >>>>> Can we tune it with the number of threads and/or total cache >>>>> size? >>>>> >>>> When you say "total cache size", is that different from >>>> shared_cache_size * threads? >>>> >>>> I see a fundamental conflict of optimization goals: >>>> 1) Provide best single thread performance (current code) >>>> 2) Provide best overall system performance under full load (proposed patch) >>>> I don't know of any way to have default behavior meet both goals without >>>> knowledge >>>> of the system size/usage/requirements. >>>> >>>> Consider a hypothetical single chip system with 64 threads and 128 MB of >>>> total cache on the chip. >>>> That won't be uncommon in the coming years on server class systems, >>>> especially >>>> in large databases or HPC environments (think vision processing or >>>> weather modeling for example). >>>> If a single app owns the whole chip and is running a multi-threaded >>>> application but needs >>>> to memcpy a really large block of data when one phase of computation >>>> finished >>>> before moving to the next phase. A common practice would be to have 64 >>>> parallel calls >>>> to memcpy. The Stream benchmark demonstrates with OpenMP that current >>>> compilers >>>> handle that with no trouble. >>>> >>>> In the example, the per thread share of the cache is 2 MB and the >>>> proposed formula will set >>>> the threshold at 1.5 Mbytes. If the total copy size is 96 Mbytes or >>>> less, all threads comfortably >>>> fit in cache. If the total copy size is over that, then non-temporal >>>> stores are used and all is well there too. >>>> >>>> The current formula would set the threshold at 96 Mbytes for each >>>> thread. Only when the total >>>> copy size was 64*96 Mbytes = 6 GBytes would non-temporal stores be used. >>>> We'd like >>>> to switch to non-temporal stores much sooner as we will be thrashing all >>>> the threads caches. >>>> >>>> In practical terms, I've had access to typical memcpy copy lengths for a >>>> variety of commerical >>>> applications while studying memcpy on Solaris over the years. The vast >>>> majority of copies >>>> are for 64Kbytes or less. Most modern chips have much more than 64Kbytes >>>> of cache >>>> per thread, allowing in-cache copies for the common case, even without >>>> borrowing >>>> cache from other threads. The occasional really large copies tend to be >>>> when an application >>>> is passing a block of data to prepare for a new phase of computation or >>>> as a shared memory >>>> communication to another thread. In these cases, having the data remain >>>> in cache is usually >>>> not relevant and using non-temporal stores even when they are not >>>> strictly required does >>>> not have a negative affect on performance. >>>> >>>> A downside of tuning for a single thread comes in cloud computing >>>> environments, where >>>> having neighboring threads being cache hogs, even if relatively isolated >>>> in virtual machines, >>>> is a "bad thing" for having stable system performance. Whatever we can >>>> do to provide consistent, >>>> reasonable performance whatever the neighboring threads might be doing >>>> is a "good thing". >>>> >>> Have you tried the full __x86_shared_cache_size instead of 3 / 4? >>> >> I have not tested larger thresholds. I'd be more comfortable with a >> smaller one. >> We could construct specific tests to show either advantage or disadvantage >> to shifting from 3/4 to all of cache depending on what data access was used >> between memcpy operations. >> >> I consider pushing the limit on cache usage to be a risky approach. Few >> applications >> only work on a single block of data. If all threads are doing a shared >> copy and >> they use all the available cache, then after the memcpy returns, any other >> active data would have been pushed out of the cache. That's likely to cost >> severe performance loss in more cases than the modest performance gains for >> a few cases where the application only is concerned with using the data that >> was just copied. >> >> Just to give a more detailed example where large copies are not followed >> by using >> the data. Consider garbage collection followed by compression. With a >> multi-age >> garbage collector, stable data that is active and survived several >> garbage collections >> is in a 'old' region. It does not need to be copied. The current 'new' >> region is full >> but has both referenced and unreferenced data. After the marking phase, >> the individual elements of the referenced data is copied to the base of >> the 'new' region. >> When complete, the rest of the 'new' region becomes the new free pool. >> The total amount copied may far exceed the processor cache. Then the >> application >> exits garbage collection and resumes active use of mostly the stable >> data with >> some accesses to the just moved new data and fresh allocations. If we >> under-use >> non-temporal stores, we clear the cache and the whole application runs >> slower >> than otherwise. >> >> Individual memcpy benchmarks are useful in isolation testing and comparing >> code patterns but can mislead about overall application performance in the >> context of potential for cache abuse. I fell into that tarpit once while >> tuning >> memcpy for Solaris and finding my new, wonderfully fast copy code (ok, maybe >> 5% faster for in-cache data) caused a major customer application to run >> slower >> because my new code abused the cache. I modified my code to only use the >> new "in-cache fast copy" for copies less than a threshold (64Kbytes or >> 128Kbytes if I remember right) and all was well. >> > The new threshold can be substantially smaller with large core count. > Are you saying that even 3 / 4 may be too big? Is there a reasonable > fixed threshold? > I don't have any evidence to say 3/4 is too big for typical applications and environments. In 2012, the default for memcpy was set to 1/2 the shared_cache_size which is what is the current default for Oracle el7 and Red Hat el7. Given the typically larger sized caches/thread today than 8 years, 3/4 may work out well since the remaining 1/4 of today's larger cache is often greater than 1/2 of yesteryear's smaller cache. - patrick
On Thu, Sep 24, 2020 at 2:49 PM Patrick McGehearty <patrick.mcgehearty@oracle.com> wrote: > > > > On 9/23/2020 6:13 PM, H.J. Lu wrote: > > On Wed, Sep 23, 2020 at 3:39 PM Patrick McGehearty > > <patrick.mcgehearty@oracle.com> wrote: > >> > >> > >> On 9/23/2020 4:37 PM, H.J. Lu wrote: > >>> On Wed, Sep 23, 2020 at 1:57 PM Patrick McGehearty > >>> <patrick.mcgehearty@oracle.com> wrote: > >>>> > >>>> On 9/23/2020 3:23 PM, H.J. Lu wrote: > >>>>> On Wed, Sep 23, 2020 at 1:10 PM Patrick McGehearty via Libc-alpha > >>>>> <libc-alpha@sourceware.org> wrote: > >>>>>> The __x86_shared_non_temporal_threshold determines when memcpy on x86 > >>>>>> uses non_temporal stores to avoid pushing other data out of the last > >>>>>> level cache. > >>>>>> > >>>>>> This patch proposes to revert the calculation change made by H.J. Lu's > >>>>>> patch of June 2, 2017. > >>>>>> > >>>>>> H.J. Lu's patch selected a threshold suitable for a single thread > >>>>>> getting maximum performance. It was tuned using the single threaded > >>>>>> large memcpy micro benchmark on an 8 core processor. The last change > >>>>>> changes the threshold from using 3/4 of one thread's share of the > >>>>>> cache to using 3/4 of the entire cache of a multi-threaded system > >>>>>> before switching to non-temporal stores. Multi-threaded systems with > >>>>>> more than a few threads are server-class and typically have many > >>>>>> active threads. If one thread consumes 3/4 of the available cache for > >>>>>> all threads, it will cause other active threads to have data removed > >>>>>> from the cache. Two examples show the range of the effect. John > >>>>>> McCalpin's widely parallel Stream benchmark, which runs in parallel > >>>>>> and fetches data sequentially, saw a 20% slowdown with this patch on > >>>>>> an internal system test of 128 threads. This regression was discovered > >>>>>> when comparing OL8 performance to OL7. An example that compares > >>>>>> normal stores to non-temporal stores may be found at > >>>>>> https://urldefense.com/v3/__https://vgatherps.github.io/2018-09-02-nontemporal/__;!!GqivPVa7Brio!IK1RH6wG0bg4U3NNMDpXf50VgsV9CFOEUaG0kGy6YYtq1G1Ca5VSz5szAxG0Zkiqdl8-IWc$ . A simple test > >>>>>> shows performance loss of 400 to 500% due to a failure to use > >>>>>> nontemporal stores. These performance losses are most likely to occur > >>>>>> when the system load is heaviest and good performance is critical. > >>>>>> > >>>>>> The tunable x86_non_temporal_threshold can be used to override the > >>>>>> default for the knowledgable user who really wants maximum cache > >>>>>> allocation to a single thread in a multi-threaded system. > >>>>>> The manual entry for the tunable has been expanded to provide > >>>>>> more information about its purpose. > >>>>>> > >>>>>> modified: sysdeps/x86/cacheinfo.c > >>>>>> modified: manual/tunables.texi > >>>>>> --- > >>>>>> manual/tunables.texi | 6 +++++- > >>>>>> sysdeps/x86/cacheinfo.c | 12 +++++++----- > >>>>>> 2 files changed, 12 insertions(+), 6 deletions(-) > >>>>>> > >>>>>> diff --git a/manual/tunables.texi b/manual/tunables.texi > >>>>>> index b6bb54d..94d4fbd 100644 > >>>>>> --- a/manual/tunables.texi > >>>>>> +++ b/manual/tunables.texi > >>>>>> @@ -364,7 +364,11 @@ set shared cache size in bytes for use in memory and string routines. > >>>>>> > >>>>>> @deftp Tunable glibc.tune.x86_non_temporal_threshold > >>>>>> The @code{glibc.tune.x86_non_temporal_threshold} tunable allows the user > >>>>>> -to set threshold in bytes for non temporal store. > >>>>>> +to set threshold in bytes for non temporal store. Non temporal stores > >>>>>> +give a hint to the hardware to move data directly to memory without > >>>>>> +displacing other data from the cache. This tunable is used by some > >>>>>> +platforms to determine when to use non temporal stores in operations > >>>>>> +like memmove and memcpy. > >>>>>> > >>>>>> This tunable is specific to i386 and x86-64. > >>>>>> @end deftp > >>>>>> diff --git a/sysdeps/x86/cacheinfo.c b/sysdeps/x86/cacheinfo.c > >>>>>> index b9444dd..c6767d9 100644 > >>>>>> --- a/sysdeps/x86/cacheinfo.c > >>>>>> +++ b/sysdeps/x86/cacheinfo.c > >>>>>> @@ -778,14 +778,16 @@ intel_bug_no_cache_info: > >>>>>> __x86_shared_cache_size = shared; > >>>>>> } > >>>>>> > >>>>>> - /* The large memcpy micro benchmark in glibc shows that 6 times of > >>>>>> - shared cache size is the approximate value above which non-temporal > >>>>>> - store becomes faster on a 8-core processor. This is the 3/4 of the > >>>>>> - total shared cache size. */ > >>>>>> + /* The default setting for the non_temporal threshold is 3/4 > >>>>>> + of one thread's share of the chip's cache. While higher > >>>>>> + single thread performance may be observed with a higher > >>>>>> + threshold, having a single thread use more than it's share > >>>>>> + of the cache will negatively impact the performance of > >>>>>> + other threads running on the chip. */ > >>>>>> __x86_shared_non_temporal_threshold > >>>>>> = (cpu_features->non_temporal_threshold != 0 > >>>>>> ? cpu_features->non_temporal_threshold > >>>>>> - : __x86_shared_cache_size * threads * 3 / 4); > >>>>>> + : __x86_shared_cache_size * 3 / 4); > >>>>>> } > >>>>>> > >>>>> Can we tune it with the number of threads and/or total cache > >>>>> size? > >>>>> > >>>> When you say "total cache size", is that different from > >>>> shared_cache_size * threads? > >>>> > >>>> I see a fundamental conflict of optimization goals: > >>>> 1) Provide best single thread performance (current code) > >>>> 2) Provide best overall system performance under full load (proposed patch) > >>>> I don't know of any way to have default behavior meet both goals without > >>>> knowledge > >>>> of the system size/usage/requirements. > >>>> > >>>> Consider a hypothetical single chip system with 64 threads and 128 MB of > >>>> total cache on the chip. > >>>> That won't be uncommon in the coming years on server class systems, > >>>> especially > >>>> in large databases or HPC environments (think vision processing or > >>>> weather modeling for example). > >>>> If a single app owns the whole chip and is running a multi-threaded > >>>> application but needs > >>>> to memcpy a really large block of data when one phase of computation > >>>> finished > >>>> before moving to the next phase. A common practice would be to have 64 > >>>> parallel calls > >>>> to memcpy. The Stream benchmark demonstrates with OpenMP that current > >>>> compilers > >>>> handle that with no trouble. > >>>> > >>>> In the example, the per thread share of the cache is 2 MB and the > >>>> proposed formula will set > >>>> the threshold at 1.5 Mbytes. If the total copy size is 96 Mbytes or > >>>> less, all threads comfortably > >>>> fit in cache. If the total copy size is over that, then non-temporal > >>>> stores are used and all is well there too. > >>>> > >>>> The current formula would set the threshold at 96 Mbytes for each > >>>> thread. Only when the total > >>>> copy size was 64*96 Mbytes = 6 GBytes would non-temporal stores be used. > >>>> We'd like > >>>> to switch to non-temporal stores much sooner as we will be thrashing all > >>>> the threads caches. > >>>> > >>>> In practical terms, I've had access to typical memcpy copy lengths for a > >>>> variety of commerical > >>>> applications while studying memcpy on Solaris over the years. The vast > >>>> majority of copies > >>>> are for 64Kbytes or less. Most modern chips have much more than 64Kbytes > >>>> of cache > >>>> per thread, allowing in-cache copies for the common case, even without > >>>> borrowing > >>>> cache from other threads. The occasional really large copies tend to be > >>>> when an application > >>>> is passing a block of data to prepare for a new phase of computation or > >>>> as a shared memory > >>>> communication to another thread. In these cases, having the data remain > >>>> in cache is usually > >>>> not relevant and using non-temporal stores even when they are not > >>>> strictly required does > >>>> not have a negative affect on performance. > >>>> > >>>> A downside of tuning for a single thread comes in cloud computing > >>>> environments, where > >>>> having neighboring threads being cache hogs, even if relatively isolated > >>>> in virtual machines, > >>>> is a "bad thing" for having stable system performance. Whatever we can > >>>> do to provide consistent, > >>>> reasonable performance whatever the neighboring threads might be doing > >>>> is a "good thing". > >>>> > >>> Have you tried the full __x86_shared_cache_size instead of 3 / 4? > >>> > >> I have not tested larger thresholds. I'd be more comfortable with a > >> smaller one. > >> We could construct specific tests to show either advantage or disadvantage > >> to shifting from 3/4 to all of cache depending on what data access was used > >> between memcpy operations. > >> > >> I consider pushing the limit on cache usage to be a risky approach. Few > >> applications > >> only work on a single block of data. If all threads are doing a shared > >> copy and > >> they use all the available cache, then after the memcpy returns, any other > >> active data would have been pushed out of the cache. That's likely to cost > >> severe performance loss in more cases than the modest performance gains for > >> a few cases where the application only is concerned with using the data that > >> was just copied. > >> > >> Just to give a more detailed example where large copies are not followed > >> by using > >> the data. Consider garbage collection followed by compression. With a > >> multi-age > >> garbage collector, stable data that is active and survived several > >> garbage collections > >> is in a 'old' region. It does not need to be copied. The current 'new' > >> region is full > >> but has both referenced and unreferenced data. After the marking phase, > >> the individual elements of the referenced data is copied to the base of > >> the 'new' region. > >> When complete, the rest of the 'new' region becomes the new free pool. > >> The total amount copied may far exceed the processor cache. Then the > >> application > >> exits garbage collection and resumes active use of mostly the stable > >> data with > >> some accesses to the just moved new data and fresh allocations. If we > >> under-use > >> non-temporal stores, we clear the cache and the whole application runs > >> slower > >> than otherwise. > >> > >> Individual memcpy benchmarks are useful in isolation testing and comparing > >> code patterns but can mislead about overall application performance in the > >> context of potential for cache abuse. I fell into that tarpit once while > >> tuning > >> memcpy for Solaris and finding my new, wonderfully fast copy code (ok, maybe > >> 5% faster for in-cache data) caused a major customer application to run > >> slower > >> because my new code abused the cache. I modified my code to only use the > >> new "in-cache fast copy" for copies less than a threshold (64Kbytes or > >> 128Kbytes if I remember right) and all was well. > >> > > The new threshold can be substantially smaller with large core count. > > Are you saying that even 3 / 4 may be too big? Is there a reasonable > > fixed threshold? > > > > I don't have any evidence to say 3/4 is too big for typical applications > and environments. > In 2012, the default for memcpy was set to 1/2 the shared_cache_size > which is what is > the current default for Oracle el7 and Red Hat el7. > > Given the typically larger sized caches/thread today than 8 years, 3/4 > may work out well > since the remaining 1/4 of today's larger cache is often greater than > 1/2 of yesteryear's smaller cache. > Please update the comment with your rationale for 3/4. Don't use today or current. Use 2020 instead. Thanks.
On 9/24/2020 4:54 PM, H.J. Lu wrote: > On Thu, Sep 24, 2020 at 2:49 PM Patrick McGehearty > <patrick.mcgehearty@oracle.com> wrote: >> >> >> On 9/23/2020 6:13 PM, H.J. Lu wrote: >>> On Wed, Sep 23, 2020 at 3:39 PM Patrick McGehearty >>> <patrick.mcgehearty@oracle.com> wrote: >>>> >>>> On 9/23/2020 4:37 PM, H.J. Lu wrote: >>>>> On Wed, Sep 23, 2020 at 1:57 PM Patrick McGehearty >>>>> <patrick.mcgehearty@oracle.com> wrote: >>>>>> On 9/23/2020 3:23 PM, H.J. Lu wrote: >>>>>>> On Wed, Sep 23, 2020 at 1:10 PM Patrick McGehearty via Libc-alpha >>>>>>> <libc-alpha@sourceware.org> wrote: >>>>>>>> The __x86_shared_non_temporal_threshold determines when memcpy on x86 >>>>>>>> uses non_temporal stores to avoid pushing other data out of the last >>>>>>>> level cache. >>>>>>>> >>>>>>>> This patch proposes to revert the calculation change made by H.J. Lu's >>>>>>>> patch of June 2, 2017. >>>>>>>> >>>>>>>> H.J. Lu's patch selected a threshold suitable for a single thread >>>>>>>> getting maximum performance. It was tuned using the single threaded >>>>>>>> large memcpy micro benchmark on an 8 core processor. The last change >>>>>>>> changes the threshold from using 3/4 of one thread's share of the >>>>>>>> cache to using 3/4 of the entire cache of a multi-threaded system >>>>>>>> before switching to non-temporal stores. Multi-threaded systems with >>>>>>>> more than a few threads are server-class and typically have many >>>>>>>> active threads. If one thread consumes 3/4 of the available cache for >>>>>>>> all threads, it will cause other active threads to have data removed >>>>>>>> from the cache. Two examples show the range of the effect. John >>>>>>>> McCalpin's widely parallel Stream benchmark, which runs in parallel >>>>>>>> and fetches data sequentially, saw a 20% slowdown with this patch on >>>>>>>> an internal system test of 128 threads. This regression was discovered >>>>>>>> when comparing OL8 performance to OL7. An example that compares >>>>>>>> normal stores to non-temporal stores may be found at >>>>>>>> https://urldefense.com/v3/__https://vgatherps.github.io/2018-09-02-nontemporal/__;!!GqivPVa7Brio!IK1RH6wG0bg4U3NNMDpXf50VgsV9CFOEUaG0kGy6YYtq1G1Ca5VSz5szAxG0Zkiqdl8-IWc$ . A simple test >>>>>>>> shows performance loss of 400 to 500% due to a failure to use >>>>>>>> nontemporal stores. These performance losses are most likely to occur >>>>>>>> when the system load is heaviest and good performance is critical. >>>>>>>> >>>>>>>> The tunable x86_non_temporal_threshold can be used to override the >>>>>>>> default for the knowledgable user who really wants maximum cache >>>>>>>> allocation to a single thread in a multi-threaded system. >>>>>>>> The manual entry for the tunable has been expanded to provide >>>>>>>> more information about its purpose. >>>>>>>> >>>>>>>> modified: sysdeps/x86/cacheinfo.c >>>>>>>> modified: manual/tunables.texi >>>>>>>> --- >>>>>>>> manual/tunables.texi | 6 +++++- >>>>>>>> sysdeps/x86/cacheinfo.c | 12 +++++++----- >>>>>>>> 2 files changed, 12 insertions(+), 6 deletions(-) >>>>>>>> >>>>>>>> diff --git a/manual/tunables.texi b/manual/tunables.texi >>>>>>>> index b6bb54d..94d4fbd 100644 >>>>>>>> --- a/manual/tunables.texi >>>>>>>> +++ b/manual/tunables.texi >>>>>>>> @@ -364,7 +364,11 @@ set shared cache size in bytes for use in memory and string routines. >>>>>>>> >>>>>>>> @deftp Tunable glibc.tune.x86_non_temporal_threshold >>>>>>>> The @code{glibc.tune.x86_non_temporal_threshold} tunable allows the user >>>>>>>> -to set threshold in bytes for non temporal store. >>>>>>>> +to set threshold in bytes for non temporal store. Non temporal stores >>>>>>>> +give a hint to the hardware to move data directly to memory without >>>>>>>> +displacing other data from the cache. This tunable is used by some >>>>>>>> +platforms to determine when to use non temporal stores in operations >>>>>>>> +like memmove and memcpy. >>>>>>>> >>>>>>>> This tunable is specific to i386 and x86-64. >>>>>>>> @end deftp >>>>>>>> diff --git a/sysdeps/x86/cacheinfo.c b/sysdeps/x86/cacheinfo.c >>>>>>>> index b9444dd..c6767d9 100644 >>>>>>>> --- a/sysdeps/x86/cacheinfo.c >>>>>>>> +++ b/sysdeps/x86/cacheinfo.c >>>>>>>> @@ -778,14 +778,16 @@ intel_bug_no_cache_info: >>>>>>>> __x86_shared_cache_size = shared; >>>>>>>> } >>>>>>>> >>>>>>>> - /* The large memcpy micro benchmark in glibc shows that 6 times of >>>>>>>> - shared cache size is the approximate value above which non-temporal >>>>>>>> - store becomes faster on a 8-core processor. This is the 3/4 of the >>>>>>>> - total shared cache size. */ >>>>>>>> + /* The default setting for the non_temporal threshold is 3/4 >>>>>>>> + of one thread's share of the chip's cache. While higher >>>>>>>> + single thread performance may be observed with a higher >>>>>>>> + threshold, having a single thread use more than it's share >>>>>>>> + of the cache will negatively impact the performance of >>>>>>>> + other threads running on the chip. */ >>>>>>>> __x86_shared_non_temporal_threshold >>>>>>>> = (cpu_features->non_temporal_threshold != 0 >>>>>>>> ? cpu_features->non_temporal_threshold >>>>>>>> - : __x86_shared_cache_size * threads * 3 / 4); >>>>>>>> + : __x86_shared_cache_size * 3 / 4); >>>>>>>> } >>>>>>>> >>>>>>> Can we tune it with the number of threads and/or total cache >>>>>>> size? >>>>>>> >>>>>> When you say "total cache size", is that different from >>>>>> shared_cache_size * threads? >>>>>> >>>>>> I see a fundamental conflict of optimization goals: >>>>>> 1) Provide best single thread performance (current code) >>>>>> 2) Provide best overall system performance under full load (proposed patch) >>>>>> I don't know of any way to have default behavior meet both goals without >>>>>> knowledge >>>>>> of the system size/usage/requirements. >>>>>> >>>>>> Consider a hypothetical single chip system with 64 threads and 128 MB of >>>>>> total cache on the chip. >>>>>> That won't be uncommon in the coming years on server class systems, >>>>>> especially >>>>>> in large databases or HPC environments (think vision processing or >>>>>> weather modeling for example). >>>>>> If a single app owns the whole chip and is running a multi-threaded >>>>>> application but needs >>>>>> to memcpy a really large block of data when one phase of computation >>>>>> finished >>>>>> before moving to the next phase. A common practice would be to have 64 >>>>>> parallel calls >>>>>> to memcpy. The Stream benchmark demonstrates with OpenMP that current >>>>>> compilers >>>>>> handle that with no trouble. >>>>>> >>>>>> In the example, the per thread share of the cache is 2 MB and the >>>>>> proposed formula will set >>>>>> the threshold at 1.5 Mbytes. If the total copy size is 96 Mbytes or >>>>>> less, all threads comfortably >>>>>> fit in cache. If the total copy size is over that, then non-temporal >>>>>> stores are used and all is well there too. >>>>>> >>>>>> The current formula would set the threshold at 96 Mbytes for each >>>>>> thread. Only when the total >>>>>> copy size was 64*96 Mbytes = 6 GBytes would non-temporal stores be used. >>>>>> We'd like >>>>>> to switch to non-temporal stores much sooner as we will be thrashing all >>>>>> the threads caches. >>>>>> >>>>>> In practical terms, I've had access to typical memcpy copy lengths for a >>>>>> variety of commerical >>>>>> applications while studying memcpy on Solaris over the years. The vast >>>>>> majority of copies >>>>>> are for 64Kbytes or less. Most modern chips have much more than 64Kbytes >>>>>> of cache >>>>>> per thread, allowing in-cache copies for the common case, even without >>>>>> borrowing >>>>>> cache from other threads. The occasional really large copies tend to be >>>>>> when an application >>>>>> is passing a block of data to prepare for a new phase of computation or >>>>>> as a shared memory >>>>>> communication to another thread. In these cases, having the data remain >>>>>> in cache is usually >>>>>> not relevant and using non-temporal stores even when they are not >>>>>> strictly required does >>>>>> not have a negative affect on performance. >>>>>> >>>>>> A downside of tuning for a single thread comes in cloud computing >>>>>> environments, where >>>>>> having neighboring threads being cache hogs, even if relatively isolated >>>>>> in virtual machines, >>>>>> is a "bad thing" for having stable system performance. Whatever we can >>>>>> do to provide consistent, >>>>>> reasonable performance whatever the neighboring threads might be doing >>>>>> is a "good thing". >>>>>> >>>>> Have you tried the full __x86_shared_cache_size instead of 3 / 4? >>>>> >>>> I have not tested larger thresholds. I'd be more comfortable with a >>>> smaller one. >>>> We could construct specific tests to show either advantage or disadvantage >>>> to shifting from 3/4 to all of cache depending on what data access was used >>>> between memcpy operations. >>>> >>>> I consider pushing the limit on cache usage to be a risky approach. Few >>>> applications >>>> only work on a single block of data. If all threads are doing a shared >>>> copy and >>>> they use all the available cache, then after the memcpy returns, any other >>>> active data would have been pushed out of the cache. That's likely to cost >>>> severe performance loss in more cases than the modest performance gains for >>>> a few cases where the application only is concerned with using the data that >>>> was just copied. >>>> >>>> Just to give a more detailed example where large copies are not followed >>>> by using >>>> the data. Consider garbage collection followed by compression. With a >>>> multi-age >>>> garbage collector, stable data that is active and survived several >>>> garbage collections >>>> is in a 'old' region. It does not need to be copied. The current 'new' >>>> region is full >>>> but has both referenced and unreferenced data. After the marking phase, >>>> the individual elements of the referenced data is copied to the base of >>>> the 'new' region. >>>> When complete, the rest of the 'new' region becomes the new free pool. >>>> The total amount copied may far exceed the processor cache. Then the >>>> application >>>> exits garbage collection and resumes active use of mostly the stable >>>> data with >>>> some accesses to the just moved new data and fresh allocations. If we >>>> under-use >>>> non-temporal stores, we clear the cache and the whole application runs >>>> slower >>>> than otherwise. >>>> >>>> Individual memcpy benchmarks are useful in isolation testing and comparing >>>> code patterns but can mislead about overall application performance in the >>>> context of potential for cache abuse. I fell into that tarpit once while >>>> tuning >>>> memcpy for Solaris and finding my new, wonderfully fast copy code (ok, maybe >>>> 5% faster for in-cache data) caused a major customer application to run >>>> slower >>>> because my new code abused the cache. I modified my code to only use the >>>> new "in-cache fast copy" for copies less than a threshold (64Kbytes or >>>> 128Kbytes if I remember right) and all was well. >>>> >>> The new threshold can be substantially smaller with large core count. >>> Are you saying that even 3 / 4 may be too big? Is there a reasonable >>> fixed threshold? >>> >> I don't have any evidence to say 3/4 is too big for typical applications >> and environments. >> In 2012, the default for memcpy was set to 1/2 the shared_cache_size >> which is what is >> the current default for Oracle el7 and Red Hat el7. >> >> Given the typically larger sized caches/thread today than 8 years, 3/4 >> may work out well >> since the remaining 1/4 of today's larger cache is often greater than >> 1/2 of yesteryear's smaller cache. >> > Please update the comment with your rationale for 3/4. Don't use > today or current. Use 2020 instead. > > Thanks. > I'm unsure about what needs to change in the comment which does not mention any dates currently. I'm assuming you are referring to the following comment in cacheinfo.c /* The default setting for the non_temporal threshold is 3/4 of one thread's share of the chip's cache. While higher single thread performance may be observed with a higher threshold, having a single thread use more than it's share of the cache will negatively impact the performance of other threads running on the chip. */ While I could add a comment on why 3/4 vs 1/2 is the best choice, I don't have hard data to back it up. I'd be comfortable with either 3/4 or 1/2. I selected 3/4 as it was closer to the formula you chose in 2017 instead of the formula you chose in 2012. - patrick
On Thu, Sep 24, 2020 at 4:22 PM Patrick McGehearty <patrick.mcgehearty@oracle.com> wrote: > > > > On 9/24/2020 4:54 PM, H.J. Lu wrote: > > On Thu, Sep 24, 2020 at 2:49 PM Patrick McGehearty > > <patrick.mcgehearty@oracle.com> wrote: > >> > >> > >> On 9/23/2020 6:13 PM, H.J. Lu wrote: > >>> On Wed, Sep 23, 2020 at 3:39 PM Patrick McGehearty > >>> <patrick.mcgehearty@oracle.com> wrote: > >>>> > >>>> On 9/23/2020 4:37 PM, H.J. Lu wrote: > >>>>> On Wed, Sep 23, 2020 at 1:57 PM Patrick McGehearty > >>>>> <patrick.mcgehearty@oracle.com> wrote: > >>>>>> On 9/23/2020 3:23 PM, H.J. Lu wrote: > >>>>>>> On Wed, Sep 23, 2020 at 1:10 PM Patrick McGehearty via Libc-alpha > >>>>>>> <libc-alpha@sourceware.org> wrote: > >>>>>>>> The __x86_shared_non_temporal_threshold determines when memcpy on x86 > >>>>>>>> uses non_temporal stores to avoid pushing other data out of the last > >>>>>>>> level cache. > >>>>>>>> > >>>>>>>> This patch proposes to revert the calculation change made by H.J. Lu's > >>>>>>>> patch of June 2, 2017. > >>>>>>>> > >>>>>>>> H.J. Lu's patch selected a threshold suitable for a single thread > >>>>>>>> getting maximum performance. It was tuned using the single threaded > >>>>>>>> large memcpy micro benchmark on an 8 core processor. The last change > >>>>>>>> changes the threshold from using 3/4 of one thread's share of the > >>>>>>>> cache to using 3/4 of the entire cache of a multi-threaded system > >>>>>>>> before switching to non-temporal stores. Multi-threaded systems with > >>>>>>>> more than a few threads are server-class and typically have many > >>>>>>>> active threads. If one thread consumes 3/4 of the available cache for > >>>>>>>> all threads, it will cause other active threads to have data removed > >>>>>>>> from the cache. Two examples show the range of the effect. John > >>>>>>>> McCalpin's widely parallel Stream benchmark, which runs in parallel > >>>>>>>> and fetches data sequentially, saw a 20% slowdown with this patch on > >>>>>>>> an internal system test of 128 threads. This regression was discovered > >>>>>>>> when comparing OL8 performance to OL7. An example that compares > >>>>>>>> normal stores to non-temporal stores may be found at > >>>>>>>> https://urldefense.com/v3/__https://vgatherps.github.io/2018-09-02-nontemporal/__;!!GqivPVa7Brio!IK1RH6wG0bg4U3NNMDpXf50VgsV9CFOEUaG0kGy6YYtq1G1Ca5VSz5szAxG0Zkiqdl8-IWc$ . A simple test > >>>>>>>> shows performance loss of 400 to 500% due to a failure to use > >>>>>>>> nontemporal stores. These performance losses are most likely to occur > >>>>>>>> when the system load is heaviest and good performance is critical. > >>>>>>>> > >>>>>>>> The tunable x86_non_temporal_threshold can be used to override the > >>>>>>>> default for the knowledgable user who really wants maximum cache > >>>>>>>> allocation to a single thread in a multi-threaded system. > >>>>>>>> The manual entry for the tunable has been expanded to provide > >>>>>>>> more information about its purpose. > >>>>>>>> > >>>>>>>> modified: sysdeps/x86/cacheinfo.c > >>>>>>>> modified: manual/tunables.texi > >>>>>>>> --- > >>>>>>>> manual/tunables.texi | 6 +++++- > >>>>>>>> sysdeps/x86/cacheinfo.c | 12 +++++++----- > >>>>>>>> 2 files changed, 12 insertions(+), 6 deletions(-) > >>>>>>>> > >>>>>>>> diff --git a/manual/tunables.texi b/manual/tunables.texi > >>>>>>>> index b6bb54d..94d4fbd 100644 > >>>>>>>> --- a/manual/tunables.texi > >>>>>>>> +++ b/manual/tunables.texi > >>>>>>>> @@ -364,7 +364,11 @@ set shared cache size in bytes for use in memory and string routines. > >>>>>>>> > >>>>>>>> @deftp Tunable glibc.tune.x86_non_temporal_threshold > >>>>>>>> The @code{glibc.tune.x86_non_temporal_threshold} tunable allows the user > >>>>>>>> -to set threshold in bytes for non temporal store. > >>>>>>>> +to set threshold in bytes for non temporal store. Non temporal stores > >>>>>>>> +give a hint to the hardware to move data directly to memory without > >>>>>>>> +displacing other data from the cache. This tunable is used by some > >>>>>>>> +platforms to determine when to use non temporal stores in operations > >>>>>>>> +like memmove and memcpy. > >>>>>>>> > >>>>>>>> This tunable is specific to i386 and x86-64. > >>>>>>>> @end deftp > >>>>>>>> diff --git a/sysdeps/x86/cacheinfo.c b/sysdeps/x86/cacheinfo.c > >>>>>>>> index b9444dd..c6767d9 100644 > >>>>>>>> --- a/sysdeps/x86/cacheinfo.c > >>>>>>>> +++ b/sysdeps/x86/cacheinfo.c > >>>>>>>> @@ -778,14 +778,16 @@ intel_bug_no_cache_info: > >>>>>>>> __x86_shared_cache_size = shared; > >>>>>>>> } > >>>>>>>> > >>>>>>>> - /* The large memcpy micro benchmark in glibc shows that 6 times of > >>>>>>>> - shared cache size is the approximate value above which non-temporal > >>>>>>>> - store becomes faster on a 8-core processor. This is the 3/4 of the > >>>>>>>> - total shared cache size. */ > >>>>>>>> + /* The default setting for the non_temporal threshold is 3/4 > >>>>>>>> + of one thread's share of the chip's cache. While higher > >>>>>>>> + single thread performance may be observed with a higher > >>>>>>>> + threshold, having a single thread use more than it's share > >>>>>>>> + of the cache will negatively impact the performance of > >>>>>>>> + other threads running on the chip. */ > >>>>>>>> __x86_shared_non_temporal_threshold > >>>>>>>> = (cpu_features->non_temporal_threshold != 0 > >>>>>>>> ? cpu_features->non_temporal_threshold > >>>>>>>> - : __x86_shared_cache_size * threads * 3 / 4); > >>>>>>>> + : __x86_shared_cache_size * 3 / 4); > >>>>>>>> } > >>>>>>>> > >>>>>>> Can we tune it with the number of threads and/or total cache > >>>>>>> size? > >>>>>>> > >>>>>> When you say "total cache size", is that different from > >>>>>> shared_cache_size * threads? > >>>>>> > >>>>>> I see a fundamental conflict of optimization goals: > >>>>>> 1) Provide best single thread performance (current code) > >>>>>> 2) Provide best overall system performance under full load (proposed patch) > >>>>>> I don't know of any way to have default behavior meet both goals without > >>>>>> knowledge > >>>>>> of the system size/usage/requirements. > >>>>>> > >>>>>> Consider a hypothetical single chip system with 64 threads and 128 MB of > >>>>>> total cache on the chip. > >>>>>> That won't be uncommon in the coming years on server class systems, > >>>>>> especially > >>>>>> in large databases or HPC environments (think vision processing or > >>>>>> weather modeling for example). > >>>>>> If a single app owns the whole chip and is running a multi-threaded > >>>>>> application but needs > >>>>>> to memcpy a really large block of data when one phase of computation > >>>>>> finished > >>>>>> before moving to the next phase. A common practice would be to have 64 > >>>>>> parallel calls > >>>>>> to memcpy. The Stream benchmark demonstrates with OpenMP that current > >>>>>> compilers > >>>>>> handle that with no trouble. > >>>>>> > >>>>>> In the example, the per thread share of the cache is 2 MB and the > >>>>>> proposed formula will set > >>>>>> the threshold at 1.5 Mbytes. If the total copy size is 96 Mbytes or > >>>>>> less, all threads comfortably > >>>>>> fit in cache. If the total copy size is over that, then non-temporal > >>>>>> stores are used and all is well there too. > >>>>>> > >>>>>> The current formula would set the threshold at 96 Mbytes for each > >>>>>> thread. Only when the total > >>>>>> copy size was 64*96 Mbytes = 6 GBytes would non-temporal stores be used. > >>>>>> We'd like > >>>>>> to switch to non-temporal stores much sooner as we will be thrashing all > >>>>>> the threads caches. > >>>>>> > >>>>>> In practical terms, I've had access to typical memcpy copy lengths for a > >>>>>> variety of commerical > >>>>>> applications while studying memcpy on Solaris over the years. The vast > >>>>>> majority of copies > >>>>>> are for 64Kbytes or less. Most modern chips have much more than 64Kbytes > >>>>>> of cache > >>>>>> per thread, allowing in-cache copies for the common case, even without > >>>>>> borrowing > >>>>>> cache from other threads. The occasional really large copies tend to be > >>>>>> when an application > >>>>>> is passing a block of data to prepare for a new phase of computation or > >>>>>> as a shared memory > >>>>>> communication to another thread. In these cases, having the data remain > >>>>>> in cache is usually > >>>>>> not relevant and using non-temporal stores even when they are not > >>>>>> strictly required does > >>>>>> not have a negative affect on performance. > >>>>>> > >>>>>> A downside of tuning for a single thread comes in cloud computing > >>>>>> environments, where > >>>>>> having neighboring threads being cache hogs, even if relatively isolated > >>>>>> in virtual machines, > >>>>>> is a "bad thing" for having stable system performance. Whatever we can > >>>>>> do to provide consistent, > >>>>>> reasonable performance whatever the neighboring threads might be doing > >>>>>> is a "good thing". > >>>>>> > >>>>> Have you tried the full __x86_shared_cache_size instead of 3 / 4? > >>>>> > >>>> I have not tested larger thresholds. I'd be more comfortable with a > >>>> smaller one. > >>>> We could construct specific tests to show either advantage or disadvantage > >>>> to shifting from 3/4 to all of cache depending on what data access was used > >>>> between memcpy operations. > >>>> > >>>> I consider pushing the limit on cache usage to be a risky approach. Few > >>>> applications > >>>> only work on a single block of data. If all threads are doing a shared > >>>> copy and > >>>> they use all the available cache, then after the memcpy returns, any other > >>>> active data would have been pushed out of the cache. That's likely to cost > >>>> severe performance loss in more cases than the modest performance gains for > >>>> a few cases where the application only is concerned with using the data that > >>>> was just copied. > >>>> > >>>> Just to give a more detailed example where large copies are not followed > >>>> by using > >>>> the data. Consider garbage collection followed by compression. With a > >>>> multi-age > >>>> garbage collector, stable data that is active and survived several > >>>> garbage collections > >>>> is in a 'old' region. It does not need to be copied. The current 'new' > >>>> region is full > >>>> but has both referenced and unreferenced data. After the marking phase, > >>>> the individual elements of the referenced data is copied to the base of > >>>> the 'new' region. > >>>> When complete, the rest of the 'new' region becomes the new free pool. > >>>> The total amount copied may far exceed the processor cache. Then the > >>>> application > >>>> exits garbage collection and resumes active use of mostly the stable > >>>> data with > >>>> some accesses to the just moved new data and fresh allocations. If we > >>>> under-use > >>>> non-temporal stores, we clear the cache and the whole application runs > >>>> slower > >>>> than otherwise. > >>>> > >>>> Individual memcpy benchmarks are useful in isolation testing and comparing > >>>> code patterns but can mislead about overall application performance in the > >>>> context of potential for cache abuse. I fell into that tarpit once while > >>>> tuning > >>>> memcpy for Solaris and finding my new, wonderfully fast copy code (ok, maybe > >>>> 5% faster for in-cache data) caused a major customer application to run > >>>> slower > >>>> because my new code abused the cache. I modified my code to only use the > >>>> new "in-cache fast copy" for copies less than a threshold (64Kbytes or > >>>> 128Kbytes if I remember right) and all was well. > >>>> > >>> The new threshold can be substantially smaller with large core count. > >>> Are you saying that even 3 / 4 may be too big? Is there a reasonable > >>> fixed threshold? > >>> > >> I don't have any evidence to say 3/4 is too big for typical applications > >> and environments. > >> In 2012, the default for memcpy was set to 1/2 the shared_cache_size > >> which is what is > >> the current default for Oracle el7 and Red Hat el7. > >> > >> Given the typically larger sized caches/thread today than 8 years, 3/4 > >> may work out well > >> since the remaining 1/4 of today's larger cache is often greater than > >> 1/2 of yesteryear's smaller cache. > >> > > Please update the comment with your rationale for 3/4. Don't use > > today or current. Use 2020 instead. > > > > Thanks. > > > I'm unsure about what needs to change in the comment which does not mention > any dates currently. I'm assuming you are referring to the following > comment in cacheinfo.c > > /* The default setting for the non_temporal threshold is 3/4 > of one thread's share of the chip's cache. While higher > single thread performance may be observed with a higher > threshold, having a single thread use more than it's share > of the cache will negatively impact the performance of > other threads running on the chip. */ > > While I could add a comment on why 3/4 vs 1/2 is the best choice, I > don't have hard > data to back it up. I'd be comfortable with either 3/4 or 1/2. I > selected 3/4 as it > was closer to the formula you chose in 2017 instead of the formula you > chose in 2012. The comment is for readers 5 years from now who may be wondering where 3/4 came from. Just add something close to what you have said above.
On 9/24/2020 6:57 PM, H.J. Lu wrote: > On Thu, Sep 24, 2020 at 4:22 PM Patrick McGehearty > <patrick.mcgehearty@oracle.com> wrote: >> >> >> On 9/24/2020 4:54 PM, H.J. Lu wrote: >>> On Thu, Sep 24, 2020 at 2:49 PM Patrick McGehearty >>> <patrick.mcgehearty@oracle.com> wrote: >>>> >>>> On 9/23/2020 6:13 PM, H.J. Lu wrote: >>>>> On Wed, Sep 23, 2020 at 3:39 PM Patrick McGehearty >>>>> <patrick.mcgehearty@oracle.com> wrote: >>>>>> On 9/23/2020 4:37 PM, H.J. Lu wrote: >>>>>>> On Wed, Sep 23, 2020 at 1:57 PM Patrick McGehearty >>>>>>> <patrick.mcgehearty@oracle.com> wrote: >>>>>>>> On 9/23/2020 3:23 PM, H.J. Lu wrote: >>>>>>>>> On Wed, Sep 23, 2020 at 1:10 PM Patrick McGehearty via Libc-alpha >>>>>>>>> <libc-alpha@sourceware.org> wrote: >>>>>>>>>> The __x86_shared_non_temporal_threshold determines when memcpy on x86 >>>>>>>>>> uses non_temporal stores to avoid pushing other data out of the last >>>>>>>>>> level cache. >>>>>>>>>> >>>>>>>>>> This patch proposes to revert the calculation change made by H.J. Lu's >>>>>>>>>> patch of June 2, 2017. >>>>>>>>>> >>>>>>>>>> H.J. Lu's patch selected a threshold suitable for a single thread >>>>>>>>>> getting maximum performance. It was tuned using the single threaded >>>>>>>>>> large memcpy micro benchmark on an 8 core processor. The last change >>>>>>>>>> changes the threshold from using 3/4 of one thread's share of the >>>>>>>>>> cache to using 3/4 of the entire cache of a multi-threaded system >>>>>>>>>> before switching to non-temporal stores. Multi-threaded systems with >>>>>>>>>> more than a few threads are server-class and typically have many >>>>>>>>>> active threads. If one thread consumes 3/4 of the available cache for >>>>>>>>>> all threads, it will cause other active threads to have data removed >>>>>>>>>> from the cache. Two examples show the range of the effect. John >>>>>>>>>> McCalpin's widely parallel Stream benchmark, which runs in parallel >>>>>>>>>> and fetches data sequentially, saw a 20% slowdown with this patch on >>>>>>>>>> an internal system test of 128 threads. This regression was discovered >>>>>>>>>> when comparing OL8 performance to OL7. An example that compares >>>>>>>>>> normal stores to non-temporal stores may be found at >>>>>>>>>> https://urldefense.com/v3/__https://vgatherps.github.io/2018-09-02-nontemporal/__;!!GqivPVa7Brio!IK1RH6wG0bg4U3NNMDpXf50VgsV9CFOEUaG0kGy6YYtq1G1Ca5VSz5szAxG0Zkiqdl8-IWc$ . A simple test >>>>>>>>>> shows performance loss of 400 to 500% due to a failure to use >>>>>>>>>> nontemporal stores. These performance losses are most likely to occur >>>>>>>>>> when the system load is heaviest and good performance is critical. >>>>>>>>>> >>>>>>>>>> The tunable x86_non_temporal_threshold can be used to override the >>>>>>>>>> default for the knowledgable user who really wants maximum cache >>>>>>>>>> allocation to a single thread in a multi-threaded system. >>>>>>>>>> The manual entry for the tunable has been expanded to provide >>>>>>>>>> more information about its purpose. >>>>>>>>>> >>>>>>>>>> modified: sysdeps/x86/cacheinfo.c >>>>>>>>>> modified: manual/tunables.texi >>>>>>>>>> --- >>>>>>>>>> manual/tunables.texi | 6 +++++- >>>>>>>>>> sysdeps/x86/cacheinfo.c | 12 +++++++----- >>>>>>>>>> 2 files changed, 12 insertions(+), 6 deletions(-) >>>>>>>>>> >>>>>>>>>> diff --git a/manual/tunables.texi b/manual/tunables.texi >>>>>>>>>> index b6bb54d..94d4fbd 100644 >>>>>>>>>> --- a/manual/tunables.texi >>>>>>>>>> +++ b/manual/tunables.texi >>>>>>>>>> @@ -364,7 +364,11 @@ set shared cache size in bytes for use in memory and string routines. >>>>>>>>>> >>>>>>>>>> @deftp Tunable glibc.tune.x86_non_temporal_threshold >>>>>>>>>> The @code{glibc.tune.x86_non_temporal_threshold} tunable allows the user >>>>>>>>>> -to set threshold in bytes for non temporal store. >>>>>>>>>> +to set threshold in bytes for non temporal store. Non temporal stores >>>>>>>>>> +give a hint to the hardware to move data directly to memory without >>>>>>>>>> +displacing other data from the cache. This tunable is used by some >>>>>>>>>> +platforms to determine when to use non temporal stores in operations >>>>>>>>>> +like memmove and memcpy. >>>>>>>>>> >>>>>>>>>> This tunable is specific to i386 and x86-64. >>>>>>>>>> @end deftp >>>>>>>>>> diff --git a/sysdeps/x86/cacheinfo.c b/sysdeps/x86/cacheinfo.c >>>>>>>>>> index b9444dd..c6767d9 100644 >>>>>>>>>> --- a/sysdeps/x86/cacheinfo.c >>>>>>>>>> +++ b/sysdeps/x86/cacheinfo.c >>>>>>>>>> @@ -778,14 +778,16 @@ intel_bug_no_cache_info: >>>>>>>>>> __x86_shared_cache_size = shared; >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> - /* The large memcpy micro benchmark in glibc shows that 6 times of >>>>>>>>>> - shared cache size is the approximate value above which non-temporal >>>>>>>>>> - store becomes faster on a 8-core processor. This is the 3/4 of the >>>>>>>>>> - total shared cache size. */ >>>>>>>>>> + /* The default setting for the non_temporal threshold is 3/4 >>>>>>>>>> + of one thread's share of the chip's cache. While higher >>>>>>>>>> + single thread performance may be observed with a higher >>>>>>>>>> + threshold, having a single thread use more than it's share >>>>>>>>>> + of the cache will negatively impact the performance of >>>>>>>>>> + other threads running on the chip. */ >>>>>>>>>> __x86_shared_non_temporal_threshold >>>>>>>>>> = (cpu_features->non_temporal_threshold != 0 >>>>>>>>>> ? cpu_features->non_temporal_threshold >>>>>>>>>> - : __x86_shared_cache_size * threads * 3 / 4); >>>>>>>>>> + : __x86_shared_cache_size * 3 / 4); >>>>>>>>>> } >>>>>>>>>> >>>>>>>>> Can we tune it with the number of threads and/or total cache >>>>>>>>> size? >>>>>>>>> >>>>>>>> When you say "total cache size", is that different from >>>>>>>> shared_cache_size * threads? >>>>>>>> >>>>>>>> I see a fundamental conflict of optimization goals: >>>>>>>> 1) Provide best single thread performance (current code) >>>>>>>> 2) Provide best overall system performance under full load (proposed patch) >>>>>>>> I don't know of any way to have default behavior meet both goals without >>>>>>>> knowledge >>>>>>>> of the system size/usage/requirements. >>>>>>>> >>>>>>>> Consider a hypothetical single chip system with 64 threads and 128 MB of >>>>>>>> total cache on the chip. >>>>>>>> That won't be uncommon in the coming years on server class systems, >>>>>>>> especially >>>>>>>> in large databases or HPC environments (think vision processing or >>>>>>>> weather modeling for example). >>>>>>>> If a single app owns the whole chip and is running a multi-threaded >>>>>>>> application but needs >>>>>>>> to memcpy a really large block of data when one phase of computation >>>>>>>> finished >>>>>>>> before moving to the next phase. A common practice would be to have 64 >>>>>>>> parallel calls >>>>>>>> to memcpy. The Stream benchmark demonstrates with OpenMP that current >>>>>>>> compilers >>>>>>>> handle that with no trouble. >>>>>>>> >>>>>>>> In the example, the per thread share of the cache is 2 MB and the >>>>>>>> proposed formula will set >>>>>>>> the threshold at 1.5 Mbytes. If the total copy size is 96 Mbytes or >>>>>>>> less, all threads comfortably >>>>>>>> fit in cache. If the total copy size is over that, then non-temporal >>>>>>>> stores are used and all is well there too. >>>>>>>> >>>>>>>> The current formula would set the threshold at 96 Mbytes for each >>>>>>>> thread. Only when the total >>>>>>>> copy size was 64*96 Mbytes = 6 GBytes would non-temporal stores be used. >>>>>>>> We'd like >>>>>>>> to switch to non-temporal stores much sooner as we will be thrashing all >>>>>>>> the threads caches. >>>>>>>> >>>>>>>> In practical terms, I've had access to typical memcpy copy lengths for a >>>>>>>> variety of commerical >>>>>>>> applications while studying memcpy on Solaris over the years. The vast >>>>>>>> majority of copies >>>>>>>> are for 64Kbytes or less. Most modern chips have much more than 64Kbytes >>>>>>>> of cache >>>>>>>> per thread, allowing in-cache copies for the common case, even without >>>>>>>> borrowing >>>>>>>> cache from other threads. The occasional really large copies tend to be >>>>>>>> when an application >>>>>>>> is passing a block of data to prepare for a new phase of computation or >>>>>>>> as a shared memory >>>>>>>> communication to another thread. In these cases, having the data remain >>>>>>>> in cache is usually >>>>>>>> not relevant and using non-temporal stores even when they are not >>>>>>>> strictly required does >>>>>>>> not have a negative affect on performance. >>>>>>>> >>>>>>>> A downside of tuning for a single thread comes in cloud computing >>>>>>>> environments, where >>>>>>>> having neighboring threads being cache hogs, even if relatively isolated >>>>>>>> in virtual machines, >>>>>>>> is a "bad thing" for having stable system performance. Whatever we can >>>>>>>> do to provide consistent, >>>>>>>> reasonable performance whatever the neighboring threads might be doing >>>>>>>> is a "good thing". >>>>>>>> >>>>>>> Have you tried the full __x86_shared_cache_size instead of 3 / 4? >>>>>>> >>>>>> I have not tested larger thresholds. I'd be more comfortable with a >>>>>> smaller one. >>>>>> We could construct specific tests to show either advantage or disadvantage >>>>>> to shifting from 3/4 to all of cache depending on what data access was used >>>>>> between memcpy operations. >>>>>> >>>>>> I consider pushing the limit on cache usage to be a risky approach. Few >>>>>> applications >>>>>> only work on a single block of data. If all threads are doing a shared >>>>>> copy and >>>>>> they use all the available cache, then after the memcpy returns, any other >>>>>> active data would have been pushed out of the cache. That's likely to cost >>>>>> severe performance loss in more cases than the modest performance gains for >>>>>> a few cases where the application only is concerned with using the data that >>>>>> was just copied. >>>>>> >>>>>> Just to give a more detailed example where large copies are not followed >>>>>> by using >>>>>> the data. Consider garbage collection followed by compression. With a >>>>>> multi-age >>>>>> garbage collector, stable data that is active and survived several >>>>>> garbage collections >>>>>> is in a 'old' region. It does not need to be copied. The current 'new' >>>>>> region is full >>>>>> but has both referenced and unreferenced data. After the marking phase, >>>>>> the individual elements of the referenced data is copied to the base of >>>>>> the 'new' region. >>>>>> When complete, the rest of the 'new' region becomes the new free pool. >>>>>> The total amount copied may far exceed the processor cache. Then the >>>>>> application >>>>>> exits garbage collection and resumes active use of mostly the stable >>>>>> data with >>>>>> some accesses to the just moved new data and fresh allocations. If we >>>>>> under-use >>>>>> non-temporal stores, we clear the cache and the whole application runs >>>>>> slower >>>>>> than otherwise. >>>>>> >>>>>> Individual memcpy benchmarks are useful in isolation testing and comparing >>>>>> code patterns but can mislead about overall application performance in the >>>>>> context of potential for cache abuse. I fell into that tarpit once while >>>>>> tuning >>>>>> memcpy for Solaris and finding my new, wonderfully fast copy code (ok, maybe >>>>>> 5% faster for in-cache data) caused a major customer application to run >>>>>> slower >>>>>> because my new code abused the cache. I modified my code to only use the >>>>>> new "in-cache fast copy" for copies less than a threshold (64Kbytes or >>>>>> 128Kbytes if I remember right) and all was well. >>>>>> >>>>> The new threshold can be substantially smaller with large core count. >>>>> Are you saying that even 3 / 4 may be too big? Is there a reasonable >>>>> fixed threshold? >>>>> >>>> I don't have any evidence to say 3/4 is too big for typical applications >>>> and environments. >>>> In 2012, the default for memcpy was set to 1/2 the shared_cache_size >>>> which is what is >>>> the current default for Oracle el7 and Red Hat el7. >>>> >>>> Given the typically larger sized caches/thread today than 8 years, 3/4 >>>> may work out well >>>> since the remaining 1/4 of today's larger cache is often greater than >>>> 1/2 of yesteryear's smaller cache. >>>> >>> Please update the comment with your rationale for 3/4. Don't use >>> today or current. Use 2020 instead. >>> >>> Thanks. >>> >> I'm unsure about what needs to change in the comment which does not mention >> any dates currently. I'm assuming you are referring to the following >> comment in cacheinfo.c >> >> /* The default setting for the non_temporal threshold is 3/4 >> of one thread's share of the chip's cache. While higher >> single thread performance may be observed with a higher >> threshold, having a single thread use more than it's share >> of the cache will negatively impact the performance of >> other threads running on the chip. */ >> >> While I could add a comment on why 3/4 vs 1/2 is the best choice, I >> don't have hard >> data to back it up. I'd be comfortable with either 3/4 or 1/2. I >> selected 3/4 as it >> was closer to the formula you chose in 2017 instead of the formula you >> chose in 2012. > The comment is for readers 5 years from now who may be wondering > where 3/4 came from. Just add something close to what you have said above. > Before I redo the commit and resubmit the whole patch. I thought I'd present a revised comment for review.The value of 500KB to 2MB/thread is based on a quick review of the wikipedia entries for Intel and AMD processors released since 2017. There may be a few outliers, but the vast majority fit that range for L3/thread. I tried to balance giving a sense of the situation without diving too deeply into application specific details. Comment in v2: /* The default setting for the non_temporal threshold is 3/4 of one thread's share of the chip's cache. While higher single thread performance may be observed with a higher threshold, having a single thread use more than it's share of the cache will negatively impact the performance of other threads running on the chip. */ Proposed comment for v3: /* The default setting for the non_temporal threshold is 3/4 of one thread's share of the chip's cache. For most Intel and AMD processors with an initial release date between 2017 and 2020,a thread's typical share ofthe cache is from 500 KBytes to 2 MBytes. Using the 3/4 threshold leaves 125 KBytes to 500 KBytes of the thread'sdata in cache after a maximum temporal copy, which will maintain in cache a reasonable portion of the thread's stack and other active data. If the threshold is set higher than one thread's share of the cache, it has a substantial risk of negatively impacting the performance of other threads running on the chip. */
On Fri, Sep 25, 2020 at 1:53 PM Patrick McGehearty <patrick.mcgehearty@oracle.com> wrote: > > > > On 9/24/2020 6:57 PM, H.J. Lu wrote: > > On Thu, Sep 24, 2020 at 4:22 PM Patrick McGehearty > > <patrick.mcgehearty@oracle.com> wrote: > >> > >> > >> On 9/24/2020 4:54 PM, H.J. Lu wrote: > >>> On Thu, Sep 24, 2020 at 2:49 PM Patrick McGehearty > >>> <patrick.mcgehearty@oracle.com> wrote: > >>>> > >>>> On 9/23/2020 6:13 PM, H.J. Lu wrote: > >>>>> On Wed, Sep 23, 2020 at 3:39 PM Patrick McGehearty > >>>>> <patrick.mcgehearty@oracle.com> wrote: > >>>>>> On 9/23/2020 4:37 PM, H.J. Lu wrote: > >>>>>>> On Wed, Sep 23, 2020 at 1:57 PM Patrick McGehearty > >>>>>>> <patrick.mcgehearty@oracle.com> wrote: > >>>>>>>> On 9/23/2020 3:23 PM, H.J. Lu wrote: > >>>>>>>>> On Wed, Sep 23, 2020 at 1:10 PM Patrick McGehearty via Libc-alpha > >>>>>>>>> <libc-alpha@sourceware.org> wrote: > >>>>>>>>>> The __x86_shared_non_temporal_threshold determines when memcpy on x86 > >>>>>>>>>> uses non_temporal stores to avoid pushing other data out of the last > >>>>>>>>>> level cache. > >>>>>>>>>> > >>>>>>>>>> This patch proposes to revert the calculation change made by H.J. Lu's > >>>>>>>>>> patch of June 2, 2017. > >>>>>>>>>> > >>>>>>>>>> H.J. Lu's patch selected a threshold suitable for a single thread > >>>>>>>>>> getting maximum performance. It was tuned using the single threaded > >>>>>>>>>> large memcpy micro benchmark on an 8 core processor. The last change > >>>>>>>>>> changes the threshold from using 3/4 of one thread's share of the > >>>>>>>>>> cache to using 3/4 of the entire cache of a multi-threaded system > >>>>>>>>>> before switching to non-temporal stores. Multi-threaded systems with > >>>>>>>>>> more than a few threads are server-class and typically have many > >>>>>>>>>> active threads. If one thread consumes 3/4 of the available cache for > >>>>>>>>>> all threads, it will cause other active threads to have data removed > >>>>>>>>>> from the cache. Two examples show the range of the effect. John > >>>>>>>>>> McCalpin's widely parallel Stream benchmark, which runs in parallel > >>>>>>>>>> and fetches data sequentially, saw a 20% slowdown with this patch on > >>>>>>>>>> an internal system test of 128 threads. This regression was discovered > >>>>>>>>>> when comparing OL8 performance to OL7. An example that compares > >>>>>>>>>> normal stores to non-temporal stores may be found at > >>>>>>>>>> https://urldefense.com/v3/__https://vgatherps.github.io/2018-09-02-nontemporal/__;!!GqivPVa7Brio!IK1RH6wG0bg4U3NNMDpXf50VgsV9CFOEUaG0kGy6YYtq1G1Ca5VSz5szAxG0Zkiqdl8-IWc$ . A simple test > >>>>>>>>>> shows performance loss of 400 to 500% due to a failure to use > >>>>>>>>>> nontemporal stores. These performance losses are most likely to occur > >>>>>>>>>> when the system load is heaviest and good performance is critical. > >>>>>>>>>> > >>>>>>>>>> The tunable x86_non_temporal_threshold can be used to override the > >>>>>>>>>> default for the knowledgable user who really wants maximum cache > >>>>>>>>>> allocation to a single thread in a multi-threaded system. > >>>>>>>>>> The manual entry for the tunable has been expanded to provide > >>>>>>>>>> more information about its purpose. > >>>>>>>>>> > >>>>>>>>>> modified: sysdeps/x86/cacheinfo.c > >>>>>>>>>> modified: manual/tunables.texi > >>>>>>>>>> --- > >>>>>>>>>> manual/tunables.texi | 6 +++++- > >>>>>>>>>> sysdeps/x86/cacheinfo.c | 12 +++++++----- > >>>>>>>>>> 2 files changed, 12 insertions(+), 6 deletions(-) > >>>>>>>>>> > >>>>>>>>>> diff --git a/manual/tunables.texi b/manual/tunables.texi > >>>>>>>>>> index b6bb54d..94d4fbd 100644 > >>>>>>>>>> --- a/manual/tunables.texi > >>>>>>>>>> +++ b/manual/tunables.texi > >>>>>>>>>> @@ -364,7 +364,11 @@ set shared cache size in bytes for use in memory and string routines. > >>>>>>>>>> > >>>>>>>>>> @deftp Tunable glibc.tune.x86_non_temporal_threshold > >>>>>>>>>> The @code{glibc.tune.x86_non_temporal_threshold} tunable allows the user > >>>>>>>>>> -to set threshold in bytes for non temporal store. > >>>>>>>>>> +to set threshold in bytes for non temporal store. Non temporal stores > >>>>>>>>>> +give a hint to the hardware to move data directly to memory without > >>>>>>>>>> +displacing other data from the cache. This tunable is used by some > >>>>>>>>>> +platforms to determine when to use non temporal stores in operations > >>>>>>>>>> +like memmove and memcpy. > >>>>>>>>>> > >>>>>>>>>> This tunable is specific to i386 and x86-64. > >>>>>>>>>> @end deftp > >>>>>>>>>> diff --git a/sysdeps/x86/cacheinfo.c b/sysdeps/x86/cacheinfo.c > >>>>>>>>>> index b9444dd..c6767d9 100644 > >>>>>>>>>> --- a/sysdeps/x86/cacheinfo.c > >>>>>>>>>> +++ b/sysdeps/x86/cacheinfo.c > >>>>>>>>>> @@ -778,14 +778,16 @@ intel_bug_no_cache_info: > >>>>>>>>>> __x86_shared_cache_size = shared; > >>>>>>>>>> } > >>>>>>>>>> > >>>>>>>>>> - /* The large memcpy micro benchmark in glibc shows that 6 times of > >>>>>>>>>> - shared cache size is the approximate value above which non-temporal > >>>>>>>>>> - store becomes faster on a 8-core processor. This is the 3/4 of the > >>>>>>>>>> - total shared cache size. */ > >>>>>>>>>> + /* The default setting for the non_temporal threshold is 3/4 > >>>>>>>>>> + of one thread's share of the chip's cache. While higher > >>>>>>>>>> + single thread performance may be observed with a higher > >>>>>>>>>> + threshold, having a single thread use more than it's share > >>>>>>>>>> + of the cache will negatively impact the performance of > >>>>>>>>>> + other threads running on the chip. */ > >>>>>>>>>> __x86_shared_non_temporal_threshold > >>>>>>>>>> = (cpu_features->non_temporal_threshold != 0 > >>>>>>>>>> ? cpu_features->non_temporal_threshold > >>>>>>>>>> - : __x86_shared_cache_size * threads * 3 / 4); > >>>>>>>>>> + : __x86_shared_cache_size * 3 / 4); > >>>>>>>>>> } > >>>>>>>>>> > >>>>>>>>> Can we tune it with the number of threads and/or total cache > >>>>>>>>> size? > >>>>>>>>> > >>>>>>>> When you say "total cache size", is that different from > >>>>>>>> shared_cache_size * threads? > >>>>>>>> > >>>>>>>> I see a fundamental conflict of optimization goals: > >>>>>>>> 1) Provide best single thread performance (current code) > >>>>>>>> 2) Provide best overall system performance under full load (proposed patch) > >>>>>>>> I don't know of any way to have default behavior meet both goals without > >>>>>>>> knowledge > >>>>>>>> of the system size/usage/requirements. > >>>>>>>> > >>>>>>>> Consider a hypothetical single chip system with 64 threads and 128 MB of > >>>>>>>> total cache on the chip. > >>>>>>>> That won't be uncommon in the coming years on server class systems, > >>>>>>>> especially > >>>>>>>> in large databases or HPC environments (think vision processing or > >>>>>>>> weather modeling for example). > >>>>>>>> If a single app owns the whole chip and is running a multi-threaded > >>>>>>>> application but needs > >>>>>>>> to memcpy a really large block of data when one phase of computation > >>>>>>>> finished > >>>>>>>> before moving to the next phase. A common practice would be to have 64 > >>>>>>>> parallel calls > >>>>>>>> to memcpy. The Stream benchmark demonstrates with OpenMP that current > >>>>>>>> compilers > >>>>>>>> handle that with no trouble. > >>>>>>>> > >>>>>>>> In the example, the per thread share of the cache is 2 MB and the > >>>>>>>> proposed formula will set > >>>>>>>> the threshold at 1.5 Mbytes. If the total copy size is 96 Mbytes or > >>>>>>>> less, all threads comfortably > >>>>>>>> fit in cache. If the total copy size is over that, then non-temporal > >>>>>>>> stores are used and all is well there too. > >>>>>>>> > >>>>>>>> The current formula would set the threshold at 96 Mbytes for each > >>>>>>>> thread. Only when the total > >>>>>>>> copy size was 64*96 Mbytes = 6 GBytes would non-temporal stores be used. > >>>>>>>> We'd like > >>>>>>>> to switch to non-temporal stores much sooner as we will be thrashing all > >>>>>>>> the threads caches. > >>>>>>>> > >>>>>>>> In practical terms, I've had access to typical memcpy copy lengths for a > >>>>>>>> variety of commerical > >>>>>>>> applications while studying memcpy on Solaris over the years. The vast > >>>>>>>> majority of copies > >>>>>>>> are for 64Kbytes or less. Most modern chips have much more than 64Kbytes > >>>>>>>> of cache > >>>>>>>> per thread, allowing in-cache copies for the common case, even without > >>>>>>>> borrowing > >>>>>>>> cache from other threads. The occasional really large copies tend to be > >>>>>>>> when an application > >>>>>>>> is passing a block of data to prepare for a new phase of computation or > >>>>>>>> as a shared memory > >>>>>>>> communication to another thread. In these cases, having the data remain > >>>>>>>> in cache is usually > >>>>>>>> not relevant and using non-temporal stores even when they are not > >>>>>>>> strictly required does > >>>>>>>> not have a negative affect on performance. > >>>>>>>> > >>>>>>>> A downside of tuning for a single thread comes in cloud computing > >>>>>>>> environments, where > >>>>>>>> having neighboring threads being cache hogs, even if relatively isolated > >>>>>>>> in virtual machines, > >>>>>>>> is a "bad thing" for having stable system performance. Whatever we can > >>>>>>>> do to provide consistent, > >>>>>>>> reasonable performance whatever the neighboring threads might be doing > >>>>>>>> is a "good thing". > >>>>>>>> > >>>>>>> Have you tried the full __x86_shared_cache_size instead of 3 / 4? > >>>>>>> > >>>>>> I have not tested larger thresholds. I'd be more comfortable with a > >>>>>> smaller one. > >>>>>> We could construct specific tests to show either advantage or disadvantage > >>>>>> to shifting from 3/4 to all of cache depending on what data access was used > >>>>>> between memcpy operations. > >>>>>> > >>>>>> I consider pushing the limit on cache usage to be a risky approach. Few > >>>>>> applications > >>>>>> only work on a single block of data. If all threads are doing a shared > >>>>>> copy and > >>>>>> they use all the available cache, then after the memcpy returns, any other > >>>>>> active data would have been pushed out of the cache. That's likely to cost > >>>>>> severe performance loss in more cases than the modest performance gains for > >>>>>> a few cases where the application only is concerned with using the data that > >>>>>> was just copied. > >>>>>> > >>>>>> Just to give a more detailed example where large copies are not followed > >>>>>> by using > >>>>>> the data. Consider garbage collection followed by compression. With a > >>>>>> multi-age > >>>>>> garbage collector, stable data that is active and survived several > >>>>>> garbage collections > >>>>>> is in a 'old' region. It does not need to be copied. The current 'new' > >>>>>> region is full > >>>>>> but has both referenced and unreferenced data. After the marking phase, > >>>>>> the individual elements of the referenced data is copied to the base of > >>>>>> the 'new' region. > >>>>>> When complete, the rest of the 'new' region becomes the new free pool. > >>>>>> The total amount copied may far exceed the processor cache. Then the > >>>>>> application > >>>>>> exits garbage collection and resumes active use of mostly the stable > >>>>>> data with > >>>>>> some accesses to the just moved new data and fresh allocations. If we > >>>>>> under-use > >>>>>> non-temporal stores, we clear the cache and the whole application runs > >>>>>> slower > >>>>>> than otherwise. > >>>>>> > >>>>>> Individual memcpy benchmarks are useful in isolation testing and comparing > >>>>>> code patterns but can mislead about overall application performance in the > >>>>>> context of potential for cache abuse. I fell into that tarpit once while > >>>>>> tuning > >>>>>> memcpy for Solaris and finding my new, wonderfully fast copy code (ok, maybe > >>>>>> 5% faster for in-cache data) caused a major customer application to run > >>>>>> slower > >>>>>> because my new code abused the cache. I modified my code to only use the > >>>>>> new "in-cache fast copy" for copies less than a threshold (64Kbytes or > >>>>>> 128Kbytes if I remember right) and all was well. > >>>>>> > >>>>> The new threshold can be substantially smaller with large core count. > >>>>> Are you saying that even 3 / 4 may be too big? Is there a reasonable > >>>>> fixed threshold? > >>>>> > >>>> I don't have any evidence to say 3/4 is too big for typical applications > >>>> and environments. > >>>> In 2012, the default for memcpy was set to 1/2 the shared_cache_size > >>>> which is what is > >>>> the current default for Oracle el7 and Red Hat el7. > >>>> > >>>> Given the typically larger sized caches/thread today than 8 years, 3/4 > >>>> may work out well > >>>> since the remaining 1/4 of today's larger cache is often greater than > >>>> 1/2 of yesteryear's smaller cache. > >>>> > >>> Please update the comment with your rationale for 3/4. Don't use > >>> today or current. Use 2020 instead. > >>> > >>> Thanks. > >>> > >> I'm unsure about what needs to change in the comment which does not mention > >> any dates currently. I'm assuming you are referring to the following > >> comment in cacheinfo.c > >> > >> /* The default setting for the non_temporal threshold is 3/4 > >> of one thread's share of the chip's cache. While higher > >> single thread performance may be observed with a higher > >> threshold, having a single thread use more than it's share > >> of the cache will negatively impact the performance of > >> other threads running on the chip. */ > >> > >> While I could add a comment on why 3/4 vs 1/2 is the best choice, I > >> don't have hard > >> data to back it up. I'd be comfortable with either 3/4 or 1/2. I > >> selected 3/4 as it > >> was closer to the formula you chose in 2017 instead of the formula you > >> chose in 2012. > > The comment is for readers 5 years from now who may be wondering > > where 3/4 came from. Just add something close to what you have said above. > > > Before I redo the commit and resubmit the whole patch. I thought I'd present > a revised comment for review.The value of 500KB to 2MB/thread is based > on a quick review of the wikipedia entries for Intel and AMD processors > released since 2017. There may be a few outliers, but the vast majority > fit that range for L3/thread. I tried to balance giving a sense of the > situation without diving too deeply into application specific details. > > > Comment in v2: > /* The default setting for the non_temporal threshold is 3/4 > of one thread's share of the chip's cache. While higher > single thread performance may be observed with a higher > threshold, having a single thread use more than it's share > of the cache will negatively impact the performance of > other threads running on the chip. */ > > Proposed comment for v3: > /* The default setting for the non_temporal threshold is 3/4 of one > thread's share of the chip's cache. For most Intel and AMD processors > with an initial release date between 2017 and 2020,a thread's typical > share ofthe cache is from 500 KBytes to 2 MBytes. Using the 3/4 > threshold leaves 125 KBytes to 500 KBytes of the thread'sdata > in cache after a maximum temporal copy, which will maintain > in cache a reasonable portion of the thread's stack and other > active data. If the threshold is set higher than one thread's > share of the cache, it has a substantial risk of negatively > impacting the performance of other threads running on the chip. */ > Comments look good. Please submit the patch with the updated comment. Thanks.
diff --git a/manual/tunables.texi b/manual/tunables.texi index b6bb54d..94d4fbd 100644 --- a/manual/tunables.texi +++ b/manual/tunables.texi @@ -364,7 +364,11 @@ set shared cache size in bytes for use in memory and string routines. @deftp Tunable glibc.tune.x86_non_temporal_threshold The @code{glibc.tune.x86_non_temporal_threshold} tunable allows the user -to set threshold in bytes for non temporal store. +to set threshold in bytes for non temporal store. Non temporal stores +give a hint to the hardware to move data directly to memory without +displacing other data from the cache. This tunable is used by some +platforms to determine when to use non temporal stores in operations +like memmove and memcpy. This tunable is specific to i386 and x86-64. @end deftp diff --git a/sysdeps/x86/cacheinfo.c b/sysdeps/x86/cacheinfo.c index b9444dd..c6767d9 100644 --- a/sysdeps/x86/cacheinfo.c +++ b/sysdeps/x86/cacheinfo.c @@ -778,14 +778,16 @@ intel_bug_no_cache_info: __x86_shared_cache_size = shared; } - /* The large memcpy micro benchmark in glibc shows that 6 times of - shared cache size is the approximate value above which non-temporal - store becomes faster on a 8-core processor. This is the 3/4 of the - total shared cache size. */ + /* The default setting for the non_temporal threshold is 3/4 + of one thread's share of the chip's cache. While higher + single thread performance may be observed with a higher + threshold, having a single thread use more than it's share + of the cache will negatively impact the performance of + other threads running on the chip. */ __x86_shared_non_temporal_threshold = (cpu_features->non_temporal_threshold != 0 ? cpu_features->non_temporal_threshold - : __x86_shared_cache_size * threads * 3 / 4); + : __x86_shared_cache_size * 3 / 4); } #endif