Message ID | 20240524173851.2483952-1-goldstein.w.n@gmail.com |
---|---|
State | New |
Headers | show |
Series | [v2,1/2] x86: Improve large memset perf with non-temporal stores [RHEL-29312] | expand |
On Fri, May 24, 2024 at 10:38 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > Previously we use `rep stosb` for all medium/large memsets. This is > notably worse than non-temporal stores for large (above a > few MBs) memsets. > See: > https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing > For data using different stategies for large memset on ICX and SKX. > > Using non-temporal stores can be up to 3x faster on ICX and 2x faster > on SKX. Historically, these numbers would not have been so good > because of the zero-over-zero writeback optimization that `rep stosb` > is able to do. But, the zero-over-zero writeback optimization has been > removed as a potential side-channel attack, so there is no longer any > good reason to only rely on `rep stosb` for large memsets. On the flip > size, non-temporal writes can avoid data in their RFO requests saving > memory bandwidth. > > All of the other changes to the file are to re-organize the > code-blocks to maintain "good" alignment given the new code added in > the `L(stosb_local)` case. > > The results from running the GLIBC memset benchmarks on TGL-client for > N=20 runs: > > Geometric Mean across the suite New / Old EXEX256: 0.979 > Geometric Mean across the suite New / Old EXEX512: 0.979 > Geometric Mean across the suite New / Old AVX2 : 0.986 > Geometric Mean across the suite New / Old SSE2 : 0.979 > > Most of the cases are essentially unchanged, this is mostly to show > that adding the non-temporal case didn't add any regressions to the > other cases. > > The results on the memset-large benchmark suite on TGL-client for N=20 > runs: > > Geometric Mean across the suite New / Old EXEX256: 0.926 > Geometric Mean across the suite New / Old EXEX512: 0.925 > Geometric Mean across the suite New / Old AVX2 : 0.928 > Geometric Mean across the suite New / Old SSE2 : 0.924 > > So roughly a 7.5% speedup. This is lower than what we see on servers > (likely because clients typically have faster single-core bandwidth so > saving bandwidth on RFOs is less impactful), but still advantageous. > > Full test-suite passes on x86_64 w/ and w/o multiarch. > --- > .../multiarch/memset-vec-unaligned-erms.S | 149 +++++++++++------- > 1 file changed, 91 insertions(+), 58 deletions(-) > > diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S > index 97839a2248..637caadb40 100644 > --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S > +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S > @@ -21,10 +21,13 @@ > 2. If size is less than VEC, use integer register stores. > 3. If size is from VEC_SIZE to 2 * VEC_SIZE, use 2 VEC stores. > 4. If size is from 2 * VEC_SIZE to 4 * VEC_SIZE, use 4 VEC stores. > - 5. On machines ERMS feature, if size is greater or equal than > - __x86_rep_stosb_threshold then REP STOSB will be used. > - 6. If size is more to 4 * VEC_SIZE, align to 4 * VEC_SIZE with > - 4 VEC stores and store 4 * VEC at a time until done. */ > + 5. If size is more to 4 * VEC_SIZE, align to 1 * VEC_SIZE with > + 4 VEC stores and store 4 * VEC at a time until done. > + 6. On machines ERMS feature, if size is range > + [__x86_rep_stosb_threshold, __x86_shared_non_temporal_threshold) > + then REP STOSB will be used. > + 7. If size >= __x86_shared_non_temporal_threshold, use a > + non-temporal stores. */ > > #include <sysdep.h> > > @@ -147,6 +150,41 @@ L(entry_from_wmemset): > VMOVU %VMM(0), -VEC_SIZE(%rdi,%rdx) > VMOVU %VMM(0), (%rdi) > VZEROUPPER_RETURN > + > + /* If have AVX512 mask instructions put L(less_vec) close to > + entry as it doesn't take much space and is likely a hot target. */ > +#ifdef USE_LESS_VEC_MASK_STORE > + /* Align to ensure the L(less_vec) logic all fits in 1x cache lines. */ > + .p2align 6,, 47 > + .p2align 4 > +L(less_vec): > +L(less_vec_from_wmemset): > + /* Less than 1 VEC. */ > +# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64 > +# error Unsupported VEC_SIZE! > +# endif > + /* Clear high bits from edi. Only keeping bits relevant to page > + cross check. Note that we are using rax which is set in > + MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out. */ > + andl $(PAGE_SIZE - 1), %edi > + /* Check if VEC_SIZE store cross page. Mask stores suffer > + serious performance degradation when it has to fault suppress. */ > + cmpl $(PAGE_SIZE - VEC_SIZE), %edi > + /* This is generally considered a cold target. */ > + ja L(cross_page) > +# if VEC_SIZE > 32 > + movq $-1, %rcx > + bzhiq %rdx, %rcx, %rcx > + kmovq %rcx, %k1 > +# else > + movl $-1, %ecx > + bzhil %edx, %ecx, %ecx > + kmovd %ecx, %k1 > +# endif > + vmovdqu8 %VMM(0), (%rax){%k1} > + VZEROUPPER_RETURN > +#endif > + > #if defined USE_MULTIARCH && IS_IN (libc) > END (MEMSET_SYMBOL (__memset, unaligned)) > > @@ -185,54 +223,6 @@ L(last_2x_vec): > #endif > VZEROUPPER_RETURN > > - /* If have AVX512 mask instructions put L(less_vec) close to > - entry as it doesn't take much space and is likely a hot target. > - */ > -#ifdef USE_LESS_VEC_MASK_STORE > - .p2align 4,, 10 > -L(less_vec): > -L(less_vec_from_wmemset): > - /* Less than 1 VEC. */ > -# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64 > -# error Unsupported VEC_SIZE! > -# endif > - /* Clear high bits from edi. Only keeping bits relevant to page > - cross check. Note that we are using rax which is set in > - MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out. */ > - andl $(PAGE_SIZE - 1), %edi > - /* Check if VEC_SIZE store cross page. Mask stores suffer > - serious performance degradation when it has to fault suppress. > - */ > - cmpl $(PAGE_SIZE - VEC_SIZE), %edi > - /* This is generally considered a cold target. */ > - ja L(cross_page) > -# if VEC_SIZE > 32 > - movq $-1, %rcx > - bzhiq %rdx, %rcx, %rcx > - kmovq %rcx, %k1 > -# else > - movl $-1, %ecx > - bzhil %edx, %ecx, %ecx > - kmovd %ecx, %k1 > -# endif > - vmovdqu8 %VMM(0), (%rax){%k1} > - VZEROUPPER_RETURN > - > -# if defined USE_MULTIARCH && IS_IN (libc) > - /* Include L(stosb_local) here if including L(less_vec) between > - L(stosb_more_2x_vec) and ENTRY. This is to cache align the > - L(stosb_more_2x_vec) target. */ > - .p2align 4,, 10 > -L(stosb_local): > - movzbl %sil, %eax > - mov %RDX_LP, %RCX_LP > - mov %RDI_LP, %RDX_LP > - rep stosb > - mov %RDX_LP, %RAX_LP > - VZEROUPPER_RETURN > -# endif > -#endif > - > #if defined USE_MULTIARCH && IS_IN (libc) > .p2align 4 > L(stosb_more_2x_vec): > @@ -318,21 +308,33 @@ L(return_vzeroupper): > ret > #endif > > - .p2align 4,, 10 > -#ifndef USE_LESS_VEC_MASK_STORE > -# if defined USE_MULTIARCH && IS_IN (libc) > +#ifdef USE_WITH_AVX2 > + .p2align 4 > +#else > + .p2align 4,, 4 > +#endif > + > +#if defined USE_MULTIARCH && IS_IN (libc) > /* If no USE_LESS_VEC_MASK put L(stosb_local) here. Will be in > range for 2-byte jump encoding. */ > L(stosb_local): > + cmp __x86_shared_non_temporal_threshold(%rip), %RDX_LP > + jae L(nt_memset) > movzbl %sil, %eax > mov %RDX_LP, %RCX_LP > mov %RDI_LP, %RDX_LP > rep stosb > +# if (defined USE_WITH_SSE2) || (defined USE_WITH_AVX512) > + /* Use xchg to save 1-byte (this helps align targets below). */ > + xchg %RDX_LP, %RAX_LP > +# else > mov %RDX_LP, %RAX_LP > - VZEROUPPER_RETURN > # endif > + VZEROUPPER_RETURN > +#endif > +#ifndef USE_LESS_VEC_MASK_STORE > /* Define L(less_vec) only if not otherwise defined. */ > - .p2align 4 > + .p2align 4,, 12 > L(less_vec): > /* Broadcast esi to partial register (i.e VEC_SIZE == 32 broadcast to > xmm). This is only does anything for AVX2. */ > @@ -423,4 +425,35 @@ L(between_2_3): > movb %SET_REG8, -1(%LESS_VEC_REG, %rdx) > #endif > ret > -END (MEMSET_SYMBOL (__memset, unaligned_erms)) > + > +#if defined USE_MULTIARCH && IS_IN (libc) > +# ifdef USE_WITH_AVX512 > + /* Force align so the loop doesn't cross a cache-line. */ > + .p2align 4 > +# endif > + .p2align 4,, 7 > + /* Memset using non-temporal stores. */ > +L(nt_memset): > + VMOVU %VMM(0), (VEC_SIZE * 0)(%rdi) > + leaq (VEC_SIZE * -4)(%rdi, %rdx), %rdx > + /* Align DST. */ > + orq $(VEC_SIZE * 1 - 1), %rdi > + incq %rdi > + .p2align 4,, 7 > +L(nt_loop): > + VMOVNT %VMM(0), (VEC_SIZE * 0)(%rdi) > + VMOVNT %VMM(0), (VEC_SIZE * 1)(%rdi) > + VMOVNT %VMM(0), (VEC_SIZE * 2)(%rdi) > + VMOVNT %VMM(0), (VEC_SIZE * 3)(%rdi) > + subq $(VEC_SIZE * -4), %rdi > + cmpq %rdx, %rdi > + jb L(nt_loop) > + sfence > + VMOVU %VMM(0), (VEC_SIZE * 0)(%rdx) > + VMOVU %VMM(0), (VEC_SIZE * 1)(%rdx) > + VMOVU %VMM(0), (VEC_SIZE * 2)(%rdx) > + VMOVU %VMM(0), (VEC_SIZE * 3)(%rdx) > + VZEROUPPER_RETURN > +#endif > + > +END(MEMSET_SYMBOL(__memset, unaligned_erms)) > -- > 2.34.1 > LGTM. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> Thanks.
On Fri, Jul 12, 2024 at 4:56 AM DJ Delorie <dj@redhat.com> wrote: > > > Noah Goldstein <goldstein.w.n@gmail.com> writes: > > Using non-temporal stores can be up to 3x faster on ICX and 2x faster > > on SKX. Historically, these numbers would not have been so good > > because of the zero-over-zero writeback optimization that `rep stosb` > > is able to do. But, the zero-over-zero writeback optimization has been > > removed as a potential side-channel attack, so there is no longer any > > good reason to only rely on `rep stosb` for large memsets. On the flip > > size, non-temporal writes can avoid data in their RFO requests saving > > memory bandwidth. > > I'm actually working on RHEL-29312 that you reference in the subject > (thanks for the reference, but please don't reference downstream tickets > in upstream patches). In that ticket was shown a 50% slowdown in memset > for the Xeon 4215. I've reproduced that slowdown locally, but this > patch doesn't seem to have any affect on it (current git glibc still has > the slow code). I can get the faster results with tunables[1]. Will > there be further tuning of the default thresholds? Or do we expect > users to use tunables to get around these regressions, for the specific > chips that are not optimized? > > [1] specifically, libc.cpu.x86_rep_stosb_threshold=0x1000000000 - none > of the others have any affect. Okay, I messed up. The SKX machine I used for the SKX benchmarks was mislabeled, and was in fact an ICX machine. I have some older SKX/BWD data: https://docs.google.com/spreadsheets/d/1rcRIWpU8Tcq0FMH9iGPBUViPvioUmHtA9qx3h-UzPa0/edit?usp=sharing and agree we need to increase the stosb threshold on SKX. I didn't see the huge (100MB threshold) you saw there, but am re-running benchmarks. An unfortunate note is BWD seems to prefer ERMS in the same range SKX doesn't, so we can't just make the decision based on the feature. On the bright side, it does look like the non-temporal decision (this patch) still makes sense on SKX. I am re-running benchmarks now and will see what we can do about raising the stosb threshold on SKX. I am deeply sorry for the confusion I have created here. >
On Fri, Jul 12, 2024 at 3:04 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > On Fri, Jul 12, 2024 at 4:56 AM DJ Delorie <dj@redhat.com> wrote: > > > > > > Noah Goldstein <goldstein.w.n@gmail.com> writes: > > > Using non-temporal stores can be up to 3x faster on ICX and 2x faster > > > on SKX. Historically, these numbers would not have been so good > > > because of the zero-over-zero writeback optimization that `rep stosb` > > > is able to do. But, the zero-over-zero writeback optimization has been > > > removed as a potential side-channel attack, so there is no longer any > > > good reason to only rely on `rep stosb` for large memsets. On the flip > > > size, non-temporal writes can avoid data in their RFO requests saving > > > memory bandwidth. > > > > I'm actually working on RHEL-29312 that you reference in the subject > > (thanks for the reference, but please don't reference downstream tickets > > in upstream patches). In that ticket was shown a 50% slowdown in memset > > for the Xeon 4215. I've reproduced that slowdown locally, but this > > patch doesn't seem to have any affect on it (current git glibc still has > > the slow code). I can get the faster results with tunables[1]. Will > > there be further tuning of the default thresholds? Or do we expect > > users to use tunables to get around these regressions, for the specific > > chips that are not optimized? > > > > [1] specifically, libc.cpu.x86_rep_stosb_threshold=0x1000000000 - none > > of the others have any affect. > > Okay, I messed up. The SKX machine I used for the SKX benchmarks was > mislabeled, and was in fact an ICX machine. > > I have some older SKX/BWD data: > https://docs.google.com/spreadsheets/d/1rcRIWpU8Tcq0FMH9iGPBUViPvioUmHtA9qx3h-UzPa0/edit?usp=sharing > and agree we need to increase the stosb threshold on SKX. I didn't see > the huge (100MB threshold) you saw there, but am re-running benchmarks. > > An unfortunate note is BWD seems to prefer ERMS in the same range SKX > doesn't, so we can't just make the decision based on the feature. > On the bright side, it does look like the non-temporal decision (this patch) > still makes sense on SKX. > > I am re-running benchmarks now and will see what we can do about raising > the stosb threshold on SKX. > > I am deeply sorry for the confusion I have created here. I've re-ran some benchmarks and updated the google sheet: https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing See SKX/SKX-2. Regarding ERMS: What I am measuring so far indicates we want temporal writes when the majority of the set fits in L1. Past that ERMS becomes better, especially as thread count increases. Regarding NT Stores: non-temporal doesn't look good in these benchmarks (particularly SKX-2). I am running some additional tests to see exactly what is going on. I am going to post a patch to disable non-temporal stores on Skylake, if that is considered too risky given the 2.40 release I think we should revert this patch (and the follow up AMD ones) to avoid a regression on SKX and try to get them into 2.41 instead with proper tuning. > >
On Mon, Jul 15, 2024, 3:36 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > On Fri, Jul 12, 2024 at 3:04 PM Noah Goldstein <goldstein.w.n@gmail.com> > wrote: > > > > On Fri, Jul 12, 2024 at 4:56 AM DJ Delorie <dj@redhat.com> wrote: > > > > > > > > > Noah Goldstein <goldstein.w.n@gmail.com> writes: > > > > Using non-temporal stores can be up to 3x faster on ICX and 2x faster > > > > on SKX. Historically, these numbers would not have been so good > > > > because of the zero-over-zero writeback optimization that `rep stosb` > > > > is able to do. But, the zero-over-zero writeback optimization has > been > > > > removed as a potential side-channel attack, so there is no longer any > > > > good reason to only rely on `rep stosb` for large memsets. On the > flip > > > > size, non-temporal writes can avoid data in their RFO requests saving > > > > memory bandwidth. > > > > > > I'm actually working on RHEL-29312 that you reference in the subject > > > (thanks for the reference, but please don't reference downstream > tickets > > > in upstream patches). In that ticket was shown a 50% slowdown in > memset > > > for the Xeon 4215. I've reproduced that slowdown locally, but this > > > patch doesn't seem to have any affect on it (current git glibc still > has > > > the slow code). I can get the faster results with tunables[1]. Will > > > there be further tuning of the default thresholds? Or do we expect > > > users to use tunables to get around these regressions, for the specific > > > chips that are not optimized? > > > > > > [1] specifically, libc.cpu.x86_rep_stosb_threshold=0x1000000000 - none > > > of the others have any affect. > > > > Okay, I messed up. The SKX machine I used for the SKX benchmarks was > > mislabeled, and was in fact an ICX machine. > > > > I have some older SKX/BWD data: > > > https://docs.google.com/spreadsheets/d/1rcRIWpU8Tcq0FMH9iGPBUViPvioUmHtA9qx3h-UzPa0/edit?usp=sharing > > and agree we need to increase the stosb threshold on SKX. I didn't see > > the huge (100MB threshold) you saw there, but am re-running benchmarks. > > > > An unfortunate note is BWD seems to prefer ERMS in the same range SKX > > doesn't, so we can't just make the decision based on the feature. > > On the bright side, it does look like the non-temporal decision (this > patch) > > still makes sense on SKX. > > > > I am re-running benchmarks now and will see what we can do about raising > > the stosb threshold on SKX. > > > > I am deeply sorry for the confusion I have created here. > > I've re-ran some benchmarks and updated the google sheet: > > https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing > See SKX/SKX-2. > > Regarding ERMS: > What I am measuring so far indicates we want temporal writes > when the majority of the set fits in L1. Past that ERMS becomes better, > especially as thread count increases. > > Regarding NT Stores: > non-temporal doesn't look good in these benchmarks (particularly SKX-2). > I am running some additional tests to see exactly what is going on. > I am going to post a patch to disable non-temporal stores on Skylake, > Skylake clients have very different NT behaviors. Please make sure that NT isn't disabled on clients. > if that is considered too risky given the 2.40 release I think we should > revert > this patch (and the follow up AMD ones) to avoid a regression on SKX > and try to get them into 2.41 instead with proper tuning. > > > > > > > H.J.
On Mon, Jul 15, 2024 at 3:45 PM H.J. Lu <hjl.tools@gmail.com> wrote: > > > On Mon, Jul 15, 2024, 3:36 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: >> >> On Fri, Jul 12, 2024 at 3:04 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: >> > >> > On Fri, Jul 12, 2024 at 4:56 AM DJ Delorie <dj@redhat.com> wrote: >> > > >> > > >> > > Noah Goldstein <goldstein.w.n@gmail.com> writes: >> > > > Using non-temporal stores can be up to 3x faster on ICX and 2x faster >> > > > on SKX. Historically, these numbers would not have been so good >> > > > because of the zero-over-zero writeback optimization that `rep stosb` >> > > > is able to do. But, the zero-over-zero writeback optimization has been >> > > > removed as a potential side-channel attack, so there is no longer any >> > > > good reason to only rely on `rep stosb` for large memsets. On the flip >> > > > size, non-temporal writes can avoid data in their RFO requests saving >> > > > memory bandwidth. >> > > >> > > I'm actually working on RHEL-29312 that you reference in the subject >> > > (thanks for the reference, but please don't reference downstream tickets >> > > in upstream patches). In that ticket was shown a 50% slowdown in memset >> > > for the Xeon 4215. I've reproduced that slowdown locally, but this >> > > patch doesn't seem to have any affect on it (current git glibc still has >> > > the slow code). I can get the faster results with tunables[1]. Will >> > > there be further tuning of the default thresholds? Or do we expect >> > > users to use tunables to get around these regressions, for the specific >> > > chips that are not optimized? >> > > >> > > [1] specifically, libc.cpu.x86_rep_stosb_threshold=0x1000000000 - none >> > > of the others have any affect. >> > >> > Okay, I messed up. The SKX machine I used for the SKX benchmarks was >> > mislabeled, and was in fact an ICX machine. >> > >> > I have some older SKX/BWD data: >> > https://docs.google.com/spreadsheets/d/1rcRIWpU8Tcq0FMH9iGPBUViPvioUmHtA9qx3h-UzPa0/edit?usp=sharing >> > and agree we need to increase the stosb threshold on SKX. I didn't see >> > the huge (100MB threshold) you saw there, but am re-running benchmarks. >> > >> > An unfortunate note is BWD seems to prefer ERMS in the same range SKX >> > doesn't, so we can't just make the decision based on the feature. >> > On the bright side, it does look like the non-temporal decision (this patch) >> > still makes sense on SKX. >> > >> > I am re-running benchmarks now and will see what we can do about raising >> > the stosb threshold on SKX. >> > >> > I am deeply sorry for the confusion I have created here. >> >> I've re-ran some benchmarks and updated the google sheet: >> https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing >> See SKX/SKX-2. >> >> Regarding ERMS: >> What I am measuring so far indicates we want temporal writes >> when the majority of the set fits in L1. Past that ERMS becomes better, >> especially as thread count increases. >> >> Regarding NT Stores: >> non-temporal doesn't look good in these benchmarks (particularly SKX-2). >> I am running some additional tests to see exactly what is going on. >> I am going to post a patch to disable non-temporal stores on Skylake, > > > Skylake clients have very different NT behaviors. > Please make sure that NT isn't disabled > on clients. I guess my thinking right now is it's better to ensure we don't introduce a new regression. The NT store code is new to 2.40, so disabling it won't cause any regressions. > >> >> if that is considered too risky given the 2.40 release I think we should revert >> this patch (and the follow up AMD ones) to avoid a regression on SKX >> and try to get them into 2.41 instead with proper tuning. >> >> >> > > >> > H.J.
On Mon, Jul 15, 2024 at 3:48 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > On Mon, Jul 15, 2024 at 3:45 PM H.J. Lu <hjl.tools@gmail.com> wrote: > > > > > > On Mon, Jul 15, 2024, 3:36 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > >> > >> On Fri, Jul 12, 2024 at 3:04 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > >> > > >> > On Fri, Jul 12, 2024 at 4:56 AM DJ Delorie <dj@redhat.com> wrote: > >> > > > >> > > > >> > > Noah Goldstein <goldstein.w.n@gmail.com> writes: > >> > > > Using non-temporal stores can be up to 3x faster on ICX and 2x faster > >> > > > on SKX. Historically, these numbers would not have been so good > >> > > > because of the zero-over-zero writeback optimization that `rep stosb` > >> > > > is able to do. But, the zero-over-zero writeback optimization has been > >> > > > removed as a potential side-channel attack, so there is no longer any > >> > > > good reason to only rely on `rep stosb` for large memsets. On the flip > >> > > > size, non-temporal writes can avoid data in their RFO requests saving > >> > > > memory bandwidth. > >> > > > >> > > I'm actually working on RHEL-29312 that you reference in the subject > >> > > (thanks for the reference, but please don't reference downstream tickets > >> > > in upstream patches). In that ticket was shown a 50% slowdown in memset > >> > > for the Xeon 4215. I've reproduced that slowdown locally, but this > >> > > patch doesn't seem to have any affect on it (current git glibc still has > >> > > the slow code). I can get the faster results with tunables[1]. Will > >> > > there be further tuning of the default thresholds? Or do we expect > >> > > users to use tunables to get around these regressions, for the specific > >> > > chips that are not optimized? > >> > > > >> > > [1] specifically, libc.cpu.x86_rep_stosb_threshold=0x1000000000 - none > >> > > of the others have any affect. > >> > > >> > Okay, I messed up. The SKX machine I used for the SKX benchmarks was > >> > mislabeled, and was in fact an ICX machine. > >> > > >> > I have some older SKX/BWD data: > >> > https://docs.google.com/spreadsheets/d/1rcRIWpU8Tcq0FMH9iGPBUViPvioUmHtA9qx3h-UzPa0/edit?usp=sharing > >> > and agree we need to increase the stosb threshold on SKX. I didn't see > >> > the huge (100MB threshold) you saw there, but am re-running benchmarks. > >> > > >> > An unfortunate note is BWD seems to prefer ERMS in the same range SKX > >> > doesn't, so we can't just make the decision based on the feature. > >> > On the bright side, it does look like the non-temporal decision (this patch) > >> > still makes sense on SKX. > >> > > >> > I am re-running benchmarks now and will see what we can do about raising > >> > the stosb threshold on SKX. > >> > > >> > I am deeply sorry for the confusion I have created here. > >> > >> I've re-ran some benchmarks and updated the google sheet: > >> https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing > >> See SKX/SKX-2. > >> > >> Regarding ERMS: > >> What I am measuring so far indicates we want temporal writes > >> when the majority of the set fits in L1. Past that ERMS becomes better, > >> especially as thread count increases. > >> > >> Regarding NT Stores: > >> non-temporal doesn't look good in these benchmarks (particularly SKX-2). > >> I am running some additional tests to see exactly what is going on. > >> I am going to post a patch to disable non-temporal stores on Skylake, > > > > > > Skylake clients have very different NT behaviors. > > Please make sure that NT isn't disabled > > on clients. > > I guess my thinking right now is it's better to ensure we don't introduce a > new regression. The NT store code is new to 2.40, so disabling it won't > cause any regressions. > > > >> > >> if that is considered too risky given the 2.40 release I think we should revert > >> this patch (and the follow up AMD ones) to avoid a regression on SKX > >> and try to get them into 2.41 instead with proper tuning. Patch posted. I disabled for SKX only (im convinced enough its a function of SKX's unique memory system). > >> > >> > >> > > > >> > > H.J.
diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S index 97839a2248..637caadb40 100644 --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S @@ -21,10 +21,13 @@ 2. If size is less than VEC, use integer register stores. 3. If size is from VEC_SIZE to 2 * VEC_SIZE, use 2 VEC stores. 4. If size is from 2 * VEC_SIZE to 4 * VEC_SIZE, use 4 VEC stores. - 5. On machines ERMS feature, if size is greater or equal than - __x86_rep_stosb_threshold then REP STOSB will be used. - 6. If size is more to 4 * VEC_SIZE, align to 4 * VEC_SIZE with - 4 VEC stores and store 4 * VEC at a time until done. */ + 5. If size is more to 4 * VEC_SIZE, align to 1 * VEC_SIZE with + 4 VEC stores and store 4 * VEC at a time until done. + 6. On machines ERMS feature, if size is range + [__x86_rep_stosb_threshold, __x86_shared_non_temporal_threshold) + then REP STOSB will be used. + 7. If size >= __x86_shared_non_temporal_threshold, use a + non-temporal stores. */ #include <sysdep.h> @@ -147,6 +150,41 @@ L(entry_from_wmemset): VMOVU %VMM(0), -VEC_SIZE(%rdi,%rdx) VMOVU %VMM(0), (%rdi) VZEROUPPER_RETURN + + /* If have AVX512 mask instructions put L(less_vec) close to + entry as it doesn't take much space and is likely a hot target. */ +#ifdef USE_LESS_VEC_MASK_STORE + /* Align to ensure the L(less_vec) logic all fits in 1x cache lines. */ + .p2align 6,, 47 + .p2align 4 +L(less_vec): +L(less_vec_from_wmemset): + /* Less than 1 VEC. */ +# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64 +# error Unsupported VEC_SIZE! +# endif + /* Clear high bits from edi. Only keeping bits relevant to page + cross check. Note that we are using rax which is set in + MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out. */ + andl $(PAGE_SIZE - 1), %edi + /* Check if VEC_SIZE store cross page. Mask stores suffer + serious performance degradation when it has to fault suppress. */ + cmpl $(PAGE_SIZE - VEC_SIZE), %edi + /* This is generally considered a cold target. */ + ja L(cross_page) +# if VEC_SIZE > 32 + movq $-1, %rcx + bzhiq %rdx, %rcx, %rcx + kmovq %rcx, %k1 +# else + movl $-1, %ecx + bzhil %edx, %ecx, %ecx + kmovd %ecx, %k1 +# endif + vmovdqu8 %VMM(0), (%rax){%k1} + VZEROUPPER_RETURN +#endif + #if defined USE_MULTIARCH && IS_IN (libc) END (MEMSET_SYMBOL (__memset, unaligned)) @@ -185,54 +223,6 @@ L(last_2x_vec): #endif VZEROUPPER_RETURN - /* If have AVX512 mask instructions put L(less_vec) close to - entry as it doesn't take much space and is likely a hot target. - */ -#ifdef USE_LESS_VEC_MASK_STORE - .p2align 4,, 10 -L(less_vec): -L(less_vec_from_wmemset): - /* Less than 1 VEC. */ -# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64 -# error Unsupported VEC_SIZE! -# endif - /* Clear high bits from edi. Only keeping bits relevant to page - cross check. Note that we are using rax which is set in - MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out. */ - andl $(PAGE_SIZE - 1), %edi - /* Check if VEC_SIZE store cross page. Mask stores suffer - serious performance degradation when it has to fault suppress. - */ - cmpl $(PAGE_SIZE - VEC_SIZE), %edi - /* This is generally considered a cold target. */ - ja L(cross_page) -# if VEC_SIZE > 32 - movq $-1, %rcx - bzhiq %rdx, %rcx, %rcx - kmovq %rcx, %k1 -# else - movl $-1, %ecx - bzhil %edx, %ecx, %ecx - kmovd %ecx, %k1 -# endif - vmovdqu8 %VMM(0), (%rax){%k1} - VZEROUPPER_RETURN - -# if defined USE_MULTIARCH && IS_IN (libc) - /* Include L(stosb_local) here if including L(less_vec) between - L(stosb_more_2x_vec) and ENTRY. This is to cache align the - L(stosb_more_2x_vec) target. */ - .p2align 4,, 10 -L(stosb_local): - movzbl %sil, %eax - mov %RDX_LP, %RCX_LP - mov %RDI_LP, %RDX_LP - rep stosb - mov %RDX_LP, %RAX_LP - VZEROUPPER_RETURN -# endif -#endif - #if defined USE_MULTIARCH && IS_IN (libc) .p2align 4 L(stosb_more_2x_vec): @@ -318,21 +308,33 @@ L(return_vzeroupper): ret #endif - .p2align 4,, 10 -#ifndef USE_LESS_VEC_MASK_STORE -# if defined USE_MULTIARCH && IS_IN (libc) +#ifdef USE_WITH_AVX2 + .p2align 4 +#else + .p2align 4,, 4 +#endif + +#if defined USE_MULTIARCH && IS_IN (libc) /* If no USE_LESS_VEC_MASK put L(stosb_local) here. Will be in range for 2-byte jump encoding. */ L(stosb_local): + cmp __x86_shared_non_temporal_threshold(%rip), %RDX_LP + jae L(nt_memset) movzbl %sil, %eax mov %RDX_LP, %RCX_LP mov %RDI_LP, %RDX_LP rep stosb +# if (defined USE_WITH_SSE2) || (defined USE_WITH_AVX512) + /* Use xchg to save 1-byte (this helps align targets below). */ + xchg %RDX_LP, %RAX_LP +# else mov %RDX_LP, %RAX_LP - VZEROUPPER_RETURN # endif + VZEROUPPER_RETURN +#endif +#ifndef USE_LESS_VEC_MASK_STORE /* Define L(less_vec) only if not otherwise defined. */ - .p2align 4 + .p2align 4,, 12 L(less_vec): /* Broadcast esi to partial register (i.e VEC_SIZE == 32 broadcast to xmm). This is only does anything for AVX2. */ @@ -423,4 +425,35 @@ L(between_2_3): movb %SET_REG8, -1(%LESS_VEC_REG, %rdx) #endif ret -END (MEMSET_SYMBOL (__memset, unaligned_erms)) + +#if defined USE_MULTIARCH && IS_IN (libc) +# ifdef USE_WITH_AVX512 + /* Force align so the loop doesn't cross a cache-line. */ + .p2align 4 +# endif + .p2align 4,, 7 + /* Memset using non-temporal stores. */ +L(nt_memset): + VMOVU %VMM(0), (VEC_SIZE * 0)(%rdi) + leaq (VEC_SIZE * -4)(%rdi, %rdx), %rdx + /* Align DST. */ + orq $(VEC_SIZE * 1 - 1), %rdi + incq %rdi + .p2align 4,, 7 +L(nt_loop): + VMOVNT %VMM(0), (VEC_SIZE * 0)(%rdi) + VMOVNT %VMM(0), (VEC_SIZE * 1)(%rdi) + VMOVNT %VMM(0), (VEC_SIZE * 2)(%rdi) + VMOVNT %VMM(0), (VEC_SIZE * 3)(%rdi) + subq $(VEC_SIZE * -4), %rdi + cmpq %rdx, %rdi + jb L(nt_loop) + sfence + VMOVU %VMM(0), (VEC_SIZE * 0)(%rdx) + VMOVU %VMM(0), (VEC_SIZE * 1)(%rdx) + VMOVU %VMM(0), (VEC_SIZE * 2)(%rdx) + VMOVU %VMM(0), (VEC_SIZE * 3)(%rdx) + VZEROUPPER_RETURN +#endif + +END(MEMSET_SYMBOL(__memset, unaligned_erms))