Message ID | 20240519004347.2759850-1-goldstein.w.n@gmail.com |
---|---|
State | New |
Headers | show |
Series | [v1] x86: Improve large memset perf with non-temporal stores [RHEL-29312] | expand |
On Sat, May 18, 2024 at 5:44 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > Previously we use `rep stosb` for all medium/large memsets. This is > notably worse than non-temporal stores for large (above a > few MBs) memsets. > See: > https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing > For data using different stategies for large memset on ICX and SKX. > > Using non-temporal stores can be up to 3x faster on ICX and 2x faster > on SKX. Historically, these numbers would not have been so good > because of the zero-over-zero writeback optimization that `rep stosb` > is able to do. But, the zero-over-zero writeback optimization has been > removed as a potential side-channel attack, so there is no longer any > good reason to only rely on `rep stosb` for large memsets. On the flip > size, non-temporal writes can avoid data in their RFO requests saving > memory bandwidth. > > All of the other changes to the file are to re-organize the > code-blocks to maintain "good" alignment given the new code added in > the `L(stosb_local)` case. > > The results from running the GLIBC memset benchmarks on TGL-client for > N=20 runs: > > Geometric Mean across the suite New / Old EXEX256: 0.979 > Geometric Mean across the suite New / Old EXEX512: 0.979 > Geometric Mean across the suite New / Old AVX2 : 0.986 > Geometric Mean across the suite New / Old SSE2 : 0.979 > > Most of the cases are essentially unchanged, this is mostly to show > that adding the non-temporal case didn't add any regressions to the > other cases. > > The results on the memset-large benchmark suite on TGL-client for N=20 > runs: > > Geometric Mean across the suite New / Old EXEX256: 0.926 > Geometric Mean across the suite New / Old EXEX512: 0.925 > Geometric Mean across the suite New / Old AVX2 : 0.928 > Geometric Mean across the suite New / Old SSE2 : 0.924 > > So roughly a 7.5% speedup. This is lower than what we see on servers > (likely because clients typically have faster single-core bandwidth so > saving bandwidth on RFOs is less impactful), but still advantageous. > > Full test-suite passes on x86_64 w/ and w/o multiarch. > --- > .../multiarch/memset-vec-unaligned-erms.S | 149 +++++++++++------- > 1 file changed, 91 insertions(+), 58 deletions(-) > > diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S > index 97839a2248..637caadb40 100644 > --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S > +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S > @@ -21,10 +21,13 @@ > 2. If size is less than VEC, use integer register stores. > 3. If size is from VEC_SIZE to 2 * VEC_SIZE, use 2 VEC stores. > 4. If size is from 2 * VEC_SIZE to 4 * VEC_SIZE, use 4 VEC stores. > - 5. On machines ERMS feature, if size is greater or equal than > - __x86_rep_stosb_threshold then REP STOSB will be used. > - 6. If size is more to 4 * VEC_SIZE, align to 4 * VEC_SIZE with > - 4 VEC stores and store 4 * VEC at a time until done. */ > + 5. If size is more to 4 * VEC_SIZE, align to 1 * VEC_SIZE with > + 4 VEC stores and store 4 * VEC at a time until done. > + 6. On machines ERMS feature, if size is range > + [__x86_rep_stosb_threshold, __x86_shared_non_temporal_threshold) > + then REP STOSB will be used. > + 7. If size >= __x86_shared_non_temporal_threshold, use a > + non-temporal stores. */ > > #include <sysdep.h> > > @@ -147,6 +150,41 @@ L(entry_from_wmemset): > VMOVU %VMM(0), -VEC_SIZE(%rdi,%rdx) > VMOVU %VMM(0), (%rdi) > VZEROUPPER_RETURN > + > + /* If have AVX512 mask instructions put L(less_vec) close to > + entry as it doesn't take much space and is likely a hot target. */ > +#ifdef USE_LESS_VEC_MASK_STORE > + /* Align to ensure the L(less_vec) logic all fits in 1x cache lines. */ > + .p2align 6,, 47 > + .p2align 4 > +L(less_vec): > +L(less_vec_from_wmemset): > + /* Less than 1 VEC. */ > +# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64 > +# error Unsupported VEC_SIZE! > +# endif > + /* Clear high bits from edi. Only keeping bits relevant to page > + cross check. Note that we are using rax which is set in > + MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out. */ > + andl $(PAGE_SIZE - 1), %edi > + /* Check if VEC_SIZE store cross page. Mask stores suffer > + serious performance degradation when it has to fault suppress. */ > + cmpl $(PAGE_SIZE - VEC_SIZE), %edi > + /* This is generally considered a cold target. */ > + ja L(cross_page) > +# if VEC_SIZE > 32 > + movq $-1, %rcx > + bzhiq %rdx, %rcx, %rcx > + kmovq %rcx, %k1 > +# else > + movl $-1, %ecx > + bzhil %edx, %ecx, %ecx > + kmovd %ecx, %k1 > +# endif > + vmovdqu8 %VMM(0), (%rax){%k1} > + VZEROUPPER_RETURN > +#endif > + > #if defined USE_MULTIARCH && IS_IN (libc) > END (MEMSET_SYMBOL (__memset, unaligned)) > > @@ -185,54 +223,6 @@ L(last_2x_vec): > #endif > VZEROUPPER_RETURN > > - /* If have AVX512 mask instructions put L(less_vec) close to > - entry as it doesn't take much space and is likely a hot target. > - */ > -#ifdef USE_LESS_VEC_MASK_STORE > - .p2align 4,, 10 > -L(less_vec): > -L(less_vec_from_wmemset): > - /* Less than 1 VEC. */ > -# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64 > -# error Unsupported VEC_SIZE! > -# endif > - /* Clear high bits from edi. Only keeping bits relevant to page > - cross check. Note that we are using rax which is set in > - MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out. */ > - andl $(PAGE_SIZE - 1), %edi > - /* Check if VEC_SIZE store cross page. Mask stores suffer > - serious performance degradation when it has to fault suppress. > - */ > - cmpl $(PAGE_SIZE - VEC_SIZE), %edi > - /* This is generally considered a cold target. */ > - ja L(cross_page) > -# if VEC_SIZE > 32 > - movq $-1, %rcx > - bzhiq %rdx, %rcx, %rcx > - kmovq %rcx, %k1 > -# else > - movl $-1, %ecx > - bzhil %edx, %ecx, %ecx > - kmovd %ecx, %k1 > -# endif > - vmovdqu8 %VMM(0), (%rax){%k1} > - VZEROUPPER_RETURN > - > -# if defined USE_MULTIARCH && IS_IN (libc) > - /* Include L(stosb_local) here if including L(less_vec) between > - L(stosb_more_2x_vec) and ENTRY. This is to cache align the > - L(stosb_more_2x_vec) target. */ > - .p2align 4,, 10 > -L(stosb_local): > - movzbl %sil, %eax > - mov %RDX_LP, %RCX_LP > - mov %RDI_LP, %RDX_LP > - rep stosb > - mov %RDX_LP, %RAX_LP > - VZEROUPPER_RETURN > -# endif > -#endif > - > #if defined USE_MULTIARCH && IS_IN (libc) > .p2align 4 > L(stosb_more_2x_vec): > @@ -318,21 +308,33 @@ L(return_vzeroupper): > ret > #endif > > - .p2align 4,, 10 > -#ifndef USE_LESS_VEC_MASK_STORE > -# if defined USE_MULTIARCH && IS_IN (libc) > +#ifdef USE_WITH_AVX2 > + .p2align 4 > +#else > + .p2align 4,, 4 > +#endif > + > +#if defined USE_MULTIARCH && IS_IN (libc) > /* If no USE_LESS_VEC_MASK put L(stosb_local) here. Will be in > range for 2-byte jump encoding. */ > L(stosb_local): > + cmp __x86_shared_non_temporal_threshold(%rip), %RDX_LP > + jae L(nt_memset) > movzbl %sil, %eax > mov %RDX_LP, %RCX_LP > mov %RDI_LP, %RDX_LP > rep stosb > +# if (defined USE_WITH_SSE2) || (defined USE_WITH_AVX512) > + /* Use xchg to save 1-byte (this helps align targets below). */ > + xchg %RDX_LP, %RAX_LP > +# else > mov %RDX_LP, %RAX_LP > - VZEROUPPER_RETURN > # endif > + VZEROUPPER_RETURN > +#endif > +#ifndef USE_LESS_VEC_MASK_STORE > /* Define L(less_vec) only if not otherwise defined. */ > - .p2align 4 > + .p2align 4,, 12 > L(less_vec): > /* Broadcast esi to partial register (i.e VEC_SIZE == 32 broadcast to > xmm). This is only does anything for AVX2. */ > @@ -423,4 +425,35 @@ L(between_2_3): > movb %SET_REG8, -1(%LESS_VEC_REG, %rdx) > #endif > ret > -END (MEMSET_SYMBOL (__memset, unaligned_erms)) > + > +#if defined USE_MULTIARCH && IS_IN (libc) > +# ifdef USE_WITH_AVX512 > + /* Force align so the loop doesn't cross a cache-line. */ > + .p2align 4 > +# endif > + .p2align 4,, 7 > + /* Memset using non-temporal stores. */ > +L(nt_memset): > + VMOVU %VMM(0), (VEC_SIZE * 0)(%rdi) > + leaq (VEC_SIZE * -4)(%rdi, %rdx), %rdx > + /* Align DST. */ > + orq $(VEC_SIZE * 1 - 1), %rdi > + incq %rdi > + .p2align 4,, 7 > +L(nt_loop): > + VMOVNT %VMM(0), (VEC_SIZE * 0)(%rdi) > + VMOVNT %VMM(0), (VEC_SIZE * 1)(%rdi) > + VMOVNT %VMM(0), (VEC_SIZE * 2)(%rdi) > + VMOVNT %VMM(0), (VEC_SIZE * 3)(%rdi) > + subq $(VEC_SIZE * -4), %rdi > + cmpq %rdx, %rdi > + jb L(nt_loop) > + sfence > + VMOVU %VMM(0), (VEC_SIZE * 0)(%rdx) > + VMOVU %VMM(0), (VEC_SIZE * 1)(%rdx) > + VMOVU %VMM(0), (VEC_SIZE * 2)(%rdx) > + VMOVU %VMM(0), (VEC_SIZE * 3)(%rdx) > + VZEROUPPER_RETURN > +#endif > + > +END(MEMSET_SYMBOL(__memset, unaligned_erms)) > -- > 2.34.1 > LGTM. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> Thanks.
On 18/05/24 21:43, Noah Goldstein wrote: > Previously we use `rep stosb` for all medium/large memsets. This is > notably worse than non-temporal stores for large (above a > few MBs) memsets. > See: > https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing > For data using different stategies for large memset on ICX and SKX. Btw this datasheet is not accessible without extra steps. > > Using non-temporal stores can be up to 3x faster on ICX and 2x faster > on SKX. Historically, these numbers would not have been so good > because of the zero-over-zero writeback optimization that `rep stosb` > is able to do. But, the zero-over-zero writeback optimization has been > removed as a potential side-channel attack, so there is no longer any > good reason to only rely on `rep stosb` for large memsets. On the flip > size, non-temporal writes can avoid data in their RFO requests saving > memory bandwidth. Any chance on how this play in newer AMD chips? I am trying to avoid another regression like BZ#30994 and BZ#30995 (this one I would like to fix, however I don't have access to a Zen4 machine to check for results). > > All of the other changes to the file are to re-organize the > code-blocks to maintain "good" alignment given the new code added in > the `L(stosb_local)` case. > > The results from running the GLIBC memset benchmarks on TGL-client for > N=20 runs: > > Geometric Mean across the suite New / Old EXEX256: 0.979 > Geometric Mean across the suite New / Old EXEX512: 0.979 > Geometric Mean across the suite New / Old AVX2 : 0.986 > Geometric Mean across the suite New / Old SSE2 : 0.979 > > Most of the cases are essentially unchanged, this is mostly to show > that adding the non-temporal case didn't add any regressions to the > other cases. > > The results on the memset-large benchmark suite on TGL-client for N=20 > runs: > > Geometric Mean across the suite New / Old EXEX256: 0.926 > Geometric Mean across the suite New / Old EXEX512: 0.925 > Geometric Mean across the suite New / Old AVX2 : 0.928 > Geometric Mean across the suite New / Old SSE2 : 0.924 > > So roughly a 7.5% speedup. This is lower than what we see on servers > (likely because clients typically have faster single-core bandwidth so > saving bandwidth on RFOs is less impactful), but still advantageous. > > Full test-suite passes on x86_64 w/ and w/o multiarch. > --- > .../multiarch/memset-vec-unaligned-erms.S | 149 +++++++++++------- > 1 file changed, 91 insertions(+), 58 deletions(-) > > diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S > index 97839a2248..637caadb40 100644 > --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S > +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S > @@ -21,10 +21,13 @@ > 2. If size is less than VEC, use integer register stores. > 3. If size is from VEC_SIZE to 2 * VEC_SIZE, use 2 VEC stores. > 4. If size is from 2 * VEC_SIZE to 4 * VEC_SIZE, use 4 VEC stores. > - 5. On machines ERMS feature, if size is greater or equal than > - __x86_rep_stosb_threshold then REP STOSB will be used. > - 6. If size is more to 4 * VEC_SIZE, align to 4 * VEC_SIZE with > - 4 VEC stores and store 4 * VEC at a time until done. */ > + 5. If size is more to 4 * VEC_SIZE, align to 1 * VEC_SIZE with > + 4 VEC stores and store 4 * VEC at a time until done. > + 6. On machines ERMS feature, if size is range > + [__x86_rep_stosb_threshold, __x86_shared_non_temporal_threshold) > + then REP STOSB will be used. > + 7. If size >= __x86_shared_non_temporal_threshold, use a > + non-temporal stores. */ > > #include <sysdep.h> > > @@ -147,6 +150,41 @@ L(entry_from_wmemset): > VMOVU %VMM(0), -VEC_SIZE(%rdi,%rdx) > VMOVU %VMM(0), (%rdi) > VZEROUPPER_RETURN > + > + /* If have AVX512 mask instructions put L(less_vec) close to > + entry as it doesn't take much space and is likely a hot target. */ > +#ifdef USE_LESS_VEC_MASK_STORE > + /* Align to ensure the L(less_vec) logic all fits in 1x cache lines. */ > + .p2align 6,, 47 > + .p2align 4 > +L(less_vec): > +L(less_vec_from_wmemset): > + /* Less than 1 VEC. */ > +# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64 > +# error Unsupported VEC_SIZE! > +# endif > + /* Clear high bits from edi. Only keeping bits relevant to page > + cross check. Note that we are using rax which is set in > + MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out. */ > + andl $(PAGE_SIZE - 1), %edi > + /* Check if VEC_SIZE store cross page. Mask stores suffer > + serious performance degradation when it has to fault suppress. */ > + cmpl $(PAGE_SIZE - VEC_SIZE), %edi > + /* This is generally considered a cold target. */ > + ja L(cross_page) > +# if VEC_SIZE > 32 > + movq $-1, %rcx > + bzhiq %rdx, %rcx, %rcx > + kmovq %rcx, %k1 > +# else > + movl $-1, %ecx > + bzhil %edx, %ecx, %ecx > + kmovd %ecx, %k1 > +# endif > + vmovdqu8 %VMM(0), (%rax){%k1} > + VZEROUPPER_RETURN > +#endif > + > #if defined USE_MULTIARCH && IS_IN (libc) > END (MEMSET_SYMBOL (__memset, unaligned)) > > @@ -185,54 +223,6 @@ L(last_2x_vec): > #endif > VZEROUPPER_RETURN > > - /* If have AVX512 mask instructions put L(less_vec) close to > - entry as it doesn't take much space and is likely a hot target. > - */ > -#ifdef USE_LESS_VEC_MASK_STORE > - .p2align 4,, 10 > -L(less_vec): > -L(less_vec_from_wmemset): > - /* Less than 1 VEC. */ > -# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64 > -# error Unsupported VEC_SIZE! > -# endif > - /* Clear high bits from edi. Only keeping bits relevant to page > - cross check. Note that we are using rax which is set in > - MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out. */ > - andl $(PAGE_SIZE - 1), %edi > - /* Check if VEC_SIZE store cross page. Mask stores suffer > - serious performance degradation when it has to fault suppress. > - */ > - cmpl $(PAGE_SIZE - VEC_SIZE), %edi > - /* This is generally considered a cold target. */ > - ja L(cross_page) > -# if VEC_SIZE > 32 > - movq $-1, %rcx > - bzhiq %rdx, %rcx, %rcx > - kmovq %rcx, %k1 > -# else > - movl $-1, %ecx > - bzhil %edx, %ecx, %ecx > - kmovd %ecx, %k1 > -# endif > - vmovdqu8 %VMM(0), (%rax){%k1} > - VZEROUPPER_RETURN > - > -# if defined USE_MULTIARCH && IS_IN (libc) > - /* Include L(stosb_local) here if including L(less_vec) between > - L(stosb_more_2x_vec) and ENTRY. This is to cache align the > - L(stosb_more_2x_vec) target. */ > - .p2align 4,, 10 > -L(stosb_local): > - movzbl %sil, %eax > - mov %RDX_LP, %RCX_LP > - mov %RDI_LP, %RDX_LP > - rep stosb > - mov %RDX_LP, %RAX_LP > - VZEROUPPER_RETURN > -# endif > -#endif > - > #if defined USE_MULTIARCH && IS_IN (libc) > .p2align 4 > L(stosb_more_2x_vec): > @@ -318,21 +308,33 @@ L(return_vzeroupper): > ret > #endif > > - .p2align 4,, 10 > -#ifndef USE_LESS_VEC_MASK_STORE > -# if defined USE_MULTIARCH && IS_IN (libc) > +#ifdef USE_WITH_AVX2 > + .p2align 4 > +#else > + .p2align 4,, 4 > +#endif > + > +#if defined USE_MULTIARCH && IS_IN (libc) > /* If no USE_LESS_VEC_MASK put L(stosb_local) here. Will be in > range for 2-byte jump encoding. */ > L(stosb_local): > + cmp __x86_shared_non_temporal_threshold(%rip), %RDX_LP > + jae L(nt_memset) > movzbl %sil, %eax > mov %RDX_LP, %RCX_LP > mov %RDI_LP, %RDX_LP > rep stosb > +# if (defined USE_WITH_SSE2) || (defined USE_WITH_AVX512) > + /* Use xchg to save 1-byte (this helps align targets below). */ > + xchg %RDX_LP, %RAX_LP > +# else > mov %RDX_LP, %RAX_LP > - VZEROUPPER_RETURN > # endif > + VZEROUPPER_RETURN > +#endif > +#ifndef USE_LESS_VEC_MASK_STORE > /* Define L(less_vec) only if not otherwise defined. */ > - .p2align 4 > + .p2align 4,, 12 > L(less_vec): > /* Broadcast esi to partial register (i.e VEC_SIZE == 32 broadcast to > xmm). This is only does anything for AVX2. */ > @@ -423,4 +425,35 @@ L(between_2_3): > movb %SET_REG8, -1(%LESS_VEC_REG, %rdx) > #endif > ret > -END (MEMSET_SYMBOL (__memset, unaligned_erms)) > + > +#if defined USE_MULTIARCH && IS_IN (libc) > +# ifdef USE_WITH_AVX512 > + /* Force align so the loop doesn't cross a cache-line. */ > + .p2align 4 > +# endif > + .p2align 4,, 7 > + /* Memset using non-temporal stores. */ > +L(nt_memset): > + VMOVU %VMM(0), (VEC_SIZE * 0)(%rdi) > + leaq (VEC_SIZE * -4)(%rdi, %rdx), %rdx > + /* Align DST. */ > + orq $(VEC_SIZE * 1 - 1), %rdi > + incq %rdi > + .p2align 4,, 7 > +L(nt_loop): > + VMOVNT %VMM(0), (VEC_SIZE * 0)(%rdi) > + VMOVNT %VMM(0), (VEC_SIZE * 1)(%rdi) > + VMOVNT %VMM(0), (VEC_SIZE * 2)(%rdi) > + VMOVNT %VMM(0), (VEC_SIZE * 3)(%rdi) > + subq $(VEC_SIZE * -4), %rdi > + cmpq %rdx, %rdi > + jb L(nt_loop) > + sfence > + VMOVU %VMM(0), (VEC_SIZE * 0)(%rdx) > + VMOVU %VMM(0), (VEC_SIZE * 1)(%rdx) > + VMOVU %VMM(0), (VEC_SIZE * 2)(%rdx) > + VMOVU %VMM(0), (VEC_SIZE * 3)(%rdx) > + VZEROUPPER_RETURN > +#endif > + > +END(MEMSET_SYMBOL(__memset, unaligned_erms))
On Sun, May 19, 2024, at 12:43 AM, Noah Goldstein wrote: > Previously we use `rep stosb` for all medium/large memsets. This is > notably worse than non-temporal stores for large (above a few MBs) > memsets. See: > https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing > For data using different stategies for large memset on ICX and SKX. Did your tests read the region that was memset afterward, or did they just do the memset? It *probably* doesn't matter at "above a few MBs" but I'm worried that this new implementation might look good on microbenchmarks but make real programs *slower* because it is no longer priming the cache for the memset region. zw
On Mon, May 20, 2024 at 2:16 PM Adhemerval Zanella Netto <adhemerval.zanella@linaro.org> wrote: > > > > On 18/05/24 21:43, Noah Goldstein wrote: > > Previously we use `rep stosb` for all medium/large memsets. This is > > notably worse than non-temporal stores for large (above a > > few MBs) memsets. > > See: > > https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing > > For data using different stategies for large memset on ICX and SKX. > > Btw this datasheet is not accessible without extra steps. > Bh sorry, put the benchmark data in a repo. See data here: https://github.com/goldsteinn/memset-benchmarks/tree/master/results The ICX.html and SKX.html have the data from the above link. > > > > Using non-temporal stores can be up to 3x faster on ICX and 2x faster > > on SKX. Historically, these numbers would not have been so good > > because of the zero-over-zero writeback optimization that `rep stosb` > > is able to do. But, the zero-over-zero writeback optimization has been > > removed as a potential side-channel attack, so there is no longer any > > good reason to only rely on `rep stosb` for large memsets. On the flip > > size, non-temporal writes can avoid data in their RFO requests saving > > memory bandwidth. > > Any chance on how this play in newer AMD chips? I am trying to avoid > another regression like BZ#30994 and BZ#30995 (this one I would like > to fix, however I don't have access to a Zen4 machine to check for > results). I didn't and don't have access to any of the newer AMD chips. The benchmark code here: https://github.com/goldsteinn/memset-benchmarks/ has a README w/ steps if anyone wants to test it. > > > > > All of the other changes to the file are to re-organize the > > code-blocks to maintain "good" alignment given the new code added in > > the `L(stosb_local)` case. > > > > The results from running the GLIBC memset benchmarks on TGL-client for > > N=20 runs: > > > > Geometric Mean across the suite New / Old EXEX256: 0.979 > > Geometric Mean across the suite New / Old EXEX512: 0.979 > > Geometric Mean across the suite New / Old AVX2 : 0.986 > > Geometric Mean across the suite New / Old SSE2 : 0.979 > > > > Most of the cases are essentially unchanged, this is mostly to show > > that adding the non-temporal case didn't add any regressions to the > > other cases. > > > > The results on the memset-large benchmark suite on TGL-client for N=20 > > runs: > > > > Geometric Mean across the suite New / Old EXEX256: 0.926 > > Geometric Mean across the suite New / Old EXEX512: 0.925 > > Geometric Mean across the suite New / Old AVX2 : 0.928 > > Geometric Mean across the suite New / Old SSE2 : 0.924 > > > > So roughly a 7.5% speedup. This is lower than what we see on servers > > (likely because clients typically have faster single-core bandwidth so > > saving bandwidth on RFOs is less impactful), but still advantageous. > > > > Full test-suite passes on x86_64 w/ and w/o multiarch. > > --- > > .../multiarch/memset-vec-unaligned-erms.S | 149 +++++++++++------- > > 1 file changed, 91 insertions(+), 58 deletions(-) > > > > diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S > > index 97839a2248..637caadb40 100644 > > --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S > > +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S > > @@ -21,10 +21,13 @@ > > 2. If size is less than VEC, use integer register stores. > > 3. If size is from VEC_SIZE to 2 * VEC_SIZE, use 2 VEC stores. > > 4. If size is from 2 * VEC_SIZE to 4 * VEC_SIZE, use 4 VEC stores. > > - 5. On machines ERMS feature, if size is greater or equal than > > - __x86_rep_stosb_threshold then REP STOSB will be used. > > - 6. If size is more to 4 * VEC_SIZE, align to 4 * VEC_SIZE with > > - 4 VEC stores and store 4 * VEC at a time until done. */ > > + 5. If size is more to 4 * VEC_SIZE, align to 1 * VEC_SIZE with > > + 4 VEC stores and store 4 * VEC at a time until done. > > + 6. On machines ERMS feature, if size is range > > + [__x86_rep_stosb_threshold, __x86_shared_non_temporal_threshold) > > + then REP STOSB will be used. > > + 7. If size >= __x86_shared_non_temporal_threshold, use a > > + non-temporal stores. */ > > > > #include <sysdep.h> > > > > @@ -147,6 +150,41 @@ L(entry_from_wmemset): > > VMOVU %VMM(0), -VEC_SIZE(%rdi,%rdx) > > VMOVU %VMM(0), (%rdi) > > VZEROUPPER_RETURN > > + > > + /* If have AVX512 mask instructions put L(less_vec) close to > > + entry as it doesn't take much space and is likely a hot target. */ > > +#ifdef USE_LESS_VEC_MASK_STORE > > + /* Align to ensure the L(less_vec) logic all fits in 1x cache lines. */ > > + .p2align 6,, 47 > > + .p2align 4 > > +L(less_vec): > > +L(less_vec_from_wmemset): > > + /* Less than 1 VEC. */ > > +# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64 > > +# error Unsupported VEC_SIZE! > > +# endif > > + /* Clear high bits from edi. Only keeping bits relevant to page > > + cross check. Note that we are using rax which is set in > > + MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out. */ > > + andl $(PAGE_SIZE - 1), %edi > > + /* Check if VEC_SIZE store cross page. Mask stores suffer > > + serious performance degradation when it has to fault suppress. */ > > + cmpl $(PAGE_SIZE - VEC_SIZE), %edi > > + /* This is generally considered a cold target. */ > > + ja L(cross_page) > > +# if VEC_SIZE > 32 > > + movq $-1, %rcx > > + bzhiq %rdx, %rcx, %rcx > > + kmovq %rcx, %k1 > > +# else > > + movl $-1, %ecx > > + bzhil %edx, %ecx, %ecx > > + kmovd %ecx, %k1 > > +# endif > > + vmovdqu8 %VMM(0), (%rax){%k1} > > + VZEROUPPER_RETURN > > +#endif > > + > > #if defined USE_MULTIARCH && IS_IN (libc) > > END (MEMSET_SYMBOL (__memset, unaligned)) > > > > @@ -185,54 +223,6 @@ L(last_2x_vec): > > #endif > > VZEROUPPER_RETURN > > > > - /* If have AVX512 mask instructions put L(less_vec) close to > > - entry as it doesn't take much space and is likely a hot target. > > - */ > > -#ifdef USE_LESS_VEC_MASK_STORE > > - .p2align 4,, 10 > > -L(less_vec): > > -L(less_vec_from_wmemset): > > - /* Less than 1 VEC. */ > > -# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64 > > -# error Unsupported VEC_SIZE! > > -# endif > > - /* Clear high bits from edi. Only keeping bits relevant to page > > - cross check. Note that we are using rax which is set in > > - MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out. */ > > - andl $(PAGE_SIZE - 1), %edi > > - /* Check if VEC_SIZE store cross page. Mask stores suffer > > - serious performance degradation when it has to fault suppress. > > - */ > > - cmpl $(PAGE_SIZE - VEC_SIZE), %edi > > - /* This is generally considered a cold target. */ > > - ja L(cross_page) > > -# if VEC_SIZE > 32 > > - movq $-1, %rcx > > - bzhiq %rdx, %rcx, %rcx > > - kmovq %rcx, %k1 > > -# else > > - movl $-1, %ecx > > - bzhil %edx, %ecx, %ecx > > - kmovd %ecx, %k1 > > -# endif > > - vmovdqu8 %VMM(0), (%rax){%k1} > > - VZEROUPPER_RETURN > > - > > -# if defined USE_MULTIARCH && IS_IN (libc) > > - /* Include L(stosb_local) here if including L(less_vec) between > > - L(stosb_more_2x_vec) and ENTRY. This is to cache align the > > - L(stosb_more_2x_vec) target. */ > > - .p2align 4,, 10 > > -L(stosb_local): > > - movzbl %sil, %eax > > - mov %RDX_LP, %RCX_LP > > - mov %RDI_LP, %RDX_LP > > - rep stosb > > - mov %RDX_LP, %RAX_LP > > - VZEROUPPER_RETURN > > -# endif > > -#endif > > - > > #if defined USE_MULTIARCH && IS_IN (libc) > > .p2align 4 > > L(stosb_more_2x_vec): > > @@ -318,21 +308,33 @@ L(return_vzeroupper): > > ret > > #endif > > > > - .p2align 4,, 10 > > -#ifndef USE_LESS_VEC_MASK_STORE > > -# if defined USE_MULTIARCH && IS_IN (libc) > > +#ifdef USE_WITH_AVX2 > > + .p2align 4 > > +#else > > + .p2align 4,, 4 > > +#endif > > + > > +#if defined USE_MULTIARCH && IS_IN (libc) > > /* If no USE_LESS_VEC_MASK put L(stosb_local) here. Will be in > > range for 2-byte jump encoding. */ > > L(stosb_local): > > + cmp __x86_shared_non_temporal_threshold(%rip), %RDX_LP > > + jae L(nt_memset) > > movzbl %sil, %eax > > mov %RDX_LP, %RCX_LP > > mov %RDI_LP, %RDX_LP > > rep stosb > > +# if (defined USE_WITH_SSE2) || (defined USE_WITH_AVX512) > > + /* Use xchg to save 1-byte (this helps align targets below). */ > > + xchg %RDX_LP, %RAX_LP > > +# else > > mov %RDX_LP, %RAX_LP > > - VZEROUPPER_RETURN > > # endif > > + VZEROUPPER_RETURN > > +#endif > > +#ifndef USE_LESS_VEC_MASK_STORE > > /* Define L(less_vec) only if not otherwise defined. */ > > - .p2align 4 > > + .p2align 4,, 12 > > L(less_vec): > > /* Broadcast esi to partial register (i.e VEC_SIZE == 32 broadcast to > > xmm). This is only does anything for AVX2. */ > > @@ -423,4 +425,35 @@ L(between_2_3): > > movb %SET_REG8, -1(%LESS_VEC_REG, %rdx) > > #endif > > ret > > -END (MEMSET_SYMBOL (__memset, unaligned_erms)) > > + > > +#if defined USE_MULTIARCH && IS_IN (libc) > > +# ifdef USE_WITH_AVX512 > > + /* Force align so the loop doesn't cross a cache-line. */ > > + .p2align 4 > > +# endif > > + .p2align 4,, 7 > > + /* Memset using non-temporal stores. */ > > +L(nt_memset): > > + VMOVU %VMM(0), (VEC_SIZE * 0)(%rdi) > > + leaq (VEC_SIZE * -4)(%rdi, %rdx), %rdx > > + /* Align DST. */ > > + orq $(VEC_SIZE * 1 - 1), %rdi > > + incq %rdi > > + .p2align 4,, 7 > > +L(nt_loop): > > + VMOVNT %VMM(0), (VEC_SIZE * 0)(%rdi) > > + VMOVNT %VMM(0), (VEC_SIZE * 1)(%rdi) > > + VMOVNT %VMM(0), (VEC_SIZE * 2)(%rdi) > > + VMOVNT %VMM(0), (VEC_SIZE * 3)(%rdi) > > + subq $(VEC_SIZE * -4), %rdi > > + cmpq %rdx, %rdi > > + jb L(nt_loop) > > + sfence > > + VMOVU %VMM(0), (VEC_SIZE * 0)(%rdx) > > + VMOVU %VMM(0), (VEC_SIZE * 1)(%rdx) > > + VMOVU %VMM(0), (VEC_SIZE * 2)(%rdx) > > + VMOVU %VMM(0), (VEC_SIZE * 3)(%rdx) > > + VZEROUPPER_RETURN > > +#endif > > + > > +END(MEMSET_SYMBOL(__memset, unaligned_erms))
On Mon, May 20, 2024 at 2:37 PM Zack Weinberg <zack@owlfolio.org> wrote: > > On Sun, May 19, 2024, at 12:43 AM, Noah Goldstein wrote: > > Previously we use `rep stosb` for all medium/large memsets. This is > > notably worse than non-temporal stores for large (above a few MBs) > > memsets. See: > > https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing > > For data using different stategies for large memset on ICX and SKX. > > Did your tests read the region that was memset afterward, or did they > just do the memset? It *probably* doesn't matter at "above a few MBs" > but I'm worried that this new implementation might look good on > microbenchmarks but make real programs *slower* because it is no longer > priming the cache for the memset region. Yes, See all results tagged with `reused=1`. Results are still good. Data available at: https://github.com/goldsteinn/memset-benchmarks/blob/master/results/SKX.html and/or https://github.com/goldsteinn/memset-benchmarks/blob/master/results/ICX.html > > zw
On Mon, May 20, 2024 at 5:47 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > On Mon, May 20, 2024 at 2:16 PM Adhemerval Zanella Netto > <adhemerval.zanella@linaro.org> wrote: > > > > > > > > On 18/05/24 21:43, Noah Goldstein wrote: > > > Previously we use `rep stosb` for all medium/large memsets. This is > > > notably worse than non-temporal stores for large (above a > > > few MBs) memsets. > > > See: > > > https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing > > > For data using different stategies for large memset on ICX and SKX. > > > > Btw this datasheet is not accessible without extra steps. > > > > Bh sorry, put the benchmark data in a repo. See data here: > https://github.com/goldsteinn/memset-benchmarks/tree/master/results > > The ICX.html and SKX.html have the data from the above link. > > > > > > Using non-temporal stores can be up to 3x faster on ICX and 2x faster > > > on SKX. Historically, these numbers would not have been so good > > > because of the zero-over-zero writeback optimization that `rep stosb` > > > is able to do. But, the zero-over-zero writeback optimization has been > > > removed as a potential side-channel attack, so there is no longer any > > > good reason to only rely on `rep stosb` for large memsets. On the flip > > > size, non-temporal writes can avoid data in their RFO requests saving > > > memory bandwidth. > > > > Any chance on how this play in newer AMD chips? I am trying to avoid > > another regression like BZ#30994 and BZ#30995 (this one I would like > > to fix, however I don't have access to a Zen4 machine to check for > > results). > > I didn't and don't have access to any of the newer AMD chips. > > The benchmark code here: https://github.com/goldsteinn/memset-benchmarks/ > has a README w/ steps if anyone wants to test it. What we could do is use a seperate tunable for memset NT threshold and just make it SIZE_MAX for AMD. > > > > > > > > All of the other changes to the file are to re-organize the > > > code-blocks to maintain "good" alignment given the new code added in > > > the `L(stosb_local)` case. > > > > > > The results from running the GLIBC memset benchmarks on TGL-client for > > > N=20 runs: > > > > > > Geometric Mean across the suite New / Old EXEX256: 0.979 > > > Geometric Mean across the suite New / Old EXEX512: 0.979 > > > Geometric Mean across the suite New / Old AVX2 : 0.986 > > > Geometric Mean across the suite New / Old SSE2 : 0.979 > > > > > > Most of the cases are essentially unchanged, this is mostly to show > > > that adding the non-temporal case didn't add any regressions to the > > > other cases. > > > > > > The results on the memset-large benchmark suite on TGL-client for N=20 > > > runs: > > > > > > Geometric Mean across the suite New / Old EXEX256: 0.926 > > > Geometric Mean across the suite New / Old EXEX512: 0.925 > > > Geometric Mean across the suite New / Old AVX2 : 0.928 > > > Geometric Mean across the suite New / Old SSE2 : 0.924 > > > > > > So roughly a 7.5% speedup. This is lower than what we see on servers > > > (likely because clients typically have faster single-core bandwidth so > > > saving bandwidth on RFOs is less impactful), but still advantageous. > > > > > > Full test-suite passes on x86_64 w/ and w/o multiarch. > > > --- > > > .../multiarch/memset-vec-unaligned-erms.S | 149 +++++++++++------- > > > 1 file changed, 91 insertions(+), 58 deletions(-) > > > > > > diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S > > > index 97839a2248..637caadb40 100644 > > > --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S > > > +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S > > > @@ -21,10 +21,13 @@ > > > 2. If size is less than VEC, use integer register stores. > > > 3. If size is from VEC_SIZE to 2 * VEC_SIZE, use 2 VEC stores. > > > 4. If size is from 2 * VEC_SIZE to 4 * VEC_SIZE, use 4 VEC stores. > > > - 5. On machines ERMS feature, if size is greater or equal than > > > - __x86_rep_stosb_threshold then REP STOSB will be used. > > > - 6. If size is more to 4 * VEC_SIZE, align to 4 * VEC_SIZE with > > > - 4 VEC stores and store 4 * VEC at a time until done. */ > > > + 5. If size is more to 4 * VEC_SIZE, align to 1 * VEC_SIZE with > > > + 4 VEC stores and store 4 * VEC at a time until done. > > > + 6. On machines ERMS feature, if size is range > > > + [__x86_rep_stosb_threshold, __x86_shared_non_temporal_threshold) > > > + then REP STOSB will be used. > > > + 7. If size >= __x86_shared_non_temporal_threshold, use a > > > + non-temporal stores. */ > > > > > > #include <sysdep.h> > > > > > > @@ -147,6 +150,41 @@ L(entry_from_wmemset): > > > VMOVU %VMM(0), -VEC_SIZE(%rdi,%rdx) > > > VMOVU %VMM(0), (%rdi) > > > VZEROUPPER_RETURN > > > + > > > + /* If have AVX512 mask instructions put L(less_vec) close to > > > + entry as it doesn't take much space and is likely a hot target. */ > > > +#ifdef USE_LESS_VEC_MASK_STORE > > > + /* Align to ensure the L(less_vec) logic all fits in 1x cache lines. */ > > > + .p2align 6,, 47 > > > + .p2align 4 > > > +L(less_vec): > > > +L(less_vec_from_wmemset): > > > + /* Less than 1 VEC. */ > > > +# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64 > > > +# error Unsupported VEC_SIZE! > > > +# endif > > > + /* Clear high bits from edi. Only keeping bits relevant to page > > > + cross check. Note that we are using rax which is set in > > > + MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out. */ > > > + andl $(PAGE_SIZE - 1), %edi > > > + /* Check if VEC_SIZE store cross page. Mask stores suffer > > > + serious performance degradation when it has to fault suppress. */ > > > + cmpl $(PAGE_SIZE - VEC_SIZE), %edi > > > + /* This is generally considered a cold target. */ > > > + ja L(cross_page) > > > +# if VEC_SIZE > 32 > > > + movq $-1, %rcx > > > + bzhiq %rdx, %rcx, %rcx > > > + kmovq %rcx, %k1 > > > +# else > > > + movl $-1, %ecx > > > + bzhil %edx, %ecx, %ecx > > > + kmovd %ecx, %k1 > > > +# endif > > > + vmovdqu8 %VMM(0), (%rax){%k1} > > > + VZEROUPPER_RETURN > > > +#endif > > > + > > > #if defined USE_MULTIARCH && IS_IN (libc) > > > END (MEMSET_SYMBOL (__memset, unaligned)) > > > > > > @@ -185,54 +223,6 @@ L(last_2x_vec): > > > #endif > > > VZEROUPPER_RETURN > > > > > > - /* If have AVX512 mask instructions put L(less_vec) close to > > > - entry as it doesn't take much space and is likely a hot target. > > > - */ > > > -#ifdef USE_LESS_VEC_MASK_STORE > > > - .p2align 4,, 10 > > > -L(less_vec): > > > -L(less_vec_from_wmemset): > > > - /* Less than 1 VEC. */ > > > -# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64 > > > -# error Unsupported VEC_SIZE! > > > -# endif > > > - /* Clear high bits from edi. Only keeping bits relevant to page > > > - cross check. Note that we are using rax which is set in > > > - MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out. */ > > > - andl $(PAGE_SIZE - 1), %edi > > > - /* Check if VEC_SIZE store cross page. Mask stores suffer > > > - serious performance degradation when it has to fault suppress. > > > - */ > > > - cmpl $(PAGE_SIZE - VEC_SIZE), %edi > > > - /* This is generally considered a cold target. */ > > > - ja L(cross_page) > > > -# if VEC_SIZE > 32 > > > - movq $-1, %rcx > > > - bzhiq %rdx, %rcx, %rcx > > > - kmovq %rcx, %k1 > > > -# else > > > - movl $-1, %ecx > > > - bzhil %edx, %ecx, %ecx > > > - kmovd %ecx, %k1 > > > -# endif > > > - vmovdqu8 %VMM(0), (%rax){%k1} > > > - VZEROUPPER_RETURN > > > - > > > -# if defined USE_MULTIARCH && IS_IN (libc) > > > - /* Include L(stosb_local) here if including L(less_vec) between > > > - L(stosb_more_2x_vec) and ENTRY. This is to cache align the > > > - L(stosb_more_2x_vec) target. */ > > > - .p2align 4,, 10 > > > -L(stosb_local): > > > - movzbl %sil, %eax > > > - mov %RDX_LP, %RCX_LP > > > - mov %RDI_LP, %RDX_LP > > > - rep stosb > > > - mov %RDX_LP, %RAX_LP > > > - VZEROUPPER_RETURN > > > -# endif > > > -#endif > > > - > > > #if defined USE_MULTIARCH && IS_IN (libc) > > > .p2align 4 > > > L(stosb_more_2x_vec): > > > @@ -318,21 +308,33 @@ L(return_vzeroupper): > > > ret > > > #endif > > > > > > - .p2align 4,, 10 > > > -#ifndef USE_LESS_VEC_MASK_STORE > > > -# if defined USE_MULTIARCH && IS_IN (libc) > > > +#ifdef USE_WITH_AVX2 > > > + .p2align 4 > > > +#else > > > + .p2align 4,, 4 > > > +#endif > > > + > > > +#if defined USE_MULTIARCH && IS_IN (libc) > > > /* If no USE_LESS_VEC_MASK put L(stosb_local) here. Will be in > > > range for 2-byte jump encoding. */ > > > L(stosb_local): > > > + cmp __x86_shared_non_temporal_threshold(%rip), %RDX_LP > > > + jae L(nt_memset) > > > movzbl %sil, %eax > > > mov %RDX_LP, %RCX_LP > > > mov %RDI_LP, %RDX_LP > > > rep stosb > > > +# if (defined USE_WITH_SSE2) || (defined USE_WITH_AVX512) > > > + /* Use xchg to save 1-byte (this helps align targets below). */ > > > + xchg %RDX_LP, %RAX_LP > > > +# else > > > mov %RDX_LP, %RAX_LP > > > - VZEROUPPER_RETURN > > > # endif > > > + VZEROUPPER_RETURN > > > +#endif > > > +#ifndef USE_LESS_VEC_MASK_STORE > > > /* Define L(less_vec) only if not otherwise defined. */ > > > - .p2align 4 > > > + .p2align 4,, 12 > > > L(less_vec): > > > /* Broadcast esi to partial register (i.e VEC_SIZE == 32 broadcast to > > > xmm). This is only does anything for AVX2. */ > > > @@ -423,4 +425,35 @@ L(between_2_3): > > > movb %SET_REG8, -1(%LESS_VEC_REG, %rdx) > > > #endif > > > ret > > > -END (MEMSET_SYMBOL (__memset, unaligned_erms)) > > > + > > > +#if defined USE_MULTIARCH && IS_IN (libc) > > > +# ifdef USE_WITH_AVX512 > > > + /* Force align so the loop doesn't cross a cache-line. */ > > > + .p2align 4 > > > +# endif > > > + .p2align 4,, 7 > > > + /* Memset using non-temporal stores. */ > > > +L(nt_memset): > > > + VMOVU %VMM(0), (VEC_SIZE * 0)(%rdi) > > > + leaq (VEC_SIZE * -4)(%rdi, %rdx), %rdx > > > + /* Align DST. */ > > > + orq $(VEC_SIZE * 1 - 1), %rdi > > > + incq %rdi > > > + .p2align 4,, 7 > > > +L(nt_loop): > > > + VMOVNT %VMM(0), (VEC_SIZE * 0)(%rdi) > > > + VMOVNT %VMM(0), (VEC_SIZE * 1)(%rdi) > > > + VMOVNT %VMM(0), (VEC_SIZE * 2)(%rdi) > > > + VMOVNT %VMM(0), (VEC_SIZE * 3)(%rdi) > > > + subq $(VEC_SIZE * -4), %rdi > > > + cmpq %rdx, %rdi > > > + jb L(nt_loop) > > > + sfence > > > + VMOVU %VMM(0), (VEC_SIZE * 0)(%rdx) > > > + VMOVU %VMM(0), (VEC_SIZE * 1)(%rdx) > > > + VMOVU %VMM(0), (VEC_SIZE * 2)(%rdx) > > > + VMOVU %VMM(0), (VEC_SIZE * 3)(%rdx) > > > + VZEROUPPER_RETURN > > > +#endif > > > + > > > +END(MEMSET_SYMBOL(__memset, unaligned_erms))
On Mon, May 20, 2024 at 5:50 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > On Mon, May 20, 2024 at 5:47 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > > On Mon, May 20, 2024 at 2:16 PM Adhemerval Zanella Netto > > <adhemerval.zanella@linaro.org> wrote: > > > > > > > > > > > > On 18/05/24 21:43, Noah Goldstein wrote: > > > > Previously we use `rep stosb` for all medium/large memsets. This is > > > > notably worse than non-temporal stores for large (above a > > > > few MBs) memsets. > > > > See: > > > > https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing > > > > For data using different stategies for large memset on ICX and SKX. > > > > > > Btw this datasheet is not accessible without extra steps. > > > > > > > Bh sorry, put the benchmark data in a repo. See data here: > > https://github.com/goldsteinn/memset-benchmarks/tree/master/results > > > > The ICX.html and SKX.html have the data from the above link. > > > > > > > > Using non-temporal stores can be up to 3x faster on ICX and 2x faster > > > > on SKX. Historically, these numbers would not have been so good > > > > because of the zero-over-zero writeback optimization that `rep stosb` > > > > is able to do. But, the zero-over-zero writeback optimization has been > > > > removed as a potential side-channel attack, so there is no longer any > > > > good reason to only rely on `rep stosb` for large memsets. On the flip > > > > size, non-temporal writes can avoid data in their RFO requests saving > > > > memory bandwidth. > > > > > > Any chance on how this play in newer AMD chips? I am trying to avoid > > > another regression like BZ#30994 and BZ#30995 (this one I would like > > > to fix, however I don't have access to a Zen4 machine to check for > > > results). > > > > I didn't and don't have access to any of the newer AMD chips. > > > > The benchmark code here: https://github.com/goldsteinn/memset-benchmarks/ > > has a README w/ steps if anyone wants to test it. > > What we could do is use a seperate tunable for memset NT threshold and just > make it SIZE_MAX for AMD. Adhemerval, this patch okay to get in like that, or do you want to keep in Intel specific? > > > > > > > > > > > All of the other changes to the file are to re-organize the > > > > code-blocks to maintain "good" alignment given the new code added in > > > > the `L(stosb_local)` case. > > > > > > > > The results from running the GLIBC memset benchmarks on TGL-client for > > > > N=20 runs: > > > > > > > > Geometric Mean across the suite New / Old EXEX256: 0.979 > > > > Geometric Mean across the suite New / Old EXEX512: 0.979 > > > > Geometric Mean across the suite New / Old AVX2 : 0.986 > > > > Geometric Mean across the suite New / Old SSE2 : 0.979 > > > > > > > > Most of the cases are essentially unchanged, this is mostly to show > > > > that adding the non-temporal case didn't add any regressions to the > > > > other cases. > > > > > > > > The results on the memset-large benchmark suite on TGL-client for N=20 > > > > runs: > > > > > > > > Geometric Mean across the suite New / Old EXEX256: 0.926 > > > > Geometric Mean across the suite New / Old EXEX512: 0.925 > > > > Geometric Mean across the suite New / Old AVX2 : 0.928 > > > > Geometric Mean across the suite New / Old SSE2 : 0.924 > > > > > > > > So roughly a 7.5% speedup. This is lower than what we see on servers > > > > (likely because clients typically have faster single-core bandwidth so > > > > saving bandwidth on RFOs is less impactful), but still advantageous. > > > > > > > > Full test-suite passes on x86_64 w/ and w/o multiarch. > > > > --- > > > > .../multiarch/memset-vec-unaligned-erms.S | 149 +++++++++++------- > > > > 1 file changed, 91 insertions(+), 58 deletions(-) > > > > > > > > diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S > > > > index 97839a2248..637caadb40 100644 > > > > --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S > > > > +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S > > > > @@ -21,10 +21,13 @@ > > > > 2. If size is less than VEC, use integer register stores. > > > > 3. If size is from VEC_SIZE to 2 * VEC_SIZE, use 2 VEC stores. > > > > 4. If size is from 2 * VEC_SIZE to 4 * VEC_SIZE, use 4 VEC stores. > > > > - 5. On machines ERMS feature, if size is greater or equal than > > > > - __x86_rep_stosb_threshold then REP STOSB will be used. > > > > - 6. If size is more to 4 * VEC_SIZE, align to 4 * VEC_SIZE with > > > > - 4 VEC stores and store 4 * VEC at a time until done. */ > > > > + 5. If size is more to 4 * VEC_SIZE, align to 1 * VEC_SIZE with > > > > + 4 VEC stores and store 4 * VEC at a time until done. > > > > + 6. On machines ERMS feature, if size is range > > > > + [__x86_rep_stosb_threshold, __x86_shared_non_temporal_threshold) > > > > + then REP STOSB will be used. > > > > + 7. If size >= __x86_shared_non_temporal_threshold, use a > > > > + non-temporal stores. */ > > > > > > > > #include <sysdep.h> > > > > > > > > @@ -147,6 +150,41 @@ L(entry_from_wmemset): > > > > VMOVU %VMM(0), -VEC_SIZE(%rdi,%rdx) > > > > VMOVU %VMM(0), (%rdi) > > > > VZEROUPPER_RETURN > > > > + > > > > + /* If have AVX512 mask instructions put L(less_vec) close to > > > > + entry as it doesn't take much space and is likely a hot target. */ > > > > +#ifdef USE_LESS_VEC_MASK_STORE > > > > + /* Align to ensure the L(less_vec) logic all fits in 1x cache lines. */ > > > > + .p2align 6,, 47 > > > > + .p2align 4 > > > > +L(less_vec): > > > > +L(less_vec_from_wmemset): > > > > + /* Less than 1 VEC. */ > > > > +# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64 > > > > +# error Unsupported VEC_SIZE! > > > > +# endif > > > > + /* Clear high bits from edi. Only keeping bits relevant to page > > > > + cross check. Note that we are using rax which is set in > > > > + MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out. */ > > > > + andl $(PAGE_SIZE - 1), %edi > > > > + /* Check if VEC_SIZE store cross page. Mask stores suffer > > > > + serious performance degradation when it has to fault suppress. */ > > > > + cmpl $(PAGE_SIZE - VEC_SIZE), %edi > > > > + /* This is generally considered a cold target. */ > > > > + ja L(cross_page) > > > > +# if VEC_SIZE > 32 > > > > + movq $-1, %rcx > > > > + bzhiq %rdx, %rcx, %rcx > > > > + kmovq %rcx, %k1 > > > > +# else > > > > + movl $-1, %ecx > > > > + bzhil %edx, %ecx, %ecx > > > > + kmovd %ecx, %k1 > > > > +# endif > > > > + vmovdqu8 %VMM(0), (%rax){%k1} > > > > + VZEROUPPER_RETURN > > > > +#endif > > > > + > > > > #if defined USE_MULTIARCH && IS_IN (libc) > > > > END (MEMSET_SYMBOL (__memset, unaligned)) > > > > > > > > @@ -185,54 +223,6 @@ L(last_2x_vec): > > > > #endif > > > > VZEROUPPER_RETURN > > > > > > > > - /* If have AVX512 mask instructions put L(less_vec) close to > > > > - entry as it doesn't take much space and is likely a hot target. > > > > - */ > > > > -#ifdef USE_LESS_VEC_MASK_STORE > > > > - .p2align 4,, 10 > > > > -L(less_vec): > > > > -L(less_vec_from_wmemset): > > > > - /* Less than 1 VEC. */ > > > > -# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64 > > > > -# error Unsupported VEC_SIZE! > > > > -# endif > > > > - /* Clear high bits from edi. Only keeping bits relevant to page > > > > - cross check. Note that we are using rax which is set in > > > > - MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out. */ > > > > - andl $(PAGE_SIZE - 1), %edi > > > > - /* Check if VEC_SIZE store cross page. Mask stores suffer > > > > - serious performance degradation when it has to fault suppress. > > > > - */ > > > > - cmpl $(PAGE_SIZE - VEC_SIZE), %edi > > > > - /* This is generally considered a cold target. */ > > > > - ja L(cross_page) > > > > -# if VEC_SIZE > 32 > > > > - movq $-1, %rcx > > > > - bzhiq %rdx, %rcx, %rcx > > > > - kmovq %rcx, %k1 > > > > -# else > > > > - movl $-1, %ecx > > > > - bzhil %edx, %ecx, %ecx > > > > - kmovd %ecx, %k1 > > > > -# endif > > > > - vmovdqu8 %VMM(0), (%rax){%k1} > > > > - VZEROUPPER_RETURN > > > > - > > > > -# if defined USE_MULTIARCH && IS_IN (libc) > > > > - /* Include L(stosb_local) here if including L(less_vec) between > > > > - L(stosb_more_2x_vec) and ENTRY. This is to cache align the > > > > - L(stosb_more_2x_vec) target. */ > > > > - .p2align 4,, 10 > > > > -L(stosb_local): > > > > - movzbl %sil, %eax > > > > - mov %RDX_LP, %RCX_LP > > > > - mov %RDI_LP, %RDX_LP > > > > - rep stosb > > > > - mov %RDX_LP, %RAX_LP > > > > - VZEROUPPER_RETURN > > > > -# endif > > > > -#endif > > > > - > > > > #if defined USE_MULTIARCH && IS_IN (libc) > > > > .p2align 4 > > > > L(stosb_more_2x_vec): > > > > @@ -318,21 +308,33 @@ L(return_vzeroupper): > > > > ret > > > > #endif > > > > > > > > - .p2align 4,, 10 > > > > -#ifndef USE_LESS_VEC_MASK_STORE > > > > -# if defined USE_MULTIARCH && IS_IN (libc) > > > > +#ifdef USE_WITH_AVX2 > > > > + .p2align 4 > > > > +#else > > > > + .p2align 4,, 4 > > > > +#endif > > > > + > > > > +#if defined USE_MULTIARCH && IS_IN (libc) > > > > /* If no USE_LESS_VEC_MASK put L(stosb_local) here. Will be in > > > > range for 2-byte jump encoding. */ > > > > L(stosb_local): > > > > + cmp __x86_shared_non_temporal_threshold(%rip), %RDX_LP > > > > + jae L(nt_memset) > > > > movzbl %sil, %eax > > > > mov %RDX_LP, %RCX_LP > > > > mov %RDI_LP, %RDX_LP > > > > rep stosb > > > > +# if (defined USE_WITH_SSE2) || (defined USE_WITH_AVX512) > > > > + /* Use xchg to save 1-byte (this helps align targets below). */ > > > > + xchg %RDX_LP, %RAX_LP > > > > +# else > > > > mov %RDX_LP, %RAX_LP > > > > - VZEROUPPER_RETURN > > > > # endif > > > > + VZEROUPPER_RETURN > > > > +#endif > > > > +#ifndef USE_LESS_VEC_MASK_STORE > > > > /* Define L(less_vec) only if not otherwise defined. */ > > > > - .p2align 4 > > > > + .p2align 4,, 12 > > > > L(less_vec): > > > > /* Broadcast esi to partial register (i.e VEC_SIZE == 32 broadcast to > > > > xmm). This is only does anything for AVX2. */ > > > > @@ -423,4 +425,35 @@ L(between_2_3): > > > > movb %SET_REG8, -1(%LESS_VEC_REG, %rdx) > > > > #endif > > > > ret > > > > -END (MEMSET_SYMBOL (__memset, unaligned_erms)) > > > > + > > > > +#if defined USE_MULTIARCH && IS_IN (libc) > > > > +# ifdef USE_WITH_AVX512 > > > > + /* Force align so the loop doesn't cross a cache-line. */ > > > > + .p2align 4 > > > > +# endif > > > > + .p2align 4,, 7 > > > > + /* Memset using non-temporal stores. */ > > > > +L(nt_memset): > > > > + VMOVU %VMM(0), (VEC_SIZE * 0)(%rdi) > > > > + leaq (VEC_SIZE * -4)(%rdi, %rdx), %rdx > > > > + /* Align DST. */ > > > > + orq $(VEC_SIZE * 1 - 1), %rdi > > > > + incq %rdi > > > > + .p2align 4,, 7 > > > > +L(nt_loop): > > > > + VMOVNT %VMM(0), (VEC_SIZE * 0)(%rdi) > > > > + VMOVNT %VMM(0), (VEC_SIZE * 1)(%rdi) > > > > + VMOVNT %VMM(0), (VEC_SIZE * 2)(%rdi) > > > > + VMOVNT %VMM(0), (VEC_SIZE * 3)(%rdi) > > > > + subq $(VEC_SIZE * -4), %rdi > > > > + cmpq %rdx, %rdi > > > > + jb L(nt_loop) > > > > + sfence > > > > + VMOVU %VMM(0), (VEC_SIZE * 0)(%rdx) > > > > + VMOVU %VMM(0), (VEC_SIZE * 1)(%rdx) > > > > + VMOVU %VMM(0), (VEC_SIZE * 2)(%rdx) > > > > + VMOVU %VMM(0), (VEC_SIZE * 3)(%rdx) > > > > + VZEROUPPER_RETURN > > > > +#endif > > > > + > > > > +END(MEMSET_SYMBOL(__memset, unaligned_erms))
On Thu, May 23, 2024 at 9:46 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > On Mon, May 20, 2024 at 5:50 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > > On Mon, May 20, 2024 at 5:47 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > > > > On Mon, May 20, 2024 at 2:16 PM Adhemerval Zanella Netto > > > <adhemerval.zanella@linaro.org> wrote: > > > > > > > > > > > > > > > > On 18/05/24 21:43, Noah Goldstein wrote: > > > > > Previously we use `rep stosb` for all medium/large memsets. This is > > > > > notably worse than non-temporal stores for large (above a > > > > > few MBs) memsets. > > > > > See: > > > > > https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing > > > > > For data using different stategies for large memset on ICX and SKX. > > > > > > > > Btw this datasheet is not accessible without extra steps. > > > > > > > > > > Bh sorry, put the benchmark data in a repo. See data here: > > > https://github.com/goldsteinn/memset-benchmarks/tree/master/results > > > > > > The ICX.html and SKX.html have the data from the above link. > > > > > > > > > > Using non-temporal stores can be up to 3x faster on ICX and 2x faster > > > > > on SKX. Historically, these numbers would not have been so good > > > > > because of the zero-over-zero writeback optimization that `rep stosb` > > > > > is able to do. But, the zero-over-zero writeback optimization has been > > > > > removed as a potential side-channel attack, so there is no longer any > > > > > good reason to only rely on `rep stosb` for large memsets. On the flip > > > > > size, non-temporal writes can avoid data in their RFO requests saving > > > > > memory bandwidth. > > > > > > > > Any chance on how this play in newer AMD chips? I am trying to avoid > > > > another regression like BZ#30994 and BZ#30995 (this one I would like > > > > to fix, however I don't have access to a Zen4 machine to check for > > > > results). > > > > > > I didn't and don't have access to any of the newer AMD chips. > > > > > > The benchmark code here: https://github.com/goldsteinn/memset-benchmarks/ > > > has a README w/ steps if anyone wants to test it. > > > > What we could do is use a seperate tunable for memset NT threshold and just > > make it SIZE_MAX for AMD. > > Adhemerval, this patch okay to get in like that, or do you want to keep in Intel > specific? Please make it Intel specific. Thanks. > > > > > > > > > > > > > > All of the other changes to the file are to re-organize the > > > > > code-blocks to maintain "good" alignment given the new code added in > > > > > the `L(stosb_local)` case. > > > > > > > > > > The results from running the GLIBC memset benchmarks on TGL-client for > > > > > N=20 runs: > > > > > > > > > > Geometric Mean across the suite New / Old EXEX256: 0.979 > > > > > Geometric Mean across the suite New / Old EXEX512: 0.979 > > > > > Geometric Mean across the suite New / Old AVX2 : 0.986 > > > > > Geometric Mean across the suite New / Old SSE2 : 0.979 > > > > > > > > > > Most of the cases are essentially unchanged, this is mostly to show > > > > > that adding the non-temporal case didn't add any regressions to the > > > > > other cases. > > > > > > > > > > The results on the memset-large benchmark suite on TGL-client for N=20 > > > > > runs: > > > > > > > > > > Geometric Mean across the suite New / Old EXEX256: 0.926 > > > > > Geometric Mean across the suite New / Old EXEX512: 0.925 > > > > > Geometric Mean across the suite New / Old AVX2 : 0.928 > > > > > Geometric Mean across the suite New / Old SSE2 : 0.924 > > > > > > > > > > So roughly a 7.5% speedup. This is lower than what we see on servers > > > > > (likely because clients typically have faster single-core bandwidth so > > > > > saving bandwidth on RFOs is less impactful), but still advantageous. > > > > > > > > > > Full test-suite passes on x86_64 w/ and w/o multiarch. > > > > > --- > > > > > .../multiarch/memset-vec-unaligned-erms.S | 149 +++++++++++------- > > > > > 1 file changed, 91 insertions(+), 58 deletions(-) > > > > > > > > > > diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S > > > > > index 97839a2248..637caadb40 100644 > > > > > --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S > > > > > +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S > > > > > @@ -21,10 +21,13 @@ > > > > > 2. If size is less than VEC, use integer register stores. > > > > > 3. If size is from VEC_SIZE to 2 * VEC_SIZE, use 2 VEC stores. > > > > > 4. If size is from 2 * VEC_SIZE to 4 * VEC_SIZE, use 4 VEC stores. > > > > > - 5. On machines ERMS feature, if size is greater or equal than > > > > > - __x86_rep_stosb_threshold then REP STOSB will be used. > > > > > - 6. If size is more to 4 * VEC_SIZE, align to 4 * VEC_SIZE with > > > > > - 4 VEC stores and store 4 * VEC at a time until done. */ > > > > > + 5. If size is more to 4 * VEC_SIZE, align to 1 * VEC_SIZE with > > > > > + 4 VEC stores and store 4 * VEC at a time until done. > > > > > + 6. On machines ERMS feature, if size is range > > > > > + [__x86_rep_stosb_threshold, __x86_shared_non_temporal_threshold) > > > > > + then REP STOSB will be used. > > > > > + 7. If size >= __x86_shared_non_temporal_threshold, use a > > > > > + non-temporal stores. */ > > > > > > > > > > #include <sysdep.h> > > > > > > > > > > @@ -147,6 +150,41 @@ L(entry_from_wmemset): > > > > > VMOVU %VMM(0), -VEC_SIZE(%rdi,%rdx) > > > > > VMOVU %VMM(0), (%rdi) > > > > > VZEROUPPER_RETURN > > > > > + > > > > > + /* If have AVX512 mask instructions put L(less_vec) close to > > > > > + entry as it doesn't take much space and is likely a hot target. */ > > > > > +#ifdef USE_LESS_VEC_MASK_STORE > > > > > + /* Align to ensure the L(less_vec) logic all fits in 1x cache lines. */ > > > > > + .p2align 6,, 47 > > > > > + .p2align 4 > > > > > +L(less_vec): > > > > > +L(less_vec_from_wmemset): > > > > > + /* Less than 1 VEC. */ > > > > > +# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64 > > > > > +# error Unsupported VEC_SIZE! > > > > > +# endif > > > > > + /* Clear high bits from edi. Only keeping bits relevant to page > > > > > + cross check. Note that we are using rax which is set in > > > > > + MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out. */ > > > > > + andl $(PAGE_SIZE - 1), %edi > > > > > + /* Check if VEC_SIZE store cross page. Mask stores suffer > > > > > + serious performance degradation when it has to fault suppress. */ > > > > > + cmpl $(PAGE_SIZE - VEC_SIZE), %edi > > > > > + /* This is generally considered a cold target. */ > > > > > + ja L(cross_page) > > > > > +# if VEC_SIZE > 32 > > > > > + movq $-1, %rcx > > > > > + bzhiq %rdx, %rcx, %rcx > > > > > + kmovq %rcx, %k1 > > > > > +# else > > > > > + movl $-1, %ecx > > > > > + bzhil %edx, %ecx, %ecx > > > > > + kmovd %ecx, %k1 > > > > > +# endif > > > > > + vmovdqu8 %VMM(0), (%rax){%k1} > > > > > + VZEROUPPER_RETURN > > > > > +#endif > > > > > + > > > > > #if defined USE_MULTIARCH && IS_IN (libc) > > > > > END (MEMSET_SYMBOL (__memset, unaligned)) > > > > > > > > > > @@ -185,54 +223,6 @@ L(last_2x_vec): > > > > > #endif > > > > > VZEROUPPER_RETURN > > > > > > > > > > - /* If have AVX512 mask instructions put L(less_vec) close to > > > > > - entry as it doesn't take much space and is likely a hot target. > > > > > - */ > > > > > -#ifdef USE_LESS_VEC_MASK_STORE > > > > > - .p2align 4,, 10 > > > > > -L(less_vec): > > > > > -L(less_vec_from_wmemset): > > > > > - /* Less than 1 VEC. */ > > > > > -# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64 > > > > > -# error Unsupported VEC_SIZE! > > > > > -# endif > > > > > - /* Clear high bits from edi. Only keeping bits relevant to page > > > > > - cross check. Note that we are using rax which is set in > > > > > - MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out. */ > > > > > - andl $(PAGE_SIZE - 1), %edi > > > > > - /* Check if VEC_SIZE store cross page. Mask stores suffer > > > > > - serious performance degradation when it has to fault suppress. > > > > > - */ > > > > > - cmpl $(PAGE_SIZE - VEC_SIZE), %edi > > > > > - /* This is generally considered a cold target. */ > > > > > - ja L(cross_page) > > > > > -# if VEC_SIZE > 32 > > > > > - movq $-1, %rcx > > > > > - bzhiq %rdx, %rcx, %rcx > > > > > - kmovq %rcx, %k1 > > > > > -# else > > > > > - movl $-1, %ecx > > > > > - bzhil %edx, %ecx, %ecx > > > > > - kmovd %ecx, %k1 > > > > > -# endif > > > > > - vmovdqu8 %VMM(0), (%rax){%k1} > > > > > - VZEROUPPER_RETURN > > > > > - > > > > > -# if defined USE_MULTIARCH && IS_IN (libc) > > > > > - /* Include L(stosb_local) here if including L(less_vec) between > > > > > - L(stosb_more_2x_vec) and ENTRY. This is to cache align the > > > > > - L(stosb_more_2x_vec) target. */ > > > > > - .p2align 4,, 10 > > > > > -L(stosb_local): > > > > > - movzbl %sil, %eax > > > > > - mov %RDX_LP, %RCX_LP > > > > > - mov %RDI_LP, %RDX_LP > > > > > - rep stosb > > > > > - mov %RDX_LP, %RAX_LP > > > > > - VZEROUPPER_RETURN > > > > > -# endif > > > > > -#endif > > > > > - > > > > > #if defined USE_MULTIARCH && IS_IN (libc) > > > > > .p2align 4 > > > > > L(stosb_more_2x_vec): > > > > > @@ -318,21 +308,33 @@ L(return_vzeroupper): > > > > > ret > > > > > #endif > > > > > > > > > > - .p2align 4,, 10 > > > > > -#ifndef USE_LESS_VEC_MASK_STORE > > > > > -# if defined USE_MULTIARCH && IS_IN (libc) > > > > > +#ifdef USE_WITH_AVX2 > > > > > + .p2align 4 > > > > > +#else > > > > > + .p2align 4,, 4 > > > > > +#endif > > > > > + > > > > > +#if defined USE_MULTIARCH && IS_IN (libc) > > > > > /* If no USE_LESS_VEC_MASK put L(stosb_local) here. Will be in > > > > > range for 2-byte jump encoding. */ > > > > > L(stosb_local): > > > > > + cmp __x86_shared_non_temporal_threshold(%rip), %RDX_LP > > > > > + jae L(nt_memset) > > > > > movzbl %sil, %eax > > > > > mov %RDX_LP, %RCX_LP > > > > > mov %RDI_LP, %RDX_LP > > > > > rep stosb > > > > > +# if (defined USE_WITH_SSE2) || (defined USE_WITH_AVX512) > > > > > + /* Use xchg to save 1-byte (this helps align targets below). */ > > > > > + xchg %RDX_LP, %RAX_LP > > > > > +# else > > > > > mov %RDX_LP, %RAX_LP > > > > > - VZEROUPPER_RETURN > > > > > # endif > > > > > + VZEROUPPER_RETURN > > > > > +#endif > > > > > +#ifndef USE_LESS_VEC_MASK_STORE > > > > > /* Define L(less_vec) only if not otherwise defined. */ > > > > > - .p2align 4 > > > > > + .p2align 4,, 12 > > > > > L(less_vec): > > > > > /* Broadcast esi to partial register (i.e VEC_SIZE == 32 broadcast to > > > > > xmm). This is only does anything for AVX2. */ > > > > > @@ -423,4 +425,35 @@ L(between_2_3): > > > > > movb %SET_REG8, -1(%LESS_VEC_REG, %rdx) > > > > > #endif > > > > > ret > > > > > -END (MEMSET_SYMBOL (__memset, unaligned_erms)) > > > > > + > > > > > +#if defined USE_MULTIARCH && IS_IN (libc) > > > > > +# ifdef USE_WITH_AVX512 > > > > > + /* Force align so the loop doesn't cross a cache-line. */ > > > > > + .p2align 4 > > > > > +# endif > > > > > + .p2align 4,, 7 > > > > > + /* Memset using non-temporal stores. */ > > > > > +L(nt_memset): > > > > > + VMOVU %VMM(0), (VEC_SIZE * 0)(%rdi) > > > > > + leaq (VEC_SIZE * -4)(%rdi, %rdx), %rdx > > > > > + /* Align DST. */ > > > > > + orq $(VEC_SIZE * 1 - 1), %rdi > > > > > + incq %rdi > > > > > + .p2align 4,, 7 > > > > > +L(nt_loop): > > > > > + VMOVNT %VMM(0), (VEC_SIZE * 0)(%rdi) > > > > > + VMOVNT %VMM(0), (VEC_SIZE * 1)(%rdi) > > > > > + VMOVNT %VMM(0), (VEC_SIZE * 2)(%rdi) > > > > > + VMOVNT %VMM(0), (VEC_SIZE * 3)(%rdi) > > > > > + subq $(VEC_SIZE * -4), %rdi > > > > > + cmpq %rdx, %rdi > > > > > + jb L(nt_loop) > > > > > + sfence > > > > > + VMOVU %VMM(0), (VEC_SIZE * 0)(%rdx) > > > > > + VMOVU %VMM(0), (VEC_SIZE * 1)(%rdx) > > > > > + VMOVU %VMM(0), (VEC_SIZE * 2)(%rdx) > > > > > + VMOVU %VMM(0), (VEC_SIZE * 3)(%rdx) > > > > > + VZEROUPPER_RETURN > > > > > +#endif > > > > > + > > > > > +END(MEMSET_SYMBOL(__memset, unaligned_erms))
On 23/05/24 13:49, H.J. Lu wrote: > On Thu, May 23, 2024 at 9:46 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote: >> >> On Mon, May 20, 2024 at 5:50 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: >>> >>> On Mon, May 20, 2024 at 5:47 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: >>>> >>>> On Mon, May 20, 2024 at 2:16 PM Adhemerval Zanella Netto >>>> <adhemerval.zanella@linaro.org> wrote: >>>>> >>>>> >>>>> >>>>> On 18/05/24 21:43, Noah Goldstein wrote: >>>>>> Previously we use `rep stosb` for all medium/large memsets. This is >>>>>> notably worse than non-temporal stores for large (above a >>>>>> few MBs) memsets. >>>>>> See: >>>>>> https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing >>>>>> For data using different stategies for large memset on ICX and SKX. >>>>> >>>>> Btw this datasheet is not accessible without extra steps. >>>>> >>>> >>>> Bh sorry, put the benchmark data in a repo. See data here: >>>> https://github.com/goldsteinn/memset-benchmarks/tree/master/results >>>> >>>> The ICX.html and SKX.html have the data from the above link. >>>>>> >>>>>> Using non-temporal stores can be up to 3x faster on ICX and 2x faster >>>>>> on SKX. Historically, these numbers would not have been so good >>>>>> because of the zero-over-zero writeback optimization that `rep stosb` >>>>>> is able to do. But, the zero-over-zero writeback optimization has been >>>>>> removed as a potential side-channel attack, so there is no longer any >>>>>> good reason to only rely on `rep stosb` for large memsets. On the flip >>>>>> size, non-temporal writes can avoid data in their RFO requests saving >>>>>> memory bandwidth. >>>>> >>>>> Any chance on how this play in newer AMD chips? I am trying to avoid >>>>> another regression like BZ#30994 and BZ#30995 (this one I would like >>>>> to fix, however I don't have access to a Zen4 machine to check for >>>>> results). >>>> >>>> I didn't and don't have access to any of the newer AMD chips. >>>> >>>> The benchmark code here: https://github.com/goldsteinn/memset-benchmarks/ >>>> has a README w/ steps if anyone wants to test it. >>> >>> What we could do is use a seperate tunable for memset NT threshold and just >>> make it SIZE_MAX for AMD. >> >> Adhemerval, this patch okay to get in like that, or do you want to keep in Intel >> specific? > > Please make it Intel specific. > > Thanks. Yes, please make it Intel specific for now. I will try to take a look at least with a Zen3 core that I have access
On Thu, May 23, 2024 at 12:18 PM Adhemerval Zanella Netto <adhemerval.zanella@linaro.org> wrote: > > > > On 23/05/24 13:49, H.J. Lu wrote: > > On Thu, May 23, 2024 at 9:46 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > >> > >> On Mon, May 20, 2024 at 5:50 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > >>> > >>> On Mon, May 20, 2024 at 5:47 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > >>>> > >>>> On Mon, May 20, 2024 at 2:16 PM Adhemerval Zanella Netto > >>>> <adhemerval.zanella@linaro.org> wrote: > >>>>> > >>>>> > >>>>> > >>>>> On 18/05/24 21:43, Noah Goldstein wrote: > >>>>>> Previously we use `rep stosb` for all medium/large memsets. This is > >>>>>> notably worse than non-temporal stores for large (above a > >>>>>> few MBs) memsets. > >>>>>> See: > >>>>>> https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing > >>>>>> For data using different stategies for large memset on ICX and SKX. > >>>>> > >>>>> Btw this datasheet is not accessible without extra steps. > >>>>> > >>>> > >>>> Bh sorry, put the benchmark data in a repo. See data here: > >>>> https://github.com/goldsteinn/memset-benchmarks/tree/master/results > >>>> > >>>> The ICX.html and SKX.html have the data from the above link. > >>>>>> > >>>>>> Using non-temporal stores can be up to 3x faster on ICX and 2x faster > >>>>>> on SKX. Historically, these numbers would not have been so good > >>>>>> because of the zero-over-zero writeback optimization that `rep stosb` > >>>>>> is able to do. But, the zero-over-zero writeback optimization has been > >>>>>> removed as a potential side-channel attack, so there is no longer any > >>>>>> good reason to only rely on `rep stosb` for large memsets. On the flip > >>>>>> size, non-temporal writes can avoid data in their RFO requests saving > >>>>>> memory bandwidth. > >>>>> > >>>>> Any chance on how this play in newer AMD chips? I am trying to avoid > >>>>> another regression like BZ#30994 and BZ#30995 (this one I would like > >>>>> to fix, however I don't have access to a Zen4 machine to check for > >>>>> results). > >>>> > >>>> I didn't and don't have access to any of the newer AMD chips. > >>>> > >>>> The benchmark code here: https://github.com/goldsteinn/memset-benchmarks/ > >>>> has a README w/ steps if anyone wants to test it. > >>> > >>> What we could do is use a seperate tunable for memset NT threshold and just > >>> make it SIZE_MAX for AMD. > >> > >> Adhemerval, this patch okay to get in like that, or do you want to keep in Intel > >> specific? > > > > Please make it Intel specific. > > > > Thanks. > > Yes, please make it Intel specific for now. I will try to take a look at least > with a Zen3 core that I have access Done, split into two patches, the tunable one is the second in the series.
On Thu, May 23, 2024 at 12:18 PM Adhemerval Zanella Netto <adhemerval.zanella@linaro.org> wrote: > > > > On 23/05/24 13:49, H.J. Lu wrote: > > On Thu, May 23, 2024 at 9:46 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > >> > >> On Mon, May 20, 2024 at 5:50 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > >>> > >>> On Mon, May 20, 2024 at 5:47 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > >>>> > >>>> On Mon, May 20, 2024 at 2:16 PM Adhemerval Zanella Netto > >>>> <adhemerval.zanella@linaro.org> wrote: > >>>>> > >>>>> > >>>>> > >>>>> On 18/05/24 21:43, Noah Goldstein wrote: > >>>>>> Previously we use `rep stosb` for all medium/large memsets. This is > >>>>>> notably worse than non-temporal stores for large (above a > >>>>>> few MBs) memsets. > >>>>>> See: > >>>>>> https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing > >>>>>> For data using different stategies for large memset on ICX and SKX. > >>>>> > >>>>> Btw this datasheet is not accessible without extra steps. > >>>>> > >>>> > >>>> Bh sorry, put the benchmark data in a repo. See data here: > >>>> https://github.com/goldsteinn/memset-benchmarks/tree/master/results > >>>> > >>>> The ICX.html and SKX.html have the data from the above link. > >>>>>> > >>>>>> Using non-temporal stores can be up to 3x faster on ICX and 2x faster > >>>>>> on SKX. Historically, these numbers would not have been so good > >>>>>> because of the zero-over-zero writeback optimization that `rep stosb` > >>>>>> is able to do. But, the zero-over-zero writeback optimization has been > >>>>>> removed as a potential side-channel attack, so there is no longer any > >>>>>> good reason to only rely on `rep stosb` for large memsets. On the flip > >>>>>> size, non-temporal writes can avoid data in their RFO requests saving > >>>>>> memory bandwidth. > >>>>> > >>>>> Any chance on how this play in newer AMD chips? I am trying to avoid > >>>>> another regression like BZ#30994 and BZ#30995 (this one I would like > >>>>> to fix, however I don't have access to a Zen4 machine to check for > >>>>> results). > >>>> > >>>> I didn't and don't have access to any of the newer AMD chips. > >>>> > >>>> The benchmark code here: https://github.com/goldsteinn/memset-benchmarks/ > >>>> has a README w/ steps if anyone wants to test it. > >>> > >>> What we could do is use a seperate tunable for memset NT threshold and just > >>> make it SIZE_MAX for AMD. > >> > >> Adhemerval, this patch okay to get in like that, or do you want to keep in Intel > >> specific? > > > > Please make it Intel specific. > > > > Thanks. > > Yes, please make it Intel specific for now. I will try to take a look at least > with a Zen3 core that I have access Adding Joe into the thread. He has access to some AMD machines and will run the benchmarks for us.
diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S index 97839a2248..637caadb40 100644 --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S @@ -21,10 +21,13 @@ 2. If size is less than VEC, use integer register stores. 3. If size is from VEC_SIZE to 2 * VEC_SIZE, use 2 VEC stores. 4. If size is from 2 * VEC_SIZE to 4 * VEC_SIZE, use 4 VEC stores. - 5. On machines ERMS feature, if size is greater or equal than - __x86_rep_stosb_threshold then REP STOSB will be used. - 6. If size is more to 4 * VEC_SIZE, align to 4 * VEC_SIZE with - 4 VEC stores and store 4 * VEC at a time until done. */ + 5. If size is more to 4 * VEC_SIZE, align to 1 * VEC_SIZE with + 4 VEC stores and store 4 * VEC at a time until done. + 6. On machines ERMS feature, if size is range + [__x86_rep_stosb_threshold, __x86_shared_non_temporal_threshold) + then REP STOSB will be used. + 7. If size >= __x86_shared_non_temporal_threshold, use a + non-temporal stores. */ #include <sysdep.h> @@ -147,6 +150,41 @@ L(entry_from_wmemset): VMOVU %VMM(0), -VEC_SIZE(%rdi,%rdx) VMOVU %VMM(0), (%rdi) VZEROUPPER_RETURN + + /* If have AVX512 mask instructions put L(less_vec) close to + entry as it doesn't take much space and is likely a hot target. */ +#ifdef USE_LESS_VEC_MASK_STORE + /* Align to ensure the L(less_vec) logic all fits in 1x cache lines. */ + .p2align 6,, 47 + .p2align 4 +L(less_vec): +L(less_vec_from_wmemset): + /* Less than 1 VEC. */ +# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64 +# error Unsupported VEC_SIZE! +# endif + /* Clear high bits from edi. Only keeping bits relevant to page + cross check. Note that we are using rax which is set in + MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out. */ + andl $(PAGE_SIZE - 1), %edi + /* Check if VEC_SIZE store cross page. Mask stores suffer + serious performance degradation when it has to fault suppress. */ + cmpl $(PAGE_SIZE - VEC_SIZE), %edi + /* This is generally considered a cold target. */ + ja L(cross_page) +# if VEC_SIZE > 32 + movq $-1, %rcx + bzhiq %rdx, %rcx, %rcx + kmovq %rcx, %k1 +# else + movl $-1, %ecx + bzhil %edx, %ecx, %ecx + kmovd %ecx, %k1 +# endif + vmovdqu8 %VMM(0), (%rax){%k1} + VZEROUPPER_RETURN +#endif + #if defined USE_MULTIARCH && IS_IN (libc) END (MEMSET_SYMBOL (__memset, unaligned)) @@ -185,54 +223,6 @@ L(last_2x_vec): #endif VZEROUPPER_RETURN - /* If have AVX512 mask instructions put L(less_vec) close to - entry as it doesn't take much space and is likely a hot target. - */ -#ifdef USE_LESS_VEC_MASK_STORE - .p2align 4,, 10 -L(less_vec): -L(less_vec_from_wmemset): - /* Less than 1 VEC. */ -# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64 -# error Unsupported VEC_SIZE! -# endif - /* Clear high bits from edi. Only keeping bits relevant to page - cross check. Note that we are using rax which is set in - MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out. */ - andl $(PAGE_SIZE - 1), %edi - /* Check if VEC_SIZE store cross page. Mask stores suffer - serious performance degradation when it has to fault suppress. - */ - cmpl $(PAGE_SIZE - VEC_SIZE), %edi - /* This is generally considered a cold target. */ - ja L(cross_page) -# if VEC_SIZE > 32 - movq $-1, %rcx - bzhiq %rdx, %rcx, %rcx - kmovq %rcx, %k1 -# else - movl $-1, %ecx - bzhil %edx, %ecx, %ecx - kmovd %ecx, %k1 -# endif - vmovdqu8 %VMM(0), (%rax){%k1} - VZEROUPPER_RETURN - -# if defined USE_MULTIARCH && IS_IN (libc) - /* Include L(stosb_local) here if including L(less_vec) between - L(stosb_more_2x_vec) and ENTRY. This is to cache align the - L(stosb_more_2x_vec) target. */ - .p2align 4,, 10 -L(stosb_local): - movzbl %sil, %eax - mov %RDX_LP, %RCX_LP - mov %RDI_LP, %RDX_LP - rep stosb - mov %RDX_LP, %RAX_LP - VZEROUPPER_RETURN -# endif -#endif - #if defined USE_MULTIARCH && IS_IN (libc) .p2align 4 L(stosb_more_2x_vec): @@ -318,21 +308,33 @@ L(return_vzeroupper): ret #endif - .p2align 4,, 10 -#ifndef USE_LESS_VEC_MASK_STORE -# if defined USE_MULTIARCH && IS_IN (libc) +#ifdef USE_WITH_AVX2 + .p2align 4 +#else + .p2align 4,, 4 +#endif + +#if defined USE_MULTIARCH && IS_IN (libc) /* If no USE_LESS_VEC_MASK put L(stosb_local) here. Will be in range for 2-byte jump encoding. */ L(stosb_local): + cmp __x86_shared_non_temporal_threshold(%rip), %RDX_LP + jae L(nt_memset) movzbl %sil, %eax mov %RDX_LP, %RCX_LP mov %RDI_LP, %RDX_LP rep stosb +# if (defined USE_WITH_SSE2) || (defined USE_WITH_AVX512) + /* Use xchg to save 1-byte (this helps align targets below). */ + xchg %RDX_LP, %RAX_LP +# else mov %RDX_LP, %RAX_LP - VZEROUPPER_RETURN # endif + VZEROUPPER_RETURN +#endif +#ifndef USE_LESS_VEC_MASK_STORE /* Define L(less_vec) only if not otherwise defined. */ - .p2align 4 + .p2align 4,, 12 L(less_vec): /* Broadcast esi to partial register (i.e VEC_SIZE == 32 broadcast to xmm). This is only does anything for AVX2. */ @@ -423,4 +425,35 @@ L(between_2_3): movb %SET_REG8, -1(%LESS_VEC_REG, %rdx) #endif ret -END (MEMSET_SYMBOL (__memset, unaligned_erms)) + +#if defined USE_MULTIARCH && IS_IN (libc) +# ifdef USE_WITH_AVX512 + /* Force align so the loop doesn't cross a cache-line. */ + .p2align 4 +# endif + .p2align 4,, 7 + /* Memset using non-temporal stores. */ +L(nt_memset): + VMOVU %VMM(0), (VEC_SIZE * 0)(%rdi) + leaq (VEC_SIZE * -4)(%rdi, %rdx), %rdx + /* Align DST. */ + orq $(VEC_SIZE * 1 - 1), %rdi + incq %rdi + .p2align 4,, 7 +L(nt_loop): + VMOVNT %VMM(0), (VEC_SIZE * 0)(%rdi) + VMOVNT %VMM(0), (VEC_SIZE * 1)(%rdi) + VMOVNT %VMM(0), (VEC_SIZE * 2)(%rdi) + VMOVNT %VMM(0), (VEC_SIZE * 3)(%rdi) + subq $(VEC_SIZE * -4), %rdi + cmpq %rdx, %rdi + jb L(nt_loop) + sfence + VMOVU %VMM(0), (VEC_SIZE * 0)(%rdx) + VMOVU %VMM(0), (VEC_SIZE * 1)(%rdx) + VMOVU %VMM(0), (VEC_SIZE * 2)(%rdx) + VMOVU %VMM(0), (VEC_SIZE * 3)(%rdx) + VZEROUPPER_RETURN +#endif + +END(MEMSET_SYMBOL(__memset, unaligned_erms))