Message ID | 20210520184404.2901975-1-goldstein.w.n@gmail.com |
---|---|
State | New |
Headers | show |
Series | [v1] x86: Improve memset-vec-unaligned-erms.S | expand |
On Thu, May 20, 2021 at 2:45 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > No bug. This commit makes a few small improvements to > memset-vec-unaligned-erms.S. The changes are 1) only aligning to 64 > instead of 128. Either alignment will perform equally well in a loop > and 128 just increases the odds of having to do an extra iteration > which can be significant overhead for small values. 2) Align some > targets and the loop. 3) Remove an ALU from the alignment process. 4) > Reorder the last 4x VEC so that they are stored after the loop. 5) > Move the condition for leq 8x VEC to before the alignment > process. test-memset and test-wmemset are both passing. > > Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> > --- > Tests where run on the following CPUs: > > Skylake: https://ark.intel.com/content/www/us/en/ark/products/149091/intel-core-i7-8565u-processor-8m-cache-up-to-4-60-ghz.html > > Icelake: https://ark.intel.com/content/www/us/en/ark/products/196597/intel-core-i7-1065g7-processor-8m-cache-up-to-3-90-ghz.html > > Tigerlake: https://ark.intel.com/content/www/us/en/ark/products/208921/intel-core-i7-1165g7-processor-12m-cache-up-to-4-70-ghz-with-ipu.html > > All times are the geometric mean of N=50. The unit of time is > seconds. > > "Cur" refers to the current implementation > "New" refers to this patches implementation > > Performance data attached in memset-data.pdf > > Some notes on the numbers: > > I only included numbers for the proper VEC_SIZE for the corresponding > cpu. > > skl -> avx2 > icl -> evex > tgl -> evex > > The changes only affect sizes > 2 * VEC_SIZE. The performance > differences in the size <= 2 * VEC_SIZE come from changes in alignment > after linking (i.e ENTRY aligns to 16, but performance can be affected > by alignment % 64 or alignment % 4096) and generally affects > throughput only, not latency (i.e with an lfence to the benchmark loop > the deviations go away). Generally I think they can be ignored (both > positive and negative affects). > > The interesting part of the data is in the medium size range [128, > 1024] where the new implementation has a reasonable speedup. This is > especially pronounced when the more conservative alignment saves a > full loop iteration. The only significant exception is > skylake-avx2-erms case for size = 416, alignment = 416 where the > current implementation is meaningfully faster. I am unsure of the root > cause for this. The skylake-avx2 case only performs a bit worse in > this case which makes me think part of it is code alignment related, > though comparative to the speedup in other size/alignment > configurations it is still a trough. Despite this, I still think the > numbers are overall an improvement. > > As well due to aligning the loop (and possibly slightly more efficient > DSB behavior with the replacement of addq 4 * VEC_SIZE in the loop > with subq -4 * VEC_SIZE) in the non-erms cases there is often a slight > improvement to the main loop for large sizes. > > .../multiarch/memset-vec-unaligned-erms.S | 50 +++++++++++-------- > 1 file changed, 28 insertions(+), 22 deletions(-) > > diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S > index 08cfa49bd1..ff196844a0 100644 > --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S > +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S > @@ -173,17 +173,22 @@ ENTRY (MEMSET_SYMBOL (__memset, unaligned_erms)) > VMOVU %VEC(0), (%rdi) > VZEROUPPER_RETURN > > + .p2align 4 > L(stosb_more_2x_vec): > cmp __x86_rep_stosb_threshold(%rip), %RDX_LP > ja L(stosb) > +#else > + .p2align 4 > #endif > L(more_2x_vec): > - cmpq $(VEC_SIZE * 4), %rdx > - ja L(loop_start) > + /* Stores to first 2x VEC before cmp as any path forward will > + require it. */ > VMOVU %VEC(0), (%rdi) > VMOVU %VEC(0), VEC_SIZE(%rdi) > - VMOVU %VEC(0), -VEC_SIZE(%rdi,%rdx) > + cmpq $(VEC_SIZE * 4), %rdx > + ja L(loop_start) > VMOVU %VEC(0), -(VEC_SIZE * 2)(%rdi,%rdx) > + VMOVU %VEC(0), -VEC_SIZE(%rdi,%rdx) > L(return): > #if VEC_SIZE > 16 > ZERO_UPPER_VEC_REGISTERS_RETURN > @@ -192,28 +197,29 @@ L(return): > #endif > > L(loop_start): > - leaq (VEC_SIZE * 4)(%rdi), %rcx > - VMOVU %VEC(0), (%rdi) > - andq $-(VEC_SIZE * 4), %rcx > - VMOVU %VEC(0), -VEC_SIZE(%rdi,%rdx) > - VMOVU %VEC(0), VEC_SIZE(%rdi) > - VMOVU %VEC(0), -(VEC_SIZE * 2)(%rdi,%rdx) > VMOVU %VEC(0), (VEC_SIZE * 2)(%rdi) > - VMOVU %VEC(0), -(VEC_SIZE * 3)(%rdi,%rdx) > VMOVU %VEC(0), (VEC_SIZE * 3)(%rdi) > - VMOVU %VEC(0), -(VEC_SIZE * 4)(%rdi,%rdx) > - addq %rdi, %rdx > - andq $-(VEC_SIZE * 4), %rdx > - cmpq %rdx, %rcx > - je L(return) > + cmpq $(VEC_SIZE * 8), %rdx > + jbe L(loop_end) > + andq $-(VEC_SIZE * 2), %rdi > + subq $-(VEC_SIZE * 4), %rdi > + leaq -(VEC_SIZE * 4)(%rax, %rdx), %rcx > + .p2align 4 > L(loop): > - VMOVA %VEC(0), (%rcx) > - VMOVA %VEC(0), VEC_SIZE(%rcx) > - VMOVA %VEC(0), (VEC_SIZE * 2)(%rcx) > - VMOVA %VEC(0), (VEC_SIZE * 3)(%rcx) > - addq $(VEC_SIZE * 4), %rcx > - cmpq %rcx, %rdx > - jne L(loop) > + VMOVA %VEC(0), (%rdi) > + VMOVA %VEC(0), VEC_SIZE(%rdi) > + VMOVA %VEC(0), (VEC_SIZE * 2)(%rdi) > + VMOVA %VEC(0), (VEC_SIZE * 3)(%rdi) > + subq $-(VEC_SIZE * 4), %rdi > + cmpq %rcx, %rdi > + jb L(loop) > +L(loop_end): > + /* NB: rax is set as ptr in MEMSET_VDUP_TO_VEC0_AND_SET_RETURN. > + rdx as length is also unchanged. */ > + VMOVU %VEC(0), -(VEC_SIZE * 4)(%rax, %rdx) > + VMOVU %VEC(0), -(VEC_SIZE * 3)(%rax, %rdx) > + VMOVU %VEC(0), -(VEC_SIZE * 2)(%rax, %rdx) > + VMOVU %VEC(0), -VEC_SIZE(%rax, %rdx) > VZEROUPPER_SHORT_RETURN > > .p2align 4 > -- > 2.25.1 >
On Thu, May 20, 2021 at 11:45 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > No bug. This commit makes a few small improvements to > memset-vec-unaligned-erms.S. The changes are 1) only aligning to 64 > instead of 128. Either alignment will perform equally well in a loop > and 128 just increases the odds of having to do an extra iteration > which can be significant overhead for small values. 2) Align some > targets and the loop. 3) Remove an ALU from the alignment process. 4) > Reorder the last 4x VEC so that they are stored after the loop. 5) > Move the condition for leq 8x VEC to before the alignment > process. test-memset and test-wmemset are both passing. > > Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> > --- > Tests where run on the following CPUs: > > Skylake: https://ark.intel.com/content/www/us/en/ark/products/149091/intel-core-i7-8565u-processor-8m-cache-up-to-4-60-ghz.html > > Icelake: https://ark.intel.com/content/www/us/en/ark/products/196597/intel-core-i7-1065g7-processor-8m-cache-up-to-3-90-ghz.html > > Tigerlake: https://ark.intel.com/content/www/us/en/ark/products/208921/intel-core-i7-1165g7-processor-12m-cache-up-to-4-70-ghz-with-ipu.html > > All times are the geometric mean of N=50. The unit of time is > seconds. > > "Cur" refers to the current implementation > "New" refers to this patches implementation > > Performance data attached in memset-data.pdf > > Some notes on the numbers: > > I only included numbers for the proper VEC_SIZE for the corresponding > cpu. > > skl -> avx2 > icl -> evex > tgl -> evex > > The changes only affect sizes > 2 * VEC_SIZE. The performance > differences in the size <= 2 * VEC_SIZE come from changes in alignment > after linking (i.e ENTRY aligns to 16, but performance can be affected > by alignment % 64 or alignment % 4096) and generally affects > throughput only, not latency (i.e with an lfence to the benchmark loop > the deviations go away). Generally I think they can be ignored (both > positive and negative affects). > > The interesting part of the data is in the medium size range [128, > 1024] where the new implementation has a reasonable speedup. This is > especially pronounced when the more conservative alignment saves a > full loop iteration. The only significant exception is > skylake-avx2-erms case for size = 416, alignment = 416 where the > current implementation is meaningfully faster. I am unsure of the root > cause for this. The skylake-avx2 case only performs a bit worse in > this case which makes me think part of it is code alignment related, > though comparative to the speedup in other size/alignment > configurations it is still a trough. Despite this, I still think the > numbers are overall an improvement. > > As well due to aligning the loop (and possibly slightly more efficient > DSB behavior with the replacement of addq 4 * VEC_SIZE in the loop > with subq -4 * VEC_SIZE) in the non-erms cases there is often a slight > improvement to the main loop for large sizes. > > .../multiarch/memset-vec-unaligned-erms.S | 50 +++++++++++-------- > 1 file changed, 28 insertions(+), 22 deletions(-) > > diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S > index 08cfa49bd1..ff196844a0 100644 > --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S > +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S > @@ -173,17 +173,22 @@ ENTRY (MEMSET_SYMBOL (__memset, unaligned_erms)) > VMOVU %VEC(0), (%rdi) > VZEROUPPER_RETURN > > + .p2align 4 > L(stosb_more_2x_vec): > cmp __x86_rep_stosb_threshold(%rip), %RDX_LP > ja L(stosb) > +#else > + .p2align 4 > #endif > L(more_2x_vec): > - cmpq $(VEC_SIZE * 4), %rdx > - ja L(loop_start) > + /* Stores to first 2x VEC before cmp as any path forward will > + require it. */ > VMOVU %VEC(0), (%rdi) > VMOVU %VEC(0), VEC_SIZE(%rdi) > - VMOVU %VEC(0), -VEC_SIZE(%rdi,%rdx) > + cmpq $(VEC_SIZE * 4), %rdx > + ja L(loop_start) > VMOVU %VEC(0), -(VEC_SIZE * 2)(%rdi,%rdx) > + VMOVU %VEC(0), -VEC_SIZE(%rdi,%rdx) > L(return): > #if VEC_SIZE > 16 > ZERO_UPPER_VEC_REGISTERS_RETURN > @@ -192,28 +197,29 @@ L(return): > #endif > > L(loop_start): > - leaq (VEC_SIZE * 4)(%rdi), %rcx > - VMOVU %VEC(0), (%rdi) > - andq $-(VEC_SIZE * 4), %rcx > - VMOVU %VEC(0), -VEC_SIZE(%rdi,%rdx) > - VMOVU %VEC(0), VEC_SIZE(%rdi) > - VMOVU %VEC(0), -(VEC_SIZE * 2)(%rdi,%rdx) > VMOVU %VEC(0), (VEC_SIZE * 2)(%rdi) > - VMOVU %VEC(0), -(VEC_SIZE * 3)(%rdi,%rdx) > VMOVU %VEC(0), (VEC_SIZE * 3)(%rdi) > - VMOVU %VEC(0), -(VEC_SIZE * 4)(%rdi,%rdx) > - addq %rdi, %rdx > - andq $-(VEC_SIZE * 4), %rdx > - cmpq %rdx, %rcx > - je L(return) > + cmpq $(VEC_SIZE * 8), %rdx > + jbe L(loop_end) > + andq $-(VEC_SIZE * 2), %rdi > + subq $-(VEC_SIZE * 4), %rdi > + leaq -(VEC_SIZE * 4)(%rax, %rdx), %rcx > + .p2align 4 > L(loop): > - VMOVA %VEC(0), (%rcx) > - VMOVA %VEC(0), VEC_SIZE(%rcx) > - VMOVA %VEC(0), (VEC_SIZE * 2)(%rcx) > - VMOVA %VEC(0), (VEC_SIZE * 3)(%rcx) > - addq $(VEC_SIZE * 4), %rcx > - cmpq %rcx, %rdx > - jne L(loop) > + VMOVA %VEC(0), (%rdi) > + VMOVA %VEC(0), VEC_SIZE(%rdi) > + VMOVA %VEC(0), (VEC_SIZE * 2)(%rdi) > + VMOVA %VEC(0), (VEC_SIZE * 3)(%rdi) > + subq $-(VEC_SIZE * 4), %rdi > + cmpq %rcx, %rdi > + jb L(loop) > +L(loop_end): > + /* NB: rax is set as ptr in MEMSET_VDUP_TO_VEC0_AND_SET_RETURN. > + rdx as length is also unchanged. */ > + VMOVU %VEC(0), -(VEC_SIZE * 4)(%rax, %rdx) > + VMOVU %VEC(0), -(VEC_SIZE * 3)(%rax, %rdx) > + VMOVU %VEC(0), -(VEC_SIZE * 2)(%rax, %rdx) > + VMOVU %VEC(0), -VEC_SIZE(%rax, %rdx) > VZEROUPPER_SHORT_RETURN > > .p2align 4 > -- > 2.25.1 > LGTM. Reviewed-by: H.J. Lu <hjl.tools@gmail.com> Thanks.
On Thu, May 20, 2021 at 3:40 PM H.J. Lu <hjl.tools@gmail.com> wrote: > > On Thu, May 20, 2021 at 11:45 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > > No bug. This commit makes a few small improvements to > > memset-vec-unaligned-erms.S. The changes are 1) only aligning to 64 > > instead of 128. Either alignment will perform equally well in a loop > > and 128 just increases the odds of having to do an extra iteration > > which can be significant overhead for small values. 2) Align some > > targets and the loop. 3) Remove an ALU from the alignment process. 4) > > Reorder the last 4x VEC so that they are stored after the loop. 5) > > Move the condition for leq 8x VEC to before the alignment > > process. test-memset and test-wmemset are both passing. > > > > Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> > > --- > > Tests where run on the following CPUs: > > > > Skylake: https://ark.intel.com/content/www/us/en/ark/products/149091/intel-core-i7-8565u-processor-8m-cache-up-to-4-60-ghz.html > > > > Icelake: https://ark.intel.com/content/www/us/en/ark/products/196597/intel-core-i7-1065g7-processor-8m-cache-up-to-3-90-ghz.html > > > > Tigerlake: https://ark.intel.com/content/www/us/en/ark/products/208921/intel-core-i7-1165g7-processor-12m-cache-up-to-4-70-ghz-with-ipu.html > > > > All times are the geometric mean of N=50. The unit of time is > > seconds. > > > > "Cur" refers to the current implementation > > "New" refers to this patches implementation > > > > Performance data attached in memset-data.pdf > > > > Some notes on the numbers: > > > > I only included numbers for the proper VEC_SIZE for the corresponding > > cpu. > > > > skl -> avx2 > > icl -> evex > > tgl -> evex > > > > The changes only affect sizes > 2 * VEC_SIZE. The performance > > differences in the size <= 2 * VEC_SIZE come from changes in alignment > > after linking (i.e ENTRY aligns to 16, but performance can be affected > > by alignment % 64 or alignment % 4096) and generally affects > > throughput only, not latency (i.e with an lfence to the benchmark loop > > the deviations go away). Generally I think they can be ignored (both > > positive and negative affects). > > > > The interesting part of the data is in the medium size range [128, > > 1024] where the new implementation has a reasonable speedup. This is > > especially pronounced when the more conservative alignment saves a > > full loop iteration. The only significant exception is > > skylake-avx2-erms case for size = 416, alignment = 416 where the > > current implementation is meaningfully faster. I am unsure of the root > > cause for this. The skylake-avx2 case only performs a bit worse in > > this case which makes me think part of it is code alignment related, > > though comparative to the speedup in other size/alignment > > configurations it is still a trough. Despite this, I still think the > > numbers are overall an improvement. > > > > As well due to aligning the loop (and possibly slightly more efficient > > DSB behavior with the replacement of addq 4 * VEC_SIZE in the loop > > with subq -4 * VEC_SIZE) in the non-erms cases there is often a slight > > improvement to the main loop for large sizes. > > > > .../multiarch/memset-vec-unaligned-erms.S | 50 +++++++++++-------- > > 1 file changed, 28 insertions(+), 22 deletions(-) > > > > diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S > > index 08cfa49bd1..ff196844a0 100644 > > --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S > > +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S > > @@ -173,17 +173,22 @@ ENTRY (MEMSET_SYMBOL (__memset, unaligned_erms)) > > VMOVU %VEC(0), (%rdi) > > VZEROUPPER_RETURN > > > > + .p2align 4 > > L(stosb_more_2x_vec): > > cmp __x86_rep_stosb_threshold(%rip), %RDX_LP > > ja L(stosb) > > +#else > > + .p2align 4 > > #endif > > L(more_2x_vec): > > - cmpq $(VEC_SIZE * 4), %rdx > > - ja L(loop_start) > > + /* Stores to first 2x VEC before cmp as any path forward will > > + require it. */ > > VMOVU %VEC(0), (%rdi) > > VMOVU %VEC(0), VEC_SIZE(%rdi) > > - VMOVU %VEC(0), -VEC_SIZE(%rdi,%rdx) > > + cmpq $(VEC_SIZE * 4), %rdx > > + ja L(loop_start) > > VMOVU %VEC(0), -(VEC_SIZE * 2)(%rdi,%rdx) > > + VMOVU %VEC(0), -VEC_SIZE(%rdi,%rdx) > > L(return): > > #if VEC_SIZE > 16 > > ZERO_UPPER_VEC_REGISTERS_RETURN > > @@ -192,28 +197,29 @@ L(return): > > #endif > > > > L(loop_start): > > - leaq (VEC_SIZE * 4)(%rdi), %rcx > > - VMOVU %VEC(0), (%rdi) > > - andq $-(VEC_SIZE * 4), %rcx > > - VMOVU %VEC(0), -VEC_SIZE(%rdi,%rdx) > > - VMOVU %VEC(0), VEC_SIZE(%rdi) > > - VMOVU %VEC(0), -(VEC_SIZE * 2)(%rdi,%rdx) > > VMOVU %VEC(0), (VEC_SIZE * 2)(%rdi) > > - VMOVU %VEC(0), -(VEC_SIZE * 3)(%rdi,%rdx) > > VMOVU %VEC(0), (VEC_SIZE * 3)(%rdi) > > - VMOVU %VEC(0), -(VEC_SIZE * 4)(%rdi,%rdx) > > - addq %rdi, %rdx > > - andq $-(VEC_SIZE * 4), %rdx > > - cmpq %rdx, %rcx > > - je L(return) > > + cmpq $(VEC_SIZE * 8), %rdx > > + jbe L(loop_end) > > + andq $-(VEC_SIZE * 2), %rdi > > + subq $-(VEC_SIZE * 4), %rdi > > + leaq -(VEC_SIZE * 4)(%rax, %rdx), %rcx > > + .p2align 4 > > L(loop): > > - VMOVA %VEC(0), (%rcx) > > - VMOVA %VEC(0), VEC_SIZE(%rcx) > > - VMOVA %VEC(0), (VEC_SIZE * 2)(%rcx) > > - VMOVA %VEC(0), (VEC_SIZE * 3)(%rcx) > > - addq $(VEC_SIZE * 4), %rcx > > - cmpq %rcx, %rdx > > - jne L(loop) > > + VMOVA %VEC(0), (%rdi) > > + VMOVA %VEC(0), VEC_SIZE(%rdi) > > + VMOVA %VEC(0), (VEC_SIZE * 2)(%rdi) > > + VMOVA %VEC(0), (VEC_SIZE * 3)(%rdi) > > + subq $-(VEC_SIZE * 4), %rdi > > + cmpq %rcx, %rdi > > + jb L(loop) > > +L(loop_end): > > + /* NB: rax is set as ptr in MEMSET_VDUP_TO_VEC0_AND_SET_RETURN. > > + rdx as length is also unchanged. */ > > + VMOVU %VEC(0), -(VEC_SIZE * 4)(%rax, %rdx) > > + VMOVU %VEC(0), -(VEC_SIZE * 3)(%rax, %rdx) > > + VMOVU %VEC(0), -(VEC_SIZE * 2)(%rax, %rdx) > > + VMOVU %VEC(0), -VEC_SIZE(%rax, %rdx) > > VZEROUPPER_SHORT_RETURN > > > > .p2align 4 > > -- > > 2.25.1 > > > > LGTM. Awesome! For future patches do you prefer performance numbers like this or raw text? Or some other alternative? > > Reviewed-by: H.J. Lu <hjl.tools@gmail.com> > > Thanks. > > -- > H.J.
On Thu, May 20, 2021 at 1:03 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > On Thu, May 20, 2021 at 3:40 PM H.J. Lu <hjl.tools@gmail.com> wrote: > > > > On Thu, May 20, 2021 at 11:45 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > > > > No bug. This commit makes a few small improvements to > > > memset-vec-unaligned-erms.S. The changes are 1) only aligning to 64 > > > instead of 128. Either alignment will perform equally well in a loop > > > and 128 just increases the odds of having to do an extra iteration > > > which can be significant overhead for small values. 2) Align some > > > targets and the loop. 3) Remove an ALU from the alignment process. 4) > > > Reorder the last 4x VEC so that they are stored after the loop. 5) > > > Move the condition for leq 8x VEC to before the alignment > > > process. test-memset and test-wmemset are both passing. > > > > > > Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> > > > --- > > > Tests where run on the following CPUs: > > > > > > Skylake: https://ark.intel.com/content/www/us/en/ark/products/149091/intel-core-i7-8565u-processor-8m-cache-up-to-4-60-ghz.html > > > > > > Icelake: https://ark.intel.com/content/www/us/en/ark/products/196597/intel-core-i7-1065g7-processor-8m-cache-up-to-3-90-ghz.html > > > > > > Tigerlake: https://ark.intel.com/content/www/us/en/ark/products/208921/intel-core-i7-1165g7-processor-12m-cache-up-to-4-70-ghz-with-ipu.html > > > > > > All times are the geometric mean of N=50. The unit of time is > > > seconds. > > > > > > "Cur" refers to the current implementation > > > "New" refers to this patches implementation > > > > > > Performance data attached in memset-data.pdf > > > > > > Some notes on the numbers: > > > > > > I only included numbers for the proper VEC_SIZE for the corresponding > > > cpu. > > > > > > skl -> avx2 > > > icl -> evex > > > tgl -> evex > > > > > > The changes only affect sizes > 2 * VEC_SIZE. The performance > > > differences in the size <= 2 * VEC_SIZE come from changes in alignment > > > after linking (i.e ENTRY aligns to 16, but performance can be affected > > > by alignment % 64 or alignment % 4096) and generally affects > > > throughput only, not latency (i.e with an lfence to the benchmark loop > > > the deviations go away). Generally I think they can be ignored (both > > > positive and negative affects). > > > > > > The interesting part of the data is in the medium size range [128, > > > 1024] where the new implementation has a reasonable speedup. This is > > > especially pronounced when the more conservative alignment saves a > > > full loop iteration. The only significant exception is > > > skylake-avx2-erms case for size = 416, alignment = 416 where the > > > current implementation is meaningfully faster. I am unsure of the root > > > cause for this. The skylake-avx2 case only performs a bit worse in > > > this case which makes me think part of it is code alignment related, > > > though comparative to the speedup in other size/alignment > > > configurations it is still a trough. Despite this, I still think the > > > numbers are overall an improvement. > > > > > > As well due to aligning the loop (and possibly slightly more efficient > > > DSB behavior with the replacement of addq 4 * VEC_SIZE in the loop > > > with subq -4 * VEC_SIZE) in the non-erms cases there is often a slight > > > improvement to the main loop for large sizes. > > > > > > .../multiarch/memset-vec-unaligned-erms.S | 50 +++++++++++-------- > > > 1 file changed, 28 insertions(+), 22 deletions(-) > > > > > > diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S > > > index 08cfa49bd1..ff196844a0 100644 > > > --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S > > > +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S > > > @@ -173,17 +173,22 @@ ENTRY (MEMSET_SYMBOL (__memset, unaligned_erms)) > > > VMOVU %VEC(0), (%rdi) > > > VZEROUPPER_RETURN > > > > > > + .p2align 4 > > > L(stosb_more_2x_vec): > > > cmp __x86_rep_stosb_threshold(%rip), %RDX_LP > > > ja L(stosb) > > > +#else > > > + .p2align 4 > > > #endif > > > L(more_2x_vec): > > > - cmpq $(VEC_SIZE * 4), %rdx > > > - ja L(loop_start) > > > + /* Stores to first 2x VEC before cmp as any path forward will > > > + require it. */ > > > VMOVU %VEC(0), (%rdi) > > > VMOVU %VEC(0), VEC_SIZE(%rdi) > > > - VMOVU %VEC(0), -VEC_SIZE(%rdi,%rdx) > > > + cmpq $(VEC_SIZE * 4), %rdx > > > + ja L(loop_start) > > > VMOVU %VEC(0), -(VEC_SIZE * 2)(%rdi,%rdx) > > > + VMOVU %VEC(0), -VEC_SIZE(%rdi,%rdx) > > > L(return): > > > #if VEC_SIZE > 16 > > > ZERO_UPPER_VEC_REGISTERS_RETURN > > > @@ -192,28 +197,29 @@ L(return): > > > #endif > > > > > > L(loop_start): > > > - leaq (VEC_SIZE * 4)(%rdi), %rcx > > > - VMOVU %VEC(0), (%rdi) > > > - andq $-(VEC_SIZE * 4), %rcx > > > - VMOVU %VEC(0), -VEC_SIZE(%rdi,%rdx) > > > - VMOVU %VEC(0), VEC_SIZE(%rdi) > > > - VMOVU %VEC(0), -(VEC_SIZE * 2)(%rdi,%rdx) > > > VMOVU %VEC(0), (VEC_SIZE * 2)(%rdi) > > > - VMOVU %VEC(0), -(VEC_SIZE * 3)(%rdi,%rdx) > > > VMOVU %VEC(0), (VEC_SIZE * 3)(%rdi) > > > - VMOVU %VEC(0), -(VEC_SIZE * 4)(%rdi,%rdx) > > > - addq %rdi, %rdx > > > - andq $-(VEC_SIZE * 4), %rdx > > > - cmpq %rdx, %rcx > > > - je L(return) > > > + cmpq $(VEC_SIZE * 8), %rdx > > > + jbe L(loop_end) > > > + andq $-(VEC_SIZE * 2), %rdi > > > + subq $-(VEC_SIZE * 4), %rdi > > > + leaq -(VEC_SIZE * 4)(%rax, %rdx), %rcx > > > + .p2align 4 > > > L(loop): > > > - VMOVA %VEC(0), (%rcx) > > > - VMOVA %VEC(0), VEC_SIZE(%rcx) > > > - VMOVA %VEC(0), (VEC_SIZE * 2)(%rcx) > > > - VMOVA %VEC(0), (VEC_SIZE * 3)(%rcx) > > > - addq $(VEC_SIZE * 4), %rcx > > > - cmpq %rcx, %rdx > > > - jne L(loop) > > > + VMOVA %VEC(0), (%rdi) > > > + VMOVA %VEC(0), VEC_SIZE(%rdi) > > > + VMOVA %VEC(0), (VEC_SIZE * 2)(%rdi) > > > + VMOVA %VEC(0), (VEC_SIZE * 3)(%rdi) > > > + subq $-(VEC_SIZE * 4), %rdi > > > + cmpq %rcx, %rdi > > > + jb L(loop) > > > +L(loop_end): > > > + /* NB: rax is set as ptr in MEMSET_VDUP_TO_VEC0_AND_SET_RETURN. > > > + rdx as length is also unchanged. */ > > > + VMOVU %VEC(0), -(VEC_SIZE * 4)(%rax, %rdx) > > > + VMOVU %VEC(0), -(VEC_SIZE * 3)(%rax, %rdx) > > > + VMOVU %VEC(0), -(VEC_SIZE * 2)(%rax, %rdx) > > > + VMOVU %VEC(0), -VEC_SIZE(%rax, %rdx) > > > VZEROUPPER_SHORT_RETURN > > > > > > .p2align 4 > > > -- > > > 2.25.1 > > > > > > > LGTM. > > Awesome! > > For future patches do you prefer performance numbers like this or > raw text? Or some other alternative? The current data format is fine. Thanks. > > > > Reviewed-by: H.J. Lu <hjl.tools@gmail.com> > > > > Thanks. > > > > -- > > H.J.
On Thu, May 20, 2021 at 4:45 PM H.J. Lu <hjl.tools@gmail.com> wrote: > On Thu, May 20, 2021 at 1:03 PM Noah Goldstein <goldstein.w.n@gmail.com> > wrote: > > > > On Thu, May 20, 2021 at 3:40 PM H.J. Lu <hjl.tools@gmail.com> wrote: > > > > > > On Thu, May 20, 2021 at 11:45 AM Noah Goldstein < > goldstein.w.n@gmail.com> wrote: > > > > > > > > No bug. This commit makes a few small improvements to > > > > memset-vec-unaligned-erms.S. The changes are 1) only aligning to 64 > > > > instead of 128. Either alignment will perform equally well in a loop > > > > and 128 just increases the odds of having to do an extra iteration > > > > which can be significant overhead for small values. 2) Align some > > > > targets and the loop. 3) Remove an ALU from the alignment process. 4) > > > > Reorder the last 4x VEC so that they are stored after the loop. 5) > > > > Move the condition for leq 8x VEC to before the alignment > > > > process. test-memset and test-wmemset are both passing. > > > > > > > > Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> > > > > --- > > > > Tests where run on the following CPUs: > > > > > > > > Skylake: > https://ark.intel.com/content/www/us/en/ark/products/149091/intel-core-i7-8565u-processor-8m-cache-up-to-4-60-ghz.html > > > > > > > > Icelake: > https://ark.intel.com/content/www/us/en/ark/products/196597/intel-core-i7-1065g7-processor-8m-cache-up-to-3-90-ghz.html > > > > > > > > Tigerlake: > https://ark.intel.com/content/www/us/en/ark/products/208921/intel-core-i7-1165g7-processor-12m-cache-up-to-4-70-ghz-with-ipu.html > > > > > > > > All times are the geometric mean of N=50. The unit of time is > > > > seconds. > > > > > > > > "Cur" refers to the current implementation > > > > "New" refers to this patches implementation > > > > > > > > Performance data attached in memset-data.pdf > > > > > > > > Some notes on the numbers: > > > > > > > > I only included numbers for the proper VEC_SIZE for the corresponding > > > > cpu. > > > > > > > > skl -> avx2 > > > > icl -> evex > > > > tgl -> evex > > > > > > > > The changes only affect sizes > 2 * VEC_SIZE. The performance > > > > differences in the size <= 2 * VEC_SIZE come from changes in > alignment > > > > after linking (i.e ENTRY aligns to 16, but performance can be > affected > > > > by alignment % 64 or alignment % 4096) and generally affects > > > > throughput only, not latency (i.e with an lfence to the benchmark > loop > > > > the deviations go away). Generally I think they can be ignored (both > > > > positive and negative affects). > > > > > > > > The interesting part of the data is in the medium size range [128, > > > > 1024] where the new implementation has a reasonable speedup. This is > > > > especially pronounced when the more conservative alignment saves a > > > > full loop iteration. The only significant exception is > > > > skylake-avx2-erms case for size = 416, alignment = 416 where the > > > > current implementation is meaningfully faster. I am unsure of the > root > > > > cause for this. The skylake-avx2 case only performs a bit worse in > > > > this case which makes me think part of it is code alignment related, > > > > though comparative to the speedup in other size/alignment > > > > configurations it is still a trough. Despite this, I still think the > > > > numbers are overall an improvement. > > > > > > > > As well due to aligning the loop (and possibly slightly more > efficient > > > > DSB behavior with the replacement of addq 4 * VEC_SIZE in the loop > > > > with subq -4 * VEC_SIZE) in the non-erms cases there is often a > slight > > > > improvement to the main loop for large sizes. > > > > > > > > .../multiarch/memset-vec-unaligned-erms.S | 50 > +++++++++++-------- > > > > 1 file changed, 28 insertions(+), 22 deletions(-) > > > > > > > > diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S > b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S > > > > index 08cfa49bd1..ff196844a0 100644 > > > > --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S > > > > +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S > > > > @@ -173,17 +173,22 @@ ENTRY (MEMSET_SYMBOL (__memset, > unaligned_erms)) > > > > VMOVU %VEC(0), (%rdi) > > > > VZEROUPPER_RETURN > > > > > > > > + .p2align 4 > > > > L(stosb_more_2x_vec): > > > > cmp __x86_rep_stosb_threshold(%rip), %RDX_LP > > > > ja L(stosb) > > > > +#else > > > > + .p2align 4 > > > > #endif > > > > L(more_2x_vec): > > > > - cmpq $(VEC_SIZE * 4), %rdx > > > > - ja L(loop_start) > > > > + /* Stores to first 2x VEC before cmp as any path forward will > > > > + require it. */ > > > > VMOVU %VEC(0), (%rdi) > > > > VMOVU %VEC(0), VEC_SIZE(%rdi) > > > > - VMOVU %VEC(0), -VEC_SIZE(%rdi,%rdx) > > > > + cmpq $(VEC_SIZE * 4), %rdx > > > > + ja L(loop_start) > > > > VMOVU %VEC(0), -(VEC_SIZE * 2)(%rdi,%rdx) > > > > + VMOVU %VEC(0), -VEC_SIZE(%rdi,%rdx) > > > > L(return): > > > > #if VEC_SIZE > 16 > > > > ZERO_UPPER_VEC_REGISTERS_RETURN > > > > @@ -192,28 +197,29 @@ L(return): > > > > #endif > > > > > > > > L(loop_start): > > > > - leaq (VEC_SIZE * 4)(%rdi), %rcx > > > > - VMOVU %VEC(0), (%rdi) > > > > - andq $-(VEC_SIZE * 4), %rcx > > > > - VMOVU %VEC(0), -VEC_SIZE(%rdi,%rdx) > > > > - VMOVU %VEC(0), VEC_SIZE(%rdi) > > > > - VMOVU %VEC(0), -(VEC_SIZE * 2)(%rdi,%rdx) > > > > VMOVU %VEC(0), (VEC_SIZE * 2)(%rdi) > > > > - VMOVU %VEC(0), -(VEC_SIZE * 3)(%rdi,%rdx) > > > > VMOVU %VEC(0), (VEC_SIZE * 3)(%rdi) > > > > - VMOVU %VEC(0), -(VEC_SIZE * 4)(%rdi,%rdx) > > > > - addq %rdi, %rdx > > > > - andq $-(VEC_SIZE * 4), %rdx > > > > - cmpq %rdx, %rcx > > > > - je L(return) > > > > + cmpq $(VEC_SIZE * 8), %rdx > > > > + jbe L(loop_end) > > > > + andq $-(VEC_SIZE * 2), %rdi > > > > + subq $-(VEC_SIZE * 4), %rdi > > > > + leaq -(VEC_SIZE * 4)(%rax, %rdx), %rcx If this overflows loop will exit first iteration. Is that an issue? If so the following commits from me have the same bug: https://sourceware.org/git/?p=glibc.git;a=commit;h=6abf27980a947f9b6e514d6b33b83059d39566ae https://sourceware.org/git/?p=glibc.git;a=commit;h=4ad473e97acdc5f6d811755b67c09f2128a644ce https://sourceware.org/git/?p=glibc.git;a=commit;h=16d12015c57701b08d7bbed6ec536641bcafb428 > > > > + .p2align 4 > > > > L(loop): > > > > - VMOVA %VEC(0), (%rcx) > > > > - VMOVA %VEC(0), VEC_SIZE(%rcx) > > > > - VMOVA %VEC(0), (VEC_SIZE * 2)(%rcx) > > > > - VMOVA %VEC(0), (VEC_SIZE * 3)(%rcx) > > > > - addq $(VEC_SIZE * 4), %rcx > > > > - cmpq %rcx, %rdx > > > > - jne L(loop) > > > > + VMOVA %VEC(0), (%rdi) > > > > + VMOVA %VEC(0), VEC_SIZE(%rdi) > > > > + VMOVA %VEC(0), (VEC_SIZE * 2)(%rdi) > > > > + VMOVA %VEC(0), (VEC_SIZE * 3)(%rdi) > > > > + subq $-(VEC_SIZE * 4), %rdi > > > > + cmpq %rcx, %rdi > > > > + jb L(loop) Issue because %rdi will not be below %rcx here. > > > > +L(loop_end): > > > > + /* NB: rax is set as ptr in > MEMSET_VDUP_TO_VEC0_AND_SET_RETURN. > > > > + rdx as length is also unchanged. */ > > > > + VMOVU %VEC(0), -(VEC_SIZE * 4)(%rax, %rdx) > > > > + VMOVU %VEC(0), -(VEC_SIZE * 3)(%rax, %rdx) > > > > + VMOVU %VEC(0), -(VEC_SIZE * 2)(%rax, %rdx) > > > > + VMOVU %VEC(0), -VEC_SIZE(%rax, %rdx) > > > > VZEROUPPER_SHORT_RETURN > > > > > > > > .p2align 4 > > > > -- > > > > 2.25.1 > > > > > > > > > > LGTM. > > > > Awesome! > > > > For future patches do you prefer performance numbers like this or > > raw text? Or some other alternative? > > The current data format is fine. Thanks. > > > > > > > Reviewed-by: H.J. Lu <hjl.tools@gmail.com> > > > > > > Thanks. > > > > > > -- > > > H.J. > > > > -- > H.J. >
On Sun, Jun 6, 2021 at 7:35 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > > On Thu, May 20, 2021 at 4:45 PM H.J. Lu <hjl.tools@gmail.com> wrote: >> >> On Thu, May 20, 2021 at 1:03 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: >> > >> > On Thu, May 20, 2021 at 3:40 PM H.J. Lu <hjl.tools@gmail.com> wrote: >> > > >> > > On Thu, May 20, 2021 at 11:45 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote: >> > > > >> > > > No bug. This commit makes a few small improvements to >> > > > memset-vec-unaligned-erms.S. The changes are 1) only aligning to 64 >> > > > instead of 128. Either alignment will perform equally well in a loop >> > > > and 128 just increases the odds of having to do an extra iteration >> > > > which can be significant overhead for small values. 2) Align some >> > > > targets and the loop. 3) Remove an ALU from the alignment process. 4) >> > > > Reorder the last 4x VEC so that they are stored after the loop. 5) >> > > > Move the condition for leq 8x VEC to before the alignment >> > > > process. test-memset and test-wmemset are both passing. >> > > > >> > > > Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> >> > > > --- >> > > > Tests where run on the following CPUs: >> > > > >> > > > Skylake: https://ark.intel.com/content/www/us/en/ark/products/149091/intel-core-i7-8565u-processor-8m-cache-up-to-4-60-ghz.html >> > > > >> > > > Icelake: https://ark.intel.com/content/www/us/en/ark/products/196597/intel-core-i7-1065g7-processor-8m-cache-up-to-3-90-ghz.html >> > > > >> > > > Tigerlake: https://ark.intel.com/content/www/us/en/ark/products/208921/intel-core-i7-1165g7-processor-12m-cache-up-to-4-70-ghz-with-ipu.html >> > > > >> > > > All times are the geometric mean of N=50. The unit of time is >> > > > seconds. >> > > > >> > > > "Cur" refers to the current implementation >> > > > "New" refers to this patches implementation >> > > > >> > > > Performance data attached in memset-data.pdf >> > > > >> > > > Some notes on the numbers: >> > > > >> > > > I only included numbers for the proper VEC_SIZE for the corresponding >> > > > cpu. >> > > > >> > > > skl -> avx2 >> > > > icl -> evex >> > > > tgl -> evex >> > > > >> > > > The changes only affect sizes > 2 * VEC_SIZE. The performance >> > > > differences in the size <= 2 * VEC_SIZE come from changes in alignment >> > > > after linking (i.e ENTRY aligns to 16, but performance can be affected >> > > > by alignment % 64 or alignment % 4096) and generally affects >> > > > throughput only, not latency (i.e with an lfence to the benchmark loop >> > > > the deviations go away). Generally I think they can be ignored (both >> > > > positive and negative affects). >> > > > >> > > > The interesting part of the data is in the medium size range [128, >> > > > 1024] where the new implementation has a reasonable speedup. This is >> > > > especially pronounced when the more conservative alignment saves a >> > > > full loop iteration. The only significant exception is >> > > > skylake-avx2-erms case for size = 416, alignment = 416 where the >> > > > current implementation is meaningfully faster. I am unsure of the root >> > > > cause for this. The skylake-avx2 case only performs a bit worse in >> > > > this case which makes me think part of it is code alignment related, >> > > > though comparative to the speedup in other size/alignment >> > > > configurations it is still a trough. Despite this, I still think the >> > > > numbers are overall an improvement. >> > > > >> > > > As well due to aligning the loop (and possibly slightly more efficient >> > > > DSB behavior with the replacement of addq 4 * VEC_SIZE in the loop >> > > > with subq -4 * VEC_SIZE) in the non-erms cases there is often a slight >> > > > improvement to the main loop for large sizes. >> > > > >> > > > .../multiarch/memset-vec-unaligned-erms.S | 50 +++++++++++-------- >> > > > 1 file changed, 28 insertions(+), 22 deletions(-) >> > > > >> > > > diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S >> > > > index 08cfa49bd1..ff196844a0 100644 >> > > > --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S >> > > > +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S >> > > > @@ -173,17 +173,22 @@ ENTRY (MEMSET_SYMBOL (__memset, unaligned_erms)) >> > > > VMOVU %VEC(0), (%rdi) >> > > > VZEROUPPER_RETURN >> > > > >> > > > + .p2align 4 >> > > > L(stosb_more_2x_vec): >> > > > cmp __x86_rep_stosb_threshold(%rip), %RDX_LP >> > > > ja L(stosb) >> > > > +#else >> > > > + .p2align 4 >> > > > #endif >> > > > L(more_2x_vec): >> > > > - cmpq $(VEC_SIZE * 4), %rdx >> > > > - ja L(loop_start) >> > > > + /* Stores to first 2x VEC before cmp as any path forward will >> > > > + require it. */ >> > > > VMOVU %VEC(0), (%rdi) >> > > > VMOVU %VEC(0), VEC_SIZE(%rdi) >> > > > - VMOVU %VEC(0), -VEC_SIZE(%rdi,%rdx) >> > > > + cmpq $(VEC_SIZE * 4), %rdx >> > > > + ja L(loop_start) >> > > > VMOVU %VEC(0), -(VEC_SIZE * 2)(%rdi,%rdx) >> > > > + VMOVU %VEC(0), -VEC_SIZE(%rdi,%rdx) >> > > > L(return): >> > > > #if VEC_SIZE > 16 >> > > > ZERO_UPPER_VEC_REGISTERS_RETURN >> > > > @@ -192,28 +197,29 @@ L(return): >> > > > #endif >> > > > >> > > > L(loop_start): >> > > > - leaq (VEC_SIZE * 4)(%rdi), %rcx >> > > > - VMOVU %VEC(0), (%rdi) >> > > > - andq $-(VEC_SIZE * 4), %rcx >> > > > - VMOVU %VEC(0), -VEC_SIZE(%rdi,%rdx) >> > > > - VMOVU %VEC(0), VEC_SIZE(%rdi) >> > > > - VMOVU %VEC(0), -(VEC_SIZE * 2)(%rdi,%rdx) >> > > > VMOVU %VEC(0), (VEC_SIZE * 2)(%rdi) >> > > > - VMOVU %VEC(0), -(VEC_SIZE * 3)(%rdi,%rdx) >> > > > VMOVU %VEC(0), (VEC_SIZE * 3)(%rdi) >> > > > - VMOVU %VEC(0), -(VEC_SIZE * 4)(%rdi,%rdx) >> > > > - addq %rdi, %rdx >> > > > - andq $-(VEC_SIZE * 4), %rdx >> > > > - cmpq %rdx, %rcx >> > > > - je L(return) >> > > > + cmpq $(VEC_SIZE * 8), %rdx >> > > > + jbe L(loop_end) >> > > > + andq $-(VEC_SIZE * 2), %rdi >> > > > + subq $-(VEC_SIZE * 4), %rdi >> > > > + leaq -(VEC_SIZE * 4)(%rax, %rdx), %rcx > > If this overflows loop will exit first iteration. Is that an issue? Please do following: 1. Update memset assembly codes with check conditions for underwrite/overwrite. if true then branch to the HLT instruction. 2. Update and run memset test to verify the test coverage for the condition. 3. Update memset assembly codes to cover such conditions. Thanks. > If so the following commits from me have the same bug: > https://sourceware.org/git/?p=glibc.git;a=commit;h=6abf27980a947f9b6e514d6b33b83059d39566ae > https://sourceware.org/git/?p=glibc.git;a=commit;h=4ad473e97acdc5f6d811755b67c09f2128a644ce > https://sourceware.org/git/?p=glibc.git;a=commit;h=16d12015c57701b08d7bbed6ec536641bcafb428 >> >> >> >> > > > + .p2align 4 >> > > > L(loop): >> > > > - VMOVA %VEC(0), (%rcx) >> > > > - VMOVA %VEC(0), VEC_SIZE(%rcx) >> > > > - VMOVA %VEC(0), (VEC_SIZE * 2)(%rcx) >> > > > - VMOVA %VEC(0), (VEC_SIZE * 3)(%rcx) >> > > > - addq $(VEC_SIZE * 4), %rcx >> > > > - cmpq %rcx, %rdx >> > > > - jne L(loop) >> > > > + VMOVA %VEC(0), (%rdi) >> > > > + VMOVA %VEC(0), VEC_SIZE(%rdi) >> > > > + VMOVA %VEC(0), (VEC_SIZE * 2)(%rdi) >> > > > + VMOVA %VEC(0), (VEC_SIZE * 3)(%rdi) >> > > > + subq $-(VEC_SIZE * 4), %rdi >> > > > + cmpq %rcx, %rdi >> > > > + jb L(loop) > > Issue because %rdi will not be below %rcx here. >> >> > > > +L(loop_end): >> > > > + /* NB: rax is set as ptr in MEMSET_VDUP_TO_VEC0_AND_SET_RETURN. >> > > > + rdx as length is also unchanged. */ >> > > > + VMOVU %VEC(0), -(VEC_SIZE * 4)(%rax, %rdx) >> > > > + VMOVU %VEC(0), -(VEC_SIZE * 3)(%rax, %rdx) >> > > > + VMOVU %VEC(0), -(VEC_SIZE * 2)(%rax, %rdx) >> > > > + VMOVU %VEC(0), -VEC_SIZE(%rax, %rdx) >> > > > VZEROUPPER_SHORT_RETURN >> > > > >> > > > .p2align 4 >> > > > -- >> > > > 2.25.1 >> > > > >> > > >> > > LGTM. >> > >> > Awesome! >> > >> > For future patches do you prefer performance numbers like this or >> > raw text? Or some other alternative? >> >> The current data format is fine. Thanks. >> >> > > >> > > Reviewed-by: H.J. Lu <hjl.tools@gmail.com> >> > > >> > > Thanks. >> > > >> > > -- >> > > H.J. >> >> >> >> -- >> H.J.
On Sun, Jun 6, 2021 at 10:48 PM H.J. Lu <hjl.tools@gmail.com> wrote: > On Sun, Jun 6, 2021 at 7:35 PM Noah Goldstein <goldstein.w.n@gmail.com> > wrote: > > > > > > > > On Thu, May 20, 2021 at 4:45 PM H.J. Lu <hjl.tools@gmail.com> wrote: > >> > >> On Thu, May 20, 2021 at 1:03 PM Noah Goldstein <goldstein.w.n@gmail.com> > wrote: > >> > > >> > On Thu, May 20, 2021 at 3:40 PM H.J. Lu <hjl.tools@gmail.com> wrote: > >> > > > >> > > On Thu, May 20, 2021 at 11:45 AM Noah Goldstein < > goldstein.w.n@gmail.com> wrote: > >> > > > > >> > > > No bug. This commit makes a few small improvements to > >> > > > memset-vec-unaligned-erms.S. The changes are 1) only aligning to > 64 > >> > > > instead of 128. Either alignment will perform equally well in a > loop > >> > > > and 128 just increases the odds of having to do an extra iteration > >> > > > which can be significant overhead for small values. 2) Align some > >> > > > targets and the loop. 3) Remove an ALU from the alignment > process. 4) > >> > > > Reorder the last 4x VEC so that they are stored after the loop. 5) > >> > > > Move the condition for leq 8x VEC to before the alignment > >> > > > process. test-memset and test-wmemset are both passing. > >> > > > > >> > > > Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> > >> > > > --- > >> > > > Tests where run on the following CPUs: > >> > > > > >> > > > Skylake: > https://ark.intel.com/content/www/us/en/ark/products/149091/intel-core-i7-8565u-processor-8m-cache-up-to-4-60-ghz.html > >> > > > > >> > > > Icelake: > https://ark.intel.com/content/www/us/en/ark/products/196597/intel-core-i7-1065g7-processor-8m-cache-up-to-3-90-ghz.html > >> > > > > >> > > > Tigerlake: > https://ark.intel.com/content/www/us/en/ark/products/208921/intel-core-i7-1165g7-processor-12m-cache-up-to-4-70-ghz-with-ipu.html > >> > > > > >> > > > All times are the geometric mean of N=50. The unit of time is > >> > > > seconds. > >> > > > > >> > > > "Cur" refers to the current implementation > >> > > > "New" refers to this patches implementation > >> > > > > >> > > > Performance data attached in memset-data.pdf > >> > > > > >> > > > Some notes on the numbers: > >> > > > > >> > > > I only included numbers for the proper VEC_SIZE for the > corresponding > >> > > > cpu. > >> > > > > >> > > > skl -> avx2 > >> > > > icl -> evex > >> > > > tgl -> evex > >> > > > > >> > > > The changes only affect sizes > 2 * VEC_SIZE. The performance > >> > > > differences in the size <= 2 * VEC_SIZE come from changes in > alignment > >> > > > after linking (i.e ENTRY aligns to 16, but performance can be > affected > >> > > > by alignment % 64 or alignment % 4096) and generally affects > >> > > > throughput only, not latency (i.e with an lfence to the benchmark > loop > >> > > > the deviations go away). Generally I think they can be ignored > (both > >> > > > positive and negative affects). > >> > > > > >> > > > The interesting part of the data is in the medium size range [128, > >> > > > 1024] where the new implementation has a reasonable speedup. This > is > >> > > > especially pronounced when the more conservative alignment saves a > >> > > > full loop iteration. The only significant exception is > >> > > > skylake-avx2-erms case for size = 416, alignment = 416 where the > >> > > > current implementation is meaningfully faster. I am unsure of the > root > >> > > > cause for this. The skylake-avx2 case only performs a bit worse in > >> > > > this case which makes me think part of it is code alignment > related, > >> > > > though comparative to the speedup in other size/alignment > >> > > > configurations it is still a trough. Despite this, I still think > the > >> > > > numbers are overall an improvement. > >> > > > > >> > > > As well due to aligning the loop (and possibly slightly more > efficient > >> > > > DSB behavior with the replacement of addq 4 * VEC_SIZE in the loop > >> > > > with subq -4 * VEC_SIZE) in the non-erms cases there is often a > slight > >> > > > improvement to the main loop for large sizes. > >> > > > > >> > > > .../multiarch/memset-vec-unaligned-erms.S | 50 > +++++++++++-------- > >> > > > 1 file changed, 28 insertions(+), 22 deletions(-) > >> > > > > >> > > > diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S > b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S > >> > > > index 08cfa49bd1..ff196844a0 100644 > >> > > > --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S > >> > > > +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S > >> > > > @@ -173,17 +173,22 @@ ENTRY (MEMSET_SYMBOL (__memset, > unaligned_erms)) > >> > > > VMOVU %VEC(0), (%rdi) > >> > > > VZEROUPPER_RETURN > >> > > > > >> > > > + .p2align 4 > >> > > > L(stosb_more_2x_vec): > >> > > > cmp __x86_rep_stosb_threshold(%rip), %RDX_LP > >> > > > ja L(stosb) > >> > > > +#else > >> > > > + .p2align 4 > >> > > > #endif > >> > > > L(more_2x_vec): > >> > > > - cmpq $(VEC_SIZE * 4), %rdx > >> > > > - ja L(loop_start) > >> > > > + /* Stores to first 2x VEC before cmp as any path forward > will > >> > > > + require it. */ > >> > > > VMOVU %VEC(0), (%rdi) > >> > > > VMOVU %VEC(0), VEC_SIZE(%rdi) > >> > > > - VMOVU %VEC(0), -VEC_SIZE(%rdi,%rdx) > >> > > > + cmpq $(VEC_SIZE * 4), %rdx > >> > > > + ja L(loop_start) > >> > > > VMOVU %VEC(0), -(VEC_SIZE * 2)(%rdi,%rdx) > >> > > > + VMOVU %VEC(0), -VEC_SIZE(%rdi,%rdx) > >> > > > L(return): > >> > > > #if VEC_SIZE > 16 > >> > > > ZERO_UPPER_VEC_REGISTERS_RETURN > >> > > > @@ -192,28 +197,29 @@ L(return): > >> > > > #endif > >> > > > > >> > > > L(loop_start): > >> > > > - leaq (VEC_SIZE * 4)(%rdi), %rcx > >> > > > - VMOVU %VEC(0), (%rdi) > >> > > > - andq $-(VEC_SIZE * 4), %rcx > >> > > > - VMOVU %VEC(0), -VEC_SIZE(%rdi,%rdx) > >> > > > - VMOVU %VEC(0), VEC_SIZE(%rdi) > >> > > > - VMOVU %VEC(0), -(VEC_SIZE * 2)(%rdi,%rdx) > >> > > > VMOVU %VEC(0), (VEC_SIZE * 2)(%rdi) > >> > > > - VMOVU %VEC(0), -(VEC_SIZE * 3)(%rdi,%rdx) > >> > > > VMOVU %VEC(0), (VEC_SIZE * 3)(%rdi) > >> > > > - VMOVU %VEC(0), -(VEC_SIZE * 4)(%rdi,%rdx) > >> > > > - addq %rdi, %rdx > >> > > > - andq $-(VEC_SIZE * 4), %rdx > >> > > > - cmpq %rdx, %rcx > >> > > > - je L(return) > >> > > > + cmpq $(VEC_SIZE * 8), %rdx > >> > > > + jbe L(loop_end) > >> > > > + andq $-(VEC_SIZE * 2), %rdi > >> > > > + subq $-(VEC_SIZE * 4), %rdi > >> > > > + leaq -(VEC_SIZE * 4)(%rax, %rdx), %rcx > > > > If this overflows loop will exit first iteration. Is that an issue? > > Please do following: > > 1. Update memset assembly codes with > check conditions for underwrite/overwrite. > if true then branch to the HLT instruction. > codes? Just this file or others as well? > 2. Update and run memset test to verify the test coverage for the > condition. > What is the desired result? Segfault? > 3. Update memset assembly codes to cover such conditions. > Memcmp as well? What about wmemset? Currently (and with previous versions as well) a value like 2^63 would cause similar behavior after the `salq $2, %rdx` > > Thanks. > > > If so the following commits from me have the same bug: > > > https://sourceware.org/git/?p=glibc.git;a=commit;h=6abf27980a947f9b6e514d6b33b83059d39566ae > > > https://sourceware.org/git/?p=glibc.git;a=commit;h=4ad473e97acdc5f6d811755b67c09f2128a644ce > > > https://sourceware.org/git/?p=glibc.git;a=commit;h=16d12015c57701b08d7bbed6ec536641bcafb428 > >> > >> > >> > >> > > > + .p2align 4 > >> > > > L(loop): > >> > > > - VMOVA %VEC(0), (%rcx) > >> > > > - VMOVA %VEC(0), VEC_SIZE(%rcx) > >> > > > - VMOVA %VEC(0), (VEC_SIZE * 2)(%rcx) > >> > > > - VMOVA %VEC(0), (VEC_SIZE * 3)(%rcx) > >> > > > - addq $(VEC_SIZE * 4), %rcx > >> > > > - cmpq %rcx, %rdx > >> > > > - jne L(loop) > >> > > > + VMOVA %VEC(0), (%rdi) > >> > > > + VMOVA %VEC(0), VEC_SIZE(%rdi) > >> > > > + VMOVA %VEC(0), (VEC_SIZE * 2)(%rdi) > >> > > > + VMOVA %VEC(0), (VEC_SIZE * 3)(%rdi) > >> > > > + subq $-(VEC_SIZE * 4), %rdi > >> > > > + cmpq %rcx, %rdi > >> > > > + jb L(loop) > > > > Issue because %rdi will not be below %rcx here. > >> > >> > > > +L(loop_end): > >> > > > + /* NB: rax is set as ptr in > MEMSET_VDUP_TO_VEC0_AND_SET_RETURN. > >> > > > + rdx as length is also unchanged. */ > >> > > > + VMOVU %VEC(0), -(VEC_SIZE * 4)(%rax, %rdx) > >> > > > + VMOVU %VEC(0), -(VEC_SIZE * 3)(%rax, %rdx) > >> > > > + VMOVU %VEC(0), -(VEC_SIZE * 2)(%rax, %rdx) > >> > > > + VMOVU %VEC(0), -VEC_SIZE(%rax, %rdx) > >> > > > VZEROUPPER_SHORT_RETURN > >> > > > > >> > > > .p2align 4 > >> > > > -- > >> > > > 2.25.1 > >> > > > > >> > > > >> > > LGTM. > >> > > >> > Awesome! > >> > > >> > For future patches do you prefer performance numbers like this or > >> > raw text? Or some other alternative? > >> > >> The current data format is fine. Thanks. > >> > >> > > > >> > > Reviewed-by: H.J. Lu <hjl.tools@gmail.com> > >> > > > >> > > Thanks. > >> > > > >> > > -- > >> > > H.J. > >> > >> > >> > >> -- > >> H.J. > > > > -- > H.J. >
On Sun, Jun 6, 2021 at 8:06 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > > On Sun, Jun 6, 2021 at 10:48 PM H.J. Lu <hjl.tools@gmail.com> wrote: >> >> On Sun, Jun 6, 2021 at 7:35 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: >> > >> > >> > >> > On Thu, May 20, 2021 at 4:45 PM H.J. Lu <hjl.tools@gmail.com> wrote: >> >> >> >> On Thu, May 20, 2021 at 1:03 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: >> >> > >> >> > On Thu, May 20, 2021 at 3:40 PM H.J. Lu <hjl.tools@gmail.com> wrote: >> >> > > >> >> > > On Thu, May 20, 2021 at 11:45 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote: >> >> > > > >> >> > > > No bug. This commit makes a few small improvements to >> >> > > > memset-vec-unaligned-erms.S. The changes are 1) only aligning to 64 >> >> > > > instead of 128. Either alignment will perform equally well in a loop >> >> > > > and 128 just increases the odds of having to do an extra iteration >> >> > > > which can be significant overhead for small values. 2) Align some >> >> > > > targets and the loop. 3) Remove an ALU from the alignment process. 4) >> >> > > > Reorder the last 4x VEC so that they are stored after the loop. 5) >> >> > > > Move the condition for leq 8x VEC to before the alignment >> >> > > > process. test-memset and test-wmemset are both passing. >> >> > > > >> >> > > > Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> >> >> > > > --- >> >> > > > Tests where run on the following CPUs: >> >> > > > >> >> > > > Skylake: https://ark.intel.com/content/www/us/en/ark/products/149091/intel-core-i7-8565u-processor-8m-cache-up-to-4-60-ghz.html >> >> > > > >> >> > > > Icelake: https://ark.intel.com/content/www/us/en/ark/products/196597/intel-core-i7-1065g7-processor-8m-cache-up-to-3-90-ghz.html >> >> > > > >> >> > > > Tigerlake: https://ark.intel.com/content/www/us/en/ark/products/208921/intel-core-i7-1165g7-processor-12m-cache-up-to-4-70-ghz-with-ipu.html >> >> > > > >> >> > > > All times are the geometric mean of N=50. The unit of time is >> >> > > > seconds. >> >> > > > >> >> > > > "Cur" refers to the current implementation >> >> > > > "New" refers to this patches implementation >> >> > > > >> >> > > > Performance data attached in memset-data.pdf >> >> > > > >> >> > > > Some notes on the numbers: >> >> > > > >> >> > > > I only included numbers for the proper VEC_SIZE for the corresponding >> >> > > > cpu. >> >> > > > >> >> > > > skl -> avx2 >> >> > > > icl -> evex >> >> > > > tgl -> evex >> >> > > > >> >> > > > The changes only affect sizes > 2 * VEC_SIZE. The performance >> >> > > > differences in the size <= 2 * VEC_SIZE come from changes in alignment >> >> > > > after linking (i.e ENTRY aligns to 16, but performance can be affected >> >> > > > by alignment % 64 or alignment % 4096) and generally affects >> >> > > > throughput only, not latency (i.e with an lfence to the benchmark loop >> >> > > > the deviations go away). Generally I think they can be ignored (both >> >> > > > positive and negative affects). >> >> > > > >> >> > > > The interesting part of the data is in the medium size range [128, >> >> > > > 1024] where the new implementation has a reasonable speedup. This is >> >> > > > especially pronounced when the more conservative alignment saves a >> >> > > > full loop iteration. The only significant exception is >> >> > > > skylake-avx2-erms case for size = 416, alignment = 416 where the >> >> > > > current implementation is meaningfully faster. I am unsure of the root >> >> > > > cause for this. The skylake-avx2 case only performs a bit worse in >> >> > > > this case which makes me think part of it is code alignment related, >> >> > > > though comparative to the speedup in other size/alignment >> >> > > > configurations it is still a trough. Despite this, I still think the >> >> > > > numbers are overall an improvement. >> >> > > > >> >> > > > As well due to aligning the loop (and possibly slightly more efficient >> >> > > > DSB behavior with the replacement of addq 4 * VEC_SIZE in the loop >> >> > > > with subq -4 * VEC_SIZE) in the non-erms cases there is often a slight >> >> > > > improvement to the main loop for large sizes. >> >> > > > >> >> > > > .../multiarch/memset-vec-unaligned-erms.S | 50 +++++++++++-------- >> >> > > > 1 file changed, 28 insertions(+), 22 deletions(-) >> >> > > > >> >> > > > diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S >> >> > > > index 08cfa49bd1..ff196844a0 100644 >> >> > > > --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S >> >> > > > +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S >> >> > > > @@ -173,17 +173,22 @@ ENTRY (MEMSET_SYMBOL (__memset, unaligned_erms)) >> >> > > > VMOVU %VEC(0), (%rdi) >> >> > > > VZEROUPPER_RETURN >> >> > > > >> >> > > > + .p2align 4 >> >> > > > L(stosb_more_2x_vec): >> >> > > > cmp __x86_rep_stosb_threshold(%rip), %RDX_LP >> >> > > > ja L(stosb) >> >> > > > +#else >> >> > > > + .p2align 4 >> >> > > > #endif >> >> > > > L(more_2x_vec): >> >> > > > - cmpq $(VEC_SIZE * 4), %rdx >> >> > > > - ja L(loop_start) >> >> > > > + /* Stores to first 2x VEC before cmp as any path forward will >> >> > > > + require it. */ >> >> > > > VMOVU %VEC(0), (%rdi) >> >> > > > VMOVU %VEC(0), VEC_SIZE(%rdi) >> >> > > > - VMOVU %VEC(0), -VEC_SIZE(%rdi,%rdx) >> >> > > > + cmpq $(VEC_SIZE * 4), %rdx >> >> > > > + ja L(loop_start) >> >> > > > VMOVU %VEC(0), -(VEC_SIZE * 2)(%rdi,%rdx) >> >> > > > + VMOVU %VEC(0), -VEC_SIZE(%rdi,%rdx) >> >> > > > L(return): >> >> > > > #if VEC_SIZE > 16 >> >> > > > ZERO_UPPER_VEC_REGISTERS_RETURN >> >> > > > @@ -192,28 +197,29 @@ L(return): >> >> > > > #endif >> >> > > > >> >> > > > L(loop_start): >> >> > > > - leaq (VEC_SIZE * 4)(%rdi), %rcx >> >> > > > - VMOVU %VEC(0), (%rdi) >> >> > > > - andq $-(VEC_SIZE * 4), %rcx >> >> > > > - VMOVU %VEC(0), -VEC_SIZE(%rdi,%rdx) >> >> > > > - VMOVU %VEC(0), VEC_SIZE(%rdi) >> >> > > > - VMOVU %VEC(0), -(VEC_SIZE * 2)(%rdi,%rdx) >> >> > > > VMOVU %VEC(0), (VEC_SIZE * 2)(%rdi) >> >> > > > - VMOVU %VEC(0), -(VEC_SIZE * 3)(%rdi,%rdx) >> >> > > > VMOVU %VEC(0), (VEC_SIZE * 3)(%rdi) >> >> > > > - VMOVU %VEC(0), -(VEC_SIZE * 4)(%rdi,%rdx) >> >> > > > - addq %rdi, %rdx >> >> > > > - andq $-(VEC_SIZE * 4), %rdx >> >> > > > - cmpq %rdx, %rcx >> >> > > > - je L(return) >> >> > > > + cmpq $(VEC_SIZE * 8), %rdx >> >> > > > + jbe L(loop_end) >> >> > > > + andq $-(VEC_SIZE * 2), %rdi >> >> > > > + subq $-(VEC_SIZE * 4), %rdi >> >> > > > + leaq -(VEC_SIZE * 4)(%rax, %rdx), %rcx >> > >> > If this overflows loop will exit first iteration. Is that an issue? >> >> Please do following: >> >> 1. Update memset assembly codes with >> check conditions for underwrite/overwrite. >> if true then branch to the HLT instruction. > > codes? Just this file or others as well? All string/memory functions should work for all valid inputs. >> >> 2. Update and run memset test to verify the test coverage for the condition. > > What is the desired result? Segfault? For invalid inputs, anything can happen, including segfault. >> >> 3. Update memset assembly codes to cover such conditions. > > Memcmp as well? What about wmemset? Currently (and with previous versions as well) > a value like 2^63 would cause similar behavior after the `salq $2, %rdx` Is this condition a valid input? If not, there is nothing to do. >> >> >> Thanks. >> >> > If so the following commits from me have the same bug: >> > https://sourceware.org/git/?p=glibc.git;a=commit;h=6abf27980a947f9b6e514d6b33b83059d39566ae >> > https://sourceware.org/git/?p=glibc.git;a=commit;h=4ad473e97acdc5f6d811755b67c09f2128a644ce >> > https://sourceware.org/git/?p=glibc.git;a=commit;h=16d12015c57701b08d7bbed6ec536641bcafb428 >> >> >> >> >> >> >> >> > > > + .p2align 4 >> >> > > > L(loop): >> >> > > > - VMOVA %VEC(0), (%rcx) >> >> > > > - VMOVA %VEC(0), VEC_SIZE(%rcx) >> >> > > > - VMOVA %VEC(0), (VEC_SIZE * 2)(%rcx) >> >> > > > - VMOVA %VEC(0), (VEC_SIZE * 3)(%rcx) >> >> > > > - addq $(VEC_SIZE * 4), %rcx >> >> > > > - cmpq %rcx, %rdx >> >> > > > - jne L(loop) >> >> > > > + VMOVA %VEC(0), (%rdi) >> >> > > > + VMOVA %VEC(0), VEC_SIZE(%rdi) >> >> > > > + VMOVA %VEC(0), (VEC_SIZE * 2)(%rdi) >> >> > > > + VMOVA %VEC(0), (VEC_SIZE * 3)(%rdi) >> >> > > > + subq $-(VEC_SIZE * 4), %rdi >> >> > > > + cmpq %rcx, %rdi >> >> > > > + jb L(loop) >> > >> > Issue because %rdi will not be below %rcx here. >> >> >> >> > > > +L(loop_end): >> >> > > > + /* NB: rax is set as ptr in MEMSET_VDUP_TO_VEC0_AND_SET_RETURN. >> >> > > > + rdx as length is also unchanged. */ >> >> > > > + VMOVU %VEC(0), -(VEC_SIZE * 4)(%rax, %rdx) >> >> > > > + VMOVU %VEC(0), -(VEC_SIZE * 3)(%rax, %rdx) >> >> > > > + VMOVU %VEC(0), -(VEC_SIZE * 2)(%rax, %rdx) >> >> > > > + VMOVU %VEC(0), -VEC_SIZE(%rax, %rdx) >> >> > > > VZEROUPPER_SHORT_RETURN >> >> > > > >> >> > > > .p2align 4 >> >> > > > -- >> >> > > > 2.25.1 >> >> > > > >> >> > > >> >> > > LGTM. >> >> > >> >> > Awesome! >> >> > >> >> > For future patches do you prefer performance numbers like this or >> >> > raw text? Or some other alternative? >> >> >> >> The current data format is fine. Thanks. >> >> >> >> > > >> >> > > Reviewed-by: H.J. Lu <hjl.tools@gmail.com> >> >> > > >> >> > > Thanks. >> >> > > >> >> > > -- >> >> > > H.J. >> >> >> >> >> >> >> >> -- >> >> H.J. >> >> >> >> -- >> H.J.
On Sun, Jun 6, 2021 at 11:21 PM H.J. Lu <hjl.tools@gmail.com> wrote: > On Sun, Jun 6, 2021 at 8:06 PM Noah Goldstein <goldstein.w.n@gmail.com> > wrote: > > > > > > > > On Sun, Jun 6, 2021 at 10:48 PM H.J. Lu <hjl.tools@gmail.com> wrote: > >> > >> On Sun, Jun 6, 2021 at 7:35 PM Noah Goldstein <goldstein.w.n@gmail.com> > wrote: > >> > > >> > > >> > > >> > On Thu, May 20, 2021 at 4:45 PM H.J. Lu <hjl.tools@gmail.com> wrote: > >> >> > >> >> On Thu, May 20, 2021 at 1:03 PM Noah Goldstein < > goldstein.w.n@gmail.com> wrote: > >> >> > > >> >> > On Thu, May 20, 2021 at 3:40 PM H.J. Lu <hjl.tools@gmail.com> > wrote: > >> >> > > > >> >> > > On Thu, May 20, 2021 at 11:45 AM Noah Goldstein < > goldstein.w.n@gmail.com> wrote: > >> >> > > > > >> >> > > > No bug. This commit makes a few small improvements to > >> >> > > > memset-vec-unaligned-erms.S. The changes are 1) only aligning > to 64 > >> >> > > > instead of 128. Either alignment will perform equally well in > a loop > >> >> > > > and 128 just increases the odds of having to do an extra > iteration > >> >> > > > which can be significant overhead for small values. 2) Align > some > >> >> > > > targets and the loop. 3) Remove an ALU from the alignment > process. 4) > >> >> > > > Reorder the last 4x VEC so that they are stored after the > loop. 5) > >> >> > > > Move the condition for leq 8x VEC to before the alignment > >> >> > > > process. test-memset and test-wmemset are both passing. > >> >> > > > > >> >> > > > Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> > >> >> > > > --- > >> >> > > > Tests where run on the following CPUs: > >> >> > > > > >> >> > > > Skylake: > https://ark.intel.com/content/www/us/en/ark/products/149091/intel-core-i7-8565u-processor-8m-cache-up-to-4-60-ghz.html > >> >> > > > > >> >> > > > Icelake: > https://ark.intel.com/content/www/us/en/ark/products/196597/intel-core-i7-1065g7-processor-8m-cache-up-to-3-90-ghz.html > >> >> > > > > >> >> > > > Tigerlake: > https://ark.intel.com/content/www/us/en/ark/products/208921/intel-core-i7-1165g7-processor-12m-cache-up-to-4-70-ghz-with-ipu.html > >> >> > > > > >> >> > > > All times are the geometric mean of N=50. The unit of time is > >> >> > > > seconds. > >> >> > > > > >> >> > > > "Cur" refers to the current implementation > >> >> > > > "New" refers to this patches implementation > >> >> > > > > >> >> > > > Performance data attached in memset-data.pdf > >> >> > > > > >> >> > > > Some notes on the numbers: > >> >> > > > > >> >> > > > I only included numbers for the proper VEC_SIZE for the > corresponding > >> >> > > > cpu. > >> >> > > > > >> >> > > > skl -> avx2 > >> >> > > > icl -> evex > >> >> > > > tgl -> evex > >> >> > > > > >> >> > > > The changes only affect sizes > 2 * VEC_SIZE. The performance > >> >> > > > differences in the size <= 2 * VEC_SIZE come from changes in > alignment > >> >> > > > after linking (i.e ENTRY aligns to 16, but performance can be > affected > >> >> > > > by alignment % 64 or alignment % 4096) and generally affects > >> >> > > > throughput only, not latency (i.e with an lfence to the > benchmark loop > >> >> > > > the deviations go away). Generally I think they can be ignored > (both > >> >> > > > positive and negative affects). > >> >> > > > > >> >> > > > The interesting part of the data is in the medium size range > [128, > >> >> > > > 1024] where the new implementation has a reasonable speedup. > This is > >> >> > > > especially pronounced when the more conservative alignment > saves a > >> >> > > > full loop iteration. The only significant exception is > >> >> > > > skylake-avx2-erms case for size = 416, alignment = 416 where > the > >> >> > > > current implementation is meaningfully faster. I am unsure of > the root > >> >> > > > cause for this. The skylake-avx2 case only performs a bit > worse in > >> >> > > > this case which makes me think part of it is code alignment > related, > >> >> > > > though comparative to the speedup in other size/alignment > >> >> > > > configurations it is still a trough. Despite this, I still > think the > >> >> > > > numbers are overall an improvement. > >> >> > > > > >> >> > > > As well due to aligning the loop (and possibly slightly more > efficient > >> >> > > > DSB behavior with the replacement of addq 4 * VEC_SIZE in the > loop > >> >> > > > with subq -4 * VEC_SIZE) in the non-erms cases there is often > a slight > >> >> > > > improvement to the main loop for large sizes. > >> >> > > > > >> >> > > > .../multiarch/memset-vec-unaligned-erms.S | 50 > +++++++++++-------- > >> >> > > > 1 file changed, 28 insertions(+), 22 deletions(-) > >> >> > > > > >> >> > > > diff --git > a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S > b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S > >> >> > > > index 08cfa49bd1..ff196844a0 100644 > >> >> > > > --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S > >> >> > > > +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S > >> >> > > > @@ -173,17 +173,22 @@ ENTRY (MEMSET_SYMBOL (__memset, > unaligned_erms)) > >> >> > > > VMOVU %VEC(0), (%rdi) > >> >> > > > VZEROUPPER_RETURN > >> >> > > > > >> >> > > > + .p2align 4 > >> >> > > > L(stosb_more_2x_vec): > >> >> > > > cmp __x86_rep_stosb_threshold(%rip), %RDX_LP > >> >> > > > ja L(stosb) > >> >> > > > +#else > >> >> > > > + .p2align 4 > >> >> > > > #endif > >> >> > > > L(more_2x_vec): > >> >> > > > - cmpq $(VEC_SIZE * 4), %rdx > >> >> > > > - ja L(loop_start) > >> >> > > > + /* Stores to first 2x VEC before cmp as any path > forward will > >> >> > > > + require it. */ > >> >> > > > VMOVU %VEC(0), (%rdi) > >> >> > > > VMOVU %VEC(0), VEC_SIZE(%rdi) > >> >> > > > - VMOVU %VEC(0), -VEC_SIZE(%rdi,%rdx) > >> >> > > > + cmpq $(VEC_SIZE * 4), %rdx > >> >> > > > + ja L(loop_start) > >> >> > > > VMOVU %VEC(0), -(VEC_SIZE * 2)(%rdi,%rdx) > >> >> > > > + VMOVU %VEC(0), -VEC_SIZE(%rdi,%rdx) > >> >> > > > L(return): > >> >> > > > #if VEC_SIZE > 16 > >> >> > > > ZERO_UPPER_VEC_REGISTERS_RETURN > >> >> > > > @@ -192,28 +197,29 @@ L(return): > >> >> > > > #endif > >> >> > > > > >> >> > > > L(loop_start): > >> >> > > > - leaq (VEC_SIZE * 4)(%rdi), %rcx > >> >> > > > - VMOVU %VEC(0), (%rdi) > >> >> > > > - andq $-(VEC_SIZE * 4), %rcx > >> >> > > > - VMOVU %VEC(0), -VEC_SIZE(%rdi,%rdx) > >> >> > > > - VMOVU %VEC(0), VEC_SIZE(%rdi) > >> >> > > > - VMOVU %VEC(0), -(VEC_SIZE * 2)(%rdi,%rdx) > >> >> > > > VMOVU %VEC(0), (VEC_SIZE * 2)(%rdi) > >> >> > > > - VMOVU %VEC(0), -(VEC_SIZE * 3)(%rdi,%rdx) > >> >> > > > VMOVU %VEC(0), (VEC_SIZE * 3)(%rdi) > >> >> > > > - VMOVU %VEC(0), -(VEC_SIZE * 4)(%rdi,%rdx) > >> >> > > > - addq %rdi, %rdx > >> >> > > > - andq $-(VEC_SIZE * 4), %rdx > >> >> > > > - cmpq %rdx, %rcx > >> >> > > > - je L(return) > >> >> > > > + cmpq $(VEC_SIZE * 8), %rdx > >> >> > > > + jbe L(loop_end) > >> >> > > > + andq $-(VEC_SIZE * 2), %rdi > >> >> > > > + subq $-(VEC_SIZE * 4), %rdi > >> >> > > > + leaq -(VEC_SIZE * 4)(%rax, %rdx), %rcx > >> > > >> > If this overflows loop will exit first iteration. Is that an issue? > >> > >> Please do following: > >> > >> 1. Update memset assembly codes with > >> check conditions for underwrite/overwrite. > >> if true then branch to the HLT instruction. > > > > codes? Just this file or others as well? > > All string/memory functions should work for all valid inputs. > > >> > >> 2. Update and run memset test to verify the test coverage for the > condition. > > > > What is the desired result? Segfault? > > For invalid inputs, anything can happen, including segfault. > > >> > >> 3. Update memset assembly codes to cover such conditions. > > > > Memcmp as well? What about wmemset? Currently (and with previous > versions as well) > > a value like 2^63 would cause similar behavior after the `salq $2, %rdx` > > Is this condition a valid input? If not, there is nothing to do. > AFAIK and value length [0, SIZE_MAX] for either is a valid input for any string/memory function from the perspective of the standard. But I don't know if it has any qualifiers. As well I don't know what is meant to happen if the machine/OS is unable to perform the necessary operations. Normally you would see segfault. The difference from the 3 commits below is essentially just that it won't segfault. That is already case, and has been for a while, certain inputs for many of the wcsmbs will have roughly the same behavior from `salq $2, %rdx`. For example wmemset(ptr, 0, 2^62 + 1) will currently set 1 wchar then return. > > >> > >> > >> Thanks. > >> > >> > If so the following commits from me have the same bug: > >> > > https://sourceware.org/git/?p=glibc.git;a=commit;h=6abf27980a947f9b6e514d6b33b83059d39566ae > >> > > https://sourceware.org/git/?p=glibc.git;a=commit;h=4ad473e97acdc5f6d811755b67c09f2128a644ce > >> > > https://sourceware.org/git/?p=glibc.git;a=commit;h=16d12015c57701b08d7bbed6ec536641bcafb428 > >> >> > >> >> > >> >> > >> >> > > > + .p2align 4 > >> >> > > > L(loop): > >> >> > > > - VMOVA %VEC(0), (%rcx) > >> >> > > > - VMOVA %VEC(0), VEC_SIZE(%rcx) > >> >> > > > - VMOVA %VEC(0), (VEC_SIZE * 2)(%rcx) > >> >> > > > - VMOVA %VEC(0), (VEC_SIZE * 3)(%rcx) > >> >> > > > - addq $(VEC_SIZE * 4), %rcx > >> >> > > > - cmpq %rcx, %rdx > >> >> > > > - jne L(loop) > >> >> > > > + VMOVA %VEC(0), (%rdi) > >> >> > > > + VMOVA %VEC(0), VEC_SIZE(%rdi) > >> >> > > > + VMOVA %VEC(0), (VEC_SIZE * 2)(%rdi) > >> >> > > > + VMOVA %VEC(0), (VEC_SIZE * 3)(%rdi) > >> >> > > > + subq $-(VEC_SIZE * 4), %rdi > >> >> > > > + cmpq %rcx, %rdi > >> >> > > > + jb L(loop) > >> > > >> > Issue because %rdi will not be below %rcx here. > >> >> > >> >> > > > +L(loop_end): > >> >> > > > + /* NB: rax is set as ptr in > MEMSET_VDUP_TO_VEC0_AND_SET_RETURN. > >> >> > > > + rdx as length is also unchanged. */ > >> >> > > > + VMOVU %VEC(0), -(VEC_SIZE * 4)(%rax, %rdx) > >> >> > > > + VMOVU %VEC(0), -(VEC_SIZE * 3)(%rax, %rdx) > >> >> > > > + VMOVU %VEC(0), -(VEC_SIZE * 2)(%rax, %rdx) > >> >> > > > + VMOVU %VEC(0), -VEC_SIZE(%rax, %rdx) > >> >> > > > VZEROUPPER_SHORT_RETURN > >> >> > > > > >> >> > > > .p2align 4 > >> >> > > > -- > >> >> > > > 2.25.1 > >> >> > > > > >> >> > > > >> >> > > LGTM. > >> >> > > >> >> > Awesome! > >> >> > > >> >> > For future patches do you prefer performance numbers like this or > >> >> > raw text? Or some other alternative? > >> >> > >> >> The current data format is fine. Thanks. > >> >> > >> >> > > > >> >> > > Reviewed-by: H.J. Lu <hjl.tools@gmail.com> > >> >> > > > >> >> > > Thanks. > >> >> > > > >> >> > > -- > >> >> > > H.J. > >> >> > >> >> > >> >> > >> >> -- > >> >> H.J. > >> > >> > >> > >> -- > >> H.J. > > > > -- > H.J. >
On Sun, Jun 6, 2021 at 8:39 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > > On Sun, Jun 6, 2021 at 11:21 PM H.J. Lu <hjl.tools@gmail.com> wrote: >> >> On Sun, Jun 6, 2021 at 8:06 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: >> > >> > >> > >> > On Sun, Jun 6, 2021 at 10:48 PM H.J. Lu <hjl.tools@gmail.com> wrote: >> >> >> >> On Sun, Jun 6, 2021 at 7:35 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: >> >> > >> >> > >> >> > >> >> > On Thu, May 20, 2021 at 4:45 PM H.J. Lu <hjl.tools@gmail.com> wrote: >> >> >> >> >> >> On Thu, May 20, 2021 at 1:03 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: >> >> >> > >> >> >> > On Thu, May 20, 2021 at 3:40 PM H.J. Lu <hjl.tools@gmail.com> wrote: >> >> >> > > >> >> >> > > On Thu, May 20, 2021 at 11:45 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote: >> >> >> > > > >> >> >> > > > No bug. This commit makes a few small improvements to >> >> >> > > > memset-vec-unaligned-erms.S. The changes are 1) only aligning to 64 >> >> >> > > > instead of 128. Either alignment will perform equally well in a loop >> >> >> > > > and 128 just increases the odds of having to do an extra iteration >> >> >> > > > which can be significant overhead for small values. 2) Align some >> >> >> > > > targets and the loop. 3) Remove an ALU from the alignment process. 4) >> >> >> > > > Reorder the last 4x VEC so that they are stored after the loop. 5) >> >> >> > > > Move the condition for leq 8x VEC to before the alignment >> >> >> > > > process. test-memset and test-wmemset are both passing. >> >> >> > > > >> >> >> > > > Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> >> >> >> > > > --- >> >> >> > > > Tests where run on the following CPUs: >> >> >> > > > >> >> >> > > > Skylake: https://ark.intel.com/content/www/us/en/ark/products/149091/intel-core-i7-8565u-processor-8m-cache-up-to-4-60-ghz.html >> >> >> > > > >> >> >> > > > Icelake: https://ark.intel.com/content/www/us/en/ark/products/196597/intel-core-i7-1065g7-processor-8m-cache-up-to-3-90-ghz.html >> >> >> > > > >> >> >> > > > Tigerlake: https://ark.intel.com/content/www/us/en/ark/products/208921/intel-core-i7-1165g7-processor-12m-cache-up-to-4-70-ghz-with-ipu.html >> >> >> > > > >> >> >> > > > All times are the geometric mean of N=50. The unit of time is >> >> >> > > > seconds. >> >> >> > > > >> >> >> > > > "Cur" refers to the current implementation >> >> >> > > > "New" refers to this patches implementation >> >> >> > > > >> >> >> > > > Performance data attached in memset-data.pdf >> >> >> > > > >> >> >> > > > Some notes on the numbers: >> >> >> > > > >> >> >> > > > I only included numbers for the proper VEC_SIZE for the corresponding >> >> >> > > > cpu. >> >> >> > > > >> >> >> > > > skl -> avx2 >> >> >> > > > icl -> evex >> >> >> > > > tgl -> evex >> >> >> > > > >> >> >> > > > The changes only affect sizes > 2 * VEC_SIZE. The performance >> >> >> > > > differences in the size <= 2 * VEC_SIZE come from changes in alignment >> >> >> > > > after linking (i.e ENTRY aligns to 16, but performance can be affected >> >> >> > > > by alignment % 64 or alignment % 4096) and generally affects >> >> >> > > > throughput only, not latency (i.e with an lfence to the benchmark loop >> >> >> > > > the deviations go away). Generally I think they can be ignored (both >> >> >> > > > positive and negative affects). >> >> >> > > > >> >> >> > > > The interesting part of the data is in the medium size range [128, >> >> >> > > > 1024] where the new implementation has a reasonable speedup. This is >> >> >> > > > especially pronounced when the more conservative alignment saves a >> >> >> > > > full loop iteration. The only significant exception is >> >> >> > > > skylake-avx2-erms case for size = 416, alignment = 416 where the >> >> >> > > > current implementation is meaningfully faster. I am unsure of the root >> >> >> > > > cause for this. The skylake-avx2 case only performs a bit worse in >> >> >> > > > this case which makes me think part of it is code alignment related, >> >> >> > > > though comparative to the speedup in other size/alignment >> >> >> > > > configurations it is still a trough. Despite this, I still think the >> >> >> > > > numbers are overall an improvement. >> >> >> > > > >> >> >> > > > As well due to aligning the loop (and possibly slightly more efficient >> >> >> > > > DSB behavior with the replacement of addq 4 * VEC_SIZE in the loop >> >> >> > > > with subq -4 * VEC_SIZE) in the non-erms cases there is often a slight >> >> >> > > > improvement to the main loop for large sizes. >> >> >> > > > >> >> >> > > > .../multiarch/memset-vec-unaligned-erms.S | 50 +++++++++++-------- >> >> >> > > > 1 file changed, 28 insertions(+), 22 deletions(-) >> >> >> > > > >> >> >> > > > diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S >> >> >> > > > index 08cfa49bd1..ff196844a0 100644 >> >> >> > > > --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S >> >> >> > > > +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S >> >> >> > > > @@ -173,17 +173,22 @@ ENTRY (MEMSET_SYMBOL (__memset, unaligned_erms)) >> >> >> > > > VMOVU %VEC(0), (%rdi) >> >> >> > > > VZEROUPPER_RETURN >> >> >> > > > >> >> >> > > > + .p2align 4 >> >> >> > > > L(stosb_more_2x_vec): >> >> >> > > > cmp __x86_rep_stosb_threshold(%rip), %RDX_LP >> >> >> > > > ja L(stosb) >> >> >> > > > +#else >> >> >> > > > + .p2align 4 >> >> >> > > > #endif >> >> >> > > > L(more_2x_vec): >> >> >> > > > - cmpq $(VEC_SIZE * 4), %rdx >> >> >> > > > - ja L(loop_start) >> >> >> > > > + /* Stores to first 2x VEC before cmp as any path forward will >> >> >> > > > + require it. */ >> >> >> > > > VMOVU %VEC(0), (%rdi) >> >> >> > > > VMOVU %VEC(0), VEC_SIZE(%rdi) >> >> >> > > > - VMOVU %VEC(0), -VEC_SIZE(%rdi,%rdx) >> >> >> > > > + cmpq $(VEC_SIZE * 4), %rdx >> >> >> > > > + ja L(loop_start) >> >> >> > > > VMOVU %VEC(0), -(VEC_SIZE * 2)(%rdi,%rdx) >> >> >> > > > + VMOVU %VEC(0), -VEC_SIZE(%rdi,%rdx) >> >> >> > > > L(return): >> >> >> > > > #if VEC_SIZE > 16 >> >> >> > > > ZERO_UPPER_VEC_REGISTERS_RETURN >> >> >> > > > @@ -192,28 +197,29 @@ L(return): >> >> >> > > > #endif >> >> >> > > > >> >> >> > > > L(loop_start): >> >> >> > > > - leaq (VEC_SIZE * 4)(%rdi), %rcx >> >> >> > > > - VMOVU %VEC(0), (%rdi) >> >> >> > > > - andq $-(VEC_SIZE * 4), %rcx >> >> >> > > > - VMOVU %VEC(0), -VEC_SIZE(%rdi,%rdx) >> >> >> > > > - VMOVU %VEC(0), VEC_SIZE(%rdi) >> >> >> > > > - VMOVU %VEC(0), -(VEC_SIZE * 2)(%rdi,%rdx) >> >> >> > > > VMOVU %VEC(0), (VEC_SIZE * 2)(%rdi) >> >> >> > > > - VMOVU %VEC(0), -(VEC_SIZE * 3)(%rdi,%rdx) >> >> >> > > > VMOVU %VEC(0), (VEC_SIZE * 3)(%rdi) >> >> >> > > > - VMOVU %VEC(0), -(VEC_SIZE * 4)(%rdi,%rdx) >> >> >> > > > - addq %rdi, %rdx >> >> >> > > > - andq $-(VEC_SIZE * 4), %rdx >> >> >> > > > - cmpq %rdx, %rcx >> >> >> > > > - je L(return) >> >> >> > > > + cmpq $(VEC_SIZE * 8), %rdx >> >> >> > > > + jbe L(loop_end) >> >> >> > > > + andq $-(VEC_SIZE * 2), %rdi >> >> >> > > > + subq $-(VEC_SIZE * 4), %rdi >> >> >> > > > + leaq -(VEC_SIZE * 4)(%rax, %rdx), %rcx >> >> > >> >> > If this overflows loop will exit first iteration. Is that an issue? >> >> >> >> Please do following: >> >> >> >> 1. Update memset assembly codes with >> >> check conditions for underwrite/overwrite. >> >> if true then branch to the HLT instruction. >> > >> > codes? Just this file or others as well? >> >> All string/memory functions should work for all valid inputs. >> >> >> >> >> 2. Update and run memset test to verify the test coverage for the condition. >> > >> > What is the desired result? Segfault? >> >> For invalid inputs, anything can happen, including segfault. >> >> >> >> >> 3. Update memset assembly codes to cover such conditions. >> > >> > Memcmp as well? What about wmemset? Currently (and with previous versions as well) >> > a value like 2^63 would cause similar behavior after the `salq $2, %rdx` >> >> Is this condition a valid input? If not, there is nothing to do. > > > AFAIK and value length [0, SIZE_MAX] for either is a valid input for any string/memory function > from the perspective of the standard. But I don't know if it has any qualifiers. > > As well I don't know what is meant to happen if the machine/OS is unable to perform the necessary operations. > Normally you would see segfault. The difference from the 3 commits below is essentially just that it won't > segfault. > > That is already case, and has been for a while, certain inputs for many of the wcsmbs will have roughly the > same behavior from `salq $2, %rdx`. For example wmemset(ptr, 0, 2^62 + 1) will currently set 1 wchar then > return. Please construct a testcase which will return normally instead of segfault if not fixed. So on x86-64, the expected behavior should be segfault. >> >> >> >> >> >> >> >> Thanks. >> >> >> >> > If so the following commits from me have the same bug: >> >> > https://sourceware.org/git/?p=glibc.git;a=commit;h=6abf27980a947f9b6e514d6b33b83059d39566ae >> >> > https://sourceware.org/git/?p=glibc.git;a=commit;h=4ad473e97acdc5f6d811755b67c09f2128a644ce >> >> > https://sourceware.org/git/?p=glibc.git;a=commit;h=16d12015c57701b08d7bbed6ec536641bcafb428 >> >> >> >> >> >> >> >> >> >> >> >> > > > + .p2align 4 >> >> >> > > > L(loop): >> >> >> > > > - VMOVA %VEC(0), (%rcx) >> >> >> > > > - VMOVA %VEC(0), VEC_SIZE(%rcx) >> >> >> > > > - VMOVA %VEC(0), (VEC_SIZE * 2)(%rcx) >> >> >> > > > - VMOVA %VEC(0), (VEC_SIZE * 3)(%rcx) >> >> >> > > > - addq $(VEC_SIZE * 4), %rcx >> >> >> > > > - cmpq %rcx, %rdx >> >> >> > > > - jne L(loop) >> >> >> > > > + VMOVA %VEC(0), (%rdi) >> >> >> > > > + VMOVA %VEC(0), VEC_SIZE(%rdi) >> >> >> > > > + VMOVA %VEC(0), (VEC_SIZE * 2)(%rdi) >> >> >> > > > + VMOVA %VEC(0), (VEC_SIZE * 3)(%rdi) >> >> >> > > > + subq $-(VEC_SIZE * 4), %rdi >> >> >> > > > + cmpq %rcx, %rdi >> >> >> > > > + jb L(loop) >> >> > >> >> > Issue because %rdi will not be below %rcx here. >> >> >> >> >> >> > > > +L(loop_end): >> >> >> > > > + /* NB: rax is set as ptr in MEMSET_VDUP_TO_VEC0_AND_SET_RETURN. >> >> >> > > > + rdx as length is also unchanged. */ >> >> >> > > > + VMOVU %VEC(0), -(VEC_SIZE * 4)(%rax, %rdx) >> >> >> > > > + VMOVU %VEC(0), -(VEC_SIZE * 3)(%rax, %rdx) >> >> >> > > > + VMOVU %VEC(0), -(VEC_SIZE * 2)(%rax, %rdx) >> >> >> > > > + VMOVU %VEC(0), -VEC_SIZE(%rax, %rdx) >> >> >> > > > VZEROUPPER_SHORT_RETURN >> >> >> > > > >> >> >> > > > .p2align 4 >> >> >> > > > -- >> >> >> > > > 2.25.1 >> >> >> > > > >> >> >> > > >> >> >> > > LGTM. >> >> >> > >> >> >> > Awesome! >> >> >> > >> >> >> > For future patches do you prefer performance numbers like this or >> >> >> > raw text? Or some other alternative? >> >> >> >> >> >> The current data format is fine. Thanks. >> >> >> >> >> >> > > >> >> >> > > Reviewed-by: H.J. Lu <hjl.tools@gmail.com> >> >> >> > > >> >> >> > > Thanks. >> >> >> > > >> >> >> > > -- >> >> >> > > H.J. >> >> >> >> >> >> >> >> >> >> >> >> -- >> >> >> H.J. >> >> >> >> >> >> >> >> -- >> >> H.J. >> >> >> >> -- >> H.J.
On Mon, Jun 7, 2021 at 12:34 AM H.J. Lu <hjl.tools@gmail.com> wrote: > On Sun, Jun 6, 2021 at 8:39 PM Noah Goldstein <goldstein.w.n@gmail.com> > wrote: > > > > > > > > On Sun, Jun 6, 2021 at 11:21 PM H.J. Lu <hjl.tools@gmail.com> wrote: > >> > >> On Sun, Jun 6, 2021 at 8:06 PM Noah Goldstein <goldstein.w.n@gmail.com> > wrote: > >> > > >> > > >> > > >> > On Sun, Jun 6, 2021 at 10:48 PM H.J. Lu <hjl.tools@gmail.com> wrote: > >> >> > >> >> On Sun, Jun 6, 2021 at 7:35 PM Noah Goldstein < > goldstein.w.n@gmail.com> wrote: > >> >> > > >> >> > > >> >> > > >> >> > On Thu, May 20, 2021 at 4:45 PM H.J. Lu <hjl.tools@gmail.com> > wrote: > >> >> >> > >> >> >> On Thu, May 20, 2021 at 1:03 PM Noah Goldstein < > goldstein.w.n@gmail.com> wrote: > >> >> >> > > >> >> >> > On Thu, May 20, 2021 at 3:40 PM H.J. Lu <hjl.tools@gmail.com> > wrote: > >> >> >> > > > >> >> >> > > On Thu, May 20, 2021 at 11:45 AM Noah Goldstein < > goldstein.w.n@gmail.com> wrote: > >> >> >> > > > > >> >> >> > > > No bug. This commit makes a few small improvements to > >> >> >> > > > memset-vec-unaligned-erms.S. The changes are 1) only > aligning to 64 > >> >> >> > > > instead of 128. Either alignment will perform equally well > in a loop > >> >> >> > > > and 128 just increases the odds of having to do an extra > iteration > >> >> >> > > > which can be significant overhead for small values. 2) > Align some > >> >> >> > > > targets and the loop. 3) Remove an ALU from the alignment > process. 4) > >> >> >> > > > Reorder the last 4x VEC so that they are stored after the > loop. 5) > >> >> >> > > > Move the condition for leq 8x VEC to before the alignment > >> >> >> > > > process. test-memset and test-wmemset are both passing. > >> >> >> > > > > >> >> >> > > > Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> > >> >> >> > > > --- > >> >> >> > > > Tests where run on the following CPUs: > >> >> >> > > > > >> >> >> > > > Skylake: > https://ark.intel.com/content/www/us/en/ark/products/149091/intel-core-i7-8565u-processor-8m-cache-up-to-4-60-ghz.html > >> >> >> > > > > >> >> >> > > > Icelake: > https://ark.intel.com/content/www/us/en/ark/products/196597/intel-core-i7-1065g7-processor-8m-cache-up-to-3-90-ghz.html > >> >> >> > > > > >> >> >> > > > Tigerlake: > https://ark.intel.com/content/www/us/en/ark/products/208921/intel-core-i7-1165g7-processor-12m-cache-up-to-4-70-ghz-with-ipu.html > >> >> >> > > > > >> >> >> > > > All times are the geometric mean of N=50. The unit of time > is > >> >> >> > > > seconds. > >> >> >> > > > > >> >> >> > > > "Cur" refers to the current implementation > >> >> >> > > > "New" refers to this patches implementation > >> >> >> > > > > >> >> >> > > > Performance data attached in memset-data.pdf > >> >> >> > > > > >> >> >> > > > Some notes on the numbers: > >> >> >> > > > > >> >> >> > > > I only included numbers for the proper VEC_SIZE for the > corresponding > >> >> >> > > > cpu. > >> >> >> > > > > >> >> >> > > > skl -> avx2 > >> >> >> > > > icl -> evex > >> >> >> > > > tgl -> evex > >> >> >> > > > > >> >> >> > > > The changes only affect sizes > 2 * VEC_SIZE. The > performance > >> >> >> > > > differences in the size <= 2 * VEC_SIZE come from changes > in alignment > >> >> >> > > > after linking (i.e ENTRY aligns to 16, but performance can > be affected > >> >> >> > > > by alignment % 64 or alignment % 4096) and generally affects > >> >> >> > > > throughput only, not latency (i.e with an lfence to the > benchmark loop > >> >> >> > > > the deviations go away). Generally I think they can be > ignored (both > >> >> >> > > > positive and negative affects). > >> >> >> > > > > >> >> >> > > > The interesting part of the data is in the medium size > range [128, > >> >> >> > > > 1024] where the new implementation has a reasonable > speedup. This is > >> >> >> > > > especially pronounced when the more conservative alignment > saves a > >> >> >> > > > full loop iteration. The only significant exception is > >> >> >> > > > skylake-avx2-erms case for size = 416, alignment = 416 > where the > >> >> >> > > > current implementation is meaningfully faster. I am unsure > of the root > >> >> >> > > > cause for this. The skylake-avx2 case only performs a bit > worse in > >> >> >> > > > this case which makes me think part of it is code alignment > related, > >> >> >> > > > though comparative to the speedup in other size/alignment > >> >> >> > > > configurations it is still a trough. Despite this, I still > think the > >> >> >> > > > numbers are overall an improvement. > >> >> >> > > > > >> >> >> > > > As well due to aligning the loop (and possibly slightly > more efficient > >> >> >> > > > DSB behavior with the replacement of addq 4 * VEC_SIZE in > the loop > >> >> >> > > > with subq -4 * VEC_SIZE) in the non-erms cases there is > often a slight > >> >> >> > > > improvement to the main loop for large sizes. > >> >> >> > > > > >> >> >> > > > .../multiarch/memset-vec-unaligned-erms.S | 50 > +++++++++++-------- > >> >> >> > > > 1 file changed, 28 insertions(+), 22 deletions(-) > >> >> >> > > > > >> >> >> > > > diff --git > a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S > b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S > >> >> >> > > > index 08cfa49bd1..ff196844a0 100644 > >> >> >> > > > --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S > >> >> >> > > > +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S > >> >> >> > > > @@ -173,17 +173,22 @@ ENTRY (MEMSET_SYMBOL (__memset, > unaligned_erms)) > >> >> >> > > > VMOVU %VEC(0), (%rdi) > >> >> >> > > > VZEROUPPER_RETURN > >> >> >> > > > > >> >> >> > > > + .p2align 4 > >> >> >> > > > L(stosb_more_2x_vec): > >> >> >> > > > cmp __x86_rep_stosb_threshold(%rip), %RDX_LP > >> >> >> > > > ja L(stosb) > >> >> >> > > > +#else > >> >> >> > > > + .p2align 4 > >> >> >> > > > #endif > >> >> >> > > > L(more_2x_vec): > >> >> >> > > > - cmpq $(VEC_SIZE * 4), %rdx > >> >> >> > > > - ja L(loop_start) > >> >> >> > > > + /* Stores to first 2x VEC before cmp as any path > forward will > >> >> >> > > > + require it. */ > >> >> >> > > > VMOVU %VEC(0), (%rdi) > >> >> >> > > > VMOVU %VEC(0), VEC_SIZE(%rdi) > >> >> >> > > > - VMOVU %VEC(0), -VEC_SIZE(%rdi,%rdx) > >> >> >> > > > + cmpq $(VEC_SIZE * 4), %rdx > >> >> >> > > > + ja L(loop_start) > >> >> >> > > > VMOVU %VEC(0), -(VEC_SIZE * 2)(%rdi,%rdx) > >> >> >> > > > + VMOVU %VEC(0), -VEC_SIZE(%rdi,%rdx) > >> >> >> > > > L(return): > >> >> >> > > > #if VEC_SIZE > 16 > >> >> >> > > > ZERO_UPPER_VEC_REGISTERS_RETURN > >> >> >> > > > @@ -192,28 +197,29 @@ L(return): > >> >> >> > > > #endif > >> >> >> > > > > >> >> >> > > > L(loop_start): > >> >> >> > > > - leaq (VEC_SIZE * 4)(%rdi), %rcx > >> >> >> > > > - VMOVU %VEC(0), (%rdi) > >> >> >> > > > - andq $-(VEC_SIZE * 4), %rcx > >> >> >> > > > - VMOVU %VEC(0), -VEC_SIZE(%rdi,%rdx) > >> >> >> > > > - VMOVU %VEC(0), VEC_SIZE(%rdi) > >> >> >> > > > - VMOVU %VEC(0), -(VEC_SIZE * 2)(%rdi,%rdx) > >> >> >> > > > VMOVU %VEC(0), (VEC_SIZE * 2)(%rdi) > >> >> >> > > > - VMOVU %VEC(0), -(VEC_SIZE * 3)(%rdi,%rdx) > >> >> >> > > > VMOVU %VEC(0), (VEC_SIZE * 3)(%rdi) > >> >> >> > > > - VMOVU %VEC(0), -(VEC_SIZE * 4)(%rdi,%rdx) > >> >> >> > > > - addq %rdi, %rdx > >> >> >> > > > - andq $-(VEC_SIZE * 4), %rdx > >> >> >> > > > - cmpq %rdx, %rcx > >> >> >> > > > - je L(return) > >> >> >> > > > + cmpq $(VEC_SIZE * 8), %rdx > >> >> >> > > > + jbe L(loop_end) > >> >> >> > > > + andq $-(VEC_SIZE * 2), %rdi > >> >> >> > > > + subq $-(VEC_SIZE * 4), %rdi > >> >> >> > > > + leaq -(VEC_SIZE * 4)(%rax, %rdx), %rcx > >> >> > > >> >> > If this overflows loop will exit first iteration. Is that an issue? > >> >> > >> >> Please do following: > >> >> > >> >> 1. Update memset assembly codes with > >> >> check conditions for underwrite/overwrite. > >> >> if true then branch to the HLT instruction. > >> > > >> > codes? Just this file or others as well? > >> > >> All string/memory functions should work for all valid inputs. > >> > >> >> > >> >> 2. Update and run memset test to verify the test coverage for the > condition. > >> > > >> > What is the desired result? Segfault? > >> > >> For invalid inputs, anything can happen, including segfault. > >> > >> >> > >> >> 3. Update memset assembly codes to cover such conditions. > >> > > >> > Memcmp as well? What about wmemset? Currently (and with previous > versions as well) > >> > a value like 2^63 would cause similar behavior after the `salq $2, > %rdx` > >> > >> Is this condition a valid input? If not, there is nothing to do. > > > > > > AFAIK and value length [0, SIZE_MAX] for either is a valid input for any > string/memory function > > from the perspective of the standard. But I don't know if it has any > qualifiers. > > > > As well I don't know what is meant to happen if the machine/OS is unable > to perform the necessary operations. > > Normally you would see segfault. The difference from the 3 commits below > is essentially just that it won't > > segfault. > > > > That is already case, and has been for a while, certain inputs for many > of the wcsmbs will have roughly the > > same behavior from `salq $2, %rdx`. For example wmemset(ptr, 0, 2^62 + > 1) will currently set 1 wchar then > > return. > > Please construct a testcase which will return normally instead of > segfault if not fixed. So on > x86-64, the expected behavior should be segfault. > memcmp / test for it as well? > > >> > >> > >> >> > >> >> > >> >> Thanks. > >> >> > >> >> > If so the following commits from me have the same bug: > >> >> > > https://sourceware.org/git/?p=glibc.git;a=commit;h=6abf27980a947f9b6e514d6b33b83059d39566ae > >> >> > > https://sourceware.org/git/?p=glibc.git;a=commit;h=4ad473e97acdc5f6d811755b67c09f2128a644ce > >> >> > > https://sourceware.org/git/?p=glibc.git;a=commit;h=16d12015c57701b08d7bbed6ec536641bcafb428 > >> >> >> > >> >> >> > >> >> >> > >> >> >> > > > + .p2align 4 > >> >> >> > > > L(loop): > >> >> >> > > > - VMOVA %VEC(0), (%rcx) > >> >> >> > > > - VMOVA %VEC(0), VEC_SIZE(%rcx) > >> >> >> > > > - VMOVA %VEC(0), (VEC_SIZE * 2)(%rcx) > >> >> >> > > > - VMOVA %VEC(0), (VEC_SIZE * 3)(%rcx) > >> >> >> > > > - addq $(VEC_SIZE * 4), %rcx > >> >> >> > > > - cmpq %rcx, %rdx > >> >> >> > > > - jne L(loop) > >> >> >> > > > + VMOVA %VEC(0), (%rdi) > >> >> >> > > > + VMOVA %VEC(0), VEC_SIZE(%rdi) > >> >> >> > > > + VMOVA %VEC(0), (VEC_SIZE * 2)(%rdi) > >> >> >> > > > + VMOVA %VEC(0), (VEC_SIZE * 3)(%rdi) > >> >> >> > > > + subq $-(VEC_SIZE * 4), %rdi > >> >> >> > > > + cmpq %rcx, %rdi > >> >> >> > > > + jb L(loop) > >> >> > > >> >> > Issue because %rdi will not be below %rcx here. > >> >> >> > >> >> >> > > > +L(loop_end): > >> >> >> > > > + /* NB: rax is set as ptr in > MEMSET_VDUP_TO_VEC0_AND_SET_RETURN. > >> >> >> > > > + rdx as length is also unchanged. */ > >> >> >> > > > + VMOVU %VEC(0), -(VEC_SIZE * 4)(%rax, %rdx) > >> >> >> > > > + VMOVU %VEC(0), -(VEC_SIZE * 3)(%rax, %rdx) > >> >> >> > > > + VMOVU %VEC(0), -(VEC_SIZE * 2)(%rax, %rdx) > >> >> >> > > > + VMOVU %VEC(0), -VEC_SIZE(%rax, %rdx) > >> >> >> > > > VZEROUPPER_SHORT_RETURN > >> >> >> > > > > >> >> >> > > > .p2align 4 > >> >> >> > > > -- > >> >> >> > > > 2.25.1 > >> >> >> > > > > >> >> >> > > > >> >> >> > > LGTM. > >> >> >> > > >> >> >> > Awesome! > >> >> >> > > >> >> >> > For future patches do you prefer performance numbers like this > or > >> >> >> > raw text? Or some other alternative? > >> >> >> > >> >> >> The current data format is fine. Thanks. > >> >> >> > >> >> >> > > > >> >> >> > > Reviewed-by: H.J. Lu <hjl.tools@gmail.com> > >> >> >> > > > >> >> >> > > Thanks. > >> >> >> > > > >> >> >> > > -- > >> >> >> > > H.J. > >> >> >> > >> >> >> > >> >> >> > >> >> >> -- > >> >> >> H.J. > >> >> > >> >> > >> >> > >> >> -- > >> >> H.J. > >> > >> > >> > >> -- > >> H.J. > > > > -- > H.J. >
On Mon, Jun 7, 2021 at 12:12 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > > On Mon, Jun 7, 2021 at 12:34 AM H.J. Lu <hjl.tools@gmail.com> wrote: >> >> On Sun, Jun 6, 2021 at 8:39 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: >> > >> > >> > >> > On Sun, Jun 6, 2021 at 11:21 PM H.J. Lu <hjl.tools@gmail.com> wrote: >> >> >> >> On Sun, Jun 6, 2021 at 8:06 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: >> >> > >> >> > >> >> > >> >> > On Sun, Jun 6, 2021 at 10:48 PM H.J. Lu <hjl.tools@gmail.com> wrote: >> >> >> >> >> >> On Sun, Jun 6, 2021 at 7:35 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: >> >> >> > >> >> >> > >> >> >> > >> >> >> > On Thu, May 20, 2021 at 4:45 PM H.J. Lu <hjl.tools@gmail.com> wrote: >> >> >> >> >> >> >> >> On Thu, May 20, 2021 at 1:03 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: >> >> >> >> > >> >> >> >> > On Thu, May 20, 2021 at 3:40 PM H.J. Lu <hjl.tools@gmail.com> wrote: >> >> >> >> > > >> >> >> >> > > On Thu, May 20, 2021 at 11:45 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote: >> >> >> >> > > > >> >> >> >> > > > No bug. This commit makes a few small improvements to >> >> >> >> > > > memset-vec-unaligned-erms.S. The changes are 1) only aligning to 64 >> >> >> >> > > > instead of 128. Either alignment will perform equally well in a loop >> >> >> >> > > > and 128 just increases the odds of having to do an extra iteration >> >> >> >> > > > which can be significant overhead for small values. 2) Align some >> >> >> >> > > > targets and the loop. 3) Remove an ALU from the alignment process. 4) >> >> >> >> > > > Reorder the last 4x VEC so that they are stored after the loop. 5) >> >> >> >> > > > Move the condition for leq 8x VEC to before the alignment >> >> >> >> > > > process. test-memset and test-wmemset are both passing. >> >> >> >> > > > >> >> >> >> > > > Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> >> >> >> >> > > > --- >> >> >> >> > > > Tests where run on the following CPUs: >> >> >> >> > > > >> >> >> >> > > > Skylake: https://ark.intel.com/content/www/us/en/ark/products/149091/intel-core-i7-8565u-processor-8m-cache-up-to-4-60-ghz.html >> >> >> >> > > > >> >> >> >> > > > Icelake: https://ark.intel.com/content/www/us/en/ark/products/196597/intel-core-i7-1065g7-processor-8m-cache-up-to-3-90-ghz.html >> >> >> >> > > > >> >> >> >> > > > Tigerlake: https://ark.intel.com/content/www/us/en/ark/products/208921/intel-core-i7-1165g7-processor-12m-cache-up-to-4-70-ghz-with-ipu.html >> >> >> >> > > > >> >> >> >> > > > All times are the geometric mean of N=50. The unit of time is >> >> >> >> > > > seconds. >> >> >> >> > > > >> >> >> >> > > > "Cur" refers to the current implementation >> >> >> >> > > > "New" refers to this patches implementation >> >> >> >> > > > >> >> >> >> > > > Performance data attached in memset-data.pdf >> >> >> >> > > > >> >> >> >> > > > Some notes on the numbers: >> >> >> >> > > > >> >> >> >> > > > I only included numbers for the proper VEC_SIZE for the corresponding >> >> >> >> > > > cpu. >> >> >> >> > > > >> >> >> >> > > > skl -> avx2 >> >> >> >> > > > icl -> evex >> >> >> >> > > > tgl -> evex >> >> >> >> > > > >> >> >> >> > > > The changes only affect sizes > 2 * VEC_SIZE. The performance >> >> >> >> > > > differences in the size <= 2 * VEC_SIZE come from changes in alignment >> >> >> >> > > > after linking (i.e ENTRY aligns to 16, but performance can be affected >> >> >> >> > > > by alignment % 64 or alignment % 4096) and generally affects >> >> >> >> > > > throughput only, not latency (i.e with an lfence to the benchmark loop >> >> >> >> > > > the deviations go away). Generally I think they can be ignored (both >> >> >> >> > > > positive and negative affects). >> >> >> >> > > > >> >> >> >> > > > The interesting part of the data is in the medium size range [128, >> >> >> >> > > > 1024] where the new implementation has a reasonable speedup. This is >> >> >> >> > > > especially pronounced when the more conservative alignment saves a >> >> >> >> > > > full loop iteration. The only significant exception is >> >> >> >> > > > skylake-avx2-erms case for size = 416, alignment = 416 where the >> >> >> >> > > > current implementation is meaningfully faster. I am unsure of the root >> >> >> >> > > > cause for this. The skylake-avx2 case only performs a bit worse in >> >> >> >> > > > this case which makes me think part of it is code alignment related, >> >> >> >> > > > though comparative to the speedup in other size/alignment >> >> >> >> > > > configurations it is still a trough. Despite this, I still think the >> >> >> >> > > > numbers are overall an improvement. >> >> >> >> > > > >> >> >> >> > > > As well due to aligning the loop (and possibly slightly more efficient >> >> >> >> > > > DSB behavior with the replacement of addq 4 * VEC_SIZE in the loop >> >> >> >> > > > with subq -4 * VEC_SIZE) in the non-erms cases there is often a slight >> >> >> >> > > > improvement to the main loop for large sizes. >> >> >> >> > > > >> >> >> >> > > > .../multiarch/memset-vec-unaligned-erms.S | 50 +++++++++++-------- >> >> >> >> > > > 1 file changed, 28 insertions(+), 22 deletions(-) >> >> >> >> > > > >> >> >> >> > > > diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S >> >> >> >> > > > index 08cfa49bd1..ff196844a0 100644 >> >> >> >> > > > --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S >> >> >> >> > > > +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S >> >> >> >> > > > @@ -173,17 +173,22 @@ ENTRY (MEMSET_SYMBOL (__memset, unaligned_erms)) >> >> >> >> > > > VMOVU %VEC(0), (%rdi) >> >> >> >> > > > VZEROUPPER_RETURN >> >> >> >> > > > >> >> >> >> > > > + .p2align 4 >> >> >> >> > > > L(stosb_more_2x_vec): >> >> >> >> > > > cmp __x86_rep_stosb_threshold(%rip), %RDX_LP >> >> >> >> > > > ja L(stosb) >> >> >> >> > > > +#else >> >> >> >> > > > + .p2align 4 >> >> >> >> > > > #endif >> >> >> >> > > > L(more_2x_vec): >> >> >> >> > > > - cmpq $(VEC_SIZE * 4), %rdx >> >> >> >> > > > - ja L(loop_start) >> >> >> >> > > > + /* Stores to first 2x VEC before cmp as any path forward will >> >> >> >> > > > + require it. */ >> >> >> >> > > > VMOVU %VEC(0), (%rdi) >> >> >> >> > > > VMOVU %VEC(0), VEC_SIZE(%rdi) >> >> >> >> > > > - VMOVU %VEC(0), -VEC_SIZE(%rdi,%rdx) >> >> >> >> > > > + cmpq $(VEC_SIZE * 4), %rdx >> >> >> >> > > > + ja L(loop_start) >> >> >> >> > > > VMOVU %VEC(0), -(VEC_SIZE * 2)(%rdi,%rdx) >> >> >> >> > > > + VMOVU %VEC(0), -VEC_SIZE(%rdi,%rdx) >> >> >> >> > > > L(return): >> >> >> >> > > > #if VEC_SIZE > 16 >> >> >> >> > > > ZERO_UPPER_VEC_REGISTERS_RETURN >> >> >> >> > > > @@ -192,28 +197,29 @@ L(return): >> >> >> >> > > > #endif >> >> >> >> > > > >> >> >> >> > > > L(loop_start): >> >> >> >> > > > - leaq (VEC_SIZE * 4)(%rdi), %rcx >> >> >> >> > > > - VMOVU %VEC(0), (%rdi) >> >> >> >> > > > - andq $-(VEC_SIZE * 4), %rcx >> >> >> >> > > > - VMOVU %VEC(0), -VEC_SIZE(%rdi,%rdx) >> >> >> >> > > > - VMOVU %VEC(0), VEC_SIZE(%rdi) >> >> >> >> > > > - VMOVU %VEC(0), -(VEC_SIZE * 2)(%rdi,%rdx) >> >> >> >> > > > VMOVU %VEC(0), (VEC_SIZE * 2)(%rdi) >> >> >> >> > > > - VMOVU %VEC(0), -(VEC_SIZE * 3)(%rdi,%rdx) >> >> >> >> > > > VMOVU %VEC(0), (VEC_SIZE * 3)(%rdi) >> >> >> >> > > > - VMOVU %VEC(0), -(VEC_SIZE * 4)(%rdi,%rdx) >> >> >> >> > > > - addq %rdi, %rdx >> >> >> >> > > > - andq $-(VEC_SIZE * 4), %rdx >> >> >> >> > > > - cmpq %rdx, %rcx >> >> >> >> > > > - je L(return) >> >> >> >> > > > + cmpq $(VEC_SIZE * 8), %rdx >> >> >> >> > > > + jbe L(loop_end) >> >> >> >> > > > + andq $-(VEC_SIZE * 2), %rdi >> >> >> >> > > > + subq $-(VEC_SIZE * 4), %rdi >> >> >> >> > > > + leaq -(VEC_SIZE * 4)(%rax, %rdx), %rcx >> >> >> > >> >> >> > If this overflows loop will exit first iteration. Is that an issue? >> >> >> >> >> >> Please do following: >> >> >> >> >> >> 1. Update memset assembly codes with >> >> >> check conditions for underwrite/overwrite. >> >> >> if true then branch to the HLT instruction. >> >> > >> >> > codes? Just this file or others as well? >> >> >> >> All string/memory functions should work for all valid inputs. >> >> >> >> >> >> >> >> 2. Update and run memset test to verify the test coverage for the condition. >> >> > >> >> > What is the desired result? Segfault? >> >> >> >> For invalid inputs, anything can happen, including segfault. >> >> >> >> >> >> >> >> 3. Update memset assembly codes to cover such conditions. >> >> > >> >> > Memcmp as well? What about wmemset? Currently (and with previous versions as well) >> >> > a value like 2^63 would cause similar behavior after the `salq $2, %rdx` >> >> >> >> Is this condition a valid input? If not, there is nothing to do. >> > >> > >> > AFAIK and value length [0, SIZE_MAX] for either is a valid input for any string/memory function >> > from the perspective of the standard. But I don't know if it has any qualifiers. >> > >> > As well I don't know what is meant to happen if the machine/OS is unable to perform the necessary operations. >> > Normally you would see segfault. The difference from the 3 commits below is essentially just that it won't >> > segfault. >> > >> > That is already case, and has been for a while, certain inputs for many of the wcsmbs will have roughly the >> > same behavior from `salq $2, %rdx`. For example wmemset(ptr, 0, 2^62 + 1) will currently set 1 wchar then >> > return. >> >> Please construct a testcase which will return normally instead of >> segfault if not fixed. So on >> x86-64, the expected behavior should be segfault. > > > memcmp / test for it as well? Yes. >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> Thanks. >> >> >> >> >> >> > If so the following commits from me have the same bug: >> >> >> > https://sourceware.org/git/?p=glibc.git;a=commit;h=6abf27980a947f9b6e514d6b33b83059d39566ae >> >> >> > https://sourceware.org/git/?p=glibc.git;a=commit;h=4ad473e97acdc5f6d811755b67c09f2128a644ce >> >> >> > https://sourceware.org/git/?p=glibc.git;a=commit;h=16d12015c57701b08d7bbed6ec536641bcafb428 >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> > > > + .p2align 4 >> >> >> >> > > > L(loop): >> >> >> >> > > > - VMOVA %VEC(0), (%rcx) >> >> >> >> > > > - VMOVA %VEC(0), VEC_SIZE(%rcx) >> >> >> >> > > > - VMOVA %VEC(0), (VEC_SIZE * 2)(%rcx) >> >> >> >> > > > - VMOVA %VEC(0), (VEC_SIZE * 3)(%rcx) >> >> >> >> > > > - addq $(VEC_SIZE * 4), %rcx >> >> >> >> > > > - cmpq %rcx, %rdx >> >> >> >> > > > - jne L(loop) >> >> >> >> > > > + VMOVA %VEC(0), (%rdi) >> >> >> >> > > > + VMOVA %VEC(0), VEC_SIZE(%rdi) >> >> >> >> > > > + VMOVA %VEC(0), (VEC_SIZE * 2)(%rdi) >> >> >> >> > > > + VMOVA %VEC(0), (VEC_SIZE * 3)(%rdi) >> >> >> >> > > > + subq $-(VEC_SIZE * 4), %rdi >> >> >> >> > > > + cmpq %rcx, %rdi >> >> >> >> > > > + jb L(loop) >> >> >> > >> >> >> > Issue because %rdi will not be below %rcx here. >> >> >> >> >> >> >> >> > > > +L(loop_end): >> >> >> >> > > > + /* NB: rax is set as ptr in MEMSET_VDUP_TO_VEC0_AND_SET_RETURN. >> >> >> >> > > > + rdx as length is also unchanged. */ >> >> >> >> > > > + VMOVU %VEC(0), -(VEC_SIZE * 4)(%rax, %rdx) >> >> >> >> > > > + VMOVU %VEC(0), -(VEC_SIZE * 3)(%rax, %rdx) >> >> >> >> > > > + VMOVU %VEC(0), -(VEC_SIZE * 2)(%rax, %rdx) >> >> >> >> > > > + VMOVU %VEC(0), -VEC_SIZE(%rax, %rdx) >> >> >> >> > > > VZEROUPPER_SHORT_RETURN >> >> >> >> > > > >> >> >> >> > > > .p2align 4 >> >> >> >> > > > -- >> >> >> >> > > > 2.25.1 >> >> >> >> > > > >> >> >> >> > > >> >> >> >> > > LGTM. >> >> >> >> > >> >> >> >> > Awesome! >> >> >> >> > >> >> >> >> > For future patches do you prefer performance numbers like this or >> >> >> >> > raw text? Or some other alternative? >> >> >> >> >> >> >> >> The current data format is fine. Thanks. >> >> >> >> >> >> >> >> > > >> >> >> >> > > Reviewed-by: H.J. Lu <hjl.tools@gmail.com> >> >> >> >> > > >> >> >> >> > > Thanks. >> >> >> >> > > >> >> >> >> > > -- >> >> >> >> > > H.J. >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> -- >> >> >> >> H.J. >> >> >> >> >> >> >> >> >> >> >> >> -- >> >> >> H.J. >> >> >> >> >> >> >> >> -- >> >> H.J. >> >> >> >> -- >> H.J.
On Mon, Jun 7, 2021 at 5:55 AM H.J. Lu via Libc-alpha <libc-alpha@sourceware.org> wrote: > > On Mon, Jun 7, 2021 at 12:12 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > > > > > > > > On Mon, Jun 7, 2021 at 12:34 AM H.J. Lu <hjl.tools@gmail.com> wrote: > >> > >> On Sun, Jun 6, 2021 at 8:39 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > >> > > >> > > >> > > >> > On Sun, Jun 6, 2021 at 11:21 PM H.J. Lu <hjl.tools@gmail.com> wrote: > >> >> > >> >> On Sun, Jun 6, 2021 at 8:06 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > >> >> > > >> >> > > >> >> > > >> >> > On Sun, Jun 6, 2021 at 10:48 PM H.J. Lu <hjl.tools@gmail.com> wrote: > >> >> >> > >> >> >> On Sun, Jun 6, 2021 at 7:35 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > >> >> >> > > >> >> >> > > >> >> >> > > >> >> >> > On Thu, May 20, 2021 at 4:45 PM H.J. Lu <hjl.tools@gmail.com> wrote: > >> >> >> >> > >> >> >> >> On Thu, May 20, 2021 at 1:03 PM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > >> >> >> >> > > >> >> >> >> > On Thu, May 20, 2021 at 3:40 PM H.J. Lu <hjl.tools@gmail.com> wrote: > >> >> >> >> > > > >> >> >> >> > > On Thu, May 20, 2021 at 11:45 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > >> >> >> >> > > > > >> >> >> >> > > > No bug. This commit makes a few small improvements to > >> >> >> >> > > > memset-vec-unaligned-erms.S. The changes are 1) only aligning to 64 > >> >> >> >> > > > instead of 128. Either alignment will perform equally well in a loop > >> >> >> >> > > > and 128 just increases the odds of having to do an extra iteration > >> >> >> >> > > > which can be significant overhead for small values. 2) Align some > >> >> >> >> > > > targets and the loop. 3) Remove an ALU from the alignment process. 4) > >> >> >> >> > > > Reorder the last 4x VEC so that they are stored after the loop. 5) > >> >> >> >> > > > Move the condition for leq 8x VEC to before the alignment > >> >> >> >> > > > process. test-memset and test-wmemset are both passing. > >> >> >> >> > > > > >> >> >> >> > > > Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> > >> >> >> >> > > > --- > >> >> >> >> > > > Tests where run on the following CPUs: > >> >> >> >> > > > > >> >> >> >> > > > Skylake: https://ark.intel.com/content/www/us/en/ark/products/149091/intel-core-i7-8565u-processor-8m-cache-up-to-4-60-ghz.html > >> >> >> >> > > > > >> >> >> >> > > > Icelake: https://ark.intel.com/content/www/us/en/ark/products/196597/intel-core-i7-1065g7-processor-8m-cache-up-to-3-90-ghz.html > >> >> >> >> > > > > >> >> >> >> > > > Tigerlake: https://ark.intel.com/content/www/us/en/ark/products/208921/intel-core-i7-1165g7-processor-12m-cache-up-to-4-70-ghz-with-ipu.html > >> >> >> >> > > > > >> >> >> >> > > > All times are the geometric mean of N=50. The unit of time is > >> >> >> >> > > > seconds. > >> >> >> >> > > > > >> >> >> >> > > > "Cur" refers to the current implementation > >> >> >> >> > > > "New" refers to this patches implementation > >> >> >> >> > > > > >> >> >> >> > > > Performance data attached in memset-data.pdf > >> >> >> >> > > > > >> >> >> >> > > > Some notes on the numbers: > >> >> >> >> > > > > >> >> >> >> > > > I only included numbers for the proper VEC_SIZE for the corresponding > >> >> >> >> > > > cpu. > >> >> >> >> > > > > >> >> >> >> > > > skl -> avx2 > >> >> >> >> > > > icl -> evex > >> >> >> >> > > > tgl -> evex > >> >> >> >> > > > > >> >> >> >> > > > The changes only affect sizes > 2 * VEC_SIZE. The performance > >> >> >> >> > > > differences in the size <= 2 * VEC_SIZE come from changes in alignment > >> >> >> >> > > > after linking (i.e ENTRY aligns to 16, but performance can be affected > >> >> >> >> > > > by alignment % 64 or alignment % 4096) and generally affects > >> >> >> >> > > > throughput only, not latency (i.e with an lfence to the benchmark loop > >> >> >> >> > > > the deviations go away). Generally I think they can be ignored (both > >> >> >> >> > > > positive and negative affects). > >> >> >> >> > > > > >> >> >> >> > > > The interesting part of the data is in the medium size range [128, > >> >> >> >> > > > 1024] where the new implementation has a reasonable speedup. This is > >> >> >> >> > > > especially pronounced when the more conservative alignment saves a > >> >> >> >> > > > full loop iteration. The only significant exception is > >> >> >> >> > > > skylake-avx2-erms case for size = 416, alignment = 416 where the > >> >> >> >> > > > current implementation is meaningfully faster. I am unsure of the root > >> >> >> >> > > > cause for this. The skylake-avx2 case only performs a bit worse in > >> >> >> >> > > > this case which makes me think part of it is code alignment related, > >> >> >> >> > > > though comparative to the speedup in other size/alignment > >> >> >> >> > > > configurations it is still a trough. Despite this, I still think the > >> >> >> >> > > > numbers are overall an improvement. > >> >> >> >> > > > > >> >> >> >> > > > As well due to aligning the loop (and possibly slightly more efficient > >> >> >> >> > > > DSB behavior with the replacement of addq 4 * VEC_SIZE in the loop > >> >> >> >> > > > with subq -4 * VEC_SIZE) in the non-erms cases there is often a slight > >> >> >> >> > > > improvement to the main loop for large sizes. > >> >> >> >> > > > > >> >> >> >> > > > .../multiarch/memset-vec-unaligned-erms.S | 50 +++++++++++-------- > >> >> >> >> > > > 1 file changed, 28 insertions(+), 22 deletions(-) > >> >> >> >> > > > > >> >> >> >> > > > diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S > >> >> >> >> > > > index 08cfa49bd1..ff196844a0 100644 > >> >> >> >> > > > --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S > >> >> >> >> > > > +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S > >> >> >> >> > > > @@ -173,17 +173,22 @@ ENTRY (MEMSET_SYMBOL (__memset, unaligned_erms)) > >> >> >> >> > > > VMOVU %VEC(0), (%rdi) > >> >> >> >> > > > VZEROUPPER_RETURN > >> >> >> >> > > > > >> >> >> >> > > > + .p2align 4 > >> >> >> >> > > > L(stosb_more_2x_vec): > >> >> >> >> > > > cmp __x86_rep_stosb_threshold(%rip), %RDX_LP > >> >> >> >> > > > ja L(stosb) > >> >> >> >> > > > +#else > >> >> >> >> > > > + .p2align 4 > >> >> >> >> > > > #endif > >> >> >> >> > > > L(more_2x_vec): > >> >> >> >> > > > - cmpq $(VEC_SIZE * 4), %rdx > >> >> >> >> > > > - ja L(loop_start) > >> >> >> >> > > > + /* Stores to first 2x VEC before cmp as any path forward will > >> >> >> >> > > > + require it. */ > >> >> >> >> > > > VMOVU %VEC(0), (%rdi) > >> >> >> >> > > > VMOVU %VEC(0), VEC_SIZE(%rdi) > >> >> >> >> > > > - VMOVU %VEC(0), -VEC_SIZE(%rdi,%rdx) > >> >> >> >> > > > + cmpq $(VEC_SIZE * 4), %rdx > >> >> >> >> > > > + ja L(loop_start) > >> >> >> >> > > > VMOVU %VEC(0), -(VEC_SIZE * 2)(%rdi,%rdx) > >> >> >> >> > > > + VMOVU %VEC(0), -VEC_SIZE(%rdi,%rdx) > >> >> >> >> > > > L(return): > >> >> >> >> > > > #if VEC_SIZE > 16 > >> >> >> >> > > > ZERO_UPPER_VEC_REGISTERS_RETURN > >> >> >> >> > > > @@ -192,28 +197,29 @@ L(return): > >> >> >> >> > > > #endif > >> >> >> >> > > > > >> >> >> >> > > > L(loop_start): > >> >> >> >> > > > - leaq (VEC_SIZE * 4)(%rdi), %rcx > >> >> >> >> > > > - VMOVU %VEC(0), (%rdi) > >> >> >> >> > > > - andq $-(VEC_SIZE * 4), %rcx > >> >> >> >> > > > - VMOVU %VEC(0), -VEC_SIZE(%rdi,%rdx) > >> >> >> >> > > > - VMOVU %VEC(0), VEC_SIZE(%rdi) > >> >> >> >> > > > - VMOVU %VEC(0), -(VEC_SIZE * 2)(%rdi,%rdx) > >> >> >> >> > > > VMOVU %VEC(0), (VEC_SIZE * 2)(%rdi) > >> >> >> >> > > > - VMOVU %VEC(0), -(VEC_SIZE * 3)(%rdi,%rdx) > >> >> >> >> > > > VMOVU %VEC(0), (VEC_SIZE * 3)(%rdi) > >> >> >> >> > > > - VMOVU %VEC(0), -(VEC_SIZE * 4)(%rdi,%rdx) > >> >> >> >> > > > - addq %rdi, %rdx > >> >> >> >> > > > - andq $-(VEC_SIZE * 4), %rdx > >> >> >> >> > > > - cmpq %rdx, %rcx > >> >> >> >> > > > - je L(return) > >> >> >> >> > > > + cmpq $(VEC_SIZE * 8), %rdx > >> >> >> >> > > > + jbe L(loop_end) > >> >> >> >> > > > + andq $-(VEC_SIZE * 2), %rdi > >> >> >> >> > > > + subq $-(VEC_SIZE * 4), %rdi > >> >> >> >> > > > + leaq -(VEC_SIZE * 4)(%rax, %rdx), %rcx > >> >> >> > > >> >> >> > If this overflows loop will exit first iteration. Is that an issue? > >> >> >> > >> >> >> Please do following: > >> >> >> > >> >> >> 1. Update memset assembly codes with > >> >> >> check conditions for underwrite/overwrite. > >> >> >> if true then branch to the HLT instruction. > >> >> > > >> >> > codes? Just this file or others as well? > >> >> > >> >> All string/memory functions should work for all valid inputs. > >> >> > >> >> >> > >> >> >> 2. Update and run memset test to verify the test coverage for the condition. > >> >> > > >> >> > What is the desired result? Segfault? > >> >> > >> >> For invalid inputs, anything can happen, including segfault. > >> >> > >> >> >> > >> >> >> 3. Update memset assembly codes to cover such conditions. > >> >> > > >> >> > Memcmp as well? What about wmemset? Currently (and with previous versions as well) > >> >> > a value like 2^63 would cause similar behavior after the `salq $2, %rdx` > >> >> > >> >> Is this condition a valid input? If not, there is nothing to do. > >> > > >> > > >> > AFAIK and value length [0, SIZE_MAX] for either is a valid input for any string/memory function > >> > from the perspective of the standard. But I don't know if it has any qualifiers. > >> > > >> > As well I don't know what is meant to happen if the machine/OS is unable to perform the necessary operations. > >> > Normally you would see segfault. The difference from the 3 commits below is essentially just that it won't > >> > segfault. > >> > > >> > That is already case, and has been for a while, certain inputs for many of the wcsmbs will have roughly the > >> > same behavior from `salq $2, %rdx`. For example wmemset(ptr, 0, 2^62 + 1) will currently set 1 wchar then > >> > return. > >> > >> Please construct a testcase which will return normally instead of > >> segfault if not fixed. So on > >> x86-64, the expected behavior should be segfault. > > > > > > memcmp / test for it as well? > > Yes. > > >> > >> > >> >> > >> >> > >> >> >> > >> >> >> > >> >> >> Thanks. > >> >> >> > >> >> >> > If so the following commits from me have the same bug: > >> >> >> > https://sourceware.org/git/?p=glibc.git;a=commit;h=6abf27980a947f9b6e514d6b33b83059d39566ae > >> >> >> > https://sourceware.org/git/?p=glibc.git;a=commit;h=4ad473e97acdc5f6d811755b67c09f2128a644ce > >> >> >> > https://sourceware.org/git/?p=glibc.git;a=commit;h=16d12015c57701b08d7bbed6ec536641bcafb428 > >> >> >> >> > >> >> >> >> > >> >> >> >> > >> >> >> >> > > > + .p2align 4 > >> >> >> >> > > > L(loop): > >> >> >> >> > > > - VMOVA %VEC(0), (%rcx) > >> >> >> >> > > > - VMOVA %VEC(0), VEC_SIZE(%rcx) > >> >> >> >> > > > - VMOVA %VEC(0), (VEC_SIZE * 2)(%rcx) > >> >> >> >> > > > - VMOVA %VEC(0), (VEC_SIZE * 3)(%rcx) > >> >> >> >> > > > - addq $(VEC_SIZE * 4), %rcx > >> >> >> >> > > > - cmpq %rcx, %rdx > >> >> >> >> > > > - jne L(loop) > >> >> >> >> > > > + VMOVA %VEC(0), (%rdi) > >> >> >> >> > > > + VMOVA %VEC(0), VEC_SIZE(%rdi) > >> >> >> >> > > > + VMOVA %VEC(0), (VEC_SIZE * 2)(%rdi) > >> >> >> >> > > > + VMOVA %VEC(0), (VEC_SIZE * 3)(%rdi) > >> >> >> >> > > > + subq $-(VEC_SIZE * 4), %rdi > >> >> >> >> > > > + cmpq %rcx, %rdi > >> >> >> >> > > > + jb L(loop) > >> >> >> > > >> >> >> > Issue because %rdi will not be below %rcx here. > >> >> >> >> > >> >> >> >> > > > +L(loop_end): > >> >> >> >> > > > + /* NB: rax is set as ptr in MEMSET_VDUP_TO_VEC0_AND_SET_RETURN. > >> >> >> >> > > > + rdx as length is also unchanged. */ > >> >> >> >> > > > + VMOVU %VEC(0), -(VEC_SIZE * 4)(%rax, %rdx) > >> >> >> >> > > > + VMOVU %VEC(0), -(VEC_SIZE * 3)(%rax, %rdx) > >> >> >> >> > > > + VMOVU %VEC(0), -(VEC_SIZE * 2)(%rax, %rdx) > >> >> >> >> > > > + VMOVU %VEC(0), -VEC_SIZE(%rax, %rdx) > >> >> >> >> > > > VZEROUPPER_SHORT_RETURN > >> >> >> >> > > > > >> >> >> >> > > > .p2align 4 > >> >> >> >> > > > -- > >> >> >> >> > > > 2.25.1 > >> >> >> >> > > > > >> >> >> >> > > > >> >> >> >> > > LGTM. > >> >> >> >> > > >> >> >> >> > Awesome! > >> >> >> >> > > >> >> >> >> > For future patches do you prefer performance numbers like this or > >> >> >> >> > raw text? Or some other alternative? > >> >> >> >> > >> >> >> >> The current data format is fine. Thanks. > >> >> >> >> > >> >> >> >> > > > >> >> >> >> > > Reviewed-by: H.J. Lu <hjl.tools@gmail.com> > >> >> >> >> > > > >> >> >> >> > > Thanks. > >> >> >> >> > > > >> >> >> >> > > -- > >> >> >> >> > > H.J. > >> >> >> >> > >> >> >> >> > >> >> >> >> > >> >> >> >> -- > >> >> >> >> H.J. > >> >> >> > >> >> >> > >> >> >> > >> >> >> -- > >> >> >> H.J. > >> >> > >> >> > >> >> > >> >> -- > >> >> H.J. > >> > >> > >> > >> -- > >> H.J. > > > > -- > H.J. I would like to backport this patch to release branches. Any comments or objections? --Sunil
diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S index 08cfa49bd1..ff196844a0 100644 --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S @@ -173,17 +173,22 @@ ENTRY (MEMSET_SYMBOL (__memset, unaligned_erms)) VMOVU %VEC(0), (%rdi) VZEROUPPER_RETURN + .p2align 4 L(stosb_more_2x_vec): cmp __x86_rep_stosb_threshold(%rip), %RDX_LP ja L(stosb) +#else + .p2align 4 #endif L(more_2x_vec): - cmpq $(VEC_SIZE * 4), %rdx - ja L(loop_start) + /* Stores to first 2x VEC before cmp as any path forward will + require it. */ VMOVU %VEC(0), (%rdi) VMOVU %VEC(0), VEC_SIZE(%rdi) - VMOVU %VEC(0), -VEC_SIZE(%rdi,%rdx) + cmpq $(VEC_SIZE * 4), %rdx + ja L(loop_start) VMOVU %VEC(0), -(VEC_SIZE * 2)(%rdi,%rdx) + VMOVU %VEC(0), -VEC_SIZE(%rdi,%rdx) L(return): #if VEC_SIZE > 16 ZERO_UPPER_VEC_REGISTERS_RETURN @@ -192,28 +197,29 @@ L(return): #endif L(loop_start): - leaq (VEC_SIZE * 4)(%rdi), %rcx - VMOVU %VEC(0), (%rdi) - andq $-(VEC_SIZE * 4), %rcx - VMOVU %VEC(0), -VEC_SIZE(%rdi,%rdx) - VMOVU %VEC(0), VEC_SIZE(%rdi) - VMOVU %VEC(0), -(VEC_SIZE * 2)(%rdi,%rdx) VMOVU %VEC(0), (VEC_SIZE * 2)(%rdi) - VMOVU %VEC(0), -(VEC_SIZE * 3)(%rdi,%rdx) VMOVU %VEC(0), (VEC_SIZE * 3)(%rdi) - VMOVU %VEC(0), -(VEC_SIZE * 4)(%rdi,%rdx) - addq %rdi, %rdx - andq $-(VEC_SIZE * 4), %rdx - cmpq %rdx, %rcx - je L(return) + cmpq $(VEC_SIZE * 8), %rdx + jbe L(loop_end) + andq $-(VEC_SIZE * 2), %rdi + subq $-(VEC_SIZE * 4), %rdi + leaq -(VEC_SIZE * 4)(%rax, %rdx), %rcx + .p2align 4 L(loop): - VMOVA %VEC(0), (%rcx) - VMOVA %VEC(0), VEC_SIZE(%rcx) - VMOVA %VEC(0), (VEC_SIZE * 2)(%rcx) - VMOVA %VEC(0), (VEC_SIZE * 3)(%rcx) - addq $(VEC_SIZE * 4), %rcx - cmpq %rcx, %rdx - jne L(loop) + VMOVA %VEC(0), (%rdi) + VMOVA %VEC(0), VEC_SIZE(%rdi) + VMOVA %VEC(0), (VEC_SIZE * 2)(%rdi) + VMOVA %VEC(0), (VEC_SIZE * 3)(%rdi) + subq $-(VEC_SIZE * 4), %rdi + cmpq %rcx, %rdi + jb L(loop) +L(loop_end): + /* NB: rax is set as ptr in MEMSET_VDUP_TO_VEC0_AND_SET_RETURN. + rdx as length is also unchanged. */ + VMOVU %VEC(0), -(VEC_SIZE * 4)(%rax, %rdx) + VMOVU %VEC(0), -(VEC_SIZE * 3)(%rax, %rdx) + VMOVU %VEC(0), -(VEC_SIZE * 2)(%rax, %rdx) + VMOVU %VEC(0), -VEC_SIZE(%rax, %rdx) VZEROUPPER_SHORT_RETURN .p2align 4
No bug. This commit makes a few small improvements to memset-vec-unaligned-erms.S. The changes are 1) only aligning to 64 instead of 128. Either alignment will perform equally well in a loop and 128 just increases the odds of having to do an extra iteration which can be significant overhead for small values. 2) Align some targets and the loop. 3) Remove an ALU from the alignment process. 4) Reorder the last 4x VEC so that they are stored after the loop. 5) Move the condition for leq 8x VEC to before the alignment process. test-memset and test-wmemset are both passing. Signed-off-by: Noah Goldstein <goldstein.w.n@gmail.com> --- Tests where run on the following CPUs: Skylake: https://ark.intel.com/content/www/us/en/ark/products/149091/intel-core-i7-8565u-processor-8m-cache-up-to-4-60-ghz.html Icelake: https://ark.intel.com/content/www/us/en/ark/products/196597/intel-core-i7-1065g7-processor-8m-cache-up-to-3-90-ghz.html Tigerlake: https://ark.intel.com/content/www/us/en/ark/products/208921/intel-core-i7-1165g7-processor-12m-cache-up-to-4-70-ghz-with-ipu.html All times are the geometric mean of N=50. The unit of time is seconds. "Cur" refers to the current implementation "New" refers to this patches implementation Performance data attached in memset-data.pdf Some notes on the numbers: I only included numbers for the proper VEC_SIZE for the corresponding cpu. skl -> avx2 icl -> evex tgl -> evex The changes only affect sizes > 2 * VEC_SIZE. The performance differences in the size <= 2 * VEC_SIZE come from changes in alignment after linking (i.e ENTRY aligns to 16, but performance can be affected by alignment % 64 or alignment % 4096) and generally affects throughput only, not latency (i.e with an lfence to the benchmark loop the deviations go away). Generally I think they can be ignored (both positive and negative affects). The interesting part of the data is in the medium size range [128, 1024] where the new implementation has a reasonable speedup. This is especially pronounced when the more conservative alignment saves a full loop iteration. The only significant exception is skylake-avx2-erms case for size = 416, alignment = 416 where the current implementation is meaningfully faster. I am unsure of the root cause for this. The skylake-avx2 case only performs a bit worse in this case which makes me think part of it is code alignment related, though comparative to the speedup in other size/alignment configurations it is still a trough. Despite this, I still think the numbers are overall an improvement. As well due to aligning the loop (and possibly slightly more efficient DSB behavior with the replacement of addq 4 * VEC_SIZE in the loop with subq -4 * VEC_SIZE) in the non-erms cases there is often a slight improvement to the main loop for large sizes. .../multiarch/memset-vec-unaligned-erms.S | 50 +++++++++++-------- 1 file changed, 28 insertions(+), 22 deletions(-)