Message ID | 20210824082753.3356637-5-goldstein.w.n@gmail.com |
---|---|
State | New |
Headers | show |
Series | [1/5] string: Make tests birdirectional test-memcpy.c | expand |
On Tue, Aug 24, 2021 at 4:29 AM Noah Goldstein <goldstein.w.n@gmail.com> wrote: > No bug. This commit optimizes memmove-vec-unaligned.S. > > The optimizations are in descending order of importance to the > L(less_vec), L(movsb), the 8x forward/backward loops and various > target alignments that have minimal code size impact. > > The L(less_vec) optimizations are to: > > 1. Readjust the branch order to either given hotter paths a fall > through case or have less branches in there way. > 2. Moderately change the size classes to make hot branches hotter > and thus increase predictability. > 3. Try and minimize branch aliasing to avoid BPU thrashing based > misses. > 4. 64 byte the prior function entry. This is to avoid cases where > seemingly unrelated changes end up have severe negative > performance impacts. > > The L(movsb) optimizations are to: > > 1. Reduce the number of taken branches needed to determine if > movsb should be used. > 2. 64 byte align either dst if the CPU has fsrm or if dst and src > do not 4k alias. > 3. 64 byte align src if the CPU does not have fsrm and dst and src > do 4k alias. > > The 8x forward/backward loop optimizations are to: > > 1. Reduce instructions needed for aligning to VEC_SIZE. > 2. Reduce uops and code size of the loops. > > All tests in string/ passing. > --- > See performance data attached. > Included benchmarks: memcpy-random, memcpy, memmove, memcpy-walk, > memmove-walk, memcpy-large > > The first page is a summary with the ifunc selection version for > erms/non-erms for each computers. Then in the following 4 sheets are > all the numbers for sse2, avx for Skylake and sse2, avx2, evex, and > avx512 for Tigerlake. > > Benchmark CPUS: Skylake: > > https://ark.intel.com/content/www/us/en/ark/products/149091/intel-core-i7-8565u-processor-8m-cache-up-to-4-60-ghz.html > > Tigerlake: > > https://ark.intel.com/content/www/us/en/ark/products/208921/intel-core-i7-1165g7-processor-12m-cache-up-to-4-70-ghz-with-ipu.html > > All times are geometric mean of N=30. > > "Cur" refers to the current implementation "New" refers to this > patches implementation > > Score refers to new/cur (low means improvement, high means > degragation). Scores are color coded. The more green the better, the > more red the worse. > > > Some notes on the numbers: > > In my opinion most of the benchmarks where src/dst align are in [0, > 64] have some unpredictable and unfortunate noise from non-obvious > false dependencies between stores to dst and next iterations loads > from src. For example in the 8x forward case, the store of VEC(4) will > end up stalling next iterations load queue, so if size was large > enough that the begining of dst was flushed from L1 this can have a > seemingly random but significant impact on the benchmark result. > > There are significant performance improvements/degregations in the [0, > VEC_SIZE]. I didn't treat these as imporant as I think in this size > range the branch pattern indicated by the random tests is more > important. On the random tests the new implementation performance > significantly better. > > I also added logic to align before L(movsb). If you see the new random > benchmarks with fixed size this leads to roughly a 10-20% performance > improvement for some hot sizes. I am not 100% convinced this is needed > as generally for larger copies that would go to movsb they are already > aligned but even in the fixed loop cases, especially on Skylake w.o > FSRM it seems aligning before movsb pays off. Let me know if you think > this is unnecessary. > > There are occasional performance degregations at odd splots throughout > the medium range sizes in the fixed memcpy benchmarks. I think > generally there is more good than harm here and at the moment I don't > have an explination for why these certain configurations seem to > perform worse. On the plus side, however, it also seems that there are > unexplained improvements of the same magnitude patterened with the > degregations (and both are sparse) so I ultimately believe it should > be acceptable. if this is not the case let me know. > > The memmove benchmarks look a bit worse, especially for the erms > case. Part of this is from the nop cases which I didn't treat as > important. But part of it is also because to optimize for what I > expect to be the common case of no overlap the overlap case has extra > branches and overhead. I think this is inevitable when implementing > memmove and memcpy in the same file, but if this is unacceptable let > me know. > > > Note: I benchmarks before two changes that made it into the final version: > > -#if !defined USE_MULTIARCH || !IS_IN (libc) > -L(nop): > - ret > -#else > + VMOVU %VEC(1), -VEC_SIZE(%rdi, %rdx) > VZEROUPPER_RETURN > -#endif > > > > And > > + testl $X86_STRING_CONTROL_AVOID_SHORT_DISTANCE_REP_MOVSB, > __x86_string_control(%rip) > - andl $X86_STRING_CONTROL_AVOID_SHORT_DISTANCE_REP_MOVSB, > __x86_string_control(%rip) > > > I don't think either of these should have any impact. > > I made the former change because I think it was a bug that could cause > use of avx2 w.o vzeroupper and the latter because I think it could > cause issues on multicore platforms. > > > sysdeps/x86/sysdep.h | 13 +- > .../multiarch/memmove-vec-unaligned-erms.S | 484 +++++++++++------- > 2 files changed, 317 insertions(+), 180 deletions(-) > > diff --git a/sysdeps/x86/sysdep.h b/sysdeps/x86/sysdep.h > index cac1d762fb..9226d2c6c9 100644 > --- a/sysdeps/x86/sysdep.h > +++ b/sysdeps/x86/sysdep.h > @@ -78,15 +78,18 @@ enum cf_protection_level > #define ASM_SIZE_DIRECTIVE(name) .size name,.-name; > > /* Define an entry point visible from C. */ > -#define ENTRY(name) > \ > - .globl C_SYMBOL_NAME(name); > \ > - .type C_SYMBOL_NAME(name),@function; > \ > - .align ALIGNARG(4); > \ > +#define P2ALIGN_ENTRY(name, alignment) > \ > + .globl C_SYMBOL_NAME(name); > \ > + .type C_SYMBOL_NAME(name),@function; > \ > + .align ALIGNARG(alignment); > \ > C_LABEL(name) > \ > cfi_startproc; > \ > - _CET_ENDBR; > \ > + _CET_ENDBR; \ > CALL_MCOUNT > > +#define ENTRY(name) P2ALIGN_ENTRY(name, 4) > + > + > #undef END > #define END(name) > \ > cfi_endproc; > \ > diff --git a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S > b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S > index 9f02624375..75b6efe969 100644 > --- a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S > +++ b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S > @@ -165,6 +165,32 @@ > # error Invalid LARGE_LOAD_SIZE > #endif > > +/* Whether to align before movsb. Ultimately we want 64 byte align > + and not worth it to load 4x VEC for VEC_SIZE == 16. */ > +#define ALIGN_MOVSB (VEC_SIZE > 16) > + > +/* Number of VECs to align movsb to. */ > +#if VEC_SIZE == 64 > +# define MOVSB_ALIGN_TO (VEC_SIZE) > +#else > +# define MOVSB_ALIGN_TO (VEC_SIZE * 2) > +#endif > + > +/* Macro for copying inclusive power of 2 range with two register > + loads. */ > +#define COPY_BLOCK(mov_inst, src_reg, dst_reg, size_reg, len, tmp_reg0, > tmp_reg1) \ > + mov_inst (%src_reg), %tmp_reg0; \ > + mov_inst -(len)(%src_reg, %size_reg), %tmp_reg1; \ > + mov_inst %tmp_reg0, (%dst_reg); \ > + mov_inst %tmp_reg1, -(len)(%dst_reg, %size_reg); > + > +/* Define all copies used by L(less_vec) for VEC_SIZE of 16, 32, or > + 64. */ > +#define COPY_4_8 COPY_BLOCK(movl, rsi, rdi, rdx, 4, ecx, esi) > +#define COPY_8_16 COPY_BLOCK(movq, rsi, rdi, rdx, 8, rcx, rsi) > +#define COPY_16_32 COPY_BLOCK(vmovdqu, rsi, rdi, rdx, 16, xmm0, xmm1) > +#define COPY_32_64 COPY_BLOCK(vmovdqu64, rsi, rdi, rdx, 32, ymm16, > ymm17) > + > #ifndef SECTION > # error SECTION is not defined! > #endif > @@ -198,7 +224,13 @@ L(start): > movl %edx, %edx > # endif > cmp $VEC_SIZE, %RDX_LP > + /* Based on SPEC2017 distribution both 16 and 32 memcpy calls are > + really hot so we want them to take the same branch path. */ > +#if VEC_SIZE > 16 > + jbe L(less_vec) > +#else > jb L(less_vec) > +#endif > cmp $(VEC_SIZE * 2), %RDX_LP > ja L(more_2x_vec) > #if !defined USE_MULTIARCH || !IS_IN (libc) > @@ -206,15 +238,10 @@ L(last_2x_vec): > #endif > /* From VEC and to 2 * VEC. No branch when size == VEC_SIZE. */ > VMOVU (%rsi), %VEC(0) > - VMOVU -VEC_SIZE(%rsi,%rdx), %VEC(1) > + VMOVU -VEC_SIZE(%rsi, %rdx), %VEC(1) > VMOVU %VEC(0), (%rdi) > - VMOVU %VEC(1), -VEC_SIZE(%rdi,%rdx) > -#if !defined USE_MULTIARCH || !IS_IN (libc) > -L(nop): > - ret > -#else > + VMOVU %VEC(1), -VEC_SIZE(%rdi, %rdx) > VZEROUPPER_RETURN > -#endif > #if defined USE_MULTIARCH && IS_IN (libc) > END (MEMMOVE_SYMBOL (__memmove, unaligned)) > > @@ -289,7 +316,9 @@ ENTRY (MEMMOVE_CHK_SYMBOL (__memmove_chk, > unaligned_erms)) > END (MEMMOVE_CHK_SYMBOL (__memmove_chk, unaligned_erms)) > # endif > > -ENTRY (MEMMOVE_SYMBOL (__memmove, unaligned_erms)) > +/* Cache align entry so that branch heavy L(less_vec) maintains good > + alignment. */ > +P2ALIGN_ENTRY (MEMMOVE_SYMBOL (__memmove, unaligned_erms), 6) > movq %rdi, %rax > L(start_erms): > # ifdef __ILP32__ > @@ -297,123 +326,217 @@ L(start_erms): > movl %edx, %edx > # endif > cmp $VEC_SIZE, %RDX_LP > + /* Based on SPEC2017 distribution both 16 and 32 memcpy calls are > + really hot so we want them to take the same branch path. */ > +# if VEC_SIZE > 16 > + jbe L(less_vec) > +# else > jb L(less_vec) > +# endif > cmp $(VEC_SIZE * 2), %RDX_LP > ja L(movsb_more_2x_vec) > L(last_2x_vec): > - /* From VEC and to 2 * VEC. No branch when size == VEC_SIZE. */ > + /* From VEC and to 2 * VEC. No branch when size == VEC_SIZE. */ > VMOVU (%rsi), %VEC(0) > - VMOVU -VEC_SIZE(%rsi,%rdx), %VEC(1) > + VMOVU -VEC_SIZE(%rsi, %rdx), %VEC(1) > VMOVU %VEC(0), (%rdi) > - VMOVU %VEC(1), -VEC_SIZE(%rdi,%rdx) > + VMOVU %VEC(1), -VEC_SIZE(%rdi, %rdx) > L(return): > -#if VEC_SIZE > 16 > +# if VEC_SIZE > 16 > ZERO_UPPER_VEC_REGISTERS_RETURN > -#else > +# else > ret > +# endif > #endif > +#if VEC_SIZE == 64 > +L(copy_8_15): > + COPY_8_16 > + ret > > -L(movsb): > - cmp __x86_rep_movsb_stop_threshold(%rip), %RDX_LP > - jae L(more_8x_vec) > - cmpq %rsi, %rdi > - jb 1f > - /* Source == destination is less common. */ > - je L(nop) > - leaq (%rsi,%rdx), %r9 > - cmpq %r9, %rdi > - /* Avoid slow backward REP MOVSB. */ > - jb L(more_8x_vec_backward) > -# if AVOID_SHORT_DISTANCE_REP_MOVSB > - andl $X86_STRING_CONTROL_AVOID_SHORT_DISTANCE_REP_MOVSB, > __x86_string_control(%rip) > - jz 3f > - movq %rdi, %rcx > - subq %rsi, %rcx > - jmp 2f > -# endif > -1: > -# if AVOID_SHORT_DISTANCE_REP_MOVSB > - andl $X86_STRING_CONTROL_AVOID_SHORT_DISTANCE_REP_MOVSB, > __x86_string_control(%rip) > - jz 3f > - movq %rsi, %rcx > - subq %rdi, %rcx > -2: > -/* Avoid "rep movsb" if RCX, the distance between source and destination, > - is N*4GB + [1..63] with N >= 0. */ > - cmpl $63, %ecx > - jbe L(more_2x_vec) /* Avoid "rep movsb" if ECX <= 63. */ > -3: > -# endif > - mov %RDX_LP, %RCX_LP > - rep movsb > -L(nop): > +L(copy_33_63): > + COPY_32_64 > ret > #endif > - > + /* Only worth aligning if near end of 16 byte block and won't get > + first branch in first decode after jump. */ > + .p2align 4,, 6 > L(less_vec): > - /* Less than 1 VEC. */ > #if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64 > # error Unsupported VEC_SIZE! > #endif > -#if VEC_SIZE > 32 > - cmpb $32, %dl > - jae L(between_32_63) > + /* Second set of branches for smallest copies. */ > + cmpl $(VEC_SIZE / 4), %edx > + jb L(less_quarter_vec) > + > + cmpl $(VEC_SIZE / 2), %edx > +#if VEC_SIZE == 64 > + /* We branch to [33, 63] instead of [16, 32] to give [16, 32] fall > + through path as [16, 32] is hotter. */ > + ja L(copy_33_63) > + COPY_16_32 > +#elif VEC_SIZE == 32 > + /* Branch to [8, 15]. Fall through to [16, 32]. */ > + jb L(copy_8_15) > + COPY_16_32 > +#else > + /* Branch to [4, 7]. Fall through to [8, 15]. */ > + jb L(copy_4_7) > + COPY_8_16 > #endif > -#if VEC_SIZE > 16 > - cmpb $16, %dl > - jae L(between_16_31) > -#endif > - cmpb $8, %dl > - jae L(between_8_15) > - cmpb $4, %dl > - jae L(between_4_7) > - cmpb $1, %dl > - ja L(between_2_3) > - jb 1f > + ret > + /* Align if won't cost too many bytes. */ > + .p2align 4,, 6 > +L(copy_4_7): > + COPY_4_8 > + ret > + > + /* Cold target. No need to align. */ > +L(copy_1): > movzbl (%rsi), %ecx > movb %cl, (%rdi) > -1: > ret > + > + /* Colder copy case for [0, VEC_SIZE / 4 - 1]. */ > +L(less_quarter_vec): > #if VEC_SIZE > 32 > -L(between_32_63): > - /* From 32 to 63. No branch when size == 32. */ > - VMOVU (%rsi), %YMM0 > - VMOVU -32(%rsi,%rdx), %YMM1 > - VMOVU %YMM0, (%rdi) > - VMOVU %YMM1, -32(%rdi,%rdx) > - VZEROUPPER_RETURN > + cmpl $8, %edx > + jae L(copy_8_15) > #endif > #if VEC_SIZE > 16 > - /* From 16 to 31. No branch when size == 16. */ > -L(between_16_31): > - VMOVU (%rsi), %XMM0 > - VMOVU -16(%rsi,%rdx), %XMM1 > - VMOVU %XMM0, (%rdi) > - VMOVU %XMM1, -16(%rdi,%rdx) > - VZEROUPPER_RETURN > -#endif > -L(between_8_15): > - /* From 8 to 15. No branch when size == 8. */ > - movq -8(%rsi,%rdx), %rcx > - movq (%rsi), %rsi > - movq %rcx, -8(%rdi,%rdx) > - movq %rsi, (%rdi) > - ret > -L(between_4_7): > - /* From 4 to 7. No branch when size == 4. */ > - movl -4(%rsi,%rdx), %ecx > - movl (%rsi), %esi > - movl %ecx, -4(%rdi,%rdx) > - movl %esi, (%rdi) > + cmpl $4, %edx > + jae L(copy_4_7) > +#endif > + cmpl $1, %edx > + je L(copy_1) > + jb L(copy_0) > + /* Fall through into copy [2, 3] as it is more common than [0, 1]. > + */ > + movzwl (%rsi), %ecx > + movzbl -1(%rsi, %rdx), %esi > + movw %cx, (%rdi) > + movb %sil, -1(%rdi, %rdx) > +L(copy_0): > ret > -L(between_2_3): > - /* From 2 to 3. No branch when size == 2. */ > - movzwl -2(%rsi,%rdx), %ecx > - movzwl (%rsi), %esi > - movw %cx, -2(%rdi,%rdx) > - movw %si, (%rdi) > + > + .p2align 4 > +#if VEC_SIZE == 32 > +L(copy_8_15): > + COPY_8_16 > ret > + /* COPY_8_16 is exactly 17 bytes so don't want to p2align after as > + it wastes 15 bytes of code and 1 byte off is fine. */ > +#endif > + > +#if defined USE_MULTIARCH && IS_IN (libc) > +L(movsb): > + movq %rdi, %rcx > + subq %rsi, %rcx > + /* Go to backwards temporal copy if overlap no matter what as > + backward movsb is slow. */ > + cmpq %rdx, %rcx > + /* L(more_8x_vec_backward_check_nop) checks for src == dst. */ > + jb L(more_8x_vec_backward_check_nop) > + /* If above __x86_rep_movsb_stop_threshold most likely is candidate > + for NT moves aswell. */ > + cmp __x86_rep_movsb_stop_threshold(%rip), %RDX_LP > + jae L(large_memcpy_2x_check) > +# if ALIGN_MOVSB > + VMOVU (%rsi), %VEC(0) > +# if MOVSB_ALIGN_TO > VEC_SIZE > + VMOVU VEC_SIZE(%rsi), %VEC(1) > +# endif > +# if MOVSB_ALIGN_TO > (VEC_SIZE * 2) > +# error Unsupported MOVSB_ALIGN_TO > +# endif > + /* Store dst for use after rep movsb. */ > + movq %rdi, %r8 > +# endif > +# if AVOID_SHORT_DISTANCE_REP_MOVSB > + /* Only avoid short movsb if CPU has FSRM. */ > + testl $X86_STRING_CONTROL_AVOID_SHORT_DISTANCE_REP_MOVSB, > __x86_string_control(%rip) > + jz L(skip_short_movsb_check) > + /* Avoid "rep movsb" if RCX, the distance between source and > + destination, is N*4GB + [1..63] with N >= 0. */ > + > + /* ecx contains dst - src. Early check for backward copy conditions > + means only case of slow movsb with src = dst + [0, 63] is ecx in > + [-63, 0]. Use unsigned comparison with -64 check for that > case. */ > + cmpl $-64, %ecx > + ja L(more_8x_vec_forward) > +# endif > +# if ALIGN_MOVSB > + /* Fall through means cpu has FSRM. In that case exclusively align > + destination. */ > + > + /* Subtract dst from src. Add back after dst aligned. */ > + subq %rdi, %rsi > + /* Add dst to len. Subtract back after dst aligned. */ > + leaq (%rdi, %rdx), %rcx > + /* Exclusively align dst to MOVSB_ALIGN_TO (64). */ > + addq $(MOVSB_ALIGN_TO - 1), %rdi > + andq $-(MOVSB_ALIGN_TO), %rdi > + /* Restore src and len adjusted with new values for aligned dst. > */ > + addq %rdi, %rsi > + subq %rdi, %rcx > > + rep movsb > + VMOVU %VEC(0), (%r8) > +# if MOVSB_ALIGN_TO > VEC_SIZE > + VMOVU %VEC(1), VEC_SIZE(%r8) > +# endif > + VZEROUPPER_RETURN > +L(movsb_align_dst): > + /* Subtract dst from src. Add back after dst aligned. */ > + subq %rdi, %rsi > + /* Add dst to len. Subtract back after dst aligned. -1 because dst > + is initially aligned to MOVSB_ALIGN_TO - 1. */ > + leaq -(1)(%rdi, %rdx), %rcx > + /* Inclusively align dst to MOVSB_ALIGN_TO - 1. */ > + orq $(MOVSB_ALIGN_TO - 1), %rdi > + leaq 1(%rdi, %rsi), %rsi > + /* Restore src and len adjusted with new values for aligned dst. > */ > + subq %rdi, %rcx > + /* Finish aligning dst. */ > + incq %rdi > + rep movsb > + VMOVU %VEC(0), (%r8) > +# if MOVSB_ALIGN_TO > VEC_SIZE > + VMOVU %VEC(1), VEC_SIZE(%r8) > +# endif > + VZEROUPPER_RETURN > + > +L(skip_short_movsb_check): > + /* If CPU does not have FSRM two options for aligning. Align src if > + dst and src 4k alias. Otherwise align dst. */ > + testl $(PAGE_SIZE - 512), %ecx > + jnz L(movsb_align_dst) > + /* rcx already has dst - src. */ > + movq %rcx, %r9 > + /* Add src to len. Subtract back after src aligned. -1 because src > + is initially aligned to MOVSB_ALIGN_TO - 1. */ > + leaq -(1)(%rsi, %rdx), %rcx > + /* Inclusively align src to MOVSB_ALIGN_TO - 1. */ > + orq $(MOVSB_ALIGN_TO - 1), %rsi > + /* Restore dst and len adjusted with new values for aligned dst. > */ > + leaq 1(%rsi, %r9), %rdi > + subq %rsi, %rcx > + /* Finish aligning src. */ > + incq %rsi > + rep movsb > + VMOVU %VEC(0), (%r8) > +# if MOVSB_ALIGN_TO > VEC_SIZE > + VMOVU %VEC(1), VEC_SIZE(%r8) > +# endif > + VZEROUPPER_RETURN > +# else > + /* Not alignined rep movsb so just copy. */ > + mov %RDX_LP, %RCX_LP > + rep movsb > + ret > +# endif > +#endif > + /* Align if doesn't cost too many bytes. */ > + .p2align 4,, 6 > #if defined USE_MULTIARCH && IS_IN (libc) > L(movsb_more_2x_vec): > cmp __x86_rep_movsb_threshold(%rip), %RDX_LP > @@ -426,50 +549,60 @@ L(more_2x_vec): > ja L(more_8x_vec) > cmpq $(VEC_SIZE * 4), %rdx > jbe L(last_4x_vec) > - /* Copy from 4 * VEC + 1 to 8 * VEC, inclusively. */ > + /* Copy from 4 * VEC + 1 to 8 * VEC, inclusively. */ > VMOVU (%rsi), %VEC(0) > VMOVU VEC_SIZE(%rsi), %VEC(1) > VMOVU (VEC_SIZE * 2)(%rsi), %VEC(2) > VMOVU (VEC_SIZE * 3)(%rsi), %VEC(3) > - VMOVU -VEC_SIZE(%rsi,%rdx), %VEC(4) > - VMOVU -(VEC_SIZE * 2)(%rsi,%rdx), %VEC(5) > - VMOVU -(VEC_SIZE * 3)(%rsi,%rdx), %VEC(6) > - VMOVU -(VEC_SIZE * 4)(%rsi,%rdx), %VEC(7) > + VMOVU -VEC_SIZE(%rsi, %rdx), %VEC(4) > + VMOVU -(VEC_SIZE * 2)(%rsi, %rdx), %VEC(5) > + VMOVU -(VEC_SIZE * 3)(%rsi, %rdx), %VEC(6) > + VMOVU -(VEC_SIZE * 4)(%rsi, %rdx), %VEC(7) > VMOVU %VEC(0), (%rdi) > VMOVU %VEC(1), VEC_SIZE(%rdi) > VMOVU %VEC(2), (VEC_SIZE * 2)(%rdi) > VMOVU %VEC(3), (VEC_SIZE * 3)(%rdi) > - VMOVU %VEC(4), -VEC_SIZE(%rdi,%rdx) > - VMOVU %VEC(5), -(VEC_SIZE * 2)(%rdi,%rdx) > - VMOVU %VEC(6), -(VEC_SIZE * 3)(%rdi,%rdx) > - VMOVU %VEC(7), -(VEC_SIZE * 4)(%rdi,%rdx) > + VMOVU %VEC(4), -VEC_SIZE(%rdi, %rdx) > + VMOVU %VEC(5), -(VEC_SIZE * 2)(%rdi, %rdx) > + VMOVU %VEC(6), -(VEC_SIZE * 3)(%rdi, %rdx) > + VMOVU %VEC(7), -(VEC_SIZE * 4)(%rdi, %rdx) > VZEROUPPER_RETURN > + /* Align if doesn't cost too much code size. 6 bytes so that after > + jump to target a full mov instruction will always be able to be > + fetched. */ > + .p2align 4,, 6 > L(last_4x_vec): > - /* Copy from 2 * VEC + 1 to 4 * VEC, inclusively. */ > + /* Copy from 2 * VEC + 1 to 4 * VEC, inclusively. */ > VMOVU (%rsi), %VEC(0) > VMOVU VEC_SIZE(%rsi), %VEC(1) > - VMOVU -VEC_SIZE(%rsi,%rdx), %VEC(2) > - VMOVU -(VEC_SIZE * 2)(%rsi,%rdx), %VEC(3) > + VMOVU -VEC_SIZE(%rsi, %rdx), %VEC(2) > + VMOVU -(VEC_SIZE * 2)(%rsi, %rdx), %VEC(3) > VMOVU %VEC(0), (%rdi) > VMOVU %VEC(1), VEC_SIZE(%rdi) > - VMOVU %VEC(2), -VEC_SIZE(%rdi,%rdx) > - VMOVU %VEC(3), -(VEC_SIZE * 2)(%rdi,%rdx) > + VMOVU %VEC(2), -VEC_SIZE(%rdi, %rdx) > + VMOVU %VEC(3), -(VEC_SIZE * 2)(%rdi, %rdx) > + /* Keep nop target close to jmp for 2-byte encoding. */ > +L(nop): > VZEROUPPER_RETURN > - > + /* Align if doesn't cost too much code size. */ > + .p2align 4,, 10 > L(more_8x_vec): > /* Check if non-temporal move candidate. */ > #if (defined USE_MULTIARCH || VEC_SIZE == 16) && IS_IN (libc) > /* Check non-temporal store threshold. */ > - cmp __x86_shared_non_temporal_threshold(%rip), %RDX_LP > + cmp __x86_shared_non_temporal_threshold(%rip), %RDX_LP > ja L(large_memcpy_2x) > #endif > - /* Entry if rdx is greater than non-temporal threshold but there > - is overlap. */ > + /* Entry if rdx is greater than non-temporal threshold but there is > + overlap. */ > L(more_8x_vec_check): > cmpq %rsi, %rdi > ja L(more_8x_vec_backward) > /* Source == destination is less common. */ > je L(nop) > + /* Entry if rdx is greater than movsb or stop movsb threshold but > + there is overlap with dst > src. */ > +L(more_8x_vec_forward): > /* Load the first VEC and last 4 * VEC to support overlapping > addresses. */ > VMOVU (%rsi), %VEC(4) > @@ -477,22 +610,18 @@ L(more_8x_vec_check): > VMOVU -(VEC_SIZE * 2)(%rsi, %rdx), %VEC(6) > VMOVU -(VEC_SIZE * 3)(%rsi, %rdx), %VEC(7) > VMOVU -(VEC_SIZE * 4)(%rsi, %rdx), %VEC(8) > - /* Save start and stop of the destination buffer. */ > - movq %rdi, %r11 > - leaq -VEC_SIZE(%rdi, %rdx), %rcx > - /* Align destination for aligned stores in the loop. Compute > - how much destination is misaligned. */ > - movq %rdi, %r8 > - andq $(VEC_SIZE - 1), %r8 > - /* Get the negative of offset for alignment. */ > - subq $VEC_SIZE, %r8 > - /* Adjust source. */ > - subq %r8, %rsi > - /* Adjust destination which should be aligned now. */ > - subq %r8, %rdi > - /* Adjust length. */ > - addq %r8, %rdx > - > + /* Subtract dst from src. Add back after dst aligned. */ > + subq %rdi, %rsi > + /* Store end of buffer minus tail in rdx. */ > + leaq (VEC_SIZE * -4)(%rdi, %rdx), %rdx > + /* Save begining of dst. */ > + movq %rdi, %rcx > + /* Align dst to VEC_SIZE - 1. */ > + orq $(VEC_SIZE - 1), %rdi > + /* Restore src adjusted with new value for aligned dst. */ > + leaq 1(%rdi, %rsi), %rsi > + /* Finish aligning dst. */ > + incq %rdi > .p2align 4 > L(loop_4x_vec_forward): > /* Copy 4 * VEC a time forward. */ > @@ -501,23 +630,27 @@ L(loop_4x_vec_forward): > VMOVU (VEC_SIZE * 2)(%rsi), %VEC(2) > VMOVU (VEC_SIZE * 3)(%rsi), %VEC(3) > subq $-(VEC_SIZE * 4), %rsi > - addq $-(VEC_SIZE * 4), %rdx > VMOVA %VEC(0), (%rdi) > VMOVA %VEC(1), VEC_SIZE(%rdi) > VMOVA %VEC(2), (VEC_SIZE * 2)(%rdi) > VMOVA %VEC(3), (VEC_SIZE * 3)(%rdi) > subq $-(VEC_SIZE * 4), %rdi > - cmpq $(VEC_SIZE * 4), %rdx > + cmpq %rdi, %rdx > ja L(loop_4x_vec_forward) > /* Store the last 4 * VEC. */ > - VMOVU %VEC(5), (%rcx) > - VMOVU %VEC(6), -VEC_SIZE(%rcx) > - VMOVU %VEC(7), -(VEC_SIZE * 2)(%rcx) > - VMOVU %VEC(8), -(VEC_SIZE * 3)(%rcx) > + VMOVU %VEC(5), (VEC_SIZE * 3)(%rdx) > + VMOVU %VEC(6), (VEC_SIZE * 2)(%rdx) > + VMOVU %VEC(7), VEC_SIZE(%rdx) > + VMOVU %VEC(8), (%rdx) > /* Store the first VEC. */ > - VMOVU %VEC(4), (%r11) > + VMOVU %VEC(4), (%rcx) > + /* Keep nop target close to jmp for 2-byte encoding. */ > +L(nop2): > VZEROUPPER_RETURN > - > + /* Entry from fail movsb. Need to test if dst - src == 0 still. */ > +L(more_8x_vec_backward_check_nop): > + testq %rcx, %rcx > + jz L(nop2) > L(more_8x_vec_backward): > /* Load the first 4 * VEC and last VEC to support overlapping > addresses. */ > @@ -525,49 +658,50 @@ L(more_8x_vec_backward): > VMOVU VEC_SIZE(%rsi), %VEC(5) > VMOVU (VEC_SIZE * 2)(%rsi), %VEC(6) > VMOVU (VEC_SIZE * 3)(%rsi), %VEC(7) > - VMOVU -VEC_SIZE(%rsi,%rdx), %VEC(8) > - /* Save stop of the destination buffer. */ > - leaq -VEC_SIZE(%rdi, %rdx), %r11 > - /* Align destination end for aligned stores in the loop. Compute > - how much destination end is misaligned. */ > - leaq -VEC_SIZE(%rsi, %rdx), %rcx > - movq %r11, %r9 > - movq %r11, %r8 > - andq $(VEC_SIZE - 1), %r8 > - /* Adjust source. */ > - subq %r8, %rcx > - /* Adjust the end of destination which should be aligned now. */ > - subq %r8, %r9 > - /* Adjust length. */ > - subq %r8, %rdx > - > - .p2align 4 > + VMOVU -VEC_SIZE(%rsi, %rdx), %VEC(8) > + /* Subtract dst from src. Add back after dst aligned. */ > + subq %rdi, %rsi > + /* Save begining of buffer. */ > + movq %rdi, %rcx > + /* Set dst to begining of region to copy. -1 for inclusive > + alignment. */ > + leaq (VEC_SIZE * -4 + -1)(%rdi, %rdx), %rdi > + /* Align dst. */ > + andq $-(VEC_SIZE), %rdi > + /* Restore src. */ > + addq %rdi, %rsi > + /* Don't use multi-byte nop to align. */ > + .p2align 4,, 11 > L(loop_4x_vec_backward): > /* Copy 4 * VEC a time backward. */ > - VMOVU (%rcx), %VEC(0) > - VMOVU -VEC_SIZE(%rcx), %VEC(1) > - VMOVU -(VEC_SIZE * 2)(%rcx), %VEC(2) > - VMOVU -(VEC_SIZE * 3)(%rcx), %VEC(3) > - addq $-(VEC_SIZE * 4), %rcx > - addq $-(VEC_SIZE * 4), %rdx > - VMOVA %VEC(0), (%r9) > - VMOVA %VEC(1), -VEC_SIZE(%r9) > - VMOVA %VEC(2), -(VEC_SIZE * 2)(%r9) > - VMOVA %VEC(3), -(VEC_SIZE * 3)(%r9) > - addq $-(VEC_SIZE * 4), %r9 > - cmpq $(VEC_SIZE * 4), %rdx > - ja L(loop_4x_vec_backward) > + VMOVU (VEC_SIZE * 3)(%rsi), %VEC(0) > + VMOVU (VEC_SIZE * 2)(%rsi), %VEC(1) > + VMOVU (VEC_SIZE * 1)(%rsi), %VEC(2) > + VMOVU (VEC_SIZE * 0)(%rsi), %VEC(3) > + addq $(VEC_SIZE * -4), %rsi > + VMOVA %VEC(0), (VEC_SIZE * 3)(%rdi) > + VMOVA %VEC(1), (VEC_SIZE * 2)(%rdi) > + VMOVA %VEC(2), (VEC_SIZE * 1)(%rdi) > + VMOVA %VEC(3), (VEC_SIZE * 0)(%rdi) > + addq $(VEC_SIZE * -4), %rdi > + cmpq %rdi, %rcx > + jb L(loop_4x_vec_backward) > /* Store the first 4 * VEC. */ > - VMOVU %VEC(4), (%rdi) > - VMOVU %VEC(5), VEC_SIZE(%rdi) > - VMOVU %VEC(6), (VEC_SIZE * 2)(%rdi) > - VMOVU %VEC(7), (VEC_SIZE * 3)(%rdi) > + VMOVU %VEC(4), (%rcx) > + VMOVU %VEC(5), VEC_SIZE(%rcx) > + VMOVU %VEC(6), (VEC_SIZE * 2)(%rcx) > + VMOVU %VEC(7), (VEC_SIZE * 3)(%rcx) > /* Store the last VEC. */ > - VMOVU %VEC(8), (%r11) > + VMOVU %VEC(8), -VEC_SIZE(%rdx, %rcx) > VZEROUPPER_RETURN > > #if (defined USE_MULTIARCH || VEC_SIZE == 16) && IS_IN (libc) > .p2align 4 > + /* Entry if dst > stop movsb threshold (usually set to non-temporal > + threshold). */ > +L(large_memcpy_2x_check): > + cmp __x86_shared_non_temporal_threshold(%rip), %RDX_LP > + jb L(more_8x_vec_forward) > L(large_memcpy_2x): > /* Compute absolute value of difference between source and > destination. */ > -- > 2.25.1 > >
diff --git a/sysdeps/x86/sysdep.h b/sysdeps/x86/sysdep.h index cac1d762fb..9226d2c6c9 100644 --- a/sysdeps/x86/sysdep.h +++ b/sysdeps/x86/sysdep.h @@ -78,15 +78,18 @@ enum cf_protection_level #define ASM_SIZE_DIRECTIVE(name) .size name,.-name; /* Define an entry point visible from C. */ -#define ENTRY(name) \ - .globl C_SYMBOL_NAME(name); \ - .type C_SYMBOL_NAME(name),@function; \ - .align ALIGNARG(4); \ +#define P2ALIGN_ENTRY(name, alignment) \ + .globl C_SYMBOL_NAME(name); \ + .type C_SYMBOL_NAME(name),@function; \ + .align ALIGNARG(alignment); \ C_LABEL(name) \ cfi_startproc; \ - _CET_ENDBR; \ + _CET_ENDBR; \ CALL_MCOUNT +#define ENTRY(name) P2ALIGN_ENTRY(name, 4) + + #undef END #define END(name) \ cfi_endproc; \ diff --git a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S index 9f02624375..75b6efe969 100644 --- a/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S +++ b/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S @@ -165,6 +165,32 @@ # error Invalid LARGE_LOAD_SIZE #endif +/* Whether to align before movsb. Ultimately we want 64 byte align + and not worth it to load 4x VEC for VEC_SIZE == 16. */ +#define ALIGN_MOVSB (VEC_SIZE > 16) + +/* Number of VECs to align movsb to. */ +#if VEC_SIZE == 64 +# define MOVSB_ALIGN_TO (VEC_SIZE) +#else +# define MOVSB_ALIGN_TO (VEC_SIZE * 2) +#endif + +/* Macro for copying inclusive power of 2 range with two register + loads. */ +#define COPY_BLOCK(mov_inst, src_reg, dst_reg, size_reg, len, tmp_reg0, tmp_reg1) \ + mov_inst (%src_reg), %tmp_reg0; \ + mov_inst -(len)(%src_reg, %size_reg), %tmp_reg1; \ + mov_inst %tmp_reg0, (%dst_reg); \ + mov_inst %tmp_reg1, -(len)(%dst_reg, %size_reg); + +/* Define all copies used by L(less_vec) for VEC_SIZE of 16, 32, or + 64. */ +#define COPY_4_8 COPY_BLOCK(movl, rsi, rdi, rdx, 4, ecx, esi) +#define COPY_8_16 COPY_BLOCK(movq, rsi, rdi, rdx, 8, rcx, rsi) +#define COPY_16_32 COPY_BLOCK(vmovdqu, rsi, rdi, rdx, 16, xmm0, xmm1) +#define COPY_32_64 COPY_BLOCK(vmovdqu64, rsi, rdi, rdx, 32, ymm16, ymm17) + #ifndef SECTION # error SECTION is not defined! #endif @@ -198,7 +224,13 @@ L(start): movl %edx, %edx # endif cmp $VEC_SIZE, %RDX_LP + /* Based on SPEC2017 distribution both 16 and 32 memcpy calls are + really hot so we want them to take the same branch path. */ +#if VEC_SIZE > 16 + jbe L(less_vec) +#else jb L(less_vec) +#endif cmp $(VEC_SIZE * 2), %RDX_LP ja L(more_2x_vec) #if !defined USE_MULTIARCH || !IS_IN (libc) @@ -206,15 +238,10 @@ L(last_2x_vec): #endif /* From VEC and to 2 * VEC. No branch when size == VEC_SIZE. */ VMOVU (%rsi), %VEC(0) - VMOVU -VEC_SIZE(%rsi,%rdx), %VEC(1) + VMOVU -VEC_SIZE(%rsi, %rdx), %VEC(1) VMOVU %VEC(0), (%rdi) - VMOVU %VEC(1), -VEC_SIZE(%rdi,%rdx) -#if !defined USE_MULTIARCH || !IS_IN (libc) -L(nop): - ret -#else + VMOVU %VEC(1), -VEC_SIZE(%rdi, %rdx) VZEROUPPER_RETURN -#endif #if defined USE_MULTIARCH && IS_IN (libc) END (MEMMOVE_SYMBOL (__memmove, unaligned)) @@ -289,7 +316,9 @@ ENTRY (MEMMOVE_CHK_SYMBOL (__memmove_chk, unaligned_erms)) END (MEMMOVE_CHK_SYMBOL (__memmove_chk, unaligned_erms)) # endif -ENTRY (MEMMOVE_SYMBOL (__memmove, unaligned_erms)) +/* Cache align entry so that branch heavy L(less_vec) maintains good + alignment. */ +P2ALIGN_ENTRY (MEMMOVE_SYMBOL (__memmove, unaligned_erms), 6) movq %rdi, %rax L(start_erms): # ifdef __ILP32__ @@ -297,123 +326,217 @@ L(start_erms): movl %edx, %edx # endif cmp $VEC_SIZE, %RDX_LP + /* Based on SPEC2017 distribution both 16 and 32 memcpy calls are + really hot so we want them to take the same branch path. */ +# if VEC_SIZE > 16 + jbe L(less_vec) +# else jb L(less_vec) +# endif cmp $(VEC_SIZE * 2), %RDX_LP ja L(movsb_more_2x_vec) L(last_2x_vec): - /* From VEC and to 2 * VEC. No branch when size == VEC_SIZE. */ + /* From VEC and to 2 * VEC. No branch when size == VEC_SIZE. */ VMOVU (%rsi), %VEC(0) - VMOVU -VEC_SIZE(%rsi,%rdx), %VEC(1) + VMOVU -VEC_SIZE(%rsi, %rdx), %VEC(1) VMOVU %VEC(0), (%rdi) - VMOVU %VEC(1), -VEC_SIZE(%rdi,%rdx) + VMOVU %VEC(1), -VEC_SIZE(%rdi, %rdx) L(return): -#if VEC_SIZE > 16 +# if VEC_SIZE > 16 ZERO_UPPER_VEC_REGISTERS_RETURN -#else +# else ret +# endif #endif +#if VEC_SIZE == 64 +L(copy_8_15): + COPY_8_16 + ret -L(movsb): - cmp __x86_rep_movsb_stop_threshold(%rip), %RDX_LP - jae L(more_8x_vec) - cmpq %rsi, %rdi - jb 1f - /* Source == destination is less common. */ - je L(nop) - leaq (%rsi,%rdx), %r9 - cmpq %r9, %rdi - /* Avoid slow backward REP MOVSB. */ - jb L(more_8x_vec_backward) -# if AVOID_SHORT_DISTANCE_REP_MOVSB - andl $X86_STRING_CONTROL_AVOID_SHORT_DISTANCE_REP_MOVSB, __x86_string_control(%rip) - jz 3f - movq %rdi, %rcx - subq %rsi, %rcx - jmp 2f -# endif -1: -# if AVOID_SHORT_DISTANCE_REP_MOVSB - andl $X86_STRING_CONTROL_AVOID_SHORT_DISTANCE_REP_MOVSB, __x86_string_control(%rip) - jz 3f - movq %rsi, %rcx - subq %rdi, %rcx -2: -/* Avoid "rep movsb" if RCX, the distance between source and destination, - is N*4GB + [1..63] with N >= 0. */ - cmpl $63, %ecx - jbe L(more_2x_vec) /* Avoid "rep movsb" if ECX <= 63. */ -3: -# endif - mov %RDX_LP, %RCX_LP - rep movsb -L(nop): +L(copy_33_63): + COPY_32_64 ret #endif - + /* Only worth aligning if near end of 16 byte block and won't get + first branch in first decode after jump. */ + .p2align 4,, 6 L(less_vec): - /* Less than 1 VEC. */ #if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64 # error Unsupported VEC_SIZE! #endif -#if VEC_SIZE > 32 - cmpb $32, %dl - jae L(between_32_63) + /* Second set of branches for smallest copies. */ + cmpl $(VEC_SIZE / 4), %edx + jb L(less_quarter_vec) + + cmpl $(VEC_SIZE / 2), %edx +#if VEC_SIZE == 64 + /* We branch to [33, 63] instead of [16, 32] to give [16, 32] fall + through path as [16, 32] is hotter. */ + ja L(copy_33_63) + COPY_16_32 +#elif VEC_SIZE == 32 + /* Branch to [8, 15]. Fall through to [16, 32]. */ + jb L(copy_8_15) + COPY_16_32 +#else + /* Branch to [4, 7]. Fall through to [8, 15]. */ + jb L(copy_4_7) + COPY_8_16 #endif -#if VEC_SIZE > 16 - cmpb $16, %dl - jae L(between_16_31) -#endif - cmpb $8, %dl - jae L(between_8_15) - cmpb $4, %dl - jae L(between_4_7) - cmpb $1, %dl - ja L(between_2_3) - jb 1f + ret + /* Align if won't cost too many bytes. */ + .p2align 4,, 6 +L(copy_4_7): + COPY_4_8 + ret + + /* Cold target. No need to align. */ +L(copy_1): movzbl (%rsi), %ecx movb %cl, (%rdi) -1: ret + + /* Colder copy case for [0, VEC_SIZE / 4 - 1]. */ +L(less_quarter_vec): #if VEC_SIZE > 32 -L(between_32_63): - /* From 32 to 63. No branch when size == 32. */ - VMOVU (%rsi), %YMM0 - VMOVU -32(%rsi,%rdx), %YMM1 - VMOVU %YMM0, (%rdi) - VMOVU %YMM1, -32(%rdi,%rdx) - VZEROUPPER_RETURN + cmpl $8, %edx + jae L(copy_8_15) #endif #if VEC_SIZE > 16 - /* From 16 to 31. No branch when size == 16. */ -L(between_16_31): - VMOVU (%rsi), %XMM0 - VMOVU -16(%rsi,%rdx), %XMM1 - VMOVU %XMM0, (%rdi) - VMOVU %XMM1, -16(%rdi,%rdx) - VZEROUPPER_RETURN -#endif -L(between_8_15): - /* From 8 to 15. No branch when size == 8. */ - movq -8(%rsi,%rdx), %rcx - movq (%rsi), %rsi - movq %rcx, -8(%rdi,%rdx) - movq %rsi, (%rdi) - ret -L(between_4_7): - /* From 4 to 7. No branch when size == 4. */ - movl -4(%rsi,%rdx), %ecx - movl (%rsi), %esi - movl %ecx, -4(%rdi,%rdx) - movl %esi, (%rdi) + cmpl $4, %edx + jae L(copy_4_7) +#endif + cmpl $1, %edx + je L(copy_1) + jb L(copy_0) + /* Fall through into copy [2, 3] as it is more common than [0, 1]. + */ + movzwl (%rsi), %ecx + movzbl -1(%rsi, %rdx), %esi + movw %cx, (%rdi) + movb %sil, -1(%rdi, %rdx) +L(copy_0): ret -L(between_2_3): - /* From 2 to 3. No branch when size == 2. */ - movzwl -2(%rsi,%rdx), %ecx - movzwl (%rsi), %esi - movw %cx, -2(%rdi,%rdx) - movw %si, (%rdi) + + .p2align 4 +#if VEC_SIZE == 32 +L(copy_8_15): + COPY_8_16 ret + /* COPY_8_16 is exactly 17 bytes so don't want to p2align after as + it wastes 15 bytes of code and 1 byte off is fine. */ +#endif + +#if defined USE_MULTIARCH && IS_IN (libc) +L(movsb): + movq %rdi, %rcx + subq %rsi, %rcx + /* Go to backwards temporal copy if overlap no matter what as + backward movsb is slow. */ + cmpq %rdx, %rcx + /* L(more_8x_vec_backward_check_nop) checks for src == dst. */ + jb L(more_8x_vec_backward_check_nop) + /* If above __x86_rep_movsb_stop_threshold most likely is candidate + for NT moves aswell. */ + cmp __x86_rep_movsb_stop_threshold(%rip), %RDX_LP + jae L(large_memcpy_2x_check) +# if ALIGN_MOVSB + VMOVU (%rsi), %VEC(0) +# if MOVSB_ALIGN_TO > VEC_SIZE + VMOVU VEC_SIZE(%rsi), %VEC(1) +# endif +# if MOVSB_ALIGN_TO > (VEC_SIZE * 2) +# error Unsupported MOVSB_ALIGN_TO +# endif + /* Store dst for use after rep movsb. */ + movq %rdi, %r8 +# endif +# if AVOID_SHORT_DISTANCE_REP_MOVSB + /* Only avoid short movsb if CPU has FSRM. */ + testl $X86_STRING_CONTROL_AVOID_SHORT_DISTANCE_REP_MOVSB, __x86_string_control(%rip) + jz L(skip_short_movsb_check) + /* Avoid "rep movsb" if RCX, the distance between source and + destination, is N*4GB + [1..63] with N >= 0. */ + + /* ecx contains dst - src. Early check for backward copy conditions + means only case of slow movsb with src = dst + [0, 63] is ecx in + [-63, 0]. Use unsigned comparison with -64 check for that case. */ + cmpl $-64, %ecx + ja L(more_8x_vec_forward) +# endif +# if ALIGN_MOVSB + /* Fall through means cpu has FSRM. In that case exclusively align + destination. */ + + /* Subtract dst from src. Add back after dst aligned. */ + subq %rdi, %rsi + /* Add dst to len. Subtract back after dst aligned. */ + leaq (%rdi, %rdx), %rcx + /* Exclusively align dst to MOVSB_ALIGN_TO (64). */ + addq $(MOVSB_ALIGN_TO - 1), %rdi + andq $-(MOVSB_ALIGN_TO), %rdi + /* Restore src and len adjusted with new values for aligned dst. */ + addq %rdi, %rsi + subq %rdi, %rcx + rep movsb + VMOVU %VEC(0), (%r8) +# if MOVSB_ALIGN_TO > VEC_SIZE + VMOVU %VEC(1), VEC_SIZE(%r8) +# endif + VZEROUPPER_RETURN +L(movsb_align_dst): + /* Subtract dst from src. Add back after dst aligned. */ + subq %rdi, %rsi + /* Add dst to len. Subtract back after dst aligned. -1 because dst + is initially aligned to MOVSB_ALIGN_TO - 1. */ + leaq -(1)(%rdi, %rdx), %rcx + /* Inclusively align dst to MOVSB_ALIGN_TO - 1. */ + orq $(MOVSB_ALIGN_TO - 1), %rdi + leaq 1(%rdi, %rsi), %rsi + /* Restore src and len adjusted with new values for aligned dst. */ + subq %rdi, %rcx + /* Finish aligning dst. */ + incq %rdi + rep movsb + VMOVU %VEC(0), (%r8) +# if MOVSB_ALIGN_TO > VEC_SIZE + VMOVU %VEC(1), VEC_SIZE(%r8) +# endif + VZEROUPPER_RETURN + +L(skip_short_movsb_check): + /* If CPU does not have FSRM two options for aligning. Align src if + dst and src 4k alias. Otherwise align dst. */ + testl $(PAGE_SIZE - 512), %ecx + jnz L(movsb_align_dst) + /* rcx already has dst - src. */ + movq %rcx, %r9 + /* Add src to len. Subtract back after src aligned. -1 because src + is initially aligned to MOVSB_ALIGN_TO - 1. */ + leaq -(1)(%rsi, %rdx), %rcx + /* Inclusively align src to MOVSB_ALIGN_TO - 1. */ + orq $(MOVSB_ALIGN_TO - 1), %rsi + /* Restore dst and len adjusted with new values for aligned dst. */ + leaq 1(%rsi, %r9), %rdi + subq %rsi, %rcx + /* Finish aligning src. */ + incq %rsi + rep movsb + VMOVU %VEC(0), (%r8) +# if MOVSB_ALIGN_TO > VEC_SIZE + VMOVU %VEC(1), VEC_SIZE(%r8) +# endif + VZEROUPPER_RETURN +# else + /* Not alignined rep movsb so just copy. */ + mov %RDX_LP, %RCX_LP + rep movsb + ret +# endif +#endif + /* Align if doesn't cost too many bytes. */ + .p2align 4,, 6 #if defined USE_MULTIARCH && IS_IN (libc) L(movsb_more_2x_vec): cmp __x86_rep_movsb_threshold(%rip), %RDX_LP @@ -426,50 +549,60 @@ L(more_2x_vec): ja L(more_8x_vec) cmpq $(VEC_SIZE * 4), %rdx jbe L(last_4x_vec) - /* Copy from 4 * VEC + 1 to 8 * VEC, inclusively. */ + /* Copy from 4 * VEC + 1 to 8 * VEC, inclusively. */ VMOVU (%rsi), %VEC(0) VMOVU VEC_SIZE(%rsi), %VEC(1) VMOVU (VEC_SIZE * 2)(%rsi), %VEC(2) VMOVU (VEC_SIZE * 3)(%rsi), %VEC(3) - VMOVU -VEC_SIZE(%rsi,%rdx), %VEC(4) - VMOVU -(VEC_SIZE * 2)(%rsi,%rdx), %VEC(5) - VMOVU -(VEC_SIZE * 3)(%rsi,%rdx), %VEC(6) - VMOVU -(VEC_SIZE * 4)(%rsi,%rdx), %VEC(7) + VMOVU -VEC_SIZE(%rsi, %rdx), %VEC(4) + VMOVU -(VEC_SIZE * 2)(%rsi, %rdx), %VEC(5) + VMOVU -(VEC_SIZE * 3)(%rsi, %rdx), %VEC(6) + VMOVU -(VEC_SIZE * 4)(%rsi, %rdx), %VEC(7) VMOVU %VEC(0), (%rdi) VMOVU %VEC(1), VEC_SIZE(%rdi) VMOVU %VEC(2), (VEC_SIZE * 2)(%rdi) VMOVU %VEC(3), (VEC_SIZE * 3)(%rdi) - VMOVU %VEC(4), -VEC_SIZE(%rdi,%rdx) - VMOVU %VEC(5), -(VEC_SIZE * 2)(%rdi,%rdx) - VMOVU %VEC(6), -(VEC_SIZE * 3)(%rdi,%rdx) - VMOVU %VEC(7), -(VEC_SIZE * 4)(%rdi,%rdx) + VMOVU %VEC(4), -VEC_SIZE(%rdi, %rdx) + VMOVU %VEC(5), -(VEC_SIZE * 2)(%rdi, %rdx) + VMOVU %VEC(6), -(VEC_SIZE * 3)(%rdi, %rdx) + VMOVU %VEC(7), -(VEC_SIZE * 4)(%rdi, %rdx) VZEROUPPER_RETURN + /* Align if doesn't cost too much code size. 6 bytes so that after + jump to target a full mov instruction will always be able to be + fetched. */ + .p2align 4,, 6 L(last_4x_vec): - /* Copy from 2 * VEC + 1 to 4 * VEC, inclusively. */ + /* Copy from 2 * VEC + 1 to 4 * VEC, inclusively. */ VMOVU (%rsi), %VEC(0) VMOVU VEC_SIZE(%rsi), %VEC(1) - VMOVU -VEC_SIZE(%rsi,%rdx), %VEC(2) - VMOVU -(VEC_SIZE * 2)(%rsi,%rdx), %VEC(3) + VMOVU -VEC_SIZE(%rsi, %rdx), %VEC(2) + VMOVU -(VEC_SIZE * 2)(%rsi, %rdx), %VEC(3) VMOVU %VEC(0), (%rdi) VMOVU %VEC(1), VEC_SIZE(%rdi) - VMOVU %VEC(2), -VEC_SIZE(%rdi,%rdx) - VMOVU %VEC(3), -(VEC_SIZE * 2)(%rdi,%rdx) + VMOVU %VEC(2), -VEC_SIZE(%rdi, %rdx) + VMOVU %VEC(3), -(VEC_SIZE * 2)(%rdi, %rdx) + /* Keep nop target close to jmp for 2-byte encoding. */ +L(nop): VZEROUPPER_RETURN - + /* Align if doesn't cost too much code size. */ + .p2align 4,, 10 L(more_8x_vec): /* Check if non-temporal move candidate. */ #if (defined USE_MULTIARCH || VEC_SIZE == 16) && IS_IN (libc) /* Check non-temporal store threshold. */ - cmp __x86_shared_non_temporal_threshold(%rip), %RDX_LP + cmp __x86_shared_non_temporal_threshold(%rip), %RDX_LP ja L(large_memcpy_2x) #endif - /* Entry if rdx is greater than non-temporal threshold but there - is overlap. */ + /* Entry if rdx is greater than non-temporal threshold but there is + overlap. */ L(more_8x_vec_check): cmpq %rsi, %rdi ja L(more_8x_vec_backward) /* Source == destination is less common. */ je L(nop) + /* Entry if rdx is greater than movsb or stop movsb threshold but + there is overlap with dst > src. */ +L(more_8x_vec_forward): /* Load the first VEC and last 4 * VEC to support overlapping addresses. */ VMOVU (%rsi), %VEC(4) @@ -477,22 +610,18 @@ L(more_8x_vec_check): VMOVU -(VEC_SIZE * 2)(%rsi, %rdx), %VEC(6) VMOVU -(VEC_SIZE * 3)(%rsi, %rdx), %VEC(7) VMOVU -(VEC_SIZE * 4)(%rsi, %rdx), %VEC(8) - /* Save start and stop of the destination buffer. */ - movq %rdi, %r11 - leaq -VEC_SIZE(%rdi, %rdx), %rcx - /* Align destination for aligned stores in the loop. Compute - how much destination is misaligned. */ - movq %rdi, %r8 - andq $(VEC_SIZE - 1), %r8 - /* Get the negative of offset for alignment. */ - subq $VEC_SIZE, %r8 - /* Adjust source. */ - subq %r8, %rsi - /* Adjust destination which should be aligned now. */ - subq %r8, %rdi - /* Adjust length. */ - addq %r8, %rdx - + /* Subtract dst from src. Add back after dst aligned. */ + subq %rdi, %rsi + /* Store end of buffer minus tail in rdx. */ + leaq (VEC_SIZE * -4)(%rdi, %rdx), %rdx + /* Save begining of dst. */ + movq %rdi, %rcx + /* Align dst to VEC_SIZE - 1. */ + orq $(VEC_SIZE - 1), %rdi + /* Restore src adjusted with new value for aligned dst. */ + leaq 1(%rdi, %rsi), %rsi + /* Finish aligning dst. */ + incq %rdi .p2align 4 L(loop_4x_vec_forward): /* Copy 4 * VEC a time forward. */ @@ -501,23 +630,27 @@ L(loop_4x_vec_forward): VMOVU (VEC_SIZE * 2)(%rsi), %VEC(2) VMOVU (VEC_SIZE * 3)(%rsi), %VEC(3) subq $-(VEC_SIZE * 4), %rsi - addq $-(VEC_SIZE * 4), %rdx VMOVA %VEC(0), (%rdi) VMOVA %VEC(1), VEC_SIZE(%rdi) VMOVA %VEC(2), (VEC_SIZE * 2)(%rdi) VMOVA %VEC(3), (VEC_SIZE * 3)(%rdi) subq $-(VEC_SIZE * 4), %rdi - cmpq $(VEC_SIZE * 4), %rdx + cmpq %rdi, %rdx ja L(loop_4x_vec_forward) /* Store the last 4 * VEC. */ - VMOVU %VEC(5), (%rcx) - VMOVU %VEC(6), -VEC_SIZE(%rcx) - VMOVU %VEC(7), -(VEC_SIZE * 2)(%rcx) - VMOVU %VEC(8), -(VEC_SIZE * 3)(%rcx) + VMOVU %VEC(5), (VEC_SIZE * 3)(%rdx) + VMOVU %VEC(6), (VEC_SIZE * 2)(%rdx) + VMOVU %VEC(7), VEC_SIZE(%rdx) + VMOVU %VEC(8), (%rdx) /* Store the first VEC. */ - VMOVU %VEC(4), (%r11) + VMOVU %VEC(4), (%rcx) + /* Keep nop target close to jmp for 2-byte encoding. */ +L(nop2): VZEROUPPER_RETURN - + /* Entry from fail movsb. Need to test if dst - src == 0 still. */ +L(more_8x_vec_backward_check_nop): + testq %rcx, %rcx + jz L(nop2) L(more_8x_vec_backward): /* Load the first 4 * VEC and last VEC to support overlapping addresses. */ @@ -525,49 +658,50 @@ L(more_8x_vec_backward): VMOVU VEC_SIZE(%rsi), %VEC(5) VMOVU (VEC_SIZE * 2)(%rsi), %VEC(6) VMOVU (VEC_SIZE * 3)(%rsi), %VEC(7) - VMOVU -VEC_SIZE(%rsi,%rdx), %VEC(8) - /* Save stop of the destination buffer. */ - leaq -VEC_SIZE(%rdi, %rdx), %r11 - /* Align destination end for aligned stores in the loop. Compute - how much destination end is misaligned. */ - leaq -VEC_SIZE(%rsi, %rdx), %rcx - movq %r11, %r9 - movq %r11, %r8 - andq $(VEC_SIZE - 1), %r8 - /* Adjust source. */ - subq %r8, %rcx - /* Adjust the end of destination which should be aligned now. */ - subq %r8, %r9 - /* Adjust length. */ - subq %r8, %rdx - - .p2align 4 + VMOVU -VEC_SIZE(%rsi, %rdx), %VEC(8) + /* Subtract dst from src. Add back after dst aligned. */ + subq %rdi, %rsi + /* Save begining of buffer. */ + movq %rdi, %rcx + /* Set dst to begining of region to copy. -1 for inclusive + alignment. */ + leaq (VEC_SIZE * -4 + -1)(%rdi, %rdx), %rdi + /* Align dst. */ + andq $-(VEC_SIZE), %rdi + /* Restore src. */ + addq %rdi, %rsi + /* Don't use multi-byte nop to align. */ + .p2align 4,, 11 L(loop_4x_vec_backward): /* Copy 4 * VEC a time backward. */ - VMOVU (%rcx), %VEC(0) - VMOVU -VEC_SIZE(%rcx), %VEC(1) - VMOVU -(VEC_SIZE * 2)(%rcx), %VEC(2) - VMOVU -(VEC_SIZE * 3)(%rcx), %VEC(3) - addq $-(VEC_SIZE * 4), %rcx - addq $-(VEC_SIZE * 4), %rdx - VMOVA %VEC(0), (%r9) - VMOVA %VEC(1), -VEC_SIZE(%r9) - VMOVA %VEC(2), -(VEC_SIZE * 2)(%r9) - VMOVA %VEC(3), -(VEC_SIZE * 3)(%r9) - addq $-(VEC_SIZE * 4), %r9 - cmpq $(VEC_SIZE * 4), %rdx - ja L(loop_4x_vec_backward) + VMOVU (VEC_SIZE * 3)(%rsi), %VEC(0) + VMOVU (VEC_SIZE * 2)(%rsi), %VEC(1) + VMOVU (VEC_SIZE * 1)(%rsi), %VEC(2) + VMOVU (VEC_SIZE * 0)(%rsi), %VEC(3) + addq $(VEC_SIZE * -4), %rsi + VMOVA %VEC(0), (VEC_SIZE * 3)(%rdi) + VMOVA %VEC(1), (VEC_SIZE * 2)(%rdi) + VMOVA %VEC(2), (VEC_SIZE * 1)(%rdi) + VMOVA %VEC(3), (VEC_SIZE * 0)(%rdi) + addq $(VEC_SIZE * -4), %rdi + cmpq %rdi, %rcx + jb L(loop_4x_vec_backward) /* Store the first 4 * VEC. */ - VMOVU %VEC(4), (%rdi) - VMOVU %VEC(5), VEC_SIZE(%rdi) - VMOVU %VEC(6), (VEC_SIZE * 2)(%rdi) - VMOVU %VEC(7), (VEC_SIZE * 3)(%rdi) + VMOVU %VEC(4), (%rcx) + VMOVU %VEC(5), VEC_SIZE(%rcx) + VMOVU %VEC(6), (VEC_SIZE * 2)(%rcx) + VMOVU %VEC(7), (VEC_SIZE * 3)(%rcx) /* Store the last VEC. */ - VMOVU %VEC(8), (%r11) + VMOVU %VEC(8), -VEC_SIZE(%rdx, %rcx) VZEROUPPER_RETURN #if (defined USE_MULTIARCH || VEC_SIZE == 16) && IS_IN (libc) .p2align 4 + /* Entry if dst > stop movsb threshold (usually set to non-temporal + threshold). */ +L(large_memcpy_2x_check): + cmp __x86_shared_non_temporal_threshold(%rip), %RDX_LP + jb L(more_8x_vec_forward) L(large_memcpy_2x): /* Compute absolute value of difference between source and destination. */