Message ID | 20150617180105.GA26497@domone |
---|---|
State | New |
Headers | show |
On Wed, Jun 17, 2015 at 08:01:05PM +0200, Ondřej Bílka wrote: > Hi, > > I wrote new strcpy on x64 and for some reason I thought that I had > commited it and forgot to ping it. > > As there are other routines that I could improve I will use branch > neleai/string-x64 to collect these. > > Here is revised version of what I sumbitted in 2013. Main change is that > I now target i7 instead core2 That simplifies things as unaligned loads > are cheap instead bit slower than aligned ones on core2. That mainly > concerns header as for core2 you could get better performance by > aligning loads or stores to 16 bytes after first bytes were read. I do > not know whats better I would need to test it. > > That also makes less important support of ssse3 variant. I could send it > but it was one of my list on TODO list that now probably lost > importance. Problem is that on x64 for aligning by ssse3 or sse2 with > shifts you need to make 16 loops for each alignment as you don't have > variable shift. Also it needs to use jump table thats very expensive > For strcpy thats dubious as it increases instruction cache pressure > and most copies are small. You would need to do switching from unaligned > loads to aligning. I needed to do profiling to select correct treshold. > > If somebody is interested in optimizing old pentiums4, athlon64 I will > provide a ssse3 variant that is also 50% faster than current one. > That is also reason why I omitted drawing current ssse3 implementation > performance. > > > In this version header first checks 128 bytes unaligned unless they > cross page boundary. That allows more effective loop as then at end of > loop we could simply write last 64 bytes instead specialcasing to avoid > writing before start. > > I tried several variants of header, as we first read 16 bytes to xmm0 > register question is if they could be reused. I used evolver to select > best variant, there was almost no difference in performance between > these. > > Now I do checks for bytes 0-15, then 16-31, then 32-63, then 64-128. > There is possibility to get some cycles with different grouping, I will > post later improvement if I could find something. > > > First problem was reading ahead. A rereading 8 bytes looked bit faster > than move from xmm. > > Then I tried when to reuse/reread. In 4-7 byte case it was faster reread > than using bit shifts to get second half. For 1-3 bytes I use following > copy with s[0] and s[1] from rdx register with byte shifts. > > Test branch vs this branchless that works for i 0,1,2 > d[i] = 0; > d[i/2] = s[1]; > d[0] = s[0]; > > I also added a avx2 loop. Problem why I shouldn't use them in headers > was high latency. I could test if using them for bytes 64-128 would give > speedup. > > As technical issues go I needed to move old strcpy_sse_unaligned > implementation into strncpy_sse2_unaligned as strncpy is function that > should be optimized for size, not performance. For now I this will keep > these unchanged. > > As performance these are 15%-30% faster than current one for gcc workload on > haswell and ivy bridge. > > As avx2 version its currently 6% on this workload mainly as its bash and > has lot of large loads so avx2 loop helps. > > I used my profiler to show improvement, see here > > http://kam.mff.cuni.cz/~ondra/benchmark_string/strcpy_profile.html > > and source is here > > http://kam.mff.cuni.cz/~ondra/benchmark_string/strcpy_profile170615.tar.bz2 > > Comments? > > * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): > Add __strcpy_avx2 and __stpcpy_avx2 > * sysdeps/x86_64/multiarch/Makefile (routines): Add stpcpy_avx2.S and > strcpy_avx2.S > * sysdeps/x86_64/multiarch/stpcpy-avx2.S: New file > * sysdeps/x86_64/multiarch/strcpy-avx2.S: Likewise. > * sysdeps/x86_64/multiarch/stpcpy-sse2-unaligned.S: Refactored > implementation. > * sysdeps/x86_64/multiarch/strcpy.S: Updated ifunc. > * sysdeps/x86_64/multiarch/strncpy.S: Moved from strcpy.S. > * sysdeps/x86_64/multiarch/strncpy-sse2-unaligned.S: Moved > strcpy-sse2-unaligned.S here. > * sysdeps/x86_64/multiarch/strcpy-sse2-unaligned.S: Likewise. > * sysdeps/x86_64/multiarch/stpncpy-sse2-unaligned.S: Redirect > from strcpy-sse2-unaligned.S to strncpy-sse2-unaligned.S > * sysdeps/x86_64/multiarch/stpncpy.S: Likewise. > * sysdeps/x86_64/multiarch/strcat-sse2-unaligned.S: Likewise. > > --- > sysdeps/x86_64/multiarch/Makefile | 2 +- > sysdeps/x86_64/multiarch/ifunc-impl-list.c | 2 + > sysdeps/x86_64/multiarch/stpcpy-avx2.S | 3 + > sysdeps/x86_64/multiarch/stpcpy-sse2-unaligned.S | 439 ++++- > sysdeps/x86_64/multiarch/stpncpy-sse2-unaligned.S | 3 +- > sysdeps/x86_64/multiarch/stpncpy.S | 5 +- > sysdeps/x86_64/multiarch/strcat-sse2-unaligned.S | 2 +- > sysdeps/x86_64/multiarch/strcpy-avx2.S | 4 + > sysdeps/x86_64/multiarch/strcpy-sse2-unaligned.S | 1890 +------------------- > sysdeps/x86_64/multiarch/strcpy.S | 22 +- > sysdeps/x86_64/multiarch/strncpy-sse2-unaligned.S | 1891 ++++++++++++++++++++- > sysdeps/x86_64/multiarch/strncpy.S | 88 +- > 14 files changed, 2435 insertions(+), 1921 deletions(-) > create mode 100644 sysdeps/x86_64/multiarch/stpcpy-avx2.S > create mode 100644 sysdeps/x86_64/multiarch/strcpy-avx2.S > > > diff --git a/sysdeps/x86_64/multiarch/Makefile b/sysdeps/x86_64/multiarch/Makefile > index d7002a9..c573744 100644 > --- a/sysdeps/x86_64/multiarch/Makefile > +++ b/sysdeps/x86_64/multiarch/Makefile > @@ -29,7 +29,7 @@ CFLAGS-strspn-c.c += -msse4 > endif > > ifeq (yes,$(config-cflags-avx2)) > -sysdep_routines += memset-avx2 > +sysdep_routines += memset-avx2 strcpy-avx2 stpcpy-avx2 > endif > endif > > diff --git a/sysdeps/x86_64/multiarch/ifunc-impl-list.c b/sysdeps/x86_64/multiarch/ifunc-impl-list.c > index b64e4f1..d398e43 100644 > --- a/sysdeps/x86_64/multiarch/ifunc-impl-list.c > +++ b/sysdeps/x86_64/multiarch/ifunc-impl-list.c > @@ -88,6 +88,7 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array, > > /* Support sysdeps/x86_64/multiarch/stpcpy.S. */ > IFUNC_IMPL (i, name, stpcpy, > + IFUNC_IMPL_ADD (array, i, strcpy, HAS_AVX2, __stpcpy_avx2) > IFUNC_IMPL_ADD (array, i, stpcpy, HAS_SSSE3, __stpcpy_ssse3) > IFUNC_IMPL_ADD (array, i, stpcpy, 1, __stpcpy_sse2_unaligned) > IFUNC_IMPL_ADD (array, i, stpcpy, 1, __stpcpy_sse2)) > @@ -137,6 +138,7 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array, > > /* Support sysdeps/x86_64/multiarch/strcpy.S. */ > IFUNC_IMPL (i, name, strcpy, > + IFUNC_IMPL_ADD (array, i, strcpy, HAS_AVX2, __strcpy_avx2) > IFUNC_IMPL_ADD (array, i, strcpy, HAS_SSSE3, __strcpy_ssse3) > IFUNC_IMPL_ADD (array, i, strcpy, 1, __strcpy_sse2_unaligned) > IFUNC_IMPL_ADD (array, i, strcpy, 1, __strcpy_sse2)) > diff --git a/sysdeps/x86_64/multiarch/stpcpy-avx2.S b/sysdeps/x86_64/multiarch/stpcpy-avx2.S > new file mode 100644 > index 0000000..bd30ef6 > --- /dev/null > +++ b/sysdeps/x86_64/multiarch/stpcpy-avx2.S > @@ -0,0 +1,3 @@ > +#define USE_AVX2 > +#define STPCPY __stpcpy_avx2 > +#include "stpcpy-sse2-unaligned.S" > diff --git a/sysdeps/x86_64/multiarch/stpcpy-sse2-unaligned.S b/sysdeps/x86_64/multiarch/stpcpy-sse2-unaligned.S > index 34231f8..695a236 100644 > --- a/sysdeps/x86_64/multiarch/stpcpy-sse2-unaligned.S > +++ b/sysdeps/x86_64/multiarch/stpcpy-sse2-unaligned.S > @@ -1,3 +1,436 @@ > -#define USE_AS_STPCPY > -#define STRCPY __stpcpy_sse2_unaligned > -#include "strcpy-sse2-unaligned.S" > +/* stpcpy with SSE2 and unaligned load > + Copyright (C) 2015 Free Software Foundation, Inc. > + This file is part of the GNU C Library. > + > + The GNU C Library is free software; you can redistribute it and/or > + modify it under the terms of the GNU Lesser General Public > + License as published by the Free Software Foundation; either > + version 2.1 of the License, or (at your option) any later version. > + > + The GNU C Library is distributed in the hope that it will be useful, > + but WITHOUT ANY WARRANTY; without even the implied warranty of > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + Lesser General Public License for more details. > + > + You should have received a copy of the GNU Lesser General Public > + License along with the GNU C Library; if not, see > + <http://www.gnu.org/licenses/>. */ > + > +#include <sysdep.h> > +#ifndef STPCPY > +# define STPCPY __stpcpy_sse2_unaligned > +#endif > + > +ENTRY(STPCPY) > + mov %esi, %edx > +#ifdef AS_STRCPY > + movq %rdi, %rax > +#endif > + pxor %xmm4, %xmm4 > + pxor %xmm5, %xmm5 > + andl $4095, %edx > + cmp $3968, %edx > + ja L(cross_page) > + > + movdqu (%rsi), %xmm0 > + pcmpeqb %xmm0, %xmm4 > + pmovmskb %xmm4, %edx > + testl %edx, %edx > + je L(more16bytes) > + bsf %edx, %ecx > +#ifndef AS_STRCPY > + lea (%rdi, %rcx), %rax > +#endif > + cmp $7, %ecx > + movq (%rsi), %rdx > + jb L(less_8_bytesb) > +L(8bytes_from_cross): > + movq -7(%rsi, %rcx), %rsi > + movq %rdx, (%rdi) > +#ifdef AS_STRCPY > + movq %rsi, -7(%rdi, %rcx) > +#else > + movq %rsi, -7(%rax) > +#endif > + ret > + > + .p2align 4 > +L(less_8_bytesb): > + cmp $2, %ecx > + jbe L(less_4_bytes) > +L(4bytes_from_cross): > + mov -3(%rsi, %rcx), %esi > + mov %edx, (%rdi) > +#ifdef AS_STRCPY > + mov %esi, -3(%rdi, %rcx) > +#else > + mov %esi, -3(%rax) > +#endif > + ret > + > +.p2align 4 > + L(less_4_bytes): > + /* > + Test branch vs this branchless that works for i 0,1,2 > + d[i] = 0; > + d[i/2] = s[1]; > + d[0] = s[0]; > + */ > +#ifdef AS_STRCPY > + movb $0, (%rdi, %rcx) > +#endif > + > + shr $1, %ecx > + mov %edx, %esi > + shr $8, %edx > + movb %dl, (%rdi, %rcx) > +#ifndef AS_STRCPY > + movb $0, (%rax) > +#endif > + movb %sil, (%rdi) > + ret > + > + > + > + > + > + .p2align 4 > +L(more16bytes): > + pxor %xmm6, %xmm6 > + movdqu 16(%rsi), %xmm1 > + pxor %xmm7, %xmm7 > + pcmpeqb %xmm1, %xmm5 > + pmovmskb %xmm5, %edx > + testl %edx, %edx > + je L(more32bytes) > + bsf %edx, %edx > +#ifdef AS_STRCPY > + movdqu 1(%rsi, %rdx), %xmm1 > + movdqu %xmm0, (%rdi) > + movdqu %xmm1, 1(%rdi, %rdx) > +#else > + lea 16(%rdi, %rdx), %rax > + movdqu 1(%rsi, %rdx), %xmm1 > + movdqu %xmm0, (%rdi) > + movdqu %xmm1, -15(%rax) > +#endif > + ret > + > + .p2align 4 > +L(more32bytes): > + movdqu 32(%rsi), %xmm2 > + movdqu 48(%rsi), %xmm3 > + > + pcmpeqb %xmm2, %xmm6 > + pcmpeqb %xmm3, %xmm7 > + pmovmskb %xmm7, %edx > + shl $16, %edx > + pmovmskb %xmm6, %ecx > + or %ecx, %edx > + je L(more64bytes) > + bsf %edx, %edx > +#ifndef AS_STRCPY > + lea 32(%rdi, %rdx), %rax > +#endif > + movdqu 1(%rsi, %rdx), %xmm2 > + movdqu 17(%rsi, %rdx), %xmm3 > + movdqu %xmm0, (%rdi) > + movdqu %xmm1, 16(%rdi) > +#ifdef AS_STRCPY > + movdqu %xmm2, 1(%rdi, %rdx) > + movdqu %xmm3, 17(%rdi, %rdx) > +#else > + movdqu %xmm2, -31(%rax) > + movdqu %xmm3, -15(%rax) > +#endif > + ret > + > + .p2align 4 > +L(more64bytes): > + movdqu %xmm0, (%rdi) > + movdqu %xmm1, 16(%rdi) > + movdqu %xmm2, 32(%rdi) > + movdqu %xmm3, 48(%rdi) > + movdqu 64(%rsi), %xmm0 > + movdqu 80(%rsi), %xmm1 > + movdqu 96(%rsi), %xmm2 > + movdqu 112(%rsi), %xmm3 > + > + pcmpeqb %xmm0, %xmm4 > + pcmpeqb %xmm1, %xmm5 > + pcmpeqb %xmm2, %xmm6 > + pcmpeqb %xmm3, %xmm7 > + pmovmskb %xmm4, %ecx > + pmovmskb %xmm5, %edx > + pmovmskb %xmm6, %r8d > + pmovmskb %xmm7, %r9d > + shl $16, %edx > + or %ecx, %edx > + shl $32, %r8 > + shl $48, %r9 > + or %r8, %rdx > + or %r9, %rdx > + test %rdx, %rdx > + je L(prepare_loop) > + bsf %rdx, %rdx > +#ifndef AS_STRCPY > + lea 64(%rdi, %rdx), %rax > +#endif > + movdqu 1(%rsi, %rdx), %xmm0 > + movdqu 17(%rsi, %rdx), %xmm1 > + movdqu 33(%rsi, %rdx), %xmm2 > + movdqu 49(%rsi, %rdx), %xmm3 > +#ifdef AS_STRCPY > + movdqu %xmm0, 1(%rdi, %rdx) > + movdqu %xmm1, 17(%rdi, %rdx) > + movdqu %xmm2, 33(%rdi, %rdx) > + movdqu %xmm3, 49(%rdi, %rdx) > +#else > + movdqu %xmm0, -63(%rax) > + movdqu %xmm1, -47(%rax) > + movdqu %xmm2, -31(%rax) > + movdqu %xmm3, -15(%rax) > +#endif > + ret > + > + > + .p2align 4 > +L(prepare_loop): > + movdqu %xmm0, 64(%rdi) > + movdqu %xmm1, 80(%rdi) > + movdqu %xmm2, 96(%rdi) > + movdqu %xmm3, 112(%rdi) > + > + subq %rsi, %rdi > + add $64, %rsi > + andq $-64, %rsi > + addq %rsi, %rdi > + jmp L(loop_entry) > + > +#ifdef USE_AVX2 > + .p2align 4 > +L(loop): > + vmovdqu %ymm1, (%rdi) > + vmovdqu %ymm3, 32(%rdi) > +L(loop_entry): > + vmovdqa 96(%rsi), %ymm3 > + vmovdqa 64(%rsi), %ymm1 > + vpminub %ymm3, %ymm1, %ymm2 > + addq $64, %rsi > + addq $64, %rdi > + vpcmpeqb %ymm5, %ymm2, %ymm0 > + vpmovmskb %ymm0, %edx > + test %edx, %edx > + je L(loop) > + salq $32, %rdx > + vpcmpeqb %ymm5, %ymm1, %ymm4 > + vpmovmskb %ymm4, %ecx > + or %rcx, %rdx > + bsfq %rdx, %rdx > +#ifndef AS_STRCPY > + lea (%rdi, %rdx), %rax > +#endif > + vmovdqu -63(%rsi, %rdx), %ymm0 > + vmovdqu -31(%rsi, %rdx), %ymm2 > +#ifdef AS_STRCPY > + vmovdqu %ymm0, -63(%rdi, %rdx) > + vmovdqu %ymm2, -31(%rdi, %rdx) > +#else > + vmovdqu %ymm0, -63(%rax) > + vmovdqu %ymm2, -31(%rax) > +#endif > + vzeroupper > + ret > +#else > + .p2align 4 > +L(loop): > + movdqu %xmm1, (%rdi) > + movdqu %xmm2, 16(%rdi) > + movdqu %xmm3, 32(%rdi) > + movdqu %xmm4, 48(%rdi) > +L(loop_entry): > + movdqa 96(%rsi), %xmm3 > + movdqa 112(%rsi), %xmm4 > + movdqa %xmm3, %xmm0 > + movdqa 80(%rsi), %xmm2 > + pminub %xmm4, %xmm0 > + movdqa 64(%rsi), %xmm1 > + pminub %xmm2, %xmm0 > + pminub %xmm1, %xmm0 > + addq $64, %rsi > + addq $64, %rdi > + pcmpeqb %xmm5, %xmm0 > + pmovmskb %xmm0, %edx > + test %edx, %edx > + je L(loop) > + salq $48, %rdx > + pcmpeqb %xmm1, %xmm5 > + pcmpeqb %xmm2, %xmm6 > + pmovmskb %xmm5, %ecx > +#ifdef AS_STRCPY > + pmovmskb %xmm6, %r8d > + pcmpeqb %xmm3, %xmm7 > + pmovmskb %xmm7, %r9d > + sal $16, %r8d > + or %r8d, %ecx > +#else > + pmovmskb %xmm6, %eax > + pcmpeqb %xmm3, %xmm7 > + pmovmskb %xmm7, %r9d > + sal $16, %eax > + or %eax, %ecx > +#endif > + salq $32, %r9 > + orq %rcx, %rdx > + orq %r9, %rdx > + bsfq %rdx, %rdx > +#ifndef AS_STRCPY > + lea (%rdi, %rdx), %rax > +#endif > + movdqu -63(%rsi, %rdx), %xmm0 > + movdqu -47(%rsi, %rdx), %xmm1 > + movdqu -31(%rsi, %rdx), %xmm2 > + movdqu -15(%rsi, %rdx), %xmm3 > +#ifdef AS_STRCPY > + movdqu %xmm0, -63(%rdi, %rdx) > + movdqu %xmm1, -47(%rdi, %rdx) > + movdqu %xmm2, -31(%rdi, %rdx) > + movdqu %xmm3, -15(%rdi, %rdx) > +#else > + movdqu %xmm0, -63(%rax) > + movdqu %xmm1, -47(%rax) > + movdqu %xmm2, -31(%rax) > + movdqu %xmm3, -15(%rax) > +#endif > + ret > +#endif > + > + .p2align 4 > +L(cross_page): > + movq %rsi, %rcx > + pxor %xmm0, %xmm0 > + and $15, %ecx > + movq %rsi, %r9 > + movq %rdi, %r10 > + subq %rcx, %rsi > + subq %rcx, %rdi > + movdqa (%rsi), %xmm1 > + pcmpeqb %xmm0, %xmm1 > + pmovmskb %xmm1, %edx > + shr %cl, %edx > + shl %cl, %edx > + test %edx, %edx > + jne L(less_32_cross) > + > + addq $16, %rsi > + addq $16, %rdi > + movdqa (%rsi), %xmm1 > + pcmpeqb %xmm1, %xmm0 > + pmovmskb %xmm0, %edx > + test %edx, %edx > + jne L(less_32_cross) > + movdqu %xmm1, (%rdi) > + > + movdqu (%r9), %xmm0 > + movdqu %xmm0, (%r10) > + > + mov $8, %rcx > +L(cross_loop): > + addq $16, %rsi > + addq $16, %rdi > + pxor %xmm0, %xmm0 > + movdqa (%rsi), %xmm1 > + pcmpeqb %xmm1, %xmm0 > + pmovmskb %xmm0, %edx > + test %edx, %edx > + jne L(return_cross) > + movdqu %xmm1, (%rdi) > + sub $1, %rcx > + ja L(cross_loop) > + > + pxor %xmm5, %xmm5 > + pxor %xmm6, %xmm6 > + pxor %xmm7, %xmm7 > + > + lea -64(%rsi), %rdx > + andq $-64, %rdx > + addq %rdx, %rdi > + subq %rsi, %rdi > + movq %rdx, %rsi > + jmp L(loop_entry) > + > + .p2align 4 > +L(return_cross): > + bsf %edx, %edx > +#ifdef AS_STRCPY > + movdqu -15(%rsi, %rdx), %xmm0 > + movdqu %xmm0, -15(%rdi, %rdx) > +#else > + lea (%rdi, %rdx), %rax > + movdqu -15(%rsi, %rdx), %xmm0 > + movdqu %xmm0, -15(%rax) > +#endif > + ret > + > + .p2align 4 > +L(less_32_cross): > + bsf %rdx, %rdx > + lea (%rdi, %rdx), %rcx > +#ifndef AS_STRCPY > + mov %rcx, %rax > +#endif > + mov %r9, %rsi > + mov %r10, %rdi > + sub %rdi, %rcx > + cmp $15, %ecx > + jb L(less_16_cross) > + movdqu (%rsi), %xmm0 > + movdqu -15(%rsi, %rcx), %xmm1 > + movdqu %xmm0, (%rdi) > +#ifdef AS_STRCPY > + movdqu %xmm1, -15(%rdi, %rcx) > +#else > + movdqu %xmm1, -15(%rax) > +#endif > + ret > + > +L(less_16_cross): > + cmp $7, %ecx > + jb L(less_8_bytes_cross) > + movq (%rsi), %rdx > + jmp L(8bytes_from_cross) > + > +L(less_8_bytes_cross): > + cmp $2, %ecx > + jbe L(3_bytes_cross) > + mov (%rsi), %edx > + jmp L(4bytes_from_cross) > + > +L(3_bytes_cross): > + jb L(1_2bytes_cross) > + movzwl (%rsi), %edx > + jmp L(_3_bytesb) > + > +L(1_2bytes_cross): > + movb (%rsi), %dl > + jmp L(0_2bytes_from_cross) > + > + .p2align 4 > +L(less_4_bytesb): > + je L(_3_bytesb) > +L(0_2bytes_from_cross): > + movb %dl, (%rdi) > +#ifdef AS_STRCPY > + movb $0, (%rdi, %rcx) > +#else > + movb $0, (%rax) > +#endif > + ret > + > + .p2align 4 > +L(_3_bytesb): > + movw %dx, (%rdi) > + movb $0, 2(%rdi) > + ret > + > +END(STPCPY) > diff --git a/sysdeps/x86_64/multiarch/stpncpy-sse2-unaligned.S b/sysdeps/x86_64/multiarch/stpncpy-sse2-unaligned.S > index 658520f..3f35068 100644 > --- a/sysdeps/x86_64/multiarch/stpncpy-sse2-unaligned.S > +++ b/sysdeps/x86_64/multiarch/stpncpy-sse2-unaligned.S > @@ -1,4 +1,3 @@ > #define USE_AS_STPCPY > -#define USE_AS_STRNCPY > #define STRCPY __stpncpy_sse2_unaligned > -#include "strcpy-sse2-unaligned.S" > +#include "strncpy-sse2-unaligned.S" > diff --git a/sysdeps/x86_64/multiarch/stpncpy.S b/sysdeps/x86_64/multiarch/stpncpy.S > index 2698ca6..159604a 100644 > --- a/sysdeps/x86_64/multiarch/stpncpy.S > +++ b/sysdeps/x86_64/multiarch/stpncpy.S > @@ -1,8 +1,7 @@ > /* Multiple versions of stpncpy > All versions must be listed in ifunc-impl-list.c. */ > -#define STRCPY __stpncpy > +#define STRNCPY __stpncpy > #define USE_AS_STPCPY > -#define USE_AS_STRNCPY > -#include "strcpy.S" > +#include "strncpy.S" > > weak_alias (__stpncpy, stpncpy) > diff --git a/sysdeps/x86_64/multiarch/strcat-sse2-unaligned.S b/sysdeps/x86_64/multiarch/strcat-sse2-unaligned.S > index 81f1b40..1faa49d 100644 > --- a/sysdeps/x86_64/multiarch/strcat-sse2-unaligned.S > +++ b/sysdeps/x86_64/multiarch/strcat-sse2-unaligned.S > @@ -275,5 +275,5 @@ L(StartStrcpyPart): > # define USE_AS_STRNCPY > # endif > > -# include "strcpy-sse2-unaligned.S" > +# include "strncpy-sse2-unaligned.S" > #endif > diff --git a/sysdeps/x86_64/multiarch/strcpy-avx2.S b/sysdeps/x86_64/multiarch/strcpy-avx2.S > new file mode 100644 > index 0000000..a3133a4 > --- /dev/null > +++ b/sysdeps/x86_64/multiarch/strcpy-avx2.S > @@ -0,0 +1,4 @@ > +#define USE_AVX2 > +#define AS_STRCPY > +#define STPCPY __strcpy_avx2 > +#include "stpcpy-sse2-unaligned.S" > diff --git a/sysdeps/x86_64/multiarch/strcpy-sse2-unaligned.S b/sysdeps/x86_64/multiarch/strcpy-sse2-unaligned.S > index 8f03d1d..310e4fa 100644 > --- a/sysdeps/x86_64/multiarch/strcpy-sse2-unaligned.S > +++ b/sysdeps/x86_64/multiarch/strcpy-sse2-unaligned.S > @@ -1,1887 +1,3 @@ > -/* strcpy with SSE2 and unaligned load > - Copyright (C) 2011-2015 Free Software Foundation, Inc. > - Contributed by Intel Corporation. > - This file is part of the GNU C Library. > - > - The GNU C Library is free software; you can redistribute it and/or > - modify it under the terms of the GNU Lesser General Public > - License as published by the Free Software Foundation; either > - version 2.1 of the License, or (at your option) any later version. > - > - The GNU C Library is distributed in the hope that it will be useful, > - but WITHOUT ANY WARRANTY; without even the implied warranty of > - MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > - Lesser General Public License for more details. > - > - You should have received a copy of the GNU Lesser General Public > - License along with the GNU C Library; if not, see > - <http://www.gnu.org/licenses/>. */ > - > -#if IS_IN (libc) > - > -# ifndef USE_AS_STRCAT > -# include <sysdep.h> > - > -# ifndef STRCPY > -# define STRCPY __strcpy_sse2_unaligned > -# endif > - > -# endif > - > -# define JMPTBL(I, B) I - B > -# define BRANCH_TO_JMPTBL_ENTRY(TABLE, INDEX, SCALE) \ > - lea TABLE(%rip), %r11; \ > - movslq (%r11, INDEX, SCALE), %rcx; \ > - lea (%r11, %rcx), %rcx; \ > - jmp *%rcx > - > -# ifndef USE_AS_STRCAT > - > -.text > -ENTRY (STRCPY) > -# ifdef USE_AS_STRNCPY > - mov %rdx, %r8 > - test %r8, %r8 > - jz L(ExitZero) > -# endif > - mov %rsi, %rcx > -# ifndef USE_AS_STPCPY > - mov %rdi, %rax /* save result */ > -# endif > - > -# endif > - > - and $63, %rcx > - cmp $32, %rcx > - jbe L(SourceStringAlignmentLess32) > - > - and $-16, %rsi > - and $15, %rcx > - pxor %xmm0, %xmm0 > - pxor %xmm1, %xmm1 > - > - pcmpeqb (%rsi), %xmm1 > - pmovmskb %xmm1, %rdx > - shr %cl, %rdx > - > -# ifdef USE_AS_STRNCPY > -# if defined USE_AS_STPCPY || defined USE_AS_STRCAT > - mov $16, %r10 > - sub %rcx, %r10 > - cmp %r10, %r8 > -# else > - mov $17, %r10 > - sub %rcx, %r10 > - cmp %r10, %r8 > -# endif > - jbe L(CopyFrom1To16BytesTailCase2OrCase3) > -# endif > - test %rdx, %rdx > - jnz L(CopyFrom1To16BytesTail) > - > - pcmpeqb 16(%rsi), %xmm0 > - pmovmskb %xmm0, %rdx > - > -# ifdef USE_AS_STRNCPY > - add $16, %r10 > - cmp %r10, %r8 > - jbe L(CopyFrom1To32BytesCase2OrCase3) > -# endif > - test %rdx, %rdx > - jnz L(CopyFrom1To32Bytes) > - > - movdqu (%rsi, %rcx), %xmm1 /* copy 16 bytes */ > - movdqu %xmm1, (%rdi) > - > -/* If source address alignment != destination address alignment */ > - .p2align 4 > -L(Unalign16Both): > - sub %rcx, %rdi > -# ifdef USE_AS_STRNCPY > - add %rcx, %r8 > -# endif > - mov $16, %rcx > - movdqa (%rsi, %rcx), %xmm1 > - movaps 16(%rsi, %rcx), %xmm2 > - movdqu %xmm1, (%rdi, %rcx) > - pcmpeqb %xmm2, %xmm0 > - pmovmskb %xmm0, %rdx > - add $16, %rcx > -# ifdef USE_AS_STRNCPY > - sub $48, %r8 > - jbe L(CopyFrom1To16BytesCase2OrCase3) > -# endif > - test %rdx, %rdx > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > - jnz L(CopyFrom1To16BytesUnalignedXmm2) > -# else > - jnz L(CopyFrom1To16Bytes) > -# endif > - > - movaps 16(%rsi, %rcx), %xmm3 > - movdqu %xmm2, (%rdi, %rcx) > - pcmpeqb %xmm3, %xmm0 > - pmovmskb %xmm0, %rdx > - add $16, %rcx > -# ifdef USE_AS_STRNCPY > - sub $16, %r8 > - jbe L(CopyFrom1To16BytesCase2OrCase3) > -# endif > - test %rdx, %rdx > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > - jnz L(CopyFrom1To16BytesUnalignedXmm3) > -# else > - jnz L(CopyFrom1To16Bytes) > -# endif > - > - movaps 16(%rsi, %rcx), %xmm4 > - movdqu %xmm3, (%rdi, %rcx) > - pcmpeqb %xmm4, %xmm0 > - pmovmskb %xmm0, %rdx > - add $16, %rcx > -# ifdef USE_AS_STRNCPY > - sub $16, %r8 > - jbe L(CopyFrom1To16BytesCase2OrCase3) > -# endif > - test %rdx, %rdx > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > - jnz L(CopyFrom1To16BytesUnalignedXmm4) > -# else > - jnz L(CopyFrom1To16Bytes) > -# endif > - > - movaps 16(%rsi, %rcx), %xmm1 > - movdqu %xmm4, (%rdi, %rcx) > - pcmpeqb %xmm1, %xmm0 > - pmovmskb %xmm0, %rdx > - add $16, %rcx > -# ifdef USE_AS_STRNCPY > - sub $16, %r8 > - jbe L(CopyFrom1To16BytesCase2OrCase3) > -# endif > - test %rdx, %rdx > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > - jnz L(CopyFrom1To16BytesUnalignedXmm1) > -# else > - jnz L(CopyFrom1To16Bytes) > -# endif > - > - movaps 16(%rsi, %rcx), %xmm2 > - movdqu %xmm1, (%rdi, %rcx) > - pcmpeqb %xmm2, %xmm0 > - pmovmskb %xmm0, %rdx > - add $16, %rcx > -# ifdef USE_AS_STRNCPY > - sub $16, %r8 > - jbe L(CopyFrom1To16BytesCase2OrCase3) > -# endif > - test %rdx, %rdx > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > - jnz L(CopyFrom1To16BytesUnalignedXmm2) > -# else > - jnz L(CopyFrom1To16Bytes) > -# endif > - > - movaps 16(%rsi, %rcx), %xmm3 > - movdqu %xmm2, (%rdi, %rcx) > - pcmpeqb %xmm3, %xmm0 > - pmovmskb %xmm0, %rdx > - add $16, %rcx > -# ifdef USE_AS_STRNCPY > - sub $16, %r8 > - jbe L(CopyFrom1To16BytesCase2OrCase3) > -# endif > - test %rdx, %rdx > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > - jnz L(CopyFrom1To16BytesUnalignedXmm3) > -# else > - jnz L(CopyFrom1To16Bytes) > -# endif > - > - movdqu %xmm3, (%rdi, %rcx) > - mov %rsi, %rdx > - lea 16(%rsi, %rcx), %rsi > - and $-0x40, %rsi > - sub %rsi, %rdx > - sub %rdx, %rdi > -# ifdef USE_AS_STRNCPY > - lea 128(%r8, %rdx), %r8 > -# endif > -L(Unaligned64Loop): > - movaps (%rsi), %xmm2 > - movaps %xmm2, %xmm4 > - movaps 16(%rsi), %xmm5 > - movaps 32(%rsi), %xmm3 > - movaps %xmm3, %xmm6 > - movaps 48(%rsi), %xmm7 > - pminub %xmm5, %xmm2 > - pminub %xmm7, %xmm3 > - pminub %xmm2, %xmm3 > - pcmpeqb %xmm0, %xmm3 > - pmovmskb %xmm3, %rdx > -# ifdef USE_AS_STRNCPY > - sub $64, %r8 > - jbe L(UnalignedLeaveCase2OrCase3) > -# endif > - test %rdx, %rdx > - jnz L(Unaligned64Leave) > - > -L(Unaligned64Loop_start): > - add $64, %rdi > - add $64, %rsi > - movdqu %xmm4, -64(%rdi) > - movaps (%rsi), %xmm2 > - movdqa %xmm2, %xmm4 > - movdqu %xmm5, -48(%rdi) > - movaps 16(%rsi), %xmm5 > - pminub %xmm5, %xmm2 > - movaps 32(%rsi), %xmm3 > - movdqu %xmm6, -32(%rdi) > - movaps %xmm3, %xmm6 > - movdqu %xmm7, -16(%rdi) > - movaps 48(%rsi), %xmm7 > - pminub %xmm7, %xmm3 > - pminub %xmm2, %xmm3 > - pcmpeqb %xmm0, %xmm3 > - pmovmskb %xmm3, %rdx > -# ifdef USE_AS_STRNCPY > - sub $64, %r8 > - jbe L(UnalignedLeaveCase2OrCase3) > -# endif > - test %rdx, %rdx > - jz L(Unaligned64Loop_start) > - > -L(Unaligned64Leave): > - pxor %xmm1, %xmm1 > - > - pcmpeqb %xmm4, %xmm0 > - pcmpeqb %xmm5, %xmm1 > - pmovmskb %xmm0, %rdx > - pmovmskb %xmm1, %rcx > - test %rdx, %rdx > - jnz L(CopyFrom1To16BytesUnaligned_0) > - test %rcx, %rcx > - jnz L(CopyFrom1To16BytesUnaligned_16) > - > - pcmpeqb %xmm6, %xmm0 > - pcmpeqb %xmm7, %xmm1 > - pmovmskb %xmm0, %rdx > - pmovmskb %xmm1, %rcx > - test %rdx, %rdx > - jnz L(CopyFrom1To16BytesUnaligned_32) > - > - bsf %rcx, %rdx > - movdqu %xmm4, (%rdi) > - movdqu %xmm5, 16(%rdi) > - movdqu %xmm6, 32(%rdi) > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > -# ifdef USE_AS_STPCPY > - lea 48(%rdi, %rdx), %rax > -# endif > - movdqu %xmm7, 48(%rdi) > - add $15, %r8 > - sub %rdx, %r8 > - lea 49(%rdi, %rdx), %rdi > - jmp L(StrncpyFillTailWithZero) > -# else > - add $48, %rsi > - add $48, %rdi > - BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4) > -# endif > - > -/* If source address alignment == destination address alignment */ > - > -L(SourceStringAlignmentLess32): > - pxor %xmm0, %xmm0 > - movdqu (%rsi), %xmm1 > - movdqu 16(%rsi), %xmm2 > - pcmpeqb %xmm1, %xmm0 > - pmovmskb %xmm0, %rdx > - > -# ifdef USE_AS_STRNCPY > -# if defined USE_AS_STPCPY || defined USE_AS_STRCAT > - cmp $16, %r8 > -# else > - cmp $17, %r8 > -# endif > - jbe L(CopyFrom1To16BytesTail1Case2OrCase3) > -# endif > - test %rdx, %rdx > - jnz L(CopyFrom1To16BytesTail1) > - > - pcmpeqb %xmm2, %xmm0 > - movdqu %xmm1, (%rdi) > - pmovmskb %xmm0, %rdx > - > -# ifdef USE_AS_STRNCPY > -# if defined USE_AS_STPCPY || defined USE_AS_STRCAT > - cmp $32, %r8 > -# else > - cmp $33, %r8 > -# endif > - jbe L(CopyFrom1To32Bytes1Case2OrCase3) > -# endif > - test %rdx, %rdx > - jnz L(CopyFrom1To32Bytes1) > - > - and $-16, %rsi > - and $15, %rcx > - jmp L(Unalign16Both) > - > -/*------End of main part with loops---------------------*/ > - > -/* Case1 */ > - > -# if (!defined USE_AS_STRNCPY) || (defined USE_AS_STRCAT) > - .p2align 4 > -L(CopyFrom1To16Bytes): > - add %rcx, %rdi > - add %rcx, %rsi > - bsf %rdx, %rdx > - BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4) > -# endif > - .p2align 4 > -L(CopyFrom1To16BytesTail): > - add %rcx, %rsi > - bsf %rdx, %rdx > - BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4) > - > - .p2align 4 > -L(CopyFrom1To32Bytes1): > - add $16, %rsi > - add $16, %rdi > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > - sub $16, %r8 > -# endif > -L(CopyFrom1To16BytesTail1): > - bsf %rdx, %rdx > - BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4) > - > - .p2align 4 > -L(CopyFrom1To32Bytes): > - bsf %rdx, %rdx > - add %rcx, %rsi > - add $16, %rdx > - sub %rcx, %rdx > - BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4) > - > - .p2align 4 > -L(CopyFrom1To16BytesUnaligned_0): > - bsf %rdx, %rdx > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > -# ifdef USE_AS_STPCPY > - lea (%rdi, %rdx), %rax > -# endif > - movdqu %xmm4, (%rdi) > - add $63, %r8 > - sub %rdx, %r8 > - lea 1(%rdi, %rdx), %rdi > - jmp L(StrncpyFillTailWithZero) > -# else > - BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4) > -# endif > - > - .p2align 4 > -L(CopyFrom1To16BytesUnaligned_16): > - bsf %rcx, %rdx > - movdqu %xmm4, (%rdi) > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > -# ifdef USE_AS_STPCPY > - lea 16(%rdi, %rdx), %rax > -# endif > - movdqu %xmm5, 16(%rdi) > - add $47, %r8 > - sub %rdx, %r8 > - lea 17(%rdi, %rdx), %rdi > - jmp L(StrncpyFillTailWithZero) > -# else > - add $16, %rsi > - add $16, %rdi > - BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4) > -# endif > - > - .p2align 4 > -L(CopyFrom1To16BytesUnaligned_32): > - bsf %rdx, %rdx > - movdqu %xmm4, (%rdi) > - movdqu %xmm5, 16(%rdi) > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > -# ifdef USE_AS_STPCPY > - lea 32(%rdi, %rdx), %rax > -# endif > - movdqu %xmm6, 32(%rdi) > - add $31, %r8 > - sub %rdx, %r8 > - lea 33(%rdi, %rdx), %rdi > - jmp L(StrncpyFillTailWithZero) > -# else > - add $32, %rsi > - add $32, %rdi > - BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4) > -# endif > - > -# ifdef USE_AS_STRNCPY > -# ifndef USE_AS_STRCAT > - .p2align 4 > -L(CopyFrom1To16BytesUnalignedXmm6): > - movdqu %xmm6, (%rdi, %rcx) > - jmp L(CopyFrom1To16BytesXmmExit) > - > - .p2align 4 > -L(CopyFrom1To16BytesUnalignedXmm5): > - movdqu %xmm5, (%rdi, %rcx) > - jmp L(CopyFrom1To16BytesXmmExit) > - > - .p2align 4 > -L(CopyFrom1To16BytesUnalignedXmm4): > - movdqu %xmm4, (%rdi, %rcx) > - jmp L(CopyFrom1To16BytesXmmExit) > - > - .p2align 4 > -L(CopyFrom1To16BytesUnalignedXmm3): > - movdqu %xmm3, (%rdi, %rcx) > - jmp L(CopyFrom1To16BytesXmmExit) > - > - .p2align 4 > -L(CopyFrom1To16BytesUnalignedXmm1): > - movdqu %xmm1, (%rdi, %rcx) > - jmp L(CopyFrom1To16BytesXmmExit) > -# endif > - > - .p2align 4 > -L(CopyFrom1To16BytesExit): > - BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4) > - > -/* Case2 */ > - > - .p2align 4 > -L(CopyFrom1To16BytesCase2): > - add $16, %r8 > - add %rcx, %rdi > - add %rcx, %rsi > - bsf %rdx, %rdx > - cmp %r8, %rdx > - jb L(CopyFrom1To16BytesExit) > - BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4) > - > - .p2align 4 > -L(CopyFrom1To32BytesCase2): > - add %rcx, %rsi > - bsf %rdx, %rdx > - add $16, %rdx > - sub %rcx, %rdx > - cmp %r8, %rdx > - jb L(CopyFrom1To16BytesExit) > - BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4) > - > -L(CopyFrom1To16BytesTailCase2): > - add %rcx, %rsi > - bsf %rdx, %rdx > - cmp %r8, %rdx > - jb L(CopyFrom1To16BytesExit) > - BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4) > - > -L(CopyFrom1To16BytesTail1Case2): > - bsf %rdx, %rdx > - cmp %r8, %rdx > - jb L(CopyFrom1To16BytesExit) > - BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4) > - > -/* Case2 or Case3, Case3 */ > - > - .p2align 4 > -L(CopyFrom1To16BytesCase2OrCase3): > - test %rdx, %rdx > - jnz L(CopyFrom1To16BytesCase2) > -L(CopyFrom1To16BytesCase3): > - add $16, %r8 > - add %rcx, %rdi > - add %rcx, %rsi > - BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4) > - > - .p2align 4 > -L(CopyFrom1To32BytesCase2OrCase3): > - test %rdx, %rdx > - jnz L(CopyFrom1To32BytesCase2) > - add %rcx, %rsi > - BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4) > - > - .p2align 4 > -L(CopyFrom1To16BytesTailCase2OrCase3): > - test %rdx, %rdx > - jnz L(CopyFrom1To16BytesTailCase2) > - add %rcx, %rsi > - BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4) > - > - .p2align 4 > -L(CopyFrom1To32Bytes1Case2OrCase3): > - add $16, %rdi > - add $16, %rsi > - sub $16, %r8 > -L(CopyFrom1To16BytesTail1Case2OrCase3): > - test %rdx, %rdx > - jnz L(CopyFrom1To16BytesTail1Case2) > - BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4) > - > -# endif > - > -/*------------End labels regarding with copying 1-16 bytes--and 1-32 bytes----*/ > - > - .p2align 4 > -L(Exit1): > - mov %dh, (%rdi) > -# ifdef USE_AS_STPCPY > - lea (%rdi), %rax > -# endif > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > - sub $1, %r8 > - lea 1(%rdi), %rdi > - jnz L(StrncpyFillTailWithZero) > -# endif > - ret > - > - .p2align 4 > -L(Exit2): > - mov (%rsi), %dx > - mov %dx, (%rdi) > -# ifdef USE_AS_STPCPY > - lea 1(%rdi), %rax > -# endif > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > - sub $2, %r8 > - lea 2(%rdi), %rdi > - jnz L(StrncpyFillTailWithZero) > -# endif > - ret > - > - .p2align 4 > -L(Exit3): > - mov (%rsi), %cx > - mov %cx, (%rdi) > - mov %dh, 2(%rdi) > -# ifdef USE_AS_STPCPY > - lea 2(%rdi), %rax > -# endif > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > - sub $3, %r8 > - lea 3(%rdi), %rdi > - jnz L(StrncpyFillTailWithZero) > -# endif > - ret > - > - .p2align 4 > -L(Exit4): > - mov (%rsi), %edx > - mov %edx, (%rdi) > -# ifdef USE_AS_STPCPY > - lea 3(%rdi), %rax > -# endif > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > - sub $4, %r8 > - lea 4(%rdi), %rdi > - jnz L(StrncpyFillTailWithZero) > -# endif > - ret > - > - .p2align 4 > -L(Exit5): > - mov (%rsi), %ecx > - mov %dh, 4(%rdi) > - mov %ecx, (%rdi) > -# ifdef USE_AS_STPCPY > - lea 4(%rdi), %rax > -# endif > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > - sub $5, %r8 > - lea 5(%rdi), %rdi > - jnz L(StrncpyFillTailWithZero) > -# endif > - ret > - > - .p2align 4 > -L(Exit6): > - mov (%rsi), %ecx > - mov 4(%rsi), %dx > - mov %ecx, (%rdi) > - mov %dx, 4(%rdi) > -# ifdef USE_AS_STPCPY > - lea 5(%rdi), %rax > -# endif > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > - sub $6, %r8 > - lea 6(%rdi), %rdi > - jnz L(StrncpyFillTailWithZero) > -# endif > - ret > - > - .p2align 4 > -L(Exit7): > - mov (%rsi), %ecx > - mov 3(%rsi), %edx > - mov %ecx, (%rdi) > - mov %edx, 3(%rdi) > -# ifdef USE_AS_STPCPY > - lea 6(%rdi), %rax > -# endif > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > - sub $7, %r8 > - lea 7(%rdi), %rdi > - jnz L(StrncpyFillTailWithZero) > -# endif > - ret > - > - .p2align 4 > -L(Exit8): > - mov (%rsi), %rdx > - mov %rdx, (%rdi) > -# ifdef USE_AS_STPCPY > - lea 7(%rdi), %rax > -# endif > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > - sub $8, %r8 > - lea 8(%rdi), %rdi > - jnz L(StrncpyFillTailWithZero) > -# endif > - ret > - > - .p2align 4 > -L(Exit9): > - mov (%rsi), %rcx > - mov %dh, 8(%rdi) > - mov %rcx, (%rdi) > -# ifdef USE_AS_STPCPY > - lea 8(%rdi), %rax > -# endif > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > - sub $9, %r8 > - lea 9(%rdi), %rdi > - jnz L(StrncpyFillTailWithZero) > -# endif > - ret > - > - .p2align 4 > -L(Exit10): > - mov (%rsi), %rcx > - mov 8(%rsi), %dx > - mov %rcx, (%rdi) > - mov %dx, 8(%rdi) > -# ifdef USE_AS_STPCPY > - lea 9(%rdi), %rax > -# endif > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > - sub $10, %r8 > - lea 10(%rdi), %rdi > - jnz L(StrncpyFillTailWithZero) > -# endif > - ret > - > - .p2align 4 > -L(Exit11): > - mov (%rsi), %rcx > - mov 7(%rsi), %edx > - mov %rcx, (%rdi) > - mov %edx, 7(%rdi) > -# ifdef USE_AS_STPCPY > - lea 10(%rdi), %rax > -# endif > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > - sub $11, %r8 > - lea 11(%rdi), %rdi > - jnz L(StrncpyFillTailWithZero) > -# endif > - ret > - > - .p2align 4 > -L(Exit12): > - mov (%rsi), %rcx > - mov 8(%rsi), %edx > - mov %rcx, (%rdi) > - mov %edx, 8(%rdi) > -# ifdef USE_AS_STPCPY > - lea 11(%rdi), %rax > -# endif > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > - sub $12, %r8 > - lea 12(%rdi), %rdi > - jnz L(StrncpyFillTailWithZero) > -# endif > - ret > - > - .p2align 4 > -L(Exit13): > - mov (%rsi), %rcx > - mov 5(%rsi), %rdx > - mov %rcx, (%rdi) > - mov %rdx, 5(%rdi) > -# ifdef USE_AS_STPCPY > - lea 12(%rdi), %rax > -# endif > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > - sub $13, %r8 > - lea 13(%rdi), %rdi > - jnz L(StrncpyFillTailWithZero) > -# endif > - ret > - > - .p2align 4 > -L(Exit14): > - mov (%rsi), %rcx > - mov 6(%rsi), %rdx > - mov %rcx, (%rdi) > - mov %rdx, 6(%rdi) > -# ifdef USE_AS_STPCPY > - lea 13(%rdi), %rax > -# endif > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > - sub $14, %r8 > - lea 14(%rdi), %rdi > - jnz L(StrncpyFillTailWithZero) > -# endif > - ret > - > - .p2align 4 > -L(Exit15): > - mov (%rsi), %rcx > - mov 7(%rsi), %rdx > - mov %rcx, (%rdi) > - mov %rdx, 7(%rdi) > -# ifdef USE_AS_STPCPY > - lea 14(%rdi), %rax > -# endif > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > - sub $15, %r8 > - lea 15(%rdi), %rdi > - jnz L(StrncpyFillTailWithZero) > -# endif > - ret > - > - .p2align 4 > -L(Exit16): > - movdqu (%rsi), %xmm0 > - movdqu %xmm0, (%rdi) > -# ifdef USE_AS_STPCPY > - lea 15(%rdi), %rax > -# endif > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > - sub $16, %r8 > - lea 16(%rdi), %rdi > - jnz L(StrncpyFillTailWithZero) > -# endif > - ret > - > - .p2align 4 > -L(Exit17): > - movdqu (%rsi), %xmm0 > - movdqu %xmm0, (%rdi) > - mov %dh, 16(%rdi) > -# ifdef USE_AS_STPCPY > - lea 16(%rdi), %rax > -# endif > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > - sub $17, %r8 > - lea 17(%rdi), %rdi > - jnz L(StrncpyFillTailWithZero) > -# endif > - ret > - > - .p2align 4 > -L(Exit18): > - movdqu (%rsi), %xmm0 > - mov 16(%rsi), %cx > - movdqu %xmm0, (%rdi) > - mov %cx, 16(%rdi) > -# ifdef USE_AS_STPCPY > - lea 17(%rdi), %rax > -# endif > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > - sub $18, %r8 > - lea 18(%rdi), %rdi > - jnz L(StrncpyFillTailWithZero) > -# endif > - ret > - > - .p2align 4 > -L(Exit19): > - movdqu (%rsi), %xmm0 > - mov 15(%rsi), %ecx > - movdqu %xmm0, (%rdi) > - mov %ecx, 15(%rdi) > -# ifdef USE_AS_STPCPY > - lea 18(%rdi), %rax > -# endif > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > - sub $19, %r8 > - lea 19(%rdi), %rdi > - jnz L(StrncpyFillTailWithZero) > -# endif > - ret > - > - .p2align 4 > -L(Exit20): > - movdqu (%rsi), %xmm0 > - mov 16(%rsi), %ecx > - movdqu %xmm0, (%rdi) > - mov %ecx, 16(%rdi) > -# ifdef USE_AS_STPCPY > - lea 19(%rdi), %rax > -# endif > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > - sub $20, %r8 > - lea 20(%rdi), %rdi > - jnz L(StrncpyFillTailWithZero) > -# endif > - ret > - > - .p2align 4 > -L(Exit21): > - movdqu (%rsi), %xmm0 > - mov 16(%rsi), %ecx > - movdqu %xmm0, (%rdi) > - mov %ecx, 16(%rdi) > - mov %dh, 20(%rdi) > -# ifdef USE_AS_STPCPY > - lea 20(%rdi), %rax > -# endif > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > - sub $21, %r8 > - lea 21(%rdi), %rdi > - jnz L(StrncpyFillTailWithZero) > -# endif > - ret > - > - .p2align 4 > -L(Exit22): > - movdqu (%rsi), %xmm0 > - mov 14(%rsi), %rcx > - movdqu %xmm0, (%rdi) > - mov %rcx, 14(%rdi) > -# ifdef USE_AS_STPCPY > - lea 21(%rdi), %rax > -# endif > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > - sub $22, %r8 > - lea 22(%rdi), %rdi > - jnz L(StrncpyFillTailWithZero) > -# endif > - ret > - > - .p2align 4 > -L(Exit23): > - movdqu (%rsi), %xmm0 > - mov 15(%rsi), %rcx > - movdqu %xmm0, (%rdi) > - mov %rcx, 15(%rdi) > -# ifdef USE_AS_STPCPY > - lea 22(%rdi), %rax > -# endif > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > - sub $23, %r8 > - lea 23(%rdi), %rdi > - jnz L(StrncpyFillTailWithZero) > -# endif > - ret > - > - .p2align 4 > -L(Exit24): > - movdqu (%rsi), %xmm0 > - mov 16(%rsi), %rcx > - movdqu %xmm0, (%rdi) > - mov %rcx, 16(%rdi) > -# ifdef USE_AS_STPCPY > - lea 23(%rdi), %rax > -# endif > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > - sub $24, %r8 > - lea 24(%rdi), %rdi > - jnz L(StrncpyFillTailWithZero) > -# endif > - ret > - > - .p2align 4 > -L(Exit25): > - movdqu (%rsi), %xmm0 > - mov 16(%rsi), %rcx > - movdqu %xmm0, (%rdi) > - mov %rcx, 16(%rdi) > - mov %dh, 24(%rdi) > -# ifdef USE_AS_STPCPY > - lea 24(%rdi), %rax > -# endif > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > - sub $25, %r8 > - lea 25(%rdi), %rdi > - jnz L(StrncpyFillTailWithZero) > -# endif > - ret > - > - .p2align 4 > -L(Exit26): > - movdqu (%rsi), %xmm0 > - mov 16(%rsi), %rdx > - mov 24(%rsi), %cx > - movdqu %xmm0, (%rdi) > - mov %rdx, 16(%rdi) > - mov %cx, 24(%rdi) > -# ifdef USE_AS_STPCPY > - lea 25(%rdi), %rax > -# endif > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > - sub $26, %r8 > - lea 26(%rdi), %rdi > - jnz L(StrncpyFillTailWithZero) > -# endif > - ret > - > - .p2align 4 > -L(Exit27): > - movdqu (%rsi), %xmm0 > - mov 16(%rsi), %rdx > - mov 23(%rsi), %ecx > - movdqu %xmm0, (%rdi) > - mov %rdx, 16(%rdi) > - mov %ecx, 23(%rdi) > -# ifdef USE_AS_STPCPY > - lea 26(%rdi), %rax > -# endif > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > - sub $27, %r8 > - lea 27(%rdi), %rdi > - jnz L(StrncpyFillTailWithZero) > -# endif > - ret > - > - .p2align 4 > -L(Exit28): > - movdqu (%rsi), %xmm0 > - mov 16(%rsi), %rdx > - mov 24(%rsi), %ecx > - movdqu %xmm0, (%rdi) > - mov %rdx, 16(%rdi) > - mov %ecx, 24(%rdi) > -# ifdef USE_AS_STPCPY > - lea 27(%rdi), %rax > -# endif > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > - sub $28, %r8 > - lea 28(%rdi), %rdi > - jnz L(StrncpyFillTailWithZero) > -# endif > - ret > - > - .p2align 4 > -L(Exit29): > - movdqu (%rsi), %xmm0 > - movdqu 13(%rsi), %xmm2 > - movdqu %xmm0, (%rdi) > - movdqu %xmm2, 13(%rdi) > -# ifdef USE_AS_STPCPY > - lea 28(%rdi), %rax > -# endif > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > - sub $29, %r8 > - lea 29(%rdi), %rdi > - jnz L(StrncpyFillTailWithZero) > -# endif > - ret > - > - .p2align 4 > -L(Exit30): > - movdqu (%rsi), %xmm0 > - movdqu 14(%rsi), %xmm2 > - movdqu %xmm0, (%rdi) > - movdqu %xmm2, 14(%rdi) > -# ifdef USE_AS_STPCPY > - lea 29(%rdi), %rax > -# endif > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > - sub $30, %r8 > - lea 30(%rdi), %rdi > - jnz L(StrncpyFillTailWithZero) > -# endif > - ret > - > - .p2align 4 > -L(Exit31): > - movdqu (%rsi), %xmm0 > - movdqu 15(%rsi), %xmm2 > - movdqu %xmm0, (%rdi) > - movdqu %xmm2, 15(%rdi) > -# ifdef USE_AS_STPCPY > - lea 30(%rdi), %rax > -# endif > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > - sub $31, %r8 > - lea 31(%rdi), %rdi > - jnz L(StrncpyFillTailWithZero) > -# endif > - ret > - > - .p2align 4 > -L(Exit32): > - movdqu (%rsi), %xmm0 > - movdqu 16(%rsi), %xmm2 > - movdqu %xmm0, (%rdi) > - movdqu %xmm2, 16(%rdi) > -# ifdef USE_AS_STPCPY > - lea 31(%rdi), %rax > -# endif > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > - sub $32, %r8 > - lea 32(%rdi), %rdi > - jnz L(StrncpyFillTailWithZero) > -# endif > - ret > - > -# ifdef USE_AS_STRNCPY > - > - .p2align 4 > -L(StrncpyExit0): > -# ifdef USE_AS_STPCPY > - mov %rdi, %rax > -# endif > -# ifdef USE_AS_STRCAT > - xor %ch, %ch > - movb %ch, (%rdi) > -# endif > - ret > - > - .p2align 4 > -L(StrncpyExit1): > - mov (%rsi), %dl > - mov %dl, (%rdi) > -# ifdef USE_AS_STPCPY > - lea 1(%rdi), %rax > -# endif > -# ifdef USE_AS_STRCAT > - xor %ch, %ch > - movb %ch, 1(%rdi) > -# endif > - ret > - > - .p2align 4 > -L(StrncpyExit2): > - mov (%rsi), %dx > - mov %dx, (%rdi) > -# ifdef USE_AS_STPCPY > - lea 2(%rdi), %rax > -# endif > -# ifdef USE_AS_STRCAT > - xor %ch, %ch > - movb %ch, 2(%rdi) > -# endif > - ret > - > - .p2align 4 > -L(StrncpyExit3): > - mov (%rsi), %cx > - mov 2(%rsi), %dl > - mov %cx, (%rdi) > - mov %dl, 2(%rdi) > -# ifdef USE_AS_STPCPY > - lea 3(%rdi), %rax > -# endif > -# ifdef USE_AS_STRCAT > - xor %ch, %ch > - movb %ch, 3(%rdi) > -# endif > - ret > - > - .p2align 4 > -L(StrncpyExit4): > - mov (%rsi), %edx > - mov %edx, (%rdi) > -# ifdef USE_AS_STPCPY > - lea 4(%rdi), %rax > -# endif > -# ifdef USE_AS_STRCAT > - xor %ch, %ch > - movb %ch, 4(%rdi) > -# endif > - ret > - > - .p2align 4 > -L(StrncpyExit5): > - mov (%rsi), %ecx > - mov 4(%rsi), %dl > - mov %ecx, (%rdi) > - mov %dl, 4(%rdi) > -# ifdef USE_AS_STPCPY > - lea 5(%rdi), %rax > -# endif > -# ifdef USE_AS_STRCAT > - xor %ch, %ch > - movb %ch, 5(%rdi) > -# endif > - ret > - > - .p2align 4 > -L(StrncpyExit6): > - mov (%rsi), %ecx > - mov 4(%rsi), %dx > - mov %ecx, (%rdi) > - mov %dx, 4(%rdi) > -# ifdef USE_AS_STPCPY > - lea 6(%rdi), %rax > -# endif > -# ifdef USE_AS_STRCAT > - xor %ch, %ch > - movb %ch, 6(%rdi) > -# endif > - ret > - > - .p2align 4 > -L(StrncpyExit7): > - mov (%rsi), %ecx > - mov 3(%rsi), %edx > - mov %ecx, (%rdi) > - mov %edx, 3(%rdi) > -# ifdef USE_AS_STPCPY > - lea 7(%rdi), %rax > -# endif > -# ifdef USE_AS_STRCAT > - xor %ch, %ch > - movb %ch, 7(%rdi) > -# endif > - ret > - > - .p2align 4 > -L(StrncpyExit8): > - mov (%rsi), %rdx > - mov %rdx, (%rdi) > -# ifdef USE_AS_STPCPY > - lea 8(%rdi), %rax > -# endif > -# ifdef USE_AS_STRCAT > - xor %ch, %ch > - movb %ch, 8(%rdi) > -# endif > - ret > - > - .p2align 4 > -L(StrncpyExit9): > - mov (%rsi), %rcx > - mov 8(%rsi), %dl > - mov %rcx, (%rdi) > - mov %dl, 8(%rdi) > -# ifdef USE_AS_STPCPY > - lea 9(%rdi), %rax > -# endif > -# ifdef USE_AS_STRCAT > - xor %ch, %ch > - movb %ch, 9(%rdi) > -# endif > - ret > - > - .p2align 4 > -L(StrncpyExit10): > - mov (%rsi), %rcx > - mov 8(%rsi), %dx > - mov %rcx, (%rdi) > - mov %dx, 8(%rdi) > -# ifdef USE_AS_STPCPY > - lea 10(%rdi), %rax > -# endif > -# ifdef USE_AS_STRCAT > - xor %ch, %ch > - movb %ch, 10(%rdi) > -# endif > - ret > - > - .p2align 4 > -L(StrncpyExit11): > - mov (%rsi), %rcx > - mov 7(%rsi), %edx > - mov %rcx, (%rdi) > - mov %edx, 7(%rdi) > -# ifdef USE_AS_STPCPY > - lea 11(%rdi), %rax > -# endif > -# ifdef USE_AS_STRCAT > - xor %ch, %ch > - movb %ch, 11(%rdi) > -# endif > - ret > - > - .p2align 4 > -L(StrncpyExit12): > - mov (%rsi), %rcx > - mov 8(%rsi), %edx > - mov %rcx, (%rdi) > - mov %edx, 8(%rdi) > -# ifdef USE_AS_STPCPY > - lea 12(%rdi), %rax > -# endif > -# ifdef USE_AS_STRCAT > - xor %ch, %ch > - movb %ch, 12(%rdi) > -# endif > - ret > - > - .p2align 4 > -L(StrncpyExit13): > - mov (%rsi), %rcx > - mov 5(%rsi), %rdx > - mov %rcx, (%rdi) > - mov %rdx, 5(%rdi) > -# ifdef USE_AS_STPCPY > - lea 13(%rdi), %rax > -# endif > -# ifdef USE_AS_STRCAT > - xor %ch, %ch > - movb %ch, 13(%rdi) > -# endif > - ret > - > - .p2align 4 > -L(StrncpyExit14): > - mov (%rsi), %rcx > - mov 6(%rsi), %rdx > - mov %rcx, (%rdi) > - mov %rdx, 6(%rdi) > -# ifdef USE_AS_STPCPY > - lea 14(%rdi), %rax > -# endif > -# ifdef USE_AS_STRCAT > - xor %ch, %ch > - movb %ch, 14(%rdi) > -# endif > - ret > - > - .p2align 4 > -L(StrncpyExit15): > - mov (%rsi), %rcx > - mov 7(%rsi), %rdx > - mov %rcx, (%rdi) > - mov %rdx, 7(%rdi) > -# ifdef USE_AS_STPCPY > - lea 15(%rdi), %rax > -# endif > -# ifdef USE_AS_STRCAT > - xor %ch, %ch > - movb %ch, 15(%rdi) > -# endif > - ret > - > - .p2align 4 > -L(StrncpyExit16): > - movdqu (%rsi), %xmm0 > - movdqu %xmm0, (%rdi) > -# ifdef USE_AS_STPCPY > - lea 16(%rdi), %rax > -# endif > -# ifdef USE_AS_STRCAT > - xor %ch, %ch > - movb %ch, 16(%rdi) > -# endif > - ret > - > - .p2align 4 > -L(StrncpyExit17): > - movdqu (%rsi), %xmm0 > - mov 16(%rsi), %cl > - movdqu %xmm0, (%rdi) > - mov %cl, 16(%rdi) > -# ifdef USE_AS_STPCPY > - lea 17(%rdi), %rax > -# endif > -# ifdef USE_AS_STRCAT > - xor %ch, %ch > - movb %ch, 17(%rdi) > -# endif > - ret > - > - .p2align 4 > -L(StrncpyExit18): > - movdqu (%rsi), %xmm0 > - mov 16(%rsi), %cx > - movdqu %xmm0, (%rdi) > - mov %cx, 16(%rdi) > -# ifdef USE_AS_STPCPY > - lea 18(%rdi), %rax > -# endif > -# ifdef USE_AS_STRCAT > - xor %ch, %ch > - movb %ch, 18(%rdi) > -# endif > - ret > - > - .p2align 4 > -L(StrncpyExit19): > - movdqu (%rsi), %xmm0 > - mov 15(%rsi), %ecx > - movdqu %xmm0, (%rdi) > - mov %ecx, 15(%rdi) > -# ifdef USE_AS_STPCPY > - lea 19(%rdi), %rax > -# endif > -# ifdef USE_AS_STRCAT > - xor %ch, %ch > - movb %ch, 19(%rdi) > -# endif > - ret > - > - .p2align 4 > -L(StrncpyExit20): > - movdqu (%rsi), %xmm0 > - mov 16(%rsi), %ecx > - movdqu %xmm0, (%rdi) > - mov %ecx, 16(%rdi) > -# ifdef USE_AS_STPCPY > - lea 20(%rdi), %rax > -# endif > -# ifdef USE_AS_STRCAT > - xor %ch, %ch > - movb %ch, 20(%rdi) > -# endif > - ret > - > - .p2align 4 > -L(StrncpyExit21): > - movdqu (%rsi), %xmm0 > - mov 16(%rsi), %ecx > - mov 20(%rsi), %dl > - movdqu %xmm0, (%rdi) > - mov %ecx, 16(%rdi) > - mov %dl, 20(%rdi) > -# ifdef USE_AS_STPCPY > - lea 21(%rdi), %rax > -# endif > -# ifdef USE_AS_STRCAT > - xor %ch, %ch > - movb %ch, 21(%rdi) > -# endif > - ret > - > - .p2align 4 > -L(StrncpyExit22): > - movdqu (%rsi), %xmm0 > - mov 14(%rsi), %rcx > - movdqu %xmm0, (%rdi) > - mov %rcx, 14(%rdi) > -# ifdef USE_AS_STPCPY > - lea 22(%rdi), %rax > -# endif > -# ifdef USE_AS_STRCAT > - xor %ch, %ch > - movb %ch, 22(%rdi) > -# endif > - ret > - > - .p2align 4 > -L(StrncpyExit23): > - movdqu (%rsi), %xmm0 > - mov 15(%rsi), %rcx > - movdqu %xmm0, (%rdi) > - mov %rcx, 15(%rdi) > -# ifdef USE_AS_STPCPY > - lea 23(%rdi), %rax > -# endif > -# ifdef USE_AS_STRCAT > - xor %ch, %ch > - movb %ch, 23(%rdi) > -# endif > - ret > - > - .p2align 4 > -L(StrncpyExit24): > - movdqu (%rsi), %xmm0 > - mov 16(%rsi), %rcx > - movdqu %xmm0, (%rdi) > - mov %rcx, 16(%rdi) > -# ifdef USE_AS_STPCPY > - lea 24(%rdi), %rax > -# endif > -# ifdef USE_AS_STRCAT > - xor %ch, %ch > - movb %ch, 24(%rdi) > -# endif > - ret > - > - .p2align 4 > -L(StrncpyExit25): > - movdqu (%rsi), %xmm0 > - mov 16(%rsi), %rdx > - mov 24(%rsi), %cl > - movdqu %xmm0, (%rdi) > - mov %rdx, 16(%rdi) > - mov %cl, 24(%rdi) > -# ifdef USE_AS_STPCPY > - lea 25(%rdi), %rax > -# endif > -# ifdef USE_AS_STRCAT > - xor %ch, %ch > - movb %ch, 25(%rdi) > -# endif > - ret > - > - .p2align 4 > -L(StrncpyExit26): > - movdqu (%rsi), %xmm0 > - mov 16(%rsi), %rdx > - mov 24(%rsi), %cx > - movdqu %xmm0, (%rdi) > - mov %rdx, 16(%rdi) > - mov %cx, 24(%rdi) > -# ifdef USE_AS_STPCPY > - lea 26(%rdi), %rax > -# endif > -# ifdef USE_AS_STRCAT > - xor %ch, %ch > - movb %ch, 26(%rdi) > -# endif > - ret > - > - .p2align 4 > -L(StrncpyExit27): > - movdqu (%rsi), %xmm0 > - mov 16(%rsi), %rdx > - mov 23(%rsi), %ecx > - movdqu %xmm0, (%rdi) > - mov %rdx, 16(%rdi) > - mov %ecx, 23(%rdi) > -# ifdef USE_AS_STPCPY > - lea 27(%rdi), %rax > -# endif > -# ifdef USE_AS_STRCAT > - xor %ch, %ch > - movb %ch, 27(%rdi) > -# endif > - ret > - > - .p2align 4 > -L(StrncpyExit28): > - movdqu (%rsi), %xmm0 > - mov 16(%rsi), %rdx > - mov 24(%rsi), %ecx > - movdqu %xmm0, (%rdi) > - mov %rdx, 16(%rdi) > - mov %ecx, 24(%rdi) > -# ifdef USE_AS_STPCPY > - lea 28(%rdi), %rax > -# endif > -# ifdef USE_AS_STRCAT > - xor %ch, %ch > - movb %ch, 28(%rdi) > -# endif > - ret > - > - .p2align 4 > -L(StrncpyExit29): > - movdqu (%rsi), %xmm0 > - movdqu 13(%rsi), %xmm2 > - movdqu %xmm0, (%rdi) > - movdqu %xmm2, 13(%rdi) > -# ifdef USE_AS_STPCPY > - lea 29(%rdi), %rax > -# endif > -# ifdef USE_AS_STRCAT > - xor %ch, %ch > - movb %ch, 29(%rdi) > -# endif > - ret > - > - .p2align 4 > -L(StrncpyExit30): > - movdqu (%rsi), %xmm0 > - movdqu 14(%rsi), %xmm2 > - movdqu %xmm0, (%rdi) > - movdqu %xmm2, 14(%rdi) > -# ifdef USE_AS_STPCPY > - lea 30(%rdi), %rax > -# endif > -# ifdef USE_AS_STRCAT > - xor %ch, %ch > - movb %ch, 30(%rdi) > -# endif > - ret > - > - .p2align 4 > -L(StrncpyExit31): > - movdqu (%rsi), %xmm0 > - movdqu 15(%rsi), %xmm2 > - movdqu %xmm0, (%rdi) > - movdqu %xmm2, 15(%rdi) > -# ifdef USE_AS_STPCPY > - lea 31(%rdi), %rax > -# endif > -# ifdef USE_AS_STRCAT > - xor %ch, %ch > - movb %ch, 31(%rdi) > -# endif > - ret > - > - .p2align 4 > -L(StrncpyExit32): > - movdqu (%rsi), %xmm0 > - movdqu 16(%rsi), %xmm2 > - movdqu %xmm0, (%rdi) > - movdqu %xmm2, 16(%rdi) > -# ifdef USE_AS_STPCPY > - lea 32(%rdi), %rax > -# endif > -# ifdef USE_AS_STRCAT > - xor %ch, %ch > - movb %ch, 32(%rdi) > -# endif > - ret > - > - .p2align 4 > -L(StrncpyExit33): > - movdqu (%rsi), %xmm0 > - movdqu 16(%rsi), %xmm2 > - mov 32(%rsi), %cl > - movdqu %xmm0, (%rdi) > - movdqu %xmm2, 16(%rdi) > - mov %cl, 32(%rdi) > -# ifdef USE_AS_STRCAT > - xor %ch, %ch > - movb %ch, 33(%rdi) > -# endif > - ret > - > -# ifndef USE_AS_STRCAT > - > - .p2align 4 > -L(Fill0): > - ret > - > - .p2align 4 > -L(Fill1): > - mov %dl, (%rdi) > - ret > - > - .p2align 4 > -L(Fill2): > - mov %dx, (%rdi) > - ret > - > - .p2align 4 > -L(Fill3): > - mov %edx, -1(%rdi) > - ret > - > - .p2align 4 > -L(Fill4): > - mov %edx, (%rdi) > - ret > - > - .p2align 4 > -L(Fill5): > - mov %edx, (%rdi) > - mov %dl, 4(%rdi) > - ret > - > - .p2align 4 > -L(Fill6): > - mov %edx, (%rdi) > - mov %dx, 4(%rdi) > - ret > - > - .p2align 4 > -L(Fill7): > - mov %rdx, -1(%rdi) > - ret > - > - .p2align 4 > -L(Fill8): > - mov %rdx, (%rdi) > - ret > - > - .p2align 4 > -L(Fill9): > - mov %rdx, (%rdi) > - mov %dl, 8(%rdi) > - ret > - > - .p2align 4 > -L(Fill10): > - mov %rdx, (%rdi) > - mov %dx, 8(%rdi) > - ret > - > - .p2align 4 > -L(Fill11): > - mov %rdx, (%rdi) > - mov %edx, 7(%rdi) > - ret > - > - .p2align 4 > -L(Fill12): > - mov %rdx, (%rdi) > - mov %edx, 8(%rdi) > - ret > - > - .p2align 4 > -L(Fill13): > - mov %rdx, (%rdi) > - mov %rdx, 5(%rdi) > - ret > - > - .p2align 4 > -L(Fill14): > - mov %rdx, (%rdi) > - mov %rdx, 6(%rdi) > - ret > - > - .p2align 4 > -L(Fill15): > - movdqu %xmm0, -1(%rdi) > - ret > - > - .p2align 4 > -L(Fill16): > - movdqu %xmm0, (%rdi) > - ret > - > - .p2align 4 > -L(CopyFrom1To16BytesUnalignedXmm2): > - movdqu %xmm2, (%rdi, %rcx) > - > - .p2align 4 > -L(CopyFrom1To16BytesXmmExit): > - bsf %rdx, %rdx > - add $15, %r8 > - add %rcx, %rdi > -# ifdef USE_AS_STPCPY > - lea (%rdi, %rdx), %rax > -# endif > - sub %rdx, %r8 > - lea 1(%rdi, %rdx), %rdi > - > - .p2align 4 > -L(StrncpyFillTailWithZero): > - pxor %xmm0, %xmm0 > - xor %rdx, %rdx > - sub $16, %r8 > - jbe L(StrncpyFillExit) > - > - movdqu %xmm0, (%rdi) > - add $16, %rdi > - > - mov %rdi, %rsi > - and $0xf, %rsi > - sub %rsi, %rdi > - add %rsi, %r8 > - sub $64, %r8 > - jb L(StrncpyFillLess64) > - > -L(StrncpyFillLoopMovdqa): > - movdqa %xmm0, (%rdi) > - movdqa %xmm0, 16(%rdi) > - movdqa %xmm0, 32(%rdi) > - movdqa %xmm0, 48(%rdi) > - add $64, %rdi > - sub $64, %r8 > - jae L(StrncpyFillLoopMovdqa) > - > -L(StrncpyFillLess64): > - add $32, %r8 > - jl L(StrncpyFillLess32) > - movdqa %xmm0, (%rdi) > - movdqa %xmm0, 16(%rdi) > - add $32, %rdi > - sub $16, %r8 > - jl L(StrncpyFillExit) > - movdqa %xmm0, (%rdi) > - add $16, %rdi > - BRANCH_TO_JMPTBL_ENTRY (L(FillTable), %r8, 4) > - > -L(StrncpyFillLess32): > - add $16, %r8 > - jl L(StrncpyFillExit) > - movdqa %xmm0, (%rdi) > - add $16, %rdi > - BRANCH_TO_JMPTBL_ENTRY (L(FillTable), %r8, 4) > - > -L(StrncpyFillExit): > - add $16, %r8 > - BRANCH_TO_JMPTBL_ENTRY (L(FillTable), %r8, 4) > - > -/* end of ifndef USE_AS_STRCAT */ > -# endif > - > - .p2align 4 > -L(UnalignedLeaveCase2OrCase3): > - test %rdx, %rdx > - jnz L(Unaligned64LeaveCase2) > -L(Unaligned64LeaveCase3): > - lea 64(%r8), %rcx > - and $-16, %rcx > - add $48, %r8 > - jl L(CopyFrom1To16BytesCase3) > - movdqu %xmm4, (%rdi) > - sub $16, %r8 > - jb L(CopyFrom1To16BytesCase3) > - movdqu %xmm5, 16(%rdi) > - sub $16, %r8 > - jb L(CopyFrom1To16BytesCase3) > - movdqu %xmm6, 32(%rdi) > - sub $16, %r8 > - jb L(CopyFrom1To16BytesCase3) > - movdqu %xmm7, 48(%rdi) > -# ifdef USE_AS_STPCPY > - lea 64(%rdi), %rax > -# endif > -# ifdef USE_AS_STRCAT > - xor %ch, %ch > - movb %ch, 64(%rdi) > -# endif > - ret > - > - .p2align 4 > -L(Unaligned64LeaveCase2): > - xor %rcx, %rcx > - pcmpeqb %xmm4, %xmm0 > - pmovmskb %xmm0, %rdx > - add $48, %r8 > - jle L(CopyFrom1To16BytesCase2OrCase3) > - test %rdx, %rdx > -# ifndef USE_AS_STRCAT > - jnz L(CopyFrom1To16BytesUnalignedXmm4) > -# else > - jnz L(CopyFrom1To16Bytes) > -# endif > - pcmpeqb %xmm5, %xmm0 > - pmovmskb %xmm0, %rdx > - movdqu %xmm4, (%rdi) > - add $16, %rcx > - sub $16, %r8 > - jbe L(CopyFrom1To16BytesCase2OrCase3) > - test %rdx, %rdx > -# ifndef USE_AS_STRCAT > - jnz L(CopyFrom1To16BytesUnalignedXmm5) > -# else > - jnz L(CopyFrom1To16Bytes) > -# endif > - > - pcmpeqb %xmm6, %xmm0 > - pmovmskb %xmm0, %rdx > - movdqu %xmm5, 16(%rdi) > - add $16, %rcx > - sub $16, %r8 > - jbe L(CopyFrom1To16BytesCase2OrCase3) > - test %rdx, %rdx > -# ifndef USE_AS_STRCAT > - jnz L(CopyFrom1To16BytesUnalignedXmm6) > -# else > - jnz L(CopyFrom1To16Bytes) > -# endif > - > - pcmpeqb %xmm7, %xmm0 > - pmovmskb %xmm0, %rdx > - movdqu %xmm6, 32(%rdi) > - lea 16(%rdi, %rcx), %rdi > - lea 16(%rsi, %rcx), %rsi > - bsf %rdx, %rdx > - cmp %r8, %rdx > - jb L(CopyFrom1To16BytesExit) > - BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4) > - > - .p2align 4 > -L(ExitZero): > -# ifndef USE_AS_STRCAT > - mov %rdi, %rax > -# endif > - ret > - > -# endif > - > -# ifndef USE_AS_STRCAT > -END (STRCPY) > -# else > -END (STRCAT) > -# endif > - .p2align 4 > - .section .rodata > -L(ExitTable): > - .int JMPTBL(L(Exit1), L(ExitTable)) > - .int JMPTBL(L(Exit2), L(ExitTable)) > - .int JMPTBL(L(Exit3), L(ExitTable)) > - .int JMPTBL(L(Exit4), L(ExitTable)) > - .int JMPTBL(L(Exit5), L(ExitTable)) > - .int JMPTBL(L(Exit6), L(ExitTable)) > - .int JMPTBL(L(Exit7), L(ExitTable)) > - .int JMPTBL(L(Exit8), L(ExitTable)) > - .int JMPTBL(L(Exit9), L(ExitTable)) > - .int JMPTBL(L(Exit10), L(ExitTable)) > - .int JMPTBL(L(Exit11), L(ExitTable)) > - .int JMPTBL(L(Exit12), L(ExitTable)) > - .int JMPTBL(L(Exit13), L(ExitTable)) > - .int JMPTBL(L(Exit14), L(ExitTable)) > - .int JMPTBL(L(Exit15), L(ExitTable)) > - .int JMPTBL(L(Exit16), L(ExitTable)) > - .int JMPTBL(L(Exit17), L(ExitTable)) > - .int JMPTBL(L(Exit18), L(ExitTable)) > - .int JMPTBL(L(Exit19), L(ExitTable)) > - .int JMPTBL(L(Exit20), L(ExitTable)) > - .int JMPTBL(L(Exit21), L(ExitTable)) > - .int JMPTBL(L(Exit22), L(ExitTable)) > - .int JMPTBL(L(Exit23), L(ExitTable)) > - .int JMPTBL(L(Exit24), L(ExitTable)) > - .int JMPTBL(L(Exit25), L(ExitTable)) > - .int JMPTBL(L(Exit26), L(ExitTable)) > - .int JMPTBL(L(Exit27), L(ExitTable)) > - .int JMPTBL(L(Exit28), L(ExitTable)) > - .int JMPTBL(L(Exit29), L(ExitTable)) > - .int JMPTBL(L(Exit30), L(ExitTable)) > - .int JMPTBL(L(Exit31), L(ExitTable)) > - .int JMPTBL(L(Exit32), L(ExitTable)) > -# ifdef USE_AS_STRNCPY > -L(ExitStrncpyTable): > - .int JMPTBL(L(StrncpyExit0), L(ExitStrncpyTable)) > - .int JMPTBL(L(StrncpyExit1), L(ExitStrncpyTable)) > - .int JMPTBL(L(StrncpyExit2), L(ExitStrncpyTable)) > - .int JMPTBL(L(StrncpyExit3), L(ExitStrncpyTable)) > - .int JMPTBL(L(StrncpyExit4), L(ExitStrncpyTable)) > - .int JMPTBL(L(StrncpyExit5), L(ExitStrncpyTable)) > - .int JMPTBL(L(StrncpyExit6), L(ExitStrncpyTable)) > - .int JMPTBL(L(StrncpyExit7), L(ExitStrncpyTable)) > - .int JMPTBL(L(StrncpyExit8), L(ExitStrncpyTable)) > - .int JMPTBL(L(StrncpyExit9), L(ExitStrncpyTable)) > - .int JMPTBL(L(StrncpyExit10), L(ExitStrncpyTable)) > - .int JMPTBL(L(StrncpyExit11), L(ExitStrncpyTable)) > - .int JMPTBL(L(StrncpyExit12), L(ExitStrncpyTable)) > - .int JMPTBL(L(StrncpyExit13), L(ExitStrncpyTable)) > - .int JMPTBL(L(StrncpyExit14), L(ExitStrncpyTable)) > - .int JMPTBL(L(StrncpyExit15), L(ExitStrncpyTable)) > - .int JMPTBL(L(StrncpyExit16), L(ExitStrncpyTable)) > - .int JMPTBL(L(StrncpyExit17), L(ExitStrncpyTable)) > - .int JMPTBL(L(StrncpyExit18), L(ExitStrncpyTable)) > - .int JMPTBL(L(StrncpyExit19), L(ExitStrncpyTable)) > - .int JMPTBL(L(StrncpyExit20), L(ExitStrncpyTable)) > - .int JMPTBL(L(StrncpyExit21), L(ExitStrncpyTable)) > - .int JMPTBL(L(StrncpyExit22), L(ExitStrncpyTable)) > - .int JMPTBL(L(StrncpyExit23), L(ExitStrncpyTable)) > - .int JMPTBL(L(StrncpyExit24), L(ExitStrncpyTable)) > - .int JMPTBL(L(StrncpyExit25), L(ExitStrncpyTable)) > - .int JMPTBL(L(StrncpyExit26), L(ExitStrncpyTable)) > - .int JMPTBL(L(StrncpyExit27), L(ExitStrncpyTable)) > - .int JMPTBL(L(StrncpyExit28), L(ExitStrncpyTable)) > - .int JMPTBL(L(StrncpyExit29), L(ExitStrncpyTable)) > - .int JMPTBL(L(StrncpyExit30), L(ExitStrncpyTable)) > - .int JMPTBL(L(StrncpyExit31), L(ExitStrncpyTable)) > - .int JMPTBL(L(StrncpyExit32), L(ExitStrncpyTable)) > - .int JMPTBL(L(StrncpyExit33), L(ExitStrncpyTable)) > -# ifndef USE_AS_STRCAT > - .p2align 4 > -L(FillTable): > - .int JMPTBL(L(Fill0), L(FillTable)) > - .int JMPTBL(L(Fill1), L(FillTable)) > - .int JMPTBL(L(Fill2), L(FillTable)) > - .int JMPTBL(L(Fill3), L(FillTable)) > - .int JMPTBL(L(Fill4), L(FillTable)) > - .int JMPTBL(L(Fill5), L(FillTable)) > - .int JMPTBL(L(Fill6), L(FillTable)) > - .int JMPTBL(L(Fill7), L(FillTable)) > - .int JMPTBL(L(Fill8), L(FillTable)) > - .int JMPTBL(L(Fill9), L(FillTable)) > - .int JMPTBL(L(Fill10), L(FillTable)) > - .int JMPTBL(L(Fill11), L(FillTable)) > - .int JMPTBL(L(Fill12), L(FillTable)) > - .int JMPTBL(L(Fill13), L(FillTable)) > - .int JMPTBL(L(Fill14), L(FillTable)) > - .int JMPTBL(L(Fill15), L(FillTable)) > - .int JMPTBL(L(Fill16), L(FillTable)) > -# endif > -# endif > -#endif > +#define AS_STRCPY > +#define STPCPY __strcpy_sse2_unaligned > +#include "stpcpy-sse2-unaligned.S" > diff --git a/sysdeps/x86_64/multiarch/strcpy.S b/sysdeps/x86_64/multiarch/strcpy.S > index 9464ee8..92be04c 100644 > --- a/sysdeps/x86_64/multiarch/strcpy.S > +++ b/sysdeps/x86_64/multiarch/strcpy.S > @@ -28,31 +28,18 @@ > #endif > > #ifdef USE_AS_STPCPY > -# ifdef USE_AS_STRNCPY > -# define STRCPY_SSSE3 __stpncpy_ssse3 > -# define STRCPY_SSE2 __stpncpy_sse2 > -# define STRCPY_SSE2_UNALIGNED __stpncpy_sse2_unaligned > -# define __GI_STRCPY __GI_stpncpy > -# define __GI___STRCPY __GI___stpncpy > -# else > # define STRCPY_SSSE3 __stpcpy_ssse3 > # define STRCPY_SSE2 __stpcpy_sse2 > +# define STRCPY_AVX2 __stpcpy_avx2 > # define STRCPY_SSE2_UNALIGNED __stpcpy_sse2_unaligned > # define __GI_STRCPY __GI_stpcpy > # define __GI___STRCPY __GI___stpcpy > -# endif > #else > -# ifdef USE_AS_STRNCPY > -# define STRCPY_SSSE3 __strncpy_ssse3 > -# define STRCPY_SSE2 __strncpy_sse2 > -# define STRCPY_SSE2_UNALIGNED __strncpy_sse2_unaligned > -# define __GI_STRCPY __GI_strncpy > -# else > # define STRCPY_SSSE3 __strcpy_ssse3 > +# define STRCPY_AVX2 __strcpy_avx2 > # define STRCPY_SSE2 __strcpy_sse2 > # define STRCPY_SSE2_UNALIGNED __strcpy_sse2_unaligned > # define __GI_STRCPY __GI_strcpy > -# endif > #endif > > > @@ -64,7 +51,10 @@ ENTRY(STRCPY) > cmpl $0, __cpu_features+KIND_OFFSET(%rip) > jne 1f > call __init_cpu_features > -1: leaq STRCPY_SSE2_UNALIGNED(%rip), %rax > +1: leaq STRCPY_AVX2(%rip), %rax > + testl $bit_AVX_Fast_Unaligned_Load, __cpu_features+FEATURE_OFFSET+index_AVX_Fast_Unaligned_Load(%rip) > + jnz 2f > + leaq STRCPY_SSE2_UNALIGNED(%rip), %rax > testl $bit_Fast_Unaligned_Load, __cpu_features+FEATURE_OFFSET+index_Fast_Unaligned_Load(%rip) > jnz 2f > leaq STRCPY_SSE2(%rip), %rax > diff --git a/sysdeps/x86_64/multiarch/strncpy-sse2-unaligned.S b/sysdeps/x86_64/multiarch/strncpy-sse2-unaligned.S > index fcc23a7..e4c98e7 100644 > --- a/sysdeps/x86_64/multiarch/strncpy-sse2-unaligned.S > +++ b/sysdeps/x86_64/multiarch/strncpy-sse2-unaligned.S > @@ -1,3 +1,1888 @@ > -#define USE_AS_STRNCPY > -#define STRCPY __strncpy_sse2_unaligned > -#include "strcpy-sse2-unaligned.S" > +/* strcpy with SSE2 and unaligned load > + Copyright (C) 2011-2015 Free Software Foundation, Inc. > + Contributed by Intel Corporation. > + This file is part of the GNU C Library. > + > + The GNU C Library is free software; you can redistribute it and/or > + modify it under the terms of the GNU Lesser General Public > + License as published by the Free Software Foundation; either > + version 2.1 of the License, or (at your option) any later version. > + > + The GNU C Library is distributed in the hope that it will be useful, > + but WITHOUT ANY WARRANTY; without even the implied warranty of > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + Lesser General Public License for more details. > + > + You should have received a copy of the GNU Lesser General Public > + License along with the GNU C Library; if not, see > + <http://www.gnu.org/licenses/>. */ > + > +#if IS_IN (libc) > + > +# ifndef USE_AS_STRCAT > +# include <sysdep.h> > + > +# ifndef STRCPY > +# define STRCPY __strncpy_sse2_unaligned > +# endif > + > +# define USE_AS_STRNCPY > +# endif > + > +# define JMPTBL(I, B) I - B > +# define BRANCH_TO_JMPTBL_ENTRY(TABLE, INDEX, SCALE) \ > + lea TABLE(%rip), %r11; \ > + movslq (%r11, INDEX, SCALE), %rcx; \ > + lea (%r11, %rcx), %rcx; \ > + jmp *%rcx > + > +# ifndef USE_AS_STRCAT > + > +.text > +ENTRY (STRCPY) > +# ifdef USE_AS_STRNCPY > + mov %rdx, %r8 > + test %r8, %r8 > + jz L(ExitZero) > +# endif > + mov %rsi, %rcx > +# ifndef USE_AS_STPCPY > + mov %rdi, %rax /* save result */ > +# endif > + > +# endif > + > + and $63, %rcx > + cmp $32, %rcx > + jbe L(SourceStringAlignmentLess32) > + > + and $-16, %rsi > + and $15, %rcx > + pxor %xmm0, %xmm0 > + pxor %xmm1, %xmm1 > + > + pcmpeqb (%rsi), %xmm1 > + pmovmskb %xmm1, %rdx > + shr %cl, %rdx > + > +# ifdef USE_AS_STRNCPY > +# if defined USE_AS_STPCPY || defined USE_AS_STRCAT > + mov $16, %r10 > + sub %rcx, %r10 > + cmp %r10, %r8 > +# else > + mov $17, %r10 > + sub %rcx, %r10 > + cmp %r10, %r8 > +# endif > + jbe L(CopyFrom1To16BytesTailCase2OrCase3) > +# endif > + test %rdx, %rdx > + jnz L(CopyFrom1To16BytesTail) > + > + pcmpeqb 16(%rsi), %xmm0 > + pmovmskb %xmm0, %rdx > + > +# ifdef USE_AS_STRNCPY > + add $16, %r10 > + cmp %r10, %r8 > + jbe L(CopyFrom1To32BytesCase2OrCase3) > +# endif > + test %rdx, %rdx > + jnz L(CopyFrom1To32Bytes) > + > + movdqu (%rsi, %rcx), %xmm1 /* copy 16 bytes */ > + movdqu %xmm1, (%rdi) > + > +/* If source address alignment != destination address alignment */ > + .p2align 4 > +L(Unalign16Both): > + sub %rcx, %rdi > +# ifdef USE_AS_STRNCPY > + add %rcx, %r8 > +# endif > + mov $16, %rcx > + movdqa (%rsi, %rcx), %xmm1 > + movaps 16(%rsi, %rcx), %xmm2 > + movdqu %xmm1, (%rdi, %rcx) > + pcmpeqb %xmm2, %xmm0 > + pmovmskb %xmm0, %rdx > + add $16, %rcx > +# ifdef USE_AS_STRNCPY > + sub $48, %r8 > + jbe L(CopyFrom1To16BytesCase2OrCase3) > +# endif > + test %rdx, %rdx > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > + jnz L(CopyFrom1To16BytesUnalignedXmm2) > +# else > + jnz L(CopyFrom1To16Bytes) > +# endif > + > + movaps 16(%rsi, %rcx), %xmm3 > + movdqu %xmm2, (%rdi, %rcx) > + pcmpeqb %xmm3, %xmm0 > + pmovmskb %xmm0, %rdx > + add $16, %rcx > +# ifdef USE_AS_STRNCPY > + sub $16, %r8 > + jbe L(CopyFrom1To16BytesCase2OrCase3) > +# endif > + test %rdx, %rdx > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > + jnz L(CopyFrom1To16BytesUnalignedXmm3) > +# else > + jnz L(CopyFrom1To16Bytes) > +# endif > + > + movaps 16(%rsi, %rcx), %xmm4 > + movdqu %xmm3, (%rdi, %rcx) > + pcmpeqb %xmm4, %xmm0 > + pmovmskb %xmm0, %rdx > + add $16, %rcx > +# ifdef USE_AS_STRNCPY > + sub $16, %r8 > + jbe L(CopyFrom1To16BytesCase2OrCase3) > +# endif > + test %rdx, %rdx > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > + jnz L(CopyFrom1To16BytesUnalignedXmm4) > +# else > + jnz L(CopyFrom1To16Bytes) > +# endif > + > + movaps 16(%rsi, %rcx), %xmm1 > + movdqu %xmm4, (%rdi, %rcx) > + pcmpeqb %xmm1, %xmm0 > + pmovmskb %xmm0, %rdx > + add $16, %rcx > +# ifdef USE_AS_STRNCPY > + sub $16, %r8 > + jbe L(CopyFrom1To16BytesCase2OrCase3) > +# endif > + test %rdx, %rdx > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > + jnz L(CopyFrom1To16BytesUnalignedXmm1) > +# else > + jnz L(CopyFrom1To16Bytes) > +# endif > + > + movaps 16(%rsi, %rcx), %xmm2 > + movdqu %xmm1, (%rdi, %rcx) > + pcmpeqb %xmm2, %xmm0 > + pmovmskb %xmm0, %rdx > + add $16, %rcx > +# ifdef USE_AS_STRNCPY > + sub $16, %r8 > + jbe L(CopyFrom1To16BytesCase2OrCase3) > +# endif > + test %rdx, %rdx > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > + jnz L(CopyFrom1To16BytesUnalignedXmm2) > +# else > + jnz L(CopyFrom1To16Bytes) > +# endif > + > + movaps 16(%rsi, %rcx), %xmm3 > + movdqu %xmm2, (%rdi, %rcx) > + pcmpeqb %xmm3, %xmm0 > + pmovmskb %xmm0, %rdx > + add $16, %rcx > +# ifdef USE_AS_STRNCPY > + sub $16, %r8 > + jbe L(CopyFrom1To16BytesCase2OrCase3) > +# endif > + test %rdx, %rdx > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > + jnz L(CopyFrom1To16BytesUnalignedXmm3) > +# else > + jnz L(CopyFrom1To16Bytes) > +# endif > + > + movdqu %xmm3, (%rdi, %rcx) > + mov %rsi, %rdx > + lea 16(%rsi, %rcx), %rsi > + and $-0x40, %rsi > + sub %rsi, %rdx > + sub %rdx, %rdi > +# ifdef USE_AS_STRNCPY > + lea 128(%r8, %rdx), %r8 > +# endif > +L(Unaligned64Loop): > + movaps (%rsi), %xmm2 > + movaps %xmm2, %xmm4 > + movaps 16(%rsi), %xmm5 > + movaps 32(%rsi), %xmm3 > + movaps %xmm3, %xmm6 > + movaps 48(%rsi), %xmm7 > + pminub %xmm5, %xmm2 > + pminub %xmm7, %xmm3 > + pminub %xmm2, %xmm3 > + pcmpeqb %xmm0, %xmm3 > + pmovmskb %xmm3, %rdx > +# ifdef USE_AS_STRNCPY > + sub $64, %r8 > + jbe L(UnalignedLeaveCase2OrCase3) > +# endif > + test %rdx, %rdx > + jnz L(Unaligned64Leave) > + > +L(Unaligned64Loop_start): > + add $64, %rdi > + add $64, %rsi > + movdqu %xmm4, -64(%rdi) > + movaps (%rsi), %xmm2 > + movdqa %xmm2, %xmm4 > + movdqu %xmm5, -48(%rdi) > + movaps 16(%rsi), %xmm5 > + pminub %xmm5, %xmm2 > + movaps 32(%rsi), %xmm3 > + movdqu %xmm6, -32(%rdi) > + movaps %xmm3, %xmm6 > + movdqu %xmm7, -16(%rdi) > + movaps 48(%rsi), %xmm7 > + pminub %xmm7, %xmm3 > + pminub %xmm2, %xmm3 > + pcmpeqb %xmm0, %xmm3 > + pmovmskb %xmm3, %rdx > +# ifdef USE_AS_STRNCPY > + sub $64, %r8 > + jbe L(UnalignedLeaveCase2OrCase3) > +# endif > + test %rdx, %rdx > + jz L(Unaligned64Loop_start) > + > +L(Unaligned64Leave): > + pxor %xmm1, %xmm1 > + > + pcmpeqb %xmm4, %xmm0 > + pcmpeqb %xmm5, %xmm1 > + pmovmskb %xmm0, %rdx > + pmovmskb %xmm1, %rcx > + test %rdx, %rdx > + jnz L(CopyFrom1To16BytesUnaligned_0) > + test %rcx, %rcx > + jnz L(CopyFrom1To16BytesUnaligned_16) > + > + pcmpeqb %xmm6, %xmm0 > + pcmpeqb %xmm7, %xmm1 > + pmovmskb %xmm0, %rdx > + pmovmskb %xmm1, %rcx > + test %rdx, %rdx > + jnz L(CopyFrom1To16BytesUnaligned_32) > + > + bsf %rcx, %rdx > + movdqu %xmm4, (%rdi) > + movdqu %xmm5, 16(%rdi) > + movdqu %xmm6, 32(%rdi) > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > +# ifdef USE_AS_STPCPY > + lea 48(%rdi, %rdx), %rax > +# endif > + movdqu %xmm7, 48(%rdi) > + add $15, %r8 > + sub %rdx, %r8 > + lea 49(%rdi, %rdx), %rdi > + jmp L(StrncpyFillTailWithZero) > +# else > + add $48, %rsi > + add $48, %rdi > + BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4) > +# endif > + > +/* If source address alignment == destination address alignment */ > + > +L(SourceStringAlignmentLess32): > + pxor %xmm0, %xmm0 > + movdqu (%rsi), %xmm1 > + movdqu 16(%rsi), %xmm2 > + pcmpeqb %xmm1, %xmm0 > + pmovmskb %xmm0, %rdx > + > +# ifdef USE_AS_STRNCPY > +# if defined USE_AS_STPCPY || defined USE_AS_STRCAT > + cmp $16, %r8 > +# else > + cmp $17, %r8 > +# endif > + jbe L(CopyFrom1To16BytesTail1Case2OrCase3) > +# endif > + test %rdx, %rdx > + jnz L(CopyFrom1To16BytesTail1) > + > + pcmpeqb %xmm2, %xmm0 > + movdqu %xmm1, (%rdi) > + pmovmskb %xmm0, %rdx > + > +# ifdef USE_AS_STRNCPY > +# if defined USE_AS_STPCPY || defined USE_AS_STRCAT > + cmp $32, %r8 > +# else > + cmp $33, %r8 > +# endif > + jbe L(CopyFrom1To32Bytes1Case2OrCase3) > +# endif > + test %rdx, %rdx > + jnz L(CopyFrom1To32Bytes1) > + > + and $-16, %rsi > + and $15, %rcx > + jmp L(Unalign16Both) > + > +/*------End of main part with loops---------------------*/ > + > +/* Case1 */ > + > +# if (!defined USE_AS_STRNCPY) || (defined USE_AS_STRCAT) > + .p2align 4 > +L(CopyFrom1To16Bytes): > + add %rcx, %rdi > + add %rcx, %rsi > + bsf %rdx, %rdx > + BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4) > +# endif > + .p2align 4 > +L(CopyFrom1To16BytesTail): > + add %rcx, %rsi > + bsf %rdx, %rdx > + BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4) > + > + .p2align 4 > +L(CopyFrom1To32Bytes1): > + add $16, %rsi > + add $16, %rdi > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > + sub $16, %r8 > +# endif > +L(CopyFrom1To16BytesTail1): > + bsf %rdx, %rdx > + BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4) > + > + .p2align 4 > +L(CopyFrom1To32Bytes): > + bsf %rdx, %rdx > + add %rcx, %rsi > + add $16, %rdx > + sub %rcx, %rdx > + BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4) > + > + .p2align 4 > +L(CopyFrom1To16BytesUnaligned_0): > + bsf %rdx, %rdx > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > +# ifdef USE_AS_STPCPY > + lea (%rdi, %rdx), %rax > +# endif > + movdqu %xmm4, (%rdi) > + add $63, %r8 > + sub %rdx, %r8 > + lea 1(%rdi, %rdx), %rdi > + jmp L(StrncpyFillTailWithZero) > +# else > + BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4) > +# endif > + > + .p2align 4 > +L(CopyFrom1To16BytesUnaligned_16): > + bsf %rcx, %rdx > + movdqu %xmm4, (%rdi) > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > +# ifdef USE_AS_STPCPY > + lea 16(%rdi, %rdx), %rax > +# endif > + movdqu %xmm5, 16(%rdi) > + add $47, %r8 > + sub %rdx, %r8 > + lea 17(%rdi, %rdx), %rdi > + jmp L(StrncpyFillTailWithZero) > +# else > + add $16, %rsi > + add $16, %rdi > + BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4) > +# endif > + > + .p2align 4 > +L(CopyFrom1To16BytesUnaligned_32): > + bsf %rdx, %rdx > + movdqu %xmm4, (%rdi) > + movdqu %xmm5, 16(%rdi) > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > +# ifdef USE_AS_STPCPY > + lea 32(%rdi, %rdx), %rax > +# endif > + movdqu %xmm6, 32(%rdi) > + add $31, %r8 > + sub %rdx, %r8 > + lea 33(%rdi, %rdx), %rdi > + jmp L(StrncpyFillTailWithZero) > +# else > + add $32, %rsi > + add $32, %rdi > + BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4) > +# endif > + > +# ifdef USE_AS_STRNCPY > +# ifndef USE_AS_STRCAT > + .p2align 4 > +L(CopyFrom1To16BytesUnalignedXmm6): > + movdqu %xmm6, (%rdi, %rcx) > + jmp L(CopyFrom1To16BytesXmmExit) > + > + .p2align 4 > +L(CopyFrom1To16BytesUnalignedXmm5): > + movdqu %xmm5, (%rdi, %rcx) > + jmp L(CopyFrom1To16BytesXmmExit) > + > + .p2align 4 > +L(CopyFrom1To16BytesUnalignedXmm4): > + movdqu %xmm4, (%rdi, %rcx) > + jmp L(CopyFrom1To16BytesXmmExit) > + > + .p2align 4 > +L(CopyFrom1To16BytesUnalignedXmm3): > + movdqu %xmm3, (%rdi, %rcx) > + jmp L(CopyFrom1To16BytesXmmExit) > + > + .p2align 4 > +L(CopyFrom1To16BytesUnalignedXmm1): > + movdqu %xmm1, (%rdi, %rcx) > + jmp L(CopyFrom1To16BytesXmmExit) > +# endif > + > + .p2align 4 > +L(CopyFrom1To16BytesExit): > + BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4) > + > +/* Case2 */ > + > + .p2align 4 > +L(CopyFrom1To16BytesCase2): > + add $16, %r8 > + add %rcx, %rdi > + add %rcx, %rsi > + bsf %rdx, %rdx > + cmp %r8, %rdx > + jb L(CopyFrom1To16BytesExit) > + BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4) > + > + .p2align 4 > +L(CopyFrom1To32BytesCase2): > + add %rcx, %rsi > + bsf %rdx, %rdx > + add $16, %rdx > + sub %rcx, %rdx > + cmp %r8, %rdx > + jb L(CopyFrom1To16BytesExit) > + BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4) > + > +L(CopyFrom1To16BytesTailCase2): > + add %rcx, %rsi > + bsf %rdx, %rdx > + cmp %r8, %rdx > + jb L(CopyFrom1To16BytesExit) > + BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4) > + > +L(CopyFrom1To16BytesTail1Case2): > + bsf %rdx, %rdx > + cmp %r8, %rdx > + jb L(CopyFrom1To16BytesExit) > + BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4) > + > +/* Case2 or Case3, Case3 */ > + > + .p2align 4 > +L(CopyFrom1To16BytesCase2OrCase3): > + test %rdx, %rdx > + jnz L(CopyFrom1To16BytesCase2) > +L(CopyFrom1To16BytesCase3): > + add $16, %r8 > + add %rcx, %rdi > + add %rcx, %rsi > + BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4) > + > + .p2align 4 > +L(CopyFrom1To32BytesCase2OrCase3): > + test %rdx, %rdx > + jnz L(CopyFrom1To32BytesCase2) > + add %rcx, %rsi > + BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4) > + > + .p2align 4 > +L(CopyFrom1To16BytesTailCase2OrCase3): > + test %rdx, %rdx > + jnz L(CopyFrom1To16BytesTailCase2) > + add %rcx, %rsi > + BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4) > + > + .p2align 4 > +L(CopyFrom1To32Bytes1Case2OrCase3): > + add $16, %rdi > + add $16, %rsi > + sub $16, %r8 > +L(CopyFrom1To16BytesTail1Case2OrCase3): > + test %rdx, %rdx > + jnz L(CopyFrom1To16BytesTail1Case2) > + BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4) > + > +# endif > + > +/*------------End labels regarding with copying 1-16 bytes--and 1-32 bytes----*/ > + > + .p2align 4 > +L(Exit1): > + mov %dh, (%rdi) > +# ifdef USE_AS_STPCPY > + lea (%rdi), %rax > +# endif > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > + sub $1, %r8 > + lea 1(%rdi), %rdi > + jnz L(StrncpyFillTailWithZero) > +# endif > + ret > + > + .p2align 4 > +L(Exit2): > + mov (%rsi), %dx > + mov %dx, (%rdi) > +# ifdef USE_AS_STPCPY > + lea 1(%rdi), %rax > +# endif > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > + sub $2, %r8 > + lea 2(%rdi), %rdi > + jnz L(StrncpyFillTailWithZero) > +# endif > + ret > + > + .p2align 4 > +L(Exit3): > + mov (%rsi), %cx > + mov %cx, (%rdi) > + mov %dh, 2(%rdi) > +# ifdef USE_AS_STPCPY > + lea 2(%rdi), %rax > +# endif > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > + sub $3, %r8 > + lea 3(%rdi), %rdi > + jnz L(StrncpyFillTailWithZero) > +# endif > + ret > + > + .p2align 4 > +L(Exit4): > + mov (%rsi), %edx > + mov %edx, (%rdi) > +# ifdef USE_AS_STPCPY > + lea 3(%rdi), %rax > +# endif > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > + sub $4, %r8 > + lea 4(%rdi), %rdi > + jnz L(StrncpyFillTailWithZero) > +# endif > + ret > + > + .p2align 4 > +L(Exit5): > + mov (%rsi), %ecx > + mov %dh, 4(%rdi) > + mov %ecx, (%rdi) > +# ifdef USE_AS_STPCPY > + lea 4(%rdi), %rax > +# endif > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > + sub $5, %r8 > + lea 5(%rdi), %rdi > + jnz L(StrncpyFillTailWithZero) > +# endif > + ret > + > + .p2align 4 > +L(Exit6): > + mov (%rsi), %ecx > + mov 4(%rsi), %dx > + mov %ecx, (%rdi) > + mov %dx, 4(%rdi) > +# ifdef USE_AS_STPCPY > + lea 5(%rdi), %rax > +# endif > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > + sub $6, %r8 > + lea 6(%rdi), %rdi > + jnz L(StrncpyFillTailWithZero) > +# endif > + ret > + > + .p2align 4 > +L(Exit7): > + mov (%rsi), %ecx > + mov 3(%rsi), %edx > + mov %ecx, (%rdi) > + mov %edx, 3(%rdi) > +# ifdef USE_AS_STPCPY > + lea 6(%rdi), %rax > +# endif > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > + sub $7, %r8 > + lea 7(%rdi), %rdi > + jnz L(StrncpyFillTailWithZero) > +# endif > + ret > + > + .p2align 4 > +L(Exit8): > + mov (%rsi), %rdx > + mov %rdx, (%rdi) > +# ifdef USE_AS_STPCPY > + lea 7(%rdi), %rax > +# endif > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > + sub $8, %r8 > + lea 8(%rdi), %rdi > + jnz L(StrncpyFillTailWithZero) > +# endif > + ret > + > + .p2align 4 > +L(Exit9): > + mov (%rsi), %rcx > + mov %dh, 8(%rdi) > + mov %rcx, (%rdi) > +# ifdef USE_AS_STPCPY > + lea 8(%rdi), %rax > +# endif > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > + sub $9, %r8 > + lea 9(%rdi), %rdi > + jnz L(StrncpyFillTailWithZero) > +# endif > + ret > + > + .p2align 4 > +L(Exit10): > + mov (%rsi), %rcx > + mov 8(%rsi), %dx > + mov %rcx, (%rdi) > + mov %dx, 8(%rdi) > +# ifdef USE_AS_STPCPY > + lea 9(%rdi), %rax > +# endif > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > + sub $10, %r8 > + lea 10(%rdi), %rdi > + jnz L(StrncpyFillTailWithZero) > +# endif > + ret > + > + .p2align 4 > +L(Exit11): > + mov (%rsi), %rcx > + mov 7(%rsi), %edx > + mov %rcx, (%rdi) > + mov %edx, 7(%rdi) > +# ifdef USE_AS_STPCPY > + lea 10(%rdi), %rax > +# endif > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > + sub $11, %r8 > + lea 11(%rdi), %rdi > + jnz L(StrncpyFillTailWithZero) > +# endif > + ret > + > + .p2align 4 > +L(Exit12): > + mov (%rsi), %rcx > + mov 8(%rsi), %edx > + mov %rcx, (%rdi) > + mov %edx, 8(%rdi) > +# ifdef USE_AS_STPCPY > + lea 11(%rdi), %rax > +# endif > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > + sub $12, %r8 > + lea 12(%rdi), %rdi > + jnz L(StrncpyFillTailWithZero) > +# endif > + ret > + > + .p2align 4 > +L(Exit13): > + mov (%rsi), %rcx > + mov 5(%rsi), %rdx > + mov %rcx, (%rdi) > + mov %rdx, 5(%rdi) > +# ifdef USE_AS_STPCPY > + lea 12(%rdi), %rax > +# endif > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > + sub $13, %r8 > + lea 13(%rdi), %rdi > + jnz L(StrncpyFillTailWithZero) > +# endif > + ret > + > + .p2align 4 > +L(Exit14): > + mov (%rsi), %rcx > + mov 6(%rsi), %rdx > + mov %rcx, (%rdi) > + mov %rdx, 6(%rdi) > +# ifdef USE_AS_STPCPY > + lea 13(%rdi), %rax > +# endif > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > + sub $14, %r8 > + lea 14(%rdi), %rdi > + jnz L(StrncpyFillTailWithZero) > +# endif > + ret > + > + .p2align 4 > +L(Exit15): > + mov (%rsi), %rcx > + mov 7(%rsi), %rdx > + mov %rcx, (%rdi) > + mov %rdx, 7(%rdi) > +# ifdef USE_AS_STPCPY > + lea 14(%rdi), %rax > +# endif > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > + sub $15, %r8 > + lea 15(%rdi), %rdi > + jnz L(StrncpyFillTailWithZero) > +# endif > + ret > + > + .p2align 4 > +L(Exit16): > + movdqu (%rsi), %xmm0 > + movdqu %xmm0, (%rdi) > +# ifdef USE_AS_STPCPY > + lea 15(%rdi), %rax > +# endif > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > + sub $16, %r8 > + lea 16(%rdi), %rdi > + jnz L(StrncpyFillTailWithZero) > +# endif > + ret > + > + .p2align 4 > +L(Exit17): > + movdqu (%rsi), %xmm0 > + movdqu %xmm0, (%rdi) > + mov %dh, 16(%rdi) > +# ifdef USE_AS_STPCPY > + lea 16(%rdi), %rax > +# endif > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > + sub $17, %r8 > + lea 17(%rdi), %rdi > + jnz L(StrncpyFillTailWithZero) > +# endif > + ret > + > + .p2align 4 > +L(Exit18): > + movdqu (%rsi), %xmm0 > + mov 16(%rsi), %cx > + movdqu %xmm0, (%rdi) > + mov %cx, 16(%rdi) > +# ifdef USE_AS_STPCPY > + lea 17(%rdi), %rax > +# endif > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > + sub $18, %r8 > + lea 18(%rdi), %rdi > + jnz L(StrncpyFillTailWithZero) > +# endif > + ret > + > + .p2align 4 > +L(Exit19): > + movdqu (%rsi), %xmm0 > + mov 15(%rsi), %ecx > + movdqu %xmm0, (%rdi) > + mov %ecx, 15(%rdi) > +# ifdef USE_AS_STPCPY > + lea 18(%rdi), %rax > +# endif > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > + sub $19, %r8 > + lea 19(%rdi), %rdi > + jnz L(StrncpyFillTailWithZero) > +# endif > + ret > + > + .p2align 4 > +L(Exit20): > + movdqu (%rsi), %xmm0 > + mov 16(%rsi), %ecx > + movdqu %xmm0, (%rdi) > + mov %ecx, 16(%rdi) > +# ifdef USE_AS_STPCPY > + lea 19(%rdi), %rax > +# endif > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > + sub $20, %r8 > + lea 20(%rdi), %rdi > + jnz L(StrncpyFillTailWithZero) > +# endif > + ret > + > + .p2align 4 > +L(Exit21): > + movdqu (%rsi), %xmm0 > + mov 16(%rsi), %ecx > + movdqu %xmm0, (%rdi) > + mov %ecx, 16(%rdi) > + mov %dh, 20(%rdi) > +# ifdef USE_AS_STPCPY > + lea 20(%rdi), %rax > +# endif > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > + sub $21, %r8 > + lea 21(%rdi), %rdi > + jnz L(StrncpyFillTailWithZero) > +# endif > + ret > + > + .p2align 4 > +L(Exit22): > + movdqu (%rsi), %xmm0 > + mov 14(%rsi), %rcx > + movdqu %xmm0, (%rdi) > + mov %rcx, 14(%rdi) > +# ifdef USE_AS_STPCPY > + lea 21(%rdi), %rax > +# endif > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > + sub $22, %r8 > + lea 22(%rdi), %rdi > + jnz L(StrncpyFillTailWithZero) > +# endif > + ret > + > + .p2align 4 > +L(Exit23): > + movdqu (%rsi), %xmm0 > + mov 15(%rsi), %rcx > + movdqu %xmm0, (%rdi) > + mov %rcx, 15(%rdi) > +# ifdef USE_AS_STPCPY > + lea 22(%rdi), %rax > +# endif > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > + sub $23, %r8 > + lea 23(%rdi), %rdi > + jnz L(StrncpyFillTailWithZero) > +# endif > + ret > + > + .p2align 4 > +L(Exit24): > + movdqu (%rsi), %xmm0 > + mov 16(%rsi), %rcx > + movdqu %xmm0, (%rdi) > + mov %rcx, 16(%rdi) > +# ifdef USE_AS_STPCPY > + lea 23(%rdi), %rax > +# endif > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > + sub $24, %r8 > + lea 24(%rdi), %rdi > + jnz L(StrncpyFillTailWithZero) > +# endif > + ret > + > + .p2align 4 > +L(Exit25): > + movdqu (%rsi), %xmm0 > + mov 16(%rsi), %rcx > + movdqu %xmm0, (%rdi) > + mov %rcx, 16(%rdi) > + mov %dh, 24(%rdi) > +# ifdef USE_AS_STPCPY > + lea 24(%rdi), %rax > +# endif > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > + sub $25, %r8 > + lea 25(%rdi), %rdi > + jnz L(StrncpyFillTailWithZero) > +# endif > + ret > + > + .p2align 4 > +L(Exit26): > + movdqu (%rsi), %xmm0 > + mov 16(%rsi), %rdx > + mov 24(%rsi), %cx > + movdqu %xmm0, (%rdi) > + mov %rdx, 16(%rdi) > + mov %cx, 24(%rdi) > +# ifdef USE_AS_STPCPY > + lea 25(%rdi), %rax > +# endif > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > + sub $26, %r8 > + lea 26(%rdi), %rdi > + jnz L(StrncpyFillTailWithZero) > +# endif > + ret > + > + .p2align 4 > +L(Exit27): > + movdqu (%rsi), %xmm0 > + mov 16(%rsi), %rdx > + mov 23(%rsi), %ecx > + movdqu %xmm0, (%rdi) > + mov %rdx, 16(%rdi) > + mov %ecx, 23(%rdi) > +# ifdef USE_AS_STPCPY > + lea 26(%rdi), %rax > +# endif > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > + sub $27, %r8 > + lea 27(%rdi), %rdi > + jnz L(StrncpyFillTailWithZero) > +# endif > + ret > + > + .p2align 4 > +L(Exit28): > + movdqu (%rsi), %xmm0 > + mov 16(%rsi), %rdx > + mov 24(%rsi), %ecx > + movdqu %xmm0, (%rdi) > + mov %rdx, 16(%rdi) > + mov %ecx, 24(%rdi) > +# ifdef USE_AS_STPCPY > + lea 27(%rdi), %rax > +# endif > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > + sub $28, %r8 > + lea 28(%rdi), %rdi > + jnz L(StrncpyFillTailWithZero) > +# endif > + ret > + > + .p2align 4 > +L(Exit29): > + movdqu (%rsi), %xmm0 > + movdqu 13(%rsi), %xmm2 > + movdqu %xmm0, (%rdi) > + movdqu %xmm2, 13(%rdi) > +# ifdef USE_AS_STPCPY > + lea 28(%rdi), %rax > +# endif > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > + sub $29, %r8 > + lea 29(%rdi), %rdi > + jnz L(StrncpyFillTailWithZero) > +# endif > + ret > + > + .p2align 4 > +L(Exit30): > + movdqu (%rsi), %xmm0 > + movdqu 14(%rsi), %xmm2 > + movdqu %xmm0, (%rdi) > + movdqu %xmm2, 14(%rdi) > +# ifdef USE_AS_STPCPY > + lea 29(%rdi), %rax > +# endif > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > + sub $30, %r8 > + lea 30(%rdi), %rdi > + jnz L(StrncpyFillTailWithZero) > +# endif > + ret > + > + .p2align 4 > +L(Exit31): > + movdqu (%rsi), %xmm0 > + movdqu 15(%rsi), %xmm2 > + movdqu %xmm0, (%rdi) > + movdqu %xmm2, 15(%rdi) > +# ifdef USE_AS_STPCPY > + lea 30(%rdi), %rax > +# endif > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > + sub $31, %r8 > + lea 31(%rdi), %rdi > + jnz L(StrncpyFillTailWithZero) > +# endif > + ret > + > + .p2align 4 > +L(Exit32): > + movdqu (%rsi), %xmm0 > + movdqu 16(%rsi), %xmm2 > + movdqu %xmm0, (%rdi) > + movdqu %xmm2, 16(%rdi) > +# ifdef USE_AS_STPCPY > + lea 31(%rdi), %rax > +# endif > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > + sub $32, %r8 > + lea 32(%rdi), %rdi > + jnz L(StrncpyFillTailWithZero) > +# endif > + ret > + > +# ifdef USE_AS_STRNCPY > + > + .p2align 4 > +L(StrncpyExit0): > +# ifdef USE_AS_STPCPY > + mov %rdi, %rax > +# endif > +# ifdef USE_AS_STRCAT > + xor %ch, %ch > + movb %ch, (%rdi) > +# endif > + ret > + > + .p2align 4 > +L(StrncpyExit1): > + mov (%rsi), %dl > + mov %dl, (%rdi) > +# ifdef USE_AS_STPCPY > + lea 1(%rdi), %rax > +# endif > +# ifdef USE_AS_STRCAT > + xor %ch, %ch > + movb %ch, 1(%rdi) > +# endif > + ret > + > + .p2align 4 > +L(StrncpyExit2): > + mov (%rsi), %dx > + mov %dx, (%rdi) > +# ifdef USE_AS_STPCPY > + lea 2(%rdi), %rax > +# endif > +# ifdef USE_AS_STRCAT > + xor %ch, %ch > + movb %ch, 2(%rdi) > +# endif > + ret > + > + .p2align 4 > +L(StrncpyExit3): > + mov (%rsi), %cx > + mov 2(%rsi), %dl > + mov %cx, (%rdi) > + mov %dl, 2(%rdi) > +# ifdef USE_AS_STPCPY > + lea 3(%rdi), %rax > +# endif > +# ifdef USE_AS_STRCAT > + xor %ch, %ch > + movb %ch, 3(%rdi) > +# endif > + ret > + > + .p2align 4 > +L(StrncpyExit4): > + mov (%rsi), %edx > + mov %edx, (%rdi) > +# ifdef USE_AS_STPCPY > + lea 4(%rdi), %rax > +# endif > +# ifdef USE_AS_STRCAT > + xor %ch, %ch > + movb %ch, 4(%rdi) > +# endif > + ret > + > + .p2align 4 > +L(StrncpyExit5): > + mov (%rsi), %ecx > + mov 4(%rsi), %dl > + mov %ecx, (%rdi) > + mov %dl, 4(%rdi) > +# ifdef USE_AS_STPCPY > + lea 5(%rdi), %rax > +# endif > +# ifdef USE_AS_STRCAT > + xor %ch, %ch > + movb %ch, 5(%rdi) > +# endif > + ret > + > + .p2align 4 > +L(StrncpyExit6): > + mov (%rsi), %ecx > + mov 4(%rsi), %dx > + mov %ecx, (%rdi) > + mov %dx, 4(%rdi) > +# ifdef USE_AS_STPCPY > + lea 6(%rdi), %rax > +# endif > +# ifdef USE_AS_STRCAT > + xor %ch, %ch > + movb %ch, 6(%rdi) > +# endif > + ret > + > + .p2align 4 > +L(StrncpyExit7): > + mov (%rsi), %ecx > + mov 3(%rsi), %edx > + mov %ecx, (%rdi) > + mov %edx, 3(%rdi) > +# ifdef USE_AS_STPCPY > + lea 7(%rdi), %rax > +# endif > +# ifdef USE_AS_STRCAT > + xor %ch, %ch > + movb %ch, 7(%rdi) > +# endif > + ret > + > + .p2align 4 > +L(StrncpyExit8): > + mov (%rsi), %rdx > + mov %rdx, (%rdi) > +# ifdef USE_AS_STPCPY > + lea 8(%rdi), %rax > +# endif > +# ifdef USE_AS_STRCAT > + xor %ch, %ch > + movb %ch, 8(%rdi) > +# endif > + ret > + > + .p2align 4 > +L(StrncpyExit9): > + mov (%rsi), %rcx > + mov 8(%rsi), %dl > + mov %rcx, (%rdi) > + mov %dl, 8(%rdi) > +# ifdef USE_AS_STPCPY > + lea 9(%rdi), %rax > +# endif > +# ifdef USE_AS_STRCAT > + xor %ch, %ch > + movb %ch, 9(%rdi) > +# endif > + ret > + > + .p2align 4 > +L(StrncpyExit10): > + mov (%rsi), %rcx > + mov 8(%rsi), %dx > + mov %rcx, (%rdi) > + mov %dx, 8(%rdi) > +# ifdef USE_AS_STPCPY > + lea 10(%rdi), %rax > +# endif > +# ifdef USE_AS_STRCAT > + xor %ch, %ch > + movb %ch, 10(%rdi) > +# endif > + ret > + > + .p2align 4 > +L(StrncpyExit11): > + mov (%rsi), %rcx > + mov 7(%rsi), %edx > + mov %rcx, (%rdi) > + mov %edx, 7(%rdi) > +# ifdef USE_AS_STPCPY > + lea 11(%rdi), %rax > +# endif > +# ifdef USE_AS_STRCAT > + xor %ch, %ch > + movb %ch, 11(%rdi) > +# endif > + ret > + > + .p2align 4 > +L(StrncpyExit12): > + mov (%rsi), %rcx > + mov 8(%rsi), %edx > + mov %rcx, (%rdi) > + mov %edx, 8(%rdi) > +# ifdef USE_AS_STPCPY > + lea 12(%rdi), %rax > +# endif > +# ifdef USE_AS_STRCAT > + xor %ch, %ch > + movb %ch, 12(%rdi) > +# endif > + ret > + > + .p2align 4 > +L(StrncpyExit13): > + mov (%rsi), %rcx > + mov 5(%rsi), %rdx > + mov %rcx, (%rdi) > + mov %rdx, 5(%rdi) > +# ifdef USE_AS_STPCPY > + lea 13(%rdi), %rax > +# endif > +# ifdef USE_AS_STRCAT > + xor %ch, %ch > + movb %ch, 13(%rdi) > +# endif > + ret > + > + .p2align 4 > +L(StrncpyExit14): > + mov (%rsi), %rcx > + mov 6(%rsi), %rdx > + mov %rcx, (%rdi) > + mov %rdx, 6(%rdi) > +# ifdef USE_AS_STPCPY > + lea 14(%rdi), %rax > +# endif > +# ifdef USE_AS_STRCAT > + xor %ch, %ch > + movb %ch, 14(%rdi) > +# endif > + ret > + > + .p2align 4 > +L(StrncpyExit15): > + mov (%rsi), %rcx > + mov 7(%rsi), %rdx > + mov %rcx, (%rdi) > + mov %rdx, 7(%rdi) > +# ifdef USE_AS_STPCPY > + lea 15(%rdi), %rax > +# endif > +# ifdef USE_AS_STRCAT > + xor %ch, %ch > + movb %ch, 15(%rdi) > +# endif > + ret > + > + .p2align 4 > +L(StrncpyExit16): > + movdqu (%rsi), %xmm0 > + movdqu %xmm0, (%rdi) > +# ifdef USE_AS_STPCPY > + lea 16(%rdi), %rax > +# endif > +# ifdef USE_AS_STRCAT > + xor %ch, %ch > + movb %ch, 16(%rdi) > +# endif > + ret > + > + .p2align 4 > +L(StrncpyExit17): > + movdqu (%rsi), %xmm0 > + mov 16(%rsi), %cl > + movdqu %xmm0, (%rdi) > + mov %cl, 16(%rdi) > +# ifdef USE_AS_STPCPY > + lea 17(%rdi), %rax > +# endif > +# ifdef USE_AS_STRCAT > + xor %ch, %ch > + movb %ch, 17(%rdi) > +# endif > + ret > + > + .p2align 4 > +L(StrncpyExit18): > + movdqu (%rsi), %xmm0 > + mov 16(%rsi), %cx > + movdqu %xmm0, (%rdi) > + mov %cx, 16(%rdi) > +# ifdef USE_AS_STPCPY > + lea 18(%rdi), %rax > +# endif > +# ifdef USE_AS_STRCAT > + xor %ch, %ch > + movb %ch, 18(%rdi) > +# endif > + ret > + > + .p2align 4 > +L(StrncpyExit19): > + movdqu (%rsi), %xmm0 > + mov 15(%rsi), %ecx > + movdqu %xmm0, (%rdi) > + mov %ecx, 15(%rdi) > +# ifdef USE_AS_STPCPY > + lea 19(%rdi), %rax > +# endif > +# ifdef USE_AS_STRCAT > + xor %ch, %ch > + movb %ch, 19(%rdi) > +# endif > + ret > + > + .p2align 4 > +L(StrncpyExit20): > + movdqu (%rsi), %xmm0 > + mov 16(%rsi), %ecx > + movdqu %xmm0, (%rdi) > + mov %ecx, 16(%rdi) > +# ifdef USE_AS_STPCPY > + lea 20(%rdi), %rax > +# endif > +# ifdef USE_AS_STRCAT > + xor %ch, %ch > + movb %ch, 20(%rdi) > +# endif > + ret > + > + .p2align 4 > +L(StrncpyExit21): > + movdqu (%rsi), %xmm0 > + mov 16(%rsi), %ecx > + mov 20(%rsi), %dl > + movdqu %xmm0, (%rdi) > + mov %ecx, 16(%rdi) > + mov %dl, 20(%rdi) > +# ifdef USE_AS_STPCPY > + lea 21(%rdi), %rax > +# endif > +# ifdef USE_AS_STRCAT > + xor %ch, %ch > + movb %ch, 21(%rdi) > +# endif > + ret > + > + .p2align 4 > +L(StrncpyExit22): > + movdqu (%rsi), %xmm0 > + mov 14(%rsi), %rcx > + movdqu %xmm0, (%rdi) > + mov %rcx, 14(%rdi) > +# ifdef USE_AS_STPCPY > + lea 22(%rdi), %rax > +# endif > +# ifdef USE_AS_STRCAT > + xor %ch, %ch > + movb %ch, 22(%rdi) > +# endif > + ret > + > + .p2align 4 > +L(StrncpyExit23): > + movdqu (%rsi), %xmm0 > + mov 15(%rsi), %rcx > + movdqu %xmm0, (%rdi) > + mov %rcx, 15(%rdi) > +# ifdef USE_AS_STPCPY > + lea 23(%rdi), %rax > +# endif > +# ifdef USE_AS_STRCAT > + xor %ch, %ch > + movb %ch, 23(%rdi) > +# endif > + ret > + > + .p2align 4 > +L(StrncpyExit24): > + movdqu (%rsi), %xmm0 > + mov 16(%rsi), %rcx > + movdqu %xmm0, (%rdi) > + mov %rcx, 16(%rdi) > +# ifdef USE_AS_STPCPY > + lea 24(%rdi), %rax > +# endif > +# ifdef USE_AS_STRCAT > + xor %ch, %ch > + movb %ch, 24(%rdi) > +# endif > + ret > + > + .p2align 4 > +L(StrncpyExit25): > + movdqu (%rsi), %xmm0 > + mov 16(%rsi), %rdx > + mov 24(%rsi), %cl > + movdqu %xmm0, (%rdi) > + mov %rdx, 16(%rdi) > + mov %cl, 24(%rdi) > +# ifdef USE_AS_STPCPY > + lea 25(%rdi), %rax > +# endif > +# ifdef USE_AS_STRCAT > + xor %ch, %ch > + movb %ch, 25(%rdi) > +# endif > + ret > + > + .p2align 4 > +L(StrncpyExit26): > + movdqu (%rsi), %xmm0 > + mov 16(%rsi), %rdx > + mov 24(%rsi), %cx > + movdqu %xmm0, (%rdi) > + mov %rdx, 16(%rdi) > + mov %cx, 24(%rdi) > +# ifdef USE_AS_STPCPY > + lea 26(%rdi), %rax > +# endif > +# ifdef USE_AS_STRCAT > + xor %ch, %ch > + movb %ch, 26(%rdi) > +# endif > + ret > + > + .p2align 4 > +L(StrncpyExit27): > + movdqu (%rsi), %xmm0 > + mov 16(%rsi), %rdx > + mov 23(%rsi), %ecx > + movdqu %xmm0, (%rdi) > + mov %rdx, 16(%rdi) > + mov %ecx, 23(%rdi) > +# ifdef USE_AS_STPCPY > + lea 27(%rdi), %rax > +# endif > +# ifdef USE_AS_STRCAT > + xor %ch, %ch > + movb %ch, 27(%rdi) > +# endif > + ret > + > + .p2align 4 > +L(StrncpyExit28): > + movdqu (%rsi), %xmm0 > + mov 16(%rsi), %rdx > + mov 24(%rsi), %ecx > + movdqu %xmm0, (%rdi) > + mov %rdx, 16(%rdi) > + mov %ecx, 24(%rdi) > +# ifdef USE_AS_STPCPY > + lea 28(%rdi), %rax > +# endif > +# ifdef USE_AS_STRCAT > + xor %ch, %ch > + movb %ch, 28(%rdi) > +# endif > + ret > + > + .p2align 4 > +L(StrncpyExit29): > + movdqu (%rsi), %xmm0 > + movdqu 13(%rsi), %xmm2 > + movdqu %xmm0, (%rdi) > + movdqu %xmm2, 13(%rdi) > +# ifdef USE_AS_STPCPY > + lea 29(%rdi), %rax > +# endif > +# ifdef USE_AS_STRCAT > + xor %ch, %ch > + movb %ch, 29(%rdi) > +# endif > + ret > + > + .p2align 4 > +L(StrncpyExit30): > + movdqu (%rsi), %xmm0 > + movdqu 14(%rsi), %xmm2 > + movdqu %xmm0, (%rdi) > + movdqu %xmm2, 14(%rdi) > +# ifdef USE_AS_STPCPY > + lea 30(%rdi), %rax > +# endif > +# ifdef USE_AS_STRCAT > + xor %ch, %ch > + movb %ch, 30(%rdi) > +# endif > + ret > + > + .p2align 4 > +L(StrncpyExit31): > + movdqu (%rsi), %xmm0 > + movdqu 15(%rsi), %xmm2 > + movdqu %xmm0, (%rdi) > + movdqu %xmm2, 15(%rdi) > +# ifdef USE_AS_STPCPY > + lea 31(%rdi), %rax > +# endif > +# ifdef USE_AS_STRCAT > + xor %ch, %ch > + movb %ch, 31(%rdi) > +# endif > + ret > + > + .p2align 4 > +L(StrncpyExit32): > + movdqu (%rsi), %xmm0 > + movdqu 16(%rsi), %xmm2 > + movdqu %xmm0, (%rdi) > + movdqu %xmm2, 16(%rdi) > +# ifdef USE_AS_STPCPY > + lea 32(%rdi), %rax > +# endif > +# ifdef USE_AS_STRCAT > + xor %ch, %ch > + movb %ch, 32(%rdi) > +# endif > + ret > + > + .p2align 4 > +L(StrncpyExit33): > + movdqu (%rsi), %xmm0 > + movdqu 16(%rsi), %xmm2 > + mov 32(%rsi), %cl > + movdqu %xmm0, (%rdi) > + movdqu %xmm2, 16(%rdi) > + mov %cl, 32(%rdi) > +# ifdef USE_AS_STRCAT > + xor %ch, %ch > + movb %ch, 33(%rdi) > +# endif > + ret > + > +# ifndef USE_AS_STRCAT > + > + .p2align 4 > +L(Fill0): > + ret > + > + .p2align 4 > +L(Fill1): > + mov %dl, (%rdi) > + ret > + > + .p2align 4 > +L(Fill2): > + mov %dx, (%rdi) > + ret > + > + .p2align 4 > +L(Fill3): > + mov %edx, -1(%rdi) > + ret > + > + .p2align 4 > +L(Fill4): > + mov %edx, (%rdi) > + ret > + > + .p2align 4 > +L(Fill5): > + mov %edx, (%rdi) > + mov %dl, 4(%rdi) > + ret > + > + .p2align 4 > +L(Fill6): > + mov %edx, (%rdi) > + mov %dx, 4(%rdi) > + ret > + > + .p2align 4 > +L(Fill7): > + mov %rdx, -1(%rdi) > + ret > + > + .p2align 4 > +L(Fill8): > + mov %rdx, (%rdi) > + ret > + > + .p2align 4 > +L(Fill9): > + mov %rdx, (%rdi) > + mov %dl, 8(%rdi) > + ret > + > + .p2align 4 > +L(Fill10): > + mov %rdx, (%rdi) > + mov %dx, 8(%rdi) > + ret > + > + .p2align 4 > +L(Fill11): > + mov %rdx, (%rdi) > + mov %edx, 7(%rdi) > + ret > + > + .p2align 4 > +L(Fill12): > + mov %rdx, (%rdi) > + mov %edx, 8(%rdi) > + ret > + > + .p2align 4 > +L(Fill13): > + mov %rdx, (%rdi) > + mov %rdx, 5(%rdi) > + ret > + > + .p2align 4 > +L(Fill14): > + mov %rdx, (%rdi) > + mov %rdx, 6(%rdi) > + ret > + > + .p2align 4 > +L(Fill15): > + movdqu %xmm0, -1(%rdi) > + ret > + > + .p2align 4 > +L(Fill16): > + movdqu %xmm0, (%rdi) > + ret > + > + .p2align 4 > +L(CopyFrom1To16BytesUnalignedXmm2): > + movdqu %xmm2, (%rdi, %rcx) > + > + .p2align 4 > +L(CopyFrom1To16BytesXmmExit): > + bsf %rdx, %rdx > + add $15, %r8 > + add %rcx, %rdi > +# ifdef USE_AS_STPCPY > + lea (%rdi, %rdx), %rax > +# endif > + sub %rdx, %r8 > + lea 1(%rdi, %rdx), %rdi > + > + .p2align 4 > +L(StrncpyFillTailWithZero): > + pxor %xmm0, %xmm0 > + xor %rdx, %rdx > + sub $16, %r8 > + jbe L(StrncpyFillExit) > + > + movdqu %xmm0, (%rdi) > + add $16, %rdi > + > + mov %rdi, %rsi > + and $0xf, %rsi > + sub %rsi, %rdi > + add %rsi, %r8 > + sub $64, %r8 > + jb L(StrncpyFillLess64) > + > +L(StrncpyFillLoopMovdqa): > + movdqa %xmm0, (%rdi) > + movdqa %xmm0, 16(%rdi) > + movdqa %xmm0, 32(%rdi) > + movdqa %xmm0, 48(%rdi) > + add $64, %rdi > + sub $64, %r8 > + jae L(StrncpyFillLoopMovdqa) > + > +L(StrncpyFillLess64): > + add $32, %r8 > + jl L(StrncpyFillLess32) > + movdqa %xmm0, (%rdi) > + movdqa %xmm0, 16(%rdi) > + add $32, %rdi > + sub $16, %r8 > + jl L(StrncpyFillExit) > + movdqa %xmm0, (%rdi) > + add $16, %rdi > + BRANCH_TO_JMPTBL_ENTRY (L(FillTable), %r8, 4) > + > +L(StrncpyFillLess32): > + add $16, %r8 > + jl L(StrncpyFillExit) > + movdqa %xmm0, (%rdi) > + add $16, %rdi > + BRANCH_TO_JMPTBL_ENTRY (L(FillTable), %r8, 4) > + > +L(StrncpyFillExit): > + add $16, %r8 > + BRANCH_TO_JMPTBL_ENTRY (L(FillTable), %r8, 4) > + > +/* end of ifndef USE_AS_STRCAT */ > +# endif > + > + .p2align 4 > +L(UnalignedLeaveCase2OrCase3): > + test %rdx, %rdx > + jnz L(Unaligned64LeaveCase2) > +L(Unaligned64LeaveCase3): > + lea 64(%r8), %rcx > + and $-16, %rcx > + add $48, %r8 > + jl L(CopyFrom1To16BytesCase3) > + movdqu %xmm4, (%rdi) > + sub $16, %r8 > + jb L(CopyFrom1To16BytesCase3) > + movdqu %xmm5, 16(%rdi) > + sub $16, %r8 > + jb L(CopyFrom1To16BytesCase3) > + movdqu %xmm6, 32(%rdi) > + sub $16, %r8 > + jb L(CopyFrom1To16BytesCase3) > + movdqu %xmm7, 48(%rdi) > +# ifdef USE_AS_STPCPY > + lea 64(%rdi), %rax > +# endif > +# ifdef USE_AS_STRCAT > + xor %ch, %ch > + movb %ch, 64(%rdi) > +# endif > + ret > + > + .p2align 4 > +L(Unaligned64LeaveCase2): > + xor %rcx, %rcx > + pcmpeqb %xmm4, %xmm0 > + pmovmskb %xmm0, %rdx > + add $48, %r8 > + jle L(CopyFrom1To16BytesCase2OrCase3) > + test %rdx, %rdx > +# ifndef USE_AS_STRCAT > + jnz L(CopyFrom1To16BytesUnalignedXmm4) > +# else > + jnz L(CopyFrom1To16Bytes) > +# endif > + pcmpeqb %xmm5, %xmm0 > + pmovmskb %xmm0, %rdx > + movdqu %xmm4, (%rdi) > + add $16, %rcx > + sub $16, %r8 > + jbe L(CopyFrom1To16BytesCase2OrCase3) > + test %rdx, %rdx > +# ifndef USE_AS_STRCAT > + jnz L(CopyFrom1To16BytesUnalignedXmm5) > +# else > + jnz L(CopyFrom1To16Bytes) > +# endif > + > + pcmpeqb %xmm6, %xmm0 > + pmovmskb %xmm0, %rdx > + movdqu %xmm5, 16(%rdi) > + add $16, %rcx > + sub $16, %r8 > + jbe L(CopyFrom1To16BytesCase2OrCase3) > + test %rdx, %rdx > +# ifndef USE_AS_STRCAT > + jnz L(CopyFrom1To16BytesUnalignedXmm6) > +# else > + jnz L(CopyFrom1To16Bytes) > +# endif > + > + pcmpeqb %xmm7, %xmm0 > + pmovmskb %xmm0, %rdx > + movdqu %xmm6, 32(%rdi) > + lea 16(%rdi, %rcx), %rdi > + lea 16(%rsi, %rcx), %rsi > + bsf %rdx, %rdx > + cmp %r8, %rdx > + jb L(CopyFrom1To16BytesExit) > + BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4) > + > + .p2align 4 > +L(ExitZero): > +# ifndef USE_AS_STRCAT > + mov %rdi, %rax > +# endif > + ret > + > +# endif > + > +# ifndef USE_AS_STRCAT > +END (STRCPY) > +# else > +END (STRCAT) > +# endif > + .p2align 4 > + .section .rodata > +L(ExitTable): > + .int JMPTBL(L(Exit1), L(ExitTable)) > + .int JMPTBL(L(Exit2), L(ExitTable)) > + .int JMPTBL(L(Exit3), L(ExitTable)) > + .int JMPTBL(L(Exit4), L(ExitTable)) > + .int JMPTBL(L(Exit5), L(ExitTable)) > + .int JMPTBL(L(Exit6), L(ExitTable)) > + .int JMPTBL(L(Exit7), L(ExitTable)) > + .int JMPTBL(L(Exit8), L(ExitTable)) > + .int JMPTBL(L(Exit9), L(ExitTable)) > + .int JMPTBL(L(Exit10), L(ExitTable)) > + .int JMPTBL(L(Exit11), L(ExitTable)) > + .int JMPTBL(L(Exit12), L(ExitTable)) > + .int JMPTBL(L(Exit13), L(ExitTable)) > + .int JMPTBL(L(Exit14), L(ExitTable)) > + .int JMPTBL(L(Exit15), L(ExitTable)) > + .int JMPTBL(L(Exit16), L(ExitTable)) > + .int JMPTBL(L(Exit17), L(ExitTable)) > + .int JMPTBL(L(Exit18), L(ExitTable)) > + .int JMPTBL(L(Exit19), L(ExitTable)) > + .int JMPTBL(L(Exit20), L(ExitTable)) > + .int JMPTBL(L(Exit21), L(ExitTable)) > + .int JMPTBL(L(Exit22), L(ExitTable)) > + .int JMPTBL(L(Exit23), L(ExitTable)) > + .int JMPTBL(L(Exit24), L(ExitTable)) > + .int JMPTBL(L(Exit25), L(ExitTable)) > + .int JMPTBL(L(Exit26), L(ExitTable)) > + .int JMPTBL(L(Exit27), L(ExitTable)) > + .int JMPTBL(L(Exit28), L(ExitTable)) > + .int JMPTBL(L(Exit29), L(ExitTable)) > + .int JMPTBL(L(Exit30), L(ExitTable)) > + .int JMPTBL(L(Exit31), L(ExitTable)) > + .int JMPTBL(L(Exit32), L(ExitTable)) > +# ifdef USE_AS_STRNCPY > +L(ExitStrncpyTable): > + .int JMPTBL(L(StrncpyExit0), L(ExitStrncpyTable)) > + .int JMPTBL(L(StrncpyExit1), L(ExitStrncpyTable)) > + .int JMPTBL(L(StrncpyExit2), L(ExitStrncpyTable)) > + .int JMPTBL(L(StrncpyExit3), L(ExitStrncpyTable)) > + .int JMPTBL(L(StrncpyExit4), L(ExitStrncpyTable)) > + .int JMPTBL(L(StrncpyExit5), L(ExitStrncpyTable)) > + .int JMPTBL(L(StrncpyExit6), L(ExitStrncpyTable)) > + .int JMPTBL(L(StrncpyExit7), L(ExitStrncpyTable)) > + .int JMPTBL(L(StrncpyExit8), L(ExitStrncpyTable)) > + .int JMPTBL(L(StrncpyExit9), L(ExitStrncpyTable)) > + .int JMPTBL(L(StrncpyExit10), L(ExitStrncpyTable)) > + .int JMPTBL(L(StrncpyExit11), L(ExitStrncpyTable)) > + .int JMPTBL(L(StrncpyExit12), L(ExitStrncpyTable)) > + .int JMPTBL(L(StrncpyExit13), L(ExitStrncpyTable)) > + .int JMPTBL(L(StrncpyExit14), L(ExitStrncpyTable)) > + .int JMPTBL(L(StrncpyExit15), L(ExitStrncpyTable)) > + .int JMPTBL(L(StrncpyExit16), L(ExitStrncpyTable)) > + .int JMPTBL(L(StrncpyExit17), L(ExitStrncpyTable)) > + .int JMPTBL(L(StrncpyExit18), L(ExitStrncpyTable)) > + .int JMPTBL(L(StrncpyExit19), L(ExitStrncpyTable)) > + .int JMPTBL(L(StrncpyExit20), L(ExitStrncpyTable)) > + .int JMPTBL(L(StrncpyExit21), L(ExitStrncpyTable)) > + .int JMPTBL(L(StrncpyExit22), L(ExitStrncpyTable)) > + .int JMPTBL(L(StrncpyExit23), L(ExitStrncpyTable)) > + .int JMPTBL(L(StrncpyExit24), L(ExitStrncpyTable)) > + .int JMPTBL(L(StrncpyExit25), L(ExitStrncpyTable)) > + .int JMPTBL(L(StrncpyExit26), L(ExitStrncpyTable)) > + .int JMPTBL(L(StrncpyExit27), L(ExitStrncpyTable)) > + .int JMPTBL(L(StrncpyExit28), L(ExitStrncpyTable)) > + .int JMPTBL(L(StrncpyExit29), L(ExitStrncpyTable)) > + .int JMPTBL(L(StrncpyExit30), L(ExitStrncpyTable)) > + .int JMPTBL(L(StrncpyExit31), L(ExitStrncpyTable)) > + .int JMPTBL(L(StrncpyExit32), L(ExitStrncpyTable)) > + .int JMPTBL(L(StrncpyExit33), L(ExitStrncpyTable)) > +# ifndef USE_AS_STRCAT > + .p2align 4 > +L(FillTable): > + .int JMPTBL(L(Fill0), L(FillTable)) > + .int JMPTBL(L(Fill1), L(FillTable)) > + .int JMPTBL(L(Fill2), L(FillTable)) > + .int JMPTBL(L(Fill3), L(FillTable)) > + .int JMPTBL(L(Fill4), L(FillTable)) > + .int JMPTBL(L(Fill5), L(FillTable)) > + .int JMPTBL(L(Fill6), L(FillTable)) > + .int JMPTBL(L(Fill7), L(FillTable)) > + .int JMPTBL(L(Fill8), L(FillTable)) > + .int JMPTBL(L(Fill9), L(FillTable)) > + .int JMPTBL(L(Fill10), L(FillTable)) > + .int JMPTBL(L(Fill11), L(FillTable)) > + .int JMPTBL(L(Fill12), L(FillTable)) > + .int JMPTBL(L(Fill13), L(FillTable)) > + .int JMPTBL(L(Fill14), L(FillTable)) > + .int JMPTBL(L(Fill15), L(FillTable)) > + .int JMPTBL(L(Fill16), L(FillTable)) > +# endif > +# endif > +#endif > diff --git a/sysdeps/x86_64/multiarch/strncpy.S b/sysdeps/x86_64/multiarch/strncpy.S > index 6d87a0b..afbd870 100644 > --- a/sysdeps/x86_64/multiarch/strncpy.S > +++ b/sysdeps/x86_64/multiarch/strncpy.S > @@ -1,5 +1,85 @@ > -/* Multiple versions of strncpy > - All versions must be listed in ifunc-impl-list.c. */ > -#define STRCPY strncpy > +/* Multiple versions of strcpy > + All versions must be listed in ifunc-impl-list.c. > + Copyright (C) 2009-2015 Free Software Foundation, Inc. > + Contributed by Intel Corporation. > + This file is part of the GNU C Library. > + > + The GNU C Library is free software; you can redistribute it and/or > + modify it under the terms of the GNU Lesser General Public > + License as published by the Free Software Foundation; either > + version 2.1 of the License, or (at your option) any later version. > + > + The GNU C Library is distributed in the hope that it will be useful, > + but WITHOUT ANY WARRANTY; without even the implied warranty of > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + Lesser General Public License for more details. > + > + You should have received a copy of the GNU Lesser General Public > + License along with the GNU C Library; if not, see > + <http://www.gnu.org/licenses/>. */ > + > +#include <sysdep.h> > +#include <init-arch.h> > + > #define USE_AS_STRNCPY > -#include "strcpy.S" > +#ifndef STRNCPY > +#define STRNCPY strncpy > +#endif > + > +#ifdef USE_AS_STPCPY > +# define STRNCPY_SSSE3 __stpncpy_ssse3 > +# define STRNCPY_SSE2 __stpncpy_sse2 > +# define STRNCPY_SSE2_UNALIGNED __stpncpy_sse2_unaligned > +# define __GI_STRNCPY __GI_stpncpy > +# define __GI___STRNCPY __GI___stpncpy > +#else > +# define STRNCPY_SSSE3 __strncpy_ssse3 > +# define STRNCPY_SSE2 __strncpy_sse2 > +# define STRNCPY_SSE2_UNALIGNED __strncpy_sse2_unaligned > +# define __GI_STRNCPY __GI_strncpy > +#endif > + > + > +/* Define multiple versions only for the definition in libc. */ > +#if IS_IN (libc) > + .text > +ENTRY(STRNCPY) > + .type STRNCPY, @gnu_indirect_function > + cmpl $0, __cpu_features+KIND_OFFSET(%rip) > + jne 1f > + call __init_cpu_features > +1: leaq STRNCPY_SSE2_UNALIGNED(%rip), %rax > + testl $bit_Fast_Unaligned_Load, __cpu_features+FEATURE_OFFSET+index_Fast_Unaligned_Load(%rip) > + jnz 2f > + leaq STRNCPY_SSE2(%rip), %rax > + testl $bit_SSSE3, __cpu_features+CPUID_OFFSET+index_SSSE3(%rip) > + jz 2f > + leaq STRNCPY_SSSE3(%rip), %rax > +2: ret > +END(STRNCPY) > + > +# undef ENTRY > +# define ENTRY(name) \ > + .type STRNCPY_SSE2, @function; \ > + .align 16; \ > + .globl STRNCPY_SSE2; \ > + .hidden STRNCPY_SSE2; \ > + STRNCPY_SSE2: cfi_startproc; \ > + CALL_MCOUNT > +# undef END > +# define END(name) \ > + cfi_endproc; .size STRNCPY_SSE2, .-STRNCPY_SSE2 > +# undef libc_hidden_builtin_def > +/* It doesn't make sense to send libc-internal strcpy calls through a PLT. > + The speedup we get from using SSSE3 instruction is likely eaten away > + by the indirect call in the PLT. */ > +# define libc_hidden_builtin_def(name) \ > + .globl __GI_STRNCPY; __GI_STRNCPY = STRNCPY_SSE2 > +# undef libc_hidden_def > +# define libc_hidden_def(name) \ > + .globl __GI___STRNCPY; __GI___STRNCPY = STRNCPY_SSE2 > +#endif > + > +#ifndef USE_AS_STRNCPY > +#include "../strcpy.S" > +#endif > -- > 1.8.4.rc3
On Wed, Jun 24, 2015 at 10:13:31AM +0200, Ondřej Bílka wrote: > On Wed, Jun 17, 2015 at 08:01:05PM +0200, Ondřej Bílka wrote: > > Hi, > > > > I wrote new strcpy on x64 and for some reason I thought that I had > > commited it and forgot to ping it. > > > > As there are other routines that I could improve I will use branch > > neleai/string-x64 to collect these. > > > > Here is revised version of what I sumbitted in 2013. Main change is that > > I now target i7 instead core2 That simplifies things as unaligned loads > > are cheap instead bit slower than aligned ones on core2. That mainly > > concerns header as for core2 you could get better performance by > > aligning loads or stores to 16 bytes after first bytes were read. I do > > not know whats better I would need to test it. > > > > That also makes less important support of ssse3 variant. I could send it > > but it was one of my list on TODO list that now probably lost > > importance. Problem is that on x64 for aligning by ssse3 or sse2 with > > shifts you need to make 16 loops for each alignment as you don't have > > variable shift. Also it needs to use jump table thats very expensive > > For strcpy thats dubious as it increases instruction cache pressure > > and most copies are small. You would need to do switching from unaligned > > loads to aligning. I needed to do profiling to select correct treshold. > > > > If somebody is interested in optimizing old pentiums4, athlon64 I will > > provide a ssse3 variant that is also 50% faster than current one. > > That is also reason why I omitted drawing current ssse3 implementation > > performance. > > > > > > In this version header first checks 128 bytes unaligned unless they > > cross page boundary. That allows more effective loop as then at end of > > loop we could simply write last 64 bytes instead specialcasing to avoid > > writing before start. > > > > I tried several variants of header, as we first read 16 bytes to xmm0 > > register question is if they could be reused. I used evolver to select > > best variant, there was almost no difference in performance between > > these. > > > > Now I do checks for bytes 0-15, then 16-31, then 32-63, then 64-128. > > There is possibility to get some cycles with different grouping, I will > > post later improvement if I could find something. > > > > > > First problem was reading ahead. A rereading 8 bytes looked bit faster > > than move from xmm. > > > > Then I tried when to reuse/reread. In 4-7 byte case it was faster reread > > than using bit shifts to get second half. For 1-3 bytes I use following > > copy with s[0] and s[1] from rdx register with byte shifts. > > > > Test branch vs this branchless that works for i 0,1,2 > > d[i] = 0; > > d[i/2] = s[1]; > > d[0] = s[0]; > > > > I also added a avx2 loop. Problem why I shouldn't use them in headers > > was high latency. I could test if using them for bytes 64-128 would give > > speedup. > > > > As technical issues go I needed to move old strcpy_sse_unaligned > > implementation into strncpy_sse2_unaligned as strncpy is function that > > should be optimized for size, not performance. For now I this will keep > > these unchanged. > > > > As performance these are 15%-30% faster than current one for gcc workload on > > haswell and ivy bridge. > > > > As avx2 version its currently 6% on this workload mainly as its bash and > > has lot of large loads so avx2 loop helps. > > > > I used my profiler to show improvement, see here > > > > http://kam.mff.cuni.cz/~ondra/benchmark_string/strcpy_profile.html > > > > and source is here > > > > http://kam.mff.cuni.cz/~ondra/benchmark_string/strcpy_profile170615.tar.bz2 > > > > Comments? > > > > * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list): > > Add __strcpy_avx2 and __stpcpy_avx2 > > * sysdeps/x86_64/multiarch/Makefile (routines): Add stpcpy_avx2.S and > > strcpy_avx2.S > > * sysdeps/x86_64/multiarch/stpcpy-avx2.S: New file > > * sysdeps/x86_64/multiarch/strcpy-avx2.S: Likewise. > > * sysdeps/x86_64/multiarch/stpcpy-sse2-unaligned.S: Refactored > > implementation. > > * sysdeps/x86_64/multiarch/strcpy.S: Updated ifunc. > > * sysdeps/x86_64/multiarch/strncpy.S: Moved from strcpy.S. > > * sysdeps/x86_64/multiarch/strncpy-sse2-unaligned.S: Moved > > strcpy-sse2-unaligned.S here. > > * sysdeps/x86_64/multiarch/strcpy-sse2-unaligned.S: Likewise. > > * sysdeps/x86_64/multiarch/stpncpy-sse2-unaligned.S: Redirect > > from strcpy-sse2-unaligned.S to strncpy-sse2-unaligned.S > > * sysdeps/x86_64/multiarch/stpncpy.S: Likewise. > > * sysdeps/x86_64/multiarch/strcat-sse2-unaligned.S: Likewise. > > > > --- > > sysdeps/x86_64/multiarch/Makefile | 2 +- > > sysdeps/x86_64/multiarch/ifunc-impl-list.c | 2 + > > sysdeps/x86_64/multiarch/stpcpy-avx2.S | 3 + > > sysdeps/x86_64/multiarch/stpcpy-sse2-unaligned.S | 439 ++++- > > sysdeps/x86_64/multiarch/stpncpy-sse2-unaligned.S | 3 +- > > sysdeps/x86_64/multiarch/stpncpy.S | 5 +- > > sysdeps/x86_64/multiarch/strcat-sse2-unaligned.S | 2 +- > > sysdeps/x86_64/multiarch/strcpy-avx2.S | 4 + > > sysdeps/x86_64/multiarch/strcpy-sse2-unaligned.S | 1890 +------------------- > > sysdeps/x86_64/multiarch/strcpy.S | 22 +- > > sysdeps/x86_64/multiarch/strncpy-sse2-unaligned.S | 1891 ++++++++++++++++++++- > > sysdeps/x86_64/multiarch/strncpy.S | 88 +- > > 14 files changed, 2435 insertions(+), 1921 deletions(-) > > create mode 100644 sysdeps/x86_64/multiarch/stpcpy-avx2.S > > create mode 100644 sysdeps/x86_64/multiarch/strcpy-avx2.S > > > > > > diff --git a/sysdeps/x86_64/multiarch/Makefile b/sysdeps/x86_64/multiarch/Makefile > > index d7002a9..c573744 100644 > > --- a/sysdeps/x86_64/multiarch/Makefile > > +++ b/sysdeps/x86_64/multiarch/Makefile > > @@ -29,7 +29,7 @@ CFLAGS-strspn-c.c += -msse4 > > endif > > > > ifeq (yes,$(config-cflags-avx2)) > > -sysdep_routines += memset-avx2 > > +sysdep_routines += memset-avx2 strcpy-avx2 stpcpy-avx2 > > endif > > endif > > > > diff --git a/sysdeps/x86_64/multiarch/ifunc-impl-list.c b/sysdeps/x86_64/multiarch/ifunc-impl-list.c > > index b64e4f1..d398e43 100644 > > --- a/sysdeps/x86_64/multiarch/ifunc-impl-list.c > > +++ b/sysdeps/x86_64/multiarch/ifunc-impl-list.c > > @@ -88,6 +88,7 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array, > > > > /* Support sysdeps/x86_64/multiarch/stpcpy.S. */ > > IFUNC_IMPL (i, name, stpcpy, > > + IFUNC_IMPL_ADD (array, i, strcpy, HAS_AVX2, __stpcpy_avx2) > > IFUNC_IMPL_ADD (array, i, stpcpy, HAS_SSSE3, __stpcpy_ssse3) > > IFUNC_IMPL_ADD (array, i, stpcpy, 1, __stpcpy_sse2_unaligned) > > IFUNC_IMPL_ADD (array, i, stpcpy, 1, __stpcpy_sse2)) > > @@ -137,6 +138,7 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array, > > > > /* Support sysdeps/x86_64/multiarch/strcpy.S. */ > > IFUNC_IMPL (i, name, strcpy, > > + IFUNC_IMPL_ADD (array, i, strcpy, HAS_AVX2, __strcpy_avx2) > > IFUNC_IMPL_ADD (array, i, strcpy, HAS_SSSE3, __strcpy_ssse3) > > IFUNC_IMPL_ADD (array, i, strcpy, 1, __strcpy_sse2_unaligned) > > IFUNC_IMPL_ADD (array, i, strcpy, 1, __strcpy_sse2)) > > diff --git a/sysdeps/x86_64/multiarch/stpcpy-avx2.S b/sysdeps/x86_64/multiarch/stpcpy-avx2.S > > new file mode 100644 > > index 0000000..bd30ef6 > > --- /dev/null > > +++ b/sysdeps/x86_64/multiarch/stpcpy-avx2.S > > @@ -0,0 +1,3 @@ > > +#define USE_AVX2 > > +#define STPCPY __stpcpy_avx2 > > +#include "stpcpy-sse2-unaligned.S" > > diff --git a/sysdeps/x86_64/multiarch/stpcpy-sse2-unaligned.S b/sysdeps/x86_64/multiarch/stpcpy-sse2-unaligned.S > > index 34231f8..695a236 100644 > > --- a/sysdeps/x86_64/multiarch/stpcpy-sse2-unaligned.S > > +++ b/sysdeps/x86_64/multiarch/stpcpy-sse2-unaligned.S > > @@ -1,3 +1,436 @@ > > -#define USE_AS_STPCPY > > -#define STRCPY __stpcpy_sse2_unaligned > > -#include "strcpy-sse2-unaligned.S" > > +/* stpcpy with SSE2 and unaligned load > > + Copyright (C) 2015 Free Software Foundation, Inc. > > + This file is part of the GNU C Library. > > + > > + The GNU C Library is free software; you can redistribute it and/or > > + modify it under the terms of the GNU Lesser General Public > > + License as published by the Free Software Foundation; either > > + version 2.1 of the License, or (at your option) any later version. > > + > > + The GNU C Library is distributed in the hope that it will be useful, > > + but WITHOUT ANY WARRANTY; without even the implied warranty of > > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > > + Lesser General Public License for more details. > > + > > + You should have received a copy of the GNU Lesser General Public > > + License along with the GNU C Library; if not, see > > + <http://www.gnu.org/licenses/>. */ > > + > > +#include <sysdep.h> > > +#ifndef STPCPY > > +# define STPCPY __stpcpy_sse2_unaligned > > +#endif > > + > > +ENTRY(STPCPY) > > + mov %esi, %edx > > +#ifdef AS_STRCPY > > + movq %rdi, %rax > > +#endif > > + pxor %xmm4, %xmm4 > > + pxor %xmm5, %xmm5 > > + andl $4095, %edx > > + cmp $3968, %edx > > + ja L(cross_page) > > + > > + movdqu (%rsi), %xmm0 > > + pcmpeqb %xmm0, %xmm4 > > + pmovmskb %xmm4, %edx > > + testl %edx, %edx > > + je L(more16bytes) > > + bsf %edx, %ecx > > +#ifndef AS_STRCPY > > + lea (%rdi, %rcx), %rax > > +#endif > > + cmp $7, %ecx > > + movq (%rsi), %rdx > > + jb L(less_8_bytesb) > > +L(8bytes_from_cross): > > + movq -7(%rsi, %rcx), %rsi > > + movq %rdx, (%rdi) > > +#ifdef AS_STRCPY > > + movq %rsi, -7(%rdi, %rcx) > > +#else > > + movq %rsi, -7(%rax) > > +#endif > > + ret > > + > > + .p2align 4 > > +L(less_8_bytesb): > > + cmp $2, %ecx > > + jbe L(less_4_bytes) > > +L(4bytes_from_cross): > > + mov -3(%rsi, %rcx), %esi > > + mov %edx, (%rdi) > > +#ifdef AS_STRCPY > > + mov %esi, -3(%rdi, %rcx) > > +#else > > + mov %esi, -3(%rax) > > +#endif > > + ret > > + > > +.p2align 4 > > + L(less_4_bytes): > > + /* > > + Test branch vs this branchless that works for i 0,1,2 > > + d[i] = 0; > > + d[i/2] = s[1]; > > + d[0] = s[0]; > > + */ > > +#ifdef AS_STRCPY > > + movb $0, (%rdi, %rcx) > > +#endif > > + > > + shr $1, %ecx > > + mov %edx, %esi > > + shr $8, %edx > > + movb %dl, (%rdi, %rcx) > > +#ifndef AS_STRCPY > > + movb $0, (%rax) > > +#endif > > + movb %sil, (%rdi) > > + ret > > + > > + > > + > > + > > + > > + .p2align 4 > > +L(more16bytes): > > + pxor %xmm6, %xmm6 > > + movdqu 16(%rsi), %xmm1 > > + pxor %xmm7, %xmm7 > > + pcmpeqb %xmm1, %xmm5 > > + pmovmskb %xmm5, %edx > > + testl %edx, %edx > > + je L(more32bytes) > > + bsf %edx, %edx > > +#ifdef AS_STRCPY > > + movdqu 1(%rsi, %rdx), %xmm1 > > + movdqu %xmm0, (%rdi) > > + movdqu %xmm1, 1(%rdi, %rdx) > > +#else > > + lea 16(%rdi, %rdx), %rax > > + movdqu 1(%rsi, %rdx), %xmm1 > > + movdqu %xmm0, (%rdi) > > + movdqu %xmm1, -15(%rax) > > +#endif > > + ret > > + > > + .p2align 4 > > +L(more32bytes): > > + movdqu 32(%rsi), %xmm2 > > + movdqu 48(%rsi), %xmm3 > > + > > + pcmpeqb %xmm2, %xmm6 > > + pcmpeqb %xmm3, %xmm7 > > + pmovmskb %xmm7, %edx > > + shl $16, %edx > > + pmovmskb %xmm6, %ecx > > + or %ecx, %edx > > + je L(more64bytes) > > + bsf %edx, %edx > > +#ifndef AS_STRCPY > > + lea 32(%rdi, %rdx), %rax > > +#endif > > + movdqu 1(%rsi, %rdx), %xmm2 > > + movdqu 17(%rsi, %rdx), %xmm3 > > + movdqu %xmm0, (%rdi) > > + movdqu %xmm1, 16(%rdi) > > +#ifdef AS_STRCPY > > + movdqu %xmm2, 1(%rdi, %rdx) > > + movdqu %xmm3, 17(%rdi, %rdx) > > +#else > > + movdqu %xmm2, -31(%rax) > > + movdqu %xmm3, -15(%rax) > > +#endif > > + ret > > + > > + .p2align 4 > > +L(more64bytes): > > + movdqu %xmm0, (%rdi) > > + movdqu %xmm1, 16(%rdi) > > + movdqu %xmm2, 32(%rdi) > > + movdqu %xmm3, 48(%rdi) > > + movdqu 64(%rsi), %xmm0 > > + movdqu 80(%rsi), %xmm1 > > + movdqu 96(%rsi), %xmm2 > > + movdqu 112(%rsi), %xmm3 > > + > > + pcmpeqb %xmm0, %xmm4 > > + pcmpeqb %xmm1, %xmm5 > > + pcmpeqb %xmm2, %xmm6 > > + pcmpeqb %xmm3, %xmm7 > > + pmovmskb %xmm4, %ecx > > + pmovmskb %xmm5, %edx > > + pmovmskb %xmm6, %r8d > > + pmovmskb %xmm7, %r9d > > + shl $16, %edx > > + or %ecx, %edx > > + shl $32, %r8 > > + shl $48, %r9 > > + or %r8, %rdx > > + or %r9, %rdx > > + test %rdx, %rdx > > + je L(prepare_loop) > > + bsf %rdx, %rdx > > +#ifndef AS_STRCPY > > + lea 64(%rdi, %rdx), %rax > > +#endif > > + movdqu 1(%rsi, %rdx), %xmm0 > > + movdqu 17(%rsi, %rdx), %xmm1 > > + movdqu 33(%rsi, %rdx), %xmm2 > > + movdqu 49(%rsi, %rdx), %xmm3 > > +#ifdef AS_STRCPY > > + movdqu %xmm0, 1(%rdi, %rdx) > > + movdqu %xmm1, 17(%rdi, %rdx) > > + movdqu %xmm2, 33(%rdi, %rdx) > > + movdqu %xmm3, 49(%rdi, %rdx) > > +#else > > + movdqu %xmm0, -63(%rax) > > + movdqu %xmm1, -47(%rax) > > + movdqu %xmm2, -31(%rax) > > + movdqu %xmm3, -15(%rax) > > +#endif > > + ret > > + > > + > > + .p2align 4 > > +L(prepare_loop): > > + movdqu %xmm0, 64(%rdi) > > + movdqu %xmm1, 80(%rdi) > > + movdqu %xmm2, 96(%rdi) > > + movdqu %xmm3, 112(%rdi) > > + > > + subq %rsi, %rdi > > + add $64, %rsi > > + andq $-64, %rsi > > + addq %rsi, %rdi > > + jmp L(loop_entry) > > + > > +#ifdef USE_AVX2 > > + .p2align 4 > > +L(loop): > > + vmovdqu %ymm1, (%rdi) > > + vmovdqu %ymm3, 32(%rdi) > > +L(loop_entry): > > + vmovdqa 96(%rsi), %ymm3 > > + vmovdqa 64(%rsi), %ymm1 > > + vpminub %ymm3, %ymm1, %ymm2 > > + addq $64, %rsi > > + addq $64, %rdi > > + vpcmpeqb %ymm5, %ymm2, %ymm0 > > + vpmovmskb %ymm0, %edx > > + test %edx, %edx > > + je L(loop) > > + salq $32, %rdx > > + vpcmpeqb %ymm5, %ymm1, %ymm4 > > + vpmovmskb %ymm4, %ecx > > + or %rcx, %rdx > > + bsfq %rdx, %rdx > > +#ifndef AS_STRCPY > > + lea (%rdi, %rdx), %rax > > +#endif > > + vmovdqu -63(%rsi, %rdx), %ymm0 > > + vmovdqu -31(%rsi, %rdx), %ymm2 > > +#ifdef AS_STRCPY > > + vmovdqu %ymm0, -63(%rdi, %rdx) > > + vmovdqu %ymm2, -31(%rdi, %rdx) > > +#else > > + vmovdqu %ymm0, -63(%rax) > > + vmovdqu %ymm2, -31(%rax) > > +#endif > > + vzeroupper > > + ret > > +#else > > + .p2align 4 > > +L(loop): > > + movdqu %xmm1, (%rdi) > > + movdqu %xmm2, 16(%rdi) > > + movdqu %xmm3, 32(%rdi) > > + movdqu %xmm4, 48(%rdi) > > +L(loop_entry): > > + movdqa 96(%rsi), %xmm3 > > + movdqa 112(%rsi), %xmm4 > > + movdqa %xmm3, %xmm0 > > + movdqa 80(%rsi), %xmm2 > > + pminub %xmm4, %xmm0 > > + movdqa 64(%rsi), %xmm1 > > + pminub %xmm2, %xmm0 > > + pminub %xmm1, %xmm0 > > + addq $64, %rsi > > + addq $64, %rdi > > + pcmpeqb %xmm5, %xmm0 > > + pmovmskb %xmm0, %edx > > + test %edx, %edx > > + je L(loop) > > + salq $48, %rdx > > + pcmpeqb %xmm1, %xmm5 > > + pcmpeqb %xmm2, %xmm6 > > + pmovmskb %xmm5, %ecx > > +#ifdef AS_STRCPY > > + pmovmskb %xmm6, %r8d > > + pcmpeqb %xmm3, %xmm7 > > + pmovmskb %xmm7, %r9d > > + sal $16, %r8d > > + or %r8d, %ecx > > +#else > > + pmovmskb %xmm6, %eax > > + pcmpeqb %xmm3, %xmm7 > > + pmovmskb %xmm7, %r9d > > + sal $16, %eax > > + or %eax, %ecx > > +#endif > > + salq $32, %r9 > > + orq %rcx, %rdx > > + orq %r9, %rdx > > + bsfq %rdx, %rdx > > +#ifndef AS_STRCPY > > + lea (%rdi, %rdx), %rax > > +#endif > > + movdqu -63(%rsi, %rdx), %xmm0 > > + movdqu -47(%rsi, %rdx), %xmm1 > > + movdqu -31(%rsi, %rdx), %xmm2 > > + movdqu -15(%rsi, %rdx), %xmm3 > > +#ifdef AS_STRCPY > > + movdqu %xmm0, -63(%rdi, %rdx) > > + movdqu %xmm1, -47(%rdi, %rdx) > > + movdqu %xmm2, -31(%rdi, %rdx) > > + movdqu %xmm3, -15(%rdi, %rdx) > > +#else > > + movdqu %xmm0, -63(%rax) > > + movdqu %xmm1, -47(%rax) > > + movdqu %xmm2, -31(%rax) > > + movdqu %xmm3, -15(%rax) > > +#endif > > + ret > > +#endif > > + > > + .p2align 4 > > +L(cross_page): > > + movq %rsi, %rcx > > + pxor %xmm0, %xmm0 > > + and $15, %ecx > > + movq %rsi, %r9 > > + movq %rdi, %r10 > > + subq %rcx, %rsi > > + subq %rcx, %rdi > > + movdqa (%rsi), %xmm1 > > + pcmpeqb %xmm0, %xmm1 > > + pmovmskb %xmm1, %edx > > + shr %cl, %edx > > + shl %cl, %edx > > + test %edx, %edx > > + jne L(less_32_cross) > > + > > + addq $16, %rsi > > + addq $16, %rdi > > + movdqa (%rsi), %xmm1 > > + pcmpeqb %xmm1, %xmm0 > > + pmovmskb %xmm0, %edx > > + test %edx, %edx > > + jne L(less_32_cross) > > + movdqu %xmm1, (%rdi) > > + > > + movdqu (%r9), %xmm0 > > + movdqu %xmm0, (%r10) > > + > > + mov $8, %rcx > > +L(cross_loop): > > + addq $16, %rsi > > + addq $16, %rdi > > + pxor %xmm0, %xmm0 > > + movdqa (%rsi), %xmm1 > > + pcmpeqb %xmm1, %xmm0 > > + pmovmskb %xmm0, %edx > > + test %edx, %edx > > + jne L(return_cross) > > + movdqu %xmm1, (%rdi) > > + sub $1, %rcx > > + ja L(cross_loop) > > + > > + pxor %xmm5, %xmm5 > > + pxor %xmm6, %xmm6 > > + pxor %xmm7, %xmm7 > > + > > + lea -64(%rsi), %rdx > > + andq $-64, %rdx > > + addq %rdx, %rdi > > + subq %rsi, %rdi > > + movq %rdx, %rsi > > + jmp L(loop_entry) > > + > > + .p2align 4 > > +L(return_cross): > > + bsf %edx, %edx > > +#ifdef AS_STRCPY > > + movdqu -15(%rsi, %rdx), %xmm0 > > + movdqu %xmm0, -15(%rdi, %rdx) > > +#else > > + lea (%rdi, %rdx), %rax > > + movdqu -15(%rsi, %rdx), %xmm0 > > + movdqu %xmm0, -15(%rax) > > +#endif > > + ret > > + > > + .p2align 4 > > +L(less_32_cross): > > + bsf %rdx, %rdx > > + lea (%rdi, %rdx), %rcx > > +#ifndef AS_STRCPY > > + mov %rcx, %rax > > +#endif > > + mov %r9, %rsi > > + mov %r10, %rdi > > + sub %rdi, %rcx > > + cmp $15, %ecx > > + jb L(less_16_cross) > > + movdqu (%rsi), %xmm0 > > + movdqu -15(%rsi, %rcx), %xmm1 > > + movdqu %xmm0, (%rdi) > > +#ifdef AS_STRCPY > > + movdqu %xmm1, -15(%rdi, %rcx) > > +#else > > + movdqu %xmm1, -15(%rax) > > +#endif > > + ret > > + > > +L(less_16_cross): > > + cmp $7, %ecx > > + jb L(less_8_bytes_cross) > > + movq (%rsi), %rdx > > + jmp L(8bytes_from_cross) > > + > > +L(less_8_bytes_cross): > > + cmp $2, %ecx > > + jbe L(3_bytes_cross) > > + mov (%rsi), %edx > > + jmp L(4bytes_from_cross) > > + > > +L(3_bytes_cross): > > + jb L(1_2bytes_cross) > > + movzwl (%rsi), %edx > > + jmp L(_3_bytesb) > > + > > +L(1_2bytes_cross): > > + movb (%rsi), %dl > > + jmp L(0_2bytes_from_cross) > > + > > + .p2align 4 > > +L(less_4_bytesb): > > + je L(_3_bytesb) > > +L(0_2bytes_from_cross): > > + movb %dl, (%rdi) > > +#ifdef AS_STRCPY > > + movb $0, (%rdi, %rcx) > > +#else > > + movb $0, (%rax) > > +#endif > > + ret > > + > > + .p2align 4 > > +L(_3_bytesb): > > + movw %dx, (%rdi) > > + movb $0, 2(%rdi) > > + ret > > + > > +END(STPCPY) > > diff --git a/sysdeps/x86_64/multiarch/stpncpy-sse2-unaligned.S b/sysdeps/x86_64/multiarch/stpncpy-sse2-unaligned.S > > index 658520f..3f35068 100644 > > --- a/sysdeps/x86_64/multiarch/stpncpy-sse2-unaligned.S > > +++ b/sysdeps/x86_64/multiarch/stpncpy-sse2-unaligned.S > > @@ -1,4 +1,3 @@ > > #define USE_AS_STPCPY > > -#define USE_AS_STRNCPY > > #define STRCPY __stpncpy_sse2_unaligned > > -#include "strcpy-sse2-unaligned.S" > > +#include "strncpy-sse2-unaligned.S" > > diff --git a/sysdeps/x86_64/multiarch/stpncpy.S b/sysdeps/x86_64/multiarch/stpncpy.S > > index 2698ca6..159604a 100644 > > --- a/sysdeps/x86_64/multiarch/stpncpy.S > > +++ b/sysdeps/x86_64/multiarch/stpncpy.S > > @@ -1,8 +1,7 @@ > > /* Multiple versions of stpncpy > > All versions must be listed in ifunc-impl-list.c. */ > > -#define STRCPY __stpncpy > > +#define STRNCPY __stpncpy > > #define USE_AS_STPCPY > > -#define USE_AS_STRNCPY > > -#include "strcpy.S" > > +#include "strncpy.S" > > > > weak_alias (__stpncpy, stpncpy) > > diff --git a/sysdeps/x86_64/multiarch/strcat-sse2-unaligned.S b/sysdeps/x86_64/multiarch/strcat-sse2-unaligned.S > > index 81f1b40..1faa49d 100644 > > --- a/sysdeps/x86_64/multiarch/strcat-sse2-unaligned.S > > +++ b/sysdeps/x86_64/multiarch/strcat-sse2-unaligned.S > > @@ -275,5 +275,5 @@ L(StartStrcpyPart): > > # define USE_AS_STRNCPY > > # endif > > > > -# include "strcpy-sse2-unaligned.S" > > +# include "strncpy-sse2-unaligned.S" > > #endif > > diff --git a/sysdeps/x86_64/multiarch/strcpy-avx2.S b/sysdeps/x86_64/multiarch/strcpy-avx2.S > > new file mode 100644 > > index 0000000..a3133a4 > > --- /dev/null > > +++ b/sysdeps/x86_64/multiarch/strcpy-avx2.S > > @@ -0,0 +1,4 @@ > > +#define USE_AVX2 > > +#define AS_STRCPY > > +#define STPCPY __strcpy_avx2 > > +#include "stpcpy-sse2-unaligned.S" > > diff --git a/sysdeps/x86_64/multiarch/strcpy-sse2-unaligned.S b/sysdeps/x86_64/multiarch/strcpy-sse2-unaligned.S > > index 8f03d1d..310e4fa 100644 > > --- a/sysdeps/x86_64/multiarch/strcpy-sse2-unaligned.S > > +++ b/sysdeps/x86_64/multiarch/strcpy-sse2-unaligned.S > > @@ -1,1887 +1,3 @@ > > -/* strcpy with SSE2 and unaligned load > > - Copyright (C) 2011-2015 Free Software Foundation, Inc. > > - Contributed by Intel Corporation. > > - This file is part of the GNU C Library. > > - > > - The GNU C Library is free software; you can redistribute it and/or > > - modify it under the terms of the GNU Lesser General Public > > - License as published by the Free Software Foundation; either > > - version 2.1 of the License, or (at your option) any later version. > > - > > - The GNU C Library is distributed in the hope that it will be useful, > > - but WITHOUT ANY WARRANTY; without even the implied warranty of > > - MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > > - Lesser General Public License for more details. > > - > > - You should have received a copy of the GNU Lesser General Public > > - License along with the GNU C Library; if not, see > > - <http://www.gnu.org/licenses/>. */ > > - > > -#if IS_IN (libc) > > - > > -# ifndef USE_AS_STRCAT > > -# include <sysdep.h> > > - > > -# ifndef STRCPY > > -# define STRCPY __strcpy_sse2_unaligned > > -# endif > > - > > -# endif > > - > > -# define JMPTBL(I, B) I - B > > -# define BRANCH_TO_JMPTBL_ENTRY(TABLE, INDEX, SCALE) \ > > - lea TABLE(%rip), %r11; \ > > - movslq (%r11, INDEX, SCALE), %rcx; \ > > - lea (%r11, %rcx), %rcx; \ > > - jmp *%rcx > > - > > -# ifndef USE_AS_STRCAT > > - > > -.text > > -ENTRY (STRCPY) > > -# ifdef USE_AS_STRNCPY > > - mov %rdx, %r8 > > - test %r8, %r8 > > - jz L(ExitZero) > > -# endif > > - mov %rsi, %rcx > > -# ifndef USE_AS_STPCPY > > - mov %rdi, %rax /* save result */ > > -# endif > > - > > -# endif > > - > > - and $63, %rcx > > - cmp $32, %rcx > > - jbe L(SourceStringAlignmentLess32) > > - > > - and $-16, %rsi > > - and $15, %rcx > > - pxor %xmm0, %xmm0 > > - pxor %xmm1, %xmm1 > > - > > - pcmpeqb (%rsi), %xmm1 > > - pmovmskb %xmm1, %rdx > > - shr %cl, %rdx > > - > > -# ifdef USE_AS_STRNCPY > > -# if defined USE_AS_STPCPY || defined USE_AS_STRCAT > > - mov $16, %r10 > > - sub %rcx, %r10 > > - cmp %r10, %r8 > > -# else > > - mov $17, %r10 > > - sub %rcx, %r10 > > - cmp %r10, %r8 > > -# endif > > - jbe L(CopyFrom1To16BytesTailCase2OrCase3) > > -# endif > > - test %rdx, %rdx > > - jnz L(CopyFrom1To16BytesTail) > > - > > - pcmpeqb 16(%rsi), %xmm0 > > - pmovmskb %xmm0, %rdx > > - > > -# ifdef USE_AS_STRNCPY > > - add $16, %r10 > > - cmp %r10, %r8 > > - jbe L(CopyFrom1To32BytesCase2OrCase3) > > -# endif > > - test %rdx, %rdx > > - jnz L(CopyFrom1To32Bytes) > > - > > - movdqu (%rsi, %rcx), %xmm1 /* copy 16 bytes */ > > - movdqu %xmm1, (%rdi) > > - > > -/* If source address alignment != destination address alignment */ > > - .p2align 4 > > -L(Unalign16Both): > > - sub %rcx, %rdi > > -# ifdef USE_AS_STRNCPY > > - add %rcx, %r8 > > -# endif > > - mov $16, %rcx > > - movdqa (%rsi, %rcx), %xmm1 > > - movaps 16(%rsi, %rcx), %xmm2 > > - movdqu %xmm1, (%rdi, %rcx) > > - pcmpeqb %xmm2, %xmm0 > > - pmovmskb %xmm0, %rdx > > - add $16, %rcx > > -# ifdef USE_AS_STRNCPY > > - sub $48, %r8 > > - jbe L(CopyFrom1To16BytesCase2OrCase3) > > -# endif > > - test %rdx, %rdx > > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > - jnz L(CopyFrom1To16BytesUnalignedXmm2) > > -# else > > - jnz L(CopyFrom1To16Bytes) > > -# endif > > - > > - movaps 16(%rsi, %rcx), %xmm3 > > - movdqu %xmm2, (%rdi, %rcx) > > - pcmpeqb %xmm3, %xmm0 > > - pmovmskb %xmm0, %rdx > > - add $16, %rcx > > -# ifdef USE_AS_STRNCPY > > - sub $16, %r8 > > - jbe L(CopyFrom1To16BytesCase2OrCase3) > > -# endif > > - test %rdx, %rdx > > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > - jnz L(CopyFrom1To16BytesUnalignedXmm3) > > -# else > > - jnz L(CopyFrom1To16Bytes) > > -# endif > > - > > - movaps 16(%rsi, %rcx), %xmm4 > > - movdqu %xmm3, (%rdi, %rcx) > > - pcmpeqb %xmm4, %xmm0 > > - pmovmskb %xmm0, %rdx > > - add $16, %rcx > > -# ifdef USE_AS_STRNCPY > > - sub $16, %r8 > > - jbe L(CopyFrom1To16BytesCase2OrCase3) > > -# endif > > - test %rdx, %rdx > > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > - jnz L(CopyFrom1To16BytesUnalignedXmm4) > > -# else > > - jnz L(CopyFrom1To16Bytes) > > -# endif > > - > > - movaps 16(%rsi, %rcx), %xmm1 > > - movdqu %xmm4, (%rdi, %rcx) > > - pcmpeqb %xmm1, %xmm0 > > - pmovmskb %xmm0, %rdx > > - add $16, %rcx > > -# ifdef USE_AS_STRNCPY > > - sub $16, %r8 > > - jbe L(CopyFrom1To16BytesCase2OrCase3) > > -# endif > > - test %rdx, %rdx > > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > - jnz L(CopyFrom1To16BytesUnalignedXmm1) > > -# else > > - jnz L(CopyFrom1To16Bytes) > > -# endif > > - > > - movaps 16(%rsi, %rcx), %xmm2 > > - movdqu %xmm1, (%rdi, %rcx) > > - pcmpeqb %xmm2, %xmm0 > > - pmovmskb %xmm0, %rdx > > - add $16, %rcx > > -# ifdef USE_AS_STRNCPY > > - sub $16, %r8 > > - jbe L(CopyFrom1To16BytesCase2OrCase3) > > -# endif > > - test %rdx, %rdx > > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > - jnz L(CopyFrom1To16BytesUnalignedXmm2) > > -# else > > - jnz L(CopyFrom1To16Bytes) > > -# endif > > - > > - movaps 16(%rsi, %rcx), %xmm3 > > - movdqu %xmm2, (%rdi, %rcx) > > - pcmpeqb %xmm3, %xmm0 > > - pmovmskb %xmm0, %rdx > > - add $16, %rcx > > -# ifdef USE_AS_STRNCPY > > - sub $16, %r8 > > - jbe L(CopyFrom1To16BytesCase2OrCase3) > > -# endif > > - test %rdx, %rdx > > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > - jnz L(CopyFrom1To16BytesUnalignedXmm3) > > -# else > > - jnz L(CopyFrom1To16Bytes) > > -# endif > > - > > - movdqu %xmm3, (%rdi, %rcx) > > - mov %rsi, %rdx > > - lea 16(%rsi, %rcx), %rsi > > - and $-0x40, %rsi > > - sub %rsi, %rdx > > - sub %rdx, %rdi > > -# ifdef USE_AS_STRNCPY > > - lea 128(%r8, %rdx), %r8 > > -# endif > > -L(Unaligned64Loop): > > - movaps (%rsi), %xmm2 > > - movaps %xmm2, %xmm4 > > - movaps 16(%rsi), %xmm5 > > - movaps 32(%rsi), %xmm3 > > - movaps %xmm3, %xmm6 > > - movaps 48(%rsi), %xmm7 > > - pminub %xmm5, %xmm2 > > - pminub %xmm7, %xmm3 > > - pminub %xmm2, %xmm3 > > - pcmpeqb %xmm0, %xmm3 > > - pmovmskb %xmm3, %rdx > > -# ifdef USE_AS_STRNCPY > > - sub $64, %r8 > > - jbe L(UnalignedLeaveCase2OrCase3) > > -# endif > > - test %rdx, %rdx > > - jnz L(Unaligned64Leave) > > - > > -L(Unaligned64Loop_start): > > - add $64, %rdi > > - add $64, %rsi > > - movdqu %xmm4, -64(%rdi) > > - movaps (%rsi), %xmm2 > > - movdqa %xmm2, %xmm4 > > - movdqu %xmm5, -48(%rdi) > > - movaps 16(%rsi), %xmm5 > > - pminub %xmm5, %xmm2 > > - movaps 32(%rsi), %xmm3 > > - movdqu %xmm6, -32(%rdi) > > - movaps %xmm3, %xmm6 > > - movdqu %xmm7, -16(%rdi) > > - movaps 48(%rsi), %xmm7 > > - pminub %xmm7, %xmm3 > > - pminub %xmm2, %xmm3 > > - pcmpeqb %xmm0, %xmm3 > > - pmovmskb %xmm3, %rdx > > -# ifdef USE_AS_STRNCPY > > - sub $64, %r8 > > - jbe L(UnalignedLeaveCase2OrCase3) > > -# endif > > - test %rdx, %rdx > > - jz L(Unaligned64Loop_start) > > - > > -L(Unaligned64Leave): > > - pxor %xmm1, %xmm1 > > - > > - pcmpeqb %xmm4, %xmm0 > > - pcmpeqb %xmm5, %xmm1 > > - pmovmskb %xmm0, %rdx > > - pmovmskb %xmm1, %rcx > > - test %rdx, %rdx > > - jnz L(CopyFrom1To16BytesUnaligned_0) > > - test %rcx, %rcx > > - jnz L(CopyFrom1To16BytesUnaligned_16) > > - > > - pcmpeqb %xmm6, %xmm0 > > - pcmpeqb %xmm7, %xmm1 > > - pmovmskb %xmm0, %rdx > > - pmovmskb %xmm1, %rcx > > - test %rdx, %rdx > > - jnz L(CopyFrom1To16BytesUnaligned_32) > > - > > - bsf %rcx, %rdx > > - movdqu %xmm4, (%rdi) > > - movdqu %xmm5, 16(%rdi) > > - movdqu %xmm6, 32(%rdi) > > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > -# ifdef USE_AS_STPCPY > > - lea 48(%rdi, %rdx), %rax > > -# endif > > - movdqu %xmm7, 48(%rdi) > > - add $15, %r8 > > - sub %rdx, %r8 > > - lea 49(%rdi, %rdx), %rdi > > - jmp L(StrncpyFillTailWithZero) > > -# else > > - add $48, %rsi > > - add $48, %rdi > > - BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4) > > -# endif > > - > > -/* If source address alignment == destination address alignment */ > > - > > -L(SourceStringAlignmentLess32): > > - pxor %xmm0, %xmm0 > > - movdqu (%rsi), %xmm1 > > - movdqu 16(%rsi), %xmm2 > > - pcmpeqb %xmm1, %xmm0 > > - pmovmskb %xmm0, %rdx > > - > > -# ifdef USE_AS_STRNCPY > > -# if defined USE_AS_STPCPY || defined USE_AS_STRCAT > > - cmp $16, %r8 > > -# else > > - cmp $17, %r8 > > -# endif > > - jbe L(CopyFrom1To16BytesTail1Case2OrCase3) > > -# endif > > - test %rdx, %rdx > > - jnz L(CopyFrom1To16BytesTail1) > > - > > - pcmpeqb %xmm2, %xmm0 > > - movdqu %xmm1, (%rdi) > > - pmovmskb %xmm0, %rdx > > - > > -# ifdef USE_AS_STRNCPY > > -# if defined USE_AS_STPCPY || defined USE_AS_STRCAT > > - cmp $32, %r8 > > -# else > > - cmp $33, %r8 > > -# endif > > - jbe L(CopyFrom1To32Bytes1Case2OrCase3) > > -# endif > > - test %rdx, %rdx > > - jnz L(CopyFrom1To32Bytes1) > > - > > - and $-16, %rsi > > - and $15, %rcx > > - jmp L(Unalign16Both) > > - > > -/*------End of main part with loops---------------------*/ > > - > > -/* Case1 */ > > - > > -# if (!defined USE_AS_STRNCPY) || (defined USE_AS_STRCAT) > > - .p2align 4 > > -L(CopyFrom1To16Bytes): > > - add %rcx, %rdi > > - add %rcx, %rsi > > - bsf %rdx, %rdx > > - BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4) > > -# endif > > - .p2align 4 > > -L(CopyFrom1To16BytesTail): > > - add %rcx, %rsi > > - bsf %rdx, %rdx > > - BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4) > > - > > - .p2align 4 > > -L(CopyFrom1To32Bytes1): > > - add $16, %rsi > > - add $16, %rdi > > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > - sub $16, %r8 > > -# endif > > -L(CopyFrom1To16BytesTail1): > > - bsf %rdx, %rdx > > - BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4) > > - > > - .p2align 4 > > -L(CopyFrom1To32Bytes): > > - bsf %rdx, %rdx > > - add %rcx, %rsi > > - add $16, %rdx > > - sub %rcx, %rdx > > - BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4) > > - > > - .p2align 4 > > -L(CopyFrom1To16BytesUnaligned_0): > > - bsf %rdx, %rdx > > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > -# ifdef USE_AS_STPCPY > > - lea (%rdi, %rdx), %rax > > -# endif > > - movdqu %xmm4, (%rdi) > > - add $63, %r8 > > - sub %rdx, %r8 > > - lea 1(%rdi, %rdx), %rdi > > - jmp L(StrncpyFillTailWithZero) > > -# else > > - BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4) > > -# endif > > - > > - .p2align 4 > > -L(CopyFrom1To16BytesUnaligned_16): > > - bsf %rcx, %rdx > > - movdqu %xmm4, (%rdi) > > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > -# ifdef USE_AS_STPCPY > > - lea 16(%rdi, %rdx), %rax > > -# endif > > - movdqu %xmm5, 16(%rdi) > > - add $47, %r8 > > - sub %rdx, %r8 > > - lea 17(%rdi, %rdx), %rdi > > - jmp L(StrncpyFillTailWithZero) > > -# else > > - add $16, %rsi > > - add $16, %rdi > > - BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4) > > -# endif > > - > > - .p2align 4 > > -L(CopyFrom1To16BytesUnaligned_32): > > - bsf %rdx, %rdx > > - movdqu %xmm4, (%rdi) > > - movdqu %xmm5, 16(%rdi) > > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > -# ifdef USE_AS_STPCPY > > - lea 32(%rdi, %rdx), %rax > > -# endif > > - movdqu %xmm6, 32(%rdi) > > - add $31, %r8 > > - sub %rdx, %r8 > > - lea 33(%rdi, %rdx), %rdi > > - jmp L(StrncpyFillTailWithZero) > > -# else > > - add $32, %rsi > > - add $32, %rdi > > - BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4) > > -# endif > > - > > -# ifdef USE_AS_STRNCPY > > -# ifndef USE_AS_STRCAT > > - .p2align 4 > > -L(CopyFrom1To16BytesUnalignedXmm6): > > - movdqu %xmm6, (%rdi, %rcx) > > - jmp L(CopyFrom1To16BytesXmmExit) > > - > > - .p2align 4 > > -L(CopyFrom1To16BytesUnalignedXmm5): > > - movdqu %xmm5, (%rdi, %rcx) > > - jmp L(CopyFrom1To16BytesXmmExit) > > - > > - .p2align 4 > > -L(CopyFrom1To16BytesUnalignedXmm4): > > - movdqu %xmm4, (%rdi, %rcx) > > - jmp L(CopyFrom1To16BytesXmmExit) > > - > > - .p2align 4 > > -L(CopyFrom1To16BytesUnalignedXmm3): > > - movdqu %xmm3, (%rdi, %rcx) > > - jmp L(CopyFrom1To16BytesXmmExit) > > - > > - .p2align 4 > > -L(CopyFrom1To16BytesUnalignedXmm1): > > - movdqu %xmm1, (%rdi, %rcx) > > - jmp L(CopyFrom1To16BytesXmmExit) > > -# endif > > - > > - .p2align 4 > > -L(CopyFrom1To16BytesExit): > > - BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4) > > - > > -/* Case2 */ > > - > > - .p2align 4 > > -L(CopyFrom1To16BytesCase2): > > - add $16, %r8 > > - add %rcx, %rdi > > - add %rcx, %rsi > > - bsf %rdx, %rdx > > - cmp %r8, %rdx > > - jb L(CopyFrom1To16BytesExit) > > - BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4) > > - > > - .p2align 4 > > -L(CopyFrom1To32BytesCase2): > > - add %rcx, %rsi > > - bsf %rdx, %rdx > > - add $16, %rdx > > - sub %rcx, %rdx > > - cmp %r8, %rdx > > - jb L(CopyFrom1To16BytesExit) > > - BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4) > > - > > -L(CopyFrom1To16BytesTailCase2): > > - add %rcx, %rsi > > - bsf %rdx, %rdx > > - cmp %r8, %rdx > > - jb L(CopyFrom1To16BytesExit) > > - BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4) > > - > > -L(CopyFrom1To16BytesTail1Case2): > > - bsf %rdx, %rdx > > - cmp %r8, %rdx > > - jb L(CopyFrom1To16BytesExit) > > - BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4) > > - > > -/* Case2 or Case3, Case3 */ > > - > > - .p2align 4 > > -L(CopyFrom1To16BytesCase2OrCase3): > > - test %rdx, %rdx > > - jnz L(CopyFrom1To16BytesCase2) > > -L(CopyFrom1To16BytesCase3): > > - add $16, %r8 > > - add %rcx, %rdi > > - add %rcx, %rsi > > - BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4) > > - > > - .p2align 4 > > -L(CopyFrom1To32BytesCase2OrCase3): > > - test %rdx, %rdx > > - jnz L(CopyFrom1To32BytesCase2) > > - add %rcx, %rsi > > - BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4) > > - > > - .p2align 4 > > -L(CopyFrom1To16BytesTailCase2OrCase3): > > - test %rdx, %rdx > > - jnz L(CopyFrom1To16BytesTailCase2) > > - add %rcx, %rsi > > - BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4) > > - > > - .p2align 4 > > -L(CopyFrom1To32Bytes1Case2OrCase3): > > - add $16, %rdi > > - add $16, %rsi > > - sub $16, %r8 > > -L(CopyFrom1To16BytesTail1Case2OrCase3): > > - test %rdx, %rdx > > - jnz L(CopyFrom1To16BytesTail1Case2) > > - BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4) > > - > > -# endif > > - > > -/*------------End labels regarding with copying 1-16 bytes--and 1-32 bytes----*/ > > - > > - .p2align 4 > > -L(Exit1): > > - mov %dh, (%rdi) > > -# ifdef USE_AS_STPCPY > > - lea (%rdi), %rax > > -# endif > > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > - sub $1, %r8 > > - lea 1(%rdi), %rdi > > - jnz L(StrncpyFillTailWithZero) > > -# endif > > - ret > > - > > - .p2align 4 > > -L(Exit2): > > - mov (%rsi), %dx > > - mov %dx, (%rdi) > > -# ifdef USE_AS_STPCPY > > - lea 1(%rdi), %rax > > -# endif > > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > - sub $2, %r8 > > - lea 2(%rdi), %rdi > > - jnz L(StrncpyFillTailWithZero) > > -# endif > > - ret > > - > > - .p2align 4 > > -L(Exit3): > > - mov (%rsi), %cx > > - mov %cx, (%rdi) > > - mov %dh, 2(%rdi) > > -# ifdef USE_AS_STPCPY > > - lea 2(%rdi), %rax > > -# endif > > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > - sub $3, %r8 > > - lea 3(%rdi), %rdi > > - jnz L(StrncpyFillTailWithZero) > > -# endif > > - ret > > - > > - .p2align 4 > > -L(Exit4): > > - mov (%rsi), %edx > > - mov %edx, (%rdi) > > -# ifdef USE_AS_STPCPY > > - lea 3(%rdi), %rax > > -# endif > > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > - sub $4, %r8 > > - lea 4(%rdi), %rdi > > - jnz L(StrncpyFillTailWithZero) > > -# endif > > - ret > > - > > - .p2align 4 > > -L(Exit5): > > - mov (%rsi), %ecx > > - mov %dh, 4(%rdi) > > - mov %ecx, (%rdi) > > -# ifdef USE_AS_STPCPY > > - lea 4(%rdi), %rax > > -# endif > > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > - sub $5, %r8 > > - lea 5(%rdi), %rdi > > - jnz L(StrncpyFillTailWithZero) > > -# endif > > - ret > > - > > - .p2align 4 > > -L(Exit6): > > - mov (%rsi), %ecx > > - mov 4(%rsi), %dx > > - mov %ecx, (%rdi) > > - mov %dx, 4(%rdi) > > -# ifdef USE_AS_STPCPY > > - lea 5(%rdi), %rax > > -# endif > > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > - sub $6, %r8 > > - lea 6(%rdi), %rdi > > - jnz L(StrncpyFillTailWithZero) > > -# endif > > - ret > > - > > - .p2align 4 > > -L(Exit7): > > - mov (%rsi), %ecx > > - mov 3(%rsi), %edx > > - mov %ecx, (%rdi) > > - mov %edx, 3(%rdi) > > -# ifdef USE_AS_STPCPY > > - lea 6(%rdi), %rax > > -# endif > > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > - sub $7, %r8 > > - lea 7(%rdi), %rdi > > - jnz L(StrncpyFillTailWithZero) > > -# endif > > - ret > > - > > - .p2align 4 > > -L(Exit8): > > - mov (%rsi), %rdx > > - mov %rdx, (%rdi) > > -# ifdef USE_AS_STPCPY > > - lea 7(%rdi), %rax > > -# endif > > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > - sub $8, %r8 > > - lea 8(%rdi), %rdi > > - jnz L(StrncpyFillTailWithZero) > > -# endif > > - ret > > - > > - .p2align 4 > > -L(Exit9): > > - mov (%rsi), %rcx > > - mov %dh, 8(%rdi) > > - mov %rcx, (%rdi) > > -# ifdef USE_AS_STPCPY > > - lea 8(%rdi), %rax > > -# endif > > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > - sub $9, %r8 > > - lea 9(%rdi), %rdi > > - jnz L(StrncpyFillTailWithZero) > > -# endif > > - ret > > - > > - .p2align 4 > > -L(Exit10): > > - mov (%rsi), %rcx > > - mov 8(%rsi), %dx > > - mov %rcx, (%rdi) > > - mov %dx, 8(%rdi) > > -# ifdef USE_AS_STPCPY > > - lea 9(%rdi), %rax > > -# endif > > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > - sub $10, %r8 > > - lea 10(%rdi), %rdi > > - jnz L(StrncpyFillTailWithZero) > > -# endif > > - ret > > - > > - .p2align 4 > > -L(Exit11): > > - mov (%rsi), %rcx > > - mov 7(%rsi), %edx > > - mov %rcx, (%rdi) > > - mov %edx, 7(%rdi) > > -# ifdef USE_AS_STPCPY > > - lea 10(%rdi), %rax > > -# endif > > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > - sub $11, %r8 > > - lea 11(%rdi), %rdi > > - jnz L(StrncpyFillTailWithZero) > > -# endif > > - ret > > - > > - .p2align 4 > > -L(Exit12): > > - mov (%rsi), %rcx > > - mov 8(%rsi), %edx > > - mov %rcx, (%rdi) > > - mov %edx, 8(%rdi) > > -# ifdef USE_AS_STPCPY > > - lea 11(%rdi), %rax > > -# endif > > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > - sub $12, %r8 > > - lea 12(%rdi), %rdi > > - jnz L(StrncpyFillTailWithZero) > > -# endif > > - ret > > - > > - .p2align 4 > > -L(Exit13): > > - mov (%rsi), %rcx > > - mov 5(%rsi), %rdx > > - mov %rcx, (%rdi) > > - mov %rdx, 5(%rdi) > > -# ifdef USE_AS_STPCPY > > - lea 12(%rdi), %rax > > -# endif > > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > - sub $13, %r8 > > - lea 13(%rdi), %rdi > > - jnz L(StrncpyFillTailWithZero) > > -# endif > > - ret > > - > > - .p2align 4 > > -L(Exit14): > > - mov (%rsi), %rcx > > - mov 6(%rsi), %rdx > > - mov %rcx, (%rdi) > > - mov %rdx, 6(%rdi) > > -# ifdef USE_AS_STPCPY > > - lea 13(%rdi), %rax > > -# endif > > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > - sub $14, %r8 > > - lea 14(%rdi), %rdi > > - jnz L(StrncpyFillTailWithZero) > > -# endif > > - ret > > - > > - .p2align 4 > > -L(Exit15): > > - mov (%rsi), %rcx > > - mov 7(%rsi), %rdx > > - mov %rcx, (%rdi) > > - mov %rdx, 7(%rdi) > > -# ifdef USE_AS_STPCPY > > - lea 14(%rdi), %rax > > -# endif > > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > - sub $15, %r8 > > - lea 15(%rdi), %rdi > > - jnz L(StrncpyFillTailWithZero) > > -# endif > > - ret > > - > > - .p2align 4 > > -L(Exit16): > > - movdqu (%rsi), %xmm0 > > - movdqu %xmm0, (%rdi) > > -# ifdef USE_AS_STPCPY > > - lea 15(%rdi), %rax > > -# endif > > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > - sub $16, %r8 > > - lea 16(%rdi), %rdi > > - jnz L(StrncpyFillTailWithZero) > > -# endif > > - ret > > - > > - .p2align 4 > > -L(Exit17): > > - movdqu (%rsi), %xmm0 > > - movdqu %xmm0, (%rdi) > > - mov %dh, 16(%rdi) > > -# ifdef USE_AS_STPCPY > > - lea 16(%rdi), %rax > > -# endif > > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > - sub $17, %r8 > > - lea 17(%rdi), %rdi > > - jnz L(StrncpyFillTailWithZero) > > -# endif > > - ret > > - > > - .p2align 4 > > -L(Exit18): > > - movdqu (%rsi), %xmm0 > > - mov 16(%rsi), %cx > > - movdqu %xmm0, (%rdi) > > - mov %cx, 16(%rdi) > > -# ifdef USE_AS_STPCPY > > - lea 17(%rdi), %rax > > -# endif > > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > - sub $18, %r8 > > - lea 18(%rdi), %rdi > > - jnz L(StrncpyFillTailWithZero) > > -# endif > > - ret > > - > > - .p2align 4 > > -L(Exit19): > > - movdqu (%rsi), %xmm0 > > - mov 15(%rsi), %ecx > > - movdqu %xmm0, (%rdi) > > - mov %ecx, 15(%rdi) > > -# ifdef USE_AS_STPCPY > > - lea 18(%rdi), %rax > > -# endif > > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > - sub $19, %r8 > > - lea 19(%rdi), %rdi > > - jnz L(StrncpyFillTailWithZero) > > -# endif > > - ret > > - > > - .p2align 4 > > -L(Exit20): > > - movdqu (%rsi), %xmm0 > > - mov 16(%rsi), %ecx > > - movdqu %xmm0, (%rdi) > > - mov %ecx, 16(%rdi) > > -# ifdef USE_AS_STPCPY > > - lea 19(%rdi), %rax > > -# endif > > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > - sub $20, %r8 > > - lea 20(%rdi), %rdi > > - jnz L(StrncpyFillTailWithZero) > > -# endif > > - ret > > - > > - .p2align 4 > > -L(Exit21): > > - movdqu (%rsi), %xmm0 > > - mov 16(%rsi), %ecx > > - movdqu %xmm0, (%rdi) > > - mov %ecx, 16(%rdi) > > - mov %dh, 20(%rdi) > > -# ifdef USE_AS_STPCPY > > - lea 20(%rdi), %rax > > -# endif > > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > - sub $21, %r8 > > - lea 21(%rdi), %rdi > > - jnz L(StrncpyFillTailWithZero) > > -# endif > > - ret > > - > > - .p2align 4 > > -L(Exit22): > > - movdqu (%rsi), %xmm0 > > - mov 14(%rsi), %rcx > > - movdqu %xmm0, (%rdi) > > - mov %rcx, 14(%rdi) > > -# ifdef USE_AS_STPCPY > > - lea 21(%rdi), %rax > > -# endif > > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > - sub $22, %r8 > > - lea 22(%rdi), %rdi > > - jnz L(StrncpyFillTailWithZero) > > -# endif > > - ret > > - > > - .p2align 4 > > -L(Exit23): > > - movdqu (%rsi), %xmm0 > > - mov 15(%rsi), %rcx > > - movdqu %xmm0, (%rdi) > > - mov %rcx, 15(%rdi) > > -# ifdef USE_AS_STPCPY > > - lea 22(%rdi), %rax > > -# endif > > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > - sub $23, %r8 > > - lea 23(%rdi), %rdi > > - jnz L(StrncpyFillTailWithZero) > > -# endif > > - ret > > - > > - .p2align 4 > > -L(Exit24): > > - movdqu (%rsi), %xmm0 > > - mov 16(%rsi), %rcx > > - movdqu %xmm0, (%rdi) > > - mov %rcx, 16(%rdi) > > -# ifdef USE_AS_STPCPY > > - lea 23(%rdi), %rax > > -# endif > > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > - sub $24, %r8 > > - lea 24(%rdi), %rdi > > - jnz L(StrncpyFillTailWithZero) > > -# endif > > - ret > > - > > - .p2align 4 > > -L(Exit25): > > - movdqu (%rsi), %xmm0 > > - mov 16(%rsi), %rcx > > - movdqu %xmm0, (%rdi) > > - mov %rcx, 16(%rdi) > > - mov %dh, 24(%rdi) > > -# ifdef USE_AS_STPCPY > > - lea 24(%rdi), %rax > > -# endif > > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > - sub $25, %r8 > > - lea 25(%rdi), %rdi > > - jnz L(StrncpyFillTailWithZero) > > -# endif > > - ret > > - > > - .p2align 4 > > -L(Exit26): > > - movdqu (%rsi), %xmm0 > > - mov 16(%rsi), %rdx > > - mov 24(%rsi), %cx > > - movdqu %xmm0, (%rdi) > > - mov %rdx, 16(%rdi) > > - mov %cx, 24(%rdi) > > -# ifdef USE_AS_STPCPY > > - lea 25(%rdi), %rax > > -# endif > > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > - sub $26, %r8 > > - lea 26(%rdi), %rdi > > - jnz L(StrncpyFillTailWithZero) > > -# endif > > - ret > > - > > - .p2align 4 > > -L(Exit27): > > - movdqu (%rsi), %xmm0 > > - mov 16(%rsi), %rdx > > - mov 23(%rsi), %ecx > > - movdqu %xmm0, (%rdi) > > - mov %rdx, 16(%rdi) > > - mov %ecx, 23(%rdi) > > -# ifdef USE_AS_STPCPY > > - lea 26(%rdi), %rax > > -# endif > > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > - sub $27, %r8 > > - lea 27(%rdi), %rdi > > - jnz L(StrncpyFillTailWithZero) > > -# endif > > - ret > > - > > - .p2align 4 > > -L(Exit28): > > - movdqu (%rsi), %xmm0 > > - mov 16(%rsi), %rdx > > - mov 24(%rsi), %ecx > > - movdqu %xmm0, (%rdi) > > - mov %rdx, 16(%rdi) > > - mov %ecx, 24(%rdi) > > -# ifdef USE_AS_STPCPY > > - lea 27(%rdi), %rax > > -# endif > > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > - sub $28, %r8 > > - lea 28(%rdi), %rdi > > - jnz L(StrncpyFillTailWithZero) > > -# endif > > - ret > > - > > - .p2align 4 > > -L(Exit29): > > - movdqu (%rsi), %xmm0 > > - movdqu 13(%rsi), %xmm2 > > - movdqu %xmm0, (%rdi) > > - movdqu %xmm2, 13(%rdi) > > -# ifdef USE_AS_STPCPY > > - lea 28(%rdi), %rax > > -# endif > > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > - sub $29, %r8 > > - lea 29(%rdi), %rdi > > - jnz L(StrncpyFillTailWithZero) > > -# endif > > - ret > > - > > - .p2align 4 > > -L(Exit30): > > - movdqu (%rsi), %xmm0 > > - movdqu 14(%rsi), %xmm2 > > - movdqu %xmm0, (%rdi) > > - movdqu %xmm2, 14(%rdi) > > -# ifdef USE_AS_STPCPY > > - lea 29(%rdi), %rax > > -# endif > > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > - sub $30, %r8 > > - lea 30(%rdi), %rdi > > - jnz L(StrncpyFillTailWithZero) > > -# endif > > - ret > > - > > - .p2align 4 > > -L(Exit31): > > - movdqu (%rsi), %xmm0 > > - movdqu 15(%rsi), %xmm2 > > - movdqu %xmm0, (%rdi) > > - movdqu %xmm2, 15(%rdi) > > -# ifdef USE_AS_STPCPY > > - lea 30(%rdi), %rax > > -# endif > > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > - sub $31, %r8 > > - lea 31(%rdi), %rdi > > - jnz L(StrncpyFillTailWithZero) > > -# endif > > - ret > > - > > - .p2align 4 > > -L(Exit32): > > - movdqu (%rsi), %xmm0 > > - movdqu 16(%rsi), %xmm2 > > - movdqu %xmm0, (%rdi) > > - movdqu %xmm2, 16(%rdi) > > -# ifdef USE_AS_STPCPY > > - lea 31(%rdi), %rax > > -# endif > > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > - sub $32, %r8 > > - lea 32(%rdi), %rdi > > - jnz L(StrncpyFillTailWithZero) > > -# endif > > - ret > > - > > -# ifdef USE_AS_STRNCPY > > - > > - .p2align 4 > > -L(StrncpyExit0): > > -# ifdef USE_AS_STPCPY > > - mov %rdi, %rax > > -# endif > > -# ifdef USE_AS_STRCAT > > - xor %ch, %ch > > - movb %ch, (%rdi) > > -# endif > > - ret > > - > > - .p2align 4 > > -L(StrncpyExit1): > > - mov (%rsi), %dl > > - mov %dl, (%rdi) > > -# ifdef USE_AS_STPCPY > > - lea 1(%rdi), %rax > > -# endif > > -# ifdef USE_AS_STRCAT > > - xor %ch, %ch > > - movb %ch, 1(%rdi) > > -# endif > > - ret > > - > > - .p2align 4 > > -L(StrncpyExit2): > > - mov (%rsi), %dx > > - mov %dx, (%rdi) > > -# ifdef USE_AS_STPCPY > > - lea 2(%rdi), %rax > > -# endif > > -# ifdef USE_AS_STRCAT > > - xor %ch, %ch > > - movb %ch, 2(%rdi) > > -# endif > > - ret > > - > > - .p2align 4 > > -L(StrncpyExit3): > > - mov (%rsi), %cx > > - mov 2(%rsi), %dl > > - mov %cx, (%rdi) > > - mov %dl, 2(%rdi) > > -# ifdef USE_AS_STPCPY > > - lea 3(%rdi), %rax > > -# endif > > -# ifdef USE_AS_STRCAT > > - xor %ch, %ch > > - movb %ch, 3(%rdi) > > -# endif > > - ret > > - > > - .p2align 4 > > -L(StrncpyExit4): > > - mov (%rsi), %edx > > - mov %edx, (%rdi) > > -# ifdef USE_AS_STPCPY > > - lea 4(%rdi), %rax > > -# endif > > -# ifdef USE_AS_STRCAT > > - xor %ch, %ch > > - movb %ch, 4(%rdi) > > -# endif > > - ret > > - > > - .p2align 4 > > -L(StrncpyExit5): > > - mov (%rsi), %ecx > > - mov 4(%rsi), %dl > > - mov %ecx, (%rdi) > > - mov %dl, 4(%rdi) > > -# ifdef USE_AS_STPCPY > > - lea 5(%rdi), %rax > > -# endif > > -# ifdef USE_AS_STRCAT > > - xor %ch, %ch > > - movb %ch, 5(%rdi) > > -# endif > > - ret > > - > > - .p2align 4 > > -L(StrncpyExit6): > > - mov (%rsi), %ecx > > - mov 4(%rsi), %dx > > - mov %ecx, (%rdi) > > - mov %dx, 4(%rdi) > > -# ifdef USE_AS_STPCPY > > - lea 6(%rdi), %rax > > -# endif > > -# ifdef USE_AS_STRCAT > > - xor %ch, %ch > > - movb %ch, 6(%rdi) > > -# endif > > - ret > > - > > - .p2align 4 > > -L(StrncpyExit7): > > - mov (%rsi), %ecx > > - mov 3(%rsi), %edx > > - mov %ecx, (%rdi) > > - mov %edx, 3(%rdi) > > -# ifdef USE_AS_STPCPY > > - lea 7(%rdi), %rax > > -# endif > > -# ifdef USE_AS_STRCAT > > - xor %ch, %ch > > - movb %ch, 7(%rdi) > > -# endif > > - ret > > - > > - .p2align 4 > > -L(StrncpyExit8): > > - mov (%rsi), %rdx > > - mov %rdx, (%rdi) > > -# ifdef USE_AS_STPCPY > > - lea 8(%rdi), %rax > > -# endif > > -# ifdef USE_AS_STRCAT > > - xor %ch, %ch > > - movb %ch, 8(%rdi) > > -# endif > > - ret > > - > > - .p2align 4 > > -L(StrncpyExit9): > > - mov (%rsi), %rcx > > - mov 8(%rsi), %dl > > - mov %rcx, (%rdi) > > - mov %dl, 8(%rdi) > > -# ifdef USE_AS_STPCPY > > - lea 9(%rdi), %rax > > -# endif > > -# ifdef USE_AS_STRCAT > > - xor %ch, %ch > > - movb %ch, 9(%rdi) > > -# endif > > - ret > > - > > - .p2align 4 > > -L(StrncpyExit10): > > - mov (%rsi), %rcx > > - mov 8(%rsi), %dx > > - mov %rcx, (%rdi) > > - mov %dx, 8(%rdi) > > -# ifdef USE_AS_STPCPY > > - lea 10(%rdi), %rax > > -# endif > > -# ifdef USE_AS_STRCAT > > - xor %ch, %ch > > - movb %ch, 10(%rdi) > > -# endif > > - ret > > - > > - .p2align 4 > > -L(StrncpyExit11): > > - mov (%rsi), %rcx > > - mov 7(%rsi), %edx > > - mov %rcx, (%rdi) > > - mov %edx, 7(%rdi) > > -# ifdef USE_AS_STPCPY > > - lea 11(%rdi), %rax > > -# endif > > -# ifdef USE_AS_STRCAT > > - xor %ch, %ch > > - movb %ch, 11(%rdi) > > -# endif > > - ret > > - > > - .p2align 4 > > -L(StrncpyExit12): > > - mov (%rsi), %rcx > > - mov 8(%rsi), %edx > > - mov %rcx, (%rdi) > > - mov %edx, 8(%rdi) > > -# ifdef USE_AS_STPCPY > > - lea 12(%rdi), %rax > > -# endif > > -# ifdef USE_AS_STRCAT > > - xor %ch, %ch > > - movb %ch, 12(%rdi) > > -# endif > > - ret > > - > > - .p2align 4 > > -L(StrncpyExit13): > > - mov (%rsi), %rcx > > - mov 5(%rsi), %rdx > > - mov %rcx, (%rdi) > > - mov %rdx, 5(%rdi) > > -# ifdef USE_AS_STPCPY > > - lea 13(%rdi), %rax > > -# endif > > -# ifdef USE_AS_STRCAT > > - xor %ch, %ch > > - movb %ch, 13(%rdi) > > -# endif > > - ret > > - > > - .p2align 4 > > -L(StrncpyExit14): > > - mov (%rsi), %rcx > > - mov 6(%rsi), %rdx > > - mov %rcx, (%rdi) > > - mov %rdx, 6(%rdi) > > -# ifdef USE_AS_STPCPY > > - lea 14(%rdi), %rax > > -# endif > > -# ifdef USE_AS_STRCAT > > - xor %ch, %ch > > - movb %ch, 14(%rdi) > > -# endif > > - ret > > - > > - .p2align 4 > > -L(StrncpyExit15): > > - mov (%rsi), %rcx > > - mov 7(%rsi), %rdx > > - mov %rcx, (%rdi) > > - mov %rdx, 7(%rdi) > > -# ifdef USE_AS_STPCPY > > - lea 15(%rdi), %rax > > -# endif > > -# ifdef USE_AS_STRCAT > > - xor %ch, %ch > > - movb %ch, 15(%rdi) > > -# endif > > - ret > > - > > - .p2align 4 > > -L(StrncpyExit16): > > - movdqu (%rsi), %xmm0 > > - movdqu %xmm0, (%rdi) > > -# ifdef USE_AS_STPCPY > > - lea 16(%rdi), %rax > > -# endif > > -# ifdef USE_AS_STRCAT > > - xor %ch, %ch > > - movb %ch, 16(%rdi) > > -# endif > > - ret > > - > > - .p2align 4 > > -L(StrncpyExit17): > > - movdqu (%rsi), %xmm0 > > - mov 16(%rsi), %cl > > - movdqu %xmm0, (%rdi) > > - mov %cl, 16(%rdi) > > -# ifdef USE_AS_STPCPY > > - lea 17(%rdi), %rax > > -# endif > > -# ifdef USE_AS_STRCAT > > - xor %ch, %ch > > - movb %ch, 17(%rdi) > > -# endif > > - ret > > - > > - .p2align 4 > > -L(StrncpyExit18): > > - movdqu (%rsi), %xmm0 > > - mov 16(%rsi), %cx > > - movdqu %xmm0, (%rdi) > > - mov %cx, 16(%rdi) > > -# ifdef USE_AS_STPCPY > > - lea 18(%rdi), %rax > > -# endif > > -# ifdef USE_AS_STRCAT > > - xor %ch, %ch > > - movb %ch, 18(%rdi) > > -# endif > > - ret > > - > > - .p2align 4 > > -L(StrncpyExit19): > > - movdqu (%rsi), %xmm0 > > - mov 15(%rsi), %ecx > > - movdqu %xmm0, (%rdi) > > - mov %ecx, 15(%rdi) > > -# ifdef USE_AS_STPCPY > > - lea 19(%rdi), %rax > > -# endif > > -# ifdef USE_AS_STRCAT > > - xor %ch, %ch > > - movb %ch, 19(%rdi) > > -# endif > > - ret > > - > > - .p2align 4 > > -L(StrncpyExit20): > > - movdqu (%rsi), %xmm0 > > - mov 16(%rsi), %ecx > > - movdqu %xmm0, (%rdi) > > - mov %ecx, 16(%rdi) > > -# ifdef USE_AS_STPCPY > > - lea 20(%rdi), %rax > > -# endif > > -# ifdef USE_AS_STRCAT > > - xor %ch, %ch > > - movb %ch, 20(%rdi) > > -# endif > > - ret > > - > > - .p2align 4 > > -L(StrncpyExit21): > > - movdqu (%rsi), %xmm0 > > - mov 16(%rsi), %ecx > > - mov 20(%rsi), %dl > > - movdqu %xmm0, (%rdi) > > - mov %ecx, 16(%rdi) > > - mov %dl, 20(%rdi) > > -# ifdef USE_AS_STPCPY > > - lea 21(%rdi), %rax > > -# endif > > -# ifdef USE_AS_STRCAT > > - xor %ch, %ch > > - movb %ch, 21(%rdi) > > -# endif > > - ret > > - > > - .p2align 4 > > -L(StrncpyExit22): > > - movdqu (%rsi), %xmm0 > > - mov 14(%rsi), %rcx > > - movdqu %xmm0, (%rdi) > > - mov %rcx, 14(%rdi) > > -# ifdef USE_AS_STPCPY > > - lea 22(%rdi), %rax > > -# endif > > -# ifdef USE_AS_STRCAT > > - xor %ch, %ch > > - movb %ch, 22(%rdi) > > -# endif > > - ret > > - > > - .p2align 4 > > -L(StrncpyExit23): > > - movdqu (%rsi), %xmm0 > > - mov 15(%rsi), %rcx > > - movdqu %xmm0, (%rdi) > > - mov %rcx, 15(%rdi) > > -# ifdef USE_AS_STPCPY > > - lea 23(%rdi), %rax > > -# endif > > -# ifdef USE_AS_STRCAT > > - xor %ch, %ch > > - movb %ch, 23(%rdi) > > -# endif > > - ret > > - > > - .p2align 4 > > -L(StrncpyExit24): > > - movdqu (%rsi), %xmm0 > > - mov 16(%rsi), %rcx > > - movdqu %xmm0, (%rdi) > > - mov %rcx, 16(%rdi) > > -# ifdef USE_AS_STPCPY > > - lea 24(%rdi), %rax > > -# endif > > -# ifdef USE_AS_STRCAT > > - xor %ch, %ch > > - movb %ch, 24(%rdi) > > -# endif > > - ret > > - > > - .p2align 4 > > -L(StrncpyExit25): > > - movdqu (%rsi), %xmm0 > > - mov 16(%rsi), %rdx > > - mov 24(%rsi), %cl > > - movdqu %xmm0, (%rdi) > > - mov %rdx, 16(%rdi) > > - mov %cl, 24(%rdi) > > -# ifdef USE_AS_STPCPY > > - lea 25(%rdi), %rax > > -# endif > > -# ifdef USE_AS_STRCAT > > - xor %ch, %ch > > - movb %ch, 25(%rdi) > > -# endif > > - ret > > - > > - .p2align 4 > > -L(StrncpyExit26): > > - movdqu (%rsi), %xmm0 > > - mov 16(%rsi), %rdx > > - mov 24(%rsi), %cx > > - movdqu %xmm0, (%rdi) > > - mov %rdx, 16(%rdi) > > - mov %cx, 24(%rdi) > > -# ifdef USE_AS_STPCPY > > - lea 26(%rdi), %rax > > -# endif > > -# ifdef USE_AS_STRCAT > > - xor %ch, %ch > > - movb %ch, 26(%rdi) > > -# endif > > - ret > > - > > - .p2align 4 > > -L(StrncpyExit27): > > - movdqu (%rsi), %xmm0 > > - mov 16(%rsi), %rdx > > - mov 23(%rsi), %ecx > > - movdqu %xmm0, (%rdi) > > - mov %rdx, 16(%rdi) > > - mov %ecx, 23(%rdi) > > -# ifdef USE_AS_STPCPY > > - lea 27(%rdi), %rax > > -# endif > > -# ifdef USE_AS_STRCAT > > - xor %ch, %ch > > - movb %ch, 27(%rdi) > > -# endif > > - ret > > - > > - .p2align 4 > > -L(StrncpyExit28): > > - movdqu (%rsi), %xmm0 > > - mov 16(%rsi), %rdx > > - mov 24(%rsi), %ecx > > - movdqu %xmm0, (%rdi) > > - mov %rdx, 16(%rdi) > > - mov %ecx, 24(%rdi) > > -# ifdef USE_AS_STPCPY > > - lea 28(%rdi), %rax > > -# endif > > -# ifdef USE_AS_STRCAT > > - xor %ch, %ch > > - movb %ch, 28(%rdi) > > -# endif > > - ret > > - > > - .p2align 4 > > -L(StrncpyExit29): > > - movdqu (%rsi), %xmm0 > > - movdqu 13(%rsi), %xmm2 > > - movdqu %xmm0, (%rdi) > > - movdqu %xmm2, 13(%rdi) > > -# ifdef USE_AS_STPCPY > > - lea 29(%rdi), %rax > > -# endif > > -# ifdef USE_AS_STRCAT > > - xor %ch, %ch > > - movb %ch, 29(%rdi) > > -# endif > > - ret > > - > > - .p2align 4 > > -L(StrncpyExit30): > > - movdqu (%rsi), %xmm0 > > - movdqu 14(%rsi), %xmm2 > > - movdqu %xmm0, (%rdi) > > - movdqu %xmm2, 14(%rdi) > > -# ifdef USE_AS_STPCPY > > - lea 30(%rdi), %rax > > -# endif > > -# ifdef USE_AS_STRCAT > > - xor %ch, %ch > > - movb %ch, 30(%rdi) > > -# endif > > - ret > > - > > - .p2align 4 > > -L(StrncpyExit31): > > - movdqu (%rsi), %xmm0 > > - movdqu 15(%rsi), %xmm2 > > - movdqu %xmm0, (%rdi) > > - movdqu %xmm2, 15(%rdi) > > -# ifdef USE_AS_STPCPY > > - lea 31(%rdi), %rax > > -# endif > > -# ifdef USE_AS_STRCAT > > - xor %ch, %ch > > - movb %ch, 31(%rdi) > > -# endif > > - ret > > - > > - .p2align 4 > > -L(StrncpyExit32): > > - movdqu (%rsi), %xmm0 > > - movdqu 16(%rsi), %xmm2 > > - movdqu %xmm0, (%rdi) > > - movdqu %xmm2, 16(%rdi) > > -# ifdef USE_AS_STPCPY > > - lea 32(%rdi), %rax > > -# endif > > -# ifdef USE_AS_STRCAT > > - xor %ch, %ch > > - movb %ch, 32(%rdi) > > -# endif > > - ret > > - > > - .p2align 4 > > -L(StrncpyExit33): > > - movdqu (%rsi), %xmm0 > > - movdqu 16(%rsi), %xmm2 > > - mov 32(%rsi), %cl > > - movdqu %xmm0, (%rdi) > > - movdqu %xmm2, 16(%rdi) > > - mov %cl, 32(%rdi) > > -# ifdef USE_AS_STRCAT > > - xor %ch, %ch > > - movb %ch, 33(%rdi) > > -# endif > > - ret > > - > > -# ifndef USE_AS_STRCAT > > - > > - .p2align 4 > > -L(Fill0): > > - ret > > - > > - .p2align 4 > > -L(Fill1): > > - mov %dl, (%rdi) > > - ret > > - > > - .p2align 4 > > -L(Fill2): > > - mov %dx, (%rdi) > > - ret > > - > > - .p2align 4 > > -L(Fill3): > > - mov %edx, -1(%rdi) > > - ret > > - > > - .p2align 4 > > -L(Fill4): > > - mov %edx, (%rdi) > > - ret > > - > > - .p2align 4 > > -L(Fill5): > > - mov %edx, (%rdi) > > - mov %dl, 4(%rdi) > > - ret > > - > > - .p2align 4 > > -L(Fill6): > > - mov %edx, (%rdi) > > - mov %dx, 4(%rdi) > > - ret > > - > > - .p2align 4 > > -L(Fill7): > > - mov %rdx, -1(%rdi) > > - ret > > - > > - .p2align 4 > > -L(Fill8): > > - mov %rdx, (%rdi) > > - ret > > - > > - .p2align 4 > > -L(Fill9): > > - mov %rdx, (%rdi) > > - mov %dl, 8(%rdi) > > - ret > > - > > - .p2align 4 > > -L(Fill10): > > - mov %rdx, (%rdi) > > - mov %dx, 8(%rdi) > > - ret > > - > > - .p2align 4 > > -L(Fill11): > > - mov %rdx, (%rdi) > > - mov %edx, 7(%rdi) > > - ret > > - > > - .p2align 4 > > -L(Fill12): > > - mov %rdx, (%rdi) > > - mov %edx, 8(%rdi) > > - ret > > - > > - .p2align 4 > > -L(Fill13): > > - mov %rdx, (%rdi) > > - mov %rdx, 5(%rdi) > > - ret > > - > > - .p2align 4 > > -L(Fill14): > > - mov %rdx, (%rdi) > > - mov %rdx, 6(%rdi) > > - ret > > - > > - .p2align 4 > > -L(Fill15): > > - movdqu %xmm0, -1(%rdi) > > - ret > > - > > - .p2align 4 > > -L(Fill16): > > - movdqu %xmm0, (%rdi) > > - ret > > - > > - .p2align 4 > > -L(CopyFrom1To16BytesUnalignedXmm2): > > - movdqu %xmm2, (%rdi, %rcx) > > - > > - .p2align 4 > > -L(CopyFrom1To16BytesXmmExit): > > - bsf %rdx, %rdx > > - add $15, %r8 > > - add %rcx, %rdi > > -# ifdef USE_AS_STPCPY > > - lea (%rdi, %rdx), %rax > > -# endif > > - sub %rdx, %r8 > > - lea 1(%rdi, %rdx), %rdi > > - > > - .p2align 4 > > -L(StrncpyFillTailWithZero): > > - pxor %xmm0, %xmm0 > > - xor %rdx, %rdx > > - sub $16, %r8 > > - jbe L(StrncpyFillExit) > > - > > - movdqu %xmm0, (%rdi) > > - add $16, %rdi > > - > > - mov %rdi, %rsi > > - and $0xf, %rsi > > - sub %rsi, %rdi > > - add %rsi, %r8 > > - sub $64, %r8 > > - jb L(StrncpyFillLess64) > > - > > -L(StrncpyFillLoopMovdqa): > > - movdqa %xmm0, (%rdi) > > - movdqa %xmm0, 16(%rdi) > > - movdqa %xmm0, 32(%rdi) > > - movdqa %xmm0, 48(%rdi) > > - add $64, %rdi > > - sub $64, %r8 > > - jae L(StrncpyFillLoopMovdqa) > > - > > -L(StrncpyFillLess64): > > - add $32, %r8 > > - jl L(StrncpyFillLess32) > > - movdqa %xmm0, (%rdi) > > - movdqa %xmm0, 16(%rdi) > > - add $32, %rdi > > - sub $16, %r8 > > - jl L(StrncpyFillExit) > > - movdqa %xmm0, (%rdi) > > - add $16, %rdi > > - BRANCH_TO_JMPTBL_ENTRY (L(FillTable), %r8, 4) > > - > > -L(StrncpyFillLess32): > > - add $16, %r8 > > - jl L(StrncpyFillExit) > > - movdqa %xmm0, (%rdi) > > - add $16, %rdi > > - BRANCH_TO_JMPTBL_ENTRY (L(FillTable), %r8, 4) > > - > > -L(StrncpyFillExit): > > - add $16, %r8 > > - BRANCH_TO_JMPTBL_ENTRY (L(FillTable), %r8, 4) > > - > > -/* end of ifndef USE_AS_STRCAT */ > > -# endif > > - > > - .p2align 4 > > -L(UnalignedLeaveCase2OrCase3): > > - test %rdx, %rdx > > - jnz L(Unaligned64LeaveCase2) > > -L(Unaligned64LeaveCase3): > > - lea 64(%r8), %rcx > > - and $-16, %rcx > > - add $48, %r8 > > - jl L(CopyFrom1To16BytesCase3) > > - movdqu %xmm4, (%rdi) > > - sub $16, %r8 > > - jb L(CopyFrom1To16BytesCase3) > > - movdqu %xmm5, 16(%rdi) > > - sub $16, %r8 > > - jb L(CopyFrom1To16BytesCase3) > > - movdqu %xmm6, 32(%rdi) > > - sub $16, %r8 > > - jb L(CopyFrom1To16BytesCase3) > > - movdqu %xmm7, 48(%rdi) > > -# ifdef USE_AS_STPCPY > > - lea 64(%rdi), %rax > > -# endif > > -# ifdef USE_AS_STRCAT > > - xor %ch, %ch > > - movb %ch, 64(%rdi) > > -# endif > > - ret > > - > > - .p2align 4 > > -L(Unaligned64LeaveCase2): > > - xor %rcx, %rcx > > - pcmpeqb %xmm4, %xmm0 > > - pmovmskb %xmm0, %rdx > > - add $48, %r8 > > - jle L(CopyFrom1To16BytesCase2OrCase3) > > - test %rdx, %rdx > > -# ifndef USE_AS_STRCAT > > - jnz L(CopyFrom1To16BytesUnalignedXmm4) > > -# else > > - jnz L(CopyFrom1To16Bytes) > > -# endif > > - pcmpeqb %xmm5, %xmm0 > > - pmovmskb %xmm0, %rdx > > - movdqu %xmm4, (%rdi) > > - add $16, %rcx > > - sub $16, %r8 > > - jbe L(CopyFrom1To16BytesCase2OrCase3) > > - test %rdx, %rdx > > -# ifndef USE_AS_STRCAT > > - jnz L(CopyFrom1To16BytesUnalignedXmm5) > > -# else > > - jnz L(CopyFrom1To16Bytes) > > -# endif > > - > > - pcmpeqb %xmm6, %xmm0 > > - pmovmskb %xmm0, %rdx > > - movdqu %xmm5, 16(%rdi) > > - add $16, %rcx > > - sub $16, %r8 > > - jbe L(CopyFrom1To16BytesCase2OrCase3) > > - test %rdx, %rdx > > -# ifndef USE_AS_STRCAT > > - jnz L(CopyFrom1To16BytesUnalignedXmm6) > > -# else > > - jnz L(CopyFrom1To16Bytes) > > -# endif > > - > > - pcmpeqb %xmm7, %xmm0 > > - pmovmskb %xmm0, %rdx > > - movdqu %xmm6, 32(%rdi) > > - lea 16(%rdi, %rcx), %rdi > > - lea 16(%rsi, %rcx), %rsi > > - bsf %rdx, %rdx > > - cmp %r8, %rdx > > - jb L(CopyFrom1To16BytesExit) > > - BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4) > > - > > - .p2align 4 > > -L(ExitZero): > > -# ifndef USE_AS_STRCAT > > - mov %rdi, %rax > > -# endif > > - ret > > - > > -# endif > > - > > -# ifndef USE_AS_STRCAT > > -END (STRCPY) > > -# else > > -END (STRCAT) > > -# endif > > - .p2align 4 > > - .section .rodata > > -L(ExitTable): > > - .int JMPTBL(L(Exit1), L(ExitTable)) > > - .int JMPTBL(L(Exit2), L(ExitTable)) > > - .int JMPTBL(L(Exit3), L(ExitTable)) > > - .int JMPTBL(L(Exit4), L(ExitTable)) > > - .int JMPTBL(L(Exit5), L(ExitTable)) > > - .int JMPTBL(L(Exit6), L(ExitTable)) > > - .int JMPTBL(L(Exit7), L(ExitTable)) > > - .int JMPTBL(L(Exit8), L(ExitTable)) > > - .int JMPTBL(L(Exit9), L(ExitTable)) > > - .int JMPTBL(L(Exit10), L(ExitTable)) > > - .int JMPTBL(L(Exit11), L(ExitTable)) > > - .int JMPTBL(L(Exit12), L(ExitTable)) > > - .int JMPTBL(L(Exit13), L(ExitTable)) > > - .int JMPTBL(L(Exit14), L(ExitTable)) > > - .int JMPTBL(L(Exit15), L(ExitTable)) > > - .int JMPTBL(L(Exit16), L(ExitTable)) > > - .int JMPTBL(L(Exit17), L(ExitTable)) > > - .int JMPTBL(L(Exit18), L(ExitTable)) > > - .int JMPTBL(L(Exit19), L(ExitTable)) > > - .int JMPTBL(L(Exit20), L(ExitTable)) > > - .int JMPTBL(L(Exit21), L(ExitTable)) > > - .int JMPTBL(L(Exit22), L(ExitTable)) > > - .int JMPTBL(L(Exit23), L(ExitTable)) > > - .int JMPTBL(L(Exit24), L(ExitTable)) > > - .int JMPTBL(L(Exit25), L(ExitTable)) > > - .int JMPTBL(L(Exit26), L(ExitTable)) > > - .int JMPTBL(L(Exit27), L(ExitTable)) > > - .int JMPTBL(L(Exit28), L(ExitTable)) > > - .int JMPTBL(L(Exit29), L(ExitTable)) > > - .int JMPTBL(L(Exit30), L(ExitTable)) > > - .int JMPTBL(L(Exit31), L(ExitTable)) > > - .int JMPTBL(L(Exit32), L(ExitTable)) > > -# ifdef USE_AS_STRNCPY > > -L(ExitStrncpyTable): > > - .int JMPTBL(L(StrncpyExit0), L(ExitStrncpyTable)) > > - .int JMPTBL(L(StrncpyExit1), L(ExitStrncpyTable)) > > - .int JMPTBL(L(StrncpyExit2), L(ExitStrncpyTable)) > > - .int JMPTBL(L(StrncpyExit3), L(ExitStrncpyTable)) > > - .int JMPTBL(L(StrncpyExit4), L(ExitStrncpyTable)) > > - .int JMPTBL(L(StrncpyExit5), L(ExitStrncpyTable)) > > - .int JMPTBL(L(StrncpyExit6), L(ExitStrncpyTable)) > > - .int JMPTBL(L(StrncpyExit7), L(ExitStrncpyTable)) > > - .int JMPTBL(L(StrncpyExit8), L(ExitStrncpyTable)) > > - .int JMPTBL(L(StrncpyExit9), L(ExitStrncpyTable)) > > - .int JMPTBL(L(StrncpyExit10), L(ExitStrncpyTable)) > > - .int JMPTBL(L(StrncpyExit11), L(ExitStrncpyTable)) > > - .int JMPTBL(L(StrncpyExit12), L(ExitStrncpyTable)) > > - .int JMPTBL(L(StrncpyExit13), L(ExitStrncpyTable)) > > - .int JMPTBL(L(StrncpyExit14), L(ExitStrncpyTable)) > > - .int JMPTBL(L(StrncpyExit15), L(ExitStrncpyTable)) > > - .int JMPTBL(L(StrncpyExit16), L(ExitStrncpyTable)) > > - .int JMPTBL(L(StrncpyExit17), L(ExitStrncpyTable)) > > - .int JMPTBL(L(StrncpyExit18), L(ExitStrncpyTable)) > > - .int JMPTBL(L(StrncpyExit19), L(ExitStrncpyTable)) > > - .int JMPTBL(L(StrncpyExit20), L(ExitStrncpyTable)) > > - .int JMPTBL(L(StrncpyExit21), L(ExitStrncpyTable)) > > - .int JMPTBL(L(StrncpyExit22), L(ExitStrncpyTable)) > > - .int JMPTBL(L(StrncpyExit23), L(ExitStrncpyTable)) > > - .int JMPTBL(L(StrncpyExit24), L(ExitStrncpyTable)) > > - .int JMPTBL(L(StrncpyExit25), L(ExitStrncpyTable)) > > - .int JMPTBL(L(StrncpyExit26), L(ExitStrncpyTable)) > > - .int JMPTBL(L(StrncpyExit27), L(ExitStrncpyTable)) > > - .int JMPTBL(L(StrncpyExit28), L(ExitStrncpyTable)) > > - .int JMPTBL(L(StrncpyExit29), L(ExitStrncpyTable)) > > - .int JMPTBL(L(StrncpyExit30), L(ExitStrncpyTable)) > > - .int JMPTBL(L(StrncpyExit31), L(ExitStrncpyTable)) > > - .int JMPTBL(L(StrncpyExit32), L(ExitStrncpyTable)) > > - .int JMPTBL(L(StrncpyExit33), L(ExitStrncpyTable)) > > -# ifndef USE_AS_STRCAT > > - .p2align 4 > > -L(FillTable): > > - .int JMPTBL(L(Fill0), L(FillTable)) > > - .int JMPTBL(L(Fill1), L(FillTable)) > > - .int JMPTBL(L(Fill2), L(FillTable)) > > - .int JMPTBL(L(Fill3), L(FillTable)) > > - .int JMPTBL(L(Fill4), L(FillTable)) > > - .int JMPTBL(L(Fill5), L(FillTable)) > > - .int JMPTBL(L(Fill6), L(FillTable)) > > - .int JMPTBL(L(Fill7), L(FillTable)) > > - .int JMPTBL(L(Fill8), L(FillTable)) > > - .int JMPTBL(L(Fill9), L(FillTable)) > > - .int JMPTBL(L(Fill10), L(FillTable)) > > - .int JMPTBL(L(Fill11), L(FillTable)) > > - .int JMPTBL(L(Fill12), L(FillTable)) > > - .int JMPTBL(L(Fill13), L(FillTable)) > > - .int JMPTBL(L(Fill14), L(FillTable)) > > - .int JMPTBL(L(Fill15), L(FillTable)) > > - .int JMPTBL(L(Fill16), L(FillTable)) > > -# endif > > -# endif > > -#endif > > +#define AS_STRCPY > > +#define STPCPY __strcpy_sse2_unaligned > > +#include "stpcpy-sse2-unaligned.S" > > diff --git a/sysdeps/x86_64/multiarch/strcpy.S b/sysdeps/x86_64/multiarch/strcpy.S > > index 9464ee8..92be04c 100644 > > --- a/sysdeps/x86_64/multiarch/strcpy.S > > +++ b/sysdeps/x86_64/multiarch/strcpy.S > > @@ -28,31 +28,18 @@ > > #endif > > > > #ifdef USE_AS_STPCPY > > -# ifdef USE_AS_STRNCPY > > -# define STRCPY_SSSE3 __stpncpy_ssse3 > > -# define STRCPY_SSE2 __stpncpy_sse2 > > -# define STRCPY_SSE2_UNALIGNED __stpncpy_sse2_unaligned > > -# define __GI_STRCPY __GI_stpncpy > > -# define __GI___STRCPY __GI___stpncpy > > -# else > > # define STRCPY_SSSE3 __stpcpy_ssse3 > > # define STRCPY_SSE2 __stpcpy_sse2 > > +# define STRCPY_AVX2 __stpcpy_avx2 > > # define STRCPY_SSE2_UNALIGNED __stpcpy_sse2_unaligned > > # define __GI_STRCPY __GI_stpcpy > > # define __GI___STRCPY __GI___stpcpy > > -# endif > > #else > > -# ifdef USE_AS_STRNCPY > > -# define STRCPY_SSSE3 __strncpy_ssse3 > > -# define STRCPY_SSE2 __strncpy_sse2 > > -# define STRCPY_SSE2_UNALIGNED __strncpy_sse2_unaligned > > -# define __GI_STRCPY __GI_strncpy > > -# else > > # define STRCPY_SSSE3 __strcpy_ssse3 > > +# define STRCPY_AVX2 __strcpy_avx2 > > # define STRCPY_SSE2 __strcpy_sse2 > > # define STRCPY_SSE2_UNALIGNED __strcpy_sse2_unaligned > > # define __GI_STRCPY __GI_strcpy > > -# endif > > #endif > > > > > > @@ -64,7 +51,10 @@ ENTRY(STRCPY) > > cmpl $0, __cpu_features+KIND_OFFSET(%rip) > > jne 1f > > call __init_cpu_features > > -1: leaq STRCPY_SSE2_UNALIGNED(%rip), %rax > > +1: leaq STRCPY_AVX2(%rip), %rax > > + testl $bit_AVX_Fast_Unaligned_Load, __cpu_features+FEATURE_OFFSET+index_AVX_Fast_Unaligned_Load(%rip) > > + jnz 2f > > + leaq STRCPY_SSE2_UNALIGNED(%rip), %rax > > testl $bit_Fast_Unaligned_Load, __cpu_features+FEATURE_OFFSET+index_Fast_Unaligned_Load(%rip) > > jnz 2f > > leaq STRCPY_SSE2(%rip), %rax > > diff --git a/sysdeps/x86_64/multiarch/strncpy-sse2-unaligned.S b/sysdeps/x86_64/multiarch/strncpy-sse2-unaligned.S > > index fcc23a7..e4c98e7 100644 > > --- a/sysdeps/x86_64/multiarch/strncpy-sse2-unaligned.S > > +++ b/sysdeps/x86_64/multiarch/strncpy-sse2-unaligned.S > > @@ -1,3 +1,1888 @@ > > -#define USE_AS_STRNCPY > > -#define STRCPY __strncpy_sse2_unaligned > > -#include "strcpy-sse2-unaligned.S" > > +/* strcpy with SSE2 and unaligned load > > + Copyright (C) 2011-2015 Free Software Foundation, Inc. > > + Contributed by Intel Corporation. > > + This file is part of the GNU C Library. > > + > > + The GNU C Library is free software; you can redistribute it and/or > > + modify it under the terms of the GNU Lesser General Public > > + License as published by the Free Software Foundation; either > > + version 2.1 of the License, or (at your option) any later version. > > + > > + The GNU C Library is distributed in the hope that it will be useful, > > + but WITHOUT ANY WARRANTY; without even the implied warranty of > > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > > + Lesser General Public License for more details. > > + > > + You should have received a copy of the GNU Lesser General Public > > + License along with the GNU C Library; if not, see > > + <http://www.gnu.org/licenses/>. */ > > + > > +#if IS_IN (libc) > > + > > +# ifndef USE_AS_STRCAT > > +# include <sysdep.h> > > + > > +# ifndef STRCPY > > +# define STRCPY __strncpy_sse2_unaligned > > +# endif > > + > > +# define USE_AS_STRNCPY > > +# endif > > + > > +# define JMPTBL(I, B) I - B > > +# define BRANCH_TO_JMPTBL_ENTRY(TABLE, INDEX, SCALE) \ > > + lea TABLE(%rip), %r11; \ > > + movslq (%r11, INDEX, SCALE), %rcx; \ > > + lea (%r11, %rcx), %rcx; \ > > + jmp *%rcx > > + > > +# ifndef USE_AS_STRCAT > > + > > +.text > > +ENTRY (STRCPY) > > +# ifdef USE_AS_STRNCPY > > + mov %rdx, %r8 > > + test %r8, %r8 > > + jz L(ExitZero) > > +# endif > > + mov %rsi, %rcx > > +# ifndef USE_AS_STPCPY > > + mov %rdi, %rax /* save result */ > > +# endif > > + > > +# endif > > + > > + and $63, %rcx > > + cmp $32, %rcx > > + jbe L(SourceStringAlignmentLess32) > > + > > + and $-16, %rsi > > + and $15, %rcx > > + pxor %xmm0, %xmm0 > > + pxor %xmm1, %xmm1 > > + > > + pcmpeqb (%rsi), %xmm1 > > + pmovmskb %xmm1, %rdx > > + shr %cl, %rdx > > + > > +# ifdef USE_AS_STRNCPY > > +# if defined USE_AS_STPCPY || defined USE_AS_STRCAT > > + mov $16, %r10 > > + sub %rcx, %r10 > > + cmp %r10, %r8 > > +# else > > + mov $17, %r10 > > + sub %rcx, %r10 > > + cmp %r10, %r8 > > +# endif > > + jbe L(CopyFrom1To16BytesTailCase2OrCase3) > > +# endif > > + test %rdx, %rdx > > + jnz L(CopyFrom1To16BytesTail) > > + > > + pcmpeqb 16(%rsi), %xmm0 > > + pmovmskb %xmm0, %rdx > > + > > +# ifdef USE_AS_STRNCPY > > + add $16, %r10 > > + cmp %r10, %r8 > > + jbe L(CopyFrom1To32BytesCase2OrCase3) > > +# endif > > + test %rdx, %rdx > > + jnz L(CopyFrom1To32Bytes) > > + > > + movdqu (%rsi, %rcx), %xmm1 /* copy 16 bytes */ > > + movdqu %xmm1, (%rdi) > > + > > +/* If source address alignment != destination address alignment */ > > + .p2align 4 > > +L(Unalign16Both): > > + sub %rcx, %rdi > > +# ifdef USE_AS_STRNCPY > > + add %rcx, %r8 > > +# endif > > + mov $16, %rcx > > + movdqa (%rsi, %rcx), %xmm1 > > + movaps 16(%rsi, %rcx), %xmm2 > > + movdqu %xmm1, (%rdi, %rcx) > > + pcmpeqb %xmm2, %xmm0 > > + pmovmskb %xmm0, %rdx > > + add $16, %rcx > > +# ifdef USE_AS_STRNCPY > > + sub $48, %r8 > > + jbe L(CopyFrom1To16BytesCase2OrCase3) > > +# endif > > + test %rdx, %rdx > > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > + jnz L(CopyFrom1To16BytesUnalignedXmm2) > > +# else > > + jnz L(CopyFrom1To16Bytes) > > +# endif > > + > > + movaps 16(%rsi, %rcx), %xmm3 > > + movdqu %xmm2, (%rdi, %rcx) > > + pcmpeqb %xmm3, %xmm0 > > + pmovmskb %xmm0, %rdx > > + add $16, %rcx > > +# ifdef USE_AS_STRNCPY > > + sub $16, %r8 > > + jbe L(CopyFrom1To16BytesCase2OrCase3) > > +# endif > > + test %rdx, %rdx > > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > + jnz L(CopyFrom1To16BytesUnalignedXmm3) > > +# else > > + jnz L(CopyFrom1To16Bytes) > > +# endif > > + > > + movaps 16(%rsi, %rcx), %xmm4 > > + movdqu %xmm3, (%rdi, %rcx) > > + pcmpeqb %xmm4, %xmm0 > > + pmovmskb %xmm0, %rdx > > + add $16, %rcx > > +# ifdef USE_AS_STRNCPY > > + sub $16, %r8 > > + jbe L(CopyFrom1To16BytesCase2OrCase3) > > +# endif > > + test %rdx, %rdx > > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > + jnz L(CopyFrom1To16BytesUnalignedXmm4) > > +# else > > + jnz L(CopyFrom1To16Bytes) > > +# endif > > + > > + movaps 16(%rsi, %rcx), %xmm1 > > + movdqu %xmm4, (%rdi, %rcx) > > + pcmpeqb %xmm1, %xmm0 > > + pmovmskb %xmm0, %rdx > > + add $16, %rcx > > +# ifdef USE_AS_STRNCPY > > + sub $16, %r8 > > + jbe L(CopyFrom1To16BytesCase2OrCase3) > > +# endif > > + test %rdx, %rdx > > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > + jnz L(CopyFrom1To16BytesUnalignedXmm1) > > +# else > > + jnz L(CopyFrom1To16Bytes) > > +# endif > > + > > + movaps 16(%rsi, %rcx), %xmm2 > > + movdqu %xmm1, (%rdi, %rcx) > > + pcmpeqb %xmm2, %xmm0 > > + pmovmskb %xmm0, %rdx > > + add $16, %rcx > > +# ifdef USE_AS_STRNCPY > > + sub $16, %r8 > > + jbe L(CopyFrom1To16BytesCase2OrCase3) > > +# endif > > + test %rdx, %rdx > > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > + jnz L(CopyFrom1To16BytesUnalignedXmm2) > > +# else > > + jnz L(CopyFrom1To16Bytes) > > +# endif > > + > > + movaps 16(%rsi, %rcx), %xmm3 > > + movdqu %xmm2, (%rdi, %rcx) > > + pcmpeqb %xmm3, %xmm0 > > + pmovmskb %xmm0, %rdx > > + add $16, %rcx > > +# ifdef USE_AS_STRNCPY > > + sub $16, %r8 > > + jbe L(CopyFrom1To16BytesCase2OrCase3) > > +# endif > > + test %rdx, %rdx > > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > + jnz L(CopyFrom1To16BytesUnalignedXmm3) > > +# else > > + jnz L(CopyFrom1To16Bytes) > > +# endif > > + > > + movdqu %xmm3, (%rdi, %rcx) > > + mov %rsi, %rdx > > + lea 16(%rsi, %rcx), %rsi > > + and $-0x40, %rsi > > + sub %rsi, %rdx > > + sub %rdx, %rdi > > +# ifdef USE_AS_STRNCPY > > + lea 128(%r8, %rdx), %r8 > > +# endif > > +L(Unaligned64Loop): > > + movaps (%rsi), %xmm2 > > + movaps %xmm2, %xmm4 > > + movaps 16(%rsi), %xmm5 > > + movaps 32(%rsi), %xmm3 > > + movaps %xmm3, %xmm6 > > + movaps 48(%rsi), %xmm7 > > + pminub %xmm5, %xmm2 > > + pminub %xmm7, %xmm3 > > + pminub %xmm2, %xmm3 > > + pcmpeqb %xmm0, %xmm3 > > + pmovmskb %xmm3, %rdx > > +# ifdef USE_AS_STRNCPY > > + sub $64, %r8 > > + jbe L(UnalignedLeaveCase2OrCase3) > > +# endif > > + test %rdx, %rdx > > + jnz L(Unaligned64Leave) > > + > > +L(Unaligned64Loop_start): > > + add $64, %rdi > > + add $64, %rsi > > + movdqu %xmm4, -64(%rdi) > > + movaps (%rsi), %xmm2 > > + movdqa %xmm2, %xmm4 > > + movdqu %xmm5, -48(%rdi) > > + movaps 16(%rsi), %xmm5 > > + pminub %xmm5, %xmm2 > > + movaps 32(%rsi), %xmm3 > > + movdqu %xmm6, -32(%rdi) > > + movaps %xmm3, %xmm6 > > + movdqu %xmm7, -16(%rdi) > > + movaps 48(%rsi), %xmm7 > > + pminub %xmm7, %xmm3 > > + pminub %xmm2, %xmm3 > > + pcmpeqb %xmm0, %xmm3 > > + pmovmskb %xmm3, %rdx > > +# ifdef USE_AS_STRNCPY > > + sub $64, %r8 > > + jbe L(UnalignedLeaveCase2OrCase3) > > +# endif > > + test %rdx, %rdx > > + jz L(Unaligned64Loop_start) > > + > > +L(Unaligned64Leave): > > + pxor %xmm1, %xmm1 > > + > > + pcmpeqb %xmm4, %xmm0 > > + pcmpeqb %xmm5, %xmm1 > > + pmovmskb %xmm0, %rdx > > + pmovmskb %xmm1, %rcx > > + test %rdx, %rdx > > + jnz L(CopyFrom1To16BytesUnaligned_0) > > + test %rcx, %rcx > > + jnz L(CopyFrom1To16BytesUnaligned_16) > > + > > + pcmpeqb %xmm6, %xmm0 > > + pcmpeqb %xmm7, %xmm1 > > + pmovmskb %xmm0, %rdx > > + pmovmskb %xmm1, %rcx > > + test %rdx, %rdx > > + jnz L(CopyFrom1To16BytesUnaligned_32) > > + > > + bsf %rcx, %rdx > > + movdqu %xmm4, (%rdi) > > + movdqu %xmm5, 16(%rdi) > > + movdqu %xmm6, 32(%rdi) > > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > +# ifdef USE_AS_STPCPY > > + lea 48(%rdi, %rdx), %rax > > +# endif > > + movdqu %xmm7, 48(%rdi) > > + add $15, %r8 > > + sub %rdx, %r8 > > + lea 49(%rdi, %rdx), %rdi > > + jmp L(StrncpyFillTailWithZero) > > +# else > > + add $48, %rsi > > + add $48, %rdi > > + BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4) > > +# endif > > + > > +/* If source address alignment == destination address alignment */ > > + > > +L(SourceStringAlignmentLess32): > > + pxor %xmm0, %xmm0 > > + movdqu (%rsi), %xmm1 > > + movdqu 16(%rsi), %xmm2 > > + pcmpeqb %xmm1, %xmm0 > > + pmovmskb %xmm0, %rdx > > + > > +# ifdef USE_AS_STRNCPY > > +# if defined USE_AS_STPCPY || defined USE_AS_STRCAT > > + cmp $16, %r8 > > +# else > > + cmp $17, %r8 > > +# endif > > + jbe L(CopyFrom1To16BytesTail1Case2OrCase3) > > +# endif > > + test %rdx, %rdx > > + jnz L(CopyFrom1To16BytesTail1) > > + > > + pcmpeqb %xmm2, %xmm0 > > + movdqu %xmm1, (%rdi) > > + pmovmskb %xmm0, %rdx > > + > > +# ifdef USE_AS_STRNCPY > > +# if defined USE_AS_STPCPY || defined USE_AS_STRCAT > > + cmp $32, %r8 > > +# else > > + cmp $33, %r8 > > +# endif > > + jbe L(CopyFrom1To32Bytes1Case2OrCase3) > > +# endif > > + test %rdx, %rdx > > + jnz L(CopyFrom1To32Bytes1) > > + > > + and $-16, %rsi > > + and $15, %rcx > > + jmp L(Unalign16Both) > > + > > +/*------End of main part with loops---------------------*/ > > + > > +/* Case1 */ > > + > > +# if (!defined USE_AS_STRNCPY) || (defined USE_AS_STRCAT) > > + .p2align 4 > > +L(CopyFrom1To16Bytes): > > + add %rcx, %rdi > > + add %rcx, %rsi > > + bsf %rdx, %rdx > > + BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4) > > +# endif > > + .p2align 4 > > +L(CopyFrom1To16BytesTail): > > + add %rcx, %rsi > > + bsf %rdx, %rdx > > + BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4) > > + > > + .p2align 4 > > +L(CopyFrom1To32Bytes1): > > + add $16, %rsi > > + add $16, %rdi > > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > + sub $16, %r8 > > +# endif > > +L(CopyFrom1To16BytesTail1): > > + bsf %rdx, %rdx > > + BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4) > > + > > + .p2align 4 > > +L(CopyFrom1To32Bytes): > > + bsf %rdx, %rdx > > + add %rcx, %rsi > > + add $16, %rdx > > + sub %rcx, %rdx > > + BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4) > > + > > + .p2align 4 > > +L(CopyFrom1To16BytesUnaligned_0): > > + bsf %rdx, %rdx > > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > +# ifdef USE_AS_STPCPY > > + lea (%rdi, %rdx), %rax > > +# endif > > + movdqu %xmm4, (%rdi) > > + add $63, %r8 > > + sub %rdx, %r8 > > + lea 1(%rdi, %rdx), %rdi > > + jmp L(StrncpyFillTailWithZero) > > +# else > > + BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4) > > +# endif > > + > > + .p2align 4 > > +L(CopyFrom1To16BytesUnaligned_16): > > + bsf %rcx, %rdx > > + movdqu %xmm4, (%rdi) > > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > +# ifdef USE_AS_STPCPY > > + lea 16(%rdi, %rdx), %rax > > +# endif > > + movdqu %xmm5, 16(%rdi) > > + add $47, %r8 > > + sub %rdx, %r8 > > + lea 17(%rdi, %rdx), %rdi > > + jmp L(StrncpyFillTailWithZero) > > +# else > > + add $16, %rsi > > + add $16, %rdi > > + BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4) > > +# endif > > + > > + .p2align 4 > > +L(CopyFrom1To16BytesUnaligned_32): > > + bsf %rdx, %rdx > > + movdqu %xmm4, (%rdi) > > + movdqu %xmm5, 16(%rdi) > > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > +# ifdef USE_AS_STPCPY > > + lea 32(%rdi, %rdx), %rax > > +# endif > > + movdqu %xmm6, 32(%rdi) > > + add $31, %r8 > > + sub %rdx, %r8 > > + lea 33(%rdi, %rdx), %rdi > > + jmp L(StrncpyFillTailWithZero) > > +# else > > + add $32, %rsi > > + add $32, %rdi > > + BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4) > > +# endif > > + > > +# ifdef USE_AS_STRNCPY > > +# ifndef USE_AS_STRCAT > > + .p2align 4 > > +L(CopyFrom1To16BytesUnalignedXmm6): > > + movdqu %xmm6, (%rdi, %rcx) > > + jmp L(CopyFrom1To16BytesXmmExit) > > + > > + .p2align 4 > > +L(CopyFrom1To16BytesUnalignedXmm5): > > + movdqu %xmm5, (%rdi, %rcx) > > + jmp L(CopyFrom1To16BytesXmmExit) > > + > > + .p2align 4 > > +L(CopyFrom1To16BytesUnalignedXmm4): > > + movdqu %xmm4, (%rdi, %rcx) > > + jmp L(CopyFrom1To16BytesXmmExit) > > + > > + .p2align 4 > > +L(CopyFrom1To16BytesUnalignedXmm3): > > + movdqu %xmm3, (%rdi, %rcx) > > + jmp L(CopyFrom1To16BytesXmmExit) > > + > > + .p2align 4 > > +L(CopyFrom1To16BytesUnalignedXmm1): > > + movdqu %xmm1, (%rdi, %rcx) > > + jmp L(CopyFrom1To16BytesXmmExit) > > +# endif > > + > > + .p2align 4 > > +L(CopyFrom1To16BytesExit): > > + BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4) > > + > > +/* Case2 */ > > + > > + .p2align 4 > > +L(CopyFrom1To16BytesCase2): > > + add $16, %r8 > > + add %rcx, %rdi > > + add %rcx, %rsi > > + bsf %rdx, %rdx > > + cmp %r8, %rdx > > + jb L(CopyFrom1To16BytesExit) > > + BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4) > > + > > + .p2align 4 > > +L(CopyFrom1To32BytesCase2): > > + add %rcx, %rsi > > + bsf %rdx, %rdx > > + add $16, %rdx > > + sub %rcx, %rdx > > + cmp %r8, %rdx > > + jb L(CopyFrom1To16BytesExit) > > + BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4) > > + > > +L(CopyFrom1To16BytesTailCase2): > > + add %rcx, %rsi > > + bsf %rdx, %rdx > > + cmp %r8, %rdx > > + jb L(CopyFrom1To16BytesExit) > > + BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4) > > + > > +L(CopyFrom1To16BytesTail1Case2): > > + bsf %rdx, %rdx > > + cmp %r8, %rdx > > + jb L(CopyFrom1To16BytesExit) > > + BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4) > > + > > +/* Case2 or Case3, Case3 */ > > + > > + .p2align 4 > > +L(CopyFrom1To16BytesCase2OrCase3): > > + test %rdx, %rdx > > + jnz L(CopyFrom1To16BytesCase2) > > +L(CopyFrom1To16BytesCase3): > > + add $16, %r8 > > + add %rcx, %rdi > > + add %rcx, %rsi > > + BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4) > > + > > + .p2align 4 > > +L(CopyFrom1To32BytesCase2OrCase3): > > + test %rdx, %rdx > > + jnz L(CopyFrom1To32BytesCase2) > > + add %rcx, %rsi > > + BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4) > > + > > + .p2align 4 > > +L(CopyFrom1To16BytesTailCase2OrCase3): > > + test %rdx, %rdx > > + jnz L(CopyFrom1To16BytesTailCase2) > > + add %rcx, %rsi > > + BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4) > > + > > + .p2align 4 > > +L(CopyFrom1To32Bytes1Case2OrCase3): > > + add $16, %rdi > > + add $16, %rsi > > + sub $16, %r8 > > +L(CopyFrom1To16BytesTail1Case2OrCase3): > > + test %rdx, %rdx > > + jnz L(CopyFrom1To16BytesTail1Case2) > > + BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4) > > + > > +# endif > > + > > +/*------------End labels regarding with copying 1-16 bytes--and 1-32 bytes----*/ > > + > > + .p2align 4 > > +L(Exit1): > > + mov %dh, (%rdi) > > +# ifdef USE_AS_STPCPY > > + lea (%rdi), %rax > > +# endif > > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > + sub $1, %r8 > > + lea 1(%rdi), %rdi > > + jnz L(StrncpyFillTailWithZero) > > +# endif > > + ret > > + > > + .p2align 4 > > +L(Exit2): > > + mov (%rsi), %dx > > + mov %dx, (%rdi) > > +# ifdef USE_AS_STPCPY > > + lea 1(%rdi), %rax > > +# endif > > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > + sub $2, %r8 > > + lea 2(%rdi), %rdi > > + jnz L(StrncpyFillTailWithZero) > > +# endif > > + ret > > + > > + .p2align 4 > > +L(Exit3): > > + mov (%rsi), %cx > > + mov %cx, (%rdi) > > + mov %dh, 2(%rdi) > > +# ifdef USE_AS_STPCPY > > + lea 2(%rdi), %rax > > +# endif > > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > + sub $3, %r8 > > + lea 3(%rdi), %rdi > > + jnz L(StrncpyFillTailWithZero) > > +# endif > > + ret > > + > > + .p2align 4 > > +L(Exit4): > > + mov (%rsi), %edx > > + mov %edx, (%rdi) > > +# ifdef USE_AS_STPCPY > > + lea 3(%rdi), %rax > > +# endif > > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > + sub $4, %r8 > > + lea 4(%rdi), %rdi > > + jnz L(StrncpyFillTailWithZero) > > +# endif > > + ret > > + > > + .p2align 4 > > +L(Exit5): > > + mov (%rsi), %ecx > > + mov %dh, 4(%rdi) > > + mov %ecx, (%rdi) > > +# ifdef USE_AS_STPCPY > > + lea 4(%rdi), %rax > > +# endif > > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > + sub $5, %r8 > > + lea 5(%rdi), %rdi > > + jnz L(StrncpyFillTailWithZero) > > +# endif > > + ret > > + > > + .p2align 4 > > +L(Exit6): > > + mov (%rsi), %ecx > > + mov 4(%rsi), %dx > > + mov %ecx, (%rdi) > > + mov %dx, 4(%rdi) > > +# ifdef USE_AS_STPCPY > > + lea 5(%rdi), %rax > > +# endif > > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > + sub $6, %r8 > > + lea 6(%rdi), %rdi > > + jnz L(StrncpyFillTailWithZero) > > +# endif > > + ret > > + > > + .p2align 4 > > +L(Exit7): > > + mov (%rsi), %ecx > > + mov 3(%rsi), %edx > > + mov %ecx, (%rdi) > > + mov %edx, 3(%rdi) > > +# ifdef USE_AS_STPCPY > > + lea 6(%rdi), %rax > > +# endif > > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > + sub $7, %r8 > > + lea 7(%rdi), %rdi > > + jnz L(StrncpyFillTailWithZero) > > +# endif > > + ret > > + > > + .p2align 4 > > +L(Exit8): > > + mov (%rsi), %rdx > > + mov %rdx, (%rdi) > > +# ifdef USE_AS_STPCPY > > + lea 7(%rdi), %rax > > +# endif > > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > + sub $8, %r8 > > + lea 8(%rdi), %rdi > > + jnz L(StrncpyFillTailWithZero) > > +# endif > > + ret > > + > > + .p2align 4 > > +L(Exit9): > > + mov (%rsi), %rcx > > + mov %dh, 8(%rdi) > > + mov %rcx, (%rdi) > > +# ifdef USE_AS_STPCPY > > + lea 8(%rdi), %rax > > +# endif > > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > + sub $9, %r8 > > + lea 9(%rdi), %rdi > > + jnz L(StrncpyFillTailWithZero) > > +# endif > > + ret > > + > > + .p2align 4 > > +L(Exit10): > > + mov (%rsi), %rcx > > + mov 8(%rsi), %dx > > + mov %rcx, (%rdi) > > + mov %dx, 8(%rdi) > > +# ifdef USE_AS_STPCPY > > + lea 9(%rdi), %rax > > +# endif > > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > + sub $10, %r8 > > + lea 10(%rdi), %rdi > > + jnz L(StrncpyFillTailWithZero) > > +# endif > > + ret > > + > > + .p2align 4 > > +L(Exit11): > > + mov (%rsi), %rcx > > + mov 7(%rsi), %edx > > + mov %rcx, (%rdi) > > + mov %edx, 7(%rdi) > > +# ifdef USE_AS_STPCPY > > + lea 10(%rdi), %rax > > +# endif > > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > + sub $11, %r8 > > + lea 11(%rdi), %rdi > > + jnz L(StrncpyFillTailWithZero) > > +# endif > > + ret > > + > > + .p2align 4 > > +L(Exit12): > > + mov (%rsi), %rcx > > + mov 8(%rsi), %edx > > + mov %rcx, (%rdi) > > + mov %edx, 8(%rdi) > > +# ifdef USE_AS_STPCPY > > + lea 11(%rdi), %rax > > +# endif > > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > + sub $12, %r8 > > + lea 12(%rdi), %rdi > > + jnz L(StrncpyFillTailWithZero) > > +# endif > > + ret > > + > > + .p2align 4 > > +L(Exit13): > > + mov (%rsi), %rcx > > + mov 5(%rsi), %rdx > > + mov %rcx, (%rdi) > > + mov %rdx, 5(%rdi) > > +# ifdef USE_AS_STPCPY > > + lea 12(%rdi), %rax > > +# endif > > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > + sub $13, %r8 > > + lea 13(%rdi), %rdi > > + jnz L(StrncpyFillTailWithZero) > > +# endif > > + ret > > + > > + .p2align 4 > > +L(Exit14): > > + mov (%rsi), %rcx > > + mov 6(%rsi), %rdx > > + mov %rcx, (%rdi) > > + mov %rdx, 6(%rdi) > > +# ifdef USE_AS_STPCPY > > + lea 13(%rdi), %rax > > +# endif > > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > + sub $14, %r8 > > + lea 14(%rdi), %rdi > > + jnz L(StrncpyFillTailWithZero) > > +# endif > > + ret > > + > > + .p2align 4 > > +L(Exit15): > > + mov (%rsi), %rcx > > + mov 7(%rsi), %rdx > > + mov %rcx, (%rdi) > > + mov %rdx, 7(%rdi) > > +# ifdef USE_AS_STPCPY > > + lea 14(%rdi), %rax > > +# endif > > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > + sub $15, %r8 > > + lea 15(%rdi), %rdi > > + jnz L(StrncpyFillTailWithZero) > > +# endif > > + ret > > + > > + .p2align 4 > > +L(Exit16): > > + movdqu (%rsi), %xmm0 > > + movdqu %xmm0, (%rdi) > > +# ifdef USE_AS_STPCPY > > + lea 15(%rdi), %rax > > +# endif > > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > + sub $16, %r8 > > + lea 16(%rdi), %rdi > > + jnz L(StrncpyFillTailWithZero) > > +# endif > > + ret > > + > > + .p2align 4 > > +L(Exit17): > > + movdqu (%rsi), %xmm0 > > + movdqu %xmm0, (%rdi) > > + mov %dh, 16(%rdi) > > +# ifdef USE_AS_STPCPY > > + lea 16(%rdi), %rax > > +# endif > > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > + sub $17, %r8 > > + lea 17(%rdi), %rdi > > + jnz L(StrncpyFillTailWithZero) > > +# endif > > + ret > > + > > + .p2align 4 > > +L(Exit18): > > + movdqu (%rsi), %xmm0 > > + mov 16(%rsi), %cx > > + movdqu %xmm0, (%rdi) > > + mov %cx, 16(%rdi) > > +# ifdef USE_AS_STPCPY > > + lea 17(%rdi), %rax > > +# endif > > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > + sub $18, %r8 > > + lea 18(%rdi), %rdi > > + jnz L(StrncpyFillTailWithZero) > > +# endif > > + ret > > + > > + .p2align 4 > > +L(Exit19): > > + movdqu (%rsi), %xmm0 > > + mov 15(%rsi), %ecx > > + movdqu %xmm0, (%rdi) > > + mov %ecx, 15(%rdi) > > +# ifdef USE_AS_STPCPY > > + lea 18(%rdi), %rax > > +# endif > > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > + sub $19, %r8 > > + lea 19(%rdi), %rdi > > + jnz L(StrncpyFillTailWithZero) > > +# endif > > + ret > > + > > + .p2align 4 > > +L(Exit20): > > + movdqu (%rsi), %xmm0 > > + mov 16(%rsi), %ecx > > + movdqu %xmm0, (%rdi) > > + mov %ecx, 16(%rdi) > > +# ifdef USE_AS_STPCPY > > + lea 19(%rdi), %rax > > +# endif > > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > + sub $20, %r8 > > + lea 20(%rdi), %rdi > > + jnz L(StrncpyFillTailWithZero) > > +# endif > > + ret > > + > > + .p2align 4 > > +L(Exit21): > > + movdqu (%rsi), %xmm0 > > + mov 16(%rsi), %ecx > > + movdqu %xmm0, (%rdi) > > + mov %ecx, 16(%rdi) > > + mov %dh, 20(%rdi) > > +# ifdef USE_AS_STPCPY > > + lea 20(%rdi), %rax > > +# endif > > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > + sub $21, %r8 > > + lea 21(%rdi), %rdi > > + jnz L(StrncpyFillTailWithZero) > > +# endif > > + ret > > + > > + .p2align 4 > > +L(Exit22): > > + movdqu (%rsi), %xmm0 > > + mov 14(%rsi), %rcx > > + movdqu %xmm0, (%rdi) > > + mov %rcx, 14(%rdi) > > +# ifdef USE_AS_STPCPY > > + lea 21(%rdi), %rax > > +# endif > > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > + sub $22, %r8 > > + lea 22(%rdi), %rdi > > + jnz L(StrncpyFillTailWithZero) > > +# endif > > + ret > > + > > + .p2align 4 > > +L(Exit23): > > + movdqu (%rsi), %xmm0 > > + mov 15(%rsi), %rcx > > + movdqu %xmm0, (%rdi) > > + mov %rcx, 15(%rdi) > > +# ifdef USE_AS_STPCPY > > + lea 22(%rdi), %rax > > +# endif > > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > + sub $23, %r8 > > + lea 23(%rdi), %rdi > > + jnz L(StrncpyFillTailWithZero) > > +# endif > > + ret > > + > > + .p2align 4 > > +L(Exit24): > > + movdqu (%rsi), %xmm0 > > + mov 16(%rsi), %rcx > > + movdqu %xmm0, (%rdi) > > + mov %rcx, 16(%rdi) > > +# ifdef USE_AS_STPCPY > > + lea 23(%rdi), %rax > > +# endif > > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > + sub $24, %r8 > > + lea 24(%rdi), %rdi > > + jnz L(StrncpyFillTailWithZero) > > +# endif > > + ret > > + > > + .p2align 4 > > +L(Exit25): > > + movdqu (%rsi), %xmm0 > > + mov 16(%rsi), %rcx > > + movdqu %xmm0, (%rdi) > > + mov %rcx, 16(%rdi) > > + mov %dh, 24(%rdi) > > +# ifdef USE_AS_STPCPY > > + lea 24(%rdi), %rax > > +# endif > > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > + sub $25, %r8 > > + lea 25(%rdi), %rdi > > + jnz L(StrncpyFillTailWithZero) > > +# endif > > + ret > > + > > + .p2align 4 > > +L(Exit26): > > + movdqu (%rsi), %xmm0 > > + mov 16(%rsi), %rdx > > + mov 24(%rsi), %cx > > + movdqu %xmm0, (%rdi) > > + mov %rdx, 16(%rdi) > > + mov %cx, 24(%rdi) > > +# ifdef USE_AS_STPCPY > > + lea 25(%rdi), %rax > > +# endif > > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > + sub $26, %r8 > > + lea 26(%rdi), %rdi > > + jnz L(StrncpyFillTailWithZero) > > +# endif > > + ret > > + > > + .p2align 4 > > +L(Exit27): > > + movdqu (%rsi), %xmm0 > > + mov 16(%rsi), %rdx > > + mov 23(%rsi), %ecx > > + movdqu %xmm0, (%rdi) > > + mov %rdx, 16(%rdi) > > + mov %ecx, 23(%rdi) > > +# ifdef USE_AS_STPCPY > > + lea 26(%rdi), %rax > > +# endif > > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > + sub $27, %r8 > > + lea 27(%rdi), %rdi > > + jnz L(StrncpyFillTailWithZero) > > +# endif > > + ret > > + > > + .p2align 4 > > +L(Exit28): > > + movdqu (%rsi), %xmm0 > > + mov 16(%rsi), %rdx > > + mov 24(%rsi), %ecx > > + movdqu %xmm0, (%rdi) > > + mov %rdx, 16(%rdi) > > + mov %ecx, 24(%rdi) > > +# ifdef USE_AS_STPCPY > > + lea 27(%rdi), %rax > > +# endif > > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > + sub $28, %r8 > > + lea 28(%rdi), %rdi > > + jnz L(StrncpyFillTailWithZero) > > +# endif > > + ret > > + > > + .p2align 4 > > +L(Exit29): > > + movdqu (%rsi), %xmm0 > > + movdqu 13(%rsi), %xmm2 > > + movdqu %xmm0, (%rdi) > > + movdqu %xmm2, 13(%rdi) > > +# ifdef USE_AS_STPCPY > > + lea 28(%rdi), %rax > > +# endif > > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > + sub $29, %r8 > > + lea 29(%rdi), %rdi > > + jnz L(StrncpyFillTailWithZero) > > +# endif > > + ret > > + > > + .p2align 4 > > +L(Exit30): > > + movdqu (%rsi), %xmm0 > > + movdqu 14(%rsi), %xmm2 > > + movdqu %xmm0, (%rdi) > > + movdqu %xmm2, 14(%rdi) > > +# ifdef USE_AS_STPCPY > > + lea 29(%rdi), %rax > > +# endif > > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > + sub $30, %r8 > > + lea 30(%rdi), %rdi > > + jnz L(StrncpyFillTailWithZero) > > +# endif > > + ret > > + > > + .p2align 4 > > +L(Exit31): > > + movdqu (%rsi), %xmm0 > > + movdqu 15(%rsi), %xmm2 > > + movdqu %xmm0, (%rdi) > > + movdqu %xmm2, 15(%rdi) > > +# ifdef USE_AS_STPCPY > > + lea 30(%rdi), %rax > > +# endif > > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > + sub $31, %r8 > > + lea 31(%rdi), %rdi > > + jnz L(StrncpyFillTailWithZero) > > +# endif > > + ret > > + > > + .p2align 4 > > +L(Exit32): > > + movdqu (%rsi), %xmm0 > > + movdqu 16(%rsi), %xmm2 > > + movdqu %xmm0, (%rdi) > > + movdqu %xmm2, 16(%rdi) > > +# ifdef USE_AS_STPCPY > > + lea 31(%rdi), %rax > > +# endif > > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT > > + sub $32, %r8 > > + lea 32(%rdi), %rdi > > + jnz L(StrncpyFillTailWithZero) > > +# endif > > + ret > > + > > +# ifdef USE_AS_STRNCPY > > + > > + .p2align 4 > > +L(StrncpyExit0): > > +# ifdef USE_AS_STPCPY > > + mov %rdi, %rax > > +# endif > > +# ifdef USE_AS_STRCAT > > + xor %ch, %ch > > + movb %ch, (%rdi) > > +# endif > > + ret > > + > > + .p2align 4 > > +L(StrncpyExit1): > > + mov (%rsi), %dl > > + mov %dl, (%rdi) > > +# ifdef USE_AS_STPCPY > > + lea 1(%rdi), %rax > > +# endif > > +# ifdef USE_AS_STRCAT > > + xor %ch, %ch > > + movb %ch, 1(%rdi) > > +# endif > > + ret > > + > > + .p2align 4 > > +L(StrncpyExit2): > > + mov (%rsi), %dx > > + mov %dx, (%rdi) > > +# ifdef USE_AS_STPCPY > > + lea 2(%rdi), %rax > > +# endif > > +# ifdef USE_AS_STRCAT > > + xor %ch, %ch > > + movb %ch, 2(%rdi) > > +# endif > > + ret > > + > > + .p2align 4 > > +L(StrncpyExit3): > > + mov (%rsi), %cx > > + mov 2(%rsi), %dl > > + mov %cx, (%rdi) > > + mov %dl, 2(%rdi) > > +# ifdef USE_AS_STPCPY > > + lea 3(%rdi), %rax > > +# endif > > +# ifdef USE_AS_STRCAT > > + xor %ch, %ch > > + movb %ch, 3(%rdi) > > +# endif > > + ret > > + > > + .p2align 4 > > +L(StrncpyExit4): > > + mov (%rsi), %edx > > + mov %edx, (%rdi) > > +# ifdef USE_AS_STPCPY > > + lea 4(%rdi), %rax > > +# endif > > +# ifdef USE_AS_STRCAT > > + xor %ch, %ch > > + movb %ch, 4(%rdi) > > +# endif > > + ret > > + > > + .p2align 4 > > +L(StrncpyExit5): > > + mov (%rsi), %ecx > > + mov 4(%rsi), %dl > > + mov %ecx, (%rdi) > > + mov %dl, 4(%rdi) > > +# ifdef USE_AS_STPCPY > > + lea 5(%rdi), %rax > > +# endif > > +# ifdef USE_AS_STRCAT > > + xor %ch, %ch > > + movb %ch, 5(%rdi) > > +# endif > > + ret > > + > > + .p2align 4 > > +L(StrncpyExit6): > > + mov (%rsi), %ecx > > + mov 4(%rsi), %dx > > + mov %ecx, (%rdi) > > + mov %dx, 4(%rdi) > > +# ifdef USE_AS_STPCPY > > + lea 6(%rdi), %rax > > +# endif > > +# ifdef USE_AS_STRCAT > > + xor %ch, %ch > > + movb %ch, 6(%rdi) > > +# endif > > + ret > > + > > + .p2align 4 > > +L(StrncpyExit7): > > + mov (%rsi), %ecx > > + mov 3(%rsi), %edx > > + mov %ecx, (%rdi) > > + mov %edx, 3(%rdi) > > +# ifdef USE_AS_STPCPY > > + lea 7(%rdi), %rax > > +# endif > > +# ifdef USE_AS_STRCAT > > + xor %ch, %ch > > + movb %ch, 7(%rdi) > > +# endif > > + ret > > + > > + .p2align 4 > > +L(StrncpyExit8): > > + mov (%rsi), %rdx > > + mov %rdx, (%rdi) > > +# ifdef USE_AS_STPCPY > > + lea 8(%rdi), %rax > > +# endif > > +# ifdef USE_AS_STRCAT > > + xor %ch, %ch > > + movb %ch, 8(%rdi) > > +# endif > > + ret > > + > > + .p2align 4 > > +L(StrncpyExit9): > > + mov (%rsi), %rcx > > + mov 8(%rsi), %dl > > + mov %rcx, (%rdi) > > + mov %dl, 8(%rdi) > > +# ifdef USE_AS_STPCPY > > + lea 9(%rdi), %rax > > +# endif > > +# ifdef USE_AS_STRCAT > > + xor %ch, %ch > > + movb %ch, 9(%rdi) > > +# endif > > + ret > > + > > + .p2align 4 > > +L(StrncpyExit10): > > + mov (%rsi), %rcx > > + mov 8(%rsi), %dx > > + mov %rcx, (%rdi) > > + mov %dx, 8(%rdi) > > +# ifdef USE_AS_STPCPY > > + lea 10(%rdi), %rax > > +# endif > > +# ifdef USE_AS_STRCAT > > + xor %ch, %ch > > + movb %ch, 10(%rdi) > > +# endif > > + ret > > + > > + .p2align 4 > > +L(StrncpyExit11): > > + mov (%rsi), %rcx > > + mov 7(%rsi), %edx > > + mov %rcx, (%rdi) > > + mov %edx, 7(%rdi) > > +# ifdef USE_AS_STPCPY > > + lea 11(%rdi), %rax > > +# endif > > +# ifdef USE_AS_STRCAT > > + xor %ch, %ch > > + movb %ch, 11(%rdi) > > +# endif > > + ret > > + > > + .p2align 4 > > +L(StrncpyExit12): > > + mov (%rsi), %rcx > > + mov 8(%rsi), %edx > > + mov %rcx, (%rdi) > > + mov %edx, 8(%rdi) > > +# ifdef USE_AS_STPCPY > > + lea 12(%rdi), %rax > > +# endif > > +# ifdef USE_AS_STRCAT > > + xor %ch, %ch > > + movb %ch, 12(%rdi) > > +# endif > > + ret > > + > > + .p2align 4 > > +L(StrncpyExit13): > > + mov (%rsi), %rcx > > + mov 5(%rsi), %rdx > > + mov %rcx, (%rdi) > > + mov %rdx, 5(%rdi) > > +# ifdef USE_AS_STPCPY > > + lea 13(%rdi), %rax > > +# endif > > +# ifdef USE_AS_STRCAT > > + xor %ch, %ch > > + movb %ch, 13(%rdi) > > +# endif > > + ret > > + > > + .p2align 4 > > +L(StrncpyExit14): > > + mov (%rsi), %rcx > > + mov 6(%rsi), %rdx > > + mov %rcx, (%rdi) > > + mov %rdx, 6(%rdi) > > +# ifdef USE_AS_STPCPY > > + lea 14(%rdi), %rax > > +# endif > > +# ifdef USE_AS_STRCAT > > + xor %ch, %ch > > + movb %ch, 14(%rdi) > > +# endif > > + ret > > + > > + .p2align 4 > > +L(StrncpyExit15): > > + mov (%rsi), %rcx > > + mov 7(%rsi), %rdx > > + mov %rcx, (%rdi) > > + mov %rdx, 7(%rdi) > > +# ifdef USE_AS_STPCPY > > + lea 15(%rdi), %rax > > +# endif > > +# ifdef USE_AS_STRCAT > > + xor %ch, %ch > > + movb %ch, 15(%rdi) > > +# endif > > + ret > > + > > + .p2align 4 > > +L(StrncpyExit16): > > + movdqu (%rsi), %xmm0 > > + movdqu %xmm0, (%rdi) > > +# ifdef USE_AS_STPCPY > > + lea 16(%rdi), %rax > > +# endif > > +# ifdef USE_AS_STRCAT > > + xor %ch, %ch > > + movb %ch, 16(%rdi) > > +# endif > > + ret > > + > > + .p2align 4 > > +L(StrncpyExit17): > > + movdqu (%rsi), %xmm0 > > + mov 16(%rsi), %cl > > + movdqu %xmm0, (%rdi) > > + mov %cl, 16(%rdi) > > +# ifdef USE_AS_STPCPY > > + lea 17(%rdi), %rax > > +# endif > > +# ifdef USE_AS_STRCAT > > + xor %ch, %ch > > + movb %ch, 17(%rdi) > > +# endif > > + ret > > + > > + .p2align 4 > > +L(StrncpyExit18): > > + movdqu (%rsi), %xmm0 > > + mov 16(%rsi), %cx > > + movdqu %xmm0, (%rdi) > > + mov %cx, 16(%rdi) > > +# ifdef USE_AS_STPCPY > > + lea 18(%rdi), %rax > > +# endif > > +# ifdef USE_AS_STRCAT > > + xor %ch, %ch > > + movb %ch, 18(%rdi) > > +# endif > > + ret > > + > > + .p2align 4 > > +L(StrncpyExit19): > > + movdqu (%rsi), %xmm0 > > + mov 15(%rsi), %ecx > > + movdqu %xmm0, (%rdi) > > + mov %ecx, 15(%rdi) > > +# ifdef USE_AS_STPCPY > > + lea 19(%rdi), %rax > > +# endif > > +# ifdef USE_AS_STRCAT > > + xor %ch, %ch > > + movb %ch, 19(%rdi) > > +# endif > > + ret > > + > > + .p2align 4 > > +L(StrncpyExit20): > > + movdqu (%rsi), %xmm0 > > + mov 16(%rsi), %ecx > > + movdqu %xmm0, (%rdi) > > + mov %ecx, 16(%rdi) > > +# ifdef USE_AS_STPCPY > > + lea 20(%rdi), %rax > > +# endif > > +# ifdef USE_AS_STRCAT > > + xor %ch, %ch > > + movb %ch, 20(%rdi) > > +# endif > > + ret > > + > > + .p2align 4 > > +L(StrncpyExit21): > > + movdqu (%rsi), %xmm0 > > + mov 16(%rsi), %ecx > > + mov 20(%rsi), %dl > > + movdqu %xmm0, (%rdi) > > + mov %ecx, 16(%rdi) > > + mov %dl, 20(%rdi) > > +# ifdef USE_AS_STPCPY > > + lea 21(%rdi), %rax > > +# endif > > +# ifdef USE_AS_STRCAT > > + xor %ch, %ch > > + movb %ch, 21(%rdi) > > +# endif > > + ret > > + > > + .p2align 4 > > +L(StrncpyExit22): > > + movdqu (%rsi), %xmm0 > > + mov 14(%rsi), %rcx > > + movdqu %xmm0, (%rdi) > > + mov %rcx, 14(%rdi) > > +# ifdef USE_AS_STPCPY > > + lea 22(%rdi), %rax > > +# endif > > +# ifdef USE_AS_STRCAT > > + xor %ch, %ch > > + movb %ch, 22(%rdi) > > +# endif > > + ret > > + > > + .p2align 4 > > +L(StrncpyExit23): > > + movdqu (%rsi), %xmm0 > > + mov 15(%rsi), %rcx > > + movdqu %xmm0, (%rdi) > > + mov %rcx, 15(%rdi) > > +# ifdef USE_AS_STPCPY > > + lea 23(%rdi), %rax > > +# endif > > +# ifdef USE_AS_STRCAT > > + xor %ch, %ch > > + movb %ch, 23(%rdi) > > +# endif > > + ret > > + > > + .p2align 4 > > +L(StrncpyExit24): > > + movdqu (%rsi), %xmm0 > > + mov 16(%rsi), %rcx > > + movdqu %xmm0, (%rdi) > > + mov %rcx, 16(%rdi) > > +# ifdef USE_AS_STPCPY > > + lea 24(%rdi), %rax > > +# endif > > +# ifdef USE_AS_STRCAT > > + xor %ch, %ch > > + movb %ch, 24(%rdi) > > +# endif > > + ret > > + > > + .p2align 4 > > +L(StrncpyExit25): > > + movdqu (%rsi), %xmm0 > > + mov 16(%rsi), %rdx > > + mov 24(%rsi), %cl > > + movdqu %xmm0, (%rdi) > > + mov %rdx, 16(%rdi) > > + mov %cl, 24(%rdi) > > +# ifdef USE_AS_STPCPY > > + lea 25(%rdi), %rax > > +# endif > > +# ifdef USE_AS_STRCAT > > + xor %ch, %ch > > + movb %ch, 25(%rdi) > > +# endif > > + ret > > + > > + .p2align 4 > > +L(StrncpyExit26): > > + movdqu (%rsi), %xmm0 > > + mov 16(%rsi), %rdx > > + mov 24(%rsi), %cx > > + movdqu %xmm0, (%rdi) > > + mov %rdx, 16(%rdi) > > + mov %cx, 24(%rdi) > > +# ifdef USE_AS_STPCPY > > + lea 26(%rdi), %rax > > +# endif > > +# ifdef USE_AS_STRCAT > > + xor %ch, %ch > > + movb %ch, 26(%rdi) > > +# endif > > + ret > > + > > + .p2align 4 > > +L(StrncpyExit27): > > + movdqu (%rsi), %xmm0 > > + mov 16(%rsi), %rdx > > + mov 23(%rsi), %ecx > > + movdqu %xmm0, (%rdi) > > + mov %rdx, 16(%rdi) > > + mov %ecx, 23(%rdi) > > +# ifdef USE_AS_STPCPY > > + lea 27(%rdi), %rax > > +# endif > > +# ifdef USE_AS_STRCAT > > + xor %ch, %ch > > + movb %ch, 27(%rdi) > > +# endif > > + ret > > + > > + .p2align 4 > > +L(StrncpyExit28): > > + movdqu (%rsi), %xmm0 > > + mov 16(%rsi), %rdx > > + mov 24(%rsi), %ecx > > + movdqu %xmm0, (%rdi) > > + mov %rdx, 16(%rdi) > > + mov %ecx, 24(%rdi) > > +# ifdef USE_AS_STPCPY > > + lea 28(%rdi), %rax > > +# endif > > +# ifdef USE_AS_STRCAT > > + xor %ch, %ch > > + movb %ch, 28(%rdi) > > +# endif > > + ret > > + > > + .p2align 4 > > +L(StrncpyExit29): > > + movdqu (%rsi), %xmm0 > > + movdqu 13(%rsi), %xmm2 > > + movdqu %xmm0, (%rdi) > > + movdqu %xmm2, 13(%rdi) > > +# ifdef USE_AS_STPCPY > > + lea 29(%rdi), %rax > > +# endif > > +# ifdef USE_AS_STRCAT > > + xor %ch, %ch > > + movb %ch, 29(%rdi) > > +# endif > > + ret > > + > > + .p2align 4 > > +L(StrncpyExit30): > > + movdqu (%rsi), %xmm0 > > + movdqu 14(%rsi), %xmm2 > > + movdqu %xmm0, (%rdi) > > + movdqu %xmm2, 14(%rdi) > > +# ifdef USE_AS_STPCPY > > + lea 30(%rdi), %rax > > +# endif > > +# ifdef USE_AS_STRCAT > > + xor %ch, %ch > > + movb %ch, 30(%rdi) > > +# endif > > + ret > > + > > + .p2align 4 > > +L(StrncpyExit31): > > + movdqu (%rsi), %xmm0 > > + movdqu 15(%rsi), %xmm2 > > + movdqu %xmm0, (%rdi) > > + movdqu %xmm2, 15(%rdi) > > +# ifdef USE_AS_STPCPY > > + lea 31(%rdi), %rax > > +# endif > > +# ifdef USE_AS_STRCAT > > + xor %ch, %ch > > + movb %ch, 31(%rdi) > > +# endif > > + ret > > + > > + .p2align 4 > > +L(StrncpyExit32): > > + movdqu (%rsi), %xmm0 > > + movdqu 16(%rsi), %xmm2 > > + movdqu %xmm0, (%rdi) > > + movdqu %xmm2, 16(%rdi) > > +# ifdef USE_AS_STPCPY > > + lea 32(%rdi), %rax > > +# endif > > +# ifdef USE_AS_STRCAT > > + xor %ch, %ch > > + movb %ch, 32(%rdi) > > +# endif > > + ret > > + > > + .p2align 4 > > +L(StrncpyExit33): > > + movdqu (%rsi), %xmm0 > > + movdqu 16(%rsi), %xmm2 > > + mov 32(%rsi), %cl > > + movdqu %xmm0, (%rdi) > > + movdqu %xmm2, 16(%rdi) > > + mov %cl, 32(%rdi) > > +# ifdef USE_AS_STRCAT > > + xor %ch, %ch > > + movb %ch, 33(%rdi) > > +# endif > > + ret > > + > > +# ifndef USE_AS_STRCAT > > + > > + .p2align 4 > > +L(Fill0): > > + ret > > + > > + .p2align 4 > > +L(Fill1): > > + mov %dl, (%rdi) > > + ret > > + > > + .p2align 4 > > +L(Fill2): > > + mov %dx, (%rdi) > > + ret > > + > > + .p2align 4 > > +L(Fill3): > > + mov %edx, -1(%rdi) > > + ret > > + > > + .p2align 4 > > +L(Fill4): > > + mov %edx, (%rdi) > > + ret > > + > > + .p2align 4 > > +L(Fill5): > > + mov %edx, (%rdi) > > + mov %dl, 4(%rdi) > > + ret > > + > > + .p2align 4 > > +L(Fill6): > > + mov %edx, (%rdi) > > + mov %dx, 4(%rdi) > > + ret > > + > > + .p2align 4 > > +L(Fill7): > > + mov %rdx, -1(%rdi) > > + ret > > + > > + .p2align 4 > > +L(Fill8): > > + mov %rdx, (%rdi) > > + ret > > + > > + .p2align 4 > > +L(Fill9): > > + mov %rdx, (%rdi) > > + mov %dl, 8(%rdi) > > + ret > > + > > + .p2align 4 > > +L(Fill10): > > + mov %rdx, (%rdi) > > + mov %dx, 8(%rdi) > > + ret > > + > > + .p2align 4 > > +L(Fill11): > > + mov %rdx, (%rdi) > > + mov %edx, 7(%rdi) > > + ret > > + > > + .p2align 4 > > +L(Fill12): > > + mov %rdx, (%rdi) > > + mov %edx, 8(%rdi) > > + ret > > + > > + .p2align 4 > > +L(Fill13): > > + mov %rdx, (%rdi) > > + mov %rdx, 5(%rdi) > > + ret > > + > > + .p2align 4 > > +L(Fill14): > > + mov %rdx, (%rdi) > > + mov %rdx, 6(%rdi) > > + ret > > + > > + .p2align 4 > > +L(Fill15): > > + movdqu %xmm0, -1(%rdi) > > + ret > > + > > + .p2align 4 > > +L(Fill16): > > + movdqu %xmm0, (%rdi) > > + ret > > + > > + .p2align 4 > > +L(CopyFrom1To16BytesUnalignedXmm2): > > + movdqu %xmm2, (%rdi, %rcx) > > + > > + .p2align 4 > > +L(CopyFrom1To16BytesXmmExit): > > + bsf %rdx, %rdx > > + add $15, %r8 > > + add %rcx, %rdi > > +# ifdef USE_AS_STPCPY > > + lea (%rdi, %rdx), %rax > > +# endif > > + sub %rdx, %r8 > > + lea 1(%rdi, %rdx), %rdi > > + > > + .p2align 4 > > +L(StrncpyFillTailWithZero): > > + pxor %xmm0, %xmm0 > > + xor %rdx, %rdx > > + sub $16, %r8 > > + jbe L(StrncpyFillExit) > > + > > + movdqu %xmm0, (%rdi) > > + add $16, %rdi > > + > > + mov %rdi, %rsi > > + and $0xf, %rsi > > + sub %rsi, %rdi > > + add %rsi, %r8 > > + sub $64, %r8 > > + jb L(StrncpyFillLess64) > > + > > +L(StrncpyFillLoopMovdqa): > > + movdqa %xmm0, (%rdi) > > + movdqa %xmm0, 16(%rdi) > > + movdqa %xmm0, 32(%rdi) > > + movdqa %xmm0, 48(%rdi) > > + add $64, %rdi > > + sub $64, %r8 > > + jae L(StrncpyFillLoopMovdqa) > > + > > +L(StrncpyFillLess64): > > + add $32, %r8 > > + jl L(StrncpyFillLess32) > > + movdqa %xmm0, (%rdi) > > + movdqa %xmm0, 16(%rdi) > > + add $32, %rdi > > + sub $16, %r8 > > + jl L(StrncpyFillExit) > > + movdqa %xmm0, (%rdi) > > + add $16, %rdi > > + BRANCH_TO_JMPTBL_ENTRY (L(FillTable), %r8, 4) > > + > > +L(StrncpyFillLess32): > > + add $16, %r8 > > + jl L(StrncpyFillExit) > > + movdqa %xmm0, (%rdi) > > + add $16, %rdi > > + BRANCH_TO_JMPTBL_ENTRY (L(FillTable), %r8, 4) > > + > > +L(StrncpyFillExit): > > + add $16, %r8 > > + BRANCH_TO_JMPTBL_ENTRY (L(FillTable), %r8, 4) > > + > > +/* end of ifndef USE_AS_STRCAT */ > > +# endif > > + > > + .p2align 4 > > +L(UnalignedLeaveCase2OrCase3): > > + test %rdx, %rdx > > + jnz L(Unaligned64LeaveCase2) > > +L(Unaligned64LeaveCase3): > > + lea 64(%r8), %rcx > > + and $-16, %rcx > > + add $48, %r8 > > + jl L(CopyFrom1To16BytesCase3) > > + movdqu %xmm4, (%rdi) > > + sub $16, %r8 > > + jb L(CopyFrom1To16BytesCase3) > > + movdqu %xmm5, 16(%rdi) > > + sub $16, %r8 > > + jb L(CopyFrom1To16BytesCase3) > > + movdqu %xmm6, 32(%rdi) > > + sub $16, %r8 > > + jb L(CopyFrom1To16BytesCase3) > > + movdqu %xmm7, 48(%rdi) > > +# ifdef USE_AS_STPCPY > > + lea 64(%rdi), %rax > > +# endif > > +# ifdef USE_AS_STRCAT > > + xor %ch, %ch > > + movb %ch, 64(%rdi) > > +# endif > > + ret > > + > > + .p2align 4 > > +L(Unaligned64LeaveCase2): > > + xor %rcx, %rcx > > + pcmpeqb %xmm4, %xmm0 > > + pmovmskb %xmm0, %rdx > > + add $48, %r8 > > + jle L(CopyFrom1To16BytesCase2OrCase3) > > + test %rdx, %rdx > > +# ifndef USE_AS_STRCAT > > + jnz L(CopyFrom1To16BytesUnalignedXmm4) > > +# else > > + jnz L(CopyFrom1To16Bytes) > > +# endif > > + pcmpeqb %xmm5, %xmm0 > > + pmovmskb %xmm0, %rdx > > + movdqu %xmm4, (%rdi) > > + add $16, %rcx > > + sub $16, %r8 > > + jbe L(CopyFrom1To16BytesCase2OrCase3) > > + test %rdx, %rdx > > +# ifndef USE_AS_STRCAT > > + jnz L(CopyFrom1To16BytesUnalignedXmm5) > > +# else > > + jnz L(CopyFrom1To16Bytes) > > +# endif > > + > > + pcmpeqb %xmm6, %xmm0 > > + pmovmskb %xmm0, %rdx > > + movdqu %xmm5, 16(%rdi) > > + add $16, %rcx > > + sub $16, %r8 > > + jbe L(CopyFrom1To16BytesCase2OrCase3) > > + test %rdx, %rdx > > +# ifndef USE_AS_STRCAT > > + jnz L(CopyFrom1To16BytesUnalignedXmm6) > > +# else > > + jnz L(CopyFrom1To16Bytes) > > +# endif > > + > > + pcmpeqb %xmm7, %xmm0 > > + pmovmskb %xmm0, %rdx > > + movdqu %xmm6, 32(%rdi) > > + lea 16(%rdi, %rcx), %rdi > > + lea 16(%rsi, %rcx), %rsi > > + bsf %rdx, %rdx > > + cmp %r8, %rdx > > + jb L(CopyFrom1To16BytesExit) > > + BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4) > > + > > + .p2align 4 > > +L(ExitZero): > > +# ifndef USE_AS_STRCAT > > + mov %rdi, %rax > > +# endif > > + ret > > + > > +# endif > > + > > +# ifndef USE_AS_STRCAT > > +END (STRCPY) > > +# else > > +END (STRCAT) > > +# endif > > + .p2align 4 > > + .section .rodata > > +L(ExitTable): > > + .int JMPTBL(L(Exit1), L(ExitTable)) > > + .int JMPTBL(L(Exit2), L(ExitTable)) > > + .int JMPTBL(L(Exit3), L(ExitTable)) > > + .int JMPTBL(L(Exit4), L(ExitTable)) > > + .int JMPTBL(L(Exit5), L(ExitTable)) > > + .int JMPTBL(L(Exit6), L(ExitTable)) > > + .int JMPTBL(L(Exit7), L(ExitTable)) > > + .int JMPTBL(L(Exit8), L(ExitTable)) > > + .int JMPTBL(L(Exit9), L(ExitTable)) > > + .int JMPTBL(L(Exit10), L(ExitTable)) > > + .int JMPTBL(L(Exit11), L(ExitTable)) > > + .int JMPTBL(L(Exit12), L(ExitTable)) > > + .int JMPTBL(L(Exit13), L(ExitTable)) > > + .int JMPTBL(L(Exit14), L(ExitTable)) > > + .int JMPTBL(L(Exit15), L(ExitTable)) > > + .int JMPTBL(L(Exit16), L(ExitTable)) > > + .int JMPTBL(L(Exit17), L(ExitTable)) > > + .int JMPTBL(L(Exit18), L(ExitTable)) > > + .int JMPTBL(L(Exit19), L(ExitTable)) > > + .int JMPTBL(L(Exit20), L(ExitTable)) > > + .int JMPTBL(L(Exit21), L(ExitTable)) > > + .int JMPTBL(L(Exit22), L(ExitTable)) > > + .int JMPTBL(L(Exit23), L(ExitTable)) > > + .int JMPTBL(L(Exit24), L(ExitTable)) > > + .int JMPTBL(L(Exit25), L(ExitTable)) > > + .int JMPTBL(L(Exit26), L(ExitTable)) > > + .int JMPTBL(L(Exit27), L(ExitTable)) > > + .int JMPTBL(L(Exit28), L(ExitTable)) > > + .int JMPTBL(L(Exit29), L(ExitTable)) > > + .int JMPTBL(L(Exit30), L(ExitTable)) > > + .int JMPTBL(L(Exit31), L(ExitTable)) > > + .int JMPTBL(L(Exit32), L(ExitTable)) > > +# ifdef USE_AS_STRNCPY > > +L(ExitStrncpyTable): > > + .int JMPTBL(L(StrncpyExit0), L(ExitStrncpyTable)) > > + .int JMPTBL(L(StrncpyExit1), L(ExitStrncpyTable)) > > + .int JMPTBL(L(StrncpyExit2), L(ExitStrncpyTable)) > > + .int JMPTBL(L(StrncpyExit3), L(ExitStrncpyTable)) > > + .int JMPTBL(L(StrncpyExit4), L(ExitStrncpyTable)) > > + .int JMPTBL(L(StrncpyExit5), L(ExitStrncpyTable)) > > + .int JMPTBL(L(StrncpyExit6), L(ExitStrncpyTable)) > > + .int JMPTBL(L(StrncpyExit7), L(ExitStrncpyTable)) > > + .int JMPTBL(L(StrncpyExit8), L(ExitStrncpyTable)) > > + .int JMPTBL(L(StrncpyExit9), L(ExitStrncpyTable)) > > + .int JMPTBL(L(StrncpyExit10), L(ExitStrncpyTable)) > > + .int JMPTBL(L(StrncpyExit11), L(ExitStrncpyTable)) > > + .int JMPTBL(L(StrncpyExit12), L(ExitStrncpyTable)) > > + .int JMPTBL(L(StrncpyExit13), L(ExitStrncpyTable)) > > + .int JMPTBL(L(StrncpyExit14), L(ExitStrncpyTable)) > > + .int JMPTBL(L(StrncpyExit15), L(ExitStrncpyTable)) > > + .int JMPTBL(L(StrncpyExit16), L(ExitStrncpyTable)) > > + .int JMPTBL(L(StrncpyExit17), L(ExitStrncpyTable)) > > + .int JMPTBL(L(StrncpyExit18), L(ExitStrncpyTable)) > > + .int JMPTBL(L(StrncpyExit19), L(ExitStrncpyTable)) > > + .int JMPTBL(L(StrncpyExit20), L(ExitStrncpyTable)) > > + .int JMPTBL(L(StrncpyExit21), L(ExitStrncpyTable)) > > + .int JMPTBL(L(StrncpyExit22), L(ExitStrncpyTable)) > > + .int JMPTBL(L(StrncpyExit23), L(ExitStrncpyTable)) > > + .int JMPTBL(L(StrncpyExit24), L(ExitStrncpyTable)) > > + .int JMPTBL(L(StrncpyExit25), L(ExitStrncpyTable)) > > + .int JMPTBL(L(StrncpyExit26), L(ExitStrncpyTable)) > > + .int JMPTBL(L(StrncpyExit27), L(ExitStrncpyTable)) > > + .int JMPTBL(L(StrncpyExit28), L(ExitStrncpyTable)) > > + .int JMPTBL(L(StrncpyExit29), L(ExitStrncpyTable)) > > + .int JMPTBL(L(StrncpyExit30), L(ExitStrncpyTable)) > > + .int JMPTBL(L(StrncpyExit31), L(ExitStrncpyTable)) > > + .int JMPTBL(L(StrncpyExit32), L(ExitStrncpyTable)) > > + .int JMPTBL(L(StrncpyExit33), L(ExitStrncpyTable)) > > +# ifndef USE_AS_STRCAT > > + .p2align 4 > > +L(FillTable): > > + .int JMPTBL(L(Fill0), L(FillTable)) > > + .int JMPTBL(L(Fill1), L(FillTable)) > > + .int JMPTBL(L(Fill2), L(FillTable)) > > + .int JMPTBL(L(Fill3), L(FillTable)) > > + .int JMPTBL(L(Fill4), L(FillTable)) > > + .int JMPTBL(L(Fill5), L(FillTable)) > > + .int JMPTBL(L(Fill6), L(FillTable)) > > + .int JMPTBL(L(Fill7), L(FillTable)) > > + .int JMPTBL(L(Fill8), L(FillTable)) > > + .int JMPTBL(L(Fill9), L(FillTable)) > > + .int JMPTBL(L(Fill10), L(FillTable)) > > + .int JMPTBL(L(Fill11), L(FillTable)) > > + .int JMPTBL(L(Fill12), L(FillTable)) > > + .int JMPTBL(L(Fill13), L(FillTable)) > > + .int JMPTBL(L(Fill14), L(FillTable)) > > + .int JMPTBL(L(Fill15), L(FillTable)) > > + .int JMPTBL(L(Fill16), L(FillTable)) > > +# endif > > +# endif > > +#endif > > diff --git a/sysdeps/x86_64/multiarch/strncpy.S b/sysdeps/x86_64/multiarch/strncpy.S > > index 6d87a0b..afbd870 100644 > > --- a/sysdeps/x86_64/multiarch/strncpy.S > > +++ b/sysdeps/x86_64/multiarch/strncpy.S > > @@ -1,5 +1,85 @@ > > -/* Multiple versions of strncpy > > - All versions must be listed in ifunc-impl-list.c. */ > > -#define STRCPY strncpy > > +/* Multiple versions of strcpy > > + All versions must be listed in ifunc-impl-list.c. > > + Copyright (C) 2009-2015 Free Software Foundation, Inc. > > + Contributed by Intel Corporation. > > + This file is part of the GNU C Library. > > + > > + The GNU C Library is free software; you can redistribute it and/or > > + modify it under the terms of the GNU Lesser General Public > > + License as published by the Free Software Foundation; either > > + version 2.1 of the License, or (at your option) any later version. > > + > > + The GNU C Library is distributed in the hope that it will be useful, > > + but WITHOUT ANY WARRANTY; without even the implied warranty of > > + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > > + Lesser General Public License for more details. > > + > > + You should have received a copy of the GNU Lesser General Public > > + License along with the GNU C Library; if not, see > > + <http://www.gnu.org/licenses/>. */ > > + > > +#include <sysdep.h> > > +#include <init-arch.h> > > + > > #define USE_AS_STRNCPY > > -#include "strcpy.S" > > +#ifndef STRNCPY > > +#define STRNCPY strncpy > > +#endif > > + > > +#ifdef USE_AS_STPCPY > > +# define STRNCPY_SSSE3 __stpncpy_ssse3 > > +# define STRNCPY_SSE2 __stpncpy_sse2 > > +# define STRNCPY_SSE2_UNALIGNED __stpncpy_sse2_unaligned > > +# define __GI_STRNCPY __GI_stpncpy > > +# define __GI___STRNCPY __GI___stpncpy > > +#else > > +# define STRNCPY_SSSE3 __strncpy_ssse3 > > +# define STRNCPY_SSE2 __strncpy_sse2 > > +# define STRNCPY_SSE2_UNALIGNED __strncpy_sse2_unaligned > > +# define __GI_STRNCPY __GI_strncpy > > +#endif > > + > > + > > +/* Define multiple versions only for the definition in libc. */ > > +#if IS_IN (libc) > > + .text > > +ENTRY(STRNCPY) > > + .type STRNCPY, @gnu_indirect_function > > + cmpl $0, __cpu_features+KIND_OFFSET(%rip) > > + jne 1f > > + call __init_cpu_features > > +1: leaq STRNCPY_SSE2_UNALIGNED(%rip), %rax > > + testl $bit_Fast_Unaligned_Load, __cpu_features+FEATURE_OFFSET+index_Fast_Unaligned_Load(%rip) > > + jnz 2f > > + leaq STRNCPY_SSE2(%rip), %rax > > + testl $bit_SSSE3, __cpu_features+CPUID_OFFSET+index_SSSE3(%rip) > > + jz 2f > > + leaq STRNCPY_SSSE3(%rip), %rax > > +2: ret > > +END(STRNCPY) > > + > > +# undef ENTRY > > +# define ENTRY(name) \ > > + .type STRNCPY_SSE2, @function; \ > > + .align 16; \ > > + .globl STRNCPY_SSE2; \ > > + .hidden STRNCPY_SSE2; \ > > + STRNCPY_SSE2: cfi_startproc; \ > > + CALL_MCOUNT > > +# undef END > > +# define END(name) \ > > + cfi_endproc; .size STRNCPY_SSE2, .-STRNCPY_SSE2 > > +# undef libc_hidden_builtin_def > > +/* It doesn't make sense to send libc-internal strcpy calls through a PLT. > > + The speedup we get from using SSSE3 instruction is likely eaten away > > + by the indirect call in the PLT. */ > > +# define libc_hidden_builtin_def(name) \ > > + .globl __GI_STRNCPY; __GI_STRNCPY = STRNCPY_SSE2 > > +# undef libc_hidden_def > > +# define libc_hidden_def(name) \ > > + .globl __GI___STRNCPY; __GI___STRNCPY = STRNCPY_SSE2 > > +#endif > > + > > +#ifndef USE_AS_STRNCPY > > +#include "../strcpy.S" > > +#endif > > -- > > 1.8.4.rc3 > > -- > > Communications satellite used by the military for star wars.
diff --git a/sysdeps/x86_64/multiarch/Makefile b/sysdeps/x86_64/multiarch/Makefile index d7002a9..c573744 100644 --- a/sysdeps/x86_64/multiarch/Makefile +++ b/sysdeps/x86_64/multiarch/Makefile @@ -29,7 +29,7 @@ CFLAGS-strspn-c.c += -msse4 endif ifeq (yes,$(config-cflags-avx2)) -sysdep_routines += memset-avx2 +sysdep_routines += memset-avx2 strcpy-avx2 stpcpy-avx2 endif endif diff --git a/sysdeps/x86_64/multiarch/ifunc-impl-list.c b/sysdeps/x86_64/multiarch/ifunc-impl-list.c index b64e4f1..d398e43 100644 --- a/sysdeps/x86_64/multiarch/ifunc-impl-list.c +++ b/sysdeps/x86_64/multiarch/ifunc-impl-list.c @@ -88,6 +88,7 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array, /* Support sysdeps/x86_64/multiarch/stpcpy.S. */ IFUNC_IMPL (i, name, stpcpy, + IFUNC_IMPL_ADD (array, i, strcpy, HAS_AVX2, __stpcpy_avx2) IFUNC_IMPL_ADD (array, i, stpcpy, HAS_SSSE3, __stpcpy_ssse3) IFUNC_IMPL_ADD (array, i, stpcpy, 1, __stpcpy_sse2_unaligned) IFUNC_IMPL_ADD (array, i, stpcpy, 1, __stpcpy_sse2)) @@ -137,6 +138,7 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array, /* Support sysdeps/x86_64/multiarch/strcpy.S. */ IFUNC_IMPL (i, name, strcpy, + IFUNC_IMPL_ADD (array, i, strcpy, HAS_AVX2, __strcpy_avx2) IFUNC_IMPL_ADD (array, i, strcpy, HAS_SSSE3, __strcpy_ssse3) IFUNC_IMPL_ADD (array, i, strcpy, 1, __strcpy_sse2_unaligned) IFUNC_IMPL_ADD (array, i, strcpy, 1, __strcpy_sse2)) diff --git a/sysdeps/x86_64/multiarch/stpcpy-avx2.S b/sysdeps/x86_64/multiarch/stpcpy-avx2.S new file mode 100644 index 0000000..bd30ef6 --- /dev/null +++ b/sysdeps/x86_64/multiarch/stpcpy-avx2.S @@ -0,0 +1,3 @@ +#define USE_AVX2 +#define STPCPY __stpcpy_avx2 +#include "stpcpy-sse2-unaligned.S" diff --git a/sysdeps/x86_64/multiarch/stpcpy-sse2-unaligned.S b/sysdeps/x86_64/multiarch/stpcpy-sse2-unaligned.S index 34231f8..695a236 100644 --- a/sysdeps/x86_64/multiarch/stpcpy-sse2-unaligned.S +++ b/sysdeps/x86_64/multiarch/stpcpy-sse2-unaligned.S @@ -1,3 +1,436 @@ -#define USE_AS_STPCPY -#define STRCPY __stpcpy_sse2_unaligned -#include "strcpy-sse2-unaligned.S" +/* stpcpy with SSE2 and unaligned load + Copyright (C) 2015 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <http://www.gnu.org/licenses/>. */ + +#include <sysdep.h> +#ifndef STPCPY +# define STPCPY __stpcpy_sse2_unaligned +#endif + +ENTRY(STPCPY) + mov %esi, %edx +#ifdef AS_STRCPY + movq %rdi, %rax +#endif + pxor %xmm4, %xmm4 + pxor %xmm5, %xmm5 + andl $4095, %edx + cmp $3968, %edx + ja L(cross_page) + + movdqu (%rsi), %xmm0 + pcmpeqb %xmm0, %xmm4 + pmovmskb %xmm4, %edx + testl %edx, %edx + je L(more16bytes) + bsf %edx, %ecx +#ifndef AS_STRCPY + lea (%rdi, %rcx), %rax +#endif + cmp $7, %ecx + movq (%rsi), %rdx + jb L(less_8_bytesb) +L(8bytes_from_cross): + movq -7(%rsi, %rcx), %rsi + movq %rdx, (%rdi) +#ifdef AS_STRCPY + movq %rsi, -7(%rdi, %rcx) +#else + movq %rsi, -7(%rax) +#endif + ret + + .p2align 4 +L(less_8_bytesb): + cmp $2, %ecx + jbe L(less_4_bytes) +L(4bytes_from_cross): + mov -3(%rsi, %rcx), %esi + mov %edx, (%rdi) +#ifdef AS_STRCPY + mov %esi, -3(%rdi, %rcx) +#else + mov %esi, -3(%rax) +#endif + ret + +.p2align 4 + L(less_4_bytes): + /* + Test branch vs this branchless that works for i 0,1,2 + d[i] = 0; + d[i/2] = s[1]; + d[0] = s[0]; + */ +#ifdef AS_STRCPY + movb $0, (%rdi, %rcx) +#endif + + shr $1, %ecx + mov %edx, %esi + shr $8, %edx + movb %dl, (%rdi, %rcx) +#ifndef AS_STRCPY + movb $0, (%rax) +#endif + movb %sil, (%rdi) + ret + + + + + + .p2align 4 +L(more16bytes): + pxor %xmm6, %xmm6 + movdqu 16(%rsi), %xmm1 + pxor %xmm7, %xmm7 + pcmpeqb %xmm1, %xmm5 + pmovmskb %xmm5, %edx + testl %edx, %edx + je L(more32bytes) + bsf %edx, %edx +#ifdef AS_STRCPY + movdqu 1(%rsi, %rdx), %xmm1 + movdqu %xmm0, (%rdi) + movdqu %xmm1, 1(%rdi, %rdx) +#else + lea 16(%rdi, %rdx), %rax + movdqu 1(%rsi, %rdx), %xmm1 + movdqu %xmm0, (%rdi) + movdqu %xmm1, -15(%rax) +#endif + ret + + .p2align 4 +L(more32bytes): + movdqu 32(%rsi), %xmm2 + movdqu 48(%rsi), %xmm3 + + pcmpeqb %xmm2, %xmm6 + pcmpeqb %xmm3, %xmm7 + pmovmskb %xmm7, %edx + shl $16, %edx + pmovmskb %xmm6, %ecx + or %ecx, %edx + je L(more64bytes) + bsf %edx, %edx +#ifndef AS_STRCPY + lea 32(%rdi, %rdx), %rax +#endif + movdqu 1(%rsi, %rdx), %xmm2 + movdqu 17(%rsi, %rdx), %xmm3 + movdqu %xmm0, (%rdi) + movdqu %xmm1, 16(%rdi) +#ifdef AS_STRCPY + movdqu %xmm2, 1(%rdi, %rdx) + movdqu %xmm3, 17(%rdi, %rdx) +#else + movdqu %xmm2, -31(%rax) + movdqu %xmm3, -15(%rax) +#endif + ret + + .p2align 4 +L(more64bytes): + movdqu %xmm0, (%rdi) + movdqu %xmm1, 16(%rdi) + movdqu %xmm2, 32(%rdi) + movdqu %xmm3, 48(%rdi) + movdqu 64(%rsi), %xmm0 + movdqu 80(%rsi), %xmm1 + movdqu 96(%rsi), %xmm2 + movdqu 112(%rsi), %xmm3 + + pcmpeqb %xmm0, %xmm4 + pcmpeqb %xmm1, %xmm5 + pcmpeqb %xmm2, %xmm6 + pcmpeqb %xmm3, %xmm7 + pmovmskb %xmm4, %ecx + pmovmskb %xmm5, %edx + pmovmskb %xmm6, %r8d + pmovmskb %xmm7, %r9d + shl $16, %edx + or %ecx, %edx + shl $32, %r8 + shl $48, %r9 + or %r8, %rdx + or %r9, %rdx + test %rdx, %rdx + je L(prepare_loop) + bsf %rdx, %rdx +#ifndef AS_STRCPY + lea 64(%rdi, %rdx), %rax +#endif + movdqu 1(%rsi, %rdx), %xmm0 + movdqu 17(%rsi, %rdx), %xmm1 + movdqu 33(%rsi, %rdx), %xmm2 + movdqu 49(%rsi, %rdx), %xmm3 +#ifdef AS_STRCPY + movdqu %xmm0, 1(%rdi, %rdx) + movdqu %xmm1, 17(%rdi, %rdx) + movdqu %xmm2, 33(%rdi, %rdx) + movdqu %xmm3, 49(%rdi, %rdx) +#else + movdqu %xmm0, -63(%rax) + movdqu %xmm1, -47(%rax) + movdqu %xmm2, -31(%rax) + movdqu %xmm3, -15(%rax) +#endif + ret + + + .p2align 4 +L(prepare_loop): + movdqu %xmm0, 64(%rdi) + movdqu %xmm1, 80(%rdi) + movdqu %xmm2, 96(%rdi) + movdqu %xmm3, 112(%rdi) + + subq %rsi, %rdi + add $64, %rsi + andq $-64, %rsi + addq %rsi, %rdi + jmp L(loop_entry) + +#ifdef USE_AVX2 + .p2align 4 +L(loop): + vmovdqu %ymm1, (%rdi) + vmovdqu %ymm3, 32(%rdi) +L(loop_entry): + vmovdqa 96(%rsi), %ymm3 + vmovdqa 64(%rsi), %ymm1 + vpminub %ymm3, %ymm1, %ymm2 + addq $64, %rsi + addq $64, %rdi + vpcmpeqb %ymm5, %ymm2, %ymm0 + vpmovmskb %ymm0, %edx + test %edx, %edx + je L(loop) + salq $32, %rdx + vpcmpeqb %ymm5, %ymm1, %ymm4 + vpmovmskb %ymm4, %ecx + or %rcx, %rdx + bsfq %rdx, %rdx +#ifndef AS_STRCPY + lea (%rdi, %rdx), %rax +#endif + vmovdqu -63(%rsi, %rdx), %ymm0 + vmovdqu -31(%rsi, %rdx), %ymm2 +#ifdef AS_STRCPY + vmovdqu %ymm0, -63(%rdi, %rdx) + vmovdqu %ymm2, -31(%rdi, %rdx) +#else + vmovdqu %ymm0, -63(%rax) + vmovdqu %ymm2, -31(%rax) +#endif + vzeroupper + ret +#else + .p2align 4 +L(loop): + movdqu %xmm1, (%rdi) + movdqu %xmm2, 16(%rdi) + movdqu %xmm3, 32(%rdi) + movdqu %xmm4, 48(%rdi) +L(loop_entry): + movdqa 96(%rsi), %xmm3 + movdqa 112(%rsi), %xmm4 + movdqa %xmm3, %xmm0 + movdqa 80(%rsi), %xmm2 + pminub %xmm4, %xmm0 + movdqa 64(%rsi), %xmm1 + pminub %xmm2, %xmm0 + pminub %xmm1, %xmm0 + addq $64, %rsi + addq $64, %rdi + pcmpeqb %xmm5, %xmm0 + pmovmskb %xmm0, %edx + test %edx, %edx + je L(loop) + salq $48, %rdx + pcmpeqb %xmm1, %xmm5 + pcmpeqb %xmm2, %xmm6 + pmovmskb %xmm5, %ecx +#ifdef AS_STRCPY + pmovmskb %xmm6, %r8d + pcmpeqb %xmm3, %xmm7 + pmovmskb %xmm7, %r9d + sal $16, %r8d + or %r8d, %ecx +#else + pmovmskb %xmm6, %eax + pcmpeqb %xmm3, %xmm7 + pmovmskb %xmm7, %r9d + sal $16, %eax + or %eax, %ecx +#endif + salq $32, %r9 + orq %rcx, %rdx + orq %r9, %rdx + bsfq %rdx, %rdx +#ifndef AS_STRCPY + lea (%rdi, %rdx), %rax +#endif + movdqu -63(%rsi, %rdx), %xmm0 + movdqu -47(%rsi, %rdx), %xmm1 + movdqu -31(%rsi, %rdx), %xmm2 + movdqu -15(%rsi, %rdx), %xmm3 +#ifdef AS_STRCPY + movdqu %xmm0, -63(%rdi, %rdx) + movdqu %xmm1, -47(%rdi, %rdx) + movdqu %xmm2, -31(%rdi, %rdx) + movdqu %xmm3, -15(%rdi, %rdx) +#else + movdqu %xmm0, -63(%rax) + movdqu %xmm1, -47(%rax) + movdqu %xmm2, -31(%rax) + movdqu %xmm3, -15(%rax) +#endif + ret +#endif + + .p2align 4 +L(cross_page): + movq %rsi, %rcx + pxor %xmm0, %xmm0 + and $15, %ecx + movq %rsi, %r9 + movq %rdi, %r10 + subq %rcx, %rsi + subq %rcx, %rdi + movdqa (%rsi), %xmm1 + pcmpeqb %xmm0, %xmm1 + pmovmskb %xmm1, %edx + shr %cl, %edx + shl %cl, %edx + test %edx, %edx + jne L(less_32_cross) + + addq $16, %rsi + addq $16, %rdi + movdqa (%rsi), %xmm1 + pcmpeqb %xmm1, %xmm0 + pmovmskb %xmm0, %edx + test %edx, %edx + jne L(less_32_cross) + movdqu %xmm1, (%rdi) + + movdqu (%r9), %xmm0 + movdqu %xmm0, (%r10) + + mov $8, %rcx +L(cross_loop): + addq $16, %rsi + addq $16, %rdi + pxor %xmm0, %xmm0 + movdqa (%rsi), %xmm1 + pcmpeqb %xmm1, %xmm0 + pmovmskb %xmm0, %edx + test %edx, %edx + jne L(return_cross) + movdqu %xmm1, (%rdi) + sub $1, %rcx + ja L(cross_loop) + + pxor %xmm5, %xmm5 + pxor %xmm6, %xmm6 + pxor %xmm7, %xmm7 + + lea -64(%rsi), %rdx + andq $-64, %rdx + addq %rdx, %rdi + subq %rsi, %rdi + movq %rdx, %rsi + jmp L(loop_entry) + + .p2align 4 +L(return_cross): + bsf %edx, %edx +#ifdef AS_STRCPY + movdqu -15(%rsi, %rdx), %xmm0 + movdqu %xmm0, -15(%rdi, %rdx) +#else + lea (%rdi, %rdx), %rax + movdqu -15(%rsi, %rdx), %xmm0 + movdqu %xmm0, -15(%rax) +#endif + ret + + .p2align 4 +L(less_32_cross): + bsf %rdx, %rdx + lea (%rdi, %rdx), %rcx +#ifndef AS_STRCPY + mov %rcx, %rax +#endif + mov %r9, %rsi + mov %r10, %rdi + sub %rdi, %rcx + cmp $15, %ecx + jb L(less_16_cross) + movdqu (%rsi), %xmm0 + movdqu -15(%rsi, %rcx), %xmm1 + movdqu %xmm0, (%rdi) +#ifdef AS_STRCPY + movdqu %xmm1, -15(%rdi, %rcx) +#else + movdqu %xmm1, -15(%rax) +#endif + ret + +L(less_16_cross): + cmp $7, %ecx + jb L(less_8_bytes_cross) + movq (%rsi), %rdx + jmp L(8bytes_from_cross) + +L(less_8_bytes_cross): + cmp $2, %ecx + jbe L(3_bytes_cross) + mov (%rsi), %edx + jmp L(4bytes_from_cross) + +L(3_bytes_cross): + jb L(1_2bytes_cross) + movzwl (%rsi), %edx + jmp L(_3_bytesb) + +L(1_2bytes_cross): + movb (%rsi), %dl + jmp L(0_2bytes_from_cross) + + .p2align 4 +L(less_4_bytesb): + je L(_3_bytesb) +L(0_2bytes_from_cross): + movb %dl, (%rdi) +#ifdef AS_STRCPY + movb $0, (%rdi, %rcx) +#else + movb $0, (%rax) +#endif + ret + + .p2align 4 +L(_3_bytesb): + movw %dx, (%rdi) + movb $0, 2(%rdi) + ret + +END(STPCPY) diff --git a/sysdeps/x86_64/multiarch/stpncpy-sse2-unaligned.S b/sysdeps/x86_64/multiarch/stpncpy-sse2-unaligned.S index 658520f..3f35068 100644 --- a/sysdeps/x86_64/multiarch/stpncpy-sse2-unaligned.S +++ b/sysdeps/x86_64/multiarch/stpncpy-sse2-unaligned.S @@ -1,4 +1,3 @@ #define USE_AS_STPCPY -#define USE_AS_STRNCPY #define STRCPY __stpncpy_sse2_unaligned -#include "strcpy-sse2-unaligned.S" +#include "strncpy-sse2-unaligned.S" diff --git a/sysdeps/x86_64/multiarch/stpncpy.S b/sysdeps/x86_64/multiarch/stpncpy.S index 2698ca6..159604a 100644 --- a/sysdeps/x86_64/multiarch/stpncpy.S +++ b/sysdeps/x86_64/multiarch/stpncpy.S @@ -1,8 +1,7 @@ /* Multiple versions of stpncpy All versions must be listed in ifunc-impl-list.c. */ -#define STRCPY __stpncpy +#define STRNCPY __stpncpy #define USE_AS_STPCPY -#define USE_AS_STRNCPY -#include "strcpy.S" +#include "strncpy.S" weak_alias (__stpncpy, stpncpy) diff --git a/sysdeps/x86_64/multiarch/strcat-sse2-unaligned.S b/sysdeps/x86_64/multiarch/strcat-sse2-unaligned.S index 81f1b40..1faa49d 100644 --- a/sysdeps/x86_64/multiarch/strcat-sse2-unaligned.S +++ b/sysdeps/x86_64/multiarch/strcat-sse2-unaligned.S @@ -275,5 +275,5 @@ L(StartStrcpyPart): # define USE_AS_STRNCPY # endif -# include "strcpy-sse2-unaligned.S" +# include "strncpy-sse2-unaligned.S" #endif diff --git a/sysdeps/x86_64/multiarch/strcpy-avx2.S b/sysdeps/x86_64/multiarch/strcpy-avx2.S new file mode 100644 index 0000000..a3133a4 --- /dev/null +++ b/sysdeps/x86_64/multiarch/strcpy-avx2.S @@ -0,0 +1,4 @@ +#define USE_AVX2 +#define AS_STRCPY +#define STPCPY __strcpy_avx2 +#include "stpcpy-sse2-unaligned.S" diff --git a/sysdeps/x86_64/multiarch/strcpy-sse2-unaligned.S b/sysdeps/x86_64/multiarch/strcpy-sse2-unaligned.S index 8f03d1d..310e4fa 100644 --- a/sysdeps/x86_64/multiarch/strcpy-sse2-unaligned.S +++ b/sysdeps/x86_64/multiarch/strcpy-sse2-unaligned.S @@ -1,1887 +1,3 @@ -/* strcpy with SSE2 and unaligned load - Copyright (C) 2011-2015 Free Software Foundation, Inc. - Contributed by Intel Corporation. - This file is part of the GNU C Library. - - The GNU C Library is free software; you can redistribute it and/or - modify it under the terms of the GNU Lesser General Public - License as published by the Free Software Foundation; either - version 2.1 of the License, or (at your option) any later version. - - The GNU C Library is distributed in the hope that it will be useful, - but WITHOUT ANY WARRANTY; without even the implied warranty of - MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU - Lesser General Public License for more details. - - You should have received a copy of the GNU Lesser General Public - License along with the GNU C Library; if not, see - <http://www.gnu.org/licenses/>. */ - -#if IS_IN (libc) - -# ifndef USE_AS_STRCAT -# include <sysdep.h> - -# ifndef STRCPY -# define STRCPY __strcpy_sse2_unaligned -# endif - -# endif - -# define JMPTBL(I, B) I - B -# define BRANCH_TO_JMPTBL_ENTRY(TABLE, INDEX, SCALE) \ - lea TABLE(%rip), %r11; \ - movslq (%r11, INDEX, SCALE), %rcx; \ - lea (%r11, %rcx), %rcx; \ - jmp *%rcx - -# ifndef USE_AS_STRCAT - -.text -ENTRY (STRCPY) -# ifdef USE_AS_STRNCPY - mov %rdx, %r8 - test %r8, %r8 - jz L(ExitZero) -# endif - mov %rsi, %rcx -# ifndef USE_AS_STPCPY - mov %rdi, %rax /* save result */ -# endif - -# endif - - and $63, %rcx - cmp $32, %rcx - jbe L(SourceStringAlignmentLess32) - - and $-16, %rsi - and $15, %rcx - pxor %xmm0, %xmm0 - pxor %xmm1, %xmm1 - - pcmpeqb (%rsi), %xmm1 - pmovmskb %xmm1, %rdx - shr %cl, %rdx - -# ifdef USE_AS_STRNCPY -# if defined USE_AS_STPCPY || defined USE_AS_STRCAT - mov $16, %r10 - sub %rcx, %r10 - cmp %r10, %r8 -# else - mov $17, %r10 - sub %rcx, %r10 - cmp %r10, %r8 -# endif - jbe L(CopyFrom1To16BytesTailCase2OrCase3) -# endif - test %rdx, %rdx - jnz L(CopyFrom1To16BytesTail) - - pcmpeqb 16(%rsi), %xmm0 - pmovmskb %xmm0, %rdx - -# ifdef USE_AS_STRNCPY - add $16, %r10 - cmp %r10, %r8 - jbe L(CopyFrom1To32BytesCase2OrCase3) -# endif - test %rdx, %rdx - jnz L(CopyFrom1To32Bytes) - - movdqu (%rsi, %rcx), %xmm1 /* copy 16 bytes */ - movdqu %xmm1, (%rdi) - -/* If source address alignment != destination address alignment */ - .p2align 4 -L(Unalign16Both): - sub %rcx, %rdi -# ifdef USE_AS_STRNCPY - add %rcx, %r8 -# endif - mov $16, %rcx - movdqa (%rsi, %rcx), %xmm1 - movaps 16(%rsi, %rcx), %xmm2 - movdqu %xmm1, (%rdi, %rcx) - pcmpeqb %xmm2, %xmm0 - pmovmskb %xmm0, %rdx - add $16, %rcx -# ifdef USE_AS_STRNCPY - sub $48, %r8 - jbe L(CopyFrom1To16BytesCase2OrCase3) -# endif - test %rdx, %rdx -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT - jnz L(CopyFrom1To16BytesUnalignedXmm2) -# else - jnz L(CopyFrom1To16Bytes) -# endif - - movaps 16(%rsi, %rcx), %xmm3 - movdqu %xmm2, (%rdi, %rcx) - pcmpeqb %xmm3, %xmm0 - pmovmskb %xmm0, %rdx - add $16, %rcx -# ifdef USE_AS_STRNCPY - sub $16, %r8 - jbe L(CopyFrom1To16BytesCase2OrCase3) -# endif - test %rdx, %rdx -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT - jnz L(CopyFrom1To16BytesUnalignedXmm3) -# else - jnz L(CopyFrom1To16Bytes) -# endif - - movaps 16(%rsi, %rcx), %xmm4 - movdqu %xmm3, (%rdi, %rcx) - pcmpeqb %xmm4, %xmm0 - pmovmskb %xmm0, %rdx - add $16, %rcx -# ifdef USE_AS_STRNCPY - sub $16, %r8 - jbe L(CopyFrom1To16BytesCase2OrCase3) -# endif - test %rdx, %rdx -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT - jnz L(CopyFrom1To16BytesUnalignedXmm4) -# else - jnz L(CopyFrom1To16Bytes) -# endif - - movaps 16(%rsi, %rcx), %xmm1 - movdqu %xmm4, (%rdi, %rcx) - pcmpeqb %xmm1, %xmm0 - pmovmskb %xmm0, %rdx - add $16, %rcx -# ifdef USE_AS_STRNCPY - sub $16, %r8 - jbe L(CopyFrom1To16BytesCase2OrCase3) -# endif - test %rdx, %rdx -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT - jnz L(CopyFrom1To16BytesUnalignedXmm1) -# else - jnz L(CopyFrom1To16Bytes) -# endif - - movaps 16(%rsi, %rcx), %xmm2 - movdqu %xmm1, (%rdi, %rcx) - pcmpeqb %xmm2, %xmm0 - pmovmskb %xmm0, %rdx - add $16, %rcx -# ifdef USE_AS_STRNCPY - sub $16, %r8 - jbe L(CopyFrom1To16BytesCase2OrCase3) -# endif - test %rdx, %rdx -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT - jnz L(CopyFrom1To16BytesUnalignedXmm2) -# else - jnz L(CopyFrom1To16Bytes) -# endif - - movaps 16(%rsi, %rcx), %xmm3 - movdqu %xmm2, (%rdi, %rcx) - pcmpeqb %xmm3, %xmm0 - pmovmskb %xmm0, %rdx - add $16, %rcx -# ifdef USE_AS_STRNCPY - sub $16, %r8 - jbe L(CopyFrom1To16BytesCase2OrCase3) -# endif - test %rdx, %rdx -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT - jnz L(CopyFrom1To16BytesUnalignedXmm3) -# else - jnz L(CopyFrom1To16Bytes) -# endif - - movdqu %xmm3, (%rdi, %rcx) - mov %rsi, %rdx - lea 16(%rsi, %rcx), %rsi - and $-0x40, %rsi - sub %rsi, %rdx - sub %rdx, %rdi -# ifdef USE_AS_STRNCPY - lea 128(%r8, %rdx), %r8 -# endif -L(Unaligned64Loop): - movaps (%rsi), %xmm2 - movaps %xmm2, %xmm4 - movaps 16(%rsi), %xmm5 - movaps 32(%rsi), %xmm3 - movaps %xmm3, %xmm6 - movaps 48(%rsi), %xmm7 - pminub %xmm5, %xmm2 - pminub %xmm7, %xmm3 - pminub %xmm2, %xmm3 - pcmpeqb %xmm0, %xmm3 - pmovmskb %xmm3, %rdx -# ifdef USE_AS_STRNCPY - sub $64, %r8 - jbe L(UnalignedLeaveCase2OrCase3) -# endif - test %rdx, %rdx - jnz L(Unaligned64Leave) - -L(Unaligned64Loop_start): - add $64, %rdi - add $64, %rsi - movdqu %xmm4, -64(%rdi) - movaps (%rsi), %xmm2 - movdqa %xmm2, %xmm4 - movdqu %xmm5, -48(%rdi) - movaps 16(%rsi), %xmm5 - pminub %xmm5, %xmm2 - movaps 32(%rsi), %xmm3 - movdqu %xmm6, -32(%rdi) - movaps %xmm3, %xmm6 - movdqu %xmm7, -16(%rdi) - movaps 48(%rsi), %xmm7 - pminub %xmm7, %xmm3 - pminub %xmm2, %xmm3 - pcmpeqb %xmm0, %xmm3 - pmovmskb %xmm3, %rdx -# ifdef USE_AS_STRNCPY - sub $64, %r8 - jbe L(UnalignedLeaveCase2OrCase3) -# endif - test %rdx, %rdx - jz L(Unaligned64Loop_start) - -L(Unaligned64Leave): - pxor %xmm1, %xmm1 - - pcmpeqb %xmm4, %xmm0 - pcmpeqb %xmm5, %xmm1 - pmovmskb %xmm0, %rdx - pmovmskb %xmm1, %rcx - test %rdx, %rdx - jnz L(CopyFrom1To16BytesUnaligned_0) - test %rcx, %rcx - jnz L(CopyFrom1To16BytesUnaligned_16) - - pcmpeqb %xmm6, %xmm0 - pcmpeqb %xmm7, %xmm1 - pmovmskb %xmm0, %rdx - pmovmskb %xmm1, %rcx - test %rdx, %rdx - jnz L(CopyFrom1To16BytesUnaligned_32) - - bsf %rcx, %rdx - movdqu %xmm4, (%rdi) - movdqu %xmm5, 16(%rdi) - movdqu %xmm6, 32(%rdi) -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT -# ifdef USE_AS_STPCPY - lea 48(%rdi, %rdx), %rax -# endif - movdqu %xmm7, 48(%rdi) - add $15, %r8 - sub %rdx, %r8 - lea 49(%rdi, %rdx), %rdi - jmp L(StrncpyFillTailWithZero) -# else - add $48, %rsi - add $48, %rdi - BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4) -# endif - -/* If source address alignment == destination address alignment */ - -L(SourceStringAlignmentLess32): - pxor %xmm0, %xmm0 - movdqu (%rsi), %xmm1 - movdqu 16(%rsi), %xmm2 - pcmpeqb %xmm1, %xmm0 - pmovmskb %xmm0, %rdx - -# ifdef USE_AS_STRNCPY -# if defined USE_AS_STPCPY || defined USE_AS_STRCAT - cmp $16, %r8 -# else - cmp $17, %r8 -# endif - jbe L(CopyFrom1To16BytesTail1Case2OrCase3) -# endif - test %rdx, %rdx - jnz L(CopyFrom1To16BytesTail1) - - pcmpeqb %xmm2, %xmm0 - movdqu %xmm1, (%rdi) - pmovmskb %xmm0, %rdx - -# ifdef USE_AS_STRNCPY -# if defined USE_AS_STPCPY || defined USE_AS_STRCAT - cmp $32, %r8 -# else - cmp $33, %r8 -# endif - jbe L(CopyFrom1To32Bytes1Case2OrCase3) -# endif - test %rdx, %rdx - jnz L(CopyFrom1To32Bytes1) - - and $-16, %rsi - and $15, %rcx - jmp L(Unalign16Both) - -/*------End of main part with loops---------------------*/ - -/* Case1 */ - -# if (!defined USE_AS_STRNCPY) || (defined USE_AS_STRCAT) - .p2align 4 -L(CopyFrom1To16Bytes): - add %rcx, %rdi - add %rcx, %rsi - bsf %rdx, %rdx - BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4) -# endif - .p2align 4 -L(CopyFrom1To16BytesTail): - add %rcx, %rsi - bsf %rdx, %rdx - BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4) - - .p2align 4 -L(CopyFrom1To32Bytes1): - add $16, %rsi - add $16, %rdi -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT - sub $16, %r8 -# endif -L(CopyFrom1To16BytesTail1): - bsf %rdx, %rdx - BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4) - - .p2align 4 -L(CopyFrom1To32Bytes): - bsf %rdx, %rdx - add %rcx, %rsi - add $16, %rdx - sub %rcx, %rdx - BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4) - - .p2align 4 -L(CopyFrom1To16BytesUnaligned_0): - bsf %rdx, %rdx -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT -# ifdef USE_AS_STPCPY - lea (%rdi, %rdx), %rax -# endif - movdqu %xmm4, (%rdi) - add $63, %r8 - sub %rdx, %r8 - lea 1(%rdi, %rdx), %rdi - jmp L(StrncpyFillTailWithZero) -# else - BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4) -# endif - - .p2align 4 -L(CopyFrom1To16BytesUnaligned_16): - bsf %rcx, %rdx - movdqu %xmm4, (%rdi) -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT -# ifdef USE_AS_STPCPY - lea 16(%rdi, %rdx), %rax -# endif - movdqu %xmm5, 16(%rdi) - add $47, %r8 - sub %rdx, %r8 - lea 17(%rdi, %rdx), %rdi - jmp L(StrncpyFillTailWithZero) -# else - add $16, %rsi - add $16, %rdi - BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4) -# endif - - .p2align 4 -L(CopyFrom1To16BytesUnaligned_32): - bsf %rdx, %rdx - movdqu %xmm4, (%rdi) - movdqu %xmm5, 16(%rdi) -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT -# ifdef USE_AS_STPCPY - lea 32(%rdi, %rdx), %rax -# endif - movdqu %xmm6, 32(%rdi) - add $31, %r8 - sub %rdx, %r8 - lea 33(%rdi, %rdx), %rdi - jmp L(StrncpyFillTailWithZero) -# else - add $32, %rsi - add $32, %rdi - BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4) -# endif - -# ifdef USE_AS_STRNCPY -# ifndef USE_AS_STRCAT - .p2align 4 -L(CopyFrom1To16BytesUnalignedXmm6): - movdqu %xmm6, (%rdi, %rcx) - jmp L(CopyFrom1To16BytesXmmExit) - - .p2align 4 -L(CopyFrom1To16BytesUnalignedXmm5): - movdqu %xmm5, (%rdi, %rcx) - jmp L(CopyFrom1To16BytesXmmExit) - - .p2align 4 -L(CopyFrom1To16BytesUnalignedXmm4): - movdqu %xmm4, (%rdi, %rcx) - jmp L(CopyFrom1To16BytesXmmExit) - - .p2align 4 -L(CopyFrom1To16BytesUnalignedXmm3): - movdqu %xmm3, (%rdi, %rcx) - jmp L(CopyFrom1To16BytesXmmExit) - - .p2align 4 -L(CopyFrom1To16BytesUnalignedXmm1): - movdqu %xmm1, (%rdi, %rcx) - jmp L(CopyFrom1To16BytesXmmExit) -# endif - - .p2align 4 -L(CopyFrom1To16BytesExit): - BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4) - -/* Case2 */ - - .p2align 4 -L(CopyFrom1To16BytesCase2): - add $16, %r8 - add %rcx, %rdi - add %rcx, %rsi - bsf %rdx, %rdx - cmp %r8, %rdx - jb L(CopyFrom1To16BytesExit) - BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4) - - .p2align 4 -L(CopyFrom1To32BytesCase2): - add %rcx, %rsi - bsf %rdx, %rdx - add $16, %rdx - sub %rcx, %rdx - cmp %r8, %rdx - jb L(CopyFrom1To16BytesExit) - BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4) - -L(CopyFrom1To16BytesTailCase2): - add %rcx, %rsi - bsf %rdx, %rdx - cmp %r8, %rdx - jb L(CopyFrom1To16BytesExit) - BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4) - -L(CopyFrom1To16BytesTail1Case2): - bsf %rdx, %rdx - cmp %r8, %rdx - jb L(CopyFrom1To16BytesExit) - BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4) - -/* Case2 or Case3, Case3 */ - - .p2align 4 -L(CopyFrom1To16BytesCase2OrCase3): - test %rdx, %rdx - jnz L(CopyFrom1To16BytesCase2) -L(CopyFrom1To16BytesCase3): - add $16, %r8 - add %rcx, %rdi - add %rcx, %rsi - BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4) - - .p2align 4 -L(CopyFrom1To32BytesCase2OrCase3): - test %rdx, %rdx - jnz L(CopyFrom1To32BytesCase2) - add %rcx, %rsi - BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4) - - .p2align 4 -L(CopyFrom1To16BytesTailCase2OrCase3): - test %rdx, %rdx - jnz L(CopyFrom1To16BytesTailCase2) - add %rcx, %rsi - BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4) - - .p2align 4 -L(CopyFrom1To32Bytes1Case2OrCase3): - add $16, %rdi - add $16, %rsi - sub $16, %r8 -L(CopyFrom1To16BytesTail1Case2OrCase3): - test %rdx, %rdx - jnz L(CopyFrom1To16BytesTail1Case2) - BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4) - -# endif - -/*------------End labels regarding with copying 1-16 bytes--and 1-32 bytes----*/ - - .p2align 4 -L(Exit1): - mov %dh, (%rdi) -# ifdef USE_AS_STPCPY - lea (%rdi), %rax -# endif -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT - sub $1, %r8 - lea 1(%rdi), %rdi - jnz L(StrncpyFillTailWithZero) -# endif - ret - - .p2align 4 -L(Exit2): - mov (%rsi), %dx - mov %dx, (%rdi) -# ifdef USE_AS_STPCPY - lea 1(%rdi), %rax -# endif -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT - sub $2, %r8 - lea 2(%rdi), %rdi - jnz L(StrncpyFillTailWithZero) -# endif - ret - - .p2align 4 -L(Exit3): - mov (%rsi), %cx - mov %cx, (%rdi) - mov %dh, 2(%rdi) -# ifdef USE_AS_STPCPY - lea 2(%rdi), %rax -# endif -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT - sub $3, %r8 - lea 3(%rdi), %rdi - jnz L(StrncpyFillTailWithZero) -# endif - ret - - .p2align 4 -L(Exit4): - mov (%rsi), %edx - mov %edx, (%rdi) -# ifdef USE_AS_STPCPY - lea 3(%rdi), %rax -# endif -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT - sub $4, %r8 - lea 4(%rdi), %rdi - jnz L(StrncpyFillTailWithZero) -# endif - ret - - .p2align 4 -L(Exit5): - mov (%rsi), %ecx - mov %dh, 4(%rdi) - mov %ecx, (%rdi) -# ifdef USE_AS_STPCPY - lea 4(%rdi), %rax -# endif -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT - sub $5, %r8 - lea 5(%rdi), %rdi - jnz L(StrncpyFillTailWithZero) -# endif - ret - - .p2align 4 -L(Exit6): - mov (%rsi), %ecx - mov 4(%rsi), %dx - mov %ecx, (%rdi) - mov %dx, 4(%rdi) -# ifdef USE_AS_STPCPY - lea 5(%rdi), %rax -# endif -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT - sub $6, %r8 - lea 6(%rdi), %rdi - jnz L(StrncpyFillTailWithZero) -# endif - ret - - .p2align 4 -L(Exit7): - mov (%rsi), %ecx - mov 3(%rsi), %edx - mov %ecx, (%rdi) - mov %edx, 3(%rdi) -# ifdef USE_AS_STPCPY - lea 6(%rdi), %rax -# endif -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT - sub $7, %r8 - lea 7(%rdi), %rdi - jnz L(StrncpyFillTailWithZero) -# endif - ret - - .p2align 4 -L(Exit8): - mov (%rsi), %rdx - mov %rdx, (%rdi) -# ifdef USE_AS_STPCPY - lea 7(%rdi), %rax -# endif -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT - sub $8, %r8 - lea 8(%rdi), %rdi - jnz L(StrncpyFillTailWithZero) -# endif - ret - - .p2align 4 -L(Exit9): - mov (%rsi), %rcx - mov %dh, 8(%rdi) - mov %rcx, (%rdi) -# ifdef USE_AS_STPCPY - lea 8(%rdi), %rax -# endif -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT - sub $9, %r8 - lea 9(%rdi), %rdi - jnz L(StrncpyFillTailWithZero) -# endif - ret - - .p2align 4 -L(Exit10): - mov (%rsi), %rcx - mov 8(%rsi), %dx - mov %rcx, (%rdi) - mov %dx, 8(%rdi) -# ifdef USE_AS_STPCPY - lea 9(%rdi), %rax -# endif -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT - sub $10, %r8 - lea 10(%rdi), %rdi - jnz L(StrncpyFillTailWithZero) -# endif - ret - - .p2align 4 -L(Exit11): - mov (%rsi), %rcx - mov 7(%rsi), %edx - mov %rcx, (%rdi) - mov %edx, 7(%rdi) -# ifdef USE_AS_STPCPY - lea 10(%rdi), %rax -# endif -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT - sub $11, %r8 - lea 11(%rdi), %rdi - jnz L(StrncpyFillTailWithZero) -# endif - ret - - .p2align 4 -L(Exit12): - mov (%rsi), %rcx - mov 8(%rsi), %edx - mov %rcx, (%rdi) - mov %edx, 8(%rdi) -# ifdef USE_AS_STPCPY - lea 11(%rdi), %rax -# endif -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT - sub $12, %r8 - lea 12(%rdi), %rdi - jnz L(StrncpyFillTailWithZero) -# endif - ret - - .p2align 4 -L(Exit13): - mov (%rsi), %rcx - mov 5(%rsi), %rdx - mov %rcx, (%rdi) - mov %rdx, 5(%rdi) -# ifdef USE_AS_STPCPY - lea 12(%rdi), %rax -# endif -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT - sub $13, %r8 - lea 13(%rdi), %rdi - jnz L(StrncpyFillTailWithZero) -# endif - ret - - .p2align 4 -L(Exit14): - mov (%rsi), %rcx - mov 6(%rsi), %rdx - mov %rcx, (%rdi) - mov %rdx, 6(%rdi) -# ifdef USE_AS_STPCPY - lea 13(%rdi), %rax -# endif -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT - sub $14, %r8 - lea 14(%rdi), %rdi - jnz L(StrncpyFillTailWithZero) -# endif - ret - - .p2align 4 -L(Exit15): - mov (%rsi), %rcx - mov 7(%rsi), %rdx - mov %rcx, (%rdi) - mov %rdx, 7(%rdi) -# ifdef USE_AS_STPCPY - lea 14(%rdi), %rax -# endif -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT - sub $15, %r8 - lea 15(%rdi), %rdi - jnz L(StrncpyFillTailWithZero) -# endif - ret - - .p2align 4 -L(Exit16): - movdqu (%rsi), %xmm0 - movdqu %xmm0, (%rdi) -# ifdef USE_AS_STPCPY - lea 15(%rdi), %rax -# endif -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT - sub $16, %r8 - lea 16(%rdi), %rdi - jnz L(StrncpyFillTailWithZero) -# endif - ret - - .p2align 4 -L(Exit17): - movdqu (%rsi), %xmm0 - movdqu %xmm0, (%rdi) - mov %dh, 16(%rdi) -# ifdef USE_AS_STPCPY - lea 16(%rdi), %rax -# endif -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT - sub $17, %r8 - lea 17(%rdi), %rdi - jnz L(StrncpyFillTailWithZero) -# endif - ret - - .p2align 4 -L(Exit18): - movdqu (%rsi), %xmm0 - mov 16(%rsi), %cx - movdqu %xmm0, (%rdi) - mov %cx, 16(%rdi) -# ifdef USE_AS_STPCPY - lea 17(%rdi), %rax -# endif -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT - sub $18, %r8 - lea 18(%rdi), %rdi - jnz L(StrncpyFillTailWithZero) -# endif - ret - - .p2align 4 -L(Exit19): - movdqu (%rsi), %xmm0 - mov 15(%rsi), %ecx - movdqu %xmm0, (%rdi) - mov %ecx, 15(%rdi) -# ifdef USE_AS_STPCPY - lea 18(%rdi), %rax -# endif -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT - sub $19, %r8 - lea 19(%rdi), %rdi - jnz L(StrncpyFillTailWithZero) -# endif - ret - - .p2align 4 -L(Exit20): - movdqu (%rsi), %xmm0 - mov 16(%rsi), %ecx - movdqu %xmm0, (%rdi) - mov %ecx, 16(%rdi) -# ifdef USE_AS_STPCPY - lea 19(%rdi), %rax -# endif -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT - sub $20, %r8 - lea 20(%rdi), %rdi - jnz L(StrncpyFillTailWithZero) -# endif - ret - - .p2align 4 -L(Exit21): - movdqu (%rsi), %xmm0 - mov 16(%rsi), %ecx - movdqu %xmm0, (%rdi) - mov %ecx, 16(%rdi) - mov %dh, 20(%rdi) -# ifdef USE_AS_STPCPY - lea 20(%rdi), %rax -# endif -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT - sub $21, %r8 - lea 21(%rdi), %rdi - jnz L(StrncpyFillTailWithZero) -# endif - ret - - .p2align 4 -L(Exit22): - movdqu (%rsi), %xmm0 - mov 14(%rsi), %rcx - movdqu %xmm0, (%rdi) - mov %rcx, 14(%rdi) -# ifdef USE_AS_STPCPY - lea 21(%rdi), %rax -# endif -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT - sub $22, %r8 - lea 22(%rdi), %rdi - jnz L(StrncpyFillTailWithZero) -# endif - ret - - .p2align 4 -L(Exit23): - movdqu (%rsi), %xmm0 - mov 15(%rsi), %rcx - movdqu %xmm0, (%rdi) - mov %rcx, 15(%rdi) -# ifdef USE_AS_STPCPY - lea 22(%rdi), %rax -# endif -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT - sub $23, %r8 - lea 23(%rdi), %rdi - jnz L(StrncpyFillTailWithZero) -# endif - ret - - .p2align 4 -L(Exit24): - movdqu (%rsi), %xmm0 - mov 16(%rsi), %rcx - movdqu %xmm0, (%rdi) - mov %rcx, 16(%rdi) -# ifdef USE_AS_STPCPY - lea 23(%rdi), %rax -# endif -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT - sub $24, %r8 - lea 24(%rdi), %rdi - jnz L(StrncpyFillTailWithZero) -# endif - ret - - .p2align 4 -L(Exit25): - movdqu (%rsi), %xmm0 - mov 16(%rsi), %rcx - movdqu %xmm0, (%rdi) - mov %rcx, 16(%rdi) - mov %dh, 24(%rdi) -# ifdef USE_AS_STPCPY - lea 24(%rdi), %rax -# endif -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT - sub $25, %r8 - lea 25(%rdi), %rdi - jnz L(StrncpyFillTailWithZero) -# endif - ret - - .p2align 4 -L(Exit26): - movdqu (%rsi), %xmm0 - mov 16(%rsi), %rdx - mov 24(%rsi), %cx - movdqu %xmm0, (%rdi) - mov %rdx, 16(%rdi) - mov %cx, 24(%rdi) -# ifdef USE_AS_STPCPY - lea 25(%rdi), %rax -# endif -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT - sub $26, %r8 - lea 26(%rdi), %rdi - jnz L(StrncpyFillTailWithZero) -# endif - ret - - .p2align 4 -L(Exit27): - movdqu (%rsi), %xmm0 - mov 16(%rsi), %rdx - mov 23(%rsi), %ecx - movdqu %xmm0, (%rdi) - mov %rdx, 16(%rdi) - mov %ecx, 23(%rdi) -# ifdef USE_AS_STPCPY - lea 26(%rdi), %rax -# endif -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT - sub $27, %r8 - lea 27(%rdi), %rdi - jnz L(StrncpyFillTailWithZero) -# endif - ret - - .p2align 4 -L(Exit28): - movdqu (%rsi), %xmm0 - mov 16(%rsi), %rdx - mov 24(%rsi), %ecx - movdqu %xmm0, (%rdi) - mov %rdx, 16(%rdi) - mov %ecx, 24(%rdi) -# ifdef USE_AS_STPCPY - lea 27(%rdi), %rax -# endif -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT - sub $28, %r8 - lea 28(%rdi), %rdi - jnz L(StrncpyFillTailWithZero) -# endif - ret - - .p2align 4 -L(Exit29): - movdqu (%rsi), %xmm0 - movdqu 13(%rsi), %xmm2 - movdqu %xmm0, (%rdi) - movdqu %xmm2, 13(%rdi) -# ifdef USE_AS_STPCPY - lea 28(%rdi), %rax -# endif -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT - sub $29, %r8 - lea 29(%rdi), %rdi - jnz L(StrncpyFillTailWithZero) -# endif - ret - - .p2align 4 -L(Exit30): - movdqu (%rsi), %xmm0 - movdqu 14(%rsi), %xmm2 - movdqu %xmm0, (%rdi) - movdqu %xmm2, 14(%rdi) -# ifdef USE_AS_STPCPY - lea 29(%rdi), %rax -# endif -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT - sub $30, %r8 - lea 30(%rdi), %rdi - jnz L(StrncpyFillTailWithZero) -# endif - ret - - .p2align 4 -L(Exit31): - movdqu (%rsi), %xmm0 - movdqu 15(%rsi), %xmm2 - movdqu %xmm0, (%rdi) - movdqu %xmm2, 15(%rdi) -# ifdef USE_AS_STPCPY - lea 30(%rdi), %rax -# endif -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT - sub $31, %r8 - lea 31(%rdi), %rdi - jnz L(StrncpyFillTailWithZero) -# endif - ret - - .p2align 4 -L(Exit32): - movdqu (%rsi), %xmm0 - movdqu 16(%rsi), %xmm2 - movdqu %xmm0, (%rdi) - movdqu %xmm2, 16(%rdi) -# ifdef USE_AS_STPCPY - lea 31(%rdi), %rax -# endif -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT - sub $32, %r8 - lea 32(%rdi), %rdi - jnz L(StrncpyFillTailWithZero) -# endif - ret - -# ifdef USE_AS_STRNCPY - - .p2align 4 -L(StrncpyExit0): -# ifdef USE_AS_STPCPY - mov %rdi, %rax -# endif -# ifdef USE_AS_STRCAT - xor %ch, %ch - movb %ch, (%rdi) -# endif - ret - - .p2align 4 -L(StrncpyExit1): - mov (%rsi), %dl - mov %dl, (%rdi) -# ifdef USE_AS_STPCPY - lea 1(%rdi), %rax -# endif -# ifdef USE_AS_STRCAT - xor %ch, %ch - movb %ch, 1(%rdi) -# endif - ret - - .p2align 4 -L(StrncpyExit2): - mov (%rsi), %dx - mov %dx, (%rdi) -# ifdef USE_AS_STPCPY - lea 2(%rdi), %rax -# endif -# ifdef USE_AS_STRCAT - xor %ch, %ch - movb %ch, 2(%rdi) -# endif - ret - - .p2align 4 -L(StrncpyExit3): - mov (%rsi), %cx - mov 2(%rsi), %dl - mov %cx, (%rdi) - mov %dl, 2(%rdi) -# ifdef USE_AS_STPCPY - lea 3(%rdi), %rax -# endif -# ifdef USE_AS_STRCAT - xor %ch, %ch - movb %ch, 3(%rdi) -# endif - ret - - .p2align 4 -L(StrncpyExit4): - mov (%rsi), %edx - mov %edx, (%rdi) -# ifdef USE_AS_STPCPY - lea 4(%rdi), %rax -# endif -# ifdef USE_AS_STRCAT - xor %ch, %ch - movb %ch, 4(%rdi) -# endif - ret - - .p2align 4 -L(StrncpyExit5): - mov (%rsi), %ecx - mov 4(%rsi), %dl - mov %ecx, (%rdi) - mov %dl, 4(%rdi) -# ifdef USE_AS_STPCPY - lea 5(%rdi), %rax -# endif -# ifdef USE_AS_STRCAT - xor %ch, %ch - movb %ch, 5(%rdi) -# endif - ret - - .p2align 4 -L(StrncpyExit6): - mov (%rsi), %ecx - mov 4(%rsi), %dx - mov %ecx, (%rdi) - mov %dx, 4(%rdi) -# ifdef USE_AS_STPCPY - lea 6(%rdi), %rax -# endif -# ifdef USE_AS_STRCAT - xor %ch, %ch - movb %ch, 6(%rdi) -# endif - ret - - .p2align 4 -L(StrncpyExit7): - mov (%rsi), %ecx - mov 3(%rsi), %edx - mov %ecx, (%rdi) - mov %edx, 3(%rdi) -# ifdef USE_AS_STPCPY - lea 7(%rdi), %rax -# endif -# ifdef USE_AS_STRCAT - xor %ch, %ch - movb %ch, 7(%rdi) -# endif - ret - - .p2align 4 -L(StrncpyExit8): - mov (%rsi), %rdx - mov %rdx, (%rdi) -# ifdef USE_AS_STPCPY - lea 8(%rdi), %rax -# endif -# ifdef USE_AS_STRCAT - xor %ch, %ch - movb %ch, 8(%rdi) -# endif - ret - - .p2align 4 -L(StrncpyExit9): - mov (%rsi), %rcx - mov 8(%rsi), %dl - mov %rcx, (%rdi) - mov %dl, 8(%rdi) -# ifdef USE_AS_STPCPY - lea 9(%rdi), %rax -# endif -# ifdef USE_AS_STRCAT - xor %ch, %ch - movb %ch, 9(%rdi) -# endif - ret - - .p2align 4 -L(StrncpyExit10): - mov (%rsi), %rcx - mov 8(%rsi), %dx - mov %rcx, (%rdi) - mov %dx, 8(%rdi) -# ifdef USE_AS_STPCPY - lea 10(%rdi), %rax -# endif -# ifdef USE_AS_STRCAT - xor %ch, %ch - movb %ch, 10(%rdi) -# endif - ret - - .p2align 4 -L(StrncpyExit11): - mov (%rsi), %rcx - mov 7(%rsi), %edx - mov %rcx, (%rdi) - mov %edx, 7(%rdi) -# ifdef USE_AS_STPCPY - lea 11(%rdi), %rax -# endif -# ifdef USE_AS_STRCAT - xor %ch, %ch - movb %ch, 11(%rdi) -# endif - ret - - .p2align 4 -L(StrncpyExit12): - mov (%rsi), %rcx - mov 8(%rsi), %edx - mov %rcx, (%rdi) - mov %edx, 8(%rdi) -# ifdef USE_AS_STPCPY - lea 12(%rdi), %rax -# endif -# ifdef USE_AS_STRCAT - xor %ch, %ch - movb %ch, 12(%rdi) -# endif - ret - - .p2align 4 -L(StrncpyExit13): - mov (%rsi), %rcx - mov 5(%rsi), %rdx - mov %rcx, (%rdi) - mov %rdx, 5(%rdi) -# ifdef USE_AS_STPCPY - lea 13(%rdi), %rax -# endif -# ifdef USE_AS_STRCAT - xor %ch, %ch - movb %ch, 13(%rdi) -# endif - ret - - .p2align 4 -L(StrncpyExit14): - mov (%rsi), %rcx - mov 6(%rsi), %rdx - mov %rcx, (%rdi) - mov %rdx, 6(%rdi) -# ifdef USE_AS_STPCPY - lea 14(%rdi), %rax -# endif -# ifdef USE_AS_STRCAT - xor %ch, %ch - movb %ch, 14(%rdi) -# endif - ret - - .p2align 4 -L(StrncpyExit15): - mov (%rsi), %rcx - mov 7(%rsi), %rdx - mov %rcx, (%rdi) - mov %rdx, 7(%rdi) -# ifdef USE_AS_STPCPY - lea 15(%rdi), %rax -# endif -# ifdef USE_AS_STRCAT - xor %ch, %ch - movb %ch, 15(%rdi) -# endif - ret - - .p2align 4 -L(StrncpyExit16): - movdqu (%rsi), %xmm0 - movdqu %xmm0, (%rdi) -# ifdef USE_AS_STPCPY - lea 16(%rdi), %rax -# endif -# ifdef USE_AS_STRCAT - xor %ch, %ch - movb %ch, 16(%rdi) -# endif - ret - - .p2align 4 -L(StrncpyExit17): - movdqu (%rsi), %xmm0 - mov 16(%rsi), %cl - movdqu %xmm0, (%rdi) - mov %cl, 16(%rdi) -# ifdef USE_AS_STPCPY - lea 17(%rdi), %rax -# endif -# ifdef USE_AS_STRCAT - xor %ch, %ch - movb %ch, 17(%rdi) -# endif - ret - - .p2align 4 -L(StrncpyExit18): - movdqu (%rsi), %xmm0 - mov 16(%rsi), %cx - movdqu %xmm0, (%rdi) - mov %cx, 16(%rdi) -# ifdef USE_AS_STPCPY - lea 18(%rdi), %rax -# endif -# ifdef USE_AS_STRCAT - xor %ch, %ch - movb %ch, 18(%rdi) -# endif - ret - - .p2align 4 -L(StrncpyExit19): - movdqu (%rsi), %xmm0 - mov 15(%rsi), %ecx - movdqu %xmm0, (%rdi) - mov %ecx, 15(%rdi) -# ifdef USE_AS_STPCPY - lea 19(%rdi), %rax -# endif -# ifdef USE_AS_STRCAT - xor %ch, %ch - movb %ch, 19(%rdi) -# endif - ret - - .p2align 4 -L(StrncpyExit20): - movdqu (%rsi), %xmm0 - mov 16(%rsi), %ecx - movdqu %xmm0, (%rdi) - mov %ecx, 16(%rdi) -# ifdef USE_AS_STPCPY - lea 20(%rdi), %rax -# endif -# ifdef USE_AS_STRCAT - xor %ch, %ch - movb %ch, 20(%rdi) -# endif - ret - - .p2align 4 -L(StrncpyExit21): - movdqu (%rsi), %xmm0 - mov 16(%rsi), %ecx - mov 20(%rsi), %dl - movdqu %xmm0, (%rdi) - mov %ecx, 16(%rdi) - mov %dl, 20(%rdi) -# ifdef USE_AS_STPCPY - lea 21(%rdi), %rax -# endif -# ifdef USE_AS_STRCAT - xor %ch, %ch - movb %ch, 21(%rdi) -# endif - ret - - .p2align 4 -L(StrncpyExit22): - movdqu (%rsi), %xmm0 - mov 14(%rsi), %rcx - movdqu %xmm0, (%rdi) - mov %rcx, 14(%rdi) -# ifdef USE_AS_STPCPY - lea 22(%rdi), %rax -# endif -# ifdef USE_AS_STRCAT - xor %ch, %ch - movb %ch, 22(%rdi) -# endif - ret - - .p2align 4 -L(StrncpyExit23): - movdqu (%rsi), %xmm0 - mov 15(%rsi), %rcx - movdqu %xmm0, (%rdi) - mov %rcx, 15(%rdi) -# ifdef USE_AS_STPCPY - lea 23(%rdi), %rax -# endif -# ifdef USE_AS_STRCAT - xor %ch, %ch - movb %ch, 23(%rdi) -# endif - ret - - .p2align 4 -L(StrncpyExit24): - movdqu (%rsi), %xmm0 - mov 16(%rsi), %rcx - movdqu %xmm0, (%rdi) - mov %rcx, 16(%rdi) -# ifdef USE_AS_STPCPY - lea 24(%rdi), %rax -# endif -# ifdef USE_AS_STRCAT - xor %ch, %ch - movb %ch, 24(%rdi) -# endif - ret - - .p2align 4 -L(StrncpyExit25): - movdqu (%rsi), %xmm0 - mov 16(%rsi), %rdx - mov 24(%rsi), %cl - movdqu %xmm0, (%rdi) - mov %rdx, 16(%rdi) - mov %cl, 24(%rdi) -# ifdef USE_AS_STPCPY - lea 25(%rdi), %rax -# endif -# ifdef USE_AS_STRCAT - xor %ch, %ch - movb %ch, 25(%rdi) -# endif - ret - - .p2align 4 -L(StrncpyExit26): - movdqu (%rsi), %xmm0 - mov 16(%rsi), %rdx - mov 24(%rsi), %cx - movdqu %xmm0, (%rdi) - mov %rdx, 16(%rdi) - mov %cx, 24(%rdi) -# ifdef USE_AS_STPCPY - lea 26(%rdi), %rax -# endif -# ifdef USE_AS_STRCAT - xor %ch, %ch - movb %ch, 26(%rdi) -# endif - ret - - .p2align 4 -L(StrncpyExit27): - movdqu (%rsi), %xmm0 - mov 16(%rsi), %rdx - mov 23(%rsi), %ecx - movdqu %xmm0, (%rdi) - mov %rdx, 16(%rdi) - mov %ecx, 23(%rdi) -# ifdef USE_AS_STPCPY - lea 27(%rdi), %rax -# endif -# ifdef USE_AS_STRCAT - xor %ch, %ch - movb %ch, 27(%rdi) -# endif - ret - - .p2align 4 -L(StrncpyExit28): - movdqu (%rsi), %xmm0 - mov 16(%rsi), %rdx - mov 24(%rsi), %ecx - movdqu %xmm0, (%rdi) - mov %rdx, 16(%rdi) - mov %ecx, 24(%rdi) -# ifdef USE_AS_STPCPY - lea 28(%rdi), %rax -# endif -# ifdef USE_AS_STRCAT - xor %ch, %ch - movb %ch, 28(%rdi) -# endif - ret - - .p2align 4 -L(StrncpyExit29): - movdqu (%rsi), %xmm0 - movdqu 13(%rsi), %xmm2 - movdqu %xmm0, (%rdi) - movdqu %xmm2, 13(%rdi) -# ifdef USE_AS_STPCPY - lea 29(%rdi), %rax -# endif -# ifdef USE_AS_STRCAT - xor %ch, %ch - movb %ch, 29(%rdi) -# endif - ret - - .p2align 4 -L(StrncpyExit30): - movdqu (%rsi), %xmm0 - movdqu 14(%rsi), %xmm2 - movdqu %xmm0, (%rdi) - movdqu %xmm2, 14(%rdi) -# ifdef USE_AS_STPCPY - lea 30(%rdi), %rax -# endif -# ifdef USE_AS_STRCAT - xor %ch, %ch - movb %ch, 30(%rdi) -# endif - ret - - .p2align 4 -L(StrncpyExit31): - movdqu (%rsi), %xmm0 - movdqu 15(%rsi), %xmm2 - movdqu %xmm0, (%rdi) - movdqu %xmm2, 15(%rdi) -# ifdef USE_AS_STPCPY - lea 31(%rdi), %rax -# endif -# ifdef USE_AS_STRCAT - xor %ch, %ch - movb %ch, 31(%rdi) -# endif - ret - - .p2align 4 -L(StrncpyExit32): - movdqu (%rsi), %xmm0 - movdqu 16(%rsi), %xmm2 - movdqu %xmm0, (%rdi) - movdqu %xmm2, 16(%rdi) -# ifdef USE_AS_STPCPY - lea 32(%rdi), %rax -# endif -# ifdef USE_AS_STRCAT - xor %ch, %ch - movb %ch, 32(%rdi) -# endif - ret - - .p2align 4 -L(StrncpyExit33): - movdqu (%rsi), %xmm0 - movdqu 16(%rsi), %xmm2 - mov 32(%rsi), %cl - movdqu %xmm0, (%rdi) - movdqu %xmm2, 16(%rdi) - mov %cl, 32(%rdi) -# ifdef USE_AS_STRCAT - xor %ch, %ch - movb %ch, 33(%rdi) -# endif - ret - -# ifndef USE_AS_STRCAT - - .p2align 4 -L(Fill0): - ret - - .p2align 4 -L(Fill1): - mov %dl, (%rdi) - ret - - .p2align 4 -L(Fill2): - mov %dx, (%rdi) - ret - - .p2align 4 -L(Fill3): - mov %edx, -1(%rdi) - ret - - .p2align 4 -L(Fill4): - mov %edx, (%rdi) - ret - - .p2align 4 -L(Fill5): - mov %edx, (%rdi) - mov %dl, 4(%rdi) - ret - - .p2align 4 -L(Fill6): - mov %edx, (%rdi) - mov %dx, 4(%rdi) - ret - - .p2align 4 -L(Fill7): - mov %rdx, -1(%rdi) - ret - - .p2align 4 -L(Fill8): - mov %rdx, (%rdi) - ret - - .p2align 4 -L(Fill9): - mov %rdx, (%rdi) - mov %dl, 8(%rdi) - ret - - .p2align 4 -L(Fill10): - mov %rdx, (%rdi) - mov %dx, 8(%rdi) - ret - - .p2align 4 -L(Fill11): - mov %rdx, (%rdi) - mov %edx, 7(%rdi) - ret - - .p2align 4 -L(Fill12): - mov %rdx, (%rdi) - mov %edx, 8(%rdi) - ret - - .p2align 4 -L(Fill13): - mov %rdx, (%rdi) - mov %rdx, 5(%rdi) - ret - - .p2align 4 -L(Fill14): - mov %rdx, (%rdi) - mov %rdx, 6(%rdi) - ret - - .p2align 4 -L(Fill15): - movdqu %xmm0, -1(%rdi) - ret - - .p2align 4 -L(Fill16): - movdqu %xmm0, (%rdi) - ret - - .p2align 4 -L(CopyFrom1To16BytesUnalignedXmm2): - movdqu %xmm2, (%rdi, %rcx) - - .p2align 4 -L(CopyFrom1To16BytesXmmExit): - bsf %rdx, %rdx - add $15, %r8 - add %rcx, %rdi -# ifdef USE_AS_STPCPY - lea (%rdi, %rdx), %rax -# endif - sub %rdx, %r8 - lea 1(%rdi, %rdx), %rdi - - .p2align 4 -L(StrncpyFillTailWithZero): - pxor %xmm0, %xmm0 - xor %rdx, %rdx - sub $16, %r8 - jbe L(StrncpyFillExit) - - movdqu %xmm0, (%rdi) - add $16, %rdi - - mov %rdi, %rsi - and $0xf, %rsi - sub %rsi, %rdi - add %rsi, %r8 - sub $64, %r8 - jb L(StrncpyFillLess64) - -L(StrncpyFillLoopMovdqa): - movdqa %xmm0, (%rdi) - movdqa %xmm0, 16(%rdi) - movdqa %xmm0, 32(%rdi) - movdqa %xmm0, 48(%rdi) - add $64, %rdi - sub $64, %r8 - jae L(StrncpyFillLoopMovdqa) - -L(StrncpyFillLess64): - add $32, %r8 - jl L(StrncpyFillLess32) - movdqa %xmm0, (%rdi) - movdqa %xmm0, 16(%rdi) - add $32, %rdi - sub $16, %r8 - jl L(StrncpyFillExit) - movdqa %xmm0, (%rdi) - add $16, %rdi - BRANCH_TO_JMPTBL_ENTRY (L(FillTable), %r8, 4) - -L(StrncpyFillLess32): - add $16, %r8 - jl L(StrncpyFillExit) - movdqa %xmm0, (%rdi) - add $16, %rdi - BRANCH_TO_JMPTBL_ENTRY (L(FillTable), %r8, 4) - -L(StrncpyFillExit): - add $16, %r8 - BRANCH_TO_JMPTBL_ENTRY (L(FillTable), %r8, 4) - -/* end of ifndef USE_AS_STRCAT */ -# endif - - .p2align 4 -L(UnalignedLeaveCase2OrCase3): - test %rdx, %rdx - jnz L(Unaligned64LeaveCase2) -L(Unaligned64LeaveCase3): - lea 64(%r8), %rcx - and $-16, %rcx - add $48, %r8 - jl L(CopyFrom1To16BytesCase3) - movdqu %xmm4, (%rdi) - sub $16, %r8 - jb L(CopyFrom1To16BytesCase3) - movdqu %xmm5, 16(%rdi) - sub $16, %r8 - jb L(CopyFrom1To16BytesCase3) - movdqu %xmm6, 32(%rdi) - sub $16, %r8 - jb L(CopyFrom1To16BytesCase3) - movdqu %xmm7, 48(%rdi) -# ifdef USE_AS_STPCPY - lea 64(%rdi), %rax -# endif -# ifdef USE_AS_STRCAT - xor %ch, %ch - movb %ch, 64(%rdi) -# endif - ret - - .p2align 4 -L(Unaligned64LeaveCase2): - xor %rcx, %rcx - pcmpeqb %xmm4, %xmm0 - pmovmskb %xmm0, %rdx - add $48, %r8 - jle L(CopyFrom1To16BytesCase2OrCase3) - test %rdx, %rdx -# ifndef USE_AS_STRCAT - jnz L(CopyFrom1To16BytesUnalignedXmm4) -# else - jnz L(CopyFrom1To16Bytes) -# endif - pcmpeqb %xmm5, %xmm0 - pmovmskb %xmm0, %rdx - movdqu %xmm4, (%rdi) - add $16, %rcx - sub $16, %r8 - jbe L(CopyFrom1To16BytesCase2OrCase3) - test %rdx, %rdx -# ifndef USE_AS_STRCAT - jnz L(CopyFrom1To16BytesUnalignedXmm5) -# else - jnz L(CopyFrom1To16Bytes) -# endif - - pcmpeqb %xmm6, %xmm0 - pmovmskb %xmm0, %rdx - movdqu %xmm5, 16(%rdi) - add $16, %rcx - sub $16, %r8 - jbe L(CopyFrom1To16BytesCase2OrCase3) - test %rdx, %rdx -# ifndef USE_AS_STRCAT - jnz L(CopyFrom1To16BytesUnalignedXmm6) -# else - jnz L(CopyFrom1To16Bytes) -# endif - - pcmpeqb %xmm7, %xmm0 - pmovmskb %xmm0, %rdx - movdqu %xmm6, 32(%rdi) - lea 16(%rdi, %rcx), %rdi - lea 16(%rsi, %rcx), %rsi - bsf %rdx, %rdx - cmp %r8, %rdx - jb L(CopyFrom1To16BytesExit) - BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4) - - .p2align 4 -L(ExitZero): -# ifndef USE_AS_STRCAT - mov %rdi, %rax -# endif - ret - -# endif - -# ifndef USE_AS_STRCAT -END (STRCPY) -# else -END (STRCAT) -# endif - .p2align 4 - .section .rodata -L(ExitTable): - .int JMPTBL(L(Exit1), L(ExitTable)) - .int JMPTBL(L(Exit2), L(ExitTable)) - .int JMPTBL(L(Exit3), L(ExitTable)) - .int JMPTBL(L(Exit4), L(ExitTable)) - .int JMPTBL(L(Exit5), L(ExitTable)) - .int JMPTBL(L(Exit6), L(ExitTable)) - .int JMPTBL(L(Exit7), L(ExitTable)) - .int JMPTBL(L(Exit8), L(ExitTable)) - .int JMPTBL(L(Exit9), L(ExitTable)) - .int JMPTBL(L(Exit10), L(ExitTable)) - .int JMPTBL(L(Exit11), L(ExitTable)) - .int JMPTBL(L(Exit12), L(ExitTable)) - .int JMPTBL(L(Exit13), L(ExitTable)) - .int JMPTBL(L(Exit14), L(ExitTable)) - .int JMPTBL(L(Exit15), L(ExitTable)) - .int JMPTBL(L(Exit16), L(ExitTable)) - .int JMPTBL(L(Exit17), L(ExitTable)) - .int JMPTBL(L(Exit18), L(ExitTable)) - .int JMPTBL(L(Exit19), L(ExitTable)) - .int JMPTBL(L(Exit20), L(ExitTable)) - .int JMPTBL(L(Exit21), L(ExitTable)) - .int JMPTBL(L(Exit22), L(ExitTable)) - .int JMPTBL(L(Exit23), L(ExitTable)) - .int JMPTBL(L(Exit24), L(ExitTable)) - .int JMPTBL(L(Exit25), L(ExitTable)) - .int JMPTBL(L(Exit26), L(ExitTable)) - .int JMPTBL(L(Exit27), L(ExitTable)) - .int JMPTBL(L(Exit28), L(ExitTable)) - .int JMPTBL(L(Exit29), L(ExitTable)) - .int JMPTBL(L(Exit30), L(ExitTable)) - .int JMPTBL(L(Exit31), L(ExitTable)) - .int JMPTBL(L(Exit32), L(ExitTable)) -# ifdef USE_AS_STRNCPY -L(ExitStrncpyTable): - .int JMPTBL(L(StrncpyExit0), L(ExitStrncpyTable)) - .int JMPTBL(L(StrncpyExit1), L(ExitStrncpyTable)) - .int JMPTBL(L(StrncpyExit2), L(ExitStrncpyTable)) - .int JMPTBL(L(StrncpyExit3), L(ExitStrncpyTable)) - .int JMPTBL(L(StrncpyExit4), L(ExitStrncpyTable)) - .int JMPTBL(L(StrncpyExit5), L(ExitStrncpyTable)) - .int JMPTBL(L(StrncpyExit6), L(ExitStrncpyTable)) - .int JMPTBL(L(StrncpyExit7), L(ExitStrncpyTable)) - .int JMPTBL(L(StrncpyExit8), L(ExitStrncpyTable)) - .int JMPTBL(L(StrncpyExit9), L(ExitStrncpyTable)) - .int JMPTBL(L(StrncpyExit10), L(ExitStrncpyTable)) - .int JMPTBL(L(StrncpyExit11), L(ExitStrncpyTable)) - .int JMPTBL(L(StrncpyExit12), L(ExitStrncpyTable)) - .int JMPTBL(L(StrncpyExit13), L(ExitStrncpyTable)) - .int JMPTBL(L(StrncpyExit14), L(ExitStrncpyTable)) - .int JMPTBL(L(StrncpyExit15), L(ExitStrncpyTable)) - .int JMPTBL(L(StrncpyExit16), L(ExitStrncpyTable)) - .int JMPTBL(L(StrncpyExit17), L(ExitStrncpyTable)) - .int JMPTBL(L(StrncpyExit18), L(ExitStrncpyTable)) - .int JMPTBL(L(StrncpyExit19), L(ExitStrncpyTable)) - .int JMPTBL(L(StrncpyExit20), L(ExitStrncpyTable)) - .int JMPTBL(L(StrncpyExit21), L(ExitStrncpyTable)) - .int JMPTBL(L(StrncpyExit22), L(ExitStrncpyTable)) - .int JMPTBL(L(StrncpyExit23), L(ExitStrncpyTable)) - .int JMPTBL(L(StrncpyExit24), L(ExitStrncpyTable)) - .int JMPTBL(L(StrncpyExit25), L(ExitStrncpyTable)) - .int JMPTBL(L(StrncpyExit26), L(ExitStrncpyTable)) - .int JMPTBL(L(StrncpyExit27), L(ExitStrncpyTable)) - .int JMPTBL(L(StrncpyExit28), L(ExitStrncpyTable)) - .int JMPTBL(L(StrncpyExit29), L(ExitStrncpyTable)) - .int JMPTBL(L(StrncpyExit30), L(ExitStrncpyTable)) - .int JMPTBL(L(StrncpyExit31), L(ExitStrncpyTable)) - .int JMPTBL(L(StrncpyExit32), L(ExitStrncpyTable)) - .int JMPTBL(L(StrncpyExit33), L(ExitStrncpyTable)) -# ifndef USE_AS_STRCAT - .p2align 4 -L(FillTable): - .int JMPTBL(L(Fill0), L(FillTable)) - .int JMPTBL(L(Fill1), L(FillTable)) - .int JMPTBL(L(Fill2), L(FillTable)) - .int JMPTBL(L(Fill3), L(FillTable)) - .int JMPTBL(L(Fill4), L(FillTable)) - .int JMPTBL(L(Fill5), L(FillTable)) - .int JMPTBL(L(Fill6), L(FillTable)) - .int JMPTBL(L(Fill7), L(FillTable)) - .int JMPTBL(L(Fill8), L(FillTable)) - .int JMPTBL(L(Fill9), L(FillTable)) - .int JMPTBL(L(Fill10), L(FillTable)) - .int JMPTBL(L(Fill11), L(FillTable)) - .int JMPTBL(L(Fill12), L(FillTable)) - .int JMPTBL(L(Fill13), L(FillTable)) - .int JMPTBL(L(Fill14), L(FillTable)) - .int JMPTBL(L(Fill15), L(FillTable)) - .int JMPTBL(L(Fill16), L(FillTable)) -# endif -# endif -#endif +#define AS_STRCPY +#define STPCPY __strcpy_sse2_unaligned +#include "stpcpy-sse2-unaligned.S" diff --git a/sysdeps/x86_64/multiarch/strcpy.S b/sysdeps/x86_64/multiarch/strcpy.S index 9464ee8..92be04c 100644 --- a/sysdeps/x86_64/multiarch/strcpy.S +++ b/sysdeps/x86_64/multiarch/strcpy.S @@ -28,31 +28,18 @@ #endif #ifdef USE_AS_STPCPY -# ifdef USE_AS_STRNCPY -# define STRCPY_SSSE3 __stpncpy_ssse3 -# define STRCPY_SSE2 __stpncpy_sse2 -# define STRCPY_SSE2_UNALIGNED __stpncpy_sse2_unaligned -# define __GI_STRCPY __GI_stpncpy -# define __GI___STRCPY __GI___stpncpy -# else # define STRCPY_SSSE3 __stpcpy_ssse3 # define STRCPY_SSE2 __stpcpy_sse2 +# define STRCPY_AVX2 __stpcpy_avx2 # define STRCPY_SSE2_UNALIGNED __stpcpy_sse2_unaligned # define __GI_STRCPY __GI_stpcpy # define __GI___STRCPY __GI___stpcpy -# endif #else -# ifdef USE_AS_STRNCPY -# define STRCPY_SSSE3 __strncpy_ssse3 -# define STRCPY_SSE2 __strncpy_sse2 -# define STRCPY_SSE2_UNALIGNED __strncpy_sse2_unaligned -# define __GI_STRCPY __GI_strncpy -# else # define STRCPY_SSSE3 __strcpy_ssse3 +# define STRCPY_AVX2 __strcpy_avx2 # define STRCPY_SSE2 __strcpy_sse2 # define STRCPY_SSE2_UNALIGNED __strcpy_sse2_unaligned # define __GI_STRCPY __GI_strcpy -# endif #endif @@ -64,7 +51,10 @@ ENTRY(STRCPY) cmpl $0, __cpu_features+KIND_OFFSET(%rip) jne 1f call __init_cpu_features -1: leaq STRCPY_SSE2_UNALIGNED(%rip), %rax +1: leaq STRCPY_AVX2(%rip), %rax + testl $bit_AVX_Fast_Unaligned_Load, __cpu_features+FEATURE_OFFSET+index_AVX_Fast_Unaligned_Load(%rip) + jnz 2f + leaq STRCPY_SSE2_UNALIGNED(%rip), %rax testl $bit_Fast_Unaligned_Load, __cpu_features+FEATURE_OFFSET+index_Fast_Unaligned_Load(%rip) jnz 2f leaq STRCPY_SSE2(%rip), %rax diff --git a/sysdeps/x86_64/multiarch/strncpy-sse2-unaligned.S b/sysdeps/x86_64/multiarch/strncpy-sse2-unaligned.S index fcc23a7..e4c98e7 100644 --- a/sysdeps/x86_64/multiarch/strncpy-sse2-unaligned.S +++ b/sysdeps/x86_64/multiarch/strncpy-sse2-unaligned.S @@ -1,3 +1,1888 @@ -#define USE_AS_STRNCPY -#define STRCPY __strncpy_sse2_unaligned -#include "strcpy-sse2-unaligned.S" +/* strcpy with SSE2 and unaligned load + Copyright (C) 2011-2015 Free Software Foundation, Inc. + Contributed by Intel Corporation. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <http://www.gnu.org/licenses/>. */ + +#if IS_IN (libc) + +# ifndef USE_AS_STRCAT +# include <sysdep.h> + +# ifndef STRCPY +# define STRCPY __strncpy_sse2_unaligned +# endif + +# define USE_AS_STRNCPY +# endif + +# define JMPTBL(I, B) I - B +# define BRANCH_TO_JMPTBL_ENTRY(TABLE, INDEX, SCALE) \ + lea TABLE(%rip), %r11; \ + movslq (%r11, INDEX, SCALE), %rcx; \ + lea (%r11, %rcx), %rcx; \ + jmp *%rcx + +# ifndef USE_AS_STRCAT + +.text +ENTRY (STRCPY) +# ifdef USE_AS_STRNCPY + mov %rdx, %r8 + test %r8, %r8 + jz L(ExitZero) +# endif + mov %rsi, %rcx +# ifndef USE_AS_STPCPY + mov %rdi, %rax /* save result */ +# endif + +# endif + + and $63, %rcx + cmp $32, %rcx + jbe L(SourceStringAlignmentLess32) + + and $-16, %rsi + and $15, %rcx + pxor %xmm0, %xmm0 + pxor %xmm1, %xmm1 + + pcmpeqb (%rsi), %xmm1 + pmovmskb %xmm1, %rdx + shr %cl, %rdx + +# ifdef USE_AS_STRNCPY +# if defined USE_AS_STPCPY || defined USE_AS_STRCAT + mov $16, %r10 + sub %rcx, %r10 + cmp %r10, %r8 +# else + mov $17, %r10 + sub %rcx, %r10 + cmp %r10, %r8 +# endif + jbe L(CopyFrom1To16BytesTailCase2OrCase3) +# endif + test %rdx, %rdx + jnz L(CopyFrom1To16BytesTail) + + pcmpeqb 16(%rsi), %xmm0 + pmovmskb %xmm0, %rdx + +# ifdef USE_AS_STRNCPY + add $16, %r10 + cmp %r10, %r8 + jbe L(CopyFrom1To32BytesCase2OrCase3) +# endif + test %rdx, %rdx + jnz L(CopyFrom1To32Bytes) + + movdqu (%rsi, %rcx), %xmm1 /* copy 16 bytes */ + movdqu %xmm1, (%rdi) + +/* If source address alignment != destination address alignment */ + .p2align 4 +L(Unalign16Both): + sub %rcx, %rdi +# ifdef USE_AS_STRNCPY + add %rcx, %r8 +# endif + mov $16, %rcx + movdqa (%rsi, %rcx), %xmm1 + movaps 16(%rsi, %rcx), %xmm2 + movdqu %xmm1, (%rdi, %rcx) + pcmpeqb %xmm2, %xmm0 + pmovmskb %xmm0, %rdx + add $16, %rcx +# ifdef USE_AS_STRNCPY + sub $48, %r8 + jbe L(CopyFrom1To16BytesCase2OrCase3) +# endif + test %rdx, %rdx +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT + jnz L(CopyFrom1To16BytesUnalignedXmm2) +# else + jnz L(CopyFrom1To16Bytes) +# endif + + movaps 16(%rsi, %rcx), %xmm3 + movdqu %xmm2, (%rdi, %rcx) + pcmpeqb %xmm3, %xmm0 + pmovmskb %xmm0, %rdx + add $16, %rcx +# ifdef USE_AS_STRNCPY + sub $16, %r8 + jbe L(CopyFrom1To16BytesCase2OrCase3) +# endif + test %rdx, %rdx +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT + jnz L(CopyFrom1To16BytesUnalignedXmm3) +# else + jnz L(CopyFrom1To16Bytes) +# endif + + movaps 16(%rsi, %rcx), %xmm4 + movdqu %xmm3, (%rdi, %rcx) + pcmpeqb %xmm4, %xmm0 + pmovmskb %xmm0, %rdx + add $16, %rcx +# ifdef USE_AS_STRNCPY + sub $16, %r8 + jbe L(CopyFrom1To16BytesCase2OrCase3) +# endif + test %rdx, %rdx +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT + jnz L(CopyFrom1To16BytesUnalignedXmm4) +# else + jnz L(CopyFrom1To16Bytes) +# endif + + movaps 16(%rsi, %rcx), %xmm1 + movdqu %xmm4, (%rdi, %rcx) + pcmpeqb %xmm1, %xmm0 + pmovmskb %xmm0, %rdx + add $16, %rcx +# ifdef USE_AS_STRNCPY + sub $16, %r8 + jbe L(CopyFrom1To16BytesCase2OrCase3) +# endif + test %rdx, %rdx +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT + jnz L(CopyFrom1To16BytesUnalignedXmm1) +# else + jnz L(CopyFrom1To16Bytes) +# endif + + movaps 16(%rsi, %rcx), %xmm2 + movdqu %xmm1, (%rdi, %rcx) + pcmpeqb %xmm2, %xmm0 + pmovmskb %xmm0, %rdx + add $16, %rcx +# ifdef USE_AS_STRNCPY + sub $16, %r8 + jbe L(CopyFrom1To16BytesCase2OrCase3) +# endif + test %rdx, %rdx +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT + jnz L(CopyFrom1To16BytesUnalignedXmm2) +# else + jnz L(CopyFrom1To16Bytes) +# endif + + movaps 16(%rsi, %rcx), %xmm3 + movdqu %xmm2, (%rdi, %rcx) + pcmpeqb %xmm3, %xmm0 + pmovmskb %xmm0, %rdx + add $16, %rcx +# ifdef USE_AS_STRNCPY + sub $16, %r8 + jbe L(CopyFrom1To16BytesCase2OrCase3) +# endif + test %rdx, %rdx +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT + jnz L(CopyFrom1To16BytesUnalignedXmm3) +# else + jnz L(CopyFrom1To16Bytes) +# endif + + movdqu %xmm3, (%rdi, %rcx) + mov %rsi, %rdx + lea 16(%rsi, %rcx), %rsi + and $-0x40, %rsi + sub %rsi, %rdx + sub %rdx, %rdi +# ifdef USE_AS_STRNCPY + lea 128(%r8, %rdx), %r8 +# endif +L(Unaligned64Loop): + movaps (%rsi), %xmm2 + movaps %xmm2, %xmm4 + movaps 16(%rsi), %xmm5 + movaps 32(%rsi), %xmm3 + movaps %xmm3, %xmm6 + movaps 48(%rsi), %xmm7 + pminub %xmm5, %xmm2 + pminub %xmm7, %xmm3 + pminub %xmm2, %xmm3 + pcmpeqb %xmm0, %xmm3 + pmovmskb %xmm3, %rdx +# ifdef USE_AS_STRNCPY + sub $64, %r8 + jbe L(UnalignedLeaveCase2OrCase3) +# endif + test %rdx, %rdx + jnz L(Unaligned64Leave) + +L(Unaligned64Loop_start): + add $64, %rdi + add $64, %rsi + movdqu %xmm4, -64(%rdi) + movaps (%rsi), %xmm2 + movdqa %xmm2, %xmm4 + movdqu %xmm5, -48(%rdi) + movaps 16(%rsi), %xmm5 + pminub %xmm5, %xmm2 + movaps 32(%rsi), %xmm3 + movdqu %xmm6, -32(%rdi) + movaps %xmm3, %xmm6 + movdqu %xmm7, -16(%rdi) + movaps 48(%rsi), %xmm7 + pminub %xmm7, %xmm3 + pminub %xmm2, %xmm3 + pcmpeqb %xmm0, %xmm3 + pmovmskb %xmm3, %rdx +# ifdef USE_AS_STRNCPY + sub $64, %r8 + jbe L(UnalignedLeaveCase2OrCase3) +# endif + test %rdx, %rdx + jz L(Unaligned64Loop_start) + +L(Unaligned64Leave): + pxor %xmm1, %xmm1 + + pcmpeqb %xmm4, %xmm0 + pcmpeqb %xmm5, %xmm1 + pmovmskb %xmm0, %rdx + pmovmskb %xmm1, %rcx + test %rdx, %rdx + jnz L(CopyFrom1To16BytesUnaligned_0) + test %rcx, %rcx + jnz L(CopyFrom1To16BytesUnaligned_16) + + pcmpeqb %xmm6, %xmm0 + pcmpeqb %xmm7, %xmm1 + pmovmskb %xmm0, %rdx + pmovmskb %xmm1, %rcx + test %rdx, %rdx + jnz L(CopyFrom1To16BytesUnaligned_32) + + bsf %rcx, %rdx + movdqu %xmm4, (%rdi) + movdqu %xmm5, 16(%rdi) + movdqu %xmm6, 32(%rdi) +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT +# ifdef USE_AS_STPCPY + lea 48(%rdi, %rdx), %rax +# endif + movdqu %xmm7, 48(%rdi) + add $15, %r8 + sub %rdx, %r8 + lea 49(%rdi, %rdx), %rdi + jmp L(StrncpyFillTailWithZero) +# else + add $48, %rsi + add $48, %rdi + BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4) +# endif + +/* If source address alignment == destination address alignment */ + +L(SourceStringAlignmentLess32): + pxor %xmm0, %xmm0 + movdqu (%rsi), %xmm1 + movdqu 16(%rsi), %xmm2 + pcmpeqb %xmm1, %xmm0 + pmovmskb %xmm0, %rdx + +# ifdef USE_AS_STRNCPY +# if defined USE_AS_STPCPY || defined USE_AS_STRCAT + cmp $16, %r8 +# else + cmp $17, %r8 +# endif + jbe L(CopyFrom1To16BytesTail1Case2OrCase3) +# endif + test %rdx, %rdx + jnz L(CopyFrom1To16BytesTail1) + + pcmpeqb %xmm2, %xmm0 + movdqu %xmm1, (%rdi) + pmovmskb %xmm0, %rdx + +# ifdef USE_AS_STRNCPY +# if defined USE_AS_STPCPY || defined USE_AS_STRCAT + cmp $32, %r8 +# else + cmp $33, %r8 +# endif + jbe L(CopyFrom1To32Bytes1Case2OrCase3) +# endif + test %rdx, %rdx + jnz L(CopyFrom1To32Bytes1) + + and $-16, %rsi + and $15, %rcx + jmp L(Unalign16Both) + +/*------End of main part with loops---------------------*/ + +/* Case1 */ + +# if (!defined USE_AS_STRNCPY) || (defined USE_AS_STRCAT) + .p2align 4 +L(CopyFrom1To16Bytes): + add %rcx, %rdi + add %rcx, %rsi + bsf %rdx, %rdx + BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4) +# endif + .p2align 4 +L(CopyFrom1To16BytesTail): + add %rcx, %rsi + bsf %rdx, %rdx + BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4) + + .p2align 4 +L(CopyFrom1To32Bytes1): + add $16, %rsi + add $16, %rdi +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT + sub $16, %r8 +# endif +L(CopyFrom1To16BytesTail1): + bsf %rdx, %rdx + BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4) + + .p2align 4 +L(CopyFrom1To32Bytes): + bsf %rdx, %rdx + add %rcx, %rsi + add $16, %rdx + sub %rcx, %rdx + BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4) + + .p2align 4 +L(CopyFrom1To16BytesUnaligned_0): + bsf %rdx, %rdx +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT +# ifdef USE_AS_STPCPY + lea (%rdi, %rdx), %rax +# endif + movdqu %xmm4, (%rdi) + add $63, %r8 + sub %rdx, %r8 + lea 1(%rdi, %rdx), %rdi + jmp L(StrncpyFillTailWithZero) +# else + BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4) +# endif + + .p2align 4 +L(CopyFrom1To16BytesUnaligned_16): + bsf %rcx, %rdx + movdqu %xmm4, (%rdi) +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT +# ifdef USE_AS_STPCPY + lea 16(%rdi, %rdx), %rax +# endif + movdqu %xmm5, 16(%rdi) + add $47, %r8 + sub %rdx, %r8 + lea 17(%rdi, %rdx), %rdi + jmp L(StrncpyFillTailWithZero) +# else + add $16, %rsi + add $16, %rdi + BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4) +# endif + + .p2align 4 +L(CopyFrom1To16BytesUnaligned_32): + bsf %rdx, %rdx + movdqu %xmm4, (%rdi) + movdqu %xmm5, 16(%rdi) +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT +# ifdef USE_AS_STPCPY + lea 32(%rdi, %rdx), %rax +# endif + movdqu %xmm6, 32(%rdi) + add $31, %r8 + sub %rdx, %r8 + lea 33(%rdi, %rdx), %rdi + jmp L(StrncpyFillTailWithZero) +# else + add $32, %rsi + add $32, %rdi + BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4) +# endif + +# ifdef USE_AS_STRNCPY +# ifndef USE_AS_STRCAT + .p2align 4 +L(CopyFrom1To16BytesUnalignedXmm6): + movdqu %xmm6, (%rdi, %rcx) + jmp L(CopyFrom1To16BytesXmmExit) + + .p2align 4 +L(CopyFrom1To16BytesUnalignedXmm5): + movdqu %xmm5, (%rdi, %rcx) + jmp L(CopyFrom1To16BytesXmmExit) + + .p2align 4 +L(CopyFrom1To16BytesUnalignedXmm4): + movdqu %xmm4, (%rdi, %rcx) + jmp L(CopyFrom1To16BytesXmmExit) + + .p2align 4 +L(CopyFrom1To16BytesUnalignedXmm3): + movdqu %xmm3, (%rdi, %rcx) + jmp L(CopyFrom1To16BytesXmmExit) + + .p2align 4 +L(CopyFrom1To16BytesUnalignedXmm1): + movdqu %xmm1, (%rdi, %rcx) + jmp L(CopyFrom1To16BytesXmmExit) +# endif + + .p2align 4 +L(CopyFrom1To16BytesExit): + BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4) + +/* Case2 */ + + .p2align 4 +L(CopyFrom1To16BytesCase2): + add $16, %r8 + add %rcx, %rdi + add %rcx, %rsi + bsf %rdx, %rdx + cmp %r8, %rdx + jb L(CopyFrom1To16BytesExit) + BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4) + + .p2align 4 +L(CopyFrom1To32BytesCase2): + add %rcx, %rsi + bsf %rdx, %rdx + add $16, %rdx + sub %rcx, %rdx + cmp %r8, %rdx + jb L(CopyFrom1To16BytesExit) + BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4) + +L(CopyFrom1To16BytesTailCase2): + add %rcx, %rsi + bsf %rdx, %rdx + cmp %r8, %rdx + jb L(CopyFrom1To16BytesExit) + BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4) + +L(CopyFrom1To16BytesTail1Case2): + bsf %rdx, %rdx + cmp %r8, %rdx + jb L(CopyFrom1To16BytesExit) + BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4) + +/* Case2 or Case3, Case3 */ + + .p2align 4 +L(CopyFrom1To16BytesCase2OrCase3): + test %rdx, %rdx + jnz L(CopyFrom1To16BytesCase2) +L(CopyFrom1To16BytesCase3): + add $16, %r8 + add %rcx, %rdi + add %rcx, %rsi + BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4) + + .p2align 4 +L(CopyFrom1To32BytesCase2OrCase3): + test %rdx, %rdx + jnz L(CopyFrom1To32BytesCase2) + add %rcx, %rsi + BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4) + + .p2align 4 +L(CopyFrom1To16BytesTailCase2OrCase3): + test %rdx, %rdx + jnz L(CopyFrom1To16BytesTailCase2) + add %rcx, %rsi + BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4) + + .p2align 4 +L(CopyFrom1To32Bytes1Case2OrCase3): + add $16, %rdi + add $16, %rsi + sub $16, %r8 +L(CopyFrom1To16BytesTail1Case2OrCase3): + test %rdx, %rdx + jnz L(CopyFrom1To16BytesTail1Case2) + BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4) + +# endif + +/*------------End labels regarding with copying 1-16 bytes--and 1-32 bytes----*/ + + .p2align 4 +L(Exit1): + mov %dh, (%rdi) +# ifdef USE_AS_STPCPY + lea (%rdi), %rax +# endif +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT + sub $1, %r8 + lea 1(%rdi), %rdi + jnz L(StrncpyFillTailWithZero) +# endif + ret + + .p2align 4 +L(Exit2): + mov (%rsi), %dx + mov %dx, (%rdi) +# ifdef USE_AS_STPCPY + lea 1(%rdi), %rax +# endif +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT + sub $2, %r8 + lea 2(%rdi), %rdi + jnz L(StrncpyFillTailWithZero) +# endif + ret + + .p2align 4 +L(Exit3): + mov (%rsi), %cx + mov %cx, (%rdi) + mov %dh, 2(%rdi) +# ifdef USE_AS_STPCPY + lea 2(%rdi), %rax +# endif +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT + sub $3, %r8 + lea 3(%rdi), %rdi + jnz L(StrncpyFillTailWithZero) +# endif + ret + + .p2align 4 +L(Exit4): + mov (%rsi), %edx + mov %edx, (%rdi) +# ifdef USE_AS_STPCPY + lea 3(%rdi), %rax +# endif +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT + sub $4, %r8 + lea 4(%rdi), %rdi + jnz L(StrncpyFillTailWithZero) +# endif + ret + + .p2align 4 +L(Exit5): + mov (%rsi), %ecx + mov %dh, 4(%rdi) + mov %ecx, (%rdi) +# ifdef USE_AS_STPCPY + lea 4(%rdi), %rax +# endif +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT + sub $5, %r8 + lea 5(%rdi), %rdi + jnz L(StrncpyFillTailWithZero) +# endif + ret + + .p2align 4 +L(Exit6): + mov (%rsi), %ecx + mov 4(%rsi), %dx + mov %ecx, (%rdi) + mov %dx, 4(%rdi) +# ifdef USE_AS_STPCPY + lea 5(%rdi), %rax +# endif +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT + sub $6, %r8 + lea 6(%rdi), %rdi + jnz L(StrncpyFillTailWithZero) +# endif + ret + + .p2align 4 +L(Exit7): + mov (%rsi), %ecx + mov 3(%rsi), %edx + mov %ecx, (%rdi) + mov %edx, 3(%rdi) +# ifdef USE_AS_STPCPY + lea 6(%rdi), %rax +# endif +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT + sub $7, %r8 + lea 7(%rdi), %rdi + jnz L(StrncpyFillTailWithZero) +# endif + ret + + .p2align 4 +L(Exit8): + mov (%rsi), %rdx + mov %rdx, (%rdi) +# ifdef USE_AS_STPCPY + lea 7(%rdi), %rax +# endif +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT + sub $8, %r8 + lea 8(%rdi), %rdi + jnz L(StrncpyFillTailWithZero) +# endif + ret + + .p2align 4 +L(Exit9): + mov (%rsi), %rcx + mov %dh, 8(%rdi) + mov %rcx, (%rdi) +# ifdef USE_AS_STPCPY + lea 8(%rdi), %rax +# endif +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT + sub $9, %r8 + lea 9(%rdi), %rdi + jnz L(StrncpyFillTailWithZero) +# endif + ret + + .p2align 4 +L(Exit10): + mov (%rsi), %rcx + mov 8(%rsi), %dx + mov %rcx, (%rdi) + mov %dx, 8(%rdi) +# ifdef USE_AS_STPCPY + lea 9(%rdi), %rax +# endif +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT + sub $10, %r8 + lea 10(%rdi), %rdi + jnz L(StrncpyFillTailWithZero) +# endif + ret + + .p2align 4 +L(Exit11): + mov (%rsi), %rcx + mov 7(%rsi), %edx + mov %rcx, (%rdi) + mov %edx, 7(%rdi) +# ifdef USE_AS_STPCPY + lea 10(%rdi), %rax +# endif +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT + sub $11, %r8 + lea 11(%rdi), %rdi + jnz L(StrncpyFillTailWithZero) +# endif + ret + + .p2align 4 +L(Exit12): + mov (%rsi), %rcx + mov 8(%rsi), %edx + mov %rcx, (%rdi) + mov %edx, 8(%rdi) +# ifdef USE_AS_STPCPY + lea 11(%rdi), %rax +# endif +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT + sub $12, %r8 + lea 12(%rdi), %rdi + jnz L(StrncpyFillTailWithZero) +# endif + ret + + .p2align 4 +L(Exit13): + mov (%rsi), %rcx + mov 5(%rsi), %rdx + mov %rcx, (%rdi) + mov %rdx, 5(%rdi) +# ifdef USE_AS_STPCPY + lea 12(%rdi), %rax +# endif +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT + sub $13, %r8 + lea 13(%rdi), %rdi + jnz L(StrncpyFillTailWithZero) +# endif + ret + + .p2align 4 +L(Exit14): + mov (%rsi), %rcx + mov 6(%rsi), %rdx + mov %rcx, (%rdi) + mov %rdx, 6(%rdi) +# ifdef USE_AS_STPCPY + lea 13(%rdi), %rax +# endif +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT + sub $14, %r8 + lea 14(%rdi), %rdi + jnz L(StrncpyFillTailWithZero) +# endif + ret + + .p2align 4 +L(Exit15): + mov (%rsi), %rcx + mov 7(%rsi), %rdx + mov %rcx, (%rdi) + mov %rdx, 7(%rdi) +# ifdef USE_AS_STPCPY + lea 14(%rdi), %rax +# endif +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT + sub $15, %r8 + lea 15(%rdi), %rdi + jnz L(StrncpyFillTailWithZero) +# endif + ret + + .p2align 4 +L(Exit16): + movdqu (%rsi), %xmm0 + movdqu %xmm0, (%rdi) +# ifdef USE_AS_STPCPY + lea 15(%rdi), %rax +# endif +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT + sub $16, %r8 + lea 16(%rdi), %rdi + jnz L(StrncpyFillTailWithZero) +# endif + ret + + .p2align 4 +L(Exit17): + movdqu (%rsi), %xmm0 + movdqu %xmm0, (%rdi) + mov %dh, 16(%rdi) +# ifdef USE_AS_STPCPY + lea 16(%rdi), %rax +# endif +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT + sub $17, %r8 + lea 17(%rdi), %rdi + jnz L(StrncpyFillTailWithZero) +# endif + ret + + .p2align 4 +L(Exit18): + movdqu (%rsi), %xmm0 + mov 16(%rsi), %cx + movdqu %xmm0, (%rdi) + mov %cx, 16(%rdi) +# ifdef USE_AS_STPCPY + lea 17(%rdi), %rax +# endif +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT + sub $18, %r8 + lea 18(%rdi), %rdi + jnz L(StrncpyFillTailWithZero) +# endif + ret + + .p2align 4 +L(Exit19): + movdqu (%rsi), %xmm0 + mov 15(%rsi), %ecx + movdqu %xmm0, (%rdi) + mov %ecx, 15(%rdi) +# ifdef USE_AS_STPCPY + lea 18(%rdi), %rax +# endif +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT + sub $19, %r8 + lea 19(%rdi), %rdi + jnz L(StrncpyFillTailWithZero) +# endif + ret + + .p2align 4 +L(Exit20): + movdqu (%rsi), %xmm0 + mov 16(%rsi), %ecx + movdqu %xmm0, (%rdi) + mov %ecx, 16(%rdi) +# ifdef USE_AS_STPCPY + lea 19(%rdi), %rax +# endif +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT + sub $20, %r8 + lea 20(%rdi), %rdi + jnz L(StrncpyFillTailWithZero) +# endif + ret + + .p2align 4 +L(Exit21): + movdqu (%rsi), %xmm0 + mov 16(%rsi), %ecx + movdqu %xmm0, (%rdi) + mov %ecx, 16(%rdi) + mov %dh, 20(%rdi) +# ifdef USE_AS_STPCPY + lea 20(%rdi), %rax +# endif +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT + sub $21, %r8 + lea 21(%rdi), %rdi + jnz L(StrncpyFillTailWithZero) +# endif + ret + + .p2align 4 +L(Exit22): + movdqu (%rsi), %xmm0 + mov 14(%rsi), %rcx + movdqu %xmm0, (%rdi) + mov %rcx, 14(%rdi) +# ifdef USE_AS_STPCPY + lea 21(%rdi), %rax +# endif +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT + sub $22, %r8 + lea 22(%rdi), %rdi + jnz L(StrncpyFillTailWithZero) +# endif + ret + + .p2align 4 +L(Exit23): + movdqu (%rsi), %xmm0 + mov 15(%rsi), %rcx + movdqu %xmm0, (%rdi) + mov %rcx, 15(%rdi) +# ifdef USE_AS_STPCPY + lea 22(%rdi), %rax +# endif +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT + sub $23, %r8 + lea 23(%rdi), %rdi + jnz L(StrncpyFillTailWithZero) +# endif + ret + + .p2align 4 +L(Exit24): + movdqu (%rsi), %xmm0 + mov 16(%rsi), %rcx + movdqu %xmm0, (%rdi) + mov %rcx, 16(%rdi) +# ifdef USE_AS_STPCPY + lea 23(%rdi), %rax +# endif +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT + sub $24, %r8 + lea 24(%rdi), %rdi + jnz L(StrncpyFillTailWithZero) +# endif + ret + + .p2align 4 +L(Exit25): + movdqu (%rsi), %xmm0 + mov 16(%rsi), %rcx + movdqu %xmm0, (%rdi) + mov %rcx, 16(%rdi) + mov %dh, 24(%rdi) +# ifdef USE_AS_STPCPY + lea 24(%rdi), %rax +# endif +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT + sub $25, %r8 + lea 25(%rdi), %rdi + jnz L(StrncpyFillTailWithZero) +# endif + ret + + .p2align 4 +L(Exit26): + movdqu (%rsi), %xmm0 + mov 16(%rsi), %rdx + mov 24(%rsi), %cx + movdqu %xmm0, (%rdi) + mov %rdx, 16(%rdi) + mov %cx, 24(%rdi) +# ifdef USE_AS_STPCPY + lea 25(%rdi), %rax +# endif +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT + sub $26, %r8 + lea 26(%rdi), %rdi + jnz L(StrncpyFillTailWithZero) +# endif + ret + + .p2align 4 +L(Exit27): + movdqu (%rsi), %xmm0 + mov 16(%rsi), %rdx + mov 23(%rsi), %ecx + movdqu %xmm0, (%rdi) + mov %rdx, 16(%rdi) + mov %ecx, 23(%rdi) +# ifdef USE_AS_STPCPY + lea 26(%rdi), %rax +# endif +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT + sub $27, %r8 + lea 27(%rdi), %rdi + jnz L(StrncpyFillTailWithZero) +# endif + ret + + .p2align 4 +L(Exit28): + movdqu (%rsi), %xmm0 + mov 16(%rsi), %rdx + mov 24(%rsi), %ecx + movdqu %xmm0, (%rdi) + mov %rdx, 16(%rdi) + mov %ecx, 24(%rdi) +# ifdef USE_AS_STPCPY + lea 27(%rdi), %rax +# endif +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT + sub $28, %r8 + lea 28(%rdi), %rdi + jnz L(StrncpyFillTailWithZero) +# endif + ret + + .p2align 4 +L(Exit29): + movdqu (%rsi), %xmm0 + movdqu 13(%rsi), %xmm2 + movdqu %xmm0, (%rdi) + movdqu %xmm2, 13(%rdi) +# ifdef USE_AS_STPCPY + lea 28(%rdi), %rax +# endif +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT + sub $29, %r8 + lea 29(%rdi), %rdi + jnz L(StrncpyFillTailWithZero) +# endif + ret + + .p2align 4 +L(Exit30): + movdqu (%rsi), %xmm0 + movdqu 14(%rsi), %xmm2 + movdqu %xmm0, (%rdi) + movdqu %xmm2, 14(%rdi) +# ifdef USE_AS_STPCPY + lea 29(%rdi), %rax +# endif +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT + sub $30, %r8 + lea 30(%rdi), %rdi + jnz L(StrncpyFillTailWithZero) +# endif + ret + + .p2align 4 +L(Exit31): + movdqu (%rsi), %xmm0 + movdqu 15(%rsi), %xmm2 + movdqu %xmm0, (%rdi) + movdqu %xmm2, 15(%rdi) +# ifdef USE_AS_STPCPY + lea 30(%rdi), %rax +# endif +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT + sub $31, %r8 + lea 31(%rdi), %rdi + jnz L(StrncpyFillTailWithZero) +# endif + ret + + .p2align 4 +L(Exit32): + movdqu (%rsi), %xmm0 + movdqu 16(%rsi), %xmm2 + movdqu %xmm0, (%rdi) + movdqu %xmm2, 16(%rdi) +# ifdef USE_AS_STPCPY + lea 31(%rdi), %rax +# endif +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT + sub $32, %r8 + lea 32(%rdi), %rdi + jnz L(StrncpyFillTailWithZero) +# endif + ret + +# ifdef USE_AS_STRNCPY + + .p2align 4 +L(StrncpyExit0): +# ifdef USE_AS_STPCPY + mov %rdi, %rax +# endif +# ifdef USE_AS_STRCAT + xor %ch, %ch + movb %ch, (%rdi) +# endif + ret + + .p2align 4 +L(StrncpyExit1): + mov (%rsi), %dl + mov %dl, (%rdi) +# ifdef USE_AS_STPCPY + lea 1(%rdi), %rax +# endif +# ifdef USE_AS_STRCAT + xor %ch, %ch + movb %ch, 1(%rdi) +# endif + ret + + .p2align 4 +L(StrncpyExit2): + mov (%rsi), %dx + mov %dx, (%rdi) +# ifdef USE_AS_STPCPY + lea 2(%rdi), %rax +# endif +# ifdef USE_AS_STRCAT + xor %ch, %ch + movb %ch, 2(%rdi) +# endif + ret + + .p2align 4 +L(StrncpyExit3): + mov (%rsi), %cx + mov 2(%rsi), %dl + mov %cx, (%rdi) + mov %dl, 2(%rdi) +# ifdef USE_AS_STPCPY + lea 3(%rdi), %rax +# endif +# ifdef USE_AS_STRCAT + xor %ch, %ch + movb %ch, 3(%rdi) +# endif + ret + + .p2align 4 +L(StrncpyExit4): + mov (%rsi), %edx + mov %edx, (%rdi) +# ifdef USE_AS_STPCPY + lea 4(%rdi), %rax +# endif +# ifdef USE_AS_STRCAT + xor %ch, %ch + movb %ch, 4(%rdi) +# endif + ret + + .p2align 4 +L(StrncpyExit5): + mov (%rsi), %ecx + mov 4(%rsi), %dl + mov %ecx, (%rdi) + mov %dl, 4(%rdi) +# ifdef USE_AS_STPCPY + lea 5(%rdi), %rax +# endif +# ifdef USE_AS_STRCAT + xor %ch, %ch + movb %ch, 5(%rdi) +# endif + ret + + .p2align 4 +L(StrncpyExit6): + mov (%rsi), %ecx + mov 4(%rsi), %dx + mov %ecx, (%rdi) + mov %dx, 4(%rdi) +# ifdef USE_AS_STPCPY + lea 6(%rdi), %rax +# endif +# ifdef USE_AS_STRCAT + xor %ch, %ch + movb %ch, 6(%rdi) +# endif + ret + + .p2align 4 +L(StrncpyExit7): + mov (%rsi), %ecx + mov 3(%rsi), %edx + mov %ecx, (%rdi) + mov %edx, 3(%rdi) +# ifdef USE_AS_STPCPY + lea 7(%rdi), %rax +# endif +# ifdef USE_AS_STRCAT + xor %ch, %ch + movb %ch, 7(%rdi) +# endif + ret + + .p2align 4 +L(StrncpyExit8): + mov (%rsi), %rdx + mov %rdx, (%rdi) +# ifdef USE_AS_STPCPY + lea 8(%rdi), %rax +# endif +# ifdef USE_AS_STRCAT + xor %ch, %ch + movb %ch, 8(%rdi) +# endif + ret + + .p2align 4 +L(StrncpyExit9): + mov (%rsi), %rcx + mov 8(%rsi), %dl + mov %rcx, (%rdi) + mov %dl, 8(%rdi) +# ifdef USE_AS_STPCPY + lea 9(%rdi), %rax +# endif +# ifdef USE_AS_STRCAT + xor %ch, %ch + movb %ch, 9(%rdi) +# endif + ret + + .p2align 4 +L(StrncpyExit10): + mov (%rsi), %rcx + mov 8(%rsi), %dx + mov %rcx, (%rdi) + mov %dx, 8(%rdi) +# ifdef USE_AS_STPCPY + lea 10(%rdi), %rax +# endif +# ifdef USE_AS_STRCAT + xor %ch, %ch + movb %ch, 10(%rdi) +# endif + ret + + .p2align 4 +L(StrncpyExit11): + mov (%rsi), %rcx + mov 7(%rsi), %edx + mov %rcx, (%rdi) + mov %edx, 7(%rdi) +# ifdef USE_AS_STPCPY + lea 11(%rdi), %rax +# endif +# ifdef USE_AS_STRCAT + xor %ch, %ch + movb %ch, 11(%rdi) +# endif + ret + + .p2align 4 +L(StrncpyExit12): + mov (%rsi), %rcx + mov 8(%rsi), %edx + mov %rcx, (%rdi) + mov %edx, 8(%rdi) +# ifdef USE_AS_STPCPY + lea 12(%rdi), %rax +# endif +# ifdef USE_AS_STRCAT + xor %ch, %ch + movb %ch, 12(%rdi) +# endif + ret + + .p2align 4 +L(StrncpyExit13): + mov (%rsi), %rcx + mov 5(%rsi), %rdx + mov %rcx, (%rdi) + mov %rdx, 5(%rdi) +# ifdef USE_AS_STPCPY + lea 13(%rdi), %rax +# endif +# ifdef USE_AS_STRCAT + xor %ch, %ch + movb %ch, 13(%rdi) +# endif + ret + + .p2align 4 +L(StrncpyExit14): + mov (%rsi), %rcx + mov 6(%rsi), %rdx + mov %rcx, (%rdi) + mov %rdx, 6(%rdi) +# ifdef USE_AS_STPCPY + lea 14(%rdi), %rax +# endif +# ifdef USE_AS_STRCAT + xor %ch, %ch + movb %ch, 14(%rdi) +# endif + ret + + .p2align 4 +L(StrncpyExit15): + mov (%rsi), %rcx + mov 7(%rsi), %rdx + mov %rcx, (%rdi) + mov %rdx, 7(%rdi) +# ifdef USE_AS_STPCPY + lea 15(%rdi), %rax +# endif +# ifdef USE_AS_STRCAT + xor %ch, %ch + movb %ch, 15(%rdi) +# endif + ret + + .p2align 4 +L(StrncpyExit16): + movdqu (%rsi), %xmm0 + movdqu %xmm0, (%rdi) +# ifdef USE_AS_STPCPY + lea 16(%rdi), %rax +# endif +# ifdef USE_AS_STRCAT + xor %ch, %ch + movb %ch, 16(%rdi) +# endif + ret + + .p2align 4 +L(StrncpyExit17): + movdqu (%rsi), %xmm0 + mov 16(%rsi), %cl + movdqu %xmm0, (%rdi) + mov %cl, 16(%rdi) +# ifdef USE_AS_STPCPY + lea 17(%rdi), %rax +# endif +# ifdef USE_AS_STRCAT + xor %ch, %ch + movb %ch, 17(%rdi) +# endif + ret + + .p2align 4 +L(StrncpyExit18): + movdqu (%rsi), %xmm0 + mov 16(%rsi), %cx + movdqu %xmm0, (%rdi) + mov %cx, 16(%rdi) +# ifdef USE_AS_STPCPY + lea 18(%rdi), %rax +# endif +# ifdef USE_AS_STRCAT + xor %ch, %ch + movb %ch, 18(%rdi) +# endif + ret + + .p2align 4 +L(StrncpyExit19): + movdqu (%rsi), %xmm0 + mov 15(%rsi), %ecx + movdqu %xmm0, (%rdi) + mov %ecx, 15(%rdi) +# ifdef USE_AS_STPCPY + lea 19(%rdi), %rax +# endif +# ifdef USE_AS_STRCAT + xor %ch, %ch + movb %ch, 19(%rdi) +# endif + ret + + .p2align 4 +L(StrncpyExit20): + movdqu (%rsi), %xmm0 + mov 16(%rsi), %ecx + movdqu %xmm0, (%rdi) + mov %ecx, 16(%rdi) +# ifdef USE_AS_STPCPY + lea 20(%rdi), %rax +# endif +# ifdef USE_AS_STRCAT + xor %ch, %ch + movb %ch, 20(%rdi) +# endif + ret + + .p2align 4 +L(StrncpyExit21): + movdqu (%rsi), %xmm0 + mov 16(%rsi), %ecx + mov 20(%rsi), %dl + movdqu %xmm0, (%rdi) + mov %ecx, 16(%rdi) + mov %dl, 20(%rdi) +# ifdef USE_AS_STPCPY + lea 21(%rdi), %rax +# endif +# ifdef USE_AS_STRCAT + xor %ch, %ch + movb %ch, 21(%rdi) +# endif + ret + + .p2align 4 +L(StrncpyExit22): + movdqu (%rsi), %xmm0 + mov 14(%rsi), %rcx + movdqu %xmm0, (%rdi) + mov %rcx, 14(%rdi) +# ifdef USE_AS_STPCPY + lea 22(%rdi), %rax +# endif +# ifdef USE_AS_STRCAT + xor %ch, %ch + movb %ch, 22(%rdi) +# endif + ret + + .p2align 4 +L(StrncpyExit23): + movdqu (%rsi), %xmm0 + mov 15(%rsi), %rcx + movdqu %xmm0, (%rdi) + mov %rcx, 15(%rdi) +# ifdef USE_AS_STPCPY + lea 23(%rdi), %rax +# endif +# ifdef USE_AS_STRCAT + xor %ch, %ch + movb %ch, 23(%rdi) +# endif + ret + + .p2align 4 +L(StrncpyExit24): + movdqu (%rsi), %xmm0 + mov 16(%rsi), %rcx + movdqu %xmm0, (%rdi) + mov %rcx, 16(%rdi) +# ifdef USE_AS_STPCPY + lea 24(%rdi), %rax +# endif +# ifdef USE_AS_STRCAT + xor %ch, %ch + movb %ch, 24(%rdi) +# endif + ret + + .p2align 4 +L(StrncpyExit25): + movdqu (%rsi), %xmm0 + mov 16(%rsi), %rdx + mov 24(%rsi), %cl + movdqu %xmm0, (%rdi) + mov %rdx, 16(%rdi) + mov %cl, 24(%rdi) +# ifdef USE_AS_STPCPY + lea 25(%rdi), %rax +# endif +# ifdef USE_AS_STRCAT + xor %ch, %ch + movb %ch, 25(%rdi) +# endif + ret + + .p2align 4 +L(StrncpyExit26): + movdqu (%rsi), %xmm0 + mov 16(%rsi), %rdx + mov 24(%rsi), %cx + movdqu %xmm0, (%rdi) + mov %rdx, 16(%rdi) + mov %cx, 24(%rdi) +# ifdef USE_AS_STPCPY + lea 26(%rdi), %rax +# endif +# ifdef USE_AS_STRCAT + xor %ch, %ch + movb %ch, 26(%rdi) +# endif + ret + + .p2align 4 +L(StrncpyExit27): + movdqu (%rsi), %xmm0 + mov 16(%rsi), %rdx + mov 23(%rsi), %ecx + movdqu %xmm0, (%rdi) + mov %rdx, 16(%rdi) + mov %ecx, 23(%rdi) +# ifdef USE_AS_STPCPY + lea 27(%rdi), %rax +# endif +# ifdef USE_AS_STRCAT + xor %ch, %ch + movb %ch, 27(%rdi) +# endif + ret + + .p2align 4 +L(StrncpyExit28): + movdqu (%rsi), %xmm0 + mov 16(%rsi), %rdx + mov 24(%rsi), %ecx + movdqu %xmm0, (%rdi) + mov %rdx, 16(%rdi) + mov %ecx, 24(%rdi) +# ifdef USE_AS_STPCPY + lea 28(%rdi), %rax +# endif +# ifdef USE_AS_STRCAT + xor %ch, %ch + movb %ch, 28(%rdi) +# endif + ret + + .p2align 4 +L(StrncpyExit29): + movdqu (%rsi), %xmm0 + movdqu 13(%rsi), %xmm2 + movdqu %xmm0, (%rdi) + movdqu %xmm2, 13(%rdi) +# ifdef USE_AS_STPCPY + lea 29(%rdi), %rax +# endif +# ifdef USE_AS_STRCAT + xor %ch, %ch + movb %ch, 29(%rdi) +# endif + ret + + .p2align 4 +L(StrncpyExit30): + movdqu (%rsi), %xmm0 + movdqu 14(%rsi), %xmm2 + movdqu %xmm0, (%rdi) + movdqu %xmm2, 14(%rdi) +# ifdef USE_AS_STPCPY + lea 30(%rdi), %rax +# endif +# ifdef USE_AS_STRCAT + xor %ch, %ch + movb %ch, 30(%rdi) +# endif + ret + + .p2align 4 +L(StrncpyExit31): + movdqu (%rsi), %xmm0 + movdqu 15(%rsi), %xmm2 + movdqu %xmm0, (%rdi) + movdqu %xmm2, 15(%rdi) +# ifdef USE_AS_STPCPY + lea 31(%rdi), %rax +# endif +# ifdef USE_AS_STRCAT + xor %ch, %ch + movb %ch, 31(%rdi) +# endif + ret + + .p2align 4 +L(StrncpyExit32): + movdqu (%rsi), %xmm0 + movdqu 16(%rsi), %xmm2 + movdqu %xmm0, (%rdi) + movdqu %xmm2, 16(%rdi) +# ifdef USE_AS_STPCPY + lea 32(%rdi), %rax +# endif +# ifdef USE_AS_STRCAT + xor %ch, %ch + movb %ch, 32(%rdi) +# endif + ret + + .p2align 4 +L(StrncpyExit33): + movdqu (%rsi), %xmm0 + movdqu 16(%rsi), %xmm2 + mov 32(%rsi), %cl + movdqu %xmm0, (%rdi) + movdqu %xmm2, 16(%rdi) + mov %cl, 32(%rdi) +# ifdef USE_AS_STRCAT + xor %ch, %ch + movb %ch, 33(%rdi) +# endif + ret + +# ifndef USE_AS_STRCAT + + .p2align 4 +L(Fill0): + ret + + .p2align 4 +L(Fill1): + mov %dl, (%rdi) + ret + + .p2align 4 +L(Fill2): + mov %dx, (%rdi) + ret + + .p2align 4 +L(Fill3): + mov %edx, -1(%rdi) + ret + + .p2align 4 +L(Fill4): + mov %edx, (%rdi) + ret + + .p2align 4 +L(Fill5): + mov %edx, (%rdi) + mov %dl, 4(%rdi) + ret + + .p2align 4 +L(Fill6): + mov %edx, (%rdi) + mov %dx, 4(%rdi) + ret + + .p2align 4 +L(Fill7): + mov %rdx, -1(%rdi) + ret + + .p2align 4 +L(Fill8): + mov %rdx, (%rdi) + ret + + .p2align 4 +L(Fill9): + mov %rdx, (%rdi) + mov %dl, 8(%rdi) + ret + + .p2align 4 +L(Fill10): + mov %rdx, (%rdi) + mov %dx, 8(%rdi) + ret + + .p2align 4 +L(Fill11): + mov %rdx, (%rdi) + mov %edx, 7(%rdi) + ret + + .p2align 4 +L(Fill12): + mov %rdx, (%rdi) + mov %edx, 8(%rdi) + ret + + .p2align 4 +L(Fill13): + mov %rdx, (%rdi) + mov %rdx, 5(%rdi) + ret + + .p2align 4 +L(Fill14): + mov %rdx, (%rdi) + mov %rdx, 6(%rdi) + ret + + .p2align 4 +L(Fill15): + movdqu %xmm0, -1(%rdi) + ret + + .p2align 4 +L(Fill16): + movdqu %xmm0, (%rdi) + ret + + .p2align 4 +L(CopyFrom1To16BytesUnalignedXmm2): + movdqu %xmm2, (%rdi, %rcx) + + .p2align 4 +L(CopyFrom1To16BytesXmmExit): + bsf %rdx, %rdx + add $15, %r8 + add %rcx, %rdi +# ifdef USE_AS_STPCPY + lea (%rdi, %rdx), %rax +# endif + sub %rdx, %r8 + lea 1(%rdi, %rdx), %rdi + + .p2align 4 +L(StrncpyFillTailWithZero): + pxor %xmm0, %xmm0 + xor %rdx, %rdx + sub $16, %r8 + jbe L(StrncpyFillExit) + + movdqu %xmm0, (%rdi) + add $16, %rdi + + mov %rdi, %rsi + and $0xf, %rsi + sub %rsi, %rdi + add %rsi, %r8 + sub $64, %r8 + jb L(StrncpyFillLess64) + +L(StrncpyFillLoopMovdqa): + movdqa %xmm0, (%rdi) + movdqa %xmm0, 16(%rdi) + movdqa %xmm0, 32(%rdi) + movdqa %xmm0, 48(%rdi) + add $64, %rdi + sub $64, %r8 + jae L(StrncpyFillLoopMovdqa) + +L(StrncpyFillLess64): + add $32, %r8 + jl L(StrncpyFillLess32) + movdqa %xmm0, (%rdi) + movdqa %xmm0, 16(%rdi) + add $32, %rdi + sub $16, %r8 + jl L(StrncpyFillExit) + movdqa %xmm0, (%rdi) + add $16, %rdi + BRANCH_TO_JMPTBL_ENTRY (L(FillTable), %r8, 4) + +L(StrncpyFillLess32): + add $16, %r8 + jl L(StrncpyFillExit) + movdqa %xmm0, (%rdi) + add $16, %rdi + BRANCH_TO_JMPTBL_ENTRY (L(FillTable), %r8, 4) + +L(StrncpyFillExit): + add $16, %r8 + BRANCH_TO_JMPTBL_ENTRY (L(FillTable), %r8, 4) + +/* end of ifndef USE_AS_STRCAT */ +# endif + + .p2align 4 +L(UnalignedLeaveCase2OrCase3): + test %rdx, %rdx + jnz L(Unaligned64LeaveCase2) +L(Unaligned64LeaveCase3): + lea 64(%r8), %rcx + and $-16, %rcx + add $48, %r8 + jl L(CopyFrom1To16BytesCase3) + movdqu %xmm4, (%rdi) + sub $16, %r8 + jb L(CopyFrom1To16BytesCase3) + movdqu %xmm5, 16(%rdi) + sub $16, %r8 + jb L(CopyFrom1To16BytesCase3) + movdqu %xmm6, 32(%rdi) + sub $16, %r8 + jb L(CopyFrom1To16BytesCase3) + movdqu %xmm7, 48(%rdi) +# ifdef USE_AS_STPCPY + lea 64(%rdi), %rax +# endif +# ifdef USE_AS_STRCAT + xor %ch, %ch + movb %ch, 64(%rdi) +# endif + ret + + .p2align 4 +L(Unaligned64LeaveCase2): + xor %rcx, %rcx + pcmpeqb %xmm4, %xmm0 + pmovmskb %xmm0, %rdx + add $48, %r8 + jle L(CopyFrom1To16BytesCase2OrCase3) + test %rdx, %rdx +# ifndef USE_AS_STRCAT + jnz L(CopyFrom1To16BytesUnalignedXmm4) +# else + jnz L(CopyFrom1To16Bytes) +# endif + pcmpeqb %xmm5, %xmm0 + pmovmskb %xmm0, %rdx + movdqu %xmm4, (%rdi) + add $16, %rcx + sub $16, %r8 + jbe L(CopyFrom1To16BytesCase2OrCase3) + test %rdx, %rdx +# ifndef USE_AS_STRCAT + jnz L(CopyFrom1To16BytesUnalignedXmm5) +# else + jnz L(CopyFrom1To16Bytes) +# endif + + pcmpeqb %xmm6, %xmm0 + pmovmskb %xmm0, %rdx + movdqu %xmm5, 16(%rdi) + add $16, %rcx + sub $16, %r8 + jbe L(CopyFrom1To16BytesCase2OrCase3) + test %rdx, %rdx +# ifndef USE_AS_STRCAT + jnz L(CopyFrom1To16BytesUnalignedXmm6) +# else + jnz L(CopyFrom1To16Bytes) +# endif + + pcmpeqb %xmm7, %xmm0 + pmovmskb %xmm0, %rdx + movdqu %xmm6, 32(%rdi) + lea 16(%rdi, %rcx), %rdi + lea 16(%rsi, %rcx), %rsi + bsf %rdx, %rdx + cmp %r8, %rdx + jb L(CopyFrom1To16BytesExit) + BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4) + + .p2align 4 +L(ExitZero): +# ifndef USE_AS_STRCAT + mov %rdi, %rax +# endif + ret + +# endif + +# ifndef USE_AS_STRCAT +END (STRCPY) +# else +END (STRCAT) +# endif + .p2align 4 + .section .rodata +L(ExitTable): + .int JMPTBL(L(Exit1), L(ExitTable)) + .int JMPTBL(L(Exit2), L(ExitTable)) + .int JMPTBL(L(Exit3), L(ExitTable)) + .int JMPTBL(L(Exit4), L(ExitTable)) + .int JMPTBL(L(Exit5), L(ExitTable)) + .int JMPTBL(L(Exit6), L(ExitTable)) + .int JMPTBL(L(Exit7), L(ExitTable)) + .int JMPTBL(L(Exit8), L(ExitTable)) + .int JMPTBL(L(Exit9), L(ExitTable)) + .int JMPTBL(L(Exit10), L(ExitTable)) + .int JMPTBL(L(Exit11), L(ExitTable)) + .int JMPTBL(L(Exit12), L(ExitTable)) + .int JMPTBL(L(Exit13), L(ExitTable)) + .int JMPTBL(L(Exit14), L(ExitTable)) + .int JMPTBL(L(Exit15), L(ExitTable)) + .int JMPTBL(L(Exit16), L(ExitTable)) + .int JMPTBL(L(Exit17), L(ExitTable)) + .int JMPTBL(L(Exit18), L(ExitTable)) + .int JMPTBL(L(Exit19), L(ExitTable)) + .int JMPTBL(L(Exit20), L(ExitTable)) + .int JMPTBL(L(Exit21), L(ExitTable)) + .int JMPTBL(L(Exit22), L(ExitTable)) + .int JMPTBL(L(Exit23), L(ExitTable)) + .int JMPTBL(L(Exit24), L(ExitTable)) + .int JMPTBL(L(Exit25), L(ExitTable)) + .int JMPTBL(L(Exit26), L(ExitTable)) + .int JMPTBL(L(Exit27), L(ExitTable)) + .int JMPTBL(L(Exit28), L(ExitTable)) + .int JMPTBL(L(Exit29), L(ExitTable)) + .int JMPTBL(L(Exit30), L(ExitTable)) + .int JMPTBL(L(Exit31), L(ExitTable)) + .int JMPTBL(L(Exit32), L(ExitTable)) +# ifdef USE_AS_STRNCPY +L(ExitStrncpyTable): + .int JMPTBL(L(StrncpyExit0), L(ExitStrncpyTable)) + .int JMPTBL(L(StrncpyExit1), L(ExitStrncpyTable)) + .int JMPTBL(L(StrncpyExit2), L(ExitStrncpyTable)) + .int JMPTBL(L(StrncpyExit3), L(ExitStrncpyTable)) + .int JMPTBL(L(StrncpyExit4), L(ExitStrncpyTable)) + .int JMPTBL(L(StrncpyExit5), L(ExitStrncpyTable)) + .int JMPTBL(L(StrncpyExit6), L(ExitStrncpyTable)) + .int JMPTBL(L(StrncpyExit7), L(ExitStrncpyTable)) + .int JMPTBL(L(StrncpyExit8), L(ExitStrncpyTable)) + .int JMPTBL(L(StrncpyExit9), L(ExitStrncpyTable)) + .int JMPTBL(L(StrncpyExit10), L(ExitStrncpyTable)) + .int JMPTBL(L(StrncpyExit11), L(ExitStrncpyTable)) + .int JMPTBL(L(StrncpyExit12), L(ExitStrncpyTable)) + .int JMPTBL(L(StrncpyExit13), L(ExitStrncpyTable)) + .int JMPTBL(L(StrncpyExit14), L(ExitStrncpyTable)) + .int JMPTBL(L(StrncpyExit15), L(ExitStrncpyTable)) + .int JMPTBL(L(StrncpyExit16), L(ExitStrncpyTable)) + .int JMPTBL(L(StrncpyExit17), L(ExitStrncpyTable)) + .int JMPTBL(L(StrncpyExit18), L(ExitStrncpyTable)) + .int JMPTBL(L(StrncpyExit19), L(ExitStrncpyTable)) + .int JMPTBL(L(StrncpyExit20), L(ExitStrncpyTable)) + .int JMPTBL(L(StrncpyExit21), L(ExitStrncpyTable)) + .int JMPTBL(L(StrncpyExit22), L(ExitStrncpyTable)) + .int JMPTBL(L(StrncpyExit23), L(ExitStrncpyTable)) + .int JMPTBL(L(StrncpyExit24), L(ExitStrncpyTable)) + .int JMPTBL(L(StrncpyExit25), L(ExitStrncpyTable)) + .int JMPTBL(L(StrncpyExit26), L(ExitStrncpyTable)) + .int JMPTBL(L(StrncpyExit27), L(ExitStrncpyTable)) + .int JMPTBL(L(StrncpyExit28), L(ExitStrncpyTable)) + .int JMPTBL(L(StrncpyExit29), L(ExitStrncpyTable)) + .int JMPTBL(L(StrncpyExit30), L(ExitStrncpyTable)) + .int JMPTBL(L(StrncpyExit31), L(ExitStrncpyTable)) + .int JMPTBL(L(StrncpyExit32), L(ExitStrncpyTable)) + .int JMPTBL(L(StrncpyExit33), L(ExitStrncpyTable)) +# ifndef USE_AS_STRCAT + .p2align 4 +L(FillTable): + .int JMPTBL(L(Fill0), L(FillTable)) + .int JMPTBL(L(Fill1), L(FillTable)) + .int JMPTBL(L(Fill2), L(FillTable)) + .int JMPTBL(L(Fill3), L(FillTable)) + .int JMPTBL(L(Fill4), L(FillTable)) + .int JMPTBL(L(Fill5), L(FillTable)) + .int JMPTBL(L(Fill6), L(FillTable)) + .int JMPTBL(L(Fill7), L(FillTable)) + .int JMPTBL(L(Fill8), L(FillTable)) + .int JMPTBL(L(Fill9), L(FillTable)) + .int JMPTBL(L(Fill10), L(FillTable)) + .int JMPTBL(L(Fill11), L(FillTable)) + .int JMPTBL(L(Fill12), L(FillTable)) + .int JMPTBL(L(Fill13), L(FillTable)) + .int JMPTBL(L(Fill14), L(FillTable)) + .int JMPTBL(L(Fill15), L(FillTable)) + .int JMPTBL(L(Fill16), L(FillTable)) +# endif +# endif +#endif diff --git a/sysdeps/x86_64/multiarch/strncpy.S b/sysdeps/x86_64/multiarch/strncpy.S index 6d87a0b..afbd870 100644 --- a/sysdeps/x86_64/multiarch/strncpy.S +++ b/sysdeps/x86_64/multiarch/strncpy.S @@ -1,5 +1,85 @@ -/* Multiple versions of strncpy - All versions must be listed in ifunc-impl-list.c. */ -#define STRCPY strncpy +/* Multiple versions of strcpy + All versions must be listed in ifunc-impl-list.c. + Copyright (C) 2009-2015 Free Software Foundation, Inc. + Contributed by Intel Corporation. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <http://www.gnu.org/licenses/>. */ + +#include <sysdep.h> +#include <init-arch.h> + #define USE_AS_STRNCPY -#include "strcpy.S" +#ifndef STRNCPY +#define STRNCPY strncpy +#endif + +#ifdef USE_AS_STPCPY +# define STRNCPY_SSSE3 __stpncpy_ssse3 +# define STRNCPY_SSE2 __stpncpy_sse2 +# define STRNCPY_SSE2_UNALIGNED __stpncpy_sse2_unaligned +# define __GI_STRNCPY __GI_stpncpy +# define __GI___STRNCPY __GI___stpncpy +#else +# define STRNCPY_SSSE3 __strncpy_ssse3 +# define STRNCPY_SSE2 __strncpy_sse2 +# define STRNCPY_SSE2_UNALIGNED __strncpy_sse2_unaligned +# define __GI_STRNCPY __GI_strncpy +#endif + + +/* Define multiple versions only for the definition in libc. */ +#if IS_IN (libc) + .text +ENTRY(STRNCPY) + .type STRNCPY, @gnu_indirect_function + cmpl $0, __cpu_features+KIND_OFFSET(%rip) + jne 1f + call __init_cpu_features +1: leaq STRNCPY_SSE2_UNALIGNED(%rip), %rax + testl $bit_Fast_Unaligned_Load, __cpu_features+FEATURE_OFFSET+index_Fast_Unaligned_Load(%rip) + jnz 2f + leaq STRNCPY_SSE2(%rip), %rax + testl $bit_SSSE3, __cpu_features+CPUID_OFFSET+index_SSSE3(%rip) + jz 2f + leaq STRNCPY_SSSE3(%rip), %rax +2: ret +END(STRNCPY) + +# undef ENTRY +# define ENTRY(name) \ + .type STRNCPY_SSE2, @function; \ + .align 16; \ + .globl STRNCPY_SSE2; \ + .hidden STRNCPY_SSE2; \ + STRNCPY_SSE2: cfi_startproc; \ + CALL_MCOUNT +# undef END +# define END(name) \ + cfi_endproc; .size STRNCPY_SSE2, .-STRNCPY_SSE2 +# undef libc_hidden_builtin_def +/* It doesn't make sense to send libc-internal strcpy calls through a PLT. + The speedup we get from using SSSE3 instruction is likely eaten away + by the indirect call in the PLT. */ +# define libc_hidden_builtin_def(name) \ + .globl __GI_STRNCPY; __GI_STRNCPY = STRNCPY_SSE2 +# undef libc_hidden_def +# define libc_hidden_def(name) \ + .globl __GI___STRNCPY; __GI___STRNCPY = STRNCPY_SSE2 +#endif + +#ifndef USE_AS_STRNCPY +#include "../strcpy.S" +#endif