diff mbox

[neleai/string-x64] Improve strcpy sse2 and avx2 implementation

Message ID 20150617180105.GA26497@domone
State New
Headers show

Commit Message

Ondřej Bílka June 17, 2015, 6:01 p.m. UTC
Hi,

I wrote new strcpy on x64 and for some reason I thought that I had
commited it and forgot to ping it.

As there are other routines that I could improve I will use branch
neleai/string-x64 to collect these.

Here is revised version of what I sumbitted in 2013. Main change is that
I now target i7 instead core2 That simplifies things as unaligned loads
are cheap instead bit slower than aligned ones on core2. That mainly
concerns header as for core2 you could get better performance by
aligning loads or stores to 16 bytes after first bytes were read. I do
not know whats better I would need to test it.

That also makes less important support of ssse3 variant. I could send it
but it was one of my list on TODO list that now probably lost
importance. Problem is that on x64 for aligning by ssse3 or sse2 with
shifts you need to make 16 loops for each alignment as you don't have
variable shift. Also it needs to use jump table thats very expensive
For strcpy thats dubious as it increases instruction cache pressure 
and most copies are small. You would need to do switching from unaligned 
loads to aligning. I needed to do profiling to select correct treshold.

If somebody is interested in optimizing old pentiums4, athlon64 I will
provide a ssse3 variant that is also 50% faster than current one.
That is also reason why I omitted drawing current ssse3 implementation
performance.


In this version header first checks 128 bytes unaligned unless they
cross page boundary. That allows more effective loop as then at end of
loop we could simply write last 64 bytes instead specialcasing to avoid
writing before start.

I tried several variants of header, as we first read 16 bytes to xmm0
register question is if they could be reused. I used evolver to select
best variant, there was almost no difference in performance between
these.

Now I do checks for bytes 0-15, then 16-31, then 32-63, then 64-128.
There is possibility to get some cycles with different grouping, I will
post later improvement if I could find something. 


First problem was reading ahead. A rereading 8 bytes looked bit faster
than move from xmm.

Then I tried when to reuse/reread. In 4-7 byte case it was faster reread
than using bit shifts to get second half. For 1-3 bytes I use following
copy with s[0] and s[1] from rdx register with byte shifts.

  Test branch vs this branchless that works for i 0,1,2
   d[i] = 0;
   d[i/2] = s[1];
   d[0] = s[0];

I also added a avx2 loop. Problem why I shouldn't use them in headers
was high latency. I could test if using them for bytes 64-128 would give
speedup.

As technical issues go I needed to move old strcpy_sse_unaligned
implementation into strncpy_sse2_unaligned as strncpy is function that
should be optimized for size, not performance. For now I this will keep
these unchanged.

As performance these are 15%-30% faster than current one for gcc workload on
haswell and ivy bridge. 

As avx2 version its currently 6% on this workload mainly as its bash and
has lot of large loads so avx2 loop helps.

I used my profiler to show improvement, see here

http://kam.mff.cuni.cz/~ondra/benchmark_string/strcpy_profile.html

and source is here

http://kam.mff.cuni.cz/~ondra/benchmark_string/strcpy_profile170615.tar.bz2

Comments?

        * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list):
	Add __strcpy_avx2 and __stpcpy_avx2
        * sysdeps/x86_64/multiarch/Makefile (routines): Add stpcpy_avx2.S and 
	strcpy_avx2.S
        * sysdeps/x86_64/multiarch/stpcpy-avx2.S: New file
        * sysdeps/x86_64/multiarch/strcpy-avx2.S: Likewise.
        * sysdeps/x86_64/multiarch/stpcpy-sse2-unaligned.S: Refactored
	implementation.
        * sysdeps/x86_64/multiarch/strcpy.S: Updated ifunc.
        * sysdeps/x86_64/multiarch/strncpy.S: Moved from strcpy.S.
        * sysdeps/x86_64/multiarch/strncpy-sse2-unaligned.S: Moved
	strcpy-sse2-unaligned.S here.
        * sysdeps/x86_64/multiarch/strcpy-sse2-unaligned.S: Likewise.
        * sysdeps/x86_64/multiarch/stpncpy-sse2-unaligned.S: Redirect
	from strcpy-sse2-unaligned.S to strncpy-sse2-unaligned.S 
        * sysdeps/x86_64/multiarch/stpncpy.S: Likewise.
        * sysdeps/x86_64/multiarch/strcat-sse2-unaligned.S: Likewise.

---
 sysdeps/x86_64/multiarch/Makefile                 |    2 +-
 sysdeps/x86_64/multiarch/ifunc-impl-list.c        |    2 +
 sysdeps/x86_64/multiarch/stpcpy-avx2.S            |    3 +
 sysdeps/x86_64/multiarch/stpcpy-sse2-unaligned.S  |  439 ++++-
 sysdeps/x86_64/multiarch/stpncpy-sse2-unaligned.S |    3 +-
 sysdeps/x86_64/multiarch/stpncpy.S                |    5 +-
 sysdeps/x86_64/multiarch/strcat-sse2-unaligned.S  |    2 +-
 sysdeps/x86_64/multiarch/strcpy-avx2.S            |    4 +
 sysdeps/x86_64/multiarch/strcpy-sse2-unaligned.S  | 1890 +-------------------
 sysdeps/x86_64/multiarch/strcpy.S                 |   22 +-
 sysdeps/x86_64/multiarch/strncpy-sse2-unaligned.S | 1891 ++++++++++++++++++++-
 sysdeps/x86_64/multiarch/strncpy.S                |   88 +-
 14 files changed, 2435 insertions(+), 1921 deletions(-)
 create mode 100644 sysdeps/x86_64/multiarch/stpcpy-avx2.S
 create mode 100644 sysdeps/x86_64/multiarch/strcpy-avx2.S

Comments

Ondřej Bílka June 24, 2015, 8:13 a.m. UTC | #1
On Wed, Jun 17, 2015 at 08:01:05PM +0200, Ondřej Bílka wrote:
> Hi,
> 
> I wrote new strcpy on x64 and for some reason I thought that I had
> commited it and forgot to ping it.
> 
> As there are other routines that I could improve I will use branch
> neleai/string-x64 to collect these.
> 
> Here is revised version of what I sumbitted in 2013. Main change is that
> I now target i7 instead core2 That simplifies things as unaligned loads
> are cheap instead bit slower than aligned ones on core2. That mainly
> concerns header as for core2 you could get better performance by
> aligning loads or stores to 16 bytes after first bytes were read. I do
> not know whats better I would need to test it.
> 
> That also makes less important support of ssse3 variant. I could send it
> but it was one of my list on TODO list that now probably lost
> importance. Problem is that on x64 for aligning by ssse3 or sse2 with
> shifts you need to make 16 loops for each alignment as you don't have
> variable shift. Also it needs to use jump table thats very expensive
> For strcpy thats dubious as it increases instruction cache pressure 
> and most copies are small. You would need to do switching from unaligned 
> loads to aligning. I needed to do profiling to select correct treshold.
> 
> If somebody is interested in optimizing old pentiums4, athlon64 I will
> provide a ssse3 variant that is also 50% faster than current one.
> That is also reason why I omitted drawing current ssse3 implementation
> performance.
> 
> 
> In this version header first checks 128 bytes unaligned unless they
> cross page boundary. That allows more effective loop as then at end of
> loop we could simply write last 64 bytes instead specialcasing to avoid
> writing before start.
> 
> I tried several variants of header, as we first read 16 bytes to xmm0
> register question is if they could be reused. I used evolver to select
> best variant, there was almost no difference in performance between
> these.
> 
> Now I do checks for bytes 0-15, then 16-31, then 32-63, then 64-128.
> There is possibility to get some cycles with different grouping, I will
> post later improvement if I could find something. 
> 
> 
> First problem was reading ahead. A rereading 8 bytes looked bit faster
> than move from xmm.
> 
> Then I tried when to reuse/reread. In 4-7 byte case it was faster reread
> than using bit shifts to get second half. For 1-3 bytes I use following
> copy with s[0] and s[1] from rdx register with byte shifts.
> 
>   Test branch vs this branchless that works for i 0,1,2
>    d[i] = 0;
>    d[i/2] = s[1];
>    d[0] = s[0];
> 
> I also added a avx2 loop. Problem why I shouldn't use them in headers
> was high latency. I could test if using them for bytes 64-128 would give
> speedup.
> 
> As technical issues go I needed to move old strcpy_sse_unaligned
> implementation into strncpy_sse2_unaligned as strncpy is function that
> should be optimized for size, not performance. For now I this will keep
> these unchanged.
> 
> As performance these are 15%-30% faster than current one for gcc workload on
> haswell and ivy bridge. 
> 
> As avx2 version its currently 6% on this workload mainly as its bash and
> has lot of large loads so avx2 loop helps.
> 
> I used my profiler to show improvement, see here
> 
> http://kam.mff.cuni.cz/~ondra/benchmark_string/strcpy_profile.html
> 
> and source is here
> 
> http://kam.mff.cuni.cz/~ondra/benchmark_string/strcpy_profile170615.tar.bz2
> 
> Comments?
> 
>         * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list):
> 	Add __strcpy_avx2 and __stpcpy_avx2
>         * sysdeps/x86_64/multiarch/Makefile (routines): Add stpcpy_avx2.S and 
> 	strcpy_avx2.S
>         * sysdeps/x86_64/multiarch/stpcpy-avx2.S: New file
>         * sysdeps/x86_64/multiarch/strcpy-avx2.S: Likewise.
>         * sysdeps/x86_64/multiarch/stpcpy-sse2-unaligned.S: Refactored
> 	implementation.
>         * sysdeps/x86_64/multiarch/strcpy.S: Updated ifunc.
>         * sysdeps/x86_64/multiarch/strncpy.S: Moved from strcpy.S.
>         * sysdeps/x86_64/multiarch/strncpy-sse2-unaligned.S: Moved
> 	strcpy-sse2-unaligned.S here.
>         * sysdeps/x86_64/multiarch/strcpy-sse2-unaligned.S: Likewise.
>         * sysdeps/x86_64/multiarch/stpncpy-sse2-unaligned.S: Redirect
> 	from strcpy-sse2-unaligned.S to strncpy-sse2-unaligned.S 
>         * sysdeps/x86_64/multiarch/stpncpy.S: Likewise.
>         * sysdeps/x86_64/multiarch/strcat-sse2-unaligned.S: Likewise.
> 
> ---
>  sysdeps/x86_64/multiarch/Makefile                 |    2 +-
>  sysdeps/x86_64/multiarch/ifunc-impl-list.c        |    2 +
>  sysdeps/x86_64/multiarch/stpcpy-avx2.S            |    3 +
>  sysdeps/x86_64/multiarch/stpcpy-sse2-unaligned.S  |  439 ++++-
>  sysdeps/x86_64/multiarch/stpncpy-sse2-unaligned.S |    3 +-
>  sysdeps/x86_64/multiarch/stpncpy.S                |    5 +-
>  sysdeps/x86_64/multiarch/strcat-sse2-unaligned.S  |    2 +-
>  sysdeps/x86_64/multiarch/strcpy-avx2.S            |    4 +
>  sysdeps/x86_64/multiarch/strcpy-sse2-unaligned.S  | 1890 +-------------------
>  sysdeps/x86_64/multiarch/strcpy.S                 |   22 +-
>  sysdeps/x86_64/multiarch/strncpy-sse2-unaligned.S | 1891 ++++++++++++++++++++-
>  sysdeps/x86_64/multiarch/strncpy.S                |   88 +-
>  14 files changed, 2435 insertions(+), 1921 deletions(-)
>  create mode 100644 sysdeps/x86_64/multiarch/stpcpy-avx2.S
>  create mode 100644 sysdeps/x86_64/multiarch/strcpy-avx2.S
> 
> 
> diff --git a/sysdeps/x86_64/multiarch/Makefile b/sysdeps/x86_64/multiarch/Makefile
> index d7002a9..c573744 100644
> --- a/sysdeps/x86_64/multiarch/Makefile
> +++ b/sysdeps/x86_64/multiarch/Makefile
> @@ -29,7 +29,7 @@ CFLAGS-strspn-c.c += -msse4
>  endif
>  
>  ifeq (yes,$(config-cflags-avx2))
> -sysdep_routines += memset-avx2
> +sysdep_routines += memset-avx2 strcpy-avx2 stpcpy-avx2
>  endif
>  endif
>  
> diff --git a/sysdeps/x86_64/multiarch/ifunc-impl-list.c b/sysdeps/x86_64/multiarch/ifunc-impl-list.c
> index b64e4f1..d398e43 100644
> --- a/sysdeps/x86_64/multiarch/ifunc-impl-list.c
> +++ b/sysdeps/x86_64/multiarch/ifunc-impl-list.c
> @@ -88,6 +88,7 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
>  
>    /* Support sysdeps/x86_64/multiarch/stpcpy.S.  */
>    IFUNC_IMPL (i, name, stpcpy,
> +	      IFUNC_IMPL_ADD (array, i, strcpy, HAS_AVX2, __stpcpy_avx2)
>  	      IFUNC_IMPL_ADD (array, i, stpcpy, HAS_SSSE3, __stpcpy_ssse3)
>  	      IFUNC_IMPL_ADD (array, i, stpcpy, 1, __stpcpy_sse2_unaligned)
>  	      IFUNC_IMPL_ADD (array, i, stpcpy, 1, __stpcpy_sse2))
> @@ -137,6 +138,7 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
>  
>    /* Support sysdeps/x86_64/multiarch/strcpy.S.  */
>    IFUNC_IMPL (i, name, strcpy,
> +	      IFUNC_IMPL_ADD (array, i, strcpy, HAS_AVX2, __strcpy_avx2)
>  	      IFUNC_IMPL_ADD (array, i, strcpy, HAS_SSSE3, __strcpy_ssse3)
>  	      IFUNC_IMPL_ADD (array, i, strcpy, 1, __strcpy_sse2_unaligned)
>  	      IFUNC_IMPL_ADD (array, i, strcpy, 1, __strcpy_sse2))
> diff --git a/sysdeps/x86_64/multiarch/stpcpy-avx2.S b/sysdeps/x86_64/multiarch/stpcpy-avx2.S
> new file mode 100644
> index 0000000..bd30ef6
> --- /dev/null
> +++ b/sysdeps/x86_64/multiarch/stpcpy-avx2.S
> @@ -0,0 +1,3 @@
> +#define USE_AVX2
> +#define STPCPY __stpcpy_avx2
> +#include "stpcpy-sse2-unaligned.S"
> diff --git a/sysdeps/x86_64/multiarch/stpcpy-sse2-unaligned.S b/sysdeps/x86_64/multiarch/stpcpy-sse2-unaligned.S
> index 34231f8..695a236 100644
> --- a/sysdeps/x86_64/multiarch/stpcpy-sse2-unaligned.S
> +++ b/sysdeps/x86_64/multiarch/stpcpy-sse2-unaligned.S
> @@ -1,3 +1,436 @@
> -#define USE_AS_STPCPY
> -#define STRCPY __stpcpy_sse2_unaligned
> -#include "strcpy-sse2-unaligned.S"
> +/* stpcpy with SSE2 and unaligned load
> +   Copyright (C) 2015 Free Software Foundation, Inc.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +#include <sysdep.h>
> +#ifndef STPCPY
> +# define STPCPY __stpcpy_sse2_unaligned
> +#endif
> +
> +ENTRY(STPCPY)
> +	mov	%esi, %edx
> +#ifdef AS_STRCPY
> +	movq    %rdi, %rax
> +#endif
> +	pxor	%xmm4, %xmm4
> +	pxor	%xmm5, %xmm5
> +	andl	$4095, %edx
> +	cmp	$3968, %edx
> +	ja	L(cross_page)
> +
> +	movdqu	(%rsi), %xmm0
> +	pcmpeqb	%xmm0, %xmm4
> +	pmovmskb %xmm4, %edx
> +	testl	%edx, %edx
> +	je	L(more16bytes)
> +	bsf	%edx, %ecx
> +#ifndef AS_STRCPY
> +	lea	(%rdi, %rcx), %rax
> +#endif
> +	cmp	$7, %ecx
> +	movq	(%rsi), %rdx
> +	jb	L(less_8_bytesb)
> +L(8bytes_from_cross):
> +	movq	-7(%rsi, %rcx), %rsi
> +	movq	%rdx, (%rdi)
> +#ifdef AS_STRCPY
> +	movq    %rsi, -7(%rdi, %rcx)
> +#else
> +	movq	%rsi, -7(%rax)
> +#endif
> +	ret
> +
> +	.p2align 4
> +L(less_8_bytesb):
> +	cmp	$2, %ecx
> +	jbe	L(less_4_bytes)
> +L(4bytes_from_cross):
> +	mov	-3(%rsi, %rcx), %esi
> +	mov	%edx, (%rdi)
> +#ifdef AS_STRCPY
> +        mov     %esi, -3(%rdi, %rcx)
> +#else
> +	mov	%esi, -3(%rax)
> +#endif
> +	ret
> +
> +.p2align 4
> + L(less_4_bytes):
> + /*
> +  Test branch vs this branchless that works for i 0,1,2
> +   d[i] = 0;
> +   d[i/2] = s[1];
> +   d[0] = s[0];
> +  */
> +#ifdef AS_STRCPY
> +	movb	$0, (%rdi, %rcx)
> +#endif
> +
> +	shr	$1, %ecx
> +	mov	%edx, %esi
> +	shr	$8, %edx
> +	movb	%dl, (%rdi, %rcx)
> +#ifndef AS_STRCPY
> +	movb	$0, (%rax)
> +#endif
> +	movb	%sil, (%rdi)
> +	ret
> +
> +
> +
> +
> +
> +	.p2align 4
> +L(more16bytes):
> +	pxor	%xmm6, %xmm6
> +	movdqu	16(%rsi), %xmm1
> +	pxor	%xmm7, %xmm7
> +	pcmpeqb	%xmm1, %xmm5
> +	pmovmskb %xmm5, %edx
> +	testl	%edx, %edx
> +	je	L(more32bytes)
> +	bsf	%edx, %edx
> +#ifdef AS_STRCPY
> +        movdqu  1(%rsi, %rdx), %xmm1
> +        movdqu  %xmm0, (%rdi)
> +	movdqu  %xmm1, 1(%rdi, %rdx)
> +#else
> +	lea	16(%rdi, %rdx), %rax
> +	movdqu	1(%rsi, %rdx), %xmm1
> +	movdqu	%xmm0, (%rdi)
> +	movdqu	%xmm1, -15(%rax)
> +#endif
> +	ret
> +
> +	.p2align 4
> +L(more32bytes):
> +	movdqu	32(%rsi), %xmm2
> +	movdqu	48(%rsi), %xmm3
> +
> +	pcmpeqb	%xmm2, %xmm6
> +	pcmpeqb	%xmm3, %xmm7
> +	pmovmskb %xmm7, %edx
> +	shl	$16, %edx
> +	pmovmskb %xmm6, %ecx
> +	or	%ecx, %edx
> +	je	L(more64bytes)
> +	bsf	%edx, %edx
> +#ifndef AS_STRCPY
> +	lea	32(%rdi, %rdx), %rax
> +#endif
> +	movdqu	1(%rsi, %rdx), %xmm2
> +	movdqu	17(%rsi, %rdx), %xmm3
> +	movdqu	%xmm0, (%rdi)
> +	movdqu	%xmm1, 16(%rdi)
> +#ifdef AS_STRCPY
> +        movdqu  %xmm2, 1(%rdi, %rdx)
> +        movdqu  %xmm3, 17(%rdi, %rdx)
> +#else
> +	movdqu	%xmm2, -31(%rax)
> +	movdqu	%xmm3, -15(%rax)
> +#endif
> +	ret
> +
> +	.p2align 4
> +L(more64bytes):
> +	movdqu	%xmm0, (%rdi)
> +	movdqu	%xmm1, 16(%rdi)
> +	movdqu	%xmm2, 32(%rdi)
> +	movdqu	%xmm3, 48(%rdi)
> +	movdqu	64(%rsi), %xmm0
> +	movdqu	80(%rsi), %xmm1
> +	movdqu	96(%rsi), %xmm2
> +	movdqu	112(%rsi), %xmm3
> +
> +	pcmpeqb	%xmm0, %xmm4
> +	pcmpeqb	%xmm1, %xmm5
> +	pcmpeqb	%xmm2, %xmm6
> +	pcmpeqb	%xmm3, %xmm7
> +	pmovmskb %xmm4, %ecx
> +	pmovmskb %xmm5, %edx
> +	pmovmskb %xmm6, %r8d
> +	pmovmskb %xmm7, %r9d
> +	shl	$16, %edx
> +	or	%ecx, %edx
> +	shl	$32, %r8
> +	shl	$48, %r9
> +	or	%r8, %rdx
> +	or	%r9, %rdx
> +	test	%rdx, %rdx
> +	je	L(prepare_loop)
> +	bsf	%rdx, %rdx
> +#ifndef AS_STRCPY
> +	lea	64(%rdi, %rdx), %rax
> +#endif
> +	movdqu	1(%rsi, %rdx), %xmm0
> +	movdqu	17(%rsi, %rdx), %xmm1
> +	movdqu	33(%rsi, %rdx), %xmm2
> +	movdqu	49(%rsi, %rdx), %xmm3
> +#ifdef AS_STRCPY
> +        movdqu  %xmm0, 1(%rdi, %rdx)
> +        movdqu  %xmm1, 17(%rdi, %rdx)
> +        movdqu  %xmm2, 33(%rdi, %rdx)
> +        movdqu  %xmm3, 49(%rdi, %rdx)
> +#else
> +	movdqu	%xmm0, -63(%rax)
> +	movdqu	%xmm1, -47(%rax)
> +	movdqu	%xmm2, -31(%rax)
> +	movdqu	%xmm3, -15(%rax)
> +#endif
> +	ret
> +
> +
> +	.p2align 4
> +L(prepare_loop):
> +	movdqu	%xmm0, 64(%rdi)
> +	movdqu	%xmm1, 80(%rdi)
> +	movdqu	%xmm2, 96(%rdi)
> +	movdqu	%xmm3, 112(%rdi)
> +
> +	subq	%rsi, %rdi
> +	add	$64, %rsi
> +	andq	$-64, %rsi
> +	addq	%rsi, %rdi
> +	jmp	L(loop_entry)
> +
> +#ifdef USE_AVX2
> +	.p2align 4
> +L(loop):
> +	vmovdqu	%ymm1, (%rdi)
> +	vmovdqu	%ymm3, 32(%rdi)
> +L(loop_entry):
> +	vmovdqa	96(%rsi), %ymm3
> +	vmovdqa	64(%rsi), %ymm1
> +	vpminub	%ymm3, %ymm1, %ymm2
> +	addq	$64, %rsi
> +	addq	$64, %rdi
> +	vpcmpeqb %ymm5, %ymm2, %ymm0
> +	vpmovmskb %ymm0, %edx
> +	test	%edx, %edx
> +	je	L(loop)
> +	salq	$32, %rdx
> +	vpcmpeqb %ymm5, %ymm1, %ymm4
> +	vpmovmskb %ymm4, %ecx
> +	or	%rcx, %rdx
> +	bsfq	%rdx, %rdx
> +#ifndef AS_STRCPY
> +	lea	(%rdi, %rdx), %rax
> +#endif
> +	vmovdqu	-63(%rsi, %rdx), %ymm0
> +	vmovdqu	-31(%rsi, %rdx), %ymm2
> +#ifdef AS_STRCPY
> +        vmovdqu  %ymm0, -63(%rdi, %rdx)
> +        vmovdqu  %ymm2, -31(%rdi, %rdx)
> +#else
> +	vmovdqu	%ymm0, -63(%rax)
> +	vmovdqu	%ymm2, -31(%rax)
> +#endif
> +	vzeroupper
> +	ret
> +#else
> +	.p2align 4
> +L(loop):
> +	movdqu	%xmm1, (%rdi)
> +	movdqu	%xmm2, 16(%rdi)
> +	movdqu	%xmm3, 32(%rdi)
> +	movdqu	%xmm4, 48(%rdi)
> +L(loop_entry):
> +	movdqa	96(%rsi), %xmm3
> +	movdqa	112(%rsi), %xmm4
> +	movdqa	%xmm3, %xmm0
> +	movdqa	80(%rsi), %xmm2
> +	pminub	%xmm4, %xmm0
> +	movdqa	64(%rsi), %xmm1
> +	pminub	%xmm2, %xmm0
> +	pminub	%xmm1, %xmm0
> +	addq	$64, %rsi
> +	addq	$64, %rdi
> +	pcmpeqb	%xmm5, %xmm0
> +	pmovmskb %xmm0, %edx
> +	test	%edx, %edx
> +	je	L(loop)
> +	salq	$48, %rdx
> +	pcmpeqb	%xmm1, %xmm5
> +	pcmpeqb	%xmm2, %xmm6
> +	pmovmskb %xmm5, %ecx
> +#ifdef AS_STRCPY
> +	pmovmskb %xmm6, %r8d
> +	pcmpeqb	%xmm3, %xmm7
> +	pmovmskb %xmm7, %r9d
> +	sal	$16, %r8d
> +	or	%r8d, %ecx
> +#else
> +	pmovmskb %xmm6, %eax
> +	pcmpeqb	%xmm3, %xmm7
> +	pmovmskb %xmm7, %r9d
> +	sal	$16, %eax
> +	or	%eax, %ecx
> +#endif
> +	salq	$32, %r9
> +	orq	%rcx, %rdx
> +	orq	%r9, %rdx
> +	bsfq	%rdx, %rdx
> +#ifndef AS_STRCPY
> +	lea	(%rdi, %rdx), %rax
> +#endif
> +	movdqu	-63(%rsi, %rdx), %xmm0
> +	movdqu	-47(%rsi, %rdx), %xmm1
> +	movdqu	-31(%rsi, %rdx), %xmm2
> +	movdqu	-15(%rsi, %rdx), %xmm3
> +#ifdef AS_STRCPY
> +        movdqu  %xmm0, -63(%rdi, %rdx)
> +        movdqu  %xmm1, -47(%rdi, %rdx)
> +        movdqu  %xmm2, -31(%rdi, %rdx)
> +        movdqu  %xmm3, -15(%rdi, %rdx)
> +#else
> +	movdqu	%xmm0, -63(%rax)
> +	movdqu	%xmm1, -47(%rax)
> +	movdqu	%xmm2, -31(%rax)
> +	movdqu	%xmm3, -15(%rax)
> +#endif
> +	ret
> +#endif
> +
> +	.p2align 4
> +L(cross_page):
> +	movq	%rsi, %rcx
> +	pxor	%xmm0, %xmm0
> +	and	$15, %ecx
> +	movq	%rsi, %r9
> +	movq	%rdi, %r10
> +	subq	%rcx, %rsi
> +	subq	%rcx, %rdi
> +	movdqa	(%rsi), %xmm1
> +	pcmpeqb	%xmm0, %xmm1
> +	pmovmskb %xmm1, %edx
> +	shr	%cl, %edx
> +	shl	%cl, %edx
> +	test	%edx, %edx
> +	jne	L(less_32_cross)
> +
> +	addq	$16, %rsi
> +	addq	$16, %rdi
> +	movdqa	(%rsi), %xmm1
> +	pcmpeqb	%xmm1, %xmm0
> +	pmovmskb %xmm0, %edx
> +	test	%edx, %edx
> +	jne	L(less_32_cross)
> +	movdqu	%xmm1, (%rdi)
> +
> +	movdqu	(%r9), %xmm0
> +	movdqu	%xmm0, (%r10)
> +
> +	mov	$8, %rcx
> +L(cross_loop):
> +	addq	$16, %rsi
> +	addq	$16, %rdi
> +	pxor	%xmm0, %xmm0
> +	movdqa	(%rsi), %xmm1
> +	pcmpeqb	%xmm1, %xmm0
> +	pmovmskb %xmm0, %edx
> +	test	%edx, %edx
> +	jne	L(return_cross)
> +	movdqu	%xmm1, (%rdi)
> +	sub	$1, %rcx
> +	ja	L(cross_loop)
> +
> +	pxor	%xmm5, %xmm5
> +	pxor	%xmm6, %xmm6
> +	pxor	%xmm7, %xmm7
> +
> +	lea	-64(%rsi), %rdx
> +	andq	$-64, %rdx
> +	addq	%rdx, %rdi
> +	subq	%rsi, %rdi
> +	movq	%rdx, %rsi
> +	jmp	L(loop_entry)
> +
> +	.p2align 4
> +L(return_cross):
> +	bsf	%edx, %edx
> +#ifdef AS_STRCPY
> +        movdqu  -15(%rsi, %rdx), %xmm0
> +        movdqu  %xmm0, -15(%rdi, %rdx)
> +#else
> +	lea	(%rdi, %rdx), %rax
> +	movdqu	-15(%rsi, %rdx), %xmm0
> +	movdqu	%xmm0, -15(%rax)
> +#endif
> +	ret
> +
> +	.p2align 4
> +L(less_32_cross):
> +	bsf	%rdx, %rdx
> +	lea	(%rdi, %rdx), %rcx
> +#ifndef AS_STRCPY
> +	mov	%rcx, %rax
> +#endif
> +	mov	%r9, %rsi
> +	mov	%r10, %rdi
> +	sub	%rdi, %rcx
> +	cmp	$15, %ecx
> +	jb	L(less_16_cross)
> +	movdqu	(%rsi), %xmm0
> +	movdqu	-15(%rsi, %rcx), %xmm1
> +	movdqu	%xmm0, (%rdi)
> +#ifdef AS_STRCPY
> +	movdqu  %xmm1, -15(%rdi, %rcx)
> +#else
> +	movdqu	%xmm1, -15(%rax)
> +#endif
> +	ret
> +
> +L(less_16_cross):
> +	cmp	$7, %ecx
> +	jb	L(less_8_bytes_cross)
> +	movq	(%rsi), %rdx
> +	jmp	L(8bytes_from_cross)
> +
> +L(less_8_bytes_cross):
> +	cmp	$2, %ecx
> +	jbe	L(3_bytes_cross)
> +	mov	(%rsi), %edx
> +	jmp	L(4bytes_from_cross)
> +
> +L(3_bytes_cross):
> +	jb	L(1_2bytes_cross)
> +	movzwl	(%rsi), %edx
> +	jmp	L(_3_bytesb)
> +
> +L(1_2bytes_cross):
> +	movb	(%rsi), %dl
> +	jmp	L(0_2bytes_from_cross)
> +
> +	.p2align 4
> +L(less_4_bytesb):
> +	je	L(_3_bytesb)
> +L(0_2bytes_from_cross):
> +	movb	%dl, (%rdi)
> +#ifdef AS_STRCPY
> +	movb    $0, (%rdi, %rcx)
> +#else
> +	movb	$0, (%rax)
> +#endif
> +	ret
> +
> +	.p2align 4
> +L(_3_bytesb):
> +	movw	%dx, (%rdi)
> +	movb	$0, 2(%rdi)
> +	ret
> +
> +END(STPCPY)
> diff --git a/sysdeps/x86_64/multiarch/stpncpy-sse2-unaligned.S b/sysdeps/x86_64/multiarch/stpncpy-sse2-unaligned.S
> index 658520f..3f35068 100644
> --- a/sysdeps/x86_64/multiarch/stpncpy-sse2-unaligned.S
> +++ b/sysdeps/x86_64/multiarch/stpncpy-sse2-unaligned.S
> @@ -1,4 +1,3 @@
>  #define USE_AS_STPCPY
> -#define USE_AS_STRNCPY
>  #define STRCPY __stpncpy_sse2_unaligned
> -#include "strcpy-sse2-unaligned.S"
> +#include "strncpy-sse2-unaligned.S"
> diff --git a/sysdeps/x86_64/multiarch/stpncpy.S b/sysdeps/x86_64/multiarch/stpncpy.S
> index 2698ca6..159604a 100644
> --- a/sysdeps/x86_64/multiarch/stpncpy.S
> +++ b/sysdeps/x86_64/multiarch/stpncpy.S
> @@ -1,8 +1,7 @@
>  /* Multiple versions of stpncpy
>     All versions must be listed in ifunc-impl-list.c.  */
> -#define STRCPY __stpncpy
> +#define STRNCPY __stpncpy
>  #define USE_AS_STPCPY
> -#define USE_AS_STRNCPY
> -#include "strcpy.S"
> +#include "strncpy.S"
>  
>  weak_alias (__stpncpy, stpncpy)
> diff --git a/sysdeps/x86_64/multiarch/strcat-sse2-unaligned.S b/sysdeps/x86_64/multiarch/strcat-sse2-unaligned.S
> index 81f1b40..1faa49d 100644
> --- a/sysdeps/x86_64/multiarch/strcat-sse2-unaligned.S
> +++ b/sysdeps/x86_64/multiarch/strcat-sse2-unaligned.S
> @@ -275,5 +275,5 @@ L(StartStrcpyPart):
>  #  define USE_AS_STRNCPY
>  # endif
>  
> -# include "strcpy-sse2-unaligned.S"
> +# include "strncpy-sse2-unaligned.S"
>  #endif
> diff --git a/sysdeps/x86_64/multiarch/strcpy-avx2.S b/sysdeps/x86_64/multiarch/strcpy-avx2.S
> new file mode 100644
> index 0000000..a3133a4
> --- /dev/null
> +++ b/sysdeps/x86_64/multiarch/strcpy-avx2.S
> @@ -0,0 +1,4 @@
> +#define USE_AVX2
> +#define AS_STRCPY
> +#define STPCPY __strcpy_avx2
> +#include "stpcpy-sse2-unaligned.S"
> diff --git a/sysdeps/x86_64/multiarch/strcpy-sse2-unaligned.S b/sysdeps/x86_64/multiarch/strcpy-sse2-unaligned.S
> index 8f03d1d..310e4fa 100644
> --- a/sysdeps/x86_64/multiarch/strcpy-sse2-unaligned.S
> +++ b/sysdeps/x86_64/multiarch/strcpy-sse2-unaligned.S
> @@ -1,1887 +1,3 @@
> -/* strcpy with SSE2 and unaligned load
> -   Copyright (C) 2011-2015 Free Software Foundation, Inc.
> -   Contributed by Intel Corporation.
> -   This file is part of the GNU C Library.
> -
> -   The GNU C Library is free software; you can redistribute it and/or
> -   modify it under the terms of the GNU Lesser General Public
> -   License as published by the Free Software Foundation; either
> -   version 2.1 of the License, or (at your option) any later version.
> -
> -   The GNU C Library is distributed in the hope that it will be useful,
> -   but WITHOUT ANY WARRANTY; without even the implied warranty of
> -   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> -   Lesser General Public License for more details.
> -
> -   You should have received a copy of the GNU Lesser General Public
> -   License along with the GNU C Library; if not, see
> -   <http://www.gnu.org/licenses/>.  */
> -
> -#if IS_IN (libc)
> -
> -# ifndef USE_AS_STRCAT
> -#  include <sysdep.h>
> -
> -#  ifndef STRCPY
> -#   define STRCPY  __strcpy_sse2_unaligned
> -#  endif
> -
> -# endif
> -
> -# define JMPTBL(I, B)	I - B
> -# define BRANCH_TO_JMPTBL_ENTRY(TABLE, INDEX, SCALE)             \
> -	lea	TABLE(%rip), %r11;                              \
> -	movslq	(%r11, INDEX, SCALE), %rcx;                     \
> -	lea	(%r11, %rcx), %rcx;                             \
> -	jmp	*%rcx
> -
> -# ifndef USE_AS_STRCAT
> -
> -.text
> -ENTRY (STRCPY)
> -#  ifdef USE_AS_STRNCPY
> -	mov	%rdx, %r8
> -	test	%r8, %r8
> -	jz	L(ExitZero)
> -#  endif
> -	mov	%rsi, %rcx
> -#  ifndef USE_AS_STPCPY
> -	mov	%rdi, %rax      /* save result */
> -#  endif
> -
> -# endif
> -
> -	and	$63, %rcx
> -	cmp	$32, %rcx
> -	jbe	L(SourceStringAlignmentLess32)
> -
> -	and	$-16, %rsi
> -	and	$15, %rcx
> -	pxor	%xmm0, %xmm0
> -	pxor	%xmm1, %xmm1
> -
> -	pcmpeqb	(%rsi), %xmm1
> -	pmovmskb %xmm1, %rdx
> -	shr	%cl, %rdx
> -
> -# ifdef USE_AS_STRNCPY
> -#  if defined USE_AS_STPCPY || defined USE_AS_STRCAT
> -	mov	$16, %r10
> -	sub	%rcx, %r10
> -	cmp	%r10, %r8
> -#  else
> -	mov	$17, %r10
> -	sub	%rcx, %r10
> -	cmp	%r10, %r8
> -#  endif
> -	jbe	L(CopyFrom1To16BytesTailCase2OrCase3)
> -# endif
> -	test	%rdx, %rdx
> -	jnz	L(CopyFrom1To16BytesTail)
> -
> -	pcmpeqb	16(%rsi), %xmm0
> -	pmovmskb %xmm0, %rdx
> -
> -# ifdef USE_AS_STRNCPY
> -	add	$16, %r10
> -	cmp	%r10, %r8
> -	jbe	L(CopyFrom1To32BytesCase2OrCase3)
> -# endif
> -	test	%rdx, %rdx
> -	jnz	L(CopyFrom1To32Bytes)
> -
> -	movdqu	(%rsi, %rcx), %xmm1   /* copy 16 bytes */
> -	movdqu	%xmm1, (%rdi)
> -
> -/* If source address alignment != destination address alignment */
> -	.p2align 4
> -L(Unalign16Both):
> -	sub	%rcx, %rdi
> -# ifdef USE_AS_STRNCPY
> -	add	%rcx, %r8
> -# endif
> -	mov	$16, %rcx
> -	movdqa	(%rsi, %rcx), %xmm1
> -	movaps	16(%rsi, %rcx), %xmm2
> -	movdqu	%xmm1, (%rdi, %rcx)
> -	pcmpeqb	%xmm2, %xmm0
> -	pmovmskb %xmm0, %rdx
> -	add	$16, %rcx
> -# ifdef USE_AS_STRNCPY
> -	sub	$48, %r8
> -	jbe	L(CopyFrom1To16BytesCase2OrCase3)
> -# endif
> -	test	%rdx, %rdx
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> -	jnz	L(CopyFrom1To16BytesUnalignedXmm2)
> -# else
> -	jnz	L(CopyFrom1To16Bytes)
> -# endif
> -
> -	movaps	16(%rsi, %rcx), %xmm3
> -	movdqu	%xmm2, (%rdi, %rcx)
> -	pcmpeqb	%xmm3, %xmm0
> -	pmovmskb %xmm0, %rdx
> -	add	$16, %rcx
> -# ifdef USE_AS_STRNCPY
> -	sub	$16, %r8
> -	jbe	L(CopyFrom1To16BytesCase2OrCase3)
> -# endif
> -	test	%rdx, %rdx
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> -	jnz	L(CopyFrom1To16BytesUnalignedXmm3)
> -# else
> -	jnz	L(CopyFrom1To16Bytes)
> -# endif
> -
> -	movaps	16(%rsi, %rcx), %xmm4
> -	movdqu	%xmm3, (%rdi, %rcx)
> -	pcmpeqb	%xmm4, %xmm0
> -	pmovmskb %xmm0, %rdx
> -	add	$16, %rcx
> -# ifdef USE_AS_STRNCPY
> -	sub	$16, %r8
> -	jbe	L(CopyFrom1To16BytesCase2OrCase3)
> -# endif
> -	test	%rdx, %rdx
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> -	jnz	L(CopyFrom1To16BytesUnalignedXmm4)
> -# else
> -	jnz	L(CopyFrom1To16Bytes)
> -# endif
> -
> -	movaps	16(%rsi, %rcx), %xmm1
> -	movdqu	%xmm4, (%rdi, %rcx)
> -	pcmpeqb	%xmm1, %xmm0
> -	pmovmskb %xmm0, %rdx
> -	add	$16, %rcx
> -# ifdef USE_AS_STRNCPY
> -	sub	$16, %r8
> -	jbe	L(CopyFrom1To16BytesCase2OrCase3)
> -# endif
> -	test	%rdx, %rdx
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> -	jnz	L(CopyFrom1To16BytesUnalignedXmm1)
> -# else
> -	jnz	L(CopyFrom1To16Bytes)
> -# endif
> -
> -	movaps	16(%rsi, %rcx), %xmm2
> -	movdqu	%xmm1, (%rdi, %rcx)
> -	pcmpeqb	%xmm2, %xmm0
> -	pmovmskb %xmm0, %rdx
> -	add	$16, %rcx
> -# ifdef USE_AS_STRNCPY
> -	sub	$16, %r8
> -	jbe	L(CopyFrom1To16BytesCase2OrCase3)
> -# endif
> -	test	%rdx, %rdx
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> -	jnz	L(CopyFrom1To16BytesUnalignedXmm2)
> -# else
> -	jnz	L(CopyFrom1To16Bytes)
> -# endif
> -
> -	movaps	16(%rsi, %rcx), %xmm3
> -	movdqu	%xmm2, (%rdi, %rcx)
> -	pcmpeqb	%xmm3, %xmm0
> -	pmovmskb %xmm0, %rdx
> -	add	$16, %rcx
> -# ifdef USE_AS_STRNCPY
> -	sub	$16, %r8
> -	jbe	L(CopyFrom1To16BytesCase2OrCase3)
> -# endif
> -	test	%rdx, %rdx
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> -	jnz	L(CopyFrom1To16BytesUnalignedXmm3)
> -# else
> -	jnz	L(CopyFrom1To16Bytes)
> -# endif
> -
> -	movdqu	%xmm3, (%rdi, %rcx)
> -	mov	%rsi, %rdx
> -	lea	16(%rsi, %rcx), %rsi
> -	and	$-0x40, %rsi
> -	sub	%rsi, %rdx
> -	sub	%rdx, %rdi
> -# ifdef USE_AS_STRNCPY
> -	lea	128(%r8, %rdx), %r8
> -# endif
> -L(Unaligned64Loop):
> -	movaps	(%rsi), %xmm2
> -	movaps	%xmm2, %xmm4
> -	movaps	16(%rsi), %xmm5
> -	movaps	32(%rsi), %xmm3
> -	movaps	%xmm3, %xmm6
> -	movaps	48(%rsi), %xmm7
> -	pminub	%xmm5, %xmm2
> -	pminub	%xmm7, %xmm3
> -	pminub	%xmm2, %xmm3
> -	pcmpeqb	%xmm0, %xmm3
> -	pmovmskb %xmm3, %rdx
> -# ifdef USE_AS_STRNCPY
> -	sub	$64, %r8
> -	jbe	L(UnalignedLeaveCase2OrCase3)
> -# endif
> -	test	%rdx, %rdx
> -	jnz	L(Unaligned64Leave)
> -
> -L(Unaligned64Loop_start):
> -	add	$64, %rdi
> -	add	$64, %rsi
> -	movdqu	%xmm4, -64(%rdi)
> -	movaps	(%rsi), %xmm2
> -	movdqa	%xmm2, %xmm4
> -	movdqu	%xmm5, -48(%rdi)
> -	movaps	16(%rsi), %xmm5
> -	pminub	%xmm5, %xmm2
> -	movaps	32(%rsi), %xmm3
> -	movdqu	%xmm6, -32(%rdi)
> -	movaps	%xmm3, %xmm6
> -	movdqu	%xmm7, -16(%rdi)
> -	movaps	48(%rsi), %xmm7
> -	pminub	%xmm7, %xmm3
> -	pminub	%xmm2, %xmm3
> -	pcmpeqb	%xmm0, %xmm3
> -	pmovmskb %xmm3, %rdx
> -# ifdef USE_AS_STRNCPY
> -	sub	$64, %r8
> -	jbe	L(UnalignedLeaveCase2OrCase3)
> -# endif
> -	test	%rdx, %rdx
> -	jz	L(Unaligned64Loop_start)
> -
> -L(Unaligned64Leave):
> -	pxor	%xmm1, %xmm1
> -
> -	pcmpeqb	%xmm4, %xmm0
> -	pcmpeqb	%xmm5, %xmm1
> -	pmovmskb %xmm0, %rdx
> -	pmovmskb %xmm1, %rcx
> -	test	%rdx, %rdx
> -	jnz	L(CopyFrom1To16BytesUnaligned_0)
> -	test	%rcx, %rcx
> -	jnz	L(CopyFrom1To16BytesUnaligned_16)
> -
> -	pcmpeqb	%xmm6, %xmm0
> -	pcmpeqb	%xmm7, %xmm1
> -	pmovmskb %xmm0, %rdx
> -	pmovmskb %xmm1, %rcx
> -	test	%rdx, %rdx
> -	jnz	L(CopyFrom1To16BytesUnaligned_32)
> -
> -	bsf	%rcx, %rdx
> -	movdqu	%xmm4, (%rdi)
> -	movdqu	%xmm5, 16(%rdi)
> -	movdqu	%xmm6, 32(%rdi)
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> -# ifdef USE_AS_STPCPY
> -	lea	48(%rdi, %rdx), %rax
> -# endif
> -	movdqu	%xmm7, 48(%rdi)
> -	add	$15, %r8
> -	sub	%rdx, %r8
> -	lea	49(%rdi, %rdx), %rdi
> -	jmp	L(StrncpyFillTailWithZero)
> -# else
> -	add	$48, %rsi
> -	add	$48, %rdi
> -	BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
> -# endif
> -
> -/* If source address alignment == destination address alignment */
> -
> -L(SourceStringAlignmentLess32):
> -	pxor	%xmm0, %xmm0
> -	movdqu	(%rsi), %xmm1
> -	movdqu	16(%rsi), %xmm2
> -	pcmpeqb	%xmm1, %xmm0
> -	pmovmskb %xmm0, %rdx
> -
> -# ifdef USE_AS_STRNCPY
> -#  if defined USE_AS_STPCPY || defined USE_AS_STRCAT
> -	cmp	$16, %r8
> -#  else
> -	cmp	$17, %r8
> -#  endif
> -	jbe	L(CopyFrom1To16BytesTail1Case2OrCase3)
> -# endif
> -	test	%rdx, %rdx
> -	jnz	L(CopyFrom1To16BytesTail1)
> -
> -	pcmpeqb	%xmm2, %xmm0
> -	movdqu	%xmm1, (%rdi)
> -	pmovmskb %xmm0, %rdx
> -
> -# ifdef USE_AS_STRNCPY
> -#  if defined USE_AS_STPCPY || defined USE_AS_STRCAT
> -	cmp	$32, %r8
> -#  else
> -	cmp	$33, %r8
> -#  endif
> -	jbe	L(CopyFrom1To32Bytes1Case2OrCase3)
> -# endif
> -	test	%rdx, %rdx
> -	jnz	L(CopyFrom1To32Bytes1)
> -
> -	and	$-16, %rsi
> -	and	$15, %rcx
> -	jmp	L(Unalign16Both)
> -
> -/*------End of main part with loops---------------------*/
> -
> -/* Case1 */
> -
> -# if (!defined USE_AS_STRNCPY) || (defined USE_AS_STRCAT)
> -	.p2align 4
> -L(CopyFrom1To16Bytes):
> -	add	%rcx, %rdi
> -	add	%rcx, %rsi
> -	bsf	%rdx, %rdx
> -	BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
> -# endif
> -	.p2align 4
> -L(CopyFrom1To16BytesTail):
> -	add	%rcx, %rsi
> -	bsf	%rdx, %rdx
> -	BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
> -
> -	.p2align 4
> -L(CopyFrom1To32Bytes1):
> -	add	$16, %rsi
> -	add	$16, %rdi
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> -	sub	$16, %r8
> -# endif
> -L(CopyFrom1To16BytesTail1):
> -	bsf	%rdx, %rdx
> -	BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
> -
> -	.p2align 4
> -L(CopyFrom1To32Bytes):
> -	bsf	%rdx, %rdx
> -	add	%rcx, %rsi
> -	add	$16, %rdx
> -	sub	%rcx, %rdx
> -	BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
> -
> -	.p2align 4
> -L(CopyFrom1To16BytesUnaligned_0):
> -	bsf	%rdx, %rdx
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> -# ifdef USE_AS_STPCPY
> -	lea	(%rdi, %rdx), %rax
> -# endif
> -	movdqu	%xmm4, (%rdi)
> -	add	$63, %r8
> -	sub	%rdx, %r8
> -	lea	1(%rdi, %rdx), %rdi
> -	jmp	L(StrncpyFillTailWithZero)
> -# else
> -	BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
> -# endif
> -
> -	.p2align 4
> -L(CopyFrom1To16BytesUnaligned_16):
> -	bsf	%rcx, %rdx
> -	movdqu	%xmm4, (%rdi)
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> -# ifdef USE_AS_STPCPY
> -	lea	16(%rdi, %rdx), %rax
> -# endif
> -	movdqu	%xmm5, 16(%rdi)
> -	add	$47, %r8
> -	sub	%rdx, %r8
> -	lea	17(%rdi, %rdx), %rdi
> -	jmp	L(StrncpyFillTailWithZero)
> -# else
> -	add	$16, %rsi
> -	add	$16, %rdi
> -	BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
> -# endif
> -
> -	.p2align 4
> -L(CopyFrom1To16BytesUnaligned_32):
> -	bsf	%rdx, %rdx
> -	movdqu	%xmm4, (%rdi)
> -	movdqu	%xmm5, 16(%rdi)
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> -# ifdef USE_AS_STPCPY
> -	lea	32(%rdi, %rdx), %rax
> -# endif
> -	movdqu	%xmm6, 32(%rdi)
> -	add	$31, %r8
> -	sub	%rdx, %r8
> -	lea	33(%rdi, %rdx), %rdi
> -	jmp	L(StrncpyFillTailWithZero)
> -# else
> -	add	$32, %rsi
> -	add	$32, %rdi
> -	BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
> -# endif
> -
> -# ifdef USE_AS_STRNCPY
> -#  ifndef USE_AS_STRCAT
> -	.p2align 4
> -L(CopyFrom1To16BytesUnalignedXmm6):
> -	movdqu	%xmm6, (%rdi, %rcx)
> -	jmp	L(CopyFrom1To16BytesXmmExit)
> -
> -	.p2align 4
> -L(CopyFrom1To16BytesUnalignedXmm5):
> -	movdqu	%xmm5, (%rdi, %rcx)
> -	jmp	L(CopyFrom1To16BytesXmmExit)
> -
> -	.p2align 4
> -L(CopyFrom1To16BytesUnalignedXmm4):
> -	movdqu	%xmm4, (%rdi, %rcx)
> -	jmp	L(CopyFrom1To16BytesXmmExit)
> -
> -	.p2align 4
> -L(CopyFrom1To16BytesUnalignedXmm3):
> -	movdqu	%xmm3, (%rdi, %rcx)
> -	jmp	L(CopyFrom1To16BytesXmmExit)
> -
> -	.p2align 4
> -L(CopyFrom1To16BytesUnalignedXmm1):
> -	movdqu	%xmm1, (%rdi, %rcx)
> -	jmp	L(CopyFrom1To16BytesXmmExit)
> -#  endif
> -
> -	.p2align 4
> -L(CopyFrom1To16BytesExit):
> -	BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
> -
> -/* Case2 */
> -
> -	.p2align 4
> -L(CopyFrom1To16BytesCase2):
> -	add	$16, %r8
> -	add	%rcx, %rdi
> -	add	%rcx, %rsi
> -	bsf	%rdx, %rdx
> -	cmp	%r8, %rdx
> -	jb	L(CopyFrom1To16BytesExit)
> -	BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
> -
> -	.p2align 4
> -L(CopyFrom1To32BytesCase2):
> -	add	%rcx, %rsi
> -	bsf	%rdx, %rdx
> -	add	$16, %rdx
> -	sub	%rcx, %rdx
> -	cmp	%r8, %rdx
> -	jb	L(CopyFrom1To16BytesExit)
> -	BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
> -
> -L(CopyFrom1To16BytesTailCase2):
> -	add	%rcx, %rsi
> -	bsf	%rdx, %rdx
> -	cmp	%r8, %rdx
> -	jb	L(CopyFrom1To16BytesExit)
> -	BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
> -
> -L(CopyFrom1To16BytesTail1Case2):
> -	bsf	%rdx, %rdx
> -	cmp	%r8, %rdx
> -	jb	L(CopyFrom1To16BytesExit)
> -	BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
> -
> -/* Case2 or Case3,  Case3 */
> -
> -	.p2align 4
> -L(CopyFrom1To16BytesCase2OrCase3):
> -	test	%rdx, %rdx
> -	jnz	L(CopyFrom1To16BytesCase2)
> -L(CopyFrom1To16BytesCase3):
> -	add	$16, %r8
> -	add	%rcx, %rdi
> -	add	%rcx, %rsi
> -	BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
> -
> -	.p2align 4
> -L(CopyFrom1To32BytesCase2OrCase3):
> -	test	%rdx, %rdx
> -	jnz	L(CopyFrom1To32BytesCase2)
> -	add	%rcx, %rsi
> -	BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
> -
> -	.p2align 4
> -L(CopyFrom1To16BytesTailCase2OrCase3):
> -	test	%rdx, %rdx
> -	jnz	L(CopyFrom1To16BytesTailCase2)
> -	add	%rcx, %rsi
> -	BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
> -
> -	.p2align 4
> -L(CopyFrom1To32Bytes1Case2OrCase3):
> -	add	$16, %rdi
> -	add	$16, %rsi
> -	sub	$16, %r8
> -L(CopyFrom1To16BytesTail1Case2OrCase3):
> -	test	%rdx, %rdx
> -	jnz	L(CopyFrom1To16BytesTail1Case2)
> -	BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
> -
> -# endif
> -
> -/*------------End labels regarding with copying 1-16 bytes--and 1-32 bytes----*/
> -
> -	.p2align 4
> -L(Exit1):
> -	mov	%dh, (%rdi)
> -# ifdef USE_AS_STPCPY
> -	lea	(%rdi), %rax
> -# endif
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> -	sub	$1, %r8
> -	lea	1(%rdi), %rdi
> -	jnz	L(StrncpyFillTailWithZero)
> -# endif
> -	ret
> -
> -	.p2align 4
> -L(Exit2):
> -	mov	(%rsi), %dx
> -	mov	%dx, (%rdi)
> -# ifdef USE_AS_STPCPY
> -	lea	1(%rdi), %rax
> -# endif
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> -	sub	$2, %r8
> -	lea	2(%rdi), %rdi
> -	jnz	L(StrncpyFillTailWithZero)
> -# endif
> -	ret
> -
> -	.p2align 4
> -L(Exit3):
> -	mov	(%rsi), %cx
> -	mov	%cx, (%rdi)
> -	mov	%dh, 2(%rdi)
> -# ifdef USE_AS_STPCPY
> -	lea	2(%rdi), %rax
> -# endif
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> -	sub	$3, %r8
> -	lea	3(%rdi), %rdi
> -	jnz	L(StrncpyFillTailWithZero)
> -# endif
> -	ret
> -
> -	.p2align 4
> -L(Exit4):
> -	mov	(%rsi), %edx
> -	mov	%edx, (%rdi)
> -# ifdef USE_AS_STPCPY
> -	lea	3(%rdi), %rax
> -# endif
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> -	sub	$4, %r8
> -	lea	4(%rdi), %rdi
> -	jnz	L(StrncpyFillTailWithZero)
> -# endif
> -	ret
> -
> -	.p2align 4
> -L(Exit5):
> -	mov	(%rsi), %ecx
> -	mov	%dh, 4(%rdi)
> -	mov	%ecx, (%rdi)
> -# ifdef USE_AS_STPCPY
> -	lea	4(%rdi), %rax
> -# endif
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> -	sub	$5, %r8
> -	lea	5(%rdi), %rdi
> -	jnz	L(StrncpyFillTailWithZero)
> -# endif
> -	ret
> -
> -	.p2align 4
> -L(Exit6):
> -	mov	(%rsi), %ecx
> -	mov	4(%rsi), %dx
> -	mov	%ecx, (%rdi)
> -	mov	%dx, 4(%rdi)
> -# ifdef USE_AS_STPCPY
> -	lea	5(%rdi), %rax
> -# endif
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> -	sub	$6, %r8
> -	lea	6(%rdi), %rdi
> -	jnz	L(StrncpyFillTailWithZero)
> -# endif
> -	ret
> -
> -	.p2align 4
> -L(Exit7):
> -	mov	(%rsi), %ecx
> -	mov	3(%rsi), %edx
> -	mov	%ecx, (%rdi)
> -	mov	%edx, 3(%rdi)
> -# ifdef USE_AS_STPCPY
> -	lea	6(%rdi), %rax
> -# endif
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> -	sub	$7, %r8
> -	lea	7(%rdi), %rdi
> -	jnz	L(StrncpyFillTailWithZero)
> -# endif
> -	ret
> -
> -	.p2align 4
> -L(Exit8):
> -	mov	(%rsi), %rdx
> -	mov	%rdx, (%rdi)
> -# ifdef USE_AS_STPCPY
> -	lea	7(%rdi), %rax
> -# endif
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> -	sub	$8, %r8
> -	lea	8(%rdi), %rdi
> -	jnz	L(StrncpyFillTailWithZero)
> -# endif
> -	ret
> -
> -	.p2align 4
> -L(Exit9):
> -	mov	(%rsi), %rcx
> -	mov	%dh, 8(%rdi)
> -	mov	%rcx, (%rdi)
> -# ifdef USE_AS_STPCPY
> -	lea	8(%rdi), %rax
> -# endif
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> -	sub	$9, %r8
> -	lea	9(%rdi), %rdi
> -	jnz	L(StrncpyFillTailWithZero)
> -# endif
> -	ret
> -
> -	.p2align 4
> -L(Exit10):
> -	mov	(%rsi), %rcx
> -	mov	8(%rsi), %dx
> -	mov	%rcx, (%rdi)
> -	mov	%dx, 8(%rdi)
> -# ifdef USE_AS_STPCPY
> -	lea	9(%rdi), %rax
> -# endif
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> -	sub	$10, %r8
> -	lea	10(%rdi), %rdi
> -	jnz	L(StrncpyFillTailWithZero)
> -# endif
> -	ret
> -
> -	.p2align 4
> -L(Exit11):
> -	mov	(%rsi), %rcx
> -	mov	7(%rsi), %edx
> -	mov	%rcx, (%rdi)
> -	mov	%edx, 7(%rdi)
> -# ifdef USE_AS_STPCPY
> -	lea	10(%rdi), %rax
> -# endif
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> -	sub	$11, %r8
> -	lea	11(%rdi), %rdi
> -	jnz	L(StrncpyFillTailWithZero)
> -# endif
> -	ret
> -
> -	.p2align 4
> -L(Exit12):
> -	mov	(%rsi), %rcx
> -	mov	8(%rsi), %edx
> -	mov	%rcx, (%rdi)
> -	mov	%edx, 8(%rdi)
> -# ifdef USE_AS_STPCPY
> -	lea	11(%rdi), %rax
> -# endif
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> -	sub	$12, %r8
> -	lea	12(%rdi), %rdi
> -	jnz	L(StrncpyFillTailWithZero)
> -# endif
> -	ret
> -
> -	.p2align 4
> -L(Exit13):
> -	mov	(%rsi), %rcx
> -	mov	5(%rsi), %rdx
> -	mov	%rcx, (%rdi)
> -	mov	%rdx, 5(%rdi)
> -# ifdef USE_AS_STPCPY
> -	lea	12(%rdi), %rax
> -# endif
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> -	sub	$13, %r8
> -	lea	13(%rdi), %rdi
> -	jnz	L(StrncpyFillTailWithZero)
> -# endif
> -	ret
> -
> -	.p2align 4
> -L(Exit14):
> -	mov	(%rsi), %rcx
> -	mov	6(%rsi), %rdx
> -	mov	%rcx, (%rdi)
> -	mov	%rdx, 6(%rdi)
> -# ifdef USE_AS_STPCPY
> -	lea	13(%rdi), %rax
> -# endif
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> -	sub	$14, %r8
> -	lea	14(%rdi), %rdi
> -	jnz	L(StrncpyFillTailWithZero)
> -# endif
> -	ret
> -
> -	.p2align 4
> -L(Exit15):
> -	mov	(%rsi), %rcx
> -	mov	7(%rsi), %rdx
> -	mov	%rcx, (%rdi)
> -	mov	%rdx, 7(%rdi)
> -# ifdef USE_AS_STPCPY
> -	lea	14(%rdi), %rax
> -# endif
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> -	sub	$15, %r8
> -	lea	15(%rdi), %rdi
> -	jnz	L(StrncpyFillTailWithZero)
> -# endif
> -	ret
> -
> -	.p2align 4
> -L(Exit16):
> -	movdqu	(%rsi), %xmm0
> -	movdqu	%xmm0, (%rdi)
> -# ifdef USE_AS_STPCPY
> -	lea	15(%rdi), %rax
> -# endif
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> -	sub	$16, %r8
> -	lea	16(%rdi), %rdi
> -	jnz	L(StrncpyFillTailWithZero)
> -# endif
> -	ret
> -
> -	.p2align 4
> -L(Exit17):
> -	movdqu	(%rsi), %xmm0
> -	movdqu	%xmm0, (%rdi)
> -	mov	%dh, 16(%rdi)
> -# ifdef USE_AS_STPCPY
> -	lea	16(%rdi), %rax
> -# endif
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> -	sub	$17, %r8
> -	lea	17(%rdi), %rdi
> -	jnz	L(StrncpyFillTailWithZero)
> -# endif
> -	ret
> -
> -	.p2align 4
> -L(Exit18):
> -	movdqu	(%rsi), %xmm0
> -	mov	16(%rsi), %cx
> -	movdqu	%xmm0, (%rdi)
> -	mov	%cx, 16(%rdi)
> -# ifdef USE_AS_STPCPY
> -	lea	17(%rdi), %rax
> -# endif
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> -	sub	$18, %r8
> -	lea	18(%rdi), %rdi
> -	jnz	L(StrncpyFillTailWithZero)
> -# endif
> -	ret
> -
> -	.p2align 4
> -L(Exit19):
> -	movdqu	(%rsi), %xmm0
> -	mov	15(%rsi), %ecx
> -	movdqu	%xmm0, (%rdi)
> -	mov	%ecx, 15(%rdi)
> -# ifdef USE_AS_STPCPY
> -	lea	18(%rdi), %rax
> -# endif
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> -	sub	$19, %r8
> -	lea	19(%rdi), %rdi
> -	jnz	L(StrncpyFillTailWithZero)
> -# endif
> -	ret
> -
> -	.p2align 4
> -L(Exit20):
> -	movdqu	(%rsi), %xmm0
> -	mov	16(%rsi), %ecx
> -	movdqu	%xmm0, (%rdi)
> -	mov	%ecx, 16(%rdi)
> -# ifdef USE_AS_STPCPY
> -	lea	19(%rdi), %rax
> -# endif
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> -	sub	$20, %r8
> -	lea	20(%rdi), %rdi
> -	jnz	L(StrncpyFillTailWithZero)
> -# endif
> -	ret
> -
> -	.p2align 4
> -L(Exit21):
> -	movdqu	(%rsi), %xmm0
> -	mov	16(%rsi), %ecx
> -	movdqu	%xmm0, (%rdi)
> -	mov	%ecx, 16(%rdi)
> -	mov	%dh, 20(%rdi)
> -# ifdef USE_AS_STPCPY
> -	lea	20(%rdi), %rax
> -# endif
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> -	sub	$21, %r8
> -	lea	21(%rdi), %rdi
> -	jnz	L(StrncpyFillTailWithZero)
> -# endif
> -	ret
> -
> -	.p2align 4
> -L(Exit22):
> -	movdqu	(%rsi), %xmm0
> -	mov	14(%rsi), %rcx
> -	movdqu	%xmm0, (%rdi)
> -	mov	%rcx, 14(%rdi)
> -# ifdef USE_AS_STPCPY
> -	lea	21(%rdi), %rax
> -# endif
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> -	sub	$22, %r8
> -	lea	22(%rdi), %rdi
> -	jnz	L(StrncpyFillTailWithZero)
> -# endif
> -	ret
> -
> -	.p2align 4
> -L(Exit23):
> -	movdqu	(%rsi), %xmm0
> -	mov	15(%rsi), %rcx
> -	movdqu	%xmm0, (%rdi)
> -	mov	%rcx, 15(%rdi)
> -# ifdef USE_AS_STPCPY
> -	lea	22(%rdi), %rax
> -# endif
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> -	sub	$23, %r8
> -	lea	23(%rdi), %rdi
> -	jnz	L(StrncpyFillTailWithZero)
> -# endif
> -	ret
> -
> -	.p2align 4
> -L(Exit24):
> -	movdqu	(%rsi), %xmm0
> -	mov	16(%rsi), %rcx
> -	movdqu	%xmm0, (%rdi)
> -	mov	%rcx, 16(%rdi)
> -# ifdef USE_AS_STPCPY
> -	lea	23(%rdi), %rax
> -# endif
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> -	sub	$24, %r8
> -	lea	24(%rdi), %rdi
> -	jnz	L(StrncpyFillTailWithZero)
> -# endif
> -	ret
> -
> -	.p2align 4
> -L(Exit25):
> -	movdqu	(%rsi), %xmm0
> -	mov	16(%rsi), %rcx
> -	movdqu	%xmm0, (%rdi)
> -	mov	%rcx, 16(%rdi)
> -	mov	%dh, 24(%rdi)
> -# ifdef USE_AS_STPCPY
> -	lea	24(%rdi), %rax
> -# endif
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> -	sub	$25, %r8
> -	lea	25(%rdi), %rdi
> -	jnz	L(StrncpyFillTailWithZero)
> -# endif
> -	ret
> -
> -	.p2align 4
> -L(Exit26):
> -	movdqu	(%rsi), %xmm0
> -	mov	16(%rsi), %rdx
> -	mov	24(%rsi), %cx
> -	movdqu	%xmm0, (%rdi)
> -	mov	%rdx, 16(%rdi)
> -	mov	%cx, 24(%rdi)
> -# ifdef USE_AS_STPCPY
> -	lea	25(%rdi), %rax
> -# endif
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> -	sub	$26, %r8
> -	lea	26(%rdi), %rdi
> -	jnz	L(StrncpyFillTailWithZero)
> -# endif
> -	ret
> -
> -	.p2align 4
> -L(Exit27):
> -	movdqu	(%rsi), %xmm0
> -	mov	16(%rsi), %rdx
> -	mov	23(%rsi), %ecx
> -	movdqu	%xmm0, (%rdi)
> -	mov	%rdx, 16(%rdi)
> -	mov	%ecx, 23(%rdi)
> -# ifdef USE_AS_STPCPY
> -	lea	26(%rdi), %rax
> -# endif
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> -	sub	$27, %r8
> -	lea	27(%rdi), %rdi
> -	jnz	L(StrncpyFillTailWithZero)
> -# endif
> -	ret
> -
> -	.p2align 4
> -L(Exit28):
> -	movdqu	(%rsi), %xmm0
> -	mov	16(%rsi), %rdx
> -	mov	24(%rsi), %ecx
> -	movdqu	%xmm0, (%rdi)
> -	mov	%rdx, 16(%rdi)
> -	mov	%ecx, 24(%rdi)
> -# ifdef USE_AS_STPCPY
> -	lea	27(%rdi), %rax
> -# endif
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> -	sub	$28, %r8
> -	lea	28(%rdi), %rdi
> -	jnz	L(StrncpyFillTailWithZero)
> -# endif
> -	ret
> -
> -	.p2align 4
> -L(Exit29):
> -	movdqu	(%rsi), %xmm0
> -	movdqu	13(%rsi), %xmm2
> -	movdqu	%xmm0, (%rdi)
> -	movdqu	%xmm2, 13(%rdi)
> -# ifdef USE_AS_STPCPY
> -	lea	28(%rdi), %rax
> -# endif
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> -	sub	$29, %r8
> -	lea	29(%rdi), %rdi
> -	jnz	L(StrncpyFillTailWithZero)
> -# endif
> -	ret
> -
> -	.p2align 4
> -L(Exit30):
> -	movdqu	(%rsi), %xmm0
> -	movdqu	14(%rsi), %xmm2
> -	movdqu	%xmm0, (%rdi)
> -	movdqu	%xmm2, 14(%rdi)
> -# ifdef USE_AS_STPCPY
> -	lea	29(%rdi), %rax
> -# endif
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> -	sub	$30, %r8
> -	lea	30(%rdi), %rdi
> -	jnz	L(StrncpyFillTailWithZero)
> -# endif
> -	ret
> -
> -	.p2align 4
> -L(Exit31):
> -	movdqu	(%rsi), %xmm0
> -	movdqu	15(%rsi), %xmm2
> -	movdqu	%xmm0, (%rdi)
> -	movdqu	%xmm2, 15(%rdi)
> -# ifdef USE_AS_STPCPY
> -	lea	30(%rdi), %rax
> -# endif
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> -	sub	$31, %r8
> -	lea	31(%rdi), %rdi
> -	jnz	L(StrncpyFillTailWithZero)
> -# endif
> -	ret
> -
> -	.p2align 4
> -L(Exit32):
> -	movdqu	(%rsi), %xmm0
> -	movdqu	16(%rsi), %xmm2
> -	movdqu	%xmm0, (%rdi)
> -	movdqu	%xmm2, 16(%rdi)
> -# ifdef USE_AS_STPCPY
> -	lea	31(%rdi), %rax
> -# endif
> -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> -	sub	$32, %r8
> -	lea	32(%rdi), %rdi
> -	jnz	L(StrncpyFillTailWithZero)
> -# endif
> -	ret
> -
> -# ifdef USE_AS_STRNCPY
> -
> -	.p2align 4
> -L(StrncpyExit0):
> -#  ifdef USE_AS_STPCPY
> -	mov	%rdi, %rax
> -#  endif
> -#  ifdef USE_AS_STRCAT
> -	xor	%ch, %ch
> -	movb	%ch, (%rdi)
> -#  endif
> -	ret
> -
> -	.p2align 4
> -L(StrncpyExit1):
> -	mov	(%rsi), %dl
> -	mov	%dl, (%rdi)
> -#  ifdef USE_AS_STPCPY
> -	lea	1(%rdi), %rax
> -#  endif
> -#  ifdef USE_AS_STRCAT
> -	xor	%ch, %ch
> -	movb	%ch, 1(%rdi)
> -#  endif
> -	ret
> -
> -	.p2align 4
> -L(StrncpyExit2):
> -	mov	(%rsi), %dx
> -	mov	%dx, (%rdi)
> -#  ifdef USE_AS_STPCPY
> -	lea	2(%rdi), %rax
> -#  endif
> -#  ifdef USE_AS_STRCAT
> -	xor	%ch, %ch
> -	movb	%ch, 2(%rdi)
> -#  endif
> -	ret
> -
> -	.p2align 4
> -L(StrncpyExit3):
> -	mov	(%rsi), %cx
> -	mov	2(%rsi), %dl
> -	mov	%cx, (%rdi)
> -	mov	%dl, 2(%rdi)
> -#  ifdef USE_AS_STPCPY
> -	lea	3(%rdi), %rax
> -#  endif
> -#  ifdef USE_AS_STRCAT
> -	xor	%ch, %ch
> -	movb	%ch, 3(%rdi)
> -#  endif
> -	ret
> -
> -	.p2align 4
> -L(StrncpyExit4):
> -	mov	(%rsi), %edx
> -	mov	%edx, (%rdi)
> -#  ifdef USE_AS_STPCPY
> -	lea	4(%rdi), %rax
> -#  endif
> -#  ifdef USE_AS_STRCAT
> -	xor	%ch, %ch
> -	movb	%ch, 4(%rdi)
> -#  endif
> -	ret
> -
> -	.p2align 4
> -L(StrncpyExit5):
> -	mov	(%rsi), %ecx
> -	mov	4(%rsi), %dl
> -	mov	%ecx, (%rdi)
> -	mov	%dl, 4(%rdi)
> -#  ifdef USE_AS_STPCPY
> -	lea	5(%rdi), %rax
> -#  endif
> -#  ifdef USE_AS_STRCAT
> -	xor	%ch, %ch
> -	movb	%ch, 5(%rdi)
> -#  endif
> -	ret
> -
> -	.p2align 4
> -L(StrncpyExit6):
> -	mov	(%rsi), %ecx
> -	mov	4(%rsi), %dx
> -	mov	%ecx, (%rdi)
> -	mov	%dx, 4(%rdi)
> -#  ifdef USE_AS_STPCPY
> -	lea	6(%rdi), %rax
> -#  endif
> -#  ifdef USE_AS_STRCAT
> -	xor	%ch, %ch
> -	movb	%ch, 6(%rdi)
> -#  endif
> -	ret
> -
> -	.p2align 4
> -L(StrncpyExit7):
> -	mov	(%rsi), %ecx
> -	mov	3(%rsi), %edx
> -	mov	%ecx, (%rdi)
> -	mov	%edx, 3(%rdi)
> -#  ifdef USE_AS_STPCPY
> -	lea	7(%rdi), %rax
> -#  endif
> -#  ifdef USE_AS_STRCAT
> -	xor	%ch, %ch
> -	movb	%ch, 7(%rdi)
> -#  endif
> -	ret
> -
> -	.p2align 4
> -L(StrncpyExit8):
> -	mov	(%rsi), %rdx
> -	mov	%rdx, (%rdi)
> -#  ifdef USE_AS_STPCPY
> -	lea	8(%rdi), %rax
> -#  endif
> -#  ifdef USE_AS_STRCAT
> -	xor	%ch, %ch
> -	movb	%ch, 8(%rdi)
> -#  endif
> -	ret
> -
> -	.p2align 4
> -L(StrncpyExit9):
> -	mov	(%rsi), %rcx
> -	mov	8(%rsi), %dl
> -	mov	%rcx, (%rdi)
> -	mov	%dl, 8(%rdi)
> -#  ifdef USE_AS_STPCPY
> -	lea	9(%rdi), %rax
> -#  endif
> -#  ifdef USE_AS_STRCAT
> -	xor	%ch, %ch
> -	movb	%ch, 9(%rdi)
> -#  endif
> -	ret
> -
> -	.p2align 4
> -L(StrncpyExit10):
> -	mov	(%rsi), %rcx
> -	mov	8(%rsi), %dx
> -	mov	%rcx, (%rdi)
> -	mov	%dx, 8(%rdi)
> -#  ifdef USE_AS_STPCPY
> -	lea	10(%rdi), %rax
> -#  endif
> -#  ifdef USE_AS_STRCAT
> -	xor	%ch, %ch
> -	movb	%ch, 10(%rdi)
> -#  endif
> -	ret
> -
> -	.p2align 4
> -L(StrncpyExit11):
> -	mov	(%rsi), %rcx
> -	mov	7(%rsi), %edx
> -	mov	%rcx, (%rdi)
> -	mov	%edx, 7(%rdi)
> -#  ifdef USE_AS_STPCPY
> -	lea	11(%rdi), %rax
> -#  endif
> -#  ifdef USE_AS_STRCAT
> -	xor	%ch, %ch
> -	movb	%ch, 11(%rdi)
> -#  endif
> -	ret
> -
> -	.p2align 4
> -L(StrncpyExit12):
> -	mov	(%rsi), %rcx
> -	mov	8(%rsi), %edx
> -	mov	%rcx, (%rdi)
> -	mov	%edx, 8(%rdi)
> -#  ifdef USE_AS_STPCPY
> -	lea	12(%rdi), %rax
> -#  endif
> -#  ifdef USE_AS_STRCAT
> -	xor	%ch, %ch
> -	movb	%ch, 12(%rdi)
> -#  endif
> -	ret
> -
> -	.p2align 4
> -L(StrncpyExit13):
> -	mov	(%rsi), %rcx
> -	mov	5(%rsi), %rdx
> -	mov	%rcx, (%rdi)
> -	mov	%rdx, 5(%rdi)
> -#  ifdef USE_AS_STPCPY
> -	lea	13(%rdi), %rax
> -#  endif
> -#  ifdef USE_AS_STRCAT
> -	xor	%ch, %ch
> -	movb	%ch, 13(%rdi)
> -#  endif
> -	ret
> -
> -	.p2align 4
> -L(StrncpyExit14):
> -	mov	(%rsi), %rcx
> -	mov	6(%rsi), %rdx
> -	mov	%rcx, (%rdi)
> -	mov	%rdx, 6(%rdi)
> -#  ifdef USE_AS_STPCPY
> -	lea	14(%rdi), %rax
> -#  endif
> -#  ifdef USE_AS_STRCAT
> -	xor	%ch, %ch
> -	movb	%ch, 14(%rdi)
> -#  endif
> -	ret
> -
> -	.p2align 4
> -L(StrncpyExit15):
> -	mov	(%rsi), %rcx
> -	mov	7(%rsi), %rdx
> -	mov	%rcx, (%rdi)
> -	mov	%rdx, 7(%rdi)
> -#  ifdef USE_AS_STPCPY
> -	lea	15(%rdi), %rax
> -#  endif
> -#  ifdef USE_AS_STRCAT
> -	xor	%ch, %ch
> -	movb	%ch, 15(%rdi)
> -#  endif
> -	ret
> -
> -	.p2align 4
> -L(StrncpyExit16):
> -	movdqu	(%rsi), %xmm0
> -	movdqu	%xmm0, (%rdi)
> -#  ifdef USE_AS_STPCPY
> -	lea	16(%rdi), %rax
> -#  endif
> -#  ifdef USE_AS_STRCAT
> -	xor	%ch, %ch
> -	movb	%ch, 16(%rdi)
> -#  endif
> -	ret
> -
> -	.p2align 4
> -L(StrncpyExit17):
> -	movdqu	(%rsi), %xmm0
> -	mov	16(%rsi), %cl
> -	movdqu	%xmm0, (%rdi)
> -	mov	%cl, 16(%rdi)
> -#  ifdef USE_AS_STPCPY
> -	lea	17(%rdi), %rax
> -#  endif
> -#  ifdef USE_AS_STRCAT
> -	xor	%ch, %ch
> -	movb	%ch, 17(%rdi)
> -#  endif
> -	ret
> -
> -	.p2align 4
> -L(StrncpyExit18):
> -	movdqu	(%rsi), %xmm0
> -	mov	16(%rsi), %cx
> -	movdqu	%xmm0, (%rdi)
> -	mov	%cx, 16(%rdi)
> -#  ifdef USE_AS_STPCPY
> -	lea	18(%rdi), %rax
> -#  endif
> -#  ifdef USE_AS_STRCAT
> -	xor	%ch, %ch
> -	movb	%ch, 18(%rdi)
> -#  endif
> -	ret
> -
> -	.p2align 4
> -L(StrncpyExit19):
> -	movdqu	(%rsi), %xmm0
> -	mov	15(%rsi), %ecx
> -	movdqu	%xmm0, (%rdi)
> -	mov	%ecx, 15(%rdi)
> -#  ifdef USE_AS_STPCPY
> -	lea	19(%rdi), %rax
> -#  endif
> -#  ifdef USE_AS_STRCAT
> -	xor	%ch, %ch
> -	movb	%ch, 19(%rdi)
> -#  endif
> -	ret
> -
> -	.p2align 4
> -L(StrncpyExit20):
> -	movdqu	(%rsi), %xmm0
> -	mov	16(%rsi), %ecx
> -	movdqu	%xmm0, (%rdi)
> -	mov	%ecx, 16(%rdi)
> -#  ifdef USE_AS_STPCPY
> -	lea	20(%rdi), %rax
> -#  endif
> -#  ifdef USE_AS_STRCAT
> -	xor	%ch, %ch
> -	movb	%ch, 20(%rdi)
> -#  endif
> -	ret
> -
> -	.p2align 4
> -L(StrncpyExit21):
> -	movdqu	(%rsi), %xmm0
> -	mov	16(%rsi), %ecx
> -	mov	20(%rsi), %dl
> -	movdqu	%xmm0, (%rdi)
> -	mov	%ecx, 16(%rdi)
> -	mov	%dl, 20(%rdi)
> -#  ifdef USE_AS_STPCPY
> -	lea	21(%rdi), %rax
> -#  endif
> -#  ifdef USE_AS_STRCAT
> -	xor	%ch, %ch
> -	movb	%ch, 21(%rdi)
> -#  endif
> -	ret
> -
> -	.p2align 4
> -L(StrncpyExit22):
> -	movdqu	(%rsi), %xmm0
> -	mov	14(%rsi), %rcx
> -	movdqu	%xmm0, (%rdi)
> -	mov	%rcx, 14(%rdi)
> -#  ifdef USE_AS_STPCPY
> -	lea	22(%rdi), %rax
> -#  endif
> -#  ifdef USE_AS_STRCAT
> -	xor	%ch, %ch
> -	movb	%ch, 22(%rdi)
> -#  endif
> -	ret
> -
> -	.p2align 4
> -L(StrncpyExit23):
> -	movdqu	(%rsi), %xmm0
> -	mov	15(%rsi), %rcx
> -	movdqu	%xmm0, (%rdi)
> -	mov	%rcx, 15(%rdi)
> -#  ifdef USE_AS_STPCPY
> -	lea	23(%rdi), %rax
> -#  endif
> -#  ifdef USE_AS_STRCAT
> -	xor	%ch, %ch
> -	movb	%ch, 23(%rdi)
> -#  endif
> -	ret
> -
> -	.p2align 4
> -L(StrncpyExit24):
> -	movdqu	(%rsi), %xmm0
> -	mov	16(%rsi), %rcx
> -	movdqu	%xmm0, (%rdi)
> -	mov	%rcx, 16(%rdi)
> -#  ifdef USE_AS_STPCPY
> -	lea	24(%rdi), %rax
> -#  endif
> -#  ifdef USE_AS_STRCAT
> -	xor	%ch, %ch
> -	movb	%ch, 24(%rdi)
> -#  endif
> -	ret
> -
> -	.p2align 4
> -L(StrncpyExit25):
> -	movdqu	(%rsi), %xmm0
> -	mov	16(%rsi), %rdx
> -	mov	24(%rsi), %cl
> -	movdqu	%xmm0, (%rdi)
> -	mov	%rdx, 16(%rdi)
> -	mov	%cl, 24(%rdi)
> -#  ifdef USE_AS_STPCPY
> -	lea	25(%rdi), %rax
> -#  endif
> -#  ifdef USE_AS_STRCAT
> -	xor	%ch, %ch
> -	movb	%ch, 25(%rdi)
> -#  endif
> -	ret
> -
> -	.p2align 4
> -L(StrncpyExit26):
> -	movdqu	(%rsi), %xmm0
> -	mov	16(%rsi), %rdx
> -	mov	24(%rsi), %cx
> -	movdqu	%xmm0, (%rdi)
> -	mov	%rdx, 16(%rdi)
> -	mov	%cx, 24(%rdi)
> -#  ifdef USE_AS_STPCPY
> -	lea	26(%rdi), %rax
> -#  endif
> -#  ifdef USE_AS_STRCAT
> -	xor	%ch, %ch
> -	movb	%ch, 26(%rdi)
> -#  endif
> -	ret
> -
> -	.p2align 4
> -L(StrncpyExit27):
> -	movdqu	(%rsi), %xmm0
> -	mov	16(%rsi), %rdx
> -	mov	23(%rsi), %ecx
> -	movdqu	%xmm0, (%rdi)
> -	mov	%rdx, 16(%rdi)
> -	mov	%ecx, 23(%rdi)
> -#  ifdef USE_AS_STPCPY
> -	lea	27(%rdi), %rax
> -#  endif
> -#  ifdef USE_AS_STRCAT
> -	xor	%ch, %ch
> -	movb	%ch, 27(%rdi)
> -#  endif
> -	ret
> -
> -	.p2align 4
> -L(StrncpyExit28):
> -	movdqu	(%rsi), %xmm0
> -	mov	16(%rsi), %rdx
> -	mov	24(%rsi), %ecx
> -	movdqu	%xmm0, (%rdi)
> -	mov	%rdx, 16(%rdi)
> -	mov	%ecx, 24(%rdi)
> -#  ifdef USE_AS_STPCPY
> -	lea	28(%rdi), %rax
> -#  endif
> -#  ifdef USE_AS_STRCAT
> -	xor	%ch, %ch
> -	movb	%ch, 28(%rdi)
> -#  endif
> -	ret
> -
> -	.p2align 4
> -L(StrncpyExit29):
> -	movdqu	(%rsi), %xmm0
> -	movdqu	13(%rsi), %xmm2
> -	movdqu	%xmm0, (%rdi)
> -	movdqu	%xmm2, 13(%rdi)
> -#  ifdef USE_AS_STPCPY
> -	lea	29(%rdi), %rax
> -#  endif
> -#  ifdef USE_AS_STRCAT
> -	xor	%ch, %ch
> -	movb	%ch, 29(%rdi)
> -#  endif
> -	ret
> -
> -	.p2align 4
> -L(StrncpyExit30):
> -	movdqu	(%rsi), %xmm0
> -	movdqu	14(%rsi), %xmm2
> -	movdqu	%xmm0, (%rdi)
> -	movdqu	%xmm2, 14(%rdi)
> -#  ifdef USE_AS_STPCPY
> -	lea	30(%rdi), %rax
> -#  endif
> -#  ifdef USE_AS_STRCAT
> -	xor	%ch, %ch
> -	movb	%ch, 30(%rdi)
> -#  endif
> -	ret
> -
> -	.p2align 4
> -L(StrncpyExit31):
> -	movdqu	(%rsi), %xmm0
> -	movdqu	15(%rsi), %xmm2
> -	movdqu	%xmm0, (%rdi)
> -	movdqu	%xmm2, 15(%rdi)
> -#  ifdef USE_AS_STPCPY
> -	lea	31(%rdi), %rax
> -#  endif
> -#  ifdef USE_AS_STRCAT
> -	xor	%ch, %ch
> -	movb	%ch, 31(%rdi)
> -#  endif
> -	ret
> -
> -	.p2align 4
> -L(StrncpyExit32):
> -	movdqu	(%rsi), %xmm0
> -	movdqu	16(%rsi), %xmm2
> -	movdqu	%xmm0, (%rdi)
> -	movdqu	%xmm2, 16(%rdi)
> -#  ifdef USE_AS_STPCPY
> -	lea	32(%rdi), %rax
> -#  endif
> -#  ifdef USE_AS_STRCAT
> -	xor	%ch, %ch
> -	movb	%ch, 32(%rdi)
> -#  endif
> -	ret
> -
> -	.p2align 4
> -L(StrncpyExit33):
> -	movdqu	(%rsi), %xmm0
> -	movdqu	16(%rsi), %xmm2
> -	mov	32(%rsi), %cl
> -	movdqu	%xmm0, (%rdi)
> -	movdqu	%xmm2, 16(%rdi)
> -	mov	%cl, 32(%rdi)
> -#  ifdef USE_AS_STRCAT
> -	xor	%ch, %ch
> -	movb	%ch, 33(%rdi)
> -#  endif
> -	ret
> -
> -#  ifndef USE_AS_STRCAT
> -
> -	.p2align 4
> -L(Fill0):
> -	ret
> -
> -	.p2align 4
> -L(Fill1):
> -	mov	%dl, (%rdi)
> -	ret
> -
> -	.p2align 4
> -L(Fill2):
> -	mov	%dx, (%rdi)
> -	ret
> -
> -	.p2align 4
> -L(Fill3):
> -	mov	%edx, -1(%rdi)
> -	ret
> -
> -	.p2align 4
> -L(Fill4):
> -	mov	%edx, (%rdi)
> -	ret
> -
> -	.p2align 4
> -L(Fill5):
> -	mov	%edx, (%rdi)
> -	mov	%dl, 4(%rdi)
> -	ret
> -
> -	.p2align 4
> -L(Fill6):
> -	mov	%edx, (%rdi)
> -	mov	%dx, 4(%rdi)
> -	ret
> -
> -	.p2align 4
> -L(Fill7):
> -	mov	%rdx, -1(%rdi)
> -	ret
> -
> -	.p2align 4
> -L(Fill8):
> -	mov	%rdx, (%rdi)
> -	ret
> -
> -	.p2align 4
> -L(Fill9):
> -	mov	%rdx, (%rdi)
> -	mov	%dl, 8(%rdi)
> -	ret
> -
> -	.p2align 4
> -L(Fill10):
> -	mov	%rdx, (%rdi)
> -	mov	%dx, 8(%rdi)
> -	ret
> -
> -	.p2align 4
> -L(Fill11):
> -	mov	%rdx, (%rdi)
> -	mov	%edx, 7(%rdi)
> -	ret
> -
> -	.p2align 4
> -L(Fill12):
> -	mov	%rdx, (%rdi)
> -	mov	%edx, 8(%rdi)
> -	ret
> -
> -	.p2align 4
> -L(Fill13):
> -	mov	%rdx, (%rdi)
> -	mov	%rdx, 5(%rdi)
> -	ret
> -
> -	.p2align 4
> -L(Fill14):
> -	mov	%rdx, (%rdi)
> -	mov	%rdx, 6(%rdi)
> -	ret
> -
> -	.p2align 4
> -L(Fill15):
> -	movdqu	%xmm0, -1(%rdi)
> -	ret
> -
> -	.p2align 4
> -L(Fill16):
> -	movdqu	%xmm0, (%rdi)
> -	ret
> -
> -	.p2align 4
> -L(CopyFrom1To16BytesUnalignedXmm2):
> -	movdqu	%xmm2, (%rdi, %rcx)
> -
> -	.p2align 4
> -L(CopyFrom1To16BytesXmmExit):
> -	bsf	%rdx, %rdx
> -	add	$15, %r8
> -	add	%rcx, %rdi
> -#   ifdef USE_AS_STPCPY
> -	lea	(%rdi, %rdx), %rax
> -#   endif
> -	sub	%rdx, %r8
> -	lea	1(%rdi, %rdx), %rdi
> -
> -	.p2align 4
> -L(StrncpyFillTailWithZero):
> -	pxor	%xmm0, %xmm0
> -	xor	%rdx, %rdx
> -	sub	$16, %r8
> -	jbe	L(StrncpyFillExit)
> -
> -	movdqu	%xmm0, (%rdi)
> -	add	$16, %rdi
> -
> -	mov	%rdi, %rsi
> -	and	$0xf, %rsi
> -	sub	%rsi, %rdi
> -	add	%rsi, %r8
> -	sub	$64, %r8
> -	jb	L(StrncpyFillLess64)
> -
> -L(StrncpyFillLoopMovdqa):
> -	movdqa	%xmm0, (%rdi)
> -	movdqa	%xmm0, 16(%rdi)
> -	movdqa	%xmm0, 32(%rdi)
> -	movdqa	%xmm0, 48(%rdi)
> -	add	$64, %rdi
> -	sub	$64, %r8
> -	jae	L(StrncpyFillLoopMovdqa)
> -
> -L(StrncpyFillLess64):
> -	add	$32, %r8
> -	jl	L(StrncpyFillLess32)
> -	movdqa	%xmm0, (%rdi)
> -	movdqa	%xmm0, 16(%rdi)
> -	add	$32, %rdi
> -	sub	$16, %r8
> -	jl	L(StrncpyFillExit)
> -	movdqa	%xmm0, (%rdi)
> -	add	$16, %rdi
> -	BRANCH_TO_JMPTBL_ENTRY (L(FillTable), %r8, 4)
> -
> -L(StrncpyFillLess32):
> -	add	$16, %r8
> -	jl	L(StrncpyFillExit)
> -	movdqa	%xmm0, (%rdi)
> -	add	$16, %rdi
> -	BRANCH_TO_JMPTBL_ENTRY (L(FillTable), %r8, 4)
> -
> -L(StrncpyFillExit):
> -	add	$16, %r8
> -	BRANCH_TO_JMPTBL_ENTRY (L(FillTable), %r8, 4)
> -
> -/* end of ifndef USE_AS_STRCAT */
> -#  endif
> -
> -	.p2align 4
> -L(UnalignedLeaveCase2OrCase3):
> -	test	%rdx, %rdx
> -	jnz	L(Unaligned64LeaveCase2)
> -L(Unaligned64LeaveCase3):
> -	lea	64(%r8), %rcx
> -	and	$-16, %rcx
> -	add	$48, %r8
> -	jl	L(CopyFrom1To16BytesCase3)
> -	movdqu	%xmm4, (%rdi)
> -	sub	$16, %r8
> -	jb	L(CopyFrom1To16BytesCase3)
> -	movdqu	%xmm5, 16(%rdi)
> -	sub	$16, %r8
> -	jb	L(CopyFrom1To16BytesCase3)
> -	movdqu	%xmm6, 32(%rdi)
> -	sub	$16, %r8
> -	jb	L(CopyFrom1To16BytesCase3)
> -	movdqu	%xmm7, 48(%rdi)
> -#  ifdef USE_AS_STPCPY
> -	lea	64(%rdi), %rax
> -#  endif
> -#  ifdef USE_AS_STRCAT
> -	xor	%ch, %ch
> -	movb	%ch, 64(%rdi)
> -#  endif
> -	ret
> -
> -	.p2align 4
> -L(Unaligned64LeaveCase2):
> -	xor	%rcx, %rcx
> -	pcmpeqb	%xmm4, %xmm0
> -	pmovmskb %xmm0, %rdx
> -	add	$48, %r8
> -	jle	L(CopyFrom1To16BytesCase2OrCase3)
> -	test	%rdx, %rdx
> -#  ifndef USE_AS_STRCAT
> -	jnz	L(CopyFrom1To16BytesUnalignedXmm4)
> -#  else
> -	jnz	L(CopyFrom1To16Bytes)
> -#  endif
> -	pcmpeqb	%xmm5, %xmm0
> -	pmovmskb %xmm0, %rdx
> -	movdqu	%xmm4, (%rdi)
> -	add	$16, %rcx
> -	sub	$16, %r8
> -	jbe	L(CopyFrom1To16BytesCase2OrCase3)
> -	test	%rdx, %rdx
> -#  ifndef USE_AS_STRCAT
> -	jnz	L(CopyFrom1To16BytesUnalignedXmm5)
> -#  else
> -	jnz	L(CopyFrom1To16Bytes)
> -#  endif
> -
> -	pcmpeqb	%xmm6, %xmm0
> -	pmovmskb %xmm0, %rdx
> -	movdqu	%xmm5, 16(%rdi)
> -	add	$16, %rcx
> -	sub	$16, %r8
> -	jbe	L(CopyFrom1To16BytesCase2OrCase3)
> -	test	%rdx, %rdx
> -#  ifndef USE_AS_STRCAT
> -	jnz	L(CopyFrom1To16BytesUnalignedXmm6)
> -#  else
> -	jnz	L(CopyFrom1To16Bytes)
> -#  endif
> -
> -	pcmpeqb	%xmm7, %xmm0
> -	pmovmskb %xmm0, %rdx
> -	movdqu	%xmm6, 32(%rdi)
> -	lea	16(%rdi, %rcx), %rdi
> -	lea	16(%rsi, %rcx), %rsi
> -	bsf	%rdx, %rdx
> -	cmp	%r8, %rdx
> -	jb	L(CopyFrom1To16BytesExit)
> -	BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
> -
> -	.p2align 4
> -L(ExitZero):
> -#  ifndef USE_AS_STRCAT
> -	mov	%rdi, %rax
> -#  endif
> -	ret
> -
> -# endif
> -
> -# ifndef USE_AS_STRCAT
> -END (STRCPY)
> -# else
> -END (STRCAT)
> -# endif
> -	.p2align 4
> -	.section .rodata
> -L(ExitTable):
> -	.int	JMPTBL(L(Exit1), L(ExitTable))
> -	.int	JMPTBL(L(Exit2), L(ExitTable))
> -	.int	JMPTBL(L(Exit3), L(ExitTable))
> -	.int	JMPTBL(L(Exit4), L(ExitTable))
> -	.int	JMPTBL(L(Exit5), L(ExitTable))
> -	.int	JMPTBL(L(Exit6), L(ExitTable))
> -	.int	JMPTBL(L(Exit7), L(ExitTable))
> -	.int	JMPTBL(L(Exit8), L(ExitTable))
> -	.int	JMPTBL(L(Exit9), L(ExitTable))
> -	.int	JMPTBL(L(Exit10), L(ExitTable))
> -	.int	JMPTBL(L(Exit11), L(ExitTable))
> -	.int	JMPTBL(L(Exit12), L(ExitTable))
> -	.int	JMPTBL(L(Exit13), L(ExitTable))
> -	.int	JMPTBL(L(Exit14), L(ExitTable))
> -	.int	JMPTBL(L(Exit15), L(ExitTable))
> -	.int	JMPTBL(L(Exit16), L(ExitTable))
> -	.int	JMPTBL(L(Exit17), L(ExitTable))
> -	.int	JMPTBL(L(Exit18), L(ExitTable))
> -	.int	JMPTBL(L(Exit19), L(ExitTable))
> -	.int	JMPTBL(L(Exit20), L(ExitTable))
> -	.int	JMPTBL(L(Exit21), L(ExitTable))
> -	.int	JMPTBL(L(Exit22), L(ExitTable))
> -	.int    JMPTBL(L(Exit23), L(ExitTable))
> -	.int	JMPTBL(L(Exit24), L(ExitTable))
> -	.int	JMPTBL(L(Exit25), L(ExitTable))
> -	.int	JMPTBL(L(Exit26), L(ExitTable))
> -	.int	JMPTBL(L(Exit27), L(ExitTable))
> -	.int	JMPTBL(L(Exit28), L(ExitTable))
> -	.int	JMPTBL(L(Exit29), L(ExitTable))
> -	.int	JMPTBL(L(Exit30), L(ExitTable))
> -	.int	JMPTBL(L(Exit31), L(ExitTable))
> -	.int	JMPTBL(L(Exit32), L(ExitTable))
> -# ifdef USE_AS_STRNCPY
> -L(ExitStrncpyTable):
> -	.int	JMPTBL(L(StrncpyExit0), L(ExitStrncpyTable))
> -	.int	JMPTBL(L(StrncpyExit1), L(ExitStrncpyTable))
> -	.int	JMPTBL(L(StrncpyExit2), L(ExitStrncpyTable))
> -	.int	JMPTBL(L(StrncpyExit3), L(ExitStrncpyTable))
> -	.int	JMPTBL(L(StrncpyExit4), L(ExitStrncpyTable))
> -	.int	JMPTBL(L(StrncpyExit5), L(ExitStrncpyTable))
> -	.int	JMPTBL(L(StrncpyExit6), L(ExitStrncpyTable))
> -	.int	JMPTBL(L(StrncpyExit7), L(ExitStrncpyTable))
> -	.int	JMPTBL(L(StrncpyExit8), L(ExitStrncpyTable))
> -	.int	JMPTBL(L(StrncpyExit9), L(ExitStrncpyTable))
> -	.int	JMPTBL(L(StrncpyExit10), L(ExitStrncpyTable))
> -	.int	JMPTBL(L(StrncpyExit11), L(ExitStrncpyTable))
> -	.int	JMPTBL(L(StrncpyExit12), L(ExitStrncpyTable))
> -	.int	JMPTBL(L(StrncpyExit13), L(ExitStrncpyTable))
> -	.int	JMPTBL(L(StrncpyExit14), L(ExitStrncpyTable))
> -	.int	JMPTBL(L(StrncpyExit15), L(ExitStrncpyTable))
> -	.int	JMPTBL(L(StrncpyExit16), L(ExitStrncpyTable))
> -	.int	JMPTBL(L(StrncpyExit17), L(ExitStrncpyTable))
> -	.int	JMPTBL(L(StrncpyExit18), L(ExitStrncpyTable))
> -	.int	JMPTBL(L(StrncpyExit19), L(ExitStrncpyTable))
> -	.int	JMPTBL(L(StrncpyExit20), L(ExitStrncpyTable))
> -	.int	JMPTBL(L(StrncpyExit21), L(ExitStrncpyTable))
> -	.int	JMPTBL(L(StrncpyExit22), L(ExitStrncpyTable))
> -	.int    JMPTBL(L(StrncpyExit23), L(ExitStrncpyTable))
> -	.int	JMPTBL(L(StrncpyExit24), L(ExitStrncpyTable))
> -	.int	JMPTBL(L(StrncpyExit25), L(ExitStrncpyTable))
> -	.int	JMPTBL(L(StrncpyExit26), L(ExitStrncpyTable))
> -	.int	JMPTBL(L(StrncpyExit27), L(ExitStrncpyTable))
> -	.int	JMPTBL(L(StrncpyExit28), L(ExitStrncpyTable))
> -	.int	JMPTBL(L(StrncpyExit29), L(ExitStrncpyTable))
> -	.int	JMPTBL(L(StrncpyExit30), L(ExitStrncpyTable))
> -	.int	JMPTBL(L(StrncpyExit31), L(ExitStrncpyTable))
> -	.int	JMPTBL(L(StrncpyExit32), L(ExitStrncpyTable))
> -	.int	JMPTBL(L(StrncpyExit33), L(ExitStrncpyTable))
> -#  ifndef USE_AS_STRCAT
> -	.p2align 4
> -L(FillTable):
> -	.int	JMPTBL(L(Fill0), L(FillTable))
> -	.int	JMPTBL(L(Fill1), L(FillTable))
> -	.int	JMPTBL(L(Fill2), L(FillTable))
> -	.int	JMPTBL(L(Fill3), L(FillTable))
> -	.int	JMPTBL(L(Fill4), L(FillTable))
> -	.int	JMPTBL(L(Fill5), L(FillTable))
> -	.int	JMPTBL(L(Fill6), L(FillTable))
> -	.int	JMPTBL(L(Fill7), L(FillTable))
> -	.int	JMPTBL(L(Fill8), L(FillTable))
> -	.int	JMPTBL(L(Fill9), L(FillTable))
> -	.int	JMPTBL(L(Fill10), L(FillTable))
> -	.int	JMPTBL(L(Fill11), L(FillTable))
> -	.int	JMPTBL(L(Fill12), L(FillTable))
> -	.int	JMPTBL(L(Fill13), L(FillTable))
> -	.int	JMPTBL(L(Fill14), L(FillTable))
> -	.int	JMPTBL(L(Fill15), L(FillTable))
> -	.int	JMPTBL(L(Fill16), L(FillTable))
> -#  endif
> -# endif
> -#endif
> +#define AS_STRCPY
> +#define STPCPY __strcpy_sse2_unaligned
> +#include "stpcpy-sse2-unaligned.S"
> diff --git a/sysdeps/x86_64/multiarch/strcpy.S b/sysdeps/x86_64/multiarch/strcpy.S
> index 9464ee8..92be04c 100644
> --- a/sysdeps/x86_64/multiarch/strcpy.S
> +++ b/sysdeps/x86_64/multiarch/strcpy.S
> @@ -28,31 +28,18 @@
>  #endif
>  
>  #ifdef USE_AS_STPCPY
> -# ifdef USE_AS_STRNCPY
> -#  define STRCPY_SSSE3		__stpncpy_ssse3
> -#  define STRCPY_SSE2		__stpncpy_sse2
> -#  define STRCPY_SSE2_UNALIGNED __stpncpy_sse2_unaligned
> -#  define __GI_STRCPY		__GI_stpncpy
> -#  define __GI___STRCPY		__GI___stpncpy
> -# else
>  #  define STRCPY_SSSE3		__stpcpy_ssse3
>  #  define STRCPY_SSE2		__stpcpy_sse2
> +#  define STRCPY_AVX2		__stpcpy_avx2
>  #  define STRCPY_SSE2_UNALIGNED	__stpcpy_sse2_unaligned
>  #  define __GI_STRCPY		__GI_stpcpy
>  #  define __GI___STRCPY		__GI___stpcpy
> -# endif
>  #else
> -# ifdef USE_AS_STRNCPY
> -#  define STRCPY_SSSE3		__strncpy_ssse3
> -#  define STRCPY_SSE2		__strncpy_sse2
> -#  define STRCPY_SSE2_UNALIGNED	__strncpy_sse2_unaligned
> -#  define __GI_STRCPY		__GI_strncpy
> -# else
>  #  define STRCPY_SSSE3		__strcpy_ssse3
> +#  define STRCPY_AVX2		__strcpy_avx2
>  #  define STRCPY_SSE2		__strcpy_sse2
>  #  define STRCPY_SSE2_UNALIGNED	__strcpy_sse2_unaligned
>  #  define __GI_STRCPY		__GI_strcpy
> -# endif
>  #endif
>  
>  
> @@ -64,7 +51,10 @@ ENTRY(STRCPY)
>  	cmpl	$0, __cpu_features+KIND_OFFSET(%rip)
>  	jne	1f
>  	call	__init_cpu_features
> -1:	leaq	STRCPY_SSE2_UNALIGNED(%rip), %rax
> +1:	leaq	STRCPY_AVX2(%rip), %rax
> +	testl   $bit_AVX_Fast_Unaligned_Load, __cpu_features+FEATURE_OFFSET+index_AVX_Fast_Unaligned_Load(%rip)
> +	jnz	2f
> +	leaq	STRCPY_SSE2_UNALIGNED(%rip), %rax
>  	testl	$bit_Fast_Unaligned_Load, __cpu_features+FEATURE_OFFSET+index_Fast_Unaligned_Load(%rip)
>  	jnz	2f
>  	leaq	STRCPY_SSE2(%rip), %rax
> diff --git a/sysdeps/x86_64/multiarch/strncpy-sse2-unaligned.S b/sysdeps/x86_64/multiarch/strncpy-sse2-unaligned.S
> index fcc23a7..e4c98e7 100644
> --- a/sysdeps/x86_64/multiarch/strncpy-sse2-unaligned.S
> +++ b/sysdeps/x86_64/multiarch/strncpy-sse2-unaligned.S
> @@ -1,3 +1,1888 @@
> -#define USE_AS_STRNCPY
> -#define STRCPY __strncpy_sse2_unaligned
> -#include "strcpy-sse2-unaligned.S"
> +/* strcpy with SSE2 and unaligned load
> +   Copyright (C) 2011-2015 Free Software Foundation, Inc.
> +   Contributed by Intel Corporation.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +#if IS_IN (libc)
> +
> +# ifndef USE_AS_STRCAT
> +#  include <sysdep.h>
> +
> +#  ifndef STRCPY
> +#   define STRCPY  __strncpy_sse2_unaligned
> +#  endif
> +
> +# define USE_AS_STRNCPY
> +# endif
> +
> +# define JMPTBL(I, B)	I - B
> +# define BRANCH_TO_JMPTBL_ENTRY(TABLE, INDEX, SCALE)             \
> +	lea	TABLE(%rip), %r11;                              \
> +	movslq	(%r11, INDEX, SCALE), %rcx;                     \
> +	lea	(%r11, %rcx), %rcx;                             \
> +	jmp	*%rcx
> +
> +# ifndef USE_AS_STRCAT
> +
> +.text
> +ENTRY (STRCPY)
> +#  ifdef USE_AS_STRNCPY
> +	mov	%rdx, %r8
> +	test	%r8, %r8
> +	jz	L(ExitZero)
> +#  endif
> +	mov	%rsi, %rcx
> +#  ifndef USE_AS_STPCPY
> +	mov	%rdi, %rax      /* save result */
> +#  endif
> +
> +# endif
> +
> +	and	$63, %rcx
> +	cmp	$32, %rcx
> +	jbe	L(SourceStringAlignmentLess32)
> +
> +	and	$-16, %rsi
> +	and	$15, %rcx
> +	pxor	%xmm0, %xmm0
> +	pxor	%xmm1, %xmm1
> +
> +	pcmpeqb	(%rsi), %xmm1
> +	pmovmskb %xmm1, %rdx
> +	shr	%cl, %rdx
> +
> +# ifdef USE_AS_STRNCPY
> +#  if defined USE_AS_STPCPY || defined USE_AS_STRCAT
> +	mov	$16, %r10
> +	sub	%rcx, %r10
> +	cmp	%r10, %r8
> +#  else
> +	mov	$17, %r10
> +	sub	%rcx, %r10
> +	cmp	%r10, %r8
> +#  endif
> +	jbe	L(CopyFrom1To16BytesTailCase2OrCase3)
> +# endif
> +	test	%rdx, %rdx
> +	jnz	L(CopyFrom1To16BytesTail)
> +
> +	pcmpeqb	16(%rsi), %xmm0
> +	pmovmskb %xmm0, %rdx
> +
> +# ifdef USE_AS_STRNCPY
> +	add	$16, %r10
> +	cmp	%r10, %r8
> +	jbe	L(CopyFrom1To32BytesCase2OrCase3)
> +# endif
> +	test	%rdx, %rdx
> +	jnz	L(CopyFrom1To32Bytes)
> +
> +	movdqu	(%rsi, %rcx), %xmm1   /* copy 16 bytes */
> +	movdqu	%xmm1, (%rdi)
> +
> +/* If source address alignment != destination address alignment */
> +	.p2align 4
> +L(Unalign16Both):
> +	sub	%rcx, %rdi
> +# ifdef USE_AS_STRNCPY
> +	add	%rcx, %r8
> +# endif
> +	mov	$16, %rcx
> +	movdqa	(%rsi, %rcx), %xmm1
> +	movaps	16(%rsi, %rcx), %xmm2
> +	movdqu	%xmm1, (%rdi, %rcx)
> +	pcmpeqb	%xmm2, %xmm0
> +	pmovmskb %xmm0, %rdx
> +	add	$16, %rcx
> +# ifdef USE_AS_STRNCPY
> +	sub	$48, %r8
> +	jbe	L(CopyFrom1To16BytesCase2OrCase3)
> +# endif
> +	test	%rdx, %rdx
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> +	jnz	L(CopyFrom1To16BytesUnalignedXmm2)
> +# else
> +	jnz	L(CopyFrom1To16Bytes)
> +# endif
> +
> +	movaps	16(%rsi, %rcx), %xmm3
> +	movdqu	%xmm2, (%rdi, %rcx)
> +	pcmpeqb	%xmm3, %xmm0
> +	pmovmskb %xmm0, %rdx
> +	add	$16, %rcx
> +# ifdef USE_AS_STRNCPY
> +	sub	$16, %r8
> +	jbe	L(CopyFrom1To16BytesCase2OrCase3)
> +# endif
> +	test	%rdx, %rdx
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> +	jnz	L(CopyFrom1To16BytesUnalignedXmm3)
> +# else
> +	jnz	L(CopyFrom1To16Bytes)
> +# endif
> +
> +	movaps	16(%rsi, %rcx), %xmm4
> +	movdqu	%xmm3, (%rdi, %rcx)
> +	pcmpeqb	%xmm4, %xmm0
> +	pmovmskb %xmm0, %rdx
> +	add	$16, %rcx
> +# ifdef USE_AS_STRNCPY
> +	sub	$16, %r8
> +	jbe	L(CopyFrom1To16BytesCase2OrCase3)
> +# endif
> +	test	%rdx, %rdx
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> +	jnz	L(CopyFrom1To16BytesUnalignedXmm4)
> +# else
> +	jnz	L(CopyFrom1To16Bytes)
> +# endif
> +
> +	movaps	16(%rsi, %rcx), %xmm1
> +	movdqu	%xmm4, (%rdi, %rcx)
> +	pcmpeqb	%xmm1, %xmm0
> +	pmovmskb %xmm0, %rdx
> +	add	$16, %rcx
> +# ifdef USE_AS_STRNCPY
> +	sub	$16, %r8
> +	jbe	L(CopyFrom1To16BytesCase2OrCase3)
> +# endif
> +	test	%rdx, %rdx
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> +	jnz	L(CopyFrom1To16BytesUnalignedXmm1)
> +# else
> +	jnz	L(CopyFrom1To16Bytes)
> +# endif
> +
> +	movaps	16(%rsi, %rcx), %xmm2
> +	movdqu	%xmm1, (%rdi, %rcx)
> +	pcmpeqb	%xmm2, %xmm0
> +	pmovmskb %xmm0, %rdx
> +	add	$16, %rcx
> +# ifdef USE_AS_STRNCPY
> +	sub	$16, %r8
> +	jbe	L(CopyFrom1To16BytesCase2OrCase3)
> +# endif
> +	test	%rdx, %rdx
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> +	jnz	L(CopyFrom1To16BytesUnalignedXmm2)
> +# else
> +	jnz	L(CopyFrom1To16Bytes)
> +# endif
> +
> +	movaps	16(%rsi, %rcx), %xmm3
> +	movdqu	%xmm2, (%rdi, %rcx)
> +	pcmpeqb	%xmm3, %xmm0
> +	pmovmskb %xmm0, %rdx
> +	add	$16, %rcx
> +# ifdef USE_AS_STRNCPY
> +	sub	$16, %r8
> +	jbe	L(CopyFrom1To16BytesCase2OrCase3)
> +# endif
> +	test	%rdx, %rdx
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> +	jnz	L(CopyFrom1To16BytesUnalignedXmm3)
> +# else
> +	jnz	L(CopyFrom1To16Bytes)
> +# endif
> +
> +	movdqu	%xmm3, (%rdi, %rcx)
> +	mov	%rsi, %rdx
> +	lea	16(%rsi, %rcx), %rsi
> +	and	$-0x40, %rsi
> +	sub	%rsi, %rdx
> +	sub	%rdx, %rdi
> +# ifdef USE_AS_STRNCPY
> +	lea	128(%r8, %rdx), %r8
> +# endif
> +L(Unaligned64Loop):
> +	movaps	(%rsi), %xmm2
> +	movaps	%xmm2, %xmm4
> +	movaps	16(%rsi), %xmm5
> +	movaps	32(%rsi), %xmm3
> +	movaps	%xmm3, %xmm6
> +	movaps	48(%rsi), %xmm7
> +	pminub	%xmm5, %xmm2
> +	pminub	%xmm7, %xmm3
> +	pminub	%xmm2, %xmm3
> +	pcmpeqb	%xmm0, %xmm3
> +	pmovmskb %xmm3, %rdx
> +# ifdef USE_AS_STRNCPY
> +	sub	$64, %r8
> +	jbe	L(UnalignedLeaveCase2OrCase3)
> +# endif
> +	test	%rdx, %rdx
> +	jnz	L(Unaligned64Leave)
> +
> +L(Unaligned64Loop_start):
> +	add	$64, %rdi
> +	add	$64, %rsi
> +	movdqu	%xmm4, -64(%rdi)
> +	movaps	(%rsi), %xmm2
> +	movdqa	%xmm2, %xmm4
> +	movdqu	%xmm5, -48(%rdi)
> +	movaps	16(%rsi), %xmm5
> +	pminub	%xmm5, %xmm2
> +	movaps	32(%rsi), %xmm3
> +	movdqu	%xmm6, -32(%rdi)
> +	movaps	%xmm3, %xmm6
> +	movdqu	%xmm7, -16(%rdi)
> +	movaps	48(%rsi), %xmm7
> +	pminub	%xmm7, %xmm3
> +	pminub	%xmm2, %xmm3
> +	pcmpeqb	%xmm0, %xmm3
> +	pmovmskb %xmm3, %rdx
> +# ifdef USE_AS_STRNCPY
> +	sub	$64, %r8
> +	jbe	L(UnalignedLeaveCase2OrCase3)
> +# endif
> +	test	%rdx, %rdx
> +	jz	L(Unaligned64Loop_start)
> +
> +L(Unaligned64Leave):
> +	pxor	%xmm1, %xmm1
> +
> +	pcmpeqb	%xmm4, %xmm0
> +	pcmpeqb	%xmm5, %xmm1
> +	pmovmskb %xmm0, %rdx
> +	pmovmskb %xmm1, %rcx
> +	test	%rdx, %rdx
> +	jnz	L(CopyFrom1To16BytesUnaligned_0)
> +	test	%rcx, %rcx
> +	jnz	L(CopyFrom1To16BytesUnaligned_16)
> +
> +	pcmpeqb	%xmm6, %xmm0
> +	pcmpeqb	%xmm7, %xmm1
> +	pmovmskb %xmm0, %rdx
> +	pmovmskb %xmm1, %rcx
> +	test	%rdx, %rdx
> +	jnz	L(CopyFrom1To16BytesUnaligned_32)
> +
> +	bsf	%rcx, %rdx
> +	movdqu	%xmm4, (%rdi)
> +	movdqu	%xmm5, 16(%rdi)
> +	movdqu	%xmm6, 32(%rdi)
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> +# ifdef USE_AS_STPCPY
> +	lea	48(%rdi, %rdx), %rax
> +# endif
> +	movdqu	%xmm7, 48(%rdi)
> +	add	$15, %r8
> +	sub	%rdx, %r8
> +	lea	49(%rdi, %rdx), %rdi
> +	jmp	L(StrncpyFillTailWithZero)
> +# else
> +	add	$48, %rsi
> +	add	$48, %rdi
> +	BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
> +# endif
> +
> +/* If source address alignment == destination address alignment */
> +
> +L(SourceStringAlignmentLess32):
> +	pxor	%xmm0, %xmm0
> +	movdqu	(%rsi), %xmm1
> +	movdqu	16(%rsi), %xmm2
> +	pcmpeqb	%xmm1, %xmm0
> +	pmovmskb %xmm0, %rdx
> +
> +# ifdef USE_AS_STRNCPY
> +#  if defined USE_AS_STPCPY || defined USE_AS_STRCAT
> +	cmp	$16, %r8
> +#  else
> +	cmp	$17, %r8
> +#  endif
> +	jbe	L(CopyFrom1To16BytesTail1Case2OrCase3)
> +# endif
> +	test	%rdx, %rdx
> +	jnz	L(CopyFrom1To16BytesTail1)
> +
> +	pcmpeqb	%xmm2, %xmm0
> +	movdqu	%xmm1, (%rdi)
> +	pmovmskb %xmm0, %rdx
> +
> +# ifdef USE_AS_STRNCPY
> +#  if defined USE_AS_STPCPY || defined USE_AS_STRCAT
> +	cmp	$32, %r8
> +#  else
> +	cmp	$33, %r8
> +#  endif
> +	jbe	L(CopyFrom1To32Bytes1Case2OrCase3)
> +# endif
> +	test	%rdx, %rdx
> +	jnz	L(CopyFrom1To32Bytes1)
> +
> +	and	$-16, %rsi
> +	and	$15, %rcx
> +	jmp	L(Unalign16Both)
> +
> +/*------End of main part with loops---------------------*/
> +
> +/* Case1 */
> +
> +# if (!defined USE_AS_STRNCPY) || (defined USE_AS_STRCAT)
> +	.p2align 4
> +L(CopyFrom1To16Bytes):
> +	add	%rcx, %rdi
> +	add	%rcx, %rsi
> +	bsf	%rdx, %rdx
> +	BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
> +# endif
> +	.p2align 4
> +L(CopyFrom1To16BytesTail):
> +	add	%rcx, %rsi
> +	bsf	%rdx, %rdx
> +	BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
> +
> +	.p2align 4
> +L(CopyFrom1To32Bytes1):
> +	add	$16, %rsi
> +	add	$16, %rdi
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> +	sub	$16, %r8
> +# endif
> +L(CopyFrom1To16BytesTail1):
> +	bsf	%rdx, %rdx
> +	BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
> +
> +	.p2align 4
> +L(CopyFrom1To32Bytes):
> +	bsf	%rdx, %rdx
> +	add	%rcx, %rsi
> +	add	$16, %rdx
> +	sub	%rcx, %rdx
> +	BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
> +
> +	.p2align 4
> +L(CopyFrom1To16BytesUnaligned_0):
> +	bsf	%rdx, %rdx
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> +# ifdef USE_AS_STPCPY
> +	lea	(%rdi, %rdx), %rax
> +# endif
> +	movdqu	%xmm4, (%rdi)
> +	add	$63, %r8
> +	sub	%rdx, %r8
> +	lea	1(%rdi, %rdx), %rdi
> +	jmp	L(StrncpyFillTailWithZero)
> +# else
> +	BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
> +# endif
> +
> +	.p2align 4
> +L(CopyFrom1To16BytesUnaligned_16):
> +	bsf	%rcx, %rdx
> +	movdqu	%xmm4, (%rdi)
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> +# ifdef USE_AS_STPCPY
> +	lea	16(%rdi, %rdx), %rax
> +# endif
> +	movdqu	%xmm5, 16(%rdi)
> +	add	$47, %r8
> +	sub	%rdx, %r8
> +	lea	17(%rdi, %rdx), %rdi
> +	jmp	L(StrncpyFillTailWithZero)
> +# else
> +	add	$16, %rsi
> +	add	$16, %rdi
> +	BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
> +# endif
> +
> +	.p2align 4
> +L(CopyFrom1To16BytesUnaligned_32):
> +	bsf	%rdx, %rdx
> +	movdqu	%xmm4, (%rdi)
> +	movdqu	%xmm5, 16(%rdi)
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> +# ifdef USE_AS_STPCPY
> +	lea	32(%rdi, %rdx), %rax
> +# endif
> +	movdqu	%xmm6, 32(%rdi)
> +	add	$31, %r8
> +	sub	%rdx, %r8
> +	lea	33(%rdi, %rdx), %rdi
> +	jmp	L(StrncpyFillTailWithZero)
> +# else
> +	add	$32, %rsi
> +	add	$32, %rdi
> +	BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
> +# endif
> +
> +# ifdef USE_AS_STRNCPY
> +#  ifndef USE_AS_STRCAT
> +	.p2align 4
> +L(CopyFrom1To16BytesUnalignedXmm6):
> +	movdqu	%xmm6, (%rdi, %rcx)
> +	jmp	L(CopyFrom1To16BytesXmmExit)
> +
> +	.p2align 4
> +L(CopyFrom1To16BytesUnalignedXmm5):
> +	movdqu	%xmm5, (%rdi, %rcx)
> +	jmp	L(CopyFrom1To16BytesXmmExit)
> +
> +	.p2align 4
> +L(CopyFrom1To16BytesUnalignedXmm4):
> +	movdqu	%xmm4, (%rdi, %rcx)
> +	jmp	L(CopyFrom1To16BytesXmmExit)
> +
> +	.p2align 4
> +L(CopyFrom1To16BytesUnalignedXmm3):
> +	movdqu	%xmm3, (%rdi, %rcx)
> +	jmp	L(CopyFrom1To16BytesXmmExit)
> +
> +	.p2align 4
> +L(CopyFrom1To16BytesUnalignedXmm1):
> +	movdqu	%xmm1, (%rdi, %rcx)
> +	jmp	L(CopyFrom1To16BytesXmmExit)
> +#  endif
> +
> +	.p2align 4
> +L(CopyFrom1To16BytesExit):
> +	BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
> +
> +/* Case2 */
> +
> +	.p2align 4
> +L(CopyFrom1To16BytesCase2):
> +	add	$16, %r8
> +	add	%rcx, %rdi
> +	add	%rcx, %rsi
> +	bsf	%rdx, %rdx
> +	cmp	%r8, %rdx
> +	jb	L(CopyFrom1To16BytesExit)
> +	BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
> +
> +	.p2align 4
> +L(CopyFrom1To32BytesCase2):
> +	add	%rcx, %rsi
> +	bsf	%rdx, %rdx
> +	add	$16, %rdx
> +	sub	%rcx, %rdx
> +	cmp	%r8, %rdx
> +	jb	L(CopyFrom1To16BytesExit)
> +	BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
> +
> +L(CopyFrom1To16BytesTailCase2):
> +	add	%rcx, %rsi
> +	bsf	%rdx, %rdx
> +	cmp	%r8, %rdx
> +	jb	L(CopyFrom1To16BytesExit)
> +	BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
> +
> +L(CopyFrom1To16BytesTail1Case2):
> +	bsf	%rdx, %rdx
> +	cmp	%r8, %rdx
> +	jb	L(CopyFrom1To16BytesExit)
> +	BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
> +
> +/* Case2 or Case3,  Case3 */
> +
> +	.p2align 4
> +L(CopyFrom1To16BytesCase2OrCase3):
> +	test	%rdx, %rdx
> +	jnz	L(CopyFrom1To16BytesCase2)
> +L(CopyFrom1To16BytesCase3):
> +	add	$16, %r8
> +	add	%rcx, %rdi
> +	add	%rcx, %rsi
> +	BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
> +
> +	.p2align 4
> +L(CopyFrom1To32BytesCase2OrCase3):
> +	test	%rdx, %rdx
> +	jnz	L(CopyFrom1To32BytesCase2)
> +	add	%rcx, %rsi
> +	BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
> +
> +	.p2align 4
> +L(CopyFrom1To16BytesTailCase2OrCase3):
> +	test	%rdx, %rdx
> +	jnz	L(CopyFrom1To16BytesTailCase2)
> +	add	%rcx, %rsi
> +	BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
> +
> +	.p2align 4
> +L(CopyFrom1To32Bytes1Case2OrCase3):
> +	add	$16, %rdi
> +	add	$16, %rsi
> +	sub	$16, %r8
> +L(CopyFrom1To16BytesTail1Case2OrCase3):
> +	test	%rdx, %rdx
> +	jnz	L(CopyFrom1To16BytesTail1Case2)
> +	BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
> +
> +# endif
> +
> +/*------------End labels regarding with copying 1-16 bytes--and 1-32 bytes----*/
> +
> +	.p2align 4
> +L(Exit1):
> +	mov	%dh, (%rdi)
> +# ifdef USE_AS_STPCPY
> +	lea	(%rdi), %rax
> +# endif
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> +	sub	$1, %r8
> +	lea	1(%rdi), %rdi
> +	jnz	L(StrncpyFillTailWithZero)
> +# endif
> +	ret
> +
> +	.p2align 4
> +L(Exit2):
> +	mov	(%rsi), %dx
> +	mov	%dx, (%rdi)
> +# ifdef USE_AS_STPCPY
> +	lea	1(%rdi), %rax
> +# endif
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> +	sub	$2, %r8
> +	lea	2(%rdi), %rdi
> +	jnz	L(StrncpyFillTailWithZero)
> +# endif
> +	ret
> +
> +	.p2align 4
> +L(Exit3):
> +	mov	(%rsi), %cx
> +	mov	%cx, (%rdi)
> +	mov	%dh, 2(%rdi)
> +# ifdef USE_AS_STPCPY
> +	lea	2(%rdi), %rax
> +# endif
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> +	sub	$3, %r8
> +	lea	3(%rdi), %rdi
> +	jnz	L(StrncpyFillTailWithZero)
> +# endif
> +	ret
> +
> +	.p2align 4
> +L(Exit4):
> +	mov	(%rsi), %edx
> +	mov	%edx, (%rdi)
> +# ifdef USE_AS_STPCPY
> +	lea	3(%rdi), %rax
> +# endif
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> +	sub	$4, %r8
> +	lea	4(%rdi), %rdi
> +	jnz	L(StrncpyFillTailWithZero)
> +# endif
> +	ret
> +
> +	.p2align 4
> +L(Exit5):
> +	mov	(%rsi), %ecx
> +	mov	%dh, 4(%rdi)
> +	mov	%ecx, (%rdi)
> +# ifdef USE_AS_STPCPY
> +	lea	4(%rdi), %rax
> +# endif
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> +	sub	$5, %r8
> +	lea	5(%rdi), %rdi
> +	jnz	L(StrncpyFillTailWithZero)
> +# endif
> +	ret
> +
> +	.p2align 4
> +L(Exit6):
> +	mov	(%rsi), %ecx
> +	mov	4(%rsi), %dx
> +	mov	%ecx, (%rdi)
> +	mov	%dx, 4(%rdi)
> +# ifdef USE_AS_STPCPY
> +	lea	5(%rdi), %rax
> +# endif
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> +	sub	$6, %r8
> +	lea	6(%rdi), %rdi
> +	jnz	L(StrncpyFillTailWithZero)
> +# endif
> +	ret
> +
> +	.p2align 4
> +L(Exit7):
> +	mov	(%rsi), %ecx
> +	mov	3(%rsi), %edx
> +	mov	%ecx, (%rdi)
> +	mov	%edx, 3(%rdi)
> +# ifdef USE_AS_STPCPY
> +	lea	6(%rdi), %rax
> +# endif
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> +	sub	$7, %r8
> +	lea	7(%rdi), %rdi
> +	jnz	L(StrncpyFillTailWithZero)
> +# endif
> +	ret
> +
> +	.p2align 4
> +L(Exit8):
> +	mov	(%rsi), %rdx
> +	mov	%rdx, (%rdi)
> +# ifdef USE_AS_STPCPY
> +	lea	7(%rdi), %rax
> +# endif
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> +	sub	$8, %r8
> +	lea	8(%rdi), %rdi
> +	jnz	L(StrncpyFillTailWithZero)
> +# endif
> +	ret
> +
> +	.p2align 4
> +L(Exit9):
> +	mov	(%rsi), %rcx
> +	mov	%dh, 8(%rdi)
> +	mov	%rcx, (%rdi)
> +# ifdef USE_AS_STPCPY
> +	lea	8(%rdi), %rax
> +# endif
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> +	sub	$9, %r8
> +	lea	9(%rdi), %rdi
> +	jnz	L(StrncpyFillTailWithZero)
> +# endif
> +	ret
> +
> +	.p2align 4
> +L(Exit10):
> +	mov	(%rsi), %rcx
> +	mov	8(%rsi), %dx
> +	mov	%rcx, (%rdi)
> +	mov	%dx, 8(%rdi)
> +# ifdef USE_AS_STPCPY
> +	lea	9(%rdi), %rax
> +# endif
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> +	sub	$10, %r8
> +	lea	10(%rdi), %rdi
> +	jnz	L(StrncpyFillTailWithZero)
> +# endif
> +	ret
> +
> +	.p2align 4
> +L(Exit11):
> +	mov	(%rsi), %rcx
> +	mov	7(%rsi), %edx
> +	mov	%rcx, (%rdi)
> +	mov	%edx, 7(%rdi)
> +# ifdef USE_AS_STPCPY
> +	lea	10(%rdi), %rax
> +# endif
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> +	sub	$11, %r8
> +	lea	11(%rdi), %rdi
> +	jnz	L(StrncpyFillTailWithZero)
> +# endif
> +	ret
> +
> +	.p2align 4
> +L(Exit12):
> +	mov	(%rsi), %rcx
> +	mov	8(%rsi), %edx
> +	mov	%rcx, (%rdi)
> +	mov	%edx, 8(%rdi)
> +# ifdef USE_AS_STPCPY
> +	lea	11(%rdi), %rax
> +# endif
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> +	sub	$12, %r8
> +	lea	12(%rdi), %rdi
> +	jnz	L(StrncpyFillTailWithZero)
> +# endif
> +	ret
> +
> +	.p2align 4
> +L(Exit13):
> +	mov	(%rsi), %rcx
> +	mov	5(%rsi), %rdx
> +	mov	%rcx, (%rdi)
> +	mov	%rdx, 5(%rdi)
> +# ifdef USE_AS_STPCPY
> +	lea	12(%rdi), %rax
> +# endif
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> +	sub	$13, %r8
> +	lea	13(%rdi), %rdi
> +	jnz	L(StrncpyFillTailWithZero)
> +# endif
> +	ret
> +
> +	.p2align 4
> +L(Exit14):
> +	mov	(%rsi), %rcx
> +	mov	6(%rsi), %rdx
> +	mov	%rcx, (%rdi)
> +	mov	%rdx, 6(%rdi)
> +# ifdef USE_AS_STPCPY
> +	lea	13(%rdi), %rax
> +# endif
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> +	sub	$14, %r8
> +	lea	14(%rdi), %rdi
> +	jnz	L(StrncpyFillTailWithZero)
> +# endif
> +	ret
> +
> +	.p2align 4
> +L(Exit15):
> +	mov	(%rsi), %rcx
> +	mov	7(%rsi), %rdx
> +	mov	%rcx, (%rdi)
> +	mov	%rdx, 7(%rdi)
> +# ifdef USE_AS_STPCPY
> +	lea	14(%rdi), %rax
> +# endif
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> +	sub	$15, %r8
> +	lea	15(%rdi), %rdi
> +	jnz	L(StrncpyFillTailWithZero)
> +# endif
> +	ret
> +
> +	.p2align 4
> +L(Exit16):
> +	movdqu	(%rsi), %xmm0
> +	movdqu	%xmm0, (%rdi)
> +# ifdef USE_AS_STPCPY
> +	lea	15(%rdi), %rax
> +# endif
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> +	sub	$16, %r8
> +	lea	16(%rdi), %rdi
> +	jnz	L(StrncpyFillTailWithZero)
> +# endif
> +	ret
> +
> +	.p2align 4
> +L(Exit17):
> +	movdqu	(%rsi), %xmm0
> +	movdqu	%xmm0, (%rdi)
> +	mov	%dh, 16(%rdi)
> +# ifdef USE_AS_STPCPY
> +	lea	16(%rdi), %rax
> +# endif
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> +	sub	$17, %r8
> +	lea	17(%rdi), %rdi
> +	jnz	L(StrncpyFillTailWithZero)
> +# endif
> +	ret
> +
> +	.p2align 4
> +L(Exit18):
> +	movdqu	(%rsi), %xmm0
> +	mov	16(%rsi), %cx
> +	movdqu	%xmm0, (%rdi)
> +	mov	%cx, 16(%rdi)
> +# ifdef USE_AS_STPCPY
> +	lea	17(%rdi), %rax
> +# endif
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> +	sub	$18, %r8
> +	lea	18(%rdi), %rdi
> +	jnz	L(StrncpyFillTailWithZero)
> +# endif
> +	ret
> +
> +	.p2align 4
> +L(Exit19):
> +	movdqu	(%rsi), %xmm0
> +	mov	15(%rsi), %ecx
> +	movdqu	%xmm0, (%rdi)
> +	mov	%ecx, 15(%rdi)
> +# ifdef USE_AS_STPCPY
> +	lea	18(%rdi), %rax
> +# endif
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> +	sub	$19, %r8
> +	lea	19(%rdi), %rdi
> +	jnz	L(StrncpyFillTailWithZero)
> +# endif
> +	ret
> +
> +	.p2align 4
> +L(Exit20):
> +	movdqu	(%rsi), %xmm0
> +	mov	16(%rsi), %ecx
> +	movdqu	%xmm0, (%rdi)
> +	mov	%ecx, 16(%rdi)
> +# ifdef USE_AS_STPCPY
> +	lea	19(%rdi), %rax
> +# endif
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> +	sub	$20, %r8
> +	lea	20(%rdi), %rdi
> +	jnz	L(StrncpyFillTailWithZero)
> +# endif
> +	ret
> +
> +	.p2align 4
> +L(Exit21):
> +	movdqu	(%rsi), %xmm0
> +	mov	16(%rsi), %ecx
> +	movdqu	%xmm0, (%rdi)
> +	mov	%ecx, 16(%rdi)
> +	mov	%dh, 20(%rdi)
> +# ifdef USE_AS_STPCPY
> +	lea	20(%rdi), %rax
> +# endif
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> +	sub	$21, %r8
> +	lea	21(%rdi), %rdi
> +	jnz	L(StrncpyFillTailWithZero)
> +# endif
> +	ret
> +
> +	.p2align 4
> +L(Exit22):
> +	movdqu	(%rsi), %xmm0
> +	mov	14(%rsi), %rcx
> +	movdqu	%xmm0, (%rdi)
> +	mov	%rcx, 14(%rdi)
> +# ifdef USE_AS_STPCPY
> +	lea	21(%rdi), %rax
> +# endif
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> +	sub	$22, %r8
> +	lea	22(%rdi), %rdi
> +	jnz	L(StrncpyFillTailWithZero)
> +# endif
> +	ret
> +
> +	.p2align 4
> +L(Exit23):
> +	movdqu	(%rsi), %xmm0
> +	mov	15(%rsi), %rcx
> +	movdqu	%xmm0, (%rdi)
> +	mov	%rcx, 15(%rdi)
> +# ifdef USE_AS_STPCPY
> +	lea	22(%rdi), %rax
> +# endif
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> +	sub	$23, %r8
> +	lea	23(%rdi), %rdi
> +	jnz	L(StrncpyFillTailWithZero)
> +# endif
> +	ret
> +
> +	.p2align 4
> +L(Exit24):
> +	movdqu	(%rsi), %xmm0
> +	mov	16(%rsi), %rcx
> +	movdqu	%xmm0, (%rdi)
> +	mov	%rcx, 16(%rdi)
> +# ifdef USE_AS_STPCPY
> +	lea	23(%rdi), %rax
> +# endif
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> +	sub	$24, %r8
> +	lea	24(%rdi), %rdi
> +	jnz	L(StrncpyFillTailWithZero)
> +# endif
> +	ret
> +
> +	.p2align 4
> +L(Exit25):
> +	movdqu	(%rsi), %xmm0
> +	mov	16(%rsi), %rcx
> +	movdqu	%xmm0, (%rdi)
> +	mov	%rcx, 16(%rdi)
> +	mov	%dh, 24(%rdi)
> +# ifdef USE_AS_STPCPY
> +	lea	24(%rdi), %rax
> +# endif
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> +	sub	$25, %r8
> +	lea	25(%rdi), %rdi
> +	jnz	L(StrncpyFillTailWithZero)
> +# endif
> +	ret
> +
> +	.p2align 4
> +L(Exit26):
> +	movdqu	(%rsi), %xmm0
> +	mov	16(%rsi), %rdx
> +	mov	24(%rsi), %cx
> +	movdqu	%xmm0, (%rdi)
> +	mov	%rdx, 16(%rdi)
> +	mov	%cx, 24(%rdi)
> +# ifdef USE_AS_STPCPY
> +	lea	25(%rdi), %rax
> +# endif
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> +	sub	$26, %r8
> +	lea	26(%rdi), %rdi
> +	jnz	L(StrncpyFillTailWithZero)
> +# endif
> +	ret
> +
> +	.p2align 4
> +L(Exit27):
> +	movdqu	(%rsi), %xmm0
> +	mov	16(%rsi), %rdx
> +	mov	23(%rsi), %ecx
> +	movdqu	%xmm0, (%rdi)
> +	mov	%rdx, 16(%rdi)
> +	mov	%ecx, 23(%rdi)
> +# ifdef USE_AS_STPCPY
> +	lea	26(%rdi), %rax
> +# endif
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> +	sub	$27, %r8
> +	lea	27(%rdi), %rdi
> +	jnz	L(StrncpyFillTailWithZero)
> +# endif
> +	ret
> +
> +	.p2align 4
> +L(Exit28):
> +	movdqu	(%rsi), %xmm0
> +	mov	16(%rsi), %rdx
> +	mov	24(%rsi), %ecx
> +	movdqu	%xmm0, (%rdi)
> +	mov	%rdx, 16(%rdi)
> +	mov	%ecx, 24(%rdi)
> +# ifdef USE_AS_STPCPY
> +	lea	27(%rdi), %rax
> +# endif
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> +	sub	$28, %r8
> +	lea	28(%rdi), %rdi
> +	jnz	L(StrncpyFillTailWithZero)
> +# endif
> +	ret
> +
> +	.p2align 4
> +L(Exit29):
> +	movdqu	(%rsi), %xmm0
> +	movdqu	13(%rsi), %xmm2
> +	movdqu	%xmm0, (%rdi)
> +	movdqu	%xmm2, 13(%rdi)
> +# ifdef USE_AS_STPCPY
> +	lea	28(%rdi), %rax
> +# endif
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> +	sub	$29, %r8
> +	lea	29(%rdi), %rdi
> +	jnz	L(StrncpyFillTailWithZero)
> +# endif
> +	ret
> +
> +	.p2align 4
> +L(Exit30):
> +	movdqu	(%rsi), %xmm0
> +	movdqu	14(%rsi), %xmm2
> +	movdqu	%xmm0, (%rdi)
> +	movdqu	%xmm2, 14(%rdi)
> +# ifdef USE_AS_STPCPY
> +	lea	29(%rdi), %rax
> +# endif
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> +	sub	$30, %r8
> +	lea	30(%rdi), %rdi
> +	jnz	L(StrncpyFillTailWithZero)
> +# endif
> +	ret
> +
> +	.p2align 4
> +L(Exit31):
> +	movdqu	(%rsi), %xmm0
> +	movdqu	15(%rsi), %xmm2
> +	movdqu	%xmm0, (%rdi)
> +	movdqu	%xmm2, 15(%rdi)
> +# ifdef USE_AS_STPCPY
> +	lea	30(%rdi), %rax
> +# endif
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> +	sub	$31, %r8
> +	lea	31(%rdi), %rdi
> +	jnz	L(StrncpyFillTailWithZero)
> +# endif
> +	ret
> +
> +	.p2align 4
> +L(Exit32):
> +	movdqu	(%rsi), %xmm0
> +	movdqu	16(%rsi), %xmm2
> +	movdqu	%xmm0, (%rdi)
> +	movdqu	%xmm2, 16(%rdi)
> +# ifdef USE_AS_STPCPY
> +	lea	31(%rdi), %rax
> +# endif
> +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> +	sub	$32, %r8
> +	lea	32(%rdi), %rdi
> +	jnz	L(StrncpyFillTailWithZero)
> +# endif
> +	ret
> +
> +# ifdef USE_AS_STRNCPY
> +
> +	.p2align 4
> +L(StrncpyExit0):
> +#  ifdef USE_AS_STPCPY
> +	mov	%rdi, %rax
> +#  endif
> +#  ifdef USE_AS_STRCAT
> +	xor	%ch, %ch
> +	movb	%ch, (%rdi)
> +#  endif
> +	ret
> +
> +	.p2align 4
> +L(StrncpyExit1):
> +	mov	(%rsi), %dl
> +	mov	%dl, (%rdi)
> +#  ifdef USE_AS_STPCPY
> +	lea	1(%rdi), %rax
> +#  endif
> +#  ifdef USE_AS_STRCAT
> +	xor	%ch, %ch
> +	movb	%ch, 1(%rdi)
> +#  endif
> +	ret
> +
> +	.p2align 4
> +L(StrncpyExit2):
> +	mov	(%rsi), %dx
> +	mov	%dx, (%rdi)
> +#  ifdef USE_AS_STPCPY
> +	lea	2(%rdi), %rax
> +#  endif
> +#  ifdef USE_AS_STRCAT
> +	xor	%ch, %ch
> +	movb	%ch, 2(%rdi)
> +#  endif
> +	ret
> +
> +	.p2align 4
> +L(StrncpyExit3):
> +	mov	(%rsi), %cx
> +	mov	2(%rsi), %dl
> +	mov	%cx, (%rdi)
> +	mov	%dl, 2(%rdi)
> +#  ifdef USE_AS_STPCPY
> +	lea	3(%rdi), %rax
> +#  endif
> +#  ifdef USE_AS_STRCAT
> +	xor	%ch, %ch
> +	movb	%ch, 3(%rdi)
> +#  endif
> +	ret
> +
> +	.p2align 4
> +L(StrncpyExit4):
> +	mov	(%rsi), %edx
> +	mov	%edx, (%rdi)
> +#  ifdef USE_AS_STPCPY
> +	lea	4(%rdi), %rax
> +#  endif
> +#  ifdef USE_AS_STRCAT
> +	xor	%ch, %ch
> +	movb	%ch, 4(%rdi)
> +#  endif
> +	ret
> +
> +	.p2align 4
> +L(StrncpyExit5):
> +	mov	(%rsi), %ecx
> +	mov	4(%rsi), %dl
> +	mov	%ecx, (%rdi)
> +	mov	%dl, 4(%rdi)
> +#  ifdef USE_AS_STPCPY
> +	lea	5(%rdi), %rax
> +#  endif
> +#  ifdef USE_AS_STRCAT
> +	xor	%ch, %ch
> +	movb	%ch, 5(%rdi)
> +#  endif
> +	ret
> +
> +	.p2align 4
> +L(StrncpyExit6):
> +	mov	(%rsi), %ecx
> +	mov	4(%rsi), %dx
> +	mov	%ecx, (%rdi)
> +	mov	%dx, 4(%rdi)
> +#  ifdef USE_AS_STPCPY
> +	lea	6(%rdi), %rax
> +#  endif
> +#  ifdef USE_AS_STRCAT
> +	xor	%ch, %ch
> +	movb	%ch, 6(%rdi)
> +#  endif
> +	ret
> +
> +	.p2align 4
> +L(StrncpyExit7):
> +	mov	(%rsi), %ecx
> +	mov	3(%rsi), %edx
> +	mov	%ecx, (%rdi)
> +	mov	%edx, 3(%rdi)
> +#  ifdef USE_AS_STPCPY
> +	lea	7(%rdi), %rax
> +#  endif
> +#  ifdef USE_AS_STRCAT
> +	xor	%ch, %ch
> +	movb	%ch, 7(%rdi)
> +#  endif
> +	ret
> +
> +	.p2align 4
> +L(StrncpyExit8):
> +	mov	(%rsi), %rdx
> +	mov	%rdx, (%rdi)
> +#  ifdef USE_AS_STPCPY
> +	lea	8(%rdi), %rax
> +#  endif
> +#  ifdef USE_AS_STRCAT
> +	xor	%ch, %ch
> +	movb	%ch, 8(%rdi)
> +#  endif
> +	ret
> +
> +	.p2align 4
> +L(StrncpyExit9):
> +	mov	(%rsi), %rcx
> +	mov	8(%rsi), %dl
> +	mov	%rcx, (%rdi)
> +	mov	%dl, 8(%rdi)
> +#  ifdef USE_AS_STPCPY
> +	lea	9(%rdi), %rax
> +#  endif
> +#  ifdef USE_AS_STRCAT
> +	xor	%ch, %ch
> +	movb	%ch, 9(%rdi)
> +#  endif
> +	ret
> +
> +	.p2align 4
> +L(StrncpyExit10):
> +	mov	(%rsi), %rcx
> +	mov	8(%rsi), %dx
> +	mov	%rcx, (%rdi)
> +	mov	%dx, 8(%rdi)
> +#  ifdef USE_AS_STPCPY
> +	lea	10(%rdi), %rax
> +#  endif
> +#  ifdef USE_AS_STRCAT
> +	xor	%ch, %ch
> +	movb	%ch, 10(%rdi)
> +#  endif
> +	ret
> +
> +	.p2align 4
> +L(StrncpyExit11):
> +	mov	(%rsi), %rcx
> +	mov	7(%rsi), %edx
> +	mov	%rcx, (%rdi)
> +	mov	%edx, 7(%rdi)
> +#  ifdef USE_AS_STPCPY
> +	lea	11(%rdi), %rax
> +#  endif
> +#  ifdef USE_AS_STRCAT
> +	xor	%ch, %ch
> +	movb	%ch, 11(%rdi)
> +#  endif
> +	ret
> +
> +	.p2align 4
> +L(StrncpyExit12):
> +	mov	(%rsi), %rcx
> +	mov	8(%rsi), %edx
> +	mov	%rcx, (%rdi)
> +	mov	%edx, 8(%rdi)
> +#  ifdef USE_AS_STPCPY
> +	lea	12(%rdi), %rax
> +#  endif
> +#  ifdef USE_AS_STRCAT
> +	xor	%ch, %ch
> +	movb	%ch, 12(%rdi)
> +#  endif
> +	ret
> +
> +	.p2align 4
> +L(StrncpyExit13):
> +	mov	(%rsi), %rcx
> +	mov	5(%rsi), %rdx
> +	mov	%rcx, (%rdi)
> +	mov	%rdx, 5(%rdi)
> +#  ifdef USE_AS_STPCPY
> +	lea	13(%rdi), %rax
> +#  endif
> +#  ifdef USE_AS_STRCAT
> +	xor	%ch, %ch
> +	movb	%ch, 13(%rdi)
> +#  endif
> +	ret
> +
> +	.p2align 4
> +L(StrncpyExit14):
> +	mov	(%rsi), %rcx
> +	mov	6(%rsi), %rdx
> +	mov	%rcx, (%rdi)
> +	mov	%rdx, 6(%rdi)
> +#  ifdef USE_AS_STPCPY
> +	lea	14(%rdi), %rax
> +#  endif
> +#  ifdef USE_AS_STRCAT
> +	xor	%ch, %ch
> +	movb	%ch, 14(%rdi)
> +#  endif
> +	ret
> +
> +	.p2align 4
> +L(StrncpyExit15):
> +	mov	(%rsi), %rcx
> +	mov	7(%rsi), %rdx
> +	mov	%rcx, (%rdi)
> +	mov	%rdx, 7(%rdi)
> +#  ifdef USE_AS_STPCPY
> +	lea	15(%rdi), %rax
> +#  endif
> +#  ifdef USE_AS_STRCAT
> +	xor	%ch, %ch
> +	movb	%ch, 15(%rdi)
> +#  endif
> +	ret
> +
> +	.p2align 4
> +L(StrncpyExit16):
> +	movdqu	(%rsi), %xmm0
> +	movdqu	%xmm0, (%rdi)
> +#  ifdef USE_AS_STPCPY
> +	lea	16(%rdi), %rax
> +#  endif
> +#  ifdef USE_AS_STRCAT
> +	xor	%ch, %ch
> +	movb	%ch, 16(%rdi)
> +#  endif
> +	ret
> +
> +	.p2align 4
> +L(StrncpyExit17):
> +	movdqu	(%rsi), %xmm0
> +	mov	16(%rsi), %cl
> +	movdqu	%xmm0, (%rdi)
> +	mov	%cl, 16(%rdi)
> +#  ifdef USE_AS_STPCPY
> +	lea	17(%rdi), %rax
> +#  endif
> +#  ifdef USE_AS_STRCAT
> +	xor	%ch, %ch
> +	movb	%ch, 17(%rdi)
> +#  endif
> +	ret
> +
> +	.p2align 4
> +L(StrncpyExit18):
> +	movdqu	(%rsi), %xmm0
> +	mov	16(%rsi), %cx
> +	movdqu	%xmm0, (%rdi)
> +	mov	%cx, 16(%rdi)
> +#  ifdef USE_AS_STPCPY
> +	lea	18(%rdi), %rax
> +#  endif
> +#  ifdef USE_AS_STRCAT
> +	xor	%ch, %ch
> +	movb	%ch, 18(%rdi)
> +#  endif
> +	ret
> +
> +	.p2align 4
> +L(StrncpyExit19):
> +	movdqu	(%rsi), %xmm0
> +	mov	15(%rsi), %ecx
> +	movdqu	%xmm0, (%rdi)
> +	mov	%ecx, 15(%rdi)
> +#  ifdef USE_AS_STPCPY
> +	lea	19(%rdi), %rax
> +#  endif
> +#  ifdef USE_AS_STRCAT
> +	xor	%ch, %ch
> +	movb	%ch, 19(%rdi)
> +#  endif
> +	ret
> +
> +	.p2align 4
> +L(StrncpyExit20):
> +	movdqu	(%rsi), %xmm0
> +	mov	16(%rsi), %ecx
> +	movdqu	%xmm0, (%rdi)
> +	mov	%ecx, 16(%rdi)
> +#  ifdef USE_AS_STPCPY
> +	lea	20(%rdi), %rax
> +#  endif
> +#  ifdef USE_AS_STRCAT
> +	xor	%ch, %ch
> +	movb	%ch, 20(%rdi)
> +#  endif
> +	ret
> +
> +	.p2align 4
> +L(StrncpyExit21):
> +	movdqu	(%rsi), %xmm0
> +	mov	16(%rsi), %ecx
> +	mov	20(%rsi), %dl
> +	movdqu	%xmm0, (%rdi)
> +	mov	%ecx, 16(%rdi)
> +	mov	%dl, 20(%rdi)
> +#  ifdef USE_AS_STPCPY
> +	lea	21(%rdi), %rax
> +#  endif
> +#  ifdef USE_AS_STRCAT
> +	xor	%ch, %ch
> +	movb	%ch, 21(%rdi)
> +#  endif
> +	ret
> +
> +	.p2align 4
> +L(StrncpyExit22):
> +	movdqu	(%rsi), %xmm0
> +	mov	14(%rsi), %rcx
> +	movdqu	%xmm0, (%rdi)
> +	mov	%rcx, 14(%rdi)
> +#  ifdef USE_AS_STPCPY
> +	lea	22(%rdi), %rax
> +#  endif
> +#  ifdef USE_AS_STRCAT
> +	xor	%ch, %ch
> +	movb	%ch, 22(%rdi)
> +#  endif
> +	ret
> +
> +	.p2align 4
> +L(StrncpyExit23):
> +	movdqu	(%rsi), %xmm0
> +	mov	15(%rsi), %rcx
> +	movdqu	%xmm0, (%rdi)
> +	mov	%rcx, 15(%rdi)
> +#  ifdef USE_AS_STPCPY
> +	lea	23(%rdi), %rax
> +#  endif
> +#  ifdef USE_AS_STRCAT
> +	xor	%ch, %ch
> +	movb	%ch, 23(%rdi)
> +#  endif
> +	ret
> +
> +	.p2align 4
> +L(StrncpyExit24):
> +	movdqu	(%rsi), %xmm0
> +	mov	16(%rsi), %rcx
> +	movdqu	%xmm0, (%rdi)
> +	mov	%rcx, 16(%rdi)
> +#  ifdef USE_AS_STPCPY
> +	lea	24(%rdi), %rax
> +#  endif
> +#  ifdef USE_AS_STRCAT
> +	xor	%ch, %ch
> +	movb	%ch, 24(%rdi)
> +#  endif
> +	ret
> +
> +	.p2align 4
> +L(StrncpyExit25):
> +	movdqu	(%rsi), %xmm0
> +	mov	16(%rsi), %rdx
> +	mov	24(%rsi), %cl
> +	movdqu	%xmm0, (%rdi)
> +	mov	%rdx, 16(%rdi)
> +	mov	%cl, 24(%rdi)
> +#  ifdef USE_AS_STPCPY
> +	lea	25(%rdi), %rax
> +#  endif
> +#  ifdef USE_AS_STRCAT
> +	xor	%ch, %ch
> +	movb	%ch, 25(%rdi)
> +#  endif
> +	ret
> +
> +	.p2align 4
> +L(StrncpyExit26):
> +	movdqu	(%rsi), %xmm0
> +	mov	16(%rsi), %rdx
> +	mov	24(%rsi), %cx
> +	movdqu	%xmm0, (%rdi)
> +	mov	%rdx, 16(%rdi)
> +	mov	%cx, 24(%rdi)
> +#  ifdef USE_AS_STPCPY
> +	lea	26(%rdi), %rax
> +#  endif
> +#  ifdef USE_AS_STRCAT
> +	xor	%ch, %ch
> +	movb	%ch, 26(%rdi)
> +#  endif
> +	ret
> +
> +	.p2align 4
> +L(StrncpyExit27):
> +	movdqu	(%rsi), %xmm0
> +	mov	16(%rsi), %rdx
> +	mov	23(%rsi), %ecx
> +	movdqu	%xmm0, (%rdi)
> +	mov	%rdx, 16(%rdi)
> +	mov	%ecx, 23(%rdi)
> +#  ifdef USE_AS_STPCPY
> +	lea	27(%rdi), %rax
> +#  endif
> +#  ifdef USE_AS_STRCAT
> +	xor	%ch, %ch
> +	movb	%ch, 27(%rdi)
> +#  endif
> +	ret
> +
> +	.p2align 4
> +L(StrncpyExit28):
> +	movdqu	(%rsi), %xmm0
> +	mov	16(%rsi), %rdx
> +	mov	24(%rsi), %ecx
> +	movdqu	%xmm0, (%rdi)
> +	mov	%rdx, 16(%rdi)
> +	mov	%ecx, 24(%rdi)
> +#  ifdef USE_AS_STPCPY
> +	lea	28(%rdi), %rax
> +#  endif
> +#  ifdef USE_AS_STRCAT
> +	xor	%ch, %ch
> +	movb	%ch, 28(%rdi)
> +#  endif
> +	ret
> +
> +	.p2align 4
> +L(StrncpyExit29):
> +	movdqu	(%rsi), %xmm0
> +	movdqu	13(%rsi), %xmm2
> +	movdqu	%xmm0, (%rdi)
> +	movdqu	%xmm2, 13(%rdi)
> +#  ifdef USE_AS_STPCPY
> +	lea	29(%rdi), %rax
> +#  endif
> +#  ifdef USE_AS_STRCAT
> +	xor	%ch, %ch
> +	movb	%ch, 29(%rdi)
> +#  endif
> +	ret
> +
> +	.p2align 4
> +L(StrncpyExit30):
> +	movdqu	(%rsi), %xmm0
> +	movdqu	14(%rsi), %xmm2
> +	movdqu	%xmm0, (%rdi)
> +	movdqu	%xmm2, 14(%rdi)
> +#  ifdef USE_AS_STPCPY
> +	lea	30(%rdi), %rax
> +#  endif
> +#  ifdef USE_AS_STRCAT
> +	xor	%ch, %ch
> +	movb	%ch, 30(%rdi)
> +#  endif
> +	ret
> +
> +	.p2align 4
> +L(StrncpyExit31):
> +	movdqu	(%rsi), %xmm0
> +	movdqu	15(%rsi), %xmm2
> +	movdqu	%xmm0, (%rdi)
> +	movdqu	%xmm2, 15(%rdi)
> +#  ifdef USE_AS_STPCPY
> +	lea	31(%rdi), %rax
> +#  endif
> +#  ifdef USE_AS_STRCAT
> +	xor	%ch, %ch
> +	movb	%ch, 31(%rdi)
> +#  endif
> +	ret
> +
> +	.p2align 4
> +L(StrncpyExit32):
> +	movdqu	(%rsi), %xmm0
> +	movdqu	16(%rsi), %xmm2
> +	movdqu	%xmm0, (%rdi)
> +	movdqu	%xmm2, 16(%rdi)
> +#  ifdef USE_AS_STPCPY
> +	lea	32(%rdi), %rax
> +#  endif
> +#  ifdef USE_AS_STRCAT
> +	xor	%ch, %ch
> +	movb	%ch, 32(%rdi)
> +#  endif
> +	ret
> +
> +	.p2align 4
> +L(StrncpyExit33):
> +	movdqu	(%rsi), %xmm0
> +	movdqu	16(%rsi), %xmm2
> +	mov	32(%rsi), %cl
> +	movdqu	%xmm0, (%rdi)
> +	movdqu	%xmm2, 16(%rdi)
> +	mov	%cl, 32(%rdi)
> +#  ifdef USE_AS_STRCAT
> +	xor	%ch, %ch
> +	movb	%ch, 33(%rdi)
> +#  endif
> +	ret
> +
> +#  ifndef USE_AS_STRCAT
> +
> +	.p2align 4
> +L(Fill0):
> +	ret
> +
> +	.p2align 4
> +L(Fill1):
> +	mov	%dl, (%rdi)
> +	ret
> +
> +	.p2align 4
> +L(Fill2):
> +	mov	%dx, (%rdi)
> +	ret
> +
> +	.p2align 4
> +L(Fill3):
> +	mov	%edx, -1(%rdi)
> +	ret
> +
> +	.p2align 4
> +L(Fill4):
> +	mov	%edx, (%rdi)
> +	ret
> +
> +	.p2align 4
> +L(Fill5):
> +	mov	%edx, (%rdi)
> +	mov	%dl, 4(%rdi)
> +	ret
> +
> +	.p2align 4
> +L(Fill6):
> +	mov	%edx, (%rdi)
> +	mov	%dx, 4(%rdi)
> +	ret
> +
> +	.p2align 4
> +L(Fill7):
> +	mov	%rdx, -1(%rdi)
> +	ret
> +
> +	.p2align 4
> +L(Fill8):
> +	mov	%rdx, (%rdi)
> +	ret
> +
> +	.p2align 4
> +L(Fill9):
> +	mov	%rdx, (%rdi)
> +	mov	%dl, 8(%rdi)
> +	ret
> +
> +	.p2align 4
> +L(Fill10):
> +	mov	%rdx, (%rdi)
> +	mov	%dx, 8(%rdi)
> +	ret
> +
> +	.p2align 4
> +L(Fill11):
> +	mov	%rdx, (%rdi)
> +	mov	%edx, 7(%rdi)
> +	ret
> +
> +	.p2align 4
> +L(Fill12):
> +	mov	%rdx, (%rdi)
> +	mov	%edx, 8(%rdi)
> +	ret
> +
> +	.p2align 4
> +L(Fill13):
> +	mov	%rdx, (%rdi)
> +	mov	%rdx, 5(%rdi)
> +	ret
> +
> +	.p2align 4
> +L(Fill14):
> +	mov	%rdx, (%rdi)
> +	mov	%rdx, 6(%rdi)
> +	ret
> +
> +	.p2align 4
> +L(Fill15):
> +	movdqu	%xmm0, -1(%rdi)
> +	ret
> +
> +	.p2align 4
> +L(Fill16):
> +	movdqu	%xmm0, (%rdi)
> +	ret
> +
> +	.p2align 4
> +L(CopyFrom1To16BytesUnalignedXmm2):
> +	movdqu	%xmm2, (%rdi, %rcx)
> +
> +	.p2align 4
> +L(CopyFrom1To16BytesXmmExit):
> +	bsf	%rdx, %rdx
> +	add	$15, %r8
> +	add	%rcx, %rdi
> +#   ifdef USE_AS_STPCPY
> +	lea	(%rdi, %rdx), %rax
> +#   endif
> +	sub	%rdx, %r8
> +	lea	1(%rdi, %rdx), %rdi
> +
> +	.p2align 4
> +L(StrncpyFillTailWithZero):
> +	pxor	%xmm0, %xmm0
> +	xor	%rdx, %rdx
> +	sub	$16, %r8
> +	jbe	L(StrncpyFillExit)
> +
> +	movdqu	%xmm0, (%rdi)
> +	add	$16, %rdi
> +
> +	mov	%rdi, %rsi
> +	and	$0xf, %rsi
> +	sub	%rsi, %rdi
> +	add	%rsi, %r8
> +	sub	$64, %r8
> +	jb	L(StrncpyFillLess64)
> +
> +L(StrncpyFillLoopMovdqa):
> +	movdqa	%xmm0, (%rdi)
> +	movdqa	%xmm0, 16(%rdi)
> +	movdqa	%xmm0, 32(%rdi)
> +	movdqa	%xmm0, 48(%rdi)
> +	add	$64, %rdi
> +	sub	$64, %r8
> +	jae	L(StrncpyFillLoopMovdqa)
> +
> +L(StrncpyFillLess64):
> +	add	$32, %r8
> +	jl	L(StrncpyFillLess32)
> +	movdqa	%xmm0, (%rdi)
> +	movdqa	%xmm0, 16(%rdi)
> +	add	$32, %rdi
> +	sub	$16, %r8
> +	jl	L(StrncpyFillExit)
> +	movdqa	%xmm0, (%rdi)
> +	add	$16, %rdi
> +	BRANCH_TO_JMPTBL_ENTRY (L(FillTable), %r8, 4)
> +
> +L(StrncpyFillLess32):
> +	add	$16, %r8
> +	jl	L(StrncpyFillExit)
> +	movdqa	%xmm0, (%rdi)
> +	add	$16, %rdi
> +	BRANCH_TO_JMPTBL_ENTRY (L(FillTable), %r8, 4)
> +
> +L(StrncpyFillExit):
> +	add	$16, %r8
> +	BRANCH_TO_JMPTBL_ENTRY (L(FillTable), %r8, 4)
> +
> +/* end of ifndef USE_AS_STRCAT */
> +#  endif
> +
> +	.p2align 4
> +L(UnalignedLeaveCase2OrCase3):
> +	test	%rdx, %rdx
> +	jnz	L(Unaligned64LeaveCase2)
> +L(Unaligned64LeaveCase3):
> +	lea	64(%r8), %rcx
> +	and	$-16, %rcx
> +	add	$48, %r8
> +	jl	L(CopyFrom1To16BytesCase3)
> +	movdqu	%xmm4, (%rdi)
> +	sub	$16, %r8
> +	jb	L(CopyFrom1To16BytesCase3)
> +	movdqu	%xmm5, 16(%rdi)
> +	sub	$16, %r8
> +	jb	L(CopyFrom1To16BytesCase3)
> +	movdqu	%xmm6, 32(%rdi)
> +	sub	$16, %r8
> +	jb	L(CopyFrom1To16BytesCase3)
> +	movdqu	%xmm7, 48(%rdi)
> +#  ifdef USE_AS_STPCPY
> +	lea	64(%rdi), %rax
> +#  endif
> +#  ifdef USE_AS_STRCAT
> +	xor	%ch, %ch
> +	movb	%ch, 64(%rdi)
> +#  endif
> +	ret
> +
> +	.p2align 4
> +L(Unaligned64LeaveCase2):
> +	xor	%rcx, %rcx
> +	pcmpeqb	%xmm4, %xmm0
> +	pmovmskb %xmm0, %rdx
> +	add	$48, %r8
> +	jle	L(CopyFrom1To16BytesCase2OrCase3)
> +	test	%rdx, %rdx
> +#  ifndef USE_AS_STRCAT
> +	jnz	L(CopyFrom1To16BytesUnalignedXmm4)
> +#  else
> +	jnz	L(CopyFrom1To16Bytes)
> +#  endif
> +	pcmpeqb	%xmm5, %xmm0
> +	pmovmskb %xmm0, %rdx
> +	movdqu	%xmm4, (%rdi)
> +	add	$16, %rcx
> +	sub	$16, %r8
> +	jbe	L(CopyFrom1To16BytesCase2OrCase3)
> +	test	%rdx, %rdx
> +#  ifndef USE_AS_STRCAT
> +	jnz	L(CopyFrom1To16BytesUnalignedXmm5)
> +#  else
> +	jnz	L(CopyFrom1To16Bytes)
> +#  endif
> +
> +	pcmpeqb	%xmm6, %xmm0
> +	pmovmskb %xmm0, %rdx
> +	movdqu	%xmm5, 16(%rdi)
> +	add	$16, %rcx
> +	sub	$16, %r8
> +	jbe	L(CopyFrom1To16BytesCase2OrCase3)
> +	test	%rdx, %rdx
> +#  ifndef USE_AS_STRCAT
> +	jnz	L(CopyFrom1To16BytesUnalignedXmm6)
> +#  else
> +	jnz	L(CopyFrom1To16Bytes)
> +#  endif
> +
> +	pcmpeqb	%xmm7, %xmm0
> +	pmovmskb %xmm0, %rdx
> +	movdqu	%xmm6, 32(%rdi)
> +	lea	16(%rdi, %rcx), %rdi
> +	lea	16(%rsi, %rcx), %rsi
> +	bsf	%rdx, %rdx
> +	cmp	%r8, %rdx
> +	jb	L(CopyFrom1To16BytesExit)
> +	BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
> +
> +	.p2align 4
> +L(ExitZero):
> +#  ifndef USE_AS_STRCAT
> +	mov	%rdi, %rax
> +#  endif
> +	ret
> +
> +# endif
> +
> +# ifndef USE_AS_STRCAT
> +END (STRCPY)
> +# else
> +END (STRCAT)
> +# endif
> +	.p2align 4
> +	.section .rodata
> +L(ExitTable):
> +	.int	JMPTBL(L(Exit1), L(ExitTable))
> +	.int	JMPTBL(L(Exit2), L(ExitTable))
> +	.int	JMPTBL(L(Exit3), L(ExitTable))
> +	.int	JMPTBL(L(Exit4), L(ExitTable))
> +	.int	JMPTBL(L(Exit5), L(ExitTable))
> +	.int	JMPTBL(L(Exit6), L(ExitTable))
> +	.int	JMPTBL(L(Exit7), L(ExitTable))
> +	.int	JMPTBL(L(Exit8), L(ExitTable))
> +	.int	JMPTBL(L(Exit9), L(ExitTable))
> +	.int	JMPTBL(L(Exit10), L(ExitTable))
> +	.int	JMPTBL(L(Exit11), L(ExitTable))
> +	.int	JMPTBL(L(Exit12), L(ExitTable))
> +	.int	JMPTBL(L(Exit13), L(ExitTable))
> +	.int	JMPTBL(L(Exit14), L(ExitTable))
> +	.int	JMPTBL(L(Exit15), L(ExitTable))
> +	.int	JMPTBL(L(Exit16), L(ExitTable))
> +	.int	JMPTBL(L(Exit17), L(ExitTable))
> +	.int	JMPTBL(L(Exit18), L(ExitTable))
> +	.int	JMPTBL(L(Exit19), L(ExitTable))
> +	.int	JMPTBL(L(Exit20), L(ExitTable))
> +	.int	JMPTBL(L(Exit21), L(ExitTable))
> +	.int	JMPTBL(L(Exit22), L(ExitTable))
> +	.int    JMPTBL(L(Exit23), L(ExitTable))
> +	.int	JMPTBL(L(Exit24), L(ExitTable))
> +	.int	JMPTBL(L(Exit25), L(ExitTable))
> +	.int	JMPTBL(L(Exit26), L(ExitTable))
> +	.int	JMPTBL(L(Exit27), L(ExitTable))
> +	.int	JMPTBL(L(Exit28), L(ExitTable))
> +	.int	JMPTBL(L(Exit29), L(ExitTable))
> +	.int	JMPTBL(L(Exit30), L(ExitTable))
> +	.int	JMPTBL(L(Exit31), L(ExitTable))
> +	.int	JMPTBL(L(Exit32), L(ExitTable))
> +# ifdef USE_AS_STRNCPY
> +L(ExitStrncpyTable):
> +	.int	JMPTBL(L(StrncpyExit0), L(ExitStrncpyTable))
> +	.int	JMPTBL(L(StrncpyExit1), L(ExitStrncpyTable))
> +	.int	JMPTBL(L(StrncpyExit2), L(ExitStrncpyTable))
> +	.int	JMPTBL(L(StrncpyExit3), L(ExitStrncpyTable))
> +	.int	JMPTBL(L(StrncpyExit4), L(ExitStrncpyTable))
> +	.int	JMPTBL(L(StrncpyExit5), L(ExitStrncpyTable))
> +	.int	JMPTBL(L(StrncpyExit6), L(ExitStrncpyTable))
> +	.int	JMPTBL(L(StrncpyExit7), L(ExitStrncpyTable))
> +	.int	JMPTBL(L(StrncpyExit8), L(ExitStrncpyTable))
> +	.int	JMPTBL(L(StrncpyExit9), L(ExitStrncpyTable))
> +	.int	JMPTBL(L(StrncpyExit10), L(ExitStrncpyTable))
> +	.int	JMPTBL(L(StrncpyExit11), L(ExitStrncpyTable))
> +	.int	JMPTBL(L(StrncpyExit12), L(ExitStrncpyTable))
> +	.int	JMPTBL(L(StrncpyExit13), L(ExitStrncpyTable))
> +	.int	JMPTBL(L(StrncpyExit14), L(ExitStrncpyTable))
> +	.int	JMPTBL(L(StrncpyExit15), L(ExitStrncpyTable))
> +	.int	JMPTBL(L(StrncpyExit16), L(ExitStrncpyTable))
> +	.int	JMPTBL(L(StrncpyExit17), L(ExitStrncpyTable))
> +	.int	JMPTBL(L(StrncpyExit18), L(ExitStrncpyTable))
> +	.int	JMPTBL(L(StrncpyExit19), L(ExitStrncpyTable))
> +	.int	JMPTBL(L(StrncpyExit20), L(ExitStrncpyTable))
> +	.int	JMPTBL(L(StrncpyExit21), L(ExitStrncpyTable))
> +	.int	JMPTBL(L(StrncpyExit22), L(ExitStrncpyTable))
> +	.int    JMPTBL(L(StrncpyExit23), L(ExitStrncpyTable))
> +	.int	JMPTBL(L(StrncpyExit24), L(ExitStrncpyTable))
> +	.int	JMPTBL(L(StrncpyExit25), L(ExitStrncpyTable))
> +	.int	JMPTBL(L(StrncpyExit26), L(ExitStrncpyTable))
> +	.int	JMPTBL(L(StrncpyExit27), L(ExitStrncpyTable))
> +	.int	JMPTBL(L(StrncpyExit28), L(ExitStrncpyTable))
> +	.int	JMPTBL(L(StrncpyExit29), L(ExitStrncpyTable))
> +	.int	JMPTBL(L(StrncpyExit30), L(ExitStrncpyTable))
> +	.int	JMPTBL(L(StrncpyExit31), L(ExitStrncpyTable))
> +	.int	JMPTBL(L(StrncpyExit32), L(ExitStrncpyTable))
> +	.int	JMPTBL(L(StrncpyExit33), L(ExitStrncpyTable))
> +#  ifndef USE_AS_STRCAT
> +	.p2align 4
> +L(FillTable):
> +	.int	JMPTBL(L(Fill0), L(FillTable))
> +	.int	JMPTBL(L(Fill1), L(FillTable))
> +	.int	JMPTBL(L(Fill2), L(FillTable))
> +	.int	JMPTBL(L(Fill3), L(FillTable))
> +	.int	JMPTBL(L(Fill4), L(FillTable))
> +	.int	JMPTBL(L(Fill5), L(FillTable))
> +	.int	JMPTBL(L(Fill6), L(FillTable))
> +	.int	JMPTBL(L(Fill7), L(FillTable))
> +	.int	JMPTBL(L(Fill8), L(FillTable))
> +	.int	JMPTBL(L(Fill9), L(FillTable))
> +	.int	JMPTBL(L(Fill10), L(FillTable))
> +	.int	JMPTBL(L(Fill11), L(FillTable))
> +	.int	JMPTBL(L(Fill12), L(FillTable))
> +	.int	JMPTBL(L(Fill13), L(FillTable))
> +	.int	JMPTBL(L(Fill14), L(FillTable))
> +	.int	JMPTBL(L(Fill15), L(FillTable))
> +	.int	JMPTBL(L(Fill16), L(FillTable))
> +#  endif
> +# endif
> +#endif
> diff --git a/sysdeps/x86_64/multiarch/strncpy.S b/sysdeps/x86_64/multiarch/strncpy.S
> index 6d87a0b..afbd870 100644
> --- a/sysdeps/x86_64/multiarch/strncpy.S
> +++ b/sysdeps/x86_64/multiarch/strncpy.S
> @@ -1,5 +1,85 @@
> -/* Multiple versions of strncpy
> -   All versions must be listed in ifunc-impl-list.c.  */
> -#define STRCPY strncpy
> +/* Multiple versions of strcpy
> +   All versions must be listed in ifunc-impl-list.c.
> +   Copyright (C) 2009-2015 Free Software Foundation, Inc.
> +   Contributed by Intel Corporation.
> +   This file is part of the GNU C Library.
> +
> +   The GNU C Library is free software; you can redistribute it and/or
> +   modify it under the terms of the GNU Lesser General Public
> +   License as published by the Free Software Foundation; either
> +   version 2.1 of the License, or (at your option) any later version.
> +
> +   The GNU C Library is distributed in the hope that it will be useful,
> +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   Lesser General Public License for more details.
> +
> +   You should have received a copy of the GNU Lesser General Public
> +   License along with the GNU C Library; if not, see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +#include <sysdep.h>
> +#include <init-arch.h>
> +
>  #define USE_AS_STRNCPY
> -#include "strcpy.S"
> +#ifndef STRNCPY
> +#define STRNCPY strncpy
> +#endif
> +
> +#ifdef USE_AS_STPCPY
> +#  define STRNCPY_SSSE3		__stpncpy_ssse3
> +#  define STRNCPY_SSE2		__stpncpy_sse2
> +#  define STRNCPY_SSE2_UNALIGNED __stpncpy_sse2_unaligned
> +#  define __GI_STRNCPY		__GI_stpncpy
> +#  define __GI___STRNCPY		__GI___stpncpy
> +#else
> +#  define STRNCPY_SSSE3		__strncpy_ssse3
> +#  define STRNCPY_SSE2		__strncpy_sse2
> +#  define STRNCPY_SSE2_UNALIGNED	__strncpy_sse2_unaligned
> +#  define __GI_STRNCPY		__GI_strncpy
> +#endif
> +
> +
> +/* Define multiple versions only for the definition in libc.  */
> +#if IS_IN (libc)
> +	.text
> +ENTRY(STRNCPY)
> +	.type	STRNCPY, @gnu_indirect_function
> +	cmpl	$0, __cpu_features+KIND_OFFSET(%rip)
> +	jne	1f
> +	call	__init_cpu_features
> +1:	leaq	STRNCPY_SSE2_UNALIGNED(%rip), %rax
> +	testl	$bit_Fast_Unaligned_Load, __cpu_features+FEATURE_OFFSET+index_Fast_Unaligned_Load(%rip)
> +	jnz	2f
> +	leaq	STRNCPY_SSE2(%rip), %rax
> +	testl	$bit_SSSE3, __cpu_features+CPUID_OFFSET+index_SSSE3(%rip)
> +	jz	2f
> +	leaq	STRNCPY_SSSE3(%rip), %rax
> +2:	ret
> +END(STRNCPY)
> +
> +# undef ENTRY
> +# define ENTRY(name) \
> +	.type STRNCPY_SSE2, @function; \
> +	.align 16; \
> +	.globl STRNCPY_SSE2; \
> +	.hidden STRNCPY_SSE2; \
> +	STRNCPY_SSE2: cfi_startproc; \
> +	CALL_MCOUNT
> +# undef END
> +# define END(name) \
> +	cfi_endproc; .size STRNCPY_SSE2, .-STRNCPY_SSE2
> +# undef libc_hidden_builtin_def
> +/* It doesn't make sense to send libc-internal strcpy calls through a PLT.
> +   The speedup we get from using SSSE3 instruction is likely eaten away
> +   by the indirect call in the PLT.  */
> +# define libc_hidden_builtin_def(name) \
> +	.globl __GI_STRNCPY; __GI_STRNCPY = STRNCPY_SSE2
> +# undef libc_hidden_def
> +# define libc_hidden_def(name) \
> +	.globl __GI___STRNCPY; __GI___STRNCPY = STRNCPY_SSE2
> +#endif
> +
> +#ifndef USE_AS_STRNCPY
> +#include "../strcpy.S"
> +#endif
> -- 
> 1.8.4.rc3
Ondřej Bílka July 3, 2015, 7:46 a.m. UTC | #2
On Wed, Jun 24, 2015 at 10:13:31AM +0200, Ondřej Bílka wrote:
> On Wed, Jun 17, 2015 at 08:01:05PM +0200, Ondřej Bílka wrote:
> > Hi,
> > 
> > I wrote new strcpy on x64 and for some reason I thought that I had
> > commited it and forgot to ping it.
> > 
> > As there are other routines that I could improve I will use branch
> > neleai/string-x64 to collect these.
> > 
> > Here is revised version of what I sumbitted in 2013. Main change is that
> > I now target i7 instead core2 That simplifies things as unaligned loads
> > are cheap instead bit slower than aligned ones on core2. That mainly
> > concerns header as for core2 you could get better performance by
> > aligning loads or stores to 16 bytes after first bytes were read. I do
> > not know whats better I would need to test it.
> > 
> > That also makes less important support of ssse3 variant. I could send it
> > but it was one of my list on TODO list that now probably lost
> > importance. Problem is that on x64 for aligning by ssse3 or sse2 with
> > shifts you need to make 16 loops for each alignment as you don't have
> > variable shift. Also it needs to use jump table thats very expensive
> > For strcpy thats dubious as it increases instruction cache pressure 
> > and most copies are small. You would need to do switching from unaligned 
> > loads to aligning. I needed to do profiling to select correct treshold.
> > 
> > If somebody is interested in optimizing old pentiums4, athlon64 I will
> > provide a ssse3 variant that is also 50% faster than current one.
> > That is also reason why I omitted drawing current ssse3 implementation
> > performance.
> > 
> > 
> > In this version header first checks 128 bytes unaligned unless they
> > cross page boundary. That allows more effective loop as then at end of
> > loop we could simply write last 64 bytes instead specialcasing to avoid
> > writing before start.
> > 
> > I tried several variants of header, as we first read 16 bytes to xmm0
> > register question is if they could be reused. I used evolver to select
> > best variant, there was almost no difference in performance between
> > these.
> > 
> > Now I do checks for bytes 0-15, then 16-31, then 32-63, then 64-128.
> > There is possibility to get some cycles with different grouping, I will
> > post later improvement if I could find something. 
> > 
> > 
> > First problem was reading ahead. A rereading 8 bytes looked bit faster
> > than move from xmm.
> > 
> > Then I tried when to reuse/reread. In 4-7 byte case it was faster reread
> > than using bit shifts to get second half. For 1-3 bytes I use following
> > copy with s[0] and s[1] from rdx register with byte shifts.
> > 
> >   Test branch vs this branchless that works for i 0,1,2
> >    d[i] = 0;
> >    d[i/2] = s[1];
> >    d[0] = s[0];
> > 
> > I also added a avx2 loop. Problem why I shouldn't use them in headers
> > was high latency. I could test if using them for bytes 64-128 would give
> > speedup.
> > 
> > As technical issues go I needed to move old strcpy_sse_unaligned
> > implementation into strncpy_sse2_unaligned as strncpy is function that
> > should be optimized for size, not performance. For now I this will keep
> > these unchanged.
> > 
> > As performance these are 15%-30% faster than current one for gcc workload on
> > haswell and ivy bridge. 
> > 
> > As avx2 version its currently 6% on this workload mainly as its bash and
> > has lot of large loads so avx2 loop helps.
> > 
> > I used my profiler to show improvement, see here
> > 
> > http://kam.mff.cuni.cz/~ondra/benchmark_string/strcpy_profile.html
> > 
> > and source is here
> > 
> > http://kam.mff.cuni.cz/~ondra/benchmark_string/strcpy_profile170615.tar.bz2
> > 
> > Comments?
> > 
> >         * sysdeps/x86_64/multiarch/ifunc-impl-list.c (__libc_ifunc_impl_list):
> > 	Add __strcpy_avx2 and __stpcpy_avx2
> >         * sysdeps/x86_64/multiarch/Makefile (routines): Add stpcpy_avx2.S and 
> > 	strcpy_avx2.S
> >         * sysdeps/x86_64/multiarch/stpcpy-avx2.S: New file
> >         * sysdeps/x86_64/multiarch/strcpy-avx2.S: Likewise.
> >         * sysdeps/x86_64/multiarch/stpcpy-sse2-unaligned.S: Refactored
> > 	implementation.
> >         * sysdeps/x86_64/multiarch/strcpy.S: Updated ifunc.
> >         * sysdeps/x86_64/multiarch/strncpy.S: Moved from strcpy.S.
> >         * sysdeps/x86_64/multiarch/strncpy-sse2-unaligned.S: Moved
> > 	strcpy-sse2-unaligned.S here.
> >         * sysdeps/x86_64/multiarch/strcpy-sse2-unaligned.S: Likewise.
> >         * sysdeps/x86_64/multiarch/stpncpy-sse2-unaligned.S: Redirect
> > 	from strcpy-sse2-unaligned.S to strncpy-sse2-unaligned.S 
> >         * sysdeps/x86_64/multiarch/stpncpy.S: Likewise.
> >         * sysdeps/x86_64/multiarch/strcat-sse2-unaligned.S: Likewise.
> > 
> > ---
> >  sysdeps/x86_64/multiarch/Makefile                 |    2 +-
> >  sysdeps/x86_64/multiarch/ifunc-impl-list.c        |    2 +
> >  sysdeps/x86_64/multiarch/stpcpy-avx2.S            |    3 +
> >  sysdeps/x86_64/multiarch/stpcpy-sse2-unaligned.S  |  439 ++++-
> >  sysdeps/x86_64/multiarch/stpncpy-sse2-unaligned.S |    3 +-
> >  sysdeps/x86_64/multiarch/stpncpy.S                |    5 +-
> >  sysdeps/x86_64/multiarch/strcat-sse2-unaligned.S  |    2 +-
> >  sysdeps/x86_64/multiarch/strcpy-avx2.S            |    4 +
> >  sysdeps/x86_64/multiarch/strcpy-sse2-unaligned.S  | 1890 +-------------------
> >  sysdeps/x86_64/multiarch/strcpy.S                 |   22 +-
> >  sysdeps/x86_64/multiarch/strncpy-sse2-unaligned.S | 1891 ++++++++++++++++++++-
> >  sysdeps/x86_64/multiarch/strncpy.S                |   88 +-
> >  14 files changed, 2435 insertions(+), 1921 deletions(-)
> >  create mode 100644 sysdeps/x86_64/multiarch/stpcpy-avx2.S
> >  create mode 100644 sysdeps/x86_64/multiarch/strcpy-avx2.S
> > 
> > 
> > diff --git a/sysdeps/x86_64/multiarch/Makefile b/sysdeps/x86_64/multiarch/Makefile
> > index d7002a9..c573744 100644
> > --- a/sysdeps/x86_64/multiarch/Makefile
> > +++ b/sysdeps/x86_64/multiarch/Makefile
> > @@ -29,7 +29,7 @@ CFLAGS-strspn-c.c += -msse4
> >  endif
> >  
> >  ifeq (yes,$(config-cflags-avx2))
> > -sysdep_routines += memset-avx2
> > +sysdep_routines += memset-avx2 strcpy-avx2 stpcpy-avx2
> >  endif
> >  endif
> >  
> > diff --git a/sysdeps/x86_64/multiarch/ifunc-impl-list.c b/sysdeps/x86_64/multiarch/ifunc-impl-list.c
> > index b64e4f1..d398e43 100644
> > --- a/sysdeps/x86_64/multiarch/ifunc-impl-list.c
> > +++ b/sysdeps/x86_64/multiarch/ifunc-impl-list.c
> > @@ -88,6 +88,7 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
> >  
> >    /* Support sysdeps/x86_64/multiarch/stpcpy.S.  */
> >    IFUNC_IMPL (i, name, stpcpy,
> > +	      IFUNC_IMPL_ADD (array, i, strcpy, HAS_AVX2, __stpcpy_avx2)
> >  	      IFUNC_IMPL_ADD (array, i, stpcpy, HAS_SSSE3, __stpcpy_ssse3)
> >  	      IFUNC_IMPL_ADD (array, i, stpcpy, 1, __stpcpy_sse2_unaligned)
> >  	      IFUNC_IMPL_ADD (array, i, stpcpy, 1, __stpcpy_sse2))
> > @@ -137,6 +138,7 @@ __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
> >  
> >    /* Support sysdeps/x86_64/multiarch/strcpy.S.  */
> >    IFUNC_IMPL (i, name, strcpy,
> > +	      IFUNC_IMPL_ADD (array, i, strcpy, HAS_AVX2, __strcpy_avx2)
> >  	      IFUNC_IMPL_ADD (array, i, strcpy, HAS_SSSE3, __strcpy_ssse3)
> >  	      IFUNC_IMPL_ADD (array, i, strcpy, 1, __strcpy_sse2_unaligned)
> >  	      IFUNC_IMPL_ADD (array, i, strcpy, 1, __strcpy_sse2))
> > diff --git a/sysdeps/x86_64/multiarch/stpcpy-avx2.S b/sysdeps/x86_64/multiarch/stpcpy-avx2.S
> > new file mode 100644
> > index 0000000..bd30ef6
> > --- /dev/null
> > +++ b/sysdeps/x86_64/multiarch/stpcpy-avx2.S
> > @@ -0,0 +1,3 @@
> > +#define USE_AVX2
> > +#define STPCPY __stpcpy_avx2
> > +#include "stpcpy-sse2-unaligned.S"
> > diff --git a/sysdeps/x86_64/multiarch/stpcpy-sse2-unaligned.S b/sysdeps/x86_64/multiarch/stpcpy-sse2-unaligned.S
> > index 34231f8..695a236 100644
> > --- a/sysdeps/x86_64/multiarch/stpcpy-sse2-unaligned.S
> > +++ b/sysdeps/x86_64/multiarch/stpcpy-sse2-unaligned.S
> > @@ -1,3 +1,436 @@
> > -#define USE_AS_STPCPY
> > -#define STRCPY __stpcpy_sse2_unaligned
> > -#include "strcpy-sse2-unaligned.S"
> > +/* stpcpy with SSE2 and unaligned load
> > +   Copyright (C) 2015 Free Software Foundation, Inc.
> > +   This file is part of the GNU C Library.
> > +
> > +   The GNU C Library is free software; you can redistribute it and/or
> > +   modify it under the terms of the GNU Lesser General Public
> > +   License as published by the Free Software Foundation; either
> > +   version 2.1 of the License, or (at your option) any later version.
> > +
> > +   The GNU C Library is distributed in the hope that it will be useful,
> > +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> > +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > +   Lesser General Public License for more details.
> > +
> > +   You should have received a copy of the GNU Lesser General Public
> > +   License along with the GNU C Library; if not, see
> > +   <http://www.gnu.org/licenses/>.  */
> > +
> > +#include <sysdep.h>
> > +#ifndef STPCPY
> > +# define STPCPY __stpcpy_sse2_unaligned
> > +#endif
> > +
> > +ENTRY(STPCPY)
> > +	mov	%esi, %edx
> > +#ifdef AS_STRCPY
> > +	movq    %rdi, %rax
> > +#endif
> > +	pxor	%xmm4, %xmm4
> > +	pxor	%xmm5, %xmm5
> > +	andl	$4095, %edx
> > +	cmp	$3968, %edx
> > +	ja	L(cross_page)
> > +
> > +	movdqu	(%rsi), %xmm0
> > +	pcmpeqb	%xmm0, %xmm4
> > +	pmovmskb %xmm4, %edx
> > +	testl	%edx, %edx
> > +	je	L(more16bytes)
> > +	bsf	%edx, %ecx
> > +#ifndef AS_STRCPY
> > +	lea	(%rdi, %rcx), %rax
> > +#endif
> > +	cmp	$7, %ecx
> > +	movq	(%rsi), %rdx
> > +	jb	L(less_8_bytesb)
> > +L(8bytes_from_cross):
> > +	movq	-7(%rsi, %rcx), %rsi
> > +	movq	%rdx, (%rdi)
> > +#ifdef AS_STRCPY
> > +	movq    %rsi, -7(%rdi, %rcx)
> > +#else
> > +	movq	%rsi, -7(%rax)
> > +#endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(less_8_bytesb):
> > +	cmp	$2, %ecx
> > +	jbe	L(less_4_bytes)
> > +L(4bytes_from_cross):
> > +	mov	-3(%rsi, %rcx), %esi
> > +	mov	%edx, (%rdi)
> > +#ifdef AS_STRCPY
> > +        mov     %esi, -3(%rdi, %rcx)
> > +#else
> > +	mov	%esi, -3(%rax)
> > +#endif
> > +	ret
> > +
> > +.p2align 4
> > + L(less_4_bytes):
> > + /*
> > +  Test branch vs this branchless that works for i 0,1,2
> > +   d[i] = 0;
> > +   d[i/2] = s[1];
> > +   d[0] = s[0];
> > +  */
> > +#ifdef AS_STRCPY
> > +	movb	$0, (%rdi, %rcx)
> > +#endif
> > +
> > +	shr	$1, %ecx
> > +	mov	%edx, %esi
> > +	shr	$8, %edx
> > +	movb	%dl, (%rdi, %rcx)
> > +#ifndef AS_STRCPY
> > +	movb	$0, (%rax)
> > +#endif
> > +	movb	%sil, (%rdi)
> > +	ret
> > +
> > +
> > +
> > +
> > +
> > +	.p2align 4
> > +L(more16bytes):
> > +	pxor	%xmm6, %xmm6
> > +	movdqu	16(%rsi), %xmm1
> > +	pxor	%xmm7, %xmm7
> > +	pcmpeqb	%xmm1, %xmm5
> > +	pmovmskb %xmm5, %edx
> > +	testl	%edx, %edx
> > +	je	L(more32bytes)
> > +	bsf	%edx, %edx
> > +#ifdef AS_STRCPY
> > +        movdqu  1(%rsi, %rdx), %xmm1
> > +        movdqu  %xmm0, (%rdi)
> > +	movdqu  %xmm1, 1(%rdi, %rdx)
> > +#else
> > +	lea	16(%rdi, %rdx), %rax
> > +	movdqu	1(%rsi, %rdx), %xmm1
> > +	movdqu	%xmm0, (%rdi)
> > +	movdqu	%xmm1, -15(%rax)
> > +#endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(more32bytes):
> > +	movdqu	32(%rsi), %xmm2
> > +	movdqu	48(%rsi), %xmm3
> > +
> > +	pcmpeqb	%xmm2, %xmm6
> > +	pcmpeqb	%xmm3, %xmm7
> > +	pmovmskb %xmm7, %edx
> > +	shl	$16, %edx
> > +	pmovmskb %xmm6, %ecx
> > +	or	%ecx, %edx
> > +	je	L(more64bytes)
> > +	bsf	%edx, %edx
> > +#ifndef AS_STRCPY
> > +	lea	32(%rdi, %rdx), %rax
> > +#endif
> > +	movdqu	1(%rsi, %rdx), %xmm2
> > +	movdqu	17(%rsi, %rdx), %xmm3
> > +	movdqu	%xmm0, (%rdi)
> > +	movdqu	%xmm1, 16(%rdi)
> > +#ifdef AS_STRCPY
> > +        movdqu  %xmm2, 1(%rdi, %rdx)
> > +        movdqu  %xmm3, 17(%rdi, %rdx)
> > +#else
> > +	movdqu	%xmm2, -31(%rax)
> > +	movdqu	%xmm3, -15(%rax)
> > +#endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(more64bytes):
> > +	movdqu	%xmm0, (%rdi)
> > +	movdqu	%xmm1, 16(%rdi)
> > +	movdqu	%xmm2, 32(%rdi)
> > +	movdqu	%xmm3, 48(%rdi)
> > +	movdqu	64(%rsi), %xmm0
> > +	movdqu	80(%rsi), %xmm1
> > +	movdqu	96(%rsi), %xmm2
> > +	movdqu	112(%rsi), %xmm3
> > +
> > +	pcmpeqb	%xmm0, %xmm4
> > +	pcmpeqb	%xmm1, %xmm5
> > +	pcmpeqb	%xmm2, %xmm6
> > +	pcmpeqb	%xmm3, %xmm7
> > +	pmovmskb %xmm4, %ecx
> > +	pmovmskb %xmm5, %edx
> > +	pmovmskb %xmm6, %r8d
> > +	pmovmskb %xmm7, %r9d
> > +	shl	$16, %edx
> > +	or	%ecx, %edx
> > +	shl	$32, %r8
> > +	shl	$48, %r9
> > +	or	%r8, %rdx
> > +	or	%r9, %rdx
> > +	test	%rdx, %rdx
> > +	je	L(prepare_loop)
> > +	bsf	%rdx, %rdx
> > +#ifndef AS_STRCPY
> > +	lea	64(%rdi, %rdx), %rax
> > +#endif
> > +	movdqu	1(%rsi, %rdx), %xmm0
> > +	movdqu	17(%rsi, %rdx), %xmm1
> > +	movdqu	33(%rsi, %rdx), %xmm2
> > +	movdqu	49(%rsi, %rdx), %xmm3
> > +#ifdef AS_STRCPY
> > +        movdqu  %xmm0, 1(%rdi, %rdx)
> > +        movdqu  %xmm1, 17(%rdi, %rdx)
> > +        movdqu  %xmm2, 33(%rdi, %rdx)
> > +        movdqu  %xmm3, 49(%rdi, %rdx)
> > +#else
> > +	movdqu	%xmm0, -63(%rax)
> > +	movdqu	%xmm1, -47(%rax)
> > +	movdqu	%xmm2, -31(%rax)
> > +	movdqu	%xmm3, -15(%rax)
> > +#endif
> > +	ret
> > +
> > +
> > +	.p2align 4
> > +L(prepare_loop):
> > +	movdqu	%xmm0, 64(%rdi)
> > +	movdqu	%xmm1, 80(%rdi)
> > +	movdqu	%xmm2, 96(%rdi)
> > +	movdqu	%xmm3, 112(%rdi)
> > +
> > +	subq	%rsi, %rdi
> > +	add	$64, %rsi
> > +	andq	$-64, %rsi
> > +	addq	%rsi, %rdi
> > +	jmp	L(loop_entry)
> > +
> > +#ifdef USE_AVX2
> > +	.p2align 4
> > +L(loop):
> > +	vmovdqu	%ymm1, (%rdi)
> > +	vmovdqu	%ymm3, 32(%rdi)
> > +L(loop_entry):
> > +	vmovdqa	96(%rsi), %ymm3
> > +	vmovdqa	64(%rsi), %ymm1
> > +	vpminub	%ymm3, %ymm1, %ymm2
> > +	addq	$64, %rsi
> > +	addq	$64, %rdi
> > +	vpcmpeqb %ymm5, %ymm2, %ymm0
> > +	vpmovmskb %ymm0, %edx
> > +	test	%edx, %edx
> > +	je	L(loop)
> > +	salq	$32, %rdx
> > +	vpcmpeqb %ymm5, %ymm1, %ymm4
> > +	vpmovmskb %ymm4, %ecx
> > +	or	%rcx, %rdx
> > +	bsfq	%rdx, %rdx
> > +#ifndef AS_STRCPY
> > +	lea	(%rdi, %rdx), %rax
> > +#endif
> > +	vmovdqu	-63(%rsi, %rdx), %ymm0
> > +	vmovdqu	-31(%rsi, %rdx), %ymm2
> > +#ifdef AS_STRCPY
> > +        vmovdqu  %ymm0, -63(%rdi, %rdx)
> > +        vmovdqu  %ymm2, -31(%rdi, %rdx)
> > +#else
> > +	vmovdqu	%ymm0, -63(%rax)
> > +	vmovdqu	%ymm2, -31(%rax)
> > +#endif
> > +	vzeroupper
> > +	ret
> > +#else
> > +	.p2align 4
> > +L(loop):
> > +	movdqu	%xmm1, (%rdi)
> > +	movdqu	%xmm2, 16(%rdi)
> > +	movdqu	%xmm3, 32(%rdi)
> > +	movdqu	%xmm4, 48(%rdi)
> > +L(loop_entry):
> > +	movdqa	96(%rsi), %xmm3
> > +	movdqa	112(%rsi), %xmm4
> > +	movdqa	%xmm3, %xmm0
> > +	movdqa	80(%rsi), %xmm2
> > +	pminub	%xmm4, %xmm0
> > +	movdqa	64(%rsi), %xmm1
> > +	pminub	%xmm2, %xmm0
> > +	pminub	%xmm1, %xmm0
> > +	addq	$64, %rsi
> > +	addq	$64, %rdi
> > +	pcmpeqb	%xmm5, %xmm0
> > +	pmovmskb %xmm0, %edx
> > +	test	%edx, %edx
> > +	je	L(loop)
> > +	salq	$48, %rdx
> > +	pcmpeqb	%xmm1, %xmm5
> > +	pcmpeqb	%xmm2, %xmm6
> > +	pmovmskb %xmm5, %ecx
> > +#ifdef AS_STRCPY
> > +	pmovmskb %xmm6, %r8d
> > +	pcmpeqb	%xmm3, %xmm7
> > +	pmovmskb %xmm7, %r9d
> > +	sal	$16, %r8d
> > +	or	%r8d, %ecx
> > +#else
> > +	pmovmskb %xmm6, %eax
> > +	pcmpeqb	%xmm3, %xmm7
> > +	pmovmskb %xmm7, %r9d
> > +	sal	$16, %eax
> > +	or	%eax, %ecx
> > +#endif
> > +	salq	$32, %r9
> > +	orq	%rcx, %rdx
> > +	orq	%r9, %rdx
> > +	bsfq	%rdx, %rdx
> > +#ifndef AS_STRCPY
> > +	lea	(%rdi, %rdx), %rax
> > +#endif
> > +	movdqu	-63(%rsi, %rdx), %xmm0
> > +	movdqu	-47(%rsi, %rdx), %xmm1
> > +	movdqu	-31(%rsi, %rdx), %xmm2
> > +	movdqu	-15(%rsi, %rdx), %xmm3
> > +#ifdef AS_STRCPY
> > +        movdqu  %xmm0, -63(%rdi, %rdx)
> > +        movdqu  %xmm1, -47(%rdi, %rdx)
> > +        movdqu  %xmm2, -31(%rdi, %rdx)
> > +        movdqu  %xmm3, -15(%rdi, %rdx)
> > +#else
> > +	movdqu	%xmm0, -63(%rax)
> > +	movdqu	%xmm1, -47(%rax)
> > +	movdqu	%xmm2, -31(%rax)
> > +	movdqu	%xmm3, -15(%rax)
> > +#endif
> > +	ret
> > +#endif
> > +
> > +	.p2align 4
> > +L(cross_page):
> > +	movq	%rsi, %rcx
> > +	pxor	%xmm0, %xmm0
> > +	and	$15, %ecx
> > +	movq	%rsi, %r9
> > +	movq	%rdi, %r10
> > +	subq	%rcx, %rsi
> > +	subq	%rcx, %rdi
> > +	movdqa	(%rsi), %xmm1
> > +	pcmpeqb	%xmm0, %xmm1
> > +	pmovmskb %xmm1, %edx
> > +	shr	%cl, %edx
> > +	shl	%cl, %edx
> > +	test	%edx, %edx
> > +	jne	L(less_32_cross)
> > +
> > +	addq	$16, %rsi
> > +	addq	$16, %rdi
> > +	movdqa	(%rsi), %xmm1
> > +	pcmpeqb	%xmm1, %xmm0
> > +	pmovmskb %xmm0, %edx
> > +	test	%edx, %edx
> > +	jne	L(less_32_cross)
> > +	movdqu	%xmm1, (%rdi)
> > +
> > +	movdqu	(%r9), %xmm0
> > +	movdqu	%xmm0, (%r10)
> > +
> > +	mov	$8, %rcx
> > +L(cross_loop):
> > +	addq	$16, %rsi
> > +	addq	$16, %rdi
> > +	pxor	%xmm0, %xmm0
> > +	movdqa	(%rsi), %xmm1
> > +	pcmpeqb	%xmm1, %xmm0
> > +	pmovmskb %xmm0, %edx
> > +	test	%edx, %edx
> > +	jne	L(return_cross)
> > +	movdqu	%xmm1, (%rdi)
> > +	sub	$1, %rcx
> > +	ja	L(cross_loop)
> > +
> > +	pxor	%xmm5, %xmm5
> > +	pxor	%xmm6, %xmm6
> > +	pxor	%xmm7, %xmm7
> > +
> > +	lea	-64(%rsi), %rdx
> > +	andq	$-64, %rdx
> > +	addq	%rdx, %rdi
> > +	subq	%rsi, %rdi
> > +	movq	%rdx, %rsi
> > +	jmp	L(loop_entry)
> > +
> > +	.p2align 4
> > +L(return_cross):
> > +	bsf	%edx, %edx
> > +#ifdef AS_STRCPY
> > +        movdqu  -15(%rsi, %rdx), %xmm0
> > +        movdqu  %xmm0, -15(%rdi, %rdx)
> > +#else
> > +	lea	(%rdi, %rdx), %rax
> > +	movdqu	-15(%rsi, %rdx), %xmm0
> > +	movdqu	%xmm0, -15(%rax)
> > +#endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(less_32_cross):
> > +	bsf	%rdx, %rdx
> > +	lea	(%rdi, %rdx), %rcx
> > +#ifndef AS_STRCPY
> > +	mov	%rcx, %rax
> > +#endif
> > +	mov	%r9, %rsi
> > +	mov	%r10, %rdi
> > +	sub	%rdi, %rcx
> > +	cmp	$15, %ecx
> > +	jb	L(less_16_cross)
> > +	movdqu	(%rsi), %xmm0
> > +	movdqu	-15(%rsi, %rcx), %xmm1
> > +	movdqu	%xmm0, (%rdi)
> > +#ifdef AS_STRCPY
> > +	movdqu  %xmm1, -15(%rdi, %rcx)
> > +#else
> > +	movdqu	%xmm1, -15(%rax)
> > +#endif
> > +	ret
> > +
> > +L(less_16_cross):
> > +	cmp	$7, %ecx
> > +	jb	L(less_8_bytes_cross)
> > +	movq	(%rsi), %rdx
> > +	jmp	L(8bytes_from_cross)
> > +
> > +L(less_8_bytes_cross):
> > +	cmp	$2, %ecx
> > +	jbe	L(3_bytes_cross)
> > +	mov	(%rsi), %edx
> > +	jmp	L(4bytes_from_cross)
> > +
> > +L(3_bytes_cross):
> > +	jb	L(1_2bytes_cross)
> > +	movzwl	(%rsi), %edx
> > +	jmp	L(_3_bytesb)
> > +
> > +L(1_2bytes_cross):
> > +	movb	(%rsi), %dl
> > +	jmp	L(0_2bytes_from_cross)
> > +
> > +	.p2align 4
> > +L(less_4_bytesb):
> > +	je	L(_3_bytesb)
> > +L(0_2bytes_from_cross):
> > +	movb	%dl, (%rdi)
> > +#ifdef AS_STRCPY
> > +	movb    $0, (%rdi, %rcx)
> > +#else
> > +	movb	$0, (%rax)
> > +#endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(_3_bytesb):
> > +	movw	%dx, (%rdi)
> > +	movb	$0, 2(%rdi)
> > +	ret
> > +
> > +END(STPCPY)
> > diff --git a/sysdeps/x86_64/multiarch/stpncpy-sse2-unaligned.S b/sysdeps/x86_64/multiarch/stpncpy-sse2-unaligned.S
> > index 658520f..3f35068 100644
> > --- a/sysdeps/x86_64/multiarch/stpncpy-sse2-unaligned.S
> > +++ b/sysdeps/x86_64/multiarch/stpncpy-sse2-unaligned.S
> > @@ -1,4 +1,3 @@
> >  #define USE_AS_STPCPY
> > -#define USE_AS_STRNCPY
> >  #define STRCPY __stpncpy_sse2_unaligned
> > -#include "strcpy-sse2-unaligned.S"
> > +#include "strncpy-sse2-unaligned.S"
> > diff --git a/sysdeps/x86_64/multiarch/stpncpy.S b/sysdeps/x86_64/multiarch/stpncpy.S
> > index 2698ca6..159604a 100644
> > --- a/sysdeps/x86_64/multiarch/stpncpy.S
> > +++ b/sysdeps/x86_64/multiarch/stpncpy.S
> > @@ -1,8 +1,7 @@
> >  /* Multiple versions of stpncpy
> >     All versions must be listed in ifunc-impl-list.c.  */
> > -#define STRCPY __stpncpy
> > +#define STRNCPY __stpncpy
> >  #define USE_AS_STPCPY
> > -#define USE_AS_STRNCPY
> > -#include "strcpy.S"
> > +#include "strncpy.S"
> >  
> >  weak_alias (__stpncpy, stpncpy)
> > diff --git a/sysdeps/x86_64/multiarch/strcat-sse2-unaligned.S b/sysdeps/x86_64/multiarch/strcat-sse2-unaligned.S
> > index 81f1b40..1faa49d 100644
> > --- a/sysdeps/x86_64/multiarch/strcat-sse2-unaligned.S
> > +++ b/sysdeps/x86_64/multiarch/strcat-sse2-unaligned.S
> > @@ -275,5 +275,5 @@ L(StartStrcpyPart):
> >  #  define USE_AS_STRNCPY
> >  # endif
> >  
> > -# include "strcpy-sse2-unaligned.S"
> > +# include "strncpy-sse2-unaligned.S"
> >  #endif
> > diff --git a/sysdeps/x86_64/multiarch/strcpy-avx2.S b/sysdeps/x86_64/multiarch/strcpy-avx2.S
> > new file mode 100644
> > index 0000000..a3133a4
> > --- /dev/null
> > +++ b/sysdeps/x86_64/multiarch/strcpy-avx2.S
> > @@ -0,0 +1,4 @@
> > +#define USE_AVX2
> > +#define AS_STRCPY
> > +#define STPCPY __strcpy_avx2
> > +#include "stpcpy-sse2-unaligned.S"
> > diff --git a/sysdeps/x86_64/multiarch/strcpy-sse2-unaligned.S b/sysdeps/x86_64/multiarch/strcpy-sse2-unaligned.S
> > index 8f03d1d..310e4fa 100644
> > --- a/sysdeps/x86_64/multiarch/strcpy-sse2-unaligned.S
> > +++ b/sysdeps/x86_64/multiarch/strcpy-sse2-unaligned.S
> > @@ -1,1887 +1,3 @@
> > -/* strcpy with SSE2 and unaligned load
> > -   Copyright (C) 2011-2015 Free Software Foundation, Inc.
> > -   Contributed by Intel Corporation.
> > -   This file is part of the GNU C Library.
> > -
> > -   The GNU C Library is free software; you can redistribute it and/or
> > -   modify it under the terms of the GNU Lesser General Public
> > -   License as published by the Free Software Foundation; either
> > -   version 2.1 of the License, or (at your option) any later version.
> > -
> > -   The GNU C Library is distributed in the hope that it will be useful,
> > -   but WITHOUT ANY WARRANTY; without even the implied warranty of
> > -   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > -   Lesser General Public License for more details.
> > -
> > -   You should have received a copy of the GNU Lesser General Public
> > -   License along with the GNU C Library; if not, see
> > -   <http://www.gnu.org/licenses/>.  */
> > -
> > -#if IS_IN (libc)
> > -
> > -# ifndef USE_AS_STRCAT
> > -#  include <sysdep.h>
> > -
> > -#  ifndef STRCPY
> > -#   define STRCPY  __strcpy_sse2_unaligned
> > -#  endif
> > -
> > -# endif
> > -
> > -# define JMPTBL(I, B)	I - B
> > -# define BRANCH_TO_JMPTBL_ENTRY(TABLE, INDEX, SCALE)             \
> > -	lea	TABLE(%rip), %r11;                              \
> > -	movslq	(%r11, INDEX, SCALE), %rcx;                     \
> > -	lea	(%r11, %rcx), %rcx;                             \
> > -	jmp	*%rcx
> > -
> > -# ifndef USE_AS_STRCAT
> > -
> > -.text
> > -ENTRY (STRCPY)
> > -#  ifdef USE_AS_STRNCPY
> > -	mov	%rdx, %r8
> > -	test	%r8, %r8
> > -	jz	L(ExitZero)
> > -#  endif
> > -	mov	%rsi, %rcx
> > -#  ifndef USE_AS_STPCPY
> > -	mov	%rdi, %rax      /* save result */
> > -#  endif
> > -
> > -# endif
> > -
> > -	and	$63, %rcx
> > -	cmp	$32, %rcx
> > -	jbe	L(SourceStringAlignmentLess32)
> > -
> > -	and	$-16, %rsi
> > -	and	$15, %rcx
> > -	pxor	%xmm0, %xmm0
> > -	pxor	%xmm1, %xmm1
> > -
> > -	pcmpeqb	(%rsi), %xmm1
> > -	pmovmskb %xmm1, %rdx
> > -	shr	%cl, %rdx
> > -
> > -# ifdef USE_AS_STRNCPY
> > -#  if defined USE_AS_STPCPY || defined USE_AS_STRCAT
> > -	mov	$16, %r10
> > -	sub	%rcx, %r10
> > -	cmp	%r10, %r8
> > -#  else
> > -	mov	$17, %r10
> > -	sub	%rcx, %r10
> > -	cmp	%r10, %r8
> > -#  endif
> > -	jbe	L(CopyFrom1To16BytesTailCase2OrCase3)
> > -# endif
> > -	test	%rdx, %rdx
> > -	jnz	L(CopyFrom1To16BytesTail)
> > -
> > -	pcmpeqb	16(%rsi), %xmm0
> > -	pmovmskb %xmm0, %rdx
> > -
> > -# ifdef USE_AS_STRNCPY
> > -	add	$16, %r10
> > -	cmp	%r10, %r8
> > -	jbe	L(CopyFrom1To32BytesCase2OrCase3)
> > -# endif
> > -	test	%rdx, %rdx
> > -	jnz	L(CopyFrom1To32Bytes)
> > -
> > -	movdqu	(%rsi, %rcx), %xmm1   /* copy 16 bytes */
> > -	movdqu	%xmm1, (%rdi)
> > -
> > -/* If source address alignment != destination address alignment */
> > -	.p2align 4
> > -L(Unalign16Both):
> > -	sub	%rcx, %rdi
> > -# ifdef USE_AS_STRNCPY
> > -	add	%rcx, %r8
> > -# endif
> > -	mov	$16, %rcx
> > -	movdqa	(%rsi, %rcx), %xmm1
> > -	movaps	16(%rsi, %rcx), %xmm2
> > -	movdqu	%xmm1, (%rdi, %rcx)
> > -	pcmpeqb	%xmm2, %xmm0
> > -	pmovmskb %xmm0, %rdx
> > -	add	$16, %rcx
> > -# ifdef USE_AS_STRNCPY
> > -	sub	$48, %r8
> > -	jbe	L(CopyFrom1To16BytesCase2OrCase3)
> > -# endif
> > -	test	%rdx, %rdx
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	jnz	L(CopyFrom1To16BytesUnalignedXmm2)
> > -# else
> > -	jnz	L(CopyFrom1To16Bytes)
> > -# endif
> > -
> > -	movaps	16(%rsi, %rcx), %xmm3
> > -	movdqu	%xmm2, (%rdi, %rcx)
> > -	pcmpeqb	%xmm3, %xmm0
> > -	pmovmskb %xmm0, %rdx
> > -	add	$16, %rcx
> > -# ifdef USE_AS_STRNCPY
> > -	sub	$16, %r8
> > -	jbe	L(CopyFrom1To16BytesCase2OrCase3)
> > -# endif
> > -	test	%rdx, %rdx
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	jnz	L(CopyFrom1To16BytesUnalignedXmm3)
> > -# else
> > -	jnz	L(CopyFrom1To16Bytes)
> > -# endif
> > -
> > -	movaps	16(%rsi, %rcx), %xmm4
> > -	movdqu	%xmm3, (%rdi, %rcx)
> > -	pcmpeqb	%xmm4, %xmm0
> > -	pmovmskb %xmm0, %rdx
> > -	add	$16, %rcx
> > -# ifdef USE_AS_STRNCPY
> > -	sub	$16, %r8
> > -	jbe	L(CopyFrom1To16BytesCase2OrCase3)
> > -# endif
> > -	test	%rdx, %rdx
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	jnz	L(CopyFrom1To16BytesUnalignedXmm4)
> > -# else
> > -	jnz	L(CopyFrom1To16Bytes)
> > -# endif
> > -
> > -	movaps	16(%rsi, %rcx), %xmm1
> > -	movdqu	%xmm4, (%rdi, %rcx)
> > -	pcmpeqb	%xmm1, %xmm0
> > -	pmovmskb %xmm0, %rdx
> > -	add	$16, %rcx
> > -# ifdef USE_AS_STRNCPY
> > -	sub	$16, %r8
> > -	jbe	L(CopyFrom1To16BytesCase2OrCase3)
> > -# endif
> > -	test	%rdx, %rdx
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	jnz	L(CopyFrom1To16BytesUnalignedXmm1)
> > -# else
> > -	jnz	L(CopyFrom1To16Bytes)
> > -# endif
> > -
> > -	movaps	16(%rsi, %rcx), %xmm2
> > -	movdqu	%xmm1, (%rdi, %rcx)
> > -	pcmpeqb	%xmm2, %xmm0
> > -	pmovmskb %xmm0, %rdx
> > -	add	$16, %rcx
> > -# ifdef USE_AS_STRNCPY
> > -	sub	$16, %r8
> > -	jbe	L(CopyFrom1To16BytesCase2OrCase3)
> > -# endif
> > -	test	%rdx, %rdx
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	jnz	L(CopyFrom1To16BytesUnalignedXmm2)
> > -# else
> > -	jnz	L(CopyFrom1To16Bytes)
> > -# endif
> > -
> > -	movaps	16(%rsi, %rcx), %xmm3
> > -	movdqu	%xmm2, (%rdi, %rcx)
> > -	pcmpeqb	%xmm3, %xmm0
> > -	pmovmskb %xmm0, %rdx
> > -	add	$16, %rcx
> > -# ifdef USE_AS_STRNCPY
> > -	sub	$16, %r8
> > -	jbe	L(CopyFrom1To16BytesCase2OrCase3)
> > -# endif
> > -	test	%rdx, %rdx
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	jnz	L(CopyFrom1To16BytesUnalignedXmm3)
> > -# else
> > -	jnz	L(CopyFrom1To16Bytes)
> > -# endif
> > -
> > -	movdqu	%xmm3, (%rdi, %rcx)
> > -	mov	%rsi, %rdx
> > -	lea	16(%rsi, %rcx), %rsi
> > -	and	$-0x40, %rsi
> > -	sub	%rsi, %rdx
> > -	sub	%rdx, %rdi
> > -# ifdef USE_AS_STRNCPY
> > -	lea	128(%r8, %rdx), %r8
> > -# endif
> > -L(Unaligned64Loop):
> > -	movaps	(%rsi), %xmm2
> > -	movaps	%xmm2, %xmm4
> > -	movaps	16(%rsi), %xmm5
> > -	movaps	32(%rsi), %xmm3
> > -	movaps	%xmm3, %xmm6
> > -	movaps	48(%rsi), %xmm7
> > -	pminub	%xmm5, %xmm2
> > -	pminub	%xmm7, %xmm3
> > -	pminub	%xmm2, %xmm3
> > -	pcmpeqb	%xmm0, %xmm3
> > -	pmovmskb %xmm3, %rdx
> > -# ifdef USE_AS_STRNCPY
> > -	sub	$64, %r8
> > -	jbe	L(UnalignedLeaveCase2OrCase3)
> > -# endif
> > -	test	%rdx, %rdx
> > -	jnz	L(Unaligned64Leave)
> > -
> > -L(Unaligned64Loop_start):
> > -	add	$64, %rdi
> > -	add	$64, %rsi
> > -	movdqu	%xmm4, -64(%rdi)
> > -	movaps	(%rsi), %xmm2
> > -	movdqa	%xmm2, %xmm4
> > -	movdqu	%xmm5, -48(%rdi)
> > -	movaps	16(%rsi), %xmm5
> > -	pminub	%xmm5, %xmm2
> > -	movaps	32(%rsi), %xmm3
> > -	movdqu	%xmm6, -32(%rdi)
> > -	movaps	%xmm3, %xmm6
> > -	movdqu	%xmm7, -16(%rdi)
> > -	movaps	48(%rsi), %xmm7
> > -	pminub	%xmm7, %xmm3
> > -	pminub	%xmm2, %xmm3
> > -	pcmpeqb	%xmm0, %xmm3
> > -	pmovmskb %xmm3, %rdx
> > -# ifdef USE_AS_STRNCPY
> > -	sub	$64, %r8
> > -	jbe	L(UnalignedLeaveCase2OrCase3)
> > -# endif
> > -	test	%rdx, %rdx
> > -	jz	L(Unaligned64Loop_start)
> > -
> > -L(Unaligned64Leave):
> > -	pxor	%xmm1, %xmm1
> > -
> > -	pcmpeqb	%xmm4, %xmm0
> > -	pcmpeqb	%xmm5, %xmm1
> > -	pmovmskb %xmm0, %rdx
> > -	pmovmskb %xmm1, %rcx
> > -	test	%rdx, %rdx
> > -	jnz	L(CopyFrom1To16BytesUnaligned_0)
> > -	test	%rcx, %rcx
> > -	jnz	L(CopyFrom1To16BytesUnaligned_16)
> > -
> > -	pcmpeqb	%xmm6, %xmm0
> > -	pcmpeqb	%xmm7, %xmm1
> > -	pmovmskb %xmm0, %rdx
> > -	pmovmskb %xmm1, %rcx
> > -	test	%rdx, %rdx
> > -	jnz	L(CopyFrom1To16BytesUnaligned_32)
> > -
> > -	bsf	%rcx, %rdx
> > -	movdqu	%xmm4, (%rdi)
> > -	movdqu	%xmm5, 16(%rdi)
> > -	movdqu	%xmm6, 32(%rdi)
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -# ifdef USE_AS_STPCPY
> > -	lea	48(%rdi, %rdx), %rax
> > -# endif
> > -	movdqu	%xmm7, 48(%rdi)
> > -	add	$15, %r8
> > -	sub	%rdx, %r8
> > -	lea	49(%rdi, %rdx), %rdi
> > -	jmp	L(StrncpyFillTailWithZero)
> > -# else
> > -	add	$48, %rsi
> > -	add	$48, %rdi
> > -	BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
> > -# endif
> > -
> > -/* If source address alignment == destination address alignment */
> > -
> > -L(SourceStringAlignmentLess32):
> > -	pxor	%xmm0, %xmm0
> > -	movdqu	(%rsi), %xmm1
> > -	movdqu	16(%rsi), %xmm2
> > -	pcmpeqb	%xmm1, %xmm0
> > -	pmovmskb %xmm0, %rdx
> > -
> > -# ifdef USE_AS_STRNCPY
> > -#  if defined USE_AS_STPCPY || defined USE_AS_STRCAT
> > -	cmp	$16, %r8
> > -#  else
> > -	cmp	$17, %r8
> > -#  endif
> > -	jbe	L(CopyFrom1To16BytesTail1Case2OrCase3)
> > -# endif
> > -	test	%rdx, %rdx
> > -	jnz	L(CopyFrom1To16BytesTail1)
> > -
> > -	pcmpeqb	%xmm2, %xmm0
> > -	movdqu	%xmm1, (%rdi)
> > -	pmovmskb %xmm0, %rdx
> > -
> > -# ifdef USE_AS_STRNCPY
> > -#  if defined USE_AS_STPCPY || defined USE_AS_STRCAT
> > -	cmp	$32, %r8
> > -#  else
> > -	cmp	$33, %r8
> > -#  endif
> > -	jbe	L(CopyFrom1To32Bytes1Case2OrCase3)
> > -# endif
> > -	test	%rdx, %rdx
> > -	jnz	L(CopyFrom1To32Bytes1)
> > -
> > -	and	$-16, %rsi
> > -	and	$15, %rcx
> > -	jmp	L(Unalign16Both)
> > -
> > -/*------End of main part with loops---------------------*/
> > -
> > -/* Case1 */
> > -
> > -# if (!defined USE_AS_STRNCPY) || (defined USE_AS_STRCAT)
> > -	.p2align 4
> > -L(CopyFrom1To16Bytes):
> > -	add	%rcx, %rdi
> > -	add	%rcx, %rsi
> > -	bsf	%rdx, %rdx
> > -	BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
> > -# endif
> > -	.p2align 4
> > -L(CopyFrom1To16BytesTail):
> > -	add	%rcx, %rsi
> > -	bsf	%rdx, %rdx
> > -	BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
> > -
> > -	.p2align 4
> > -L(CopyFrom1To32Bytes1):
> > -	add	$16, %rsi
> > -	add	$16, %rdi
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	sub	$16, %r8
> > -# endif
> > -L(CopyFrom1To16BytesTail1):
> > -	bsf	%rdx, %rdx
> > -	BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
> > -
> > -	.p2align 4
> > -L(CopyFrom1To32Bytes):
> > -	bsf	%rdx, %rdx
> > -	add	%rcx, %rsi
> > -	add	$16, %rdx
> > -	sub	%rcx, %rdx
> > -	BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
> > -
> > -	.p2align 4
> > -L(CopyFrom1To16BytesUnaligned_0):
> > -	bsf	%rdx, %rdx
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -# ifdef USE_AS_STPCPY
> > -	lea	(%rdi, %rdx), %rax
> > -# endif
> > -	movdqu	%xmm4, (%rdi)
> > -	add	$63, %r8
> > -	sub	%rdx, %r8
> > -	lea	1(%rdi, %rdx), %rdi
> > -	jmp	L(StrncpyFillTailWithZero)
> > -# else
> > -	BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
> > -# endif
> > -
> > -	.p2align 4
> > -L(CopyFrom1To16BytesUnaligned_16):
> > -	bsf	%rcx, %rdx
> > -	movdqu	%xmm4, (%rdi)
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -# ifdef USE_AS_STPCPY
> > -	lea	16(%rdi, %rdx), %rax
> > -# endif
> > -	movdqu	%xmm5, 16(%rdi)
> > -	add	$47, %r8
> > -	sub	%rdx, %r8
> > -	lea	17(%rdi, %rdx), %rdi
> > -	jmp	L(StrncpyFillTailWithZero)
> > -# else
> > -	add	$16, %rsi
> > -	add	$16, %rdi
> > -	BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
> > -# endif
> > -
> > -	.p2align 4
> > -L(CopyFrom1To16BytesUnaligned_32):
> > -	bsf	%rdx, %rdx
> > -	movdqu	%xmm4, (%rdi)
> > -	movdqu	%xmm5, 16(%rdi)
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -# ifdef USE_AS_STPCPY
> > -	lea	32(%rdi, %rdx), %rax
> > -# endif
> > -	movdqu	%xmm6, 32(%rdi)
> > -	add	$31, %r8
> > -	sub	%rdx, %r8
> > -	lea	33(%rdi, %rdx), %rdi
> > -	jmp	L(StrncpyFillTailWithZero)
> > -# else
> > -	add	$32, %rsi
> > -	add	$32, %rdi
> > -	BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
> > -# endif
> > -
> > -# ifdef USE_AS_STRNCPY
> > -#  ifndef USE_AS_STRCAT
> > -	.p2align 4
> > -L(CopyFrom1To16BytesUnalignedXmm6):
> > -	movdqu	%xmm6, (%rdi, %rcx)
> > -	jmp	L(CopyFrom1To16BytesXmmExit)
> > -
> > -	.p2align 4
> > -L(CopyFrom1To16BytesUnalignedXmm5):
> > -	movdqu	%xmm5, (%rdi, %rcx)
> > -	jmp	L(CopyFrom1To16BytesXmmExit)
> > -
> > -	.p2align 4
> > -L(CopyFrom1To16BytesUnalignedXmm4):
> > -	movdqu	%xmm4, (%rdi, %rcx)
> > -	jmp	L(CopyFrom1To16BytesXmmExit)
> > -
> > -	.p2align 4
> > -L(CopyFrom1To16BytesUnalignedXmm3):
> > -	movdqu	%xmm3, (%rdi, %rcx)
> > -	jmp	L(CopyFrom1To16BytesXmmExit)
> > -
> > -	.p2align 4
> > -L(CopyFrom1To16BytesUnalignedXmm1):
> > -	movdqu	%xmm1, (%rdi, %rcx)
> > -	jmp	L(CopyFrom1To16BytesXmmExit)
> > -#  endif
> > -
> > -	.p2align 4
> > -L(CopyFrom1To16BytesExit):
> > -	BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
> > -
> > -/* Case2 */
> > -
> > -	.p2align 4
> > -L(CopyFrom1To16BytesCase2):
> > -	add	$16, %r8
> > -	add	%rcx, %rdi
> > -	add	%rcx, %rsi
> > -	bsf	%rdx, %rdx
> > -	cmp	%r8, %rdx
> > -	jb	L(CopyFrom1To16BytesExit)
> > -	BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
> > -
> > -	.p2align 4
> > -L(CopyFrom1To32BytesCase2):
> > -	add	%rcx, %rsi
> > -	bsf	%rdx, %rdx
> > -	add	$16, %rdx
> > -	sub	%rcx, %rdx
> > -	cmp	%r8, %rdx
> > -	jb	L(CopyFrom1To16BytesExit)
> > -	BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
> > -
> > -L(CopyFrom1To16BytesTailCase2):
> > -	add	%rcx, %rsi
> > -	bsf	%rdx, %rdx
> > -	cmp	%r8, %rdx
> > -	jb	L(CopyFrom1To16BytesExit)
> > -	BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
> > -
> > -L(CopyFrom1To16BytesTail1Case2):
> > -	bsf	%rdx, %rdx
> > -	cmp	%r8, %rdx
> > -	jb	L(CopyFrom1To16BytesExit)
> > -	BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
> > -
> > -/* Case2 or Case3,  Case3 */
> > -
> > -	.p2align 4
> > -L(CopyFrom1To16BytesCase2OrCase3):
> > -	test	%rdx, %rdx
> > -	jnz	L(CopyFrom1To16BytesCase2)
> > -L(CopyFrom1To16BytesCase3):
> > -	add	$16, %r8
> > -	add	%rcx, %rdi
> > -	add	%rcx, %rsi
> > -	BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
> > -
> > -	.p2align 4
> > -L(CopyFrom1To32BytesCase2OrCase3):
> > -	test	%rdx, %rdx
> > -	jnz	L(CopyFrom1To32BytesCase2)
> > -	add	%rcx, %rsi
> > -	BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
> > -
> > -	.p2align 4
> > -L(CopyFrom1To16BytesTailCase2OrCase3):
> > -	test	%rdx, %rdx
> > -	jnz	L(CopyFrom1To16BytesTailCase2)
> > -	add	%rcx, %rsi
> > -	BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
> > -
> > -	.p2align 4
> > -L(CopyFrom1To32Bytes1Case2OrCase3):
> > -	add	$16, %rdi
> > -	add	$16, %rsi
> > -	sub	$16, %r8
> > -L(CopyFrom1To16BytesTail1Case2OrCase3):
> > -	test	%rdx, %rdx
> > -	jnz	L(CopyFrom1To16BytesTail1Case2)
> > -	BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
> > -
> > -# endif
> > -
> > -/*------------End labels regarding with copying 1-16 bytes--and 1-32 bytes----*/
> > -
> > -	.p2align 4
> > -L(Exit1):
> > -	mov	%dh, (%rdi)
> > -# ifdef USE_AS_STPCPY
> > -	lea	(%rdi), %rax
> > -# endif
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	sub	$1, %r8
> > -	lea	1(%rdi), %rdi
> > -	jnz	L(StrncpyFillTailWithZero)
> > -# endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(Exit2):
> > -	mov	(%rsi), %dx
> > -	mov	%dx, (%rdi)
> > -# ifdef USE_AS_STPCPY
> > -	lea	1(%rdi), %rax
> > -# endif
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	sub	$2, %r8
> > -	lea	2(%rdi), %rdi
> > -	jnz	L(StrncpyFillTailWithZero)
> > -# endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(Exit3):
> > -	mov	(%rsi), %cx
> > -	mov	%cx, (%rdi)
> > -	mov	%dh, 2(%rdi)
> > -# ifdef USE_AS_STPCPY
> > -	lea	2(%rdi), %rax
> > -# endif
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	sub	$3, %r8
> > -	lea	3(%rdi), %rdi
> > -	jnz	L(StrncpyFillTailWithZero)
> > -# endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(Exit4):
> > -	mov	(%rsi), %edx
> > -	mov	%edx, (%rdi)
> > -# ifdef USE_AS_STPCPY
> > -	lea	3(%rdi), %rax
> > -# endif
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	sub	$4, %r8
> > -	lea	4(%rdi), %rdi
> > -	jnz	L(StrncpyFillTailWithZero)
> > -# endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(Exit5):
> > -	mov	(%rsi), %ecx
> > -	mov	%dh, 4(%rdi)
> > -	mov	%ecx, (%rdi)
> > -# ifdef USE_AS_STPCPY
> > -	lea	4(%rdi), %rax
> > -# endif
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	sub	$5, %r8
> > -	lea	5(%rdi), %rdi
> > -	jnz	L(StrncpyFillTailWithZero)
> > -# endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(Exit6):
> > -	mov	(%rsi), %ecx
> > -	mov	4(%rsi), %dx
> > -	mov	%ecx, (%rdi)
> > -	mov	%dx, 4(%rdi)
> > -# ifdef USE_AS_STPCPY
> > -	lea	5(%rdi), %rax
> > -# endif
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	sub	$6, %r8
> > -	lea	6(%rdi), %rdi
> > -	jnz	L(StrncpyFillTailWithZero)
> > -# endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(Exit7):
> > -	mov	(%rsi), %ecx
> > -	mov	3(%rsi), %edx
> > -	mov	%ecx, (%rdi)
> > -	mov	%edx, 3(%rdi)
> > -# ifdef USE_AS_STPCPY
> > -	lea	6(%rdi), %rax
> > -# endif
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	sub	$7, %r8
> > -	lea	7(%rdi), %rdi
> > -	jnz	L(StrncpyFillTailWithZero)
> > -# endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(Exit8):
> > -	mov	(%rsi), %rdx
> > -	mov	%rdx, (%rdi)
> > -# ifdef USE_AS_STPCPY
> > -	lea	7(%rdi), %rax
> > -# endif
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	sub	$8, %r8
> > -	lea	8(%rdi), %rdi
> > -	jnz	L(StrncpyFillTailWithZero)
> > -# endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(Exit9):
> > -	mov	(%rsi), %rcx
> > -	mov	%dh, 8(%rdi)
> > -	mov	%rcx, (%rdi)
> > -# ifdef USE_AS_STPCPY
> > -	lea	8(%rdi), %rax
> > -# endif
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	sub	$9, %r8
> > -	lea	9(%rdi), %rdi
> > -	jnz	L(StrncpyFillTailWithZero)
> > -# endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(Exit10):
> > -	mov	(%rsi), %rcx
> > -	mov	8(%rsi), %dx
> > -	mov	%rcx, (%rdi)
> > -	mov	%dx, 8(%rdi)
> > -# ifdef USE_AS_STPCPY
> > -	lea	9(%rdi), %rax
> > -# endif
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	sub	$10, %r8
> > -	lea	10(%rdi), %rdi
> > -	jnz	L(StrncpyFillTailWithZero)
> > -# endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(Exit11):
> > -	mov	(%rsi), %rcx
> > -	mov	7(%rsi), %edx
> > -	mov	%rcx, (%rdi)
> > -	mov	%edx, 7(%rdi)
> > -# ifdef USE_AS_STPCPY
> > -	lea	10(%rdi), %rax
> > -# endif
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	sub	$11, %r8
> > -	lea	11(%rdi), %rdi
> > -	jnz	L(StrncpyFillTailWithZero)
> > -# endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(Exit12):
> > -	mov	(%rsi), %rcx
> > -	mov	8(%rsi), %edx
> > -	mov	%rcx, (%rdi)
> > -	mov	%edx, 8(%rdi)
> > -# ifdef USE_AS_STPCPY
> > -	lea	11(%rdi), %rax
> > -# endif
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	sub	$12, %r8
> > -	lea	12(%rdi), %rdi
> > -	jnz	L(StrncpyFillTailWithZero)
> > -# endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(Exit13):
> > -	mov	(%rsi), %rcx
> > -	mov	5(%rsi), %rdx
> > -	mov	%rcx, (%rdi)
> > -	mov	%rdx, 5(%rdi)
> > -# ifdef USE_AS_STPCPY
> > -	lea	12(%rdi), %rax
> > -# endif
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	sub	$13, %r8
> > -	lea	13(%rdi), %rdi
> > -	jnz	L(StrncpyFillTailWithZero)
> > -# endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(Exit14):
> > -	mov	(%rsi), %rcx
> > -	mov	6(%rsi), %rdx
> > -	mov	%rcx, (%rdi)
> > -	mov	%rdx, 6(%rdi)
> > -# ifdef USE_AS_STPCPY
> > -	lea	13(%rdi), %rax
> > -# endif
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	sub	$14, %r8
> > -	lea	14(%rdi), %rdi
> > -	jnz	L(StrncpyFillTailWithZero)
> > -# endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(Exit15):
> > -	mov	(%rsi), %rcx
> > -	mov	7(%rsi), %rdx
> > -	mov	%rcx, (%rdi)
> > -	mov	%rdx, 7(%rdi)
> > -# ifdef USE_AS_STPCPY
> > -	lea	14(%rdi), %rax
> > -# endif
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	sub	$15, %r8
> > -	lea	15(%rdi), %rdi
> > -	jnz	L(StrncpyFillTailWithZero)
> > -# endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(Exit16):
> > -	movdqu	(%rsi), %xmm0
> > -	movdqu	%xmm0, (%rdi)
> > -# ifdef USE_AS_STPCPY
> > -	lea	15(%rdi), %rax
> > -# endif
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	sub	$16, %r8
> > -	lea	16(%rdi), %rdi
> > -	jnz	L(StrncpyFillTailWithZero)
> > -# endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(Exit17):
> > -	movdqu	(%rsi), %xmm0
> > -	movdqu	%xmm0, (%rdi)
> > -	mov	%dh, 16(%rdi)
> > -# ifdef USE_AS_STPCPY
> > -	lea	16(%rdi), %rax
> > -# endif
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	sub	$17, %r8
> > -	lea	17(%rdi), %rdi
> > -	jnz	L(StrncpyFillTailWithZero)
> > -# endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(Exit18):
> > -	movdqu	(%rsi), %xmm0
> > -	mov	16(%rsi), %cx
> > -	movdqu	%xmm0, (%rdi)
> > -	mov	%cx, 16(%rdi)
> > -# ifdef USE_AS_STPCPY
> > -	lea	17(%rdi), %rax
> > -# endif
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	sub	$18, %r8
> > -	lea	18(%rdi), %rdi
> > -	jnz	L(StrncpyFillTailWithZero)
> > -# endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(Exit19):
> > -	movdqu	(%rsi), %xmm0
> > -	mov	15(%rsi), %ecx
> > -	movdqu	%xmm0, (%rdi)
> > -	mov	%ecx, 15(%rdi)
> > -# ifdef USE_AS_STPCPY
> > -	lea	18(%rdi), %rax
> > -# endif
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	sub	$19, %r8
> > -	lea	19(%rdi), %rdi
> > -	jnz	L(StrncpyFillTailWithZero)
> > -# endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(Exit20):
> > -	movdqu	(%rsi), %xmm0
> > -	mov	16(%rsi), %ecx
> > -	movdqu	%xmm0, (%rdi)
> > -	mov	%ecx, 16(%rdi)
> > -# ifdef USE_AS_STPCPY
> > -	lea	19(%rdi), %rax
> > -# endif
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	sub	$20, %r8
> > -	lea	20(%rdi), %rdi
> > -	jnz	L(StrncpyFillTailWithZero)
> > -# endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(Exit21):
> > -	movdqu	(%rsi), %xmm0
> > -	mov	16(%rsi), %ecx
> > -	movdqu	%xmm0, (%rdi)
> > -	mov	%ecx, 16(%rdi)
> > -	mov	%dh, 20(%rdi)
> > -# ifdef USE_AS_STPCPY
> > -	lea	20(%rdi), %rax
> > -# endif
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	sub	$21, %r8
> > -	lea	21(%rdi), %rdi
> > -	jnz	L(StrncpyFillTailWithZero)
> > -# endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(Exit22):
> > -	movdqu	(%rsi), %xmm0
> > -	mov	14(%rsi), %rcx
> > -	movdqu	%xmm0, (%rdi)
> > -	mov	%rcx, 14(%rdi)
> > -# ifdef USE_AS_STPCPY
> > -	lea	21(%rdi), %rax
> > -# endif
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	sub	$22, %r8
> > -	lea	22(%rdi), %rdi
> > -	jnz	L(StrncpyFillTailWithZero)
> > -# endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(Exit23):
> > -	movdqu	(%rsi), %xmm0
> > -	mov	15(%rsi), %rcx
> > -	movdqu	%xmm0, (%rdi)
> > -	mov	%rcx, 15(%rdi)
> > -# ifdef USE_AS_STPCPY
> > -	lea	22(%rdi), %rax
> > -# endif
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	sub	$23, %r8
> > -	lea	23(%rdi), %rdi
> > -	jnz	L(StrncpyFillTailWithZero)
> > -# endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(Exit24):
> > -	movdqu	(%rsi), %xmm0
> > -	mov	16(%rsi), %rcx
> > -	movdqu	%xmm0, (%rdi)
> > -	mov	%rcx, 16(%rdi)
> > -# ifdef USE_AS_STPCPY
> > -	lea	23(%rdi), %rax
> > -# endif
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	sub	$24, %r8
> > -	lea	24(%rdi), %rdi
> > -	jnz	L(StrncpyFillTailWithZero)
> > -# endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(Exit25):
> > -	movdqu	(%rsi), %xmm0
> > -	mov	16(%rsi), %rcx
> > -	movdqu	%xmm0, (%rdi)
> > -	mov	%rcx, 16(%rdi)
> > -	mov	%dh, 24(%rdi)
> > -# ifdef USE_AS_STPCPY
> > -	lea	24(%rdi), %rax
> > -# endif
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	sub	$25, %r8
> > -	lea	25(%rdi), %rdi
> > -	jnz	L(StrncpyFillTailWithZero)
> > -# endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(Exit26):
> > -	movdqu	(%rsi), %xmm0
> > -	mov	16(%rsi), %rdx
> > -	mov	24(%rsi), %cx
> > -	movdqu	%xmm0, (%rdi)
> > -	mov	%rdx, 16(%rdi)
> > -	mov	%cx, 24(%rdi)
> > -# ifdef USE_AS_STPCPY
> > -	lea	25(%rdi), %rax
> > -# endif
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	sub	$26, %r8
> > -	lea	26(%rdi), %rdi
> > -	jnz	L(StrncpyFillTailWithZero)
> > -# endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(Exit27):
> > -	movdqu	(%rsi), %xmm0
> > -	mov	16(%rsi), %rdx
> > -	mov	23(%rsi), %ecx
> > -	movdqu	%xmm0, (%rdi)
> > -	mov	%rdx, 16(%rdi)
> > -	mov	%ecx, 23(%rdi)
> > -# ifdef USE_AS_STPCPY
> > -	lea	26(%rdi), %rax
> > -# endif
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	sub	$27, %r8
> > -	lea	27(%rdi), %rdi
> > -	jnz	L(StrncpyFillTailWithZero)
> > -# endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(Exit28):
> > -	movdqu	(%rsi), %xmm0
> > -	mov	16(%rsi), %rdx
> > -	mov	24(%rsi), %ecx
> > -	movdqu	%xmm0, (%rdi)
> > -	mov	%rdx, 16(%rdi)
> > -	mov	%ecx, 24(%rdi)
> > -# ifdef USE_AS_STPCPY
> > -	lea	27(%rdi), %rax
> > -# endif
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	sub	$28, %r8
> > -	lea	28(%rdi), %rdi
> > -	jnz	L(StrncpyFillTailWithZero)
> > -# endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(Exit29):
> > -	movdqu	(%rsi), %xmm0
> > -	movdqu	13(%rsi), %xmm2
> > -	movdqu	%xmm0, (%rdi)
> > -	movdqu	%xmm2, 13(%rdi)
> > -# ifdef USE_AS_STPCPY
> > -	lea	28(%rdi), %rax
> > -# endif
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	sub	$29, %r8
> > -	lea	29(%rdi), %rdi
> > -	jnz	L(StrncpyFillTailWithZero)
> > -# endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(Exit30):
> > -	movdqu	(%rsi), %xmm0
> > -	movdqu	14(%rsi), %xmm2
> > -	movdqu	%xmm0, (%rdi)
> > -	movdqu	%xmm2, 14(%rdi)
> > -# ifdef USE_AS_STPCPY
> > -	lea	29(%rdi), %rax
> > -# endif
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	sub	$30, %r8
> > -	lea	30(%rdi), %rdi
> > -	jnz	L(StrncpyFillTailWithZero)
> > -# endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(Exit31):
> > -	movdqu	(%rsi), %xmm0
> > -	movdqu	15(%rsi), %xmm2
> > -	movdqu	%xmm0, (%rdi)
> > -	movdqu	%xmm2, 15(%rdi)
> > -# ifdef USE_AS_STPCPY
> > -	lea	30(%rdi), %rax
> > -# endif
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	sub	$31, %r8
> > -	lea	31(%rdi), %rdi
> > -	jnz	L(StrncpyFillTailWithZero)
> > -# endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(Exit32):
> > -	movdqu	(%rsi), %xmm0
> > -	movdqu	16(%rsi), %xmm2
> > -	movdqu	%xmm0, (%rdi)
> > -	movdqu	%xmm2, 16(%rdi)
> > -# ifdef USE_AS_STPCPY
> > -	lea	31(%rdi), %rax
> > -# endif
> > -# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > -	sub	$32, %r8
> > -	lea	32(%rdi), %rdi
> > -	jnz	L(StrncpyFillTailWithZero)
> > -# endif
> > -	ret
> > -
> > -# ifdef USE_AS_STRNCPY
> > -
> > -	.p2align 4
> > -L(StrncpyExit0):
> > -#  ifdef USE_AS_STPCPY
> > -	mov	%rdi, %rax
> > -#  endif
> > -#  ifdef USE_AS_STRCAT
> > -	xor	%ch, %ch
> > -	movb	%ch, (%rdi)
> > -#  endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(StrncpyExit1):
> > -	mov	(%rsi), %dl
> > -	mov	%dl, (%rdi)
> > -#  ifdef USE_AS_STPCPY
> > -	lea	1(%rdi), %rax
> > -#  endif
> > -#  ifdef USE_AS_STRCAT
> > -	xor	%ch, %ch
> > -	movb	%ch, 1(%rdi)
> > -#  endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(StrncpyExit2):
> > -	mov	(%rsi), %dx
> > -	mov	%dx, (%rdi)
> > -#  ifdef USE_AS_STPCPY
> > -	lea	2(%rdi), %rax
> > -#  endif
> > -#  ifdef USE_AS_STRCAT
> > -	xor	%ch, %ch
> > -	movb	%ch, 2(%rdi)
> > -#  endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(StrncpyExit3):
> > -	mov	(%rsi), %cx
> > -	mov	2(%rsi), %dl
> > -	mov	%cx, (%rdi)
> > -	mov	%dl, 2(%rdi)
> > -#  ifdef USE_AS_STPCPY
> > -	lea	3(%rdi), %rax
> > -#  endif
> > -#  ifdef USE_AS_STRCAT
> > -	xor	%ch, %ch
> > -	movb	%ch, 3(%rdi)
> > -#  endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(StrncpyExit4):
> > -	mov	(%rsi), %edx
> > -	mov	%edx, (%rdi)
> > -#  ifdef USE_AS_STPCPY
> > -	lea	4(%rdi), %rax
> > -#  endif
> > -#  ifdef USE_AS_STRCAT
> > -	xor	%ch, %ch
> > -	movb	%ch, 4(%rdi)
> > -#  endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(StrncpyExit5):
> > -	mov	(%rsi), %ecx
> > -	mov	4(%rsi), %dl
> > -	mov	%ecx, (%rdi)
> > -	mov	%dl, 4(%rdi)
> > -#  ifdef USE_AS_STPCPY
> > -	lea	5(%rdi), %rax
> > -#  endif
> > -#  ifdef USE_AS_STRCAT
> > -	xor	%ch, %ch
> > -	movb	%ch, 5(%rdi)
> > -#  endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(StrncpyExit6):
> > -	mov	(%rsi), %ecx
> > -	mov	4(%rsi), %dx
> > -	mov	%ecx, (%rdi)
> > -	mov	%dx, 4(%rdi)
> > -#  ifdef USE_AS_STPCPY
> > -	lea	6(%rdi), %rax
> > -#  endif
> > -#  ifdef USE_AS_STRCAT
> > -	xor	%ch, %ch
> > -	movb	%ch, 6(%rdi)
> > -#  endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(StrncpyExit7):
> > -	mov	(%rsi), %ecx
> > -	mov	3(%rsi), %edx
> > -	mov	%ecx, (%rdi)
> > -	mov	%edx, 3(%rdi)
> > -#  ifdef USE_AS_STPCPY
> > -	lea	7(%rdi), %rax
> > -#  endif
> > -#  ifdef USE_AS_STRCAT
> > -	xor	%ch, %ch
> > -	movb	%ch, 7(%rdi)
> > -#  endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(StrncpyExit8):
> > -	mov	(%rsi), %rdx
> > -	mov	%rdx, (%rdi)
> > -#  ifdef USE_AS_STPCPY
> > -	lea	8(%rdi), %rax
> > -#  endif
> > -#  ifdef USE_AS_STRCAT
> > -	xor	%ch, %ch
> > -	movb	%ch, 8(%rdi)
> > -#  endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(StrncpyExit9):
> > -	mov	(%rsi), %rcx
> > -	mov	8(%rsi), %dl
> > -	mov	%rcx, (%rdi)
> > -	mov	%dl, 8(%rdi)
> > -#  ifdef USE_AS_STPCPY
> > -	lea	9(%rdi), %rax
> > -#  endif
> > -#  ifdef USE_AS_STRCAT
> > -	xor	%ch, %ch
> > -	movb	%ch, 9(%rdi)
> > -#  endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(StrncpyExit10):
> > -	mov	(%rsi), %rcx
> > -	mov	8(%rsi), %dx
> > -	mov	%rcx, (%rdi)
> > -	mov	%dx, 8(%rdi)
> > -#  ifdef USE_AS_STPCPY
> > -	lea	10(%rdi), %rax
> > -#  endif
> > -#  ifdef USE_AS_STRCAT
> > -	xor	%ch, %ch
> > -	movb	%ch, 10(%rdi)
> > -#  endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(StrncpyExit11):
> > -	mov	(%rsi), %rcx
> > -	mov	7(%rsi), %edx
> > -	mov	%rcx, (%rdi)
> > -	mov	%edx, 7(%rdi)
> > -#  ifdef USE_AS_STPCPY
> > -	lea	11(%rdi), %rax
> > -#  endif
> > -#  ifdef USE_AS_STRCAT
> > -	xor	%ch, %ch
> > -	movb	%ch, 11(%rdi)
> > -#  endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(StrncpyExit12):
> > -	mov	(%rsi), %rcx
> > -	mov	8(%rsi), %edx
> > -	mov	%rcx, (%rdi)
> > -	mov	%edx, 8(%rdi)
> > -#  ifdef USE_AS_STPCPY
> > -	lea	12(%rdi), %rax
> > -#  endif
> > -#  ifdef USE_AS_STRCAT
> > -	xor	%ch, %ch
> > -	movb	%ch, 12(%rdi)
> > -#  endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(StrncpyExit13):
> > -	mov	(%rsi), %rcx
> > -	mov	5(%rsi), %rdx
> > -	mov	%rcx, (%rdi)
> > -	mov	%rdx, 5(%rdi)
> > -#  ifdef USE_AS_STPCPY
> > -	lea	13(%rdi), %rax
> > -#  endif
> > -#  ifdef USE_AS_STRCAT
> > -	xor	%ch, %ch
> > -	movb	%ch, 13(%rdi)
> > -#  endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(StrncpyExit14):
> > -	mov	(%rsi), %rcx
> > -	mov	6(%rsi), %rdx
> > -	mov	%rcx, (%rdi)
> > -	mov	%rdx, 6(%rdi)
> > -#  ifdef USE_AS_STPCPY
> > -	lea	14(%rdi), %rax
> > -#  endif
> > -#  ifdef USE_AS_STRCAT
> > -	xor	%ch, %ch
> > -	movb	%ch, 14(%rdi)
> > -#  endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(StrncpyExit15):
> > -	mov	(%rsi), %rcx
> > -	mov	7(%rsi), %rdx
> > -	mov	%rcx, (%rdi)
> > -	mov	%rdx, 7(%rdi)
> > -#  ifdef USE_AS_STPCPY
> > -	lea	15(%rdi), %rax
> > -#  endif
> > -#  ifdef USE_AS_STRCAT
> > -	xor	%ch, %ch
> > -	movb	%ch, 15(%rdi)
> > -#  endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(StrncpyExit16):
> > -	movdqu	(%rsi), %xmm0
> > -	movdqu	%xmm0, (%rdi)
> > -#  ifdef USE_AS_STPCPY
> > -	lea	16(%rdi), %rax
> > -#  endif
> > -#  ifdef USE_AS_STRCAT
> > -	xor	%ch, %ch
> > -	movb	%ch, 16(%rdi)
> > -#  endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(StrncpyExit17):
> > -	movdqu	(%rsi), %xmm0
> > -	mov	16(%rsi), %cl
> > -	movdqu	%xmm0, (%rdi)
> > -	mov	%cl, 16(%rdi)
> > -#  ifdef USE_AS_STPCPY
> > -	lea	17(%rdi), %rax
> > -#  endif
> > -#  ifdef USE_AS_STRCAT
> > -	xor	%ch, %ch
> > -	movb	%ch, 17(%rdi)
> > -#  endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(StrncpyExit18):
> > -	movdqu	(%rsi), %xmm0
> > -	mov	16(%rsi), %cx
> > -	movdqu	%xmm0, (%rdi)
> > -	mov	%cx, 16(%rdi)
> > -#  ifdef USE_AS_STPCPY
> > -	lea	18(%rdi), %rax
> > -#  endif
> > -#  ifdef USE_AS_STRCAT
> > -	xor	%ch, %ch
> > -	movb	%ch, 18(%rdi)
> > -#  endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(StrncpyExit19):
> > -	movdqu	(%rsi), %xmm0
> > -	mov	15(%rsi), %ecx
> > -	movdqu	%xmm0, (%rdi)
> > -	mov	%ecx, 15(%rdi)
> > -#  ifdef USE_AS_STPCPY
> > -	lea	19(%rdi), %rax
> > -#  endif
> > -#  ifdef USE_AS_STRCAT
> > -	xor	%ch, %ch
> > -	movb	%ch, 19(%rdi)
> > -#  endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(StrncpyExit20):
> > -	movdqu	(%rsi), %xmm0
> > -	mov	16(%rsi), %ecx
> > -	movdqu	%xmm0, (%rdi)
> > -	mov	%ecx, 16(%rdi)
> > -#  ifdef USE_AS_STPCPY
> > -	lea	20(%rdi), %rax
> > -#  endif
> > -#  ifdef USE_AS_STRCAT
> > -	xor	%ch, %ch
> > -	movb	%ch, 20(%rdi)
> > -#  endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(StrncpyExit21):
> > -	movdqu	(%rsi), %xmm0
> > -	mov	16(%rsi), %ecx
> > -	mov	20(%rsi), %dl
> > -	movdqu	%xmm0, (%rdi)
> > -	mov	%ecx, 16(%rdi)
> > -	mov	%dl, 20(%rdi)
> > -#  ifdef USE_AS_STPCPY
> > -	lea	21(%rdi), %rax
> > -#  endif
> > -#  ifdef USE_AS_STRCAT
> > -	xor	%ch, %ch
> > -	movb	%ch, 21(%rdi)
> > -#  endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(StrncpyExit22):
> > -	movdqu	(%rsi), %xmm0
> > -	mov	14(%rsi), %rcx
> > -	movdqu	%xmm0, (%rdi)
> > -	mov	%rcx, 14(%rdi)
> > -#  ifdef USE_AS_STPCPY
> > -	lea	22(%rdi), %rax
> > -#  endif
> > -#  ifdef USE_AS_STRCAT
> > -	xor	%ch, %ch
> > -	movb	%ch, 22(%rdi)
> > -#  endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(StrncpyExit23):
> > -	movdqu	(%rsi), %xmm0
> > -	mov	15(%rsi), %rcx
> > -	movdqu	%xmm0, (%rdi)
> > -	mov	%rcx, 15(%rdi)
> > -#  ifdef USE_AS_STPCPY
> > -	lea	23(%rdi), %rax
> > -#  endif
> > -#  ifdef USE_AS_STRCAT
> > -	xor	%ch, %ch
> > -	movb	%ch, 23(%rdi)
> > -#  endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(StrncpyExit24):
> > -	movdqu	(%rsi), %xmm0
> > -	mov	16(%rsi), %rcx
> > -	movdqu	%xmm0, (%rdi)
> > -	mov	%rcx, 16(%rdi)
> > -#  ifdef USE_AS_STPCPY
> > -	lea	24(%rdi), %rax
> > -#  endif
> > -#  ifdef USE_AS_STRCAT
> > -	xor	%ch, %ch
> > -	movb	%ch, 24(%rdi)
> > -#  endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(StrncpyExit25):
> > -	movdqu	(%rsi), %xmm0
> > -	mov	16(%rsi), %rdx
> > -	mov	24(%rsi), %cl
> > -	movdqu	%xmm0, (%rdi)
> > -	mov	%rdx, 16(%rdi)
> > -	mov	%cl, 24(%rdi)
> > -#  ifdef USE_AS_STPCPY
> > -	lea	25(%rdi), %rax
> > -#  endif
> > -#  ifdef USE_AS_STRCAT
> > -	xor	%ch, %ch
> > -	movb	%ch, 25(%rdi)
> > -#  endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(StrncpyExit26):
> > -	movdqu	(%rsi), %xmm0
> > -	mov	16(%rsi), %rdx
> > -	mov	24(%rsi), %cx
> > -	movdqu	%xmm0, (%rdi)
> > -	mov	%rdx, 16(%rdi)
> > -	mov	%cx, 24(%rdi)
> > -#  ifdef USE_AS_STPCPY
> > -	lea	26(%rdi), %rax
> > -#  endif
> > -#  ifdef USE_AS_STRCAT
> > -	xor	%ch, %ch
> > -	movb	%ch, 26(%rdi)
> > -#  endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(StrncpyExit27):
> > -	movdqu	(%rsi), %xmm0
> > -	mov	16(%rsi), %rdx
> > -	mov	23(%rsi), %ecx
> > -	movdqu	%xmm0, (%rdi)
> > -	mov	%rdx, 16(%rdi)
> > -	mov	%ecx, 23(%rdi)
> > -#  ifdef USE_AS_STPCPY
> > -	lea	27(%rdi), %rax
> > -#  endif
> > -#  ifdef USE_AS_STRCAT
> > -	xor	%ch, %ch
> > -	movb	%ch, 27(%rdi)
> > -#  endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(StrncpyExit28):
> > -	movdqu	(%rsi), %xmm0
> > -	mov	16(%rsi), %rdx
> > -	mov	24(%rsi), %ecx
> > -	movdqu	%xmm0, (%rdi)
> > -	mov	%rdx, 16(%rdi)
> > -	mov	%ecx, 24(%rdi)
> > -#  ifdef USE_AS_STPCPY
> > -	lea	28(%rdi), %rax
> > -#  endif
> > -#  ifdef USE_AS_STRCAT
> > -	xor	%ch, %ch
> > -	movb	%ch, 28(%rdi)
> > -#  endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(StrncpyExit29):
> > -	movdqu	(%rsi), %xmm0
> > -	movdqu	13(%rsi), %xmm2
> > -	movdqu	%xmm0, (%rdi)
> > -	movdqu	%xmm2, 13(%rdi)
> > -#  ifdef USE_AS_STPCPY
> > -	lea	29(%rdi), %rax
> > -#  endif
> > -#  ifdef USE_AS_STRCAT
> > -	xor	%ch, %ch
> > -	movb	%ch, 29(%rdi)
> > -#  endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(StrncpyExit30):
> > -	movdqu	(%rsi), %xmm0
> > -	movdqu	14(%rsi), %xmm2
> > -	movdqu	%xmm0, (%rdi)
> > -	movdqu	%xmm2, 14(%rdi)
> > -#  ifdef USE_AS_STPCPY
> > -	lea	30(%rdi), %rax
> > -#  endif
> > -#  ifdef USE_AS_STRCAT
> > -	xor	%ch, %ch
> > -	movb	%ch, 30(%rdi)
> > -#  endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(StrncpyExit31):
> > -	movdqu	(%rsi), %xmm0
> > -	movdqu	15(%rsi), %xmm2
> > -	movdqu	%xmm0, (%rdi)
> > -	movdqu	%xmm2, 15(%rdi)
> > -#  ifdef USE_AS_STPCPY
> > -	lea	31(%rdi), %rax
> > -#  endif
> > -#  ifdef USE_AS_STRCAT
> > -	xor	%ch, %ch
> > -	movb	%ch, 31(%rdi)
> > -#  endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(StrncpyExit32):
> > -	movdqu	(%rsi), %xmm0
> > -	movdqu	16(%rsi), %xmm2
> > -	movdqu	%xmm0, (%rdi)
> > -	movdqu	%xmm2, 16(%rdi)
> > -#  ifdef USE_AS_STPCPY
> > -	lea	32(%rdi), %rax
> > -#  endif
> > -#  ifdef USE_AS_STRCAT
> > -	xor	%ch, %ch
> > -	movb	%ch, 32(%rdi)
> > -#  endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(StrncpyExit33):
> > -	movdqu	(%rsi), %xmm0
> > -	movdqu	16(%rsi), %xmm2
> > -	mov	32(%rsi), %cl
> > -	movdqu	%xmm0, (%rdi)
> > -	movdqu	%xmm2, 16(%rdi)
> > -	mov	%cl, 32(%rdi)
> > -#  ifdef USE_AS_STRCAT
> > -	xor	%ch, %ch
> > -	movb	%ch, 33(%rdi)
> > -#  endif
> > -	ret
> > -
> > -#  ifndef USE_AS_STRCAT
> > -
> > -	.p2align 4
> > -L(Fill0):
> > -	ret
> > -
> > -	.p2align 4
> > -L(Fill1):
> > -	mov	%dl, (%rdi)
> > -	ret
> > -
> > -	.p2align 4
> > -L(Fill2):
> > -	mov	%dx, (%rdi)
> > -	ret
> > -
> > -	.p2align 4
> > -L(Fill3):
> > -	mov	%edx, -1(%rdi)
> > -	ret
> > -
> > -	.p2align 4
> > -L(Fill4):
> > -	mov	%edx, (%rdi)
> > -	ret
> > -
> > -	.p2align 4
> > -L(Fill5):
> > -	mov	%edx, (%rdi)
> > -	mov	%dl, 4(%rdi)
> > -	ret
> > -
> > -	.p2align 4
> > -L(Fill6):
> > -	mov	%edx, (%rdi)
> > -	mov	%dx, 4(%rdi)
> > -	ret
> > -
> > -	.p2align 4
> > -L(Fill7):
> > -	mov	%rdx, -1(%rdi)
> > -	ret
> > -
> > -	.p2align 4
> > -L(Fill8):
> > -	mov	%rdx, (%rdi)
> > -	ret
> > -
> > -	.p2align 4
> > -L(Fill9):
> > -	mov	%rdx, (%rdi)
> > -	mov	%dl, 8(%rdi)
> > -	ret
> > -
> > -	.p2align 4
> > -L(Fill10):
> > -	mov	%rdx, (%rdi)
> > -	mov	%dx, 8(%rdi)
> > -	ret
> > -
> > -	.p2align 4
> > -L(Fill11):
> > -	mov	%rdx, (%rdi)
> > -	mov	%edx, 7(%rdi)
> > -	ret
> > -
> > -	.p2align 4
> > -L(Fill12):
> > -	mov	%rdx, (%rdi)
> > -	mov	%edx, 8(%rdi)
> > -	ret
> > -
> > -	.p2align 4
> > -L(Fill13):
> > -	mov	%rdx, (%rdi)
> > -	mov	%rdx, 5(%rdi)
> > -	ret
> > -
> > -	.p2align 4
> > -L(Fill14):
> > -	mov	%rdx, (%rdi)
> > -	mov	%rdx, 6(%rdi)
> > -	ret
> > -
> > -	.p2align 4
> > -L(Fill15):
> > -	movdqu	%xmm0, -1(%rdi)
> > -	ret
> > -
> > -	.p2align 4
> > -L(Fill16):
> > -	movdqu	%xmm0, (%rdi)
> > -	ret
> > -
> > -	.p2align 4
> > -L(CopyFrom1To16BytesUnalignedXmm2):
> > -	movdqu	%xmm2, (%rdi, %rcx)
> > -
> > -	.p2align 4
> > -L(CopyFrom1To16BytesXmmExit):
> > -	bsf	%rdx, %rdx
> > -	add	$15, %r8
> > -	add	%rcx, %rdi
> > -#   ifdef USE_AS_STPCPY
> > -	lea	(%rdi, %rdx), %rax
> > -#   endif
> > -	sub	%rdx, %r8
> > -	lea	1(%rdi, %rdx), %rdi
> > -
> > -	.p2align 4
> > -L(StrncpyFillTailWithZero):
> > -	pxor	%xmm0, %xmm0
> > -	xor	%rdx, %rdx
> > -	sub	$16, %r8
> > -	jbe	L(StrncpyFillExit)
> > -
> > -	movdqu	%xmm0, (%rdi)
> > -	add	$16, %rdi
> > -
> > -	mov	%rdi, %rsi
> > -	and	$0xf, %rsi
> > -	sub	%rsi, %rdi
> > -	add	%rsi, %r8
> > -	sub	$64, %r8
> > -	jb	L(StrncpyFillLess64)
> > -
> > -L(StrncpyFillLoopMovdqa):
> > -	movdqa	%xmm0, (%rdi)
> > -	movdqa	%xmm0, 16(%rdi)
> > -	movdqa	%xmm0, 32(%rdi)
> > -	movdqa	%xmm0, 48(%rdi)
> > -	add	$64, %rdi
> > -	sub	$64, %r8
> > -	jae	L(StrncpyFillLoopMovdqa)
> > -
> > -L(StrncpyFillLess64):
> > -	add	$32, %r8
> > -	jl	L(StrncpyFillLess32)
> > -	movdqa	%xmm0, (%rdi)
> > -	movdqa	%xmm0, 16(%rdi)
> > -	add	$32, %rdi
> > -	sub	$16, %r8
> > -	jl	L(StrncpyFillExit)
> > -	movdqa	%xmm0, (%rdi)
> > -	add	$16, %rdi
> > -	BRANCH_TO_JMPTBL_ENTRY (L(FillTable), %r8, 4)
> > -
> > -L(StrncpyFillLess32):
> > -	add	$16, %r8
> > -	jl	L(StrncpyFillExit)
> > -	movdqa	%xmm0, (%rdi)
> > -	add	$16, %rdi
> > -	BRANCH_TO_JMPTBL_ENTRY (L(FillTable), %r8, 4)
> > -
> > -L(StrncpyFillExit):
> > -	add	$16, %r8
> > -	BRANCH_TO_JMPTBL_ENTRY (L(FillTable), %r8, 4)
> > -
> > -/* end of ifndef USE_AS_STRCAT */
> > -#  endif
> > -
> > -	.p2align 4
> > -L(UnalignedLeaveCase2OrCase3):
> > -	test	%rdx, %rdx
> > -	jnz	L(Unaligned64LeaveCase2)
> > -L(Unaligned64LeaveCase3):
> > -	lea	64(%r8), %rcx
> > -	and	$-16, %rcx
> > -	add	$48, %r8
> > -	jl	L(CopyFrom1To16BytesCase3)
> > -	movdqu	%xmm4, (%rdi)
> > -	sub	$16, %r8
> > -	jb	L(CopyFrom1To16BytesCase3)
> > -	movdqu	%xmm5, 16(%rdi)
> > -	sub	$16, %r8
> > -	jb	L(CopyFrom1To16BytesCase3)
> > -	movdqu	%xmm6, 32(%rdi)
> > -	sub	$16, %r8
> > -	jb	L(CopyFrom1To16BytesCase3)
> > -	movdqu	%xmm7, 48(%rdi)
> > -#  ifdef USE_AS_STPCPY
> > -	lea	64(%rdi), %rax
> > -#  endif
> > -#  ifdef USE_AS_STRCAT
> > -	xor	%ch, %ch
> > -	movb	%ch, 64(%rdi)
> > -#  endif
> > -	ret
> > -
> > -	.p2align 4
> > -L(Unaligned64LeaveCase2):
> > -	xor	%rcx, %rcx
> > -	pcmpeqb	%xmm4, %xmm0
> > -	pmovmskb %xmm0, %rdx
> > -	add	$48, %r8
> > -	jle	L(CopyFrom1To16BytesCase2OrCase3)
> > -	test	%rdx, %rdx
> > -#  ifndef USE_AS_STRCAT
> > -	jnz	L(CopyFrom1To16BytesUnalignedXmm4)
> > -#  else
> > -	jnz	L(CopyFrom1To16Bytes)
> > -#  endif
> > -	pcmpeqb	%xmm5, %xmm0
> > -	pmovmskb %xmm0, %rdx
> > -	movdqu	%xmm4, (%rdi)
> > -	add	$16, %rcx
> > -	sub	$16, %r8
> > -	jbe	L(CopyFrom1To16BytesCase2OrCase3)
> > -	test	%rdx, %rdx
> > -#  ifndef USE_AS_STRCAT
> > -	jnz	L(CopyFrom1To16BytesUnalignedXmm5)
> > -#  else
> > -	jnz	L(CopyFrom1To16Bytes)
> > -#  endif
> > -
> > -	pcmpeqb	%xmm6, %xmm0
> > -	pmovmskb %xmm0, %rdx
> > -	movdqu	%xmm5, 16(%rdi)
> > -	add	$16, %rcx
> > -	sub	$16, %r8
> > -	jbe	L(CopyFrom1To16BytesCase2OrCase3)
> > -	test	%rdx, %rdx
> > -#  ifndef USE_AS_STRCAT
> > -	jnz	L(CopyFrom1To16BytesUnalignedXmm6)
> > -#  else
> > -	jnz	L(CopyFrom1To16Bytes)
> > -#  endif
> > -
> > -	pcmpeqb	%xmm7, %xmm0
> > -	pmovmskb %xmm0, %rdx
> > -	movdqu	%xmm6, 32(%rdi)
> > -	lea	16(%rdi, %rcx), %rdi
> > -	lea	16(%rsi, %rcx), %rsi
> > -	bsf	%rdx, %rdx
> > -	cmp	%r8, %rdx
> > -	jb	L(CopyFrom1To16BytesExit)
> > -	BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
> > -
> > -	.p2align 4
> > -L(ExitZero):
> > -#  ifndef USE_AS_STRCAT
> > -	mov	%rdi, %rax
> > -#  endif
> > -	ret
> > -
> > -# endif
> > -
> > -# ifndef USE_AS_STRCAT
> > -END (STRCPY)
> > -# else
> > -END (STRCAT)
> > -# endif
> > -	.p2align 4
> > -	.section .rodata
> > -L(ExitTable):
> > -	.int	JMPTBL(L(Exit1), L(ExitTable))
> > -	.int	JMPTBL(L(Exit2), L(ExitTable))
> > -	.int	JMPTBL(L(Exit3), L(ExitTable))
> > -	.int	JMPTBL(L(Exit4), L(ExitTable))
> > -	.int	JMPTBL(L(Exit5), L(ExitTable))
> > -	.int	JMPTBL(L(Exit6), L(ExitTable))
> > -	.int	JMPTBL(L(Exit7), L(ExitTable))
> > -	.int	JMPTBL(L(Exit8), L(ExitTable))
> > -	.int	JMPTBL(L(Exit9), L(ExitTable))
> > -	.int	JMPTBL(L(Exit10), L(ExitTable))
> > -	.int	JMPTBL(L(Exit11), L(ExitTable))
> > -	.int	JMPTBL(L(Exit12), L(ExitTable))
> > -	.int	JMPTBL(L(Exit13), L(ExitTable))
> > -	.int	JMPTBL(L(Exit14), L(ExitTable))
> > -	.int	JMPTBL(L(Exit15), L(ExitTable))
> > -	.int	JMPTBL(L(Exit16), L(ExitTable))
> > -	.int	JMPTBL(L(Exit17), L(ExitTable))
> > -	.int	JMPTBL(L(Exit18), L(ExitTable))
> > -	.int	JMPTBL(L(Exit19), L(ExitTable))
> > -	.int	JMPTBL(L(Exit20), L(ExitTable))
> > -	.int	JMPTBL(L(Exit21), L(ExitTable))
> > -	.int	JMPTBL(L(Exit22), L(ExitTable))
> > -	.int    JMPTBL(L(Exit23), L(ExitTable))
> > -	.int	JMPTBL(L(Exit24), L(ExitTable))
> > -	.int	JMPTBL(L(Exit25), L(ExitTable))
> > -	.int	JMPTBL(L(Exit26), L(ExitTable))
> > -	.int	JMPTBL(L(Exit27), L(ExitTable))
> > -	.int	JMPTBL(L(Exit28), L(ExitTable))
> > -	.int	JMPTBL(L(Exit29), L(ExitTable))
> > -	.int	JMPTBL(L(Exit30), L(ExitTable))
> > -	.int	JMPTBL(L(Exit31), L(ExitTable))
> > -	.int	JMPTBL(L(Exit32), L(ExitTable))
> > -# ifdef USE_AS_STRNCPY
> > -L(ExitStrncpyTable):
> > -	.int	JMPTBL(L(StrncpyExit0), L(ExitStrncpyTable))
> > -	.int	JMPTBL(L(StrncpyExit1), L(ExitStrncpyTable))
> > -	.int	JMPTBL(L(StrncpyExit2), L(ExitStrncpyTable))
> > -	.int	JMPTBL(L(StrncpyExit3), L(ExitStrncpyTable))
> > -	.int	JMPTBL(L(StrncpyExit4), L(ExitStrncpyTable))
> > -	.int	JMPTBL(L(StrncpyExit5), L(ExitStrncpyTable))
> > -	.int	JMPTBL(L(StrncpyExit6), L(ExitStrncpyTable))
> > -	.int	JMPTBL(L(StrncpyExit7), L(ExitStrncpyTable))
> > -	.int	JMPTBL(L(StrncpyExit8), L(ExitStrncpyTable))
> > -	.int	JMPTBL(L(StrncpyExit9), L(ExitStrncpyTable))
> > -	.int	JMPTBL(L(StrncpyExit10), L(ExitStrncpyTable))
> > -	.int	JMPTBL(L(StrncpyExit11), L(ExitStrncpyTable))
> > -	.int	JMPTBL(L(StrncpyExit12), L(ExitStrncpyTable))
> > -	.int	JMPTBL(L(StrncpyExit13), L(ExitStrncpyTable))
> > -	.int	JMPTBL(L(StrncpyExit14), L(ExitStrncpyTable))
> > -	.int	JMPTBL(L(StrncpyExit15), L(ExitStrncpyTable))
> > -	.int	JMPTBL(L(StrncpyExit16), L(ExitStrncpyTable))
> > -	.int	JMPTBL(L(StrncpyExit17), L(ExitStrncpyTable))
> > -	.int	JMPTBL(L(StrncpyExit18), L(ExitStrncpyTable))
> > -	.int	JMPTBL(L(StrncpyExit19), L(ExitStrncpyTable))
> > -	.int	JMPTBL(L(StrncpyExit20), L(ExitStrncpyTable))
> > -	.int	JMPTBL(L(StrncpyExit21), L(ExitStrncpyTable))
> > -	.int	JMPTBL(L(StrncpyExit22), L(ExitStrncpyTable))
> > -	.int    JMPTBL(L(StrncpyExit23), L(ExitStrncpyTable))
> > -	.int	JMPTBL(L(StrncpyExit24), L(ExitStrncpyTable))
> > -	.int	JMPTBL(L(StrncpyExit25), L(ExitStrncpyTable))
> > -	.int	JMPTBL(L(StrncpyExit26), L(ExitStrncpyTable))
> > -	.int	JMPTBL(L(StrncpyExit27), L(ExitStrncpyTable))
> > -	.int	JMPTBL(L(StrncpyExit28), L(ExitStrncpyTable))
> > -	.int	JMPTBL(L(StrncpyExit29), L(ExitStrncpyTable))
> > -	.int	JMPTBL(L(StrncpyExit30), L(ExitStrncpyTable))
> > -	.int	JMPTBL(L(StrncpyExit31), L(ExitStrncpyTable))
> > -	.int	JMPTBL(L(StrncpyExit32), L(ExitStrncpyTable))
> > -	.int	JMPTBL(L(StrncpyExit33), L(ExitStrncpyTable))
> > -#  ifndef USE_AS_STRCAT
> > -	.p2align 4
> > -L(FillTable):
> > -	.int	JMPTBL(L(Fill0), L(FillTable))
> > -	.int	JMPTBL(L(Fill1), L(FillTable))
> > -	.int	JMPTBL(L(Fill2), L(FillTable))
> > -	.int	JMPTBL(L(Fill3), L(FillTable))
> > -	.int	JMPTBL(L(Fill4), L(FillTable))
> > -	.int	JMPTBL(L(Fill5), L(FillTable))
> > -	.int	JMPTBL(L(Fill6), L(FillTable))
> > -	.int	JMPTBL(L(Fill7), L(FillTable))
> > -	.int	JMPTBL(L(Fill8), L(FillTable))
> > -	.int	JMPTBL(L(Fill9), L(FillTable))
> > -	.int	JMPTBL(L(Fill10), L(FillTable))
> > -	.int	JMPTBL(L(Fill11), L(FillTable))
> > -	.int	JMPTBL(L(Fill12), L(FillTable))
> > -	.int	JMPTBL(L(Fill13), L(FillTable))
> > -	.int	JMPTBL(L(Fill14), L(FillTable))
> > -	.int	JMPTBL(L(Fill15), L(FillTable))
> > -	.int	JMPTBL(L(Fill16), L(FillTable))
> > -#  endif
> > -# endif
> > -#endif
> > +#define AS_STRCPY
> > +#define STPCPY __strcpy_sse2_unaligned
> > +#include "stpcpy-sse2-unaligned.S"
> > diff --git a/sysdeps/x86_64/multiarch/strcpy.S b/sysdeps/x86_64/multiarch/strcpy.S
> > index 9464ee8..92be04c 100644
> > --- a/sysdeps/x86_64/multiarch/strcpy.S
> > +++ b/sysdeps/x86_64/multiarch/strcpy.S
> > @@ -28,31 +28,18 @@
> >  #endif
> >  
> >  #ifdef USE_AS_STPCPY
> > -# ifdef USE_AS_STRNCPY
> > -#  define STRCPY_SSSE3		__stpncpy_ssse3
> > -#  define STRCPY_SSE2		__stpncpy_sse2
> > -#  define STRCPY_SSE2_UNALIGNED __stpncpy_sse2_unaligned
> > -#  define __GI_STRCPY		__GI_stpncpy
> > -#  define __GI___STRCPY		__GI___stpncpy
> > -# else
> >  #  define STRCPY_SSSE3		__stpcpy_ssse3
> >  #  define STRCPY_SSE2		__stpcpy_sse2
> > +#  define STRCPY_AVX2		__stpcpy_avx2
> >  #  define STRCPY_SSE2_UNALIGNED	__stpcpy_sse2_unaligned
> >  #  define __GI_STRCPY		__GI_stpcpy
> >  #  define __GI___STRCPY		__GI___stpcpy
> > -# endif
> >  #else
> > -# ifdef USE_AS_STRNCPY
> > -#  define STRCPY_SSSE3		__strncpy_ssse3
> > -#  define STRCPY_SSE2		__strncpy_sse2
> > -#  define STRCPY_SSE2_UNALIGNED	__strncpy_sse2_unaligned
> > -#  define __GI_STRCPY		__GI_strncpy
> > -# else
> >  #  define STRCPY_SSSE3		__strcpy_ssse3
> > +#  define STRCPY_AVX2		__strcpy_avx2
> >  #  define STRCPY_SSE2		__strcpy_sse2
> >  #  define STRCPY_SSE2_UNALIGNED	__strcpy_sse2_unaligned
> >  #  define __GI_STRCPY		__GI_strcpy
> > -# endif
> >  #endif
> >  
> >  
> > @@ -64,7 +51,10 @@ ENTRY(STRCPY)
> >  	cmpl	$0, __cpu_features+KIND_OFFSET(%rip)
> >  	jne	1f
> >  	call	__init_cpu_features
> > -1:	leaq	STRCPY_SSE2_UNALIGNED(%rip), %rax
> > +1:	leaq	STRCPY_AVX2(%rip), %rax
> > +	testl   $bit_AVX_Fast_Unaligned_Load, __cpu_features+FEATURE_OFFSET+index_AVX_Fast_Unaligned_Load(%rip)
> > +	jnz	2f
> > +	leaq	STRCPY_SSE2_UNALIGNED(%rip), %rax
> >  	testl	$bit_Fast_Unaligned_Load, __cpu_features+FEATURE_OFFSET+index_Fast_Unaligned_Load(%rip)
> >  	jnz	2f
> >  	leaq	STRCPY_SSE2(%rip), %rax
> > diff --git a/sysdeps/x86_64/multiarch/strncpy-sse2-unaligned.S b/sysdeps/x86_64/multiarch/strncpy-sse2-unaligned.S
> > index fcc23a7..e4c98e7 100644
> > --- a/sysdeps/x86_64/multiarch/strncpy-sse2-unaligned.S
> > +++ b/sysdeps/x86_64/multiarch/strncpy-sse2-unaligned.S
> > @@ -1,3 +1,1888 @@
> > -#define USE_AS_STRNCPY
> > -#define STRCPY __strncpy_sse2_unaligned
> > -#include "strcpy-sse2-unaligned.S"
> > +/* strcpy with SSE2 and unaligned load
> > +   Copyright (C) 2011-2015 Free Software Foundation, Inc.
> > +   Contributed by Intel Corporation.
> > +   This file is part of the GNU C Library.
> > +
> > +   The GNU C Library is free software; you can redistribute it and/or
> > +   modify it under the terms of the GNU Lesser General Public
> > +   License as published by the Free Software Foundation; either
> > +   version 2.1 of the License, or (at your option) any later version.
> > +
> > +   The GNU C Library is distributed in the hope that it will be useful,
> > +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> > +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > +   Lesser General Public License for more details.
> > +
> > +   You should have received a copy of the GNU Lesser General Public
> > +   License along with the GNU C Library; if not, see
> > +   <http://www.gnu.org/licenses/>.  */
> > +
> > +#if IS_IN (libc)
> > +
> > +# ifndef USE_AS_STRCAT
> > +#  include <sysdep.h>
> > +
> > +#  ifndef STRCPY
> > +#   define STRCPY  __strncpy_sse2_unaligned
> > +#  endif
> > +
> > +# define USE_AS_STRNCPY
> > +# endif
> > +
> > +# define JMPTBL(I, B)	I - B
> > +# define BRANCH_TO_JMPTBL_ENTRY(TABLE, INDEX, SCALE)             \
> > +	lea	TABLE(%rip), %r11;                              \
> > +	movslq	(%r11, INDEX, SCALE), %rcx;                     \
> > +	lea	(%r11, %rcx), %rcx;                             \
> > +	jmp	*%rcx
> > +
> > +# ifndef USE_AS_STRCAT
> > +
> > +.text
> > +ENTRY (STRCPY)
> > +#  ifdef USE_AS_STRNCPY
> > +	mov	%rdx, %r8
> > +	test	%r8, %r8
> > +	jz	L(ExitZero)
> > +#  endif
> > +	mov	%rsi, %rcx
> > +#  ifndef USE_AS_STPCPY
> > +	mov	%rdi, %rax      /* save result */
> > +#  endif
> > +
> > +# endif
> > +
> > +	and	$63, %rcx
> > +	cmp	$32, %rcx
> > +	jbe	L(SourceStringAlignmentLess32)
> > +
> > +	and	$-16, %rsi
> > +	and	$15, %rcx
> > +	pxor	%xmm0, %xmm0
> > +	pxor	%xmm1, %xmm1
> > +
> > +	pcmpeqb	(%rsi), %xmm1
> > +	pmovmskb %xmm1, %rdx
> > +	shr	%cl, %rdx
> > +
> > +# ifdef USE_AS_STRNCPY
> > +#  if defined USE_AS_STPCPY || defined USE_AS_STRCAT
> > +	mov	$16, %r10
> > +	sub	%rcx, %r10
> > +	cmp	%r10, %r8
> > +#  else
> > +	mov	$17, %r10
> > +	sub	%rcx, %r10
> > +	cmp	%r10, %r8
> > +#  endif
> > +	jbe	L(CopyFrom1To16BytesTailCase2OrCase3)
> > +# endif
> > +	test	%rdx, %rdx
> > +	jnz	L(CopyFrom1To16BytesTail)
> > +
> > +	pcmpeqb	16(%rsi), %xmm0
> > +	pmovmskb %xmm0, %rdx
> > +
> > +# ifdef USE_AS_STRNCPY
> > +	add	$16, %r10
> > +	cmp	%r10, %r8
> > +	jbe	L(CopyFrom1To32BytesCase2OrCase3)
> > +# endif
> > +	test	%rdx, %rdx
> > +	jnz	L(CopyFrom1To32Bytes)
> > +
> > +	movdqu	(%rsi, %rcx), %xmm1   /* copy 16 bytes */
> > +	movdqu	%xmm1, (%rdi)
> > +
> > +/* If source address alignment != destination address alignment */
> > +	.p2align 4
> > +L(Unalign16Both):
> > +	sub	%rcx, %rdi
> > +# ifdef USE_AS_STRNCPY
> > +	add	%rcx, %r8
> > +# endif
> > +	mov	$16, %rcx
> > +	movdqa	(%rsi, %rcx), %xmm1
> > +	movaps	16(%rsi, %rcx), %xmm2
> > +	movdqu	%xmm1, (%rdi, %rcx)
> > +	pcmpeqb	%xmm2, %xmm0
> > +	pmovmskb %xmm0, %rdx
> > +	add	$16, %rcx
> > +# ifdef USE_AS_STRNCPY
> > +	sub	$48, %r8
> > +	jbe	L(CopyFrom1To16BytesCase2OrCase3)
> > +# endif
> > +	test	%rdx, %rdx
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	jnz	L(CopyFrom1To16BytesUnalignedXmm2)
> > +# else
> > +	jnz	L(CopyFrom1To16Bytes)
> > +# endif
> > +
> > +	movaps	16(%rsi, %rcx), %xmm3
> > +	movdqu	%xmm2, (%rdi, %rcx)
> > +	pcmpeqb	%xmm3, %xmm0
> > +	pmovmskb %xmm0, %rdx
> > +	add	$16, %rcx
> > +# ifdef USE_AS_STRNCPY
> > +	sub	$16, %r8
> > +	jbe	L(CopyFrom1To16BytesCase2OrCase3)
> > +# endif
> > +	test	%rdx, %rdx
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	jnz	L(CopyFrom1To16BytesUnalignedXmm3)
> > +# else
> > +	jnz	L(CopyFrom1To16Bytes)
> > +# endif
> > +
> > +	movaps	16(%rsi, %rcx), %xmm4
> > +	movdqu	%xmm3, (%rdi, %rcx)
> > +	pcmpeqb	%xmm4, %xmm0
> > +	pmovmskb %xmm0, %rdx
> > +	add	$16, %rcx
> > +# ifdef USE_AS_STRNCPY
> > +	sub	$16, %r8
> > +	jbe	L(CopyFrom1To16BytesCase2OrCase3)
> > +# endif
> > +	test	%rdx, %rdx
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	jnz	L(CopyFrom1To16BytesUnalignedXmm4)
> > +# else
> > +	jnz	L(CopyFrom1To16Bytes)
> > +# endif
> > +
> > +	movaps	16(%rsi, %rcx), %xmm1
> > +	movdqu	%xmm4, (%rdi, %rcx)
> > +	pcmpeqb	%xmm1, %xmm0
> > +	pmovmskb %xmm0, %rdx
> > +	add	$16, %rcx
> > +# ifdef USE_AS_STRNCPY
> > +	sub	$16, %r8
> > +	jbe	L(CopyFrom1To16BytesCase2OrCase3)
> > +# endif
> > +	test	%rdx, %rdx
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	jnz	L(CopyFrom1To16BytesUnalignedXmm1)
> > +# else
> > +	jnz	L(CopyFrom1To16Bytes)
> > +# endif
> > +
> > +	movaps	16(%rsi, %rcx), %xmm2
> > +	movdqu	%xmm1, (%rdi, %rcx)
> > +	pcmpeqb	%xmm2, %xmm0
> > +	pmovmskb %xmm0, %rdx
> > +	add	$16, %rcx
> > +# ifdef USE_AS_STRNCPY
> > +	sub	$16, %r8
> > +	jbe	L(CopyFrom1To16BytesCase2OrCase3)
> > +# endif
> > +	test	%rdx, %rdx
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	jnz	L(CopyFrom1To16BytesUnalignedXmm2)
> > +# else
> > +	jnz	L(CopyFrom1To16Bytes)
> > +# endif
> > +
> > +	movaps	16(%rsi, %rcx), %xmm3
> > +	movdqu	%xmm2, (%rdi, %rcx)
> > +	pcmpeqb	%xmm3, %xmm0
> > +	pmovmskb %xmm0, %rdx
> > +	add	$16, %rcx
> > +# ifdef USE_AS_STRNCPY
> > +	sub	$16, %r8
> > +	jbe	L(CopyFrom1To16BytesCase2OrCase3)
> > +# endif
> > +	test	%rdx, %rdx
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	jnz	L(CopyFrom1To16BytesUnalignedXmm3)
> > +# else
> > +	jnz	L(CopyFrom1To16Bytes)
> > +# endif
> > +
> > +	movdqu	%xmm3, (%rdi, %rcx)
> > +	mov	%rsi, %rdx
> > +	lea	16(%rsi, %rcx), %rsi
> > +	and	$-0x40, %rsi
> > +	sub	%rsi, %rdx
> > +	sub	%rdx, %rdi
> > +# ifdef USE_AS_STRNCPY
> > +	lea	128(%r8, %rdx), %r8
> > +# endif
> > +L(Unaligned64Loop):
> > +	movaps	(%rsi), %xmm2
> > +	movaps	%xmm2, %xmm4
> > +	movaps	16(%rsi), %xmm5
> > +	movaps	32(%rsi), %xmm3
> > +	movaps	%xmm3, %xmm6
> > +	movaps	48(%rsi), %xmm7
> > +	pminub	%xmm5, %xmm2
> > +	pminub	%xmm7, %xmm3
> > +	pminub	%xmm2, %xmm3
> > +	pcmpeqb	%xmm0, %xmm3
> > +	pmovmskb %xmm3, %rdx
> > +# ifdef USE_AS_STRNCPY
> > +	sub	$64, %r8
> > +	jbe	L(UnalignedLeaveCase2OrCase3)
> > +# endif
> > +	test	%rdx, %rdx
> > +	jnz	L(Unaligned64Leave)
> > +
> > +L(Unaligned64Loop_start):
> > +	add	$64, %rdi
> > +	add	$64, %rsi
> > +	movdqu	%xmm4, -64(%rdi)
> > +	movaps	(%rsi), %xmm2
> > +	movdqa	%xmm2, %xmm4
> > +	movdqu	%xmm5, -48(%rdi)
> > +	movaps	16(%rsi), %xmm5
> > +	pminub	%xmm5, %xmm2
> > +	movaps	32(%rsi), %xmm3
> > +	movdqu	%xmm6, -32(%rdi)
> > +	movaps	%xmm3, %xmm6
> > +	movdqu	%xmm7, -16(%rdi)
> > +	movaps	48(%rsi), %xmm7
> > +	pminub	%xmm7, %xmm3
> > +	pminub	%xmm2, %xmm3
> > +	pcmpeqb	%xmm0, %xmm3
> > +	pmovmskb %xmm3, %rdx
> > +# ifdef USE_AS_STRNCPY
> > +	sub	$64, %r8
> > +	jbe	L(UnalignedLeaveCase2OrCase3)
> > +# endif
> > +	test	%rdx, %rdx
> > +	jz	L(Unaligned64Loop_start)
> > +
> > +L(Unaligned64Leave):
> > +	pxor	%xmm1, %xmm1
> > +
> > +	pcmpeqb	%xmm4, %xmm0
> > +	pcmpeqb	%xmm5, %xmm1
> > +	pmovmskb %xmm0, %rdx
> > +	pmovmskb %xmm1, %rcx
> > +	test	%rdx, %rdx
> > +	jnz	L(CopyFrom1To16BytesUnaligned_0)
> > +	test	%rcx, %rcx
> > +	jnz	L(CopyFrom1To16BytesUnaligned_16)
> > +
> > +	pcmpeqb	%xmm6, %xmm0
> > +	pcmpeqb	%xmm7, %xmm1
> > +	pmovmskb %xmm0, %rdx
> > +	pmovmskb %xmm1, %rcx
> > +	test	%rdx, %rdx
> > +	jnz	L(CopyFrom1To16BytesUnaligned_32)
> > +
> > +	bsf	%rcx, %rdx
> > +	movdqu	%xmm4, (%rdi)
> > +	movdqu	%xmm5, 16(%rdi)
> > +	movdqu	%xmm6, 32(%rdi)
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +# ifdef USE_AS_STPCPY
> > +	lea	48(%rdi, %rdx), %rax
> > +# endif
> > +	movdqu	%xmm7, 48(%rdi)
> > +	add	$15, %r8
> > +	sub	%rdx, %r8
> > +	lea	49(%rdi, %rdx), %rdi
> > +	jmp	L(StrncpyFillTailWithZero)
> > +# else
> > +	add	$48, %rsi
> > +	add	$48, %rdi
> > +	BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
> > +# endif
> > +
> > +/* If source address alignment == destination address alignment */
> > +
> > +L(SourceStringAlignmentLess32):
> > +	pxor	%xmm0, %xmm0
> > +	movdqu	(%rsi), %xmm1
> > +	movdqu	16(%rsi), %xmm2
> > +	pcmpeqb	%xmm1, %xmm0
> > +	pmovmskb %xmm0, %rdx
> > +
> > +# ifdef USE_AS_STRNCPY
> > +#  if defined USE_AS_STPCPY || defined USE_AS_STRCAT
> > +	cmp	$16, %r8
> > +#  else
> > +	cmp	$17, %r8
> > +#  endif
> > +	jbe	L(CopyFrom1To16BytesTail1Case2OrCase3)
> > +# endif
> > +	test	%rdx, %rdx
> > +	jnz	L(CopyFrom1To16BytesTail1)
> > +
> > +	pcmpeqb	%xmm2, %xmm0
> > +	movdqu	%xmm1, (%rdi)
> > +	pmovmskb %xmm0, %rdx
> > +
> > +# ifdef USE_AS_STRNCPY
> > +#  if defined USE_AS_STPCPY || defined USE_AS_STRCAT
> > +	cmp	$32, %r8
> > +#  else
> > +	cmp	$33, %r8
> > +#  endif
> > +	jbe	L(CopyFrom1To32Bytes1Case2OrCase3)
> > +# endif
> > +	test	%rdx, %rdx
> > +	jnz	L(CopyFrom1To32Bytes1)
> > +
> > +	and	$-16, %rsi
> > +	and	$15, %rcx
> > +	jmp	L(Unalign16Both)
> > +
> > +/*------End of main part with loops---------------------*/
> > +
> > +/* Case1 */
> > +
> > +# if (!defined USE_AS_STRNCPY) || (defined USE_AS_STRCAT)
> > +	.p2align 4
> > +L(CopyFrom1To16Bytes):
> > +	add	%rcx, %rdi
> > +	add	%rcx, %rsi
> > +	bsf	%rdx, %rdx
> > +	BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
> > +# endif
> > +	.p2align 4
> > +L(CopyFrom1To16BytesTail):
> > +	add	%rcx, %rsi
> > +	bsf	%rdx, %rdx
> > +	BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
> > +
> > +	.p2align 4
> > +L(CopyFrom1To32Bytes1):
> > +	add	$16, %rsi
> > +	add	$16, %rdi
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	sub	$16, %r8
> > +# endif
> > +L(CopyFrom1To16BytesTail1):
> > +	bsf	%rdx, %rdx
> > +	BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
> > +
> > +	.p2align 4
> > +L(CopyFrom1To32Bytes):
> > +	bsf	%rdx, %rdx
> > +	add	%rcx, %rsi
> > +	add	$16, %rdx
> > +	sub	%rcx, %rdx
> > +	BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
> > +
> > +	.p2align 4
> > +L(CopyFrom1To16BytesUnaligned_0):
> > +	bsf	%rdx, %rdx
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +# ifdef USE_AS_STPCPY
> > +	lea	(%rdi, %rdx), %rax
> > +# endif
> > +	movdqu	%xmm4, (%rdi)
> > +	add	$63, %r8
> > +	sub	%rdx, %r8
> > +	lea	1(%rdi, %rdx), %rdi
> > +	jmp	L(StrncpyFillTailWithZero)
> > +# else
> > +	BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
> > +# endif
> > +
> > +	.p2align 4
> > +L(CopyFrom1To16BytesUnaligned_16):
> > +	bsf	%rcx, %rdx
> > +	movdqu	%xmm4, (%rdi)
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +# ifdef USE_AS_STPCPY
> > +	lea	16(%rdi, %rdx), %rax
> > +# endif
> > +	movdqu	%xmm5, 16(%rdi)
> > +	add	$47, %r8
> > +	sub	%rdx, %r8
> > +	lea	17(%rdi, %rdx), %rdi
> > +	jmp	L(StrncpyFillTailWithZero)
> > +# else
> > +	add	$16, %rsi
> > +	add	$16, %rdi
> > +	BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
> > +# endif
> > +
> > +	.p2align 4
> > +L(CopyFrom1To16BytesUnaligned_32):
> > +	bsf	%rdx, %rdx
> > +	movdqu	%xmm4, (%rdi)
> > +	movdqu	%xmm5, 16(%rdi)
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +# ifdef USE_AS_STPCPY
> > +	lea	32(%rdi, %rdx), %rax
> > +# endif
> > +	movdqu	%xmm6, 32(%rdi)
> > +	add	$31, %r8
> > +	sub	%rdx, %r8
> > +	lea	33(%rdi, %rdx), %rdi
> > +	jmp	L(StrncpyFillTailWithZero)
> > +# else
> > +	add	$32, %rsi
> > +	add	$32, %rdi
> > +	BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
> > +# endif
> > +
> > +# ifdef USE_AS_STRNCPY
> > +#  ifndef USE_AS_STRCAT
> > +	.p2align 4
> > +L(CopyFrom1To16BytesUnalignedXmm6):
> > +	movdqu	%xmm6, (%rdi, %rcx)
> > +	jmp	L(CopyFrom1To16BytesXmmExit)
> > +
> > +	.p2align 4
> > +L(CopyFrom1To16BytesUnalignedXmm5):
> > +	movdqu	%xmm5, (%rdi, %rcx)
> > +	jmp	L(CopyFrom1To16BytesXmmExit)
> > +
> > +	.p2align 4
> > +L(CopyFrom1To16BytesUnalignedXmm4):
> > +	movdqu	%xmm4, (%rdi, %rcx)
> > +	jmp	L(CopyFrom1To16BytesXmmExit)
> > +
> > +	.p2align 4
> > +L(CopyFrom1To16BytesUnalignedXmm3):
> > +	movdqu	%xmm3, (%rdi, %rcx)
> > +	jmp	L(CopyFrom1To16BytesXmmExit)
> > +
> > +	.p2align 4
> > +L(CopyFrom1To16BytesUnalignedXmm1):
> > +	movdqu	%xmm1, (%rdi, %rcx)
> > +	jmp	L(CopyFrom1To16BytesXmmExit)
> > +#  endif
> > +
> > +	.p2align 4
> > +L(CopyFrom1To16BytesExit):
> > +	BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
> > +
> > +/* Case2 */
> > +
> > +	.p2align 4
> > +L(CopyFrom1To16BytesCase2):
> > +	add	$16, %r8
> > +	add	%rcx, %rdi
> > +	add	%rcx, %rsi
> > +	bsf	%rdx, %rdx
> > +	cmp	%r8, %rdx
> > +	jb	L(CopyFrom1To16BytesExit)
> > +	BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
> > +
> > +	.p2align 4
> > +L(CopyFrom1To32BytesCase2):
> > +	add	%rcx, %rsi
> > +	bsf	%rdx, %rdx
> > +	add	$16, %rdx
> > +	sub	%rcx, %rdx
> > +	cmp	%r8, %rdx
> > +	jb	L(CopyFrom1To16BytesExit)
> > +	BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
> > +
> > +L(CopyFrom1To16BytesTailCase2):
> > +	add	%rcx, %rsi
> > +	bsf	%rdx, %rdx
> > +	cmp	%r8, %rdx
> > +	jb	L(CopyFrom1To16BytesExit)
> > +	BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
> > +
> > +L(CopyFrom1To16BytesTail1Case2):
> > +	bsf	%rdx, %rdx
> > +	cmp	%r8, %rdx
> > +	jb	L(CopyFrom1To16BytesExit)
> > +	BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
> > +
> > +/* Case2 or Case3,  Case3 */
> > +
> > +	.p2align 4
> > +L(CopyFrom1To16BytesCase2OrCase3):
> > +	test	%rdx, %rdx
> > +	jnz	L(CopyFrom1To16BytesCase2)
> > +L(CopyFrom1To16BytesCase3):
> > +	add	$16, %r8
> > +	add	%rcx, %rdi
> > +	add	%rcx, %rsi
> > +	BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
> > +
> > +	.p2align 4
> > +L(CopyFrom1To32BytesCase2OrCase3):
> > +	test	%rdx, %rdx
> > +	jnz	L(CopyFrom1To32BytesCase2)
> > +	add	%rcx, %rsi
> > +	BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
> > +
> > +	.p2align 4
> > +L(CopyFrom1To16BytesTailCase2OrCase3):
> > +	test	%rdx, %rdx
> > +	jnz	L(CopyFrom1To16BytesTailCase2)
> > +	add	%rcx, %rsi
> > +	BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
> > +
> > +	.p2align 4
> > +L(CopyFrom1To32Bytes1Case2OrCase3):
> > +	add	$16, %rdi
> > +	add	$16, %rsi
> > +	sub	$16, %r8
> > +L(CopyFrom1To16BytesTail1Case2OrCase3):
> > +	test	%rdx, %rdx
> > +	jnz	L(CopyFrom1To16BytesTail1Case2)
> > +	BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
> > +
> > +# endif
> > +
> > +/*------------End labels regarding with copying 1-16 bytes--and 1-32 bytes----*/
> > +
> > +	.p2align 4
> > +L(Exit1):
> > +	mov	%dh, (%rdi)
> > +# ifdef USE_AS_STPCPY
> > +	lea	(%rdi), %rax
> > +# endif
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	sub	$1, %r8
> > +	lea	1(%rdi), %rdi
> > +	jnz	L(StrncpyFillTailWithZero)
> > +# endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(Exit2):
> > +	mov	(%rsi), %dx
> > +	mov	%dx, (%rdi)
> > +# ifdef USE_AS_STPCPY
> > +	lea	1(%rdi), %rax
> > +# endif
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	sub	$2, %r8
> > +	lea	2(%rdi), %rdi
> > +	jnz	L(StrncpyFillTailWithZero)
> > +# endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(Exit3):
> > +	mov	(%rsi), %cx
> > +	mov	%cx, (%rdi)
> > +	mov	%dh, 2(%rdi)
> > +# ifdef USE_AS_STPCPY
> > +	lea	2(%rdi), %rax
> > +# endif
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	sub	$3, %r8
> > +	lea	3(%rdi), %rdi
> > +	jnz	L(StrncpyFillTailWithZero)
> > +# endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(Exit4):
> > +	mov	(%rsi), %edx
> > +	mov	%edx, (%rdi)
> > +# ifdef USE_AS_STPCPY
> > +	lea	3(%rdi), %rax
> > +# endif
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	sub	$4, %r8
> > +	lea	4(%rdi), %rdi
> > +	jnz	L(StrncpyFillTailWithZero)
> > +# endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(Exit5):
> > +	mov	(%rsi), %ecx
> > +	mov	%dh, 4(%rdi)
> > +	mov	%ecx, (%rdi)
> > +# ifdef USE_AS_STPCPY
> > +	lea	4(%rdi), %rax
> > +# endif
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	sub	$5, %r8
> > +	lea	5(%rdi), %rdi
> > +	jnz	L(StrncpyFillTailWithZero)
> > +# endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(Exit6):
> > +	mov	(%rsi), %ecx
> > +	mov	4(%rsi), %dx
> > +	mov	%ecx, (%rdi)
> > +	mov	%dx, 4(%rdi)
> > +# ifdef USE_AS_STPCPY
> > +	lea	5(%rdi), %rax
> > +# endif
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	sub	$6, %r8
> > +	lea	6(%rdi), %rdi
> > +	jnz	L(StrncpyFillTailWithZero)
> > +# endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(Exit7):
> > +	mov	(%rsi), %ecx
> > +	mov	3(%rsi), %edx
> > +	mov	%ecx, (%rdi)
> > +	mov	%edx, 3(%rdi)
> > +# ifdef USE_AS_STPCPY
> > +	lea	6(%rdi), %rax
> > +# endif
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	sub	$7, %r8
> > +	lea	7(%rdi), %rdi
> > +	jnz	L(StrncpyFillTailWithZero)
> > +# endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(Exit8):
> > +	mov	(%rsi), %rdx
> > +	mov	%rdx, (%rdi)
> > +# ifdef USE_AS_STPCPY
> > +	lea	7(%rdi), %rax
> > +# endif
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	sub	$8, %r8
> > +	lea	8(%rdi), %rdi
> > +	jnz	L(StrncpyFillTailWithZero)
> > +# endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(Exit9):
> > +	mov	(%rsi), %rcx
> > +	mov	%dh, 8(%rdi)
> > +	mov	%rcx, (%rdi)
> > +# ifdef USE_AS_STPCPY
> > +	lea	8(%rdi), %rax
> > +# endif
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	sub	$9, %r8
> > +	lea	9(%rdi), %rdi
> > +	jnz	L(StrncpyFillTailWithZero)
> > +# endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(Exit10):
> > +	mov	(%rsi), %rcx
> > +	mov	8(%rsi), %dx
> > +	mov	%rcx, (%rdi)
> > +	mov	%dx, 8(%rdi)
> > +# ifdef USE_AS_STPCPY
> > +	lea	9(%rdi), %rax
> > +# endif
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	sub	$10, %r8
> > +	lea	10(%rdi), %rdi
> > +	jnz	L(StrncpyFillTailWithZero)
> > +# endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(Exit11):
> > +	mov	(%rsi), %rcx
> > +	mov	7(%rsi), %edx
> > +	mov	%rcx, (%rdi)
> > +	mov	%edx, 7(%rdi)
> > +# ifdef USE_AS_STPCPY
> > +	lea	10(%rdi), %rax
> > +# endif
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	sub	$11, %r8
> > +	lea	11(%rdi), %rdi
> > +	jnz	L(StrncpyFillTailWithZero)
> > +# endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(Exit12):
> > +	mov	(%rsi), %rcx
> > +	mov	8(%rsi), %edx
> > +	mov	%rcx, (%rdi)
> > +	mov	%edx, 8(%rdi)
> > +# ifdef USE_AS_STPCPY
> > +	lea	11(%rdi), %rax
> > +# endif
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	sub	$12, %r8
> > +	lea	12(%rdi), %rdi
> > +	jnz	L(StrncpyFillTailWithZero)
> > +# endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(Exit13):
> > +	mov	(%rsi), %rcx
> > +	mov	5(%rsi), %rdx
> > +	mov	%rcx, (%rdi)
> > +	mov	%rdx, 5(%rdi)
> > +# ifdef USE_AS_STPCPY
> > +	lea	12(%rdi), %rax
> > +# endif
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	sub	$13, %r8
> > +	lea	13(%rdi), %rdi
> > +	jnz	L(StrncpyFillTailWithZero)
> > +# endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(Exit14):
> > +	mov	(%rsi), %rcx
> > +	mov	6(%rsi), %rdx
> > +	mov	%rcx, (%rdi)
> > +	mov	%rdx, 6(%rdi)
> > +# ifdef USE_AS_STPCPY
> > +	lea	13(%rdi), %rax
> > +# endif
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	sub	$14, %r8
> > +	lea	14(%rdi), %rdi
> > +	jnz	L(StrncpyFillTailWithZero)
> > +# endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(Exit15):
> > +	mov	(%rsi), %rcx
> > +	mov	7(%rsi), %rdx
> > +	mov	%rcx, (%rdi)
> > +	mov	%rdx, 7(%rdi)
> > +# ifdef USE_AS_STPCPY
> > +	lea	14(%rdi), %rax
> > +# endif
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	sub	$15, %r8
> > +	lea	15(%rdi), %rdi
> > +	jnz	L(StrncpyFillTailWithZero)
> > +# endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(Exit16):
> > +	movdqu	(%rsi), %xmm0
> > +	movdqu	%xmm0, (%rdi)
> > +# ifdef USE_AS_STPCPY
> > +	lea	15(%rdi), %rax
> > +# endif
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	sub	$16, %r8
> > +	lea	16(%rdi), %rdi
> > +	jnz	L(StrncpyFillTailWithZero)
> > +# endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(Exit17):
> > +	movdqu	(%rsi), %xmm0
> > +	movdqu	%xmm0, (%rdi)
> > +	mov	%dh, 16(%rdi)
> > +# ifdef USE_AS_STPCPY
> > +	lea	16(%rdi), %rax
> > +# endif
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	sub	$17, %r8
> > +	lea	17(%rdi), %rdi
> > +	jnz	L(StrncpyFillTailWithZero)
> > +# endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(Exit18):
> > +	movdqu	(%rsi), %xmm0
> > +	mov	16(%rsi), %cx
> > +	movdqu	%xmm0, (%rdi)
> > +	mov	%cx, 16(%rdi)
> > +# ifdef USE_AS_STPCPY
> > +	lea	17(%rdi), %rax
> > +# endif
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	sub	$18, %r8
> > +	lea	18(%rdi), %rdi
> > +	jnz	L(StrncpyFillTailWithZero)
> > +# endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(Exit19):
> > +	movdqu	(%rsi), %xmm0
> > +	mov	15(%rsi), %ecx
> > +	movdqu	%xmm0, (%rdi)
> > +	mov	%ecx, 15(%rdi)
> > +# ifdef USE_AS_STPCPY
> > +	lea	18(%rdi), %rax
> > +# endif
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	sub	$19, %r8
> > +	lea	19(%rdi), %rdi
> > +	jnz	L(StrncpyFillTailWithZero)
> > +# endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(Exit20):
> > +	movdqu	(%rsi), %xmm0
> > +	mov	16(%rsi), %ecx
> > +	movdqu	%xmm0, (%rdi)
> > +	mov	%ecx, 16(%rdi)
> > +# ifdef USE_AS_STPCPY
> > +	lea	19(%rdi), %rax
> > +# endif
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	sub	$20, %r8
> > +	lea	20(%rdi), %rdi
> > +	jnz	L(StrncpyFillTailWithZero)
> > +# endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(Exit21):
> > +	movdqu	(%rsi), %xmm0
> > +	mov	16(%rsi), %ecx
> > +	movdqu	%xmm0, (%rdi)
> > +	mov	%ecx, 16(%rdi)
> > +	mov	%dh, 20(%rdi)
> > +# ifdef USE_AS_STPCPY
> > +	lea	20(%rdi), %rax
> > +# endif
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	sub	$21, %r8
> > +	lea	21(%rdi), %rdi
> > +	jnz	L(StrncpyFillTailWithZero)
> > +# endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(Exit22):
> > +	movdqu	(%rsi), %xmm0
> > +	mov	14(%rsi), %rcx
> > +	movdqu	%xmm0, (%rdi)
> > +	mov	%rcx, 14(%rdi)
> > +# ifdef USE_AS_STPCPY
> > +	lea	21(%rdi), %rax
> > +# endif
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	sub	$22, %r8
> > +	lea	22(%rdi), %rdi
> > +	jnz	L(StrncpyFillTailWithZero)
> > +# endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(Exit23):
> > +	movdqu	(%rsi), %xmm0
> > +	mov	15(%rsi), %rcx
> > +	movdqu	%xmm0, (%rdi)
> > +	mov	%rcx, 15(%rdi)
> > +# ifdef USE_AS_STPCPY
> > +	lea	22(%rdi), %rax
> > +# endif
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	sub	$23, %r8
> > +	lea	23(%rdi), %rdi
> > +	jnz	L(StrncpyFillTailWithZero)
> > +# endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(Exit24):
> > +	movdqu	(%rsi), %xmm0
> > +	mov	16(%rsi), %rcx
> > +	movdqu	%xmm0, (%rdi)
> > +	mov	%rcx, 16(%rdi)
> > +# ifdef USE_AS_STPCPY
> > +	lea	23(%rdi), %rax
> > +# endif
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	sub	$24, %r8
> > +	lea	24(%rdi), %rdi
> > +	jnz	L(StrncpyFillTailWithZero)
> > +# endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(Exit25):
> > +	movdqu	(%rsi), %xmm0
> > +	mov	16(%rsi), %rcx
> > +	movdqu	%xmm0, (%rdi)
> > +	mov	%rcx, 16(%rdi)
> > +	mov	%dh, 24(%rdi)
> > +# ifdef USE_AS_STPCPY
> > +	lea	24(%rdi), %rax
> > +# endif
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	sub	$25, %r8
> > +	lea	25(%rdi), %rdi
> > +	jnz	L(StrncpyFillTailWithZero)
> > +# endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(Exit26):
> > +	movdqu	(%rsi), %xmm0
> > +	mov	16(%rsi), %rdx
> > +	mov	24(%rsi), %cx
> > +	movdqu	%xmm0, (%rdi)
> > +	mov	%rdx, 16(%rdi)
> > +	mov	%cx, 24(%rdi)
> > +# ifdef USE_AS_STPCPY
> > +	lea	25(%rdi), %rax
> > +# endif
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	sub	$26, %r8
> > +	lea	26(%rdi), %rdi
> > +	jnz	L(StrncpyFillTailWithZero)
> > +# endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(Exit27):
> > +	movdqu	(%rsi), %xmm0
> > +	mov	16(%rsi), %rdx
> > +	mov	23(%rsi), %ecx
> > +	movdqu	%xmm0, (%rdi)
> > +	mov	%rdx, 16(%rdi)
> > +	mov	%ecx, 23(%rdi)
> > +# ifdef USE_AS_STPCPY
> > +	lea	26(%rdi), %rax
> > +# endif
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	sub	$27, %r8
> > +	lea	27(%rdi), %rdi
> > +	jnz	L(StrncpyFillTailWithZero)
> > +# endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(Exit28):
> > +	movdqu	(%rsi), %xmm0
> > +	mov	16(%rsi), %rdx
> > +	mov	24(%rsi), %ecx
> > +	movdqu	%xmm0, (%rdi)
> > +	mov	%rdx, 16(%rdi)
> > +	mov	%ecx, 24(%rdi)
> > +# ifdef USE_AS_STPCPY
> > +	lea	27(%rdi), %rax
> > +# endif
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	sub	$28, %r8
> > +	lea	28(%rdi), %rdi
> > +	jnz	L(StrncpyFillTailWithZero)
> > +# endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(Exit29):
> > +	movdqu	(%rsi), %xmm0
> > +	movdqu	13(%rsi), %xmm2
> > +	movdqu	%xmm0, (%rdi)
> > +	movdqu	%xmm2, 13(%rdi)
> > +# ifdef USE_AS_STPCPY
> > +	lea	28(%rdi), %rax
> > +# endif
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	sub	$29, %r8
> > +	lea	29(%rdi), %rdi
> > +	jnz	L(StrncpyFillTailWithZero)
> > +# endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(Exit30):
> > +	movdqu	(%rsi), %xmm0
> > +	movdqu	14(%rsi), %xmm2
> > +	movdqu	%xmm0, (%rdi)
> > +	movdqu	%xmm2, 14(%rdi)
> > +# ifdef USE_AS_STPCPY
> > +	lea	29(%rdi), %rax
> > +# endif
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	sub	$30, %r8
> > +	lea	30(%rdi), %rdi
> > +	jnz	L(StrncpyFillTailWithZero)
> > +# endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(Exit31):
> > +	movdqu	(%rsi), %xmm0
> > +	movdqu	15(%rsi), %xmm2
> > +	movdqu	%xmm0, (%rdi)
> > +	movdqu	%xmm2, 15(%rdi)
> > +# ifdef USE_AS_STPCPY
> > +	lea	30(%rdi), %rax
> > +# endif
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	sub	$31, %r8
> > +	lea	31(%rdi), %rdi
> > +	jnz	L(StrncpyFillTailWithZero)
> > +# endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(Exit32):
> > +	movdqu	(%rsi), %xmm0
> > +	movdqu	16(%rsi), %xmm2
> > +	movdqu	%xmm0, (%rdi)
> > +	movdqu	%xmm2, 16(%rdi)
> > +# ifdef USE_AS_STPCPY
> > +	lea	31(%rdi), %rax
> > +# endif
> > +# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
> > +	sub	$32, %r8
> > +	lea	32(%rdi), %rdi
> > +	jnz	L(StrncpyFillTailWithZero)
> > +# endif
> > +	ret
> > +
> > +# ifdef USE_AS_STRNCPY
> > +
> > +	.p2align 4
> > +L(StrncpyExit0):
> > +#  ifdef USE_AS_STPCPY
> > +	mov	%rdi, %rax
> > +#  endif
> > +#  ifdef USE_AS_STRCAT
> > +	xor	%ch, %ch
> > +	movb	%ch, (%rdi)
> > +#  endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(StrncpyExit1):
> > +	mov	(%rsi), %dl
> > +	mov	%dl, (%rdi)
> > +#  ifdef USE_AS_STPCPY
> > +	lea	1(%rdi), %rax
> > +#  endif
> > +#  ifdef USE_AS_STRCAT
> > +	xor	%ch, %ch
> > +	movb	%ch, 1(%rdi)
> > +#  endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(StrncpyExit2):
> > +	mov	(%rsi), %dx
> > +	mov	%dx, (%rdi)
> > +#  ifdef USE_AS_STPCPY
> > +	lea	2(%rdi), %rax
> > +#  endif
> > +#  ifdef USE_AS_STRCAT
> > +	xor	%ch, %ch
> > +	movb	%ch, 2(%rdi)
> > +#  endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(StrncpyExit3):
> > +	mov	(%rsi), %cx
> > +	mov	2(%rsi), %dl
> > +	mov	%cx, (%rdi)
> > +	mov	%dl, 2(%rdi)
> > +#  ifdef USE_AS_STPCPY
> > +	lea	3(%rdi), %rax
> > +#  endif
> > +#  ifdef USE_AS_STRCAT
> > +	xor	%ch, %ch
> > +	movb	%ch, 3(%rdi)
> > +#  endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(StrncpyExit4):
> > +	mov	(%rsi), %edx
> > +	mov	%edx, (%rdi)
> > +#  ifdef USE_AS_STPCPY
> > +	lea	4(%rdi), %rax
> > +#  endif
> > +#  ifdef USE_AS_STRCAT
> > +	xor	%ch, %ch
> > +	movb	%ch, 4(%rdi)
> > +#  endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(StrncpyExit5):
> > +	mov	(%rsi), %ecx
> > +	mov	4(%rsi), %dl
> > +	mov	%ecx, (%rdi)
> > +	mov	%dl, 4(%rdi)
> > +#  ifdef USE_AS_STPCPY
> > +	lea	5(%rdi), %rax
> > +#  endif
> > +#  ifdef USE_AS_STRCAT
> > +	xor	%ch, %ch
> > +	movb	%ch, 5(%rdi)
> > +#  endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(StrncpyExit6):
> > +	mov	(%rsi), %ecx
> > +	mov	4(%rsi), %dx
> > +	mov	%ecx, (%rdi)
> > +	mov	%dx, 4(%rdi)
> > +#  ifdef USE_AS_STPCPY
> > +	lea	6(%rdi), %rax
> > +#  endif
> > +#  ifdef USE_AS_STRCAT
> > +	xor	%ch, %ch
> > +	movb	%ch, 6(%rdi)
> > +#  endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(StrncpyExit7):
> > +	mov	(%rsi), %ecx
> > +	mov	3(%rsi), %edx
> > +	mov	%ecx, (%rdi)
> > +	mov	%edx, 3(%rdi)
> > +#  ifdef USE_AS_STPCPY
> > +	lea	7(%rdi), %rax
> > +#  endif
> > +#  ifdef USE_AS_STRCAT
> > +	xor	%ch, %ch
> > +	movb	%ch, 7(%rdi)
> > +#  endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(StrncpyExit8):
> > +	mov	(%rsi), %rdx
> > +	mov	%rdx, (%rdi)
> > +#  ifdef USE_AS_STPCPY
> > +	lea	8(%rdi), %rax
> > +#  endif
> > +#  ifdef USE_AS_STRCAT
> > +	xor	%ch, %ch
> > +	movb	%ch, 8(%rdi)
> > +#  endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(StrncpyExit9):
> > +	mov	(%rsi), %rcx
> > +	mov	8(%rsi), %dl
> > +	mov	%rcx, (%rdi)
> > +	mov	%dl, 8(%rdi)
> > +#  ifdef USE_AS_STPCPY
> > +	lea	9(%rdi), %rax
> > +#  endif
> > +#  ifdef USE_AS_STRCAT
> > +	xor	%ch, %ch
> > +	movb	%ch, 9(%rdi)
> > +#  endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(StrncpyExit10):
> > +	mov	(%rsi), %rcx
> > +	mov	8(%rsi), %dx
> > +	mov	%rcx, (%rdi)
> > +	mov	%dx, 8(%rdi)
> > +#  ifdef USE_AS_STPCPY
> > +	lea	10(%rdi), %rax
> > +#  endif
> > +#  ifdef USE_AS_STRCAT
> > +	xor	%ch, %ch
> > +	movb	%ch, 10(%rdi)
> > +#  endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(StrncpyExit11):
> > +	mov	(%rsi), %rcx
> > +	mov	7(%rsi), %edx
> > +	mov	%rcx, (%rdi)
> > +	mov	%edx, 7(%rdi)
> > +#  ifdef USE_AS_STPCPY
> > +	lea	11(%rdi), %rax
> > +#  endif
> > +#  ifdef USE_AS_STRCAT
> > +	xor	%ch, %ch
> > +	movb	%ch, 11(%rdi)
> > +#  endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(StrncpyExit12):
> > +	mov	(%rsi), %rcx
> > +	mov	8(%rsi), %edx
> > +	mov	%rcx, (%rdi)
> > +	mov	%edx, 8(%rdi)
> > +#  ifdef USE_AS_STPCPY
> > +	lea	12(%rdi), %rax
> > +#  endif
> > +#  ifdef USE_AS_STRCAT
> > +	xor	%ch, %ch
> > +	movb	%ch, 12(%rdi)
> > +#  endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(StrncpyExit13):
> > +	mov	(%rsi), %rcx
> > +	mov	5(%rsi), %rdx
> > +	mov	%rcx, (%rdi)
> > +	mov	%rdx, 5(%rdi)
> > +#  ifdef USE_AS_STPCPY
> > +	lea	13(%rdi), %rax
> > +#  endif
> > +#  ifdef USE_AS_STRCAT
> > +	xor	%ch, %ch
> > +	movb	%ch, 13(%rdi)
> > +#  endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(StrncpyExit14):
> > +	mov	(%rsi), %rcx
> > +	mov	6(%rsi), %rdx
> > +	mov	%rcx, (%rdi)
> > +	mov	%rdx, 6(%rdi)
> > +#  ifdef USE_AS_STPCPY
> > +	lea	14(%rdi), %rax
> > +#  endif
> > +#  ifdef USE_AS_STRCAT
> > +	xor	%ch, %ch
> > +	movb	%ch, 14(%rdi)
> > +#  endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(StrncpyExit15):
> > +	mov	(%rsi), %rcx
> > +	mov	7(%rsi), %rdx
> > +	mov	%rcx, (%rdi)
> > +	mov	%rdx, 7(%rdi)
> > +#  ifdef USE_AS_STPCPY
> > +	lea	15(%rdi), %rax
> > +#  endif
> > +#  ifdef USE_AS_STRCAT
> > +	xor	%ch, %ch
> > +	movb	%ch, 15(%rdi)
> > +#  endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(StrncpyExit16):
> > +	movdqu	(%rsi), %xmm0
> > +	movdqu	%xmm0, (%rdi)
> > +#  ifdef USE_AS_STPCPY
> > +	lea	16(%rdi), %rax
> > +#  endif
> > +#  ifdef USE_AS_STRCAT
> > +	xor	%ch, %ch
> > +	movb	%ch, 16(%rdi)
> > +#  endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(StrncpyExit17):
> > +	movdqu	(%rsi), %xmm0
> > +	mov	16(%rsi), %cl
> > +	movdqu	%xmm0, (%rdi)
> > +	mov	%cl, 16(%rdi)
> > +#  ifdef USE_AS_STPCPY
> > +	lea	17(%rdi), %rax
> > +#  endif
> > +#  ifdef USE_AS_STRCAT
> > +	xor	%ch, %ch
> > +	movb	%ch, 17(%rdi)
> > +#  endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(StrncpyExit18):
> > +	movdqu	(%rsi), %xmm0
> > +	mov	16(%rsi), %cx
> > +	movdqu	%xmm0, (%rdi)
> > +	mov	%cx, 16(%rdi)
> > +#  ifdef USE_AS_STPCPY
> > +	lea	18(%rdi), %rax
> > +#  endif
> > +#  ifdef USE_AS_STRCAT
> > +	xor	%ch, %ch
> > +	movb	%ch, 18(%rdi)
> > +#  endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(StrncpyExit19):
> > +	movdqu	(%rsi), %xmm0
> > +	mov	15(%rsi), %ecx
> > +	movdqu	%xmm0, (%rdi)
> > +	mov	%ecx, 15(%rdi)
> > +#  ifdef USE_AS_STPCPY
> > +	lea	19(%rdi), %rax
> > +#  endif
> > +#  ifdef USE_AS_STRCAT
> > +	xor	%ch, %ch
> > +	movb	%ch, 19(%rdi)
> > +#  endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(StrncpyExit20):
> > +	movdqu	(%rsi), %xmm0
> > +	mov	16(%rsi), %ecx
> > +	movdqu	%xmm0, (%rdi)
> > +	mov	%ecx, 16(%rdi)
> > +#  ifdef USE_AS_STPCPY
> > +	lea	20(%rdi), %rax
> > +#  endif
> > +#  ifdef USE_AS_STRCAT
> > +	xor	%ch, %ch
> > +	movb	%ch, 20(%rdi)
> > +#  endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(StrncpyExit21):
> > +	movdqu	(%rsi), %xmm0
> > +	mov	16(%rsi), %ecx
> > +	mov	20(%rsi), %dl
> > +	movdqu	%xmm0, (%rdi)
> > +	mov	%ecx, 16(%rdi)
> > +	mov	%dl, 20(%rdi)
> > +#  ifdef USE_AS_STPCPY
> > +	lea	21(%rdi), %rax
> > +#  endif
> > +#  ifdef USE_AS_STRCAT
> > +	xor	%ch, %ch
> > +	movb	%ch, 21(%rdi)
> > +#  endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(StrncpyExit22):
> > +	movdqu	(%rsi), %xmm0
> > +	mov	14(%rsi), %rcx
> > +	movdqu	%xmm0, (%rdi)
> > +	mov	%rcx, 14(%rdi)
> > +#  ifdef USE_AS_STPCPY
> > +	lea	22(%rdi), %rax
> > +#  endif
> > +#  ifdef USE_AS_STRCAT
> > +	xor	%ch, %ch
> > +	movb	%ch, 22(%rdi)
> > +#  endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(StrncpyExit23):
> > +	movdqu	(%rsi), %xmm0
> > +	mov	15(%rsi), %rcx
> > +	movdqu	%xmm0, (%rdi)
> > +	mov	%rcx, 15(%rdi)
> > +#  ifdef USE_AS_STPCPY
> > +	lea	23(%rdi), %rax
> > +#  endif
> > +#  ifdef USE_AS_STRCAT
> > +	xor	%ch, %ch
> > +	movb	%ch, 23(%rdi)
> > +#  endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(StrncpyExit24):
> > +	movdqu	(%rsi), %xmm0
> > +	mov	16(%rsi), %rcx
> > +	movdqu	%xmm0, (%rdi)
> > +	mov	%rcx, 16(%rdi)
> > +#  ifdef USE_AS_STPCPY
> > +	lea	24(%rdi), %rax
> > +#  endif
> > +#  ifdef USE_AS_STRCAT
> > +	xor	%ch, %ch
> > +	movb	%ch, 24(%rdi)
> > +#  endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(StrncpyExit25):
> > +	movdqu	(%rsi), %xmm0
> > +	mov	16(%rsi), %rdx
> > +	mov	24(%rsi), %cl
> > +	movdqu	%xmm0, (%rdi)
> > +	mov	%rdx, 16(%rdi)
> > +	mov	%cl, 24(%rdi)
> > +#  ifdef USE_AS_STPCPY
> > +	lea	25(%rdi), %rax
> > +#  endif
> > +#  ifdef USE_AS_STRCAT
> > +	xor	%ch, %ch
> > +	movb	%ch, 25(%rdi)
> > +#  endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(StrncpyExit26):
> > +	movdqu	(%rsi), %xmm0
> > +	mov	16(%rsi), %rdx
> > +	mov	24(%rsi), %cx
> > +	movdqu	%xmm0, (%rdi)
> > +	mov	%rdx, 16(%rdi)
> > +	mov	%cx, 24(%rdi)
> > +#  ifdef USE_AS_STPCPY
> > +	lea	26(%rdi), %rax
> > +#  endif
> > +#  ifdef USE_AS_STRCAT
> > +	xor	%ch, %ch
> > +	movb	%ch, 26(%rdi)
> > +#  endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(StrncpyExit27):
> > +	movdqu	(%rsi), %xmm0
> > +	mov	16(%rsi), %rdx
> > +	mov	23(%rsi), %ecx
> > +	movdqu	%xmm0, (%rdi)
> > +	mov	%rdx, 16(%rdi)
> > +	mov	%ecx, 23(%rdi)
> > +#  ifdef USE_AS_STPCPY
> > +	lea	27(%rdi), %rax
> > +#  endif
> > +#  ifdef USE_AS_STRCAT
> > +	xor	%ch, %ch
> > +	movb	%ch, 27(%rdi)
> > +#  endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(StrncpyExit28):
> > +	movdqu	(%rsi), %xmm0
> > +	mov	16(%rsi), %rdx
> > +	mov	24(%rsi), %ecx
> > +	movdqu	%xmm0, (%rdi)
> > +	mov	%rdx, 16(%rdi)
> > +	mov	%ecx, 24(%rdi)
> > +#  ifdef USE_AS_STPCPY
> > +	lea	28(%rdi), %rax
> > +#  endif
> > +#  ifdef USE_AS_STRCAT
> > +	xor	%ch, %ch
> > +	movb	%ch, 28(%rdi)
> > +#  endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(StrncpyExit29):
> > +	movdqu	(%rsi), %xmm0
> > +	movdqu	13(%rsi), %xmm2
> > +	movdqu	%xmm0, (%rdi)
> > +	movdqu	%xmm2, 13(%rdi)
> > +#  ifdef USE_AS_STPCPY
> > +	lea	29(%rdi), %rax
> > +#  endif
> > +#  ifdef USE_AS_STRCAT
> > +	xor	%ch, %ch
> > +	movb	%ch, 29(%rdi)
> > +#  endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(StrncpyExit30):
> > +	movdqu	(%rsi), %xmm0
> > +	movdqu	14(%rsi), %xmm2
> > +	movdqu	%xmm0, (%rdi)
> > +	movdqu	%xmm2, 14(%rdi)
> > +#  ifdef USE_AS_STPCPY
> > +	lea	30(%rdi), %rax
> > +#  endif
> > +#  ifdef USE_AS_STRCAT
> > +	xor	%ch, %ch
> > +	movb	%ch, 30(%rdi)
> > +#  endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(StrncpyExit31):
> > +	movdqu	(%rsi), %xmm0
> > +	movdqu	15(%rsi), %xmm2
> > +	movdqu	%xmm0, (%rdi)
> > +	movdqu	%xmm2, 15(%rdi)
> > +#  ifdef USE_AS_STPCPY
> > +	lea	31(%rdi), %rax
> > +#  endif
> > +#  ifdef USE_AS_STRCAT
> > +	xor	%ch, %ch
> > +	movb	%ch, 31(%rdi)
> > +#  endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(StrncpyExit32):
> > +	movdqu	(%rsi), %xmm0
> > +	movdqu	16(%rsi), %xmm2
> > +	movdqu	%xmm0, (%rdi)
> > +	movdqu	%xmm2, 16(%rdi)
> > +#  ifdef USE_AS_STPCPY
> > +	lea	32(%rdi), %rax
> > +#  endif
> > +#  ifdef USE_AS_STRCAT
> > +	xor	%ch, %ch
> > +	movb	%ch, 32(%rdi)
> > +#  endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(StrncpyExit33):
> > +	movdqu	(%rsi), %xmm0
> > +	movdqu	16(%rsi), %xmm2
> > +	mov	32(%rsi), %cl
> > +	movdqu	%xmm0, (%rdi)
> > +	movdqu	%xmm2, 16(%rdi)
> > +	mov	%cl, 32(%rdi)
> > +#  ifdef USE_AS_STRCAT
> > +	xor	%ch, %ch
> > +	movb	%ch, 33(%rdi)
> > +#  endif
> > +	ret
> > +
> > +#  ifndef USE_AS_STRCAT
> > +
> > +	.p2align 4
> > +L(Fill0):
> > +	ret
> > +
> > +	.p2align 4
> > +L(Fill1):
> > +	mov	%dl, (%rdi)
> > +	ret
> > +
> > +	.p2align 4
> > +L(Fill2):
> > +	mov	%dx, (%rdi)
> > +	ret
> > +
> > +	.p2align 4
> > +L(Fill3):
> > +	mov	%edx, -1(%rdi)
> > +	ret
> > +
> > +	.p2align 4
> > +L(Fill4):
> > +	mov	%edx, (%rdi)
> > +	ret
> > +
> > +	.p2align 4
> > +L(Fill5):
> > +	mov	%edx, (%rdi)
> > +	mov	%dl, 4(%rdi)
> > +	ret
> > +
> > +	.p2align 4
> > +L(Fill6):
> > +	mov	%edx, (%rdi)
> > +	mov	%dx, 4(%rdi)
> > +	ret
> > +
> > +	.p2align 4
> > +L(Fill7):
> > +	mov	%rdx, -1(%rdi)
> > +	ret
> > +
> > +	.p2align 4
> > +L(Fill8):
> > +	mov	%rdx, (%rdi)
> > +	ret
> > +
> > +	.p2align 4
> > +L(Fill9):
> > +	mov	%rdx, (%rdi)
> > +	mov	%dl, 8(%rdi)
> > +	ret
> > +
> > +	.p2align 4
> > +L(Fill10):
> > +	mov	%rdx, (%rdi)
> > +	mov	%dx, 8(%rdi)
> > +	ret
> > +
> > +	.p2align 4
> > +L(Fill11):
> > +	mov	%rdx, (%rdi)
> > +	mov	%edx, 7(%rdi)
> > +	ret
> > +
> > +	.p2align 4
> > +L(Fill12):
> > +	mov	%rdx, (%rdi)
> > +	mov	%edx, 8(%rdi)
> > +	ret
> > +
> > +	.p2align 4
> > +L(Fill13):
> > +	mov	%rdx, (%rdi)
> > +	mov	%rdx, 5(%rdi)
> > +	ret
> > +
> > +	.p2align 4
> > +L(Fill14):
> > +	mov	%rdx, (%rdi)
> > +	mov	%rdx, 6(%rdi)
> > +	ret
> > +
> > +	.p2align 4
> > +L(Fill15):
> > +	movdqu	%xmm0, -1(%rdi)
> > +	ret
> > +
> > +	.p2align 4
> > +L(Fill16):
> > +	movdqu	%xmm0, (%rdi)
> > +	ret
> > +
> > +	.p2align 4
> > +L(CopyFrom1To16BytesUnalignedXmm2):
> > +	movdqu	%xmm2, (%rdi, %rcx)
> > +
> > +	.p2align 4
> > +L(CopyFrom1To16BytesXmmExit):
> > +	bsf	%rdx, %rdx
> > +	add	$15, %r8
> > +	add	%rcx, %rdi
> > +#   ifdef USE_AS_STPCPY
> > +	lea	(%rdi, %rdx), %rax
> > +#   endif
> > +	sub	%rdx, %r8
> > +	lea	1(%rdi, %rdx), %rdi
> > +
> > +	.p2align 4
> > +L(StrncpyFillTailWithZero):
> > +	pxor	%xmm0, %xmm0
> > +	xor	%rdx, %rdx
> > +	sub	$16, %r8
> > +	jbe	L(StrncpyFillExit)
> > +
> > +	movdqu	%xmm0, (%rdi)
> > +	add	$16, %rdi
> > +
> > +	mov	%rdi, %rsi
> > +	and	$0xf, %rsi
> > +	sub	%rsi, %rdi
> > +	add	%rsi, %r8
> > +	sub	$64, %r8
> > +	jb	L(StrncpyFillLess64)
> > +
> > +L(StrncpyFillLoopMovdqa):
> > +	movdqa	%xmm0, (%rdi)
> > +	movdqa	%xmm0, 16(%rdi)
> > +	movdqa	%xmm0, 32(%rdi)
> > +	movdqa	%xmm0, 48(%rdi)
> > +	add	$64, %rdi
> > +	sub	$64, %r8
> > +	jae	L(StrncpyFillLoopMovdqa)
> > +
> > +L(StrncpyFillLess64):
> > +	add	$32, %r8
> > +	jl	L(StrncpyFillLess32)
> > +	movdqa	%xmm0, (%rdi)
> > +	movdqa	%xmm0, 16(%rdi)
> > +	add	$32, %rdi
> > +	sub	$16, %r8
> > +	jl	L(StrncpyFillExit)
> > +	movdqa	%xmm0, (%rdi)
> > +	add	$16, %rdi
> > +	BRANCH_TO_JMPTBL_ENTRY (L(FillTable), %r8, 4)
> > +
> > +L(StrncpyFillLess32):
> > +	add	$16, %r8
> > +	jl	L(StrncpyFillExit)
> > +	movdqa	%xmm0, (%rdi)
> > +	add	$16, %rdi
> > +	BRANCH_TO_JMPTBL_ENTRY (L(FillTable), %r8, 4)
> > +
> > +L(StrncpyFillExit):
> > +	add	$16, %r8
> > +	BRANCH_TO_JMPTBL_ENTRY (L(FillTable), %r8, 4)
> > +
> > +/* end of ifndef USE_AS_STRCAT */
> > +#  endif
> > +
> > +	.p2align 4
> > +L(UnalignedLeaveCase2OrCase3):
> > +	test	%rdx, %rdx
> > +	jnz	L(Unaligned64LeaveCase2)
> > +L(Unaligned64LeaveCase3):
> > +	lea	64(%r8), %rcx
> > +	and	$-16, %rcx
> > +	add	$48, %r8
> > +	jl	L(CopyFrom1To16BytesCase3)
> > +	movdqu	%xmm4, (%rdi)
> > +	sub	$16, %r8
> > +	jb	L(CopyFrom1To16BytesCase3)
> > +	movdqu	%xmm5, 16(%rdi)
> > +	sub	$16, %r8
> > +	jb	L(CopyFrom1To16BytesCase3)
> > +	movdqu	%xmm6, 32(%rdi)
> > +	sub	$16, %r8
> > +	jb	L(CopyFrom1To16BytesCase3)
> > +	movdqu	%xmm7, 48(%rdi)
> > +#  ifdef USE_AS_STPCPY
> > +	lea	64(%rdi), %rax
> > +#  endif
> > +#  ifdef USE_AS_STRCAT
> > +	xor	%ch, %ch
> > +	movb	%ch, 64(%rdi)
> > +#  endif
> > +	ret
> > +
> > +	.p2align 4
> > +L(Unaligned64LeaveCase2):
> > +	xor	%rcx, %rcx
> > +	pcmpeqb	%xmm4, %xmm0
> > +	pmovmskb %xmm0, %rdx
> > +	add	$48, %r8
> > +	jle	L(CopyFrom1To16BytesCase2OrCase3)
> > +	test	%rdx, %rdx
> > +#  ifndef USE_AS_STRCAT
> > +	jnz	L(CopyFrom1To16BytesUnalignedXmm4)
> > +#  else
> > +	jnz	L(CopyFrom1To16Bytes)
> > +#  endif
> > +	pcmpeqb	%xmm5, %xmm0
> > +	pmovmskb %xmm0, %rdx
> > +	movdqu	%xmm4, (%rdi)
> > +	add	$16, %rcx
> > +	sub	$16, %r8
> > +	jbe	L(CopyFrom1To16BytesCase2OrCase3)
> > +	test	%rdx, %rdx
> > +#  ifndef USE_AS_STRCAT
> > +	jnz	L(CopyFrom1To16BytesUnalignedXmm5)
> > +#  else
> > +	jnz	L(CopyFrom1To16Bytes)
> > +#  endif
> > +
> > +	pcmpeqb	%xmm6, %xmm0
> > +	pmovmskb %xmm0, %rdx
> > +	movdqu	%xmm5, 16(%rdi)
> > +	add	$16, %rcx
> > +	sub	$16, %r8
> > +	jbe	L(CopyFrom1To16BytesCase2OrCase3)
> > +	test	%rdx, %rdx
> > +#  ifndef USE_AS_STRCAT
> > +	jnz	L(CopyFrom1To16BytesUnalignedXmm6)
> > +#  else
> > +	jnz	L(CopyFrom1To16Bytes)
> > +#  endif
> > +
> > +	pcmpeqb	%xmm7, %xmm0
> > +	pmovmskb %xmm0, %rdx
> > +	movdqu	%xmm6, 32(%rdi)
> > +	lea	16(%rdi, %rcx), %rdi
> > +	lea	16(%rsi, %rcx), %rsi
> > +	bsf	%rdx, %rdx
> > +	cmp	%r8, %rdx
> > +	jb	L(CopyFrom1To16BytesExit)
> > +	BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
> > +
> > +	.p2align 4
> > +L(ExitZero):
> > +#  ifndef USE_AS_STRCAT
> > +	mov	%rdi, %rax
> > +#  endif
> > +	ret
> > +
> > +# endif
> > +
> > +# ifndef USE_AS_STRCAT
> > +END (STRCPY)
> > +# else
> > +END (STRCAT)
> > +# endif
> > +	.p2align 4
> > +	.section .rodata
> > +L(ExitTable):
> > +	.int	JMPTBL(L(Exit1), L(ExitTable))
> > +	.int	JMPTBL(L(Exit2), L(ExitTable))
> > +	.int	JMPTBL(L(Exit3), L(ExitTable))
> > +	.int	JMPTBL(L(Exit4), L(ExitTable))
> > +	.int	JMPTBL(L(Exit5), L(ExitTable))
> > +	.int	JMPTBL(L(Exit6), L(ExitTable))
> > +	.int	JMPTBL(L(Exit7), L(ExitTable))
> > +	.int	JMPTBL(L(Exit8), L(ExitTable))
> > +	.int	JMPTBL(L(Exit9), L(ExitTable))
> > +	.int	JMPTBL(L(Exit10), L(ExitTable))
> > +	.int	JMPTBL(L(Exit11), L(ExitTable))
> > +	.int	JMPTBL(L(Exit12), L(ExitTable))
> > +	.int	JMPTBL(L(Exit13), L(ExitTable))
> > +	.int	JMPTBL(L(Exit14), L(ExitTable))
> > +	.int	JMPTBL(L(Exit15), L(ExitTable))
> > +	.int	JMPTBL(L(Exit16), L(ExitTable))
> > +	.int	JMPTBL(L(Exit17), L(ExitTable))
> > +	.int	JMPTBL(L(Exit18), L(ExitTable))
> > +	.int	JMPTBL(L(Exit19), L(ExitTable))
> > +	.int	JMPTBL(L(Exit20), L(ExitTable))
> > +	.int	JMPTBL(L(Exit21), L(ExitTable))
> > +	.int	JMPTBL(L(Exit22), L(ExitTable))
> > +	.int    JMPTBL(L(Exit23), L(ExitTable))
> > +	.int	JMPTBL(L(Exit24), L(ExitTable))
> > +	.int	JMPTBL(L(Exit25), L(ExitTable))
> > +	.int	JMPTBL(L(Exit26), L(ExitTable))
> > +	.int	JMPTBL(L(Exit27), L(ExitTable))
> > +	.int	JMPTBL(L(Exit28), L(ExitTable))
> > +	.int	JMPTBL(L(Exit29), L(ExitTable))
> > +	.int	JMPTBL(L(Exit30), L(ExitTable))
> > +	.int	JMPTBL(L(Exit31), L(ExitTable))
> > +	.int	JMPTBL(L(Exit32), L(ExitTable))
> > +# ifdef USE_AS_STRNCPY
> > +L(ExitStrncpyTable):
> > +	.int	JMPTBL(L(StrncpyExit0), L(ExitStrncpyTable))
> > +	.int	JMPTBL(L(StrncpyExit1), L(ExitStrncpyTable))
> > +	.int	JMPTBL(L(StrncpyExit2), L(ExitStrncpyTable))
> > +	.int	JMPTBL(L(StrncpyExit3), L(ExitStrncpyTable))
> > +	.int	JMPTBL(L(StrncpyExit4), L(ExitStrncpyTable))
> > +	.int	JMPTBL(L(StrncpyExit5), L(ExitStrncpyTable))
> > +	.int	JMPTBL(L(StrncpyExit6), L(ExitStrncpyTable))
> > +	.int	JMPTBL(L(StrncpyExit7), L(ExitStrncpyTable))
> > +	.int	JMPTBL(L(StrncpyExit8), L(ExitStrncpyTable))
> > +	.int	JMPTBL(L(StrncpyExit9), L(ExitStrncpyTable))
> > +	.int	JMPTBL(L(StrncpyExit10), L(ExitStrncpyTable))
> > +	.int	JMPTBL(L(StrncpyExit11), L(ExitStrncpyTable))
> > +	.int	JMPTBL(L(StrncpyExit12), L(ExitStrncpyTable))
> > +	.int	JMPTBL(L(StrncpyExit13), L(ExitStrncpyTable))
> > +	.int	JMPTBL(L(StrncpyExit14), L(ExitStrncpyTable))
> > +	.int	JMPTBL(L(StrncpyExit15), L(ExitStrncpyTable))
> > +	.int	JMPTBL(L(StrncpyExit16), L(ExitStrncpyTable))
> > +	.int	JMPTBL(L(StrncpyExit17), L(ExitStrncpyTable))
> > +	.int	JMPTBL(L(StrncpyExit18), L(ExitStrncpyTable))
> > +	.int	JMPTBL(L(StrncpyExit19), L(ExitStrncpyTable))
> > +	.int	JMPTBL(L(StrncpyExit20), L(ExitStrncpyTable))
> > +	.int	JMPTBL(L(StrncpyExit21), L(ExitStrncpyTable))
> > +	.int	JMPTBL(L(StrncpyExit22), L(ExitStrncpyTable))
> > +	.int    JMPTBL(L(StrncpyExit23), L(ExitStrncpyTable))
> > +	.int	JMPTBL(L(StrncpyExit24), L(ExitStrncpyTable))
> > +	.int	JMPTBL(L(StrncpyExit25), L(ExitStrncpyTable))
> > +	.int	JMPTBL(L(StrncpyExit26), L(ExitStrncpyTable))
> > +	.int	JMPTBL(L(StrncpyExit27), L(ExitStrncpyTable))
> > +	.int	JMPTBL(L(StrncpyExit28), L(ExitStrncpyTable))
> > +	.int	JMPTBL(L(StrncpyExit29), L(ExitStrncpyTable))
> > +	.int	JMPTBL(L(StrncpyExit30), L(ExitStrncpyTable))
> > +	.int	JMPTBL(L(StrncpyExit31), L(ExitStrncpyTable))
> > +	.int	JMPTBL(L(StrncpyExit32), L(ExitStrncpyTable))
> > +	.int	JMPTBL(L(StrncpyExit33), L(ExitStrncpyTable))
> > +#  ifndef USE_AS_STRCAT
> > +	.p2align 4
> > +L(FillTable):
> > +	.int	JMPTBL(L(Fill0), L(FillTable))
> > +	.int	JMPTBL(L(Fill1), L(FillTable))
> > +	.int	JMPTBL(L(Fill2), L(FillTable))
> > +	.int	JMPTBL(L(Fill3), L(FillTable))
> > +	.int	JMPTBL(L(Fill4), L(FillTable))
> > +	.int	JMPTBL(L(Fill5), L(FillTable))
> > +	.int	JMPTBL(L(Fill6), L(FillTable))
> > +	.int	JMPTBL(L(Fill7), L(FillTable))
> > +	.int	JMPTBL(L(Fill8), L(FillTable))
> > +	.int	JMPTBL(L(Fill9), L(FillTable))
> > +	.int	JMPTBL(L(Fill10), L(FillTable))
> > +	.int	JMPTBL(L(Fill11), L(FillTable))
> > +	.int	JMPTBL(L(Fill12), L(FillTable))
> > +	.int	JMPTBL(L(Fill13), L(FillTable))
> > +	.int	JMPTBL(L(Fill14), L(FillTable))
> > +	.int	JMPTBL(L(Fill15), L(FillTable))
> > +	.int	JMPTBL(L(Fill16), L(FillTable))
> > +#  endif
> > +# endif
> > +#endif
> > diff --git a/sysdeps/x86_64/multiarch/strncpy.S b/sysdeps/x86_64/multiarch/strncpy.S
> > index 6d87a0b..afbd870 100644
> > --- a/sysdeps/x86_64/multiarch/strncpy.S
> > +++ b/sysdeps/x86_64/multiarch/strncpy.S
> > @@ -1,5 +1,85 @@
> > -/* Multiple versions of strncpy
> > -   All versions must be listed in ifunc-impl-list.c.  */
> > -#define STRCPY strncpy
> > +/* Multiple versions of strcpy
> > +   All versions must be listed in ifunc-impl-list.c.
> > +   Copyright (C) 2009-2015 Free Software Foundation, Inc.
> > +   Contributed by Intel Corporation.
> > +   This file is part of the GNU C Library.
> > +
> > +   The GNU C Library is free software; you can redistribute it and/or
> > +   modify it under the terms of the GNU Lesser General Public
> > +   License as published by the Free Software Foundation; either
> > +   version 2.1 of the License, or (at your option) any later version.
> > +
> > +   The GNU C Library is distributed in the hope that it will be useful,
> > +   but WITHOUT ANY WARRANTY; without even the implied warranty of
> > +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > +   Lesser General Public License for more details.
> > +
> > +   You should have received a copy of the GNU Lesser General Public
> > +   License along with the GNU C Library; if not, see
> > +   <http://www.gnu.org/licenses/>.  */
> > +
> > +#include <sysdep.h>
> > +#include <init-arch.h>
> > +
> >  #define USE_AS_STRNCPY
> > -#include "strcpy.S"
> > +#ifndef STRNCPY
> > +#define STRNCPY strncpy
> > +#endif
> > +
> > +#ifdef USE_AS_STPCPY
> > +#  define STRNCPY_SSSE3		__stpncpy_ssse3
> > +#  define STRNCPY_SSE2		__stpncpy_sse2
> > +#  define STRNCPY_SSE2_UNALIGNED __stpncpy_sse2_unaligned
> > +#  define __GI_STRNCPY		__GI_stpncpy
> > +#  define __GI___STRNCPY		__GI___stpncpy
> > +#else
> > +#  define STRNCPY_SSSE3		__strncpy_ssse3
> > +#  define STRNCPY_SSE2		__strncpy_sse2
> > +#  define STRNCPY_SSE2_UNALIGNED	__strncpy_sse2_unaligned
> > +#  define __GI_STRNCPY		__GI_strncpy
> > +#endif
> > +
> > +
> > +/* Define multiple versions only for the definition in libc.  */
> > +#if IS_IN (libc)
> > +	.text
> > +ENTRY(STRNCPY)
> > +	.type	STRNCPY, @gnu_indirect_function
> > +	cmpl	$0, __cpu_features+KIND_OFFSET(%rip)
> > +	jne	1f
> > +	call	__init_cpu_features
> > +1:	leaq	STRNCPY_SSE2_UNALIGNED(%rip), %rax
> > +	testl	$bit_Fast_Unaligned_Load, __cpu_features+FEATURE_OFFSET+index_Fast_Unaligned_Load(%rip)
> > +	jnz	2f
> > +	leaq	STRNCPY_SSE2(%rip), %rax
> > +	testl	$bit_SSSE3, __cpu_features+CPUID_OFFSET+index_SSSE3(%rip)
> > +	jz	2f
> > +	leaq	STRNCPY_SSSE3(%rip), %rax
> > +2:	ret
> > +END(STRNCPY)
> > +
> > +# undef ENTRY
> > +# define ENTRY(name) \
> > +	.type STRNCPY_SSE2, @function; \
> > +	.align 16; \
> > +	.globl STRNCPY_SSE2; \
> > +	.hidden STRNCPY_SSE2; \
> > +	STRNCPY_SSE2: cfi_startproc; \
> > +	CALL_MCOUNT
> > +# undef END
> > +# define END(name) \
> > +	cfi_endproc; .size STRNCPY_SSE2, .-STRNCPY_SSE2
> > +# undef libc_hidden_builtin_def
> > +/* It doesn't make sense to send libc-internal strcpy calls through a PLT.
> > +   The speedup we get from using SSSE3 instruction is likely eaten away
> > +   by the indirect call in the PLT.  */
> > +# define libc_hidden_builtin_def(name) \
> > +	.globl __GI_STRNCPY; __GI_STRNCPY = STRNCPY_SSE2
> > +# undef libc_hidden_def
> > +# define libc_hidden_def(name) \
> > +	.globl __GI___STRNCPY; __GI___STRNCPY = STRNCPY_SSE2
> > +#endif
> > +
> > +#ifndef USE_AS_STRNCPY
> > +#include "../strcpy.S"
> > +#endif
> > -- 
> > 1.8.4.rc3
> 
> -- 
> 
> Communications satellite used by the military for star wars.
diff mbox

Patch

diff --git a/sysdeps/x86_64/multiarch/Makefile b/sysdeps/x86_64/multiarch/Makefile
index d7002a9..c573744 100644
--- a/sysdeps/x86_64/multiarch/Makefile
+++ b/sysdeps/x86_64/multiarch/Makefile
@@ -29,7 +29,7 @@  CFLAGS-strspn-c.c += -msse4
 endif
 
 ifeq (yes,$(config-cflags-avx2))
-sysdep_routines += memset-avx2
+sysdep_routines += memset-avx2 strcpy-avx2 stpcpy-avx2
 endif
 endif
 
diff --git a/sysdeps/x86_64/multiarch/ifunc-impl-list.c b/sysdeps/x86_64/multiarch/ifunc-impl-list.c
index b64e4f1..d398e43 100644
--- a/sysdeps/x86_64/multiarch/ifunc-impl-list.c
+++ b/sysdeps/x86_64/multiarch/ifunc-impl-list.c
@@ -88,6 +88,7 @@  __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
 
   /* Support sysdeps/x86_64/multiarch/stpcpy.S.  */
   IFUNC_IMPL (i, name, stpcpy,
+	      IFUNC_IMPL_ADD (array, i, strcpy, HAS_AVX2, __stpcpy_avx2)
 	      IFUNC_IMPL_ADD (array, i, stpcpy, HAS_SSSE3, __stpcpy_ssse3)
 	      IFUNC_IMPL_ADD (array, i, stpcpy, 1, __stpcpy_sse2_unaligned)
 	      IFUNC_IMPL_ADD (array, i, stpcpy, 1, __stpcpy_sse2))
@@ -137,6 +138,7 @@  __libc_ifunc_impl_list (const char *name, struct libc_ifunc_impl *array,
 
   /* Support sysdeps/x86_64/multiarch/strcpy.S.  */
   IFUNC_IMPL (i, name, strcpy,
+	      IFUNC_IMPL_ADD (array, i, strcpy, HAS_AVX2, __strcpy_avx2)
 	      IFUNC_IMPL_ADD (array, i, strcpy, HAS_SSSE3, __strcpy_ssse3)
 	      IFUNC_IMPL_ADD (array, i, strcpy, 1, __strcpy_sse2_unaligned)
 	      IFUNC_IMPL_ADD (array, i, strcpy, 1, __strcpy_sse2))
diff --git a/sysdeps/x86_64/multiarch/stpcpy-avx2.S b/sysdeps/x86_64/multiarch/stpcpy-avx2.S
new file mode 100644
index 0000000..bd30ef6
--- /dev/null
+++ b/sysdeps/x86_64/multiarch/stpcpy-avx2.S
@@ -0,0 +1,3 @@ 
+#define USE_AVX2
+#define STPCPY __stpcpy_avx2
+#include "stpcpy-sse2-unaligned.S"
diff --git a/sysdeps/x86_64/multiarch/stpcpy-sse2-unaligned.S b/sysdeps/x86_64/multiarch/stpcpy-sse2-unaligned.S
index 34231f8..695a236 100644
--- a/sysdeps/x86_64/multiarch/stpcpy-sse2-unaligned.S
+++ b/sysdeps/x86_64/multiarch/stpcpy-sse2-unaligned.S
@@ -1,3 +1,436 @@ 
-#define USE_AS_STPCPY
-#define STRCPY __stpcpy_sse2_unaligned
-#include "strcpy-sse2-unaligned.S"
+/* stpcpy with SSE2 and unaligned load
+   Copyright (C) 2015 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <sysdep.h>
+#ifndef STPCPY
+# define STPCPY __stpcpy_sse2_unaligned
+#endif
+
+ENTRY(STPCPY)
+	mov	%esi, %edx
+#ifdef AS_STRCPY
+	movq    %rdi, %rax
+#endif
+	pxor	%xmm4, %xmm4
+	pxor	%xmm5, %xmm5
+	andl	$4095, %edx
+	cmp	$3968, %edx
+	ja	L(cross_page)
+
+	movdqu	(%rsi), %xmm0
+	pcmpeqb	%xmm0, %xmm4
+	pmovmskb %xmm4, %edx
+	testl	%edx, %edx
+	je	L(more16bytes)
+	bsf	%edx, %ecx
+#ifndef AS_STRCPY
+	lea	(%rdi, %rcx), %rax
+#endif
+	cmp	$7, %ecx
+	movq	(%rsi), %rdx
+	jb	L(less_8_bytesb)
+L(8bytes_from_cross):
+	movq	-7(%rsi, %rcx), %rsi
+	movq	%rdx, (%rdi)
+#ifdef AS_STRCPY
+	movq    %rsi, -7(%rdi, %rcx)
+#else
+	movq	%rsi, -7(%rax)
+#endif
+	ret
+
+	.p2align 4
+L(less_8_bytesb):
+	cmp	$2, %ecx
+	jbe	L(less_4_bytes)
+L(4bytes_from_cross):
+	mov	-3(%rsi, %rcx), %esi
+	mov	%edx, (%rdi)
+#ifdef AS_STRCPY
+        mov     %esi, -3(%rdi, %rcx)
+#else
+	mov	%esi, -3(%rax)
+#endif
+	ret
+
+.p2align 4
+ L(less_4_bytes):
+ /*
+  Test branch vs this branchless that works for i 0,1,2
+   d[i] = 0;
+   d[i/2] = s[1];
+   d[0] = s[0];
+  */
+#ifdef AS_STRCPY
+	movb	$0, (%rdi, %rcx)
+#endif
+
+	shr	$1, %ecx
+	mov	%edx, %esi
+	shr	$8, %edx
+	movb	%dl, (%rdi, %rcx)
+#ifndef AS_STRCPY
+	movb	$0, (%rax)
+#endif
+	movb	%sil, (%rdi)
+	ret
+
+
+
+
+
+	.p2align 4
+L(more16bytes):
+	pxor	%xmm6, %xmm6
+	movdqu	16(%rsi), %xmm1
+	pxor	%xmm7, %xmm7
+	pcmpeqb	%xmm1, %xmm5
+	pmovmskb %xmm5, %edx
+	testl	%edx, %edx
+	je	L(more32bytes)
+	bsf	%edx, %edx
+#ifdef AS_STRCPY
+        movdqu  1(%rsi, %rdx), %xmm1
+        movdqu  %xmm0, (%rdi)
+	movdqu  %xmm1, 1(%rdi, %rdx)
+#else
+	lea	16(%rdi, %rdx), %rax
+	movdqu	1(%rsi, %rdx), %xmm1
+	movdqu	%xmm0, (%rdi)
+	movdqu	%xmm1, -15(%rax)
+#endif
+	ret
+
+	.p2align 4
+L(more32bytes):
+	movdqu	32(%rsi), %xmm2
+	movdqu	48(%rsi), %xmm3
+
+	pcmpeqb	%xmm2, %xmm6
+	pcmpeqb	%xmm3, %xmm7
+	pmovmskb %xmm7, %edx
+	shl	$16, %edx
+	pmovmskb %xmm6, %ecx
+	or	%ecx, %edx
+	je	L(more64bytes)
+	bsf	%edx, %edx
+#ifndef AS_STRCPY
+	lea	32(%rdi, %rdx), %rax
+#endif
+	movdqu	1(%rsi, %rdx), %xmm2
+	movdqu	17(%rsi, %rdx), %xmm3
+	movdqu	%xmm0, (%rdi)
+	movdqu	%xmm1, 16(%rdi)
+#ifdef AS_STRCPY
+        movdqu  %xmm2, 1(%rdi, %rdx)
+        movdqu  %xmm3, 17(%rdi, %rdx)
+#else
+	movdqu	%xmm2, -31(%rax)
+	movdqu	%xmm3, -15(%rax)
+#endif
+	ret
+
+	.p2align 4
+L(more64bytes):
+	movdqu	%xmm0, (%rdi)
+	movdqu	%xmm1, 16(%rdi)
+	movdqu	%xmm2, 32(%rdi)
+	movdqu	%xmm3, 48(%rdi)
+	movdqu	64(%rsi), %xmm0
+	movdqu	80(%rsi), %xmm1
+	movdqu	96(%rsi), %xmm2
+	movdqu	112(%rsi), %xmm3
+
+	pcmpeqb	%xmm0, %xmm4
+	pcmpeqb	%xmm1, %xmm5
+	pcmpeqb	%xmm2, %xmm6
+	pcmpeqb	%xmm3, %xmm7
+	pmovmskb %xmm4, %ecx
+	pmovmskb %xmm5, %edx
+	pmovmskb %xmm6, %r8d
+	pmovmskb %xmm7, %r9d
+	shl	$16, %edx
+	or	%ecx, %edx
+	shl	$32, %r8
+	shl	$48, %r9
+	or	%r8, %rdx
+	or	%r9, %rdx
+	test	%rdx, %rdx
+	je	L(prepare_loop)
+	bsf	%rdx, %rdx
+#ifndef AS_STRCPY
+	lea	64(%rdi, %rdx), %rax
+#endif
+	movdqu	1(%rsi, %rdx), %xmm0
+	movdqu	17(%rsi, %rdx), %xmm1
+	movdqu	33(%rsi, %rdx), %xmm2
+	movdqu	49(%rsi, %rdx), %xmm3
+#ifdef AS_STRCPY
+        movdqu  %xmm0, 1(%rdi, %rdx)
+        movdqu  %xmm1, 17(%rdi, %rdx)
+        movdqu  %xmm2, 33(%rdi, %rdx)
+        movdqu  %xmm3, 49(%rdi, %rdx)
+#else
+	movdqu	%xmm0, -63(%rax)
+	movdqu	%xmm1, -47(%rax)
+	movdqu	%xmm2, -31(%rax)
+	movdqu	%xmm3, -15(%rax)
+#endif
+	ret
+
+
+	.p2align 4
+L(prepare_loop):
+	movdqu	%xmm0, 64(%rdi)
+	movdqu	%xmm1, 80(%rdi)
+	movdqu	%xmm2, 96(%rdi)
+	movdqu	%xmm3, 112(%rdi)
+
+	subq	%rsi, %rdi
+	add	$64, %rsi
+	andq	$-64, %rsi
+	addq	%rsi, %rdi
+	jmp	L(loop_entry)
+
+#ifdef USE_AVX2
+	.p2align 4
+L(loop):
+	vmovdqu	%ymm1, (%rdi)
+	vmovdqu	%ymm3, 32(%rdi)
+L(loop_entry):
+	vmovdqa	96(%rsi), %ymm3
+	vmovdqa	64(%rsi), %ymm1
+	vpminub	%ymm3, %ymm1, %ymm2
+	addq	$64, %rsi
+	addq	$64, %rdi
+	vpcmpeqb %ymm5, %ymm2, %ymm0
+	vpmovmskb %ymm0, %edx
+	test	%edx, %edx
+	je	L(loop)
+	salq	$32, %rdx
+	vpcmpeqb %ymm5, %ymm1, %ymm4
+	vpmovmskb %ymm4, %ecx
+	or	%rcx, %rdx
+	bsfq	%rdx, %rdx
+#ifndef AS_STRCPY
+	lea	(%rdi, %rdx), %rax
+#endif
+	vmovdqu	-63(%rsi, %rdx), %ymm0
+	vmovdqu	-31(%rsi, %rdx), %ymm2
+#ifdef AS_STRCPY
+        vmovdqu  %ymm0, -63(%rdi, %rdx)
+        vmovdqu  %ymm2, -31(%rdi, %rdx)
+#else
+	vmovdqu	%ymm0, -63(%rax)
+	vmovdqu	%ymm2, -31(%rax)
+#endif
+	vzeroupper
+	ret
+#else
+	.p2align 4
+L(loop):
+	movdqu	%xmm1, (%rdi)
+	movdqu	%xmm2, 16(%rdi)
+	movdqu	%xmm3, 32(%rdi)
+	movdqu	%xmm4, 48(%rdi)
+L(loop_entry):
+	movdqa	96(%rsi), %xmm3
+	movdqa	112(%rsi), %xmm4
+	movdqa	%xmm3, %xmm0
+	movdqa	80(%rsi), %xmm2
+	pminub	%xmm4, %xmm0
+	movdqa	64(%rsi), %xmm1
+	pminub	%xmm2, %xmm0
+	pminub	%xmm1, %xmm0
+	addq	$64, %rsi
+	addq	$64, %rdi
+	pcmpeqb	%xmm5, %xmm0
+	pmovmskb %xmm0, %edx
+	test	%edx, %edx
+	je	L(loop)
+	salq	$48, %rdx
+	pcmpeqb	%xmm1, %xmm5
+	pcmpeqb	%xmm2, %xmm6
+	pmovmskb %xmm5, %ecx
+#ifdef AS_STRCPY
+	pmovmskb %xmm6, %r8d
+	pcmpeqb	%xmm3, %xmm7
+	pmovmskb %xmm7, %r9d
+	sal	$16, %r8d
+	or	%r8d, %ecx
+#else
+	pmovmskb %xmm6, %eax
+	pcmpeqb	%xmm3, %xmm7
+	pmovmskb %xmm7, %r9d
+	sal	$16, %eax
+	or	%eax, %ecx
+#endif
+	salq	$32, %r9
+	orq	%rcx, %rdx
+	orq	%r9, %rdx
+	bsfq	%rdx, %rdx
+#ifndef AS_STRCPY
+	lea	(%rdi, %rdx), %rax
+#endif
+	movdqu	-63(%rsi, %rdx), %xmm0
+	movdqu	-47(%rsi, %rdx), %xmm1
+	movdqu	-31(%rsi, %rdx), %xmm2
+	movdqu	-15(%rsi, %rdx), %xmm3
+#ifdef AS_STRCPY
+        movdqu  %xmm0, -63(%rdi, %rdx)
+        movdqu  %xmm1, -47(%rdi, %rdx)
+        movdqu  %xmm2, -31(%rdi, %rdx)
+        movdqu  %xmm3, -15(%rdi, %rdx)
+#else
+	movdqu	%xmm0, -63(%rax)
+	movdqu	%xmm1, -47(%rax)
+	movdqu	%xmm2, -31(%rax)
+	movdqu	%xmm3, -15(%rax)
+#endif
+	ret
+#endif
+
+	.p2align 4
+L(cross_page):
+	movq	%rsi, %rcx
+	pxor	%xmm0, %xmm0
+	and	$15, %ecx
+	movq	%rsi, %r9
+	movq	%rdi, %r10
+	subq	%rcx, %rsi
+	subq	%rcx, %rdi
+	movdqa	(%rsi), %xmm1
+	pcmpeqb	%xmm0, %xmm1
+	pmovmskb %xmm1, %edx
+	shr	%cl, %edx
+	shl	%cl, %edx
+	test	%edx, %edx
+	jne	L(less_32_cross)
+
+	addq	$16, %rsi
+	addq	$16, %rdi
+	movdqa	(%rsi), %xmm1
+	pcmpeqb	%xmm1, %xmm0
+	pmovmskb %xmm0, %edx
+	test	%edx, %edx
+	jne	L(less_32_cross)
+	movdqu	%xmm1, (%rdi)
+
+	movdqu	(%r9), %xmm0
+	movdqu	%xmm0, (%r10)
+
+	mov	$8, %rcx
+L(cross_loop):
+	addq	$16, %rsi
+	addq	$16, %rdi
+	pxor	%xmm0, %xmm0
+	movdqa	(%rsi), %xmm1
+	pcmpeqb	%xmm1, %xmm0
+	pmovmskb %xmm0, %edx
+	test	%edx, %edx
+	jne	L(return_cross)
+	movdqu	%xmm1, (%rdi)
+	sub	$1, %rcx
+	ja	L(cross_loop)
+
+	pxor	%xmm5, %xmm5
+	pxor	%xmm6, %xmm6
+	pxor	%xmm7, %xmm7
+
+	lea	-64(%rsi), %rdx
+	andq	$-64, %rdx
+	addq	%rdx, %rdi
+	subq	%rsi, %rdi
+	movq	%rdx, %rsi
+	jmp	L(loop_entry)
+
+	.p2align 4
+L(return_cross):
+	bsf	%edx, %edx
+#ifdef AS_STRCPY
+        movdqu  -15(%rsi, %rdx), %xmm0
+        movdqu  %xmm0, -15(%rdi, %rdx)
+#else
+	lea	(%rdi, %rdx), %rax
+	movdqu	-15(%rsi, %rdx), %xmm0
+	movdqu	%xmm0, -15(%rax)
+#endif
+	ret
+
+	.p2align 4
+L(less_32_cross):
+	bsf	%rdx, %rdx
+	lea	(%rdi, %rdx), %rcx
+#ifndef AS_STRCPY
+	mov	%rcx, %rax
+#endif
+	mov	%r9, %rsi
+	mov	%r10, %rdi
+	sub	%rdi, %rcx
+	cmp	$15, %ecx
+	jb	L(less_16_cross)
+	movdqu	(%rsi), %xmm0
+	movdqu	-15(%rsi, %rcx), %xmm1
+	movdqu	%xmm0, (%rdi)
+#ifdef AS_STRCPY
+	movdqu  %xmm1, -15(%rdi, %rcx)
+#else
+	movdqu	%xmm1, -15(%rax)
+#endif
+	ret
+
+L(less_16_cross):
+	cmp	$7, %ecx
+	jb	L(less_8_bytes_cross)
+	movq	(%rsi), %rdx
+	jmp	L(8bytes_from_cross)
+
+L(less_8_bytes_cross):
+	cmp	$2, %ecx
+	jbe	L(3_bytes_cross)
+	mov	(%rsi), %edx
+	jmp	L(4bytes_from_cross)
+
+L(3_bytes_cross):
+	jb	L(1_2bytes_cross)
+	movzwl	(%rsi), %edx
+	jmp	L(_3_bytesb)
+
+L(1_2bytes_cross):
+	movb	(%rsi), %dl
+	jmp	L(0_2bytes_from_cross)
+
+	.p2align 4
+L(less_4_bytesb):
+	je	L(_3_bytesb)
+L(0_2bytes_from_cross):
+	movb	%dl, (%rdi)
+#ifdef AS_STRCPY
+	movb    $0, (%rdi, %rcx)
+#else
+	movb	$0, (%rax)
+#endif
+	ret
+
+	.p2align 4
+L(_3_bytesb):
+	movw	%dx, (%rdi)
+	movb	$0, 2(%rdi)
+	ret
+
+END(STPCPY)
diff --git a/sysdeps/x86_64/multiarch/stpncpy-sse2-unaligned.S b/sysdeps/x86_64/multiarch/stpncpy-sse2-unaligned.S
index 658520f..3f35068 100644
--- a/sysdeps/x86_64/multiarch/stpncpy-sse2-unaligned.S
+++ b/sysdeps/x86_64/multiarch/stpncpy-sse2-unaligned.S
@@ -1,4 +1,3 @@ 
 #define USE_AS_STPCPY
-#define USE_AS_STRNCPY
 #define STRCPY __stpncpy_sse2_unaligned
-#include "strcpy-sse2-unaligned.S"
+#include "strncpy-sse2-unaligned.S"
diff --git a/sysdeps/x86_64/multiarch/stpncpy.S b/sysdeps/x86_64/multiarch/stpncpy.S
index 2698ca6..159604a 100644
--- a/sysdeps/x86_64/multiarch/stpncpy.S
+++ b/sysdeps/x86_64/multiarch/stpncpy.S
@@ -1,8 +1,7 @@ 
 /* Multiple versions of stpncpy
    All versions must be listed in ifunc-impl-list.c.  */
-#define STRCPY __stpncpy
+#define STRNCPY __stpncpy
 #define USE_AS_STPCPY
-#define USE_AS_STRNCPY
-#include "strcpy.S"
+#include "strncpy.S"
 
 weak_alias (__stpncpy, stpncpy)
diff --git a/sysdeps/x86_64/multiarch/strcat-sse2-unaligned.S b/sysdeps/x86_64/multiarch/strcat-sse2-unaligned.S
index 81f1b40..1faa49d 100644
--- a/sysdeps/x86_64/multiarch/strcat-sse2-unaligned.S
+++ b/sysdeps/x86_64/multiarch/strcat-sse2-unaligned.S
@@ -275,5 +275,5 @@  L(StartStrcpyPart):
 #  define USE_AS_STRNCPY
 # endif
 
-# include "strcpy-sse2-unaligned.S"
+# include "strncpy-sse2-unaligned.S"
 #endif
diff --git a/sysdeps/x86_64/multiarch/strcpy-avx2.S b/sysdeps/x86_64/multiarch/strcpy-avx2.S
new file mode 100644
index 0000000..a3133a4
--- /dev/null
+++ b/sysdeps/x86_64/multiarch/strcpy-avx2.S
@@ -0,0 +1,4 @@ 
+#define USE_AVX2
+#define AS_STRCPY
+#define STPCPY __strcpy_avx2
+#include "stpcpy-sse2-unaligned.S"
diff --git a/sysdeps/x86_64/multiarch/strcpy-sse2-unaligned.S b/sysdeps/x86_64/multiarch/strcpy-sse2-unaligned.S
index 8f03d1d..310e4fa 100644
--- a/sysdeps/x86_64/multiarch/strcpy-sse2-unaligned.S
+++ b/sysdeps/x86_64/multiarch/strcpy-sse2-unaligned.S
@@ -1,1887 +1,3 @@ 
-/* strcpy with SSE2 and unaligned load
-   Copyright (C) 2011-2015 Free Software Foundation, Inc.
-   Contributed by Intel Corporation.
-   This file is part of the GNU C Library.
-
-   The GNU C Library is free software; you can redistribute it and/or
-   modify it under the terms of the GNU Lesser General Public
-   License as published by the Free Software Foundation; either
-   version 2.1 of the License, or (at your option) any later version.
-
-   The GNU C Library is distributed in the hope that it will be useful,
-   but WITHOUT ANY WARRANTY; without even the implied warranty of
-   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
-   Lesser General Public License for more details.
-
-   You should have received a copy of the GNU Lesser General Public
-   License along with the GNU C Library; if not, see
-   <http://www.gnu.org/licenses/>.  */
-
-#if IS_IN (libc)
-
-# ifndef USE_AS_STRCAT
-#  include <sysdep.h>
-
-#  ifndef STRCPY
-#   define STRCPY  __strcpy_sse2_unaligned
-#  endif
-
-# endif
-
-# define JMPTBL(I, B)	I - B
-# define BRANCH_TO_JMPTBL_ENTRY(TABLE, INDEX, SCALE)             \
-	lea	TABLE(%rip), %r11;                              \
-	movslq	(%r11, INDEX, SCALE), %rcx;                     \
-	lea	(%r11, %rcx), %rcx;                             \
-	jmp	*%rcx
-
-# ifndef USE_AS_STRCAT
-
-.text
-ENTRY (STRCPY)
-#  ifdef USE_AS_STRNCPY
-	mov	%rdx, %r8
-	test	%r8, %r8
-	jz	L(ExitZero)
-#  endif
-	mov	%rsi, %rcx
-#  ifndef USE_AS_STPCPY
-	mov	%rdi, %rax      /* save result */
-#  endif
-
-# endif
-
-	and	$63, %rcx
-	cmp	$32, %rcx
-	jbe	L(SourceStringAlignmentLess32)
-
-	and	$-16, %rsi
-	and	$15, %rcx
-	pxor	%xmm0, %xmm0
-	pxor	%xmm1, %xmm1
-
-	pcmpeqb	(%rsi), %xmm1
-	pmovmskb %xmm1, %rdx
-	shr	%cl, %rdx
-
-# ifdef USE_AS_STRNCPY
-#  if defined USE_AS_STPCPY || defined USE_AS_STRCAT
-	mov	$16, %r10
-	sub	%rcx, %r10
-	cmp	%r10, %r8
-#  else
-	mov	$17, %r10
-	sub	%rcx, %r10
-	cmp	%r10, %r8
-#  endif
-	jbe	L(CopyFrom1To16BytesTailCase2OrCase3)
-# endif
-	test	%rdx, %rdx
-	jnz	L(CopyFrom1To16BytesTail)
-
-	pcmpeqb	16(%rsi), %xmm0
-	pmovmskb %xmm0, %rdx
-
-# ifdef USE_AS_STRNCPY
-	add	$16, %r10
-	cmp	%r10, %r8
-	jbe	L(CopyFrom1To32BytesCase2OrCase3)
-# endif
-	test	%rdx, %rdx
-	jnz	L(CopyFrom1To32Bytes)
-
-	movdqu	(%rsi, %rcx), %xmm1   /* copy 16 bytes */
-	movdqu	%xmm1, (%rdi)
-
-/* If source address alignment != destination address alignment */
-	.p2align 4
-L(Unalign16Both):
-	sub	%rcx, %rdi
-# ifdef USE_AS_STRNCPY
-	add	%rcx, %r8
-# endif
-	mov	$16, %rcx
-	movdqa	(%rsi, %rcx), %xmm1
-	movaps	16(%rsi, %rcx), %xmm2
-	movdqu	%xmm1, (%rdi, %rcx)
-	pcmpeqb	%xmm2, %xmm0
-	pmovmskb %xmm0, %rdx
-	add	$16, %rcx
-# ifdef USE_AS_STRNCPY
-	sub	$48, %r8
-	jbe	L(CopyFrom1To16BytesCase2OrCase3)
-# endif
-	test	%rdx, %rdx
-# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
-	jnz	L(CopyFrom1To16BytesUnalignedXmm2)
-# else
-	jnz	L(CopyFrom1To16Bytes)
-# endif
-
-	movaps	16(%rsi, %rcx), %xmm3
-	movdqu	%xmm2, (%rdi, %rcx)
-	pcmpeqb	%xmm3, %xmm0
-	pmovmskb %xmm0, %rdx
-	add	$16, %rcx
-# ifdef USE_AS_STRNCPY
-	sub	$16, %r8
-	jbe	L(CopyFrom1To16BytesCase2OrCase3)
-# endif
-	test	%rdx, %rdx
-# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
-	jnz	L(CopyFrom1To16BytesUnalignedXmm3)
-# else
-	jnz	L(CopyFrom1To16Bytes)
-# endif
-
-	movaps	16(%rsi, %rcx), %xmm4
-	movdqu	%xmm3, (%rdi, %rcx)
-	pcmpeqb	%xmm4, %xmm0
-	pmovmskb %xmm0, %rdx
-	add	$16, %rcx
-# ifdef USE_AS_STRNCPY
-	sub	$16, %r8
-	jbe	L(CopyFrom1To16BytesCase2OrCase3)
-# endif
-	test	%rdx, %rdx
-# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
-	jnz	L(CopyFrom1To16BytesUnalignedXmm4)
-# else
-	jnz	L(CopyFrom1To16Bytes)
-# endif
-
-	movaps	16(%rsi, %rcx), %xmm1
-	movdqu	%xmm4, (%rdi, %rcx)
-	pcmpeqb	%xmm1, %xmm0
-	pmovmskb %xmm0, %rdx
-	add	$16, %rcx
-# ifdef USE_AS_STRNCPY
-	sub	$16, %r8
-	jbe	L(CopyFrom1To16BytesCase2OrCase3)
-# endif
-	test	%rdx, %rdx
-# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
-	jnz	L(CopyFrom1To16BytesUnalignedXmm1)
-# else
-	jnz	L(CopyFrom1To16Bytes)
-# endif
-
-	movaps	16(%rsi, %rcx), %xmm2
-	movdqu	%xmm1, (%rdi, %rcx)
-	pcmpeqb	%xmm2, %xmm0
-	pmovmskb %xmm0, %rdx
-	add	$16, %rcx
-# ifdef USE_AS_STRNCPY
-	sub	$16, %r8
-	jbe	L(CopyFrom1To16BytesCase2OrCase3)
-# endif
-	test	%rdx, %rdx
-# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
-	jnz	L(CopyFrom1To16BytesUnalignedXmm2)
-# else
-	jnz	L(CopyFrom1To16Bytes)
-# endif
-
-	movaps	16(%rsi, %rcx), %xmm3
-	movdqu	%xmm2, (%rdi, %rcx)
-	pcmpeqb	%xmm3, %xmm0
-	pmovmskb %xmm0, %rdx
-	add	$16, %rcx
-# ifdef USE_AS_STRNCPY
-	sub	$16, %r8
-	jbe	L(CopyFrom1To16BytesCase2OrCase3)
-# endif
-	test	%rdx, %rdx
-# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
-	jnz	L(CopyFrom1To16BytesUnalignedXmm3)
-# else
-	jnz	L(CopyFrom1To16Bytes)
-# endif
-
-	movdqu	%xmm3, (%rdi, %rcx)
-	mov	%rsi, %rdx
-	lea	16(%rsi, %rcx), %rsi
-	and	$-0x40, %rsi
-	sub	%rsi, %rdx
-	sub	%rdx, %rdi
-# ifdef USE_AS_STRNCPY
-	lea	128(%r8, %rdx), %r8
-# endif
-L(Unaligned64Loop):
-	movaps	(%rsi), %xmm2
-	movaps	%xmm2, %xmm4
-	movaps	16(%rsi), %xmm5
-	movaps	32(%rsi), %xmm3
-	movaps	%xmm3, %xmm6
-	movaps	48(%rsi), %xmm7
-	pminub	%xmm5, %xmm2
-	pminub	%xmm7, %xmm3
-	pminub	%xmm2, %xmm3
-	pcmpeqb	%xmm0, %xmm3
-	pmovmskb %xmm3, %rdx
-# ifdef USE_AS_STRNCPY
-	sub	$64, %r8
-	jbe	L(UnalignedLeaveCase2OrCase3)
-# endif
-	test	%rdx, %rdx
-	jnz	L(Unaligned64Leave)
-
-L(Unaligned64Loop_start):
-	add	$64, %rdi
-	add	$64, %rsi
-	movdqu	%xmm4, -64(%rdi)
-	movaps	(%rsi), %xmm2
-	movdqa	%xmm2, %xmm4
-	movdqu	%xmm5, -48(%rdi)
-	movaps	16(%rsi), %xmm5
-	pminub	%xmm5, %xmm2
-	movaps	32(%rsi), %xmm3
-	movdqu	%xmm6, -32(%rdi)
-	movaps	%xmm3, %xmm6
-	movdqu	%xmm7, -16(%rdi)
-	movaps	48(%rsi), %xmm7
-	pminub	%xmm7, %xmm3
-	pminub	%xmm2, %xmm3
-	pcmpeqb	%xmm0, %xmm3
-	pmovmskb %xmm3, %rdx
-# ifdef USE_AS_STRNCPY
-	sub	$64, %r8
-	jbe	L(UnalignedLeaveCase2OrCase3)
-# endif
-	test	%rdx, %rdx
-	jz	L(Unaligned64Loop_start)
-
-L(Unaligned64Leave):
-	pxor	%xmm1, %xmm1
-
-	pcmpeqb	%xmm4, %xmm0
-	pcmpeqb	%xmm5, %xmm1
-	pmovmskb %xmm0, %rdx
-	pmovmskb %xmm1, %rcx
-	test	%rdx, %rdx
-	jnz	L(CopyFrom1To16BytesUnaligned_0)
-	test	%rcx, %rcx
-	jnz	L(CopyFrom1To16BytesUnaligned_16)
-
-	pcmpeqb	%xmm6, %xmm0
-	pcmpeqb	%xmm7, %xmm1
-	pmovmskb %xmm0, %rdx
-	pmovmskb %xmm1, %rcx
-	test	%rdx, %rdx
-	jnz	L(CopyFrom1To16BytesUnaligned_32)
-
-	bsf	%rcx, %rdx
-	movdqu	%xmm4, (%rdi)
-	movdqu	%xmm5, 16(%rdi)
-	movdqu	%xmm6, 32(%rdi)
-# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
-# ifdef USE_AS_STPCPY
-	lea	48(%rdi, %rdx), %rax
-# endif
-	movdqu	%xmm7, 48(%rdi)
-	add	$15, %r8
-	sub	%rdx, %r8
-	lea	49(%rdi, %rdx), %rdi
-	jmp	L(StrncpyFillTailWithZero)
-# else
-	add	$48, %rsi
-	add	$48, %rdi
-	BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
-# endif
-
-/* If source address alignment == destination address alignment */
-
-L(SourceStringAlignmentLess32):
-	pxor	%xmm0, %xmm0
-	movdqu	(%rsi), %xmm1
-	movdqu	16(%rsi), %xmm2
-	pcmpeqb	%xmm1, %xmm0
-	pmovmskb %xmm0, %rdx
-
-# ifdef USE_AS_STRNCPY
-#  if defined USE_AS_STPCPY || defined USE_AS_STRCAT
-	cmp	$16, %r8
-#  else
-	cmp	$17, %r8
-#  endif
-	jbe	L(CopyFrom1To16BytesTail1Case2OrCase3)
-# endif
-	test	%rdx, %rdx
-	jnz	L(CopyFrom1To16BytesTail1)
-
-	pcmpeqb	%xmm2, %xmm0
-	movdqu	%xmm1, (%rdi)
-	pmovmskb %xmm0, %rdx
-
-# ifdef USE_AS_STRNCPY
-#  if defined USE_AS_STPCPY || defined USE_AS_STRCAT
-	cmp	$32, %r8
-#  else
-	cmp	$33, %r8
-#  endif
-	jbe	L(CopyFrom1To32Bytes1Case2OrCase3)
-# endif
-	test	%rdx, %rdx
-	jnz	L(CopyFrom1To32Bytes1)
-
-	and	$-16, %rsi
-	and	$15, %rcx
-	jmp	L(Unalign16Both)
-
-/*------End of main part with loops---------------------*/
-
-/* Case1 */
-
-# if (!defined USE_AS_STRNCPY) || (defined USE_AS_STRCAT)
-	.p2align 4
-L(CopyFrom1To16Bytes):
-	add	%rcx, %rdi
-	add	%rcx, %rsi
-	bsf	%rdx, %rdx
-	BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
-# endif
-	.p2align 4
-L(CopyFrom1To16BytesTail):
-	add	%rcx, %rsi
-	bsf	%rdx, %rdx
-	BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
-
-	.p2align 4
-L(CopyFrom1To32Bytes1):
-	add	$16, %rsi
-	add	$16, %rdi
-# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
-	sub	$16, %r8
-# endif
-L(CopyFrom1To16BytesTail1):
-	bsf	%rdx, %rdx
-	BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
-
-	.p2align 4
-L(CopyFrom1To32Bytes):
-	bsf	%rdx, %rdx
-	add	%rcx, %rsi
-	add	$16, %rdx
-	sub	%rcx, %rdx
-	BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
-
-	.p2align 4
-L(CopyFrom1To16BytesUnaligned_0):
-	bsf	%rdx, %rdx
-# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
-# ifdef USE_AS_STPCPY
-	lea	(%rdi, %rdx), %rax
-# endif
-	movdqu	%xmm4, (%rdi)
-	add	$63, %r8
-	sub	%rdx, %r8
-	lea	1(%rdi, %rdx), %rdi
-	jmp	L(StrncpyFillTailWithZero)
-# else
-	BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
-# endif
-
-	.p2align 4
-L(CopyFrom1To16BytesUnaligned_16):
-	bsf	%rcx, %rdx
-	movdqu	%xmm4, (%rdi)
-# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
-# ifdef USE_AS_STPCPY
-	lea	16(%rdi, %rdx), %rax
-# endif
-	movdqu	%xmm5, 16(%rdi)
-	add	$47, %r8
-	sub	%rdx, %r8
-	lea	17(%rdi, %rdx), %rdi
-	jmp	L(StrncpyFillTailWithZero)
-# else
-	add	$16, %rsi
-	add	$16, %rdi
-	BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
-# endif
-
-	.p2align 4
-L(CopyFrom1To16BytesUnaligned_32):
-	bsf	%rdx, %rdx
-	movdqu	%xmm4, (%rdi)
-	movdqu	%xmm5, 16(%rdi)
-# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
-# ifdef USE_AS_STPCPY
-	lea	32(%rdi, %rdx), %rax
-# endif
-	movdqu	%xmm6, 32(%rdi)
-	add	$31, %r8
-	sub	%rdx, %r8
-	lea	33(%rdi, %rdx), %rdi
-	jmp	L(StrncpyFillTailWithZero)
-# else
-	add	$32, %rsi
-	add	$32, %rdi
-	BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
-# endif
-
-# ifdef USE_AS_STRNCPY
-#  ifndef USE_AS_STRCAT
-	.p2align 4
-L(CopyFrom1To16BytesUnalignedXmm6):
-	movdqu	%xmm6, (%rdi, %rcx)
-	jmp	L(CopyFrom1To16BytesXmmExit)
-
-	.p2align 4
-L(CopyFrom1To16BytesUnalignedXmm5):
-	movdqu	%xmm5, (%rdi, %rcx)
-	jmp	L(CopyFrom1To16BytesXmmExit)
-
-	.p2align 4
-L(CopyFrom1To16BytesUnalignedXmm4):
-	movdqu	%xmm4, (%rdi, %rcx)
-	jmp	L(CopyFrom1To16BytesXmmExit)
-
-	.p2align 4
-L(CopyFrom1To16BytesUnalignedXmm3):
-	movdqu	%xmm3, (%rdi, %rcx)
-	jmp	L(CopyFrom1To16BytesXmmExit)
-
-	.p2align 4
-L(CopyFrom1To16BytesUnalignedXmm1):
-	movdqu	%xmm1, (%rdi, %rcx)
-	jmp	L(CopyFrom1To16BytesXmmExit)
-#  endif
-
-	.p2align 4
-L(CopyFrom1To16BytesExit):
-	BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
-
-/* Case2 */
-
-	.p2align 4
-L(CopyFrom1To16BytesCase2):
-	add	$16, %r8
-	add	%rcx, %rdi
-	add	%rcx, %rsi
-	bsf	%rdx, %rdx
-	cmp	%r8, %rdx
-	jb	L(CopyFrom1To16BytesExit)
-	BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
-
-	.p2align 4
-L(CopyFrom1To32BytesCase2):
-	add	%rcx, %rsi
-	bsf	%rdx, %rdx
-	add	$16, %rdx
-	sub	%rcx, %rdx
-	cmp	%r8, %rdx
-	jb	L(CopyFrom1To16BytesExit)
-	BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
-
-L(CopyFrom1To16BytesTailCase2):
-	add	%rcx, %rsi
-	bsf	%rdx, %rdx
-	cmp	%r8, %rdx
-	jb	L(CopyFrom1To16BytesExit)
-	BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
-
-L(CopyFrom1To16BytesTail1Case2):
-	bsf	%rdx, %rdx
-	cmp	%r8, %rdx
-	jb	L(CopyFrom1To16BytesExit)
-	BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
-
-/* Case2 or Case3,  Case3 */
-
-	.p2align 4
-L(CopyFrom1To16BytesCase2OrCase3):
-	test	%rdx, %rdx
-	jnz	L(CopyFrom1To16BytesCase2)
-L(CopyFrom1To16BytesCase3):
-	add	$16, %r8
-	add	%rcx, %rdi
-	add	%rcx, %rsi
-	BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
-
-	.p2align 4
-L(CopyFrom1To32BytesCase2OrCase3):
-	test	%rdx, %rdx
-	jnz	L(CopyFrom1To32BytesCase2)
-	add	%rcx, %rsi
-	BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
-
-	.p2align 4
-L(CopyFrom1To16BytesTailCase2OrCase3):
-	test	%rdx, %rdx
-	jnz	L(CopyFrom1To16BytesTailCase2)
-	add	%rcx, %rsi
-	BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
-
-	.p2align 4
-L(CopyFrom1To32Bytes1Case2OrCase3):
-	add	$16, %rdi
-	add	$16, %rsi
-	sub	$16, %r8
-L(CopyFrom1To16BytesTail1Case2OrCase3):
-	test	%rdx, %rdx
-	jnz	L(CopyFrom1To16BytesTail1Case2)
-	BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
-
-# endif
-
-/*------------End labels regarding with copying 1-16 bytes--and 1-32 bytes----*/
-
-	.p2align 4
-L(Exit1):
-	mov	%dh, (%rdi)
-# ifdef USE_AS_STPCPY
-	lea	(%rdi), %rax
-# endif
-# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
-	sub	$1, %r8
-	lea	1(%rdi), %rdi
-	jnz	L(StrncpyFillTailWithZero)
-# endif
-	ret
-
-	.p2align 4
-L(Exit2):
-	mov	(%rsi), %dx
-	mov	%dx, (%rdi)
-# ifdef USE_AS_STPCPY
-	lea	1(%rdi), %rax
-# endif
-# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
-	sub	$2, %r8
-	lea	2(%rdi), %rdi
-	jnz	L(StrncpyFillTailWithZero)
-# endif
-	ret
-
-	.p2align 4
-L(Exit3):
-	mov	(%rsi), %cx
-	mov	%cx, (%rdi)
-	mov	%dh, 2(%rdi)
-# ifdef USE_AS_STPCPY
-	lea	2(%rdi), %rax
-# endif
-# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
-	sub	$3, %r8
-	lea	3(%rdi), %rdi
-	jnz	L(StrncpyFillTailWithZero)
-# endif
-	ret
-
-	.p2align 4
-L(Exit4):
-	mov	(%rsi), %edx
-	mov	%edx, (%rdi)
-# ifdef USE_AS_STPCPY
-	lea	3(%rdi), %rax
-# endif
-# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
-	sub	$4, %r8
-	lea	4(%rdi), %rdi
-	jnz	L(StrncpyFillTailWithZero)
-# endif
-	ret
-
-	.p2align 4
-L(Exit5):
-	mov	(%rsi), %ecx
-	mov	%dh, 4(%rdi)
-	mov	%ecx, (%rdi)
-# ifdef USE_AS_STPCPY
-	lea	4(%rdi), %rax
-# endif
-# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
-	sub	$5, %r8
-	lea	5(%rdi), %rdi
-	jnz	L(StrncpyFillTailWithZero)
-# endif
-	ret
-
-	.p2align 4
-L(Exit6):
-	mov	(%rsi), %ecx
-	mov	4(%rsi), %dx
-	mov	%ecx, (%rdi)
-	mov	%dx, 4(%rdi)
-# ifdef USE_AS_STPCPY
-	lea	5(%rdi), %rax
-# endif
-# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
-	sub	$6, %r8
-	lea	6(%rdi), %rdi
-	jnz	L(StrncpyFillTailWithZero)
-# endif
-	ret
-
-	.p2align 4
-L(Exit7):
-	mov	(%rsi), %ecx
-	mov	3(%rsi), %edx
-	mov	%ecx, (%rdi)
-	mov	%edx, 3(%rdi)
-# ifdef USE_AS_STPCPY
-	lea	6(%rdi), %rax
-# endif
-# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
-	sub	$7, %r8
-	lea	7(%rdi), %rdi
-	jnz	L(StrncpyFillTailWithZero)
-# endif
-	ret
-
-	.p2align 4
-L(Exit8):
-	mov	(%rsi), %rdx
-	mov	%rdx, (%rdi)
-# ifdef USE_AS_STPCPY
-	lea	7(%rdi), %rax
-# endif
-# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
-	sub	$8, %r8
-	lea	8(%rdi), %rdi
-	jnz	L(StrncpyFillTailWithZero)
-# endif
-	ret
-
-	.p2align 4
-L(Exit9):
-	mov	(%rsi), %rcx
-	mov	%dh, 8(%rdi)
-	mov	%rcx, (%rdi)
-# ifdef USE_AS_STPCPY
-	lea	8(%rdi), %rax
-# endif
-# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
-	sub	$9, %r8
-	lea	9(%rdi), %rdi
-	jnz	L(StrncpyFillTailWithZero)
-# endif
-	ret
-
-	.p2align 4
-L(Exit10):
-	mov	(%rsi), %rcx
-	mov	8(%rsi), %dx
-	mov	%rcx, (%rdi)
-	mov	%dx, 8(%rdi)
-# ifdef USE_AS_STPCPY
-	lea	9(%rdi), %rax
-# endif
-# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
-	sub	$10, %r8
-	lea	10(%rdi), %rdi
-	jnz	L(StrncpyFillTailWithZero)
-# endif
-	ret
-
-	.p2align 4
-L(Exit11):
-	mov	(%rsi), %rcx
-	mov	7(%rsi), %edx
-	mov	%rcx, (%rdi)
-	mov	%edx, 7(%rdi)
-# ifdef USE_AS_STPCPY
-	lea	10(%rdi), %rax
-# endif
-# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
-	sub	$11, %r8
-	lea	11(%rdi), %rdi
-	jnz	L(StrncpyFillTailWithZero)
-# endif
-	ret
-
-	.p2align 4
-L(Exit12):
-	mov	(%rsi), %rcx
-	mov	8(%rsi), %edx
-	mov	%rcx, (%rdi)
-	mov	%edx, 8(%rdi)
-# ifdef USE_AS_STPCPY
-	lea	11(%rdi), %rax
-# endif
-# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
-	sub	$12, %r8
-	lea	12(%rdi), %rdi
-	jnz	L(StrncpyFillTailWithZero)
-# endif
-	ret
-
-	.p2align 4
-L(Exit13):
-	mov	(%rsi), %rcx
-	mov	5(%rsi), %rdx
-	mov	%rcx, (%rdi)
-	mov	%rdx, 5(%rdi)
-# ifdef USE_AS_STPCPY
-	lea	12(%rdi), %rax
-# endif
-# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
-	sub	$13, %r8
-	lea	13(%rdi), %rdi
-	jnz	L(StrncpyFillTailWithZero)
-# endif
-	ret
-
-	.p2align 4
-L(Exit14):
-	mov	(%rsi), %rcx
-	mov	6(%rsi), %rdx
-	mov	%rcx, (%rdi)
-	mov	%rdx, 6(%rdi)
-# ifdef USE_AS_STPCPY
-	lea	13(%rdi), %rax
-# endif
-# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
-	sub	$14, %r8
-	lea	14(%rdi), %rdi
-	jnz	L(StrncpyFillTailWithZero)
-# endif
-	ret
-
-	.p2align 4
-L(Exit15):
-	mov	(%rsi), %rcx
-	mov	7(%rsi), %rdx
-	mov	%rcx, (%rdi)
-	mov	%rdx, 7(%rdi)
-# ifdef USE_AS_STPCPY
-	lea	14(%rdi), %rax
-# endif
-# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
-	sub	$15, %r8
-	lea	15(%rdi), %rdi
-	jnz	L(StrncpyFillTailWithZero)
-# endif
-	ret
-
-	.p2align 4
-L(Exit16):
-	movdqu	(%rsi), %xmm0
-	movdqu	%xmm0, (%rdi)
-# ifdef USE_AS_STPCPY
-	lea	15(%rdi), %rax
-# endif
-# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
-	sub	$16, %r8
-	lea	16(%rdi), %rdi
-	jnz	L(StrncpyFillTailWithZero)
-# endif
-	ret
-
-	.p2align 4
-L(Exit17):
-	movdqu	(%rsi), %xmm0
-	movdqu	%xmm0, (%rdi)
-	mov	%dh, 16(%rdi)
-# ifdef USE_AS_STPCPY
-	lea	16(%rdi), %rax
-# endif
-# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
-	sub	$17, %r8
-	lea	17(%rdi), %rdi
-	jnz	L(StrncpyFillTailWithZero)
-# endif
-	ret
-
-	.p2align 4
-L(Exit18):
-	movdqu	(%rsi), %xmm0
-	mov	16(%rsi), %cx
-	movdqu	%xmm0, (%rdi)
-	mov	%cx, 16(%rdi)
-# ifdef USE_AS_STPCPY
-	lea	17(%rdi), %rax
-# endif
-# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
-	sub	$18, %r8
-	lea	18(%rdi), %rdi
-	jnz	L(StrncpyFillTailWithZero)
-# endif
-	ret
-
-	.p2align 4
-L(Exit19):
-	movdqu	(%rsi), %xmm0
-	mov	15(%rsi), %ecx
-	movdqu	%xmm0, (%rdi)
-	mov	%ecx, 15(%rdi)
-# ifdef USE_AS_STPCPY
-	lea	18(%rdi), %rax
-# endif
-# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
-	sub	$19, %r8
-	lea	19(%rdi), %rdi
-	jnz	L(StrncpyFillTailWithZero)
-# endif
-	ret
-
-	.p2align 4
-L(Exit20):
-	movdqu	(%rsi), %xmm0
-	mov	16(%rsi), %ecx
-	movdqu	%xmm0, (%rdi)
-	mov	%ecx, 16(%rdi)
-# ifdef USE_AS_STPCPY
-	lea	19(%rdi), %rax
-# endif
-# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
-	sub	$20, %r8
-	lea	20(%rdi), %rdi
-	jnz	L(StrncpyFillTailWithZero)
-# endif
-	ret
-
-	.p2align 4
-L(Exit21):
-	movdqu	(%rsi), %xmm0
-	mov	16(%rsi), %ecx
-	movdqu	%xmm0, (%rdi)
-	mov	%ecx, 16(%rdi)
-	mov	%dh, 20(%rdi)
-# ifdef USE_AS_STPCPY
-	lea	20(%rdi), %rax
-# endif
-# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
-	sub	$21, %r8
-	lea	21(%rdi), %rdi
-	jnz	L(StrncpyFillTailWithZero)
-# endif
-	ret
-
-	.p2align 4
-L(Exit22):
-	movdqu	(%rsi), %xmm0
-	mov	14(%rsi), %rcx
-	movdqu	%xmm0, (%rdi)
-	mov	%rcx, 14(%rdi)
-# ifdef USE_AS_STPCPY
-	lea	21(%rdi), %rax
-# endif
-# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
-	sub	$22, %r8
-	lea	22(%rdi), %rdi
-	jnz	L(StrncpyFillTailWithZero)
-# endif
-	ret
-
-	.p2align 4
-L(Exit23):
-	movdqu	(%rsi), %xmm0
-	mov	15(%rsi), %rcx
-	movdqu	%xmm0, (%rdi)
-	mov	%rcx, 15(%rdi)
-# ifdef USE_AS_STPCPY
-	lea	22(%rdi), %rax
-# endif
-# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
-	sub	$23, %r8
-	lea	23(%rdi), %rdi
-	jnz	L(StrncpyFillTailWithZero)
-# endif
-	ret
-
-	.p2align 4
-L(Exit24):
-	movdqu	(%rsi), %xmm0
-	mov	16(%rsi), %rcx
-	movdqu	%xmm0, (%rdi)
-	mov	%rcx, 16(%rdi)
-# ifdef USE_AS_STPCPY
-	lea	23(%rdi), %rax
-# endif
-# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
-	sub	$24, %r8
-	lea	24(%rdi), %rdi
-	jnz	L(StrncpyFillTailWithZero)
-# endif
-	ret
-
-	.p2align 4
-L(Exit25):
-	movdqu	(%rsi), %xmm0
-	mov	16(%rsi), %rcx
-	movdqu	%xmm0, (%rdi)
-	mov	%rcx, 16(%rdi)
-	mov	%dh, 24(%rdi)
-# ifdef USE_AS_STPCPY
-	lea	24(%rdi), %rax
-# endif
-# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
-	sub	$25, %r8
-	lea	25(%rdi), %rdi
-	jnz	L(StrncpyFillTailWithZero)
-# endif
-	ret
-
-	.p2align 4
-L(Exit26):
-	movdqu	(%rsi), %xmm0
-	mov	16(%rsi), %rdx
-	mov	24(%rsi), %cx
-	movdqu	%xmm0, (%rdi)
-	mov	%rdx, 16(%rdi)
-	mov	%cx, 24(%rdi)
-# ifdef USE_AS_STPCPY
-	lea	25(%rdi), %rax
-# endif
-# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
-	sub	$26, %r8
-	lea	26(%rdi), %rdi
-	jnz	L(StrncpyFillTailWithZero)
-# endif
-	ret
-
-	.p2align 4
-L(Exit27):
-	movdqu	(%rsi), %xmm0
-	mov	16(%rsi), %rdx
-	mov	23(%rsi), %ecx
-	movdqu	%xmm0, (%rdi)
-	mov	%rdx, 16(%rdi)
-	mov	%ecx, 23(%rdi)
-# ifdef USE_AS_STPCPY
-	lea	26(%rdi), %rax
-# endif
-# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
-	sub	$27, %r8
-	lea	27(%rdi), %rdi
-	jnz	L(StrncpyFillTailWithZero)
-# endif
-	ret
-
-	.p2align 4
-L(Exit28):
-	movdqu	(%rsi), %xmm0
-	mov	16(%rsi), %rdx
-	mov	24(%rsi), %ecx
-	movdqu	%xmm0, (%rdi)
-	mov	%rdx, 16(%rdi)
-	mov	%ecx, 24(%rdi)
-# ifdef USE_AS_STPCPY
-	lea	27(%rdi), %rax
-# endif
-# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
-	sub	$28, %r8
-	lea	28(%rdi), %rdi
-	jnz	L(StrncpyFillTailWithZero)
-# endif
-	ret
-
-	.p2align 4
-L(Exit29):
-	movdqu	(%rsi), %xmm0
-	movdqu	13(%rsi), %xmm2
-	movdqu	%xmm0, (%rdi)
-	movdqu	%xmm2, 13(%rdi)
-# ifdef USE_AS_STPCPY
-	lea	28(%rdi), %rax
-# endif
-# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
-	sub	$29, %r8
-	lea	29(%rdi), %rdi
-	jnz	L(StrncpyFillTailWithZero)
-# endif
-	ret
-
-	.p2align 4
-L(Exit30):
-	movdqu	(%rsi), %xmm0
-	movdqu	14(%rsi), %xmm2
-	movdqu	%xmm0, (%rdi)
-	movdqu	%xmm2, 14(%rdi)
-# ifdef USE_AS_STPCPY
-	lea	29(%rdi), %rax
-# endif
-# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
-	sub	$30, %r8
-	lea	30(%rdi), %rdi
-	jnz	L(StrncpyFillTailWithZero)
-# endif
-	ret
-
-	.p2align 4
-L(Exit31):
-	movdqu	(%rsi), %xmm0
-	movdqu	15(%rsi), %xmm2
-	movdqu	%xmm0, (%rdi)
-	movdqu	%xmm2, 15(%rdi)
-# ifdef USE_AS_STPCPY
-	lea	30(%rdi), %rax
-# endif
-# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
-	sub	$31, %r8
-	lea	31(%rdi), %rdi
-	jnz	L(StrncpyFillTailWithZero)
-# endif
-	ret
-
-	.p2align 4
-L(Exit32):
-	movdqu	(%rsi), %xmm0
-	movdqu	16(%rsi), %xmm2
-	movdqu	%xmm0, (%rdi)
-	movdqu	%xmm2, 16(%rdi)
-# ifdef USE_AS_STPCPY
-	lea	31(%rdi), %rax
-# endif
-# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
-	sub	$32, %r8
-	lea	32(%rdi), %rdi
-	jnz	L(StrncpyFillTailWithZero)
-# endif
-	ret
-
-# ifdef USE_AS_STRNCPY
-
-	.p2align 4
-L(StrncpyExit0):
-#  ifdef USE_AS_STPCPY
-	mov	%rdi, %rax
-#  endif
-#  ifdef USE_AS_STRCAT
-	xor	%ch, %ch
-	movb	%ch, (%rdi)
-#  endif
-	ret
-
-	.p2align 4
-L(StrncpyExit1):
-	mov	(%rsi), %dl
-	mov	%dl, (%rdi)
-#  ifdef USE_AS_STPCPY
-	lea	1(%rdi), %rax
-#  endif
-#  ifdef USE_AS_STRCAT
-	xor	%ch, %ch
-	movb	%ch, 1(%rdi)
-#  endif
-	ret
-
-	.p2align 4
-L(StrncpyExit2):
-	mov	(%rsi), %dx
-	mov	%dx, (%rdi)
-#  ifdef USE_AS_STPCPY
-	lea	2(%rdi), %rax
-#  endif
-#  ifdef USE_AS_STRCAT
-	xor	%ch, %ch
-	movb	%ch, 2(%rdi)
-#  endif
-	ret
-
-	.p2align 4
-L(StrncpyExit3):
-	mov	(%rsi), %cx
-	mov	2(%rsi), %dl
-	mov	%cx, (%rdi)
-	mov	%dl, 2(%rdi)
-#  ifdef USE_AS_STPCPY
-	lea	3(%rdi), %rax
-#  endif
-#  ifdef USE_AS_STRCAT
-	xor	%ch, %ch
-	movb	%ch, 3(%rdi)
-#  endif
-	ret
-
-	.p2align 4
-L(StrncpyExit4):
-	mov	(%rsi), %edx
-	mov	%edx, (%rdi)
-#  ifdef USE_AS_STPCPY
-	lea	4(%rdi), %rax
-#  endif
-#  ifdef USE_AS_STRCAT
-	xor	%ch, %ch
-	movb	%ch, 4(%rdi)
-#  endif
-	ret
-
-	.p2align 4
-L(StrncpyExit5):
-	mov	(%rsi), %ecx
-	mov	4(%rsi), %dl
-	mov	%ecx, (%rdi)
-	mov	%dl, 4(%rdi)
-#  ifdef USE_AS_STPCPY
-	lea	5(%rdi), %rax
-#  endif
-#  ifdef USE_AS_STRCAT
-	xor	%ch, %ch
-	movb	%ch, 5(%rdi)
-#  endif
-	ret
-
-	.p2align 4
-L(StrncpyExit6):
-	mov	(%rsi), %ecx
-	mov	4(%rsi), %dx
-	mov	%ecx, (%rdi)
-	mov	%dx, 4(%rdi)
-#  ifdef USE_AS_STPCPY
-	lea	6(%rdi), %rax
-#  endif
-#  ifdef USE_AS_STRCAT
-	xor	%ch, %ch
-	movb	%ch, 6(%rdi)
-#  endif
-	ret
-
-	.p2align 4
-L(StrncpyExit7):
-	mov	(%rsi), %ecx
-	mov	3(%rsi), %edx
-	mov	%ecx, (%rdi)
-	mov	%edx, 3(%rdi)
-#  ifdef USE_AS_STPCPY
-	lea	7(%rdi), %rax
-#  endif
-#  ifdef USE_AS_STRCAT
-	xor	%ch, %ch
-	movb	%ch, 7(%rdi)
-#  endif
-	ret
-
-	.p2align 4
-L(StrncpyExit8):
-	mov	(%rsi), %rdx
-	mov	%rdx, (%rdi)
-#  ifdef USE_AS_STPCPY
-	lea	8(%rdi), %rax
-#  endif
-#  ifdef USE_AS_STRCAT
-	xor	%ch, %ch
-	movb	%ch, 8(%rdi)
-#  endif
-	ret
-
-	.p2align 4
-L(StrncpyExit9):
-	mov	(%rsi), %rcx
-	mov	8(%rsi), %dl
-	mov	%rcx, (%rdi)
-	mov	%dl, 8(%rdi)
-#  ifdef USE_AS_STPCPY
-	lea	9(%rdi), %rax
-#  endif
-#  ifdef USE_AS_STRCAT
-	xor	%ch, %ch
-	movb	%ch, 9(%rdi)
-#  endif
-	ret
-
-	.p2align 4
-L(StrncpyExit10):
-	mov	(%rsi), %rcx
-	mov	8(%rsi), %dx
-	mov	%rcx, (%rdi)
-	mov	%dx, 8(%rdi)
-#  ifdef USE_AS_STPCPY
-	lea	10(%rdi), %rax
-#  endif
-#  ifdef USE_AS_STRCAT
-	xor	%ch, %ch
-	movb	%ch, 10(%rdi)
-#  endif
-	ret
-
-	.p2align 4
-L(StrncpyExit11):
-	mov	(%rsi), %rcx
-	mov	7(%rsi), %edx
-	mov	%rcx, (%rdi)
-	mov	%edx, 7(%rdi)
-#  ifdef USE_AS_STPCPY
-	lea	11(%rdi), %rax
-#  endif
-#  ifdef USE_AS_STRCAT
-	xor	%ch, %ch
-	movb	%ch, 11(%rdi)
-#  endif
-	ret
-
-	.p2align 4
-L(StrncpyExit12):
-	mov	(%rsi), %rcx
-	mov	8(%rsi), %edx
-	mov	%rcx, (%rdi)
-	mov	%edx, 8(%rdi)
-#  ifdef USE_AS_STPCPY
-	lea	12(%rdi), %rax
-#  endif
-#  ifdef USE_AS_STRCAT
-	xor	%ch, %ch
-	movb	%ch, 12(%rdi)
-#  endif
-	ret
-
-	.p2align 4
-L(StrncpyExit13):
-	mov	(%rsi), %rcx
-	mov	5(%rsi), %rdx
-	mov	%rcx, (%rdi)
-	mov	%rdx, 5(%rdi)
-#  ifdef USE_AS_STPCPY
-	lea	13(%rdi), %rax
-#  endif
-#  ifdef USE_AS_STRCAT
-	xor	%ch, %ch
-	movb	%ch, 13(%rdi)
-#  endif
-	ret
-
-	.p2align 4
-L(StrncpyExit14):
-	mov	(%rsi), %rcx
-	mov	6(%rsi), %rdx
-	mov	%rcx, (%rdi)
-	mov	%rdx, 6(%rdi)
-#  ifdef USE_AS_STPCPY
-	lea	14(%rdi), %rax
-#  endif
-#  ifdef USE_AS_STRCAT
-	xor	%ch, %ch
-	movb	%ch, 14(%rdi)
-#  endif
-	ret
-
-	.p2align 4
-L(StrncpyExit15):
-	mov	(%rsi), %rcx
-	mov	7(%rsi), %rdx
-	mov	%rcx, (%rdi)
-	mov	%rdx, 7(%rdi)
-#  ifdef USE_AS_STPCPY
-	lea	15(%rdi), %rax
-#  endif
-#  ifdef USE_AS_STRCAT
-	xor	%ch, %ch
-	movb	%ch, 15(%rdi)
-#  endif
-	ret
-
-	.p2align 4
-L(StrncpyExit16):
-	movdqu	(%rsi), %xmm0
-	movdqu	%xmm0, (%rdi)
-#  ifdef USE_AS_STPCPY
-	lea	16(%rdi), %rax
-#  endif
-#  ifdef USE_AS_STRCAT
-	xor	%ch, %ch
-	movb	%ch, 16(%rdi)
-#  endif
-	ret
-
-	.p2align 4
-L(StrncpyExit17):
-	movdqu	(%rsi), %xmm0
-	mov	16(%rsi), %cl
-	movdqu	%xmm0, (%rdi)
-	mov	%cl, 16(%rdi)
-#  ifdef USE_AS_STPCPY
-	lea	17(%rdi), %rax
-#  endif
-#  ifdef USE_AS_STRCAT
-	xor	%ch, %ch
-	movb	%ch, 17(%rdi)
-#  endif
-	ret
-
-	.p2align 4
-L(StrncpyExit18):
-	movdqu	(%rsi), %xmm0
-	mov	16(%rsi), %cx
-	movdqu	%xmm0, (%rdi)
-	mov	%cx, 16(%rdi)
-#  ifdef USE_AS_STPCPY
-	lea	18(%rdi), %rax
-#  endif
-#  ifdef USE_AS_STRCAT
-	xor	%ch, %ch
-	movb	%ch, 18(%rdi)
-#  endif
-	ret
-
-	.p2align 4
-L(StrncpyExit19):
-	movdqu	(%rsi), %xmm0
-	mov	15(%rsi), %ecx
-	movdqu	%xmm0, (%rdi)
-	mov	%ecx, 15(%rdi)
-#  ifdef USE_AS_STPCPY
-	lea	19(%rdi), %rax
-#  endif
-#  ifdef USE_AS_STRCAT
-	xor	%ch, %ch
-	movb	%ch, 19(%rdi)
-#  endif
-	ret
-
-	.p2align 4
-L(StrncpyExit20):
-	movdqu	(%rsi), %xmm0
-	mov	16(%rsi), %ecx
-	movdqu	%xmm0, (%rdi)
-	mov	%ecx, 16(%rdi)
-#  ifdef USE_AS_STPCPY
-	lea	20(%rdi), %rax
-#  endif
-#  ifdef USE_AS_STRCAT
-	xor	%ch, %ch
-	movb	%ch, 20(%rdi)
-#  endif
-	ret
-
-	.p2align 4
-L(StrncpyExit21):
-	movdqu	(%rsi), %xmm0
-	mov	16(%rsi), %ecx
-	mov	20(%rsi), %dl
-	movdqu	%xmm0, (%rdi)
-	mov	%ecx, 16(%rdi)
-	mov	%dl, 20(%rdi)
-#  ifdef USE_AS_STPCPY
-	lea	21(%rdi), %rax
-#  endif
-#  ifdef USE_AS_STRCAT
-	xor	%ch, %ch
-	movb	%ch, 21(%rdi)
-#  endif
-	ret
-
-	.p2align 4
-L(StrncpyExit22):
-	movdqu	(%rsi), %xmm0
-	mov	14(%rsi), %rcx
-	movdqu	%xmm0, (%rdi)
-	mov	%rcx, 14(%rdi)
-#  ifdef USE_AS_STPCPY
-	lea	22(%rdi), %rax
-#  endif
-#  ifdef USE_AS_STRCAT
-	xor	%ch, %ch
-	movb	%ch, 22(%rdi)
-#  endif
-	ret
-
-	.p2align 4
-L(StrncpyExit23):
-	movdqu	(%rsi), %xmm0
-	mov	15(%rsi), %rcx
-	movdqu	%xmm0, (%rdi)
-	mov	%rcx, 15(%rdi)
-#  ifdef USE_AS_STPCPY
-	lea	23(%rdi), %rax
-#  endif
-#  ifdef USE_AS_STRCAT
-	xor	%ch, %ch
-	movb	%ch, 23(%rdi)
-#  endif
-	ret
-
-	.p2align 4
-L(StrncpyExit24):
-	movdqu	(%rsi), %xmm0
-	mov	16(%rsi), %rcx
-	movdqu	%xmm0, (%rdi)
-	mov	%rcx, 16(%rdi)
-#  ifdef USE_AS_STPCPY
-	lea	24(%rdi), %rax
-#  endif
-#  ifdef USE_AS_STRCAT
-	xor	%ch, %ch
-	movb	%ch, 24(%rdi)
-#  endif
-	ret
-
-	.p2align 4
-L(StrncpyExit25):
-	movdqu	(%rsi), %xmm0
-	mov	16(%rsi), %rdx
-	mov	24(%rsi), %cl
-	movdqu	%xmm0, (%rdi)
-	mov	%rdx, 16(%rdi)
-	mov	%cl, 24(%rdi)
-#  ifdef USE_AS_STPCPY
-	lea	25(%rdi), %rax
-#  endif
-#  ifdef USE_AS_STRCAT
-	xor	%ch, %ch
-	movb	%ch, 25(%rdi)
-#  endif
-	ret
-
-	.p2align 4
-L(StrncpyExit26):
-	movdqu	(%rsi), %xmm0
-	mov	16(%rsi), %rdx
-	mov	24(%rsi), %cx
-	movdqu	%xmm0, (%rdi)
-	mov	%rdx, 16(%rdi)
-	mov	%cx, 24(%rdi)
-#  ifdef USE_AS_STPCPY
-	lea	26(%rdi), %rax
-#  endif
-#  ifdef USE_AS_STRCAT
-	xor	%ch, %ch
-	movb	%ch, 26(%rdi)
-#  endif
-	ret
-
-	.p2align 4
-L(StrncpyExit27):
-	movdqu	(%rsi), %xmm0
-	mov	16(%rsi), %rdx
-	mov	23(%rsi), %ecx
-	movdqu	%xmm0, (%rdi)
-	mov	%rdx, 16(%rdi)
-	mov	%ecx, 23(%rdi)
-#  ifdef USE_AS_STPCPY
-	lea	27(%rdi), %rax
-#  endif
-#  ifdef USE_AS_STRCAT
-	xor	%ch, %ch
-	movb	%ch, 27(%rdi)
-#  endif
-	ret
-
-	.p2align 4
-L(StrncpyExit28):
-	movdqu	(%rsi), %xmm0
-	mov	16(%rsi), %rdx
-	mov	24(%rsi), %ecx
-	movdqu	%xmm0, (%rdi)
-	mov	%rdx, 16(%rdi)
-	mov	%ecx, 24(%rdi)
-#  ifdef USE_AS_STPCPY
-	lea	28(%rdi), %rax
-#  endif
-#  ifdef USE_AS_STRCAT
-	xor	%ch, %ch
-	movb	%ch, 28(%rdi)
-#  endif
-	ret
-
-	.p2align 4
-L(StrncpyExit29):
-	movdqu	(%rsi), %xmm0
-	movdqu	13(%rsi), %xmm2
-	movdqu	%xmm0, (%rdi)
-	movdqu	%xmm2, 13(%rdi)
-#  ifdef USE_AS_STPCPY
-	lea	29(%rdi), %rax
-#  endif
-#  ifdef USE_AS_STRCAT
-	xor	%ch, %ch
-	movb	%ch, 29(%rdi)
-#  endif
-	ret
-
-	.p2align 4
-L(StrncpyExit30):
-	movdqu	(%rsi), %xmm0
-	movdqu	14(%rsi), %xmm2
-	movdqu	%xmm0, (%rdi)
-	movdqu	%xmm2, 14(%rdi)
-#  ifdef USE_AS_STPCPY
-	lea	30(%rdi), %rax
-#  endif
-#  ifdef USE_AS_STRCAT
-	xor	%ch, %ch
-	movb	%ch, 30(%rdi)
-#  endif
-	ret
-
-	.p2align 4
-L(StrncpyExit31):
-	movdqu	(%rsi), %xmm0
-	movdqu	15(%rsi), %xmm2
-	movdqu	%xmm0, (%rdi)
-	movdqu	%xmm2, 15(%rdi)
-#  ifdef USE_AS_STPCPY
-	lea	31(%rdi), %rax
-#  endif
-#  ifdef USE_AS_STRCAT
-	xor	%ch, %ch
-	movb	%ch, 31(%rdi)
-#  endif
-	ret
-
-	.p2align 4
-L(StrncpyExit32):
-	movdqu	(%rsi), %xmm0
-	movdqu	16(%rsi), %xmm2
-	movdqu	%xmm0, (%rdi)
-	movdqu	%xmm2, 16(%rdi)
-#  ifdef USE_AS_STPCPY
-	lea	32(%rdi), %rax
-#  endif
-#  ifdef USE_AS_STRCAT
-	xor	%ch, %ch
-	movb	%ch, 32(%rdi)
-#  endif
-	ret
-
-	.p2align 4
-L(StrncpyExit33):
-	movdqu	(%rsi), %xmm0
-	movdqu	16(%rsi), %xmm2
-	mov	32(%rsi), %cl
-	movdqu	%xmm0, (%rdi)
-	movdqu	%xmm2, 16(%rdi)
-	mov	%cl, 32(%rdi)
-#  ifdef USE_AS_STRCAT
-	xor	%ch, %ch
-	movb	%ch, 33(%rdi)
-#  endif
-	ret
-
-#  ifndef USE_AS_STRCAT
-
-	.p2align 4
-L(Fill0):
-	ret
-
-	.p2align 4
-L(Fill1):
-	mov	%dl, (%rdi)
-	ret
-
-	.p2align 4
-L(Fill2):
-	mov	%dx, (%rdi)
-	ret
-
-	.p2align 4
-L(Fill3):
-	mov	%edx, -1(%rdi)
-	ret
-
-	.p2align 4
-L(Fill4):
-	mov	%edx, (%rdi)
-	ret
-
-	.p2align 4
-L(Fill5):
-	mov	%edx, (%rdi)
-	mov	%dl, 4(%rdi)
-	ret
-
-	.p2align 4
-L(Fill6):
-	mov	%edx, (%rdi)
-	mov	%dx, 4(%rdi)
-	ret
-
-	.p2align 4
-L(Fill7):
-	mov	%rdx, -1(%rdi)
-	ret
-
-	.p2align 4
-L(Fill8):
-	mov	%rdx, (%rdi)
-	ret
-
-	.p2align 4
-L(Fill9):
-	mov	%rdx, (%rdi)
-	mov	%dl, 8(%rdi)
-	ret
-
-	.p2align 4
-L(Fill10):
-	mov	%rdx, (%rdi)
-	mov	%dx, 8(%rdi)
-	ret
-
-	.p2align 4
-L(Fill11):
-	mov	%rdx, (%rdi)
-	mov	%edx, 7(%rdi)
-	ret
-
-	.p2align 4
-L(Fill12):
-	mov	%rdx, (%rdi)
-	mov	%edx, 8(%rdi)
-	ret
-
-	.p2align 4
-L(Fill13):
-	mov	%rdx, (%rdi)
-	mov	%rdx, 5(%rdi)
-	ret
-
-	.p2align 4
-L(Fill14):
-	mov	%rdx, (%rdi)
-	mov	%rdx, 6(%rdi)
-	ret
-
-	.p2align 4
-L(Fill15):
-	movdqu	%xmm0, -1(%rdi)
-	ret
-
-	.p2align 4
-L(Fill16):
-	movdqu	%xmm0, (%rdi)
-	ret
-
-	.p2align 4
-L(CopyFrom1To16BytesUnalignedXmm2):
-	movdqu	%xmm2, (%rdi, %rcx)
-
-	.p2align 4
-L(CopyFrom1To16BytesXmmExit):
-	bsf	%rdx, %rdx
-	add	$15, %r8
-	add	%rcx, %rdi
-#   ifdef USE_AS_STPCPY
-	lea	(%rdi, %rdx), %rax
-#   endif
-	sub	%rdx, %r8
-	lea	1(%rdi, %rdx), %rdi
-
-	.p2align 4
-L(StrncpyFillTailWithZero):
-	pxor	%xmm0, %xmm0
-	xor	%rdx, %rdx
-	sub	$16, %r8
-	jbe	L(StrncpyFillExit)
-
-	movdqu	%xmm0, (%rdi)
-	add	$16, %rdi
-
-	mov	%rdi, %rsi
-	and	$0xf, %rsi
-	sub	%rsi, %rdi
-	add	%rsi, %r8
-	sub	$64, %r8
-	jb	L(StrncpyFillLess64)
-
-L(StrncpyFillLoopMovdqa):
-	movdqa	%xmm0, (%rdi)
-	movdqa	%xmm0, 16(%rdi)
-	movdqa	%xmm0, 32(%rdi)
-	movdqa	%xmm0, 48(%rdi)
-	add	$64, %rdi
-	sub	$64, %r8
-	jae	L(StrncpyFillLoopMovdqa)
-
-L(StrncpyFillLess64):
-	add	$32, %r8
-	jl	L(StrncpyFillLess32)
-	movdqa	%xmm0, (%rdi)
-	movdqa	%xmm0, 16(%rdi)
-	add	$32, %rdi
-	sub	$16, %r8
-	jl	L(StrncpyFillExit)
-	movdqa	%xmm0, (%rdi)
-	add	$16, %rdi
-	BRANCH_TO_JMPTBL_ENTRY (L(FillTable), %r8, 4)
-
-L(StrncpyFillLess32):
-	add	$16, %r8
-	jl	L(StrncpyFillExit)
-	movdqa	%xmm0, (%rdi)
-	add	$16, %rdi
-	BRANCH_TO_JMPTBL_ENTRY (L(FillTable), %r8, 4)
-
-L(StrncpyFillExit):
-	add	$16, %r8
-	BRANCH_TO_JMPTBL_ENTRY (L(FillTable), %r8, 4)
-
-/* end of ifndef USE_AS_STRCAT */
-#  endif
-
-	.p2align 4
-L(UnalignedLeaveCase2OrCase3):
-	test	%rdx, %rdx
-	jnz	L(Unaligned64LeaveCase2)
-L(Unaligned64LeaveCase3):
-	lea	64(%r8), %rcx
-	and	$-16, %rcx
-	add	$48, %r8
-	jl	L(CopyFrom1To16BytesCase3)
-	movdqu	%xmm4, (%rdi)
-	sub	$16, %r8
-	jb	L(CopyFrom1To16BytesCase3)
-	movdqu	%xmm5, 16(%rdi)
-	sub	$16, %r8
-	jb	L(CopyFrom1To16BytesCase3)
-	movdqu	%xmm6, 32(%rdi)
-	sub	$16, %r8
-	jb	L(CopyFrom1To16BytesCase3)
-	movdqu	%xmm7, 48(%rdi)
-#  ifdef USE_AS_STPCPY
-	lea	64(%rdi), %rax
-#  endif
-#  ifdef USE_AS_STRCAT
-	xor	%ch, %ch
-	movb	%ch, 64(%rdi)
-#  endif
-	ret
-
-	.p2align 4
-L(Unaligned64LeaveCase2):
-	xor	%rcx, %rcx
-	pcmpeqb	%xmm4, %xmm0
-	pmovmskb %xmm0, %rdx
-	add	$48, %r8
-	jle	L(CopyFrom1To16BytesCase2OrCase3)
-	test	%rdx, %rdx
-#  ifndef USE_AS_STRCAT
-	jnz	L(CopyFrom1To16BytesUnalignedXmm4)
-#  else
-	jnz	L(CopyFrom1To16Bytes)
-#  endif
-	pcmpeqb	%xmm5, %xmm0
-	pmovmskb %xmm0, %rdx
-	movdqu	%xmm4, (%rdi)
-	add	$16, %rcx
-	sub	$16, %r8
-	jbe	L(CopyFrom1To16BytesCase2OrCase3)
-	test	%rdx, %rdx
-#  ifndef USE_AS_STRCAT
-	jnz	L(CopyFrom1To16BytesUnalignedXmm5)
-#  else
-	jnz	L(CopyFrom1To16Bytes)
-#  endif
-
-	pcmpeqb	%xmm6, %xmm0
-	pmovmskb %xmm0, %rdx
-	movdqu	%xmm5, 16(%rdi)
-	add	$16, %rcx
-	sub	$16, %r8
-	jbe	L(CopyFrom1To16BytesCase2OrCase3)
-	test	%rdx, %rdx
-#  ifndef USE_AS_STRCAT
-	jnz	L(CopyFrom1To16BytesUnalignedXmm6)
-#  else
-	jnz	L(CopyFrom1To16Bytes)
-#  endif
-
-	pcmpeqb	%xmm7, %xmm0
-	pmovmskb %xmm0, %rdx
-	movdqu	%xmm6, 32(%rdi)
-	lea	16(%rdi, %rcx), %rdi
-	lea	16(%rsi, %rcx), %rsi
-	bsf	%rdx, %rdx
-	cmp	%r8, %rdx
-	jb	L(CopyFrom1To16BytesExit)
-	BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
-
-	.p2align 4
-L(ExitZero):
-#  ifndef USE_AS_STRCAT
-	mov	%rdi, %rax
-#  endif
-	ret
-
-# endif
-
-# ifndef USE_AS_STRCAT
-END (STRCPY)
-# else
-END (STRCAT)
-# endif
-	.p2align 4
-	.section .rodata
-L(ExitTable):
-	.int	JMPTBL(L(Exit1), L(ExitTable))
-	.int	JMPTBL(L(Exit2), L(ExitTable))
-	.int	JMPTBL(L(Exit3), L(ExitTable))
-	.int	JMPTBL(L(Exit4), L(ExitTable))
-	.int	JMPTBL(L(Exit5), L(ExitTable))
-	.int	JMPTBL(L(Exit6), L(ExitTable))
-	.int	JMPTBL(L(Exit7), L(ExitTable))
-	.int	JMPTBL(L(Exit8), L(ExitTable))
-	.int	JMPTBL(L(Exit9), L(ExitTable))
-	.int	JMPTBL(L(Exit10), L(ExitTable))
-	.int	JMPTBL(L(Exit11), L(ExitTable))
-	.int	JMPTBL(L(Exit12), L(ExitTable))
-	.int	JMPTBL(L(Exit13), L(ExitTable))
-	.int	JMPTBL(L(Exit14), L(ExitTable))
-	.int	JMPTBL(L(Exit15), L(ExitTable))
-	.int	JMPTBL(L(Exit16), L(ExitTable))
-	.int	JMPTBL(L(Exit17), L(ExitTable))
-	.int	JMPTBL(L(Exit18), L(ExitTable))
-	.int	JMPTBL(L(Exit19), L(ExitTable))
-	.int	JMPTBL(L(Exit20), L(ExitTable))
-	.int	JMPTBL(L(Exit21), L(ExitTable))
-	.int	JMPTBL(L(Exit22), L(ExitTable))
-	.int    JMPTBL(L(Exit23), L(ExitTable))
-	.int	JMPTBL(L(Exit24), L(ExitTable))
-	.int	JMPTBL(L(Exit25), L(ExitTable))
-	.int	JMPTBL(L(Exit26), L(ExitTable))
-	.int	JMPTBL(L(Exit27), L(ExitTable))
-	.int	JMPTBL(L(Exit28), L(ExitTable))
-	.int	JMPTBL(L(Exit29), L(ExitTable))
-	.int	JMPTBL(L(Exit30), L(ExitTable))
-	.int	JMPTBL(L(Exit31), L(ExitTable))
-	.int	JMPTBL(L(Exit32), L(ExitTable))
-# ifdef USE_AS_STRNCPY
-L(ExitStrncpyTable):
-	.int	JMPTBL(L(StrncpyExit0), L(ExitStrncpyTable))
-	.int	JMPTBL(L(StrncpyExit1), L(ExitStrncpyTable))
-	.int	JMPTBL(L(StrncpyExit2), L(ExitStrncpyTable))
-	.int	JMPTBL(L(StrncpyExit3), L(ExitStrncpyTable))
-	.int	JMPTBL(L(StrncpyExit4), L(ExitStrncpyTable))
-	.int	JMPTBL(L(StrncpyExit5), L(ExitStrncpyTable))
-	.int	JMPTBL(L(StrncpyExit6), L(ExitStrncpyTable))
-	.int	JMPTBL(L(StrncpyExit7), L(ExitStrncpyTable))
-	.int	JMPTBL(L(StrncpyExit8), L(ExitStrncpyTable))
-	.int	JMPTBL(L(StrncpyExit9), L(ExitStrncpyTable))
-	.int	JMPTBL(L(StrncpyExit10), L(ExitStrncpyTable))
-	.int	JMPTBL(L(StrncpyExit11), L(ExitStrncpyTable))
-	.int	JMPTBL(L(StrncpyExit12), L(ExitStrncpyTable))
-	.int	JMPTBL(L(StrncpyExit13), L(ExitStrncpyTable))
-	.int	JMPTBL(L(StrncpyExit14), L(ExitStrncpyTable))
-	.int	JMPTBL(L(StrncpyExit15), L(ExitStrncpyTable))
-	.int	JMPTBL(L(StrncpyExit16), L(ExitStrncpyTable))
-	.int	JMPTBL(L(StrncpyExit17), L(ExitStrncpyTable))
-	.int	JMPTBL(L(StrncpyExit18), L(ExitStrncpyTable))
-	.int	JMPTBL(L(StrncpyExit19), L(ExitStrncpyTable))
-	.int	JMPTBL(L(StrncpyExit20), L(ExitStrncpyTable))
-	.int	JMPTBL(L(StrncpyExit21), L(ExitStrncpyTable))
-	.int	JMPTBL(L(StrncpyExit22), L(ExitStrncpyTable))
-	.int    JMPTBL(L(StrncpyExit23), L(ExitStrncpyTable))
-	.int	JMPTBL(L(StrncpyExit24), L(ExitStrncpyTable))
-	.int	JMPTBL(L(StrncpyExit25), L(ExitStrncpyTable))
-	.int	JMPTBL(L(StrncpyExit26), L(ExitStrncpyTable))
-	.int	JMPTBL(L(StrncpyExit27), L(ExitStrncpyTable))
-	.int	JMPTBL(L(StrncpyExit28), L(ExitStrncpyTable))
-	.int	JMPTBL(L(StrncpyExit29), L(ExitStrncpyTable))
-	.int	JMPTBL(L(StrncpyExit30), L(ExitStrncpyTable))
-	.int	JMPTBL(L(StrncpyExit31), L(ExitStrncpyTable))
-	.int	JMPTBL(L(StrncpyExit32), L(ExitStrncpyTable))
-	.int	JMPTBL(L(StrncpyExit33), L(ExitStrncpyTable))
-#  ifndef USE_AS_STRCAT
-	.p2align 4
-L(FillTable):
-	.int	JMPTBL(L(Fill0), L(FillTable))
-	.int	JMPTBL(L(Fill1), L(FillTable))
-	.int	JMPTBL(L(Fill2), L(FillTable))
-	.int	JMPTBL(L(Fill3), L(FillTable))
-	.int	JMPTBL(L(Fill4), L(FillTable))
-	.int	JMPTBL(L(Fill5), L(FillTable))
-	.int	JMPTBL(L(Fill6), L(FillTable))
-	.int	JMPTBL(L(Fill7), L(FillTable))
-	.int	JMPTBL(L(Fill8), L(FillTable))
-	.int	JMPTBL(L(Fill9), L(FillTable))
-	.int	JMPTBL(L(Fill10), L(FillTable))
-	.int	JMPTBL(L(Fill11), L(FillTable))
-	.int	JMPTBL(L(Fill12), L(FillTable))
-	.int	JMPTBL(L(Fill13), L(FillTable))
-	.int	JMPTBL(L(Fill14), L(FillTable))
-	.int	JMPTBL(L(Fill15), L(FillTable))
-	.int	JMPTBL(L(Fill16), L(FillTable))
-#  endif
-# endif
-#endif
+#define AS_STRCPY
+#define STPCPY __strcpy_sse2_unaligned
+#include "stpcpy-sse2-unaligned.S"
diff --git a/sysdeps/x86_64/multiarch/strcpy.S b/sysdeps/x86_64/multiarch/strcpy.S
index 9464ee8..92be04c 100644
--- a/sysdeps/x86_64/multiarch/strcpy.S
+++ b/sysdeps/x86_64/multiarch/strcpy.S
@@ -28,31 +28,18 @@ 
 #endif
 
 #ifdef USE_AS_STPCPY
-# ifdef USE_AS_STRNCPY
-#  define STRCPY_SSSE3		__stpncpy_ssse3
-#  define STRCPY_SSE2		__stpncpy_sse2
-#  define STRCPY_SSE2_UNALIGNED __stpncpy_sse2_unaligned
-#  define __GI_STRCPY		__GI_stpncpy
-#  define __GI___STRCPY		__GI___stpncpy
-# else
 #  define STRCPY_SSSE3		__stpcpy_ssse3
 #  define STRCPY_SSE2		__stpcpy_sse2
+#  define STRCPY_AVX2		__stpcpy_avx2
 #  define STRCPY_SSE2_UNALIGNED	__stpcpy_sse2_unaligned
 #  define __GI_STRCPY		__GI_stpcpy
 #  define __GI___STRCPY		__GI___stpcpy
-# endif
 #else
-# ifdef USE_AS_STRNCPY
-#  define STRCPY_SSSE3		__strncpy_ssse3
-#  define STRCPY_SSE2		__strncpy_sse2
-#  define STRCPY_SSE2_UNALIGNED	__strncpy_sse2_unaligned
-#  define __GI_STRCPY		__GI_strncpy
-# else
 #  define STRCPY_SSSE3		__strcpy_ssse3
+#  define STRCPY_AVX2		__strcpy_avx2
 #  define STRCPY_SSE2		__strcpy_sse2
 #  define STRCPY_SSE2_UNALIGNED	__strcpy_sse2_unaligned
 #  define __GI_STRCPY		__GI_strcpy
-# endif
 #endif
 
 
@@ -64,7 +51,10 @@  ENTRY(STRCPY)
 	cmpl	$0, __cpu_features+KIND_OFFSET(%rip)
 	jne	1f
 	call	__init_cpu_features
-1:	leaq	STRCPY_SSE2_UNALIGNED(%rip), %rax
+1:	leaq	STRCPY_AVX2(%rip), %rax
+	testl   $bit_AVX_Fast_Unaligned_Load, __cpu_features+FEATURE_OFFSET+index_AVX_Fast_Unaligned_Load(%rip)
+	jnz	2f
+	leaq	STRCPY_SSE2_UNALIGNED(%rip), %rax
 	testl	$bit_Fast_Unaligned_Load, __cpu_features+FEATURE_OFFSET+index_Fast_Unaligned_Load(%rip)
 	jnz	2f
 	leaq	STRCPY_SSE2(%rip), %rax
diff --git a/sysdeps/x86_64/multiarch/strncpy-sse2-unaligned.S b/sysdeps/x86_64/multiarch/strncpy-sse2-unaligned.S
index fcc23a7..e4c98e7 100644
--- a/sysdeps/x86_64/multiarch/strncpy-sse2-unaligned.S
+++ b/sysdeps/x86_64/multiarch/strncpy-sse2-unaligned.S
@@ -1,3 +1,1888 @@ 
-#define USE_AS_STRNCPY
-#define STRCPY __strncpy_sse2_unaligned
-#include "strcpy-sse2-unaligned.S"
+/* strcpy with SSE2 and unaligned load
+   Copyright (C) 2011-2015 Free Software Foundation, Inc.
+   Contributed by Intel Corporation.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#if IS_IN (libc)
+
+# ifndef USE_AS_STRCAT
+#  include <sysdep.h>
+
+#  ifndef STRCPY
+#   define STRCPY  __strncpy_sse2_unaligned
+#  endif
+
+# define USE_AS_STRNCPY
+# endif
+
+# define JMPTBL(I, B)	I - B
+# define BRANCH_TO_JMPTBL_ENTRY(TABLE, INDEX, SCALE)             \
+	lea	TABLE(%rip), %r11;                              \
+	movslq	(%r11, INDEX, SCALE), %rcx;                     \
+	lea	(%r11, %rcx), %rcx;                             \
+	jmp	*%rcx
+
+# ifndef USE_AS_STRCAT
+
+.text
+ENTRY (STRCPY)
+#  ifdef USE_AS_STRNCPY
+	mov	%rdx, %r8
+	test	%r8, %r8
+	jz	L(ExitZero)
+#  endif
+	mov	%rsi, %rcx
+#  ifndef USE_AS_STPCPY
+	mov	%rdi, %rax      /* save result */
+#  endif
+
+# endif
+
+	and	$63, %rcx
+	cmp	$32, %rcx
+	jbe	L(SourceStringAlignmentLess32)
+
+	and	$-16, %rsi
+	and	$15, %rcx
+	pxor	%xmm0, %xmm0
+	pxor	%xmm1, %xmm1
+
+	pcmpeqb	(%rsi), %xmm1
+	pmovmskb %xmm1, %rdx
+	shr	%cl, %rdx
+
+# ifdef USE_AS_STRNCPY
+#  if defined USE_AS_STPCPY || defined USE_AS_STRCAT
+	mov	$16, %r10
+	sub	%rcx, %r10
+	cmp	%r10, %r8
+#  else
+	mov	$17, %r10
+	sub	%rcx, %r10
+	cmp	%r10, %r8
+#  endif
+	jbe	L(CopyFrom1To16BytesTailCase2OrCase3)
+# endif
+	test	%rdx, %rdx
+	jnz	L(CopyFrom1To16BytesTail)
+
+	pcmpeqb	16(%rsi), %xmm0
+	pmovmskb %xmm0, %rdx
+
+# ifdef USE_AS_STRNCPY
+	add	$16, %r10
+	cmp	%r10, %r8
+	jbe	L(CopyFrom1To32BytesCase2OrCase3)
+# endif
+	test	%rdx, %rdx
+	jnz	L(CopyFrom1To32Bytes)
+
+	movdqu	(%rsi, %rcx), %xmm1   /* copy 16 bytes */
+	movdqu	%xmm1, (%rdi)
+
+/* If source address alignment != destination address alignment */
+	.p2align 4
+L(Unalign16Both):
+	sub	%rcx, %rdi
+# ifdef USE_AS_STRNCPY
+	add	%rcx, %r8
+# endif
+	mov	$16, %rcx
+	movdqa	(%rsi, %rcx), %xmm1
+	movaps	16(%rsi, %rcx), %xmm2
+	movdqu	%xmm1, (%rdi, %rcx)
+	pcmpeqb	%xmm2, %xmm0
+	pmovmskb %xmm0, %rdx
+	add	$16, %rcx
+# ifdef USE_AS_STRNCPY
+	sub	$48, %r8
+	jbe	L(CopyFrom1To16BytesCase2OrCase3)
+# endif
+	test	%rdx, %rdx
+# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
+	jnz	L(CopyFrom1To16BytesUnalignedXmm2)
+# else
+	jnz	L(CopyFrom1To16Bytes)
+# endif
+
+	movaps	16(%rsi, %rcx), %xmm3
+	movdqu	%xmm2, (%rdi, %rcx)
+	pcmpeqb	%xmm3, %xmm0
+	pmovmskb %xmm0, %rdx
+	add	$16, %rcx
+# ifdef USE_AS_STRNCPY
+	sub	$16, %r8
+	jbe	L(CopyFrom1To16BytesCase2OrCase3)
+# endif
+	test	%rdx, %rdx
+# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
+	jnz	L(CopyFrom1To16BytesUnalignedXmm3)
+# else
+	jnz	L(CopyFrom1To16Bytes)
+# endif
+
+	movaps	16(%rsi, %rcx), %xmm4
+	movdqu	%xmm3, (%rdi, %rcx)
+	pcmpeqb	%xmm4, %xmm0
+	pmovmskb %xmm0, %rdx
+	add	$16, %rcx
+# ifdef USE_AS_STRNCPY
+	sub	$16, %r8
+	jbe	L(CopyFrom1To16BytesCase2OrCase3)
+# endif
+	test	%rdx, %rdx
+# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
+	jnz	L(CopyFrom1To16BytesUnalignedXmm4)
+# else
+	jnz	L(CopyFrom1To16Bytes)
+# endif
+
+	movaps	16(%rsi, %rcx), %xmm1
+	movdqu	%xmm4, (%rdi, %rcx)
+	pcmpeqb	%xmm1, %xmm0
+	pmovmskb %xmm0, %rdx
+	add	$16, %rcx
+# ifdef USE_AS_STRNCPY
+	sub	$16, %r8
+	jbe	L(CopyFrom1To16BytesCase2OrCase3)
+# endif
+	test	%rdx, %rdx
+# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
+	jnz	L(CopyFrom1To16BytesUnalignedXmm1)
+# else
+	jnz	L(CopyFrom1To16Bytes)
+# endif
+
+	movaps	16(%rsi, %rcx), %xmm2
+	movdqu	%xmm1, (%rdi, %rcx)
+	pcmpeqb	%xmm2, %xmm0
+	pmovmskb %xmm0, %rdx
+	add	$16, %rcx
+# ifdef USE_AS_STRNCPY
+	sub	$16, %r8
+	jbe	L(CopyFrom1To16BytesCase2OrCase3)
+# endif
+	test	%rdx, %rdx
+# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
+	jnz	L(CopyFrom1To16BytesUnalignedXmm2)
+# else
+	jnz	L(CopyFrom1To16Bytes)
+# endif
+
+	movaps	16(%rsi, %rcx), %xmm3
+	movdqu	%xmm2, (%rdi, %rcx)
+	pcmpeqb	%xmm3, %xmm0
+	pmovmskb %xmm0, %rdx
+	add	$16, %rcx
+# ifdef USE_AS_STRNCPY
+	sub	$16, %r8
+	jbe	L(CopyFrom1To16BytesCase2OrCase3)
+# endif
+	test	%rdx, %rdx
+# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
+	jnz	L(CopyFrom1To16BytesUnalignedXmm3)
+# else
+	jnz	L(CopyFrom1To16Bytes)
+# endif
+
+	movdqu	%xmm3, (%rdi, %rcx)
+	mov	%rsi, %rdx
+	lea	16(%rsi, %rcx), %rsi
+	and	$-0x40, %rsi
+	sub	%rsi, %rdx
+	sub	%rdx, %rdi
+# ifdef USE_AS_STRNCPY
+	lea	128(%r8, %rdx), %r8
+# endif
+L(Unaligned64Loop):
+	movaps	(%rsi), %xmm2
+	movaps	%xmm2, %xmm4
+	movaps	16(%rsi), %xmm5
+	movaps	32(%rsi), %xmm3
+	movaps	%xmm3, %xmm6
+	movaps	48(%rsi), %xmm7
+	pminub	%xmm5, %xmm2
+	pminub	%xmm7, %xmm3
+	pminub	%xmm2, %xmm3
+	pcmpeqb	%xmm0, %xmm3
+	pmovmskb %xmm3, %rdx
+# ifdef USE_AS_STRNCPY
+	sub	$64, %r8
+	jbe	L(UnalignedLeaveCase2OrCase3)
+# endif
+	test	%rdx, %rdx
+	jnz	L(Unaligned64Leave)
+
+L(Unaligned64Loop_start):
+	add	$64, %rdi
+	add	$64, %rsi
+	movdqu	%xmm4, -64(%rdi)
+	movaps	(%rsi), %xmm2
+	movdqa	%xmm2, %xmm4
+	movdqu	%xmm5, -48(%rdi)
+	movaps	16(%rsi), %xmm5
+	pminub	%xmm5, %xmm2
+	movaps	32(%rsi), %xmm3
+	movdqu	%xmm6, -32(%rdi)
+	movaps	%xmm3, %xmm6
+	movdqu	%xmm7, -16(%rdi)
+	movaps	48(%rsi), %xmm7
+	pminub	%xmm7, %xmm3
+	pminub	%xmm2, %xmm3
+	pcmpeqb	%xmm0, %xmm3
+	pmovmskb %xmm3, %rdx
+# ifdef USE_AS_STRNCPY
+	sub	$64, %r8
+	jbe	L(UnalignedLeaveCase2OrCase3)
+# endif
+	test	%rdx, %rdx
+	jz	L(Unaligned64Loop_start)
+
+L(Unaligned64Leave):
+	pxor	%xmm1, %xmm1
+
+	pcmpeqb	%xmm4, %xmm0
+	pcmpeqb	%xmm5, %xmm1
+	pmovmskb %xmm0, %rdx
+	pmovmskb %xmm1, %rcx
+	test	%rdx, %rdx
+	jnz	L(CopyFrom1To16BytesUnaligned_0)
+	test	%rcx, %rcx
+	jnz	L(CopyFrom1To16BytesUnaligned_16)
+
+	pcmpeqb	%xmm6, %xmm0
+	pcmpeqb	%xmm7, %xmm1
+	pmovmskb %xmm0, %rdx
+	pmovmskb %xmm1, %rcx
+	test	%rdx, %rdx
+	jnz	L(CopyFrom1To16BytesUnaligned_32)
+
+	bsf	%rcx, %rdx
+	movdqu	%xmm4, (%rdi)
+	movdqu	%xmm5, 16(%rdi)
+	movdqu	%xmm6, 32(%rdi)
+# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
+# ifdef USE_AS_STPCPY
+	lea	48(%rdi, %rdx), %rax
+# endif
+	movdqu	%xmm7, 48(%rdi)
+	add	$15, %r8
+	sub	%rdx, %r8
+	lea	49(%rdi, %rdx), %rdi
+	jmp	L(StrncpyFillTailWithZero)
+# else
+	add	$48, %rsi
+	add	$48, %rdi
+	BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
+# endif
+
+/* If source address alignment == destination address alignment */
+
+L(SourceStringAlignmentLess32):
+	pxor	%xmm0, %xmm0
+	movdqu	(%rsi), %xmm1
+	movdqu	16(%rsi), %xmm2
+	pcmpeqb	%xmm1, %xmm0
+	pmovmskb %xmm0, %rdx
+
+# ifdef USE_AS_STRNCPY
+#  if defined USE_AS_STPCPY || defined USE_AS_STRCAT
+	cmp	$16, %r8
+#  else
+	cmp	$17, %r8
+#  endif
+	jbe	L(CopyFrom1To16BytesTail1Case2OrCase3)
+# endif
+	test	%rdx, %rdx
+	jnz	L(CopyFrom1To16BytesTail1)
+
+	pcmpeqb	%xmm2, %xmm0
+	movdqu	%xmm1, (%rdi)
+	pmovmskb %xmm0, %rdx
+
+# ifdef USE_AS_STRNCPY
+#  if defined USE_AS_STPCPY || defined USE_AS_STRCAT
+	cmp	$32, %r8
+#  else
+	cmp	$33, %r8
+#  endif
+	jbe	L(CopyFrom1To32Bytes1Case2OrCase3)
+# endif
+	test	%rdx, %rdx
+	jnz	L(CopyFrom1To32Bytes1)
+
+	and	$-16, %rsi
+	and	$15, %rcx
+	jmp	L(Unalign16Both)
+
+/*------End of main part with loops---------------------*/
+
+/* Case1 */
+
+# if (!defined USE_AS_STRNCPY) || (defined USE_AS_STRCAT)
+	.p2align 4
+L(CopyFrom1To16Bytes):
+	add	%rcx, %rdi
+	add	%rcx, %rsi
+	bsf	%rdx, %rdx
+	BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
+# endif
+	.p2align 4
+L(CopyFrom1To16BytesTail):
+	add	%rcx, %rsi
+	bsf	%rdx, %rdx
+	BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
+
+	.p2align 4
+L(CopyFrom1To32Bytes1):
+	add	$16, %rsi
+	add	$16, %rdi
+# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
+	sub	$16, %r8
+# endif
+L(CopyFrom1To16BytesTail1):
+	bsf	%rdx, %rdx
+	BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
+
+	.p2align 4
+L(CopyFrom1To32Bytes):
+	bsf	%rdx, %rdx
+	add	%rcx, %rsi
+	add	$16, %rdx
+	sub	%rcx, %rdx
+	BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
+
+	.p2align 4
+L(CopyFrom1To16BytesUnaligned_0):
+	bsf	%rdx, %rdx
+# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
+# ifdef USE_AS_STPCPY
+	lea	(%rdi, %rdx), %rax
+# endif
+	movdqu	%xmm4, (%rdi)
+	add	$63, %r8
+	sub	%rdx, %r8
+	lea	1(%rdi, %rdx), %rdi
+	jmp	L(StrncpyFillTailWithZero)
+# else
+	BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
+# endif
+
+	.p2align 4
+L(CopyFrom1To16BytesUnaligned_16):
+	bsf	%rcx, %rdx
+	movdqu	%xmm4, (%rdi)
+# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
+# ifdef USE_AS_STPCPY
+	lea	16(%rdi, %rdx), %rax
+# endif
+	movdqu	%xmm5, 16(%rdi)
+	add	$47, %r8
+	sub	%rdx, %r8
+	lea	17(%rdi, %rdx), %rdi
+	jmp	L(StrncpyFillTailWithZero)
+# else
+	add	$16, %rsi
+	add	$16, %rdi
+	BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
+# endif
+
+	.p2align 4
+L(CopyFrom1To16BytesUnaligned_32):
+	bsf	%rdx, %rdx
+	movdqu	%xmm4, (%rdi)
+	movdqu	%xmm5, 16(%rdi)
+# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
+# ifdef USE_AS_STPCPY
+	lea	32(%rdi, %rdx), %rax
+# endif
+	movdqu	%xmm6, 32(%rdi)
+	add	$31, %r8
+	sub	%rdx, %r8
+	lea	33(%rdi, %rdx), %rdi
+	jmp	L(StrncpyFillTailWithZero)
+# else
+	add	$32, %rsi
+	add	$32, %rdi
+	BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
+# endif
+
+# ifdef USE_AS_STRNCPY
+#  ifndef USE_AS_STRCAT
+	.p2align 4
+L(CopyFrom1To16BytesUnalignedXmm6):
+	movdqu	%xmm6, (%rdi, %rcx)
+	jmp	L(CopyFrom1To16BytesXmmExit)
+
+	.p2align 4
+L(CopyFrom1To16BytesUnalignedXmm5):
+	movdqu	%xmm5, (%rdi, %rcx)
+	jmp	L(CopyFrom1To16BytesXmmExit)
+
+	.p2align 4
+L(CopyFrom1To16BytesUnalignedXmm4):
+	movdqu	%xmm4, (%rdi, %rcx)
+	jmp	L(CopyFrom1To16BytesXmmExit)
+
+	.p2align 4
+L(CopyFrom1To16BytesUnalignedXmm3):
+	movdqu	%xmm3, (%rdi, %rcx)
+	jmp	L(CopyFrom1To16BytesXmmExit)
+
+	.p2align 4
+L(CopyFrom1To16BytesUnalignedXmm1):
+	movdqu	%xmm1, (%rdi, %rcx)
+	jmp	L(CopyFrom1To16BytesXmmExit)
+#  endif
+
+	.p2align 4
+L(CopyFrom1To16BytesExit):
+	BRANCH_TO_JMPTBL_ENTRY (L(ExitTable), %rdx, 4)
+
+/* Case2 */
+
+	.p2align 4
+L(CopyFrom1To16BytesCase2):
+	add	$16, %r8
+	add	%rcx, %rdi
+	add	%rcx, %rsi
+	bsf	%rdx, %rdx
+	cmp	%r8, %rdx
+	jb	L(CopyFrom1To16BytesExit)
+	BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
+
+	.p2align 4
+L(CopyFrom1To32BytesCase2):
+	add	%rcx, %rsi
+	bsf	%rdx, %rdx
+	add	$16, %rdx
+	sub	%rcx, %rdx
+	cmp	%r8, %rdx
+	jb	L(CopyFrom1To16BytesExit)
+	BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
+
+L(CopyFrom1To16BytesTailCase2):
+	add	%rcx, %rsi
+	bsf	%rdx, %rdx
+	cmp	%r8, %rdx
+	jb	L(CopyFrom1To16BytesExit)
+	BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
+
+L(CopyFrom1To16BytesTail1Case2):
+	bsf	%rdx, %rdx
+	cmp	%r8, %rdx
+	jb	L(CopyFrom1To16BytesExit)
+	BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
+
+/* Case2 or Case3,  Case3 */
+
+	.p2align 4
+L(CopyFrom1To16BytesCase2OrCase3):
+	test	%rdx, %rdx
+	jnz	L(CopyFrom1To16BytesCase2)
+L(CopyFrom1To16BytesCase3):
+	add	$16, %r8
+	add	%rcx, %rdi
+	add	%rcx, %rsi
+	BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
+
+	.p2align 4
+L(CopyFrom1To32BytesCase2OrCase3):
+	test	%rdx, %rdx
+	jnz	L(CopyFrom1To32BytesCase2)
+	add	%rcx, %rsi
+	BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
+
+	.p2align 4
+L(CopyFrom1To16BytesTailCase2OrCase3):
+	test	%rdx, %rdx
+	jnz	L(CopyFrom1To16BytesTailCase2)
+	add	%rcx, %rsi
+	BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
+
+	.p2align 4
+L(CopyFrom1To32Bytes1Case2OrCase3):
+	add	$16, %rdi
+	add	$16, %rsi
+	sub	$16, %r8
+L(CopyFrom1To16BytesTail1Case2OrCase3):
+	test	%rdx, %rdx
+	jnz	L(CopyFrom1To16BytesTail1Case2)
+	BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
+
+# endif
+
+/*------------End labels regarding with copying 1-16 bytes--and 1-32 bytes----*/
+
+	.p2align 4
+L(Exit1):
+	mov	%dh, (%rdi)
+# ifdef USE_AS_STPCPY
+	lea	(%rdi), %rax
+# endif
+# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
+	sub	$1, %r8
+	lea	1(%rdi), %rdi
+	jnz	L(StrncpyFillTailWithZero)
+# endif
+	ret
+
+	.p2align 4
+L(Exit2):
+	mov	(%rsi), %dx
+	mov	%dx, (%rdi)
+# ifdef USE_AS_STPCPY
+	lea	1(%rdi), %rax
+# endif
+# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
+	sub	$2, %r8
+	lea	2(%rdi), %rdi
+	jnz	L(StrncpyFillTailWithZero)
+# endif
+	ret
+
+	.p2align 4
+L(Exit3):
+	mov	(%rsi), %cx
+	mov	%cx, (%rdi)
+	mov	%dh, 2(%rdi)
+# ifdef USE_AS_STPCPY
+	lea	2(%rdi), %rax
+# endif
+# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
+	sub	$3, %r8
+	lea	3(%rdi), %rdi
+	jnz	L(StrncpyFillTailWithZero)
+# endif
+	ret
+
+	.p2align 4
+L(Exit4):
+	mov	(%rsi), %edx
+	mov	%edx, (%rdi)
+# ifdef USE_AS_STPCPY
+	lea	3(%rdi), %rax
+# endif
+# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
+	sub	$4, %r8
+	lea	4(%rdi), %rdi
+	jnz	L(StrncpyFillTailWithZero)
+# endif
+	ret
+
+	.p2align 4
+L(Exit5):
+	mov	(%rsi), %ecx
+	mov	%dh, 4(%rdi)
+	mov	%ecx, (%rdi)
+# ifdef USE_AS_STPCPY
+	lea	4(%rdi), %rax
+# endif
+# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
+	sub	$5, %r8
+	lea	5(%rdi), %rdi
+	jnz	L(StrncpyFillTailWithZero)
+# endif
+	ret
+
+	.p2align 4
+L(Exit6):
+	mov	(%rsi), %ecx
+	mov	4(%rsi), %dx
+	mov	%ecx, (%rdi)
+	mov	%dx, 4(%rdi)
+# ifdef USE_AS_STPCPY
+	lea	5(%rdi), %rax
+# endif
+# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
+	sub	$6, %r8
+	lea	6(%rdi), %rdi
+	jnz	L(StrncpyFillTailWithZero)
+# endif
+	ret
+
+	.p2align 4
+L(Exit7):
+	mov	(%rsi), %ecx
+	mov	3(%rsi), %edx
+	mov	%ecx, (%rdi)
+	mov	%edx, 3(%rdi)
+# ifdef USE_AS_STPCPY
+	lea	6(%rdi), %rax
+# endif
+# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
+	sub	$7, %r8
+	lea	7(%rdi), %rdi
+	jnz	L(StrncpyFillTailWithZero)
+# endif
+	ret
+
+	.p2align 4
+L(Exit8):
+	mov	(%rsi), %rdx
+	mov	%rdx, (%rdi)
+# ifdef USE_AS_STPCPY
+	lea	7(%rdi), %rax
+# endif
+# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
+	sub	$8, %r8
+	lea	8(%rdi), %rdi
+	jnz	L(StrncpyFillTailWithZero)
+# endif
+	ret
+
+	.p2align 4
+L(Exit9):
+	mov	(%rsi), %rcx
+	mov	%dh, 8(%rdi)
+	mov	%rcx, (%rdi)
+# ifdef USE_AS_STPCPY
+	lea	8(%rdi), %rax
+# endif
+# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
+	sub	$9, %r8
+	lea	9(%rdi), %rdi
+	jnz	L(StrncpyFillTailWithZero)
+# endif
+	ret
+
+	.p2align 4
+L(Exit10):
+	mov	(%rsi), %rcx
+	mov	8(%rsi), %dx
+	mov	%rcx, (%rdi)
+	mov	%dx, 8(%rdi)
+# ifdef USE_AS_STPCPY
+	lea	9(%rdi), %rax
+# endif
+# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
+	sub	$10, %r8
+	lea	10(%rdi), %rdi
+	jnz	L(StrncpyFillTailWithZero)
+# endif
+	ret
+
+	.p2align 4
+L(Exit11):
+	mov	(%rsi), %rcx
+	mov	7(%rsi), %edx
+	mov	%rcx, (%rdi)
+	mov	%edx, 7(%rdi)
+# ifdef USE_AS_STPCPY
+	lea	10(%rdi), %rax
+# endif
+# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
+	sub	$11, %r8
+	lea	11(%rdi), %rdi
+	jnz	L(StrncpyFillTailWithZero)
+# endif
+	ret
+
+	.p2align 4
+L(Exit12):
+	mov	(%rsi), %rcx
+	mov	8(%rsi), %edx
+	mov	%rcx, (%rdi)
+	mov	%edx, 8(%rdi)
+# ifdef USE_AS_STPCPY
+	lea	11(%rdi), %rax
+# endif
+# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
+	sub	$12, %r8
+	lea	12(%rdi), %rdi
+	jnz	L(StrncpyFillTailWithZero)
+# endif
+	ret
+
+	.p2align 4
+L(Exit13):
+	mov	(%rsi), %rcx
+	mov	5(%rsi), %rdx
+	mov	%rcx, (%rdi)
+	mov	%rdx, 5(%rdi)
+# ifdef USE_AS_STPCPY
+	lea	12(%rdi), %rax
+# endif
+# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
+	sub	$13, %r8
+	lea	13(%rdi), %rdi
+	jnz	L(StrncpyFillTailWithZero)
+# endif
+	ret
+
+	.p2align 4
+L(Exit14):
+	mov	(%rsi), %rcx
+	mov	6(%rsi), %rdx
+	mov	%rcx, (%rdi)
+	mov	%rdx, 6(%rdi)
+# ifdef USE_AS_STPCPY
+	lea	13(%rdi), %rax
+# endif
+# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
+	sub	$14, %r8
+	lea	14(%rdi), %rdi
+	jnz	L(StrncpyFillTailWithZero)
+# endif
+	ret
+
+	.p2align 4
+L(Exit15):
+	mov	(%rsi), %rcx
+	mov	7(%rsi), %rdx
+	mov	%rcx, (%rdi)
+	mov	%rdx, 7(%rdi)
+# ifdef USE_AS_STPCPY
+	lea	14(%rdi), %rax
+# endif
+# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
+	sub	$15, %r8
+	lea	15(%rdi), %rdi
+	jnz	L(StrncpyFillTailWithZero)
+# endif
+	ret
+
+	.p2align 4
+L(Exit16):
+	movdqu	(%rsi), %xmm0
+	movdqu	%xmm0, (%rdi)
+# ifdef USE_AS_STPCPY
+	lea	15(%rdi), %rax
+# endif
+# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
+	sub	$16, %r8
+	lea	16(%rdi), %rdi
+	jnz	L(StrncpyFillTailWithZero)
+# endif
+	ret
+
+	.p2align 4
+L(Exit17):
+	movdqu	(%rsi), %xmm0
+	movdqu	%xmm0, (%rdi)
+	mov	%dh, 16(%rdi)
+# ifdef USE_AS_STPCPY
+	lea	16(%rdi), %rax
+# endif
+# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
+	sub	$17, %r8
+	lea	17(%rdi), %rdi
+	jnz	L(StrncpyFillTailWithZero)
+# endif
+	ret
+
+	.p2align 4
+L(Exit18):
+	movdqu	(%rsi), %xmm0
+	mov	16(%rsi), %cx
+	movdqu	%xmm0, (%rdi)
+	mov	%cx, 16(%rdi)
+# ifdef USE_AS_STPCPY
+	lea	17(%rdi), %rax
+# endif
+# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
+	sub	$18, %r8
+	lea	18(%rdi), %rdi
+	jnz	L(StrncpyFillTailWithZero)
+# endif
+	ret
+
+	.p2align 4
+L(Exit19):
+	movdqu	(%rsi), %xmm0
+	mov	15(%rsi), %ecx
+	movdqu	%xmm0, (%rdi)
+	mov	%ecx, 15(%rdi)
+# ifdef USE_AS_STPCPY
+	lea	18(%rdi), %rax
+# endif
+# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
+	sub	$19, %r8
+	lea	19(%rdi), %rdi
+	jnz	L(StrncpyFillTailWithZero)
+# endif
+	ret
+
+	.p2align 4
+L(Exit20):
+	movdqu	(%rsi), %xmm0
+	mov	16(%rsi), %ecx
+	movdqu	%xmm0, (%rdi)
+	mov	%ecx, 16(%rdi)
+# ifdef USE_AS_STPCPY
+	lea	19(%rdi), %rax
+# endif
+# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
+	sub	$20, %r8
+	lea	20(%rdi), %rdi
+	jnz	L(StrncpyFillTailWithZero)
+# endif
+	ret
+
+	.p2align 4
+L(Exit21):
+	movdqu	(%rsi), %xmm0
+	mov	16(%rsi), %ecx
+	movdqu	%xmm0, (%rdi)
+	mov	%ecx, 16(%rdi)
+	mov	%dh, 20(%rdi)
+# ifdef USE_AS_STPCPY
+	lea	20(%rdi), %rax
+# endif
+# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
+	sub	$21, %r8
+	lea	21(%rdi), %rdi
+	jnz	L(StrncpyFillTailWithZero)
+# endif
+	ret
+
+	.p2align 4
+L(Exit22):
+	movdqu	(%rsi), %xmm0
+	mov	14(%rsi), %rcx
+	movdqu	%xmm0, (%rdi)
+	mov	%rcx, 14(%rdi)
+# ifdef USE_AS_STPCPY
+	lea	21(%rdi), %rax
+# endif
+# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
+	sub	$22, %r8
+	lea	22(%rdi), %rdi
+	jnz	L(StrncpyFillTailWithZero)
+# endif
+	ret
+
+	.p2align 4
+L(Exit23):
+	movdqu	(%rsi), %xmm0
+	mov	15(%rsi), %rcx
+	movdqu	%xmm0, (%rdi)
+	mov	%rcx, 15(%rdi)
+# ifdef USE_AS_STPCPY
+	lea	22(%rdi), %rax
+# endif
+# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
+	sub	$23, %r8
+	lea	23(%rdi), %rdi
+	jnz	L(StrncpyFillTailWithZero)
+# endif
+	ret
+
+	.p2align 4
+L(Exit24):
+	movdqu	(%rsi), %xmm0
+	mov	16(%rsi), %rcx
+	movdqu	%xmm0, (%rdi)
+	mov	%rcx, 16(%rdi)
+# ifdef USE_AS_STPCPY
+	lea	23(%rdi), %rax
+# endif
+# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
+	sub	$24, %r8
+	lea	24(%rdi), %rdi
+	jnz	L(StrncpyFillTailWithZero)
+# endif
+	ret
+
+	.p2align 4
+L(Exit25):
+	movdqu	(%rsi), %xmm0
+	mov	16(%rsi), %rcx
+	movdqu	%xmm0, (%rdi)
+	mov	%rcx, 16(%rdi)
+	mov	%dh, 24(%rdi)
+# ifdef USE_AS_STPCPY
+	lea	24(%rdi), %rax
+# endif
+# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
+	sub	$25, %r8
+	lea	25(%rdi), %rdi
+	jnz	L(StrncpyFillTailWithZero)
+# endif
+	ret
+
+	.p2align 4
+L(Exit26):
+	movdqu	(%rsi), %xmm0
+	mov	16(%rsi), %rdx
+	mov	24(%rsi), %cx
+	movdqu	%xmm0, (%rdi)
+	mov	%rdx, 16(%rdi)
+	mov	%cx, 24(%rdi)
+# ifdef USE_AS_STPCPY
+	lea	25(%rdi), %rax
+# endif
+# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
+	sub	$26, %r8
+	lea	26(%rdi), %rdi
+	jnz	L(StrncpyFillTailWithZero)
+# endif
+	ret
+
+	.p2align 4
+L(Exit27):
+	movdqu	(%rsi), %xmm0
+	mov	16(%rsi), %rdx
+	mov	23(%rsi), %ecx
+	movdqu	%xmm0, (%rdi)
+	mov	%rdx, 16(%rdi)
+	mov	%ecx, 23(%rdi)
+# ifdef USE_AS_STPCPY
+	lea	26(%rdi), %rax
+# endif
+# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
+	sub	$27, %r8
+	lea	27(%rdi), %rdi
+	jnz	L(StrncpyFillTailWithZero)
+# endif
+	ret
+
+	.p2align 4
+L(Exit28):
+	movdqu	(%rsi), %xmm0
+	mov	16(%rsi), %rdx
+	mov	24(%rsi), %ecx
+	movdqu	%xmm0, (%rdi)
+	mov	%rdx, 16(%rdi)
+	mov	%ecx, 24(%rdi)
+# ifdef USE_AS_STPCPY
+	lea	27(%rdi), %rax
+# endif
+# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
+	sub	$28, %r8
+	lea	28(%rdi), %rdi
+	jnz	L(StrncpyFillTailWithZero)
+# endif
+	ret
+
+	.p2align 4
+L(Exit29):
+	movdqu	(%rsi), %xmm0
+	movdqu	13(%rsi), %xmm2
+	movdqu	%xmm0, (%rdi)
+	movdqu	%xmm2, 13(%rdi)
+# ifdef USE_AS_STPCPY
+	lea	28(%rdi), %rax
+# endif
+# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
+	sub	$29, %r8
+	lea	29(%rdi), %rdi
+	jnz	L(StrncpyFillTailWithZero)
+# endif
+	ret
+
+	.p2align 4
+L(Exit30):
+	movdqu	(%rsi), %xmm0
+	movdqu	14(%rsi), %xmm2
+	movdqu	%xmm0, (%rdi)
+	movdqu	%xmm2, 14(%rdi)
+# ifdef USE_AS_STPCPY
+	lea	29(%rdi), %rax
+# endif
+# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
+	sub	$30, %r8
+	lea	30(%rdi), %rdi
+	jnz	L(StrncpyFillTailWithZero)
+# endif
+	ret
+
+	.p2align 4
+L(Exit31):
+	movdqu	(%rsi), %xmm0
+	movdqu	15(%rsi), %xmm2
+	movdqu	%xmm0, (%rdi)
+	movdqu	%xmm2, 15(%rdi)
+# ifdef USE_AS_STPCPY
+	lea	30(%rdi), %rax
+# endif
+# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
+	sub	$31, %r8
+	lea	31(%rdi), %rdi
+	jnz	L(StrncpyFillTailWithZero)
+# endif
+	ret
+
+	.p2align 4
+L(Exit32):
+	movdqu	(%rsi), %xmm0
+	movdqu	16(%rsi), %xmm2
+	movdqu	%xmm0, (%rdi)
+	movdqu	%xmm2, 16(%rdi)
+# ifdef USE_AS_STPCPY
+	lea	31(%rdi), %rax
+# endif
+# if defined USE_AS_STRNCPY && !defined USE_AS_STRCAT
+	sub	$32, %r8
+	lea	32(%rdi), %rdi
+	jnz	L(StrncpyFillTailWithZero)
+# endif
+	ret
+
+# ifdef USE_AS_STRNCPY
+
+	.p2align 4
+L(StrncpyExit0):
+#  ifdef USE_AS_STPCPY
+	mov	%rdi, %rax
+#  endif
+#  ifdef USE_AS_STRCAT
+	xor	%ch, %ch
+	movb	%ch, (%rdi)
+#  endif
+	ret
+
+	.p2align 4
+L(StrncpyExit1):
+	mov	(%rsi), %dl
+	mov	%dl, (%rdi)
+#  ifdef USE_AS_STPCPY
+	lea	1(%rdi), %rax
+#  endif
+#  ifdef USE_AS_STRCAT
+	xor	%ch, %ch
+	movb	%ch, 1(%rdi)
+#  endif
+	ret
+
+	.p2align 4
+L(StrncpyExit2):
+	mov	(%rsi), %dx
+	mov	%dx, (%rdi)
+#  ifdef USE_AS_STPCPY
+	lea	2(%rdi), %rax
+#  endif
+#  ifdef USE_AS_STRCAT
+	xor	%ch, %ch
+	movb	%ch, 2(%rdi)
+#  endif
+	ret
+
+	.p2align 4
+L(StrncpyExit3):
+	mov	(%rsi), %cx
+	mov	2(%rsi), %dl
+	mov	%cx, (%rdi)
+	mov	%dl, 2(%rdi)
+#  ifdef USE_AS_STPCPY
+	lea	3(%rdi), %rax
+#  endif
+#  ifdef USE_AS_STRCAT
+	xor	%ch, %ch
+	movb	%ch, 3(%rdi)
+#  endif
+	ret
+
+	.p2align 4
+L(StrncpyExit4):
+	mov	(%rsi), %edx
+	mov	%edx, (%rdi)
+#  ifdef USE_AS_STPCPY
+	lea	4(%rdi), %rax
+#  endif
+#  ifdef USE_AS_STRCAT
+	xor	%ch, %ch
+	movb	%ch, 4(%rdi)
+#  endif
+	ret
+
+	.p2align 4
+L(StrncpyExit5):
+	mov	(%rsi), %ecx
+	mov	4(%rsi), %dl
+	mov	%ecx, (%rdi)
+	mov	%dl, 4(%rdi)
+#  ifdef USE_AS_STPCPY
+	lea	5(%rdi), %rax
+#  endif
+#  ifdef USE_AS_STRCAT
+	xor	%ch, %ch
+	movb	%ch, 5(%rdi)
+#  endif
+	ret
+
+	.p2align 4
+L(StrncpyExit6):
+	mov	(%rsi), %ecx
+	mov	4(%rsi), %dx
+	mov	%ecx, (%rdi)
+	mov	%dx, 4(%rdi)
+#  ifdef USE_AS_STPCPY
+	lea	6(%rdi), %rax
+#  endif
+#  ifdef USE_AS_STRCAT
+	xor	%ch, %ch
+	movb	%ch, 6(%rdi)
+#  endif
+	ret
+
+	.p2align 4
+L(StrncpyExit7):
+	mov	(%rsi), %ecx
+	mov	3(%rsi), %edx
+	mov	%ecx, (%rdi)
+	mov	%edx, 3(%rdi)
+#  ifdef USE_AS_STPCPY
+	lea	7(%rdi), %rax
+#  endif
+#  ifdef USE_AS_STRCAT
+	xor	%ch, %ch
+	movb	%ch, 7(%rdi)
+#  endif
+	ret
+
+	.p2align 4
+L(StrncpyExit8):
+	mov	(%rsi), %rdx
+	mov	%rdx, (%rdi)
+#  ifdef USE_AS_STPCPY
+	lea	8(%rdi), %rax
+#  endif
+#  ifdef USE_AS_STRCAT
+	xor	%ch, %ch
+	movb	%ch, 8(%rdi)
+#  endif
+	ret
+
+	.p2align 4
+L(StrncpyExit9):
+	mov	(%rsi), %rcx
+	mov	8(%rsi), %dl
+	mov	%rcx, (%rdi)
+	mov	%dl, 8(%rdi)
+#  ifdef USE_AS_STPCPY
+	lea	9(%rdi), %rax
+#  endif
+#  ifdef USE_AS_STRCAT
+	xor	%ch, %ch
+	movb	%ch, 9(%rdi)
+#  endif
+	ret
+
+	.p2align 4
+L(StrncpyExit10):
+	mov	(%rsi), %rcx
+	mov	8(%rsi), %dx
+	mov	%rcx, (%rdi)
+	mov	%dx, 8(%rdi)
+#  ifdef USE_AS_STPCPY
+	lea	10(%rdi), %rax
+#  endif
+#  ifdef USE_AS_STRCAT
+	xor	%ch, %ch
+	movb	%ch, 10(%rdi)
+#  endif
+	ret
+
+	.p2align 4
+L(StrncpyExit11):
+	mov	(%rsi), %rcx
+	mov	7(%rsi), %edx
+	mov	%rcx, (%rdi)
+	mov	%edx, 7(%rdi)
+#  ifdef USE_AS_STPCPY
+	lea	11(%rdi), %rax
+#  endif
+#  ifdef USE_AS_STRCAT
+	xor	%ch, %ch
+	movb	%ch, 11(%rdi)
+#  endif
+	ret
+
+	.p2align 4
+L(StrncpyExit12):
+	mov	(%rsi), %rcx
+	mov	8(%rsi), %edx
+	mov	%rcx, (%rdi)
+	mov	%edx, 8(%rdi)
+#  ifdef USE_AS_STPCPY
+	lea	12(%rdi), %rax
+#  endif
+#  ifdef USE_AS_STRCAT
+	xor	%ch, %ch
+	movb	%ch, 12(%rdi)
+#  endif
+	ret
+
+	.p2align 4
+L(StrncpyExit13):
+	mov	(%rsi), %rcx
+	mov	5(%rsi), %rdx
+	mov	%rcx, (%rdi)
+	mov	%rdx, 5(%rdi)
+#  ifdef USE_AS_STPCPY
+	lea	13(%rdi), %rax
+#  endif
+#  ifdef USE_AS_STRCAT
+	xor	%ch, %ch
+	movb	%ch, 13(%rdi)
+#  endif
+	ret
+
+	.p2align 4
+L(StrncpyExit14):
+	mov	(%rsi), %rcx
+	mov	6(%rsi), %rdx
+	mov	%rcx, (%rdi)
+	mov	%rdx, 6(%rdi)
+#  ifdef USE_AS_STPCPY
+	lea	14(%rdi), %rax
+#  endif
+#  ifdef USE_AS_STRCAT
+	xor	%ch, %ch
+	movb	%ch, 14(%rdi)
+#  endif
+	ret
+
+	.p2align 4
+L(StrncpyExit15):
+	mov	(%rsi), %rcx
+	mov	7(%rsi), %rdx
+	mov	%rcx, (%rdi)
+	mov	%rdx, 7(%rdi)
+#  ifdef USE_AS_STPCPY
+	lea	15(%rdi), %rax
+#  endif
+#  ifdef USE_AS_STRCAT
+	xor	%ch, %ch
+	movb	%ch, 15(%rdi)
+#  endif
+	ret
+
+	.p2align 4
+L(StrncpyExit16):
+	movdqu	(%rsi), %xmm0
+	movdqu	%xmm0, (%rdi)
+#  ifdef USE_AS_STPCPY
+	lea	16(%rdi), %rax
+#  endif
+#  ifdef USE_AS_STRCAT
+	xor	%ch, %ch
+	movb	%ch, 16(%rdi)
+#  endif
+	ret
+
+	.p2align 4
+L(StrncpyExit17):
+	movdqu	(%rsi), %xmm0
+	mov	16(%rsi), %cl
+	movdqu	%xmm0, (%rdi)
+	mov	%cl, 16(%rdi)
+#  ifdef USE_AS_STPCPY
+	lea	17(%rdi), %rax
+#  endif
+#  ifdef USE_AS_STRCAT
+	xor	%ch, %ch
+	movb	%ch, 17(%rdi)
+#  endif
+	ret
+
+	.p2align 4
+L(StrncpyExit18):
+	movdqu	(%rsi), %xmm0
+	mov	16(%rsi), %cx
+	movdqu	%xmm0, (%rdi)
+	mov	%cx, 16(%rdi)
+#  ifdef USE_AS_STPCPY
+	lea	18(%rdi), %rax
+#  endif
+#  ifdef USE_AS_STRCAT
+	xor	%ch, %ch
+	movb	%ch, 18(%rdi)
+#  endif
+	ret
+
+	.p2align 4
+L(StrncpyExit19):
+	movdqu	(%rsi), %xmm0
+	mov	15(%rsi), %ecx
+	movdqu	%xmm0, (%rdi)
+	mov	%ecx, 15(%rdi)
+#  ifdef USE_AS_STPCPY
+	lea	19(%rdi), %rax
+#  endif
+#  ifdef USE_AS_STRCAT
+	xor	%ch, %ch
+	movb	%ch, 19(%rdi)
+#  endif
+	ret
+
+	.p2align 4
+L(StrncpyExit20):
+	movdqu	(%rsi), %xmm0
+	mov	16(%rsi), %ecx
+	movdqu	%xmm0, (%rdi)
+	mov	%ecx, 16(%rdi)
+#  ifdef USE_AS_STPCPY
+	lea	20(%rdi), %rax
+#  endif
+#  ifdef USE_AS_STRCAT
+	xor	%ch, %ch
+	movb	%ch, 20(%rdi)
+#  endif
+	ret
+
+	.p2align 4
+L(StrncpyExit21):
+	movdqu	(%rsi), %xmm0
+	mov	16(%rsi), %ecx
+	mov	20(%rsi), %dl
+	movdqu	%xmm0, (%rdi)
+	mov	%ecx, 16(%rdi)
+	mov	%dl, 20(%rdi)
+#  ifdef USE_AS_STPCPY
+	lea	21(%rdi), %rax
+#  endif
+#  ifdef USE_AS_STRCAT
+	xor	%ch, %ch
+	movb	%ch, 21(%rdi)
+#  endif
+	ret
+
+	.p2align 4
+L(StrncpyExit22):
+	movdqu	(%rsi), %xmm0
+	mov	14(%rsi), %rcx
+	movdqu	%xmm0, (%rdi)
+	mov	%rcx, 14(%rdi)
+#  ifdef USE_AS_STPCPY
+	lea	22(%rdi), %rax
+#  endif
+#  ifdef USE_AS_STRCAT
+	xor	%ch, %ch
+	movb	%ch, 22(%rdi)
+#  endif
+	ret
+
+	.p2align 4
+L(StrncpyExit23):
+	movdqu	(%rsi), %xmm0
+	mov	15(%rsi), %rcx
+	movdqu	%xmm0, (%rdi)
+	mov	%rcx, 15(%rdi)
+#  ifdef USE_AS_STPCPY
+	lea	23(%rdi), %rax
+#  endif
+#  ifdef USE_AS_STRCAT
+	xor	%ch, %ch
+	movb	%ch, 23(%rdi)
+#  endif
+	ret
+
+	.p2align 4
+L(StrncpyExit24):
+	movdqu	(%rsi), %xmm0
+	mov	16(%rsi), %rcx
+	movdqu	%xmm0, (%rdi)
+	mov	%rcx, 16(%rdi)
+#  ifdef USE_AS_STPCPY
+	lea	24(%rdi), %rax
+#  endif
+#  ifdef USE_AS_STRCAT
+	xor	%ch, %ch
+	movb	%ch, 24(%rdi)
+#  endif
+	ret
+
+	.p2align 4
+L(StrncpyExit25):
+	movdqu	(%rsi), %xmm0
+	mov	16(%rsi), %rdx
+	mov	24(%rsi), %cl
+	movdqu	%xmm0, (%rdi)
+	mov	%rdx, 16(%rdi)
+	mov	%cl, 24(%rdi)
+#  ifdef USE_AS_STPCPY
+	lea	25(%rdi), %rax
+#  endif
+#  ifdef USE_AS_STRCAT
+	xor	%ch, %ch
+	movb	%ch, 25(%rdi)
+#  endif
+	ret
+
+	.p2align 4
+L(StrncpyExit26):
+	movdqu	(%rsi), %xmm0
+	mov	16(%rsi), %rdx
+	mov	24(%rsi), %cx
+	movdqu	%xmm0, (%rdi)
+	mov	%rdx, 16(%rdi)
+	mov	%cx, 24(%rdi)
+#  ifdef USE_AS_STPCPY
+	lea	26(%rdi), %rax
+#  endif
+#  ifdef USE_AS_STRCAT
+	xor	%ch, %ch
+	movb	%ch, 26(%rdi)
+#  endif
+	ret
+
+	.p2align 4
+L(StrncpyExit27):
+	movdqu	(%rsi), %xmm0
+	mov	16(%rsi), %rdx
+	mov	23(%rsi), %ecx
+	movdqu	%xmm0, (%rdi)
+	mov	%rdx, 16(%rdi)
+	mov	%ecx, 23(%rdi)
+#  ifdef USE_AS_STPCPY
+	lea	27(%rdi), %rax
+#  endif
+#  ifdef USE_AS_STRCAT
+	xor	%ch, %ch
+	movb	%ch, 27(%rdi)
+#  endif
+	ret
+
+	.p2align 4
+L(StrncpyExit28):
+	movdqu	(%rsi), %xmm0
+	mov	16(%rsi), %rdx
+	mov	24(%rsi), %ecx
+	movdqu	%xmm0, (%rdi)
+	mov	%rdx, 16(%rdi)
+	mov	%ecx, 24(%rdi)
+#  ifdef USE_AS_STPCPY
+	lea	28(%rdi), %rax
+#  endif
+#  ifdef USE_AS_STRCAT
+	xor	%ch, %ch
+	movb	%ch, 28(%rdi)
+#  endif
+	ret
+
+	.p2align 4
+L(StrncpyExit29):
+	movdqu	(%rsi), %xmm0
+	movdqu	13(%rsi), %xmm2
+	movdqu	%xmm0, (%rdi)
+	movdqu	%xmm2, 13(%rdi)
+#  ifdef USE_AS_STPCPY
+	lea	29(%rdi), %rax
+#  endif
+#  ifdef USE_AS_STRCAT
+	xor	%ch, %ch
+	movb	%ch, 29(%rdi)
+#  endif
+	ret
+
+	.p2align 4
+L(StrncpyExit30):
+	movdqu	(%rsi), %xmm0
+	movdqu	14(%rsi), %xmm2
+	movdqu	%xmm0, (%rdi)
+	movdqu	%xmm2, 14(%rdi)
+#  ifdef USE_AS_STPCPY
+	lea	30(%rdi), %rax
+#  endif
+#  ifdef USE_AS_STRCAT
+	xor	%ch, %ch
+	movb	%ch, 30(%rdi)
+#  endif
+	ret
+
+	.p2align 4
+L(StrncpyExit31):
+	movdqu	(%rsi), %xmm0
+	movdqu	15(%rsi), %xmm2
+	movdqu	%xmm0, (%rdi)
+	movdqu	%xmm2, 15(%rdi)
+#  ifdef USE_AS_STPCPY
+	lea	31(%rdi), %rax
+#  endif
+#  ifdef USE_AS_STRCAT
+	xor	%ch, %ch
+	movb	%ch, 31(%rdi)
+#  endif
+	ret
+
+	.p2align 4
+L(StrncpyExit32):
+	movdqu	(%rsi), %xmm0
+	movdqu	16(%rsi), %xmm2
+	movdqu	%xmm0, (%rdi)
+	movdqu	%xmm2, 16(%rdi)
+#  ifdef USE_AS_STPCPY
+	lea	32(%rdi), %rax
+#  endif
+#  ifdef USE_AS_STRCAT
+	xor	%ch, %ch
+	movb	%ch, 32(%rdi)
+#  endif
+	ret
+
+	.p2align 4
+L(StrncpyExit33):
+	movdqu	(%rsi), %xmm0
+	movdqu	16(%rsi), %xmm2
+	mov	32(%rsi), %cl
+	movdqu	%xmm0, (%rdi)
+	movdqu	%xmm2, 16(%rdi)
+	mov	%cl, 32(%rdi)
+#  ifdef USE_AS_STRCAT
+	xor	%ch, %ch
+	movb	%ch, 33(%rdi)
+#  endif
+	ret
+
+#  ifndef USE_AS_STRCAT
+
+	.p2align 4
+L(Fill0):
+	ret
+
+	.p2align 4
+L(Fill1):
+	mov	%dl, (%rdi)
+	ret
+
+	.p2align 4
+L(Fill2):
+	mov	%dx, (%rdi)
+	ret
+
+	.p2align 4
+L(Fill3):
+	mov	%edx, -1(%rdi)
+	ret
+
+	.p2align 4
+L(Fill4):
+	mov	%edx, (%rdi)
+	ret
+
+	.p2align 4
+L(Fill5):
+	mov	%edx, (%rdi)
+	mov	%dl, 4(%rdi)
+	ret
+
+	.p2align 4
+L(Fill6):
+	mov	%edx, (%rdi)
+	mov	%dx, 4(%rdi)
+	ret
+
+	.p2align 4
+L(Fill7):
+	mov	%rdx, -1(%rdi)
+	ret
+
+	.p2align 4
+L(Fill8):
+	mov	%rdx, (%rdi)
+	ret
+
+	.p2align 4
+L(Fill9):
+	mov	%rdx, (%rdi)
+	mov	%dl, 8(%rdi)
+	ret
+
+	.p2align 4
+L(Fill10):
+	mov	%rdx, (%rdi)
+	mov	%dx, 8(%rdi)
+	ret
+
+	.p2align 4
+L(Fill11):
+	mov	%rdx, (%rdi)
+	mov	%edx, 7(%rdi)
+	ret
+
+	.p2align 4
+L(Fill12):
+	mov	%rdx, (%rdi)
+	mov	%edx, 8(%rdi)
+	ret
+
+	.p2align 4
+L(Fill13):
+	mov	%rdx, (%rdi)
+	mov	%rdx, 5(%rdi)
+	ret
+
+	.p2align 4
+L(Fill14):
+	mov	%rdx, (%rdi)
+	mov	%rdx, 6(%rdi)
+	ret
+
+	.p2align 4
+L(Fill15):
+	movdqu	%xmm0, -1(%rdi)
+	ret
+
+	.p2align 4
+L(Fill16):
+	movdqu	%xmm0, (%rdi)
+	ret
+
+	.p2align 4
+L(CopyFrom1To16BytesUnalignedXmm2):
+	movdqu	%xmm2, (%rdi, %rcx)
+
+	.p2align 4
+L(CopyFrom1To16BytesXmmExit):
+	bsf	%rdx, %rdx
+	add	$15, %r8
+	add	%rcx, %rdi
+#   ifdef USE_AS_STPCPY
+	lea	(%rdi, %rdx), %rax
+#   endif
+	sub	%rdx, %r8
+	lea	1(%rdi, %rdx), %rdi
+
+	.p2align 4
+L(StrncpyFillTailWithZero):
+	pxor	%xmm0, %xmm0
+	xor	%rdx, %rdx
+	sub	$16, %r8
+	jbe	L(StrncpyFillExit)
+
+	movdqu	%xmm0, (%rdi)
+	add	$16, %rdi
+
+	mov	%rdi, %rsi
+	and	$0xf, %rsi
+	sub	%rsi, %rdi
+	add	%rsi, %r8
+	sub	$64, %r8
+	jb	L(StrncpyFillLess64)
+
+L(StrncpyFillLoopMovdqa):
+	movdqa	%xmm0, (%rdi)
+	movdqa	%xmm0, 16(%rdi)
+	movdqa	%xmm0, 32(%rdi)
+	movdqa	%xmm0, 48(%rdi)
+	add	$64, %rdi
+	sub	$64, %r8
+	jae	L(StrncpyFillLoopMovdqa)
+
+L(StrncpyFillLess64):
+	add	$32, %r8
+	jl	L(StrncpyFillLess32)
+	movdqa	%xmm0, (%rdi)
+	movdqa	%xmm0, 16(%rdi)
+	add	$32, %rdi
+	sub	$16, %r8
+	jl	L(StrncpyFillExit)
+	movdqa	%xmm0, (%rdi)
+	add	$16, %rdi
+	BRANCH_TO_JMPTBL_ENTRY (L(FillTable), %r8, 4)
+
+L(StrncpyFillLess32):
+	add	$16, %r8
+	jl	L(StrncpyFillExit)
+	movdqa	%xmm0, (%rdi)
+	add	$16, %rdi
+	BRANCH_TO_JMPTBL_ENTRY (L(FillTable), %r8, 4)
+
+L(StrncpyFillExit):
+	add	$16, %r8
+	BRANCH_TO_JMPTBL_ENTRY (L(FillTable), %r8, 4)
+
+/* end of ifndef USE_AS_STRCAT */
+#  endif
+
+	.p2align 4
+L(UnalignedLeaveCase2OrCase3):
+	test	%rdx, %rdx
+	jnz	L(Unaligned64LeaveCase2)
+L(Unaligned64LeaveCase3):
+	lea	64(%r8), %rcx
+	and	$-16, %rcx
+	add	$48, %r8
+	jl	L(CopyFrom1To16BytesCase3)
+	movdqu	%xmm4, (%rdi)
+	sub	$16, %r8
+	jb	L(CopyFrom1To16BytesCase3)
+	movdqu	%xmm5, 16(%rdi)
+	sub	$16, %r8
+	jb	L(CopyFrom1To16BytesCase3)
+	movdqu	%xmm6, 32(%rdi)
+	sub	$16, %r8
+	jb	L(CopyFrom1To16BytesCase3)
+	movdqu	%xmm7, 48(%rdi)
+#  ifdef USE_AS_STPCPY
+	lea	64(%rdi), %rax
+#  endif
+#  ifdef USE_AS_STRCAT
+	xor	%ch, %ch
+	movb	%ch, 64(%rdi)
+#  endif
+	ret
+
+	.p2align 4
+L(Unaligned64LeaveCase2):
+	xor	%rcx, %rcx
+	pcmpeqb	%xmm4, %xmm0
+	pmovmskb %xmm0, %rdx
+	add	$48, %r8
+	jle	L(CopyFrom1To16BytesCase2OrCase3)
+	test	%rdx, %rdx
+#  ifndef USE_AS_STRCAT
+	jnz	L(CopyFrom1To16BytesUnalignedXmm4)
+#  else
+	jnz	L(CopyFrom1To16Bytes)
+#  endif
+	pcmpeqb	%xmm5, %xmm0
+	pmovmskb %xmm0, %rdx
+	movdqu	%xmm4, (%rdi)
+	add	$16, %rcx
+	sub	$16, %r8
+	jbe	L(CopyFrom1To16BytesCase2OrCase3)
+	test	%rdx, %rdx
+#  ifndef USE_AS_STRCAT
+	jnz	L(CopyFrom1To16BytesUnalignedXmm5)
+#  else
+	jnz	L(CopyFrom1To16Bytes)
+#  endif
+
+	pcmpeqb	%xmm6, %xmm0
+	pmovmskb %xmm0, %rdx
+	movdqu	%xmm5, 16(%rdi)
+	add	$16, %rcx
+	sub	$16, %r8
+	jbe	L(CopyFrom1To16BytesCase2OrCase3)
+	test	%rdx, %rdx
+#  ifndef USE_AS_STRCAT
+	jnz	L(CopyFrom1To16BytesUnalignedXmm6)
+#  else
+	jnz	L(CopyFrom1To16Bytes)
+#  endif
+
+	pcmpeqb	%xmm7, %xmm0
+	pmovmskb %xmm0, %rdx
+	movdqu	%xmm6, 32(%rdi)
+	lea	16(%rdi, %rcx), %rdi
+	lea	16(%rsi, %rcx), %rsi
+	bsf	%rdx, %rdx
+	cmp	%r8, %rdx
+	jb	L(CopyFrom1To16BytesExit)
+	BRANCH_TO_JMPTBL_ENTRY (L(ExitStrncpyTable), %r8, 4)
+
+	.p2align 4
+L(ExitZero):
+#  ifndef USE_AS_STRCAT
+	mov	%rdi, %rax
+#  endif
+	ret
+
+# endif
+
+# ifndef USE_AS_STRCAT
+END (STRCPY)
+# else
+END (STRCAT)
+# endif
+	.p2align 4
+	.section .rodata
+L(ExitTable):
+	.int	JMPTBL(L(Exit1), L(ExitTable))
+	.int	JMPTBL(L(Exit2), L(ExitTable))
+	.int	JMPTBL(L(Exit3), L(ExitTable))
+	.int	JMPTBL(L(Exit4), L(ExitTable))
+	.int	JMPTBL(L(Exit5), L(ExitTable))
+	.int	JMPTBL(L(Exit6), L(ExitTable))
+	.int	JMPTBL(L(Exit7), L(ExitTable))
+	.int	JMPTBL(L(Exit8), L(ExitTable))
+	.int	JMPTBL(L(Exit9), L(ExitTable))
+	.int	JMPTBL(L(Exit10), L(ExitTable))
+	.int	JMPTBL(L(Exit11), L(ExitTable))
+	.int	JMPTBL(L(Exit12), L(ExitTable))
+	.int	JMPTBL(L(Exit13), L(ExitTable))
+	.int	JMPTBL(L(Exit14), L(ExitTable))
+	.int	JMPTBL(L(Exit15), L(ExitTable))
+	.int	JMPTBL(L(Exit16), L(ExitTable))
+	.int	JMPTBL(L(Exit17), L(ExitTable))
+	.int	JMPTBL(L(Exit18), L(ExitTable))
+	.int	JMPTBL(L(Exit19), L(ExitTable))
+	.int	JMPTBL(L(Exit20), L(ExitTable))
+	.int	JMPTBL(L(Exit21), L(ExitTable))
+	.int	JMPTBL(L(Exit22), L(ExitTable))
+	.int    JMPTBL(L(Exit23), L(ExitTable))
+	.int	JMPTBL(L(Exit24), L(ExitTable))
+	.int	JMPTBL(L(Exit25), L(ExitTable))
+	.int	JMPTBL(L(Exit26), L(ExitTable))
+	.int	JMPTBL(L(Exit27), L(ExitTable))
+	.int	JMPTBL(L(Exit28), L(ExitTable))
+	.int	JMPTBL(L(Exit29), L(ExitTable))
+	.int	JMPTBL(L(Exit30), L(ExitTable))
+	.int	JMPTBL(L(Exit31), L(ExitTable))
+	.int	JMPTBL(L(Exit32), L(ExitTable))
+# ifdef USE_AS_STRNCPY
+L(ExitStrncpyTable):
+	.int	JMPTBL(L(StrncpyExit0), L(ExitStrncpyTable))
+	.int	JMPTBL(L(StrncpyExit1), L(ExitStrncpyTable))
+	.int	JMPTBL(L(StrncpyExit2), L(ExitStrncpyTable))
+	.int	JMPTBL(L(StrncpyExit3), L(ExitStrncpyTable))
+	.int	JMPTBL(L(StrncpyExit4), L(ExitStrncpyTable))
+	.int	JMPTBL(L(StrncpyExit5), L(ExitStrncpyTable))
+	.int	JMPTBL(L(StrncpyExit6), L(ExitStrncpyTable))
+	.int	JMPTBL(L(StrncpyExit7), L(ExitStrncpyTable))
+	.int	JMPTBL(L(StrncpyExit8), L(ExitStrncpyTable))
+	.int	JMPTBL(L(StrncpyExit9), L(ExitStrncpyTable))
+	.int	JMPTBL(L(StrncpyExit10), L(ExitStrncpyTable))
+	.int	JMPTBL(L(StrncpyExit11), L(ExitStrncpyTable))
+	.int	JMPTBL(L(StrncpyExit12), L(ExitStrncpyTable))
+	.int	JMPTBL(L(StrncpyExit13), L(ExitStrncpyTable))
+	.int	JMPTBL(L(StrncpyExit14), L(ExitStrncpyTable))
+	.int	JMPTBL(L(StrncpyExit15), L(ExitStrncpyTable))
+	.int	JMPTBL(L(StrncpyExit16), L(ExitStrncpyTable))
+	.int	JMPTBL(L(StrncpyExit17), L(ExitStrncpyTable))
+	.int	JMPTBL(L(StrncpyExit18), L(ExitStrncpyTable))
+	.int	JMPTBL(L(StrncpyExit19), L(ExitStrncpyTable))
+	.int	JMPTBL(L(StrncpyExit20), L(ExitStrncpyTable))
+	.int	JMPTBL(L(StrncpyExit21), L(ExitStrncpyTable))
+	.int	JMPTBL(L(StrncpyExit22), L(ExitStrncpyTable))
+	.int    JMPTBL(L(StrncpyExit23), L(ExitStrncpyTable))
+	.int	JMPTBL(L(StrncpyExit24), L(ExitStrncpyTable))
+	.int	JMPTBL(L(StrncpyExit25), L(ExitStrncpyTable))
+	.int	JMPTBL(L(StrncpyExit26), L(ExitStrncpyTable))
+	.int	JMPTBL(L(StrncpyExit27), L(ExitStrncpyTable))
+	.int	JMPTBL(L(StrncpyExit28), L(ExitStrncpyTable))
+	.int	JMPTBL(L(StrncpyExit29), L(ExitStrncpyTable))
+	.int	JMPTBL(L(StrncpyExit30), L(ExitStrncpyTable))
+	.int	JMPTBL(L(StrncpyExit31), L(ExitStrncpyTable))
+	.int	JMPTBL(L(StrncpyExit32), L(ExitStrncpyTable))
+	.int	JMPTBL(L(StrncpyExit33), L(ExitStrncpyTable))
+#  ifndef USE_AS_STRCAT
+	.p2align 4
+L(FillTable):
+	.int	JMPTBL(L(Fill0), L(FillTable))
+	.int	JMPTBL(L(Fill1), L(FillTable))
+	.int	JMPTBL(L(Fill2), L(FillTable))
+	.int	JMPTBL(L(Fill3), L(FillTable))
+	.int	JMPTBL(L(Fill4), L(FillTable))
+	.int	JMPTBL(L(Fill5), L(FillTable))
+	.int	JMPTBL(L(Fill6), L(FillTable))
+	.int	JMPTBL(L(Fill7), L(FillTable))
+	.int	JMPTBL(L(Fill8), L(FillTable))
+	.int	JMPTBL(L(Fill9), L(FillTable))
+	.int	JMPTBL(L(Fill10), L(FillTable))
+	.int	JMPTBL(L(Fill11), L(FillTable))
+	.int	JMPTBL(L(Fill12), L(FillTable))
+	.int	JMPTBL(L(Fill13), L(FillTable))
+	.int	JMPTBL(L(Fill14), L(FillTable))
+	.int	JMPTBL(L(Fill15), L(FillTable))
+	.int	JMPTBL(L(Fill16), L(FillTable))
+#  endif
+# endif
+#endif
diff --git a/sysdeps/x86_64/multiarch/strncpy.S b/sysdeps/x86_64/multiarch/strncpy.S
index 6d87a0b..afbd870 100644
--- a/sysdeps/x86_64/multiarch/strncpy.S
+++ b/sysdeps/x86_64/multiarch/strncpy.S
@@ -1,5 +1,85 @@ 
-/* Multiple versions of strncpy
-   All versions must be listed in ifunc-impl-list.c.  */
-#define STRCPY strncpy
+/* Multiple versions of strcpy
+   All versions must be listed in ifunc-impl-list.c.
+   Copyright (C) 2009-2015 Free Software Foundation, Inc.
+   Contributed by Intel Corporation.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <http://www.gnu.org/licenses/>.  */
+
+#include <sysdep.h>
+#include <init-arch.h>
+
 #define USE_AS_STRNCPY
-#include "strcpy.S"
+#ifndef STRNCPY
+#define STRNCPY strncpy
+#endif
+
+#ifdef USE_AS_STPCPY
+#  define STRNCPY_SSSE3		__stpncpy_ssse3
+#  define STRNCPY_SSE2		__stpncpy_sse2
+#  define STRNCPY_SSE2_UNALIGNED __stpncpy_sse2_unaligned
+#  define __GI_STRNCPY		__GI_stpncpy
+#  define __GI___STRNCPY		__GI___stpncpy
+#else
+#  define STRNCPY_SSSE3		__strncpy_ssse3
+#  define STRNCPY_SSE2		__strncpy_sse2
+#  define STRNCPY_SSE2_UNALIGNED	__strncpy_sse2_unaligned
+#  define __GI_STRNCPY		__GI_strncpy
+#endif
+
+
+/* Define multiple versions only for the definition in libc.  */
+#if IS_IN (libc)
+	.text
+ENTRY(STRNCPY)
+	.type	STRNCPY, @gnu_indirect_function
+	cmpl	$0, __cpu_features+KIND_OFFSET(%rip)
+	jne	1f
+	call	__init_cpu_features
+1:	leaq	STRNCPY_SSE2_UNALIGNED(%rip), %rax
+	testl	$bit_Fast_Unaligned_Load, __cpu_features+FEATURE_OFFSET+index_Fast_Unaligned_Load(%rip)
+	jnz	2f
+	leaq	STRNCPY_SSE2(%rip), %rax
+	testl	$bit_SSSE3, __cpu_features+CPUID_OFFSET+index_SSSE3(%rip)
+	jz	2f
+	leaq	STRNCPY_SSSE3(%rip), %rax
+2:	ret
+END(STRNCPY)
+
+# undef ENTRY
+# define ENTRY(name) \
+	.type STRNCPY_SSE2, @function; \
+	.align 16; \
+	.globl STRNCPY_SSE2; \
+	.hidden STRNCPY_SSE2; \
+	STRNCPY_SSE2: cfi_startproc; \
+	CALL_MCOUNT
+# undef END
+# define END(name) \
+	cfi_endproc; .size STRNCPY_SSE2, .-STRNCPY_SSE2
+# undef libc_hidden_builtin_def
+/* It doesn't make sense to send libc-internal strcpy calls through a PLT.
+   The speedup we get from using SSSE3 instruction is likely eaten away
+   by the indirect call in the PLT.  */
+# define libc_hidden_builtin_def(name) \
+	.globl __GI_STRNCPY; __GI_STRNCPY = STRNCPY_SSE2
+# undef libc_hidden_def
+# define libc_hidden_def(name) \
+	.globl __GI___STRNCPY; __GI___STRNCPY = STRNCPY_SSE2
+#endif
+
+#ifndef USE_AS_STRNCPY
+#include "../strcpy.S"
+#endif