[BZ,#17801] Fix memcpy regression (five times slower on bulldozer.)

From: "H.J. Lu" <hjl.tools@gmail.com>

On Tue, Jan 06, 2015 at 06:54:50AM -0800, H.J. Lu wrote:
> On Tue, Jan 6, 2015 at 6:29 AM, Ondřej Bílka <neleai@seznam.cz> wrote:
> > H. J, in this commit there slipped performance regression by review.
> >
> > commit 05f3633da4f9df870d04dd77336e793746e57ed4
> > Author: Ling Ma <ling.ml@alibaba-inc.com>
> > Date:   Mon Jul 14 00:02:52 2014 -0400
> >
> >     Improve 64bit memcpy performance for Haswell CPU with AVX
> > instruction
> >
> >
> > I seem to recall that I mentioned something about avx being typo and
> > should be avx2 but did not look it further.
> >
> > As I assumed its avx2 only I was ok with that nad haswell specific
> > optimizations like using rep movsq. However ifunc checks for avx which
> > is bad as we already know that avx loads/stores are slow on sandy
> > bridge.
> >
> > Also testing on affected architectures would reveal that. Especially amd
> > bulldozer where its five times slower on 2kb-16kb range, see
> > http://kam.mff.cuni.cz/~ondra/benchmark_string/fx10/memcpy_profile_avx/results_rand/result.html
> > because movsb is slow.
> >
> > On sandy bridge its only 20% regression on same range.
> > http://kam.mff.cuni.cz/~ondra/benchmark_string/i7_ivy_bridge/memcpy_profile_avx/results_rand/result.html
> >
> >
> > Also avx loop for 128-2024 bytes is slower there so there is no point
> > using it.
> >
> > What about following change?
> >
> >
> >         * sysdeps/x86_64/multiarch/memcpy.S: Fix performance regression.
> >
> > diff --git a/sysdeps/x86_64/multiarch/memcpy.S b/sysdeps/x86_64/multiarch/memcpy.S
> > index 992e40d..27f89e4 100644
> > --- a/sysdeps/x86_64/multiarch/memcpy.S
> > +++ b/sysdeps/x86_64/multiarch/memcpy.S
> > @@ -32,10 +32,13 @@ ENTRY(__new_memcpy)
> >         cmpl    $0, KIND_OFFSET+__cpu_features(%rip)
> >         jne     1f
> >         call    __init_cpu_features
> > +#ifdef HAVE_AVX2_SUPPORT
> >  1:     leaq    __memcpy_avx_unaligned(%rip), %rax
> > -       testl   $bit_AVX_Usable, __cpu_features+FEATURE_OFFSET+index_AVX_Usable(%rip)
> > +       testl   $bit_AVX2_Usable, __cpu_features+FEATURE_OFFSET+index_AVX2_Usable(%rip)
> > +
> >         jz 1f
> >         ret
> > +#endif
> >  1:     leaq    __memcpy_sse2(%rip), %rax
> >         testl   $bit_Slow_BSF, __cpu_features+FEATURE_OFFSET+index_Slow_BSF(%rip)
> >         jnz     2f
> 
> Please add a new feature bit, bit_Fast_AVX_Unaligned_Load, and turn it
> on together
> with bit_AVX2_Usable.
> 

I know we are in freeze.  But I'd like to fix this regression in 2.21.
OK for master?

Thanks.

H.J.
---
From 56d25c11b64a97255a115901d136d753c86de24e Mon Sep 17 00:00:00 2001
From: "H.J. Lu" <hjl.tools@gmail.com>
Date: Fri, 30 Jan 2015 06:50:20 -0800
Subject: [PATCH] Use AVX unaligned memcpy only if AVX2 is available

memcpy with unaligned 256-bit AVX register loads/stores are slow on older
processorsl like Sandy Bridge.  This patch adds bit_AVX_Fast_Unaligned_Load
and sets it only when AVX2 is available.

	[BZ #17801]
	* sysdeps/x86_64/multiarch/init-arch.c (__init_cpu_features):
	Set the bit_AVX_Fast_Unaligned_Load bit for AVX2.
	* sysdeps/x86_64/multiarch/init-arch.h (bit_AVX_Fast_Unaligned_Load):
	New.
	(index_AVX_Fast_Unaligned_Load): Likewise.
	(HAS_AVX_FAST_UNALIGNED_LOAD): Likewise.
	* sysdeps/x86_64/multiarch/memcpy.S (__new_memcpy): Check the
	bit_AVX_Fast_Unaligned_Load bit instead of the bit_AVX_Usable bit.
	* sysdeps/x86_64/multiarch/memcpy_chk.S (__memcpy_chk): Likewise.
	* sysdeps/x86_64/multiarch/mempcpy.S (__mempcpy): Likewise.
	* sysdeps/x86_64/multiarch/mempcpy_chk.S (__mempcpy_chk): Likewise.
	* sysdeps/x86_64/multiarch/memmove.c (__libc_memmove): Replace
	HAS_AVX with HAS_AVX_FAST_UNALIGNED_LOAD.
	* sysdeps/x86_64/multiarch/memmove_chk.c (__memmove_chk): Likewise.
---
 ChangeLog                              | 18 ++++++++++++++++++
 sysdeps/x86_64/multiarch/init-arch.c   |  9 +++++++--
 sysdeps/x86_64/multiarch/init-arch.h   |  4 ++++
 sysdeps/x86_64/multiarch/memcpy.S      |  2 +-
 sysdeps/x86_64/multiarch/memcpy_chk.S  |  2 +-
 sysdeps/x86_64/multiarch/memmove.c     |  2 +-
 sysdeps/x86_64/multiarch/memmove_chk.c |  2 +-
 sysdeps/x86_64/multiarch/mempcpy.S     |  2 +-
 sysdeps/x86_64/multiarch/mempcpy_chk.S |  2 +-
 9 files changed, 35 insertions(+), 8 deletions(-)

[BZ,#17801] Fix memcpy regression (five times slower on bulldozer.)

Commit Message

Comments

Patch