Message ID | 20150106142939.GB5835@domone |
---|---|
State | New |
Headers | show |
On Tue, Jan 6, 2015 at 6:29 AM, Ondřej Bílka <neleai@seznam.cz> wrote: > H. J, in this commit there slipped performance regression by review. > > commit 05f3633da4f9df870d04dd77336e793746e57ed4 > Author: Ling Ma <ling.ml@alibaba-inc.com> > Date: Mon Jul 14 00:02:52 2014 -0400 > > Improve 64bit memcpy performance for Haswell CPU with AVX > instruction > > > I seem to recall that I mentioned something about avx being typo and > should be avx2 but did not look it further. > > As I assumed its avx2 only I was ok with that nad haswell specific > optimizations like using rep movsq. However ifunc checks for avx which > is bad as we already know that avx loads/stores are slow on sandy > bridge. > > Also testing on affected architectures would reveal that. Especially amd > bulldozer where its five times slower on 2kb-16kb range, see > http://kam.mff.cuni.cz/~ondra/benchmark_string/fx10/memcpy_profile_avx/results_rand/result.html > because movsb is slow. > > On sandy bridge its only 20% regression on same range. > http://kam.mff.cuni.cz/~ondra/benchmark_string/i7_ivy_bridge/memcpy_profile_avx/results_rand/result.html > > > Also avx loop for 128-2024 bytes is slower there so there is no point > using it. > > What about following change? > > > * sysdeps/x86_64/multiarch/memcpy.S: Fix performance regression. > > diff --git a/sysdeps/x86_64/multiarch/memcpy.S b/sysdeps/x86_64/multiarch/memcpy.S > index 992e40d..27f89e4 100644 > --- a/sysdeps/x86_64/multiarch/memcpy.S > +++ b/sysdeps/x86_64/multiarch/memcpy.S > @@ -32,10 +32,13 @@ ENTRY(__new_memcpy) > cmpl $0, KIND_OFFSET+__cpu_features(%rip) > jne 1f > call __init_cpu_features > +#ifdef HAVE_AVX2_SUPPORT > 1: leaq __memcpy_avx_unaligned(%rip), %rax > - testl $bit_AVX_Usable, __cpu_features+FEATURE_OFFSET+index_AVX_Usable(%rip) > + testl $bit_AVX2_Usable, __cpu_features+FEATURE_OFFSET+index_AVX2_Usable(%rip) > + > jz 1f > ret > +#endif > 1: leaq __memcpy_sse2(%rip), %rax > testl $bit_Slow_BSF, __cpu_features+FEATURE_OFFSET+index_Slow_BSF(%rip) > jnz 2f Please add a new feature bit, bit_Fast_AVX_Unaligned_Load, and turn it on together with bit_AVX2_Usable. Thanks. --- H.J.
diff --git a/sysdeps/x86_64/multiarch/memcpy.S b/sysdeps/x86_64/multiarch/memcpy.S index 992e40d..27f89e4 100644 --- a/sysdeps/x86_64/multiarch/memcpy.S +++ b/sysdeps/x86_64/multiarch/memcpy.S @@ -32,10 +32,13 @@ ENTRY(__new_memcpy) cmpl $0, KIND_OFFSET+__cpu_features(%rip) jne 1f call __init_cpu_features +#ifdef HAVE_AVX2_SUPPORT 1: leaq __memcpy_avx_unaligned(%rip), %rax - testl $bit_AVX_Usable, __cpu_features+FEATURE_OFFSET+index_AVX_Usable(%rip) + testl $bit_AVX2_Usable, __cpu_features+FEATURE_OFFSET+index_AVX2_Usable(%rip) + jz 1f ret +#endif 1: leaq __memcpy_sse2(%rip), %rax testl $bit_Slow_BSF, __cpu_features+FEATURE_OFFSET+index_Slow_BSF(%rip) jnz 2f