Message ID | CAMe9rOrcRMnSXViue=mcpFX-1hnQYZ-foqPyfkHRQ-S6oravKg@mail.gmail.com |
---|---|
State | New |
Headers | show |
On Fri, Jun 23, 2017 at 2:00 PM, H.J. Lu <hjl.tools@gmail.com> wrote: > On Tue, Jun 20, 2017 at 11:16 AM, H.J. Lu <hjl.tools@gmail.com> wrote: >> On Sat, Jun 17, 2017 at 3:44 AM, Andrew Senkevich >> <andrew.n.senkevich@gmail.com> wrote: >>> 2017-06-16 4:15 GMT+02:00 H.J. Lu <hjl.tools@gmail.com>: >>>> On Thu, Jun 15, 2017 at 5:34 AM, Ondřej Bílka <neleai@seznam.cz> wrote: >>>>> On Thu, Jun 01, 2017 at 08:45:19AM -0700, H.J. Lu wrote: >>>>>> Optimize x86-64 memcmp/wmemcmp with AVX2. It uses vector compare as >>>>>> much as possible. It is as fast as SSE4 memcmp for size <= 16 bytes >>>>>> and up to 2X faster for size > 16 bytes on Haswell and Skylake. Select >>>>>> AVX2 memcmp/wmemcmp on AVX2 machines where vzeroupper is preferred and >>>>>> AVX unaligned load is fast. >>>>>> >>>>>> Key features: >>>>>> >>>>>> 1. Use overlapping compare to avoid branch. >>>>>> 2. Use vector compare when size >= 4 bytes for memcmp or size >= 8 >>>>>> bytes for wmemcmp. >>>>>> 3. If size is 8 * VEC_SIZE or less, unroll the loop. >>>>>> 4. Compare 4 * VEC_SIZE at a time with the aligned first memory area. >>>>>> 5. Use 2 vector compares when size is 2 * VEC_SIZE or less. >>>>>> 6. Use 4 vector compares when size is 4 * VEC_SIZE or less. >>>>>> 7. Use 8 vector compares when size is 8 * VEC_SIZE or less. >>>>>> >>>>>> Any comments? >>>>>> >>>>> I have some comments, its similar to one of my previous patches >>>>> >>>>>> + cmpq $(VEC_SIZE * 2), %rdx >>>>>> + ja L(more_2x_vec) >>>>>> + >>>>> This is unnecessary branch, its likely that there is difference in first >>>>> 16 bytes regardless of size. Move test about sizes... >>>>>> +L(last_2x_vec): >>>>>> + /* From VEC to 2 * VEC. No branch when size == VEC_SIZE. */ >>>>>> + vmovdqu (%rsi), %ymm2 >>>>>> + VPCMPEQ (%rdi), %ymm2, %ymm2 >>>>>> + vpmovmskb %ymm2, %eax >>>>>> + subl $VEC_MASK, %eax >>>>>> + jnz L(first_vec) >>>>> here. >>>>> >>>> >>>> If we do that, the size check will be redundant from >>>> >>>> /* Less than 4 * VEC. */ >>>> cmpq $VEC_SIZE, %rdx >>>> jbe L(last_vec) >>>> cmpq $(VEC_SIZE * 2), %rdx >>>> jbe L(last_2x_vec) >>>> >>>> L(last_4x_vec): >>>> >>>> Of cause, we can duplicate these blocks to avoid size. >>>> >>>>> >>>>>> +L(first_vec): >>>>>> + /* A byte or int32 is different within 16 or 32 bytes. */ >>>>>> + bsfl %eax, %ecx >>>>>> +# ifdef USE_AS_WMEMCMP >>>>>> + xorl %eax, %eax >>>>>> + movl (%rdi, %rcx), %edx >>>>>> + cmpl (%rsi, %rcx), %edx >>>>>> +L(wmemcmp_return): >>>>>> + setl %al >>>>>> + negl %eax >>>>>> + orl $1, %eax >>>>>> +# else >>>>>> + movzbl (%rdi, %rcx), %eax >>>>>> + movzbl (%rsi, %rcx), %edx >>>>>> + sub %edx, %eax >>>>>> +# endif >>>>>> + VZEROUPPER >>>>>> + ret >>>>>> + >>>>> >>>>> Loading bytes depending on result of bsf is slow, alternative is to find >>>>> that from vector tests. I could avoid it using tests like this but I >>>>> didn't measure performance/test it yet. >>>>> >>>>> vmovdqu (%rdi), %ymm3 >>>>> >>>>> VPCMPGTQ %ymm2, %ymm3, %ymm4 >>>>> VPCMPGTQ %ymm3, %ymm2, %ymm5 >>>>> vpmovmskb %ymm4, %eax >>>>> vpmovmskb %ymm5, %edx >>>>> neg %eax >>>>> neg %edx >>>>> lzcnt %eax, %eax >>>>> lzcnt %edx, %edx >>>>> sub %edx, %eax >>>>> ret >>>> >>>> Andrew, can you give it a try? >>> >>> Hi Ondrej, could you send patch with you proposal? >>> I have tried with the following change and got many test-memcmp wrong results: >> >> We can't use VPCMPGT for memcmp since it performs signed >> comparison, but memcmp requires unsigned comparison. >> >> H.J. > > Here is a patch. Any comments? > I will check it in this week. > -- > H.J. > ---- > Check the first 32 bytes before checking size when size >= 32 bytes > to avoid unnecessary branch if the difference is in the first 32 bytes. > Replace vpmovmskb/subl/jnz with vptest/jnc. > > On Haswell, the new version is as fast as the previous one. On Skylake, > the new version is a little bit faster. > > * sysdeps/x86_64/multiarch/memcmp-avx2-movbe.S (MEMCMP): Check > the first 32 bytes before checking size when size >= 32 bytes. > Replace vpmovmskb/subl/jnz with vptest/jnc.
From ebd5d69692463adc3733716fa1d608970d037bd6 Mon Sep 17 00:00:00 2001 From: "H.J. Lu" <hjl.tools@gmail.com> Date: Tue, 20 Jun 2017 04:05:28 -0700 Subject: [PATCH] x86-64: Optimize memcmp-avx2-movbe.S for short difference Check the first 32 bytes before checking size when size >= 32 bytes to avoid unnecessary branch if the difference is in the first 32 bytes. Replace vpmovmskb/subl/jnz with vptest/jnc. On Haswell, the new version is as fast as the previous one. On Skylake, the new version is a little bit faster. * sysdeps/x86_64/multiarch/memcmp-avx2-movbe.S (MEMCMP): Check the first 32 bytes before checking size when size >= 32 bytes. Replace vpmovmskb/subl/jnz with vptest/jnc. --- sysdeps/x86_64/multiarch/memcmp-avx2-movbe.S | 118 ++++++++++++++------------- 1 file changed, 62 insertions(+), 56 deletions(-) diff --git a/sysdeps/x86_64/multiarch/memcmp-avx2-movbe.S b/sysdeps/x86_64/multiarch/memcmp-avx2-movbe.S index abcc61c..16f4630 100644 --- a/sysdeps/x86_64/multiarch/memcmp-avx2-movbe.S +++ b/sysdeps/x86_64/multiarch/memcmp-avx2-movbe.S @@ -62,9 +62,68 @@ ENTRY (MEMCMP) # endif cmpq $VEC_SIZE, %rdx jb L(less_vec) + + /* From VEC to 2 * VEC. No branch when size == VEC_SIZE. */ + vmovdqu (%rsi), %ymm2 + VPCMPEQ (%rdi), %ymm2, %ymm2 + vpmovmskb %ymm2, %eax + subl $VEC_MASK, %eax + jnz L(first_vec) + cmpq $(VEC_SIZE * 2), %rdx - ja L(more_2x_vec) + jbe L(last_vec) + + VPCMPEQ %ymm0, %ymm0, %ymm0 + /* More than 2 * VEC. */ + cmpq $(VEC_SIZE * 8), %rdx + ja L(more_8x_vec) + cmpq $(VEC_SIZE * 4), %rdx + jb L(last_4x_vec) + + /* From 4 * VEC to 8 * VEC, inclusively. */ + vmovdqu (%rsi), %ymm1 + VPCMPEQ (%rdi), %ymm1, %ymm1 + + vmovdqu VEC_SIZE(%rsi), %ymm2 + VPCMPEQ VEC_SIZE(%rdi), %ymm2, %ymm2 + + vmovdqu (VEC_SIZE * 2)(%rsi), %ymm3 + VPCMPEQ (VEC_SIZE * 2)(%rdi), %ymm3, %ymm3 + + vmovdqu (VEC_SIZE * 3)(%rsi), %ymm4 + VPCMPEQ (VEC_SIZE * 3)(%rdi), %ymm4, %ymm4 + + vpand %ymm1, %ymm2, %ymm5 + vpand %ymm3, %ymm4, %ymm6 + vpand %ymm5, %ymm6, %ymm5 + + vptest %ymm0, %ymm5 + jnc L(4x_vec_end) + + leaq -(4 * VEC_SIZE)(%rdi, %rdx), %rdi + leaq -(4 * VEC_SIZE)(%rsi, %rdx), %rsi + vmovdqu (%rsi), %ymm1 + VPCMPEQ (%rdi), %ymm1, %ymm1 + + vmovdqu VEC_SIZE(%rsi), %ymm2 + VPCMPEQ VEC_SIZE(%rdi), %ymm2, %ymm2 + vpand %ymm2, %ymm1, %ymm5 + + vmovdqu (VEC_SIZE * 2)(%rsi), %ymm3 + VPCMPEQ (VEC_SIZE * 2)(%rdi), %ymm3, %ymm3 + vpand %ymm3, %ymm5, %ymm5 + vmovdqu (VEC_SIZE * 3)(%rsi), %ymm4 + VPCMPEQ (VEC_SIZE * 3)(%rdi), %ymm4, %ymm4 + vpand %ymm4, %ymm5, %ymm5 + + vptest %ymm0, %ymm5 + jnc L(4x_vec_end) + xorl %eax, %eax + VZEROUPPER + ret + + .p2align 4 L(last_2x_vec): /* From VEC to 2 * VEC. No branch when size == VEC_SIZE. */ vmovdqu (%rsi), %ymm2 @@ -219,58 +278,6 @@ L(between_16_31): ret .p2align 4 -L(more_2x_vec): - /* More than 2 * VEC. */ - cmpq $(VEC_SIZE * 8), %rdx - ja L(more_8x_vec) - cmpq $(VEC_SIZE * 4), %rdx - jb L(last_4x_vec) - - /* From 4 * VEC to 8 * VEC, inclusively. */ - vmovdqu (%rsi), %ymm1 - VPCMPEQ (%rdi), %ymm1, %ymm1 - - vmovdqu VEC_SIZE(%rsi), %ymm2 - VPCMPEQ VEC_SIZE(%rdi), %ymm2, %ymm2 - - vmovdqu (VEC_SIZE * 2)(%rsi), %ymm3 - VPCMPEQ (VEC_SIZE * 2)(%rdi), %ymm3, %ymm3 - - vmovdqu (VEC_SIZE * 3)(%rsi), %ymm4 - VPCMPEQ (VEC_SIZE * 3)(%rdi), %ymm4, %ymm4 - - vpand %ymm1, %ymm2, %ymm5 - vpand %ymm3, %ymm4, %ymm6 - vpand %ymm5, %ymm6, %ymm5 - - vpmovmskb %ymm5, %eax - subl $VEC_MASK, %eax - jnz L(4x_vec_end) - - leaq -(4 * VEC_SIZE)(%rdi, %rdx), %rdi - leaq -(4 * VEC_SIZE)(%rsi, %rdx), %rsi - vmovdqu (%rsi), %ymm1 - VPCMPEQ (%rdi), %ymm1, %ymm1 - - vmovdqu VEC_SIZE(%rsi), %ymm2 - VPCMPEQ VEC_SIZE(%rdi), %ymm2, %ymm2 - vpand %ymm2, %ymm1, %ymm5 - - vmovdqu (VEC_SIZE * 2)(%rsi), %ymm3 - VPCMPEQ (VEC_SIZE * 2)(%rdi), %ymm3, %ymm3 - vpand %ymm3, %ymm5, %ymm5 - - vmovdqu (VEC_SIZE * 3)(%rsi), %ymm4 - VPCMPEQ (VEC_SIZE * 3)(%rdi), %ymm4, %ymm4 - vpand %ymm4, %ymm5, %ymm5 - - vpmovmskb %ymm5, %eax - subl $VEC_MASK, %eax - jnz L(4x_vec_end) - VZEROUPPER - ret - - .p2align 4 L(more_8x_vec): /* More than 8 * VEC. Check the first VEC. */ vmovdqu (%rsi), %ymm2 @@ -309,9 +316,8 @@ L(loop_4x_vec): VPCMPEQ (VEC_SIZE * 3)(%rdi), %ymm4, %ymm4 vpand %ymm4, %ymm5, %ymm5 - vpmovmskb %ymm5, %eax - subl $VEC_MASK, %eax - jnz L(4x_vec_end) + vptest %ymm0, %ymm5 + jnc L(4x_vec_end) addq $(VEC_SIZE * 4), %rdi addq $(VEC_SIZE * 4), %rsi -- 2.9.4