From patchwork Fri Jun 23 21:00:52 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: "H.J. Lu" X-Patchwork-Id: 780277 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 3wvW864KdJz9s74 for ; Sat, 24 Jun 2017 07:01:10 +1000 (AEST) Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; secure) header.d=sourceware.org header.i=@sourceware.org header.b="aM3A98F2"; dkim-atps=neutral DomainKey-Signature: a=rsa-sha1; c=nofws; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc:content-type; q=dns; s=default; b=E2IA r/y1Dr5vu+gIc700AGMxViSpLwKs1JYTj1FMwFJlqdB7bw0rJJQ032NGXXuF4wm7 I6nphPOlXbQAhQBjIXoIakz1OFhSo7SIeBieRY300YXuozpb6Kz3j9LS8ukmIGd3 CiTbtocW0vcOG0R+JClIUk1H2HP/ggX2vADC6Vk= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=sourceware.org; h=list-id :list-unsubscribe:list-subscribe:list-archive:list-post :list-help:sender:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc:content-type; s=default; bh=vph7WkiZPf IjJTCOO4hhhQVbsdA=; b=aM3A98F2tSAgOrB54MxDQbC4e7G+hRcmBOEkJR5IRH sYaAcbNCUgqCDAUVc88DyYBBisFoym8QL/0jpZf//FA6pEtpkScrdjvHt38Fq0tu ocaeAOfGy3nwdgxBTt6FqA65OSoI0zI6OPRF5nlqHNufaB1t+g5jEHI0jhewgK/9 w= Received: (qmail 70791 invoked by alias); 23 Jun 2017 21:01:00 -0000 Mailing-List: contact libc-alpha-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: libc-alpha-owner@sourceware.org Delivered-To: mailing list libc-alpha@sourceware.org Received: (qmail 70752 invoked by uid 89); 23 Jun 2017 21:00:59 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-25.1 required=5.0 tests=AWL, BAYES_00, FREEMAIL_FROM, GIT_PATCH_0, GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3, RCVD_IN_DNSWL_NONE, SPF_PASS autolearn=ham version=3.3.2 spammy=19am, 19AM, jnc, jbe X-HELO: mail-ot0-f172.google.com X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=yPPKKuqYoZ3ZEgFQ3yguoOm3YJn4iw+T2VXNwdSQ0w4=; b=WfPAR8E+Wf/g0XbRJc8uyeVQ8dhKrfNTuWEgbbTYp1pcuJufLRiZCwdihghGwhUF4l RnlKgRAe/v350SyhZdkwYxvqoM/uO9zkktxUvElLIWJ/3oSy98ig/bqa0iF/umhLd4bN nWF14NRoLgyDygS3TWVvDu2aZEjkmaHxSaVWMOOlMgeCb4EgJyCR0tvwskr/jWAo2w08 ZPH5WHp0tZHt+YI0Mi9okZQtrcFmZwPPaaJGnwCMkeXGeqb9P+SiDRc9HEOrlhTulLc8 ouwBuFfASGH4CeqVDmwjMgn2+6sEvpOEJ0jRs9Jvx8x7fwjEoblWa3mv6Tb6pVfF5/bh Mjfw== X-Gm-Message-State: AKS2vOybMcLcAocIQdzjMtg22PTtNxSBKIVmQ9znjYZqFGTpi7itLptd HkGbdbkbGzdB7l0jy/LGsKHwiOZ4CflU X-Received: by 10.157.3.65 with SMTP id 59mr4946274otv.39.1498251653608; Fri, 23 Jun 2017 14:00:53 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: References: <20170601154519.GB14526@lucon.org> <20170615123415.GA22917@domone.kolej.mff.cuni.cz> From: "H.J. Lu" Date: Fri, 23 Jun 2017 14:00:52 -0700 Message-ID: Subject: Re: [PATCH] x86-64: Add memcmp/wmemcmp optimized with AVX2 To: Andrew Senkevich Cc: =?UTF-8?B?T25kxZllaiBCw61sa2E=?= , GNU C Library On Tue, Jun 20, 2017 at 11:16 AM, H.J. Lu wrote: > On Sat, Jun 17, 2017 at 3:44 AM, Andrew Senkevich > wrote: >> 2017-06-16 4:15 GMT+02:00 H.J. Lu : >>> On Thu, Jun 15, 2017 at 5:34 AM, Ondřej Bílka wrote: >>>> On Thu, Jun 01, 2017 at 08:45:19AM -0700, H.J. Lu wrote: >>>>> Optimize x86-64 memcmp/wmemcmp with AVX2. It uses vector compare as >>>>> much as possible. It is as fast as SSE4 memcmp for size <= 16 bytes >>>>> and up to 2X faster for size > 16 bytes on Haswell and Skylake. Select >>>>> AVX2 memcmp/wmemcmp on AVX2 machines where vzeroupper is preferred and >>>>> AVX unaligned load is fast. >>>>> >>>>> Key features: >>>>> >>>>> 1. Use overlapping compare to avoid branch. >>>>> 2. Use vector compare when size >= 4 bytes for memcmp or size >= 8 >>>>> bytes for wmemcmp. >>>>> 3. If size is 8 * VEC_SIZE or less, unroll the loop. >>>>> 4. Compare 4 * VEC_SIZE at a time with the aligned first memory area. >>>>> 5. Use 2 vector compares when size is 2 * VEC_SIZE or less. >>>>> 6. Use 4 vector compares when size is 4 * VEC_SIZE or less. >>>>> 7. Use 8 vector compares when size is 8 * VEC_SIZE or less. >>>>> >>>>> Any comments? >>>>> >>>> I have some comments, its similar to one of my previous patches >>>> >>>>> + cmpq $(VEC_SIZE * 2), %rdx >>>>> + ja L(more_2x_vec) >>>>> + >>>> This is unnecessary branch, its likely that there is difference in first >>>> 16 bytes regardless of size. Move test about sizes... >>>>> +L(last_2x_vec): >>>>> + /* From VEC to 2 * VEC. No branch when size == VEC_SIZE. */ >>>>> + vmovdqu (%rsi), %ymm2 >>>>> + VPCMPEQ (%rdi), %ymm2, %ymm2 >>>>> + vpmovmskb %ymm2, %eax >>>>> + subl $VEC_MASK, %eax >>>>> + jnz L(first_vec) >>>> here. >>>> >>> >>> If we do that, the size check will be redundant from >>> >>> /* Less than 4 * VEC. */ >>> cmpq $VEC_SIZE, %rdx >>> jbe L(last_vec) >>> cmpq $(VEC_SIZE * 2), %rdx >>> jbe L(last_2x_vec) >>> >>> L(last_4x_vec): >>> >>> Of cause, we can duplicate these blocks to avoid size. >>> >>>> >>>>> +L(first_vec): >>>>> + /* A byte or int32 is different within 16 or 32 bytes. */ >>>>> + bsfl %eax, %ecx >>>>> +# ifdef USE_AS_WMEMCMP >>>>> + xorl %eax, %eax >>>>> + movl (%rdi, %rcx), %edx >>>>> + cmpl (%rsi, %rcx), %edx >>>>> +L(wmemcmp_return): >>>>> + setl %al >>>>> + negl %eax >>>>> + orl $1, %eax >>>>> +# else >>>>> + movzbl (%rdi, %rcx), %eax >>>>> + movzbl (%rsi, %rcx), %edx >>>>> + sub %edx, %eax >>>>> +# endif >>>>> + VZEROUPPER >>>>> + ret >>>>> + >>>> >>>> Loading bytes depending on result of bsf is slow, alternative is to find >>>> that from vector tests. I could avoid it using tests like this but I >>>> didn't measure performance/test it yet. >>>> >>>> vmovdqu (%rdi), %ymm3 >>>> >>>> VPCMPGTQ %ymm2, %ymm3, %ymm4 >>>> VPCMPGTQ %ymm3, %ymm2, %ymm5 >>>> vpmovmskb %ymm4, %eax >>>> vpmovmskb %ymm5, %edx >>>> neg %eax >>>> neg %edx >>>> lzcnt %eax, %eax >>>> lzcnt %edx, %edx >>>> sub %edx, %eax >>>> ret >>> >>> Andrew, can you give it a try? >> >> Hi Ondrej, could you send patch with you proposal? >> I have tried with the following change and got many test-memcmp wrong results: > > We can't use VPCMPGT for memcmp since it performs signed > comparison, but memcmp requires unsigned comparison. > > H.J. Here is a patch. Any comments? From ebd5d69692463adc3733716fa1d608970d037bd6 Mon Sep 17 00:00:00 2001 From: "H.J. Lu" Date: Tue, 20 Jun 2017 04:05:28 -0700 Subject: [PATCH] x86-64: Optimize memcmp-avx2-movbe.S for short difference Check the first 32 bytes before checking size when size >= 32 bytes to avoid unnecessary branch if the difference is in the first 32 bytes. Replace vpmovmskb/subl/jnz with vptest/jnc. On Haswell, the new version is as fast as the previous one. On Skylake, the new version is a little bit faster. * sysdeps/x86_64/multiarch/memcmp-avx2-movbe.S (MEMCMP): Check the first 32 bytes before checking size when size >= 32 bytes. Replace vpmovmskb/subl/jnz with vptest/jnc. --- sysdeps/x86_64/multiarch/memcmp-avx2-movbe.S | 118 ++++++++++++++------------- 1 file changed, 62 insertions(+), 56 deletions(-) diff --git a/sysdeps/x86_64/multiarch/memcmp-avx2-movbe.S b/sysdeps/x86_64/multiarch/memcmp-avx2-movbe.S index abcc61c..16f4630 100644 --- a/sysdeps/x86_64/multiarch/memcmp-avx2-movbe.S +++ b/sysdeps/x86_64/multiarch/memcmp-avx2-movbe.S @@ -62,9 +62,68 @@ ENTRY (MEMCMP) # endif cmpq $VEC_SIZE, %rdx jb L(less_vec) + + /* From VEC to 2 * VEC. No branch when size == VEC_SIZE. */ + vmovdqu (%rsi), %ymm2 + VPCMPEQ (%rdi), %ymm2, %ymm2 + vpmovmskb %ymm2, %eax + subl $VEC_MASK, %eax + jnz L(first_vec) + cmpq $(VEC_SIZE * 2), %rdx - ja L(more_2x_vec) + jbe L(last_vec) + + VPCMPEQ %ymm0, %ymm0, %ymm0 + /* More than 2 * VEC. */ + cmpq $(VEC_SIZE * 8), %rdx + ja L(more_8x_vec) + cmpq $(VEC_SIZE * 4), %rdx + jb L(last_4x_vec) + + /* From 4 * VEC to 8 * VEC, inclusively. */ + vmovdqu (%rsi), %ymm1 + VPCMPEQ (%rdi), %ymm1, %ymm1 + + vmovdqu VEC_SIZE(%rsi), %ymm2 + VPCMPEQ VEC_SIZE(%rdi), %ymm2, %ymm2 + + vmovdqu (VEC_SIZE * 2)(%rsi), %ymm3 + VPCMPEQ (VEC_SIZE * 2)(%rdi), %ymm3, %ymm3 + + vmovdqu (VEC_SIZE * 3)(%rsi), %ymm4 + VPCMPEQ (VEC_SIZE * 3)(%rdi), %ymm4, %ymm4 + + vpand %ymm1, %ymm2, %ymm5 + vpand %ymm3, %ymm4, %ymm6 + vpand %ymm5, %ymm6, %ymm5 + + vptest %ymm0, %ymm5 + jnc L(4x_vec_end) + + leaq -(4 * VEC_SIZE)(%rdi, %rdx), %rdi + leaq -(4 * VEC_SIZE)(%rsi, %rdx), %rsi + vmovdqu (%rsi), %ymm1 + VPCMPEQ (%rdi), %ymm1, %ymm1 + + vmovdqu VEC_SIZE(%rsi), %ymm2 + VPCMPEQ VEC_SIZE(%rdi), %ymm2, %ymm2 + vpand %ymm2, %ymm1, %ymm5 + + vmovdqu (VEC_SIZE * 2)(%rsi), %ymm3 + VPCMPEQ (VEC_SIZE * 2)(%rdi), %ymm3, %ymm3 + vpand %ymm3, %ymm5, %ymm5 + vmovdqu (VEC_SIZE * 3)(%rsi), %ymm4 + VPCMPEQ (VEC_SIZE * 3)(%rdi), %ymm4, %ymm4 + vpand %ymm4, %ymm5, %ymm5 + + vptest %ymm0, %ymm5 + jnc L(4x_vec_end) + xorl %eax, %eax + VZEROUPPER + ret + + .p2align 4 L(last_2x_vec): /* From VEC to 2 * VEC. No branch when size == VEC_SIZE. */ vmovdqu (%rsi), %ymm2 @@ -219,58 +278,6 @@ L(between_16_31): ret .p2align 4 -L(more_2x_vec): - /* More than 2 * VEC. */ - cmpq $(VEC_SIZE * 8), %rdx - ja L(more_8x_vec) - cmpq $(VEC_SIZE * 4), %rdx - jb L(last_4x_vec) - - /* From 4 * VEC to 8 * VEC, inclusively. */ - vmovdqu (%rsi), %ymm1 - VPCMPEQ (%rdi), %ymm1, %ymm1 - - vmovdqu VEC_SIZE(%rsi), %ymm2 - VPCMPEQ VEC_SIZE(%rdi), %ymm2, %ymm2 - - vmovdqu (VEC_SIZE * 2)(%rsi), %ymm3 - VPCMPEQ (VEC_SIZE * 2)(%rdi), %ymm3, %ymm3 - - vmovdqu (VEC_SIZE * 3)(%rsi), %ymm4 - VPCMPEQ (VEC_SIZE * 3)(%rdi), %ymm4, %ymm4 - - vpand %ymm1, %ymm2, %ymm5 - vpand %ymm3, %ymm4, %ymm6 - vpand %ymm5, %ymm6, %ymm5 - - vpmovmskb %ymm5, %eax - subl $VEC_MASK, %eax - jnz L(4x_vec_end) - - leaq -(4 * VEC_SIZE)(%rdi, %rdx), %rdi - leaq -(4 * VEC_SIZE)(%rsi, %rdx), %rsi - vmovdqu (%rsi), %ymm1 - VPCMPEQ (%rdi), %ymm1, %ymm1 - - vmovdqu VEC_SIZE(%rsi), %ymm2 - VPCMPEQ VEC_SIZE(%rdi), %ymm2, %ymm2 - vpand %ymm2, %ymm1, %ymm5 - - vmovdqu (VEC_SIZE * 2)(%rsi), %ymm3 - VPCMPEQ (VEC_SIZE * 2)(%rdi), %ymm3, %ymm3 - vpand %ymm3, %ymm5, %ymm5 - - vmovdqu (VEC_SIZE * 3)(%rsi), %ymm4 - VPCMPEQ (VEC_SIZE * 3)(%rdi), %ymm4, %ymm4 - vpand %ymm4, %ymm5, %ymm5 - - vpmovmskb %ymm5, %eax - subl $VEC_MASK, %eax - jnz L(4x_vec_end) - VZEROUPPER - ret - - .p2align 4 L(more_8x_vec): /* More than 8 * VEC. Check the first VEC. */ vmovdqu (%rsi), %ymm2 @@ -309,9 +316,8 @@ L(loop_4x_vec): VPCMPEQ (VEC_SIZE * 3)(%rdi), %ymm4, %ymm4 vpand %ymm4, %ymm5, %ymm5 - vpmovmskb %ymm5, %eax - subl $VEC_MASK, %eax - jnz L(4x_vec_end) + vptest %ymm0, %ymm5 + jnc L(4x_vec_end) addq $(VEC_SIZE * 4), %rdi addq $(VEC_SIZE * 4), %rsi -- 2.9.4