From patchwork Wed May 16 08:34:18 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Simon Guo X-Patchwork-Id: 914265 Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Received: from lists.ozlabs.org (lists.ozlabs.org [203.11.71.2]) (using TLSv1.2 with cipher ADH-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 40m7C35rf0z9s19 for ; Wed, 16 May 2018 18:39:11 +1000 (AEST) Authentication-Results: ozlabs.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: ozlabs.org; dkim=fail reason="signature verification failed" (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.b="Bdcm0Uvw"; dkim-atps=neutral Received: from lists.ozlabs.org (lists.ozlabs.org [IPv6:2401:3900:2:1::3]) by lists.ozlabs.org (Postfix) with ESMTP id 40m7C348przF14f for ; Wed, 16 May 2018 18:39:11 +1000 (AEST) Authentication-Results: lists.ozlabs.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: lists.ozlabs.org; dkim=fail reason="signature verification failed" (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.b="Bdcm0Uvw"; dkim-atps=neutral X-Original-To: linuxppc-dev@lists.ozlabs.org Delivered-To: linuxppc-dev@lists.ozlabs.org Authentication-Results: lists.ozlabs.org; spf=pass (mailfrom) smtp.mailfrom=gmail.com (client-ip=2607:f8b0:400e:c01::241; helo=mail-pl0-x241.google.com; envelope-from=wei.guo.simon@gmail.com; receiver=) Authentication-Results: lists.ozlabs.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: lists.ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.b="Bdcm0Uvw"; dkim-atps=neutral Received: from mail-pl0-x241.google.com (mail-pl0-x241.google.com [IPv6:2607:f8b0:400e:c01::241]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by lists.ozlabs.org (Postfix) with ESMTPS id 40m75r3mzDzF1Q8 for ; Wed, 16 May 2018 18:34:40 +1000 (AEST) Received: by mail-pl0-x241.google.com with SMTP id i5-v6so1735575plt.2 for ; Wed, 16 May 2018 01:34:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=rJqmprCRo2eWkbvKr9PMpimRCI9hPyabLZZCsnrJzgE=; b=Bdcm0UvwmdM6bnhVcG74T4Z+tN9SeNVCQXM2nRVbgheqJ+tjbtAVGTBeBLcVkLeRJJ nEqZNS9lsnrZNpVrmH174vIdf0LmaGD9sOFn8Dv+dBkuEWkQQjDbZBElgpWQKysUVQXz cSVChkSoTUAL46N6kzPZZc4o4Z1hy721ByKG/WgsaarnKkDPxl5Y9ruSlVCl0OG8uLRC M6ynn1V34LGzweQjVAGwKZnz40jTOEke/5yRD46eZJIa8/MygYKMu+HeIDGUDVI4EgdM oEi/ofvdru82eKETrfGWuEaacx4gXP41aM+zPJ2pWQlfBgb2328/6t5mHD36RVaFaLFs u30w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=rJqmprCRo2eWkbvKr9PMpimRCI9hPyabLZZCsnrJzgE=; b=H3PjKJE3KKAufw5pqCF9kM7GRI9AtWN2tKTQUsPRWBUfAgMlcVb9GQnDCZFIXCADTP KY1/KR/aH4ihBoDEcK7nO4pMluvbVFsxQe+xggUax51QWFUQYsvHt+hYFJ0zwLVHtH3w YnteSIxkVBbmcs5dCPCfqiOM42tYJS/PK8V9A9/KoWGUo2dBsqP5v5oHZQonke1ZWUXh v/3JlbLJXlfGS/DKv7x/xNkTmfqLLUNxG4JpE7t6qtpbAZ64LRnArq3Tacr8/zD67etl ecSMv6BizBoVOrO0jhf+qhZU7McTbT8oPud9L9jEUWvxhrUqqboXduC1KLHPugsMKQRL c94Q== X-Gm-Message-State: ALKqPwfH5OBK31vllMYZ2ax6t7w5/E80CX3qE4DM656fgb66JcaPVMhE 2Mt8UjJrtJHYYLS7q/TyJx0GFQ== X-Google-Smtp-Source: AB8JxZqkoE6FwATTuCKBrxivcJ4L+OHFYMqBotPSS8fvKC8wrmCOhr7ZlU18Vv015a8MM6dM01b+TQ== X-Received: by 2002:a17:902:d210:: with SMTP id t16-v6mr2655487ply.364.1526459678193; Wed, 16 May 2018 01:34:38 -0700 (PDT) Received: from simonLocalRHEL7.cn.ibm.com ([112.73.0.88]) by smtp.gmail.com with ESMTPSA id s16-v6sm2898757pfm.114.2018.05.16.01.34.35 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 16 May 2018 01:34:37 -0700 (PDT) From: wei.guo.simon@gmail.com To: linuxppc-dev@lists.ozlabs.org Subject: [PATCH v4 1/4] powerpc/64: Align bytes before fall back to .Lshort in powerpc64 memcmp() Date: Wed, 16 May 2018 16:34:18 +0800 Message-Id: <1526459661-17323-2-git-send-email-wei.guo.simon@gmail.com> X-Mailer: git-send-email 1.8.3.1 In-Reply-To: <1526459661-17323-1-git-send-email-wei.guo.simon@gmail.com> References: <1526459661-17323-1-git-send-email-wei.guo.simon@gmail.com> MIME-Version: 1.0 X-BeenThere: linuxppc-dev@lists.ozlabs.org X-Mailman-Version: 2.1.26 Precedence: list List-Id: Linux on PowerPC Developers Mail List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: "Naveen N. Rao" , Simon Guo , Cyril Bur Errors-To: linuxppc-dev-bounces+patchwork-incoming=ozlabs.org@lists.ozlabs.org Sender: "Linuxppc-dev" From: Simon Guo Currently memcmp() 64bytes version in powerpc will fall back to .Lshort (compare per byte mode) if either src or dst address is not 8 bytes aligned. It can be opmitized in 2 situations: 1) if both addresses are with the same offset with 8 bytes boundary: memcmp() can compare the unaligned bytes within 8 bytes boundary firstly and then compare the rest 8-bytes-aligned content with .Llong mode. 2) If src/dst addrs are not with the same offset of 8 bytes boundary: memcmp() can align src addr with 8 bytes, increment dst addr accordingly, then load src with aligned mode and load dst with unaligned mode. This patch optmizes memcmp() behavior in the above 2 situations. Tested with both little/big endian. Performance result below is based on little endian. Following is the test result with src/dst having the same offset case: (a similar result was observed when src/dst having different offset): (1) 256 bytes Test with the existing tools/testing/selftests/powerpc/stringloops/memcmp: - without patch 29.773018302 seconds time elapsed ( +- 0.09% ) - with patch 16.485568173 seconds time elapsed ( +- 0.02% ) -> There is ~+80% percent improvement (2) 32 bytes To observe performance impact on < 32 bytes, modify tools/testing/selftests/powerpc/stringloops/memcmp.c with following: ------- #include #include "utils.h" -#define SIZE 256 +#define SIZE 32 #define ITERATIONS 10000 int test_memcmp(const void *s1, const void *s2, size_t n); -------- - Without patch 0.244746482 seconds time elapsed ( +- 0.36%) - with patch 0.215069477 seconds time elapsed ( +- 0.51%) -> There is ~+13% improvement (3) 0~8 bytes To observe <8 bytes performance impact, modify tools/testing/selftests/powerpc/stringloops/memcmp.c with following: ------- #include #include "utils.h" -#define SIZE 256 -#define ITERATIONS 10000 +#define SIZE 8 +#define ITERATIONS 1000000 int test_memcmp(const void *s1, const void *s2, size_t n); ------- - Without patch 1.845642503 seconds time elapsed ( +- 0.12% ) - With patch 1.849767135 seconds time elapsed ( +- 0.26% ) -> They are nearly the same. (-0.2%) Signed-off-by: Simon Guo --- arch/powerpc/lib/memcmp_64.S | 143 ++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 136 insertions(+), 7 deletions(-) diff --git a/arch/powerpc/lib/memcmp_64.S b/arch/powerpc/lib/memcmp_64.S index d75d18b..f20e883 100644 --- a/arch/powerpc/lib/memcmp_64.S +++ b/arch/powerpc/lib/memcmp_64.S @@ -24,28 +24,41 @@ #define rH r31 #ifdef __LITTLE_ENDIAN__ +#define LH lhbrx +#define LW lwbrx #define LD ldbrx #else +#define LH lhzx +#define LW lwzx #define LD ldx #endif +/* + * There are 2 categories for memcmp: + * 1) src/dst has the same offset to the 8 bytes boundary. The handlers + * are named like .Lsameoffset_xxxx + * 2) src/dst has different offset to the 8 bytes boundary. The handlers + * are named like .Ldiffoffset_xxxx + */ _GLOBAL(memcmp) cmpdi cr1,r5,0 - /* Use the short loop if both strings are not 8B aligned */ - or r6,r3,r4 + /* Use the short loop if the src/dst addresses are not + * with the same offset of 8 bytes align boundary. + */ + xor r6,r3,r4 andi. r6,r6,7 - /* Use the short loop if length is less than 32B */ - cmpdi cr6,r5,31 + /* Fall back to short loop if compare at aligned addrs + * with less than 8 bytes. + */ + cmpdi cr6,r5,7 beq cr1,.Lzero - bne .Lshort - bgt cr6,.Llong + bgt cr6,.Lno_short .Lshort: mtctr r5 - 1: lbz rA,0(r3) lbz rB,0(r4) subf. rC,rB,rA @@ -78,11 +91,90 @@ _GLOBAL(memcmp) li r3,0 blr +.Lno_short: + dcbt 0,r3 + dcbt 0,r4 + bne .Ldiffoffset_8bytes_make_align_start + + +.Lsameoffset_8bytes_make_align_start: + /* attempt to compare bytes not aligned with 8 bytes so that + * rest comparison can run based on 8 bytes alignment. + */ + andi. r6,r3,7 + + /* Try to compare the first double word which is not 8 bytes aligned: + * load the first double word at (src & ~7UL) and shift left appropriate + * bits before comparision. + */ + clrlwi r6,r3,29 + rlwinm r6,r6,3,0,28 + beq .Lsameoffset_8bytes_aligned + clrrdi r3,r3,3 + clrrdi r4,r4,3 + LD rA,0,r3 + LD rB,0,r4 + sld rA,rA,r6 + sld rB,rB,r6 + cmpld cr0,rA,rB + srwi r6,r6,3 + bne cr0,.LcmpAB_lightweight + subfic r6,r6,8 + subfc. r5,r6,r5 + addi r3,r3,8 + addi r4,r4,8 + beq .Lzero + +.Lsameoffset_8bytes_aligned: + /* now we are aligned with 8 bytes. + * Use .Llong loop if left cmp bytes are equal or greater than 32B. + */ + cmpdi cr6,r5,31 + bgt cr6,.Llong + +.Lcmp_lt32bytes: + /* compare 1 ~ 32 bytes, at least r3 addr is 8 bytes aligned now */ + cmpdi cr5,r5,7 + srdi r0,r5,3 + ble cr5,.Lcmp_rest_lt8bytes + + /* handle 8 ~ 31 bytes */ + clrldi r5,r5,61 + mtctr r0 +2: + LD rA,0,r3 + LD rB,0,r4 + cmpld cr0,rA,rB + addi r3,r3,8 + addi r4,r4,8 + bne cr0,.LcmpAB_lightweight + bdnz 2b + + cmpwi r5,0 + beq .Lzero + +.Lcmp_rest_lt8bytes: + /* Here we have only less than 8 bytes to compare with. at least s1 + * Address is aligned with 8 bytes. + * The next double words are load and shift right with appropriate + * bits. + */ + subfic r6,r5,8 + rlwinm r6,r6,3,0,28 + LD rA,0,r3 + LD rB,0,r4 + srd rA,rA,r6 + srd rB,rB,r6 + cmpld cr0,rA,rB + bne cr0,.LcmpAB_lightweight + b .Lzero + .Lnon_zero: mr r3,rC blr .Llong: + /* At least s1 addr is aligned with 8 bytes */ li off8,8 li off16,16 li off24,24 @@ -232,4 +324,41 @@ _GLOBAL(memcmp) ld r28,-32(r1) ld r27,-40(r1) blr + +.LcmpAB_lightweight: /* skip NV GPRS restore */ + li r3,1 + bgt cr0,8f + li r3,-1 +8: + blr + +.Ldiffoffset_8bytes_make_align_start: + /* now try to align s1 with 8 bytes */ + andi. r6,r3,0x7 + rlwinm r6,r6,3,0,28 + beq .Ldiffoffset_align_s1_8bytes + + clrrdi r3,r3,3 + LD rA,0,r3 + LD rB,0,r4 /* unaligned load */ + sld rA,rA,r6 + srd rA,rA,r6 + srd rB,rB,r6 + cmpld cr0,rA,rB + srwi r6,r6,3 + bne cr0,.LcmpAB_lightweight + + subfic r6,r6,8 + subfc. r5,r6,r5 + addi r3,r3,8 + add r4,r4,r6 + + beq .Lzero + +.Ldiffoffset_align_s1_8bytes: + /* now s1 is aligned with 8 bytes. */ + cmpdi cr5,r5,31 + ble cr5,.Lcmp_lt32bytes + b .Llong + EXPORT_SYMBOL(memcmp)