From patchwork Wed May 23 06:48:02 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Simon Guo <wei.guo.simon@gmail.com>
X-Patchwork-Id: 918802
Return-Path: 
 <linuxppc-dev-bounces+patchwork-incoming=ozlabs.org@lists.ozlabs.org>
X-Original-To: patchwork-incoming@ozlabs.org
Delivered-To: patchwork-incoming@ozlabs.org
Received: from lists.ozlabs.org (lists.ozlabs.org [203.11.71.2])
	(using TLSv1.2 with cipher ADH-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by ozlabs.org (Postfix) with ESMTPS id 40rPcc4Cnwz9s1d
	for <patchwork-incoming@ozlabs.org>;
	Wed, 23 May 2018 17:42:40 +1000 (AEST)
Authentication-Results: ozlabs.org;
	dmarc=fail (p=none dis=none) header.from=gmail.com
Authentication-Results: ozlabs.org;
	dkim=fail reason="signature verification failed" (2048-bit key;
	unprotected) header.d=gmail.com header.i=@gmail.com
	header.b="bX4C6CQO"; dkim-atps=neutral
Received: from lists.ozlabs.org (lists.ozlabs.org [IPv6:2401:3900:2:1::3])
	by lists.ozlabs.org (Postfix) with ESMTP id 40rPcc2dHkzDrq6
	for <patchwork-incoming@ozlabs.org>;
	Wed, 23 May 2018 17:42:40 +1000 (AEST)
Authentication-Results: lists.ozlabs.org;
	dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: lists.ozlabs.org;
	dkim=fail reason="signature verification failed" (2048-bit key;
	unprotected) header.d=gmail.com header.i=@gmail.com
	header.b="bX4C6CQO"; dkim-atps=neutral
X-Original-To: linuxppc-dev@lists.ozlabs.org
Delivered-To: linuxppc-dev@lists.ozlabs.org
Authentication-Results: lists.ozlabs.org;
	spf=pass (mailfrom) smtp.mailfrom=gmail.com
	(client-ip=2607:f8b0:400e:c05::244; helo=mail-pg0-x244.google.com;
	envelope-from=wei.guo.simon@gmail.com; receiver=<UNKNOWN>)
Authentication-Results: lists.ozlabs.org;
	dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: lists.ozlabs.org; dkim=pass (2048-bit key;
	unprotected) header.d=gmail.com header.i=@gmail.com
	header.b="bX4C6CQO"; dkim-atps=neutral
Received: from mail-pg0-x244.google.com (mail-pg0-x244.google.com
	[IPv6:2607:f8b0:400e:c05::244])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128
	bits)) (No client certificate requested)
	by lists.ozlabs.org (Postfix) with ESMTPS id 40rPPh48RXzF0S0
	for <linuxppc-dev@lists.ozlabs.org>;
	Wed, 23 May 2018 17:33:12 +1000 (AEST)
Received: by mail-pg0-x244.google.com with SMTP id w3-v6so9011160pgv.12
	for <linuxppc-dev@lists.ozlabs.org>;
	Wed, 23 May 2018 00:33:12 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025;
	h=from:to:cc:subject:date:message-id:in-reply-to:references;
	bh=mTm89+kn18KWEh5H1Fy1Nj1gpUI11m+z53aOikaQLvs=;
	b=bX4C6CQOY0sTx792u1ehmaYOtyBGW7DzXkZ5Ypm3FykiYN3p5tkjsvnPa8g7fDboMt
	eqB3+lVJ/f6ELO+T3lCCp6hxW1W6dYxJQd0wp3cyYG/8wYaLwo5WmOMmPJgbVlVUJs0C
	VLLiSbgZXcXW/GMhdgXaUazLN7YHUQCNgd65MD7nZa9c9GpPwNuZps/O779zibThtX6f
	nTjh0s1laMc//AkdWlpo+vItbiU0xALDTRj5apx+mXcEh628lm96OylAkGit87JSKKk7
	Ab4AuvX4A+vlv3chyQWW1xYJml9CVsNmwQmh4mSbUYWiEXQtvw/N8ULgQMb5B8z4/7B7
	lMPQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=1e100.net; s=20161025;
	h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to
	:references;
	bh=mTm89+kn18KWEh5H1Fy1Nj1gpUI11m+z53aOikaQLvs=;
	b=DJFCxMp7uYMsTwj2r+Wmi4te3odOefmyD6r4Qc6xnHjAXLeU3G5xoh+G1lK+wXADX3
	nbAq0zhX64x1Enzu6shvsIZty7ECZzxFdMWK5Br3tUjjh5ODh699qvb1SojZASJhshra
	DWxezNzlMpV8PRo3ii4i2v6Q1JYtR64w1M8Ic9Xf99C+cmcXJurmXrynDMhZHL49jSXW
	cyrdB5YI/ys/TyvdeFH1SZd4HHaS2VzX49AlXnAadqOOObLIc8W1Ol3/qezXmLhyYAJ9
	O4L70CCRMsciUXClerUcT1JwQaJRVlRiAuZIX1AWlSIdz3sqbdBYpQ7MAfoAoi77ROan
	n+xg==
X-Gm-Message-State: ALKqPweTA+EO+0ADyDYKM6zHpDus8kp8H/ld+G7hjv9J8RaIP2BP+8da
	3zchQAfLGt89uM2RigClT73cKw==
X-Google-Smtp-Source: 
 AB8JxZobDI0aCWoNsda7GZ2oD9t3UGd7wsKfj7NvTUY7990gst/YH8X3HDePW7p7OrMH+rw2CMvciw==
X-Received: by 2002:a62:8785:: with SMTP id
	i127-v6mr1763940pfe.201.1527060790594;
	Wed, 23 May 2018 00:33:10 -0700 (PDT)
Received: from simonLocalRHEL7.cn.ibm.com ([112.73.0.86])
	by smtp.gmail.com with ESMTPSA id
	a7-v6sm28650637pgc.68.2018.05.23.00.33.07
	(version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
	Wed, 23 May 2018 00:33:09 -0700 (PDT)
From: wei.guo.simon@gmail.com
To: linuxppc-dev@lists.ozlabs.org
Subject: [PATCH v5 3/4] powerpc/64: add 32 bytes prechecking before using VMX
	optimization on memcmp()
Date: Wed, 23 May 2018 14:48:02 +0800
Message-Id: <1527058083-6998-4-git-send-email-wei.guo.simon@gmail.com>
X-Mailer: git-send-email 1.8.3.1
In-Reply-To: <1527058083-6998-1-git-send-email-wei.guo.simon@gmail.com>
References: <1527058083-6998-1-git-send-email-wei.guo.simon@gmail.com>
X-BeenThere: linuxppc-dev@lists.ozlabs.org
X-Mailman-Version: 2.1.26
Precedence: list
List-Id: Linux on PowerPC Developers Mail List
	<linuxppc-dev.lists.ozlabs.org>
List-Unsubscribe: <https://lists.ozlabs.org/options/linuxppc-dev>,
	<mailto:linuxppc-dev-request@lists.ozlabs.org?subject=unsubscribe>
List-Archive: <http://lists.ozlabs.org/pipermail/linuxppc-dev/>
List-Post: <mailto:linuxppc-dev@lists.ozlabs.org>
List-Help: <mailto:linuxppc-dev-request@lists.ozlabs.org?subject=help>
List-Subscribe: <https://lists.ozlabs.org/listinfo/linuxppc-dev>,
	<mailto:linuxppc-dev-request@lists.ozlabs.org?subject=subscribe>
Cc: "Naveen N.  Rao" <naveen.n.rao@linux.vnet.ibm.com>,
	Simon Guo <wei.guo.simon@gmail.com>, Cyril Bur <cyrilbur@gmail.com>
Errors-To: linuxppc-dev-bounces+patchwork-incoming=ozlabs.org@lists.ozlabs.org
Sender: "Linuxppc-dev"
	<linuxppc-dev-bounces+patchwork-incoming=ozlabs.org@lists.ozlabs.org>

From: Simon Guo <wei.guo.simon@gmail.com>

This patch is based on the previous VMX patch on memcmp().

To optimize ppc64 memcmp() with VMX instruction, we need to think about
the VMX penalty brought with: If kernel uses VMX instruction, it needs
to save/restore current thread's VMX registers. There are 32 x 128 bits
VMX registers in PPC, which means 32 x 16 = 512 bytes for load and store.

The major concern regarding the memcmp() performance in kernel is KSM,
who will use memcmp() frequently to merge identical pages. So it will
make sense to take some measures/enhancement on KSM to see whether any
improvement can be done here.  Cyril Bur indicates that the memcmp() for
KSM has a higher possibility to fail (unmatch) early in previous bytes
in following mail.
	https://patchwork.ozlabs.org/patch/817322/#1773629
And I am taking a follow-up on this with this patch.

Per some testing, it shows KSM memcmp() will fail early at previous 32
bytes.  More specifically:
    - 76% cases will fail/unmatch before 16 bytes;
    - 83% cases will fail/unmatch before 32 bytes;
    - 84% cases will fail/unmatch before 64 bytes;
So 32 bytes looks a better choice than other bytes for pre-checking.

The early failure is also true for memcmp() for non-KSM case. With a
non-typical call load, it shows ~73% cases fail before first 32 bytes.

This patch adds a 32 bytes pre-checking firstly before jumping into VMX
operations, to avoid the unnecessary VMX penalty. It is not limited to
KSM case. And the testing shows ~20% improvement on memcmp() average
execution time with this patch.

And note the 32B pre-checking is only performed when the compare size
is long enough (>=4K currently) to allow VMX operation.

The detail data and analysis is at:
https://github.com/justdoitqd/publicFiles/blob/master/memcmp/README.md

Signed-off-by: Simon Guo <wei.guo.simon@gmail.com>
---
 arch/powerpc/lib/memcmp_64.S | 50 +++++++++++++++++++++++++++++++++++++-------
 1 file changed, 42 insertions(+), 8 deletions(-)
diff --git a/arch/powerpc/lib/memcmp_64.S b/arch/powerpc/lib/memcmp_64.S
index 6303bbf..ee45348 100644
--- a/arch/powerpc/lib/memcmp_64.S
+++ b/arch/powerpc/lib/memcmp_64.S
@@ -403,8 +403,27 @@ _GLOBAL(memcmp)
 #ifdef CONFIG_ALTIVEC
 .Lsameoffset_vmx_cmp:
 	/* Enter with src/dst addrs has the same offset with 8 bytes
-	 * align boundary
+	 * align boundary.
+	 *
+	 * There is an optimization based on following fact: memcmp()
+	 * prones to fail early at the first 32 bytes.
+	 * Before applying VMX instructions which will lead to 32x128bits
+	 * VMX regs load/restore penalty, we compare the first 32 bytes
+	 * so that we can catch the ~80% fail cases.
 	 */
+
+	li	r0,4
+	mtctr	r0
+.Lsameoffset_prechk_32B_loop:
+	LD	rA,0,r3
+	LD	rB,0,r4
+	cmpld	cr0,rA,rB
+	addi	r3,r3,8
+	addi	r4,r4,8
+	bne     cr0,.LcmpAB_lightweight
+	addi	r5,r5,-8
+	bdnz	.Lsameoffset_prechk_32B_loop
+
 	ENTER_VMX_OPS
 	beq     cr1,.Llong_novmx_cmp
 
@@ -481,13 +500,6 @@ _GLOBAL(memcmp)
 #endif
 
 .Ldiffoffset_8bytes_make_align_start:
-#ifdef CONFIG_ALTIVEC
-	/* only do vmx ops when the size exceeds 4K bytes */
-	cmpdi	cr5,r5,VMX_OPS_THRES
-	bge	cr5,.Ldiffoffset_vmx_cmp
-.Ldiffoffset_novmx_cmp:
-#endif
-
 	/* now try to align s1 with 8 bytes */
 	andi.   r6,r3,0x7
 	rlwinm  r6,r6,3,0,28
@@ -512,6 +524,13 @@ _GLOBAL(memcmp)
 
 .Ldiffoffset_align_s1_8bytes:
 	/* now s1 is aligned with 8 bytes. */
+#ifdef CONFIG_ALTIVEC
+	/* only do vmx ops when the size exceeds 4K bytes */
+	cmpdi	cr5,r5,VMX_OPS_THRES
+	bge	cr5,.Ldiffoffset_vmx_cmp
+.Ldiffoffset_novmx_cmp:
+#endif
+
 	cmpdi   cr5,r5,31
 	ble	cr5,.Lcmp_lt32bytes
 
@@ -523,6 +542,21 @@ _GLOBAL(memcmp)
 
 #ifdef CONFIG_ALTIVEC
 .Ldiffoffset_vmx_cmp:
+	/* perform a 32 bytes pre-checking before
+	 * enable VMX operations.
+	 */
+	li	r0,4
+	mtctr	r0
+.Ldiffoffset_prechk_32B_loop:
+	LD	rA,0,r3
+	LD	rB,0,r4
+	cmpld	cr0,rA,rB
+	addi	r3,r3,8
+	addi	r4,r4,8
+	bne     cr0,.LcmpAB_lightweight
+	addi	r5,r5,-8
+	bdnz	.Ldiffoffset_prechk_32B_loop
+
 	ENTER_VMX_OPS
 	beq     cr1,.Ldiffoffset_novmx_cmp