From patchwork Fri Sep 27 22:50:10 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Noah Goldstein <goldstein.w.n@gmail.com>
X-Patchwork-Id: 1990448
Return-Path: <libc-alpha-bounces~incoming=patchwork.ozlabs.org@sourceware.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@legolas.ozlabs.org
Authentication-Results: legolas.ozlabs.org;
	dkim=pass (2048-bit key;
 unprotected) header.d=gmail.com header.i=@gmail.com header.a=rsa-sha256
 header.s=20230601 header.b=M3Op0DxX;
	dkim-atps=neutral
Authentication-Results: legolas.ozlabs.org;
 spf=pass (sender SPF authorized) smtp.mailfrom=sourceware.org
 (client-ip=8.43.85.97; helo=server2.sourceware.org;
 envelope-from=libc-alpha-bounces~incoming=patchwork.ozlabs.org@sourceware.org;
 receiver=patchwork.ozlabs.org)
Received: from server2.sourceware.org (server2.sourceware.org [8.43.85.97])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature ECDSA (secp384r1) server-digest SHA384)
	(No client certificate requested)
	by legolas.ozlabs.org (Postfix) with ESMTPS id 4XFm0C5ysNz1xt8
	for <incoming@patchwork.ozlabs.org>; Sat, 28 Sep 2024 08:50:51 +1000 (AEST)
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 25642385DDD4
	for <incoming@patchwork.ozlabs.org>; Fri, 27 Sep 2024 22:50:49 +0000 (GMT)
X-Original-To: libc-alpha@sourceware.org
Delivered-To: libc-alpha@sourceware.org
Received: from mail-qv1-xf33.google.com (mail-qv1-xf33.google.com
 [IPv6:2607:f8b0:4864:20::f33])
 by sourceware.org (Postfix) with ESMTPS id 81E0E3858D37
 for <libc-alpha@sourceware.org>; Fri, 27 Sep 2024 22:50:32 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 81E0E3858D37
Authentication-Results: sourceware.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 81E0E3858D37
Authentication-Results: server2.sourceware.org;
 arc=none smtp.remote-ip=2607:f8b0:4864:20::f33
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1727477434; cv=none;
 b=fxvPCLYc7isjRd6CjGoir+bqOaXe02KNJZJuQbrvUOvjG8Gojw4ltWdD9WYrCfaeFZNsJvDo1Qf8b6Qekjr3skem3B3pzmSLIubPs9Kc64fdTtoxeVVhMJZMftMXj2qbhNdpPlwXC3nEKq7NU4wwjkjthg8S8cUHAQh7asFJbFs=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
 t=1727477434; c=relaxed/simple;
 bh=R7lPWDXm6ZylA7HhuftWeQyJ8sfKGO1VSz41QFHc5vs=;
 h=DKIM-Signature:From:To:Subject:Date:Message-Id:MIME-Version;
 b=eIIrRJlC2ojjXY++Qi0t3tyzDIBqMRZHnN1XmeAT6qZhpruhAwOJFr3OgfhsvRUdgT218G9cthA3ZxR3mwe9Py73lMSD1U8fbafk1aQxQ5eaLGL+0lRz8AuXbpPVlTxOGxZSdpWI2ShDeCo8SPyzumu2C0VmWEYJh2FDMnCH5nY=
ARC-Authentication-Results: i=1; server2.sourceware.org
Received: by mail-qv1-xf33.google.com with SMTP id
 6a1803df08f44-6cb284c1a37so20850156d6.1
 for <libc-alpha@sourceware.org>; Fri, 27 Sep 2024 15:50:32 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=gmail.com; s=20230601; t=1727477431; x=1728082231; darn=sourceware.org;
 h=content-transfer-encoding:mime-version:message-id:date:subject:cc
 :to:from:from:to:cc:subject:date:message-id:reply-to;
 bh=UB6eOwP24RvlsHHl0/BOG7gEj2iVhruqN9nEDjZiy/Q=;
 b=M3Op0DxXI2wPpPH54GAbl0sQg6s12J7YTarURTODUX1aEZ9RhYs07Wfi0O/o3xav1h
 bp/8rHZMB6faRK2nR/9tlF7/FzSYJHaySi+Mh7vmXjWX5BwsJFRKo2taylg74wIDnTnO
 QFQNdkz4rCR7z5gcyuMh8gzvxqXHTkqe/WQFYer0NBSXmMCW4+tmaF+pYCG/LINCpbCO
 NveoMoSi9NFQirkPpkrmw7VoonHos4f/TrcDgMtKAfzg4LEikTBdsurEpXZ2u9iLmEew
 EbSzaiM87RQr++d1r6HLUwckPfJ9fQXf+VkDM5BKNK/QhpEpbFawvEvXt1lvjYkOBu+/
 ePpw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20230601; t=1727477431; x=1728082231;
 h=content-transfer-encoding:mime-version:message-id:date:subject:cc
 :to:from:x-gm-message-state:from:to:cc:subject:date:message-id
 :reply-to;
 bh=UB6eOwP24RvlsHHl0/BOG7gEj2iVhruqN9nEDjZiy/Q=;
 b=Y8JZy3mxJpRkSrcuMMHTgrUaMjxJFxIHxJGkIkuwy1LtXLnSBC9mtb2oG/nj46AVbC
 nFqf1V6ocKTP8+0JQ582jmgsPy3ypvtww/5cdWFcJ+o5axK6rest5Qp/s+u0QLWguINk
 xakjkygaLtKuRJaNx2V5QoZY2L3SX4rrCvxXwRkgHEswJAfgUDhjAl3NaJKE7GKk0Cvi
 7rFwtWHz96AheNiDRN1AjCFN5zd5NarG2BR9Tmaxr7w2QVQLSkyhzZGnhzsrApE3W03T
 Eeoam7vDVTlmRs9m8v8/HZAeULLEg9jWGVSrKl/ST94JQ+Hp445kO04nqb2Z6dFfn9Wh
 q9tg==
X-Gm-Message-State: AOJu0YySTr0iuwQOBmcHfED9gcVuKK2q5IHMmhHP4w0hPZ/HiwcROZq/
 Y4yAztu+uHde6/eCGnCCAWQMvW8UoRZkFdPwYdDWJbNXbJGusuMbuqxmmA==
X-Google-Smtp-Source: 
 AGHT+IF5O+BfXyWfH/HES0kQYlfifunDtGKtb60bh3Vs0vBpXvRdz1Ike1tDSxPxapSOj7COlEpkuQ==
X-Received: by 2002:a05:6214:2f06:b0:6c5:508d:7f81 with SMTP id
 6a1803df08f44-6cb3b5e9376mr100291436d6.26.1727477431218;
 Fri, 27 Sep 2024 15:50:31 -0700 (PDT)
Received: from noahgold-desk.lan ([192.55.54.45])
 by smtp.gmail.com with ESMTPSA id
 6a1803df08f44-6cb3b680045sm13317946d6.104.2024.09.27.15.50.29
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Fri, 27 Sep 2024 15:50:30 -0700 (PDT)
From: Noah Goldstein <goldstein.w.n@gmail.com>
To: libc-alpha@sourceware.org
Cc: goldstein.w.n@gmail.com,
	hjl.tools@gmail.com
Subject: [PATCH v1] x86/string: Fixup alignment of main loop in str{n}cmp-evex
 [BZ #32212]
Date: Fri, 27 Sep 2024 15:50:10 -0700
Message-Id: <20240927225010.2846563-1-goldstein.w.n@gmail.com>
X-Mailer: git-send-email 2.34.1
MIME-Version: 1.0
X-Spam-Status: No, score=-12.2 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0,
 SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.30
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
Errors-To: libc-alpha-bounces~incoming=patchwork.ozlabs.org@sourceware.org

The loop should be aligned to 32-bytes so that it can ideally run out
the DSB. This is particularly important on Skylake-Server where
deficiencies in it's DSB implementation make it prone to not being
able to run loops out of the DSB.

For example running strcmp-evex on 200Mb string:

32-byte aligned loop:
    - 43,399,578,766      idq.dsb_uops
not 32-byte aligned loop:
    - 6,060,139,704       idq.dsb_uops

This results in a 25% performance degradation for the non-aligned
version.

The fix is to just ensure the code layout is such that the loop is
aligned. (Which was previously the case but was accidentally dropped
in 84e7c46df).

Times are reported as ratio of Time_With_Patch /
Time_Without_Patch. Lower is better.

The values being reported is the geometric mean of the ratio across
all tests in bench-strcmp and bench-strncmp.

Note this patch is only attempting to improve the Skylake-Server
strcmp for long strings. The rest of the numbers are only to test for
regressions.

Tigerlake Results Strings <= 512:
    strcmp : 1.026
    strncmp: 0.949

Tigerlake Results Strings > 512:
    strcmp : 0.994
    strncmp: 0.998

Skylake-Server Results Strings <= 512:
    strcmp : 0.945
    strncmp: 0.943

Skylake-Server Results Strings > 512:
    strcmp : 0.778
    strncmp: 1.000

The 2.6% regression on TGL-strcmp is due to slowdowns caused by
changes in alignment of code handling small sizes (most on the
page-cross logic). These should be safe to ignore because 1) We
previously only 16-byte aligned the function so this behavior is not
new and was essentially up to chance before this patch and 2) this
type of alignment related regression on small sizes really only comes
up in tight micro-benchmark loops and is unlikely to have any affect
on realworld performance.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
---
 sysdeps/x86_64/multiarch/strcmp-evex.S | 26 +++++++++++++-------------
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/sysdeps/x86_64/multiarch/strcmp-evex.S b/sysdeps/x86_64/multiarch/strcmp-evex.S
index 06730ab2a1..cea034f394 100644
--- a/sysdeps/x86_64/multiarch/strcmp-evex.S
+++ b/sysdeps/x86_64/multiarch/strcmp-evex.S
@@ -209,7 +209,9 @@
    returned.  */
 
 	.section SECTION(.text), "ax", @progbits
-	.align	16
+	/* Align 64 bytes here. This is to get the L(loop) block ideally
+	   aligned for the DSB.  */
+	.align	64
 	.type	STRCMP, @function
 	.globl	STRCMP
 # ifdef USE_AS_STRCASECMP_L
@@ -509,9 +511,7 @@ L(ret4):
 	ret
 # endif
 
-	/* 32 byte align here ensures the main loop is ideally aligned
-	   for DSB.  */
-	.p2align 5
+	.p2align 4,, 4
 L(more_3x_vec):
 	/* Safe to compare 4x vectors.  */
 	VMOVU	(VEC_SIZE)(%rdi), %VMM(0)
@@ -1426,10 +1426,9 @@ L(less_32_till_page):
 L(ret_zero_page_cross_slow_case0):
 	xorl	%eax, %eax
 	ret
-# endif
-
-
+# else
 	.p2align 4,, 10
+# endif
 L(less_16_till_page):
 	cmpl	$((VEC_SIZE - 8) / SIZE_OF_CHAR), %eax
 	ja	L(less_8_till_page)
@@ -1482,8 +1481,12 @@ L(less_16_till_page):
 # endif
 	jmp	L(prepare_loop_aligned)
 
-
-
+# ifndef USE_AS_STRNCMP
+	/* Fits in aligning bytes.  */
+L(ret_zero_4_loop):
+	xorl	%eax, %eax
+	ret
+# endif
 
 	.p2align 4,, 10
 L(less_8_till_page):
@@ -1554,6 +1557,7 @@ L(ret_less_8_wcs):
 
 #  ifdef USE_AS_STRNCMP
 	.p2align 4,, 2
+L(ret_zero_4_loop):
 L(ret_zero_page_cross_slow_case1):
 	xorl	%eax, %eax
 	ret
@@ -1586,10 +1590,6 @@ L(less_4_loop):
 	subq	$-(CHAR_PER_VEC * 4), %rdx
 #  endif
 	jmp	L(prepare_loop_aligned)
-
-L(ret_zero_4_loop):
-	xorl	%eax, %eax
-	ret
 L(ret_less_4_loop):
 	xorl	%r8d, %eax
 	subl	%r8d, %eax