From patchwork Fri May 24 17:38:50 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Noah Goldstein <goldstein.w.n@gmail.com>
X-Patchwork-Id: 1939050
Return-Path: <libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@legolas.ozlabs.org
Authentication-Results: legolas.ozlabs.org;
	dkim=pass (2048-bit key;
 unprotected) header.d=gmail.com header.i=@gmail.com header.a=rsa-sha256
 header.s=20230601 header.b=Oif3NPB4;
	dkim-atps=neutral
Authentication-Results: legolas.ozlabs.org;
 spf=pass (sender SPF authorized) smtp.mailfrom=sourceware.org
 (client-ip=2620:52:3:1:0:246e:9693:128c; helo=server2.sourceware.org;
 envelope-from=libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org;
 receiver=patchwork.ozlabs.org)
Received: from server2.sourceware.org (server2.sourceware.org
 [IPv6:2620:52:3:1:0:246e:9693:128c])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature ECDSA (secp384r1) server-digest SHA384)
	(No client certificate requested)
	by legolas.ozlabs.org (Postfix) with ESMTPS id 4VmC2z65BLz20KL
	for <incoming@patchwork.ozlabs.org>; Sat, 25 May 2024 03:39:23 +1000 (AEST)
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id DDCF53858D29
	for <incoming@patchwork.ozlabs.org>; Fri, 24 May 2024 17:39:21 +0000 (GMT)
X-Original-To: libc-alpha@sourceware.org
Delivered-To: libc-alpha@sourceware.org
Received: from mail-oi1-x22b.google.com (mail-oi1-x22b.google.com
 [IPv6:2607:f8b0:4864:20::22b])
 by sourceware.org (Postfix) with ESMTPS id C04D63858D29
 for <libc-alpha@sourceware.org>; Fri, 24 May 2024 17:38:58 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org C04D63858D29
Authentication-Results: sourceware.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org C04D63858D29
Authentication-Results: server2.sourceware.org;
 arc=none smtp.remote-ip=2607:f8b0:4864:20::22b
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1716572340; cv=none;
 b=FmqLbimm4m8ahVl745fWsvNwcQo1vqVBXZD2ilxnbomkSWgljci72iKCORaJFgyCiL5sryePiD1gyafRFYtMZLV7WjxffYprwL+Gpz8H7XfIs4659Ezz6XvoniZVbn9R9Elj4sBup6ml92zp586np0fTrPsPmKZT1qfS7UC61qE=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
 t=1716572340; c=relaxed/simple;
 bh=rk1YT1lqriF4A6MSa6NMsS0Iy+ht68TFRXf8eHeUzWU=;
 h=DKIM-Signature:From:To:Subject:Date:Message-Id:MIME-Version;
 b=uQTreiCsqRGH5zAqviqjdj/JKQLvTeazHT4O7/5R5yscr1yg5wtveVFSsO9Nl674v+N8BVXIx1ZC9YH6Zb/lbZJd2Zw+Njs19JyZ8McQwu6EKJwWh34pLicSKcaOXhAyd/fXeCEKdLhIJDalH7b8dZxknkZ790X+1gP0BNvc/yw=
ARC-Authentication-Results: i=1; server2.sourceware.org
Received: by mail-oi1-x22b.google.com with SMTP id
 5614622812f47-3c9a604e367so4564990b6e.2
 for <libc-alpha@sourceware.org>; Fri, 24 May 2024 10:38:58 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=gmail.com; s=20230601; t=1716572338; x=1717177138; darn=sourceware.org;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:from:to:cc:subject:date
 :message-id:reply-to;
 bh=GAfgX6cxPFLQOnBpUQHCpRxB0PYHSD+jt0xstmYSVAg=;
 b=Oif3NPB4MXeCAfBvBWazTaqT3tRZ5L2eXcNn007JaWwAq2ME/F7lpJn6dCaTpi6Fna
 UAVxlmA1RRMW2PrzlLBYBg21H1Qj1zt7VzPNWju0244XiwNbIBo+BIDpiugPTUAT50nd
 DOPnK3NvoO8Sp29h+wg+el3IcLIOon8rgQSBv7rgWXofpwTT2juapZ/SMqU2OyLbUE/g
 J7J2Ieg2Mqb/+U3DCXrLXs7M7ItDmg4Vvf9Oh4TZo82rwEEV7y+KEw8Q7EdrUmGBr24J
 T7LGLgh2JB89HAsVBXpKyyeLtjm89lv+ybDiwSQ0csDaRn4c3Llt2YPAZOiH00v3JuuT
 XwqA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20230601; t=1716572338; x=1717177138;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
 :subject:date:message-id:reply-to;
 bh=GAfgX6cxPFLQOnBpUQHCpRxB0PYHSD+jt0xstmYSVAg=;
 b=WkqHc2nqBCKHFOd7IH01bbxy8L0ICfd4iRR4IiLKAtnaN0U+hlFK8fS6c19AxHGDPy
 QLma8gbwoktzZudG+TtHYdZ/sgBmlHMHFBbBt7JsmfHMbTU7q1QOARdYq85LcWtWeHcj
 Pza48mCLSGcp9Cnm8sG6G8jh5n/FWCUjlI07TV7gBE9SXuL8bFMkhBBWkXy6ep5Vtj7o
 SJI+KpUnqGH58k+jgSBGjswtpi7qHSOqpMuAMVoxlXDCRsjjgNm0xIQFKDALs0BGKC0c
 Yd3WeokEknQ7tM4QXeNexuc0um/Xc4RzmOGUIe26Pzjh/om/e0SERe/rImZKvfZ+IUH+
 2PQw==
X-Gm-Message-State: AOJu0YybH5thXRKBBausKZmKHlnpVdnU6lCI6A8ngVpwPM3NBVM0KMk+
 1qsCLUfdAOKLC0GqyLE4qYIKImjauhkxtcSs0QbUtE8CeVGT+GmcaoT0Dg==
X-Google-Smtp-Source: 
 AGHT+IH1jqoE4N9JpIYYfGyvRFLdDbzgXsg6zn2EzKAiKYLA7BzP8IiROGhaLo7H8JoOen/QgW6XJg==
X-Received: by 2002:a05:6808:30d:b0:3c5:e901:e3ec with SMTP id
 5614622812f47-3d1a5767652mr2955366b6e.16.1716572337571;
 Fri, 24 May 2024 10:38:57 -0700 (PDT)
Received: from noahgold-desk.lan ([192.55.54.54])
 by smtp.gmail.com with ESMTPSA id
 6a1803df08f44-6ac113318f0sm9107386d6.74.2024.05.24.10.38.56
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Fri, 24 May 2024 10:38:57 -0700 (PDT)
From: Noah Goldstein <goldstein.w.n@gmail.com>
To: libc-alpha@sourceware.org
Cc: goldstein.w.n@gmail.com,
	hjl.tools@gmail.com,
	carlos@systemhalted.org
Subject: [PATCH v2 1/2] x86: Improve large memset perf with non-temporal
 stores [RHEL-29312]
Date: Fri, 24 May 2024 12:38:50 -0500
Message-Id: <20240524173851.2483952-1-goldstein.w.n@gmail.com>
X-Mailer: git-send-email 2.34.1
In-Reply-To: <20240519004347.2759850-1-goldstein.w.n@gmail.com>
References: <20240519004347.2759850-1-goldstein.w.n@gmail.com>
MIME-Version: 1.0
X-Spam-Status: No, score=-12.2 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0,
 RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.30
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
Errors-To: libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org

Previously we use `rep stosb` for all medium/large memsets. This is
notably worse than non-temporal stores for large (above a
few MBs) memsets.
See:
https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing
For data using different stategies for large memset on ICX and SKX.

Using non-temporal stores can be up to 3x faster on ICX and 2x faster
on SKX. Historically, these numbers would not have been so good
because of the zero-over-zero writeback optimization that `rep stosb`
is able to do. But, the zero-over-zero writeback optimization has been
removed as a potential side-channel attack, so there is no longer any
good reason to only rely on `rep stosb` for large memsets. On the flip
size, non-temporal writes can avoid data in their RFO requests saving
memory bandwidth.

All of the other changes to the file are to re-organize the
code-blocks to maintain "good" alignment given the new code added in
the `L(stosb_local)` case.

The results from running the GLIBC memset benchmarks on TGL-client for
N=20 runs:

Geometric Mean across the suite New / Old EXEX256: 0.979
Geometric Mean across the suite New / Old EXEX512: 0.979
Geometric Mean across the suite New / Old AVX2   : 0.986
Geometric Mean across the suite New / Old SSE2   : 0.979

Most of the cases are essentially unchanged, this is mostly to show
that adding the non-temporal case didn't add any regressions to the
other cases.

The results on the memset-large benchmark suite on TGL-client for N=20
runs:

Geometric Mean across the suite New / Old EXEX256: 0.926
Geometric Mean across the suite New / Old EXEX512: 0.925
Geometric Mean across the suite New / Old AVX2   : 0.928
Geometric Mean across the suite New / Old SSE2   : 0.924

So roughly a 7.5% speedup. This is lower than what we see on servers
(likely because clients typically have faster single-core bandwidth so
saving bandwidth on RFOs is less impactful), but still advantageous.

Full test-suite passes on x86_64 w/ and w/o multiarch.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
---
 .../multiarch/memset-vec-unaligned-erms.S     | 149 +++++++++++-------
 1 file changed, 91 insertions(+), 58 deletions(-)

diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
index 97839a2248..637caadb40 100644
--- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
+++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
@@ -21,10 +21,13 @@
    2. If size is less than VEC, use integer register stores.
    3. If size is from VEC_SIZE to 2 * VEC_SIZE, use 2 VEC stores.
    4. If size is from 2 * VEC_SIZE to 4 * VEC_SIZE, use 4 VEC stores.
-   5. On machines ERMS feature, if size is greater or equal than
-      __x86_rep_stosb_threshold then REP STOSB will be used.
-   6. If size is more to 4 * VEC_SIZE, align to 4 * VEC_SIZE with
-      4 VEC stores and store 4 * VEC at a time until done.  */
+   5. If size is more to 4 * VEC_SIZE, align to 1 * VEC_SIZE with
+      4 VEC stores and store 4 * VEC at a time until done.
+   6. On machines ERMS feature, if size is range
+	  [__x86_rep_stosb_threshold, __x86_shared_non_temporal_threshold)
+	  then REP STOSB will be used.
+   7. If size >= __x86_shared_non_temporal_threshold, use a
+	  non-temporal stores.  */
 
 #include <sysdep.h>
 
@@ -147,6 +150,41 @@ L(entry_from_wmemset):
 	VMOVU	%VMM(0), -VEC_SIZE(%rdi,%rdx)
 	VMOVU	%VMM(0), (%rdi)
 	VZEROUPPER_RETURN
+
+	/* If have AVX512 mask instructions put L(less_vec) close to
+	   entry as it doesn't take much space and is likely a hot target.  */
+#ifdef USE_LESS_VEC_MASK_STORE
+    /* Align to ensure the L(less_vec) logic all fits in 1x cache lines.  */
+	.p2align 6,, 47
+	.p2align 4
+L(less_vec):
+L(less_vec_from_wmemset):
+	/* Less than 1 VEC.  */
+# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
+#  error Unsupported VEC_SIZE!
+# endif
+	/* Clear high bits from edi. Only keeping bits relevant to page
+	   cross check. Note that we are using rax which is set in
+	   MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out.  */
+	andl	$(PAGE_SIZE - 1), %edi
+	/* Check if VEC_SIZE store cross page. Mask stores suffer
+	   serious performance degradation when it has to fault suppress.  */
+	cmpl	$(PAGE_SIZE - VEC_SIZE), %edi
+	/* This is generally considered a cold target.  */
+	ja	L(cross_page)
+# if VEC_SIZE > 32
+	movq	$-1, %rcx
+	bzhiq	%rdx, %rcx, %rcx
+	kmovq	%rcx, %k1
+# else
+	movl	$-1, %ecx
+	bzhil	%edx, %ecx, %ecx
+	kmovd	%ecx, %k1
+# endif
+	vmovdqu8 %VMM(0), (%rax){%k1}
+	VZEROUPPER_RETURN
+#endif
+
 #if defined USE_MULTIARCH && IS_IN (libc)
 END (MEMSET_SYMBOL (__memset, unaligned))
 
@@ -185,54 +223,6 @@ L(last_2x_vec):
 #endif
 	VZEROUPPER_RETURN
 
-	/* If have AVX512 mask instructions put L(less_vec) close to
-	   entry as it doesn't take much space and is likely a hot target.
-	 */
-#ifdef USE_LESS_VEC_MASK_STORE
-	.p2align 4,, 10
-L(less_vec):
-L(less_vec_from_wmemset):
-	/* Less than 1 VEC.  */
-# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64
-#  error Unsupported VEC_SIZE!
-# endif
-	/* Clear high bits from edi. Only keeping bits relevant to page
-	   cross check. Note that we are using rax which is set in
-	   MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out.  */
-	andl	$(PAGE_SIZE - 1), %edi
-	/* Check if VEC_SIZE store cross page. Mask stores suffer
-	   serious performance degradation when it has to fault suppress.
-	 */
-	cmpl	$(PAGE_SIZE - VEC_SIZE), %edi
-	/* This is generally considered a cold target.  */
-	ja	L(cross_page)
-# if VEC_SIZE > 32
-	movq	$-1, %rcx
-	bzhiq	%rdx, %rcx, %rcx
-	kmovq	%rcx, %k1
-# else
-	movl	$-1, %ecx
-	bzhil	%edx, %ecx, %ecx
-	kmovd	%ecx, %k1
-# endif
-	vmovdqu8 %VMM(0), (%rax){%k1}
-	VZEROUPPER_RETURN
-
-# if defined USE_MULTIARCH && IS_IN (libc)
-	/* Include L(stosb_local) here if including L(less_vec) between
-	   L(stosb_more_2x_vec) and ENTRY. This is to cache align the
-	   L(stosb_more_2x_vec) target.  */
-	.p2align 4,, 10
-L(stosb_local):
-	movzbl	%sil, %eax
-	mov	%RDX_LP, %RCX_LP
-	mov	%RDI_LP, %RDX_LP
-	rep	stosb
-	mov	%RDX_LP, %RAX_LP
-	VZEROUPPER_RETURN
-# endif
-#endif
-
 #if defined USE_MULTIARCH && IS_IN (libc)
 	.p2align 4
 L(stosb_more_2x_vec):
@@ -318,21 +308,33 @@ L(return_vzeroupper):
 	ret
 #endif
 
-	.p2align 4,, 10
-#ifndef USE_LESS_VEC_MASK_STORE
-# if defined USE_MULTIARCH && IS_IN (libc)
+#ifdef USE_WITH_AVX2
+	.p2align 4
+#else
+	.p2align 4,, 4
+#endif
+
+#if defined USE_MULTIARCH && IS_IN (libc)
 	/* If no USE_LESS_VEC_MASK put L(stosb_local) here. Will be in
 	   range for 2-byte jump encoding.  */
 L(stosb_local):
+	cmp	__x86_shared_non_temporal_threshold(%rip), %RDX_LP
+	jae	L(nt_memset)
 	movzbl	%sil, %eax
 	mov	%RDX_LP, %RCX_LP
 	mov	%RDI_LP, %RDX_LP
 	rep	stosb
+# if (defined USE_WITH_SSE2) || (defined USE_WITH_AVX512)
+	/* Use xchg to save 1-byte (this helps align targets below).  */
+	xchg	%RDX_LP, %RAX_LP
+# else
 	mov	%RDX_LP, %RAX_LP
-	VZEROUPPER_RETURN
 # endif
+	VZEROUPPER_RETURN
+#endif
+#ifndef USE_LESS_VEC_MASK_STORE
 	/* Define L(less_vec) only if not otherwise defined.  */
-	.p2align 4
+	.p2align 4,, 12
 L(less_vec):
 	/* Broadcast esi to partial register (i.e VEC_SIZE == 32 broadcast to
 	   xmm). This is only does anything for AVX2.  */
@@ -423,4 +425,35 @@ L(between_2_3):
 	movb	%SET_REG8, -1(%LESS_VEC_REG, %rdx)
 #endif
 	ret
-END (MEMSET_SYMBOL (__memset, unaligned_erms))
+
+#if defined USE_MULTIARCH && IS_IN (libc)
+# ifdef USE_WITH_AVX512
+	/* Force align so the loop doesn't cross a cache-line.  */
+	.p2align 4
+# endif
+	.p2align 4,, 7
+    /* Memset using non-temporal stores.  */
+L(nt_memset):
+	VMOVU	%VMM(0), (VEC_SIZE * 0)(%rdi)
+	leaq	(VEC_SIZE * -4)(%rdi, %rdx), %rdx
+    /* Align DST.  */
+	orq	$(VEC_SIZE * 1 - 1), %rdi
+	incq	%rdi
+	.p2align 4,, 7
+L(nt_loop):
+	VMOVNT	%VMM(0), (VEC_SIZE * 0)(%rdi)
+	VMOVNT	%VMM(0), (VEC_SIZE * 1)(%rdi)
+	VMOVNT	%VMM(0), (VEC_SIZE * 2)(%rdi)
+	VMOVNT	%VMM(0), (VEC_SIZE * 3)(%rdi)
+	subq	$(VEC_SIZE * -4), %rdi
+	cmpq	%rdx, %rdi
+	jb	L(nt_loop)
+	sfence
+	VMOVU	%VMM(0), (VEC_SIZE * 0)(%rdx)
+	VMOVU	%VMM(0), (VEC_SIZE * 1)(%rdx)
+	VMOVU	%VMM(0), (VEC_SIZE * 2)(%rdx)
+	VMOVU	%VMM(0), (VEC_SIZE * 3)(%rdx)
+	VZEROUPPER_RETURN
+#endif
+
+END(MEMSET_SYMBOL(__memset, unaligned_erms))

From patchwork Fri May 24 17:38:51 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Noah Goldstein <goldstein.w.n@gmail.com>
X-Patchwork-Id: 1939051
Return-Path: <libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@legolas.ozlabs.org
Authentication-Results: legolas.ozlabs.org;
	dkim=pass (2048-bit key;
 unprotected) header.d=gmail.com header.i=@gmail.com header.a=rsa-sha256
 header.s=20230601 header.b=G4dSughE;
	dkim-atps=neutral
Authentication-Results: legolas.ozlabs.org;
 spf=pass (sender SPF authorized) smtp.mailfrom=sourceware.org
 (client-ip=8.43.85.97; helo=server2.sourceware.org;
 envelope-from=libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org;
 receiver=patchwork.ozlabs.org)
Received: from server2.sourceware.org (server2.sourceware.org [8.43.85.97])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature ECDSA (secp384r1) server-digest SHA384)
	(No client certificate requested)
	by legolas.ozlabs.org (Postfix) with ESMTPS id 4VmC326T9Lz20KL
	for <incoming@patchwork.ozlabs.org>; Sat, 25 May 2024 03:39:26 +1000 (AEST)
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 274E2385EC2A
	for <incoming@patchwork.ozlabs.org>; Fri, 24 May 2024 17:39:25 +0000 (GMT)
X-Original-To: libc-alpha@sourceware.org
Delivered-To: libc-alpha@sourceware.org
Received: from mail-qk1-x732.google.com (mail-qk1-x732.google.com
 [IPv6:2607:f8b0:4864:20::732])
 by sourceware.org (Postfix) with ESMTPS id 315A03858CD9
 for <libc-alpha@sourceware.org>; Fri, 24 May 2024 17:39:01 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 315A03858CD9
Authentication-Results: sourceware.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 315A03858CD9
Authentication-Results: server2.sourceware.org;
 arc=none smtp.remote-ip=2607:f8b0:4864:20::732
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1716572344; cv=none;
 b=UyyLJhdQ+IJFnUsFhaO9sccUo3lnLpvCTMZZA+sfJI2y+BJ3UZUZsSTAFtzNeOCP6F/r9cZB773z8hMHVjnDHGYa1SJYShhqJiyC8iQbsuGGUbG7IA/ZF9I++FAIrF/lvtKGoWmyNU9xAoE+T2k0nzQZE5gmqz89kHNgIHSMG70=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
 t=1716572344; c=relaxed/simple;
 bh=uoXs94CUbY9QL7BuYUVDO21h4m5nVgXgqn9HWm1AD/I=;
 h=DKIM-Signature:From:To:Subject:Date:Message-Id:MIME-Version;
 b=Cx4ORZpxf+JnpFf/y+6WxGWDALOTJFIaoZp6rkylsEwa2OQ/yj7K/HTCx2wexGSxehbbrAPFkWbRFJiBNxxIQTB6Q9gnPt8uJhRUwwtznE0LpcwoE8XNg0D0z/SVjEuTOeBDar+trwr9A62UBcr6BqPi5xA3AESGYcboQFTj47M=
ARC-Authentication-Results: i=1; server2.sourceware.org
Received: by mail-qk1-x732.google.com with SMTP id
 af79cd13be357-794b10641b9so18064985a.3
 for <libc-alpha@sourceware.org>; Fri, 24 May 2024 10:39:01 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=gmail.com; s=20230601; t=1716572340; x=1717177140; darn=sourceware.org;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:from:to:cc:subject:date
 :message-id:reply-to;
 bh=dADOHjvTjirHs5GuST2FdqEiGFV8yBrBuoXaYkCwq8I=;
 b=G4dSughEblqhOqzxRViBs+fP7+5TIwTIWzpaekocClkdjBYR1oSqAJiHkrAfDsLvQd
 JgZZ3/+BR3t6O0PKLHQXyHOjE7a+nj4m2eSj4tHyino6OfZ2YYzglnprJX5uXOdfdMVa
 wHErlOMnDQFw//6/DioR3vukhqOyRZx/wbiIzsXnn5JJdgiqOMUZJtXJTJIWkLk+JJpA
 lFFrMcdvmx5wx+bDcQapwn6vXjNfYZbnn/79vuVCoyJhAuUYz8+KQl//RRwLS4weLHQB
 cheT1oMitWpwRFTeCXX199SHe0KoIp+//dxzs8GcHK0OWCnRkltnwxurbwLOFCIk1Spz
 LLJw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20230601; t=1716572340; x=1717177140;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
 :subject:date:message-id:reply-to;
 bh=dADOHjvTjirHs5GuST2FdqEiGFV8yBrBuoXaYkCwq8I=;
 b=FOcvF20lj1EMxks1cCvU0St2Kyee1KnOv0hecafxURKcIsAGAzT1Zf64oB9pxAuPD9
 AE32aGmr5vkQnOBh5Mwv0Yvao4U6bPVKMMq8wLNcPhVkzIy+YvxCj82G1F+HtyewIHdz
 xERN6PI64sGD20OWLIyrvBnxm6669lrQakbNHTxAlcJfvNDzE1bWTHfv3gdJ1l2Lc48P
 j5EfROk8x5NfXE20mrK1ubS0wiYftytmdvAFPqizBENUUwe4yuwGlTQ3AFnchAjNKIda
 NAZFNe+0IE6E740q75PFrNlnDzqMKzmrZW14mAEP2TXnLApvQJYpJ9nwMivuWcjpxlra
 Bo2Q==
X-Gm-Message-State: AOJu0YwDV2nwXKSkzZ91JKdP6AOcv4qOHqZfz8tlq5QPH2c4Z0qgLXX7
 AWLMppcxIsHUIJtHaZOxJpHlV/0FvSMK+I0MzGaYU1pL/iY4y7NdngWZww==
X-Google-Smtp-Source: 
 AGHT+IF1Vd3ZgdJDuiE6rZUD4lLkzDMhuO0ZMmAHv3oxWPDtrx+IZ4PRVuGmL5VK28PBfZuFVb+/4Q==
X-Received: by 2002:a05:6214:4888:b0:6a3:434c:bee9 with SMTP id
 6a1803df08f44-6abcd0d2bedmr39901686d6.27.1716572339877;
 Fri, 24 May 2024 10:38:59 -0700 (PDT)
Received: from noahgold-desk.lan ([192.55.54.54])
 by smtp.gmail.com with ESMTPSA id
 6a1803df08f44-6ac113318f0sm9107386d6.74.2024.05.24.10.38.58
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Fri, 24 May 2024 10:38:59 -0700 (PDT)
From: Noah Goldstein <goldstein.w.n@gmail.com>
To: libc-alpha@sourceware.org
Cc: goldstein.w.n@gmail.com,
	hjl.tools@gmail.com,
	carlos@systemhalted.org
Subject: [PATCH v2 2/2] x86: Add seperate non-temporal tunable for memset
Date: Fri, 24 May 2024 12:38:51 -0500
Message-Id: <20240524173851.2483952-2-goldstein.w.n@gmail.com>
X-Mailer: git-send-email 2.34.1
In-Reply-To: <20240524173851.2483952-1-goldstein.w.n@gmail.com>
References: <20240519004347.2759850-1-goldstein.w.n@gmail.com>
 <20240524173851.2483952-1-goldstein.w.n@gmail.com>
MIME-Version: 1.0
X-Spam-Status: No, score=-12.2 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0,
 RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.30
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
Errors-To: libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org

The tuning for non-temporal stores for memset vs memcpy is not always
the same. This includes both the exact value and whether non-temporal
stores are profitable at all for a given arch.

This patch add `x86_memset_non_temporal_threshold`. Currently we
disable non-temporal stores for non Intel vendors as the only
benchmarks showing its benefit have been on Intel hardware.
Reviewed-by: H.J. Lu <hjl.tools@gmail.com>
---
 manual/tunables.texi                             | 16 +++++++++++++++-
 sysdeps/x86/cacheinfo.h                          |  8 +++++++-
 sysdeps/x86/dl-cacheinfo.h                       | 16 ++++++++++++++++
 sysdeps/x86/dl-diagnostics-cpu.c                 |  2 ++
 sysdeps/x86/dl-tunables.list                     |  3 +++
 sysdeps/x86/include/cpu-features.h               |  4 +++-
 .../x86_64/multiarch/memset-vec-unaligned-erms.S |  6 +++---
 7 files changed, 49 insertions(+), 6 deletions(-)

diff --git a/manual/tunables.texi b/manual/tunables.texi
index baaf751721..8dd02d8149 100644
--- a/manual/tunables.texi
+++ b/manual/tunables.texi
@@ -52,6 +52,7 @@ glibc.elision.skip_lock_busy: 3 (min: 0, max: 2147483647)
 glibc.malloc.top_pad: 0x20000 (min: 0x0, max: 0xffffffffffffffff)
 glibc.cpu.x86_rep_stosb_threshold: 0x800 (min: 0x1, max: 0xffffffffffffffff)
 glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x4040, max: 0xfffffffffffffff)
+glibc.cpu.x86_memset_non_temporal_threshold: 0xc0000 (min: 0x4040, max: 0xfffffffffffffff)
 glibc.cpu.x86_shstk:
 glibc.pthread.stack_cache_size: 0x2800000 (min: 0x0, max: 0xffffffffffffffff)
 glibc.cpu.hwcap_mask: 0x6 (min: 0x0, max: 0xffffffffffffffff)
@@ -495,7 +496,8 @@ thread stack originally backup by Huge Pages to default pages.
 @cindex shared_cache_size tunables
 @cindex tunables, shared_cache_size
 @cindex non_temporal_threshold tunables
-@cindex tunables, non_temporal_threshold
+@cindex memset_non_temporal_threshold tunables
+@cindex tunables, non_temporal_threshold, memset_non_temporal_threshold
 
 @deftp {Tunable namespace} glibc.cpu
 Behavior of @theglibc{} can be tuned to assume specific hardware capabilities
@@ -574,6 +576,18 @@ like memmove and memcpy.
 This tunable is specific to i386 and x86-64.
 @end deftp
 
+@deftp Tunable glibc.cpu.x86_memset_non_temporal_threshold
+The @code{glibc.cpu.x86_memset_non_temporal_threshold} tunable allows
+the user to set threshold in bytes for non temporal store in
+memset. Non temporal stores give a hint to the hardware to move data
+directly to memory without displacing other data from the cache. This
+tunable is used by some platforms to determine when to use non
+temporal stores memset.
+
+This tunable is specific to i386 and x86-64.
+@end deftp
+
+
 @deftp Tunable glibc.cpu.x86_rep_movsb_threshold
 The @code{glibc.cpu.x86_rep_movsb_threshold} tunable allows the user to
 set threshold in bytes to start using "rep movsb".  The value must be
diff --git a/sysdeps/x86/cacheinfo.h b/sysdeps/x86/cacheinfo.h
index ab73556772..83491607c7 100644
--- a/sysdeps/x86/cacheinfo.h
+++ b/sysdeps/x86/cacheinfo.h
@@ -35,9 +35,12 @@ long int __x86_data_cache_size attribute_hidden = 32 * 1024;
 long int __x86_shared_cache_size_half attribute_hidden = 1024 * 1024 / 2;
 long int __x86_shared_cache_size attribute_hidden = 1024 * 1024;
 
-/* Threshold to use non temporal store.  */
+/* Threshold to use non temporal store in memmove.  */
 long int __x86_shared_non_temporal_threshold attribute_hidden;
 
+/* Threshold to use non temporal store in memset.  */
+long int __x86_memset_non_temporal_threshold attribute_hidden;
+
 /* Threshold to use Enhanced REP MOVSB.  */
 long int __x86_rep_movsb_threshold attribute_hidden = 2048;
 
@@ -77,6 +80,9 @@ init_cacheinfo (void)
   __x86_shared_non_temporal_threshold
     = cpu_features->non_temporal_threshold;
 
+  __x86_memset_non_temporal_threshold
+      = cpu_features->memset_non_temporal_threshold;
+
   __x86_rep_movsb_threshold = cpu_features->rep_movsb_threshold;
   __x86_rep_stosb_threshold = cpu_features->rep_stosb_threshold;
   __x86_rep_movsb_stop_threshold =  cpu_features->rep_movsb_stop_threshold;
diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
index 5a98f70364..d375a7cba6 100644
--- a/sysdeps/x86/dl-cacheinfo.h
+++ b/sysdeps/x86/dl-cacheinfo.h
@@ -986,6 +986,13 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
   if (CPU_FEATURE_USABLE_P (cpu_features, FSRM))
     rep_movsb_threshold = 2112;
 
+  /* Non-temporal stores in memset have only been tested on Intel hardware.
+     Until we benchmark data on other x86 processor, disable non-temporal
+     stores in memset. */
+  unsigned long int memset_non_temporal_threshold = SIZE_MAX;
+  if (cpu_features->basic.kind == arch_kind_intel)
+      memset_non_temporal_threshold = non_temporal_threshold;
+
    /* For AMD CPUs that support ERMS (Zen3+), REP MOVSB is in a lot of
       cases slower than the vectorized path (and for some alignments,
       it is really slow, check BZ #30994).  */
@@ -1012,6 +1019,11 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
       && tunable_size <= maximum_non_temporal_threshold)
     non_temporal_threshold = tunable_size;
 
+  tunable_size = TUNABLE_GET (x86_memset_non_temporal_threshold, long int, NULL);
+  if (tunable_size > minimum_non_temporal_threshold
+      && tunable_size <= maximum_non_temporal_threshold)
+    memset_non_temporal_threshold = tunable_size;
+
   tunable_size = TUNABLE_GET (x86_rep_movsb_threshold, long int, NULL);
   if (tunable_size > minimum_rep_movsb_threshold)
     rep_movsb_threshold = tunable_size;
@@ -1032,6 +1044,9 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
   TUNABLE_SET_WITH_BOUNDS (x86_non_temporal_threshold, non_temporal_threshold,
 			   minimum_non_temporal_threshold,
 			   maximum_non_temporal_threshold);
+  TUNABLE_SET_WITH_BOUNDS (
+      x86_memset_non_temporal_threshold, memset_non_temporal_threshold,
+      minimum_non_temporal_threshold, maximum_non_temporal_threshold);
   TUNABLE_SET_WITH_BOUNDS (x86_rep_movsb_threshold, rep_movsb_threshold,
 			   minimum_rep_movsb_threshold, SIZE_MAX);
   TUNABLE_SET_WITH_BOUNDS (x86_rep_stosb_threshold, rep_stosb_threshold, 1,
@@ -1045,6 +1060,7 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
   cpu_features->data_cache_size = data;
   cpu_features->shared_cache_size = shared;
   cpu_features->non_temporal_threshold = non_temporal_threshold;
+  cpu_features->memset_non_temporal_threshold = memset_non_temporal_threshold;
   cpu_features->rep_movsb_threshold = rep_movsb_threshold;
   cpu_features->rep_stosb_threshold = rep_stosb_threshold;
   cpu_features->rep_movsb_stop_threshold = rep_movsb_stop_threshold;
diff --git a/sysdeps/x86/dl-diagnostics-cpu.c b/sysdeps/x86/dl-diagnostics-cpu.c
index ceafde9481..49eeb5f70a 100644
--- a/sysdeps/x86/dl-diagnostics-cpu.c
+++ b/sysdeps/x86/dl-diagnostics-cpu.c
@@ -94,6 +94,8 @@ _dl_diagnostics_cpu (void)
                             cpu_features->shared_cache_size);
   print_cpu_features_value ("non_temporal_threshold",
                             cpu_features->non_temporal_threshold);
+  print_cpu_features_value ("memset_non_temporal_threshold",
+                            cpu_features->memset_non_temporal_threshold);
   print_cpu_features_value ("rep_movsb_threshold",
                             cpu_features->rep_movsb_threshold);
   print_cpu_features_value ("rep_movsb_stop_threshold",
diff --git a/sysdeps/x86/dl-tunables.list b/sysdeps/x86/dl-tunables.list
index 7d82da0dec..a0a1299592 100644
--- a/sysdeps/x86/dl-tunables.list
+++ b/sysdeps/x86/dl-tunables.list
@@ -30,6 +30,9 @@ glibc {
     x86_non_temporal_threshold {
       type: SIZE_T
     }
+    x86_memset_non_temporal_threshold {
+      type: SIZE_T
+    }
     x86_rep_movsb_threshold {
       type: SIZE_T
       # Since there is overhead to set up REP MOVSB operation, REP
diff --git a/sysdeps/x86/include/cpu-features.h b/sysdeps/x86/include/cpu-features.h
index cd7bd27cf3..aaae44f0e1 100644
--- a/sysdeps/x86/include/cpu-features.h
+++ b/sysdeps/x86/include/cpu-features.h
@@ -944,8 +944,10 @@ struct cpu_features
   /* Shared cache size for use in memory and string routines, typically
      L2 or L3 size.  */
   unsigned long int shared_cache_size;
-  /* Threshold to use non temporal store.  */
+  /* Threshold to use non temporal store in memmove.  */
   unsigned long int non_temporal_threshold;
+  /* Threshold to use non temporal store in memset.  */
+  unsigned long int memset_non_temporal_threshold;
   /* Threshold to use "rep movsb".  */
   unsigned long int rep_movsb_threshold;
   /* Threshold to stop using "rep movsb".  */
diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
index 637caadb40..88bf08e4f4 100644
--- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
+++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S
@@ -24,9 +24,9 @@
    5. If size is more to 4 * VEC_SIZE, align to 1 * VEC_SIZE with
       4 VEC stores and store 4 * VEC at a time until done.
    6. On machines ERMS feature, if size is range
-	  [__x86_rep_stosb_threshold, __x86_shared_non_temporal_threshold)
+	  [__x86_rep_stosb_threshold, __x86_memset_non_temporal_threshold)
 	  then REP STOSB will be used.
-   7. If size >= __x86_shared_non_temporal_threshold, use a
+   7. If size >= __x86_memset_non_temporal_threshold, use a
 	  non-temporal stores.  */
 
 #include <sysdep.h>
@@ -318,7 +318,7 @@ L(return_vzeroupper):
 	/* If no USE_LESS_VEC_MASK put L(stosb_local) here. Will be in
 	   range for 2-byte jump encoding.  */
 L(stosb_local):
-	cmp	__x86_shared_non_temporal_threshold(%rip), %RDX_LP
+	cmp	__x86_memset_non_temporal_threshold(%rip), %RDX_LP
 	jae	L(nt_memset)
 	movzbl	%sil, %eax
 	mov	%RDX_LP, %RCX_LP