From patchwork Tue Jul 23 06:38:21 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Noah Goldstein <goldstein.w.n@gmail.com>
X-Patchwork-Id: 1963544
Return-Path: <libc-alpha-bounces~incoming=patchwork.ozlabs.org@sourceware.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@legolas.ozlabs.org
Authentication-Results: legolas.ozlabs.org;
	dkim=pass (2048-bit key;
 unprotected) header.d=gmail.com header.i=@gmail.com header.a=rsa-sha256
 header.s=20230601 header.b=CjQcoShZ;
	dkim-atps=neutral
Authentication-Results: legolas.ozlabs.org;
 spf=pass (sender SPF authorized) smtp.mailfrom=sourceware.org
 (client-ip=2620:52:3:1:0:246e:9693:128c; helo=server2.sourceware.org;
 envelope-from=libc-alpha-bounces~incoming=patchwork.ozlabs.org@sourceware.org;
 receiver=patchwork.ozlabs.org)
Received: from server2.sourceware.org (server2.sourceware.org
 [IPv6:2620:52:3:1:0:246e:9693:128c])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature ECDSA (secp384r1) server-digest SHA384)
	(No client certificate requested)
	by legolas.ozlabs.org (Postfix) with ESMTPS id 4WSnY94Vqlz1yZw
	for <incoming@patchwork.ozlabs.org>; Tue, 23 Jul 2024 16:38:53 +1000 (AEST)
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 92FA7386101E
	for <incoming@patchwork.ozlabs.org>; Tue, 23 Jul 2024 06:38:51 +0000 (GMT)
X-Original-To: libc-alpha@sourceware.org
Delivered-To: libc-alpha@sourceware.org
Received: from mail-qv1-xf35.google.com (mail-qv1-xf35.google.com
 [IPv6:2607:f8b0:4864:20::f35])
 by sourceware.org (Postfix) with ESMTPS id 43669385F018
 for <libc-alpha@sourceware.org>; Tue, 23 Jul 2024 06:38:32 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 43669385F018
Authentication-Results: sourceware.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 43669385F018
Authentication-Results: server2.sourceware.org;
 arc=none smtp.remote-ip=2607:f8b0:4864:20::f35
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1721716714; cv=none;
 b=FrVuDQGOu+/BpdjftW945cdJ57NsJ3Sd5ijQs8qDUwcfAlQUjwmB4MF0owjgAdjSZ+wwcNaeTehT24p0yxrEDuTsZk3f9Jf+GoftczswtI+EDAVMy0RtFxd6v10uHuTraXlEHKs42k0v2TEktg1pAdi6jp+oqVYxFykRqUz0inE=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
 t=1721716714; c=relaxed/simple;
 bh=IRR+mdhSuhChhhSYb582njpNqKY/Yba56GROy0NLfQM=;
 h=DKIM-Signature:From:To:Subject:Date:Message-Id:MIME-Version;
 b=rvr9ILTUl6j3GQzLERox7DaGYcdyUGOrxJgLBxN95hPvAM40qFNyUIGn7nq4ybkWEUVgC2YbrZgOx7ISvXBoe16Qu8rWOh5hFTzAcO6nh0MyKBbdOPWiAeTZXxfwlJBtTRAgW/gbYEHzKDyl8pL7HtL8JGxWLuM8tyxRiptZBKo=
ARC-Authentication-Results: i=1; server2.sourceware.org
Received: by mail-qv1-xf35.google.com with SMTP id
 6a1803df08f44-6b7a36f26f3so36572586d6.1
 for <libc-alpha@sourceware.org>; Mon, 22 Jul 2024 23:38:32 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=gmail.com; s=20230601; t=1721716711; x=1722321511; darn=sourceware.org;
 h=content-transfer-encoding:mime-version:message-id:date:subject:cc
 :to:from:from:to:cc:subject:date:message-id:reply-to;
 bh=74qCdgfSf1B7no3GznyLPS1jicp3gawGymeVLndVv+Y=;
 b=CjQcoShZ7QYvGb6hP6T5BHGh5xhM19TyLxWk44zNXpVnqGr2WwbQsuglkKOXe+6vse
 JuhlanAPT3VFBr50yqRfkXmkchE1y4KliJqmodN2qSl12ZD8vEoreqssDj6nL7vdCNQT
 lwuDAdLmprD2qotjliswLc/0kZoSMGu5dDucpXm3LawPBOcPNzP7f7bPa3Z/0QHYm77l
 YDlfKWtj+9Ackyswn2pZvjJLYChZwjXjjaYVaIKuvBjEjCprM/eKLVipkFHbHw31fTTX
 E21duMpnHIJYAaf6Z8C5ekXYFHWMnYVOYt+nTf/CBPhwtE6FODjSOMLWKN12DiYDp0ib
 iUYg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20230601; t=1721716711; x=1722321511;
 h=content-transfer-encoding:mime-version:message-id:date:subject:cc
 :to:from:x-gm-message-state:from:to:cc:subject:date:message-id
 :reply-to;
 bh=74qCdgfSf1B7no3GznyLPS1jicp3gawGymeVLndVv+Y=;
 b=asr+kLRvXDHei+5jMoAKet5FKjxQyO6vDk/trJJLINAmbZzhl69NZIh3QgVITp4C8y
 jVk3gRgGvCRV7lM5z9yzLWPwgXnKse40QDAErUYCHwC9fVn3lzVEuiT9FUb1ZqQFWuEv
 QiKjumAh8F3OHVweuu6F+ThIswD+ycZ8R7Y9j+mtEGaVuyLpjsr2HSDPQeuSo8ILOWmA
 7quHs0W93LBD7JqXJUP0E1FOkIu0BYTwWMj4PNY+Tx8GewNQpTIBeaulq0HRjIAmZDGX
 M97MiCyWnlGOnSQI0wgn40Lxljztr+Sm90rqD9PzVPOGC5NsvqDpr1ES2wsfUWdjP8LK
 tXaw==
X-Gm-Message-State: AOJu0Ywzw26EWNoHUL/cHn4gx4Ej8iK9lEpR6KdPwSjmHYsu1qevOYNq
 dW8icMWY7sySHyhiKmEyQLLRn0pNUYHEkX6D5WqJ0MR2305CeEzClhb4sSJlmks=
X-Google-Smtp-Source: 
 AGHT+IEbsKxVbFFFqvz/NJmtsn8clcNtIEZz+i6V2EsXLLdO0OPZcs/1e26LU+0Tnta0+Tw9RX119Q==
X-Received: by 2002:a05:6214:5ecf:b0:6b5:2a29:cd08 with SMTP id
 6a1803df08f44-6b9843b6a0cmr18182046d6.27.1721716710862;
 Mon, 22 Jul 2024 23:38:30 -0700 (PDT)
Received: from noahgold-desk.sh.intel.com ([192.55.46.44])
 by smtp.gmail.com with ESMTPSA id
 6a1803df08f44-6b7b051c86csm41892586d6.116.2024.07.22.23.38.28
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Mon, 22 Jul 2024 23:38:30 -0700 (PDT)
From: Noah Goldstein <goldstein.w.n@gmail.com>
To: libc-alpha@sourceware.org
Cc: goldstein.w.n@gmail.com,
	hjl.tools@gmail.com
Subject: x86: Increase default `rep stosb` threshold for SKX [BZ #32009]
Date: Tue, 23 Jul 2024 14:38:21 +0800
Message-Id: <20240723063821.3460385-1-goldstein.w.n@gmail.com>
X-Mailer: git-send-email 2.34.1
MIME-Version: 1.0
X-Spam-Status: No, score=-12.2 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0,
 RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.30
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
Errors-To: libc-alpha-bounces~incoming=patchwork.ozlabs.org@sourceware.org

Benchmarks indicate that `2048` (prior value) is far too low a
threshold for using `rep stosb`. Rather something around `1048576` is
preferable:

https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing

(See: SKX-1/SKX-1.1/SKX-2/SKX-2.2)

The `1048576` theshold was tested on multiple SKX machines and
suprisingly seemed to hold regardless of cache hierarchy.

Also note that in highly parallel settings, a smaller value is
preferable, but this difference does seem not so extreme to justify a
worse threshold with low/moderate parallelism.

Tested new threshold using qemu on all x86 systems mutually supported
by GLIBC and qemu.
---
 sysdeps/x86/cpu-features.c         | 11 +++++++++++
 sysdeps/x86/dl-cacheinfo.h         | 21 +++++++++++----------
 sysdeps/x86/dl-diagnostics-cpu.c   |  6 ++++--
 sysdeps/x86/include/cpu-features.h |  2 ++
 4 files changed, 28 insertions(+), 12 deletions(-)

diff --git a/sysdeps/x86/cpu-features.c b/sysdeps/x86/cpu-features.c
index c096dd390a..c5429422c0 100644
--- a/sysdeps/x86/cpu-features.c
+++ b/sysdeps/x86/cpu-features.c
@@ -757,6 +757,7 @@ init_cpu_features (struct cpu_features *cpu_features)
   enum cpu_features_kind kind;
 
   cpu_features->cachesize_non_temporal_divisor = 4;
+  cpu_features->machine_rep_stosb_threshold = 0;
 #if !HAS_CPUID
   if (__get_cpuid_max (0, 0) == 0)
     {
@@ -879,6 +880,16 @@ init_cpu_features (struct cpu_features *cpu_features)
 		     non-temporal on all Skylake servers. */
 	      cpu_features->preferred[index_arch_Avoid_Non_Temporal_Memset]
 		  |= bit_arch_Avoid_Non_Temporal_Memset;
+	      /* SKX prefers temporal stores for a while. Confusingly, tests
+	         across multiple SKX systems with different cache sizes all
+	         indicate the threshold for when `rep stosb` becomes preferable
+	         is 1048576 as opposed to some function of the size of the
+	         various components of the cache hierarchy. Worth noting,
+	         although not taken into account here, is that `rep stosb` is
+	         more preferable with higher degrees of parallelism, i.e if all
+	         cores are simultaneously setting memory a lower threshold
+	         would be preferable.  */
+	      cpu_features->machine_rep_stosb_threshold = 1048576;
 	    case INTEL_BIGCORE_COMETLAKE:
 	    case INTEL_BIGCORE_SKYLAKE:
 	    case INTEL_BIGCORE_KABYLAKE:
diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h
index a1c03b8903..c646b57508 100644
--- a/sysdeps/x86/dl-cacheinfo.h
+++ b/sysdeps/x86/dl-cacheinfo.h
@@ -1002,9 +1002,6 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
   if (cpu_features->basic.kind == arch_kind_amd)
     rep_movsb_threshold = non_temporal_threshold;
 
-  /* The default threshold to use Enhanced REP STOSB.  */
-  unsigned long int rep_stosb_threshold = 2048;
-
   long int tunable_size;
 
   tunable_size = TUNABLE_GET (x86_data_cache_size, long int, NULL);
@@ -1034,13 +1031,17 @@ dl_init_cacheinfo (struct cpu_features *cpu_features)
   /* NB: The default value of the x86_rep_stosb_threshold tunable is the
      same as the default value of __x86_rep_stosb_threshold and the
      minimum value is fixed.  */
-  rep_stosb_threshold = TUNABLE_GET (x86_rep_stosb_threshold,
-				     long int, NULL);
-  if (cpu_features->basic.kind == arch_kind_amd
-      && !TUNABLE_IS_INITIALIZED (x86_rep_stosb_threshold))
-    /* For AMD Zen3+ architecture, the performance of the vectorized loop is
-       slightly better than ERMS.  */
-    rep_stosb_threshold = SIZE_MAX;
+  unsigned long int rep_stosb_threshold
+      = TUNABLE_GET (x86_rep_stosb_threshold, long int, NULL);
+  if (!TUNABLE_IS_INITIALIZED (x86_rep_stosb_threshold))
+    {
+      if (cpu_features->machine_rep_stosb_threshold != 0)
+	rep_stosb_threshold = cpu_features->machine_rep_stosb_threshold;
+      /* For AMD Zen3+ architecture, the performance of the vectorized loop
+	 is slightly better than ERMS.  */
+      else if (cpu_features->basic.kind == arch_kind_amd)
+	rep_stosb_threshold = SIZE_MAX;
+    }
 
   TUNABLE_SET_WITH_BOUNDS (x86_data_cache_size, data, 0, SIZE_MAX);
   TUNABLE_SET_WITH_BOUNDS (x86_shared_cache_size, shared, 0, SIZE_MAX);
diff --git a/sysdeps/x86/dl-diagnostics-cpu.c b/sysdeps/x86/dl-diagnostics-cpu.c
index 49eeb5f70a..b84903a294 100644
--- a/sysdeps/x86/dl-diagnostics-cpu.c
+++ b/sysdeps/x86/dl-diagnostics-cpu.c
@@ -128,9 +128,11 @@ _dl_diagnostics_cpu (void)
                             cpu_features->level4_cache_size);
   print_cpu_features_value ("cachesize_non_temporal_divisor",
 			    cpu_features->cachesize_non_temporal_divisor);
+  print_cpu_features_value ("machine_rep_stosb_threshold",
+			    cpu_features->machine_rep_stosb_threshold);
   _Static_assert (
-      offsetof (struct cpu_features, cachesize_non_temporal_divisor)
-	      + sizeof (cpu_features->cachesize_non_temporal_divisor)
+      offsetof (struct cpu_features, machine_rep_stosb_threshold)
+	      + sizeof (cpu_features->machine_rep_stosb_threshold)
 	  == sizeof (*cpu_features),
       "last cpu_features field has been printed");
 
diff --git a/sysdeps/x86/include/cpu-features.h b/sysdeps/x86/include/cpu-features.h
index aaae44f0e1..5f2b5b93ac 100644
--- a/sysdeps/x86/include/cpu-features.h
+++ b/sysdeps/x86/include/cpu-features.h
@@ -981,6 +981,8 @@ struct cpu_features
   /* When no user non_temporal_threshold is specified. We default to
      cachesize / cachesize_non_temporal_divisor.  */
   unsigned long int cachesize_non_temporal_divisor;
+  /* Default rep stosb threshold (if 0, use default in dl-machine.h).  */
+  unsigned long int machine_rep_stosb_threshold;
 };
 
 /* Get a pointer to the CPU features structure.  */