From patchwork Tue Jul 23 06:38:21 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Noah Goldstein X-Patchwork-Id: 1963544 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.a=rsa-sha256 header.s=20230601 header.b=CjQcoShZ; dkim-atps=neutral Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=sourceware.org (client-ip=2620:52:3:1:0:246e:9693:128c; helo=server2.sourceware.org; envelope-from=libc-alpha-bounces~incoming=patchwork.ozlabs.org@sourceware.org; receiver=patchwork.ozlabs.org) Received: from server2.sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (secp384r1) server-digest SHA384) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4WSnY94Vqlz1yZw for ; Tue, 23 Jul 2024 16:38:53 +1000 (AEST) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 92FA7386101E for ; Tue, 23 Jul 2024 06:38:51 +0000 (GMT) X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-qv1-xf35.google.com (mail-qv1-xf35.google.com [IPv6:2607:f8b0:4864:20::f35]) by sourceware.org (Postfix) with ESMTPS id 43669385F018 for ; Tue, 23 Jul 2024 06:38:32 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 43669385F018 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 43669385F018 Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=2607:f8b0:4864:20::f35 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1721716714; cv=none; b=FrVuDQGOu+/BpdjftW945cdJ57NsJ3Sd5ijQs8qDUwcfAlQUjwmB4MF0owjgAdjSZ+wwcNaeTehT24p0yxrEDuTsZk3f9Jf+GoftczswtI+EDAVMy0RtFxd6v10uHuTraXlEHKs42k0v2TEktg1pAdi6jp+oqVYxFykRqUz0inE= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1721716714; c=relaxed/simple; bh=IRR+mdhSuhChhhSYb582njpNqKY/Yba56GROy0NLfQM=; h=DKIM-Signature:From:To:Subject:Date:Message-Id:MIME-Version; b=rvr9ILTUl6j3GQzLERox7DaGYcdyUGOrxJgLBxN95hPvAM40qFNyUIGn7nq4ybkWEUVgC2YbrZgOx7ISvXBoe16Qu8rWOh5hFTzAcO6nh0MyKBbdOPWiAeTZXxfwlJBtTRAgW/gbYEHzKDyl8pL7HtL8JGxWLuM8tyxRiptZBKo= ARC-Authentication-Results: i=1; server2.sourceware.org Received: by mail-qv1-xf35.google.com with SMTP id 6a1803df08f44-6b7a36f26f3so36572586d6.1 for ; Mon, 22 Jul 2024 23:38:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1721716711; x=1722321511; darn=sourceware.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=74qCdgfSf1B7no3GznyLPS1jicp3gawGymeVLndVv+Y=; b=CjQcoShZ7QYvGb6hP6T5BHGh5xhM19TyLxWk44zNXpVnqGr2WwbQsuglkKOXe+6vse JuhlanAPT3VFBr50yqRfkXmkchE1y4KliJqmodN2qSl12ZD8vEoreqssDj6nL7vdCNQT lwuDAdLmprD2qotjliswLc/0kZoSMGu5dDucpXm3LawPBOcPNzP7f7bPa3Z/0QHYm77l YDlfKWtj+9Ackyswn2pZvjJLYChZwjXjjaYVaIKuvBjEjCprM/eKLVipkFHbHw31fTTX E21duMpnHIJYAaf6Z8C5ekXYFHWMnYVOYt+nTf/CBPhwtE6FODjSOMLWKN12DiYDp0ib iUYg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1721716711; x=1722321511; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=74qCdgfSf1B7no3GznyLPS1jicp3gawGymeVLndVv+Y=; b=asr+kLRvXDHei+5jMoAKet5FKjxQyO6vDk/trJJLINAmbZzhl69NZIh3QgVITp4C8y jVk3gRgGvCRV7lM5z9yzLWPwgXnKse40QDAErUYCHwC9fVn3lzVEuiT9FUb1ZqQFWuEv QiKjumAh8F3OHVweuu6F+ThIswD+ycZ8R7Y9j+mtEGaVuyLpjsr2HSDPQeuSo8ILOWmA 7quHs0W93LBD7JqXJUP0E1FOkIu0BYTwWMj4PNY+Tx8GewNQpTIBeaulq0HRjIAmZDGX M97MiCyWnlGOnSQI0wgn40Lxljztr+Sm90rqD9PzVPOGC5NsvqDpr1ES2wsfUWdjP8LK tXaw== X-Gm-Message-State: AOJu0Ywzw26EWNoHUL/cHn4gx4Ej8iK9lEpR6KdPwSjmHYsu1qevOYNq dW8icMWY7sySHyhiKmEyQLLRn0pNUYHEkX6D5WqJ0MR2305CeEzClhb4sSJlmks= X-Google-Smtp-Source: AGHT+IEbsKxVbFFFqvz/NJmtsn8clcNtIEZz+i6V2EsXLLdO0OPZcs/1e26LU+0Tnta0+Tw9RX119Q== X-Received: by 2002:a05:6214:5ecf:b0:6b5:2a29:cd08 with SMTP id 6a1803df08f44-6b9843b6a0cmr18182046d6.27.1721716710862; Mon, 22 Jul 2024 23:38:30 -0700 (PDT) Received: from noahgold-desk.sh.intel.com ([192.55.46.44]) by smtp.gmail.com with ESMTPSA id 6a1803df08f44-6b7b051c86csm41892586d6.116.2024.07.22.23.38.28 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 22 Jul 2024 23:38:30 -0700 (PDT) From: Noah Goldstein To: libc-alpha@sourceware.org Cc: goldstein.w.n@gmail.com, hjl.tools@gmail.com Subject: x86: Increase default `rep stosb` threshold for SKX [BZ #32009] Date: Tue, 23 Jul 2024 14:38:21 +0800 Message-Id: <20240723063821.3460385-1-goldstein.w.n@gmail.com> X-Mailer: git-send-email 2.34.1 MIME-Version: 1.0 X-Spam-Status: No, score=-12.2 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.30 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: libc-alpha-bounces~incoming=patchwork.ozlabs.org@sourceware.org Benchmarks indicate that `2048` (prior value) is far too low a threshold for using `rep stosb`. Rather something around `1048576` is preferable: https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing (See: SKX-1/SKX-1.1/SKX-2/SKX-2.2) The `1048576` theshold was tested on multiple SKX machines and suprisingly seemed to hold regardless of cache hierarchy. Also note that in highly parallel settings, a smaller value is preferable, but this difference does seem not so extreme to justify a worse threshold with low/moderate parallelism. Tested new threshold using qemu on all x86 systems mutually supported by GLIBC and qemu. --- sysdeps/x86/cpu-features.c | 11 +++++++++++ sysdeps/x86/dl-cacheinfo.h | 21 +++++++++++---------- sysdeps/x86/dl-diagnostics-cpu.c | 6 ++++-- sysdeps/x86/include/cpu-features.h | 2 ++ 4 files changed, 28 insertions(+), 12 deletions(-) diff --git a/sysdeps/x86/cpu-features.c b/sysdeps/x86/cpu-features.c index c096dd390a..c5429422c0 100644 --- a/sysdeps/x86/cpu-features.c +++ b/sysdeps/x86/cpu-features.c @@ -757,6 +757,7 @@ init_cpu_features (struct cpu_features *cpu_features) enum cpu_features_kind kind; cpu_features->cachesize_non_temporal_divisor = 4; + cpu_features->machine_rep_stosb_threshold = 0; #if !HAS_CPUID if (__get_cpuid_max (0, 0) == 0) { @@ -879,6 +880,16 @@ init_cpu_features (struct cpu_features *cpu_features) non-temporal on all Skylake servers. */ cpu_features->preferred[index_arch_Avoid_Non_Temporal_Memset] |= bit_arch_Avoid_Non_Temporal_Memset; + /* SKX prefers temporal stores for a while. Confusingly, tests + across multiple SKX systems with different cache sizes all + indicate the threshold for when `rep stosb` becomes preferable + is 1048576 as opposed to some function of the size of the + various components of the cache hierarchy. Worth noting, + although not taken into account here, is that `rep stosb` is + more preferable with higher degrees of parallelism, i.e if all + cores are simultaneously setting memory a lower threshold + would be preferable. */ + cpu_features->machine_rep_stosb_threshold = 1048576; case INTEL_BIGCORE_COMETLAKE: case INTEL_BIGCORE_SKYLAKE: case INTEL_BIGCORE_KABYLAKE: diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h index a1c03b8903..c646b57508 100644 --- a/sysdeps/x86/dl-cacheinfo.h +++ b/sysdeps/x86/dl-cacheinfo.h @@ -1002,9 +1002,6 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) if (cpu_features->basic.kind == arch_kind_amd) rep_movsb_threshold = non_temporal_threshold; - /* The default threshold to use Enhanced REP STOSB. */ - unsigned long int rep_stosb_threshold = 2048; - long int tunable_size; tunable_size = TUNABLE_GET (x86_data_cache_size, long int, NULL); @@ -1034,13 +1031,17 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) /* NB: The default value of the x86_rep_stosb_threshold tunable is the same as the default value of __x86_rep_stosb_threshold and the minimum value is fixed. */ - rep_stosb_threshold = TUNABLE_GET (x86_rep_stosb_threshold, - long int, NULL); - if (cpu_features->basic.kind == arch_kind_amd - && !TUNABLE_IS_INITIALIZED (x86_rep_stosb_threshold)) - /* For AMD Zen3+ architecture, the performance of the vectorized loop is - slightly better than ERMS. */ - rep_stosb_threshold = SIZE_MAX; + unsigned long int rep_stosb_threshold + = TUNABLE_GET (x86_rep_stosb_threshold, long int, NULL); + if (!TUNABLE_IS_INITIALIZED (x86_rep_stosb_threshold)) + { + if (cpu_features->machine_rep_stosb_threshold != 0) + rep_stosb_threshold = cpu_features->machine_rep_stosb_threshold; + /* For AMD Zen3+ architecture, the performance of the vectorized loop + is slightly better than ERMS. */ + else if (cpu_features->basic.kind == arch_kind_amd) + rep_stosb_threshold = SIZE_MAX; + } TUNABLE_SET_WITH_BOUNDS (x86_data_cache_size, data, 0, SIZE_MAX); TUNABLE_SET_WITH_BOUNDS (x86_shared_cache_size, shared, 0, SIZE_MAX); diff --git a/sysdeps/x86/dl-diagnostics-cpu.c b/sysdeps/x86/dl-diagnostics-cpu.c index 49eeb5f70a..b84903a294 100644 --- a/sysdeps/x86/dl-diagnostics-cpu.c +++ b/sysdeps/x86/dl-diagnostics-cpu.c @@ -128,9 +128,11 @@ _dl_diagnostics_cpu (void) cpu_features->level4_cache_size); print_cpu_features_value ("cachesize_non_temporal_divisor", cpu_features->cachesize_non_temporal_divisor); + print_cpu_features_value ("machine_rep_stosb_threshold", + cpu_features->machine_rep_stosb_threshold); _Static_assert ( - offsetof (struct cpu_features, cachesize_non_temporal_divisor) - + sizeof (cpu_features->cachesize_non_temporal_divisor) + offsetof (struct cpu_features, machine_rep_stosb_threshold) + + sizeof (cpu_features->machine_rep_stosb_threshold) == sizeof (*cpu_features), "last cpu_features field has been printed"); diff --git a/sysdeps/x86/include/cpu-features.h b/sysdeps/x86/include/cpu-features.h index aaae44f0e1..5f2b5b93ac 100644 --- a/sysdeps/x86/include/cpu-features.h +++ b/sysdeps/x86/include/cpu-features.h @@ -981,6 +981,8 @@ struct cpu_features /* When no user non_temporal_threshold is specified. We default to cachesize / cachesize_non_temporal_divisor. */ unsigned long int cachesize_non_temporal_divisor; + /* Default rep stosb threshold (if 0, use default in dl-machine.h). */ + unsigned long int machine_rep_stosb_threshold; }; /* Get a pointer to the CPU features structure. */