From patchwork Tue May 9 03:13:13 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Noah Goldstein X-Patchwork-Id: 1778735 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=sourceware.org (client-ip=2620:52:3:1:0:246e:9693:128c; helo=sourceware.org; envelope-from=libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org; receiver=) Authentication-Results: legolas.ozlabs.org; dkim=pass (1024-bit key; secure) header.d=sourceware.org header.i=@sourceware.org header.a=rsa-sha256 header.s=default header.b=FQ5DaCVL; dkim-atps=neutral Received: from sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-384) server-digest SHA384) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4QFjts4tKYz214M for ; Tue, 9 May 2023 13:14:29 +1000 (AEST) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 8B6A43858C66 for ; Tue, 9 May 2023 03:14:27 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 8B6A43858C66 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1683602067; bh=w0PaJCWIPxoUzXJnWJ8jHbg5flCW4vWEZ+6wblrf5Tc=; h=To:Cc:Subject:Date:In-Reply-To:References:List-Id: List-Unsubscribe:List-Archive:List-Post:List-Help:List-Subscribe: From:Reply-To:From; b=FQ5DaCVLh1hC7nDkmFMrLUj3/li1VcAXlnTspheR9MzgcNArsPTAdzfoZEhDCTaGk u44m19EW6wPTmLKNk5m9mZAVE1D0eAFhnFwrZCBWss7YWY6h9hpYhjIWEgNZCzjc+Z UfUn78m7g8ojLixB4WEtuYw66hJ4XhR/B5he/0ls= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-pl1-x62f.google.com (mail-pl1-x62f.google.com [IPv6:2607:f8b0:4864:20::62f]) by sourceware.org (Postfix) with ESMTPS id D89393858288 for ; Tue, 9 May 2023 03:13:29 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org D89393858288 Received: by mail-pl1-x62f.google.com with SMTP id d9443c01a7336-1aae46e62e9so38153015ad.2 for ; Mon, 08 May 2023 20:13:29 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1683602008; x=1686194008; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=w0PaJCWIPxoUzXJnWJ8jHbg5flCW4vWEZ+6wblrf5Tc=; b=JkhE884zGqWR0oejG8U09wSM1fUljaJ4+A/+zC/hvX/Bjw3E7aSwMVn/F4SA8fFu64 4OQMRLS/AF9RAGvuxiZeYa2ie+rkkH+3xkER+q56nzJyxgE91NnqdU3c0UT8gyKKW5Kv Aw0Q6U6dqCF250FJPs+SkwVubS6sbJ6weyvplD3gOKC3CP0y91NarUoqQYfTV9qWvBzo 2cLcZJXtkAR59faRNHKgOiujl6Q+UUvAgineOAH4ROrmm/UjYbqcgny/KaDuZoGZw5xU qsfPWoHTrZZhmWLP02OaQp6Jsr9r7YhV5a+fUaKyoJ0CzxNweLXLN6FUNFllY5jsO0g3 tS9w== X-Gm-Message-State: AC+VfDybkihVkgt5RP+gOGTPlAmldhmDlCrGy3ICPXqXBVS/VaniBhI9 8Tq6KTf++SB/Al4iulMoxiyK+YbrXbg= X-Google-Smtp-Source: ACHHUZ7xbhq6kvR/cWcsVic5Tcc/VJxql8kD+Q5XNofcnrivPoGW5/CMJwARZR4WgACcR6AW8LJGkA== X-Received: by 2002:a17:902:d483:b0:1ac:896f:f655 with SMTP id c3-20020a170902d48300b001ac896ff655mr4096053plg.50.1683602008575; Mon, 08 May 2023 20:13:28 -0700 (PDT) Received: from noahgold-desk.. ([192.55.60.44]) by smtp.gmail.com with ESMTPSA id r15-20020a170903020f00b001aae625e422sm252962plh.37.2023.05.08.20.13.26 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 08 May 2023 20:13:27 -0700 (PDT) To: libc-alpha@sourceware.org Cc: goldstein.w.n@gmail.com, hjl.tools@gmail.com, carlos@systemhalted.org Subject: [PATCH v5 3/3] x86: Make the divisor in setting `non_temporal_threshold` cpu specific Date: Mon, 8 May 2023 22:13:13 -0500 Message-Id: <20230509031313.3497001-3-goldstein.w.n@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20230509031313.3497001-1-goldstein.w.n@gmail.com> References: <20230424050329.1501348-1-goldstein.w.n@gmail.com> <20230509031313.3497001-1-goldstein.w.n@gmail.com> MIME-Version: 1.0 X-Spam-Status: No, score=-12.1 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Noah Goldstein via Libc-alpha From: Noah Goldstein Reply-To: Noah Goldstein Errors-To: libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org Sender: "Libc-alpha" Different systems prefer a different divisors. From benchmarks[1] so far the following divisors have been found: ICX : 2 SKX : 2 BWD : 8 For Intel, we are generalizing that BWD and older prefers 8 as a divisor, and SKL and newer prefers 2. This number can be further tuned as benchmarks are run. [1]: https://github.com/goldsteinn/memcpy-nt-benchmarks --- sysdeps/x86/cpu-features.c | 11 ++++++---- sysdeps/x86/dl-cacheinfo.h | 32 ++++++++++++++++++------------ sysdeps/x86/include/cpu-features.h | 3 +++ 3 files changed, 29 insertions(+), 17 deletions(-) diff --git a/sysdeps/x86/cpu-features.c b/sysdeps/x86/cpu-features.c index bec70c3c49..3c1a77906a 100644 --- a/sysdeps/x86/cpu-features.c +++ b/sysdeps/x86/cpu-features.c @@ -637,6 +637,7 @@ init_cpu_features (struct cpu_features *cpu_features) unsigned int stepping = 0; enum cpu_features_kind kind; + cpu_features->cachesize_non_temporal_divisor = 4; #if !HAS_CPUID if (__get_cpuid_max (0, 0) == 0) { @@ -720,6 +721,8 @@ init_cpu_features (struct cpu_features *cpu_features) break; case INTEL_BIGCORE_NEHALEM: case INTEL_BIGCORE_WESTMERE: + /* Older CPUs prefer non-temporal stores at lower threshold. */ + cpu_features->cachesize_non_temporal_divisor = 8; /* Rep string instructions, unaligned load, unaligned copy, and pminub are fast on Intel Core i3, i5 and i7. */ cpu_features->preferred[index_arch_Fast_Rep_String] @@ -728,11 +731,12 @@ init_cpu_features (struct cpu_features *cpu_features) | bit_arch_Prefer_PMINUB_for_stringop); break; - /* Untuned Bigcore microarch. */ case INTEL_BIGCORE_SANDYBRIDGE: case INTEL_BIGCORE_IVYBRIDGE: case INTEL_BIGCORE_HASWELL: case INTEL_BIGCORE_BROADWELL: + cpu_features->cachesize_non_temporal_divisor = 8; + break; case INTEL_BIGCORE_SKYLAKE: case INTEL_BIGCORE_AMBERLAKE: case INTEL_BIGCORE_COFFEELAKE: @@ -753,11 +757,10 @@ init_cpu_features (struct cpu_features *cpu_features) case INTEL_BIGCORE_SAPPHIRERAPIDS: case INTEL_BIGCORE_EMERALDRAPIDS: case INTEL_BIGCORE_GRANITERAPIDS: - break; - - /* Untuned Mixed (bigcore + atom SOC). */ + /* Mixed (bigcore + atom SOC). */ case INTEL_MIXED_LAKEFIELD: case INTEL_MIXED_ALDERLAKE: + cpu_features->cachesize_non_temporal_divisor = 2; break; } diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h index c7e41029fa..6225c852f6 100644 --- a/sysdeps/x86/dl-cacheinfo.h +++ b/sysdeps/x86/dl-cacheinfo.h @@ -738,19 +738,25 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) cpu_features->level3_cache_linesize = level3_cache_linesize; cpu_features->level4_cache_size = level4_cache_size; - /* The default setting for the non_temporal threshold is 1/4 of size - of the chip's cache. For most Intel and AMD processors with an - initial release date between 2017 and 2023, a thread's typical - share of the cache is from 18-64MB. Using the 1/4 L3 is meant to - estimate the point where non-temporal stores begin outcompeting - REP MOVSB. As well the point where the fact that non-temporal - stores are forced back to main memory would already occurred to the - majority of the lines in the copy. Note, concerns about the - entire L3 cache being evicted by the copy are mostly alleviated - by the fact that modern HW detects streaming patterns and - provides proper LRU hints so that the maximum thrashing - capped at 1/associativity. */ - unsigned long int non_temporal_threshold = shared / 4; + unsigned long int cachesize_non_temporal_divisor + = cpu_features->cachesize_non_temporal_divisor; + if (cachesize_non_temporal_divisor <= 0) + cachesize_non_temporal_divisor = 4; + + /* The default setting for the non_temporal threshold is [1/2, 1/8] of size + of the chip's cache (depending on `cachesize_non_temporal_divisor` which + is microarch specific. The defeault is 1/4). For most Intel and AMD + processors with an initial release date between 2017 and 2023, a thread's + typical share of the cache is from 18-64MB. Using a reasonable size + fraction of L3 is meant to estimate the point where non-temporal stores + begin outcompeting REP MOVSB. As well the point where the fact that + non-temporal stores are forced back to main memory would already occurred + to the majority of the lines in the copy. Note, concerns about the entire + L3 cache being evicted by the copy are mostly alleviated by the fact that + modern HW detects streaming patterns and provides proper LRU hints so that + the maximum thrashing capped at 1/associativity. */ + unsigned long int non_temporal_threshold + = shared / cachesize_non_temporal_divisor; /* If no ERMS, we use the per-thread L3 chunking. Normal cacheable stores run a higher risk of actually thrashing the cache as they don't have a HW LRU hint. As well, there performance in highly parallel situations is diff --git a/sysdeps/x86/include/cpu-features.h b/sysdeps/x86/include/cpu-features.h index 40b8129d6a..f5b9dd54fe 100644 --- a/sysdeps/x86/include/cpu-features.h +++ b/sysdeps/x86/include/cpu-features.h @@ -915,6 +915,9 @@ struct cpu_features unsigned long int shared_cache_size; /* Threshold to use non temporal store. */ unsigned long int non_temporal_threshold; + /* When no user non_temporal_threshold is specified. We default to + cachesize / cachesize_non_temporal_divisor. */ + unsigned long int cachesize_non_temporal_divisor; /* Threshold to use "rep movsb". */ unsigned long int rep_movsb_threshold; /* Threshold to stop using "rep movsb". */