From patchwork Fri May 24 17:38:50 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Noah Goldstein X-Patchwork-Id: 1939050 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.a=rsa-sha256 header.s=20230601 header.b=Oif3NPB4; dkim-atps=neutral Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=sourceware.org (client-ip=2620:52:3:1:0:246e:9693:128c; helo=server2.sourceware.org; envelope-from=libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org; receiver=patchwork.ozlabs.org) Received: from server2.sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (secp384r1) server-digest SHA384) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4VmC2z65BLz20KL for ; Sat, 25 May 2024 03:39:23 +1000 (AEST) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id DDCF53858D29 for ; Fri, 24 May 2024 17:39:21 +0000 (GMT) X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-oi1-x22b.google.com (mail-oi1-x22b.google.com [IPv6:2607:f8b0:4864:20::22b]) by sourceware.org (Postfix) with ESMTPS id C04D63858D29 for ; Fri, 24 May 2024 17:38:58 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org C04D63858D29 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org C04D63858D29 Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=2607:f8b0:4864:20::22b ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1716572340; cv=none; b=FmqLbimm4m8ahVl745fWsvNwcQo1vqVBXZD2ilxnbomkSWgljci72iKCORaJFgyCiL5sryePiD1gyafRFYtMZLV7WjxffYprwL+Gpz8H7XfIs4659Ezz6XvoniZVbn9R9Elj4sBup6ml92zp586np0fTrPsPmKZT1qfS7UC61qE= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1716572340; c=relaxed/simple; bh=rk1YT1lqriF4A6MSa6NMsS0Iy+ht68TFRXf8eHeUzWU=; h=DKIM-Signature:From:To:Subject:Date:Message-Id:MIME-Version; b=uQTreiCsqRGH5zAqviqjdj/JKQLvTeazHT4O7/5R5yscr1yg5wtveVFSsO9Nl674v+N8BVXIx1ZC9YH6Zb/lbZJd2Zw+Njs19JyZ8McQwu6EKJwWh34pLicSKcaOXhAyd/fXeCEKdLhIJDalH7b8dZxknkZ790X+1gP0BNvc/yw= ARC-Authentication-Results: i=1; server2.sourceware.org Received: by mail-oi1-x22b.google.com with SMTP id 5614622812f47-3c9a604e367so4564990b6e.2 for ; Fri, 24 May 2024 10:38:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1716572338; x=1717177138; darn=sourceware.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=GAfgX6cxPFLQOnBpUQHCpRxB0PYHSD+jt0xstmYSVAg=; b=Oif3NPB4MXeCAfBvBWazTaqT3tRZ5L2eXcNn007JaWwAq2ME/F7lpJn6dCaTpi6Fna UAVxlmA1RRMW2PrzlLBYBg21H1Qj1zt7VzPNWju0244XiwNbIBo+BIDpiugPTUAT50nd DOPnK3NvoO8Sp29h+wg+el3IcLIOon8rgQSBv7rgWXofpwTT2juapZ/SMqU2OyLbUE/g J7J2Ieg2Mqb/+U3DCXrLXs7M7ItDmg4Vvf9Oh4TZo82rwEEV7y+KEw8Q7EdrUmGBr24J T7LGLgh2JB89HAsVBXpKyyeLtjm89lv+ybDiwSQ0csDaRn4c3Llt2YPAZOiH00v3JuuT XwqA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1716572338; x=1717177138; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=GAfgX6cxPFLQOnBpUQHCpRxB0PYHSD+jt0xstmYSVAg=; b=WkqHc2nqBCKHFOd7IH01bbxy8L0ICfd4iRR4IiLKAtnaN0U+hlFK8fS6c19AxHGDPy QLma8gbwoktzZudG+TtHYdZ/sgBmlHMHFBbBt7JsmfHMbTU7q1QOARdYq85LcWtWeHcj Pza48mCLSGcp9Cnm8sG6G8jh5n/FWCUjlI07TV7gBE9SXuL8bFMkhBBWkXy6ep5Vtj7o SJI+KpUnqGH58k+jgSBGjswtpi7qHSOqpMuAMVoxlXDCRsjjgNm0xIQFKDALs0BGKC0c Yd3WeokEknQ7tM4QXeNexuc0um/Xc4RzmOGUIe26Pzjh/om/e0SERe/rImZKvfZ+IUH+ 2PQw== X-Gm-Message-State: AOJu0YybH5thXRKBBausKZmKHlnpVdnU6lCI6A8ngVpwPM3NBVM0KMk+ 1qsCLUfdAOKLC0GqyLE4qYIKImjauhkxtcSs0QbUtE8CeVGT+GmcaoT0Dg== X-Google-Smtp-Source: AGHT+IH1jqoE4N9JpIYYfGyvRFLdDbzgXsg6zn2EzKAiKYLA7BzP8IiROGhaLo7H8JoOen/QgW6XJg== X-Received: by 2002:a05:6808:30d:b0:3c5:e901:e3ec with SMTP id 5614622812f47-3d1a5767652mr2955366b6e.16.1716572337571; Fri, 24 May 2024 10:38:57 -0700 (PDT) Received: from noahgold-desk.lan ([192.55.54.54]) by smtp.gmail.com with ESMTPSA id 6a1803df08f44-6ac113318f0sm9107386d6.74.2024.05.24.10.38.56 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 24 May 2024 10:38:57 -0700 (PDT) From: Noah Goldstein To: libc-alpha@sourceware.org Cc: goldstein.w.n@gmail.com, hjl.tools@gmail.com, carlos@systemhalted.org Subject: [PATCH v2 1/2] x86: Improve large memset perf with non-temporal stores [RHEL-29312] Date: Fri, 24 May 2024 12:38:50 -0500 Message-Id: <20240524173851.2483952-1-goldstein.w.n@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20240519004347.2759850-1-goldstein.w.n@gmail.com> References: <20240519004347.2759850-1-goldstein.w.n@gmail.com> MIME-Version: 1.0 X-Spam-Status: No, score=-12.2 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.30 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org Previously we use `rep stosb` for all medium/large memsets. This is notably worse than non-temporal stores for large (above a few MBs) memsets. See: https://docs.google.com/spreadsheets/d/1opzukzvum4n6-RUVHTGddV6RjAEil4P2uMjjQGLbLcU/edit?usp=sharing For data using different stategies for large memset on ICX and SKX. Using non-temporal stores can be up to 3x faster on ICX and 2x faster on SKX. Historically, these numbers would not have been so good because of the zero-over-zero writeback optimization that `rep stosb` is able to do. But, the zero-over-zero writeback optimization has been removed as a potential side-channel attack, so there is no longer any good reason to only rely on `rep stosb` for large memsets. On the flip size, non-temporal writes can avoid data in their RFO requests saving memory bandwidth. All of the other changes to the file are to re-organize the code-blocks to maintain "good" alignment given the new code added in the `L(stosb_local)` case. The results from running the GLIBC memset benchmarks on TGL-client for N=20 runs: Geometric Mean across the suite New / Old EXEX256: 0.979 Geometric Mean across the suite New / Old EXEX512: 0.979 Geometric Mean across the suite New / Old AVX2 : 0.986 Geometric Mean across the suite New / Old SSE2 : 0.979 Most of the cases are essentially unchanged, this is mostly to show that adding the non-temporal case didn't add any regressions to the other cases. The results on the memset-large benchmark suite on TGL-client for N=20 runs: Geometric Mean across the suite New / Old EXEX256: 0.926 Geometric Mean across the suite New / Old EXEX512: 0.925 Geometric Mean across the suite New / Old AVX2 : 0.928 Geometric Mean across the suite New / Old SSE2 : 0.924 So roughly a 7.5% speedup. This is lower than what we see on servers (likely because clients typically have faster single-core bandwidth so saving bandwidth on RFOs is less impactful), but still advantageous. Full test-suite passes on x86_64 w/ and w/o multiarch. Reviewed-by: H.J. Lu --- .../multiarch/memset-vec-unaligned-erms.S | 149 +++++++++++------- 1 file changed, 91 insertions(+), 58 deletions(-) diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S index 97839a2248..637caadb40 100644 --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S @@ -21,10 +21,13 @@ 2. If size is less than VEC, use integer register stores. 3. If size is from VEC_SIZE to 2 * VEC_SIZE, use 2 VEC stores. 4. If size is from 2 * VEC_SIZE to 4 * VEC_SIZE, use 4 VEC stores. - 5. On machines ERMS feature, if size is greater or equal than - __x86_rep_stosb_threshold then REP STOSB will be used. - 6. If size is more to 4 * VEC_SIZE, align to 4 * VEC_SIZE with - 4 VEC stores and store 4 * VEC at a time until done. */ + 5. If size is more to 4 * VEC_SIZE, align to 1 * VEC_SIZE with + 4 VEC stores and store 4 * VEC at a time until done. + 6. On machines ERMS feature, if size is range + [__x86_rep_stosb_threshold, __x86_shared_non_temporal_threshold) + then REP STOSB will be used. + 7. If size >= __x86_shared_non_temporal_threshold, use a + non-temporal stores. */ #include @@ -147,6 +150,41 @@ L(entry_from_wmemset): VMOVU %VMM(0), -VEC_SIZE(%rdi,%rdx) VMOVU %VMM(0), (%rdi) VZEROUPPER_RETURN + + /* If have AVX512 mask instructions put L(less_vec) close to + entry as it doesn't take much space and is likely a hot target. */ +#ifdef USE_LESS_VEC_MASK_STORE + /* Align to ensure the L(less_vec) logic all fits in 1x cache lines. */ + .p2align 6,, 47 + .p2align 4 +L(less_vec): +L(less_vec_from_wmemset): + /* Less than 1 VEC. */ +# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64 +# error Unsupported VEC_SIZE! +# endif + /* Clear high bits from edi. Only keeping bits relevant to page + cross check. Note that we are using rax which is set in + MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out. */ + andl $(PAGE_SIZE - 1), %edi + /* Check if VEC_SIZE store cross page. Mask stores suffer + serious performance degradation when it has to fault suppress. */ + cmpl $(PAGE_SIZE - VEC_SIZE), %edi + /* This is generally considered a cold target. */ + ja L(cross_page) +# if VEC_SIZE > 32 + movq $-1, %rcx + bzhiq %rdx, %rcx, %rcx + kmovq %rcx, %k1 +# else + movl $-1, %ecx + bzhil %edx, %ecx, %ecx + kmovd %ecx, %k1 +# endif + vmovdqu8 %VMM(0), (%rax){%k1} + VZEROUPPER_RETURN +#endif + #if defined USE_MULTIARCH && IS_IN (libc) END (MEMSET_SYMBOL (__memset, unaligned)) @@ -185,54 +223,6 @@ L(last_2x_vec): #endif VZEROUPPER_RETURN - /* If have AVX512 mask instructions put L(less_vec) close to - entry as it doesn't take much space and is likely a hot target. - */ -#ifdef USE_LESS_VEC_MASK_STORE - .p2align 4,, 10 -L(less_vec): -L(less_vec_from_wmemset): - /* Less than 1 VEC. */ -# if VEC_SIZE != 16 && VEC_SIZE != 32 && VEC_SIZE != 64 -# error Unsupported VEC_SIZE! -# endif - /* Clear high bits from edi. Only keeping bits relevant to page - cross check. Note that we are using rax which is set in - MEMSET_VDUP_TO_VEC0_AND_SET_RETURN as ptr from here on out. */ - andl $(PAGE_SIZE - 1), %edi - /* Check if VEC_SIZE store cross page. Mask stores suffer - serious performance degradation when it has to fault suppress. - */ - cmpl $(PAGE_SIZE - VEC_SIZE), %edi - /* This is generally considered a cold target. */ - ja L(cross_page) -# if VEC_SIZE > 32 - movq $-1, %rcx - bzhiq %rdx, %rcx, %rcx - kmovq %rcx, %k1 -# else - movl $-1, %ecx - bzhil %edx, %ecx, %ecx - kmovd %ecx, %k1 -# endif - vmovdqu8 %VMM(0), (%rax){%k1} - VZEROUPPER_RETURN - -# if defined USE_MULTIARCH && IS_IN (libc) - /* Include L(stosb_local) here if including L(less_vec) between - L(stosb_more_2x_vec) and ENTRY. This is to cache align the - L(stosb_more_2x_vec) target. */ - .p2align 4,, 10 -L(stosb_local): - movzbl %sil, %eax - mov %RDX_LP, %RCX_LP - mov %RDI_LP, %RDX_LP - rep stosb - mov %RDX_LP, %RAX_LP - VZEROUPPER_RETURN -# endif -#endif - #if defined USE_MULTIARCH && IS_IN (libc) .p2align 4 L(stosb_more_2x_vec): @@ -318,21 +308,33 @@ L(return_vzeroupper): ret #endif - .p2align 4,, 10 -#ifndef USE_LESS_VEC_MASK_STORE -# if defined USE_MULTIARCH && IS_IN (libc) +#ifdef USE_WITH_AVX2 + .p2align 4 +#else + .p2align 4,, 4 +#endif + +#if defined USE_MULTIARCH && IS_IN (libc) /* If no USE_LESS_VEC_MASK put L(stosb_local) here. Will be in range for 2-byte jump encoding. */ L(stosb_local): + cmp __x86_shared_non_temporal_threshold(%rip), %RDX_LP + jae L(nt_memset) movzbl %sil, %eax mov %RDX_LP, %RCX_LP mov %RDI_LP, %RDX_LP rep stosb +# if (defined USE_WITH_SSE2) || (defined USE_WITH_AVX512) + /* Use xchg to save 1-byte (this helps align targets below). */ + xchg %RDX_LP, %RAX_LP +# else mov %RDX_LP, %RAX_LP - VZEROUPPER_RETURN # endif + VZEROUPPER_RETURN +#endif +#ifndef USE_LESS_VEC_MASK_STORE /* Define L(less_vec) only if not otherwise defined. */ - .p2align 4 + .p2align 4,, 12 L(less_vec): /* Broadcast esi to partial register (i.e VEC_SIZE == 32 broadcast to xmm). This is only does anything for AVX2. */ @@ -423,4 +425,35 @@ L(between_2_3): movb %SET_REG8, -1(%LESS_VEC_REG, %rdx) #endif ret -END (MEMSET_SYMBOL (__memset, unaligned_erms)) + +#if defined USE_MULTIARCH && IS_IN (libc) +# ifdef USE_WITH_AVX512 + /* Force align so the loop doesn't cross a cache-line. */ + .p2align 4 +# endif + .p2align 4,, 7 + /* Memset using non-temporal stores. */ +L(nt_memset): + VMOVU %VMM(0), (VEC_SIZE * 0)(%rdi) + leaq (VEC_SIZE * -4)(%rdi, %rdx), %rdx + /* Align DST. */ + orq $(VEC_SIZE * 1 - 1), %rdi + incq %rdi + .p2align 4,, 7 +L(nt_loop): + VMOVNT %VMM(0), (VEC_SIZE * 0)(%rdi) + VMOVNT %VMM(0), (VEC_SIZE * 1)(%rdi) + VMOVNT %VMM(0), (VEC_SIZE * 2)(%rdi) + VMOVNT %VMM(0), (VEC_SIZE * 3)(%rdi) + subq $(VEC_SIZE * -4), %rdi + cmpq %rdx, %rdi + jb L(nt_loop) + sfence + VMOVU %VMM(0), (VEC_SIZE * 0)(%rdx) + VMOVU %VMM(0), (VEC_SIZE * 1)(%rdx) + VMOVU %VMM(0), (VEC_SIZE * 2)(%rdx) + VMOVU %VMM(0), (VEC_SIZE * 3)(%rdx) + VZEROUPPER_RETURN +#endif + +END(MEMSET_SYMBOL(__memset, unaligned_erms)) From patchwork Fri May 24 17:38:51 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Noah Goldstein X-Patchwork-Id: 1939051 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.a=rsa-sha256 header.s=20230601 header.b=G4dSughE; dkim-atps=neutral Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=sourceware.org (client-ip=8.43.85.97; helo=server2.sourceware.org; envelope-from=libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org; receiver=patchwork.ozlabs.org) Received: from server2.sourceware.org (server2.sourceware.org [8.43.85.97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (secp384r1) server-digest SHA384) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4VmC326T9Lz20KL for ; Sat, 25 May 2024 03:39:26 +1000 (AEST) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 274E2385EC2A for ; Fri, 24 May 2024 17:39:25 +0000 (GMT) X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-qk1-x732.google.com (mail-qk1-x732.google.com [IPv6:2607:f8b0:4864:20::732]) by sourceware.org (Postfix) with ESMTPS id 315A03858CD9 for ; Fri, 24 May 2024 17:39:01 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 315A03858CD9 Authentication-Results: sourceware.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 315A03858CD9 Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=2607:f8b0:4864:20::732 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1716572344; cv=none; b=UyyLJhdQ+IJFnUsFhaO9sccUo3lnLpvCTMZZA+sfJI2y+BJ3UZUZsSTAFtzNeOCP6F/r9cZB773z8hMHVjnDHGYa1SJYShhqJiyC8iQbsuGGUbG7IA/ZF9I++FAIrF/lvtKGoWmyNU9xAoE+T2k0nzQZE5gmqz89kHNgIHSMG70= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1716572344; c=relaxed/simple; bh=uoXs94CUbY9QL7BuYUVDO21h4m5nVgXgqn9HWm1AD/I=; h=DKIM-Signature:From:To:Subject:Date:Message-Id:MIME-Version; b=Cx4ORZpxf+JnpFf/y+6WxGWDALOTJFIaoZp6rkylsEwa2OQ/yj7K/HTCx2wexGSxehbbrAPFkWbRFJiBNxxIQTB6Q9gnPt8uJhRUwwtznE0LpcwoE8XNg0D0z/SVjEuTOeBDar+trwr9A62UBcr6BqPi5xA3AESGYcboQFTj47M= ARC-Authentication-Results: i=1; server2.sourceware.org Received: by mail-qk1-x732.google.com with SMTP id af79cd13be357-794b10641b9so18064985a.3 for ; Fri, 24 May 2024 10:39:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1716572340; x=1717177140; darn=sourceware.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=dADOHjvTjirHs5GuST2FdqEiGFV8yBrBuoXaYkCwq8I=; b=G4dSughEblqhOqzxRViBs+fP7+5TIwTIWzpaekocClkdjBYR1oSqAJiHkrAfDsLvQd JgZZ3/+BR3t6O0PKLHQXyHOjE7a+nj4m2eSj4tHyino6OfZ2YYzglnprJX5uXOdfdMVa wHErlOMnDQFw//6/DioR3vukhqOyRZx/wbiIzsXnn5JJdgiqOMUZJtXJTJIWkLk+JJpA lFFrMcdvmx5wx+bDcQapwn6vXjNfYZbnn/79vuVCoyJhAuUYz8+KQl//RRwLS4weLHQB cheT1oMitWpwRFTeCXX199SHe0KoIp+//dxzs8GcHK0OWCnRkltnwxurbwLOFCIk1Spz LLJw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1716572340; x=1717177140; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=dADOHjvTjirHs5GuST2FdqEiGFV8yBrBuoXaYkCwq8I=; b=FOcvF20lj1EMxks1cCvU0St2Kyee1KnOv0hecafxURKcIsAGAzT1Zf64oB9pxAuPD9 AE32aGmr5vkQnOBh5Mwv0Yvao4U6bPVKMMq8wLNcPhVkzIy+YvxCj82G1F+HtyewIHdz xERN6PI64sGD20OWLIyrvBnxm6669lrQakbNHTxAlcJfvNDzE1bWTHfv3gdJ1l2Lc48P j5EfROk8x5NfXE20mrK1ubS0wiYftytmdvAFPqizBENUUwe4yuwGlTQ3AFnchAjNKIda NAZFNe+0IE6E740q75PFrNlnDzqMKzmrZW14mAEP2TXnLApvQJYpJ9nwMivuWcjpxlra Bo2Q== X-Gm-Message-State: AOJu0YwDV2nwXKSkzZ91JKdP6AOcv4qOHqZfz8tlq5QPH2c4Z0qgLXX7 AWLMppcxIsHUIJtHaZOxJpHlV/0FvSMK+I0MzGaYU1pL/iY4y7NdngWZww== X-Google-Smtp-Source: AGHT+IF1Vd3ZgdJDuiE6rZUD4lLkzDMhuO0ZMmAHv3oxWPDtrx+IZ4PRVuGmL5VK28PBfZuFVb+/4Q== X-Received: by 2002:a05:6214:4888:b0:6a3:434c:bee9 with SMTP id 6a1803df08f44-6abcd0d2bedmr39901686d6.27.1716572339877; Fri, 24 May 2024 10:38:59 -0700 (PDT) Received: from noahgold-desk.lan ([192.55.54.54]) by smtp.gmail.com with ESMTPSA id 6a1803df08f44-6ac113318f0sm9107386d6.74.2024.05.24.10.38.58 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 24 May 2024 10:38:59 -0700 (PDT) From: Noah Goldstein To: libc-alpha@sourceware.org Cc: goldstein.w.n@gmail.com, hjl.tools@gmail.com, carlos@systemhalted.org Subject: [PATCH v2 2/2] x86: Add seperate non-temporal tunable for memset Date: Fri, 24 May 2024 12:38:51 -0500 Message-Id: <20240524173851.2483952-2-goldstein.w.n@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20240524173851.2483952-1-goldstein.w.n@gmail.com> References: <20240519004347.2759850-1-goldstein.w.n@gmail.com> <20240524173851.2483952-1-goldstein.w.n@gmail.com> MIME-Version: 1.0 X-Spam-Status: No, score=-12.2 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.30 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org The tuning for non-temporal stores for memset vs memcpy is not always the same. This includes both the exact value and whether non-temporal stores are profitable at all for a given arch. This patch add `x86_memset_non_temporal_threshold`. Currently we disable non-temporal stores for non Intel vendors as the only benchmarks showing its benefit have been on Intel hardware. Reviewed-by: H.J. Lu --- manual/tunables.texi | 16 +++++++++++++++- sysdeps/x86/cacheinfo.h | 8 +++++++- sysdeps/x86/dl-cacheinfo.h | 16 ++++++++++++++++ sysdeps/x86/dl-diagnostics-cpu.c | 2 ++ sysdeps/x86/dl-tunables.list | 3 +++ sysdeps/x86/include/cpu-features.h | 4 +++- .../x86_64/multiarch/memset-vec-unaligned-erms.S | 6 +++--- 7 files changed, 49 insertions(+), 6 deletions(-) diff --git a/manual/tunables.texi b/manual/tunables.texi index baaf751721..8dd02d8149 100644 --- a/manual/tunables.texi +++ b/manual/tunables.texi @@ -52,6 +52,7 @@ glibc.elision.skip_lock_busy: 3 (min: 0, max: 2147483647) glibc.malloc.top_pad: 0x20000 (min: 0x0, max: 0xffffffffffffffff) glibc.cpu.x86_rep_stosb_threshold: 0x800 (min: 0x1, max: 0xffffffffffffffff) glibc.cpu.x86_non_temporal_threshold: 0xc0000 (min: 0x4040, max: 0xfffffffffffffff) +glibc.cpu.x86_memset_non_temporal_threshold: 0xc0000 (min: 0x4040, max: 0xfffffffffffffff) glibc.cpu.x86_shstk: glibc.pthread.stack_cache_size: 0x2800000 (min: 0x0, max: 0xffffffffffffffff) glibc.cpu.hwcap_mask: 0x6 (min: 0x0, max: 0xffffffffffffffff) @@ -495,7 +496,8 @@ thread stack originally backup by Huge Pages to default pages. @cindex shared_cache_size tunables @cindex tunables, shared_cache_size @cindex non_temporal_threshold tunables -@cindex tunables, non_temporal_threshold +@cindex memset_non_temporal_threshold tunables +@cindex tunables, non_temporal_threshold, memset_non_temporal_threshold @deftp {Tunable namespace} glibc.cpu Behavior of @theglibc{} can be tuned to assume specific hardware capabilities @@ -574,6 +576,18 @@ like memmove and memcpy. This tunable is specific to i386 and x86-64. @end deftp +@deftp Tunable glibc.cpu.x86_memset_non_temporal_threshold +The @code{glibc.cpu.x86_memset_non_temporal_threshold} tunable allows +the user to set threshold in bytes for non temporal store in +memset. Non temporal stores give a hint to the hardware to move data +directly to memory without displacing other data from the cache. This +tunable is used by some platforms to determine when to use non +temporal stores memset. + +This tunable is specific to i386 and x86-64. +@end deftp + + @deftp Tunable glibc.cpu.x86_rep_movsb_threshold The @code{glibc.cpu.x86_rep_movsb_threshold} tunable allows the user to set threshold in bytes to start using "rep movsb". The value must be diff --git a/sysdeps/x86/cacheinfo.h b/sysdeps/x86/cacheinfo.h index ab73556772..83491607c7 100644 --- a/sysdeps/x86/cacheinfo.h +++ b/sysdeps/x86/cacheinfo.h @@ -35,9 +35,12 @@ long int __x86_data_cache_size attribute_hidden = 32 * 1024; long int __x86_shared_cache_size_half attribute_hidden = 1024 * 1024 / 2; long int __x86_shared_cache_size attribute_hidden = 1024 * 1024; -/* Threshold to use non temporal store. */ +/* Threshold to use non temporal store in memmove. */ long int __x86_shared_non_temporal_threshold attribute_hidden; +/* Threshold to use non temporal store in memset. */ +long int __x86_memset_non_temporal_threshold attribute_hidden; + /* Threshold to use Enhanced REP MOVSB. */ long int __x86_rep_movsb_threshold attribute_hidden = 2048; @@ -77,6 +80,9 @@ init_cacheinfo (void) __x86_shared_non_temporal_threshold = cpu_features->non_temporal_threshold; + __x86_memset_non_temporal_threshold + = cpu_features->memset_non_temporal_threshold; + __x86_rep_movsb_threshold = cpu_features->rep_movsb_threshold; __x86_rep_stosb_threshold = cpu_features->rep_stosb_threshold; __x86_rep_movsb_stop_threshold = cpu_features->rep_movsb_stop_threshold; diff --git a/sysdeps/x86/dl-cacheinfo.h b/sysdeps/x86/dl-cacheinfo.h index 5a98f70364..d375a7cba6 100644 --- a/sysdeps/x86/dl-cacheinfo.h +++ b/sysdeps/x86/dl-cacheinfo.h @@ -986,6 +986,13 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) if (CPU_FEATURE_USABLE_P (cpu_features, FSRM)) rep_movsb_threshold = 2112; + /* Non-temporal stores in memset have only been tested on Intel hardware. + Until we benchmark data on other x86 processor, disable non-temporal + stores in memset. */ + unsigned long int memset_non_temporal_threshold = SIZE_MAX; + if (cpu_features->basic.kind == arch_kind_intel) + memset_non_temporal_threshold = non_temporal_threshold; + /* For AMD CPUs that support ERMS (Zen3+), REP MOVSB is in a lot of cases slower than the vectorized path (and for some alignments, it is really slow, check BZ #30994). */ @@ -1012,6 +1019,11 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) && tunable_size <= maximum_non_temporal_threshold) non_temporal_threshold = tunable_size; + tunable_size = TUNABLE_GET (x86_memset_non_temporal_threshold, long int, NULL); + if (tunable_size > minimum_non_temporal_threshold + && tunable_size <= maximum_non_temporal_threshold) + memset_non_temporal_threshold = tunable_size; + tunable_size = TUNABLE_GET (x86_rep_movsb_threshold, long int, NULL); if (tunable_size > minimum_rep_movsb_threshold) rep_movsb_threshold = tunable_size; @@ -1032,6 +1044,9 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) TUNABLE_SET_WITH_BOUNDS (x86_non_temporal_threshold, non_temporal_threshold, minimum_non_temporal_threshold, maximum_non_temporal_threshold); + TUNABLE_SET_WITH_BOUNDS ( + x86_memset_non_temporal_threshold, memset_non_temporal_threshold, + minimum_non_temporal_threshold, maximum_non_temporal_threshold); TUNABLE_SET_WITH_BOUNDS (x86_rep_movsb_threshold, rep_movsb_threshold, minimum_rep_movsb_threshold, SIZE_MAX); TUNABLE_SET_WITH_BOUNDS (x86_rep_stosb_threshold, rep_stosb_threshold, 1, @@ -1045,6 +1060,7 @@ dl_init_cacheinfo (struct cpu_features *cpu_features) cpu_features->data_cache_size = data; cpu_features->shared_cache_size = shared; cpu_features->non_temporal_threshold = non_temporal_threshold; + cpu_features->memset_non_temporal_threshold = memset_non_temporal_threshold; cpu_features->rep_movsb_threshold = rep_movsb_threshold; cpu_features->rep_stosb_threshold = rep_stosb_threshold; cpu_features->rep_movsb_stop_threshold = rep_movsb_stop_threshold; diff --git a/sysdeps/x86/dl-diagnostics-cpu.c b/sysdeps/x86/dl-diagnostics-cpu.c index ceafde9481..49eeb5f70a 100644 --- a/sysdeps/x86/dl-diagnostics-cpu.c +++ b/sysdeps/x86/dl-diagnostics-cpu.c @@ -94,6 +94,8 @@ _dl_diagnostics_cpu (void) cpu_features->shared_cache_size); print_cpu_features_value ("non_temporal_threshold", cpu_features->non_temporal_threshold); + print_cpu_features_value ("memset_non_temporal_threshold", + cpu_features->memset_non_temporal_threshold); print_cpu_features_value ("rep_movsb_threshold", cpu_features->rep_movsb_threshold); print_cpu_features_value ("rep_movsb_stop_threshold", diff --git a/sysdeps/x86/dl-tunables.list b/sysdeps/x86/dl-tunables.list index 7d82da0dec..a0a1299592 100644 --- a/sysdeps/x86/dl-tunables.list +++ b/sysdeps/x86/dl-tunables.list @@ -30,6 +30,9 @@ glibc { x86_non_temporal_threshold { type: SIZE_T } + x86_memset_non_temporal_threshold { + type: SIZE_T + } x86_rep_movsb_threshold { type: SIZE_T # Since there is overhead to set up REP MOVSB operation, REP diff --git a/sysdeps/x86/include/cpu-features.h b/sysdeps/x86/include/cpu-features.h index cd7bd27cf3..aaae44f0e1 100644 --- a/sysdeps/x86/include/cpu-features.h +++ b/sysdeps/x86/include/cpu-features.h @@ -944,8 +944,10 @@ struct cpu_features /* Shared cache size for use in memory and string routines, typically L2 or L3 size. */ unsigned long int shared_cache_size; - /* Threshold to use non temporal store. */ + /* Threshold to use non temporal store in memmove. */ unsigned long int non_temporal_threshold; + /* Threshold to use non temporal store in memset. */ + unsigned long int memset_non_temporal_threshold; /* Threshold to use "rep movsb". */ unsigned long int rep_movsb_threshold; /* Threshold to stop using "rep movsb". */ diff --git a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S index 637caadb40..88bf08e4f4 100644 --- a/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S +++ b/sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S @@ -24,9 +24,9 @@ 5. If size is more to 4 * VEC_SIZE, align to 1 * VEC_SIZE with 4 VEC stores and store 4 * VEC at a time until done. 6. On machines ERMS feature, if size is range - [__x86_rep_stosb_threshold, __x86_shared_non_temporal_threshold) + [__x86_rep_stosb_threshold, __x86_memset_non_temporal_threshold) then REP STOSB will be used. - 7. If size >= __x86_shared_non_temporal_threshold, use a + 7. If size >= __x86_memset_non_temporal_threshold, use a non-temporal stores. */ #include @@ -318,7 +318,7 @@ L(return_vzeroupper): /* If no USE_LESS_VEC_MASK put L(stosb_local) here. Will be in range for 2-byte jump encoding. */ L(stosb_local): - cmp __x86_shared_non_temporal_threshold(%rip), %RDX_LP + cmp __x86_memset_non_temporal_threshold(%rip), %RDX_LP jae L(nt_memset) movzbl %sil, %eax mov %RDX_LP, %RCX_LP