From patchwork Tue Oct 18 02:48:55 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Noah Goldstein X-Patchwork-Id: 1691323 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=sourceware.org (client-ip=8.43.85.97; helo=sourceware.org; envelope-from=libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org; receiver=) Authentication-Results: legolas.ozlabs.org; dkim=pass (1024-bit key; secure) header.d=sourceware.org header.i=@sourceware.org header.a=rsa-sha256 header.s=default header.b=cv7masdq; dkim-atps=neutral Received: from sourceware.org (server2.sourceware.org [8.43.85.97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-384) server-digest SHA384) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4Mryxk3l3Vz1ygT for ; Tue, 18 Oct 2022 13:49:30 +1100 (AEDT) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 4363C385734F for ; Tue, 18 Oct 2022 02:49:27 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 4363C385734F DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1666061367; bh=iwJ5tG1ZqBYLW52dVWD9KF7e6KhX6rOUjBu+vk7oSiA=; h=To:Subject:Date:List-Id:List-Unsubscribe:List-Archive:List-Post: List-Help:List-Subscribe:From:Reply-To:From; b=cv7masdq8lgtzCASOclNJ15wZSbyz3+v27vBejQbjvWLttmUl9bt6WkgFX/o0yMYZ EYveiJfCg08Qpmsj4WPjCxRh15qTDZszpkRdvckGD40K09AE67Vy9a71FOSGX2eOjY t3UNtI56adCK3KM6pmRg2u5TZw6SKdgvxwE3v/9I= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-ot1-x32a.google.com (mail-ot1-x32a.google.com [IPv6:2607:f8b0:4864:20::32a]) by sourceware.org (Postfix) with ESMTPS id E0DE73858D32 for ; Tue, 18 Oct 2022 02:49:06 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org E0DE73858D32 Received: by mail-ot1-x32a.google.com with SMTP id cb2-20020a056830618200b00661b6e5dcd8so6908155otb.8 for ; Mon, 17 Oct 2022 19:49:06 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=iwJ5tG1ZqBYLW52dVWD9KF7e6KhX6rOUjBu+vk7oSiA=; b=uPbFA1jaeXNVbzP11CdIC62s2A+ra6uM+5Tr3fa88xyTmHrpF76GngfPXyX89Gplig BMlaV412cIhg9PWAW7Ht5wLQ0dGranwmc6tBMBxF1zCgwrjuKvhlDyJfqbc15OTC6GRo hihJvpNn1SwRCPSyEd3H9a1pAtz5+e03SSp245+XjBIkh0ig9WisfeSWDxVllWbBu3Qe RNfMGuh6qsV8LYZ7i+3Nim2uiqIJJCBTLCXu8QvXKLhOMPvrKq40sgsFyYbWHa4OsLgv iccgZgwBwj1HsBHzGm+HW6uV1RArr8pRJiV+duqoNpOF88Ui7SbKaoMal5HGnZYH7BH7 /juA== X-Gm-Message-State: ACrzQf3BFjuPgYABTSP6Kqcl1pUJ3PfnY66+E/COwRXMQxSTKFG3ApbO BF41IKtREzc095rzcQzLymPN2aClUMiDAg== X-Google-Smtp-Source: AMsMyM7hGokoh8+OCycMcKJpZq68Km/vE0hQfFFcHGBRetU2Z3D2jFgBL0JgLP/762D5ZsTv+KpgSQ== X-Received: by 2002:a9d:3e9:0:b0:661:a8a2:7b96 with SMTP id f96-20020a9d03e9000000b00661a8a27b96mr433744otf.190.1666061344679; Mon, 17 Oct 2022 19:49:04 -0700 (PDT) Received: from noah-tgl.lan (2603-8080-1301-76c6-02dd-0570-1640-b39b.res6.spectrum.com. [2603:8080:1301:76c6:2dd:570:1640:b39b]) by smtp.gmail.com with ESMTPSA id r10-20020a4a964a000000b00435a59fba01sm4957260ooi.47.2022.10.17.19.49.03 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 17 Oct 2022 19:49:04 -0700 (PDT) To: libc-alpha@sourceware.org Subject: [PATCH v1 1/7] x86: Optimize memchr-evex.S and implement with VMM headers Date: Mon, 17 Oct 2022 19:48:55 -0700 Message-Id: <20221018024901.3381469-1-goldstein.w.n@gmail.com> X-Mailer: git-send-email 2.34.1 MIME-Version: 1.0 X-Spam-Status: No, score=-12.1 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, KAM_SHORT, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Noah Goldstein via Libc-alpha From: Noah Goldstein Reply-To: Noah Goldstein Errors-To: libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org Sender: "Libc-alpha" Optimizations are: 1. Use the fact that tzcnt(0) -> VEC_SIZE for memchr to save a branch in short string case. 2. Restructure code so that small strings are given the hot path. - This is a net-zero on the benchmark suite but in general makes sense as smaller sizes are far more common. 3. Use more code-size efficient instructions. - tzcnt ... -> bsf ... - vpcmpb $0 ... -> vpcmpeq ... 4. Align labels less aggressively, especially if it doesn't save fetch blocks / causes the basic-block to span extra cache-lines. The optimizations (especially for point 2) make the memchr and rawmemchr code essentially incompatible so split rawmemchr-evex to a new file. Code Size Changes: memchr-evex.S : -107 bytes rawmemchr-evex.S : -53 bytes Net perf changes: Reported as geometric mean of all improvements / regressions from N=10 runs of the benchtests. Value as New Time / Old Time so < 1.0 is improvement and 1.0 is regression. memchr-evex.S : 0.928 rawmemchr-evex.S : 0.986 (Less targets cross cache lines) Full results attached in email. Full check passes on x86-64. --- sysdeps/x86_64/multiarch/memchr-evex.S | 939 ++++++++++-------- sysdeps/x86_64/multiarch/rawmemchr-evex-rtm.S | 9 +- sysdeps/x86_64/multiarch/rawmemchr-evex.S | 313 +++++- 3 files changed, 851 insertions(+), 410 deletions(-) diff --git a/sysdeps/x86_64/multiarch/memchr-evex.S b/sysdeps/x86_64/multiarch/memchr-evex.S index 0dd4f1dcce..23a1c0018e 100644 --- a/sysdeps/x86_64/multiarch/memchr-evex.S +++ b/sysdeps/x86_64/multiarch/memchr-evex.S @@ -21,17 +21,27 @@ #if ISA_SHOULD_BUILD (4) +# ifndef VEC_SIZE +# include "x86-evex256-vecs.h" +# endif + # ifndef MEMCHR # define MEMCHR __memchr_evex # endif # ifdef USE_AS_WMEMCHR +# define PC_SHIFT_GPR rcx +# define VPTESTN vptestnmd # define VPBROADCAST vpbroadcastd # define VPMINU vpminud # define VPCMP vpcmpd # define VPCMPEQ vpcmpeqd # define CHAR_SIZE 4 + +# define USE_WIDE_CHAR # else +# define PC_SHIFT_GPR rdi +# define VPTESTN vptestnmb # define VPBROADCAST vpbroadcastb # define VPMINU vpminub # define VPCMP vpcmpb @@ -39,534 +49,661 @@ # define CHAR_SIZE 1 # endif - /* In the 4x loop the RTM and non-RTM versions have data pointer - off by VEC_SIZE * 4 with RTM version being VEC_SIZE * 4 greater. - This is represented by BASE_OFFSET. As well because the RTM - version uses vpcmp which stores a bit per element compared where - the non-RTM version uses vpcmpeq which stores a bit per byte - compared RET_SCALE of CHAR_SIZE is only relevant for the RTM - version. */ -# ifdef USE_IN_RTM +# include "reg-macros.h" + + +/* If not in an RTM and VEC_SIZE != 64 (the VEC_SIZE = 64 + doesn't have VEX encoding), use VEX encoding in loop so we + can use vpcmpeqb + vptern which is more efficient than the + EVEX alternative. */ +# if defined USE_IN_RTM || VEC_SIZE == 64 +# undef COND_VZEROUPPER +# undef VZEROUPPER_RETURN +# undef VZEROUPPER + +# define COND_VZEROUPPER +# define VZEROUPPER_RETURN ret # define VZEROUPPER -# define BASE_OFFSET (VEC_SIZE * 4) -# define RET_SCALE CHAR_SIZE + +# define USE_TERN_IN_LOOP 0 # else +# define USE_TERN_IN_LOOP 1 +# undef VZEROUPPER # define VZEROUPPER vzeroupper -# define BASE_OFFSET 0 -# define RET_SCALE 1 # endif - /* In the return from 4x loop memchr and rawmemchr versions have - data pointers off by VEC_SIZE * 4 with memchr version being - VEC_SIZE * 4 greater. */ -# ifdef USE_AS_RAWMEMCHR -# define RET_OFFSET (BASE_OFFSET - (VEC_SIZE * 4)) -# define RAW_PTR_REG rcx -# define ALGN_PTR_REG rdi +# if USE_TERN_IN_LOOP + /* Resulting bitmask for vpmovmskb has 4-bits set for each wchar + so we don't want to multiply resulting index. */ +# define TERN_CHAR_MULT 1 + +# ifdef USE_AS_WMEMCHR +# define TEST_END() inc %VRCX +# else +# define TEST_END() add %rdx, %rcx +# endif # else -# define RET_OFFSET BASE_OFFSET -# define RAW_PTR_REG rdi -# define ALGN_PTR_REG rcx +# define TERN_CHAR_MULT CHAR_SIZE +# define TEST_END() KORTEST %k2, %k3 # endif -# define XMMZERO xmm23 -# define YMMZERO ymm23 -# define XMMMATCH xmm16 -# define YMMMATCH ymm16 -# define YMM1 ymm17 -# define YMM2 ymm18 -# define YMM3 ymm19 -# define YMM4 ymm20 -# define YMM5 ymm21 -# define YMM6 ymm22 +# if defined USE_AS_WMEMCHR || !USE_TERN_IN_LOOP +# ifndef USE_AS_WMEMCHR +# define GPR_X0_IS_RET 1 +# else +# define GPR_X0_IS_RET 0 +# endif +# define GPR_X0 rax +# else +# define GPR_X0_IS_RET 0 +# define GPR_X0 rdx +# endif + +# define CHAR_PER_VEC (VEC_SIZE / CHAR_SIZE) -# ifndef SECTION -# define SECTION(p) p##.evex +# if CHAR_PER_VEC == 64 +# define LAST_VEC_OFFSET (VEC_SIZE * 3) +# else +# define LAST_VEC_OFFSET (VEC_SIZE * 2) +# endif +# if CHAR_PER_VEC >= 32 +# define MASK_GPR(...) VGPR(__VA_ARGS__) +# elif CHAR_PER_VEC == 16 +# define MASK_GPR(reg) VGPR_SZ(reg, 16) +# else +# define MASK_GPR(reg) VGPR_SZ(reg, 8) # endif -# define VEC_SIZE 32 -# define CHAR_PER_VEC (VEC_SIZE / CHAR_SIZE) -# define PAGE_SIZE 4096 +# define VMATCH VMM(0) +# define VMATCH_LO VMM_lo(0) - .section SECTION(.text),"ax",@progbits +# define PAGE_SIZE 4096 + + + .section SECTION(.text), "ax", @progbits ENTRY_P2ALIGN (MEMCHR, 6) -# ifndef USE_AS_RAWMEMCHR /* Check for zero length. */ test %RDX_LP, %RDX_LP - jz L(zero) + jz L(zero_0) -# ifdef __ILP32__ +# ifdef __ILP32__ /* Clear the upper 32 bits. */ movl %edx, %edx -# endif # endif - /* Broadcast CHAR to YMMMATCH. */ - VPBROADCAST %esi, %YMMMATCH + VPBROADCAST %esi, %VMATCH /* Check if we may cross page boundary with one vector load. */ movl %edi, %eax andl $(PAGE_SIZE - 1), %eax cmpl $(PAGE_SIZE - VEC_SIZE), %eax - ja L(cross_page_boundary) + ja L(page_cross) + + VPCMPEQ (%rdi), %VMATCH, %k0 + KMOV %k0, %VRAX +# ifndef USE_AS_WMEMCHR + /* If rcx is zero then tzcnt -> CHAR_PER_VEC. NB: there is a + already a dependency between rcx and rsi so no worries about + false-dep here. */ + tzcnt %VRAX, %VRSI + /* If rdx <= rsi then either 1) rcx was non-zero (there was a + match) but it was out of bounds or 2) rcx was zero and rdx + was <= VEC_SIZE so we are done scanning. */ + cmpq %rsi, %rdx + /* NB: Use branch to return zero/non-zero. Common usage will + branch on result of function (if return is null/non-null). + This branch can be used to predict the ensuing one so there + is no reason to extend the data-dependency with cmovcc. */ + jbe L(zero_0) + + /* If rcx is zero then len must be > RDX, otherwise since we + already tested len vs lzcnt(rcx) (in rsi) we are good to + return this match. */ + test %VRAX, %VRAX + jz L(more_1x_vec) + leaq (%rdi, %rsi), %rax +# else - /* Check the first VEC_SIZE bytes. */ - VPCMP $0, (%rdi), %YMMMATCH, %k0 - kmovd %k0, %eax -# ifndef USE_AS_RAWMEMCHR - /* If length < CHAR_PER_VEC handle special. */ + /* We can't use the `tzcnt` trick for wmemchr because CHAR_SIZE + > 1 so if rcx is tzcnt != CHAR_PER_VEC. */ cmpq $CHAR_PER_VEC, %rdx - jbe L(first_vec_x0) -# endif - testl %eax, %eax - jz L(aligned_more) - tzcntl %eax, %eax -# ifdef USE_AS_WMEMCHR - /* NB: Multiply bytes by CHAR_SIZE to get the wchar_t count. */ + ja L(more_1x_vec) + tzcnt %VRAX, %VRAX + cmpl %eax, %edx + jbe L(zero_0) +L(first_vec_x0_ret): leaq (%rdi, %rax, CHAR_SIZE), %rax -# else - addq %rdi, %rax # endif ret -# ifndef USE_AS_RAWMEMCHR -L(zero): - xorl %eax, %eax - ret - - .p2align 4 -L(first_vec_x0): - /* Check if first match was before length. NB: tzcnt has false data- - dependency on destination. eax already had a data-dependency on esi - so this should have no affect here. */ - tzcntl %eax, %esi -# ifdef USE_AS_WMEMCHR - leaq (%rdi, %rsi, CHAR_SIZE), %rdi -# else - addq %rsi, %rdi -# endif + /* Only fits in first cache line for VEC_SIZE == 32. */ +# if VEC_SIZE == 32 + .p2align 4,, 2 +L(zero_0): xorl %eax, %eax - cmpl %esi, %edx - cmovg %rdi, %rax ret # endif - .p2align 4 -L(cross_page_boundary): - /* Save pointer before aligning as its original value is - necessary for computer return address if byte is found or - adjusting length if it is not and this is memchr. */ - movq %rdi, %rcx - /* Align data to VEC_SIZE. ALGN_PTR_REG is rcx for memchr and rdi - for rawmemchr. */ - andq $-VEC_SIZE, %ALGN_PTR_REG - VPCMP $0, (%ALGN_PTR_REG), %YMMMATCH, %k0 - kmovd %k0, %r8d + .p2align 4,, 9 +L(more_1x_vec): # ifdef USE_AS_WMEMCHR - /* NB: Divide shift count by 4 since each bit in K0 represent 4 - bytes. */ - sarl $2, %eax -# endif -# ifndef USE_AS_RAWMEMCHR - movl $(PAGE_SIZE / CHAR_SIZE), %esi - subl %eax, %esi + /* If wmemchr still need to test if there was a match in first + VEC. Use bsf to test here so we can reuse + L(first_vec_x0_ret). */ + bsf %VRAX, %VRAX + jnz L(first_vec_x0_ret) # endif + +L(page_cross_continue): # ifdef USE_AS_WMEMCHR - andl $(CHAR_PER_VEC - 1), %eax -# endif - /* Remove the leading bytes. */ - sarxl %eax, %r8d, %eax -# ifndef USE_AS_RAWMEMCHR - /* Check the end of data. */ - cmpq %rsi, %rdx - jbe L(first_vec_x0) + /* We can't use end of the buffer to re-calculate length for + wmemchr as len * CHAR_SIZE may overflow. */ + leaq -(VEC_SIZE + CHAR_SIZE)(%rdi), %rax + andq $(VEC_SIZE * -1), %rdi + subq %rdi, %rax + sarq $2, %rax + addq %rdx, %rax +# else + leaq -(VEC_SIZE + 1)(%rdx, %rdi), %rax + andq $(VEC_SIZE * -1), %rdi + subq %rdi, %rax # endif - testl %eax, %eax - jz L(cross_page_continue) - tzcntl %eax, %eax + + /* rax contains remaining length - 1. -1 so we can get imm8 + encoding in a few additional places saving code size. */ + + /* Needed regardless of remaining length. */ + VPCMPEQ VEC_SIZE(%rdi), %VMATCH, %k0 + KMOV %k0, %VRDX + + /* We cannot fold the above `sub %rdi, %rax` with the `cmp + $(CHAR_PER_VEC * 2), %rax` because its possible for a very + large length to overflow and cause the subtract to carry + despite length being above CHAR_PER_VEC * 2. */ + cmpq $(CHAR_PER_VEC * 2 - 1), %rax + ja L(more_2x_vec) +L(last_2x_vec): + + test %VRDX, %VRDX + jnz L(first_vec_x1_check) + + /* Check the end of data. NB: use 8-bit operations to save code + size. We no longer need the full-width of eax and will + perform a write-only operation over eax so there will be no + partial-register stalls. */ + subb $(CHAR_PER_VEC * 1 - 1), %al + jle L(zero_0) + + VPCMPEQ (VEC_SIZE * 2)(%rdi), %VMATCH, %k0 + KMOV %k0, %VRCX # ifdef USE_AS_WMEMCHR - /* NB: Multiply bytes by CHAR_SIZE to get the wchar_t count. */ - leaq (%RAW_PTR_REG, %rax, CHAR_SIZE), %rax + /* For wmemchr against we can't take advantage of tzcnt(0) == + VEC_SIZE as CHAR_PER_VEC != VEC_SIZE. */ + test %VRCX, %VRCX + jz L(zero_0) +# endif + tzcnt %VRCX, %VRCX + cmp %cl, %al + + /* Same CFG for VEC_SIZE == 64 and VEC_SIZE == 32. We give + fallthrough to L(zero_0) for VEC_SIZE == 64 here as there is + not enough space before the next cache line to fit the `lea` + for return. */ +# if VEC_SIZE == 64 + ja L(first_vec_x2_ret) +L(zero_0): + xorl %eax, %eax + ret # else - addq %RAW_PTR_REG, %rax + jbe L(zero_0) + leaq (VEC_SIZE * 2)(%rdi, %rcx, CHAR_SIZE), %rax + ret # endif + + .p2align 4,, 5 +L(first_vec_x1_check): + bsf %VRDX, %VRDX + cmpb %dl, %al + jb L(zero_4) + leaq (VEC_SIZE * 1)(%rdi, %rdx, CHAR_SIZE), %rax ret - .p2align 4 -L(first_vec_x1): - tzcntl %eax, %eax - leaq VEC_SIZE(%rdi, %rax, CHAR_SIZE), %rax + /* Fits at the end of the cache line here for VEC_SIZE == 32. + */ +# if VEC_SIZE == 32 +L(zero_4): + xorl %eax, %eax ret +# endif - .p2align 4 + + .p2align 4,, 4 L(first_vec_x2): - tzcntl %eax, %eax - leaq (VEC_SIZE * 2)(%rdi, %rax, CHAR_SIZE), %rax + bsf %VRCX, %VRCX +L(first_vec_x2_ret): + leaq (VEC_SIZE * 2)(%rdi, %rcx, CHAR_SIZE), %rax ret - .p2align 4 -L(first_vec_x3): - tzcntl %eax, %eax - leaq (VEC_SIZE * 3)(%rdi, %rax, CHAR_SIZE), %rax + /* Fits at the end of the cache line here for VEC_SIZE == 64. + */ +# if VEC_SIZE == 64 +L(zero_4): + xorl %eax, %eax ret +# endif - .p2align 4 -L(first_vec_x4): - tzcntl %eax, %eax - leaq (VEC_SIZE * 4)(%rdi, %rax, CHAR_SIZE), %rax + .p2align 4,, 4 +L(first_vec_x1): + bsf %VRDX, %VRDX + leaq (VEC_SIZE * 1)(%rdi, %rdx, CHAR_SIZE), %rax ret - .p2align 5 -L(aligned_more): - /* Check the first 4 * VEC_SIZE. Only one VEC_SIZE at a time - since data is only aligned to VEC_SIZE. */ -# ifndef USE_AS_RAWMEMCHR - /* Align data to VEC_SIZE. */ -L(cross_page_continue): - xorl %ecx, %ecx - subl %edi, %ecx - andq $-VEC_SIZE, %rdi - /* esi is for adjusting length to see if near the end. */ - leal (VEC_SIZE * 5)(%rdi, %rcx), %esi -# ifdef USE_AS_WMEMCHR - /* NB: Divide bytes by 4 to get the wchar_t count. */ - sarl $2, %esi -# endif -# else - andq $-VEC_SIZE, %rdi -L(cross_page_continue): -# endif - /* Load first VEC regardless. */ - VPCMP $0, (VEC_SIZE)(%rdi), %YMMMATCH, %k0 - kmovd %k0, %eax -# ifndef USE_AS_RAWMEMCHR - /* Adjust length. If near end handle specially. */ - subq %rsi, %rdx - jbe L(last_4x_vec_or_less) -# endif - testl %eax, %eax + .p2align 4,, 5 +L(more_2x_vec): + /* Length > VEC_SIZE * 2 so check first 2x VEC before rechecking + length. */ + + + /* Already computed matches for first VEC in rdx. */ + test %VRDX, %VRDX jnz L(first_vec_x1) - VPCMP $0, (VEC_SIZE * 2)(%rdi), %YMMMATCH, %k0 - kmovd %k0, %eax - testl %eax, %eax + + VPCMPEQ (VEC_SIZE * 2)(%rdi), %VMATCH, %k0 + KMOV %k0, %VRCX + test %VRCX, %VRCX jnz L(first_vec_x2) - VPCMP $0, (VEC_SIZE * 3)(%rdi), %YMMMATCH, %k0 - kmovd %k0, %eax - testl %eax, %eax + /* Needed regardless of next length check. */ + VPCMPEQ (VEC_SIZE * 3)(%rdi), %VMATCH, %k0 + KMOV %k0, %VRCX + + /* Check if we are near the end. */ + cmpq $(CHAR_PER_VEC * 4 - 1), %rax + ja L(more_4x_vec) + + test %VRCX, %VRCX + jnz L(first_vec_x3_check) + + /* Use 8-bit instructions to save code size. We won't use full- + width eax again and will perform a write-only operation to + eax so no worries about partial-register stalls. */ + subb $(CHAR_PER_VEC * 3), %al + jb L(zero_2) +L(last_vec_check): + VPCMPEQ (VEC_SIZE * 4)(%rdi), %VMATCH, %k0 + KMOV %k0, %VRCX +# ifdef USE_AS_WMEMCHR + /* For wmemchr against we can't take advantage of tzcnt(0) == + VEC_SIZE as CHAR_PER_VEC != VEC_SIZE. */ + test %VRCX, %VRCX + jz L(zero_2) +# endif + tzcnt %VRCX, %VRCX + cmp %cl, %al + jae L(first_vec_x4_ret) +L(zero_2): + xorl %eax, %eax + ret + + /* Fits at the end of the cache line here for VEC_SIZE == 64. + For VEC_SIZE == 32 we put the return label at the end of + L(first_vec_x4). */ +# if VEC_SIZE == 64 +L(first_vec_x4_ret): + leaq (VEC_SIZE * 4)(%rdi, %rcx, CHAR_SIZE), %rax + ret +# endif + + .p2align 4,, 6 +L(first_vec_x4): + bsf %VRCX, %VRCX +# if VEC_SIZE == 32 + /* Place L(first_vec_x4_ret) here as we can't fit it in the same + cache line as where it is called from so we might as well + save code size by reusing return of L(first_vec_x4). */ +L(first_vec_x4_ret): +# endif + leaq (VEC_SIZE * 4)(%rdi, %rcx, CHAR_SIZE), %rax + ret + + .p2align 4,, 6 +L(first_vec_x3_check): + /* Need to adjust remaining length before checking. */ + addb $-(CHAR_PER_VEC * 2), %al + bsf %VRCX, %VRCX + cmpb %cl, %al + jb L(zero_2) + leaq (VEC_SIZE * 3)(%rdi, %rcx, CHAR_SIZE), %rax + ret + + .p2align 4,, 6 +L(first_vec_x3): + bsf %VRCX, %VRCX + leaq (VEC_SIZE * 3)(%rdi, %rcx, CHAR_SIZE), %rax + ret + + .p2align 4,, 3 +# if !USE_TERN_IN_LOOP + .p2align 4,, 10 +# endif +L(more_4x_vec): + test %VRCX, %VRCX jnz L(first_vec_x3) - VPCMP $0, (VEC_SIZE * 4)(%rdi), %YMMMATCH, %k0 - kmovd %k0, %eax - testl %eax, %eax + VPCMPEQ (VEC_SIZE * 4)(%rdi), %VMATCH, %k0 + KMOV %k0, %VRCX + test %VRCX, %VRCX jnz L(first_vec_x4) + subq $-(VEC_SIZE * 5), %rdi + subq $(CHAR_PER_VEC * 8), %rax + jb L(last_4x_vec) -# ifndef USE_AS_RAWMEMCHR - /* Check if at last CHAR_PER_VEC * 4 length. */ - subq $(CHAR_PER_VEC * 4), %rdx - jbe L(last_4x_vec_or_less_cmpeq) - /* +VEC_SIZE if USE_IN_RTM otherwise +VEC_SIZE * 5. */ - addq $(VEC_SIZE + (VEC_SIZE * 4 - BASE_OFFSET)), %rdi - - /* Align data to VEC_SIZE * 4 for the loop and readjust length. - */ -# ifdef USE_AS_WMEMCHR +# ifdef USE_AS_WMEMCHR movl %edi, %ecx - andq $-(4 * VEC_SIZE), %rdi +# else + addq %rdi, %rax +# endif + + +# if VEC_SIZE == 64 + /* use xorb to do `andq $-(VEC_SIZE * 4), %rdi`. No evex + processor has partial register stalls (all have merging + uop). If that changes this can be removed. */ + xorb %dil, %dil +# else + andq $-(VEC_SIZE * 4), %rdi +# endif + +# ifdef USE_AS_WMEMCHR subl %edi, %ecx - /* NB: Divide bytes by 4 to get the wchar_t count. */ sarl $2, %ecx - addq %rcx, %rdx -# else - addq %rdi, %rdx - andq $-(4 * VEC_SIZE), %rdi - subq %rdi, %rdx -# endif + addq %rcx, %rax # else - addq $(VEC_SIZE + (VEC_SIZE * 4 - BASE_OFFSET)), %rdi - andq $-(4 * VEC_SIZE), %rdi + subq %rdi, %rax # endif -# ifdef USE_IN_RTM - vpxorq %XMMZERO, %XMMZERO, %XMMZERO -# else - /* copy ymmmatch to ymm0 so we can use vpcmpeq which is not - encodable with EVEX registers (ymm16-ymm31). */ - vmovdqa64 %YMMMATCH, %ymm0 + + + +# if USE_TERN_IN_LOOP + /* copy VMATCH to low ymm so we can use vpcmpeq which is not + encodable with EVEX registers. NB: this is VEC_SIZE == 32 + only as there is no way to encode vpcmpeq with zmm0-15. */ + vmovdqa64 %VMATCH, %VMATCH_LO # endif - /* Compare 4 * VEC at a time forward. */ - .p2align 4 + .p2align 4,, 11 L(loop_4x_vec): - /* Two versions of the loop. One that does not require - vzeroupper by not using ymm0-ymm15 and another does that require - vzeroupper because it uses ymm0-ymm15. The reason why ymm0-ymm15 - is used at all is because there is no EVEX encoding vpcmpeq and - with vpcmpeq this loop can be performed more efficiently. The - non-vzeroupper version is safe for RTM while the vzeroupper - version should be prefered if RTM are not supported. */ -# ifdef USE_IN_RTM - /* It would be possible to save some instructions using 4x VPCMP - but bottleneck on port 5 makes it not woth it. */ - VPCMP $4, (VEC_SIZE * 4)(%rdi), %YMMMATCH, %k1 - /* xor will set bytes match esi to zero. */ - vpxorq (VEC_SIZE * 5)(%rdi), %YMMMATCH, %YMM2 - vpxorq (VEC_SIZE * 6)(%rdi), %YMMMATCH, %YMM3 - VPCMP $0, (VEC_SIZE * 7)(%rdi), %YMMMATCH, %k3 - /* Reduce VEC2 / VEC3 with min and VEC1 with zero mask. */ - VPMINU %YMM2, %YMM3, %YMM3{%k1}{z} - VPCMP $0, %YMM3, %YMMZERO, %k2 -# else + /* Two versions of the loop. One that does not require + vzeroupper by not using ymmm0-15 and another does that + require vzeroupper because it uses ymmm0-15. The reason why + ymm0-15 is used at all is because there is no EVEX encoding + vpcmpeq and with vpcmpeq this loop can be performed more + efficiently. The non-vzeroupper version is safe for RTM + while the vzeroupper version should be prefered if RTM are + not supported. Which loop version we use is determined by + USE_TERN_IN_LOOP. */ + +# if USE_TERN_IN_LOOP /* Since vptern can only take 3x vectors fastest to do 1 vec seperately with EVEX vpcmp. */ # ifdef USE_AS_WMEMCHR /* vptern can only accept masks for epi32/epi64 so can only save - instruction using not equals mask on vptern with wmemchr. */ - VPCMP $4, (%rdi), %YMMMATCH, %k1 + instruction using not equals mask on vptern with wmemchr. + */ + VPCMP $4, (VEC_SIZE * 0)(%rdi), %VMATCH, %k1 # else - VPCMP $0, (%rdi), %YMMMATCH, %k1 + VPCMPEQ (VEC_SIZE * 0)(%rdi), %VMATCH, %k1 # endif /* Compare 3x with vpcmpeq and or them all together with vptern. */ - VPCMPEQ VEC_SIZE(%rdi), %ymm0, %ymm2 - VPCMPEQ (VEC_SIZE * 2)(%rdi), %ymm0, %ymm3 - VPCMPEQ (VEC_SIZE * 3)(%rdi), %ymm0, %ymm4 + VPCMPEQ (VEC_SIZE * 1)(%rdi), %VMATCH_LO, %VMM_lo(2) + VPCMPEQ (VEC_SIZE * 2)(%rdi), %VMATCH_LO, %VMM_lo(3) + VPCMPEQ (VEC_SIZE * 3)(%rdi), %VMATCH_LO, %VMM_lo(4) # ifdef USE_AS_WMEMCHR - /* This takes the not of or between ymm2, ymm3, ymm4 as well as - combines result from VEC0 with zero mask. */ - vpternlogd $1, %ymm2, %ymm3, %ymm4{%k1}{z} - vpmovmskb %ymm4, %ecx + /* This takes the not of or between VEC_lo(2), VEC_lo(3), + VEC_lo(4) as well as combines result from VEC(0) with zero + mask. */ + vpternlogd $1, %VMM_lo(2), %VMM_lo(3), %VMM_lo(4){%k1}{z} + vpmovmskb %VMM_lo(4), %VRCX # else - /* 254 is mask for oring ymm2, ymm3, ymm4 into ymm4. */ - vpternlogd $254, %ymm2, %ymm3, %ymm4 - vpmovmskb %ymm4, %ecx - kmovd %k1, %eax + /* 254 is mask for oring VEC_lo(2), VEC_lo(3), VEC_lo(4) into + VEC_lo(4). */ + vpternlogd $254, %VMM_lo(2), %VMM_lo(3), %VMM_lo(4) + vpmovmskb %VMM_lo(4), %VRCX + KMOV %k1, %edx # endif -# endif -# ifdef USE_AS_RAWMEMCHR - subq $-(VEC_SIZE * 4), %rdi -# endif -# ifdef USE_IN_RTM - kortestd %k2, %k3 # else -# ifdef USE_AS_WMEMCHR - /* ecx contains not of matches. All 1s means no matches. incl will - overflow and set zeroflag if that is the case. */ - incl %ecx -# else - /* If either VEC1 (eax) or VEC2-VEC4 (ecx) are not zero. Adding - to ecx is not an issue because if eax is non-zero it will be - used for returning the match. If it is zero the add does - nothing. */ - addq %rax, %rcx -# endif + /* Loop version that uses EVEX encoding. */ + VPCMP $4, (VEC_SIZE * 0)(%rdi), %VMATCH, %k1 + vpxorq (VEC_SIZE * 1)(%rdi), %VMATCH, %VMM(2) + vpxorq (VEC_SIZE * 2)(%rdi), %VMATCH, %VMM(3) + VPCMPEQ (VEC_SIZE * 3)(%rdi), %VMATCH, %k3 + VPMINU %VMM(2), %VMM(3), %VMM(3){%k1}{z} + VPTESTN %VMM(3), %VMM(3), %k2 # endif -# ifdef USE_AS_RAWMEMCHR - jz L(loop_4x_vec) -# else - jnz L(loop_4x_vec_end) + + + TEST_END () + jnz L(loop_vec_ret) subq $-(VEC_SIZE * 4), %rdi - subq $(CHAR_PER_VEC * 4), %rdx - ja L(loop_4x_vec) + subq $(CHAR_PER_VEC * 4), %rax + jae L(loop_4x_vec) - /* Fall through into less than 4 remaining vectors of length case. + /* COND_VZEROUPPER is vzeroupper if we use the VEX encoded loop. */ - VPCMP $0, BASE_OFFSET(%rdi), %YMMMATCH, %k0 - addq $(BASE_OFFSET - VEC_SIZE), %rdi - kmovd %k0, %eax - VZEROUPPER - -L(last_4x_vec_or_less): - /* Check if first VEC contained match. */ - testl %eax, %eax - jnz L(first_vec_x1_check) + COND_VZEROUPPER - /* If remaining length > CHAR_PER_VEC * 2. */ - addl $(CHAR_PER_VEC * 2), %edx - jg L(last_4x_vec) - -L(last_2x_vec): - /* If remaining length < CHAR_PER_VEC. */ - addl $CHAR_PER_VEC, %edx - jle L(zero_end) - - /* Check VEC2 and compare any match with remaining length. */ - VPCMP $0, (VEC_SIZE * 2)(%rdi), %YMMMATCH, %k0 - kmovd %k0, %eax - tzcntl %eax, %eax - cmpl %eax, %edx - jbe L(set_zero_end) - leaq (VEC_SIZE * 2)(%rdi, %rax, CHAR_SIZE), %rax -L(zero_end): - ret + .p2align 4,, 10 +L(last_4x_vec): + /* For CHAR_PER_VEC == 64 we don't need to mask as we use 8-bit + instructions on eax from here on out. */ +# if CHAR_PER_VEC != 64 + andl $(CHAR_PER_VEC * 4 - 1), %eax +# endif + VPCMPEQ (VEC_SIZE * 0)(%rdi), %VMATCH, %k0 + subq $(VEC_SIZE * 1), %rdi + KMOV %k0, %VRDX + cmpb $(CHAR_PER_VEC * 2 - 1), %al + jbe L(last_2x_vec) + test %VRDX, %VRDX + jnz L(last_vec_x1_novzero) + + VPCMPEQ (VEC_SIZE * 2)(%rdi), %VMATCH, %k0 + KMOV %k0, %VRDX + test %VRDX, %VRDX + jnz L(last_vec_x2_novzero) + + VPCMPEQ (VEC_SIZE * 3)(%rdi), %VMATCH, %k0 + KMOV %k0, %VRCX + test %VRCX, %VRCX + jnz L(first_vec_x3_check) + + subb $(CHAR_PER_VEC * 3), %al + jae L(last_vec_check) -L(set_zero_end): xorl %eax, %eax ret - .p2align 4 -L(first_vec_x1_check): - /* eax must be non-zero. Use bsfl to save code size. */ - bsfl %eax, %eax - /* Adjust length. */ - subl $-(CHAR_PER_VEC * 4), %edx - /* Check if match within remaining length. */ - cmpl %eax, %edx - jbe L(set_zero_end) - /* NB: Multiply bytes by CHAR_SIZE to get the wchar_t count. */ - leaq VEC_SIZE(%rdi, %rax, CHAR_SIZE), %rax +# if defined USE_AS_WMEMCHR && USE_TERN_IN_LOOP +L(last_vec_x2_novzero): + addq $VEC_SIZE, %rdi +L(last_vec_x1_novzero): + bsf %VRDX, %VRDX + leaq (VEC_SIZE * 1)(%rdi, %rdx, CHAR_SIZE), %rax ret +# endif - .p2align 4 -L(loop_4x_vec_end): +# if CHAR_PER_VEC == 64 + /* Since we can't combine the last 2x VEC when CHAR_PER_VEC == + 64 it needs a seperate return label. */ + .p2align 4,, 4 +L(last_vec_x2): +L(last_vec_x2_novzero): + bsf %VRDX, %VRDX + leaq (VEC_SIZE * 2)(%rdi, %rdx, TERN_CHAR_MULT), %rax + ret # endif - /* rawmemchr will fall through into this if match was found in - loop. */ -# if defined USE_IN_RTM || defined USE_AS_WMEMCHR - /* k1 has not of matches with VEC1. */ - kmovd %k1, %eax -# ifdef USE_AS_WMEMCHR - subl $((1 << CHAR_PER_VEC) - 1), %eax -# else - incl %eax -# endif + .p2align 4,, 4 +L(loop_vec_ret): +# if defined USE_AS_WMEMCHR || !USE_TERN_IN_LOOP + KMOV %k1, %VRAX + inc %MASK_GPR(rax) # else - /* eax already has matches for VEC1. */ - testl %eax, %eax + test %VRDX, %VRDX # endif - jnz L(last_vec_x1_return) + jnz L(last_vec_x0) -# ifdef USE_IN_RTM - VPCMP $0, %YMM2, %YMMZERO, %k0 - kmovd %k0, %eax + +# if USE_TERN_IN_LOOP + vpmovmskb %VMM_lo(2), %VRDX # else - vpmovmskb %ymm2, %eax + VPTESTN %VMM(2), %VMM(2), %k1 + KMOV %k1, %VRDX # endif - testl %eax, %eax - jnz L(last_vec_x2_return) + test %VRDX, %VRDX + jnz L(last_vec_x1) -# ifdef USE_IN_RTM - kmovd %k2, %eax - testl %eax, %eax - jnz L(last_vec_x3_return) - kmovd %k3, %eax - tzcntl %eax, %eax - leaq (VEC_SIZE * 3 + RET_OFFSET)(%rdi, %rax, CHAR_SIZE), %rax +# if USE_TERN_IN_LOOP + vpmovmskb %VMM_lo(3), %VRDX # else - vpmovmskb %ymm3, %eax - /* Combine matches in VEC3 (eax) with matches in VEC4 (ecx). */ - salq $VEC_SIZE, %rcx - orq %rcx, %rax - tzcntq %rax, %rax - leaq (VEC_SIZE * 2 + RET_OFFSET)(%rdi, %rax), %rax - VZEROUPPER + KMOV %k2, %VRDX # endif - ret - .p2align 4,, 10 -L(last_vec_x1_return): - tzcntl %eax, %eax -# if defined USE_AS_WMEMCHR || RET_OFFSET != 0 - /* NB: Multiply bytes by CHAR_SIZE to get the wchar_t count. */ - leaq RET_OFFSET(%rdi, %rax, CHAR_SIZE), %rax + /* No longer need any of the lo vecs (ymm0-15) so vzeroupper + (only if used VEX encoded loop). */ + COND_VZEROUPPER + + /* Seperate logic for CHAR_PER_VEC == 64 vs the rest. For + CHAR_PER_VEC we test the last 2x VEC seperately, for + CHAR_PER_VEC <= 32 we can combine the results from the 2x + VEC in a single GPR. */ +# if CHAR_PER_VEC == 64 +# if USE_TERN_IN_LOOP +# error "Unsupported" +# endif + + + /* If CHAR_PER_VEC == 64 we can't combine the last two VEC. */ + test %VRDX, %VRDX + jnz L(last_vec_x2) + KMOV %k3, %VRDX # else - addq %rdi, %rax + /* CHAR_PER_VEC <= 32 so we can combine the results from the + last 2x VEC. */ + +# if !USE_TERN_IN_LOOP + KMOV %k3, %VRCX +# endif + salq $(VEC_SIZE / TERN_CHAR_MULT), %rcx + addq %rcx, %rdx +# if !defined USE_AS_WMEMCHR || !USE_TERN_IN_LOOP +L(last_vec_x2_novzero): +# endif # endif - VZEROUPPER + bsf %rdx, %rdx + leaq (LAST_VEC_OFFSET)(%rdi, %rdx, TERN_CHAR_MULT), %rax ret - .p2align 4 -L(last_vec_x2_return): - tzcntl %eax, %eax - /* NB: Multiply bytes by RET_SCALE to get the wchar_t count - if relevant (RET_SCALE = CHAR_SIZE if USE_AS_WMEMCHAR and - USE_IN_RTM are both defined. Otherwise RET_SCALE = 1. */ - leaq (VEC_SIZE + RET_OFFSET)(%rdi, %rax, RET_SCALE), %rax - VZEROUPPER + .p2align 4,, 8 +L(last_vec_x1): + COND_VZEROUPPER +# if !defined USE_AS_WMEMCHR || !USE_TERN_IN_LOOP +L(last_vec_x1_novzero): +# endif + bsf %VRDX, %VRDX + leaq (VEC_SIZE * 1)(%rdi, %rdx, TERN_CHAR_MULT), %rax ret -# ifdef USE_IN_RTM - .p2align 4 -L(last_vec_x3_return): - tzcntl %eax, %eax - /* NB: Multiply bytes by CHAR_SIZE to get the wchar_t count. */ - leaq (VEC_SIZE * 2 + RET_OFFSET)(%rdi, %rax, CHAR_SIZE), %rax + + .p2align 4,, 4 +L(last_vec_x0): + COND_VZEROUPPER + bsf %VGPR(GPR_X0), %VGPR(GPR_X0) +# if GPR_X0_IS_RET + addq %rdi, %rax +# else + leaq (%rdi, %GPR_X0, CHAR_SIZE), %rax +# endif ret + + .p2align 4,, 6 +L(page_cross): + /* Need to preserve eax to compute inbound bytes we are + checking. */ +# ifdef USE_AS_WMEMCHR + movl %eax, %ecx +# else + xorl %ecx, %ecx + subl %eax, %ecx # endif -# ifndef USE_AS_RAWMEMCHR - .p2align 4,, 5 -L(last_4x_vec_or_less_cmpeq): - VPCMP $0, (VEC_SIZE * 5)(%rdi), %YMMMATCH, %k0 - kmovd %k0, %eax - subq $-(VEC_SIZE * 4), %rdi - /* Check first VEC regardless. */ - testl %eax, %eax - jnz L(first_vec_x1_check) + xorq %rdi, %rax + VPCMPEQ (PAGE_SIZE - VEC_SIZE)(%rax), %VMATCH, %k0 + KMOV %k0, %VRAX - /* If remaining length <= CHAR_PER_VEC * 2. */ - addl $(CHAR_PER_VEC * 2), %edx - jle L(last_2x_vec) +# ifdef USE_AS_WMEMCHR + /* NB: Divide by CHAR_SIZE to shift out out of bounds bytes. */ + shrl $2, %ecx + andl $(CHAR_PER_VEC - 1), %ecx +# endif - .p2align 4 -L(last_4x_vec): - VPCMP $0, (VEC_SIZE * 2)(%rdi), %YMMMATCH, %k0 - kmovd %k0, %eax - testl %eax, %eax - jnz L(last_vec_x2) + shrx %VGPR(PC_SHIFT_GPR), %VRAX, %VRAX - VPCMP $0, (VEC_SIZE * 3)(%rdi), %YMMMATCH, %k0 - kmovd %k0, %eax - /* Create mask for possible matches within remaining length. */ -# ifdef USE_AS_WMEMCHR - movl $((1 << (CHAR_PER_VEC * 2)) - 1), %ecx - bzhil %edx, %ecx, %ecx -# else - movq $-1, %rcx - bzhiq %rdx, %rcx, %rcx -# endif - /* Test matches in data against length match. */ - andl %ecx, %eax - jnz L(last_vec_x3) +# ifdef USE_AS_WMEMCHR + negl %ecx +# endif - /* if remaining length <= CHAR_PER_VEC * 3 (Note this is after - remaining length was found to be > CHAR_PER_VEC * 2. */ - subl $CHAR_PER_VEC, %edx - jbe L(zero_end2) + /* mask lower bits from ecx (negative eax) to get bytes till + next VEC. */ + andl $(CHAR_PER_VEC - 1), %ecx + /* Check if VEC is entirely contained in the remainder of the + page. */ + cmpq %rcx, %rdx + jbe L(page_cross_ret) - VPCMP $0, (VEC_SIZE * 4)(%rdi), %YMMMATCH, %k0 - kmovd %k0, %eax - /* Shift remaining length mask for last VEC. */ -# ifdef USE_AS_WMEMCHR - shrl $CHAR_PER_VEC, %ecx -# else - shrq $CHAR_PER_VEC, %rcx -# endif - andl %ecx, %eax - jz L(zero_end2) - bsfl %eax, %eax - leaq (VEC_SIZE * 4)(%rdi, %rax, CHAR_SIZE), %rax -L(zero_end2): - ret + /* Length crosses the page so if rax is zero (no matches) + continue. */ + test %VRAX, %VRAX + jz L(page_cross_continue) -L(last_vec_x2): - tzcntl %eax, %eax - leaq (VEC_SIZE * 2)(%rdi, %rax, CHAR_SIZE), %rax + /* if rdx > rcx then any match here must be in [buf:buf + len]. + */ + tzcnt %VRAX, %VRAX +# ifdef USE_AS_WMEMCHR + leaq (%rdi, %rax, CHAR_SIZE), %rax +# else + addq %rdi, %rax +# endif ret - .p2align 4 -L(last_vec_x3): - tzcntl %eax, %eax - leaq (VEC_SIZE * 3)(%rdi, %rax, CHAR_SIZE), %rax + .p2align 4,, 2 +L(page_cross_zero): + xorl %eax, %eax ret + + .p2align 4,, 4 +L(page_cross_ret): + /* Search is entirely contained in page cross case. */ +# ifdef USE_AS_WMEMCHR + test %VRAX, %VRAX + jz L(page_cross_zero) +# endif + tzcnt %VRAX, %VRAX + cmpl %eax, %edx + jbe L(page_cross_zero) +# ifdef USE_AS_WMEMCHR + leaq (%rdi, %rax, CHAR_SIZE), %rax +# else + addq %rdi, %rax # endif - /* 7 bytes from next cache line. */ + ret END (MEMCHR) #endif diff --git a/sysdeps/x86_64/multiarch/rawmemchr-evex-rtm.S b/sysdeps/x86_64/multiarch/rawmemchr-evex-rtm.S index deda1ca395..2073eaa620 100644 --- a/sysdeps/x86_64/multiarch/rawmemchr-evex-rtm.S +++ b/sysdeps/x86_64/multiarch/rawmemchr-evex-rtm.S @@ -1,3 +1,6 @@ -#define MEMCHR __rawmemchr_evex_rtm -#define USE_AS_RAWMEMCHR 1 -#include "memchr-evex-rtm.S" +#define RAWMEMCHR __rawmemchr_evex_rtm + +#define USE_IN_RTM 1 +#define SECTION(p) p##.evex.rtm + +#include "rawmemchr-evex.S" diff --git a/sysdeps/x86_64/multiarch/rawmemchr-evex.S b/sysdeps/x86_64/multiarch/rawmemchr-evex.S index dc1c450699..dad54def2b 100644 --- a/sysdeps/x86_64/multiarch/rawmemchr-evex.S +++ b/sysdeps/x86_64/multiarch/rawmemchr-evex.S @@ -1,7 +1,308 @@ -#ifndef RAWMEMCHR -# define RAWMEMCHR __rawmemchr_evex -#endif -#define USE_AS_RAWMEMCHR 1 -#define MEMCHR RAWMEMCHR +/* rawmemchr optimized with 256-bit EVEX instructions. + Copyright (C) 2022 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include + +#if ISA_SHOULD_BUILD (4) + +# ifndef VEC_SIZE +# include "x86-evex256-vecs.h" +# endif + +# ifndef RAWMEMCHR +# define RAWMEMCHR __rawmemchr_evex +# endif + + +# define PC_SHIFT_GPR rdi +# define REG_WIDTH VEC_SIZE +# define VPTESTN vptestnmb +# define VPBROADCAST vpbroadcastb +# define VPMINU vpminub +# define VPCMP vpcmpb +# define VPCMPEQ vpcmpeqb +# define CHAR_SIZE 1 + +# include "reg-macros.h" + +/* If not in an RTM and VEC_SIZE != 64 (the VEC_SIZE = 64 + doesn't have VEX encoding), use VEX encoding in loop so we + can use vpcmpeqb + vptern which is more efficient than the + EVEX alternative. */ +# if defined USE_IN_RTM || VEC_SIZE == 64 +# undef COND_VZEROUPPER +# undef VZEROUPPER_RETURN +# undef VZEROUPPER + + +# define COND_VZEROUPPER +# define VZEROUPPER_RETURN ret +# define VZEROUPPER + +# define USE_TERN_IN_LOOP 0 +# else +# define USE_TERN_IN_LOOP 1 +# undef VZEROUPPER +# define VZEROUPPER vzeroupper +# endif + +# define CHAR_PER_VEC VEC_SIZE + +# if CHAR_PER_VEC == 64 + +# define TAIL_RETURN_LBL first_vec_x2 +# define TAIL_RETURN_OFFSET (CHAR_PER_VEC * 2) + +# define FALLTHROUGH_RETURN_LBL first_vec_x3 +# define FALLTHROUGH_RETURN_OFFSET (CHAR_PER_VEC * 3) + +# else /* !(CHAR_PER_VEC == 64) */ + +# define TAIL_RETURN_LBL first_vec_x3 +# define TAIL_RETURN_OFFSET (CHAR_PER_VEC * 3) + +# define FALLTHROUGH_RETURN_LBL first_vec_x2 +# define FALLTHROUGH_RETURN_OFFSET (CHAR_PER_VEC * 2) +# endif /* !(CHAR_PER_VEC == 64) */ + + +# define VMATCH VMM(0) +# define VMATCH_LO VMM_lo(0) + +# define PAGE_SIZE 4096 + + .section SECTION(.text), "ax", @progbits +ENTRY_P2ALIGN (RAWMEMCHR, 6) + VPBROADCAST %esi, %VMATCH + /* Check if we may cross page boundary with one vector load. */ + movl %edi, %eax + andl $(PAGE_SIZE - 1), %eax + cmpl $(PAGE_SIZE - VEC_SIZE), %eax + ja L(page_cross) + + VPCMPEQ (%rdi), %VMATCH, %k0 + KMOV %k0, %VRAX + + test %VRAX, %VRAX + jz L(aligned_more) +L(first_vec_x0): + bsf %VRAX, %VRAX + addq %rdi, %rax + ret + + .p2align 4,, 4 +L(first_vec_x4): + bsf %VRAX, %VRAX + leaq (VEC_SIZE * 4)(%rdi, %rax), %rax + ret -#include "memchr-evex.S" + /* For VEC_SIZE == 32 we can fit this in aligning bytes so might + as well place it more locally. For VEC_SIZE == 64 we reuse + return code at the end of loop's return. */ +# if VEC_SIZE == 32 + .p2align 4,, 4 +L(FALLTHROUGH_RETURN_LBL): + bsf %VRAX, %VRAX + leaq (FALLTHROUGH_RETURN_OFFSET)(%rdi, %rax), %rax + ret +# endif + + .p2align 4,, 6 +L(page_cross): + /* eax has lower page-offset bits of rdi so xor will zero them + out. */ + xorq %rdi, %rax + VPCMPEQ (PAGE_SIZE - VEC_SIZE)(%rax), %VMATCH, %k0 + KMOV %k0, %VRAX + + /* Shift out out-of-bounds matches. */ + shrx %VRDI, %VRAX, %VRAX + test %VRAX, %VRAX + jnz L(first_vec_x0) + + .p2align 4,, 10 +L(aligned_more): +L(page_cross_continue): + /* Align pointer. */ + andq $(VEC_SIZE * -1), %rdi + + VPCMPEQ VEC_SIZE(%rdi), %VMATCH, %k0 + KMOV %k0, %VRAX + test %VRAX, %VRAX + jnz L(first_vec_x1) + + VPCMPEQ (VEC_SIZE * 2)(%rdi), %VMATCH, %k0 + KMOV %k0, %VRAX + test %VRAX, %VRAX + jnz L(first_vec_x2) + + VPCMPEQ (VEC_SIZE * 3)(%rdi), %VMATCH, %k0 + KMOV %k0, %VRAX + test %VRAX, %VRAX + jnz L(first_vec_x3) + + VPCMPEQ (VEC_SIZE * 4)(%rdi), %VMATCH, %k0 + KMOV %k0, %VRAX + test %VRAX, %VRAX + jnz L(first_vec_x4) + + subq $-(VEC_SIZE * 1), %rdi +# if VEC_SIZE == 64 + /* Saves code size. No evex512 processor has partial register + stalls. If that change this can be replaced with `andq + $-(VEC_SIZE * 4), %rdi`. */ + xorb %dil, %dil +# else + andq $-(VEC_SIZE * 4), %rdi +# endif + +# if USE_TERN_IN_LOOP + /* copy VMATCH to low ymm so we can use vpcmpeq which is not + encodable with EVEX registers. NB: this is VEC_SIZE == 32 + only as there is no way to encode vpcmpeq with zmm0-15. */ + vmovdqa64 %VMATCH, %VMATCH_LO +# endif + + .p2align 4 +L(loop_4x_vec): + /* Two versions of the loop. One that does not require + vzeroupper by not using ymm0-15 and another does that + require vzeroupper because it uses ymm0-15. The reason why + ymm0-15 is used at all is because there is no EVEX encoding + vpcmpeq and with vpcmpeq this loop can be performed more + efficiently. The non-vzeroupper version is safe for RTM + while the vzeroupper version should be prefered if RTM are + not supported. Which loop version we use is determined by + USE_TERN_IN_LOOP. */ + +# if USE_TERN_IN_LOOP + /* Since vptern can only take 3x vectors fastest to do 1 vec + seperately with EVEX vpcmp. */ + VPCMPEQ (VEC_SIZE * 4)(%rdi), %VMATCH, %k1 + /* Compare 3x with vpcmpeq and or them all together with vptern. + */ + + VPCMPEQ (VEC_SIZE * 5)(%rdi), %VMATCH_LO, %VMM_lo(2) + subq $(VEC_SIZE * -4), %rdi + VPCMPEQ (VEC_SIZE * 2)(%rdi), %VMATCH_LO, %VMM_lo(3) + VPCMPEQ (VEC_SIZE * 3)(%rdi), %VMATCH_LO, %VMM_lo(4) + + /* 254 is mask for oring VEC_lo(2), VEC_lo(3), VEC_lo(4) into + VEC_lo(4). */ + vpternlogd $254, %VMM_lo(2), %VMM_lo(3), %VMM_lo(4) + vpmovmskb %VMM_lo(4), %VRCX + + KMOV %k1, %eax + + /* NB: rax has match from first VEC and rcx has matches from + VEC 2-4. If rax is non-zero we will return that match. If + rax is zero adding won't disturb the bits in rcx. */ + add %rax, %rcx +# else + /* Loop version that uses EVEX encoding. */ + VPCMP $4, (VEC_SIZE * 4)(%rdi), %VMATCH, %k1 + vpxorq (VEC_SIZE * 5)(%rdi), %VMATCH, %VMM(2) + vpxorq (VEC_SIZE * 6)(%rdi), %VMATCH, %VMM(3) + VPCMPEQ (VEC_SIZE * 7)(%rdi), %VMATCH, %k3 + VPMINU %VMM(2), %VMM(3), %VMM(3){%k1}{z} + VPTESTN %VMM(3), %VMM(3), %k2 + subq $(VEC_SIZE * -4), %rdi + KORTEST %k2, %k3 +# endif + jz L(loop_4x_vec) + +# if USE_TERN_IN_LOOP + test %VRAX, %VRAX +# else + KMOV %k1, %VRAX + inc %VRAX +# endif + jnz L(last_vec_x0) + + +# if USE_TERN_IN_LOOP + vpmovmskb %VMM_lo(2), %VRAX +# else + VPTESTN %VMM(2), %VMM(2), %k1 + KMOV %k1, %VRAX +# endif + test %VRAX, %VRAX + jnz L(last_vec_x1) + + +# if USE_TERN_IN_LOOP + vpmovmskb %VMM_lo(3), %VRAX +# else + KMOV %k2, %VRAX +# endif + + /* No longer need any of the lo vecs (ymm0-15) so vzeroupper + (only if used VEX encoded loop). */ + COND_VZEROUPPER + + /* Seperate logic for VEC_SIZE == 64 and VEC_SIZE == 32 for + returning last 2x VEC. For VEC_SIZE == 64 we test each VEC + individually, for VEC_SIZE == 32 we combine them in a single + 64-bit GPR. */ +# if CHAR_PER_VEC == 64 +# if USE_TERN_IN_LOOP +# error "Unsupported" +# endif + + + /* If CHAR_PER_VEC == 64 we can't combine the last two VEC. */ + test %VRAX, %VRAX + jnz L(first_vec_x2) + KMOV %k3, %VRAX +L(FALLTHROUGH_RETURN_LBL): +# else + /* CHAR_PER_VEC <= 32 so we can combine the results from the + last 2x VEC. */ +# if !USE_TERN_IN_LOOP + KMOV %k3, %VRCX +# endif + salq $CHAR_PER_VEC, %rcx + addq %rcx, %rax +# endif + bsf %rax, %rax + leaq (FALLTHROUGH_RETURN_OFFSET)(%rdi, %rax), %rax + ret + + .p2align 4,, 8 +L(TAIL_RETURN_LBL): + bsf %rax, %rax + leaq (TAIL_RETURN_OFFSET)(%rdi, %rax), %rax + ret + + .p2align 4,, 8 +L(last_vec_x1): + COND_VZEROUPPER +L(first_vec_x1): + bsf %VRAX, %VRAX + leaq (VEC_SIZE * 1)(%rdi, %rax), %rax + ret + + .p2align 4,, 8 +L(last_vec_x0): + COND_VZEROUPPER + bsf %VRAX, %VRAX + addq %rdi, %rax + ret +END (RAWMEMCHR) +#endif From patchwork Tue Oct 18 02:48:56 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Noah Goldstein X-Patchwork-Id: 1691322 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=sourceware.org (client-ip=2620:52:3:1:0:246e:9693:128c; helo=sourceware.org; envelope-from=libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org; receiver=) Authentication-Results: legolas.ozlabs.org; dkim=pass (1024-bit key; secure) header.d=sourceware.org header.i=@sourceware.org header.a=rsa-sha256 header.s=default header.b=gfm2okeU; dkim-atps=neutral Received: from sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-384) server-digest SHA384) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4Mryxh4cQYz1ygT for ; Tue, 18 Oct 2022 13:49:28 +1100 (AEDT) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id AA8D53857403 for ; Tue, 18 Oct 2022 02:49:25 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org AA8D53857403 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1666061365; bh=8g2UpY8YP3LG3k4/aV/uG4zI5BnrmIlPF5QM4OTP2+M=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To: From; b=gfm2okeUXogqLO8LYzyDwKC1YGZIlgMbE1hBJixuEfcAxlBOoc7q/b698j4CcNERX +ppTicqXTmObqAUQ0r2xJGDCLi6Ojf5B4JfRe7vfcRkzbk4cCC+PUQgb2UkqafJ3hb VnCL9AHN6w4I1OJNMZ4VbEPx/SVz7Uw1Vx1U3CtM= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-oi1-x22d.google.com (mail-oi1-x22d.google.com [IPv6:2607:f8b0:4864:20::22d]) by sourceware.org (Postfix) with ESMTPS id 9A21B3858D3C for ; Tue, 18 Oct 2022 02:49:07 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 9A21B3858D3C Received: by mail-oi1-x22d.google.com with SMTP id o64so14210410oib.12 for ; Mon, 17 Oct 2022 19:49:07 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=8g2UpY8YP3LG3k4/aV/uG4zI5BnrmIlPF5QM4OTP2+M=; b=QgglQW770sTuj+K0+3P2xM4DOwALtgAjKB/7x94EGIwzm6rnBnw0lOCNcQkBwnCOdv fDtWIwRqqyIYbCuuKQWTkif7VDbtrMIwKYcLI5QxXuQUQm6dp3ep2xPte4OxBoVz0Uvu SP94n6H0e3n43jPOU3LbSHWZMuzCB8RLlGInHQ/KFb1629AtERHhY8FCk/NK4SPg7lu1 02qWkEwarTKlUhYzcaVbS3kgjNcJkbJyMpOnSSJhu/STyi1zBj1VvxHi9frE8cLwmaMq WpV8VoE4ReHQBqxjxEkuRP+Sgx6616nnQsPdM6lD5A5IPbEJ0igYkEE1ReNYbLbsBUbr mW/Q== X-Gm-Message-State: ACrzQf2scEhJme9a6hgr/BgcxeFkVHm/s72kpsCAa1O6b9r6WyIWtMaD At1CkgYs1VrkqelCatDzU7VINfvdktoyRA== X-Google-Smtp-Source: AMsMyM432p6/2epMaUV3h32kwRkiAU+9036ehO2gRvMoH5Ror4Eg3ys8u44w1WeSKUfUtVFvk3ukMg== X-Received: by 2002:a05:6808:218c:b0:355:231d:54a6 with SMTP id be12-20020a056808218c00b00355231d54a6mr449002oib.4.1666061345724; Mon, 17 Oct 2022 19:49:05 -0700 (PDT) Received: from noah-tgl.lan (2603-8080-1301-76c6-02dd-0570-1640-b39b.res6.spectrum.com. [2603:8080:1301:76c6:2dd:570:1640:b39b]) by smtp.gmail.com with ESMTPSA id r10-20020a4a964a000000b00435a59fba01sm4957260ooi.47.2022.10.17.19.49.04 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 17 Oct 2022 19:49:05 -0700 (PDT) To: libc-alpha@sourceware.org Subject: [PATCH v1 2/7] x86: Shrink / minorly optimize strchr-evex and implement with VMM headers Date: Mon, 17 Oct 2022 19:48:56 -0700 Message-Id: <20221018024901.3381469-2-goldstein.w.n@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20221018024901.3381469-1-goldstein.w.n@gmail.com> References: <20221018024901.3381469-1-goldstein.w.n@gmail.com> MIME-Version: 1.0 X-Spam-Status: No, score=-12.1 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Noah Goldstein via Libc-alpha From: Noah Goldstein Reply-To: Noah Goldstein Errors-To: libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org Sender: "Libc-alpha" Size Optimizations: 1. Condence hot path for better cache-locality. - This is most impact for strchrnul where the logic strings with len <= VEC_SIZE or with a match in the first VEC no fits entirely in the first cache line. 2. Reuse common targets in first 4x VEC and after the loop. 3. Don't align targets so aggressively if it doesn't change the number of fetch blocks it will require and put more care in avoiding the case where targets unnecessarily split cache lines. 4. Align the loop better for DSB/LSD 5. Use more code-size efficient instructions. - tzcnt ... -> bsf ... - vpcmpb $0 ... -> vpcmpeq ... 6. Align labels less aggressively, especially if it doesn't save fetch blocks / causes the basic-block to span extra cache-lines. Code Size Changes: strchr-evex.S : -63 bytes strchrnul-evex.S: -48 bytes Net perf changes: Reported as geometric mean of all improvements / regressions from N=10 runs of the benchtests. Value as New Time / Old Time so < 1.0 is improvement and 1.0 is regression. strchr-evex.S (Fixed) : 0.971 strchr-evex.S (Rand) : 0.932 strchrnul-evex.S : 0.965 Full results attached in email. Full check passes on x86-64. --- sysdeps/x86_64/multiarch/strchr-evex.S | 558 +++++++++++++++---------- 1 file changed, 340 insertions(+), 218 deletions(-) diff --git a/sysdeps/x86_64/multiarch/strchr-evex.S b/sysdeps/x86_64/multiarch/strchr-evex.S index a1c15c4419..c2a0d112f7 100644 --- a/sysdeps/x86_64/multiarch/strchr-evex.S +++ b/sysdeps/x86_64/multiarch/strchr-evex.S @@ -26,48 +26,75 @@ # define STRCHR __strchr_evex # endif -# define VMOVU vmovdqu64 -# define VMOVA vmovdqa64 +# ifndef VEC_SIZE +# include "x86-evex256-vecs.h" +# endif # ifdef USE_AS_WCSCHR # define VPBROADCAST vpbroadcastd -# define VPCMP vpcmpd +# define VPCMP vpcmpd +# define VPCMPEQ vpcmpeqd # define VPTESTN vptestnmd +# define VPTEST vptestmd # define VPMINU vpminud # define CHAR_REG esi -# define SHIFT_REG ecx +# define SHIFT_REG rcx # define CHAR_SIZE 4 + +# define USE_WIDE_CHAR # else # define VPBROADCAST vpbroadcastb -# define VPCMP vpcmpb +# define VPCMP vpcmpb +# define VPCMPEQ vpcmpeqb # define VPTESTN vptestnmb +# define VPTEST vptestmb # define VPMINU vpminub # define CHAR_REG sil -# define SHIFT_REG edx +# define SHIFT_REG rdi # define CHAR_SIZE 1 # endif -# define XMMZERO xmm16 - -# define YMMZERO ymm16 -# define YMM0 ymm17 -# define YMM1 ymm18 -# define YMM2 ymm19 -# define YMM3 ymm20 -# define YMM4 ymm21 -# define YMM5 ymm22 -# define YMM6 ymm23 -# define YMM7 ymm24 -# define YMM8 ymm25 - -# define VEC_SIZE 32 -# define PAGE_SIZE 4096 -# define CHAR_PER_VEC (VEC_SIZE / CHAR_SIZE) - - .section .text.evex,"ax",@progbits -ENTRY_P2ALIGN (STRCHR, 5) - /* Broadcast CHAR to YMM0. */ - VPBROADCAST %esi, %YMM0 +# include "reg-macros.h" + +# if VEC_SIZE == 64 +# define MASK_GPR rcx +# define LOOP_REG rax + +# define COND_MASK(k_reg) {%k_reg} +# else +# define MASK_GPR rax +# define LOOP_REG rdi + +# define COND_MASK(k_reg) +# endif + +# define CHAR_PER_VEC (VEC_SIZE / CHAR_SIZE) + + +# if CHAR_PER_VEC == 64 +# define LAST_VEC_OFFSET (VEC_SIZE * 3) +# define TESTZ(reg) incq %VGPR_SZ(reg, 64) +# else + +# if CHAR_PER_VEC == 32 +# define TESTZ(reg) incl %VGPR_SZ(reg, 32) +# elif CHAR_PER_VEC == 16 +# define TESTZ(reg) incw %VGPR_SZ(reg, 16) +# else +# define TESTZ(reg) incb %VGPR_SZ(reg, 8) +# endif + +# define LAST_VEC_OFFSET (VEC_SIZE * 2) +# endif + +# define VMATCH VMM(0) + +# define PAGE_SIZE 4096 + + .section SECTION(.text), "ax", @progbits +ENTRY_P2ALIGN (STRCHR, 6) + /* Broadcast CHAR to VEC_0. */ + VPBROADCAST %esi, %VMATCH movl %edi, %eax andl $(PAGE_SIZE - 1), %eax /* Check if we cross page boundary with one vector load. @@ -75,19 +102,27 @@ ENTRY_P2ALIGN (STRCHR, 5) cmpl $(PAGE_SIZE - VEC_SIZE), %eax ja L(cross_page_boundary) + /* Check the first VEC_SIZE bytes. Search for both CHAR and the null bytes. */ - VMOVU (%rdi), %YMM1 - + VMOVU (%rdi), %VMM(1) /* Leaves only CHARS matching esi as 0. */ - vpxorq %YMM1, %YMM0, %YMM2 - VPMINU %YMM2, %YMM1, %YMM2 - /* Each bit in K0 represents a CHAR or a null byte in YMM1. */ - VPTESTN %YMM2, %YMM2, %k0 - kmovd %k0, %eax - testl %eax, %eax + vpxorq %VMM(1), %VMATCH, %VMM(2) + VPMINU %VMM(2), %VMM(1), %VMM(2) + /* Each bit in K0 represents a CHAR or a null byte in VEC_1. */ + VPTESTN %VMM(2), %VMM(2), %k0 + KMOV %k0, %VRAX +# if VEC_SIZE == 64 && defined USE_AS_STRCHRNUL + /* If VEC_SIZE == 64 && STRCHRNUL use bsf to test condition so + that all logic for match/null in first VEC first in 1x cache + lines. This has a slight cost to larger sizes. */ + bsf %VRAX, %VRAX + jz L(aligned_more) +# else + test %VRAX, %VRAX jz L(aligned_more) - tzcntl %eax, %eax + bsf %VRAX, %VRAX +# endif # ifndef USE_AS_STRCHRNUL /* Found CHAR or the null byte. */ cmp (%rdi, %rax, CHAR_SIZE), %CHAR_REG @@ -109,287 +144,374 @@ ENTRY_P2ALIGN (STRCHR, 5) # endif ret - - - .p2align 4,, 10 -L(first_vec_x4): -# ifndef USE_AS_STRCHRNUL - /* Check to see if first match was CHAR (k0) or null (k1). */ - kmovd %k0, %eax - tzcntl %eax, %eax - kmovd %k1, %ecx - /* bzhil will not be 0 if first match was null. */ - bzhil %eax, %ecx, %ecx - jne L(zero) -# else - /* Combine CHAR and null matches. */ - kord %k0, %k1, %k0 - kmovd %k0, %eax - tzcntl %eax, %eax -# endif - /* NB: Multiply sizeof char type (1 or 4) to get the number of - bytes. */ - leaq (VEC_SIZE * 4)(%rdi, %rax, CHAR_SIZE), %rax - ret - # ifndef USE_AS_STRCHRNUL L(zero): xorl %eax, %eax ret # endif - - .p2align 4 + .p2align 4,, 2 +L(first_vec_x3): + subq $-(VEC_SIZE * 2), %rdi +# if VEC_SIZE == 32 + /* Reuse L(first_vec_x3) for last VEC2 only for VEC_SIZE == 32. + For VEC_SIZE == 64 the registers don't match. */ +L(last_vec_x2): +# endif L(first_vec_x1): /* Use bsf here to save 1-byte keeping keeping the block in 1x fetch block. eax guranteed non-zero. */ - bsfl %eax, %eax + bsf %VRCX, %VRCX # ifndef USE_AS_STRCHRNUL - /* Found CHAR or the null byte. */ - cmp (VEC_SIZE)(%rdi, %rax, CHAR_SIZE), %CHAR_REG + /* Found CHAR or the null byte. */ + cmp (VEC_SIZE)(%rdi, %rcx, CHAR_SIZE), %CHAR_REG jne L(zero) - # endif /* NB: Multiply sizeof char type (1 or 4) to get the number of bytes. */ - leaq (VEC_SIZE)(%rdi, %rax, CHAR_SIZE), %rax + leaq (VEC_SIZE)(%rdi, %rcx, CHAR_SIZE), %rax ret - .p2align 4,, 10 + .p2align 4,, 2 +L(first_vec_x4): + subq $-(VEC_SIZE * 2), %rdi L(first_vec_x2): # ifndef USE_AS_STRCHRNUL /* Check to see if first match was CHAR (k0) or null (k1). */ - kmovd %k0, %eax - tzcntl %eax, %eax - kmovd %k1, %ecx + KMOV %k0, %VRAX + tzcnt %VRAX, %VRAX + KMOV %k1, %VRCX /* bzhil will not be 0 if first match was null. */ - bzhil %eax, %ecx, %ecx + bzhi %VRAX, %VRCX, %VRCX jne L(zero) # else /* Combine CHAR and null matches. */ - kord %k0, %k1, %k0 - kmovd %k0, %eax - tzcntl %eax, %eax + KOR %k0, %k1, %k0 + KMOV %k0, %VRAX + bsf %VRAX, %VRAX # endif /* NB: Multiply sizeof char type (1 or 4) to get the number of bytes. */ leaq (VEC_SIZE * 2)(%rdi, %rax, CHAR_SIZE), %rax ret - .p2align 4,, 10 -L(first_vec_x3): - /* Use bsf here to save 1-byte keeping keeping the block in 1x - fetch block. eax guranteed non-zero. */ - bsfl %eax, %eax -# ifndef USE_AS_STRCHRNUL - /* Found CHAR or the null byte. */ - cmp (VEC_SIZE * 3)(%rdi, %rax, CHAR_SIZE), %CHAR_REG - jne L(zero) +# ifdef USE_AS_STRCHRNUL + /* We use this as a hook to get imm8 encoding for the jmp to + L(page_cross_boundary). This allows the hot case of a + match/null-term in first VEC to fit entirely in 1 cache + line. */ +L(cross_page_boundary): + jmp L(cross_page_boundary_real) # endif - /* NB: Multiply sizeof char type (1 or 4) to get the number of - bytes. */ - leaq (VEC_SIZE * 3)(%rdi, %rax, CHAR_SIZE), %rax - ret .p2align 4 L(aligned_more): +L(cross_page_continue): /* Align data to VEC_SIZE. */ andq $-VEC_SIZE, %rdi -L(cross_page_continue): - /* Check the next 4 * VEC_SIZE. Only one VEC_SIZE at a time since - data is only aligned to VEC_SIZE. Use two alternating methods - for checking VEC to balance latency and port contention. */ - /* This method has higher latency but has better port - distribution. */ - VMOVA (VEC_SIZE)(%rdi), %YMM1 + /* Check the next 4 * VEC_SIZE. Only one VEC_SIZE at a time + since data is only aligned to VEC_SIZE. Use two alternating + methods for checking VEC to balance latency and port + contention. */ + + /* Method(1) with 8c latency: + For VEC_SIZE == 32: + p0 * 1.83, p1 * 0.83, p5 * 1.33 + For VEC_SIZE == 64: + p0 * 2.50, p1 * 0.00, p5 * 1.50 */ + VMOVA (VEC_SIZE)(%rdi), %VMM(1) /* Leaves only CHARS matching esi as 0. */ - vpxorq %YMM1, %YMM0, %YMM2 - VPMINU %YMM2, %YMM1, %YMM2 - /* Each bit in K0 represents a CHAR or a null byte in YMM1. */ - VPTESTN %YMM2, %YMM2, %k0 - kmovd %k0, %eax - testl %eax, %eax + vpxorq %VMM(1), %VMATCH, %VMM(2) + VPMINU %VMM(2), %VMM(1), %VMM(2) + /* Each bit in K0 represents a CHAR or a null byte in VEC_1. */ + VPTESTN %VMM(2), %VMM(2), %k0 + KMOV %k0, %VRCX + test %VRCX, %VRCX jnz L(first_vec_x1) - /* This method has higher latency but has better port - distribution. */ - VMOVA (VEC_SIZE * 2)(%rdi), %YMM1 - /* Each bit in K0 represents a CHAR in YMM1. */ - VPCMP $0, %YMM1, %YMM0, %k0 - /* Each bit in K1 represents a CHAR in YMM1. */ - VPTESTN %YMM1, %YMM1, %k1 - kortestd %k0, %k1 + /* Method(2) with 6c latency: + For VEC_SIZE == 32: + p0 * 1.00, p1 * 0.00, p5 * 2.00 + For VEC_SIZE == 64: + p0 * 1.00, p1 * 0.00, p5 * 2.00 */ + VMOVA (VEC_SIZE * 2)(%rdi), %VMM(1) + /* Each bit in K0 represents a CHAR in VEC_1. */ + VPCMPEQ %VMM(1), %VMATCH, %k0 + /* Each bit in K1 represents a CHAR in VEC_1. */ + VPTESTN %VMM(1), %VMM(1), %k1 + KORTEST %k0, %k1 jnz L(first_vec_x2) - VMOVA (VEC_SIZE * 3)(%rdi), %YMM1 + /* By swapping between Method 1/2 we get more fair port + distrubition and better throughput. */ + + VMOVA (VEC_SIZE * 3)(%rdi), %VMM(1) /* Leaves only CHARS matching esi as 0. */ - vpxorq %YMM1, %YMM0, %YMM2 - VPMINU %YMM2, %YMM1, %YMM2 - /* Each bit in K0 represents a CHAR or a null byte in YMM1. */ - VPTESTN %YMM2, %YMM2, %k0 - kmovd %k0, %eax - testl %eax, %eax + vpxorq %VMM(1), %VMATCH, %VMM(2) + VPMINU %VMM(2), %VMM(1), %VMM(2) + /* Each bit in K0 represents a CHAR or a null byte in VEC_1. */ + VPTESTN %VMM(2), %VMM(2), %k0 + KMOV %k0, %VRCX + test %VRCX, %VRCX jnz L(first_vec_x3) - VMOVA (VEC_SIZE * 4)(%rdi), %YMM1 - /* Each bit in K0 represents a CHAR in YMM1. */ - VPCMP $0, %YMM1, %YMM0, %k0 - /* Each bit in K1 represents a CHAR in YMM1. */ - VPTESTN %YMM1, %YMM1, %k1 - kortestd %k0, %k1 + VMOVA (VEC_SIZE * 4)(%rdi), %VMM(1) + /* Each bit in K0 represents a CHAR in VEC_1. */ + VPCMPEQ %VMM(1), %VMATCH, %k0 + /* Each bit in K1 represents a CHAR in VEC_1. */ + VPTESTN %VMM(1), %VMM(1), %k1 + KORTEST %k0, %k1 jnz L(first_vec_x4) /* Align data to VEC_SIZE * 4 for the loop. */ +# if VEC_SIZE == 64 + /* Use rax for the loop reg as it allows to the loop to fit in + exactly 2-cache-lines. (more efficient imm32 + gpr + encoding). */ + leaq (VEC_SIZE)(%rdi), %rax + /* No partial register stalls on evex512 processors. */ + xorb %al, %al +# else + /* For VEC_SIZE == 32 continue using rdi for loop reg so we can + reuse more code and save space. */ addq $VEC_SIZE, %rdi andq $-(VEC_SIZE * 4), %rdi - +# endif .p2align 4 L(loop_4x_vec): - /* Check 4x VEC at a time. No penalty to imm32 offset with evex - encoding. */ - VMOVA (VEC_SIZE * 4)(%rdi), %YMM1 - VMOVA (VEC_SIZE * 5)(%rdi), %YMM2 - VMOVA (VEC_SIZE * 6)(%rdi), %YMM3 - VMOVA (VEC_SIZE * 7)(%rdi), %YMM4 - - /* For YMM1 and YMM3 use xor to set the CHARs matching esi to + /* Check 4x VEC at a time. No penalty for imm32 offset with evex + encoding (if offset % VEC_SIZE == 0). */ + VMOVA (VEC_SIZE * 4)(%LOOP_REG), %VMM(1) + VMOVA (VEC_SIZE * 5)(%LOOP_REG), %VMM(2) + VMOVA (VEC_SIZE * 6)(%LOOP_REG), %VMM(3) + VMOVA (VEC_SIZE * 7)(%LOOP_REG), %VMM(4) + + /* Collect bits where VEC_1 does NOT match esi. This is later + use to mask of results (getting not matches allows us to + save an instruction on combining). */ + VPCMP $4, %VMATCH, %VMM(1), %k1 + + /* Two methods for loop depending on VEC_SIZE. This is because + with zmm registers VPMINU can only run on p0 (as opposed to + p0/p1 for ymm) so it is less prefered. */ +# if VEC_SIZE == 32 + /* For VEC_2 and VEC_3 use xor to set the CHARs matching esi to zero. */ - vpxorq %YMM1, %YMM0, %YMM5 - /* For YMM2 and YMM4 cmp not equals to CHAR and store result in - k register. Its possible to save either 1 or 2 instructions - using cmp no equals method for either YMM1 or YMM1 and YMM3 - respectively but bottleneck on p5 makes it not worth it. */ - VPCMP $4, %YMM0, %YMM2, %k2 - vpxorq %YMM3, %YMM0, %YMM7 - VPCMP $4, %YMM0, %YMM4, %k4 - - /* Use min to select all zeros from either xor or end of string). - */ - VPMINU %YMM1, %YMM5, %YMM1 - VPMINU %YMM3, %YMM7, %YMM3 + vpxorq %VMM(2), %VMATCH, %VMM(6) + vpxorq %VMM(3), %VMATCH, %VMM(7) - /* Use min + zeromask to select for zeros. Since k2 and k4 will - have 0 as positions that matched with CHAR which will set - zero in the corresponding destination bytes in YMM2 / YMM4. - */ - VPMINU %YMM1, %YMM2, %YMM2{%k2}{z} - VPMINU %YMM3, %YMM4, %YMM4 - VPMINU %YMM2, %YMM4, %YMM4{%k4}{z} - - VPTESTN %YMM4, %YMM4, %k1 - kmovd %k1, %ecx - subq $-(VEC_SIZE * 4), %rdi - testl %ecx, %ecx + /* Find non-matches in VEC_4 while combining with non-matches + from VEC_1. NB: Try and use masked predicate execution on + instructions that have mask result as it has no latency + penalty. */ + VPCMP $4, %VMATCH, %VMM(4), %k4{%k1} + + /* Combined zeros from VEC_1 / VEC_2 (search for null term). */ + VPMINU %VMM(1), %VMM(2), %VMM(2) + + /* Use min to select all zeros from either xor or end of + string). */ + VPMINU %VMM(3), %VMM(7), %VMM(3) + VPMINU %VMM(2), %VMM(6), %VMM(2) + + /* Combined zeros from VEC_2 / VEC_3 (search for null term). */ + VPMINU %VMM(3), %VMM(4), %VMM(4) + + /* Combined zeros from VEC_2 / VEC_4 (this has all null term and + esi matches for VEC_2 / VEC_3). */ + VPMINU %VMM(2), %VMM(4), %VMM(4) +# else + /* Collect non-matches for VEC_2. */ + VPCMP $4, %VMM(2), %VMATCH, %k2 + + /* Combined zeros from VEC_1 / VEC_2 (search for null term). */ + VPMINU %VMM(1), %VMM(2), %VMM(2) + + /* Find non-matches in VEC_3/VEC_4 while combining with non- + matches from VEC_1/VEC_2 respectively. */ + VPCMP $4, %VMM(3), %VMATCH, %k3{%k1} + VPCMP $4, %VMM(4), %VMATCH, %k4{%k2} + + /* Finish combining zeros in all VECs. */ + VPMINU %VMM(3), %VMM(4), %VMM(4) + + /* Combine in esi matches for VEC_3 (if there was a match with + esi, the corresponding bit in %k3 is zero so the + VPMINU_MASKZ will have a zero in the result). NB: This make + the VPMINU 3c latency. The only way to avoid it is to + createa a 12c dependency chain on all the `VPCMP $4, ...` + which has higher total latency. */ + VPMINU %VMM(2), %VMM(4), %VMM(4){%k3}{z} +# endif + VPTEST %VMM(4), %VMM(4), %k0{%k4} + KMOV %k0, %VRDX + subq $-(VEC_SIZE * 4), %LOOP_REG + + /* TESTZ is inc using the proper register width depending on + CHAR_PER_VEC. An esi match or null-term match leaves a zero- + bit in rdx so inc won't overflow and won't be zero. */ + TESTZ (rdx) jz L(loop_4x_vec) - VPTESTN %YMM1, %YMM1, %k0 - kmovd %k0, %eax - testl %eax, %eax - jnz L(last_vec_x1) + VPTEST %VMM(1), %VMM(1), %k0{%k1} + KMOV %k0, %VGPR(MASK_GPR) + TESTZ (MASK_GPR) +# if VEC_SIZE == 32 + /* We can reuse the return code in page_cross logic for VEC_SIZE + == 32. */ + jnz L(last_vec_x1_vec_size32) +# else + jnz L(last_vec_x1_vec_size64) +# endif + - VPTESTN %YMM2, %YMM2, %k0 - kmovd %k0, %eax - testl %eax, %eax + /* COND_MASK integates the esi matches for VEC_SIZE == 64. For + VEC_SIZE == 32 they are already integrated. */ + VPTEST %VMM(2), %VMM(2), %k0 COND_MASK(k2) + KMOV %k0, %VRCX + TESTZ (rcx) jnz L(last_vec_x2) - VPTESTN %YMM3, %YMM3, %k0 - kmovd %k0, %eax - /* Combine YMM3 matches (eax) with YMM4 matches (ecx). */ -# ifdef USE_AS_WCSCHR - sall $8, %ecx - orl %ecx, %eax - bsfl %eax, %eax + VPTEST %VMM(3), %VMM(3), %k0 COND_MASK(k3) + KMOV %k0, %VRCX +# if CHAR_PER_VEC == 64 + TESTZ (rcx) + jnz L(last_vec_x3) # else - salq $32, %rcx - orq %rcx, %rax - bsfq %rax, %rax + salq $CHAR_PER_VEC, %rdx + TESTZ (rcx) + orq %rcx, %rdx # endif + + bsfq %rdx, %rdx + # ifndef USE_AS_STRCHRNUL /* Check if match was CHAR or null. */ - cmp (VEC_SIZE * 2)(%rdi, %rax, CHAR_SIZE), %CHAR_REG + cmp (LAST_VEC_OFFSET)(%LOOP_REG, %rdx, CHAR_SIZE), %CHAR_REG jne L(zero_end) # endif /* NB: Multiply sizeof char type (1 or 4) to get the number of bytes. */ - leaq (VEC_SIZE * 2)(%rdi, %rax, CHAR_SIZE), %rax + leaq (LAST_VEC_OFFSET)(%LOOP_REG, %rdx, CHAR_SIZE), %rax ret - .p2align 4,, 8 -L(last_vec_x1): - bsfl %eax, %eax -# ifdef USE_AS_WCSCHR - /* NB: Multiply wchar_t count by 4 to get the number of bytes. - */ - leaq (%rdi, %rax, CHAR_SIZE), %rax -# else - addq %rdi, %rax +# ifndef USE_AS_STRCHRNUL +L(zero_end): + xorl %eax, %eax + ret # endif -# ifndef USE_AS_STRCHRNUL + + /* Seperate return label for last VEC1 because for VEC_SIZE == + 32 we can reuse return code in L(page_cross) but VEC_SIZE == + 64 has mismatched registers. */ +# if VEC_SIZE == 64 + .p2align 4,, 8 +L(last_vec_x1_vec_size64): + bsf %VRCX, %VRCX +# ifndef USE_AS_STRCHRNUL /* Check if match was null. */ - cmp (%rax), %CHAR_REG + cmp (%rax, %rcx, CHAR_SIZE), %CHAR_REG jne L(zero_end) -# endif - +# endif +# ifdef USE_AS_WCSCHR + /* NB: Multiply wchar_t count by 4 to get the number of bytes. + */ + leaq (%rax, %rcx, CHAR_SIZE), %rax +# else + addq %rcx, %rax +# endif ret + /* Since we can't combine the last 2x matches for CHAR_PER_VEC + == 64 we need return label for last VEC3. */ +# if CHAR_PER_VEC == 64 .p2align 4,, 8 +L(last_vec_x3): + addq $VEC_SIZE, %LOOP_REG +# endif + + /* Duplicate L(last_vec_x2) for VEC_SIZE == 64 because we can't + reuse L(first_vec_x3) due to register mismatch. */ L(last_vec_x2): - bsfl %eax, %eax -# ifndef USE_AS_STRCHRNUL + bsf %VGPR(MASK_GPR), %VGPR(MASK_GPR) +# ifndef USE_AS_STRCHRNUL /* Check if match was null. */ - cmp (VEC_SIZE)(%rdi, %rax, CHAR_SIZE), %CHAR_REG + cmp (VEC_SIZE * 1)(%LOOP_REG, %MASK_GPR, CHAR_SIZE), %CHAR_REG jne L(zero_end) -# endif +# endif /* NB: Multiply sizeof char type (1 or 4) to get the number of bytes. */ - leaq (VEC_SIZE)(%rdi, %rax, CHAR_SIZE), %rax + leaq (VEC_SIZE * 1)(%LOOP_REG, %MASK_GPR, CHAR_SIZE), %rax ret +# endif - /* Cold case for crossing page with first load. */ - .p2align 4,, 8 + /* Cold case for crossing page with first load. */ + .p2align 4,, 10 +# ifndef USE_AS_STRCHRNUL L(cross_page_boundary): - movq %rdi, %rdx +# endif +L(cross_page_boundary_real): /* Align rdi. */ - andq $-VEC_SIZE, %rdi - VMOVA (%rdi), %YMM1 - /* Leaves only CHARS matching esi as 0. */ - vpxorq %YMM1, %YMM0, %YMM2 - VPMINU %YMM2, %YMM1, %YMM2 - /* Each bit in K0 represents a CHAR or a null byte in YMM1. */ - VPTESTN %YMM2, %YMM2, %k0 - kmovd %k0, %eax + xorq %rdi, %rax + VMOVA (PAGE_SIZE - VEC_SIZE)(%rax), %VMM(1) + /* Use high latency method of getting matches to save code size. + */ + + /* K1 has 1s where VEC(1) does NOT match esi. */ + VPCMP $4, %VMM(1), %VMATCH, %k1 + /* K0 has ones where K1 is 1 (non-match with esi), and non-zero + (null). */ + VPTEST %VMM(1), %VMM(1), %k0{%k1} + KMOV %k0, %VRAX /* Remove the leading bits. */ # ifdef USE_AS_WCSCHR - movl %edx, %SHIFT_REG + movl %edi, %VGPR_SZ(SHIFT_REG, 32) /* NB: Divide shift count by 4 since each bit in K1 represent 4 bytes. */ - sarl $2, %SHIFT_REG - andl $(CHAR_PER_VEC - 1), %SHIFT_REG + sarl $2, %VGPR_SZ(SHIFT_REG, 32) + andl $(CHAR_PER_VEC - 1), %VGPR_SZ(SHIFT_REG, 32) + + /* if wcsrchr we need to reverse matches as we can't rely on + signed shift to bring in ones. There is not sarx for + gpr8/16. Also not we can't use inc here as the lower bits + represent matches out of range so we can't rely on overflow. + */ + xorl $((1 << CHAR_PER_VEC)- 1), %eax +# endif + /* Use arithmatic shift so that leading 1s are filled in. */ + sarx %VGPR(SHIFT_REG), %VRAX, %VRAX + /* If eax is all ones then no matches for esi or NULL. */ + +# ifdef USE_AS_WCSCHR + test %VRAX, %VRAX +# else + inc %VRAX # endif - sarxl %SHIFT_REG, %eax, %eax - /* If eax is zero continue. */ - testl %eax, %eax jz L(cross_page_continue) - bsfl %eax, %eax + .p2align 4,, 10 +L(last_vec_x1_vec_size32): + bsf %VRAX, %VRAX # ifdef USE_AS_WCSCHR - /* NB: Multiply wchar_t count by 4 to get the number of - bytes. */ - leaq (%rdx, %rax, CHAR_SIZE), %rax + /* NB: Multiply wchar_t count by 4 to get the number of bytes. + */ + leaq (%rdi, %rax, CHAR_SIZE), %rax # else - addq %rdx, %rax + addq %rdi, %rax # endif # ifndef USE_AS_STRCHRNUL /* Check to see if match was CHAR or null. */ cmp (%rax), %CHAR_REG - je L(cross_page_ret) -L(zero_end): - xorl %eax, %eax -L(cross_page_ret): + jne L(zero_end_0) # endif ret +# ifndef USE_AS_STRCHRNUL +L(zero_end_0): + xorl %eax, %eax + ret +# endif END (STRCHR) #endif From patchwork Tue Oct 18 02:48:57 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Noah Goldstein X-Patchwork-Id: 1691324 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=sourceware.org (client-ip=8.43.85.97; helo=sourceware.org; envelope-from=libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org; receiver=) Authentication-Results: legolas.ozlabs.org; dkim=pass (1024-bit key; secure) header.d=sourceware.org header.i=@sourceware.org header.a=rsa-sha256 header.s=default header.b=pT1NcNpZ; dkim-atps=neutral Received: from sourceware.org (server2.sourceware.org [8.43.85.97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-384) server-digest SHA384) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4Mryxw2hsdz1ygT for ; Tue, 18 Oct 2022 13:49:40 +1100 (AEDT) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 53B003857402 for ; Tue, 18 Oct 2022 02:49:38 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 53B003857402 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1666061378; bh=PSQMROoJfqw3Uqfco9Q4iDeddMT2vOOCLzyfEiEff+g=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To: From; b=pT1NcNpZ1N+xkGRcB8QISppkbx7JGlrT8/+4jNXouj3nn9NLGh582DMoUW1cklF3o JDTtPR9gSJdeaHL9sxvhbgcJyltnTOm8co4pvJ99IiuYEMUOZUrrJIOXw+Gmn5qgn4 qwQl+pZ+6AO2rLCrROm0yd/7J6m8ZbKqn8cSJGjI= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-oi1-x233.google.com (mail-oi1-x233.google.com [IPv6:2607:f8b0:4864:20::233]) by sourceware.org (Postfix) with ESMTPS id A8FDB3858C83 for ; Tue, 18 Oct 2022 02:49:08 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org A8FDB3858C83 Received: by mail-oi1-x233.google.com with SMTP id w196so14217319oiw.8 for ; Mon, 17 Oct 2022 19:49:08 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=PSQMROoJfqw3Uqfco9Q4iDeddMT2vOOCLzyfEiEff+g=; b=d0EE8UsOnvPaHy8H7V31LdRhZtjLIjEsA5GtVt1pJM+qMgQ1t/tUia/0KTUzREeFJS EEZRLycByarwkuilmtrI8+YlNfPv0I9DRkukwAceJ7ckanAixokLCKjReuE9uR6avWff oTrlgMAwvcobCVmrjapoWcidMc5ZdcuwP6w0v5JpjN1c4UAj+fPJTMmMNf2+3HtYCc7f 4jhvZU1JRmViSb4RzWQuIH9M0SV/UW1aGN2bPhYA+8JvIB52uXJ/R0HRM0CSRli9XLkV JzhyLAOGDLZz6+3S3EEJ+wy+7sj2AUneiNkHlkrSlr/qpguN1f7E9b+sNov4wepguiwY N4ww== X-Gm-Message-State: ACrzQf0bP3AHjlRPkqukSGdYnhCIMzgQLq7CkLg7tJUqmVhb0RT8RKO/ NTuanAw41wt53qeYxBnRpBLluIzjYmvM9A== X-Google-Smtp-Source: AMsMyM6WDYDMO0MWfIJR32j8HI4VfRBYkpz/WNJ4SklZU5uGde240oHq4HyODQqwQZGLrUfUacPgIA== X-Received: by 2002:aca:a84c:0:b0:355:4262:28ef with SMTP id r73-20020acaa84c000000b00355426228efmr1901819oie.14.1666061346960; Mon, 17 Oct 2022 19:49:06 -0700 (PDT) Received: from noah-tgl.lan (2603-8080-1301-76c6-02dd-0570-1640-b39b.res6.spectrum.com. [2603:8080:1301:76c6:2dd:570:1640:b39b]) by smtp.gmail.com with ESMTPSA id r10-20020a4a964a000000b00435a59fba01sm4957260ooi.47.2022.10.17.19.49.05 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 17 Oct 2022 19:49:06 -0700 (PDT) To: libc-alpha@sourceware.org Subject: [PATCH v1 3/7] x86: Optimize strnlen-evex.S and implement with VMM headers Date: Mon, 17 Oct 2022 19:48:57 -0700 Message-Id: <20221018024901.3381469-3-goldstein.w.n@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20221018024901.3381469-1-goldstein.w.n@gmail.com> References: <20221018024901.3381469-1-goldstein.w.n@gmail.com> MIME-Version: 1.0 X-Spam-Status: No, score=-12.1 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, KAM_SHORT, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Noah Goldstein via Libc-alpha From: Noah Goldstein Reply-To: Noah Goldstein Errors-To: libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org Sender: "Libc-alpha" Optimizations are: 1. Use the fact that bsf(0) leaves the destination unchanged to save a branch in short string case. 2. Restructure code so that small strings are given the hot path. - This is a net-zero on the benchmark suite but in general makes sense as smaller sizes are far more common. 3. Use more code-size efficient instructions. - tzcnt ... -> bsf ... - vpcmpb $0 ... -> vpcmpeq ... 4. Align labels less aggressively, especially if it doesn't save fetch blocks / causes the basic-block to span extra cache-lines. The optimizations (especially for point 2) make the strnlen and strlen code essentially incompatible so split strnlen-evex to a new file. Code Size Changes: strlen-evex.S : -23 bytes strnlen-evex.S : -167 bytes Net perf changes: Reported as geometric mean of all improvements / regressions from N=10 runs of the benchtests. Value as New Time / Old Time so < 1.0 is improvement and 1.0 is regression. strlen-evex.S : 0.992 (No real change) strnlen-evex.S : 0.947 Full results attached in email. Full check passes on x86-64. --- sysdeps/x86_64/multiarch/strlen-evex.S | 544 +++++++----------------- sysdeps/x86_64/multiarch/strnlen-evex.S | 427 ++++++++++++++++++- sysdeps/x86_64/multiarch/wcsnlen-evex.S | 5 +- 3 files changed, 572 insertions(+), 404 deletions(-) diff --git a/sysdeps/x86_64/multiarch/strlen-evex.S b/sysdeps/x86_64/multiarch/strlen-evex.S index 2109ec2f7a..487846f098 100644 --- a/sysdeps/x86_64/multiarch/strlen-evex.S +++ b/sysdeps/x86_64/multiarch/strlen-evex.S @@ -26,466 +26,220 @@ # define STRLEN __strlen_evex # endif -# define VMOVA vmovdqa64 +# ifndef VEC_SIZE +# include "x86-evex256-vecs.h" +# endif # ifdef USE_AS_WCSLEN -# define VPCMP vpcmpd +# define VPCMPEQ vpcmpeqd +# define VPCMPNEQ vpcmpneqd +# define VPTESTN vptestnmd +# define VPTEST vptestmd # define VPMINU vpminud -# define SHIFT_REG ecx # define CHAR_SIZE 4 +# define CHAR_SIZE_SHIFT_REG(reg) sar $2, %reg # else -# define VPCMP vpcmpb +# define VPCMPEQ vpcmpeqb +# define VPCMPNEQ vpcmpneqb +# define VPTESTN vptestnmb +# define VPTEST vptestmb # define VPMINU vpminub -# define SHIFT_REG edx # define CHAR_SIZE 1 +# define CHAR_SIZE_SHIFT_REG(reg) + +# define REG_WIDTH VEC_SIZE # endif -# define XMMZERO xmm16 -# define YMMZERO ymm16 -# define YMM1 ymm17 -# define YMM2 ymm18 -# define YMM3 ymm19 -# define YMM4 ymm20 -# define YMM5 ymm21 -# define YMM6 ymm22 - -# define VEC_SIZE 32 -# define PAGE_SIZE 4096 -# define CHAR_PER_VEC (VEC_SIZE / CHAR_SIZE) - - .section .text.evex,"ax",@progbits -ENTRY (STRLEN) -# ifdef USE_AS_STRNLEN - /* Check zero length. */ - test %RSI_LP, %RSI_LP - jz L(zero) -# ifdef __ILP32__ - /* Clear the upper 32 bits. */ - movl %esi, %esi -# endif - mov %RSI_LP, %R8_LP +# define CHAR_PER_VEC (VEC_SIZE / CHAR_SIZE) + +# include "reg-macros.h" + +# if CHAR_PER_VEC == 64 + +# define TAIL_RETURN_LBL first_vec_x2 +# define TAIL_RETURN_OFFSET (CHAR_PER_VEC * 2) + +# define FALLTHROUGH_RETURN_LBL first_vec_x3 +# define FALLTHROUGH_RETURN_OFFSET (CHAR_PER_VEC * 3) + +# else + +# define TAIL_RETURN_LBL first_vec_x3 +# define TAIL_RETURN_OFFSET (CHAR_PER_VEC * 3) + +# define FALLTHROUGH_RETURN_LBL first_vec_x2 +# define FALLTHROUGH_RETURN_OFFSET (CHAR_PER_VEC * 2) # endif + +# define XZERO VMM_128(0) +# define VZERO VMM(0) +# define PAGE_SIZE 4096 + + .section SECTION(.text), "ax", @progbits +ENTRY_P2ALIGN (STRLEN, 6) movl %edi, %eax - vpxorq %XMMZERO, %XMMZERO, %XMMZERO - /* Clear high bits from edi. Only keeping bits relevant to page - cross check. */ + vpxorq %XZERO, %XZERO, %XZERO andl $(PAGE_SIZE - 1), %eax - /* Check if we may cross page boundary with one vector load. */ cmpl $(PAGE_SIZE - VEC_SIZE), %eax ja L(cross_page_boundary) /* Check the first VEC_SIZE bytes. Each bit in K0 represents a null byte. */ - VPCMP $0, (%rdi), %YMMZERO, %k0 - kmovd %k0, %eax -# ifdef USE_AS_STRNLEN - /* If length < CHAR_PER_VEC handle special. */ - cmpq $CHAR_PER_VEC, %rsi - jbe L(first_vec_x0) -# endif - testl %eax, %eax + VPCMPEQ (%rdi), %VZERO, %k0 + KMOV %k0, %VRAX + test %VRAX, %VRAX jz L(aligned_more) - tzcntl %eax, %eax - ret -# ifdef USE_AS_STRNLEN -L(zero): - xorl %eax, %eax - ret - - .p2align 4 -L(first_vec_x0): - /* Set bit for max len so that tzcnt will return min of max len - and position of first match. */ - btsq %rsi, %rax - tzcntl %eax, %eax - ret -# endif - - .p2align 4 -L(first_vec_x1): - tzcntl %eax, %eax - /* Safe to use 32 bit instructions as these are only called for - size = [1, 159]. */ -# ifdef USE_AS_STRNLEN - /* Use ecx which was computed earlier to compute correct value. - */ - leal -(CHAR_PER_VEC * 4 + 1)(%rcx, %rax), %eax -# else - subl %edx, %edi -# ifdef USE_AS_WCSLEN - /* NB: Divide bytes by 4 to get the wchar_t count. */ - sarl $2, %edi -# endif - leal CHAR_PER_VEC(%rdi, %rax), %eax -# endif - ret - - .p2align 4 -L(first_vec_x2): - tzcntl %eax, %eax - /* Safe to use 32 bit instructions as these are only called for - size = [1, 159]. */ -# ifdef USE_AS_STRNLEN - /* Use ecx which was computed earlier to compute correct value. - */ - leal -(CHAR_PER_VEC * 3 + 1)(%rcx, %rax), %eax -# else - subl %edx, %edi -# ifdef USE_AS_WCSLEN - /* NB: Divide bytes by 4 to get the wchar_t count. */ - sarl $2, %edi -# endif - leal (CHAR_PER_VEC * 2)(%rdi, %rax), %eax -# endif + bsf %VRAX, %VRAX ret - .p2align 4 -L(first_vec_x3): - tzcntl %eax, %eax - /* Safe to use 32 bit instructions as these are only called for - size = [1, 159]. */ -# ifdef USE_AS_STRNLEN - /* Use ecx which was computed earlier to compute correct value. - */ - leal -(CHAR_PER_VEC * 2 + 1)(%rcx, %rax), %eax -# else - subl %edx, %edi -# ifdef USE_AS_WCSLEN - /* NB: Divide bytes by 4 to get the wchar_t count. */ - sarl $2, %edi -# endif - leal (CHAR_PER_VEC * 3)(%rdi, %rax), %eax -# endif - ret - - .p2align 4 + .p2align 4,, 8 L(first_vec_x4): - tzcntl %eax, %eax - /* Safe to use 32 bit instructions as these are only called for - size = [1, 159]. */ -# ifdef USE_AS_STRNLEN - /* Use ecx which was computed earlier to compute correct value. - */ - leal -(CHAR_PER_VEC + 1)(%rcx, %rax), %eax -# else - subl %edx, %edi -# ifdef USE_AS_WCSLEN - /* NB: Divide bytes by 4 to get the wchar_t count. */ - sarl $2, %edi -# endif + bsf %VRAX, %VRAX + subl %ecx, %edi + CHAR_SIZE_SHIFT_REG (edi) leal (CHAR_PER_VEC * 4)(%rdi, %rax), %eax -# endif ret - .p2align 5 + + + /* Aligned more for strnlen compares remaining length vs 2 * + CHAR_PER_VEC, 4 * CHAR_PER_VEC, and 8 * CHAR_PER_VEC before + going to the loop. */ + .p2align 4,, 10 L(aligned_more): - movq %rdi, %rdx - /* Align data to VEC_SIZE. */ - andq $-(VEC_SIZE), %rdi + movq %rdi, %rcx + andq $(VEC_SIZE * -1), %rdi L(cross_page_continue): - /* Check the first 4 * VEC_SIZE. Only one VEC_SIZE at a time - since data is only aligned to VEC_SIZE. */ -# ifdef USE_AS_STRNLEN - /* + CHAR_SIZE because it simplies the logic in - last_4x_vec_or_less. */ - leaq (VEC_SIZE * 5 + CHAR_SIZE)(%rdi), %rcx - subq %rdx, %rcx -# ifdef USE_AS_WCSLEN - /* NB: Divide bytes by 4 to get the wchar_t count. */ - sarl $2, %ecx -# endif -# endif - /* Load first VEC regardless. */ - VPCMP $0, VEC_SIZE(%rdi), %YMMZERO, %k0 -# ifdef USE_AS_STRNLEN - /* Adjust length. If near end handle specially. */ - subq %rcx, %rsi - jb L(last_4x_vec_or_less) -# endif - kmovd %k0, %eax - testl %eax, %eax + /* Remaining length >= 2 * CHAR_PER_VEC so do VEC0/VEC1 without + rechecking bounds. */ + VPCMPEQ (VEC_SIZE * 1)(%rdi), %VZERO, %k0 + KMOV %k0, %VRAX + test %VRAX, %VRAX jnz L(first_vec_x1) - VPCMP $0, (VEC_SIZE * 2)(%rdi), %YMMZERO, %k0 - kmovd %k0, %eax - test %eax, %eax + VPCMPEQ (VEC_SIZE * 2)(%rdi), %VZERO, %k0 + KMOV %k0, %VRAX + test %VRAX, %VRAX jnz L(first_vec_x2) - VPCMP $0, (VEC_SIZE * 3)(%rdi), %YMMZERO, %k0 - kmovd %k0, %eax - testl %eax, %eax + VPCMPEQ (VEC_SIZE * 3)(%rdi), %VZERO, %k0 + KMOV %k0, %VRAX + test %VRAX, %VRAX jnz L(first_vec_x3) - VPCMP $0, (VEC_SIZE * 4)(%rdi), %YMMZERO, %k0 - kmovd %k0, %eax - testl %eax, %eax + VPCMPEQ (VEC_SIZE * 4)(%rdi), %VZERO, %k0 + KMOV %k0, %VRAX + test %VRAX, %VRAX jnz L(first_vec_x4) - addq $VEC_SIZE, %rdi -# ifdef USE_AS_STRNLEN - /* Check if at last VEC_SIZE * 4 length. */ - cmpq $(CHAR_PER_VEC * 4 - 1), %rsi - jbe L(last_4x_vec_or_less_load) - movl %edi, %ecx - andl $(VEC_SIZE * 4 - 1), %ecx -# ifdef USE_AS_WCSLEN - /* NB: Divide bytes by 4 to get the wchar_t count. */ - sarl $2, %ecx -# endif - /* Readjust length. */ - addq %rcx, %rsi -# endif - /* Align data to VEC_SIZE * 4. */ + subq $(VEC_SIZE * -1), %rdi + +# if CHAR_PER_VEC == 64 + /* No partial register stalls on processors that we use evex512 + on and this saves code size. */ + xorb %dil, %dil +# else andq $-(VEC_SIZE * 4), %rdi +# endif + + /* Compare 4 * VEC at a time forward. */ .p2align 4 L(loop_4x_vec): - /* Load first VEC regardless. */ - VMOVA (VEC_SIZE * 4)(%rdi), %YMM1 -# ifdef USE_AS_STRNLEN - /* Break if at end of length. */ - subq $(CHAR_PER_VEC * 4), %rsi - jb L(last_4x_vec_or_less_cmpeq) -# endif - /* Save some code size by microfusing VPMINU with the load. Since - the matches in ymm2/ymm4 can only be returned if there where no - matches in ymm1/ymm3 respectively there is no issue with overlap. - */ - VPMINU (VEC_SIZE * 5)(%rdi), %YMM1, %YMM2 - VMOVA (VEC_SIZE * 6)(%rdi), %YMM3 - VPMINU (VEC_SIZE * 7)(%rdi), %YMM3, %YMM4 + VMOVA (VEC_SIZE * 4)(%rdi), %VMM(1) + VPMINU (VEC_SIZE * 5)(%rdi), %VMM(1), %VMM(2) + VMOVA (VEC_SIZE * 6)(%rdi), %VMM(3) + VPMINU (VEC_SIZE * 7)(%rdi), %VMM(3), %VMM(4) + VPTESTN %VMM(2), %VMM(2), %k0 + VPTESTN %VMM(4), %VMM(4), %k2 - VPCMP $0, %YMM2, %YMMZERO, %k0 - VPCMP $0, %YMM4, %YMMZERO, %k1 subq $-(VEC_SIZE * 4), %rdi - kortestd %k0, %k1 + KORTEST %k0, %k2 jz L(loop_4x_vec) - /* Check if end was in first half. */ - kmovd %k0, %eax - subq %rdx, %rdi -# ifdef USE_AS_WCSLEN - shrq $2, %rdi -# endif - testl %eax, %eax - jz L(second_vec_return) + VPTESTN %VMM(1), %VMM(1), %k1 + KMOV %k1, %VRAX + test %VRAX, %VRAX + jnz L(first_vec_x0) - VPCMP $0, %YMM1, %YMMZERO, %k2 - kmovd %k2, %edx - /* Combine VEC1 matches (edx) with VEC2 matches (eax). */ -# ifdef USE_AS_WCSLEN - sall $CHAR_PER_VEC, %eax - orl %edx, %eax - tzcntl %eax, %eax -# else - salq $CHAR_PER_VEC, %rax - orq %rdx, %rax - tzcntq %rax, %rax -# endif - addq %rdi, %rax - ret - - -# ifdef USE_AS_STRNLEN - -L(last_4x_vec_or_less_load): - /* Depending on entry adjust rdi / prepare first VEC in YMM1. */ - VMOVA (VEC_SIZE * 4)(%rdi), %YMM1 -L(last_4x_vec_or_less_cmpeq): - VPCMP $0, %YMM1, %YMMZERO, %k0 - addq $(VEC_SIZE * 3), %rdi -L(last_4x_vec_or_less): - kmovd %k0, %eax - /* If remaining length > VEC_SIZE * 2. This works if esi is off by - VEC_SIZE * 4. */ - testl $(CHAR_PER_VEC * 2), %esi - jnz L(last_4x_vec) - - /* length may have been negative or positive by an offset of - CHAR_PER_VEC * 4 depending on where this was called from. This - fixes that. */ - andl $(CHAR_PER_VEC * 4 - 1), %esi - testl %eax, %eax - jnz L(last_vec_x1_check) + KMOV %k0, %VRAX + test %VRAX, %VRAX + jnz L(first_vec_x1) - /* Check the end of data. */ - subl $CHAR_PER_VEC, %esi - jb L(max) + VPTESTN %VMM(3), %VMM(3), %k0 - VPCMP $0, (VEC_SIZE * 2)(%rdi), %YMMZERO, %k0 - kmovd %k0, %eax - tzcntl %eax, %eax - /* Check the end of data. */ - cmpl %eax, %esi - jb L(max) - - subq %rdx, %rdi -# ifdef USE_AS_WCSLEN - /* NB: Divide bytes by 4 to get the wchar_t count. */ - sarq $2, %rdi -# endif - leaq (CHAR_PER_VEC * 2)(%rdi, %rax), %rax - ret -L(max): - movq %r8, %rax - ret -# endif - - /* Placed here in strnlen so that the jcc L(last_4x_vec_or_less) - in the 4x VEC loop can use 2 byte encoding. */ - .p2align 4 -L(second_vec_return): - VPCMP $0, %YMM3, %YMMZERO, %k0 - /* Combine YMM3 matches (k0) with YMM4 matches (k1). */ -# ifdef USE_AS_WCSLEN - kunpckbw %k0, %k1, %k0 - kmovd %k0, %eax - tzcntl %eax, %eax +# if CHAR_PER_VEC == 64 + KMOV %k0, %VRAX + test %VRAX, %VRAX + jnz L(first_vec_x2) + KMOV %k2, %VRAX # else - kunpckdq %k0, %k1, %k0 - kmovq %k0, %rax - tzcntq %rax, %rax + /* We can only combine last 2x VEC masks if CHAR_PER_VEC <= 32. + */ + kmovd %k2, %edx + kmovd %k0, %eax + salq $CHAR_PER_VEC, %rdx + orq %rdx, %rax # endif - leaq (CHAR_PER_VEC * 2)(%rdi, %rax), %rax - ret - -# ifdef USE_AS_STRNLEN -L(last_vec_x1_check): - tzcntl %eax, %eax - /* Check the end of data. */ - cmpl %eax, %esi - jb L(max) - subq %rdx, %rdi -# ifdef USE_AS_WCSLEN - /* NB: Divide bytes by 4 to get the wchar_t count. */ - sarq $2, %rdi -# endif - leaq (CHAR_PER_VEC)(%rdi, %rax), %rax + /* first_vec_x3 for strlen-ZMM and first_vec_x2 for strlen-YMM. + */ + .p2align 4,, 2 +L(FALLTHROUGH_RETURN_LBL): + bsfq %rax, %rax + subq %rcx, %rdi + CHAR_SIZE_SHIFT_REG (rdi) + leaq (FALLTHROUGH_RETURN_OFFSET)(%rdi, %rax), %rax ret - .p2align 4 -L(last_4x_vec): - /* Test first 2x VEC normally. */ - testl %eax, %eax - jnz L(last_vec_x1) - - VPCMP $0, (VEC_SIZE * 2)(%rdi), %YMMZERO, %k0 - kmovd %k0, %eax - testl %eax, %eax - jnz L(last_vec_x2) - - /* Normalize length. */ - andl $(CHAR_PER_VEC * 4 - 1), %esi - VPCMP $0, (VEC_SIZE * 3)(%rdi), %YMMZERO, %k0 - kmovd %k0, %eax - testl %eax, %eax - jnz L(last_vec_x3) - - /* Check the end of data. */ - subl $(CHAR_PER_VEC * 3), %esi - jb L(max) - - VPCMP $0, (VEC_SIZE * 4)(%rdi), %YMMZERO, %k0 - kmovd %k0, %eax - tzcntl %eax, %eax - /* Check the end of data. */ - cmpl %eax, %esi - jb L(max_end) - - subq %rdx, %rdi -# ifdef USE_AS_WCSLEN - /* NB: Divide bytes by 4 to get the wchar_t count. */ - sarq $2, %rdi -# endif - leaq (CHAR_PER_VEC * 4)(%rdi, %rax), %rax + .p2align 4,, 8 +L(first_vec_x0): + bsf %VRAX, %VRAX + sub %rcx, %rdi + CHAR_SIZE_SHIFT_REG (rdi) + addq %rdi, %rax ret - .p2align 4 -L(last_vec_x1): - tzcntl %eax, %eax - subq %rdx, %rdi -# ifdef USE_AS_WCSLEN - /* NB: Divide bytes by 4 to get the wchar_t count. */ - sarq $2, %rdi -# endif + .p2align 4,, 10 +L(first_vec_x1): + bsf %VRAX, %VRAX + sub %rcx, %rdi + CHAR_SIZE_SHIFT_REG (rdi) leaq (CHAR_PER_VEC)(%rdi, %rax), %rax ret - .p2align 4 -L(last_vec_x2): - tzcntl %eax, %eax - subq %rdx, %rdi -# ifdef USE_AS_WCSLEN - /* NB: Divide bytes by 4 to get the wchar_t count. */ - sarq $2, %rdi -# endif - leaq (CHAR_PER_VEC * 2)(%rdi, %rax), %rax - ret - - .p2align 4 -L(last_vec_x3): - tzcntl %eax, %eax - subl $(CHAR_PER_VEC * 2), %esi - /* Check the end of data. */ - cmpl %eax, %esi - jb L(max_end) - subq %rdx, %rdi -# ifdef USE_AS_WCSLEN - /* NB: Divide bytes by 4 to get the wchar_t count. */ - sarq $2, %rdi -# endif - leaq (CHAR_PER_VEC * 3)(%rdi, %rax), %rax - ret -L(max_end): - movq %r8, %rax + .p2align 4,, 10 + /* first_vec_x2 for strlen-ZMM and first_vec_x3 for strlen-YMM. + */ +L(TAIL_RETURN_LBL): + bsf %VRAX, %VRAX + sub %VRCX, %VRDI + CHAR_SIZE_SHIFT_REG (VRDI) + lea (TAIL_RETURN_OFFSET)(%rdi, %rax), %VRAX ret -# endif - /* Cold case for crossing page with first load. */ - .p2align 4 + .p2align 4,, 8 L(cross_page_boundary): - movq %rdi, %rdx + movq %rdi, %rcx /* Align data to VEC_SIZE. */ andq $-VEC_SIZE, %rdi - VPCMP $0, (%rdi), %YMMZERO, %k0 - kmovd %k0, %eax - /* Remove the leading bytes. */ + + VPCMPEQ (%rdi), %VZERO, %k0 + + KMOV %k0, %VRAX # ifdef USE_AS_WCSLEN - /* NB: Divide shift count by 4 since each bit in K0 represent 4 - bytes. */ - movl %edx, %ecx - shrl $2, %ecx - andl $(CHAR_PER_VEC - 1), %ecx -# endif - /* SHIFT_REG is ecx for USE_AS_WCSLEN and edx otherwise. */ - sarxl %SHIFT_REG, %eax, %eax + movl %ecx, %edx + shrl $2, %edx + andl $(CHAR_PER_VEC - 1), %edx + shrx %edx, %eax, %eax testl %eax, %eax -# ifndef USE_AS_STRNLEN - jz L(cross_page_continue) - tzcntl %eax, %eax - ret # else - jnz L(cross_page_less_vec) -# ifndef USE_AS_WCSLEN - movl %edx, %ecx - andl $(CHAR_PER_VEC - 1), %ecx -# endif - movl $CHAR_PER_VEC, %eax - subl %ecx, %eax - /* Check the end of data. */ - cmpq %rax, %rsi - ja L(cross_page_continue) - movl %esi, %eax - ret -L(cross_page_less_vec): - tzcntl %eax, %eax - /* Select min of length and position of first null. */ - cmpq %rax, %rsi - cmovb %esi, %eax - ret + shr %cl, %VRAX # endif + jz L(cross_page_continue) + bsf %VRAX, %VRAX + ret END (STRLEN) #endif diff --git a/sysdeps/x86_64/multiarch/strnlen-evex.S b/sysdeps/x86_64/multiarch/strnlen-evex.S index 64a9fc2606..443a32a749 100644 --- a/sysdeps/x86_64/multiarch/strnlen-evex.S +++ b/sysdeps/x86_64/multiarch/strnlen-evex.S @@ -1,8 +1,423 @@ -#ifndef STRNLEN -# define STRNLEN __strnlen_evex -#endif +/* strnlen/wcsnlen optimized with 256-bit EVEX instructions. + Copyright (C) 2022 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include + +#if ISA_SHOULD_BUILD (4) + +# ifndef VEC_SIZE +# include "x86-evex256-vecs.h" +# endif + + +# ifndef STRNLEN +# define STRNLEN __strnlen_evex +# endif + +# ifdef USE_AS_WCSLEN +# define VPCMPEQ vpcmpeqd +# define VPCMPNEQ vpcmpneqd +# define VPTESTN vptestnmd +# define VPTEST vptestmd +# define VPMINU vpminud +# define CHAR_SIZE 4 + +# else +# define VPCMPEQ vpcmpeqb +# define VPCMPNEQ vpcmpneqb +# define VPTESTN vptestnmb +# define VPTEST vptestmb +# define VPMINU vpminub +# define CHAR_SIZE 1 + +# define REG_WIDTH VEC_SIZE +# endif + +# define CHAR_PER_VEC (VEC_SIZE / CHAR_SIZE) + +# include "reg-macros.h" + +# if CHAR_PER_VEC == 32 +# define SUB_SHORT(imm, reg) subb $(imm), %VGPR_SZ(reg, 8) +# else +# define SUB_SHORT(imm, reg) subl $(imm), %VGPR_SZ(reg, 32) +# endif + + + +# if CHAR_PER_VEC == 64 +# define FALLTHROUGH_RETURN_OFFSET (CHAR_PER_VEC * 3) +# else +# define FALLTHROUGH_RETURN_OFFSET (CHAR_PER_VEC * 2) +# endif + + +# define XZERO VMM_128(0) +# define VZERO VMM(0) +# define PAGE_SIZE 4096 + + .section SECTION(.text), "ax", @progbits +ENTRY_P2ALIGN (STRNLEN, 6) + /* Check zero length. */ + test %RSI_LP, %RSI_LP + jz L(zero) +# ifdef __ILP32__ + /* Clear the upper 32 bits. */ + movl %esi, %esi +# endif + + movl %edi, %eax + vpxorq %XZERO, %XZERO, %XZERO + andl $(PAGE_SIZE - 1), %eax + cmpl $(PAGE_SIZE - VEC_SIZE), %eax + ja L(cross_page_boundary) + + /* Check the first VEC_SIZE bytes. Each bit in K0 represents a + null byte. */ + VPCMPEQ (%rdi), %VZERO, %k0 + + KMOV %k0, %VRCX + movq %rsi, %rax + + /* If src (rcx) is zero, bsf does not change the result. NB: + Must use 64-bit bsf here so that upper bits of len are not + cleared. */ + bsfq %rcx, %rax + /* If rax > CHAR_PER_VEC then rcx must have been zero (no null + CHAR) and rsi must be > CHAR_PER_VEC. */ + cmpq $CHAR_PER_VEC, %rax + ja L(more_1x_vec) + /* Check if first match in bounds. */ + cmpq %rax, %rsi + cmovb %esi, %eax + ret + + +# if CHAR_PER_VEC != 32 + .p2align 4,, 2 +L(zero): +L(max_0): + movl %esi, %eax + ret +# endif + + /* Aligned more for strnlen compares remaining length vs 2 * + CHAR_PER_VEC, 4 * CHAR_PER_VEC, and 8 * CHAR_PER_VEC before + going to the loop. */ + .p2align 4,, 10 +L(more_1x_vec): +L(cross_page_continue): + /* Compute number of words checked after aligning. */ +# ifdef USE_AS_WCSLEN + /* Need to compute directly for wcslen as CHAR_SIZE * rsi can + overflow. */ + movq %rdi, %rax + andq $(VEC_SIZE * -1), %rdi + subq %rdi, %rax + sarq $2, %rax + leaq -(CHAR_PER_VEC * 1)(%rax, %rsi), %rax +# else + leaq (VEC_SIZE * -1)(%rsi, %rdi), %rax + andq $(VEC_SIZE * -1), %rdi + subq %rdi, %rax +# endif + + + VPCMPEQ VEC_SIZE(%rdi), %VZERO, %k0 + + cmpq $(CHAR_PER_VEC * 2), %rax + ja L(more_2x_vec) + +L(last_2x_vec_or_less): + KMOV %k0, %VRDX + test %VRDX, %VRDX + jnz L(last_vec_check) + + /* Check the end of data. */ + SUB_SHORT (CHAR_PER_VEC, rax) + jbe L(max_0) + VPCMPEQ (VEC_SIZE * 2)(%rdi), %VZERO, %k0 + KMOV %k0, %VRDX + test %VRDX, %VRDX + jz L(max_0) + /* Best place for LAST_VEC_CHECK if ZMM. */ + .p2align 4,, 8 +L(last_vec_check): + bsf %VRDX, %VRDX + sub %eax, %edx + lea (%rsi, %rdx), %eax + cmovae %esi, %eax + ret + +# if CHAR_PER_VEC == 32 + .p2align 4,, 2 +L(zero): +L(max_0): + movl %esi, %eax + ret +# endif + + .p2align 4,, 8 +L(last_4x_vec_or_less): + addl $(CHAR_PER_VEC * -4), %eax + VPCMPEQ (VEC_SIZE * 5)(%rdi), %VZERO, %k0 + subq $(VEC_SIZE * -4), %rdi + cmpl $(CHAR_PER_VEC * 2), %eax + jbe L(last_2x_vec_or_less) + + .p2align 4,, 6 +L(more_2x_vec): + /* Remaining length >= 2 * CHAR_PER_VEC so do VEC0/VEC1 without + rechecking bounds. */ -#define USE_AS_STRNLEN 1 -#define STRLEN STRNLEN + KMOV %k0, %VRDX -#include "strlen-evex.S" + test %VRDX, %VRDX + jnz L(first_vec_x1) + + VPCMPEQ (VEC_SIZE * 2)(%rdi), %VZERO, %k0 + KMOV %k0, %VRDX + test %VRDX, %VRDX + jnz L(first_vec_x2) + + cmpq $(CHAR_PER_VEC * 4), %rax + ja L(more_4x_vec) + + + VPCMPEQ (VEC_SIZE * 3)(%rdi), %VZERO, %k0 + KMOV %k0, %VRDX + addl $(CHAR_PER_VEC * -2), %eax + test %VRDX, %VRDX + jnz L(last_vec_check) + + subl $(CHAR_PER_VEC), %eax + jbe L(max_1) + + VPCMPEQ (VEC_SIZE * 4)(%rdi), %VZERO, %k0 + KMOV %k0, %VRDX + + test %VRDX, %VRDX + jnz L(last_vec_check) +L(max_1): + movl %esi, %eax + ret + + .p2align 4,, 3 +L(first_vec_x2): +# if VEC_SIZE == 64 + /* If VEC_SIZE == 64 we can fit logic for full return label in + spare bytes before next cache line. */ + bsf %VRDX, %VRDX + sub %eax, %esi + leal (CHAR_PER_VEC * 1)(%rsi, %rdx), %eax + ret + .p2align 4,, 6 +# else + addl $CHAR_PER_VEC, %esi +# endif +L(first_vec_x1): + bsf %VRDX, %VRDX + sub %eax, %esi + leal (CHAR_PER_VEC * 0)(%rsi, %rdx), %eax + ret + + + .p2align 4,, 6 +L(first_vec_x4): +# if VEC_SIZE == 64 + /* If VEC_SIZE == 64 we can fit logic for full return label in + spare bytes before next cache line. */ + bsf %VRDX, %VRDX + sub %eax, %esi + leal (CHAR_PER_VEC * 3)(%rsi, %rdx), %eax + ret + .p2align 4,, 6 +# else + addl $CHAR_PER_VEC, %esi +# endif +L(first_vec_x3): + bsf %VRDX, %VRDX + sub %eax, %esi + leal (CHAR_PER_VEC * 2)(%rsi, %rdx), %eax + ret + + .p2align 4,, 5 +L(more_4x_vec): + VPCMPEQ (VEC_SIZE * 3)(%rdi), %VZERO, %k0 + KMOV %k0, %VRDX + test %VRDX, %VRDX + jnz L(first_vec_x3) + + VPCMPEQ (VEC_SIZE * 4)(%rdi), %VZERO, %k0 + KMOV %k0, %VRDX + test %VRDX, %VRDX + jnz L(first_vec_x4) + + /* Check if at last VEC_SIZE * 4 length before aligning for the + loop. */ + cmpq $(CHAR_PER_VEC * 8), %rax + jbe L(last_4x_vec_or_less) + + + /* Compute number of words checked after aligning. */ +# ifdef USE_AS_WCSLEN + /* Need to compute directly for wcslen as CHAR_SIZE * rsi can + overflow. */ + leaq (VEC_SIZE * -3)(%rdi), %rdx +# else + leaq (VEC_SIZE * -3)(%rdi, %rax), %rax +# endif + + subq $(VEC_SIZE * -1), %rdi + + /* Align data to VEC_SIZE * 4. */ +# if VEC_SIZE == 64 + /* Saves code size. No evex512 processor has partial register + stalls. If that change this can be replaced with `andq + $-(VEC_SIZE * 4), %rdi`. */ + xorb %dil, %dil +# else + andq $-(VEC_SIZE * 4), %rdi +# endif + +# ifdef USE_AS_WCSLEN + subq %rdi, %rdx + sarq $2, %rdx + addq %rdx, %rax +# else + subq %rdi, %rax +# endif + /* Compare 4 * VEC at a time forward. */ + .p2align 4,, 11 +L(loop_4x_vec): + VMOVA (VEC_SIZE * 4)(%rdi), %VMM(1) + VPMINU (VEC_SIZE * 5)(%rdi), %VMM(1), %VMM(2) + VMOVA (VEC_SIZE * 6)(%rdi), %VMM(3) + VPMINU (VEC_SIZE * 7)(%rdi), %VMM(3), %VMM(4) + VPTESTN %VMM(2), %VMM(2), %k0 + VPTESTN %VMM(4), %VMM(4), %k2 + subq $-(VEC_SIZE * 4), %rdi + /* Break if at end of length. */ + subq $(CHAR_PER_VEC * 4), %rax + jbe L(loop_len_end) + + + KORTEST %k0, %k2 + jz L(loop_4x_vec) + + +L(loop_last_4x_vec): + movq %rsi, %rcx + subq %rax, %rsi + VPTESTN %VMM(1), %VMM(1), %k1 + KMOV %k1, %VRDX + test %VRDX, %VRDX + jnz L(last_vec_x0) + + KMOV %k0, %VRDX + test %VRDX, %VRDX + jnz L(last_vec_x1) + + VPTESTN %VMM(3), %VMM(3), %k0 + + /* Seperate logic for VEC_SIZE == 64 and VEC_SIZE == 32 for + returning last 2x VEC. For VEC_SIZE == 64 we test each VEC + individually, for VEC_SIZE == 32 we combine them in a single + 64-bit GPR. */ +# if CHAR_PER_VEC == 64 + KMOV %k0, %VRDX + test %VRDX, %VRDX + jnz L(last_vec_x2) + KMOV %k2, %VRDX +# else + /* We can only combine last 2x VEC masks if CHAR_PER_VEC <= 32. + */ + kmovd %k2, %edx + kmovd %k0, %eax + salq $CHAR_PER_VEC, %rdx + orq %rax, %rdx +# endif + + /* first_vec_x3 for strlen-ZMM and first_vec_x2 for strlen-YMM. + */ + bsfq %rdx, %rdx + leaq (FALLTHROUGH_RETURN_OFFSET - CHAR_PER_VEC * 4)(%rsi, %rdx), %rax + cmpq %rax, %rcx + cmovb %rcx, %rax + ret + + /* Handle last 4x VEC after loop. All VECs have been loaded. */ + .p2align 4,, 4 +L(loop_len_end): + KORTEST %k0, %k2 + jnz L(loop_last_4x_vec) + movq %rsi, %rax + ret + + +# if CHAR_PER_VEC == 64 + /* Since we can't combine the last 2x VEC for VEC_SIZE == 64 + need return label for it. */ + .p2align 4,, 8 +L(last_vec_x2): + bsf %VRDX, %VRDX + leaq (CHAR_PER_VEC * -2)(%rsi, %rdx), %rax + cmpq %rax, %rcx + cmovb %rcx, %rax + ret +# endif + + + .p2align 4,, 10 +L(last_vec_x1): + addq $CHAR_PER_VEC, %rsi +L(last_vec_x0): + bsf %VRDX, %VRDX + leaq (CHAR_PER_VEC * -4)(%rsi, %rdx), %rax + cmpq %rax, %rcx + cmovb %rcx, %rax + ret + + + .p2align 4,, 8 +L(cross_page_boundary): + /* Align data to VEC_SIZE. */ + movq %rdi, %rcx + andq $-VEC_SIZE, %rcx + VPCMPEQ (%rcx), %VZERO, %k0 + + KMOV %k0, %VRCX +# ifdef USE_AS_WCSLEN + shrl $2, %eax + andl $(CHAR_PER_VEC - 1), %eax +# endif + shrx %VRAX, %VRCX, %VRCX + + negl %eax + andl $(CHAR_PER_VEC - 1), %eax + movq %rsi, %rdx + bsf %VRCX, %VRDX + cmpq %rax, %rdx + ja L(cross_page_continue) + movl %edx, %eax + cmpq %rdx, %rsi + cmovb %esi, %eax + ret +END (STRNLEN) +#endif diff --git a/sysdeps/x86_64/multiarch/wcsnlen-evex.S b/sysdeps/x86_64/multiarch/wcsnlen-evex.S index e2aad94c1e..57a7e93fbf 100644 --- a/sysdeps/x86_64/multiarch/wcsnlen-evex.S +++ b/sysdeps/x86_64/multiarch/wcsnlen-evex.S @@ -2,8 +2,7 @@ # define WCSNLEN __wcsnlen_evex #endif -#define STRLEN WCSNLEN +#define STRNLEN WCSNLEN #define USE_AS_WCSLEN 1 -#define USE_AS_STRNLEN 1 -#include "strlen-evex.S" +#include "strnlen-evex.S" From patchwork Tue Oct 18 02:48:58 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Noah Goldstein X-Patchwork-Id: 1691327 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=sourceware.org (client-ip=2620:52:3:1:0:246e:9693:128c; helo=sourceware.org; envelope-from=libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org; receiver=) Authentication-Results: legolas.ozlabs.org; dkim=pass (1024-bit key; secure) header.d=sourceware.org header.i=@sourceware.org header.a=rsa-sha256 header.s=default header.b=UHxrONAs; dkim-atps=neutral Received: from sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-384) server-digest SHA384) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4Mryyn5PvLz23jp for ; Tue, 18 Oct 2022 13:50:25 +1100 (AEDT) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 9B2AF385800B for ; Tue, 18 Oct 2022 02:50:23 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 9B2AF385800B DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1666061423; bh=0ihs1yTyMDaSh/n1/tlMjLAoc4iM0wM7Zco1H9FQStg=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To: From; b=UHxrONAsg7i5v7nDmIy3hOW5hDefAvS6euyI0XIqFnAK09VsGAPDRZRWraysjdrwt D18Blxd0fe0CUwHSOxlaspjJhZqAh2rb4noHy5yMM/M8zM7sV8RhoRYXakrMcW7VrT Jk/RIHdhND9Z/VbGVXc143s3+xA1Hhtn8KI/o6/Y= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-ot1-x332.google.com (mail-ot1-x332.google.com [IPv6:2607:f8b0:4864:20::332]) by sourceware.org (Postfix) with ESMTPS id C3F343858C50 for ; Tue, 18 Oct 2022 02:49:09 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org C3F343858C50 Received: by mail-ot1-x332.google.com with SMTP id e53-20020a9d01b8000000b006619152f3cdso6901403ote.0 for ; Mon, 17 Oct 2022 19:49:09 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=0ihs1yTyMDaSh/n1/tlMjLAoc4iM0wM7Zco1H9FQStg=; b=jPvn/VYBvtW0fVNeEb4YlmNGYQfMoP8No0BIWefqxVD+b2u8iuD2jD13ookymydJJY 8LtYNfHibK8qo959cG5rEH3fti4rPPnNNXL3OZoGmkBGkDD/AgpbCiwIO2pE6gg9LgMV AxmDsvebBpj7VUGOwtsqbfxvPMVxyc2ZsvLlaSzkX5PmJThoubTQrPlEdLpzq3SxUc1/ j9vzRM9+AT4HYANjWlqdMn4bsjXY/CSNlvixiSf682vthBriZYuTsOdjRxgApinC/64i JxKJV+4Gmop8dBPrMSFa83cVLrjwN5cOPIxXmo64GxY7YhyQGha7ZXX2i50E6zVZnKVv Zk0g== X-Gm-Message-State: ACrzQf2lhR6J5wccTQrNKcxLvRd4eEWf+qE9Tr+zYJnGq8k8879OlO2m cP4i/UIr4fgrAXV4copNTpfgxdRgL6va1Q== X-Google-Smtp-Source: AMsMyM7HWH+G0dxYvHQA0gIMKeSOmubxucQF6MAZvetf6A4T1Wox6k3kNGcOSDJq5NzG6Ego/lxP4Q== X-Received: by 2002:a05:6830:1d8b:b0:661:8cd7:994a with SMTP id y11-20020a0568301d8b00b006618cd7994amr381369oti.119.1666061348414; Mon, 17 Oct 2022 19:49:08 -0700 (PDT) Received: from noah-tgl.lan (2603-8080-1301-76c6-02dd-0570-1640-b39b.res6.spectrum.com. [2603:8080:1301:76c6:2dd:570:1640:b39b]) by smtp.gmail.com with ESMTPSA id r10-20020a4a964a000000b00435a59fba01sm4957260ooi.47.2022.10.17.19.49.07 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 17 Oct 2022 19:49:07 -0700 (PDT) To: libc-alpha@sourceware.org Subject: [PATCH v1 4/7] x86: Optimize memrchr-evex.S Date: Mon, 17 Oct 2022 19:48:58 -0700 Message-Id: <20221018024901.3381469-4-goldstein.w.n@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20221018024901.3381469-1-goldstein.w.n@gmail.com> References: <20221018024901.3381469-1-goldstein.w.n@gmail.com> MIME-Version: 1.0 X-Spam-Status: No, score=-12.1 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Noah Goldstein via Libc-alpha From: Noah Goldstein Reply-To: Noah Goldstein Errors-To: libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org Sender: "Libc-alpha" Optimizations are: 1. Use the fact that lzcnt(0) -> VEC_SIZE for memchr to save a branch in short string case. 2. Save several instructions in len = [VEC_SIZE, 4 * VEC_SIZE] case. 3. Use more code-size efficient instructions. - tzcnt ... -> bsf ... - vpcmpb $0 ... -> vpcmpeq ... Code Size Changes: memrchr-evex.S : -29 bytes Net perf changes: Reported as geometric mean of all improvements / regressions from N=10 runs of the benchtests. Value as New Time / Old Time so < 1.0 is improvement and 1.0 is regression. memrchr-evex.S : 0.949 (Mostly from improvements in small strings) Full results attached in email. Full check passes on x86-64. --- sysdeps/x86_64/multiarch/memrchr-evex.S | 538 ++++++++++++++---------- 1 file changed, 324 insertions(+), 214 deletions(-) diff --git a/sysdeps/x86_64/multiarch/memrchr-evex.S b/sysdeps/x86_64/multiarch/memrchr-evex.S index 550b328c5a..dbcf52808f 100644 --- a/sysdeps/x86_64/multiarch/memrchr-evex.S +++ b/sysdeps/x86_64/multiarch/memrchr-evex.S @@ -21,17 +21,19 @@ #if ISA_SHOULD_BUILD (4) # include -# include "x86-evex256-vecs.h" -# if VEC_SIZE != 32 -# error "VEC_SIZE != 32 unimplemented" + +# ifndef VEC_SIZE +# include "x86-evex256-vecs.h" # endif +# include "reg-macros.h" + # ifndef MEMRCHR -# define MEMRCHR __memrchr_evex +# define MEMRCHR __memrchr_evex # endif -# define PAGE_SIZE 4096 -# define VMMMATCH VMM(0) +# define PAGE_SIZE 4096 +# define VMATCH VMM(0) .section SECTION(.text), "ax", @progbits ENTRY_P2ALIGN(MEMRCHR, 6) @@ -43,294 +45,402 @@ ENTRY_P2ALIGN(MEMRCHR, 6) # endif jz L(zero_0) - /* Get end pointer. Minus one for two reasons. 1) It is necessary for a - correct page cross check and 2) it correctly sets up end ptr to be - subtract by lzcnt aligned. */ + /* Get end pointer. Minus one for three reasons. 1) It is + necessary for a correct page cross check and 2) it correctly + sets up end ptr to be subtract by lzcnt aligned. 3) it is a + necessary step in aligning ptr. */ leaq -1(%rdi, %rdx), %rax - vpbroadcastb %esi, %VMMMATCH + vpbroadcastb %esi, %VMATCH /* Check if we can load 1x VEC without cross a page. */ testl $(PAGE_SIZE - VEC_SIZE), %eax jz L(page_cross) - /* Don't use rax for pointer here because EVEX has better encoding with - offset % VEC_SIZE == 0. */ - vpcmpb $0, -(VEC_SIZE)(%rdi, %rdx), %VMMMATCH, %k0 - kmovd %k0, %ecx - - /* Fall through for rdx (len) <= VEC_SIZE (expect small sizes). */ - cmpq $VEC_SIZE, %rdx - ja L(more_1x_vec) -L(ret_vec_x0_test): - - /* If ecx is zero (no matches) lzcnt will set it 32 (VEC_SIZE) which - will guarantee edx (len) is less than it. */ - lzcntl %ecx, %ecx - cmpl %ecx, %edx - jle L(zero_0) - subq %rcx, %rax + /* Don't use rax for pointer here because EVEX has better + encoding with offset % VEC_SIZE == 0. */ + vpcmpeqb (VEC_SIZE * -1)(%rdi, %rdx), %VMATCH, %k0 + KMOV %k0, %VRCX + + /* If rcx is zero then lzcnt -> VEC_SIZE. NB: there is a + already a dependency between rcx and rsi so no worries about + false-dep here. */ + lzcnt %VRCX, %VRSI + /* If rdx <= rsi then either 1) rcx was non-zero (there was a + match) but it was out of bounds or 2) rcx was zero and rdx + was <= VEC_SIZE so we are done scanning. */ + cmpq %rsi, %rdx + /* NB: Use branch to return zero/non-zero. Common usage will + branch on result of function (if return is null/non-null). + This branch can be used to predict the ensuing one so there + is no reason to extend the data-dependency with cmovcc. */ + jbe L(zero_0) + + /* If rcx is zero then len must be > RDX, otherwise since we + already tested len vs lzcnt(rcx) (in rsi) we are good to + return this match. */ + test %VRCX, %VRCX + jz L(more_1x_vec) + subq %rsi, %rax ret - /* Fits in aligning bytes of first cache line. */ + /* Fits in aligning bytes of first cache line for VEC_SIZE == + 32. */ +# if VEC_SIZE == 32 + .p2align 4,, 2 L(zero_0): xorl %eax, %eax ret - - .p2align 4,, 9 -L(ret_vec_x0_dec): - decq %rax -L(ret_vec_x0): - lzcntl %ecx, %ecx - subq %rcx, %rax - ret +# endif .p2align 4,, 10 L(more_1x_vec): - testl %ecx, %ecx - jnz L(ret_vec_x0) - /* Align rax (pointer to string). */ andq $-VEC_SIZE, %rax - +L(page_cross_continue): /* Recompute length after aligning. */ - movq %rax, %rdx + subq %rdi, %rax - /* Need no matter what. */ - vpcmpb $0, -(VEC_SIZE)(%rax), %VMMMATCH, %k0 - kmovd %k0, %ecx - - subq %rdi, %rdx - - cmpq $(VEC_SIZE * 2), %rdx + cmpq $(VEC_SIZE * 2), %rax ja L(more_2x_vec) + L(last_2x_vec): + vpcmpeqb (VEC_SIZE * -1)(%rdi, %rax), %VMATCH, %k0 + KMOV %k0, %VRCX - /* Must dec rax because L(ret_vec_x0_test) expects it. */ - decq %rax - cmpl $VEC_SIZE, %edx - jbe L(ret_vec_x0_test) + test %VRCX, %VRCX + jnz L(ret_vec_x0_test) - testl %ecx, %ecx - jnz L(ret_vec_x0) + /* If VEC_SIZE == 64 need to subtract because lzcntq won't + implicitly add VEC_SIZE to match position. */ +# if VEC_SIZE == 64 + subl $VEC_SIZE, %eax +# else + cmpb $VEC_SIZE, %al +# endif + jle L(zero_2) - /* Don't use rax for pointer here because EVEX has better encoding with - offset % VEC_SIZE == 0. */ - vpcmpb $0, -(VEC_SIZE * 2)(%rdi, %rdx), %VMMMATCH, %k0 - kmovd %k0, %ecx - /* NB: 64-bit lzcnt. This will naturally add 32 to position. */ + /* We adjusted rax (length) for VEC_SIZE == 64 so need seperate + offsets. */ +# if VEC_SIZE == 64 + vpcmpeqb (VEC_SIZE * -1)(%rdi, %rax), %VMATCH, %k0 +# else + vpcmpeqb (VEC_SIZE * -2)(%rdi, %rax), %VMATCH, %k0 +# endif + KMOV %k0, %VRCX + /* NB: 64-bit lzcnt. This will naturally add 32 to position for + VEC_SIZE == 32. */ lzcntq %rcx, %rcx - cmpl %ecx, %edx - jle L(zero_0) - subq %rcx, %rax - ret - - /* Inexpensive place to put this regarding code size / target alignments - / ICache NLP. Necessary for 2-byte encoding of jump to page cross - case which in turn is necessary for hot path (len <= VEC_SIZE) to fit - in first cache line. */ -L(page_cross): - movq %rax, %rsi - andq $-VEC_SIZE, %rsi - vpcmpb $0, (%rsi), %VMMMATCH, %k0 - kmovd %k0, %r8d - /* Shift out negative alignment (because we are starting from endptr and - working backwards). */ - movl %eax, %ecx - /* notl because eax already has endptr - 1. (-x = ~(x - 1)). */ - notl %ecx - shlxl %ecx, %r8d, %ecx - cmpq %rdi, %rsi - ja L(more_1x_vec) - lzcntl %ecx, %ecx - cmpl %ecx, %edx - jle L(zero_1) - subq %rcx, %rax + subl %ecx, %eax + ja L(first_vec_x1_ret) + /* If VEC_SIZE == 64 put L(zero_0) here as we can't fit in the + first cache line (this is the second cache line). */ +# if VEC_SIZE == 64 +L(zero_0): +# endif +L(zero_2): + xorl %eax, %eax ret - /* Continue creating zero labels that fit in aligning bytes and get - 2-byte encoding / are in the same cache line as condition. */ -L(zero_1): - xorl %eax, %eax + /* NB: Fits in aligning bytes before next cache line for + VEC_SIZE == 32. For VEC_SIZE == 64 this is attached to + L(first_vec_x0_test). */ +# if VEC_SIZE == 32 +L(first_vec_x1_ret): + leaq -1(%rdi, %rax), %rax ret +# endif - .p2align 4,, 8 -L(ret_vec_x1): - /* This will naturally add 32 to position. */ - bsrl %ecx, %ecx - leaq -(VEC_SIZE * 2)(%rcx, %rax), %rax + .p2align 4,, 6 +L(ret_vec_x0_test): + lzcnt %VRCX, %VRCX + subl %ecx, %eax + jle L(zero_2) +# if VEC_SIZE == 64 + /* Reuse code at the end of L(ret_vec_x0_test) as we can't fit + L(first_vec_x1_ret) in the same cache line as its jmp base + so we might as well save code size. */ +L(first_vec_x1_ret): +# endif + leaq -1(%rdi, %rax), %rax ret - .p2align 4,, 8 + .p2align 4,, 6 +L(loop_last_4x_vec): + /* Compute remaining length. */ + subl %edi, %eax +L(last_4x_vec): + cmpl $(VEC_SIZE * 2), %eax + jle L(last_2x_vec) +# if VEC_SIZE == 32 + /* Only align for VEC_SIZE == 32. For VEC_SIZE == 64 we need + the spare bytes to align the loop properly. */ + .p2align 4,, 10 +# endif L(more_2x_vec): - testl %ecx, %ecx - jnz L(ret_vec_x0_dec) - vpcmpb $0, -(VEC_SIZE * 2)(%rax), %VMMMATCH, %k0 - kmovd %k0, %ecx - testl %ecx, %ecx - jnz L(ret_vec_x1) + /* Length > VEC_SIZE * 2 so check the first 2x VEC for match and + return if either hit. */ + vpcmpeqb (VEC_SIZE * -1)(%rdi, %rax), %VMATCH, %k0 + KMOV %k0, %VRCX + + test %VRCX, %VRCX + jnz L(first_vec_x0) + + vpcmpeqb (VEC_SIZE * -2)(%rdi, %rax), %VMATCH, %k0 + KMOV %k0, %VRCX + test %VRCX, %VRCX + jnz L(first_vec_x1) /* Need no matter what. */ - vpcmpb $0, -(VEC_SIZE * 3)(%rax), %VMMMATCH, %k0 - kmovd %k0, %ecx + vpcmpeqb (VEC_SIZE * -3)(%rdi, %rax), %VMATCH, %k0 + KMOV %k0, %VRCX - subq $(VEC_SIZE * 4), %rdx + /* Check if we are near the end. */ + subq $(VEC_SIZE * 4), %rax ja L(more_4x_vec) - cmpl $(VEC_SIZE * -1), %edx - jle L(ret_vec_x2_test) -L(last_vec): - testl %ecx, %ecx - jnz L(ret_vec_x2) + test %VRCX, %VRCX + jnz L(first_vec_x2_test) + /* Adjust length for final check and check if we are at the end. + */ + addl $(VEC_SIZE * 1), %eax + jle L(zero_1) - /* Need no matter what. */ - vpcmpb $0, -(VEC_SIZE * 4)(%rax), %VMMMATCH, %k0 - kmovd %k0, %ecx - lzcntl %ecx, %ecx - subq $(VEC_SIZE * 3 + 1), %rax - subq %rcx, %rax - cmpq %rax, %rdi - ja L(zero_1) + vpcmpeqb (VEC_SIZE * -1)(%rdi, %rax), %VMATCH, %k0 + KMOV %k0, %VRCX + + lzcnt %VRCX, %VRCX + subl %ecx, %eax + ja L(first_vec_x3_ret) +L(zero_1): + xorl %eax, %eax + ret +L(first_vec_x3_ret): + leaq -1(%rdi, %rax), %rax ret - .p2align 4,, 8 -L(ret_vec_x2_test): - lzcntl %ecx, %ecx - subq $(VEC_SIZE * 2 + 1), %rax - subq %rcx, %rax - cmpq %rax, %rdi - ja L(zero_1) + .p2align 4,, 6 +L(first_vec_x2_test): + /* Must adjust length before check. */ + subl $-(VEC_SIZE * 2 - 1), %eax + lzcnt %VRCX, %VRCX + subl %ecx, %eax + jl L(zero_4) + addq %rdi, %rax ret - .p2align 4,, 8 -L(ret_vec_x2): - bsrl %ecx, %ecx - leaq -(VEC_SIZE * 3)(%rcx, %rax), %rax + + .p2align 4,, 10 +L(first_vec_x0): + bsr %VRCX, %VRCX + leaq (VEC_SIZE * -1)(%rdi, %rax), %rax + addq %rcx, %rax ret - .p2align 4,, 8 -L(ret_vec_x3): - bsrl %ecx, %ecx - leaq -(VEC_SIZE * 4)(%rcx, %rax), %rax + /* Fits unobtrusively here. */ +L(zero_4): + xorl %eax, %eax + ret + + .p2align 4,, 10 +L(first_vec_x1): + bsr %VRCX, %VRCX + leaq (VEC_SIZE * -2)(%rdi, %rax), %rax + addq %rcx, %rax ret .p2align 4,, 8 +L(first_vec_x3): + bsr %VRCX, %VRCX + addq %rdi, %rax + addq %rcx, %rax + ret + + .p2align 4,, 6 +L(first_vec_x2): + bsr %VRCX, %VRCX + leaq (VEC_SIZE * 1)(%rdi, %rax), %rax + addq %rcx, %rax + ret + + .p2align 4,, 2 L(more_4x_vec): - testl %ecx, %ecx - jnz L(ret_vec_x2) + test %VRCX, %VRCX + jnz L(first_vec_x2) - vpcmpb $0, -(VEC_SIZE * 4)(%rax), %VMMMATCH, %k0 - kmovd %k0, %ecx + vpcmpeqb (%rdi, %rax), %VMATCH, %k0 + KMOV %k0, %VRCX - testl %ecx, %ecx - jnz L(ret_vec_x3) + test %VRCX, %VRCX + jnz L(first_vec_x3) /* Check if near end before re-aligning (otherwise might do an unnecessary loop iteration). */ - addq $-(VEC_SIZE * 4), %rax - cmpq $(VEC_SIZE * 4), %rdx + cmpq $(VEC_SIZE * 4), %rax jbe L(last_4x_vec) - decq %rax - andq $-(VEC_SIZE * 4), %rax - movq %rdi, %rdx - /* Get endptr for loop in rdx. NB: Can't just do while rax > rdi because - lengths that overflow can be valid and break the comparison. */ - andq $-(VEC_SIZE * 4), %rdx + + /* NB: We setup the loop to NOT use index-address-mode for the + buffer. This costs some instructions & code size but avoids + stalls due to unlaminated micro-fused instructions (as used + in the loop) from being forced to issue in the same group + (essentially narrowing the backend width). */ + + /* Get endptr for loop in rdx. NB: Can't just do while rax > rdi + because lengths that overflow can be valid and break the + comparison. */ +# if VEC_SIZE == 64 + /* Use rdx as intermediate to compute rax, this gets us imm8 + encoding which just allows the L(more_4x_vec) block to fit + in 1 cache-line. */ + leaq (VEC_SIZE * 4)(%rdi), %rdx + leaq (VEC_SIZE * -1)(%rdx, %rax), %rax + + /* No evex machine has partial register stalls. This can be + replaced with: `andq $(VEC_SIZE * -4), %rax/%rdx` if that + changes. */ + xorb %al, %al + xorb %dl, %dl +# else + leaq (VEC_SIZE * 3)(%rdi, %rax), %rax + andq $(VEC_SIZE * -4), %rax + leaq (VEC_SIZE * 4)(%rdi), %rdx + andq $(VEC_SIZE * -4), %rdx +# endif + .p2align 4 L(loop_4x_vec): - /* Store 1 were not-equals and 0 where equals in k1 (used to mask later - on). */ - vpcmpb $4, (VEC_SIZE * 3)(%rax), %VMMMATCH, %k1 + /* NB: We could do the same optimization here as we do for + memchr/rawmemchr by using VEX encoding in the loop for access + to VEX vpcmpeqb + vpternlogd. Since memrchr is not as hot as + memchr it may not be worth the extra code size, but if the + need arises it an easy ~15% perf improvement to the loop. */ + + cmpq %rdx, %rax + je L(loop_last_4x_vec) + /* Store 1 were not-equals and 0 where equals in k1 (used to + mask later on). */ + vpcmpb $4, (VEC_SIZE * -1)(%rax), %VMATCH, %k1 /* VEC(2/3) will have zero-byte where we found a CHAR. */ - vpxorq (VEC_SIZE * 2)(%rax), %VMMMATCH, %VMM(2) - vpxorq (VEC_SIZE * 1)(%rax), %VMMMATCH, %VMM(3) - vpcmpb $0, (VEC_SIZE * 0)(%rax), %VMMMATCH, %k4 + vpxorq (VEC_SIZE * -2)(%rax), %VMATCH, %VMM(2) + vpxorq (VEC_SIZE * -3)(%rax), %VMATCH, %VMM(3) + vpcmpeqb (VEC_SIZE * -4)(%rax), %VMATCH, %k4 - /* Combine VEC(2/3) with min and maskz with k1 (k1 has zero bit where - CHAR is found and VEC(2/3) have zero-byte where CHAR is found. */ + /* Combine VEC(2/3) with min and maskz with k1 (k1 has zero bit + where CHAR is found and VEC(2/3) have zero-byte where CHAR + is found. */ vpminub %VMM(2), %VMM(3), %VMM(3){%k1}{z} vptestnmb %VMM(3), %VMM(3), %k2 - /* Any 1s and we found CHAR. */ - kortestd %k2, %k4 - jnz L(loop_end) - addq $-(VEC_SIZE * 4), %rax - cmpq %rdx, %rax - jne L(loop_4x_vec) - /* Need to re-adjust rdx / rax for L(last_4x_vec). */ - subq $-(VEC_SIZE * 4), %rdx - movq %rdx, %rax - subl %edi, %edx -L(last_4x_vec): + /* Any 1s and we found CHAR. */ + KORTEST %k2, %k4 + jz L(loop_4x_vec) + - /* Used no matter what. */ - vpcmpb $0, (VEC_SIZE * -1)(%rax), %VMMMATCH, %k0 - kmovd %k0, %ecx + /* K1 has non-matches for first VEC. inc; jz will overflow rcx + iff all bytes where non-matches. */ + KMOV %k1, %VRCX + inc %VRCX + jnz L(first_vec_x0_end) - cmpl $(VEC_SIZE * 2), %edx - jbe L(last_2x_vec) + vptestnmb %VMM(2), %VMM(2), %k0 + KMOV %k0, %VRCX + test %VRCX, %VRCX + jnz L(first_vec_x1_end) + KMOV %k2, %VRCX + + /* Seperate logic for VEC_SIZE == 64 and VEC_SIZE == 32 for + returning last 2x VEC. For VEC_SIZE == 64 we test each VEC + individually, for VEC_SIZE == 32 we combine them in a single + 64-bit GPR. */ +# if VEC_SIZE == 64 + test %VRCX, %VRCX + jnz L(first_vec_x2_end) + KMOV %k4, %VRCX +# else + /* Combine last 2 VEC matches for VEC_SIZE == 32. If rcx (from + VEC(3)) is zero (no CHAR in VEC(3)) then it won't affect the + result in rsi (from VEC(4)). If rcx is non-zero then CHAR in + VEC(3) and bsrq will use that position. */ + KMOV %k4, %VRSI + salq $32, %rcx + orq %rsi, %rcx +# endif + bsrq %rcx, %rcx + addq %rcx, %rax + ret - testl %ecx, %ecx - jnz L(ret_vec_x0_dec) + .p2align 4,, 4 +L(first_vec_x0_end): + /* rcx has 1s at non-matches so we need to `not` it. We used + `inc` to test if zero so use `neg` to complete the `not` so + the last 1 bit represent a match. NB: (-x + 1 == ~x). */ + neg %VRCX + bsr %VRCX, %VRCX + leaq (VEC_SIZE * 3)(%rcx, %rax), %rax + ret + .p2align 4,, 10 +L(first_vec_x1_end): + bsr %VRCX, %VRCX + leaq (VEC_SIZE * 2)(%rcx, %rax), %rax + ret - vpcmpb $0, (VEC_SIZE * -2)(%rax), %VMMMATCH, %k0 - kmovd %k0, %ecx +# if VEC_SIZE == 64 + /* Since we can't combine the last 2x VEC for VEC_SIZE == 64 + need return label for it. */ + .p2align 4,, 4 +L(first_vec_x2_end): + bsr %VRCX, %VRCX + leaq (VEC_SIZE * 1)(%rcx, %rax), %rax + ret +# endif - testl %ecx, %ecx - jnz L(ret_vec_x1) - /* Used no matter what. */ - vpcmpb $0, (VEC_SIZE * -3)(%rax), %VMMMATCH, %k0 - kmovd %k0, %ecx + .p2align 4,, 4 +L(page_cross): + /* only lower bits of eax[log2(VEC_SIZE):0] are set so we can + use movzbl to get the amount of bytes we are checking here. + */ + movzbl %al, %ecx + andq $-VEC_SIZE, %rax + vpcmpeqb (%rax), %VMATCH, %k0 + KMOV %k0, %VRSI - cmpl $(VEC_SIZE * 3), %edx - ja L(last_vec) + /* eax was comptued as %rdi + %rdx - 1 so need to add back 1 + here. */ + leal 1(%rcx), %r8d - lzcntl %ecx, %ecx - subq $(VEC_SIZE * 2 + 1), %rax - subq %rcx, %rax - cmpq %rax, %rdi - jbe L(ret_1) + /* Invert ecx to get shift count for byte matches out of range. + */ + notl %ecx + shlx %VRCX, %VRSI, %VRSI + + /* if r8 < rdx then the entire [buf, buf + len] is handled in + the page cross case. NB: we can't use the trick here we use + in the non page-cross case because we aren't checking full + VEC_SIZE. */ + cmpq %r8, %rdx + ja L(page_cross_check) + lzcnt %VRSI, %VRSI + subl %esi, %edx + ja L(page_cross_ret) xorl %eax, %eax -L(ret_1): ret - .p2align 4,, 6 -L(loop_end): - kmovd %k1, %ecx - notl %ecx - testl %ecx, %ecx - jnz L(ret_vec_x0_end) +L(page_cross_check): + test %VRSI, %VRSI + jz L(page_cross_continue) - vptestnmb %VMM(2), %VMM(2), %k0 - kmovd %k0, %ecx - testl %ecx, %ecx - jnz L(ret_vec_x1_end) - - kmovd %k2, %ecx - kmovd %k4, %esi - /* Combine last 2 VEC matches. If ecx (VEC3) is zero (no CHAR in VEC3) - then it won't affect the result in esi (VEC4). If ecx is non-zero - then CHAR in VEC3 and bsrq will use that position. */ - salq $32, %rcx - orq %rsi, %rcx - bsrq %rcx, %rcx - addq %rcx, %rax - ret - .p2align 4,, 4 -L(ret_vec_x0_end): - addq $(VEC_SIZE), %rax -L(ret_vec_x1_end): - bsrl %ecx, %ecx - leaq (VEC_SIZE * 2)(%rax, %rcx), %rax + lzcnt %VRSI, %VRSI + subl %esi, %edx +L(page_cross_ret): + leaq -1(%rdi, %rdx), %rax ret - END(MEMRCHR) #endif From patchwork Tue Oct 18 02:48:59 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Noah Goldstein X-Patchwork-Id: 1691325 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=sourceware.org (client-ip=8.43.85.97; helo=sourceware.org; envelope-from=libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org; receiver=) Authentication-Results: legolas.ozlabs.org; dkim=pass (1024-bit key; secure) header.d=sourceware.org header.i=@sourceware.org header.a=rsa-sha256 header.s=default header.b=ji8pueHC; dkim-atps=neutral Received: from sourceware.org (ip-8-43-85-97.sourceware.org [8.43.85.97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-384) server-digest SHA384) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4MryyM5Drfz23jk for ; Tue, 18 Oct 2022 13:50:03 +1100 (AEDT) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id A02CB3857357 for ; Tue, 18 Oct 2022 02:50:01 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org A02CB3857357 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1666061401; bh=oC01aOJqUCRtwCibJyVGnkt6cpo8JFfWkb5OftF4yO0=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To: From; b=ji8pueHC6iNMHUA01ThwBX2a4mEB0dLlu3zNu2g+QIkSHku25fm2XTSPbymqagou4 Q+gq/7ml5o0Eu5BWJfPziK+ravek4oRBSBK/5L2quxbJeN/WzF1EyVPWg6J75GuFT7 V32QMya+abX4aK/el767rlt8RHFVUDxXIumoEIqw= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-oi1-x22d.google.com (mail-oi1-x22d.google.com [IPv6:2607:f8b0:4864:20::22d]) by sourceware.org (Postfix) with ESMTPS id D01E53858403 for ; Tue, 18 Oct 2022 02:49:10 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org D01E53858403 Received: by mail-oi1-x22d.google.com with SMTP id p127so13579950oih.9 for ; Mon, 17 Oct 2022 19:49:10 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=oC01aOJqUCRtwCibJyVGnkt6cpo8JFfWkb5OftF4yO0=; b=lCavx4eowfrfgmAAX4gRQfYAQEnpLxMy4S/4s1Q+o2GHGt54GP4zn6DiPbWuqmUZja a6GPHDhCRtwRlw0P872mOIcm9rW0aScExlrD1QGA8wF/7+dtIwgyE3ZE4OReXyPAXObA ivuDlwEryLeuorw0W9phQ8C8a84gLCQWD4m4rzLwPyrPA2t3ydhmUM/tEeZrbUiWCugo zcBiIN6CN36Q7h1/8CrGyNmpwVKdWC5H9pvryEddFJ9UtYI8N7lDNm7i+wbkBUknu91L fp0mCT3bBstTELEXfk5TnnbaBy5AMZCTK2bOXYXeoNRIuZj89J7nDrEvk8cHeyZ0gA6o O2MA== X-Gm-Message-State: ACrzQf3rDJipPE858vmZCtU3MqPYh4PuZ7xz3chzlu6gzjGAYU5KHHLL MrFVm7/QrXrDQG3mD1dYBGGtsw9RioRLOw== X-Google-Smtp-Source: AMsMyM5BTQ5TseqJIxkZ6AtL43pP4lUbZi9SEKm9rzk2E2JFe29xN6mQhoSEEXslQmLVgn6FCTPEiA== X-Received: by 2002:aca:ab42:0:b0:354:93a1:1908 with SMTP id u63-20020acaab42000000b0035493a11908mr412688oie.194.1666061349625; Mon, 17 Oct 2022 19:49:09 -0700 (PDT) Received: from noah-tgl.lan (2603-8080-1301-76c6-02dd-0570-1640-b39b.res6.spectrum.com. [2603:8080:1301:76c6:2dd:570:1640:b39b]) by smtp.gmail.com with ESMTPSA id r10-20020a4a964a000000b00435a59fba01sm4957260ooi.47.2022.10.17.19.49.08 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 17 Oct 2022 19:49:09 -0700 (PDT) To: libc-alpha@sourceware.org Subject: [PATCH v1 5/7] x86: Optimize strrchr-evex.S and implement with VMM headers Date: Mon, 17 Oct 2022 19:48:59 -0700 Message-Id: <20221018024901.3381469-5-goldstein.w.n@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20221018024901.3381469-1-goldstein.w.n@gmail.com> References: <20221018024901.3381469-1-goldstein.w.n@gmail.com> MIME-Version: 1.0 X-Spam-Status: No, score=-12.1 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Noah Goldstein via Libc-alpha From: Noah Goldstein Reply-To: Noah Goldstein Errors-To: libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org Sender: "Libc-alpha" Optimization is: 1. Cache latest result in "fast path" loop with `vmovdqu` instead of `kunpckdq`. This helps if there are more than one matches. Code Size Changes: strrchr-evex.S : +30 bytes (Same number of cache lines) Net perf changes: Reported as geometric mean of all improvements / regressions from N=10 runs of the benchtests. Value as New Time / Old Time so < 1.0 is improvement and 1.0 is regression. strrchr-evex.S : 0.932 (From cases with higher match frequency) Full results attached in email. Full check passes on x86-64. --- sysdeps/x86_64/multiarch/strrchr-evex.S | 371 +++++++++++++----------- 1 file changed, 200 insertions(+), 171 deletions(-) diff --git a/sysdeps/x86_64/multiarch/strrchr-evex.S b/sysdeps/x86_64/multiarch/strrchr-evex.S index 992b45fb47..45487dc87a 100644 --- a/sysdeps/x86_64/multiarch/strrchr-evex.S +++ b/sysdeps/x86_64/multiarch/strrchr-evex.S @@ -26,25 +26,30 @@ # define STRRCHR __strrchr_evex # endif -# define VMOVU vmovdqu64 -# define VMOVA vmovdqa64 +# include "x86-evex256-vecs.h" # ifdef USE_AS_WCSRCHR -# define SHIFT_REG esi - -# define kunpck kunpckbw +# define RCX_M cl +# define SHIFT_REG rcx +# define VPCOMPRESS vpcompressd +# define kunpck_2x kunpckbw # define kmov_2x kmovd # define maskz_2x ecx # define maskm_2x eax # define CHAR_SIZE 4 # define VPMIN vpminud # define VPTESTN vptestnmd +# define VPTEST vptestmd # define VPBROADCAST vpbroadcastd +# define VPCMPEQ vpcmpeqd # define VPCMP vpcmpd -# else -# define SHIFT_REG edi -# define kunpck kunpckdq +# define USE_WIDE_CHAR +# else +# define RCX_M ecx +# define SHIFT_REG rdi +# define VPCOMPRESS vpcompressb +# define kunpck_2x kunpckdq # define kmov_2x kmovq # define maskz_2x rcx # define maskm_2x rax @@ -52,58 +57,48 @@ # define CHAR_SIZE 1 # define VPMIN vpminub # define VPTESTN vptestnmb +# define VPTEST vptestmb # define VPBROADCAST vpbroadcastb +# define VPCMPEQ vpcmpeqb # define VPCMP vpcmpb # endif -# define XMMZERO xmm16 -# define YMMZERO ymm16 -# define YMMMATCH ymm17 -# define YMMSAVE ymm18 +# include "reg-macros.h" -# define YMM1 ymm19 -# define YMM2 ymm20 -# define YMM3 ymm21 -# define YMM4 ymm22 -# define YMM5 ymm23 -# define YMM6 ymm24 -# define YMM7 ymm25 -# define YMM8 ymm26 - - -# define VEC_SIZE 32 +# define VMATCH VMM(0) +# define CHAR_PER_VEC (VEC_SIZE / CHAR_SIZE) # define PAGE_SIZE 4096 - .section .text.evex, "ax", @progbits -ENTRY(STRRCHR) + + .section SECTION(.text), "ax", @progbits +ENTRY_P2ALIGN(STRRCHR, 6) movl %edi, %eax - /* Broadcast CHAR to YMMMATCH. */ - VPBROADCAST %esi, %YMMMATCH + /* Broadcast CHAR to VMATCH. */ + VPBROADCAST %esi, %VMATCH andl $(PAGE_SIZE - 1), %eax cmpl $(PAGE_SIZE - VEC_SIZE), %eax jg L(cross_page_boundary) -L(page_cross_continue): - VMOVU (%rdi), %YMM1 - /* k0 has a 1 for each zero CHAR in YMM1. */ - VPTESTN %YMM1, %YMM1, %k0 - kmovd %k0, %ecx - testl %ecx, %ecx + VMOVU (%rdi), %VMM(1) + /* k0 has a 1 for each zero CHAR in VEC(1). */ + VPTESTN %VMM(1), %VMM(1), %k0 + KMOV %k0, %VRSI + test %VRSI, %VRSI jz L(aligned_more) /* fallthrough: zero CHAR in first VEC. */ - - /* K1 has a 1 for each search CHAR match in YMM1. */ - VPCMP $0, %YMMMATCH, %YMM1, %k1 - kmovd %k1, %eax +L(page_cross_return): + /* K1 has a 1 for each search CHAR match in VEC(1). */ + VPCMPEQ %VMATCH, %VMM(1), %k1 + KMOV %k1, %VRAX /* Build mask up until first zero CHAR (used to mask of potential search CHAR matches past the end of the string). */ - blsmskl %ecx, %ecx - andl %ecx, %eax + blsmsk %VRSI, %VRSI + and %VRSI, %VRAX jz L(ret0) - /* Get last match (the `andl` removed any out of bounds - matches). */ - bsrl %eax, %eax + /* Get last match (the `and` removed any out of bounds matches). + */ + bsr %VRAX, %VRAX # ifdef USE_AS_WCSRCHR leaq (%rdi, %rax, CHAR_SIZE), %rax # else @@ -116,22 +111,22 @@ L(ret0): search path for earlier matches. */ .p2align 4,, 6 L(first_vec_x1): - VPCMP $0, %YMMMATCH, %YMM2, %k1 - kmovd %k1, %eax - blsmskl %ecx, %ecx + VPCMPEQ %VMATCH, %VMM(2), %k1 + KMOV %k1, %VRAX + blsmsk %VRCX, %VRCX /* eax non-zero if search CHAR in range. */ - andl %ecx, %eax + and %VRCX, %VRAX jnz L(first_vec_x1_return) - /* fallthrough: no match in YMM2 then need to check for earlier - matches (in YMM1). */ + /* fallthrough: no match in VEC(2) then need to check for + earlier matches (in VEC(1)). */ .p2align 4,, 4 L(first_vec_x0_test): - VPCMP $0, %YMMMATCH, %YMM1, %k1 - kmovd %k1, %eax - testl %eax, %eax + VPCMPEQ %VMATCH, %VMM(1), %k1 + KMOV %k1, %VRAX + test %VRAX, %VRAX jz L(ret1) - bsrl %eax, %eax + bsr %VRAX, %VRAX # ifdef USE_AS_WCSRCHR leaq (%rsi, %rax, CHAR_SIZE), %rax # else @@ -142,129 +137,144 @@ L(ret1): .p2align 4,, 10 L(first_vec_x1_or_x2): - VPCMP $0, %YMM3, %YMMMATCH, %k3 - VPCMP $0, %YMM2, %YMMMATCH, %k2 + VPCMPEQ %VMM(3), %VMATCH, %k3 + VPCMPEQ %VMM(2), %VMATCH, %k2 /* K2 and K3 have 1 for any search CHAR match. Test if any - matches between either of them. Otherwise check YMM1. */ - kortestd %k2, %k3 + matches between either of them. Otherwise check VEC(1). */ + KORTEST %k2, %k3 jz L(first_vec_x0_test) - /* Guranteed that YMM2 and YMM3 are within range so merge the - two bitmasks then get last result. */ - kunpck %k2, %k3, %k3 - kmovq %k3, %rax - bsrq %rax, %rax - leaq (VEC_SIZE)(%r8, %rax, CHAR_SIZE), %rax + /* Guranteed that VEC(2) and VEC(3) are within range so merge + the two bitmasks then get last result. */ + kunpck_2x %k2, %k3, %k3 + kmov_2x %k3, %maskm_2x + bsr %maskm_2x, %maskm_2x + leaq (VEC_SIZE * 1)(%r8, %rax, CHAR_SIZE), %rax ret - .p2align 4,, 6 + .p2align 4,, 7 L(first_vec_x3): - VPCMP $0, %YMMMATCH, %YMM4, %k1 - kmovd %k1, %eax - blsmskl %ecx, %ecx - /* If no search CHAR match in range check YMM1/YMM2/YMM3. */ - andl %ecx, %eax + VPCMPEQ %VMATCH, %VMM(4), %k1 + KMOV %k1, %VRAX + blsmsk %VRCX, %VRCX + /* If no search CHAR match in range check VEC(1)/VEC(2)/VEC(3). + */ + and %VRCX, %VRAX jz L(first_vec_x1_or_x2) - bsrl %eax, %eax + bsr %VRAX, %VRAX leaq (VEC_SIZE * 3)(%rdi, %rax, CHAR_SIZE), %rax ret + .p2align 4,, 6 L(first_vec_x0_x1_test): - VPCMP $0, %YMMMATCH, %YMM2, %k1 - kmovd %k1, %eax - /* Check YMM2 for last match first. If no match try YMM1. */ - testl %eax, %eax + VPCMPEQ %VMATCH, %VMM(2), %k1 + KMOV %k1, %VRAX + /* Check VEC(2) for last match first. If no match try VEC(1). + */ + test %VRAX, %VRAX jz L(first_vec_x0_test) .p2align 4,, 4 L(first_vec_x1_return): - bsrl %eax, %eax + bsr %VRAX, %VRAX leaq (VEC_SIZE)(%rdi, %rax, CHAR_SIZE), %rax ret + .p2align 4,, 10 L(first_vec_x2): - VPCMP $0, %YMMMATCH, %YMM3, %k1 - kmovd %k1, %eax - blsmskl %ecx, %ecx - /* Check YMM3 for last match first. If no match try YMM2/YMM1. - */ - andl %ecx, %eax + VPCMPEQ %VMATCH, %VMM(3), %k1 + KMOV %k1, %VRAX + blsmsk %VRCX, %VRCX + /* Check VEC(3) for last match first. If no match try + VEC(2)/VEC(1). */ + and %VRCX, %VRAX jz L(first_vec_x0_x1_test) - bsrl %eax, %eax + bsr %VRAX, %VRAX leaq (VEC_SIZE * 2)(%rdi, %rax, CHAR_SIZE), %rax ret - .p2align 4 + .p2align 4,, 12 L(aligned_more): - /* Need to keep original pointer incase YMM1 has last match. */ +L(page_cross_continue): + /* Need to keep original pointer incase VEC(1) has last match. + */ movq %rdi, %rsi andq $-VEC_SIZE, %rdi - VMOVU VEC_SIZE(%rdi), %YMM2 - VPTESTN %YMM2, %YMM2, %k0 - kmovd %k0, %ecx - testl %ecx, %ecx + + VMOVU VEC_SIZE(%rdi), %VMM(2) + VPTESTN %VMM(2), %VMM(2), %k0 + KMOV %k0, %VRCX + + test %VRCX, %VRCX jnz L(first_vec_x1) - VMOVU (VEC_SIZE * 2)(%rdi), %YMM3 - VPTESTN %YMM3, %YMM3, %k0 - kmovd %k0, %ecx - testl %ecx, %ecx + VMOVU (VEC_SIZE * 2)(%rdi), %VMM(3) + VPTESTN %VMM(3), %VMM(3), %k0 + KMOV %k0, %VRCX + + test %VRCX, %VRCX jnz L(first_vec_x2) - VMOVU (VEC_SIZE * 3)(%rdi), %YMM4 - VPTESTN %YMM4, %YMM4, %k0 - kmovd %k0, %ecx + VMOVU (VEC_SIZE * 3)(%rdi), %VMM(4) + VPTESTN %VMM(4), %VMM(4), %k0 + KMOV %k0, %VRCX movq %rdi, %r8 - testl %ecx, %ecx + test %VRCX, %VRCX jnz L(first_vec_x3) andq $-(VEC_SIZE * 2), %rdi - .p2align 4 + .p2align 4,, 10 L(first_aligned_loop): - /* Preserve YMM1, YMM2, YMM3, and YMM4 until we can gurantee - they don't store a match. */ - VMOVA (VEC_SIZE * 4)(%rdi), %YMM5 - VMOVA (VEC_SIZE * 5)(%rdi), %YMM6 + /* Preserve VEC(1), VEC(2), VEC(3), and VEC(4) until we can + gurantee they don't store a match. */ + VMOVA (VEC_SIZE * 4)(%rdi), %VMM(5) + VMOVA (VEC_SIZE * 5)(%rdi), %VMM(6) - VPCMP $0, %YMM5, %YMMMATCH, %k2 - vpxord %YMM6, %YMMMATCH, %YMM7 + VPCMPEQ %VMM(5), %VMATCH, %k2 + vpxord %VMM(6), %VMATCH, %VMM(7) - VPMIN %YMM5, %YMM6, %YMM8 - VPMIN %YMM8, %YMM7, %YMM7 + VPMIN %VMM(5), %VMM(6), %VMM(8) + VPMIN %VMM(8), %VMM(7), %VMM(7) - VPTESTN %YMM7, %YMM7, %k1 + VPTESTN %VMM(7), %VMM(7), %k1 subq $(VEC_SIZE * -2), %rdi - kortestd %k1, %k2 + KORTEST %k1, %k2 jz L(first_aligned_loop) - VPCMP $0, %YMM6, %YMMMATCH, %k3 - VPTESTN %YMM8, %YMM8, %k1 - ktestd %k1, %k1 + VPCMPEQ %VMM(6), %VMATCH, %k3 + VPTESTN %VMM(8), %VMM(8), %k1 + + /* If k1 is zero, then we found a CHAR match but no null-term. + We can now safely throw out VEC1-4. */ + KTEST %k1, %k1 jz L(second_aligned_loop_prep) - kortestd %k2, %k3 + KORTEST %k2, %k3 jnz L(return_first_aligned_loop) + .p2align 4,, 6 L(first_vec_x1_or_x2_or_x3): - VPCMP $0, %YMM4, %YMMMATCH, %k4 - kmovd %k4, %eax - testl %eax, %eax + VPCMPEQ %VMM(4), %VMATCH, %k4 + KMOV %k4, %VRAX + bsr %VRAX, %VRAX jz L(first_vec_x1_or_x2) - bsrl %eax, %eax leaq (VEC_SIZE * 3)(%r8, %rax, CHAR_SIZE), %rax ret + .p2align 4,, 8 L(return_first_aligned_loop): - VPTESTN %YMM5, %YMM5, %k0 - kunpck %k0, %k1, %k0 + VPTESTN %VMM(5), %VMM(5), %k0 + + /* Combined results from VEC5/6. */ + kunpck_2x %k0, %k1, %k0 kmov_2x %k0, %maskz_2x blsmsk %maskz_2x, %maskz_2x - kunpck %k2, %k3, %k3 + kunpck_2x %k2, %k3, %k3 kmov_2x %k3, %maskm_2x and %maskz_2x, %maskm_2x jz L(first_vec_x1_or_x2_or_x3) @@ -280,47 +290,62 @@ L(return_first_aligned_loop): L(second_aligned_loop_prep): L(second_aligned_loop_set_furthest_match): movq %rdi, %rsi - kunpck %k2, %k3, %k4 - + /* Ideally we would safe k2/k3 but `kmov/kunpck` take uops on + port0 and have noticable overhead in the loop. */ + VMOVA %VMM(5), %VMM(7) + VMOVA %VMM(6), %VMM(8) .p2align 4 L(second_aligned_loop): - VMOVU (VEC_SIZE * 4)(%rdi), %YMM1 - VMOVU (VEC_SIZE * 5)(%rdi), %YMM2 - - VPCMP $0, %YMM1, %YMMMATCH, %k2 - vpxord %YMM2, %YMMMATCH, %YMM3 + VMOVU (VEC_SIZE * 4)(%rdi), %VMM(5) + VMOVU (VEC_SIZE * 5)(%rdi), %VMM(6) + VPCMPEQ %VMM(5), %VMATCH, %k2 + vpxord %VMM(6), %VMATCH, %VMM(3) - VPMIN %YMM1, %YMM2, %YMM4 - VPMIN %YMM3, %YMM4, %YMM3 + VPMIN %VMM(5), %VMM(6), %VMM(4) + VPMIN %VMM(3), %VMM(4), %VMM(3) - VPTESTN %YMM3, %YMM3, %k1 + VPTESTN %VMM(3), %VMM(3), %k1 subq $(VEC_SIZE * -2), %rdi - kortestd %k1, %k2 + KORTEST %k1, %k2 jz L(second_aligned_loop) - - VPCMP $0, %YMM2, %YMMMATCH, %k3 - VPTESTN %YMM4, %YMM4, %k1 - ktestd %k1, %k1 + VPCMPEQ %VMM(6), %VMATCH, %k3 + VPTESTN %VMM(4), %VMM(4), %k1 + KTEST %k1, %k1 jz L(second_aligned_loop_set_furthest_match) - kortestd %k2, %k3 - /* branch here because there is a significant advantage interms - of output dependency chance in using edx. */ + /* branch here because we know we have a match in VEC7/8 but + might not in VEC5/6 so the latter is expected to be less + likely. */ + KORTEST %k2, %k3 jnz L(return_new_match) + L(return_old_match): - kmovq %k4, %rax - bsrq %rax, %rax - leaq (VEC_SIZE * 2)(%rsi, %rax, CHAR_SIZE), %rax + VPCMPEQ %VMM(8), %VMATCH, %k0 + KMOV %k0, %VRCX + bsr %VRCX, %VRCX + jnz L(return_old_match_ret) + + VPCMPEQ %VMM(7), %VMATCH, %k0 + KMOV %k0, %VRCX + bsr %VRCX, %VRCX + subq $VEC_SIZE, %rsi +L(return_old_match_ret): + leaq (VEC_SIZE * 3)(%rsi, %rcx, CHAR_SIZE), %rax ret + .p2align 4,, 10 L(return_new_match): - VPTESTN %YMM1, %YMM1, %k0 - kunpck %k0, %k1, %k0 + VPTESTN %VMM(5), %VMM(5), %k0 + + /* Combined results from VEC5/6. */ + kunpck_2x %k0, %k1, %k0 kmov_2x %k0, %maskz_2x blsmsk %maskz_2x, %maskz_2x - kunpck %k2, %k3, %k3 + kunpck_2x %k2, %k3, %k3 kmov_2x %k3, %maskm_2x + + /* Match at end was out-of-bounds so use last known match. */ and %maskz_2x, %maskm_2x jz L(return_old_match) @@ -328,49 +353,53 @@ L(return_new_match): leaq (VEC_SIZE * 2)(%rdi, %rax, CHAR_SIZE), %rax ret + .p2align 4,, 4 L(cross_page_boundary): - /* eax contains all the page offset bits of src (rdi). `xor rdi, - rax` sets pointer will all page offset bits cleared so - offset of (PAGE_SIZE - VEC_SIZE) will get last aligned VEC - before page cross (guranteed to be safe to read). Doing this - as opposed to `movq %rdi, %rax; andq $-VEC_SIZE, %rax` saves - a bit of code size. */ xorq %rdi, %rax - VMOVU (PAGE_SIZE - VEC_SIZE)(%rax), %YMM1 - VPTESTN %YMM1, %YMM1, %k0 - kmovd %k0, %ecx + mov $-1, %VRDX + VMOVU (PAGE_SIZE - VEC_SIZE)(%rax), %VMM(6) + VPTESTN %VMM(6), %VMM(6), %k0 + KMOV %k0, %VRSI + +# ifdef USE_AS_WCSRCHR + movl %edi, %ecx + and $(VEC_SIZE - 1), %ecx + shrl $2, %ecx +# endif + shlx %VGPR(SHIFT_REG), %VRDX, %VRDX - /* Shift out zero CHAR matches that are before the begining of - src (rdi). */ # ifdef USE_AS_WCSRCHR - movl %edi, %esi - andl $(VEC_SIZE - 1), %esi - shrl $2, %esi + kmovb %edx, %k1 +# else + KMOV %VRDX, %k1 # endif - shrxl %SHIFT_REG, %ecx, %ecx - testl %ecx, %ecx + /* Need to adjust result to VEC(1) so it can be re-used by + L(return_vec_x0_test). The alternative is to collect VEC(1) + will a page cross load which is far more expensive. */ + VPCOMPRESS %VMM(6), %VMM(1){%k1}{z} + + /* We could technically just jmp back after the vpcompress but + it doesn't save any 16-byte blocks. */ + shrx %VGPR(SHIFT_REG), %VRSI, %VRSI + test %VRSI, %VRSI jz L(page_cross_continue) - /* Found zero CHAR so need to test for search CHAR. */ - VPCMP $0, %YMMMATCH, %YMM1, %k1 - kmovd %k1, %eax - /* Shift out search CHAR matches that are before the begining of - src (rdi). */ - shrxl %SHIFT_REG, %eax, %eax - - /* Check if any search CHAR match in range. */ - blsmskl %ecx, %ecx - andl %ecx, %eax - jz L(ret3) - bsrl %eax, %eax + /* Duplicate of return logic from ENTRY. Doesn't cause spill to + next cache line so might as well copy it here. */ + VPCMPEQ %VMATCH, %VMM(1), %k1 + KMOV %k1, %VRAX + blsmsk %VRSI, %VRSI + and %VRSI, %VRAX + jz L(ret_page_cross) + bsr %VRAX, %VRAX # ifdef USE_AS_WCSRCHR leaq (%rdi, %rax, CHAR_SIZE), %rax # else addq %rdi, %rax # endif -L(ret3): +L(ret_page_cross): ret - + /* 1 byte till next cache line. */ END(STRRCHR) #endif From patchwork Tue Oct 18 02:49:00 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Noah Goldstein X-Patchwork-Id: 1691326 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=sourceware.org (client-ip=8.43.85.97; helo=sourceware.org; envelope-from=libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org; receiver=) Authentication-Results: legolas.ozlabs.org; dkim=pass (1024-bit key; secure) header.d=sourceware.org header.i=@sourceware.org header.a=rsa-sha256 header.s=default header.b=IlPRPSp/; dkim-atps=neutral Received: from sourceware.org (ip-8-43-85-97.sourceware.org [8.43.85.97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-384) server-digest SHA384) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4Mryyb3kWDz23jp for ; Tue, 18 Oct 2022 13:50:15 +1100 (AEDT) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 7FD3F3857B87 for ; Tue, 18 Oct 2022 02:50:13 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 7FD3F3857B87 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1666061413; bh=o1P58Oi8baln6MVt0yU9Hm3PGPb7vM9Cf2yhavvHOuA=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To: From; b=IlPRPSp/0rq5WdZFtQHYoYH/ed3Ws+gYIiUrvuMxwHCuPVskVs6RkWXaYpVTxTPJz ExREgT512l79IdISZS/d8Qyk8kolBvwpOZsZ9pGucWstCJQrz+GdgAGlWQnWi4XHPs o0qQsW86aBSZCANSJsnILB9JUqpDWtIK2RCfg2sw= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-ot1-x32d.google.com (mail-ot1-x32d.google.com [IPv6:2607:f8b0:4864:20::32d]) by sourceware.org (Postfix) with ESMTPS id 821CF3858406 for ; Tue, 18 Oct 2022 02:49:12 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 821CF3858406 Received: by mail-ot1-x32d.google.com with SMTP id d18-20020a05683025d200b00661c6f1b6a4so6898646otu.1 for ; Mon, 17 Oct 2022 19:49:12 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=o1P58Oi8baln6MVt0yU9Hm3PGPb7vM9Cf2yhavvHOuA=; b=2pnszE4npRKOfO2jvGaXZJXbv6bFKggh+Rnw6VvINW+2bZDg7z5+MPap3Q7cRmo5AG BFHCHgQvIzEJEL9ZinGvlXMF/kZWCrS7jodL7YXFLUJdMzy3mOmnhAuoOPpwfTL3mI6f kDBoQ4wD3uBn/U2Y4wCst0C54sfXmzkC4pHU4w/rPOPQ6+s+wapW/qq4Im43YX4a7aQb Z2z8PNbnO9/KUWsZu+997mXuPB1UxYjEBu2Y0ZkepIjafuZveTnHEjGOe6HxumniQkpw yw6jZTUEHhR9CerHmFqKOr4i1Xk+JPP3Mxzc3kbgu8nMmHrJ2/dLNt1cA1ZXuImLnkv4 vfuA== X-Gm-Message-State: ACrzQf0HWa1GcRt2/CQ/W3+b0vj4uo7JLWtZOKyEwLtJEDbFA9h3hy2C OeEeNLLaEYEkLt50lCNtbUZox2k8Xhh9Hw== X-Google-Smtp-Source: AMsMyM7mhPR4QTJNCJaxLH1J6DOCl4Zvm/yuDheSipojDngpiBe8In7cPE/o44TdbTLrYv7xUpGwSw== X-Received: by 2002:a05:6830:d0b:b0:661:9435:5e30 with SMTP id bu11-20020a0568300d0b00b0066194355e30mr426573otb.276.1666061350657; Mon, 17 Oct 2022 19:49:10 -0700 (PDT) Received: from noah-tgl.lan (2603-8080-1301-76c6-02dd-0570-1640-b39b.res6.spectrum.com. [2603:8080:1301:76c6:2dd:570:1640:b39b]) by smtp.gmail.com with ESMTPSA id r10-20020a4a964a000000b00435a59fba01sm4957260ooi.47.2022.10.17.19.49.09 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 17 Oct 2022 19:49:10 -0700 (PDT) To: libc-alpha@sourceware.org Subject: [PATCH v1 6/7] x86: Add support for VEC_SIZE == 64 in strcmp-evex.S impl Date: Mon, 17 Oct 2022 19:49:00 -0700 Message-Id: <20221018024901.3381469-6-goldstein.w.n@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20221018024901.3381469-1-goldstein.w.n@gmail.com> References: <20221018024901.3381469-1-goldstein.w.n@gmail.com> MIME-Version: 1.0 X-Spam-Status: No, score=-12.1 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Noah Goldstein via Libc-alpha From: Noah Goldstein Reply-To: Noah Goldstein Errors-To: libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org Sender: "Libc-alpha" Unused at the moment, but evex512 strcmp, strncmp, strcasecmp{l}, and strncasecmp{l} functions can be added by including strcmp-evex.S with "x86-evex512-vecs.h" defined. In addition save code size a bit in a few places. 1. tzcnt ... -> bsf ... 2. vpcmp{b|d} $0 ... -> vpcmpeq{b|d} This saves a touch of code size but has minimal net affect. Full check passes on x86-64. --- sysdeps/x86_64/multiarch/strcmp-evex.S | 676 ++++++++++++++++--------- 1 file changed, 430 insertions(+), 246 deletions(-) diff --git a/sysdeps/x86_64/multiarch/strcmp-evex.S b/sysdeps/x86_64/multiarch/strcmp-evex.S index e482d0167f..756a3bb8d6 100644 --- a/sysdeps/x86_64/multiarch/strcmp-evex.S +++ b/sysdeps/x86_64/multiarch/strcmp-evex.S @@ -20,6 +20,10 @@ #if ISA_SHOULD_BUILD (4) +# ifndef VEC_SIZE +# include "x86-evex256-vecs.h" +# endif + # define STRCMP_ISA _evex # include "strcmp-naming.h" @@ -35,41 +39,57 @@ # define PAGE_SIZE 4096 /* VEC_SIZE = Number of bytes in a ymm register. */ -# define VEC_SIZE 32 # define CHAR_PER_VEC (VEC_SIZE / SIZE_OF_CHAR) -# define VMOVU vmovdqu64 -# define VMOVA vmovdqa64 - # ifdef USE_AS_WCSCMP -# define TESTEQ subl $0xff, /* Compare packed dwords. */ # define VPCMP vpcmpd +# define VPCMPEQ vpcmpeqd # define VPMINU vpminud # define VPTESTM vptestmd # define VPTESTNM vptestnmd /* 1 dword char == 4 bytes. */ # define SIZE_OF_CHAR 4 + +# define TESTEQ sub $((1 << CHAR_PER_VEC) - 1), + +# define USE_WIDE_CHAR # else -# define TESTEQ incl /* Compare packed bytes. */ # define VPCMP vpcmpb +# define VPCMPEQ vpcmpeqb # define VPMINU vpminub # define VPTESTM vptestmb # define VPTESTNM vptestnmb /* 1 byte char == 1 byte. */ # define SIZE_OF_CHAR 1 + +# define TESTEQ inc +# endif + +# include "reg-macros.h" + +# if VEC_SIZE == 64 +# define RODATA_SECTION rodata.cst64 +# else +# define RODATA_SECTION rodata.cst32 +# endif + +# if CHAR_PER_VEC == 64 +# define FALLTHROUGH_RETURN_OFFSET (VEC_SIZE * 3) +# else +# define FALLTHROUGH_RETURN_OFFSET (VEC_SIZE * 2) # endif # ifdef USE_AS_STRNCMP -# define LOOP_REG r9d +# define LOOP_REG VR9 # define LOOP_REG64 r9 # define OFFSET_REG8 r9b # define OFFSET_REG r9d # define OFFSET_REG64 r9 # else -# define LOOP_REG edx +# define LOOP_REG VRDX # define LOOP_REG64 rdx # define OFFSET_REG8 dl @@ -83,32 +103,6 @@ # define VEC_OFFSET (-VEC_SIZE) # endif -# define XMM0 xmm17 -# define XMM1 xmm18 - -# define XMM10 xmm27 -# define XMM11 xmm28 -# define XMM12 xmm29 -# define XMM13 xmm30 -# define XMM14 xmm31 - - -# define YMM0 ymm17 -# define YMM1 ymm18 -# define YMM2 ymm19 -# define YMM3 ymm20 -# define YMM4 ymm21 -# define YMM5 ymm22 -# define YMM6 ymm23 -# define YMM7 ymm24 -# define YMM8 ymm25 -# define YMM9 ymm26 -# define YMM10 ymm27 -# define YMM11 ymm28 -# define YMM12 ymm29 -# define YMM13 ymm30 -# define YMM14 ymm31 - # ifdef USE_AS_STRCASECMP_L # define BYTE_LOOP_REG OFFSET_REG # else @@ -125,61 +119,72 @@ # endif # endif -# define LCASE_MIN_YMM %YMM12 -# define LCASE_MAX_YMM %YMM13 -# define CASE_ADD_YMM %YMM14 +# define LCASE_MIN_V VMM(12) +# define LCASE_MAX_V VMM(13) +# define CASE_ADD_V VMM(14) -# define LCASE_MIN_XMM %XMM12 -# define LCASE_MAX_XMM %XMM13 -# define CASE_ADD_XMM %XMM14 +# if VEC_SIZE == 64 +# define LCASE_MIN_YMM VMM_256(12) +# define LCASE_MAX_YMM VMM_256(13) +# define CASE_ADD_YMM VMM_256(14) +# endif + +# define LCASE_MIN_XMM VMM_128(12) +# define LCASE_MAX_XMM VMM_128(13) +# define CASE_ADD_XMM VMM_128(14) /* NB: wcsncmp uses r11 but strcasecmp is never used in conjunction with wcscmp. */ # define TOLOWER_BASE %r11 # ifdef USE_AS_STRCASECMP_L -# define _REG(x, y) x ## y -# define REG(x, y) _REG(x, y) -# define TOLOWER(reg1, reg2, ext) \ - vpsubb REG(LCASE_MIN_, ext), reg1, REG(%ext, 10); \ - vpsubb REG(LCASE_MIN_, ext), reg2, REG(%ext, 11); \ - vpcmpub $1, REG(LCASE_MAX_, ext), REG(%ext, 10), %k5; \ - vpcmpub $1, REG(LCASE_MAX_, ext), REG(%ext, 11), %k6; \ - vpaddb reg1, REG(CASE_ADD_, ext), reg1{%k5}; \ - vpaddb reg2, REG(CASE_ADD_, ext), reg2{%k6} - -# define TOLOWER_gpr(src, dst) movl (TOLOWER_BASE, src, 4), dst -# define TOLOWER_YMM(...) TOLOWER(__VA_ARGS__, YMM) -# define TOLOWER_XMM(...) TOLOWER(__VA_ARGS__, XMM) - -# define CMP_R1_R2(s1_reg, s2_reg, reg_out, ext) \ - TOLOWER (s1_reg, s2_reg, ext); \ - VPCMP $0, s1_reg, s2_reg, reg_out - -# define CMP_R1_S2(s1_reg, s2_mem, s2_reg, reg_out, ext) \ - VMOVU s2_mem, s2_reg; \ - CMP_R1_R2(s1_reg, s2_reg, reg_out, ext) - -# define CMP_R1_R2_YMM(...) CMP_R1_R2(__VA_ARGS__, YMM) -# define CMP_R1_R2_XMM(...) CMP_R1_R2(__VA_ARGS__, XMM) - -# define CMP_R1_S2_YMM(...) CMP_R1_S2(__VA_ARGS__, YMM) -# define CMP_R1_S2_XMM(...) CMP_R1_S2(__VA_ARGS__, XMM) +# define _REG(x, y) x ## y +# define REG(x, y) _REG(x, y) +# define TOLOWER(reg1, reg2, ext, vec_macro) \ + vpsubb %REG(LCASE_MIN_, ext), reg1, %vec_macro(10); \ + vpsubb %REG(LCASE_MIN_, ext), reg2, %vec_macro(11); \ + vpcmpub $1, %REG(LCASE_MAX_, ext), %vec_macro(10), %k5; \ + vpcmpub $1, %REG(LCASE_MAX_, ext), %vec_macro(11), %k6; \ + vpaddb reg1, %REG(CASE_ADD_, ext), reg1{%k5}; \ + vpaddb reg2, %REG(CASE_ADD_, ext), reg2{%k6} + +# define TOLOWER_gpr(src, dst) movl (TOLOWER_BASE, src, 4), dst +# define TOLOWER_VMM(...) TOLOWER(__VA_ARGS__, V, VMM) +# define TOLOWER_YMM(...) TOLOWER(__VA_ARGS__, YMM, VMM_256) +# define TOLOWER_XMM(...) TOLOWER(__VA_ARGS__, XMM, VMM_128) + +# define CMP_R1_R2(s1_reg, s2_reg, reg_out, ext, vec_macro) \ + TOLOWER (s1_reg, s2_reg, ext, vec_macro); \ + VPCMPEQ s1_reg, s2_reg, reg_out + +# define CMP_R1_S2(s1_reg, s2_mem, s2_reg, reg_out, ext, vec_macro) \ + VMOVU s2_mem, s2_reg; \ + CMP_R1_R2 (s1_reg, s2_reg, reg_out, ext, vec_macro) + +# define CMP_R1_R2_VMM(...) CMP_R1_R2(__VA_ARGS__, V, VMM) +# define CMP_R1_R2_YMM(...) CMP_R1_R2(__VA_ARGS__, YMM, VMM_256) +# define CMP_R1_R2_XMM(...) CMP_R1_R2(__VA_ARGS__, XMM, VMM_128) + +# define CMP_R1_S2_VMM(...) CMP_R1_S2(__VA_ARGS__, V, VMM) +# define CMP_R1_S2_YMM(...) CMP_R1_S2(__VA_ARGS__, YMM, VMM_256) +# define CMP_R1_S2_XMM(...) CMP_R1_S2(__VA_ARGS__, XMM, VMM_128) # else # define TOLOWER_gpr(...) +# define TOLOWER_VMM(...) # define TOLOWER_YMM(...) # define TOLOWER_XMM(...) -# define CMP_R1_R2_YMM(s1_reg, s2_reg, reg_out) \ - VPCMP $0, s2_reg, s1_reg, reg_out +# define CMP_R1_R2_VMM(s1_reg, s2_reg, reg_out) \ + VPCMPEQ s2_reg, s1_reg, reg_out -# define CMP_R1_R2_XMM(...) CMP_R1_R2_YMM(__VA_ARGS__) +# define CMP_R1_R2_YMM(...) CMP_R1_R2_VMM(__VA_ARGS__) +# define CMP_R1_R2_XMM(...) CMP_R1_R2_VMM(__VA_ARGS__) -# define CMP_R1_S2_YMM(s1_reg, s2_mem, unused, reg_out) \ - VPCMP $0, s2_mem, s1_reg, reg_out - -# define CMP_R1_S2_XMM(...) CMP_R1_S2_YMM(__VA_ARGS__) +# define CMP_R1_S2_VMM(s1_reg, s2_mem, unused, reg_out) \ + VPCMPEQ s2_mem, s1_reg, reg_out +# define CMP_R1_S2_YMM(...) CMP_R1_S2_VMM(__VA_ARGS__) +# define CMP_R1_S2_XMM(...) CMP_R1_S2_VMM(__VA_ARGS__) # endif /* Warning! @@ -203,7 +208,7 @@ the maximum offset is reached before a difference is found, zero is returned. */ - .section .text.evex, "ax", @progbits + .section SECTION(.text), "ax", @progbits .align 16 .type STRCMP, @function .globl STRCMP @@ -232,7 +237,7 @@ STRCMP: # else mov (%LOCALE_REG), %RAX_LP # endif - testl $1, LOCALE_DATA_VALUES + _NL_CTYPE_NONASCII_CASE * SIZEOF_VALUES(%rax) + testb $1, LOCALE_DATA_VALUES + _NL_CTYPE_NONASCII_CASE * SIZEOF_VALUES(%rax) jne STRCASECMP_L_NONASCII leaq _nl_C_LC_CTYPE_tolower + 128 * 4(%rip), TOLOWER_BASE # endif @@ -254,28 +259,46 @@ STRCMP: # endif # if defined USE_AS_STRCASECMP_L - .section .rodata.cst32, "aM", @progbits, 32 - .align 32 + .section RODATA_SECTION, "aM", @progbits, VEC_SIZE + .align VEC_SIZE L(lcase_min): .quad 0x4141414141414141 .quad 0x4141414141414141 .quad 0x4141414141414141 .quad 0x4141414141414141 +# if VEC_SIZE == 64 + .quad 0x4141414141414141 + .quad 0x4141414141414141 + .quad 0x4141414141414141 + .quad 0x4141414141414141 +# endif L(lcase_max): .quad 0x1a1a1a1a1a1a1a1a .quad 0x1a1a1a1a1a1a1a1a .quad 0x1a1a1a1a1a1a1a1a .quad 0x1a1a1a1a1a1a1a1a +# if VEC_SIZE == 64 + .quad 0x1a1a1a1a1a1a1a1a + .quad 0x1a1a1a1a1a1a1a1a + .quad 0x1a1a1a1a1a1a1a1a + .quad 0x1a1a1a1a1a1a1a1a +# endif L(case_add): .quad 0x2020202020202020 .quad 0x2020202020202020 .quad 0x2020202020202020 .quad 0x2020202020202020 +# if VEC_SIZE == 64 + .quad 0x2020202020202020 + .quad 0x2020202020202020 + .quad 0x2020202020202020 + .quad 0x2020202020202020 +# endif .previous - vmovdqa64 L(lcase_min)(%rip), LCASE_MIN_YMM - vmovdqa64 L(lcase_max)(%rip), LCASE_MAX_YMM - vmovdqa64 L(case_add)(%rip), CASE_ADD_YMM + VMOVA L(lcase_min)(%rip), %LCASE_MIN_V + VMOVA L(lcase_max)(%rip), %LCASE_MAX_V + VMOVA L(case_add)(%rip), %CASE_ADD_V # endif movl %edi, %eax @@ -288,12 +311,12 @@ L(case_add): L(no_page_cross): /* Safe to compare 4x vectors. */ - VMOVU (%rdi), %YMM0 - VPTESTM %YMM0, %YMM0, %k2 + VMOVU (%rdi), %VMM(0) + VPTESTM %VMM(0), %VMM(0), %k2 /* Each bit cleared in K1 represents a mismatch or a null CHAR in YMM0 and 32 bytes at (%rsi). */ - CMP_R1_S2_YMM (%YMM0, (%rsi), %YMM1, %k1){%k2} - kmovd %k1, %ecx + CMP_R1_S2_VMM (%VMM(0), (%rsi), %VMM(1), %k1){%k2} + KMOV %k1, %VRCX # ifdef USE_AS_STRNCMP cmpq $CHAR_PER_VEC, %rdx jbe L(vec_0_test_len) @@ -303,14 +326,14 @@ L(no_page_cross): wcscmp/wcsncmp. */ /* All 1s represents all equals. TESTEQ will overflow to zero in - all equals case. Otherwise 1s will carry until position of first - mismatch. */ - TESTEQ %ecx + all equals case. Otherwise 1s will carry until position of + first mismatch. */ + TESTEQ %VRCX jz L(more_3x_vec) .p2align 4,, 4 L(return_vec_0): - tzcntl %ecx, %ecx + bsf %VRCX, %VRCX # ifdef USE_AS_WCSCMP movl (%rdi, %rcx, SIZE_OF_CHAR), %edx xorl %eax, %eax @@ -321,7 +344,16 @@ L(return_vec_0): orl $1, %eax # else movzbl (%rdi, %rcx), %eax + /* For VEC_SIZE == 64 use movb instead of movzbl to save a byte + and keep logic for len <= VEC_SIZE (common) in just the + first cache line. NB: No evex512 processor has partial- + register stalls. If that changes this ifdef can be disabled + without affecting correctness. */ +# if !defined USE_AS_STRNCMP && !defined USE_AS_STRCASECMP_L && VEC_SIZE == 64 + movb (%rsi, %rcx), %cl +# else movzbl (%rsi, %rcx), %ecx +# endif TOLOWER_gpr (%rax, %eax) TOLOWER_gpr (%rcx, %ecx) subl %ecx, %eax @@ -332,8 +364,8 @@ L(ret0): # ifdef USE_AS_STRNCMP .p2align 4,, 4 L(vec_0_test_len): - notl %ecx - bzhil %edx, %ecx, %eax + not %VRCX + bzhi %VRDX, %VRCX, %VRAX jnz L(return_vec_0) /* Align if will cross fetch block. */ .p2align 4,, 2 @@ -372,7 +404,7 @@ L(ret1): .p2align 4,, 10 L(return_vec_1): - tzcntl %ecx, %ecx + bsf %VRCX, %VRCX # ifdef USE_AS_STRNCMP /* rdx must be > CHAR_PER_VEC so its safe to subtract without worrying about underflow. */ @@ -401,24 +433,41 @@ L(ret2): .p2align 4,, 10 # ifdef USE_AS_STRNCMP L(return_vec_3): -# if CHAR_PER_VEC <= 16 +# if CHAR_PER_VEC <= 32 + /* If CHAR_PER_VEC <= 32 reuse code from L(return_vec_3) without + additional branches by adjusting the bit positions from + VEC3. We can't do this for CHAR_PER_VEC == 64. */ +# if CHAR_PER_VEC <= 16 sall $CHAR_PER_VEC, %ecx -# else +# else salq $CHAR_PER_VEC, %rcx +# endif +# else + /* If CHAR_PER_VEC == 64 we can't shift the return GPR so just + check it. */ + bsf %VRCX, %VRCX + addl $(CHAR_PER_VEC), %ecx + cmpq %rcx, %rdx + ja L(ret_vec_3_finish) + xorl %eax, %eax + ret # endif # endif + + /* If CHAR_PER_VEC == 64 we can't combine matches from the last + 2x VEC so need seperate return label. */ L(return_vec_2): # if (CHAR_PER_VEC <= 16) || !(defined USE_AS_STRNCMP) - tzcntl %ecx, %ecx + bsf %VRCX, %VRCX # else - tzcntq %rcx, %rcx + bsfq %rcx, %rcx # endif - # ifdef USE_AS_STRNCMP cmpq %rcx, %rdx jbe L(ret_zero) # endif +L(ret_vec_3_finish): # ifdef USE_AS_WCSCMP movl (VEC_SIZE * 2)(%rdi, %rcx, SIZE_OF_CHAR), %edx xorl %eax, %eax @@ -440,7 +489,7 @@ L(ret3): # ifndef USE_AS_STRNCMP .p2align 4,, 10 L(return_vec_3): - tzcntl %ecx, %ecx + bsf %VRCX, %VRCX # ifdef USE_AS_WCSCMP movl (VEC_SIZE * 3)(%rdi, %rcx, SIZE_OF_CHAR), %edx xorl %eax, %eax @@ -465,11 +514,11 @@ L(ret4): .p2align 5 L(more_3x_vec): /* Safe to compare 4x vectors. */ - VMOVU (VEC_SIZE)(%rdi), %YMM0 - VPTESTM %YMM0, %YMM0, %k2 - CMP_R1_S2_YMM (%YMM0, VEC_SIZE(%rsi), %YMM1, %k1){%k2} - kmovd %k1, %ecx - TESTEQ %ecx + VMOVU (VEC_SIZE)(%rdi), %VMM(0) + VPTESTM %VMM(0), %VMM(0), %k2 + CMP_R1_S2_VMM (%VMM(0), VEC_SIZE(%rsi), %VMM(1), %k1){%k2} + KMOV %k1, %VRCX + TESTEQ %VRCX jnz L(return_vec_1) # ifdef USE_AS_STRNCMP @@ -477,18 +526,18 @@ L(more_3x_vec): jbe L(ret_zero) # endif - VMOVU (VEC_SIZE * 2)(%rdi), %YMM0 - VPTESTM %YMM0, %YMM0, %k2 - CMP_R1_S2_YMM (%YMM0, (VEC_SIZE * 2)(%rsi), %YMM1, %k1){%k2} - kmovd %k1, %ecx - TESTEQ %ecx + VMOVU (VEC_SIZE * 2)(%rdi), %VMM(0) + VPTESTM %VMM(0), %VMM(0), %k2 + CMP_R1_S2_VMM (%VMM(0), (VEC_SIZE * 2)(%rsi), %VMM(1), %k1){%k2} + KMOV %k1, %VRCX + TESTEQ %VRCX jnz L(return_vec_2) - VMOVU (VEC_SIZE * 3)(%rdi), %YMM0 - VPTESTM %YMM0, %YMM0, %k2 - CMP_R1_S2_YMM (%YMM0, (VEC_SIZE * 3)(%rsi), %YMM1, %k1){%k2} - kmovd %k1, %ecx - TESTEQ %ecx + VMOVU (VEC_SIZE * 3)(%rdi), %VMM(0) + VPTESTM %VMM(0), %VMM(0), %k2 + CMP_R1_S2_VMM (%VMM(0), (VEC_SIZE * 3)(%rsi), %VMM(1), %k1){%k2} + KMOV %k1, %VRCX + TESTEQ %VRCX jnz L(return_vec_3) # ifdef USE_AS_STRNCMP @@ -565,110 +614,123 @@ L(loop): /* Loop entry after handling page cross during loop. */ L(loop_skip_page_cross_check): - VMOVA (VEC_SIZE * 0)(%rdi), %YMM0 - VMOVA (VEC_SIZE * 1)(%rdi), %YMM2 - VMOVA (VEC_SIZE * 2)(%rdi), %YMM4 - VMOVA (VEC_SIZE * 3)(%rdi), %YMM6 + VMOVA (VEC_SIZE * 0)(%rdi), %VMM(0) + VMOVA (VEC_SIZE * 1)(%rdi), %VMM(2) + VMOVA (VEC_SIZE * 2)(%rdi), %VMM(4) + VMOVA (VEC_SIZE * 3)(%rdi), %VMM(6) - VPMINU %YMM0, %YMM2, %YMM8 - VPMINU %YMM4, %YMM6, %YMM9 + VPMINU %VMM(0), %VMM(2), %VMM(8) + VPMINU %VMM(4), %VMM(6), %VMM(9) /* A zero CHAR in YMM9 means that there is a null CHAR. */ - VPMINU %YMM8, %YMM9, %YMM9 + VPMINU %VMM(8), %VMM(9), %VMM(9) /* Each bit set in K1 represents a non-null CHAR in YMM9. */ - VPTESTM %YMM9, %YMM9, %k1 + VPTESTM %VMM(9), %VMM(9), %k1 # ifndef USE_AS_STRCASECMP_L - vpxorq (VEC_SIZE * 0)(%rsi), %YMM0, %YMM1 - vpxorq (VEC_SIZE * 1)(%rsi), %YMM2, %YMM3 - vpxorq (VEC_SIZE * 2)(%rsi), %YMM4, %YMM5 + vpxorq (VEC_SIZE * 0)(%rsi), %VMM(0), %VMM(1) + vpxorq (VEC_SIZE * 1)(%rsi), %VMM(2), %VMM(3) + vpxorq (VEC_SIZE * 2)(%rsi), %VMM(4), %VMM(5) /* Ternary logic to xor (VEC_SIZE * 3)(%rsi) with YMM6 while oring with YMM1. Result is stored in YMM6. */ - vpternlogd $0xde, (VEC_SIZE * 3)(%rsi), %YMM1, %YMM6 + vpternlogd $0xde, (VEC_SIZE * 3)(%rsi), %VMM(1), %VMM(6) # else - VMOVU (VEC_SIZE * 0)(%rsi), %YMM1 - TOLOWER_YMM (%YMM0, %YMM1) - VMOVU (VEC_SIZE * 1)(%rsi), %YMM3 - TOLOWER_YMM (%YMM2, %YMM3) - VMOVU (VEC_SIZE * 2)(%rsi), %YMM5 - TOLOWER_YMM (%YMM4, %YMM5) - VMOVU (VEC_SIZE * 3)(%rsi), %YMM7 - TOLOWER_YMM (%YMM6, %YMM7) - vpxorq %YMM0, %YMM1, %YMM1 - vpxorq %YMM2, %YMM3, %YMM3 - vpxorq %YMM4, %YMM5, %YMM5 - vpternlogd $0xde, %YMM7, %YMM1, %YMM6 + VMOVU (VEC_SIZE * 0)(%rsi), %VMM(1) + TOLOWER_VMM (%VMM(0), %VMM(1)) + VMOVU (VEC_SIZE * 1)(%rsi), %VMM(3) + TOLOWER_VMM (%VMM(2), %VMM(3)) + VMOVU (VEC_SIZE * 2)(%rsi), %VMM(5) + TOLOWER_VMM (%VMM(4), %VMM(5)) + VMOVU (VEC_SIZE * 3)(%rsi), %VMM(7) + TOLOWER_VMM (%VMM(6), %VMM(7)) + vpxorq %VMM(0), %VMM(1), %VMM(1) + vpxorq %VMM(2), %VMM(3), %VMM(3) + vpxorq %VMM(4), %VMM(5), %VMM(5) + vpternlogd $0xde, %VMM(7), %VMM(1), %VMM(6) # endif /* Or together YMM3, YMM5, and YMM6. */ - vpternlogd $0xfe, %YMM3, %YMM5, %YMM6 + vpternlogd $0xfe, %VMM(3), %VMM(5), %VMM(6) /* A non-zero CHAR in YMM6 represents a mismatch. */ - VPTESTNM %YMM6, %YMM6, %k0{%k1} - kmovd %k0, %LOOP_REG + VPTESTNM %VMM(6), %VMM(6), %k0{%k1} + KMOV %k0, %LOOP_REG TESTEQ %LOOP_REG jz L(loop) /* Find which VEC has the mismatch of end of string. */ - VPTESTM %YMM0, %YMM0, %k1 - VPTESTNM %YMM1, %YMM1, %k0{%k1} - kmovd %k0, %ecx - TESTEQ %ecx + VPTESTM %VMM(0), %VMM(0), %k1 + VPTESTNM %VMM(1), %VMM(1), %k0{%k1} + KMOV %k0, %VRCX + TESTEQ %VRCX jnz L(return_vec_0_end) - VPTESTM %YMM2, %YMM2, %k1 - VPTESTNM %YMM3, %YMM3, %k0{%k1} - kmovd %k0, %ecx - TESTEQ %ecx + VPTESTM %VMM(2), %VMM(2), %k1 + VPTESTNM %VMM(3), %VMM(3), %k0{%k1} + KMOV %k0, %VRCX + TESTEQ %VRCX jnz L(return_vec_1_end) - /* Handle VEC 2 and 3 without branches. */ + /* Handle VEC 2 and 3 without branches if CHAR_PER_VEC <= 32. + */ L(return_vec_2_3_end): # ifdef USE_AS_STRNCMP subq $(CHAR_PER_VEC * 2), %rdx jbe L(ret_zero_end) # endif - VPTESTM %YMM4, %YMM4, %k1 - VPTESTNM %YMM5, %YMM5, %k0{%k1} - kmovd %k0, %ecx - TESTEQ %ecx + VPTESTM %VMM(4), %VMM(4), %k1 + VPTESTNM %VMM(5), %VMM(5), %k0{%k1} + KMOV %k0, %VRCX + TESTEQ %VRCX # if CHAR_PER_VEC <= 16 sall $CHAR_PER_VEC, %LOOP_REG orl %ecx, %LOOP_REG -# else +# elif CHAR_PER_VEC <= 32 salq $CHAR_PER_VEC, %LOOP_REG64 orq %rcx, %LOOP_REG64 +# else + /* We aren't combining last 2x VEC so branch on second the last. + */ + jnz L(return_vec_2_end) # endif -L(return_vec_3_end): + /* LOOP_REG contains matches for null/mismatch from the loop. If - VEC 0,1,and 2 all have no null and no mismatches then mismatch - must entirely be from VEC 3 which is fully represented by - LOOP_REG. */ + VEC 0,1,and 2 all have no null and no mismatches then + mismatch must entirely be from VEC 3 which is fully + represented by LOOP_REG. */ # if CHAR_PER_VEC <= 16 - tzcntl %LOOP_REG, %LOOP_REG + bsf %LOOP_REG, %LOOP_REG # else - tzcntq %LOOP_REG64, %LOOP_REG64 + bsfq %LOOP_REG64, %LOOP_REG64 # endif # ifdef USE_AS_STRNCMP + + /* If CHAR_PER_VEC == 64 we can't combine last 2x VEC so need to + adj length before last comparison. */ +# if CHAR_PER_VEC == 64 + subq $CHAR_PER_VEC, %rdx + jbe L(ret_zero_end) +# endif + cmpq %LOOP_REG64, %rdx jbe L(ret_zero_end) # endif # ifdef USE_AS_WCSCMP - movl (VEC_SIZE * 2)(%rdi, %LOOP_REG64, SIZE_OF_CHAR), %ecx + movl (FALLTHROUGH_RETURN_OFFSET)(%rdi, %LOOP_REG64, SIZE_OF_CHAR), %ecx xorl %eax, %eax - cmpl (VEC_SIZE * 2)(%rsi, %LOOP_REG64, SIZE_OF_CHAR), %ecx + cmpl (FALLTHROUGH_RETURN_OFFSET)(%rsi, %LOOP_REG64, SIZE_OF_CHAR), %ecx je L(ret5) setl %al negl %eax xorl %r8d, %eax # else - movzbl (VEC_SIZE * 2)(%rdi, %LOOP_REG64), %eax - movzbl (VEC_SIZE * 2)(%rsi, %LOOP_REG64), %ecx + movzbl (FALLTHROUGH_RETURN_OFFSET)(%rdi, %LOOP_REG64), %eax + movzbl (FALLTHROUGH_RETURN_OFFSET)(%rsi, %LOOP_REG64), %ecx TOLOWER_gpr (%rax, %eax) TOLOWER_gpr (%rcx, %ecx) subl %ecx, %eax @@ -686,23 +748,39 @@ L(ret_zero_end): # endif + /* The L(return_vec_N_end) differ from L(return_vec_N) in that - they use the value of `r8` to negate the return value. This is - because the page cross logic can swap `rdi` and `rsi`. */ + they use the value of `r8` to negate the return value. This + is because the page cross logic can swap `rdi` and `rsi`. + */ .p2align 4,, 10 # ifdef USE_AS_STRNCMP L(return_vec_1_end): -# if CHAR_PER_VEC <= 16 +# if CHAR_PER_VEC <= 32 + /* If CHAR_PER_VEC <= 32 reuse code from L(return_vec_0_end) + without additional branches by adjusting the bit positions + from VEC1. We can't do this for CHAR_PER_VEC == 64. */ +# if CHAR_PER_VEC <= 16 sall $CHAR_PER_VEC, %ecx -# else +# else salq $CHAR_PER_VEC, %rcx +# endif +# else + /* If CHAR_PER_VEC == 64 we can't shift the return GPR so just + check it. */ + bsf %VRCX, %VRCX + addl $(CHAR_PER_VEC), %ecx + cmpq %rcx, %rdx + ja L(ret_vec_0_end_finish) + xorl %eax, %eax + ret # endif # endif L(return_vec_0_end): # if (CHAR_PER_VEC <= 16) || !(defined USE_AS_STRNCMP) - tzcntl %ecx, %ecx + bsf %VRCX, %VRCX # else - tzcntq %rcx, %rcx + bsfq %rcx, %rcx # endif # ifdef USE_AS_STRNCMP @@ -710,6 +788,7 @@ L(return_vec_0_end): jbe L(ret_zero_end) # endif +L(ret_vec_0_end_finish): # ifdef USE_AS_WCSCMP movl (%rdi, %rcx, SIZE_OF_CHAR), %edx xorl %eax, %eax @@ -737,7 +816,7 @@ L(ret6): # ifndef USE_AS_STRNCMP .p2align 4,, 10 L(return_vec_1_end): - tzcntl %ecx, %ecx + bsf %VRCX, %VRCX # ifdef USE_AS_WCSCMP movl VEC_SIZE(%rdi, %rcx, SIZE_OF_CHAR), %edx xorl %eax, %eax @@ -760,6 +839,41 @@ L(ret7): # endif + /* If CHAR_PER_VEC == 64 we can't combine matches from the last + 2x VEC so need seperate return label. */ +# if CHAR_PER_VEC == 64 +L(return_vec_2_end): + bsf %VRCX, %VRCX +# ifdef USE_AS_STRNCMP + cmpq %rcx, %rdx + jbe L(ret_zero_end) +# endif +# ifdef USE_AS_WCSCMP + movl (VEC_SIZE * 2)(%rdi, %rcx, SIZE_OF_CHAR), %edx + xorl %eax, %eax + cmpl (VEC_SIZE * 2)(%rsi, %rcx, SIZE_OF_CHAR), %edx + je L(ret31) + setl %al + negl %eax + /* This is the non-zero case for `eax` so just xorl with `r8d` + flip is `rdi` and `rsi` where swapped. */ + xorl %r8d, %eax +# else + movzbl (VEC_SIZE * 2)(%rdi, %rcx), %eax + movzbl (VEC_SIZE * 2)(%rsi, %rcx), %ecx + TOLOWER_gpr (%rax, %eax) + TOLOWER_gpr (%rcx, %ecx) + subl %ecx, %eax + /* Flip `eax` if `rdi` and `rsi` where swapped in page cross + logic. Subtract `r8d` after xor for zero case. */ + xorl %r8d, %eax + subl %r8d, %eax +# endif +L(ret13): + ret +# endif + + /* Page cross in rsi in next 4x VEC. */ /* TODO: Improve logic here. */ @@ -778,11 +892,11 @@ L(page_cross_during_loop): cmpl $-(VEC_SIZE * 3), %eax jle L(less_1x_vec_till_page_cross) - VMOVA (%rdi), %YMM0 - VPTESTM %YMM0, %YMM0, %k2 - CMP_R1_S2_YMM (%YMM0, (%rsi), %YMM1, %k1){%k2} - kmovd %k1, %ecx - TESTEQ %ecx + VMOVA (%rdi), %VMM(0) + VPTESTM %VMM(0), %VMM(0), %k2 + CMP_R1_S2_VMM (%VMM(0), (%rsi), %VMM(1), %k1){%k2} + KMOV %k1, %VRCX + TESTEQ %VRCX jnz L(return_vec_0_end) /* if distance >= 2x VEC then eax > -(VEC_SIZE * 2). */ @@ -799,9 +913,9 @@ L(less_1x_vec_till_page_cross): to read back -VEC_SIZE. If rdi is truly at the start of a page here, it means the previous page (rdi - VEC_SIZE) has already been loaded earlier so must be valid. */ - VMOVU -VEC_SIZE(%rdi, %rax), %YMM0 - VPTESTM %YMM0, %YMM0, %k2 - CMP_R1_S2_YMM (%YMM0, -VEC_SIZE(%rsi, %rax), %YMM1, %k1){%k2} + VMOVU -VEC_SIZE(%rdi, %rax), %VMM(0) + VPTESTM %VMM(0), %VMM(0), %k2 + CMP_R1_S2_VMM (%VMM(0), -VEC_SIZE(%rsi, %rax), %VMM(1), %k1){%k2} /* Mask of potentially valid bits. The lower bits can be out of range comparisons (but safe regarding page crosses). */ @@ -813,12 +927,12 @@ L(less_1x_vec_till_page_cross): shlxl %ecx, %r10d, %ecx movzbl %cl, %r10d # else - movl $-1, %ecx - shlxl %esi, %ecx, %r10d + mov $-1, %VRCX + shlx %VRSI, %VRCX, %VR10 # endif - kmovd %k1, %ecx - notl %ecx + KMOV %k1, %VRCX + not %VRCX # ifdef USE_AS_STRNCMP @@ -838,12 +952,10 @@ L(less_1x_vec_till_page_cross): /* Readjust eax before potentially returning to the loop. */ addl $(PAGE_SIZE - VEC_SIZE * 4), %eax - andl %r10d, %ecx + and %VR10, %VRCX jz L(loop_skip_page_cross_check) - .p2align 4,, 3 -L(return_page_cross_end): - tzcntl %ecx, %ecx + bsf %VRCX, %VRCX # if (defined USE_AS_STRNCMP) || (defined USE_AS_WCSCMP) leal -VEC_SIZE(%OFFSET_REG64, %rcx, SIZE_OF_CHAR), %ecx @@ -874,8 +986,12 @@ L(ret8): # ifdef USE_AS_STRNCMP .p2align 4,, 10 L(return_page_cross_end_check): - andl %r10d, %ecx - tzcntl %ecx, %ecx + and %VR10, %VRCX + /* Need to use tzcnt here as VRCX may be zero. If VRCX is zero + tzcnt(VRCX) will be CHAR_PER and remaining length (edx) is + guranteed to be <= CHAR_PER_VEC so we will only use the return + idx if VRCX was non-zero. */ + tzcnt %VRCX, %VRCX leal -VEC_SIZE(%rax, %rcx, SIZE_OF_CHAR), %ecx # ifdef USE_AS_WCSCMP sall $2, %edx @@ -892,11 +1008,11 @@ L(more_2x_vec_till_page_cross): /* If more 2x vec till cross we will complete a full loop iteration here. */ - VMOVA VEC_SIZE(%rdi), %YMM0 - VPTESTM %YMM0, %YMM0, %k2 - CMP_R1_S2_YMM (%YMM0, VEC_SIZE(%rsi), %YMM1, %k1){%k2} - kmovd %k1, %ecx - TESTEQ %ecx + VMOVA VEC_SIZE(%rdi), %VMM(0) + VPTESTM %VMM(0), %VMM(0), %k2 + CMP_R1_S2_VMM (%VMM(0), VEC_SIZE(%rsi), %VMM(1), %k1){%k2} + KMOV %k1, %VRCX + TESTEQ %VRCX jnz L(return_vec_1_end) # ifdef USE_AS_STRNCMP @@ -907,18 +1023,18 @@ L(more_2x_vec_till_page_cross): subl $-(VEC_SIZE * 4), %eax /* Safe to include comparisons from lower bytes. */ - VMOVU -(VEC_SIZE * 2)(%rdi, %rax), %YMM0 - VPTESTM %YMM0, %YMM0, %k2 - CMP_R1_S2_YMM (%YMM0, -(VEC_SIZE * 2)(%rsi, %rax), %YMM1, %k1){%k2} - kmovd %k1, %ecx - TESTEQ %ecx + VMOVU -(VEC_SIZE * 2)(%rdi, %rax), %VMM(0) + VPTESTM %VMM(0), %VMM(0), %k2 + CMP_R1_S2_VMM (%VMM(0), -(VEC_SIZE * 2)(%rsi, %rax), %VMM(1), %k1){%k2} + KMOV %k1, %VRCX + TESTEQ %VRCX jnz L(return_vec_page_cross_0) - VMOVU -(VEC_SIZE * 1)(%rdi, %rax), %YMM0 - VPTESTM %YMM0, %YMM0, %k2 - CMP_R1_S2_YMM (%YMM0, -(VEC_SIZE * 1)(%rsi, %rax), %YMM1, %k1){%k2} - kmovd %k1, %ecx - TESTEQ %ecx + VMOVU -(VEC_SIZE * 1)(%rdi, %rax), %VMM(0) + VPTESTM %VMM(0), %VMM(0), %k2 + CMP_R1_S2_VMM (%VMM(0), -(VEC_SIZE * 1)(%rsi, %rax), %VMM(1), %k1){%k2} + KMOV %k1, %VRCX + TESTEQ %VRCX jnz L(return_vec_page_cross_1) # ifdef USE_AS_STRNCMP @@ -937,30 +1053,30 @@ L(more_2x_vec_till_page_cross): # endif /* Finish the loop. */ - VMOVA (VEC_SIZE * 2)(%rdi), %YMM4 - VMOVA (VEC_SIZE * 3)(%rdi), %YMM6 - VPMINU %YMM4, %YMM6, %YMM9 - VPTESTM %YMM9, %YMM9, %k1 + VMOVA (VEC_SIZE * 2)(%rdi), %VMM(4) + VMOVA (VEC_SIZE * 3)(%rdi), %VMM(6) + VPMINU %VMM(4), %VMM(6), %VMM(9) + VPTESTM %VMM(9), %VMM(9), %k1 # ifndef USE_AS_STRCASECMP_L - vpxorq (VEC_SIZE * 2)(%rsi), %YMM4, %YMM5 + vpxorq (VEC_SIZE * 2)(%rsi), %VMM(4), %VMM(5) /* YMM6 = YMM5 | ((VEC_SIZE * 3)(%rsi) ^ YMM6). */ - vpternlogd $0xde, (VEC_SIZE * 3)(%rsi), %YMM5, %YMM6 + vpternlogd $0xde, (VEC_SIZE * 3)(%rsi), %VMM(5), %VMM(6) # else - VMOVU (VEC_SIZE * 2)(%rsi), %YMM5 - TOLOWER_YMM (%YMM4, %YMM5) - VMOVU (VEC_SIZE * 3)(%rsi), %YMM7 - TOLOWER_YMM (%YMM6, %YMM7) - vpxorq %YMM4, %YMM5, %YMM5 - vpternlogd $0xde, %YMM7, %YMM5, %YMM6 -# endif - VPTESTNM %YMM6, %YMM6, %k0{%k1} - kmovd %k0, %LOOP_REG + VMOVU (VEC_SIZE * 2)(%rsi), %VMM(5) + TOLOWER_VMM (%VMM(4), %VMM(5)) + VMOVU (VEC_SIZE * 3)(%rsi), %VMM(7) + TOLOWER_VMM (%VMM(6), %VMM(7)) + vpxorq %VMM(4), %VMM(5), %VMM(5) + vpternlogd $0xde, %VMM(7), %VMM(5), %VMM(6) +# endif + VPTESTNM %VMM(6), %VMM(6), %k0{%k1} + KMOV %k0, %LOOP_REG TESTEQ %LOOP_REG jnz L(return_vec_2_3_end) /* Best for code size to include ucond-jmp here. Would be faster - if this case is hot to duplicate the L(return_vec_2_3_end) code - as fall-through and have jump back to loop on mismatch + if this case is hot to duplicate the L(return_vec_2_3_end) + code as fall-through and have jump back to loop on mismatch comparison. */ subq $-(VEC_SIZE * 4), %rdi subq $-(VEC_SIZE * 4), %rsi @@ -980,7 +1096,7 @@ L(ret_zero_in_loop_page_cross): L(return_vec_page_cross_0): addl $-VEC_SIZE, %eax L(return_vec_page_cross_1): - tzcntl %ecx, %ecx + bsf %VRCX, %VRCX # if defined USE_AS_STRNCMP || defined USE_AS_WCSCMP leal -VEC_SIZE(%rax, %rcx, SIZE_OF_CHAR), %ecx # ifdef USE_AS_STRNCMP @@ -1023,8 +1139,8 @@ L(ret9): L(page_cross): # ifndef USE_AS_STRNCMP /* If both are VEC aligned we don't need any special logic here. - Only valid for strcmp where stop condition is guranteed to be - reachable by just reading memory. */ + Only valid for strcmp where stop condition is guranteed to + be reachable by just reading memory. */ testl $((VEC_SIZE - 1) << 20), %eax jz L(no_page_cross) # endif @@ -1065,11 +1181,11 @@ L(page_cross): loadable memory until within 1x VEC of page cross. */ .p2align 4,, 8 L(page_cross_loop): - VMOVU (%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %YMM0 - VPTESTM %YMM0, %YMM0, %k2 - CMP_R1_S2_YMM (%YMM0, (%rsi, %OFFSET_REG64, SIZE_OF_CHAR), %YMM1, %k1){%k2} - kmovd %k1, %ecx - TESTEQ %ecx + VMOVU (%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %VMM(0) + VPTESTM %VMM(0), %VMM(0), %k2 + CMP_R1_S2_VMM (%VMM(0), (%rsi, %OFFSET_REG64, SIZE_OF_CHAR), %VMM(1), %k1){%k2} + KMOV %k1, %VRCX + TESTEQ %VRCX jnz L(check_ret_vec_page_cross) addl $CHAR_PER_VEC, %OFFSET_REG # ifdef USE_AS_STRNCMP @@ -1087,13 +1203,13 @@ L(page_cross_loop): subl %eax, %OFFSET_REG /* OFFSET_REG has distance to page cross - VEC_SIZE. Guranteed to not cross page so is safe to load. Since we have already - loaded at least 1 VEC from rsi it is also guranteed to be safe. - */ - VMOVU (%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %YMM0 - VPTESTM %YMM0, %YMM0, %k2 - CMP_R1_S2_YMM (%YMM0, (%rsi, %OFFSET_REG64, SIZE_OF_CHAR), %YMM1, %k1){%k2} + loaded at least 1 VEC from rsi it is also guranteed to be + safe. */ + VMOVU (%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %VMM(0) + VPTESTM %VMM(0), %VMM(0), %k2 + CMP_R1_S2_VMM (%VMM(0), (%rsi, %OFFSET_REG64, SIZE_OF_CHAR), %VMM(1), %k1){%k2} - kmovd %k1, %ecx + KMOV %k1, %VRCX # ifdef USE_AS_STRNCMP leal CHAR_PER_VEC(%OFFSET_REG64), %eax cmpq %rax, %rdx @@ -1104,7 +1220,7 @@ L(page_cross_loop): addq %rdi, %rdx # endif # endif - TESTEQ %ecx + TESTEQ %VRCX jz L(prepare_loop_no_len) .p2align 4,, 4 @@ -1112,7 +1228,7 @@ L(ret_vec_page_cross): # ifndef USE_AS_STRNCMP L(check_ret_vec_page_cross): # endif - tzcntl %ecx, %ecx + tzcnt %VRCX, %VRCX addl %OFFSET_REG, %ecx L(ret_vec_page_cross_cont): # ifdef USE_AS_WCSCMP @@ -1139,9 +1255,9 @@ L(ret12): # ifdef USE_AS_STRNCMP .p2align 4,, 10 L(check_ret_vec_page_cross2): - TESTEQ %ecx + TESTEQ %VRCX L(check_ret_vec_page_cross): - tzcntl %ecx, %ecx + tzcnt %VRCX, %VRCX addl %OFFSET_REG, %ecx cmpq %rcx, %rdx ja L(ret_vec_page_cross_cont) @@ -1180,8 +1296,71 @@ L(less_1x_vec_till_page): # ifdef USE_AS_WCSCMP shrl $2, %eax # endif + + /* Find largest load size we can use. VEC_SIZE == 64 only check + if we can do a full ymm load. */ +# if VEC_SIZE == 64 + + cmpl $((VEC_SIZE - 32) / SIZE_OF_CHAR), %eax + ja L(less_32_till_page) + + + /* Use 16 byte comparison. */ + VMOVU (%rdi), %VMM_256(0) + VPTESTM %VMM_256(0), %VMM_256(0), %k2 + CMP_R1_S2_YMM (%VMM_256(0), (%rsi), %VMM_256(1), %k1){%k2} + kmovd %k1, %ecx +# ifdef USE_AS_WCSCMP + subl $0xff, %ecx +# else + incl %ecx +# endif + jnz L(check_ret_vec_page_cross) + movl $((VEC_SIZE - 32) / SIZE_OF_CHAR), %OFFSET_REG +# ifdef USE_AS_STRNCMP + cmpq %OFFSET_REG64, %rdx + jbe L(ret_zero_page_cross_slow_case64) + subl %eax, %OFFSET_REG +# else + /* Explicit check for 32 byte alignment. */ + subl %eax, %OFFSET_REG + jz L(prepare_loop) +# endif + VMOVU (%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %VMM_256(0) + VPTESTM %VMM_256(0), %VMM_256(0), %k2 + CMP_R1_S2_YMM (%VMM_256(0), (%rsi, %OFFSET_REG64, SIZE_OF_CHAR), %VMM_256(1), %k1){%k2} + kmovd %k1, %ecx +# ifdef USE_AS_WCSCMP + subl $0xff, %ecx +# else + incl %ecx +# endif + jnz L(check_ret_vec_page_cross) +# ifdef USE_AS_STRNCMP + addl $(32 / SIZE_OF_CHAR), %OFFSET_REG + subq %OFFSET_REG64, %rdx + jbe L(ret_zero_page_cross_slow_case64) + subq $-(CHAR_PER_VEC * 4), %rdx + + leaq -(VEC_SIZE * 4)(%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %rdi + leaq -(VEC_SIZE * 4)(%rsi, %OFFSET_REG64, SIZE_OF_CHAR), %rsi +# else + leaq (32 - VEC_SIZE * 4)(%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %rdi + leaq (32 - VEC_SIZE * 4)(%rsi, %OFFSET_REG64, SIZE_OF_CHAR), %rsi +# endif + jmp L(prepare_loop_aligned) + +# ifdef USE_AS_STRNCMP + .p2align 4,, 2 +L(ret_zero_page_cross_slow_case64): + xorl %eax, %eax + ret +# endif +L(less_32_till_page): +# endif + /* Find largest load size we can use. */ - cmpl $(16 / SIZE_OF_CHAR), %eax + cmpl $((VEC_SIZE - 16) / SIZE_OF_CHAR), %eax ja L(less_16_till_page) /* Use 16 byte comparison. */ @@ -1195,9 +1374,14 @@ L(less_1x_vec_till_page): incw %cx # endif jnz L(check_ret_vec_page_cross) - movl $(16 / SIZE_OF_CHAR), %OFFSET_REG + + movl $((VEC_SIZE - 16) / SIZE_OF_CHAR), %OFFSET_REG # ifdef USE_AS_STRNCMP +# if VEC_SIZE == 32 cmpq %OFFSET_REG64, %rdx +# else + cmpq $(16 / SIZE_OF_CHAR), %rdx +# endif jbe L(ret_zero_page_cross_slow_case0) subl %eax, %OFFSET_REG # else @@ -1239,7 +1423,7 @@ L(ret_zero_page_cross_slow_case0): .p2align 4,, 10 L(less_16_till_page): - cmpl $(24 / SIZE_OF_CHAR), %eax + cmpl $((VEC_SIZE - 8) / SIZE_OF_CHAR), %eax ja L(less_8_till_page) /* Use 8 byte comparison. */ @@ -1260,7 +1444,7 @@ L(less_16_till_page): cmpq $(8 / SIZE_OF_CHAR), %rdx jbe L(ret_zero_page_cross_slow_case0) # endif - movl $(24 / SIZE_OF_CHAR), %OFFSET_REG + movl $((VEC_SIZE - 8) / SIZE_OF_CHAR), %OFFSET_REG subl %eax, %OFFSET_REG vmovq (%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %xmm0 @@ -1320,7 +1504,7 @@ L(ret_less_8_wcs): ret # else - cmpl $28, %eax + cmpl $(VEC_SIZE - 4), %eax ja L(less_4_till_page) vmovd (%rdi), %xmm0 @@ -1335,7 +1519,7 @@ L(ret_less_8_wcs): cmpq $4, %rdx jbe L(ret_zero_page_cross_slow_case1) # endif - movl $(28 / SIZE_OF_CHAR), %OFFSET_REG + movl $((VEC_SIZE - 4) / SIZE_OF_CHAR), %OFFSET_REG subl %eax, %OFFSET_REG vmovd (%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %xmm0 @@ -1386,7 +1570,7 @@ L(less_4_loop): # endif incq %rdi /* end condition is reach page boundary (rdi is aligned). */ - testl $31, %edi + testb $(VEC_SIZE - 1), %dil jnz L(less_4_loop) leaq -(VEC_SIZE * 4)(%rdi, %rsi), %rsi addq $-(VEC_SIZE * 4), %rdi From patchwork Tue Oct 18 02:49:01 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Noah Goldstein X-Patchwork-Id: 1691328 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=sourceware.org (client-ip=8.43.85.97; helo=sourceware.org; envelope-from=libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org; receiver=) Authentication-Results: legolas.ozlabs.org; dkim=pass (1024-bit key; secure) header.d=sourceware.org header.i=@sourceware.org header.a=rsa-sha256 header.s=default header.b=dxFKCxF2; dkim-atps=neutral Received: from sourceware.org (server2.sourceware.org [8.43.85.97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-384) server-digest SHA384) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4Mryzd2yfxz23jp for ; Tue, 18 Oct 2022 13:51:09 +1100 (AEDT) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 52C36385B837 for ; Tue, 18 Oct 2022 02:51:07 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 52C36385B837 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1666061467; bh=G8flCwTlKigmBidl5OlTAXN/Be6PgGMqHOox7PZaEw8=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To: From; b=dxFKCxF2lwz0glH/BI98JzFPkyzL3thlr4wZuNRHj/p+mEK7AOgKC9hsNInZp5Xrg R/Igl/2Y1Qha///Gn6oCEBa5nItUjPTipy/JEZVR/u1f+fdKMaglCtSn6jQdmwKQpZ eX82etLaQX/OaZeUipWcRHglOXMYz0MH7mdMnhyI= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-ot1-x32c.google.com (mail-ot1-x32c.google.com [IPv6:2607:f8b0:4864:20::32c]) by sourceware.org (Postfix) with ESMTPS id 04E36385841D for ; Tue, 18 Oct 2022 02:49:13 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 04E36385841D Received: by mail-ot1-x32c.google.com with SMTP id a14-20020a9d470e000000b00661b66a5393so6897214otf.11 for ; Mon, 17 Oct 2022 19:49:12 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=G8flCwTlKigmBidl5OlTAXN/Be6PgGMqHOox7PZaEw8=; b=ZZVsbYtyUvkmOGU6Tou/4CIO+LF5p1Tx5rdD585+gIu/dgsoVP45weIwi6OZvlNhiS db4fcCH3+gXCMcjpcwCxs+yDGK2sJGY6cPZZiQ4df7riFc6nl4Xxe6fQ+f0p6g7iaiwA kP1MM1rlUt/5QXdmw+93xxJmN5E/tC6K1W920KLiE0vLmQMrmatI0mw2L/9lHcGGeztn cIjOGCLPP7ap4CWZgOkJtbTzlRYzMN4ztnW7U6uhL0+fxo3uUrwG2XUYKm8uLEsIwBFT FYWnExqoHeX+WKTxkBjCc0SPMns09RXT074OBFAX9l2+ArRzSn/NCs6D+HPp1bi4FKpt EiwA== X-Gm-Message-State: ACrzQf2oBt7/URlq0InpnXyNeK6/c9O9fh24ALwxD52GUWx3f0XrwJWO TzAXwq0ZZWq9oLBkjNgkaLUb6wqtcbg+HA== X-Google-Smtp-Source: AMsMyM794oFTgluEnfAuvppQyvhP5p1JRyGUG1F2iklrjAqSCDbLmtL8qCmFXTkDK06zWUJwg7aEDQ== X-Received: by 2002:a05:6830:418b:b0:637:3897:e279 with SMTP id r11-20020a056830418b00b006373897e279mr414791otu.78.1666061351841; Mon, 17 Oct 2022 19:49:11 -0700 (PDT) Received: from noah-tgl.lan (2603-8080-1301-76c6-02dd-0570-1640-b39b.res6.spectrum.com. [2603:8080:1301:76c6:2dd:570:1640:b39b]) by smtp.gmail.com with ESMTPSA id r10-20020a4a964a000000b00435a59fba01sm4957260ooi.47.2022.10.17.19.49.10 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 17 Oct 2022 19:49:11 -0700 (PDT) To: libc-alpha@sourceware.org Subject: [PATCH v1 7/7] Bench: Improve benchtests for memchr, strchr, strnlen, strrchr Date: Mon, 17 Oct 2022 19:49:01 -0700 Message-Id: <20221018024901.3381469-7-goldstein.w.n@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20221018024901.3381469-1-goldstein.w.n@gmail.com> References: <20221018024901.3381469-1-goldstein.w.n@gmail.com> MIME-Version: 1.0 X-Spam-Status: No, score=-12.1 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Noah Goldstein via Libc-alpha From: Noah Goldstein Reply-To: Noah Goldstein Errors-To: libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org Sender: "Libc-alpha" 1. Add more complete coverage in the medium size range. 2. In strnlen remove the `1 << i` which was UB (`i` could go beyond 32/64) 3. Add timer for total benchmark runtime (useful for deciding about tradeoff between coverage and runtime). --- benchtests/bench-memchr.c | 83 +++++++++++++++++++++++++----------- benchtests/bench-rawmemchr.c | 36 ++++++++++++++-- benchtests/bench-strchr.c | 42 +++++++++++++----- benchtests/bench-strnlen.c | 19 ++++++--- benchtests/bench-strrchr.c | 33 +++++++++++++- 5 files changed, 166 insertions(+), 47 deletions(-) diff --git a/benchtests/bench-memchr.c b/benchtests/bench-memchr.c index 0facda2fa0..c4d758ae61 100644 --- a/benchtests/bench-memchr.c +++ b/benchtests/bench-memchr.c @@ -126,9 +126,10 @@ do_test (json_ctx_t *json_ctx, size_t align, size_t pos, size_t len, int test_main (void) { - size_t i; + size_t i, j, al, al_max; int repeats; json_ctx_t json_ctx; + timing_t bench_start, bench_stop, bench_total_time; test_init (); json_init (&json_ctx, 0, stdout); @@ -147,35 +148,47 @@ test_main (void) json_array_begin (&json_ctx, "results"); + TIMING_NOW (bench_start); + al_max = 0; +#ifdef USE_AS_MEMRCHR + al_max = getpagesize () / 2; +#endif + for (repeats = 0; repeats < 2; ++repeats) { - for (i = 1; i < 8; ++i) + for (al = 0; al <= al_max; al += getpagesize () / 2) { - do_test (&json_ctx, 0, 16 << i, 2048, 23, repeats); - do_test (&json_ctx, i, 64, 256, 23, repeats); - do_test (&json_ctx, 0, 16 << i, 2048, 0, repeats); - do_test (&json_ctx, i, 64, 256, 0, repeats); - - do_test (&json_ctx, getpagesize () - 15, 64, 256, 0, repeats); + for (i = 1; i < 8; ++i) + { + do_test (&json_ctx, al, 16 << i, 2048, 23, repeats); + do_test (&json_ctx, al + i, 64, 256, 23, repeats); + do_test (&json_ctx, al, 16 << i, 2048, 0, repeats); + do_test (&json_ctx, al + i, 64, 256, 0, repeats); + + do_test (&json_ctx, al + getpagesize () - 15, 64, 256, 0, + repeats); #ifdef USE_AS_MEMRCHR - /* Also test the position close to the beginning for memrchr. */ - do_test (&json_ctx, 0, i, 256, 23, repeats); - do_test (&json_ctx, 0, i, 256, 0, repeats); - do_test (&json_ctx, i, i, 256, 23, repeats); - do_test (&json_ctx, i, i, 256, 0, repeats); + /* Also test the position close to the beginning for memrchr. */ + do_test (&json_ctx, al, i, 256, 23, repeats); + do_test (&json_ctx, al, i, 256, 0, repeats); + do_test (&json_ctx, al + i, i, 256, 23, repeats); + do_test (&json_ctx, al + i, i, 256, 0, repeats); #endif + } + for (i = 1; i < 8; ++i) + { + do_test (&json_ctx, al + i, i << 5, 192, 23, repeats); + do_test (&json_ctx, al + i, i << 5, 192, 0, repeats); + do_test (&json_ctx, al + i, i << 5, 256, 23, repeats); + do_test (&json_ctx, al + i, i << 5, 256, 0, repeats); + do_test (&json_ctx, al + i, i << 5, 512, 23, repeats); + do_test (&json_ctx, al + i, i << 5, 512, 0, repeats); + + do_test (&json_ctx, al + getpagesize () - 15, i << 5, 256, 23, + repeats); + } } - for (i = 1; i < 8; ++i) - { - do_test (&json_ctx, i, i << 5, 192, 23, repeats); - do_test (&json_ctx, i, i << 5, 192, 0, repeats); - do_test (&json_ctx, i, i << 5, 256, 23, repeats); - do_test (&json_ctx, i, i << 5, 256, 0, repeats); - do_test (&json_ctx, i, i << 5, 512, 23, repeats); - do_test (&json_ctx, i, i << 5, 512, 0, repeats); - - do_test (&json_ctx, getpagesize () - 15, i << 5, 256, 23, repeats); - } + for (i = 1; i < 32; ++i) { do_test (&json_ctx, 0, i, i + 1, 23, repeats); @@ -207,11 +220,33 @@ test_main (void) do_test (&json_ctx, 0, 2, i + 1, 0, repeats); #endif } + for (al = 0; al <= al_max; al += getpagesize () / 2) + { + for (i = (16 / sizeof (CHAR)); i <= (8192 / sizeof (CHAR)); i += i) + { + for (j = 0; j <= (384 / sizeof (CHAR)); + j += (32 / sizeof (CHAR))) + { + do_test (&json_ctx, al, i + j, i, 23, repeats); + do_test (&json_ctx, al, i, i + j, 23, repeats); + if (j < i) + { + do_test (&json_ctx, al, i - j, i, 23, repeats); + do_test (&json_ctx, al, i, i - j, 23, repeats); + } + } + } + } + #ifndef USE_AS_MEMRCHR break; #endif } + TIMING_NOW (bench_stop); + TIMING_DIFF (bench_total_time, bench_start, bench_stop); + json_attr_double (&json_ctx, "benchtime", bench_total_time); + json_array_end (&json_ctx); json_attr_object_end (&json_ctx); json_attr_object_end (&json_ctx); diff --git a/benchtests/bench-rawmemchr.c b/benchtests/bench-rawmemchr.c index b1803afc14..667ecd48f9 100644 --- a/benchtests/bench-rawmemchr.c +++ b/benchtests/bench-rawmemchr.c @@ -70,7 +70,7 @@ do_test (json_ctx_t *json_ctx, size_t align, size_t pos, size_t len, int seek_ch size_t i; char *result; - align &= 7; + align &= getpagesize () - 1; if (align + len >= page_size) return; @@ -106,7 +106,7 @@ test_main (void) { json_ctx_t json_ctx; size_t i; - + timing_t bench_start, bench_stop, bench_total_time; test_init (); json_init (&json_ctx, 0, stdout); @@ -120,11 +120,12 @@ test_main (void) json_array_begin (&json_ctx, "ifuncs"); FOR_EACH_IMPL (impl, 0) - json_element_string (&json_ctx, impl->name); + json_element_string (&json_ctx, impl->name); json_array_end (&json_ctx); json_array_begin (&json_ctx, "results"); + TIMING_NOW (bench_start); for (i = 1; i < 7; ++i) { do_test (&json_ctx, 0, 16 << i, 2048, 23); @@ -137,6 +138,35 @@ test_main (void) do_test (&json_ctx, 0, i, i + 1, 23); do_test (&json_ctx, 0, i, i + 1, 0); } + for (; i < 256; i += 32) + { + do_test (&json_ctx, 0, i, i + 1, 23); + do_test (&json_ctx, 0, i - 1, i, 23); + } + for (; i < 512; i += 64) + { + do_test (&json_ctx, 0, i, i + 1, 23); + do_test (&json_ctx, 0, i - 1, i, 23); + } + for (; i < 1024; i += 128) + { + do_test (&json_ctx, 0, i, i + 1, 23); + do_test (&json_ctx, 0, i - 1, i, 23); + } + for (; i < 2048; i += 256) + { + do_test (&json_ctx, 0, i, i + 1, 23); + do_test (&json_ctx, 0, i - 1, i, 23); + } + for (; i < 4096; i += 512) + { + do_test (&json_ctx, 0, i, i + 1, 23); + do_test (&json_ctx, 0, i - 1, i, 23); + } + + TIMING_NOW (bench_stop); + TIMING_DIFF (bench_total_time, bench_start, bench_stop); + json_attr_double (&json_ctx, "benchtime", bench_total_time); json_array_end (&json_ctx); json_attr_object_end (&json_ctx); diff --git a/benchtests/bench-strchr.c b/benchtests/bench-strchr.c index 54640bde7e..af325806ce 100644 --- a/benchtests/bench-strchr.c +++ b/benchtests/bench-strchr.c @@ -287,8 +287,8 @@ int test_main (void) { json_ctx_t json_ctx; - size_t i; - + size_t i, j; + timing_t bench_start, bench_stop, bench_total_time; test_init (); json_init (&json_ctx, 0, stdout); @@ -307,6 +307,7 @@ test_main (void) json_array_begin (&json_ctx, "results"); + TIMING_NOW (bench_start); for (i = 1; i < 8; ++i) { do_test (&json_ctx, 0, 16 << i, 2048, SMALL_CHAR, MIDDLE_CHAR); @@ -367,15 +368,34 @@ test_main (void) do_test (&json_ctx, 0, i, i + 1, 0, BIG_CHAR); } - DO_RAND_TEST(&json_ctx, 0, 15, 16, 0.0); - DO_RAND_TEST(&json_ctx, 0, 15, 16, 0.1); - DO_RAND_TEST(&json_ctx, 0, 15, 16, 0.25); - DO_RAND_TEST(&json_ctx, 0, 15, 16, 0.33); - DO_RAND_TEST(&json_ctx, 0, 15, 16, 0.5); - DO_RAND_TEST(&json_ctx, 0, 15, 16, 0.66); - DO_RAND_TEST(&json_ctx, 0, 15, 16, 0.75); - DO_RAND_TEST(&json_ctx, 0, 15, 16, 0.9); - DO_RAND_TEST(&json_ctx, 0, 15, 16, 1.0); + for (i = 16 / sizeof (CHAR); i <= 8192 / sizeof (CHAR); i += i) + { + for (j = 32 / sizeof (CHAR); j <= 320 / sizeof (CHAR); + j += 32 / sizeof (CHAR)) + { + do_test (&json_ctx, 0, i, i + j, 0, MIDDLE_CHAR); + do_test (&json_ctx, 0, i + j, i, 0, MIDDLE_CHAR); + if (i > j) + { + do_test (&json_ctx, 0, i, i - j, 0, MIDDLE_CHAR); + do_test (&json_ctx, 0, i - j, i, 0, MIDDLE_CHAR); + } + } + } + + DO_RAND_TEST (&json_ctx, 0, 15, 16, 0.0); + DO_RAND_TEST (&json_ctx, 0, 15, 16, 0.1); + DO_RAND_TEST (&json_ctx, 0, 15, 16, 0.25); + DO_RAND_TEST (&json_ctx, 0, 15, 16, 0.33); + DO_RAND_TEST (&json_ctx, 0, 15, 16, 0.5); + DO_RAND_TEST (&json_ctx, 0, 15, 16, 0.66); + DO_RAND_TEST (&json_ctx, 0, 15, 16, 0.75); + DO_RAND_TEST (&json_ctx, 0, 15, 16, 0.9); + DO_RAND_TEST (&json_ctx, 0, 15, 16, 1.0); + + TIMING_NOW (bench_stop); + TIMING_DIFF (bench_total_time, bench_start, bench_stop); + json_attr_double (&json_ctx, "benchtime", bench_total_time); json_array_end (&json_ctx); json_attr_object_end (&json_ctx); diff --git a/benchtests/bench-strnlen.c b/benchtests/bench-strnlen.c index 13b46b3f57..c6281b6373 100644 --- a/benchtests/bench-strnlen.c +++ b/benchtests/bench-strnlen.c @@ -117,7 +117,7 @@ test_main (void) { size_t i, j; json_ctx_t json_ctx; - + timing_t bench_start, bench_stop, bench_total_time; test_init (); json_init (&json_ctx, 0, stdout); @@ -136,6 +136,7 @@ test_main (void) json_array_begin (&json_ctx, "results"); + TIMING_NOW (bench_start); for (i = 0; i <= 1; ++i) { do_test (&json_ctx, i, 1, 128, MIDDLE_CHAR); @@ -195,23 +196,27 @@ test_main (void) { for (j = 0; j <= (704 / sizeof (CHAR)); j += (32 / sizeof (CHAR))) { - do_test (&json_ctx, 0, 1 << i, (i + j), BIG_CHAR); do_test (&json_ctx, 0, i + j, i, BIG_CHAR); - - do_test (&json_ctx, 64, 1 << i, (i + j), BIG_CHAR); do_test (&json_ctx, 64, i + j, i, BIG_CHAR); + do_test (&json_ctx, 0, i, i + j, BIG_CHAR); + do_test (&json_ctx, 64, i, i + j, BIG_CHAR); + if (j < i) { - do_test (&json_ctx, 0, 1 << i, i - j, BIG_CHAR); do_test (&json_ctx, 0, i - j, i, BIG_CHAR); - - do_test (&json_ctx, 64, 1 << i, i - j, BIG_CHAR); do_test (&json_ctx, 64, i - j, i, BIG_CHAR); + + do_test (&json_ctx, 0, i, i - j, BIG_CHAR); + do_test (&json_ctx, 64, i, i - j, BIG_CHAR); } } } + TIMING_NOW (bench_stop); + TIMING_DIFF (bench_total_time, bench_start, bench_stop); + json_attr_double (&json_ctx, "benchtime", bench_total_time); + json_array_end (&json_ctx); json_attr_object_end (&json_ctx); json_attr_object_end (&json_ctx); diff --git a/benchtests/bench-strrchr.c b/benchtests/bench-strrchr.c index 7cd2a15484..e6d8163047 100644 --- a/benchtests/bench-strrchr.c +++ b/benchtests/bench-strrchr.c @@ -151,8 +151,9 @@ int test_main (void) { json_ctx_t json_ctx; - size_t i, j; + size_t i, j, k; int seek; + timing_t bench_start, bench_stop, bench_total_time; test_init (); json_init (&json_ctx, 0, stdout); @@ -171,9 +172,10 @@ test_main (void) json_array_begin (&json_ctx, "results"); + TIMING_NOW (bench_start); for (seek = 0; seek <= 23; seek += 23) { - for (j = 1; j < 32; j += j) + for (j = 1; j <= 256; j = (j * 4)) { for (i = 1; i < 9; ++i) { @@ -197,12 +199,39 @@ test_main (void) do_test (&json_ctx, getpagesize () - i / 2 - 1, i, i + 1, seek, SMALL_CHAR, j); } + + for (i = (16 / sizeof (CHAR)); i <= (288 / sizeof (CHAR)); i += 32) + { + do_test (&json_ctx, 0, i - 16, i, seek, SMALL_CHAR, j); + do_test (&json_ctx, 0, i, i + 16, seek, SMALL_CHAR, j); + } + + for (i = (16 / sizeof (CHAR)); i <= (2048 / sizeof (CHAR)); i += i) + { + for (k = 0; k <= (288 / sizeof (CHAR)); + k += (48 / sizeof (CHAR))) + { + do_test (&json_ctx, 0, k, i, seek, SMALL_CHAR, j); + do_test (&json_ctx, 0, i, i + k, seek, SMALL_CHAR, j); + + if (k < i) + { + do_test (&json_ctx, 0, i - k, i, seek, SMALL_CHAR, j); + do_test (&json_ctx, 0, k, i - k, seek, SMALL_CHAR, j); + do_test (&json_ctx, 0, i, i - k, seek, SMALL_CHAR, j); + } + } + } + if (seek == 0) { break; } } } + TIMING_NOW (bench_stop); + TIMING_DIFF (bench_total_time, bench_start, bench_stop); + json_attr_double (&json_ctx, "benchtime", bench_total_time); json_array_end (&json_ctx); json_attr_object_end (&json_ctx);