From patchwork Tue Oct 18 23:19:32 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Noah Goldstein X-Patchwork-Id: 1691738 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=sourceware.org (client-ip=2620:52:3:1:0:246e:9693:128c; helo=sourceware.org; envelope-from=libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org; receiver=) Authentication-Results: legolas.ozlabs.org; dkim=pass (1024-bit key; secure) header.d=sourceware.org header.i=@sourceware.org header.a=rsa-sha256 header.s=default header.b=KIKAnNUA; dkim-atps=neutral Received: from sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-384) server-digest SHA384) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4MsVGW1GmKz23jk for ; Wed, 19 Oct 2022 10:20:51 +1100 (AEDT) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 2F0A03858010 for ; Tue, 18 Oct 2022 23:20:49 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 2F0A03858010 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1666135249; bh=iwJ5tG1ZqBYLW52dVWD9KF7e6KhX6rOUjBu+vk7oSiA=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To: From; b=KIKAnNUAayCcLuo5+0Bq4ltFkyuu10k1eduimspsKBkxmO8Tn7gmqwLmeEFpaLOKv s/3XyYHSKIEJWpncIrVWtbBk4uznKSmeDu5hc1FoVfR83Qp4X1t2ra3CmOtVQMsq79 IAL32Xp7EEBpYgTQlGiFpxZWoaF4ojT2GpwJ8ioI= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-ot1-x333.google.com (mail-ot1-x333.google.com [IPv6:2607:f8b0:4864:20::333]) by sourceware.org (Postfix) with ESMTPS id 5F3813858D32 for ; Tue, 18 Oct 2022 23:19:45 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 5F3813858D32 Received: by mail-ot1-x333.google.com with SMTP id d18-20020a05683025d200b00661c6f1b6a4so8522414otu.1 for ; Tue, 18 Oct 2022 16:19:45 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=iwJ5tG1ZqBYLW52dVWD9KF7e6KhX6rOUjBu+vk7oSiA=; b=ES0z5S1JxGg15Dcdp74iL3ZSAUAoXzjFLd2etZ61P4mZXt5TgEc1VVwA9IkUxbeXTX tRsmVn/1IiYx9W3rF4F2YSBC+u04Zx+PJr+I78RkWECLT3jveMrAa+w3GFiXs/kYYkbw p2tlacaye+OTwGkVXCVUI94bSkc6aQf3zyZZh//RPXUR0Jrs5xk3RoJ74f5kIrlN3bLP Zbf7nKN4II6/MhTMTHQoMl77QRqU6pWs3r3NeiYIiuRm44N6lXjY6NljAzuhrrmmF1gf jku+j4Q6sfR99FrDyRJLAYCNzXjiGTdBDtdS3o61VGl4urcy04Z7TvelNPw/SdIBnXu3 Go8Q== X-Gm-Message-State: ACrzQf0tF1JNzypcQe7Ciyw1dxUSYdY3g+DZsnQGOLj0l5eAZ840q7I6 IHi122nxL/qbxxrLYYSiSZwTWN0qV7quaA== X-Google-Smtp-Source: AMsMyM7Wq0DTFYG8YDcTLrKYjXO5KY+g9L1uae99ggjuv08gk5F6M9mD+xBQBM69H5HWtUKM9i1VKg== X-Received: by 2002:a05:6830:14cf:b0:661:bc9d:554f with SMTP id t15-20020a05683014cf00b00661bc9d554fmr2419846otq.341.1666135182071; Tue, 18 Oct 2022 16:19:42 -0700 (PDT) Received: from noah-tgl.lan (2603-8080-1301-76c6-27cf-8854-3909-9373.res6.spectrum.com. [2603:8080:1301:76c6:27cf:8854:3909:9373]) by smtp.gmail.com with ESMTPSA id x27-20020a056870a79b00b001372c1902afsm6787119oao.52.2022.10.18.16.19.41 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 18 Oct 2022 16:19:41 -0700 (PDT) To: libc-alpha@sourceware.org Subject: [PATCH v2 1/7] x86: Optimize memchr-evex.S and implement with VMM headers Date: Tue, 18 Oct 2022 16:19:32 -0700 Message-Id: <20221018231938.3621554-1-goldstein.w.n@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20221018024901.3381469-1-goldstein.w.n@gmail.com> References: <20221018024901.3381469-1-goldstein.w.n@gmail.com> MIME-Version: 1.0 X-Spam-Status: No, score=-12.1 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, KAM_SHORT, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Noah Goldstein via Libc-alpha From: Noah Goldstein Reply-To: Noah Goldstein Errors-To: libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org Sender: "Libc-alpha" Optimizations are: 1. Use the fact that tzcnt(0) -> VEC_SIZE for memchr to save a branch in short string case. 2. Restructure code so that small strings are given the hot path. - This is a net-zero on the benchmark suite but in general makes sense as smaller sizes are far more common. 3. Use more code-size efficient instructions. - tzcnt ... -> bsf ... - vpcmpb $0 ... -> vpcmpeq ... 4. Align labels less aggressively, especially if it doesn't save fetch blocks / causes the basic-block to span extra cache-lines. The optimizations (especially for point 2) make the memchr and rawmemchr code essentially incompatible so split rawmemchr-evex to a new file. Code Size Changes: memchr-evex.S : -107 bytes rawmemchr-evex.S : -53 bytes Net perf changes: Reported as geometric mean of all improvements / regressions from N=10 runs of the benchtests. Value as New Time / Old Time so < 1.0 is improvement and 1.0 is regression. memchr-evex.S : 0.928 rawmemchr-evex.S : 0.986 (Less targets cross cache lines) Full results attached in email. Full check passes on x86-64. --- sysdeps/x86_64/multiarch/memchr-evex.S | 939 ++++++++++-------- sysdeps/x86_64/multiarch/rawmemchr-evex-rtm.S | 9 +- sysdeps/x86_64/multiarch/rawmemchr-evex.S | 313 +++++- 3 files changed, 851 insertions(+), 410 deletions(-) diff --git a/sysdeps/x86_64/multiarch/memchr-evex.S b/sysdeps/x86_64/multiarch/memchr-evex.S index 0dd4f1dcce..23a1c0018e 100644 --- a/sysdeps/x86_64/multiarch/memchr-evex.S +++ b/sysdeps/x86_64/multiarch/memchr-evex.S @@ -21,17 +21,27 @@ #if ISA_SHOULD_BUILD (4) +# ifndef VEC_SIZE +# include "x86-evex256-vecs.h" +# endif + # ifndef MEMCHR # define MEMCHR __memchr_evex # endif # ifdef USE_AS_WMEMCHR +# define PC_SHIFT_GPR rcx +# define VPTESTN vptestnmd # define VPBROADCAST vpbroadcastd # define VPMINU vpminud # define VPCMP vpcmpd # define VPCMPEQ vpcmpeqd # define CHAR_SIZE 4 + +# define USE_WIDE_CHAR # else +# define PC_SHIFT_GPR rdi +# define VPTESTN vptestnmb # define VPBROADCAST vpbroadcastb # define VPMINU vpminub # define VPCMP vpcmpb @@ -39,534 +49,661 @@ # define CHAR_SIZE 1 # endif - /* In the 4x loop the RTM and non-RTM versions have data pointer - off by VEC_SIZE * 4 with RTM version being VEC_SIZE * 4 greater. - This is represented by BASE_OFFSET. As well because the RTM - version uses vpcmp which stores a bit per element compared where - the non-RTM version uses vpcmpeq which stores a bit per byte - compared RET_SCALE of CHAR_SIZE is only relevant for the RTM - version. */ -# ifdef USE_IN_RTM +# include "reg-macros.h" + + +/* If not in an RTM and VEC_SIZE != 64 (the VEC_SIZE = 64 + doesn't have VEX encoding), use VEX encoding in loop so we + can use vpcmpeqb + vptern which is more efficient than the + EVEX alternative. */ +# if defined USE_IN_RTM || VEC_SIZE == 64 +# undef COND_VZEROUPPER +# undef VZEROUPPER_RETURN +# undef VZEROUPPER + +# define COND_VZEROUPPER +# define VZEROUPPER_RETURN ret # define VZEROUPPER -# define BASE_OFFSET (VEC_SIZE * 4) -# define RET_SCALE CHAR_SIZE + +# define USE_TERN_IN_LOOP 0 # else +# define USE_TERN_IN_LOOP 1 +# undef VZEROUPPER # define VZEROUPPER vzeroupper -# define BASE_OFFSET 0 -# define RET_SCALE 1 # endif - /* In the return from 4x loop memchr and rawmemchr versions have - data pointers off by VEC_SIZE * 4 with memchr version being - VEC_SIZE * 4 greater. */ -# ifdef USE_AS_RAWMEMCHR -# define RET_OFFSET (BASE_OFFSET - (VEC_SIZE * 4)) -# define RAW_PTR_REG rcx -# define ALGN_PTR_REG rdi +# if USE_TERN_IN_LOOP + /* Resulting bitmask for vpmovmskb has 4-bits set for each wchar + so we don't want to multiply resulting index. */ +# define TERN_CHAR_MULT 1 + +# ifdef USE_AS_WMEMCHR +# define TEST_END() inc %VRCX +# else +# define TEST_END() add %rdx, %rcx +# endif # else -# define RET_OFFSET BASE_OFFSET -# define RAW_PTR_REG rdi -# define ALGN_PTR_REG rcx +# define TERN_CHAR_MULT CHAR_SIZE +# define TEST_END() KORTEST %k2, %k3 # endif -# define XMMZERO xmm23 -# define YMMZERO ymm23 -# define XMMMATCH xmm16 -# define YMMMATCH ymm16 -# define YMM1 ymm17 -# define YMM2 ymm18 -# define YMM3 ymm19 -# define YMM4 ymm20 -# define YMM5 ymm21 -# define YMM6 ymm22 +# if defined USE_AS_WMEMCHR || !USE_TERN_IN_LOOP +# ifndef USE_AS_WMEMCHR +# define GPR_X0_IS_RET 1 +# else +# define GPR_X0_IS_RET 0 +# endif +# define GPR_X0 rax +# else +# define GPR_X0_IS_RET 0 +# define GPR_X0 rdx +# endif + +# define CHAR_PER_VEC (VEC_SIZE / CHAR_SIZE) -# ifndef SECTION -# define SECTION(p) p##.evex +# if CHAR_PER_VEC == 64 +# define LAST_VEC_OFFSET (VEC_SIZE * 3) +# else +# define LAST_VEC_OFFSET (VEC_SIZE * 2) +# endif +# if CHAR_PER_VEC >= 32 +# define MASK_GPR(...) VGPR(__VA_ARGS__) +# elif CHAR_PER_VEC == 16 +# define MASK_GPR(reg) VGPR_SZ(reg, 16) +# else +# define MASK_GPR(reg) VGPR_SZ(reg, 8) # endif -# define VEC_SIZE 32 -# define CHAR_PER_VEC (VEC_SIZE / CHAR_SIZE) -# define PAGE_SIZE 4096 +# define VMATCH VMM(0) +# define VMATCH_LO VMM_lo(0) - .section SECTION(.text),"ax",@progbits +# define PAGE_SIZE 4096 + + + .section SECTION(.text), "ax", @progbits ENTRY_P2ALIGN (MEMCHR, 6) -# ifndef USE_AS_RAWMEMCHR /* Check for zero length. */ test %RDX_LP, %RDX_LP - jz L(zero) + jz L(zero_0) -# ifdef __ILP32__ +# ifdef __ILP32__ /* Clear the upper 32 bits. */ movl %edx, %edx -# endif # endif - /* Broadcast CHAR to YMMMATCH. */ - VPBROADCAST %esi, %YMMMATCH + VPBROADCAST %esi, %VMATCH /* Check if we may cross page boundary with one vector load. */ movl %edi, %eax andl $(PAGE_SIZE - 1), %eax cmpl $(PAGE_SIZE - VEC_SIZE), %eax - ja L(cross_page_boundary) + ja L(page_cross) + + VPCMPEQ (%rdi), %VMATCH, %k0 + KMOV %k0, %VRAX +# ifndef USE_AS_WMEMCHR + /* If rcx is zero then tzcnt -> CHAR_PER_VEC. NB: there is a + already a dependency between rcx and rsi so no worries about + false-dep here. */ + tzcnt %VRAX, %VRSI + /* If rdx <= rsi then either 1) rcx was non-zero (there was a + match) but it was out of bounds or 2) rcx was zero and rdx + was <= VEC_SIZE so we are done scanning. */ + cmpq %rsi, %rdx + /* NB: Use branch to return zero/non-zero. Common usage will + branch on result of function (if return is null/non-null). + This branch can be used to predict the ensuing one so there + is no reason to extend the data-dependency with cmovcc. */ + jbe L(zero_0) + + /* If rcx is zero then len must be > RDX, otherwise since we + already tested len vs lzcnt(rcx) (in rsi) we are good to + return this match. */ + test %VRAX, %VRAX + jz L(more_1x_vec) + leaq (%rdi, %rsi), %rax +# else - /* Check the first VEC_SIZE bytes. */ - VPCMP $0, (%rdi), %YMMMATCH, %k0 - kmovd %k0, %eax -# ifndef USE_AS_RAWMEMCHR - /* If length < CHAR_PER_VEC handle special. */ + /* We can't use the `tzcnt` trick for wmemchr because CHAR_SIZE + > 1 so if rcx is tzcnt != CHAR_PER_VEC. */ cmpq $CHAR_PER_VEC, %rdx - jbe L(first_vec_x0) -# endif - testl %eax, %eax - jz L(aligned_more) - tzcntl %eax, %eax -# ifdef USE_AS_WMEMCHR - /* NB: Multiply bytes by CHAR_SIZE to get the wchar_t count. */ + ja L(more_1x_vec) + tzcnt %VRAX, %VRAX + cmpl %eax, %edx + jbe L(zero_0) +L(first_vec_x0_ret): leaq (%rdi, %rax, CHAR_SIZE), %rax -# else - addq %rdi, %rax # endif ret -# ifndef USE_AS_RAWMEMCHR -L(zero): - xorl %eax, %eax - ret - - .p2align 4 -L(first_vec_x0): - /* Check if first match was before length. NB: tzcnt has false data- - dependency on destination. eax already had a data-dependency on esi - so this should have no affect here. */ - tzcntl %eax, %esi -# ifdef USE_AS_WMEMCHR - leaq (%rdi, %rsi, CHAR_SIZE), %rdi -# else - addq %rsi, %rdi -# endif + /* Only fits in first cache line for VEC_SIZE == 32. */ +# if VEC_SIZE == 32 + .p2align 4,, 2 +L(zero_0): xorl %eax, %eax - cmpl %esi, %edx - cmovg %rdi, %rax ret # endif - .p2align 4 -L(cross_page_boundary): - /* Save pointer before aligning as its original value is - necessary for computer return address if byte is found or - adjusting length if it is not and this is memchr. */ - movq %rdi, %rcx - /* Align data to VEC_SIZE. ALGN_PTR_REG is rcx for memchr and rdi - for rawmemchr. */ - andq $-VEC_SIZE, %ALGN_PTR_REG - VPCMP $0, (%ALGN_PTR_REG), %YMMMATCH, %k0 - kmovd %k0, %r8d + .p2align 4,, 9 +L(more_1x_vec): # ifdef USE_AS_WMEMCHR - /* NB: Divide shift count by 4 since each bit in K0 represent 4 - bytes. */ - sarl $2, %eax -# endif -# ifndef USE_AS_RAWMEMCHR - movl $(PAGE_SIZE / CHAR_SIZE), %esi - subl %eax, %esi + /* If wmemchr still need to test if there was a match in first + VEC. Use bsf to test here so we can reuse + L(first_vec_x0_ret). */ + bsf %VRAX, %VRAX + jnz L(first_vec_x0_ret) # endif + +L(page_cross_continue): # ifdef USE_AS_WMEMCHR - andl $(CHAR_PER_VEC - 1), %eax -# endif - /* Remove the leading bytes. */ - sarxl %eax, %r8d, %eax -# ifndef USE_AS_RAWMEMCHR - /* Check the end of data. */ - cmpq %rsi, %rdx - jbe L(first_vec_x0) + /* We can't use end of the buffer to re-calculate length for + wmemchr as len * CHAR_SIZE may overflow. */ + leaq -(VEC_SIZE + CHAR_SIZE)(%rdi), %rax + andq $(VEC_SIZE * -1), %rdi + subq %rdi, %rax + sarq $2, %rax + addq %rdx, %rax +# else + leaq -(VEC_SIZE + 1)(%rdx, %rdi), %rax + andq $(VEC_SIZE * -1), %rdi + subq %rdi, %rax # endif - testl %eax, %eax - jz L(cross_page_continue) - tzcntl %eax, %eax + + /* rax contains remaining length - 1. -1 so we can get imm8 + encoding in a few additional places saving code size. */ + + /* Needed regardless of remaining length. */ + VPCMPEQ VEC_SIZE(%rdi), %VMATCH, %k0 + KMOV %k0, %VRDX + + /* We cannot fold the above `sub %rdi, %rax` with the `cmp + $(CHAR_PER_VEC * 2), %rax` because its possible for a very + large length to overflow and cause the subtract to carry + despite length being above CHAR_PER_VEC * 2. */ + cmpq $(CHAR_PER_VEC * 2 - 1), %rax + ja L(more_2x_vec) +L(last_2x_vec): + + test %VRDX, %VRDX + jnz L(first_vec_x1_check) + + /* Check the end of data. NB: use 8-bit operations to save code + size. We no longer need the full-width of eax and will + perform a write-only operation over eax so there will be no + partial-register stalls. */ + subb $(CHAR_PER_VEC * 1 - 1), %al + jle L(zero_0) + + VPCMPEQ (VEC_SIZE * 2)(%rdi), %VMATCH, %k0 + KMOV %k0, %VRCX # ifdef USE_AS_WMEMCHR - /* NB: Multiply bytes by CHAR_SIZE to get the wchar_t count. */ - leaq (%RAW_PTR_REG, %rax, CHAR_SIZE), %rax + /* For wmemchr against we can't take advantage of tzcnt(0) == + VEC_SIZE as CHAR_PER_VEC != VEC_SIZE. */ + test %VRCX, %VRCX + jz L(zero_0) +# endif + tzcnt %VRCX, %VRCX + cmp %cl, %al + + /* Same CFG for VEC_SIZE == 64 and VEC_SIZE == 32. We give + fallthrough to L(zero_0) for VEC_SIZE == 64 here as there is + not enough space before the next cache line to fit the `lea` + for return. */ +# if VEC_SIZE == 64 + ja L(first_vec_x2_ret) +L(zero_0): + xorl %eax, %eax + ret # else - addq %RAW_PTR_REG, %rax + jbe L(zero_0) + leaq (VEC_SIZE * 2)(%rdi, %rcx, CHAR_SIZE), %rax + ret # endif + + .p2align 4,, 5 +L(first_vec_x1_check): + bsf %VRDX, %VRDX + cmpb %dl, %al + jb L(zero_4) + leaq (VEC_SIZE * 1)(%rdi, %rdx, CHAR_SIZE), %rax ret - .p2align 4 -L(first_vec_x1): - tzcntl %eax, %eax - leaq VEC_SIZE(%rdi, %rax, CHAR_SIZE), %rax + /* Fits at the end of the cache line here for VEC_SIZE == 32. + */ +# if VEC_SIZE == 32 +L(zero_4): + xorl %eax, %eax ret +# endif - .p2align 4 + + .p2align 4,, 4 L(first_vec_x2): - tzcntl %eax, %eax - leaq (VEC_SIZE * 2)(%rdi, %rax, CHAR_SIZE), %rax + bsf %VRCX, %VRCX +L(first_vec_x2_ret): + leaq (VEC_SIZE * 2)(%rdi, %rcx, CHAR_SIZE), %rax ret - .p2align 4 -L(first_vec_x3): - tzcntl %eax, %eax - leaq (VEC_SIZE * 3)(%rdi, %rax, CHAR_SIZE), %rax + /* Fits at the end of the cache line here for VEC_SIZE == 64. + */ +# if VEC_SIZE == 64 +L(zero_4): + xorl %eax, %eax ret +# endif - .p2align 4 -L(first_vec_x4): - tzcntl %eax, %eax - leaq (VEC_SIZE * 4)(%rdi, %rax, CHAR_SIZE), %rax + .p2align 4,, 4 +L(first_vec_x1): + bsf %VRDX, %VRDX + leaq (VEC_SIZE * 1)(%rdi, %rdx, CHAR_SIZE), %rax ret - .p2align 5 -L(aligned_more): - /* Check the first 4 * VEC_SIZE. Only one VEC_SIZE at a time - since data is only aligned to VEC_SIZE. */ -# ifndef USE_AS_RAWMEMCHR - /* Align data to VEC_SIZE. */ -L(cross_page_continue): - xorl %ecx, %ecx - subl %edi, %ecx - andq $-VEC_SIZE, %rdi - /* esi is for adjusting length to see if near the end. */ - leal (VEC_SIZE * 5)(%rdi, %rcx), %esi -# ifdef USE_AS_WMEMCHR - /* NB: Divide bytes by 4 to get the wchar_t count. */ - sarl $2, %esi -# endif -# else - andq $-VEC_SIZE, %rdi -L(cross_page_continue): -# endif - /* Load first VEC regardless. */ - VPCMP $0, (VEC_SIZE)(%rdi), %YMMMATCH, %k0 - kmovd %k0, %eax -# ifndef USE_AS_RAWMEMCHR - /* Adjust length. If near end handle specially. */ - subq %rsi, %rdx - jbe L(last_4x_vec_or_less) -# endif - testl %eax, %eax + .p2align 4,, 5 +L(more_2x_vec): + /* Length > VEC_SIZE * 2 so check first 2x VEC before rechecking + length. */ + + + /* Already computed matches for first VEC in rdx. */ + test %VRDX, %VRDX jnz L(first_vec_x1) - VPCMP $0, (VEC_SIZE * 2)(%rdi), %YMMMATCH, %k0 - kmovd %k0, %eax - testl %eax, %eax + + VPCMPEQ (VEC_SIZE * 2)(%rdi), %VMATCH, %k0 + KMOV %k0, %VRCX + test %VRCX, %VRCX jnz L(first_vec_x2) - VPCMP $0, (VEC_SIZE * 3)(%rdi), %YMMMATCH, %k0 - kmovd %k0, %eax - testl %eax, %eax + /* Needed regardless of next length check. */ + VPCMPEQ (VEC_SIZE * 3)(%rdi), %VMATCH, %k0 + KMOV %k0, %VRCX + + /* Check if we are near the end. */ + cmpq $(CHAR_PER_VEC * 4 - 1), %rax + ja L(more_4x_vec) + + test %VRCX, %VRCX + jnz L(first_vec_x3_check) + + /* Use 8-bit instructions to save code size. We won't use full- + width eax again and will perform a write-only operation to + eax so no worries about partial-register stalls. */ + subb $(CHAR_PER_VEC * 3), %al + jb L(zero_2) +L(last_vec_check): + VPCMPEQ (VEC_SIZE * 4)(%rdi), %VMATCH, %k0 + KMOV %k0, %VRCX +# ifdef USE_AS_WMEMCHR + /* For wmemchr against we can't take advantage of tzcnt(0) == + VEC_SIZE as CHAR_PER_VEC != VEC_SIZE. */ + test %VRCX, %VRCX + jz L(zero_2) +# endif + tzcnt %VRCX, %VRCX + cmp %cl, %al + jae L(first_vec_x4_ret) +L(zero_2): + xorl %eax, %eax + ret + + /* Fits at the end of the cache line here for VEC_SIZE == 64. + For VEC_SIZE == 32 we put the return label at the end of + L(first_vec_x4). */ +# if VEC_SIZE == 64 +L(first_vec_x4_ret): + leaq (VEC_SIZE * 4)(%rdi, %rcx, CHAR_SIZE), %rax + ret +# endif + + .p2align 4,, 6 +L(first_vec_x4): + bsf %VRCX, %VRCX +# if VEC_SIZE == 32 + /* Place L(first_vec_x4_ret) here as we can't fit it in the same + cache line as where it is called from so we might as well + save code size by reusing return of L(first_vec_x4). */ +L(first_vec_x4_ret): +# endif + leaq (VEC_SIZE * 4)(%rdi, %rcx, CHAR_SIZE), %rax + ret + + .p2align 4,, 6 +L(first_vec_x3_check): + /* Need to adjust remaining length before checking. */ + addb $-(CHAR_PER_VEC * 2), %al + bsf %VRCX, %VRCX + cmpb %cl, %al + jb L(zero_2) + leaq (VEC_SIZE * 3)(%rdi, %rcx, CHAR_SIZE), %rax + ret + + .p2align 4,, 6 +L(first_vec_x3): + bsf %VRCX, %VRCX + leaq (VEC_SIZE * 3)(%rdi, %rcx, CHAR_SIZE), %rax + ret + + .p2align 4,, 3 +# if !USE_TERN_IN_LOOP + .p2align 4,, 10 +# endif +L(more_4x_vec): + test %VRCX, %VRCX jnz L(first_vec_x3) - VPCMP $0, (VEC_SIZE * 4)(%rdi), %YMMMATCH, %k0 - kmovd %k0, %eax - testl %eax, %eax + VPCMPEQ (VEC_SIZE * 4)(%rdi), %VMATCH, %k0 + KMOV %k0, %VRCX + test %VRCX, %VRCX jnz L(first_vec_x4) + subq $-(VEC_SIZE * 5), %rdi + subq $(CHAR_PER_VEC * 8), %rax + jb L(last_4x_vec) -# ifndef USE_AS_RAWMEMCHR - /* Check if at last CHAR_PER_VEC * 4 length. */ - subq $(CHAR_PER_VEC * 4), %rdx - jbe L(last_4x_vec_or_less_cmpeq) - /* +VEC_SIZE if USE_IN_RTM otherwise +VEC_SIZE * 5. */ - addq $(VEC_SIZE + (VEC_SIZE * 4 - BASE_OFFSET)), %rdi - - /* Align data to VEC_SIZE * 4 for the loop and readjust length. - */ -# ifdef USE_AS_WMEMCHR +# ifdef USE_AS_WMEMCHR movl %edi, %ecx - andq $-(4 * VEC_SIZE), %rdi +# else + addq %rdi, %rax +# endif + + +# if VEC_SIZE == 64 + /* use xorb to do `andq $-(VEC_SIZE * 4), %rdi`. No evex + processor has partial register stalls (all have merging + uop). If that changes this can be removed. */ + xorb %dil, %dil +# else + andq $-(VEC_SIZE * 4), %rdi +# endif + +# ifdef USE_AS_WMEMCHR subl %edi, %ecx - /* NB: Divide bytes by 4 to get the wchar_t count. */ sarl $2, %ecx - addq %rcx, %rdx -# else - addq %rdi, %rdx - andq $-(4 * VEC_SIZE), %rdi - subq %rdi, %rdx -# endif + addq %rcx, %rax # else - addq $(VEC_SIZE + (VEC_SIZE * 4 - BASE_OFFSET)), %rdi - andq $-(4 * VEC_SIZE), %rdi + subq %rdi, %rax # endif -# ifdef USE_IN_RTM - vpxorq %XMMZERO, %XMMZERO, %XMMZERO -# else - /* copy ymmmatch to ymm0 so we can use vpcmpeq which is not - encodable with EVEX registers (ymm16-ymm31). */ - vmovdqa64 %YMMMATCH, %ymm0 + + + +# if USE_TERN_IN_LOOP + /* copy VMATCH to low ymm so we can use vpcmpeq which is not + encodable with EVEX registers. NB: this is VEC_SIZE == 32 + only as there is no way to encode vpcmpeq with zmm0-15. */ + vmovdqa64 %VMATCH, %VMATCH_LO # endif - /* Compare 4 * VEC at a time forward. */ - .p2align 4 + .p2align 4,, 11 L(loop_4x_vec): - /* Two versions of the loop. One that does not require - vzeroupper by not using ymm0-ymm15 and another does that require - vzeroupper because it uses ymm0-ymm15. The reason why ymm0-ymm15 - is used at all is because there is no EVEX encoding vpcmpeq and - with vpcmpeq this loop can be performed more efficiently. The - non-vzeroupper version is safe for RTM while the vzeroupper - version should be prefered if RTM are not supported. */ -# ifdef USE_IN_RTM - /* It would be possible to save some instructions using 4x VPCMP - but bottleneck on port 5 makes it not woth it. */ - VPCMP $4, (VEC_SIZE * 4)(%rdi), %YMMMATCH, %k1 - /* xor will set bytes match esi to zero. */ - vpxorq (VEC_SIZE * 5)(%rdi), %YMMMATCH, %YMM2 - vpxorq (VEC_SIZE * 6)(%rdi), %YMMMATCH, %YMM3 - VPCMP $0, (VEC_SIZE * 7)(%rdi), %YMMMATCH, %k3 - /* Reduce VEC2 / VEC3 with min and VEC1 with zero mask. */ - VPMINU %YMM2, %YMM3, %YMM3{%k1}{z} - VPCMP $0, %YMM3, %YMMZERO, %k2 -# else + /* Two versions of the loop. One that does not require + vzeroupper by not using ymmm0-15 and another does that + require vzeroupper because it uses ymmm0-15. The reason why + ymm0-15 is used at all is because there is no EVEX encoding + vpcmpeq and with vpcmpeq this loop can be performed more + efficiently. The non-vzeroupper version is safe for RTM + while the vzeroupper version should be prefered if RTM are + not supported. Which loop version we use is determined by + USE_TERN_IN_LOOP. */ + +# if USE_TERN_IN_LOOP /* Since vptern can only take 3x vectors fastest to do 1 vec seperately with EVEX vpcmp. */ # ifdef USE_AS_WMEMCHR /* vptern can only accept masks for epi32/epi64 so can only save - instruction using not equals mask on vptern with wmemchr. */ - VPCMP $4, (%rdi), %YMMMATCH, %k1 + instruction using not equals mask on vptern with wmemchr. + */ + VPCMP $4, (VEC_SIZE * 0)(%rdi), %VMATCH, %k1 # else - VPCMP $0, (%rdi), %YMMMATCH, %k1 + VPCMPEQ (VEC_SIZE * 0)(%rdi), %VMATCH, %k1 # endif /* Compare 3x with vpcmpeq and or them all together with vptern. */ - VPCMPEQ VEC_SIZE(%rdi), %ymm0, %ymm2 - VPCMPEQ (VEC_SIZE * 2)(%rdi), %ymm0, %ymm3 - VPCMPEQ (VEC_SIZE * 3)(%rdi), %ymm0, %ymm4 + VPCMPEQ (VEC_SIZE * 1)(%rdi), %VMATCH_LO, %VMM_lo(2) + VPCMPEQ (VEC_SIZE * 2)(%rdi), %VMATCH_LO, %VMM_lo(3) + VPCMPEQ (VEC_SIZE * 3)(%rdi), %VMATCH_LO, %VMM_lo(4) # ifdef USE_AS_WMEMCHR - /* This takes the not of or between ymm2, ymm3, ymm4 as well as - combines result from VEC0 with zero mask. */ - vpternlogd $1, %ymm2, %ymm3, %ymm4{%k1}{z} - vpmovmskb %ymm4, %ecx + /* This takes the not of or between VEC_lo(2), VEC_lo(3), + VEC_lo(4) as well as combines result from VEC(0) with zero + mask. */ + vpternlogd $1, %VMM_lo(2), %VMM_lo(3), %VMM_lo(4){%k1}{z} + vpmovmskb %VMM_lo(4), %VRCX # else - /* 254 is mask for oring ymm2, ymm3, ymm4 into ymm4. */ - vpternlogd $254, %ymm2, %ymm3, %ymm4 - vpmovmskb %ymm4, %ecx - kmovd %k1, %eax + /* 254 is mask for oring VEC_lo(2), VEC_lo(3), VEC_lo(4) into + VEC_lo(4). */ + vpternlogd $254, %VMM_lo(2), %VMM_lo(3), %VMM_lo(4) + vpmovmskb %VMM_lo(4), %VRCX + KMOV %k1, %edx # endif -# endif -# ifdef USE_AS_RAWMEMCHR - subq $-(VEC_SIZE * 4), %rdi -# endif -# ifdef USE_IN_RTM - kortestd %k2, %k3 # else -# ifdef USE_AS_WMEMCHR - /* ecx contains not of matches. All 1s means no matches. incl will - overflow and set zeroflag if that is the case. */ - incl %ecx -# else - /* If either VEC1 (eax) or VEC2-VEC4 (ecx) are not zero. Adding - to ecx is not an issue because if eax is non-zero it will be - used for returning the match. If it is zero the add does - nothing. */ - addq %rax, %rcx -# endif + /* Loop version that uses EVEX encoding. */ + VPCMP $4, (VEC_SIZE * 0)(%rdi), %VMATCH, %k1 + vpxorq (VEC_SIZE * 1)(%rdi), %VMATCH, %VMM(2) + vpxorq (VEC_SIZE * 2)(%rdi), %VMATCH, %VMM(3) + VPCMPEQ (VEC_SIZE * 3)(%rdi), %VMATCH, %k3 + VPMINU %VMM(2), %VMM(3), %VMM(3){%k1}{z} + VPTESTN %VMM(3), %VMM(3), %k2 # endif -# ifdef USE_AS_RAWMEMCHR - jz L(loop_4x_vec) -# else - jnz L(loop_4x_vec_end) + + + TEST_END () + jnz L(loop_vec_ret) subq $-(VEC_SIZE * 4), %rdi - subq $(CHAR_PER_VEC * 4), %rdx - ja L(loop_4x_vec) + subq $(CHAR_PER_VEC * 4), %rax + jae L(loop_4x_vec) - /* Fall through into less than 4 remaining vectors of length case. + /* COND_VZEROUPPER is vzeroupper if we use the VEX encoded loop. */ - VPCMP $0, BASE_OFFSET(%rdi), %YMMMATCH, %k0 - addq $(BASE_OFFSET - VEC_SIZE), %rdi - kmovd %k0, %eax - VZEROUPPER - -L(last_4x_vec_or_less): - /* Check if first VEC contained match. */ - testl %eax, %eax - jnz L(first_vec_x1_check) + COND_VZEROUPPER - /* If remaining length > CHAR_PER_VEC * 2. */ - addl $(CHAR_PER_VEC * 2), %edx - jg L(last_4x_vec) - -L(last_2x_vec): - /* If remaining length < CHAR_PER_VEC. */ - addl $CHAR_PER_VEC, %edx - jle L(zero_end) - - /* Check VEC2 and compare any match with remaining length. */ - VPCMP $0, (VEC_SIZE * 2)(%rdi), %YMMMATCH, %k0 - kmovd %k0, %eax - tzcntl %eax, %eax - cmpl %eax, %edx - jbe L(set_zero_end) - leaq (VEC_SIZE * 2)(%rdi, %rax, CHAR_SIZE), %rax -L(zero_end): - ret + .p2align 4,, 10 +L(last_4x_vec): + /* For CHAR_PER_VEC == 64 we don't need to mask as we use 8-bit + instructions on eax from here on out. */ +# if CHAR_PER_VEC != 64 + andl $(CHAR_PER_VEC * 4 - 1), %eax +# endif + VPCMPEQ (VEC_SIZE * 0)(%rdi), %VMATCH, %k0 + subq $(VEC_SIZE * 1), %rdi + KMOV %k0, %VRDX + cmpb $(CHAR_PER_VEC * 2 - 1), %al + jbe L(last_2x_vec) + test %VRDX, %VRDX + jnz L(last_vec_x1_novzero) + + VPCMPEQ (VEC_SIZE * 2)(%rdi), %VMATCH, %k0 + KMOV %k0, %VRDX + test %VRDX, %VRDX + jnz L(last_vec_x2_novzero) + + VPCMPEQ (VEC_SIZE * 3)(%rdi), %VMATCH, %k0 + KMOV %k0, %VRCX + test %VRCX, %VRCX + jnz L(first_vec_x3_check) + + subb $(CHAR_PER_VEC * 3), %al + jae L(last_vec_check) -L(set_zero_end): xorl %eax, %eax ret - .p2align 4 -L(first_vec_x1_check): - /* eax must be non-zero. Use bsfl to save code size. */ - bsfl %eax, %eax - /* Adjust length. */ - subl $-(CHAR_PER_VEC * 4), %edx - /* Check if match within remaining length. */ - cmpl %eax, %edx - jbe L(set_zero_end) - /* NB: Multiply bytes by CHAR_SIZE to get the wchar_t count. */ - leaq VEC_SIZE(%rdi, %rax, CHAR_SIZE), %rax +# if defined USE_AS_WMEMCHR && USE_TERN_IN_LOOP +L(last_vec_x2_novzero): + addq $VEC_SIZE, %rdi +L(last_vec_x1_novzero): + bsf %VRDX, %VRDX + leaq (VEC_SIZE * 1)(%rdi, %rdx, CHAR_SIZE), %rax ret +# endif - .p2align 4 -L(loop_4x_vec_end): +# if CHAR_PER_VEC == 64 + /* Since we can't combine the last 2x VEC when CHAR_PER_VEC == + 64 it needs a seperate return label. */ + .p2align 4,, 4 +L(last_vec_x2): +L(last_vec_x2_novzero): + bsf %VRDX, %VRDX + leaq (VEC_SIZE * 2)(%rdi, %rdx, TERN_CHAR_MULT), %rax + ret # endif - /* rawmemchr will fall through into this if match was found in - loop. */ -# if defined USE_IN_RTM || defined USE_AS_WMEMCHR - /* k1 has not of matches with VEC1. */ - kmovd %k1, %eax -# ifdef USE_AS_WMEMCHR - subl $((1 << CHAR_PER_VEC) - 1), %eax -# else - incl %eax -# endif + .p2align 4,, 4 +L(loop_vec_ret): +# if defined USE_AS_WMEMCHR || !USE_TERN_IN_LOOP + KMOV %k1, %VRAX + inc %MASK_GPR(rax) # else - /* eax already has matches for VEC1. */ - testl %eax, %eax + test %VRDX, %VRDX # endif - jnz L(last_vec_x1_return) + jnz L(last_vec_x0) -# ifdef USE_IN_RTM - VPCMP $0, %YMM2, %YMMZERO, %k0 - kmovd %k0, %eax + +# if USE_TERN_IN_LOOP + vpmovmskb %VMM_lo(2), %VRDX # else - vpmovmskb %ymm2, %eax + VPTESTN %VMM(2), %VMM(2), %k1 + KMOV %k1, %VRDX # endif - testl %eax, %eax - jnz L(last_vec_x2_return) + test %VRDX, %VRDX + jnz L(last_vec_x1) -# ifdef USE_IN_RTM - kmovd %k2, %eax - testl %eax, %eax - jnz L(last_vec_x3_return) - kmovd %k3, %eax - tzcntl %eax, %eax - leaq (VEC_SIZE * 3 + RET_OFFSET)(%rdi, %rax, CHAR_SIZE), %rax +# if USE_TERN_IN_LOOP + vpmovmskb %VMM_lo(3), %VRDX # else - vpmovmskb %ymm3, %eax - /* Combine matches in VEC3 (eax) with matches in VEC4 (ecx). */ - salq $VEC_SIZE, %rcx - orq %rcx, %rax - tzcntq %rax, %rax - leaq (VEC_SIZE * 2 + RET_OFFSET)(%rdi, %rax), %rax - VZEROUPPER + KMOV %k2, %VRDX # endif - ret - .p2align 4,, 10 -L(last_vec_x1_return): - tzcntl %eax, %eax -# if defined USE_AS_WMEMCHR || RET_OFFSET != 0 - /* NB: Multiply bytes by CHAR_SIZE to get the wchar_t count. */ - leaq RET_OFFSET(%rdi, %rax, CHAR_SIZE), %rax + /* No longer need any of the lo vecs (ymm0-15) so vzeroupper + (only if used VEX encoded loop). */ + COND_VZEROUPPER + + /* Seperate logic for CHAR_PER_VEC == 64 vs the rest. For + CHAR_PER_VEC we test the last 2x VEC seperately, for + CHAR_PER_VEC <= 32 we can combine the results from the 2x + VEC in a single GPR. */ +# if CHAR_PER_VEC == 64 +# if USE_TERN_IN_LOOP +# error "Unsupported" +# endif + + + /* If CHAR_PER_VEC == 64 we can't combine the last two VEC. */ + test %VRDX, %VRDX + jnz L(last_vec_x2) + KMOV %k3, %VRDX # else - addq %rdi, %rax + /* CHAR_PER_VEC <= 32 so we can combine the results from the + last 2x VEC. */ + +# if !USE_TERN_IN_LOOP + KMOV %k3, %VRCX +# endif + salq $(VEC_SIZE / TERN_CHAR_MULT), %rcx + addq %rcx, %rdx +# if !defined USE_AS_WMEMCHR || !USE_TERN_IN_LOOP +L(last_vec_x2_novzero): +# endif # endif - VZEROUPPER + bsf %rdx, %rdx + leaq (LAST_VEC_OFFSET)(%rdi, %rdx, TERN_CHAR_MULT), %rax ret - .p2align 4 -L(last_vec_x2_return): - tzcntl %eax, %eax - /* NB: Multiply bytes by RET_SCALE to get the wchar_t count - if relevant (RET_SCALE = CHAR_SIZE if USE_AS_WMEMCHAR and - USE_IN_RTM are both defined. Otherwise RET_SCALE = 1. */ - leaq (VEC_SIZE + RET_OFFSET)(%rdi, %rax, RET_SCALE), %rax - VZEROUPPER + .p2align 4,, 8 +L(last_vec_x1): + COND_VZEROUPPER +# if !defined USE_AS_WMEMCHR || !USE_TERN_IN_LOOP +L(last_vec_x1_novzero): +# endif + bsf %VRDX, %VRDX + leaq (VEC_SIZE * 1)(%rdi, %rdx, TERN_CHAR_MULT), %rax ret -# ifdef USE_IN_RTM - .p2align 4 -L(last_vec_x3_return): - tzcntl %eax, %eax - /* NB: Multiply bytes by CHAR_SIZE to get the wchar_t count. */ - leaq (VEC_SIZE * 2 + RET_OFFSET)(%rdi, %rax, CHAR_SIZE), %rax + + .p2align 4,, 4 +L(last_vec_x0): + COND_VZEROUPPER + bsf %VGPR(GPR_X0), %VGPR(GPR_X0) +# if GPR_X0_IS_RET + addq %rdi, %rax +# else + leaq (%rdi, %GPR_X0, CHAR_SIZE), %rax +# endif ret + + .p2align 4,, 6 +L(page_cross): + /* Need to preserve eax to compute inbound bytes we are + checking. */ +# ifdef USE_AS_WMEMCHR + movl %eax, %ecx +# else + xorl %ecx, %ecx + subl %eax, %ecx # endif -# ifndef USE_AS_RAWMEMCHR - .p2align 4,, 5 -L(last_4x_vec_or_less_cmpeq): - VPCMP $0, (VEC_SIZE * 5)(%rdi), %YMMMATCH, %k0 - kmovd %k0, %eax - subq $-(VEC_SIZE * 4), %rdi - /* Check first VEC regardless. */ - testl %eax, %eax - jnz L(first_vec_x1_check) + xorq %rdi, %rax + VPCMPEQ (PAGE_SIZE - VEC_SIZE)(%rax), %VMATCH, %k0 + KMOV %k0, %VRAX - /* If remaining length <= CHAR_PER_VEC * 2. */ - addl $(CHAR_PER_VEC * 2), %edx - jle L(last_2x_vec) +# ifdef USE_AS_WMEMCHR + /* NB: Divide by CHAR_SIZE to shift out out of bounds bytes. */ + shrl $2, %ecx + andl $(CHAR_PER_VEC - 1), %ecx +# endif - .p2align 4 -L(last_4x_vec): - VPCMP $0, (VEC_SIZE * 2)(%rdi), %YMMMATCH, %k0 - kmovd %k0, %eax - testl %eax, %eax - jnz L(last_vec_x2) + shrx %VGPR(PC_SHIFT_GPR), %VRAX, %VRAX - VPCMP $0, (VEC_SIZE * 3)(%rdi), %YMMMATCH, %k0 - kmovd %k0, %eax - /* Create mask for possible matches within remaining length. */ -# ifdef USE_AS_WMEMCHR - movl $((1 << (CHAR_PER_VEC * 2)) - 1), %ecx - bzhil %edx, %ecx, %ecx -# else - movq $-1, %rcx - bzhiq %rdx, %rcx, %rcx -# endif - /* Test matches in data against length match. */ - andl %ecx, %eax - jnz L(last_vec_x3) +# ifdef USE_AS_WMEMCHR + negl %ecx +# endif - /* if remaining length <= CHAR_PER_VEC * 3 (Note this is after - remaining length was found to be > CHAR_PER_VEC * 2. */ - subl $CHAR_PER_VEC, %edx - jbe L(zero_end2) + /* mask lower bits from ecx (negative eax) to get bytes till + next VEC. */ + andl $(CHAR_PER_VEC - 1), %ecx + /* Check if VEC is entirely contained in the remainder of the + page. */ + cmpq %rcx, %rdx + jbe L(page_cross_ret) - VPCMP $0, (VEC_SIZE * 4)(%rdi), %YMMMATCH, %k0 - kmovd %k0, %eax - /* Shift remaining length mask for last VEC. */ -# ifdef USE_AS_WMEMCHR - shrl $CHAR_PER_VEC, %ecx -# else - shrq $CHAR_PER_VEC, %rcx -# endif - andl %ecx, %eax - jz L(zero_end2) - bsfl %eax, %eax - leaq (VEC_SIZE * 4)(%rdi, %rax, CHAR_SIZE), %rax -L(zero_end2): - ret + /* Length crosses the page so if rax is zero (no matches) + continue. */ + test %VRAX, %VRAX + jz L(page_cross_continue) -L(last_vec_x2): - tzcntl %eax, %eax - leaq (VEC_SIZE * 2)(%rdi, %rax, CHAR_SIZE), %rax + /* if rdx > rcx then any match here must be in [buf:buf + len]. + */ + tzcnt %VRAX, %VRAX +# ifdef USE_AS_WMEMCHR + leaq (%rdi, %rax, CHAR_SIZE), %rax +# else + addq %rdi, %rax +# endif ret - .p2align 4 -L(last_vec_x3): - tzcntl %eax, %eax - leaq (VEC_SIZE * 3)(%rdi, %rax, CHAR_SIZE), %rax + .p2align 4,, 2 +L(page_cross_zero): + xorl %eax, %eax ret + + .p2align 4,, 4 +L(page_cross_ret): + /* Search is entirely contained in page cross case. */ +# ifdef USE_AS_WMEMCHR + test %VRAX, %VRAX + jz L(page_cross_zero) +# endif + tzcnt %VRAX, %VRAX + cmpl %eax, %edx + jbe L(page_cross_zero) +# ifdef USE_AS_WMEMCHR + leaq (%rdi, %rax, CHAR_SIZE), %rax +# else + addq %rdi, %rax # endif - /* 7 bytes from next cache line. */ + ret END (MEMCHR) #endif diff --git a/sysdeps/x86_64/multiarch/rawmemchr-evex-rtm.S b/sysdeps/x86_64/multiarch/rawmemchr-evex-rtm.S index deda1ca395..2073eaa620 100644 --- a/sysdeps/x86_64/multiarch/rawmemchr-evex-rtm.S +++ b/sysdeps/x86_64/multiarch/rawmemchr-evex-rtm.S @@ -1,3 +1,6 @@ -#define MEMCHR __rawmemchr_evex_rtm -#define USE_AS_RAWMEMCHR 1 -#include "memchr-evex-rtm.S" +#define RAWMEMCHR __rawmemchr_evex_rtm + +#define USE_IN_RTM 1 +#define SECTION(p) p##.evex.rtm + +#include "rawmemchr-evex.S" diff --git a/sysdeps/x86_64/multiarch/rawmemchr-evex.S b/sysdeps/x86_64/multiarch/rawmemchr-evex.S index dc1c450699..dad54def2b 100644 --- a/sysdeps/x86_64/multiarch/rawmemchr-evex.S +++ b/sysdeps/x86_64/multiarch/rawmemchr-evex.S @@ -1,7 +1,308 @@ -#ifndef RAWMEMCHR -# define RAWMEMCHR __rawmemchr_evex -#endif -#define USE_AS_RAWMEMCHR 1 -#define MEMCHR RAWMEMCHR +/* rawmemchr optimized with 256-bit EVEX instructions. + Copyright (C) 2022 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include + +#if ISA_SHOULD_BUILD (4) + +# ifndef VEC_SIZE +# include "x86-evex256-vecs.h" +# endif + +# ifndef RAWMEMCHR +# define RAWMEMCHR __rawmemchr_evex +# endif + + +# define PC_SHIFT_GPR rdi +# define REG_WIDTH VEC_SIZE +# define VPTESTN vptestnmb +# define VPBROADCAST vpbroadcastb +# define VPMINU vpminub +# define VPCMP vpcmpb +# define VPCMPEQ vpcmpeqb +# define CHAR_SIZE 1 + +# include "reg-macros.h" + +/* If not in an RTM and VEC_SIZE != 64 (the VEC_SIZE = 64 + doesn't have VEX encoding), use VEX encoding in loop so we + can use vpcmpeqb + vptern which is more efficient than the + EVEX alternative. */ +# if defined USE_IN_RTM || VEC_SIZE == 64 +# undef COND_VZEROUPPER +# undef VZEROUPPER_RETURN +# undef VZEROUPPER + + +# define COND_VZEROUPPER +# define VZEROUPPER_RETURN ret +# define VZEROUPPER + +# define USE_TERN_IN_LOOP 0 +# else +# define USE_TERN_IN_LOOP 1 +# undef VZEROUPPER +# define VZEROUPPER vzeroupper +# endif + +# define CHAR_PER_VEC VEC_SIZE + +# if CHAR_PER_VEC == 64 + +# define TAIL_RETURN_LBL first_vec_x2 +# define TAIL_RETURN_OFFSET (CHAR_PER_VEC * 2) + +# define FALLTHROUGH_RETURN_LBL first_vec_x3 +# define FALLTHROUGH_RETURN_OFFSET (CHAR_PER_VEC * 3) + +# else /* !(CHAR_PER_VEC == 64) */ + +# define TAIL_RETURN_LBL first_vec_x3 +# define TAIL_RETURN_OFFSET (CHAR_PER_VEC * 3) + +# define FALLTHROUGH_RETURN_LBL first_vec_x2 +# define FALLTHROUGH_RETURN_OFFSET (CHAR_PER_VEC * 2) +# endif /* !(CHAR_PER_VEC == 64) */ + + +# define VMATCH VMM(0) +# define VMATCH_LO VMM_lo(0) + +# define PAGE_SIZE 4096 + + .section SECTION(.text), "ax", @progbits +ENTRY_P2ALIGN (RAWMEMCHR, 6) + VPBROADCAST %esi, %VMATCH + /* Check if we may cross page boundary with one vector load. */ + movl %edi, %eax + andl $(PAGE_SIZE - 1), %eax + cmpl $(PAGE_SIZE - VEC_SIZE), %eax + ja L(page_cross) + + VPCMPEQ (%rdi), %VMATCH, %k0 + KMOV %k0, %VRAX + + test %VRAX, %VRAX + jz L(aligned_more) +L(first_vec_x0): + bsf %VRAX, %VRAX + addq %rdi, %rax + ret + + .p2align 4,, 4 +L(first_vec_x4): + bsf %VRAX, %VRAX + leaq (VEC_SIZE * 4)(%rdi, %rax), %rax + ret -#include "memchr-evex.S" + /* For VEC_SIZE == 32 we can fit this in aligning bytes so might + as well place it more locally. For VEC_SIZE == 64 we reuse + return code at the end of loop's return. */ +# if VEC_SIZE == 32 + .p2align 4,, 4 +L(FALLTHROUGH_RETURN_LBL): + bsf %VRAX, %VRAX + leaq (FALLTHROUGH_RETURN_OFFSET)(%rdi, %rax), %rax + ret +# endif + + .p2align 4,, 6 +L(page_cross): + /* eax has lower page-offset bits of rdi so xor will zero them + out. */ + xorq %rdi, %rax + VPCMPEQ (PAGE_SIZE - VEC_SIZE)(%rax), %VMATCH, %k0 + KMOV %k0, %VRAX + + /* Shift out out-of-bounds matches. */ + shrx %VRDI, %VRAX, %VRAX + test %VRAX, %VRAX + jnz L(first_vec_x0) + + .p2align 4,, 10 +L(aligned_more): +L(page_cross_continue): + /* Align pointer. */ + andq $(VEC_SIZE * -1), %rdi + + VPCMPEQ VEC_SIZE(%rdi), %VMATCH, %k0 + KMOV %k0, %VRAX + test %VRAX, %VRAX + jnz L(first_vec_x1) + + VPCMPEQ (VEC_SIZE * 2)(%rdi), %VMATCH, %k0 + KMOV %k0, %VRAX + test %VRAX, %VRAX + jnz L(first_vec_x2) + + VPCMPEQ (VEC_SIZE * 3)(%rdi), %VMATCH, %k0 + KMOV %k0, %VRAX + test %VRAX, %VRAX + jnz L(first_vec_x3) + + VPCMPEQ (VEC_SIZE * 4)(%rdi), %VMATCH, %k0 + KMOV %k0, %VRAX + test %VRAX, %VRAX + jnz L(first_vec_x4) + + subq $-(VEC_SIZE * 1), %rdi +# if VEC_SIZE == 64 + /* Saves code size. No evex512 processor has partial register + stalls. If that change this can be replaced with `andq + $-(VEC_SIZE * 4), %rdi`. */ + xorb %dil, %dil +# else + andq $-(VEC_SIZE * 4), %rdi +# endif + +# if USE_TERN_IN_LOOP + /* copy VMATCH to low ymm so we can use vpcmpeq which is not + encodable with EVEX registers. NB: this is VEC_SIZE == 32 + only as there is no way to encode vpcmpeq with zmm0-15. */ + vmovdqa64 %VMATCH, %VMATCH_LO +# endif + + .p2align 4 +L(loop_4x_vec): + /* Two versions of the loop. One that does not require + vzeroupper by not using ymm0-15 and another does that + require vzeroupper because it uses ymm0-15. The reason why + ymm0-15 is used at all is because there is no EVEX encoding + vpcmpeq and with vpcmpeq this loop can be performed more + efficiently. The non-vzeroupper version is safe for RTM + while the vzeroupper version should be prefered if RTM are + not supported. Which loop version we use is determined by + USE_TERN_IN_LOOP. */ + +# if USE_TERN_IN_LOOP + /* Since vptern can only take 3x vectors fastest to do 1 vec + seperately with EVEX vpcmp. */ + VPCMPEQ (VEC_SIZE * 4)(%rdi), %VMATCH, %k1 + /* Compare 3x with vpcmpeq and or them all together with vptern. + */ + + VPCMPEQ (VEC_SIZE * 5)(%rdi), %VMATCH_LO, %VMM_lo(2) + subq $(VEC_SIZE * -4), %rdi + VPCMPEQ (VEC_SIZE * 2)(%rdi), %VMATCH_LO, %VMM_lo(3) + VPCMPEQ (VEC_SIZE * 3)(%rdi), %VMATCH_LO, %VMM_lo(4) + + /* 254 is mask for oring VEC_lo(2), VEC_lo(3), VEC_lo(4) into + VEC_lo(4). */ + vpternlogd $254, %VMM_lo(2), %VMM_lo(3), %VMM_lo(4) + vpmovmskb %VMM_lo(4), %VRCX + + KMOV %k1, %eax + + /* NB: rax has match from first VEC and rcx has matches from + VEC 2-4. If rax is non-zero we will return that match. If + rax is zero adding won't disturb the bits in rcx. */ + add %rax, %rcx +# else + /* Loop version that uses EVEX encoding. */ + VPCMP $4, (VEC_SIZE * 4)(%rdi), %VMATCH, %k1 + vpxorq (VEC_SIZE * 5)(%rdi), %VMATCH, %VMM(2) + vpxorq (VEC_SIZE * 6)(%rdi), %VMATCH, %VMM(3) + VPCMPEQ (VEC_SIZE * 7)(%rdi), %VMATCH, %k3 + VPMINU %VMM(2), %VMM(3), %VMM(3){%k1}{z} + VPTESTN %VMM(3), %VMM(3), %k2 + subq $(VEC_SIZE * -4), %rdi + KORTEST %k2, %k3 +# endif + jz L(loop_4x_vec) + +# if USE_TERN_IN_LOOP + test %VRAX, %VRAX +# else + KMOV %k1, %VRAX + inc %VRAX +# endif + jnz L(last_vec_x0) + + +# if USE_TERN_IN_LOOP + vpmovmskb %VMM_lo(2), %VRAX +# else + VPTESTN %VMM(2), %VMM(2), %k1 + KMOV %k1, %VRAX +# endif + test %VRAX, %VRAX + jnz L(last_vec_x1) + + +# if USE_TERN_IN_LOOP + vpmovmskb %VMM_lo(3), %VRAX +# else + KMOV %k2, %VRAX +# endif + + /* No longer need any of the lo vecs (ymm0-15) so vzeroupper + (only if used VEX encoded loop). */ + COND_VZEROUPPER + + /* Seperate logic for VEC_SIZE == 64 and VEC_SIZE == 32 for + returning last 2x VEC. For VEC_SIZE == 64 we test each VEC + individually, for VEC_SIZE == 32 we combine them in a single + 64-bit GPR. */ +# if CHAR_PER_VEC == 64 +# if USE_TERN_IN_LOOP +# error "Unsupported" +# endif + + + /* If CHAR_PER_VEC == 64 we can't combine the last two VEC. */ + test %VRAX, %VRAX + jnz L(first_vec_x2) + KMOV %k3, %VRAX +L(FALLTHROUGH_RETURN_LBL): +# else + /* CHAR_PER_VEC <= 32 so we can combine the results from the + last 2x VEC. */ +# if !USE_TERN_IN_LOOP + KMOV %k3, %VRCX +# endif + salq $CHAR_PER_VEC, %rcx + addq %rcx, %rax +# endif + bsf %rax, %rax + leaq (FALLTHROUGH_RETURN_OFFSET)(%rdi, %rax), %rax + ret + + .p2align 4,, 8 +L(TAIL_RETURN_LBL): + bsf %rax, %rax + leaq (TAIL_RETURN_OFFSET)(%rdi, %rax), %rax + ret + + .p2align 4,, 8 +L(last_vec_x1): + COND_VZEROUPPER +L(first_vec_x1): + bsf %VRAX, %VRAX + leaq (VEC_SIZE * 1)(%rdi, %rax), %rax + ret + + .p2align 4,, 8 +L(last_vec_x0): + COND_VZEROUPPER + bsf %VRAX, %VRAX + addq %rdi, %rax + ret +END (RAWMEMCHR) +#endif From patchwork Tue Oct 18 23:19:33 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Noah Goldstein X-Patchwork-Id: 1691735 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=sourceware.org (client-ip=8.43.85.97; helo=sourceware.org; envelope-from=libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org; receiver=) Authentication-Results: legolas.ozlabs.org; dkim=pass (1024-bit key; secure) header.d=sourceware.org header.i=@sourceware.org header.a=rsa-sha256 header.s=default header.b=ekxZuPjq; dkim-atps=neutral Received: from sourceware.org (server2.sourceware.org [8.43.85.97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-384) server-digest SHA384) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4MsVFh6fg4z23jk for ; Wed, 19 Oct 2022 10:20:08 +1100 (AEDT) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 0DDCD385828A for ; Tue, 18 Oct 2022 23:20:04 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 0DDCD385828A DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1666135204; bh=8g2UpY8YP3LG3k4/aV/uG4zI5BnrmIlPF5QM4OTP2+M=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To: From; b=ekxZuPjqBJGdUo7WQz93FVZFnvPTbxowz2dnEIIEUGek+IHOjzwP1EaJJxfBw+RU1 VNwC3kp25rdqUCfXVNSDtyyInNMfU79q8LgWzAtpNRsyDc7k5CRVN3u2HvqT/JcieJ GbEVxw85K+ifvfdnN4qVYP11w1qzn6XRvmmTNOes= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-oa1-x31.google.com (mail-oa1-x31.google.com [IPv6:2001:4860:4864:20::31]) by sourceware.org (Postfix) with ESMTPS id 8544D3858D39 for ; Tue, 18 Oct 2022 23:19:45 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 8544D3858D39 Received: by mail-oa1-x31.google.com with SMTP id 586e51a60fabf-1364357a691so18673194fac.7 for ; Tue, 18 Oct 2022 16:19:45 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=8g2UpY8YP3LG3k4/aV/uG4zI5BnrmIlPF5QM4OTP2+M=; b=E52Kn5YMJ6SM5V+Gb8wvJaHCpu2uUa5eFFOCIfFaOLH0Cqa8qyjzZKQIwmgj1vgMAt klWZOVECBW7ipE1VYuCuNfgpOrQYZzlQe8gpQlETHQk2KynpEgzIEQvZGlt8wViw1IUF wpPmU5LpWgUmWoZycVcNkk4IYMumnusZsbgltobHPunmklihCo4/hFRjyLxLL75QcEHe stZ6ZSaR9zPbAGZEZm8VJ+SBeaHCI3mJb5m6UpcHmgvzqYvTEkBzGoEgAqedx9TtWD4h KYpsphZ9vqeJFcShfPwyVQqkeoy0tOEi/Sj8/cHPIiXLWu5+1QbNDsMp6WVVSSM5Bfgp WdLg== X-Gm-Message-State: ACrzQf3h7dLCYUztAoIYDNfv6/keDSZYMyl3ayMj9KX6b5vfj42y66lU 92sQu23hh+/lgw8Pg28r3wzgQOyIgGvUYA== X-Google-Smtp-Source: AMsMyM6LlUYm4X0782di7+cY75uBvhl7BglnVe9C6UBulDq0yjpPk5moeWqTlvCWt7kYlx5vwHAD7g== X-Received: by 2002:a05:6870:1602:b0:133:14f:d9a8 with SMTP id b2-20020a056870160200b00133014fd9a8mr3019215oae.237.1666135184105; Tue, 18 Oct 2022 16:19:44 -0700 (PDT) Received: from noah-tgl.lan (2603-8080-1301-76c6-27cf-8854-3909-9373.res6.spectrum.com. [2603:8080:1301:76c6:27cf:8854:3909:9373]) by smtp.gmail.com with ESMTPSA id x27-20020a056870a79b00b001372c1902afsm6787119oao.52.2022.10.18.16.19.43 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 18 Oct 2022 16:19:43 -0700 (PDT) To: libc-alpha@sourceware.org Subject: [PATCH v2 2/7] x86: Shrink / minorly optimize strchr-evex and implement with VMM headers Date: Tue, 18 Oct 2022 16:19:33 -0700 Message-Id: <20221018231938.3621554-2-goldstein.w.n@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20221018231938.3621554-1-goldstein.w.n@gmail.com> References: <20221018024901.3381469-1-goldstein.w.n@gmail.com> <20221018231938.3621554-1-goldstein.w.n@gmail.com> MIME-Version: 1.0 X-Spam-Status: No, score=-12.1 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Noah Goldstein via Libc-alpha From: Noah Goldstein Reply-To: Noah Goldstein Errors-To: libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org Sender: "Libc-alpha" Size Optimizations: 1. Condence hot path for better cache-locality. - This is most impact for strchrnul where the logic strings with len <= VEC_SIZE or with a match in the first VEC no fits entirely in the first cache line. 2. Reuse common targets in first 4x VEC and after the loop. 3. Don't align targets so aggressively if it doesn't change the number of fetch blocks it will require and put more care in avoiding the case where targets unnecessarily split cache lines. 4. Align the loop better for DSB/LSD 5. Use more code-size efficient instructions. - tzcnt ... -> bsf ... - vpcmpb $0 ... -> vpcmpeq ... 6. Align labels less aggressively, especially if it doesn't save fetch blocks / causes the basic-block to span extra cache-lines. Code Size Changes: strchr-evex.S : -63 bytes strchrnul-evex.S: -48 bytes Net perf changes: Reported as geometric mean of all improvements / regressions from N=10 runs of the benchtests. Value as New Time / Old Time so < 1.0 is improvement and 1.0 is regression. strchr-evex.S (Fixed) : 0.971 strchr-evex.S (Rand) : 0.932 strchrnul-evex.S : 0.965 Full results attached in email. Full check passes on x86-64. --- sysdeps/x86_64/multiarch/strchr-evex.S | 558 +++++++++++++++---------- 1 file changed, 340 insertions(+), 218 deletions(-) diff --git a/sysdeps/x86_64/multiarch/strchr-evex.S b/sysdeps/x86_64/multiarch/strchr-evex.S index a1c15c4419..c2a0d112f7 100644 --- a/sysdeps/x86_64/multiarch/strchr-evex.S +++ b/sysdeps/x86_64/multiarch/strchr-evex.S @@ -26,48 +26,75 @@ # define STRCHR __strchr_evex # endif -# define VMOVU vmovdqu64 -# define VMOVA vmovdqa64 +# ifndef VEC_SIZE +# include "x86-evex256-vecs.h" +# endif # ifdef USE_AS_WCSCHR # define VPBROADCAST vpbroadcastd -# define VPCMP vpcmpd +# define VPCMP vpcmpd +# define VPCMPEQ vpcmpeqd # define VPTESTN vptestnmd +# define VPTEST vptestmd # define VPMINU vpminud # define CHAR_REG esi -# define SHIFT_REG ecx +# define SHIFT_REG rcx # define CHAR_SIZE 4 + +# define USE_WIDE_CHAR # else # define VPBROADCAST vpbroadcastb -# define VPCMP vpcmpb +# define VPCMP vpcmpb +# define VPCMPEQ vpcmpeqb # define VPTESTN vptestnmb +# define VPTEST vptestmb # define VPMINU vpminub # define CHAR_REG sil -# define SHIFT_REG edx +# define SHIFT_REG rdi # define CHAR_SIZE 1 # endif -# define XMMZERO xmm16 - -# define YMMZERO ymm16 -# define YMM0 ymm17 -# define YMM1 ymm18 -# define YMM2 ymm19 -# define YMM3 ymm20 -# define YMM4 ymm21 -# define YMM5 ymm22 -# define YMM6 ymm23 -# define YMM7 ymm24 -# define YMM8 ymm25 - -# define VEC_SIZE 32 -# define PAGE_SIZE 4096 -# define CHAR_PER_VEC (VEC_SIZE / CHAR_SIZE) - - .section .text.evex,"ax",@progbits -ENTRY_P2ALIGN (STRCHR, 5) - /* Broadcast CHAR to YMM0. */ - VPBROADCAST %esi, %YMM0 +# include "reg-macros.h" + +# if VEC_SIZE == 64 +# define MASK_GPR rcx +# define LOOP_REG rax + +# define COND_MASK(k_reg) {%k_reg} +# else +# define MASK_GPR rax +# define LOOP_REG rdi + +# define COND_MASK(k_reg) +# endif + +# define CHAR_PER_VEC (VEC_SIZE / CHAR_SIZE) + + +# if CHAR_PER_VEC == 64 +# define LAST_VEC_OFFSET (VEC_SIZE * 3) +# define TESTZ(reg) incq %VGPR_SZ(reg, 64) +# else + +# if CHAR_PER_VEC == 32 +# define TESTZ(reg) incl %VGPR_SZ(reg, 32) +# elif CHAR_PER_VEC == 16 +# define TESTZ(reg) incw %VGPR_SZ(reg, 16) +# else +# define TESTZ(reg) incb %VGPR_SZ(reg, 8) +# endif + +# define LAST_VEC_OFFSET (VEC_SIZE * 2) +# endif + +# define VMATCH VMM(0) + +# define PAGE_SIZE 4096 + + .section SECTION(.text), "ax", @progbits +ENTRY_P2ALIGN (STRCHR, 6) + /* Broadcast CHAR to VEC_0. */ + VPBROADCAST %esi, %VMATCH movl %edi, %eax andl $(PAGE_SIZE - 1), %eax /* Check if we cross page boundary with one vector load. @@ -75,19 +102,27 @@ ENTRY_P2ALIGN (STRCHR, 5) cmpl $(PAGE_SIZE - VEC_SIZE), %eax ja L(cross_page_boundary) + /* Check the first VEC_SIZE bytes. Search for both CHAR and the null bytes. */ - VMOVU (%rdi), %YMM1 - + VMOVU (%rdi), %VMM(1) /* Leaves only CHARS matching esi as 0. */ - vpxorq %YMM1, %YMM0, %YMM2 - VPMINU %YMM2, %YMM1, %YMM2 - /* Each bit in K0 represents a CHAR or a null byte in YMM1. */ - VPTESTN %YMM2, %YMM2, %k0 - kmovd %k0, %eax - testl %eax, %eax + vpxorq %VMM(1), %VMATCH, %VMM(2) + VPMINU %VMM(2), %VMM(1), %VMM(2) + /* Each bit in K0 represents a CHAR or a null byte in VEC_1. */ + VPTESTN %VMM(2), %VMM(2), %k0 + KMOV %k0, %VRAX +# if VEC_SIZE == 64 && defined USE_AS_STRCHRNUL + /* If VEC_SIZE == 64 && STRCHRNUL use bsf to test condition so + that all logic for match/null in first VEC first in 1x cache + lines. This has a slight cost to larger sizes. */ + bsf %VRAX, %VRAX + jz L(aligned_more) +# else + test %VRAX, %VRAX jz L(aligned_more) - tzcntl %eax, %eax + bsf %VRAX, %VRAX +# endif # ifndef USE_AS_STRCHRNUL /* Found CHAR or the null byte. */ cmp (%rdi, %rax, CHAR_SIZE), %CHAR_REG @@ -109,287 +144,374 @@ ENTRY_P2ALIGN (STRCHR, 5) # endif ret - - - .p2align 4,, 10 -L(first_vec_x4): -# ifndef USE_AS_STRCHRNUL - /* Check to see if first match was CHAR (k0) or null (k1). */ - kmovd %k0, %eax - tzcntl %eax, %eax - kmovd %k1, %ecx - /* bzhil will not be 0 if first match was null. */ - bzhil %eax, %ecx, %ecx - jne L(zero) -# else - /* Combine CHAR and null matches. */ - kord %k0, %k1, %k0 - kmovd %k0, %eax - tzcntl %eax, %eax -# endif - /* NB: Multiply sizeof char type (1 or 4) to get the number of - bytes. */ - leaq (VEC_SIZE * 4)(%rdi, %rax, CHAR_SIZE), %rax - ret - # ifndef USE_AS_STRCHRNUL L(zero): xorl %eax, %eax ret # endif - - .p2align 4 + .p2align 4,, 2 +L(first_vec_x3): + subq $-(VEC_SIZE * 2), %rdi +# if VEC_SIZE == 32 + /* Reuse L(first_vec_x3) for last VEC2 only for VEC_SIZE == 32. + For VEC_SIZE == 64 the registers don't match. */ +L(last_vec_x2): +# endif L(first_vec_x1): /* Use bsf here to save 1-byte keeping keeping the block in 1x fetch block. eax guranteed non-zero. */ - bsfl %eax, %eax + bsf %VRCX, %VRCX # ifndef USE_AS_STRCHRNUL - /* Found CHAR or the null byte. */ - cmp (VEC_SIZE)(%rdi, %rax, CHAR_SIZE), %CHAR_REG + /* Found CHAR or the null byte. */ + cmp (VEC_SIZE)(%rdi, %rcx, CHAR_SIZE), %CHAR_REG jne L(zero) - # endif /* NB: Multiply sizeof char type (1 or 4) to get the number of bytes. */ - leaq (VEC_SIZE)(%rdi, %rax, CHAR_SIZE), %rax + leaq (VEC_SIZE)(%rdi, %rcx, CHAR_SIZE), %rax ret - .p2align 4,, 10 + .p2align 4,, 2 +L(first_vec_x4): + subq $-(VEC_SIZE * 2), %rdi L(first_vec_x2): # ifndef USE_AS_STRCHRNUL /* Check to see if first match was CHAR (k0) or null (k1). */ - kmovd %k0, %eax - tzcntl %eax, %eax - kmovd %k1, %ecx + KMOV %k0, %VRAX + tzcnt %VRAX, %VRAX + KMOV %k1, %VRCX /* bzhil will not be 0 if first match was null. */ - bzhil %eax, %ecx, %ecx + bzhi %VRAX, %VRCX, %VRCX jne L(zero) # else /* Combine CHAR and null matches. */ - kord %k0, %k1, %k0 - kmovd %k0, %eax - tzcntl %eax, %eax + KOR %k0, %k1, %k0 + KMOV %k0, %VRAX + bsf %VRAX, %VRAX # endif /* NB: Multiply sizeof char type (1 or 4) to get the number of bytes. */ leaq (VEC_SIZE * 2)(%rdi, %rax, CHAR_SIZE), %rax ret - .p2align 4,, 10 -L(first_vec_x3): - /* Use bsf here to save 1-byte keeping keeping the block in 1x - fetch block. eax guranteed non-zero. */ - bsfl %eax, %eax -# ifndef USE_AS_STRCHRNUL - /* Found CHAR or the null byte. */ - cmp (VEC_SIZE * 3)(%rdi, %rax, CHAR_SIZE), %CHAR_REG - jne L(zero) +# ifdef USE_AS_STRCHRNUL + /* We use this as a hook to get imm8 encoding for the jmp to + L(page_cross_boundary). This allows the hot case of a + match/null-term in first VEC to fit entirely in 1 cache + line. */ +L(cross_page_boundary): + jmp L(cross_page_boundary_real) # endif - /* NB: Multiply sizeof char type (1 or 4) to get the number of - bytes. */ - leaq (VEC_SIZE * 3)(%rdi, %rax, CHAR_SIZE), %rax - ret .p2align 4 L(aligned_more): +L(cross_page_continue): /* Align data to VEC_SIZE. */ andq $-VEC_SIZE, %rdi -L(cross_page_continue): - /* Check the next 4 * VEC_SIZE. Only one VEC_SIZE at a time since - data is only aligned to VEC_SIZE. Use two alternating methods - for checking VEC to balance latency and port contention. */ - /* This method has higher latency but has better port - distribution. */ - VMOVA (VEC_SIZE)(%rdi), %YMM1 + /* Check the next 4 * VEC_SIZE. Only one VEC_SIZE at a time + since data is only aligned to VEC_SIZE. Use two alternating + methods for checking VEC to balance latency and port + contention. */ + + /* Method(1) with 8c latency: + For VEC_SIZE == 32: + p0 * 1.83, p1 * 0.83, p5 * 1.33 + For VEC_SIZE == 64: + p0 * 2.50, p1 * 0.00, p5 * 1.50 */ + VMOVA (VEC_SIZE)(%rdi), %VMM(1) /* Leaves only CHARS matching esi as 0. */ - vpxorq %YMM1, %YMM0, %YMM2 - VPMINU %YMM2, %YMM1, %YMM2 - /* Each bit in K0 represents a CHAR or a null byte in YMM1. */ - VPTESTN %YMM2, %YMM2, %k0 - kmovd %k0, %eax - testl %eax, %eax + vpxorq %VMM(1), %VMATCH, %VMM(2) + VPMINU %VMM(2), %VMM(1), %VMM(2) + /* Each bit in K0 represents a CHAR or a null byte in VEC_1. */ + VPTESTN %VMM(2), %VMM(2), %k0 + KMOV %k0, %VRCX + test %VRCX, %VRCX jnz L(first_vec_x1) - /* This method has higher latency but has better port - distribution. */ - VMOVA (VEC_SIZE * 2)(%rdi), %YMM1 - /* Each bit in K0 represents a CHAR in YMM1. */ - VPCMP $0, %YMM1, %YMM0, %k0 - /* Each bit in K1 represents a CHAR in YMM1. */ - VPTESTN %YMM1, %YMM1, %k1 - kortestd %k0, %k1 + /* Method(2) with 6c latency: + For VEC_SIZE == 32: + p0 * 1.00, p1 * 0.00, p5 * 2.00 + For VEC_SIZE == 64: + p0 * 1.00, p1 * 0.00, p5 * 2.00 */ + VMOVA (VEC_SIZE * 2)(%rdi), %VMM(1) + /* Each bit in K0 represents a CHAR in VEC_1. */ + VPCMPEQ %VMM(1), %VMATCH, %k0 + /* Each bit in K1 represents a CHAR in VEC_1. */ + VPTESTN %VMM(1), %VMM(1), %k1 + KORTEST %k0, %k1 jnz L(first_vec_x2) - VMOVA (VEC_SIZE * 3)(%rdi), %YMM1 + /* By swapping between Method 1/2 we get more fair port + distrubition and better throughput. */ + + VMOVA (VEC_SIZE * 3)(%rdi), %VMM(1) /* Leaves only CHARS matching esi as 0. */ - vpxorq %YMM1, %YMM0, %YMM2 - VPMINU %YMM2, %YMM1, %YMM2 - /* Each bit in K0 represents a CHAR or a null byte in YMM1. */ - VPTESTN %YMM2, %YMM2, %k0 - kmovd %k0, %eax - testl %eax, %eax + vpxorq %VMM(1), %VMATCH, %VMM(2) + VPMINU %VMM(2), %VMM(1), %VMM(2) + /* Each bit in K0 represents a CHAR or a null byte in VEC_1. */ + VPTESTN %VMM(2), %VMM(2), %k0 + KMOV %k0, %VRCX + test %VRCX, %VRCX jnz L(first_vec_x3) - VMOVA (VEC_SIZE * 4)(%rdi), %YMM1 - /* Each bit in K0 represents a CHAR in YMM1. */ - VPCMP $0, %YMM1, %YMM0, %k0 - /* Each bit in K1 represents a CHAR in YMM1. */ - VPTESTN %YMM1, %YMM1, %k1 - kortestd %k0, %k1 + VMOVA (VEC_SIZE * 4)(%rdi), %VMM(1) + /* Each bit in K0 represents a CHAR in VEC_1. */ + VPCMPEQ %VMM(1), %VMATCH, %k0 + /* Each bit in K1 represents a CHAR in VEC_1. */ + VPTESTN %VMM(1), %VMM(1), %k1 + KORTEST %k0, %k1 jnz L(first_vec_x4) /* Align data to VEC_SIZE * 4 for the loop. */ +# if VEC_SIZE == 64 + /* Use rax for the loop reg as it allows to the loop to fit in + exactly 2-cache-lines. (more efficient imm32 + gpr + encoding). */ + leaq (VEC_SIZE)(%rdi), %rax + /* No partial register stalls on evex512 processors. */ + xorb %al, %al +# else + /* For VEC_SIZE == 32 continue using rdi for loop reg so we can + reuse more code and save space. */ addq $VEC_SIZE, %rdi andq $-(VEC_SIZE * 4), %rdi - +# endif .p2align 4 L(loop_4x_vec): - /* Check 4x VEC at a time. No penalty to imm32 offset with evex - encoding. */ - VMOVA (VEC_SIZE * 4)(%rdi), %YMM1 - VMOVA (VEC_SIZE * 5)(%rdi), %YMM2 - VMOVA (VEC_SIZE * 6)(%rdi), %YMM3 - VMOVA (VEC_SIZE * 7)(%rdi), %YMM4 - - /* For YMM1 and YMM3 use xor to set the CHARs matching esi to + /* Check 4x VEC at a time. No penalty for imm32 offset with evex + encoding (if offset % VEC_SIZE == 0). */ + VMOVA (VEC_SIZE * 4)(%LOOP_REG), %VMM(1) + VMOVA (VEC_SIZE * 5)(%LOOP_REG), %VMM(2) + VMOVA (VEC_SIZE * 6)(%LOOP_REG), %VMM(3) + VMOVA (VEC_SIZE * 7)(%LOOP_REG), %VMM(4) + + /* Collect bits where VEC_1 does NOT match esi. This is later + use to mask of results (getting not matches allows us to + save an instruction on combining). */ + VPCMP $4, %VMATCH, %VMM(1), %k1 + + /* Two methods for loop depending on VEC_SIZE. This is because + with zmm registers VPMINU can only run on p0 (as opposed to + p0/p1 for ymm) so it is less prefered. */ +# if VEC_SIZE == 32 + /* For VEC_2 and VEC_3 use xor to set the CHARs matching esi to zero. */ - vpxorq %YMM1, %YMM0, %YMM5 - /* For YMM2 and YMM4 cmp not equals to CHAR and store result in - k register. Its possible to save either 1 or 2 instructions - using cmp no equals method for either YMM1 or YMM1 and YMM3 - respectively but bottleneck on p5 makes it not worth it. */ - VPCMP $4, %YMM0, %YMM2, %k2 - vpxorq %YMM3, %YMM0, %YMM7 - VPCMP $4, %YMM0, %YMM4, %k4 - - /* Use min to select all zeros from either xor or end of string). - */ - VPMINU %YMM1, %YMM5, %YMM1 - VPMINU %YMM3, %YMM7, %YMM3 + vpxorq %VMM(2), %VMATCH, %VMM(6) + vpxorq %VMM(3), %VMATCH, %VMM(7) - /* Use min + zeromask to select for zeros. Since k2 and k4 will - have 0 as positions that matched with CHAR which will set - zero in the corresponding destination bytes in YMM2 / YMM4. - */ - VPMINU %YMM1, %YMM2, %YMM2{%k2}{z} - VPMINU %YMM3, %YMM4, %YMM4 - VPMINU %YMM2, %YMM4, %YMM4{%k4}{z} - - VPTESTN %YMM4, %YMM4, %k1 - kmovd %k1, %ecx - subq $-(VEC_SIZE * 4), %rdi - testl %ecx, %ecx + /* Find non-matches in VEC_4 while combining with non-matches + from VEC_1. NB: Try and use masked predicate execution on + instructions that have mask result as it has no latency + penalty. */ + VPCMP $4, %VMATCH, %VMM(4), %k4{%k1} + + /* Combined zeros from VEC_1 / VEC_2 (search for null term). */ + VPMINU %VMM(1), %VMM(2), %VMM(2) + + /* Use min to select all zeros from either xor or end of + string). */ + VPMINU %VMM(3), %VMM(7), %VMM(3) + VPMINU %VMM(2), %VMM(6), %VMM(2) + + /* Combined zeros from VEC_2 / VEC_3 (search for null term). */ + VPMINU %VMM(3), %VMM(4), %VMM(4) + + /* Combined zeros from VEC_2 / VEC_4 (this has all null term and + esi matches for VEC_2 / VEC_3). */ + VPMINU %VMM(2), %VMM(4), %VMM(4) +# else + /* Collect non-matches for VEC_2. */ + VPCMP $4, %VMM(2), %VMATCH, %k2 + + /* Combined zeros from VEC_1 / VEC_2 (search for null term). */ + VPMINU %VMM(1), %VMM(2), %VMM(2) + + /* Find non-matches in VEC_3/VEC_4 while combining with non- + matches from VEC_1/VEC_2 respectively. */ + VPCMP $4, %VMM(3), %VMATCH, %k3{%k1} + VPCMP $4, %VMM(4), %VMATCH, %k4{%k2} + + /* Finish combining zeros in all VECs. */ + VPMINU %VMM(3), %VMM(4), %VMM(4) + + /* Combine in esi matches for VEC_3 (if there was a match with + esi, the corresponding bit in %k3 is zero so the + VPMINU_MASKZ will have a zero in the result). NB: This make + the VPMINU 3c latency. The only way to avoid it is to + createa a 12c dependency chain on all the `VPCMP $4, ...` + which has higher total latency. */ + VPMINU %VMM(2), %VMM(4), %VMM(4){%k3}{z} +# endif + VPTEST %VMM(4), %VMM(4), %k0{%k4} + KMOV %k0, %VRDX + subq $-(VEC_SIZE * 4), %LOOP_REG + + /* TESTZ is inc using the proper register width depending on + CHAR_PER_VEC. An esi match or null-term match leaves a zero- + bit in rdx so inc won't overflow and won't be zero. */ + TESTZ (rdx) jz L(loop_4x_vec) - VPTESTN %YMM1, %YMM1, %k0 - kmovd %k0, %eax - testl %eax, %eax - jnz L(last_vec_x1) + VPTEST %VMM(1), %VMM(1), %k0{%k1} + KMOV %k0, %VGPR(MASK_GPR) + TESTZ (MASK_GPR) +# if VEC_SIZE == 32 + /* We can reuse the return code in page_cross logic for VEC_SIZE + == 32. */ + jnz L(last_vec_x1_vec_size32) +# else + jnz L(last_vec_x1_vec_size64) +# endif + - VPTESTN %YMM2, %YMM2, %k0 - kmovd %k0, %eax - testl %eax, %eax + /* COND_MASK integates the esi matches for VEC_SIZE == 64. For + VEC_SIZE == 32 they are already integrated. */ + VPTEST %VMM(2), %VMM(2), %k0 COND_MASK(k2) + KMOV %k0, %VRCX + TESTZ (rcx) jnz L(last_vec_x2) - VPTESTN %YMM3, %YMM3, %k0 - kmovd %k0, %eax - /* Combine YMM3 matches (eax) with YMM4 matches (ecx). */ -# ifdef USE_AS_WCSCHR - sall $8, %ecx - orl %ecx, %eax - bsfl %eax, %eax + VPTEST %VMM(3), %VMM(3), %k0 COND_MASK(k3) + KMOV %k0, %VRCX +# if CHAR_PER_VEC == 64 + TESTZ (rcx) + jnz L(last_vec_x3) # else - salq $32, %rcx - orq %rcx, %rax - bsfq %rax, %rax + salq $CHAR_PER_VEC, %rdx + TESTZ (rcx) + orq %rcx, %rdx # endif + + bsfq %rdx, %rdx + # ifndef USE_AS_STRCHRNUL /* Check if match was CHAR or null. */ - cmp (VEC_SIZE * 2)(%rdi, %rax, CHAR_SIZE), %CHAR_REG + cmp (LAST_VEC_OFFSET)(%LOOP_REG, %rdx, CHAR_SIZE), %CHAR_REG jne L(zero_end) # endif /* NB: Multiply sizeof char type (1 or 4) to get the number of bytes. */ - leaq (VEC_SIZE * 2)(%rdi, %rax, CHAR_SIZE), %rax + leaq (LAST_VEC_OFFSET)(%LOOP_REG, %rdx, CHAR_SIZE), %rax ret - .p2align 4,, 8 -L(last_vec_x1): - bsfl %eax, %eax -# ifdef USE_AS_WCSCHR - /* NB: Multiply wchar_t count by 4 to get the number of bytes. - */ - leaq (%rdi, %rax, CHAR_SIZE), %rax -# else - addq %rdi, %rax +# ifndef USE_AS_STRCHRNUL +L(zero_end): + xorl %eax, %eax + ret # endif -# ifndef USE_AS_STRCHRNUL + + /* Seperate return label for last VEC1 because for VEC_SIZE == + 32 we can reuse return code in L(page_cross) but VEC_SIZE == + 64 has mismatched registers. */ +# if VEC_SIZE == 64 + .p2align 4,, 8 +L(last_vec_x1_vec_size64): + bsf %VRCX, %VRCX +# ifndef USE_AS_STRCHRNUL /* Check if match was null. */ - cmp (%rax), %CHAR_REG + cmp (%rax, %rcx, CHAR_SIZE), %CHAR_REG jne L(zero_end) -# endif - +# endif +# ifdef USE_AS_WCSCHR + /* NB: Multiply wchar_t count by 4 to get the number of bytes. + */ + leaq (%rax, %rcx, CHAR_SIZE), %rax +# else + addq %rcx, %rax +# endif ret + /* Since we can't combine the last 2x matches for CHAR_PER_VEC + == 64 we need return label for last VEC3. */ +# if CHAR_PER_VEC == 64 .p2align 4,, 8 +L(last_vec_x3): + addq $VEC_SIZE, %LOOP_REG +# endif + + /* Duplicate L(last_vec_x2) for VEC_SIZE == 64 because we can't + reuse L(first_vec_x3) due to register mismatch. */ L(last_vec_x2): - bsfl %eax, %eax -# ifndef USE_AS_STRCHRNUL + bsf %VGPR(MASK_GPR), %VGPR(MASK_GPR) +# ifndef USE_AS_STRCHRNUL /* Check if match was null. */ - cmp (VEC_SIZE)(%rdi, %rax, CHAR_SIZE), %CHAR_REG + cmp (VEC_SIZE * 1)(%LOOP_REG, %MASK_GPR, CHAR_SIZE), %CHAR_REG jne L(zero_end) -# endif +# endif /* NB: Multiply sizeof char type (1 or 4) to get the number of bytes. */ - leaq (VEC_SIZE)(%rdi, %rax, CHAR_SIZE), %rax + leaq (VEC_SIZE * 1)(%LOOP_REG, %MASK_GPR, CHAR_SIZE), %rax ret +# endif - /* Cold case for crossing page with first load. */ - .p2align 4,, 8 + /* Cold case for crossing page with first load. */ + .p2align 4,, 10 +# ifndef USE_AS_STRCHRNUL L(cross_page_boundary): - movq %rdi, %rdx +# endif +L(cross_page_boundary_real): /* Align rdi. */ - andq $-VEC_SIZE, %rdi - VMOVA (%rdi), %YMM1 - /* Leaves only CHARS matching esi as 0. */ - vpxorq %YMM1, %YMM0, %YMM2 - VPMINU %YMM2, %YMM1, %YMM2 - /* Each bit in K0 represents a CHAR or a null byte in YMM1. */ - VPTESTN %YMM2, %YMM2, %k0 - kmovd %k0, %eax + xorq %rdi, %rax + VMOVA (PAGE_SIZE - VEC_SIZE)(%rax), %VMM(1) + /* Use high latency method of getting matches to save code size. + */ + + /* K1 has 1s where VEC(1) does NOT match esi. */ + VPCMP $4, %VMM(1), %VMATCH, %k1 + /* K0 has ones where K1 is 1 (non-match with esi), and non-zero + (null). */ + VPTEST %VMM(1), %VMM(1), %k0{%k1} + KMOV %k0, %VRAX /* Remove the leading bits. */ # ifdef USE_AS_WCSCHR - movl %edx, %SHIFT_REG + movl %edi, %VGPR_SZ(SHIFT_REG, 32) /* NB: Divide shift count by 4 since each bit in K1 represent 4 bytes. */ - sarl $2, %SHIFT_REG - andl $(CHAR_PER_VEC - 1), %SHIFT_REG + sarl $2, %VGPR_SZ(SHIFT_REG, 32) + andl $(CHAR_PER_VEC - 1), %VGPR_SZ(SHIFT_REG, 32) + + /* if wcsrchr we need to reverse matches as we can't rely on + signed shift to bring in ones. There is not sarx for + gpr8/16. Also not we can't use inc here as the lower bits + represent matches out of range so we can't rely on overflow. + */ + xorl $((1 << CHAR_PER_VEC)- 1), %eax +# endif + /* Use arithmatic shift so that leading 1s are filled in. */ + sarx %VGPR(SHIFT_REG), %VRAX, %VRAX + /* If eax is all ones then no matches for esi or NULL. */ + +# ifdef USE_AS_WCSCHR + test %VRAX, %VRAX +# else + inc %VRAX # endif - sarxl %SHIFT_REG, %eax, %eax - /* If eax is zero continue. */ - testl %eax, %eax jz L(cross_page_continue) - bsfl %eax, %eax + .p2align 4,, 10 +L(last_vec_x1_vec_size32): + bsf %VRAX, %VRAX # ifdef USE_AS_WCSCHR - /* NB: Multiply wchar_t count by 4 to get the number of - bytes. */ - leaq (%rdx, %rax, CHAR_SIZE), %rax + /* NB: Multiply wchar_t count by 4 to get the number of bytes. + */ + leaq (%rdi, %rax, CHAR_SIZE), %rax # else - addq %rdx, %rax + addq %rdi, %rax # endif # ifndef USE_AS_STRCHRNUL /* Check to see if match was CHAR or null. */ cmp (%rax), %CHAR_REG - je L(cross_page_ret) -L(zero_end): - xorl %eax, %eax -L(cross_page_ret): + jne L(zero_end_0) # endif ret +# ifndef USE_AS_STRCHRNUL +L(zero_end_0): + xorl %eax, %eax + ret +# endif END (STRCHR) #endif From patchwork Tue Oct 18 23:19:34 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Noah Goldstein X-Patchwork-Id: 1691739 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=sourceware.org (client-ip=2620:52:3:1:0:246e:9693:128c; helo=sourceware.org; envelope-from=libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org; receiver=) Authentication-Results: legolas.ozlabs.org; dkim=pass (1024-bit key; secure) header.d=sourceware.org header.i=@sourceware.org header.a=rsa-sha256 header.s=default header.b=AAB3eYet; dkim-atps=neutral Received: from sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-384) server-digest SHA384) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4MsVGm6cy5z23jk for ; Wed, 19 Oct 2022 10:21:04 +1100 (AEDT) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id C99AC3856DDD for ; Tue, 18 Oct 2022 23:21:02 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org C99AC3856DDD DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1666135262; bh=PSQMROoJfqw3Uqfco9Q4iDeddMT2vOOCLzyfEiEff+g=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To: From; b=AAB3eYetXj7qa8sgRT3Hjs0ja+R4DPPUDYRAGNU3ZMxKi5JDASmZoklCIVxduubvm CRWuN9ZqhwmNPEfuizeEP+V5NW67GFG33WF7aP9GIVXLJdbEmNE/sdkDUEd1B2sNJT dfk6wVxlxbK3Y1c5/KfeLCA1l+m+ARtR8bsg4tuU= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-ot1-x331.google.com (mail-ot1-x331.google.com [IPv6:2607:f8b0:4864:20::331]) by sourceware.org (Postfix) with ESMTPS id EEE623858C83 for ; Tue, 18 Oct 2022 23:19:48 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org EEE623858C83 Received: by mail-ot1-x331.google.com with SMTP id a16-20020a056830101000b006619dba7fd4so8515499otp.12 for ; Tue, 18 Oct 2022 16:19:48 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=PSQMROoJfqw3Uqfco9Q4iDeddMT2vOOCLzyfEiEff+g=; b=PNkzZWAsYWTJu5owTchXTPpgl47bWCxEfgaVrEAWGi0wTCMFRt4nkl5alKkPQNf7ye sPE0++OqISrJW+0JpbQfNi6uf8nO0e7Y3IcNXxaKXBuQSrCgqhLDoQm2pUdGaRICHZ37 2zNghGJOskq/OA6f7YJHeCRu1kfEWguU9g1QQU/dvz/tGXyZ/LbgQxRRdEMMZWwB++aE BPpJiqv+j467DxFyrJ9RKl5KtYiUfY3/qwyk/IyiBIth5CtpbYmskRtUuDe1iIj/qVvw D0oZlT/DrnchNnQOkjm/v2oD8fS52t+wUjy1O8uldiF8HWvMEnAdXteKm6AobDnoRZpR WlHA== X-Gm-Message-State: ACrzQf2uMeSd71gi9h5jietmZoeR0ejNL2Wfb+RFPuhDXSACQBq0Ykry Ylg0XzoSR5CqrT/40ZQsfKvNU9WfdMOejw== X-Google-Smtp-Source: AMsMyM5knl2iLj6qpxUGu205uV+r//qmAwfH0DWQnBVwQA1poA9a3e8Z02orLIL+I6K/0+PUO2PUOA== X-Received: by 2002:a05:6830:40c5:b0:660:ef3f:97c with SMTP id h5-20020a05683040c500b00660ef3f097cmr2432932otu.177.1666135186015; Tue, 18 Oct 2022 16:19:46 -0700 (PDT) Received: from noah-tgl.lan (2603-8080-1301-76c6-27cf-8854-3909-9373.res6.spectrum.com. [2603:8080:1301:76c6:27cf:8854:3909:9373]) by smtp.gmail.com with ESMTPSA id x27-20020a056870a79b00b001372c1902afsm6787119oao.52.2022.10.18.16.19.44 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 18 Oct 2022 16:19:45 -0700 (PDT) To: libc-alpha@sourceware.org Subject: [PATCH v2 3/7] x86: Optimize strnlen-evex.S and implement with VMM headers Date: Tue, 18 Oct 2022 16:19:34 -0700 Message-Id: <20221018231938.3621554-3-goldstein.w.n@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20221018231938.3621554-1-goldstein.w.n@gmail.com> References: <20221018024901.3381469-1-goldstein.w.n@gmail.com> <20221018231938.3621554-1-goldstein.w.n@gmail.com> MIME-Version: 1.0 X-Spam-Status: No, score=-12.1 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, KAM_SHORT, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Noah Goldstein via Libc-alpha From: Noah Goldstein Reply-To: Noah Goldstein Errors-To: libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org Sender: "Libc-alpha" Optimizations are: 1. Use the fact that bsf(0) leaves the destination unchanged to save a branch in short string case. 2. Restructure code so that small strings are given the hot path. - This is a net-zero on the benchmark suite but in general makes sense as smaller sizes are far more common. 3. Use more code-size efficient instructions. - tzcnt ... -> bsf ... - vpcmpb $0 ... -> vpcmpeq ... 4. Align labels less aggressively, especially if it doesn't save fetch blocks / causes the basic-block to span extra cache-lines. The optimizations (especially for point 2) make the strnlen and strlen code essentially incompatible so split strnlen-evex to a new file. Code Size Changes: strlen-evex.S : -23 bytes strnlen-evex.S : -167 bytes Net perf changes: Reported as geometric mean of all improvements / regressions from N=10 runs of the benchtests. Value as New Time / Old Time so < 1.0 is improvement and 1.0 is regression. strlen-evex.S : 0.992 (No real change) strnlen-evex.S : 0.947 Full results attached in email. Full check passes on x86-64. --- sysdeps/x86_64/multiarch/strlen-evex.S | 544 +++++++----------------- sysdeps/x86_64/multiarch/strnlen-evex.S | 427 ++++++++++++++++++- sysdeps/x86_64/multiarch/wcsnlen-evex.S | 5 +- 3 files changed, 572 insertions(+), 404 deletions(-) diff --git a/sysdeps/x86_64/multiarch/strlen-evex.S b/sysdeps/x86_64/multiarch/strlen-evex.S index 2109ec2f7a..487846f098 100644 --- a/sysdeps/x86_64/multiarch/strlen-evex.S +++ b/sysdeps/x86_64/multiarch/strlen-evex.S @@ -26,466 +26,220 @@ # define STRLEN __strlen_evex # endif -# define VMOVA vmovdqa64 +# ifndef VEC_SIZE +# include "x86-evex256-vecs.h" +# endif # ifdef USE_AS_WCSLEN -# define VPCMP vpcmpd +# define VPCMPEQ vpcmpeqd +# define VPCMPNEQ vpcmpneqd +# define VPTESTN vptestnmd +# define VPTEST vptestmd # define VPMINU vpminud -# define SHIFT_REG ecx # define CHAR_SIZE 4 +# define CHAR_SIZE_SHIFT_REG(reg) sar $2, %reg # else -# define VPCMP vpcmpb +# define VPCMPEQ vpcmpeqb +# define VPCMPNEQ vpcmpneqb +# define VPTESTN vptestnmb +# define VPTEST vptestmb # define VPMINU vpminub -# define SHIFT_REG edx # define CHAR_SIZE 1 +# define CHAR_SIZE_SHIFT_REG(reg) + +# define REG_WIDTH VEC_SIZE # endif -# define XMMZERO xmm16 -# define YMMZERO ymm16 -# define YMM1 ymm17 -# define YMM2 ymm18 -# define YMM3 ymm19 -# define YMM4 ymm20 -# define YMM5 ymm21 -# define YMM6 ymm22 - -# define VEC_SIZE 32 -# define PAGE_SIZE 4096 -# define CHAR_PER_VEC (VEC_SIZE / CHAR_SIZE) - - .section .text.evex,"ax",@progbits -ENTRY (STRLEN) -# ifdef USE_AS_STRNLEN - /* Check zero length. */ - test %RSI_LP, %RSI_LP - jz L(zero) -# ifdef __ILP32__ - /* Clear the upper 32 bits. */ - movl %esi, %esi -# endif - mov %RSI_LP, %R8_LP +# define CHAR_PER_VEC (VEC_SIZE / CHAR_SIZE) + +# include "reg-macros.h" + +# if CHAR_PER_VEC == 64 + +# define TAIL_RETURN_LBL first_vec_x2 +# define TAIL_RETURN_OFFSET (CHAR_PER_VEC * 2) + +# define FALLTHROUGH_RETURN_LBL first_vec_x3 +# define FALLTHROUGH_RETURN_OFFSET (CHAR_PER_VEC * 3) + +# else + +# define TAIL_RETURN_LBL first_vec_x3 +# define TAIL_RETURN_OFFSET (CHAR_PER_VEC * 3) + +# define FALLTHROUGH_RETURN_LBL first_vec_x2 +# define FALLTHROUGH_RETURN_OFFSET (CHAR_PER_VEC * 2) # endif + +# define XZERO VMM_128(0) +# define VZERO VMM(0) +# define PAGE_SIZE 4096 + + .section SECTION(.text), "ax", @progbits +ENTRY_P2ALIGN (STRLEN, 6) movl %edi, %eax - vpxorq %XMMZERO, %XMMZERO, %XMMZERO - /* Clear high bits from edi. Only keeping bits relevant to page - cross check. */ + vpxorq %XZERO, %XZERO, %XZERO andl $(PAGE_SIZE - 1), %eax - /* Check if we may cross page boundary with one vector load. */ cmpl $(PAGE_SIZE - VEC_SIZE), %eax ja L(cross_page_boundary) /* Check the first VEC_SIZE bytes. Each bit in K0 represents a null byte. */ - VPCMP $0, (%rdi), %YMMZERO, %k0 - kmovd %k0, %eax -# ifdef USE_AS_STRNLEN - /* If length < CHAR_PER_VEC handle special. */ - cmpq $CHAR_PER_VEC, %rsi - jbe L(first_vec_x0) -# endif - testl %eax, %eax + VPCMPEQ (%rdi), %VZERO, %k0 + KMOV %k0, %VRAX + test %VRAX, %VRAX jz L(aligned_more) - tzcntl %eax, %eax - ret -# ifdef USE_AS_STRNLEN -L(zero): - xorl %eax, %eax - ret - - .p2align 4 -L(first_vec_x0): - /* Set bit for max len so that tzcnt will return min of max len - and position of first match. */ - btsq %rsi, %rax - tzcntl %eax, %eax - ret -# endif - - .p2align 4 -L(first_vec_x1): - tzcntl %eax, %eax - /* Safe to use 32 bit instructions as these are only called for - size = [1, 159]. */ -# ifdef USE_AS_STRNLEN - /* Use ecx which was computed earlier to compute correct value. - */ - leal -(CHAR_PER_VEC * 4 + 1)(%rcx, %rax), %eax -# else - subl %edx, %edi -# ifdef USE_AS_WCSLEN - /* NB: Divide bytes by 4 to get the wchar_t count. */ - sarl $2, %edi -# endif - leal CHAR_PER_VEC(%rdi, %rax), %eax -# endif - ret - - .p2align 4 -L(first_vec_x2): - tzcntl %eax, %eax - /* Safe to use 32 bit instructions as these are only called for - size = [1, 159]. */ -# ifdef USE_AS_STRNLEN - /* Use ecx which was computed earlier to compute correct value. - */ - leal -(CHAR_PER_VEC * 3 + 1)(%rcx, %rax), %eax -# else - subl %edx, %edi -# ifdef USE_AS_WCSLEN - /* NB: Divide bytes by 4 to get the wchar_t count. */ - sarl $2, %edi -# endif - leal (CHAR_PER_VEC * 2)(%rdi, %rax), %eax -# endif + bsf %VRAX, %VRAX ret - .p2align 4 -L(first_vec_x3): - tzcntl %eax, %eax - /* Safe to use 32 bit instructions as these are only called for - size = [1, 159]. */ -# ifdef USE_AS_STRNLEN - /* Use ecx which was computed earlier to compute correct value. - */ - leal -(CHAR_PER_VEC * 2 + 1)(%rcx, %rax), %eax -# else - subl %edx, %edi -# ifdef USE_AS_WCSLEN - /* NB: Divide bytes by 4 to get the wchar_t count. */ - sarl $2, %edi -# endif - leal (CHAR_PER_VEC * 3)(%rdi, %rax), %eax -# endif - ret - - .p2align 4 + .p2align 4,, 8 L(first_vec_x4): - tzcntl %eax, %eax - /* Safe to use 32 bit instructions as these are only called for - size = [1, 159]. */ -# ifdef USE_AS_STRNLEN - /* Use ecx which was computed earlier to compute correct value. - */ - leal -(CHAR_PER_VEC + 1)(%rcx, %rax), %eax -# else - subl %edx, %edi -# ifdef USE_AS_WCSLEN - /* NB: Divide bytes by 4 to get the wchar_t count. */ - sarl $2, %edi -# endif + bsf %VRAX, %VRAX + subl %ecx, %edi + CHAR_SIZE_SHIFT_REG (edi) leal (CHAR_PER_VEC * 4)(%rdi, %rax), %eax -# endif ret - .p2align 5 + + + /* Aligned more for strnlen compares remaining length vs 2 * + CHAR_PER_VEC, 4 * CHAR_PER_VEC, and 8 * CHAR_PER_VEC before + going to the loop. */ + .p2align 4,, 10 L(aligned_more): - movq %rdi, %rdx - /* Align data to VEC_SIZE. */ - andq $-(VEC_SIZE), %rdi + movq %rdi, %rcx + andq $(VEC_SIZE * -1), %rdi L(cross_page_continue): - /* Check the first 4 * VEC_SIZE. Only one VEC_SIZE at a time - since data is only aligned to VEC_SIZE. */ -# ifdef USE_AS_STRNLEN - /* + CHAR_SIZE because it simplies the logic in - last_4x_vec_or_less. */ - leaq (VEC_SIZE * 5 + CHAR_SIZE)(%rdi), %rcx - subq %rdx, %rcx -# ifdef USE_AS_WCSLEN - /* NB: Divide bytes by 4 to get the wchar_t count. */ - sarl $2, %ecx -# endif -# endif - /* Load first VEC regardless. */ - VPCMP $0, VEC_SIZE(%rdi), %YMMZERO, %k0 -# ifdef USE_AS_STRNLEN - /* Adjust length. If near end handle specially. */ - subq %rcx, %rsi - jb L(last_4x_vec_or_less) -# endif - kmovd %k0, %eax - testl %eax, %eax + /* Remaining length >= 2 * CHAR_PER_VEC so do VEC0/VEC1 without + rechecking bounds. */ + VPCMPEQ (VEC_SIZE * 1)(%rdi), %VZERO, %k0 + KMOV %k0, %VRAX + test %VRAX, %VRAX jnz L(first_vec_x1) - VPCMP $0, (VEC_SIZE * 2)(%rdi), %YMMZERO, %k0 - kmovd %k0, %eax - test %eax, %eax + VPCMPEQ (VEC_SIZE * 2)(%rdi), %VZERO, %k0 + KMOV %k0, %VRAX + test %VRAX, %VRAX jnz L(first_vec_x2) - VPCMP $0, (VEC_SIZE * 3)(%rdi), %YMMZERO, %k0 - kmovd %k0, %eax - testl %eax, %eax + VPCMPEQ (VEC_SIZE * 3)(%rdi), %VZERO, %k0 + KMOV %k0, %VRAX + test %VRAX, %VRAX jnz L(first_vec_x3) - VPCMP $0, (VEC_SIZE * 4)(%rdi), %YMMZERO, %k0 - kmovd %k0, %eax - testl %eax, %eax + VPCMPEQ (VEC_SIZE * 4)(%rdi), %VZERO, %k0 + KMOV %k0, %VRAX + test %VRAX, %VRAX jnz L(first_vec_x4) - addq $VEC_SIZE, %rdi -# ifdef USE_AS_STRNLEN - /* Check if at last VEC_SIZE * 4 length. */ - cmpq $(CHAR_PER_VEC * 4 - 1), %rsi - jbe L(last_4x_vec_or_less_load) - movl %edi, %ecx - andl $(VEC_SIZE * 4 - 1), %ecx -# ifdef USE_AS_WCSLEN - /* NB: Divide bytes by 4 to get the wchar_t count. */ - sarl $2, %ecx -# endif - /* Readjust length. */ - addq %rcx, %rsi -# endif - /* Align data to VEC_SIZE * 4. */ + subq $(VEC_SIZE * -1), %rdi + +# if CHAR_PER_VEC == 64 + /* No partial register stalls on processors that we use evex512 + on and this saves code size. */ + xorb %dil, %dil +# else andq $-(VEC_SIZE * 4), %rdi +# endif + + /* Compare 4 * VEC at a time forward. */ .p2align 4 L(loop_4x_vec): - /* Load first VEC regardless. */ - VMOVA (VEC_SIZE * 4)(%rdi), %YMM1 -# ifdef USE_AS_STRNLEN - /* Break if at end of length. */ - subq $(CHAR_PER_VEC * 4), %rsi - jb L(last_4x_vec_or_less_cmpeq) -# endif - /* Save some code size by microfusing VPMINU with the load. Since - the matches in ymm2/ymm4 can only be returned if there where no - matches in ymm1/ymm3 respectively there is no issue with overlap. - */ - VPMINU (VEC_SIZE * 5)(%rdi), %YMM1, %YMM2 - VMOVA (VEC_SIZE * 6)(%rdi), %YMM3 - VPMINU (VEC_SIZE * 7)(%rdi), %YMM3, %YMM4 + VMOVA (VEC_SIZE * 4)(%rdi), %VMM(1) + VPMINU (VEC_SIZE * 5)(%rdi), %VMM(1), %VMM(2) + VMOVA (VEC_SIZE * 6)(%rdi), %VMM(3) + VPMINU (VEC_SIZE * 7)(%rdi), %VMM(3), %VMM(4) + VPTESTN %VMM(2), %VMM(2), %k0 + VPTESTN %VMM(4), %VMM(4), %k2 - VPCMP $0, %YMM2, %YMMZERO, %k0 - VPCMP $0, %YMM4, %YMMZERO, %k1 subq $-(VEC_SIZE * 4), %rdi - kortestd %k0, %k1 + KORTEST %k0, %k2 jz L(loop_4x_vec) - /* Check if end was in first half. */ - kmovd %k0, %eax - subq %rdx, %rdi -# ifdef USE_AS_WCSLEN - shrq $2, %rdi -# endif - testl %eax, %eax - jz L(second_vec_return) + VPTESTN %VMM(1), %VMM(1), %k1 + KMOV %k1, %VRAX + test %VRAX, %VRAX + jnz L(first_vec_x0) - VPCMP $0, %YMM1, %YMMZERO, %k2 - kmovd %k2, %edx - /* Combine VEC1 matches (edx) with VEC2 matches (eax). */ -# ifdef USE_AS_WCSLEN - sall $CHAR_PER_VEC, %eax - orl %edx, %eax - tzcntl %eax, %eax -# else - salq $CHAR_PER_VEC, %rax - orq %rdx, %rax - tzcntq %rax, %rax -# endif - addq %rdi, %rax - ret - - -# ifdef USE_AS_STRNLEN - -L(last_4x_vec_or_less_load): - /* Depending on entry adjust rdi / prepare first VEC in YMM1. */ - VMOVA (VEC_SIZE * 4)(%rdi), %YMM1 -L(last_4x_vec_or_less_cmpeq): - VPCMP $0, %YMM1, %YMMZERO, %k0 - addq $(VEC_SIZE * 3), %rdi -L(last_4x_vec_or_less): - kmovd %k0, %eax - /* If remaining length > VEC_SIZE * 2. This works if esi is off by - VEC_SIZE * 4. */ - testl $(CHAR_PER_VEC * 2), %esi - jnz L(last_4x_vec) - - /* length may have been negative or positive by an offset of - CHAR_PER_VEC * 4 depending on where this was called from. This - fixes that. */ - andl $(CHAR_PER_VEC * 4 - 1), %esi - testl %eax, %eax - jnz L(last_vec_x1_check) + KMOV %k0, %VRAX + test %VRAX, %VRAX + jnz L(first_vec_x1) - /* Check the end of data. */ - subl $CHAR_PER_VEC, %esi - jb L(max) + VPTESTN %VMM(3), %VMM(3), %k0 - VPCMP $0, (VEC_SIZE * 2)(%rdi), %YMMZERO, %k0 - kmovd %k0, %eax - tzcntl %eax, %eax - /* Check the end of data. */ - cmpl %eax, %esi - jb L(max) - - subq %rdx, %rdi -# ifdef USE_AS_WCSLEN - /* NB: Divide bytes by 4 to get the wchar_t count. */ - sarq $2, %rdi -# endif - leaq (CHAR_PER_VEC * 2)(%rdi, %rax), %rax - ret -L(max): - movq %r8, %rax - ret -# endif - - /* Placed here in strnlen so that the jcc L(last_4x_vec_or_less) - in the 4x VEC loop can use 2 byte encoding. */ - .p2align 4 -L(second_vec_return): - VPCMP $0, %YMM3, %YMMZERO, %k0 - /* Combine YMM3 matches (k0) with YMM4 matches (k1). */ -# ifdef USE_AS_WCSLEN - kunpckbw %k0, %k1, %k0 - kmovd %k0, %eax - tzcntl %eax, %eax +# if CHAR_PER_VEC == 64 + KMOV %k0, %VRAX + test %VRAX, %VRAX + jnz L(first_vec_x2) + KMOV %k2, %VRAX # else - kunpckdq %k0, %k1, %k0 - kmovq %k0, %rax - tzcntq %rax, %rax + /* We can only combine last 2x VEC masks if CHAR_PER_VEC <= 32. + */ + kmovd %k2, %edx + kmovd %k0, %eax + salq $CHAR_PER_VEC, %rdx + orq %rdx, %rax # endif - leaq (CHAR_PER_VEC * 2)(%rdi, %rax), %rax - ret - -# ifdef USE_AS_STRNLEN -L(last_vec_x1_check): - tzcntl %eax, %eax - /* Check the end of data. */ - cmpl %eax, %esi - jb L(max) - subq %rdx, %rdi -# ifdef USE_AS_WCSLEN - /* NB: Divide bytes by 4 to get the wchar_t count. */ - sarq $2, %rdi -# endif - leaq (CHAR_PER_VEC)(%rdi, %rax), %rax + /* first_vec_x3 for strlen-ZMM and first_vec_x2 for strlen-YMM. + */ + .p2align 4,, 2 +L(FALLTHROUGH_RETURN_LBL): + bsfq %rax, %rax + subq %rcx, %rdi + CHAR_SIZE_SHIFT_REG (rdi) + leaq (FALLTHROUGH_RETURN_OFFSET)(%rdi, %rax), %rax ret - .p2align 4 -L(last_4x_vec): - /* Test first 2x VEC normally. */ - testl %eax, %eax - jnz L(last_vec_x1) - - VPCMP $0, (VEC_SIZE * 2)(%rdi), %YMMZERO, %k0 - kmovd %k0, %eax - testl %eax, %eax - jnz L(last_vec_x2) - - /* Normalize length. */ - andl $(CHAR_PER_VEC * 4 - 1), %esi - VPCMP $0, (VEC_SIZE * 3)(%rdi), %YMMZERO, %k0 - kmovd %k0, %eax - testl %eax, %eax - jnz L(last_vec_x3) - - /* Check the end of data. */ - subl $(CHAR_PER_VEC * 3), %esi - jb L(max) - - VPCMP $0, (VEC_SIZE * 4)(%rdi), %YMMZERO, %k0 - kmovd %k0, %eax - tzcntl %eax, %eax - /* Check the end of data. */ - cmpl %eax, %esi - jb L(max_end) - - subq %rdx, %rdi -# ifdef USE_AS_WCSLEN - /* NB: Divide bytes by 4 to get the wchar_t count. */ - sarq $2, %rdi -# endif - leaq (CHAR_PER_VEC * 4)(%rdi, %rax), %rax + .p2align 4,, 8 +L(first_vec_x0): + bsf %VRAX, %VRAX + sub %rcx, %rdi + CHAR_SIZE_SHIFT_REG (rdi) + addq %rdi, %rax ret - .p2align 4 -L(last_vec_x1): - tzcntl %eax, %eax - subq %rdx, %rdi -# ifdef USE_AS_WCSLEN - /* NB: Divide bytes by 4 to get the wchar_t count. */ - sarq $2, %rdi -# endif + .p2align 4,, 10 +L(first_vec_x1): + bsf %VRAX, %VRAX + sub %rcx, %rdi + CHAR_SIZE_SHIFT_REG (rdi) leaq (CHAR_PER_VEC)(%rdi, %rax), %rax ret - .p2align 4 -L(last_vec_x2): - tzcntl %eax, %eax - subq %rdx, %rdi -# ifdef USE_AS_WCSLEN - /* NB: Divide bytes by 4 to get the wchar_t count. */ - sarq $2, %rdi -# endif - leaq (CHAR_PER_VEC * 2)(%rdi, %rax), %rax - ret - - .p2align 4 -L(last_vec_x3): - tzcntl %eax, %eax - subl $(CHAR_PER_VEC * 2), %esi - /* Check the end of data. */ - cmpl %eax, %esi - jb L(max_end) - subq %rdx, %rdi -# ifdef USE_AS_WCSLEN - /* NB: Divide bytes by 4 to get the wchar_t count. */ - sarq $2, %rdi -# endif - leaq (CHAR_PER_VEC * 3)(%rdi, %rax), %rax - ret -L(max_end): - movq %r8, %rax + .p2align 4,, 10 + /* first_vec_x2 for strlen-ZMM and first_vec_x3 for strlen-YMM. + */ +L(TAIL_RETURN_LBL): + bsf %VRAX, %VRAX + sub %VRCX, %VRDI + CHAR_SIZE_SHIFT_REG (VRDI) + lea (TAIL_RETURN_OFFSET)(%rdi, %rax), %VRAX ret -# endif - /* Cold case for crossing page with first load. */ - .p2align 4 + .p2align 4,, 8 L(cross_page_boundary): - movq %rdi, %rdx + movq %rdi, %rcx /* Align data to VEC_SIZE. */ andq $-VEC_SIZE, %rdi - VPCMP $0, (%rdi), %YMMZERO, %k0 - kmovd %k0, %eax - /* Remove the leading bytes. */ + + VPCMPEQ (%rdi), %VZERO, %k0 + + KMOV %k0, %VRAX # ifdef USE_AS_WCSLEN - /* NB: Divide shift count by 4 since each bit in K0 represent 4 - bytes. */ - movl %edx, %ecx - shrl $2, %ecx - andl $(CHAR_PER_VEC - 1), %ecx -# endif - /* SHIFT_REG is ecx for USE_AS_WCSLEN and edx otherwise. */ - sarxl %SHIFT_REG, %eax, %eax + movl %ecx, %edx + shrl $2, %edx + andl $(CHAR_PER_VEC - 1), %edx + shrx %edx, %eax, %eax testl %eax, %eax -# ifndef USE_AS_STRNLEN - jz L(cross_page_continue) - tzcntl %eax, %eax - ret # else - jnz L(cross_page_less_vec) -# ifndef USE_AS_WCSLEN - movl %edx, %ecx - andl $(CHAR_PER_VEC - 1), %ecx -# endif - movl $CHAR_PER_VEC, %eax - subl %ecx, %eax - /* Check the end of data. */ - cmpq %rax, %rsi - ja L(cross_page_continue) - movl %esi, %eax - ret -L(cross_page_less_vec): - tzcntl %eax, %eax - /* Select min of length and position of first null. */ - cmpq %rax, %rsi - cmovb %esi, %eax - ret + shr %cl, %VRAX # endif + jz L(cross_page_continue) + bsf %VRAX, %VRAX + ret END (STRLEN) #endif diff --git a/sysdeps/x86_64/multiarch/strnlen-evex.S b/sysdeps/x86_64/multiarch/strnlen-evex.S index 64a9fc2606..443a32a749 100644 --- a/sysdeps/x86_64/multiarch/strnlen-evex.S +++ b/sysdeps/x86_64/multiarch/strnlen-evex.S @@ -1,8 +1,423 @@ -#ifndef STRNLEN -# define STRNLEN __strnlen_evex -#endif +/* strnlen/wcsnlen optimized with 256-bit EVEX instructions. + Copyright (C) 2022 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include +#include + +#if ISA_SHOULD_BUILD (4) + +# ifndef VEC_SIZE +# include "x86-evex256-vecs.h" +# endif + + +# ifndef STRNLEN +# define STRNLEN __strnlen_evex +# endif + +# ifdef USE_AS_WCSLEN +# define VPCMPEQ vpcmpeqd +# define VPCMPNEQ vpcmpneqd +# define VPTESTN vptestnmd +# define VPTEST vptestmd +# define VPMINU vpminud +# define CHAR_SIZE 4 + +# else +# define VPCMPEQ vpcmpeqb +# define VPCMPNEQ vpcmpneqb +# define VPTESTN vptestnmb +# define VPTEST vptestmb +# define VPMINU vpminub +# define CHAR_SIZE 1 + +# define REG_WIDTH VEC_SIZE +# endif + +# define CHAR_PER_VEC (VEC_SIZE / CHAR_SIZE) + +# include "reg-macros.h" + +# if CHAR_PER_VEC == 32 +# define SUB_SHORT(imm, reg) subb $(imm), %VGPR_SZ(reg, 8) +# else +# define SUB_SHORT(imm, reg) subl $(imm), %VGPR_SZ(reg, 32) +# endif + + + +# if CHAR_PER_VEC == 64 +# define FALLTHROUGH_RETURN_OFFSET (CHAR_PER_VEC * 3) +# else +# define FALLTHROUGH_RETURN_OFFSET (CHAR_PER_VEC * 2) +# endif + + +# define XZERO VMM_128(0) +# define VZERO VMM(0) +# define PAGE_SIZE 4096 + + .section SECTION(.text), "ax", @progbits +ENTRY_P2ALIGN (STRNLEN, 6) + /* Check zero length. */ + test %RSI_LP, %RSI_LP + jz L(zero) +# ifdef __ILP32__ + /* Clear the upper 32 bits. */ + movl %esi, %esi +# endif + + movl %edi, %eax + vpxorq %XZERO, %XZERO, %XZERO + andl $(PAGE_SIZE - 1), %eax + cmpl $(PAGE_SIZE - VEC_SIZE), %eax + ja L(cross_page_boundary) + + /* Check the first VEC_SIZE bytes. Each bit in K0 represents a + null byte. */ + VPCMPEQ (%rdi), %VZERO, %k0 + + KMOV %k0, %VRCX + movq %rsi, %rax + + /* If src (rcx) is zero, bsf does not change the result. NB: + Must use 64-bit bsf here so that upper bits of len are not + cleared. */ + bsfq %rcx, %rax + /* If rax > CHAR_PER_VEC then rcx must have been zero (no null + CHAR) and rsi must be > CHAR_PER_VEC. */ + cmpq $CHAR_PER_VEC, %rax + ja L(more_1x_vec) + /* Check if first match in bounds. */ + cmpq %rax, %rsi + cmovb %esi, %eax + ret + + +# if CHAR_PER_VEC != 32 + .p2align 4,, 2 +L(zero): +L(max_0): + movl %esi, %eax + ret +# endif + + /* Aligned more for strnlen compares remaining length vs 2 * + CHAR_PER_VEC, 4 * CHAR_PER_VEC, and 8 * CHAR_PER_VEC before + going to the loop. */ + .p2align 4,, 10 +L(more_1x_vec): +L(cross_page_continue): + /* Compute number of words checked after aligning. */ +# ifdef USE_AS_WCSLEN + /* Need to compute directly for wcslen as CHAR_SIZE * rsi can + overflow. */ + movq %rdi, %rax + andq $(VEC_SIZE * -1), %rdi + subq %rdi, %rax + sarq $2, %rax + leaq -(CHAR_PER_VEC * 1)(%rax, %rsi), %rax +# else + leaq (VEC_SIZE * -1)(%rsi, %rdi), %rax + andq $(VEC_SIZE * -1), %rdi + subq %rdi, %rax +# endif + + + VPCMPEQ VEC_SIZE(%rdi), %VZERO, %k0 + + cmpq $(CHAR_PER_VEC * 2), %rax + ja L(more_2x_vec) + +L(last_2x_vec_or_less): + KMOV %k0, %VRDX + test %VRDX, %VRDX + jnz L(last_vec_check) + + /* Check the end of data. */ + SUB_SHORT (CHAR_PER_VEC, rax) + jbe L(max_0) + VPCMPEQ (VEC_SIZE * 2)(%rdi), %VZERO, %k0 + KMOV %k0, %VRDX + test %VRDX, %VRDX + jz L(max_0) + /* Best place for LAST_VEC_CHECK if ZMM. */ + .p2align 4,, 8 +L(last_vec_check): + bsf %VRDX, %VRDX + sub %eax, %edx + lea (%rsi, %rdx), %eax + cmovae %esi, %eax + ret + +# if CHAR_PER_VEC == 32 + .p2align 4,, 2 +L(zero): +L(max_0): + movl %esi, %eax + ret +# endif + + .p2align 4,, 8 +L(last_4x_vec_or_less): + addl $(CHAR_PER_VEC * -4), %eax + VPCMPEQ (VEC_SIZE * 5)(%rdi), %VZERO, %k0 + subq $(VEC_SIZE * -4), %rdi + cmpl $(CHAR_PER_VEC * 2), %eax + jbe L(last_2x_vec_or_less) + + .p2align 4,, 6 +L(more_2x_vec): + /* Remaining length >= 2 * CHAR_PER_VEC so do VEC0/VEC1 without + rechecking bounds. */ -#define USE_AS_STRNLEN 1 -#define STRLEN STRNLEN + KMOV %k0, %VRDX -#include "strlen-evex.S" + test %VRDX, %VRDX + jnz L(first_vec_x1) + + VPCMPEQ (VEC_SIZE * 2)(%rdi), %VZERO, %k0 + KMOV %k0, %VRDX + test %VRDX, %VRDX + jnz L(first_vec_x2) + + cmpq $(CHAR_PER_VEC * 4), %rax + ja L(more_4x_vec) + + + VPCMPEQ (VEC_SIZE * 3)(%rdi), %VZERO, %k0 + KMOV %k0, %VRDX + addl $(CHAR_PER_VEC * -2), %eax + test %VRDX, %VRDX + jnz L(last_vec_check) + + subl $(CHAR_PER_VEC), %eax + jbe L(max_1) + + VPCMPEQ (VEC_SIZE * 4)(%rdi), %VZERO, %k0 + KMOV %k0, %VRDX + + test %VRDX, %VRDX + jnz L(last_vec_check) +L(max_1): + movl %esi, %eax + ret + + .p2align 4,, 3 +L(first_vec_x2): +# if VEC_SIZE == 64 + /* If VEC_SIZE == 64 we can fit logic for full return label in + spare bytes before next cache line. */ + bsf %VRDX, %VRDX + sub %eax, %esi + leal (CHAR_PER_VEC * 1)(%rsi, %rdx), %eax + ret + .p2align 4,, 6 +# else + addl $CHAR_PER_VEC, %esi +# endif +L(first_vec_x1): + bsf %VRDX, %VRDX + sub %eax, %esi + leal (CHAR_PER_VEC * 0)(%rsi, %rdx), %eax + ret + + + .p2align 4,, 6 +L(first_vec_x4): +# if VEC_SIZE == 64 + /* If VEC_SIZE == 64 we can fit logic for full return label in + spare bytes before next cache line. */ + bsf %VRDX, %VRDX + sub %eax, %esi + leal (CHAR_PER_VEC * 3)(%rsi, %rdx), %eax + ret + .p2align 4,, 6 +# else + addl $CHAR_PER_VEC, %esi +# endif +L(first_vec_x3): + bsf %VRDX, %VRDX + sub %eax, %esi + leal (CHAR_PER_VEC * 2)(%rsi, %rdx), %eax + ret + + .p2align 4,, 5 +L(more_4x_vec): + VPCMPEQ (VEC_SIZE * 3)(%rdi), %VZERO, %k0 + KMOV %k0, %VRDX + test %VRDX, %VRDX + jnz L(first_vec_x3) + + VPCMPEQ (VEC_SIZE * 4)(%rdi), %VZERO, %k0 + KMOV %k0, %VRDX + test %VRDX, %VRDX + jnz L(first_vec_x4) + + /* Check if at last VEC_SIZE * 4 length before aligning for the + loop. */ + cmpq $(CHAR_PER_VEC * 8), %rax + jbe L(last_4x_vec_or_less) + + + /* Compute number of words checked after aligning. */ +# ifdef USE_AS_WCSLEN + /* Need to compute directly for wcslen as CHAR_SIZE * rsi can + overflow. */ + leaq (VEC_SIZE * -3)(%rdi), %rdx +# else + leaq (VEC_SIZE * -3)(%rdi, %rax), %rax +# endif + + subq $(VEC_SIZE * -1), %rdi + + /* Align data to VEC_SIZE * 4. */ +# if VEC_SIZE == 64 + /* Saves code size. No evex512 processor has partial register + stalls. If that change this can be replaced with `andq + $-(VEC_SIZE * 4), %rdi`. */ + xorb %dil, %dil +# else + andq $-(VEC_SIZE * 4), %rdi +# endif + +# ifdef USE_AS_WCSLEN + subq %rdi, %rdx + sarq $2, %rdx + addq %rdx, %rax +# else + subq %rdi, %rax +# endif + /* Compare 4 * VEC at a time forward. */ + .p2align 4,, 11 +L(loop_4x_vec): + VMOVA (VEC_SIZE * 4)(%rdi), %VMM(1) + VPMINU (VEC_SIZE * 5)(%rdi), %VMM(1), %VMM(2) + VMOVA (VEC_SIZE * 6)(%rdi), %VMM(3) + VPMINU (VEC_SIZE * 7)(%rdi), %VMM(3), %VMM(4) + VPTESTN %VMM(2), %VMM(2), %k0 + VPTESTN %VMM(4), %VMM(4), %k2 + subq $-(VEC_SIZE * 4), %rdi + /* Break if at end of length. */ + subq $(CHAR_PER_VEC * 4), %rax + jbe L(loop_len_end) + + + KORTEST %k0, %k2 + jz L(loop_4x_vec) + + +L(loop_last_4x_vec): + movq %rsi, %rcx + subq %rax, %rsi + VPTESTN %VMM(1), %VMM(1), %k1 + KMOV %k1, %VRDX + test %VRDX, %VRDX + jnz L(last_vec_x0) + + KMOV %k0, %VRDX + test %VRDX, %VRDX + jnz L(last_vec_x1) + + VPTESTN %VMM(3), %VMM(3), %k0 + + /* Seperate logic for VEC_SIZE == 64 and VEC_SIZE == 32 for + returning last 2x VEC. For VEC_SIZE == 64 we test each VEC + individually, for VEC_SIZE == 32 we combine them in a single + 64-bit GPR. */ +# if CHAR_PER_VEC == 64 + KMOV %k0, %VRDX + test %VRDX, %VRDX + jnz L(last_vec_x2) + KMOV %k2, %VRDX +# else + /* We can only combine last 2x VEC masks if CHAR_PER_VEC <= 32. + */ + kmovd %k2, %edx + kmovd %k0, %eax + salq $CHAR_PER_VEC, %rdx + orq %rax, %rdx +# endif + + /* first_vec_x3 for strlen-ZMM and first_vec_x2 for strlen-YMM. + */ + bsfq %rdx, %rdx + leaq (FALLTHROUGH_RETURN_OFFSET - CHAR_PER_VEC * 4)(%rsi, %rdx), %rax + cmpq %rax, %rcx + cmovb %rcx, %rax + ret + + /* Handle last 4x VEC after loop. All VECs have been loaded. */ + .p2align 4,, 4 +L(loop_len_end): + KORTEST %k0, %k2 + jnz L(loop_last_4x_vec) + movq %rsi, %rax + ret + + +# if CHAR_PER_VEC == 64 + /* Since we can't combine the last 2x VEC for VEC_SIZE == 64 + need return label for it. */ + .p2align 4,, 8 +L(last_vec_x2): + bsf %VRDX, %VRDX + leaq (CHAR_PER_VEC * -2)(%rsi, %rdx), %rax + cmpq %rax, %rcx + cmovb %rcx, %rax + ret +# endif + + + .p2align 4,, 10 +L(last_vec_x1): + addq $CHAR_PER_VEC, %rsi +L(last_vec_x0): + bsf %VRDX, %VRDX + leaq (CHAR_PER_VEC * -4)(%rsi, %rdx), %rax + cmpq %rax, %rcx + cmovb %rcx, %rax + ret + + + .p2align 4,, 8 +L(cross_page_boundary): + /* Align data to VEC_SIZE. */ + movq %rdi, %rcx + andq $-VEC_SIZE, %rcx + VPCMPEQ (%rcx), %VZERO, %k0 + + KMOV %k0, %VRCX +# ifdef USE_AS_WCSLEN + shrl $2, %eax + andl $(CHAR_PER_VEC - 1), %eax +# endif + shrx %VRAX, %VRCX, %VRCX + + negl %eax + andl $(CHAR_PER_VEC - 1), %eax + movq %rsi, %rdx + bsf %VRCX, %VRDX + cmpq %rax, %rdx + ja L(cross_page_continue) + movl %edx, %eax + cmpq %rdx, %rsi + cmovb %esi, %eax + ret +END (STRNLEN) +#endif diff --git a/sysdeps/x86_64/multiarch/wcsnlen-evex.S b/sysdeps/x86_64/multiarch/wcsnlen-evex.S index e2aad94c1e..57a7e93fbf 100644 --- a/sysdeps/x86_64/multiarch/wcsnlen-evex.S +++ b/sysdeps/x86_64/multiarch/wcsnlen-evex.S @@ -2,8 +2,7 @@ # define WCSNLEN __wcsnlen_evex #endif -#define STRLEN WCSNLEN +#define STRNLEN WCSNLEN #define USE_AS_WCSLEN 1 -#define USE_AS_STRNLEN 1 -#include "strlen-evex.S" +#include "strnlen-evex.S" From patchwork Tue Oct 18 23:19:35 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Noah Goldstein X-Patchwork-Id: 1691736 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=sourceware.org (client-ip=2620:52:3:1:0:246e:9693:128c; helo=sourceware.org; envelope-from=libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org; receiver=) Authentication-Results: legolas.ozlabs.org; dkim=pass (1024-bit key; secure) header.d=sourceware.org header.i=@sourceware.org header.a=rsa-sha256 header.s=default header.b=tK1qlpKu; dkim-atps=neutral Received: from sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-384) server-digest SHA384) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4MsVFy3dXCz23jk for ; Wed, 19 Oct 2022 10:20:22 +1100 (AEDT) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 0D3B13858415 for ; Tue, 18 Oct 2022 23:20:18 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 0D3B13858415 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1666135218; bh=0ihs1yTyMDaSh/n1/tlMjLAoc4iM0wM7Zco1H9FQStg=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To: From; b=tK1qlpKu0MltGUm2cQO94cNGd69Q48ova/zD9TNU8TQ8r9GTIOUKsMZZJ1ppZSzz4 gXaf9dgnP+tpDD2kZpbWo4v/iIoI4jOWSUC86UsaCFcykoP6ajF4ZQJV/ZCpqZ36q8 KzFWcgCMhUAtwGrd812CBv9YzRI55eoKi6CKa2w8= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-oa1-x2e.google.com (mail-oa1-x2e.google.com [IPv6:2001:4860:4864:20::2e]) by sourceware.org (Postfix) with ESMTPS id 734ED3858D3C for ; Tue, 18 Oct 2022 23:19:48 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 734ED3858D3C Received: by mail-oa1-x2e.google.com with SMTP id 586e51a60fabf-12c8312131fso18693487fac.4 for ; Tue, 18 Oct 2022 16:19:48 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=0ihs1yTyMDaSh/n1/tlMjLAoc4iM0wM7Zco1H9FQStg=; b=qE+9wsMQiVce1B6oReEdCFjqvOfbWgbZbosGfAntoIShKPssX10NjgznbPwN7iui6j arL3eZ4XhZG1EOmqs25IOEXmpoAZRK57ukUtL32mlXEJFxOrttwzuzTkLATX9WKa7GiY 0jVDTAvLqglR/oeBdgm96CvF6ImKWQcTwtf3uwTAkd8nggIVBin841/3x8cR/xNITke5 MdepxXGZgA5JhI5Ijxs8H32aWFnBLQ18PpuXLvQ2/34HNCa0LAxzQv1qbqrP4MuQ3d/O 7/zPkm1NZzUAbsunCcS66LKTnMHsS8+WQP8bTKsoFZOHXirB2XPMnf0Sc9khOvHxnWVC MciA== X-Gm-Message-State: ACrzQf0vyneCT9KvsD49KWs+f43jbKR4bUUiZc0np8qnmKP4uTyoIKP7 c65etIz11+4rk7MShDrB7ICgpywrMyHu1Q== X-Google-Smtp-Source: AMsMyM7D3kpEibEQ+7gvnuHbK4TBFstkjvlU+RXSgsf+AvuwXtYwAYb5Gzzq/sMVwFLPdk62hUD/ew== X-Received: by 2002:a05:6870:2394:b0:130:de3a:dd99 with SMTP id e20-20020a056870239400b00130de3add99mr3028435oap.54.1666135187063; Tue, 18 Oct 2022 16:19:47 -0700 (PDT) Received: from noah-tgl.lan (2603-8080-1301-76c6-27cf-8854-3909-9373.res6.spectrum.com. [2603:8080:1301:76c6:27cf:8854:3909:9373]) by smtp.gmail.com with ESMTPSA id x27-20020a056870a79b00b001372c1902afsm6787119oao.52.2022.10.18.16.19.46 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 18 Oct 2022 16:19:46 -0700 (PDT) To: libc-alpha@sourceware.org Subject: [PATCH v2 4/7] x86: Optimize memrchr-evex.S Date: Tue, 18 Oct 2022 16:19:35 -0700 Message-Id: <20221018231938.3621554-4-goldstein.w.n@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20221018231938.3621554-1-goldstein.w.n@gmail.com> References: <20221018024901.3381469-1-goldstein.w.n@gmail.com> <20221018231938.3621554-1-goldstein.w.n@gmail.com> MIME-Version: 1.0 X-Spam-Status: No, score=-12.1 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Noah Goldstein via Libc-alpha From: Noah Goldstein Reply-To: Noah Goldstein Errors-To: libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org Sender: "Libc-alpha" Optimizations are: 1. Use the fact that lzcnt(0) -> VEC_SIZE for memchr to save a branch in short string case. 2. Save several instructions in len = [VEC_SIZE, 4 * VEC_SIZE] case. 3. Use more code-size efficient instructions. - tzcnt ... -> bsf ... - vpcmpb $0 ... -> vpcmpeq ... Code Size Changes: memrchr-evex.S : -29 bytes Net perf changes: Reported as geometric mean of all improvements / regressions from N=10 runs of the benchtests. Value as New Time / Old Time so < 1.0 is improvement and 1.0 is regression. memrchr-evex.S : 0.949 (Mostly from improvements in small strings) Full results attached in email. Full check passes on x86-64. --- sysdeps/x86_64/multiarch/memrchr-evex.S | 538 ++++++++++++++---------- 1 file changed, 324 insertions(+), 214 deletions(-) diff --git a/sysdeps/x86_64/multiarch/memrchr-evex.S b/sysdeps/x86_64/multiarch/memrchr-evex.S index 550b328c5a..dbcf52808f 100644 --- a/sysdeps/x86_64/multiarch/memrchr-evex.S +++ b/sysdeps/x86_64/multiarch/memrchr-evex.S @@ -21,17 +21,19 @@ #if ISA_SHOULD_BUILD (4) # include -# include "x86-evex256-vecs.h" -# if VEC_SIZE != 32 -# error "VEC_SIZE != 32 unimplemented" + +# ifndef VEC_SIZE +# include "x86-evex256-vecs.h" # endif +# include "reg-macros.h" + # ifndef MEMRCHR -# define MEMRCHR __memrchr_evex +# define MEMRCHR __memrchr_evex # endif -# define PAGE_SIZE 4096 -# define VMMMATCH VMM(0) +# define PAGE_SIZE 4096 +# define VMATCH VMM(0) .section SECTION(.text), "ax", @progbits ENTRY_P2ALIGN(MEMRCHR, 6) @@ -43,294 +45,402 @@ ENTRY_P2ALIGN(MEMRCHR, 6) # endif jz L(zero_0) - /* Get end pointer. Minus one for two reasons. 1) It is necessary for a - correct page cross check and 2) it correctly sets up end ptr to be - subtract by lzcnt aligned. */ + /* Get end pointer. Minus one for three reasons. 1) It is + necessary for a correct page cross check and 2) it correctly + sets up end ptr to be subtract by lzcnt aligned. 3) it is a + necessary step in aligning ptr. */ leaq -1(%rdi, %rdx), %rax - vpbroadcastb %esi, %VMMMATCH + vpbroadcastb %esi, %VMATCH /* Check if we can load 1x VEC without cross a page. */ testl $(PAGE_SIZE - VEC_SIZE), %eax jz L(page_cross) - /* Don't use rax for pointer here because EVEX has better encoding with - offset % VEC_SIZE == 0. */ - vpcmpb $0, -(VEC_SIZE)(%rdi, %rdx), %VMMMATCH, %k0 - kmovd %k0, %ecx - - /* Fall through for rdx (len) <= VEC_SIZE (expect small sizes). */ - cmpq $VEC_SIZE, %rdx - ja L(more_1x_vec) -L(ret_vec_x0_test): - - /* If ecx is zero (no matches) lzcnt will set it 32 (VEC_SIZE) which - will guarantee edx (len) is less than it. */ - lzcntl %ecx, %ecx - cmpl %ecx, %edx - jle L(zero_0) - subq %rcx, %rax + /* Don't use rax for pointer here because EVEX has better + encoding with offset % VEC_SIZE == 0. */ + vpcmpeqb (VEC_SIZE * -1)(%rdi, %rdx), %VMATCH, %k0 + KMOV %k0, %VRCX + + /* If rcx is zero then lzcnt -> VEC_SIZE. NB: there is a + already a dependency between rcx and rsi so no worries about + false-dep here. */ + lzcnt %VRCX, %VRSI + /* If rdx <= rsi then either 1) rcx was non-zero (there was a + match) but it was out of bounds or 2) rcx was zero and rdx + was <= VEC_SIZE so we are done scanning. */ + cmpq %rsi, %rdx + /* NB: Use branch to return zero/non-zero. Common usage will + branch on result of function (if return is null/non-null). + This branch can be used to predict the ensuing one so there + is no reason to extend the data-dependency with cmovcc. */ + jbe L(zero_0) + + /* If rcx is zero then len must be > RDX, otherwise since we + already tested len vs lzcnt(rcx) (in rsi) we are good to + return this match. */ + test %VRCX, %VRCX + jz L(more_1x_vec) + subq %rsi, %rax ret - /* Fits in aligning bytes of first cache line. */ + /* Fits in aligning bytes of first cache line for VEC_SIZE == + 32. */ +# if VEC_SIZE == 32 + .p2align 4,, 2 L(zero_0): xorl %eax, %eax ret - - .p2align 4,, 9 -L(ret_vec_x0_dec): - decq %rax -L(ret_vec_x0): - lzcntl %ecx, %ecx - subq %rcx, %rax - ret +# endif .p2align 4,, 10 L(more_1x_vec): - testl %ecx, %ecx - jnz L(ret_vec_x0) - /* Align rax (pointer to string). */ andq $-VEC_SIZE, %rax - +L(page_cross_continue): /* Recompute length after aligning. */ - movq %rax, %rdx + subq %rdi, %rax - /* Need no matter what. */ - vpcmpb $0, -(VEC_SIZE)(%rax), %VMMMATCH, %k0 - kmovd %k0, %ecx - - subq %rdi, %rdx - - cmpq $(VEC_SIZE * 2), %rdx + cmpq $(VEC_SIZE * 2), %rax ja L(more_2x_vec) + L(last_2x_vec): + vpcmpeqb (VEC_SIZE * -1)(%rdi, %rax), %VMATCH, %k0 + KMOV %k0, %VRCX - /* Must dec rax because L(ret_vec_x0_test) expects it. */ - decq %rax - cmpl $VEC_SIZE, %edx - jbe L(ret_vec_x0_test) + test %VRCX, %VRCX + jnz L(ret_vec_x0_test) - testl %ecx, %ecx - jnz L(ret_vec_x0) + /* If VEC_SIZE == 64 need to subtract because lzcntq won't + implicitly add VEC_SIZE to match position. */ +# if VEC_SIZE == 64 + subl $VEC_SIZE, %eax +# else + cmpb $VEC_SIZE, %al +# endif + jle L(zero_2) - /* Don't use rax for pointer here because EVEX has better encoding with - offset % VEC_SIZE == 0. */ - vpcmpb $0, -(VEC_SIZE * 2)(%rdi, %rdx), %VMMMATCH, %k0 - kmovd %k0, %ecx - /* NB: 64-bit lzcnt. This will naturally add 32 to position. */ + /* We adjusted rax (length) for VEC_SIZE == 64 so need seperate + offsets. */ +# if VEC_SIZE == 64 + vpcmpeqb (VEC_SIZE * -1)(%rdi, %rax), %VMATCH, %k0 +# else + vpcmpeqb (VEC_SIZE * -2)(%rdi, %rax), %VMATCH, %k0 +# endif + KMOV %k0, %VRCX + /* NB: 64-bit lzcnt. This will naturally add 32 to position for + VEC_SIZE == 32. */ lzcntq %rcx, %rcx - cmpl %ecx, %edx - jle L(zero_0) - subq %rcx, %rax - ret - - /* Inexpensive place to put this regarding code size / target alignments - / ICache NLP. Necessary for 2-byte encoding of jump to page cross - case which in turn is necessary for hot path (len <= VEC_SIZE) to fit - in first cache line. */ -L(page_cross): - movq %rax, %rsi - andq $-VEC_SIZE, %rsi - vpcmpb $0, (%rsi), %VMMMATCH, %k0 - kmovd %k0, %r8d - /* Shift out negative alignment (because we are starting from endptr and - working backwards). */ - movl %eax, %ecx - /* notl because eax already has endptr - 1. (-x = ~(x - 1)). */ - notl %ecx - shlxl %ecx, %r8d, %ecx - cmpq %rdi, %rsi - ja L(more_1x_vec) - lzcntl %ecx, %ecx - cmpl %ecx, %edx - jle L(zero_1) - subq %rcx, %rax + subl %ecx, %eax + ja L(first_vec_x1_ret) + /* If VEC_SIZE == 64 put L(zero_0) here as we can't fit in the + first cache line (this is the second cache line). */ +# if VEC_SIZE == 64 +L(zero_0): +# endif +L(zero_2): + xorl %eax, %eax ret - /* Continue creating zero labels that fit in aligning bytes and get - 2-byte encoding / are in the same cache line as condition. */ -L(zero_1): - xorl %eax, %eax + /* NB: Fits in aligning bytes before next cache line for + VEC_SIZE == 32. For VEC_SIZE == 64 this is attached to + L(first_vec_x0_test). */ +# if VEC_SIZE == 32 +L(first_vec_x1_ret): + leaq -1(%rdi, %rax), %rax ret +# endif - .p2align 4,, 8 -L(ret_vec_x1): - /* This will naturally add 32 to position. */ - bsrl %ecx, %ecx - leaq -(VEC_SIZE * 2)(%rcx, %rax), %rax + .p2align 4,, 6 +L(ret_vec_x0_test): + lzcnt %VRCX, %VRCX + subl %ecx, %eax + jle L(zero_2) +# if VEC_SIZE == 64 + /* Reuse code at the end of L(ret_vec_x0_test) as we can't fit + L(first_vec_x1_ret) in the same cache line as its jmp base + so we might as well save code size. */ +L(first_vec_x1_ret): +# endif + leaq -1(%rdi, %rax), %rax ret - .p2align 4,, 8 + .p2align 4,, 6 +L(loop_last_4x_vec): + /* Compute remaining length. */ + subl %edi, %eax +L(last_4x_vec): + cmpl $(VEC_SIZE * 2), %eax + jle L(last_2x_vec) +# if VEC_SIZE == 32 + /* Only align for VEC_SIZE == 32. For VEC_SIZE == 64 we need + the spare bytes to align the loop properly. */ + .p2align 4,, 10 +# endif L(more_2x_vec): - testl %ecx, %ecx - jnz L(ret_vec_x0_dec) - vpcmpb $0, -(VEC_SIZE * 2)(%rax), %VMMMATCH, %k0 - kmovd %k0, %ecx - testl %ecx, %ecx - jnz L(ret_vec_x1) + /* Length > VEC_SIZE * 2 so check the first 2x VEC for match and + return if either hit. */ + vpcmpeqb (VEC_SIZE * -1)(%rdi, %rax), %VMATCH, %k0 + KMOV %k0, %VRCX + + test %VRCX, %VRCX + jnz L(first_vec_x0) + + vpcmpeqb (VEC_SIZE * -2)(%rdi, %rax), %VMATCH, %k0 + KMOV %k0, %VRCX + test %VRCX, %VRCX + jnz L(first_vec_x1) /* Need no matter what. */ - vpcmpb $0, -(VEC_SIZE * 3)(%rax), %VMMMATCH, %k0 - kmovd %k0, %ecx + vpcmpeqb (VEC_SIZE * -3)(%rdi, %rax), %VMATCH, %k0 + KMOV %k0, %VRCX - subq $(VEC_SIZE * 4), %rdx + /* Check if we are near the end. */ + subq $(VEC_SIZE * 4), %rax ja L(more_4x_vec) - cmpl $(VEC_SIZE * -1), %edx - jle L(ret_vec_x2_test) -L(last_vec): - testl %ecx, %ecx - jnz L(ret_vec_x2) + test %VRCX, %VRCX + jnz L(first_vec_x2_test) + /* Adjust length for final check and check if we are at the end. + */ + addl $(VEC_SIZE * 1), %eax + jle L(zero_1) - /* Need no matter what. */ - vpcmpb $0, -(VEC_SIZE * 4)(%rax), %VMMMATCH, %k0 - kmovd %k0, %ecx - lzcntl %ecx, %ecx - subq $(VEC_SIZE * 3 + 1), %rax - subq %rcx, %rax - cmpq %rax, %rdi - ja L(zero_1) + vpcmpeqb (VEC_SIZE * -1)(%rdi, %rax), %VMATCH, %k0 + KMOV %k0, %VRCX + + lzcnt %VRCX, %VRCX + subl %ecx, %eax + ja L(first_vec_x3_ret) +L(zero_1): + xorl %eax, %eax + ret +L(first_vec_x3_ret): + leaq -1(%rdi, %rax), %rax ret - .p2align 4,, 8 -L(ret_vec_x2_test): - lzcntl %ecx, %ecx - subq $(VEC_SIZE * 2 + 1), %rax - subq %rcx, %rax - cmpq %rax, %rdi - ja L(zero_1) + .p2align 4,, 6 +L(first_vec_x2_test): + /* Must adjust length before check. */ + subl $-(VEC_SIZE * 2 - 1), %eax + lzcnt %VRCX, %VRCX + subl %ecx, %eax + jl L(zero_4) + addq %rdi, %rax ret - .p2align 4,, 8 -L(ret_vec_x2): - bsrl %ecx, %ecx - leaq -(VEC_SIZE * 3)(%rcx, %rax), %rax + + .p2align 4,, 10 +L(first_vec_x0): + bsr %VRCX, %VRCX + leaq (VEC_SIZE * -1)(%rdi, %rax), %rax + addq %rcx, %rax ret - .p2align 4,, 8 -L(ret_vec_x3): - bsrl %ecx, %ecx - leaq -(VEC_SIZE * 4)(%rcx, %rax), %rax + /* Fits unobtrusively here. */ +L(zero_4): + xorl %eax, %eax + ret + + .p2align 4,, 10 +L(first_vec_x1): + bsr %VRCX, %VRCX + leaq (VEC_SIZE * -2)(%rdi, %rax), %rax + addq %rcx, %rax ret .p2align 4,, 8 +L(first_vec_x3): + bsr %VRCX, %VRCX + addq %rdi, %rax + addq %rcx, %rax + ret + + .p2align 4,, 6 +L(first_vec_x2): + bsr %VRCX, %VRCX + leaq (VEC_SIZE * 1)(%rdi, %rax), %rax + addq %rcx, %rax + ret + + .p2align 4,, 2 L(more_4x_vec): - testl %ecx, %ecx - jnz L(ret_vec_x2) + test %VRCX, %VRCX + jnz L(first_vec_x2) - vpcmpb $0, -(VEC_SIZE * 4)(%rax), %VMMMATCH, %k0 - kmovd %k0, %ecx + vpcmpeqb (%rdi, %rax), %VMATCH, %k0 + KMOV %k0, %VRCX - testl %ecx, %ecx - jnz L(ret_vec_x3) + test %VRCX, %VRCX + jnz L(first_vec_x3) /* Check if near end before re-aligning (otherwise might do an unnecessary loop iteration). */ - addq $-(VEC_SIZE * 4), %rax - cmpq $(VEC_SIZE * 4), %rdx + cmpq $(VEC_SIZE * 4), %rax jbe L(last_4x_vec) - decq %rax - andq $-(VEC_SIZE * 4), %rax - movq %rdi, %rdx - /* Get endptr for loop in rdx. NB: Can't just do while rax > rdi because - lengths that overflow can be valid and break the comparison. */ - andq $-(VEC_SIZE * 4), %rdx + + /* NB: We setup the loop to NOT use index-address-mode for the + buffer. This costs some instructions & code size but avoids + stalls due to unlaminated micro-fused instructions (as used + in the loop) from being forced to issue in the same group + (essentially narrowing the backend width). */ + + /* Get endptr for loop in rdx. NB: Can't just do while rax > rdi + because lengths that overflow can be valid and break the + comparison. */ +# if VEC_SIZE == 64 + /* Use rdx as intermediate to compute rax, this gets us imm8 + encoding which just allows the L(more_4x_vec) block to fit + in 1 cache-line. */ + leaq (VEC_SIZE * 4)(%rdi), %rdx + leaq (VEC_SIZE * -1)(%rdx, %rax), %rax + + /* No evex machine has partial register stalls. This can be + replaced with: `andq $(VEC_SIZE * -4), %rax/%rdx` if that + changes. */ + xorb %al, %al + xorb %dl, %dl +# else + leaq (VEC_SIZE * 3)(%rdi, %rax), %rax + andq $(VEC_SIZE * -4), %rax + leaq (VEC_SIZE * 4)(%rdi), %rdx + andq $(VEC_SIZE * -4), %rdx +# endif + .p2align 4 L(loop_4x_vec): - /* Store 1 were not-equals and 0 where equals in k1 (used to mask later - on). */ - vpcmpb $4, (VEC_SIZE * 3)(%rax), %VMMMATCH, %k1 + /* NB: We could do the same optimization here as we do for + memchr/rawmemchr by using VEX encoding in the loop for access + to VEX vpcmpeqb + vpternlogd. Since memrchr is not as hot as + memchr it may not be worth the extra code size, but if the + need arises it an easy ~15% perf improvement to the loop. */ + + cmpq %rdx, %rax + je L(loop_last_4x_vec) + /* Store 1 were not-equals and 0 where equals in k1 (used to + mask later on). */ + vpcmpb $4, (VEC_SIZE * -1)(%rax), %VMATCH, %k1 /* VEC(2/3) will have zero-byte where we found a CHAR. */ - vpxorq (VEC_SIZE * 2)(%rax), %VMMMATCH, %VMM(2) - vpxorq (VEC_SIZE * 1)(%rax), %VMMMATCH, %VMM(3) - vpcmpb $0, (VEC_SIZE * 0)(%rax), %VMMMATCH, %k4 + vpxorq (VEC_SIZE * -2)(%rax), %VMATCH, %VMM(2) + vpxorq (VEC_SIZE * -3)(%rax), %VMATCH, %VMM(3) + vpcmpeqb (VEC_SIZE * -4)(%rax), %VMATCH, %k4 - /* Combine VEC(2/3) with min and maskz with k1 (k1 has zero bit where - CHAR is found and VEC(2/3) have zero-byte where CHAR is found. */ + /* Combine VEC(2/3) with min and maskz with k1 (k1 has zero bit + where CHAR is found and VEC(2/3) have zero-byte where CHAR + is found. */ vpminub %VMM(2), %VMM(3), %VMM(3){%k1}{z} vptestnmb %VMM(3), %VMM(3), %k2 - /* Any 1s and we found CHAR. */ - kortestd %k2, %k4 - jnz L(loop_end) - addq $-(VEC_SIZE * 4), %rax - cmpq %rdx, %rax - jne L(loop_4x_vec) - /* Need to re-adjust rdx / rax for L(last_4x_vec). */ - subq $-(VEC_SIZE * 4), %rdx - movq %rdx, %rax - subl %edi, %edx -L(last_4x_vec): + /* Any 1s and we found CHAR. */ + KORTEST %k2, %k4 + jz L(loop_4x_vec) + - /* Used no matter what. */ - vpcmpb $0, (VEC_SIZE * -1)(%rax), %VMMMATCH, %k0 - kmovd %k0, %ecx + /* K1 has non-matches for first VEC. inc; jz will overflow rcx + iff all bytes where non-matches. */ + KMOV %k1, %VRCX + inc %VRCX + jnz L(first_vec_x0_end) - cmpl $(VEC_SIZE * 2), %edx - jbe L(last_2x_vec) + vptestnmb %VMM(2), %VMM(2), %k0 + KMOV %k0, %VRCX + test %VRCX, %VRCX + jnz L(first_vec_x1_end) + KMOV %k2, %VRCX + + /* Seperate logic for VEC_SIZE == 64 and VEC_SIZE == 32 for + returning last 2x VEC. For VEC_SIZE == 64 we test each VEC + individually, for VEC_SIZE == 32 we combine them in a single + 64-bit GPR. */ +# if VEC_SIZE == 64 + test %VRCX, %VRCX + jnz L(first_vec_x2_end) + KMOV %k4, %VRCX +# else + /* Combine last 2 VEC matches for VEC_SIZE == 32. If rcx (from + VEC(3)) is zero (no CHAR in VEC(3)) then it won't affect the + result in rsi (from VEC(4)). If rcx is non-zero then CHAR in + VEC(3) and bsrq will use that position. */ + KMOV %k4, %VRSI + salq $32, %rcx + orq %rsi, %rcx +# endif + bsrq %rcx, %rcx + addq %rcx, %rax + ret - testl %ecx, %ecx - jnz L(ret_vec_x0_dec) + .p2align 4,, 4 +L(first_vec_x0_end): + /* rcx has 1s at non-matches so we need to `not` it. We used + `inc` to test if zero so use `neg` to complete the `not` so + the last 1 bit represent a match. NB: (-x + 1 == ~x). */ + neg %VRCX + bsr %VRCX, %VRCX + leaq (VEC_SIZE * 3)(%rcx, %rax), %rax + ret + .p2align 4,, 10 +L(first_vec_x1_end): + bsr %VRCX, %VRCX + leaq (VEC_SIZE * 2)(%rcx, %rax), %rax + ret - vpcmpb $0, (VEC_SIZE * -2)(%rax), %VMMMATCH, %k0 - kmovd %k0, %ecx +# if VEC_SIZE == 64 + /* Since we can't combine the last 2x VEC for VEC_SIZE == 64 + need return label for it. */ + .p2align 4,, 4 +L(first_vec_x2_end): + bsr %VRCX, %VRCX + leaq (VEC_SIZE * 1)(%rcx, %rax), %rax + ret +# endif - testl %ecx, %ecx - jnz L(ret_vec_x1) - /* Used no matter what. */ - vpcmpb $0, (VEC_SIZE * -3)(%rax), %VMMMATCH, %k0 - kmovd %k0, %ecx + .p2align 4,, 4 +L(page_cross): + /* only lower bits of eax[log2(VEC_SIZE):0] are set so we can + use movzbl to get the amount of bytes we are checking here. + */ + movzbl %al, %ecx + andq $-VEC_SIZE, %rax + vpcmpeqb (%rax), %VMATCH, %k0 + KMOV %k0, %VRSI - cmpl $(VEC_SIZE * 3), %edx - ja L(last_vec) + /* eax was comptued as %rdi + %rdx - 1 so need to add back 1 + here. */ + leal 1(%rcx), %r8d - lzcntl %ecx, %ecx - subq $(VEC_SIZE * 2 + 1), %rax - subq %rcx, %rax - cmpq %rax, %rdi - jbe L(ret_1) + /* Invert ecx to get shift count for byte matches out of range. + */ + notl %ecx + shlx %VRCX, %VRSI, %VRSI + + /* if r8 < rdx then the entire [buf, buf + len] is handled in + the page cross case. NB: we can't use the trick here we use + in the non page-cross case because we aren't checking full + VEC_SIZE. */ + cmpq %r8, %rdx + ja L(page_cross_check) + lzcnt %VRSI, %VRSI + subl %esi, %edx + ja L(page_cross_ret) xorl %eax, %eax -L(ret_1): ret - .p2align 4,, 6 -L(loop_end): - kmovd %k1, %ecx - notl %ecx - testl %ecx, %ecx - jnz L(ret_vec_x0_end) +L(page_cross_check): + test %VRSI, %VRSI + jz L(page_cross_continue) - vptestnmb %VMM(2), %VMM(2), %k0 - kmovd %k0, %ecx - testl %ecx, %ecx - jnz L(ret_vec_x1_end) - - kmovd %k2, %ecx - kmovd %k4, %esi - /* Combine last 2 VEC matches. If ecx (VEC3) is zero (no CHAR in VEC3) - then it won't affect the result in esi (VEC4). If ecx is non-zero - then CHAR in VEC3 and bsrq will use that position. */ - salq $32, %rcx - orq %rsi, %rcx - bsrq %rcx, %rcx - addq %rcx, %rax - ret - .p2align 4,, 4 -L(ret_vec_x0_end): - addq $(VEC_SIZE), %rax -L(ret_vec_x1_end): - bsrl %ecx, %ecx - leaq (VEC_SIZE * 2)(%rax, %rcx), %rax + lzcnt %VRSI, %VRSI + subl %esi, %edx +L(page_cross_ret): + leaq -1(%rdi, %rdx), %rax ret - END(MEMRCHR) #endif From patchwork Tue Oct 18 23:19:36 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Noah Goldstein X-Patchwork-Id: 1691741 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=sourceware.org (client-ip=2620:52:3:1:0:246e:9693:128c; helo=sourceware.org; envelope-from=libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org; receiver=) Authentication-Results: legolas.ozlabs.org; dkim=pass (1024-bit key; secure) header.d=sourceware.org header.i=@sourceware.org header.a=rsa-sha256 header.s=default header.b=ebb7ZBK7; dkim-atps=neutral Received: from sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-384) server-digest SHA384) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4MsVHg2rBQz23jk for ; Wed, 19 Oct 2022 10:21:51 +1100 (AEDT) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 4FA3D385701A for ; Tue, 18 Oct 2022 23:21:49 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 4FA3D385701A DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1666135309; bh=oC01aOJqUCRtwCibJyVGnkt6cpo8JFfWkb5OftF4yO0=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To: From; b=ebb7ZBK77GeZzQFg6KptCSZjXbCwaVClQo4wOIO4T7j+/79S9zWUoFB2uBpIXonrZ 59L9zPC5NMXtOTLRsfpzD/Fq459M9LUlWWD++JiT824itTg5sIfNiyy2fogUt19JUH 6f83RXXAZ//U3ta5WPbngapewvUngu1+clbEYyZA= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-oa1-x2a.google.com (mail-oa1-x2a.google.com [IPv6:2001:4860:4864:20::2a]) by sourceware.org (Postfix) with ESMTPS id 830553858C56 for ; Tue, 18 Oct 2022 23:19:49 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 830553858C56 Received: by mail-oa1-x2a.google.com with SMTP id 586e51a60fabf-136b5dd6655so18704396fac.3 for ; Tue, 18 Oct 2022 16:19:49 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=oC01aOJqUCRtwCibJyVGnkt6cpo8JFfWkb5OftF4yO0=; b=Q24CH/m1l9a4EObBJJzLOVYAH0fwctIsP5OPimSnqE165d1gbQk8x395Za7WdxJ4eV Kagbzf5kxTrT5GzadKxFvmxq2bQZwT42dTx9NnIUVEAKK67e4a7wunZ6vp91pnl9sykt ifhPH5XRYKld2YwUyF1voKk4KjzmpxMf3oWwnspSTJA4+L0ntdYUYLzfnM5l4+XfbKbn RoJQxbj9x1rrqCX98c2TuoYsshcvxrzbREoAOvHP5GcT0OXUVeqqNq2mBAUCe+0fBv9J 54YjnYWqPmKHDmS3akM/FeXu9cB3/vDfAiIzted4QHvsRC9B2zdaZgMAE7HSX8KF8ZOc PfRQ== X-Gm-Message-State: ACrzQf2No45ZZKqFFxCaFiR/wiKZm+fPopEUMWwAZj1Ir3EymhYGfddP aQU9NNZW8RUCYz9gsVayd0CPJxKucn4BOg== X-Google-Smtp-Source: AMsMyM6Ts7yfVUZyfGl8l+Sf9FsT11m7klQes8WVk5vUKhjJ3UvNTJfq2DIHwYDurEKwbVFILhyDAA== X-Received: by 2002:a05:6870:9614:b0:11d:3906:18fc with SMTP id d20-20020a056870961400b0011d390618fcmr20124600oaq.190.1666135188300; Tue, 18 Oct 2022 16:19:48 -0700 (PDT) Received: from noah-tgl.lan (2603-8080-1301-76c6-27cf-8854-3909-9373.res6.spectrum.com. [2603:8080:1301:76c6:27cf:8854:3909:9373]) by smtp.gmail.com with ESMTPSA id x27-20020a056870a79b00b001372c1902afsm6787119oao.52.2022.10.18.16.19.47 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 18 Oct 2022 16:19:47 -0700 (PDT) To: libc-alpha@sourceware.org Subject: [PATCH v2 5/7] x86: Optimize strrchr-evex.S and implement with VMM headers Date: Tue, 18 Oct 2022 16:19:36 -0700 Message-Id: <20221018231938.3621554-5-goldstein.w.n@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20221018231938.3621554-1-goldstein.w.n@gmail.com> References: <20221018024901.3381469-1-goldstein.w.n@gmail.com> <20221018231938.3621554-1-goldstein.w.n@gmail.com> MIME-Version: 1.0 X-Spam-Status: No, score=-12.1 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Noah Goldstein via Libc-alpha From: Noah Goldstein Reply-To: Noah Goldstein Errors-To: libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org Sender: "Libc-alpha" Optimization is: 1. Cache latest result in "fast path" loop with `vmovdqu` instead of `kunpckdq`. This helps if there are more than one matches. Code Size Changes: strrchr-evex.S : +30 bytes (Same number of cache lines) Net perf changes: Reported as geometric mean of all improvements / regressions from N=10 runs of the benchtests. Value as New Time / Old Time so < 1.0 is improvement and 1.0 is regression. strrchr-evex.S : 0.932 (From cases with higher match frequency) Full results attached in email. Full check passes on x86-64. --- sysdeps/x86_64/multiarch/strrchr-evex.S | 371 +++++++++++++----------- 1 file changed, 200 insertions(+), 171 deletions(-) diff --git a/sysdeps/x86_64/multiarch/strrchr-evex.S b/sysdeps/x86_64/multiarch/strrchr-evex.S index 992b45fb47..45487dc87a 100644 --- a/sysdeps/x86_64/multiarch/strrchr-evex.S +++ b/sysdeps/x86_64/multiarch/strrchr-evex.S @@ -26,25 +26,30 @@ # define STRRCHR __strrchr_evex # endif -# define VMOVU vmovdqu64 -# define VMOVA vmovdqa64 +# include "x86-evex256-vecs.h" # ifdef USE_AS_WCSRCHR -# define SHIFT_REG esi - -# define kunpck kunpckbw +# define RCX_M cl +# define SHIFT_REG rcx +# define VPCOMPRESS vpcompressd +# define kunpck_2x kunpckbw # define kmov_2x kmovd # define maskz_2x ecx # define maskm_2x eax # define CHAR_SIZE 4 # define VPMIN vpminud # define VPTESTN vptestnmd +# define VPTEST vptestmd # define VPBROADCAST vpbroadcastd +# define VPCMPEQ vpcmpeqd # define VPCMP vpcmpd -# else -# define SHIFT_REG edi -# define kunpck kunpckdq +# define USE_WIDE_CHAR +# else +# define RCX_M ecx +# define SHIFT_REG rdi +# define VPCOMPRESS vpcompressb +# define kunpck_2x kunpckdq # define kmov_2x kmovq # define maskz_2x rcx # define maskm_2x rax @@ -52,58 +57,48 @@ # define CHAR_SIZE 1 # define VPMIN vpminub # define VPTESTN vptestnmb +# define VPTEST vptestmb # define VPBROADCAST vpbroadcastb +# define VPCMPEQ vpcmpeqb # define VPCMP vpcmpb # endif -# define XMMZERO xmm16 -# define YMMZERO ymm16 -# define YMMMATCH ymm17 -# define YMMSAVE ymm18 +# include "reg-macros.h" -# define YMM1 ymm19 -# define YMM2 ymm20 -# define YMM3 ymm21 -# define YMM4 ymm22 -# define YMM5 ymm23 -# define YMM6 ymm24 -# define YMM7 ymm25 -# define YMM8 ymm26 - - -# define VEC_SIZE 32 +# define VMATCH VMM(0) +# define CHAR_PER_VEC (VEC_SIZE / CHAR_SIZE) # define PAGE_SIZE 4096 - .section .text.evex, "ax", @progbits -ENTRY(STRRCHR) + + .section SECTION(.text), "ax", @progbits +ENTRY_P2ALIGN(STRRCHR, 6) movl %edi, %eax - /* Broadcast CHAR to YMMMATCH. */ - VPBROADCAST %esi, %YMMMATCH + /* Broadcast CHAR to VMATCH. */ + VPBROADCAST %esi, %VMATCH andl $(PAGE_SIZE - 1), %eax cmpl $(PAGE_SIZE - VEC_SIZE), %eax jg L(cross_page_boundary) -L(page_cross_continue): - VMOVU (%rdi), %YMM1 - /* k0 has a 1 for each zero CHAR in YMM1. */ - VPTESTN %YMM1, %YMM1, %k0 - kmovd %k0, %ecx - testl %ecx, %ecx + VMOVU (%rdi), %VMM(1) + /* k0 has a 1 for each zero CHAR in VEC(1). */ + VPTESTN %VMM(1), %VMM(1), %k0 + KMOV %k0, %VRSI + test %VRSI, %VRSI jz L(aligned_more) /* fallthrough: zero CHAR in first VEC. */ - - /* K1 has a 1 for each search CHAR match in YMM1. */ - VPCMP $0, %YMMMATCH, %YMM1, %k1 - kmovd %k1, %eax +L(page_cross_return): + /* K1 has a 1 for each search CHAR match in VEC(1). */ + VPCMPEQ %VMATCH, %VMM(1), %k1 + KMOV %k1, %VRAX /* Build mask up until first zero CHAR (used to mask of potential search CHAR matches past the end of the string). */ - blsmskl %ecx, %ecx - andl %ecx, %eax + blsmsk %VRSI, %VRSI + and %VRSI, %VRAX jz L(ret0) - /* Get last match (the `andl` removed any out of bounds - matches). */ - bsrl %eax, %eax + /* Get last match (the `and` removed any out of bounds matches). + */ + bsr %VRAX, %VRAX # ifdef USE_AS_WCSRCHR leaq (%rdi, %rax, CHAR_SIZE), %rax # else @@ -116,22 +111,22 @@ L(ret0): search path for earlier matches. */ .p2align 4,, 6 L(first_vec_x1): - VPCMP $0, %YMMMATCH, %YMM2, %k1 - kmovd %k1, %eax - blsmskl %ecx, %ecx + VPCMPEQ %VMATCH, %VMM(2), %k1 + KMOV %k1, %VRAX + blsmsk %VRCX, %VRCX /* eax non-zero if search CHAR in range. */ - andl %ecx, %eax + and %VRCX, %VRAX jnz L(first_vec_x1_return) - /* fallthrough: no match in YMM2 then need to check for earlier - matches (in YMM1). */ + /* fallthrough: no match in VEC(2) then need to check for + earlier matches (in VEC(1)). */ .p2align 4,, 4 L(first_vec_x0_test): - VPCMP $0, %YMMMATCH, %YMM1, %k1 - kmovd %k1, %eax - testl %eax, %eax + VPCMPEQ %VMATCH, %VMM(1), %k1 + KMOV %k1, %VRAX + test %VRAX, %VRAX jz L(ret1) - bsrl %eax, %eax + bsr %VRAX, %VRAX # ifdef USE_AS_WCSRCHR leaq (%rsi, %rax, CHAR_SIZE), %rax # else @@ -142,129 +137,144 @@ L(ret1): .p2align 4,, 10 L(first_vec_x1_or_x2): - VPCMP $0, %YMM3, %YMMMATCH, %k3 - VPCMP $0, %YMM2, %YMMMATCH, %k2 + VPCMPEQ %VMM(3), %VMATCH, %k3 + VPCMPEQ %VMM(2), %VMATCH, %k2 /* K2 and K3 have 1 for any search CHAR match. Test if any - matches between either of them. Otherwise check YMM1. */ - kortestd %k2, %k3 + matches between either of them. Otherwise check VEC(1). */ + KORTEST %k2, %k3 jz L(first_vec_x0_test) - /* Guranteed that YMM2 and YMM3 are within range so merge the - two bitmasks then get last result. */ - kunpck %k2, %k3, %k3 - kmovq %k3, %rax - bsrq %rax, %rax - leaq (VEC_SIZE)(%r8, %rax, CHAR_SIZE), %rax + /* Guranteed that VEC(2) and VEC(3) are within range so merge + the two bitmasks then get last result. */ + kunpck_2x %k2, %k3, %k3 + kmov_2x %k3, %maskm_2x + bsr %maskm_2x, %maskm_2x + leaq (VEC_SIZE * 1)(%r8, %rax, CHAR_SIZE), %rax ret - .p2align 4,, 6 + .p2align 4,, 7 L(first_vec_x3): - VPCMP $0, %YMMMATCH, %YMM4, %k1 - kmovd %k1, %eax - blsmskl %ecx, %ecx - /* If no search CHAR match in range check YMM1/YMM2/YMM3. */ - andl %ecx, %eax + VPCMPEQ %VMATCH, %VMM(4), %k1 + KMOV %k1, %VRAX + blsmsk %VRCX, %VRCX + /* If no search CHAR match in range check VEC(1)/VEC(2)/VEC(3). + */ + and %VRCX, %VRAX jz L(first_vec_x1_or_x2) - bsrl %eax, %eax + bsr %VRAX, %VRAX leaq (VEC_SIZE * 3)(%rdi, %rax, CHAR_SIZE), %rax ret + .p2align 4,, 6 L(first_vec_x0_x1_test): - VPCMP $0, %YMMMATCH, %YMM2, %k1 - kmovd %k1, %eax - /* Check YMM2 for last match first. If no match try YMM1. */ - testl %eax, %eax + VPCMPEQ %VMATCH, %VMM(2), %k1 + KMOV %k1, %VRAX + /* Check VEC(2) for last match first. If no match try VEC(1). + */ + test %VRAX, %VRAX jz L(first_vec_x0_test) .p2align 4,, 4 L(first_vec_x1_return): - bsrl %eax, %eax + bsr %VRAX, %VRAX leaq (VEC_SIZE)(%rdi, %rax, CHAR_SIZE), %rax ret + .p2align 4,, 10 L(first_vec_x2): - VPCMP $0, %YMMMATCH, %YMM3, %k1 - kmovd %k1, %eax - blsmskl %ecx, %ecx - /* Check YMM3 for last match first. If no match try YMM2/YMM1. - */ - andl %ecx, %eax + VPCMPEQ %VMATCH, %VMM(3), %k1 + KMOV %k1, %VRAX + blsmsk %VRCX, %VRCX + /* Check VEC(3) for last match first. If no match try + VEC(2)/VEC(1). */ + and %VRCX, %VRAX jz L(first_vec_x0_x1_test) - bsrl %eax, %eax + bsr %VRAX, %VRAX leaq (VEC_SIZE * 2)(%rdi, %rax, CHAR_SIZE), %rax ret - .p2align 4 + .p2align 4,, 12 L(aligned_more): - /* Need to keep original pointer incase YMM1 has last match. */ +L(page_cross_continue): + /* Need to keep original pointer incase VEC(1) has last match. + */ movq %rdi, %rsi andq $-VEC_SIZE, %rdi - VMOVU VEC_SIZE(%rdi), %YMM2 - VPTESTN %YMM2, %YMM2, %k0 - kmovd %k0, %ecx - testl %ecx, %ecx + + VMOVU VEC_SIZE(%rdi), %VMM(2) + VPTESTN %VMM(2), %VMM(2), %k0 + KMOV %k0, %VRCX + + test %VRCX, %VRCX jnz L(first_vec_x1) - VMOVU (VEC_SIZE * 2)(%rdi), %YMM3 - VPTESTN %YMM3, %YMM3, %k0 - kmovd %k0, %ecx - testl %ecx, %ecx + VMOVU (VEC_SIZE * 2)(%rdi), %VMM(3) + VPTESTN %VMM(3), %VMM(3), %k0 + KMOV %k0, %VRCX + + test %VRCX, %VRCX jnz L(first_vec_x2) - VMOVU (VEC_SIZE * 3)(%rdi), %YMM4 - VPTESTN %YMM4, %YMM4, %k0 - kmovd %k0, %ecx + VMOVU (VEC_SIZE * 3)(%rdi), %VMM(4) + VPTESTN %VMM(4), %VMM(4), %k0 + KMOV %k0, %VRCX movq %rdi, %r8 - testl %ecx, %ecx + test %VRCX, %VRCX jnz L(first_vec_x3) andq $-(VEC_SIZE * 2), %rdi - .p2align 4 + .p2align 4,, 10 L(first_aligned_loop): - /* Preserve YMM1, YMM2, YMM3, and YMM4 until we can gurantee - they don't store a match. */ - VMOVA (VEC_SIZE * 4)(%rdi), %YMM5 - VMOVA (VEC_SIZE * 5)(%rdi), %YMM6 + /* Preserve VEC(1), VEC(2), VEC(3), and VEC(4) until we can + gurantee they don't store a match. */ + VMOVA (VEC_SIZE * 4)(%rdi), %VMM(5) + VMOVA (VEC_SIZE * 5)(%rdi), %VMM(6) - VPCMP $0, %YMM5, %YMMMATCH, %k2 - vpxord %YMM6, %YMMMATCH, %YMM7 + VPCMPEQ %VMM(5), %VMATCH, %k2 + vpxord %VMM(6), %VMATCH, %VMM(7) - VPMIN %YMM5, %YMM6, %YMM8 - VPMIN %YMM8, %YMM7, %YMM7 + VPMIN %VMM(5), %VMM(6), %VMM(8) + VPMIN %VMM(8), %VMM(7), %VMM(7) - VPTESTN %YMM7, %YMM7, %k1 + VPTESTN %VMM(7), %VMM(7), %k1 subq $(VEC_SIZE * -2), %rdi - kortestd %k1, %k2 + KORTEST %k1, %k2 jz L(first_aligned_loop) - VPCMP $0, %YMM6, %YMMMATCH, %k3 - VPTESTN %YMM8, %YMM8, %k1 - ktestd %k1, %k1 + VPCMPEQ %VMM(6), %VMATCH, %k3 + VPTESTN %VMM(8), %VMM(8), %k1 + + /* If k1 is zero, then we found a CHAR match but no null-term. + We can now safely throw out VEC1-4. */ + KTEST %k1, %k1 jz L(second_aligned_loop_prep) - kortestd %k2, %k3 + KORTEST %k2, %k3 jnz L(return_first_aligned_loop) + .p2align 4,, 6 L(first_vec_x1_or_x2_or_x3): - VPCMP $0, %YMM4, %YMMMATCH, %k4 - kmovd %k4, %eax - testl %eax, %eax + VPCMPEQ %VMM(4), %VMATCH, %k4 + KMOV %k4, %VRAX + bsr %VRAX, %VRAX jz L(first_vec_x1_or_x2) - bsrl %eax, %eax leaq (VEC_SIZE * 3)(%r8, %rax, CHAR_SIZE), %rax ret + .p2align 4,, 8 L(return_first_aligned_loop): - VPTESTN %YMM5, %YMM5, %k0 - kunpck %k0, %k1, %k0 + VPTESTN %VMM(5), %VMM(5), %k0 + + /* Combined results from VEC5/6. */ + kunpck_2x %k0, %k1, %k0 kmov_2x %k0, %maskz_2x blsmsk %maskz_2x, %maskz_2x - kunpck %k2, %k3, %k3 + kunpck_2x %k2, %k3, %k3 kmov_2x %k3, %maskm_2x and %maskz_2x, %maskm_2x jz L(first_vec_x1_or_x2_or_x3) @@ -280,47 +290,62 @@ L(return_first_aligned_loop): L(second_aligned_loop_prep): L(second_aligned_loop_set_furthest_match): movq %rdi, %rsi - kunpck %k2, %k3, %k4 - + /* Ideally we would safe k2/k3 but `kmov/kunpck` take uops on + port0 and have noticable overhead in the loop. */ + VMOVA %VMM(5), %VMM(7) + VMOVA %VMM(6), %VMM(8) .p2align 4 L(second_aligned_loop): - VMOVU (VEC_SIZE * 4)(%rdi), %YMM1 - VMOVU (VEC_SIZE * 5)(%rdi), %YMM2 - - VPCMP $0, %YMM1, %YMMMATCH, %k2 - vpxord %YMM2, %YMMMATCH, %YMM3 + VMOVU (VEC_SIZE * 4)(%rdi), %VMM(5) + VMOVU (VEC_SIZE * 5)(%rdi), %VMM(6) + VPCMPEQ %VMM(5), %VMATCH, %k2 + vpxord %VMM(6), %VMATCH, %VMM(3) - VPMIN %YMM1, %YMM2, %YMM4 - VPMIN %YMM3, %YMM4, %YMM3 + VPMIN %VMM(5), %VMM(6), %VMM(4) + VPMIN %VMM(3), %VMM(4), %VMM(3) - VPTESTN %YMM3, %YMM3, %k1 + VPTESTN %VMM(3), %VMM(3), %k1 subq $(VEC_SIZE * -2), %rdi - kortestd %k1, %k2 + KORTEST %k1, %k2 jz L(second_aligned_loop) - - VPCMP $0, %YMM2, %YMMMATCH, %k3 - VPTESTN %YMM4, %YMM4, %k1 - ktestd %k1, %k1 + VPCMPEQ %VMM(6), %VMATCH, %k3 + VPTESTN %VMM(4), %VMM(4), %k1 + KTEST %k1, %k1 jz L(second_aligned_loop_set_furthest_match) - kortestd %k2, %k3 - /* branch here because there is a significant advantage interms - of output dependency chance in using edx. */ + /* branch here because we know we have a match in VEC7/8 but + might not in VEC5/6 so the latter is expected to be less + likely. */ + KORTEST %k2, %k3 jnz L(return_new_match) + L(return_old_match): - kmovq %k4, %rax - bsrq %rax, %rax - leaq (VEC_SIZE * 2)(%rsi, %rax, CHAR_SIZE), %rax + VPCMPEQ %VMM(8), %VMATCH, %k0 + KMOV %k0, %VRCX + bsr %VRCX, %VRCX + jnz L(return_old_match_ret) + + VPCMPEQ %VMM(7), %VMATCH, %k0 + KMOV %k0, %VRCX + bsr %VRCX, %VRCX + subq $VEC_SIZE, %rsi +L(return_old_match_ret): + leaq (VEC_SIZE * 3)(%rsi, %rcx, CHAR_SIZE), %rax ret + .p2align 4,, 10 L(return_new_match): - VPTESTN %YMM1, %YMM1, %k0 - kunpck %k0, %k1, %k0 + VPTESTN %VMM(5), %VMM(5), %k0 + + /* Combined results from VEC5/6. */ + kunpck_2x %k0, %k1, %k0 kmov_2x %k0, %maskz_2x blsmsk %maskz_2x, %maskz_2x - kunpck %k2, %k3, %k3 + kunpck_2x %k2, %k3, %k3 kmov_2x %k3, %maskm_2x + + /* Match at end was out-of-bounds so use last known match. */ and %maskz_2x, %maskm_2x jz L(return_old_match) @@ -328,49 +353,53 @@ L(return_new_match): leaq (VEC_SIZE * 2)(%rdi, %rax, CHAR_SIZE), %rax ret + .p2align 4,, 4 L(cross_page_boundary): - /* eax contains all the page offset bits of src (rdi). `xor rdi, - rax` sets pointer will all page offset bits cleared so - offset of (PAGE_SIZE - VEC_SIZE) will get last aligned VEC - before page cross (guranteed to be safe to read). Doing this - as opposed to `movq %rdi, %rax; andq $-VEC_SIZE, %rax` saves - a bit of code size. */ xorq %rdi, %rax - VMOVU (PAGE_SIZE - VEC_SIZE)(%rax), %YMM1 - VPTESTN %YMM1, %YMM1, %k0 - kmovd %k0, %ecx + mov $-1, %VRDX + VMOVU (PAGE_SIZE - VEC_SIZE)(%rax), %VMM(6) + VPTESTN %VMM(6), %VMM(6), %k0 + KMOV %k0, %VRSI + +# ifdef USE_AS_WCSRCHR + movl %edi, %ecx + and $(VEC_SIZE - 1), %ecx + shrl $2, %ecx +# endif + shlx %VGPR(SHIFT_REG), %VRDX, %VRDX - /* Shift out zero CHAR matches that are before the begining of - src (rdi). */ # ifdef USE_AS_WCSRCHR - movl %edi, %esi - andl $(VEC_SIZE - 1), %esi - shrl $2, %esi + kmovb %edx, %k1 +# else + KMOV %VRDX, %k1 # endif - shrxl %SHIFT_REG, %ecx, %ecx - testl %ecx, %ecx + /* Need to adjust result to VEC(1) so it can be re-used by + L(return_vec_x0_test). The alternative is to collect VEC(1) + will a page cross load which is far more expensive. */ + VPCOMPRESS %VMM(6), %VMM(1){%k1}{z} + + /* We could technically just jmp back after the vpcompress but + it doesn't save any 16-byte blocks. */ + shrx %VGPR(SHIFT_REG), %VRSI, %VRSI + test %VRSI, %VRSI jz L(page_cross_continue) - /* Found zero CHAR so need to test for search CHAR. */ - VPCMP $0, %YMMMATCH, %YMM1, %k1 - kmovd %k1, %eax - /* Shift out search CHAR matches that are before the begining of - src (rdi). */ - shrxl %SHIFT_REG, %eax, %eax - - /* Check if any search CHAR match in range. */ - blsmskl %ecx, %ecx - andl %ecx, %eax - jz L(ret3) - bsrl %eax, %eax + /* Duplicate of return logic from ENTRY. Doesn't cause spill to + next cache line so might as well copy it here. */ + VPCMPEQ %VMATCH, %VMM(1), %k1 + KMOV %k1, %VRAX + blsmsk %VRSI, %VRSI + and %VRSI, %VRAX + jz L(ret_page_cross) + bsr %VRAX, %VRAX # ifdef USE_AS_WCSRCHR leaq (%rdi, %rax, CHAR_SIZE), %rax # else addq %rdi, %rax # endif -L(ret3): +L(ret_page_cross): ret - + /* 1 byte till next cache line. */ END(STRRCHR) #endif From patchwork Tue Oct 18 23:19:37 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Noah Goldstein X-Patchwork-Id: 1691737 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=sourceware.org (client-ip=8.43.85.97; helo=sourceware.org; envelope-from=libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org; receiver=) Authentication-Results: legolas.ozlabs.org; dkim=pass (1024-bit key; secure) header.d=sourceware.org header.i=@sourceware.org header.a=rsa-sha256 header.s=default header.b=Dx23OTvW; dkim-atps=neutral Received: from sourceware.org (ip-8-43-85-97.sourceware.org [8.43.85.97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-384) server-digest SHA384) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4MsVGS36h4z23jk for ; Wed, 19 Oct 2022 10:20:48 +1100 (AEDT) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 4B6ED3857366 for ; Tue, 18 Oct 2022 23:20:46 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 4B6ED3857366 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1666135246; bh=o1P58Oi8baln6MVt0yU9Hm3PGPb7vM9Cf2yhavvHOuA=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To: From; b=Dx23OTvWPRUlgx3/0pDfAfRv+mGR+PDH3kIAIebMggs0zEvSkehBaYu52NiDuabp6 LaSnTf7ogBgd2ZSL3DSAfTP9pHQvt3ngCKlPlx+k+Bln/JWbPzWBIn0qWzzdLVaGDD L37xUjrjfg+TEZH/8DJ4/Xf2lcQAKrpRaeuKFP8o= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-oa1-x2f.google.com (mail-oa1-x2f.google.com [IPv6:2001:4860:4864:20::2f]) by sourceware.org (Postfix) with ESMTPS id 58900385842E for ; Tue, 18 Oct 2022 23:19:51 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 58900385842E Received: by mail-oa1-x2f.google.com with SMTP id 586e51a60fabf-132b8f6f1b2so18647504fac.11 for ; Tue, 18 Oct 2022 16:19:51 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=o1P58Oi8baln6MVt0yU9Hm3PGPb7vM9Cf2yhavvHOuA=; b=rK5swJjk2oCMKpdOHbTBr4VUvrhiYQAF+74PFlzQHDCzBgR/duzOTNeLUHcl9u1Tc3 SOAwgeH+8pY9uK0YP5/MJ/utoZNVIPQMWXEWsUJHCloKUFuTNRyCfsUNAJ8CXZNGnXB6 +ZmsvFEOSUO7E9crLYF2W2hhCnOVtGYloymLCjgfOUqbSFL5wCRGq6W4l5D/w5sFNvSe XnOeN0AvkkssWmquoeVkHbeZdrCYyPEMamX07Saro/PZ8EL3Xqj9JpQqOm9KOfDN17Rs tkporsB4FbkdCIlAMWxRwLRXmkfiIKnNRMDqHZmwRHQ7gThy0JXpqvRxu4D6Tw85Oi1S r/1w== X-Gm-Message-State: ACrzQf09NjAKaVNAhN9V22kR1uSQz9qhm5t9Yiht4Ws2R3iFCDvo4mFo uE1q11ExoidvcxFoWmf95D0LYjSCDh3mJw== X-Google-Smtp-Source: AMsMyM65cZyl5qwEVusHB5EWQxfwP/UsE3EZxHhEyZl9FlYA7WL9MeXTp/1/YCQXIq1adQQ+irld5g== X-Received: by 2002:a05:6870:179c:b0:136:3c63:3b86 with SMTP id r28-20020a056870179c00b001363c633b86mr3214809oae.131.1666135189430; Tue, 18 Oct 2022 16:19:49 -0700 (PDT) Received: from noah-tgl.lan (2603-8080-1301-76c6-27cf-8854-3909-9373.res6.spectrum.com. [2603:8080:1301:76c6:27cf:8854:3909:9373]) by smtp.gmail.com with ESMTPSA id x27-20020a056870a79b00b001372c1902afsm6787119oao.52.2022.10.18.16.19.48 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 18 Oct 2022 16:19:48 -0700 (PDT) To: libc-alpha@sourceware.org Subject: [PATCH v2 6/7] x86: Add support for VEC_SIZE == 64 in strcmp-evex.S impl Date: Tue, 18 Oct 2022 16:19:37 -0700 Message-Id: <20221018231938.3621554-6-goldstein.w.n@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20221018231938.3621554-1-goldstein.w.n@gmail.com> References: <20221018024901.3381469-1-goldstein.w.n@gmail.com> <20221018231938.3621554-1-goldstein.w.n@gmail.com> MIME-Version: 1.0 X-Spam-Status: No, score=-12.1 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Noah Goldstein via Libc-alpha From: Noah Goldstein Reply-To: Noah Goldstein Errors-To: libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org Sender: "Libc-alpha" Unused at the moment, but evex512 strcmp, strncmp, strcasecmp{l}, and strncasecmp{l} functions can be added by including strcmp-evex.S with "x86-evex512-vecs.h" defined. In addition save code size a bit in a few places. 1. tzcnt ... -> bsf ... 2. vpcmp{b|d} $0 ... -> vpcmpeq{b|d} This saves a touch of code size but has minimal net affect. Full check passes on x86-64. --- sysdeps/x86_64/multiarch/strcmp-evex.S | 676 ++++++++++++++++--------- 1 file changed, 430 insertions(+), 246 deletions(-) diff --git a/sysdeps/x86_64/multiarch/strcmp-evex.S b/sysdeps/x86_64/multiarch/strcmp-evex.S index e482d0167f..756a3bb8d6 100644 --- a/sysdeps/x86_64/multiarch/strcmp-evex.S +++ b/sysdeps/x86_64/multiarch/strcmp-evex.S @@ -20,6 +20,10 @@ #if ISA_SHOULD_BUILD (4) +# ifndef VEC_SIZE +# include "x86-evex256-vecs.h" +# endif + # define STRCMP_ISA _evex # include "strcmp-naming.h" @@ -35,41 +39,57 @@ # define PAGE_SIZE 4096 /* VEC_SIZE = Number of bytes in a ymm register. */ -# define VEC_SIZE 32 # define CHAR_PER_VEC (VEC_SIZE / SIZE_OF_CHAR) -# define VMOVU vmovdqu64 -# define VMOVA vmovdqa64 - # ifdef USE_AS_WCSCMP -# define TESTEQ subl $0xff, /* Compare packed dwords. */ # define VPCMP vpcmpd +# define VPCMPEQ vpcmpeqd # define VPMINU vpminud # define VPTESTM vptestmd # define VPTESTNM vptestnmd /* 1 dword char == 4 bytes. */ # define SIZE_OF_CHAR 4 + +# define TESTEQ sub $((1 << CHAR_PER_VEC) - 1), + +# define USE_WIDE_CHAR # else -# define TESTEQ incl /* Compare packed bytes. */ # define VPCMP vpcmpb +# define VPCMPEQ vpcmpeqb # define VPMINU vpminub # define VPTESTM vptestmb # define VPTESTNM vptestnmb /* 1 byte char == 1 byte. */ # define SIZE_OF_CHAR 1 + +# define TESTEQ inc +# endif + +# include "reg-macros.h" + +# if VEC_SIZE == 64 +# define RODATA_SECTION rodata.cst64 +# else +# define RODATA_SECTION rodata.cst32 +# endif + +# if CHAR_PER_VEC == 64 +# define FALLTHROUGH_RETURN_OFFSET (VEC_SIZE * 3) +# else +# define FALLTHROUGH_RETURN_OFFSET (VEC_SIZE * 2) # endif # ifdef USE_AS_STRNCMP -# define LOOP_REG r9d +# define LOOP_REG VR9 # define LOOP_REG64 r9 # define OFFSET_REG8 r9b # define OFFSET_REG r9d # define OFFSET_REG64 r9 # else -# define LOOP_REG edx +# define LOOP_REG VRDX # define LOOP_REG64 rdx # define OFFSET_REG8 dl @@ -83,32 +103,6 @@ # define VEC_OFFSET (-VEC_SIZE) # endif -# define XMM0 xmm17 -# define XMM1 xmm18 - -# define XMM10 xmm27 -# define XMM11 xmm28 -# define XMM12 xmm29 -# define XMM13 xmm30 -# define XMM14 xmm31 - - -# define YMM0 ymm17 -# define YMM1 ymm18 -# define YMM2 ymm19 -# define YMM3 ymm20 -# define YMM4 ymm21 -# define YMM5 ymm22 -# define YMM6 ymm23 -# define YMM7 ymm24 -# define YMM8 ymm25 -# define YMM9 ymm26 -# define YMM10 ymm27 -# define YMM11 ymm28 -# define YMM12 ymm29 -# define YMM13 ymm30 -# define YMM14 ymm31 - # ifdef USE_AS_STRCASECMP_L # define BYTE_LOOP_REG OFFSET_REG # else @@ -125,61 +119,72 @@ # endif # endif -# define LCASE_MIN_YMM %YMM12 -# define LCASE_MAX_YMM %YMM13 -# define CASE_ADD_YMM %YMM14 +# define LCASE_MIN_V VMM(12) +# define LCASE_MAX_V VMM(13) +# define CASE_ADD_V VMM(14) -# define LCASE_MIN_XMM %XMM12 -# define LCASE_MAX_XMM %XMM13 -# define CASE_ADD_XMM %XMM14 +# if VEC_SIZE == 64 +# define LCASE_MIN_YMM VMM_256(12) +# define LCASE_MAX_YMM VMM_256(13) +# define CASE_ADD_YMM VMM_256(14) +# endif + +# define LCASE_MIN_XMM VMM_128(12) +# define LCASE_MAX_XMM VMM_128(13) +# define CASE_ADD_XMM VMM_128(14) /* NB: wcsncmp uses r11 but strcasecmp is never used in conjunction with wcscmp. */ # define TOLOWER_BASE %r11 # ifdef USE_AS_STRCASECMP_L -# define _REG(x, y) x ## y -# define REG(x, y) _REG(x, y) -# define TOLOWER(reg1, reg2, ext) \ - vpsubb REG(LCASE_MIN_, ext), reg1, REG(%ext, 10); \ - vpsubb REG(LCASE_MIN_, ext), reg2, REG(%ext, 11); \ - vpcmpub $1, REG(LCASE_MAX_, ext), REG(%ext, 10), %k5; \ - vpcmpub $1, REG(LCASE_MAX_, ext), REG(%ext, 11), %k6; \ - vpaddb reg1, REG(CASE_ADD_, ext), reg1{%k5}; \ - vpaddb reg2, REG(CASE_ADD_, ext), reg2{%k6} - -# define TOLOWER_gpr(src, dst) movl (TOLOWER_BASE, src, 4), dst -# define TOLOWER_YMM(...) TOLOWER(__VA_ARGS__, YMM) -# define TOLOWER_XMM(...) TOLOWER(__VA_ARGS__, XMM) - -# define CMP_R1_R2(s1_reg, s2_reg, reg_out, ext) \ - TOLOWER (s1_reg, s2_reg, ext); \ - VPCMP $0, s1_reg, s2_reg, reg_out - -# define CMP_R1_S2(s1_reg, s2_mem, s2_reg, reg_out, ext) \ - VMOVU s2_mem, s2_reg; \ - CMP_R1_R2(s1_reg, s2_reg, reg_out, ext) - -# define CMP_R1_R2_YMM(...) CMP_R1_R2(__VA_ARGS__, YMM) -# define CMP_R1_R2_XMM(...) CMP_R1_R2(__VA_ARGS__, XMM) - -# define CMP_R1_S2_YMM(...) CMP_R1_S2(__VA_ARGS__, YMM) -# define CMP_R1_S2_XMM(...) CMP_R1_S2(__VA_ARGS__, XMM) +# define _REG(x, y) x ## y +# define REG(x, y) _REG(x, y) +# define TOLOWER(reg1, reg2, ext, vec_macro) \ + vpsubb %REG(LCASE_MIN_, ext), reg1, %vec_macro(10); \ + vpsubb %REG(LCASE_MIN_, ext), reg2, %vec_macro(11); \ + vpcmpub $1, %REG(LCASE_MAX_, ext), %vec_macro(10), %k5; \ + vpcmpub $1, %REG(LCASE_MAX_, ext), %vec_macro(11), %k6; \ + vpaddb reg1, %REG(CASE_ADD_, ext), reg1{%k5}; \ + vpaddb reg2, %REG(CASE_ADD_, ext), reg2{%k6} + +# define TOLOWER_gpr(src, dst) movl (TOLOWER_BASE, src, 4), dst +# define TOLOWER_VMM(...) TOLOWER(__VA_ARGS__, V, VMM) +# define TOLOWER_YMM(...) TOLOWER(__VA_ARGS__, YMM, VMM_256) +# define TOLOWER_XMM(...) TOLOWER(__VA_ARGS__, XMM, VMM_128) + +# define CMP_R1_R2(s1_reg, s2_reg, reg_out, ext, vec_macro) \ + TOLOWER (s1_reg, s2_reg, ext, vec_macro); \ + VPCMPEQ s1_reg, s2_reg, reg_out + +# define CMP_R1_S2(s1_reg, s2_mem, s2_reg, reg_out, ext, vec_macro) \ + VMOVU s2_mem, s2_reg; \ + CMP_R1_R2 (s1_reg, s2_reg, reg_out, ext, vec_macro) + +# define CMP_R1_R2_VMM(...) CMP_R1_R2(__VA_ARGS__, V, VMM) +# define CMP_R1_R2_YMM(...) CMP_R1_R2(__VA_ARGS__, YMM, VMM_256) +# define CMP_R1_R2_XMM(...) CMP_R1_R2(__VA_ARGS__, XMM, VMM_128) + +# define CMP_R1_S2_VMM(...) CMP_R1_S2(__VA_ARGS__, V, VMM) +# define CMP_R1_S2_YMM(...) CMP_R1_S2(__VA_ARGS__, YMM, VMM_256) +# define CMP_R1_S2_XMM(...) CMP_R1_S2(__VA_ARGS__, XMM, VMM_128) # else # define TOLOWER_gpr(...) +# define TOLOWER_VMM(...) # define TOLOWER_YMM(...) # define TOLOWER_XMM(...) -# define CMP_R1_R2_YMM(s1_reg, s2_reg, reg_out) \ - VPCMP $0, s2_reg, s1_reg, reg_out +# define CMP_R1_R2_VMM(s1_reg, s2_reg, reg_out) \ + VPCMPEQ s2_reg, s1_reg, reg_out -# define CMP_R1_R2_XMM(...) CMP_R1_R2_YMM(__VA_ARGS__) +# define CMP_R1_R2_YMM(...) CMP_R1_R2_VMM(__VA_ARGS__) +# define CMP_R1_R2_XMM(...) CMP_R1_R2_VMM(__VA_ARGS__) -# define CMP_R1_S2_YMM(s1_reg, s2_mem, unused, reg_out) \ - VPCMP $0, s2_mem, s1_reg, reg_out - -# define CMP_R1_S2_XMM(...) CMP_R1_S2_YMM(__VA_ARGS__) +# define CMP_R1_S2_VMM(s1_reg, s2_mem, unused, reg_out) \ + VPCMPEQ s2_mem, s1_reg, reg_out +# define CMP_R1_S2_YMM(...) CMP_R1_S2_VMM(__VA_ARGS__) +# define CMP_R1_S2_XMM(...) CMP_R1_S2_VMM(__VA_ARGS__) # endif /* Warning! @@ -203,7 +208,7 @@ the maximum offset is reached before a difference is found, zero is returned. */ - .section .text.evex, "ax", @progbits + .section SECTION(.text), "ax", @progbits .align 16 .type STRCMP, @function .globl STRCMP @@ -232,7 +237,7 @@ STRCMP: # else mov (%LOCALE_REG), %RAX_LP # endif - testl $1, LOCALE_DATA_VALUES + _NL_CTYPE_NONASCII_CASE * SIZEOF_VALUES(%rax) + testb $1, LOCALE_DATA_VALUES + _NL_CTYPE_NONASCII_CASE * SIZEOF_VALUES(%rax) jne STRCASECMP_L_NONASCII leaq _nl_C_LC_CTYPE_tolower + 128 * 4(%rip), TOLOWER_BASE # endif @@ -254,28 +259,46 @@ STRCMP: # endif # if defined USE_AS_STRCASECMP_L - .section .rodata.cst32, "aM", @progbits, 32 - .align 32 + .section RODATA_SECTION, "aM", @progbits, VEC_SIZE + .align VEC_SIZE L(lcase_min): .quad 0x4141414141414141 .quad 0x4141414141414141 .quad 0x4141414141414141 .quad 0x4141414141414141 +# if VEC_SIZE == 64 + .quad 0x4141414141414141 + .quad 0x4141414141414141 + .quad 0x4141414141414141 + .quad 0x4141414141414141 +# endif L(lcase_max): .quad 0x1a1a1a1a1a1a1a1a .quad 0x1a1a1a1a1a1a1a1a .quad 0x1a1a1a1a1a1a1a1a .quad 0x1a1a1a1a1a1a1a1a +# if VEC_SIZE == 64 + .quad 0x1a1a1a1a1a1a1a1a + .quad 0x1a1a1a1a1a1a1a1a + .quad 0x1a1a1a1a1a1a1a1a + .quad 0x1a1a1a1a1a1a1a1a +# endif L(case_add): .quad 0x2020202020202020 .quad 0x2020202020202020 .quad 0x2020202020202020 .quad 0x2020202020202020 +# if VEC_SIZE == 64 + .quad 0x2020202020202020 + .quad 0x2020202020202020 + .quad 0x2020202020202020 + .quad 0x2020202020202020 +# endif .previous - vmovdqa64 L(lcase_min)(%rip), LCASE_MIN_YMM - vmovdqa64 L(lcase_max)(%rip), LCASE_MAX_YMM - vmovdqa64 L(case_add)(%rip), CASE_ADD_YMM + VMOVA L(lcase_min)(%rip), %LCASE_MIN_V + VMOVA L(lcase_max)(%rip), %LCASE_MAX_V + VMOVA L(case_add)(%rip), %CASE_ADD_V # endif movl %edi, %eax @@ -288,12 +311,12 @@ L(case_add): L(no_page_cross): /* Safe to compare 4x vectors. */ - VMOVU (%rdi), %YMM0 - VPTESTM %YMM0, %YMM0, %k2 + VMOVU (%rdi), %VMM(0) + VPTESTM %VMM(0), %VMM(0), %k2 /* Each bit cleared in K1 represents a mismatch or a null CHAR in YMM0 and 32 bytes at (%rsi). */ - CMP_R1_S2_YMM (%YMM0, (%rsi), %YMM1, %k1){%k2} - kmovd %k1, %ecx + CMP_R1_S2_VMM (%VMM(0), (%rsi), %VMM(1), %k1){%k2} + KMOV %k1, %VRCX # ifdef USE_AS_STRNCMP cmpq $CHAR_PER_VEC, %rdx jbe L(vec_0_test_len) @@ -303,14 +326,14 @@ L(no_page_cross): wcscmp/wcsncmp. */ /* All 1s represents all equals. TESTEQ will overflow to zero in - all equals case. Otherwise 1s will carry until position of first - mismatch. */ - TESTEQ %ecx + all equals case. Otherwise 1s will carry until position of + first mismatch. */ + TESTEQ %VRCX jz L(more_3x_vec) .p2align 4,, 4 L(return_vec_0): - tzcntl %ecx, %ecx + bsf %VRCX, %VRCX # ifdef USE_AS_WCSCMP movl (%rdi, %rcx, SIZE_OF_CHAR), %edx xorl %eax, %eax @@ -321,7 +344,16 @@ L(return_vec_0): orl $1, %eax # else movzbl (%rdi, %rcx), %eax + /* For VEC_SIZE == 64 use movb instead of movzbl to save a byte + and keep logic for len <= VEC_SIZE (common) in just the + first cache line. NB: No evex512 processor has partial- + register stalls. If that changes this ifdef can be disabled + without affecting correctness. */ +# if !defined USE_AS_STRNCMP && !defined USE_AS_STRCASECMP_L && VEC_SIZE == 64 + movb (%rsi, %rcx), %cl +# else movzbl (%rsi, %rcx), %ecx +# endif TOLOWER_gpr (%rax, %eax) TOLOWER_gpr (%rcx, %ecx) subl %ecx, %eax @@ -332,8 +364,8 @@ L(ret0): # ifdef USE_AS_STRNCMP .p2align 4,, 4 L(vec_0_test_len): - notl %ecx - bzhil %edx, %ecx, %eax + not %VRCX + bzhi %VRDX, %VRCX, %VRAX jnz L(return_vec_0) /* Align if will cross fetch block. */ .p2align 4,, 2 @@ -372,7 +404,7 @@ L(ret1): .p2align 4,, 10 L(return_vec_1): - tzcntl %ecx, %ecx + bsf %VRCX, %VRCX # ifdef USE_AS_STRNCMP /* rdx must be > CHAR_PER_VEC so its safe to subtract without worrying about underflow. */ @@ -401,24 +433,41 @@ L(ret2): .p2align 4,, 10 # ifdef USE_AS_STRNCMP L(return_vec_3): -# if CHAR_PER_VEC <= 16 +# if CHAR_PER_VEC <= 32 + /* If CHAR_PER_VEC <= 32 reuse code from L(return_vec_3) without + additional branches by adjusting the bit positions from + VEC3. We can't do this for CHAR_PER_VEC == 64. */ +# if CHAR_PER_VEC <= 16 sall $CHAR_PER_VEC, %ecx -# else +# else salq $CHAR_PER_VEC, %rcx +# endif +# else + /* If CHAR_PER_VEC == 64 we can't shift the return GPR so just + check it. */ + bsf %VRCX, %VRCX + addl $(CHAR_PER_VEC), %ecx + cmpq %rcx, %rdx + ja L(ret_vec_3_finish) + xorl %eax, %eax + ret # endif # endif + + /* If CHAR_PER_VEC == 64 we can't combine matches from the last + 2x VEC so need seperate return label. */ L(return_vec_2): # if (CHAR_PER_VEC <= 16) || !(defined USE_AS_STRNCMP) - tzcntl %ecx, %ecx + bsf %VRCX, %VRCX # else - tzcntq %rcx, %rcx + bsfq %rcx, %rcx # endif - # ifdef USE_AS_STRNCMP cmpq %rcx, %rdx jbe L(ret_zero) # endif +L(ret_vec_3_finish): # ifdef USE_AS_WCSCMP movl (VEC_SIZE * 2)(%rdi, %rcx, SIZE_OF_CHAR), %edx xorl %eax, %eax @@ -440,7 +489,7 @@ L(ret3): # ifndef USE_AS_STRNCMP .p2align 4,, 10 L(return_vec_3): - tzcntl %ecx, %ecx + bsf %VRCX, %VRCX # ifdef USE_AS_WCSCMP movl (VEC_SIZE * 3)(%rdi, %rcx, SIZE_OF_CHAR), %edx xorl %eax, %eax @@ -465,11 +514,11 @@ L(ret4): .p2align 5 L(more_3x_vec): /* Safe to compare 4x vectors. */ - VMOVU (VEC_SIZE)(%rdi), %YMM0 - VPTESTM %YMM0, %YMM0, %k2 - CMP_R1_S2_YMM (%YMM0, VEC_SIZE(%rsi), %YMM1, %k1){%k2} - kmovd %k1, %ecx - TESTEQ %ecx + VMOVU (VEC_SIZE)(%rdi), %VMM(0) + VPTESTM %VMM(0), %VMM(0), %k2 + CMP_R1_S2_VMM (%VMM(0), VEC_SIZE(%rsi), %VMM(1), %k1){%k2} + KMOV %k1, %VRCX + TESTEQ %VRCX jnz L(return_vec_1) # ifdef USE_AS_STRNCMP @@ -477,18 +526,18 @@ L(more_3x_vec): jbe L(ret_zero) # endif - VMOVU (VEC_SIZE * 2)(%rdi), %YMM0 - VPTESTM %YMM0, %YMM0, %k2 - CMP_R1_S2_YMM (%YMM0, (VEC_SIZE * 2)(%rsi), %YMM1, %k1){%k2} - kmovd %k1, %ecx - TESTEQ %ecx + VMOVU (VEC_SIZE * 2)(%rdi), %VMM(0) + VPTESTM %VMM(0), %VMM(0), %k2 + CMP_R1_S2_VMM (%VMM(0), (VEC_SIZE * 2)(%rsi), %VMM(1), %k1){%k2} + KMOV %k1, %VRCX + TESTEQ %VRCX jnz L(return_vec_2) - VMOVU (VEC_SIZE * 3)(%rdi), %YMM0 - VPTESTM %YMM0, %YMM0, %k2 - CMP_R1_S2_YMM (%YMM0, (VEC_SIZE * 3)(%rsi), %YMM1, %k1){%k2} - kmovd %k1, %ecx - TESTEQ %ecx + VMOVU (VEC_SIZE * 3)(%rdi), %VMM(0) + VPTESTM %VMM(0), %VMM(0), %k2 + CMP_R1_S2_VMM (%VMM(0), (VEC_SIZE * 3)(%rsi), %VMM(1), %k1){%k2} + KMOV %k1, %VRCX + TESTEQ %VRCX jnz L(return_vec_3) # ifdef USE_AS_STRNCMP @@ -565,110 +614,123 @@ L(loop): /* Loop entry after handling page cross during loop. */ L(loop_skip_page_cross_check): - VMOVA (VEC_SIZE * 0)(%rdi), %YMM0 - VMOVA (VEC_SIZE * 1)(%rdi), %YMM2 - VMOVA (VEC_SIZE * 2)(%rdi), %YMM4 - VMOVA (VEC_SIZE * 3)(%rdi), %YMM6 + VMOVA (VEC_SIZE * 0)(%rdi), %VMM(0) + VMOVA (VEC_SIZE * 1)(%rdi), %VMM(2) + VMOVA (VEC_SIZE * 2)(%rdi), %VMM(4) + VMOVA (VEC_SIZE * 3)(%rdi), %VMM(6) - VPMINU %YMM0, %YMM2, %YMM8 - VPMINU %YMM4, %YMM6, %YMM9 + VPMINU %VMM(0), %VMM(2), %VMM(8) + VPMINU %VMM(4), %VMM(6), %VMM(9) /* A zero CHAR in YMM9 means that there is a null CHAR. */ - VPMINU %YMM8, %YMM9, %YMM9 + VPMINU %VMM(8), %VMM(9), %VMM(9) /* Each bit set in K1 represents a non-null CHAR in YMM9. */ - VPTESTM %YMM9, %YMM9, %k1 + VPTESTM %VMM(9), %VMM(9), %k1 # ifndef USE_AS_STRCASECMP_L - vpxorq (VEC_SIZE * 0)(%rsi), %YMM0, %YMM1 - vpxorq (VEC_SIZE * 1)(%rsi), %YMM2, %YMM3 - vpxorq (VEC_SIZE * 2)(%rsi), %YMM4, %YMM5 + vpxorq (VEC_SIZE * 0)(%rsi), %VMM(0), %VMM(1) + vpxorq (VEC_SIZE * 1)(%rsi), %VMM(2), %VMM(3) + vpxorq (VEC_SIZE * 2)(%rsi), %VMM(4), %VMM(5) /* Ternary logic to xor (VEC_SIZE * 3)(%rsi) with YMM6 while oring with YMM1. Result is stored in YMM6. */ - vpternlogd $0xde, (VEC_SIZE * 3)(%rsi), %YMM1, %YMM6 + vpternlogd $0xde, (VEC_SIZE * 3)(%rsi), %VMM(1), %VMM(6) # else - VMOVU (VEC_SIZE * 0)(%rsi), %YMM1 - TOLOWER_YMM (%YMM0, %YMM1) - VMOVU (VEC_SIZE * 1)(%rsi), %YMM3 - TOLOWER_YMM (%YMM2, %YMM3) - VMOVU (VEC_SIZE * 2)(%rsi), %YMM5 - TOLOWER_YMM (%YMM4, %YMM5) - VMOVU (VEC_SIZE * 3)(%rsi), %YMM7 - TOLOWER_YMM (%YMM6, %YMM7) - vpxorq %YMM0, %YMM1, %YMM1 - vpxorq %YMM2, %YMM3, %YMM3 - vpxorq %YMM4, %YMM5, %YMM5 - vpternlogd $0xde, %YMM7, %YMM1, %YMM6 + VMOVU (VEC_SIZE * 0)(%rsi), %VMM(1) + TOLOWER_VMM (%VMM(0), %VMM(1)) + VMOVU (VEC_SIZE * 1)(%rsi), %VMM(3) + TOLOWER_VMM (%VMM(2), %VMM(3)) + VMOVU (VEC_SIZE * 2)(%rsi), %VMM(5) + TOLOWER_VMM (%VMM(4), %VMM(5)) + VMOVU (VEC_SIZE * 3)(%rsi), %VMM(7) + TOLOWER_VMM (%VMM(6), %VMM(7)) + vpxorq %VMM(0), %VMM(1), %VMM(1) + vpxorq %VMM(2), %VMM(3), %VMM(3) + vpxorq %VMM(4), %VMM(5), %VMM(5) + vpternlogd $0xde, %VMM(7), %VMM(1), %VMM(6) # endif /* Or together YMM3, YMM5, and YMM6. */ - vpternlogd $0xfe, %YMM3, %YMM5, %YMM6 + vpternlogd $0xfe, %VMM(3), %VMM(5), %VMM(6) /* A non-zero CHAR in YMM6 represents a mismatch. */ - VPTESTNM %YMM6, %YMM6, %k0{%k1} - kmovd %k0, %LOOP_REG + VPTESTNM %VMM(6), %VMM(6), %k0{%k1} + KMOV %k0, %LOOP_REG TESTEQ %LOOP_REG jz L(loop) /* Find which VEC has the mismatch of end of string. */ - VPTESTM %YMM0, %YMM0, %k1 - VPTESTNM %YMM1, %YMM1, %k0{%k1} - kmovd %k0, %ecx - TESTEQ %ecx + VPTESTM %VMM(0), %VMM(0), %k1 + VPTESTNM %VMM(1), %VMM(1), %k0{%k1} + KMOV %k0, %VRCX + TESTEQ %VRCX jnz L(return_vec_0_end) - VPTESTM %YMM2, %YMM2, %k1 - VPTESTNM %YMM3, %YMM3, %k0{%k1} - kmovd %k0, %ecx - TESTEQ %ecx + VPTESTM %VMM(2), %VMM(2), %k1 + VPTESTNM %VMM(3), %VMM(3), %k0{%k1} + KMOV %k0, %VRCX + TESTEQ %VRCX jnz L(return_vec_1_end) - /* Handle VEC 2 and 3 without branches. */ + /* Handle VEC 2 and 3 without branches if CHAR_PER_VEC <= 32. + */ L(return_vec_2_3_end): # ifdef USE_AS_STRNCMP subq $(CHAR_PER_VEC * 2), %rdx jbe L(ret_zero_end) # endif - VPTESTM %YMM4, %YMM4, %k1 - VPTESTNM %YMM5, %YMM5, %k0{%k1} - kmovd %k0, %ecx - TESTEQ %ecx + VPTESTM %VMM(4), %VMM(4), %k1 + VPTESTNM %VMM(5), %VMM(5), %k0{%k1} + KMOV %k0, %VRCX + TESTEQ %VRCX # if CHAR_PER_VEC <= 16 sall $CHAR_PER_VEC, %LOOP_REG orl %ecx, %LOOP_REG -# else +# elif CHAR_PER_VEC <= 32 salq $CHAR_PER_VEC, %LOOP_REG64 orq %rcx, %LOOP_REG64 +# else + /* We aren't combining last 2x VEC so branch on second the last. + */ + jnz L(return_vec_2_end) # endif -L(return_vec_3_end): + /* LOOP_REG contains matches for null/mismatch from the loop. If - VEC 0,1,and 2 all have no null and no mismatches then mismatch - must entirely be from VEC 3 which is fully represented by - LOOP_REG. */ + VEC 0,1,and 2 all have no null and no mismatches then + mismatch must entirely be from VEC 3 which is fully + represented by LOOP_REG. */ # if CHAR_PER_VEC <= 16 - tzcntl %LOOP_REG, %LOOP_REG + bsf %LOOP_REG, %LOOP_REG # else - tzcntq %LOOP_REG64, %LOOP_REG64 + bsfq %LOOP_REG64, %LOOP_REG64 # endif # ifdef USE_AS_STRNCMP + + /* If CHAR_PER_VEC == 64 we can't combine last 2x VEC so need to + adj length before last comparison. */ +# if CHAR_PER_VEC == 64 + subq $CHAR_PER_VEC, %rdx + jbe L(ret_zero_end) +# endif + cmpq %LOOP_REG64, %rdx jbe L(ret_zero_end) # endif # ifdef USE_AS_WCSCMP - movl (VEC_SIZE * 2)(%rdi, %LOOP_REG64, SIZE_OF_CHAR), %ecx + movl (FALLTHROUGH_RETURN_OFFSET)(%rdi, %LOOP_REG64, SIZE_OF_CHAR), %ecx xorl %eax, %eax - cmpl (VEC_SIZE * 2)(%rsi, %LOOP_REG64, SIZE_OF_CHAR), %ecx + cmpl (FALLTHROUGH_RETURN_OFFSET)(%rsi, %LOOP_REG64, SIZE_OF_CHAR), %ecx je L(ret5) setl %al negl %eax xorl %r8d, %eax # else - movzbl (VEC_SIZE * 2)(%rdi, %LOOP_REG64), %eax - movzbl (VEC_SIZE * 2)(%rsi, %LOOP_REG64), %ecx + movzbl (FALLTHROUGH_RETURN_OFFSET)(%rdi, %LOOP_REG64), %eax + movzbl (FALLTHROUGH_RETURN_OFFSET)(%rsi, %LOOP_REG64), %ecx TOLOWER_gpr (%rax, %eax) TOLOWER_gpr (%rcx, %ecx) subl %ecx, %eax @@ -686,23 +748,39 @@ L(ret_zero_end): # endif + /* The L(return_vec_N_end) differ from L(return_vec_N) in that - they use the value of `r8` to negate the return value. This is - because the page cross logic can swap `rdi` and `rsi`. */ + they use the value of `r8` to negate the return value. This + is because the page cross logic can swap `rdi` and `rsi`. + */ .p2align 4,, 10 # ifdef USE_AS_STRNCMP L(return_vec_1_end): -# if CHAR_PER_VEC <= 16 +# if CHAR_PER_VEC <= 32 + /* If CHAR_PER_VEC <= 32 reuse code from L(return_vec_0_end) + without additional branches by adjusting the bit positions + from VEC1. We can't do this for CHAR_PER_VEC == 64. */ +# if CHAR_PER_VEC <= 16 sall $CHAR_PER_VEC, %ecx -# else +# else salq $CHAR_PER_VEC, %rcx +# endif +# else + /* If CHAR_PER_VEC == 64 we can't shift the return GPR so just + check it. */ + bsf %VRCX, %VRCX + addl $(CHAR_PER_VEC), %ecx + cmpq %rcx, %rdx + ja L(ret_vec_0_end_finish) + xorl %eax, %eax + ret # endif # endif L(return_vec_0_end): # if (CHAR_PER_VEC <= 16) || !(defined USE_AS_STRNCMP) - tzcntl %ecx, %ecx + bsf %VRCX, %VRCX # else - tzcntq %rcx, %rcx + bsfq %rcx, %rcx # endif # ifdef USE_AS_STRNCMP @@ -710,6 +788,7 @@ L(return_vec_0_end): jbe L(ret_zero_end) # endif +L(ret_vec_0_end_finish): # ifdef USE_AS_WCSCMP movl (%rdi, %rcx, SIZE_OF_CHAR), %edx xorl %eax, %eax @@ -737,7 +816,7 @@ L(ret6): # ifndef USE_AS_STRNCMP .p2align 4,, 10 L(return_vec_1_end): - tzcntl %ecx, %ecx + bsf %VRCX, %VRCX # ifdef USE_AS_WCSCMP movl VEC_SIZE(%rdi, %rcx, SIZE_OF_CHAR), %edx xorl %eax, %eax @@ -760,6 +839,41 @@ L(ret7): # endif + /* If CHAR_PER_VEC == 64 we can't combine matches from the last + 2x VEC so need seperate return label. */ +# if CHAR_PER_VEC == 64 +L(return_vec_2_end): + bsf %VRCX, %VRCX +# ifdef USE_AS_STRNCMP + cmpq %rcx, %rdx + jbe L(ret_zero_end) +# endif +# ifdef USE_AS_WCSCMP + movl (VEC_SIZE * 2)(%rdi, %rcx, SIZE_OF_CHAR), %edx + xorl %eax, %eax + cmpl (VEC_SIZE * 2)(%rsi, %rcx, SIZE_OF_CHAR), %edx + je L(ret31) + setl %al + negl %eax + /* This is the non-zero case for `eax` so just xorl with `r8d` + flip is `rdi` and `rsi` where swapped. */ + xorl %r8d, %eax +# else + movzbl (VEC_SIZE * 2)(%rdi, %rcx), %eax + movzbl (VEC_SIZE * 2)(%rsi, %rcx), %ecx + TOLOWER_gpr (%rax, %eax) + TOLOWER_gpr (%rcx, %ecx) + subl %ecx, %eax + /* Flip `eax` if `rdi` and `rsi` where swapped in page cross + logic. Subtract `r8d` after xor for zero case. */ + xorl %r8d, %eax + subl %r8d, %eax +# endif +L(ret13): + ret +# endif + + /* Page cross in rsi in next 4x VEC. */ /* TODO: Improve logic here. */ @@ -778,11 +892,11 @@ L(page_cross_during_loop): cmpl $-(VEC_SIZE * 3), %eax jle L(less_1x_vec_till_page_cross) - VMOVA (%rdi), %YMM0 - VPTESTM %YMM0, %YMM0, %k2 - CMP_R1_S2_YMM (%YMM0, (%rsi), %YMM1, %k1){%k2} - kmovd %k1, %ecx - TESTEQ %ecx + VMOVA (%rdi), %VMM(0) + VPTESTM %VMM(0), %VMM(0), %k2 + CMP_R1_S2_VMM (%VMM(0), (%rsi), %VMM(1), %k1){%k2} + KMOV %k1, %VRCX + TESTEQ %VRCX jnz L(return_vec_0_end) /* if distance >= 2x VEC then eax > -(VEC_SIZE * 2). */ @@ -799,9 +913,9 @@ L(less_1x_vec_till_page_cross): to read back -VEC_SIZE. If rdi is truly at the start of a page here, it means the previous page (rdi - VEC_SIZE) has already been loaded earlier so must be valid. */ - VMOVU -VEC_SIZE(%rdi, %rax), %YMM0 - VPTESTM %YMM0, %YMM0, %k2 - CMP_R1_S2_YMM (%YMM0, -VEC_SIZE(%rsi, %rax), %YMM1, %k1){%k2} + VMOVU -VEC_SIZE(%rdi, %rax), %VMM(0) + VPTESTM %VMM(0), %VMM(0), %k2 + CMP_R1_S2_VMM (%VMM(0), -VEC_SIZE(%rsi, %rax), %VMM(1), %k1){%k2} /* Mask of potentially valid bits. The lower bits can be out of range comparisons (but safe regarding page crosses). */ @@ -813,12 +927,12 @@ L(less_1x_vec_till_page_cross): shlxl %ecx, %r10d, %ecx movzbl %cl, %r10d # else - movl $-1, %ecx - shlxl %esi, %ecx, %r10d + mov $-1, %VRCX + shlx %VRSI, %VRCX, %VR10 # endif - kmovd %k1, %ecx - notl %ecx + KMOV %k1, %VRCX + not %VRCX # ifdef USE_AS_STRNCMP @@ -838,12 +952,10 @@ L(less_1x_vec_till_page_cross): /* Readjust eax before potentially returning to the loop. */ addl $(PAGE_SIZE - VEC_SIZE * 4), %eax - andl %r10d, %ecx + and %VR10, %VRCX jz L(loop_skip_page_cross_check) - .p2align 4,, 3 -L(return_page_cross_end): - tzcntl %ecx, %ecx + bsf %VRCX, %VRCX # if (defined USE_AS_STRNCMP) || (defined USE_AS_WCSCMP) leal -VEC_SIZE(%OFFSET_REG64, %rcx, SIZE_OF_CHAR), %ecx @@ -874,8 +986,12 @@ L(ret8): # ifdef USE_AS_STRNCMP .p2align 4,, 10 L(return_page_cross_end_check): - andl %r10d, %ecx - tzcntl %ecx, %ecx + and %VR10, %VRCX + /* Need to use tzcnt here as VRCX may be zero. If VRCX is zero + tzcnt(VRCX) will be CHAR_PER and remaining length (edx) is + guranteed to be <= CHAR_PER_VEC so we will only use the return + idx if VRCX was non-zero. */ + tzcnt %VRCX, %VRCX leal -VEC_SIZE(%rax, %rcx, SIZE_OF_CHAR), %ecx # ifdef USE_AS_WCSCMP sall $2, %edx @@ -892,11 +1008,11 @@ L(more_2x_vec_till_page_cross): /* If more 2x vec till cross we will complete a full loop iteration here. */ - VMOVA VEC_SIZE(%rdi), %YMM0 - VPTESTM %YMM0, %YMM0, %k2 - CMP_R1_S2_YMM (%YMM0, VEC_SIZE(%rsi), %YMM1, %k1){%k2} - kmovd %k1, %ecx - TESTEQ %ecx + VMOVA VEC_SIZE(%rdi), %VMM(0) + VPTESTM %VMM(0), %VMM(0), %k2 + CMP_R1_S2_VMM (%VMM(0), VEC_SIZE(%rsi), %VMM(1), %k1){%k2} + KMOV %k1, %VRCX + TESTEQ %VRCX jnz L(return_vec_1_end) # ifdef USE_AS_STRNCMP @@ -907,18 +1023,18 @@ L(more_2x_vec_till_page_cross): subl $-(VEC_SIZE * 4), %eax /* Safe to include comparisons from lower bytes. */ - VMOVU -(VEC_SIZE * 2)(%rdi, %rax), %YMM0 - VPTESTM %YMM0, %YMM0, %k2 - CMP_R1_S2_YMM (%YMM0, -(VEC_SIZE * 2)(%rsi, %rax), %YMM1, %k1){%k2} - kmovd %k1, %ecx - TESTEQ %ecx + VMOVU -(VEC_SIZE * 2)(%rdi, %rax), %VMM(0) + VPTESTM %VMM(0), %VMM(0), %k2 + CMP_R1_S2_VMM (%VMM(0), -(VEC_SIZE * 2)(%rsi, %rax), %VMM(1), %k1){%k2} + KMOV %k1, %VRCX + TESTEQ %VRCX jnz L(return_vec_page_cross_0) - VMOVU -(VEC_SIZE * 1)(%rdi, %rax), %YMM0 - VPTESTM %YMM0, %YMM0, %k2 - CMP_R1_S2_YMM (%YMM0, -(VEC_SIZE * 1)(%rsi, %rax), %YMM1, %k1){%k2} - kmovd %k1, %ecx - TESTEQ %ecx + VMOVU -(VEC_SIZE * 1)(%rdi, %rax), %VMM(0) + VPTESTM %VMM(0), %VMM(0), %k2 + CMP_R1_S2_VMM (%VMM(0), -(VEC_SIZE * 1)(%rsi, %rax), %VMM(1), %k1){%k2} + KMOV %k1, %VRCX + TESTEQ %VRCX jnz L(return_vec_page_cross_1) # ifdef USE_AS_STRNCMP @@ -937,30 +1053,30 @@ L(more_2x_vec_till_page_cross): # endif /* Finish the loop. */ - VMOVA (VEC_SIZE * 2)(%rdi), %YMM4 - VMOVA (VEC_SIZE * 3)(%rdi), %YMM6 - VPMINU %YMM4, %YMM6, %YMM9 - VPTESTM %YMM9, %YMM9, %k1 + VMOVA (VEC_SIZE * 2)(%rdi), %VMM(4) + VMOVA (VEC_SIZE * 3)(%rdi), %VMM(6) + VPMINU %VMM(4), %VMM(6), %VMM(9) + VPTESTM %VMM(9), %VMM(9), %k1 # ifndef USE_AS_STRCASECMP_L - vpxorq (VEC_SIZE * 2)(%rsi), %YMM4, %YMM5 + vpxorq (VEC_SIZE * 2)(%rsi), %VMM(4), %VMM(5) /* YMM6 = YMM5 | ((VEC_SIZE * 3)(%rsi) ^ YMM6). */ - vpternlogd $0xde, (VEC_SIZE * 3)(%rsi), %YMM5, %YMM6 + vpternlogd $0xde, (VEC_SIZE * 3)(%rsi), %VMM(5), %VMM(6) # else - VMOVU (VEC_SIZE * 2)(%rsi), %YMM5 - TOLOWER_YMM (%YMM4, %YMM5) - VMOVU (VEC_SIZE * 3)(%rsi), %YMM7 - TOLOWER_YMM (%YMM6, %YMM7) - vpxorq %YMM4, %YMM5, %YMM5 - vpternlogd $0xde, %YMM7, %YMM5, %YMM6 -# endif - VPTESTNM %YMM6, %YMM6, %k0{%k1} - kmovd %k0, %LOOP_REG + VMOVU (VEC_SIZE * 2)(%rsi), %VMM(5) + TOLOWER_VMM (%VMM(4), %VMM(5)) + VMOVU (VEC_SIZE * 3)(%rsi), %VMM(7) + TOLOWER_VMM (%VMM(6), %VMM(7)) + vpxorq %VMM(4), %VMM(5), %VMM(5) + vpternlogd $0xde, %VMM(7), %VMM(5), %VMM(6) +# endif + VPTESTNM %VMM(6), %VMM(6), %k0{%k1} + KMOV %k0, %LOOP_REG TESTEQ %LOOP_REG jnz L(return_vec_2_3_end) /* Best for code size to include ucond-jmp here. Would be faster - if this case is hot to duplicate the L(return_vec_2_3_end) code - as fall-through and have jump back to loop on mismatch + if this case is hot to duplicate the L(return_vec_2_3_end) + code as fall-through and have jump back to loop on mismatch comparison. */ subq $-(VEC_SIZE * 4), %rdi subq $-(VEC_SIZE * 4), %rsi @@ -980,7 +1096,7 @@ L(ret_zero_in_loop_page_cross): L(return_vec_page_cross_0): addl $-VEC_SIZE, %eax L(return_vec_page_cross_1): - tzcntl %ecx, %ecx + bsf %VRCX, %VRCX # if defined USE_AS_STRNCMP || defined USE_AS_WCSCMP leal -VEC_SIZE(%rax, %rcx, SIZE_OF_CHAR), %ecx # ifdef USE_AS_STRNCMP @@ -1023,8 +1139,8 @@ L(ret9): L(page_cross): # ifndef USE_AS_STRNCMP /* If both are VEC aligned we don't need any special logic here. - Only valid for strcmp where stop condition is guranteed to be - reachable by just reading memory. */ + Only valid for strcmp where stop condition is guranteed to + be reachable by just reading memory. */ testl $((VEC_SIZE - 1) << 20), %eax jz L(no_page_cross) # endif @@ -1065,11 +1181,11 @@ L(page_cross): loadable memory until within 1x VEC of page cross. */ .p2align 4,, 8 L(page_cross_loop): - VMOVU (%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %YMM0 - VPTESTM %YMM0, %YMM0, %k2 - CMP_R1_S2_YMM (%YMM0, (%rsi, %OFFSET_REG64, SIZE_OF_CHAR), %YMM1, %k1){%k2} - kmovd %k1, %ecx - TESTEQ %ecx + VMOVU (%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %VMM(0) + VPTESTM %VMM(0), %VMM(0), %k2 + CMP_R1_S2_VMM (%VMM(0), (%rsi, %OFFSET_REG64, SIZE_OF_CHAR), %VMM(1), %k1){%k2} + KMOV %k1, %VRCX + TESTEQ %VRCX jnz L(check_ret_vec_page_cross) addl $CHAR_PER_VEC, %OFFSET_REG # ifdef USE_AS_STRNCMP @@ -1087,13 +1203,13 @@ L(page_cross_loop): subl %eax, %OFFSET_REG /* OFFSET_REG has distance to page cross - VEC_SIZE. Guranteed to not cross page so is safe to load. Since we have already - loaded at least 1 VEC from rsi it is also guranteed to be safe. - */ - VMOVU (%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %YMM0 - VPTESTM %YMM0, %YMM0, %k2 - CMP_R1_S2_YMM (%YMM0, (%rsi, %OFFSET_REG64, SIZE_OF_CHAR), %YMM1, %k1){%k2} + loaded at least 1 VEC from rsi it is also guranteed to be + safe. */ + VMOVU (%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %VMM(0) + VPTESTM %VMM(0), %VMM(0), %k2 + CMP_R1_S2_VMM (%VMM(0), (%rsi, %OFFSET_REG64, SIZE_OF_CHAR), %VMM(1), %k1){%k2} - kmovd %k1, %ecx + KMOV %k1, %VRCX # ifdef USE_AS_STRNCMP leal CHAR_PER_VEC(%OFFSET_REG64), %eax cmpq %rax, %rdx @@ -1104,7 +1220,7 @@ L(page_cross_loop): addq %rdi, %rdx # endif # endif - TESTEQ %ecx + TESTEQ %VRCX jz L(prepare_loop_no_len) .p2align 4,, 4 @@ -1112,7 +1228,7 @@ L(ret_vec_page_cross): # ifndef USE_AS_STRNCMP L(check_ret_vec_page_cross): # endif - tzcntl %ecx, %ecx + tzcnt %VRCX, %VRCX addl %OFFSET_REG, %ecx L(ret_vec_page_cross_cont): # ifdef USE_AS_WCSCMP @@ -1139,9 +1255,9 @@ L(ret12): # ifdef USE_AS_STRNCMP .p2align 4,, 10 L(check_ret_vec_page_cross2): - TESTEQ %ecx + TESTEQ %VRCX L(check_ret_vec_page_cross): - tzcntl %ecx, %ecx + tzcnt %VRCX, %VRCX addl %OFFSET_REG, %ecx cmpq %rcx, %rdx ja L(ret_vec_page_cross_cont) @@ -1180,8 +1296,71 @@ L(less_1x_vec_till_page): # ifdef USE_AS_WCSCMP shrl $2, %eax # endif + + /* Find largest load size we can use. VEC_SIZE == 64 only check + if we can do a full ymm load. */ +# if VEC_SIZE == 64 + + cmpl $((VEC_SIZE - 32) / SIZE_OF_CHAR), %eax + ja L(less_32_till_page) + + + /* Use 16 byte comparison. */ + VMOVU (%rdi), %VMM_256(0) + VPTESTM %VMM_256(0), %VMM_256(0), %k2 + CMP_R1_S2_YMM (%VMM_256(0), (%rsi), %VMM_256(1), %k1){%k2} + kmovd %k1, %ecx +# ifdef USE_AS_WCSCMP + subl $0xff, %ecx +# else + incl %ecx +# endif + jnz L(check_ret_vec_page_cross) + movl $((VEC_SIZE - 32) / SIZE_OF_CHAR), %OFFSET_REG +# ifdef USE_AS_STRNCMP + cmpq %OFFSET_REG64, %rdx + jbe L(ret_zero_page_cross_slow_case64) + subl %eax, %OFFSET_REG +# else + /* Explicit check for 32 byte alignment. */ + subl %eax, %OFFSET_REG + jz L(prepare_loop) +# endif + VMOVU (%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %VMM_256(0) + VPTESTM %VMM_256(0), %VMM_256(0), %k2 + CMP_R1_S2_YMM (%VMM_256(0), (%rsi, %OFFSET_REG64, SIZE_OF_CHAR), %VMM_256(1), %k1){%k2} + kmovd %k1, %ecx +# ifdef USE_AS_WCSCMP + subl $0xff, %ecx +# else + incl %ecx +# endif + jnz L(check_ret_vec_page_cross) +# ifdef USE_AS_STRNCMP + addl $(32 / SIZE_OF_CHAR), %OFFSET_REG + subq %OFFSET_REG64, %rdx + jbe L(ret_zero_page_cross_slow_case64) + subq $-(CHAR_PER_VEC * 4), %rdx + + leaq -(VEC_SIZE * 4)(%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %rdi + leaq -(VEC_SIZE * 4)(%rsi, %OFFSET_REG64, SIZE_OF_CHAR), %rsi +# else + leaq (32 - VEC_SIZE * 4)(%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %rdi + leaq (32 - VEC_SIZE * 4)(%rsi, %OFFSET_REG64, SIZE_OF_CHAR), %rsi +# endif + jmp L(prepare_loop_aligned) + +# ifdef USE_AS_STRNCMP + .p2align 4,, 2 +L(ret_zero_page_cross_slow_case64): + xorl %eax, %eax + ret +# endif +L(less_32_till_page): +# endif + /* Find largest load size we can use. */ - cmpl $(16 / SIZE_OF_CHAR), %eax + cmpl $((VEC_SIZE - 16) / SIZE_OF_CHAR), %eax ja L(less_16_till_page) /* Use 16 byte comparison. */ @@ -1195,9 +1374,14 @@ L(less_1x_vec_till_page): incw %cx # endif jnz L(check_ret_vec_page_cross) - movl $(16 / SIZE_OF_CHAR), %OFFSET_REG + + movl $((VEC_SIZE - 16) / SIZE_OF_CHAR), %OFFSET_REG # ifdef USE_AS_STRNCMP +# if VEC_SIZE == 32 cmpq %OFFSET_REG64, %rdx +# else + cmpq $(16 / SIZE_OF_CHAR), %rdx +# endif jbe L(ret_zero_page_cross_slow_case0) subl %eax, %OFFSET_REG # else @@ -1239,7 +1423,7 @@ L(ret_zero_page_cross_slow_case0): .p2align 4,, 10 L(less_16_till_page): - cmpl $(24 / SIZE_OF_CHAR), %eax + cmpl $((VEC_SIZE - 8) / SIZE_OF_CHAR), %eax ja L(less_8_till_page) /* Use 8 byte comparison. */ @@ -1260,7 +1444,7 @@ L(less_16_till_page): cmpq $(8 / SIZE_OF_CHAR), %rdx jbe L(ret_zero_page_cross_slow_case0) # endif - movl $(24 / SIZE_OF_CHAR), %OFFSET_REG + movl $((VEC_SIZE - 8) / SIZE_OF_CHAR), %OFFSET_REG subl %eax, %OFFSET_REG vmovq (%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %xmm0 @@ -1320,7 +1504,7 @@ L(ret_less_8_wcs): ret # else - cmpl $28, %eax + cmpl $(VEC_SIZE - 4), %eax ja L(less_4_till_page) vmovd (%rdi), %xmm0 @@ -1335,7 +1519,7 @@ L(ret_less_8_wcs): cmpq $4, %rdx jbe L(ret_zero_page_cross_slow_case1) # endif - movl $(28 / SIZE_OF_CHAR), %OFFSET_REG + movl $((VEC_SIZE - 4) / SIZE_OF_CHAR), %OFFSET_REG subl %eax, %OFFSET_REG vmovd (%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %xmm0 @@ -1386,7 +1570,7 @@ L(less_4_loop): # endif incq %rdi /* end condition is reach page boundary (rdi is aligned). */ - testl $31, %edi + testb $(VEC_SIZE - 1), %dil jnz L(less_4_loop) leaq -(VEC_SIZE * 4)(%rdi, %rsi), %rsi addq $-(VEC_SIZE * 4), %rdi From patchwork Tue Oct 18 23:19:38 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Noah Goldstein X-Patchwork-Id: 1691740 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=sourceware.org (client-ip=2620:52:3:1:0:246e:9693:128c; helo=sourceware.org; envelope-from=libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org; receiver=) Authentication-Results: legolas.ozlabs.org; dkim=pass (1024-bit key; secure) header.d=sourceware.org header.i=@sourceware.org header.a=rsa-sha256 header.s=default header.b=Sx7CKmTh; dkim-atps=neutral Received: from sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-384) server-digest SHA384) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4MsVHL6lgWz23jk for ; Wed, 19 Oct 2022 10:21:34 +1100 (AEDT) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 332C73857366 for ; Tue, 18 Oct 2022 23:21:32 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 332C73857366 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1666135292; bh=QLYf2t8uZJScMAunEK0mJZeQMtIfs8O9tj5QAcGnu8g=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To: From; b=Sx7CKmThw0o9sjKH68miVwrdZvOirfn+Kpc2HuFmP9pZ3uXgdUSM+Z9zt3io2EZv6 5mwogVv2GGunE6TREXp4BRr4fsvLvOrdlJWUfpQj2I3z9SCAwS64M2jJ5iS0IZG/d9 mO9tiG+3718WM97qVMeT1YORluH+by/4veeXdQLg= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-oa1-x2d.google.com (mail-oa1-x2d.google.com [IPv6:2001:4860:4864:20::2d]) by sourceware.org (Postfix) with ESMTPS id 915A03858283 for ; Tue, 18 Oct 2022 23:19:51 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 915A03858283 Received: by mail-oa1-x2d.google.com with SMTP id 586e51a60fabf-132af5e5543so18671254fac.8 for ; Tue, 18 Oct 2022 16:19:51 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=QLYf2t8uZJScMAunEK0mJZeQMtIfs8O9tj5QAcGnu8g=; b=I42RjBBxUeg5WtNK1ZAksq1bo9MJvxDg3+byCmFPSlC7n5U3JkxeKDEN0LnKY2ai9K RCCG/4paKOU4EIFrgzep5noz5+cfOSxQ3py5ybhL/1Qsr2qmUdGF14Jmng414OsXJlvA 3GkblQ93lqBthoc6VOf4shjbontupkgwTXcbhyp+KEslkBlb4LTaFyyL6ZdHbKmMnmPs gqX1HdJsPXEjcE9ySn5k3JessivPB1Pxk2hIEpDVLsdEoyPKo2eCuhnQmQe8YnHdTNRF cmS2z3WVOLn0cn/wDiST2nfpvaAUDqpNTpWSh3uM4/gEfHjDBDeV/r92qLj9i/yAm8V7 JiIw== X-Gm-Message-State: ACrzQf2Gng7lmdqSSHrGVPgTGq+ty8eBYbUHyioyNlhoPzR/o9Lm7MVT skK3DRt0zAIK6cOuATljM1zLOA/zPAi+Yw== X-Google-Smtp-Source: AMsMyM68whe5gwk1beorpHHBxO/21TNZpGcE+n5hj0y8MEW4HBGWYRL3VipzxnGD3rSQ/TBL5XT8AQ== X-Received: by 2002:a05:6870:6587:b0:131:d289:c928 with SMTP id fp7-20020a056870658700b00131d289c928mr19587045oab.75.1666135190578; Tue, 18 Oct 2022 16:19:50 -0700 (PDT) Received: from noah-tgl.lan (2603-8080-1301-76c6-27cf-8854-3909-9373.res6.spectrum.com. [2603:8080:1301:76c6:27cf:8854:3909:9373]) by smtp.gmail.com with ESMTPSA id x27-20020a056870a79b00b001372c1902afsm6787119oao.52.2022.10.18.16.19.49 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 18 Oct 2022 16:19:50 -0700 (PDT) To: libc-alpha@sourceware.org Subject: [PATCH v2 7/7] Bench: Improve benchtests for memchr, strchr, strnlen, strrchr Date: Tue, 18 Oct 2022 16:19:38 -0700 Message-Id: <20221018231938.3621554-7-goldstein.w.n@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20221018231938.3621554-1-goldstein.w.n@gmail.com> References: <20221018024901.3381469-1-goldstein.w.n@gmail.com> <20221018231938.3621554-1-goldstein.w.n@gmail.com> MIME-Version: 1.0 X-Spam-Status: No, score=-12.1 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Noah Goldstein via Libc-alpha From: Noah Goldstein Reply-To: Noah Goldstein Errors-To: libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org Sender: "Libc-alpha" 1. Add more complete coverage in the medium size range. 2. In strnlen remove the `1 << i` which was UB (`i` could go beyond 32/64) 3. Add timer for total benchmark runtime (useful for deciding about tradeoff between coverage and runtime). --- benchtests/bench-memchr.c | 77 +++++++++++++++++++++++++----------- benchtests/bench-rawmemchr.c | 30 ++++++++++++-- benchtests/bench-strchr.c | 35 +++++++++++----- benchtests/bench-strnlen.c | 12 +++--- benchtests/bench-strrchr.c | 28 ++++++++++++- 5 files changed, 137 insertions(+), 45 deletions(-) diff --git a/benchtests/bench-memchr.c b/benchtests/bench-memchr.c index 0facda2fa0..2ec9dd86d0 100644 --- a/benchtests/bench-memchr.c +++ b/benchtests/bench-memchr.c @@ -126,7 +126,7 @@ do_test (json_ctx_t *json_ctx, size_t align, size_t pos, size_t len, int test_main (void) { - size_t i; + size_t i, j, al, al_max; int repeats; json_ctx_t json_ctx; test_init (); @@ -147,35 +147,46 @@ test_main (void) json_array_begin (&json_ctx, "results"); + al_max = 0; +#ifdef USE_AS_MEMRCHR + al_max = getpagesize () / 2; +#endif + for (repeats = 0; repeats < 2; ++repeats) { - for (i = 1; i < 8; ++i) + for (al = 0; al <= al_max; al += getpagesize () / 2) { - do_test (&json_ctx, 0, 16 << i, 2048, 23, repeats); - do_test (&json_ctx, i, 64, 256, 23, repeats); - do_test (&json_ctx, 0, 16 << i, 2048, 0, repeats); - do_test (&json_ctx, i, 64, 256, 0, repeats); - - do_test (&json_ctx, getpagesize () - 15, 64, 256, 0, repeats); + for (i = 1; i < 8; ++i) + { + do_test (&json_ctx, al, 16 << i, 2048, 23, repeats); + do_test (&json_ctx, al + i, 64, 256, 23, repeats); + do_test (&json_ctx, al, 16 << i, 2048, 0, repeats); + do_test (&json_ctx, al + i, 64, 256, 0, repeats); + + do_test (&json_ctx, al + getpagesize () - 15, 64, 256, 0, + repeats); #ifdef USE_AS_MEMRCHR - /* Also test the position close to the beginning for memrchr. */ - do_test (&json_ctx, 0, i, 256, 23, repeats); - do_test (&json_ctx, 0, i, 256, 0, repeats); - do_test (&json_ctx, i, i, 256, 23, repeats); - do_test (&json_ctx, i, i, 256, 0, repeats); + /* Also test the position close to the beginning for memrchr. */ + do_test (&json_ctx, al, i, 256, 23, repeats); + do_test (&json_ctx, al, i, 256, 0, repeats); + do_test (&json_ctx, al + i, i, 256, 23, repeats); + do_test (&json_ctx, al + i, i, 256, 0, repeats); #endif + } + for (i = 1; i < 8; ++i) + { + do_test (&json_ctx, al + i, i << 5, 192, 23, repeats); + do_test (&json_ctx, al + i, i << 5, 192, 0, repeats); + do_test (&json_ctx, al + i, i << 5, 256, 23, repeats); + do_test (&json_ctx, al + i, i << 5, 256, 0, repeats); + do_test (&json_ctx, al + i, i << 5, 512, 23, repeats); + do_test (&json_ctx, al + i, i << 5, 512, 0, repeats); + + do_test (&json_ctx, al + getpagesize () - 15, i << 5, 256, 23, + repeats); + } } - for (i = 1; i < 8; ++i) - { - do_test (&json_ctx, i, i << 5, 192, 23, repeats); - do_test (&json_ctx, i, i << 5, 192, 0, repeats); - do_test (&json_ctx, i, i << 5, 256, 23, repeats); - do_test (&json_ctx, i, i << 5, 256, 0, repeats); - do_test (&json_ctx, i, i << 5, 512, 23, repeats); - do_test (&json_ctx, i, i << 5, 512, 0, repeats); - - do_test (&json_ctx, getpagesize () - 15, i << 5, 256, 23, repeats); - } + for (i = 1; i < 32; ++i) { do_test (&json_ctx, 0, i, i + 1, 23, repeats); @@ -207,6 +218,24 @@ test_main (void) do_test (&json_ctx, 0, 2, i + 1, 0, repeats); #endif } + for (al = 0; al <= al_max; al += getpagesize () / 2) + { + for (i = (16 / sizeof (CHAR)); i <= (8192 / sizeof (CHAR)); i += i) + { + for (j = 0; j <= (384 / sizeof (CHAR)); + j += (32 / sizeof (CHAR))) + { + do_test (&json_ctx, al, i + j, i, 23, repeats); + do_test (&json_ctx, al, i, i + j, 23, repeats); + if (j < i) + { + do_test (&json_ctx, al, i - j, i, 23, repeats); + do_test (&json_ctx, al, i, i - j, 23, repeats); + } + } + } + } + #ifndef USE_AS_MEMRCHR break; #endif diff --git a/benchtests/bench-rawmemchr.c b/benchtests/bench-rawmemchr.c index b1803afc14..dab77f3858 100644 --- a/benchtests/bench-rawmemchr.c +++ b/benchtests/bench-rawmemchr.c @@ -70,7 +70,7 @@ do_test (json_ctx_t *json_ctx, size_t align, size_t pos, size_t len, int seek_ch size_t i; char *result; - align &= 7; + align &= getpagesize () - 1; if (align + len >= page_size) return; @@ -106,7 +106,6 @@ test_main (void) { json_ctx_t json_ctx; size_t i; - test_init (); json_init (&json_ctx, 0, stdout); @@ -120,7 +119,7 @@ test_main (void) json_array_begin (&json_ctx, "ifuncs"); FOR_EACH_IMPL (impl, 0) - json_element_string (&json_ctx, impl->name); + json_element_string (&json_ctx, impl->name); json_array_end (&json_ctx); json_array_begin (&json_ctx, "results"); @@ -137,6 +136,31 @@ test_main (void) do_test (&json_ctx, 0, i, i + 1, 23); do_test (&json_ctx, 0, i, i + 1, 0); } + for (; i < 256; i += 32) + { + do_test (&json_ctx, 0, i, i + 1, 23); + do_test (&json_ctx, 0, i - 1, i, 23); + } + for (; i < 512; i += 64) + { + do_test (&json_ctx, 0, i, i + 1, 23); + do_test (&json_ctx, 0, i - 1, i, 23); + } + for (; i < 1024; i += 128) + { + do_test (&json_ctx, 0, i, i + 1, 23); + do_test (&json_ctx, 0, i - 1, i, 23); + } + for (; i < 2048; i += 256) + { + do_test (&json_ctx, 0, i, i + 1, 23); + do_test (&json_ctx, 0, i - 1, i, 23); + } + for (; i < 4096; i += 512) + { + do_test (&json_ctx, 0, i, i + 1, 23); + do_test (&json_ctx, 0, i - 1, i, 23); + } json_array_end (&json_ctx); json_attr_object_end (&json_ctx); diff --git a/benchtests/bench-strchr.c b/benchtests/bench-strchr.c index 54640bde7e..aeb882d442 100644 --- a/benchtests/bench-strchr.c +++ b/benchtests/bench-strchr.c @@ -287,8 +287,8 @@ int test_main (void) { json_ctx_t json_ctx; - size_t i; + size_t i, j; test_init (); json_init (&json_ctx, 0, stdout); @@ -367,15 +367,30 @@ test_main (void) do_test (&json_ctx, 0, i, i + 1, 0, BIG_CHAR); } - DO_RAND_TEST(&json_ctx, 0, 15, 16, 0.0); - DO_RAND_TEST(&json_ctx, 0, 15, 16, 0.1); - DO_RAND_TEST(&json_ctx, 0, 15, 16, 0.25); - DO_RAND_TEST(&json_ctx, 0, 15, 16, 0.33); - DO_RAND_TEST(&json_ctx, 0, 15, 16, 0.5); - DO_RAND_TEST(&json_ctx, 0, 15, 16, 0.66); - DO_RAND_TEST(&json_ctx, 0, 15, 16, 0.75); - DO_RAND_TEST(&json_ctx, 0, 15, 16, 0.9); - DO_RAND_TEST(&json_ctx, 0, 15, 16, 1.0); + for (i = 16 / sizeof (CHAR); i <= 8192 / sizeof (CHAR); i += i) + { + for (j = 32 / sizeof (CHAR); j <= 320 / sizeof (CHAR); + j += 32 / sizeof (CHAR)) + { + do_test (&json_ctx, 0, i, i + j, 0, MIDDLE_CHAR); + do_test (&json_ctx, 0, i + j, i, 0, MIDDLE_CHAR); + if (i > j) + { + do_test (&json_ctx, 0, i, i - j, 0, MIDDLE_CHAR); + do_test (&json_ctx, 0, i - j, i, 0, MIDDLE_CHAR); + } + } + } + + DO_RAND_TEST (&json_ctx, 0, 15, 16, 0.0); + DO_RAND_TEST (&json_ctx, 0, 15, 16, 0.1); + DO_RAND_TEST (&json_ctx, 0, 15, 16, 0.25); + DO_RAND_TEST (&json_ctx, 0, 15, 16, 0.33); + DO_RAND_TEST (&json_ctx, 0, 15, 16, 0.5); + DO_RAND_TEST (&json_ctx, 0, 15, 16, 0.66); + DO_RAND_TEST (&json_ctx, 0, 15, 16, 0.75); + DO_RAND_TEST (&json_ctx, 0, 15, 16, 0.9); + DO_RAND_TEST (&json_ctx, 0, 15, 16, 1.0); json_array_end (&json_ctx); json_attr_object_end (&json_ctx); diff --git a/benchtests/bench-strnlen.c b/benchtests/bench-strnlen.c index 13b46b3f57..82c02eb6ed 100644 --- a/benchtests/bench-strnlen.c +++ b/benchtests/bench-strnlen.c @@ -195,19 +195,19 @@ test_main (void) { for (j = 0; j <= (704 / sizeof (CHAR)); j += (32 / sizeof (CHAR))) { - do_test (&json_ctx, 0, 1 << i, (i + j), BIG_CHAR); do_test (&json_ctx, 0, i + j, i, BIG_CHAR); - - do_test (&json_ctx, 64, 1 << i, (i + j), BIG_CHAR); do_test (&json_ctx, 64, i + j, i, BIG_CHAR); + do_test (&json_ctx, 0, i, i + j, BIG_CHAR); + do_test (&json_ctx, 64, i, i + j, BIG_CHAR); + if (j < i) { - do_test (&json_ctx, 0, 1 << i, i - j, BIG_CHAR); do_test (&json_ctx, 0, i - j, i, BIG_CHAR); - - do_test (&json_ctx, 64, 1 << i, i - j, BIG_CHAR); do_test (&json_ctx, 64, i - j, i, BIG_CHAR); + + do_test (&json_ctx, 0, i, i - j, BIG_CHAR); + do_test (&json_ctx, 64, i, i - j, BIG_CHAR); } } } diff --git a/benchtests/bench-strrchr.c b/benchtests/bench-strrchr.c index 7cd2a15484..3fcf3f281d 100644 --- a/benchtests/bench-strrchr.c +++ b/benchtests/bench-strrchr.c @@ -151,7 +151,7 @@ int test_main (void) { json_ctx_t json_ctx; - size_t i, j; + size_t i, j, k; int seek; test_init (); @@ -173,7 +173,7 @@ test_main (void) for (seek = 0; seek <= 23; seek += 23) { - for (j = 1; j < 32; j += j) + for (j = 1; j <= 256; j = (j * 4)) { for (i = 1; i < 9; ++i) { @@ -197,6 +197,30 @@ test_main (void) do_test (&json_ctx, getpagesize () - i / 2 - 1, i, i + 1, seek, SMALL_CHAR, j); } + + for (i = (16 / sizeof (CHAR)); i <= (288 / sizeof (CHAR)); i += 32) + { + do_test (&json_ctx, 0, i - 16, i, seek, SMALL_CHAR, j); + do_test (&json_ctx, 0, i, i + 16, seek, SMALL_CHAR, j); + } + + for (i = (16 / sizeof (CHAR)); i <= (2048 / sizeof (CHAR)); i += i) + { + for (k = 0; k <= (288 / sizeof (CHAR)); + k += (48 / sizeof (CHAR))) + { + do_test (&json_ctx, 0, k, i, seek, SMALL_CHAR, j); + do_test (&json_ctx, 0, i, i + k, seek, SMALL_CHAR, j); + + if (k < i) + { + do_test (&json_ctx, 0, i - k, i, seek, SMALL_CHAR, j); + do_test (&json_ctx, 0, k, i - k, seek, SMALL_CHAR, j); + do_test (&json_ctx, 0, i, i - k, seek, SMALL_CHAR, j); + } + } + } + if (seek == 0) { break;