From patchwork Tue Oct 18 23:19:32 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Noah Goldstein <goldstein.w.n@gmail.com>
X-Patchwork-Id: 1691738
Return-Path: <libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@legolas.ozlabs.org
Authentication-Results: legolas.ozlabs.org;
 spf=pass (sender SPF authorized) smtp.mailfrom=sourceware.org
 (client-ip=2620:52:3:1:0:246e:9693:128c; helo=sourceware.org;
 envelope-from=libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org;
 receiver=<UNKNOWN>)
Authentication-Results: legolas.ozlabs.org;
	dkim=pass (1024-bit key;
 secure) header.d=sourceware.org header.i=@sourceware.org header.a=rsa-sha256
 header.s=default header.b=KIKAnNUA;
	dkim-atps=neutral
Received: from sourceware.org (server2.sourceware.org
 [IPv6:2620:52:3:1:0:246e:9693:128c])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature ECDSA (P-384) server-digest SHA384)
	(No client certificate requested)
	by legolas.ozlabs.org (Postfix) with ESMTPS id 4MsVGW1GmKz23jk
	for <incoming@patchwork.ozlabs.org>; Wed, 19 Oct 2022 10:20:51 +1100 (AEDT)
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 2F0A03858010
	for <incoming@patchwork.ozlabs.org>; Tue, 18 Oct 2022 23:20:49 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 2F0A03858010
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org;
	s=default; t=1666135249;
	bh=iwJ5tG1ZqBYLW52dVWD9KF7e6KhX6rOUjBu+vk7oSiA=;
	h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe:
	 List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:
	 From;
	b=KIKAnNUAayCcLuo5+0Bq4ltFkyuu10k1eduimspsKBkxmO8Tn7gmqwLmeEFpaLOKv
	 s/3XyYHSKIEJWpncIrVWtbBk4uznKSmeDu5hc1FoVfR83Qp4X1t2ra3CmOtVQMsq79
	 IAL32Xp7EEBpYgTQlGiFpxZWoaF4ojT2GpwJ8ioI=
X-Original-To: libc-alpha@sourceware.org
Delivered-To: libc-alpha@sourceware.org
Received: from mail-ot1-x333.google.com (mail-ot1-x333.google.com
 [IPv6:2607:f8b0:4864:20::333])
 by sourceware.org (Postfix) with ESMTPS id 5F3813858D32
 for <libc-alpha@sourceware.org>; Tue, 18 Oct 2022 23:19:45 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 5F3813858D32
Received: by mail-ot1-x333.google.com with SMTP id
 d18-20020a05683025d200b00661c6f1b6a4so8522414otu.1
 for <libc-alpha@sourceware.org>; Tue, 18 Oct 2022 16:19:45 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
 :subject:date:message-id:reply-to;
 bh=iwJ5tG1ZqBYLW52dVWD9KF7e6KhX6rOUjBu+vk7oSiA=;
 b=ES0z5S1JxGg15Dcdp74iL3ZSAUAoXzjFLd2etZ61P4mZXt5TgEc1VVwA9IkUxbeXTX
 tRsmVn/1IiYx9W3rF4F2YSBC+u04Zx+PJr+I78RkWECLT3jveMrAa+w3GFiXs/kYYkbw
 p2tlacaye+OTwGkVXCVUI94bSkc6aQf3zyZZh//RPXUR0Jrs5xk3RoJ74f5kIrlN3bLP
 Zbf7nKN4II6/MhTMTHQoMl77QRqU6pWs3r3NeiYIiuRm44N6lXjY6NljAzuhrrmmF1gf
 jku+j4Q6sfR99FrDyRJLAYCNzXjiGTdBDtdS3o61VGl4urcy04Z7TvelNPw/SdIBnXu3
 Go8Q==
X-Gm-Message-State: ACrzQf0tF1JNzypcQe7Ciyw1dxUSYdY3g+DZsnQGOLj0l5eAZ840q7I6
 IHi122nxL/qbxxrLYYSiSZwTWN0qV7quaA==
X-Google-Smtp-Source: 
 AMsMyM7Wq0DTFYG8YDcTLrKYjXO5KY+g9L1uae99ggjuv08gk5F6M9mD+xBQBM69H5HWtUKM9i1VKg==
X-Received: by 2002:a05:6830:14cf:b0:661:bc9d:554f with SMTP id
 t15-20020a05683014cf00b00661bc9d554fmr2419846otq.341.1666135182071;
 Tue, 18 Oct 2022 16:19:42 -0700 (PDT)
Received: from noah-tgl.lan
 (2603-8080-1301-76c6-27cf-8854-3909-9373.res6.spectrum.com.
 [2603:8080:1301:76c6:27cf:8854:3909:9373])
 by smtp.gmail.com with ESMTPSA id
 x27-20020a056870a79b00b001372c1902afsm6787119oao.52.2022.10.18.16.19.41
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Tue, 18 Oct 2022 16:19:41 -0700 (PDT)
To: libc-alpha@sourceware.org
Subject: [PATCH v2 1/7] x86: Optimize memchr-evex.S and implement with VMM
 headers
Date: Tue, 18 Oct 2022 16:19:32 -0700
Message-Id: <20221018231938.3621554-1-goldstein.w.n@gmail.com>
X-Mailer: git-send-email 2.34.1
In-Reply-To: <20221018024901.3381469-1-goldstein.w.n@gmail.com>
References: <20221018024901.3381469-1-goldstein.w.n@gmail.com>
MIME-Version: 1.0
X-Spam-Status: No, score=-12.1 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0,
 KAM_SHORT, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
X-Patchwork-Original-From: Noah Goldstein via Libc-alpha
 <libc-alpha@sourceware.org>
From: Noah Goldstein <goldstein.w.n@gmail.com>
Reply-To: Noah Goldstein <goldstein.w.n@gmail.com>
Errors-To: libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org
Sender: "Libc-alpha"
 <libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org>

Optimizations are:

1. Use the fact that tzcnt(0) -> VEC_SIZE for memchr to save a branch
   in short string case.
2. Restructure code so that small strings are given the hot path.
	- This is a net-zero on the benchmark suite but in general makes
      sense as smaller sizes are far more common.
3. Use more code-size efficient instructions.
	- tzcnt ...     -> bsf ...
	- vpcmpb $0 ... -> vpcmpeq ...
4. Align labels less aggressively, especially if it doesn't save fetch
   blocks / causes the basic-block to span extra cache-lines.

The optimizations (especially for point 2) make the memchr and
rawmemchr code essentially incompatible so split rawmemchr-evex
to a new file.

Code Size Changes:
memchr-evex.S       : -107 bytes
rawmemchr-evex.S    :  -53 bytes

Net perf changes:

Reported as geometric mean of all improvements / regressions from N=10
runs of the benchtests. Value as New Time / Old Time so < 1.0 is
improvement and 1.0 is regression.

memchr-evex.S       : 0.928
rawmemchr-evex.S    : 0.986 (Less targets cross cache lines)

Full results attached in email.

Full check passes on x86-64.
---
 sysdeps/x86_64/multiarch/memchr-evex.S        | 939 ++++++++++--------
 sysdeps/x86_64/multiarch/rawmemchr-evex-rtm.S |   9 +-
 sysdeps/x86_64/multiarch/rawmemchr-evex.S     | 313 +++++-
 3 files changed, 851 insertions(+), 410 deletions(-)
diff --git a/sysdeps/x86_64/multiarch/memchr-evex.S b/sysdeps/x86_64/multiarch/memchr-evex.S
index 0dd4f1dcce..23a1c0018e 100644
--- a/sysdeps/x86_64/multiarch/memchr-evex.S
+++ b/sysdeps/x86_64/multiarch/memchr-evex.S
@@ -21,17 +21,27 @@
 
 #if ISA_SHOULD_BUILD (4)
 
+# ifndef VEC_SIZE
+#  include "x86-evex256-vecs.h"
+# endif
+
 # ifndef MEMCHR
 #  define MEMCHR	__memchr_evex
 # endif
 
 # ifdef USE_AS_WMEMCHR
+#  define PC_SHIFT_GPR	rcx
+#  define VPTESTN	vptestnmd
 #  define VPBROADCAST	vpbroadcastd
 #  define VPMINU	vpminud
 #  define VPCMP	vpcmpd
 #  define VPCMPEQ	vpcmpeqd
 #  define CHAR_SIZE	4
+
+#  define USE_WIDE_CHAR
 # else
+#  define PC_SHIFT_GPR	rdi
+#  define VPTESTN	vptestnmb
 #  define VPBROADCAST	vpbroadcastb
 #  define VPMINU	vpminub
 #  define VPCMP	vpcmpb
@@ -39,534 +49,661 @@
 #  define CHAR_SIZE	1
 # endif
 
-	/* In the 4x loop the RTM and non-RTM versions have data pointer
-	   off by VEC_SIZE * 4 with RTM version being VEC_SIZE * 4 greater.
-	   This is represented by BASE_OFFSET. As well because the RTM
-	   version uses vpcmp which stores a bit per element compared where
-	   the non-RTM version uses vpcmpeq which stores a bit per byte
-	   compared RET_SCALE of CHAR_SIZE is only relevant for the RTM
-	   version.  */
-# ifdef USE_IN_RTM
+# include "reg-macros.h"
+
+
+/* If not in an RTM and VEC_SIZE != 64 (the VEC_SIZE = 64
+   doesn't have VEX encoding), use VEX encoding in loop so we
+   can use vpcmpeqb + vptern which is more efficient than the
+   EVEX alternative.  */
+# if defined USE_IN_RTM || VEC_SIZE == 64
+#  undef COND_VZEROUPPER
+#  undef VZEROUPPER_RETURN
+#  undef VZEROUPPER
+
+#  define COND_VZEROUPPER
+#  define VZEROUPPER_RETURN	ret
 #  define VZEROUPPER
-#  define BASE_OFFSET	(VEC_SIZE * 4)
-#  define RET_SCALE	CHAR_SIZE
+
+#  define USE_TERN_IN_LOOP	0
 # else
+#  define USE_TERN_IN_LOOP	1
+#  undef VZEROUPPER
 #  define VZEROUPPER	vzeroupper
-#  define BASE_OFFSET	0
-#  define RET_SCALE	1
 # endif
 
-	/* In the return from 4x loop memchr and rawmemchr versions have
-	   data pointers off by VEC_SIZE * 4 with memchr version being
-	   VEC_SIZE * 4 greater.  */
-# ifdef USE_AS_RAWMEMCHR
-#  define RET_OFFSET	(BASE_OFFSET - (VEC_SIZE * 4))
-#  define RAW_PTR_REG	rcx
-#  define ALGN_PTR_REG	rdi
+# if USE_TERN_IN_LOOP
+	/* Resulting bitmask for vpmovmskb has 4-bits set for each wchar
+	   so we don't want to multiply resulting index.  */
+#  define TERN_CHAR_MULT	1
+
+#  ifdef USE_AS_WMEMCHR
+#   define TEST_END()	inc %VRCX
+#  else
+#   define TEST_END()	add %rdx, %rcx
+#  endif
 # else
-#  define RET_OFFSET	BASE_OFFSET
-#  define RAW_PTR_REG	rdi
-#  define ALGN_PTR_REG	rcx
+#  define TERN_CHAR_MULT	CHAR_SIZE
+#  define TEST_END()	KORTEST %k2, %k3
 # endif
 
-# define XMMZERO	xmm23
-# define YMMZERO	ymm23
-# define XMMMATCH	xmm16
-# define YMMMATCH	ymm16
-# define YMM1		ymm17
-# define YMM2		ymm18
-# define YMM3		ymm19
-# define YMM4		ymm20
-# define YMM5		ymm21
-# define YMM6		ymm22
+# if defined USE_AS_WMEMCHR || !USE_TERN_IN_LOOP
+#  ifndef USE_AS_WMEMCHR
+#   define GPR_X0_IS_RET	1
+#  else
+#   define GPR_X0_IS_RET	0
+#  endif
+#  define GPR_X0	rax
+# else
+#  define GPR_X0_IS_RET	0
+#  define GPR_X0	rdx
+# endif
+
+# define CHAR_PER_VEC	(VEC_SIZE / CHAR_SIZE)
 
-# ifndef SECTION
-#  define SECTION(p)	p##.evex
+# if CHAR_PER_VEC == 64
+#  define LAST_VEC_OFFSET	(VEC_SIZE * 3)
+# else
+#  define LAST_VEC_OFFSET	(VEC_SIZE * 2)
+# endif
+# if CHAR_PER_VEC >= 32
+#  define MASK_GPR(...)	VGPR(__VA_ARGS__)
+# elif CHAR_PER_VEC == 16
+#  define MASK_GPR(reg)	VGPR_SZ(reg, 16)
+# else
+#  define MASK_GPR(reg)	VGPR_SZ(reg, 8)
 # endif
 
-# define VEC_SIZE 32
-# define CHAR_PER_VEC (VEC_SIZE / CHAR_SIZE)
-# define PAGE_SIZE 4096
+# define VMATCH	VMM(0)
+# define VMATCH_LO	VMM_lo(0)
 
-	.section SECTION(.text),"ax",@progbits
+# define PAGE_SIZE	4096
+
+
+	.section SECTION(.text), "ax", @progbits
 ENTRY_P2ALIGN (MEMCHR, 6)
-# ifndef USE_AS_RAWMEMCHR
 	/* Check for zero length.  */
 	test	%RDX_LP, %RDX_LP
-	jz	L(zero)
+	jz	L(zero_0)
 
-#  ifdef __ILP32__
+# ifdef __ILP32__
 	/* Clear the upper 32 bits.  */
 	movl	%edx, %edx
-#  endif
 # endif
-	/* Broadcast CHAR to YMMMATCH.  */
-	VPBROADCAST %esi, %YMMMATCH
+	VPBROADCAST %esi, %VMATCH
 	/* Check if we may cross page boundary with one vector load.  */
 	movl	%edi, %eax
 	andl	$(PAGE_SIZE - 1), %eax
 	cmpl	$(PAGE_SIZE - VEC_SIZE), %eax
-	ja	L(cross_page_boundary)
+	ja	L(page_cross)
+
+	VPCMPEQ	(%rdi), %VMATCH, %k0
+	KMOV	%k0, %VRAX
+# ifndef USE_AS_WMEMCHR
+	/* If rcx is zero then tzcnt -> CHAR_PER_VEC.  NB: there is a
+	   already a dependency between rcx and rsi so no worries about
+	   false-dep here.  */
+	tzcnt	%VRAX, %VRSI
+	/* If rdx <= rsi then either 1) rcx was non-zero (there was a
+	   match) but it was out of bounds or 2) rcx was zero and rdx
+	   was <= VEC_SIZE so we are done scanning.  */
+	cmpq	%rsi, %rdx
+	/* NB: Use branch to return zero/non-zero.  Common usage will
+	   branch on result of function (if return is null/non-null).
+	   This branch can be used to predict the ensuing one so there
+	   is no reason to extend the data-dependency with cmovcc.  */
+	jbe	L(zero_0)
+
+	/* If rcx is zero then len must be > RDX, otherwise since we
+	   already tested len vs lzcnt(rcx) (in rsi) we are good to
+	   return this match.  */
+	test	%VRAX, %VRAX
+	jz	L(more_1x_vec)
+	leaq	(%rdi, %rsi), %rax
+# else
 
-	/* Check the first VEC_SIZE bytes.  */
-	VPCMP	$0, (%rdi), %YMMMATCH, %k0
-	kmovd	%k0, %eax
-# ifndef USE_AS_RAWMEMCHR
-	/* If length < CHAR_PER_VEC handle special.  */
+	/* We can't use the `tzcnt` trick for wmemchr because CHAR_SIZE
+	   > 1 so if rcx is tzcnt != CHAR_PER_VEC.  */
 	cmpq	$CHAR_PER_VEC, %rdx
-	jbe	L(first_vec_x0)
-# endif
-	testl	%eax, %eax
-	jz	L(aligned_more)
-	tzcntl	%eax, %eax
-# ifdef USE_AS_WMEMCHR
-	/* NB: Multiply bytes by CHAR_SIZE to get the wchar_t count.  */
+	ja	L(more_1x_vec)
+	tzcnt	%VRAX, %VRAX
+	cmpl	%eax, %edx
+	jbe	L(zero_0)
+L(first_vec_x0_ret):
 	leaq	(%rdi, %rax, CHAR_SIZE), %rax
-# else
-	addq	%rdi, %rax
 # endif
 	ret
 
-# ifndef USE_AS_RAWMEMCHR
-L(zero):
-	xorl	%eax, %eax
-	ret
-
-	.p2align 4
-L(first_vec_x0):
-	/* Check if first match was before length. NB: tzcnt has false data-
-	   dependency on destination. eax already had a data-dependency on esi
-	   so this should have no affect here.  */
-	tzcntl	%eax, %esi
-#  ifdef USE_AS_WMEMCHR
-	leaq	(%rdi, %rsi, CHAR_SIZE), %rdi
-#  else
-	addq	%rsi, %rdi
-#  endif
+	/* Only fits in first cache line for VEC_SIZE == 32.  */
+# if VEC_SIZE == 32
+	.p2align 4,, 2
+L(zero_0):
 	xorl	%eax, %eax
-	cmpl	%esi, %edx
-	cmovg	%rdi, %rax
 	ret
 # endif
 
-	.p2align 4
-L(cross_page_boundary):
-	/* Save pointer before aligning as its original value is
-	   necessary for computer return address if byte is found or
-	   adjusting length if it is not and this is memchr.  */
-	movq	%rdi, %rcx
-	/* Align data to VEC_SIZE. ALGN_PTR_REG is rcx for memchr and rdi
-	   for rawmemchr.  */
-	andq	$-VEC_SIZE, %ALGN_PTR_REG
-	VPCMP	$0, (%ALGN_PTR_REG), %YMMMATCH, %k0
-	kmovd	%k0, %r8d
+	.p2align 4,, 9
+L(more_1x_vec):
 # ifdef USE_AS_WMEMCHR
-	/* NB: Divide shift count by 4 since each bit in K0 represent 4
-	   bytes.  */
-	sarl	$2, %eax
-# endif
-# ifndef USE_AS_RAWMEMCHR
-	movl	$(PAGE_SIZE / CHAR_SIZE), %esi
-	subl	%eax, %esi
+	/* If wmemchr still need to test if there was a match in first
+	   VEC.  Use bsf to test here so we can reuse
+	   L(first_vec_x0_ret).  */
+	bsf	%VRAX, %VRAX
+	jnz	L(first_vec_x0_ret)
 # endif
+
+L(page_cross_continue):
 # ifdef USE_AS_WMEMCHR
-	andl	$(CHAR_PER_VEC - 1), %eax
-# endif
-	/* Remove the leading bytes.  */
-	sarxl	%eax, %r8d, %eax
-# ifndef USE_AS_RAWMEMCHR
-	/* Check the end of data.  */
-	cmpq	%rsi, %rdx
-	jbe	L(first_vec_x0)
+	/* We can't use end of the buffer to re-calculate length for
+	   wmemchr as len * CHAR_SIZE may overflow.  */
+	leaq	-(VEC_SIZE + CHAR_SIZE)(%rdi), %rax
+	andq	$(VEC_SIZE * -1), %rdi
+	subq	%rdi, %rax
+	sarq	$2, %rax
+	addq	%rdx, %rax
+# else
+	leaq	-(VEC_SIZE + 1)(%rdx, %rdi), %rax
+	andq	$(VEC_SIZE * -1), %rdi
+	subq	%rdi, %rax
 # endif
-	testl	%eax, %eax
-	jz	L(cross_page_continue)
-	tzcntl	%eax, %eax
+
+	/* rax contains remaining length - 1.  -1 so we can get imm8
+	   encoding in a few additional places saving code size.  */
+
+	/* Needed regardless of remaining length.  */
+	VPCMPEQ	VEC_SIZE(%rdi), %VMATCH, %k0
+	KMOV	%k0, %VRDX
+
+	/* We cannot fold the above `sub %rdi, %rax` with the `cmp
+	   $(CHAR_PER_VEC * 2), %rax` because its possible for a very
+	   large length to overflow and cause the subtract to carry
+	   despite length being above CHAR_PER_VEC * 2.  */
+	cmpq	$(CHAR_PER_VEC * 2 - 1), %rax
+	ja	L(more_2x_vec)
+L(last_2x_vec):
+
+	test	%VRDX, %VRDX
+	jnz	L(first_vec_x1_check)
+
+	/* Check the end of data.  NB: use 8-bit operations to save code
+	   size.  We no longer need the full-width of eax and will
+	   perform a write-only operation over eax so there will be no
+	   partial-register stalls.  */
+	subb	$(CHAR_PER_VEC * 1 - 1), %al
+	jle	L(zero_0)
+
+	VPCMPEQ	(VEC_SIZE * 2)(%rdi), %VMATCH, %k0
+	KMOV	%k0, %VRCX
 # ifdef USE_AS_WMEMCHR
-	/* NB: Multiply bytes by CHAR_SIZE to get the wchar_t count.  */
-	leaq	(%RAW_PTR_REG, %rax, CHAR_SIZE), %rax
+	/* For wmemchr against we can't take advantage of tzcnt(0) ==
+	   VEC_SIZE as CHAR_PER_VEC != VEC_SIZE.  */
+	test	%VRCX, %VRCX
+	jz	L(zero_0)
+# endif
+	tzcnt	%VRCX, %VRCX
+	cmp	%cl, %al
+
+	/* Same CFG for VEC_SIZE == 64 and VEC_SIZE == 32.  We give
+	   fallthrough to L(zero_0) for VEC_SIZE == 64 here as there is
+	   not enough space before the next cache line to fit the `lea`
+	   for return.  */
+# if VEC_SIZE == 64
+	ja	L(first_vec_x2_ret)
+L(zero_0):
+	xorl	%eax, %eax
+	ret
 # else
-	addq	%RAW_PTR_REG, %rax
+	jbe	L(zero_0)
+	leaq	(VEC_SIZE * 2)(%rdi, %rcx, CHAR_SIZE), %rax
+	ret
 # endif
+
+	.p2align 4,, 5
+L(first_vec_x1_check):
+	bsf	%VRDX, %VRDX
+	cmpb	%dl, %al
+	jb	L(zero_4)
+	leaq	(VEC_SIZE * 1)(%rdi, %rdx, CHAR_SIZE), %rax
 	ret
 
-	.p2align 4
-L(first_vec_x1):
-	tzcntl	%eax, %eax
-	leaq	VEC_SIZE(%rdi, %rax, CHAR_SIZE), %rax
+	/* Fits at the end of the cache line here for VEC_SIZE == 32.
+	 */
+# if VEC_SIZE == 32
+L(zero_4):
+	xorl	%eax, %eax
 	ret
+# endif
 
-	.p2align 4
+
+	.p2align 4,, 4
 L(first_vec_x2):
-	tzcntl	%eax, %eax
-	leaq	(VEC_SIZE * 2)(%rdi, %rax, CHAR_SIZE), %rax
+	bsf	%VRCX, %VRCX
+L(first_vec_x2_ret):
+	leaq	(VEC_SIZE * 2)(%rdi, %rcx, CHAR_SIZE), %rax
 	ret
 
-	.p2align 4
-L(first_vec_x3):
-	tzcntl	%eax, %eax
-	leaq	(VEC_SIZE * 3)(%rdi, %rax, CHAR_SIZE), %rax
+	/* Fits at the end of the cache line here for VEC_SIZE == 64.
+	 */
+# if VEC_SIZE == 64
+L(zero_4):
+	xorl	%eax, %eax
 	ret
+# endif
 
-	.p2align 4
-L(first_vec_x4):
-	tzcntl	%eax, %eax
-	leaq	(VEC_SIZE * 4)(%rdi, %rax, CHAR_SIZE), %rax
+	.p2align 4,, 4
+L(first_vec_x1):
+	bsf	%VRDX, %VRDX
+	leaq	(VEC_SIZE * 1)(%rdi, %rdx, CHAR_SIZE), %rax
 	ret
 
-	.p2align 5
-L(aligned_more):
-	/* Check the first 4 * VEC_SIZE.  Only one VEC_SIZE at a time
-	   since data is only aligned to VEC_SIZE.  */
 
-# ifndef USE_AS_RAWMEMCHR
-	/* Align data to VEC_SIZE.  */
-L(cross_page_continue):
-	xorl	%ecx, %ecx
-	subl	%edi, %ecx
-	andq	$-VEC_SIZE, %rdi
-	/* esi is for adjusting length to see if near the end.  */
-	leal	(VEC_SIZE * 5)(%rdi, %rcx), %esi
-#  ifdef USE_AS_WMEMCHR
-	/* NB: Divide bytes by 4 to get the wchar_t count.  */
-	sarl	$2, %esi
-#  endif
-# else
-	andq	$-VEC_SIZE, %rdi
-L(cross_page_continue):
-# endif
-	/* Load first VEC regardless.  */
-	VPCMP	$0, (VEC_SIZE)(%rdi), %YMMMATCH, %k0
-	kmovd	%k0, %eax
-# ifndef USE_AS_RAWMEMCHR
-	/* Adjust length. If near end handle specially.  */
-	subq	%rsi, %rdx
-	jbe	L(last_4x_vec_or_less)
-# endif
-	testl	%eax, %eax
+	.p2align 4,, 5
+L(more_2x_vec):
+	/* Length > VEC_SIZE * 2 so check first 2x VEC before rechecking
+	   length.  */
+
+
+	/* Already computed matches for first VEC in rdx.  */
+	test	%VRDX, %VRDX
 	jnz	L(first_vec_x1)
 
-	VPCMP	$0, (VEC_SIZE * 2)(%rdi), %YMMMATCH, %k0
-	kmovd	%k0, %eax
-	testl	%eax, %eax
+
+	VPCMPEQ	(VEC_SIZE * 2)(%rdi), %VMATCH, %k0
+	KMOV	%k0, %VRCX
+	test	%VRCX, %VRCX
 	jnz	L(first_vec_x2)
 
-	VPCMP	$0, (VEC_SIZE * 3)(%rdi), %YMMMATCH, %k0
-	kmovd	%k0, %eax
-	testl	%eax, %eax
+	/* Needed regardless of next length check.  */
+	VPCMPEQ	(VEC_SIZE * 3)(%rdi), %VMATCH, %k0
+	KMOV	%k0, %VRCX
+
+	/* Check if we are near the end.  */
+	cmpq	$(CHAR_PER_VEC * 4 - 1), %rax
+	ja	L(more_4x_vec)
+
+	test	%VRCX, %VRCX
+	jnz	L(first_vec_x3_check)
+
+	/* Use 8-bit instructions to save code size.  We won't use full-
+	   width eax again and will perform a write-only operation to
+	   eax so no worries about partial-register stalls.  */
+	subb	$(CHAR_PER_VEC * 3), %al
+	jb	L(zero_2)
+L(last_vec_check):
+	VPCMPEQ	(VEC_SIZE * 4)(%rdi), %VMATCH, %k0
+	KMOV	%k0, %VRCX
+# ifdef USE_AS_WMEMCHR
+	/* For wmemchr against we can't take advantage of tzcnt(0) ==
+	   VEC_SIZE as CHAR_PER_VEC != VEC_SIZE.  */
+	test	%VRCX, %VRCX
+	jz	L(zero_2)
+# endif
+	tzcnt	%VRCX, %VRCX
+	cmp	%cl, %al
+	jae	L(first_vec_x4_ret)
+L(zero_2):
+	xorl	%eax, %eax
+	ret
+
+	/* Fits at the end of the cache line here for VEC_SIZE == 64.
+	   For VEC_SIZE == 32 we put the return label at the end of
+	   L(first_vec_x4).  */
+# if VEC_SIZE == 64
+L(first_vec_x4_ret):
+	leaq	(VEC_SIZE * 4)(%rdi, %rcx, CHAR_SIZE), %rax
+	ret
+# endif
+
+	.p2align 4,, 6
+L(first_vec_x4):
+	bsf	%VRCX, %VRCX
+# if VEC_SIZE == 32
+	/* Place L(first_vec_x4_ret) here as we can't fit it in the same
+	   cache line as where it is called from so we might as well
+	   save code size by reusing return of L(first_vec_x4).  */
+L(first_vec_x4_ret):
+# endif
+	leaq	(VEC_SIZE * 4)(%rdi, %rcx, CHAR_SIZE), %rax
+	ret
+
+	.p2align 4,, 6
+L(first_vec_x3_check):
+	/* Need to adjust remaining length before checking.  */
+	addb	$-(CHAR_PER_VEC * 2), %al
+	bsf	%VRCX, %VRCX
+	cmpb	%cl, %al
+	jb	L(zero_2)
+	leaq	(VEC_SIZE * 3)(%rdi, %rcx, CHAR_SIZE), %rax
+	ret
+
+	.p2align 4,, 6
+L(first_vec_x3):
+	bsf	%VRCX, %VRCX
+	leaq	(VEC_SIZE * 3)(%rdi, %rcx, CHAR_SIZE), %rax
+	ret
+
+	.p2align 4,, 3
+# if !USE_TERN_IN_LOOP
+	.p2align 4,, 10
+# endif
+L(more_4x_vec):
+	test	%VRCX, %VRCX
 	jnz	L(first_vec_x3)
 
-	VPCMP	$0, (VEC_SIZE * 4)(%rdi), %YMMMATCH, %k0
-	kmovd	%k0, %eax
-	testl	%eax, %eax
+	VPCMPEQ	(VEC_SIZE * 4)(%rdi), %VMATCH, %k0
+	KMOV	%k0, %VRCX
+	test	%VRCX, %VRCX
 	jnz	L(first_vec_x4)
 
+	subq	$-(VEC_SIZE * 5), %rdi
+	subq	$(CHAR_PER_VEC * 8), %rax
+	jb	L(last_4x_vec)
 
-# ifndef USE_AS_RAWMEMCHR
-	/* Check if at last CHAR_PER_VEC * 4 length.  */
-	subq	$(CHAR_PER_VEC * 4), %rdx
-	jbe	L(last_4x_vec_or_less_cmpeq)
-	/* +VEC_SIZE if USE_IN_RTM otherwise +VEC_SIZE * 5.  */
-	addq	$(VEC_SIZE + (VEC_SIZE * 4 - BASE_OFFSET)), %rdi
-
-	/* Align data to VEC_SIZE * 4 for the loop and readjust length.
-	 */
-#  ifdef USE_AS_WMEMCHR
+# ifdef USE_AS_WMEMCHR
 	movl	%edi, %ecx
-	andq	$-(4 * VEC_SIZE), %rdi
+# else
+	addq	%rdi, %rax
+# endif
+
+
+# if VEC_SIZE == 64
+	/* use xorb to do `andq $-(VEC_SIZE * 4), %rdi`. No evex
+	   processor has partial register stalls (all have merging
+	   uop). If that changes this can be removed.  */
+	xorb	%dil, %dil
+# else
+	andq	$-(VEC_SIZE * 4), %rdi
+# endif
+
+# ifdef USE_AS_WMEMCHR
 	subl	%edi, %ecx
-	/* NB: Divide bytes by 4 to get the wchar_t count.  */
 	sarl	$2, %ecx
-	addq	%rcx, %rdx
-#  else
-	addq	%rdi, %rdx
-	andq	$-(4 * VEC_SIZE), %rdi
-	subq	%rdi, %rdx
-#  endif
+	addq	%rcx, %rax
 # else
-	addq	$(VEC_SIZE + (VEC_SIZE * 4 - BASE_OFFSET)), %rdi
-	andq	$-(4 * VEC_SIZE), %rdi
+	subq	%rdi, %rax
 # endif
-# ifdef USE_IN_RTM
-	vpxorq	%XMMZERO, %XMMZERO, %XMMZERO
-# else
-	/* copy ymmmatch to ymm0 so we can use vpcmpeq which is not
-	   encodable with EVEX registers (ymm16-ymm31).  */
-	vmovdqa64 %YMMMATCH, %ymm0
+
+
+
+# if USE_TERN_IN_LOOP
+	/* copy VMATCH to low ymm so we can use vpcmpeq which is not
+	   encodable with EVEX registers.  NB: this is VEC_SIZE == 32
+	   only as there is no way to encode vpcmpeq with zmm0-15.  */
+	vmovdqa64 %VMATCH, %VMATCH_LO
 # endif
 
-	/* Compare 4 * VEC at a time forward.  */
-	.p2align 4
+	.p2align 4,, 11
 L(loop_4x_vec):
-	/* Two versions of the loop. One that does not require
-	   vzeroupper by not using ymm0-ymm15 and another does that require
-	   vzeroupper because it uses ymm0-ymm15. The reason why ymm0-ymm15
-	   is used at all is because there is no EVEX encoding vpcmpeq and
-	   with vpcmpeq this loop can be performed more efficiently. The
-	   non-vzeroupper version is safe for RTM while the vzeroupper
-	   version should be prefered if RTM are not supported.  */
-# ifdef USE_IN_RTM
-	/* It would be possible to save some instructions using 4x VPCMP
-	   but bottleneck on port 5 makes it not woth it.  */
-	VPCMP	$4, (VEC_SIZE * 4)(%rdi), %YMMMATCH, %k1
-	/* xor will set bytes match esi to zero.  */
-	vpxorq	(VEC_SIZE * 5)(%rdi), %YMMMATCH, %YMM2
-	vpxorq	(VEC_SIZE * 6)(%rdi), %YMMMATCH, %YMM3
-	VPCMP	$0, (VEC_SIZE * 7)(%rdi), %YMMMATCH, %k3
-	/* Reduce VEC2 / VEC3 with min and VEC1 with zero mask.  */
-	VPMINU	%YMM2, %YMM3, %YMM3{%k1}{z}
-	VPCMP	$0, %YMM3, %YMMZERO, %k2
-# else
+	/* Two versions of the loop.  One that does not require
+	   vzeroupper by not using ymmm0-15 and another does that
+	   require vzeroupper because it uses ymmm0-15.  The reason why
+	   ymm0-15 is used at all is because there is no EVEX encoding
+	   vpcmpeq and with vpcmpeq this loop can be performed more
+	   efficiently.  The non-vzeroupper version is safe for RTM
+	   while the vzeroupper version should be prefered if RTM are
+	   not supported.   Which loop version we use is determined by
+	   USE_TERN_IN_LOOP.  */
+
+# if USE_TERN_IN_LOOP
 	/* Since vptern can only take 3x vectors fastest to do 1 vec
 	   seperately with EVEX vpcmp.  */
 #  ifdef USE_AS_WMEMCHR
 	/* vptern can only accept masks for epi32/epi64 so can only save
-	   instruction using not equals mask on vptern with wmemchr.  */
-	VPCMP	$4, (%rdi), %YMMMATCH, %k1
+	   instruction using not equals mask on vptern with wmemchr.
+	 */
+	VPCMP	$4, (VEC_SIZE * 0)(%rdi), %VMATCH, %k1
 #  else
-	VPCMP	$0, (%rdi), %YMMMATCH, %k1
+	VPCMPEQ	(VEC_SIZE * 0)(%rdi), %VMATCH, %k1
 #  endif
 	/* Compare 3x with vpcmpeq and or them all together with vptern.
 	 */
-	VPCMPEQ	VEC_SIZE(%rdi), %ymm0, %ymm2
-	VPCMPEQ	(VEC_SIZE * 2)(%rdi), %ymm0, %ymm3
-	VPCMPEQ	(VEC_SIZE * 3)(%rdi), %ymm0, %ymm4
+	VPCMPEQ	(VEC_SIZE * 1)(%rdi), %VMATCH_LO, %VMM_lo(2)
+	VPCMPEQ	(VEC_SIZE * 2)(%rdi), %VMATCH_LO, %VMM_lo(3)
+	VPCMPEQ	(VEC_SIZE * 3)(%rdi), %VMATCH_LO, %VMM_lo(4)
 #  ifdef USE_AS_WMEMCHR
-	/* This takes the not of or between ymm2, ymm3, ymm4 as well as
-	   combines result from VEC0 with zero mask.  */
-	vpternlogd $1, %ymm2, %ymm3, %ymm4{%k1}{z}
-	vpmovmskb %ymm4, %ecx
+	/* This takes the not of or between VEC_lo(2), VEC_lo(3),
+	   VEC_lo(4) as well as combines result from VEC(0) with zero
+	   mask.  */
+	vpternlogd $1, %VMM_lo(2), %VMM_lo(3), %VMM_lo(4){%k1}{z}
+	vpmovmskb %VMM_lo(4), %VRCX
 #  else
-	/* 254 is mask for oring ymm2, ymm3, ymm4 into ymm4.  */
-	vpternlogd $254, %ymm2, %ymm3, %ymm4
-	vpmovmskb %ymm4, %ecx
-	kmovd	%k1, %eax
+	/* 254 is mask for oring VEC_lo(2), VEC_lo(3), VEC_lo(4) into
+	   VEC_lo(4).  */
+	vpternlogd $254, %VMM_lo(2), %VMM_lo(3), %VMM_lo(4)
+	vpmovmskb %VMM_lo(4), %VRCX
+	KMOV	%k1, %edx
 #  endif
-# endif
 
-# ifdef USE_AS_RAWMEMCHR
-	subq	$-(VEC_SIZE * 4), %rdi
-# endif
-# ifdef USE_IN_RTM
-	kortestd %k2, %k3
 # else
-#  ifdef USE_AS_WMEMCHR
-	/* ecx contains not of matches. All 1s means no matches. incl will
-	   overflow and set zeroflag if that is the case.  */
-	incl	%ecx
-#  else
-	/* If either VEC1 (eax) or VEC2-VEC4 (ecx) are not zero. Adding
-	   to ecx is not an issue because if eax is non-zero it will be
-	   used for returning the match. If it is zero the add does
-	   nothing.  */
-	addq	%rax, %rcx
-#  endif
+	/* Loop version that uses EVEX encoding.  */
+	VPCMP	$4, (VEC_SIZE * 0)(%rdi), %VMATCH, %k1
+	vpxorq	(VEC_SIZE * 1)(%rdi), %VMATCH, %VMM(2)
+	vpxorq	(VEC_SIZE * 2)(%rdi), %VMATCH, %VMM(3)
+	VPCMPEQ	(VEC_SIZE * 3)(%rdi), %VMATCH, %k3
+	VPMINU	%VMM(2), %VMM(3), %VMM(3){%k1}{z}
+	VPTESTN	%VMM(3), %VMM(3), %k2
 # endif
-# ifdef USE_AS_RAWMEMCHR
-	jz	L(loop_4x_vec)
-# else
-	jnz	L(loop_4x_vec_end)
+
+
+	TEST_END ()
+	jnz	L(loop_vec_ret)
 
 	subq	$-(VEC_SIZE * 4), %rdi
 
-	subq	$(CHAR_PER_VEC * 4), %rdx
-	ja	L(loop_4x_vec)
+	subq	$(CHAR_PER_VEC * 4), %rax
+	jae	L(loop_4x_vec)
 
-	/* Fall through into less than 4 remaining vectors of length case.
+	/* COND_VZEROUPPER is vzeroupper if we use the VEX encoded loop.
 	 */
-	VPCMP	$0, BASE_OFFSET(%rdi), %YMMMATCH, %k0
-	addq	$(BASE_OFFSET - VEC_SIZE), %rdi
-	kmovd	%k0, %eax
-	VZEROUPPER
-
-L(last_4x_vec_or_less):
-	/* Check if first VEC contained match.  */
-	testl	%eax, %eax
-	jnz	L(first_vec_x1_check)
+	COND_VZEROUPPER
 
-	/* If remaining length > CHAR_PER_VEC * 2.  */
-	addl	$(CHAR_PER_VEC * 2), %edx
-	jg	L(last_4x_vec)
-
-L(last_2x_vec):
-	/* If remaining length < CHAR_PER_VEC.  */
-	addl	$CHAR_PER_VEC, %edx
-	jle	L(zero_end)
-
-	/* Check VEC2 and compare any match with remaining length.  */
-	VPCMP	$0, (VEC_SIZE * 2)(%rdi), %YMMMATCH, %k0
-	kmovd	%k0, %eax
-	tzcntl	%eax, %eax
-	cmpl	%eax, %edx
-	jbe	L(set_zero_end)
-	leaq	(VEC_SIZE * 2)(%rdi, %rax, CHAR_SIZE), %rax
-L(zero_end):
-	ret
+	.p2align 4,, 10
+L(last_4x_vec):
+	/* For CHAR_PER_VEC == 64 we don't need to mask as we use 8-bit
+	   instructions on eax from here on out.  */
+# if CHAR_PER_VEC != 64
+	andl	$(CHAR_PER_VEC * 4 - 1), %eax
+# endif
+	VPCMPEQ	(VEC_SIZE * 0)(%rdi), %VMATCH, %k0
+	subq	$(VEC_SIZE * 1), %rdi
+	KMOV	%k0, %VRDX
+	cmpb	$(CHAR_PER_VEC * 2 - 1), %al
+	jbe	L(last_2x_vec)
+	test	%VRDX, %VRDX
+	jnz	L(last_vec_x1_novzero)
+
+	VPCMPEQ	(VEC_SIZE * 2)(%rdi), %VMATCH, %k0
+	KMOV	%k0, %VRDX
+	test	%VRDX, %VRDX
+	jnz	L(last_vec_x2_novzero)
+
+	VPCMPEQ	(VEC_SIZE * 3)(%rdi), %VMATCH, %k0
+	KMOV	%k0, %VRCX
+	test	%VRCX, %VRCX
+	jnz	L(first_vec_x3_check)
+
+	subb	$(CHAR_PER_VEC * 3), %al
+	jae	L(last_vec_check)
 
-L(set_zero_end):
 	xorl	%eax, %eax
 	ret
 
-	.p2align 4
-L(first_vec_x1_check):
-	/* eax must be non-zero. Use bsfl to save code size.  */
-	bsfl	%eax, %eax
-	/* Adjust length.  */
-	subl	$-(CHAR_PER_VEC * 4), %edx
-	/* Check if match within remaining length.  */
-	cmpl	%eax, %edx
-	jbe	L(set_zero_end)
-	/* NB: Multiply bytes by CHAR_SIZE to get the wchar_t count.  */
-	leaq	VEC_SIZE(%rdi, %rax, CHAR_SIZE), %rax
+# if defined USE_AS_WMEMCHR && USE_TERN_IN_LOOP
+L(last_vec_x2_novzero):
+	addq	$VEC_SIZE, %rdi
+L(last_vec_x1_novzero):
+	bsf	%VRDX, %VRDX
+	leaq	(VEC_SIZE * 1)(%rdi, %rdx, CHAR_SIZE), %rax
 	ret
+# endif
 
-	.p2align 4
-L(loop_4x_vec_end):
+# if CHAR_PER_VEC == 64
+	/* Since we can't combine the last 2x VEC when CHAR_PER_VEC ==
+	   64 it needs a seperate return label.  */
+	.p2align 4,, 4
+L(last_vec_x2):
+L(last_vec_x2_novzero):
+	bsf	%VRDX, %VRDX
+	leaq	(VEC_SIZE * 2)(%rdi, %rdx, TERN_CHAR_MULT), %rax
+	ret
 # endif
-	/* rawmemchr will fall through into this if match was found in
-	   loop.  */
 
-# if defined USE_IN_RTM || defined USE_AS_WMEMCHR
-	/* k1 has not of matches with VEC1.  */
-	kmovd	%k1, %eax
-#  ifdef USE_AS_WMEMCHR
-	subl	$((1 << CHAR_PER_VEC) - 1), %eax
-#  else
-	incl	%eax
-#  endif
+	.p2align 4,, 4
+L(loop_vec_ret):
+# if defined USE_AS_WMEMCHR || !USE_TERN_IN_LOOP
+	KMOV	%k1, %VRAX
+	inc	%MASK_GPR(rax)
 # else
-	/* eax already has matches for VEC1.  */
-	testl	%eax, %eax
+	test	%VRDX, %VRDX
 # endif
-	jnz	L(last_vec_x1_return)
+	jnz	L(last_vec_x0)
 
-# ifdef USE_IN_RTM
-	VPCMP	$0, %YMM2, %YMMZERO, %k0
-	kmovd	%k0, %eax
+
+# if USE_TERN_IN_LOOP
+	vpmovmskb %VMM_lo(2), %VRDX
 # else
-	vpmovmskb %ymm2, %eax
+	VPTESTN	%VMM(2), %VMM(2), %k1
+	KMOV	%k1, %VRDX
 # endif
-	testl	%eax, %eax
-	jnz	L(last_vec_x2_return)
+	test	%VRDX, %VRDX
+	jnz	L(last_vec_x1)
 
-# ifdef USE_IN_RTM
-	kmovd	%k2, %eax
-	testl	%eax, %eax
-	jnz	L(last_vec_x3_return)
 
-	kmovd	%k3, %eax
-	tzcntl	%eax, %eax
-	leaq	(VEC_SIZE * 3 + RET_OFFSET)(%rdi, %rax, CHAR_SIZE), %rax
+# if USE_TERN_IN_LOOP
+	vpmovmskb %VMM_lo(3), %VRDX
 # else
-	vpmovmskb %ymm3, %eax
-	/* Combine matches in VEC3 (eax) with matches in VEC4 (ecx).  */
-	salq	$VEC_SIZE, %rcx
-	orq	%rcx, %rax
-	tzcntq	%rax, %rax
-	leaq	(VEC_SIZE * 2 + RET_OFFSET)(%rdi, %rax), %rax
-	VZEROUPPER
+	KMOV	%k2, %VRDX
 # endif
-	ret
 
-	.p2align 4,, 10
-L(last_vec_x1_return):
-	tzcntl	%eax, %eax
-# if defined USE_AS_WMEMCHR || RET_OFFSET != 0
-	/* NB: Multiply bytes by CHAR_SIZE to get the wchar_t count.  */
-	leaq	RET_OFFSET(%rdi, %rax, CHAR_SIZE), %rax
+	/* No longer need any of the lo vecs (ymm0-15) so vzeroupper
+	   (only if used VEX encoded loop).  */
+	COND_VZEROUPPER
+
+	/* Seperate logic for CHAR_PER_VEC == 64 vs the rest.  For
+	   CHAR_PER_VEC we test the last 2x VEC seperately, for
+	   CHAR_PER_VEC <= 32 we can combine the results from the 2x
+	   VEC in a single GPR.  */
+# if CHAR_PER_VEC == 64
+#  if USE_TERN_IN_LOOP
+#   error "Unsupported"
+#  endif
+
+
+	/* If CHAR_PER_VEC == 64 we can't combine the last two VEC.  */
+	test	%VRDX, %VRDX
+	jnz	L(last_vec_x2)
+	KMOV	%k3, %VRDX
 # else
-	addq	%rdi, %rax
+	/* CHAR_PER_VEC <= 32 so we can combine the results from the
+	   last 2x VEC.  */
+
+#  if !USE_TERN_IN_LOOP
+	KMOV	%k3, %VRCX
+#  endif
+	salq	$(VEC_SIZE / TERN_CHAR_MULT), %rcx
+	addq	%rcx, %rdx
+#  if !defined USE_AS_WMEMCHR || !USE_TERN_IN_LOOP
+L(last_vec_x2_novzero):
+#  endif
 # endif
-	VZEROUPPER
+	bsf	%rdx, %rdx
+	leaq	(LAST_VEC_OFFSET)(%rdi, %rdx, TERN_CHAR_MULT), %rax
 	ret
 
-	.p2align 4
-L(last_vec_x2_return):
-	tzcntl	%eax, %eax
-	/* NB: Multiply bytes by RET_SCALE to get the wchar_t count
-	   if relevant (RET_SCALE = CHAR_SIZE if USE_AS_WMEMCHAR and
-	   USE_IN_RTM are both defined. Otherwise RET_SCALE = 1.  */
-	leaq	(VEC_SIZE + RET_OFFSET)(%rdi, %rax, RET_SCALE), %rax
-	VZEROUPPER
+	.p2align 4,, 8
+L(last_vec_x1):
+	COND_VZEROUPPER
+# if !defined USE_AS_WMEMCHR || !USE_TERN_IN_LOOP
+L(last_vec_x1_novzero):
+# endif
+	bsf	%VRDX, %VRDX
+	leaq	(VEC_SIZE * 1)(%rdi, %rdx, TERN_CHAR_MULT), %rax
 	ret
 
-# ifdef USE_IN_RTM
-	.p2align 4
-L(last_vec_x3_return):
-	tzcntl	%eax, %eax
-	/* NB: Multiply bytes by CHAR_SIZE to get the wchar_t count.  */
-	leaq	(VEC_SIZE * 2 + RET_OFFSET)(%rdi, %rax, CHAR_SIZE), %rax
+
+	.p2align 4,, 4
+L(last_vec_x0):
+	COND_VZEROUPPER
+	bsf	%VGPR(GPR_X0), %VGPR(GPR_X0)
+# if GPR_X0_IS_RET
+	addq	%rdi, %rax
+# else
+	leaq	(%rdi, %GPR_X0, CHAR_SIZE), %rax
+# endif
 	ret
+
+	.p2align 4,, 6
+L(page_cross):
+	/* Need to preserve eax to compute inbound bytes we are
+	   checking.  */
+# ifdef USE_AS_WMEMCHR
+	movl	%eax, %ecx
+# else
+	xorl	%ecx, %ecx
+	subl	%eax, %ecx
 # endif
 
-# ifndef USE_AS_RAWMEMCHR
-	.p2align 4,, 5
-L(last_4x_vec_or_less_cmpeq):
-	VPCMP	$0, (VEC_SIZE * 5)(%rdi), %YMMMATCH, %k0
-	kmovd	%k0, %eax
-	subq	$-(VEC_SIZE * 4), %rdi
-	/* Check first VEC regardless.  */
-	testl	%eax, %eax
-	jnz	L(first_vec_x1_check)
+	xorq	%rdi, %rax
+	VPCMPEQ	(PAGE_SIZE - VEC_SIZE)(%rax), %VMATCH, %k0
+	KMOV	%k0, %VRAX
 
-	/* If remaining length <= CHAR_PER_VEC * 2.  */
-	addl	$(CHAR_PER_VEC * 2), %edx
-	jle	L(last_2x_vec)
+# ifdef USE_AS_WMEMCHR
+	/* NB: Divide by CHAR_SIZE to shift out out of bounds bytes.  */
+	shrl	$2, %ecx
+	andl	$(CHAR_PER_VEC - 1), %ecx
+# endif
 
-	.p2align 4
-L(last_4x_vec):
-	VPCMP	$0, (VEC_SIZE * 2)(%rdi), %YMMMATCH, %k0
-	kmovd	%k0, %eax
-	testl	%eax, %eax
-	jnz	L(last_vec_x2)
 
+	shrx	%VGPR(PC_SHIFT_GPR), %VRAX, %VRAX
 
-	VPCMP	$0, (VEC_SIZE * 3)(%rdi), %YMMMATCH, %k0
-	kmovd	%k0, %eax
-	/* Create mask for possible matches within remaining length.  */
-#  ifdef USE_AS_WMEMCHR
-	movl	$((1 << (CHAR_PER_VEC * 2)) - 1), %ecx
-	bzhil	%edx, %ecx, %ecx
-#  else
-	movq	$-1, %rcx
-	bzhiq	%rdx, %rcx, %rcx
-#  endif
-	/* Test matches in data against length match.  */
-	andl	%ecx, %eax
-	jnz	L(last_vec_x3)
+# ifdef USE_AS_WMEMCHR
+	negl	%ecx
+# endif
 
-	/* if remaining length <= CHAR_PER_VEC * 3 (Note this is after
-	   remaining length was found to be > CHAR_PER_VEC * 2.  */
-	subl	$CHAR_PER_VEC, %edx
-	jbe	L(zero_end2)
+	/* mask lower bits from ecx (negative eax) to get bytes till
+	   next VEC.  */
+	andl	$(CHAR_PER_VEC - 1), %ecx
 
+	/* Check if VEC is entirely contained in the remainder of the
+	   page.  */
+	cmpq	%rcx, %rdx
+	jbe	L(page_cross_ret)
 
-	VPCMP	$0, (VEC_SIZE * 4)(%rdi), %YMMMATCH, %k0
-	kmovd	%k0, %eax
-	/* Shift remaining length mask for last VEC.  */
-#  ifdef USE_AS_WMEMCHR
-	shrl	$CHAR_PER_VEC, %ecx
-#  else
-	shrq	$CHAR_PER_VEC, %rcx
-#  endif
-	andl	%ecx, %eax
-	jz	L(zero_end2)
-	bsfl	%eax, %eax
-	leaq	(VEC_SIZE * 4)(%rdi, %rax, CHAR_SIZE), %rax
-L(zero_end2):
-	ret
+	/* Length crosses the page so if rax is zero (no matches)
+	   continue.  */
+	test	%VRAX, %VRAX
+	jz	L(page_cross_continue)
 
-L(last_vec_x2):
-	tzcntl	%eax, %eax
-	leaq	(VEC_SIZE * 2)(%rdi, %rax, CHAR_SIZE), %rax
+	/* if rdx > rcx then any match here must be in [buf:buf + len].
+	 */
+	tzcnt	%VRAX, %VRAX
+# ifdef USE_AS_WMEMCHR
+	leaq	(%rdi, %rax, CHAR_SIZE), %rax
+# else
+	addq	%rdi, %rax
+# endif
 	ret
 
-	.p2align 4
-L(last_vec_x3):
-	tzcntl	%eax, %eax
-	leaq	(VEC_SIZE * 3)(%rdi, %rax, CHAR_SIZE), %rax
+	.p2align 4,, 2
+L(page_cross_zero):
+	xorl	%eax, %eax
 	ret
+
+	.p2align 4,, 4
+L(page_cross_ret):
+	/* Search is entirely contained in page cross case.  */
+# ifdef USE_AS_WMEMCHR
+	test	%VRAX, %VRAX
+	jz	L(page_cross_zero)
+# endif
+	tzcnt	%VRAX, %VRAX
+	cmpl	%eax, %edx
+	jbe	L(page_cross_zero)
+# ifdef USE_AS_WMEMCHR
+	leaq	(%rdi, %rax, CHAR_SIZE), %rax
+# else
+	addq	%rdi, %rax
 # endif
-	/* 7 bytes from next cache line.  */
+	ret
 END (MEMCHR)
 #endif
diff --git a/sysdeps/x86_64/multiarch/rawmemchr-evex-rtm.S b/sysdeps/x86_64/multiarch/rawmemchr-evex-rtm.S
index deda1ca395..2073eaa620 100644
--- a/sysdeps/x86_64/multiarch/rawmemchr-evex-rtm.S
+++ b/sysdeps/x86_64/multiarch/rawmemchr-evex-rtm.S
@@ -1,3 +1,6 @@
-#define MEMCHR __rawmemchr_evex_rtm
-#define USE_AS_RAWMEMCHR 1
-#include "memchr-evex-rtm.S"
+#define RAWMEMCHR	__rawmemchr_evex_rtm
+
+#define USE_IN_RTM	1
+#define SECTION(p)	p##.evex.rtm
+
+#include "rawmemchr-evex.S"
diff --git a/sysdeps/x86_64/multiarch/rawmemchr-evex.S b/sysdeps/x86_64/multiarch/rawmemchr-evex.S
index dc1c450699..dad54def2b 100644
--- a/sysdeps/x86_64/multiarch/rawmemchr-evex.S
+++ b/sysdeps/x86_64/multiarch/rawmemchr-evex.S
@@ -1,7 +1,308 @@
-#ifndef RAWMEMCHR
-# define RAWMEMCHR	__rawmemchr_evex
-#endif
-#define USE_AS_RAWMEMCHR	1
-#define MEMCHR	RAWMEMCHR
+/* rawmemchr optimized with 256-bit EVEX instructions.
+   Copyright (C) 2022 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#include <isa-level.h>
+#include <sysdep.h>
+
+#if ISA_SHOULD_BUILD (4)
+
+# ifndef VEC_SIZE
+#  include "x86-evex256-vecs.h"
+# endif
+
+# ifndef RAWMEMCHR
+#  define RAWMEMCHR	__rawmemchr_evex
+# endif
+
+
+# define PC_SHIFT_GPR	rdi
+# define REG_WIDTH	VEC_SIZE
+# define VPTESTN	vptestnmb
+# define VPBROADCAST	vpbroadcastb
+# define VPMINU	vpminub
+# define VPCMP	vpcmpb
+# define VPCMPEQ	vpcmpeqb
+# define CHAR_SIZE	1
+
+# include "reg-macros.h"
+
+/* If not in an RTM and VEC_SIZE != 64 (the VEC_SIZE = 64
+   doesn't have VEX encoding), use VEX encoding in loop so we
+   can use vpcmpeqb + vptern which is more efficient than the
+   EVEX alternative.  */
+# if defined USE_IN_RTM || VEC_SIZE == 64
+#  undef COND_VZEROUPPER
+#  undef VZEROUPPER_RETURN
+#  undef VZEROUPPER
+
+
+#  define COND_VZEROUPPER
+#  define VZEROUPPER_RETURN	ret
+#  define VZEROUPPER
+
+#  define USE_TERN_IN_LOOP	0
+# else
+#  define USE_TERN_IN_LOOP	1
+#  undef VZEROUPPER
+#  define VZEROUPPER	vzeroupper
+# endif
+
+# define CHAR_PER_VEC	VEC_SIZE
+
+# if CHAR_PER_VEC == 64
+
+#  define TAIL_RETURN_LBL	first_vec_x2
+#  define TAIL_RETURN_OFFSET	(CHAR_PER_VEC * 2)
+
+#  define FALLTHROUGH_RETURN_LBL	first_vec_x3
+#  define FALLTHROUGH_RETURN_OFFSET	(CHAR_PER_VEC * 3)
+
+# else	/* !(CHAR_PER_VEC == 64) */
+
+#  define TAIL_RETURN_LBL	first_vec_x3
+#  define TAIL_RETURN_OFFSET	(CHAR_PER_VEC * 3)
+
+#  define FALLTHROUGH_RETURN_LBL	first_vec_x2
+#  define FALLTHROUGH_RETURN_OFFSET	(CHAR_PER_VEC * 2)
+# endif	/* !(CHAR_PER_VEC == 64) */
+
+
+# define VMATCH	VMM(0)
+# define VMATCH_LO	VMM_lo(0)
+
+# define PAGE_SIZE	4096
+
+	.section SECTION(.text), "ax", @progbits
+ENTRY_P2ALIGN (RAWMEMCHR, 6)
+	VPBROADCAST %esi, %VMATCH
+	/* Check if we may cross page boundary with one vector load.  */
+	movl	%edi, %eax
+	andl	$(PAGE_SIZE - 1), %eax
+	cmpl	$(PAGE_SIZE - VEC_SIZE), %eax
+	ja	L(page_cross)
+
+	VPCMPEQ	(%rdi), %VMATCH, %k0
+	KMOV	%k0, %VRAX
+
+	test	%VRAX, %VRAX
+	jz	L(aligned_more)
+L(first_vec_x0):
+	bsf	%VRAX, %VRAX
+	addq	%rdi, %rax
+	ret
+
+	.p2align 4,, 4
+L(first_vec_x4):
+	bsf	%VRAX, %VRAX
+	leaq	(VEC_SIZE * 4)(%rdi, %rax), %rax
+	ret
 
-#include "memchr-evex.S"
+	/* For VEC_SIZE == 32 we can fit this in aligning bytes so might
+	   as well place it more locally.  For VEC_SIZE == 64 we reuse
+	   return code at the end of loop's return.  */
+# if VEC_SIZE == 32
+	.p2align 4,, 4
+L(FALLTHROUGH_RETURN_LBL):
+	bsf	%VRAX, %VRAX
+	leaq	(FALLTHROUGH_RETURN_OFFSET)(%rdi, %rax), %rax
+	ret
+# endif
+
+	.p2align 4,, 6
+L(page_cross):
+	/* eax has lower page-offset bits of rdi so xor will zero them
+	   out.  */
+	xorq	%rdi, %rax
+	VPCMPEQ	(PAGE_SIZE - VEC_SIZE)(%rax), %VMATCH, %k0
+	KMOV	%k0, %VRAX
+
+	/* Shift out out-of-bounds matches.  */
+	shrx	%VRDI, %VRAX, %VRAX
+	test	%VRAX, %VRAX
+	jnz	L(first_vec_x0)
+
+	.p2align 4,, 10
+L(aligned_more):
+L(page_cross_continue):
+	/* Align pointer.  */
+	andq	$(VEC_SIZE * -1), %rdi
+
+	VPCMPEQ	VEC_SIZE(%rdi), %VMATCH, %k0
+	KMOV	%k0, %VRAX
+	test	%VRAX, %VRAX
+	jnz	L(first_vec_x1)
+
+	VPCMPEQ	(VEC_SIZE * 2)(%rdi), %VMATCH, %k0
+	KMOV	%k0, %VRAX
+	test	%VRAX, %VRAX
+	jnz	L(first_vec_x2)
+
+	VPCMPEQ	(VEC_SIZE * 3)(%rdi), %VMATCH, %k0
+	KMOV	%k0, %VRAX
+	test	%VRAX, %VRAX
+	jnz	L(first_vec_x3)
+
+	VPCMPEQ	(VEC_SIZE * 4)(%rdi), %VMATCH, %k0
+	KMOV	%k0, %VRAX
+	test	%VRAX, %VRAX
+	jnz	L(first_vec_x4)
+
+	subq	$-(VEC_SIZE * 1), %rdi
+# if VEC_SIZE == 64
+	/* Saves code size.  No evex512 processor has partial register
+	   stalls.  If that change this can be replaced with `andq
+	   $-(VEC_SIZE * 4), %rdi`.  */
+	xorb	%dil, %dil
+# else
+	andq	$-(VEC_SIZE * 4), %rdi
+# endif
+
+# if USE_TERN_IN_LOOP
+	/* copy VMATCH to low ymm so we can use vpcmpeq which is not
+	   encodable with EVEX registers.  NB: this is VEC_SIZE == 32
+	   only as there is no way to encode vpcmpeq with zmm0-15.  */
+	vmovdqa64 %VMATCH, %VMATCH_LO
+# endif
+
+	.p2align 4
+L(loop_4x_vec):
+	/* Two versions of the loop.  One that does not require
+	   vzeroupper by not using ymm0-15 and another does that
+	   require vzeroupper because it uses ymm0-15.  The reason why
+	   ymm0-15 is used at all is because there is no EVEX encoding
+	   vpcmpeq and with vpcmpeq this loop can be performed more
+	   efficiently.  The non-vzeroupper version is safe for RTM
+	   while the vzeroupper version should be prefered if RTM are
+	   not supported.   Which loop version we use is determined by
+	   USE_TERN_IN_LOOP.  */
+
+# if USE_TERN_IN_LOOP
+	/* Since vptern can only take 3x vectors fastest to do 1 vec
+	   seperately with EVEX vpcmp.  */
+	VPCMPEQ	(VEC_SIZE * 4)(%rdi), %VMATCH, %k1
+	/* Compare 3x with vpcmpeq and or them all together with vptern.
+	 */
+
+	VPCMPEQ	(VEC_SIZE * 5)(%rdi), %VMATCH_LO, %VMM_lo(2)
+	subq	$(VEC_SIZE * -4), %rdi
+	VPCMPEQ	(VEC_SIZE * 2)(%rdi), %VMATCH_LO, %VMM_lo(3)
+	VPCMPEQ	(VEC_SIZE * 3)(%rdi), %VMATCH_LO, %VMM_lo(4)
+
+	/* 254 is mask for oring VEC_lo(2), VEC_lo(3), VEC_lo(4) into
+	   VEC_lo(4).  */
+	vpternlogd $254, %VMM_lo(2), %VMM_lo(3), %VMM_lo(4)
+	vpmovmskb %VMM_lo(4), %VRCX
+
+	KMOV	%k1, %eax
+
+	/* NB:  rax has match from first VEC and rcx has matches from
+	   VEC 2-4.  If rax is non-zero we will return that match.  If
+	   rax is zero adding won't disturb the bits in rcx.  */
+	add	%rax, %rcx
+# else
+	/* Loop version that uses EVEX encoding.  */
+	VPCMP	$4, (VEC_SIZE * 4)(%rdi), %VMATCH, %k1
+	vpxorq	(VEC_SIZE * 5)(%rdi), %VMATCH, %VMM(2)
+	vpxorq	(VEC_SIZE * 6)(%rdi), %VMATCH, %VMM(3)
+	VPCMPEQ	(VEC_SIZE * 7)(%rdi), %VMATCH, %k3
+	VPMINU	%VMM(2), %VMM(3), %VMM(3){%k1}{z}
+	VPTESTN	%VMM(3), %VMM(3), %k2
+	subq	$(VEC_SIZE * -4), %rdi
+	KORTEST %k2, %k3
+# endif
+	jz	L(loop_4x_vec)
+
+# if USE_TERN_IN_LOOP
+	test	%VRAX, %VRAX
+# else
+	KMOV	%k1, %VRAX
+	inc	%VRAX
+# endif
+	jnz	L(last_vec_x0)
+
+
+# if USE_TERN_IN_LOOP
+	vpmovmskb %VMM_lo(2), %VRAX
+# else
+	VPTESTN	%VMM(2), %VMM(2), %k1
+	KMOV	%k1, %VRAX
+# endif
+	test	%VRAX, %VRAX
+	jnz	L(last_vec_x1)
+
+
+# if USE_TERN_IN_LOOP
+	vpmovmskb %VMM_lo(3), %VRAX
+# else
+	KMOV	%k2, %VRAX
+# endif
+
+	/* No longer need any of the lo vecs (ymm0-15) so vzeroupper
+	   (only if used VEX encoded loop).  */
+	COND_VZEROUPPER
+
+	/* Seperate logic for VEC_SIZE == 64 and VEC_SIZE == 32 for
+	   returning last 2x VEC. For VEC_SIZE == 64 we test each VEC
+	   individually, for VEC_SIZE == 32 we combine them in a single
+	   64-bit GPR.  */
+# if CHAR_PER_VEC == 64
+#  if USE_TERN_IN_LOOP
+#   error "Unsupported"
+#  endif
+
+
+	/* If CHAR_PER_VEC == 64 we can't combine the last two VEC.  */
+	test	%VRAX, %VRAX
+	jnz	L(first_vec_x2)
+	KMOV	%k3, %VRAX
+L(FALLTHROUGH_RETURN_LBL):
+# else
+	/* CHAR_PER_VEC <= 32 so we can combine the results from the
+	   last 2x VEC.  */
+#  if !USE_TERN_IN_LOOP
+	KMOV	%k3, %VRCX
+#  endif
+	salq	$CHAR_PER_VEC, %rcx
+	addq	%rcx, %rax
+# endif
+	bsf	%rax, %rax
+	leaq	(FALLTHROUGH_RETURN_OFFSET)(%rdi, %rax), %rax
+	ret
+
+	.p2align 4,, 8
+L(TAIL_RETURN_LBL):
+	bsf	%rax, %rax
+	leaq	(TAIL_RETURN_OFFSET)(%rdi, %rax), %rax
+	ret
+
+	.p2align 4,, 8
+L(last_vec_x1):
+	COND_VZEROUPPER
+L(first_vec_x1):
+	bsf	%VRAX, %VRAX
+	leaq	(VEC_SIZE * 1)(%rdi, %rax), %rax
+	ret
+
+	.p2align 4,, 8
+L(last_vec_x0):
+	COND_VZEROUPPER
+	bsf	%VRAX, %VRAX
+	addq	%rdi, %rax
+	ret
+END (RAWMEMCHR)
+#endif

From patchwork Tue Oct 18 23:19:33 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Noah Goldstein <goldstein.w.n@gmail.com>
X-Patchwork-Id: 1691735
Return-Path: <libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@legolas.ozlabs.org
Authentication-Results: legolas.ozlabs.org;
 spf=pass (sender SPF authorized) smtp.mailfrom=sourceware.org
 (client-ip=8.43.85.97; helo=sourceware.org;
 envelope-from=libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org;
 receiver=<UNKNOWN>)
Authentication-Results: legolas.ozlabs.org;
	dkim=pass (1024-bit key;
 secure) header.d=sourceware.org header.i=@sourceware.org header.a=rsa-sha256
 header.s=default header.b=ekxZuPjq;
	dkim-atps=neutral
Received: from sourceware.org (server2.sourceware.org [8.43.85.97])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature ECDSA (P-384) server-digest SHA384)
	(No client certificate requested)
	by legolas.ozlabs.org (Postfix) with ESMTPS id 4MsVFh6fg4z23jk
	for <incoming@patchwork.ozlabs.org>; Wed, 19 Oct 2022 10:20:08 +1100 (AEDT)
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 0DDCD385828A
	for <incoming@patchwork.ozlabs.org>; Tue, 18 Oct 2022 23:20:04 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 0DDCD385828A
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org;
	s=default; t=1666135204;
	bh=8g2UpY8YP3LG3k4/aV/uG4zI5BnrmIlPF5QM4OTP2+M=;
	h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe:
	 List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:
	 From;
	b=ekxZuPjqBJGdUo7WQz93FVZFnvPTbxowz2dnEIIEUGek+IHOjzwP1EaJJxfBw+RU1
	 VNwC3kp25rdqUCfXVNSDtyyInNMfU79q8LgWzAtpNRsyDc7k5CRVN3u2HvqT/JcieJ
	 GbEVxw85K+ifvfdnN4qVYP11w1qzn6XRvmmTNOes=
X-Original-To: libc-alpha@sourceware.org
Delivered-To: libc-alpha@sourceware.org
Received: from mail-oa1-x31.google.com (mail-oa1-x31.google.com
 [IPv6:2001:4860:4864:20::31])
 by sourceware.org (Postfix) with ESMTPS id 8544D3858D39
 for <libc-alpha@sourceware.org>; Tue, 18 Oct 2022 23:19:45 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 8544D3858D39
Received: by mail-oa1-x31.google.com with SMTP id
 586e51a60fabf-1364357a691so18673194fac.7
 for <libc-alpha@sourceware.org>; Tue, 18 Oct 2022 16:19:45 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
 :subject:date:message-id:reply-to;
 bh=8g2UpY8YP3LG3k4/aV/uG4zI5BnrmIlPF5QM4OTP2+M=;
 b=E52Kn5YMJ6SM5V+Gb8wvJaHCpu2uUa5eFFOCIfFaOLH0Cqa8qyjzZKQIwmgj1vgMAt
 klWZOVECBW7ipE1VYuCuNfgpOrQYZzlQe8gpQlETHQk2KynpEgzIEQvZGlt8wViw1IUF
 wpPmU5LpWgUmWoZycVcNkk4IYMumnusZsbgltobHPunmklihCo4/hFRjyLxLL75QcEHe
 stZ6ZSaR9zPbAGZEZm8VJ+SBeaHCI3mJb5m6UpcHmgvzqYvTEkBzGoEgAqedx9TtWD4h
 KYpsphZ9vqeJFcShfPwyVQqkeoy0tOEi/Sj8/cHPIiXLWu5+1QbNDsMp6WVVSSM5Bfgp
 WdLg==
X-Gm-Message-State: ACrzQf3h7dLCYUztAoIYDNfv6/keDSZYMyl3ayMj9KX6b5vfj42y66lU
 92sQu23hh+/lgw8Pg28r3wzgQOyIgGvUYA==
X-Google-Smtp-Source: 
 AMsMyM6LlUYm4X0782di7+cY75uBvhl7BglnVe9C6UBulDq0yjpPk5moeWqTlvCWt7kYlx5vwHAD7g==
X-Received: by 2002:a05:6870:1602:b0:133:14f:d9a8 with SMTP id
 b2-20020a056870160200b00133014fd9a8mr3019215oae.237.1666135184105;
 Tue, 18 Oct 2022 16:19:44 -0700 (PDT)
Received: from noah-tgl.lan
 (2603-8080-1301-76c6-27cf-8854-3909-9373.res6.spectrum.com.
 [2603:8080:1301:76c6:27cf:8854:3909:9373])
 by smtp.gmail.com with ESMTPSA id
 x27-20020a056870a79b00b001372c1902afsm6787119oao.52.2022.10.18.16.19.43
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Tue, 18 Oct 2022 16:19:43 -0700 (PDT)
To: libc-alpha@sourceware.org
Subject: [PATCH v2 2/7] x86: Shrink / minorly optimize strchr-evex and
 implement with VMM headers
Date: Tue, 18 Oct 2022 16:19:33 -0700
Message-Id: <20221018231938.3621554-2-goldstein.w.n@gmail.com>
X-Mailer: git-send-email 2.34.1
In-Reply-To: <20221018231938.3621554-1-goldstein.w.n@gmail.com>
References: <20221018024901.3381469-1-goldstein.w.n@gmail.com>
 <20221018231938.3621554-1-goldstein.w.n@gmail.com>
MIME-Version: 1.0
X-Spam-Status: No, score=-12.1 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0,
 RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
X-Patchwork-Original-From: Noah Goldstein via Libc-alpha
 <libc-alpha@sourceware.org>
From: Noah Goldstein <goldstein.w.n@gmail.com>
Reply-To: Noah Goldstein <goldstein.w.n@gmail.com>
Errors-To: libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org
Sender: "Libc-alpha"
 <libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org>

Size Optimizations:
1. Condence hot path for better cache-locality.
    - This is most impact for strchrnul where the logic strings with
      len <= VEC_SIZE or with a match in the first VEC no fits entirely
      in the first cache line.
2. Reuse common targets in first 4x VEC and after the loop.
3. Don't align targets so aggressively if it doesn't change the number
   of fetch blocks it will require and put more care in avoiding the
   case where targets unnecessarily split cache lines.
4. Align the loop better for DSB/LSD
5. Use more code-size efficient instructions.
	- tzcnt ...     -> bsf ...
	- vpcmpb $0 ... -> vpcmpeq ...
6. Align labels less aggressively, especially if it doesn't save fetch
   blocks / causes the basic-block to span extra cache-lines.

Code Size Changes:
strchr-evex.S	: -63 bytes
strchrnul-evex.S: -48 bytes

Net perf changes:
Reported as geometric mean of all improvements / regressions from N=10
runs of the benchtests. Value as New Time / Old Time so < 1.0 is
improvement and 1.0 is regression.

strchr-evex.S (Fixed)   : 0.971
strchr-evex.S (Rand)    : 0.932
strchrnul-evex.S        : 0.965

Full results attached in email.

Full check passes on x86-64.
---
 sysdeps/x86_64/multiarch/strchr-evex.S | 558 +++++++++++++++----------
 1 file changed, 340 insertions(+), 218 deletions(-)

diff --git a/sysdeps/x86_64/multiarch/strchr-evex.S b/sysdeps/x86_64/multiarch/strchr-evex.S
index a1c15c4419..c2a0d112f7 100644
--- a/sysdeps/x86_64/multiarch/strchr-evex.S
+++ b/sysdeps/x86_64/multiarch/strchr-evex.S
@@ -26,48 +26,75 @@
 #  define STRCHR	__strchr_evex
 # endif
 
-# define VMOVU		vmovdqu64
-# define VMOVA		vmovdqa64
+# ifndef VEC_SIZE
+#  include "x86-evex256-vecs.h"
+# endif
 
 # ifdef USE_AS_WCSCHR
 #  define VPBROADCAST	vpbroadcastd
-#  define VPCMP		vpcmpd
+#  define VPCMP	vpcmpd
+#  define VPCMPEQ	vpcmpeqd
 #  define VPTESTN	vptestnmd
+#  define VPTEST	vptestmd
 #  define VPMINU	vpminud
 #  define CHAR_REG	esi
-#  define SHIFT_REG	ecx
+#  define SHIFT_REG	rcx
 #  define CHAR_SIZE	4
+
+#  define USE_WIDE_CHAR
 # else
 #  define VPBROADCAST	vpbroadcastb
-#  define VPCMP		vpcmpb
+#  define VPCMP	vpcmpb
+#  define VPCMPEQ	vpcmpeqb
 #  define VPTESTN	vptestnmb
+#  define VPTEST	vptestmb
 #  define VPMINU	vpminub
 #  define CHAR_REG	sil
-#  define SHIFT_REG	edx
+#  define SHIFT_REG	rdi
 #  define CHAR_SIZE	1
 # endif
 
-# define XMMZERO	xmm16
-
-# define YMMZERO	ymm16
-# define YMM0		ymm17
-# define YMM1		ymm18
-# define YMM2		ymm19
-# define YMM3		ymm20
-# define YMM4		ymm21
-# define YMM5		ymm22
-# define YMM6		ymm23
-# define YMM7		ymm24
-# define YMM8		ymm25
-
-# define VEC_SIZE 32
-# define PAGE_SIZE 4096
-# define CHAR_PER_VEC (VEC_SIZE / CHAR_SIZE)
-
-	.section .text.evex,"ax",@progbits
-ENTRY_P2ALIGN (STRCHR, 5)
-	/* Broadcast CHAR to YMM0.	*/
-	VPBROADCAST	%esi, %YMM0
+# include "reg-macros.h"
+
+# if VEC_SIZE == 64
+#  define MASK_GPR	rcx
+#  define LOOP_REG	rax
+
+#  define COND_MASK(k_reg)	{%k_reg}
+# else
+#  define MASK_GPR	rax
+#  define LOOP_REG	rdi
+
+#  define COND_MASK(k_reg)
+# endif
+
+# define CHAR_PER_VEC	(VEC_SIZE / CHAR_SIZE)
+
+
+# if CHAR_PER_VEC == 64
+#  define LAST_VEC_OFFSET	(VEC_SIZE * 3)
+#  define TESTZ(reg)	incq %VGPR_SZ(reg, 64)
+# else
+
+#  if CHAR_PER_VEC == 32
+#   define TESTZ(reg)	incl %VGPR_SZ(reg, 32)
+#  elif CHAR_PER_VEC == 16
+#   define TESTZ(reg)	incw %VGPR_SZ(reg, 16)
+#  else
+#   define TESTZ(reg)	incb %VGPR_SZ(reg, 8)
+#  endif
+
+#  define LAST_VEC_OFFSET	(VEC_SIZE * 2)
+# endif
+
+# define VMATCH	VMM(0)
+
+# define PAGE_SIZE	4096
+
+	.section SECTION(.text), "ax", @progbits
+ENTRY_P2ALIGN (STRCHR, 6)
+	/* Broadcast CHAR to VEC_0.  */
+	VPBROADCAST %esi, %VMATCH
 	movl	%edi, %eax
 	andl	$(PAGE_SIZE - 1), %eax
 	/* Check if we cross page boundary with one vector load.
@@ -75,19 +102,27 @@ ENTRY_P2ALIGN (STRCHR, 5)
 	cmpl	$(PAGE_SIZE - VEC_SIZE), %eax
 	ja	L(cross_page_boundary)
 
+
 	/* Check the first VEC_SIZE bytes. Search for both CHAR and the
 	   null bytes.  */
-	VMOVU	(%rdi), %YMM1
-
+	VMOVU	(%rdi), %VMM(1)
 	/* Leaves only CHARS matching esi as 0.  */
-	vpxorq	%YMM1, %YMM0, %YMM2
-	VPMINU	%YMM2, %YMM1, %YMM2
-	/* Each bit in K0 represents a CHAR or a null byte in YMM1.  */
-	VPTESTN	%YMM2, %YMM2, %k0
-	kmovd	%k0, %eax
-	testl	%eax, %eax
+	vpxorq	%VMM(1), %VMATCH, %VMM(2)
+	VPMINU	%VMM(2), %VMM(1), %VMM(2)
+	/* Each bit in K0 represents a CHAR or a null byte in VEC_1.  */
+	VPTESTN	%VMM(2), %VMM(2), %k0
+	KMOV	%k0, %VRAX
+# if VEC_SIZE == 64 && defined USE_AS_STRCHRNUL
+	/* If VEC_SIZE == 64 && STRCHRNUL use bsf to test condition so
+	   that all logic for match/null in first VEC first in 1x cache
+	   lines.  This has a slight cost to larger sizes.  */
+	bsf	%VRAX, %VRAX
+	jz	L(aligned_more)
+# else
+	test	%VRAX, %VRAX
 	jz	L(aligned_more)
-	tzcntl	%eax, %eax
+	bsf	%VRAX, %VRAX
+# endif
 # ifndef USE_AS_STRCHRNUL
 	/* Found CHAR or the null byte.  */
 	cmp	(%rdi, %rax, CHAR_SIZE), %CHAR_REG
@@ -109,287 +144,374 @@ ENTRY_P2ALIGN (STRCHR, 5)
 # endif
 	ret
 
-
-
-	.p2align 4,, 10
-L(first_vec_x4):
-# ifndef USE_AS_STRCHRNUL
-	/* Check to see if first match was CHAR (k0) or null (k1).  */
-	kmovd	%k0, %eax
-	tzcntl	%eax, %eax
-	kmovd	%k1, %ecx
-	/* bzhil will not be 0 if first match was null.  */
-	bzhil	%eax, %ecx, %ecx
-	jne	L(zero)
-# else
-	/* Combine CHAR and null matches.  */
-	kord	%k0, %k1, %k0
-	kmovd	%k0, %eax
-	tzcntl	%eax, %eax
-# endif
-	/* NB: Multiply sizeof char type (1 or 4) to get the number of
-	   bytes.  */
-	leaq	(VEC_SIZE * 4)(%rdi, %rax, CHAR_SIZE), %rax
-	ret
-
 # ifndef USE_AS_STRCHRNUL
 L(zero):
 	xorl	%eax, %eax
 	ret
 # endif
 
-
-	.p2align 4
+	.p2align 4,, 2
+L(first_vec_x3):
+	subq	$-(VEC_SIZE * 2), %rdi
+# if VEC_SIZE == 32
+	/* Reuse L(first_vec_x3) for last VEC2 only for VEC_SIZE == 32.
+	   For VEC_SIZE == 64 the registers don't match.  */
+L(last_vec_x2):
+# endif
 L(first_vec_x1):
 	/* Use bsf here to save 1-byte keeping keeping the block in 1x
 	   fetch block. eax guranteed non-zero.  */
-	bsfl	%eax, %eax
+	bsf	%VRCX, %VRCX
 # ifndef USE_AS_STRCHRNUL
-	/* Found CHAR or the null byte.	 */
-	cmp	(VEC_SIZE)(%rdi, %rax, CHAR_SIZE), %CHAR_REG
+	/* Found CHAR or the null byte.  */
+	cmp	(VEC_SIZE)(%rdi, %rcx, CHAR_SIZE), %CHAR_REG
 	jne	L(zero)
-
 # endif
 	/* NB: Multiply sizeof char type (1 or 4) to get the number of
 	   bytes.  */
-	leaq	(VEC_SIZE)(%rdi, %rax, CHAR_SIZE), %rax
+	leaq	(VEC_SIZE)(%rdi, %rcx, CHAR_SIZE), %rax
 	ret
 
-	.p2align 4,, 10
+	.p2align 4,, 2
+L(first_vec_x4):
+	subq	$-(VEC_SIZE * 2), %rdi
 L(first_vec_x2):
 # ifndef USE_AS_STRCHRNUL
 	/* Check to see if first match was CHAR (k0) or null (k1).  */
-	kmovd	%k0, %eax
-	tzcntl	%eax, %eax
-	kmovd	%k1, %ecx
+	KMOV	%k0, %VRAX
+	tzcnt	%VRAX, %VRAX
+	KMOV	%k1, %VRCX
 	/* bzhil will not be 0 if first match was null.  */
-	bzhil	%eax, %ecx, %ecx
+	bzhi	%VRAX, %VRCX, %VRCX
 	jne	L(zero)
 # else
 	/* Combine CHAR and null matches.  */
-	kord	%k0, %k1, %k0
-	kmovd	%k0, %eax
-	tzcntl	%eax, %eax
+	KOR	%k0, %k1, %k0
+	KMOV	%k0, %VRAX
+	bsf	%VRAX, %VRAX
 # endif
 	/* NB: Multiply sizeof char type (1 or 4) to get the number of
 	   bytes.  */
 	leaq	(VEC_SIZE * 2)(%rdi, %rax, CHAR_SIZE), %rax
 	ret
 
-	.p2align 4,, 10
-L(first_vec_x3):
-	/* Use bsf here to save 1-byte keeping keeping the block in 1x
-	   fetch block. eax guranteed non-zero.  */
-	bsfl	%eax, %eax
-# ifndef USE_AS_STRCHRNUL
-	/* Found CHAR or the null byte.	 */
-	cmp	(VEC_SIZE * 3)(%rdi, %rax, CHAR_SIZE), %CHAR_REG
-	jne	L(zero)
+# ifdef USE_AS_STRCHRNUL
+	/* We use this as a hook to get imm8 encoding for the jmp to
+	   L(page_cross_boundary).  This allows the hot case of a
+	   match/null-term in first VEC to fit entirely in 1 cache
+	   line.  */
+L(cross_page_boundary):
+	jmp	L(cross_page_boundary_real)
 # endif
-	/* NB: Multiply sizeof char type (1 or 4) to get the number of
-	   bytes.  */
-	leaq	(VEC_SIZE * 3)(%rdi, %rax, CHAR_SIZE), %rax
-	ret
 
 	.p2align 4
 L(aligned_more):
+L(cross_page_continue):
 	/* Align data to VEC_SIZE.  */
 	andq	$-VEC_SIZE, %rdi
-L(cross_page_continue):
-	/* Check the next 4 * VEC_SIZE. Only one VEC_SIZE at a time since
-	   data is only aligned to VEC_SIZE. Use two alternating methods
-	   for checking VEC to balance latency and port contention.  */
 
-	/* This method has higher latency but has better port
-	   distribution.  */
-	VMOVA	(VEC_SIZE)(%rdi), %YMM1
+	/* Check the next 4 * VEC_SIZE. Only one VEC_SIZE at a time
+	   since data is only aligned to VEC_SIZE. Use two alternating
+	   methods for checking VEC to balance latency and port
+	   contention.  */
+
+    /* Method(1) with 8c latency:
+	   For VEC_SIZE == 32:
+	   p0 * 1.83, p1 * 0.83, p5 * 1.33
+	   For VEC_SIZE == 64:
+	   p0 * 2.50, p1 * 0.00, p5 * 1.50  */
+	VMOVA	(VEC_SIZE)(%rdi), %VMM(1)
 	/* Leaves only CHARS matching esi as 0.  */
-	vpxorq	%YMM1, %YMM0, %YMM2
-	VPMINU	%YMM2, %YMM1, %YMM2
-	/* Each bit in K0 represents a CHAR or a null byte in YMM1.  */
-	VPTESTN	%YMM2, %YMM2, %k0
-	kmovd	%k0, %eax
-	testl	%eax, %eax
+	vpxorq	%VMM(1), %VMATCH, %VMM(2)
+	VPMINU	%VMM(2), %VMM(1), %VMM(2)
+	/* Each bit in K0 represents a CHAR or a null byte in VEC_1.  */
+	VPTESTN	%VMM(2), %VMM(2), %k0
+	KMOV	%k0, %VRCX
+	test	%VRCX, %VRCX
 	jnz	L(first_vec_x1)
 
-	/* This method has higher latency but has better port
-	   distribution.  */
-	VMOVA	(VEC_SIZE * 2)(%rdi), %YMM1
-	/* Each bit in K0 represents a CHAR in YMM1.  */
-	VPCMP	$0, %YMM1, %YMM0, %k0
-	/* Each bit in K1 represents a CHAR in YMM1.  */
-	VPTESTN	%YMM1, %YMM1, %k1
-	kortestd	%k0, %k1
+    /* Method(2) with 6c latency:
+	   For VEC_SIZE == 32:
+	   p0 * 1.00, p1 * 0.00, p5 * 2.00
+	   For VEC_SIZE == 64:
+	   p0 * 1.00, p1 * 0.00, p5 * 2.00  */
+	VMOVA	(VEC_SIZE * 2)(%rdi), %VMM(1)
+	/* Each bit in K0 represents a CHAR in VEC_1.  */
+	VPCMPEQ	%VMM(1), %VMATCH, %k0
+	/* Each bit in K1 represents a CHAR in VEC_1.  */
+	VPTESTN	%VMM(1), %VMM(1), %k1
+	KORTEST %k0, %k1
 	jnz	L(first_vec_x2)
 
-	VMOVA	(VEC_SIZE * 3)(%rdi), %YMM1
+	/* By swapping between Method 1/2 we get more fair port
+	   distrubition and better throughput.  */
+
+	VMOVA	(VEC_SIZE * 3)(%rdi), %VMM(1)
 	/* Leaves only CHARS matching esi as 0.  */
-	vpxorq	%YMM1, %YMM0, %YMM2
-	VPMINU	%YMM2, %YMM1, %YMM2
-	/* Each bit in K0 represents a CHAR or a null byte in YMM1.  */
-	VPTESTN	%YMM2, %YMM2, %k0
-	kmovd	%k0, %eax
-	testl	%eax, %eax
+	vpxorq	%VMM(1), %VMATCH, %VMM(2)
+	VPMINU	%VMM(2), %VMM(1), %VMM(2)
+	/* Each bit in K0 represents a CHAR or a null byte in VEC_1.  */
+	VPTESTN	%VMM(2), %VMM(2), %k0
+	KMOV	%k0, %VRCX
+	test	%VRCX, %VRCX
 	jnz	L(first_vec_x3)
 
-	VMOVA	(VEC_SIZE * 4)(%rdi), %YMM1
-	/* Each bit in K0 represents a CHAR in YMM1.  */
-	VPCMP	$0, %YMM1, %YMM0, %k0
-	/* Each bit in K1 represents a CHAR in YMM1.  */
-	VPTESTN	%YMM1, %YMM1, %k1
-	kortestd	%k0, %k1
+	VMOVA	(VEC_SIZE * 4)(%rdi), %VMM(1)
+	/* Each bit in K0 represents a CHAR in VEC_1.  */
+	VPCMPEQ	%VMM(1), %VMATCH, %k0
+	/* Each bit in K1 represents a CHAR in VEC_1.  */
+	VPTESTN	%VMM(1), %VMM(1), %k1
+	KORTEST %k0, %k1
 	jnz	L(first_vec_x4)
 
 	/* Align data to VEC_SIZE * 4 for the loop.  */
+# if VEC_SIZE == 64
+	/* Use rax for the loop reg as it allows to the loop to fit in
+	   exactly 2-cache-lines. (more efficient imm32 + gpr
+	   encoding).  */
+	leaq	(VEC_SIZE)(%rdi), %rax
+	/* No partial register stalls on evex512 processors.  */
+	xorb	%al, %al
+# else
+	/* For VEC_SIZE == 32 continue using rdi for loop reg so we can
+	   reuse more code and save space.  */
 	addq	$VEC_SIZE, %rdi
 	andq	$-(VEC_SIZE * 4), %rdi
-
+# endif
 	.p2align 4
 L(loop_4x_vec):
-	/* Check 4x VEC at a time. No penalty to imm32 offset with evex
-	   encoding.  */
-	VMOVA	(VEC_SIZE * 4)(%rdi), %YMM1
-	VMOVA	(VEC_SIZE * 5)(%rdi), %YMM2
-	VMOVA	(VEC_SIZE * 6)(%rdi), %YMM3
-	VMOVA	(VEC_SIZE * 7)(%rdi), %YMM4
-
-	/* For YMM1 and YMM3 use xor to set the CHARs matching esi to
+	/* Check 4x VEC at a time. No penalty for imm32 offset with evex
+	   encoding (if offset % VEC_SIZE == 0).  */
+	VMOVA	(VEC_SIZE * 4)(%LOOP_REG), %VMM(1)
+	VMOVA	(VEC_SIZE * 5)(%LOOP_REG), %VMM(2)
+	VMOVA	(VEC_SIZE * 6)(%LOOP_REG), %VMM(3)
+	VMOVA	(VEC_SIZE * 7)(%LOOP_REG), %VMM(4)
+
+	/* Collect bits where VEC_1 does NOT match esi.  This is later
+	   use to mask of results (getting not matches allows us to
+	   save an instruction on combining).  */
+	VPCMP	$4, %VMATCH, %VMM(1), %k1
+
+	/* Two methods for loop depending on VEC_SIZE.  This is because
+	   with zmm registers VPMINU can only run on p0 (as opposed to
+	   p0/p1 for ymm) so it is less prefered.  */
+# if VEC_SIZE == 32
+	/* For VEC_2 and VEC_3 use xor to set the CHARs matching esi to
 	   zero.  */
-	vpxorq	%YMM1, %YMM0, %YMM5
-	/* For YMM2 and YMM4 cmp not equals to CHAR and store result in
-	   k register. Its possible to save either 1 or 2 instructions
-	   using cmp no equals method for either YMM1 or YMM1 and YMM3
-	   respectively but bottleneck on p5 makes it not worth it.  */
-	VPCMP	$4, %YMM0, %YMM2, %k2
-	vpxorq	%YMM3, %YMM0, %YMM7
-	VPCMP	$4, %YMM0, %YMM4, %k4
-
-	/* Use min to select all zeros from either xor or end of string).
-	 */
-	VPMINU	%YMM1, %YMM5, %YMM1
-	VPMINU	%YMM3, %YMM7, %YMM3
+	vpxorq	%VMM(2), %VMATCH, %VMM(6)
+	vpxorq	%VMM(3), %VMATCH, %VMM(7)
 
-	/* Use min + zeromask to select for zeros. Since k2 and k4 will
-	   have 0 as positions that matched with CHAR which will set
-	   zero in the corresponding destination bytes in YMM2 / YMM4.
-	 */
-	VPMINU	%YMM1, %YMM2, %YMM2{%k2}{z}
-	VPMINU	%YMM3, %YMM4, %YMM4
-	VPMINU	%YMM2, %YMM4, %YMM4{%k4}{z}
-
-	VPTESTN	%YMM4, %YMM4, %k1
-	kmovd	%k1, %ecx
-	subq	$-(VEC_SIZE * 4), %rdi
-	testl	%ecx, %ecx
+	/* Find non-matches in VEC_4 while combining with non-matches
+	   from VEC_1.  NB: Try and use masked predicate execution on
+	   instructions that have mask result as it has no latency
+	   penalty.  */
+	VPCMP	$4, %VMATCH, %VMM(4), %k4{%k1}
+
+	/* Combined zeros from VEC_1 / VEC_2 (search for null term).  */
+	VPMINU	%VMM(1), %VMM(2), %VMM(2)
+
+	/* Use min to select all zeros from either xor or end of
+	   string).  */
+	VPMINU	%VMM(3), %VMM(7), %VMM(3)
+	VPMINU	%VMM(2), %VMM(6), %VMM(2)
+
+	/* Combined zeros from VEC_2 / VEC_3 (search for null term).  */
+	VPMINU	%VMM(3), %VMM(4), %VMM(4)
+
+	/* Combined zeros from VEC_2 / VEC_4 (this has all null term and
+	   esi matches for VEC_2 / VEC_3).  */
+	VPMINU	%VMM(2), %VMM(4), %VMM(4)
+# else
+	/* Collect non-matches for VEC_2.  */
+	VPCMP	$4, %VMM(2), %VMATCH, %k2
+
+	/* Combined zeros from VEC_1 / VEC_2 (search for null term).  */
+	VPMINU	%VMM(1), %VMM(2), %VMM(2)
+
+	/* Find non-matches in VEC_3/VEC_4 while combining with non-
+	   matches from VEC_1/VEC_2 respectively.  */
+	VPCMP	$4, %VMM(3), %VMATCH, %k3{%k1}
+	VPCMP	$4, %VMM(4), %VMATCH, %k4{%k2}
+
+	/* Finish combining zeros in all VECs.  */
+	VPMINU	%VMM(3), %VMM(4), %VMM(4)
+
+	/* Combine in esi matches for VEC_3 (if there was a match with
+	   esi, the corresponding bit in %k3 is zero so the
+	   VPMINU_MASKZ will have a zero in the result).  NB: This make
+	   the VPMINU 3c latency.  The only way to avoid it is to
+	   createa a 12c dependency chain on all the `VPCMP $4, ...`
+	   which has higher total latency.  */
+	VPMINU	%VMM(2), %VMM(4), %VMM(4){%k3}{z}
+# endif
+	VPTEST	%VMM(4), %VMM(4), %k0{%k4}
+	KMOV	%k0, %VRDX
+	subq	$-(VEC_SIZE * 4), %LOOP_REG
+
+	/* TESTZ is inc using the proper register width depending on
+	   CHAR_PER_VEC. An esi match or null-term match leaves a zero-
+	   bit in rdx so inc won't overflow and won't be zero.  */
+	TESTZ	(rdx)
 	jz	L(loop_4x_vec)
 
-	VPTESTN	%YMM1, %YMM1, %k0
-	kmovd	%k0, %eax
-	testl	%eax, %eax
-	jnz	L(last_vec_x1)
+	VPTEST	%VMM(1), %VMM(1), %k0{%k1}
+	KMOV	%k0, %VGPR(MASK_GPR)
+	TESTZ	(MASK_GPR)
+# if VEC_SIZE == 32
+	/* We can reuse the return code in page_cross logic for VEC_SIZE
+	   == 32.  */
+	jnz	L(last_vec_x1_vec_size32)
+# else
+	jnz	L(last_vec_x1_vec_size64)
+# endif
+
 
-	VPTESTN	%YMM2, %YMM2, %k0
-	kmovd	%k0, %eax
-	testl	%eax, %eax
+	/* COND_MASK integates the esi matches for VEC_SIZE == 64. For
+	   VEC_SIZE == 32 they are already integrated.  */
+	VPTEST	%VMM(2), %VMM(2), %k0 COND_MASK(k2)
+	KMOV	%k0, %VRCX
+	TESTZ	(rcx)
 	jnz	L(last_vec_x2)
 
-	VPTESTN	%YMM3, %YMM3, %k0
-	kmovd	%k0, %eax
-	/* Combine YMM3 matches (eax) with YMM4 matches (ecx).  */
-# ifdef USE_AS_WCSCHR
-	sall	$8, %ecx
-	orl	%ecx, %eax
-	bsfl	%eax, %eax
+	VPTEST	%VMM(3), %VMM(3), %k0 COND_MASK(k3)
+	KMOV	%k0, %VRCX
+# if CHAR_PER_VEC == 64
+	TESTZ	(rcx)
+	jnz	L(last_vec_x3)
 # else
-	salq	$32, %rcx
-	orq	%rcx, %rax
-	bsfq	%rax, %rax
+	salq	$CHAR_PER_VEC, %rdx
+	TESTZ	(rcx)
+	orq	%rcx, %rdx
 # endif
+
+	bsfq	%rdx, %rdx
+
 # ifndef USE_AS_STRCHRNUL
 	/* Check if match was CHAR or null.  */
-	cmp	(VEC_SIZE * 2)(%rdi, %rax, CHAR_SIZE), %CHAR_REG
+	cmp	(LAST_VEC_OFFSET)(%LOOP_REG, %rdx, CHAR_SIZE), %CHAR_REG
 	jne	L(zero_end)
 # endif
 	/* NB: Multiply sizeof char type (1 or 4) to get the number of
 	   bytes.  */
-	leaq	(VEC_SIZE * 2)(%rdi, %rax, CHAR_SIZE), %rax
+	leaq	(LAST_VEC_OFFSET)(%LOOP_REG, %rdx, CHAR_SIZE), %rax
 	ret
 
-	.p2align 4,, 8
-L(last_vec_x1):
-	bsfl	%eax, %eax
-# ifdef USE_AS_WCSCHR
-	/* NB: Multiply wchar_t count by 4 to get the number of bytes.
-	   */
-	leaq	(%rdi, %rax, CHAR_SIZE), %rax
-# else
-	addq	%rdi, %rax
+# ifndef USE_AS_STRCHRNUL
+L(zero_end):
+	xorl	%eax, %eax
+	ret
 # endif
 
-# ifndef USE_AS_STRCHRNUL
+
+	/* Seperate return label for last VEC1 because for VEC_SIZE ==
+	   32 we can reuse return code in L(page_cross) but VEC_SIZE ==
+	   64 has mismatched registers.  */
+# if VEC_SIZE == 64
+	.p2align 4,, 8
+L(last_vec_x1_vec_size64):
+	bsf	%VRCX, %VRCX
+#  ifndef USE_AS_STRCHRNUL
 	/* Check if match was null.  */
-	cmp	(%rax), %CHAR_REG
+	cmp	(%rax, %rcx, CHAR_SIZE), %CHAR_REG
 	jne	L(zero_end)
-# endif
-
+#  endif
+#  ifdef USE_AS_WCSCHR
+	/* NB: Multiply wchar_t count by 4 to get the number of bytes.
+	 */
+	leaq	(%rax, %rcx, CHAR_SIZE), %rax
+#  else
+	addq	%rcx, %rax
+#  endif
 	ret
 
+	/* Since we can't combine the last 2x matches for CHAR_PER_VEC
+	   == 64 we need return label for last VEC3.  */
+#  if CHAR_PER_VEC == 64
 	.p2align 4,, 8
+L(last_vec_x3):
+	addq	$VEC_SIZE, %LOOP_REG
+#  endif
+
+	/* Duplicate L(last_vec_x2) for VEC_SIZE == 64 because we can't
+	   reuse L(first_vec_x3) due to register mismatch.  */
 L(last_vec_x2):
-	bsfl	%eax, %eax
-# ifndef USE_AS_STRCHRNUL
+	bsf	%VGPR(MASK_GPR), %VGPR(MASK_GPR)
+#  ifndef USE_AS_STRCHRNUL
 	/* Check if match was null.  */
-	cmp	(VEC_SIZE)(%rdi, %rax, CHAR_SIZE), %CHAR_REG
+	cmp	(VEC_SIZE * 1)(%LOOP_REG, %MASK_GPR, CHAR_SIZE), %CHAR_REG
 	jne	L(zero_end)
-# endif
+#  endif
 	/* NB: Multiply sizeof char type (1 or 4) to get the number of
 	   bytes.  */
-	leaq	(VEC_SIZE)(%rdi, %rax, CHAR_SIZE), %rax
+	leaq	(VEC_SIZE * 1)(%LOOP_REG, %MASK_GPR, CHAR_SIZE), %rax
 	ret
+# endif
 
-	/* Cold case for crossing page with first load.	 */
-	.p2align 4,, 8
+	/* Cold case for crossing page with first load.  */
+	.p2align 4,, 10
+# ifndef USE_AS_STRCHRNUL
 L(cross_page_boundary):
-	movq	%rdi, %rdx
+# endif
+L(cross_page_boundary_real):
 	/* Align rdi.  */
-	andq	$-VEC_SIZE, %rdi
-	VMOVA	(%rdi), %YMM1
-	/* Leaves only CHARS matching esi as 0.  */
-	vpxorq	%YMM1, %YMM0, %YMM2
-	VPMINU	%YMM2, %YMM1, %YMM2
-	/* Each bit in K0 represents a CHAR or a null byte in YMM1.  */
-	VPTESTN	%YMM2, %YMM2, %k0
-	kmovd	%k0, %eax
+	xorq	%rdi, %rax
+	VMOVA	(PAGE_SIZE - VEC_SIZE)(%rax), %VMM(1)
+	/* Use high latency method of getting matches to save code size.
+	 */
+
+	/* K1 has 1s where VEC(1) does NOT match esi.  */
+	VPCMP	$4, %VMM(1), %VMATCH, %k1
+	/* K0 has ones where K1 is 1 (non-match with esi), and non-zero
+	   (null).  */
+	VPTEST	%VMM(1), %VMM(1), %k0{%k1}
+	KMOV	%k0, %VRAX
 	/* Remove the leading bits.  */
 # ifdef USE_AS_WCSCHR
-	movl	%edx, %SHIFT_REG
+	movl	%edi, %VGPR_SZ(SHIFT_REG, 32)
 	/* NB: Divide shift count by 4 since each bit in K1 represent 4
 	   bytes.  */
-	sarl	$2, %SHIFT_REG
-	andl	$(CHAR_PER_VEC - 1), %SHIFT_REG
+	sarl	$2, %VGPR_SZ(SHIFT_REG, 32)
+	andl	$(CHAR_PER_VEC - 1), %VGPR_SZ(SHIFT_REG, 32)
+
+	/* if wcsrchr we need to reverse matches as we can't rely on
+	   signed shift to bring in ones. There is not sarx for
+	   gpr8/16. Also not we can't use inc here as the lower bits
+	   represent matches out of range so we can't rely on overflow.
+	 */
+	xorl	$((1 << CHAR_PER_VEC)- 1), %eax
+# endif
+	/* Use arithmatic shift so that leading 1s are filled in.  */
+	sarx	%VGPR(SHIFT_REG), %VRAX, %VRAX
+	/* If eax is all ones then no matches for esi or NULL.  */
+
+# ifdef USE_AS_WCSCHR
+	test	%VRAX, %VRAX
+# else
+	inc	%VRAX
 # endif
-	sarxl	%SHIFT_REG, %eax, %eax
-	/* If eax is zero continue.  */
-	testl	%eax, %eax
 	jz	L(cross_page_continue)
-	bsfl	%eax, %eax
 
+	.p2align 4,, 10
+L(last_vec_x1_vec_size32):
+	bsf	%VRAX, %VRAX
 # ifdef USE_AS_WCSCHR
-	/* NB: Multiply wchar_t count by 4 to get the number of
-	   bytes.  */
-	leaq	(%rdx, %rax, CHAR_SIZE), %rax
+	/* NB: Multiply wchar_t count by 4 to get the number of bytes.
+	 */
+	leaq	(%rdi, %rax, CHAR_SIZE), %rax
 # else
-	addq	%rdx, %rax
+	addq	%rdi, %rax
 # endif
 # ifndef USE_AS_STRCHRNUL
 	/* Check to see if match was CHAR or null.  */
 	cmp	(%rax), %CHAR_REG
-	je	L(cross_page_ret)
-L(zero_end):
-	xorl	%eax, %eax
-L(cross_page_ret):
+	jne	L(zero_end_0)
 # endif
 	ret
+# ifndef USE_AS_STRCHRNUL
+L(zero_end_0):
+	xorl	%eax, %eax
+	ret
+# endif
 
 END (STRCHR)
 #endif

From patchwork Tue Oct 18 23:19:34 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Noah Goldstein <goldstein.w.n@gmail.com>
X-Patchwork-Id: 1691739
Return-Path: <libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@legolas.ozlabs.org
Authentication-Results: legolas.ozlabs.org;
 spf=pass (sender SPF authorized) smtp.mailfrom=sourceware.org
 (client-ip=2620:52:3:1:0:246e:9693:128c; helo=sourceware.org;
 envelope-from=libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org;
 receiver=<UNKNOWN>)
Authentication-Results: legolas.ozlabs.org;
	dkim=pass (1024-bit key;
 secure) header.d=sourceware.org header.i=@sourceware.org header.a=rsa-sha256
 header.s=default header.b=AAB3eYet;
	dkim-atps=neutral
Received: from sourceware.org (server2.sourceware.org
 [IPv6:2620:52:3:1:0:246e:9693:128c])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature ECDSA (P-384) server-digest SHA384)
	(No client certificate requested)
	by legolas.ozlabs.org (Postfix) with ESMTPS id 4MsVGm6cy5z23jk
	for <incoming@patchwork.ozlabs.org>; Wed, 19 Oct 2022 10:21:04 +1100 (AEDT)
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id C99AC3856DDD
	for <incoming@patchwork.ozlabs.org>; Tue, 18 Oct 2022 23:21:02 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org C99AC3856DDD
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org;
	s=default; t=1666135262;
	bh=PSQMROoJfqw3Uqfco9Q4iDeddMT2vOOCLzyfEiEff+g=;
	h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe:
	 List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:
	 From;
	b=AAB3eYetXj7qa8sgRT3Hjs0ja+R4DPPUDYRAGNU3ZMxKi5JDASmZoklCIVxduubvm
	 CRWuN9ZqhwmNPEfuizeEP+V5NW67GFG33WF7aP9GIVXLJdbEmNE/sdkDUEd1B2sNJT
	 dfk6wVxlxbK3Y1c5/KfeLCA1l+m+ARtR8bsg4tuU=
X-Original-To: libc-alpha@sourceware.org
Delivered-To: libc-alpha@sourceware.org
Received: from mail-ot1-x331.google.com (mail-ot1-x331.google.com
 [IPv6:2607:f8b0:4864:20::331])
 by sourceware.org (Postfix) with ESMTPS id EEE623858C83
 for <libc-alpha@sourceware.org>; Tue, 18 Oct 2022 23:19:48 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org EEE623858C83
Received: by mail-ot1-x331.google.com with SMTP id
 a16-20020a056830101000b006619dba7fd4so8515499otp.12
 for <libc-alpha@sourceware.org>; Tue, 18 Oct 2022 16:19:48 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
 :subject:date:message-id:reply-to;
 bh=PSQMROoJfqw3Uqfco9Q4iDeddMT2vOOCLzyfEiEff+g=;
 b=PNkzZWAsYWTJu5owTchXTPpgl47bWCxEfgaVrEAWGi0wTCMFRt4nkl5alKkPQNf7ye
 sPE0++OqISrJW+0JpbQfNi6uf8nO0e7Y3IcNXxaKXBuQSrCgqhLDoQm2pUdGaRICHZ37
 2zNghGJOskq/OA6f7YJHeCRu1kfEWguU9g1QQU/dvz/tGXyZ/LbgQxRRdEMMZWwB++aE
 BPpJiqv+j467DxFyrJ9RKl5KtYiUfY3/qwyk/IyiBIth5CtpbYmskRtUuDe1iIj/qVvw
 D0oZlT/DrnchNnQOkjm/v2oD8fS52t+wUjy1O8uldiF8HWvMEnAdXteKm6AobDnoRZpR
 WlHA==
X-Gm-Message-State: ACrzQf2uMeSd71gi9h5jietmZoeR0ejNL2Wfb+RFPuhDXSACQBq0Ykry
 Ylg0XzoSR5CqrT/40ZQsfKvNU9WfdMOejw==
X-Google-Smtp-Source: 
 AMsMyM5knl2iLj6qpxUGu205uV+r//qmAwfH0DWQnBVwQA1poA9a3e8Z02orLIL+I6K/0+PUO2PUOA==
X-Received: by 2002:a05:6830:40c5:b0:660:ef3f:97c with SMTP id
 h5-20020a05683040c500b00660ef3f097cmr2432932otu.177.1666135186015;
 Tue, 18 Oct 2022 16:19:46 -0700 (PDT)
Received: from noah-tgl.lan
 (2603-8080-1301-76c6-27cf-8854-3909-9373.res6.spectrum.com.
 [2603:8080:1301:76c6:27cf:8854:3909:9373])
 by smtp.gmail.com with ESMTPSA id
 x27-20020a056870a79b00b001372c1902afsm6787119oao.52.2022.10.18.16.19.44
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Tue, 18 Oct 2022 16:19:45 -0700 (PDT)
To: libc-alpha@sourceware.org
Subject: [PATCH v2 3/7] x86: Optimize strnlen-evex.S and implement with VMM
 headers
Date: Tue, 18 Oct 2022 16:19:34 -0700
Message-Id: <20221018231938.3621554-3-goldstein.w.n@gmail.com>
X-Mailer: git-send-email 2.34.1
In-Reply-To: <20221018231938.3621554-1-goldstein.w.n@gmail.com>
References: <20221018024901.3381469-1-goldstein.w.n@gmail.com>
 <20221018231938.3621554-1-goldstein.w.n@gmail.com>
MIME-Version: 1.0
X-Spam-Status: No, score=-12.1 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0,
 KAM_SHORT, RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
X-Patchwork-Original-From: Noah Goldstein via Libc-alpha
 <libc-alpha@sourceware.org>
From: Noah Goldstein <goldstein.w.n@gmail.com>
Reply-To: Noah Goldstein <goldstein.w.n@gmail.com>
Errors-To: libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org
Sender: "Libc-alpha"
 <libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org>

Optimizations are:
1. Use the fact that bsf(0) leaves the destination unchanged to save a
   branch in short string case.
2. Restructure code so that small strings are given the hot path.
        - This is a net-zero on the benchmark suite but in general makes
      sense as smaller sizes are far more common.
3. Use more code-size efficient instructions.
	- tzcnt ...     -> bsf ...
	- vpcmpb $0 ... -> vpcmpeq ...
4. Align labels less aggressively, especially if it doesn't save fetch
   blocks / causes the basic-block to span extra cache-lines.

The optimizations (especially for point 2) make the strnlen and
strlen code essentially incompatible so split strnlen-evex
to a new file.

Code Size Changes:
strlen-evex.S       :  -23 bytes
strnlen-evex.S      : -167 bytes

Net perf changes:

Reported as geometric mean of all improvements / regressions from N=10
runs of the benchtests. Value as New Time / Old Time so < 1.0 is
improvement and 1.0 is regression.

strlen-evex.S       : 0.992 (No real change)
strnlen-evex.S      : 0.947

Full results attached in email.

Full check passes on x86-64.
---
 sysdeps/x86_64/multiarch/strlen-evex.S  | 544 +++++++-----------------
 sysdeps/x86_64/multiarch/strnlen-evex.S | 427 ++++++++++++++++++-
 sysdeps/x86_64/multiarch/wcsnlen-evex.S |   5 +-
 3 files changed, 572 insertions(+), 404 deletions(-)

diff --git a/sysdeps/x86_64/multiarch/strlen-evex.S b/sysdeps/x86_64/multiarch/strlen-evex.S
index 2109ec2f7a..487846f098 100644
--- a/sysdeps/x86_64/multiarch/strlen-evex.S
+++ b/sysdeps/x86_64/multiarch/strlen-evex.S
@@ -26,466 +26,220 @@
 #  define STRLEN	__strlen_evex
 # endif
 
-# define VMOVA		vmovdqa64
+# ifndef VEC_SIZE
+#  include "x86-evex256-vecs.h"
+# endif
 
 # ifdef USE_AS_WCSLEN
-#  define VPCMP		vpcmpd
+#  define VPCMPEQ	vpcmpeqd
+#  define VPCMPNEQ	vpcmpneqd
+#  define VPTESTN	vptestnmd
+#  define VPTEST	vptestmd
 #  define VPMINU	vpminud
-#  define SHIFT_REG ecx
 #  define CHAR_SIZE	4
+#  define CHAR_SIZE_SHIFT_REG(reg)	sar $2, %reg
 # else
-#  define VPCMP		vpcmpb
+#  define VPCMPEQ	vpcmpeqb
+#  define VPCMPNEQ	vpcmpneqb
+#  define VPTESTN	vptestnmb
+#  define VPTEST	vptestmb
 #  define VPMINU	vpminub
-#  define SHIFT_REG edx
 #  define CHAR_SIZE	1
+#  define CHAR_SIZE_SHIFT_REG(reg)
+
+#  define REG_WIDTH	VEC_SIZE
 # endif
 
-# define XMMZERO	xmm16
-# define YMMZERO	ymm16
-# define YMM1		ymm17
-# define YMM2		ymm18
-# define YMM3		ymm19
-# define YMM4		ymm20
-# define YMM5		ymm21
-# define YMM6		ymm22
-
-# define VEC_SIZE 32
-# define PAGE_SIZE 4096
-# define CHAR_PER_VEC (VEC_SIZE / CHAR_SIZE)
-
-	.section .text.evex,"ax",@progbits
-ENTRY (STRLEN)
-# ifdef USE_AS_STRNLEN
-	/* Check zero length.  */
-	test	%RSI_LP, %RSI_LP
-	jz	L(zero)
-#  ifdef __ILP32__
-	/* Clear the upper 32 bits.  */
-	movl	%esi, %esi
-#  endif
-	mov	%RSI_LP, %R8_LP
+# define CHAR_PER_VEC	(VEC_SIZE / CHAR_SIZE)
+
+# include "reg-macros.h"
+
+# if CHAR_PER_VEC == 64
+
+#  define TAIL_RETURN_LBL	first_vec_x2
+#  define TAIL_RETURN_OFFSET	(CHAR_PER_VEC * 2)
+
+#  define FALLTHROUGH_RETURN_LBL	first_vec_x3
+#  define FALLTHROUGH_RETURN_OFFSET	(CHAR_PER_VEC * 3)
+
+# else
+
+#  define TAIL_RETURN_LBL	first_vec_x3
+#  define TAIL_RETURN_OFFSET	(CHAR_PER_VEC * 3)
+
+#  define FALLTHROUGH_RETURN_LBL	first_vec_x2
+#  define FALLTHROUGH_RETURN_OFFSET	(CHAR_PER_VEC * 2)
 # endif
+
+# define XZERO	VMM_128(0)
+# define VZERO	VMM(0)
+# define PAGE_SIZE	4096
+
+	.section SECTION(.text), "ax", @progbits
+ENTRY_P2ALIGN (STRLEN, 6)
 	movl	%edi, %eax
-	vpxorq	%XMMZERO, %XMMZERO, %XMMZERO
-	/* Clear high bits from edi. Only keeping bits relevant to page
-	   cross check.  */
+	vpxorq	%XZERO, %XZERO, %XZERO
 	andl	$(PAGE_SIZE - 1), %eax
-	/* Check if we may cross page boundary with one vector load.  */
 	cmpl	$(PAGE_SIZE - VEC_SIZE), %eax
 	ja	L(cross_page_boundary)
 
 	/* Check the first VEC_SIZE bytes.  Each bit in K0 represents a
 	   null byte.  */
-	VPCMP	$0, (%rdi), %YMMZERO, %k0
-	kmovd	%k0, %eax
-# ifdef USE_AS_STRNLEN
-	/* If length < CHAR_PER_VEC handle special.  */
-	cmpq	$CHAR_PER_VEC, %rsi
-	jbe	L(first_vec_x0)
-# endif
-	testl	%eax, %eax
+	VPCMPEQ	(%rdi), %VZERO, %k0
+	KMOV	%k0, %VRAX
+	test	%VRAX, %VRAX
 	jz	L(aligned_more)
-	tzcntl	%eax, %eax
-	ret
-# ifdef USE_AS_STRNLEN
-L(zero):
-	xorl	%eax, %eax
-	ret
-
-	.p2align 4
-L(first_vec_x0):
-	/* Set bit for max len so that tzcnt will return min of max len
-	   and position of first match.  */
-	btsq	%rsi, %rax
-	tzcntl	%eax, %eax
-	ret
-# endif
-
-	.p2align 4
-L(first_vec_x1):
-	tzcntl	%eax, %eax
-	/* Safe to use 32 bit instructions as these are only called for
-	   size = [1, 159].  */
-# ifdef USE_AS_STRNLEN
-	/* Use ecx which was computed earlier to compute correct value.
-	 */
-	leal	-(CHAR_PER_VEC * 4 + 1)(%rcx, %rax), %eax
-# else
-	subl	%edx, %edi
-#  ifdef USE_AS_WCSLEN
-	/* NB: Divide bytes by 4 to get the wchar_t count.  */
-	sarl	$2, %edi
-#  endif
-	leal	CHAR_PER_VEC(%rdi, %rax), %eax
-# endif
-	ret
-
-	.p2align 4
-L(first_vec_x2):
-	tzcntl	%eax, %eax
-	/* Safe to use 32 bit instructions as these are only called for
-	   size = [1, 159].  */
-# ifdef USE_AS_STRNLEN
-	/* Use ecx which was computed earlier to compute correct value.
-	 */
-	leal	-(CHAR_PER_VEC * 3 + 1)(%rcx, %rax), %eax
-# else
-	subl	%edx, %edi
-#  ifdef USE_AS_WCSLEN
-	/* NB: Divide bytes by 4 to get the wchar_t count.  */
-	sarl	$2, %edi
-#  endif
-	leal	(CHAR_PER_VEC * 2)(%rdi, %rax), %eax
-# endif
+	bsf	%VRAX, %VRAX
 	ret
 
-	.p2align 4
-L(first_vec_x3):
-	tzcntl	%eax, %eax
-	/* Safe to use 32 bit instructions as these are only called for
-	   size = [1, 159].  */
-# ifdef USE_AS_STRNLEN
-	/* Use ecx which was computed earlier to compute correct value.
-	 */
-	leal	-(CHAR_PER_VEC * 2 + 1)(%rcx, %rax), %eax
-# else
-	subl	%edx, %edi
-#  ifdef USE_AS_WCSLEN
-	/* NB: Divide bytes by 4 to get the wchar_t count.  */
-	sarl	$2, %edi
-#  endif
-	leal	(CHAR_PER_VEC * 3)(%rdi, %rax), %eax
-# endif
-	ret
-
-	.p2align 4
+	.p2align 4,, 8
 L(first_vec_x4):
-	tzcntl	%eax, %eax
-	/* Safe to use 32 bit instructions as these are only called for
-	   size = [1, 159].  */
-# ifdef USE_AS_STRNLEN
-	/* Use ecx which was computed earlier to compute correct value.
-	 */
-	leal	-(CHAR_PER_VEC + 1)(%rcx, %rax), %eax
-# else
-	subl	%edx, %edi
-#  ifdef USE_AS_WCSLEN
-	/* NB: Divide bytes by 4 to get the wchar_t count.  */
-	sarl	$2, %edi
-#  endif
+	bsf	%VRAX, %VRAX
+	subl	%ecx, %edi
+	CHAR_SIZE_SHIFT_REG (edi)
 	leal	(CHAR_PER_VEC * 4)(%rdi, %rax), %eax
-# endif
 	ret
 
-	.p2align 5
+
+
+	/* Aligned more for strnlen compares remaining length vs 2 *
+	   CHAR_PER_VEC, 4 * CHAR_PER_VEC, and 8 * CHAR_PER_VEC before
+	   going to the loop.  */
+	.p2align 4,, 10
 L(aligned_more):
-	movq	%rdi, %rdx
-	/* Align data to VEC_SIZE.  */
-	andq	$-(VEC_SIZE), %rdi
+	movq	%rdi, %rcx
+	andq	$(VEC_SIZE * -1), %rdi
 L(cross_page_continue):
-	/* Check the first 4 * VEC_SIZE.  Only one VEC_SIZE at a time
-	   since data is only aligned to VEC_SIZE.  */
-# ifdef USE_AS_STRNLEN
-	/* + CHAR_SIZE because it simplies the logic in
-	   last_4x_vec_or_less.  */
-	leaq	(VEC_SIZE * 5 + CHAR_SIZE)(%rdi), %rcx
-	subq	%rdx, %rcx
-#  ifdef USE_AS_WCSLEN
-	/* NB: Divide bytes by 4 to get the wchar_t count.  */
-	sarl	$2, %ecx
-#  endif
-# endif
-	/* Load first VEC regardless.  */
-	VPCMP	$0, VEC_SIZE(%rdi), %YMMZERO, %k0
-# ifdef USE_AS_STRNLEN
-	/* Adjust length. If near end handle specially.  */
-	subq	%rcx, %rsi
-	jb	L(last_4x_vec_or_less)
-# endif
-	kmovd	%k0, %eax
-	testl	%eax, %eax
+	/* Remaining length >= 2 * CHAR_PER_VEC so do VEC0/VEC1 without
+	   rechecking bounds.  */
+	VPCMPEQ	(VEC_SIZE * 1)(%rdi), %VZERO, %k0
+	KMOV	%k0, %VRAX
+	test	%VRAX, %VRAX
 	jnz	L(first_vec_x1)
 
-	VPCMP	$0, (VEC_SIZE * 2)(%rdi), %YMMZERO, %k0
-	kmovd	%k0, %eax
-	test	%eax, %eax
+	VPCMPEQ	(VEC_SIZE * 2)(%rdi), %VZERO, %k0
+	KMOV	%k0, %VRAX
+	test	%VRAX, %VRAX
 	jnz	L(first_vec_x2)
 
-	VPCMP	$0, (VEC_SIZE * 3)(%rdi), %YMMZERO, %k0
-	kmovd	%k0, %eax
-	testl	%eax, %eax
+	VPCMPEQ	(VEC_SIZE * 3)(%rdi), %VZERO, %k0
+	KMOV	%k0, %VRAX
+	test	%VRAX, %VRAX
 	jnz	L(first_vec_x3)
 
-	VPCMP	$0, (VEC_SIZE * 4)(%rdi), %YMMZERO, %k0
-	kmovd	%k0, %eax
-	testl	%eax, %eax
+	VPCMPEQ	(VEC_SIZE * 4)(%rdi), %VZERO, %k0
+	KMOV	%k0, %VRAX
+	test	%VRAX, %VRAX
 	jnz	L(first_vec_x4)
 
-	addq	$VEC_SIZE, %rdi
-# ifdef USE_AS_STRNLEN
-	/* Check if at last VEC_SIZE * 4 length.  */
-	cmpq	$(CHAR_PER_VEC * 4 - 1), %rsi
-	jbe	L(last_4x_vec_or_less_load)
-	movl	%edi, %ecx
-	andl	$(VEC_SIZE * 4 - 1), %ecx
-#  ifdef USE_AS_WCSLEN
-	/* NB: Divide bytes by 4 to get the wchar_t count.  */
-	sarl	$2, %ecx
-#  endif
-	/* Readjust length.  */
-	addq	%rcx, %rsi
-# endif
-	/* Align data to VEC_SIZE * 4.  */
+	subq	$(VEC_SIZE * -1), %rdi
+
+# if CHAR_PER_VEC == 64
+	/* No partial register stalls on processors that we use evex512
+	   on and this saves code size.  */
+	xorb	%dil, %dil
+# else
 	andq	$-(VEC_SIZE * 4), %rdi
+# endif
+
+
 
 	/* Compare 4 * VEC at a time forward.  */
 	.p2align 4
 L(loop_4x_vec):
-	/* Load first VEC regardless.  */
-	VMOVA	(VEC_SIZE * 4)(%rdi), %YMM1
-# ifdef USE_AS_STRNLEN
-	/* Break if at end of length.  */
-	subq	$(CHAR_PER_VEC * 4), %rsi
-	jb	L(last_4x_vec_or_less_cmpeq)
-# endif
-	/* Save some code size by microfusing VPMINU with the load. Since
-	   the matches in ymm2/ymm4 can only be returned if there where no
-	   matches in ymm1/ymm3 respectively there is no issue with overlap.
-	 */
-	VPMINU	(VEC_SIZE * 5)(%rdi), %YMM1, %YMM2
-	VMOVA	(VEC_SIZE * 6)(%rdi), %YMM3
-	VPMINU	(VEC_SIZE * 7)(%rdi), %YMM3, %YMM4
+	VMOVA	(VEC_SIZE * 4)(%rdi), %VMM(1)
+	VPMINU	(VEC_SIZE * 5)(%rdi), %VMM(1), %VMM(2)
+	VMOVA	(VEC_SIZE * 6)(%rdi), %VMM(3)
+	VPMINU	(VEC_SIZE * 7)(%rdi), %VMM(3), %VMM(4)
+	VPTESTN	%VMM(2), %VMM(2), %k0
+	VPTESTN	%VMM(4), %VMM(4), %k2
 
-	VPCMP	$0, %YMM2, %YMMZERO, %k0
-	VPCMP	$0, %YMM4, %YMMZERO, %k1
 	subq	$-(VEC_SIZE * 4), %rdi
-	kortestd	%k0, %k1
+	KORTEST %k0, %k2
 	jz	L(loop_4x_vec)
 
-	/* Check if end was in first half.  */
-	kmovd	%k0, %eax
-	subq	%rdx, %rdi
-# ifdef USE_AS_WCSLEN
-	shrq	$2, %rdi
-# endif
-	testl	%eax, %eax
-	jz	L(second_vec_return)
+	VPTESTN	%VMM(1), %VMM(1), %k1
+	KMOV	%k1, %VRAX
+	test	%VRAX, %VRAX
+	jnz	L(first_vec_x0)
 
-	VPCMP	$0, %YMM1, %YMMZERO, %k2
-	kmovd	%k2, %edx
-	/* Combine VEC1 matches (edx) with VEC2 matches (eax).  */
-# ifdef USE_AS_WCSLEN
-	sall	$CHAR_PER_VEC, %eax
-	orl	%edx, %eax
-	tzcntl	%eax, %eax
-# else
-	salq	$CHAR_PER_VEC, %rax
-	orq	%rdx, %rax
-	tzcntq	%rax, %rax
-# endif
-	addq	%rdi, %rax
-	ret
-
-
-# ifdef USE_AS_STRNLEN
-
-L(last_4x_vec_or_less_load):
-	/* Depending on entry adjust rdi / prepare first VEC in YMM1.  */
-	VMOVA	(VEC_SIZE * 4)(%rdi), %YMM1
-L(last_4x_vec_or_less_cmpeq):
-	VPCMP	$0, %YMM1, %YMMZERO, %k0
-	addq	$(VEC_SIZE * 3), %rdi
-L(last_4x_vec_or_less):
-	kmovd	%k0, %eax
-	/* If remaining length > VEC_SIZE * 2. This works if esi is off by
-	   VEC_SIZE * 4.  */
-	testl	$(CHAR_PER_VEC * 2), %esi
-	jnz	L(last_4x_vec)
-
-	/* length may have been negative or positive by an offset of
-	   CHAR_PER_VEC * 4 depending on where this was called from. This
-	   fixes that.  */
-	andl	$(CHAR_PER_VEC * 4 - 1), %esi
-	testl	%eax, %eax
-	jnz	L(last_vec_x1_check)
+	KMOV	%k0, %VRAX
+	test	%VRAX, %VRAX
+	jnz	L(first_vec_x1)
 
-	/* Check the end of data.  */
-	subl	$CHAR_PER_VEC, %esi
-	jb	L(max)
+	VPTESTN	%VMM(3), %VMM(3), %k0
 
-	VPCMP	$0, (VEC_SIZE * 2)(%rdi), %YMMZERO, %k0
-	kmovd	%k0, %eax
-	tzcntl	%eax, %eax
-	/* Check the end of data.  */
-	cmpl	%eax, %esi
-	jb	L(max)
-
-	subq	%rdx, %rdi
-#  ifdef USE_AS_WCSLEN
-	/* NB: Divide bytes by 4 to get the wchar_t count.  */
-	sarq	$2, %rdi
-#  endif
-	leaq	(CHAR_PER_VEC * 2)(%rdi, %rax), %rax
-	ret
-L(max):
-	movq	%r8, %rax
-	ret
-# endif
-
-	/* Placed here in strnlen so that the jcc L(last_4x_vec_or_less)
-	   in the 4x VEC loop can use 2 byte encoding.  */
-	.p2align 4
-L(second_vec_return):
-	VPCMP	$0, %YMM3, %YMMZERO, %k0
-	/* Combine YMM3 matches (k0) with YMM4 matches (k1).  */
-# ifdef USE_AS_WCSLEN
-	kunpckbw	%k0, %k1, %k0
-	kmovd	%k0, %eax
-	tzcntl	%eax, %eax
+# if CHAR_PER_VEC == 64
+	KMOV	%k0, %VRAX
+	test	%VRAX, %VRAX
+	jnz	L(first_vec_x2)
+	KMOV	%k2, %VRAX
 # else
-	kunpckdq	%k0, %k1, %k0
-	kmovq	%k0, %rax
-	tzcntq	%rax, %rax
+	/* We can only combine last 2x VEC masks if CHAR_PER_VEC <= 32.
+	 */
+	kmovd	%k2, %edx
+	kmovd	%k0, %eax
+	salq	$CHAR_PER_VEC, %rdx
+	orq	%rdx, %rax
 # endif
-	leaq	(CHAR_PER_VEC * 2)(%rdi, %rax), %rax
-	ret
 
-
-# ifdef USE_AS_STRNLEN
-L(last_vec_x1_check):
-	tzcntl	%eax, %eax
-	/* Check the end of data.  */
-	cmpl	%eax, %esi
-	jb	L(max)
-	subq	%rdx, %rdi
-#  ifdef USE_AS_WCSLEN
-	/* NB: Divide bytes by 4 to get the wchar_t count.  */
-	sarq	$2, %rdi
-#  endif
-	leaq	(CHAR_PER_VEC)(%rdi, %rax), %rax
+	/* first_vec_x3 for strlen-ZMM and first_vec_x2 for strlen-YMM.
+	 */
+	.p2align 4,, 2
+L(FALLTHROUGH_RETURN_LBL):
+	bsfq	%rax, %rax
+	subq	%rcx, %rdi
+	CHAR_SIZE_SHIFT_REG (rdi)
+	leaq	(FALLTHROUGH_RETURN_OFFSET)(%rdi, %rax), %rax
 	ret
 
-	.p2align 4
-L(last_4x_vec):
-	/* Test first 2x VEC normally.  */
-	testl	%eax, %eax
-	jnz	L(last_vec_x1)
-
-	VPCMP	$0, (VEC_SIZE * 2)(%rdi), %YMMZERO, %k0
-	kmovd	%k0, %eax
-	testl	%eax, %eax
-	jnz	L(last_vec_x2)
-
-	/* Normalize length.  */
-	andl	$(CHAR_PER_VEC * 4 - 1), %esi
-	VPCMP	$0, (VEC_SIZE * 3)(%rdi), %YMMZERO, %k0
-	kmovd	%k0, %eax
-	testl	%eax, %eax
-	jnz	L(last_vec_x3)
-
-	/* Check the end of data.  */
-	subl	$(CHAR_PER_VEC * 3), %esi
-	jb	L(max)
-
-	VPCMP	$0, (VEC_SIZE * 4)(%rdi), %YMMZERO, %k0
-	kmovd	%k0, %eax
-	tzcntl	%eax, %eax
-	/* Check the end of data.  */
-	cmpl	%eax, %esi
-	jb	L(max_end)
-
-	subq	%rdx, %rdi
-#  ifdef USE_AS_WCSLEN
-	/* NB: Divide bytes by 4 to get the wchar_t count.  */
-	sarq	$2, %rdi
-#  endif
-	leaq	(CHAR_PER_VEC * 4)(%rdi, %rax), %rax
+	.p2align 4,, 8
+L(first_vec_x0):
+	bsf	%VRAX, %VRAX
+	sub	%rcx, %rdi
+	CHAR_SIZE_SHIFT_REG (rdi)
+	addq	%rdi, %rax
 	ret
 
-	.p2align 4
-L(last_vec_x1):
-	tzcntl	%eax, %eax
-	subq	%rdx, %rdi
-#  ifdef USE_AS_WCSLEN
-	/* NB: Divide bytes by 4 to get the wchar_t count.  */
-	sarq	$2, %rdi
-#  endif
+	.p2align 4,, 10
+L(first_vec_x1):
+	bsf	%VRAX, %VRAX
+	sub	%rcx, %rdi
+	CHAR_SIZE_SHIFT_REG (rdi)
 	leaq	(CHAR_PER_VEC)(%rdi, %rax), %rax
 	ret
 
-	.p2align 4
-L(last_vec_x2):
-	tzcntl	%eax, %eax
-	subq	%rdx, %rdi
-#  ifdef USE_AS_WCSLEN
-	/* NB: Divide bytes by 4 to get the wchar_t count.  */
-	sarq	$2, %rdi
-#  endif
-	leaq	(CHAR_PER_VEC * 2)(%rdi, %rax), %rax
-	ret
-
-	.p2align 4
-L(last_vec_x3):
-	tzcntl	%eax, %eax
-	subl	$(CHAR_PER_VEC * 2), %esi
-	/* Check the end of data.  */
-	cmpl	%eax, %esi
-	jb	L(max_end)
-	subq	%rdx, %rdi
-#  ifdef USE_AS_WCSLEN
-	/* NB: Divide bytes by 4 to get the wchar_t count.  */
-	sarq	$2, %rdi
-#  endif
-	leaq	(CHAR_PER_VEC * 3)(%rdi, %rax), %rax
-	ret
-L(max_end):
-	movq	%r8, %rax
+	.p2align 4,, 10
+	/* first_vec_x2 for strlen-ZMM and first_vec_x3 for strlen-YMM.
+	 */
+L(TAIL_RETURN_LBL):
+	bsf	%VRAX, %VRAX
+	sub	%VRCX, %VRDI
+	CHAR_SIZE_SHIFT_REG (VRDI)
+	lea	(TAIL_RETURN_OFFSET)(%rdi, %rax), %VRAX
 	ret
-# endif
 
-	/* Cold case for crossing page with first load.	 */
-	.p2align 4
+	.p2align 4,, 8
 L(cross_page_boundary):
-	movq	%rdi, %rdx
+	movq	%rdi, %rcx
 	/* Align data to VEC_SIZE.  */
 	andq	$-VEC_SIZE, %rdi
-	VPCMP	$0, (%rdi), %YMMZERO, %k0
-	kmovd	%k0, %eax
-	/* Remove the leading bytes.  */
+
+	VPCMPEQ	(%rdi), %VZERO, %k0
+
+	KMOV	%k0, %VRAX
 # ifdef USE_AS_WCSLEN
-	/* NB: Divide shift count by 4 since each bit in K0 represent 4
-	   bytes.  */
-	movl	%edx, %ecx
-	shrl	$2, %ecx
-	andl	$(CHAR_PER_VEC - 1), %ecx
-# endif
-	/* SHIFT_REG is ecx for USE_AS_WCSLEN and edx otherwise.  */
-	sarxl	%SHIFT_REG, %eax, %eax
+	movl	%ecx, %edx
+	shrl	$2, %edx
+	andl	$(CHAR_PER_VEC - 1), %edx
+	shrx	%edx, %eax, %eax
 	testl	%eax, %eax
-# ifndef USE_AS_STRNLEN
-	jz	L(cross_page_continue)
-	tzcntl	%eax, %eax
-	ret
 # else
-	jnz	L(cross_page_less_vec)
-#  ifndef USE_AS_WCSLEN
-	movl	%edx, %ecx
-	andl	$(CHAR_PER_VEC - 1), %ecx
-#  endif
-	movl	$CHAR_PER_VEC, %eax
-	subl	%ecx, %eax
-	/* Check the end of data.  */
-	cmpq	%rax, %rsi
-	ja	L(cross_page_continue)
-	movl	%esi, %eax
-	ret
-L(cross_page_less_vec):
-	tzcntl	%eax, %eax
-	/* Select min of length and position of first null.  */
-	cmpq	%rax, %rsi
-	cmovb	%esi, %eax
-	ret
+	shr	%cl, %VRAX
 # endif
+	jz	L(cross_page_continue)
+	bsf	%VRAX, %VRAX
+	ret
 
 END (STRLEN)
 #endif
diff --git a/sysdeps/x86_64/multiarch/strnlen-evex.S b/sysdeps/x86_64/multiarch/strnlen-evex.S
index 64a9fc2606..443a32a749 100644
--- a/sysdeps/x86_64/multiarch/strnlen-evex.S
+++ b/sysdeps/x86_64/multiarch/strnlen-evex.S
@@ -1,8 +1,423 @@
-#ifndef STRNLEN
-# define STRNLEN __strnlen_evex
-#endif
+/* strnlen/wcsnlen optimized with 256-bit EVEX instructions.
+   Copyright (C) 2022 Free Software Foundation, Inc.
+   This file is part of the GNU C Library.
+
+   The GNU C Library is free software; you can redistribute it and/or
+   modify it under the terms of the GNU Lesser General Public
+   License as published by the Free Software Foundation; either
+   version 2.1 of the License, or (at your option) any later version.
+
+   The GNU C Library is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   Lesser General Public License for more details.
+
+   You should have received a copy of the GNU Lesser General Public
+   License along with the GNU C Library; if not, see
+   <https://www.gnu.org/licenses/>.  */
+
+#include <isa-level.h>
+#include <sysdep.h>
+
+#if ISA_SHOULD_BUILD (4)
+
+# ifndef VEC_SIZE
+#  include "x86-evex256-vecs.h"
+# endif
+
+
+# ifndef STRNLEN
+#  define STRNLEN	__strnlen_evex
+# endif
+
+# ifdef USE_AS_WCSLEN
+#  define VPCMPEQ	vpcmpeqd
+#  define VPCMPNEQ	vpcmpneqd
+#  define VPTESTN	vptestnmd
+#  define VPTEST	vptestmd
+#  define VPMINU	vpminud
+#  define CHAR_SIZE	4
+
+# else
+#  define VPCMPEQ	vpcmpeqb
+#  define VPCMPNEQ	vpcmpneqb
+#  define VPTESTN	vptestnmb
+#  define VPTEST	vptestmb
+#  define VPMINU	vpminub
+#  define CHAR_SIZE	1
+
+#  define REG_WIDTH	VEC_SIZE
+# endif
+
+# define CHAR_PER_VEC	(VEC_SIZE / CHAR_SIZE)
+
+# include "reg-macros.h"
+
+# if CHAR_PER_VEC == 32
+#  define SUB_SHORT(imm, reg)	subb $(imm), %VGPR_SZ(reg, 8)
+# else
+#  define SUB_SHORT(imm, reg)	subl $(imm), %VGPR_SZ(reg, 32)
+# endif
+
+
+
+# if CHAR_PER_VEC == 64
+#  define FALLTHROUGH_RETURN_OFFSET	(CHAR_PER_VEC * 3)
+# else
+#  define FALLTHROUGH_RETURN_OFFSET	(CHAR_PER_VEC * 2)
+# endif
+
+
+# define XZERO	VMM_128(0)
+# define VZERO	VMM(0)
+# define PAGE_SIZE	4096
+
+	.section SECTION(.text), "ax", @progbits
+ENTRY_P2ALIGN (STRNLEN, 6)
+	/* Check zero length.  */
+	test	%RSI_LP, %RSI_LP
+	jz	L(zero)
+# ifdef __ILP32__
+	/* Clear the upper 32 bits.  */
+	movl	%esi, %esi
+# endif
+
+	movl	%edi, %eax
+	vpxorq	%XZERO, %XZERO, %XZERO
+	andl	$(PAGE_SIZE - 1), %eax
+	cmpl	$(PAGE_SIZE - VEC_SIZE), %eax
+	ja	L(cross_page_boundary)
+
+	/* Check the first VEC_SIZE bytes.  Each bit in K0 represents a
+	   null byte.  */
+	VPCMPEQ	(%rdi), %VZERO, %k0
+
+	KMOV	%k0, %VRCX
+	movq	%rsi, %rax
+
+	/* If src (rcx) is zero, bsf does not change the result.  NB:
+	   Must use 64-bit bsf here so that upper bits of len are not
+	   cleared.  */
+	bsfq	%rcx, %rax
+	/* If rax > CHAR_PER_VEC then rcx must have been zero (no null
+	   CHAR) and rsi must be > CHAR_PER_VEC.  */
+	cmpq	$CHAR_PER_VEC, %rax
+	ja	L(more_1x_vec)
+	/* Check if first match in bounds.  */
+	cmpq	%rax, %rsi
+	cmovb	%esi, %eax
+	ret
+
+
+# if CHAR_PER_VEC != 32
+	.p2align 4,, 2
+L(zero):
+L(max_0):
+	movl	%esi, %eax
+	ret
+# endif
+
+	/* Aligned more for strnlen compares remaining length vs 2 *
+	   CHAR_PER_VEC, 4 * CHAR_PER_VEC, and 8 * CHAR_PER_VEC before
+	   going to the loop.  */
+	.p2align 4,, 10
+L(more_1x_vec):
+L(cross_page_continue):
+	/* Compute number of words checked after aligning.  */
+# ifdef USE_AS_WCSLEN
+	/* Need to compute directly for wcslen as CHAR_SIZE * rsi can
+	   overflow.  */
+	movq	%rdi, %rax
+	andq	$(VEC_SIZE * -1), %rdi
+	subq	%rdi, %rax
+	sarq	$2, %rax
+	leaq	-(CHAR_PER_VEC * 1)(%rax, %rsi), %rax
+# else
+	leaq	(VEC_SIZE * -1)(%rsi, %rdi), %rax
+	andq	$(VEC_SIZE * -1), %rdi
+	subq	%rdi, %rax
+# endif
+
+
+	VPCMPEQ	VEC_SIZE(%rdi), %VZERO, %k0
+
+	cmpq	$(CHAR_PER_VEC * 2), %rax
+	ja	L(more_2x_vec)
+
+L(last_2x_vec_or_less):
+	KMOV	%k0, %VRDX
+	test	%VRDX, %VRDX
+	jnz	L(last_vec_check)
+
+	/* Check the end of data.  */
+	SUB_SHORT (CHAR_PER_VEC, rax)
+	jbe	L(max_0)
+	VPCMPEQ	(VEC_SIZE * 2)(%rdi), %VZERO, %k0
+	KMOV	%k0, %VRDX
+	test	%VRDX, %VRDX
+	jz	L(max_0)
+	/* Best place for LAST_VEC_CHECK if ZMM.  */
+	.p2align 4,, 8
+L(last_vec_check):
+	bsf	%VRDX, %VRDX
+	sub	%eax, %edx
+	lea	(%rsi, %rdx), %eax
+	cmovae	%esi, %eax
+	ret
+
+# if CHAR_PER_VEC == 32
+	.p2align 4,, 2
+L(zero):
+L(max_0):
+	movl	%esi, %eax
+	ret
+# endif
+
+	.p2align 4,, 8
+L(last_4x_vec_or_less):
+	addl	$(CHAR_PER_VEC * -4), %eax
+	VPCMPEQ	(VEC_SIZE * 5)(%rdi), %VZERO, %k0
+	subq	$(VEC_SIZE * -4), %rdi
+	cmpl	$(CHAR_PER_VEC * 2), %eax
+	jbe	L(last_2x_vec_or_less)
+
+	.p2align 4,, 6
+L(more_2x_vec):
+	/* Remaining length >= 2 * CHAR_PER_VEC so do VEC0/VEC1 without
+	   rechecking bounds.  */
 
-#define USE_AS_STRNLEN 1
-#define STRLEN	STRNLEN
+	KMOV	%k0, %VRDX
 
-#include "strlen-evex.S"
+	test	%VRDX, %VRDX
+	jnz	L(first_vec_x1)
+
+	VPCMPEQ	(VEC_SIZE * 2)(%rdi), %VZERO, %k0
+	KMOV	%k0, %VRDX
+	test	%VRDX, %VRDX
+	jnz	L(first_vec_x2)
+
+	cmpq	$(CHAR_PER_VEC * 4), %rax
+	ja	L(more_4x_vec)
+
+
+	VPCMPEQ	(VEC_SIZE * 3)(%rdi), %VZERO, %k0
+	KMOV	%k0, %VRDX
+	addl	$(CHAR_PER_VEC * -2), %eax
+	test	%VRDX, %VRDX
+	jnz	L(last_vec_check)
+
+	subl	$(CHAR_PER_VEC), %eax
+	jbe	L(max_1)
+
+	VPCMPEQ	(VEC_SIZE * 4)(%rdi), %VZERO, %k0
+	KMOV	%k0, %VRDX
+
+	test	%VRDX, %VRDX
+	jnz	L(last_vec_check)
+L(max_1):
+	movl	%esi, %eax
+	ret
+
+	.p2align 4,, 3
+L(first_vec_x2):
+# if VEC_SIZE == 64
+	/* If VEC_SIZE == 64 we can fit logic for full return label in
+	   spare bytes before next cache line.  */
+	bsf	%VRDX, %VRDX
+	sub	%eax, %esi
+	leal	(CHAR_PER_VEC * 1)(%rsi, %rdx), %eax
+	ret
+	.p2align 4,, 6
+# else
+	addl	$CHAR_PER_VEC, %esi
+# endif
+L(first_vec_x1):
+	bsf	%VRDX, %VRDX
+	sub	%eax, %esi
+	leal	(CHAR_PER_VEC * 0)(%rsi, %rdx), %eax
+	ret
+
+
+	.p2align 4,, 6
+L(first_vec_x4):
+# if VEC_SIZE == 64
+	/* If VEC_SIZE == 64 we can fit logic for full return label in
+	   spare bytes before next cache line.  */
+	bsf	%VRDX, %VRDX
+	sub	%eax, %esi
+	leal	(CHAR_PER_VEC * 3)(%rsi, %rdx), %eax
+	ret
+	.p2align 4,, 6
+# else
+	addl	$CHAR_PER_VEC, %esi
+# endif
+L(first_vec_x3):
+	bsf	%VRDX, %VRDX
+	sub	%eax, %esi
+	leal	(CHAR_PER_VEC * 2)(%rsi, %rdx), %eax
+	ret
+
+	.p2align 4,, 5
+L(more_4x_vec):
+	VPCMPEQ	(VEC_SIZE * 3)(%rdi), %VZERO, %k0
+	KMOV	%k0, %VRDX
+	test	%VRDX, %VRDX
+	jnz	L(first_vec_x3)
+
+	VPCMPEQ	(VEC_SIZE * 4)(%rdi), %VZERO, %k0
+	KMOV	%k0, %VRDX
+	test	%VRDX, %VRDX
+	jnz	L(first_vec_x4)
+
+	/* Check if at last VEC_SIZE * 4 length before aligning for the
+	   loop.  */
+	cmpq	$(CHAR_PER_VEC * 8), %rax
+	jbe	L(last_4x_vec_or_less)
+
+
+	/* Compute number of words checked after aligning.  */
+# ifdef USE_AS_WCSLEN
+	/* Need to compute directly for wcslen as CHAR_SIZE * rsi can
+	   overflow.  */
+	leaq	(VEC_SIZE * -3)(%rdi), %rdx
+# else
+	leaq	(VEC_SIZE * -3)(%rdi, %rax), %rax
+# endif
+
+	subq	$(VEC_SIZE * -1), %rdi
+
+	/* Align data to VEC_SIZE * 4.  */
+# if VEC_SIZE == 64
+	/* Saves code size.  No evex512 processor has partial register
+	   stalls.  If that change this can be replaced with `andq
+	   $-(VEC_SIZE * 4), %rdi`.  */
+	xorb	%dil, %dil
+# else
+	andq	$-(VEC_SIZE * 4), %rdi
+# endif
+
+# ifdef USE_AS_WCSLEN
+	subq	%rdi, %rdx
+	sarq	$2, %rdx
+	addq	%rdx, %rax
+# else
+	subq	%rdi, %rax
+# endif
+	/* Compare 4 * VEC at a time forward.  */
+	.p2align 4,, 11
+L(loop_4x_vec):
+	VMOVA	(VEC_SIZE * 4)(%rdi), %VMM(1)
+	VPMINU	(VEC_SIZE * 5)(%rdi), %VMM(1), %VMM(2)
+	VMOVA	(VEC_SIZE * 6)(%rdi), %VMM(3)
+	VPMINU	(VEC_SIZE * 7)(%rdi), %VMM(3), %VMM(4)
+	VPTESTN	%VMM(2), %VMM(2), %k0
+	VPTESTN	%VMM(4), %VMM(4), %k2
+	subq	$-(VEC_SIZE * 4), %rdi
+	/* Break if at end of length.  */
+	subq	$(CHAR_PER_VEC * 4), %rax
+	jbe	L(loop_len_end)
+
+
+	KORTEST %k0, %k2
+	jz	L(loop_4x_vec)
+
+
+L(loop_last_4x_vec):
+	movq	%rsi, %rcx
+	subq	%rax, %rsi
+	VPTESTN	%VMM(1), %VMM(1), %k1
+	KMOV	%k1, %VRDX
+	test	%VRDX, %VRDX
+	jnz	L(last_vec_x0)
+
+	KMOV	%k0, %VRDX
+	test	%VRDX, %VRDX
+	jnz	L(last_vec_x1)
+
+	VPTESTN	%VMM(3), %VMM(3), %k0
+
+	/* Seperate logic for VEC_SIZE == 64 and VEC_SIZE == 32 for
+	   returning last 2x VEC. For VEC_SIZE == 64 we test each VEC
+	   individually, for VEC_SIZE == 32 we combine them in a single
+	   64-bit GPR.  */
+# if CHAR_PER_VEC == 64
+	KMOV	%k0, %VRDX
+	test	%VRDX, %VRDX
+	jnz	L(last_vec_x2)
+	KMOV	%k2, %VRDX
+# else
+	/* We can only combine last 2x VEC masks if CHAR_PER_VEC <= 32.
+	 */
+	kmovd	%k2, %edx
+	kmovd	%k0, %eax
+	salq	$CHAR_PER_VEC, %rdx
+	orq	%rax, %rdx
+# endif
+
+	/* first_vec_x3 for strlen-ZMM and first_vec_x2 for strlen-YMM.
+	 */
+	bsfq	%rdx, %rdx
+	leaq	(FALLTHROUGH_RETURN_OFFSET - CHAR_PER_VEC * 4)(%rsi, %rdx), %rax
+	cmpq	%rax, %rcx
+	cmovb	%rcx, %rax
+	ret
+
+	/* Handle last 4x VEC after loop. All VECs have been loaded.  */
+	.p2align 4,, 4
+L(loop_len_end):
+	KORTEST %k0, %k2
+	jnz	L(loop_last_4x_vec)
+	movq	%rsi, %rax
+	ret
+
+
+# if CHAR_PER_VEC == 64
+	/* Since we can't combine the last 2x VEC for VEC_SIZE == 64
+	   need return label for it.  */
+	.p2align 4,, 8
+L(last_vec_x2):
+	bsf	%VRDX, %VRDX
+	leaq	(CHAR_PER_VEC * -2)(%rsi, %rdx), %rax
+	cmpq	%rax, %rcx
+	cmovb	%rcx, %rax
+	ret
+# endif
+
+
+	.p2align 4,, 10
+L(last_vec_x1):
+	addq	$CHAR_PER_VEC, %rsi
+L(last_vec_x0):
+	bsf	%VRDX, %VRDX
+	leaq	(CHAR_PER_VEC * -4)(%rsi, %rdx), %rax
+	cmpq	%rax, %rcx
+	cmovb	%rcx, %rax
+	ret
+
+
+	.p2align 4,, 8
+L(cross_page_boundary):
+	/* Align data to VEC_SIZE.  */
+	movq	%rdi, %rcx
+	andq	$-VEC_SIZE, %rcx
+	VPCMPEQ	(%rcx), %VZERO, %k0
+
+	KMOV	%k0, %VRCX
+# ifdef USE_AS_WCSLEN
+	shrl	$2, %eax
+	andl	$(CHAR_PER_VEC - 1), %eax
+# endif
+	shrx	%VRAX, %VRCX, %VRCX
+
+	negl	%eax
+	andl	$(CHAR_PER_VEC - 1), %eax
+	movq	%rsi, %rdx
+	bsf	%VRCX, %VRDX
+	cmpq	%rax, %rdx
+	ja	L(cross_page_continue)
+	movl	%edx, %eax
+	cmpq	%rdx, %rsi
+	cmovb	%esi, %eax
+	ret
+END (STRNLEN)
+#endif
diff --git a/sysdeps/x86_64/multiarch/wcsnlen-evex.S b/sysdeps/x86_64/multiarch/wcsnlen-evex.S
index e2aad94c1e..57a7e93fbf 100644
--- a/sysdeps/x86_64/multiarch/wcsnlen-evex.S
+++ b/sysdeps/x86_64/multiarch/wcsnlen-evex.S
@@ -2,8 +2,7 @@
 # define WCSNLEN	__wcsnlen_evex
 #endif
 
-#define STRLEN	WCSNLEN
+#define STRNLEN	WCSNLEN
 #define USE_AS_WCSLEN 1
-#define USE_AS_STRNLEN 1
 
-#include "strlen-evex.S"
+#include "strnlen-evex.S"

From patchwork Tue Oct 18 23:19:35 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Noah Goldstein <goldstein.w.n@gmail.com>
X-Patchwork-Id: 1691736
Return-Path: <libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@legolas.ozlabs.org
Authentication-Results: legolas.ozlabs.org;
 spf=pass (sender SPF authorized) smtp.mailfrom=sourceware.org
 (client-ip=2620:52:3:1:0:246e:9693:128c; helo=sourceware.org;
 envelope-from=libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org;
 receiver=<UNKNOWN>)
Authentication-Results: legolas.ozlabs.org;
	dkim=pass (1024-bit key;
 secure) header.d=sourceware.org header.i=@sourceware.org header.a=rsa-sha256
 header.s=default header.b=tK1qlpKu;
	dkim-atps=neutral
Received: from sourceware.org (server2.sourceware.org
 [IPv6:2620:52:3:1:0:246e:9693:128c])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature ECDSA (P-384) server-digest SHA384)
	(No client certificate requested)
	by legolas.ozlabs.org (Postfix) with ESMTPS id 4MsVFy3dXCz23jk
	for <incoming@patchwork.ozlabs.org>; Wed, 19 Oct 2022 10:20:22 +1100 (AEDT)
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 0D3B13858415
	for <incoming@patchwork.ozlabs.org>; Tue, 18 Oct 2022 23:20:18 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 0D3B13858415
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org;
	s=default; t=1666135218;
	bh=0ihs1yTyMDaSh/n1/tlMjLAoc4iM0wM7Zco1H9FQStg=;
	h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe:
	 List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:
	 From;
	b=tK1qlpKu0MltGUm2cQO94cNGd69Q48ova/zD9TNU8TQ8r9GTIOUKsMZZJ1ppZSzz4
	 gXaf9dgnP+tpDD2kZpbWo4v/iIoI4jOWSUC86UsaCFcykoP6ajF4ZQJV/ZCpqZ36q8
	 KzFWcgCMhUAtwGrd812CBv9YzRI55eoKi6CKa2w8=
X-Original-To: libc-alpha@sourceware.org
Delivered-To: libc-alpha@sourceware.org
Received: from mail-oa1-x2e.google.com (mail-oa1-x2e.google.com
 [IPv6:2001:4860:4864:20::2e])
 by sourceware.org (Postfix) with ESMTPS id 734ED3858D3C
 for <libc-alpha@sourceware.org>; Tue, 18 Oct 2022 23:19:48 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 734ED3858D3C
Received: by mail-oa1-x2e.google.com with SMTP id
 586e51a60fabf-12c8312131fso18693487fac.4
 for <libc-alpha@sourceware.org>; Tue, 18 Oct 2022 16:19:48 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
 :subject:date:message-id:reply-to;
 bh=0ihs1yTyMDaSh/n1/tlMjLAoc4iM0wM7Zco1H9FQStg=;
 b=qE+9wsMQiVce1B6oReEdCFjqvOfbWgbZbosGfAntoIShKPssX10NjgznbPwN7iui6j
 arL3eZ4XhZG1EOmqs25IOEXmpoAZRK57ukUtL32mlXEJFxOrttwzuzTkLATX9WKa7GiY
 0jVDTAvLqglR/oeBdgm96CvF6ImKWQcTwtf3uwTAkd8nggIVBin841/3x8cR/xNITke5
 MdepxXGZgA5JhI5Ijxs8H32aWFnBLQ18PpuXLvQ2/34HNCa0LAxzQv1qbqrP4MuQ3d/O
 7/zPkm1NZzUAbsunCcS66LKTnMHsS8+WQP8bTKsoFZOHXirB2XPMnf0Sc9khOvHxnWVC
 MciA==
X-Gm-Message-State: ACrzQf0vyneCT9KvsD49KWs+f43jbKR4bUUiZc0np8qnmKP4uTyoIKP7
 c65etIz11+4rk7MShDrB7ICgpywrMyHu1Q==
X-Google-Smtp-Source: 
 AMsMyM7D3kpEibEQ+7gvnuHbK4TBFstkjvlU+RXSgsf+AvuwXtYwAYb5Gzzq/sMVwFLPdk62hUD/ew==
X-Received: by 2002:a05:6870:2394:b0:130:de3a:dd99 with SMTP id
 e20-20020a056870239400b00130de3add99mr3028435oap.54.1666135187063;
 Tue, 18 Oct 2022 16:19:47 -0700 (PDT)
Received: from noah-tgl.lan
 (2603-8080-1301-76c6-27cf-8854-3909-9373.res6.spectrum.com.
 [2603:8080:1301:76c6:27cf:8854:3909:9373])
 by smtp.gmail.com with ESMTPSA id
 x27-20020a056870a79b00b001372c1902afsm6787119oao.52.2022.10.18.16.19.46
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Tue, 18 Oct 2022 16:19:46 -0700 (PDT)
To: libc-alpha@sourceware.org
Subject: [PATCH v2 4/7] x86: Optimize memrchr-evex.S
Date: Tue, 18 Oct 2022 16:19:35 -0700
Message-Id: <20221018231938.3621554-4-goldstein.w.n@gmail.com>
X-Mailer: git-send-email 2.34.1
In-Reply-To: <20221018231938.3621554-1-goldstein.w.n@gmail.com>
References: <20221018024901.3381469-1-goldstein.w.n@gmail.com>
 <20221018231938.3621554-1-goldstein.w.n@gmail.com>
MIME-Version: 1.0
X-Spam-Status: No, score=-12.1 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0,
 RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
X-Patchwork-Original-From: Noah Goldstein via Libc-alpha
 <libc-alpha@sourceware.org>
From: Noah Goldstein <goldstein.w.n@gmail.com>
Reply-To: Noah Goldstein <goldstein.w.n@gmail.com>
Errors-To: libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org
Sender: "Libc-alpha"
 <libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org>

Optimizations are:
1. Use the fact that lzcnt(0) -> VEC_SIZE for memchr to save a branch
   in short string case.
2. Save several instructions in len = [VEC_SIZE, 4 * VEC_SIZE] case.
3. Use more code-size efficient instructions.
	- tzcnt ...     -> bsf ...
	- vpcmpb $0 ... -> vpcmpeq ...

Code Size Changes:
memrchr-evex.S      :  -29 bytes

Net perf changes:

Reported as geometric mean of all improvements / regressions from N=10
runs of the benchtests. Value as New Time / Old Time so < 1.0 is
improvement and 1.0 is regression.

memrchr-evex.S      : 0.949 (Mostly from improvements in small strings)

Full results attached in email.

Full check passes on x86-64.
---
 sysdeps/x86_64/multiarch/memrchr-evex.S | 538 ++++++++++++++----------
 1 file changed, 324 insertions(+), 214 deletions(-)

diff --git a/sysdeps/x86_64/multiarch/memrchr-evex.S b/sysdeps/x86_64/multiarch/memrchr-evex.S
index 550b328c5a..dbcf52808f 100644
--- a/sysdeps/x86_64/multiarch/memrchr-evex.S
+++ b/sysdeps/x86_64/multiarch/memrchr-evex.S
@@ -21,17 +21,19 @@
 #if ISA_SHOULD_BUILD (4)
 
 # include <sysdep.h>
-# include "x86-evex256-vecs.h"
-# if VEC_SIZE != 32
-#  error "VEC_SIZE != 32 unimplemented"
+
+# ifndef VEC_SIZE
+#  include "x86-evex256-vecs.h"
 # endif
 
+# include "reg-macros.h"
+
 # ifndef MEMRCHR
-#  define MEMRCHR				__memrchr_evex
+#  define MEMRCHR	__memrchr_evex
 # endif
 
-# define PAGE_SIZE			4096
-# define VMMMATCH			VMM(0)
+# define PAGE_SIZE	4096
+# define VMATCH	VMM(0)
 
 	.section SECTION(.text), "ax", @progbits
 ENTRY_P2ALIGN(MEMRCHR, 6)
@@ -43,294 +45,402 @@ ENTRY_P2ALIGN(MEMRCHR, 6)
 # endif
 	jz	L(zero_0)
 
-	/* Get end pointer. Minus one for two reasons. 1) It is necessary for a
-	   correct page cross check and 2) it correctly sets up end ptr to be
-	   subtract by lzcnt aligned.  */
+	/* Get end pointer. Minus one for three reasons. 1) It is
+	   necessary for a correct page cross check and 2) it correctly
+	   sets up end ptr to be subtract by lzcnt aligned. 3) it is a
+	   necessary step in aligning ptr.  */
 	leaq	-1(%rdi, %rdx), %rax
-	vpbroadcastb %esi, %VMMMATCH
+	vpbroadcastb %esi, %VMATCH
 
 	/* Check if we can load 1x VEC without cross a page.  */
 	testl	$(PAGE_SIZE - VEC_SIZE), %eax
 	jz	L(page_cross)
 
-	/* Don't use rax for pointer here because EVEX has better encoding with
-	   offset % VEC_SIZE == 0.  */
-	vpcmpb	$0, -(VEC_SIZE)(%rdi, %rdx), %VMMMATCH, %k0
-	kmovd	%k0, %ecx
-
-	/* Fall through for rdx (len) <= VEC_SIZE (expect small sizes).  */
-	cmpq	$VEC_SIZE, %rdx
-	ja	L(more_1x_vec)
-L(ret_vec_x0_test):
-
-	/* If ecx is zero (no matches) lzcnt will set it 32 (VEC_SIZE) which
-	   will guarantee edx (len) is less than it.  */
-	lzcntl	%ecx, %ecx
-	cmpl	%ecx, %edx
-	jle	L(zero_0)
-	subq	%rcx, %rax
+	/* Don't use rax for pointer here because EVEX has better
+	   encoding with offset % VEC_SIZE == 0.  */
+	vpcmpeqb (VEC_SIZE * -1)(%rdi, %rdx), %VMATCH, %k0
+	KMOV	%k0, %VRCX
+
+	/* If rcx is zero then lzcnt -> VEC_SIZE.  NB: there is a
+	   already a dependency between rcx and rsi so no worries about
+	   false-dep here.  */
+	lzcnt	%VRCX, %VRSI
+	/* If rdx <= rsi then either 1) rcx was non-zero (there was a
+	   match) but it was out of bounds or 2) rcx was zero and rdx
+	   was <= VEC_SIZE so we are done scanning.  */
+	cmpq	%rsi, %rdx
+	/* NB: Use branch to return zero/non-zero.  Common usage will
+	   branch on result of function (if return is null/non-null).
+	   This branch can be used to predict the ensuing one so there
+	   is no reason to extend the data-dependency with cmovcc.  */
+	jbe	L(zero_0)
+
+	/* If rcx is zero then len must be > RDX, otherwise since we
+	   already tested len vs lzcnt(rcx) (in rsi) we are good to
+	   return this match.  */
+	test	%VRCX, %VRCX
+	jz	L(more_1x_vec)
+	subq	%rsi, %rax
 	ret
 
-	/* Fits in aligning bytes of first cache line.  */
+	/* Fits in aligning bytes of first cache line for VEC_SIZE ==
+	   32.  */
+# if VEC_SIZE == 32
+	.p2align 4,, 2
 L(zero_0):
 	xorl	%eax, %eax
 	ret
-
-	.p2align 4,, 9
-L(ret_vec_x0_dec):
-	decq	%rax
-L(ret_vec_x0):
-	lzcntl	%ecx, %ecx
-	subq	%rcx, %rax
-	ret
+# endif
 
 	.p2align 4,, 10
 L(more_1x_vec):
-	testl	%ecx, %ecx
-	jnz	L(ret_vec_x0)
-
 	/* Align rax (pointer to string).  */
 	andq	$-VEC_SIZE, %rax
-
+L(page_cross_continue):
 	/* Recompute length after aligning.  */
-	movq	%rax, %rdx
+	subq	%rdi, %rax
 
-	/* Need no matter what.  */
-	vpcmpb	$0, -(VEC_SIZE)(%rax), %VMMMATCH, %k0
-	kmovd	%k0, %ecx
-
-	subq	%rdi, %rdx
-
-	cmpq	$(VEC_SIZE * 2), %rdx
+	cmpq	$(VEC_SIZE * 2), %rax
 	ja	L(more_2x_vec)
+
 L(last_2x_vec):
+	vpcmpeqb (VEC_SIZE * -1)(%rdi, %rax), %VMATCH, %k0
+	KMOV	%k0, %VRCX
 
-	/* Must dec rax because L(ret_vec_x0_test) expects it.  */
-	decq	%rax
-	cmpl	$VEC_SIZE, %edx
-	jbe	L(ret_vec_x0_test)
+	test	%VRCX, %VRCX
+	jnz	L(ret_vec_x0_test)
 
-	testl	%ecx, %ecx
-	jnz	L(ret_vec_x0)
+	/* If VEC_SIZE == 64 need to subtract because lzcntq won't
+	   implicitly add VEC_SIZE to match position.  */
+# if VEC_SIZE == 64
+	subl	$VEC_SIZE, %eax
+# else
+	cmpb	$VEC_SIZE, %al
+# endif
+	jle	L(zero_2)
 
-	/* Don't use rax for pointer here because EVEX has better encoding with
-	   offset % VEC_SIZE == 0.  */
-	vpcmpb	$0, -(VEC_SIZE * 2)(%rdi, %rdx), %VMMMATCH, %k0
-	kmovd	%k0, %ecx
-	/* NB: 64-bit lzcnt. This will naturally add 32 to position.  */
+	/* We adjusted rax (length) for VEC_SIZE == 64 so need seperate
+	   offsets.  */
+# if VEC_SIZE == 64
+	vpcmpeqb (VEC_SIZE * -1)(%rdi, %rax), %VMATCH, %k0
+# else
+	vpcmpeqb (VEC_SIZE * -2)(%rdi, %rax), %VMATCH, %k0
+# endif
+	KMOV	%k0, %VRCX
+	/* NB: 64-bit lzcnt. This will naturally add 32 to position for
+	   VEC_SIZE == 32.  */
 	lzcntq	%rcx, %rcx
-	cmpl	%ecx, %edx
-	jle	L(zero_0)
-	subq	%rcx, %rax
-	ret
-
-	/* Inexpensive place to put this regarding code size / target alignments
-	   / ICache NLP. Necessary for 2-byte encoding of jump to page cross
-	   case which in turn is necessary for hot path (len <= VEC_SIZE) to fit
-	   in first cache line.  */
-L(page_cross):
-	movq	%rax, %rsi
-	andq	$-VEC_SIZE, %rsi
-	vpcmpb	$0, (%rsi), %VMMMATCH, %k0
-	kmovd	%k0, %r8d
-	/* Shift out negative alignment (because we are starting from endptr and
-	   working backwards).  */
-	movl	%eax, %ecx
-	/* notl because eax already has endptr - 1.  (-x = ~(x - 1)).  */
-	notl	%ecx
-	shlxl	%ecx, %r8d, %ecx
-	cmpq	%rdi, %rsi
-	ja	L(more_1x_vec)
-	lzcntl	%ecx, %ecx
-	cmpl	%ecx, %edx
-	jle	L(zero_1)
-	subq	%rcx, %rax
+	subl	%ecx, %eax
+	ja	L(first_vec_x1_ret)
+	/* If VEC_SIZE == 64 put L(zero_0) here as we can't fit in the
+	   first cache line (this is the second cache line).  */
+# if VEC_SIZE == 64
+L(zero_0):
+# endif
+L(zero_2):
+	xorl	%eax, %eax
 	ret
 
-	/* Continue creating zero labels that fit in aligning bytes and get
-	   2-byte encoding / are in the same cache line as condition.  */
-L(zero_1):
-	xorl	%eax, %eax
+	/* NB: Fits in aligning bytes before next cache line for
+	   VEC_SIZE == 32.  For VEC_SIZE == 64 this is attached to
+	   L(first_vec_x0_test).  */
+# if VEC_SIZE == 32
+L(first_vec_x1_ret):
+	leaq	-1(%rdi, %rax), %rax
 	ret
+# endif
 
-	.p2align 4,, 8
-L(ret_vec_x1):
-	/* This will naturally add 32 to position.  */
-	bsrl	%ecx, %ecx
-	leaq	-(VEC_SIZE * 2)(%rcx, %rax), %rax
+	.p2align 4,, 6
+L(ret_vec_x0_test):
+	lzcnt	%VRCX, %VRCX
+	subl	%ecx, %eax
+	jle	L(zero_2)
+# if VEC_SIZE == 64
+	/* Reuse code at the end of L(ret_vec_x0_test) as we can't fit
+	   L(first_vec_x1_ret) in the same cache line as its jmp base
+	   so we might as well save code size.  */
+L(first_vec_x1_ret):
+# endif
+	leaq	-1(%rdi, %rax), %rax
 	ret
 
-	.p2align 4,, 8
+	.p2align 4,, 6
+L(loop_last_4x_vec):
+	/* Compute remaining length.  */
+	subl	%edi, %eax
+L(last_4x_vec):
+	cmpl	$(VEC_SIZE * 2), %eax
+	jle	L(last_2x_vec)
+# if VEC_SIZE == 32
+	/* Only align for VEC_SIZE == 32.  For VEC_SIZE == 64 we need
+	   the spare bytes to align the loop properly.  */
+	.p2align 4,, 10
+# endif
 L(more_2x_vec):
-	testl	%ecx, %ecx
-	jnz	L(ret_vec_x0_dec)
 
-	vpcmpb	$0, -(VEC_SIZE * 2)(%rax), %VMMMATCH, %k0
-	kmovd	%k0, %ecx
-	testl	%ecx, %ecx
-	jnz	L(ret_vec_x1)
+	/* Length > VEC_SIZE * 2 so check the first 2x VEC for match and
+	   return if either hit.  */
+	vpcmpeqb (VEC_SIZE * -1)(%rdi, %rax), %VMATCH, %k0
+	KMOV	%k0, %VRCX
+
+	test	%VRCX, %VRCX
+	jnz	L(first_vec_x0)
+
+	vpcmpeqb (VEC_SIZE * -2)(%rdi, %rax), %VMATCH, %k0
+	KMOV	%k0, %VRCX
+	test	%VRCX, %VRCX
+	jnz	L(first_vec_x1)
 
 	/* Need no matter what.  */
-	vpcmpb	$0, -(VEC_SIZE * 3)(%rax), %VMMMATCH, %k0
-	kmovd	%k0, %ecx
+	vpcmpeqb (VEC_SIZE * -3)(%rdi, %rax), %VMATCH, %k0
+	KMOV	%k0, %VRCX
 
-	subq	$(VEC_SIZE * 4), %rdx
+	/* Check if we are near the end.  */
+	subq	$(VEC_SIZE * 4), %rax
 	ja	L(more_4x_vec)
 
-	cmpl	$(VEC_SIZE * -1), %edx
-	jle	L(ret_vec_x2_test)
-L(last_vec):
-	testl	%ecx, %ecx
-	jnz	L(ret_vec_x2)
+	test	%VRCX, %VRCX
+	jnz	L(first_vec_x2_test)
 
+	/* Adjust length for final check and check if we are at the end.
+	 */
+	addl	$(VEC_SIZE * 1), %eax
+	jle	L(zero_1)
 
-	/* Need no matter what.  */
-	vpcmpb	$0, -(VEC_SIZE * 4)(%rax), %VMMMATCH, %k0
-	kmovd	%k0, %ecx
-	lzcntl	%ecx, %ecx
-	subq	$(VEC_SIZE * 3 + 1), %rax
-	subq	%rcx, %rax
-	cmpq	%rax, %rdi
-	ja	L(zero_1)
+	vpcmpeqb (VEC_SIZE * -1)(%rdi, %rax), %VMATCH, %k0
+	KMOV	%k0, %VRCX
+
+	lzcnt	%VRCX, %VRCX
+	subl	%ecx, %eax
+	ja	L(first_vec_x3_ret)
+L(zero_1):
+	xorl	%eax, %eax
+	ret
+L(first_vec_x3_ret):
+	leaq	-1(%rdi, %rax), %rax
 	ret
 
-	.p2align 4,, 8
-L(ret_vec_x2_test):
-	lzcntl	%ecx, %ecx
-	subq	$(VEC_SIZE * 2 + 1), %rax
-	subq	%rcx, %rax
-	cmpq	%rax, %rdi
-	ja	L(zero_1)
+	.p2align 4,, 6
+L(first_vec_x2_test):
+	/* Must adjust length before check.  */
+	subl	$-(VEC_SIZE * 2 - 1), %eax
+	lzcnt	%VRCX, %VRCX
+	subl	%ecx, %eax
+	jl	L(zero_4)
+	addq	%rdi, %rax
 	ret
 
-	.p2align 4,, 8
-L(ret_vec_x2):
-	bsrl	%ecx, %ecx
-	leaq	-(VEC_SIZE * 3)(%rcx, %rax), %rax
+
+	.p2align 4,, 10
+L(first_vec_x0):
+	bsr	%VRCX, %VRCX
+	leaq	(VEC_SIZE * -1)(%rdi, %rax), %rax
+	addq	%rcx, %rax
 	ret
 
-	.p2align 4,, 8
-L(ret_vec_x3):
-	bsrl	%ecx, %ecx
-	leaq	-(VEC_SIZE * 4)(%rcx, %rax), %rax
+	/* Fits unobtrusively here.  */
+L(zero_4):
+	xorl	%eax, %eax
+	ret
+
+	.p2align 4,, 10
+L(first_vec_x1):
+	bsr	%VRCX, %VRCX
+	leaq	(VEC_SIZE * -2)(%rdi, %rax), %rax
+	addq	%rcx, %rax
 	ret
 
 	.p2align 4,, 8
+L(first_vec_x3):
+	bsr	%VRCX, %VRCX
+	addq	%rdi, %rax
+	addq	%rcx, %rax
+	ret
+
+	.p2align 4,, 6
+L(first_vec_x2):
+	bsr	%VRCX, %VRCX
+	leaq	(VEC_SIZE * 1)(%rdi, %rax), %rax
+	addq	%rcx, %rax
+	ret
+
+	.p2align 4,, 2
 L(more_4x_vec):
-	testl	%ecx, %ecx
-	jnz	L(ret_vec_x2)
+	test	%VRCX, %VRCX
+	jnz	L(first_vec_x2)
 
-	vpcmpb	$0, -(VEC_SIZE * 4)(%rax), %VMMMATCH, %k0
-	kmovd	%k0, %ecx
+	vpcmpeqb (%rdi, %rax), %VMATCH, %k0
+	KMOV	%k0, %VRCX
 
-	testl	%ecx, %ecx
-	jnz	L(ret_vec_x3)
+	test	%VRCX, %VRCX
+	jnz	L(first_vec_x3)
 
 	/* Check if near end before re-aligning (otherwise might do an
 	   unnecessary loop iteration).  */
-	addq	$-(VEC_SIZE * 4), %rax
-	cmpq	$(VEC_SIZE * 4), %rdx
+	cmpq	$(VEC_SIZE * 4), %rax
 	jbe	L(last_4x_vec)
 
-	decq	%rax
-	andq	$-(VEC_SIZE * 4), %rax
-	movq	%rdi, %rdx
-	/* Get endptr for loop in rdx. NB: Can't just do while rax > rdi because
-	   lengths that overflow can be valid and break the comparison.  */
-	andq	$-(VEC_SIZE * 4), %rdx
+
+	/* NB: We setup the loop to NOT use index-address-mode for the
+	   buffer.  This costs some instructions & code size but avoids
+	   stalls due to unlaminated micro-fused instructions (as used
+	   in the loop) from being forced to issue in the same group
+	   (essentially narrowing the backend width).  */
+
+	/* Get endptr for loop in rdx. NB: Can't just do while rax > rdi
+	   because lengths that overflow can be valid and break the
+	   comparison.  */
+# if VEC_SIZE == 64
+	/* Use rdx as intermediate to compute rax, this gets us imm8
+	   encoding which just allows the L(more_4x_vec) block to fit
+	   in 1 cache-line.  */
+	leaq	(VEC_SIZE * 4)(%rdi), %rdx
+	leaq	(VEC_SIZE * -1)(%rdx, %rax), %rax
+
+	/* No evex machine has partial register stalls. This can be
+	   replaced with: `andq $(VEC_SIZE * -4), %rax/%rdx` if that
+	   changes.  */
+	xorb	%al, %al
+	xorb	%dl, %dl
+# else
+	leaq	(VEC_SIZE * 3)(%rdi, %rax), %rax
+	andq	$(VEC_SIZE * -4), %rax
+	leaq	(VEC_SIZE * 4)(%rdi), %rdx
+	andq	$(VEC_SIZE * -4), %rdx
+# endif
+
 
 	.p2align 4
 L(loop_4x_vec):
-	/* Store 1 were not-equals and 0 where equals in k1 (used to mask later
-	   on).  */
-	vpcmpb	$4, (VEC_SIZE * 3)(%rax), %VMMMATCH, %k1
+	/* NB: We could do the same optimization here as we do for
+	   memchr/rawmemchr by using VEX encoding in the loop for access
+	   to VEX vpcmpeqb + vpternlogd.  Since memrchr is not as hot as
+	   memchr it may not be worth the extra code size, but if the
+	   need arises it an easy ~15% perf improvement to the loop.  */
+
+	cmpq	%rdx, %rax
+	je	L(loop_last_4x_vec)
+	/* Store 1 were not-equals and 0 where equals in k1 (used to
+	   mask later on).  */
+	vpcmpb	$4, (VEC_SIZE * -1)(%rax), %VMATCH, %k1
 
 	/* VEC(2/3) will have zero-byte where we found a CHAR.  */
-	vpxorq	(VEC_SIZE * 2)(%rax), %VMMMATCH, %VMM(2)
-	vpxorq	(VEC_SIZE * 1)(%rax), %VMMMATCH, %VMM(3)
-	vpcmpb	$0, (VEC_SIZE * 0)(%rax), %VMMMATCH, %k4
+	vpxorq	(VEC_SIZE * -2)(%rax), %VMATCH, %VMM(2)
+	vpxorq	(VEC_SIZE * -3)(%rax), %VMATCH, %VMM(3)
+	vpcmpeqb (VEC_SIZE * -4)(%rax), %VMATCH, %k4
 
-	/* Combine VEC(2/3) with min and maskz with k1 (k1 has zero bit where
-	   CHAR is found and VEC(2/3) have zero-byte where CHAR is found.  */
+	/* Combine VEC(2/3) with min and maskz with k1 (k1 has zero bit
+	   where CHAR is found and VEC(2/3) have zero-byte where CHAR
+	   is found.  */
 	vpminub	%VMM(2), %VMM(3), %VMM(3){%k1}{z}
 	vptestnmb %VMM(3), %VMM(3), %k2
 
-	/* Any 1s and we found CHAR.  */
-	kortestd %k2, %k4
-	jnz	L(loop_end)
-
 	addq	$-(VEC_SIZE * 4), %rax
-	cmpq	%rdx, %rax
-	jne	L(loop_4x_vec)
 
-	/* Need to re-adjust rdx / rax for L(last_4x_vec).  */
-	subq	$-(VEC_SIZE * 4), %rdx
-	movq	%rdx, %rax
-	subl	%edi, %edx
-L(last_4x_vec):
+	/* Any 1s and we found CHAR.  */
+	KORTEST %k2, %k4
+	jz	L(loop_4x_vec)
+
 
-	/* Used no matter what.  */
-	vpcmpb	$0, (VEC_SIZE * -1)(%rax), %VMMMATCH, %k0
-	kmovd	%k0, %ecx
+	/* K1 has non-matches for first VEC. inc; jz will overflow rcx
+	   iff all bytes where non-matches.  */
+	KMOV	%k1, %VRCX
+	inc	%VRCX
+	jnz	L(first_vec_x0_end)
 
-	cmpl	$(VEC_SIZE * 2), %edx
-	jbe	L(last_2x_vec)
+	vptestnmb %VMM(2), %VMM(2), %k0
+	KMOV	%k0, %VRCX
+	test	%VRCX, %VRCX
+	jnz	L(first_vec_x1_end)
+	KMOV	%k2, %VRCX
+
+	/* Seperate logic for VEC_SIZE == 64 and VEC_SIZE == 32 for
+	   returning last 2x VEC. For VEC_SIZE == 64 we test each VEC
+	   individually, for VEC_SIZE == 32 we combine them in a single
+	   64-bit GPR.  */
+# if VEC_SIZE == 64
+	test	%VRCX, %VRCX
+	jnz	L(first_vec_x2_end)
+	KMOV	%k4, %VRCX
+# else
+	/* Combine last 2 VEC matches for VEC_SIZE == 32. If rcx (from
+	   VEC(3)) is zero (no CHAR in VEC(3)) then it won't affect the
+	   result in rsi (from VEC(4)). If rcx is non-zero then CHAR in
+	   VEC(3) and bsrq will use that position.  */
+	KMOV	%k4, %VRSI
+	salq	$32, %rcx
+	orq	%rsi, %rcx
+# endif
+	bsrq	%rcx, %rcx
+	addq	%rcx, %rax
+	ret
 
-	testl	%ecx, %ecx
-	jnz	L(ret_vec_x0_dec)
+	.p2align 4,, 4
+L(first_vec_x0_end):
+	/* rcx has 1s at non-matches so we need to `not` it. We used
+	   `inc` to test if zero so use `neg` to complete the `not` so
+	   the last 1 bit represent a match.  NB: (-x + 1 == ~x).  */
+	neg	%VRCX
+	bsr	%VRCX, %VRCX
+	leaq	(VEC_SIZE * 3)(%rcx, %rax), %rax
+	ret
 
+	.p2align 4,, 10
+L(first_vec_x1_end):
+	bsr	%VRCX, %VRCX
+	leaq	(VEC_SIZE * 2)(%rcx, %rax), %rax
+	ret
 
-	vpcmpb	$0, (VEC_SIZE * -2)(%rax), %VMMMATCH, %k0
-	kmovd	%k0, %ecx
+# if VEC_SIZE == 64
+	/* Since we can't combine the last 2x VEC for VEC_SIZE == 64
+	   need return label for it.  */
+	.p2align 4,, 4
+L(first_vec_x2_end):
+	bsr	%VRCX, %VRCX
+	leaq	(VEC_SIZE * 1)(%rcx, %rax), %rax
+	ret
+# endif
 
-	testl	%ecx, %ecx
-	jnz	L(ret_vec_x1)
 
-	/* Used no matter what.  */
-	vpcmpb	$0, (VEC_SIZE * -3)(%rax), %VMMMATCH, %k0
-	kmovd	%k0, %ecx
+	.p2align 4,, 4
+L(page_cross):
+	/* only lower bits of eax[log2(VEC_SIZE):0] are set so we can
+	   use movzbl to get the amount of bytes we are checking here.
+	 */
+	movzbl	%al, %ecx
+	andq	$-VEC_SIZE, %rax
+	vpcmpeqb (%rax), %VMATCH, %k0
+	KMOV	%k0, %VRSI
 
-	cmpl	$(VEC_SIZE * 3), %edx
-	ja	L(last_vec)
+	/* eax was comptued as %rdi + %rdx - 1 so need to add back 1
+	   here.  */
+	leal	1(%rcx), %r8d
 
-	lzcntl	%ecx, %ecx
-	subq	$(VEC_SIZE * 2 + 1), %rax
-	subq	%rcx, %rax
-	cmpq	%rax, %rdi
-	jbe	L(ret_1)
+	/* Invert ecx to get shift count for byte matches out of range.
+	 */
+	notl	%ecx
+	shlx	%VRCX, %VRSI, %VRSI
+
+	/* if r8 < rdx then the entire [buf, buf + len] is handled in
+	   the page cross case.  NB: we can't use the trick here we use
+	   in the non page-cross case because we aren't checking full
+	   VEC_SIZE.  */
+	cmpq	%r8, %rdx
+	ja	L(page_cross_check)
+	lzcnt	%VRSI, %VRSI
+	subl	%esi, %edx
+	ja	L(page_cross_ret)
 	xorl	%eax, %eax
-L(ret_1):
 	ret
 
-	.p2align 4,, 6
-L(loop_end):
-	kmovd	%k1, %ecx
-	notl	%ecx
-	testl	%ecx, %ecx
-	jnz	L(ret_vec_x0_end)
+L(page_cross_check):
+	test	%VRSI, %VRSI
+	jz	L(page_cross_continue)
 
-	vptestnmb %VMM(2), %VMM(2), %k0
-	kmovd	%k0, %ecx
-	testl	%ecx, %ecx
-	jnz	L(ret_vec_x1_end)
-
-	kmovd	%k2, %ecx
-	kmovd	%k4, %esi
-	/* Combine last 2 VEC matches. If ecx (VEC3) is zero (no CHAR in VEC3)
-	   then it won't affect the result in esi (VEC4). If ecx is non-zero
-	   then CHAR in VEC3 and bsrq will use that position.  */
-	salq	$32, %rcx
-	orq	%rsi, %rcx
-	bsrq	%rcx, %rcx
-	addq	%rcx, %rax
-	ret
-	.p2align 4,, 4
-L(ret_vec_x0_end):
-	addq	$(VEC_SIZE), %rax
-L(ret_vec_x1_end):
-	bsrl	%ecx, %ecx
-	leaq	(VEC_SIZE * 2)(%rax, %rcx), %rax
+	lzcnt	%VRSI, %VRSI
+	subl	%esi, %edx
+L(page_cross_ret):
+	leaq	-1(%rdi, %rdx), %rax
 	ret
-
 END(MEMRCHR)
 #endif

From patchwork Tue Oct 18 23:19:36 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Noah Goldstein <goldstein.w.n@gmail.com>
X-Patchwork-Id: 1691741
Return-Path: <libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@legolas.ozlabs.org
Authentication-Results: legolas.ozlabs.org;
 spf=pass (sender SPF authorized) smtp.mailfrom=sourceware.org
 (client-ip=2620:52:3:1:0:246e:9693:128c; helo=sourceware.org;
 envelope-from=libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org;
 receiver=<UNKNOWN>)
Authentication-Results: legolas.ozlabs.org;
	dkim=pass (1024-bit key;
 secure) header.d=sourceware.org header.i=@sourceware.org header.a=rsa-sha256
 header.s=default header.b=ebb7ZBK7;
	dkim-atps=neutral
Received: from sourceware.org (server2.sourceware.org
 [IPv6:2620:52:3:1:0:246e:9693:128c])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature ECDSA (P-384) server-digest SHA384)
	(No client certificate requested)
	by legolas.ozlabs.org (Postfix) with ESMTPS id 4MsVHg2rBQz23jk
	for <incoming@patchwork.ozlabs.org>; Wed, 19 Oct 2022 10:21:51 +1100 (AEDT)
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 4FA3D385701A
	for <incoming@patchwork.ozlabs.org>; Tue, 18 Oct 2022 23:21:49 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 4FA3D385701A
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org;
	s=default; t=1666135309;
	bh=oC01aOJqUCRtwCibJyVGnkt6cpo8JFfWkb5OftF4yO0=;
	h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe:
	 List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:
	 From;
	b=ebb7ZBK77GeZzQFg6KptCSZjXbCwaVClQo4wOIO4T7j+/79S9zWUoFB2uBpIXonrZ
	 59L9zPC5NMXtOTLRsfpzD/Fq459M9LUlWWD++JiT824itTg5sIfNiyy2fogUt19JUH
	 6f83RXXAZ//U3ta5WPbngapewvUngu1+clbEYyZA=
X-Original-To: libc-alpha@sourceware.org
Delivered-To: libc-alpha@sourceware.org
Received: from mail-oa1-x2a.google.com (mail-oa1-x2a.google.com
 [IPv6:2001:4860:4864:20::2a])
 by sourceware.org (Postfix) with ESMTPS id 830553858C56
 for <libc-alpha@sourceware.org>; Tue, 18 Oct 2022 23:19:49 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 830553858C56
Received: by mail-oa1-x2a.google.com with SMTP id
 586e51a60fabf-136b5dd6655so18704396fac.3
 for <libc-alpha@sourceware.org>; Tue, 18 Oct 2022 16:19:49 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
 :subject:date:message-id:reply-to;
 bh=oC01aOJqUCRtwCibJyVGnkt6cpo8JFfWkb5OftF4yO0=;
 b=Q24CH/m1l9a4EObBJJzLOVYAH0fwctIsP5OPimSnqE165d1gbQk8x395Za7WdxJ4eV
 Kagbzf5kxTrT5GzadKxFvmxq2bQZwT42dTx9NnIUVEAKK67e4a7wunZ6vp91pnl9sykt
 ifhPH5XRYKld2YwUyF1voKk4KjzmpxMf3oWwnspSTJA4+L0ntdYUYLzfnM5l4+XfbKbn
 RoJQxbj9x1rrqCX98c2TuoYsshcvxrzbREoAOvHP5GcT0OXUVeqqNq2mBAUCe+0fBv9J
 54YjnYWqPmKHDmS3akM/FeXu9cB3/vDfAiIzted4QHvsRC9B2zdaZgMAE7HSX8KF8ZOc
 PfRQ==
X-Gm-Message-State: ACrzQf2No45ZZKqFFxCaFiR/wiKZm+fPopEUMWwAZj1Ir3EymhYGfddP
 aQU9NNZW8RUCYz9gsVayd0CPJxKucn4BOg==
X-Google-Smtp-Source: 
 AMsMyM6Ts7yfVUZyfGl8l+Sf9FsT11m7klQes8WVk5vUKhjJ3UvNTJfq2DIHwYDurEKwbVFILhyDAA==
X-Received: by 2002:a05:6870:9614:b0:11d:3906:18fc with SMTP id
 d20-20020a056870961400b0011d390618fcmr20124600oaq.190.1666135188300;
 Tue, 18 Oct 2022 16:19:48 -0700 (PDT)
Received: from noah-tgl.lan
 (2603-8080-1301-76c6-27cf-8854-3909-9373.res6.spectrum.com.
 [2603:8080:1301:76c6:27cf:8854:3909:9373])
 by smtp.gmail.com with ESMTPSA id
 x27-20020a056870a79b00b001372c1902afsm6787119oao.52.2022.10.18.16.19.47
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Tue, 18 Oct 2022 16:19:47 -0700 (PDT)
To: libc-alpha@sourceware.org
Subject: [PATCH v2 5/7] x86: Optimize strrchr-evex.S and implement with VMM
 headers
Date: Tue, 18 Oct 2022 16:19:36 -0700
Message-Id: <20221018231938.3621554-5-goldstein.w.n@gmail.com>
X-Mailer: git-send-email 2.34.1
In-Reply-To: <20221018231938.3621554-1-goldstein.w.n@gmail.com>
References: <20221018024901.3381469-1-goldstein.w.n@gmail.com>
 <20221018231938.3621554-1-goldstein.w.n@gmail.com>
MIME-Version: 1.0
X-Spam-Status: No, score=-12.1 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0,
 RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
X-Patchwork-Original-From: Noah Goldstein via Libc-alpha
 <libc-alpha@sourceware.org>
From: Noah Goldstein <goldstein.w.n@gmail.com>
Reply-To: Noah Goldstein <goldstein.w.n@gmail.com>
Errors-To: libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org
Sender: "Libc-alpha"
 <libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org>

Optimization is:
1. Cache latest result in "fast path" loop with `vmovdqu` instead of
  `kunpckdq`.  This helps if there are more than one matches.

Code Size Changes:
strrchr-evex.S       :  +30 bytes (Same number of cache lines)

Net perf changes:

Reported as geometric mean of all improvements / regressions from N=10
runs of the benchtests. Value as New Time / Old Time so < 1.0 is
improvement and 1.0 is regression.

strrchr-evex.S       : 0.932 (From cases with higher match frequency)

Full results attached in email.

Full check passes on x86-64.
---
 sysdeps/x86_64/multiarch/strrchr-evex.S | 371 +++++++++++++-----------
 1 file changed, 200 insertions(+), 171 deletions(-)

diff --git a/sysdeps/x86_64/multiarch/strrchr-evex.S b/sysdeps/x86_64/multiarch/strrchr-evex.S
index 992b45fb47..45487dc87a 100644
--- a/sysdeps/x86_64/multiarch/strrchr-evex.S
+++ b/sysdeps/x86_64/multiarch/strrchr-evex.S
@@ -26,25 +26,30 @@
 #  define STRRCHR	__strrchr_evex
 # endif
 
-# define VMOVU	vmovdqu64
-# define VMOVA	vmovdqa64
+# include "x86-evex256-vecs.h"
 
 # ifdef USE_AS_WCSRCHR
-#  define SHIFT_REG	esi
-
-#  define kunpck	kunpckbw
+#  define RCX_M	cl
+#  define SHIFT_REG	rcx
+#  define VPCOMPRESS	vpcompressd
+#  define kunpck_2x	kunpckbw
 #  define kmov_2x	kmovd
 #  define maskz_2x	ecx
 #  define maskm_2x	eax
 #  define CHAR_SIZE	4
 #  define VPMIN	vpminud
 #  define VPTESTN	vptestnmd
+#  define VPTEST	vptestmd
 #  define VPBROADCAST	vpbroadcastd
+#  define VPCMPEQ	vpcmpeqd
 #  define VPCMP	vpcmpd
-# else
-#  define SHIFT_REG	edi
 
-#  define kunpck	kunpckdq
+#  define USE_WIDE_CHAR
+# else
+#  define RCX_M	ecx
+#  define SHIFT_REG	rdi
+#  define VPCOMPRESS	vpcompressb
+#  define kunpck_2x	kunpckdq
 #  define kmov_2x	kmovq
 #  define maskz_2x	rcx
 #  define maskm_2x	rax
@@ -52,58 +57,48 @@
 #  define CHAR_SIZE	1
 #  define VPMIN	vpminub
 #  define VPTESTN	vptestnmb
+#  define VPTEST	vptestmb
 #  define VPBROADCAST	vpbroadcastb
+#  define VPCMPEQ	vpcmpeqb
 #  define VPCMP	vpcmpb
 # endif
 
-# define XMMZERO	xmm16
-# define YMMZERO	ymm16
-# define YMMMATCH	ymm17
-# define YMMSAVE	ymm18
+# include "reg-macros.h"
 
-# define YMM1	ymm19
-# define YMM2	ymm20
-# define YMM3	ymm21
-# define YMM4	ymm22
-# define YMM5	ymm23
-# define YMM6	ymm24
-# define YMM7	ymm25
-# define YMM8	ymm26
-
-
-# define VEC_SIZE	32
+# define VMATCH	VMM(0)
+# define CHAR_PER_VEC	(VEC_SIZE / CHAR_SIZE)
 # define PAGE_SIZE	4096
-	.section .text.evex, "ax", @progbits
-ENTRY(STRRCHR)
+
+	.section SECTION(.text), "ax", @progbits
+ENTRY_P2ALIGN(STRRCHR, 6)
 	movl	%edi, %eax
-	/* Broadcast CHAR to YMMMATCH.  */
-	VPBROADCAST %esi, %YMMMATCH
+	/* Broadcast CHAR to VMATCH.  */
+	VPBROADCAST %esi, %VMATCH
 
 	andl	$(PAGE_SIZE - 1), %eax
 	cmpl	$(PAGE_SIZE - VEC_SIZE), %eax
 	jg	L(cross_page_boundary)
 
-L(page_cross_continue):
-	VMOVU	(%rdi), %YMM1
-	/* k0 has a 1 for each zero CHAR in YMM1.  */
-	VPTESTN	%YMM1, %YMM1, %k0
-	kmovd	%k0, %ecx
-	testl	%ecx, %ecx
+	VMOVU	(%rdi), %VMM(1)
+	/* k0 has a 1 for each zero CHAR in VEC(1).  */
+	VPTESTN	%VMM(1), %VMM(1), %k0
+	KMOV	%k0, %VRSI
+	test	%VRSI, %VRSI
 	jz	L(aligned_more)
 	/* fallthrough: zero CHAR in first VEC.  */
-
-	/* K1 has a 1 for each search CHAR match in YMM1.  */
-	VPCMP	$0, %YMMMATCH, %YMM1, %k1
-	kmovd	%k1, %eax
+L(page_cross_return):
+	/* K1 has a 1 for each search CHAR match in VEC(1).  */
+	VPCMPEQ	%VMATCH, %VMM(1), %k1
+	KMOV	%k1, %VRAX
 	/* Build mask up until first zero CHAR (used to mask of
 	   potential search CHAR matches past the end of the string).
 	 */
-	blsmskl	%ecx, %ecx
-	andl	%ecx, %eax
+	blsmsk	%VRSI, %VRSI
+	and	%VRSI, %VRAX
 	jz	L(ret0)
-	/* Get last match (the `andl` removed any out of bounds
-	   matches).  */
-	bsrl	%eax, %eax
+	/* Get last match (the `and` removed any out of bounds matches).
+	 */
+	bsr	%VRAX, %VRAX
 # ifdef USE_AS_WCSRCHR
 	leaq	(%rdi, %rax, CHAR_SIZE), %rax
 # else
@@ -116,22 +111,22 @@ L(ret0):
 	   search path for earlier matches.  */
 	.p2align 4,, 6
 L(first_vec_x1):
-	VPCMP	$0, %YMMMATCH, %YMM2, %k1
-	kmovd	%k1, %eax
-	blsmskl	%ecx, %ecx
+	VPCMPEQ	%VMATCH, %VMM(2), %k1
+	KMOV	%k1, %VRAX
+	blsmsk	%VRCX, %VRCX
 	/* eax non-zero if search CHAR in range.  */
-	andl	%ecx, %eax
+	and	%VRCX, %VRAX
 	jnz	L(first_vec_x1_return)
 
-	/* fallthrough: no match in YMM2 then need to check for earlier
-	   matches (in YMM1).  */
+	/* fallthrough: no match in VEC(2) then need to check for
+	   earlier matches (in VEC(1)).  */
 	.p2align 4,, 4
 L(first_vec_x0_test):
-	VPCMP	$0, %YMMMATCH, %YMM1, %k1
-	kmovd	%k1, %eax
-	testl	%eax, %eax
+	VPCMPEQ	%VMATCH, %VMM(1), %k1
+	KMOV	%k1, %VRAX
+	test	%VRAX, %VRAX
 	jz	L(ret1)
-	bsrl	%eax, %eax
+	bsr	%VRAX, %VRAX
 # ifdef USE_AS_WCSRCHR
 	leaq	(%rsi, %rax, CHAR_SIZE), %rax
 # else
@@ -142,129 +137,144 @@ L(ret1):
 
 	.p2align 4,, 10
 L(first_vec_x1_or_x2):
-	VPCMP	$0, %YMM3, %YMMMATCH, %k3
-	VPCMP	$0, %YMM2, %YMMMATCH, %k2
+	VPCMPEQ	%VMM(3), %VMATCH, %k3
+	VPCMPEQ	%VMM(2), %VMATCH, %k2
 	/* K2 and K3 have 1 for any search CHAR match. Test if any
-	   matches between either of them. Otherwise check YMM1.  */
-	kortestd %k2, %k3
+	   matches between either of them. Otherwise check VEC(1).  */
+	KORTEST %k2, %k3
 	jz	L(first_vec_x0_test)
 
-	/* Guranteed that YMM2 and YMM3 are within range so merge the
-	   two bitmasks then get last result.  */
-	kunpck	%k2, %k3, %k3
-	kmovq	%k3, %rax
-	bsrq	%rax, %rax
-	leaq	(VEC_SIZE)(%r8, %rax, CHAR_SIZE), %rax
+	/* Guranteed that VEC(2) and VEC(3) are within range so merge
+	   the two bitmasks then get last result.  */
+	kunpck_2x %k2, %k3, %k3
+	kmov_2x	%k3, %maskm_2x
+	bsr	%maskm_2x, %maskm_2x
+	leaq	(VEC_SIZE * 1)(%r8, %rax, CHAR_SIZE), %rax
 	ret
 
-	.p2align 4,, 6
+	.p2align 4,, 7
 L(first_vec_x3):
-	VPCMP	$0, %YMMMATCH, %YMM4, %k1
-	kmovd	%k1, %eax
-	blsmskl	%ecx, %ecx
-	/* If no search CHAR match in range check YMM1/YMM2/YMM3.  */
-	andl	%ecx, %eax
+	VPCMPEQ	%VMATCH, %VMM(4), %k1
+	KMOV	%k1, %VRAX
+	blsmsk	%VRCX, %VRCX
+	/* If no search CHAR match in range check VEC(1)/VEC(2)/VEC(3).
+	 */
+	and	%VRCX, %VRAX
 	jz	L(first_vec_x1_or_x2)
-	bsrl	%eax, %eax
+	bsr	%VRAX, %VRAX
 	leaq	(VEC_SIZE * 3)(%rdi, %rax, CHAR_SIZE), %rax
 	ret
 
+
 	.p2align 4,, 6
 L(first_vec_x0_x1_test):
-	VPCMP	$0, %YMMMATCH, %YMM2, %k1
-	kmovd	%k1, %eax
-	/* Check YMM2 for last match first. If no match try YMM1.  */
-	testl	%eax, %eax
+	VPCMPEQ	%VMATCH, %VMM(2), %k1
+	KMOV	%k1, %VRAX
+	/* Check VEC(2) for last match first. If no match try VEC(1).
+	 */
+	test	%VRAX, %VRAX
 	jz	L(first_vec_x0_test)
 	.p2align 4,, 4
 L(first_vec_x1_return):
-	bsrl	%eax, %eax
+	bsr	%VRAX, %VRAX
 	leaq	(VEC_SIZE)(%rdi, %rax, CHAR_SIZE), %rax
 	ret
 
+
 	.p2align 4,, 10
 L(first_vec_x2):
-	VPCMP	$0, %YMMMATCH, %YMM3, %k1
-	kmovd	%k1, %eax
-	blsmskl	%ecx, %ecx
-	/* Check YMM3 for last match first. If no match try YMM2/YMM1.
-	 */
-	andl	%ecx, %eax
+	VPCMPEQ	%VMATCH, %VMM(3), %k1
+	KMOV	%k1, %VRAX
+	blsmsk	%VRCX, %VRCX
+	/* Check VEC(3) for last match first. If no match try
+	   VEC(2)/VEC(1).  */
+	and	%VRCX, %VRAX
 	jz	L(first_vec_x0_x1_test)
-	bsrl	%eax, %eax
+	bsr	%VRAX, %VRAX
 	leaq	(VEC_SIZE * 2)(%rdi, %rax, CHAR_SIZE), %rax
 	ret
 
 
-	.p2align 4
+	.p2align 4,, 12
 L(aligned_more):
-	/* Need to keep original pointer incase YMM1 has last match.  */
+L(page_cross_continue):
+	/* Need to keep original pointer incase VEC(1) has last match.
+	 */
 	movq	%rdi, %rsi
 	andq	$-VEC_SIZE, %rdi
-	VMOVU	VEC_SIZE(%rdi), %YMM2
-	VPTESTN	%YMM2, %YMM2, %k0
-	kmovd	%k0, %ecx
-	testl	%ecx, %ecx
+
+	VMOVU	VEC_SIZE(%rdi), %VMM(2)
+	VPTESTN	%VMM(2), %VMM(2), %k0
+	KMOV	%k0, %VRCX
+
+	test	%VRCX, %VRCX
 	jnz	L(first_vec_x1)
 
-	VMOVU	(VEC_SIZE * 2)(%rdi), %YMM3
-	VPTESTN	%YMM3, %YMM3, %k0
-	kmovd	%k0, %ecx
-	testl	%ecx, %ecx
+	VMOVU	(VEC_SIZE * 2)(%rdi), %VMM(3)
+	VPTESTN	%VMM(3), %VMM(3), %k0
+	KMOV	%k0, %VRCX
+
+	test	%VRCX, %VRCX
 	jnz	L(first_vec_x2)
 
-	VMOVU	(VEC_SIZE * 3)(%rdi), %YMM4
-	VPTESTN	%YMM4, %YMM4, %k0
-	kmovd	%k0, %ecx
+	VMOVU	(VEC_SIZE * 3)(%rdi), %VMM(4)
+	VPTESTN	%VMM(4), %VMM(4), %k0
+	KMOV	%k0, %VRCX
 	movq	%rdi, %r8
-	testl	%ecx, %ecx
+	test	%VRCX, %VRCX
 	jnz	L(first_vec_x3)
 
 	andq	$-(VEC_SIZE * 2), %rdi
-	.p2align 4
+	.p2align 4,, 10
 L(first_aligned_loop):
-	/* Preserve YMM1, YMM2, YMM3, and YMM4 until we can gurantee
-	   they don't store a match.  */
-	VMOVA	(VEC_SIZE * 4)(%rdi), %YMM5
-	VMOVA	(VEC_SIZE * 5)(%rdi), %YMM6
+	/* Preserve VEC(1), VEC(2), VEC(3), and VEC(4) until we can
+	   gurantee they don't store a match.  */
+	VMOVA	(VEC_SIZE * 4)(%rdi), %VMM(5)
+	VMOVA	(VEC_SIZE * 5)(%rdi), %VMM(6)
 
-	VPCMP	$0, %YMM5, %YMMMATCH, %k2
-	vpxord	%YMM6, %YMMMATCH, %YMM7
+	VPCMPEQ	%VMM(5), %VMATCH, %k2
+	vpxord	%VMM(6), %VMATCH, %VMM(7)
 
-	VPMIN	%YMM5, %YMM6, %YMM8
-	VPMIN	%YMM8, %YMM7, %YMM7
+	VPMIN	%VMM(5), %VMM(6), %VMM(8)
+	VPMIN	%VMM(8), %VMM(7), %VMM(7)
 
-	VPTESTN	%YMM7, %YMM7, %k1
+	VPTESTN	%VMM(7), %VMM(7), %k1
 	subq	$(VEC_SIZE * -2), %rdi
-	kortestd %k1, %k2
+	KORTEST %k1, %k2
 	jz	L(first_aligned_loop)
 
-	VPCMP	$0, %YMM6, %YMMMATCH, %k3
-	VPTESTN	%YMM8, %YMM8, %k1
-	ktestd	%k1, %k1
+	VPCMPEQ	%VMM(6), %VMATCH, %k3
+	VPTESTN	%VMM(8), %VMM(8), %k1
+
+	/* If k1 is zero, then we found a CHAR match but no null-term.
+	   We can now safely throw out VEC1-4.  */
+	KTEST	%k1, %k1
 	jz	L(second_aligned_loop_prep)
 
-	kortestd %k2, %k3
+	KORTEST %k2, %k3
 	jnz	L(return_first_aligned_loop)
 
+
 	.p2align 4,, 6
 L(first_vec_x1_or_x2_or_x3):
-	VPCMP	$0, %YMM4, %YMMMATCH, %k4
-	kmovd	%k4, %eax
-	testl	%eax, %eax
+	VPCMPEQ	%VMM(4), %VMATCH, %k4
+	KMOV	%k4, %VRAX
+	bsr	%VRAX, %VRAX
 	jz	L(first_vec_x1_or_x2)
-	bsrl	%eax, %eax
 	leaq	(VEC_SIZE * 3)(%r8, %rax, CHAR_SIZE), %rax
 	ret
 
+
 	.p2align 4,, 8
 L(return_first_aligned_loop):
-	VPTESTN	%YMM5, %YMM5, %k0
-	kunpck	%k0, %k1, %k0
+	VPTESTN	%VMM(5), %VMM(5), %k0
+
+	/* Combined results from VEC5/6.  */
+	kunpck_2x %k0, %k1, %k0
 	kmov_2x	%k0, %maskz_2x
 
 	blsmsk	%maskz_2x, %maskz_2x
-	kunpck	%k2, %k3, %k3
+	kunpck_2x %k2, %k3, %k3
 	kmov_2x	%k3, %maskm_2x
 	and	%maskz_2x, %maskm_2x
 	jz	L(first_vec_x1_or_x2_or_x3)
@@ -280,47 +290,62 @@ L(return_first_aligned_loop):
 L(second_aligned_loop_prep):
 L(second_aligned_loop_set_furthest_match):
 	movq	%rdi, %rsi
-	kunpck	%k2, %k3, %k4
-
+	/* Ideally we would safe k2/k3 but `kmov/kunpck` take uops on
+	   port0 and have noticable overhead in the loop.  */
+	VMOVA	%VMM(5), %VMM(7)
+	VMOVA	%VMM(6), %VMM(8)
 	.p2align 4
 L(second_aligned_loop):
-	VMOVU	(VEC_SIZE * 4)(%rdi), %YMM1
-	VMOVU	(VEC_SIZE * 5)(%rdi), %YMM2
-
-	VPCMP	$0, %YMM1, %YMMMATCH, %k2
-	vpxord	%YMM2, %YMMMATCH, %YMM3
+	VMOVU	(VEC_SIZE * 4)(%rdi), %VMM(5)
+	VMOVU	(VEC_SIZE * 5)(%rdi), %VMM(6)
+	VPCMPEQ	%VMM(5), %VMATCH, %k2
+	vpxord	%VMM(6), %VMATCH, %VMM(3)
 
-	VPMIN	%YMM1, %YMM2, %YMM4
-	VPMIN	%YMM3, %YMM4, %YMM3
+	VPMIN	%VMM(5), %VMM(6), %VMM(4)
+	VPMIN	%VMM(3), %VMM(4), %VMM(3)
 
-	VPTESTN	%YMM3, %YMM3, %k1
+	VPTESTN	%VMM(3), %VMM(3), %k1
 	subq	$(VEC_SIZE * -2), %rdi
-	kortestd %k1, %k2
+	KORTEST %k1, %k2
 	jz	L(second_aligned_loop)
-
-	VPCMP	$0, %YMM2, %YMMMATCH, %k3
-	VPTESTN	%YMM4, %YMM4, %k1
-	ktestd	%k1, %k1
+	VPCMPEQ	%VMM(6), %VMATCH, %k3
+	VPTESTN	%VMM(4), %VMM(4), %k1
+	KTEST	%k1, %k1
 	jz	L(second_aligned_loop_set_furthest_match)
 
-	kortestd %k2, %k3
-	/* branch here because there is a significant advantage interms
-	   of output dependency chance in using edx.  */
+	/* branch here because we know we have a match in VEC7/8 but
+	   might not in VEC5/6 so the latter is expected to be less
+	   likely.  */
+	KORTEST %k2, %k3
 	jnz	L(return_new_match)
+
 L(return_old_match):
-	kmovq	%k4, %rax
-	bsrq	%rax, %rax
-	leaq	(VEC_SIZE * 2)(%rsi, %rax, CHAR_SIZE), %rax
+	VPCMPEQ	%VMM(8), %VMATCH, %k0
+	KMOV	%k0, %VRCX
+	bsr	%VRCX, %VRCX
+	jnz	L(return_old_match_ret)
+
+	VPCMPEQ	%VMM(7), %VMATCH, %k0
+	KMOV	%k0, %VRCX
+	bsr	%VRCX, %VRCX
+	subq	$VEC_SIZE, %rsi
+L(return_old_match_ret):
+	leaq	(VEC_SIZE * 3)(%rsi, %rcx, CHAR_SIZE), %rax
 	ret
 
+	.p2align 4,, 10
 L(return_new_match):
-	VPTESTN	%YMM1, %YMM1, %k0
-	kunpck	%k0, %k1, %k0
+	VPTESTN	%VMM(5), %VMM(5), %k0
+
+	/* Combined results from VEC5/6.  */
+	kunpck_2x %k0, %k1, %k0
 	kmov_2x	%k0, %maskz_2x
 
 	blsmsk	%maskz_2x, %maskz_2x
-	kunpck	%k2, %k3, %k3
+	kunpck_2x %k2, %k3, %k3
 	kmov_2x	%k3, %maskm_2x
+
+	/* Match at end was out-of-bounds so use last known match.  */
 	and	%maskz_2x, %maskm_2x
 	jz	L(return_old_match)
 
@@ -328,49 +353,53 @@ L(return_new_match):
 	leaq	(VEC_SIZE * 2)(%rdi, %rax, CHAR_SIZE), %rax
 	ret
 
+	.p2align 4,, 4
 L(cross_page_boundary):
-	/* eax contains all the page offset bits of src (rdi). `xor rdi,
-	   rax` sets pointer will all page offset bits cleared so
-	   offset of (PAGE_SIZE - VEC_SIZE) will get last aligned VEC
-	   before page cross (guranteed to be safe to read). Doing this
-	   as opposed to `movq %rdi, %rax; andq $-VEC_SIZE, %rax` saves
-	   a bit of code size.  */
 	xorq	%rdi, %rax
-	VMOVU	(PAGE_SIZE - VEC_SIZE)(%rax), %YMM1
-	VPTESTN	%YMM1, %YMM1, %k0
-	kmovd	%k0, %ecx
+	mov	$-1, %VRDX
+	VMOVU	(PAGE_SIZE - VEC_SIZE)(%rax), %VMM(6)
+	VPTESTN	%VMM(6), %VMM(6), %k0
+	KMOV	%k0, %VRSI
+
+# ifdef USE_AS_WCSRCHR
+	movl	%edi, %ecx
+	and	$(VEC_SIZE - 1), %ecx
+	shrl	$2, %ecx
+# endif
+	shlx	%VGPR(SHIFT_REG), %VRDX, %VRDX
 
-	/* Shift out zero CHAR matches that are before the begining of
-	   src (rdi).  */
 # ifdef USE_AS_WCSRCHR
-	movl	%edi, %esi
-	andl	$(VEC_SIZE - 1), %esi
-	shrl	$2, %esi
+	kmovb	%edx, %k1
+# else
+	KMOV	%VRDX, %k1
 # endif
-	shrxl	%SHIFT_REG, %ecx, %ecx
 
-	testl	%ecx, %ecx
+	/* Need to adjust result to VEC(1) so it can be re-used by
+	   L(return_vec_x0_test).  The alternative is to collect VEC(1)
+	   will a page cross load which is far more expensive.  */
+	VPCOMPRESS %VMM(6), %VMM(1){%k1}{z}
+
+	/* We could technically just jmp back after the vpcompress but
+	   it doesn't save any 16-byte blocks.  */
+	shrx	%VGPR(SHIFT_REG), %VRSI, %VRSI
+	test	%VRSI, %VRSI
 	jz	L(page_cross_continue)
 
-	/* Found zero CHAR so need to test for search CHAR.  */
-	VPCMP	$0, %YMMMATCH, %YMM1, %k1
-	kmovd	%k1, %eax
-	/* Shift out search CHAR matches that are before the begining of
-	   src (rdi).  */
-	shrxl	%SHIFT_REG, %eax, %eax
-
-	/* Check if any search CHAR match in range.  */
-	blsmskl	%ecx, %ecx
-	andl	%ecx, %eax
-	jz	L(ret3)
-	bsrl	%eax, %eax
+	/* Duplicate of return logic from ENTRY. Doesn't cause spill to
+	   next cache line so might as well copy it here.  */
+	VPCMPEQ	%VMATCH, %VMM(1), %k1
+	KMOV	%k1, %VRAX
+	blsmsk	%VRSI, %VRSI
+	and	%VRSI, %VRAX
+	jz	L(ret_page_cross)
+	bsr	%VRAX, %VRAX
 # ifdef USE_AS_WCSRCHR
 	leaq	(%rdi, %rax, CHAR_SIZE), %rax
 # else
 	addq	%rdi, %rax
 # endif
-L(ret3):
+L(ret_page_cross):
 	ret
-
+	/* 1 byte till next cache line.  */
 END(STRRCHR)
 #endif

From patchwork Tue Oct 18 23:19:37 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Noah Goldstein <goldstein.w.n@gmail.com>
X-Patchwork-Id: 1691737
Return-Path: <libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@legolas.ozlabs.org
Authentication-Results: legolas.ozlabs.org;
 spf=pass (sender SPF authorized) smtp.mailfrom=sourceware.org
 (client-ip=8.43.85.97; helo=sourceware.org;
 envelope-from=libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org;
 receiver=<UNKNOWN>)
Authentication-Results: legolas.ozlabs.org;
	dkim=pass (1024-bit key;
 secure) header.d=sourceware.org header.i=@sourceware.org header.a=rsa-sha256
 header.s=default header.b=Dx23OTvW;
	dkim-atps=neutral
Received: from sourceware.org (ip-8-43-85-97.sourceware.org [8.43.85.97])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature ECDSA (P-384) server-digest SHA384)
	(No client certificate requested)
	by legolas.ozlabs.org (Postfix) with ESMTPS id 4MsVGS36h4z23jk
	for <incoming@patchwork.ozlabs.org>; Wed, 19 Oct 2022 10:20:48 +1100 (AEDT)
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 4B6ED3857366
	for <incoming@patchwork.ozlabs.org>; Tue, 18 Oct 2022 23:20:46 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 4B6ED3857366
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org;
	s=default; t=1666135246;
	bh=o1P58Oi8baln6MVt0yU9Hm3PGPb7vM9Cf2yhavvHOuA=;
	h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe:
	 List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:
	 From;
	b=Dx23OTvWPRUlgx3/0pDfAfRv+mGR+PDH3kIAIebMggs0zEvSkehBaYu52NiDuabp6
	 LaSnTf7ogBgd2ZSL3DSAfTP9pHQvt3ngCKlPlx+k+Bln/JWbPzWBIn0qWzzdLVaGDD
	 L37xUjrjfg+TEZH/8DJ4/Xf2lcQAKrpRaeuKFP8o=
X-Original-To: libc-alpha@sourceware.org
Delivered-To: libc-alpha@sourceware.org
Received: from mail-oa1-x2f.google.com (mail-oa1-x2f.google.com
 [IPv6:2001:4860:4864:20::2f])
 by sourceware.org (Postfix) with ESMTPS id 58900385842E
 for <libc-alpha@sourceware.org>; Tue, 18 Oct 2022 23:19:51 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 58900385842E
Received: by mail-oa1-x2f.google.com with SMTP id
 586e51a60fabf-132b8f6f1b2so18647504fac.11
 for <libc-alpha@sourceware.org>; Tue, 18 Oct 2022 16:19:51 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
 :subject:date:message-id:reply-to;
 bh=o1P58Oi8baln6MVt0yU9Hm3PGPb7vM9Cf2yhavvHOuA=;
 b=rK5swJjk2oCMKpdOHbTBr4VUvrhiYQAF+74PFlzQHDCzBgR/duzOTNeLUHcl9u1Tc3
 SOAwgeH+8pY9uK0YP5/MJ/utoZNVIPQMWXEWsUJHCloKUFuTNRyCfsUNAJ8CXZNGnXB6
 +ZmsvFEOSUO7E9crLYF2W2hhCnOVtGYloymLCjgfOUqbSFL5wCRGq6W4l5D/w5sFNvSe
 XnOeN0AvkkssWmquoeVkHbeZdrCYyPEMamX07Saro/PZ8EL3Xqj9JpQqOm9KOfDN17Rs
 tkporsB4FbkdCIlAMWxRwLRXmkfiIKnNRMDqHZmwRHQ7gThy0JXpqvRxu4D6Tw85Oi1S
 r/1w==
X-Gm-Message-State: ACrzQf09NjAKaVNAhN9V22kR1uSQz9qhm5t9Yiht4Ws2R3iFCDvo4mFo
 uE1q11ExoidvcxFoWmf95D0LYjSCDh3mJw==
X-Google-Smtp-Source: 
 AMsMyM65cZyl5qwEVusHB5EWQxfwP/UsE3EZxHhEyZl9FlYA7WL9MeXTp/1/YCQXIq1adQQ+irld5g==
X-Received: by 2002:a05:6870:179c:b0:136:3c63:3b86 with SMTP id
 r28-20020a056870179c00b001363c633b86mr3214809oae.131.1666135189430;
 Tue, 18 Oct 2022 16:19:49 -0700 (PDT)
Received: from noah-tgl.lan
 (2603-8080-1301-76c6-27cf-8854-3909-9373.res6.spectrum.com.
 [2603:8080:1301:76c6:27cf:8854:3909:9373])
 by smtp.gmail.com with ESMTPSA id
 x27-20020a056870a79b00b001372c1902afsm6787119oao.52.2022.10.18.16.19.48
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Tue, 18 Oct 2022 16:19:48 -0700 (PDT)
To: libc-alpha@sourceware.org
Subject: [PATCH v2 6/7] x86: Add support for VEC_SIZE == 64 in strcmp-evex.S
 impl
Date: Tue, 18 Oct 2022 16:19:37 -0700
Message-Id: <20221018231938.3621554-6-goldstein.w.n@gmail.com>
X-Mailer: git-send-email 2.34.1
In-Reply-To: <20221018231938.3621554-1-goldstein.w.n@gmail.com>
References: <20221018024901.3381469-1-goldstein.w.n@gmail.com>
 <20221018231938.3621554-1-goldstein.w.n@gmail.com>
MIME-Version: 1.0
X-Spam-Status: No, score=-12.1 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0,
 RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
X-Patchwork-Original-From: Noah Goldstein via Libc-alpha
 <libc-alpha@sourceware.org>
From: Noah Goldstein <goldstein.w.n@gmail.com>
Reply-To: Noah Goldstein <goldstein.w.n@gmail.com>
Errors-To: libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org
Sender: "Libc-alpha"
 <libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org>

Unused at the moment, but evex512 strcmp, strncmp, strcasecmp{l}, and
strncasecmp{l} functions can be added by including strcmp-evex.S with
"x86-evex512-vecs.h" defined.

In addition save code size a bit in a few places.

1. tzcnt ...         -> bsf ...
2. vpcmp{b|d} $0 ... -> vpcmpeq{b|d}

This saves a touch of code size but has minimal net affect.

Full check passes on x86-64.
---
 sysdeps/x86_64/multiarch/strcmp-evex.S | 676 ++++++++++++++++---------
 1 file changed, 430 insertions(+), 246 deletions(-)

diff --git a/sysdeps/x86_64/multiarch/strcmp-evex.S b/sysdeps/x86_64/multiarch/strcmp-evex.S
index e482d0167f..756a3bb8d6 100644
--- a/sysdeps/x86_64/multiarch/strcmp-evex.S
+++ b/sysdeps/x86_64/multiarch/strcmp-evex.S
@@ -20,6 +20,10 @@
 
 #if ISA_SHOULD_BUILD (4)
 
+# ifndef VEC_SIZE
+#  include "x86-evex256-vecs.h"
+# endif
+
 # define STRCMP_ISA	_evex
 # include "strcmp-naming.h"
 
@@ -35,41 +39,57 @@
 # define PAGE_SIZE	4096
 
 	/* VEC_SIZE = Number of bytes in a ymm register.  */
-# define VEC_SIZE	32
 # define CHAR_PER_VEC	(VEC_SIZE	/	SIZE_OF_CHAR)
 
-# define VMOVU	vmovdqu64
-# define VMOVA	vmovdqa64
-
 # ifdef USE_AS_WCSCMP
-#  define TESTEQ	subl $0xff,
 	/* Compare packed dwords.  */
 #  define VPCMP	vpcmpd
+#  define VPCMPEQ	vpcmpeqd
 #  define VPMINU	vpminud
 #  define VPTESTM	vptestmd
 #  define VPTESTNM	vptestnmd
 	/* 1 dword char == 4 bytes.  */
 #  define SIZE_OF_CHAR	4
+
+#  define TESTEQ	sub $((1 << CHAR_PER_VEC) - 1),
+
+#  define USE_WIDE_CHAR
 # else
-#  define TESTEQ	incl
 	/* Compare packed bytes.  */
 #  define VPCMP	vpcmpb
+#  define VPCMPEQ	vpcmpeqb
 #  define VPMINU	vpminub
 #  define VPTESTM	vptestmb
 #  define VPTESTNM	vptestnmb
 	/* 1 byte char == 1 byte.  */
 #  define SIZE_OF_CHAR	1
+
+#  define TESTEQ	inc
+# endif
+
+# include "reg-macros.h"
+
+# if VEC_SIZE == 64
+#  define RODATA_SECTION	rodata.cst64
+# else
+#  define RODATA_SECTION	rodata.cst32
+# endif
+
+# if CHAR_PER_VEC == 64
+#  define FALLTHROUGH_RETURN_OFFSET	(VEC_SIZE * 3)
+# else
+#  define FALLTHROUGH_RETURN_OFFSET	(VEC_SIZE * 2)
 # endif
 
 # ifdef USE_AS_STRNCMP
-#  define LOOP_REG	r9d
+#  define LOOP_REG	VR9
 #  define LOOP_REG64	r9
 
 #  define OFFSET_REG8	r9b
 #  define OFFSET_REG	r9d
 #  define OFFSET_REG64	r9
 # else
-#  define LOOP_REG	edx
+#  define LOOP_REG	VRDX
 #  define LOOP_REG64	rdx
 
 #  define OFFSET_REG8	dl
@@ -83,32 +103,6 @@
 #  define VEC_OFFSET	(-VEC_SIZE)
 # endif
 
-# define XMM0	xmm17
-# define XMM1	xmm18
-
-# define XMM10	xmm27
-# define XMM11	xmm28
-# define XMM12	xmm29
-# define XMM13	xmm30
-# define XMM14	xmm31
-
-
-# define YMM0	ymm17
-# define YMM1	ymm18
-# define YMM2	ymm19
-# define YMM3	ymm20
-# define YMM4	ymm21
-# define YMM5	ymm22
-# define YMM6	ymm23
-# define YMM7	ymm24
-# define YMM8	ymm25
-# define YMM9	ymm26
-# define YMM10	ymm27
-# define YMM11	ymm28
-# define YMM12	ymm29
-# define YMM13	ymm30
-# define YMM14	ymm31
-
 # ifdef USE_AS_STRCASECMP_L
 #  define BYTE_LOOP_REG	OFFSET_REG
 # else
@@ -125,61 +119,72 @@
 #  endif
 # endif
 
-# define LCASE_MIN_YMM	%YMM12
-# define LCASE_MAX_YMM	%YMM13
-# define CASE_ADD_YMM	%YMM14
+# define LCASE_MIN_V	VMM(12)
+# define LCASE_MAX_V	VMM(13)
+# define CASE_ADD_V	VMM(14)
 
-# define LCASE_MIN_XMM	%XMM12
-# define LCASE_MAX_XMM	%XMM13
-# define CASE_ADD_XMM	%XMM14
+# if VEC_SIZE == 64
+#  define LCASE_MIN_YMM	VMM_256(12)
+#  define LCASE_MAX_YMM	VMM_256(13)
+#  define CASE_ADD_YMM	VMM_256(14)
+# endif
+
+# define LCASE_MIN_XMM	VMM_128(12)
+# define LCASE_MAX_XMM	VMM_128(13)
+# define CASE_ADD_XMM	VMM_128(14)
 
 	/* NB: wcsncmp uses r11 but strcasecmp is never used in
 	   conjunction with wcscmp.  */
 # define TOLOWER_BASE	%r11
 
 # ifdef USE_AS_STRCASECMP_L
-#  define _REG(x, y) x ## y
-#  define REG(x, y) _REG(x, y)
-#  define TOLOWER(reg1, reg2, ext)										\
-	vpsubb	REG(LCASE_MIN_, ext), reg1, REG(%ext, 10);					\
-	vpsubb	REG(LCASE_MIN_, ext), reg2, REG(%ext, 11);					\
-	vpcmpub	$1, REG(LCASE_MAX_, ext), REG(%ext, 10), %k5;				\
-	vpcmpub	$1, REG(LCASE_MAX_, ext), REG(%ext, 11), %k6;				\
-	vpaddb	reg1, REG(CASE_ADD_, ext), reg1{%k5};						\
-	vpaddb	reg2, REG(CASE_ADD_, ext), reg2{%k6}
-
-#  define TOLOWER_gpr(src, dst) movl (TOLOWER_BASE, src, 4), dst
-#  define TOLOWER_YMM(...)	TOLOWER(__VA_ARGS__, YMM)
-#  define TOLOWER_XMM(...)	TOLOWER(__VA_ARGS__, XMM)
-
-#  define CMP_R1_R2(s1_reg, s2_reg, reg_out, ext)						\
-	TOLOWER	(s1_reg, s2_reg, ext);										\
-	VPCMP	$0, s1_reg, s2_reg, reg_out
-
-#  define CMP_R1_S2(s1_reg, s2_mem, s2_reg, reg_out, ext)				\
-	VMOVU	s2_mem, s2_reg;												\
-	CMP_R1_R2(s1_reg, s2_reg, reg_out, ext)
-
-#  define CMP_R1_R2_YMM(...) CMP_R1_R2(__VA_ARGS__, YMM)
-#  define CMP_R1_R2_XMM(...) CMP_R1_R2(__VA_ARGS__, XMM)
-
-#  define CMP_R1_S2_YMM(...) CMP_R1_S2(__VA_ARGS__, YMM)
-#  define CMP_R1_S2_XMM(...) CMP_R1_S2(__VA_ARGS__, XMM)
+#  define _REG(x, y)	x ## y
+#  define REG(x, y)	_REG(x, y)
+#  define TOLOWER(reg1, reg2, ext, vec_macro)	\
+	vpsubb	%REG(LCASE_MIN_, ext), reg1, %vec_macro(10);	\
+	vpsubb	%REG(LCASE_MIN_, ext), reg2, %vec_macro(11);	\
+	vpcmpub	$1, %REG(LCASE_MAX_, ext), %vec_macro(10), %k5;	\
+	vpcmpub	$1, %REG(LCASE_MAX_, ext), %vec_macro(11), %k6;	\
+	vpaddb	reg1, %REG(CASE_ADD_, ext), reg1{%k5};	\
+	vpaddb	reg2, %REG(CASE_ADD_, ext), reg2{%k6}
+
+#  define TOLOWER_gpr(src, dst)	movl (TOLOWER_BASE, src, 4), dst
+#  define TOLOWER_VMM(...)	TOLOWER(__VA_ARGS__, V, VMM)
+#  define TOLOWER_YMM(...)	TOLOWER(__VA_ARGS__, YMM, VMM_256)
+#  define TOLOWER_XMM(...)	TOLOWER(__VA_ARGS__, XMM, VMM_128)
+
+#  define CMP_R1_R2(s1_reg, s2_reg, reg_out, ext, vec_macro)	\
+	TOLOWER	(s1_reg, s2_reg, ext, vec_macro);	\
+	VPCMPEQ	s1_reg, s2_reg, reg_out
+
+#  define CMP_R1_S2(s1_reg, s2_mem, s2_reg, reg_out, ext, vec_macro)	\
+	VMOVU	s2_mem, s2_reg;	\
+	CMP_R1_R2 (s1_reg, s2_reg, reg_out, ext, vec_macro)
+
+#  define CMP_R1_R2_VMM(...)	CMP_R1_R2(__VA_ARGS__, V, VMM)
+#  define CMP_R1_R2_YMM(...)	CMP_R1_R2(__VA_ARGS__, YMM, VMM_256)
+#  define CMP_R1_R2_XMM(...)	CMP_R1_R2(__VA_ARGS__, XMM, VMM_128)
+
+#  define CMP_R1_S2_VMM(...)	CMP_R1_S2(__VA_ARGS__, V, VMM)
+#  define CMP_R1_S2_YMM(...)	CMP_R1_S2(__VA_ARGS__, YMM, VMM_256)
+#  define CMP_R1_S2_XMM(...)	CMP_R1_S2(__VA_ARGS__, XMM, VMM_128)
 
 # else
 #  define TOLOWER_gpr(...)
+#  define TOLOWER_VMM(...)
 #  define TOLOWER_YMM(...)
 #  define TOLOWER_XMM(...)
 
-#  define CMP_R1_R2_YMM(s1_reg, s2_reg, reg_out)						\
-	VPCMP	$0, s2_reg, s1_reg, reg_out
+#  define CMP_R1_R2_VMM(s1_reg, s2_reg, reg_out)	\
+	VPCMPEQ	s2_reg, s1_reg, reg_out
 
-#  define CMP_R1_R2_XMM(...) CMP_R1_R2_YMM(__VA_ARGS__)
+#  define CMP_R1_R2_YMM(...)	CMP_R1_R2_VMM(__VA_ARGS__)
+#  define CMP_R1_R2_XMM(...)	CMP_R1_R2_VMM(__VA_ARGS__)
 
-#  define CMP_R1_S2_YMM(s1_reg, s2_mem, unused, reg_out)				\
-	VPCMP	$0, s2_mem, s1_reg, reg_out
-
-#  define CMP_R1_S2_XMM(...) CMP_R1_S2_YMM(__VA_ARGS__)
+#  define CMP_R1_S2_VMM(s1_reg, s2_mem, unused, reg_out)	\
+	VPCMPEQ	s2_mem, s1_reg, reg_out
+#  define CMP_R1_S2_YMM(...)	CMP_R1_S2_VMM(__VA_ARGS__)
+#  define CMP_R1_S2_XMM(...)	CMP_R1_S2_VMM(__VA_ARGS__)
 # endif
 
 /* Warning!
@@ -203,7 +208,7 @@
    the maximum offset is reached before a difference is found, zero is
    returned.  */
 
-	.section .text.evex, "ax", @progbits
+	.section SECTION(.text), "ax", @progbits
 	.align	16
 	.type	STRCMP, @function
 	.globl	STRCMP
@@ -232,7 +237,7 @@ STRCMP:
 #  else
 	mov	(%LOCALE_REG), %RAX_LP
 #  endif
-	testl	$1, LOCALE_DATA_VALUES + _NL_CTYPE_NONASCII_CASE * SIZEOF_VALUES(%rax)
+	testb	$1, LOCALE_DATA_VALUES + _NL_CTYPE_NONASCII_CASE * SIZEOF_VALUES(%rax)
 	jne	STRCASECMP_L_NONASCII
 	leaq	_nl_C_LC_CTYPE_tolower + 128 * 4(%rip), TOLOWER_BASE
 # endif
@@ -254,28 +259,46 @@ STRCMP:
 # endif
 
 # if defined USE_AS_STRCASECMP_L
-	.section .rodata.cst32, "aM", @progbits, 32
-	.align	32
+	.section RODATA_SECTION, "aM", @progbits, VEC_SIZE
+	.align	VEC_SIZE
 L(lcase_min):
 	.quad	0x4141414141414141
 	.quad	0x4141414141414141
 	.quad	0x4141414141414141
 	.quad	0x4141414141414141
+#  if VEC_SIZE == 64
+	.quad	0x4141414141414141
+	.quad	0x4141414141414141
+	.quad	0x4141414141414141
+	.quad	0x4141414141414141
+#  endif
 L(lcase_max):
 	.quad	0x1a1a1a1a1a1a1a1a
 	.quad	0x1a1a1a1a1a1a1a1a
 	.quad	0x1a1a1a1a1a1a1a1a
 	.quad	0x1a1a1a1a1a1a1a1a
+#  if VEC_SIZE == 64
+	.quad	0x1a1a1a1a1a1a1a1a
+	.quad	0x1a1a1a1a1a1a1a1a
+	.quad	0x1a1a1a1a1a1a1a1a
+	.quad	0x1a1a1a1a1a1a1a1a
+#  endif
 L(case_add):
 	.quad	0x2020202020202020
 	.quad	0x2020202020202020
 	.quad	0x2020202020202020
 	.quad	0x2020202020202020
+#  if VEC_SIZE == 64
+	.quad	0x2020202020202020
+	.quad	0x2020202020202020
+	.quad	0x2020202020202020
+	.quad	0x2020202020202020
+#  endif
 	.previous
 
-	vmovdqa64 L(lcase_min)(%rip), LCASE_MIN_YMM
-	vmovdqa64 L(lcase_max)(%rip), LCASE_MAX_YMM
-	vmovdqa64 L(case_add)(%rip), CASE_ADD_YMM
+	VMOVA	L(lcase_min)(%rip), %LCASE_MIN_V
+	VMOVA	L(lcase_max)(%rip), %LCASE_MAX_V
+	VMOVA	L(case_add)(%rip), %CASE_ADD_V
 # endif
 
 	movl	%edi, %eax
@@ -288,12 +311,12 @@ L(case_add):
 
 L(no_page_cross):
 	/* Safe to compare 4x vectors.  */
-	VMOVU	(%rdi), %YMM0
-	VPTESTM	%YMM0, %YMM0, %k2
+	VMOVU	(%rdi), %VMM(0)
+	VPTESTM	%VMM(0), %VMM(0), %k2
 	/* Each bit cleared in K1 represents a mismatch or a null CHAR
 	   in YMM0 and 32 bytes at (%rsi).  */
-	CMP_R1_S2_YMM (%YMM0, (%rsi), %YMM1, %k1){%k2}
-	kmovd	%k1, %ecx
+	CMP_R1_S2_VMM (%VMM(0), (%rsi), %VMM(1), %k1){%k2}
+	KMOV	%k1, %VRCX
 # ifdef USE_AS_STRNCMP
 	cmpq	$CHAR_PER_VEC, %rdx
 	jbe	L(vec_0_test_len)
@@ -303,14 +326,14 @@ L(no_page_cross):
 	   wcscmp/wcsncmp.  */
 
 	/* All 1s represents all equals. TESTEQ will overflow to zero in
-	   all equals case. Otherwise 1s will carry until position of first
-	   mismatch.  */
-	TESTEQ	%ecx
+	   all equals case. Otherwise 1s will carry until position of
+	   first mismatch.  */
+	TESTEQ	%VRCX
 	jz	L(more_3x_vec)
 
 	.p2align 4,, 4
 L(return_vec_0):
-	tzcntl	%ecx, %ecx
+	bsf	%VRCX, %VRCX
 # ifdef USE_AS_WCSCMP
 	movl	(%rdi, %rcx, SIZE_OF_CHAR), %edx
 	xorl	%eax, %eax
@@ -321,7 +344,16 @@ L(return_vec_0):
 	orl	$1, %eax
 # else
 	movzbl	(%rdi, %rcx), %eax
+	/* For VEC_SIZE == 64 use movb instead of movzbl to save a byte
+	   and keep logic for len <= VEC_SIZE (common) in just the
+	   first cache line.  NB: No evex512 processor has partial-
+	   register stalls. If that changes this ifdef can be disabled
+	   without affecting correctness.  */
+#  if !defined USE_AS_STRNCMP && !defined USE_AS_STRCASECMP_L && VEC_SIZE == 64
+	movb	(%rsi, %rcx), %cl
+#  else
 	movzbl	(%rsi, %rcx), %ecx
+#  endif
 	TOLOWER_gpr (%rax, %eax)
 	TOLOWER_gpr (%rcx, %ecx)
 	subl	%ecx, %eax
@@ -332,8 +364,8 @@ L(ret0):
 # ifdef USE_AS_STRNCMP
 	.p2align 4,, 4
 L(vec_0_test_len):
-	notl	%ecx
-	bzhil	%edx, %ecx, %eax
+	not	%VRCX
+	bzhi	%VRDX, %VRCX, %VRAX
 	jnz	L(return_vec_0)
 	/* Align if will cross fetch block.  */
 	.p2align 4,, 2
@@ -372,7 +404,7 @@ L(ret1):
 
 	.p2align 4,, 10
 L(return_vec_1):
-	tzcntl	%ecx, %ecx
+	bsf	%VRCX, %VRCX
 # ifdef USE_AS_STRNCMP
 	/* rdx must be > CHAR_PER_VEC so its safe to subtract without
 	   worrying about underflow.  */
@@ -401,24 +433,41 @@ L(ret2):
 	.p2align 4,, 10
 # ifdef USE_AS_STRNCMP
 L(return_vec_3):
-#  if CHAR_PER_VEC <= 16
+#  if CHAR_PER_VEC <= 32
+	/* If CHAR_PER_VEC <= 32 reuse code from L(return_vec_3) without
+	   additional branches by adjusting the bit positions from
+	   VEC3.  We can't do this for CHAR_PER_VEC == 64.  */
+#   if CHAR_PER_VEC <= 16
 	sall	$CHAR_PER_VEC, %ecx
-#  else
+#   else
 	salq	$CHAR_PER_VEC, %rcx
+#   endif
+#  else
+	/* If CHAR_PER_VEC == 64 we can't shift the return GPR so just
+	   check it.  */
+	bsf	%VRCX, %VRCX
+	addl	$(CHAR_PER_VEC), %ecx
+	cmpq	%rcx, %rdx
+	ja	L(ret_vec_3_finish)
+	xorl	%eax, %eax
+	ret
 #  endif
 # endif
+
+	/* If CHAR_PER_VEC == 64 we can't combine matches from the last
+	   2x VEC so need seperate return label.  */
 L(return_vec_2):
 # if (CHAR_PER_VEC <= 16) || !(defined USE_AS_STRNCMP)
-	tzcntl	%ecx, %ecx
+	bsf	%VRCX, %VRCX
 # else
-	tzcntq	%rcx, %rcx
+	bsfq	%rcx, %rcx
 # endif
-
 # ifdef USE_AS_STRNCMP
 	cmpq	%rcx, %rdx
 	jbe	L(ret_zero)
 # endif
 
+L(ret_vec_3_finish):
 # ifdef USE_AS_WCSCMP
 	movl	(VEC_SIZE * 2)(%rdi, %rcx, SIZE_OF_CHAR), %edx
 	xorl	%eax, %eax
@@ -440,7 +489,7 @@ L(ret3):
 # ifndef USE_AS_STRNCMP
 	.p2align 4,, 10
 L(return_vec_3):
-	tzcntl	%ecx, %ecx
+	bsf	%VRCX, %VRCX
 #  ifdef USE_AS_WCSCMP
 	movl	(VEC_SIZE * 3)(%rdi, %rcx, SIZE_OF_CHAR), %edx
 	xorl	%eax, %eax
@@ -465,11 +514,11 @@ L(ret4):
 	.p2align 5
 L(more_3x_vec):
 	/* Safe to compare 4x vectors.  */
-	VMOVU	(VEC_SIZE)(%rdi), %YMM0
-	VPTESTM	%YMM0, %YMM0, %k2
-	CMP_R1_S2_YMM (%YMM0, VEC_SIZE(%rsi), %YMM1, %k1){%k2}
-	kmovd	%k1, %ecx
-	TESTEQ	%ecx
+	VMOVU	(VEC_SIZE)(%rdi), %VMM(0)
+	VPTESTM	%VMM(0), %VMM(0), %k2
+	CMP_R1_S2_VMM (%VMM(0), VEC_SIZE(%rsi), %VMM(1), %k1){%k2}
+	KMOV	%k1, %VRCX
+	TESTEQ	%VRCX
 	jnz	L(return_vec_1)
 
 # ifdef USE_AS_STRNCMP
@@ -477,18 +526,18 @@ L(more_3x_vec):
 	jbe	L(ret_zero)
 # endif
 
-	VMOVU	(VEC_SIZE * 2)(%rdi), %YMM0
-	VPTESTM	%YMM0, %YMM0, %k2
-	CMP_R1_S2_YMM (%YMM0, (VEC_SIZE * 2)(%rsi), %YMM1, %k1){%k2}
-	kmovd	%k1, %ecx
-	TESTEQ	%ecx
+	VMOVU	(VEC_SIZE * 2)(%rdi), %VMM(0)
+	VPTESTM	%VMM(0), %VMM(0), %k2
+	CMP_R1_S2_VMM (%VMM(0), (VEC_SIZE * 2)(%rsi), %VMM(1), %k1){%k2}
+	KMOV	%k1, %VRCX
+	TESTEQ	%VRCX
 	jnz	L(return_vec_2)
 
-	VMOVU	(VEC_SIZE * 3)(%rdi), %YMM0
-	VPTESTM	%YMM0, %YMM0, %k2
-	CMP_R1_S2_YMM (%YMM0, (VEC_SIZE * 3)(%rsi), %YMM1, %k1){%k2}
-	kmovd	%k1, %ecx
-	TESTEQ	%ecx
+	VMOVU	(VEC_SIZE * 3)(%rdi), %VMM(0)
+	VPTESTM	%VMM(0), %VMM(0), %k2
+	CMP_R1_S2_VMM (%VMM(0), (VEC_SIZE * 3)(%rsi), %VMM(1), %k1){%k2}
+	KMOV	%k1, %VRCX
+	TESTEQ	%VRCX
 	jnz	L(return_vec_3)
 
 # ifdef USE_AS_STRNCMP
@@ -565,110 +614,123 @@ L(loop):
 
 	/* Loop entry after handling page cross during loop.  */
 L(loop_skip_page_cross_check):
-	VMOVA	(VEC_SIZE * 0)(%rdi), %YMM0
-	VMOVA	(VEC_SIZE * 1)(%rdi), %YMM2
-	VMOVA	(VEC_SIZE * 2)(%rdi), %YMM4
-	VMOVA	(VEC_SIZE * 3)(%rdi), %YMM6
+	VMOVA	(VEC_SIZE * 0)(%rdi), %VMM(0)
+	VMOVA	(VEC_SIZE * 1)(%rdi), %VMM(2)
+	VMOVA	(VEC_SIZE * 2)(%rdi), %VMM(4)
+	VMOVA	(VEC_SIZE * 3)(%rdi), %VMM(6)
 
-	VPMINU	%YMM0, %YMM2, %YMM8
-	VPMINU	%YMM4, %YMM6, %YMM9
+	VPMINU	%VMM(0), %VMM(2), %VMM(8)
+	VPMINU	%VMM(4), %VMM(6), %VMM(9)
 
 	/* A zero CHAR in YMM9 means that there is a null CHAR.  */
-	VPMINU	%YMM8, %YMM9, %YMM9
+	VPMINU	%VMM(8), %VMM(9), %VMM(9)
 
 	/* Each bit set in K1 represents a non-null CHAR in YMM9.  */
-	VPTESTM	%YMM9, %YMM9, %k1
+	VPTESTM	%VMM(9), %VMM(9), %k1
 # ifndef USE_AS_STRCASECMP_L
-	vpxorq	(VEC_SIZE * 0)(%rsi), %YMM0, %YMM1
-	vpxorq	(VEC_SIZE * 1)(%rsi), %YMM2, %YMM3
-	vpxorq	(VEC_SIZE * 2)(%rsi), %YMM4, %YMM5
+	vpxorq	(VEC_SIZE * 0)(%rsi), %VMM(0), %VMM(1)
+	vpxorq	(VEC_SIZE * 1)(%rsi), %VMM(2), %VMM(3)
+	vpxorq	(VEC_SIZE * 2)(%rsi), %VMM(4), %VMM(5)
 	/* Ternary logic to xor (VEC_SIZE * 3)(%rsi) with YMM6 while
 	   oring with YMM1. Result is stored in YMM6.  */
-	vpternlogd $0xde, (VEC_SIZE * 3)(%rsi), %YMM1, %YMM6
+	vpternlogd $0xde, (VEC_SIZE * 3)(%rsi), %VMM(1), %VMM(6)
 # else
-	VMOVU	(VEC_SIZE * 0)(%rsi), %YMM1
-	TOLOWER_YMM (%YMM0, %YMM1)
-	VMOVU	(VEC_SIZE * 1)(%rsi), %YMM3
-	TOLOWER_YMM (%YMM2, %YMM3)
-	VMOVU	(VEC_SIZE * 2)(%rsi), %YMM5
-	TOLOWER_YMM (%YMM4, %YMM5)
-	VMOVU	(VEC_SIZE * 3)(%rsi), %YMM7
-	TOLOWER_YMM (%YMM6, %YMM7)
-	vpxorq	%YMM0, %YMM1, %YMM1
-	vpxorq	%YMM2, %YMM3, %YMM3
-	vpxorq	%YMM4, %YMM5, %YMM5
-	vpternlogd $0xde, %YMM7, %YMM1, %YMM6
+	VMOVU	(VEC_SIZE * 0)(%rsi), %VMM(1)
+	TOLOWER_VMM (%VMM(0), %VMM(1))
+	VMOVU	(VEC_SIZE * 1)(%rsi), %VMM(3)
+	TOLOWER_VMM (%VMM(2), %VMM(3))
+	VMOVU	(VEC_SIZE * 2)(%rsi), %VMM(5)
+	TOLOWER_VMM (%VMM(4), %VMM(5))
+	VMOVU	(VEC_SIZE * 3)(%rsi), %VMM(7)
+	TOLOWER_VMM (%VMM(6), %VMM(7))
+	vpxorq	%VMM(0), %VMM(1), %VMM(1)
+	vpxorq	%VMM(2), %VMM(3), %VMM(3)
+	vpxorq	%VMM(4), %VMM(5), %VMM(5)
+	vpternlogd $0xde, %VMM(7), %VMM(1), %VMM(6)
 # endif
 	/* Or together YMM3, YMM5, and YMM6.  */
-	vpternlogd $0xfe, %YMM3, %YMM5, %YMM6
+	vpternlogd $0xfe, %VMM(3), %VMM(5), %VMM(6)
 
 
 	/* A non-zero CHAR in YMM6 represents a mismatch.  */
-	VPTESTNM %YMM6, %YMM6, %k0{%k1}
-	kmovd	%k0, %LOOP_REG
+	VPTESTNM %VMM(6), %VMM(6), %k0{%k1}
+	KMOV	%k0, %LOOP_REG
 
 	TESTEQ	%LOOP_REG
 	jz	L(loop)
 
 
 	/* Find which VEC has the mismatch of end of string.  */
-	VPTESTM	%YMM0, %YMM0, %k1
-	VPTESTNM %YMM1, %YMM1, %k0{%k1}
-	kmovd	%k0, %ecx
-	TESTEQ	%ecx
+	VPTESTM	%VMM(0), %VMM(0), %k1
+	VPTESTNM %VMM(1), %VMM(1), %k0{%k1}
+	KMOV	%k0, %VRCX
+	TESTEQ	%VRCX
 	jnz	L(return_vec_0_end)
 
-	VPTESTM	%YMM2, %YMM2, %k1
-	VPTESTNM %YMM3, %YMM3, %k0{%k1}
-	kmovd	%k0, %ecx
-	TESTEQ	%ecx
+	VPTESTM	%VMM(2), %VMM(2), %k1
+	VPTESTNM %VMM(3), %VMM(3), %k0{%k1}
+	KMOV	%k0, %VRCX
+	TESTEQ	%VRCX
 	jnz	L(return_vec_1_end)
 
 
-	/* Handle VEC 2 and 3 without branches.  */
+	/* Handle VEC 2 and 3 without branches if CHAR_PER_VEC <= 32.
+	 */
 L(return_vec_2_3_end):
 # ifdef USE_AS_STRNCMP
 	subq	$(CHAR_PER_VEC * 2), %rdx
 	jbe	L(ret_zero_end)
 # endif
 
-	VPTESTM	%YMM4, %YMM4, %k1
-	VPTESTNM %YMM5, %YMM5, %k0{%k1}
-	kmovd	%k0, %ecx
-	TESTEQ	%ecx
+	VPTESTM	%VMM(4), %VMM(4), %k1
+	VPTESTNM %VMM(5), %VMM(5), %k0{%k1}
+	KMOV	%k0, %VRCX
+	TESTEQ	%VRCX
 # if CHAR_PER_VEC <= 16
 	sall	$CHAR_PER_VEC, %LOOP_REG
 	orl	%ecx, %LOOP_REG
-# else
+# elif CHAR_PER_VEC <= 32
 	salq	$CHAR_PER_VEC, %LOOP_REG64
 	orq	%rcx, %LOOP_REG64
+# else
+	/* We aren't combining last 2x VEC so branch on second the last.
+	 */
+	jnz	L(return_vec_2_end)
 # endif
-L(return_vec_3_end):
+
 	/* LOOP_REG contains matches for null/mismatch from the loop. If
-	   VEC 0,1,and 2 all have no null and no mismatches then mismatch
-	   must entirely be from VEC 3 which is fully represented by
-	   LOOP_REG.  */
+	   VEC 0,1,and 2 all have no null and no mismatches then
+	   mismatch must entirely be from VEC 3 which is fully
+	   represented by LOOP_REG.  */
 # if CHAR_PER_VEC <= 16
-	tzcntl	%LOOP_REG, %LOOP_REG
+	bsf	%LOOP_REG, %LOOP_REG
 # else
-	tzcntq	%LOOP_REG64, %LOOP_REG64
+	bsfq	%LOOP_REG64, %LOOP_REG64
 # endif
 # ifdef USE_AS_STRNCMP
+
+	/* If CHAR_PER_VEC == 64 we can't combine last 2x VEC so need to
+	   adj length before last comparison.  */
+#  if CHAR_PER_VEC == 64
+	subq	$CHAR_PER_VEC, %rdx
+	jbe	L(ret_zero_end)
+#  endif
+
 	cmpq	%LOOP_REG64, %rdx
 	jbe	L(ret_zero_end)
 # endif
 
 # ifdef USE_AS_WCSCMP
-	movl	(VEC_SIZE * 2)(%rdi, %LOOP_REG64, SIZE_OF_CHAR), %ecx
+	movl	(FALLTHROUGH_RETURN_OFFSET)(%rdi, %LOOP_REG64, SIZE_OF_CHAR), %ecx
 	xorl	%eax, %eax
-	cmpl	(VEC_SIZE * 2)(%rsi, %LOOP_REG64, SIZE_OF_CHAR), %ecx
+	cmpl	(FALLTHROUGH_RETURN_OFFSET)(%rsi, %LOOP_REG64, SIZE_OF_CHAR), %ecx
 	je	L(ret5)
 	setl	%al
 	negl	%eax
 	xorl	%r8d, %eax
 # else
-	movzbl	(VEC_SIZE * 2)(%rdi, %LOOP_REG64), %eax
-	movzbl	(VEC_SIZE * 2)(%rsi, %LOOP_REG64), %ecx
+	movzbl	(FALLTHROUGH_RETURN_OFFSET)(%rdi, %LOOP_REG64), %eax
+	movzbl	(FALLTHROUGH_RETURN_OFFSET)(%rsi, %LOOP_REG64), %ecx
 	TOLOWER_gpr (%rax, %eax)
 	TOLOWER_gpr (%rcx, %ecx)
 	subl	%ecx, %eax
@@ -686,23 +748,39 @@ L(ret_zero_end):
 # endif
 
 
+
 	/* The L(return_vec_N_end) differ from L(return_vec_N) in that
-	   they use the value of `r8` to negate the return value. This is
-	   because the page cross logic can swap `rdi` and `rsi`.  */
+	   they use the value of `r8` to negate the return value. This
+	   is because the page cross logic can swap `rdi` and `rsi`.
+	 */
 	.p2align 4,, 10
 # ifdef USE_AS_STRNCMP
 L(return_vec_1_end):
-#  if CHAR_PER_VEC <= 16
+#  if CHAR_PER_VEC <= 32
+	/* If CHAR_PER_VEC <= 32 reuse code from L(return_vec_0_end)
+	   without additional branches by adjusting the bit positions
+	   from VEC1.  We can't do this for CHAR_PER_VEC == 64.  */
+#   if CHAR_PER_VEC <= 16
 	sall	$CHAR_PER_VEC, %ecx
-#  else
+#   else
 	salq	$CHAR_PER_VEC, %rcx
+#   endif
+#  else
+	/* If CHAR_PER_VEC == 64 we can't shift the return GPR so just
+	   check it.  */
+	bsf	%VRCX, %VRCX
+	addl	$(CHAR_PER_VEC), %ecx
+	cmpq	%rcx, %rdx
+	ja	L(ret_vec_0_end_finish)
+	xorl	%eax, %eax
+	ret
 #  endif
 # endif
 L(return_vec_0_end):
 # if (CHAR_PER_VEC <= 16) || !(defined USE_AS_STRNCMP)
-	tzcntl	%ecx, %ecx
+	bsf	%VRCX, %VRCX
 # else
-	tzcntq	%rcx, %rcx
+	bsfq	%rcx, %rcx
 # endif
 
 # ifdef USE_AS_STRNCMP
@@ -710,6 +788,7 @@ L(return_vec_0_end):
 	jbe	L(ret_zero_end)
 # endif
 
+L(ret_vec_0_end_finish):
 # ifdef USE_AS_WCSCMP
 	movl	(%rdi, %rcx, SIZE_OF_CHAR), %edx
 	xorl	%eax, %eax
@@ -737,7 +816,7 @@ L(ret6):
 # ifndef USE_AS_STRNCMP
 	.p2align 4,, 10
 L(return_vec_1_end):
-	tzcntl	%ecx, %ecx
+	bsf	%VRCX, %VRCX
 #  ifdef USE_AS_WCSCMP
 	movl	VEC_SIZE(%rdi, %rcx, SIZE_OF_CHAR), %edx
 	xorl	%eax, %eax
@@ -760,6 +839,41 @@ L(ret7):
 # endif
 
 
+	/* If CHAR_PER_VEC == 64 we can't combine matches from the last
+	   2x VEC so need seperate return label.  */
+# if CHAR_PER_VEC == 64
+L(return_vec_2_end):
+	bsf	%VRCX, %VRCX
+#  ifdef USE_AS_STRNCMP
+	cmpq	%rcx, %rdx
+	jbe	L(ret_zero_end)
+#  endif
+#  ifdef USE_AS_WCSCMP
+	movl	(VEC_SIZE * 2)(%rdi, %rcx, SIZE_OF_CHAR), %edx
+	xorl	%eax, %eax
+	cmpl	(VEC_SIZE * 2)(%rsi, %rcx, SIZE_OF_CHAR), %edx
+	je	L(ret31)
+	setl	%al
+	negl	%eax
+	/* This is the non-zero case for `eax` so just xorl with `r8d`
+	   flip is `rdi` and `rsi` where swapped.  */
+	xorl	%r8d, %eax
+#  else
+	movzbl	(VEC_SIZE * 2)(%rdi, %rcx), %eax
+	movzbl	(VEC_SIZE * 2)(%rsi, %rcx), %ecx
+	TOLOWER_gpr (%rax, %eax)
+	TOLOWER_gpr (%rcx, %ecx)
+	subl	%ecx, %eax
+	/* Flip `eax` if `rdi` and `rsi` where swapped in page cross
+	   logic. Subtract `r8d` after xor for zero case.  */
+	xorl	%r8d, %eax
+	subl	%r8d, %eax
+#  endif
+L(ret13):
+	ret
+# endif
+
+
 	/* Page cross in rsi in next 4x VEC.  */
 
 	/* TODO: Improve logic here.  */
@@ -778,11 +892,11 @@ L(page_cross_during_loop):
 	cmpl	$-(VEC_SIZE * 3), %eax
 	jle	L(less_1x_vec_till_page_cross)
 
-	VMOVA	(%rdi), %YMM0
-	VPTESTM	%YMM0, %YMM0, %k2
-	CMP_R1_S2_YMM (%YMM0, (%rsi), %YMM1, %k1){%k2}
-	kmovd	%k1, %ecx
-	TESTEQ	%ecx
+	VMOVA	(%rdi), %VMM(0)
+	VPTESTM	%VMM(0), %VMM(0), %k2
+	CMP_R1_S2_VMM (%VMM(0), (%rsi), %VMM(1), %k1){%k2}
+	KMOV	%k1, %VRCX
+	TESTEQ	%VRCX
 	jnz	L(return_vec_0_end)
 
 	/* if distance >= 2x VEC then eax > -(VEC_SIZE * 2).  */
@@ -799,9 +913,9 @@ L(less_1x_vec_till_page_cross):
 	   to read back -VEC_SIZE. If rdi is truly at the start of a page
 	   here, it means the previous page (rdi - VEC_SIZE) has already
 	   been loaded earlier so must be valid.  */
-	VMOVU	-VEC_SIZE(%rdi, %rax), %YMM0
-	VPTESTM	%YMM0, %YMM0, %k2
-	CMP_R1_S2_YMM (%YMM0, -VEC_SIZE(%rsi, %rax), %YMM1, %k1){%k2}
+	VMOVU	-VEC_SIZE(%rdi, %rax), %VMM(0)
+	VPTESTM	%VMM(0), %VMM(0), %k2
+	CMP_R1_S2_VMM (%VMM(0), -VEC_SIZE(%rsi, %rax), %VMM(1), %k1){%k2}
 	/* Mask of potentially valid bits. The lower bits can be out of
 	   range comparisons (but safe regarding page crosses).  */
 
@@ -813,12 +927,12 @@ L(less_1x_vec_till_page_cross):
 	shlxl	%ecx, %r10d, %ecx
 	movzbl	%cl, %r10d
 # else
-	movl	$-1, %ecx
-	shlxl	%esi, %ecx, %r10d
+	mov	$-1, %VRCX
+	shlx	%VRSI, %VRCX, %VR10
 # endif
 
-	kmovd	%k1, %ecx
-	notl	%ecx
+	KMOV	%k1, %VRCX
+	not	%VRCX
 
 
 # ifdef USE_AS_STRNCMP
@@ -838,12 +952,10 @@ L(less_1x_vec_till_page_cross):
 	/* Readjust eax before potentially returning to the loop.  */
 	addl	$(PAGE_SIZE - VEC_SIZE * 4), %eax
 
-	andl	%r10d, %ecx
+	and	%VR10, %VRCX
 	jz	L(loop_skip_page_cross_check)
 
-	.p2align 4,, 3
-L(return_page_cross_end):
-	tzcntl	%ecx, %ecx
+	bsf	%VRCX, %VRCX
 
 # if (defined USE_AS_STRNCMP) || (defined USE_AS_WCSCMP)
 	leal	-VEC_SIZE(%OFFSET_REG64, %rcx, SIZE_OF_CHAR), %ecx
@@ -874,8 +986,12 @@ L(ret8):
 # ifdef USE_AS_STRNCMP
 	.p2align 4,, 10
 L(return_page_cross_end_check):
-	andl	%r10d, %ecx
-	tzcntl	%ecx, %ecx
+	and	%VR10, %VRCX
+	/* Need to use tzcnt here as VRCX may be zero.  If VRCX is zero
+	   tzcnt(VRCX) will be CHAR_PER and remaining length (edx) is
+	   guranteed to be <= CHAR_PER_VEC so we will only use the return
+	   idx if VRCX was non-zero.  */
+	tzcnt	%VRCX, %VRCX
 	leal	-VEC_SIZE(%rax, %rcx, SIZE_OF_CHAR), %ecx
 #  ifdef USE_AS_WCSCMP
 	sall	$2, %edx
@@ -892,11 +1008,11 @@ L(more_2x_vec_till_page_cross):
 	/* If more 2x vec till cross we will complete a full loop
 	   iteration here.  */
 
-	VMOVA	VEC_SIZE(%rdi), %YMM0
-	VPTESTM	%YMM0, %YMM0, %k2
-	CMP_R1_S2_YMM (%YMM0, VEC_SIZE(%rsi), %YMM1, %k1){%k2}
-	kmovd	%k1, %ecx
-	TESTEQ	%ecx
+	VMOVA	VEC_SIZE(%rdi), %VMM(0)
+	VPTESTM	%VMM(0), %VMM(0), %k2
+	CMP_R1_S2_VMM (%VMM(0), VEC_SIZE(%rsi), %VMM(1), %k1){%k2}
+	KMOV	%k1, %VRCX
+	TESTEQ	%VRCX
 	jnz	L(return_vec_1_end)
 
 # ifdef USE_AS_STRNCMP
@@ -907,18 +1023,18 @@ L(more_2x_vec_till_page_cross):
 	subl	$-(VEC_SIZE * 4), %eax
 
 	/* Safe to include comparisons from lower bytes.  */
-	VMOVU	-(VEC_SIZE * 2)(%rdi, %rax), %YMM0
-	VPTESTM	%YMM0, %YMM0, %k2
-	CMP_R1_S2_YMM (%YMM0, -(VEC_SIZE * 2)(%rsi, %rax), %YMM1, %k1){%k2}
-	kmovd	%k1, %ecx
-	TESTEQ	%ecx
+	VMOVU	-(VEC_SIZE * 2)(%rdi, %rax), %VMM(0)
+	VPTESTM	%VMM(0), %VMM(0), %k2
+	CMP_R1_S2_VMM (%VMM(0), -(VEC_SIZE * 2)(%rsi, %rax), %VMM(1), %k1){%k2}
+	KMOV	%k1, %VRCX
+	TESTEQ	%VRCX
 	jnz	L(return_vec_page_cross_0)
 
-	VMOVU	-(VEC_SIZE * 1)(%rdi, %rax), %YMM0
-	VPTESTM	%YMM0, %YMM0, %k2
-	CMP_R1_S2_YMM (%YMM0, -(VEC_SIZE * 1)(%rsi, %rax), %YMM1, %k1){%k2}
-	kmovd	%k1, %ecx
-	TESTEQ	%ecx
+	VMOVU	-(VEC_SIZE * 1)(%rdi, %rax), %VMM(0)
+	VPTESTM	%VMM(0), %VMM(0), %k2
+	CMP_R1_S2_VMM (%VMM(0), -(VEC_SIZE * 1)(%rsi, %rax), %VMM(1), %k1){%k2}
+	KMOV	%k1, %VRCX
+	TESTEQ	%VRCX
 	jnz	L(return_vec_page_cross_1)
 
 # ifdef USE_AS_STRNCMP
@@ -937,30 +1053,30 @@ L(more_2x_vec_till_page_cross):
 # endif
 
 	/* Finish the loop.  */
-	VMOVA	(VEC_SIZE * 2)(%rdi), %YMM4
-	VMOVA	(VEC_SIZE * 3)(%rdi), %YMM6
-	VPMINU	%YMM4, %YMM6, %YMM9
-	VPTESTM	%YMM9, %YMM9, %k1
+	VMOVA	(VEC_SIZE * 2)(%rdi), %VMM(4)
+	VMOVA	(VEC_SIZE * 3)(%rdi), %VMM(6)
+	VPMINU	%VMM(4), %VMM(6), %VMM(9)
+	VPTESTM	%VMM(9), %VMM(9), %k1
 # ifndef USE_AS_STRCASECMP_L
-	vpxorq	(VEC_SIZE * 2)(%rsi), %YMM4, %YMM5
+	vpxorq	(VEC_SIZE * 2)(%rsi), %VMM(4), %VMM(5)
 	/* YMM6 = YMM5 | ((VEC_SIZE * 3)(%rsi) ^ YMM6).  */
-	vpternlogd $0xde, (VEC_SIZE * 3)(%rsi), %YMM5, %YMM6
+	vpternlogd $0xde, (VEC_SIZE * 3)(%rsi), %VMM(5), %VMM(6)
 # else
-	VMOVU	(VEC_SIZE * 2)(%rsi), %YMM5
-	TOLOWER_YMM (%YMM4, %YMM5)
-	VMOVU	(VEC_SIZE * 3)(%rsi), %YMM7
-	TOLOWER_YMM (%YMM6, %YMM7)
-	vpxorq	%YMM4, %YMM5, %YMM5
-	vpternlogd $0xde, %YMM7, %YMM5, %YMM6
-# endif
-	VPTESTNM %YMM6, %YMM6, %k0{%k1}
-	kmovd	%k0, %LOOP_REG
+	VMOVU	(VEC_SIZE * 2)(%rsi), %VMM(5)
+	TOLOWER_VMM (%VMM(4), %VMM(5))
+	VMOVU	(VEC_SIZE * 3)(%rsi), %VMM(7)
+	TOLOWER_VMM (%VMM(6), %VMM(7))
+	vpxorq	%VMM(4), %VMM(5), %VMM(5)
+	vpternlogd $0xde, %VMM(7), %VMM(5), %VMM(6)
+# endif
+	VPTESTNM %VMM(6), %VMM(6), %k0{%k1}
+	KMOV	%k0, %LOOP_REG
 	TESTEQ	%LOOP_REG
 	jnz	L(return_vec_2_3_end)
 
 	/* Best for code size to include ucond-jmp here. Would be faster
-	   if this case is hot to duplicate the L(return_vec_2_3_end) code
-	   as fall-through and have jump back to loop on mismatch
+	   if this case is hot to duplicate the L(return_vec_2_3_end)
+	   code as fall-through and have jump back to loop on mismatch
 	   comparison.  */
 	subq	$-(VEC_SIZE * 4), %rdi
 	subq	$-(VEC_SIZE * 4), %rsi
@@ -980,7 +1096,7 @@ L(ret_zero_in_loop_page_cross):
 L(return_vec_page_cross_0):
 	addl	$-VEC_SIZE, %eax
 L(return_vec_page_cross_1):
-	tzcntl	%ecx, %ecx
+	bsf	%VRCX, %VRCX
 # if defined USE_AS_STRNCMP || defined USE_AS_WCSCMP
 	leal	-VEC_SIZE(%rax, %rcx, SIZE_OF_CHAR), %ecx
 #  ifdef USE_AS_STRNCMP
@@ -1023,8 +1139,8 @@ L(ret9):
 L(page_cross):
 # ifndef USE_AS_STRNCMP
 	/* If both are VEC aligned we don't need any special logic here.
-	   Only valid for strcmp where stop condition is guranteed to be
-	   reachable by just reading memory.  */
+	   Only valid for strcmp where stop condition is guranteed to
+	   be reachable by just reading memory.  */
 	testl	$((VEC_SIZE - 1) << 20), %eax
 	jz	L(no_page_cross)
 # endif
@@ -1065,11 +1181,11 @@ L(page_cross):
 	   loadable memory until within 1x VEC of page cross.  */
 	.p2align 4,, 8
 L(page_cross_loop):
-	VMOVU	(%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %YMM0
-	VPTESTM	%YMM0, %YMM0, %k2
-	CMP_R1_S2_YMM (%YMM0, (%rsi, %OFFSET_REG64, SIZE_OF_CHAR), %YMM1, %k1){%k2}
-	kmovd	%k1, %ecx
-	TESTEQ	%ecx
+	VMOVU	(%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %VMM(0)
+	VPTESTM	%VMM(0), %VMM(0), %k2
+	CMP_R1_S2_VMM (%VMM(0), (%rsi, %OFFSET_REG64, SIZE_OF_CHAR), %VMM(1), %k1){%k2}
+	KMOV	%k1, %VRCX
+	TESTEQ	%VRCX
 	jnz	L(check_ret_vec_page_cross)
 	addl	$CHAR_PER_VEC, %OFFSET_REG
 # ifdef USE_AS_STRNCMP
@@ -1087,13 +1203,13 @@ L(page_cross_loop):
 	subl	%eax, %OFFSET_REG
 	/* OFFSET_REG has distance to page cross - VEC_SIZE. Guranteed
 	   to not cross page so is safe to load. Since we have already
-	   loaded at least 1 VEC from rsi it is also guranteed to be safe.
-	 */
-	VMOVU	(%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %YMM0
-	VPTESTM	%YMM0, %YMM0, %k2
-	CMP_R1_S2_YMM (%YMM0, (%rsi, %OFFSET_REG64, SIZE_OF_CHAR), %YMM1, %k1){%k2}
+	   loaded at least 1 VEC from rsi it is also guranteed to be
+	   safe.  */
+	VMOVU	(%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %VMM(0)
+	VPTESTM	%VMM(0), %VMM(0), %k2
+	CMP_R1_S2_VMM (%VMM(0), (%rsi, %OFFSET_REG64, SIZE_OF_CHAR), %VMM(1), %k1){%k2}
 
-	kmovd	%k1, %ecx
+	KMOV	%k1, %VRCX
 # ifdef USE_AS_STRNCMP
 	leal	CHAR_PER_VEC(%OFFSET_REG64), %eax
 	cmpq	%rax, %rdx
@@ -1104,7 +1220,7 @@ L(page_cross_loop):
 	addq	%rdi, %rdx
 #  endif
 # endif
-	TESTEQ	%ecx
+	TESTEQ	%VRCX
 	jz	L(prepare_loop_no_len)
 
 	.p2align 4,, 4
@@ -1112,7 +1228,7 @@ L(ret_vec_page_cross):
 # ifndef USE_AS_STRNCMP
 L(check_ret_vec_page_cross):
 # endif
-	tzcntl	%ecx, %ecx
+	tzcnt	%VRCX, %VRCX
 	addl	%OFFSET_REG, %ecx
 L(ret_vec_page_cross_cont):
 # ifdef USE_AS_WCSCMP
@@ -1139,9 +1255,9 @@ L(ret12):
 # ifdef USE_AS_STRNCMP
 	.p2align 4,, 10
 L(check_ret_vec_page_cross2):
-	TESTEQ	%ecx
+	TESTEQ	%VRCX
 L(check_ret_vec_page_cross):
-	tzcntl	%ecx, %ecx
+	tzcnt	%VRCX, %VRCX
 	addl	%OFFSET_REG, %ecx
 	cmpq	%rcx, %rdx
 	ja	L(ret_vec_page_cross_cont)
@@ -1180,8 +1296,71 @@ L(less_1x_vec_till_page):
 # ifdef USE_AS_WCSCMP
 	shrl	$2, %eax
 # endif
+
+	/* Find largest load size we can use. VEC_SIZE == 64 only check
+	   if we can do a full ymm load.  */
+# if VEC_SIZE == 64
+
+	cmpl	$((VEC_SIZE - 32) / SIZE_OF_CHAR), %eax
+	ja	L(less_32_till_page)
+
+
+	/* Use 16 byte comparison.  */
+	VMOVU	(%rdi), %VMM_256(0)
+	VPTESTM	%VMM_256(0), %VMM_256(0), %k2
+	CMP_R1_S2_YMM (%VMM_256(0), (%rsi), %VMM_256(1), %k1){%k2}
+	kmovd	%k1, %ecx
+#  ifdef USE_AS_WCSCMP
+	subl	$0xff, %ecx
+#  else
+	incl	%ecx
+#  endif
+	jnz	L(check_ret_vec_page_cross)
+	movl	$((VEC_SIZE - 32) / SIZE_OF_CHAR), %OFFSET_REG
+#  ifdef USE_AS_STRNCMP
+	cmpq	%OFFSET_REG64, %rdx
+	jbe	L(ret_zero_page_cross_slow_case64)
+	subl	%eax, %OFFSET_REG
+#  else
+	/* Explicit check for 32 byte alignment.  */
+	subl	%eax, %OFFSET_REG
+	jz	L(prepare_loop)
+#  endif
+	VMOVU	(%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %VMM_256(0)
+	VPTESTM	%VMM_256(0), %VMM_256(0), %k2
+	CMP_R1_S2_YMM (%VMM_256(0), (%rsi, %OFFSET_REG64, SIZE_OF_CHAR), %VMM_256(1), %k1){%k2}
+	kmovd	%k1, %ecx
+#  ifdef USE_AS_WCSCMP
+	subl	$0xff, %ecx
+#  else
+	incl	%ecx
+#  endif
+	jnz	L(check_ret_vec_page_cross)
+#  ifdef USE_AS_STRNCMP
+	addl	$(32 / SIZE_OF_CHAR), %OFFSET_REG
+	subq	%OFFSET_REG64, %rdx
+	jbe	L(ret_zero_page_cross_slow_case64)
+	subq	$-(CHAR_PER_VEC * 4), %rdx
+
+	leaq	-(VEC_SIZE * 4)(%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %rdi
+	leaq	-(VEC_SIZE * 4)(%rsi, %OFFSET_REG64, SIZE_OF_CHAR), %rsi
+#  else
+	leaq	(32 - VEC_SIZE * 4)(%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %rdi
+	leaq	(32 - VEC_SIZE * 4)(%rsi, %OFFSET_REG64, SIZE_OF_CHAR), %rsi
+#  endif
+	jmp	L(prepare_loop_aligned)
+
+#  ifdef USE_AS_STRNCMP
+	.p2align 4,, 2
+L(ret_zero_page_cross_slow_case64):
+	xorl	%eax, %eax
+	ret
+#  endif
+L(less_32_till_page):
+# endif
+
 	/* Find largest load size we can use.  */
-	cmpl	$(16 / SIZE_OF_CHAR), %eax
+	cmpl	$((VEC_SIZE - 16) / SIZE_OF_CHAR), %eax
 	ja	L(less_16_till_page)
 
 	/* Use 16 byte comparison.  */
@@ -1195,9 +1374,14 @@ L(less_1x_vec_till_page):
 	incw	%cx
 # endif
 	jnz	L(check_ret_vec_page_cross)
-	movl	$(16 / SIZE_OF_CHAR), %OFFSET_REG
+
+	movl	$((VEC_SIZE - 16) / SIZE_OF_CHAR), %OFFSET_REG
 # ifdef USE_AS_STRNCMP
+#  if VEC_SIZE == 32
 	cmpq	%OFFSET_REG64, %rdx
+#  else
+	cmpq	$(16 / SIZE_OF_CHAR), %rdx
+#  endif
 	jbe	L(ret_zero_page_cross_slow_case0)
 	subl	%eax, %OFFSET_REG
 # else
@@ -1239,7 +1423,7 @@ L(ret_zero_page_cross_slow_case0):
 
 	.p2align 4,, 10
 L(less_16_till_page):
-	cmpl	$(24 / SIZE_OF_CHAR), %eax
+	cmpl	$((VEC_SIZE - 8) / SIZE_OF_CHAR), %eax
 	ja	L(less_8_till_page)
 
 	/* Use 8 byte comparison.  */
@@ -1260,7 +1444,7 @@ L(less_16_till_page):
 	cmpq	$(8 / SIZE_OF_CHAR), %rdx
 	jbe	L(ret_zero_page_cross_slow_case0)
 # endif
-	movl	$(24 / SIZE_OF_CHAR), %OFFSET_REG
+	movl	$((VEC_SIZE - 8) / SIZE_OF_CHAR), %OFFSET_REG
 	subl	%eax, %OFFSET_REG
 
 	vmovq	(%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %xmm0
@@ -1320,7 +1504,7 @@ L(ret_less_8_wcs):
 	ret
 
 # else
-	cmpl	$28, %eax
+	cmpl	$(VEC_SIZE - 4), %eax
 	ja	L(less_4_till_page)
 
 	vmovd	(%rdi), %xmm0
@@ -1335,7 +1519,7 @@ L(ret_less_8_wcs):
 	cmpq	$4, %rdx
 	jbe	L(ret_zero_page_cross_slow_case1)
 #  endif
-	movl	$(28 / SIZE_OF_CHAR), %OFFSET_REG
+	movl	$((VEC_SIZE - 4) / SIZE_OF_CHAR), %OFFSET_REG
 	subl	%eax, %OFFSET_REG
 
 	vmovd	(%rdi, %OFFSET_REG64, SIZE_OF_CHAR), %xmm0
@@ -1386,7 +1570,7 @@ L(less_4_loop):
 #  endif
 	incq	%rdi
 	/* end condition is reach page boundary (rdi is aligned).  */
-	testl	$31, %edi
+	testb	$(VEC_SIZE - 1), %dil
 	jnz	L(less_4_loop)
 	leaq	-(VEC_SIZE * 4)(%rdi, %rsi), %rsi
 	addq	$-(VEC_SIZE * 4), %rdi

From patchwork Tue Oct 18 23:19:38 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Noah Goldstein <goldstein.w.n@gmail.com>
X-Patchwork-Id: 1691740
Return-Path: <libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@legolas.ozlabs.org
Authentication-Results: legolas.ozlabs.org;
 spf=pass (sender SPF authorized) smtp.mailfrom=sourceware.org
 (client-ip=2620:52:3:1:0:246e:9693:128c; helo=sourceware.org;
 envelope-from=libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org;
 receiver=<UNKNOWN>)
Authentication-Results: legolas.ozlabs.org;
	dkim=pass (1024-bit key;
 secure) header.d=sourceware.org header.i=@sourceware.org header.a=rsa-sha256
 header.s=default header.b=Sx7CKmTh;
	dkim-atps=neutral
Received: from sourceware.org (server2.sourceware.org
 [IPv6:2620:52:3:1:0:246e:9693:128c])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature ECDSA (P-384) server-digest SHA384)
	(No client certificate requested)
	by legolas.ozlabs.org (Postfix) with ESMTPS id 4MsVHL6lgWz23jk
	for <incoming@patchwork.ozlabs.org>; Wed, 19 Oct 2022 10:21:34 +1100 (AEDT)
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 332C73857366
	for <incoming@patchwork.ozlabs.org>; Tue, 18 Oct 2022 23:21:32 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 332C73857366
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org;
	s=default; t=1666135292;
	bh=QLYf2t8uZJScMAunEK0mJZeQMtIfs8O9tj5QAcGnu8g=;
	h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe:
	 List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:
	 From;
	b=Sx7CKmThw0o9sjKH68miVwrdZvOirfn+Kpc2HuFmP9pZ3uXgdUSM+Z9zt3io2EZv6
	 5mwogVv2GGunE6TREXp4BRr4fsvLvOrdlJWUfpQj2I3z9SCAwS64M2jJ5iS0IZG/d9
	 mO9tiG+3718WM97qVMeT1YORluH+by/4veeXdQLg=
X-Original-To: libc-alpha@sourceware.org
Delivered-To: libc-alpha@sourceware.org
Received: from mail-oa1-x2d.google.com (mail-oa1-x2d.google.com
 [IPv6:2001:4860:4864:20::2d])
 by sourceware.org (Postfix) with ESMTPS id 915A03858283
 for <libc-alpha@sourceware.org>; Tue, 18 Oct 2022 23:19:51 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 915A03858283
Received: by mail-oa1-x2d.google.com with SMTP id
 586e51a60fabf-132af5e5543so18671254fac.8
 for <libc-alpha@sourceware.org>; Tue, 18 Oct 2022 16:19:51 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20210112;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
 :subject:date:message-id:reply-to;
 bh=QLYf2t8uZJScMAunEK0mJZeQMtIfs8O9tj5QAcGnu8g=;
 b=I42RjBBxUeg5WtNK1ZAksq1bo9MJvxDg3+byCmFPSlC7n5U3JkxeKDEN0LnKY2ai9K
 RCCG/4paKOU4EIFrgzep5noz5+cfOSxQ3py5ybhL/1Qsr2qmUdGF14Jmng414OsXJlvA
 3GkblQ93lqBthoc6VOf4shjbontupkgwTXcbhyp+KEslkBlb4LTaFyyL6ZdHbKmMnmPs
 gqX1HdJsPXEjcE9ySn5k3JessivPB1Pxk2hIEpDVLsdEoyPKo2eCuhnQmQe8YnHdTNRF
 cmS2z3WVOLn0cn/wDiST2nfpvaAUDqpNTpWSh3uM4/gEfHjDBDeV/r92qLj9i/yAm8V7
 JiIw==
X-Gm-Message-State: ACrzQf2Gng7lmdqSSHrGVPgTGq+ty8eBYbUHyioyNlhoPzR/o9Lm7MVT
 skK3DRt0zAIK6cOuATljM1zLOA/zPAi+Yw==
X-Google-Smtp-Source: 
 AMsMyM68whe5gwk1beorpHHBxO/21TNZpGcE+n5hj0y8MEW4HBGWYRL3VipzxnGD3rSQ/TBL5XT8AQ==
X-Received: by 2002:a05:6870:6587:b0:131:d289:c928 with SMTP id
 fp7-20020a056870658700b00131d289c928mr19587045oab.75.1666135190578;
 Tue, 18 Oct 2022 16:19:50 -0700 (PDT)
Received: from noah-tgl.lan
 (2603-8080-1301-76c6-27cf-8854-3909-9373.res6.spectrum.com.
 [2603:8080:1301:76c6:27cf:8854:3909:9373])
 by smtp.gmail.com with ESMTPSA id
 x27-20020a056870a79b00b001372c1902afsm6787119oao.52.2022.10.18.16.19.49
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Tue, 18 Oct 2022 16:19:50 -0700 (PDT)
To: libc-alpha@sourceware.org
Subject: [PATCH v2 7/7] Bench: Improve benchtests for memchr, strchr, strnlen,
 strrchr
Date: Tue, 18 Oct 2022 16:19:38 -0700
Message-Id: <20221018231938.3621554-7-goldstein.w.n@gmail.com>
X-Mailer: git-send-email 2.34.1
In-Reply-To: <20221018231938.3621554-1-goldstein.w.n@gmail.com>
References: <20221018024901.3381469-1-goldstein.w.n@gmail.com>
 <20221018231938.3621554-1-goldstein.w.n@gmail.com>
MIME-Version: 1.0
X-Spam-Status: No, score=-12.1 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0,
 RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
X-Patchwork-Original-From: Noah Goldstein via Libc-alpha
 <libc-alpha@sourceware.org>
From: Noah Goldstein <goldstein.w.n@gmail.com>
Reply-To: Noah Goldstein <goldstein.w.n@gmail.com>
Errors-To: libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org
Sender: "Libc-alpha"
 <libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org>

1. Add more complete coverage in the medium size range.
2. In strnlen remove the `1 << i` which was UB (`i` could go beyond
   32/64)
3. Add timer for total benchmark runtime (useful for deciding about
   tradeoff between coverage and runtime).
---
 benchtests/bench-memchr.c    | 77 +++++++++++++++++++++++++-----------
 benchtests/bench-rawmemchr.c | 30 ++++++++++++--
 benchtests/bench-strchr.c    | 35 +++++++++++-----
 benchtests/bench-strnlen.c   | 12 +++---
 benchtests/bench-strrchr.c   | 28 ++++++++++++-
 5 files changed, 137 insertions(+), 45 deletions(-)

diff --git a/benchtests/bench-memchr.c b/benchtests/bench-memchr.c
index 0facda2fa0..2ec9dd86d0 100644
--- a/benchtests/bench-memchr.c
+++ b/benchtests/bench-memchr.c
@@ -126,7 +126,7 @@ do_test (json_ctx_t *json_ctx, size_t align, size_t pos, size_t len,
 int
 test_main (void)
 {
-  size_t i;
+  size_t i, j, al, al_max;
   int repeats;
   json_ctx_t json_ctx;
   test_init ();
@@ -147,35 +147,46 @@ test_main (void)
 
   json_array_begin (&json_ctx, "results");
 
+  al_max = 0;
+#ifdef USE_AS_MEMRCHR
+  al_max = getpagesize () / 2;
+#endif
+
   for (repeats = 0; repeats < 2; ++repeats)
     {
-      for (i = 1; i < 8; ++i)
+      for (al = 0; al <= al_max; al += getpagesize () / 2)
 	{
-	  do_test (&json_ctx, 0, 16 << i, 2048, 23, repeats);
-	  do_test (&json_ctx, i, 64, 256, 23, repeats);
-	  do_test (&json_ctx, 0, 16 << i, 2048, 0, repeats);
-	  do_test (&json_ctx, i, 64, 256, 0, repeats);
-
-	  do_test (&json_ctx, getpagesize () - 15, 64, 256, 0, repeats);
+	  for (i = 1; i < 8; ++i)
+	    {
+	      do_test (&json_ctx, al, 16 << i, 2048, 23, repeats);
+	      do_test (&json_ctx, al + i, 64, 256, 23, repeats);
+	      do_test (&json_ctx, al, 16 << i, 2048, 0, repeats);
+	      do_test (&json_ctx, al + i, 64, 256, 0, repeats);
+
+	      do_test (&json_ctx, al + getpagesize () - 15, 64, 256, 0,
+		       repeats);
 #ifdef USE_AS_MEMRCHR
-	  /* Also test the position close to the beginning for memrchr.  */
-	  do_test (&json_ctx, 0, i, 256, 23, repeats);
-	  do_test (&json_ctx, 0, i, 256, 0, repeats);
-	  do_test (&json_ctx, i, i, 256, 23, repeats);
-	  do_test (&json_ctx, i, i, 256, 0, repeats);
+	      /* Also test the position close to the beginning for memrchr.  */
+	      do_test (&json_ctx, al, i, 256, 23, repeats);
+	      do_test (&json_ctx, al, i, 256, 0, repeats);
+	      do_test (&json_ctx, al + i, i, 256, 23, repeats);
+	      do_test (&json_ctx, al + i, i, 256, 0, repeats);
 #endif
+	    }
+	  for (i = 1; i < 8; ++i)
+	    {
+	      do_test (&json_ctx, al + i, i << 5, 192, 23, repeats);
+	      do_test (&json_ctx, al + i, i << 5, 192, 0, repeats);
+	      do_test (&json_ctx, al + i, i << 5, 256, 23, repeats);
+	      do_test (&json_ctx, al + i, i << 5, 256, 0, repeats);
+	      do_test (&json_ctx, al + i, i << 5, 512, 23, repeats);
+	      do_test (&json_ctx, al + i, i << 5, 512, 0, repeats);
+
+	      do_test (&json_ctx, al + getpagesize () - 15, i << 5, 256, 23,
+		       repeats);
+	    }
 	}
-      for (i = 1; i < 8; ++i)
-	{
-	  do_test (&json_ctx, i, i << 5, 192, 23, repeats);
-	  do_test (&json_ctx, i, i << 5, 192, 0, repeats);
-	  do_test (&json_ctx, i, i << 5, 256, 23, repeats);
-	  do_test (&json_ctx, i, i << 5, 256, 0, repeats);
-	  do_test (&json_ctx, i, i << 5, 512, 23, repeats);
-	  do_test (&json_ctx, i, i << 5, 512, 0, repeats);
-
-	  do_test (&json_ctx, getpagesize () - 15, i << 5, 256, 23, repeats);
-	}
+
       for (i = 1; i < 32; ++i)
 	{
 	  do_test (&json_ctx, 0, i, i + 1, 23, repeats);
@@ -207,6 +218,24 @@ test_main (void)
 	  do_test (&json_ctx, 0, 2, i + 1, 0, repeats);
 #endif
 	}
+      for (al = 0; al <= al_max; al += getpagesize () / 2)
+	{
+	  for (i = (16 / sizeof (CHAR)); i <= (8192 / sizeof (CHAR)); i += i)
+	    {
+	      for (j = 0; j <= (384 / sizeof (CHAR));
+		   j += (32 / sizeof (CHAR)))
+		{
+		  do_test (&json_ctx, al, i + j, i, 23, repeats);
+		  do_test (&json_ctx, al, i, i + j, 23, repeats);
+		  if (j < i)
+		    {
+		      do_test (&json_ctx, al, i - j, i, 23, repeats);
+		      do_test (&json_ctx, al, i, i - j, 23, repeats);
+		    }
+		}
+	    }
+	}
+
 #ifndef USE_AS_MEMRCHR
       break;
 #endif
diff --git a/benchtests/bench-rawmemchr.c b/benchtests/bench-rawmemchr.c
index b1803afc14..dab77f3858 100644
--- a/benchtests/bench-rawmemchr.c
+++ b/benchtests/bench-rawmemchr.c
@@ -70,7 +70,7 @@ do_test (json_ctx_t *json_ctx, size_t align, size_t pos, size_t len, int seek_ch
   size_t i;
   char *result;
 
-  align &= 7;
+  align &= getpagesize () - 1;
   if (align + len >= page_size)
     return;
 
@@ -106,7 +106,6 @@ test_main (void)
 {
   json_ctx_t json_ctx;
   size_t i;
-
   test_init ();
 
   json_init (&json_ctx, 0, stdout);
@@ -120,7 +119,7 @@ test_main (void)
 
   json_array_begin (&json_ctx, "ifuncs");
   FOR_EACH_IMPL (impl, 0)
-      json_element_string (&json_ctx, impl->name);
+    json_element_string (&json_ctx, impl->name);
   json_array_end (&json_ctx);
 
   json_array_begin (&json_ctx, "results");
@@ -137,6 +136,31 @@ test_main (void)
       do_test (&json_ctx, 0, i, i + 1, 23);
       do_test (&json_ctx, 0, i, i + 1, 0);
     }
+  for (; i < 256; i += 32)
+    {
+      do_test (&json_ctx, 0, i, i + 1, 23);
+      do_test (&json_ctx, 0, i - 1, i, 23);
+    }
+  for (; i < 512; i += 64)
+    {
+      do_test (&json_ctx, 0, i, i + 1, 23);
+      do_test (&json_ctx, 0, i - 1, i, 23);
+    }
+  for (; i < 1024; i += 128)
+    {
+      do_test (&json_ctx, 0, i, i + 1, 23);
+      do_test (&json_ctx, 0, i - 1, i, 23);
+    }
+  for (; i < 2048; i += 256)
+    {
+      do_test (&json_ctx, 0, i, i + 1, 23);
+      do_test (&json_ctx, 0, i - 1, i, 23);
+    }
+  for (; i < 4096; i += 512)
+    {
+      do_test (&json_ctx, 0, i, i + 1, 23);
+      do_test (&json_ctx, 0, i - 1, i, 23);
+    }
 
   json_array_end (&json_ctx);
   json_attr_object_end (&json_ctx);
diff --git a/benchtests/bench-strchr.c b/benchtests/bench-strchr.c
index 54640bde7e..aeb882d442 100644
--- a/benchtests/bench-strchr.c
+++ b/benchtests/bench-strchr.c
@@ -287,8 +287,8 @@ int
 test_main (void)
 {
   json_ctx_t json_ctx;
-  size_t i;
 
+  size_t i, j;
   test_init ();
 
   json_init (&json_ctx, 0, stdout);
@@ -367,15 +367,30 @@ test_main (void)
       do_test (&json_ctx, 0, i, i + 1, 0, BIG_CHAR);
     }
 
-  DO_RAND_TEST(&json_ctx, 0, 15, 16, 0.0);
-  DO_RAND_TEST(&json_ctx, 0, 15, 16, 0.1);
-  DO_RAND_TEST(&json_ctx, 0, 15, 16, 0.25);
-  DO_RAND_TEST(&json_ctx, 0, 15, 16, 0.33);
-  DO_RAND_TEST(&json_ctx, 0, 15, 16, 0.5);
-  DO_RAND_TEST(&json_ctx, 0, 15, 16, 0.66);
-  DO_RAND_TEST(&json_ctx, 0, 15, 16, 0.75);
-  DO_RAND_TEST(&json_ctx, 0, 15, 16, 0.9);
-  DO_RAND_TEST(&json_ctx, 0, 15, 16, 1.0);
+  for (i = 16 / sizeof (CHAR); i <= 8192 / sizeof (CHAR); i += i)
+    {
+      for (j = 32 / sizeof (CHAR); j <= 320 / sizeof (CHAR);
+	   j += 32 / sizeof (CHAR))
+	{
+	  do_test (&json_ctx, 0, i, i + j, 0, MIDDLE_CHAR);
+	  do_test (&json_ctx, 0, i + j, i, 0, MIDDLE_CHAR);
+	  if (i > j)
+	    {
+	      do_test (&json_ctx, 0, i, i - j, 0, MIDDLE_CHAR);
+	      do_test (&json_ctx, 0, i - j, i, 0, MIDDLE_CHAR);
+	    }
+	}
+    }
+
+  DO_RAND_TEST (&json_ctx, 0, 15, 16, 0.0);
+  DO_RAND_TEST (&json_ctx, 0, 15, 16, 0.1);
+  DO_RAND_TEST (&json_ctx, 0, 15, 16, 0.25);
+  DO_RAND_TEST (&json_ctx, 0, 15, 16, 0.33);
+  DO_RAND_TEST (&json_ctx, 0, 15, 16, 0.5);
+  DO_RAND_TEST (&json_ctx, 0, 15, 16, 0.66);
+  DO_RAND_TEST (&json_ctx, 0, 15, 16, 0.75);
+  DO_RAND_TEST (&json_ctx, 0, 15, 16, 0.9);
+  DO_RAND_TEST (&json_ctx, 0, 15, 16, 1.0);
 
   json_array_end (&json_ctx);
   json_attr_object_end (&json_ctx);
diff --git a/benchtests/bench-strnlen.c b/benchtests/bench-strnlen.c
index 13b46b3f57..82c02eb6ed 100644
--- a/benchtests/bench-strnlen.c
+++ b/benchtests/bench-strnlen.c
@@ -195,19 +195,19 @@ test_main (void)
     {
       for (j = 0; j <= (704 / sizeof (CHAR)); j += (32 / sizeof (CHAR)))
 	{
-	  do_test (&json_ctx, 0, 1 << i, (i + j), BIG_CHAR);
 	  do_test (&json_ctx, 0, i + j, i, BIG_CHAR);
-
-	  do_test (&json_ctx, 64, 1 << i, (i + j), BIG_CHAR);
 	  do_test (&json_ctx, 64, i + j, i, BIG_CHAR);
 
+	  do_test (&json_ctx, 0, i, i + j, BIG_CHAR);
+	  do_test (&json_ctx, 64, i, i + j, BIG_CHAR);
+
 	  if (j < i)
 	    {
-	      do_test (&json_ctx, 0, 1 << i, i - j, BIG_CHAR);
 	      do_test (&json_ctx, 0, i - j, i, BIG_CHAR);
-
-	      do_test (&json_ctx, 64, 1 << i, i - j, BIG_CHAR);
 	      do_test (&json_ctx, 64, i - j, i, BIG_CHAR);
+
+	      do_test (&json_ctx, 0, i, i - j, BIG_CHAR);
+	      do_test (&json_ctx, 64, i, i - j, BIG_CHAR);
 	    }
 	}
     }
diff --git a/benchtests/bench-strrchr.c b/benchtests/bench-strrchr.c
index 7cd2a15484..3fcf3f281d 100644
--- a/benchtests/bench-strrchr.c
+++ b/benchtests/bench-strrchr.c
@@ -151,7 +151,7 @@ int
 test_main (void)
 {
   json_ctx_t json_ctx;
-  size_t i, j;
+  size_t i, j, k;
   int seek;
 
   test_init ();
@@ -173,7 +173,7 @@ test_main (void)
 
   for (seek = 0; seek <= 23; seek += 23)
     {
-      for (j = 1; j < 32; j += j)
+      for (j = 1; j <= 256; j = (j * 4))
 	{
 	  for (i = 1; i < 9; ++i)
 	    {
@@ -197,6 +197,30 @@ test_main (void)
 	      do_test (&json_ctx, getpagesize () - i / 2 - 1, i, i + 1, seek,
 		       SMALL_CHAR, j);
 	    }
+
+	  for (i = (16 / sizeof (CHAR)); i <= (288 / sizeof (CHAR)); i += 32)
+	    {
+	      do_test (&json_ctx, 0, i - 16, i, seek, SMALL_CHAR, j);
+	      do_test (&json_ctx, 0, i, i + 16, seek, SMALL_CHAR, j);
+	    }
+
+	  for (i = (16 / sizeof (CHAR)); i <= (2048 / sizeof (CHAR)); i += i)
+	    {
+	      for (k = 0; k <= (288 / sizeof (CHAR));
+		   k += (48 / sizeof (CHAR)))
+		{
+		  do_test (&json_ctx, 0, k, i, seek, SMALL_CHAR, j);
+		  do_test (&json_ctx, 0, i, i + k, seek, SMALL_CHAR, j);
+
+		  if (k < i)
+		    {
+		      do_test (&json_ctx, 0, i - k, i, seek, SMALL_CHAR, j);
+		      do_test (&json_ctx, 0, k, i - k, seek, SMALL_CHAR, j);
+		      do_test (&json_ctx, 0, i, i - k, seek, SMALL_CHAR, j);
+		    }
+		}
+	    }
+
 	  if (seek == 0)
 	    {
 	      break;