From patchwork Mon Feb  1 00:30:14 2021
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Noah Goldstein <goldstein.w.n@gmail.com>
X-Patchwork-Id: 1433987
Return-Path: <libc-alpha-bounces@sourceware.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@bilbo.ozlabs.org
Authentication-Results: ozlabs.org;
 spf=pass (sender SPF authorized) smtp.mailfrom=sourceware.org
 (client-ip=8.43.85.97; helo=sourceware.org;
 envelope-from=libc-alpha-bounces@sourceware.org; receiver=<UNKNOWN>)
Authentication-Results: ozlabs.org;
	dkim=pass (1024-bit key;
 secure) header.d=sourceware.org header.i=@sourceware.org header.a=rsa-sha256
 header.s=default header.b=CVNQL7/n;
	dkim-atps=neutral
Received: from sourceware.org (server2.sourceware.org [8.43.85.97])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest
 SHA256)
	(No client certificate requested)
	by ozlabs.org (Postfix) with ESMTPS id 4DTTPk5PPmz9t1D
	for <incoming@patchwork.ozlabs.org>; Mon,  1 Feb 2021 11:30:49 +1100 (AEDT)
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 4BF0B384A87E;
	Mon,  1 Feb 2021 00:30:44 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 4BF0B384A87E
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org;
	s=default; t=1612139444;
	bh=os7F2JXal1aB/ifeRK3vFg8a8R0uw7voIFxjN9F3oaQ=;
	h=To:Subject:Date:List-Id:List-Unsubscribe:List-Archive:List-Post:
	 List-Help:List-Subscribe:From:Reply-To:Cc:From;
	b=CVNQL7/njLDi7khca7HYXmSmchnVaPaoK/2sU0u6ZwxsNQYcr5FGmoA7BebHVVOMM
	 v396fHewg/cThUj/ZwltIvkBDHyg5A4I5ieMGmMo+f3YrPzjR50Eno+oJd9bmEJXmO
	 P6itw/ZJmi5mwNS5PGFAL7jBTi1go2roF7FYWJSo=
X-Original-To: libc-alpha@sourceware.org
Delivered-To: libc-alpha@sourceware.org
Received: from mail-pl1-x631.google.com (mail-pl1-x631.google.com
 [IPv6:2607:f8b0:4864:20::631])
 by sourceware.org (Postfix) with ESMTPS id 1F6B8384B406
 for <libc-alpha@sourceware.org>; Mon,  1 Feb 2021 00:30:41 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 1F6B8384B406
Received: by mail-pl1-x631.google.com with SMTP id q2so9024503plk.4
 for <libc-alpha@sourceware.org>; Sun, 31 Jan 2021 16:30:41 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version
 :content-transfer-encoding;
 bh=os7F2JXal1aB/ifeRK3vFg8a8R0uw7voIFxjN9F3oaQ=;
 b=KL0wUt2ku95Si1fw0Pp6cfIZ6BO6WR4VnaKv7B07GGbAHeL0NZb99tf/I6UOySOcQl
 4q3f9E3KbL8td2QkKqwAyY+QqrJQ2chXs9s1BXbM7LIAESNUtUJn/YOd206e/ikhFJii
 +EyuPzoPov+oxfMApOGSF8eP1FciSZnYbceF7O0X8vw5CllfM+pWrp9OwprIRPQohplc
 IBk59lRgqfeT7OM3iJayrzTQzgv0QW4Lkc5qQGXseAZDOhejQt+KJ8fsUjt9HNrSnDRI
 uANXg9vRVz59VrTYjN1HEcUyynBb4lKLCcfcuODzxR2RKWYNR/RP+Ooa7s5E4lvSjJAf
 ZcOg==
X-Gm-Message-State: AOAM530pSdrrE79dw0zPnOpMTEN+QiER4iZ4IIax4YOlSBNSzCr75DT2
 kU9OrrNTgtRtXi8ZsXBH+ucIkAC4YNU=
X-Google-Smtp-Source: 
 ABdhPJyS687LC8OW6myO61pmHIpT7fMYm1nMc6f2FI7Q0UdXFhDbzrF4DxYCc9x1q1XrV+4b9t9ZqQ==
X-Received: by 2002:a17:90a:d249:: with SMTP id
 o9mr14642742pjw.196.1612139439965;
 Sun, 31 Jan 2021 16:30:39 -0800 (PST)
Received: from localhost.localdomain (c-73-241-149-213.hsd1.ca.comcast.net.
 [73.241.149.213])
 by smtp.googlemail.com with ESMTPSA id w67sm2954211pfb.79.2021.01.31.16.30.38
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Sun, 31 Jan 2021 16:30:39 -0800 (PST)
To: libc-alpha@sourceware.org
Subject: [PATCH v2 1/2] x86: Refactor and improve performance of strchr-avx2.S
Date: Sun, 31 Jan 2021 19:30:14 -0500
Message-Id: <20210201003014.785099-1-goldstein.w.n@gmail.com>
X-Mailer: git-send-email 2.29.2
MIME-Version: 1.0
X-Spam-Status: No, score=-11.9 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0,
 RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.2
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on
 server2.sourceware.org
X-BeenThere: libc-alpha@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Libc-alpha mailing list <libc-alpha.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/libc-alpha/>
List-Post: <mailto:libc-alpha@sourceware.org>
List-Help: <mailto:libc-alpha-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/libc-alpha>,
 <mailto:libc-alpha-request@sourceware.org?subject=subscribe>
X-Patchwork-Original-From: noah via Libc-alpha <libc-alpha@sourceware.org>
From: Noah Goldstein <goldstein.w.n@gmail.com>
Reply-To: noah <goldstein.w.n@gmail.com>
Cc: goldstein.w.n@gmail.com
Errors-To: libc-alpha-bounces@sourceware.org
Sender: "Libc-alpha" <libc-alpha-bounces@sourceware.org>

No bug. Just seemed the performance could be improved a bit. Observed
and expected behavior are unchanged. Optimized body of main
loop. Updated page cross logic and optimized accordingly. Made a few
minor instruction selection modifications. No regressions in test
suite. Both test-strchrnul and test-strchr passed.

Signed-off-by: noah <goldstein.w.n@gmail.com>
---
Since V1 optimized more around smaller lengths. The original version
expected the 4x loop to be hit though the benchmarks in bench-strchr.c
indicate optimization for very short strings is most important.

Made the first 32 byte check expect to find either the end of the
string or character in question. As well increased number of vectors
in L(aligned_more) to 4. This does cost for most alignments if the 4x
loop is hit but is faster for strings < 128 byte.
    
 sysdeps/x86_64/multiarch/strchr-avx2.S | 247 ++++++++++++-------------
 sysdeps/x86_64/multiarch/strchr.c      |   1 +
 2 files changed, 124 insertions(+), 124 deletions(-)
diff --git a/sysdeps/x86_64/multiarch/strchr-avx2.S b/sysdeps/x86_64/multiarch/strchr-avx2.S
index d416558d04..3012cb6ece 100644
--- a/sysdeps/x86_64/multiarch/strchr-avx2.S
+++ b/sysdeps/x86_64/multiarch/strchr-avx2.S
@@ -27,10 +27,12 @@
 # ifdef USE_AS_WCSCHR
 #  define VPBROADCAST	vpbroadcastd
 #  define VPCMPEQ	vpcmpeqd
+#  define VPMINU	vpminud
 #  define CHAR_REG	esi
 # else
 #  define VPBROADCAST	vpbroadcastb
 #  define VPCMPEQ	vpcmpeqb
+#  define VPMINU	vpminub
 #  define CHAR_REG	sil
 # endif
 
@@ -39,19 +41,25 @@
 # endif
 
 # define VEC_SIZE 32
+# define PAGE_SIZE 4096
 
 	.section .text.avx,"ax",@progbits
 ENTRY (STRCHR)
-	movl	%edi, %ecx
+    movl	%edi, %ecx
+# ifndef USE_AS_STRCHRNUL
+	xorl	%edx, %edx
+# endif
+    
 	/* Broadcast CHAR to YMM0.  */
 	vmovd	%esi, %xmm0
 	vpxor	%xmm9, %xmm9, %xmm9
 	VPBROADCAST %xmm0, %ymm0
-	/* Check if we may cross page boundary with one vector load.  */
-	andl	$(2 * VEC_SIZE - 1), %ecx
-	cmpl	$VEC_SIZE, %ecx
-	ja	L(cros_page_boundary)
-
+    
+	/* Check if we cross page boundary with one vector load.  */
+	andl	$(PAGE_SIZE - 1), %ecx
+	cmpl	$(PAGE_SIZE - VEC_SIZE), %ecx
+	ja	L(cross_page_boundary)
+    
 	/* Check the first VEC_SIZE bytes.  Search for both CHAR and the
 	   null byte.  */
 	vmovdqu	(%rdi), %ymm8
@@ -60,50 +68,27 @@ ENTRY (STRCHR)
 	vpor	%ymm1, %ymm2, %ymm1
 	vpmovmskb %ymm1, %eax
 	testl	%eax, %eax
-	jnz	L(first_vec_x0)
-
-	/* Align data for aligned loads in the loop.  */
-	addq	$VEC_SIZE, %rdi
-	andl	$(VEC_SIZE - 1), %ecx
-	andq	$-VEC_SIZE, %rdi
-
-	jmp	L(more_4x_vec)
-
-	.p2align 4
-L(cros_page_boundary):
-	andl	$(VEC_SIZE - 1), %ecx
-	andq	$-VEC_SIZE, %rdi
-	vmovdqu	(%rdi), %ymm8
-	VPCMPEQ %ymm8, %ymm0, %ymm1
-	VPCMPEQ %ymm8, %ymm9, %ymm2
-	vpor	%ymm1, %ymm2, %ymm1
-	vpmovmskb %ymm1, %eax
-	/* Remove the leading bytes.  */
-	sarl	%cl, %eax
-	testl	%eax, %eax
-	jz	L(aligned_more)
+	jz	L(more_vecs)
+    tzcntl	%eax, %eax
 	/* Found CHAR or the null byte.  */
-	tzcntl	%eax, %eax
-	addq	%rcx, %rax
-# ifdef USE_AS_STRCHRNUL
 	addq	%rdi, %rax
-# else
-	xorl	%edx, %edx
-	leaq	(%rdi, %rax), %rax
-	cmp	(%rax), %CHAR_REG
+# ifndef USE_AS_STRCHRNUL
+	cmp     (%rax), %CHAR_REG
 	cmovne	%rdx, %rax
 # endif
 	VZEROUPPER
 	ret
 
-	.p2align 4
+    .p2align 4
+L(more_vecs):    
+	/* Align data for aligned loads in the loop.  */
+    andq	$-VEC_SIZE, %rdi
 L(aligned_more):
-	addq	$VEC_SIZE, %rdi
 
-L(more_4x_vec):
-	/* Check the first 4 * VEC_SIZE.  Only one VEC_SIZE at a time
-	   since data is only aligned to VEC_SIZE.  */
-	vmovdqa	(%rdi), %ymm8
+	/* Check the next 4 * VEC_SIZE.  Only one VEC_SIZE at a time
+       since data is only aligned to VEC_SIZE.  */
+	vmovdqa	VEC_SIZE(%rdi), %ymm8
+    addq    $VEC_SIZE, %rdi
 	VPCMPEQ %ymm8, %ymm0, %ymm1
 	VPCMPEQ %ymm8, %ymm9, %ymm2
 	vpor	%ymm1, %ymm2, %ymm1
@@ -125,7 +110,7 @@ L(more_4x_vec):
 	vpor	%ymm1, %ymm2, %ymm1
 	vpmovmskb %ymm1, %eax
 	testl	%eax, %eax
-	jnz	L(first_vec_x2)
+	jnz	L(first_vec_x2)    
 
 	vmovdqa	(VEC_SIZE * 3)(%rdi), %ymm8
 	VPCMPEQ %ymm8, %ymm0, %ymm1
@@ -133,122 +118,136 @@ L(more_4x_vec):
 	vpor	%ymm1, %ymm2, %ymm1
 	vpmovmskb %ymm1, %eax
 	testl	%eax, %eax
-	jnz	L(first_vec_x3)
-
-	addq	$(VEC_SIZE * 4), %rdi
-
-	/* Align data to 4 * VEC_SIZE.  */
-	movq	%rdi, %rcx
-	andl	$(4 * VEC_SIZE - 1), %ecx
-	andq	$-(4 * VEC_SIZE), %rdi
-
-	.p2align 4
-L(loop_4x_vec):
-	/* Compare 4 * VEC at a time forward.  */
-	vmovdqa	(%rdi), %ymm5
-	vmovdqa	VEC_SIZE(%rdi), %ymm6
-	vmovdqa	(VEC_SIZE * 2)(%rdi), %ymm7
-	vmovdqa	(VEC_SIZE * 3)(%rdi), %ymm8
-
-	VPCMPEQ %ymm5, %ymm0, %ymm1
-	VPCMPEQ %ymm6, %ymm0, %ymm2
-	VPCMPEQ %ymm7, %ymm0, %ymm3
-	VPCMPEQ %ymm8, %ymm0, %ymm4
-
-	VPCMPEQ %ymm5, %ymm9, %ymm5
-	VPCMPEQ %ymm6, %ymm9, %ymm6
-	VPCMPEQ %ymm7, %ymm9, %ymm7
-	VPCMPEQ %ymm8, %ymm9, %ymm8
-
-	vpor	%ymm1, %ymm5, %ymm1
-	vpor	%ymm2, %ymm6, %ymm2
-	vpor	%ymm3, %ymm7, %ymm3
-	vpor	%ymm4, %ymm8, %ymm4
-
-	vpor	%ymm1, %ymm2, %ymm5
-	vpor	%ymm3, %ymm4, %ymm6
+	jz	L(prep_loop_4x)
 
-	vpor	%ymm5, %ymm6, %ymm5
-
-	vpmovmskb %ymm5, %eax
-	testl	%eax, %eax
-	jnz	L(4x_vec_end)
-
-	addq	$(VEC_SIZE * 4), %rdi
-
-	jmp	L(loop_4x_vec)
+    tzcntl	%eax, %eax
+    leaq	(VEC_SIZE * 3)(%rdi, %rax), %rax
+# ifndef USE_AS_STRCHRNUL
+	cmp     (%rax), %CHAR_REG
+	cmovne	%rdx, %rax
+# endif
+	VZEROUPPER
+	ret
 
-	.p2align 4
+    .p2align 4
 L(first_vec_x0):
+    tzcntl	%eax, %eax
 	/* Found CHAR or the null byte.  */
-	tzcntl	%eax, %eax
-# ifdef USE_AS_STRCHRNUL
 	addq	%rdi, %rax
-# else
-	xorl	%edx, %edx
-	leaq	(%rdi, %rax), %rax
-	cmp	(%rax), %CHAR_REG
+# ifndef USE_AS_STRCHRNUL
+	cmp     (%rax), %CHAR_REG
 	cmovne	%rdx, %rax
 # endif
 	VZEROUPPER
 	ret
-
+    
 	.p2align 4
 L(first_vec_x1):
-	tzcntl	%eax, %eax
-# ifdef USE_AS_STRCHRNUL
-	addq	$VEC_SIZE, %rax
-	addq	%rdi, %rax
-# else
-	xorl	%edx, %edx
-	leaq	VEC_SIZE(%rdi, %rax), %rax
-	cmp	(%rax), %CHAR_REG
+    tzcntl	%eax, %eax
+    leaq	VEC_SIZE(%rdi, %rax), %rax
+# ifndef USE_AS_STRCHRNUL
+	cmp     (%rax), %CHAR_REG
 	cmovne	%rdx, %rax
 # endif
 	VZEROUPPER
-	ret
-
-	.p2align 4
+	ret    
+    
+    .p2align 4
 L(first_vec_x2):
-	tzcntl	%eax, %eax
-# ifdef USE_AS_STRCHRNUL
-	addq	$(VEC_SIZE * 2), %rax
-	addq	%rdi, %rax
-# else
-	xorl	%edx, %edx
+    tzcntl	%eax, %eax
+	/* Found CHAR or the null byte.  */
 	leaq	(VEC_SIZE * 2)(%rdi, %rax), %rax
-	cmp	(%rax), %CHAR_REG
+# ifndef USE_AS_STRCHRNUL
+	cmp     (%rax), %CHAR_REG
 	cmovne	%rdx, %rax
 # endif
 	VZEROUPPER
 	ret
+    
+L(prep_loop_4x):
+    /* Align data to 4 * VEC_SIZE.  */
+	andq	$-(VEC_SIZE * 4), %rdi
 
 	.p2align 4
-L(4x_vec_end):
+L(loop_4x_vec):
+	/* Compare 4 * VEC at a time forward.  */
+	vmovdqa	(VEC_SIZE * 4)(%rdi), %ymm5
+	vmovdqa	(VEC_SIZE * 5)(%rdi), %ymm6
+	vmovdqa	(VEC_SIZE * 6)(%rdi), %ymm7
+	vmovdqa	(VEC_SIZE * 7)(%rdi), %ymm8
+
+    /* Leaves only CHARS matching esi as 0.  */
+    vpxor   %ymm5, %ymm0, %ymm1
+    vpxor   %ymm6, %ymm0, %ymm2
+    vpxor   %ymm7, %ymm0, %ymm3
+    vpxor   %ymm8, %ymm0, %ymm4
+
+	VPMINU	%ymm1, %ymm5, %ymm1
+	VPMINU	%ymm2, %ymm6, %ymm2
+	VPMINU	%ymm3, %ymm7, %ymm3
+	VPMINU	%ymm4, %ymm8, %ymm4
+
+	VPMINU	%ymm1, %ymm2, %ymm5
+	VPMINU	%ymm3, %ymm4, %ymm6
+
+	VPMINU	%ymm5, %ymm6, %ymm5
+
+    VPCMPEQ %ymm5, %ymm9, %ymm5
+	vpmovmskb %ymm5, %eax
+
+    addq	$(VEC_SIZE * 4), %rdi
+	testl	%eax, %eax
+    jz	L(loop_4x_vec)
+    
+    VPCMPEQ %ymm1, %ymm9, %ymm1
 	vpmovmskb %ymm1, %eax
 	testl	%eax, %eax
 	jnz	L(first_vec_x0)
-	vpmovmskb %ymm2, %eax
+
+    VPCMPEQ %ymm2, %ymm9, %ymm2
+    vpmovmskb %ymm2, %eax
 	testl	%eax, %eax
 	jnz	L(first_vec_x1)
-	vpmovmskb %ymm3, %eax
-	testl	%eax, %eax
-	jnz	L(first_vec_x2)
+
+    VPCMPEQ %ymm3, %ymm9, %ymm3
+    VPCMPEQ %ymm4, %ymm9, %ymm4
+	vpmovmskb %ymm3, %ecx
 	vpmovmskb %ymm4, %eax
+    salq    $32, %rax
+    orq     %rcx, %rax
+	tzcntq	%rax, %rax
+	leaq	(VEC_SIZE * 2)(%rdi, %rax), %rax
+# ifndef USE_AS_STRCHRNUL
+	cmp     (%rax), %CHAR_REG
+	cmovne	%rdx, %rax
+# endif
+	VZEROUPPER
+	ret
+
+    /* Cold case for crossing page with first load.  */
+	.p2align 4
+L(cross_page_boundary):
+    andq	$-VEC_SIZE, %rdi
+	andl	$(VEC_SIZE - 1), %ecx
+
+	vmovdqa	(%rdi), %ymm8
+	VPCMPEQ %ymm8, %ymm0, %ymm1
+	VPCMPEQ %ymm8, %ymm9, %ymm2
+	vpor	%ymm1, %ymm2, %ymm1
+	vpmovmskb %ymm1, %eax
+	/* Remove the leading bits.  */
+	sarxl	%ecx, %eax, %eax
 	testl	%eax, %eax
-L(first_vec_x3):
+	jz	L(aligned_more)    
 	tzcntl	%eax, %eax
-# ifdef USE_AS_STRCHRNUL
-	addq	$(VEC_SIZE * 3), %rax
+    addq	%rcx, %rdi
 	addq	%rdi, %rax
-# else
-	xorl	%edx, %edx
-	leaq	(VEC_SIZE * 3)(%rdi, %rax), %rax
-	cmp	(%rax), %CHAR_REG
+# ifndef USE_AS_STRCHRNUL
+	cmp     (%rax), %CHAR_REG
 	cmovne	%rdx, %rax
 # endif
 	VZEROUPPER
 	ret
 
 END (STRCHR)
-#endif
+# endif
diff --git a/sysdeps/x86_64/multiarch/strchr.c b/sysdeps/x86_64/multiarch/strchr.c
index 583a152794..4dfbe3b58b 100644
--- a/sysdeps/x86_64/multiarch/strchr.c
+++ b/sysdeps/x86_64/multiarch/strchr.c
@@ -37,6 +37,7 @@ IFUNC_SELECTOR (void)
 
   if (!CPU_FEATURES_ARCH_P (cpu_features, Prefer_No_VZEROUPPER)
       && CPU_FEATURE_USABLE_P (cpu_features, AVX2)
+      && CPU_FEATURE_USABLE_P (cpu_features, BMI2)
       && CPU_FEATURES_ARCH_P (cpu_features, AVX_Fast_Unaligned_Load))
     return OPTIMIZE (avx2);