From patchwork Tue Jul 12 19:29:05 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Noah Goldstein X-Patchwork-Id: 1655616 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: bilbo.ozlabs.org; dkim=pass (1024-bit key; secure) header.d=sourceware.org header.i=@sourceware.org header.a=rsa-sha256 header.s=default header.b=k+SvUQj9; dkim-atps=neutral Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=sourceware.org (client-ip=2620:52:3:1:0:246e:9693:128c; helo=sourceware.org; envelope-from=libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org; receiver=) Received: from sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by bilbo.ozlabs.org (Postfix) with ESMTPS id 4Lj9np2HjPz9s09 for ; Wed, 13 Jul 2022 05:30:22 +1000 (AEST) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 412BC388CE8E for ; Tue, 12 Jul 2022 19:30:20 +0000 (GMT) DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org 412BC388CE8E DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org; s=default; t=1657654220; bh=a4sd8sfvJf22s1ZMHNaFAZ08hQ/FA+7Ezh8I+3QuoyQ=; h=To:Subject:Date:In-Reply-To:References:List-Id:List-Unsubscribe: List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To: From; b=k+SvUQj9xWlklqog9QcnVSLlIlaesgMEzXyRyVTZR1k2Q5W8HB323RLYqK31gAUEQ 0dC8JjzNyJe0RE4QapBgzkLXWmT+Ej+CPcTZzdn7JJWnHsejPZZU77MXzdVgPAhmW4 Xk8qaocaqxphYlW+JD3lcT2Ml2uEC9q82O3ClrwY= X-Original-To: libc-alpha@sourceware.org Delivered-To: libc-alpha@sourceware.org Received: from mail-pj1-x1031.google.com (mail-pj1-x1031.google.com [IPv6:2607:f8b0:4864:20::1031]) by sourceware.org (Postfix) with ESMTPS id E7C9938768A3 for ; Tue, 12 Jul 2022 19:29:18 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org E7C9938768A3 Received: by mail-pj1-x1031.google.com with SMTP id o5-20020a17090a3d4500b001ef76490983so9537916pjf.2 for ; Tue, 12 Jul 2022 12:29:18 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=a4sd8sfvJf22s1ZMHNaFAZ08hQ/FA+7Ezh8I+3QuoyQ=; b=2io/3yx2nMzLu889lzApE+DMkaKCLz4rsVbuBejyCWb7lGh/R9ZsCN2Kxf0vG0xNgg nNVd6ObGo7VrYN496TlCd9FH8eRMHUvvnQEBri/8arGPQLmXpyWhQhL54F0tZGaIdhih ET3kRult3eWvlMQ0/iu73SIB37kM+qgSVPgLdk9OHh3V8RHMzyS76SxAQ7XfdQQHgKPT yuv41GidjZZBmKMwux+bW/J7j3zL+Y3SF3I+gFF7F7SUDgio17Moy3hXr7XaVGETaJjI xxr0CFUfMc7uhqCu/EclTyjKK8qzBmrUTXjj46eYcGVLpa7rcgosrZo4wKe5HtlQTQRe OSBA== X-Gm-Message-State: AJIora9+lccCz5dbcMAld2v647mpuFcD87puhFBu4EfHxDpCHG1dr7wA 4XBNjwi4+Pkru/bULTTXeo/aMxMXHVA= X-Google-Smtp-Source: AGRyM1uf34COTf7OavIWARGsHxmZOGdNmZy5J+Lx4tdITKsOKLIJega0B7cbwXoIHs6FRODmMdOemw== X-Received: by 2002:a17:90b:4a44:b0:1f0:3680:2a72 with SMTP id lb4-20020a17090b4a4400b001f036802a72mr5849675pjb.97.1657654157627; Tue, 12 Jul 2022 12:29:17 -0700 (PDT) Received: from noah-tgl.. ([192.55.60.37]) by smtp.gmail.com with ESMTPSA id w7-20020a170902e88700b0016c28fbd7e5sm7274704plg.268.2022.07.12.12.29.16 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 12 Jul 2022 12:29:17 -0700 (PDT) To: libc-alpha@sourceware.org Subject: [PATCH v1] x86: Move strchr SSE2 implementation to multiarch/strchr-sse2.S Date: Tue, 12 Jul 2022 12:29:05 -0700 Message-Id: <20220712192910.351121-5-goldstein.w.n@gmail.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20220712192910.351121-1-goldstein.w.n@gmail.com> References: <20220712192910.351121-1-goldstein.w.n@gmail.com> MIME-Version: 1.0 X-Spam-Status: No, score=-11.1 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, FREEMAIL_FROM, GIT_PATCH_0, KAM_SHORT, RCVD_IN_DNSWL_NONE, SCC_10_SHORT_WORD_LINES, SCC_5_SHORT_WORD_LINES, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: libc-alpha@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Libc-alpha mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Patchwork-Original-From: Noah Goldstein via Libc-alpha From: Noah Goldstein Reply-To: Noah Goldstein Errors-To: libc-alpha-bounces+incoming=patchwork.ozlabs.org@sourceware.org Sender: "Libc-alpha" This commit doesn't affect libc.so.6, its just housekeeping to prepare for adding explicit ISA level support. Tested build on x86_64 and x86_32 with/without multiarch. --- sysdeps/x86_64/multiarch/rtld-strchr.S | 18 +++ sysdeps/x86_64/multiarch/rtld-strchrnul.S | 18 +++ sysdeps/x86_64/multiarch/strchr-sse2.S | 175 +++++++++++++++++++++- sysdeps/x86_64/multiarch/strchrnul-sse2.S | 11 +- sysdeps/x86_64/strchr.S | 167 +-------------------- sysdeps/x86_64/strchrnul.S | 7 +- 6 files changed, 213 insertions(+), 183 deletions(-) create mode 100644 sysdeps/x86_64/multiarch/rtld-strchr.S create mode 100644 sysdeps/x86_64/multiarch/rtld-strchrnul.S diff --git a/sysdeps/x86_64/multiarch/rtld-strchr.S b/sysdeps/x86_64/multiarch/rtld-strchr.S new file mode 100644 index 0000000000..2b7b879e37 --- /dev/null +++ b/sysdeps/x86_64/multiarch/rtld-strchr.S @@ -0,0 +1,18 @@ +/* Copyright (C) 2022 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include "../strchr.S" diff --git a/sysdeps/x86_64/multiarch/rtld-strchrnul.S b/sysdeps/x86_64/multiarch/rtld-strchrnul.S new file mode 100644 index 0000000000..0cc5becc88 --- /dev/null +++ b/sysdeps/x86_64/multiarch/rtld-strchrnul.S @@ -0,0 +1,18 @@ +/* Copyright (C) 2022 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + . */ + +#include "../strchrnul.S" diff --git a/sysdeps/x86_64/multiarch/strchr-sse2.S b/sysdeps/x86_64/multiarch/strchr-sse2.S index 992f700077..f7767ca543 100644 --- a/sysdeps/x86_64/multiarch/strchr-sse2.S +++ b/sysdeps/x86_64/multiarch/strchr-sse2.S @@ -16,13 +16,172 @@ License along with the GNU C Library; if not, see . */ -#if IS_IN (libc) -# define strchr __strchr_sse2 +#if IS_IN (libc) || defined STRCHR +# ifndef STRCHR +# define STRCHR __strchr_sse2 +# endif -# undef weak_alias -# define weak_alias(strchr, index) -# undef libc_hidden_builtin_def -# define libc_hidden_builtin_def(strchr) -#endif +# include + + .text +ENTRY (STRCHR) + movd %esi, %xmm1 + movl %edi, %eax + andl $4095, %eax + punpcklbw %xmm1, %xmm1 + cmpl $4032, %eax + punpcklwd %xmm1, %xmm1 + pshufd $0, %xmm1, %xmm1 + jg L(cross_page) + movdqu (%rdi), %xmm0 + pxor %xmm3, %xmm3 + movdqa %xmm0, %xmm4 + pcmpeqb %xmm1, %xmm0 + pcmpeqb %xmm3, %xmm4 + por %xmm4, %xmm0 + pmovmskb %xmm0, %eax + test %eax, %eax + je L(next_48_bytes) + bsf %eax, %eax +# ifdef AS_STRCHRNUL + leaq (%rdi,%rax), %rax +# else + movl $0, %edx + leaq (%rdi,%rax), %rax + cmpb %sil, (%rax) + cmovne %rdx, %rax +# endif + ret + + .p2align 3 +L(next_48_bytes): + movdqu 16(%rdi), %xmm0 + movdqa %xmm0, %xmm4 + pcmpeqb %xmm1, %xmm0 + pcmpeqb %xmm3, %xmm4 + por %xmm4, %xmm0 + pmovmskb %xmm0, %ecx + movdqu 32(%rdi), %xmm0 + movdqa %xmm0, %xmm4 + pcmpeqb %xmm1, %xmm0 + salq $16, %rcx + pcmpeqb %xmm3, %xmm4 + por %xmm4, %xmm0 + pmovmskb %xmm0, %eax + movdqu 48(%rdi), %xmm0 + pcmpeqb %xmm0, %xmm3 + salq $32, %rax + pcmpeqb %xmm1, %xmm0 + orq %rcx, %rax + por %xmm3, %xmm0 + pmovmskb %xmm0, %ecx + salq $48, %rcx + orq %rcx, %rax + testq %rax, %rax + jne L(return) +L(loop_start): + /* We use this alignment to force loop be aligned to 8 but not + 16 bytes. This gives better sheduling on AMD processors. */ + .p2align 4 + pxor %xmm6, %xmm6 + andq $-64, %rdi + .p2align 3 +L(loop64): + addq $64, %rdi + movdqa (%rdi), %xmm5 + movdqa 16(%rdi), %xmm2 + movdqa 32(%rdi), %xmm3 + pxor %xmm1, %xmm5 + movdqa 48(%rdi), %xmm4 + pxor %xmm1, %xmm2 + pxor %xmm1, %xmm3 + pminub (%rdi), %xmm5 + pxor %xmm1, %xmm4 + pminub 16(%rdi), %xmm2 + pminub 32(%rdi), %xmm3 + pminub %xmm2, %xmm5 + pminub 48(%rdi), %xmm4 + pminub %xmm3, %xmm5 + pminub %xmm4, %xmm5 + pcmpeqb %xmm6, %xmm5 + pmovmskb %xmm5, %eax + + testl %eax, %eax + je L(loop64) -#include "../strchr.S" + movdqa (%rdi), %xmm5 + movdqa %xmm5, %xmm0 + pcmpeqb %xmm1, %xmm5 + pcmpeqb %xmm6, %xmm0 + por %xmm0, %xmm5 + pcmpeqb %xmm6, %xmm2 + pcmpeqb %xmm6, %xmm3 + pcmpeqb %xmm6, %xmm4 + + pmovmskb %xmm5, %ecx + pmovmskb %xmm2, %eax + salq $16, %rax + pmovmskb %xmm3, %r8d + pmovmskb %xmm4, %edx + salq $32, %r8 + orq %r8, %rax + orq %rcx, %rax + salq $48, %rdx + orq %rdx, %rax + .p2align 3 +L(return): + bsfq %rax, %rax +# ifdef AS_STRCHRNUL + leaq (%rdi,%rax), %rax +# else + movl $0, %edx + leaq (%rdi,%rax), %rax + cmpb %sil, (%rax) + cmovne %rdx, %rax +# endif + ret + .p2align 4 + +L(cross_page): + movq %rdi, %rdx + pxor %xmm2, %xmm2 + andq $-64, %rdx + movdqa %xmm1, %xmm0 + movdqa (%rdx), %xmm3 + movdqa %xmm3, %xmm4 + pcmpeqb %xmm1, %xmm3 + pcmpeqb %xmm2, %xmm4 + por %xmm4, %xmm3 + pmovmskb %xmm3, %r8d + movdqa 16(%rdx), %xmm3 + movdqa %xmm3, %xmm4 + pcmpeqb %xmm1, %xmm3 + pcmpeqb %xmm2, %xmm4 + por %xmm4, %xmm3 + pmovmskb %xmm3, %eax + movdqa 32(%rdx), %xmm3 + movdqa %xmm3, %xmm4 + pcmpeqb %xmm1, %xmm3 + salq $16, %rax + pcmpeqb %xmm2, %xmm4 + por %xmm4, %xmm3 + pmovmskb %xmm3, %r9d + movdqa 48(%rdx), %xmm3 + pcmpeqb %xmm3, %xmm2 + salq $32, %r9 + pcmpeqb %xmm3, %xmm0 + orq %r9, %rax + orq %r8, %rax + por %xmm2, %xmm0 + pmovmskb %xmm0, %ecx + salq $48, %rcx + orq %rcx, %rax + movl %edi, %ecx + subb %dl, %cl + shrq %cl, %rax + testq %rax, %rax + jne L(return) + jmp L(loop_start) + +END (STRCHR) +#endif diff --git a/sysdeps/x86_64/multiarch/strchrnul-sse2.S b/sysdeps/x86_64/multiarch/strchrnul-sse2.S index f91c670369..7238977a21 100644 --- a/sysdeps/x86_64/multiarch/strchrnul-sse2.S +++ b/sysdeps/x86_64/multiarch/strchrnul-sse2.S @@ -17,10 +17,11 @@ . */ #if IS_IN (libc) -# define __strchrnul __strchrnul_sse2 - -# undef weak_alias -# define weak_alias(__strchrnul, strchrnul) +# ifndef STRCHR +# define STRCHR __strchrnul_sse2 +# endif #endif -#include "../strchrnul.S" +#define AS_STRCHRNUL + +#include "strchr-sse2.S" diff --git a/sysdeps/x86_64/strchr.S b/sysdeps/x86_64/strchr.S index dda7c0431d..77c956c92c 100644 --- a/sysdeps/x86_64/strchr.S +++ b/sysdeps/x86_64/strchr.S @@ -17,171 +17,8 @@ License along with the GNU C Library; if not, see . */ -#include - .text -ENTRY (strchr) - movd %esi, %xmm1 - movl %edi, %eax - andl $4095, %eax - punpcklbw %xmm1, %xmm1 - cmpl $4032, %eax - punpcklwd %xmm1, %xmm1 - pshufd $0, %xmm1, %xmm1 - jg L(cross_page) - movdqu (%rdi), %xmm0 - pxor %xmm3, %xmm3 - movdqa %xmm0, %xmm4 - pcmpeqb %xmm1, %xmm0 - pcmpeqb %xmm3, %xmm4 - por %xmm4, %xmm0 - pmovmskb %xmm0, %eax - test %eax, %eax - je L(next_48_bytes) - bsf %eax, %eax -#ifdef AS_STRCHRNUL - leaq (%rdi,%rax), %rax -#else - movl $0, %edx - leaq (%rdi,%rax), %rax - cmpb %sil, (%rax) - cmovne %rdx, %rax -#endif - ret - - .p2align 3 - L(next_48_bytes): - movdqu 16(%rdi), %xmm0 - movdqa %xmm0, %xmm4 - pcmpeqb %xmm1, %xmm0 - pcmpeqb %xmm3, %xmm4 - por %xmm4, %xmm0 - pmovmskb %xmm0, %ecx - movdqu 32(%rdi), %xmm0 - movdqa %xmm0, %xmm4 - pcmpeqb %xmm1, %xmm0 - salq $16, %rcx - pcmpeqb %xmm3, %xmm4 - por %xmm4, %xmm0 - pmovmskb %xmm0, %eax - movdqu 48(%rdi), %xmm0 - pcmpeqb %xmm0, %xmm3 - salq $32, %rax - pcmpeqb %xmm1, %xmm0 - orq %rcx, %rax - por %xmm3, %xmm0 - pmovmskb %xmm0, %ecx - salq $48, %rcx - orq %rcx, %rax - testq %rax, %rax - jne L(return) -L(loop_start): - /* We use this alignment to force loop be aligned to 8 but not - 16 bytes. This gives better sheduling on AMD processors. */ - .p2align 4 - pxor %xmm6, %xmm6 - andq $-64, %rdi - .p2align 3 -L(loop64): - addq $64, %rdi - movdqa (%rdi), %xmm5 - movdqa 16(%rdi), %xmm2 - movdqa 32(%rdi), %xmm3 - pxor %xmm1, %xmm5 - movdqa 48(%rdi), %xmm4 - pxor %xmm1, %xmm2 - pxor %xmm1, %xmm3 - pminub (%rdi), %xmm5 - pxor %xmm1, %xmm4 - pminub 16(%rdi), %xmm2 - pminub 32(%rdi), %xmm3 - pminub %xmm2, %xmm5 - pminub 48(%rdi), %xmm4 - pminub %xmm3, %xmm5 - pminub %xmm4, %xmm5 - pcmpeqb %xmm6, %xmm5 - pmovmskb %xmm5, %eax - - testl %eax, %eax - je L(loop64) - - movdqa (%rdi), %xmm5 - movdqa %xmm5, %xmm0 - pcmpeqb %xmm1, %xmm5 - pcmpeqb %xmm6, %xmm0 - por %xmm0, %xmm5 - pcmpeqb %xmm6, %xmm2 - pcmpeqb %xmm6, %xmm3 - pcmpeqb %xmm6, %xmm4 - - pmovmskb %xmm5, %ecx - pmovmskb %xmm2, %eax - salq $16, %rax - pmovmskb %xmm3, %r8d - pmovmskb %xmm4, %edx - salq $32, %r8 - orq %r8, %rax - orq %rcx, %rax - salq $48, %rdx - orq %rdx, %rax - .p2align 3 -L(return): - bsfq %rax, %rax -#ifdef AS_STRCHRNUL - leaq (%rdi,%rax), %rax -#else - movl $0, %edx - leaq (%rdi,%rax), %rax - cmpb %sil, (%rax) - cmovne %rdx, %rax -#endif - ret - .p2align 4 - -L(cross_page): - movq %rdi, %rdx - pxor %xmm2, %xmm2 - andq $-64, %rdx - movdqa %xmm1, %xmm0 - movdqa (%rdx), %xmm3 - movdqa %xmm3, %xmm4 - pcmpeqb %xmm1, %xmm3 - pcmpeqb %xmm2, %xmm4 - por %xmm4, %xmm3 - pmovmskb %xmm3, %r8d - movdqa 16(%rdx), %xmm3 - movdqa %xmm3, %xmm4 - pcmpeqb %xmm1, %xmm3 - pcmpeqb %xmm2, %xmm4 - por %xmm4, %xmm3 - pmovmskb %xmm3, %eax - movdqa 32(%rdx), %xmm3 - movdqa %xmm3, %xmm4 - pcmpeqb %xmm1, %xmm3 - salq $16, %rax - pcmpeqb %xmm2, %xmm4 - por %xmm4, %xmm3 - pmovmskb %xmm3, %r9d - movdqa 48(%rdx), %xmm3 - pcmpeqb %xmm3, %xmm2 - salq $32, %r9 - pcmpeqb %xmm3, %xmm0 - orq %r9, %rax - orq %r8, %rax - por %xmm2, %xmm0 - pmovmskb %xmm0, %ecx - salq $48, %rcx - orq %rcx, %rax - movl %edi, %ecx - subb %dl, %cl - shrq %cl, %rax - testq %rax, %rax - jne L(return) - jmp L(loop_start) - -END (strchr) - -#ifndef AS_STRCHRNUL +#define STRCHR strchr +#include "multiarch/strchr-sse2.S" weak_alias (strchr, index) libc_hidden_builtin_def (strchr) -#endif diff --git a/sysdeps/x86_64/strchrnul.S b/sysdeps/x86_64/strchrnul.S index ec2e652e25..508e42db26 100644 --- a/sysdeps/x86_64/strchrnul.S +++ b/sysdeps/x86_64/strchrnul.S @@ -18,10 +18,7 @@ License along with the GNU C Library; if not, see . */ -#include - -#define strchr __strchrnul -#define AS_STRCHRNUL -#include "strchr.S" +#define STRCHR __strchrnul +#include "multiarch/strchrnul-sse2.S" weak_alias (__strchrnul, strchrnul)