From patchwork Fri May 13 09:46:02 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Roger Sayle X-Patchwork-Id: 1630597 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: bilbo.ozlabs.org; dkim=fail reason="signature verification failed" (2048-bit key; unprotected) header.d=nextmovesoftware.com header.i=@nextmovesoftware.com header.a=rsa-sha256 header.s=default header.b=jbq4nhP3; dkim-atps=neutral Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org (client-ip=2620:52:3:1:0:246e:9693:128c; helo=sourceware.org; envelope-from=gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org; receiver=) Received: from sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by bilbo.ozlabs.org (Postfix) with ESMTPS id 4L03gf3pBlz9sFs for ; Fri, 13 May 2022 19:46:21 +1000 (AEST) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id DA0853846456 for ; Fri, 13 May 2022 09:46:18 +0000 (GMT) X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from server.nextmovesoftware.com (server.nextmovesoftware.com [162.254.253.69]) by sourceware.org (Postfix) with ESMTPS id 542B33857023 for ; Fri, 13 May 2022 09:46:06 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 542B33857023 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=nextmovesoftware.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=nextmovesoftware.com DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=nextmovesoftware.com; s=default; h=Content-Type:MIME-Version:Message-ID: Date:Subject:Cc:To:From:Sender:Reply-To:Content-Transfer-Encoding:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:In-Reply-To:References:List-Id:List-Help:List-Unsubscribe: List-Subscribe:List-Post:List-Owner:List-Archive; bh=WdX7XlP31mo1diy+e2eFvLJHQl1B3YBacab+gzVS0no=; b=jbq4nhP3tbu/Pr2KT/9gBfJG7G vl6iKbx6WvjRkCh5EpBhoVs8H9iHFdER1O/y6bXEGctEy6/+8+TLG42KqGKX9wk4NDg7l3KhuOSM0 XQV78p2scLkeYEehqumE5YwCEGCzTkxhJnWBIY8Lf/8CBawu/kFuQQJJGh7Zzb07AiDdyhH7xYPFB W0w3rUEQxUgNayvFFHtxQ0/1VoEquMywcRoKp6PZD1vPxpUYDKsyco9BNE0Jjeb4cWmrmsqnpyyYw BdrniOfKMV3BeQMQzztAUeGpip5esh3x0czeONSmOxuxm9Kwkpt/b41YMrrvcGI0B4fdd7Lr7aCTu t1wospdA==; Received: from host109-154-46-241.range109-154.btcentralplus.com ([109.154.46.241]:57737 helo=Dell) by server.nextmovesoftware.com with esmtpsa (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1npRs5-0000zU-LU; Fri, 13 May 2022 05:46:05 -0400 From: "Roger Sayle" To: "'Uros Bizjak'" Subject: [x86 PATCH take 2] Improved V1TI (and V2DI) mode equality/inequality. Date: Fri, 13 May 2022 10:46:02 +0100 Message-ID: <01b001d866ae$42c7e990$c857bcb0$@nextmovesoftware.com> MIME-Version: 1.0 X-Mailer: Microsoft Outlook 16.0 Thread-Index: AdhmrMrzyw5i8mhyTzmA4TUJoiknFw== Content-Language: en-gb X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - server.nextmovesoftware.com X-AntiAbuse: Original Domain - gcc.gnu.org X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12] X-AntiAbuse: Sender Address Domain - nextmovesoftware.com X-Get-Message-Sender-Via: server.nextmovesoftware.com: authenticated_id: roger@nextmovesoftware.com X-Authenticated-Sender: server.nextmovesoftware.com: roger@nextmovesoftware.com X-Source: X-Source-Args: X-Source-Dir: X-Spam-Status: No, score=-12.0 required=5.0 tests=BAYES_00, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, KAM_SHORT, SPF_HELO_NONE, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gcc-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: gcc-patches@gcc.gnu.org Errors-To: gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org Sender: "Gcc-patches" Hi Uros, Now that we're back in stage 1, here's the revised version of the patch I submitted here: https://gcc.gnu.org/pipermail/gcc-patches/2022-April/593434.html incorporating all the suggested improvements from your review here: https://gcc.gnu.org/pipermail/gcc-patches/2022-April/593454.html This revised patch has been (re)tested against mainline (GCC 13) on x86_64-pc-linux-gnu with make bootstrap and make -k check, both with and without --target_board=unix{-m32}, with no new failures. Ok for mainline? 2022-05-13 Roger Sayle Uroš Bizjak gcc/ChangeLog * config/i386/sse.md (vec_cmpeqv2div2di): Enable for TARGET_SSE2. For !TARGET_SSE4_1, expand as a V4SI vector comparison, followed by a pshufd and pand. (vec_cmpeqv1tiv1ti): New define_expand implementing V1TImode vector equality as a V2DImode vector comparison (see above), followed by a pshufd and pand. gcc/testsuite/ChangeLog * gcc.target/i386/sse2-v1ti-veq.c: New test case. * gcc.target/i386/sse2-v1ti-vne.c: New test case. Thanks in advance, Roger --- > -----Original Message----- > From: Uros Bizjak > Sent: 21 April 2022 10:31 > To: Roger Sayle > Cc: gcc-patches@gcc.gnu.org > Subject: Re: [x86 PATCH] Improved V1TI (and V2DI) mode equality/inequality. > > On Wed, Apr 20, 2022 at 8:28 PM Roger Sayle > wrote: > > > > > > Doh! ENOPATCH. > > > > > -----Original Message----- > > > From: Roger Sayle > > > Sent: 20 April 2022 18:50 > > > To: 'gcc-patches@gcc.gnu.org' > > > Subject: [x86 PATCH] Improved V1TI (and V2DI) mode equality/inequality. > > > > > > > > > This patch (for when the compiler returns to stage 1) improves > > > support for vector equality and inequality of V1TImode vectors, and > > > V2DImode vectors > > with > > > sse2 but not sse4. Consider the three functions below: > > > > > > typedef unsigned int uv4si __attribute__ ((__vector_size__ (16))); > > > typedef unsigned long long uv2di __attribute__ ((__vector_size__ > > > (16))); typedef unsigned __int128 uv1ti __attribute__ > > > ((__vector_size__ (16))); > > > > > > uv4si eq_v4si(uv4si x, uv4si y) { return x == y; } uv2di > > > eq_v2di(uv2di x, > > uv2di y) { > > > return x == y; } uv1ti eq_v1ti(uv1ti x, uv1ti y) { return x == y; } > > > > > > These all perform vector comparisons of 128bit SSE2 registers, > > > generating > > the > > > result as a vector, where ~0 (all 1 bits) represents true and a zero > > represents > > > false. eq_v4si is trivially implemented by x86_64's pcmpeqd instruction. > > This > > > patch improves the other two cases: > > > > > > For v2di, gcc -O2 currently generates: > > > > > > movq %xmm0, %rdx > > > movq %xmm1, %rax > > > movdqa %xmm0, %xmm2 > > > cmpq %rax, %rdx > > > movhlps %xmm2, %xmm3 > > > movhlps %xmm1, %xmm4 > > > sete %al > > > movq %xmm3, %rdx > > > movzbl %al, %eax > > > negq %rax > > > movq %rax, %xmm0 > > > movq %xmm4, %rax > > > cmpq %rax, %rdx > > > sete %al > > > movzbl %al, %eax > > > negq %rax > > > movq %rax, %xmm5 > > > punpcklqdq %xmm5, %xmm0 > > > ret > > > > > > but with this patch we now generate: > > > > > > pcmpeqd %xmm0, %xmm1 > > > pshufd $177, %xmm1, %xmm0 > > > pand %xmm1, %xmm0 > > > ret > > > > > > where the results of a V4SI comparison are shuffled and bit-wise > > > ANDed to produce the desired result. There's no change in the code > > > generated for > > "-O2 - > > > msse4" where the compiler generates a single "pcmpeqq" insn. > > > > > > For V1TI mode, the results are equally dramatic, where the current > > > -O2 > > output > > > looks like: > > > > > > movaps %xmm0, -40(%rsp) > > > movq -40(%rsp), %rax > > > movq -32(%rsp), %rdx > > > movaps %xmm1, -24(%rsp) > > > movq -24(%rsp), %rcx > > > movq -16(%rsp), %rsi > > > xorq %rcx, %rax > > > xorq %rsi, %rdx > > > orq %rdx, %rax > > > sete %al > > > xorl %edx, %edx > > > movzbl %al, %eax > > > negq %rax > > > adcq $0, %rdx > > > movq %rax, %xmm2 > > > negq %rdx > > > movq %rdx, -40(%rsp) > > > movhps -40(%rsp), %xmm2 > > > movdqa %xmm2, %xmm0 > > > ret > > > > > > with this patch we now generate: > > > > > > pcmpeqd %xmm0, %xmm1 > > > pshufd $177, %xmm1, %xmm0 > > > pand %xmm1, %xmm0 > > > pshufd $78, %xmm0, %xmm1 > > > pand %xmm1, %xmm0 > > > ret > > > > > > performing a V2DI comparison, followed by a shuffle and pand, and > > > with > > > -O2 -msse4 take advantages of SSE4.1's pcmpeqq: > > > > > > pcmpeqq %xmm0, %xmm1 > > > pshufd $78, %xmm1, %xmm0 > > > pand %xmm1, %xmm0 > > > ret > > > > > > > > > This patch has been tested on x86_64-pc-linux-gnu with make > > > bootstrap and make -k check, both with and without > > > --target_board=unix{-m32}, with no > > new > > > failures. Is this OK for when we return to stage 1? > > > > > > > > > 2022-04-20 Roger Sayle > > > > > > gcc/ChangeLog > > > * config/i386/sse.md (vec_cmpeqv2div2di): Enable for TARGET_SSE2. > > > For !TARGET_SSE4_1, expand as a V4SI vector comparison, followed > > > by a pshufd and pand. > > > (vec_cmpeqv1tiv1ti): New define_expand implementing V1TImode > > > vector equality as a V2DImode vector comparison (see above), > > > followed by a pshufd and pand. > > > > > > gcc/testsuite/ChangeLog > > > * gcc.target/i386/sse2-v1ti-veq.c: New test case. > > > * gcc.target/i386/sse2-v1ti-vne.c: New test case. > > > > > > + bool ok; > + if (!TARGET_SSE4_1) > + { > + rtx ops[4]; > + ops[0] = gen_reg_rtx (V4SImode); > + ops[2] = force_reg (V4SImode, gen_lowpart (V4SImode, operands[2])); > + ops[3] = force_reg (V4SImode, gen_lowpart (V4SImode, > + operands[3])); > > In general, this is better written as e.g.: > > gen_lowpart (V4SImode, force_reg (V2DImode, operands[2])) > > This ensures that we get a subreg of V2DImode register, and avoids problems > with gen_lowpart. Also, other expander functions should be prepared to handle > subregs, so in > > + rtx tmp2 = force_reg (V4SImode, gen_lowpart (V4SImode, dst)); > + emit_insn (gen_sse2_pshufd (tmp1, tmp2, GEN_INT (0x4e))); > > forcing a subreg to a register before the call to gen_sse2_pshufd is not needed, > since dst is already a register. > > Uros. diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md index 47f8b18..0fa1847 100644 --- a/gcc/config/i386/sse.md +++ b/gcc/config/i386/sse.md @@ -4376,13 +4376,57 @@ (match_operator:V2DI 1 "" [(match_operand:V2DI 2 "register_operand") (match_operand:V2DI 3 "vector_operand")]))] - "TARGET_SSE4_1" + "TARGET_SSE2" { - bool ok = ix86_expand_int_vec_cmp (operands); + bool ok; + if (!TARGET_SSE4_1) + { + rtx ops[4]; + ops[0] = gen_reg_rtx (V4SImode); + ops[2] = gen_lowpart (V4SImode, force_reg (V2DImode, operands[2])); + ops[3] = gen_lowpart (V4SImode, force_reg (V2DImode, operands[3])); + ops[1] = gen_rtx_fmt_ee (GET_CODE (operands[1]), V4SImode, + ops[2], ops[3]); + ok = ix86_expand_int_vec_cmp (ops); + + rtx tmp1 = gen_reg_rtx (V4SImode); + emit_insn (gen_sse2_pshufd (tmp1, ops[0], GEN_INT (0xb1))); + + rtx tmp2 = gen_reg_rtx (V4SImode); + emit_insn (gen_andv4si3 (tmp2, tmp1, ops[0])); + + emit_move_insn (operands[0], gen_lowpart (V2DImode, tmp2)); + } + else + ok = ix86_expand_int_vec_cmp (operands); gcc_assert (ok); DONE; }) +(define_expand "vec_cmpeqv1tiv1ti" + [(set (match_operand:V1TI 0 "register_operand") + (match_operator:V1TI 1 "" + [(match_operand:V1TI 2 "register_operand") + (match_operand:V1TI 3 "vector_operand")]))] + "TARGET_SSE2" +{ + rtx dst = gen_reg_rtx (V2DImode); + rtx op1 = gen_lowpart (V2DImode, force_reg (V1TImode, operands[2])); + rtx op2 = gen_lowpart (V2DImode, force_reg (V1TImode, operands[3])); + rtx cmp = gen_rtx_fmt_ee (GET_CODE (operands[1]), V2DImode, op1, op2); + emit_insn (gen_vec_cmpeqv2div2di (dst, cmp, op1, op2)); + + rtx tmp1 = gen_reg_rtx (V4SImode); + rtx tmp2 = gen_lowpart (V4SImode, dst); + emit_insn (gen_sse2_pshufd (tmp1, tmp2, GEN_INT (0x4e))); + + rtx tmp3 = gen_reg_rtx (V4SImode); + emit_insn (gen_andv4si3 (tmp3, tmp2, tmp1)); + + emit_move_insn (operands[0], gen_lowpart (V1TImode, tmp3)); + DONE; +}) + (define_expand "vcond" [(set (match_operand:V_512 0 "register_operand") (if_then_else:V_512 diff --git a/gcc/testsuite/gcc.target/i386/sse2-v1ti-veq.c b/gcc/testsuite/gcc.target/i386/sse2-v1ti-veq.c new file mode 100644 index 0000000..8bbda06 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/sse2-v1ti-veq.c @@ -0,0 +1,12 @@ +/* { dg-do compile { target { ! ia32 } } } */ +/* { dg-options "-O2 -msse2" } */ +typedef unsigned __int128 uv1ti __attribute__ ((__vector_size__ (16))); +typedef unsigned long long uv2di __attribute__ ((__vector_size__ (16))); +typedef unsigned int uv4si __attribute__ ((__vector_size__ (16))); + +uv1ti eq_v1ti(uv1ti x, uv1ti y) { return x == y; } +uv2di eq_v2di(uv2di x, uv2di y) { return x == y; } +uv4si eq_v4si(uv4si x, uv4si y) { return x == y; } + +/* { dg-final { scan-assembler-times "pcmpeq" 3 } } */ +/* { dg-final { scan-assembler "pshufd" } } */ diff --git a/gcc/testsuite/gcc.target/i386/sse2-v1ti-vne.c b/gcc/testsuite/gcc.target/i386/sse2-v1ti-vne.c new file mode 100644 index 0000000..cb47147 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/sse2-v1ti-vne.c @@ -0,0 +1,13 @@ +/* { dg-do compile { target { ! ia32 } } } */ +/* { dg-options "-O2 -msse2" } */ +typedef unsigned __int128 uv1ti __attribute__ ((__vector_size__ (16))); +typedef unsigned long long uv2di __attribute__ ((__vector_size__ (16))); +typedef unsigned int uv4si __attribute__ ((__vector_size__ (16))); + +uv1ti eq_v1ti(uv1ti x, uv1ti y) { return x != y; } +uv2di eq_v2di(uv2di x, uv2di y) { return x != y; } +uv4si eq_v4si(uv4si x, uv4si y) { return x != y; } + +/* { dg-final { scan-assembler-times "pcmpeq" 6 } } */ +/* { dg-final { scan-assembler-times "pxor" 3 } } */ +/* { dg-final { scan-assembler "pshufd" } } */