From patchwork Fri May 13 09:46:02 2022
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: Roger Sayle <roger@nextmovesoftware.com>
X-Patchwork-Id: 1630597
Return-Path: <gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@bilbo.ozlabs.org
Authentication-Results: bilbo.ozlabs.org;
	dkim=fail reason="signature verification failed" (2048-bit key;
 unprotected) header.d=nextmovesoftware.com header.i=@nextmovesoftware.com
 header.a=rsa-sha256 header.s=default header.b=jbq4nhP3;
	dkim-atps=neutral
Authentication-Results: ozlabs.org;
 spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org
 (client-ip=2620:52:3:1:0:246e:9693:128c; helo=sourceware.org;
 envelope-from=gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org;
 receiver=<UNKNOWN>)
Received: from sourceware.org (server2.sourceware.org
 [IPv6:2620:52:3:1:0:246e:9693:128c])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest
 SHA256)
	(No client certificate requested)
	by bilbo.ozlabs.org (Postfix) with ESMTPS id 4L03gf3pBlz9sFs
	for <incoming@patchwork.ozlabs.org>; Fri, 13 May 2022 19:46:21 +1000 (AEST)
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id DA0853846456
	for <incoming@patchwork.ozlabs.org>; Fri, 13 May 2022 09:46:18 +0000 (GMT)
X-Original-To: gcc-patches@gcc.gnu.org
Delivered-To: gcc-patches@gcc.gnu.org
Received: from server.nextmovesoftware.com (server.nextmovesoftware.com
 [162.254.253.69])
 by sourceware.org (Postfix) with ESMTPS id 542B33857023
 for <gcc-patches@gcc.gnu.org>; Fri, 13 May 2022 09:46:06 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 542B33857023
Authentication-Results: sourceware.org; dmarc=none (p=none dis=none)
 header.from=nextmovesoftware.com
Authentication-Results: sourceware.org;
 spf=pass smtp.mailfrom=nextmovesoftware.com
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
 d=nextmovesoftware.com; s=default; h=Content-Type:MIME-Version:Message-ID:
 Date:Subject:Cc:To:From:Sender:Reply-To:Content-Transfer-Encoding:Content-ID:
 Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc
 :Resent-Message-ID:In-Reply-To:References:List-Id:List-Help:List-Unsubscribe:
 List-Subscribe:List-Post:List-Owner:List-Archive;
 bh=WdX7XlP31mo1diy+e2eFvLJHQl1B3YBacab+gzVS0no=; b=jbq4nhP3tbu/Pr2KT/9gBfJG7G
 vl6iKbx6WvjRkCh5EpBhoVs8H9iHFdER1O/y6bXEGctEy6/+8+TLG42KqGKX9wk4NDg7l3KhuOSM0
 XQV78p2scLkeYEehqumE5YwCEGCzTkxhJnWBIY8Lf/8CBawu/kFuQQJJGh7Zzb07AiDdyhH7xYPFB
 W0w3rUEQxUgNayvFFHtxQ0/1VoEquMywcRoKp6PZD1vPxpUYDKsyco9BNE0Jjeb4cWmrmsqnpyyYw
 BdrniOfKMV3BeQMQzztAUeGpip5esh3x0czeONSmOxuxm9Kwkpt/b41YMrrvcGI0B4fdd7Lr7aCTu
 t1wospdA==;
Received: from host109-154-46-241.range109-154.btcentralplus.com
 ([109.154.46.241]:57737 helo=Dell)
 by server.nextmovesoftware.com with esmtpsa (TLS1.2) tls
 TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2)
 (envelope-from <roger@nextmovesoftware.com>)
 id 1npRs5-0000zU-LU; Fri, 13 May 2022 05:46:05 -0400
From: "Roger Sayle" <roger@nextmovesoftware.com>
To: "'Uros Bizjak'" <ubizjak@gmail.com>
Subject: [x86 PATCH take 2] Improved V1TI (and V2DI) mode equality/inequality.
Date: Fri, 13 May 2022 10:46:02 +0100
Message-ID: <01b001d866ae$42c7e990$c857bcb0$@nextmovesoftware.com>
MIME-Version: 1.0
X-Mailer: Microsoft Outlook 16.0
Thread-Index: AdhmrMrzyw5i8mhyTzmA4TUJoiknFw==
Content-Language: en-gb
X-AntiAbuse: This header was added to track abuse,
 please include it with any abuse report
X-AntiAbuse: Primary Hostname - server.nextmovesoftware.com
X-AntiAbuse: Original Domain - gcc.gnu.org
X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12]
X-AntiAbuse: Sender Address Domain - nextmovesoftware.com
X-Get-Message-Sender-Via: server.nextmovesoftware.com: authenticated_id:
 roger@nextmovesoftware.com
X-Authenticated-Sender: server.nextmovesoftware.com:
 roger@nextmovesoftware.com
X-Source: 
X-Source-Args: 
X-Source-Dir: 
X-Spam-Status: No, score=-12.0 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, KAM_SHORT,
 SPF_HELO_NONE, SPF_PASS, TXREP,
 T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
Cc: gcc-patches@gcc.gnu.org
Errors-To: gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org
Sender: "Gcc-patches"
 <gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org>

Hi Uros,
Now that we're back in stage 1, here's the revised version of the patch I submitted here:
https://gcc.gnu.org/pipermail/gcc-patches/2022-April/593434.html
incorporating all the suggested improvements from your review here:
https://gcc.gnu.org/pipermail/gcc-patches/2022-April/593454.html

This revised patch has been (re)tested against mainline (GCC 13) on
x86_64-pc-linux-gnu with make bootstrap and make -k check, both
with and without --target_board=unix{-m32}, with no new failures.
Ok for mainline?


2022-05-13  Roger Sayle  <roger@nextmovesoftware.com>
            Uroš Bizjak  <ubizjak@gmail.com>

gcc/ChangeLog
	* config/i386/sse.md (vec_cmpeqv2div2di): Enable for TARGET_SSE2.
	For !TARGET_SSE4_1, expand as a V4SI vector comparison, followed
	by a pshufd and pand.
	(vec_cmpeqv1tiv1ti): New define_expand implementing V1TImode
	vector equality as a V2DImode vector comparison (see above),
	followed by a pshufd and pand.

gcc/testsuite/ChangeLog
	* gcc.target/i386/sse2-v1ti-veq.c: New test case.
	* gcc.target/i386/sse2-v1ti-vne.c: New test case.


Thanks in advance,
Roger
---

> -----Original Message-----
> From: Uros Bizjak <ubizjak@gmail.com>
> Sent: 21 April 2022 10:31
> To: Roger Sayle <roger@nextmovesoftware.com>
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: [x86 PATCH] Improved V1TI (and V2DI) mode equality/inequality.
> 
> On Wed, Apr 20, 2022 at 8:28 PM Roger Sayle <roger@nextmovesoftware.com>
> wrote:
> >
> >
> > Doh! ENOPATCH.
> >
> > > -----Original Message-----
> > > From: Roger Sayle <roger@nextmovesoftware.com>
> > > Sent: 20 April 2022 18:50
> > > To: 'gcc-patches@gcc.gnu.org' <gcc-patches@gcc.gnu.org>
> > > Subject: [x86 PATCH] Improved V1TI (and V2DI) mode equality/inequality.
> > >
> > >
> > > This patch (for when the compiler returns to stage 1) improves
> > > support for vector equality and inequality of V1TImode vectors, and
> > > V2DImode vectors
> > with
> > > sse2 but not sse4.  Consider the three functions below:
> > >
> > > typedef unsigned int uv4si __attribute__ ((__vector_size__ (16)));
> > > typedef unsigned long long uv2di __attribute__ ((__vector_size__
> > > (16))); typedef unsigned __int128 uv1ti __attribute__
> > > ((__vector_size__ (16)));
> > >
> > > uv4si eq_v4si(uv4si x, uv4si y) { return x == y; } uv2di
> > > eq_v2di(uv2di x,
> > uv2di y) {
> > > return x == y; } uv1ti eq_v1ti(uv1ti x, uv1ti y) { return x == y; }
> > >
> > > These all perform vector comparisons of 128bit SSE2 registers,
> > > generating
> > the
> > > result as a vector, where ~0 (all 1 bits) represents true and a zero
> > represents
> > > false.  eq_v4si is trivially implemented by x86_64's pcmpeqd instruction.
> > This
> > > patch improves the other two cases:
> > >
> > > For v2di, gcc -O2 currently generates:
> > >
> > >         movq    %xmm0, %rdx
> > >         movq    %xmm1, %rax
> > >         movdqa  %xmm0, %xmm2
> > >         cmpq    %rax, %rdx
> > >         movhlps %xmm2, %xmm3
> > >         movhlps %xmm1, %xmm4
> > >         sete    %al
> > >         movq    %xmm3, %rdx
> > >         movzbl  %al, %eax
> > >         negq    %rax
> > >         movq    %rax, %xmm0
> > >         movq    %xmm4, %rax
> > >         cmpq    %rax, %rdx
> > >         sete    %al
> > >         movzbl  %al, %eax
> > >         negq    %rax
> > >         movq    %rax, %xmm5
> > >         punpcklqdq      %xmm5, %xmm0
> > >         ret
> > >
> > > but with this patch we now generate:
> > >
> > >         pcmpeqd %xmm0, %xmm1
> > >         pshufd  $177, %xmm1, %xmm0
> > >         pand    %xmm1, %xmm0
> > >         ret
> > >
> > > where the results of a V4SI comparison are shuffled and bit-wise
> > > ANDed to produce the desired result.  There's no change in the code
> > > generated for
> > "-O2 -
> > > msse4" where the compiler generates a single "pcmpeqq" insn.
> > >
> > > For V1TI mode, the results are equally dramatic, where the current
> > > -O2
> > output
> > > looks like:
> > >
> > >         movaps  %xmm0, -40(%rsp)
> > >         movq    -40(%rsp), %rax
> > >         movq    -32(%rsp), %rdx
> > >         movaps  %xmm1, -24(%rsp)
> > >         movq    -24(%rsp), %rcx
> > >         movq    -16(%rsp), %rsi
> > >         xorq    %rcx, %rax
> > >         xorq    %rsi, %rdx
> > >         orq     %rdx, %rax
> > >         sete    %al
> > >         xorl    %edx, %edx
> > >         movzbl  %al, %eax
> > >         negq    %rax
> > >         adcq    $0, %rdx
> > >         movq    %rax, %xmm2
> > >         negq    %rdx
> > >         movq    %rdx, -40(%rsp)
> > >         movhps  -40(%rsp), %xmm2
> > >         movdqa  %xmm2, %xmm0
> > >         ret
> > >
> > > with this patch we now generate:
> > >
> > >         pcmpeqd %xmm0, %xmm1
> > >         pshufd  $177, %xmm1, %xmm0
> > >         pand    %xmm1, %xmm0
> > >         pshufd  $78, %xmm0, %xmm1
> > >         pand    %xmm1, %xmm0
> > >         ret
> > >
> > > performing a V2DI comparison, followed by a shuffle and pand, and
> > > with
> > > -O2 -msse4 take advantages of SSE4.1's pcmpeqq:
> > >
> > >         pcmpeqq %xmm0, %xmm1
> > >         pshufd  $78, %xmm1, %xmm0
> > >         pand    %xmm1, %xmm0
> > >         ret
> > >
> > >
> > > This patch has been tested on x86_64-pc-linux-gnu with make
> > > bootstrap and make -k check, both with and without
> > > --target_board=unix{-m32}, with no
> > new
> > > failures.  Is this OK for when we return to stage 1?
> > >
> > >
> > > 2022-04-20  Roger Sayle  <roger@nextmovesoftware.com>
> > >
> > > gcc/ChangeLog
> > >       * config/i386/sse.md (vec_cmpeqv2div2di): Enable for TARGET_SSE2.
> > >       For !TARGET_SSE4_1, expand as a V4SI vector comparison, followed
> > >       by a pshufd and pand.
> > >       (vec_cmpeqv1tiv1ti): New define_expand implementing V1TImode
> > >       vector equality as a V2DImode vector comparison (see above),
> > >       followed by a pshufd and pand.
> > >
> > > gcc/testsuite/ChangeLog
> > >       * gcc.target/i386/sse2-v1ti-veq.c: New test case.
> > >       * gcc.target/i386/sse2-v1ti-vne.c: New test case.
> > >
> 
> 
> +  bool ok;
> +  if (!TARGET_SSE4_1)
> +    {
> +      rtx ops[4];
> +      ops[0] = gen_reg_rtx (V4SImode);
> +      ops[2] = force_reg (V4SImode, gen_lowpart (V4SImode, operands[2]));
> +      ops[3] = force_reg (V4SImode, gen_lowpart (V4SImode,
> + operands[3]));
> 
> In general, this is better written as e.g.:
> 
> gen_lowpart (V4SImode, force_reg (V2DImode, operands[2]))
> 
> This ensures that we get a subreg of V2DImode register, and avoids problems
> with gen_lowpart. Also, other expander functions should be prepared to handle
> subregs, so in
> 
> +  rtx tmp2 = force_reg (V4SImode, gen_lowpart (V4SImode, dst));
> + emit_insn (gen_sse2_pshufd (tmp1, tmp2, GEN_INT (0x4e)));
> 
> forcing a subreg to a register before the call to gen_sse2_pshufd is not needed,
> since dst is already a register.
> 
> Uros.
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index 47f8b18..0fa1847 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -4376,13 +4376,57 @@
 	(match_operator:V2DI 1 ""
 	  [(match_operand:V2DI 2 "register_operand")
 	   (match_operand:V2DI 3 "vector_operand")]))]
-  "TARGET_SSE4_1"
+  "TARGET_SSE2"
 {
-  bool ok = ix86_expand_int_vec_cmp (operands);
+  bool ok;
+  if (!TARGET_SSE4_1)
+    {
+      rtx ops[4];
+      ops[0] = gen_reg_rtx (V4SImode);
+      ops[2] = gen_lowpart (V4SImode, force_reg (V2DImode, operands[2]));
+      ops[3] = gen_lowpart (V4SImode, force_reg (V2DImode, operands[3]));
+      ops[1] = gen_rtx_fmt_ee (GET_CODE (operands[1]), V4SImode,
+			       ops[2], ops[3]);
+      ok = ix86_expand_int_vec_cmp (ops);
+
+      rtx tmp1 = gen_reg_rtx (V4SImode);
+      emit_insn (gen_sse2_pshufd (tmp1, ops[0], GEN_INT (0xb1)));
+
+      rtx tmp2 = gen_reg_rtx (V4SImode);
+      emit_insn (gen_andv4si3 (tmp2, tmp1, ops[0]));
+
+      emit_move_insn (operands[0], gen_lowpart (V2DImode, tmp2));
+    }
+  else
+    ok = ix86_expand_int_vec_cmp (operands);
   gcc_assert (ok);
   DONE;
 })
 
+(define_expand "vec_cmpeqv1tiv1ti"
+  [(set (match_operand:V1TI 0 "register_operand")
+	(match_operator:V1TI 1 ""
+	  [(match_operand:V1TI 2 "register_operand")
+	   (match_operand:V1TI 3 "vector_operand")]))]
+  "TARGET_SSE2"
+{
+  rtx dst = gen_reg_rtx (V2DImode);
+  rtx op1 = gen_lowpart (V2DImode, force_reg (V1TImode, operands[2]));
+  rtx op2 = gen_lowpart (V2DImode, force_reg (V1TImode, operands[3]));
+  rtx cmp = gen_rtx_fmt_ee (GET_CODE (operands[1]), V2DImode, op1, op2);
+  emit_insn (gen_vec_cmpeqv2div2di (dst, cmp, op1, op2));
+
+  rtx tmp1 = gen_reg_rtx (V4SImode);
+  rtx tmp2 = gen_lowpart (V4SImode, dst);
+  emit_insn (gen_sse2_pshufd (tmp1, tmp2, GEN_INT (0x4e)));
+
+  rtx tmp3 = gen_reg_rtx (V4SImode);
+  emit_insn (gen_andv4si3 (tmp3, tmp2, tmp1));
+
+  emit_move_insn (operands[0], gen_lowpart (V1TImode, tmp3));
+  DONE;
+})
+
 (define_expand "vcond<V_512:mode><VF_512:mode>"
   [(set (match_operand:V_512 0 "register_operand")
 	(if_then_else:V_512
diff --git a/gcc/testsuite/gcc.target/i386/sse2-v1ti-veq.c b/gcc/testsuite/gcc.target/i386/sse2-v1ti-veq.c
new file mode 100644
index 0000000..8bbda06
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/sse2-v1ti-veq.c
@@ -0,0 +1,12 @@
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -msse2" } */
+typedef unsigned __int128 uv1ti __attribute__ ((__vector_size__ (16)));
+typedef unsigned long long uv2di __attribute__ ((__vector_size__ (16)));
+typedef unsigned int uv4si __attribute__ ((__vector_size__ (16)));
+
+uv1ti eq_v1ti(uv1ti x, uv1ti y) { return x == y; }
+uv2di eq_v2di(uv2di x, uv2di y) { return x == y; }
+uv4si eq_v4si(uv4si x, uv4si y) { return x == y; }
+
+/* { dg-final { scan-assembler-times "pcmpeq" 3 } } */
+/* { dg-final { scan-assembler "pshufd" } } */
diff --git a/gcc/testsuite/gcc.target/i386/sse2-v1ti-vne.c b/gcc/testsuite/gcc.target/i386/sse2-v1ti-vne.c
new file mode 100644
index 0000000..cb47147
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/sse2-v1ti-vne.c
@@ -0,0 +1,13 @@
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -msse2" } */
+typedef unsigned __int128 uv1ti __attribute__ ((__vector_size__ (16)));
+typedef unsigned long long uv2di __attribute__ ((__vector_size__ (16)));
+typedef unsigned int uv4si __attribute__ ((__vector_size__ (16)));
+
+uv1ti eq_v1ti(uv1ti x, uv1ti y) { return x != y; }
+uv2di eq_v2di(uv2di x, uv2di y) { return x != y; }
+uv4si eq_v4si(uv4si x, uv4si y) { return x != y; }
+
+/* { dg-final { scan-assembler-times "pcmpeq" 6 } } */
+/* { dg-final { scan-assembler-times "pxor" 3 } } */
+/* { dg-final { scan-assembler "pshufd" } } */