From patchwork Wed May  1 09:34:31 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: Kyrill Tkachov <kyrylo.tkachov@foss.arm.com>
X-Patchwork-Id: 1093637
Return-Path: 
 <gcc-patches-return-499949-incoming=patchwork.ozlabs.org@gcc.gnu.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@bilbo.ozlabs.org
Authentication-Results: ozlabs.org;
	spf=pass (mailfrom) smtp.mailfrom=gcc.gnu.org
	(client-ip=209.132.180.131; helo=sourceware.org;
	envelope-from=gcc-patches-return-499949-incoming=patchwork.ozlabs.org@gcc.gnu.org;
	receiver=<UNKNOWN>)
Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none)
	header.from=foss.arm.com
Authentication-Results: ozlabs.org; dkim=pass (1024-bit key;
	unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org
	header.b="u2yCS7as"; dkim-atps=neutral
Received: from sourceware.org (server1.sourceware.org [209.132.180.131])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256
	bits)) (No client certificate requested)
	by ozlabs.org (Postfix) with ESMTPS id 44vCsn0P83z9s6w
	for <incoming@patchwork.ozlabs.org>;
	Wed,  1 May 2019 19:34:45 +1000 (AEST)
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id
	:list-unsubscribe:list-archive:list-post:list-help:sender:to:cc
	:from:subject:message-id:date:mime-version:content-type; q=dns;
	s=default; b=aZUfp9P/tx7PCAMOF3HI3C5zejpnD5L/Pl/EocLUKW+NyaJcm+
	43Un10VagAxP0CER8gEn52Lh+rszDmxUi77sY2XrQgRB9Bb8zAjxguUVp9DtkS2v
	Bu+ZGp4HGtTDVXxmgLjZsNcaeySNhe3kC2jGVpJXLn0lMOqiBDbuhwhus=
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id
	:list-unsubscribe:list-archive:list-post:list-help:sender:to:cc
	:from:subject:message-id:date:mime-version:content-type; s=
	default; bh=H8Ki2/PEmW4eWcp+9K7wnw7ENYU=; b=u2yCS7asL33U1nZLyH8a
	peGMC6MuWzfnHNJdSkiWQb/E2BaWhI3I2btOhKv9KWLTOAT1RbqgP9uuFiTovB3S
	wnHAnrwJJiwkJKYXEHra5UPdZ3uMcM8+p2sbJqU6liLhQBlXtU28sfyq1qO0l8b4
	TxzGuDsw8tPO/+TGU+zB3OE=
Received: (qmail 54333 invoked by alias); 1 May 2019 09:34:37 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Unsubscribe: 
 <mailto:gcc-patches-unsubscribe-incoming=patchwork.ozlabs.org@gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Delivered-To: mailing list gcc-patches@gcc.gnu.org
Received: (qmail 54319 invoked by uid 89); 1 May 2019 09:34:37 -0000
Authentication-Results: sourceware.org; auth=none
X-Spam-SWARE-Status: No, score=-14.3 required=5.0 tests=AWL, BAYES_00,
	GIT_PATCH_0, GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3,
	KAM_LOTSOFHASH autolearn=ham version=3.3.1 spammy=MOV,
	accurate
X-HELO: foss.arm.com
Received: from foss.arm.com (HELO foss.arm.com) (217.140.101.70) by
	sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP;
	Wed, 01 May 2019 09:34:35 +0000
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249])	by
	usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id
	DB33EA78; Wed,  1 May 2019 02:34:33 -0700 (PDT)
Received: from [10.2.206.47] (e120808-lin.cambridge.arm.com
	[10.2.206.47])	by usa-sjc-imap-foss1.foss.arm.com (Postfix)
	with ESMTPSA id 141013F719; Wed,  1 May 2019 02:34:32 -0700 (PDT)
To: "gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>
Cc: James Greenhalgh <James.Greenhalgh@arm.com>,
	Marcus Shawcroft <Marcus.Shawcroft@arm.com>,
	Richard Earnshaw <Richard.Earnshaw@arm.com>
From: Kyrill Tkachov <kyrylo.tkachov@foss.arm.com>
Subject: [PATCH][AArch64] Emit TARGET_DOTPROD-specific sequence for
	<us>sadv16qi
Message-ID: <9b27c4db-c89d-d6c7-b3bd-2eef58511338@foss.arm.com>
Date: Wed, 1 May 2019 10:34:31 +0100
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
	rv:60.0) Gecko/20100101 Thunderbird/60.4.0
MIME-Version: 1.0

Hi all,

Wilco pointed out that when the Dot Product instructions are available 
we can use them
to generate an even more efficient expansion for the [us]sadv16qi optab.
Instead of the current:
         uabdl2  v0.8h, v1.16b, v2.16b
         uabal   v0.8h, v1.8b, v2.8b
         uadalp  v3.4s, v0.8h

we can generate:
       (1)  mov    v4.16b, 1
       (2)  uabd    v0.16b, v1.16b, v2.16b
       (3)  udot    v3.4s, v0.16b, v4.16b

Instruction (1) can be CSEd across multiple such expansions and even 
hoisted outside of loops,
so when this sequence appears frequently back-to-back (like in x264_r) 
we essentially only have 2 instructions
per sum. Also, the UDOT instruction does the byte-to-word accumulation 
in one step, which allows us to use
the much simpler UABD instruction before it.

This makes it a shorter and lower-latency sequence overall for targets 
that support it.

I've added a helper <su>abd<mode>_3 expander to simplify the generation 
of the [US]ABD patterns as well.
Bootstrapped and tested on aarch64-none-linux-gnu.

This gives about 0.5% improvement on 525.x264_r on Neoverse N1.

Ok for trunk?

Thanks,
Kyrill

2019-01-05  Kyrylo Tkachov <kyrylo.tkachov@arm.com>

     * config/aarch64/iterators.md (MAX_OPP): New code attr.
     * config/aarch64/aarch64-simd.md (<su>abd<mode>_3): New define_expand.
     (*aarch64_<su>abd<mode>_3): Rename to...
     (aarch64_<su>abd<mode>_3): ... This.
     (<sur>sadv16qi): Add TARGET_DOTPROD expansion.

2019-01-05  Kyrylo Tkachov <kyrylo.tkachov@arm.com>

     * gcc.target/aarch64/ssadv16qi.c: Add +nodotprod to pragma.
     * gcc.target/aarch64/usadv16qi.c: Likewise.
     * gcc.target/aarch64/ssadv16qi-dotprod.c: New test.
     * gcc.target/aarch64/usadv16qi-dotprod.c: Likewise.
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index eb99d3ab881e29f3069991e4f778be95d51ec4da..a823c4ddca420e0cca1caac59cbab59f17ec639c 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -705,12 +705,28 @@ (define_insn "aarch64_abs<mode>"
   [(set_attr "type" "neon_abs<q>")]
 )
 
+;; Helper expander for aarch64_<su>abd<mode>_3 to save the callers
+;; the hassle of constructing the other arm of the MINUS.
+(define_expand "<su>abd<mode>_3"
+  [(use (match_operand:VDQ_BHSI 0 "register_operand"))
+   (USMAX:VDQ_BHSI (match_operand:VDQ_BHSI 1 "register_operand")
+		   (match_operand:VDQ_BHSI 2 "register_operand"))]
+  "TARGET_SIMD"
+  {
+    rtx other_arm
+      = simplify_gen_binary (<MAX_OPP>, <MODE>mode, operands[1], operands[2]);
+    emit_insn (gen_aarch64_<su>abd<mode>_3 (operands[0], operands[1],
+	       operands[2], other_arm));
+    DONE;
+  }
+)
+
 ;; It's tempting to represent SABD as ABS (MINUS op1 op2).
 ;; This isn't accurate as ABS treats always its input as a signed value.
 ;; So (ABS:QI (minus:QI 64 -128)) == (ABS:QI (192 or -64 signed)) == 64.
 ;; Whereas SABD would return 192 (-64 signed) on the above example.
 ;; Use MINUS ([us]max (op1, op2), [us]min (op1, op2)) instead.
-(define_insn "*aarch64_<su>abd<mode>_3"
+(define_insn "aarch64_<su>abd<mode>_3"
   [(set (match_operand:VDQ_BHSI 0 "register_operand" "=w")
 	(minus:VDQ_BHSI
 	  (USMAX:VDQ_BHSI
@@ -764,6 +780,13 @@ (define_insn "aarch64_<sur>adalp<mode>_3"
 ;; UABAL	tmp.8h, op1.16b, op2.16b
 ;; UADALP	op3.4s, tmp.8h
 ;; MOV		op0, op3 // should be eliminated in later passes.
+;;
+;; For TARGET_DOTPROD we do:
+;; MOV	tmp1.16b, #1 // Can be CSE'd and hoisted out of loops.
+;; UABD	tmp2.16b, op1.16b, op2.16b
+;; UDOT	op3.4s, tmp2.16b, tmp1.16b
+;; MOV	op0, op3 // RA will tie the operands of UDOT appropriately.
+;;
 ;; The signed version just uses the signed variants of the above instructions.
 
 (define_expand "<sur>sadv16qi"
@@ -773,6 +796,18 @@ (define_expand "<sur>sadv16qi"
    (use (match_operand:V4SI 3 "register_operand"))]
   "TARGET_SIMD"
   {
+    if (TARGET_DOTPROD)
+      {
+	rtx ones = gen_reg_rtx (V16QImode);
+	emit_move_insn (ones,
+			aarch64_simd_gen_const_vector_dup (V16QImode,
+							    HOST_WIDE_INT_1));
+	rtx abd = gen_reg_rtx (V16QImode);
+	emit_insn (gen_<sur>abdv16qi_3 (abd, operands[1], operands[2]));
+	emit_insn (gen_aarch64_<sur>dotv16qi (operands[0], operands[3],
+					       abd, ones));
+	DONE;
+      }
     rtx reduc = gen_reg_rtx (V8HImode);
     emit_insn (gen_aarch64_<sur>abdl2v16qi_3 (reduc, operands[1],
 					       operands[2]));
diff --git a/gcc/config/aarch64/iterators.md b/gcc/config/aarch64/iterators.md
index 16e4dbda73ab928054590c47a4398408162c0332..5afb692493c6e9fa31355693e7843e4f0b1b281c 100644
--- a/gcc/config/aarch64/iterators.md
+++ b/gcc/config/aarch64/iterators.md
@@ -1059,6 +1059,9 @@ (define_code_attr f16mac [(plus "a") (minus "s")])
 ;; Map smax to smin and umax to umin.
 (define_code_attr max_opp [(smax "smin") (umax "umin")])
 
+;; Same as above, but louder.
+(define_code_attr MAX_OPP [(smax "SMIN") (umax "UMIN")])
+
 ;; The number of subvectors in an SVE_STRUCT.
 (define_mode_attr vector_count [(VNx32QI "2") (VNx16HI "2")
 				(VNx8SI  "2") (VNx4DI  "2")
diff --git a/gcc/testsuite/gcc.target/aarch64/ssadv16qi-dotprod.c b/gcc/testsuite/gcc.target/aarch64/ssadv16qi-dotprod.c
new file mode 100644
index 0000000000000000000000000000000000000000..e08c33785303e86815554e67a300189a67dfc1da
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/ssadv16qi-dotprod.c
@@ -0,0 +1,31 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target arm_v8_2a_dotprod_neon_ok } */
+/* { dg-add-options arm_v8_2a_dotprod_neon }  */
+/* { dg-additional-options "-O3" } */
+
+#pragma GCC target "+nosve"
+
+#define N 1024
+
+signed char pix1[N], pix2[N];
+
+int foo (void)
+{
+  int i_sum = 0;
+  int i;
+
+  for (i = 0; i < N; i++)
+    i_sum += __builtin_abs (pix1[i] - pix2[i]);
+
+  return i_sum;
+}
+
+/* { dg-final { scan-assembler-not {\tsshll\t} } } */
+/* { dg-final { scan-assembler-not {\tsshll2\t} } } */
+/* { dg-final { scan-assembler-not {\tssubl\t} } } */
+/* { dg-final { scan-assembler-not {\tssubl2\t} } } */
+/* { dg-final { scan-assembler-not {\tabs\t} } } */
+
+/* { dg-final { scan-assembler {\tsabd\t} } } */
+/* { dg-final { scan-assembler {\tsdot\t} } } */
+
diff --git a/gcc/testsuite/gcc.target/aarch64/ssadv16qi.c b/gcc/testsuite/gcc.target/aarch64/ssadv16qi.c
index 40b28843616e84df137210b45ec16abed2a37c75..85a867a113013f560bfd0a3142805b9c95ad8c5a 100644
--- a/gcc/testsuite/gcc.target/aarch64/ssadv16qi.c
+++ b/gcc/testsuite/gcc.target/aarch64/ssadv16qi.c
@@ -1,7 +1,7 @@
 /* { dg-do compile } */
 /* { dg-options "-O3" } */
 
-#pragma GCC target "+nosve"
+#pragma GCC target "+nosve+nodotprod"
 
 #define N 1024
 
diff --git a/gcc/testsuite/gcc.target/aarch64/usadv16qi-dotprod.c b/gcc/testsuite/gcc.target/aarch64/usadv16qi-dotprod.c
new file mode 100644
index 0000000000000000000000000000000000000000..ea8de4d69758bd6bc9af9e33e1498f838b706949
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/usadv16qi-dotprod.c
@@ -0,0 +1,30 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target arm_v8_2a_dotprod_neon_ok } */
+/* { dg-add-options arm_v8_2a_dotprod_neon }  */
+/* { dg-additional-options "-O3" } */
+
+#pragma GCC target "+nosve"
+
+#define N 1024
+
+unsigned char pix1[N], pix2[N];
+
+int foo (void)
+{
+  int i_sum = 0;
+  int i;
+
+  for (i = 0; i < N; i++)
+    i_sum += __builtin_abs (pix1[i] - pix2[i]);
+
+  return i_sum;
+}
+
+/* { dg-final { scan-assembler-not {\tushll\t} } } */
+/* { dg-final { scan-assembler-not {\tushll2\t} } } */
+/* { dg-final { scan-assembler-not {\tusubl\t} } } */
+/* { dg-final { scan-assembler-not {\tusubl2\t} } } */
+/* { dg-final { scan-assembler-not {\tabs\t} } } */
+
+/* { dg-final { scan-assembler {\tuabd\t} } } */
+/* { dg-final { scan-assembler {\tudot\t} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/usadv16qi.c b/gcc/testsuite/gcc.target/aarch64/usadv16qi.c
index 69ceaf4259ea43e95078ce900d2498c3a2291369..a66e1209662cefaa95c90d8d2694f9c7c0de4152 100644
--- a/gcc/testsuite/gcc.target/aarch64/usadv16qi.c
+++ b/gcc/testsuite/gcc.target/aarch64/usadv16qi.c
@@ -1,7 +1,7 @@
 /* { dg-do compile } */
 /* { dg-options "-O3" } */
 
-#pragma GCC target "+nosve"
+#pragma GCC target "+nosve+nodotprod"
 
 #define N 1024