From patchwork Mon Nov  6 00:25:11 2017
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Bill Schmidt <wschmidt@linux.vnet.ibm.com>
X-Patchwork-Id: 834424
Return-Path: 
 <gcc-patches-return-465982-incoming=patchwork.ozlabs.org@gcc.gnu.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@bilbo.ozlabs.org
Authentication-Results: ozlabs.org;
	spf=pass (mailfrom) smtp.mailfrom=gcc.gnu.org
	(client-ip=209.132.180.131; helo=sourceware.org;
	envelope-from=gcc-patches-return-465982-incoming=patchwork.ozlabs.org@gcc.gnu.org;
	receiver=<UNKNOWN>)
Authentication-Results: ozlabs.org; dkim=pass (1024-bit key;
	unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org
	header.b="QXpqCien"; dkim-atps=neutral
Received: from sourceware.org (server1.sourceware.org [209.132.180.131])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256
	bits)) (No client certificate requested)
	by ozlabs.org (Postfix) with ESMTPS id 3yVYHk70SLz9s81
	for <incoming@patchwork.ozlabs.org>;
	Mon,  6 Nov 2017 11:25:31 +1100 (AEDT)
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id
	:list-unsubscribe:list-archive:list-post:list-help:sender:to:cc
	:from:subject:date:mime-version:content-type
	:content-transfer-encoding:message-id; q=dns; s=default; b=HYqJg
	S3JYz28EySmJTmB+dyF6ebtGtSs1rwDQZHVQlmJAFrI0zwMNapRjXFsLUBl/qbOK
	cLODsOT0VBpxodhhDA1HEUiXPHXxNCRO6w7iP2hiltMx2Y9ain/mDK8VAXdS/UIi
	LFZmzsx7yIaGc7p9wyO8Kq90Jg7vHVCuQxmY5E=
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id
	:list-unsubscribe:list-archive:list-post:list-help:sender:to:cc
	:from:subject:date:mime-version:content-type
	:content-transfer-encoding:message-id; s=default; bh=QbmzLr38n12
	dSfOefY9gdQXBTzM=; b=QXpqCiennRNKhI+uTNoGvSIBlP0E1ZlViXP21JuySva
	/1WEVu/CkbOaGHvLd8mf9J0yGDurqkRKALtELQ7lqj0N0rTpExf6juqTzH8S8IIr
	GHroFpWCww1XVdoWXEv9mJdTDJhgBqW6nZBXtyraxrgVwjLd6/7WNgCZ2Q3/u+qQ
	=
Received: (qmail 1222 invoked by alias); 6 Nov 2017 00:25:23 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Unsubscribe: 
 <mailto:gcc-patches-unsubscribe-incoming=patchwork.ozlabs.org@gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Delivered-To: mailing list gcc-patches@gcc.gnu.org
Received: (qmail 941 invoked by uid 89); 6 Nov 2017 00:25:22 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-10.8 required=5.0 tests=BAYES_00, GIT_PATCH_2,
	GIT_PATCH_3, KAM_ASCII_DIVIDERS, KAM_LAZY_DOMAIN_SECURITY,
	RCVD_IN_DNSWL_LOW autolearn=ham version=3.3.2 spammy=sad,
	differences, Schmidt, schmidt
X-HELO: mx0a-001b2d01.pphosted.com
Received: from mx0b-001b2d01.pphosted.com (HELO mx0a-001b2d01.pphosted.com)
	(148.163.158.5) by sourceware.org
	(qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP;
	Mon, 06 Nov 2017 00:25:20 +0000
Received: from pps.filterd (m0098414.ppops.net [127.0.0.1])	by
	mx0b-001b2d01.pphosted.com (8.16.0.21/8.16.0.21) with SMTP id
	vA60Jskn092968	for <gcc-patches@gcc.gnu.org>;
	Sun, 5 Nov 2017 19:25:14 -0500
Received: from e14.ny.us.ibm.com (e14.ny.us.ibm.com [129.33.205.204])	by
	mx0b-001b2d01.pphosted.com with ESMTP id
	2e23c6tcnm-1	(version=TLSv1.2 cipher=AES256-SHA bits=256
	verify=NOT)	for <gcc-patches@gcc.gnu.org>;
	Sun, 05 Nov 2017 19:25:14 -0500
Received: from localhost	by e14.ny.us.ibm.com with IBM ESMTP SMTP Gateway:
	Authorized Use Only! Violators will be prosecuted	for
	<gcc-patches@gcc.gnu.org> from <wschmidt@linux.vnet.ibm.com>;
	Sun, 5 Nov 2017 19:25:13 -0500
Received: from b01cxnp22036.gho.pok.ibm.com (9.57.198.26)	by
	e14.ny.us.ibm.com (146.89.104.201) with IBM ESMTP SMTP
	Gateway: Authorized Use Only! Violators will be prosecuted;
	Sun, 5 Nov 2017 19:25:12 -0500
Received: from b01ledav004.gho.pok.ibm.com (b01ledav004.gho.pok.ibm.com
	[9.57.199.109])	by b01cxnp22036.gho.pok.ibm.com
	(8.14.9/8.14.9/NCO v10.0) with ESMTP id vA60PCPU47382688;
	Mon, 6 Nov 2017 00:25:12 GMT
Received: from b01ledav004.gho.pok.ibm.com (unknown [127.0.0.1])	by IMSVA
	(Postfix) with ESMTP id A9B7D112057;
	Sun,  5 Nov 2017 19:24:39 -0500 (EST)
Received: from BigMac.local (unknown [9.85.150.159])	by
	b01ledav004.gho.pok.ibm.com (Postfix) with ESMTP id
	44AA1112034; Sun,  5 Nov 2017 19:24:39 -0500 (EST)
To: GCC Patches <gcc-patches@gcc.gnu.org>
Cc: Segher Boessenkool <segher@kernel.crashing.org>,
	David Edelsohn <dje.gcc@gmail.com>
From: Bill Schmidt <wschmidt@linux.vnet.ibm.com>
Subject: [PATCH,
	rs6000] Add support for usadv16qi and usadv8hi standard patterns
Date: Sun, 5 Nov 2017 18:25:11 -0600
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12;
	rv:52.0) Gecko/20100101 Thunderbird/52.3.0
MIME-Version: 1.0
X-TM-AS-GCONF: 00
x-cbid: 17110600-0052-0000-0000-0000027D0FE8
X-IBM-SpamModules-Scores: 
X-IBM-SpamModules-Versions: BY=3.00008003; HX=3.00000241; KW=3.00000007;
	PH=3.00000004; SC=3.00000239; SDB=6.00941691; UDB=6.00474953;
	IPR=6.00721981; BA=6.00005670; NDR=6.00000001; ZLA=6.00000005;
	ZF=6.00000009; ZB=6.00000000; ZP=6.00000000; ZH=6.00000000;
	ZU=6.00000002; MB=3.00017852; XFM=3.00000015;
	UTC=2017-11-06 00:25:13
X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused
x-cbparentid: 17110600-0053-0000-0000-000052896B39
Message-Id: <2a3e7921-c2ee-ff57-677a-f84becc0f002@linux.vnet.ibm.com>
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:, ,
	definitions=2017-11-05_06:, , signatures=0
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0
	priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0
	bulkscore=0 spamscore=0 clxscore=1011 lowpriorityscore=0
	impostorscore=0 adultscore=0 classifier=spam adjust=0
	reason=mlx scancount=1 engine=8.0.1-1707230000
	definitions=main-1711060003
X-IsSubscribed: yes

Hi,

This patch adds support for vectorization of unsigned SAD expressions.  SAD
vectorization uses the usad<mode> pattern to represent a widening accumulation
of SADs performed on a narrower type.  The two cases in this patch operate
on V16QImode and V8HImode, respectively, accumulating into V4SImode.  A
vectorized loop on SAD operations will use these patterns in the main loop
body and perform a final reduction to sum the 4 accumulated results in the
V4SImode accumulator during the loop epilogue.

POWER's sum-across ops (vsum4ubs and vsum4shs) unfortunately have saturating
semantics, so they can only be used for the sum-across; the accumulation
with previous iteration results requires a separate add.

Bootstrapped and tested on powerpc64le-linux-gnu for POWER8 and POWER9
subtargets with no regressions.  Is this ok for trunk?

Thanks,
Bill


[gcc]

2017-11-05  Bill Schmidt  <wschmidt@linux.vnet.ibm.com>

	* config/rs6000/altivec.md (*p9_vadu<mode>3) Rename to
	p9_vadu<mode>3.
	(usadv16qi): New define_expand.
	(usadv8hi): New define_expand.

[gcc/testsuite]

2017-11-05  Bill Schmidt  <wschmidt@linux.vnet.ibm.com>

	* gcc.target/powerpc/sad-vectorize-1.c: New file.
	* gcc.target/powerpc/sad-vectorize-2.c: New file.
	* gcc.target/powerpc/sad-vectorize-3.c: New file.
	* gcc.target/powerpc/sad-vectorize-4.c: New file.

Index: gcc/config/rs6000/altivec.md
===================================================================
--- gcc/config/rs6000/altivec.md	(revision 254428)
+++ gcc/config/rs6000/altivec.md	(working copy)
@@ -4020,7 +4020,7 @@
   "TARGET_P9_VECTOR")
 
 ;; Vector absolute difference unsigned
-(define_insn "*p9_vadu<mode>3"
+(define_insn "p9_vadu<mode>3"
   [(set (match_operand:VI 0 "register_operand" "=v")
         (unspec:VI [(match_operand:VI 1 "register_operand" "v")
 		    (match_operand:VI 2 "register_operand" "v")]
@@ -4184,6 +4184,51 @@
   "vbpermd %0,%1,%2"
   [(set_attr "type" "vecsimple")])
 
+;; Support for SAD (sum of absolute differences).
+
+;; Due to saturating semantics, we can't combine the sum-across
+;; with the vector accumulate in vsum4ubs.  A vadduwm is needed.
+(define_expand "usadv16qi"
+  [(use (match_operand:V4SI 0 "register_operand"))
+   (use (match_operand:V16QI 1 "register_operand"))
+   (use (match_operand:V16QI 2 "register_operand"))
+   (use (match_operand:V4SI 3 "register_operand"))]
+  "TARGET_P9_VECTOR"
+  "
+{
+  rtx absd = gen_reg_rtx (V16QImode);
+  rtx zero = gen_reg_rtx (V4SImode);
+  rtx psum = gen_reg_rtx (V4SImode);
+
+  emit_insn (gen_p9_vaduv16qi3 (absd, operands[1], operands[2]));
+  emit_insn (gen_altivec_vspltisw (zero, const0_rtx));
+  emit_insn (gen_altivec_vsum4ubs (psum, absd, zero));
+  emit_insn (gen_addv4si3 (operands[0], psum, operands[3]));
+  DONE;
+}")
+
+;; Since vsum4shs is saturating and further performs signed
+;; arithmetic, we can't combine the sum-across with the vector
+;; accumulate in vsum4shs.  A vadduwm is needed.
+(define_expand "usadv8hi"
+  [(use (match_operand:V4SI 0 "register_operand"))
+   (use (match_operand:V8HI 1 "register_operand"))
+   (use (match_operand:V8HI 2 "register_operand"))
+   (use (match_operand:V4SI 3 "register_operand"))]
+  "TARGET_P9_VECTOR"
+  "
+{
+  rtx absd = gen_reg_rtx (V8HImode);
+  rtx zero = gen_reg_rtx (V4SImode);
+  rtx psum = gen_reg_rtx (V4SImode);
+
+  emit_insn (gen_p9_vaduv8hi3 (absd, operands[1], operands[2]));
+  emit_insn (gen_altivec_vspltisw (zero, const0_rtx));
+  emit_insn (gen_altivec_vsum4shs (psum, absd, zero));
+  emit_insn (gen_addv4si3 (operands[0], psum, operands[3]));
+  DONE;
+}")
+
 ;; Decimal Integer operations
 (define_int_iterator UNSPEC_BCD_ADD_SUB [UNSPEC_BCDADD UNSPEC_BCDSUB])
 
Index: gcc/testsuite/gcc.target/powerpc/sad-vectorize-1.c
===================================================================
--- gcc/testsuite/gcc.target/powerpc/sad-vectorize-1.c	(nonexistent)
+++ gcc/testsuite/gcc.target/powerpc/sad-vectorize-1.c	(working copy)
@@ -0,0 +1,36 @@
+/* { dg-do compile { target { powerpc*-*-* } } } */
+/* { dg-skip-if "do not override -mcpu" { powerpc*-*-* } { "-mcpu=*" } { "-mcpu=power9" } } */
+/* { dg-require-effective-target powerpc_p9vector_ok } */
+/* { dg-skip-if "" { powerpc*-*-aix* } } */
+/* { dg-options "-O3 -mcpu=power9" } */
+
+/* Verify that we vectorize this SAD loop using vabsdub. */
+
+extern int abs (int __x) __attribute__ ((__nothrow__, __leaf__)) __attribute__ ((__const__));
+
+static int
+foo (unsigned char *w, int i, unsigned char *x, int j)
+{
+  int tot = 0;
+  for (int a = 0; a < 16; a++)
+    {
+      for (int b = 0; b < 16; b++)
+	tot += abs (w[b] - x[b]);
+      w += i;
+      x += j;
+    }
+  return tot;
+}
+
+void
+bar (unsigned char *w, unsigned char *x, int i, int *result)
+{
+  *result = foo (w, 16, x, i);
+}
+
+/* { dg-final { scan-assembler-times "vabsdub" 16 } } */
+/* { dg-final { scan-assembler-times "vsum4ubs" 16 } } */
+/* { dg-final { scan-assembler-times "vadduwm" 17 } } */
+
+/* Note: One of the 16 adds is optimized out (add with zero),
+   leaving 15.  The extra two adds are for the final reduction.  */
Index: gcc/testsuite/gcc.target/powerpc/sad-vectorize-2.c
===================================================================
--- gcc/testsuite/gcc.target/powerpc/sad-vectorize-2.c	(nonexistent)
+++ gcc/testsuite/gcc.target/powerpc/sad-vectorize-2.c	(working copy)
@@ -0,0 +1,36 @@
+/* { dg-do compile { target { powerpc*-*-* } } } */
+/* { dg-skip-if "do not override -mcpu" { powerpc*-*-* } { "-mcpu=*" } { "-mcpu=power9" } } */
+/* { dg-require-effective-target powerpc_p9vector_ok } */
+/* { dg-skip-if "" { powerpc*-*-aix* } } */
+/* { dg-options "-O3 -mcpu=power9" } */
+
+/* Verify that we vectorize this SAD loop using vabsduh. */
+
+extern int abs (int __x) __attribute__ ((__nothrow__, __leaf__)) __attribute__ ((__const__));
+
+static int
+foo (unsigned short *w, int i, unsigned short *x, int j)
+{
+  int tot = 0;
+  for (int a = 0; a < 16; a++)
+    {
+      for (int b = 0; b < 8; b++)
+	tot += abs (w[b] - x[b]);
+      w += i;
+      x += j;
+    }
+  return tot;
+}
+
+void
+bar (unsigned short *w, unsigned short *x, int i, int *result)
+{
+  *result = foo (w, 8, x, i);
+}
+
+/* { dg-final { scan-assembler-times "vabsduh" 16 } } */
+/* { dg-final { scan-assembler-times "vsum4shs" 16 } } */
+/* { dg-final { scan-assembler-times "vadduwm" 17 } } */
+
+/* Note: One of the 16 adds is optimized out (add with zero),
+   leaving 15.  The extra two adds are for the final reduction.  */
Index: gcc/testsuite/gcc.target/powerpc/sad-vectorize-3.c
===================================================================
--- gcc/testsuite/gcc.target/powerpc/sad-vectorize-3.c	(nonexistent)
+++ gcc/testsuite/gcc.target/powerpc/sad-vectorize-3.c	(working copy)
@@ -0,0 +1,57 @@
+/* { dg-do run { target { powerpc*-*-linux* && { lp64 && p9vector_hw } } } } */
+/* { dg-require-effective-target powerpc_p9vector_ok } */
+/* { dg-options "-O3 -mcpu=power9" } */
+/* { dg-skip-if "do not override -mcpu" { powerpc*-*-* } { "-mcpu=*" } { "-mcpu=power9" } } */
+
+/* Verify that we get correct code when we vectorize this SAD loop using
+   vabsdub. */
+
+extern void abort ();
+extern int abs (int __x) __attribute__ ((__nothrow__, __leaf__)) __attribute__ ((__const__));
+
+static int
+foo (unsigned char *w, int i, unsigned char *x, int j)
+{
+  int tot = 0;
+  for (int a = 0; a < 16; a++)
+    {
+      for (int b = 0; b < 16; b++)
+	tot += abs (w[b] - x[b]);
+      w += i;
+      x += j;
+    }
+  return tot;
+}
+
+void
+bar (unsigned char *w, unsigned char *x, int i, int *result)
+{
+  *result = foo (w, 16, x, i);
+}
+
+int
+main ()
+{
+  unsigned char m[256];
+  unsigned char n[256];
+  int sum, i;
+
+  for (i = 0; i < 256; ++i)
+    if (i % 2 == 0)
+      {
+	m[i] = (i % 8) * 2 + 1;
+	n[i] = -(i % 8);
+      }
+    else
+      {
+	m[i] = -((i % 8) * 2 + 2);
+	n[i] = -((i % 8) >> 1);
+      }
+  
+  bar (m, n, 16, &sum);
+
+  if (sum != 32384)
+    abort ();
+
+  return 0;
+}
Index: gcc/testsuite/gcc.target/powerpc/sad-vectorize-4.c
===================================================================
--- gcc/testsuite/gcc.target/powerpc/sad-vectorize-4.c	(nonexistent)
+++ gcc/testsuite/gcc.target/powerpc/sad-vectorize-4.c	(working copy)
@@ -0,0 +1,57 @@
+/* { dg-do run { target { powerpc*-*-linux* && { lp64 && p9vector_hw } } } } */
+/* { dg-require-effective-target powerpc_p9vector_ok } */
+/* { dg-options "-O3 -mcpu=power9" } */
+/* { dg-skip-if "do not override -mcpu" { powerpc*-*-* } { "-mcpu=*" } { "-mcpu=power9" } } */
+
+/* Verify that we get correct code when we vectorize this SAD loop using
+   vabsduh. */
+
+extern void abort ();
+extern int abs (int __x) __attribute__ ((__nothrow__, __leaf__)) __attribute__ ((__const__));
+
+static int
+foo (unsigned short *w, int i, unsigned short *x, int j)
+{
+  int tot = 0;
+  for (int a = 0; a < 16; a++)
+    {
+      for (int b = 0; b < 8; b++)
+	tot += abs (w[b] - x[b]);
+      w += i;
+      x += j;
+    }
+  return tot;
+}
+
+void
+bar (unsigned short *w, unsigned short *x, int i, int *result)
+{
+  *result = foo (w, 8, x, i);
+}
+
+int
+main ()
+{
+  unsigned short m[128];
+  unsigned short n[128];
+  int sum, i;
+
+  for (i = 0; i < 128; ++i)
+    if (i % 2 == 0)
+      {
+	m[i] = (i % 8) * 2 + 1;
+	n[i] = i % 8;
+      }
+    else
+      {
+	m[i] = (i % 8) * 4 - 3;
+	n[i] = (i % 8) >> 1;
+      }
+  
+  bar (m, n, 8, &sum);
+
+  if (sum != 992)
+    abort ();
+
+  return 0;
+}