From patchwork Mon Nov 6 00:25:11 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Bill Schmidt X-Patchwork-Id: 834424 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (mailfrom) smtp.mailfrom=gcc.gnu.org (client-ip=209.132.180.131; helo=sourceware.org; envelope-from=gcc-patches-return-465982-incoming=patchwork.ozlabs.org@gcc.gnu.org; receiver=) Authentication-Results: ozlabs.org; dkim=pass (1024-bit key; unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org header.b="QXpqCien"; dkim-atps=neutral Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 3yVYHk70SLz9s81 for ; Mon, 6 Nov 2017 11:25:31 +1100 (AEDT) DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:to:cc :from:subject:date:mime-version:content-type :content-transfer-encoding:message-id; q=dns; s=default; b=HYqJg S3JYz28EySmJTmB+dyF6ebtGtSs1rwDQZHVQlmJAFrI0zwMNapRjXFsLUBl/qbOK cLODsOT0VBpxodhhDA1HEUiXPHXxNCRO6w7iP2hiltMx2Y9ain/mDK8VAXdS/UIi LFZmzsx7yIaGc7p9wyO8Kq90Jg7vHVCuQxmY5E= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:to:cc :from:subject:date:mime-version:content-type :content-transfer-encoding:message-id; s=default; bh=QbmzLr38n12 dSfOefY9gdQXBTzM=; b=QXpqCiennRNKhI+uTNoGvSIBlP0E1ZlViXP21JuySva /1WEVu/CkbOaGHvLd8mf9J0yGDurqkRKALtELQ7lqj0N0rTpExf6juqTzH8S8IIr GHroFpWCww1XVdoWXEv9mJdTDJhgBqW6nZBXtyraxrgVwjLd6/7WNgCZ2Q3/u+qQ = Received: (qmail 1222 invoked by alias); 6 Nov 2017 00:25:23 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org Received: (qmail 941 invoked by uid 89); 6 Nov 2017 00:25:22 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-10.8 required=5.0 tests=BAYES_00, GIT_PATCH_2, GIT_PATCH_3, KAM_ASCII_DIVIDERS, KAM_LAZY_DOMAIN_SECURITY, RCVD_IN_DNSWL_LOW autolearn=ham version=3.3.2 spammy=sad, differences, Schmidt, schmidt X-HELO: mx0a-001b2d01.pphosted.com Received: from mx0b-001b2d01.pphosted.com (HELO mx0a-001b2d01.pphosted.com) (148.163.158.5) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Mon, 06 Nov 2017 00:25:20 +0000 Received: from pps.filterd (m0098414.ppops.net [127.0.0.1]) by mx0b-001b2d01.pphosted.com (8.16.0.21/8.16.0.21) with SMTP id vA60Jskn092968 for ; Sun, 5 Nov 2017 19:25:14 -0500 Received: from e14.ny.us.ibm.com (e14.ny.us.ibm.com [129.33.205.204]) by mx0b-001b2d01.pphosted.com with ESMTP id 2e23c6tcnm-1 (version=TLSv1.2 cipher=AES256-SHA bits=256 verify=NOT) for ; Sun, 05 Nov 2017 19:25:14 -0500 Received: from localhost by e14.ny.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Sun, 5 Nov 2017 19:25:13 -0500 Received: from b01cxnp22036.gho.pok.ibm.com (9.57.198.26) by e14.ny.us.ibm.com (146.89.104.201) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; Sun, 5 Nov 2017 19:25:12 -0500 Received: from b01ledav004.gho.pok.ibm.com (b01ledav004.gho.pok.ibm.com [9.57.199.109]) by b01cxnp22036.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id vA60PCPU47382688; Mon, 6 Nov 2017 00:25:12 GMT Received: from b01ledav004.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id A9B7D112057; Sun, 5 Nov 2017 19:24:39 -0500 (EST) Received: from BigMac.local (unknown [9.85.150.159]) by b01ledav004.gho.pok.ibm.com (Postfix) with ESMTP id 44AA1112034; Sun, 5 Nov 2017 19:24:39 -0500 (EST) To: GCC Patches Cc: Segher Boessenkool , David Edelsohn From: Bill Schmidt Subject: [PATCH, rs6000] Add support for usadv16qi and usadv8hi standard patterns Date: Sun, 5 Nov 2017 18:25:11 -0600 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:52.0) Gecko/20100101 Thunderbird/52.3.0 MIME-Version: 1.0 X-TM-AS-GCONF: 00 x-cbid: 17110600-0052-0000-0000-0000027D0FE8 X-IBM-SpamModules-Scores: X-IBM-SpamModules-Versions: BY=3.00008003; HX=3.00000241; KW=3.00000007; PH=3.00000004; SC=3.00000239; SDB=6.00941691; UDB=6.00474953; IPR=6.00721981; BA=6.00005670; NDR=6.00000001; ZLA=6.00000005; ZF=6.00000009; ZB=6.00000000; ZP=6.00000000; ZH=6.00000000; ZU=6.00000002; MB=3.00017852; XFM=3.00000015; UTC=2017-11-06 00:25:13 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 17110600-0053-0000-0000-000052896B39 Message-Id: <2a3e7921-c2ee-ff57-677a-f84becc0f002@linux.vnet.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:, , definitions=2017-11-05_06:, , signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1011 lowpriorityscore=0 impostorscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1707230000 definitions=main-1711060003 X-IsSubscribed: yes Hi, This patch adds support for vectorization of unsigned SAD expressions. SAD vectorization uses the usad pattern to represent a widening accumulation of SADs performed on a narrower type. The two cases in this patch operate on V16QImode and V8HImode, respectively, accumulating into V4SImode. A vectorized loop on SAD operations will use these patterns in the main loop body and perform a final reduction to sum the 4 accumulated results in the V4SImode accumulator during the loop epilogue. POWER's sum-across ops (vsum4ubs and vsum4shs) unfortunately have saturating semantics, so they can only be used for the sum-across; the accumulation with previous iteration results requires a separate add. Bootstrapped and tested on powerpc64le-linux-gnu for POWER8 and POWER9 subtargets with no regressions. Is this ok for trunk? Thanks, Bill [gcc] 2017-11-05 Bill Schmidt * config/rs6000/altivec.md (*p9_vadu3) Rename to p9_vadu3. (usadv16qi): New define_expand. (usadv8hi): New define_expand. [gcc/testsuite] 2017-11-05 Bill Schmidt * gcc.target/powerpc/sad-vectorize-1.c: New file. * gcc.target/powerpc/sad-vectorize-2.c: New file. * gcc.target/powerpc/sad-vectorize-3.c: New file. * gcc.target/powerpc/sad-vectorize-4.c: New file. Index: gcc/config/rs6000/altivec.md =================================================================== --- gcc/config/rs6000/altivec.md (revision 254428) +++ gcc/config/rs6000/altivec.md (working copy) @@ -4020,7 +4020,7 @@ "TARGET_P9_VECTOR") ;; Vector absolute difference unsigned -(define_insn "*p9_vadu3" +(define_insn "p9_vadu3" [(set (match_operand:VI 0 "register_operand" "=v") (unspec:VI [(match_operand:VI 1 "register_operand" "v") (match_operand:VI 2 "register_operand" "v")] @@ -4184,6 +4184,51 @@ "vbpermd %0,%1,%2" [(set_attr "type" "vecsimple")]) +;; Support for SAD (sum of absolute differences). + +;; Due to saturating semantics, we can't combine the sum-across +;; with the vector accumulate in vsum4ubs. A vadduwm is needed. +(define_expand "usadv16qi" + [(use (match_operand:V4SI 0 "register_operand")) + (use (match_operand:V16QI 1 "register_operand")) + (use (match_operand:V16QI 2 "register_operand")) + (use (match_operand:V4SI 3 "register_operand"))] + "TARGET_P9_VECTOR" + " +{ + rtx absd = gen_reg_rtx (V16QImode); + rtx zero = gen_reg_rtx (V4SImode); + rtx psum = gen_reg_rtx (V4SImode); + + emit_insn (gen_p9_vaduv16qi3 (absd, operands[1], operands[2])); + emit_insn (gen_altivec_vspltisw (zero, const0_rtx)); + emit_insn (gen_altivec_vsum4ubs (psum, absd, zero)); + emit_insn (gen_addv4si3 (operands[0], psum, operands[3])); + DONE; +}") + +;; Since vsum4shs is saturating and further performs signed +;; arithmetic, we can't combine the sum-across with the vector +;; accumulate in vsum4shs. A vadduwm is needed. +(define_expand "usadv8hi" + [(use (match_operand:V4SI 0 "register_operand")) + (use (match_operand:V8HI 1 "register_operand")) + (use (match_operand:V8HI 2 "register_operand")) + (use (match_operand:V4SI 3 "register_operand"))] + "TARGET_P9_VECTOR" + " +{ + rtx absd = gen_reg_rtx (V8HImode); + rtx zero = gen_reg_rtx (V4SImode); + rtx psum = gen_reg_rtx (V4SImode); + + emit_insn (gen_p9_vaduv8hi3 (absd, operands[1], operands[2])); + emit_insn (gen_altivec_vspltisw (zero, const0_rtx)); + emit_insn (gen_altivec_vsum4shs (psum, absd, zero)); + emit_insn (gen_addv4si3 (operands[0], psum, operands[3])); + DONE; +}") + ;; Decimal Integer operations (define_int_iterator UNSPEC_BCD_ADD_SUB [UNSPEC_BCDADD UNSPEC_BCDSUB]) Index: gcc/testsuite/gcc.target/powerpc/sad-vectorize-1.c =================================================================== --- gcc/testsuite/gcc.target/powerpc/sad-vectorize-1.c (nonexistent) +++ gcc/testsuite/gcc.target/powerpc/sad-vectorize-1.c (working copy) @@ -0,0 +1,36 @@ +/* { dg-do compile { target { powerpc*-*-* } } } */ +/* { dg-skip-if "do not override -mcpu" { powerpc*-*-* } { "-mcpu=*" } { "-mcpu=power9" } } */ +/* { dg-require-effective-target powerpc_p9vector_ok } */ +/* { dg-skip-if "" { powerpc*-*-aix* } } */ +/* { dg-options "-O3 -mcpu=power9" } */ + +/* Verify that we vectorize this SAD loop using vabsdub. */ + +extern int abs (int __x) __attribute__ ((__nothrow__, __leaf__)) __attribute__ ((__const__)); + +static int +foo (unsigned char *w, int i, unsigned char *x, int j) +{ + int tot = 0; + for (int a = 0; a < 16; a++) + { + for (int b = 0; b < 16; b++) + tot += abs (w[b] - x[b]); + w += i; + x += j; + } + return tot; +} + +void +bar (unsigned char *w, unsigned char *x, int i, int *result) +{ + *result = foo (w, 16, x, i); +} + +/* { dg-final { scan-assembler-times "vabsdub" 16 } } */ +/* { dg-final { scan-assembler-times "vsum4ubs" 16 } } */ +/* { dg-final { scan-assembler-times "vadduwm" 17 } } */ + +/* Note: One of the 16 adds is optimized out (add with zero), + leaving 15. The extra two adds are for the final reduction. */ Index: gcc/testsuite/gcc.target/powerpc/sad-vectorize-2.c =================================================================== --- gcc/testsuite/gcc.target/powerpc/sad-vectorize-2.c (nonexistent) +++ gcc/testsuite/gcc.target/powerpc/sad-vectorize-2.c (working copy) @@ -0,0 +1,36 @@ +/* { dg-do compile { target { powerpc*-*-* } } } */ +/* { dg-skip-if "do not override -mcpu" { powerpc*-*-* } { "-mcpu=*" } { "-mcpu=power9" } } */ +/* { dg-require-effective-target powerpc_p9vector_ok } */ +/* { dg-skip-if "" { powerpc*-*-aix* } } */ +/* { dg-options "-O3 -mcpu=power9" } */ + +/* Verify that we vectorize this SAD loop using vabsduh. */ + +extern int abs (int __x) __attribute__ ((__nothrow__, __leaf__)) __attribute__ ((__const__)); + +static int +foo (unsigned short *w, int i, unsigned short *x, int j) +{ + int tot = 0; + for (int a = 0; a < 16; a++) + { + for (int b = 0; b < 8; b++) + tot += abs (w[b] - x[b]); + w += i; + x += j; + } + return tot; +} + +void +bar (unsigned short *w, unsigned short *x, int i, int *result) +{ + *result = foo (w, 8, x, i); +} + +/* { dg-final { scan-assembler-times "vabsduh" 16 } } */ +/* { dg-final { scan-assembler-times "vsum4shs" 16 } } */ +/* { dg-final { scan-assembler-times "vadduwm" 17 } } */ + +/* Note: One of the 16 adds is optimized out (add with zero), + leaving 15. The extra two adds are for the final reduction. */ Index: gcc/testsuite/gcc.target/powerpc/sad-vectorize-3.c =================================================================== --- gcc/testsuite/gcc.target/powerpc/sad-vectorize-3.c (nonexistent) +++ gcc/testsuite/gcc.target/powerpc/sad-vectorize-3.c (working copy) @@ -0,0 +1,57 @@ +/* { dg-do run { target { powerpc*-*-linux* && { lp64 && p9vector_hw } } } } */ +/* { dg-require-effective-target powerpc_p9vector_ok } */ +/* { dg-options "-O3 -mcpu=power9" } */ +/* { dg-skip-if "do not override -mcpu" { powerpc*-*-* } { "-mcpu=*" } { "-mcpu=power9" } } */ + +/* Verify that we get correct code when we vectorize this SAD loop using + vabsdub. */ + +extern void abort (); +extern int abs (int __x) __attribute__ ((__nothrow__, __leaf__)) __attribute__ ((__const__)); + +static int +foo (unsigned char *w, int i, unsigned char *x, int j) +{ + int tot = 0; + for (int a = 0; a < 16; a++) + { + for (int b = 0; b < 16; b++) + tot += abs (w[b] - x[b]); + w += i; + x += j; + } + return tot; +} + +void +bar (unsigned char *w, unsigned char *x, int i, int *result) +{ + *result = foo (w, 16, x, i); +} + +int +main () +{ + unsigned char m[256]; + unsigned char n[256]; + int sum, i; + + for (i = 0; i < 256; ++i) + if (i % 2 == 0) + { + m[i] = (i % 8) * 2 + 1; + n[i] = -(i % 8); + } + else + { + m[i] = -((i % 8) * 2 + 2); + n[i] = -((i % 8) >> 1); + } + + bar (m, n, 16, &sum); + + if (sum != 32384) + abort (); + + return 0; +} Index: gcc/testsuite/gcc.target/powerpc/sad-vectorize-4.c =================================================================== --- gcc/testsuite/gcc.target/powerpc/sad-vectorize-4.c (nonexistent) +++ gcc/testsuite/gcc.target/powerpc/sad-vectorize-4.c (working copy) @@ -0,0 +1,57 @@ +/* { dg-do run { target { powerpc*-*-linux* && { lp64 && p9vector_hw } } } } */ +/* { dg-require-effective-target powerpc_p9vector_ok } */ +/* { dg-options "-O3 -mcpu=power9" } */ +/* { dg-skip-if "do not override -mcpu" { powerpc*-*-* } { "-mcpu=*" } { "-mcpu=power9" } } */ + +/* Verify that we get correct code when we vectorize this SAD loop using + vabsduh. */ + +extern void abort (); +extern int abs (int __x) __attribute__ ((__nothrow__, __leaf__)) __attribute__ ((__const__)); + +static int +foo (unsigned short *w, int i, unsigned short *x, int j) +{ + int tot = 0; + for (int a = 0; a < 16; a++) + { + for (int b = 0; b < 8; b++) + tot += abs (w[b] - x[b]); + w += i; + x += j; + } + return tot; +} + +void +bar (unsigned short *w, unsigned short *x, int i, int *result) +{ + *result = foo (w, 8, x, i); +} + +int +main () +{ + unsigned short m[128]; + unsigned short n[128]; + int sum, i; + + for (i = 0; i < 128; ++i) + if (i % 2 == 0) + { + m[i] = (i % 8) * 2 + 1; + n[i] = i % 8; + } + else + { + m[i] = (i % 8) * 4 - 3; + n[i] = (i % 8) >> 1; + } + + bar (m, n, 8, &sum); + + if (sum != 992) + abort (); + + return 0; +}