From patchwork Wed May  3 19:43:09 2017
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Bill Schmidt <wschmidt@linux.vnet.ibm.com>
X-Patchwork-Id: 758188
Return-Path: 
 <gcc-patches-return-452732-incoming=patchwork.ozlabs.org@gcc.gnu.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@bilbo.ozlabs.org
Received: from sourceware.org (server1.sourceware.org [209.132.180.131])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256
	bits)) (No client certificate requested)
	by ozlabs.org (Postfix) with ESMTPS id 3wJ7r03Q0Qz9rxl
	for <incoming@patchwork.ozlabs.org>;
	Thu,  4 May 2017 05:43:27 +1000 (AEST)
Authentication-Results: ozlabs.org; dkim=pass (1024-bit key;
	unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org
	header.b="rRDaU7eF"; dkim-atps=neutral
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id
	:list-unsubscribe:list-archive:list-post:list-help:sender:to:cc
	:from:subject:date:mime-version:content-type
	:content-transfer-encoding:message-id; q=dns; s=default; b=QiLaL
	o29LriA6po+tS1eOGtJFfknRjIf29XeQMaUzRStuTNyEHFryr15cPOW33kjlfJv6
	u09n/PaA01QRTkR7Lw9F063msuCcaHlOvVT3THofTZ+PEtNL3ESeEb0lcjdDQLgg
	70l0fpVUSnaueGOFc2FAA4so2Jl/kICTuM+Qsk=
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id
	:list-unsubscribe:list-archive:list-post:list-help:sender:to:cc
	:from:subject:date:mime-version:content-type
	:content-transfer-encoding:message-id; s=default; bh=8dp8bT+banb
	/SzeVhg7NZq13Fq4=; b=rRDaU7eFBlriCMoM93M4Doz/CVxB/B4FU5PkRJJjdlH
	hJK6st6B5uYLfhtRAawegOHaT5r+ubP2PapOkcuL43P2RcUhAtmERksMIkZzJanF
	6uKzJwcBrQlaYmVu80yrD6ZvzazwhKqsIG1VgalRiP6W6EUOxL7HMUf5sQiAc9rg
	=
Received: (qmail 60353 invoked by alias); 3 May 2017 19:43:16 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Unsubscribe: 
 <mailto:gcc-patches-unsubscribe-incoming=patchwork.ozlabs.org@gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Delivered-To: mailing list gcc-patches@gcc.gnu.org
Received: (qmail 60342 invoked by uid 89); 3 May 2017 19:43:16 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-10.2 required=5.0 tests=AWL, BAYES_00,
	GIT_PATCH_2, GIT_PATCH_3, KAM_ASCII_DIVIDERS,
	KAM_LAZY_DOMAIN_SECURITY,
	RCVD_IN_DNSWL_LOW autolearn=ham version=3.3.2 spammy=reversed,
	profitably, wash
X-HELO: mx0a-001b2d01.pphosted.com
Received: from mx0b-001b2d01.pphosted.com (HELO mx0a-001b2d01.pphosted.com)
	(148.163.158.5) by sourceware.org
	(qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP;
	Wed, 03 May 2017 19:43:14 +0000
Received: from pps.filterd (m0098417.ppops.net [127.0.0.1])	by
	mx0a-001b2d01.pphosted.com (8.16.0.20/8.16.0.20) with SMTP id
	v43Jcm0x038081	for <gcc-patches@gcc.gnu.org>;
	Wed, 3 May 2017 15:43:14 -0400
Received: from e18.ny.us.ibm.com (e18.ny.us.ibm.com [129.33.205.208])	by
	mx0a-001b2d01.pphosted.com with ESMTP id
	2a7mehts7f-1	(version=TLSv1.2 cipher=AES256-SHA bits=256
	verify=NOT)	for <gcc-patches@gcc.gnu.org>;
	Wed, 03 May 2017 15:43:13 -0400
Received: from localhost	by e18.ny.us.ibm.com with IBM ESMTP SMTP Gateway:
	Authorized Use Only! Violators will be prosecuted	for
	<gcc-patches@gcc.gnu.org> from <wschmidt@linux.vnet.ibm.com>;
	Wed, 3 May 2017 15:43:13 -0400
Received: from b01cxnp23032.gho.pok.ibm.com (9.57.198.27)	by
	e18.ny.us.ibm.com (146.89.104.205) with IBM ESMTP SMTP
	Gateway: Authorized Use Only! Violators will be prosecuted;
	Wed, 3 May 2017 15:43:10 -0400
Received: from b01ledav004.gho.pok.ibm.com (b01ledav004.gho.pok.ibm.com
	[9.57.199.109])	by b01cxnp23032.gho.pok.ibm.com
	(8.14.9/8.14.9/NCO v10.0) with ESMTP id v43JhAOs39780562;
	Wed, 3 May 2017 19:43:10 GMT
Received: from b01ledav004.gho.pok.ibm.com (unknown [127.0.0.1])	by IMSVA
	(Postfix) with ESMTP id D7A55112054;
	Wed,  3 May 2017 15:43:10 -0400 (EDT)
Received: from bigmac.rchland.ibm.com (unknown [9.10.86.41])	by
	b01ledav004.gho.pok.ibm.com (Postfix) with ESMTP id
	9D856112034; Wed,  3 May 2017 15:43:10 -0400 (EDT)
To: GCC Patches <gcc-patches@gcc.gnu.org>
Cc: Segher Boessenkool <segher@kernel.crashing.org>,
	David Edelsohn <dje.gcc@gmail.com>
From: Bill Schmidt <wschmidt@linux.vnet.ibm.com>
Subject: [PATCH,
	rs6000] Avoid vectorizing versioned copy loops with vectorization
	factor 2
Date: Wed, 3 May 2017 14:43:09 -0500
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12;
	rv:45.0) Gecko/20100101 Thunderbird/45.8.0
MIME-Version: 1.0
X-TM-AS-GCONF: 00
x-cbid: 17050319-0044-0000-0000-000003201D32
X-IBM-SpamModules-Scores: 
X-IBM-SpamModules-Versions: BY=3.00007018; HX=3.00000240; KW=3.00000007;
	PH=3.00000004; SC=3.00000208; SDB=6.00855796; UDB=6.00423582;
	IPR=6.00634959; BA=6.00005323; NDR=6.00000001; ZLA=6.00000005;
	ZF=6.00000009; ZB=6.00000000; ZP=6.00000000; ZH=6.00000000;
	ZU=6.00000002; MB=3.00015286; XFM=3.00000014;
	UTC=2017-05-03 19:43:11
X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused
x-cbparentid: 17050319-0045-0000-0000-0000074E2732
Message-Id: <f4ee0d29-6cdd-6b4f-167a-3fec1b38358f@linux.vnet.ibm.com>
X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:, ,
	definitions=2017-05-03_14:, , signatures=0
X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0
	spamscore=0 suspectscore=0 malwarescore=0 phishscore=0
	adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx
	scancount=1 engine=8.0.1-1703280000
	definitions=main-1705030345
X-IsSubscribed: yes

Hi,

We recently became aware of some poor code generation as a result of
unprofitable (for POWER) loop vectorization.  When a loop is simply copying
data with 64-bit loads and stores, vectorizing with 128-bit loads and stores
generally does not provide any benefit on modern POWER processors.
Furthermore, if there is a requirement to version the loop for aliasing,
alignment, etc., the cost of the versioning test is almost certainly a
performance loss for such loops.  The user code example included such a copy
loop, executed only a few times on average, within an outer loop that was
executed many times on average, causing a tremendous slowdown.

This patch very specifically targets these kinds of loops and no others,
and artificially inflates the vectorization cost to ensure vectorization
does not appear profitable.  This is done within the target model cost
hooks to avoid affecting other targets.  A new test case is included that
demonstrates the refusal to vectorize.

We've done SPEC performance testing to verify that the patch does not
degrade such workloads.  Results were all in the noise range.  The
customer code performance loss was verified to have been reversed.

Bootstrapped and tested on powerpc64le-unknown-linux-gnu with no regressions.
Is this ok for trunk?

Thanks,
Bill


[gcc]

2017-05-03  Bill Schmidt  <wschmidt@linux.vnet.ibm.com>

	* config/rs6000/rs6000.c (rs6000_vect_nonmem): New static var.
	(rs6000_init_cost): Initialize rs6000_vect_nonmem.
	(rs6000_add_stmt_cost): Update rs6000_vect_nonmem.
	(rs6000_finish_cost): Avoid vectorizing simple copy loops with
	VF=2 that require versioning.

[gcc/testsuite]

2017-05-03  Bill Schmidt  <wschmidt@linux.vnet.ibm.com>

	* gcc.target/powerpc/veresioned-copy-loop.c: New file.

Index: gcc/config/rs6000/rs6000.c
===================================================================
--- gcc/config/rs6000/rs6000.c	(revision 247560)
+++ gcc/config/rs6000/rs6000.c	(working copy)
@@ -5873,6 +5873,8 @@ rs6000_density_test (rs6000_cost_data *data)
 
 /* Implement targetm.vectorize.init_cost.  */
 
+static bool rs6000_vect_nonmem;
+
 static void *
 rs6000_init_cost (struct loop *loop_info)
 {
@@ -5881,6 +5883,7 @@ rs6000_init_cost (struct loop *loop_info)
   data->cost[vect_prologue] = 0;
   data->cost[vect_body]     = 0;
   data->cost[vect_epilogue] = 0;
+  rs6000_vect_nonmem = false;
   return data;
 }
 
@@ -5907,6 +5910,19 @@ rs6000_add_stmt_cost (void *data, int count, enum
 
       retval = (unsigned) (count * stmt_cost);
       cost_data->cost[where] += retval;
+
+      /* Check whether we're doing something other than just a copy loop.
+	 Not all such loops may be profitably vectorized; see
+	 rs6000_finish_cost.  */
+      if ((where == vect_body
+	   && (kind == vector_stmt || kind == vec_to_scalar || kind == vec_perm
+	       || kind == vec_promote_demote || kind == vec_construct
+	       || kind == scalar_to_vec))
+	  || (where != vect_body
+	      && (kind == vec_to_scalar || kind == vec_perm
+		  || kind == vec_promote_demote || kind == vec_construct
+		  || kind == scalar_to_vec)))
+	rs6000_vect_nonmem = true;
     }
 
   return retval;
@@ -5923,6 +5939,19 @@ rs6000_finish_cost (void *data, unsigned *prologue
   if (cost_data->loop_info)
     rs6000_density_test (cost_data);
 
+  /* Don't vectorize minimum-vectorization-factor, simple copy loops
+     that require versioning for any reason.  The vectorization is at
+     best a wash inside the loop, and the versioning checks make
+     profitability highly unlikely and potentially quite harmful.  */
+  if (cost_data->loop_info)
+    {
+      loop_vec_info vec_info = loop_vec_info_for_loop (cost_data->loop_info);
+      if (!rs6000_vect_nonmem
+	  && LOOP_VINFO_VECT_FACTOR (vec_info) == 2
+	  && LOOP_REQUIRES_VERSIONING (vec_info))
+	cost_data->cost[vect_body] += 10000;
+    }
+
   *prologue_cost = cost_data->cost[vect_prologue];
   *body_cost     = cost_data->cost[vect_body];
   *epilogue_cost = cost_data->cost[vect_epilogue];
Index: gcc/testsuite/gcc.target/powerpc/versioned-copy-loop.c
===================================================================
--- gcc/testsuite/gcc.target/powerpc/versioned-copy-loop.c	(nonexistent)
+++ gcc/testsuite/gcc.target/powerpc/versioned-copy-loop.c	(working copy)
@@ -0,0 +1,30 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target powerpc_p8vector_ok } */
+/* { dg-options "-O3 -fdump-tree-vect-details" } */
+
+/* Verify that a pure copy loop with a vectorization factor of two
+   that requires alignment will not be vectorized.  See the cost
+   model hooks in rs6000.c.  */
+
+typedef long unsigned int size_t;
+typedef unsigned char uint8_t;
+
+extern void *memcpy (void *__restrict __dest, const void *__restrict __src,
+       size_t __n) __attribute__ ((__nothrow__ , __leaf__)) __attribute__ ((__nonnull__ (1, 2)));
+
+void foo (void *dstPtr, const void *srcPtr, void *dstEnd)
+{
+    uint8_t *d = (uint8_t*)dstPtr;
+    const uint8_t *s = (const uint8_t*)srcPtr;
+    uint8_t* const e = (uint8_t*)dstEnd;
+
+    do
+      {
+	memcpy (d, s, 8);
+	d += 8;
+	s += 8;
+      }
+    while (d < e);
+}
+
+/* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" } } */