From patchwork Thu Nov  5 13:33:47 2015
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Alan Lawrence <alan.lawrence@arm.com>
X-Patchwork-Id: 540425
Return-Path: 
 <gcc-patches-return-412744-incoming=patchwork.ozlabs.org@gcc.gnu.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@bilbo.ozlabs.org
Received: from sourceware.org (server1.sourceware.org [209.132.180.131])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256
	bits)) (No client certificate requested)
	by ozlabs.org (Postfix) with ESMTPS id BA1F71402BF
	for <incoming@patchwork.ozlabs.org>;
	Fri,  6 Nov 2015 00:34:31 +1100 (AEDT)
Authentication-Results: ozlabs.org; dkim=pass (1024-bit key;
	unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org
	header.b=ipugozaF; dkim-atps=neutral
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id
	:list-unsubscribe:list-archive:list-post:list-help:sender:from
	:to:cc:subject:date:message-id:in-reply-to:references
	:content-type:content-transfer-encoding; q=dns; s=default; b=rnB
	sbcmm/cejpx3BLRH6mSXmrXmT4qMolDBVFTzG+ZRPS0ZgBAdv2Up7Q10s0N8PZ5g
	tSsiOochv6fklt+QDeIl0Dq6EtyrF9GjQm3vN7ogtA9MhZa0kKm3w4LIU74dmfXQ
	T22woIqxDXklwSju2MlJwpJvNO1urEwECw+4mDNw=
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id
	:list-unsubscribe:list-archive:list-post:list-help:sender:from
	:to:cc:subject:date:message-id:in-reply-to:references
	:content-type:content-transfer-encoding; s=default; bh=NsfOxg0jB
	dwCWgw8uu6XrEGOTIk=; b=ipugozaFkYbwzh4bvbnyegBX0Ulveqmr0hXJR3ord
	4lfOdIFnZ4xMHSQLjnYw4LHDVFAf5uWxpEwP9X91yefF2IV7T0ToxpqTR076HDuT
	ns0tu2gzoIJNfHqps1r+i9iGypGGjotyRgQ5FCQVR02a1CebUNCoRdnfymPRj86i
	5A=
Received: (qmail 56657 invoked by alias); 5 Nov 2015 13:34:23 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Unsubscribe: 
 <mailto:gcc-patches-unsubscribe-incoming=patchwork.ozlabs.org@gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Delivered-To: mailing list gcc-patches@gcc.gnu.org
Received: (qmail 56644 invoked by uid 89); 5 Nov 2015 13:34:22 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-1.7 required=5.0 tests=AWL, BAYES_00,
	SPF_PASS autolearn=ham version=3.3.2
X-HELO: eu-smtp-delivery-143.mimecast.com
Received: from eu-smtp-delivery-143.mimecast.com (HELO
	eu-smtp-delivery-143.mimecast.com) (207.82.80.143) by
	sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP;
	Thu, 05 Nov 2015 13:34:20 +0000
Received: from cam-owa1.Emea.Arm.com (fw-tnat.cambridge.arm.com
	[217.140.96.140]) by eu-smtp-1.mimecast.com with ESMTP id
	uk-mta-2-rTW7mwwsSfmWJux0HMC7rg-1; Thu, 05 Nov 2015 13:34:14 +0000
Received: from arm.com ([10.1.2.79]) by cam-owa1.Emea.Arm.com with Microsoft
	SMTPSVC(6.0.3790.3959); Thu, 5 Nov 2015 13:34:13 +0000
From: Alan Lawrence <alan.lawrence@arm.com>
To: gcc-patches@gcc.gnu.org
Cc: richard.guenther@gmail.com
Subject: Re: [PATCH] PR/67682,
	break SLP groups up if only some elements match
Date: Thu,  5 Nov 2015 13:33:47 +0000
Message-Id: <1446730427-13428-1-git-send-email-alan.lawrence@arm.com>
In-Reply-To: 
 <CAFiYyc1Ad3f=LBJn7Af49wWPR_ptw5nekw-rrTfMfyCZ_GJ+iQ@mail.gmail.com>
References: 
 <CAFiYyc1Ad3f=LBJn7Af49wWPR_ptw5nekw-rrTfMfyCZ_GJ+iQ@mail.gmail.com>
X-MC-Unique: rTW7mwwsSfmWJux0HMC7rg-1
X-IsSubscribed: yes

On 03/11/15 13:39, Richard Biener wrote:
> On Tue, Oct 27, 2015 at 6:38 PM, Alan Lawrence <alan.lawrence@arm.com> wrote:
>>
>> Say I...P are consecutive, the input would have gaps 0 1 1 1 1 1 1 1. If we
>> split the load group, we would want subgroups with gaps 0 1 1 1 and 0 1 1 1?
>
> As said on IRC it should be 4 1 1 1 and 4 1 1 1.

Right. And so, if we have a twelve-element group (0 1 1 1 1 1 1 1 1 1 1 1), by
the time it became three subgroups, these should each be (8 1 1 1), via an
intermediate stage of (4 1 1 1 1 1 1 1) (8 1 1 1). This leads to the code in
the attached patch.

> No, I don't think we can split load groups that way.  So I think if
> splitting store
> groups works well (with having larger load groups) then that's the way to go
> (even for loop vect).

Well, slp-11a.c still fails if I enable the splitting for non-BB SLP; I was
thinking this was because I needed to split the load groups too, but maybe not
- maybe this is a separate bug/issue with hybrid SLP. Whatever the reason, I
still think splitting groups in hybrid SLP is another patch. (Do we really want
to put off handling the basic-block case until it works for hybrid SLP as well?
IMHO I would think not.) It sounds as if the approach of restricting splitting
to store groups with appropriate asserts GROUP_GAP == 1 is thus the right thing
to do in the longer term too (hence, renamed vect_split_slp_store_group to
emphasize that) - at least until we remove that restriction on SLP generally.

Bootstrapped + check-{gcc,g++,gfortran} on x86_64, AArch64, ARM.

Re. the extra skipping loop, I think it would theoretically be possible for the
recursive call to vect_slp_analyze to succeed on an element where the original
failed, because it may have more num_permutes remaining (after skipping over
the first vector). So there's a second argument (besides code complexity) for
dropping that part....?

gcc/ChangeLog:

	* tree-vect-slp.c (vect_split_slp_store_group): New.
	(vect_analyze_slp_instance): Recurse on subgroup(s) if
	vect_build_slp_tree fails during basic block SLP.

gcc/testsuite/ChangeLog:

	* gcc.dg/vect/bb-slp-7.c (main1): Make subgroups non-isomorphic.
	* gcc.dg/vect/bb-slp-subgroups-1.c: New.
	* gcc.dg/vect/bb-slp-subgroups-2.c: New.
	* gcc.dg/vect/bb-slp-subgroups-3.c: New.
	* gcc.dg/vect/bb-slp-subgroups-4.c: New.
---
 gcc/testsuite/gcc.dg/vect/bb-slp-7.c           | 10 +--
 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c | 44 +++++++++++++
 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c | 42 +++++++++++++
 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c | 41 ++++++++++++
 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-4.c | 41 ++++++++++++
 gcc/tree-vect-slp.c                            | 87 +++++++++++++++++++++++++-
 6 files changed, 259 insertions(+), 6 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-4.c

diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-7.c b/gcc/testsuite/gcc.dg/vect/bb-slp-7.c
index ab54a48..b8bef8c 100644
--- a/gcc/testsuite/gcc.dg/vect/bb-slp-7.c
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-7.c
@@ -16,12 +16,12 @@ main1 (unsigned int x, unsigned int y)
   unsigned int *pout = &out[0];
   unsigned int a0, a1, a2, a3;
 
-  /* Non isomorphic.  */
+  /* Non isomorphic, even 64-bit subgroups.  */
   a0 = *pin++ + 23;
-  a1 = *pin++ + 142;
+  a1 = *pin++ * 142;
   a2 = *pin++ + 2;
   a3 = *pin++ * 31;
-  
+
   *pout++ = a0 * x;
   *pout++ = a1 * y;
   *pout++ = a2 * x;
@@ -29,7 +29,7 @@ main1 (unsigned int x, unsigned int y)
 
   /* Check results.  */
   if (out[0] != (in[0] + 23) * x
-      || out[1] != (in[1] + 142) * y
+      || out[1] != (in[1] * 142) * y
       || out[2] != (in[2] + 2) * x
       || out[3] != (in[3] * 31) * y)
     abort();
@@ -47,4 +47,4 @@ int main (void)
 }
 
 /* { dg-final { scan-tree-dump-times "basic block vectorized" 0 "slp2" } } */
-  
+
diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c
new file mode 100644
index 0000000..39c23c3
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-1.c
@@ -0,0 +1,44 @@
+/* { dg-require-effective-target vect_int } */
+/* PR tree-optimization/67682.  */
+
+#include "tree-vect.h"
+
+int __attribute__((__aligned__(8))) a[8];
+int __attribute__((__aligned__(8))) b[4];
+
+__attribute__ ((noinline)) void
+test ()
+{
+    a[0] = b[0];
+    a[1] = b[1];
+    a[2] = b[2];
+    a[3] = b[3];
+    a[4] = 0;
+    a[5] = 0;
+    a[6] = 0;
+    a[7] = 0;
+}
+
+int
+main (int argc, char **argv)
+{
+  check_vect ();
+
+  for (int i = 0; i < 8; i++)
+    a[i] = 1;
+  for (int i = 0; i < 4; i++)
+    b[i] = i + 4;
+  __asm__ volatile ("" : : : "memory");
+  test (a, b);
+  __asm__ volatile ("" : : : "memory");
+  for (int i = 0; i < 4; i++)
+    if (a[i] != i+4)
+      abort ();
+  for (int i = 4; i < 8; i++)
+    if (a[i] != 0)
+      abort ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */
+/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c
new file mode 100644
index 0000000..06099fd
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-2.c
@@ -0,0 +1,42 @@
+/* { dg-require-effective-target vect_int } */
+/* PR tree-optimization/67682.  */
+
+#include "tree-vect.h"
+
+int __attribute__((__aligned__(8))) a[8];
+int __attribute__((__aligned__(8))) b[8];
+
+__attribute__ ((noinline)) void
+test ()
+{
+    a[0] = b[0];
+    a[1] = b[1] + 1;
+    a[2] = b[2] * 2;
+    a[3] = b[3] + 3;
+
+    a[4] = b[4] + 4;
+    a[5] = b[5] + 4;
+    a[6] = b[6] + 4;
+    a[7] = b[7] + 4;
+}
+
+int
+main (int argc, char **argv)
+{
+  check_vect ();
+
+  for (int i = 0; i < 8; i++)
+    b[i] = i + 1;
+  __asm__ volatile ("" : : : "memory");
+  test (a, b);
+  __asm__ volatile ("" : : : "memory");
+  if ((a[0] != 1) || (a[1] != 3) || (a[2] != 6) || (a[3] != 7))
+    abort ();
+  for (int i = 4; i < 8; i++)
+    if (a[i] != i + 5)
+      abort ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */
+/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c
new file mode 100644
index 0000000..13c51f3
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-3.c
@@ -0,0 +1,41 @@
+/* { dg-require-effective-target vect_int } */
+/* PR tree-optimization/67682.  */
+
+#include "tree-vect.h"
+
+int __attribute__((__aligned__(8))) a[8];
+int __attribute__((__aligned__(8))) b[4];
+
+__attribute__ ((noinline)) void
+test ()
+{
+    a[0] = b[2] + 1;
+    a[1] = b[0] + 2;
+    a[2] = b[1] + 3;
+    a[3] = b[1] + 4;
+    a[4] = b[3] * 3;
+    a[5] = b[0] * 4;
+    a[6] = b[2] * 5;
+    a[7] = b[1] * 7;
+}
+
+int
+main (int argc, char **argv)
+{
+  check_vect ();
+
+  for (int i = 0; i < 8; i++)
+    a[i] = 1;
+  for (int i = 0; i < 4; i++)
+    b[i] = i + 4;
+  __asm__ volatile ("" : : : "memory");
+  test (a, b);
+  __asm__ volatile ("" : : : "memory");
+  if ((a[0] != 7) || a[1] != 6 || (a[2] != 8) || (a[3] != 9)
+      || (a[4] != 21) || (a[5] != 16) || (a[6] != 30) || (a[7] != 35))
+    abort ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */
+/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-4.c b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-4.c
new file mode 100644
index 0000000..6ae9a89
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/bb-slp-subgroups-4.c
@@ -0,0 +1,41 @@
+/* { dg-require-effective-target vect_int } */
+/* PR tree-optimization/67682.  */
+
+#include "tree-vect.h"
+
+int __attribute__((__aligned__(8))) a[8];
+int __attribute__((__aligned__(8))) b[8];
+
+__attribute__ ((noinline)) void
+test ()
+{
+    a[0] = b[0] + 1;
+    a[1] = b[1] + 2;
+    a[2] = b[2] + 3;
+    a[3] = b[3] + 4;
+    a[4] = b[0] * 3;
+    a[5] = b[2] * 4;
+    a[6] = b[4] * 5;
+    a[7] = b[6] * 7;
+}
+
+int
+main (int argc, char **argv)
+{
+  check_vect ();
+
+  for (int i = 0; i < 8; i++)
+    a[i] = 1;
+  for (int i = 0; i < 8; i++)
+    b[i] = i + 4;
+  __asm__ volatile ("" : : : "memory");
+  test (a, b);
+  __asm__ volatile ("" : : : "memory");
+  if ((a[0] != 5) || (a[1] != 7) || (a[2] != 9) || (a[3] != 11)
+      || (a[4] != 12) || (a[5] != 24) || (a[6] != 40) || (a[7] != 70))
+    abort ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Basic block will be vectorized using SLP" 1 "slp2" } } */
+/* { dg-final { scan-tree-dump-times "basic block vectorized" 1 "slp2" } } */
diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c
index 1424123..c01632a 100644
--- a/gcc/tree-vect-slp.c
+++ b/gcc/tree-vect-slp.c
@@ -1641,6 +1641,50 @@ vect_analyze_slp_cost (slp_instance instance, void *data)
   body_cost_vec.release ();
 }
 
+/* Splits a group of stores, currently beginning at FIRST_STMT, into two groups:
+   one (still beginning at FIRST_STMT) of size GROUP1_SIZE (also containing
+   the first GROUP1_SIZE stmts, since stores are consecutive), the second
+   containing the remainder.
+   Return the first stmt in the second group.  */
+
+static gimple *
+vect_split_slp_store_group (gimple *first_stmt, unsigned group1_size)
+{
+  stmt_vec_info first_vinfo = vinfo_for_stmt (first_stmt);
+  gcc_assert (GROUP_FIRST_ELEMENT (first_vinfo) == first_stmt);
+  gcc_assert (group1_size > 0);
+  int group2_size = GROUP_SIZE (first_vinfo) - group1_size;
+  gcc_assert (group2_size > 0);
+  GROUP_SIZE (first_vinfo) = group1_size;
+
+  gimple *stmt = first_stmt;
+  for (unsigned i = group1_size; i > 1; i--)
+    {
+      stmt = GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt));
+      gcc_assert (GROUP_GAP (vinfo_for_stmt (stmt)) == 1);
+    }
+  /* STMT is now the last element of the first group.  */
+  gimple *group2 = GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt));
+  GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt)) = 0;
+
+  GROUP_SIZE (vinfo_for_stmt (group2)) = group2_size;
+  for (stmt = group2; stmt; stmt = GROUP_NEXT_ELEMENT (vinfo_for_stmt (stmt)))
+    {
+      GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt)) = group2;
+      gcc_assert (GROUP_GAP (vinfo_for_stmt (stmt)) == 1);
+    }
+
+  /* For the second group, the GROUP_GAP is that before the original group,
+     plus skipping over the first vector.  */
+  GROUP_GAP (vinfo_for_stmt (group2)) =
+    GROUP_GAP (first_vinfo) + group1_size;
+
+  /* GROUP_GAP of the first group now has to skip over the second group too.  */
+  GROUP_GAP (first_vinfo) += group2_size;
+
+  return group2;
+}
+
 /* Analyze an SLP instance starting from a group of grouped stores.  Call
    vect_build_slp_tree to build a tree of packed stmts if possible.
    Return FALSE if it's impossible to SLP any stmt in the loop.  */
@@ -1656,7 +1700,7 @@ vect_analyze_slp_instance (vec_info *vinfo,
   tree vectype, scalar_type = NULL_TREE;
   gimple *next;
   unsigned int vectorization_factor = 0;
-  int i;
+  unsigned int i;
   unsigned int max_nunits = 0;
   vec<slp_tree> loads;
   struct data_reference *dr = STMT_VINFO_DATA_REF (vinfo_for_stmt (stmt));
@@ -1846,6 +1890,47 @@ vect_analyze_slp_instance (vec_info *vinfo,
   vect_free_slp_tree (node);
   loads.release ();
 
+  /* For basic block vectorization, try to break the group up into multiples
+     of the vectorization factor.  */
+  if (is_a <bb_vec_info> (vinfo) && GROUP_FIRST_ELEMENT (vinfo_for_stmt (stmt)))
+    {
+      /* We consider breaking the group only on VF boundaries from the existing
+	 start.  */
+      for (i = 0; i < group_size; i++)
+	if (!matches[i]) break;
+
+      if (i < vectorization_factor)
+	{
+	  /* First vector is a mix of (non-/)matches, or first element was
+	     impossible for another reason.  Skip the first vector, and look
+	     for a vector's worth of consecutive (non-/)matches.  */
+	  i = vectorization_factor;
+	  while (i < group_size)
+	    {
+	      for (unsigned j = i + 1; matches[j] == matches[i]; j++)
+		if (j == (i + vectorization_factor - 1))
+		  {
+		    /* Either all in that vector match, or none do.
+		       Retry SLP from that vector onwards.  */
+		    stmt = vect_split_slp_store_group (stmt, i);
+		    return vect_analyze_slp_instance (vinfo,
+						      stmt, max_tree_size);
+		  }
+	      /* Vector contains both matches and non-matches.  Skip over.  */
+	      i += vectorization_factor;
+	    }
+	}
+      else if (i < group_size)
+	{
+	  /* Split into two groups at the first vector boundary before i.  */
+	  gcc_assert ((vectorization_factor & (vectorization_factor - 1)) == 0);
+	  i &= ~(vectorization_factor - 1);
+	  gimple *grp2start = vect_split_slp_store_group (stmt, i);
+	  return vect_analyze_slp_instance (vinfo, stmt, max_tree_size)
+		 | vect_analyze_slp_instance (vinfo, grp2start, max_tree_size);
+	 }
+    }
+
   return false;
 }