From patchwork Wed Jul  3 13:23:42 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Richard Biener <rguenther@suse.de>
X-Patchwork-Id: 1956238
Return-Path: <gcc-patches-bounces~incoming=patchwork.ozlabs.org@gcc.gnu.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@legolas.ozlabs.org
Authentication-Results: legolas.ozlabs.org;
	dkim=pass (1024-bit key;
 unprotected) header.d=suse.de header.i=@suse.de header.a=rsa-sha256
 header.s=susede2_rsa header.b=oveqKnmu;
	dkim=pass header.d=suse.de header.i=@suse.de header.a=ed25519-sha256
 header.s=susede2_ed25519 header.b=jMyvA0Jx;
	dkim=pass (1024-bit key) header.d=suse.de header.i=@suse.de
 header.a=rsa-sha256 header.s=susede2_rsa header.b=oveqKnmu;
	dkim=neutral header.d=suse.de header.i=@suse.de header.a=ed25519-sha256
 header.s=susede2_ed25519 header.b=jMyvA0Jx;
	dkim-atps=neutral
Authentication-Results: legolas.ozlabs.org;
 spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org
 (client-ip=2620:52:3:1:0:246e:9693:128c; helo=server2.sourceware.org;
 envelope-from=gcc-patches-bounces~incoming=patchwork.ozlabs.org@gcc.gnu.org;
 receiver=patchwork.ozlabs.org)
Received: from server2.sourceware.org (server2.sourceware.org
 [IPv6:2620:52:3:1:0:246e:9693:128c])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature ECDSA (secp384r1) server-digest SHA384)
	(No client certificate requested)
	by legolas.ozlabs.org (Postfix) with ESMTPS id 4WDgV93HMPz1xqh
	for <incoming@patchwork.ozlabs.org>; Wed,  3 Jul 2024 23:24:17 +1000 (AEST)
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id AD3A3384A48D
	for <incoming@patchwork.ozlabs.org>; Wed,  3 Jul 2024 13:24:15 +0000 (GMT)
X-Original-To: gcc-patches@gcc.gnu.org
Delivered-To: gcc-patches@gcc.gnu.org
Received: from smtp-out1.suse.de (smtp-out1.suse.de
 [IPv6:2a07:de40:b251:101:10:150:64:1])
 by sourceware.org (Postfix) with ESMTPS id BFA6B3860769
 for <gcc-patches@gcc.gnu.org>; Wed,  3 Jul 2024 13:23:43 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org BFA6B3860769
Authentication-Results: sourceware.org;
 dmarc=pass (p=none dis=none) header.from=suse.de
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=suse.de
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org BFA6B3860769
Authentication-Results: server2.sourceware.org;
 arc=none smtp.remote-ip=2a07:de40:b251:101:10:150:64:1
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1720013026; cv=none;
 b=vH8f2827H4cPFsIS/UQ1iYcCTrbfrW9/+izSWixso0Po+Ya4sK+mJm3y3U5zCl9829uq9V6RK9r9ii6EY+4RpxWwmd1ZhNRd4QJ+UNUq/0Obyrd5faczZavISVmui1nQBujsXxRner1iH+xY58Gt5vFGWXuMnRu97t0f3aKD9Ao=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
 t=1720013026; c=relaxed/simple;
 bh=IODeveUpQfczYYRWt7sO863/KtyKaaaQjh0QgvQf+58=;
 h=DKIM-Signature:DKIM-Signature:DKIM-Signature:DKIM-Signature:Date:
 From:To:Subject:MIME-Version;
 b=IUDAr0W84EtW+3kj9vtsJEAClZ1NqueHhuGbFhW3wQcqO1FxHYTJrUcxK9a7mEkW75m7Bb4EN/owXL1ywCFr28NuMQFRggZ7P4FNNkNHXBrQ6oTUf9GaPm7g5zUBjLYz1knSZ1U3LMcE248SrS5xd4Lq/8CaLJSzi5B2NupDMf4=
ARC-Authentication-Results: i=1; server2.sourceware.org
Received: from murzim.nue2.suse.org (unknown [10.168.4.243])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest
 SHA256)
 (No client certificate requested)
 by smtp-out1.suse.de (Postfix) with ESMTPS id ACE7A21BD0
 for <gcc-patches@gcc.gnu.org>; Wed,  3 Jul 2024 13:23:42 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de;
 s=susede2_rsa;
 t=1720013022;
 h=from:from:reply-to:date:date:to:to:cc:mime-version:mime-version:
 content-type:content-type; bh=e02ZWnynlvNUbSpJVV9hHE8iE29D4XcySzVIihUUtcM=;
 b=oveqKnmubiNwPSvMeIwdnC87LBwZNivso2/g6dSw0EAGVECAyfgkxUaBKHltBSj5x966bN
 oBbPg8WAMALi+LZTsRliGpopYYQPPgaUxAxqh2IAGYtQrcPuCXEekXbjcldIHSfXLvvGsi
 UMN8eQ9Gi4pcWW7ySc1z9HyJkmoBgpA=
DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de;
 s=susede2_ed25519; t=1720013022;
 h=from:from:reply-to:date:date:to:to:cc:mime-version:mime-version:
 content-type:content-type; bh=e02ZWnynlvNUbSpJVV9hHE8iE29D4XcySzVIihUUtcM=;
 b=jMyvA0JxMeOgUF2BYeS5AuOCEZMjIyOM6LUTfkCce1n2xmWz0kp/9mkSlf2ptqQo+U5H8R
 xVSsxz0oDRk8HGAg==
Authentication-Results: smtp-out1.suse.de;
	none
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de;
 s=susede2_rsa;
 t=1720013022;
 h=from:from:reply-to:date:date:to:to:cc:mime-version:mime-version:
 content-type:content-type; bh=e02ZWnynlvNUbSpJVV9hHE8iE29D4XcySzVIihUUtcM=;
 b=oveqKnmubiNwPSvMeIwdnC87LBwZNivso2/g6dSw0EAGVECAyfgkxUaBKHltBSj5x966bN
 oBbPg8WAMALi+LZTsRliGpopYYQPPgaUxAxqh2IAGYtQrcPuCXEekXbjcldIHSfXLvvGsi
 UMN8eQ9Gi4pcWW7ySc1z9HyJkmoBgpA=
DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de;
 s=susede2_ed25519; t=1720013022;
 h=from:from:reply-to:date:date:to:to:cc:mime-version:mime-version:
 content-type:content-type; bh=e02ZWnynlvNUbSpJVV9hHE8iE29D4XcySzVIihUUtcM=;
 b=jMyvA0JxMeOgUF2BYeS5AuOCEZMjIyOM6LUTfkCce1n2xmWz0kp/9mkSlf2ptqQo+U5H8R
 xVSsxz0oDRk8HGAg==
Date: Wed, 3 Jul 2024 15:23:42 +0200 (CEST)
From: Richard Biener <rguenther@suse.de>
To: gcc-patches@gcc.gnu.org
Subject: [PATCH 1/5] lower SLP load permutation to interleaving
MIME-Version: 1.0
X-Spam-Score: -0.41
X-Spam-Level: 
X-Spamd-Result: default: False [-0.41 / 50.00]; BAYES_HAM(-3.00)[100.00%];
 MISSING_MID(2.50)[]; NEURAL_SPAM_LONG(0.39)[0.112];
 NEURAL_HAM_SHORT(-0.20)[-1.000]; MIME_GOOD(-0.10)[text/plain];
 RCPT_COUNT_ONE(0.00)[1]; RCVD_COUNT_ZERO(0.00)[0];
 ARC_NA(0.00)[]; MISSING_XM_UA(0.00)[];
 DKIM_SIGNED(0.00)[suse.de:s=susede2_rsa,suse.de:s=susede2_ed25519];
 FUZZY_BLOCKED(0.00)[rspamd.com]; FROM_EQ_ENVFROM(0.00)[];
 MIME_TRACE(0.00)[0:+]; TO_DN_NONE(0.00)[];
 TO_MATCH_ENVRCPT_ALL(0.00)[]; FROM_HAS_DN(0.00)[]
X-Spam-Status: No, score=-10.9 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, MISSING_MID,
 RCVD_IN_DNSWL_LOW, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.30
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
Errors-To: gcc-patches-bounces~incoming=patchwork.ozlabs.org@gcc.gnu.org
Message-Id: <20240703132415.AD3A3384A48D@sourceware.org>

The following emulates classical interleaving for SLP load permutes
that we are unlikely handling natively.  This is to handle cases
where interleaving (or load/store-lanes) is the optimal choice for
vectorizing even when we are doing that within SLP.  An example
would be

void foo (int * __restrict a, int * b)
{
  for (int i = 0; i < 16; ++i)
    {
      a[4*i + 0] = b[4*i + 0] * 3;
      a[4*i + 1] = b[4*i + 1] + 3;
      a[4*i + 2] = (b[4*i + 2] * 3 + 3);
      a[4*i + 3] = b[4*i + 3] * 3;
    }
}

where currently the SLP store is merging four single-lane SLP
sub-graphs but none of the loads in it can be code-generated
with V4SImode vectors and a VF of four as the permutes would need
three vectors.

The patch introduces a lowering phase after SLP discovery but
before SLP pattern recognition or permute optimization that
analyzes all loads from the same dataref group and creates an
interleaving scheme starting from an unpermuted load.

What can be handled is quite restrictive, matching only a subset
of the non-SLP interleaving cases (the power-of-two group size
ones, in addition only cases without gaps).  The interleaving
vectorization in addition can handle size 3 and 5 - but I am not
sure if it's possible to do that in a VL agnostic way.  It
should be still possible to set up the SLP graph in a way that
a load-lane could be matched from SLP pattern recognition.

As said gaps are currently not handled - for SLP we have a
representational issue that SLP_TREE_SCALAR_STMTS for "gap lanes"
would need to be filled in some way (even if we just push NULL).

The patch misses multi-level even/odd handling as well as CSEing
intermediate generated permutes.  Both is quite straight-forward
to add, but eventually there's a better or more general strategy
for lowering?  The main goal of the patch is to avoid falling
back to non-SLP for cases the interleaving code handles.

	* tree-vect-slp.cc (vllp_cmp): New function.
	(vect_lower_load_permutations): Likewise.
	(vect_analyze_slp): Call it.
---
 gcc/tree-vect-slp.cc | 287 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 287 insertions(+)

diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
index b88bba44760..1d5c4d99549 100644
--- a/gcc/tree-vect-slp.cc
+++ b/gcc/tree-vect-slp.cc
@@ -3889,6 +3889,287 @@ vect_analyze_slp_instance (vec_info *vinfo,
   return res;
 }
 
+/* qsort comparator ordering SLP load nodes.  */
+
+static int
+vllp_cmp (const void *a_, const void *b_)
+{
+  const slp_tree a = *(const slp_tree *)a_;
+  const slp_tree b = *(const slp_tree *)b_;
+  stmt_vec_info a0 = SLP_TREE_SCALAR_STMTS (a)[0];
+  stmt_vec_info b0 = SLP_TREE_SCALAR_STMTS (b)[0];
+  if (STMT_VINFO_GROUPED_ACCESS (a0)
+      && STMT_VINFO_GROUPED_ACCESS (b0)
+      && DR_GROUP_FIRST_ELEMENT (a0) == DR_GROUP_FIRST_ELEMENT (b0))
+    {
+      /* Same group, order after lanes used.  */
+      if (SLP_TREE_LANES (a) < SLP_TREE_LANES (b))
+	return 1;
+      else if (SLP_TREE_LANES (a) > SLP_TREE_LANES (b))
+	return -1;
+      else
+	{
+	  /* Try to order loads using the same lanes together, breaking
+	     the tie with the lane number that first differs.  */
+	  if (!SLP_TREE_LOAD_PERMUTATION (a).exists ()
+	      && !SLP_TREE_LOAD_PERMUTATION (b).exists ())
+	    return 0;
+	  else if (SLP_TREE_LOAD_PERMUTATION (a).exists ()
+		   && !SLP_TREE_LOAD_PERMUTATION (b).exists ())
+	    return 1;
+	  else if (!SLP_TREE_LOAD_PERMUTATION (a).exists ()
+		   && SLP_TREE_LOAD_PERMUTATION (b).exists ())
+	    return -1;
+	  else
+	    {
+	      for (unsigned i = 0; i < SLP_TREE_LANES (a); ++i)
+		if (SLP_TREE_LOAD_PERMUTATION (a)[i]
+		    != SLP_TREE_LOAD_PERMUTATION (b)[i])
+		  {
+		    /* In-order lane first, that's what the above case for
+		       no permutation does.  */
+		    if (SLP_TREE_LOAD_PERMUTATION (a)[i] == i)
+		      return -1;
+		    else if (SLP_TREE_LOAD_PERMUTATION (b)[i] == i)
+		      return 1;
+		    else if (SLP_TREE_LOAD_PERMUTATION (a)[i]
+			     < SLP_TREE_LOAD_PERMUTATION (b)[i])
+		      return -1;
+		    else
+		      return 1;
+		  }
+	      return 0;
+	    }
+	}
+    }
+  else /* Different groups or non-groups.  */
+    {
+      /* Order groups as their first element to keep them together.  */
+      if (STMT_VINFO_GROUPED_ACCESS (a0))
+	a0 = DR_GROUP_FIRST_ELEMENT (a0);
+      if (STMT_VINFO_GROUPED_ACCESS (b0))
+	b0 = DR_GROUP_FIRST_ELEMENT (b0);
+      if (a0 == b0)
+	return 0;
+      /* Tie using UID.  */
+      else if (gimple_uid (STMT_VINFO_STMT (a0))
+	       < gimple_uid (STMT_VINFO_STMT (b0)))
+	return -1;
+      else
+	{
+	  gcc_assert (gimple_uid (STMT_VINFO_STMT (a0))
+		      != gimple_uid (STMT_VINFO_STMT (b0)));
+	  return 1;
+	}
+    }
+}
+
+/* Process the set of LOADS that are all from the same dataref group.  */
+
+static void
+vect_lower_load_permutations (loop_vec_info loop_vinfo,
+			      scalar_stmts_to_slp_tree_map_t *bst_map,
+			      const array_slice<slp_tree> &loads)
+{
+  /* We at this point want to lower without a fixed VF or vector
+     size in mind which means we cannot actually compute whether we
+     need three or more vectors for a load permutation yet.  So always
+     lower.  */
+  stmt_vec_info first
+    = DR_GROUP_FIRST_ELEMENT (SLP_TREE_SCALAR_STMTS (loads[0])[0]);
+
+  /* ???  In principle we have to consider a gap up to the next full
+     vector, but we have to actually represent a scalar stmt for the
+     gaps value so delay handling this.  The same is true for
+     inbetween gaps which the load places in the load-permutation
+     represent.  It's probably not worth trying an intermediate packing
+     to vectors without gap even if that might handle some more cases.
+     Instead get the gap case correct in some way.  */
+  unsigned group_lanes = 0;
+  for (stmt_vec_info s = first; s; s = DR_GROUP_NEXT_ELEMENT (s))
+    {
+      if ((s == first && DR_GROUP_GAP (s) != 0)
+	  || (s != first && DR_GROUP_GAP (s) != 1))
+	return;
+      group_lanes++;
+    }
+  /* Only a power-of-two number of lanes matches interleaving.  */
+  if (exact_log2 (group_lanes) == -1)
+    return;
+
+  for (slp_tree load : loads)
+    {
+      /* Leave masked or gather loads alone for now.  */
+      if (!SLP_TREE_CHILDREN (load).is_empty ())
+	continue;
+
+      /* We need to lower only loads of less than half of the groups
+	 lanes, including duplicate lanes.  */
+      if (SLP_TREE_LANES (load) >= group_lanes / 2)
+	continue;
+
+      /* Lower by reducing the group to half its size using an
+	 interleaving scheme.  For this try to compute whether all
+	 elements needed for this load are in even or odd elements of
+	 an even/odd decomposition with N consecutive elements.
+	 Thus { e, e, o, o, e, e, o, o } woud be an even/odd decomposition
+	 with N == 2.  */
+      unsigned even = (1 << ceil_log2 (DR_GROUP_SIZE (first))) - 1;
+      unsigned odd = even;
+      for (unsigned l : SLP_TREE_LOAD_PERMUTATION (load))
+	{
+	  even &= ~l;
+	  odd &= l;
+	}
+      /* Give up when this doesn't match up with an interleaving scheme.  */
+      if (!even && !odd)
+	continue;
+
+      /* First build (and possibly re-use) a load node for the
+	 unpermuted group.  */
+      vec<stmt_vec_info> stmts;
+      stmts.create (group_lanes);
+      for (stmt_vec_info s = first; s; s = DR_GROUP_NEXT_ELEMENT (s))
+	stmts.quick_push (s);
+      poly_uint64 max_nunits;
+      bool *matches = XALLOCAVEC (bool, group_lanes);
+      unsigned limit = 1;
+      unsigned tree_size = 0;
+      slp_tree l0 = vect_build_slp_tree (loop_vinfo, stmts,
+					 group_lanes,
+					 &max_nunits, matches, &limit,
+					 &tree_size, bst_map);
+
+      /* Build the permute to get the original load permutation order.  */
+      lane_permutation_t final_perm;
+      final_perm.create (SLP_TREE_LANES (load));
+      for (unsigned i = 0; i < SLP_TREE_LANES (load); ++i)
+	final_perm.quick_push
+	  (std::make_pair (0, SLP_TREE_LOAD_PERMUTATION (load)[i]));
+
+      /* Now build an even or odd extraction from the unpermuted load.  */
+      lane_permutation_t perm;
+      perm.create (group_lanes / 2);
+      unsigned level;
+      if (even
+	  && ((level = 1 << ctz_hwi (even)), true)
+	  && group_lanes % (2 * level) == 0)
+	{
+	  /* { 0, 1, ... 4, 5 ..., } */
+	  unsigned level = 1 << ctz_hwi (even);
+	  for (unsigned i = 0; i < group_lanes / 2 / level; ++i)
+	    for (unsigned j = 0; j < level; ++j)
+	      perm.quick_push (std::make_pair (0, 2 * i * level + j));
+	}
+      else if (odd)
+	{
+	  /* { ..., 2, 3, ... 6, 7 } */
+	  unsigned level = 1 << ctz_hwi (odd);
+	  gcc_assert (group_lanes % (2 * level) == 0);
+	  for (unsigned i = 0; i < group_lanes / 2 / level; ++i)
+	    for (unsigned j = 0; j < level; ++j)
+	      perm.quick_push (std::make_pair (0, (2 * i + 1) * level + j));
+	}
+      else
+	gcc_unreachable ();
+
+      /* And update final_perm.  */
+      for (unsigned i = 0; i < SLP_TREE_LANES (load); ++i)
+	{
+	  unsigned l = final_perm[i].second;
+	  unsigned j;
+	  for (j = 0; j < perm.length (); ++j)
+	    if (perm[j].second == l)
+	      {
+		final_perm[i].second = j;
+		break;
+	      }
+	  gcc_assert (j < perm.length ());
+	}
+
+      /* And create scalar stmts.  */
+      vec<stmt_vec_info> perm_stmts;
+      perm_stmts.create (perm.length ());
+      for (unsigned i = 0; i < perm.length (); ++i)
+	perm_stmts.quick_push (SLP_TREE_SCALAR_STMTS (l0)[perm[i].second]);
+
+      slp_tree p = vect_create_new_slp_node (1, VEC_PERM_EXPR);
+      SLP_TREE_CHILDREN (p).quick_push (l0);
+      SLP_TREE_LANE_PERMUTATION (p) = perm;
+      SLP_TREE_VECTYPE (p) = SLP_TREE_VECTYPE (load);
+      SLP_TREE_LANES (p) = perm.length ();
+      SLP_TREE_REPRESENTATIVE (p) = SLP_TREE_REPRESENTATIVE (load);
+      /* ???  As we have scalar stmts for this intermediate permute we
+	 could CSE it via bst_map but we do not want to pick up
+	 another SLP node with a load permutation.  We instead should
+	 have a "local" CSE map here.  */
+      SLP_TREE_SCALAR_STMTS (p) = perm_stmts;
+      /* ???  We need to iterate if group_lanes / 2 is still too large.  */
+      /* ???  Ideally pick the best even/odd scheme usable for
+	 most of the loads.  -> do a multi-step scheme?  */
+
+      /* And finally from the ordered reduction node create the
+	 permute to shuffle the lanes into the original load-permutation
+	 order.  We replace the original load node with this.  */
+      SLP_TREE_CODE (load) = VEC_PERM_EXPR;
+      SLP_TREE_LOAD_PERMUTATION (load).release ();
+      SLP_TREE_LANE_PERMUTATION (load) = final_perm;
+      SLP_TREE_CHILDREN (load).create (1);
+      SLP_TREE_CHILDREN (load).quick_push (p);
+    }
+}
+
+/* Transform SLP loads in the SLP graph created by SLP discovery to
+   group loads from the same group and lower load permutations that
+   are unlikely to be supported into a series of permutes.
+   In the degenerate case of having only single-lane SLP instances
+   this should result in a series of permute nodes emulating an
+   interleaving scheme.  */
+
+static void
+vect_lower_load_permutations (loop_vec_info loop_vinfo,
+			      scalar_stmts_to_slp_tree_map_t *bst_map)
+{
+  /* Gather and sort loads across all instances.  */
+  hash_set<slp_tree> visited;
+  auto_vec<slp_tree> loads;
+  for (auto inst : loop_vinfo->slp_instances)
+    vect_gather_slp_loads (loads, SLP_INSTANCE_TREE (inst), visited);
+  if (loads.is_empty ())
+    return;
+  loads.qsort (vllp_cmp);
+
+  /* Now process each dataref group separately.  */
+  unsigned firsti = 0;
+  for (unsigned i = 1; i < loads.length (); ++i)
+    {
+      slp_tree first = loads[firsti];
+      slp_tree next = loads[i];
+      stmt_vec_info a0 = SLP_TREE_SCALAR_STMTS (first)[0];
+      stmt_vec_info b0 = SLP_TREE_SCALAR_STMTS (next)[0];
+      if (STMT_VINFO_GROUPED_ACCESS (a0)
+	  && STMT_VINFO_GROUPED_ACCESS (b0)
+	  && DR_GROUP_FIRST_ELEMENT (a0) == DR_GROUP_FIRST_ELEMENT (b0))
+	continue;
+      /* Just one SLP load of a possible group, leave those alone.  */
+      if (i == firsti + 1)
+	{
+	  firsti = i;
+	  continue;
+	}
+      /* Now we have multiple SLP loads of the same group from
+	 firsti to i - 1.  */
+      vect_lower_load_permutations (loop_vinfo, bst_map,
+				    make_array_slice (&loads[firsti],
+						      i - firsti));
+      firsti = i;
+    }
+  if (firsti < loads.length () - 1)
+    vect_lower_load_permutations (loop_vinfo, bst_map,
+				  make_array_slice (&loads[firsti],
+						    loads.length () - firsti));
+}
+
 /* Check if there are stmts in the loop can be vectorized using SLP.  Build SLP
    trees of packed scalar stmts if SLP is possible.  */
 
@@ -4030,6 +4311,12 @@ vect_analyze_slp (vec_info *vinfo, unsigned max_tree_size)
 	}
     }
 
+  /* When we end up with load permutations that we cannot possibly handle,
+     like those requiring three vector inputs, lower them using interleaving
+     like schemes.  */
+  if (loop_vec_info loop_vinfo = dyn_cast <loop_vec_info> (vinfo))
+    vect_lower_load_permutations (loop_vinfo, bst_map);
+
   hash_set<slp_tree> visited_patterns;
   slp_tree_to_load_perm_map_t perm_cache;
   slp_compat_nodes_map_t compat_cache;

From patchwork Wed Jul  3 13:23:52 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Richard Biener <rguenther@suse.de>
X-Patchwork-Id: 1956239
Return-Path: <gcc-patches-bounces~incoming=patchwork.ozlabs.org@gcc.gnu.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@legolas.ozlabs.org
Authentication-Results: legolas.ozlabs.org;
	dkim=pass (1024-bit key;
 unprotected) header.d=suse.de header.i=@suse.de header.a=rsa-sha256
 header.s=susede2_rsa header.b=gbpv1JcN;
	dkim=pass header.d=suse.de header.i=@suse.de header.a=ed25519-sha256
 header.s=susede2_ed25519 header.b=yPAHvZ/a;
	dkim=pass (1024-bit key) header.d=suse.de header.i=@suse.de
 header.a=rsa-sha256 header.s=susede2_rsa header.b=gbpv1JcN;
	dkim=neutral header.d=suse.de header.i=@suse.de header.a=ed25519-sha256
 header.s=susede2_ed25519 header.b=yPAHvZ/a;
	dkim-atps=neutral
Authentication-Results: legolas.ozlabs.org;
 spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org
 (client-ip=2620:52:3:1:0:246e:9693:128c; helo=server2.sourceware.org;
 envelope-from=gcc-patches-bounces~incoming=patchwork.ozlabs.org@gcc.gnu.org;
 receiver=patchwork.ozlabs.org)
Received: from server2.sourceware.org (server2.sourceware.org
 [IPv6:2620:52:3:1:0:246e:9693:128c])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature ECDSA (secp384r1) server-digest SHA384)
	(No client certificate requested)
	by legolas.ozlabs.org (Postfix) with ESMTPS id 4WDgVN6NJxz1xqh
	for <incoming@patchwork.ozlabs.org>; Wed,  3 Jul 2024 23:24:28 +1000 (AEST)
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 19C1A3861011
	for <incoming@patchwork.ozlabs.org>; Wed,  3 Jul 2024 13:24:27 +0000 (GMT)
X-Original-To: gcc-patches@gcc.gnu.org
Delivered-To: gcc-patches@gcc.gnu.org
Received: from smtp-out1.suse.de (smtp-out1.suse.de
 [IPv6:2a07:de40:b251:101:10:150:64:1])
 by sourceware.org (Postfix) with ESMTPS id BCFF13861022
 for <gcc-patches@gcc.gnu.org>; Wed,  3 Jul 2024 13:23:53 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org BCFF13861022
Authentication-Results: sourceware.org;
 dmarc=pass (p=none dis=none) header.from=suse.de
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=suse.de
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org BCFF13861022
Authentication-Results: server2.sourceware.org;
 arc=none smtp.remote-ip=2a07:de40:b251:101:10:150:64:1
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1720013037; cv=none;
 b=PeSv3yHi1vuukSXN9TPhR1e62nz8V1lE47cQGANomlbJT+pVLZGLhiVcVVRBzlpmbhFS+t8gONMeOFlQ/5RgC72xj65dOL5WTA+FzzLsaEKrufshJVqp2FIvL+6LoN69sZSDJRtvFOw1ILA2SZx4+V8BKO8p1dWGmD5Af5CjYLM=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
 t=1720013037; c=relaxed/simple;
 bh=rHxH1jotRLaAX5p6OWzes7PPznClrl1CERaYsrYxCMc=;
 h=DKIM-Signature:DKIM-Signature:DKIM-Signature:DKIM-Signature:Date:
 From:To:Subject:MIME-Version;
 b=Wb7j7fZSmHE/oQs2t6jvVLfPdLeUthd4n/4gA7WUsu/g3VsqZfJ0g1kd5iCDMBtHMOggSJxo0D0mnEZYutJ4q99I2+8798Zuwiu1tbvgBJnt/Xduo/iVJPps1Jxb3takxmsSD4p05C5GZZTXJCa4u/GBurkjFdHhPRqKTVYTnpU=
ARC-Authentication-Results: i=1; server2.sourceware.org
Received: from murzim.nue2.suse.org (unknown [10.168.4.243])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest
 SHA256)
 (No client certificate requested)
 by smtp-out1.suse.de (Postfix) with ESMTPS id D3D6221BD4
 for <gcc-patches@gcc.gnu.org>; Wed,  3 Jul 2024 13:23:52 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de;
 s=susede2_rsa;
 t=1720013032;
 h=from:from:reply-to:date:date:to:to:cc:mime-version:mime-version:
 content-type:content-type; bh=oNCVaVDyuUxLQMDYn5pZQuSMaP7dZ//YZ1f15U14jzw=;
 b=gbpv1JcNqYzH+ERsbPy6lhf3ms1mW6dGWoCsgNuqO50P4fo8iJlADsxyRngNzVBc6pGBEA
 8+Z5QcmQ50oPaCu7NlyWev+rovcjzy8qgvRr6ltGM5Awg7AO0OH+SA6Ydb6SgNEWTUI+JU
 yXO1Nk8OvABFqb3MC1NycmYxVPCzvrU=
DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de;
 s=susede2_ed25519; t=1720013032;
 h=from:from:reply-to:date:date:to:to:cc:mime-version:mime-version:
 content-type:content-type; bh=oNCVaVDyuUxLQMDYn5pZQuSMaP7dZ//YZ1f15U14jzw=;
 b=yPAHvZ/ahx4fY0bReuir7ue8dY/zBOb9/eFHzKIF1fgfAAO64YD+Bb5t2E4KtdbjR1HWF7
 iOAwRQaVDkpdbVCg==
Authentication-Results: smtp-out1.suse.de;
	none
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de;
 s=susede2_rsa;
 t=1720013032;
 h=from:from:reply-to:date:date:to:to:cc:mime-version:mime-version:
 content-type:content-type; bh=oNCVaVDyuUxLQMDYn5pZQuSMaP7dZ//YZ1f15U14jzw=;
 b=gbpv1JcNqYzH+ERsbPy6lhf3ms1mW6dGWoCsgNuqO50P4fo8iJlADsxyRngNzVBc6pGBEA
 8+Z5QcmQ50oPaCu7NlyWev+rovcjzy8qgvRr6ltGM5Awg7AO0OH+SA6Ydb6SgNEWTUI+JU
 yXO1Nk8OvABFqb3MC1NycmYxVPCzvrU=
DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de;
 s=susede2_ed25519; t=1720013032;
 h=from:from:reply-to:date:date:to:to:cc:mime-version:mime-version:
 content-type:content-type; bh=oNCVaVDyuUxLQMDYn5pZQuSMaP7dZ//YZ1f15U14jzw=;
 b=yPAHvZ/ahx4fY0bReuir7ue8dY/zBOb9/eFHzKIF1fgfAAO64YD+Bb5t2E4KtdbjR1HWF7
 iOAwRQaVDkpdbVCg==
Date: Wed, 3 Jul 2024 15:23:52 +0200 (CEST)
From: Richard Biener <rguenther@suse.de>
To: gcc-patches@gcc.gnu.org
Subject: [PATCH 2/5] extend SLP load permutation lowering
MIME-Version: 1.0
X-Spamd-Result: default: False [-0.26 / 50.00]; BAYES_HAM(-3.00)[100.00%];
 MISSING_MID(2.50)[]; NEURAL_SPAM_LONG(0.54)[0.155];
 NEURAL_HAM_SHORT(-0.20)[-1.000]; MIME_GOOD(-0.10)[text/plain];
 RCPT_COUNT_ONE(0.00)[1]; RCVD_COUNT_ZERO(0.00)[0];
 ARC_NA(0.00)[]; MISSING_XM_UA(0.00)[];
 DKIM_SIGNED(0.00)[suse.de:s=susede2_rsa,suse.de:s=susede2_ed25519];
 FUZZY_BLOCKED(0.00)[rspamd.com]; FROM_EQ_ENVFROM(0.00)[];
 MIME_TRACE(0.00)[0:+]; TO_DN_NONE(0.00)[];
 TO_MATCH_ENVRCPT_ALL(0.00)[]; FROM_HAS_DN(0.00)[]
X-Spam-Score: -0.26
X-Spam-Level: 
X-Spam-Status: No, score=-10.9 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, MISSING_MID,
 RCVD_IN_DNSWL_LOW, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.30
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
Errors-To: gcc-patches-bounces~incoming=patchwork.ozlabs.org@gcc.gnu.org
Message-Id: <20240703132427.19C1A3861011@sourceware.org>

The following extends the SLP load permutation to single-level
interleaving to handle the case we need multiple levels and adding
a fallback handling when the required permutes do not match
interleaving but would need three vectors to implement (and thus
fail).  With the change all single-lane SLP instances should be
supported with interleaving (with similar constraints as for non-SLP
but not implementing the identical 3 and 5 element group special
cases).

It also handles

void foo (int * __restrict a, int * b)
{
  for (int i = 0; i < 16; ++i)
    {
      a[8*i + 0] = b[8*i + 0] * 3;
      a[8*i + 1] = b[8*i + 1] + 3;
      a[8*i + 2] = (b[8*i + 2] * 3 + 3);
      a[8*i + 3] = b[8*i + 3] * 3;
      a[8*i + 4] = b[8*i + 4] * 3;
      a[8*i + 5] = b[8*i + 5] + 3;
      a[8*i + 6] = (b[8*i + 6] * 3 + 3);
      a[8*i + 7] = b[8*i + 7] * 3;
    }
}

albeit with 58 instead of 48 permutes.  The non-interleaving fallback
needs more work.

Next up is supporting gaps.

	* tree-vect-slp.cc (vect_lower_load_permutations): Support
	multi-level interleaving.  Support non-even/odd permutes.

	* gcc.dg/vect/slp-11a.c: Expect SLP.
	* gcc.dg/vect/slp-12a.c: Likewise.
---
 gcc/testsuite/gcc.dg/vect/slp-11a.c |   2 +-
 gcc/testsuite/gcc.dg/vect/slp-12a.c |   2 +-
 gcc/tree-vect-slp.cc                | 199 +++++++++++++++++-----------
 3 files changed, 122 insertions(+), 81 deletions(-)

diff --git a/gcc/testsuite/gcc.dg/vect/slp-11a.c b/gcc/testsuite/gcc.dg/vect/slp-11a.c
index fcb7cf6c7a2..2efa1796757 100644
--- a/gcc/testsuite/gcc.dg/vect/slp-11a.c
+++ b/gcc/testsuite/gcc.dg/vect/slp-11a.c
@@ -72,4 +72,4 @@ int main (void)
 
 /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target { vect_strided8 && vect_int_mult } } } } */
 /* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" { target { ! { vect_strided8 && vect_int_mult } } } } } */
-/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 0 "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/slp-12a.c b/gcc/testsuite/gcc.dg/vect/slp-12a.c
index 2f98dc9da0b..fedf27b69d2 100644
--- a/gcc/testsuite/gcc.dg/vect/slp-12a.c
+++ b/gcc/testsuite/gcc.dg/vect/slp-12a.c
@@ -80,5 +80,5 @@ int main (void)
 
 /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target { vect_strided8 && vect_int_mult } } } } */
 /* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" { target { ! { vect_strided8 && vect_int_mult } } } } } */
-/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 0 "vect" { target { { vect_strided8 && {! vect_load_lanes } } && vect_int_mult } } } } */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { target { { vect_strided8 && {! vect_load_lanes } } && vect_int_mult } } } } */
 /* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 0 "vect" { target { ! { vect_strided8 && vect_int_mult } } } } } */
diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
index 1d5c4d99549..6f3822af950 100644
--- a/gcc/tree-vect-slp.cc
+++ b/gcc/tree-vect-slp.cc
@@ -3993,7 +3993,9 @@ vect_lower_load_permutations (loop_vec_info loop_vinfo,
 	return;
       group_lanes++;
     }
-  /* Only a power-of-two number of lanes matches interleaving.  */
+  /* Only a power-of-two number of lanes matches interleaving with N levels.
+     ???  An even number of lanes could be reduced to 1<<ceil_log2(N)-1 lanes
+     at each step.  */
   if (exact_log2 (group_lanes) == -1)
     return;
 
@@ -4003,28 +4005,17 @@ vect_lower_load_permutations (loop_vec_info loop_vinfo,
       if (!SLP_TREE_CHILDREN (load).is_empty ())
 	continue;
 
+      /* We want to pattern-match special cases here and keep those
+	 alone.  Candidates are splats and load-lane.  */
+
       /* We need to lower only loads of less than half of the groups
-	 lanes, including duplicate lanes.  */
+	 lanes, including duplicate lanes.  Note this leaves nodes
+	 with a non-1:1 load permutation around instead of canonicalizing
+	 those into a load and a permute node.  Removing this early
+	 check would do such canonicalization.  */
       if (SLP_TREE_LANES (load) >= group_lanes / 2)
 	continue;
 
-      /* Lower by reducing the group to half its size using an
-	 interleaving scheme.  For this try to compute whether all
-	 elements needed for this load are in even or odd elements of
-	 an even/odd decomposition with N consecutive elements.
-	 Thus { e, e, o, o, e, e, o, o } woud be an even/odd decomposition
-	 with N == 2.  */
-      unsigned even = (1 << ceil_log2 (DR_GROUP_SIZE (first))) - 1;
-      unsigned odd = even;
-      for (unsigned l : SLP_TREE_LOAD_PERMUTATION (load))
-	{
-	  even &= ~l;
-	  odd &= l;
-	}
-      /* Give up when this doesn't match up with an interleaving scheme.  */
-      if (!even && !odd)
-	continue;
-
       /* First build (and possibly re-use) a load node for the
 	 unpermuted group.  */
       vec<stmt_vec_info> stmts;
@@ -4047,66 +4038,105 @@ vect_lower_load_permutations (loop_vec_info loop_vinfo,
 	final_perm.quick_push
 	  (std::make_pair (0, SLP_TREE_LOAD_PERMUTATION (load)[i]));
 
-      /* Now build an even or odd extraction from the unpermuted load.  */
-      lane_permutation_t perm;
-      perm.create (group_lanes / 2);
-      unsigned level;
-      if (even
-	  && ((level = 1 << ctz_hwi (even)), true)
-	  && group_lanes % (2 * level) == 0)
-	{
-	  /* { 0, 1, ... 4, 5 ..., } */
-	  unsigned level = 1 << ctz_hwi (even);
-	  for (unsigned i = 0; i < group_lanes / 2 / level; ++i)
-	    for (unsigned j = 0; j < level; ++j)
-	      perm.quick_push (std::make_pair (0, 2 * i * level + j));
-	}
-      else if (odd)
-	{
-	  /* { ..., 2, 3, ... 6, 7 } */
-	  unsigned level = 1 << ctz_hwi (odd);
-	  gcc_assert (group_lanes % (2 * level) == 0);
-	  for (unsigned i = 0; i < group_lanes / 2 / level; ++i)
-	    for (unsigned j = 0; j < level; ++j)
-	      perm.quick_push (std::make_pair (0, (2 * i + 1) * level + j));
-	}
-      else
-	gcc_unreachable ();
-
-      /* And update final_perm.  */
-      for (unsigned i = 0; i < SLP_TREE_LANES (load); ++i)
+      while (1)
 	{
-	  unsigned l = final_perm[i].second;
-	  unsigned j;
-	  for (j = 0; j < perm.length (); ++j)
-	    if (perm[j].second == l)
-	      {
-		final_perm[i].second = j;
-		break;
-	      }
-	  gcc_assert (j < perm.length ());
-	}
+	  unsigned group_lanes = SLP_TREE_LANES (l0);
+	  if (SLP_TREE_LANES (load) >= group_lanes / 2)
+	    break;
 
-      /* And create scalar stmts.  */
-      vec<stmt_vec_info> perm_stmts;
-      perm_stmts.create (perm.length ());
-      for (unsigned i = 0; i < perm.length (); ++i)
-	perm_stmts.quick_push (SLP_TREE_SCALAR_STMTS (l0)[perm[i].second]);
-
-      slp_tree p = vect_create_new_slp_node (1, VEC_PERM_EXPR);
-      SLP_TREE_CHILDREN (p).quick_push (l0);
-      SLP_TREE_LANE_PERMUTATION (p) = perm;
-      SLP_TREE_VECTYPE (p) = SLP_TREE_VECTYPE (load);
-      SLP_TREE_LANES (p) = perm.length ();
-      SLP_TREE_REPRESENTATIVE (p) = SLP_TREE_REPRESENTATIVE (load);
-      /* ???  As we have scalar stmts for this intermediate permute we
-	 could CSE it via bst_map but we do not want to pick up
-	 another SLP node with a load permutation.  We instead should
-	 have a "local" CSE map here.  */
-      SLP_TREE_SCALAR_STMTS (p) = perm_stmts;
-      /* ???  We need to iterate if group_lanes / 2 is still too large.  */
-      /* ???  Ideally pick the best even/odd scheme usable for
-	 most of the loads.  -> do a multi-step scheme?  */
+	  /* Try to lower by reducing the group to half its size using an
+	     interleaving scheme.  For this try to compute whether all
+	     elements needed for this load are in even or odd elements of
+	     an even/odd decomposition with N consecutive elements.
+	     Thus { e, e, o, o, e, e, o, o } woud be an even/odd decomposition
+	     with N == 2.  */
+	  /* ???  Only an even number of lanes can be handed this way, but the
+	     fallback below could work for any number.  */
+	  gcc_assert ((group_lanes & 1) == 0);
+	  unsigned even = (1 << ceil_log2 (group_lanes)) - 1;
+	  unsigned odd = even;
+	  for (auto l : final_perm)
+	    {
+	      even &= ~l.second;
+	      odd &= l.second;
+	    }
+
+	  /* Now build an even or odd extraction from the unpermuted load.  */
+	  lane_permutation_t perm;
+	  perm.create (group_lanes / 2);
+	  unsigned level;
+	  if (even
+	      && ((level = 1 << ctz_hwi (even)), true)
+	      && group_lanes % (2 * level) == 0)
+	    {
+	      /* { 0, 1, ... 4, 5 ..., } */
+	      unsigned level = 1 << ctz_hwi (even);
+	      for (unsigned i = 0; i < group_lanes / 2 / level; ++i)
+		for (unsigned j = 0; j < level; ++j)
+		  perm.quick_push (std::make_pair (0, 2 * i * level + j));
+	    }
+	  else if (odd)
+	    {
+	      /* { ..., 2, 3, ... 6, 7 } */
+	      unsigned level = 1 << ctz_hwi (odd);
+	      gcc_assert (group_lanes % (2 * level) == 0);
+	      for (unsigned i = 0; i < group_lanes / 2 / level; ++i)
+		for (unsigned j = 0; j < level; ++j)
+		  perm.quick_push (std::make_pair (0, (2 * i + 1) * level + j));
+	    }
+	  else
+	    {
+	      /* As fallback extract all used lanes and fill to half the
+		 group size by repeating the last element.
+		 ???  This is quite a bad strathegy for re-use - we could
+		 brute force our way to find more optimal filling lanes to
+		 maximize re-use when looking at all loads from the group.  */
+	      auto_bitmap l;
+	      for (auto p : final_perm)
+		bitmap_set_bit (l, p.second);
+	      unsigned i = 0;
+	      bitmap_iterator bi;
+	      EXECUTE_IF_SET_IN_BITMAP (l, 0, i, bi)
+		  perm.quick_push (std::make_pair (0, i));
+	      while (perm.length () < group_lanes / 2)
+		perm.quick_push (perm.last ());
+	    }
+
+	  /* Update final_perm with the intermediate permute.  */
+	  for (unsigned i = 0; i < final_perm.length (); ++i)
+	    {
+	      unsigned l = final_perm[i].second;
+	      unsigned j;
+	      for (j = 0; j < perm.length (); ++j)
+		if (perm[j].second == l)
+		  {
+		    final_perm[i].second = j;
+		    break;
+		  }
+	      gcc_assert (j < perm.length ());
+	    }
+
+	  /* And create scalar stmts.  */
+	  vec<stmt_vec_info> perm_stmts;
+	  perm_stmts.create (perm.length ());
+	  for (unsigned i = 0; i < perm.length (); ++i)
+	    perm_stmts.quick_push (SLP_TREE_SCALAR_STMTS (l0)[perm[i].second]);
+
+	  slp_tree p = vect_create_new_slp_node (1, VEC_PERM_EXPR);
+	  SLP_TREE_CHILDREN (p).quick_push (l0);
+	  SLP_TREE_LANE_PERMUTATION (p) = perm;
+	  SLP_TREE_VECTYPE (p) = SLP_TREE_VECTYPE (load);
+	  SLP_TREE_LANES (p) = perm.length ();
+	  SLP_TREE_REPRESENTATIVE (p) = SLP_TREE_REPRESENTATIVE (load);
+	  /* ???  As we have scalar stmts for this intermediate permute we
+	     could CSE it via bst_map but we do not want to pick up
+	     another SLP node with a load permutation.  We instead should
+	     have a "local" CSE map here.  */
+	  SLP_TREE_SCALAR_STMTS (p) = perm_stmts;
+
+	  /* We now have a node for group_lanes / 2 lanes.  */
+	  l0 = p;
+	}
 
       /* And finally from the ordered reduction node create the
 	 permute to shuffle the lanes into the original load-permutation
@@ -4115,7 +4145,7 @@ vect_lower_load_permutations (loop_vec_info loop_vinfo,
       SLP_TREE_LOAD_PERMUTATION (load).release ();
       SLP_TREE_LANE_PERMUTATION (load) = final_perm;
       SLP_TREE_CHILDREN (load).create (1);
-      SLP_TREE_CHILDREN (load).quick_push (p);
+      SLP_TREE_CHILDREN (load).quick_push (l0);
     }
 }
 
@@ -4315,7 +4345,18 @@ vect_analyze_slp (vec_info *vinfo, unsigned max_tree_size)
      like those requiring three vector inputs, lower them using interleaving
      like schemes.  */
   if (loop_vec_info loop_vinfo = dyn_cast <loop_vec_info> (vinfo))
-    vect_lower_load_permutations (loop_vinfo, bst_map);
+    {
+      vect_lower_load_permutations (loop_vinfo, bst_map);
+      if (dump_enabled_p ())
+	{
+	  dump_printf_loc (MSG_NOTE, vect_location,
+			   "SLP graph after lowering permutations:\n");
+	  hash_set<slp_tree> visited;
+	  FOR_EACH_VEC_ELT (LOOP_VINFO_SLP_INSTANCES (vinfo), i, instance)
+	    vect_print_slp_graph (MSG_NOTE, vect_location,
+				  SLP_INSTANCE_TREE (instance), visited);
+	}
+    }
 
   hash_set<slp_tree> visited_patterns;
   slp_tree_to_load_perm_map_t perm_cache;

From patchwork Wed Jul  3 13:23:57 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Richard Biener <rguenther@suse.de>
X-Patchwork-Id: 1956240
Return-Path: <gcc-patches-bounces~incoming=patchwork.ozlabs.org@gcc.gnu.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@legolas.ozlabs.org
Authentication-Results: legolas.ozlabs.org;
	dkim=pass (1024-bit key;
 unprotected) header.d=suse.de header.i=@suse.de header.a=rsa-sha256
 header.s=susede2_rsa header.b=rqncHb8j;
	dkim=pass header.d=suse.de header.i=@suse.de header.a=ed25519-sha256
 header.s=susede2_ed25519 header.b=k0v2yNTj;
	dkim=pass (1024-bit key) header.d=suse.de header.i=@suse.de
 header.a=rsa-sha256 header.s=susede2_rsa header.b=rqncHb8j;
	dkim=neutral header.d=suse.de header.i=@suse.de header.a=ed25519-sha256
 header.s=susede2_ed25519 header.b=k0v2yNTj;
	dkim-atps=neutral
Authentication-Results: legolas.ozlabs.org;
 spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org
 (client-ip=8.43.85.97; helo=server2.sourceware.org;
 envelope-from=gcc-patches-bounces~incoming=patchwork.ozlabs.org@gcc.gnu.org;
 receiver=patchwork.ozlabs.org)
Received: from server2.sourceware.org (server2.sourceware.org [8.43.85.97])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature ECDSA (secp384r1) server-digest SHA384)
	(No client certificate requested)
	by legolas.ozlabs.org (Postfix) with ESMTPS id 4WDgWW0GBBz1xpN
	for <incoming@patchwork.ozlabs.org>; Wed,  3 Jul 2024 23:25:26 +1000 (AEST)
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 336CB386101B
	for <incoming@patchwork.ozlabs.org>; Wed,  3 Jul 2024 13:25:24 +0000 (GMT)
X-Original-To: gcc-patches@gcc.gnu.org
Delivered-To: gcc-patches@gcc.gnu.org
Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.223.130])
 by sourceware.org (Postfix) with ESMTPS id C3830386100E
 for <gcc-patches@gcc.gnu.org>; Wed,  3 Jul 2024 13:23:58 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org C3830386100E
Authentication-Results: sourceware.org;
 dmarc=pass (p=none dis=none) header.from=suse.de
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=suse.de
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org C3830386100E
Authentication-Results: server2.sourceware.org;
 arc=none smtp.remote-ip=195.135.223.130
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1720013040; cv=none;
 b=djxprGZ4j3SSEH/hSIv1fvESRYygWvPeM1FUDoZUmhg8dreQFfYXsr6a0aIfT+oNQ0VovcDvpQyOPLzPg57fLV0QVGsnVrWwVy9XQ/99ZS4plbrRKKVCSmaETYCykpJJdIVVtBzZUWiGZbK/Ps7YZ4LJjX0otqRwtbxC68AbR+M=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
 t=1720013040; c=relaxed/simple;
 bh=+LBrBY1aRxKmWjPrdodoeeXYr5/P00jAZwcRJysdMrA=;
 h=DKIM-Signature:DKIM-Signature:DKIM-Signature:DKIM-Signature:Date:
 From:To:Subject:MIME-Version;
 b=aiVirep1nfmjvwiumGnzEIUZDUVHpeQ+fx4P52NfnYylwlQlmyU7SjCgvaYtzrhYwOmGTdXIVG1zXL7rVQrkFddc38pT2RKk1853gEgPtGYeiaQeNTUEuE5VjKJwvzo6U8MwBSF9ZTXJywTmE/srrMTeIcDvO3TTRUwaQU1di+A=
ARC-Authentication-Results: i=1; server2.sourceware.org
Received: from murzim.nue2.suse.org (unknown [10.168.4.243])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest
 SHA256)
 (No client certificate requested)
 by smtp-out1.suse.de (Postfix) with ESMTPS id D0AD821BD0
 for <gcc-patches@gcc.gnu.org>; Wed,  3 Jul 2024 13:23:57 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de;
 s=susede2_rsa;
 t=1720013037;
 h=from:from:reply-to:date:date:to:to:cc:mime-version:mime-version:
 content-type:content-type; bh=hiwsuPA1Rqp+UnfYnhXq/qstuzJ0jgZdnkaaZ5RVueI=;
 b=rqncHb8j5uOr8vSj5PlaaVj1QWkdq30rPNeitJ1/iSp2YuDXKfx80UzWafCV+aQvzMEjpQ
 QLXWp6Muj51MT0wZTMVXg5qKyup75BBPuOTnYuXyOjYCgMyhmPFkg504VrVYUBmCbc5qSF
 sN2a3ovE2iiMm4DJE6xyrXI9LBvKgWw=
DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de;
 s=susede2_ed25519; t=1720013037;
 h=from:from:reply-to:date:date:to:to:cc:mime-version:mime-version:
 content-type:content-type; bh=hiwsuPA1Rqp+UnfYnhXq/qstuzJ0jgZdnkaaZ5RVueI=;
 b=k0v2yNTjjvMeaX8j2SqilRenp1eBg5uJZ2UiPTrVe0+tXGkjWqIgv5hXVDmsmn4MX/QdJN
 ZyQ6/SO0BXfZG0Bg==
Authentication-Results: smtp-out1.suse.de;
	none
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de;
 s=susede2_rsa;
 t=1720013037;
 h=from:from:reply-to:date:date:to:to:cc:mime-version:mime-version:
 content-type:content-type; bh=hiwsuPA1Rqp+UnfYnhXq/qstuzJ0jgZdnkaaZ5RVueI=;
 b=rqncHb8j5uOr8vSj5PlaaVj1QWkdq30rPNeitJ1/iSp2YuDXKfx80UzWafCV+aQvzMEjpQ
 QLXWp6Muj51MT0wZTMVXg5qKyup75BBPuOTnYuXyOjYCgMyhmPFkg504VrVYUBmCbc5qSF
 sN2a3ovE2iiMm4DJE6xyrXI9LBvKgWw=
DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de;
 s=susede2_ed25519; t=1720013037;
 h=from:from:reply-to:date:date:to:to:cc:mime-version:mime-version:
 content-type:content-type; bh=hiwsuPA1Rqp+UnfYnhXq/qstuzJ0jgZdnkaaZ5RVueI=;
 b=k0v2yNTjjvMeaX8j2SqilRenp1eBg5uJZ2UiPTrVe0+tXGkjWqIgv5hXVDmsmn4MX/QdJN
 ZyQ6/SO0BXfZG0Bg==
Date: Wed, 3 Jul 2024 15:23:57 +0200 (CEST)
From: Richard Biener <rguenther@suse.de>
To: gcc-patches@gcc.gnu.org
Subject: [PATCH 3/5] Handle gaps in SLP load permutation lowering
MIME-Version: 1.0
X-Spamd-Result: default: False [-0.22 / 50.00]; BAYES_HAM(-3.00)[100.00%];
 MISSING_MID(2.50)[]; NEURAL_SPAM_LONG(0.58)[0.166];
 NEURAL_HAM_SHORT(-0.20)[-1.000]; MIME_GOOD(-0.10)[text/plain];
 RCPT_COUNT_ONE(0.00)[1]; RCVD_COUNT_ZERO(0.00)[0];
 ARC_NA(0.00)[]; MISSING_XM_UA(0.00)[];
 DKIM_SIGNED(0.00)[suse.de:s=susede2_rsa,suse.de:s=susede2_ed25519];
 FUZZY_BLOCKED(0.00)[rspamd.com]; FROM_EQ_ENVFROM(0.00)[];
 MIME_TRACE(0.00)[0:+]; TO_DN_NONE(0.00)[];
 TO_MATCH_ENVRCPT_ALL(0.00)[]; FROM_HAS_DN(0.00)[]
X-Spam-Score: -0.22
X-Spam-Level: 
X-Spam-Status: No, score=-10.5 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, MISSING_MID,
 SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.30
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
Errors-To: gcc-patches-bounces~incoming=patchwork.ozlabs.org@gcc.gnu.org
Message-Id: <20240703132524.336CB386101B@sourceware.org>

The following adds handling of gaps by representing them with NULL
entries in SLP_TREE_SCALAR_STMTS for the unpermuted load node.

The SLP discovery changes could be elided if we manually build the
load node instead.

	* tree-vect-slp.cc (vect_build_slp_tree_1): Handle NULL stmt.
	(vect_build_slp_tree_2): Likewise.  Release load permutation
	when there's a NULL in SLP_TREE_SCALAR_STMTS and assert there's
	no actual permutation in that case.
	(vect_lower_load_permutations): Handle gaps in loads.

	* gcc.dg/vect/slp-51.c: New testcase.
---
 gcc/testsuite/gcc.dg/vect/slp-51.c | 17 +++++++++++
 gcc/tree-vect-slp.cc               | 49 ++++++++++++++++++------------
 2 files changed, 47 insertions(+), 19 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/slp-51.c

diff --git a/gcc/testsuite/gcc.dg/vect/slp-51.c b/gcc/testsuite/gcc.dg/vect/slp-51.c
new file mode 100644
index 00000000000..91ae763be30
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/slp-51.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+
+void foo (int * __restrict x, int *y)
+{
+  x = __builtin_assume_aligned (x, __BIGGEST_ALIGNMENT__);
+  y = __builtin_assume_aligned (y, __BIGGEST_ALIGNMENT__);
+  for (int i = 0; i < 1024; ++i)
+    {
+      x[4*i+0] = y[4*i+0];
+      x[4*i+1] = y[4*i+2] * 2;
+      x[4*i+2] = y[4*i+0] + 3;
+      x[4*i+3] = y[4*i+2] * 2 - 5;
+    }
+}
+
+/* Check we can handle SLP with gaps and an interleaving scheme.  */
+/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" { target { vect_int && vect_int_mult } } } } */
diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
index 6f3822af950..fdefee90e92 100644
--- a/gcc/tree-vect-slp.cc
+++ b/gcc/tree-vect-slp.cc
@@ -1080,10 +1080,15 @@ vect_build_slp_tree_1 (vec_info *vinfo, unsigned char *swap,
   stmt_vec_info stmt_info;
   FOR_EACH_VEC_ELT (stmts, i, stmt_info)
     {
-      gimple *stmt = stmt_info->stmt;
       swap[i] = 0;
       matches[i] = false;
+      if (!stmt_info)
+	{
+	  matches[i] = true;
+	  continue;
+	}
 
+      gimple *stmt = stmt_info->stmt;
       if (dump_enabled_p ())
 	dump_printf_loc (MSG_NOTE, vect_location, "Build SLP for %G", stmt);
 
@@ -1984,10 +1989,16 @@ vect_build_slp_tree_2 (vec_info *vinfo, slp_tree node,
 	  stmt_vec_info first_stmt_info
 	    = DR_GROUP_FIRST_ELEMENT (SLP_TREE_SCALAR_STMTS (node)[0]);
 	  bool any_permute = false;
+	  bool any_null = false;
 	  FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_STMTS (node), j, load_info)
 	    {
 	      int load_place;
-	      if (STMT_VINFO_GROUPED_ACCESS (stmt_info))
+	      if (! load_info)
+		{
+		  load_place = j;
+		  any_null = true;
+		}
+	      else if (STMT_VINFO_GROUPED_ACCESS (stmt_info))
 		load_place = vect_get_place_in_interleaving_chain
 		    (load_info, first_stmt_info);
 	      else
@@ -1996,6 +2007,11 @@ vect_build_slp_tree_2 (vec_info *vinfo, slp_tree node,
 	      any_permute |= load_place != j;
 	      load_permutation.quick_push (load_place);
 	    }
+	  if (any_null)
+	    {
+	      gcc_assert (!any_permute);
+	      load_permutation.release ();
+	    }
 
 	  if (gcall *stmt = dyn_cast <gcall *> (stmt_info->stmt))
 	    {
@@ -3978,24 +3994,11 @@ vect_lower_load_permutations (loop_vec_info loop_vinfo,
   stmt_vec_info first
     = DR_GROUP_FIRST_ELEMENT (SLP_TREE_SCALAR_STMTS (loads[0])[0]);
 
-  /* ???  In principle we have to consider a gap up to the next full
-     vector, but we have to actually represent a scalar stmt for the
-     gaps value so delay handling this.  The same is true for
-     inbetween gaps which the load places in the load-permutation
-     represent.  It's probably not worth trying an intermediate packing
-     to vectors without gap even if that might handle some more cases.
-     Instead get the gap case correct in some way.  */
-  unsigned group_lanes = 0;
-  for (stmt_vec_info s = first; s; s = DR_GROUP_NEXT_ELEMENT (s))
-    {
-      if ((s == first && DR_GROUP_GAP (s) != 0)
-	  || (s != first && DR_GROUP_GAP (s) != 1))
-	return;
-      group_lanes++;
-    }
   /* Only a power-of-two number of lanes matches interleaving with N levels.
+     The non-SLP path also supports DR_GROUP_SIZE == 3.
      ???  An even number of lanes could be reduced to 1<<ceil_log2(N)-1 lanes
      at each step.  */
+  unsigned group_lanes = DR_GROUP_SIZE (first);
   if (exact_log2 (group_lanes) == -1)
     return;
 
@@ -4017,11 +4020,19 @@ vect_lower_load_permutations (loop_vec_info loop_vinfo,
 	continue;
 
       /* First build (and possibly re-use) a load node for the
-	 unpermuted group.  */
+	 unpermuted group.  Gaps in the middle and on the end are
+	 represented with NULL stmts.  */
       vec<stmt_vec_info> stmts;
       stmts.create (group_lanes);
       for (stmt_vec_info s = first; s; s = DR_GROUP_NEXT_ELEMENT (s))
-	stmts.quick_push (s);
+	{
+	  if (s != first)
+	    for (unsigned i = 1; i < DR_GROUP_GAP (s); ++i)
+	      stmts.quick_push (NULL);
+	  stmts.quick_push (s);
+	}
+      for (unsigned i = 0; i < DR_GROUP_GAP (first); ++i)
+	stmts.quick_push (NULL);
       poly_uint64 max_nunits;
       bool *matches = XALLOCAVEC (bool, group_lanes);
       unsigned limit = 1;

From patchwork Wed Jul  3 13:24:05 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Richard Biener <rguenther@suse.de>
X-Patchwork-Id: 1956242
Return-Path: <gcc-patches-bounces~incoming=patchwork.ozlabs.org@gcc.gnu.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@legolas.ozlabs.org
Authentication-Results: legolas.ozlabs.org;
	dkim=pass (1024-bit key;
 unprotected) header.d=suse.de header.i=@suse.de header.a=rsa-sha256
 header.s=susede2_rsa header.b=tK2MyEvN;
	dkim=pass header.d=suse.de header.i=@suse.de header.a=ed25519-sha256
 header.s=susede2_ed25519 header.b=lNkxWoMf;
	dkim=pass (1024-bit key) header.d=suse.de header.i=@suse.de
 header.a=rsa-sha256 header.s=susede2_rsa header.b=tK2MyEvN;
	dkim=neutral header.d=suse.de header.i=@suse.de header.a=ed25519-sha256
 header.s=susede2_ed25519 header.b=lNkxWoMf;
	dkim-atps=neutral
Authentication-Results: legolas.ozlabs.org;
 spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org
 (client-ip=2620:52:3:1:0:246e:9693:128c; helo=server2.sourceware.org;
 envelope-from=gcc-patches-bounces~incoming=patchwork.ozlabs.org@gcc.gnu.org;
 receiver=patchwork.ozlabs.org)
Received: from server2.sourceware.org (server2.sourceware.org
 [IPv6:2620:52:3:1:0:246e:9693:128c])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature ECDSA (secp384r1) server-digest SHA384)
	(No client certificate requested)
	by legolas.ozlabs.org (Postfix) with ESMTPS id 4WDgXq016Sz1xpN
	for <incoming@patchwork.ozlabs.org>; Wed,  3 Jul 2024 23:26:34 +1000 (AEST)
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 3BEF53860C3D
	for <incoming@patchwork.ozlabs.org>; Wed,  3 Jul 2024 13:26:33 +0000 (GMT)
X-Original-To: gcc-patches@gcc.gnu.org
Delivered-To: gcc-patches@gcc.gnu.org
Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.223.130])
 by sourceware.org (Postfix) with ESMTPS id 4811D386481A
 for <gcc-patches@gcc.gnu.org>; Wed,  3 Jul 2024 13:24:06 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 4811D386481A
Authentication-Results: sourceware.org;
 dmarc=pass (p=none dis=none) header.from=suse.de
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=suse.de
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 4811D386481A
Authentication-Results: server2.sourceware.org;
 arc=none smtp.remote-ip=195.135.223.130
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1720013049; cv=none;
 b=GxAuAnhNsk2CyA6twum96/MXPfpKqD9l5I07brP/cES5yttAgrF1zN0BT2eJFnOQgUS8C+Hqx/DxzlWUxEFZwFKGeeVWhh0o9D+BYejRT+LcE+fwR2KdAe3xTHXrk7HMLeBhEgeuvd8H0BlLC5Pk2j4fXbUItEarzl74iHNrc2U=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
 t=1720013049; c=relaxed/simple;
 bh=A706mkrVisk7yEJvKPYx4ZZSaVFTN1MUEAINNchYeDI=;
 h=DKIM-Signature:DKIM-Signature:DKIM-Signature:DKIM-Signature:Date:
 From:To:Subject:MIME-Version;
 b=giXQz3QrZJFW2cx3p7kDC4ZWpkMCbE61/8jbfuTD8Hws5tE1TO0ZbrCvOhfgbtVVWt4371morYb/L3wOi8IRaFzUp9fDKP9JF+t5uNZpbCn1f4cAGb5JZyO8AibvGc7H8nZquExp/ysSMt/+xfxz3s7N7TwZsyMsMNila4mpOw4=
ARC-Authentication-Results: i=1; server2.sourceware.org
Received: from murzim.nue2.suse.org (unknown [10.168.4.243])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest
 SHA256)
 (No client certificate requested)
 by smtp-out1.suse.de (Postfix) with ESMTPS id 4E53321BD0
 for <gcc-patches@gcc.gnu.org>; Wed,  3 Jul 2024 13:24:05 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de;
 s=susede2_rsa;
 t=1720013045;
 h=from:from:reply-to:date:date:to:to:cc:mime-version:mime-version:
 content-type:content-type; bh=6oH0DkDdjmUXLT7dbd9OVpJFB6nISzoCiGz1xAflXzg=;
 b=tK2MyEvNRgzF016a06/v7MH/v3pDVb0Pa5/27czrz3GpzfW5xsm4KCDjhmCqysTTACJqwH
 pjXXIcmDydnWi75XEm9rw4Iqa7SYZTIFyFbTknS7ZrSkHCbbMV6G4Emx8uK8i7lLvG5bMm
 qZndxTK7v1TiHE42/7ydBwZmzCly7sc=
DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de;
 s=susede2_ed25519; t=1720013045;
 h=from:from:reply-to:date:date:to:to:cc:mime-version:mime-version:
 content-type:content-type; bh=6oH0DkDdjmUXLT7dbd9OVpJFB6nISzoCiGz1xAflXzg=;
 b=lNkxWoMfeZa6G+1s9ivPoK1y4mIn9FqHD2U1R69y+A5GnqGDNUkFYri7SFrHMFHCQONj/9
 HcHVU8es2i31kYAw==
Authentication-Results: smtp-out1.suse.de;
	none
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de;
 s=susede2_rsa;
 t=1720013045;
 h=from:from:reply-to:date:date:to:to:cc:mime-version:mime-version:
 content-type:content-type; bh=6oH0DkDdjmUXLT7dbd9OVpJFB6nISzoCiGz1xAflXzg=;
 b=tK2MyEvNRgzF016a06/v7MH/v3pDVb0Pa5/27czrz3GpzfW5xsm4KCDjhmCqysTTACJqwH
 pjXXIcmDydnWi75XEm9rw4Iqa7SYZTIFyFbTknS7ZrSkHCbbMV6G4Emx8uK8i7lLvG5bMm
 qZndxTK7v1TiHE42/7ydBwZmzCly7sc=
DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de;
 s=susede2_ed25519; t=1720013045;
 h=from:from:reply-to:date:date:to:to:cc:mime-version:mime-version:
 content-type:content-type; bh=6oH0DkDdjmUXLT7dbd9OVpJFB6nISzoCiGz1xAflXzg=;
 b=lNkxWoMfeZa6G+1s9ivPoK1y4mIn9FqHD2U1R69y+A5GnqGDNUkFYri7SFrHMFHCQONj/9
 HcHVU8es2i31kYAw==
Date: Wed, 3 Jul 2024 15:24:05 +0200 (CEST)
From: Richard Biener <rguenther@suse.de>
To: gcc-patches@gcc.gnu.org
Subject: [PATCH 4/5] Support group-size of three in SLP load permutation
 lowering
MIME-Version: 1.0
X-Spamd-Result: default: False [-0.10 / 50.00]; BAYES_HAM(-3.00)[100.00%];
 MISSING_MID(2.50)[]; NEURAL_SPAM_LONG(0.70)[0.200];
 NEURAL_HAM_SHORT(-0.20)[-1.000]; MIME_GOOD(-0.10)[text/plain];
 RCPT_COUNT_ONE(0.00)[1]; RCVD_COUNT_ZERO(0.00)[0];
 ARC_NA(0.00)[]; MISSING_XM_UA(0.00)[];
 DKIM_SIGNED(0.00)[suse.de:s=susede2_rsa,suse.de:s=susede2_ed25519];
 FUZZY_BLOCKED(0.00)[rspamd.com]; FROM_EQ_ENVFROM(0.00)[];
 MIME_TRACE(0.00)[0:+]; TO_DN_NONE(0.00)[];
 TO_MATCH_ENVRCPT_ALL(0.00)[]; FROM_HAS_DN(0.00)[]
X-Spam-Score: -0.10
X-Spam-Level: 
X-Spam-Status: No, score=-10.6 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, MISSING_MID,
 SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.30
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
Errors-To: gcc-patches-bounces~incoming=patchwork.ozlabs.org@gcc.gnu.org
Message-Id: <20240703132633.3BEF53860C3D@sourceware.org>

The following adds support for group-size three in SLP load permutation
lowering to match the non-SLP capabilities.  This is done by using
the non-interleaving fallback code which then creates at VF == 4 from
{ { a0, b0, c0 }, { a1, b1, c1 }, { a2, b2, c2 }, { a3, b3, c3 } }
the intermediate vectors { c0, c0, c1, c1 } and { c2, c2, c3, c3 }
to produce { c0, c1, c2, c3 }.

This turns out to be more effective than the scheme implemented
for non-SLP for SSE and only slightly worse for AVX512 and a bit
more worse for AVX2.  It seems to me that this would extend to
other non-power-of-two group-sizes though (but the patch does not).
Optimal schemes are likely difficult to lay out in VF agnostic form.

I'll note that while the lowering assumes even/odd extract is
generally available for all vector element sizes (which is probably
a good assumption), it doesn't in any way constrain the other
permutes it generates based on target availability.  Again difficult
to do in a VF agnostic way (but at least currently the vector type
is fixed).

I'll also note that the SLP store side merges lanes in a way
producing three-vector permutes for store group-size of three, so
the testcase uses a store group-size of four.

	* tree-vect-slp.cc (vect_lower_load_permutations): Support
	group-size of three.

	* gcc.dg/vect/slp-52.c: New testcase.
---
 gcc/testsuite/gcc.dg/vect/slp-52.c | 14 ++++++++++++
 gcc/tree-vect-slp.cc               | 35 +++++++++++++++++-------------
 2 files changed, 34 insertions(+), 15 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/slp-52.c

diff --git a/gcc/testsuite/gcc.dg/vect/slp-52.c b/gcc/testsuite/gcc.dg/vect/slp-52.c
new file mode 100644
index 00000000000..ba49f0046e2
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/slp-52.c
@@ -0,0 +1,14 @@
+/* { dg-do compile } */
+
+void foo (int * __restrict x, int *y)
+{
+  for (int i = 0; i < 1024; ++i)
+    {
+      x[4*i+0] = y[3*i+0];
+      x[4*i+1] = y[3*i+1] * 2;
+      x[4*i+2] = y[3*i+2] + 3;
+      x[4*i+3] = y[3*i+2] * 2 - 5;
+    }
+}
+
+/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" { target { vect_int && vect_int_mult } } } } */
diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
index fdefee90e92..c62b0b5cf88 100644
--- a/gcc/tree-vect-slp.cc
+++ b/gcc/tree-vect-slp.cc
@@ -3718,7 +3718,8 @@ vect_build_slp_instance (vec_info *vinfo,
 		 with the least number of lanes to one and then repeat until
 		 we end up with two inputs.  That scheme makes sure we end
 		 up with permutes satisfying the restriction of requiring at
-		 most two vector inputs to produce a single vector output.  */
+		 most two vector inputs to produce a single vector output
+		 when the number of lanes is even.  */
 	      while (SLP_TREE_CHILDREN (perm).length () > 2)
 		{
 		  /* Pick the two nodes with the least number of lanes,
@@ -3995,11 +3996,10 @@ vect_lower_load_permutations (loop_vec_info loop_vinfo,
     = DR_GROUP_FIRST_ELEMENT (SLP_TREE_SCALAR_STMTS (loads[0])[0]);
 
   /* Only a power-of-two number of lanes matches interleaving with N levels.
-     The non-SLP path also supports DR_GROUP_SIZE == 3.
      ???  An even number of lanes could be reduced to 1<<ceil_log2(N)-1 lanes
      at each step.  */
   unsigned group_lanes = DR_GROUP_SIZE (first);
-  if (exact_log2 (group_lanes) == -1)
+  if (exact_log2 (group_lanes) == -1 && group_lanes != 3)
     return;
 
   for (slp_tree load : loads)
@@ -4016,7 +4016,7 @@ vect_lower_load_permutations (loop_vec_info loop_vinfo,
 	 with a non-1:1 load permutation around instead of canonicalizing
 	 those into a load and a permute node.  Removing this early
 	 check would do such canonicalization.  */
-      if (SLP_TREE_LANES (load) >= group_lanes / 2)
+      if (SLP_TREE_LANES (load) >= (group_lanes + 1) / 2)
 	continue;
 
       /* First build (and possibly re-use) a load node for the
@@ -4052,7 +4052,7 @@ vect_lower_load_permutations (loop_vec_info loop_vinfo,
       while (1)
 	{
 	  unsigned group_lanes = SLP_TREE_LANES (l0);
-	  if (SLP_TREE_LANES (load) >= group_lanes / 2)
+	  if (SLP_TREE_LANES (load) >= (group_lanes + 1) / 2)
 	    break;
 
 	  /* Try to lower by reducing the group to half its size using an
@@ -4062,19 +4062,24 @@ vect_lower_load_permutations (loop_vec_info loop_vinfo,
 	     Thus { e, e, o, o, e, e, o, o } woud be an even/odd decomposition
 	     with N == 2.  */
 	  /* ???  Only an even number of lanes can be handed this way, but the
-	     fallback below could work for any number.  */
-	  gcc_assert ((group_lanes & 1) == 0);
-	  unsigned even = (1 << ceil_log2 (group_lanes)) - 1;
-	  unsigned odd = even;
-	  for (auto l : final_perm)
+	     fallback below could work for any number.  We have to make sure
+	     to round up in that case.  */
+	  gcc_assert ((group_lanes & 1) == 0 || group_lanes == 3);
+	  unsigned even = 0, odd = 0;
+	  if ((group_lanes & 1) == 0)
 	    {
-	      even &= ~l.second;
-	      odd &= l.second;
+	      even = (1 << ceil_log2 (group_lanes)) - 1;
+	      odd = even;
+	      for (auto l : final_perm)
+		{
+		  even &= ~l.second;
+		  odd &= l.second;
+		}
 	    }
 
 	  /* Now build an even or odd extraction from the unpermuted load.  */
 	  lane_permutation_t perm;
-	  perm.create (group_lanes / 2);
+	  perm.create ((group_lanes + 1) / 2);
 	  unsigned level;
 	  if (even
 	      && ((level = 1 << ctz_hwi (even)), true)
@@ -4109,7 +4114,7 @@ vect_lower_load_permutations (loop_vec_info loop_vinfo,
 	      bitmap_iterator bi;
 	      EXECUTE_IF_SET_IN_BITMAP (l, 0, i, bi)
 		  perm.quick_push (std::make_pair (0, i));
-	      while (perm.length () < group_lanes / 2)
+	      while (perm.length () < (group_lanes + 1) / 2)
 		perm.quick_push (perm.last ());
 	    }
 
@@ -4145,7 +4150,7 @@ vect_lower_load_permutations (loop_vec_info loop_vinfo,
 	     have a "local" CSE map here.  */
 	  SLP_TREE_SCALAR_STMTS (p) = perm_stmts;
 
-	  /* We now have a node for group_lanes / 2 lanes.  */
+	  /* We now have a node for (group_lanes + 1) / 2 lanes.  */
 	  l0 = p;
 	}
 

From patchwork Wed Jul  3 13:24:11 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Richard Biener <rguenther@suse.de>
X-Patchwork-Id: 1956241
Return-Path: <gcc-patches-bounces~incoming=patchwork.ozlabs.org@gcc.gnu.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@legolas.ozlabs.org
Authentication-Results: legolas.ozlabs.org;
	dkim=pass (1024-bit key;
 unprotected) header.d=suse.de header.i=@suse.de header.a=rsa-sha256
 header.s=susede2_rsa header.b=QyCP+27P;
	dkim=pass header.d=suse.de header.i=@suse.de header.a=ed25519-sha256
 header.s=susede2_ed25519 header.b=DM3GboWI;
	dkim=pass (1024-bit key) header.d=suse.de header.i=@suse.de
 header.a=rsa-sha256 header.s=susede2_rsa header.b=QyCP+27P;
	dkim=neutral header.d=suse.de header.i=@suse.de header.a=ed25519-sha256
 header.s=susede2_ed25519 header.b=DM3GboWI;
	dkim-atps=neutral
Authentication-Results: legolas.ozlabs.org;
 spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org
 (client-ip=8.43.85.97; helo=server2.sourceware.org;
 envelope-from=gcc-patches-bounces~incoming=patchwork.ozlabs.org@gcc.gnu.org;
 receiver=patchwork.ozlabs.org)
Received: from server2.sourceware.org (server2.sourceware.org [8.43.85.97])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature ECDSA (secp384r1) server-digest SHA384)
	(No client certificate requested)
	by legolas.ozlabs.org (Postfix) with ESMTPS id 4WDgWy3GKXz1xpN
	for <incoming@patchwork.ozlabs.org>; Wed,  3 Jul 2024 23:25:50 +1000 (AEST)
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id B89483861028
	for <incoming@patchwork.ozlabs.org>; Wed,  3 Jul 2024 13:25:48 +0000 (GMT)
X-Original-To: gcc-patches@gcc.gnu.org
Delivered-To: gcc-patches@gcc.gnu.org
Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.223.131])
 by sourceware.org (Postfix) with ESMTPS id 6DB733860C3D
 for <gcc-patches@gcc.gnu.org>; Wed,  3 Jul 2024 13:24:12 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 6DB733860C3D
Authentication-Results: sourceware.org;
 dmarc=pass (p=none dis=none) header.from=suse.de
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=suse.de
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 6DB733860C3D
Authentication-Results: server2.sourceware.org;
 arc=none smtp.remote-ip=195.135.223.131
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1720013055; cv=none;
 b=bXd7W956x9Dz8I53oFBRf9MwYuPFIkRr0NCXqWno5m6lK1+mN3jqGpfomFE7W6Ry1SLJP/O0KXBGJqBqZ8KH92d2UI0P86oZFaIew7QZCCT4IHmdL16is/nayD9uClYtCfrVtM3QuCjUsdRpaZBiKqBxjlu2aSWJ80MmVMtKp0Y=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
 t=1720013055; c=relaxed/simple;
 bh=Q+oUt2zKgj4jFw+YY8UMgmiPdavHQKYyNxyFzYafXUI=;
 h=DKIM-Signature:DKIM-Signature:DKIM-Signature:DKIM-Signature:Date:
 From:To:Subject:MIME-Version;
 b=rrDCEQx55DG2BlFPlU1qq8Drq6yMhh5IxMGQ6fMe33eJumfLfU+nwPTVYVXFZN6geSzJ1MxxeEIrpX/OxZDKO0zhbp2wx3c8tRC5xhD3B4sOpxAzY0RkGwlpju0tOwq7AafkJGfpHKImcsFdc1V7SdVZZ3rFwhv5VinSgNY3+ig=
ARC-Authentication-Results: i=1; server2.sourceware.org
Received: from murzim.nue2.suse.org (unknown [10.168.4.243])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest
 SHA256)
 (No client certificate requested)
 by smtp-out2.suse.de (Postfix) with ESMTPS id 78FB41F449
 for <gcc-patches@gcc.gnu.org>; Wed,  3 Jul 2024 13:24:11 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de;
 s=susede2_rsa;
 t=1720013051;
 h=from:from:reply-to:date:date:to:to:cc:mime-version:mime-version:
 content-type:content-type; bh=vrX6Pi/ZCnGk2cM7n+nb5cCXM5bJ9ytX3lZLv5OKVXs=;
 b=QyCP+27PUGO6Mx546bTrJ5Ec78fR4ZE2yhP+Ey65VEjOIhi1iuHtqveOJvTD98BmMUATlk
 WrgSLYaDcdsW+y1fpHIY6b2CaxY79qb7zXTzHCxqghi0z9Hqbg/AdVXs5QVH8CI8lGVegL
 WndTEgn0gnkyCZhlZUyau7q+XrLooG8=
DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de;
 s=susede2_ed25519; t=1720013051;
 h=from:from:reply-to:date:date:to:to:cc:mime-version:mime-version:
 content-type:content-type; bh=vrX6Pi/ZCnGk2cM7n+nb5cCXM5bJ9ytX3lZLv5OKVXs=;
 b=DM3GboWIcNkFQneJQmqhuAsddfumoHFP+c8bPXxiGyXYDpjpZRCc0xfSBhNNWNyGt12aV6
 uLgSy1lNoQjPTdCw==
Authentication-Results: smtp-out2.suse.de;
	none
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de;
 s=susede2_rsa;
 t=1720013051;
 h=from:from:reply-to:date:date:to:to:cc:mime-version:mime-version:
 content-type:content-type; bh=vrX6Pi/ZCnGk2cM7n+nb5cCXM5bJ9ytX3lZLv5OKVXs=;
 b=QyCP+27PUGO6Mx546bTrJ5Ec78fR4ZE2yhP+Ey65VEjOIhi1iuHtqveOJvTD98BmMUATlk
 WrgSLYaDcdsW+y1fpHIY6b2CaxY79qb7zXTzHCxqghi0z9Hqbg/AdVXs5QVH8CI8lGVegL
 WndTEgn0gnkyCZhlZUyau7q+XrLooG8=
DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de;
 s=susede2_ed25519; t=1720013051;
 h=from:from:reply-to:date:date:to:to:cc:mime-version:mime-version:
 content-type:content-type; bh=vrX6Pi/ZCnGk2cM7n+nb5cCXM5bJ9ytX3lZLv5OKVXs=;
 b=DM3GboWIcNkFQneJQmqhuAsddfumoHFP+c8bPXxiGyXYDpjpZRCc0xfSBhNNWNyGt12aV6
 uLgSy1lNoQjPTdCw==
Date: Wed, 3 Jul 2024 15:24:11 +0200 (CEST)
From: Richard Biener <rguenther@suse.de>
To: gcc-patches@gcc.gnu.org
Subject: [PATCH 5/5] RISC-V: Support group size of three in SLP store permute
 lowering
MIME-Version: 1.0
X-Spam-Score: -0.40
X-Spam-Level: 
X-Spamd-Result: default: False [-0.40 / 50.00]; BAYES_HAM(-3.00)[100.00%];
 MISSING_MID(2.50)[]; NEURAL_SPAM_LONG(0.40)[0.114];
 NEURAL_HAM_SHORT(-0.20)[-1.000]; MIME_GOOD(-0.10)[text/plain];
 FUZZY_BLOCKED(0.00)[rspamd.com];
 DKIM_SIGNED(0.00)[suse.de:s=susede2_rsa,suse.de:s=susede2_ed25519];
 RCPT_COUNT_ONE(0.00)[1]; ARC_NA(0.00)[];
 TO_MATCH_ENVRCPT_ALL(0.00)[]; FROM_HAS_DN(0.00)[];
 RCVD_COUNT_ZERO(0.00)[0]; MISSING_XM_UA(0.00)[];
 FROM_EQ_ENVFROM(0.00)[]; TO_DN_NONE(0.00)[];
 MIME_TRACE(0.00)[0:+]
X-Spam-Status: No, score=-10.5 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, MISSING_MID,
 SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.30
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
Errors-To: gcc-patches-bounces~incoming=patchwork.ozlabs.org@gcc.gnu.org
Message-Id: <20240703132548.B89483861028@sourceware.org>

The following implements the group-size three scheme from
vect_permute_store_chain in SLP grouped store permute lowering
and extends it to power-of-two multiples of group size three.

The scheme goes from vectors A, B and C to
{ A[0], B[0], C[0], A[1], B[1], C[1], ... } by first producing
{ A[0], B[0], X, A[1], B[1], X, ... } (with X random but chosen
to A[n]) and then permuting in C[n] in the appropriate places.

The extension goes as to replace vector elements with a
power-of-two number of lanes and you'd get pairwise interleaving
until the final three input permutes happen.

The last permute step could be seen as extending C to { C[0], C[0],
C[0], ... } and then performing a blend.

VLA archs will want to use store-lanes here I guess, I'm not sure
if the three vector interleave operation is also available with
a register source and destination and thus available for a shuffle.

	* tree-vect-slp.cc (vect_build_slp_instance): Special case
	three input permute with the same number of lanes in store
	permute lowering.

	* gcc.dg/vect/slp-53.c: New testcase.
	* gcc.dg/vect/slp-54.c: New testcase.
---
 gcc/testsuite/gcc.dg/vect/slp-53.c | 15 +++++++
 gcc/testsuite/gcc.dg/vect/slp-54.c | 18 +++++++++
 gcc/tree-vect-slp.cc               | 65 +++++++++++++++++++++++++++++-
 3 files changed, 97 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/slp-53.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/slp-54.c

diff --git a/gcc/testsuite/gcc.dg/vect/slp-53.c b/gcc/testsuite/gcc.dg/vect/slp-53.c
new file mode 100644
index 00000000000..d00ca236958
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/slp-53.c
@@ -0,0 +1,15 @@
+/* { dg-do compile } */
+
+void foo (int * __restrict x, int *y)
+{
+  x = __builtin_assume_aligned (x, __BIGGEST_ALIGNMENT__);
+  y = __builtin_assume_aligned (y, __BIGGEST_ALIGNMENT__);
+  for (int i = 0; i < 1024; ++i)
+    {
+      x[3*i+0] = y[2*i+0] * 7 + 5;
+      x[3*i+1] = y[2*i+1] * 2;
+      x[3*i+2] = y[2*i+0] + 3;
+    }
+}
+
+/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" { target { vect_int && vect_int_mult } } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/slp-54.c b/gcc/testsuite/gcc.dg/vect/slp-54.c
new file mode 100644
index 00000000000..57268ab50b7
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/slp-54.c
@@ -0,0 +1,18 @@
+/* { dg-do compile } */
+
+void foo (int * __restrict x, int *y)
+{
+  x = __builtin_assume_aligned (x, __BIGGEST_ALIGNMENT__);
+  y = __builtin_assume_aligned (y, __BIGGEST_ALIGNMENT__);
+  for (int i = 0; i < 1024; ++i)
+    {
+      x[6*i+0] = y[4*i+0] * 7 + 5;
+      x[6*i+1] = y[4*i+1] * 2;
+      x[6*i+2] = y[4*i+2] + 3;
+      x[6*i+3] = y[4*i+3] * 7 + 5;
+      x[6*i+4] = y[4*i+0] * 2;
+      x[6*i+5] = y[4*i+3] + 3;
+    }
+}
+
+/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" { target { vect_int && vect_int_mult } } } } */
diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
index c62b0b5cf88..fd4b4574a2f 100644
--- a/gcc/tree-vect-slp.cc
+++ b/gcc/tree-vect-slp.cc
@@ -3722,6 +3722,69 @@ vect_build_slp_instance (vec_info *vinfo,
 		 when the number of lanes is even.  */
 	      while (SLP_TREE_CHILDREN (perm).length () > 2)
 		{
+		  /* When we have three equal sized groups left the pairwise
+		     reduction does not result in a scheme that avoids using
+		     three vectors.  Instead merge the first two groups
+		     to the final size with do-not-care elements (chosen
+		     from the first group) and then merge with the third.
+			   { A0, B0,  x, A1, B1,  x, ... }
+			-> { A0, B0, C0, A1, B1, C1, ... }
+		     This handles group size of three (and at least
+		     power-of-two multiples of that).  */
+		  if (SLP_TREE_CHILDREN (perm).length () == 3
+		      && (SLP_TREE_LANES (SLP_TREE_CHILDREN (perm)[0])
+			  == SLP_TREE_LANES (SLP_TREE_CHILDREN (perm)[1]))
+		      && (SLP_TREE_LANES (SLP_TREE_CHILDREN (perm)[0])
+			  == SLP_TREE_LANES (SLP_TREE_CHILDREN (perm)[2])))
+		    {
+		      int ai = 0;
+		      int bi = 1;
+		      slp_tree a = SLP_TREE_CHILDREN (perm)[ai];
+		      slp_tree b = SLP_TREE_CHILDREN (perm)[bi];
+		      unsigned n = SLP_TREE_LANES (perm);
+
+		      slp_tree permab
+			= vect_create_new_slp_node (2, VEC_PERM_EXPR);
+		      SLP_TREE_LANES (permab) = n;
+		      SLP_TREE_LANE_PERMUTATION (permab).create (n);
+		      SLP_TREE_VECTYPE (permab) = SLP_TREE_VECTYPE (perm);
+		      /* ???  Should be NULL but that's not expected.  */
+		      SLP_TREE_REPRESENTATIVE (permab)
+			= SLP_TREE_REPRESENTATIVE (perm);
+		      SLP_TREE_CHILDREN (permab).quick_push (a);
+		      for (unsigned k = 0; k < SLP_TREE_LANES (a); ++k)
+			SLP_TREE_LANE_PERMUTATION (permab)
+			  .quick_push (std::make_pair (0, k));
+		      SLP_TREE_CHILDREN (permab).quick_push (b);
+		      for (unsigned k = 0; k < SLP_TREE_LANES (b); ++k)
+			SLP_TREE_LANE_PERMUTATION (permab)
+			  .quick_push (std::make_pair (1, k));
+		      /* Push the do-not-care lanes.  */
+		      for (unsigned k = 0; k < SLP_TREE_LANES (a); ++k)
+			SLP_TREE_LANE_PERMUTATION (permab)
+			  .quick_push (std::make_pair (0, k));
+
+		      /* Put the merged node into 'perm', in place of a.  */
+		      SLP_TREE_CHILDREN (perm)[ai] = permab;
+		      /* Adjust the references to b in the permutation
+			 of perm and to the later children which we'll
+			 remove.  */
+		      for (unsigned k = 0; k < SLP_TREE_LANES (perm); ++k)
+			{
+			  std::pair<unsigned, unsigned> &p
+			      = SLP_TREE_LANE_PERMUTATION (perm)[k];
+			  if (p.first == (unsigned) bi)
+			    {
+			      p.first = ai;
+			      p.second += SLP_TREE_LANES (a);
+			    }
+			  else if (p.first > (unsigned) bi)
+			    p.first--;
+			}
+		      SLP_TREE_CHILDREN (perm).ordered_remove (bi);
+		      break;
+		    }
+
 		  /* Pick the two nodes with the least number of lanes,
 		     prefer the earliest candidate and maintain ai < bi.  */
 		  int ai = -1;
@@ -3758,7 +3821,7 @@ vect_build_slp_instance (vec_info *vinfo,
 		  SLP_TREE_LANES (permab) = n;
 		  SLP_TREE_LANE_PERMUTATION (permab).create (n);
 		  SLP_TREE_VECTYPE (permab) = SLP_TREE_VECTYPE (perm);
-		  /* ???  We should set this NULL but that's not expected.  */
+		  /* ???  Should be NULL but that's not expected.  */
 		  SLP_TREE_REPRESENTATIVE (permab)
 		    = SLP_TREE_REPRESENTATIVE (perm);
 		  SLP_TREE_CHILDREN (permab).quick_push (a);