From patchwork Tue Jul  9 11:34:53 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Richard Biener <rguenther@suse.de>
X-Patchwork-Id: 1958341
Return-Path: <gcc-patches-bounces~incoming=patchwork.ozlabs.org@gcc.gnu.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@legolas.ozlabs.org
Authentication-Results: legolas.ozlabs.org;
	dkim=pass (1024-bit key;
 unprotected) header.d=suse.de header.i=@suse.de header.a=rsa-sha256
 header.s=susede2_rsa header.b=N+M2RR7Y;
	dkim=pass header.d=suse.de header.i=@suse.de header.a=ed25519-sha256
 header.s=susede2_ed25519 header.b=oyQJxbsr;
	dkim=pass (1024-bit key) header.d=suse.de header.i=@suse.de
 header.a=rsa-sha256 header.s=susede2_rsa header.b=p+3+BQeg;
	dkim=neutral header.d=suse.de header.i=@suse.de header.a=ed25519-sha256
 header.s=susede2_ed25519 header.b=h52+DTaC;
	dkim-atps=neutral
Authentication-Results: legolas.ozlabs.org;
 spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org
 (client-ip=2620:52:3:1:0:246e:9693:128c; helo=server2.sourceware.org;
 envelope-from=gcc-patches-bounces~incoming=patchwork.ozlabs.org@gcc.gnu.org;
 receiver=patchwork.ozlabs.org)
Received: from server2.sourceware.org (server2.sourceware.org
 [IPv6:2620:52:3:1:0:246e:9693:128c])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature ECDSA (secp384r1) server-digest SHA384)
	(No client certificate requested)
	by legolas.ozlabs.org (Postfix) with ESMTPS id 4WJJny4Rfvz1xr9
	for <incoming@patchwork.ozlabs.org>; Tue,  9 Jul 2024 21:35:33 +1000 (AEST)
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 72614386C5A5
	for <incoming@patchwork.ozlabs.org>; Tue,  9 Jul 2024 11:35:31 +0000 (GMT)
X-Original-To: gcc-patches@gcc.gnu.org
Delivered-To: gcc-patches@gcc.gnu.org
Received: from smtp-out2.suse.de (smtp-out2.suse.de
 [IPv6:2a07:de40:b251:101:10:150:64:2])
 by sourceware.org (Postfix) with ESMTPS id 1FB28385DDC5
 for <gcc-patches@gcc.gnu.org>; Tue,  9 Jul 2024 11:34:55 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 1FB28385DDC5
Authentication-Results: sourceware.org;
 dmarc=pass (p=none dis=none) header.from=suse.de
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=suse.de
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 1FB28385DDC5
Authentication-Results: server2.sourceware.org;
 arc=none smtp.remote-ip=2a07:de40:b251:101:10:150:64:2
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1720524900; cv=none;
 b=o6NwfjZvUX611RdwI6u+vJeyTJeBtwi/eF77ahrOL/zXE5hw5rbXetY7LZgy99pczDznH0BG9mUNjKr4zU5t82DWxW5TD6NDEQRupYt0/hvwJ9aGviel7m+6IZnrmExp3vaxi6cpFceAVoeEwtzbw8EmqMB/uze6zW0ViJ5OwVM=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
 t=1720524900; c=relaxed/simple;
 bh=8+87zEJ2buhBMFfl5GnU8IyFWUiTVkhCdlH4Vs9OQjE=;
 h=DKIM-Signature:DKIM-Signature:DKIM-Signature:DKIM-Signature:Date:
 From:To:Subject:MIME-Version;
 b=J9pul/FIvQqGdfaIOGtB3+Qud3J5fgu6gqKpXEGKf8mj2WRRPXvAH8iDAAIAB3Xjbuf/WfTaB2TNhuijg6jclPYRp7NLLqaqR6HSFMQ6+6Cbt9DOIzFHOR9gs40mKsJ/aW/9ejqfZgwHBRAtJCqAIV3QsDd6Ti2T6Vfv1pq9H+8=
ARC-Authentication-Results: i=1; server2.sourceware.org
Received: from murzim.nue2.suse.org (unknown [10.168.4.243])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest
 SHA256)
 (No client certificate requested)
 by smtp-out2.suse.de (Postfix) with ESMTPS id E61E31F7B9
 for <gcc-patches@gcc.gnu.org>; Tue,  9 Jul 2024 11:34:53 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de;
 s=susede2_rsa;
 t=1720524894;
 h=from:from:reply-to:date:date:to:to:cc:mime-version:mime-version:
 content-type:content-type; bh=PC0D6b3GdKgUoFnJz/u+FPDBeSnMRBnnThUwJ6l774M=;
 b=N+M2RR7YF6VVQh6fN4+tQtHQ/pndxw3TDJtcW7Ct5bUnUNHqXK7via+KibNC+recWwvI3N
 GKXn5qx0/HxE5vT3zSjYTgQbI38ZFVlfMXVRavho49pemaXfAE6Oaq6KmwtoPx2ZWN3Naw
 lBtLcIuu7WRP1PuTjoo/9dzirxaiBNw=
DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de;
 s=susede2_ed25519; t=1720524894;
 h=from:from:reply-to:date:date:to:to:cc:mime-version:mime-version:
 content-type:content-type; bh=PC0D6b3GdKgUoFnJz/u+FPDBeSnMRBnnThUwJ6l774M=;
 b=oyQJxbsreT33Ih82qoRglUH/Ek/zV90FiPeRpc2svRSrlDNqPoExCfqXYhkBoH0DTUQqqq
 g9kBdK5IlFfOwDDg==
Authentication-Results: smtp-out2.suse.de;
	none
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de;
 s=susede2_rsa;
 t=1720524893;
 h=from:from:reply-to:date:date:to:to:cc:mime-version:mime-version:
 content-type:content-type; bh=PC0D6b3GdKgUoFnJz/u+FPDBeSnMRBnnThUwJ6l774M=;
 b=p+3+BQegmKDf3FeBIseKAKmlbXHIH91loGMalBSneitku6LNoZ21k9LmIX00zv6wsi+Cv1
 UNgmkXFFCHib42fpaPBQMwKOjfNVunzCYQeCfl57rPO/cg5b6u7TnxR/S35m05pN/XCyMz
 aAu/93x92p6EejMHj3R2Fb0wYWGMmt4=
DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de;
 s=susede2_ed25519; t=1720524893;
 h=from:from:reply-to:date:date:to:to:cc:mime-version:mime-version:
 content-type:content-type; bh=PC0D6b3GdKgUoFnJz/u+FPDBeSnMRBnnThUwJ6l774M=;
 b=h52+DTaCEhazrln/31yLx/HqZ1F+i57WJ4O6/vJ038ACtgpGYxwQijnzz6rIgozqCo//KG
 6ULdHNXvYVkhIkAw==
Date: Tue, 9 Jul 2024 13:34:53 +0200 (CEST)
From: Richard Biener <rguenther@suse.de>
To: gcc-patches@gcc.gnu.org
Subject: [PATCH 1/3] lower SLP load permutation to interleaving
MIME-Version: 1.0
X-Spam-Score: 2.26
X-Spamd-Result: default: False [2.26 / 50.00]; NEURAL_SPAM_LONG(3.05)[0.870];
 BAYES_HAM(-3.00)[100.00%]; MISSING_MID(2.50)[];
 NEURAL_HAM_SHORT(-0.18)[-0.916]; MIME_GOOD(-0.10)[text/plain];
 RCPT_COUNT_ONE(0.00)[1]; RCVD_COUNT_ZERO(0.00)[0];
 ARC_NA(0.00)[]; MISSING_XM_UA(0.00)[];
 DKIM_SIGNED(0.00)[suse.de:s=susede2_rsa,suse.de:s=susede2_ed25519];
 FUZZY_BLOCKED(0.00)[rspamd.com]; FROM_EQ_ENVFROM(0.00)[];
 MIME_TRACE(0.00)[0:+]; TO_DN_NONE(0.00)[];
 TO_MATCH_ENVRCPT_ALL(0.00)[]; FROM_HAS_DN(0.00)[]
X-Spam-Status: No, score=-10.5 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, MISSING_MID,
 RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.30
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
Errors-To: gcc-patches-bounces~incoming=patchwork.ozlabs.org@gcc.gnu.org
Message-Id: <20240709113531.72614386C5A5@sourceware.org>

The following emulates classical interleaving for SLP load permutes
that we are unlikely handling natively.  This is to handle cases
where interleaving (or load/store-lanes) is the optimal choice for
vectorizing even when we are doing that within SLP.  An example
would be

void foo (int * __restrict a, int * b)
{
  for (int i = 0; i < 16; ++i)
    {
      a[4*i + 0] = b[4*i + 0] * 3;
      a[4*i + 1] = b[4*i + 1] + 3;
      a[4*i + 2] = (b[4*i + 2] * 3 + 3);
      a[4*i + 3] = b[4*i + 3] * 3;
    }
}

where currently the SLP store is merging four single-lane SLP
sub-graphs but none of the loads in it can be code-generated
with V4SImode vectors and a VF of four as the permutes would need
three vectors.

The patch introduces a lowering phase after SLP discovery but
before SLP pattern recognition or permute optimization that
analyzes all loads from the same dataref group and creates an
interleaving scheme starting from an unpermuted load.

What can be handled is power-of-two group size, group size of
three is handled in a followup, as is the possibility for
doing the interleaving with a load-lanes like instruction.

The patch has a fallback for when there are multi-lane groups
and the resulting permutes to not fit interleaving.  Code
generation is not optimal when this triggers and might be
worse than doing single-lane group interleaving.

The patch handles gaps by representing them with NULL
entries in SLP_TREE_SCALAR_STMTS for the unpermuted load node.
The SLP discovery changes could be elided if we manually build the
load node instead.

SLP load nodes covering enough lanes to not need intermediate
permutes are retained as having a load-permutation and do not
use the single SLP load node for each dataref group.  That's
something we might want to change, making load-permutation
something purely local to SLP discovery (but then SLP discovery
could do part of the lowering).

The patch misses CSEing intermediate generated permutes and
registering them with the bst_map which is possibly required
for SLP pattern detection in some cases.

	* tree-vect-slp.cc (vect_build_slp_tree_1): Handle NULL stmt.
	(vect_build_slp_tree_2): Likewise.  Release load permutation
	when there's a NULL in SLP_TREE_SCALAR_STMTS and assert there's
	no actual permutation in that case.
	(vllp_cmp): New function.
	(vect_lower_load_permutations): Likewise.
	(vect_analyze_slp): Call it.

	* gcc.dg/vect/slp-11a.c: Expect SLP.
	* gcc.dg/vect/slp-12a.c: Likewise.
	* gcc.dg/vect/slp-51.c: New testcase.
---
 gcc/testsuite/gcc.dg/vect/slp-11a.c |   2 +-
 gcc/testsuite/gcc.dg/vect/slp-12a.c |   2 +-
 gcc/testsuite/gcc.dg/vect/slp-51.c  |  17 ++
 gcc/tree-vect-slp.cc                | 343 +++++++++++++++++++++++++++-
 4 files changed, 360 insertions(+), 4 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/slp-51.c

diff --git a/gcc/testsuite/gcc.dg/vect/slp-11a.c b/gcc/testsuite/gcc.dg/vect/slp-11a.c
index fcb7cf6c7a2..2efa1796757 100644
--- a/gcc/testsuite/gcc.dg/vect/slp-11a.c
+++ b/gcc/testsuite/gcc.dg/vect/slp-11a.c
@@ -72,4 +72,4 @@ int main (void)
 
 /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target { vect_strided8 && vect_int_mult } } } } */
 /* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" { target { ! { vect_strided8 && vect_int_mult } } } } } */
-/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 0 "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/slp-12a.c b/gcc/testsuite/gcc.dg/vect/slp-12a.c
index 2f98dc9da0b..fedf27b69d2 100644
--- a/gcc/testsuite/gcc.dg/vect/slp-12a.c
+++ b/gcc/testsuite/gcc.dg/vect/slp-12a.c
@@ -80,5 +80,5 @@ int main (void)
 
 /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target { vect_strided8 && vect_int_mult } } } } */
 /* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" { target { ! { vect_strided8 && vect_int_mult } } } } } */
-/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 0 "vect" { target { { vect_strided8 && {! vect_load_lanes } } && vect_int_mult } } } } */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { target { { vect_strided8 && {! vect_load_lanes } } && vect_int_mult } } } } */
 /* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 0 "vect" { target { ! { vect_strided8 && vect_int_mult } } } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/slp-51.c b/gcc/testsuite/gcc.dg/vect/slp-51.c
new file mode 100644
index 00000000000..91ae763be30
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/slp-51.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+
+void foo (int * __restrict x, int *y)
+{
+  x = __builtin_assume_aligned (x, __BIGGEST_ALIGNMENT__);
+  y = __builtin_assume_aligned (y, __BIGGEST_ALIGNMENT__);
+  for (int i = 0; i < 1024; ++i)
+    {
+      x[4*i+0] = y[4*i+0];
+      x[4*i+1] = y[4*i+2] * 2;
+      x[4*i+2] = y[4*i+0] + 3;
+      x[4*i+3] = y[4*i+2] * 2 - 5;
+    }
+}
+
+/* Check we can handle SLP with gaps and an interleaving scheme.  */
+/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" { target { vect_int && vect_int_mult } } } } */
diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
index d0a8531fd3b..0f830c1ad9c 100644
--- a/gcc/tree-vect-slp.cc
+++ b/gcc/tree-vect-slp.cc
@@ -1080,10 +1080,15 @@ vect_build_slp_tree_1 (vec_info *vinfo, unsigned char *swap,
   stmt_vec_info stmt_info;
   FOR_EACH_VEC_ELT (stmts, i, stmt_info)
     {
-      gimple *stmt = stmt_info->stmt;
       swap[i] = 0;
       matches[i] = false;
+      if (!stmt_info)
+	{
+	  matches[i] = true;
+	  continue;
+	}
 
+      gimple *stmt = stmt_info->stmt;
       if (dump_enabled_p ())
 	dump_printf_loc (MSG_NOTE, vect_location, "Build SLP for %G", stmt);
 
@@ -1984,10 +1989,16 @@ vect_build_slp_tree_2 (vec_info *vinfo, slp_tree node,
 	  stmt_vec_info first_stmt_info
 	    = DR_GROUP_FIRST_ELEMENT (SLP_TREE_SCALAR_STMTS (node)[0]);
 	  bool any_permute = false;
+	  bool any_null = false;
 	  FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_STMTS (node), j, load_info)
 	    {
 	      int load_place;
-	      if (STMT_VINFO_GROUPED_ACCESS (stmt_info))
+	      if (! load_info)
+		{
+		  load_place = j;
+		  any_null = true;
+		}
+	      else if (STMT_VINFO_GROUPED_ACCESS (stmt_info))
 		load_place = vect_get_place_in_interleaving_chain
 		    (load_info, first_stmt_info);
 	      else
@@ -1996,6 +2007,11 @@ vect_build_slp_tree_2 (vec_info *vinfo, slp_tree node,
 	      any_permute |= load_place != j;
 	      load_permutation.quick_push (load_place);
 	    }
+	  if (any_null)
+	    {
+	      gcc_assert (!any_permute);
+	      load_permutation.release ();
+	    }
 
 	  if (gcall *stmt = dyn_cast <gcall *> (stmt_info->stmt))
 	    {
@@ -3944,6 +3960,312 @@ vect_analyze_slp_instance (vec_info *vinfo,
   return res;
 }
 
+/* qsort comparator ordering SLP load nodes.  */
+
+static int
+vllp_cmp (const void *a_, const void *b_)
+{
+  const slp_tree a = *(const slp_tree *)a_;
+  const slp_tree b = *(const slp_tree *)b_;
+  stmt_vec_info a0 = SLP_TREE_SCALAR_STMTS (a)[0];
+  stmt_vec_info b0 = SLP_TREE_SCALAR_STMTS (b)[0];
+  if (STMT_VINFO_GROUPED_ACCESS (a0)
+      && STMT_VINFO_GROUPED_ACCESS (b0)
+      && DR_GROUP_FIRST_ELEMENT (a0) == DR_GROUP_FIRST_ELEMENT (b0))
+    {
+      /* Same group, order after lanes used.  */
+      if (SLP_TREE_LANES (a) < SLP_TREE_LANES (b))
+	return 1;
+      else if (SLP_TREE_LANES (a) > SLP_TREE_LANES (b))
+	return -1;
+      else
+	{
+	  /* Try to order loads using the same lanes together, breaking
+	     the tie with the lane number that first differs.  */
+	  if (!SLP_TREE_LOAD_PERMUTATION (a).exists ()
+	      && !SLP_TREE_LOAD_PERMUTATION (b).exists ())
+	    return 0;
+	  else if (SLP_TREE_LOAD_PERMUTATION (a).exists ()
+		   && !SLP_TREE_LOAD_PERMUTATION (b).exists ())
+	    return 1;
+	  else if (!SLP_TREE_LOAD_PERMUTATION (a).exists ()
+		   && SLP_TREE_LOAD_PERMUTATION (b).exists ())
+	    return -1;
+	  else
+	    {
+	      for (unsigned i = 0; i < SLP_TREE_LANES (a); ++i)
+		if (SLP_TREE_LOAD_PERMUTATION (a)[i]
+		    != SLP_TREE_LOAD_PERMUTATION (b)[i])
+		  {
+		    /* In-order lane first, that's what the above case for
+		       no permutation does.  */
+		    if (SLP_TREE_LOAD_PERMUTATION (a)[i] == i)
+		      return -1;
+		    else if (SLP_TREE_LOAD_PERMUTATION (b)[i] == i)
+		      return 1;
+		    else if (SLP_TREE_LOAD_PERMUTATION (a)[i]
+			     < SLP_TREE_LOAD_PERMUTATION (b)[i])
+		      return -1;
+		    else
+		      return 1;
+		  }
+	      return 0;
+	    }
+	}
+    }
+  else /* Different groups or non-groups.  */
+    {
+      /* Order groups as their first element to keep them together.  */
+      if (STMT_VINFO_GROUPED_ACCESS (a0))
+	a0 = DR_GROUP_FIRST_ELEMENT (a0);
+      if (STMT_VINFO_GROUPED_ACCESS (b0))
+	b0 = DR_GROUP_FIRST_ELEMENT (b0);
+      if (a0 == b0)
+	return 0;
+      /* Tie using UID.  */
+      else if (gimple_uid (STMT_VINFO_STMT (a0))
+	       < gimple_uid (STMT_VINFO_STMT (b0)))
+	return -1;
+      else
+	{
+	  gcc_assert (gimple_uid (STMT_VINFO_STMT (a0))
+		      != gimple_uid (STMT_VINFO_STMT (b0)));
+	  return 1;
+	}
+    }
+}
+
+/* Process the set of LOADS that are all from the same dataref group.  */
+
+static void
+vect_lower_load_permutations (loop_vec_info loop_vinfo,
+			      scalar_stmts_to_slp_tree_map_t *bst_map,
+			      const array_slice<slp_tree> &loads)
+{
+  /* We at this point want to lower without a fixed VF or vector
+     size in mind which means we cannot actually compute whether we
+     need three or more vectors for a load permutation yet.  So always
+     lower.  */
+  stmt_vec_info first
+    = DR_GROUP_FIRST_ELEMENT (SLP_TREE_SCALAR_STMTS (loads[0])[0]);
+
+  /* Only a power-of-two number of lanes matches interleaving with N levels.
+     The non-SLP path also supports DR_GROUP_SIZE == 3.
+     ???  An even number of lanes could be reduced to 1<<ceil_log2(N)-1 lanes
+     at each step.  */
+  unsigned group_lanes = DR_GROUP_SIZE (first);
+  if (exact_log2 (group_lanes) == -1)
+    return;
+
+  for (slp_tree load : loads)
+    {
+      /* Leave masked or gather loads alone for now.  */
+      if (!SLP_TREE_CHILDREN (load).is_empty ())
+	continue;
+
+      /* We want to pattern-match special cases here and keep those
+	 alone.  Candidates are splats and load-lane.  */
+
+      /* We need to lower only loads of less than half of the groups
+	 lanes, including duplicate lanes.  Note this leaves nodes
+	 with a non-1:1 load permutation around instead of canonicalizing
+	 those into a load and a permute node.  Removing this early
+	 check would do such canonicalization.  */
+      if (SLP_TREE_LANES (load) >= group_lanes / 2)
+	continue;
+
+      /* First build (and possibly re-use) a load node for the
+	 unpermuted group.  Gaps in the middle and on the end are
+	 represented with NULL stmts.  */
+      vec<stmt_vec_info> stmts;
+      stmts.create (group_lanes);
+      for (stmt_vec_info s = first; s; s = DR_GROUP_NEXT_ELEMENT (s))
+	{
+	  if (s != first)
+	    for (unsigned i = 1; i < DR_GROUP_GAP (s); ++i)
+	      stmts.quick_push (NULL);
+	  stmts.quick_push (s);
+	}
+      for (unsigned i = 0; i < DR_GROUP_GAP (first); ++i)
+	stmts.quick_push (NULL);
+      poly_uint64 max_nunits = 1;
+      bool *matches = XALLOCAVEC (bool, group_lanes);
+      unsigned limit = 1;
+      unsigned tree_size = 0;
+      slp_tree l0 = vect_build_slp_tree (loop_vinfo, stmts,
+					 group_lanes,
+					 &max_nunits, matches, &limit,
+					 &tree_size, bst_map);
+
+      /* Build the permute to get the original load permutation order.  */
+      lane_permutation_t final_perm;
+      final_perm.create (SLP_TREE_LANES (load));
+      for (unsigned i = 0; i < SLP_TREE_LANES (load); ++i)
+	final_perm.quick_push
+	  (std::make_pair (0, SLP_TREE_LOAD_PERMUTATION (load)[i]));
+
+      while (1)
+	{
+	  unsigned group_lanes = SLP_TREE_LANES (l0);
+	  if (SLP_TREE_LANES (load) >= group_lanes / 2)
+	    break;
+
+	  /* Try to lower by reducing the group to half its size using an
+	     interleaving scheme.  For this try to compute whether all
+	     elements needed for this load are in even or odd elements of
+	     an even/odd decomposition with N consecutive elements.
+	     Thus { e, e, o, o, e, e, o, o } woud be an even/odd decomposition
+	     with N == 2.  */
+	  /* ???  Only an even number of lanes can be handed this way, but the
+	     fallback below could work for any number.  */
+	  gcc_assert ((group_lanes & 1) == 0);
+	  unsigned even = (1 << ceil_log2 (group_lanes)) - 1;
+	  unsigned odd = even;
+	  for (auto l : final_perm)
+	    {
+	      even &= ~l.second;
+	      odd &= l.second;
+	    }
+
+	  /* Now build an even or odd extraction from the unpermuted load.  */
+	  lane_permutation_t perm;
+	  perm.create (group_lanes / 2);
+	  unsigned level;
+	  if (even
+	      && ((level = 1 << ctz_hwi (even)), true)
+	      && group_lanes % (2 * level) == 0)
+	    {
+	      /* { 0, 1, ... 4, 5 ..., } */
+	      unsigned level = 1 << ctz_hwi (even);
+	      for (unsigned i = 0; i < group_lanes / 2 / level; ++i)
+		for (unsigned j = 0; j < level; ++j)
+		  perm.quick_push (std::make_pair (0, 2 * i * level + j));
+	    }
+	  else if (odd)
+	    {
+	      /* { ..., 2, 3, ... 6, 7 } */
+	      unsigned level = 1 << ctz_hwi (odd);
+	      gcc_assert (group_lanes % (2 * level) == 0);
+	      for (unsigned i = 0; i < group_lanes / 2 / level; ++i)
+		for (unsigned j = 0; j < level; ++j)
+		  perm.quick_push (std::make_pair (0, (2 * i + 1) * level + j));
+	    }
+	  else
+	    {
+	      /* As fallback extract all used lanes and fill to half the
+		 group size by repeating the last element.
+		 ???  This is quite a bad strathegy for re-use - we could
+		 brute force our way to find more optimal filling lanes to
+		 maximize re-use when looking at all loads from the group.  */
+	      auto_bitmap l;
+	      for (auto p : final_perm)
+		bitmap_set_bit (l, p.second);
+	      unsigned i = 0;
+	      bitmap_iterator bi;
+	      EXECUTE_IF_SET_IN_BITMAP (l, 0, i, bi)
+		  perm.quick_push (std::make_pair (0, i));
+	      while (perm.length () < group_lanes / 2)
+		perm.quick_push (perm.last ());
+	    }
+
+	  /* Update final_perm with the intermediate permute.  */
+	  for (unsigned i = 0; i < final_perm.length (); ++i)
+	    {
+	      unsigned l = final_perm[i].second;
+	      unsigned j;
+	      for (j = 0; j < perm.length (); ++j)
+		if (perm[j].second == l)
+		  {
+		    final_perm[i].second = j;
+		    break;
+		  }
+	      gcc_assert (j < perm.length ());
+	    }
+
+	  /* And create scalar stmts.  */
+	  vec<stmt_vec_info> perm_stmts;
+	  perm_stmts.create (perm.length ());
+	  for (unsigned i = 0; i < perm.length (); ++i)
+	    perm_stmts.quick_push (SLP_TREE_SCALAR_STMTS (l0)[perm[i].second]);
+
+	  slp_tree p = vect_create_new_slp_node (1, VEC_PERM_EXPR);
+	  SLP_TREE_CHILDREN (p).quick_push (l0);
+	  SLP_TREE_LANE_PERMUTATION (p) = perm;
+	  SLP_TREE_VECTYPE (p) = SLP_TREE_VECTYPE (load);
+	  SLP_TREE_LANES (p) = perm.length ();
+	  SLP_TREE_REPRESENTATIVE (p) = SLP_TREE_REPRESENTATIVE (load);
+	  /* ???  As we have scalar stmts for this intermediate permute we
+	     could CSE it via bst_map but we do not want to pick up
+	     another SLP node with a load permutation.  We instead should
+	     have a "local" CSE map here.  */
+	  SLP_TREE_SCALAR_STMTS (p) = perm_stmts;
+
+	  /* We now have a node for group_lanes / 2 lanes.  */
+	  l0 = p;
+	}
+
+      /* And finally from the ordered reduction node create the
+	 permute to shuffle the lanes into the original load-permutation
+	 order.  We replace the original load node with this.  */
+      SLP_TREE_CODE (load) = VEC_PERM_EXPR;
+      SLP_TREE_LOAD_PERMUTATION (load).release ();
+      SLP_TREE_LANE_PERMUTATION (load) = final_perm;
+      SLP_TREE_CHILDREN (load).create (1);
+      SLP_TREE_CHILDREN (load).quick_push (l0);
+    }
+}
+
+/* Transform SLP loads in the SLP graph created by SLP discovery to
+   group loads from the same group and lower load permutations that
+   are unlikely to be supported into a series of permutes.
+   In the degenerate case of having only single-lane SLP instances
+   this should result in a series of permute nodes emulating an
+   interleaving scheme.  */
+
+static void
+vect_lower_load_permutations (loop_vec_info loop_vinfo,
+			      scalar_stmts_to_slp_tree_map_t *bst_map)
+{
+  /* Gather and sort loads across all instances.  */
+  hash_set<slp_tree> visited;
+  auto_vec<slp_tree> loads;
+  for (auto inst : loop_vinfo->slp_instances)
+    vect_gather_slp_loads (loads, SLP_INSTANCE_TREE (inst), visited);
+  if (loads.is_empty ())
+    return;
+  loads.qsort (vllp_cmp);
+
+  /* Now process each dataref group separately.  */
+  unsigned firsti = 0;
+  for (unsigned i = 1; i < loads.length (); ++i)
+    {
+      slp_tree first = loads[firsti];
+      slp_tree next = loads[i];
+      stmt_vec_info a0 = SLP_TREE_SCALAR_STMTS (first)[0];
+      stmt_vec_info b0 = SLP_TREE_SCALAR_STMTS (next)[0];
+      if (STMT_VINFO_GROUPED_ACCESS (a0)
+	  && STMT_VINFO_GROUPED_ACCESS (b0)
+	  && DR_GROUP_FIRST_ELEMENT (a0) == DR_GROUP_FIRST_ELEMENT (b0))
+	continue;
+      /* Just one SLP load of a possible group, leave those alone.  */
+      if (i == firsti + 1)
+	{
+	  firsti = i;
+	  continue;
+	}
+      /* Now we have multiple SLP loads of the same group from
+	 firsti to i - 1.  */
+      vect_lower_load_permutations (loop_vinfo, bst_map,
+				    make_array_slice (&loads[firsti],
+						      i - firsti));
+      firsti = i;
+    }
+  if (firsti < loads.length () - 1)
+    vect_lower_load_permutations (loop_vinfo, bst_map,
+				  make_array_slice (&loads[firsti],
+						    loads.length () - firsti));
+}
+
 /* Check if there are stmts in the loop can be vectorized using SLP.  Build SLP
    trees of packed scalar stmts if SLP is possible.  */
 
@@ -4085,6 +4407,23 @@ vect_analyze_slp (vec_info *vinfo, unsigned max_tree_size)
 	}
     }
 
+  /* When we end up with load permutations that we cannot possibly handle,
+     like those requiring three vector inputs, lower them using interleaving
+     like schemes.  */
+  if (loop_vec_info loop_vinfo = dyn_cast <loop_vec_info> (vinfo))
+    {
+      vect_lower_load_permutations (loop_vinfo, bst_map);
+      if (dump_enabled_p ())
+	{
+	  dump_printf_loc (MSG_NOTE, vect_location,
+			   "SLP graph after lowering permutations:\n");
+	  hash_set<slp_tree> visited;
+	  FOR_EACH_VEC_ELT (LOOP_VINFO_SLP_INSTANCES (vinfo), i, instance)
+	    vect_print_slp_graph (MSG_NOTE, vect_location,
+				  SLP_INSTANCE_TREE (instance), visited);
+	}
+    }
+
   hash_set<slp_tree> visited_patterns;
   slp_tree_to_load_perm_map_t perm_cache;
   slp_compat_nodes_map_t compat_cache;

From patchwork Tue Jul  9 11:35:01 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Richard Biener <rguenther@suse.de>
X-Patchwork-Id: 1958342
Return-Path: <gcc-patches-bounces~incoming=patchwork.ozlabs.org@gcc.gnu.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@legolas.ozlabs.org
Authentication-Results: legolas.ozlabs.org;
	dkim=pass (1024-bit key;
 unprotected) header.d=suse.de header.i=@suse.de header.a=rsa-sha256
 header.s=susede2_rsa header.b=UXoo6lk2;
	dkim=pass header.d=suse.de header.i=@suse.de header.a=ed25519-sha256
 header.s=susede2_ed25519 header.b=tUIaqRj+;
	dkim=pass (1024-bit key) header.d=suse.de header.i=@suse.de
 header.a=rsa-sha256 header.s=susede2_rsa header.b=UXoo6lk2;
	dkim=neutral header.d=suse.de header.i=@suse.de header.a=ed25519-sha256
 header.s=susede2_ed25519 header.b=tUIaqRj+;
	dkim-atps=neutral
Authentication-Results: legolas.ozlabs.org;
 spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org
 (client-ip=2620:52:3:1:0:246e:9693:128c; helo=server2.sourceware.org;
 envelope-from=gcc-patches-bounces~incoming=patchwork.ozlabs.org@gcc.gnu.org;
 receiver=patchwork.ozlabs.org)
Received: from server2.sourceware.org (server2.sourceware.org
 [IPv6:2620:52:3:1:0:246e:9693:128c])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature ECDSA (secp384r1) server-digest SHA384)
	(No client certificate requested)
	by legolas.ozlabs.org (Postfix) with ESMTPS id 4WJJp73LQnz1xr9
	for <incoming@patchwork.ozlabs.org>; Tue,  9 Jul 2024 21:35:43 +1000 (AEST)
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id A2797386C5A3
	for <incoming@patchwork.ozlabs.org>; Tue,  9 Jul 2024 11:35:41 +0000 (GMT)
X-Original-To: gcc-patches@gcc.gnu.org
Delivered-To: gcc-patches@gcc.gnu.org
Received: from smtp-out1.suse.de (smtp-out1.suse.de
 [IPv6:2a07:de40:b251:101:10:150:64:1])
 by sourceware.org (Postfix) with ESMTPS id 2E015386C5AF
 for <gcc-patches@gcc.gnu.org>; Tue,  9 Jul 2024 11:35:02 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 2E015386C5AF
Authentication-Results: sourceware.org;
 dmarc=pass (p=none dis=none) header.from=suse.de
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=suse.de
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 2E015386C5AF
Authentication-Results: server2.sourceware.org;
 arc=none smtp.remote-ip=2a07:de40:b251:101:10:150:64:1
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1720524916; cv=none;
 b=k5aObiutL6oLbvdggktmmrKUCZHCtFxYATcPCHqbhEeoEI8dd3one3yvVsuaOEnBxoGkk0k+Kw4fkoTwQXN3NCuzxLiRnvd8oS08ePEnZCXNEL5427nvqQpCubCqnJ4eahLrEVoE2xix9nAC1T0HxBQtQX/sb7m9++LEoM7ejss=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
 t=1720524916; c=relaxed/simple;
 bh=jmOoRlMyKxGTTbc29h0McYqEaoGbmpi3VCU5L2+IGtY=;
 h=DKIM-Signature:DKIM-Signature:DKIM-Signature:DKIM-Signature:Date:
 From:To:Subject:MIME-Version;
 b=TBl3efcx5D+uf+sZft6jCtCayZZ56Uob0Y5XoIsTDQSizOdfuF9q9NHEBXpZ4gWClcw+jgt4YaGc+fc+SmCxF4Lf8DbEBsjifu7s0AmHPBjdOjeN+oI+r9yIp8PZDm8EztFPMWoki8wgtbqb/uNMe8ZKStrX/4mo3e65c2O8wb0=
ARC-Authentication-Results: i=1; server2.sourceware.org
Received: from murzim.nue2.suse.org (unknown [10.168.4.243])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest
 SHA256)
 (No client certificate requested)
 by smtp-out1.suse.de (Postfix) with ESMTPS id 3BE7C219FD
 for <gcc-patches@gcc.gnu.org>; Tue,  9 Jul 2024 11:35:01 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de;
 s=susede2_rsa;
 t=1720524901;
 h=from:from:reply-to:date:date:to:to:cc:mime-version:mime-version:
 content-type:content-type; bh=+JmJYD3jPNGY5Uw7LzFMj41P3UGobQw7hgezAoSSTSU=;
 b=UXoo6lk2xpCIBfuw9iIlLxOKXb552/XFc1DWR22afA1e8+4VOoPYvTLZfMn95l0OhAFYPv
 TR2aersRor11lLVVUrmBZ8nwCMFUj4dOPs8fRmU9j/ivYD59W/3+5b2qdQt4kHmZUDTUN3
 vFXME6L1hSaCOuYu9M3++0/55ZP9cvA=
DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de;
 s=susede2_ed25519; t=1720524901;
 h=from:from:reply-to:date:date:to:to:cc:mime-version:mime-version:
 content-type:content-type; bh=+JmJYD3jPNGY5Uw7LzFMj41P3UGobQw7hgezAoSSTSU=;
 b=tUIaqRj+v3epOVwY0Gzf1+kdW+ZSAOLGjdS7Tv9ovXFCjoKPhdQ3xn3Oscli/LknLnw4XO
 qheIGQ319KfEJ0CA==
Authentication-Results: smtp-out1.suse.de;
	none
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de;
 s=susede2_rsa;
 t=1720524901;
 h=from:from:reply-to:date:date:to:to:cc:mime-version:mime-version:
 content-type:content-type; bh=+JmJYD3jPNGY5Uw7LzFMj41P3UGobQw7hgezAoSSTSU=;
 b=UXoo6lk2xpCIBfuw9iIlLxOKXb552/XFc1DWR22afA1e8+4VOoPYvTLZfMn95l0OhAFYPv
 TR2aersRor11lLVVUrmBZ8nwCMFUj4dOPs8fRmU9j/ivYD59W/3+5b2qdQt4kHmZUDTUN3
 vFXME6L1hSaCOuYu9M3++0/55ZP9cvA=
DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de;
 s=susede2_ed25519; t=1720524901;
 h=from:from:reply-to:date:date:to:to:cc:mime-version:mime-version:
 content-type:content-type; bh=+JmJYD3jPNGY5Uw7LzFMj41P3UGobQw7hgezAoSSTSU=;
 b=tUIaqRj+v3epOVwY0Gzf1+kdW+ZSAOLGjdS7Tv9ovXFCjoKPhdQ3xn3Oscli/LknLnw4XO
 qheIGQ319KfEJ0CA==
Date: Tue, 9 Jul 2024 13:35:01 +0200 (CEST)
From: Richard Biener <rguenther@suse.de>
To: gcc-patches@gcc.gnu.org
Subject: [PATCH 2/3] Support group-size of three in SLP load permutation
 lowering
MIME-Version: 1.0
X-Spam-Score: 2.34
X-Spamd-Result: default: False [2.34 / 50.00]; NEURAL_SPAM_LONG(3.12)[0.892];
 BAYES_HAM(-3.00)[100.00%]; MISSING_MID(2.50)[];
 NEURAL_HAM_SHORT(-0.18)[-0.892]; MIME_GOOD(-0.10)[text/plain];
 FUZZY_BLOCKED(0.00)[rspamd.com];
 DKIM_SIGNED(0.00)[suse.de:s=susede2_rsa,suse.de:s=susede2_ed25519];
 RCPT_COUNT_ONE(0.00)[1]; ARC_NA(0.00)[];
 TO_MATCH_ENVRCPT_ALL(0.00)[]; FROM_HAS_DN(0.00)[];
 RCVD_COUNT_ZERO(0.00)[0]; MISSING_XM_UA(0.00)[];
 FROM_EQ_ENVFROM(0.00)[]; TO_DN_NONE(0.00)[];
 MIME_TRACE(0.00)[0:+]
X-Spam-Status: No, score=-10.6 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, MISSING_MID,
 RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.30
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
Errors-To: gcc-patches-bounces~incoming=patchwork.ozlabs.org@gcc.gnu.org
Message-Id: <20240709113541.A2797386C5A3@sourceware.org>

The following adds support for group-size three in SLP load permutation
lowering to match the non-SLP capabilities.  This is done by using
the non-interleaving fallback code which then creates at VF == 4 from
{ { a0, b0, c0 }, { a1, b1, c1 }, { a2, b2, c2 }, { a3, b3, c3 } }
the intermediate vectors { c0, c0, c1, c1 } and { c2, c2, c3, c3 }
to produce { c0, c1, c2, c3 }.

This turns out to be more effective than the scheme implemented
for non-SLP for SSE and only slightly worse for AVX512 and a bit
more worse for AVX2.  It seems to me that this would extend to
other non-power-of-two group-sizes though (but the patch does not).
Optimal schemes are likely difficult to lay out in VF agnostic form.

I'll note that while the lowering assumes even/odd extract is
generally available for all vector element sizes (which is probably
a good assumption), it doesn't in any way constrain the other
permutes it generates based on target availability.  Again difficult
to do in a VF agnostic way (but at least currently the vector type
is fixed).

I'll also note that the SLP store side merges lanes in a way
producing three-vector permutes for store group-size of three, so
the testcase uses a store group-size of four.

	* tree-vect-slp.cc (vect_lower_load_permutations): Support
	group-size of three.

	* gcc.dg/vect/slp-52.c: New testcase.
---
 gcc/testsuite/gcc.dg/vect/slp-52.c | 14 ++++++++++++
 gcc/tree-vect-slp.cc               | 35 +++++++++++++++++-------------
 2 files changed, 34 insertions(+), 15 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/slp-52.c

diff --git a/gcc/testsuite/gcc.dg/vect/slp-52.c b/gcc/testsuite/gcc.dg/vect/slp-52.c
new file mode 100644
index 00000000000..ba49f0046e2
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/slp-52.c
@@ -0,0 +1,14 @@
+/* { dg-do compile } */
+
+void foo (int * __restrict x, int *y)
+{
+  for (int i = 0; i < 1024; ++i)
+    {
+      x[4*i+0] = y[3*i+0];
+      x[4*i+1] = y[3*i+1] * 2;
+      x[4*i+2] = y[3*i+2] + 3;
+      x[4*i+3] = y[3*i+2] * 2 - 5;
+    }
+}
+
+/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" { target { vect_int && vect_int_mult } } } } */
diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
index 0f830c1ad9c..2dc6d365303 100644
--- a/gcc/tree-vect-slp.cc
+++ b/gcc/tree-vect-slp.cc
@@ -3710,7 +3710,8 @@ vect_build_slp_instance (vec_info *vinfo,
 		 with the least number of lanes to one and then repeat until
 		 we end up with two inputs.  That scheme makes sure we end
 		 up with permutes satisfying the restriction of requiring at
-		 most two vector inputs to produce a single vector output.  */
+		 most two vector inputs to produce a single vector output
+		 when the number of lanes is even.  */
 	      while (SLP_TREE_CHILDREN (perm).length () > 2)
 		{
 		  /* When we have three equal sized groups left the pairwise
@@ -4050,11 +4051,10 @@ vect_lower_load_permutations (loop_vec_info loop_vinfo,
     = DR_GROUP_FIRST_ELEMENT (SLP_TREE_SCALAR_STMTS (loads[0])[0]);
 
   /* Only a power-of-two number of lanes matches interleaving with N levels.
-     The non-SLP path also supports DR_GROUP_SIZE == 3.
      ???  An even number of lanes could be reduced to 1<<ceil_log2(N)-1 lanes
      at each step.  */
   unsigned group_lanes = DR_GROUP_SIZE (first);
-  if (exact_log2 (group_lanes) == -1)
+  if (exact_log2 (group_lanes) == -1 && group_lanes != 3)
     return;
 
   for (slp_tree load : loads)
@@ -4071,7 +4071,7 @@ vect_lower_load_permutations (loop_vec_info loop_vinfo,
 	 with a non-1:1 load permutation around instead of canonicalizing
 	 those into a load and a permute node.  Removing this early
 	 check would do such canonicalization.  */
-      if (SLP_TREE_LANES (load) >= group_lanes / 2)
+      if (SLP_TREE_LANES (load) >= (group_lanes + 1) / 2)
 	continue;
 
       /* First build (and possibly re-use) a load node for the
@@ -4107,7 +4107,7 @@ vect_lower_load_permutations (loop_vec_info loop_vinfo,
       while (1)
 	{
 	  unsigned group_lanes = SLP_TREE_LANES (l0);
-	  if (SLP_TREE_LANES (load) >= group_lanes / 2)
+	  if (SLP_TREE_LANES (load) >= (group_lanes + 1) / 2)
 	    break;
 
 	  /* Try to lower by reducing the group to half its size using an
@@ -4117,19 +4117,24 @@ vect_lower_load_permutations (loop_vec_info loop_vinfo,
 	     Thus { e, e, o, o, e, e, o, o } woud be an even/odd decomposition
 	     with N == 2.  */
 	  /* ???  Only an even number of lanes can be handed this way, but the
-	     fallback below could work for any number.  */
-	  gcc_assert ((group_lanes & 1) == 0);
-	  unsigned even = (1 << ceil_log2 (group_lanes)) - 1;
-	  unsigned odd = even;
-	  for (auto l : final_perm)
+	     fallback below could work for any number.  We have to make sure
+	     to round up in that case.  */
+	  gcc_assert ((group_lanes & 1) == 0 || group_lanes == 3);
+	  unsigned even = 0, odd = 0;
+	  if ((group_lanes & 1) == 0)
 	    {
-	      even &= ~l.second;
-	      odd &= l.second;
+	      even = (1 << ceil_log2 (group_lanes)) - 1;
+	      odd = even;
+	      for (auto l : final_perm)
+		{
+		  even &= ~l.second;
+		  odd &= l.second;
+		}
 	    }
 
 	  /* Now build an even or odd extraction from the unpermuted load.  */
 	  lane_permutation_t perm;
-	  perm.create (group_lanes / 2);
+	  perm.create ((group_lanes + 1) / 2);
 	  unsigned level;
 	  if (even
 	      && ((level = 1 << ctz_hwi (even)), true)
@@ -4164,7 +4169,7 @@ vect_lower_load_permutations (loop_vec_info loop_vinfo,
 	      bitmap_iterator bi;
 	      EXECUTE_IF_SET_IN_BITMAP (l, 0, i, bi)
 		  perm.quick_push (std::make_pair (0, i));
-	      while (perm.length () < group_lanes / 2)
+	      while (perm.length () < (group_lanes + 1) / 2)
 		perm.quick_push (perm.last ());
 	    }
 
@@ -4200,7 +4205,7 @@ vect_lower_load_permutations (loop_vec_info loop_vinfo,
 	     have a "local" CSE map here.  */
 	  SLP_TREE_SCALAR_STMTS (p) = perm_stmts;
 
-	  /* We now have a node for group_lanes / 2 lanes.  */
+	  /* We now have a node for (group_lanes + 1) / 2 lanes.  */
 	  l0 = p;
 	}
 

From patchwork Tue Jul  9 11:38:27 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Richard Biener <rguenther@suse.de>
X-Patchwork-Id: 1958343
Return-Path: <gcc-patches-bounces~incoming=patchwork.ozlabs.org@gcc.gnu.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@legolas.ozlabs.org
Authentication-Results: legolas.ozlabs.org;
	dkim=pass (1024-bit key;
 unprotected) header.d=suse.de header.i=@suse.de header.a=rsa-sha256
 header.s=susede2_rsa header.b=J1aFl4JO;
	dkim=pass header.d=suse.de header.i=@suse.de header.a=ed25519-sha256
 header.s=susede2_ed25519 header.b=NiBNqynE;
	dkim=pass (1024-bit key) header.d=suse.de header.i=@suse.de
 header.a=rsa-sha256 header.s=susede2_rsa header.b=J1aFl4JO;
	dkim=neutral header.d=suse.de header.i=@suse.de header.a=ed25519-sha256
 header.s=susede2_ed25519 header.b=NiBNqynE;
	dkim-atps=neutral
Authentication-Results: legolas.ozlabs.org;
 spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org
 (client-ip=2620:52:3:1:0:246e:9693:128c; helo=server2.sourceware.org;
 envelope-from=gcc-patches-bounces~incoming=patchwork.ozlabs.org@gcc.gnu.org;
 receiver=patchwork.ozlabs.org)
Received: from server2.sourceware.org (server2.sourceware.org
 [IPv6:2620:52:3:1:0:246e:9693:128c])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature ECDSA (secp384r1) server-digest SHA384)
	(No client certificate requested)
	by legolas.ozlabs.org (Postfix) with ESMTPS id 4WJJst000kz1xr9
	for <incoming@patchwork.ozlabs.org>; Tue,  9 Jul 2024 21:38:57 +1000 (AEST)
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 3D7AC386C5B3
	for <incoming@patchwork.ozlabs.org>; Tue,  9 Jul 2024 11:38:56 +0000 (GMT)
X-Original-To: gcc-patches@gcc.gnu.org
Delivered-To: gcc-patches@gcc.gnu.org
Received: from smtp-out2.suse.de (smtp-out2.suse.de
 [IPv6:2a07:de40:b251:101:10:150:64:2])
 by sourceware.org (Postfix) with ESMTPS id 3CB113858420
 for <gcc-patches@gcc.gnu.org>; Tue,  9 Jul 2024 11:38:28 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 3CB113858420
Authentication-Results: sourceware.org;
 dmarc=pass (p=none dis=none) header.from=suse.de
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=suse.de
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 3CB113858420
Authentication-Results: server2.sourceware.org;
 arc=none smtp.remote-ip=2a07:de40:b251:101:10:150:64:2
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1720525111; cv=none;
 b=cHAH/qAxgmb7AVEYBGcyUJwRPFuY0Vy0VzQI1b1DA6KhvnaWZcFmllzAAWE/kTGKoquldS+I/SCFuAZSnNseLo56tevrkU8fUcqOXOlMp5jSJYs3NoVgj+wG5PYEKH7SqGEQB2xNWXUJhZi/t/z4JxAxYfIE4Lpapg/NDnXD4Xo=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
 t=1720525111; c=relaxed/simple;
 bh=DGsFtc5yl1hiLhxrxPC754cbHTgh9LFrRYRPyTC04Ys=;
 h=DKIM-Signature:DKIM-Signature:DKIM-Signature:DKIM-Signature:Date:
 From:To:Subject:MIME-Version;
 b=Zn2Uf9Y4Ql6yGLRB6X/KQ9l5Jq9cEneVd7ov8CzVRouZJae0a62/b6hOivUH/m9Ik4A51KKHa/EXNBTTd2rm0WKzWjfhnKupScpQ9bIKIuaiQjkErRE2wJTr0lx9DjaxVm7ulisIux9214t8yIjTiZbXsD0q+LUW8siAoQYRzZU=
ARC-Authentication-Results: i=1; server2.sourceware.org
Received: from murzim.nue2.suse.org (unknown [10.168.4.243])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest
 SHA256)
 (No client certificate requested)
 by smtp-out2.suse.de (Postfix) with ESMTPS id 367A51F7B7;
 Tue,  9 Jul 2024 11:38:27 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de;
 s=susede2_rsa;
 t=1720525107;
 h=from:from:reply-to:date:date:to:to:cc:cc:mime-version:mime-version:
 content-type:content-type; bh=MVn3p8vvqld1TxxoUvXphk7tvWEXbofuQgRpcGeUYlI=;
 b=J1aFl4JO56mt9kn+/Vq6yHssNpXZwL+tjVhlwu0OXNLhZChE2kpetoBmm/0I5WFefjw9mD
 HCO0ElXdkRvY1+9KAqJnOpHPIBjghY3F0NzGODxrq7Jvbes5ctmGqw/Z5M6b+F6VxzvVLs
 tQGFRziQtoGJD29w0JIRnRBtpMRKHmI=
DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de;
 s=susede2_ed25519; t=1720525107;
 h=from:from:reply-to:date:date:to:to:cc:cc:mime-version:mime-version:
 content-type:content-type; bh=MVn3p8vvqld1TxxoUvXphk7tvWEXbofuQgRpcGeUYlI=;
 b=NiBNqynEo5x5qXSOl6/MIa0pZnmfTo7J30BkMpTL6viZr/5iv5D1Mtac42vRE6kOGVP/lI
 AFhu40koIb9oQzBg==
Authentication-Results: smtp-out2.suse.de;
	none
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de;
 s=susede2_rsa;
 t=1720525107;
 h=from:from:reply-to:date:date:to:to:cc:cc:mime-version:mime-version:
 content-type:content-type; bh=MVn3p8vvqld1TxxoUvXphk7tvWEXbofuQgRpcGeUYlI=;
 b=J1aFl4JO56mt9kn+/Vq6yHssNpXZwL+tjVhlwu0OXNLhZChE2kpetoBmm/0I5WFefjw9mD
 HCO0ElXdkRvY1+9KAqJnOpHPIBjghY3F0NzGODxrq7Jvbes5ctmGqw/Z5M6b+F6VxzvVLs
 tQGFRziQtoGJD29w0JIRnRBtpMRKHmI=
DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de;
 s=susede2_ed25519; t=1720525107;
 h=from:from:reply-to:date:date:to:to:cc:cc:mime-version:mime-version:
 content-type:content-type; bh=MVn3p8vvqld1TxxoUvXphk7tvWEXbofuQgRpcGeUYlI=;
 b=NiBNqynEo5x5qXSOl6/MIa0pZnmfTo7J30BkMpTL6viZr/5iv5D1Mtac42vRE6kOGVP/lI
 AFhu40koIb9oQzBg==
Date: Tue, 9 Jul 2024 13:38:27 +0200 (CEST)
From: Richard Biener <rguenther@suse.de>
To: gcc-patches@gcc.gnu.org
cc: richard.sandiford@arm.com
Subject: [PATCH 3/3] RISC-V: load and store-lanes with SLP
MIME-Version: 1.0
X-Spamd-Result: default: False [1.48 / 50.00]; BAYES_HAM(-3.00)[100.00%];
 MISSING_MID(2.50)[]; NEURAL_SPAM_LONG(2.27)[0.648];
 NEURAL_HAM_SHORT(-0.19)[-0.960]; MIME_GOOD(-0.10)[text/plain];
 MISSING_XM_UA(0.00)[]; RCVD_COUNT_ZERO(0.00)[0];
 ARC_NA(0.00)[]; RCPT_COUNT_TWO(0.00)[2]; FROM_HAS_DN(0.00)[];
 DKIM_SIGNED(0.00)[suse.de:s=susede2_rsa,suse.de:s=susede2_ed25519];
 FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+];
 TO_DN_NONE(0.00)[]; FUZZY_BLOCKED(0.00)[rspamd.com];
 TO_MATCH_ENVRCPT_ALL(0.00)[];
 DBL_BLOCKED_OPENRESOLVER(0.00)[tree-vect-slp.cc:url]
X-Spam-Score: 1.48
X-Spam-Status: No, score=-10.5 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, MISSING_MID,
 RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.30
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
Errors-To: gcc-patches-bounces~incoming=patchwork.ozlabs.org@gcc.gnu.org
Message-Id: <20240709113856.3D7AC386C5B3@sourceware.org>

As promised this is the rework of SLP load/store-lane support with
a simpler representation.  It builds ontop the load-permute lowering
series which I've squashed to two patches (already tested separately
in the CI yesterday).  The load/store-permute work hasn't seen much
testing, I hope the CI will spot obvious bugs.

Comments on the design are welcome.  The main issue might be that
we're deciding on load/store-lane too(?) early, but it mostly
matches what we do right now where we cancel SLP.

Richard.


The following is a prototype for how to represent load/store-lanes
within SLP.  I've for now settled with having a single load node
with multiple permute nodes acting as selection, one for each loaded lane
and a single store node fed from all stored lanes.  For

  for (int i = 0; i < 1024; ++i)
    {
      a[2*i] = b[2*i] + 7;
      a[2*i+1] = b[2*i+1] * 3;
    }

you have the following SLP graph where I explain how things are set
up and code-generated:

t.c:23:21: note:   SLP graph after lowering permutations:
t.c:23:21: note:   node 0x50dc8b0 (max_nunits=1, refcnt=1) vector(4) int
t.c:23:21: note:   op template: *_6 = _7;
t.c:23:21: note:        stmt 0 *_6 = _7;
t.c:23:21: note:        stmt 1 *_12 = _13;
t.c:23:21: note:        children 0x50dc488 0x50dc6e8

This is the store node, it's marked with ldst_lanes = true during
SLP discovery.  This node code-generates

  vect_array.65[0] = vect__7.61_29;
  vect_array.65[1] = vect__13.62_28;
  MEM <int[8]> [(int *)vectp_a.63_27] = .STORE_LANES (vect_array.65);

...
t.c:23:21: note:   node 0x50dc520 (max_nunits=4, refcnt=2) vector(4) int
t.c:23:21: note:   op: VEC_PERM_EXPR
t.c:23:21: note:        stmt 0 _5 = *_4;
t.c:23:21: note:        lane permutation { 0[0] }
t.c:23:21: note:        children 0x50dc948
t.c:23:21: note:   node 0x50dc780 (max_nunits=4, refcnt=2) vector(4) int
t.c:23:21: note:   op: VEC_PERM_EXPR
t.c:23:21: note:        stmt 0 _11 = *_10;
t.c:23:21: note:        lane permutation { 0[1] }
t.c:23:21: note:        children 0x50dc948

These are the selection nodes, marked with ldst_lanes = true.
They code generate nothing.

t.c:23:21: note:   node 0x50dc948 (max_nunits=4, refcnt=3) vector(4) int
t.c:23:21: note:   op template: _5 = *_4;
t.c:23:21: note:        stmt 0 _5 = *_4;
t.c:23:21: note:        stmt 1 _11 = *_10;
t.c:23:21: note:        load permutation { 0 1 }

This is the load node, marked with ldst_lanes = true (the load
permutation is only accurate when taking into account the lane permute
in the selection nodes).  It code generates

  vect_array.58 = .LOAD_LANES (MEM <int[8]> [(int *)vectp_b.56_33]);
  vect__5.59_31 = vect_array.58[0];
  vect__5.60_30 = vect_array.58[1];

This scheme allows to leave code generation in vectorizable_load/store
mostly as-is.

With this I've disabled the code scrapping SLP as it will no longer
fire.

	* tree-vectorizer.h (_slp_tree::ldst_lanes): New flag to mark
	load, store and permute nodes.
	* tree-vect-slp.cc (_slp_tree::_slp_tree): Initialize ldst_lanes.
	(vect_build_slp_instance): For stores iff the target prefers
	store-lanes discover single-lane sub-groups, do not perform
	interleaving lowering but mark the node with ldst_lanes.
	(vect_lower_load_permutations): When the target supports
	load lanes and the loads all fit the pattern split out
	a single level of permutes only and mark the load and
	permute nodes with ldst_lanes.
	(vectorizable_slp_permutation_1): Handle the load-lane permute
	forwarding of vector defs.
	* tree-vect-stmts.cc (get_group_load_store_type): Support
	load/store-lanes for SLP.
	(vectorizable_store): Support SLP code generation for store-lanes.
	(vectorizable_load): Support SLP code generation for load-lanes.
	* tree-vect-loop.cc (vect_analyze_loop_2): Do not cancel SLP
	when store-lanes can be used.

	* gcc.dg/vect/slp-55.c: New testcase.
---
 gcc/testsuite/gcc.dg/vect/slp-55.c |  37 ++++++++++
 gcc/tree-vect-loop.cc              |  76 --------------------
 gcc/tree-vect-slp.cc               | 107 +++++++++++++++++++++++++++--
 gcc/tree-vect-stmts.cc             |  81 +++++++++++++++-------
 gcc/tree-vectorizer.h              |   3 +
 5 files changed, 196 insertions(+), 108 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/slp-55.c

diff --git a/gcc/testsuite/gcc.dg/vect/slp-55.c b/gcc/testsuite/gcc.dg/vect/slp-55.c
new file mode 100644
index 00000000000..0bf65ef6dc4
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/slp-55.c
@@ -0,0 +1,37 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target vect_int_mult } */
+/* { dg-additional-options "-fdump-tree-optimized" } */
+
+void foo (int * __restrict a, int *b, int *c)
+{
+  for (int i = 0; i < 1024; ++i)
+    {
+      a[2*i] = b[i] + 7;
+      a[2*i+1] = c[i] * 3;
+    }
+}
+
+int bar (int *b)
+{
+  int res = 0;
+  for (int i = 0; i < 1024; ++i)
+    {
+      res += b[2*i] + 7;
+      res += b[2*i+1] * 3;
+    }
+  return res;
+}
+
+void baz (int * __restrict a, int *b)
+{
+  for (int i = 0; i < 1024; ++i)
+    {
+      a[2*i] = b[2*i] + 7;
+      a[2*i+1] = b[2*i+1] * 3;
+    }
+}
+
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 3 "vect" } } */
+/* { dg-final { scan-tree-dump-times "LOAD_LANES" 2 "optimized" { target vect_load_lanes } } } */
+/* { dg-final { scan-tree-dump-times "STORE_LANES" 2 "optimized" { target vect_load_lanes } } } */
diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index a64b5082bd1..0d48c4980ce 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -2957,82 +2957,6 @@ start_over:
 				       "unsupported SLP instances\n");
 	  goto again;
 	}
-
-      /* Check whether any load in ALL SLP instances is possibly permuted.  */
-      slp_tree load_node, slp_root;
-      unsigned i, x;
-      slp_instance instance;
-      bool can_use_lanes = true;
-      FOR_EACH_VEC_ELT (LOOP_VINFO_SLP_INSTANCES (loop_vinfo), x, instance)
-	{
-	  slp_root = SLP_INSTANCE_TREE (instance);
-	  int group_size = SLP_TREE_LANES (slp_root);
-	  tree vectype = SLP_TREE_VECTYPE (slp_root);
-	  bool loads_permuted = false;
-	  FOR_EACH_VEC_ELT (SLP_INSTANCE_LOADS (instance), i, load_node)
-	    {
-	      if (!SLP_TREE_LOAD_PERMUTATION (load_node).exists ())
-		continue;
-	      unsigned j;
-	      stmt_vec_info load_info;
-	      FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_STMTS (load_node), j, load_info)
-		if (SLP_TREE_LOAD_PERMUTATION (load_node)[j] != j)
-		  {
-		    loads_permuted = true;
-		    break;
-		  }
-	    }
-
-	  /* If the loads and stores can be handled with load/store-lane
-	     instructions record it and move on to the next instance.  */
-	  if (loads_permuted
-	      && SLP_INSTANCE_KIND (instance) == slp_inst_kind_store
-	      && vect_store_lanes_supported (vectype, group_size, false)
-		   != IFN_LAST)
-	    {
-	      FOR_EACH_VEC_ELT (SLP_INSTANCE_LOADS (instance), i, load_node)
-		if (STMT_VINFO_GROUPED_ACCESS
-		      (SLP_TREE_REPRESENTATIVE (load_node)))
-		  {
-		    stmt_vec_info stmt_vinfo = DR_GROUP_FIRST_ELEMENT
-			(SLP_TREE_REPRESENTATIVE (load_node));
-		    /* Use SLP for strided accesses (or if we can't
-		       load-lanes).  */
-		    if (STMT_VINFO_STRIDED_P (stmt_vinfo)
-			|| vect_load_lanes_supported
-			     (STMT_VINFO_VECTYPE (stmt_vinfo),
-			      DR_GROUP_SIZE (stmt_vinfo), false) == IFN_LAST)
-		      break;
-		  }
-
-	      can_use_lanes
-		= can_use_lanes && i == SLP_INSTANCE_LOADS (instance).length ();
-
-	      if (can_use_lanes && dump_enabled_p ())
-		dump_printf_loc (MSG_NOTE, vect_location,
-				 "SLP instance %p can use load/store-lanes\n",
-				 (void *) instance);
-	    }
-	  else
-	    {
-	      can_use_lanes = false;
-	      break;
-	    }
-	}
-
-      /* If all SLP instances can use load/store-lanes abort SLP and try again
-	 with SLP disabled.  */
-      if (can_use_lanes)
-	{
-	  ok = opt_result::failure_at (vect_location,
-				       "Built SLP cancelled: can use "
-				       "load/store-lanes\n");
-	  if (dump_enabled_p ())
-	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-			     "Built SLP cancelled: all SLP instances support "
-			     "load/store-lanes\n");
-	  goto again;
-	}
     }
 
   /* Dissolve SLP-only groups.  */
diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
index 2dc6d365303..17d3c59a3d8 100644
--- a/gcc/tree-vect-slp.cc
+++ b/gcc/tree-vect-slp.cc
@@ -120,6 +120,7 @@ _slp_tree::_slp_tree ()
   SLP_TREE_SIMD_CLONE_INFO (this) = vNULL;
   SLP_TREE_DEF_TYPE (this) = vect_uninitialized_def;
   SLP_TREE_CODE (this) = ERROR_MARK;
+  this->ldst_lanes = false;
   SLP_TREE_VECTYPE (this) = NULL_TREE;
   SLP_TREE_REPRESENTATIVE (this) = NULL;
   SLP_TREE_REF_COUNT (this) = 1;
@@ -3600,10 +3601,24 @@ vect_build_slp_instance (vec_info *vinfo,
       /* For loop vectorization split the RHS into arbitrary pieces of
 	 size >= 1.  */
       else if (is_a <loop_vec_info> (vinfo)
-	       && (i > 0 && i < group_size)
-	       && !vect_slp_prefer_store_lanes_p (vinfo,
-						  stmt_info, group_size, i))
-	{
+	       && (i > 0 && i < group_size))
+	{
+	  /* There are targets that cannot do even/odd interleaving schemes
+	     so they absolutely need to use load/store-lanes.  For now
+	     force single-lane SLP for them - they would be happy with
+	     uniform power-of-two lanes (but depending on element size),
+	     but even if we can use 'i' as indicator we would need to
+	     backtrack when later lanes fail to discover with the same
+	     granularity.  We cannot turn any of .MASK_STORE or
+	     scatter store into store-lanes.  */
+	  bool want_store_lanes
+	    = (! is_a <gcall *> (stmt_info->stmt)
+	       && ! STMT_VINFO_GATHER_SCATTER_P (stmt_info)
+	       && vect_slp_prefer_store_lanes_p (vinfo, stmt_info,
+						 group_size, 1));
+	  if (want_store_lanes)
+	    i = 1;
+
 	  if (dump_enabled_p ())
 	    dump_printf_loc (MSG_NOTE, vect_location,
 			     "Splitting SLP group at stmt %u\n", i);
@@ -3637,7 +3652,10 @@ vect_build_slp_instance (vec_info *vinfo,
 					       (max_nunits, end - start));
 		  rhs_nodes.safe_push (node);
 		  start = end;
-		  end = group_size;
+		  if (want_store_lanes)
+		    end = start + 1;
+		  else
+		    end = group_size;
 		}
 	      else
 		{
@@ -3676,6 +3694,18 @@ vect_build_slp_instance (vec_info *vinfo,
 					   SLP_TREE_CHILDREN
 					     (rhs_nodes[0]).length ());
 	  SLP_TREE_VECTYPE (node) = SLP_TREE_VECTYPE (rhs_nodes[0]);
+	  if (want_store_lanes)
+	    {
+	      /* For store-lanes feed the store node with all RHS nodes
+		 in order.  We cannot handle .MASK_STORE here.  */
+	      gcc_assert (SLP_TREE_CHILDREN (rhs_nodes[0]).length () == 1);
+	      node->ldst_lanes = 1;
+	      SLP_TREE_CHILDREN (node).reserve_exact (rhs_nodes.length ());
+	      for (unsigned j = 0; j < rhs_nodes.length (); ++j)
+		SLP_TREE_CHILDREN (node)
+		  .quick_push (SLP_TREE_CHILDREN (rhs_nodes[j])[0]);
+	    }
+	  else
 	  for (unsigned l = 0;
 	       l < SLP_TREE_CHILDREN (rhs_nodes[0]).length (); ++l)
 	    {
@@ -4057,6 +4087,42 @@ vect_lower_load_permutations (loop_vec_info loop_vinfo,
   if (exact_log2 (group_lanes) == -1 && group_lanes != 3)
     return;
 
+  /* Verify if all load permutations can be implemented with a suitably
+     large element load-lanes operation.  */
+  unsigned ld_lanes_lanes = SLP_TREE_LANES (loads[0]);
+  if (exact_log2 (ld_lanes_lanes) == -1
+      /* ???  For now only support the single-lane case as there is
+	 missing support on the store-lane side and code generation
+	 isn't up to the task yet.  */
+      || ld_lanes_lanes != 1
+      || vect_load_lanes_supported (SLP_TREE_VECTYPE (loads[0]),
+				    group_lanes / ld_lanes_lanes,
+				    false) == IFN_LAST)
+    ld_lanes_lanes = 0;
+  else
+    /* Verify the loads access the same number of lanes aligned to
+       ld_lanes_lanes.  */
+    for (slp_tree load : loads)
+      {
+	if (SLP_TREE_LANES (load) != ld_lanes_lanes)
+	  {
+	    ld_lanes_lanes = 0;
+	    break;
+	  }
+	unsigned first = SLP_TREE_LOAD_PERMUTATION (load)[0];
+	if (first % ld_lanes_lanes != 0)
+	  {
+	    ld_lanes_lanes = 0;
+	    break;
+	  }
+	for (unsigned i = 1; i < SLP_TREE_LANES (load); ++i)
+	  if (SLP_TREE_LOAD_PERMUTATION (load)[i] != first + i)
+	    {
+	      ld_lanes_lanes = 0;
+	      break;
+	    }
+      }
+
   for (slp_tree load : loads)
     {
       /* Leave masked or gather loads alone for now.  */
@@ -4071,7 +4137,8 @@ vect_lower_load_permutations (loop_vec_info loop_vinfo,
 	 with a non-1:1 load permutation around instead of canonicalizing
 	 those into a load and a permute node.  Removing this early
 	 check would do such canonicalization.  */
-      if (SLP_TREE_LANES (load) >= (group_lanes + 1) / 2)
+      if (SLP_TREE_LANES (load) >= (group_lanes + 1) / 2
+	  && ld_lanes_lanes == 0)
 	continue;
 
       /* First build (and possibly re-use) a load node for the
@@ -4104,6 +4171,12 @@ vect_lower_load_permutations (loop_vec_info loop_vinfo,
 	final_perm.quick_push
 	  (std::make_pair (0, SLP_TREE_LOAD_PERMUTATION (load)[i]));
 
+      if (ld_lanes_lanes != 0)
+	{
+	  l0->ldst_lanes = true;
+	  load->ldst_lanes = true;
+	}
+      else
       while (1)
 	{
 	  unsigned group_lanes = SLP_TREE_LANES (l0);
@@ -9758,6 +9831,28 @@ vectorizable_slp_permutation_1 (vec_info *vinfo, gimple_stmt_iterator *gsi,
 
   gcc_assert (perm.length () == SLP_TREE_LANES (node));
 
+  /* Load-lanes permute.  This permute only acts as a forwarder to
+     select the correct vector def of the load-lanes load which
+     has the permuted vectors in its vector defs like
+     { v0, w0, r0, v1, w1, r1 ... } for a ld3.  */
+  if (node->ldst_lanes)
+    {
+      gcc_assert (children.length () == 1);
+      if (!gsi)
+	/* This is a trivial op always supported.  */
+	return 1;
+      slp_tree child = children[0];
+      unsigned vec_idx = (SLP_TREE_LANE_PERMUTATION (node)[0].second
+			  / SLP_TREE_LANES (node));
+      unsigned vec_num = SLP_TREE_LANES (child) / SLP_TREE_LANES (node);
+      for (unsigned i = 0; i < SLP_TREE_NUMBER_OF_VEC_STMTS (node); ++i)
+	{
+	  tree def = SLP_TREE_VEC_DEFS (child)[i * vec_num  + vec_idx];
+	  node->push_vec_def (def);
+	}
+      return 1;
+    }
+
   /* REPEATING_P is true if every output vector is guaranteed to use the
      same permute vector.  We can handle that case for both variable-length
      and constant-length vectors, but we only handle other cases for
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index fdcda0d2aba..330091f08b3 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -1508,7 +1508,8 @@ check_load_store_for_partial_vectors (loop_vec_info loop_vinfo, tree vectype,
 
   unsigned int nvectors;
   if (slp_node)
-    nvectors = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node);
+    /* ???  Incorrect for multi-lane lanes.  */
+    nvectors = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node) / group_size;
   else
     nvectors = vect_get_num_copies (loop_vinfo, vectype);
 
@@ -2069,6 +2070,14 @@ get_group_load_store_type (vec_info *vinfo, stmt_vec_info stmt_info,
 		 is irrelevant for them.  */
 	      *alignment_support_scheme = dr_unaligned_supported;
 	    }
+	  /* Try using LOAD/STORE_LANES.  */
+	  else if (slp_node->ldst_lanes
+		   && (*lanes_ifn
+			 = (vls_type == VLS_LOAD
+			    ? vect_load_lanes_supported (vectype, group_size, masked_p)
+			    : vect_store_lanes_supported (vectype, group_size,
+							  masked_p))) != IFN_LAST)
+	    *memory_access_type = VMAT_LOAD_STORE_LANES;
 	  else
 	    *memory_access_type = VMAT_CONTIGUOUS;
 
@@ -8705,7 +8714,7 @@ vectorizable_store (vec_info *vinfo,
   else
     {
       if (memory_access_type == VMAT_LOAD_STORE_LANES)
-	aggr_type = build_array_type_nelts (elem_type, vec_num * nunits);
+	aggr_type = build_array_type_nelts (elem_type, group_size * nunits);
       else
 	aggr_type = vectype;
       bump = vect_get_data_ptr_increment (vinfo, gsi, dr_info, aggr_type,
@@ -8762,11 +8771,12 @@ vectorizable_store (vec_info *vinfo,
 
   if (memory_access_type == VMAT_LOAD_STORE_LANES)
     {
-      gcc_assert (!slp && grouped_store);
       unsigned inside_cost = 0, prologue_cost = 0;
       /* For costing some adjacent vector stores, we'd like to cost with
 	 the total number of them once instead of cost each one by one. */
       unsigned int n_adjacent_stores = 0;
+      if (slp)
+	ncopies = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node) / group_size;
       for (j = 0; j < ncopies; j++)
 	{
 	  gimple *new_stmt;
@@ -8784,7 +8794,7 @@ vectorizable_store (vec_info *vinfo,
 		  op = vect_get_store_rhs (next_stmt_info);
 		  if (costing_p)
 		    update_prologue_cost (&prologue_cost, op);
-		  else
+		  else if (!slp)
 		    {
 		      vect_get_vec_defs_for_operand (vinfo, next_stmt_info,
 						     ncopies, op,
@@ -8799,15 +8809,15 @@ vectorizable_store (vec_info *vinfo,
 		{
 		  if (mask)
 		    {
-		      vect_get_vec_defs_for_operand (vinfo, stmt_info, ncopies,
-						     mask, &vec_masks,
-						     mask_vectype);
+		      if (slp_node)
+			vect_get_slp_defs (mask_node, &vec_masks);
+		      else
+			vect_get_vec_defs_for_operand (vinfo, stmt_info, ncopies,
+						       mask, &vec_masks,
+						       mask_vectype);
 		      vec_mask = vec_masks[0];
 		    }
 
-		  /* We should have catched mismatched types earlier.  */
-		  gcc_assert (
-		    useless_type_conversion_p (vectype, TREE_TYPE (vec_oprnd)));
 		  dataref_ptr
 		    = vect_create_data_ref_ptr (vinfo, first_stmt_info,
 						aggr_type, NULL, offset, &dummy,
@@ -8819,10 +8829,16 @@ vectorizable_store (vec_info *vinfo,
 	      gcc_assert (!LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo));
 	      /* DR_CHAIN is then used as an input to
 		 vect_permute_store_chain().  */
-	      for (i = 0; i < group_size; i++)
+	      if (!slp)
 		{
-		  vec_oprnd = (*gvec_oprnds[i])[j];
-		  dr_chain[i] = vec_oprnd;
+		  /* We should have catched mismatched types earlier.  */
+		  gcc_assert (
+		    useless_type_conversion_p (vectype, TREE_TYPE (vec_oprnd)));
+		  for (i = 0; i < group_size; i++)
+		    {
+		      vec_oprnd = (*gvec_oprnds[i])[j];
+		      dr_chain[i] = vec_oprnd;
+		    }
 		}
 	      if (mask)
 		vec_mask = vec_masks[j];
@@ -8832,12 +8848,12 @@ vectorizable_store (vec_info *vinfo,
 
 	  if (costing_p)
 	    {
-	      n_adjacent_stores += vec_num;
+	      n_adjacent_stores += group_size;
 	      continue;
 	    }
 
 	  /* Get an array into which we can store the individual vectors.  */
-	  tree vec_array = create_vector_array (vectype, vec_num);
+	  tree vec_array = create_vector_array (vectype, group_size);
 
 	  /* Invalidate the current contents of VEC_ARRAY.  This should
 	     become an RTL clobber too, which prevents the vector registers
@@ -8845,9 +8861,13 @@ vectorizable_store (vec_info *vinfo,
 	  vect_clobber_variable (vinfo, stmt_info, gsi, vec_array);
 
 	  /* Store the individual vectors into the array.  */
-	  for (i = 0; i < vec_num; i++)
+	  for (i = 0; i < group_size; i++)
 	    {
-	      vec_oprnd = dr_chain[i];
+	      if (slp)
+		vec_oprnd
+		  = SLP_TREE_VEC_DEFS (SLP_TREE_CHILDREN (slp_node)[i])[j];
+	      else
+		vec_oprnd = dr_chain[i];
 	      write_vector_array (vinfo, stmt_info, gsi, vec_oprnd, vec_array,
 				  i);
 	    }
@@ -8917,9 +8937,10 @@ vectorizable_store (vec_info *vinfo,
 
 	  /* Record that VEC_ARRAY is now dead.  */
 	  vect_clobber_variable (vinfo, stmt_info, gsi, vec_array);
-	  if (j == 0)
+	  if (j == 0 && !slp)
 	    *vec_stmt = new_stmt;
-	  STMT_VINFO_VEC_STMTS (stmt_info).safe_push (new_stmt);
+	  if (!slp)
+	    STMT_VINFO_VEC_STMTS (stmt_info).safe_push (new_stmt);
 	}
 
       if (costing_p)
@@ -10765,12 +10786,13 @@ vectorizable_load (vec_info *vinfo,
     {
       gcc_assert (alignment_support_scheme == dr_aligned
 		  || alignment_support_scheme == dr_unaligned_supported);
-      gcc_assert (grouped_load && !slp);
 
       unsigned int inside_cost = 0, prologue_cost = 0;
       /* For costing some adjacent vector loads, we'd like to cost with
 	 the total number of them once instead of cost each one by one. */
       unsigned int n_adjacent_loads = 0;
+      if (slp_node)
+	ncopies = slp_node->vec_stmts_size / vec_num;
       for (j = 0; j < ncopies; j++)
 	{
 	  if (costing_p)
@@ -10884,24 +10906,31 @@ vectorizable_load (vec_info *vinfo,
 	  gimple_call_set_nothrow (call, true);
 	  vect_finish_stmt_generation (vinfo, stmt_info, call, gsi);
 
-	  dr_chain.create (vec_num);
+	  if (!slp)
+	    dr_chain.create (vec_num);
 	  /* Extract each vector into an SSA_NAME.  */
 	  for (i = 0; i < vec_num; i++)
 	    {
 	      new_temp = read_vector_array (vinfo, stmt_info, gsi, scalar_dest,
 					    vec_array, i);
-	      dr_chain.quick_push (new_temp);
+	      if (slp)
+		slp_node->push_vec_def (new_temp);
+	      else
+		dr_chain.quick_push (new_temp);
 	    }
 
-	  /* Record the mapping between SSA_NAMEs and statements.  */
-	  vect_record_grouped_load_vectors (vinfo, stmt_info, dr_chain);
+	  if (!slp)
+	    /* Record the mapping between SSA_NAMEs and statements.  */
+	    vect_record_grouped_load_vectors (vinfo, stmt_info, dr_chain);
 
 	  /* Record that VEC_ARRAY is now dead.  */
 	  vect_clobber_variable (vinfo, stmt_info, gsi, vec_array);
 
-	  dr_chain.release ();
+	  if (!slp)
+	    dr_chain.release ();
 
-	  *vec_stmt = STMT_VINFO_VEC_STMTS (stmt_info)[0];
+	  if (!slp_node)
+	    *vec_stmt = STMT_VINFO_VEC_STMTS (stmt_info)[0];
 	}
 
       if (costing_p)
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 8eb3ec4df86..ac288541c51 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -222,6 +222,9 @@ struct _slp_tree {
   unsigned int lanes;
   /* The operation of this node.  */
   enum tree_code code;
+  /* Whether uses of this load or feeders of this store are suitable
+     for load/store-lanes.  */
+  bool ldst_lanes;
 
   int vertex;