From patchwork Tue Jul  9 11:38:27 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Richard Biener <rguenther@suse.de>
X-Patchwork-Id: 1958343
Return-Path: <gcc-patches-bounces~incoming=patchwork.ozlabs.org@gcc.gnu.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@legolas.ozlabs.org
Authentication-Results: legolas.ozlabs.org;
	dkim=pass (1024-bit key;
 unprotected) header.d=suse.de header.i=@suse.de header.a=rsa-sha256
 header.s=susede2_rsa header.b=J1aFl4JO;
	dkim=pass header.d=suse.de header.i=@suse.de header.a=ed25519-sha256
 header.s=susede2_ed25519 header.b=NiBNqynE;
	dkim=pass (1024-bit key) header.d=suse.de header.i=@suse.de
 header.a=rsa-sha256 header.s=susede2_rsa header.b=J1aFl4JO;
	dkim=neutral header.d=suse.de header.i=@suse.de header.a=ed25519-sha256
 header.s=susede2_ed25519 header.b=NiBNqynE;
	dkim-atps=neutral
Authentication-Results: legolas.ozlabs.org;
 spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org
 (client-ip=2620:52:3:1:0:246e:9693:128c; helo=server2.sourceware.org;
 envelope-from=gcc-patches-bounces~incoming=patchwork.ozlabs.org@gcc.gnu.org;
 receiver=patchwork.ozlabs.org)
Received: from server2.sourceware.org (server2.sourceware.org
 [IPv6:2620:52:3:1:0:246e:9693:128c])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature ECDSA (secp384r1) server-digest SHA384)
	(No client certificate requested)
	by legolas.ozlabs.org (Postfix) with ESMTPS id 4WJJst000kz1xr9
	for <incoming@patchwork.ozlabs.org>; Tue,  9 Jul 2024 21:38:57 +1000 (AEST)
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 3D7AC386C5B3
	for <incoming@patchwork.ozlabs.org>; Tue,  9 Jul 2024 11:38:56 +0000 (GMT)
X-Original-To: gcc-patches@gcc.gnu.org
Delivered-To: gcc-patches@gcc.gnu.org
Received: from smtp-out2.suse.de (smtp-out2.suse.de
 [IPv6:2a07:de40:b251:101:10:150:64:2])
 by sourceware.org (Postfix) with ESMTPS id 3CB113858420
 for <gcc-patches@gcc.gnu.org>; Tue,  9 Jul 2024 11:38:28 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 3CB113858420
Authentication-Results: sourceware.org;
 dmarc=pass (p=none dis=none) header.from=suse.de
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=suse.de
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 3CB113858420
Authentication-Results: server2.sourceware.org;
 arc=none smtp.remote-ip=2a07:de40:b251:101:10:150:64:2
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1720525111; cv=none;
 b=cHAH/qAxgmb7AVEYBGcyUJwRPFuY0Vy0VzQI1b1DA6KhvnaWZcFmllzAAWE/kTGKoquldS+I/SCFuAZSnNseLo56tevrkU8fUcqOXOlMp5jSJYs3NoVgj+wG5PYEKH7SqGEQB2xNWXUJhZi/t/z4JxAxYfIE4Lpapg/NDnXD4Xo=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
 t=1720525111; c=relaxed/simple;
 bh=DGsFtc5yl1hiLhxrxPC754cbHTgh9LFrRYRPyTC04Ys=;
 h=DKIM-Signature:DKIM-Signature:DKIM-Signature:DKIM-Signature:Date:
 From:To:Subject:MIME-Version;
 b=Zn2Uf9Y4Ql6yGLRB6X/KQ9l5Jq9cEneVd7ov8CzVRouZJae0a62/b6hOivUH/m9Ik4A51KKHa/EXNBTTd2rm0WKzWjfhnKupScpQ9bIKIuaiQjkErRE2wJTr0lx9DjaxVm7ulisIux9214t8yIjTiZbXsD0q+LUW8siAoQYRzZU=
ARC-Authentication-Results: i=1; server2.sourceware.org
Received: from murzim.nue2.suse.org (unknown [10.168.4.243])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest
 SHA256)
 (No client certificate requested)
 by smtp-out2.suse.de (Postfix) with ESMTPS id 367A51F7B7;
 Tue,  9 Jul 2024 11:38:27 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de;
 s=susede2_rsa;
 t=1720525107;
 h=from:from:reply-to:date:date:to:to:cc:cc:mime-version:mime-version:
 content-type:content-type; bh=MVn3p8vvqld1TxxoUvXphk7tvWEXbofuQgRpcGeUYlI=;
 b=J1aFl4JO56mt9kn+/Vq6yHssNpXZwL+tjVhlwu0OXNLhZChE2kpetoBmm/0I5WFefjw9mD
 HCO0ElXdkRvY1+9KAqJnOpHPIBjghY3F0NzGODxrq7Jvbes5ctmGqw/Z5M6b+F6VxzvVLs
 tQGFRziQtoGJD29w0JIRnRBtpMRKHmI=
DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de;
 s=susede2_ed25519; t=1720525107;
 h=from:from:reply-to:date:date:to:to:cc:cc:mime-version:mime-version:
 content-type:content-type; bh=MVn3p8vvqld1TxxoUvXphk7tvWEXbofuQgRpcGeUYlI=;
 b=NiBNqynEo5x5qXSOl6/MIa0pZnmfTo7J30BkMpTL6viZr/5iv5D1Mtac42vRE6kOGVP/lI
 AFhu40koIb9oQzBg==
Authentication-Results: smtp-out2.suse.de;
	none
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de;
 s=susede2_rsa;
 t=1720525107;
 h=from:from:reply-to:date:date:to:to:cc:cc:mime-version:mime-version:
 content-type:content-type; bh=MVn3p8vvqld1TxxoUvXphk7tvWEXbofuQgRpcGeUYlI=;
 b=J1aFl4JO56mt9kn+/Vq6yHssNpXZwL+tjVhlwu0OXNLhZChE2kpetoBmm/0I5WFefjw9mD
 HCO0ElXdkRvY1+9KAqJnOpHPIBjghY3F0NzGODxrq7Jvbes5ctmGqw/Z5M6b+F6VxzvVLs
 tQGFRziQtoGJD29w0JIRnRBtpMRKHmI=
DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de;
 s=susede2_ed25519; t=1720525107;
 h=from:from:reply-to:date:date:to:to:cc:cc:mime-version:mime-version:
 content-type:content-type; bh=MVn3p8vvqld1TxxoUvXphk7tvWEXbofuQgRpcGeUYlI=;
 b=NiBNqynEo5x5qXSOl6/MIa0pZnmfTo7J30BkMpTL6viZr/5iv5D1Mtac42vRE6kOGVP/lI
 AFhu40koIb9oQzBg==
Date: Tue, 9 Jul 2024 13:38:27 +0200 (CEST)
From: Richard Biener <rguenther@suse.de>
To: gcc-patches@gcc.gnu.org
cc: richard.sandiford@arm.com
Subject: [PATCH 3/3] RISC-V: load and store-lanes with SLP
MIME-Version: 1.0
X-Spamd-Result: default: False [1.48 / 50.00]; BAYES_HAM(-3.00)[100.00%];
 MISSING_MID(2.50)[]; NEURAL_SPAM_LONG(2.27)[0.648];
 NEURAL_HAM_SHORT(-0.19)[-0.960]; MIME_GOOD(-0.10)[text/plain];
 MISSING_XM_UA(0.00)[]; RCVD_COUNT_ZERO(0.00)[0];
 ARC_NA(0.00)[]; RCPT_COUNT_TWO(0.00)[2]; FROM_HAS_DN(0.00)[];
 DKIM_SIGNED(0.00)[suse.de:s=susede2_rsa,suse.de:s=susede2_ed25519];
 FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+];
 TO_DN_NONE(0.00)[]; FUZZY_BLOCKED(0.00)[rspamd.com];
 TO_MATCH_ENVRCPT_ALL(0.00)[];
 DBL_BLOCKED_OPENRESOLVER(0.00)[tree-vect-slp.cc:url]
X-Spam-Score: 1.48
X-Spam-Status: No, score=-10.5 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, MISSING_MID,
 RCVD_IN_DNSWL_NONE, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.30
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
Errors-To: gcc-patches-bounces~incoming=patchwork.ozlabs.org@gcc.gnu.org
Message-Id: <20240709113856.3D7AC386C5B3@sourceware.org>

As promised this is the rework of SLP load/store-lane support with
a simpler representation.  It builds ontop the load-permute lowering
series which I've squashed to two patches (already tested separately
in the CI yesterday).  The load/store-permute work hasn't seen much
testing, I hope the CI will spot obvious bugs.

Comments on the design are welcome.  The main issue might be that
we're deciding on load/store-lane too(?) early, but it mostly
matches what we do right now where we cancel SLP.

Richard.


The following is a prototype for how to represent load/store-lanes
within SLP.  I've for now settled with having a single load node
with multiple permute nodes acting as selection, one for each loaded lane
and a single store node fed from all stored lanes.  For

  for (int i = 0; i < 1024; ++i)
    {
      a[2*i] = b[2*i] + 7;
      a[2*i+1] = b[2*i+1] * 3;
    }

you have the following SLP graph where I explain how things are set
up and code-generated:

t.c:23:21: note:   SLP graph after lowering permutations:
t.c:23:21: note:   node 0x50dc8b0 (max_nunits=1, refcnt=1) vector(4) int
t.c:23:21: note:   op template: *_6 = _7;
t.c:23:21: note:        stmt 0 *_6 = _7;
t.c:23:21: note:        stmt 1 *_12 = _13;
t.c:23:21: note:        children 0x50dc488 0x50dc6e8

This is the store node, it's marked with ldst_lanes = true during
SLP discovery.  This node code-generates

  vect_array.65[0] = vect__7.61_29;
  vect_array.65[1] = vect__13.62_28;
  MEM <int[8]> [(int *)vectp_a.63_27] = .STORE_LANES (vect_array.65);

...
t.c:23:21: note:   node 0x50dc520 (max_nunits=4, refcnt=2) vector(4) int
t.c:23:21: note:   op: VEC_PERM_EXPR
t.c:23:21: note:        stmt 0 _5 = *_4;
t.c:23:21: note:        lane permutation { 0[0] }
t.c:23:21: note:        children 0x50dc948
t.c:23:21: note:   node 0x50dc780 (max_nunits=4, refcnt=2) vector(4) int
t.c:23:21: note:   op: VEC_PERM_EXPR
t.c:23:21: note:        stmt 0 _11 = *_10;
t.c:23:21: note:        lane permutation { 0[1] }
t.c:23:21: note:        children 0x50dc948

These are the selection nodes, marked with ldst_lanes = true.
They code generate nothing.

t.c:23:21: note:   node 0x50dc948 (max_nunits=4, refcnt=3) vector(4) int
t.c:23:21: note:   op template: _5 = *_4;
t.c:23:21: note:        stmt 0 _5 = *_4;
t.c:23:21: note:        stmt 1 _11 = *_10;
t.c:23:21: note:        load permutation { 0 1 }

This is the load node, marked with ldst_lanes = true (the load
permutation is only accurate when taking into account the lane permute
in the selection nodes).  It code generates

  vect_array.58 = .LOAD_LANES (MEM <int[8]> [(int *)vectp_b.56_33]);
  vect__5.59_31 = vect_array.58[0];
  vect__5.60_30 = vect_array.58[1];

This scheme allows to leave code generation in vectorizable_load/store
mostly as-is.

With this I've disabled the code scrapping SLP as it will no longer
fire.

	* tree-vectorizer.h (_slp_tree::ldst_lanes): New flag to mark
	load, store and permute nodes.
	* tree-vect-slp.cc (_slp_tree::_slp_tree): Initialize ldst_lanes.
	(vect_build_slp_instance): For stores iff the target prefers
	store-lanes discover single-lane sub-groups, do not perform
	interleaving lowering but mark the node with ldst_lanes.
	(vect_lower_load_permutations): When the target supports
	load lanes and the loads all fit the pattern split out
	a single level of permutes only and mark the load and
	permute nodes with ldst_lanes.
	(vectorizable_slp_permutation_1): Handle the load-lane permute
	forwarding of vector defs.
	* tree-vect-stmts.cc (get_group_load_store_type): Support
	load/store-lanes for SLP.
	(vectorizable_store): Support SLP code generation for store-lanes.
	(vectorizable_load): Support SLP code generation for load-lanes.
	* tree-vect-loop.cc (vect_analyze_loop_2): Do not cancel SLP
	when store-lanes can be used.

	* gcc.dg/vect/slp-55.c: New testcase.
---
 gcc/testsuite/gcc.dg/vect/slp-55.c |  37 ++++++++++
 gcc/tree-vect-loop.cc              |  76 --------------------
 gcc/tree-vect-slp.cc               | 107 +++++++++++++++++++++++++++--
 gcc/tree-vect-stmts.cc             |  81 +++++++++++++++-------
 gcc/tree-vectorizer.h              |   3 +
 5 files changed, 196 insertions(+), 108 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/slp-55.c

diff --git a/gcc/testsuite/gcc.dg/vect/slp-55.c b/gcc/testsuite/gcc.dg/vect/slp-55.c
new file mode 100644
index 00000000000..0bf65ef6dc4
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/slp-55.c
@@ -0,0 +1,37 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target vect_int } */
+/* { dg-require-effective-target vect_int_mult } */
+/* { dg-additional-options "-fdump-tree-optimized" } */
+
+void foo (int * __restrict a, int *b, int *c)
+{
+  for (int i = 0; i < 1024; ++i)
+    {
+      a[2*i] = b[i] + 7;
+      a[2*i+1] = c[i] * 3;
+    }
+}
+
+int bar (int *b)
+{
+  int res = 0;
+  for (int i = 0; i < 1024; ++i)
+    {
+      res += b[2*i] + 7;
+      res += b[2*i+1] * 3;
+    }
+  return res;
+}
+
+void baz (int * __restrict a, int *b)
+{
+  for (int i = 0; i < 1024; ++i)
+    {
+      a[2*i] = b[2*i] + 7;
+      a[2*i+1] = b[2*i+1] * 3;
+    }
+}
+
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 3 "vect" } } */
+/* { dg-final { scan-tree-dump-times "LOAD_LANES" 2 "optimized" { target vect_load_lanes } } } */
+/* { dg-final { scan-tree-dump-times "STORE_LANES" 2 "optimized" { target vect_load_lanes } } } */
diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index a64b5082bd1..0d48c4980ce 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -2957,82 +2957,6 @@ start_over:
 				       "unsupported SLP instances\n");
 	  goto again;
 	}
-
-      /* Check whether any load in ALL SLP instances is possibly permuted.  */
-      slp_tree load_node, slp_root;
-      unsigned i, x;
-      slp_instance instance;
-      bool can_use_lanes = true;
-      FOR_EACH_VEC_ELT (LOOP_VINFO_SLP_INSTANCES (loop_vinfo), x, instance)
-	{
-	  slp_root = SLP_INSTANCE_TREE (instance);
-	  int group_size = SLP_TREE_LANES (slp_root);
-	  tree vectype = SLP_TREE_VECTYPE (slp_root);
-	  bool loads_permuted = false;
-	  FOR_EACH_VEC_ELT (SLP_INSTANCE_LOADS (instance), i, load_node)
-	    {
-	      if (!SLP_TREE_LOAD_PERMUTATION (load_node).exists ())
-		continue;
-	      unsigned j;
-	      stmt_vec_info load_info;
-	      FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_STMTS (load_node), j, load_info)
-		if (SLP_TREE_LOAD_PERMUTATION (load_node)[j] != j)
-		  {
-		    loads_permuted = true;
-		    break;
-		  }
-	    }
-
-	  /* If the loads and stores can be handled with load/store-lane
-	     instructions record it and move on to the next instance.  */
-	  if (loads_permuted
-	      && SLP_INSTANCE_KIND (instance) == slp_inst_kind_store
-	      && vect_store_lanes_supported (vectype, group_size, false)
-		   != IFN_LAST)
-	    {
-	      FOR_EACH_VEC_ELT (SLP_INSTANCE_LOADS (instance), i, load_node)
-		if (STMT_VINFO_GROUPED_ACCESS
-		      (SLP_TREE_REPRESENTATIVE (load_node)))
-		  {
-		    stmt_vec_info stmt_vinfo = DR_GROUP_FIRST_ELEMENT
-			(SLP_TREE_REPRESENTATIVE (load_node));
-		    /* Use SLP for strided accesses (or if we can't
-		       load-lanes).  */
-		    if (STMT_VINFO_STRIDED_P (stmt_vinfo)
-			|| vect_load_lanes_supported
-			     (STMT_VINFO_VECTYPE (stmt_vinfo),
-			      DR_GROUP_SIZE (stmt_vinfo), false) == IFN_LAST)
-		      break;
-		  }
-
-	      can_use_lanes
-		= can_use_lanes && i == SLP_INSTANCE_LOADS (instance).length ();
-
-	      if (can_use_lanes && dump_enabled_p ())
-		dump_printf_loc (MSG_NOTE, vect_location,
-				 "SLP instance %p can use load/store-lanes\n",
-				 (void *) instance);
-	    }
-	  else
-	    {
-	      can_use_lanes = false;
-	      break;
-	    }
-	}
-
-      /* If all SLP instances can use load/store-lanes abort SLP and try again
-	 with SLP disabled.  */
-      if (can_use_lanes)
-	{
-	  ok = opt_result::failure_at (vect_location,
-				       "Built SLP cancelled: can use "
-				       "load/store-lanes\n");
-	  if (dump_enabled_p ())
-	    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-			     "Built SLP cancelled: all SLP instances support "
-			     "load/store-lanes\n");
-	  goto again;
-	}
     }
 
   /* Dissolve SLP-only groups.  */
diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
index 2dc6d365303..17d3c59a3d8 100644
--- a/gcc/tree-vect-slp.cc
+++ b/gcc/tree-vect-slp.cc
@@ -120,6 +120,7 @@ _slp_tree::_slp_tree ()
   SLP_TREE_SIMD_CLONE_INFO (this) = vNULL;
   SLP_TREE_DEF_TYPE (this) = vect_uninitialized_def;
   SLP_TREE_CODE (this) = ERROR_MARK;
+  this->ldst_lanes = false;
   SLP_TREE_VECTYPE (this) = NULL_TREE;
   SLP_TREE_REPRESENTATIVE (this) = NULL;
   SLP_TREE_REF_COUNT (this) = 1;
@@ -3600,10 +3601,24 @@ vect_build_slp_instance (vec_info *vinfo,
       /* For loop vectorization split the RHS into arbitrary pieces of
 	 size >= 1.  */
       else if (is_a <loop_vec_info> (vinfo)
-	       && (i > 0 && i < group_size)
-	       && !vect_slp_prefer_store_lanes_p (vinfo,
-						  stmt_info, group_size, i))
-	{
+	       && (i > 0 && i < group_size))
+	{
+	  /* There are targets that cannot do even/odd interleaving schemes
+	     so they absolutely need to use load/store-lanes.  For now
+	     force single-lane SLP for them - they would be happy with
+	     uniform power-of-two lanes (but depending on element size),
+	     but even if we can use 'i' as indicator we would need to
+	     backtrack when later lanes fail to discover with the same
+	     granularity.  We cannot turn any of .MASK_STORE or
+	     scatter store into store-lanes.  */
+	  bool want_store_lanes
+	    = (! is_a <gcall *> (stmt_info->stmt)
+	       && ! STMT_VINFO_GATHER_SCATTER_P (stmt_info)
+	       && vect_slp_prefer_store_lanes_p (vinfo, stmt_info,
+						 group_size, 1));
+	  if (want_store_lanes)
+	    i = 1;
+
 	  if (dump_enabled_p ())
 	    dump_printf_loc (MSG_NOTE, vect_location,
 			     "Splitting SLP group at stmt %u\n", i);
@@ -3637,7 +3652,10 @@ vect_build_slp_instance (vec_info *vinfo,
 					       (max_nunits, end - start));
 		  rhs_nodes.safe_push (node);
 		  start = end;
-		  end = group_size;
+		  if (want_store_lanes)
+		    end = start + 1;
+		  else
+		    end = group_size;
 		}
 	      else
 		{
@@ -3676,6 +3694,18 @@ vect_build_slp_instance (vec_info *vinfo,
 					   SLP_TREE_CHILDREN
 					     (rhs_nodes[0]).length ());
 	  SLP_TREE_VECTYPE (node) = SLP_TREE_VECTYPE (rhs_nodes[0]);
+	  if (want_store_lanes)
+	    {
+	      /* For store-lanes feed the store node with all RHS nodes
+		 in order.  We cannot handle .MASK_STORE here.  */
+	      gcc_assert (SLP_TREE_CHILDREN (rhs_nodes[0]).length () == 1);
+	      node->ldst_lanes = 1;
+	      SLP_TREE_CHILDREN (node).reserve_exact (rhs_nodes.length ());
+	      for (unsigned j = 0; j < rhs_nodes.length (); ++j)
+		SLP_TREE_CHILDREN (node)
+		  .quick_push (SLP_TREE_CHILDREN (rhs_nodes[j])[0]);
+	    }
+	  else
 	  for (unsigned l = 0;
 	       l < SLP_TREE_CHILDREN (rhs_nodes[0]).length (); ++l)
 	    {
@@ -4057,6 +4087,42 @@ vect_lower_load_permutations (loop_vec_info loop_vinfo,
   if (exact_log2 (group_lanes) == -1 && group_lanes != 3)
     return;
 
+  /* Verify if all load permutations can be implemented with a suitably
+     large element load-lanes operation.  */
+  unsigned ld_lanes_lanes = SLP_TREE_LANES (loads[0]);
+  if (exact_log2 (ld_lanes_lanes) == -1
+      /* ???  For now only support the single-lane case as there is
+	 missing support on the store-lane side and code generation
+	 isn't up to the task yet.  */
+      || ld_lanes_lanes != 1
+      || vect_load_lanes_supported (SLP_TREE_VECTYPE (loads[0]),
+				    group_lanes / ld_lanes_lanes,
+				    false) == IFN_LAST)
+    ld_lanes_lanes = 0;
+  else
+    /* Verify the loads access the same number of lanes aligned to
+       ld_lanes_lanes.  */
+    for (slp_tree load : loads)
+      {
+	if (SLP_TREE_LANES (load) != ld_lanes_lanes)
+	  {
+	    ld_lanes_lanes = 0;
+	    break;
+	  }
+	unsigned first = SLP_TREE_LOAD_PERMUTATION (load)[0];
+	if (first % ld_lanes_lanes != 0)
+	  {
+	    ld_lanes_lanes = 0;
+	    break;
+	  }
+	for (unsigned i = 1; i < SLP_TREE_LANES (load); ++i)
+	  if (SLP_TREE_LOAD_PERMUTATION (load)[i] != first + i)
+	    {
+	      ld_lanes_lanes = 0;
+	      break;
+	    }
+      }
+
   for (slp_tree load : loads)
     {
       /* Leave masked or gather loads alone for now.  */
@@ -4071,7 +4137,8 @@ vect_lower_load_permutations (loop_vec_info loop_vinfo,
 	 with a non-1:1 load permutation around instead of canonicalizing
 	 those into a load and a permute node.  Removing this early
 	 check would do such canonicalization.  */
-      if (SLP_TREE_LANES (load) >= (group_lanes + 1) / 2)
+      if (SLP_TREE_LANES (load) >= (group_lanes + 1) / 2
+	  && ld_lanes_lanes == 0)
 	continue;
 
       /* First build (and possibly re-use) a load node for the
@@ -4104,6 +4171,12 @@ vect_lower_load_permutations (loop_vec_info loop_vinfo,
 	final_perm.quick_push
 	  (std::make_pair (0, SLP_TREE_LOAD_PERMUTATION (load)[i]));
 
+      if (ld_lanes_lanes != 0)
+	{
+	  l0->ldst_lanes = true;
+	  load->ldst_lanes = true;
+	}
+      else
       while (1)
 	{
 	  unsigned group_lanes = SLP_TREE_LANES (l0);
@@ -9758,6 +9831,28 @@ vectorizable_slp_permutation_1 (vec_info *vinfo, gimple_stmt_iterator *gsi,
 
   gcc_assert (perm.length () == SLP_TREE_LANES (node));
 
+  /* Load-lanes permute.  This permute only acts as a forwarder to
+     select the correct vector def of the load-lanes load which
+     has the permuted vectors in its vector defs like
+     { v0, w0, r0, v1, w1, r1 ... } for a ld3.  */
+  if (node->ldst_lanes)
+    {
+      gcc_assert (children.length () == 1);
+      if (!gsi)
+	/* This is a trivial op always supported.  */
+	return 1;
+      slp_tree child = children[0];
+      unsigned vec_idx = (SLP_TREE_LANE_PERMUTATION (node)[0].second
+			  / SLP_TREE_LANES (node));
+      unsigned vec_num = SLP_TREE_LANES (child) / SLP_TREE_LANES (node);
+      for (unsigned i = 0; i < SLP_TREE_NUMBER_OF_VEC_STMTS (node); ++i)
+	{
+	  tree def = SLP_TREE_VEC_DEFS (child)[i * vec_num  + vec_idx];
+	  node->push_vec_def (def);
+	}
+      return 1;
+    }
+
   /* REPEATING_P is true if every output vector is guaranteed to use the
      same permute vector.  We can handle that case for both variable-length
      and constant-length vectors, but we only handle other cases for
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index fdcda0d2aba..330091f08b3 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -1508,7 +1508,8 @@ check_load_store_for_partial_vectors (loop_vec_info loop_vinfo, tree vectype,
 
   unsigned int nvectors;
   if (slp_node)
-    nvectors = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node);
+    /* ???  Incorrect for multi-lane lanes.  */
+    nvectors = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node) / group_size;
   else
     nvectors = vect_get_num_copies (loop_vinfo, vectype);
 
@@ -2069,6 +2070,14 @@ get_group_load_store_type (vec_info *vinfo, stmt_vec_info stmt_info,
 		 is irrelevant for them.  */
 	      *alignment_support_scheme = dr_unaligned_supported;
 	    }
+	  /* Try using LOAD/STORE_LANES.  */
+	  else if (slp_node->ldst_lanes
+		   && (*lanes_ifn
+			 = (vls_type == VLS_LOAD
+			    ? vect_load_lanes_supported (vectype, group_size, masked_p)
+			    : vect_store_lanes_supported (vectype, group_size,
+							  masked_p))) != IFN_LAST)
+	    *memory_access_type = VMAT_LOAD_STORE_LANES;
 	  else
 	    *memory_access_type = VMAT_CONTIGUOUS;
 
@@ -8705,7 +8714,7 @@ vectorizable_store (vec_info *vinfo,
   else
     {
       if (memory_access_type == VMAT_LOAD_STORE_LANES)
-	aggr_type = build_array_type_nelts (elem_type, vec_num * nunits);
+	aggr_type = build_array_type_nelts (elem_type, group_size * nunits);
       else
 	aggr_type = vectype;
       bump = vect_get_data_ptr_increment (vinfo, gsi, dr_info, aggr_type,
@@ -8762,11 +8771,12 @@ vectorizable_store (vec_info *vinfo,
 
   if (memory_access_type == VMAT_LOAD_STORE_LANES)
     {
-      gcc_assert (!slp && grouped_store);
       unsigned inside_cost = 0, prologue_cost = 0;
       /* For costing some adjacent vector stores, we'd like to cost with
 	 the total number of them once instead of cost each one by one. */
       unsigned int n_adjacent_stores = 0;
+      if (slp)
+	ncopies = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node) / group_size;
       for (j = 0; j < ncopies; j++)
 	{
 	  gimple *new_stmt;
@@ -8784,7 +8794,7 @@ vectorizable_store (vec_info *vinfo,
 		  op = vect_get_store_rhs (next_stmt_info);
 		  if (costing_p)
 		    update_prologue_cost (&prologue_cost, op);
-		  else
+		  else if (!slp)
 		    {
 		      vect_get_vec_defs_for_operand (vinfo, next_stmt_info,
 						     ncopies, op,
@@ -8799,15 +8809,15 @@ vectorizable_store (vec_info *vinfo,
 		{
 		  if (mask)
 		    {
-		      vect_get_vec_defs_for_operand (vinfo, stmt_info, ncopies,
-						     mask, &vec_masks,
-						     mask_vectype);
+		      if (slp_node)
+			vect_get_slp_defs (mask_node, &vec_masks);
+		      else
+			vect_get_vec_defs_for_operand (vinfo, stmt_info, ncopies,
+						       mask, &vec_masks,
+						       mask_vectype);
 		      vec_mask = vec_masks[0];
 		    }
 
-		  /* We should have catched mismatched types earlier.  */
-		  gcc_assert (
-		    useless_type_conversion_p (vectype, TREE_TYPE (vec_oprnd)));
 		  dataref_ptr
 		    = vect_create_data_ref_ptr (vinfo, first_stmt_info,
 						aggr_type, NULL, offset, &dummy,
@@ -8819,10 +8829,16 @@ vectorizable_store (vec_info *vinfo,
 	      gcc_assert (!LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo));
 	      /* DR_CHAIN is then used as an input to
 		 vect_permute_store_chain().  */
-	      for (i = 0; i < group_size; i++)
+	      if (!slp)
 		{
-		  vec_oprnd = (*gvec_oprnds[i])[j];
-		  dr_chain[i] = vec_oprnd;
+		  /* We should have catched mismatched types earlier.  */
+		  gcc_assert (
+		    useless_type_conversion_p (vectype, TREE_TYPE (vec_oprnd)));
+		  for (i = 0; i < group_size; i++)
+		    {
+		      vec_oprnd = (*gvec_oprnds[i])[j];
+		      dr_chain[i] = vec_oprnd;
+		    }
 		}
 	      if (mask)
 		vec_mask = vec_masks[j];
@@ -8832,12 +8848,12 @@ vectorizable_store (vec_info *vinfo,
 
 	  if (costing_p)
 	    {
-	      n_adjacent_stores += vec_num;
+	      n_adjacent_stores += group_size;
 	      continue;
 	    }
 
 	  /* Get an array into which we can store the individual vectors.  */
-	  tree vec_array = create_vector_array (vectype, vec_num);
+	  tree vec_array = create_vector_array (vectype, group_size);
 
 	  /* Invalidate the current contents of VEC_ARRAY.  This should
 	     become an RTL clobber too, which prevents the vector registers
@@ -8845,9 +8861,13 @@ vectorizable_store (vec_info *vinfo,
 	  vect_clobber_variable (vinfo, stmt_info, gsi, vec_array);
 
 	  /* Store the individual vectors into the array.  */
-	  for (i = 0; i < vec_num; i++)
+	  for (i = 0; i < group_size; i++)
 	    {
-	      vec_oprnd = dr_chain[i];
+	      if (slp)
+		vec_oprnd
+		  = SLP_TREE_VEC_DEFS (SLP_TREE_CHILDREN (slp_node)[i])[j];
+	      else
+		vec_oprnd = dr_chain[i];
 	      write_vector_array (vinfo, stmt_info, gsi, vec_oprnd, vec_array,
 				  i);
 	    }
@@ -8917,9 +8937,10 @@ vectorizable_store (vec_info *vinfo,
 
 	  /* Record that VEC_ARRAY is now dead.  */
 	  vect_clobber_variable (vinfo, stmt_info, gsi, vec_array);
-	  if (j == 0)
+	  if (j == 0 && !slp)
 	    *vec_stmt = new_stmt;
-	  STMT_VINFO_VEC_STMTS (stmt_info).safe_push (new_stmt);
+	  if (!slp)
+	    STMT_VINFO_VEC_STMTS (stmt_info).safe_push (new_stmt);
 	}
 
       if (costing_p)
@@ -10765,12 +10786,13 @@ vectorizable_load (vec_info *vinfo,
     {
       gcc_assert (alignment_support_scheme == dr_aligned
 		  || alignment_support_scheme == dr_unaligned_supported);
-      gcc_assert (grouped_load && !slp);
 
       unsigned int inside_cost = 0, prologue_cost = 0;
       /* For costing some adjacent vector loads, we'd like to cost with
 	 the total number of them once instead of cost each one by one. */
       unsigned int n_adjacent_loads = 0;
+      if (slp_node)
+	ncopies = slp_node->vec_stmts_size / vec_num;
       for (j = 0; j < ncopies; j++)
 	{
 	  if (costing_p)
@@ -10884,24 +10906,31 @@ vectorizable_load (vec_info *vinfo,
 	  gimple_call_set_nothrow (call, true);
 	  vect_finish_stmt_generation (vinfo, stmt_info, call, gsi);
 
-	  dr_chain.create (vec_num);
+	  if (!slp)
+	    dr_chain.create (vec_num);
 	  /* Extract each vector into an SSA_NAME.  */
 	  for (i = 0; i < vec_num; i++)
 	    {
 	      new_temp = read_vector_array (vinfo, stmt_info, gsi, scalar_dest,
 					    vec_array, i);
-	      dr_chain.quick_push (new_temp);
+	      if (slp)
+		slp_node->push_vec_def (new_temp);
+	      else
+		dr_chain.quick_push (new_temp);
 	    }
 
-	  /* Record the mapping between SSA_NAMEs and statements.  */
-	  vect_record_grouped_load_vectors (vinfo, stmt_info, dr_chain);
+	  if (!slp)
+	    /* Record the mapping between SSA_NAMEs and statements.  */
+	    vect_record_grouped_load_vectors (vinfo, stmt_info, dr_chain);
 
 	  /* Record that VEC_ARRAY is now dead.  */
 	  vect_clobber_variable (vinfo, stmt_info, gsi, vec_array);
 
-	  dr_chain.release ();
+	  if (!slp)
+	    dr_chain.release ();
 
-	  *vec_stmt = STMT_VINFO_VEC_STMTS (stmt_info)[0];
+	  if (!slp_node)
+	    *vec_stmt = STMT_VINFO_VEC_STMTS (stmt_info)[0];
 	}
 
       if (costing_p)
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 8eb3ec4df86..ac288541c51 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -222,6 +222,9 @@ struct _slp_tree {
   unsigned int lanes;
   /* The operation of this node.  */
   enum tree_code code;
+  /* Whether uses of this load or feeders of this store are suitable
+     for load/store-lanes.  */
+  bool ldst_lanes;
 
   int vertex;