From patchwork Fri Aug 30 13:39:18 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Richard Biener <rguenther@suse.de>
X-Patchwork-Id: 1979013
Return-Path: <gcc-patches-bounces~incoming=patchwork.ozlabs.org@gcc.gnu.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@legolas.ozlabs.org
Authentication-Results: legolas.ozlabs.org;
	dkim=pass (1024-bit key;
 unprotected) header.d=suse.de header.i=@suse.de header.a=rsa-sha256
 header.s=susede2_rsa header.b=SYr7BcCJ;
	dkim=pass header.d=suse.de header.i=@suse.de header.a=ed25519-sha256
 header.s=susede2_ed25519 header.b=tJw3cZdi;
	dkim=pass (1024-bit key) header.d=suse.de header.i=@suse.de
 header.a=rsa-sha256 header.s=susede2_rsa header.b=SYr7BcCJ;
	dkim=neutral header.d=suse.de header.i=@suse.de header.a=ed25519-sha256
 header.s=susede2_ed25519 header.b=tJw3cZdi;
	dkim-atps=neutral
Authentication-Results: legolas.ozlabs.org;
 spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org
 (client-ip=2620:52:3:1:0:246e:9693:128c; helo=server2.sourceware.org;
 envelope-from=gcc-patches-bounces~incoming=patchwork.ozlabs.org@gcc.gnu.org;
 receiver=patchwork.ozlabs.org)
Received: from server2.sourceware.org (server2.sourceware.org
 [IPv6:2620:52:3:1:0:246e:9693:128c])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature ECDSA (secp384r1) server-digest SHA384)
	(No client certificate requested)
	by legolas.ozlabs.org (Postfix) with ESMTPS id 4WwK5w3Jx9z1yfn
	for <incoming@patchwork.ozlabs.org>; Fri, 30 Aug 2024 23:40:20 +1000 (AEST)
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 38BBA386076B
	for <incoming@patchwork.ozlabs.org>; Fri, 30 Aug 2024 13:40:18 +0000 (GMT)
X-Original-To: gcc-patches@gcc.gnu.org
Delivered-To: gcc-patches@gcc.gnu.org
Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.223.131])
 by sourceware.org (Postfix) with ESMTPS id 84A71385DDD1
 for <gcc-patches@gcc.gnu.org>; Fri, 30 Aug 2024 13:39:24 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 84A71385DDD1
Authentication-Results: sourceware.org;
 dmarc=pass (p=none dis=none) header.from=suse.de
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=suse.de
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 84A71385DDD1
Authentication-Results: server2.sourceware.org;
 arc=none smtp.remote-ip=195.135.223.131
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1725025168; cv=none;
 b=o+V1lfobUBDbthJQmeDIdTMhF8NkdXYt6CQa0eBj+ONIf03arGpCvalxdPJocxJAF4jUgR64lCc2cZYfL4jznodOdQy9ozkV5MKk51hswILuMDMpXW8bYBjxEMJ9Rz8pKY7v5KFjkPd5mNU2/0iAxtVESFqd3SjyOEab7dn8JrQ=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
 t=1725025168; c=relaxed/simple;
 bh=7o8sIcdRWdKJ9qWRchcSdeGL0Ra5dGHLl58CtZTSILg=;
 h=DKIM-Signature:DKIM-Signature:DKIM-Signature:DKIM-Signature:Date:
 From:To:Subject:MIME-Version:Message-Id;
 b=T2dNuKrFdwG3DQiUPYewlmCCPBuOyJj8S77bea3CUvCT00yStzh3Plf3lPbVTwHTFij1v1xMoq6fxncRLkCJAiMmbatt0lxtj3HyGTlU6hV/d1hGSC8yWy2gqE2HviaoKnzwlsE+y5WxUqR04lwGp4H27JNPeU8OB/epMCi1p+s=
ARC-Authentication-Results: i=1; server2.sourceware.org
Received: from imap1.dmz-prg2.suse.org (unknown [10.150.64.97])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest
 SHA256)
 (No client certificate requested)
 by smtp-out2.suse.de (Postfix) with ESMTPS id 5948F1F7D4
 for <gcc-patches@gcc.gnu.org>; Fri, 30 Aug 2024 13:39:23 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de;
 s=susede2_rsa;
 t=1725025163; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:
 mime-version:mime-version:content-type:content-type;
 bh=6JIuAmGeRZUaT6I2pSB5eEa2cQdx8TmZh5qNegDqu5M=;
 b=SYr7BcCJvZ+k7cxDs2YyHIPj/w3CIK8yBKK0u8sNdsSRWl70MtqBzudUHxfsDPEHtuuhh2
 3+CZwinonDR81DnUdGZup/nE7N0Cm6HeNWtsVEXWDL9BalX/IFq8y1+99Pj/hoSJtAGWze
 JiX8Hfnhl/eBmnol54MUcG/7PyDt6Bg=
DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de;
 s=susede2_ed25519; t=1725025163;
 h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:
 mime-version:mime-version:content-type:content-type;
 bh=6JIuAmGeRZUaT6I2pSB5eEa2cQdx8TmZh5qNegDqu5M=;
 b=tJw3cZdibAfGrSfRmxU3BUw18O/YHoz7LcxOZJaSWVZMi5S6rQU7FSmIhaN9y6oOaRexwJ
 xCJcUoY/YMROtvBg==
Authentication-Results: smtp-out2.suse.de;
	none
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de;
 s=susede2_rsa;
 t=1725025163; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:
 mime-version:mime-version:content-type:content-type;
 bh=6JIuAmGeRZUaT6I2pSB5eEa2cQdx8TmZh5qNegDqu5M=;
 b=SYr7BcCJvZ+k7cxDs2YyHIPj/w3CIK8yBKK0u8sNdsSRWl70MtqBzudUHxfsDPEHtuuhh2
 3+CZwinonDR81DnUdGZup/nE7N0Cm6HeNWtsVEXWDL9BalX/IFq8y1+99Pj/hoSJtAGWze
 JiX8Hfnhl/eBmnol54MUcG/7PyDt6Bg=
DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de;
 s=susede2_ed25519; t=1725025163;
 h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:
 mime-version:mime-version:content-type:content-type;
 bh=6JIuAmGeRZUaT6I2pSB5eEa2cQdx8TmZh5qNegDqu5M=;
 b=tJw3cZdibAfGrSfRmxU3BUw18O/YHoz7LcxOZJaSWVZMi5S6rQU7FSmIhaN9y6oOaRexwJ
 xCJcUoY/YMROtvBg==
Received: from imap1.dmz-prg2.suse.org (localhost [127.0.0.1])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest
 SHA256)
 (No client certificate requested)
 by imap1.dmz-prg2.suse.org (Postfix) with ESMTPS id 3FD2F13A44
 for <gcc-patches@gcc.gnu.org>; Fri, 30 Aug 2024 13:39:23 +0000 (UTC)
Received: from dovecot-director2.suse.de ([2a07:de40:b281:106:10:150:64:167])
 by imap1.dmz-prg2.suse.org with ESMTPSA id dBn1DYvL0WbuWQAAD6G6ig
 (envelope-from <rguenther@suse.de>)
 for <gcc-patches@gcc.gnu.org>; Fri, 30 Aug 2024 13:39:23 +0000
Date: Fri, 30 Aug 2024 15:39:18 +0200 (CEST)
From: Richard Biener <rguenther@suse.de>
To: gcc-patches@gcc.gnu.org
Subject: [PATCH 1/3] lower SLP load permutation to interleaving
MIME-Version: 1.0
Message-Id: <20240830133923.3FD2F13A44@imap1.dmz-prg2.suse.org>
X-Spam-Level: 
X-Spamd-Result: default: False [-4.30 / 50.00]; BAYES_HAM(-3.00)[100.00%];
 NEURAL_HAM_LONG(-1.00)[-1.000];
 NEURAL_HAM_SHORT(-0.20)[-1.000]; MIME_GOOD(-0.10)[text/plain];
 RCVD_VIA_SMTP_AUTH(0.00)[]; ARC_NA(0.00)[];
 RCPT_COUNT_ONE(0.00)[1]; MIME_TRACE(0.00)[0:+];
 MISSING_XM_UA(0.00)[]; RCVD_TLS_ALL(0.00)[];
 DKIM_SIGNED(0.00)[suse.de:s=susede2_rsa,suse.de:s=susede2_ed25519];
 PREVIOUSLY_DELIVERED(0.00)[gcc-patches@gcc.gnu.org];
 FROM_EQ_ENVFROM(0.00)[]; FROM_HAS_DN(0.00)[];
 FUZZY_BLOCKED(0.00)[rspamd.com]; RCVD_COUNT_TWO(0.00)[2];
 TO_MATCH_ENVRCPT_ALL(0.00)[]; TO_DN_NONE(0.00)[];
 DBL_BLOCKED_OPENRESOLVER(0.00)[imap1.dmz-prg2.suse.org:helo,
 imap1.dmz-prg2.suse.org:mid, tree-vect-slp.cc:url]
X-Spam-Score: -4.30
X-Spam-Status: No, score=-11.7 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, SPF_HELO_NONE,
 SPF_PASS,
 TXREP, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.30
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
Errors-To: gcc-patches-bounces~incoming=patchwork.ozlabs.org@gcc.gnu.org

The following emulates classical interleaving for SLP load permutes
that we are unlikely handling natively.  This is to handle cases
where interleaving (or load/store-lanes) is the optimal choice for
vectorizing even when we are doing that within SLP.  An example
would be

void foo (int * __restrict a, int * b)
{
  for (int i = 0; i < 16; ++i)
    {
      a[4*i + 0] = b[4*i + 0] * 3;
      a[4*i + 1] = b[4*i + 1] + 3;
      a[4*i + 2] = (b[4*i + 2] * 3 + 3);
      a[4*i + 3] = b[4*i + 3] * 3;
    }
}

where currently the SLP store is merging four single-lane SLP
sub-graphs but none of the loads in it can be code-generated
with V4SImode vectors and a VF of four as the permutes would need
three vectors.

The patch introduces a lowering phase after SLP discovery but
before SLP pattern recognition or permute optimization that
analyzes all loads from the same dataref group and creates an
interleaving scheme starting from an unpermuted load.

What can be handled is power-of-two group size and a group size of
three.  The possibility for doing the interleaving with a load-lanes
like instruction is done as followup.

For a group-size of three this is done by using
the non-interleaving fallback code which then creates at VF == 4 from
{ { a0, b0, c0 }, { a1, b1, c1 }, { a2, b2, c2 }, { a3, b3, c3 } }
the intermediate vectors { c0, c0, c1, c1 } and { c2, c2, c3, c3 }
to produce { c0, c1, c2, c3 }.  This turns out to be more effective
than the scheme implemented for non-SLP for SSE and only slightly
worse for AVX512 and a bit more worse for AVX2.  It seems to me that
this would extend to other non-power-of-two group-sizes though (but
the patch does not).  Optimal schemes are likely difficult to lay out
in VF agnostic form.

I'll note that while the lowering assumes even/odd extract is
generally available for all vector element sizes (which is probably
a good assumption), it doesn't in any way constrain the other
permutes it generates based on target availability.  Again difficult
to do in a VF agnostic way (but at least currently the vector type
is fixed).

I'll also note that the SLP store side merges lanes in a way
producing three-vector permutes for store group-size of three, so
the testcase uses a store group-size of four.

The patch has a fallback for when there are multi-lane groups
and the resulting permutes to not fit interleaving.  Code
generation is not optimal when this triggers and might be
worse than doing single-lane group interleaving.

The patch handles gaps by representing them with NULL
entries in SLP_TREE_SCALAR_STMTS for the unpermuted load node.
The SLP discovery changes could be elided if we manually build the
load node instead.

SLP load nodes covering enough lanes to not need intermediate
permutes are retained as having a load-permutation and do not
use the single SLP load node for each dataref group.  That's
something we might want to change, making load-permutation
something purely local to SLP discovery (but then SLP discovery
could do part of the lowering).

The patch misses CSEing intermediate generated permutes and
registering them with the bst_map which is possibly required
for SLP pattern detection in some cases - this re-spin of the
patch moves the lowering after SLP pattern detection.

	* tree-vect-slp.cc (vect_build_slp_tree_1): Handle NULL stmt.
	(vect_build_slp_tree_2): Likewise.  Release load permutation
	when there's a NULL in SLP_TREE_SCALAR_STMTS and assert there's
	no actual permutation in that case.
	(vllp_cmp): New function.
	(vect_lower_load_permutations): Likewise.
	(vect_analyze_slp): Call it.

	* gcc.dg/vect/slp-11a.c: Expect SLP.
	* gcc.dg/vect/slp-12a.c: Likewise.
	* gcc.dg/vect/slp-51.c: New testcase.
	* gcc.dg/vect/slp-52.c: New testcase.
---
 gcc/testsuite/gcc.dg/vect/slp-11a.c |   2 +-
 gcc/testsuite/gcc.dg/vect/slp-12a.c |   2 +-
 gcc/testsuite/gcc.dg/vect/slp-51.c  |  17 ++
 gcc/testsuite/gcc.dg/vect/slp-52.c  |  14 ++
 gcc/tree-vect-slp.cc                | 347 +++++++++++++++++++++++++++-
 5 files changed, 378 insertions(+), 4 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/slp-51.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/slp-52.c

diff --git a/gcc/testsuite/gcc.dg/vect/slp-11a.c b/gcc/testsuite/gcc.dg/vect/slp-11a.c
index fcb7cf6c7a2..2efa1796757 100644
--- a/gcc/testsuite/gcc.dg/vect/slp-11a.c
+++ b/gcc/testsuite/gcc.dg/vect/slp-11a.c
@@ -72,4 +72,4 @@ int main (void)
 
 /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target { vect_strided8 && vect_int_mult } } } } */
 /* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" { target { ! { vect_strided8 && vect_int_mult } } } } } */
-/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 0 "vect" } } */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/slp-12a.c b/gcc/testsuite/gcc.dg/vect/slp-12a.c
index 2f98dc9da0b..fedf27b69d2 100644
--- a/gcc/testsuite/gcc.dg/vect/slp-12a.c
+++ b/gcc/testsuite/gcc.dg/vect/slp-12a.c
@@ -80,5 +80,5 @@ int main (void)
 
 /* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" { target { vect_strided8 && vect_int_mult } } } } */
 /* { dg-final { scan-tree-dump-times "vectorized 0 loops" 1 "vect" { target { ! { vect_strided8 && vect_int_mult } } } } } */
-/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 0 "vect" { target { { vect_strided8 && {! vect_load_lanes } } && vect_int_mult } } } } */
+/* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 1 "vect" { target { { vect_strided8 && {! vect_load_lanes } } && vect_int_mult } } } } */
 /* { dg-final { scan-tree-dump-times "vectorizing stmts using SLP" 0 "vect" { target { ! { vect_strided8 && vect_int_mult } } } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/slp-51.c b/gcc/testsuite/gcc.dg/vect/slp-51.c
new file mode 100644
index 00000000000..91ae763be30
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/slp-51.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+
+void foo (int * __restrict x, int *y)
+{
+  x = __builtin_assume_aligned (x, __BIGGEST_ALIGNMENT__);
+  y = __builtin_assume_aligned (y, __BIGGEST_ALIGNMENT__);
+  for (int i = 0; i < 1024; ++i)
+    {
+      x[4*i+0] = y[4*i+0];
+      x[4*i+1] = y[4*i+2] * 2;
+      x[4*i+2] = y[4*i+0] + 3;
+      x[4*i+3] = y[4*i+2] * 2 - 5;
+    }
+}
+
+/* Check we can handle SLP with gaps and an interleaving scheme.  */
+/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" { target { vect_int && vect_int_mult } } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/slp-52.c b/gcc/testsuite/gcc.dg/vect/slp-52.c
new file mode 100644
index 00000000000..ba49f0046e2
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/slp-52.c
@@ -0,0 +1,14 @@
+/* { dg-do compile } */
+
+void foo (int * __restrict x, int *y)
+{
+  for (int i = 0; i < 1024; ++i)
+    {
+      x[4*i+0] = y[3*i+0];
+      x[4*i+1] = y[3*i+1] * 2;
+      x[4*i+2] = y[3*i+2] + 3;
+      x[4*i+3] = y[3*i+2] * 2 - 5;
+    }
+}
+
+/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" { target { vect_int && vect_int_mult } } } } */
diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
index 621ebf670c7..2304cdac583 100644
--- a/gcc/tree-vect-slp.cc
+++ b/gcc/tree-vect-slp.cc
@@ -1081,10 +1081,15 @@ vect_build_slp_tree_1 (vec_info *vinfo, unsigned char *swap,
   stmt_vec_info stmt_info;
   FOR_EACH_VEC_ELT (stmts, i, stmt_info)
     {
-      gimple *stmt = stmt_info->stmt;
       swap[i] = 0;
       matches[i] = false;
+      if (!stmt_info)
+	{
+	  matches[i] = true;
+	  continue;
+	}
 
+      gimple *stmt = stmt_info->stmt;
       if (dump_enabled_p ())
 	dump_printf_loc (MSG_NOTE, vect_location, "Build SLP for %G", stmt);
 
@@ -1979,10 +1984,16 @@ vect_build_slp_tree_2 (vec_info *vinfo, slp_tree node,
 	  stmt_vec_info first_stmt_info
 	    = DR_GROUP_FIRST_ELEMENT (SLP_TREE_SCALAR_STMTS (node)[0]);
 	  bool any_permute = false;
+	  bool any_null = false;
 	  FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_STMTS (node), j, load_info)
 	    {
 	      int load_place;
-	      if (STMT_VINFO_GROUPED_ACCESS (stmt_info))
+	      if (! load_info)
+		{
+		  load_place = j;
+		  any_null = true;
+		}
+	      else if (STMT_VINFO_GROUPED_ACCESS (stmt_info))
 		load_place = vect_get_place_in_interleaving_chain
 		    (load_info, first_stmt_info);
 	      else
@@ -1991,6 +2002,11 @@ vect_build_slp_tree_2 (vec_info *vinfo, slp_tree node,
 	      any_permute |= load_place != j;
 	      load_permutation.quick_push (load_place);
 	    }
+	  if (any_null)
+	    {
+	      gcc_assert (!any_permute);
+	      load_permutation.release ();
+	    }
 
 	  if (gcall *stmt = dyn_cast <gcall *> (stmt_info->stmt))
 	    {
@@ -4080,6 +4096,316 @@ vect_analyze_slp_instance (vec_info *vinfo,
   return res;
 }
 
+/* qsort comparator ordering SLP load nodes.  */
+
+static int
+vllp_cmp (const void *a_, const void *b_)
+{
+  const slp_tree a = *(const slp_tree *)a_;
+  const slp_tree b = *(const slp_tree *)b_;
+  stmt_vec_info a0 = SLP_TREE_SCALAR_STMTS (a)[0];
+  stmt_vec_info b0 = SLP_TREE_SCALAR_STMTS (b)[0];
+  if (STMT_VINFO_GROUPED_ACCESS (a0)
+      && STMT_VINFO_GROUPED_ACCESS (b0)
+      && DR_GROUP_FIRST_ELEMENT (a0) == DR_GROUP_FIRST_ELEMENT (b0))
+    {
+      /* Same group, order after lanes used.  */
+      if (SLP_TREE_LANES (a) < SLP_TREE_LANES (b))
+	return 1;
+      else if (SLP_TREE_LANES (a) > SLP_TREE_LANES (b))
+	return -1;
+      else
+	{
+	  /* Try to order loads using the same lanes together, breaking
+	     the tie with the lane number that first differs.  */
+	  if (!SLP_TREE_LOAD_PERMUTATION (a).exists ()
+	      && !SLP_TREE_LOAD_PERMUTATION (b).exists ())
+	    return 0;
+	  else if (SLP_TREE_LOAD_PERMUTATION (a).exists ()
+		   && !SLP_TREE_LOAD_PERMUTATION (b).exists ())
+	    return 1;
+	  else if (!SLP_TREE_LOAD_PERMUTATION (a).exists ()
+		   && SLP_TREE_LOAD_PERMUTATION (b).exists ())
+	    return -1;
+	  else
+	    {
+	      for (unsigned i = 0; i < SLP_TREE_LANES (a); ++i)
+		if (SLP_TREE_LOAD_PERMUTATION (a)[i]
+		    != SLP_TREE_LOAD_PERMUTATION (b)[i])
+		  {
+		    /* In-order lane first, that's what the above case for
+		       no permutation does.  */
+		    if (SLP_TREE_LOAD_PERMUTATION (a)[i] == i)
+		      return -1;
+		    else if (SLP_TREE_LOAD_PERMUTATION (b)[i] == i)
+		      return 1;
+		    else if (SLP_TREE_LOAD_PERMUTATION (a)[i]
+			     < SLP_TREE_LOAD_PERMUTATION (b)[i])
+		      return -1;
+		    else
+		      return 1;
+		  }
+	      return 0;
+	    }
+	}
+    }
+  else /* Different groups or non-groups.  */
+    {
+      /* Order groups as their first element to keep them together.  */
+      if (STMT_VINFO_GROUPED_ACCESS (a0))
+	a0 = DR_GROUP_FIRST_ELEMENT (a0);
+      if (STMT_VINFO_GROUPED_ACCESS (b0))
+	b0 = DR_GROUP_FIRST_ELEMENT (b0);
+      if (a0 == b0)
+	return 0;
+      /* Tie using UID.  */
+      else if (gimple_uid (STMT_VINFO_STMT (a0))
+	       < gimple_uid (STMT_VINFO_STMT (b0)))
+	return -1;
+      else
+	{
+	  gcc_assert (gimple_uid (STMT_VINFO_STMT (a0))
+		      != gimple_uid (STMT_VINFO_STMT (b0)));
+	  return 1;
+	}
+    }
+}
+
+/* Process the set of LOADS that are all from the same dataref group.  */
+
+static void
+vect_lower_load_permutations (loop_vec_info loop_vinfo,
+			      scalar_stmts_to_slp_tree_map_t *bst_map,
+			      const array_slice<slp_tree> &loads)
+{
+  /* We at this point want to lower without a fixed VF or vector
+     size in mind which means we cannot actually compute whether we
+     need three or more vectors for a load permutation yet.  So always
+     lower.  */
+  stmt_vec_info first
+    = DR_GROUP_FIRST_ELEMENT (SLP_TREE_SCALAR_STMTS (loads[0])[0]);
+
+  /* Only a power-of-two number of lanes matches interleaving with N levels.
+     ???  An even number of lanes could be reduced to 1<<ceil_log2(N)-1 lanes
+     at each step.  */
+  unsigned group_lanes = DR_GROUP_SIZE (first);
+  if (exact_log2 (group_lanes) == -1 && group_lanes != 3)
+    return;
+
+  for (slp_tree load : loads)
+    {
+      /* Leave masked or gather loads alone for now.  */
+      if (!SLP_TREE_CHILDREN (load).is_empty ())
+	continue;
+
+      /* We want to pattern-match special cases here and keep those
+	 alone.  Candidates are splats and load-lane.  */
+
+      /* We need to lower only loads of less than half of the groups
+	 lanes, including duplicate lanes.  Note this leaves nodes
+	 with a non-1:1 load permutation around instead of canonicalizing
+	 those into a load and a permute node.  Removing this early
+	 check would do such canonicalization.  */
+      if (SLP_TREE_LANES (load) >= (group_lanes + 1) / 2)
+	continue;
+
+      /* First build (and possibly re-use) a load node for the
+	 unpermuted group.  Gaps in the middle and on the end are
+	 represented with NULL stmts.  */
+      vec<stmt_vec_info> stmts;
+      stmts.create (group_lanes);
+      for (stmt_vec_info s = first; s; s = DR_GROUP_NEXT_ELEMENT (s))
+	{
+	  if (s != first)
+	    for (unsigned i = 1; i < DR_GROUP_GAP (s); ++i)
+	      stmts.quick_push (NULL);
+	  stmts.quick_push (s);
+	}
+      for (unsigned i = 0; i < DR_GROUP_GAP (first); ++i)
+	stmts.quick_push (NULL);
+      poly_uint64 max_nunits = 1;
+      bool *matches = XALLOCAVEC (bool, group_lanes);
+      unsigned limit = 1;
+      unsigned tree_size = 0;
+      slp_tree l0 = vect_build_slp_tree (loop_vinfo, stmts,
+					 group_lanes,
+					 &max_nunits, matches, &limit,
+					 &tree_size, bst_map);
+
+      /* Build the permute to get the original load permutation order.  */
+      lane_permutation_t final_perm;
+      final_perm.create (SLP_TREE_LANES (load));
+      for (unsigned i = 0; i < SLP_TREE_LANES (load); ++i)
+	final_perm.quick_push
+	  (std::make_pair (0, SLP_TREE_LOAD_PERMUTATION (load)[i]));
+
+      while (1)
+	{
+	  unsigned group_lanes = SLP_TREE_LANES (l0);
+	  if (SLP_TREE_LANES (load) >= (group_lanes + 1) / 2)
+	    break;
+
+	  /* Try to lower by reducing the group to half its size using an
+	     interleaving scheme.  For this try to compute whether all
+	     elements needed for this load are in even or odd elements of
+	     an even/odd decomposition with N consecutive elements.
+	     Thus { e, e, o, o, e, e, o, o } woud be an even/odd decomposition
+	     with N == 2.  */
+	  /* ???  Only an even number of lanes can be handed this way, but the
+	     fallback below could work for any number.  We have to make sure
+	     to round up in that case.  */
+	  gcc_assert ((group_lanes & 1) == 0 || group_lanes == 3);
+	  unsigned even = 0, odd = 0;
+	  if ((group_lanes & 1) == 0)
+	    {
+	      even = (1 << ceil_log2 (group_lanes)) - 1;
+	      odd = even;
+	      for (auto l : final_perm)
+		{
+		  even &= ~l.second;
+		  odd &= l.second;
+		}
+	    }
+
+	  /* Now build an even or odd extraction from the unpermuted load.  */
+	  lane_permutation_t perm;
+	  perm.create ((group_lanes + 1) / 2);
+	  unsigned level;
+	  if (even
+	      && ((level = 1 << ctz_hwi (even)), true)
+	      && group_lanes % (2 * level) == 0)
+	    {
+	      /* { 0, 1, ... 4, 5 ..., } */
+	      unsigned level = 1 << ctz_hwi (even);
+	      for (unsigned i = 0; i < group_lanes / 2 / level; ++i)
+		for (unsigned j = 0; j < level; ++j)
+		  perm.quick_push (std::make_pair (0, 2 * i * level + j));
+	    }
+	  else if (odd)
+	    {
+	      /* { ..., 2, 3, ... 6, 7 } */
+	      unsigned level = 1 << ctz_hwi (odd);
+	      gcc_assert (group_lanes % (2 * level) == 0);
+	      for (unsigned i = 0; i < group_lanes / 2 / level; ++i)
+		for (unsigned j = 0; j < level; ++j)
+		  perm.quick_push (std::make_pair (0, (2 * i + 1) * level + j));
+	    }
+	  else
+	    {
+	      /* As fallback extract all used lanes and fill to half the
+		 group size by repeating the last element.
+		 ???  This is quite a bad strathegy for re-use - we could
+		 brute force our way to find more optimal filling lanes to
+		 maximize re-use when looking at all loads from the group.  */
+	      auto_bitmap l;
+	      for (auto p : final_perm)
+		bitmap_set_bit (l, p.second);
+	      unsigned i = 0;
+	      bitmap_iterator bi;
+	      EXECUTE_IF_SET_IN_BITMAP (l, 0, i, bi)
+		  perm.quick_push (std::make_pair (0, i));
+	      while (perm.length () < (group_lanes + 1) / 2)
+		perm.quick_push (perm.last ());
+	    }
+
+	  /* Update final_perm with the intermediate permute.  */
+	  for (unsigned i = 0; i < final_perm.length (); ++i)
+	    {
+	      unsigned l = final_perm[i].second;
+	      unsigned j;
+	      for (j = 0; j < perm.length (); ++j)
+		if (perm[j].second == l)
+		  {
+		    final_perm[i].second = j;
+		    break;
+		  }
+	      gcc_assert (j < perm.length ());
+	    }
+
+	  /* And create scalar stmts.  */
+	  vec<stmt_vec_info> perm_stmts;
+	  perm_stmts.create (perm.length ());
+	  for (unsigned i = 0; i < perm.length (); ++i)
+	    perm_stmts.quick_push (SLP_TREE_SCALAR_STMTS (l0)[perm[i].second]);
+
+	  slp_tree p = vect_create_new_slp_node (1, VEC_PERM_EXPR);
+	  SLP_TREE_CHILDREN (p).quick_push (l0);
+	  SLP_TREE_LANE_PERMUTATION (p) = perm;
+	  SLP_TREE_VECTYPE (p) = SLP_TREE_VECTYPE (load);
+	  SLP_TREE_LANES (p) = perm.length ();
+	  SLP_TREE_REPRESENTATIVE (p) = SLP_TREE_REPRESENTATIVE (load);
+	  /* ???  As we have scalar stmts for this intermediate permute we
+	     could CSE it via bst_map but we do not want to pick up
+	     another SLP node with a load permutation.  We instead should
+	     have a "local" CSE map here.  */
+	  SLP_TREE_SCALAR_STMTS (p) = perm_stmts;
+
+	  /* We now have a node for (group_lanes + 1) / 2 lanes.  */
+	  l0 = p;
+	}
+
+      /* And finally from the ordered reduction node create the
+	 permute to shuffle the lanes into the original load-permutation
+	 order.  We replace the original load node with this.  */
+      SLP_TREE_CODE (load) = VEC_PERM_EXPR;
+      SLP_TREE_LOAD_PERMUTATION (load).release ();
+      SLP_TREE_LANE_PERMUTATION (load) = final_perm;
+      SLP_TREE_CHILDREN (load).create (1);
+      SLP_TREE_CHILDREN (load).quick_push (l0);
+    }
+}
+
+/* Transform SLP loads in the SLP graph created by SLP discovery to
+   group loads from the same group and lower load permutations that
+   are unlikely to be supported into a series of permutes.
+   In the degenerate case of having only single-lane SLP instances
+   this should result in a series of permute nodes emulating an
+   interleaving scheme.  */
+
+static void
+vect_lower_load_permutations (loop_vec_info loop_vinfo,
+			      scalar_stmts_to_slp_tree_map_t *bst_map)
+{
+  /* Gather and sort loads across all instances.  */
+  hash_set<slp_tree> visited;
+  auto_vec<slp_tree> loads;
+  for (auto inst : loop_vinfo->slp_instances)
+    vect_gather_slp_loads (loads, SLP_INSTANCE_TREE (inst), visited);
+  if (loads.is_empty ())
+    return;
+  loads.qsort (vllp_cmp);
+
+  /* Now process each dataref group separately.  */
+  unsigned firsti = 0;
+  for (unsigned i = 1; i < loads.length (); ++i)
+    {
+      slp_tree first = loads[firsti];
+      slp_tree next = loads[i];
+      stmt_vec_info a0 = SLP_TREE_SCALAR_STMTS (first)[0];
+      stmt_vec_info b0 = SLP_TREE_SCALAR_STMTS (next)[0];
+      if (STMT_VINFO_GROUPED_ACCESS (a0)
+	  && STMT_VINFO_GROUPED_ACCESS (b0)
+	  && DR_GROUP_FIRST_ELEMENT (a0) == DR_GROUP_FIRST_ELEMENT (b0))
+	continue;
+      /* Just one SLP load of a possible group, leave those alone.  */
+      if (i == firsti + 1)
+	{
+	  firsti = i;
+	  continue;
+	}
+      /* Now we have multiple SLP loads of the same group from
+	 firsti to i - 1.  */
+      vect_lower_load_permutations (loop_vinfo, bst_map,
+				    make_array_slice (&loads[firsti],
+						      i - firsti));
+      firsti = i;
+    }
+  if (firsti < loads.length () - 1)
+    vect_lower_load_permutations (loop_vinfo, bst_map,
+				  make_array_slice (&loads[firsti],
+						    loads.length () - firsti));
+}
+
 /* Check if there are stmts in the loop can be vectorized using SLP.  Build SLP
    trees of packed scalar stmts if SLP is possible.  */
 
@@ -4244,6 +4570,23 @@ vect_analyze_slp (vec_info *vinfo, unsigned max_tree_size)
 	}
     }
 
+  /* When we end up with load permutations that we cannot possibly handle,
+     like those requiring three vector inputs, lower them using interleaving
+     like schemes.  */
+  if (loop_vec_info loop_vinfo = dyn_cast <loop_vec_info> (vinfo))
+    {
+      vect_lower_load_permutations (loop_vinfo, bst_map);
+      if (dump_enabled_p ())
+	{
+	  dump_printf_loc (MSG_NOTE, vect_location,
+			   "SLP graph after lowering permutations:\n");
+	  hash_set<slp_tree> visited;
+	  FOR_EACH_VEC_ELT (LOOP_VINFO_SLP_INSTANCES (vinfo), i, instance)
+	    vect_print_slp_graph (MSG_NOTE, vect_location,
+				  SLP_INSTANCE_TREE (instance), visited);
+	}
+    }
+
   release_scalar_stmts_to_slp_tree_map (bst_map);
 
   if (pattern_found && dump_enabled_p ())