From patchwork Wed Jul  3 13:24:05 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Richard Biener <rguenther@suse.de>
X-Patchwork-Id: 1956242
Return-Path: <gcc-patches-bounces~incoming=patchwork.ozlabs.org@gcc.gnu.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@legolas.ozlabs.org
Authentication-Results: legolas.ozlabs.org;
	dkim=pass (1024-bit key;
 unprotected) header.d=suse.de header.i=@suse.de header.a=rsa-sha256
 header.s=susede2_rsa header.b=tK2MyEvN;
	dkim=pass header.d=suse.de header.i=@suse.de header.a=ed25519-sha256
 header.s=susede2_ed25519 header.b=lNkxWoMf;
	dkim=pass (1024-bit key) header.d=suse.de header.i=@suse.de
 header.a=rsa-sha256 header.s=susede2_rsa header.b=tK2MyEvN;
	dkim=neutral header.d=suse.de header.i=@suse.de header.a=ed25519-sha256
 header.s=susede2_ed25519 header.b=lNkxWoMf;
	dkim-atps=neutral
Authentication-Results: legolas.ozlabs.org;
 spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org
 (client-ip=2620:52:3:1:0:246e:9693:128c; helo=server2.sourceware.org;
 envelope-from=gcc-patches-bounces~incoming=patchwork.ozlabs.org@gcc.gnu.org;
 receiver=patchwork.ozlabs.org)
Received: from server2.sourceware.org (server2.sourceware.org
 [IPv6:2620:52:3:1:0:246e:9693:128c])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature ECDSA (secp384r1) server-digest SHA384)
	(No client certificate requested)
	by legolas.ozlabs.org (Postfix) with ESMTPS id 4WDgXq016Sz1xpN
	for <incoming@patchwork.ozlabs.org>; Wed,  3 Jul 2024 23:26:34 +1000 (AEST)
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 3BEF53860C3D
	for <incoming@patchwork.ozlabs.org>; Wed,  3 Jul 2024 13:26:33 +0000 (GMT)
X-Original-To: gcc-patches@gcc.gnu.org
Delivered-To: gcc-patches@gcc.gnu.org
Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.223.130])
 by sourceware.org (Postfix) with ESMTPS id 4811D386481A
 for <gcc-patches@gcc.gnu.org>; Wed,  3 Jul 2024 13:24:06 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 4811D386481A
Authentication-Results: sourceware.org;
 dmarc=pass (p=none dis=none) header.from=suse.de
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=suse.de
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 4811D386481A
Authentication-Results: server2.sourceware.org;
 arc=none smtp.remote-ip=195.135.223.130
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1720013049; cv=none;
 b=GxAuAnhNsk2CyA6twum96/MXPfpKqD9l5I07brP/cES5yttAgrF1zN0BT2eJFnOQgUS8C+Hqx/DxzlWUxEFZwFKGeeVWhh0o9D+BYejRT+LcE+fwR2KdAe3xTHXrk7HMLeBhEgeuvd8H0BlLC5Pk2j4fXbUItEarzl74iHNrc2U=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
 t=1720013049; c=relaxed/simple;
 bh=A706mkrVisk7yEJvKPYx4ZZSaVFTN1MUEAINNchYeDI=;
 h=DKIM-Signature:DKIM-Signature:DKIM-Signature:DKIM-Signature:Date:
 From:To:Subject:MIME-Version;
 b=giXQz3QrZJFW2cx3p7kDC4ZWpkMCbE61/8jbfuTD8Hws5tE1TO0ZbrCvOhfgbtVVWt4371morYb/L3wOi8IRaFzUp9fDKP9JF+t5uNZpbCn1f4cAGb5JZyO8AibvGc7H8nZquExp/ysSMt/+xfxz3s7N7TwZsyMsMNila4mpOw4=
ARC-Authentication-Results: i=1; server2.sourceware.org
Received: from murzim.nue2.suse.org (unknown [10.168.4.243])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest
 SHA256)
 (No client certificate requested)
 by smtp-out1.suse.de (Postfix) with ESMTPS id 4E53321BD0
 for <gcc-patches@gcc.gnu.org>; Wed,  3 Jul 2024 13:24:05 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de;
 s=susede2_rsa;
 t=1720013045;
 h=from:from:reply-to:date:date:to:to:cc:mime-version:mime-version:
 content-type:content-type; bh=6oH0DkDdjmUXLT7dbd9OVpJFB6nISzoCiGz1xAflXzg=;
 b=tK2MyEvNRgzF016a06/v7MH/v3pDVb0Pa5/27czrz3GpzfW5xsm4KCDjhmCqysTTACJqwH
 pjXXIcmDydnWi75XEm9rw4Iqa7SYZTIFyFbTknS7ZrSkHCbbMV6G4Emx8uK8i7lLvG5bMm
 qZndxTK7v1TiHE42/7ydBwZmzCly7sc=
DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de;
 s=susede2_ed25519; t=1720013045;
 h=from:from:reply-to:date:date:to:to:cc:mime-version:mime-version:
 content-type:content-type; bh=6oH0DkDdjmUXLT7dbd9OVpJFB6nISzoCiGz1xAflXzg=;
 b=lNkxWoMfeZa6G+1s9ivPoK1y4mIn9FqHD2U1R69y+A5GnqGDNUkFYri7SFrHMFHCQONj/9
 HcHVU8es2i31kYAw==
Authentication-Results: smtp-out1.suse.de;
	none
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de;
 s=susede2_rsa;
 t=1720013045;
 h=from:from:reply-to:date:date:to:to:cc:mime-version:mime-version:
 content-type:content-type; bh=6oH0DkDdjmUXLT7dbd9OVpJFB6nISzoCiGz1xAflXzg=;
 b=tK2MyEvNRgzF016a06/v7MH/v3pDVb0Pa5/27czrz3GpzfW5xsm4KCDjhmCqysTTACJqwH
 pjXXIcmDydnWi75XEm9rw4Iqa7SYZTIFyFbTknS7ZrSkHCbbMV6G4Emx8uK8i7lLvG5bMm
 qZndxTK7v1TiHE42/7ydBwZmzCly7sc=
DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de;
 s=susede2_ed25519; t=1720013045;
 h=from:from:reply-to:date:date:to:to:cc:mime-version:mime-version:
 content-type:content-type; bh=6oH0DkDdjmUXLT7dbd9OVpJFB6nISzoCiGz1xAflXzg=;
 b=lNkxWoMfeZa6G+1s9ivPoK1y4mIn9FqHD2U1R69y+A5GnqGDNUkFYri7SFrHMFHCQONj/9
 HcHVU8es2i31kYAw==
Date: Wed, 3 Jul 2024 15:24:05 +0200 (CEST)
From: Richard Biener <rguenther@suse.de>
To: gcc-patches@gcc.gnu.org
Subject: [PATCH 4/5] Support group-size of three in SLP load permutation
 lowering
MIME-Version: 1.0
X-Spamd-Result: default: False [-0.10 / 50.00]; BAYES_HAM(-3.00)[100.00%];
 MISSING_MID(2.50)[]; NEURAL_SPAM_LONG(0.70)[0.200];
 NEURAL_HAM_SHORT(-0.20)[-1.000]; MIME_GOOD(-0.10)[text/plain];
 RCPT_COUNT_ONE(0.00)[1]; RCVD_COUNT_ZERO(0.00)[0];
 ARC_NA(0.00)[]; MISSING_XM_UA(0.00)[];
 DKIM_SIGNED(0.00)[suse.de:s=susede2_rsa,suse.de:s=susede2_ed25519];
 FUZZY_BLOCKED(0.00)[rspamd.com]; FROM_EQ_ENVFROM(0.00)[];
 MIME_TRACE(0.00)[0:+]; TO_DN_NONE(0.00)[];
 TO_MATCH_ENVRCPT_ALL(0.00)[]; FROM_HAS_DN(0.00)[]
X-Spam-Score: -0.10
X-Spam-Level: 
X-Spam-Status: No, score=-10.6 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, MISSING_MID,
 SPF_HELO_NONE, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.30
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
Errors-To: gcc-patches-bounces~incoming=patchwork.ozlabs.org@gcc.gnu.org
Message-Id: <20240703132633.3BEF53860C3D@sourceware.org>

The following adds support for group-size three in SLP load permutation
lowering to match the non-SLP capabilities.  This is done by using
the non-interleaving fallback code which then creates at VF == 4 from
{ { a0, b0, c0 }, { a1, b1, c1 }, { a2, b2, c2 }, { a3, b3, c3 } }
the intermediate vectors { c0, c0, c1, c1 } and { c2, c2, c3, c3 }
to produce { c0, c1, c2, c3 }.

This turns out to be more effective than the scheme implemented
for non-SLP for SSE and only slightly worse for AVX512 and a bit
more worse for AVX2.  It seems to me that this would extend to
other non-power-of-two group-sizes though (but the patch does not).
Optimal schemes are likely difficult to lay out in VF agnostic form.

I'll note that while the lowering assumes even/odd extract is
generally available for all vector element sizes (which is probably
a good assumption), it doesn't in any way constrain the other
permutes it generates based on target availability.  Again difficult
to do in a VF agnostic way (but at least currently the vector type
is fixed).

I'll also note that the SLP store side merges lanes in a way
producing three-vector permutes for store group-size of three, so
the testcase uses a store group-size of four.

	* tree-vect-slp.cc (vect_lower_load_permutations): Support
	group-size of three.

	* gcc.dg/vect/slp-52.c: New testcase.
---
 gcc/testsuite/gcc.dg/vect/slp-52.c | 14 ++++++++++++
 gcc/tree-vect-slp.cc               | 35 +++++++++++++++++-------------
 2 files changed, 34 insertions(+), 15 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/slp-52.c

diff --git a/gcc/testsuite/gcc.dg/vect/slp-52.c b/gcc/testsuite/gcc.dg/vect/slp-52.c
new file mode 100644
index 00000000000..ba49f0046e2
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/slp-52.c
@@ -0,0 +1,14 @@
+/* { dg-do compile } */
+
+void foo (int * __restrict x, int *y)
+{
+  for (int i = 0; i < 1024; ++i)
+    {
+      x[4*i+0] = y[3*i+0];
+      x[4*i+1] = y[3*i+1] * 2;
+      x[4*i+2] = y[3*i+2] + 3;
+      x[4*i+3] = y[3*i+2] * 2 - 5;
+    }
+}
+
+/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" { target { vect_int && vect_int_mult } } } } */
diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
index fdefee90e92..c62b0b5cf88 100644
--- a/gcc/tree-vect-slp.cc
+++ b/gcc/tree-vect-slp.cc
@@ -3718,7 +3718,8 @@ vect_build_slp_instance (vec_info *vinfo,
 		 with the least number of lanes to one and then repeat until
 		 we end up with two inputs.  That scheme makes sure we end
 		 up with permutes satisfying the restriction of requiring at
-		 most two vector inputs to produce a single vector output.  */
+		 most two vector inputs to produce a single vector output
+		 when the number of lanes is even.  */
 	      while (SLP_TREE_CHILDREN (perm).length () > 2)
 		{
 		  /* Pick the two nodes with the least number of lanes,
@@ -3995,11 +3996,10 @@ vect_lower_load_permutations (loop_vec_info loop_vinfo,
     = DR_GROUP_FIRST_ELEMENT (SLP_TREE_SCALAR_STMTS (loads[0])[0]);
 
   /* Only a power-of-two number of lanes matches interleaving with N levels.
-     The non-SLP path also supports DR_GROUP_SIZE == 3.
      ???  An even number of lanes could be reduced to 1<<ceil_log2(N)-1 lanes
      at each step.  */
   unsigned group_lanes = DR_GROUP_SIZE (first);
-  if (exact_log2 (group_lanes) == -1)
+  if (exact_log2 (group_lanes) == -1 && group_lanes != 3)
     return;
 
   for (slp_tree load : loads)
@@ -4016,7 +4016,7 @@ vect_lower_load_permutations (loop_vec_info loop_vinfo,
 	 with a non-1:1 load permutation around instead of canonicalizing
 	 those into a load and a permute node.  Removing this early
 	 check would do such canonicalization.  */
-      if (SLP_TREE_LANES (load) >= group_lanes / 2)
+      if (SLP_TREE_LANES (load) >= (group_lanes + 1) / 2)
 	continue;
 
       /* First build (and possibly re-use) a load node for the
@@ -4052,7 +4052,7 @@ vect_lower_load_permutations (loop_vec_info loop_vinfo,
       while (1)
 	{
 	  unsigned group_lanes = SLP_TREE_LANES (l0);
-	  if (SLP_TREE_LANES (load) >= group_lanes / 2)
+	  if (SLP_TREE_LANES (load) >= (group_lanes + 1) / 2)
 	    break;
 
 	  /* Try to lower by reducing the group to half its size using an
@@ -4062,19 +4062,24 @@ vect_lower_load_permutations (loop_vec_info loop_vinfo,
 	     Thus { e, e, o, o, e, e, o, o } woud be an even/odd decomposition
 	     with N == 2.  */
 	  /* ???  Only an even number of lanes can be handed this way, but the
-	     fallback below could work for any number.  */
-	  gcc_assert ((group_lanes & 1) == 0);
-	  unsigned even = (1 << ceil_log2 (group_lanes)) - 1;
-	  unsigned odd = even;
-	  for (auto l : final_perm)
+	     fallback below could work for any number.  We have to make sure
+	     to round up in that case.  */
+	  gcc_assert ((group_lanes & 1) == 0 || group_lanes == 3);
+	  unsigned even = 0, odd = 0;
+	  if ((group_lanes & 1) == 0)
 	    {
-	      even &= ~l.second;
-	      odd &= l.second;
+	      even = (1 << ceil_log2 (group_lanes)) - 1;
+	      odd = even;
+	      for (auto l : final_perm)
+		{
+		  even &= ~l.second;
+		  odd &= l.second;
+		}
 	    }
 
 	  /* Now build an even or odd extraction from the unpermuted load.  */
 	  lane_permutation_t perm;
-	  perm.create (group_lanes / 2);
+	  perm.create ((group_lanes + 1) / 2);
 	  unsigned level;
 	  if (even
 	      && ((level = 1 << ctz_hwi (even)), true)
@@ -4109,7 +4114,7 @@ vect_lower_load_permutations (loop_vec_info loop_vinfo,
 	      bitmap_iterator bi;
 	      EXECUTE_IF_SET_IN_BITMAP (l, 0, i, bi)
 		  perm.quick_push (std::make_pair (0, i));
-	      while (perm.length () < group_lanes / 2)
+	      while (perm.length () < (group_lanes + 1) / 2)
 		perm.quick_push (perm.last ());
 	    }
 
@@ -4145,7 +4150,7 @@ vect_lower_load_permutations (loop_vec_info loop_vinfo,
 	     have a "local" CSE map here.  */
 	  SLP_TREE_SCALAR_STMTS (p) = perm_stmts;
 
-	  /* We now have a node for group_lanes / 2 lanes.  */
+	  /* We now have a node for (group_lanes + 1) / 2 lanes.  */
 	  l0 = p;
 	}