From patchwork Wed Jun 12 09:15:31 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Richard Biener <rguenther@suse.de>
X-Patchwork-Id: 1946756
Return-Path: <gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@legolas.ozlabs.org
Authentication-Results: legolas.ozlabs.org;
	dkim=pass (1024-bit key;
 unprotected) header.d=suse.de header.i=@suse.de header.a=rsa-sha256
 header.s=susede2_rsa header.b=ceJQ15gE;
	dkim=pass header.d=suse.de header.i=@suse.de header.a=ed25519-sha256
 header.s=susede2_ed25519 header.b=Qa+BHRfy;
	dkim=pass (1024-bit key) header.d=suse.de header.i=@suse.de
 header.a=rsa-sha256 header.s=susede2_rsa header.b=ceJQ15gE;
	dkim=neutral header.d=suse.de header.i=@suse.de header.a=ed25519-sha256
 header.s=susede2_ed25519 header.b=Qa+BHRfy;
	dkim-atps=neutral
Authentication-Results: legolas.ozlabs.org;
 spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org
 (client-ip=8.43.85.97; helo=server2.sourceware.org;
 envelope-from=gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org;
 receiver=patchwork.ozlabs.org)
Received: from server2.sourceware.org (server2.sourceware.org [8.43.85.97])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature ECDSA (secp384r1) server-digest SHA384)
	(No client certificate requested)
	by legolas.ozlabs.org (Postfix) with ESMTPS id 4VzfzK1z6Qz20KL
	for <incoming@patchwork.ozlabs.org>; Wed, 12 Jun 2024 19:15:57 +1000 (AEST)
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 6924E385DC3C
	for <incoming@patchwork.ozlabs.org>; Wed, 12 Jun 2024 09:15:55 +0000 (GMT)
X-Original-To: gcc-patches@gcc.gnu.org
Delivered-To: gcc-patches@gcc.gnu.org
Received: from smtp-out2.suse.de (smtp-out2.suse.de [195.135.223.131])
 by sourceware.org (Postfix) with ESMTPS id BECA83858D34
 for <gcc-patches@gcc.gnu.org>; Wed, 12 Jun 2024 09:15:32 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org BECA83858D34
Authentication-Results: sourceware.org;
 dmarc=pass (p=none dis=none) header.from=suse.de
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=suse.de
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org BECA83858D34
Authentication-Results: server2.sourceware.org;
 arc=none smtp.remote-ip=195.135.223.131
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1718183735; cv=none;
 b=xiCTWDC1Y+2MGzg0MwAIlwkSfhx9SqDNjEsUharsVsuQkhlo74Sg/kdVily9iKZnjtVYipF+j6xIQhE/odTc9bug4MNet1iVzK7+vwoIeiwfKd9/Qu0ZhBJoVweKH+KEZS8XgAq95Q6WcAVTcCBsOpAsEA+wu5SAkU5ISBA+l8I=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
 t=1718183735; c=relaxed/simple;
 bh=ZRr8tY7ei06kHofPNmCc82DLzyaBxFM2ICjJphr9D5k=;
 h=DKIM-Signature:DKIM-Signature:DKIM-Signature:DKIM-Signature:Date:
 From:To:Subject:MIME-Version;
 b=G9sgguCPtL5WQSx12t95l9vJ061aEV3XHojXVj3Fcq3crNdd9fMU42MkQhWpD5cBIPpfHWMSVUZnN+NyBw0ieoBaPwFZftJ03Gd7gaeVgCmUyQRaup8PM0s0SqkH2TTnIyRNp5IkVvq8opq9C6KrqcuWJRwDiGuumopreg7QSA0=
ARC-Authentication-Results: i=1; server2.sourceware.org
Received: from murzim.nue2.suse.org (unknown [10.168.4.243])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest
 SHA256)
 (No client certificate requested)
 by smtp-out2.suse.de (Postfix) with ESMTPS id B885B5BE70
 for <gcc-patches@gcc.gnu.org>; Wed, 12 Jun 2024 09:15:31 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de;
 s=susede2_rsa;
 t=1718183731;
 h=from:from:reply-to:date:date:to:to:cc:mime-version:mime-version:
 content-type:content-type; bh=/k2fUJMNfTNRQ7CEbdFIZcVOqX/O2a4e1blnwLf3nZE=;
 b=ceJQ15gE/vxFolHliOqs3BhmAWKpYqaC0ancHbttA5QaXkgx2fMo1Q0QvRI1jVkBbixPwi
 7vSYOL/spDrbJFL0+CQ3ZhiIdL8MbXFpJtUZi+2J3+PIAZYGJ/h9aac/+wi/LqdqQiilI0
 UBC4qs6n3/yI540Hlz/3Wcx64LdBDx0=
DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de;
 s=susede2_ed25519; t=1718183731;
 h=from:from:reply-to:date:date:to:to:cc:mime-version:mime-version:
 content-type:content-type; bh=/k2fUJMNfTNRQ7CEbdFIZcVOqX/O2a4e1blnwLf3nZE=;
 b=Qa+BHRfyYQ3AFHlXxtZx85zufsIEamqOCXJcm4cokGD8NnUBD/AIBt7lZjfEJKczeOQ4DK
 2ZPZLZ1el8+q98DA==
Authentication-Results: smtp-out2.suse.de;
	none
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de;
 s=susede2_rsa;
 t=1718183731;
 h=from:from:reply-to:date:date:to:to:cc:mime-version:mime-version:
 content-type:content-type; bh=/k2fUJMNfTNRQ7CEbdFIZcVOqX/O2a4e1blnwLf3nZE=;
 b=ceJQ15gE/vxFolHliOqs3BhmAWKpYqaC0ancHbttA5QaXkgx2fMo1Q0QvRI1jVkBbixPwi
 7vSYOL/spDrbJFL0+CQ3ZhiIdL8MbXFpJtUZi+2J3+PIAZYGJ/h9aac/+wi/LqdqQiilI0
 UBC4qs6n3/yI540Hlz/3Wcx64LdBDx0=
DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de;
 s=susede2_ed25519; t=1718183731;
 h=from:from:reply-to:date:date:to:to:cc:mime-version:mime-version:
 content-type:content-type; bh=/k2fUJMNfTNRQ7CEbdFIZcVOqX/O2a4e1blnwLf3nZE=;
 b=Qa+BHRfyYQ3AFHlXxtZx85zufsIEamqOCXJcm4cokGD8NnUBD/AIBt7lZjfEJKczeOQ4DK
 2ZPZLZ1el8+q98DA==
Date: Wed, 12 Jun 2024 11:15:31 +0200 (CEST)
From: Richard Biener <rguenther@suse.de>
To: gcc-patches@gcc.gnu.org
Subject: [PATCH 1/3][v3] tree-optimization/114107 - avoid peeling for gaps
 in more cases
MIME-Version: 1.0
X-Spamd-Result: default: False [-1.22 / 50.00]; BAYES_HAM(-3.00)[100.00%];
 MISSING_MID(2.50)[]; NEURAL_HAM_LONG(-0.44)[-0.436];
 NEURAL_HAM_SHORT(-0.18)[-0.905]; MIME_GOOD(-0.10)[text/plain];
 RCPT_COUNT_ONE(0.00)[1]; RCVD_COUNT_ZERO(0.00)[0];
 ARC_NA(0.00)[]; MISSING_XM_UA(0.00)[];
 DKIM_SIGNED(0.00)[suse.de:s=susede2_rsa,suse.de:s=susede2_ed25519];
 FUZZY_BLOCKED(0.00)[rspamd.com]; FROM_EQ_ENVFROM(0.00)[];
 MIME_TRACE(0.00)[0:+]; TO_DN_NONE(0.00)[];
 TO_MATCH_ENVRCPT_ALL(0.00)[]; FROM_HAS_DN(0.00)[]
X-Spam-Score: -1.22
X-Spam-Level: 
X-Spam-Status: No, score=-10.5 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, MISSING_MID,
 SPF_HELO_NONE, SPF_PASS, TXREP,
 T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.30
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
Errors-To: gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org
Message-Id: <20240612091555.6924E385DC3C@sourceware.org>

The following refactors the code to detect necessary peeling for
gaps, in particular the PR103116 case when there is no gap but
the group size is smaller than the vector size.  The testcase in
PR114107 shows we fail to SLP

  for (int i=0; i<n; i++)
    for (int k=0; k<4; k++)
      data[4*i+k] *= factor[i];

because peeling one scalar iteration isn't enough to cover a gap
of 3 elements of factor[i].  But the code detecting this is placed
after the logic that detects cases we handle properly already as
we'd code generate { factor[i], 0., 0., 0. } for V4DFmode vectorization
already.  In fact the check to detect when peeling a single iteration
isn't enough seems improperly guarded as it should apply to all cases.

I'm not sure we correctly handle VMAT_CONTIGUOUS_REVERSE but I
checked that VMAT_STRIDED_SLP and VMAT_ELEMENTWISE correctly avoid
touching excess elements.

With this change we can use SLP for the above testcase and the
PR103116 testcases no longer require an epilogue on x86-64.  It
might be different on other targets so I made those testcases
runtime FAIL only instead of relying on dump scanning there's
currently no easy way to properly constrain.

	PR tree-optimization/114107
	PR tree-optimization/110445
	* tree-vect-stmts.cc (get_group_load_store_type): Refactor
	contiguous access case.  Make sure peeling for gap constraints
	are always tested and consistently relax when we know we can
	avoid touching excess elements during code generation.  But
	rewrite the check poly-int aware.

	* gcc.dg/vect/pr114107.c: New testcase.
	* gcc.dg/vect/pr103116-1.c: Adjust.
	* gcc.dg/vect/pr103116-2.c: Likewise.
---
 gcc/testsuite/gcc.dg/vect/pr103116-1.c |   4 +-
 gcc/testsuite/gcc.dg/vect/pr103116-2.c |   3 +-
 gcc/testsuite/gcc.dg/vect/pr114107.c   |  31 +++++++
 gcc/tree-vect-stmts.cc                 | 119 ++++++++++++-------------
 4 files changed, 94 insertions(+), 63 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/pr114107.c

diff --git a/gcc/testsuite/gcc.dg/vect/pr103116-1.c b/gcc/testsuite/gcc.dg/vect/pr103116-1.c
index d3639fc8cfd..280ec57eb4b 100644
--- a/gcc/testsuite/gcc.dg/vect/pr103116-1.c
+++ b/gcc/testsuite/gcc.dg/vect/pr103116-1.c
@@ -47,4 +47,6 @@ main (void)
   return 0;
 }
 
-/* { dg-final { scan-tree-dump "Data access with gaps requires scalar epilogue loop" "vect" { target { vect_perm && vect_int } } } } */
+/* When the target can compose a vector from its half we do not require
+   a scalar epilogue, but there's no effective target for this.  */
+/* { dg-final { scan-tree-dump "vectorized 1 loops" "vect" { target { vect_perm && vect_int } } } } */
diff --git a/gcc/testsuite/gcc.dg/vect/pr103116-2.c b/gcc/testsuite/gcc.dg/vect/pr103116-2.c
index aa9797a9407..ee5b82922f9 100644
--- a/gcc/testsuite/gcc.dg/vect/pr103116-2.c
+++ b/gcc/testsuite/gcc.dg/vect/pr103116-2.c
@@ -56,4 +56,5 @@ main (void)
   return 0;
 }
 
-/* { dg-final { scan-tree-dump "peeling for gaps insufficient for access" "vect" { target { vect_perm_short } } } } */
+/* Whether or not peeling for gaps is required depends on the ability of
+   the target to compose a vector from two-element bits.  */
diff --git a/gcc/testsuite/gcc.dg/vect/pr114107.c b/gcc/testsuite/gcc.dg/vect/pr114107.c
new file mode 100644
index 00000000000..65175d9a680
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr114107.c
@@ -0,0 +1,31 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target vect_double } */
+
+void rescale_x4 (double* __restrict data,
+		 const double* __restrict factor, int n)
+{
+  for (int i=0; i<n; i++)
+    {
+      data[4*i] *= factor[i];
+      data[4*i+1] *= factor[i];
+      data[4*i+2] *= factor[i];
+      data[4*i+3] *= factor[i];
+    }
+}
+
+void rescale_x4_s (double* __restrict data,
+		   const double* __restrict factor, int n, int s)
+{
+  for (int i=0; i<n; i++)
+    {
+      data[s*i] *= factor[s*i];
+      data[s*i+1] *= factor[s*i];
+      data[s*i+2] *= factor[s*i];
+      data[s*i+3] *= factor[s*i];
+    }
+}
+
+/* All targets should be able to compose a vector from scalar elements, but
+   depending on vector size different permutes might be necessary.  */
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 2 "vect" { target vect_perm } } } */
+/* { dg-final { scan-tree-dump-not "Data access with gaps requires scalar epilogue loop" "vect" { target vect_perm } } } */
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index 1d7f0fe5c4e..f8c4b33878d 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -2047,6 +2047,36 @@ get_group_load_store_type (vec_info *vinfo, stmt_vec_info stmt_info,
 	}
       else
 	{
+	  int cmp = compare_step_with_zero (vinfo, stmt_info);
+	  if (cmp < 0)
+	    {
+	      if (single_element_p)
+		/* ???  The VMAT_CONTIGUOUS_REVERSE code generation is
+		   only correct for single element "interleaving" SLP.  */
+		*memory_access_type = get_negative_load_store_type
+			     (vinfo, stmt_info, vectype, vls_type, 1, poffset);
+	      else
+		{
+		  /* Try to use consecutive accesses of DR_GROUP_SIZE elements,
+		     separated by the stride, until we have a complete vector.
+		     Fall back to scalar accesses if that isn't possible.  */
+		  if (multiple_p (nunits, group_size))
+		    *memory_access_type = VMAT_STRIDED_SLP;
+		  else
+		    *memory_access_type = VMAT_ELEMENTWISE;
+		}
+	    }
+	  else if (cmp == 0 && loop_vinfo)
+	    {
+	      gcc_assert (vls_type == VLS_LOAD);
+	      *memory_access_type = VMAT_INVARIANT;
+	      /* Invariant accesses perform only component accesses, alignment
+		 is irrelevant for them.  */
+	      *alignment_support_scheme = dr_unaligned_supported;
+	    }
+	  else
+	    *memory_access_type = VMAT_CONTIGUOUS;
+
 	  overrun_p = loop_vinfo && gap != 0;
 	  if (overrun_p && vls_type != VLS_LOAD)
 	    {
@@ -2065,6 +2095,21 @@ get_group_load_store_type (vec_info *vinfo, stmt_vec_info stmt_info,
 			/ vect_get_scalar_dr_size (first_dr_info)))
 	    overrun_p = false;
 
+	  /* When we have a contiguous access across loop iterations
+	     but the access in the loop doesn't cover the full vector
+	     we can end up with no gap recorded but still excess
+	     elements accessed, see PR103116.  Make sure we peel for
+	     gaps if necessary and sufficient and give up if not.
+
+	     If there is a combination of the access not covering the full
+	     vector and a gap recorded then we may need to peel twice.  */
+	  if (loop_vinfo
+	      && *memory_access_type == VMAT_CONTIGUOUS
+	      && SLP_TREE_LOAD_PERMUTATION (slp_node).exists ()
+	      && !multiple_p (group_size * LOOP_VINFO_VECT_FACTOR (loop_vinfo),
+			      nunits))
+	    overrun_p = true;
+
 	  /* If the gap splits the vector in half and the target
 	     can do half-vector operations avoid the epilogue peeling
 	     by simply loading half of the vector only.  Usually
@@ -2097,68 +2142,20 @@ get_group_load_store_type (vec_info *vinfo, stmt_vec_info stmt_info,
 				 "Peeling for outer loop is not supported\n");
 	      return false;
 	    }
-	  int cmp = compare_step_with_zero (vinfo, stmt_info);
-	  if (cmp < 0)
-	    {
-	      if (single_element_p)
-		/* ???  The VMAT_CONTIGUOUS_REVERSE code generation is
-		   only correct for single element "interleaving" SLP.  */
-		*memory_access_type = get_negative_load_store_type
-			     (vinfo, stmt_info, vectype, vls_type, 1, poffset);
-	      else
-		{
-		  /* Try to use consecutive accesses of DR_GROUP_SIZE elements,
-		     separated by the stride, until we have a complete vector.
-		     Fall back to scalar accesses if that isn't possible.  */
-		  if (multiple_p (nunits, group_size))
-		    *memory_access_type = VMAT_STRIDED_SLP;
-		  else
-		    *memory_access_type = VMAT_ELEMENTWISE;
-		}
-	    }
-	  else if (cmp == 0 && loop_vinfo)
-	    {
-	      gcc_assert (vls_type == VLS_LOAD);
-	      *memory_access_type = VMAT_INVARIANT;
-	      /* Invariant accesses perform only component accesses, alignment
-		 is irrelevant for them.  */
-	      *alignment_support_scheme = dr_unaligned_supported;
-	    }
-	  else
-	    *memory_access_type = VMAT_CONTIGUOUS;
-
-	  /* When we have a contiguous access across loop iterations
-	     but the access in the loop doesn't cover the full vector
-	     we can end up with no gap recorded but still excess
-	     elements accessed, see PR103116.  Make sure we peel for
-	     gaps if necessary and sufficient and give up if not.
-
-	     If there is a combination of the access not covering the full
-	     vector and a gap recorded then we may need to peel twice.  */
-	  if (loop_vinfo
-	      && *memory_access_type == VMAT_CONTIGUOUS
-	      && SLP_TREE_LOAD_PERMUTATION (slp_node).exists ()
-	      && !multiple_p (group_size * LOOP_VINFO_VECT_FACTOR (loop_vinfo),
-			      nunits))
+	  /* Peeling for gaps assumes that a single scalar iteration
+	     is enough to make sure the last vector iteration doesn't
+	     access excess elements.  */
+	  if (overrun_p
+	      && (!can_div_trunc_p (group_size
+				    * LOOP_VINFO_VECT_FACTOR (loop_vinfo) - gap,
+				    nunits, &tem, &remain)
+		  || maybe_lt (remain + group_size, nunits)))
 	    {
-	      unsigned HOST_WIDE_INT cnunits, cvf;
-	      if (!can_overrun_p
-		  || !nunits.is_constant (&cnunits)
-		  || !LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant (&cvf)
-		  /* Peeling for gaps assumes that a single scalar iteration
-		     is enough to make sure the last vector iteration doesn't
-		     access excess elements.
-		     ???  Enhancements include peeling multiple iterations
-		     or using masked loads with a static mask.  */
-		  || (group_size * cvf) % cnunits + group_size - gap < cnunits)
-		{
-		  if (dump_enabled_p ())
-		    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-				     "peeling for gaps insufficient for "
-				     "access\n");
-		  return false;
-		}
-	      overrun_p = true;
+	      if (dump_enabled_p ())
+		dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+				 "peeling for gaps insufficient for "
+				 "access\n");
+	      return false;
 	    }
 
 	  /* If this is single-element interleaving with an element

From patchwork Wed Jun 12 09:15:52 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Richard Biener <rguenther@suse.de>
X-Patchwork-Id: 1946757
Return-Path: <gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@legolas.ozlabs.org
Authentication-Results: legolas.ozlabs.org;
	dkim=pass (1024-bit key;
 unprotected) header.d=suse.de header.i=@suse.de header.a=rsa-sha256
 header.s=susede2_rsa header.b=lyRKagXi;
	dkim=pass header.d=suse.de header.i=@suse.de header.a=ed25519-sha256
 header.s=susede2_ed25519 header.b=gqg+9ZUz;
	dkim=pass (1024-bit key) header.d=suse.de header.i=@suse.de
 header.a=rsa-sha256 header.s=susede2_rsa header.b=lyRKagXi;
	dkim=neutral header.d=suse.de header.i=@suse.de header.a=ed25519-sha256
 header.s=susede2_ed25519 header.b=gqg+9ZUz;
	dkim-atps=neutral
Authentication-Results: legolas.ozlabs.org;
 spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org
 (client-ip=8.43.85.97; helo=server2.sourceware.org;
 envelope-from=gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org;
 receiver=patchwork.ozlabs.org)
Received: from server2.sourceware.org (server2.sourceware.org [8.43.85.97])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature ECDSA (secp384r1) server-digest SHA384)
	(No client certificate requested)
	by legolas.ozlabs.org (Postfix) with ESMTPS id 4Vzfzk4sm9z20KL
	for <incoming@patchwork.ozlabs.org>; Wed, 12 Jun 2024 19:16:18 +1000 (AEST)
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id E722F385DDDE
	for <incoming@patchwork.ozlabs.org>; Wed, 12 Jun 2024 09:16:16 +0000 (GMT)
X-Original-To: gcc-patches@gcc.gnu.org
Delivered-To: gcc-patches@gcc.gnu.org
Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.223.130])
 by sourceware.org (Postfix) with ESMTPS id C7811385DC32
 for <gcc-patches@gcc.gnu.org>; Wed, 12 Jun 2024 09:15:53 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org C7811385DC32
Authentication-Results: sourceware.org;
 dmarc=pass (p=none dis=none) header.from=suse.de
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=suse.de
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org C7811385DC32
Authentication-Results: server2.sourceware.org;
 arc=none smtp.remote-ip=195.135.223.130
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1718183756; cv=none;
 b=T0O5eNi3lQxz9Uwc2spzfYiwWjMPUwg4X1bNvkFZJLam4m8zGk/uH5ru363Yozj5RTgM+3e+0MOjWaoIaNoTTmsjUqtw7x72u05FxYZDGTetzKVJ4I74tP1ZR4rFwG7jFmxGViHNKJ5Qbb5pzCP04f9Ssq4zzlkacNZ4vbuUotg=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
 t=1718183756; c=relaxed/simple;
 bh=fGAgF9SEFMKbtTmhQA4Pd9zJKfYpnhzfLCXIzOVQ6Ik=;
 h=DKIM-Signature:DKIM-Signature:DKIM-Signature:DKIM-Signature:Date:
 From:To:Subject:MIME-Version;
 b=eaM0iBuWYtJhV7gSSmXs5RJArISWdYIIjSQLsls/H013u/fTr6WNm6sOKwA1aDr8U2kgEO5heOCOlwHlZjad8P9FEj50QgyQGjJpOyRHMELsnwPPpBES2fw8i9m9E6KRo1bYDL5Gh7ALOpoWBKaw5QAkudl9fIUf+xBxrpTPqTM=
ARC-Authentication-Results: i=1; server2.sourceware.org
Received: from murzim.nue2.suse.org (unknown [10.168.4.243])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest
 SHA256)
 (No client certificate requested)
 by smtp-out1.suse.de (Postfix) with ESMTPS id C583E22A01
 for <gcc-patches@gcc.gnu.org>; Wed, 12 Jun 2024 09:15:52 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de;
 s=susede2_rsa;
 t=1718183752;
 h=from:from:reply-to:date:date:to:to:cc:mime-version:mime-version:
 content-type:content-type; bh=LG4p6xdg36gL0P4ZUBbQZI/PpRVBn8X31yToodteFr4=;
 b=lyRKagXiuVn9kSwDqFs89GRGAucLu2SE6aV1L04JeEHRop+8Jbb3SN7Ze+4+NXqqulErS8
 h7A6YXu5kKA2CXCcd+i3+zo6ApIyCZOifR/+UVIdvC7i+balMZ9BNLp0SwfYQIkWEW39lg
 /p70lGpe90HJaHT4EUxQKlzpxh19JgQ=
DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de;
 s=susede2_ed25519; t=1718183752;
 h=from:from:reply-to:date:date:to:to:cc:mime-version:mime-version:
 content-type:content-type; bh=LG4p6xdg36gL0P4ZUBbQZI/PpRVBn8X31yToodteFr4=;
 b=gqg+9ZUzLtAoPeRvYUDsD439HCrFZYqCZ0y/DPKpBAOEaEQUJvI/satrEF0iOLQE3jIM8p
 tKq7LfeIPYAZcTCQ==
Authentication-Results: smtp-out1.suse.de;
	none
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de;
 s=susede2_rsa;
 t=1718183752;
 h=from:from:reply-to:date:date:to:to:cc:mime-version:mime-version:
 content-type:content-type; bh=LG4p6xdg36gL0P4ZUBbQZI/PpRVBn8X31yToodteFr4=;
 b=lyRKagXiuVn9kSwDqFs89GRGAucLu2SE6aV1L04JeEHRop+8Jbb3SN7Ze+4+NXqqulErS8
 h7A6YXu5kKA2CXCcd+i3+zo6ApIyCZOifR/+UVIdvC7i+balMZ9BNLp0SwfYQIkWEW39lg
 /p70lGpe90HJaHT4EUxQKlzpxh19JgQ=
DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de;
 s=susede2_ed25519; t=1718183752;
 h=from:from:reply-to:date:date:to:to:cc:mime-version:mime-version:
 content-type:content-type; bh=LG4p6xdg36gL0P4ZUBbQZI/PpRVBn8X31yToodteFr4=;
 b=gqg+9ZUzLtAoPeRvYUDsD439HCrFZYqCZ0y/DPKpBAOEaEQUJvI/satrEF0iOLQE3jIM8p
 tKq7LfeIPYAZcTCQ==
Date: Wed, 12 Jun 2024 11:15:52 +0200 (CEST)
From: Richard Biener <rguenther@suse.de>
To: gcc-patches@gcc.gnu.org
Subject: [PATCH 2/3][v3] tree-optimization/115385 - handle more gaps with
 peeling of a single iteration
MIME-Version: 1.0
X-Spam-Score: -1.19
X-Spam-Level: 
X-Spamd-Result: default: False [-1.19 / 50.00]; BAYES_HAM(-3.00)[100.00%];
 MISSING_MID(2.50)[]; NEURAL_HAM_LONG(-0.41)[-0.408];
 NEURAL_HAM_SHORT(-0.18)[-0.920]; MIME_GOOD(-0.10)[text/plain];
 DKIM_SIGNED(0.00)[suse.de:s=susede2_rsa,suse.de:s=susede2_ed25519];
 ARC_NA(0.00)[]; RCPT_COUNT_ONE(0.00)[1];
 FUZZY_BLOCKED(0.00)[rspamd.com]; TO_MATCH_ENVRCPT_ALL(0.00)[];
 FROM_HAS_DN(0.00)[]; RCVD_COUNT_ZERO(0.00)[0];
 MISSING_XM_UA(0.00)[]; FROM_EQ_ENVFROM(0.00)[];
 TO_DN_NONE(0.00)[]; MIME_TRACE(0.00)[0:+]
X-Spam-Status: No, score=-10.5 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, KAM_SHORT,
 MISSING_MID,
 SPF_HELO_NONE, SPF_PASS, TXREP,
 T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.30
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
Errors-To: gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org
Message-Id: <20240612091616.E722F385DDDE@sourceware.org>

The following makes peeling of a single scalar iteration handle more
gaps, including non-power-of-two cases.  This can be done by rounding
up the remaining access to the next power-of-two which ensures that
the next scalar iteration will pick at least the number of excess
elements we access.

I've added a correctness testcase and one x86 specific scanning for
the optimization.

	PR tree-optimization/115385
	* tree-vect-stmts.cc (get_group_load_store_type): Peeling
	of a single scalar iteration is sufficient if we can narrow
	the access to the next power of two of the bits in the last
	access.
	(vectorizable_load): Ensure that the last access is narrowed.

	* gcc.dg/vect/pr115385.c: New testcase.
	* gcc.target/i386/vect-pr115385.c: Likewise.
---
 gcc/testsuite/gcc.dg/vect/pr115385.c          | 88 +++++++++++++++++++
 gcc/testsuite/gcc.target/i386/vect-pr115385.c | 53 +++++++++++
 gcc/tree-vect-stmts.cc                        | 44 ++++++++--
 3 files changed, 180 insertions(+), 5 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/pr115385.c
 create mode 100644 gcc/testsuite/gcc.target/i386/vect-pr115385.c

diff --git a/gcc/testsuite/gcc.dg/vect/pr115385.c b/gcc/testsuite/gcc.dg/vect/pr115385.c
new file mode 100644
index 00000000000..a18cd665d7d
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/pr115385.c
@@ -0,0 +1,88 @@
+/* { dg-require-effective-target mmap } */
+
+#include <sys/mman.h>
+#include <stdio.h>
+
+#define COUNT 511
+#define MMAP_SIZE 0x20000
+#define ADDRESS 0x1122000000
+#define TYPE unsigned char
+
+#ifndef MAP_ANONYMOUS
+#define MAP_ANONYMOUS MAP_ANON
+#endif
+
+void __attribute__((noipa)) foo(TYPE * __restrict x,
+                                TYPE *y, int n)
+{
+  for (int i = 0; i < n; ++i)
+    {
+      x[16*i+0] = y[3*i+0];
+      x[16*i+1] = y[3*i+1];
+      x[16*i+2] = y[3*i+2];
+      x[16*i+3] = y[3*i+0];
+      x[16*i+4] = y[3*i+1];
+      x[16*i+5] = y[3*i+2];
+      x[16*i+6] = y[3*i+0];
+      x[16*i+7] = y[3*i+1];
+      x[16*i+8] = y[3*i+2];
+      x[16*i+9] = y[3*i+0];
+      x[16*i+10] = y[3*i+1];
+      x[16*i+11] = y[3*i+2];
+      x[16*i+12] = y[3*i+0];
+      x[16*i+13] = y[3*i+1];
+      x[16*i+14] = y[3*i+2];
+      x[16*i+15] = y[3*i+0];
+    }
+}
+
+void __attribute__((noipa)) bar(TYPE * __restrict x,
+                                TYPE *y, int n)
+{
+  for (int i = 0; i < n; ++i)
+    {
+      x[16*i+0] = y[5*i+0];
+      x[16*i+1] = y[5*i+1];
+      x[16*i+2] = y[5*i+2];
+      x[16*i+3] = y[5*i+3];
+      x[16*i+4] = y[5*i+4];
+      x[16*i+5] = y[5*i+0];
+      x[16*i+6] = y[5*i+1];
+      x[16*i+7] = y[5*i+2];
+      x[16*i+8] = y[5*i+3];
+      x[16*i+9] = y[5*i+4];
+      x[16*i+10] = y[5*i+0];
+      x[16*i+11] = y[5*i+1];
+      x[16*i+12] = y[5*i+2];
+      x[16*i+13] = y[5*i+3];
+      x[16*i+14] = y[5*i+4];
+      x[16*i+15] = y[5*i+0];
+    }
+}
+
+TYPE x[COUNT * 16];
+
+int
+main (void)
+{
+  void *y;
+  TYPE *end_y;
+
+  y = mmap ((void *) ADDRESS, MMAP_SIZE, PROT_READ | PROT_WRITE,
+            MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+  if (y == MAP_FAILED)
+    {
+      perror ("mmap");
+      return 1;
+    }
+
+  end_y = (TYPE *) ((char *) y + MMAP_SIZE);
+
+  foo (x, end_y - COUNT * 3, COUNT);
+  bar (x, end_y - COUNT * 5, COUNT);
+
+  return 0;
+}
+
+/* We always require a scalar epilogue here but we don't know which
+   targets support vector composition this way.  */
diff --git a/gcc/testsuite/gcc.target/i386/vect-pr115385.c b/gcc/testsuite/gcc.target/i386/vect-pr115385.c
new file mode 100644
index 00000000000..a6be9ce4e54
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/vect-pr115385.c
@@ -0,0 +1,53 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 -msse4.1 -mno-avx -fdump-tree-vect-details" } */
+
+void __attribute__((noipa)) foo(unsigned char * __restrict x,
+                                unsigned char *y, int n)
+{
+  for (int i = 0; i < n; ++i)
+    {
+      x[16*i+0] = y[3*i+0];
+      x[16*i+1] = y[3*i+1];
+      x[16*i+2] = y[3*i+2];
+      x[16*i+3] = y[3*i+0];
+      x[16*i+4] = y[3*i+1];
+      x[16*i+5] = y[3*i+2];
+      x[16*i+6] = y[3*i+0];
+      x[16*i+7] = y[3*i+1];
+      x[16*i+8] = y[3*i+2];
+      x[16*i+9] = y[3*i+0];
+      x[16*i+10] = y[3*i+1];
+      x[16*i+11] = y[3*i+2];
+      x[16*i+12] = y[3*i+0];
+      x[16*i+13] = y[3*i+1];
+      x[16*i+14] = y[3*i+2];
+      x[16*i+15] = y[3*i+0];
+    }
+}
+
+void __attribute__((noipa)) bar(unsigned char * __restrict x,
+                                unsigned char *y, int n)
+{
+  for (int i = 0; i < n; ++i)
+    {
+      x[16*i+0] = y[5*i+0];
+      x[16*i+1] = y[5*i+1];
+      x[16*i+2] = y[5*i+2];
+      x[16*i+3] = y[5*i+3];
+      x[16*i+4] = y[5*i+4];
+      x[16*i+5] = y[5*i+0];
+      x[16*i+6] = y[5*i+1];
+      x[16*i+7] = y[5*i+2];
+      x[16*i+8] = y[5*i+3];
+      x[16*i+9] = y[5*i+4];
+      x[16*i+10] = y[5*i+0];
+      x[16*i+11] = y[5*i+1];
+      x[16*i+12] = y[5*i+2];
+      x[16*i+13] = y[5*i+3];
+      x[16*i+14] = y[5*i+4];
+      x[16*i+15] = y[5*i+0];
+    }
+}
+
+/* { dg-final { scan-tree-dump "Data access with gaps requires scalar epilogue loop" "vect"} } */
+/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 2 "vect"} } */
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index f8c4b33878d..701a44e44cd 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -2151,11 +2151,24 @@ get_group_load_store_type (vec_info *vinfo, stmt_vec_info stmt_info,
 				    nunits, &tem, &remain)
 		  || maybe_lt (remain + group_size, nunits)))
 	    {
-	      if (dump_enabled_p ())
-		dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-				 "peeling for gaps insufficient for "
-				 "access\n");
-	      return false;
+	      /* But peeling a single scalar iteration is enough if
+		 we can use the next power-of-two sized partial
+		 access.  */
+	      unsigned HOST_WIDE_INT cnunits, cvf, cremain, cpart_size;
+	      if (!nunits.is_constant (&cnunits)
+		  || !LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant (&cvf)
+		  || ((cremain = remain.to_constant (), true)
+		      && ((cpart_size = (1 << ceil_log2 (cremain))) != cnunits)
+		      && vector_vector_composition_type
+			   (vectype, cnunits / cpart_size,
+			    &half_vtype) == NULL_TREE))
+		{
+		  if (dump_enabled_p ())
+		    dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
+				     "peeling for gaps insufficient for "
+				     "access\n");
+		  return false;
+		}
 	    }
 
 	  /* If this is single-element interleaving with an element
@@ -11597,6 +11610,27 @@ vectorizable_load (vec_info *vinfo,
 			      gcc_assert (new_vtype
 					  || LOOP_VINFO_PEELING_FOR_GAPS
 					       (loop_vinfo));
+			    /* But still reduce the access size to the next
+			       required power-of-two so peeling a single
+			       scalar iteration is sufficient.  */
+			    unsigned HOST_WIDE_INT cremain;
+			    if (remain.is_constant (&cremain))
+			      {
+				unsigned HOST_WIDE_INT cpart_size
+				  = 1 << ceil_log2 (cremain);
+				if (known_gt (nunits, cpart_size)
+				    && constant_multiple_p (nunits, cpart_size,
+							    &num))
+				  {
+				    tree ptype;
+				    new_vtype
+				      = vector_vector_composition_type (vectype,
+									num,
+									&ptype);
+				    if (new_vtype)
+				      ltype = ptype;
+				  }
+			      }
 			  }
 		      }
 		    tree offset

From patchwork Wed Jun 12 09:16:03 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Richard Biener <rguenther@suse.de>
X-Patchwork-Id: 1946758
Return-Path: <gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@legolas.ozlabs.org
Authentication-Results: legolas.ozlabs.org;
	dkim=pass (1024-bit key;
 unprotected) header.d=suse.de header.i=@suse.de header.a=rsa-sha256
 header.s=susede2_rsa header.b=sgh/bt59;
	dkim=pass header.d=suse.de header.i=@suse.de header.a=ed25519-sha256
 header.s=susede2_ed25519 header.b=XwnXsZtb;
	dkim=pass (1024-bit key) header.d=suse.de header.i=@suse.de
 header.a=rsa-sha256 header.s=susede2_rsa header.b=uoh/HyWR;
	dkim=neutral header.d=suse.de header.i=@suse.de header.a=ed25519-sha256
 header.s=susede2_ed25519 header.b=VF0xN5Y5;
	dkim-atps=neutral
Authentication-Results: legolas.ozlabs.org;
 spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org
 (client-ip=2620:52:3:1:0:246e:9693:128c; helo=server2.sourceware.org;
 envelope-from=gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org;
 receiver=patchwork.ozlabs.org)
Received: from server2.sourceware.org (server2.sourceware.org
 [IPv6:2620:52:3:1:0:246e:9693:128c])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature ECDSA (secp384r1) server-digest SHA384)
	(No client certificate requested)
	by legolas.ozlabs.org (Postfix) with ESMTPS id 4Vzg0s3cDNz20KL
	for <incoming@patchwork.ozlabs.org>; Wed, 12 Jun 2024 19:17:17 +1000 (AEST)
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 22DAE385DDF1
	for <incoming@patchwork.ozlabs.org>; Wed, 12 Jun 2024 09:17:15 +0000 (GMT)
X-Original-To: gcc-patches@gcc.gnu.org
Delivered-To: gcc-patches@gcc.gnu.org
Received: from smtp-out1.suse.de (smtp-out1.suse.de
 [IPv6:2a07:de40:b251:101:10:150:64:1])
 by sourceware.org (Postfix) with ESMTPS id E0CED385DDE7
 for <gcc-patches@gcc.gnu.org>; Wed, 12 Jun 2024 09:16:06 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org E0CED385DDE7
Authentication-Results: sourceware.org;
 dmarc=pass (p=none dis=none) header.from=suse.de
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=suse.de
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org E0CED385DDE7
Authentication-Results: server2.sourceware.org;
 arc=none smtp.remote-ip=2a07:de40:b251:101:10:150:64:1
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1718183769; cv=none;
 b=hDY8hphHXqoO0fTPc8IarGZ5inhQ3npfx5NNXcBtNIaBi82pBzTYgvbOwuZQ0KuiywhoOipZ2jzAczZIA6feDPqGc+7UtgYbHGrz8/2hLCUlkfHOrnx14ZCxn5E9kBjv7hJGeVHyW4fVtwFDsV02hDEKpGZKScpz28d1yg2OFAE=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
 t=1718183769; c=relaxed/simple;
 bh=McagqQwWsyWrtPoizlOwAbaHMJ45nSzsWXNMr7V76ks=;
 h=DKIM-Signature:DKIM-Signature:DKIM-Signature:DKIM-Signature:Date:
 From:To:Subject:MIME-Version;
 b=M9d7BfX6oRFal5o9xjTzhJXEMkxwTTFRl10m7uS2z9FDWioqU8EvRH9R6XR0KzeMVfeOSVk2xtkrU3El+fjqf3fw+T7juJ++5FCfBIsuJfnnMbAuV/1JpdemlbTPR4bf3mnrK6AOjsnObjLHeRh1ZMARtx3FvTnR23xWz4Qd9mA=
ARC-Authentication-Results: i=1; server2.sourceware.org
Received: from murzim.nue2.suse.org (unknown [10.168.4.243])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest
 SHA256)
 (No client certificate requested)
 by smtp-out1.suse.de (Postfix) with ESMTPS id 00E0633FF4
 for <gcc-patches@gcc.gnu.org>; Wed, 12 Jun 2024 09:16:03 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de;
 s=susede2_rsa;
 t=1718183766;
 h=from:from:reply-to:date:date:to:to:cc:mime-version:mime-version:
 content-type:content-type; bh=/fhvK/bV+lWUGiUHVnBNzHesR3o86Ssaok85mZKFSiM=;
 b=sgh/bt59MMRZBNlptCRb8pCyFcKGO7nzfhC7gObNzUt8tS/5ZQpaYW8Jswye0wr/wEwa3h
 TbQWlZHyUAeqvGjPKZh/vQScMPi/8OGo2UYt/lOwUCyKV2SzapThq0WwEnpcLJdFzvxupJ
 H1/GhQEpzusUw04WhHp83PxBZCnkT8A=
DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de;
 s=susede2_ed25519; t=1718183766;
 h=from:from:reply-to:date:date:to:to:cc:mime-version:mime-version:
 content-type:content-type; bh=/fhvK/bV+lWUGiUHVnBNzHesR3o86Ssaok85mZKFSiM=;
 b=XwnXsZtbSxwHiuYE8IVmATby5ihEm14CdusonhooeY5FCNaVxzs4UgOmUwih9vMOgaGzfp
 munvgZ7PcjA6YzAA==
Authentication-Results: smtp-out1.suse.de;
	none
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.de;
 s=susede2_rsa;
 t=1718183764;
 h=from:from:reply-to:date:date:to:to:cc:mime-version:mime-version:
 content-type:content-type; bh=/fhvK/bV+lWUGiUHVnBNzHesR3o86Ssaok85mZKFSiM=;
 b=uoh/HyWRxdqeEWC+PUVKldOZcE/f3AcwAMPm5LoZPOL3CKmvDU94N2B3Ed2sIv7PFbY5et
 BCOIRxo37terC9C8F91FWVcPXNjOszLVEFj9VmFqSdy9zmG6uSJuVT8A++9nqM9JRnvRvQ
 QmtM1mW5mCK+yfLeyGVxXtdi/Yt8aLs=
DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.de;
 s=susede2_ed25519; t=1718183764;
 h=from:from:reply-to:date:date:to:to:cc:mime-version:mime-version:
 content-type:content-type; bh=/fhvK/bV+lWUGiUHVnBNzHesR3o86Ssaok85mZKFSiM=;
 b=VF0xN5Y5VFev5ROBQvKW0itAkyrQeYR2Jzz2c4fYXoLvC+6GNe5gP9WpDosu9mT9UoE+oU
 AtYO+dm6pdpkzkAg==
Date: Wed, 12 Jun 2024 11:16:03 +0200 (CEST)
From: Richard Biener <rguenther@suse.de>
To: gcc-patches@gcc.gnu.org
Subject: [PATCH 3/3][v3] Improve code generation of strided SLP loads
MIME-Version: 1.0
X-Spamd-Result: default: False [-0.88 / 50.00]; BAYES_HAM(-3.00)[100.00%];
 MISSING_MID(2.50)[]; NEURAL_HAM_SHORT(-0.19)[-0.929];
 MIME_GOOD(-0.10)[text/plain]; NEURAL_HAM_LONG(-0.09)[-0.090];
 ARC_NA(0.00)[]; MIME_TRACE(0.00)[0:+]; RCPT_COUNT_ONE(0.00)[1];
 MISSING_XM_UA(0.00)[]; RCVD_COUNT_ZERO(0.00)[0];
 FROM_HAS_DN(0.00)[]; FUZZY_BLOCKED(0.00)[rspamd.com];
 FROM_EQ_ENVFROM(0.00)[];
 DKIM_SIGNED(0.00)[suse.de:s=susede2_rsa,suse.de:s=susede2_ed25519];
 TO_DN_NONE(0.00)[]; TO_MATCH_ENVRCPT_ALL(0.00)[];
 DBL_BLOCKED_OPENRESOLVER(0.00)[murzim.nue2.suse.org:helo]
X-Spam-Score: -0.88
X-Spam-Level: 
X-Spam-Status: No, score=-10.5 required=5.0 tests=BAYES_00, DKIM_SIGNED,
 DKIM_VALID, DKIM_VALID_AU, DKIM_VALID_EF, GIT_PATCH_0, KAM_SHORT,
 MISSING_MID,
 SPF_HELO_NONE, SPF_PASS, TXREP,
 T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.30
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
Errors-To: gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org
Message-Id: <20240612091715.22DAE385DDF1@sourceware.org>

This avoids falling back to elementwise accesses for strided SLP
loads when the group size is not a multiple of the vector element
size.  Instead we can use a smaller vector or integer type for the load.

For stores we can do the same though restrictions on stores we handle
and the fact that store-merging covers up makes this mostly effective
for cost modeling which shows for gcc.target/i386/vect-strided-3.c
which we now vectorize with V4SI vectors rather than just V2SI ones.

For all of this there's still the opportunity to use non-uniform
accesses, say for a 6-element group with a VF of two do
V4SI, { V2SI, V2SI }, V4SI.  But that's for a possible followup.

	* gcc.target/i386/vect-strided-1.c: New testcase.
	* gcc.target/i386/vect-strided-2.c: Likewise.
	* gcc.target/i386/vect-strided-3.c: Likewise.
	* gcc.target/i386/vect-strided-4.c: Likewise.
---
 .../gcc.target/i386/vect-strided-1.c          |  24 +++++
 .../gcc.target/i386/vect-strided-2.c          |  17 +++
 .../gcc.target/i386/vect-strided-3.c          |  20 ++++
 .../gcc.target/i386/vect-strided-4.c          |  20 ++++
 gcc/tree-vect-stmts.cc                        | 100 ++++++++----------
 5 files changed, 127 insertions(+), 54 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/vect-strided-1.c
 create mode 100644 gcc/testsuite/gcc.target/i386/vect-strided-2.c
 create mode 100644 gcc/testsuite/gcc.target/i386/vect-strided-3.c
 create mode 100644 gcc/testsuite/gcc.target/i386/vect-strided-4.c

diff --git a/gcc/testsuite/gcc.target/i386/vect-strided-1.c b/gcc/testsuite/gcc.target/i386/vect-strided-1.c
new file mode 100644
index 00000000000..db4a06711f1
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/vect-strided-1.c
@@ -0,0 +1,24 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -msse2 -mno-avx" } */
+
+void foo (int * __restrict a, int *b, int s)
+{
+  for (int i = 0; i < 1024; ++i)
+    {
+      a[8*i+0] = b[s*i+0];
+      a[8*i+1] = b[s*i+1];
+      a[8*i+2] = b[s*i+2];
+      a[8*i+3] = b[s*i+3];
+      a[8*i+4] = b[s*i+4];
+      a[8*i+5] = b[s*i+5];
+      a[8*i+6] = b[s*i+4];
+      a[8*i+7] = b[s*i+5];
+    }
+}
+
+/* Three two-element loads, two four-element stores.  On ia32 we elide
+   a permute and perform a redundant load.  */
+/* { dg-final { scan-assembler-times "movq" 2 } } */
+/* { dg-final { scan-assembler-times "movhps" 2 { target ia32 } } } */
+/* { dg-final { scan-assembler-times "movhps" 1 { target { ! ia32 } } } } */
+/* { dg-final { scan-assembler-times "movups" 2 } } */
diff --git a/gcc/testsuite/gcc.target/i386/vect-strided-2.c b/gcc/testsuite/gcc.target/i386/vect-strided-2.c
new file mode 100644
index 00000000000..6fd64e28cf0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/vect-strided-2.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -msse2 -mno-avx" } */
+
+void foo (int * __restrict a, int *b, int s)
+{
+  for (int i = 0; i < 1024; ++i)
+    {
+      a[4*i+0] = b[s*i+0];
+      a[4*i+1] = b[s*i+1];
+      a[4*i+2] = b[s*i+0];
+      a[4*i+3] = b[s*i+1];
+    }
+}
+
+/* One two-element load, one four-element store.  */
+/* { dg-final { scan-assembler-times "movq" 1 } } */
+/* { dg-final { scan-assembler-times "movups" 1 } } */
diff --git a/gcc/testsuite/gcc.target/i386/vect-strided-3.c b/gcc/testsuite/gcc.target/i386/vect-strided-3.c
new file mode 100644
index 00000000000..b462701a0b2
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/vect-strided-3.c
@@ -0,0 +1,20 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -msse2 -mno-avx -fno-tree-slp-vectorize" } */
+
+void foo (int * __restrict a, int *b, int s)
+{
+  if (s >= 6)
+    for (int i = 0; i < 1024; ++i)
+      {
+	a[s*i+0] = b[4*i+0];
+	a[s*i+1] = b[4*i+1];
+	a[s*i+2] = b[4*i+2];
+	a[s*i+3] = b[4*i+3];
+	a[s*i+4] = b[4*i+0];
+	a[s*i+5] = b[4*i+1];
+      }
+}
+
+/* While the vectorizer generates 6 uint64 stores.  */
+/* { dg-final { scan-assembler-times "movq" 4 } } */
+/* { dg-final { scan-assembler-times "movhps" 2 } } */
diff --git a/gcc/testsuite/gcc.target/i386/vect-strided-4.c b/gcc/testsuite/gcc.target/i386/vect-strided-4.c
new file mode 100644
index 00000000000..dd922926a2a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/vect-strided-4.c
@@ -0,0 +1,20 @@
+/* { dg-do compile } */
+/* { dg-options "-O2 -msse4.2 -mno-avx -fno-tree-slp-vectorize" } */
+
+void foo (int * __restrict a, int * __restrict b, int *c, int s)
+{
+  if (s >= 2)
+    for (int i = 0; i < 1024; ++i)
+      {
+	a[s*i+0] = c[4*i+0];
+	a[s*i+1] = c[4*i+1];
+	b[s*i+0] = c[4*i+2];
+	b[s*i+1] = c[4*i+3];
+      }
+}
+
+/* Vectorization factor two, two two-element stores to a using movq
+   and two two-element stores to b via pextrq/movhps of the high part.  */
+/* { dg-final { scan-assembler-times "movq" 2 } } */
+/* { dg-final { scan-assembler-times "pextrq" 2 { target { ! ia32 } } } } */
+/* { dg-final { scan-assembler-times "movhps" 2 { target { ia32 } } } } */
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index 701a44e44cd..d148e11a514 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -2036,15 +2036,10 @@ get_group_load_store_type (vec_info *vinfo, stmt_vec_info stmt_info,
 	first_dr_info
 	  = STMT_VINFO_DR_INFO (SLP_TREE_SCALAR_STMTS (slp_node)[0]);
       if (STMT_VINFO_STRIDED_P (first_stmt_info))
-	{
-	  /* Try to use consecutive accesses of DR_GROUP_SIZE elements,
-	     separated by the stride, until we have a complete vector.
-	     Fall back to scalar accesses if that isn't possible.  */
-	  if (multiple_p (nunits, group_size))
-	    *memory_access_type = VMAT_STRIDED_SLP;
-	  else
-	    *memory_access_type = VMAT_ELEMENTWISE;
-	}
+	/* Try to use consecutive accesses of as many elements as possible,
+	   separated by the stride, until we have a complete vector.
+	   Fall back to scalar accesses if that isn't possible.  */
+	*memory_access_type = VMAT_STRIDED_SLP;
       else
 	{
 	  int cmp = compare_step_with_zero (vinfo, stmt_info);
@@ -8514,12 +8509,29 @@ vectorizable_store (vec_info *vinfo,
       tree lvectype = vectype;
       if (slp)
 	{
-	  if (group_size < const_nunits
-	      && const_nunits % group_size == 0)
+	  HOST_WIDE_INT n = gcd (group_size, const_nunits);
+	  if (n == const_nunits)
 	    {
-	      nstores = const_nunits / group_size;
-	      lnel = group_size;
-	      ltype = build_vector_type (elem_type, group_size);
+	      int mis_align = dr_misalignment (first_dr_info, vectype);
+	      dr_alignment_support dr_align
+		= vect_supportable_dr_alignment (vinfo, dr_info, vectype,
+						 mis_align);
+	      if (dr_align == dr_aligned
+		  || dr_align == dr_unaligned_supported)
+		{
+		  nstores = 1;
+		  lnel = const_nunits;
+		  ltype = vectype;
+		  lvectype = vectype;
+		  alignment_support_scheme = dr_align;
+		  misalignment = mis_align;
+		}
+	    }
+	  else if (n > 1)
+	    {
+	      nstores = const_nunits / n;
+	      lnel = n;
+	      ltype = build_vector_type (elem_type, n);
 	      lvectype = vectype;
 
 	      /* First check if vec_extract optab doesn't support extraction
@@ -8528,7 +8540,7 @@ vectorizable_store (vec_info *vinfo,
 	      machine_mode vmode;
 	      if (!VECTOR_MODE_P (TYPE_MODE (vectype))
 		  || !related_vector_mode (TYPE_MODE (vectype), elmode,
-					   group_size).exists (&vmode)
+					   n).exists (&vmode)
 		  || (convert_optab_handler (vec_extract_optab,
 					     TYPE_MODE (vectype), vmode)
 		      == CODE_FOR_nothing))
@@ -8539,8 +8551,8 @@ vectorizable_store (vec_info *vinfo,
 		     re-interpreting it as the original vector type if
 		     supported.  */
 		  unsigned lsize
-		    = group_size * GET_MODE_BITSIZE (elmode);
-		  unsigned int lnunits = const_nunits / group_size;
+		    = n * GET_MODE_BITSIZE (elmode);
+		  unsigned int lnunits = const_nunits / n;
 		  /* If we can't construct such a vector fall back to
 		     element extracts from the original vector type and
 		     element size stores.  */
@@ -8553,7 +8565,7 @@ vectorizable_store (vec_info *vinfo,
 			  != CODE_FOR_nothing))
 		    {
 		      nstores = lnunits;
-		      lnel = group_size;
+		      lnel = n;
 		      ltype = build_nonstandard_integer_type (lsize, 1);
 		      lvectype = build_vector_type (ltype, nstores);
 		    }
@@ -8564,24 +8576,6 @@ vectorizable_store (vec_info *vinfo,
 		     issue exists here for reasonable archs.  */
 		}
 	    }
-	  else if (group_size >= const_nunits
-		   && group_size % const_nunits == 0)
-	    {
-	      int mis_align = dr_misalignment (first_dr_info, vectype);
-	      dr_alignment_support dr_align
-		= vect_supportable_dr_alignment (vinfo, dr_info, vectype,
-						 mis_align);
-	      if (dr_align == dr_aligned
-		  || dr_align == dr_unaligned_supported)
-		{
-		  nstores = 1;
-		  lnel = const_nunits;
-		  ltype = vectype;
-		  lvectype = vectype;
-		  alignment_support_scheme = dr_align;
-		  misalignment = mis_align;
-		}
-	    }
 	  ltype = build_aligned_type (ltype, TYPE_ALIGN (elem_type));
 	  ncopies = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node);
 	}
@@ -10366,34 +10360,32 @@ vectorizable_load (vec_info *vinfo,
       auto_vec<tree> dr_chain;
       if (memory_access_type == VMAT_STRIDED_SLP)
 	{
-	  if (group_size < const_nunits)
+	  HOST_WIDE_INT n = gcd (group_size, const_nunits);
+	  /* Use the target vector type if the group size is a multiple
+	     of it.  */
+	  if (n == const_nunits)
+	    {
+	      nloads = 1;
+	      lnel = const_nunits;
+	      ltype = vectype;
+	    }
+	  /* Else use the biggest vector we can load the group without
+	     accessing excess elements.  */
+	  else if (n > 1)
 	    {
-	      /* First check if vec_init optab supports construction from vector
-		 elts directly.  Otherwise avoid emitting a constructor of
-		 vector elements by performing the loads using an integer type
-		 of the same size, constructing a vector of those and then
-		 re-interpreting it as the original vector type.  This avoids a
-		 huge runtime penalty due to the general inability to perform
-		 store forwarding from smaller stores to a larger load.  */
 	      tree ptype;
 	      tree vtype
-		= vector_vector_composition_type (vectype,
-						  const_nunits / group_size,
+		= vector_vector_composition_type (vectype, const_nunits / n,
 						  &ptype);
 	      if (vtype != NULL_TREE)
 		{
-		  nloads = const_nunits / group_size;
-		  lnel = group_size;
+		  nloads = const_nunits / n;
+		  lnel = n;
 		  lvectype = vtype;
 		  ltype = ptype;
 		}
 	    }
-	  else
-	    {
-	      nloads = 1;
-	      lnel = const_nunits;
-	      ltype = vectype;
-	    }
+	  /* Else fall back to the default element-wise access.  */
 	  ltype = build_aligned_type (ltype, TYPE_ALIGN (TREE_TYPE (vectype)));
 	}
       /* Load vector(1) scalar_type if it's 1 element-wise vectype.  */