From patchwork Wed May 22 02:04:37 2019
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Kugan Vivekanandarajah
 <kugan.vivekanandarajah@linaro.org>
X-Patchwork-Id: 1103091
Return-Path: 
 <gcc-patches-return-501400-incoming=patchwork.ozlabs.org@gcc.gnu.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@bilbo.ozlabs.org
Authentication-Results: ozlabs.org;
	spf=pass (mailfrom) smtp.mailfrom=gcc.gnu.org
	(client-ip=209.132.180.131; helo=sourceware.org;
	envelope-from=gcc-patches-return-501400-incoming=patchwork.ozlabs.org@gcc.gnu.org;
	receiver=<UNKNOWN>)
Authentication-Results: ozlabs.org;
	dmarc=fail (p=none dis=none) header.from=linaro.org
Authentication-Results: ozlabs.org; dkim=pass (1024-bit key;
	unprotected) header.d=gcc.gnu.org header.i=@gcc.gnu.org
	header.b="hHMEUDNA";
	dkim=fail reason="signature verification failed" (2048-bit key;
	unprotected) header.d=linaro.org header.i=@linaro.org
	header.b="knRGTout"; dkim-atps=neutral
Received: from sourceware.org (server1.sourceware.org [209.132.180.131])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256
	bits)) (No client certificate requested)
	by ozlabs.org (Postfix) with ESMTPS id 457x0W0678z9s9T
	for <incoming@patchwork.ozlabs.org>;
	Wed, 22 May 2019 12:09:46 +1000 (AEST)
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id
	:list-unsubscribe:list-archive:list-post:list-help:sender
	:mime-version:from:date:message-id:subject:to:content-type; q=
	dns; s=default; b=ejRp8hJdAH0aUuM6dWK7MJ5h7aSW3XM+7i3m1HXPhCi2vE
	7cC0pBb1ByJTd2e0I479FtecklHM9VzImDxhLGjPlqGDtGIp0pdHVQfQBYoxzipr
	XSh15vGJbcluQO1v4Eo0PgmLK2j0Wk56PA6Kx/s9I96YVLjJ9RDhRX+LHCUtY=
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id
	:list-unsubscribe:list-archive:list-post:list-help:sender
	:mime-version:from:date:message-id:subject:to:content-type; s=
	default; bh=SfMV3d7EF3vQaexIXjqd2gyzfgc=; b=hHMEUDNACU8CzguM2u0F
	/h3k0U/tleeT8HvqPM8qjnNjgr6wuPAyRyCU81AdoF+McMwAg63xTrKlK8pcTdye
	pEz+Ola8lqCfbqgUq559VrTbkTrLaBDZ/+yQ7vLcJ15lzJ803uAscugdJPZ6+ny6
	FuVfQjI5l12xCDPcpXho8KA=
Received: (qmail 120026 invoked by alias); 22 May 2019 02:09:37 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Unsubscribe: 
 <mailto:gcc-patches-unsubscribe-incoming=patchwork.ozlabs.org@gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Delivered-To: mailing list gcc-patches@gcc.gnu.org
Received: (qmail 120008 invoked by uid 89); 22 May 2019 02:09:36 -0000
Authentication-Results: sourceware.org; auth=none
X-Spam-SWARE-Status: No, score=-23.3 required=5.0 tests=AWL, BAYES_00,
	GIT_PATCH_0, GIT_PATCH_1, GIT_PATCH_2, GIT_PATCH_3,
	RCVD_IN_DNSWL_NONE,
	SPF_PASS autolearn=ham version=3.3.1 spammy=mar, *loc,
	reusable, integer_cst
X-HELO: mail-lf1-f42.google.com
Received: from mail-lf1-f42.google.com (HELO mail-lf1-f42.google.com)
	(209.85.167.42) by sourceware.org
	(qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP;
	Wed, 22 May 2019 02:09:34 +0000
Received: by mail-lf1-f42.google.com with SMTP id n134so397846lfn.11 for
	<gcc-patches@gcc.gnu.org>; Tue, 21 May 2019 19:09:33 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google;
	h=mime-version:from:date:message-id:subject:to;
	bh=P+b9991pk1m6v2jG5F+3+Hod/jxUCzg5TzuODkwsH7U=;
	b=knRGToutRgmVJhJqekqC6zS55U8syEH4L20YuQixw8OHSz/hcjkmvg4gmafjmWTKtJ
	zPOEOeD0idIc7kxz52mOqAna7CkiXL4qro4/mwFLdOxn1a4FqvtqvjWApg9Lv4vmK9PW
	7iKNvXK8BOIIc74g4B42vwcqcpXRRZlg3xptueGMq6PbxN9ybk49s4Q8bGqgIxvF6I09
	joNJNF9EKpUUp+7l459pJYxXjmRVs73Vo0rpldxThCeKoh7fi2CU8E/tqihum4uRTxdA
	6UalHo+hfHgdajzlxqjuvdaFUeSskBR8Tz0RvKWSbTU4gPv580nfelYaQhew5Zf8KEa6
	K2fg==
MIME-Version: 1.0
From: Kugan Vivekanandarajah <kugan.vivekanandarajah@linaro.org>
Date: Wed, 22 May 2019 12:04:37 +1000
Message-ID: 
 <CAELXzTO4bT6YmbamU5=X36yzPqq0WLCBJtseRHwZyoMEZw1rFw@mail.gmail.com>
Subject: [RFC][PR88838][SVE] Use 32-bit WHILELO in LP64 mode
To: GCC Patches <gcc-patches@gcc.gnu.org>
X-IsSubscribed: yes

Hi,

Attached RFC patch attempts to use 32-bit WHILELO in LP64 mode to fix
the PR. Bootstarp and regression testing ongoing. In earlier testing,
I ran into an issue related to fwprop. I will tackle that based on the
feedback for the patch.

Thanks,
Kugan

From 4e9837ff9c0c080923f342e83574a6fdba2b3d92 Mon Sep 17 00:00:00 2001
From: Kugan Vivekanandarajah <kugan.vivekanandarajah@linaro.org>
Date: Tue, 5 Mar 2019 10:01:45 +1100
Subject: [PATCH] pr88838[v2]

As Mentioned in PR88838, this patch  avoid the SXTW by using WHILELO on W
registers instead of X registers.

As mentined in PR, vect_verify_full_masking checks which IV widths are
supported for WHILELO but prefers to go to Pmode width.  This is because
using Pmode allows ivopts to reuse the IV for indices (as in the loads
and store above).  However, it would be better to use a 32-bit WHILELO
with a truncated 64-bit IV if:

(a) the limit is extended from 32 bits.
(b) the detection loop in vect_verify_full_masking detects that using a
    32-bit IV would be correct.

gcc/ChangeLog:

2019-05-22  Kugan Vivekanandarajah  <kugan.vivekanandarajah@linaro.org>

	* tree-vect-loop-manip.c (vect_set_loop_masks_directly): If the
	compare_type is not with Pmode size, we will create an IV with
	Pmode size with truncated use (i.e. converted to the correct type).
	* tree-vect-loop.c (vect_verify_full_masking): Find which IV
	widths are supported for WHILELO.

gcc/testsuite/ChangeLog:

2019-05-22  Kugan Vivekanandarajah  <kugan.vivekanandarajah@linaro.org>

	* gcc.target/aarch64/pr88838.c: New test.
	* gcc.target/aarch64/sve/while_1.c: Adjust.

Change-Id: Iff52946c28d468078f2cc0868d53edb05325b8ca
---
 gcc/fwprop.c                                   | 13 +++++++
 gcc/testsuite/gcc.target/aarch64/pr88838.c     | 11 ++++++
 gcc/testsuite/gcc.target/aarch64/sve/while_1.c | 16 ++++----
 gcc/tree-vect-loop-manip.c                     | 52 ++++++++++++++++++++++++--
 gcc/tree-vect-loop.c                           | 39 ++++++++++++++++++-
 5 files changed, 117 insertions(+), 14 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/pr88838.c
diff --git a/gcc/fwprop.c b/gcc/fwprop.c
index cf2c9de..5275ad3 100644
--- a/gcc/fwprop.c
+++ b/gcc/fwprop.c
@@ -1358,6 +1358,19 @@ forward_propagate_and_simplify (df_ref use, rtx_insn *def_insn, rtx def_set)
   else
     mode = GET_MODE (*loc);
 
+  /* TODO.  */
+  if (GET_MODE_CLASS (mode) != GET_MODE_CLASS (GET_MODE (reg)))
+    return false;
+  /* TODO. We can't get the mode for
+     (set (reg:VNx16BI 109)
+          (unspec:VNx16BI [
+	    (reg:SI 131)
+	    (reg:SI 106)
+           ] UNSPEC_WHILE_LO))
+     Thus, bailout when it is UNSPEC and MODEs are not compatible.  */
+  if (GET_MODE_CLASS (mode) != GET_MODE_CLASS (GET_MODE (reg))
+      && GET_CODE (SET_SRC (use_set)) == UNSPEC)
+    return false;
   new_rtx = propagate_rtx (*loc, mode, reg, src,
   			   optimize_bb_for_speed_p (BLOCK_FOR_INSN (use_insn)));
 
diff --git a/gcc/testsuite/gcc.target/aarch64/pr88838.c b/gcc/testsuite/gcc.target/aarch64/pr88838.c
new file mode 100644
index 0000000..9d03c0a
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr88838.c
@@ -0,0 +1,11 @@
+/* { dg-do compile } */
+/* { dg-options "-S -O3 -march=arm8.2-a+sve" } */
+
+void
+f (int *restrict x, int *restrict y, int *restrict z, int n)
+{
+    for (int i = 0; i < n; i += 1)
+          x[i] = y[i] + z[i];
+}
+
+/* { dg-final { scan-assembler-not "sxtw" } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/while_1.c b/gcc/testsuite/gcc.target/aarch64/sve/while_1.c
index a93a04b..05a4860 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/while_1.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/while_1.c
@@ -26,14 +26,14 @@
 TEST_ALL (ADD_LOOP)
 
 /* { dg-final { scan-assembler-not {\tuqdec} } } */
-/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.b, xzr,} 2 } } */
-/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.b, x[0-9]+,} 2 } } */
-/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.h, xzr,} 2 } } */
-/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.h, x[0-9]+,} 2 } } */
-/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s, xzr,} 3 } } */
-/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s, x[0-9]+,} 3 } } */
-/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d, xzr,} 3 } } */
-/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d, x[0-9]+,} 3 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.b, wzr,} 2 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.b, w[0-9]+,} 2 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.h, wzr,} 2 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.h, w[0-9]+,} 2 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s, wzr,} 3 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s, w[0-9]+,} 3 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d, wzr,} 3 } } */
+/* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d, w[0-9]+,} 3 } } */
 /* { dg-final { scan-assembler-times {\tld1b\tz[0-9]+\.b, p[0-7]/z, \[x0, x[0-9]+\]\n} 2 } } */
 /* { dg-final { scan-assembler-times {\tst1b\tz[0-9]+\.b, p[0-7], \[x0, x[0-9]+\]\n} 2 } } */
 /* { dg-final { scan-assembler-times {\tld1h\tz[0-9]+\.h, p[0-7]/z, \[x0, x[0-9]+, lsl 1\]\n} 2 } } */
diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c
index 77d3dac..d6452a1 100644
--- a/gcc/tree-vect-loop-manip.c
+++ b/gcc/tree-vect-loop-manip.c
@@ -418,7 +418,20 @@ vect_set_loop_masks_directly (struct loop *loop, loop_vec_info loop_vinfo,
   tree mask_type = rgm->mask_type;
   unsigned int nscalars_per_iter = rgm->max_nscalars_per_iter;
   poly_uint64 nscalars_per_mask = TYPE_VECTOR_SUBPARTS (mask_type);
-
+  bool convert = false;
+  tree iv_type = NULL_TREE;
+
+  /* If the compare_type is not with Pmode size, we will create an IV with
+     Pmode size with truncated use (i.e. converted to the correct type).
+     This is because using Pmode allows ivopts to reuse the IV for indices
+     (in the loads and store).  */
+  if (known_lt (GET_MODE_BITSIZE (TYPE_MODE (compare_type)),
+		GET_MODE_BITSIZE (Pmode)))
+    {
+      iv_type = build_nonstandard_integer_type (GET_MODE_BITSIZE (Pmode),
+						true);
+      convert = true;
+    }
   /* Calculate the maximum number of scalar values that the rgroup
      handles in total, the number that it handles for each iteration
      of the vector loop, and the number that it should skip during the
@@ -444,12 +457,43 @@ vect_set_loop_masks_directly (struct loop *loop, loop_vec_info loop_vinfo,
      processed.  */
   tree index_before_incr, index_after_incr;
   gimple_stmt_iterator incr_gsi;
+  gimple_stmt_iterator incr_gsi2;
   bool insert_after;
-  tree zero_index = build_int_cst (compare_type, 0);
+  tree zero_index;
   standard_iv_increment_position (loop, &incr_gsi, &insert_after);
-  create_iv (zero_index, nscalars_step, NULL_TREE, loop, &incr_gsi,
-	     insert_after, &index_before_incr, &index_after_incr);
 
+  if (convert)
+    {
+      /* If we are creating IV of Pmode type and converting.  */
+      zero_index = build_int_cst (iv_type, 0);
+      tree step = build_int_cst (iv_type,
+				 LOOP_VINFO_VECT_FACTOR (loop_vinfo));
+      /* Creating IV of Pmode type.  */
+      create_iv (zero_index, step, NULL_TREE, loop, &incr_gsi,
+		 insert_after, &index_before_incr, &index_after_incr);
+      /* Create truncated index_before and after increament.  */
+      tree index_before_incr_trunc = make_ssa_name (compare_type);
+      tree index_after_incr_trunc = make_ssa_name (compare_type);
+      gimple *incr_before_stmt = gimple_build_assign (index_before_incr_trunc,
+						      NOP_EXPR,
+						      index_before_incr);
+      gimple *incr_after_stmt = gimple_build_assign (index_after_incr_trunc,
+						     NOP_EXPR,
+						     index_after_incr);
+      incr_gsi2 = incr_gsi;
+      gsi_insert_before (&incr_gsi2, incr_before_stmt, GSI_NEW_STMT);
+      gsi_insert_after (&incr_gsi, incr_after_stmt, GSI_NEW_STMT);
+      index_before_incr = index_before_incr_trunc;
+      index_after_incr = index_after_incr_trunc;
+      zero_index = build_int_cst (compare_type, 0);
+    }
+  else
+    {
+      /* If the IV is of Pmode compare_type, no convertion needed.  */
+      zero_index = build_int_cst (compare_type, 0);
+      create_iv (zero_index, nscalars_step, NULL_TREE, loop, &incr_gsi,
+		 insert_after, &index_before_incr, &index_after_incr);
+    }
   tree test_index, test_limit, first_limit;
   gimple_stmt_iterator *test_gsi;
   if (might_wrap_p)
diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
index bd81193..2769c86 100644
--- a/gcc/tree-vect-loop.c
+++ b/gcc/tree-vect-loop.c
@@ -1035,6 +1035,30 @@ vect_verify_full_masking (loop_vec_info loop_vinfo)
   /* Find a scalar mode for which WHILE_ULT is supported.  */
   opt_scalar_int_mode cmp_mode_iter;
   tree cmp_type = NULL_TREE;
+  tree niters_type = TREE_TYPE (LOOP_VINFO_NITERS (loop_vinfo));
+  tree niters_skip = LOOP_VINFO_MASK_SKIP_NITERS (loop_vinfo);
+  unsigned HOST_WIDE_INT max_vf = vect_max_vf (loop_vinfo);
+  widest_int iv_limit;
+  bool known_max_iters = max_loop_iterations (loop, &iv_limit);
+  if (known_max_iters)
+    {
+      if (niters_skip)
+	{
+	  /* Add the maximum number of skipped iterations to the
+	     maximum iteration count.  */
+	  if (TREE_CODE (niters_skip) == INTEGER_CST)
+	    iv_limit += wi::to_widest (niters_skip);
+	  else
+	    iv_limit += max_vf - 1;
+	}
+      /* IV_LIMIT is the maximum number of latch iterations, which is also
+	 the maximum in-range IV value.  Round this value down to the previous
+	 vector alignment boundary and then add an extra full iteration.  */
+      poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
+      iv_limit = (iv_limit & -(int) known_alignment (vf)) + max_vf;
+    }
+
+  /* Get the vectorization factor in tree form.  */
   FOR_EACH_MODE_IN_CLASS (cmp_mode_iter, MODE_INT)
     {
       unsigned int cmp_bits = GET_MODE_BITSIZE (cmp_mode_iter.require ());
@@ -1045,12 +1069,23 @@ vect_verify_full_masking (loop_vec_info loop_vinfo)
 	  if (this_type
 	      && can_produce_all_loop_masks_p (loop_vinfo, this_type))
 	    {
+	      /* See whether zero-based IV would ever generate all-false masks
+		 before wrapping around.  */
+	      bool might_wrap_p
+		= (!known_max_iters
+		   || (wi::min_precision
+		       (iv_limit
+			* vect_get_max_nscalars_per_iter (loop_vinfo),
+			UNSIGNED) > cmp_bits));
 	      /* Although we could stop as soon as we find a valid mode,
 		 it's often better to continue until we hit Pmode, since the
 		 operands to the WHILE are more likely to be reusable in
-		 address calculations.  */
+		 address calculations.  Unless the limit is extended from
+		 this_type.  */
 	      cmp_type = this_type;
-	      if (cmp_bits >= GET_MODE_BITSIZE (Pmode))
+	      if (cmp_bits >= GET_MODE_BITSIZE (Pmode)
+		  || (!might_wrap_p
+		      && (cmp_bits == TYPE_PRECISION (niters_type))))
 		break;
 	    }
 	}
-- 
2.7.4