From patchwork Sat Jan  6 02:29:21 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: =?utf-8?b?6ZKf5bGF5ZOy?= <juzhe.zhong@rivai.ai>
X-Patchwork-Id: 1883171
Return-Path: <gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@legolas.ozlabs.org
Authentication-Results: legolas.ozlabs.org;
 spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org
 (client-ip=8.43.85.97; helo=server2.sourceware.org;
 envelope-from=gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org;
 receiver=patchwork.ozlabs.org)
Received: from server2.sourceware.org (server2.sourceware.org [8.43.85.97])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature ECDSA (secp384r1) server-digest SHA384)
	(No client certificate requested)
	by legolas.ozlabs.org (Postfix) with ESMTPS id 4T6PS41K53z1yPM
	for <incoming@patchwork.ozlabs.org>; Sat,  6 Jan 2024 13:30:12 +1100 (AEDT)
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 2FBF0385803F
	for <incoming@patchwork.ozlabs.org>; Sat,  6 Jan 2024 02:30:10 +0000 (GMT)
X-Original-To: gcc-patches@gcc.gnu.org
Delivered-To: gcc-patches@gcc.gnu.org
Received: from smtpbgeu1.qq.com (smtpbgeu1.qq.com [52.59.177.22])
 by sourceware.org (Postfix) with ESMTPS id 8FA9C3858C98
 for <gcc-patches@gcc.gnu.org>; Sat,  6 Jan 2024 02:29:42 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 8FA9C3858C98
Authentication-Results: sourceware.org;
 dmarc=none (p=none dis=none) header.from=rivai.ai
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=rivai.ai
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 8FA9C3858C98
Authentication-Results: server2.sourceware.org;
 arc=none smtp.remote-ip=52.59.177.22
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1704508188; cv=none;
 b=bVYLxEbxFrCn8NRGfRjUXJhSaP0mqMjE9HYd2RqmitbftPabiw4Z7rI4gOPAbJQRZT/1EuXSKPzKcrsuMCmyFp7166eSGdh9b4WO8oOHFDjmbD8I4v/h8OLXYvdNk+ksZ+Ned9FcZgQwbEug8q2JY1WzY29uwsFJuvOb6p98EK4=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
 t=1704508188; c=relaxed/simple;
 bh=CCbQHBnpVaODF5DqzVywa0cjxD44sWQLOhDu7hhYaE8=;
 h=From:To:Subject:Date:Message-Id:MIME-Version;
 b=smtM+K8aA0v+A7YeQBHrw1fzcOWXQyATUx5Zgkv+n3GPpY6V1qFHowsjLN7Mtgo1FcZRieYTXXSXWSaRDneTG8e7n1Dvq2r3CS4+GYwEfs+aviTC/YiGZEcFBL1YW2pwsujaNTOs2UAm4MqSb0NqYaeATPnkNpJRvwhIeymI6XY=
ARC-Authentication-Results: i=1; server2.sourceware.org
X-QQ-mid: bizesmtp75t1704508163tweni7qf
Received: from rios-cad122.hadoop.rioslab.org ( [58.60.1.26])
 by bizesmtp.qq.com (ESMTP) with
 id ; Sat, 06 Jan 2024 10:29:22 +0800 (CST)
X-QQ-SSF: 01400000000000G0V000000A0000000
X-QQ-FEAT: k0mQ4ihyJQO9phCv5PZntwuSLJKWyRRDrLeeOITYcT2bOZv4mD2+K1Ho3Bevv
 wPk999WfL4x3fkjxtRrfQbg6eOLAWJz4Aij8E3CiOiVQXL5axSX60GR8ITt1iyR/B4poxY6
 sXOPSS9H3sFxefy1mOju8e/ftOFpAO52mS3cJ2lsiBv/KuqGc6QyaAuSu3Qg7hFvBSCm7XF
 l/+6JsJucSy8djYG/hGgyUkkZocBpyoyN+qmx23cL8trVf+8ejG8LIpij8idjcHMUqztdoi
 JioQgiReiE7CaHIqmwNDaz4ltwUIHuLuLWOGZKBTi5kp683EHunLpPUyraqlpFN8Ibd6rup
 tJy+0U9p6xlErZxoKaB1iz4gWTSALr/Tu9vE8XLIIG2ejg4hGVv/lD7f4MvwQ5ZjWv1T6E5
 0YIHIJf35Nw=
X-QQ-GoodBg: 2
X-BIZMAIL-ID: 15901390673188008368
From: Juzhe-Zhong <juzhe.zhong@rivai.ai>
To: gcc-patches@gcc.gnu.org
Cc: Juzhe-Zhong <juzhe.zhong@rivai.ai>
Subject: [Committed V2] RISC-V: Teach liveness computation loop invariant
 shift amount
Date: Sat,  6 Jan 2024 10:29:21 +0800
Message-Id: <20240106022921.1714868-1-juzhe.zhong@rivai.ai>
X-Mailer: git-send-email 2.36.3
MIME-Version: 1.0
X-QQ-SENDSIZE: 520
Feedback-ID: bizesmtp:rivai.ai:qybglogicsvrgz:qybglogicsvrgz7a-one-0
X-Spam-Status: No, score=-10.3 required=5.0 tests=BAYES_00, GIT_PATCH_0,
 KAM_DMARC_STATUS, RCVD_IN_BARRACUDACENTRAL, RCVD_IN_DNSWL_NONE,
 RCVD_IN_MSPIKE_H2, SPF_PASS, TXREP, T_SCC_BODY_TEXT_LINE,
 T_SPF_HELO_TEMPERROR autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.30
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
Errors-To: gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org

1). We not only have vashl_optab,vashr_optab,vlshr_optab which vectorize shift with vector shift amount,
that is, vectorization of 'a[i] >> x[i]', the shift amount is loop variant.
2). But also, we have ashl_optab, ashr_optab, lshr_optab which can vectorize shift with scalar shift amount,
that is, vectorization of 'a[i] >> x', the shift amount is loop invariant.

For the 2) case, we don't need to allocate a vector register group for shift amount.

So consider this following case:

void
f (int *restrict a, int *restrict b, int *restrict c, int *restrict d, int x,
   int n)
{
  for (int i = 0; i < n; i++)
    {
      int tmp = b[i] >> x;
      int tmp2 = tmp * b[i];
      c[i] = tmp2 * b[i];
      d[i] = tmp * tmp2 * b[i] >> x;
    }
}

Before this patch, we choose LMUL = 4, now after this patch, we can choose LMUL = 8:

f:
	ble	a5,zero,.L5
.L3:
	vsetvli	a0,a5,e32,m8,ta,ma
	slli	a6,a0,2
	vle32.v	v16,0(a1)
	vsra.vx	v24,v16,a4
	vmul.vv	v8,v24,v16
	vmul.vv	v0,v8,v16
	vse32.v	v0,0(a2)
	vmul.vv	v8,v8,v24
	vmul.vv	v8,v8,v16
	vsra.vx	v8,v8,a4
	vse32.v	v8,0(a3)
	add	a1,a1,a6
	add	a2,a2,a6
	add	a3,a3,a6
	sub	a5,a5,a0
	bne	a5,zero,.L3
.L5:
	ret

Tested on both RV32/RV64 no regression.  Ok for trunk ?

Note that we will apply same heuristic for vadd.vx, ... etc when the late-combine pass from
Richard Sandiford is committed (Since we need late combine pass to do vv->vx transformation for vadd).

gcc/ChangeLog:

	* config/riscv/riscv-vector-costs.cc (loop_invariant_op_p): New function.
	(variable_vectorized_p): Teach loop invariant.
	(has_unexpected_spills_p): Ditto.

gcc/testsuite/ChangeLog:

	* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul4-12.c: New test.
	* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul8-14.c: New test.
---
 gcc/config/riscv/riscv-vector-costs.cc        | 31 +++++++--
 .../costmodel/riscv/rvv/dynamic-lmul4-12.c    | 40 ++++++++++++
 .../costmodel/riscv/rvv/dynamic-lmul8-14.c    | 64 +++++++++++++++++++
 3 files changed, 131 insertions(+), 4 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul4-12.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul8-14.c

diff --git a/gcc/config/riscv/riscv-vector-costs.cc b/gcc/config/riscv/riscv-vector-costs.cc
index ec8156fbaf8..3bae581d6fd 100644
--- a/gcc/config/riscv/riscv-vector-costs.cc
+++ b/gcc/config/riscv/riscv-vector-costs.cc
@@ -230,9 +230,24 @@ get_biggest_mode (machine_mode mode1, machine_mode mode2)
   return mode1_size >= mode2_size ? mode1 : mode2;
 }
 
+/* Return true if OP is invariant.  */
+
+static bool
+loop_invariant_op_p (class loop *loop,
+		     tree op)
+{
+  if (is_gimple_constant (op))
+    return true;
+  if (SSA_NAME_IS_DEFAULT_DEF (op)
+      || !flow_bb_inside_loop_p (loop, gimple_bb (SSA_NAME_DEF_STMT (op))))
+    return true;
+  return gimple_uid (SSA_NAME_DEF_STMT (op)) & 1;
+}
+
 /* Return true if the variable should be counted into liveness.  */
 static bool
-variable_vectorized_p (stmt_vec_info stmt_info, tree var, bool lhs_p)
+variable_vectorized_p (class loop *loop, stmt_vec_info stmt_info, tree var,
+		       bool lhs_p)
 {
   if (!var)
     return false;
@@ -275,6 +290,10 @@ variable_vectorized_p (stmt_vec_info stmt_info, tree var, bool lhs_p)
 		 || !tree_fits_shwi_p (var)
 		 || !IN_RANGE (tree_to_shwi (var), -16, 15)
 		 || gimple_assign_rhs1 (stmt) != var;
+	case LSHIFT_EXPR:
+	case RSHIFT_EXPR:
+	  return gimple_assign_rhs2 (stmt) != var
+		 || !loop_invariant_op_p (loop, var);
 	default:
 	  break;
 	}
@@ -312,10 +331,12 @@ variable_vectorized_p (stmt_vec_info stmt_info, tree var, bool lhs_p)
    The live range of SSA 2 is [0, 4] in bb 3.  */
 static machine_mode
 compute_local_live_ranges (
+  loop_vec_info loop_vinfo,
   const hash_map<basic_block, vec<stmt_point>> &program_points_per_bb,
   hash_map<basic_block, hash_map<tree, pair>> &live_ranges_per_bb)
 {
   machine_mode biggest_mode = QImode;
+  class loop *loop = LOOP_VINFO_LOOP (loop_vinfo);
   if (!program_points_per_bb.is_empty ())
     {
       auto_vec<tree> visited_vars;
@@ -339,7 +360,8 @@ compute_local_live_ranges (
 	      unsigned int point = program_point.point;
 	      gimple *stmt = program_point.stmt;
 	      tree lhs = gimple_get_lhs (stmt);
-	      if (variable_vectorized_p (program_point.stmt_info, lhs, true))
+	      if (variable_vectorized_p (loop, program_point.stmt_info, lhs,
+					 true))
 		{
 		  biggest_mode = get_biggest_mode (biggest_mode,
 						   TYPE_MODE (TREE_TYPE (lhs)));
@@ -356,7 +378,7 @@ compute_local_live_ranges (
 	      for (i = 0; i < gimple_num_args (stmt); i++)
 		{
 		  tree var = gimple_arg (stmt, i);
-		  if (variable_vectorized_p (program_point.stmt_info, var,
+		  if (variable_vectorized_p (loop, program_point.stmt_info, var,
 					     false))
 		    {
 		      biggest_mode
@@ -781,7 +803,8 @@ has_unexpected_spills_p (loop_vec_info loop_vinfo)
   /* Compute local live ranges.  */
   hash_map<basic_block, hash_map<tree, pair>> live_ranges_per_bb;
   machine_mode biggest_mode
-    = compute_local_live_ranges (program_points_per_bb, live_ranges_per_bb);
+    = compute_local_live_ranges (loop_vinfo, program_points_per_bb,
+				 live_ranges_per_bb);
 
   /* Update live ranges according to PHI.  */
   update_local_live_ranges (loop_vinfo, program_points_per_bb,
diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul4-12.c b/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul4-12.c
new file mode 100644
index 00000000000..0cb492e611c
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul4-12.c
@@ -0,0 +1,40 @@
+/* { dg-do compile } */
+/* { dg-options "-march=rv64gcv -mabi=lp64d -O3 -ftree-vectorize --param riscv-autovec-lmul=dynamic -fdump-tree-vect-details" } */
+
+void
+f (int *restrict a, int *restrict b, int *restrict c, int *restrict d,
+   int *restrict x, int n)
+{
+  for (int i = 0; i < n; i++)
+    {
+      int tmp = b[i] >> x[i];
+      int tmp2 = tmp * b[i];
+      c[i] = tmp2 * b[i];
+      d[i] = tmp * tmp2 * b[i] >> x[i];
+    }
+}
+
+void
+f2 (int *restrict a, int *restrict b, int *restrict c, int *restrict d,
+    int *restrict x, int n)
+{
+  for (int i = 0; i < n; i++)
+    {
+      int tmp = b[i] << x[i];
+      int tmp2 = tmp * b[i];
+      c[i] = tmp2 * b[i];
+      d[i] = tmp * tmp2 * b[i] >> x[i];
+    }
+}
+
+/* { dg-final { scan-assembler-times {e32,m4} 2 } } */
+/* { dg-final { scan-assembler-not {csrr} } } */
+/* { dg-final { scan-assembler-not {jr} } } */
+/* { dg-final { scan-assembler-not {e32,m8} } } */
+/* { dg-final { scan-assembler-not {e32,m2} } } */
+/* { dg-final { scan-assembler-not {e32,m1} } } */
+/* { dg-final { scan-assembler-times {ret} 2 } } */
+/* { dg-final { scan-tree-dump-times "Preferring smaller LMUL loop because it has unexpected spills" 2 "vect" } } */
+/* { dg-final { scan-tree-dump-times "Maximum lmul = 8" 2 "vect" } } */
+/* { dg-final { scan-tree-dump-times "Maximum lmul = 4" 2 "vect" } } */
+/* { dg-final { scan-tree-dump-times "Maximum lmul = 2" 2 "vect" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul8-14.c b/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul8-14.c
new file mode 100644
index 00000000000..0d42c3b27cb
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul8-14.c
@@ -0,0 +1,64 @@
+/* { dg-do compile } */
+/* { dg-options "-march=rv64gcv -mabi=lp64d -O3 -ftree-vectorize --param riscv-autovec-lmul=dynamic -fdump-tree-vect-details" } */
+
+void
+f (int *restrict a, int *restrict b, int *restrict c, int *restrict d, int x,
+   int n)
+{
+  for (int i = 0; i < n; i++)
+    {
+      int tmp = b[i] >> x;
+      int tmp2 = tmp * b[i];
+      c[i] = tmp2 * b[i];
+      d[i] = tmp * tmp2 * b[i] >> x;
+    }
+}
+
+void
+f2 (int *restrict a, int *restrict b, int *restrict c, int *restrict d, int x,
+    int n)
+{
+  for (int i = 0; i < n; i++)
+    {
+      int tmp = b[i] << x;
+      int tmp2 = tmp * b[i];
+      c[i] = tmp2 * b[i];
+      d[i] = tmp * tmp2 * b[i] >> x;
+    }
+}
+
+void
+f3 (int *restrict a, int *restrict b, int *restrict c, int *restrict d, int n)
+{
+  for (int i = 0; i < n; i++)
+    {
+      int tmp = b[i] >> 17;
+      int tmp2 = tmp * b[i];
+      c[i] = tmp2 * b[i];
+      d[i] = tmp * tmp2 * b[i] >> 17;
+    }
+}
+
+void
+f4 (int *restrict a, int *restrict b, int *restrict c, int *restrict d, int n)
+{
+  for (int i = 0; i < n; i++)
+    {
+      int tmp = b[i] << 17;
+      int tmp2 = tmp * b[i];
+      c[i] = tmp2 * b[i];
+      d[i] = tmp * tmp2 * b[i] >> 17;
+    }
+}
+
+/* { dg-final { scan-assembler-times {e32,m8} 4 } } */
+/* { dg-final { scan-assembler-not {csrr} } } */
+/* { dg-final { scan-assembler-not {jr} } } */
+/* { dg-final { scan-assembler-not {e32,m4} } } */
+/* { dg-final { scan-assembler-not {e32,m2} } } */
+/* { dg-final { scan-assembler-not {e32,m1} } } */
+/* { dg-final { scan-assembler-times {ret} 4 } } */
+/* { dg-final { scan-tree-dump-not "Preferring smaller LMUL loop because it has unexpected spills" "vect" } } */
+/* { dg-final { scan-tree-dump-times "Maximum lmul = 8" 4 "vect" } } */
+/* { dg-final { scan-tree-dump-times "Maximum lmul = 4" 4 "vect" } } */
+/* { dg-final { scan-tree-dump-times "Maximum lmul = 2" 4 "vect" } } */