From patchwork Mon Jan 22 06:46:46 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: =?utf-8?b?6ZKf5bGF5ZOy?= <juzhe.zhong@rivai.ai>
X-Patchwork-Id: 1889018
Return-Path: <gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@legolas.ozlabs.org
Authentication-Results: legolas.ozlabs.org;
 spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org
 (client-ip=8.43.85.97; helo=server2.sourceware.org;
 envelope-from=gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org;
 receiver=patchwork.ozlabs.org)
Received: from server2.sourceware.org (server2.sourceware.org [8.43.85.97])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature ECDSA (secp384r1) server-digest SHA384)
	(No client certificate requested)
	by legolas.ozlabs.org (Postfix) with ESMTPS id 4TJLPR11rSz1yPv
	for <incoming@patchwork.ozlabs.org>; Mon, 22 Jan 2024 17:47:21 +1100 (AEDT)
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 9D4AF385782D
	for <incoming@patchwork.ozlabs.org>; Mon, 22 Jan 2024 06:47:19 +0000 (GMT)
X-Original-To: gcc-patches@gcc.gnu.org
Delivered-To: gcc-patches@gcc.gnu.org
Received: from smtpbg153.qq.com (smtpbg153.qq.com [13.245.218.24])
 by sourceware.org (Postfix) with ESMTPS id 599D13858429
 for <gcc-patches@gcc.gnu.org>; Mon, 22 Jan 2024 06:46:56 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 599D13858429
Authentication-Results: sourceware.org;
 dmarc=none (p=none dis=none) header.from=rivai.ai
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=rivai.ai
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 599D13858429
Authentication-Results: server2.sourceware.org;
 arc=none smtp.remote-ip=13.245.218.24
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1705906021; cv=none;
 b=Afo7CVtpv/1N8FnnH4mMNAKLDuRZOqGrozIDpY9mTs+IxIdfvRGwbPYFb3PE2GvfR4cjFM2MRDKRLmZk+/306OZK6UFh7tisismvr+ViLj22IWYJvxlF3KbqPW50mjjGj2mBjbIa+vdQ2HrsiWCn4H3XFkTCjihW+09EuhO3dcA=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
 t=1705906021; c=relaxed/simple;
 bh=1bE+stOALgZpFDEXyGzwF6CDt+gXCeNT+6Tynbd/tfk=;
 h=From:To:Subject:Date:Message-Id:MIME-Version;
 b=SUGRXkyTB9dcX4qeOngBNqhdsUdSxO3Wk7KKqxDBg14xnxRChW/F4qO/SGDJXZEYKRX4GOYYjM05ZIl9GN4zmWSEm5PaTPuKhnzKK40httK54Dqp2lxA34HPOHIZ0/K9CphkWzlFWn25I0Tq6KgKY+0TISPJXXC423SbRtCsccg=
ARC-Authentication-Results: i=1; server2.sourceware.org
X-QQ-mid: bizesmtp83t1705906008tgxmefwf
X-QQ-Originating-IP: pnAdgWrtd4dBMzdX5teK0CSeoyK/V8v8CSlJBQhxtTo=
Received: from rios-cad121.hadoop.rioslab.org ( [58.60.1.9])
 by bizesmtp.qq.com (ESMTP) with
 id ; Mon, 22 Jan 2024 14:46:47 +0800 (CST)
X-QQ-SSF: 01400000000000G0V000000A0000000
X-QQ-FEAT: 90EFqYDyPxDbcZQEGK4hSUxzZ6eYADwyhJ72y6z4VL+8Pk/9tzjeYnfaXcCyc
 sZmY3qToPnjSWf16S084sBvcLjbPSxiPrn5SCTFWqAp9p7mrb9UQpZbxmbEDwNk4uDkFble
 SlQEStPeRsqzZ+sprwCFdlV4Y5yFv5+U52Xcdzfp13ZfY/aVb9sxEIRmdsMwDDZ7n+h39E6
 uuI3qp618ucJjJ7PZhuZAtMhQwtGdxmKV2JOxdj/Y0D1fiZqmDjQ1E8pfVnq3fVtxpOVf11
 4C9s185n5NgOXtpbDNU5YaVO3z+/BwZYxWwUGIHLB4BQMHGViEKjRXXiEXS9NqWPFIOpYWO
 VJ+rrJlFVfqTUnh+030wos8CUI+yaAnFITMt58gHjP6jSXfl9mcZ+lcW3uMDVqAY8FFB02I
 nhY3YsJxQzzI9gj1v1bNeg==
X-QQ-GoodBg: 2
X-BIZMAIL-ID: 8521345292727623073
From: Juzhe-Zhong <juzhe.zhong@rivai.ai>
To: gcc-patches@gcc.gnu.org
Cc: kito.cheng@gmail.com, kito.cheng@sifive.com, jeffreyalaw@gmail.com,
 rdapp.gcc@gmail.com, Juzhe-Zhong <juzhe.zhong@rivai.ai>
Subject: [PATCH] RISC-V: Lower vmv.v.x (avl = 1) into vmv.s.x
Date: Mon, 22 Jan 2024 14:46:46 +0800
Message-Id: <20240122064646.2001825-1-juzhe.zhong@rivai.ai>
X-Mailer: git-send-email 2.36.3
MIME-Version: 1.0
X-QQ-SENDSIZE: 520
Feedback-ID: bizesmtp:rivai.ai:qybglogicsvrgz:qybglogicsvrgz7a-one-0
X-Spam-Status: No, score=-12.1 required=5.0 tests=BAYES_00, GIT_PATCH_0,
 KAM_DMARC_STATUS, KAM_SHORT, RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2,
 SPF_HELO_PASS, SPF_PASS, TXREP,
 T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: gcc-patches@gcc.gnu.org
X-Mailman-Version: 2.1.30
Precedence: list
List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe>
List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help>
List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>,
 <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe>
Errors-To: gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org

Notice there is a AI benchmark, GCC vs Clang has 3% performance drop.

It's because Clang/LLVM has a simplification transform vmv.v.x (avl = 1) into vmv.s.x.

Since vmv.s.x has more flexible vsetvl demand than vmv.v.x that can allow us to have
better chances to fuse vsetvl.

Consider this following case:

void
foo (uint32_t *outputMat, uint32_t *inputMat)
{
  vuint32m1_t matRegIn0 = __riscv_vle32_v_u32m1 (inputMat, 4);
  vuint32m1_t matRegIn1 = __riscv_vle32_v_u32m1 (inputMat + 4, 4);
  vuint32m1_t matRegIn2 = __riscv_vle32_v_u32m1 (inputMat + 8, 4);
  vuint32m1_t matRegIn3 = __riscv_vle32_v_u32m1 (inputMat + 12, 4);

  vbool32_t oddMask
    = __riscv_vreinterpret_v_u32m1_b32 (__riscv_vmv_v_x_u32m1 (0xaaaa, 1));

  vuint32m1_t smallTransposeMat0
    = __riscv_vslideup_vx_u32m1_tumu (oddMask, matRegIn0, matRegIn1, 1, 4);
  vuint32m1_t smallTransposeMat2
    = __riscv_vslideup_vx_u32m1_tumu (oddMask, matRegIn2, matRegIn3, 1, 4);

  vuint32m1_t outMat0 = __riscv_vslideup_vx_u32m1_tu (smallTransposeMat0,
						      smallTransposeMat2, 2, 4);

  __riscv_vse32_v_u32m1 (outputMat, outMat0, 4);
}

Before this patch:

        vsetivli        zero,4,e32,m1,ta,ma
        li      a5,45056
        addi    a2,a1,16
        addi    a3,a1,32
        addi    a4,a1,48
        vle32.v v1,0(a1)
        vle32.v v4,0(a2)
        vle32.v v2,0(a3)
        vle32.v v3,0(a4)
        addiw   a5,a5,-1366
        vsetivli        zero,1,e32,m1,ta,ma
        vmv.v.x v0,a5                         ---> Since it avl = 1, we can transform it into vmv.s.x
        vsetivli        zero,4,e32,m1,tu,mu
        vslideup.vi     v1,v4,1,v0.t
        vslideup.vi     v2,v3,1,v0.t
        vslideup.vi     v1,v2,2
        vse32.v v1,0(a0)
        ret

After this patch:

	li	a5,45056
	addi	a2,a1,16
	vsetivli	zero,4,e32,m1,tu,mu
	addiw	a5,a5,-1366
	vle32.v	v3,0(a2)
	addi	a3,a1,32
	addi	a4,a1,48
	vle32.v	v1,0(a1)
	vmv.s.x	v0,a5
	vle32.v	v2,0(a3)
	vslideup.vi	v1,v3,1,v0.t
	vle32.v	v3,0(a4)
	vslideup.vi	v2,v3,1,v0.t
	vslideup.vi	v1,v2,2
	vse32.v	v1,0(a0)
	ret

Tested on both RV32 and RV64 no regression.

gcc/ChangeLog:

	* config/riscv/riscv-protos.h (splat_to_scalar_move_p): New function.
	* config/riscv/riscv-v.cc (splat_to_scalar_move_p): Ditto.
	* config/riscv/vector.md: Simplify vmv.v.x. into vmv.s.x.

gcc/testsuite/ChangeLog:

	* gcc.target/riscv/rvv/vsetvl/attribute-2.c: New test.
	* gcc.target/riscv/rvv/vsetvl/attribute-3.c: New test.
---
 gcc/config/riscv/riscv-protos.h               |  1 +
 gcc/config/riscv/riscv-v.cc                   | 12 ++++++
 gcc/config/riscv/vector.md                    |  9 ++++-
 .../gcc.target/riscv/rvv/vsetvl/attribute-2.c | 37 +++++++++++++++++++
 .../gcc.target/riscv/rvv/vsetvl/attribute-3.c | 36 ++++++++++++++++++
 5 files changed, 94 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/vsetvl/attribute-2.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/vsetvl/attribute-3.c

diff --git a/gcc/config/riscv/riscv-protos.h b/gcc/config/riscv/riscv-protos.h
index 7fe26fcd939..b3f0bdb9924 100644
--- a/gcc/config/riscv/riscv-protos.h
+++ b/gcc/config/riscv/riscv-protos.h
@@ -708,6 +708,7 @@ bool can_be_broadcasted_p (rtx);
 bool gather_scatter_valid_offset_p (machine_mode);
 HOST_WIDE_INT estimated_poly_value (poly_int64, unsigned int);
 bool whole_reg_to_reg_move_p (rtx *, machine_mode, int);
+bool splat_to_scalar_move_p (rtx *);
 }
 
 /* We classify builtin types into two classes:
diff --git a/gcc/config/riscv/riscv-v.cc b/gcc/config/riscv/riscv-v.cc
index 93a1238a5ab..4bacb7fea45 100644
--- a/gcc/config/riscv/riscv-v.cc
+++ b/gcc/config/riscv/riscv-v.cc
@@ -5151,4 +5151,16 @@ whole_reg_to_reg_move_p (rtx *ops, machine_mode mode, int avl_type_index)
   return false;
 }
 
+/* Return true if we can transform vmv.v.x/vfmv.v.f to vmv.s.x/vfmv.s.f.  */
+bool
+splat_to_scalar_move_p (rtx *ops)
+{
+  return satisfies_constraint_Wc1 (ops[1])
+	 && satisfies_constraint_vu (ops[2])
+	 && !MEM_P (ops[3])
+	 && satisfies_constraint_c01 (ops[4])
+	 && INTVAL (ops[7]) == NONVLMAX
+	 && known_ge (GET_MODE_SIZE (Pmode), GET_MODE_SIZE (GET_MODE (ops[3])));
+}
+
 } // namespace riscv_vector
diff --git a/gcc/config/riscv/vector.md b/gcc/config/riscv/vector.md
index 307d9a8c952..ab6e099852d 100644
--- a/gcc/config/riscv/vector.md
+++ b/gcc/config/riscv/vector.md
@@ -1977,8 +1977,15 @@
 	  (match_operand:V_VLS 2 "vector_merge_operand")))]
   "TARGET_VECTOR"
 {
+  /* Transform vmv.v.x/vfmv.v.f (avl = 1) into vmv.s.x since vmv.s.x/vfmv.s.f
+     has better chances to do vsetvl fusion in vsetvl pass.  */
+  if (riscv_vector::splat_to_scalar_move_p (operands))
+    {
+      operands[1] = riscv_vector::gen_scalar_move_mask (<VM>mode);
+      operands[3] = force_reg (<VEL>mode, operands[3]);
+    }
   /* Handle vmv.s.x instruction (Wb1 mask) which has memory scalar.  */
-  if (satisfies_constraint_Wdm (operands[3]))
+  else if (satisfies_constraint_Wdm (operands[3]))
     {
       if (satisfies_constraint_Wb1 (operands[1]))
 	{
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/attribute-2.c b/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/attribute-2.c
new file mode 100644
index 00000000000..b3fec269301
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/attribute-2.c
@@ -0,0 +1,37 @@
+/* { dg-do compile } */
+/* { dg-options "-march=rv64gcv -mabi=lp64d -O3" } */
+
+#include "riscv_vector.h"
+
+void
+foo (uint32_t *outputMat, uint32_t *inputMat)
+{
+  vuint32m1_t matRegIn0 = __riscv_vle32_v_u32m1 (inputMat, 4);
+  vuint32m1_t matRegIn1 = __riscv_vle32_v_u32m1 (inputMat + 4, 4);
+  vuint32m1_t matRegIn2 = __riscv_vle32_v_u32m1 (inputMat + 8, 4);
+  vuint32m1_t matRegIn3 = __riscv_vle32_v_u32m1 (inputMat + 12, 4);
+
+  vbool32_t oddMask
+    = __riscv_vreinterpret_v_u32m1_b32 (__riscv_vmv_v_x_u32m1 (0xaaaa, 1));
+
+  vuint32m1_t smallTransposeMat0
+    = __riscv_vslideup_vx_u32m1_tumu (oddMask, matRegIn0, matRegIn1, 1, 4);
+  vuint32m1_t smallTransposeMat2
+    = __riscv_vslideup_vx_u32m1_tumu (oddMask, matRegIn2, matRegIn3, 1, 4);
+
+  vuint32m1_t outMat0 = __riscv_vslideup_vx_u32m1_tu (smallTransposeMat0,
+						      smallTransposeMat2, 2, 4);
+
+  __riscv_vse32_v_u32m1 (outputMat, outMat0, 4);
+}
+
+void
+foo2 (void *outputMat, void *inputMat)
+{
+  vfloat32m1_t v = __riscv_vfmv_v_f_f32m1 (0xaaaa, 1);
+  __riscv_vse32_v_f32m1 (outputMat, v, 4);
+}
+
+/* { dg-final { scan-assembler-times {vsetivli\s+zero,\s*4,\s*e32,\s*m1,\s*t[au],\s*m[au]} 2 } } */
+/* { dg-final { scan-assembler-times {vsetivli} 2 } } */
+/* { dg-final { scan-assembler-not {vsetvli} } } */
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/attribute-3.c b/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/attribute-3.c
new file mode 100644
index 00000000000..643f6a96aec
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/attribute-3.c
@@ -0,0 +1,36 @@
+/* { dg-do compile } */
+/* { dg-options "-march=rv64gcv -mabi=lp64d -O3 -fno-schedule-insns -fno-schedule-insns2" } */
+
+#include "riscv_vector.h"
+
+void matrix_transpose_in_register(uint32_t* outputMat, uint32_t* inputMat) {
+    vuint32m1_t matRegIn0 = __riscv_vle32_v_u32m1(inputMat, 4);
+    vuint32m1_t matRegIn1 = __riscv_vle32_v_u32m1(inputMat + 4, 4);
+    vuint32m1_t matRegIn2 = __riscv_vle32_v_u32m1(inputMat + 8, 4);
+    vuint32m1_t matRegIn3 = __riscv_vle32_v_u32m1(inputMat + 12, 4);
+
+    vbool32_t oddMask = __riscv_vreinterpret_v_u32m1_b32(__riscv_vmv_v_x_u32m1(0xaaaa, 1));
+
+    vuint32m1_t smallTransposeMat0 = __riscv_vslideup_vx_u32m1_tumu(oddMask, matRegIn0, matRegIn1, 1, 4);
+    vuint32m1_t smallTransposeMat2 = __riscv_vslideup_vx_u32m1_tumu(oddMask, matRegIn2, matRegIn3, 1, 4);
+
+    vbool32_t evenMask = __riscv_vreinterpret_v_u32m1_b32(__riscv_vmv_v_x_u32m1(0x5555, 1));
+
+    vuint32m1_t smallTransposeMat1 = __riscv_vslidedown_vx_u32m1_tumu(evenMask, matRegIn1, matRegIn0, 1, 4);
+    vuint32m1_t smallTransposeMat3 = __riscv_vslidedown_vx_u32m1_tumu(evenMask, matRegIn3, matRegIn2, 1, 4);
+
+    vuint32m1_t outMat0 = __riscv_vslideup_vx_u32m1_tu(smallTransposeMat0, smallTransposeMat2, 2, 4);
+    vuint32m1_t outMat1 = __riscv_vslideup_vx_u32m1_tu(smallTransposeMat1, smallTransposeMat3, 2, 4);
+
+    vuint32m1_t outMat2 = __riscv_vslidedown_vx_u32m1_tu(smallTransposeMat2, smallTransposeMat0, 2, 2);
+    vuint32m1_t outMat3 = __riscv_vslidedown_vx_u32m1_tu(smallTransposeMat3, smallTransposeMat1, 2, 2);
+    __riscv_vse32_v_u32m1(outputMat, outMat0, 4);
+    __riscv_vse32_v_u32m1(outputMat + 4, outMat1, 4);
+    __riscv_vse32_v_u32m1(outputMat + 8, outMat2, 4);
+    __riscv_vse32_v_u32m1(outputMat + 12, outMat3, 4);
+}
+
+/* { dg-final { scan-assembler-times {vsetivli\s+zero,\s*4,\s*e32,\s*m1,\s*t[au],\s*m[au]} 2 } } */
+/* { dg-final { scan-assembler-times {vsetivli\s+zero,\s*2,\s*e32,\s*m1,\s*t[au],\s*m[au]} 1 } } */
+/* { dg-final { scan-assembler-times {vsetivli} 3 } } */
+/* { dg-final { scan-assembler-not {vsetvli} } } */