[V2] RISC-V: Add AVL propagation PASS for RVV auto-vectorization

From: Kito Cheng

This patch addresses the redundant AVL/VL toggling in RVV partial auto-vectorization
which is a known issue for a long time and I finally find the time to address it.

Consider a simple vector addition operation:

https://godbolt.org/z/7hfGfEjW3

void
foo (int *__restrict a,
     int *__restrict b,
     int *__restrict n)
{
  for (int i = 0; i < n; i++)
      a[i] = a[i] + b[i];
}

Optimized IR:

Loop body:
  _38 = .SELECT_VL (ivtmp_36, POLY_INT_CST [4, 4]);                          -> vsetvli a5,a2,e8,mf4,ta,ma
  ...
  vect__4.8_27 = .MASK_LEN_LOAD (vectp_a.6_29, 32B, { -1, ... }, _38, 0);    -> vle32.v v2,0(a0)
  vect__6.11_20 = .MASK_LEN_LOAD (vectp_b.9_25, 32B, { -1, ... }, _38, 0);   -> vle32.v v1,0(a1)
  vect__7.12_19 = vect__6.11_20 + vect__4.8_27;                              -> vsetvli a6,zero,e32,m1,ta,ma + vadd.vv v1,v1,v2
  .MASK_LEN_STORE (vectp_a.13_11, 32B, { -1, ... }, _38, 0, vect__7.12_19);  -> vsetvli zero,a5,e32,m1,ta,ma + vse32.v v1,0(a4)

We can see 2 redundant vsetvls inside the loop body due to AVL/VL toggling.
The AVL/VL toggling is because we are missing LEN information in simple PLUS_EXPR GIMPLE assignment:

vect__7.12_19 = vect__6.11_20 + vect__4.8_27;

GCC apply partial predicate load/store and un-predicated full vector operation on partial vectorization.
Such flow are used by all other targets like ARM SVE (RVV also uses such flow):

ARM SVE:

.L3:
        ld1w    z30.s, p7/z, [x0, x3, lsl 2]   -> predicated load
        ld1w    z31.s, p7/z, [x1, x3, lsl 2]   -> predicated load
        add     z31.s, z31.s, z30.s            -> un-predicated add
        st1w    z31.s, p7, [x0, x3, lsl 2]     -> predicated store

Such vectorization flow causes AVL/VL toggling on RVV so we need AVL propagation PASS for it.

Also, It's very unlikely that we can apply predicated operations on all vectorization for following reasons:

1. It's very heavy workload to support them on all vectorization and we don't see any benefits if we can handle that on targets backend.
2. Changing Loop vectorizer for it will make code base ugly and hard to maintain.
3. We will need so many patterns for all operations. Not only COND_LEN_ADD, COND_LEN_SUB, ....
   We also need COND_LEN_EXTEND, ...., COND_LEN_CEIL, ... .. over 100+ patterns, unreasonable number of patterns.

To conclude, we prefer un-predicated operations here, and design a nice and clean AVL propagation PASS for it to elide the redundant vsetvls
due to AVL/VL toggling.

The second question is that why we separate a PASS called AVL propagation. Why not optimize it in VSETVL PASS (We definitetly can optimize AVL in VSETVL PASS)

Frankly, I was planning to address such issue in VSETVL PASS that's why we recently refactored VSETVL PASS. However, I changed my mind recently after several
experiments and tries.

The reasons as follows:

1. For code base management and maintainience. Current VSETVL PASS is complicated enough and aleady has enough aggressive and fancy optimizations which
   turns out it can always generate optimal codegen in most of the cases. It's not a good idea keep adding more features into VSETVL PASS to make VSETVL
	 PASS become heavy and heavy again, then we will need to refactor it again in the future.
	 Actuall, the VSETVL PASS is very stable and optimal after the recent refactoring. Hopefully, we should not change VSETVL PASS any more except the minor
	 fixes.

2. vsetvl insertion (VSETVL PASS does this thing) and AVL propagation are 2 different things,  I don't think we should fuse them into same PASS.

3. VSETVL PASS is an post-RA PASS, wheras AVL propagtion should be done before RA which can reduce register allocation.

4. This patch's AVL propagation PASS only does AVL propagation for RVV partial auto-vectorization situations.
   This patch's codes are only hundreds lines which is very managable and can be very easily extended features and enhancements.
	 We can easily extend and enhance more AVL propagation in a clean and separate PASS in the future. (If we do it on VSETVL PASS, we will complicate 
	 VSETVL PASS again which is already so complicated.) 

Here is an example to demonstrate more:

https://godbolt.org/z/bE86sv3q5

void foo2 (int *__restrict a,
          int *__restrict b,
          int *__restrict c,
          int *__restrict a2,
          int *__restrict b2,
          int *__restrict c2,
          int *__restrict a3,
          int *__restrict b3,
          int *__restrict c3,
          int *__restrict a4,
          int *__restrict b4,
          int *__restrict c4,
          int *__restrict a5,
          int *__restrict b5,
          int *__restrict c5,
          int n)
{
    for (int i = 0; i < n; i++){
      a[i] = b[i] + c[i];
      b5[i] = b[i] + c[i];
      a2[i] = b2[i] + c2[i];
      a3[i] = b3[i] + c3[i];
      a4[i] = b4[i] + c4[i];
      a5[i] = a[i] + a4[i];
      a[i] = a5[i] + b5[i]+ a[i];

      a[i] = a[i] + c[i];
      b5[i] = a[i] + c[i];
      a2[i] = a[i] + c2[i];
      a3[i] = a[i] + c3[i];
      a4[i] = a[i] + c4[i];
      a5[i] = a[i] + a4[i];
      a[i] = a[i] + b5[i]+ a[i];
    }
}

1. Loop Body:

Before this patch:                                          After this patch:

	      vsetvli a4,t1,e8,mf4,ta,ma                           vsetvli	a4,t1,e32,m1,ta,ma                                     
        vle32.v v2,0(a2)                                     vle32.v	v2,0(a2)
        vle32.v v4,0(a1)                                     vle32.v	v3,0(t2)
        vle32.v v1,0(t2)                                     vle32.v	v4,0(a1)
        vsetvli a7,zero,e32,m1,ta,ma                         vle32.v	v1,0(t0)
        vadd.vv v4,v2,v4                                     vadd.vv	v4,v2,v4
        vsetvli zero,a4,e32,m1,ta,ma                         vadd.vv	v1,v3,v1
        vle32.v v3,0(s0)                                     vadd.vv	v1,v1,v4
        vsetvli a7,zero,e32,m1,ta,ma                         vadd.vv	v1,v1,v4
        vadd.vv v1,v3,v1                                     vadd.vv	v1,v1,v4
        vadd.vv v1,v1,v4                                     vadd.vv	v1,v1,v2
        vadd.vv v1,v1,v4                                     vadd.vv	v2,v1,v2
        vadd.vv v1,v1,v4                                     vse32.v	v2,0(t5)
        vsetvli zero,a4,e32,m1,ta,ma                         vadd.vv	v2,v2,v1
        vle32.v v4,0(a5)                                     vadd.vv	v2,v2,v1
        vsetvli a7,zero,e32,m1,ta,ma                         slli	a7,a4,2
        vadd.vv v1,v1,v2                                     vadd.vv	v3,v1,v3
        vadd.vv v2,v1,v2                                     vle32.v	v5,0(a5)
        vadd.vv v4,v1,v4                                     vle32.v	v6,0(t6)
        vsetvli zero,a4,e32,m1,ta,ma                         vse32.v	v3,0(t3)
        vse32.v v2,0(t5)                                     vse32.v	v2,0(a0)
        vse32.v v4,0(a3)                                     vadd.vv	v3,v3,v1
        vsetvli a7,zero,e32,m1,ta,ma                         vadd.vv	v2,v1,v5
        vadd.vv v3,v1,v3                                     vse32.v	v3,0(t4)
        vadd.vv v2,v2,v1                                     vadd.vv	v1,v1,v6
        vadd.vv v2,v2,v1                                     vse32.v	v2,0(a3)
        vsetvli zero,a4,e32,m1,ta,ma                         vse32.v	v1,0(a6)
        vse32.v v2,0(a0)                                      
        vse32.v v3,0(t3)                                      
        vle32.v v2,0(t0)                                      
        vsetvli a7,zero,e32,m1,ta,ma                                      
        vadd.vv v3,v3,v1                                      
        vsetvli zero,a4,e32,m1,ta,ma                                      
        vse32.v v3,0(t4)                                      
        vsetvli a7,zero,e32,m1,ta,ma                                      
        slli    a7,a4,2                                      
        vadd.vv v1,v1,v2                                      
        sub     t1,t1,a4                                      
        vsetvli zero,a4,e32,m1,ta,ma                                      
        vse32.v v1,0(a6)                                      

It's quite obvious, all heavy && redundant vsetvls inside loop body are eliminated.

2. Epilogue:
    Before this patch:                                          After this patch:

     .L5:                                                      .L5:                                           
        ld      s0,8(sp)                                         ret
        addi    sp,sp,16                                         
        jr      ra                                         

This is the benefit we do the AVL propation before RA since we eliminate the use of 'a7' register
which is used by the redudant AVL/VL toggling instruction: 'vsetvli a7,zero,e32,m1,ta,ma'

The final codegen after this patch:

foo2:
	lw	t1,56(sp)
	ld	t6,0(sp)
	ld	t3,8(sp)
	ld	t0,16(sp)
	ld	t2,24(sp)
	ld	t4,32(sp)
	ld	t5,40(sp)
	ble	t1,zero,.L5
.L3:
	vsetvli	a4,t1,e32,m1,ta,ma
	vle32.v	v2,0(a2)
	vle32.v	v3,0(t2)
	vle32.v	v4,0(a1)
	vle32.v	v1,0(t0)
	vadd.vv	v4,v2,v4
	vadd.vv	v1,v3,v1
	vadd.vv	v1,v1,v4
	vadd.vv	v1,v1,v4
	vadd.vv	v1,v1,v4
	vadd.vv	v1,v1,v2
	vadd.vv	v2,v1,v2
	vse32.v	v2,0(t5)
	vadd.vv	v2,v2,v1
	vadd.vv	v2,v2,v1
	slli	a7,a4,2
	vadd.vv	v3,v1,v3
	vle32.v	v5,0(a5)
	vle32.v	v6,0(t6)
	vse32.v	v3,0(t3)
	vse32.v	v2,0(a0)
	vadd.vv	v3,v3,v1
	vadd.vv	v2,v1,v5
	vse32.v	v3,0(t4)
	vadd.vv	v1,v1,v6
	vse32.v	v2,0(a3)
	vse32.v	v1,0(a6)
	sub	t1,t1,a4
	add	a1,a1,a7
	add	a2,a2,a7
	add	a5,a5,a7
	add	t6,t6,a7
	add	t0,t0,a7
	add	t2,t2,a7
	add	t5,t5,a7
	add	a3,a3,a7
	add	a6,a6,a7
	add	t3,t3,a7
	add	t4,t4,a7
	add	a0,a0,a7
	bne	t1,zero,.L3
.L5:
	ret

	PR target/111318
	PR target/111888

gcc/ChangeLog:

	* config.gcc: Add AVL propagation PASS.
	* config/riscv/riscv-passes.def (INSERT_PASS_AFTER): Ditto.
	* config/riscv/riscv-protos.h (make_pass_avlprop): Ditto.
	* config/riscv/t-riscv: Ditto.
	* config/riscv/riscv-avlprop.cc: New file.

gcc/testsuite/ChangeLog:

	* gcc.target/riscv/rvv/autovec/partial/select_vl-2.c: Adapt test.
	* gcc.target/riscv/rvv/autovec/ternop/ternop_nofm-2.c: Ditto.
	* gcc.target/riscv/rvv/autovec/pr111318.c: New test.
	* gcc.target/riscv/rvv/autovec/pr111888.c: New test.

---
 gcc/config.gcc                                |   2 +-
 gcc/config/riscv/riscv-avlprop.cc             | 419 ++++++++++++++++++
 gcc/config/riscv/riscv-passes.def             |   1 +
 gcc/config/riscv/riscv-protos.h               |   1 +
 gcc/config/riscv/t-riscv                      |   6 +
 .../riscv/rvv/autovec/partial/select_vl-2.c   |   5 +-
 .../gcc.target/riscv/rvv/autovec/pr111318.c   |  16 +
 .../gcc.target/riscv/rvv/autovec/pr111888.c   |  33 ++
 .../riscv/rvv/autovec/ternop/ternop_nofm-2.c  |   1 -
 9 files changed, 480 insertions(+), 4 deletions(-)
 create mode 100644 gcc/config/riscv/riscv-avlprop.cc
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/pr111318.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/pr111888.c

Message ID	20231025120518.1319929-1-juzhe.zhong@rivai.ai
State	New
Headers	show Return-Path: <gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org> X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=gcc.gnu.org (client-ip=2620:52:3:1:0:246e:9693:128c; helo=server2.sourceware.org; envelope-from=gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org; receiver=patchwork.ozlabs.org) Received: from server2.sourceware.org (server2.sourceware.org [IPv6:2620:52:3:1:0:246e:9693:128c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (secp384r1) server-digest SHA384) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4SFngv5P4Pz23jV for <incoming@patchwork.ozlabs.org>; Wed, 25 Oct 2023 23:05:46 +1100 (AEDT) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 907E538582BC for <incoming@patchwork.ozlabs.org>; Wed, 25 Oct 2023 12:05:44 +0000 (GMT) X-Original-To: gcc-patches@gcc.gnu.org Delivered-To: gcc-patches@gcc.gnu.org Received: from smtpbguseast2.qq.com (smtpbguseast2.qq.com [54.204.34.130]) by sourceware.org (Postfix) with ESMTPS id 8C2BA3858D1E for <gcc-patches@gcc.gnu.org>; Wed, 25 Oct 2023 12:05:26 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 8C2BA3858D1E Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=rivai.ai Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=rivai.ai ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 8C2BA3858D1E Authentication-Results: server2.sourceware.org; arc=none smtp.remote-ip=54.204.34.130 ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1698235532; cv=none; b=Pj4sHZhOTjnmVWTLtN6F4TrqE7Kzo6hZrbxEHGmVI2r8+NaVwPYadtxbQrnb2UxqFGGaSI+JHK3rpGChvF0WJeo3PIfHc1zcwMx7+w80gCnhqkht9pWPXT+/vV2+47dHhq0pYfPngH2Ws1WnvoH/fg2qBPWPdt+I48OonT7T3hc= ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1698235532; c=relaxed/simple; bh=+fpgUeyHF5TAmBMM4aqNXabRcFf0XO4nmA7rGABgXq4=; h=From:To:Subject:Date:Message-Id:MIME-Version; b=Iv/iVpTAZx9qhzR5giYYtFm2Uc7P6NoDkJaW0KeM0IqFU4qPss4gRHcZBK2kZsklT4opbvnYTKNAqTGDZ4swmZHUWNwW/I4dGZofIHaOoctqHkKnLZxWlgVYoezJfpUj2oFF7yY8ziAvy91Vz0kc/6Jkqsb2xZmuiKYgBFRf1Io= ARC-Authentication-Results: i=1; server2.sourceware.org X-QQ-mid: bizesmtp79t1698235521te53q1eu Received: from rios-cad122.hadoop.rioslab.org ( [58.60.1.26]) by bizesmtp.qq.com (ESMTP) with id ; Wed, 25 Oct 2023 20:05:19 +0800 (CST) X-QQ-SSF: 01400000000000G0V000000A0000000 X-QQ-FEAT: q+EIYT+FhZqrV99kxeqbLKhD6soR7GmP0YfbjxPOIHkyI+ifje8MIc67sF4KG Jjg/yupyymjTQ3d2MreLQMh7Bg7vpGuU3Li8aDudQh7tB3PqNUlCqqNjxGVb5TpchYFDMng qZunn4I9igSkcCCwNDTPd89lQBwCd4YD62u/yPlkouPVRYhAdvGDKH1mSmc5s3vUy2YX+i7 dBHU68elckTkxtN/LSuZGi4c6r06q7yy10xHE7QHHA/CmtDtFVLcj+wXZsFwuDnW0hecH6X QxKAQDyBBAbI790njQ7/lw1XNnrt1Wd4Pa2o+gAl639DkWmgKF/5uNiBI6XozCskA+DGts9 uJNvr5zbOrFpQn8klvEirrLoq0QJNfK9ISQcxl1HOPmftrX00VGkGFhcA8JJq+1R71NNXbk k9zkjMexsWd6n+lvMiS0jQ== X-QQ-GoodBg: 2 X-BIZMAIL-ID: 7003436820035446268 From: Juzhe-Zhong <juzhe.zhong@rivai.ai> To: gcc-patches@gcc.gnu.org Cc: kito.cheng@gmail.com, kito.cheng@sifive.com, jeffreyalaw@gmail.com, rdapp.gcc@gmail.com, Juzhe-Zhong <juzhe.zhong@rivai.ai> Subject: [PATCH V2] RISC-V: Add AVL propagation PASS for RVV auto-vectorization Date: Wed, 25 Oct 2023 20:05:18 +0800 Message-Id: <20231025120518.1319929-1-juzhe.zhong@rivai.ai> X-Mailer: git-send-email 2.36.3 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-QQ-SENDSIZE: 520 Feedback-ID: bizesmtp:rivai.ai:qybglogicsvrgz:qybglogicsvrgz7a-one-0 X-Spam-Status: No, score=-8.9 required=5.0 tests=BAYES_00, GIT_PATCH_0, KAM_DMARC_STATUS, KAM_SHORT, RCVD_IN_BARRACUDACENTRAL, RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H3, RCVD_IN_MSPIKE_WL, SCC_10_SHORT_WORD_LINES, SCC_20_SHORT_WORD_LINES, SCC_5_SHORT_WORD_LINES, SPF_HELO_PASS, SPF_PASS, TXREP autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on server2.sourceware.org X-BeenThere: gcc-patches@gcc.gnu.org X-Mailman-Version: 2.1.30 Precedence: list List-Id: Gcc-patches mailing list <gcc-patches.gcc.gnu.org> List-Unsubscribe: <https://gcc.gnu.org/mailman/options/gcc-patches>, <mailto:gcc-patches-request@gcc.gnu.org?subject=unsubscribe> List-Archive: <https://gcc.gnu.org/pipermail/gcc-patches/> List-Post: <mailto:gcc-patches@gcc.gnu.org> List-Help: <mailto:gcc-patches-request@gcc.gnu.org?subject=help> List-Subscribe: <https://gcc.gnu.org/mailman/listinfo/gcc-patches>, <mailto:gcc-patches-request@gcc.gnu.org?subject=subscribe> Errors-To: gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org
Series	[V2] RISC-V: Add AVL propagation PASS for RVV auto-vectorization \| expand [V2] RISC-V: Add AVL propagation PASS for RVV auto-vectorization

[V2] RISC-V: Add AVL propagation PASS for RVV auto-vectorization

Commit Message

Comments

Patch