diff mbox series

CPROP: Allow cprop optimization when the function has a single block

Message ID 20230201122444.253620-1-juzhe.zhong@rivai.ai
State New
Headers show
Series CPROP: Allow cprop optimization when the function has a single block | expand

Commit Message

钟居哲 Feb. 1, 2023, 12:24 p.m. UTC
From: Ju-Zhe Zhong <juzhe.zhong@rivai.ai>

Hi, this patch is present for GCC 14 since I understand it's not appropriate
to land it in GCC 13.

NUM_FIXED_BLOCKS = 2 since GCC define each function has aleast 2 blocks
one is entry block, the other is exit block.
So according this code, the function will not do cprop optimization when
there is only exactly one single block.

I am not sure whether it's correct to fix it like this.
Can anyone tell me why forbid cprop optimization if the function only has s
single block ?

Let's take a look at these 2 examples of RVV intrinsics:
1.    void f1 (void * in, void *out, int64_t x, int n)
      {
      vint64m1_t v = __riscv_vle64_v_i64m1 (in + 1, 4);
      vint64m1_t v2 = __riscv_vle64_v_i64m1_tu (v, in + 2, 4);
      vint64m1_t v3 = __riscv_vadd_vx_i64m1 (v2, x, 4);
      vint64m1_t v4 = __riscv_vadd_vx_i64m1 (v3, x, 4);
      __riscv_vse64_v_i64m1 (out + 2, v4, 4);
      }

asm:
     	addi	sp,sp,-16
	sw	a2,8(sp)
	sw	a3,12(sp)
	sw	a2,0(sp)
	sw	a3,4(sp)
	addi	a5,a0,1
	vsetivli	zero,4,e64,m1,ta,ma
	addi	a0,a0,2
	vle64.v	v24,0(a5)
	addi	a5,sp,8
	vlse64.v	v27,0(a5),zero
	addi	a1,a1,2
	vsetvli	zero,zero,e64,m1,tu,ma
	vle64.v	v24,0(a0)
	vsetvli	zero,zero,e64,m1,ta,ma
	vlse64.v	v25,0(sp),zero
	vadd.vv	v26,v24,v27
	vadd.vv	v24,v26,v25
	vse64.v	v24,0(a1)
	addi	sp,sp,16
	jr	ra
you can see here there are 2 vlse64.v instructions that broadcast the scalar value "x"
GCC fail to eliminate the second vlse64.v instruction since GCC doesn't do the cprop
optimization (the function only has 1 single block). It can be optimized if we apply
this patch.

2. void f1 (void * in, void *out, int64_t x, int n)
{
    if (n) {
      vint64m1_t v = __riscv_vle64_v_i64m1 (in + 1, 4);
      vint64m1_t v2 = __riscv_vle64_v_i64m1_tu (v, in + 2, 4);
      vint64m1_t v3 = __riscv_vadd_vx_i64m1 (v2, x, 4);
      vint64m1_t v4 = __riscv_vadd_vx_i64m1 (v3, x, 4);
      __riscv_vse64_v_i64m1 (out + 2, v4, 4);
    }
}

asm:
  f1:
	vsetivli	zero,4,e64,m1,ta,ma
	beq	a4,zero,.L7
	addi	sp,sp,-16
	sw	a2,8(sp)
	sw	a3,12(sp)
	addi	a5,a0,1
	vle64.v	v24,0(a5)
	addi	a0,a0,2
	addi	a5,sp,8
	vlse64.v	v25,0(a5),zero
	addi	a1,a1,2
	vsetvli	zero,zero,e64,m1,tu,ma
	vle64.v	v24,0(a0)
	vadd.vv	v26,v24,v25
	vadd.vv	v24,v26,v25
	vse64.v	v24,0(a1)
	addi	sp,sp,16
	jr	ra
.L7:
	ret

Here, if we add if (n) condition here, the program will end up with more than 1 block.
So GCC will do the cprop optimization and the second vlse64.v instruction is eliminated.

I am not sure whether this patch is correct.
Can anyone help me with that ?
Thanks.


gcc/ChangeLog:

        * cprop.cc (one_cprop_pass): Remove +1.

---
 gcc/cprop.cc | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Richard Biener Feb. 1, 2023, 12:40 p.m. UTC | #1
On Wed, 1 Feb 2023, juzhe.zhong@rivai.ai wrote:

> From: Ju-Zhe Zhong <juzhe.zhong@rivai.ai>
> 
> Hi, this patch is present for GCC 14 since I understand it's not appropriate
> to land it in GCC 13.
> 
> NUM_FIXED_BLOCKS = 2 since GCC define each function has aleast 2 blocks
> one is entry block, the other is exit block.
> So according this code, the function will not do cprop optimization when
> there is only exactly one single block.

cprop / GCSE are global dataflow problems, there's "nothing" to do for
a single block, the local problem plus its application isn't considered
but probably left for CSE.

Why does CSE not perform the desired transform?

> I am not sure whether it's correct to fix it like this.
> Can anyone tell me why forbid cprop optimization if the function only has s
> single block ?
> 
> Let's take a look at these 2 examples of RVV intrinsics:
> 1.    void f1 (void * in, void *out, int64_t x, int n)
>       {
>       vint64m1_t v = __riscv_vle64_v_i64m1 (in + 1, 4);
>       vint64m1_t v2 = __riscv_vle64_v_i64m1_tu (v, in + 2, 4);
>       vint64m1_t v3 = __riscv_vadd_vx_i64m1 (v2, x, 4);
>       vint64m1_t v4 = __riscv_vadd_vx_i64m1 (v3, x, 4);
>       __riscv_vse64_v_i64m1 (out + 2, v4, 4);
>       }
> 
> asm:
>      	addi	sp,sp,-16
> 	sw	a2,8(sp)
> 	sw	a3,12(sp)
> 	sw	a2,0(sp)
> 	sw	a3,4(sp)
> 	addi	a5,a0,1
> 	vsetivli	zero,4,e64,m1,ta,ma
> 	addi	a0,a0,2
> 	vle64.v	v24,0(a5)
> 	addi	a5,sp,8
> 	vlse64.v	v27,0(a5),zero
> 	addi	a1,a1,2
> 	vsetvli	zero,zero,e64,m1,tu,ma
> 	vle64.v	v24,0(a0)
> 	vsetvli	zero,zero,e64,m1,ta,ma
> 	vlse64.v	v25,0(sp),zero
> 	vadd.vv	v26,v24,v27
> 	vadd.vv	v24,v26,v25
> 	vse64.v	v24,0(a1)
> 	addi	sp,sp,16
> 	jr	ra
> you can see here there are 2 vlse64.v instructions that broadcast the scalar value "x"
> GCC fail to eliminate the second vlse64.v instruction since GCC doesn't do the cprop
> optimization (the function only has 1 single block). It can be optimized if we apply
> this patch.
> 
> 2. void f1 (void * in, void *out, int64_t x, int n)
> {
>     if (n) {
>       vint64m1_t v = __riscv_vle64_v_i64m1 (in + 1, 4);
>       vint64m1_t v2 = __riscv_vle64_v_i64m1_tu (v, in + 2, 4);
>       vint64m1_t v3 = __riscv_vadd_vx_i64m1 (v2, x, 4);
>       vint64m1_t v4 = __riscv_vadd_vx_i64m1 (v3, x, 4);
>       __riscv_vse64_v_i64m1 (out + 2, v4, 4);
>     }
> }
> 
> asm:
>   f1:
> 	vsetivli	zero,4,e64,m1,ta,ma
> 	beq	a4,zero,.L7
> 	addi	sp,sp,-16
> 	sw	a2,8(sp)
> 	sw	a3,12(sp)
> 	addi	a5,a0,1
> 	vle64.v	v24,0(a5)
> 	addi	a0,a0,2
> 	addi	a5,sp,8
> 	vlse64.v	v25,0(a5),zero
> 	addi	a1,a1,2
> 	vsetvli	zero,zero,e64,m1,tu,ma
> 	vle64.v	v24,0(a0)
> 	vadd.vv	v26,v24,v25
> 	vadd.vv	v24,v26,v25
> 	vse64.v	v24,0(a1)
> 	addi	sp,sp,16
> 	jr	ra
> .L7:
> 	ret
> 
> Here, if we add if (n) condition here, the program will end up with more than 1 block.
> So GCC will do the cprop optimization and the second vlse64.v instruction is eliminated.
> 
> I am not sure whether this patch is correct.
> Can anyone help me with that ?
> Thanks.
> 
> 
> gcc/ChangeLog:
> 
>         * cprop.cc (one_cprop_pass): Remove +1.
> 
> ---
>  gcc/cprop.cc | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/gcc/cprop.cc b/gcc/cprop.cc
> index 6ec0bda4a24..615bc5078b6 100644
> --- a/gcc/cprop.cc
> +++ b/gcc/cprop.cc
> @@ -1749,7 +1749,7 @@ one_cprop_pass (void)
>    int changed = 0;
>  
>    /* Return if there's nothing to do, or it is too expensive.  */
> -  if (n_basic_blocks_for_fn (cfun) <= NUM_FIXED_BLOCKS + 1
> +  if (n_basic_blocks_for_fn (cfun) <= NUM_FIXED_BLOCKS
>        || gcse_or_cprop_is_too_expensive (_ ("const/copy propagation disabled")))
>      return 0;
>  
>
钟居哲 Feb. 1, 2023, 12:47 p.m. UTC | #2
I don't known whether CSE do the job. What I saw is CPROP do the optimization when we have more than 1 block.

This the RTL dump before CPROP:

(insn 19 18 20 4 (set (reg:VNx1DI 151)
        (if_then_else:VNx1DI (unspec:VNx1BI [
                    (const_vector:VNx1BI repeat [
                            (const_int 1 [0x1])
                        ])
                    (const_int 4 [0x4])
                    (const_int 2 [0x2]) repeated x2
                    (const_int 0 [0])
                    (reg:SI 66 vl)
                    (reg:SI 67 vtype)
                ] UNSPEC_VPREDICATE)
            (vec_duplicate:VNx1DI (reg/v:DI 148 [ x ]))
            (unspec:VNx1DI [
                    (const_int 0 [0])
                ] UNSPEC_VUNDEF))) "rvv.c":22:23 695 {pred_broadcastvnx1di}
     (nil))
(insn 20 19 21 4 (set (reg/v:VNx1DI 139 [ v3 ])
        (if_then_else:VNx1DI (unspec:VNx1BI [
                    (const_vector:VNx1BI repeat [
                            (const_int 1 [0x1])
                        ])
                    (const_int 4 [0x4])
                    (const_int 2 [0x2]) repeated x2
                    (const_int 0 [0])
                    (reg:SI 66 vl)
                    (reg:SI 67 vtype)
                ] UNSPEC_VPREDICATE)
            (plus:VNx1DI (reg/v:VNx1DI 138 [ v2 ])
                (reg:VNx1DI 151))
            (unspec:VNx1DI [
                    (const_int 0 [0])
                ] UNSPEC_VUNDEF))) "rvv.c":22:23 1528 {pred_addvnx1di}
     (expr_list:REG_DEAD (reg:VNx1DI 151)
        (nil)))
(insn 21 20 22 4 (set (reg:VNx1DI 152)
        (if_then_else:VNx1DI (unspec:VNx1BI [
                    (const_vector:VNx1BI repeat [
                            (const_int 1 [0x1])
                        ])
                    (const_int 4 [0x4])
                    (const_int 2 [0x2]) repeated x2
                    (const_int 0 [0])
                    (reg:SI 66 vl)
                    (reg:SI 67 vtype)
                ] UNSPEC_VPREDICATE)
            (vec_duplicate:VNx1DI (reg/v:DI 148 [ x ]))
            (unspec:VNx1DI [
                    (const_int 0 [0])
                ] UNSPEC_VUNDEF))) "rvv.c":23:23 695 {pred_broadcastvnx1di}
     (nil))
(insn 22 21 23 4 (set (reg/v:VNx1DI 140 [ v4 ])
        (if_then_else:VNx1DI (unspec:VNx1BI [
                    (const_vector:VNx1BI repeat [
                            (const_int 1 [0x1])
                        ])
                    (const_int 4 [0x4])
                    (const_int 0 [0])
                    (const_int 2 [0x2])
                    (const_int 0 [0])
                    (reg:SI 66 vl)
                    (reg:SI 67 vtype)
                ] UNSPEC_VPREDICATE)
            (plus:VNx1DI (reg/v:VNx1DI 138 [ v2 ])
                (reg:VNx1DI 152))
            (reg/v:VNx1DI 139 [ v3 ]))) "rvv.c":23:23 1528 {pred_addvnx1di}
     (expr_list:REG_DEAD (reg:VNx1DI 152)
        (expr_list:REG_DEAD (reg/v:VNx1DI 139 [ v3 ])
            (expr_list:REG_DEAD (reg/v:VNx1DI 138 [ v2 ])
                (nil)))))

After CRPOP:
(insn 15 14 16 4 (set (reg:VNx1DI 147)
        (if_then_else:VNx1DI (unspec:VNx1BI [
                    (const_vector:VNx1BI repeat [
                            (const_int 1 [0x1])
                        ])
                    (const_int 4 [0x4])
                    (const_int 2 [0x2]) repeated x2
                    (const_int 0 [0])
                    (reg:SI 66 vl)
                    (reg:SI 67 vtype)
                ] UNSPEC_VPREDICATE)
            (vec_duplicate:VNx1DI (reg/v:DI 143 [ x ]))
            (unspec:VNx1DI [
                    (const_int 0 [0])
                ] UNSPEC_VUNDEF))) "rvv.c":11:23 695 {pred_broadcastvnx1di}
     (nil))
(insn 16 15 18 4 (set (reg/v:VNx1DI 139 [ v3 ])
        (if_then_else:VNx1DI (unspec:VNx1BI [
                    (const_vector:VNx1BI repeat [
                            (const_int 1 [0x1])
                        ])
                    (const_int 4 [0x4])
                    (const_int 2 [0x2]) repeated x2
                    (const_int 0 [0])
                    (reg:SI 66 vl)
                    (reg:SI 67 vtype)
                ] UNSPEC_VPREDICATE)
            (plus:VNx1DI (reg/v:VNx1DI 138 [ v2 ])
                (reg:VNx1DI 147))
            (unspec:VNx1DI [
                    (const_int 0 [0])
                ] UNSPEC_VUNDEF))) "rvv.c":11:23 1528 {pred_addvnx1di}
     (expr_list:REG_DEAD (reg:VNx1DI 147)
        (expr_list:REG_DEAD (reg/v:VNx1DI 138 [ v2 ])
            (nil))))
(insn 18 16 19 4 (set (reg/v:VNx1DI 140 [ v4 ])
        (if_then_else:VNx1DI (unspec:VNx1BI [
                    (const_vector:VNx1BI repeat [
                            (const_int 1 [0x1])
                        ])
                    (const_int 4 [0x4])
                    (const_int 2 [0x2]) repeated x2
                    (const_int 0 [0])
                    (reg:SI 66 vl)
                    (reg:SI 67 vtype)
                ] UNSPEC_VPREDICATE)
            (plus:VNx1DI (reg/v:VNx1DI 139 [ v3 ])
                (reg:VNx1DI 147))
            (unspec:VNx1DI [
                    (const_int 0 [0])
                ] UNSPEC_VUNDEF))) "rvv.c":12:23 1528 {pred_addvnx1di}
     (expr_list:REG_DEAD (reg:VNx1DI 148)
        (expr_list:REG_DEAD (reg/v:VNx1DI 139 [ v3 ])
            (nil))))

You can see CPROP remove the second the "pred_broadcast" instruction and propagate the result to the second "pred_add" instruction。



juzhe.zhong@rivai.ai
 
From: Richard Biener
Date: 2023-02-01 20:40
To: Ju-Zhe Zhong
CC: gcc-patches; kito.cheng; richard.sandiford; jeffreyalaw; apinski
Subject: Re: [PATCH] CPROP: Allow cprop optimization when the function has a single block
On Wed, 1 Feb 2023, juzhe.zhong@rivai.ai wrote:
 
> From: Ju-Zhe Zhong <juzhe.zhong@rivai.ai>
> 
> Hi, this patch is present for GCC 14 since I understand it's not appropriate
> to land it in GCC 13.
> 
> NUM_FIXED_BLOCKS = 2 since GCC define each function has aleast 2 blocks
> one is entry block, the other is exit block.
> So according this code, the function will not do cprop optimization when
> there is only exactly one single block.
 
cprop / GCSE are global dataflow problems, there's "nothing" to do for
a single block, the local problem plus its application isn't considered
but probably left for CSE.
 
Why does CSE not perform the desired transform?
 
> I am not sure whether it's correct to fix it like this.
> Can anyone tell me why forbid cprop optimization if the function only has s
> single block ?
> 
> Let's take a look at these 2 examples of RVV intrinsics:
> 1.    void f1 (void * in, void *out, int64_t x, int n)
>       {
>       vint64m1_t v = __riscv_vle64_v_i64m1 (in + 1, 4);
>       vint64m1_t v2 = __riscv_vle64_v_i64m1_tu (v, in + 2, 4);
>       vint64m1_t v3 = __riscv_vadd_vx_i64m1 (v2, x, 4);
>       vint64m1_t v4 = __riscv_vadd_vx_i64m1 (v3, x, 4);
>       __riscv_vse64_v_i64m1 (out + 2, v4, 4);
>       }
> 
> asm:
>      addi sp,sp,-16
> sw a2,8(sp)
> sw a3,12(sp)
> sw a2,0(sp)
> sw a3,4(sp)
> addi a5,a0,1
> vsetivli zero,4,e64,m1,ta,ma
> addi a0,a0,2
> vle64.v v24,0(a5)
> addi a5,sp,8
> vlse64.v v27,0(a5),zero
> addi a1,a1,2
> vsetvli zero,zero,e64,m1,tu,ma
> vle64.v v24,0(a0)
> vsetvli zero,zero,e64,m1,ta,ma
> vlse64.v v25,0(sp),zero
> vadd.vv v26,v24,v27
> vadd.vv v24,v26,v25
> vse64.v v24,0(a1)
> addi sp,sp,16
> jr ra
> you can see here there are 2 vlse64.v instructions that broadcast the scalar value "x"
> GCC fail to eliminate the second vlse64.v instruction since GCC doesn't do the cprop
> optimization (the function only has 1 single block). It can be optimized if we apply
> this patch.
> 
> 2. void f1 (void * in, void *out, int64_t x, int n)
> {
>     if (n) {
>       vint64m1_t v = __riscv_vle64_v_i64m1 (in + 1, 4);
>       vint64m1_t v2 = __riscv_vle64_v_i64m1_tu (v, in + 2, 4);
>       vint64m1_t v3 = __riscv_vadd_vx_i64m1 (v2, x, 4);
>       vint64m1_t v4 = __riscv_vadd_vx_i64m1 (v3, x, 4);
>       __riscv_vse64_v_i64m1 (out + 2, v4, 4);
>     }
> }
> 
> asm:
>   f1:
> vsetivli zero,4,e64,m1,ta,ma
> beq a4,zero,.L7
> addi sp,sp,-16
> sw a2,8(sp)
> sw a3,12(sp)
> addi a5,a0,1
> vle64.v v24,0(a5)
> addi a0,a0,2
> addi a5,sp,8
> vlse64.v v25,0(a5),zero
> addi a1,a1,2
> vsetvli zero,zero,e64,m1,tu,ma
> vle64.v v24,0(a0)
> vadd.vv v26,v24,v25
> vadd.vv v24,v26,v25
> vse64.v v24,0(a1)
> addi sp,sp,16
> jr ra
> .L7:
> ret
> 
> Here, if we add if (n) condition here, the program will end up with more than 1 block.
> So GCC will do the cprop optimization and the second vlse64.v instruction is eliminated.
> 
> I am not sure whether this patch is correct.
> Can anyone help me with that ?
> Thanks.
> 
> 
> gcc/ChangeLog:
> 
>         * cprop.cc (one_cprop_pass): Remove +1.
> 
> ---
>  gcc/cprop.cc | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/gcc/cprop.cc b/gcc/cprop.cc
> index 6ec0bda4a24..615bc5078b6 100644
> --- a/gcc/cprop.cc
> +++ b/gcc/cprop.cc
> @@ -1749,7 +1749,7 @@ one_cprop_pass (void)
>    int changed = 0;
>  
>    /* Return if there's nothing to do, or it is too expensive.  */
> -  if (n_basic_blocks_for_fn (cfun) <= NUM_FIXED_BLOCKS + 1
> +  if (n_basic_blocks_for_fn (cfun) <= NUM_FIXED_BLOCKS
>        || gcse_or_cprop_is_too_expensive (_ ("const/copy propagation disabled")))
>      return 0;
>  
>
Richard Biener Feb. 1, 2023, 12:51 p.m. UTC | #3
On Wed, 1 Feb 2023, juzhe.zhong@rivai.ai wrote:

>  I don't known whether CSE do the job. What I saw is CPROP do the optimization when we have more than 1 block.
> 
> This the RTL dump before CPROP:
> 
> (insn 19 18 20 4 (set (reg:VNx1DI 151)
>         (if_then_else:VNx1DI (unspec:VNx1BI [
>                     (const_vector:VNx1BI repeat [
>                             (const_int 1 [0x1])
>                         ])
>                     (const_int 4 [0x4])
>                     (const_int 2 [0x2]) repeated x2
>                     (const_int 0 [0])
>                     (reg:SI 66 vl)
>                     (reg:SI 67 vtype)
>                 ] UNSPEC_VPREDICATE)
>             (vec_duplicate:VNx1DI (reg/v:DI 148 [ x ]))
>             (unspec:VNx1DI [
>                     (const_int 0 [0])
>                 ] UNSPEC_VUNDEF))) "rvv.c":22:23 695 {pred_broadcastvnx1di}
>      (nil))
> (insn 20 19 21 4 (set (reg/v:VNx1DI 139 [ v3 ])
>         (if_then_else:VNx1DI (unspec:VNx1BI [
>                     (const_vector:VNx1BI repeat [
>                             (const_int 1 [0x1])
>                         ])
>                     (const_int 4 [0x4])
>                     (const_int 2 [0x2]) repeated x2
>                     (const_int 0 [0])
>                     (reg:SI 66 vl)
>                     (reg:SI 67 vtype)
>                 ] UNSPEC_VPREDICATE)
>             (plus:VNx1DI (reg/v:VNx1DI 138 [ v2 ])
>                 (reg:VNx1DI 151))
>             (unspec:VNx1DI [
>                     (const_int 0 [0])
>                 ] UNSPEC_VUNDEF))) "rvv.c":22:23 1528 {pred_addvnx1di}
>      (expr_list:REG_DEAD (reg:VNx1DI 151)
>         (nil)))
> (insn 21 20 22 4 (set (reg:VNx1DI 152)
>         (if_then_else:VNx1DI (unspec:VNx1BI [
>                     (const_vector:VNx1BI repeat [
>                             (const_int 1 [0x1])
>                         ])
>                     (const_int 4 [0x4])
>                     (const_int 2 [0x2]) repeated x2
>                     (const_int 0 [0])
>                     (reg:SI 66 vl)
>                     (reg:SI 67 vtype)
>                 ] UNSPEC_VPREDICATE)
>             (vec_duplicate:VNx1DI (reg/v:DI 148 [ x ]))
>             (unspec:VNx1DI [
>                     (const_int 0 [0])
>                 ] UNSPEC_VUNDEF))) "rvv.c":23:23 695 {pred_broadcastvnx1di}
>      (nil))
> (insn 22 21 23 4 (set (reg/v:VNx1DI 140 [ v4 ])
>         (if_then_else:VNx1DI (unspec:VNx1BI [
>                     (const_vector:VNx1BI repeat [
>                             (const_int 1 [0x1])
>                         ])
>                     (const_int 4 [0x4])
>                     (const_int 0 [0])
>                     (const_int 2 [0x2])
>                     (const_int 0 [0])
>                     (reg:SI 66 vl)
>                     (reg:SI 67 vtype)
>                 ] UNSPEC_VPREDICATE)
>             (plus:VNx1DI (reg/v:VNx1DI 138 [ v2 ])
>                 (reg:VNx1DI 152))
>             (reg/v:VNx1DI 139 [ v3 ]))) "rvv.c":23:23 1528 {pred_addvnx1di}
>      (expr_list:REG_DEAD (reg:VNx1DI 152)
>         (expr_list:REG_DEAD (reg/v:VNx1DI 139 [ v3 ])
>             (expr_list:REG_DEAD (reg/v:VNx1DI 138 [ v2 ])
>                 (nil)))))
> 
> After CRPOP:
> (insn 15 14 16 4 (set (reg:VNx1DI 147)
>         (if_then_else:VNx1DI (unspec:VNx1BI [
>                     (const_vector:VNx1BI repeat [
>                             (const_int 1 [0x1])
>                         ])
>                     (const_int 4 [0x4])
>                     (const_int 2 [0x2]) repeated x2
>                     (const_int 0 [0])
>                     (reg:SI 66 vl)
>                     (reg:SI 67 vtype)
>                 ] UNSPEC_VPREDICATE)
>             (vec_duplicate:VNx1DI (reg/v:DI 143 [ x ]))
>             (unspec:VNx1DI [
>                     (const_int 0 [0])
>                 ] UNSPEC_VUNDEF))) "rvv.c":11:23 695 {pred_broadcastvnx1di}
>      (nil))
> (insn 16 15 18 4 (set (reg/v:VNx1DI 139 [ v3 ])
>         (if_then_else:VNx1DI (unspec:VNx1BI [
>                     (const_vector:VNx1BI repeat [
>                             (const_int 1 [0x1])
>                         ])
>                     (const_int 4 [0x4])
>                     (const_int 2 [0x2]) repeated x2
>                     (const_int 0 [0])
>                     (reg:SI 66 vl)
>                     (reg:SI 67 vtype)
>                 ] UNSPEC_VPREDICATE)
>             (plus:VNx1DI (reg/v:VNx1DI 138 [ v2 ])
>                 (reg:VNx1DI 147))
>             (unspec:VNx1DI [
>                     (const_int 0 [0])
>                 ] UNSPEC_VUNDEF))) "rvv.c":11:23 1528 {pred_addvnx1di}
>      (expr_list:REG_DEAD (reg:VNx1DI 147)
>         (expr_list:REG_DEAD (reg/v:VNx1DI 138 [ v2 ])
>             (nil))))
> (insn 18 16 19 4 (set (reg/v:VNx1DI 140 [ v4 ])
>         (if_then_else:VNx1DI (unspec:VNx1BI [
>                     (const_vector:VNx1BI repeat [
>                             (const_int 1 [0x1])
>                         ])
>                     (const_int 4 [0x4])
>                     (const_int 2 [0x2]) repeated x2
>                     (const_int 0 [0])
>                     (reg:SI 66 vl)
>                     (reg:SI 67 vtype)
>                 ] UNSPEC_VPREDICATE)
>             (plus:VNx1DI (reg/v:VNx1DI 139 [ v3 ])
>                 (reg:VNx1DI 147))
>             (unspec:VNx1DI [
>                     (const_int 0 [0])
>                 ] UNSPEC_VUNDEF))) "rvv.c":12:23 1528 {pred_addvnx1di}
>      (expr_list:REG_DEAD (reg:VNx1DI 148)
>         (expr_list:REG_DEAD (reg/v:VNx1DI 139 [ v3 ])
>             (nil))))
> 
> You can see CPROP remove the second the "pred_broadcast" instruction and propagate the result to the second "pred_add" instruction?

I see loads of UNSPECs, that might explain why some passes do something
and some not.  That said, not sure what exactly CPROP does, the two
pred_broadcast insns look exactly the same so CSE should CSE them?

> 
> 
> juzhe.zhong@rivai.ai
>  
> From: Richard Biener
> Date: 2023-02-01 20:40
> To: Ju-Zhe Zhong
> CC: gcc-patches; kito.cheng; richard.sandiford; jeffreyalaw; apinski
> Subject: Re: [PATCH] CPROP: Allow cprop optimization when the function has a single block
> On Wed, 1 Feb 2023, juzhe.zhong@rivai.ai wrote:
>  
> > From: Ju-Zhe Zhong <juzhe.zhong@rivai.ai>
> > 
> > Hi, this patch is present for GCC 14 since I understand it's not appropriate
> > to land it in GCC 13.
> > 
> > NUM_FIXED_BLOCKS = 2 since GCC define each function has aleast 2 blocks
> > one is entry block, the other is exit block.
> > So according this code, the function will not do cprop optimization when
> > there is only exactly one single block.
>  
> cprop / GCSE are global dataflow problems, there's "nothing" to do for
> a single block, the local problem plus its application isn't considered
> but probably left for CSE.
>  
> Why does CSE not perform the desired transform?
>  
> > I am not sure whether it's correct to fix it like this.
> > Can anyone tell me why forbid cprop optimization if the function only has s
> > single block ?
> > 
> > Let's take a look at these 2 examples of RVV intrinsics:
> > 1.    void f1 (void * in, void *out, int64_t x, int n)
> >       {
> >       vint64m1_t v = __riscv_vle64_v_i64m1 (in + 1, 4);
> >       vint64m1_t v2 = __riscv_vle64_v_i64m1_tu (v, in + 2, 4);
> >       vint64m1_t v3 = __riscv_vadd_vx_i64m1 (v2, x, 4);
> >       vint64m1_t v4 = __riscv_vadd_vx_i64m1 (v3, x, 4);
> >       __riscv_vse64_v_i64m1 (out + 2, v4, 4);
> >       }
> > 
> > asm:
> >      addi sp,sp,-16
> > sw a2,8(sp)
> > sw a3,12(sp)
> > sw a2,0(sp)
> > sw a3,4(sp)
> > addi a5,a0,1
> > vsetivli zero,4,e64,m1,ta,ma
> > addi a0,a0,2
> > vle64.v v24,0(a5)
> > addi a5,sp,8
> > vlse64.v v27,0(a5),zero
> > addi a1,a1,2
> > vsetvli zero,zero,e64,m1,tu,ma
> > vle64.v v24,0(a0)
> > vsetvli zero,zero,e64,m1,ta,ma
> > vlse64.v v25,0(sp),zero
> > vadd.vv v26,v24,v27
> > vadd.vv v24,v26,v25
> > vse64.v v24,0(a1)
> > addi sp,sp,16
> > jr ra
> > you can see here there are 2 vlse64.v instructions that broadcast the scalar value "x"
> > GCC fail to eliminate the second vlse64.v instruction since GCC doesn't do the cprop
> > optimization (the function only has 1 single block). It can be optimized if we apply
> > this patch.
> > 
> > 2. void f1 (void * in, void *out, int64_t x, int n)
> > {
> >     if (n) {
> >       vint64m1_t v = __riscv_vle64_v_i64m1 (in + 1, 4);
> >       vint64m1_t v2 = __riscv_vle64_v_i64m1_tu (v, in + 2, 4);
> >       vint64m1_t v3 = __riscv_vadd_vx_i64m1 (v2, x, 4);
> >       vint64m1_t v4 = __riscv_vadd_vx_i64m1 (v3, x, 4);
> >       __riscv_vse64_v_i64m1 (out + 2, v4, 4);
> >     }
> > }
> > 
> > asm:
> >   f1:
> > vsetivli zero,4,e64,m1,ta,ma
> > beq a4,zero,.L7
> > addi sp,sp,-16
> > sw a2,8(sp)
> > sw a3,12(sp)
> > addi a5,a0,1
> > vle64.v v24,0(a5)
> > addi a0,a0,2
> > addi a5,sp,8
> > vlse64.v v25,0(a5),zero
> > addi a1,a1,2
> > vsetvli zero,zero,e64,m1,tu,ma
> > vle64.v v24,0(a0)
> > vadd.vv v26,v24,v25
> > vadd.vv v24,v26,v25
> > vse64.v v24,0(a1)
> > addi sp,sp,16
> > jr ra
> > .L7:
> > ret
> > 
> > Here, if we add if (n) condition here, the program will end up with more than 1 block.
> > So GCC will do the cprop optimization and the second vlse64.v instruction is eliminated.
> > 
> > I am not sure whether this patch is correct.
> > Can anyone help me with that ?
> > Thanks.
> > 
> > 
> > gcc/ChangeLog:
> > 
> >         * cprop.cc (one_cprop_pass): Remove +1.
> > 
> > ---
> >  gcc/cprop.cc | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/gcc/cprop.cc b/gcc/cprop.cc
> > index 6ec0bda4a24..615bc5078b6 100644
> > --- a/gcc/cprop.cc
> > +++ b/gcc/cprop.cc
> > @@ -1749,7 +1749,7 @@ one_cprop_pass (void)
> >    int changed = 0;
> >  
> >    /* Return if there's nothing to do, or it is too expensive.  */
> > -  if (n_basic_blocks_for_fn (cfun) <= NUM_FIXED_BLOCKS + 1
> > +  if (n_basic_blocks_for_fn (cfun) <= NUM_FIXED_BLOCKS
> >        || gcse_or_cprop_is_too_expensive (_ ("const/copy propagation disabled")))
> >      return 0;
> >  
> > 
>  
>
钟居哲 Feb. 1, 2023, 12:56 p.m. UTC | #4
>> I see loads of UNSPECs, that might explain why some passes do something
>> and some not.  That said, not sure what exactly CPROP does, the two
>> pred_broadcast insns look exactly the same so CSE should CSE them?
Yes, the "source" these 2 pred_broadcast the same. However, they have different pseudos in their "dest" (operands[0])
The first one is 151, the second one is 152. And the result in these 2 pseudos are the same.
The first pred_add use 151, the second pred_add use 152.

CSE should do the job to eliminate the second pred_broadcast, but I am not sure whether  CSE can propagate the 151 pseudo
to the second pred_add ??


juzhe.zhong@rivai.ai
 
From: Richard Biener
Date: 2023-02-01 20:51
To: juzhe.zhong@rivai.ai
CC: gcc-patches; kito.cheng; richard.sandiford; jeffreyalaw; apinski
Subject: Re: Re: [PATCH] CPROP: Allow cprop optimization when the function has a single block
On Wed, 1 Feb 2023, juzhe.zhong@rivai.ai wrote:
 
>  I don't known whether CSE do the job. What I saw is CPROP do the optimization when we have more than 1 block.
> 
> This the RTL dump before CPROP:
> 
> (insn 19 18 20 4 (set (reg:VNx1DI 151)
>         (if_then_else:VNx1DI (unspec:VNx1BI [
>                     (const_vector:VNx1BI repeat [
>                             (const_int 1 [0x1])
>                         ])
>                     (const_int 4 [0x4])
>                     (const_int 2 [0x2]) repeated x2
>                     (const_int 0 [0])
>                     (reg:SI 66 vl)
>                     (reg:SI 67 vtype)
>                 ] UNSPEC_VPREDICATE)
>             (vec_duplicate:VNx1DI (reg/v:DI 148 [ x ]))
>             (unspec:VNx1DI [
>                     (const_int 0 [0])
>                 ] UNSPEC_VUNDEF))) "rvv.c":22:23 695 {pred_broadcastvnx1di}
>      (nil))
> (insn 20 19 21 4 (set (reg/v:VNx1DI 139 [ v3 ])
>         (if_then_else:VNx1DI (unspec:VNx1BI [
>                     (const_vector:VNx1BI repeat [
>                             (const_int 1 [0x1])
>                         ])
>                     (const_int 4 [0x4])
>                     (const_int 2 [0x2]) repeated x2
>                     (const_int 0 [0])
>                     (reg:SI 66 vl)
>                     (reg:SI 67 vtype)
>                 ] UNSPEC_VPREDICATE)
>             (plus:VNx1DI (reg/v:VNx1DI 138 [ v2 ])
>                 (reg:VNx1DI 151))
>             (unspec:VNx1DI [
>                     (const_int 0 [0])
>                 ] UNSPEC_VUNDEF))) "rvv.c":22:23 1528 {pred_addvnx1di}
>      (expr_list:REG_DEAD (reg:VNx1DI 151)
>         (nil)))
> (insn 21 20 22 4 (set (reg:VNx1DI 152)
>         (if_then_else:VNx1DI (unspec:VNx1BI [
>                     (const_vector:VNx1BI repeat [
>                             (const_int 1 [0x1])
>                         ])
>                     (const_int 4 [0x4])
>                     (const_int 2 [0x2]) repeated x2
>                     (const_int 0 [0])
>                     (reg:SI 66 vl)
>                     (reg:SI 67 vtype)
>                 ] UNSPEC_VPREDICATE)
>             (vec_duplicate:VNx1DI (reg/v:DI 148 [ x ]))
>             (unspec:VNx1DI [
>                     (const_int 0 [0])
>                 ] UNSPEC_VUNDEF))) "rvv.c":23:23 695 {pred_broadcastvnx1di}
>      (nil))
> (insn 22 21 23 4 (set (reg/v:VNx1DI 140 [ v4 ])
>         (if_then_else:VNx1DI (unspec:VNx1BI [
>                     (const_vector:VNx1BI repeat [
>                             (const_int 1 [0x1])
>                         ])
>                     (const_int 4 [0x4])
>                     (const_int 0 [0])
>                     (const_int 2 [0x2])
>                     (const_int 0 [0])
>                     (reg:SI 66 vl)
>                     (reg:SI 67 vtype)
>                 ] UNSPEC_VPREDICATE)
>             (plus:VNx1DI (reg/v:VNx1DI 138 [ v2 ])
>                 (reg:VNx1DI 152))
>             (reg/v:VNx1DI 139 [ v3 ]))) "rvv.c":23:23 1528 {pred_addvnx1di}
>      (expr_list:REG_DEAD (reg:VNx1DI 152)
>         (expr_list:REG_DEAD (reg/v:VNx1DI 139 [ v3 ])
>             (expr_list:REG_DEAD (reg/v:VNx1DI 138 [ v2 ])
>                 (nil)))))
> 
> After CRPOP:
> (insn 15 14 16 4 (set (reg:VNx1DI 147)
>         (if_then_else:VNx1DI (unspec:VNx1BI [
>                     (const_vector:VNx1BI repeat [
>                             (const_int 1 [0x1])
>                         ])
>                     (const_int 4 [0x4])
>                     (const_int 2 [0x2]) repeated x2
>                     (const_int 0 [0])
>                     (reg:SI 66 vl)
>                     (reg:SI 67 vtype)
>                 ] UNSPEC_VPREDICATE)
>             (vec_duplicate:VNx1DI (reg/v:DI 143 [ x ]))
>             (unspec:VNx1DI [
>                     (const_int 0 [0])
>                 ] UNSPEC_VUNDEF))) "rvv.c":11:23 695 {pred_broadcastvnx1di}
>      (nil))
> (insn 16 15 18 4 (set (reg/v:VNx1DI 139 [ v3 ])
>         (if_then_else:VNx1DI (unspec:VNx1BI [
>                     (const_vector:VNx1BI repeat [
>                             (const_int 1 [0x1])
>                         ])
>                     (const_int 4 [0x4])
>                     (const_int 2 [0x2]) repeated x2
>                     (const_int 0 [0])
>                     (reg:SI 66 vl)
>                     (reg:SI 67 vtype)
>                 ] UNSPEC_VPREDICATE)
>             (plus:VNx1DI (reg/v:VNx1DI 138 [ v2 ])
>                 (reg:VNx1DI 147))
>             (unspec:VNx1DI [
>                     (const_int 0 [0])
>                 ] UNSPEC_VUNDEF))) "rvv.c":11:23 1528 {pred_addvnx1di}
>      (expr_list:REG_DEAD (reg:VNx1DI 147)
>         (expr_list:REG_DEAD (reg/v:VNx1DI 138 [ v2 ])
>             (nil))))
> (insn 18 16 19 4 (set (reg/v:VNx1DI 140 [ v4 ])
>         (if_then_else:VNx1DI (unspec:VNx1BI [
>                     (const_vector:VNx1BI repeat [
>                             (const_int 1 [0x1])
>                         ])
>                     (const_int 4 [0x4])
>                     (const_int 2 [0x2]) repeated x2
>                     (const_int 0 [0])
>                     (reg:SI 66 vl)
>                     (reg:SI 67 vtype)
>                 ] UNSPEC_VPREDICATE)
>             (plus:VNx1DI (reg/v:VNx1DI 139 [ v3 ])
>                 (reg:VNx1DI 147))
>             (unspec:VNx1DI [
>                     (const_int 0 [0])
>                 ] UNSPEC_VUNDEF))) "rvv.c":12:23 1528 {pred_addvnx1di}
>      (expr_list:REG_DEAD (reg:VNx1DI 148)
>         (expr_list:REG_DEAD (reg/v:VNx1DI 139 [ v3 ])
>             (nil))))
> 
> You can see CPROP remove the second the "pred_broadcast" instruction and propagate the result to the second "pred_add" instruction?
 
I see loads of UNSPECs, that might explain why some passes do something
and some not.  That said, not sure what exactly CPROP does, the two
pred_broadcast insns look exactly the same so CSE should CSE them?
 
> 
> 
> juzhe.zhong@rivai.ai
>  
> From: Richard Biener
> Date: 2023-02-01 20:40
> To: Ju-Zhe Zhong
> CC: gcc-patches; kito.cheng; richard.sandiford; jeffreyalaw; apinski
> Subject: Re: [PATCH] CPROP: Allow cprop optimization when the function has a single block
> On Wed, 1 Feb 2023, juzhe.zhong@rivai.ai wrote:
>  
> > From: Ju-Zhe Zhong <juzhe.zhong@rivai.ai>
> > 
> > Hi, this patch is present for GCC 14 since I understand it's not appropriate
> > to land it in GCC 13.
> > 
> > NUM_FIXED_BLOCKS = 2 since GCC define each function has aleast 2 blocks
> > one is entry block, the other is exit block.
> > So according this code, the function will not do cprop optimization when
> > there is only exactly one single block.
>  
> cprop / GCSE are global dataflow problems, there's "nothing" to do for
> a single block, the local problem plus its application isn't considered
> but probably left for CSE.
>  
> Why does CSE not perform the desired transform?
>  
> > I am not sure whether it's correct to fix it like this.
> > Can anyone tell me why forbid cprop optimization if the function only has s
> > single block ?
> > 
> > Let's take a look at these 2 examples of RVV intrinsics:
> > 1.    void f1 (void * in, void *out, int64_t x, int n)
> >       {
> >       vint64m1_t v = __riscv_vle64_v_i64m1 (in + 1, 4);
> >       vint64m1_t v2 = __riscv_vle64_v_i64m1_tu (v, in + 2, 4);
> >       vint64m1_t v3 = __riscv_vadd_vx_i64m1 (v2, x, 4);
> >       vint64m1_t v4 = __riscv_vadd_vx_i64m1 (v3, x, 4);
> >       __riscv_vse64_v_i64m1 (out + 2, v4, 4);
> >       }
> > 
> > asm:
> >      addi sp,sp,-16
> > sw a2,8(sp)
> > sw a3,12(sp)
> > sw a2,0(sp)
> > sw a3,4(sp)
> > addi a5,a0,1
> > vsetivli zero,4,e64,m1,ta,ma
> > addi a0,a0,2
> > vle64.v v24,0(a5)
> > addi a5,sp,8
> > vlse64.v v27,0(a5),zero
> > addi a1,a1,2
> > vsetvli zero,zero,e64,m1,tu,ma
> > vle64.v v24,0(a0)
> > vsetvli zero,zero,e64,m1,ta,ma
> > vlse64.v v25,0(sp),zero
> > vadd.vv v26,v24,v27
> > vadd.vv v24,v26,v25
> > vse64.v v24,0(a1)
> > addi sp,sp,16
> > jr ra
> > you can see here there are 2 vlse64.v instructions that broadcast the scalar value "x"
> > GCC fail to eliminate the second vlse64.v instruction since GCC doesn't do the cprop
> > optimization (the function only has 1 single block). It can be optimized if we apply
> > this patch.
> > 
> > 2. void f1 (void * in, void *out, int64_t x, int n)
> > {
> >     if (n) {
> >       vint64m1_t v = __riscv_vle64_v_i64m1 (in + 1, 4);
> >       vint64m1_t v2 = __riscv_vle64_v_i64m1_tu (v, in + 2, 4);
> >       vint64m1_t v3 = __riscv_vadd_vx_i64m1 (v2, x, 4);
> >       vint64m1_t v4 = __riscv_vadd_vx_i64m1 (v3, x, 4);
> >       __riscv_vse64_v_i64m1 (out + 2, v4, 4);
> >     }
> > }
> > 
> > asm:
> >   f1:
> > vsetivli zero,4,e64,m1,ta,ma
> > beq a4,zero,.L7
> > addi sp,sp,-16
> > sw a2,8(sp)
> > sw a3,12(sp)
> > addi a5,a0,1
> > vle64.v v24,0(a5)
> > addi a0,a0,2
> > addi a5,sp,8
> > vlse64.v v25,0(a5),zero
> > addi a1,a1,2
> > vsetvli zero,zero,e64,m1,tu,ma
> > vle64.v v24,0(a0)
> > vadd.vv v26,v24,v25
> > vadd.vv v24,v26,v25
> > vse64.v v24,0(a1)
> > addi sp,sp,16
> > jr ra
> > .L7:
> > ret
> > 
> > Here, if we add if (n) condition here, the program will end up with more than 1 block.
> > So GCC will do the cprop optimization and the second vlse64.v instruction is eliminated.
> > 
> > I am not sure whether this patch is correct.
> > Can anyone help me with that ?
> > Thanks.
> > 
> > 
> > gcc/ChangeLog:
> > 
> >         * cprop.cc (one_cprop_pass): Remove +1.
> > 
> > ---
> >  gcc/cprop.cc | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/gcc/cprop.cc b/gcc/cprop.cc
> > index 6ec0bda4a24..615bc5078b6 100644
> > --- a/gcc/cprop.cc
> > +++ b/gcc/cprop.cc
> > @@ -1749,7 +1749,7 @@ one_cprop_pass (void)
> >    int changed = 0;
> >  
> >    /* Return if there's nothing to do, or it is too expensive.  */
> > -  if (n_basic_blocks_for_fn (cfun) <= NUM_FIXED_BLOCKS + 1
> > +  if (n_basic_blocks_for_fn (cfun) <= NUM_FIXED_BLOCKS
> >        || gcse_or_cprop_is_too_expensive (_ ("const/copy propagation disabled")))
> >      return 0;
> >  
> > 
>  
>
Jeff Law Feb. 2, 2023, 2:36 p.m. UTC | #5
On 2/2/23 05:26, Richard Biener wrote:
> On Thu, 2 Feb 2023, juzhe.zhong@rivai.ai wrote:
> 
>> Yeah, Thanks. You are right. CSE should do the job.
>> Now I know the reason CSE failed to optimize is I include VL_REGNUM(66)/VTYPE_RENUM(67) hard reg
>> as the dependency of pred_broadcast:
>> (insn 19 18 20 4 (set (reg:VNx1DI 152)
>>>          (if_then_else:VNx1DI (unspec:VNx1BI [
>>>                      (const_vector:VNx1BI repeat [
>>>                              (const_int 1 [0x1])
>>>                          ])
>>>                      (const_int 4 [0x4])
>>>                      (const_int 2 [0x2]) repeated x2
>>>                      (const_int 0 [0])
>>>                      (reg:SI 66 vl)
>>>                      (reg:SI 67 vtype)
>>>                  ] UNSPEC_VPREDICATE)
>>>              (vec_duplicate:VNx1DI (reg/v:DI 148 [ x ]))
>>>              (unspec:VNx1DI [
>>>                      (const_int 0 [0])
>>>                  ] UNSPEC_VUNDEF))) "rvv.c":22:23 695 {pred_broadcastvnx1di}
>>>       (nil))
>> Then CSE failed to set the 152 as copy.
>>
>> VL_REGNUM(66)/VTYPE_RENUM(67) are the global hard reg that I should make each RVV instruction depend on them.
>> Since we use vsetvl instruction (which is setting global VL_REGNUM(66)/VTYPE_RENUM(67) status) to set the global status for
>> each RVV instruction.
>> Including the dependency here is to make sure the global VL/VTYPE status is correct of each RVV instruction. (If we don't include
>> such dependency in RVV instruction, instruction scheduling may move the RVV instructions and vsetvl instructions randomly then
>> produce incorrect vsetvl configuration)
>>
>> The original reg_class of VL_REGNUM(66)/VTYPE_RENUM(67) I set here:
>> riscv_regno_to_class [VL_REGNUM] = VL_REGS;
>> riscv_regno_to_class [VTYPE_RENUM] = VTYPE_REGS;
>> Such configuration make CSE failed.
>>
>> However, if I change the reg_class :
>> riscv_regno_to_class [VL_REGNUM] = NO_REGS;
>> riscv_regno_to_class [VTYPE_RENUM] = NO_REGS;
>> The CSE now can do the optimization now!
>>
>> 1) Would you mind telling me the difference between them?
> 
> No idea.  I think CSE avoids to touch hard register references because
> eliding them to copies can increase register pressure.IIRC the costing is set up differently and for a given partition a 
pseudo will be preferred over a hard reg.  This is in addition to other 
places that test the small register class hooks.



> 
>> 2) If I set these 2 global status register as NO_REGS, will it create
>>     issues for the global status configuration of each RVV instructions ?
> 
> No idea either.  Usually these kind of dependences are introduced
> by targets at the point the VL setting is introduced to avoid
> pessimizing optimizations earlier.  Often, for cases like a VL
> register, this is done after register allocation only and indeed
> necessary to avoid the second scheduling pass from breaking things.
Yea.  I'm wondering about when the right place to introduce these 
dependencies might be.  I'm still a few months out from worrying about 
RVV, but it's not too far away.
jeff
Jeff Law Feb. 2, 2023, 2:36 p.m. UTC | #6
On 2/2/23 05:35, juzhe.zhong@rivai.ai wrote:
> Thank you so much. Kito helped me fix it already.
> RVV instruction patterns can have CSE optimizations now.
What was the issue?

jeff
钟居哲 Feb. 2, 2023, 2:43 p.m. UTC | #7
We set VL/VTYPE these 2 implicit global status denpency register as fixed reg.
Then CSE can do the optimization now.

>> Yea.  I'm wondering about when the right place to introduce these
>>dependencies might be.  I'm still a few months out from worrying about
>>RVV, but it's not too far away.
You don't need to worry about RVV. I can promise you that RVV support in GCC will be solid and
optimal. You can just try. For example, try VSETVL PASS,  this PASS implemented in GCC is much better
than LLVM. I have include so many fancy optimizations there.


juzhe.zhong@rivai.ai
 
From: Jeff Law
Date: 2023-02-02 22:36
To: juzhe.zhong@rivai.ai; rguenther
CC: gcc-patches; kito.cheng; richard.sandiford; apinski
Subject: Re: [PATCH] CPROP: Allow cprop optimization when the function has a single block
 
 
On 2/2/23 05:35, juzhe.zhong@rivai.ai wrote:
> Thank you so much. Kito helped me fix it already.
> RVV instruction patterns can have CSE optimizations now.
What was the issue?
 
jeff
Kito Cheng Feb. 2, 2023, 3:18 p.m. UTC | #8
> > Thank you so much. Kito helped me fix it already.
> > RVV instruction patterns can have CSE optimizations now.
> What was the issue?

VL and VTYPE isn't listed in fixed register so CSE feel that isn't
cheap (See CHEAP_REGNO in cse.cc),
but actually it's kind of mistake sett for VL and VTYPE register to
non fixed register,
it all managed by vsetvl insertion pass, and won't involved into the
register allocation
process, so it should be set 1 in FIXED_REGISTERS,

then CSE pass is happy to cse that after we fix that :)

More story behind that is we were trying to rely on RA to manage VL
and VTYPE before,
and then...we gave up and decided to manage that by ourselves.
diff mbox series

Patch

diff --git a/gcc/cprop.cc b/gcc/cprop.cc
index 6ec0bda4a24..615bc5078b6 100644
--- a/gcc/cprop.cc
+++ b/gcc/cprop.cc
@@ -1749,7 +1749,7 @@  one_cprop_pass (void)
   int changed = 0;
 
   /* Return if there's nothing to do, or it is too expensive.  */
-  if (n_basic_blocks_for_fn (cfun) <= NUM_FIXED_BLOCKS + 1
+  if (n_basic_blocks_for_fn (cfun) <= NUM_FIXED_BLOCKS
       || gcse_or_cprop_is_too_expensive (_ ("const/copy propagation disabled")))
     return 0;