Message ID | 20231025120518.1319929-1-juzhe.zhong@rivai.ai |
---|---|
State | New |
Headers | show |
Series | [V2] RISC-V: Add AVL propagation PASS for RVV auto-vectorization | expand |
LGTM, Thanks, it's really awesome - the implementation is simpler than I expected, it's another great improvement for RISC-V GCC! Just make sure Patrick gives a green light on the testing before committing the patch :) On Wed, Oct 25, 2023 at 8:05 PM Juzhe-Zhong <juzhe.zhong@rivai.ai> wrote: > > This patch addresses the redundant AVL/VL toggling in RVV partial auto-vectorization > which is a known issue for a long time and I finally find the time to address it. > > Consider a simple vector addition operation: > > https://godbolt.org/z/7hfGfEjW3 > > void > foo (int *__restrict a, > int *__restrict b, > int *__restrict n) > { > for (int i = 0; i < n; i++) > a[i] = a[i] + b[i]; > } > > Optimized IR: > > Loop body: > _38 = .SELECT_VL (ivtmp_36, POLY_INT_CST [4, 4]); -> vsetvli a5,a2,e8,mf4,ta,ma > ... > vect__4.8_27 = .MASK_LEN_LOAD (vectp_a.6_29, 32B, { -1, ... }, _38, 0); -> vle32.v v2,0(a0) > vect__6.11_20 = .MASK_LEN_LOAD (vectp_b.9_25, 32B, { -1, ... }, _38, 0); -> vle32.v v1,0(a1) > vect__7.12_19 = vect__6.11_20 + vect__4.8_27; -> vsetvli a6,zero,e32,m1,ta,ma + vadd.vv v1,v1,v2 > .MASK_LEN_STORE (vectp_a.13_11, 32B, { -1, ... }, _38, 0, vect__7.12_19); -> vsetvli zero,a5,e32,m1,ta,ma + vse32.v v1,0(a4) > > We can see 2 redundant vsetvls inside the loop body due to AVL/VL toggling. > The AVL/VL toggling is because we are missing LEN information in simple PLUS_EXPR GIMPLE assignment: > > vect__7.12_19 = vect__6.11_20 + vect__4.8_27; > > GCC apply partial predicate load/store and un-predicated full vector operation on partial vectorization. > Such flow are used by all other targets like ARM SVE (RVV also uses such flow): > > ARM SVE: > > .L3: > ld1w z30.s, p7/z, [x0, x3, lsl 2] -> predicated load > ld1w z31.s, p7/z, [x1, x3, lsl 2] -> predicated load > add z31.s, z31.s, z30.s -> un-predicated add > st1w z31.s, p7, [x0, x3, lsl 2] -> predicated store > > Such vectorization flow causes AVL/VL toggling on RVV so we need AVL propagation PASS for it. > > Also, It's very unlikely that we can apply predicated operations on all vectorization for following reasons: > > 1. It's very heavy workload to support them on all vectorization and we don't see any benefits if we can handle that on targets backend. > 2. Changing Loop vectorizer for it will make code base ugly and hard to maintain. > 3. We will need so many patterns for all operations. Not only COND_LEN_ADD, COND_LEN_SUB, .... > We also need COND_LEN_EXTEND, ...., COND_LEN_CEIL, ... .. over 100+ patterns, unreasonable number of patterns. > > To conclude, we prefer un-predicated operations here, and design a nice and clean AVL propagation PASS for it to elide the redundant vsetvls > due to AVL/VL toggling. > > The second question is that why we separate a PASS called AVL propagation. Why not optimize it in VSETVL PASS (We definitetly can optimize AVL in VSETVL PASS) > > Frankly, I was planning to address such issue in VSETVL PASS that's why we recently refactored VSETVL PASS. However, I changed my mind recently after several > experiments and tries. > > The reasons as follows: > > 1. For code base management and maintainience. Current VSETVL PASS is complicated enough and aleady has enough aggressive and fancy optimizations which > turns out it can always generate optimal codegen in most of the cases. It's not a good idea keep adding more features into VSETVL PASS to make VSETVL > PASS become heavy and heavy again, then we will need to refactor it again in the future. > Actuall, the VSETVL PASS is very stable and optimal after the recent refactoring. Hopefully, we should not change VSETVL PASS any more except the minor > fixes. > > 2. vsetvl insertion (VSETVL PASS does this thing) and AVL propagation are 2 different things, I don't think we should fuse them into same PASS. > > 3. VSETVL PASS is an post-RA PASS, wheras AVL propagtion should be done before RA which can reduce register allocation. > > 4. This patch's AVL propagation PASS only does AVL propagation for RVV partial auto-vectorization situations. > This patch's codes are only hundreds lines which is very managable and can be very easily extended features and enhancements. > We can easily extend and enhance more AVL propagation in a clean and separate PASS in the future. (If we do it on VSETVL PASS, we will complicate > VSETVL PASS again which is already so complicated.) > > Here is an example to demonstrate more: > > https://godbolt.org/z/bE86sv3q5 > > void foo2 (int *__restrict a, > int *__restrict b, > int *__restrict c, > int *__restrict a2, > int *__restrict b2, > int *__restrict c2, > int *__restrict a3, > int *__restrict b3, > int *__restrict c3, > int *__restrict a4, > int *__restrict b4, > int *__restrict c4, > int *__restrict a5, > int *__restrict b5, > int *__restrict c5, > int n) > { > for (int i = 0; i < n; i++){ > a[i] = b[i] + c[i]; > b5[i] = b[i] + c[i]; > a2[i] = b2[i] + c2[i]; > a3[i] = b3[i] + c3[i]; > a4[i] = b4[i] + c4[i]; > a5[i] = a[i] + a4[i]; > a[i] = a5[i] + b5[i]+ a[i]; > > a[i] = a[i] + c[i]; > b5[i] = a[i] + c[i]; > a2[i] = a[i] + c2[i]; > a3[i] = a[i] + c3[i]; > a4[i] = a[i] + c4[i]; > a5[i] = a[i] + a4[i]; > a[i] = a[i] + b5[i]+ a[i]; > } > } > > 1. Loop Body: > > Before this patch: After this patch: > > vsetvli a4,t1,e8,mf4,ta,ma vsetvli a4,t1,e32,m1,ta,ma > vle32.v v2,0(a2) vle32.v v2,0(a2) > vle32.v v4,0(a1) vle32.v v3,0(t2) > vle32.v v1,0(t2) vle32.v v4,0(a1) > vsetvli a7,zero,e32,m1,ta,ma vle32.v v1,0(t0) > vadd.vv v4,v2,v4 vadd.vv v4,v2,v4 > vsetvli zero,a4,e32,m1,ta,ma vadd.vv v1,v3,v1 > vle32.v v3,0(s0) vadd.vv v1,v1,v4 > vsetvli a7,zero,e32,m1,ta,ma vadd.vv v1,v1,v4 > vadd.vv v1,v3,v1 vadd.vv v1,v1,v4 > vadd.vv v1,v1,v4 vadd.vv v1,v1,v2 > vadd.vv v1,v1,v4 vadd.vv v2,v1,v2 > vadd.vv v1,v1,v4 vse32.v v2,0(t5) > vsetvli zero,a4,e32,m1,ta,ma vadd.vv v2,v2,v1 > vle32.v v4,0(a5) vadd.vv v2,v2,v1 > vsetvli a7,zero,e32,m1,ta,ma slli a7,a4,2 > vadd.vv v1,v1,v2 vadd.vv v3,v1,v3 > vadd.vv v2,v1,v2 vle32.v v5,0(a5) > vadd.vv v4,v1,v4 vle32.v v6,0(t6) > vsetvli zero,a4,e32,m1,ta,ma vse32.v v3,0(t3) > vse32.v v2,0(t5) vse32.v v2,0(a0) > vse32.v v4,0(a3) vadd.vv v3,v3,v1 > vsetvli a7,zero,e32,m1,ta,ma vadd.vv v2,v1,v5 > vadd.vv v3,v1,v3 vse32.v v3,0(t4) > vadd.vv v2,v2,v1 vadd.vv v1,v1,v6 > vadd.vv v2,v2,v1 vse32.v v2,0(a3) > vsetvli zero,a4,e32,m1,ta,ma vse32.v v1,0(a6) > vse32.v v2,0(a0) > vse32.v v3,0(t3) > vle32.v v2,0(t0) > vsetvli a7,zero,e32,m1,ta,ma > vadd.vv v3,v3,v1 > vsetvli zero,a4,e32,m1,ta,ma > vse32.v v3,0(t4) > vsetvli a7,zero,e32,m1,ta,ma > slli a7,a4,2 > vadd.vv v1,v1,v2 > sub t1,t1,a4 > vsetvli zero,a4,e32,m1,ta,ma > vse32.v v1,0(a6) > > It's quite obvious, all heavy && redundant vsetvls inside loop body are eliminated. > > 2. Epilogue: > Before this patch: After this patch: > > .L5: .L5: > ld s0,8(sp) ret > addi sp,sp,16 > jr ra > > This is the benefit we do the AVL propation before RA since we eliminate the use of 'a7' register > which is used by the redudant AVL/VL toggling instruction: 'vsetvli a7,zero,e32,m1,ta,ma' > > The final codegen after this patch: > > foo2: > lw t1,56(sp) > ld t6,0(sp) > ld t3,8(sp) > ld t0,16(sp) > ld t2,24(sp) > ld t4,32(sp) > ld t5,40(sp) > ble t1,zero,.L5 > .L3: > vsetvli a4,t1,e32,m1,ta,ma > vle32.v v2,0(a2) > vle32.v v3,0(t2) > vle32.v v4,0(a1) > vle32.v v1,0(t0) > vadd.vv v4,v2,v4 > vadd.vv v1,v3,v1 > vadd.vv v1,v1,v4 > vadd.vv v1,v1,v4 > vadd.vv v1,v1,v4 > vadd.vv v1,v1,v2 > vadd.vv v2,v1,v2 > vse32.v v2,0(t5) > vadd.vv v2,v2,v1 > vadd.vv v2,v2,v1 > slli a7,a4,2 > vadd.vv v3,v1,v3 > vle32.v v5,0(a5) > vle32.v v6,0(t6) > vse32.v v3,0(t3) > vse32.v v2,0(a0) > vadd.vv v3,v3,v1 > vadd.vv v2,v1,v5 > vse32.v v3,0(t4) > vadd.vv v1,v1,v6 > vse32.v v2,0(a3) > vse32.v v1,0(a6) > sub t1,t1,a4 > add a1,a1,a7 > add a2,a2,a7 > add a5,a5,a7 > add t6,t6,a7 > add t0,t0,a7 > add t2,t2,a7 > add t5,t5,a7 > add a3,a3,a7 > add a6,a6,a7 > add t3,t3,a7 > add t4,t4,a7 > add a0,a0,a7 > bne t1,zero,.L3 > .L5: > ret > > > PR target/111318 > PR target/111888 > > gcc/ChangeLog: > > * config.gcc: Add AVL propagation PASS. > * config/riscv/riscv-passes.def (INSERT_PASS_AFTER): Ditto. > * config/riscv/riscv-protos.h (make_pass_avlprop): Ditto. > * config/riscv/t-riscv: Ditto. > * config/riscv/riscv-avlprop.cc: New file. > > gcc/testsuite/ChangeLog: > > * gcc.target/riscv/rvv/autovec/partial/select_vl-2.c: Adapt test. > * gcc.target/riscv/rvv/autovec/ternop/ternop_nofm-2.c: Ditto. > * gcc.target/riscv/rvv/autovec/pr111318.c: New test. > * gcc.target/riscv/rvv/autovec/pr111888.c: New test. > > --- > gcc/config.gcc | 2 +- > gcc/config/riscv/riscv-avlprop.cc | 419 ++++++++++++++++++ > gcc/config/riscv/riscv-passes.def | 1 + > gcc/config/riscv/riscv-protos.h | 1 + > gcc/config/riscv/t-riscv | 6 + > .../riscv/rvv/autovec/partial/select_vl-2.c | 5 +- > .../gcc.target/riscv/rvv/autovec/pr111318.c | 16 + > .../gcc.target/riscv/rvv/autovec/pr111888.c | 33 ++ > .../riscv/rvv/autovec/ternop/ternop_nofm-2.c | 1 - > 9 files changed, 480 insertions(+), 4 deletions(-) > create mode 100644 gcc/config/riscv/riscv-avlprop.cc > create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/pr111318.c > create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/pr111888.c > > diff --git a/gcc/config.gcc b/gcc/config.gcc > index 606d3a8513e..efd53965c9a 100644 > --- a/gcc/config.gcc > +++ b/gcc/config.gcc > @@ -544,7 +544,7 @@ pru-*-*) > riscv*) > cpu_type=riscv > extra_objs="riscv-builtins.o riscv-c.o riscv-sr.o riscv-shorten-memrefs.o riscv-selftests.o riscv-string.o" > - extra_objs="${extra_objs} riscv-v.o riscv-vsetvl.o riscv-vector-costs.o" > + extra_objs="${extra_objs} riscv-v.o riscv-vsetvl.o riscv-vector-costs.o riscv-avlprop.o" > extra_objs="${extra_objs} riscv-vector-builtins.o riscv-vector-builtins-shapes.o riscv-vector-builtins-bases.o" > extra_objs="${extra_objs} thead.o" > d_target_objs="riscv-d.o" > diff --git a/gcc/config/riscv/riscv-avlprop.cc b/gcc/config/riscv/riscv-avlprop.cc > new file mode 100644 > index 00000000000..2c79ec81806 > --- /dev/null > +++ b/gcc/config/riscv/riscv-avlprop.cc > @@ -0,0 +1,419 @@ > +/* AVL propagation pass for RISC-V 'V' Extension for GNU compiler. > + Copyright (C) 2023-2023 Free Software Foundation, Inc. > + Contributed by Juzhe Zhong (juzhe.zhong@rivai.ai), RiVAI Technologies Ltd. > + > +This file is part of GCC. > + > +GCC is free software; you can redistribute it and/or modify > +it under the terms of the GNU General Public License as published by > +the Free Software Foundation; either version 3, or(at your option) > +any later version. > + > +GCC is distributed in the hope that it will be useful, > +but WITHOUT ANY WARRANTY; without even the implied warranty of > +MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > +GNU General Public License for more details. > + > +You should have received a copy of the GNU General Public License > +along with GCC; see the file COPYING3. If not see > +<http://www.gnu.org/licenses/>. */ > + > +/* Pre-RA RTL_SSA-based pass propagates AVL for RVV instructions. > + A standalone AVL propagation pass is designed because: > + > + - Better code maintain: > + Current LCM-based VSETVL pass is so complicated that codes > + there will become even harder to maintain. A straight forward > + AVL propagation PASS is much easier to maintain. > + > + - Reduce scalar register pressure: > + A type of AVL propagation is we propagate AVL from NON-VLMAX > + instruction to VLMAX instruction. > + Note: VLMAX instruction should be ignore tail elements (TA) > + and the result should be used by the NON-VLMAX instruction. > + This optimization is mostly for auto-vectorization codes: > + > + vsetvli r136, r137 --- SELECT_VL > + vle8.v (use avl = r136) --- IFN_MASK_LEN_LOAD > + vadd.vv (use VLMAX) --- PLUS_EXPR > + vse8.v (use avl = r136) --- IFN_MASK_LEN_STORE > + > + NO AVL propation: > + > + vsetvli a5, a4, ta > + vle8.v v1 > + vsetvli t0, zero, ta > + vadd.vv v2, v1, v1 > + vse8.v v2 > + > + We can propagate the AVL to 'vadd.vv' since its result > + is consumed by a 'vse8.v' which has AVL = a5 and its > + tail elements are agnostic. > + > + We DON'T do this optimization on VSETVL pass since it is a > + post-RA pass that consumed 't0' already wheras a standalone > + pre-RA AVL propagation pass allows us elide the consumption > + of the pseudo register of 't0' then we can reduce scalar > + register pressure. > + > + - More AVL propagation opportunities: > + A pre-RA pass is more flexible for AVL REG def-use chain, > + thus we will get more potential AVL propagation as long as > + it doesn't increase the scalar register pressure. > +*/ > + > +#define IN_TARGET_CODE 1 > +#define INCLUDE_ALGORITHM > +#define INCLUDE_FUNCTIONAL > + > +#include "config.h" > +#include "system.h" > +#include "coretypes.h" > +#include "tm.h" > +#include "backend.h" > +#include "rtl.h" > +#include "target.h" > +#include "tree-pass.h" > +#include "df.h" > +#include "rtl-ssa.h" > +#include "cfgcleanup.h" > +#include "insn-attr.h" > + > +using namespace rtl_ssa; > +using namespace riscv_vector; > + > +enum avlprop_type > +{ > + /* VLMAX AVL and tail agnostic candidates. */ > + AVLPROP_VLMAX_TA, > + AVLPROP_NONE > +}; > + > +/* dump helper functions */ > +static const char * > +avlprop_type_to_str (enum avlprop_type type) > +{ > + switch (type) > + { > + case AVLPROP_VLMAX_TA: > + return "vlmax_ta"; > + > + default: > + gcc_unreachable (); > + } > +} > + > +static bool > +vlmax_ta_p (rtx_insn *rinsn) > +{ > + return vlmax_avl_type_p (rinsn) && tail_agnostic_p (rinsn); > +} > + > +const pass_data pass_data_avlprop = { > + RTL_PASS, /* type */ > + "avlprop", /* name */ > + OPTGROUP_NONE, /* optinfo_flags */ > + TV_NONE, /* tv_id */ > + 0, /* properties_required */ > + 0, /* properties_provided */ > + 0, /* properties_destroyed */ > + 0, /* todo_flags_start */ > + 0, /* todo_flags_finish */ > +}; > + > +class pass_avlprop : public rtl_opt_pass > +{ > +public: > + pass_avlprop (gcc::context *ctxt) : rtl_opt_pass (pass_data_avlprop, ctxt) {} > + > + /* opt_pass methods: */ > + virtual bool gate (function *) final override > + { > + return TARGET_VECTOR && optimize > 0; > + } > + virtual unsigned int execute (function *) final override; > + > +private: > + /* The AVL propagation instructions and corresponding preferred AVL. > + It will be updated during the analysis. */ > + hash_map<insn_info *, rtx> *m_avl_propagations; > + > + /* Potential feasible AVL propagation candidates. */ > + auto_vec<std::pair<enum avlprop_type, insn_info *>> m_candidates; > + > + rtx get_preferred_avl (const std::pair<enum avlprop_type, insn_info *>) const; > + rtx get_vlmax_ta_preferred_avl (insn_info *) const; > + rtx get_nonvlmax_avl (insn_info *) const; > + > + void avlprop_init (function *); > + void avlprop_done (void); > +}; // class pass_avlprop > + > +void > +pass_avlprop::avlprop_init (function *fn) > +{ > + calculate_dominance_info (CDI_DOMINATORS); > + df_analyze (); > + crtl->ssa = new function_info (fn); > + m_avl_propagations = new hash_map<insn_info *, rtx>; > +} > + > +void > +pass_avlprop::avlprop_done (void) > +{ > + free_dominance_info (CDI_DOMINATORS); > + if (crtl->ssa->perform_pending_updates ()) > + cleanup_cfg (0); > + delete crtl->ssa; > + crtl->ssa = nullptr; > + delete m_avl_propagations; > + m_avl_propagations = NULL; > + if (!m_candidates.is_empty ()) > + m_candidates.release (); > +} > + > +/* If we have a preferred AVL to propagate, return the AVL. > + Otherwise, return NULL_RTX as we don't need have any preferred > + AVL. */ > + > +rtx > +pass_avlprop::get_preferred_avl ( > + const std::pair<enum avlprop_type, insn_info *> candidate) const > +{ > + switch (candidate.first) > + { > + case AVLPROP_VLMAX_TA: > + return get_vlmax_ta_preferred_avl (candidate.second); > + default: > + gcc_unreachable (); > + } > + return NULL_RTX; > +} > + > +/* This is a straight forward pattern ALWAYS in paritial auto-vectorization: > + > + VL = SELECT_AVL (AVL, ...) > + V0 = MASK_LEN_LOAD (..., VL) > + V1 = MASK_LEN_LOAD (..., VL) > + V2 = V0 + V1 --- Missed LEN information. > + MASK_LEN_STORE (..., V2, VL) > + > + We prefer PLUS_EXPR (V0 + V1) instead of COND_LEN_ADD (V0, V1, dummy LEN) > + because: > + > + - Few code changes in Loop Vectorizer. > + - Reuse the current clean flow of partial vectorization, That is, apply > + predicate LEN or MASK into LOAD/STORE operations and other special > + arithmetic operations (e.d. DIV), then do the whole vector register > + operation if it DON'T affect the correctness. > + Such flow is used by all other targets like x86, sve, s390, ... etc. > + - PLUS_EXPR has better gimple optimizations than COND_LEN_ADD. > + > + We propagate AVL from NON-VLMAX to VLMAX for gimple IR like PLUS_EXPR which > + generates the VLMAX instruction due to missed LEN information. The later > + VSETVL PASS will elided the redundant vsetvls. > +*/ > + > +rtx > +pass_avlprop::get_vlmax_ta_preferred_avl (insn_info *insn) const > +{ > + int sew = get_sew (insn->rtl ()); > + enum vlmul_type vlmul = get_vlmul (insn->rtl ()); > + int ratio = calculate_ratio (sew, vlmul); > + > + rtx use_avl = NULL_RTX; > + for (def_info *def : insn->defs ()) > + { > + if (!is_a<set_info *> (def) || def->is_mem ()) > + return NULL_RTX; > + const auto *set = dyn_cast<set_info *> (def); > + > + /* FIXME: Stop AVL propagation if any USE is not a RVV real > + instruction. It should be totally enough for vectorized codes since > + they always locate at extended blocks. > + > + TODO: We can extend PHI checking for intrinsic codes if it > + necessary in the future. */ > + if (!set->is_local_to_ebb ()) > + return NULL_RTX; > + > + for (use_info *use : set->nondebug_insn_uses ()) > + { > + insn_info *use_insn = use->insn (); > + if (!use_insn->can_be_optimized () || use_insn->is_asm () > + || use_insn->is_call () || use_insn->has_volatile_refs () > + || use_insn->has_pre_post_modify () > + || !has_vl_op (use_insn->rtl ()) > + || !tail_agnostic_p (use_insn->rtl ())) > + return NULL_RTX; > + > + int new_sew = get_sew (use_insn->rtl ()); > + enum vlmul_type new_vlmul = get_vlmul (use_insn->rtl ()); > + int new_ratio = calculate_ratio (new_sew, new_vlmul); > + if (new_ratio != ratio) > + return NULL_RTX; > + > + rtx new_use_avl = get_nonvlmax_avl (use_insn); > + if (!new_use_avl || SUBREG_P (new_use_avl)) > + return NULL_RTX; > + if (REG_P (new_use_avl)) > + { > + resource_info resource = full_register (REGNO (new_use_avl)); > + def_lookup dl = crtl->ssa->find_def (resource, use_insn); > + if (dl.matching_set ()) > + return NULL_RTX; > + def_info *def1 = dl.prev_def (insn); > + def_info *def2 = dl.prev_def (use_insn); > + if (!def1 || !def2 || def1 != def2) > + return NULL_RTX; > + > + /* FIXME: We only all AVL propation within a block which should > + be totally enough for vectorized codes. > + > + TODO: We can enhance it here for intrinsic codes in the future > + if it is necessary. */ > + if (def1->insn ()->bb () != insn->bb () > + && !dominated_by_p (CDI_DOMINATORS, insn->bb ()->cfg_bb (), > + def1->insn ()->bb ()->cfg_bb ())) > + return NULL_RTX; > + if (def1->insn ()->bb () == insn->bb () > + && def1->insn ()->compare_with (insn) >= 0) > + return NULL_RTX; > + } > + > + if (!use_avl) > + use_avl = new_use_avl; > + else if (!rtx_equal_p (use_avl, new_use_avl)) > + return NULL_RTX; > + } > + } > + > + return use_avl; > +} > + > +/* Try to get the NONVLMAX AVL of the INSN. > + INSN can be either NON-VLMAX AVL itself or VLMAX AVL INSN > + before the PASS but has been propagated a NON-VLMAX AVL > + in the before round propagation. */ > +rtx > +pass_avlprop::get_nonvlmax_avl (insn_info *insn) const > +{ > + if (m_avl_propagations->get (insn)) > + return (*m_avl_propagations->get (insn)); > + else if (nonvlmax_avl_type_p (insn->rtl ())) > + { > + extract_insn_cached (insn->rtl ()); > + return recog_data.operand[get_attr_vl_op_idx (insn->rtl ())]; > + } > + > + return NULL_RTX; > +} > + > +/* Main entry point for this pass. */ > +unsigned int > +pass_avlprop::execute (function *fn) > +{ > + avlprop_init (fn); > + > + /* Iterate the whole function in reverse order (which could speed the > + convergence) to collect all potential candidates that could be AVL > + propagated. > + > + Note that: **NOT** all the candidates will be successfully AVL propagated. > + */ > + for (bb_info *bb : crtl->ssa->reverse_bbs ()) > + { > + for (insn_info *insn : bb->reverse_real_nondebug_insns ()) > + { > + /* We only forward AVL to the instruction that has AVL/VL operand > + and can be optimized in RTL_SSA level. */ > + if (!insn->can_be_optimized () || !has_vl_op (insn->rtl ())) > + continue; > + > + /* TODO: We only do AVL propagation for VLMAX AVL with tail > + agnostic policy since we have missed-LEN information partial > + autovectorization. We could add more more AVL propagation > + for intrinsic codes in the future. */ > + if (vlmax_ta_p (insn->rtl ())) > + m_candidates.safe_push (std::make_pair (AVLPROP_VLMAX_TA, insn)); > + } > + } > + > + if (dump_file && (dump_flags & TDF_DETAILS)) > + { > + fprintf (dump_file, "\nNumber of potential AVL propagations: %d\n", > + m_candidates.length ()); > + for (const auto candidate : m_candidates) > + { > + fprintf (dump_file, "\nAVL propagation type: %s\n", > + avlprop_type_to_str (candidate.first)); > + print_rtl_single (dump_file, candidate.second->rtl ()); > + } > + } > + > + /* Go through all the candidates looking for AVL that we could propagate. */ > + bool change_p = true; > + while (change_p) > + { > + change_p = false; > + for (auto &candidate : m_candidates) > + { > + rtx new_avl = get_preferred_avl (candidate); > + if (new_avl) > + { > + gcc_assert (!vlmax_avl_p (new_avl)); > + auto &update > + = m_avl_propagations->get_or_insert (candidate.second); > + change_p = !rtx_equal_p (update, new_avl); > + update = new_avl; > + } > + } > + } > + > + if (dump_file && (dump_flags & TDF_DETAILS)) > + fprintf (dump_file, "\nNumber of successful AVL propagations: %d\n\n", > + (int) m_avl_propagations->elements ()); > + > + for (const auto prop : *m_avl_propagations) > + { > + rtx_insn *rinsn = prop.first->rtl (); > + if (dump_file && (dump_flags & TDF_DETAILS)) > + { > + fprintf (dump_file, "\nPropagating AVL: "); > + print_rtl_single (dump_file, prop.second); > + fprintf (dump_file, "into: "); > + print_rtl_single (dump_file, rinsn); > + } > + /* Replace AVL operand. */ > + extract_insn_cached (rinsn); > + rtx avl = recog_data.operand[get_attr_vl_op_idx (rinsn)]; > + int count = count_regno_occurrences (rinsn, REGNO (avl)); > + gcc_assert (count == 1); > + rtx new_pat = simplify_replace_rtx (PATTERN (rinsn), avl, prop.second); > + validate_change_or_fail (rinsn, &PATTERN (rinsn), new_pat, false); > + > + /* Change AVL TYPE into NONVLMAX if it is VLMAX. */ > + if (vlmax_avl_type_p (rinsn)) > + { > + int index = get_attr_avl_type_idx (rinsn); > + gcc_assert (index != INVALID_ATTRIBUTE); > + validate_change_or_fail (rinsn, recog_data.operand_loc[index], > + get_avl_type_rtx (avl_type::NONVLMAX), > + false); > + } > + if (dump_file && (dump_flags & TDF_DETAILS)) > + { > + fprintf (dump_file, "Successfully to match this instruction: "); > + print_rtl_single (dump_file, rinsn); > + } > + } > + > + avlprop_done (); > + return 0; > +} > + > +rtl_opt_pass * > +make_pass_avlprop (gcc::context *ctxt) > +{ > + return new pass_avlprop (ctxt); > +} > diff --git a/gcc/config/riscv/riscv-passes.def b/gcc/config/riscv/riscv-passes.def > index 4084122cf0a..b6260939d5c 100644 > --- a/gcc/config/riscv/riscv-passes.def > +++ b/gcc/config/riscv/riscv-passes.def > @@ -18,4 +18,5 @@ > <http://www.gnu.org/licenses/>. */ > > INSERT_PASS_AFTER (pass_rtl_store_motion, 1, pass_shorten_memrefs); > +INSERT_PASS_AFTER (pass_split_all_insns, 1, pass_avlprop); > INSERT_PASS_BEFORE (pass_fast_rtl_dce, 1, pass_vsetvl); > diff --git a/gcc/config/riscv/riscv-protos.h b/gcc/config/riscv/riscv-protos.h > index 668d75043ca..d4e17fc3fd0 100644 > --- a/gcc/config/riscv/riscv-protos.h > +++ b/gcc/config/riscv/riscv-protos.h > @@ -156,6 +156,7 @@ extern void riscv_parse_arch_string (const char *, struct gcc_options *, locatio > extern bool riscv_hard_regno_rename_ok (unsigned, unsigned); > > rtl_opt_pass * make_pass_shorten_memrefs (gcc::context *ctxt); > +rtl_opt_pass * make_pass_avlprop (gcc::context *ctxt); > rtl_opt_pass * make_pass_vsetvl (gcc::context *ctxt); > > /* Routines implemented in riscv-string.c. */ > diff --git a/gcc/config/riscv/t-riscv b/gcc/config/riscv/t-riscv > index dd17056fe82..f8ca3f4ac57 100644 > --- a/gcc/config/riscv/t-riscv > +++ b/gcc/config/riscv/t-riscv > @@ -78,6 +78,12 @@ riscv-vector-costs.o: $(srcdir)/config/riscv/riscv-vector-costs.cc \ > $(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) \ > $(srcdir)/config/riscv/riscv-vector-costs.cc > > +riscv-avlprop.o: $(srcdir)/config/riscv/riscv-avlprop.cc \ > + $(CONFIG_H) $(SYSTEM_H) coretypes.h $(TM_H) $(RTL_H) $(REGS_H) \ > + $(TARGET_H) tree-pass.h df.h rtl-ssa.h cfgcleanup.h insn-attr.h > + $(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) \ > + $(srcdir)/config/riscv/riscv-avlprop.cc > + > riscv-d.o: $(srcdir)/config/riscv/riscv-d.cc \ > $(CONFIG_H) $(SYSTEM_H) coretypes.h $(TM_H) > $(COMPILE) $< > diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/select_vl-2.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/select_vl-2.c > index eac7cbc757b..ca88d42cdf4 100644 > --- a/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/select_vl-2.c > +++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/select_vl-2.c > @@ -7,10 +7,11 @@ > /* > ** foo: > ** vsetivli\t[a-x0-9]+,\s*8,\s*e(8?|16?|32?|64),\s*m(1?|2?|4?|8?|f2?|f4?|f8),\s*t[au],\s*m[au] > +** ... > ** vle32\.v\tv[0-9]+,0\([a-x0-9]+\) > ** ... > -** vsetvli\t[a-x0-9]+,\s*[a-x0-9]+,\s*e(8?|16?|32?|64),\s*m(1?|2?|4?|8?|f2?|f4?|f8),\s*t[au],\s*m[au] > -** add\t[a-x0-9]+,[a-x0-9]+,[a-x0-9]+ > +** vsetvli\tzero,\s*[a-x0-9]+,\s*e(8?|16?|32?|64),\s*m(1?|2?|4?|8?|f2?|f4?|f8),\s*t[au],\s*m[au] > +** ... > ** vle32\.v\tv[0-9]+,0\([a-x0-9]+\) > ** ... > */ > diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr111318.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr111318.c > new file mode 100644 > index 00000000000..ff36da8feeb > --- /dev/null > +++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr111318.c > @@ -0,0 +1,16 @@ > +/* { dg-do compile } */ > +/* { dg-options "-march=rv64gcv -mabi=lp64d -O3 -fno-vect-cost-model" } */ > + > +void > +foo (int *__restrict a, int *__restrict b, int *__restrict c, int n) > +{ > + for (int i = 0; i < n; i += 1) > + c[i] = a[i] + b[i]; > +} > + > +/* { dg-final { scan-assembler-times {vsetvli} 1 } } */ > +/* { dg-final { scan-assembler-not {vsetivli} } } */ > +/* { dg-final { scan-assembler-times {vsetvli\s*[a-x0-9]+,\s*[a-x0-9]+} 1 } } */ > +/* { dg-final { scan-assembler-not {vsetvli\s*[a-x0-9]+,\s*zero} } } */ > +/* { dg-final { scan-assembler-not {vsetvli\s*zero} } } */ > +/* { dg-final { scan-assembler-not {vsetivli\s*zero} } } */ > diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr111888.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr111888.c > new file mode 100644 > index 00000000000..2387c20a26c > --- /dev/null > +++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr111888.c > @@ -0,0 +1,33 @@ > +/* { dg-do compile } */ > +/* { dg-options "-march=rv64gcv -mabi=lp64d -O3 -fno-vect-cost-model" } */ > + > +void > +foo (int *__restrict a, int *__restrict b, int *__restrict c, > + int *__restrict a2, int *__restrict b2, int *__restrict c2, > + int *__restrict a3, int *__restrict b3, int *__restrict c3, > + int *__restrict a4, int *__restrict b4, int *__restrict c4, > + int *__restrict a5, int *__restrict b5, int *__restrict c5, > + int *__restrict d, int *__restrict d2, int *__restrict d3, > + int *__restrict d4, int *__restrict d5, int n, int m) > +{ > + for (int i = 0; i < n; i++) > + { > + a[i] = b[i] + c[i]; > + a2[i] = b2[i] + c2[i]; > + a3[i] = b3[i] + c3[i]; > + a4[i] = b4[i] + c4[i]; > + a5[i] = a[i] + a4[i]; > + d[i] = a[i] - a2[i]; > + d2[i] = a2[i] * a[i]; > + d3[i] = a3[i] * a2[i]; > + d4[i] = a2[i] * d2[i]; > + d5[i] = a[i] * a2[i] * a3[i] * a4[i] * d[i]; > + } > +} > + > +/* { dg-final { scan-assembler-times {vsetvli} 1 } } */ > +/* { dg-final { scan-assembler-not {vsetivli} } } */ > +/* { dg-final { scan-assembler-times {vsetvli\s*[a-x0-9]+,\s*[a-x0-9]+} 1 } } */ > +/* { dg-final { scan-assembler-not {vsetvli\s*[a-x0-9]+,\s*zero} } } */ > +/* { dg-final { scan-assembler-not {vsetvli\s*zero} } } */ > +/* { dg-final { scan-assembler-not {vsetivli\s*zero} } } */ > diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/ternop/ternop_nofm-2.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/ternop/ternop_nofm-2.c > index 965365da4bb..13367423751 100644 > --- a/gcc/testsuite/gcc.target/riscv/rvv/autovec/ternop/ternop_nofm-2.c > +++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/ternop/ternop_nofm-2.c > @@ -3,7 +3,6 @@ > > #include "ternop-2.c" > > -/* { dg-final { scan-assembler-times {\tvmacc\.vv} 8 } } */ > /* { dg-final { scan-assembler-times {\tvfma[c-d][c-d]\.vv} 9 } } */ > /* { dg-final { scan-tree-dump-times "COND_LEN_FMA" 9 "optimized" } } */ > /* { dg-final { scan-assembler-not {\tvmv} } } */ > -- > 2.36.3 >
Thanks Kito. I have sent V3 with adapting testcases (2 additional dump FAILs detected by both Pan Li and Patrick). No need to review. I will wait for patrick is ok to ignore popcount FAILs for now then commit it. juzhe.zhong@rivai.ai From: Kito Cheng Date: 2023-10-26 15:51 To: Juzhe-Zhong CC: gcc-patches; kito.cheng; jeffreyalaw; rdapp.gcc; Patrick O'Neill Subject: Re: [PATCH V2] RISC-V: Add AVL propagation PASS for RVV auto-vectorization LGTM, Thanks, it's really awesome - the implementation is simpler than I expected, it's another great improvement for RISC-V GCC! Just make sure Patrick gives a green light on the testing before committing the patch :) On Wed, Oct 25, 2023 at 8:05 PM Juzhe-Zhong <juzhe.zhong@rivai.ai> wrote: > > This patch addresses the redundant AVL/VL toggling in RVV partial auto-vectorization > which is a known issue for a long time and I finally find the time to address it. > > Consider a simple vector addition operation: > > https://godbolt.org/z/7hfGfEjW3 > > void > foo (int *__restrict a, > int *__restrict b, > int *__restrict n) > { > for (int i = 0; i < n; i++) > a[i] = a[i] + b[i]; > } > > Optimized IR: > > Loop body: > _38 = .SELECT_VL (ivtmp_36, POLY_INT_CST [4, 4]); -> vsetvli a5,a2,e8,mf4,ta,ma > ... > vect__4.8_27 = .MASK_LEN_LOAD (vectp_a.6_29, 32B, { -1, ... }, _38, 0); -> vle32.v v2,0(a0) > vect__6.11_20 = .MASK_LEN_LOAD (vectp_b.9_25, 32B, { -1, ... }, _38, 0); -> vle32.v v1,0(a1) > vect__7.12_19 = vect__6.11_20 + vect__4.8_27; -> vsetvli a6,zero,e32,m1,ta,ma + vadd.vv v1,v1,v2 > .MASK_LEN_STORE (vectp_a.13_11, 32B, { -1, ... }, _38, 0, vect__7.12_19); -> vsetvli zero,a5,e32,m1,ta,ma + vse32.v v1,0(a4) > > We can see 2 redundant vsetvls inside the loop body due to AVL/VL toggling. > The AVL/VL toggling is because we are missing LEN information in simple PLUS_EXPR GIMPLE assignment: > > vect__7.12_19 = vect__6.11_20 + vect__4.8_27; > > GCC apply partial predicate load/store and un-predicated full vector operation on partial vectorization. > Such flow are used by all other targets like ARM SVE (RVV also uses such flow): > > ARM SVE: > > .L3: > ld1w z30.s, p7/z, [x0, x3, lsl 2] -> predicated load > ld1w z31.s, p7/z, [x1, x3, lsl 2] -> predicated load > add z31.s, z31.s, z30.s -> un-predicated add > st1w z31.s, p7, [x0, x3, lsl 2] -> predicated store > > Such vectorization flow causes AVL/VL toggling on RVV so we need AVL propagation PASS for it. > > Also, It's very unlikely that we can apply predicated operations on all vectorization for following reasons: > > 1. It's very heavy workload to support them on all vectorization and we don't see any benefits if we can handle that on targets backend. > 2. Changing Loop vectorizer for it will make code base ugly and hard to maintain. > 3. We will need so many patterns for all operations. Not only COND_LEN_ADD, COND_LEN_SUB, .... > We also need COND_LEN_EXTEND, ...., COND_LEN_CEIL, ... .. over 100+ patterns, unreasonable number of patterns. > > To conclude, we prefer un-predicated operations here, and design a nice and clean AVL propagation PASS for it to elide the redundant vsetvls > due to AVL/VL toggling. > > The second question is that why we separate a PASS called AVL propagation. Why not optimize it in VSETVL PASS (We definitetly can optimize AVL in VSETVL PASS) > > Frankly, I was planning to address such issue in VSETVL PASS that's why we recently refactored VSETVL PASS. However, I changed my mind recently after several > experiments and tries. > > The reasons as follows: > > 1. For code base management and maintainience. Current VSETVL PASS is complicated enough and aleady has enough aggressive and fancy optimizations which > turns out it can always generate optimal codegen in most of the cases. It's not a good idea keep adding more features into VSETVL PASS to make VSETVL > PASS become heavy and heavy again, then we will need to refactor it again in the future. > Actuall, the VSETVL PASS is very stable and optimal after the recent refactoring. Hopefully, we should not change VSETVL PASS any more except the minor > fixes. > > 2. vsetvl insertion (VSETVL PASS does this thing) and AVL propagation are 2 different things, I don't think we should fuse them into same PASS. > > 3. VSETVL PASS is an post-RA PASS, wheras AVL propagtion should be done before RA which can reduce register allocation. > > 4. This patch's AVL propagation PASS only does AVL propagation for RVV partial auto-vectorization situations. > This patch's codes are only hundreds lines which is very managable and can be very easily extended features and enhancements. > We can easily extend and enhance more AVL propagation in a clean and separate PASS in the future. (If we do it on VSETVL PASS, we will complicate > VSETVL PASS again which is already so complicated.) > > Here is an example to demonstrate more: > > https://godbolt.org/z/bE86sv3q5 > > void foo2 (int *__restrict a, > int *__restrict b, > int *__restrict c, > int *__restrict a2, > int *__restrict b2, > int *__restrict c2, > int *__restrict a3, > int *__restrict b3, > int *__restrict c3, > int *__restrict a4, > int *__restrict b4, > int *__restrict c4, > int *__restrict a5, > int *__restrict b5, > int *__restrict c5, > int n) > { > for (int i = 0; i < n; i++){ > a[i] = b[i] + c[i]; > b5[i] = b[i] + c[i]; > a2[i] = b2[i] + c2[i]; > a3[i] = b3[i] + c3[i]; > a4[i] = b4[i] + c4[i]; > a5[i] = a[i] + a4[i]; > a[i] = a5[i] + b5[i]+ a[i]; > > a[i] = a[i] + c[i]; > b5[i] = a[i] + c[i]; > a2[i] = a[i] + c2[i]; > a3[i] = a[i] + c3[i]; > a4[i] = a[i] + c4[i]; > a5[i] = a[i] + a4[i]; > a[i] = a[i] + b5[i]+ a[i]; > } > } > > 1. Loop Body: > > Before this patch: After this patch: > > vsetvli a4,t1,e8,mf4,ta,ma vsetvli a4,t1,e32,m1,ta,ma > vle32.v v2,0(a2) vle32.v v2,0(a2) > vle32.v v4,0(a1) vle32.v v3,0(t2) > vle32.v v1,0(t2) vle32.v v4,0(a1) > vsetvli a7,zero,e32,m1,ta,ma vle32.v v1,0(t0) > vadd.vv v4,v2,v4 vadd.vv v4,v2,v4 > vsetvli zero,a4,e32,m1,ta,ma vadd.vv v1,v3,v1 > vle32.v v3,0(s0) vadd.vv v1,v1,v4 > vsetvli a7,zero,e32,m1,ta,ma vadd.vv v1,v1,v4 > vadd.vv v1,v3,v1 vadd.vv v1,v1,v4 > vadd.vv v1,v1,v4 vadd.vv v1,v1,v2 > vadd.vv v1,v1,v4 vadd.vv v2,v1,v2 > vadd.vv v1,v1,v4 vse32.v v2,0(t5) > vsetvli zero,a4,e32,m1,ta,ma vadd.vv v2,v2,v1 > vle32.v v4,0(a5) vadd.vv v2,v2,v1 > vsetvli a7,zero,e32,m1,ta,ma slli a7,a4,2 > vadd.vv v1,v1,v2 vadd.vv v3,v1,v3 > vadd.vv v2,v1,v2 vle32.v v5,0(a5) > vadd.vv v4,v1,v4 vle32.v v6,0(t6) > vsetvli zero,a4,e32,m1,ta,ma vse32.v v3,0(t3) > vse32.v v2,0(t5) vse32.v v2,0(a0) > vse32.v v4,0(a3) vadd.vv v3,v3,v1 > vsetvli a7,zero,e32,m1,ta,ma vadd.vv v2,v1,v5 > vadd.vv v3,v1,v3 vse32.v v3,0(t4) > vadd.vv v2,v2,v1 vadd.vv v1,v1,v6 > vadd.vv v2,v2,v1 vse32.v v2,0(a3) > vsetvli zero,a4,e32,m1,ta,ma vse32.v v1,0(a6) > vse32.v v2,0(a0) > vse32.v v3,0(t3) > vle32.v v2,0(t0) > vsetvli a7,zero,e32,m1,ta,ma > vadd.vv v3,v3,v1 > vsetvli zero,a4,e32,m1,ta,ma > vse32.v v3,0(t4) > vsetvli a7,zero,e32,m1,ta,ma > slli a7,a4,2 > vadd.vv v1,v1,v2 > sub t1,t1,a4 > vsetvli zero,a4,e32,m1,ta,ma > vse32.v v1,0(a6) > > It's quite obvious, all heavy && redundant vsetvls inside loop body are eliminated. > > 2. Epilogue: > Before this patch: After this patch: > > .L5: .L5: > ld s0,8(sp) ret > addi sp,sp,16 > jr ra > > This is the benefit we do the AVL propation before RA since we eliminate the use of 'a7' register > which is used by the redudant AVL/VL toggling instruction: 'vsetvli a7,zero,e32,m1,ta,ma' > > The final codegen after this patch: > > foo2: > lw t1,56(sp) > ld t6,0(sp) > ld t3,8(sp) > ld t0,16(sp) > ld t2,24(sp) > ld t4,32(sp) > ld t5,40(sp) > ble t1,zero,.L5 > .L3: > vsetvli a4,t1,e32,m1,ta,ma > vle32.v v2,0(a2) > vle32.v v3,0(t2) > vle32.v v4,0(a1) > vle32.v v1,0(t0) > vadd.vv v4,v2,v4 > vadd.vv v1,v3,v1 > vadd.vv v1,v1,v4 > vadd.vv v1,v1,v4 > vadd.vv v1,v1,v4 > vadd.vv v1,v1,v2 > vadd.vv v2,v1,v2 > vse32.v v2,0(t5) > vadd.vv v2,v2,v1 > vadd.vv v2,v2,v1 > slli a7,a4,2 > vadd.vv v3,v1,v3 > vle32.v v5,0(a5) > vle32.v v6,0(t6) > vse32.v v3,0(t3) > vse32.v v2,0(a0) > vadd.vv v3,v3,v1 > vadd.vv v2,v1,v5 > vse32.v v3,0(t4) > vadd.vv v1,v1,v6 > vse32.v v2,0(a3) > vse32.v v1,0(a6) > sub t1,t1,a4 > add a1,a1,a7 > add a2,a2,a7 > add a5,a5,a7 > add t6,t6,a7 > add t0,t0,a7 > add t2,t2,a7 > add t5,t5,a7 > add a3,a3,a7 > add a6,a6,a7 > add t3,t3,a7 > add t4,t4,a7 > add a0,a0,a7 > bne t1,zero,.L3 > .L5: > ret > > > PR target/111318 > PR target/111888 > > gcc/ChangeLog: > > * config.gcc: Add AVL propagation PASS. > * config/riscv/riscv-passes.def (INSERT_PASS_AFTER): Ditto. > * config/riscv/riscv-protos.h (make_pass_avlprop): Ditto. > * config/riscv/t-riscv: Ditto. > * config/riscv/riscv-avlprop.cc: New file. > > gcc/testsuite/ChangeLog: > > * gcc.target/riscv/rvv/autovec/partial/select_vl-2.c: Adapt test. > * gcc.target/riscv/rvv/autovec/ternop/ternop_nofm-2.c: Ditto. > * gcc.target/riscv/rvv/autovec/pr111318.c: New test. > * gcc.target/riscv/rvv/autovec/pr111888.c: New test. > > --- > gcc/config.gcc | 2 +- > gcc/config/riscv/riscv-avlprop.cc | 419 ++++++++++++++++++ > gcc/config/riscv/riscv-passes.def | 1 + > gcc/config/riscv/riscv-protos.h | 1 + > gcc/config/riscv/t-riscv | 6 + > .../riscv/rvv/autovec/partial/select_vl-2.c | 5 +- > .../gcc.target/riscv/rvv/autovec/pr111318.c | 16 + > .../gcc.target/riscv/rvv/autovec/pr111888.c | 33 ++ > .../riscv/rvv/autovec/ternop/ternop_nofm-2.c | 1 - > 9 files changed, 480 insertions(+), 4 deletions(-) > create mode 100644 gcc/config/riscv/riscv-avlprop.cc > create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/pr111318.c > create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/pr111888.c > > diff --git a/gcc/config.gcc b/gcc/config.gcc > index 606d3a8513e..efd53965c9a 100644 > --- a/gcc/config.gcc > +++ b/gcc/config.gcc > @@ -544,7 +544,7 @@ pru-*-*) > riscv*) > cpu_type=riscv > extra_objs="riscv-builtins.o riscv-c.o riscv-sr.o riscv-shorten-memrefs.o riscv-selftests.o riscv-string.o" > - extra_objs="${extra_objs} riscv-v.o riscv-vsetvl.o riscv-vector-costs.o" > + extra_objs="${extra_objs} riscv-v.o riscv-vsetvl.o riscv-vector-costs.o riscv-avlprop.o" > extra_objs="${extra_objs} riscv-vector-builtins.o riscv-vector-builtins-shapes.o riscv-vector-builtins-bases.o" > extra_objs="${extra_objs} thead.o" > d_target_objs="riscv-d.o" > diff --git a/gcc/config/riscv/riscv-avlprop.cc b/gcc/config/riscv/riscv-avlprop.cc > new file mode 100644 > index 00000000000..2c79ec81806 > --- /dev/null > +++ b/gcc/config/riscv/riscv-avlprop.cc > @@ -0,0 +1,419 @@ > +/* AVL propagation pass for RISC-V 'V' Extension for GNU compiler. > + Copyright (C) 2023-2023 Free Software Foundation, Inc. > + Contributed by Juzhe Zhong (juzhe.zhong@rivai.ai), RiVAI Technologies Ltd. > + > +This file is part of GCC. > + > +GCC is free software; you can redistribute it and/or modify > +it under the terms of the GNU General Public License as published by > +the Free Software Foundation; either version 3, or(at your option) > +any later version. > + > +GCC is distributed in the hope that it will be useful, > +but WITHOUT ANY WARRANTY; without even the implied warranty of > +MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > +GNU General Public License for more details. > + > +You should have received a copy of the GNU General Public License > +along with GCC; see the file COPYING3. If not see > +<http://www.gnu.org/licenses/>. */ > + > +/* Pre-RA RTL_SSA-based pass propagates AVL for RVV instructions. > + A standalone AVL propagation pass is designed because: > + > + - Better code maintain: > + Current LCM-based VSETVL pass is so complicated that codes > + there will become even harder to maintain. A straight forward > + AVL propagation PASS is much easier to maintain. > + > + - Reduce scalar register pressure: > + A type of AVL propagation is we propagate AVL from NON-VLMAX > + instruction to VLMAX instruction. > + Note: VLMAX instruction should be ignore tail elements (TA) > + and the result should be used by the NON-VLMAX instruction. > + This optimization is mostly for auto-vectorization codes: > + > + vsetvli r136, r137 --- SELECT_VL > + vle8.v (use avl = r136) --- IFN_MASK_LEN_LOAD > + vadd.vv (use VLMAX) --- PLUS_EXPR > + vse8.v (use avl = r136) --- IFN_MASK_LEN_STORE > + > + NO AVL propation: > + > + vsetvli a5, a4, ta > + vle8.v v1 > + vsetvli t0, zero, ta > + vadd.vv v2, v1, v1 > + vse8.v v2 > + > + We can propagate the AVL to 'vadd.vv' since its result > + is consumed by a 'vse8.v' which has AVL = a5 and its > + tail elements are agnostic. > + > + We DON'T do this optimization on VSETVL pass since it is a > + post-RA pass that consumed 't0' already wheras a standalone > + pre-RA AVL propagation pass allows us elide the consumption > + of the pseudo register of 't0' then we can reduce scalar > + register pressure. > + > + - More AVL propagation opportunities: > + A pre-RA pass is more flexible for AVL REG def-use chain, > + thus we will get more potential AVL propagation as long as > + it doesn't increase the scalar register pressure. > +*/ > + > +#define IN_TARGET_CODE 1 > +#define INCLUDE_ALGORITHM > +#define INCLUDE_FUNCTIONAL > + > +#include "config.h" > +#include "system.h" > +#include "coretypes.h" > +#include "tm.h" > +#include "backend.h" > +#include "rtl.h" > +#include "target.h" > +#include "tree-pass.h" > +#include "df.h" > +#include "rtl-ssa.h" > +#include "cfgcleanup.h" > +#include "insn-attr.h" > + > +using namespace rtl_ssa; > +using namespace riscv_vector; > + > +enum avlprop_type > +{ > + /* VLMAX AVL and tail agnostic candidates. */ > + AVLPROP_VLMAX_TA, > + AVLPROP_NONE > +}; > + > +/* dump helper functions */ > +static const char * > +avlprop_type_to_str (enum avlprop_type type) > +{ > + switch (type) > + { > + case AVLPROP_VLMAX_TA: > + return "vlmax_ta"; > + > + default: > + gcc_unreachable (); > + } > +} > + > +static bool > +vlmax_ta_p (rtx_insn *rinsn) > +{ > + return vlmax_avl_type_p (rinsn) && tail_agnostic_p (rinsn); > +} > + > +const pass_data pass_data_avlprop = { > + RTL_PASS, /* type */ > + "avlprop", /* name */ > + OPTGROUP_NONE, /* optinfo_flags */ > + TV_NONE, /* tv_id */ > + 0, /* properties_required */ > + 0, /* properties_provided */ > + 0, /* properties_destroyed */ > + 0, /* todo_flags_start */ > + 0, /* todo_flags_finish */ > +}; > + > +class pass_avlprop : public rtl_opt_pass > +{ > +public: > + pass_avlprop (gcc::context *ctxt) : rtl_opt_pass (pass_data_avlprop, ctxt) {} > + > + /* opt_pass methods: */ > + virtual bool gate (function *) final override > + { > + return TARGET_VECTOR && optimize > 0; > + } > + virtual unsigned int execute (function *) final override; > + > +private: > + /* The AVL propagation instructions and corresponding preferred AVL. > + It will be updated during the analysis. */ > + hash_map<insn_info *, rtx> *m_avl_propagations; > + > + /* Potential feasible AVL propagation candidates. */ > + auto_vec<std::pair<enum avlprop_type, insn_info *>> m_candidates; > + > + rtx get_preferred_avl (const std::pair<enum avlprop_type, insn_info *>) const; > + rtx get_vlmax_ta_preferred_avl (insn_info *) const; > + rtx get_nonvlmax_avl (insn_info *) const; > + > + void avlprop_init (function *); > + void avlprop_done (void); > +}; // class pass_avlprop > + > +void > +pass_avlprop::avlprop_init (function *fn) > +{ > + calculate_dominance_info (CDI_DOMINATORS); > + df_analyze (); > + crtl->ssa = new function_info (fn); > + m_avl_propagations = new hash_map<insn_info *, rtx>; > +} > + > +void > +pass_avlprop::avlprop_done (void) > +{ > + free_dominance_info (CDI_DOMINATORS); > + if (crtl->ssa->perform_pending_updates ()) > + cleanup_cfg (0); > + delete crtl->ssa; > + crtl->ssa = nullptr; > + delete m_avl_propagations; > + m_avl_propagations = NULL; > + if (!m_candidates.is_empty ()) > + m_candidates.release (); > +} > + > +/* If we have a preferred AVL to propagate, return the AVL. > + Otherwise, return NULL_RTX as we don't need have any preferred > + AVL. */ > + > +rtx > +pass_avlprop::get_preferred_avl ( > + const std::pair<enum avlprop_type, insn_info *> candidate) const > +{ > + switch (candidate.first) > + { > + case AVLPROP_VLMAX_TA: > + return get_vlmax_ta_preferred_avl (candidate.second); > + default: > + gcc_unreachable (); > + } > + return NULL_RTX; > +} > + > +/* This is a straight forward pattern ALWAYS in paritial auto-vectorization: > + > + VL = SELECT_AVL (AVL, ...) > + V0 = MASK_LEN_LOAD (..., VL) > + V1 = MASK_LEN_LOAD (..., VL) > + V2 = V0 + V1 --- Missed LEN information. > + MASK_LEN_STORE (..., V2, VL) > + > + We prefer PLUS_EXPR (V0 + V1) instead of COND_LEN_ADD (V0, V1, dummy LEN) > + because: > + > + - Few code changes in Loop Vectorizer. > + - Reuse the current clean flow of partial vectorization, That is, apply > + predicate LEN or MASK into LOAD/STORE operations and other special > + arithmetic operations (e.d. DIV), then do the whole vector register > + operation if it DON'T affect the correctness. > + Such flow is used by all other targets like x86, sve, s390, ... etc. > + - PLUS_EXPR has better gimple optimizations than COND_LEN_ADD. > + > + We propagate AVL from NON-VLMAX to VLMAX for gimple IR like PLUS_EXPR which > + generates the VLMAX instruction due to missed LEN information. The later > + VSETVL PASS will elided the redundant vsetvls. > +*/ > + > +rtx > +pass_avlprop::get_vlmax_ta_preferred_avl (insn_info *insn) const > +{ > + int sew = get_sew (insn->rtl ()); > + enum vlmul_type vlmul = get_vlmul (insn->rtl ()); > + int ratio = calculate_ratio (sew, vlmul); > + > + rtx use_avl = NULL_RTX; > + for (def_info *def : insn->defs ()) > + { > + if (!is_a<set_info *> (def) || def->is_mem ()) > + return NULL_RTX; > + const auto *set = dyn_cast<set_info *> (def); > + > + /* FIXME: Stop AVL propagation if any USE is not a RVV real > + instruction. It should be totally enough for vectorized codes since > + they always locate at extended blocks. > + > + TODO: We can extend PHI checking for intrinsic codes if it > + necessary in the future. */ > + if (!set->is_local_to_ebb ()) > + return NULL_RTX; > + > + for (use_info *use : set->nondebug_insn_uses ()) > + { > + insn_info *use_insn = use->insn (); > + if (!use_insn->can_be_optimized () || use_insn->is_asm () > + || use_insn->is_call () || use_insn->has_volatile_refs () > + || use_insn->has_pre_post_modify () > + || !has_vl_op (use_insn->rtl ()) > + || !tail_agnostic_p (use_insn->rtl ())) > + return NULL_RTX; > + > + int new_sew = get_sew (use_insn->rtl ()); > + enum vlmul_type new_vlmul = get_vlmul (use_insn->rtl ()); > + int new_ratio = calculate_ratio (new_sew, new_vlmul); > + if (new_ratio != ratio) > + return NULL_RTX; > + > + rtx new_use_avl = get_nonvlmax_avl (use_insn); > + if (!new_use_avl || SUBREG_P (new_use_avl)) > + return NULL_RTX; > + if (REG_P (new_use_avl)) > + { > + resource_info resource = full_register (REGNO (new_use_avl)); > + def_lookup dl = crtl->ssa->find_def (resource, use_insn); > + if (dl.matching_set ()) > + return NULL_RTX; > + def_info *def1 = dl.prev_def (insn); > + def_info *def2 = dl.prev_def (use_insn); > + if (!def1 || !def2 || def1 != def2) > + return NULL_RTX; > + > + /* FIXME: We only all AVL propation within a block which should > + be totally enough for vectorized codes. > + > + TODO: We can enhance it here for intrinsic codes in the future > + if it is necessary. */ > + if (def1->insn ()->bb () != insn->bb () > + && !dominated_by_p (CDI_DOMINATORS, insn->bb ()->cfg_bb (), > + def1->insn ()->bb ()->cfg_bb ())) > + return NULL_RTX; > + if (def1->insn ()->bb () == insn->bb () > + && def1->insn ()->compare_with (insn) >= 0) > + return NULL_RTX; > + } > + > + if (!use_avl) > + use_avl = new_use_avl; > + else if (!rtx_equal_p (use_avl, new_use_avl)) > + return NULL_RTX; > + } > + } > + > + return use_avl; > +} > + > +/* Try to get the NONVLMAX AVL of the INSN. > + INSN can be either NON-VLMAX AVL itself or VLMAX AVL INSN > + before the PASS but has been propagated a NON-VLMAX AVL > + in the before round propagation. */ > +rtx > +pass_avlprop::get_nonvlmax_avl (insn_info *insn) const > +{ > + if (m_avl_propagations->get (insn)) > + return (*m_avl_propagations->get (insn)); > + else if (nonvlmax_avl_type_p (insn->rtl ())) > + { > + extract_insn_cached (insn->rtl ()); > + return recog_data.operand[get_attr_vl_op_idx (insn->rtl ())]; > + } > + > + return NULL_RTX; > +} > + > +/* Main entry point for this pass. */ > +unsigned int > +pass_avlprop::execute (function *fn) > +{ > + avlprop_init (fn); > + > + /* Iterate the whole function in reverse order (which could speed the > + convergence) to collect all potential candidates that could be AVL > + propagated. > + > + Note that: **NOT** all the candidates will be successfully AVL propagated. > + */ > + for (bb_info *bb : crtl->ssa->reverse_bbs ()) > + { > + for (insn_info *insn : bb->reverse_real_nondebug_insns ()) > + { > + /* We only forward AVL to the instruction that has AVL/VL operand > + and can be optimized in RTL_SSA level. */ > + if (!insn->can_be_optimized () || !has_vl_op (insn->rtl ())) > + continue; > + > + /* TODO: We only do AVL propagation for VLMAX AVL with tail > + agnostic policy since we have missed-LEN information partial > + autovectorization. We could add more more AVL propagation > + for intrinsic codes in the future. */ > + if (vlmax_ta_p (insn->rtl ())) > + m_candidates.safe_push (std::make_pair (AVLPROP_VLMAX_TA, insn)); > + } > + } > + > + if (dump_file && (dump_flags & TDF_DETAILS)) > + { > + fprintf (dump_file, "\nNumber of potential AVL propagations: %d\n", > + m_candidates.length ()); > + for (const auto candidate : m_candidates) > + { > + fprintf (dump_file, "\nAVL propagation type: %s\n", > + avlprop_type_to_str (candidate.first)); > + print_rtl_single (dump_file, candidate.second->rtl ()); > + } > + } > + > + /* Go through all the candidates looking for AVL that we could propagate. */ > + bool change_p = true; > + while (change_p) > + { > + change_p = false; > + for (auto &candidate : m_candidates) > + { > + rtx new_avl = get_preferred_avl (candidate); > + if (new_avl) > + { > + gcc_assert (!vlmax_avl_p (new_avl)); > + auto &update > + = m_avl_propagations->get_or_insert (candidate.second); > + change_p = !rtx_equal_p (update, new_avl); > + update = new_avl; > + } > + } > + } > + > + if (dump_file && (dump_flags & TDF_DETAILS)) > + fprintf (dump_file, "\nNumber of successful AVL propagations: %d\n\n", > + (int) m_avl_propagations->elements ()); > + > + for (const auto prop : *m_avl_propagations) > + { > + rtx_insn *rinsn = prop.first->rtl (); > + if (dump_file && (dump_flags & TDF_DETAILS)) > + { > + fprintf (dump_file, "\nPropagating AVL: "); > + print_rtl_single (dump_file, prop.second); > + fprintf (dump_file, "into: "); > + print_rtl_single (dump_file, rinsn); > + } > + /* Replace AVL operand. */ > + extract_insn_cached (rinsn); > + rtx avl = recog_data.operand[get_attr_vl_op_idx (rinsn)]; > + int count = count_regno_occurrences (rinsn, REGNO (avl)); > + gcc_assert (count == 1); > + rtx new_pat = simplify_replace_rtx (PATTERN (rinsn), avl, prop.second); > + validate_change_or_fail (rinsn, &PATTERN (rinsn), new_pat, false); > + > + /* Change AVL TYPE into NONVLMAX if it is VLMAX. */ > + if (vlmax_avl_type_p (rinsn)) > + { > + int index = get_attr_avl_type_idx (rinsn); > + gcc_assert (index != INVALID_ATTRIBUTE); > + validate_change_or_fail (rinsn, recog_data.operand_loc[index], > + get_avl_type_rtx (avl_type::NONVLMAX), > + false); > + } > + if (dump_file && (dump_flags & TDF_DETAILS)) > + { > + fprintf (dump_file, "Successfully to match this instruction: "); > + print_rtl_single (dump_file, rinsn); > + } > + } > + > + avlprop_done (); > + return 0; > +} > + > +rtl_opt_pass * > +make_pass_avlprop (gcc::context *ctxt) > +{ > + return new pass_avlprop (ctxt); > +} > diff --git a/gcc/config/riscv/riscv-passes.def b/gcc/config/riscv/riscv-passes.def > index 4084122cf0a..b6260939d5c 100644 > --- a/gcc/config/riscv/riscv-passes.def > +++ b/gcc/config/riscv/riscv-passes.def > @@ -18,4 +18,5 @@ > <http://www.gnu.org/licenses/>. */ > > INSERT_PASS_AFTER (pass_rtl_store_motion, 1, pass_shorten_memrefs); > +INSERT_PASS_AFTER (pass_split_all_insns, 1, pass_avlprop); > INSERT_PASS_BEFORE (pass_fast_rtl_dce, 1, pass_vsetvl); > diff --git a/gcc/config/riscv/riscv-protos.h b/gcc/config/riscv/riscv-protos.h > index 668d75043ca..d4e17fc3fd0 100644 > --- a/gcc/config/riscv/riscv-protos.h > +++ b/gcc/config/riscv/riscv-protos.h > @@ -156,6 +156,7 @@ extern void riscv_parse_arch_string (const char *, struct gcc_options *, locatio > extern bool riscv_hard_regno_rename_ok (unsigned, unsigned); > > rtl_opt_pass * make_pass_shorten_memrefs (gcc::context *ctxt); > +rtl_opt_pass * make_pass_avlprop (gcc::context *ctxt); > rtl_opt_pass * make_pass_vsetvl (gcc::context *ctxt); > > /* Routines implemented in riscv-string.c. */ > diff --git a/gcc/config/riscv/t-riscv b/gcc/config/riscv/t-riscv > index dd17056fe82..f8ca3f4ac57 100644 > --- a/gcc/config/riscv/t-riscv > +++ b/gcc/config/riscv/t-riscv > @@ -78,6 +78,12 @@ riscv-vector-costs.o: $(srcdir)/config/riscv/riscv-vector-costs.cc \ > $(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) \ > $(srcdir)/config/riscv/riscv-vector-costs.cc > > +riscv-avlprop.o: $(srcdir)/config/riscv/riscv-avlprop.cc \ > + $(CONFIG_H) $(SYSTEM_H) coretypes.h $(TM_H) $(RTL_H) $(REGS_H) \ > + $(TARGET_H) tree-pass.h df.h rtl-ssa.h cfgcleanup.h insn-attr.h > + $(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) \ > + $(srcdir)/config/riscv/riscv-avlprop.cc > + > riscv-d.o: $(srcdir)/config/riscv/riscv-d.cc \ > $(CONFIG_H) $(SYSTEM_H) coretypes.h $(TM_H) > $(COMPILE) $< > diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/select_vl-2.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/select_vl-2.c > index eac7cbc757b..ca88d42cdf4 100644 > --- a/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/select_vl-2.c > +++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/select_vl-2.c > @@ -7,10 +7,11 @@ > /* > ** foo: > ** vsetivli\t[a-x0-9]+,\s*8,\s*e(8?|16?|32?|64),\s*m(1?|2?|4?|8?|f2?|f4?|f8),\s*t[au],\s*m[au] > +** ... > ** vle32\.v\tv[0-9]+,0\([a-x0-9]+\) > ** ... > -** vsetvli\t[a-x0-9]+,\s*[a-x0-9]+,\s*e(8?|16?|32?|64),\s*m(1?|2?|4?|8?|f2?|f4?|f8),\s*t[au],\s*m[au] > -** add\t[a-x0-9]+,[a-x0-9]+,[a-x0-9]+ > +** vsetvli\tzero,\s*[a-x0-9]+,\s*e(8?|16?|32?|64),\s*m(1?|2?|4?|8?|f2?|f4?|f8),\s*t[au],\s*m[au] > +** ... > ** vle32\.v\tv[0-9]+,0\([a-x0-9]+\) > ** ... > */ > diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr111318.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr111318.c > new file mode 100644 > index 00000000000..ff36da8feeb > --- /dev/null > +++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr111318.c > @@ -0,0 +1,16 @@ > +/* { dg-do compile } */ > +/* { dg-options "-march=rv64gcv -mabi=lp64d -O3 -fno-vect-cost-model" } */ > + > +void > +foo (int *__restrict a, int *__restrict b, int *__restrict c, int n) > +{ > + for (int i = 0; i < n; i += 1) > + c[i] = a[i] + b[i]; > +} > + > +/* { dg-final { scan-assembler-times {vsetvli} 1 } } */ > +/* { dg-final { scan-assembler-not {vsetivli} } } */ > +/* { dg-final { scan-assembler-times {vsetvli\s*[a-x0-9]+,\s*[a-x0-9]+} 1 } } */ > +/* { dg-final { scan-assembler-not {vsetvli\s*[a-x0-9]+,\s*zero} } } */ > +/* { dg-final { scan-assembler-not {vsetvli\s*zero} } } */ > +/* { dg-final { scan-assembler-not {vsetivli\s*zero} } } */ > diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr111888.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr111888.c > new file mode 100644 > index 00000000000..2387c20a26c > --- /dev/null > +++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr111888.c > @@ -0,0 +1,33 @@ > +/* { dg-do compile } */ > +/* { dg-options "-march=rv64gcv -mabi=lp64d -O3 -fno-vect-cost-model" } */ > + > +void > +foo (int *__restrict a, int *__restrict b, int *__restrict c, > + int *__restrict a2, int *__restrict b2, int *__restrict c2, > + int *__restrict a3, int *__restrict b3, int *__restrict c3, > + int *__restrict a4, int *__restrict b4, int *__restrict c4, > + int *__restrict a5, int *__restrict b5, int *__restrict c5, > + int *__restrict d, int *__restrict d2, int *__restrict d3, > + int *__restrict d4, int *__restrict d5, int n, int m) > +{ > + for (int i = 0; i < n; i++) > + { > + a[i] = b[i] + c[i]; > + a2[i] = b2[i] + c2[i]; > + a3[i] = b3[i] + c3[i]; > + a4[i] = b4[i] + c4[i]; > + a5[i] = a[i] + a4[i]; > + d[i] = a[i] - a2[i]; > + d2[i] = a2[i] * a[i]; > + d3[i] = a3[i] * a2[i]; > + d4[i] = a2[i] * d2[i]; > + d5[i] = a[i] * a2[i] * a3[i] * a4[i] * d[i]; > + } > +} > + > +/* { dg-final { scan-assembler-times {vsetvli} 1 } } */ > +/* { dg-final { scan-assembler-not {vsetivli} } } */ > +/* { dg-final { scan-assembler-times {vsetvli\s*[a-x0-9]+,\s*[a-x0-9]+} 1 } } */ > +/* { dg-final { scan-assembler-not {vsetvli\s*[a-x0-9]+,\s*zero} } } */ > +/* { dg-final { scan-assembler-not {vsetvli\s*zero} } } */ > +/* { dg-final { scan-assembler-not {vsetivli\s*zero} } } */ > diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/ternop/ternop_nofm-2.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/ternop/ternop_nofm-2.c > index 965365da4bb..13367423751 100644 > --- a/gcc/testsuite/gcc.target/riscv/rvv/autovec/ternop/ternop_nofm-2.c > +++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/ternop/ternop_nofm-2.c > @@ -3,7 +3,6 @@ > > #include "ternop-2.c" > > -/* { dg-final { scan-assembler-times {\tvmacc\.vv} 8 } } */ > /* { dg-final { scan-assembler-times {\tvfma[c-d][c-d]\.vv} 9 } } */ > /* { dg-final { scan-tree-dump-times "COND_LEN_FMA" 9 "optimized" } } */ > /* { dg-final { scan-assembler-not {\tvmv} } } */ > -- > 2.36.3 >
> I have sent V3 with adapting testcases (2 additional dump FAILs detected by both Pan Li and Patrick). > No need to review. > > I will wait for patrick is ok to ignore popcount FAILs for now then commit it. Just to confirm: I can now also reproduce the popcount fail on my machine without your patch. Regards Robin
Oh. It's surprising. I think current RVV GCC is not stable and buggy so that different FAILs in different machines. Currently, we have 2 middle-end bugs: 1. COND_LEN_XXX: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111760 2. Gather load bug: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111970 I guess they are related to make RVV GCC unstable, so testing various in different machines. juzhe.zhong@rivai.ai From: Robin Dapp Date: 2023-10-26 16:34 To: juzhe.zhong@rivai.ai; Kito.cheng CC: rdapp.gcc; gcc-patches; kito.cheng; jeffreyalaw; Patrick O'Neill Subject: Re: [PATCH V2] RISC-V: Add AVL propagation PASS for RVV auto-vectorization > I have sent V3 with adapting testcases (2 additional dump FAILs detected by both Pan Li and Patrick). > No need to review. > > I will wait for patrick is ok to ignore popcount FAILs for now then commit it. Just to confirm: I can now also reproduce the popcount fail on my machine without your patch. Regards Robin
diff --git a/gcc/config.gcc b/gcc/config.gcc index 606d3a8513e..efd53965c9a 100644 --- a/gcc/config.gcc +++ b/gcc/config.gcc @@ -544,7 +544,7 @@ pru-*-*) riscv*) cpu_type=riscv extra_objs="riscv-builtins.o riscv-c.o riscv-sr.o riscv-shorten-memrefs.o riscv-selftests.o riscv-string.o" - extra_objs="${extra_objs} riscv-v.o riscv-vsetvl.o riscv-vector-costs.o" + extra_objs="${extra_objs} riscv-v.o riscv-vsetvl.o riscv-vector-costs.o riscv-avlprop.o" extra_objs="${extra_objs} riscv-vector-builtins.o riscv-vector-builtins-shapes.o riscv-vector-builtins-bases.o" extra_objs="${extra_objs} thead.o" d_target_objs="riscv-d.o" diff --git a/gcc/config/riscv/riscv-avlprop.cc b/gcc/config/riscv/riscv-avlprop.cc new file mode 100644 index 00000000000..2c79ec81806 --- /dev/null +++ b/gcc/config/riscv/riscv-avlprop.cc @@ -0,0 +1,419 @@ +/* AVL propagation pass for RISC-V 'V' Extension for GNU compiler. + Copyright (C) 2023-2023 Free Software Foundation, Inc. + Contributed by Juzhe Zhong (juzhe.zhong@rivai.ai), RiVAI Technologies Ltd. + +This file is part of GCC. + +GCC is free software; you can redistribute it and/or modify +it under the terms of the GNU General Public License as published by +the Free Software Foundation; either version 3, or(at your option) +any later version. + +GCC is distributed in the hope that it will be useful, +but WITHOUT ANY WARRANTY; without even the implied warranty of +MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +GNU General Public License for more details. + +You should have received a copy of the GNU General Public License +along with GCC; see the file COPYING3. If not see +<http://www.gnu.org/licenses/>. */ + +/* Pre-RA RTL_SSA-based pass propagates AVL for RVV instructions. + A standalone AVL propagation pass is designed because: + + - Better code maintain: + Current LCM-based VSETVL pass is so complicated that codes + there will become even harder to maintain. A straight forward + AVL propagation PASS is much easier to maintain. + + - Reduce scalar register pressure: + A type of AVL propagation is we propagate AVL from NON-VLMAX + instruction to VLMAX instruction. + Note: VLMAX instruction should be ignore tail elements (TA) + and the result should be used by the NON-VLMAX instruction. + This optimization is mostly for auto-vectorization codes: + + vsetvli r136, r137 --- SELECT_VL + vle8.v (use avl = r136) --- IFN_MASK_LEN_LOAD + vadd.vv (use VLMAX) --- PLUS_EXPR + vse8.v (use avl = r136) --- IFN_MASK_LEN_STORE + + NO AVL propation: + + vsetvli a5, a4, ta + vle8.v v1 + vsetvli t0, zero, ta + vadd.vv v2, v1, v1 + vse8.v v2 + + We can propagate the AVL to 'vadd.vv' since its result + is consumed by a 'vse8.v' which has AVL = a5 and its + tail elements are agnostic. + + We DON'T do this optimization on VSETVL pass since it is a + post-RA pass that consumed 't0' already wheras a standalone + pre-RA AVL propagation pass allows us elide the consumption + of the pseudo register of 't0' then we can reduce scalar + register pressure. + + - More AVL propagation opportunities: + A pre-RA pass is more flexible for AVL REG def-use chain, + thus we will get more potential AVL propagation as long as + it doesn't increase the scalar register pressure. +*/ + +#define IN_TARGET_CODE 1 +#define INCLUDE_ALGORITHM +#define INCLUDE_FUNCTIONAL + +#include "config.h" +#include "system.h" +#include "coretypes.h" +#include "tm.h" +#include "backend.h" +#include "rtl.h" +#include "target.h" +#include "tree-pass.h" +#include "df.h" +#include "rtl-ssa.h" +#include "cfgcleanup.h" +#include "insn-attr.h" + +using namespace rtl_ssa; +using namespace riscv_vector; + +enum avlprop_type +{ + /* VLMAX AVL and tail agnostic candidates. */ + AVLPROP_VLMAX_TA, + AVLPROP_NONE +}; + +/* dump helper functions */ +static const char * +avlprop_type_to_str (enum avlprop_type type) +{ + switch (type) + { + case AVLPROP_VLMAX_TA: + return "vlmax_ta"; + + default: + gcc_unreachable (); + } +} + +static bool +vlmax_ta_p (rtx_insn *rinsn) +{ + return vlmax_avl_type_p (rinsn) && tail_agnostic_p (rinsn); +} + +const pass_data pass_data_avlprop = { + RTL_PASS, /* type */ + "avlprop", /* name */ + OPTGROUP_NONE, /* optinfo_flags */ + TV_NONE, /* tv_id */ + 0, /* properties_required */ + 0, /* properties_provided */ + 0, /* properties_destroyed */ + 0, /* todo_flags_start */ + 0, /* todo_flags_finish */ +}; + +class pass_avlprop : public rtl_opt_pass +{ +public: + pass_avlprop (gcc::context *ctxt) : rtl_opt_pass (pass_data_avlprop, ctxt) {} + + /* opt_pass methods: */ + virtual bool gate (function *) final override + { + return TARGET_VECTOR && optimize > 0; + } + virtual unsigned int execute (function *) final override; + +private: + /* The AVL propagation instructions and corresponding preferred AVL. + It will be updated during the analysis. */ + hash_map<insn_info *, rtx> *m_avl_propagations; + + /* Potential feasible AVL propagation candidates. */ + auto_vec<std::pair<enum avlprop_type, insn_info *>> m_candidates; + + rtx get_preferred_avl (const std::pair<enum avlprop_type, insn_info *>) const; + rtx get_vlmax_ta_preferred_avl (insn_info *) const; + rtx get_nonvlmax_avl (insn_info *) const; + + void avlprop_init (function *); + void avlprop_done (void); +}; // class pass_avlprop + +void +pass_avlprop::avlprop_init (function *fn) +{ + calculate_dominance_info (CDI_DOMINATORS); + df_analyze (); + crtl->ssa = new function_info (fn); + m_avl_propagations = new hash_map<insn_info *, rtx>; +} + +void +pass_avlprop::avlprop_done (void) +{ + free_dominance_info (CDI_DOMINATORS); + if (crtl->ssa->perform_pending_updates ()) + cleanup_cfg (0); + delete crtl->ssa; + crtl->ssa = nullptr; + delete m_avl_propagations; + m_avl_propagations = NULL; + if (!m_candidates.is_empty ()) + m_candidates.release (); +} + +/* If we have a preferred AVL to propagate, return the AVL. + Otherwise, return NULL_RTX as we don't need have any preferred + AVL. */ + +rtx +pass_avlprop::get_preferred_avl ( + const std::pair<enum avlprop_type, insn_info *> candidate) const +{ + switch (candidate.first) + { + case AVLPROP_VLMAX_TA: + return get_vlmax_ta_preferred_avl (candidate.second); + default: + gcc_unreachable (); + } + return NULL_RTX; +} + +/* This is a straight forward pattern ALWAYS in paritial auto-vectorization: + + VL = SELECT_AVL (AVL, ...) + V0 = MASK_LEN_LOAD (..., VL) + V1 = MASK_LEN_LOAD (..., VL) + V2 = V0 + V1 --- Missed LEN information. + MASK_LEN_STORE (..., V2, VL) + + We prefer PLUS_EXPR (V0 + V1) instead of COND_LEN_ADD (V0, V1, dummy LEN) + because: + + - Few code changes in Loop Vectorizer. + - Reuse the current clean flow of partial vectorization, That is, apply + predicate LEN or MASK into LOAD/STORE operations and other special + arithmetic operations (e.d. DIV), then do the whole vector register + operation if it DON'T affect the correctness. + Such flow is used by all other targets like x86, sve, s390, ... etc. + - PLUS_EXPR has better gimple optimizations than COND_LEN_ADD. + + We propagate AVL from NON-VLMAX to VLMAX for gimple IR like PLUS_EXPR which + generates the VLMAX instruction due to missed LEN information. The later + VSETVL PASS will elided the redundant vsetvls. +*/ + +rtx +pass_avlprop::get_vlmax_ta_preferred_avl (insn_info *insn) const +{ + int sew = get_sew (insn->rtl ()); + enum vlmul_type vlmul = get_vlmul (insn->rtl ()); + int ratio = calculate_ratio (sew, vlmul); + + rtx use_avl = NULL_RTX; + for (def_info *def : insn->defs ()) + { + if (!is_a<set_info *> (def) || def->is_mem ()) + return NULL_RTX; + const auto *set = dyn_cast<set_info *> (def); + + /* FIXME: Stop AVL propagation if any USE is not a RVV real + instruction. It should be totally enough for vectorized codes since + they always locate at extended blocks. + + TODO: We can extend PHI checking for intrinsic codes if it + necessary in the future. */ + if (!set->is_local_to_ebb ()) + return NULL_RTX; + + for (use_info *use : set->nondebug_insn_uses ()) + { + insn_info *use_insn = use->insn (); + if (!use_insn->can_be_optimized () || use_insn->is_asm () + || use_insn->is_call () || use_insn->has_volatile_refs () + || use_insn->has_pre_post_modify () + || !has_vl_op (use_insn->rtl ()) + || !tail_agnostic_p (use_insn->rtl ())) + return NULL_RTX; + + int new_sew = get_sew (use_insn->rtl ()); + enum vlmul_type new_vlmul = get_vlmul (use_insn->rtl ()); + int new_ratio = calculate_ratio (new_sew, new_vlmul); + if (new_ratio != ratio) + return NULL_RTX; + + rtx new_use_avl = get_nonvlmax_avl (use_insn); + if (!new_use_avl || SUBREG_P (new_use_avl)) + return NULL_RTX; + if (REG_P (new_use_avl)) + { + resource_info resource = full_register (REGNO (new_use_avl)); + def_lookup dl = crtl->ssa->find_def (resource, use_insn); + if (dl.matching_set ()) + return NULL_RTX; + def_info *def1 = dl.prev_def (insn); + def_info *def2 = dl.prev_def (use_insn); + if (!def1 || !def2 || def1 != def2) + return NULL_RTX; + + /* FIXME: We only all AVL propation within a block which should + be totally enough for vectorized codes. + + TODO: We can enhance it here for intrinsic codes in the future + if it is necessary. */ + if (def1->insn ()->bb () != insn->bb () + && !dominated_by_p (CDI_DOMINATORS, insn->bb ()->cfg_bb (), + def1->insn ()->bb ()->cfg_bb ())) + return NULL_RTX; + if (def1->insn ()->bb () == insn->bb () + && def1->insn ()->compare_with (insn) >= 0) + return NULL_RTX; + } + + if (!use_avl) + use_avl = new_use_avl; + else if (!rtx_equal_p (use_avl, new_use_avl)) + return NULL_RTX; + } + } + + return use_avl; +} + +/* Try to get the NONVLMAX AVL of the INSN. + INSN can be either NON-VLMAX AVL itself or VLMAX AVL INSN + before the PASS but has been propagated a NON-VLMAX AVL + in the before round propagation. */ +rtx +pass_avlprop::get_nonvlmax_avl (insn_info *insn) const +{ + if (m_avl_propagations->get (insn)) + return (*m_avl_propagations->get (insn)); + else if (nonvlmax_avl_type_p (insn->rtl ())) + { + extract_insn_cached (insn->rtl ()); + return recog_data.operand[get_attr_vl_op_idx (insn->rtl ())]; + } + + return NULL_RTX; +} + +/* Main entry point for this pass. */ +unsigned int +pass_avlprop::execute (function *fn) +{ + avlprop_init (fn); + + /* Iterate the whole function in reverse order (which could speed the + convergence) to collect all potential candidates that could be AVL + propagated. + + Note that: **NOT** all the candidates will be successfully AVL propagated. + */ + for (bb_info *bb : crtl->ssa->reverse_bbs ()) + { + for (insn_info *insn : bb->reverse_real_nondebug_insns ()) + { + /* We only forward AVL to the instruction that has AVL/VL operand + and can be optimized in RTL_SSA level. */ + if (!insn->can_be_optimized () || !has_vl_op (insn->rtl ())) + continue; + + /* TODO: We only do AVL propagation for VLMAX AVL with tail + agnostic policy since we have missed-LEN information partial + autovectorization. We could add more more AVL propagation + for intrinsic codes in the future. */ + if (vlmax_ta_p (insn->rtl ())) + m_candidates.safe_push (std::make_pair (AVLPROP_VLMAX_TA, insn)); + } + } + + if (dump_file && (dump_flags & TDF_DETAILS)) + { + fprintf (dump_file, "\nNumber of potential AVL propagations: %d\n", + m_candidates.length ()); + for (const auto candidate : m_candidates) + { + fprintf (dump_file, "\nAVL propagation type: %s\n", + avlprop_type_to_str (candidate.first)); + print_rtl_single (dump_file, candidate.second->rtl ()); + } + } + + /* Go through all the candidates looking for AVL that we could propagate. */ + bool change_p = true; + while (change_p) + { + change_p = false; + for (auto &candidate : m_candidates) + { + rtx new_avl = get_preferred_avl (candidate); + if (new_avl) + { + gcc_assert (!vlmax_avl_p (new_avl)); + auto &update + = m_avl_propagations->get_or_insert (candidate.second); + change_p = !rtx_equal_p (update, new_avl); + update = new_avl; + } + } + } + + if (dump_file && (dump_flags & TDF_DETAILS)) + fprintf (dump_file, "\nNumber of successful AVL propagations: %d\n\n", + (int) m_avl_propagations->elements ()); + + for (const auto prop : *m_avl_propagations) + { + rtx_insn *rinsn = prop.first->rtl (); + if (dump_file && (dump_flags & TDF_DETAILS)) + { + fprintf (dump_file, "\nPropagating AVL: "); + print_rtl_single (dump_file, prop.second); + fprintf (dump_file, "into: "); + print_rtl_single (dump_file, rinsn); + } + /* Replace AVL operand. */ + extract_insn_cached (rinsn); + rtx avl = recog_data.operand[get_attr_vl_op_idx (rinsn)]; + int count = count_regno_occurrences (rinsn, REGNO (avl)); + gcc_assert (count == 1); + rtx new_pat = simplify_replace_rtx (PATTERN (rinsn), avl, prop.second); + validate_change_or_fail (rinsn, &PATTERN (rinsn), new_pat, false); + + /* Change AVL TYPE into NONVLMAX if it is VLMAX. */ + if (vlmax_avl_type_p (rinsn)) + { + int index = get_attr_avl_type_idx (rinsn); + gcc_assert (index != INVALID_ATTRIBUTE); + validate_change_or_fail (rinsn, recog_data.operand_loc[index], + get_avl_type_rtx (avl_type::NONVLMAX), + false); + } + if (dump_file && (dump_flags & TDF_DETAILS)) + { + fprintf (dump_file, "Successfully to match this instruction: "); + print_rtl_single (dump_file, rinsn); + } + } + + avlprop_done (); + return 0; +} + +rtl_opt_pass * +make_pass_avlprop (gcc::context *ctxt) +{ + return new pass_avlprop (ctxt); +} diff --git a/gcc/config/riscv/riscv-passes.def b/gcc/config/riscv/riscv-passes.def index 4084122cf0a..b6260939d5c 100644 --- a/gcc/config/riscv/riscv-passes.def +++ b/gcc/config/riscv/riscv-passes.def @@ -18,4 +18,5 @@ <http://www.gnu.org/licenses/>. */ INSERT_PASS_AFTER (pass_rtl_store_motion, 1, pass_shorten_memrefs); +INSERT_PASS_AFTER (pass_split_all_insns, 1, pass_avlprop); INSERT_PASS_BEFORE (pass_fast_rtl_dce, 1, pass_vsetvl); diff --git a/gcc/config/riscv/riscv-protos.h b/gcc/config/riscv/riscv-protos.h index 668d75043ca..d4e17fc3fd0 100644 --- a/gcc/config/riscv/riscv-protos.h +++ b/gcc/config/riscv/riscv-protos.h @@ -156,6 +156,7 @@ extern void riscv_parse_arch_string (const char *, struct gcc_options *, locatio extern bool riscv_hard_regno_rename_ok (unsigned, unsigned); rtl_opt_pass * make_pass_shorten_memrefs (gcc::context *ctxt); +rtl_opt_pass * make_pass_avlprop (gcc::context *ctxt); rtl_opt_pass * make_pass_vsetvl (gcc::context *ctxt); /* Routines implemented in riscv-string.c. */ diff --git a/gcc/config/riscv/t-riscv b/gcc/config/riscv/t-riscv index dd17056fe82..f8ca3f4ac57 100644 --- a/gcc/config/riscv/t-riscv +++ b/gcc/config/riscv/t-riscv @@ -78,6 +78,12 @@ riscv-vector-costs.o: $(srcdir)/config/riscv/riscv-vector-costs.cc \ $(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) \ $(srcdir)/config/riscv/riscv-vector-costs.cc +riscv-avlprop.o: $(srcdir)/config/riscv/riscv-avlprop.cc \ + $(CONFIG_H) $(SYSTEM_H) coretypes.h $(TM_H) $(RTL_H) $(REGS_H) \ + $(TARGET_H) tree-pass.h df.h rtl-ssa.h cfgcleanup.h insn-attr.h + $(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) \ + $(srcdir)/config/riscv/riscv-avlprop.cc + riscv-d.o: $(srcdir)/config/riscv/riscv-d.cc \ $(CONFIG_H) $(SYSTEM_H) coretypes.h $(TM_H) $(COMPILE) $< diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/select_vl-2.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/select_vl-2.c index eac7cbc757b..ca88d42cdf4 100644 --- a/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/select_vl-2.c +++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/select_vl-2.c @@ -7,10 +7,11 @@ /* ** foo: ** vsetivli\t[a-x0-9]+,\s*8,\s*e(8?|16?|32?|64),\s*m(1?|2?|4?|8?|f2?|f4?|f8),\s*t[au],\s*m[au] +** ... ** vle32\.v\tv[0-9]+,0\([a-x0-9]+\) ** ... -** vsetvli\t[a-x0-9]+,\s*[a-x0-9]+,\s*e(8?|16?|32?|64),\s*m(1?|2?|4?|8?|f2?|f4?|f8),\s*t[au],\s*m[au] -** add\t[a-x0-9]+,[a-x0-9]+,[a-x0-9]+ +** vsetvli\tzero,\s*[a-x0-9]+,\s*e(8?|16?|32?|64),\s*m(1?|2?|4?|8?|f2?|f4?|f8),\s*t[au],\s*m[au] +** ... ** vle32\.v\tv[0-9]+,0\([a-x0-9]+\) ** ... */ diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr111318.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr111318.c new file mode 100644 index 00000000000..ff36da8feeb --- /dev/null +++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr111318.c @@ -0,0 +1,16 @@ +/* { dg-do compile } */ +/* { dg-options "-march=rv64gcv -mabi=lp64d -O3 -fno-vect-cost-model" } */ + +void +foo (int *__restrict a, int *__restrict b, int *__restrict c, int n) +{ + for (int i = 0; i < n; i += 1) + c[i] = a[i] + b[i]; +} + +/* { dg-final { scan-assembler-times {vsetvli} 1 } } */ +/* { dg-final { scan-assembler-not {vsetivli} } } */ +/* { dg-final { scan-assembler-times {vsetvli\s*[a-x0-9]+,\s*[a-x0-9]+} 1 } } */ +/* { dg-final { scan-assembler-not {vsetvli\s*[a-x0-9]+,\s*zero} } } */ +/* { dg-final { scan-assembler-not {vsetvli\s*zero} } } */ +/* { dg-final { scan-assembler-not {vsetivli\s*zero} } } */ diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr111888.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr111888.c new file mode 100644 index 00000000000..2387c20a26c --- /dev/null +++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/pr111888.c @@ -0,0 +1,33 @@ +/* { dg-do compile } */ +/* { dg-options "-march=rv64gcv -mabi=lp64d -O3 -fno-vect-cost-model" } */ + +void +foo (int *__restrict a, int *__restrict b, int *__restrict c, + int *__restrict a2, int *__restrict b2, int *__restrict c2, + int *__restrict a3, int *__restrict b3, int *__restrict c3, + int *__restrict a4, int *__restrict b4, int *__restrict c4, + int *__restrict a5, int *__restrict b5, int *__restrict c5, + int *__restrict d, int *__restrict d2, int *__restrict d3, + int *__restrict d4, int *__restrict d5, int n, int m) +{ + for (int i = 0; i < n; i++) + { + a[i] = b[i] + c[i]; + a2[i] = b2[i] + c2[i]; + a3[i] = b3[i] + c3[i]; + a4[i] = b4[i] + c4[i]; + a5[i] = a[i] + a4[i]; + d[i] = a[i] - a2[i]; + d2[i] = a2[i] * a[i]; + d3[i] = a3[i] * a2[i]; + d4[i] = a2[i] * d2[i]; + d5[i] = a[i] * a2[i] * a3[i] * a4[i] * d[i]; + } +} + +/* { dg-final { scan-assembler-times {vsetvli} 1 } } */ +/* { dg-final { scan-assembler-not {vsetivli} } } */ +/* { dg-final { scan-assembler-times {vsetvli\s*[a-x0-9]+,\s*[a-x0-9]+} 1 } } */ +/* { dg-final { scan-assembler-not {vsetvli\s*[a-x0-9]+,\s*zero} } } */ +/* { dg-final { scan-assembler-not {vsetvli\s*zero} } } */ +/* { dg-final { scan-assembler-not {vsetivli\s*zero} } } */ diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/ternop/ternop_nofm-2.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/ternop/ternop_nofm-2.c index 965365da4bb..13367423751 100644 --- a/gcc/testsuite/gcc.target/riscv/rvv/autovec/ternop/ternop_nofm-2.c +++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/ternop/ternop_nofm-2.c @@ -3,7 +3,6 @@ #include "ternop-2.c" -/* { dg-final { scan-assembler-times {\tvmacc\.vv} 8 } } */ /* { dg-final { scan-assembler-times {\tvfma[c-d][c-d]\.vv} 9 } } */ /* { dg-final { scan-tree-dump-times "COND_LEN_FMA" 9 "optimized" } } */ /* { dg-final { scan-assembler-not {\tvmv} } } */