Message ID | 20200415015600.124587-1-guojiufu@linux.ibm.com |
---|---|
State | New |
Headers | show |
Series | [RFC] split pseudos during loop unrolling in RTL unroller | expand |
On Wed, Apr 15, 2020 at 3:56 AM Jiufu Guo via Gcc-patches <gcc-patches@gcc.gnu.org> wrote: > > Hi, > > As you may know, we have loop unroll pass in RTL which was introduced a few > years ago, and works for a long time. Currently, this unroller is using the > pseudos in the original body, and then the pseudos are written multiple times. > > It would be a good idea to create new pseudos for those duplicated instructions > during loop unrolling. This would relax reg dependence, and then provide more > freedom/opportunity for other optimizations, like scheduling, RA... I think there's a separate pass to do something like this, conveniently placed after unrolling. -fweb, IIRC enabled by default for -funroll-loops unless explicitly disabled. Related is regrename which is also enabled then. So where does your patch make a difference? Is the webizers dataflow analysis maybe confused by backedges? > As below example: > > Original loop: > ``` > 26: L26: > 17: NOTE_INSN_BASIC_BLOCK 4 > 18: r130:SF=[r125:DI+r122:DI] > 19: r131:SF=[r126:DI+r122:DI] > 20: r129:SF=r130:SF*r131:SF > 21: r132:DF=float_extend(r129:SF) > 22: r121:DF=r121:DF+r132:DF > 23: r133:SI=r123:DI#0-0x1 > 24: r123:DI=zero_extend(r133:SI) > 25: r122:DI=r122:DI+0x4 > 27: r134:CC=cmp(r123:DI,0) > 28: pc={(r134:CC!=0)?L49:pc} > ; pc falls through to BB 5 > 49: L49: > 48: NOTE_INSN_BASIC_BLOCK 8 > ; pc falls through to BB 4 > ``` > It was unrolled to: > ``` > 26: L26: > 17: NOTE_INSN_BASIC_BLOCK 4 > 18: r130:SF=[r125:DI+r122:DI] > 19: r131:SF=[r126:DI+r122:DI] > 20: r129:SF=r130:SF*r131:SF > 21: r132:DF=float_extend(r129:SF) > 93: r145:DF=r121:DF+r132:DF > 23: r133:SI=r123:DI#0-0x1 > 95: r148:DI=zero_extend(r133:SI) > 91: r140:DI=r122:DI+0x4 > 25: r122:DI=r140:DI > 97: r134:CC=cmp(r148:DI,0) > > 87: NOTE_INSN_BASIC_BLOCK 16 > 77: r130:SF=[r125:DI+r122:DI] > 78: r131:SF=[r126:DI+r122:DI] > 79: r129:SF=r130:SF*r131:SF > 80: r132:DF=float_extend(r129:SF) > 81: r121:DF=r121:DF+r132:DF > 82: r133:SI=r123:DI#0-0x1 > 83: r123:DI=zero_extend(r133:SI) > 84: r122:DI=r140:DI+0x4 > 85: r134:CC=cmp(r123:DI,0) > 86: pc={(r134:CC!=0)?L90:pc} > > ``` > We can see, the original loop is using r130,r131,r129,r132,r121... > For the unrolled sequence BB 16, they are still using them. > > We could use new pseudos for duplicate body, like below: > ``` > 26: L26: > 17: NOTE_INSN_BASIC_BLOCK 4 > 18: r130:SF=[r125:DI+r122:DI] > 19: r131:SF=[r126:DI+r122:DI] > 20: r129:SF=r130:SF*r131:SF > 21: r132:DF=float_extend(r129:SF) > 93: r145:DF=r121:DF+r132:DF > 23: r133:SI=r123:DI#0-0x1 > 95: r148:DI=zero_extend(r133:SI) > 91: r140:DI=r122:DI+0x4 > 25: r122:DI=r140:DI > 97: r134:CC=cmp(r148:DI,0) > > 87: NOTE_INSN_BASIC_BLOCK 16 > 77: r141:SF=[r125:DI+r122:DI] > 78: r142:SF=[r126:DI+r122:DI] > 79: r143:SF=r141:SF*r142:SF > 80: r144:DF=float_extend(r143:SF) > 81: r146:DF=r145:DF+r144:DF > 82: r147:SI=r148:DI#0-0x1 > 83: r149:DI=zero_extend(r147:SI) > 84: r122:DI=r140:DI+0x4 > 85: r150:CC=cmp(r149:DI,0) > 98: r130:SF=r141:SF > 99: r131:SF=r142:SF > 100: r129:SF=r143:SF > 101: r132:DF=r144:DF > 102: r121:DF=r146:DF > 103: r133:SI=r147:SI > 104: r123:DI=r149:DI > 105: r134:CC=r150:CC > 86: pc={(r150:CC!=0)?L90:pc} > ``` > In BB 16, new pseudos: r141,r142,r143... are used. > > For this, I drafted a prototype like the patch below. > This patch only supports simple loops: one exit edge with one major basic block. > > With this patch, we can see, the generated asm code uses fewer instructions on PowerPC. > > Welcome for comments, thanks! > > BR. > Jiufu Guo. > > --- > gcc/common.opt | 4 + > gcc/loop-unroll.c | 309 +++++++++++++++++++++++++++++++++++++++++++++- > 2 files changed, 310 insertions(+), 3 deletions(-) > > diff --git a/gcc/common.opt b/gcc/common.opt > index fa9da505fc2..e298ce09c82 100644 > --- a/gcc/common.opt > +++ b/gcc/common.opt > @@ -2526,6 +2526,10 @@ fvariable-expansion-in-unroller > Common Report Var(flag_variable_expansion_in_unroller) Optimization > Apply variable expansion when loops are unrolled. > > +fassin-new-operand-in-unroller > +Common Report Var(flag_assign_new_operand_in_unroller) Init(1) Optimization > +Creating new pseudo for duplicated instruction when loops are unrolled. > + > fstack-check= > Common Report RejectNegative Joined Optimization > -fstack-check=[no|generic|specific] Insert stack checking code into the program. > diff --git a/gcc/loop-unroll.c b/gcc/loop-unroll.c > index 693c7768868..54a48bd9fd7 100644 > --- a/gcc/loop-unroll.c > +++ b/gcc/loop-unroll.c > @@ -94,6 +94,17 @@ struct var_to_expand > var_expansions[REUSE_EXPANSION - 1]. */ > }; > > +struct def_to_split > +{ > + rtx_insn *insn; /* The insn in that the assigment occurs. */ > + rtx reg; /* The def/set pseudo. */ > + vec<rtx> defs; /* The copies of the defs which is split to. */ > + struct def_to_split *next; /* Next entry in walking order. */ > + int count; /* Number of DEFS. */ > + int use_before_def; /* The pseudo is used before re-defined. */ > + rtx_insn *last_insn; > +}; > + > /* Hashtable helper for iv_to_split. */ > > struct iv_split_hasher : free_ptr_hash <iv_to_split> > @@ -143,6 +154,30 @@ var_expand_hasher::equal (const var_to_expand *i1, const var_to_expand *i2) > return i1->insn == i2->insn; > } > > +/* Hashtable helper for iv_to_split. */ > + > +struct def_split_hasher : free_ptr_hash<def_to_split> > +{ > + static inline hashval_t hash (const def_to_split *); > + static inline bool equal (const def_to_split *, const def_to_split *); > +}; > + > +/* Return a hash for VES. */ > + > +inline hashval_t > +def_split_hasher::hash (const def_to_split *i) > +{ > + return (hashval_t) INSN_UID (i->insn); > +} > + > +/* Return true if I1 and I2 refer to the same instruction. */ > + > +inline bool > +def_split_hasher::equal (const def_to_split *i1, const def_to_split *i2) > +{ > + return i1->insn == i2->insn; > +} > + > /* Information about optimization applied in > the unrolled loop. */ > > @@ -156,10 +191,17 @@ struct opt_info > insns with accumulators to expand. */ > struct var_to_expand *var_to_expand_head; /* The first var to expand. */ > struct var_to_expand **var_to_expand_tail; /* Pointer to the tail of the list. */ > + > + hash_table<def_split_hasher> *insns_def_split; /* A hashtable of insns to > + split def. */ > + struct def_to_split *def_to_split_head; /* The def to split. */ > + struct def_to_split **def_to_split_tail; /* Pointer to the tail of the list. */ > + > unsigned first_new_block; /* The first basic block that was > duplicated. */ > basic_block loop_exit; /* The loop exit basic block. */ > basic_block loop_preheader; /* The loop preheader basic block. */ > + > }; > > static void decide_unroll_stupid (class loop *, int); > @@ -183,6 +225,7 @@ static void combine_var_copies_in_loop_exit (struct var_to_expand *, > basic_block); > static rtx get_expansion (struct var_to_expand *); > > +static struct def_to_split *analyze_insn_to_split_def (class loop *, rtx_insn *); > /* Emit a message summarizing the unroll that will be > performed for LOOP, along with the loop's location LOCUS, if > appropriate given the dump or -fopt-info settings. */ > @@ -898,7 +941,8 @@ unroll_loop_runtime_iterations (class loop *loop) > struct opt_info *opt_info = NULL; > bool ok; > > - if (flag_split_ivs_in_unroller > + if (flag_assign_new_operand_in_unroller > + || flag_split_ivs_in_unroller > || flag_variable_expansion_in_unroller) > opt_info = analyze_insns_in_loop (loop); > > @@ -1564,6 +1608,71 @@ analyze_iv_to_split_insn (rtx_insn *insn) > return ivts; > } > > +static struct def_to_split * > +analyze_insn_to_split_def (class loop *loop, rtx_insn *insn) > +{ > + rtx set, dest, src; > + rtx_insn *ins; > + struct def_to_split *dts; > + > + /* TODO: to be relaxed */ > + if (loop->num_nodes != 2) > + return NULL; > + > + FOR_BB_INSNS (loop->latch, ins) > + if (INSN_P (ins)) > + return NULL; > + > + set = single_set (insn); > + if (!set) > + return NULL; > + > + dest = SET_DEST (set); > + src = SET_SRC (set); > + > + if (!REG_P (dest)) // && !(GET_CODE (dest) == SUBREG && REG_P (SUBREG_REG > + // (dest)))) > + return NULL; > + > + if (REGNO (dest) < FIRST_PSEUDO_REGISTER) > + return NULL; > + > + int use_before_set = 0; > + basic_block bb = BLOCK_FOR_INSN (insn); > + rtx set1; > + for (ins = BB_HEAD (bb); ins != insn; ins = NEXT_INSN (ins)) > + if (rtx_referenced_p (dest, ins)) > + { > + set1 = single_set (ins); > + if (set1 && rtx_equal_p (SET_DEST (set1), dest)) > + return NULL; > + use_before_set = 1; > + } > + > + if (!use_before_set && rtx_referenced_p (dest, src)) > + use_before_set = 1; > + > + if (dump_file) > + { > + fprintf (dump_file, > + "\n;; Split assignment %s: ", > + use_before_set ? "use leads def" : ""); > + print_rtl (dump_file, dest); > + fprintf (dump_file, "\n"); > + } > + > + /* Record the accumulator to expand. */ > + dts = XNEW (struct def_to_split); > + dts->insn = insn; > + dts->reg = copy_rtx (dest); > + dts->defs.create (1); > + dts->next = NULL; > + dts->count = 0; > + dts->use_before_def = use_before_set; > + dts->last_insn = NULL; > + return dts; > +} > + > /* Determines which of insns in LOOP can be optimized. > Return a OPT_INFO struct with the relevant hash tables filled > with all insns to be optimized. The FIRST_NEW_BLOCK field > @@ -1578,6 +1687,7 @@ analyze_insns_in_loop (class loop *loop) > rtx_insn *insn; > struct iv_to_split *ivts = NULL; > struct var_to_expand *ves = NULL; > + struct def_to_split *dts= NULL; > iv_to_split **slot1; > var_to_expand **slot2; > vec<edge> edges = get_loop_exit_edges (loop); > @@ -1618,6 +1728,15 @@ analyze_insns_in_loop (class loop *loop) > opt_info->var_to_expand_tail = &opt_info->var_to_expand_head; > } > > + if (flag_assign_new_operand_in_unroller && can_apply > + && LPT_UNROLL_RUNTIME == loop->lpt_decision.decision) > + { > + opt_info->insns_def_split > + = new hash_table<def_split_hasher> (5 * loop->num_nodes); > + opt_info->def_to_split_head = NULL; > + opt_info->def_to_split_tail = &opt_info->def_to_split_head; > + } > + > for (i = 0; i < loop->num_nodes; i++) > { > bb = body[i]; > @@ -1653,6 +1772,20 @@ analyze_insns_in_loop (class loop *loop) > *opt_info->var_to_expand_tail = ves; > opt_info->var_to_expand_tail = &ves->next; > } > + > + dts = NULL; > + if (opt_info->insns_def_split && !ves && !ivts) > + dts = analyze_insn_to_split_def (loop, insn); > + > + if (dts) > + { > + struct def_to_split **slot > + = opt_info->insns_def_split->find_slot (dts, INSERT); > + gcc_assert (*slot == NULL); > + *slot = dts; > + *opt_info->def_to_split_tail = dts; > + opt_info->def_to_split_tail = &dts->next; > + } > } > } > > @@ -1840,6 +1973,145 @@ expand_var_during_unrolling (struct var_to_expand *ve, rtx_insn *insn) > } > } > > +/* Replace INSN with a new duplicate insn. */ > +static rtx_insn * > +replace_with_new_insn (rtx_insn *insn) > +{ > + rtx_insn *seq; > + start_sequence (); > + seq = duplicate_insn_chain (insn, insn); > + gcc_assert (seq == get_insns ()); > + end_sequence (); > + emit_insn_before (seq, insn); > + delete_insn (insn); > + return seq; > +} > + > +/* For original body, those insn(s) are not able to be update directly. Then, > + * duplicate new one for update. */ > + > +static void > +prepare_to_update_orignal_body (struct def_to_split *head) > +{ > + struct def_to_split *dts; > + rtx_insn *insn; > + vec<rtx_insn *> new_insn_list; > + > + /* For all those defines which is used before define, replace with new one for > + * update. */ > + new_insn_list.create (1); > + for (dts = head; dts; dts = dts->next) > + if (dts->use_before_def) > + { > + dts->insn = replace_with_new_insn (dts->insn); > + new_insn_list.safe_push (dts->insn); > + } > + > + /* For all those insn which is using those defines, replace with new one for > + * update. */ > + for (dts = head; dts; dts = dts->next) > + if (dts->use_before_def) > + for (insn = NEXT_INSN (dts->insn); > + insn != NEXT_INSN (BB_END (BLOCK_FOR_INSN (dts->insn))); > + insn = NEXT_INSN (insn)) > + if (rtx_referenced_p (dts->reg, insn) && !new_insn_list.contains (insn)) > + { > + insn = replace_with_new_insn (insn); > + new_insn_list.safe_push (insn); > + } > + > + new_insn_list.release (); > +} > + > +/* Replace reg which occur between START and END. */ > +static void > +replace_regs (rtx_insn *start, rtx_insn *end, rtx old_reg, rtx new_reg) > +{ > + rtx_insn *insn; > + for (insn = start; insn != NEXT_INSN (end); insn = NEXT_INSN (insn)) > + if (rtx_referenced_p (old_reg, insn)) > + validate_replace_rtx_group (old_reg, new_reg, insn); > + > + apply_change_group (); > +} > + > +/* Update the the pseudo in orignal blocks, and assign back them at the last > + * block. */ > +static void > +do_additional_update (struct def_to_split *head) > +{ > + rtx_insn *insn; > + rtx new_reg; > + basic_block bb; > + struct def_to_split *dts; > + > + for (dts = head; dts; dts = dts->next) > + { > + if (dts->use_before_def) > + { > + gcc_assert (dts->last_insn); > + gcc_assert (dts->count > 1); > + > + /* Update the insns in original body. */ > + bb = BLOCK_FOR_INSN (dts->insn); > + new_reg = dts->defs[0]; > + SET_DEST (single_set (dts->insn)) = new_reg; > + > + replace_regs (NEXT_INSN (dts->insn), BB_END (bb), dts->reg, new_reg); > + } > + > + /* In last BB, assign back to orignal reg. */ > + start_sequence (); > + emit_move_insn (dts->reg, SET_DEST (single_set (dts->last_insn))); > + insn = get_insns (); > + end_sequence (); > + emit_insn_after (insn, dts->last_insn); > + } > +} > + > +/* Replace the DTS->REG with new split reg for INSN's block. If the DTS->REG is > + * referenced before set, Those usage which leading INSN is replaced with > + * previous define/set. */ > +static void > +split_def_during_unrolling (struct def_to_split *dts, rtx_insn *insn) > +{ > + rtx new_reg, new_reg0, set; > + basic_block bb = BLOCK_FOR_INSN (insn); > + > + if (dts->use_before_def) > + { > + if (dts->count == 0) > + { > + dts->defs.safe_push (gen_reg_rtx (GET_MODE (dts->reg))); > + dts->count++; > + } > + > + /* Replace the usage of the define reg with new reg0 for those insns which > + * are leading the define, include the usage of the define. */ > + new_reg0 = dts->defs[dts->count - 1]; > + replace_regs (BB_HEAD (bb), insn, dts->reg, new_reg0); > + > + /* Replace the define reg with new reg, for the define and those usages > + * which following the define. */ > + set = single_set (insn); > + gcc_assert (set); > + new_reg = gen_reg_rtx (GET_MODE (dts->reg)); > + SET_DEST (single_set (insn)) = new_reg; > + > + replace_regs (NEXT_INSN (insn), BB_END (bb), dts->reg, new_reg); > + } > + else > + { > + new_reg = gen_reg_rtx (GET_MODE (dts->reg)); > + replace_regs (insn, BB_END (bb), dts->reg, new_reg); > + } > + > + /* Record the insn in last block. */ > + dts->last_insn = insn; > + dts->defs.safe_push (new_reg); > + dts->count++; > +} > + > /* Initialize the variable expansions in loop preheader. PLACE is the > loop-preheader basic block where the initialization of the > expansions should take place. The expansions are initialized with > @@ -2011,6 +2283,7 @@ apply_opt_in_copies (struct opt_info *opt_info, > rtx_insn *insn, *orig_insn, *next; > struct iv_to_split ivts_templ, *ivts; > struct var_to_expand ve_templ, *ves; > + struct def_to_split dts_templ, *dts; > > /* Sanity check -- we need to put initialization in the original loop > body. */ > @@ -2051,8 +2324,9 @@ apply_opt_in_copies (struct opt_info *opt_info, > > ivts_templ.insn = orig_insn; > ve_templ.insn = orig_insn; > + dts_templ.insn = orig_insn; > > - /* Apply splitting iv optimization. */ > + /* Apply splitting iv optimization. */ > if (opt_info->insns_to_split) > { > maybe_strip_eq_note_for_split_iv (opt_info, insn); > @@ -2081,6 +2355,19 @@ apply_opt_in_copies (struct opt_info *opt_info, > expand_var_during_unrolling (ves, insn); > } > } > + > + if (unrolling && opt_info->insns_def_split) > + { > + dts = (struct def_to_split *) opt_info->insns_def_split->find ( > + &dts_templ); > + if (dts) > + { > + gcc_assert (GET_CODE (PATTERN (insn)) > + == GET_CODE (PATTERN (orig_insn))); > + split_def_during_unrolling (dts, insn); > + } > + } > + > orig_insn = NEXT_INSN (orig_insn); > } > } > @@ -2135,9 +2422,14 @@ apply_opt_in_copies (struct opt_info *opt_info, > continue; > } > } > - > } > } > + > + if (unrolling && opt_info->insns_def_split) > + { > + prepare_to_update_orignal_body (opt_info->def_to_split_head); > + do_additional_update (opt_info->def_to_split_head); > + } > } > > /* Release OPT_INFO. */ > @@ -2156,5 +2448,16 @@ free_opt_info (struct opt_info *opt_info) > delete opt_info->insns_with_var_to_expand; > opt_info->insns_with_var_to_expand = NULL; > } > + > + if (opt_info->insns_def_split) > + { > + struct def_to_split *dts; > + > + for (dts = opt_info->def_to_split_head; dts; dts = dts->next) > + dts->defs.release (); > + delete opt_info->insns_def_split; > + opt_info->insns_def_split = NULL; > + } > + > free (opt_info); > } > -- > 2.17.1 >
Hi Jiufu, Just reviewing random things as I see them... On Wed, Apr 15, 2020 at 09:56:00AM +0800, Jiufu Guo wrote: > This patch only supports simple loops: one exit edge with one major basic block. That is fine for a proof-of-concept, but will need fixing perhaps. > --- a/gcc/common.opt > +++ b/gcc/common.opt > @@ -2526,6 +2526,10 @@ fvariable-expansion-in-unroller > Common Report Var(flag_variable_expansion_in_unroller) Optimization > Apply variable expansion when loops are unrolled. > > +fassin-new-operand-in-unroller Typo ("assign"). But, is there ever any reason to disable this? (You want an option for it during development, of course, but all public options cost something). > +struct def_to_split > +{ > + rtx_insn *insn; /* The insn in that the assigment occurs. */ > + rtx reg; /* The def/set pseudo. */ > + vec<rtx> defs; /* The copies of the defs which is split to. */ > + struct def_to_split *next; /* Next entry in walking order. */ > + int count; /* Number of DEFS. */ > + int use_before_def; /* The pseudo is used before re-defined. */ > + rtx_insn *last_insn; > +}; Three different kinds of indentation here. It might work better to explain it all in a bigger comment before this struct definition. It also would help to have better, more obvious names? Not "reg" but "orig_reg", not "defs" but "regs", not "count" but "nregs"? > +/* Return a hash for VES. */ > + > +inline hashval_t > +def_split_hasher::hash (const def_to_split *i) > +{ > + return (hashval_t) INSN_UID (i->insn); > +} What does "VES" mean? Something from an older version of the patch? Segher
Hi! On Wed, Apr 15, 2020 at 08:21:03AM +0200, Richard Biener wrote: > On Wed, Apr 15, 2020 at 3:56 AM Jiufu Guo via Gcc-patches > <gcc-patches@gcc.gnu.org> wrote: > > As you may know, we have loop unroll pass in RTL which was introduced a few > > years ago, and works for a long time. Currently, this unroller is using the > > pseudos in the original body, and then the pseudos are written multiple times. > > > > It would be a good idea to create new pseudos for those duplicated instructions > > during loop unrolling. This would relax reg dependence, and then provide more > > freedom/opportunity for other optimizations, like scheduling, RA... > > I think there's a separate pass to do something like this, conveniently > placed after unrolling. -fweb, IIRC enabled by default for -funroll-loops > unless explicitly disabled. Related is regrename which is also enabled then. > > So where does your patch make a difference? Is the webizers dataflow analysis > maybe confused by backedges? Does -fweb handle things set by the last unrolled iteration, used by the first unrolled iteration? On a general note, we shouldn't depend on some pass that may or may not clean up the mess we make, when we could just avoid making a mess in the first place. The web pass belongs immediately after expand; but ideally, even expand would not reuse pseudos anyway. Maybe it would be better as some utility routines, not a pass? Segher
on 2020/4/15 下午2:21, Richard Biener via Gcc-patches wrote: > On Wed, Apr 15, 2020 at 3:56 AM Jiufu Guo via Gcc-patches > <gcc-patches@gcc.gnu.org> wrote: >> >> Hi, >> >> As you may know, we have loop unroll pass in RTL which was introduced a few >> years ago, and works for a long time. Currently, this unroller is using the >> pseudos in the original body, and then the pseudos are written multiple times. >> >> It would be a good idea to create new pseudos for those duplicated instructions >> during loop unrolling. This would relax reg dependence, and then provide more >> freedom/opportunity for other optimizations, like scheduling, RA... > > I think there's a separate pass to do something like this, conveniently > placed after unrolling. -fweb, IIRC enabled by default for -funroll-loops > unless explicitly disabled. Related is regrename which is also enabled then. > > So where does your patch make a difference? Is the webizers dataflow analysis > maybe confused by backedges? > Juofu can help to confirm, it looks we disabled them on rs6000 by default and don't enable them for unroll_only_small_loops but just enable for explicit unroll options. If it's proved that web can cover what we want here, one option sounds to bring it/them back even for unroll_only_small_loops? BR, Kewen
Segher Boessenkool <segher@kernel.crashing.org> writes: Hi, Thanks for all your comments! > Hi! > > On Wed, Apr 15, 2020 at 08:21:03AM +0200, Richard Biener wrote: >> On Wed, Apr 15, 2020 at 3:56 AM Jiufu Guo via Gcc-patches >> <gcc-patches@gcc.gnu.org> wrote: >> > As you may know, we have loop unroll pass in RTL which was introduced a few >> > years ago, and works for a long time. Currently, this unroller is using the >> > pseudos in the original body, and then the pseudos are written multiple times. >> > >> > It would be a good idea to create new pseudos for those duplicated instructions >> > during loop unrolling. This would relax reg dependence, and then provide more >> > freedom/opportunity for other optimizations, like scheduling, RA... >> >> I think there's a separate pass to do something like this, conveniently >> placed after unrolling. -fweb, IIRC enabled by default for -funroll-loops >> unless explicitly disabled. Related is regrename which is also >> enabled then. web/rnreg have some help for some cases, while they does not know much information about loop, and do not care if a BB/pseudo/reg is generated during unrolling. Unroller know these information, and it maybe accurate to create new pseudos and relax depences between insns in unroller. For the example of preview RTL sequences, using new pseudo in unroller is better than web/rnreg. >> >> So where does your patch make a difference? Is the webizers dataflow analysis >> maybe confused by backedges? > > Does -fweb handle things set by the last unrolled iteration, used by the > first unrolled iteration? > > On a general note, we shouldn't depend on some pass that may or may not > clean up the mess we make, when we could just avoid making a mess in the > first place. Yes, this is a reason that I also think it maybe good to do it during unroller. > > The web pass belongs immediately after expand; but ideally, even expand > would not reuse pseudos anyway. Currently, web runs after loop done now. Jiufu > > Maybe it would be better as some utility routines, not a pass? > > > Segher
Segher Boessenkool <segher@kernel.crashing.org> writes: > Hi Jiufu, > > Just reviewing random things as I see them... > > On Wed, Apr 15, 2020 at 09:56:00AM +0800, Jiufu Guo wrote: >> This patch only supports simple loops: one exit edge with one major basic block. > > That is fine for a proof-of-concept, but will need fixing perhaps. Hi Segher, Thanks so much. You are always point out things to improve! > >> --- a/gcc/common.opt >> +++ b/gcc/common.opt >> @@ -2526,6 +2526,10 @@ fvariable-expansion-in-unroller >> Common Report Var(flag_variable_expansion_in_unroller) Optimization >> Apply variable expansion when loops are unrolled. >> >> +fassin-new-operand-in-unroller > > Typo ("assign"). But, is there ever any reason to disable this? (You > want an option for it during development, of course, but all public > options cost something). Helpful question, development is one reason for now to test benefits and lost, and incase some one want to disable it. Your question let me think if it really need to disable it. In theory, this behavior is correct and not harmful. > >> +struct def_to_split >> +{ >> + rtx_insn *insn; /* The insn in that the assigment occurs. */ >> + rtx reg; /* The def/set pseudo. */ >> + vec<rtx> defs; /* The copies of the defs which is split to. */ >> + struct def_to_split *next; /* Next entry in walking order. */ >> + int count; /* Number of DEFS. */ >> + int use_before_def; /* The pseudo is used before re-defined. */ >> + rtx_insn *last_insn; >> +}; > > Three different kinds of indentation here. > > It might work better to explain it all in a bigger comment before this > struct definition. It also would help to have better, more obvious > names? Not "reg" but "orig_reg", not "defs" but "regs", not "count" > but "nregs"? Style, I may enhance it through tools. Thanks for comments about naming :) > >> +/* Return a hash for VES. */ >> + >> +inline hashval_t >> +def_split_hasher::hash (const def_to_split *i) >> +{ >> + return (hashval_t) INSN_UID (i->insn); >> +} > > What does "VES" mean? Something from an older version of the patch? it is typo :( > > > Segher
On Wed, Apr 15, 2020 at 11:23 PM Segher Boessenkool <segher@kernel.crashing.org> wrote: > > Hi! > > On Wed, Apr 15, 2020 at 08:21:03AM +0200, Richard Biener wrote: > > On Wed, Apr 15, 2020 at 3:56 AM Jiufu Guo via Gcc-patches > > <gcc-patches@gcc.gnu.org> wrote: > > > As you may know, we have loop unroll pass in RTL which was introduced a few > > > years ago, and works for a long time. Currently, this unroller is using the > > > pseudos in the original body, and then the pseudos are written multiple times. > > > > > > It would be a good idea to create new pseudos for those duplicated instructions > > > during loop unrolling. This would relax reg dependence, and then provide more > > > freedom/opportunity for other optimizations, like scheduling, RA... > > > > I think there's a separate pass to do something like this, conveniently > > placed after unrolling. -fweb, IIRC enabled by default for -funroll-loops > > unless explicitly disabled. Related is regrename which is also enabled then. > > > > So where does your patch make a difference? Is the webizers dataflow analysis > > maybe confused by backedges? > > Does -fweb handle things set by the last unrolled iteration, used by the > first unrolled iteration? > > On a general note, we shouldn't depend on some pass that may or may not > clean up the mess we make, when we could just avoid making a mess in the > first place. True - but the issue at hand is not trivial given you have to care for partial defs, uses outside of the loop (or across the backedge), etc. So there's plenty of things to go "wrong" here. > The web pass belongs immediately after expand; but ideally, even expand > would not reuse pseudos anyway. But for example when lower-subreg decomposes things in a way turning partial defs into full defs new opportunities to split the web arise. > Maybe it would be better as some utility routines, not a pass? Sure, but then when do we apply it? Ideally scheduling would to register renaming itself and thus not rely on the used pseudos (I'm not sure if it tracks false dependences - I guess it must if it isn't able to rename regs). That would be a much better place for improvements? Richard. > > Segher
On Thu, Apr 16, 2020 at 10:33:45AM +0200, Richard Biener wrote: > On Wed, Apr 15, 2020 at 11:23 PM Segher Boessenkool > > On a general note, we shouldn't depend on some pass that may or may not > > clean up the mess we make, when we could just avoid making a mess in the > > first place. > > True - but the issue at hand is not trivial given you have to care for > partial defs, uses outside of the loop (or across the backedge), etc. > So there's plenty of things to go "wrong" here. Certainly. But *all* RTL passes before RA should leave things in "web form" (compare to SSA form). The code in web.c is probably just fine; but we shouldn't have one web pass, *all* passes should leave things in a useful form! > > The web pass belongs immediately after expand; but ideally, even expand > > would not reuse pseudos anyway. > > But for example when lower-subreg decomposes things in a way turning > partial defs into full defs new opportunities to split the web arise. But that is only for the new registers it creates, or what am I missing? > > Maybe it would be better as some utility routines, not a pass? > > Sure, but then when do we apply it? All of the time! Ideally every pass would leave things in a good shape. Usually it is cheapest as well as easiest to do things manually, but for some harder cases, such helper routines can be used. > Ideally scheduling would to > register renaming itself and thus not rely on the used pseudos > (I'm not sure if it tracks false dependences - I guess it must if it > isn't able to rename regs). That would be a much better place > for improvements? sched2 runs after RA, so it has nothing to do with webs? And sched1 doesn't do much relevant here (it doesn't move insns much). I don't see how this is directly related to register renaming either? Segher
On Thu, Apr 16, 2020 at 7:46 PM Segher Boessenkool <segher@kernel.crashing.org> wrote: > > On Thu, Apr 16, 2020 at 10:33:45AM +0200, Richard Biener wrote: > > On Wed, Apr 15, 2020 at 11:23 PM Segher Boessenkool > > > On a general note, we shouldn't depend on some pass that may or may not > > > clean up the mess we make, when we could just avoid making a mess in the > > > first place. > > > > True - but the issue at hand is not trivial given you have to care for > > partial defs, uses outside of the loop (or across the backedge), etc. > > So there's plenty of things to go "wrong" here. > > Certainly. But *all* RTL passes before RA should leave things in "web > form" (compare to SSA form). The code in web.c is probably just fine; > but we shouldn't have one web pass, *all* passes should leave things in > a useful form! Yeah well, but RTL is not in SSA form and there's no RTL IL verification in place to track degradation. And we even work in the opposite way when expanding to RTL from SSA, coalescing as much as we can ... > > > The web pass belongs immediately after expand; but ideally, even expand > > > would not reuse pseudos anyway. > > > > But for example when lower-subreg decomposes things in a way turning > > partial defs into full defs new opportunities to split the web arise. > > But that is only for the new registers it creates, or what am I missing? No idea, just made up the example that maintaing "SSA" RTL is not automagic. > > > Maybe it would be better as some utility routines, not a pass? > > > > Sure, but then when do we apply it? > > All of the time! Ideally every pass would leave things in a good shape. > Usually it is cheapest as well as easiest to do things manually, but for > some harder cases, such helper routines can be used. > > > Ideally scheduling would to > > register renaming itself and thus not rely on the used pseudos > > (I'm not sure if it tracks false dependences - I guess it must if it > > isn't able to rename regs). That would be a much better place > > for improvements? > > sched2 runs after RA, so it has nothing to do with webs? And sched1 > doesn't do much relevant here (it doesn't move insns much). > > I don't see how this is directly related to register renaming either? If scheduling ignores "false" dependences (anti dependence with full defs) then when it schedules across such defs needs to perform renaming. Maybe I'm using bogus terms here. Richard. > > Segher
On 4/17/20 1:53 AM, Richard Biener wrote: > Yeah well, but RTL is not in SSA form and there's no RTL IL verification > in place to track degradation. And we even work in the opposite way > when expanding to RTL from SSA, coalescing as much as we can ... Which is itself problematic, introducing unnecessary antidependences at the start of RTL. We've seen performance issues with this on several occasions. Bill
On Fri, Apr 17, 2020 at 08:53:08AM +0200, Richard Biener wrote: > On Thu, Apr 16, 2020 at 7:46 PM Segher Boessenkool > <segher@kernel.crashing.org> wrote: > > > > On Thu, Apr 16, 2020 at 10:33:45AM +0200, Richard Biener wrote: > > > On Wed, Apr 15, 2020 at 11:23 PM Segher Boessenkool > > > > On a general note, we shouldn't depend on some pass that may or may not > > > > clean up the mess we make, when we could just avoid making a mess in the > > > > first place. > > > > > > True - but the issue at hand is not trivial given you have to care for > > > partial defs, uses outside of the loop (or across the backedge), etc. > > > So there's plenty of things to go "wrong" here. > > > > Certainly. But *all* RTL passes before RA should leave things in "web > > form" (compare to SSA form). The code in web.c is probably just fine; > > but we shouldn't have one web pass, *all* passes should leave things in > > a useful form! > > Yeah well, but RTL is not in SSA form "Webs" are not the *same* as SSA, in a few crucial ways; but they serve similar purposes: they both make code transformations easier to do. Both do this by having more different (independent) names. > and there's no RTL IL verification > in place to track degradation. Adding this isn't hard in principle. But currently it is violated all over the place, including in backend code. > And we even work in the opposite way > when expanding to RTL from SSA, coalescing as much as we can ... Which often is bad. A lot of what expand does is premature optimisation that a) is much too complicated, and b) results in worse code in the end. Expand should not try to do *anything* "interesting", it should "merely" translate to RTL. (And in a way that can be debugged sanely at all, if possible ;-) ) > > > > The web pass belongs immediately after expand; but ideally, even expand > > > > would not reuse pseudos anyway. > > > > > > But for example when lower-subreg decomposes things in a way turning > > > partial defs into full defs new opportunities to split the web arise. > > > > But that is only for the new registers it creates, or what am I missing? > > No idea, just made up the example that maintaing "SSA" RTL is > not automagic. Oh sure, but in most cases, it is. My point is that making web form only once (and messing it up after that) isn't great. > > > > Maybe it would be better as some utility routines, not a pass? > > > > > > Sure, but then when do we apply it? > > > > All of the time! Ideally every pass would leave things in a good shape. > > Usually it is cheapest as well as easiest to do things manually, but for > > some harder cases, such helper routines can be used. > > > > > Ideally scheduling would to > > > register renaming itself and thus not rely on the used pseudos > > > (I'm not sure if it tracks false dependences - I guess it must if it > > > isn't able to rename regs). That would be a much better place > > > for improvements? > > > > sched2 runs after RA, so it has nothing to do with webs? And sched1 > > doesn't do much relevant here (it doesn't move insns much). > > > > I don't see how this is directly related to register renaming either? > > If scheduling ignores "false" dependences (anti dependence with > full defs) then when it schedules across such defs needs to perform > renaming. Maybe I'm using bogus terms here. Your terminology is fine afaic, but I don't see what you mean, sorry. Maybe I don't know sched well enough... I thought it would just refuse to move something if that would result in broken code? Segher
On Fri, Apr 17, 2020 at 10:51 PM Segher Boessenkool <segher@kernel.crashing.org> wrote: > > On Fri, Apr 17, 2020 at 08:53:08AM +0200, Richard Biener wrote: > > On Thu, Apr 16, 2020 at 7:46 PM Segher Boessenkool > > <segher@kernel.crashing.org> wrote: > > > > > > On Thu, Apr 16, 2020 at 10:33:45AM +0200, Richard Biener wrote: > > > > On Wed, Apr 15, 2020 at 11:23 PM Segher Boessenkool > > > > > On a general note, we shouldn't depend on some pass that may or may not > > > > > clean up the mess we make, when we could just avoid making a mess in the > > > > > first place. > > > > > > > > True - but the issue at hand is not trivial given you have to care for > > > > partial defs, uses outside of the loop (or across the backedge), etc. > > > > So there's plenty of things to go "wrong" here. > > > > > > Certainly. But *all* RTL passes before RA should leave things in "web > > > form" (compare to SSA form). The code in web.c is probably just fine; > > > but we shouldn't have one web pass, *all* passes should leave things in > > > a useful form! > > > > Yeah well, but RTL is not in SSA form > > "Webs" are not the *same* as SSA, in a few crucial ways; but they serve > similar purposes: they both make code transformations easier to do. > Both do this by having more different (independent) names. > > > and there's no RTL IL verification > > in place to track degradation. > > Adding this isn't hard in principle. But currently it is violated all > over the place, including in backend code. > > > And we even work in the opposite way > > when expanding to RTL from SSA, coalescing as much as we can ... > > Which often is bad. A lot of what expand does is premature optimisation > that a) is much too complicated, and b) results in worse code in the end. > Expand should not try to do *anything* "interesting", it should "merely" > translate to RTL. > > (And in a way that can be debugged sanely at all, if possible ;-) ) > > > > > > The web pass belongs immediately after expand; but ideally, even expand > > > > > would not reuse pseudos anyway. > > > > > > > > But for example when lower-subreg decomposes things in a way turning > > > > partial defs into full defs new opportunities to split the web arise. > > > > > > But that is only for the new registers it creates, or what am I missing? > > > > No idea, just made up the example that maintaing "SSA" RTL is > > not automagic. > > Oh sure, but in most cases, it is. > > My point is that making web form only once (and messing it up after that) > isn't great. > > > > > > Maybe it would be better as some utility routines, not a pass? > > > > > > > > Sure, but then when do we apply it? > > > > > > All of the time! Ideally every pass would leave things in a good shape. > > > Usually it is cheapest as well as easiest to do things manually, but for > > > some harder cases, such helper routines can be used. > > > > > > > Ideally scheduling would to > > > > register renaming itself and thus not rely on the used pseudos > > > > (I'm not sure if it tracks false dependences - I guess it must if it > > > > isn't able to rename regs). That would be a much better place > > > > for improvements? > > > > > > sched2 runs after RA, so it has nothing to do with webs? And sched1 > > > doesn't do much relevant here (it doesn't move insns much). > > > > > > I don't see how this is directly related to register renaming either? > > > > If scheduling ignores "false" dependences (anti dependence with > > full defs) then when it schedules across such defs needs to perform > > renaming. Maybe I'm using bogus terms here. > > Your terminology is fine afaic, but I don't see what you mean, sorry. > Maybe I don't know sched well enough... I thought it would just refuse > to move something if that would result in broken code? Yes, I think that's what it does. And I was suggesting it should learn to schedule the use after the def in (use (reg:SI 101)) (set (reg:SI 101) (...)) by renaming the pseudo, removing the false/anti dependence. After RA this means the ability to do simple local register allocation of course. Before RA the trivial way to implement this is to not have those false dependences, aka have webs / SSA ... ;) Richard. > > Segher
On Fri, Apr 17, 2020 at 10:51 PM Segher Boessenkool <segher@kernel.crashing.org> wrote: > > On Fri, Apr 17, 2020 at 08:53:08AM +0200, Richard Biener wrote: > > On Thu, Apr 16, 2020 at 7:46 PM Segher Boessenkool > > <segher@kernel.crashing.org> wrote: > > > > > > On Thu, Apr 16, 2020 at 10:33:45AM +0200, Richard Biener wrote: > > > > On Wed, Apr 15, 2020 at 11:23 PM Segher Boessenkool > > > > > On a general note, we shouldn't depend on some pass that may or may not > > > > > clean up the mess we make, when we could just avoid making a mess in the > > > > > first place. > > > > > > > > True - but the issue at hand is not trivial given you have to care for > > > > partial defs, uses outside of the loop (or across the backedge), etc. > > > > So there's plenty of things to go "wrong" here. > > > > > > Certainly. But *all* RTL passes before RA should leave things in "web > > > form" (compare to SSA form). The code in web.c is probably just fine; > > > but we shouldn't have one web pass, *all* passes should leave things in > > > a useful form! > > > > Yeah well, but RTL is not in SSA form > > "Webs" are not the *same* as SSA, in a few crucial ways; but they serve > similar purposes: they both make code transformations easier to do. > Both do this by having more different (independent) names. > > > and there's no RTL IL verification > > in place to track degradation. > > Adding this isn't hard in principle. But currently it is violated all > over the place, including in backend code. It's a cheap excuse and clearly a lot of work. But we've done it for GIMPLE and it was clearly worth it. If only for documentation purposes. > > And we even work in the opposite way > > when expanding to RTL from SSA, coalescing as much as we can ... > > Which often is bad. A lot of what expand does is premature optimisation > that a) is much too complicated, and b) results in worse code in the end. > Expand should not try to do *anything* "interesting", it should "merely" > translate to RTL. > > (And in a way that can be debugged sanely at all, if possible ;-) ) Sure. I've tried to turn off TER with only moderate success. Another approach would be to not do any coalescing but for PHIs (and maybe there only for edges where we cannot insert copies on w/o splitting it). But clearly all the magic expand does now should materialize itself on GIMPLE (see the patch series Martin posted for vcond* / vec_cmp* "expansion" on GIMPLE). It's the magic that requires TER and TER clearly introduces the most noise in the GIMPLE -> RTL translation step (as to code motion and initial instruction selection if it "fails"). Richard.
Hi! On Mon, Apr 20, 2020 at 09:56:34AM +0200, Richard Biener wrote: > On Fri, Apr 17, 2020 at 10:51 PM Segher Boessenkool > <segher@kernel.crashing.org> wrote: > > > Yeah well, but RTL is not in SSA form > > > > "Webs" are not the *same* as SSA, in a few crucial ways; but they serve > > similar purposes: they both make code transformations easier to do. > > Both do this by having more different (independent) names. > > > > > and there's no RTL IL verification > > > in place to track degradation. > > > > Adding this isn't hard in principle. But currently it is violated all > > over the place, including in backend code. > > It's a cheap excuse and clearly a lot of work. But we've done it for GIMPLE > and it was clearly worth it. If only for documentation purposes. Yes, I very much agree :-) But how will we do this? Make it warnings only, and have those opt-in (for a port) even? > > > And we even work in the opposite way > > > when expanding to RTL from SSA, coalescing as much as we can ... > > > > Which often is bad. A lot of what expand does is premature optimisation > > that a) is much too complicated, and b) results in worse code in the end. > > Expand should not try to do *anything* "interesting", it should "merely" > > translate to RTL. > > > > (And in a way that can be debugged sanely at all, if possible ;-) ) > > Sure. I've tried to turn off TER with only moderate success. Another approach > would be to not do any coalescing but for PHIs (and maybe there only for edges > where we cannot insert copies on w/o splitting it). Yeah, that would be much simpler to do as well. > But clearly all the magic expand does now should materialize itself on > GIMPLE (see the patch series Martin posted for vcond* / vec_cmp* > "expansion" on GIMPLE). Probably most, yeah, but some things that expand does should be done much *later* in RTL even :-) > It's the magic that requires TER and TER clearly > introduces the most noise in the GIMPLE -> RTL translation step (as to > code motion and initial instruction selection if it "fails"). Expand does a lot of things that are very much historic, too. Segher
On Mon, Apr 20, 2020 at 3:38 PM Segher Boessenkool <segher@kernel.crashing.org> wrote: > > Hi! > > On Mon, Apr 20, 2020 at 09:56:34AM +0200, Richard Biener wrote: > > On Fri, Apr 17, 2020 at 10:51 PM Segher Boessenkool > > <segher@kernel.crashing.org> wrote: > > > > Yeah well, but RTL is not in SSA form > > > > > > "Webs" are not the *same* as SSA, in a few crucial ways; but they serve > > > similar purposes: they both make code transformations easier to do. > > > Both do this by having more different (independent) names. > > > > > > > and there's no RTL IL verification > > > > in place to track degradation. > > > > > > Adding this isn't hard in principle. But currently it is violated all > > > over the place, including in backend code. > > > > It's a cheap excuse and clearly a lot of work. But we've done it for GIMPLE > > and it was clearly worth it. If only for documentation purposes. > > Yes, I very much agree :-) But how will we do this? Make it warnings > only, and have those opt-in (for a port) even? Add -fcheck-rtl, when you're ready (per port?) make it default for bootstrap. When you're even more "ready", turn it on with -fchecking (or --enable-checking=xyz). We've gone a similar way with the GIMPLE checking (not SSA checking, that was always present) which was formerly called "type" checking but now verifies more details. It was broken quite badly as well and mostly incrementally fixed (some cases by intentionally weakening the checker). Of course as you say with RTL you have all the targets to consider ... > > > > And we even work in the opposite way > > > > when expanding to RTL from SSA, coalescing as much as we can ... > > > > > > Which often is bad. A lot of what expand does is premature optimisation > > > that a) is much too complicated, and b) results in worse code in the end. > > > Expand should not try to do *anything* "interesting", it should "merely" > > > translate to RTL. > > > > > > (And in a way that can be debugged sanely at all, if possible ;-) ) > > > > Sure. I've tried to turn off TER with only moderate success. Another approach > > would be to not do any coalescing but for PHIs (and maybe there only for edges > > where we cannot insert copies on w/o splitting it). > > Yeah, that would be much simpler to do as well. > > > But clearly all the magic expand does now should materialize itself on > > GIMPLE (see the patch series Martin posted for vcond* / vec_cmp* > > "expansion" on GIMPLE). > > Probably most, yeah, but some things that expand does should be done > much *later* in RTL even :-) We have combine for that. Eh. > > It's the magic that requires TER and TER clearly > > introduces the most noise in the GIMPLE -> RTL translation step (as to > > code motion and initial instruction selection if it "fails"). > > Expand does a lot of things that are very much historic, too. Sure... Richard. > > Segher
Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> writes: > On Mon, Apr 20, 2020 at 3:38 PM Segher Boessenkool > <segher@kernel.crashing.org> wrote: >> >> Hi! >> >> On Mon, Apr 20, 2020 at 09:56:34AM +0200, Richard Biener wrote: >> > On Fri, Apr 17, 2020 at 10:51 PM Segher Boessenkool >> > <segher@kernel.crashing.org> wrote: >> > > > Yeah well, but RTL is not in SSA form >> > > >> > > "Webs" are not the *same* as SSA, in a few crucial ways; but they serve >> > > similar purposes: they both make code transformations easier to do. >> > > Both do this by having more different (independent) names. >> > > >> > > > and there's no RTL IL verification >> > > > in place to track degradation. >> > > >> > > Adding this isn't hard in principle. But currently it is violated all >> > > over the place, including in backend code. >> > >> > It's a cheap excuse and clearly a lot of work. But we've done it for GIMPLE >> > and it was clearly worth it. If only for documentation purposes. >> >> Yes, I very much agree :-) But how will we do this? Make it warnings >> only, and have those opt-in (for a port) even? > > Add -fcheck-rtl, when you're ready (per port?) make it default for > bootstrap. When you're even more "ready", turn it on with -fchecking > (or --enable-checking=xyz). What kind of condition would we be checking? That no register has two definitions in the same block? Or something stronger? Testing that -fweb would have no effect is expensive, since it requires full UD chains. And there's a risk that optimisations could break the condition accidentally. E.g.: if (cond) { x = ...; ...use x...; } else { x = ...; ...use x...; } ...use x...; // A, REG_DEAD x If anything managed to fold away the need for x in A (e.g. combine), it would then be required to rename x in one of the if/else blocks. In some ways it feels like it would be easier to resurrect RTL SSA :-) Thanks, Richard
On Wed, Apr 22, 2020 at 11:26 AM Richard Sandiford <richard.sandiford@arm.com> wrote: > > Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> writes: > > On Mon, Apr 20, 2020 at 3:38 PM Segher Boessenkool > > <segher@kernel.crashing.org> wrote: > >> > >> Hi! > >> > >> On Mon, Apr 20, 2020 at 09:56:34AM +0200, Richard Biener wrote: > >> > On Fri, Apr 17, 2020 at 10:51 PM Segher Boessenkool > >> > <segher@kernel.crashing.org> wrote: > >> > > > Yeah well, but RTL is not in SSA form > >> > > > >> > > "Webs" are not the *same* as SSA, in a few crucial ways; but they serve > >> > > similar purposes: they both make code transformations easier to do. > >> > > Both do this by having more different (independent) names. > >> > > > >> > > > and there's no RTL IL verification > >> > > > in place to track degradation. > >> > > > >> > > Adding this isn't hard in principle. But currently it is violated all > >> > > over the place, including in backend code. > >> > > >> > It's a cheap excuse and clearly a lot of work. But we've done it for GIMPLE > >> > and it was clearly worth it. If only for documentation purposes. > >> > >> Yes, I very much agree :-) But how will we do this? Make it warnings > >> only, and have those opt-in (for a port) even? > > > > Add -fcheck-rtl, when you're ready (per port?) make it default for > > bootstrap. When you're even more "ready", turn it on with -fchecking > > (or --enable-checking=xyz). > > What kind of condition would we be checking? That no register has two > definitions in the same block? Or something stronger? For the particular discussion no idea. But we generally lack verification of what is "valid" RTL and what is not. > Testing that -fweb would have no effect is expensive, since it requires > full UD chains. And there's a risk that optimisations could break the > condition accidentally. E.g.: > > if (cond) > { > x = ...; > ...use x...; > } > else > { > x = ...; > ...use x...; > } > ...use x...; // A, REG_DEAD x > > If anything managed to fold away the need for x in A (e.g. combine), > it would then be required to rename x in one of the if/else blocks. > > In some ways it feels like it would be easier to resurrect RTL SSA :-) > > Thanks, > Richard
On Wed, Apr 22, 2020 at 11:58:38AM +0200, Richard Biener wrote: > On Wed, Apr 22, 2020 at 11:26 AM Richard Sandiford > <richard.sandiford@arm.com> wrote: > > Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> writes: > > > On Mon, Apr 20, 2020 at 3:38 PM Segher Boessenkool > > > <segher@kernel.crashing.org> wrote: > > >> On Mon, Apr 20, 2020 at 09:56:34AM +0200, Richard Biener wrote: > > >> > On Fri, Apr 17, 2020 at 10:51 PM Segher Boessenkool > > >> > <segher@kernel.crashing.org> wrote: > > >> > > > Yeah well, but RTL is not in SSA form > > >> > > > > >> > > "Webs" are not the *same* as SSA, in a few crucial ways; but they serve > > >> > > similar purposes: they both make code transformations easier to do. > > >> > > Both do this by having more different (independent) names. > > >> > > > > >> > > > and there's no RTL IL verification > > >> > > > in place to track degradation. > > >> > > > > >> > > Adding this isn't hard in principle. But currently it is violated all > > >> > > over the place, including in backend code. > > >> > > > >> > It's a cheap excuse and clearly a lot of work. But we've done it for GIMPLE > > >> > and it was clearly worth it. If only for documentation purposes. > > >> > > >> Yes, I very much agree :-) But how will we do this? Make it warnings > > >> only, and have those opt-in (for a port) even? > > > > > > Add -fcheck-rtl, when you're ready (per port?) make it default for > > > bootstrap. When you're even more "ready", turn it on with -fchecking > > > (or --enable-checking=xyz). > > > > What kind of condition would we be checking? That no register has two > > definitions in the same block? Or something stronger? > > For the particular discussion no idea. But we generally lack > verification of what is "valid" RTL and what is not. There is RTL checking, and it works very well... But "valid" just means "structurally valid", and other very local things. If we want to check some extra properties, we actually have to make the rules for those first! :-) > > In some ways it feels like it would be easier to resurrect RTL SSA :-) Why was RTL SSA abandoned? It might well work to keep everything in SSA form all the way to RA. Hrm, that doesn't sound bad at all :-) (The PHIs need to be made explicit to something that resembles the machine code we will end up with, very early in the pipeline, but it could still also be some valid SSA form; and we can of course also have hard registers in all RTL, so that needs to be dealt with sanely some way as well Lots of details, I don't see a crucial problem though, probably means I need to look harder ;-) ) Segher
On Wed, 2020-04-22 at 15:50 -0500, Segher Boessenkool wrote: > > > > In some ways it feels like it would be easier to resurrect RTL SSA :-) > > Why was RTL SSA abandoned? > > It might well work to keep everything in SSA form all the way to RA. > Hrm, that doesn't sound bad at all :-) > > (The PHIs need to be made explicit to something that resembles the > machine code we will end up with, very early in the pipeline, but it > could still also be some valid SSA form; and we can of course also > have hard registers in all RTL, so that needs to be dealt with sanely > some way as well Lots of details, I don't see a crucial problem though, > probably means I need to look harder ;-) ) Lack of time mostly. There's some complications like subregs, argument registers and the like. But you can restrict ssa based analysis & optimizations to just the set of pseudos that are in SSA form and do something more conservative on the rest. Jeff
On Thu, Apr 23, 2020 at 12:31 AM Jeff Law <law@redhat.com> wrote: > > On Wed, 2020-04-22 at 15:50 -0500, Segher Boessenkool wrote: > > > > > > In some ways it feels like it would be easier to resurrect RTL SSA :-) > > > > Why was RTL SSA abandoned? > > > > It might well work to keep everything in SSA form all the way to RA. > > Hrm, that doesn't sound bad at all :-) > > > > (The PHIs need to be made explicit to something that resembles the > > machine code we will end up with, very early in the pipeline, but it > > could still also be some valid SSA form; and we can of course also > > have hard registers in all RTL, so that needs to be dealt with sanely > > some way as well Lots of details, I don't see a crucial problem though, > > probably means I need to look harder ;-) ) > Lack of time mostly. There's some complications like subregs, argument registers > and the like. But you can restrict ssa based analysis & optimizations to just > the set of pseudos that are in SSA form and do something more conservative on the > rest. I guess time is better spent on trying to extend GIMPLE + SSA up to RA, thus make instruction selection on GIMPLE. Back in time Steven spent quite some time doing factored SSA but I don't remember either what went wrong or whether simply DF appeared first. Richard. > Jeff >
On Thu, Apr 23, 2020 at 09:32:37AM +0200, Richard Biener wrote: > On Thu, Apr 23, 2020 at 12:31 AM Jeff Law <law@redhat.com> wrote: > > On Wed, 2020-04-22 at 15:50 -0500, Segher Boessenkool wrote: > > > > > In some ways it feels like it would be easier to resurrect RTL SSA :-) > > > > > > Why was RTL SSA abandoned? > > > > > > It might well work to keep everything in SSA form all the way to RA. > > > Hrm, that doesn't sound bad at all :-) > > > > > > (The PHIs need to be made explicit to something that resembles the > > > machine code we will end up with, very early in the pipeline, but it > > > could still also be some valid SSA form; and we can of course also > > > have hard registers in all RTL, so that needs to be dealt with sanely > > > some way as well Lots of details, I don't see a crucial problem though, > > > probably means I need to look harder ;-) ) > > Lack of time mostly. There's some complications like subregs, argument registers > > and the like. But you can restrict ssa based analysis & optimizations to just > > the set of pseudos that are in SSA form and do something more conservative on the > > rest. > > I guess time is better spent on trying to extend GIMPLE + SSA up to RA, thus > make instruction selection on GIMPLE. I think this is a bad idea. By the time you have invented enough new "lower GIMPLE" ("limple"?) to be able to use it to describe machine insns like we can with RTL, you will have a more verbose, more memory hungry, slower, etc. reinvented RTL. RTL is a *feature*, and it is one of the things that makes GCC significantly better than the competition. More optimisations should move to GIMPLE, for example some loop optimisations should be done much earlier (most unrolling). The expand pass should lose most of the "optimisations" it has built up over the decades (that now often are detrimental at best). Some of what expand now does should probably be done while still in GIMPLE, even. But it is very useful to have a separate "low level" representation, that is actually close to the machine code we will eventually generate. RTL is one such representation, and we already have it, and it is very well tuned by now -- throwing it away would require some *huge* advantage, because the costs of doing that are immense as well. Segher
On Thu, Apr 23, 2020 at 12:17 PM Segher Boessenkool <segher@kernel.crashing.org> wrote: > > On Thu, Apr 23, 2020 at 09:32:37AM +0200, Richard Biener wrote: > > On Thu, Apr 23, 2020 at 12:31 AM Jeff Law <law@redhat.com> wrote: > > > On Wed, 2020-04-22 at 15:50 -0500, Segher Boessenkool wrote: > > > > > > In some ways it feels like it would be easier to resurrect RTL SSA :-) > > > > > > > > Why was RTL SSA abandoned? > > > > > > > > It might well work to keep everything in SSA form all the way to RA. > > > > Hrm, that doesn't sound bad at all :-) > > > > > > > > (The PHIs need to be made explicit to something that resembles the > > > > machine code we will end up with, very early in the pipeline, but it > > > > could still also be some valid SSA form; and we can of course also > > > > have hard registers in all RTL, so that needs to be dealt with sanely > > > > some way as well Lots of details, I don't see a crucial problem though, > > > > probably means I need to look harder ;-) ) > > > Lack of time mostly. There's some complications like subregs, argument registers > > > and the like. But you can restrict ssa based analysis & optimizations to just > > > the set of pseudos that are in SSA form and do something more conservative on the > > > rest. > > > > I guess time is better spent on trying to extend GIMPLE + SSA up to RA, thus > > make instruction selection on GIMPLE. > > I think this is a bad idea. By the time you have invented enough new > "lower GIMPLE" ("limple"?) to be able to use it to describe machine > insns like we can with RTL, you will have a more verbose, more memory > hungry, slower, etc. reinvented RTL. I don't think there's much to invent. I think at least one step would be uncontroversical(?), namely moving the RTL expansion "magic" up to a GIMPLE pass. Where the "magic" would be to turn GIMPLE stmts not directly expandable via an existing optab into GIMPLE that can be trivially expanded. That includes eventually combining multiple stmts into more powerful instructions and doing the magic we have in, like, expand_binop (widening, etc.). Where there's not a 1:1 mapping of a GIMPLE stmt to an optab GIMPLE gets direct-internal-fn calls. Then RTL expansion would be mostly invoking gen_insn (optab-code). More controversical would be ending up in GIMPLE there. I think GIMPLE can handle all RTL insns if we massage GIMPLE_ASM a bit. You'd end up with, say, asm ("(set (reg:DI $0) (and:DI (reg/v:DI $1 [ dst ]) (reg:DI $2)))" : "r" (_1) : "r" (_2), "r" (_3) : "cc"); in place of _1 = _2 & _3; and the GIMPLE_ASM text could be actual RTL. We'd extend the stmt with an extra operand to denote recognized patterns, so another option would be to keep the original GIMPLE as well. > RTL is a *feature*, and it is one of the things that makes GCC > significantly better than the competition. That said, I actually agree with that. It's just that I hope we can make some of the knowledge just represented on the RTL side available on the GIMPLE side. The more complicated parts, like calling conventions, that is. And yes, I want to get rid of that expand monster to be able to do something like sched1 on "GIMPLE" without expand coming along and re-scheduling everything at-will. > More optimisations should move to GIMPLE, for example some loop > optimisations should be done much earlier (most unrolling). The expand > pass should lose most of the "optimisations" it has built up over the > decades (that now often are detrimental at best). Some of what expand > now does should probably be done while still in GIMPLE, even. > > But it is very useful to have a separate "low level" representation, > that is actually close to the machine code we will eventually generate. > RTL is one such representation, and we already have it, and it is very > well tuned by now -- throwing it away would require some *huge* > advantage, because the costs of doing that are immense as well. But being stuck with something means no progress... I know very well it's 100 times harder to get rid of something than to add something new ontop. Richard. > > Segher
On Thu, Apr 23, 2020 at 12:52:30PM +0200, Richard Biener wrote: > On Thu, Apr 23, 2020 at 12:17 PM Segher Boessenkool > <segher@kernel.crashing.org> wrote: > > > > On Thu, Apr 23, 2020 at 09:32:37AM +0200, Richard Biener wrote: > > > On Thu, Apr 23, 2020 at 12:31 AM Jeff Law <law@redhat.com> wrote: > > > > On Wed, 2020-04-22 at 15:50 -0500, Segher Boessenkool wrote: > > > > > > > In some ways it feels like it would be easier to resurrect RTL SSA :-) > > > > > > > > > > Why was RTL SSA abandoned? > > > > > > > > > > It might well work to keep everything in SSA form all the way to RA. > > > > > Hrm, that doesn't sound bad at all :-) > > > > > > > > > > (The PHIs need to be made explicit to something that resembles the > > > > > machine code we will end up with, very early in the pipeline, but it > > > > > could still also be some valid SSA form; and we can of course also > > > > > have hard registers in all RTL, so that needs to be dealt with sanely > > > > > some way as well Lots of details, I don't see a crucial problem though, > > > > > probably means I need to look harder ;-) ) > > > > Lack of time mostly. There's some complications like subregs, argument registers > > > > and the like. But you can restrict ssa based analysis & optimizations to just > > > > the set of pseudos that are in SSA form and do something more conservative on the > > > > rest. > > > > > > I guess time is better spent on trying to extend GIMPLE + SSA up to RA, thus > > > make instruction selection on GIMPLE. > > > > I think this is a bad idea. By the time you have invented enough new > > "lower GIMPLE" ("limple"?) to be able to use it to describe machine > > insns like we can with RTL, you will have a more verbose, more memory > > hungry, slower, etc. reinvented RTL. > > I don't think there's much to invent. > > I think at least one step would be uncontroversical(?), namely moving > the RTL expansion "magic" > up to a GIMPLE pass. Where the "magic" would be to turn > GIMPLE stmts not directly expandable via an existing optab into > GIMPLE that can be trivially expanded. That includes eventually > combining multiple stmts into more powerful instructions and > doing the magic we have in, like, expand_binop (widening, etc.). > Where there's not a 1:1 mapping of a GIMPLE stmt to an optab > GIMPLE gets direct-internal-fn calls. > Then RTL expansion would be mostly invoking gen_insn (optab-code). Most of expand is *other stuff*. Expand does a *lot* of things that are actually changing the code. And much of that is not done anywhere else either yet, so this cannot be fixed by simply deleting the offending code. > More controversical would be ending up in GIMPLE there. I think > GIMPLE can handle all RTL insns if we massage GIMPLE_ASM > a bit. You'd end up with, say, > > asm ("(set (reg:DI $0) > (and:DI (reg/v:DI $1 [ dst ]) > (reg:DI $2)))" : "r" (_1) : "r" (_2), "r" (_3) : "cc"); > > in place of > > _1 = _2 & _3; > > and the GIMPLE_ASM text could be actual RTL. We'd extend > the stmt with an extra operand to denote recognized patterns, > so another option would be to keep the original GIMPLE as well. Why would you ever want to do that? That would take much more memory, and RTL's memory use until recently always was a pain point. > > RTL is a *feature*, and it is one of the things that makes GCC > > significantly better than the competition. > > That said, I actually agree with that. It's just that I hope we can > make some of the knowledge just represented on the RTL side > available on the GIMPLE side. The more complicated parts, > like calling conventions, that is. Yeah, and like I said, some things (unroll...) should move to GIMPLE as well, most of it anyway. And some of the remaining RTL code needs a good overhaul (oh hello CSE). > And yes, I want to get rid of that expand monster to be able to > do something like sched1 on "GIMPLE" without expand coming > along and re-scheduling everything at-will. Right, ideally, expand would just translate GIMPLE to RTL one-to-one (well, few-to-few, whatever :-) ). But it does so much other stuff now, so all that has to be moved or reimplemented or whatever. > > More optimisations should move to GIMPLE, for example some loop > > optimisations should be done much earlier (most unrolling). The expand > > pass should lose most of the "optimisations" it has built up over the > > decades (that now often are detrimental at best). Some of what expand > > now does should probably be done while still in GIMPLE, even. > > > > But it is very useful to have a separate "low level" representation, > > that is actually close to the machine code we will eventually generate. > > RTL is one such representation, and we already have it, and it is very > > well tuned by now -- throwing it away would require some *huge* > > advantage, because the costs of doing that are immense as well. > > But being stuck with something means no progress... I know > very well it's 100 times harder to get rid of something than to > add something new ontop. Well, what progress do you expect to make? After expand that is :-) Most of what is done in RTL is done very well. Replacing specific things in RTL, maybe as big as whole passes, already is hard to do without regressing (a *lot*). And if there is no real reason to do that... Segher
On Thu, Apr 23, 2020 at 2:07 PM Segher Boessenkool <segher@kernel.crashing.org> wrote: > > On Thu, Apr 23, 2020 at 12:52:30PM +0200, Richard Biener wrote: > > On Thu, Apr 23, 2020 at 12:17 PM Segher Boessenkool > > <segher@kernel.crashing.org> wrote: > > > > > > On Thu, Apr 23, 2020 at 09:32:37AM +0200, Richard Biener wrote: > > > > On Thu, Apr 23, 2020 at 12:31 AM Jeff Law <law@redhat.com> wrote: > > > > > On Wed, 2020-04-22 at 15:50 -0500, Segher Boessenkool wrote: > > > > > > > > In some ways it feels like it would be easier to resurrect RTL SSA :-) > > > > > > > > > > > > Why was RTL SSA abandoned? > > > > > > > > > > > > It might well work to keep everything in SSA form all the way to RA. > > > > > > Hrm, that doesn't sound bad at all :-) > > > > > > > > > > > > (The PHIs need to be made explicit to something that resembles the > > > > > > machine code we will end up with, very early in the pipeline, but it > > > > > > could still also be some valid SSA form; and we can of course also > > > > > > have hard registers in all RTL, so that needs to be dealt with sanely > > > > > > some way as well Lots of details, I don't see a crucial problem though, > > > > > > probably means I need to look harder ;-) ) > > > > > Lack of time mostly. There's some complications like subregs, argument registers > > > > > and the like. But you can restrict ssa based analysis & optimizations to just > > > > > the set of pseudos that are in SSA form and do something more conservative on the > > > > > rest. > > > > > > > > I guess time is better spent on trying to extend GIMPLE + SSA up to RA, thus > > > > make instruction selection on GIMPLE. > > > > > > I think this is a bad idea. By the time you have invented enough new > > > "lower GIMPLE" ("limple"?) to be able to use it to describe machine > > > insns like we can with RTL, you will have a more verbose, more memory > > > hungry, slower, etc. reinvented RTL. > > > > I don't think there's much to invent. > > > > I think at least one step would be uncontroversical(?), namely moving > > the RTL expansion "magic" > > up to a GIMPLE pass. Where the "magic" would be to turn > > GIMPLE stmts not directly expandable via an existing optab into > > GIMPLE that can be trivially expanded. That includes eventually > > combining multiple stmts into more powerful instructions and > > doing the magic we have in, like, expand_binop (widening, etc.). > > Where there's not a 1:1 mapping of a GIMPLE stmt to an optab > > GIMPLE gets direct-internal-fn calls. > > Then RTL expansion would be mostly invoking gen_insn (optab-code). > > Most of expand is *other stuff*. Expand does a *lot* of things that are > actually changing the code. And much of that is not done anywhere else > either yet, so this cannot be fixed by simply deleting the offending code. > > > More controversical would be ending up in GIMPLE there. I think > > GIMPLE can handle all RTL insns if we massage GIMPLE_ASM > > a bit. You'd end up with, say, > > > > asm ("(set (reg:DI $0) > > (and:DI (reg/v:DI $1 [ dst ]) > > (reg:DI $2)))" : "r" (_1) : "r" (_2), "r" (_3) : "cc"); > > > > in place of > > > > _1 = _2 & _3; > > > > and the GIMPLE_ASM text could be actual RTL. We'd extend > > the stmt with an extra operand to denote recognized patterns, > > so another option would be to keep the original GIMPLE as well. > > Why would you ever want to do that? That would take much more memory, > and RTL's memory use until recently always was a pain point. It's not the RTL IL that uses much memory, it's infrastructure like DF that easily blows up. Mind that GCC has only a single function in RTL at a time but the whole program in GIMPLE. And GIMPLE includes "DF" by default. > > > RTL is a *feature*, and it is one of the things that makes GCC > > > significantly better than the competition. > > > > That said, I actually agree with that. It's just that I hope we can > > make some of the knowledge just represented on the RTL side > > available on the GIMPLE side. The more complicated parts, > > like calling conventions, that is. > > Yeah, and like I said, some things (unroll...) should move to GIMPLE as > well, most of it anyway. And some of the remaining RTL code needs a > good overhaul (oh hello CSE). > > > And yes, I want to get rid of that expand monster to be able to > > do something like sched1 on "GIMPLE" without expand coming > > along and re-scheduling everything at-will. > > Right, ideally, expand would just translate GIMPLE to RTL one-to-one > (well, few-to-few, whatever :-) ). But it does so much other stuff now, > so all that has to be moved or reimplemented or whatever. > > > > More optimisations should move to GIMPLE, for example some loop > > > optimisations should be done much earlier (most unrolling). The expand > > > pass should lose most of the "optimisations" it has built up over the > > > decades (that now often are detrimental at best). Some of what expand > > > now does should probably be done while still in GIMPLE, even. > > > > > > But it is very useful to have a separate "low level" representation, > > > that is actually close to the machine code we will eventually generate. > > > RTL is one such representation, and we already have it, and it is very > > > well tuned by now -- throwing it away would require some *huge* > > > advantage, because the costs of doing that are immense as well. > > > > But being stuck with something means no progress... I know > > very well it's 100 times harder to get rid of something than to > > add something new ontop. > > Well, what progress do you expect to make? After expand that is :-) I'd like the RTL pipeline before RA to shrink significantly, no PRE, no CSE, ... The important part before RA is robust and intelligent instruction selection - part of which is already done at RTL expansion time. > Most of what is done in RTL is done very well. Umm, well... I beg to differ with regard to DF and passes like postreload-gcse. > Replacing specific things in RTL, maybe as big as whole passes, already > is hard to do without regressing (a *lot*). And if there is no real > reason to do that... The motivation is to make GCC faster, obviously. Not spending time doing things that should have been done before (RTL PRE vs. GIMPLE PRE, etc.). Using the same infrastructure (what, no loop dependency analysis on RTL?), etc. You could say we should do more on RTL and enhance it instead, like do vectorization where we actually could have a better idea on costs and capabilities. But I'm a GIMPLE person and don't know enough of RTL to enhance it ... Richard. > > Segher
On Thu, Apr 23, 2020 at 02:25:40PM +0200, Richard Biener wrote: > > > But being stuck with something means no progress... I know > > > very well it's 100 times harder to get rid of something than to > > > add something new ontop. > > > > Well, what progress do you expect to make? After expand that is :-) > > I'd like the RTL pipeline before RA to shrink significantly, no PRE, > no CSE, ... RTL CSE for example is very much required to get any good code. It needs to CSE stuff that wasn't there before expand. The pass currently does much more (as well as not enough), of course. > The important part before RA is robust and intelligent > instruction selection - part of which is already done at RTL expansion > time. LOL. The expand pass doesn't often make good choices, and it *shouldn't*, it should not make many choices at all; it should just generate valid RTL, new pseudos for everything, and let later RTL passes make faster code from that. > > Most of what is done in RTL is done very well. > > Umm, well... I beg to differ with regard to DF and passes like > postreload-gcse. What is wrong with DF? Is there something particular in postreload-gcse that is bad? To me it always is just one of those passes that doesn't do anything :-) That can and should be cleaned up, sure :-) > > Replacing specific things in RTL, maybe as big as whole passes, already > > is hard to do without regressing (a *lot*). And if there is no real > > reason to do that... > > The motivation is to make GCC faster, obviously. Not spending time > doing things that should have been done before (RTL PRE vs. GIMPLE PRE, etc.). > Using the same infrastructure (what, no loop dependency analysis on RTL?), etc. But everything you want to remove isn't high on profiles anyway? And you proposed adding bigger, slower, stuff to replace it all with. Slow are RA, and the language frontends (esp. C++ like to take 15% of total :-/) > You could say we should do more on RTL and enhance it instead, like do > vectorization where we actually could have a better idea on costs and > capabilities. But I'm a GIMPLE person and don't know enough of RTL to > enhance it ... Oh no, I think we should do more earlier, and GIMPLE is a fine IR for there. But for low-level, close-to-the-machine stuff, RTL is much better suited. And we *do* want to optimise at that level as well, and much more than just peepholes. Segher
On Thu, Apr 23, 2020 at 2:52 PM Segher Boessenkool <segher@kernel.crashing.org> wrote: > > On Thu, Apr 23, 2020 at 02:25:40PM +0200, Richard Biener wrote: > > > > But being stuck with something means no progress... I know > > > > very well it's 100 times harder to get rid of something than to > > > > add something new ontop. > > > > > > Well, what progress do you expect to make? After expand that is :-) > > > > I'd like the RTL pipeline before RA to shrink significantly, no PRE, > > no CSE, ... > > RTL CSE for example is very much required to get any good code. It > needs to CSE stuff that wasn't there before expand. Sure, but then we should fix that! > The pass currently does much more (as well as not enough), of course. > > > The important part before RA is robust and intelligent > > instruction selection - part of which is already done at RTL expansion > > time. > > LOL. > > The expand pass doesn't often make good choices, and it *shouldn't*, it > should not make many choices at all; it should just generate valid RTL, > new pseudos for everything, and let later RTL passes make faster code > from that. But valid RTL is instructions that are recognized. Which means when the target doesn't support an SImode add we may not create one. That's instruction selection ;) > > > Most of what is done in RTL is done very well. > > > > Umm, well... I beg to differ with regard to DF and passes like > > postreload-gcse. > > What is wrong with DF? It's slow and memory hungry? > Is there something particular in postreload-gcse that is bad? To me it > always is just one of those passes that doesn't do anything :-) That > can and should be cleaned up, sure :-) postreload-gcse is ad-hoc, it uses full blown gcse tools that easily blow up (compute_transp) when it doesn't really require it (Ive fixed things up a bit in dc91c65378cd0e6c0). But I wonder why, if we want to do PRE of loads, we don't simply schedule another gcse pass rather than implementing a new one. IIRC what the pass does could be done with much more local dataflow. Both postreload gcse and cse are major time-hogs on "bad" testcases :/ > > > Replacing specific things in RTL, maybe as big as whole passes, already > > > is hard to do without regressing (a *lot*). And if there is no real > > > reason to do that... > > > > The motivation is to make GCC faster, obviously. Not spending time > > doing things that should have been done before (RTL PRE vs. GIMPLE PRE, etc.). > > Using the same infrastructure (what, no loop dependency analysis on RTL?), etc. > > But everything you want to remove isn't high on profiles anyway? And > you proposed adding bigger, slower, stuff to replace it all with. > > Slow are RA, and the language frontends (esp. C++ like to take 15% of > total :-/) > > > You could say we should do more on RTL and enhance it instead, like do > > vectorization where we actually could have a better idea on costs and > > capabilities. But I'm a GIMPLE person and don't know enough of RTL to > > enhance it ... > > Oh no, I think we should do more earlier, and GIMPLE is a fine IR for > there. But for low-level, close-to-the-machine stuff, RTL is much > better suited. And we *do* want to optimise at that level as well, and > much more than just peepholes. Well, everything that requires costing (unrolling, vectorization, IV selection to name a few) _is_ close-to-the-machine. We're just saying they are not because GIMPLE is so much easier to work with here (not sure why exactly...). Richard. > > Segher
On Thu, Apr 23, 2020 at 03:07:23PM +0200, Richard Biener wrote: > On Thu, Apr 23, 2020 at 2:52 PM Segher Boessenkool > <segher@kernel.crashing.org> wrote: > > > > On Thu, Apr 23, 2020 at 02:25:40PM +0200, Richard Biener wrote: > > > > > But being stuck with something means no progress... I know > > > > > very well it's 100 times harder to get rid of something than to > > > > > add something new ontop. > > > > > > > > Well, what progress do you expect to make? After expand that is :-) > > > > > > I'd like the RTL pipeline before RA to shrink significantly, no PRE, > > > no CSE, ... > > > > RTL CSE for example is very much required to get any good code. It > > needs to CSE stuff that wasn't there before expand. > > Sure, but then we should fix that! The expand pass, but also many RTL passes, will naturally generate code that can be CSEd. You don't want passes to have to do this themselves: for example, this can be constants used to implement some standard patterns in target code, etc. > > LOL. > > > > The expand pass doesn't often make good choices, and it *shouldn't*, it > > should not make many choices at all; it should just generate valid RTL, > > new pseudos for everything, and let later RTL passes make faster code > > from that. > > But valid RTL is instructions that are recognized. Which means > when the target doesn't support an SImode add we may not create > one. That's instruction selection ;) In that sense, you can call all RTL passes instruction selection? Usually I understand more like combine, cprop, fwprop, cse, ifcvt, splitters, peepholes. That kind of thing :-) Pretty much all of the RTL passes before RA, and a few after it. > > > > Most of what is done in RTL is done very well. > > > > > > Umm, well... I beg to differ with regard to DF and passes like > > > postreload-gcse. > > > > What is wrong with DF? > > It's slow and memory hungry? Very true, of course. But can this be significantly better? > > Is there something particular in postreload-gcse that is bad? To me it > > always is just one of those passes that doesn't do anything :-) That > > can and should be cleaned up, sure :-) > > postreload-gcse is ad-hoc, it uses full blown gcse tools that easily > blow up (compute_transp) when it doesn't really require it > (Ive fixed things up a bit in dc91c65378cd0e6c0). But I wonder why, > if we want to do PRE of loads, we don't simply schedule another > gcse pass rather than implementing a new one. IIRC what the pass > does could be done with much more local dataflow. Both > postreload gcse and cse are major time-hogs on "bad" testcases :/ RTL CSE? Really? It just loves to give up early (which is a bad thing of course, but that makes it take bounded time, and *less* on bad testcases :-) ) So the "normal" gcse does not have this problem? > > Oh no, I think we should do more earlier, and GIMPLE is a fine IR for > > there. But for low-level, close-to-the-machine stuff, RTL is much > > better suited. And we *do* want to optimise at that level as well, and > > much more than just peepholes. > > Well, everything that requires costing (unrolling, vectorization, > IV selection to name a few) _is_ close-to-the-machine. We're > just saying they are not because GIMPLE is so much easier to > work with here (not sure why exactly...). Those transforms aren't close to the machine, not in the same way, because they are beneficial independent of what exact instruction sequences are generated. Both are nasty in that both have cases doing the transform actually hurts quite a bit; but *not* doing it where it *could* costs a lot as well. But other than that "little" issue ;-) Segher
On Thu, 2020-04-23 at 07:07 -0500, Segher Boessenkool wrote: > > > I think at least one step would be uncontroversical(?), namely moving > > the RTL expansion "magic" > > up to a GIMPLE pass. Where the "magic" would be to turn > > GIMPLE stmts not directly expandable via an existing optab into > > GIMPLE that can be trivially expanded. That includes eventually > > combining multiple stmts into more powerful instructions and > > doing the magic we have in, like, expand_binop (widening, etc.). > > Where there's not a 1:1 mapping of a GIMPLE stmt to an optab > > GIMPLE gets direct-internal-fn calls. > > Then RTL expansion would be mostly invoking gen_insn (optab-code). > > Most of expand is *other stuff*. Expand does a *lot* of things that are > actually changing the code. And much of that is not done anywhere else > either yet, so this cannot be fixed by simply deleting the offending code. A lot of what is done by the expansion pass is historical and dates back to when we never optimized more than a statement at a time for trees and the real heavy lifting was all done in RTL. THe introduction of tree-ssa meant that all the expansion code which wanted to see a "nice" complex statement and generate good RTL code was useless and as a result we saw significant regressions in the end code quality. That in turn brought in TER who's sole purpose was to reconstruct those more complex trees for the purposes of improving initial expansion. I think match.pd has the potential to make TER go away as match.pd is a generic combining framework. WRT extending SSA deeper, I'm all for it and it's always been something I wanted to see happen. I've always believed we could do RTL SSA to the register allocation phase and that doing so would be a positive. If we start at the front of the RTL pipeline and work towards the back we bump into CSE quickly, which would be good. Our CSE is awful across multiple axis. I suspect most RTL passes would be reimplementations rather than bolting SSA on the side and I suspect those reimplementations would be simpler than the existing stuff -- I'm a strong believer there's a lot of dead code in the RTL passes as well. jeff
On Thu, 2020-04-23 at 15:07 +0200, Richard Biener wrote: > On Thu, Apr 23, 2020 at 2:52 PM Segher Boessenkool > <segher@kernel.crashing.org> wrote: > > On Thu, Apr 23, 2020 at 02:25:40PM +0200, Richard Biener wrote: > > > > > But being stuck with something means no progress... I know > > > > > very well it's 100 times harder to get rid of something than to > > > > > add something new ontop. > > > > > > > > Well, what progress do you expect to make? After expand that is :-) > > > > > > I'd like the RTL pipeline before RA to shrink significantly, no PRE, > > > no CSE, ... > > > > RTL CSE for example is very much required to get any good code. It > > needs to CSE stuff that wasn't there before expand. > > Sure, but then we should fix that! Exactly. It's purpose largely becomes dealing with the redundancies exposed by expansion. ie, address arithmetic and the like. A lot of its path following code should be throttled back. > > But valid RTL is instructions that are recognized. Which means > when the target doesn't support an SImode add we may not create > one. That's instruction selection ;) That's always a point of tension. But I think that in general continuing to have targets claim to support things they do not (such as double-wordsize arithmetic, logicals, moves, etc) is a mistake. It made sense at one time, but I think we've got better mechansisms in place to deal with this stuff now. > > > Is there something particular in postreload-gcse that is bad? To me it > > always is just one of those passes that doesn't do anything :-) That > > can and should be cleaned up, sure :-) > > postreload-gcse is ad-hoc, it uses full blown gcse tools that easily > blow up (compute_transp) when it doesn't really require it > (Ive fixed things up a bit in dc91c65378cd0e6c0). But I wonder why, > if we want to do PRE of loads, we don't simply schedule another > gcse pass rather than implementing a new one. IIRC what the pass > does could be done with much more local dataflow. Both > postreload gcse and cse are major time-hogs on "bad" testcases :/ I think the biggest reason is the existing gcse bits inherently assume they can create new registers. It's deeply baked into gcse.c. There's ways around that, but it's likely a lot of work. > > > Oh no, I think we should do more earlier, and GIMPLE is a fine IR for > > there. But for low-level, close-to-the-machine stuff, RTL is much > > better suited. And we *do* want to optimise at that level as well, and > > much more than just peepholes. > > Well, everything that requires costing (unrolling, vectorization, > IV selection to name a few) _is_ close-to-the-machine. We're > just saying they are not because GIMPLE is so much easier to > work with here (not sure why exactly...). The primary motivation behind discouraging target costing and the like from gimple was to make it easier to implement and predict the behavior of the gimple optimizers. We've relaxed that somewhat, particularly for vectorization, but I think the principle is still solid. But I think there is a place for adding target dependencies -- and that's at the end of the current gimple pipeline. Jeff
> > > What is wrong with DF? > > > > It's slow and memory hungry? > > Very true, of course. But can this be significantly better? That's a good question worth investigating in my opinion, because DF didn't quite achieve its initial goal of replacing all the custom liveness analysis passes only because of its performance; the rest is quite neat. But IIRC long-time GCC hackers were involved in it, so it was probably tuned already.
Jeff Law via Gcc-patches <gcc-patches@gcc.gnu.org> writes: > On Thu, 2020-04-23 at 15:07 +0200, Richard Biener wrote: >> On Thu, Apr 23, 2020 at 2:52 PM Segher Boessenkool >> <segher@kernel.crashing.org> wrote: >> > On Thu, Apr 23, 2020 at 02:25:40PM +0200, Richard Biener wrote: >> > > > > But being stuck with something means no progress... I know >> > > > > very well it's 100 times harder to get rid of something than to >> > > > > add something new ontop. >> > > > >> > > > Well, what progress do you expect to make? After expand that is :-) >> > > >> > > I'd like the RTL pipeline before RA to shrink significantly, no PRE, >> > > no CSE, ... >> > >> > RTL CSE for example is very much required to get any good code. It >> > needs to CSE stuff that wasn't there before expand. >> >> Sure, but then we should fix that! > Exactly. It's purpose largely becomes dealing with the redundancies exposed by > expansion. ie, address arithmetic and the like. A lot of its path following > code should be throttled back. Agreed. But things like address legitimisation and ensuring immediate operands are in-range could be done in gimple too, and probably be optimised more effectively and efficiently in SSA form than in RTL. The ultimate question then wouldn't just be "does the target support this optab?" but also "are these operands already legitimate for the optab"? I also wonder how difficult it would be to get recog to recognise gimple :-) Thanks, Richard
On Thu, 2020-04-23 at 16:32 +0100, Richard Sandiford wrote: > Jeff Law via Gcc-patches <gcc-patches@gcc.gnu.org> writes: > > On Thu, 2020-04-23 at 15:07 +0200, Richard Biener wrote: > > > On Thu, Apr 23, 2020 at 2:52 PM Segher Boessenkool > > > <segher@kernel.crashing.org> wrote: > > > > On Thu, Apr 23, 2020 at 02:25:40PM +0200, Richard Biener wrote: > > > > > > > But being stuck with something means no progress... I know > > > > > > > very well it's 100 times harder to get rid of something than to > > > > > > > add something new ontop. > > > > > > > > > > > > Well, what progress do you expect to make? After expand that is :-) > > > > > > > > > > I'd like the RTL pipeline before RA to shrink significantly, no PRE, > > > > > no CSE, ... > > > > > > > > RTL CSE for example is very much required to get any good code. It > > > > needs to CSE stuff that wasn't there before expand. > > > > > > Sure, but then we should fix that! > > Exactly. It's purpose largely becomes dealing with the redundancies exposed > > by > > expansion. ie, address arithmetic and the like. A lot of its path > > following > > code should be throttled back. > > Agreed. But things like address legitimisation and ensuring immediate > operands are in-range could be done in gimple too, and probably be > optimised more effectively and efficiently in SSA form than in RTL. Yea, but once exposed we'd probably need some kind of way to prevent something like an out-of-range immediate from being propagated back into the statement. Certainly do-able though. Maybe significantly easier than resurrecting RTL-SSA. > The ultimate question then wouldn't just be "does the target support > this optab?" but also "are these operands already legitimate for the > optab"? > > I also wonder how difficult it would be to get recog to recognise > gimple :-) :-) You know better than I... The more structured nature of gimple probably helps that effort. jeff >
Hi! On Thu, Apr 23, 2020 at 08:29:34AM -0600, Jeff Law wrote: > On Thu, 2020-04-23 at 07:07 -0500, Segher Boessenkool wrote: > > Most of expand is *other stuff*. Expand does a *lot* of things that are > > actually changing the code. And much of that is not done anywhere else > > either yet, so this cannot be fixed by simply deleting the offending code. > A lot of what is done by the expansion pass is historical and dates back to when > we never optimized more than a statement at a time for trees and the real heavy > lifting was all done in RTL. Yes. > THe introduction of tree-ssa meant that all the expansion code which wanted to > see a "nice" complex statement and generate good RTL code was useless and as a > result we saw significant regressions in the end code quality. > > That in turn brought in TER who's sole purpose was to reconstruct those more > complex trees for the purposes of improving initial expansion. I think match.pd > has the potential to make TER go away as match.pd is a generic combining > framework. Ooh that would be great news as well! For moving stuff out of expand, it needs to *move*, not just be deleted. > WRT extending SSA deeper, I'm all for it and it's always been something I wanted > to see happen. I've always believed we could do RTL SSA to the register > allocation phase and that doing so would be a positive. Well, since we have hard regs as well, it cannot really be SSA? And SSA doesn't fit RTL all that well anyway? But the "webs always" idea is the same of course. > If we start at the front > of the RTL pipeline and work towards the back we bump into CSE quickly, which > would be good. Our CSE is awful across multiple axis. RTL CSE does quite many things that aren't really CSE... Have to figure out for each whether it still is useful, and what to do with it then. > I suspect most RTL passes > would be reimplementations rather than bolting SSA on the side and I suspect > those reimplementations would be simpler than the existing stuff -- I'm a strong > believer there's a lot of dead code in the RTL passes as well. Sure, but there also is a lot of stuff that perhaps is historical, and perhaps is badly architected, but still does beneficial things. If a reimplementation hugely improves things it's not bad in the end to lose all such little things, but otherwise? Segher
On Thu, Apr 23, 2020 at 08:40:50AM -0600, Jeff Law wrote: > On Thu, 2020-04-23 at 15:07 +0200, Richard Biener wrote: > > On Thu, Apr 23, 2020 at 2:52 PM Segher Boessenkool > > <segher@kernel.crashing.org> wrote: > > > On Thu, Apr 23, 2020 at 02:25:40PM +0200, Richard Biener wrote: > > > > > > But being stuck with something means no progress... I know > > > > > > very well it's 100 times harder to get rid of something than to > > > > > > add something new ontop. > > > > > > > > > > Well, what progress do you expect to make? After expand that is :-) > > > > > > > > I'd like the RTL pipeline before RA to shrink significantly, no PRE, > > > > no CSE, ... > > > > > > RTL CSE for example is very much required to get any good code. It > > > needs to CSE stuff that wasn't there before expand. > > > > Sure, but then we should fix that! > Exactly. It's purpose largely becomes dealing with the redundancies exposed by > expansion. ie, address arithmetic and the like. A lot of its path following > code should be throttled back. Hrm, I never thought about it like this. CSE was always there, I never stopped to question if we needed it :-) Well, that's cse1 then. What about cse2? > > But valid RTL is instructions that are recognized. Which means > > when the target doesn't support an SImode add we may not create > > one. That's instruction selection ;) > That's always a point of tension. But I think that in general continuing to have > targets claim to support things they do not (such as double-wordsize arithmetic, > logicals, moves, etc) is a mistake. It made sense at one time, but I think we've > got better mechansisms in place to deal with this stuff now. Different targets have *very* different insns for add, mul, div, shifts; everything really. Describing this at expand time with two-machine-word operations works pretty bloody well, for most or all targets -- this is just part of the power of define_expand (but an important part). And define_expand is very very useful, it's the swiss army escape hatch, it lets you do everything optabs have a too small mind for. > > > Oh no, I think we should do more earlier, and GIMPLE is a fine IR for > > > there. But for low-level, close-to-the-machine stuff, RTL is much > > > better suited. And we *do* want to optimise at that level as well, and > > > much more than just peepholes. > > > > Well, everything that requires costing (unrolling, vectorization, > > IV selection to name a few) _is_ close-to-the-machine. We're > > just saying they are not because GIMPLE is so much easier to > > work with here (not sure why exactly...). > The primary motivation behind discouraging target costing and the like from > gimple was to make it easier to implement and predict the behavior of the gimple > optimizers. We've relaxed that somewhat, particularly for vectorization, but I > think the principle is still solid. There are two kinds of costing. The first only says which of A or B is better; that can perhaps be done on GIMPLE already, using target-specific costs. The other gives a number to everything, which is much harder to get anywhere close to usably correct (what does the number even *mean*? For performance, latency of the whole sequence is the most important number, but that is not easy to work with, or what we use for say insn_cost). > > But I think there is a place for adding target dependencies -- and that's at the > end of the current gimple pipeline. There are a *few* things in GIMPLE that use target costs (ivopts...) But yeah, most things should not. Segher
On Thu, Apr 23, 2020 at 04:32:51PM +0100, Richard Sandiford wrote: > I also wonder how difficult it would be to get recog to recognise > gimple :-) Since recog recognises single (rtl) insns: hard to impossible? Segher
On Thu, 23 Apr 2020, Richard Biener via Gcc-patches wrote: > I think at least one step would be uncontroversical(?), namely moving > the RTL expansion "magic" > up to a GIMPLE pass. Where the "magic" would be to turn > GIMPLE stmts not directly expandable via an existing optab into > GIMPLE that can be trivially expanded. That includes eventually I suspect some such pieces should actually happen before some GIMPLE optimizations are done, rather than as the last pass before expansion - though then you need to make sure subsequent GIMPLE passes don't reintroduce the operations you've lowered. E.g. the special handling of arithmetic on bit-field types in expand is a good candidate for being a lowering step (lowering to operations on integer widths actually supported by the machine / that correspond to known machine modes) somewhere in GIMPLE processing (and you don't want to subsequently narrow back to operations on those widths that don't exist in hardware). Note in that regard that there is a proposal for C2x to support integer types with a specified number of bits more generally than just as bit-fields. > That said, I actually agree with that. It's just that I hope we can > make some of the knowledge just represented on the RTL side > available on the GIMPLE side. The more complicated parts, > like calling conventions, that is. On the other hand, there is some knowledge like that which should move out of the front ends, into a GIMPLE lowering step (front ends should not be calling targetm.calls.promote_prototypes and changing their internal representation accordingly, that information should be handled later in the compiler, probably on GIMPLE).
On Thu, 2020-04-23 at 15:16 -0500, Segher Boessenkool wrote: > On Thu, Apr 23, 2020 at 08:40:50AM -0600, Jeff Law wrote: > > On Thu, 2020-04-23 at 15:07 +0200, Richard Biener wrote: > > > On Thu, Apr 23, 2020 at 2:52 PM Segher Boessenkool > > > <segher@kernel.crashing.org> wrote: > > > > On Thu, Apr 23, 2020 at 02:25:40PM +0200, Richard Biener wrote: > > > > > > > But being stuck with something means no progress... I know > > > > > > > very well it's 100 times harder to get rid of something than to > > > > > > > add something new ontop. > > > > > > > > > > > > Well, what progress do you expect to make? After expand that is :-) > > > > > > > > > > I'd like the RTL pipeline before RA to shrink significantly, no PRE, > > > > > no CSE, ... > > > > > > > > RTL CSE for example is very much required to get any good code. It > > > > needs to CSE stuff that wasn't there before expand. > > > > > > Sure, but then we should fix that! > > Exactly. It's purpose largely becomes dealing with the redundancies exposed > > by > > expansion. ie, address arithmetic and the like. A lot of its path > > following > > code should be throttled back. > > Hrm, I never thought about it like this. CSE was always there, I never > stopped to question if we needed it :-) :-) It's a dog-slow pass that isn't nearly as important as it once was. > > Well, that's cse1 then. What about cse2? cse2's original purpose was to clean up after the loop optimizers and it should be doing even less work than cse1. The steps in my mind are to see what's left in the first jump pass, then the cse path following code, then the core of cse itself. lower-subreg is a wart that could likely go away if we stop lying about the target's capabilities. > > > > But valid RTL is instructions that are recognized. Which means > > > when the target doesn't support an SImode add we may not create > > > one. That's instruction selection ;) > > That's always a point of tension. But I think that in general continuing to > > have > > targets claim to support things they do not (such as double-wordsize > > arithmetic, > > logicals, moves, etc) is a mistake. It made sense at one time, but I think > > we've > > got better mechansisms in place to deal with this stuff now. > > Different targets have *very* different insns for add, mul, div, shifts; > everything really. Describing this at expand time with two-machine-word > operations works pretty bloody well, for most or all targets -- this is > just part of the power of define_expand (but an important part). And > define_expand is very very useful, it's the swiss army escape hatch, it > lets you do everything optabs have a too small mind for. Absolutely true, but most of the double-word-mode-crap was in there because we couldn't really describe things like the carry bit. That's long since been fixed and I bet if we handled just that the vast majority pretending the target has double-word support becomes unnecessary (and the need for early lower-subreg is then reduced significantly as well). And if we look at the number of tricks that are used to do things like optimize double-word moves in the target files, it's just insane. If we stop lying and take advantage of improvements over the last 20 years we end up killing lots of crappy target code. > There are two kinds of costing. The first only says which of A or B is > better; that can perhaps be done on GIMPLE already, using > target-specific costs. The other gives a number to everything, which is > much harder to get anywhere close to usably correct (what does the > number even *mean*? For performance, latency of the whole sequence is > the most important number, but that is not easy to work with, or what we > use for say insn_cost). True. I'm referring mostly to traditional costing. ie, given form A & B, which is preferred for the target. Full sequence latency is distinct issue, though we let it bleed into the former (for example by rewriting sequences with similar costs into a sequence with fewer dependencies). > > > But I think there is a place for adding target dependencies -- and that's at > > the > > end of the current gimple pipeline. > > There are a *few* things in GIMPLE that use target costs (ivopts...) > But yeah, most things should not. Precisely. Avoiding target dependencies is the aspiration goal, but we have to also be sensible and realize that some things really do require a degree of target knowledge. > Jeff
On Thu, Apr 23, 2020 at 4:54 PM Eric Botcazou <botcazou@adacore.com> wrote: > > > > > What is wrong with DF? > > > > > > It's slow and memory hungry? > > > > Very true, of course. But can this be significantly better? > > That's a good question worth investigating in my opinion, because DF didn't > quite achieve its initial goal of replacing all the custom liveness analysis > passes only because of its performance; the rest is quite neat. But IIRC > long-time GCC hackers were involved in it, so it was probably tuned already. Having looked at DF a few times one of its problems seems to be too much indirection in its data structures which is probably a fallout of making it on-the-side rather than a part of the core RTL data structures (like SSA and immediate uses are on GIMPLE). But then where it blows up is usually during computing of the dataflow problems to create those data structures and there's little room for improvement because it deals with a more complex problem than we have on the GIMPLE side. But passes are also mindlessly throwing 'compute-me-everyting' flags at it without really thinking what they need, so ... In the end I question the need of true "global" optimization passes on the RTL IL. Almost everything should be local to interesting-CFG-shape regions. And yes, it probably requires a further lowering on GIMPLE to expose things like multi-reg operations, lowered bit-precision arithmetic, lowered address generation and eventually constant legitimization (introducing a constant pool already on GIMPLE, ideally maintaining it in the symbol table). Richard. > -- > Eric Botcazou
On Fri, Apr 24, 2020 at 08:15:36AM +0200, Richard Biener wrote: > On Thu, Apr 23, 2020 at 4:54 PM Eric Botcazou <botcazou@adacore.com> wrote: > > > > > What is wrong with DF? > > > > > > > > It's slow and memory hungry? > > > > > > Very true, of course. But can this be significantly better? > > > > That's a good question worth investigating in my opinion, because DF didn't > > quite achieve its initial goal of replacing all the custom liveness analysis > > passes only because of its performance; the rest is quite neat. But IIRC > > long-time GCC hackers were involved in it, so it was probably tuned already. > > Having looked at DF a few times one of its problems seems to be too much > indirection in its data structures which is probably a fallout of making it > on-the-side rather than a part of the core RTL data structures (like SSA > and immediate uses are on GIMPLE). But then where it blows up is usually > during computing of the dataflow problems to create those data structures > and there's little room for improvement because it deals with a more complex > problem than we have on the GIMPLE side. But passes are also mindlessly > throwing 'compute-me-everyting' flags at it without really thinking > what they need, so ... And also, throwing away all existing info way too often. Part of that is causes by the interfaces to DF we have. That shouldn't be hard to fix. But mainly it is because too many things do not keep the info up-to-date after changes. > In the end I question the need of true "global" optimization passes on the > RTL IL. Almost everything should be local to interesting-CFG-shape regions. Depending on the target, almost *everything* depends on how far some code is from other code, and you can get much better code if you know you *are* close. But many passes already are per-BB of course. It's just the hard ones (RA...) that are not local at all. So you'll have to attack IRA if you want DF to be more local only, I think? > And yes, it probably requires a further lowering on GIMPLE to expose things > like multi-reg operations, As I said before, I don't think this is a good idea. > lowered bit-precision arithmetic, lowered But that is :-) > address generation and eventually constant legitimization (introducing a > constant pool already on GIMPLE, ideally maintaining it in the symbol > table). That might work on some targets, sure :-/ Segher
diff --git a/gcc/common.opt b/gcc/common.opt index fa9da505fc2..e298ce09c82 100644 --- a/gcc/common.opt +++ b/gcc/common.opt @@ -2526,6 +2526,10 @@ fvariable-expansion-in-unroller Common Report Var(flag_variable_expansion_in_unroller) Optimization Apply variable expansion when loops are unrolled. +fassin-new-operand-in-unroller +Common Report Var(flag_assign_new_operand_in_unroller) Init(1) Optimization +Creating new pseudo for duplicated instruction when loops are unrolled. + fstack-check= Common Report RejectNegative Joined Optimization -fstack-check=[no|generic|specific] Insert stack checking code into the program. diff --git a/gcc/loop-unroll.c b/gcc/loop-unroll.c index 693c7768868..54a48bd9fd7 100644 --- a/gcc/loop-unroll.c +++ b/gcc/loop-unroll.c @@ -94,6 +94,17 @@ struct var_to_expand var_expansions[REUSE_EXPANSION - 1]. */ }; +struct def_to_split +{ + rtx_insn *insn; /* The insn in that the assigment occurs. */ + rtx reg; /* The def/set pseudo. */ + vec<rtx> defs; /* The copies of the defs which is split to. */ + struct def_to_split *next; /* Next entry in walking order. */ + int count; /* Number of DEFS. */ + int use_before_def; /* The pseudo is used before re-defined. */ + rtx_insn *last_insn; +}; + /* Hashtable helper for iv_to_split. */ struct iv_split_hasher : free_ptr_hash <iv_to_split> @@ -143,6 +154,30 @@ var_expand_hasher::equal (const var_to_expand *i1, const var_to_expand *i2) return i1->insn == i2->insn; } +/* Hashtable helper for iv_to_split. */ + +struct def_split_hasher : free_ptr_hash<def_to_split> +{ + static inline hashval_t hash (const def_to_split *); + static inline bool equal (const def_to_split *, const def_to_split *); +}; + +/* Return a hash for VES. */ + +inline hashval_t +def_split_hasher::hash (const def_to_split *i) +{ + return (hashval_t) INSN_UID (i->insn); +} + +/* Return true if I1 and I2 refer to the same instruction. */ + +inline bool +def_split_hasher::equal (const def_to_split *i1, const def_to_split *i2) +{ + return i1->insn == i2->insn; +} + /* Information about optimization applied in the unrolled loop. */ @@ -156,10 +191,17 @@ struct opt_info insns with accumulators to expand. */ struct var_to_expand *var_to_expand_head; /* The first var to expand. */ struct var_to_expand **var_to_expand_tail; /* Pointer to the tail of the list. */ + + hash_table<def_split_hasher> *insns_def_split; /* A hashtable of insns to + split def. */ + struct def_to_split *def_to_split_head; /* The def to split. */ + struct def_to_split **def_to_split_tail; /* Pointer to the tail of the list. */ + unsigned first_new_block; /* The first basic block that was duplicated. */ basic_block loop_exit; /* The loop exit basic block. */ basic_block loop_preheader; /* The loop preheader basic block. */ + }; static void decide_unroll_stupid (class loop *, int); @@ -183,6 +225,7 @@ static void combine_var_copies_in_loop_exit (struct var_to_expand *, basic_block); static rtx get_expansion (struct var_to_expand *); +static struct def_to_split *analyze_insn_to_split_def (class loop *, rtx_insn *); /* Emit a message summarizing the unroll that will be performed for LOOP, along with the loop's location LOCUS, if appropriate given the dump or -fopt-info settings. */ @@ -898,7 +941,8 @@ unroll_loop_runtime_iterations (class loop *loop) struct opt_info *opt_info = NULL; bool ok; - if (flag_split_ivs_in_unroller + if (flag_assign_new_operand_in_unroller + || flag_split_ivs_in_unroller || flag_variable_expansion_in_unroller) opt_info = analyze_insns_in_loop (loop); @@ -1564,6 +1608,71 @@ analyze_iv_to_split_insn (rtx_insn *insn) return ivts; } +static struct def_to_split * +analyze_insn_to_split_def (class loop *loop, rtx_insn *insn) +{ + rtx set, dest, src; + rtx_insn *ins; + struct def_to_split *dts; + + /* TODO: to be relaxed */ + if (loop->num_nodes != 2) + return NULL; + + FOR_BB_INSNS (loop->latch, ins) + if (INSN_P (ins)) + return NULL; + + set = single_set (insn); + if (!set) + return NULL; + + dest = SET_DEST (set); + src = SET_SRC (set); + + if (!REG_P (dest)) // && !(GET_CODE (dest) == SUBREG && REG_P (SUBREG_REG + // (dest)))) + return NULL; + + if (REGNO (dest) < FIRST_PSEUDO_REGISTER) + return NULL; + + int use_before_set = 0; + basic_block bb = BLOCK_FOR_INSN (insn); + rtx set1; + for (ins = BB_HEAD (bb); ins != insn; ins = NEXT_INSN (ins)) + if (rtx_referenced_p (dest, ins)) + { + set1 = single_set (ins); + if (set1 && rtx_equal_p (SET_DEST (set1), dest)) + return NULL; + use_before_set = 1; + } + + if (!use_before_set && rtx_referenced_p (dest, src)) + use_before_set = 1; + + if (dump_file) + { + fprintf (dump_file, + "\n;; Split assignment %s: ", + use_before_set ? "use leads def" : ""); + print_rtl (dump_file, dest); + fprintf (dump_file, "\n"); + } + + /* Record the accumulator to expand. */ + dts = XNEW (struct def_to_split); + dts->insn = insn; + dts->reg = copy_rtx (dest); + dts->defs.create (1); + dts->next = NULL; + dts->count = 0; + dts->use_before_def = use_before_set; + dts->last_insn = NULL; + return dts; +} + /* Determines which of insns in LOOP can be optimized. Return a OPT_INFO struct with the relevant hash tables filled with all insns to be optimized. The FIRST_NEW_BLOCK field @@ -1578,6 +1687,7 @@ analyze_insns_in_loop (class loop *loop) rtx_insn *insn; struct iv_to_split *ivts = NULL; struct var_to_expand *ves = NULL; + struct def_to_split *dts= NULL; iv_to_split **slot1; var_to_expand **slot2; vec<edge> edges = get_loop_exit_edges (loop); @@ -1618,6 +1728,15 @@ analyze_insns_in_loop (class loop *loop) opt_info->var_to_expand_tail = &opt_info->var_to_expand_head; } + if (flag_assign_new_operand_in_unroller && can_apply + && LPT_UNROLL_RUNTIME == loop->lpt_decision.decision) + { + opt_info->insns_def_split + = new hash_table<def_split_hasher> (5 * loop->num_nodes); + opt_info->def_to_split_head = NULL; + opt_info->def_to_split_tail = &opt_info->def_to_split_head; + } + for (i = 0; i < loop->num_nodes; i++) { bb = body[i]; @@ -1653,6 +1772,20 @@ analyze_insns_in_loop (class loop *loop) *opt_info->var_to_expand_tail = ves; opt_info->var_to_expand_tail = &ves->next; } + + dts = NULL; + if (opt_info->insns_def_split && !ves && !ivts) + dts = analyze_insn_to_split_def (loop, insn); + + if (dts) + { + struct def_to_split **slot + = opt_info->insns_def_split->find_slot (dts, INSERT); + gcc_assert (*slot == NULL); + *slot = dts; + *opt_info->def_to_split_tail = dts; + opt_info->def_to_split_tail = &dts->next; + } } } @@ -1840,6 +1973,145 @@ expand_var_during_unrolling (struct var_to_expand *ve, rtx_insn *insn) } } +/* Replace INSN with a new duplicate insn. */ +static rtx_insn * +replace_with_new_insn (rtx_insn *insn) +{ + rtx_insn *seq; + start_sequence (); + seq = duplicate_insn_chain (insn, insn); + gcc_assert (seq == get_insns ()); + end_sequence (); + emit_insn_before (seq, insn); + delete_insn (insn); + return seq; +} + +/* For original body, those insn(s) are not able to be update directly. Then, + * duplicate new one for update. */ + +static void +prepare_to_update_orignal_body (struct def_to_split *head) +{ + struct def_to_split *dts; + rtx_insn *insn; + vec<rtx_insn *> new_insn_list; + + /* For all those defines which is used before define, replace with new one for + * update. */ + new_insn_list.create (1); + for (dts = head; dts; dts = dts->next) + if (dts->use_before_def) + { + dts->insn = replace_with_new_insn (dts->insn); + new_insn_list.safe_push (dts->insn); + } + + /* For all those insn which is using those defines, replace with new one for + * update. */ + for (dts = head; dts; dts = dts->next) + if (dts->use_before_def) + for (insn = NEXT_INSN (dts->insn); + insn != NEXT_INSN (BB_END (BLOCK_FOR_INSN (dts->insn))); + insn = NEXT_INSN (insn)) + if (rtx_referenced_p (dts->reg, insn) && !new_insn_list.contains (insn)) + { + insn = replace_with_new_insn (insn); + new_insn_list.safe_push (insn); + } + + new_insn_list.release (); +} + +/* Replace reg which occur between START and END. */ +static void +replace_regs (rtx_insn *start, rtx_insn *end, rtx old_reg, rtx new_reg) +{ + rtx_insn *insn; + for (insn = start; insn != NEXT_INSN (end); insn = NEXT_INSN (insn)) + if (rtx_referenced_p (old_reg, insn)) + validate_replace_rtx_group (old_reg, new_reg, insn); + + apply_change_group (); +} + +/* Update the the pseudo in orignal blocks, and assign back them at the last + * block. */ +static void +do_additional_update (struct def_to_split *head) +{ + rtx_insn *insn; + rtx new_reg; + basic_block bb; + struct def_to_split *dts; + + for (dts = head; dts; dts = dts->next) + { + if (dts->use_before_def) + { + gcc_assert (dts->last_insn); + gcc_assert (dts->count > 1); + + /* Update the insns in original body. */ + bb = BLOCK_FOR_INSN (dts->insn); + new_reg = dts->defs[0]; + SET_DEST (single_set (dts->insn)) = new_reg; + + replace_regs (NEXT_INSN (dts->insn), BB_END (bb), dts->reg, new_reg); + } + + /* In last BB, assign back to orignal reg. */ + start_sequence (); + emit_move_insn (dts->reg, SET_DEST (single_set (dts->last_insn))); + insn = get_insns (); + end_sequence (); + emit_insn_after (insn, dts->last_insn); + } +} + +/* Replace the DTS->REG with new split reg for INSN's block. If the DTS->REG is + * referenced before set, Those usage which leading INSN is replaced with + * previous define/set. */ +static void +split_def_during_unrolling (struct def_to_split *dts, rtx_insn *insn) +{ + rtx new_reg, new_reg0, set; + basic_block bb = BLOCK_FOR_INSN (insn); + + if (dts->use_before_def) + { + if (dts->count == 0) + { + dts->defs.safe_push (gen_reg_rtx (GET_MODE (dts->reg))); + dts->count++; + } + + /* Replace the usage of the define reg with new reg0 for those insns which + * are leading the define, include the usage of the define. */ + new_reg0 = dts->defs[dts->count - 1]; + replace_regs (BB_HEAD (bb), insn, dts->reg, new_reg0); + + /* Replace the define reg with new reg, for the define and those usages + * which following the define. */ + set = single_set (insn); + gcc_assert (set); + new_reg = gen_reg_rtx (GET_MODE (dts->reg)); + SET_DEST (single_set (insn)) = new_reg; + + replace_regs (NEXT_INSN (insn), BB_END (bb), dts->reg, new_reg); + } + else + { + new_reg = gen_reg_rtx (GET_MODE (dts->reg)); + replace_regs (insn, BB_END (bb), dts->reg, new_reg); + } + + /* Record the insn in last block. */ + dts->last_insn = insn; + dts->defs.safe_push (new_reg); + dts->count++; +} + /* Initialize the variable expansions in loop preheader. PLACE is the loop-preheader basic block where the initialization of the expansions should take place. The expansions are initialized with @@ -2011,6 +2283,7 @@ apply_opt_in_copies (struct opt_info *opt_info, rtx_insn *insn, *orig_insn, *next; struct iv_to_split ivts_templ, *ivts; struct var_to_expand ve_templ, *ves; + struct def_to_split dts_templ, *dts; /* Sanity check -- we need to put initialization in the original loop body. */ @@ -2051,8 +2324,9 @@ apply_opt_in_copies (struct opt_info *opt_info, ivts_templ.insn = orig_insn; ve_templ.insn = orig_insn; + dts_templ.insn = orig_insn; - /* Apply splitting iv optimization. */ + /* Apply splitting iv optimization. */ if (opt_info->insns_to_split) { maybe_strip_eq_note_for_split_iv (opt_info, insn); @@ -2081,6 +2355,19 @@ apply_opt_in_copies (struct opt_info *opt_info, expand_var_during_unrolling (ves, insn); } } + + if (unrolling && opt_info->insns_def_split) + { + dts = (struct def_to_split *) opt_info->insns_def_split->find ( + &dts_templ); + if (dts) + { + gcc_assert (GET_CODE (PATTERN (insn)) + == GET_CODE (PATTERN (orig_insn))); + split_def_during_unrolling (dts, insn); + } + } + orig_insn = NEXT_INSN (orig_insn); } } @@ -2135,9 +2422,14 @@ apply_opt_in_copies (struct opt_info *opt_info, continue; } } - } } + + if (unrolling && opt_info->insns_def_split) + { + prepare_to_update_orignal_body (opt_info->def_to_split_head); + do_additional_update (opt_info->def_to_split_head); + } } /* Release OPT_INFO. */ @@ -2156,5 +2448,16 @@ free_opt_info (struct opt_info *opt_info) delete opt_info->insns_with_var_to_expand; opt_info->insns_with_var_to_expand = NULL; } + + if (opt_info->insns_def_split) + { + struct def_to_split *dts; + + for (dts = opt_info->def_to_split_head; dts; dts = dts->next) + dts->defs.release (); + delete opt_info->insns_def_split; + opt_info->insns_def_split = NULL; + } + free (opt_info); }