[RFC] split pseudos during loop unrolling in RTL unroller

Message ID	20200415015600.124587-1-guojiufu@linux.ibm.com
State	New
Headers	show Return-Path: <gcc-patches-bounces@gcc.gnu.org> DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 7F789385DC14 Gateway: Authorized Use Only! Violators will be prosecuted for <gcc-patches@gcc.gnu.org> from <guojiufu@linux.ibm.com>; Wed, 15 Apr 2020 02:55:48 +0100 Gateway: Authorized Use Only! Violators will be prosecuted; (version=TLSv1/SSLv3 cipher=AES256-GCM-SHA384 bits=256/256) Wed, 15 Apr 2020 02:55:45 +0100 To: gcc-patches@gcc.gnu.org Subject: [RFC] split pseudos during loop unrolling in RTL unroller Date: Wed, 15 Apr 2020 09:56:00 +0800 Message-Id: <20200415015600.124587-1-guojiufu@linux.ibm.com> Precedence: list From: Jiufu Guo via Gcc-patches <gcc-patches@gcc.gnu.org> Reply-To: Jiufu Guo <guojiufu@linux.ibm.com> Cc: wschmidt@linux.ibm.com, segher@kernel.crashing.org Errors-To: gcc-patches-bounces@gcc.gnu.org Sender: "Gcc-patches" <gcc-patches-bounces@gcc.gnu.org>
Series	[RFC] split pseudos during loop unrolling in RTL unroller \| expand [RFC] split pseudos during loop unrolling in RTL unroller

Jiufu Guo April 15, 2020, 1:56 a.m. UTC

Hi,

As you may know, we have loop unroll pass in RTL which was introduced a few
years ago, and works for a long time.  Currently, this unroller is using the
pseudos in the original body, and then the pseudos are written multiple times.

It would be a good idea to create new pseudos for those duplicated instructions
during loop unrolling.  This would relax reg dependence, and then provide more
freedom/opportunity for other optimizations, like scheduling, RA... 

As below example:

Original loop:
```
   26: L26:
   17: NOTE_INSN_BASIC_BLOCK 4
   18: r130:SF=[r125:DI+r122:DI]
   19: r131:SF=[r126:DI+r122:DI]
   20: r129:SF=r130:SF*r131:SF
   21: r132:DF=float_extend(r129:SF)
   22: r121:DF=r121:DF+r132:DF
   23: r133:SI=r123:DI#0-0x1
   24: r123:DI=zero_extend(r133:SI)
   25: r122:DI=r122:DI+0x4
   27: r134:CC=cmp(r123:DI,0)
   28: pc={(r134:CC!=0)?L49:pc}
      ; pc falls through to BB 5
   49: L49:
   48: NOTE_INSN_BASIC_BLOCK 8
      ; pc falls through to BB 4
```
It was unrolled to:
```
   26: L26:
   17: NOTE_INSN_BASIC_BLOCK 4
   18: r130:SF=[r125:DI+r122:DI]
   19: r131:SF=[r126:DI+r122:DI]
   20: r129:SF=r130:SF*r131:SF
   21: r132:DF=float_extend(r129:SF)
   93: r145:DF=r121:DF+r132:DF
   23: r133:SI=r123:DI#0-0x1
   95: r148:DI=zero_extend(r133:SI)
   91: r140:DI=r122:DI+0x4
   25: r122:DI=r140:DI
   97: r134:CC=cmp(r148:DI,0)

   87: NOTE_INSN_BASIC_BLOCK 16
   77: r130:SF=[r125:DI+r122:DI]
   78: r131:SF=[r126:DI+r122:DI]
   79: r129:SF=r130:SF*r131:SF
   80: r132:DF=float_extend(r129:SF)
   81: r121:DF=r121:DF+r132:DF
   82: r133:SI=r123:DI#0-0x1
   83: r123:DI=zero_extend(r133:SI)
   84: r122:DI=r140:DI+0x4
   85: r134:CC=cmp(r123:DI,0)
   86: pc={(r134:CC!=0)?L90:pc}

```
We can see, the original loop is using r130,r131,r129,r132,r121...
For the unrolled sequence BB 16, they are still using them.

We could use new pseudos for duplicate body, like below:
```
   26: L26:
   17: NOTE_INSN_BASIC_BLOCK 4
   18: r130:SF=[r125:DI+r122:DI]
   19: r131:SF=[r126:DI+r122:DI]
   20: r129:SF=r130:SF*r131:SF
   21: r132:DF=float_extend(r129:SF)
   93: r145:DF=r121:DF+r132:DF
   23: r133:SI=r123:DI#0-0x1
   95: r148:DI=zero_extend(r133:SI)
   91: r140:DI=r122:DI+0x4
   25: r122:DI=r140:DI
   97: r134:CC=cmp(r148:DI,0)

   87: NOTE_INSN_BASIC_BLOCK 16
   77: r141:SF=[r125:DI+r122:DI]
   78: r142:SF=[r126:DI+r122:DI]
   79: r143:SF=r141:SF*r142:SF
   80: r144:DF=float_extend(r143:SF)
   81: r146:DF=r145:DF+r144:DF
   82: r147:SI=r148:DI#0-0x1
   83: r149:DI=zero_extend(r147:SI)
   84: r122:DI=r140:DI+0x4
   85: r150:CC=cmp(r149:DI,0)
   98: r130:SF=r141:SF
   99: r131:SF=r142:SF
  100: r129:SF=r143:SF
  101: r132:DF=r144:DF
  102: r121:DF=r146:DF
  103: r133:SI=r147:SI
  104: r123:DI=r149:DI
  105: r134:CC=r150:CC
   86: pc={(r150:CC!=0)?L90:pc}
``` 
In BB 16, new pseudos: r141,r142,r143... are used.

For this, I drafted a prototype like the patch below.
This patch only supports simple loops: one exit edge with one major basic block.

With this patch, we can see, the generated asm code uses fewer instructions on PowerPC.

Welcome for comments, thanks!

BR.
Jiufu Guo.

---
 gcc/common.opt    |   4 +
 gcc/loop-unroll.c | 309 +++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 310 insertions(+), 3 deletions(-)

Richard Biener April 15, 2020, 6:21 a.m. UTC | #1

On Wed, Apr 15, 2020 at 3:56 AM Jiufu Guo via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> Hi,
>
> As you may know, we have loop unroll pass in RTL which was introduced a few
> years ago, and works for a long time.  Currently, this unroller is using the
> pseudos in the original body, and then the pseudos are written multiple times.
>
> It would be a good idea to create new pseudos for those duplicated instructions
> during loop unrolling.  This would relax reg dependence, and then provide more
> freedom/opportunity for other optimizations, like scheduling, RA...

I think there's a separate pass to do something like this, conveniently
placed after unrolling.  -fweb, IIRC enabled by default for -funroll-loops
unless explicitly disabled.  Related is regrename which is also enabled then.

So where does your patch make a difference?  Is the webizers dataflow analysis
maybe confused by backedges?

> As below example:
>
> Original loop:
> ```
>    26: L26:
>    17: NOTE_INSN_BASIC_BLOCK 4
>    18: r130:SF=[r125:DI+r122:DI]
>    19: r131:SF=[r126:DI+r122:DI]
>    20: r129:SF=r130:SF*r131:SF
>    21: r132:DF=float_extend(r129:SF)
>    22: r121:DF=r121:DF+r132:DF
>    23: r133:SI=r123:DI#0-0x1
>    24: r123:DI=zero_extend(r133:SI)
>    25: r122:DI=r122:DI+0x4
>    27: r134:CC=cmp(r123:DI,0)
>    28: pc={(r134:CC!=0)?L49:pc}
>       ; pc falls through to BB 5
>    49: L49:
>    48: NOTE_INSN_BASIC_BLOCK 8
>       ; pc falls through to BB 4
> ```
> It was unrolled to:
> ```
>    26: L26:
>    17: NOTE_INSN_BASIC_BLOCK 4
>    18: r130:SF=[r125:DI+r122:DI]
>    19: r131:SF=[r126:DI+r122:DI]
>    20: r129:SF=r130:SF*r131:SF
>    21: r132:DF=float_extend(r129:SF)
>    93: r145:DF=r121:DF+r132:DF
>    23: r133:SI=r123:DI#0-0x1
>    95: r148:DI=zero_extend(r133:SI)
>    91: r140:DI=r122:DI+0x4
>    25: r122:DI=r140:DI
>    97: r134:CC=cmp(r148:DI,0)
>
>    87: NOTE_INSN_BASIC_BLOCK 16
>    77: r130:SF=[r125:DI+r122:DI]
>    78: r131:SF=[r126:DI+r122:DI]
>    79: r129:SF=r130:SF*r131:SF
>    80: r132:DF=float_extend(r129:SF)
>    81: r121:DF=r121:DF+r132:DF
>    82: r133:SI=r123:DI#0-0x1
>    83: r123:DI=zero_extend(r133:SI)
>    84: r122:DI=r140:DI+0x4
>    85: r134:CC=cmp(r123:DI,0)
>    86: pc={(r134:CC!=0)?L90:pc}
>
> ```
> We can see, the original loop is using r130,r131,r129,r132,r121...
> For the unrolled sequence BB 16, they are still using them.
>
> We could use new pseudos for duplicate body, like below:
> ```
>    26: L26:
>    17: NOTE_INSN_BASIC_BLOCK 4
>    18: r130:SF=[r125:DI+r122:DI]
>    19: r131:SF=[r126:DI+r122:DI]
>    20: r129:SF=r130:SF*r131:SF
>    21: r132:DF=float_extend(r129:SF)
>    93: r145:DF=r121:DF+r132:DF
>    23: r133:SI=r123:DI#0-0x1
>    95: r148:DI=zero_extend(r133:SI)
>    91: r140:DI=r122:DI+0x4
>    25: r122:DI=r140:DI
>    97: r134:CC=cmp(r148:DI,0)
>
>    87: NOTE_INSN_BASIC_BLOCK 16
>    77: r141:SF=[r125:DI+r122:DI]
>    78: r142:SF=[r126:DI+r122:DI]
>    79: r143:SF=r141:SF*r142:SF
>    80: r144:DF=float_extend(r143:SF)
>    81: r146:DF=r145:DF+r144:DF
>    82: r147:SI=r148:DI#0-0x1
>    83: r149:DI=zero_extend(r147:SI)
>    84: r122:DI=r140:DI+0x4
>    85: r150:CC=cmp(r149:DI,0)
>    98: r130:SF=r141:SF
>    99: r131:SF=r142:SF
>   100: r129:SF=r143:SF
>   101: r132:DF=r144:DF
>   102: r121:DF=r146:DF
>   103: r133:SI=r147:SI
>   104: r123:DI=r149:DI
>   105: r134:CC=r150:CC
>    86: pc={(r150:CC!=0)?L90:pc}
> ```
> In BB 16, new pseudos: r141,r142,r143... are used.
>
> For this, I drafted a prototype like the patch below.
> This patch only supports simple loops: one exit edge with one major basic block.
>
> With this patch, we can see, the generated asm code uses fewer instructions on PowerPC.
>
> Welcome for comments, thanks!
>
> BR.
> Jiufu Guo.
>
> ---
>  gcc/common.opt    |   4 +
>  gcc/loop-unroll.c | 309 +++++++++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 310 insertions(+), 3 deletions(-)
>
> diff --git a/gcc/common.opt b/gcc/common.opt
> index fa9da505fc2..e298ce09c82 100644
> --- a/gcc/common.opt
> +++ b/gcc/common.opt
> @@ -2526,6 +2526,10 @@ fvariable-expansion-in-unroller
>  Common Report Var(flag_variable_expansion_in_unroller) Optimization
>  Apply variable expansion when loops are unrolled.
>
> +fassin-new-operand-in-unroller
> +Common Report Var(flag_assign_new_operand_in_unroller) Init(1) Optimization
> +Creating new pseudo for duplicated instruction when loops are unrolled.
> +
>  fstack-check=
>  Common Report RejectNegative Joined Optimization
>  -fstack-check=[no|generic|specific]    Insert stack checking code into the program.
> diff --git a/gcc/loop-unroll.c b/gcc/loop-unroll.c
> index 693c7768868..54a48bd9fd7 100644
> --- a/gcc/loop-unroll.c
> +++ b/gcc/loop-unroll.c
> @@ -94,6 +94,17 @@ struct var_to_expand
>                                        var_expansions[REUSE_EXPANSION - 1].  */
>  };
>
> +struct def_to_split
> +{
> +  rtx_insn *insn;          /* The insn in that the assigment occurs.  */
> +  rtx reg;                  /* The def/set pseudo.  */
> +  vec<rtx> defs;            /* The copies of the defs which is split to.  */
> +  struct def_to_split *next; /* Next entry in walking order.  */
> +  int count;                /* Number of DEFS.  */
> +  int use_before_def;  /* The pseudo is used before re-defined.  */
> +  rtx_insn *last_insn;
> +};
> +
>  /* Hashtable helper for iv_to_split.  */
>
>  struct iv_split_hasher : free_ptr_hash <iv_to_split>
> @@ -143,6 +154,30 @@ var_expand_hasher::equal (const var_to_expand *i1, const var_to_expand *i2)
>    return i1->insn == i2->insn;
>  }
>
> +/* Hashtable helper for iv_to_split.  */
> +
> +struct def_split_hasher : free_ptr_hash<def_to_split>
> +{
> +  static inline hashval_t hash (const def_to_split *);
> +  static inline bool equal (const def_to_split *, const def_to_split *);
> +};
> +
> +/* Return a hash for VES.  */
> +
> +inline hashval_t
> +def_split_hasher::hash (const def_to_split *i)
> +{
> +  return (hashval_t) INSN_UID (i->insn);
> +}
> +
> +/* Return true if I1 and I2 refer to the same instruction.  */
> +
> +inline bool
> +def_split_hasher::equal (const def_to_split *i1, const def_to_split *i2)
> +{
> +  return i1->insn == i2->insn;
> +}
> +
>  /* Information about optimization applied in
>     the unrolled loop.  */
>
> @@ -156,10 +191,17 @@ struct opt_info
>                                         insns with accumulators to expand.  */
>    struct var_to_expand *var_to_expand_head; /* The first var to expand.  */
>    struct var_to_expand **var_to_expand_tail; /* Pointer to the tail of the list.  */
> +
> +  hash_table<def_split_hasher> *insns_def_split; /* A hashtable of insns to
> +                                                   split def.  */
> +  struct def_to_split *def_to_split_head; /* The def to split.  */
> +  struct def_to_split **def_to_split_tail; /* Pointer to the tail of the list.  */
> +
>    unsigned first_new_block;        /* The first basic block that was
>                                        duplicated.  */
>    basic_block loop_exit;           /* The loop exit basic block.  */
>    basic_block loop_preheader;      /* The loop preheader basic block.  */
> +
>  };
>
>  static void decide_unroll_stupid (class loop *, int);
> @@ -183,6 +225,7 @@ static void combine_var_copies_in_loop_exit (struct var_to_expand *,
>                                              basic_block);
>  static rtx get_expansion (struct var_to_expand *);
>
> +static struct def_to_split *analyze_insn_to_split_def (class loop *, rtx_insn *);
>  /* Emit a message summarizing the unroll that will be
>     performed for LOOP, along with the loop's location LOCUS, if
>     appropriate given the dump or -fopt-info settings.  */
> @@ -898,7 +941,8 @@ unroll_loop_runtime_iterations (class loop *loop)
>    struct opt_info *opt_info = NULL;
>    bool ok;
>
> -  if (flag_split_ivs_in_unroller
> +  if (flag_assign_new_operand_in_unroller
> +      || flag_split_ivs_in_unroller
>        || flag_variable_expansion_in_unroller)
>      opt_info = analyze_insns_in_loop (loop);
>
> @@ -1564,6 +1608,71 @@ analyze_iv_to_split_insn (rtx_insn *insn)
>    return ivts;
>  }
>
> +static struct def_to_split *
> +analyze_insn_to_split_def (class loop *loop, rtx_insn *insn)
> +{
> +  rtx set, dest, src;
> +  rtx_insn *ins;
> +  struct def_to_split *dts;
> +
> +  /* TODO: to be relaxed */
> +  if (loop->num_nodes != 2)
> +    return NULL;
> +
> +  FOR_BB_INSNS (loop->latch, ins)
> +    if (INSN_P (ins))
> +      return NULL;
> +
> +  set = single_set (insn);
> +  if (!set)
> +    return NULL;
> +
> +  dest = SET_DEST (set);
> +  src = SET_SRC (set);
> +
> +  if (!REG_P (dest)) //    && !(GET_CODE (dest) == SUBREG && REG_P (SUBREG_REG
> +                    //    (dest))))
> +    return NULL;
> +
> +  if (REGNO (dest) < FIRST_PSEUDO_REGISTER)
> +    return NULL;
> +
> +  int use_before_set = 0;
> +  basic_block bb = BLOCK_FOR_INSN (insn);
> +  rtx set1;
> +  for (ins = BB_HEAD (bb); ins != insn; ins = NEXT_INSN (ins))
> +    if (rtx_referenced_p (dest, ins))
> +      {
> +       set1 = single_set (ins);
> +       if (set1 && rtx_equal_p (SET_DEST (set1), dest))
> +         return NULL;
> +       use_before_set = 1;
> +      }
> +
> +  if (!use_before_set && rtx_referenced_p (dest, src))
> +    use_before_set = 1;
> +
> +  if (dump_file)
> +    {
> +      fprintf (dump_file,
> +              "\n;; Split assignment %s: ",
> +              use_before_set ? "use leads def" : "");
> +      print_rtl (dump_file, dest);
> +      fprintf (dump_file, "\n");
> +    }
> +
> +  /* Record the accumulator to expand.  */
> +  dts = XNEW (struct def_to_split);
> +  dts->insn = insn;
> +  dts->reg = copy_rtx (dest);
> +  dts->defs.create (1);
> +  dts->next = NULL;
> +  dts->count = 0;
> +  dts->use_before_def = use_before_set;
> +  dts->last_insn = NULL;
> +  return dts;
> +}
> +
>  /* Determines which of insns in LOOP can be optimized.
>     Return a OPT_INFO struct with the relevant hash tables filled
>     with all insns to be optimized.  The FIRST_NEW_BLOCK field
> @@ -1578,6 +1687,7 @@ analyze_insns_in_loop (class loop *loop)
>    rtx_insn *insn;
>    struct iv_to_split *ivts = NULL;
>    struct var_to_expand *ves = NULL;
> +  struct def_to_split *dts= NULL;
>    iv_to_split **slot1;
>    var_to_expand **slot2;
>    vec<edge> edges = get_loop_exit_edges (loop);
> @@ -1618,6 +1728,15 @@ analyze_insns_in_loop (class loop *loop)
>        opt_info->var_to_expand_tail = &opt_info->var_to_expand_head;
>      }
>
> +  if (flag_assign_new_operand_in_unroller && can_apply
> +      && LPT_UNROLL_RUNTIME == loop->lpt_decision.decision)
> +    {
> +      opt_info->insns_def_split
> +       = new hash_table<def_split_hasher> (5 * loop->num_nodes);
> +      opt_info->def_to_split_head = NULL;
> +      opt_info->def_to_split_tail = &opt_info->def_to_split_head;
> +    }
> +
>    for (i = 0; i < loop->num_nodes; i++)
>      {
>        bb = body[i];
> @@ -1653,6 +1772,20 @@ analyze_insns_in_loop (class loop *loop)
>             *opt_info->var_to_expand_tail = ves;
>             opt_info->var_to_expand_tail = &ves->next;
>            }
> +
> +       dts = NULL;
> +       if (opt_info->insns_def_split && !ves && !ivts)
> +         dts = analyze_insn_to_split_def (loop, insn);
> +
> +       if (dts)
> +         {
> +           struct def_to_split **slot
> +             = opt_info->insns_def_split->find_slot (dts, INSERT);
> +           gcc_assert (*slot == NULL);
> +           *slot = dts;
> +           *opt_info->def_to_split_tail = dts;
> +           opt_info->def_to_split_tail = &dts->next;
> +         }
>        }
>      }
>
> @@ -1840,6 +1973,145 @@ expand_var_during_unrolling (struct var_to_expand *ve, rtx_insn *insn)
>        }
>  }
>
> +/* Replace INSN with a new duplicate insn.  */
> +static rtx_insn *
> +replace_with_new_insn (rtx_insn *insn)
> +{
> +  rtx_insn *seq;
> +  start_sequence ();
> +  seq = duplicate_insn_chain (insn, insn);
> +  gcc_assert (seq == get_insns ());
> +  end_sequence ();
> +  emit_insn_before (seq, insn);
> +  delete_insn (insn);
> +  return seq;
> +}
> +
> +/* For original body, those insn(s) are not able to be update directly. Then,
> + * duplicate new one for update.  */
> +
> +static void
> +prepare_to_update_orignal_body (struct def_to_split *head)
> +{
> +  struct def_to_split *dts;
> +  rtx_insn *insn;
> +  vec<rtx_insn *> new_insn_list;
> +
> +  /* For all those defines which is used before define, replace with new one for
> +   * update.  */
> +  new_insn_list.create (1);
> +  for (dts = head; dts; dts = dts->next)
> +    if (dts->use_before_def)
> +      {
> +       dts->insn = replace_with_new_insn (dts->insn);
> +       new_insn_list.safe_push (dts->insn);
> +      }
> +
> +  /* For all those insn which is using those defines, replace with new one for
> +   * update.  */
> +  for (dts = head; dts; dts = dts->next)
> +    if (dts->use_before_def)
> +      for (insn = NEXT_INSN (dts->insn);
> +          insn != NEXT_INSN (BB_END (BLOCK_FOR_INSN (dts->insn)));
> +          insn = NEXT_INSN (insn))
> +       if (rtx_referenced_p (dts->reg, insn) && !new_insn_list.contains (insn))
> +         {
> +           insn = replace_with_new_insn (insn);
> +           new_insn_list.safe_push (insn);
> +         }
> +
> +  new_insn_list.release ();
> +}
> +
> +/* Replace reg which occur between START and END.  */
> +static void
> +replace_regs (rtx_insn *start, rtx_insn *end, rtx old_reg, rtx new_reg)
> +{
> +  rtx_insn *insn;
> +  for (insn = start; insn != NEXT_INSN (end); insn = NEXT_INSN (insn))
> +    if (rtx_referenced_p (old_reg, insn))
> +      validate_replace_rtx_group (old_reg, new_reg, insn);
> +
> +  apply_change_group ();
> +}
> +
> +/* Update the the pseudo in orignal blocks, and assign back them at the last
> + * block.  */
> +static void
> +do_additional_update (struct def_to_split *head)
> +{
> +  rtx_insn *insn;
> +  rtx new_reg;
> +  basic_block bb;
> +  struct def_to_split *dts;
> +
> +  for (dts = head; dts; dts = dts->next)
> +    {
> +      if (dts->use_before_def)
> +       {
> +         gcc_assert (dts->last_insn);
> +         gcc_assert (dts->count > 1);
> +
> +         /* Update the insns in original body.  */
> +         bb = BLOCK_FOR_INSN (dts->insn);
> +         new_reg = dts->defs[0];
> +         SET_DEST (single_set (dts->insn)) = new_reg;
> +
> +         replace_regs (NEXT_INSN (dts->insn), BB_END (bb), dts->reg, new_reg);
> +       }
> +
> +      /* In last BB, assign back to orignal reg.  */
> +      start_sequence ();
> +      emit_move_insn (dts->reg, SET_DEST (single_set (dts->last_insn)));
> +      insn = get_insns ();
> +      end_sequence ();
> +      emit_insn_after (insn, dts->last_insn);
> +    }
> +}
> +
> +/* Replace the DTS->REG with new split reg for INSN's block. If the DTS->REG is
> + * referenced before set, Those usage which leading INSN is replaced with
> + * previous define/set.  */
> +static void
> +split_def_during_unrolling (struct def_to_split *dts, rtx_insn *insn)
> +{
> +  rtx new_reg, new_reg0, set;
> +  basic_block bb = BLOCK_FOR_INSN (insn);
> +
> +  if (dts->use_before_def)
> +    {
> +      if (dts->count == 0)
> +       {
> +         dts->defs.safe_push (gen_reg_rtx (GET_MODE (dts->reg)));
> +         dts->count++;
> +       }
> +
> +      /* Replace the usage of the define reg with new reg0 for those insns which
> +       * are leading the define, include the usage of the define.  */
> +      new_reg0 = dts->defs[dts->count - 1];
> +      replace_regs (BB_HEAD (bb), insn, dts->reg, new_reg0);
> +
> +      /* Replace the define reg with new reg, for the define and those usages
> +       * which following the define.  */
> +      set = single_set (insn);
> +      gcc_assert (set);
> +      new_reg = gen_reg_rtx (GET_MODE (dts->reg));
> +      SET_DEST (single_set (insn)) = new_reg;
> +
> +      replace_regs (NEXT_INSN (insn), BB_END (bb), dts->reg, new_reg);
> +    }
> +  else
> +    {
> +      new_reg = gen_reg_rtx (GET_MODE (dts->reg));
> +      replace_regs (insn, BB_END (bb), dts->reg, new_reg);
> +    }
> +
> +  /* Record the insn in last block.  */
> +  dts->last_insn = insn;
> +  dts->defs.safe_push (new_reg);
> +  dts->count++;
> +}
> +
>  /* Initialize the variable expansions in loop preheader.  PLACE is the
>     loop-preheader basic block where the initialization of the
>     expansions should take place.  The expansions are initialized with
> @@ -2011,6 +2283,7 @@ apply_opt_in_copies (struct opt_info *opt_info,
>    rtx_insn *insn, *orig_insn, *next;
>    struct iv_to_split ivts_templ, *ivts;
>    struct var_to_expand ve_templ, *ves;
> +  struct def_to_split dts_templ, *dts;
>
>    /* Sanity check -- we need to put initialization in the original loop
>       body.  */
> @@ -2051,8 +2324,9 @@ apply_opt_in_copies (struct opt_info *opt_info,
>
>            ivts_templ.insn = orig_insn;
>            ve_templ.insn = orig_insn;
> +         dts_templ.insn = orig_insn;
>
> -          /* Apply splitting iv optimization.  */
> +         /* Apply splitting iv optimization.  */
>            if (opt_info->insns_to_split)
>              {
>               maybe_strip_eq_note_for_split_iv (opt_info, insn);
> @@ -2081,6 +2355,19 @@ apply_opt_in_copies (struct opt_info *opt_info,
>                    expand_var_during_unrolling (ves, insn);
>                  }
>              }
> +
> +         if (unrolling && opt_info->insns_def_split)
> +           {
> +             dts = (struct def_to_split *) opt_info->insns_def_split->find (
> +               &dts_templ);
> +             if (dts)
> +               {
> +                 gcc_assert (GET_CODE (PATTERN (insn))
> +                             == GET_CODE (PATTERN (orig_insn)));
> +                 split_def_during_unrolling (dts, insn);
> +               }
> +           }
> +
>            orig_insn = NEXT_INSN (orig_insn);
>          }
>      }
> @@ -2135,9 +2422,14 @@ apply_opt_in_copies (struct opt_info *opt_info,
>                    continue;
>                  }
>              }
> -
>          }
>      }
> +
> +  if (unrolling && opt_info->insns_def_split)
> +    {
> +      prepare_to_update_orignal_body (opt_info->def_to_split_head);
> +      do_additional_update (opt_info->def_to_split_head);
> +    }
>  }
>
>  /* Release OPT_INFO.  */
> @@ -2156,5 +2448,16 @@ free_opt_info (struct opt_info *opt_info)
>        delete opt_info->insns_with_var_to_expand;
>        opt_info->insns_with_var_to_expand = NULL;
>      }
> +
> +  if (opt_info->insns_def_split)
> +    {
> +      struct def_to_split *dts;
> +
> +      for (dts = opt_info->def_to_split_head; dts; dts = dts->next)
> +       dts->defs.release ();
> +      delete opt_info->insns_def_split;
> +      opt_info->insns_def_split = NULL;
> +    }
> +
>    free (opt_info);
>  }
> --
> 2.17.1
>

Segher Boessenkool April 15, 2020, 9:15 p.m. UTC | #2

Hi Jiufu,

Just reviewing random things as I see them...

On Wed, Apr 15, 2020 at 09:56:00AM +0800, Jiufu Guo wrote:
> This patch only supports simple loops: one exit edge with one major basic block.

That is fine for a proof-of-concept, but will need fixing perhaps.

> --- a/gcc/common.opt
> +++ b/gcc/common.opt
> @@ -2526,6 +2526,10 @@ fvariable-expansion-in-unroller
>  Common Report Var(flag_variable_expansion_in_unroller) Optimization
>  Apply variable expansion when loops are unrolled.
>  
> +fassin-new-operand-in-unroller

Typo ("assign").  But, is there ever any reason to disable this?  (You
want an option for it during development, of course, but all public
options cost something).

> +struct def_to_split
> +{
> +  rtx_insn *insn;	    /* The insn in that the assigment occurs.  */
> +  rtx reg;		     /* The def/set pseudo.  */
> +  vec<rtx> defs;	     /* The copies of the defs which is split to.  */
> +  struct def_to_split *next; /* Next entry in walking order.  */
> +  int count;		     /* Number of DEFS.  */
> +  int use_before_def;	/* The pseudo is used before re-defined.  */
> +  rtx_insn *last_insn;
> +};

Three different kinds of indentation here.

It might work better to explain it all in a bigger comment before this
struct definition.  It also would help to have better, more obvious
names?  Not "reg" but "orig_reg", not "defs" but "regs", not "count"
but "nregs"?

> +/* Return a hash for VES.  */
> +
> +inline hashval_t
> +def_split_hasher::hash (const def_to_split *i)
> +{
> +  return (hashval_t) INSN_UID (i->insn);
> +}

What does "VES" mean?  Something from an older version of the patch?

Segher

Segher Boessenkool April 15, 2020, 9:23 p.m. UTC | #3

Hi!

On Wed, Apr 15, 2020 at 08:21:03AM +0200, Richard Biener wrote:
> On Wed, Apr 15, 2020 at 3:56 AM Jiufu Guo via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
> > As you may know, we have loop unroll pass in RTL which was introduced a few
> > years ago, and works for a long time.  Currently, this unroller is using the
> > pseudos in the original body, and then the pseudos are written multiple times.
> >
> > It would be a good idea to create new pseudos for those duplicated instructions
> > during loop unrolling.  This would relax reg dependence, and then provide more
> > freedom/opportunity for other optimizations, like scheduling, RA...
> 
> I think there's a separate pass to do something like this, conveniently
> placed after unrolling.  -fweb, IIRC enabled by default for -funroll-loops
> unless explicitly disabled.  Related is regrename which is also enabled then.
> 
> So where does your patch make a difference?  Is the webizers dataflow analysis
> maybe confused by backedges?

Does -fweb handle things set by the last unrolled iteration, used by the
first unrolled iteration?

On a general note, we shouldn't depend on some pass that may or may not
clean up the mess we make, when we could just avoid making a mess in the
first place.

The web pass belongs immediately after expand; but ideally, even expand
would not reuse pseudos anyway.

Maybe it would be better as some utility routines, not a pass?


Segher

Kewen.Lin April 16, 2020, 2:34 a.m. UTC | #4

on 2020/4/15 下午2:21, Richard Biener via Gcc-patches wrote:
> On Wed, Apr 15, 2020 at 3:56 AM Jiufu Guo via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
>>
>> Hi,
>>
>> As you may know, we have loop unroll pass in RTL which was introduced a few
>> years ago, and works for a long time.  Currently, this unroller is using the
>> pseudos in the original body, and then the pseudos are written multiple times.
>>
>> It would be a good idea to create new pseudos for those duplicated instructions
>> during loop unrolling.  This would relax reg dependence, and then provide more
>> freedom/opportunity for other optimizations, like scheduling, RA...
> 
> I think there's a separate pass to do something like this, conveniently
> placed after unrolling.  -fweb, IIRC enabled by default for -funroll-loops
> unless explicitly disabled.  Related is regrename which is also enabled then.
> 
> So where does your patch make a difference?  Is the webizers dataflow analysis
> maybe confused by backedges?
> 

Juofu can help to confirm, it looks we disabled them on rs6000 by default and don't
enable them for unroll_only_small_loops but just enable for explicit unroll options.
If it's proved that web can cover what we want here, one option sounds to bring 
it/them back even for unroll_only_small_loops?

BR,
Kewen

Jiufu Guo April 16, 2020, 3:57 a.m. UTC | #5

Segher Boessenkool <segher@kernel.crashing.org> writes:

Hi,

Thanks for all your comments!

> Hi!
>
> On Wed, Apr 15, 2020 at 08:21:03AM +0200, Richard Biener wrote:
>> On Wed, Apr 15, 2020 at 3:56 AM Jiufu Guo via Gcc-patches
>> <gcc-patches@gcc.gnu.org> wrote:
>> > As you may know, we have loop unroll pass in RTL which was introduced a few
>> > years ago, and works for a long time.  Currently, this unroller is using the
>> > pseudos in the original body, and then the pseudos are written multiple times.
>> >
>> > It would be a good idea to create new pseudos for those duplicated instructions
>> > during loop unrolling.  This would relax reg dependence, and then provide more
>> > freedom/opportunity for other optimizations, like scheduling, RA...
>> 
>> I think there's a separate pass to do something like this, conveniently
>> placed after unrolling.  -fweb, IIRC enabled by default for -funroll-loops
>> unless explicitly disabled.  Related is regrename which is also
>> enabled then.
web/rnreg have some help for some cases, while they does not know much
information about loop, and do not care if a BB/pseudo/reg is generated
during unrolling. Unroller know these information, and it maybe accurate
to create new pseudos and relax depences between insns in unroller. 
For the example of preview RTL sequences, using new pseudo in unroller
is better than web/rnreg.

>> 
>> So where does your patch make a difference?  Is the webizers dataflow analysis
>> maybe confused by backedges?
>
> Does -fweb handle things set by the last unrolled iteration, used by the
> first unrolled iteration?
>
> On a general note, we shouldn't depend on some pass that may or may not
> clean up the mess we make, when we could just avoid making a mess in the
> first place.
Yes, this is a reason that I also think it maybe good to do it during
unroller.

>
> The web pass belongs immediately after expand; but ideally, even expand
> would not reuse pseudos anyway.
Currently, web runs after loop done now. 

Jiufu

>
> Maybe it would be better as some utility routines, not a pass?
>
>
> Segher

Jiufu Guo April 16, 2020, 4:45 a.m. UTC | #6

Segher Boessenkool <segher@kernel.crashing.org> writes:

> Hi Jiufu,
>
> Just reviewing random things as I see them...
>
> On Wed, Apr 15, 2020 at 09:56:00AM +0800, Jiufu Guo wrote:
>> This patch only supports simple loops: one exit edge with one major basic block.
>
> That is fine for a proof-of-concept, but will need fixing perhaps.
Hi Segher,

Thanks so much.  You are always point out things to improve! 
>
>> --- a/gcc/common.opt
>> +++ b/gcc/common.opt
>> @@ -2526,6 +2526,10 @@ fvariable-expansion-in-unroller
>>  Common Report Var(flag_variable_expansion_in_unroller) Optimization
>>  Apply variable expansion when loops are unrolled.
>>  
>> +fassin-new-operand-in-unroller
>
> Typo ("assign").  But, is there ever any reason to disable this?  (You
> want an option for it during development, of course, but all public
> options cost something).
Helpful question, development is one reason for now to test benefits and
lost, and incase some one want to disable it.
Your question let me think if it really need to disable it.  In theory,
this behavior is correct and not harmful. 


>
>> +struct def_to_split
>> +{
>> +  rtx_insn *insn;	    /* The insn in that the assigment occurs.  */
>> +  rtx reg;		     /* The def/set pseudo.  */
>> +  vec<rtx> defs;	     /* The copies of the defs which is split to.  */
>> +  struct def_to_split *next; /* Next entry in walking order.  */
>> +  int count;		     /* Number of DEFS.  */
>> +  int use_before_def;	/* The pseudo is used before re-defined.  */
>> +  rtx_insn *last_insn;
>> +};
>
> Three different kinds of indentation here.
>
> It might work better to explain it all in a bigger comment before this
> struct definition.  It also would help to have better, more obvious
> names?  Not "reg" but "orig_reg", not "defs" but "regs", not "count"
> but "nregs"?
Style, I may enhance it through tools.
Thanks for comments about naming :)
>
>> +/* Return a hash for VES.  */
>> +
>> +inline hashval_t
>> +def_split_hasher::hash (const def_to_split *i)
>> +{
>> +  return (hashval_t) INSN_UID (i->insn);
>> +}
>
> What does "VES" mean?  Something from an older version of the patch?
it is typo :(
>
>
> Segher

Richard Biener April 16, 2020, 8:33 a.m. UTC | #7

On Wed, Apr 15, 2020 at 11:23 PM Segher Boessenkool
<segher@kernel.crashing.org> wrote:
>
> Hi!
>
> On Wed, Apr 15, 2020 at 08:21:03AM +0200, Richard Biener wrote:
> > On Wed, Apr 15, 2020 at 3:56 AM Jiufu Guo via Gcc-patches
> > <gcc-patches@gcc.gnu.org> wrote:
> > > As you may know, we have loop unroll pass in RTL which was introduced a few
> > > years ago, and works for a long time.  Currently, this unroller is using the
> > > pseudos in the original body, and then the pseudos are written multiple times.
> > >
> > > It would be a good idea to create new pseudos for those duplicated instructions
> > > during loop unrolling.  This would relax reg dependence, and then provide more
> > > freedom/opportunity for other optimizations, like scheduling, RA...
> >
> > I think there's a separate pass to do something like this, conveniently
> > placed after unrolling.  -fweb, IIRC enabled by default for -funroll-loops
> > unless explicitly disabled.  Related is regrename which is also enabled then.
> >
> > So where does your patch make a difference?  Is the webizers dataflow analysis
> > maybe confused by backedges?
>
> Does -fweb handle things set by the last unrolled iteration, used by the
> first unrolled iteration?
>
> On a general note, we shouldn't depend on some pass that may or may not
> clean up the mess we make, when we could just avoid making a mess in the
> first place.

True - but the issue at hand is not trivial given you have to care for
partial defs, uses outside of the loop (or across the backedge), etc.
So there's plenty of things to go "wrong" here.

> The web pass belongs immediately after expand; but ideally, even expand
> would not reuse pseudos anyway.

But for example when lower-subreg decomposes things in a way turning
partial defs into full defs new opportunities to split the web arise.

> Maybe it would be better as some utility routines, not a pass?

Sure, but then when do we apply it?  Ideally scheduling would to
register renaming itself and thus not rely on the used pseudos
(I'm not sure if it tracks false dependences - I guess it must if it
isn't able to rename regs).  That would be a much better place
for improvements?

Richard.

>
> Segher

Segher Boessenkool April 16, 2020, 5:46 p.m. UTC | #8

On Thu, Apr 16, 2020 at 10:33:45AM +0200, Richard Biener wrote:
> On Wed, Apr 15, 2020 at 11:23 PM Segher Boessenkool
> > On a general note, we shouldn't depend on some pass that may or may not
> > clean up the mess we make, when we could just avoid making a mess in the
> > first place.
> 
> True - but the issue at hand is not trivial given you have to care for
> partial defs, uses outside of the loop (or across the backedge), etc.
> So there's plenty of things to go "wrong" here.

Certainly.  But *all* RTL passes before RA should leave things in "web
form" (compare to SSA form).  The code in web.c is probably just fine;
but we shouldn't have one web pass, *all* passes should leave things in
a useful form!

> > The web pass belongs immediately after expand; but ideally, even expand
> > would not reuse pseudos anyway.
> 
> But for example when lower-subreg decomposes things in a way turning
> partial defs into full defs new opportunities to split the web arise.

But that is only for the new registers it creates, or what am I missing?

> > Maybe it would be better as some utility routines, not a pass?
> 
> Sure, but then when do we apply it?

All of the time!  Ideally every pass would leave things in a good shape.
Usually it is cheapest as well as easiest to do things manually, but for
some harder cases, such helper routines can be used.

> Ideally scheduling would to
> register renaming itself and thus not rely on the used pseudos
> (I'm not sure if it tracks false dependences - I guess it must if it
> isn't able to rename regs).  That would be a much better place
> for improvements?

sched2 runs after RA, so it has nothing to do with webs?  And sched1
doesn't do much relevant here (it doesn't move insns much).

I don't see how this is directly related to register renaming either?

Segher

Richard Biener April 17, 2020, 6:53 a.m. UTC | #9

On Thu, Apr 16, 2020 at 7:46 PM Segher Boessenkool
<segher@kernel.crashing.org> wrote:
>
> On Thu, Apr 16, 2020 at 10:33:45AM +0200, Richard Biener wrote:
> > On Wed, Apr 15, 2020 at 11:23 PM Segher Boessenkool
> > > On a general note, we shouldn't depend on some pass that may or may not
> > > clean up the mess we make, when we could just avoid making a mess in the
> > > first place.
> >
> > True - but the issue at hand is not trivial given you have to care for
> > partial defs, uses outside of the loop (or across the backedge), etc.
> > So there's plenty of things to go "wrong" here.
>
> Certainly.  But *all* RTL passes before RA should leave things in "web
> form" (compare to SSA form).  The code in web.c is probably just fine;
> but we shouldn't have one web pass, *all* passes should leave things in
> a useful form!

Yeah well, but RTL is not in SSA form and there's no RTL IL verification
in place to track degradation.  And we even work in the opposite way
when expanding to RTL from SSA, coalescing as much as we can ...

> > > The web pass belongs immediately after expand; but ideally, even expand
> > > would not reuse pseudos anyway.
> >
> > But for example when lower-subreg decomposes things in a way turning
> > partial defs into full defs new opportunities to split the web arise.
>
> But that is only for the new registers it creates, or what am I missing?

No idea, just made up the example that maintaing "SSA" RTL is
not automagic.

> > > Maybe it would be better as some utility routines, not a pass?
> >
> > Sure, but then when do we apply it?
>
> All of the time!  Ideally every pass would leave things in a good shape.
> Usually it is cheapest as well as easiest to do things manually, but for
> some harder cases, such helper routines can be used.
>
> > Ideally scheduling would to
> > register renaming itself and thus not rely on the used pseudos
> > (I'm not sure if it tracks false dependences - I guess it must if it
> > isn't able to rename regs).  That would be a much better place
> > for improvements?
>
> sched2 runs after RA, so it has nothing to do with webs?  And sched1
> doesn't do much relevant here (it doesn't move insns much).
>
> I don't see how this is directly related to register renaming either?

If scheduling ignores "false" dependences (anti dependence with
full defs) then when it schedules across such defs needs to perform
renaming.  Maybe I'm using bogus terms here.

Richard.

>
> Segher

Li, Pan2 via Gcc-patches April 17, 2020, 1:22 p.m. UTC | #10

On 4/17/20 1:53 AM, Richard Biener wrote:
> Yeah well, but RTL is not in SSA form and there's no RTL IL verification
> in place to track degradation.  And we even work in the opposite way
> when expanding to RTL from SSA, coalescing as much as we can ...

Which is itself problematic, introducing unnecessary antidependences at 
the start of RTL.  We've seen performance issues with this on several 
occasions.

Bill

Segher Boessenkool April 17, 2020, 8:51 p.m. UTC | #11

On Fri, Apr 17, 2020 at 08:53:08AM +0200, Richard Biener wrote:
> On Thu, Apr 16, 2020 at 7:46 PM Segher Boessenkool
> <segher@kernel.crashing.org> wrote:
> >
> > On Thu, Apr 16, 2020 at 10:33:45AM +0200, Richard Biener wrote:
> > > On Wed, Apr 15, 2020 at 11:23 PM Segher Boessenkool
> > > > On a general note, we shouldn't depend on some pass that may or may not
> > > > clean up the mess we make, when we could just avoid making a mess in the
> > > > first place.
> > >
> > > True - but the issue at hand is not trivial given you have to care for
> > > partial defs, uses outside of the loop (or across the backedge), etc.
> > > So there's plenty of things to go "wrong" here.
> >
> > Certainly.  But *all* RTL passes before RA should leave things in "web
> > form" (compare to SSA form).  The code in web.c is probably just fine;
> > but we shouldn't have one web pass, *all* passes should leave things in
> > a useful form!
> 
> Yeah well, but RTL is not in SSA form

"Webs" are not the *same* as SSA, in a few crucial ways; but they serve
similar purposes: they both make code transformations easier to do.
Both do this by having more different (independent) names.

> and there's no RTL IL verification
> in place to track degradation.

Adding this isn't hard in principle.  But currently it is violated all
over the place, including in backend code.

> And we even work in the opposite way
> when expanding to RTL from SSA, coalescing as much as we can ...

Which often is bad.  A lot of what expand does is premature optimisation
that a) is much too complicated, and b) results in worse code in the end.
Expand should not try to do *anything* "interesting", it should "merely"
translate to RTL.

(And in a way that can be debugged sanely at all, if possible ;-) )

> > > > The web pass belongs immediately after expand; but ideally, even expand
> > > > would not reuse pseudos anyway.
> > >
> > > But for example when lower-subreg decomposes things in a way turning
> > > partial defs into full defs new opportunities to split the web arise.
> >
> > But that is only for the new registers it creates, or what am I missing?
> 
> No idea, just made up the example that maintaing "SSA" RTL is
> not automagic.

Oh sure, but in most cases, it is.

My point is that making web form only once (and messing it up after that)
isn't great.

> > > > Maybe it would be better as some utility routines, not a pass?
> > >
> > > Sure, but then when do we apply it?
> >
> > All of the time!  Ideally every pass would leave things in a good shape.
> > Usually it is cheapest as well as easiest to do things manually, but for
> > some harder cases, such helper routines can be used.
> >
> > > Ideally scheduling would to
> > > register renaming itself and thus not rely on the used pseudos
> > > (I'm not sure if it tracks false dependences - I guess it must if it
> > > isn't able to rename regs).  That would be a much better place
> > > for improvements?
> >
> > sched2 runs after RA, so it has nothing to do with webs?  And sched1
> > doesn't do much relevant here (it doesn't move insns much).
> >
> > I don't see how this is directly related to register renaming either?
> 
> If scheduling ignores "false" dependences (anti dependence with
> full defs) then when it schedules across such defs needs to perform
> renaming.  Maybe I'm using bogus terms here.

Your terminology is fine afaic, but I don't see what you mean, sorry.
Maybe I don't know sched well enough...  I thought it would just refuse
to move something if that would result in broken code?


Segher

Richard Biener April 20, 2020, 7:49 a.m. UTC | #12

On Fri, Apr 17, 2020 at 10:51 PM Segher Boessenkool
<segher@kernel.crashing.org> wrote:
>
> On Fri, Apr 17, 2020 at 08:53:08AM +0200, Richard Biener wrote:
> > On Thu, Apr 16, 2020 at 7:46 PM Segher Boessenkool
> > <segher@kernel.crashing.org> wrote:
> > >
> > > On Thu, Apr 16, 2020 at 10:33:45AM +0200, Richard Biener wrote:
> > > > On Wed, Apr 15, 2020 at 11:23 PM Segher Boessenkool
> > > > > On a general note, we shouldn't depend on some pass that may or may not
> > > > > clean up the mess we make, when we could just avoid making a mess in the
> > > > > first place.
> > > >
> > > > True - but the issue at hand is not trivial given you have to care for
> > > > partial defs, uses outside of the loop (or across the backedge), etc.
> > > > So there's plenty of things to go "wrong" here.
> > >
> > > Certainly.  But *all* RTL passes before RA should leave things in "web
> > > form" (compare to SSA form).  The code in web.c is probably just fine;
> > > but we shouldn't have one web pass, *all* passes should leave things in
> > > a useful form!
> >
> > Yeah well, but RTL is not in SSA form
>
> "Webs" are not the *same* as SSA, in a few crucial ways; but they serve
> similar purposes: they both make code transformations easier to do.
> Both do this by having more different (independent) names.
>
> > and there's no RTL IL verification
> > in place to track degradation.
>
> Adding this isn't hard in principle.  But currently it is violated all
> over the place, including in backend code.
>
> > And we even work in the opposite way
> > when expanding to RTL from SSA, coalescing as much as we can ...
>
> Which often is bad.  A lot of what expand does is premature optimisation
> that a) is much too complicated, and b) results in worse code in the end.
> Expand should not try to do *anything* "interesting", it should "merely"
> translate to RTL.
>
> (And in a way that can be debugged sanely at all, if possible ;-) )
>
> > > > > The web pass belongs immediately after expand; but ideally, even expand
> > > > > would not reuse pseudos anyway.
> > > >
> > > > But for example when lower-subreg decomposes things in a way turning
> > > > partial defs into full defs new opportunities to split the web arise.
> > >
> > > But that is only for the new registers it creates, or what am I missing?
> >
> > No idea, just made up the example that maintaing "SSA" RTL is
> > not automagic.
>
> Oh sure, but in most cases, it is.
>
> My point is that making web form only once (and messing it up after that)
> isn't great.
>
> > > > > Maybe it would be better as some utility routines, not a pass?
> > > >
> > > > Sure, but then when do we apply it?
> > >
> > > All of the time!  Ideally every pass would leave things in a good shape.
> > > Usually it is cheapest as well as easiest to do things manually, but for
> > > some harder cases, such helper routines can be used.
> > >
> > > > Ideally scheduling would to
> > > > register renaming itself and thus not rely on the used pseudos
> > > > (I'm not sure if it tracks false dependences - I guess it must if it
> > > > isn't able to rename regs).  That would be a much better place
> > > > for improvements?
> > >
> > > sched2 runs after RA, so it has nothing to do with webs?  And sched1
> > > doesn't do much relevant here (it doesn't move insns much).
> > >
> > > I don't see how this is directly related to register renaming either?
> >
> > If scheduling ignores "false" dependences (anti dependence with
> > full defs) then when it schedules across such defs needs to perform
> > renaming.  Maybe I'm using bogus terms here.
>
> Your terminology is fine afaic, but I don't see what you mean, sorry.
> Maybe I don't know sched well enough...  I thought it would just refuse
> to move something if that would result in broken code?

Yes, I think that's what it does.  And I was suggesting it should learn to
schedule the use after the def in

(use (reg:SI 101))
(set (reg:SI 101) (...))

by renaming the pseudo, removing the false/anti dependence.  After RA
this means the ability to do simple local register allocation of course.
Before RA the trivial way to implement this is to not have those false
dependences, aka have webs / SSA ... ;)

Richard.

>
> Segher

Richard Biener April 20, 2020, 7:56 a.m. UTC | #13

On Fri, Apr 17, 2020 at 10:51 PM Segher Boessenkool
<segher@kernel.crashing.org> wrote:
>
> On Fri, Apr 17, 2020 at 08:53:08AM +0200, Richard Biener wrote:
> > On Thu, Apr 16, 2020 at 7:46 PM Segher Boessenkool
> > <segher@kernel.crashing.org> wrote:
> > >
> > > On Thu, Apr 16, 2020 at 10:33:45AM +0200, Richard Biener wrote:
> > > > On Wed, Apr 15, 2020 at 11:23 PM Segher Boessenkool
> > > > > On a general note, we shouldn't depend on some pass that may or may not
> > > > > clean up the mess we make, when we could just avoid making a mess in the
> > > > > first place.
> > > >
> > > > True - but the issue at hand is not trivial given you have to care for
> > > > partial defs, uses outside of the loop (or across the backedge), etc.
> > > > So there's plenty of things to go "wrong" here.
> > >
> > > Certainly.  But *all* RTL passes before RA should leave things in "web
> > > form" (compare to SSA form).  The code in web.c is probably just fine;
> > > but we shouldn't have one web pass, *all* passes should leave things in
> > > a useful form!
> >
> > Yeah well, but RTL is not in SSA form
>
> "Webs" are not the *same* as SSA, in a few crucial ways; but they serve
> similar purposes: they both make code transformations easier to do.
> Both do this by having more different (independent) names.
>
> > and there's no RTL IL verification
> > in place to track degradation.
>
> Adding this isn't hard in principle.  But currently it is violated all
> over the place, including in backend code.

It's a cheap excuse and clearly a lot of work.  But we've done it for GIMPLE
and it was clearly worth it.  If only for documentation purposes.

> > And we even work in the opposite way
> > when expanding to RTL from SSA, coalescing as much as we can ...
>
> Which often is bad.  A lot of what expand does is premature optimisation
> that a) is much too complicated, and b) results in worse code in the end.
> Expand should not try to do *anything* "interesting", it should "merely"
> translate to RTL.
>
> (And in a way that can be debugged sanely at all, if possible ;-) )

Sure.  I've tried to turn off TER with only moderate success.  Another approach
would be to not do any coalescing but for PHIs (and maybe there only for edges
where we cannot insert copies on w/o splitting it).

But clearly all the magic expand does now should materialize itself on
GIMPLE (see the patch series Martin posted for vcond* / vec_cmp*
"expansion" on GIMPLE).  It's the magic that requires TER and TER clearly
introduces the most noise in the GIMPLE -> RTL translation step (as to
code motion and initial instruction selection if it "fails").

Richard.

Segher Boessenkool April 20, 2020, 1:37 p.m. UTC | #14

Hi!

On Mon, Apr 20, 2020 at 09:56:34AM +0200, Richard Biener wrote:
> On Fri, Apr 17, 2020 at 10:51 PM Segher Boessenkool
> <segher@kernel.crashing.org> wrote:
> > > Yeah well, but RTL is not in SSA form
> >
> > "Webs" are not the *same* as SSA, in a few crucial ways; but they serve
> > similar purposes: they both make code transformations easier to do.
> > Both do this by having more different (independent) names.
> >
> > > and there's no RTL IL verification
> > > in place to track degradation.
> >
> > Adding this isn't hard in principle.  But currently it is violated all
> > over the place, including in backend code.
> 
> It's a cheap excuse and clearly a lot of work.  But we've done it for GIMPLE
> and it was clearly worth it.  If only for documentation purposes.

Yes, I very much agree :-)  But how will we do this?  Make it warnings
only, and have those opt-in (for a port) even?

> > > And we even work in the opposite way
> > > when expanding to RTL from SSA, coalescing as much as we can ...
> >
> > Which often is bad.  A lot of what expand does is premature optimisation
> > that a) is much too complicated, and b) results in worse code in the end.
> > Expand should not try to do *anything* "interesting", it should "merely"
> > translate to RTL.
> >
> > (And in a way that can be debugged sanely at all, if possible ;-) )
> 
> Sure.  I've tried to turn off TER with only moderate success.  Another approach
> would be to not do any coalescing but for PHIs (and maybe there only for edges
> where we cannot insert copies on w/o splitting it).

Yeah, that would be much simpler to do as well.

> But clearly all the magic expand does now should materialize itself on
> GIMPLE (see the patch series Martin posted for vcond* / vec_cmp*
> "expansion" on GIMPLE).

Probably most, yeah, but some things that expand does should be done
much *later* in RTL even :-)

> It's the magic that requires TER and TER clearly
> introduces the most noise in the GIMPLE -> RTL translation step (as to
> code motion and initial instruction selection if it "fails").

Expand does a lot of things that are very much historic, too.


Segher

Richard Biener April 20, 2020, 2:23 p.m. UTC | #15

On Mon, Apr 20, 2020 at 3:38 PM Segher Boessenkool
<segher@kernel.crashing.org> wrote:
>
> Hi!
>
> On Mon, Apr 20, 2020 at 09:56:34AM +0200, Richard Biener wrote:
> > On Fri, Apr 17, 2020 at 10:51 PM Segher Boessenkool
> > <segher@kernel.crashing.org> wrote:
> > > > Yeah well, but RTL is not in SSA form
> > >
> > > "Webs" are not the *same* as SSA, in a few crucial ways; but they serve
> > > similar purposes: they both make code transformations easier to do.
> > > Both do this by having more different (independent) names.
> > >
> > > > and there's no RTL IL verification
> > > > in place to track degradation.
> > >
> > > Adding this isn't hard in principle.  But currently it is violated all
> > > over the place, including in backend code.
> >
> > It's a cheap excuse and clearly a lot of work.  But we've done it for GIMPLE
> > and it was clearly worth it.  If only for documentation purposes.
>
> Yes, I very much agree :-)  But how will we do this?  Make it warnings
> only, and have those opt-in (for a port) even?

Add -fcheck-rtl, when you're ready (per port?) make it default for
bootstrap.  When you're even more "ready", turn it on with -fchecking
(or --enable-checking=xyz).

We've gone a similar way with the GIMPLE checking (not SSA checking,
that was always present) which was formerly called "type" checking but
now verifies more details.  It was broken quite badly as well and mostly
incrementally fixed (some cases by intentionally weakening the checker).

Of course as you say with RTL you have all the targets to consider ...

> > > > And we even work in the opposite way
> > > > when expanding to RTL from SSA, coalescing as much as we can ...
> > >
> > > Which often is bad.  A lot of what expand does is premature optimisation
> > > that a) is much too complicated, and b) results in worse code in the end.
> > > Expand should not try to do *anything* "interesting", it should "merely"
> > > translate to RTL.
> > >
> > > (And in a way that can be debugged sanely at all, if possible ;-) )
> >
> > Sure.  I've tried to turn off TER with only moderate success.  Another approach
> > would be to not do any coalescing but for PHIs (and maybe there only for edges
> > where we cannot insert copies on w/o splitting it).
>
> Yeah, that would be much simpler to do as well.
>
> > But clearly all the magic expand does now should materialize itself on
> > GIMPLE (see the patch series Martin posted for vcond* / vec_cmp*
> > "expansion" on GIMPLE).
>
> Probably most, yeah, but some things that expand does should be done
> much *later* in RTL even :-)

We have combine for that.  Eh.

> > It's the magic that requires TER and TER clearly
> > introduces the most noise in the GIMPLE -> RTL translation step (as to
> > code motion and initial instruction selection if it "fails").
>
> Expand does a lot of things that are very much historic, too.

Sure...

Richard.

>
> Segher

Richard Sandiford April 22, 2020, 9:26 a.m. UTC | #16

Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> On Mon, Apr 20, 2020 at 3:38 PM Segher Boessenkool
> <segher@kernel.crashing.org> wrote:
>>
>> Hi!
>>
>> On Mon, Apr 20, 2020 at 09:56:34AM +0200, Richard Biener wrote:
>> > On Fri, Apr 17, 2020 at 10:51 PM Segher Boessenkool
>> > <segher@kernel.crashing.org> wrote:
>> > > > Yeah well, but RTL is not in SSA form
>> > >
>> > > "Webs" are not the *same* as SSA, in a few crucial ways; but they serve
>> > > similar purposes: they both make code transformations easier to do.
>> > > Both do this by having more different (independent) names.
>> > >
>> > > > and there's no RTL IL verification
>> > > > in place to track degradation.
>> > >
>> > > Adding this isn't hard in principle.  But currently it is violated all
>> > > over the place, including in backend code.
>> >
>> > It's a cheap excuse and clearly a lot of work.  But we've done it for GIMPLE
>> > and it was clearly worth it.  If only for documentation purposes.
>>
>> Yes, I very much agree :-)  But how will we do this?  Make it warnings
>> only, and have those opt-in (for a port) even?
>
> Add -fcheck-rtl, when you're ready (per port?) make it default for
> bootstrap.  When you're even more "ready", turn it on with -fchecking
> (or --enable-checking=xyz).

What kind of condition would we be checking?  That no register has two
definitions in the same block?  Or something stronger?

Testing that -fweb would have no effect is expensive, since it requires
full UD chains.  And there's a risk that optimisations could break the
condition accidentally.  E.g.:

   if (cond)
     {
       x = ...;
       ...use x...;
     }
   else
     {
       x = ...;
       ...use x...;
     }
   ...use x...; // A, REG_DEAD x

If anything managed to fold away the need for x in A (e.g. combine),
it would then be required to rename x in one of the if/else blocks.

In some ways it feels like it would be easier to resurrect RTL SSA :-)

Thanks,
Richard

Richard Biener April 22, 2020, 9:58 a.m. UTC | #17

On Wed, Apr 22, 2020 at 11:26 AM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> > On Mon, Apr 20, 2020 at 3:38 PM Segher Boessenkool
> > <segher@kernel.crashing.org> wrote:
> >>
> >> Hi!
> >>
> >> On Mon, Apr 20, 2020 at 09:56:34AM +0200, Richard Biener wrote:
> >> > On Fri, Apr 17, 2020 at 10:51 PM Segher Boessenkool
> >> > <segher@kernel.crashing.org> wrote:
> >> > > > Yeah well, but RTL is not in SSA form
> >> > >
> >> > > "Webs" are not the *same* as SSA, in a few crucial ways; but they serve
> >> > > similar purposes: they both make code transformations easier to do.
> >> > > Both do this by having more different (independent) names.
> >> > >
> >> > > > and there's no RTL IL verification
> >> > > > in place to track degradation.
> >> > >
> >> > > Adding this isn't hard in principle.  But currently it is violated all
> >> > > over the place, including in backend code.
> >> >
> >> > It's a cheap excuse and clearly a lot of work.  But we've done it for GIMPLE
> >> > and it was clearly worth it.  If only for documentation purposes.
> >>
> >> Yes, I very much agree :-)  But how will we do this?  Make it warnings
> >> only, and have those opt-in (for a port) even?
> >
> > Add -fcheck-rtl, when you're ready (per port?) make it default for
> > bootstrap.  When you're even more "ready", turn it on with -fchecking
> > (or --enable-checking=xyz).
>
> What kind of condition would we be checking?  That no register has two
> definitions in the same block?  Or something stronger?

For the particular discussion no idea.  But we generally lack
verification of what is "valid" RTL and what is not.

> Testing that -fweb would have no effect is expensive, since it requires
> full UD chains.  And there's a risk that optimisations could break the
> condition accidentally.  E.g.:
>
>    if (cond)
>      {
>        x = ...;
>        ...use x...;
>      }
>    else
>      {
>        x = ...;
>        ...use x...;
>      }
>    ...use x...; // A, REG_DEAD x
>
> If anything managed to fold away the need for x in A (e.g. combine),
> it would then be required to rename x in one of the if/else blocks.
>
> In some ways it feels like it would be easier to resurrect RTL SSA :-)
>
> Thanks,
> Richard

Segher Boessenkool April 22, 2020, 8:50 p.m. UTC | #18

On Wed, Apr 22, 2020 at 11:58:38AM +0200, Richard Biener wrote:
> On Wed, Apr 22, 2020 at 11:26 AM Richard Sandiford
> <richard.sandiford@arm.com> wrote:
> > Richard Biener via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> > > On Mon, Apr 20, 2020 at 3:38 PM Segher Boessenkool
> > > <segher@kernel.crashing.org> wrote:
> > >> On Mon, Apr 20, 2020 at 09:56:34AM +0200, Richard Biener wrote:
> > >> > On Fri, Apr 17, 2020 at 10:51 PM Segher Boessenkool
> > >> > <segher@kernel.crashing.org> wrote:
> > >> > > > Yeah well, but RTL is not in SSA form
> > >> > >
> > >> > > "Webs" are not the *same* as SSA, in a few crucial ways; but they serve
> > >> > > similar purposes: they both make code transformations easier to do.
> > >> > > Both do this by having more different (independent) names.
> > >> > >
> > >> > > > and there's no RTL IL verification
> > >> > > > in place to track degradation.
> > >> > >
> > >> > > Adding this isn't hard in principle.  But currently it is violated all
> > >> > > over the place, including in backend code.
> > >> >
> > >> > It's a cheap excuse and clearly a lot of work.  But we've done it for GIMPLE
> > >> > and it was clearly worth it.  If only for documentation purposes.
> > >>
> > >> Yes, I very much agree :-)  But how will we do this?  Make it warnings
> > >> only, and have those opt-in (for a port) even?
> > >
> > > Add -fcheck-rtl, when you're ready (per port?) make it default for
> > > bootstrap.  When you're even more "ready", turn it on with -fchecking
> > > (or --enable-checking=xyz).
> >
> > What kind of condition would we be checking?  That no register has two
> > definitions in the same block?  Or something stronger?
> 
> For the particular discussion no idea.  But we generally lack
> verification of what is "valid" RTL and what is not.

There is RTL checking, and it works very well...  But "valid" just means
"structurally valid", and other very local things.  If we want to check
some extra properties, we actually have to make the rules for those
first!  :-)

> > In some ways it feels like it would be easier to resurrect RTL SSA :-)

Why was RTL SSA abandoned?

It might well work to keep everything in SSA form all the way to RA.
Hrm, that doesn't sound bad at all :-)

(The PHIs need to be made explicit to something that resembles the
machine code we will end up with, very early in the pipeline, but it
could still also be some valid SSA form; and we can of course also
have hard registers in all RTL, so that needs to be dealt with sanely
some way as well  Lots of details, I don't see a crucial problem though,
probably means I need to look harder ;-) )


Segher

Li, Pan2 via Gcc-patches April 22, 2020, 10:30 p.m. UTC | #19

On Wed, 2020-04-22 at 15:50 -0500, Segher Boessenkool wrote:
> 
> > > In some ways it feels like it would be easier to resurrect RTL SSA :-)
> 
> Why was RTL SSA abandoned?
> 
> It might well work to keep everything in SSA form all the way to RA.
> Hrm, that doesn't sound bad at all :-)
> 
> (The PHIs need to be made explicit to something that resembles the
> machine code we will end up with, very early in the pipeline, but it
> could still also be some valid SSA form; and we can of course also
> have hard registers in all RTL, so that needs to be dealt with sanely
> some way as well  Lots of details, I don't see a crucial problem though,
> probably means I need to look harder ;-) )
Lack of time mostly.  There's some complications like subregs, argument registers
and the like.  But you can restrict ssa based analysis & optimizations to just
the set of pseudos that are in SSA form and do something more conservative on the
rest.

Jeff

Richard Biener April 23, 2020, 7:32 a.m. UTC | #20

On Thu, Apr 23, 2020 at 12:31 AM Jeff Law <law@redhat.com> wrote:
>
> On Wed, 2020-04-22 at 15:50 -0500, Segher Boessenkool wrote:
> >
> > > > In some ways it feels like it would be easier to resurrect RTL SSA :-)
> >
> > Why was RTL SSA abandoned?
> >
> > It might well work to keep everything in SSA form all the way to RA.
> > Hrm, that doesn't sound bad at all :-)
> >
> > (The PHIs need to be made explicit to something that resembles the
> > machine code we will end up with, very early in the pipeline, but it
> > could still also be some valid SSA form; and we can of course also
> > have hard registers in all RTL, so that needs to be dealt with sanely
> > some way as well  Lots of details, I don't see a crucial problem though,
> > probably means I need to look harder ;-) )
> Lack of time mostly.  There's some complications like subregs, argument registers
> and the like.  But you can restrict ssa based analysis & optimizations to just
> the set of pseudos that are in SSA form and do something more conservative on the
> rest.

I guess time is better spent on trying to extend GIMPLE + SSA up to RA, thus
make instruction selection on GIMPLE.  Back in time Steven spent quite some
time doing factored SSA but I don't remember either what went wrong or whether
simply DF appeared first.

Richard.

> Jeff
>

Segher Boessenkool April 23, 2020, 10:17 a.m. UTC | #21

On Thu, Apr 23, 2020 at 09:32:37AM +0200, Richard Biener wrote:
> On Thu, Apr 23, 2020 at 12:31 AM Jeff Law <law@redhat.com> wrote:
> > On Wed, 2020-04-22 at 15:50 -0500, Segher Boessenkool wrote:
> > > > > In some ways it feels like it would be easier to resurrect RTL SSA :-)
> > >
> > > Why was RTL SSA abandoned?
> > >
> > > It might well work to keep everything in SSA form all the way to RA.
> > > Hrm, that doesn't sound bad at all :-)
> > >
> > > (The PHIs need to be made explicit to something that resembles the
> > > machine code we will end up with, very early in the pipeline, but it
> > > could still also be some valid SSA form; and we can of course also
> > > have hard registers in all RTL, so that needs to be dealt with sanely
> > > some way as well  Lots of details, I don't see a crucial problem though,
> > > probably means I need to look harder ;-) )
> > Lack of time mostly.  There's some complications like subregs, argument registers
> > and the like.  But you can restrict ssa based analysis & optimizations to just
> > the set of pseudos that are in SSA form and do something more conservative on the
> > rest.
> 
> I guess time is better spent on trying to extend GIMPLE + SSA up to RA, thus
> make instruction selection on GIMPLE.

I think this is a bad idea.  By the time you have invented enough new
"lower GIMPLE" ("limple"?) to be able to use it to describe machine
insns like we can with RTL, you will have a more verbose, more memory
hungry, slower, etc. reinvented RTL.

RTL is a *feature*, and it is one of the things that makes GCC
significantly better than the competition.

More optimisations should move to GIMPLE, for example some loop
optimisations should be done much earlier (most unrolling).  The expand
pass should lose most of the "optimisations" it has built up over the
decades (that now often are detrimental at best).  Some of what expand
now does should probably be done while still in GIMPLE, even.

But it is very useful to have a separate "low level" representation,
that is actually close to the machine code we will eventually generate.
RTL is one such representation, and we already have it, and it is very
well tuned by now -- throwing it away would require some *huge*
advantage, because the costs of doing that are immense as well.

Segher

Richard Biener April 23, 2020, 10:52 a.m. UTC | #22

On Thu, Apr 23, 2020 at 12:17 PM Segher Boessenkool
<segher@kernel.crashing.org> wrote:
>
> On Thu, Apr 23, 2020 at 09:32:37AM +0200, Richard Biener wrote:
> > On Thu, Apr 23, 2020 at 12:31 AM Jeff Law <law@redhat.com> wrote:
> > > On Wed, 2020-04-22 at 15:50 -0500, Segher Boessenkool wrote:
> > > > > > In some ways it feels like it would be easier to resurrect RTL SSA :-)
> > > >
> > > > Why was RTL SSA abandoned?
> > > >
> > > > It might well work to keep everything in SSA form all the way to RA.
> > > > Hrm, that doesn't sound bad at all :-)
> > > >
> > > > (The PHIs need to be made explicit to something that resembles the
> > > > machine code we will end up with, very early in the pipeline, but it
> > > > could still also be some valid SSA form; and we can of course also
> > > > have hard registers in all RTL, so that needs to be dealt with sanely
> > > > some way as well  Lots of details, I don't see a crucial problem though,
> > > > probably means I need to look harder ;-) )
> > > Lack of time mostly.  There's some complications like subregs, argument registers
> > > and the like.  But you can restrict ssa based analysis & optimizations to just
> > > the set of pseudos that are in SSA form and do something more conservative on the
> > > rest.
> >
> > I guess time is better spent on trying to extend GIMPLE + SSA up to RA, thus
> > make instruction selection on GIMPLE.
>
> I think this is a bad idea.  By the time you have invented enough new
> "lower GIMPLE" ("limple"?) to be able to use it to describe machine
> insns like we can with RTL, you will have a more verbose, more memory
> hungry, slower, etc. reinvented RTL.

I don't think there's much to invent.

I think at least one step would be uncontroversical(?), namely moving
the RTL expansion "magic"
up to a GIMPLE pass.  Where the "magic" would be to turn
GIMPLE stmts not directly expandable via an existing optab into
GIMPLE that can be trivially expanded.  That includes eventually
combining multiple stmts into more powerful instructions and
doing the magic we have in, like, expand_binop (widening, etc.).
Where there's not a 1:1 mapping of a GIMPLE stmt to an optab
GIMPLE gets direct-internal-fn calls.
Then RTL expansion would be mostly invoking gen_insn (optab-code).

More controversical would be ending up in GIMPLE there.  I think
GIMPLE can handle all RTL insns if we massage GIMPLE_ASM
a bit.  You'd end up with, say,

 asm ("(set (reg:DI $0)
                (and:DI (reg/v:DI $1 [ dst ])
                    (reg:DI $2)))" : "r" (_1) : "r" (_2), "r" (_3) : "cc");

in place of

  _1 = _2 & _3;

and the GIMPLE_ASM text could be actual RTL.  We'd extend
the stmt with an extra operand to denote recognized patterns,
so another option would be to keep the original GIMPLE as well.

> RTL is a *feature*, and it is one of the things that makes GCC
> significantly better than the competition.

That said, I actually agree with that.  It's just that I hope we can
make some of the knowledge just represented on the RTL side
available on the GIMPLE side.  The more complicated parts,
like calling conventions, that is.

And yes, I want to get rid of that expand monster to be able to
do something like sched1 on "GIMPLE" without expand coming
along and re-scheduling everything at-will.

> More optimisations should move to GIMPLE, for example some loop
> optimisations should be done much earlier (most unrolling).  The expand
> pass should lose most of the "optimisations" it has built up over the
> decades (that now often are detrimental at best).  Some of what expand
> now does should probably be done while still in GIMPLE, even.
>
> But it is very useful to have a separate "low level" representation,
> that is actually close to the machine code we will eventually generate.
> RTL is one such representation, and we already have it, and it is very
> well tuned by now -- throwing it away would require some *huge*
> advantage, because the costs of doing that are immense as well.

But being stuck with something means no progress...  I know
very well it's 100 times harder to get rid of something than to
add something new ontop.

Richard.

>
> Segher

Segher Boessenkool April 23, 2020, 12:07 p.m. UTC | #23

On Thu, Apr 23, 2020 at 12:52:30PM +0200, Richard Biener wrote:
> On Thu, Apr 23, 2020 at 12:17 PM Segher Boessenkool
> <segher@kernel.crashing.org> wrote:
> >
> > On Thu, Apr 23, 2020 at 09:32:37AM +0200, Richard Biener wrote:
> > > On Thu, Apr 23, 2020 at 12:31 AM Jeff Law <law@redhat.com> wrote:
> > > > On Wed, 2020-04-22 at 15:50 -0500, Segher Boessenkool wrote:
> > > > > > > In some ways it feels like it would be easier to resurrect RTL SSA :-)
> > > > >
> > > > > Why was RTL SSA abandoned?
> > > > >
> > > > > It might well work to keep everything in SSA form all the way to RA.
> > > > > Hrm, that doesn't sound bad at all :-)
> > > > >
> > > > > (The PHIs need to be made explicit to something that resembles the
> > > > > machine code we will end up with, very early in the pipeline, but it
> > > > > could still also be some valid SSA form; and we can of course also
> > > > > have hard registers in all RTL, so that needs to be dealt with sanely
> > > > > some way as well  Lots of details, I don't see a crucial problem though,
> > > > > probably means I need to look harder ;-) )
> > > > Lack of time mostly.  There's some complications like subregs, argument registers
> > > > and the like.  But you can restrict ssa based analysis & optimizations to just
> > > > the set of pseudos that are in SSA form and do something more conservative on the
> > > > rest.
> > >
> > > I guess time is better spent on trying to extend GIMPLE + SSA up to RA, thus
> > > make instruction selection on GIMPLE.
> >
> > I think this is a bad idea.  By the time you have invented enough new
> > "lower GIMPLE" ("limple"?) to be able to use it to describe machine
> > insns like we can with RTL, you will have a more verbose, more memory
> > hungry, slower, etc. reinvented RTL.
> 
> I don't think there's much to invent.
> 
> I think at least one step would be uncontroversical(?), namely moving
> the RTL expansion "magic"
> up to a GIMPLE pass.  Where the "magic" would be to turn
> GIMPLE stmts not directly expandable via an existing optab into
> GIMPLE that can be trivially expanded.  That includes eventually
> combining multiple stmts into more powerful instructions and
> doing the magic we have in, like, expand_binop (widening, etc.).
> Where there's not a 1:1 mapping of a GIMPLE stmt to an optab
> GIMPLE gets direct-internal-fn calls.
> Then RTL expansion would be mostly invoking gen_insn (optab-code).

Most of expand is *other stuff*.  Expand does a *lot* of things that are
actually changing the code.  And much of that is not done anywhere else
either yet, so this cannot be fixed by simply deleting the offending code.

> More controversical would be ending up in GIMPLE there.  I think
> GIMPLE can handle all RTL insns if we massage GIMPLE_ASM
> a bit.  You'd end up with, say,
> 
>  asm ("(set (reg:DI $0)
>                 (and:DI (reg/v:DI $1 [ dst ])
>                     (reg:DI $2)))" : "r" (_1) : "r" (_2), "r" (_3) : "cc");
> 
> in place of
> 
>   _1 = _2 & _3;
> 
> and the GIMPLE_ASM text could be actual RTL.  We'd extend
> the stmt with an extra operand to denote recognized patterns,
> so another option would be to keep the original GIMPLE as well.

Why would you ever want to do that?  That would take much more memory,
and RTL's memory use until recently always was a pain point.

> > RTL is a *feature*, and it is one of the things that makes GCC
> > significantly better than the competition.
> 
> That said, I actually agree with that.  It's just that I hope we can
> make some of the knowledge just represented on the RTL side
> available on the GIMPLE side.  The more complicated parts,
> like calling conventions, that is.

Yeah, and like I said, some things (unroll...) should move to GIMPLE as
well, most of it anyway.  And some of the remaining RTL code needs a
good overhaul (oh hello CSE).

> And yes, I want to get rid of that expand monster to be able to
> do something like sched1 on "GIMPLE" without expand coming
> along and re-scheduling everything at-will.

Right, ideally, expand would just translate GIMPLE to RTL one-to-one
(well, few-to-few, whatever :-) ).  But it does so much other stuff now,
so all that has to be moved or reimplemented or whatever.

> > More optimisations should move to GIMPLE, for example some loop
> > optimisations should be done much earlier (most unrolling).  The expand
> > pass should lose most of the "optimisations" it has built up over the
> > decades (that now often are detrimental at best).  Some of what expand
> > now does should probably be done while still in GIMPLE, even.
> >
> > But it is very useful to have a separate "low level" representation,
> > that is actually close to the machine code we will eventually generate.
> > RTL is one such representation, and we already have it, and it is very
> > well tuned by now -- throwing it away would require some *huge*
> > advantage, because the costs of doing that are immense as well.
> 
> But being stuck with something means no progress...  I know
> very well it's 100 times harder to get rid of something than to
> add something new ontop.

Well, what progress do you expect to make?  After expand that is :-)

Most of what is done in RTL is done very well.

Replacing specific things in RTL, maybe as big as whole passes, already
is hard to do without regressing (a *lot*).  And if there is no real
reason to do that...


Segher

Richard Biener April 23, 2020, 12:25 p.m. UTC | #24

On Thu, Apr 23, 2020 at 2:07 PM Segher Boessenkool
<segher@kernel.crashing.org> wrote:
>
> On Thu, Apr 23, 2020 at 12:52:30PM +0200, Richard Biener wrote:
> > On Thu, Apr 23, 2020 at 12:17 PM Segher Boessenkool
> > <segher@kernel.crashing.org> wrote:
> > >
> > > On Thu, Apr 23, 2020 at 09:32:37AM +0200, Richard Biener wrote:
> > > > On Thu, Apr 23, 2020 at 12:31 AM Jeff Law <law@redhat.com> wrote:
> > > > > On Wed, 2020-04-22 at 15:50 -0500, Segher Boessenkool wrote:
> > > > > > > > In some ways it feels like it would be easier to resurrect RTL SSA :-)
> > > > > >
> > > > > > Why was RTL SSA abandoned?
> > > > > >
> > > > > > It might well work to keep everything in SSA form all the way to RA.
> > > > > > Hrm, that doesn't sound bad at all :-)
> > > > > >
> > > > > > (The PHIs need to be made explicit to something that resembles the
> > > > > > machine code we will end up with, very early in the pipeline, but it
> > > > > > could still also be some valid SSA form; and we can of course also
> > > > > > have hard registers in all RTL, so that needs to be dealt with sanely
> > > > > > some way as well  Lots of details, I don't see a crucial problem though,
> > > > > > probably means I need to look harder ;-) )
> > > > > Lack of time mostly.  There's some complications like subregs, argument registers
> > > > > and the like.  But you can restrict ssa based analysis & optimizations to just
> > > > > the set of pseudos that are in SSA form and do something more conservative on the
> > > > > rest.
> > > >
> > > > I guess time is better spent on trying to extend GIMPLE + SSA up to RA, thus
> > > > make instruction selection on GIMPLE.
> > >
> > > I think this is a bad idea.  By the time you have invented enough new
> > > "lower GIMPLE" ("limple"?) to be able to use it to describe machine
> > > insns like we can with RTL, you will have a more verbose, more memory
> > > hungry, slower, etc. reinvented RTL.
> >
> > I don't think there's much to invent.
> >
> > I think at least one step would be uncontroversical(?), namely moving
> > the RTL expansion "magic"
> > up to a GIMPLE pass.  Where the "magic" would be to turn
> > GIMPLE stmts not directly expandable via an existing optab into
> > GIMPLE that can be trivially expanded.  That includes eventually
> > combining multiple stmts into more powerful instructions and
> > doing the magic we have in, like, expand_binop (widening, etc.).
> > Where there's not a 1:1 mapping of a GIMPLE stmt to an optab
> > GIMPLE gets direct-internal-fn calls.
> > Then RTL expansion would be mostly invoking gen_insn (optab-code).
>
> Most of expand is *other stuff*.  Expand does a *lot* of things that are
> actually changing the code.  And much of that is not done anywhere else
> either yet, so this cannot be fixed by simply deleting the offending code.
>
> > More controversical would be ending up in GIMPLE there.  I think
> > GIMPLE can handle all RTL insns if we massage GIMPLE_ASM
> > a bit.  You'd end up with, say,
> >
> >  asm ("(set (reg:DI $0)
> >                 (and:DI (reg/v:DI $1 [ dst ])
> >                     (reg:DI $2)))" : "r" (_1) : "r" (_2), "r" (_3) : "cc");
> >
> > in place of
> >
> >   _1 = _2 & _3;
> >
> > and the GIMPLE_ASM text could be actual RTL.  We'd extend
> > the stmt with an extra operand to denote recognized patterns,
> > so another option would be to keep the original GIMPLE as well.
>
> Why would you ever want to do that?  That would take much more memory,
> and RTL's memory use until recently always was a pain point.

It's not the RTL IL that uses much memory, it's infrastructure like DF
that easily blows up.  Mind that GCC has only a single function in
RTL at a time but the whole program in GIMPLE.  And GIMPLE
includes "DF" by default.

> > > RTL is a *feature*, and it is one of the things that makes GCC
> > > significantly better than the competition.
> >
> > That said, I actually agree with that.  It's just that I hope we can
> > make some of the knowledge just represented on the RTL side
> > available on the GIMPLE side.  The more complicated parts,
> > like calling conventions, that is.
>
> Yeah, and like I said, some things (unroll...) should move to GIMPLE as
> well, most of it anyway.  And some of the remaining RTL code needs a
> good overhaul (oh hello CSE).
>
> > And yes, I want to get rid of that expand monster to be able to
> > do something like sched1 on "GIMPLE" without expand coming
> > along and re-scheduling everything at-will.
>
> Right, ideally, expand would just translate GIMPLE to RTL one-to-one
> (well, few-to-few, whatever :-) ).  But it does so much other stuff now,
> so all that has to be moved or reimplemented or whatever.
>
> > > More optimisations should move to GIMPLE, for example some loop
> > > optimisations should be done much earlier (most unrolling).  The expand
> > > pass should lose most of the "optimisations" it has built up over the
> > > decades (that now often are detrimental at best).  Some of what expand
> > > now does should probably be done while still in GIMPLE, even.
> > >
> > > But it is very useful to have a separate "low level" representation,
> > > that is actually close to the machine code we will eventually generate.
> > > RTL is one such representation, and we already have it, and it is very
> > > well tuned by now -- throwing it away would require some *huge*
> > > advantage, because the costs of doing that are immense as well.
> >
> > But being stuck with something means no progress...  I know
> > very well it's 100 times harder to get rid of something than to
> > add something new ontop.
>
> Well, what progress do you expect to make?  After expand that is :-)

I'd like the RTL pipeline before RA to shrink significantly, no PRE,
no CSE, ...  The important part before RA is robust and intelligent
instruction selection - part of which is already done at RTL expansion
time.

> Most of what is done in RTL is done very well.

Umm, well...  I beg to differ with regard to DF and passes like
postreload-gcse.

> Replacing specific things in RTL, maybe as big as whole passes, already
> is hard to do without regressing (a *lot*).  And if there is no real
> reason to do that...

The motivation is to make GCC faster, obviously.  Not spending time
doing things that should have been done before (RTL PRE vs. GIMPLE PRE, etc.).
Using the same infrastructure (what, no loop dependency analysis on RTL?), etc.

You could say we should do more on RTL and enhance it instead, like do
vectorization where we actually could have a better idea on costs and
capabilities.  But I'm a GIMPLE person and don't know enough of RTL to
enhance it ...

Richard.

>
> Segher

Segher Boessenkool April 23, 2020, 12:52 p.m. UTC | #25

On Thu, Apr 23, 2020 at 02:25:40PM +0200, Richard Biener wrote:
> > > But being stuck with something means no progress...  I know
> > > very well it's 100 times harder to get rid of something than to
> > > add something new ontop.
> >
> > Well, what progress do you expect to make?  After expand that is :-)
> 
> I'd like the RTL pipeline before RA to shrink significantly, no PRE,
> no CSE, ...

RTL CSE for example is very much required to get any good code.  It
needs to CSE stuff that wasn't there before expand.

The pass currently does much more (as well as not enough), of course.

> The important part before RA is robust and intelligent
> instruction selection - part of which is already done at RTL expansion
> time.

LOL.

The expand pass doesn't often make good choices, and it *shouldn't*, it
should not make many choices at all; it should just generate valid RTL,
new pseudos for everything, and let later RTL passes make faster code
from that.

> > Most of what is done in RTL is done very well.
> 
> Umm, well...  I beg to differ with regard to DF and passes like
> postreload-gcse.

What is wrong with DF?

Is there something particular in postreload-gcse that is bad?  To me it
always is just one of those passes that doesn't do anything :-)  That
can and should be cleaned up, sure :-)

> > Replacing specific things in RTL, maybe as big as whole passes, already
> > is hard to do without regressing (a *lot*).  And if there is no real
> > reason to do that...
> 
> The motivation is to make GCC faster, obviously.  Not spending time
> doing things that should have been done before (RTL PRE vs. GIMPLE PRE, etc.).
> Using the same infrastructure (what, no loop dependency analysis on RTL?), etc.

But everything you want to remove isn't high on profiles anyway?  And
you proposed adding bigger, slower, stuff to replace it all with.

Slow are RA, and the language frontends (esp. C++ like to take 15% of
total :-/)

> You could say we should do more on RTL and enhance it instead, like do
> vectorization where we actually could have a better idea on costs and
> capabilities.  But I'm a GIMPLE person and don't know enough of RTL to
> enhance it ...

Oh no, I think we should do more earlier, and GIMPLE is a fine IR for
there.  But for low-level, close-to-the-machine stuff, RTL is much
better suited.  And we *do* want to optimise at that level as well, and
much more than just peepholes.

Segher

Richard Biener April 23, 2020, 1:07 p.m. UTC | #26

On Thu, Apr 23, 2020 at 2:52 PM Segher Boessenkool
<segher@kernel.crashing.org> wrote:
>
> On Thu, Apr 23, 2020 at 02:25:40PM +0200, Richard Biener wrote:
> > > > But being stuck with something means no progress...  I know
> > > > very well it's 100 times harder to get rid of something than to
> > > > add something new ontop.
> > >
> > > Well, what progress do you expect to make?  After expand that is :-)
> >
> > I'd like the RTL pipeline before RA to shrink significantly, no PRE,
> > no CSE, ...
>
> RTL CSE for example is very much required to get any good code.  It
> needs to CSE stuff that wasn't there before expand.

Sure, but then we should fix that!

> The pass currently does much more (as well as not enough), of course.
>
> > The important part before RA is robust and intelligent
> > instruction selection - part of which is already done at RTL expansion
> > time.
>
> LOL.
>
> The expand pass doesn't often make good choices, and it *shouldn't*, it
> should not make many choices at all; it should just generate valid RTL,
> new pseudos for everything, and let later RTL passes make faster code
> from that.

But valid RTL is instructions that are recognized.  Which means
when the target doesn't support an SImode add we may not create
one.  That's instruction selection ;)

> > > Most of what is done in RTL is done very well.
> >
> > Umm, well...  I beg to differ with regard to DF and passes like
> > postreload-gcse.
>
> What is wrong with DF?

It's slow and memory hungry?

> Is there something particular in postreload-gcse that is bad?  To me it
> always is just one of those passes that doesn't do anything :-)  That
> can and should be cleaned up, sure :-)

postreload-gcse is ad-hoc, it uses full blown gcse tools that easily
blow up (compute_transp) when it doesn't really require it
(Ive fixed things up a bit in dc91c65378cd0e6c0).  But I wonder why,
if we want to do PRE of loads, we don't simply schedule another
gcse pass rather than implementing a new one.  IIRC what the pass
does could be done with much more local dataflow.  Both
postreload gcse and cse are major time-hogs on "bad" testcases :/

> > > Replacing specific things in RTL, maybe as big as whole passes, already
> > > is hard to do without regressing (a *lot*).  And if there is no real
> > > reason to do that...
> >
> > The motivation is to make GCC faster, obviously.  Not spending time
> > doing things that should have been done before (RTL PRE vs. GIMPLE PRE, etc.).
> > Using the same infrastructure (what, no loop dependency analysis on RTL?), etc.
>
> But everything you want to remove isn't high on profiles anyway?  And
> you proposed adding bigger, slower, stuff to replace it all with.
>
> Slow are RA, and the language frontends (esp. C++ like to take 15% of
> total :-/)
>
> > You could say we should do more on RTL and enhance it instead, like do
> > vectorization where we actually could have a better idea on costs and
> > capabilities.  But I'm a GIMPLE person and don't know enough of RTL to
> > enhance it ...
>
> Oh no, I think we should do more earlier, and GIMPLE is a fine IR for
> there.  But for low-level, close-to-the-machine stuff, RTL is much
> better suited.  And we *do* want to optimise at that level as well, and
> much more than just peepholes.

Well, everything that requires costing (unrolling, vectorization,
IV selection to name a few) _is_ close-to-the-machine.  We're
just saying they are not because GIMPLE is so much easier to
work with here (not sure why exactly...).

Richard.

>
> Segher

Segher Boessenkool April 23, 2020, 1:43 p.m. UTC | #27

On Thu, Apr 23, 2020 at 03:07:23PM +0200, Richard Biener wrote:
> On Thu, Apr 23, 2020 at 2:52 PM Segher Boessenkool
> <segher@kernel.crashing.org> wrote:
> >
> > On Thu, Apr 23, 2020 at 02:25:40PM +0200, Richard Biener wrote:
> > > > > But being stuck with something means no progress...  I know
> > > > > very well it's 100 times harder to get rid of something than to
> > > > > add something new ontop.
> > > >
> > > > Well, what progress do you expect to make?  After expand that is :-)
> > >
> > > I'd like the RTL pipeline before RA to shrink significantly, no PRE,
> > > no CSE, ...
> >
> > RTL CSE for example is very much required to get any good code.  It
> > needs to CSE stuff that wasn't there before expand.
> 
> Sure, but then we should fix that!

The expand pass, but also many RTL passes, will naturally generate code
that can be CSEd.  You don't want passes to have to do this themselves:
for example, this can be constants used to implement some standard
patterns in target code, etc.

> > LOL.
> >
> > The expand pass doesn't often make good choices, and it *shouldn't*, it
> > should not make many choices at all; it should just generate valid RTL,
> > new pseudos for everything, and let later RTL passes make faster code
> > from that.
> 
> But valid RTL is instructions that are recognized.  Which means
> when the target doesn't support an SImode add we may not create
> one.  That's instruction selection ;)

In that sense, you can call all RTL passes instruction selection?
Usually I understand more like combine, cprop, fwprop, cse, ifcvt,
splitters, peepholes.  That kind of thing :-)

Pretty much all of the RTL passes before RA, and a few after it.

> > > > Most of what is done in RTL is done very well.
> > >
> > > Umm, well...  I beg to differ with regard to DF and passes like
> > > postreload-gcse.
> >
> > What is wrong with DF?
> 
> It's slow and memory hungry?

Very true, of course.  But can this be significantly better?

> > Is there something particular in postreload-gcse that is bad?  To me it
> > always is just one of those passes that doesn't do anything :-)  That
> > can and should be cleaned up, sure :-)
> 
> postreload-gcse is ad-hoc, it uses full blown gcse tools that easily
> blow up (compute_transp) when it doesn't really require it
> (Ive fixed things up a bit in dc91c65378cd0e6c0).  But I wonder why,
> if we want to do PRE of loads, we don't simply schedule another
> gcse pass rather than implementing a new one.  IIRC what the pass
> does could be done with much more local dataflow.  Both
> postreload gcse and cse are major time-hogs on "bad" testcases :/

RTL CSE?  Really?  It just loves to give up early (which is a bad thing
of course, but that makes it take bounded time, and *less* on bad
testcases :-) )

So the "normal" gcse does not have this problem?

> > Oh no, I think we should do more earlier, and GIMPLE is a fine IR for
> > there.  But for low-level, close-to-the-machine stuff, RTL is much
> > better suited.  And we *do* want to optimise at that level as well, and
> > much more than just peepholes.
> 
> Well, everything that requires costing (unrolling, vectorization,
> IV selection to name a few) _is_ close-to-the-machine.  We're
> just saying they are not because GIMPLE is so much easier to
> work with here (not sure why exactly...).

Those transforms aren't close to the machine, not in the same way,
because they are beneficial independent of what exact instruction
sequences are generated.

Both are nasty in that both have cases doing the transform actually
hurts quite a bit; but *not* doing it where it *could* costs a lot as
well.  But other than that "little" issue ;-)


Segher

Li, Pan2 via Gcc-patches April 23, 2020, 2:29 p.m. UTC | #28

On Thu, 2020-04-23 at 07:07 -0500, Segher Boessenkool wrote:
> 
> > I think at least one step would be uncontroversical(?), namely moving
> > the RTL expansion "magic"
> > up to a GIMPLE pass.  Where the "magic" would be to turn
> > GIMPLE stmts not directly expandable via an existing optab into
> > GIMPLE that can be trivially expanded.  That includes eventually
> > combining multiple stmts into more powerful instructions and
> > doing the magic we have in, like, expand_binop (widening, etc.).
> > Where there's not a 1:1 mapping of a GIMPLE stmt to an optab
> > GIMPLE gets direct-internal-fn calls.
> > Then RTL expansion would be mostly invoking gen_insn (optab-code).
> 
> Most of expand is *other stuff*.  Expand does a *lot* of things that are
> actually changing the code.  And much of that is not done anywhere else
> either yet, so this cannot be fixed by simply deleting the offending code.
A lot of what is done by the expansion pass is historical and dates back to when
we never optimized more than a statement at a time for trees and the real heavy
lifting was all done in RTL.

THe introduction of tree-ssa meant that all the expansion code which wanted to
see a "nice" complex statement and generate good RTL code was useless and as a
result we saw significant regressions in the end code quality.

That in turn brought in TER who's sole purpose was to reconstruct those more
complex trees for the purposes of improving initial expansion.  I think match.pd
has the potential to make TER go away as match.pd is a generic combining
framework.

WRT extending SSA deeper, I'm all for it and it's always been something I wanted
to see happen.    I've always believed we could do RTL SSA to the register
allocation phase and that doing so would be a positive.  If we start at the front
of the RTL pipeline and work towards the back we bump into CSE quickly, which
would be good.  Our CSE is awful across multiple axis.  I suspect most RTL passes
would be reimplementations rather than bolting SSA on the side and I suspect
those reimplementations would be simpler than the existing stuff -- I'm a strong
believer there's a lot of dead code in the RTL passes as well.

jeff

Li, Pan2 via Gcc-patches April 23, 2020, 2:40 p.m. UTC | #29

On Thu, 2020-04-23 at 15:07 +0200, Richard Biener wrote:
> On Thu, Apr 23, 2020 at 2:52 PM Segher Boessenkool
> <segher@kernel.crashing.org> wrote:
> > On Thu, Apr 23, 2020 at 02:25:40PM +0200, Richard Biener wrote:
> > > > > But being stuck with something means no progress...  I know
> > > > > very well it's 100 times harder to get rid of something than to
> > > > > add something new ontop.
> > > > 
> > > > Well, what progress do you expect to make?  After expand that is :-)
> > > 
> > > I'd like the RTL pipeline before RA to shrink significantly, no PRE,
> > > no CSE, ...
> > 
> > RTL CSE for example is very much required to get any good code.  It
> > needs to CSE stuff that wasn't there before expand.
> 
> Sure, but then we should fix that!
Exactly.  It's purpose largely becomes dealing with the redundancies exposed by
expansion.  ie, address arithmetic and the like.   A lot of its path following
code should be throttled back.

> 
> But valid RTL is instructions that are recognized.  Which means
> when the target doesn't support an SImode add we may not create
> one.  That's instruction selection ;)
That's always a point of tension.  But I think that in general continuing to have
targets claim to support things they do not (such as double-wordsize arithmetic,
logicals, moves, etc) is a mistake.  It made sense at one time, but I think we've
got better mechansisms in place to deal with this stuff now.

> 
> > Is there something particular in postreload-gcse that is bad?  To me it
> > always is just one of those passes that doesn't do anything :-)  That
> > can and should be cleaned up, sure :-)
> 
> postreload-gcse is ad-hoc, it uses full blown gcse tools that easily
> blow up (compute_transp) when it doesn't really require it
> (Ive fixed things up a bit in dc91c65378cd0e6c0).  But I wonder why,
> if we want to do PRE of loads, we don't simply schedule another
> gcse pass rather than implementing a new one.  IIRC what the pass
> does could be done with much more local dataflow.  Both
> postreload gcse and cse are major time-hogs on "bad" testcases :/
I think the biggest reason is the existing gcse bits inherently assume they can
create new registers.  It's deeply baked into gcse.c.  There's ways around that,
but it's likely a lot of work.

> 
> > Oh no, I think we should do more earlier, and GIMPLE is a fine IR for
> > there.  But for low-level, close-to-the-machine stuff, RTL is much
> > better suited.  And we *do* want to optimise at that level as well, and
> > much more than just peepholes.
> 
> Well, everything that requires costing (unrolling, vectorization,
> IV selection to name a few) _is_ close-to-the-machine.  We're
> just saying they are not because GIMPLE is so much easier to
> work with here (not sure why exactly...).
The primary motivation behind discouraging target costing and the like from
gimple was to make it easier to implement and predict the behavior of the gimple
optimizers.   We've relaxed that somewhat, particularly for vectorization, but I
think the principle is still solid.

But I think there is a place for adding target dependencies -- and that's at the
end of the current gimple pipeline.

Jeff

Eric Botcazou April 23, 2020, 2:54 p.m. UTC | #30

> > > What is wrong with DF?
> > 
> > It's slow and memory hungry?
> 
> Very true, of course.  But can this be significantly better?

That's a good question worth investigating in my opinion, because DF didn't 
quite achieve its initial goal of replacing all the custom liveness analysis 
passes only because of its performance; the rest is quite neat.  But IIRC 
long-time GCC hackers were involved in it, so it was probably tuned already.

Richard Sandiford April 23, 2020, 3:32 p.m. UTC | #31

Jeff Law via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> On Thu, 2020-04-23 at 15:07 +0200, Richard Biener wrote:
>> On Thu, Apr 23, 2020 at 2:52 PM Segher Boessenkool
>> <segher@kernel.crashing.org> wrote:
>> > On Thu, Apr 23, 2020 at 02:25:40PM +0200, Richard Biener wrote:
>> > > > > But being stuck with something means no progress...  I know
>> > > > > very well it's 100 times harder to get rid of something than to
>> > > > > add something new ontop.
>> > > > 
>> > > > Well, what progress do you expect to make?  After expand that is :-)
>> > > 
>> > > I'd like the RTL pipeline before RA to shrink significantly, no PRE,
>> > > no CSE, ...
>> > 
>> > RTL CSE for example is very much required to get any good code.  It
>> > needs to CSE stuff that wasn't there before expand.
>> 
>> Sure, but then we should fix that!
> Exactly.  It's purpose largely becomes dealing with the redundancies exposed by
> expansion.  ie, address arithmetic and the like.   A lot of its path following
> code should be throttled back.

Agreed.  But things like address legitimisation and ensuring immediate
operands are in-range could be done in gimple too, and probably be
optimised more effectively and efficiently in SSA form than in RTL.
The ultimate question then wouldn't just be "does the target support
this optab?" but also "are these operands already legitimate for the
optab"?

I also wonder how difficult it would be to get recog to recognise
gimple :-)

Thanks,
Richard

Li, Pan2 via Gcc-patches April 23, 2020, 3:55 p.m. UTC | #32

On Thu, 2020-04-23 at 16:32 +0100, Richard Sandiford wrote:
> Jeff Law via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> > On Thu, 2020-04-23 at 15:07 +0200, Richard Biener wrote:
> > > On Thu, Apr 23, 2020 at 2:52 PM Segher Boessenkool
> > > <segher@kernel.crashing.org> wrote:
> > > > On Thu, Apr 23, 2020 at 02:25:40PM +0200, Richard Biener wrote:
> > > > > > > But being stuck with something means no progress...  I know
> > > > > > > very well it's 100 times harder to get rid of something than to
> > > > > > > add something new ontop.
> > > > > > 
> > > > > > Well, what progress do you expect to make?  After expand that is :-)
> > > > > 
> > > > > I'd like the RTL pipeline before RA to shrink significantly, no PRE,
> > > > > no CSE, ...
> > > > 
> > > > RTL CSE for example is very much required to get any good code.  It
> > > > needs to CSE stuff that wasn't there before expand.
> > > 
> > > Sure, but then we should fix that!
> > Exactly.  It's purpose largely becomes dealing with the redundancies exposed
> > by
> > expansion.  ie, address arithmetic and the like.   A lot of its path
> > following
> > code should be throttled back.
> 
> Agreed.  But things like address legitimisation and ensuring immediate
> operands are in-range could be done in gimple too, and probably be
> optimised more effectively and efficiently in SSA form than in RTL.
Yea, but once exposed we'd probably need some kind of way to prevent something
like an out-of-range immediate from being propagated back into the statement. 
Certainly do-able though.  Maybe significantly easier than resurrecting RTL-SSA.

> The ultimate question then wouldn't just be "does the target support
> this optab?" but also "are these operands already legitimate for the
> optab"?
> 
> I also wonder how difficult it would be to get recog to recognise
> gimple :-)
:-)  You know better than I...  The more structured nature of gimple probably
helps that effort. 

jeff
>

Segher Boessenkool April 23, 2020, 5:24 p.m. UTC | #33

Hi!

On Thu, Apr 23, 2020 at 08:29:34AM -0600, Jeff Law wrote:
> On Thu, 2020-04-23 at 07:07 -0500, Segher Boessenkool wrote:
> > Most of expand is *other stuff*.  Expand does a *lot* of things that are
> > actually changing the code.  And much of that is not done anywhere else
> > either yet, so this cannot be fixed by simply deleting the offending code.
> A lot of what is done by the expansion pass is historical and dates back to when
> we never optimized more than a statement at a time for trees and the real heavy
> lifting was all done in RTL.

Yes.

> THe introduction of tree-ssa meant that all the expansion code which wanted to
> see a "nice" complex statement and generate good RTL code was useless and as a
> result we saw significant regressions in the end code quality.
> 
> That in turn brought in TER who's sole purpose was to reconstruct those more
> complex trees for the purposes of improving initial expansion.  I think match.pd
> has the potential to make TER go away as match.pd is a generic combining
> framework.

Ooh that would be great news as well!

For moving stuff out of expand, it needs to *move*, not just be deleted.

> WRT extending SSA deeper, I'm all for it and it's always been something I wanted
> to see happen.    I've always believed we could do RTL SSA to the register
> allocation phase and that doing so would be a positive.

Well, since we have hard regs as well, it cannot really be SSA?  And SSA
doesn't fit RTL all that well anyway?  But the "webs always" idea is the
same of course.

> If we start at the front
> of the RTL pipeline and work towards the back we bump into CSE quickly, which
> would be good.  Our CSE is awful across multiple axis.

RTL CSE does quite many things that aren't really CSE...  Have to figure
out for each whether it still is useful, and what to do with it then.

> I suspect most RTL passes
> would be reimplementations rather than bolting SSA on the side and I suspect
> those reimplementations would be simpler than the existing stuff -- I'm a strong
> believer there's a lot of dead code in the RTL passes as well.

Sure, but there also is a lot of stuff that perhaps is historical, and
perhaps is badly architected, but still does beneficial things.

If a reimplementation hugely improves things it's not bad in the end to
lose all such little things, but otherwise?


Segher

Segher Boessenkool April 23, 2020, 8:16 p.m. UTC | #34

On Thu, Apr 23, 2020 at 08:40:50AM -0600, Jeff Law wrote:
> On Thu, 2020-04-23 at 15:07 +0200, Richard Biener wrote:
> > On Thu, Apr 23, 2020 at 2:52 PM Segher Boessenkool
> > <segher@kernel.crashing.org> wrote:
> > > On Thu, Apr 23, 2020 at 02:25:40PM +0200, Richard Biener wrote:
> > > > > > But being stuck with something means no progress...  I know
> > > > > > very well it's 100 times harder to get rid of something than to
> > > > > > add something new ontop.
> > > > > 
> > > > > Well, what progress do you expect to make?  After expand that is :-)
> > > > 
> > > > I'd like the RTL pipeline before RA to shrink significantly, no PRE,
> > > > no CSE, ...
> > > 
> > > RTL CSE for example is very much required to get any good code.  It
> > > needs to CSE stuff that wasn't there before expand.
> > 
> > Sure, but then we should fix that!
> Exactly.  It's purpose largely becomes dealing with the redundancies exposed by
> expansion.  ie, address arithmetic and the like.   A lot of its path following
> code should be throttled back.

Hrm, I never thought about it like this.  CSE was always there, I never
stopped to question if we needed it :-)

Well, that's cse1 then.  What about cse2?

> > But valid RTL is instructions that are recognized.  Which means
> > when the target doesn't support an SImode add we may not create
> > one.  That's instruction selection ;)
> That's always a point of tension.  But I think that in general continuing to have
> targets claim to support things they do not (such as double-wordsize arithmetic,
> logicals, moves, etc) is a mistake.  It made sense at one time, but I think we've
> got better mechansisms in place to deal with this stuff now.

Different targets have *very* different insns for add, mul, div, shifts;
everything really.  Describing this at expand time with two-machine-word
operations works pretty bloody well, for most or all targets -- this is
just part of the power of define_expand (but an important part).  And
define_expand is very very useful, it's the swiss army escape hatch, it
lets you do everything optabs have a too small mind for.

> > > Oh no, I think we should do more earlier, and GIMPLE is a fine IR for
> > > there.  But for low-level, close-to-the-machine stuff, RTL is much
> > > better suited.  And we *do* want to optimise at that level as well, and
> > > much more than just peepholes.
> > 
> > Well, everything that requires costing (unrolling, vectorization,
> > IV selection to name a few) _is_ close-to-the-machine.  We're
> > just saying they are not because GIMPLE is so much easier to
> > work with here (not sure why exactly...).
> The primary motivation behind discouraging target costing and the like from
> gimple was to make it easier to implement and predict the behavior of the gimple
> optimizers.   We've relaxed that somewhat, particularly for vectorization, but I
> think the principle is still solid.

There are two kinds of costing.  The first only says which of A or B is
better; that can perhaps be done on GIMPLE already, using
target-specific costs.  The other gives a number to everything, which is
much harder to get anywhere close to usably correct (what does the
number even *mean*?  For performance, latency of the whole sequence is
the most important number, but that is not easy to work with, or what we
use for say insn_cost).

> 
> But I think there is a place for adding target dependencies -- and that's at the
> end of the current gimple pipeline.

There are a *few* things in GIMPLE that use target costs (ivopts...)
But yeah, most things should not.

Segher

Segher Boessenkool April 23, 2020, 8:18 p.m. UTC | #35

On Thu, Apr 23, 2020 at 04:32:51PM +0100, Richard Sandiford wrote:
> I also wonder how difficult it would be to get recog to recognise
> gimple :-)

Since recog recognises single (rtl) insns: hard to impossible?


Segher

Joseph Myers April 23, 2020, 8:26 p.m. UTC | #36

On Thu, 23 Apr 2020, Richard Biener via Gcc-patches wrote:

> I think at least one step would be uncontroversical(?), namely moving
> the RTL expansion "magic"
> up to a GIMPLE pass.  Where the "magic" would be to turn
> GIMPLE stmts not directly expandable via an existing optab into
> GIMPLE that can be trivially expanded.  That includes eventually

I suspect some such pieces should actually happen before some GIMPLE 
optimizations are done, rather than as the last pass before expansion - 
though then you need to make sure subsequent GIMPLE passes don't 
reintroduce the operations you've lowered.

E.g. the special handling of arithmetic on bit-field types in expand is a 
good candidate for being a lowering step (lowering to operations on 
integer widths actually supported by the machine / that correspond to 
known machine modes) somewhere in GIMPLE processing (and you don't want to 
subsequently narrow back to operations on those widths that don't exist in 
hardware).  Note in that regard that there is a proposal for C2x to 
support integer types with a specified number of bits more generally than 
just as bit-fields.

> That said, I actually agree with that.  It's just that I hope we can
> make some of the knowledge just represented on the RTL side
> available on the GIMPLE side.  The more complicated parts,
> like calling conventions, that is.

On the other hand, there is some knowledge like that which should move out 
of the front ends, into a GIMPLE lowering step (front ends should not be 
calling targetm.calls.promote_prototypes and changing their internal 
representation accordingly, that information should be handled later in 
the compiler, probably on GIMPLE).

Li, Pan2 via Gcc-patches April 23, 2020, 8:55 p.m. UTC | #37

On Thu, 2020-04-23 at 15:16 -0500, Segher Boessenkool wrote:
> On Thu, Apr 23, 2020 at 08:40:50AM -0600, Jeff Law wrote:
> > On Thu, 2020-04-23 at 15:07 +0200, Richard Biener wrote:
> > > On Thu, Apr 23, 2020 at 2:52 PM Segher Boessenkool
> > > <segher@kernel.crashing.org> wrote:
> > > > On Thu, Apr 23, 2020 at 02:25:40PM +0200, Richard Biener wrote:
> > > > > > > But being stuck with something means no progress...  I know
> > > > > > > very well it's 100 times harder to get rid of something than to
> > > > > > > add something new ontop.
> > > > > > 
> > > > > > Well, what progress do you expect to make?  After expand that is :-)
> > > > > 
> > > > > I'd like the RTL pipeline before RA to shrink significantly, no PRE,
> > > > > no CSE, ...
> > > > 
> > > > RTL CSE for example is very much required to get any good code.  It
> > > > needs to CSE stuff that wasn't there before expand.
> > > 
> > > Sure, but then we should fix that!
> > Exactly.  It's purpose largely becomes dealing with the redundancies exposed
> > by
> > expansion.  ie, address arithmetic and the like.   A lot of its path
> > following
> > code should be throttled back.
> 
> Hrm, I never thought about it like this.  CSE was always there, I never
> stopped to question if we needed it :-)
:-) It's a dog-slow pass that isn't nearly as important as it once was.
> 
> Well, that's cse1 then.  What about cse2?
cse2's original purpose was to clean up after the loop optimizers and it should
be doing even less work than cse1.

The steps in my mind are to see what's left in the first jump pass, then the cse
path following code, then the core of cse itself.  lower-subreg is a wart that
could likely go away if we stop lying about the target's capabilities.

> 
> > > But valid RTL is instructions that are recognized.  Which means
> > > when the target doesn't support an SImode add we may not create
> > > one.  That's instruction selection ;)
> > That's always a point of tension.  But I think that in general continuing to
> > have
> > targets claim to support things they do not (such as double-wordsize
> > arithmetic,
> > logicals, moves, etc) is a mistake.  It made sense at one time, but I think
> > we've
> > got better mechansisms in place to deal with this stuff now.
> 
> Different targets have *very* different insns for add, mul, div, shifts;
> everything really.  Describing this at expand time with two-machine-word
> operations works pretty bloody well, for most or all targets -- this is
> just part of the power of define_expand (but an important part).  And
> define_expand is very very useful, it's the swiss army escape hatch, it
> lets you do everything optabs have a too small mind for.
Absolutely true, but most of the double-word-mode-crap was in there because we
couldn't really describe things like the carry bit.  That's long since been fixed
and I bet if we handled just that the vast majority pretending the target has
double-word support becomes unnecessary (and the need for early lower-subreg is
then reduced significantly as well).

And if we look at the number of tricks that are used to do things like optimize
double-word moves in the target files, it's just insane.  If we stop lying and
take advantage of improvements over the last 20 years we end up killing lots of
crappy target code.


> There are two kinds of costing.  The first only says which of A or B is
> better; that can perhaps be done on GIMPLE already, using
> target-specific costs.  The other gives a number to everything, which is
> much harder to get anywhere close to usably correct (what does the
> number even *mean*?  For performance, latency of the whole sequence is
> the most important number, but that is not easy to work with, or what we
> use for say insn_cost).
True.  I'm referring mostly to traditional costing.  ie, given form A & B, which
is preferred for the target.  Full sequence latency is distinct issue, though we
let it bleed into the former (for example by rewriting sequences with similar
costs into a sequence with fewer dependencies).

> 
> > But I think there is a place for adding target dependencies -- and that's at
> > the
> > end of the current gimple pipeline.
> 
> There are a *few* things in GIMPLE that use target costs (ivopts...)
> But yeah, most things should not.
Precisely.  Avoiding target dependencies is the aspiration goal, but we have to
also be sensible and realize that some things really do require a degree of
target knowledge.
> 

Jeff

Richard Biener April 24, 2020, 6:15 a.m. UTC | #38

On Thu, Apr 23, 2020 at 4:54 PM Eric Botcazou <botcazou@adacore.com> wrote:
>
> > > > What is wrong with DF?
> > >
> > > It's slow and memory hungry?
> >
> > Very true, of course.  But can this be significantly better?
>
> That's a good question worth investigating in my opinion, because DF didn't
> quite achieve its initial goal of replacing all the custom liveness analysis
> passes only because of its performance; the rest is quite neat.  But IIRC
> long-time GCC hackers were involved in it, so it was probably tuned already.

Having looked at DF a few times one of its problems seems to be too much
indirection in its data structures which is probably a fallout of making it
on-the-side rather than a part of the core RTL data structures (like SSA
and immediate uses are on GIMPLE).  But then where it blows up is usually
during computing of the dataflow problems to create those data structures
and there's little room for improvement because it deals with a more complex
problem than we have on the GIMPLE side.  But passes are also mindlessly
throwing 'compute-me-everyting' flags at it without really thinking
what they need, so ...

In the end I question the need of true "global" optimization passes on the
RTL IL.  Almost everything should be local to interesting-CFG-shape regions.

And yes, it probably requires a further lowering on GIMPLE to expose things
like multi-reg operations, lowered bit-precision arithmetic, lowered
address generation and eventually constant legitimization (introducing a
constant pool already on GIMPLE, ideally maintaining it in the symbol
table).

Richard.

> --
> Eric Botcazou

Segher Boessenkool April 24, 2020, 11:13 a.m. UTC | #39

On Fri, Apr 24, 2020 at 08:15:36AM +0200, Richard Biener wrote:
> On Thu, Apr 23, 2020 at 4:54 PM Eric Botcazou <botcazou@adacore.com> wrote:
> > > > > What is wrong with DF?
> > > >
> > > > It's slow and memory hungry?
> > >
> > > Very true, of course.  But can this be significantly better?
> >
> > That's a good question worth investigating in my opinion, because DF didn't
> > quite achieve its initial goal of replacing all the custom liveness analysis
> > passes only because of its performance; the rest is quite neat.  But IIRC
> > long-time GCC hackers were involved in it, so it was probably tuned already.
> 
> Having looked at DF a few times one of its problems seems to be too much
> indirection in its data structures which is probably a fallout of making it
> on-the-side rather than a part of the core RTL data structures (like SSA
> and immediate uses are on GIMPLE).  But then where it blows up is usually
> during computing of the dataflow problems to create those data structures
> and there's little room for improvement because it deals with a more complex
> problem than we have on the GIMPLE side.  But passes are also mindlessly
> throwing 'compute-me-everyting' flags at it without really thinking
> what they need, so ...

And also, throwing away all existing info way too often.

Part of that is causes by the interfaces to DF we have.  That shouldn't
be hard to fix.

But mainly it is because too many things do not keep the info up-to-date
after changes.

> In the end I question the need of true "global" optimization passes on the
> RTL IL.  Almost everything should be local to interesting-CFG-shape regions.

Depending on the target, almost *everything* depends on how far some
code is from other code, and you can get much better code if you know
you *are* close.

But many passes already are per-BB of course.  It's just the hard ones
(RA...) that are not local at all.  So you'll have to attack IRA if you
want DF to be more local only, I think?

> And yes, it probably requires a further lowering on GIMPLE to expose things
> like multi-reg operations,

As I said before, I don't think this is a good idea.

> lowered bit-precision arithmetic, lowered

But that is :-)

> address generation and eventually constant legitimization (introducing a
> constant pool already on GIMPLE, ideally maintaining it in the symbol
> table).

That might work on some targets, sure :-/


Segher

[RFC] split pseudos during loop unrolling in RTL unroller

Commit Message

Comments

Patch