diff mbox series

fwprop: Allow UNARY_P and check register pressure.

Message ID 5a90c8a9-1570-5af4-bfdc-19d097bfee6e@gmail.com
State New
Headers show
Series fwprop: Allow UNARY_P and check register pressure. | expand

Commit Message

Robin Dapp Aug. 7, 2023, 10:26 a.m. UTC
Hi,

originally inspired by the wish to transform

 vmv v3, a0 ; = vec_duplicate
 vadd.vv v1, v2, v3

into

 vadd.vx v1, v2, a0

via fwprop for riscv, this patch enables the forward propagation
of UNARY_P sources.

As this involves potentially replacing a vector register with
a scalar register the ira_hoist_pressure machinery is used to
calculate the change in register pressure.  If the propagation
would increase the pressure beyond the number of hard regs, we
don't perform it.

The regpressure commit this patch depends on is not yet pushed
because I figured I'd post the code using it in case of further
comments.

The testsuite is unchanged on i386 and power10 but aarch64 has
some new FAILs but I'm not terribly familiar with aarch64, hence
some examples here:

The following cases shrn-combine-1/2/3 seem worse as we emit one
instruction more.

Before:
          shrn    v30.8b, v30.8h, 2
          shrn2   v30.16b, v31.8h, 2
          str     q30, [x1, x3]
After:
          ushr    v30.8h, v30.8h, 2
          ushr    v31.8h, v31.8h, 2
          uzp1    v31.16b, v30.16b, v31.16b
          str     q31, [x1, x3]

Here, one optimization already happens in fwprop so combine
cannot do the same work it did before.

vec-init-22-speed.c changes from
          sxth    w0, w0
          sxth    w1, w1
          fmov    d31, x0
          fmov    d0, x1
to:
          dup     v31.4h, w0
          dup     v0.4h, w1 
which I hope has the same semantics and is shorter.

Apart from that there are numerous check-asm testcases where
a new, earlier propagation prevents a later, supposedly better
propagation.  One such example is from ld4_s8.c:

  (insn 11 10 12 2 (set (reg:DI 100) 
          (neg:DI (reg:DI 102))) 385 {negdi2}
       (expr_list:REG_EQUAL (const_poly_int:DI [-576, -576])                                   
          (nil)))
  (insn 12 11 13 2 (set (reg/f:DI 99)
          (plus:DI (reg/v/f:DI 97 [ x0 ])
              (reg:DI 100))) 153 {*adddi3_aarch64}
       (expr_list:REG_EQUAL (plus:DI (reg/v/f:DI 97 [ x0 ])                                    
              (const_poly_int:DI [-576, -576]))                                                
          (nil)))
  (insn 13 12 14 2 (set (reg/v:VNx64QI 94 [ z0 ])                                              
          (unspec:VNx64QI [
                  (reg/v:VNx16BI 96 [ p0 ])
                  (mem:VNx64QI (reg/f:DI 99) [0 MEM <svint8_t[4]> [(signed char *)_1]+0 S[64, 64] A8])        
              ] UNSPEC_LDN)) 5885 {vec_mask_load_lanesvnx64qivnx16qi}                          
       (nil)) 

where we now do:

  propagating insn 11 into insn 12, replacing:
  (set (reg/f:DI 99)
      (plus:DI (reg/v/f:DI 97 [ x0 ])
          (reg:DI 100)))
  successfully matched this instruction to subdi3:
  (set (reg/f:DI 99)
      (minus:DI (reg/v/f:DI 97 [ x0 ])
          (reg:DI 102)))

vs before:

  cannot propagate from insn 11 into insn 12: would increase complexity of pattern
  
  propagating insn 12 into insn 13, replacing:
  (set (reg/v:VNx64QI 94 [ z0 ])
      (unspec:VNx64QI [
              (reg/v:VNx16BI 96 [ p0 ])
              (mem:VNx64QI (reg/f:DI 99) [0 MEM <svint8_t[4]> [(signed char *)_1]+0 S[64, 64] A8])
          ] UNSPEC_LDN))
  successfully matched this instruction to vec_mask_load_lanesvnx64qivnx16qi:
  (set (reg/v:VNx64QI 94 [ z0 ])
      (unspec:VNx64QI [
              (reg/v:VNx16BI 96 [ p0 ])
              (mem:VNx64QI (plus:DI (reg/v/f:DI 97 [ x0 ])
                      (reg:DI 100)) [0 MEM <svint8_t[4]> [(signed char *)_1]+0 S[64, 64] A8])
          ] UNSPEC_LDN))

All in all this seems like a general problem with earlier
optimizations preventing later ones and surely all of those
could be fixed individually.  Still, the question remains if
the general approach is useful or desired or if we not rather
prevent more optimizations that we gain.  Suggestions welcome.

I have some riscv tests for it but didn't attach them yet
in order to focus on the main part first.

Regards
 Robin

gcc/ChangeLog:

	* fwprop.cc (fwprop_propagation::profitable_p): Add unary
	handling.
	(fwprop_propagation::update_register_pressure): New function.
	(fwprop_propagation::register_pressure_high_p): New function
	(reg_single_def_for_src_p): Look through unary expressions.
	(try_fwprop_subst_pattern): Check register pressure.
	(forward_propagate_into): Call new function.
	(fwprop_init): Init register pressure.
	(fwprop_done): Clean up register pressure.
	(fwprop_insn): Add comment.
---
 gcc/fwprop.cc | 307 ++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 300 insertions(+), 7 deletions(-)

Comments

Robin Dapp Aug. 24, 2023, 2:06 p.m. UTC | #1
Ping.  I refined the code and some comments a bit and added a test
case.

My question in general would still be:  Is this something we want
given that we potentially move some of combine's work a bit towards
the front of the RTL pipeline?

Regards
 Robin

Subject: [PATCH] fwprop: Allow UNARY_P and check register pressure.

This patch enables the forwarding of UNARY_P sources.  As this
involves potentially replacing a vector register with a scalar register
the ira_hoist_pressure machinery is used to calculate the change in
register pressure.  If the propagation would increase the pressure
beyond the number of hard regs, we don't perform it.

gcc/ChangeLog:

	* fwprop.cc (fwprop_propagation::profitable_p): Add unary
	handling.
	(fwprop_propagation::update_register_pressure): New function.
	(fwprop_propagation::register_pressure_high_p): New function
	(reg_single_def_for_src_p): Look through unary expressions.
	(try_fwprop_subst_pattern): Check register pressure.
	(forward_propagate_into): Call new function.
	(fwprop_init): Init register pressure.
	(fwprop_done): Clean up register pressure.
	(fwprop_insn): Add comment.

gcc/testsuite/ChangeLog:

	* gcc.target/riscv/rvv/autovec/binop/vadd-vx-fwprop.c: New test.
---
 gcc/fwprop.cc                                 | 314 +++++++++++++++++-
 .../riscv/rvv/autovec/binop/vadd-vx-fwprop.c  |  64 ++++
 2 files changed, 371 insertions(+), 7 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/vadd-vx-fwprop.c

diff --git a/gcc/fwprop.cc b/gcc/fwprop.cc
index 0707a234726..b49d4e4ced4 100644
--- a/gcc/fwprop.cc
+++ b/gcc/fwprop.cc
@@ -36,6 +36,10 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-pass.h"
 #include "rtl-iter.h"
 #include "target.h"
+#include "dominance.h"
+
+#include "ira.h"
+#include "regpressure.h"
 
 /* This pass does simple forward propagation and simplification when an
    operand of an insn can only come from a single def.  This pass uses
@@ -103,6 +107,10 @@ using namespace rtl_ssa;
 
 static int num_changes;
 
+/* Keep track of which registers already increased the pressure to avoid double
+   booking.  */
+sbitmap pressure_accounted;
+
 /* Do not try to replace constant addresses or addresses of local and
    argument slots.  These MEM expressions are made only once and inserted
    in many instructions, as well as being used to control symbol table
@@ -181,6 +189,8 @@ namespace
     bool changed_mem_p () const { return result_flags & CHANGED_MEM; }
     bool folded_to_constants_p () const;
     bool profitable_p () const;
+    bool register_pressure_high_p (rtx, rtx, rtx_insn *, rtx_insn *) const;
+    bool update_register_pressure (rtx, rtx, rtx_insn *, rtx_insn *) const;
 
     bool check_mem (int, rtx) final override;
     void note_simplification (int, uint16_t, rtx, rtx) final override;
@@ -332,25 +342,247 @@ fwprop_propagation::profitable_p () const
       && (result_flags & PROFITABLE))
     return true;
 
-  if (REG_P (to))
+  /* Only continue with an unary operation if we consider register
+     pressure.  */
+  rtx what = copy_rtx (to);
+  if (UNARY_P (what) && flag_ira_hoist_pressure)
+    what = XEXP (what, 0);
+
+  if (REG_P (what))
     return true;
 
-  if (GET_CODE (to) == SUBREG
-      && REG_P (SUBREG_REG (to))
-      && !paradoxical_subreg_p (to))
+  if (GET_CODE (what) == SUBREG
+      && REG_P (SUBREG_REG (what))
+      && !paradoxical_subreg_p (what))
     return true;
 
-  if (CONSTANT_P (to))
+  if (CONSTANT_P (what))
     return true;
 
   return false;
 }
 
-/* Check that X has a single def.  */
+/* Check if the register pressure in any predecessor block of USE's block
+   until DEF's block is equal or higher to the number of hardregs in NU's
+   register class.  */
+bool
+fwprop_propagation::register_pressure_high_p (rtx nu, rtx old, rtx_insn *def,
+					      rtx_insn *use) const
+{
+  enum reg_class nu_class, old_class;
+  int nu_nregs, old_nregs;
+  nu_class = regpressure_get_regno_pressure_class (REGNO (nu), &nu_nregs);
+  old_class
+    = regpressure_get_regno_pressure_class (REGNO (old), &old_nregs);
+
+  if (nu_class == NO_REGS && old_class == NO_REGS)
+    return true;
+
+  if (nu_class == old_class)
+    return false;
+
+  basic_block bbfrom = BLOCK_FOR_INSN (def);
+  basic_block bbto = BLOCK_FOR_INSN (use);
+
+  basic_block bb;
+
+  sbitmap visited = sbitmap_alloc (last_basic_block_for_fn (cfun));
+  bitmap_clear (visited);
+  auto_vec<basic_block> q;
+  q.safe_push (bbto);
+
+  while (!q.is_empty ())
+    {
+      bb = q.pop ();
+
+      if (bitmap_bit_p (visited, bb->index))
+	continue;
+
+      /* Nothing to do if the register to be replaced is not live
+	 in this BB.  */
+      if (bb != bbfrom && !regpressure_is_live_in (bb, REGNO (old)))
+	continue;
+
+      /* Nothing to do if the replacement register is already live in
+	 this BB.  */
+      if (regpressure_is_live_in (bb, REGNO (nu)))
+	continue;
+
+      bitmap_set_bit (visited, bb->index);
+
+      int press = regpressure_get (bb, nu_class);
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "pressure for reg %d in bb %d: %d\n",
+		 REGNO (nu), bb->index, press);
+
+      if (press + nu_nregs > ira_class_hard_regs_num[nu_class])
+	return true;
+
+      if (bb == bbfrom || bb == ENTRY_BLOCK_PTR_FOR_FN (cfun))
+	break;
+
+      edge e;
+      edge_iterator ei;
+
+      FOR_EACH_EDGE (e, ei, bb->preds)
+	{
+	  basic_block pred = e->src;
+	  q.safe_push (pred);
+	}
+    }
+
+  sbitmap_free (visited);
+
+  return false;
+}
+
+/* Update the register pressure when propagating from DEF to USE
+   replacing OLD with NU.  */
+bool
+fwprop_propagation::update_register_pressure (rtx nu, rtx old, rtx_insn *def,
+					      rtx_insn *use) const
+{
+  basic_block bb;
+
+  enum reg_class nu_class, old_class;
+  int nu_nregs, old_nregs;
+  nu_class = regpressure_get_regno_pressure_class (REGNO (nu), &nu_nregs);
+  old_class = regpressure_get_regno_pressure_class (REGNO (old), &old_nregs);
+
+  bool changed = false;
+
+  if (nu_class == old_class)
+    return false;
+
+  basic_block bbfrom = BLOCK_FOR_INSN (def);
+  basic_block bbto = BLOCK_FOR_INSN (use);
+
+  sbitmap visited = sbitmap_alloc (last_basic_block_for_fn (cfun));
+  bitmap_clear (visited);
+  auto_vec<basic_block> q;
+  q.safe_push (bbto);
+
+  /* Precompute the users.  */
+  auto_vec<rtx_insn *> users;
+  df_ref duse;
+  bool should_decrease = true;
+  FOR_EACH_INSN_USE (duse, def)
+    {
+      rtx dreg = DF_REF_REAL_REG (duse);
+      if (REGNO (dreg) != REGNO (old))
+	continue;
+
+      df_ref op_ref = DF_REG_USE_CHAIN (REGNO (dreg));
+      for (; op_ref; op_ref = DF_REF_NEXT_REG (op_ref))
+	{
+	  if (!DF_REF_INSN_INFO (op_ref))
+	    continue;
+
+	  rtx_insn *insn = DF_REF_INSN (op_ref);
+
+	  /* If there are other users in BBTO, never decrease the
+	     pressure.  */
+	  if (BLOCK_FOR_INSN (insn) == bbto && insn != use)
+	    {
+	      should_decrease = false;
+	      break;
+	    }
+
+	  users.safe_push (insn);
+	}
+    }
+
+  while (!q.is_empty ())
+    {
+      bb = q.pop ();
+
+      if (bitmap_bit_p (visited, bb->index))
+	continue;
+
+      /* Nothing to do if the register to be replaced is not live
+	 in this BB.  */
+      if (bb != bbfrom && !regpressure_is_live_in (bb, REGNO (old)))
+	continue;
+
+      /* Nothing to do if the new register is already live in this BB.  */
+      if (regpressure_is_live_in (bb, REGNO (nu)))
+	continue;
+
+      bitmap_set_bit (visited, bb->index);
+
+      if (regpressure_is_live_in (bb, REGNO (old)))
+	{
+	  if (dump_file && (dump_flags & TDF_DETAILS))
+	    fprintf (dump_file, "increasing pressure for "
+		     "reg %d in bb %d by %d reg(s).\n",
+		     REGNO (nu), bb->index, nu_nregs);
+	  regpressure_set_live_in (bb, REGNO (nu));
+	  regpressure_increase (bb, nu_class, nu_nregs);
+	  changed = true;
+
+	  if (should_decrease)
+	    {
+	      rtx_insn *user;
+	      int ix;
+	      bool should_decrease_local = true;
+
+	      /* If this BB dominates no BB where OLD is used (except BBTO)
+		 the register pressure is decreased.  */
+	      FOR_EACH_VEC_ELT(users, ix, user)
+		{
+		  basic_block ubb = BLOCK_FOR_INSN (user);
+		  if (ubb == bbto)
+		    continue;
+		  if (dominated_by_p (CDI_DOMINATORS, ubb, bb))
+		    {
+		      should_decrease_local = false;
+		      break;
+		    }
+		}
+
+	      if (should_decrease_local)
+		{
+		  if (dump_file && (dump_flags & TDF_DETAILS))
+		    fprintf (dump_file, "decreasing pressure for "
+			     "reg %d in bb %d by %d reg(s).\n",
+			     REGNO (old), bb->index, old_nregs);
+		  regpressure_clear_live_in (bb, REGNO (old));
+		  regpressure_decrease (bb, old_class, old_nregs);
+		}
+	    }
+	}
+
+      if (bb == bbfrom || bb == ENTRY_BLOCK_PTR_FOR_FN (cfun))
+	break;
+
+      edge e;
+      edge_iterator ei;
+
+      FOR_EACH_EDGE (e, ei, bb->preds)
+	{
+	  basic_block pred = e->src;
+	  q.safe_push (pred);
+	}
+    }
+
+  sbitmap_free (visited);
+  return changed;
+}
+
+/* Check that X has a single def.  Also look through unary operations if
+   we take register pressure into consideration.  */
 
 static bool
 reg_single_def_p (rtx x)
 {
+  if (UNARY_P (x) && flag_ira_hoist_pressure)
+    {
+      x = XEXP (x, 0);
+
+      if (SUBREG_P (x) && !contains_paradoxical_subreg_p (x))
+	x = SUBREG_REG (x);
+    }
+
   return REG_P (x) && crtl->ssa->single_dominating_def (REGNO (x));
 }
 
@@ -490,6 +722,56 @@ try_fwprop_subst_pattern (obstack_watermark &attempt, insn_change &use_change,
 	  }
       }
 
+  if (ok && UNARY_P (src))
+    {
+      /* Currently the register classes can only be different with a
+	 unary operation like vec_duplicate.
+	 In that case the propagation can increase the register pressure
+	 on the way to the target BB.
+	 Therefore, we first check if the pressure of the respective register
+	 is already high, i.e. equals or exceeds the number of hardregs for
+	 that class.  If not, we allow the propagation and update the
+	 pressure.  */
+
+      rtx src1 = src;
+
+      /* Strip unary operation and possible subreg.  */
+      src1 = XEXP (src, 0);
+      if (GET_CODE (src1) == SUBREG)
+	src1 = SUBREG_REG (src1);
+
+      rtx_insn *def_rtl = def_insn->rtl ();
+
+      if (REG_P (dest) && REG_P (src1))
+	{
+	  if (!bitmap_bit_p (pressure_accounted, REGNO (src1)))
+	    {
+	      if (!prop.register_pressure_high_p (src1, dest, def_rtl, use_rtl))
+		{
+		  bool updated
+		    = prop.update_register_pressure (src1, dest, def_rtl,
+						     use_rtl);
+		  if (updated)
+		    bitmap_set_bit (pressure_accounted, REGNO (src1));
+		}
+	      else
+		{
+		  if (dump_file && (dump_flags & TDF_DETAILS))
+		    fprintf (dump_file, "cannot propagate from insn %d into"
+			     " insn %d: %s\n", def_insn->uid (),
+			     use_insn->uid (),
+			     "would increase register pressure beyond number"
+			     " of hard regs");
+		  ok = false;
+		}
+	    }
+	  else
+	    if (dump_file && (dump_flags & TDF_DETAILS))
+	      fprintf (dump_file, "pressure for reg %d has already"
+		       " been accounted\n", REGNO (src1));
+	}
+    }
+
   if (!ok)
     {
       /* The pattern didn't match, but if all uses of SRC folded to
@@ -859,7 +1141,7 @@ forward_propagate_into (use_info *use, bool reg_prop_only = false)
   if ((reg_prop_only
        || (def_loop != use_loop
 	   && !flow_loop_nested_p (use_loop, def_loop)))
-      && (!reg_single_def_p (dest) || !reg_single_def_p (src)))
+      && !reg_single_def_p (dest))
     return false;
 
   /* Don't substitute into a non-local goto, this confuses CFG.  */
@@ -890,6 +1172,14 @@ fwprop_init (void)
 
   df_analyze ();
   crtl->ssa = new rtl_ssa::function_info (cfun);
+
+  /* Calculate register pressure for each basic block.  */
+  if (flag_ira_hoist_pressure)
+    {
+      regpressure_init ();
+
+      pressure_accounted = sbitmap_alloc (DF_REG_SIZE (df));
+    }
 }
 
 static void
@@ -910,6 +1200,12 @@ fwprop_done (void)
     fprintf (dump_file,
 	     "\nNumber of successful forward propagations: %d\n\n",
 	     num_changes);
+
+  if (flag_ira_hoist_pressure)
+    {
+      regpressure_cleanup ();
+      sbitmap_free (pressure_accounted);
+    }
 }
 
 /* Try to optimize INSN, returning true if something changes.
@@ -919,6 +1215,10 @@ fwprop_done (void)
 static bool
 fwprop_insn (insn_info *insn, bool fwprop_addr_p)
 {
+  /* TODO: when rejecting propagations due to register pressure
+     we would actually like to try the propagation with most
+     potential = uses first instead of going through them
+     sequentially.  */
   for (use_info *use : insn->uses ())
     {
       if (use->is_mem ())
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/vadd-vx-fwprop.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/vadd-vx-fwprop.c
new file mode 100644
index 00000000000..0ed1bb80f73
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/vadd-vx-fwprop.c
@@ -0,0 +1,64 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O2 -march=rv64gcv -mabi=lp64d --param=riscv-autovec-preference=fixed-vlmax -fdump-rtl-fwprop1-details" } */
+
+/* Check that we propagate from vec_duplicates into vadd.vv, turning it into
+   vadd.vx.  */
+
+#include <stdint-gcc.h>
+
+/* Define 26 live variables so that we can perform two forward propagations
+   until we reach the hard reg limit of 28.  */
+uint64_t *restrict dst1;
+uint64_t *restrict dst2;
+uint64_t *restrict dst3;
+uint64_t *restrict dst4;
+uint64_t *restrict dst5;
+uint64_t *restrict dst6;
+uint64_t *restrict dst7;
+uint64_t *restrict dst8;
+uint64_t *restrict dst9;
+uint64_t *restrict dst10;
+uint64_t *restrict dst11;
+uint64_t *restrict a1;
+uint64_t *restrict a2;
+uint64_t *restrict a3;
+uint64_t *restrict a4;
+uint64_t *restrict a5;
+uint64_t *restrict a6;
+uint64_t *restrict a7;
+uint64_t *restrict a8;
+uint64_t *restrict a9;
+uint64_t *restrict a10;
+uint64_t *restrict a11;
+uint64_t b1;
+uint64_t b2;
+uint64_t b3;
+uint64_t b4;
+
+void foo (int n)
+{
+  __builtin_expect (n % 128, 0);
+  for (int i = 0; i < n; i++)
+    {
+      /* We expect b3 to be propagated into vadd here as well as in the other
+         7 occurences (without increasing the pressure every time).
+         Then b2 is still supposed to be propagated but b1 is not even though
+         it would be better to do so.  */
+      dst1[i] = a1[i] + b3;
+      dst2[i] = a2[i] + b3;
+      dst3[i] = a3[i] + b2;
+      dst4[i] = a4[i] + b1;
+      dst5[i] = a5[i] + b1;
+      dst6[i] = a6[i] + b3;
+      dst7[i] = a7[i] + b3;
+      dst8[i] = a8[i] + b3;
+      dst9[i] = a9[i] + b3;
+      dst10[i] = a10[i] + b3;
+      dst11[i] = a11[i] + b3;
+    }
+}
+
+/* { dg-final { scan-rtl-dump-times "increasing pressure for reg" 2 "fwprop1" } } */
+/* { dg-final { scan-rtl-dump-times "pressure for reg \[0-9\]+ has already been accounted" 7 "fwprop1" } } */
+/* { dg-final { scan-rtl-dump-times "would increase register pressure beyond number of hard regs" 2 "fwprop1" } } */
+/* { dg-final { scan-assembler-times "\tvadd\.vx" 9 } } */
Jeff Law Aug. 28, 2023, 11:33 p.m. UTC | #2
On 8/24/23 08:06, Robin Dapp via Gcc-patches wrote:
> Ping.  I refined the code and some comments a bit and added a test
> case.
> 
> My question in general would still be:  Is this something we want
> given that we potentially move some of combine's work a bit towards
> the front of the RTL pipeline?
> 
> Regards
>   Robin
> 
> Subject: [PATCH] fwprop: Allow UNARY_P and check register pressure.
> 
> This patch enables the forwarding of UNARY_P sources.  As this
> involves potentially replacing a vector register with a scalar register
> the ira_hoist_pressure machinery is used to calculate the change in
> register pressure.  If the propagation would increase the pressure
> beyond the number of hard regs, we don't perform it.
> 
> gcc/ChangeLog:
> 
> 	* fwprop.cc (fwprop_propagation::profitable_p): Add unary
> 	handling.
> 	(fwprop_propagation::update_register_pressure): New function.
> 	(fwprop_propagation::register_pressure_high_p): New function
> 	(reg_single_def_for_src_p): Look through unary expressions.
> 	(try_fwprop_subst_pattern): Check register pressure.
> 	(forward_propagate_into): Call new function.
> 	(fwprop_init): Init register pressure.
> 	(fwprop_done): Clean up register pressure.
> 	(fwprop_insn): Add comment.
> 
> gcc/testsuite/ChangeLog:
> 
> 	* gcc.target/riscv/rvv/autovec/binop/vadd-vx-fwprop.c: New test.
So I was hoping that Richard S. would chime in here as he knows this 
code better than anyone.

This looks like a much better implementation of something I've done 
before :-)  Basically imagine a target where a sign/zero extension can 
be folded into arithmetic for free.  We put in various hacks to this 
code to encourage more propagations of extensions.

I still think this is valuable.  As we lower from gimple->RTL we're 
going to still have artifacts in the RTL that we're going to want to 
optimize away.  fwprop has certain advantages over combine, including 
the fact that it runs earlier, pre-loop.


It looks generally sensible to me.  But give Richard S. another week to 
chime in.  He seems to be around, but may be slammed with stuff right now.

jeff
Richard Sandiford Aug. 29, 2023, 11:40 a.m. UTC | #3
Jeff Law <jeffreyalaw@gmail.com> writes:
> On 8/24/23 08:06, Robin Dapp via Gcc-patches wrote:
>> Ping.  I refined the code and some comments a bit and added a test
>> case.
>> 
>> My question in general would still be:  Is this something we want
>> given that we potentially move some of combine's work a bit towards
>> the front of the RTL pipeline?
>> 
>> Regards
>>   Robin
>> 
>> Subject: [PATCH] fwprop: Allow UNARY_P and check register pressure.
>> 
>> This patch enables the forwarding of UNARY_P sources.  As this
>> involves potentially replacing a vector register with a scalar register
>> the ira_hoist_pressure machinery is used to calculate the change in
>> register pressure.  If the propagation would increase the pressure
>> beyond the number of hard regs, we don't perform it.
>> 
>> gcc/ChangeLog:
>> 
>> 	* fwprop.cc (fwprop_propagation::profitable_p): Add unary
>> 	handling.
>> 	(fwprop_propagation::update_register_pressure): New function.
>> 	(fwprop_propagation::register_pressure_high_p): New function
>> 	(reg_single_def_for_src_p): Look through unary expressions.
>> 	(try_fwprop_subst_pattern): Check register pressure.
>> 	(forward_propagate_into): Call new function.
>> 	(fwprop_init): Init register pressure.
>> 	(fwprop_done): Clean up register pressure.
>> 	(fwprop_insn): Add comment.
>> 
>> gcc/testsuite/ChangeLog:
>> 
>> 	* gcc.target/riscv/rvv/autovec/binop/vadd-vx-fwprop.c: New test.
> So I was hoping that Richard S. would chime in here as he knows this 
> code better than anyone.

Heh, I'm not sure about that.  I rewrote the code to use rtl-ssa,
so in that sense I'm OK with the framework side.  But I tried to
preserve the decisions that the old pass made as closely as possible.
I don't know why most of those decisions were made (which is why I just
kept them).

So I don't think I have a good feel for the advantages and disadvantages
of doing this.  Robin's analysis of the aarch64 changes was nice and
detailed though.  I think the one that worries me most is the addressing
mode one.  fwprop is probably the first chance we get to propagate adds
into addresses, and virtual register elimination means that some of
those opportunities won't show up in gimple.

There again, virtual register elimination wouldn't be the reason for
the ld4_s8.c failure.  Perhaps there's something missing in expand.

Other than that, I think my main question is: why just unary operations?
Is the underlying assumption that we only want to propagate a maximum of
one register?  If so, then I think we should check for that directly, by
iterating over subrtxes.

That way we can handle things like binary operations involving a
register and a constant, and unspecs with a single non-constant operand.

I imagine the check would be something like:

  unsigned int nregs = 0;
  for (each subrtx x)
    {
      if (MEM_P (x))
        return false;
      if (SUBREG_P (x) && .../*current conditions */...)
        return false;
      if (REG_P (x))
        {
          nregs += 1;
          if (nregs > 1)
            return false;
        }
    }
  return true;

Perhaps we should allow the optimisation without register-pressure
information if (a) the source register and destination register are
in the same pressure class and (b) all uses of the destination are
being replaced.  (FWIW, rtl-ssa should make it easier to try to
replace all definitions at once, with an all-or-nothing choice,
if we ever wanted to do that.)

Thanks,
Richard

>
> This looks like a much better implementation of something I've done 
> before :-)  Basically imagine a target where a sign/zero extension can 
> be folded into arithmetic for free.  We put in various hacks to this 
> code to encourage more propagations of extensions.
>
> I still think this is valuable.  As we lower from gimple->RTL we're 
> going to still have artifacts in the RTL that we're going to want to 
> optimize away.  fwprop has certain advantages over combine, including 
> the fact that it runs earlier, pre-loop.
>
>
> It looks generally sensible to me.  But give Richard S. another week to 
> chime in.  He seems to be around, but may be slammed with stuff right now.
>
> jeff
Robin Dapp Sept. 5, 2023, 6:53 a.m. UTC | #4
> So I don't think I have a good feel for the advantages and disadvantages
> of doing this.  Robin's analysis of the aarch64 changes was nice and
> detailed though.  I think the one that worries me most is the addressing
> mode one.  fwprop is probably the first chance we get to propagate adds
> into addresses, and virtual register elimination means that some of
> those opportunities won't show up in gimple.
> 
> There again, virtual register elimination wouldn't be the reason for
> the ld4_s8.c failure.  Perhaps there's something missing in expand.
> 
> Other than that, I think my main question is: why just unary operations?
> Is the underlying assumption that we only want to propagate a maximum of
> one register?  If so, then I think we should check for that directly, by
> iterating over subrtxes.

The main reason for stopping at unary operations was to limit the scope
and change as little as possible (not restricting the change to one
register).  I'm currently testing a v2 that iterates over subrtxs.

> Perhaps we should allow the optimisation without register-pressure
> information if (a) the source register and destination register are
> in the same pressure class and (b) all uses of the destination are
> being replaced.  (FWIW, rtl-ssa should make it easier to try to
> replace all definitions at once, with an all-or-nothing choice,
> if we ever wanted to do that.)

I presume you're referring to replacing one register (dest) in all using
insns?  Source and destination are somewhat overloaded in fwprop context
because I'm thinking of the "to be replaced" register as dest when it's
actually the replacement register.

AFAICT fwprop currently iterates over insns, going through all their uses
and trying if an individual use can be substituted.  Do you suggest to
change this general iteration order to iterate over the defs of an insn
and then try to replace all the uses at once (e.g. using ssa->change_insns)?
When keeping the current order, wouldn't we need to store all potential
changes instead of committing them and later apply them in bulk, e.g.
grouped by use?  This order would also help to pick the propagation
with the most number of uses (i.e. propagation potential) but maybe
I'm misunderstanding?

Regards
 Robin
Richard Sandiford Sept. 5, 2023, 8:38 a.m. UTC | #5
Robin Dapp <rdapp.gcc@gmail.com> writes:
>> So I don't think I have a good feel for the advantages and disadvantages
>> of doing this.  Robin's analysis of the aarch64 changes was nice and
>> detailed though.  I think the one that worries me most is the addressing
>> mode one.  fwprop is probably the first chance we get to propagate adds
>> into addresses, and virtual register elimination means that some of
>> those opportunities won't show up in gimple.
>> 
>> There again, virtual register elimination wouldn't be the reason for
>> the ld4_s8.c failure.  Perhaps there's something missing in expand.
>> 
>> Other than that, I think my main question is: why just unary operations?
>> Is the underlying assumption that we only want to propagate a maximum of
>> one register?  If so, then I think we should check for that directly, by
>> iterating over subrtxes.
>
> The main reason for stopping at unary operations was to limit the scope
> and change as little as possible (not restricting the change to one
> register).  I'm currently testing a v2 that iterates over subrtxs.

Thanks.  Definitely no problem with doing things in small steps, but IMO
it's better if each choice of step can still be justified in its own terms.

>> Perhaps we should allow the optimisation without register-pressure
>> information if (a) the source register and destination register are
>> in the same pressure class and (b) all uses of the destination are
>> being replaced.  (FWIW, rtl-ssa should make it easier to try to
>> replace all definitions at once, with an all-or-nothing choice,
>> if we ever wanted to do that.)
>
> I presume you're referring to replacing one register (dest) in all using
> insns?  Source and destination are somewhat overloaded in fwprop context
> because I'm thinking of the "to be replaced" register as dest when it's
> actually the replacement register.

Yeah.

> AFAICT fwprop currently iterates over insns, going through all their uses
> and trying if an individual use can be substituted.  Do you suggest to
> change this general iteration order to iterate over the defs of an insn
> and then try to replace all the uses at once (e.g. using ssa->change_insns)?

No, I was just noting in passing that we could try do that if we wanted to.
The current code is a fairly mechanical conversion of the original DF-based
code, but there's no reason why it has to continue to work the way it
does now.

> When keeping the current order, wouldn't we need to store all potential
> changes instead of committing them and later apply them in bulk, e.g.
> grouped by use?  This order would also help to pick the propagation
> with the most number of uses (i.e. propagation potential) but maybe
> I'm misunderstanding?

I imagine doing it in reverse postorder would still make sense.

But my point was that, for the current fwprop limitation of substituting
into exactly one use of a register, we can check whether that use is
the *only* use of register.

I.e. if we substitute:

  A: (set (reg R1) (foo (reg R2)))

into:

  B: (set ... (reg R1) ...)

if R1 and R2 are likely to be in the same register class, and if B
is the only user of R2, then we don't need to calculate register
pressure.  The change is either neutral (if R2 died in A) or an
improvement (if R2 doesn't die in A, and so R1 and R2 were previously
live at the same time).

Thanks,
Richard
Robin Dapp Sept. 5, 2023, 8:45 a.m. UTC | #6
> I imagine doing it in reverse postorder would still make sense.
> 
> But my point was that, for the current fwprop limitation of substituting
> into exactly one use of a register, we can check whether that use is
> the *only* use of register.
> 
> I.e. if we substitute:
> 
>   A: (set (reg R1) (foo (reg R2)))
> 
> into:
> 
>   B: (set ... (reg R1) ...)
> 
> if R1 and R2 are likely to be in the same register class, and if B
> is the only user of R2, then we don't need to calculate register
> pressure.  The change is either neutral (if R2 died in A) or an
> improvement (if R2 doesn't die in A, and so R1 and R2 were previously
> live at the same time).

Ah, understood, thanks.  Sure, that one I can include.

Regards
 Robin
Robin Dapp Sept. 6, 2023, 11:22 a.m. UTC | #7
Hi Richard,

I did some testing with the attached v2 that does not restrict to UNARY
anymore.  As feared ;) there is some more fallout that I'm detailing below.

On Power there is one guality fail (pr43051-1.c) that I would take
the liberty of ignoring for now.

On x86 there are four fails:

 - cond_op_addsubmuldiv__Float16-2.c: assembler error
   unsupported masking for `vmovsh'.  I guess that's a latent backend
   problem.

 - ifcvt-3.c, pr49095.c: Here we propagate into a compare.  Before, we had
   (cmp (reg/CC) 0) and now we have (cmp (plus (reg1 reg2) 0).
   That looks like a costing problem and can hopefully solveable by making
   the second compare more expensive, preventing the propagation.
   i386 costing (or every costing?) is brittle so that could well break other
   things. 

 - pr88873.c: This is interesting because even before this patch we
   propagated with different register classes (V2DF vs DI).  With the patch
   we check the register pressure, find the class NO_REGS for V2DF and
   abort (because the patch assumes NO_REGS = high pressure).  I'm thinking
   of keeping the old behavior for reg-reg propagations and only checking
   the pressure for more complex operations.

aarch64 has the most fails:

 - One guality fail (same as Power).
 - shrn-combine-[123].c as before.

 - A class of (hopefully, I only checked some) similar cases where we
   propagate an unspec_whilelo into an unspec_ptest.  Before we would only
   set a REG_EQUALS note.
   Before we managed to create a while_ultsivnx16bi_cc whereas now we have
   while_ultsivnx16bi and while_ultsivnx16bi_ptest that won't be combined.
   We create redundant whilelos and I'm not sure how to improve that. I
   guess a peephole is out of the question :)

 - pred-combine-and.c: Here the new propagation appears useful at first.
   We propagate a "vector mask and" into a while_ultsivnx4bi_ptest and the
   individual and registers remain live up to the propagation site (while
   being dead before the patch).
   With the registers dead, combine could create a single fcmgt before.
   Now it only manages a 2->2 combination because we still need the registers
   and end up with two fcmgts.
   The code is worse but this seems more bad luck than anything.

 - Addressing fails from before:  I looked into these and suspect all of
   them are a similar.
   What happens is that we have a poly_int offset that we shift, negate
   and then add to x0.  The result is used as load address.
   Before, we would pull (combine) the (plus x0 reg) into the load keeping
   the neg and shift.
   Now we propagate everything into a single (set (minus x0 offset)).
   The propagation itself seems worthwhile because we save one insn.
   However as we got rid of the base/offset split by lumping everything
   together, combine cannot pull the (plus) into the address load and
   we require an aarch64_split_add_offset.  This will emit the longer
   sequence of ashiftl and subtract.  The "base" address is x0 here so
   we cannot convert (minus x0 ...)) into neg.
   I didn't go through all of aarch64_split_add_offset.  I suppose we
   could re-add the separation of base/offset there but that might be
   a loss when the result is not used as an address. 
   
Again, all in all no fatal problems but pretty annoying :)  It's not much
but just gradually worse than with just UNARY.  Any idea on how/whether to
continue?

Regards
 Robin

gcc/ChangeLog:

	* fwprop.cc (fwprop_propagation::profitable_p): Add unary
	handling.
	(fwprop_propagation::update_register_pressure): New function.
	(fwprop_propagation::register_pressure_high_p): New function
	(reg_single_def_for_src_p): Look through unary expressions.
	(try_fwprop_subst_pattern): Check register pressure.
	(forward_propagate_into): Call new function.
	(fwprop_init): Init register pressure.
	(fwprop_done): Clean up register pressure.
	(fwprop_insn): Add comment.

gcc/testsuite/ChangeLog:

	* gcc.target/riscv/rvv/autovec/binop/vadd-vx-fwprop.c: New test.
---
 gcc/fwprop.cc                                 | 359 +++++++++++++++++-
 .../riscv/rvv/autovec/binop/vadd-vx-fwprop.c  |  64 ++++
 2 files changed, 419 insertions(+), 4 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/vadd-vx-fwprop.c

diff --git a/gcc/fwprop.cc b/gcc/fwprop.cc
index 0707a234726..ce6f5a74b00 100644
--- a/gcc/fwprop.cc
+++ b/gcc/fwprop.cc
@@ -36,6 +36,10 @@ along with GCC; see the file COPYING3.  If not see
 #include "tree-pass.h"
 #include "rtl-iter.h"
 #include "target.h"
+#include "dominance.h"
+
+#include "ira.h"
+#include "regpressure.h"
 
 /* This pass does simple forward propagation and simplification when an
    operand of an insn can only come from a single def.  This pass uses
@@ -103,6 +107,10 @@ using namespace rtl_ssa;
 
 static int num_changes;
 
+/* Keep track of which registers already increased the pressure to avoid double
+   booking.  */
+sbitmap pressure_accounted;
+
 /* Do not try to replace constant addresses or addresses of local and
    argument slots.  These MEM expressions are made only once and inserted
    in many instructions, as well as being used to control symbol table
@@ -181,6 +189,8 @@ namespace
     bool changed_mem_p () const { return result_flags & CHANGED_MEM; }
     bool folded_to_constants_p () const;
     bool profitable_p () const;
+    bool register_pressure_high_p (rtx, rtx, rtx_insn *, rtx_insn *) const;
+    bool update_register_pressure (rtx, rtx, rtx_insn *, rtx_insn *) const;
 
     bool check_mem (int, rtx) final override;
     void note_simplification (int, uint16_t, rtx, rtx) final override;
@@ -340,18 +350,272 @@ fwprop_propagation::profitable_p () const
       && !paradoxical_subreg_p (to))
     return true;
 
-  if (CONSTANT_P (to))
+  /* Disallow propagations into debug insns when it is not a reg-reg
+     propagation.  Their costs are usually 0 and a cost comparison will
+     always be profitable no matter how much we complicate the pattern.  */
+  if (DEBUG_INSN_P (insn))
+    return false;
+
+  /* For more complex operations check each sub-rtx.  */
+  subrtx_iterator::array_type array;
+  FOR_EACH_SUBRTX (iter, array, to, NONCONST)
+    {
+      const_rtx x = *iter;
+
+      if (REG_P (x))
+	continue;
+
+      else if (GET_CODE (x) == SUBREG
+	  && REG_P (SUBREG_REG (x))
+	  && !paradoxical_subreg_p (x))
+	continue;
+
+      else if (rtx_equal_p (x, to))
+	continue;
+
+      else
+	return false;
+    }
+
+  return true;
+}
+
+/* Check if the register pressure in any predecessor block of USE's block
+   until DEF's block is equal or higher than the number of hardregs in NU's
+   register class.  */
+bool
+fwprop_propagation::register_pressure_high_p (rtx nu, rtx old, rtx_insn *def,
+					      rtx_insn *use) const
+{
+  enum reg_class nu_class, old_class;
+  int nu_nregs, old_nregs;
+  nu_class = regpressure_get_regno_pressure_class (REGNO (nu), &nu_nregs);
+  old_class
+    = regpressure_get_regno_pressure_class (REGNO (old), &old_nregs);
+
+  if (nu_class == NO_REGS && old_class == NO_REGS)
     return true;
 
-  return false;
This patch enables the forwarding of sources from more complex
operations like unary (i.e. vec_duplicate) or binary.  As this
potentially involves replacing a register with a register of a different
class the ira_hoist_pressure machinery is used to calculate the change
in register pressure.  If the propagation would increase the pressure
beyond the number of hard regs, we don't perform it.
+  if (nu_class == old_class)
+    return false;
+
+  basic_block bbfrom = BLOCK_FOR_INSN (def);
+  basic_block bbto = BLOCK_FOR_INSN (use);
+
+  basic_block bb;
+
+  sbitmap visited = sbitmap_alloc (last_basic_block_for_fn (cfun));
+  bitmap_clear (visited);
+  auto_vec<basic_block> q;
+  q.safe_push (bbto);
+
+  bool high = false;
+
+  while (!q.is_empty ())
+    {
+      bb = q.pop ();
+
+      if (bitmap_bit_p (visited, bb->index))
+	continue;
+
+      /* Nothing to do if the register to be replaced is not live
+	 in this BB.  */
+      if (bb != bbfrom && !regpressure_is_live_in (bb, REGNO (old)))
+	continue;
+
+      /* Nothing to do if the replacement register is already live in
+	 this BB.  */
+      if (regpressure_is_live_in (bb, REGNO (nu)))
+	continue;
+
+      bitmap_set_bit (visited, bb->index);
+
+      int press = regpressure_get (bb, nu_class);
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "pressure for reg %d in bb %d: %d\n",
+		 REGNO (nu), bb->index, press);
+
+      if (press + nu_nregs > ira_class_hard_regs_num[nu_class])
+	{
+	  high = true;
+	  break;
+	}
+
+      if (bb == bbfrom || bb == ENTRY_BLOCK_PTR_FOR_FN (cfun))
+	break;
+
+      edge e;
+      edge_iterator ei;
+
+      FOR_EACH_EDGE (e, ei, bb->preds)
+	{
+	  basic_block pred = e->src;
+	  q.safe_push (pred);
+	}
+    }
+
+  sbitmap_free (visited);
+
+  return high;
 }
 
-/* Check that X has a single def.  */
+/* Update the register pressure when propagating from DEF to USE
+   replacing OLD with NU.  Return TRUE if the pressure was updated.  */
+bool
+fwprop_propagation::update_register_pressure (rtx nu, rtx old, rtx_insn *def,
+					      rtx_insn *use) const
+{
+  basic_block bb;
+
+  enum reg_class nu_class, old_class;
+  int nu_nregs, old_nregs;
+  nu_class = regpressure_get_regno_pressure_class (REGNO (nu), &nu_nregs);
+  old_class = regpressure_get_regno_pressure_class (REGNO (old), &old_nregs);
+
+  bool changed = false;
+
+  if (nu_class == old_class)
+    return false;
+
+  basic_block bbfrom = BLOCK_FOR_INSN (def);
+  basic_block bbto = BLOCK_FOR_INSN (use);
+
+  sbitmap visited = sbitmap_alloc (last_basic_block_for_fn (cfun));
+  bitmap_clear (visited);
+  auto_vec<basic_block> q;
+  q.safe_push (bbto);
+
+  /* Precompute the users.  */
+  auto_vec<rtx_insn *> users;
+  df_ref duse;
+  bool should_decrease = true;
+  FOR_EACH_INSN_USE (duse, def)
+    {
+      rtx dreg = DF_REF_REAL_REG (duse);
+      if (REGNO (dreg) != REGNO (old))
+	continue;
+
+      df_ref op_ref = DF_REG_USE_CHAIN (REGNO (dreg));
+      for (; op_ref; op_ref = DF_REF_NEXT_REG (op_ref))
+	{
+	  if (!DF_REF_INSN_INFO (op_ref))
+	    continue;
+
+	  rtx_insn *insn = DF_REF_INSN (op_ref);
+
+	  /* If there are other users in BBTO, never decrease the
+	     pressure.  */
+	  if (BLOCK_FOR_INSN (insn) == bbto && insn != use)
+	    {
+	      should_decrease = false;
+	      break;
+	    }
+
+	  users.safe_push (insn);
+	}
+    }
+
+  while (!q.is_empty ())
+    {
+      bb = q.pop ();
+
+      if (bitmap_bit_p (visited, bb->index))
+	continue;
+
+      /* Nothing to do if the register to be replaced is not live
+	 in this BB.  */
+      if (bb != bbfrom && !regpressure_is_live_in (bb, REGNO (old)))
+	continue;
+
+      /* Nothing to do if the new register is already live in this BB.  */
+      if (regpressure_is_live_in (bb, REGNO (nu)))
+	continue;
+
+      bitmap_set_bit (visited, bb->index);
+
+      if (regpressure_is_live_in (bb, REGNO (old)))
+	{
+	  if (dump_file && (dump_flags & TDF_DETAILS))
+	    fprintf (dump_file, "increasing pressure for "
+		     "reg %d in bb %d by %d reg(s).\n",
+		     REGNO (nu), bb->index, nu_nregs);
+	  regpressure_set_live_in (bb, REGNO (nu));
+	  regpressure_increase (bb, nu_class, nu_nregs);
+	  changed = true;
+
+	  if (should_decrease)
+	    {
+	      rtx_insn *user;
+	      int ix;
+	      bool should_decrease_local = true;
+
+	      /* If this BB dominates no BB where OLD is used (except BBTO)
+		 the register pressure is decreased.  */
+	      FOR_EACH_VEC_ELT(users, ix, user)
+		{
+		  basic_block ubb = BLOCK_FOR_INSN (user);
+		  if (ubb == bbto)
+		    continue;
+		  if (dominated_by_p (CDI_DOMINATORS, ubb, bb))
+		    {
+		      should_decrease_local = false;
+		      break;
+		    }
+		}
+
+	      if (should_decrease_local)
+		{
+		  if (dump_file && (dump_flags & TDF_DETAILS))
+		    fprintf (dump_file, "decreasing pressure for "
+			     "reg %d in bb %d by %d reg(s).\n",
+			     REGNO (old), bb->index, old_nregs);
+		  regpressure_clear_live_in (bb, REGNO (old));
+		  regpressure_decrease (bb, old_class, old_nregs);
+		}
+	    }
+	}
+
+      if (bb == bbfrom || bb == ENTRY_BLOCK_PTR_FOR_FN (cfun))
+	break;
+
+      edge e;
+      edge_iterator ei;
+
+      FOR_EACH_EDGE (e, ei, bb->preds)
+	{
+	  basic_block pred = e->src;
+	  q.safe_push (pred);
+	}
+    }
+
+  sbitmap_free (visited);
+  return changed;
+}
+
+/* Check that X has a single def.  In case of more complex operations
+   check each sub-rtx.  */
 
 static bool
 reg_single_def_p (rtx x)
 {
-  return REG_P (x) && crtl->ssa->single_dominating_def (REGNO (x));
+  if (REG_P (x))
+    return crtl->ssa->single_dominating_def (REGNO (x));
+
+  subrtx_var_iterator::array_type array;
+  FOR_EACH_SUBRTX_VAR (iter, array, x, ALL)
+    {
+      rtx sub = *iter;
+
+      if (SUBREG_P (x) && !contains_paradoxical_subreg_p (sub))
+	sub = SUBREG_REG (x);
+
+      bool ok = REG_P (sub) && crtl->ssa->single_dominating_def (REGNO (sub));
+
+      if (!ok)
+	return false;
+    }
+
+  return true;
 }
 
 /* Try to substitute (set DEST SRC), which defines DEF, into note NOTE of
@@ -490,6 +754,81 @@ try_fwprop_subst_pattern (obstack_watermark &attempt, insn_change &use_change,
 	  }
       }
 
+  /* Perform register-pressure checks for all sub-rtxs.
+     If FLAG_IRA_HOIST_PRESSURE is set only allow the propagation if the
+     register pressure is not high.
+     Otherwise only allow the propagation if DEF has only a single use.  */
+  if (ok)
+    {
+      rtx_insn *def_rtl = def_insn->rtl ();
+
+      subrtx_var_iterator::array_type array;
+      FOR_EACH_SUBRTX_VAR (iter, array, src, NONCONST)
+	{
+	  rtx srcreg = *iter;
+	  rtx dstreg = dest;
+
+	  /* For register pressure we do not care about subregs so strip
+	     them.  */
+	  while (SUBREG_P (dstreg))
+	    dstreg = SUBREG_REG (dstreg);
+
+	  while (SUBREG_P (srcreg))
+	    srcreg = SUBREG_REG (srcreg);
+
+	  if (!REG_P (dstreg) || !REG_P (srcreg))
+	    continue;
+
+	  /* Nothing to do for the same class.  */
+	  bool same_class = regpressure_same_class (dstreg, srcreg);
+	  if (same_class)
+	    {
+	      if (flag_ira_hoist_pressure)
+		continue;
+	      else
+		{
+		  /* Check if this is the only use.  If not do not
+		     continue.  */
+		  for (use_info *use : def->all_uses ())
+		    if (use->insn () != use_insn)
+		      return false;
+		}
+	    }
+
+	  /* Never allow different pressure classes when not considering
+	     register pressure.  */
+	  if (!same_class && !flag_ira_hoist_pressure)
+	    return false;
+
+	  if (!bitmap_bit_p (pressure_accounted, REGNO (srcreg)))
+	    {
+	      if (!prop.register_pressure_high_p (srcreg, dstreg, def_rtl,
+						  use_rtl))
+		{
+		  bool updated
+		    = prop.update_register_pressure (srcreg, dstreg, def_rtl,
+						     use_rtl);
+		  if (updated)
+		    bitmap_set_bit (pressure_accounted, REGNO (srcreg));
+		}
+	      else
+		{
+		  if (dump_file && (dump_flags & TDF_DETAILS))
+		    fprintf (dump_file, "cannot propagate from insn %d into"
+			     " insn %d: %s\n", def_insn->uid (),
+			     use_insn->uid (),
+			     "would increase register pressure beyond number"
+			     " of hard regs");
+		  ok = false;
+		}
+	    }
+	  else
+	    if (dump_file && (dump_flags & TDF_DETAILS))
+	      fprintf (dump_file, "pressure for reg %d has already"
+		       " been accounted\n", REGNO (srcreg));
+	}
+    }
+
   if (!ok)
     {
       /* The pattern didn't match, but if all uses of SRC folded to
@@ -538,6 +877,7 @@ try_fwprop_subst_pattern (obstack_watermark &attempt, insn_change &use_change,
   confirm_change_group ();
   crtl->ssa->change_insn (use_change);
   num_changes++;
+
   return true;
 }
 
@@ -890,6 +1230,10 @@ fwprop_init (void)
 
   df_analyze ();
   crtl->ssa = new rtl_ssa::function_info (cfun);
+
+  regpressure_init ();
+  pressure_accounted = sbitmap_alloc (DF_REG_SIZE (df));
+  bitmap_clear (pressure_accounted);
 }
 
 static void
@@ -910,6 +1254,9 @@ fwprop_done (void)
     fprintf (dump_file,
 	     "\nNumber of successful forward propagations: %d\n\n",
 	     num_changes);
+
+  regpressure_cleanup ();
+  sbitmap_free (pressure_accounted);
 }
 
 /* Try to optimize INSN, returning true if something changes.
@@ -919,6 +1266,10 @@ fwprop_done (void)
 static bool
 fwprop_insn (insn_info *insn, bool fwprop_addr_p)
 {
+  /* TODO: when rejecting propagations due to register pressure
+     we would actually like to try the propagation with most
+     potential = uses first instead of going through them sequentially.  */
+
   for (use_info *use : insn->uses ())
     {
       if (use->is_mem ())
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/vadd-vx-fwprop.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/vadd-vx-fwprop.c
new file mode 100644
index 00000000000..0ed1bb80f73
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/vadd-vx-fwprop.c
@@ -0,0 +1,64 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O2 -march=rv64gcv -mabi=lp64d --param=riscv-autovec-preference=fixed-vlmax -fdump-rtl-fwprop1-details" } */
+
+/* Check that we propagate from vec_duplicates into vadd.vv, turning it into
+   vadd.vx.  */
+
+#include <stdint-gcc.h>
+
+/* Define 26 variables so that we can perform two forward propagations until we
+   reach the hard reg limit of 28.  */
+uint64_t *restrict dst1;
+uint64_t *restrict dst2;
+uint64_t *restrict dst3;
+uint64_t *restrict dst4;
+uint64_t *restrict dst5;
+uint64_t *restrict dst6;
+uint64_t *restrict dst7;
+uint64_t *restrict dst8;
+uint64_t *restrict dst9;
+uint64_t *restrict dst10;
+uint64_t *restrict dst11;
+uint64_t *restrict a1;
+uint64_t *restrict a2;
+uint64_t *restrict a3;
+uint64_t *restrict a4;
+uint64_t *restrict a5;
+uint64_t *restrict a6;
+uint64_t *restrict a7;
+uint64_t *restrict a8;
+uint64_t *restrict a9;
+uint64_t *restrict a10;
+uint64_t *restrict a11;
+uint64_t b1;
+uint64_t b2;
+uint64_t b3;
+uint64_t b4;
+
+void foo (int n)
+{
+  __builtin_expect (n % 128, 0);
+  for (int i = 0; i < n; i++)
+    {
+      /* We expect b3 to be propagated into vadd here as well as in the other
+         7 occurences (without increasing the pressure every time).
+         Then b2 is still supposed to be propagated but b1 is not even though
+         it would be better to do so.  */
+      dst1[i] = a1[i] + b3;
+      dst2[i] = a2[i] + b3;
+      dst3[i] = a3[i] + b2;
+      dst4[i] = a4[i] + b1;
+      dst5[i] = a5[i] + b1;
+      dst6[i] = a6[i] + b3;
+      dst7[i] = a7[i] + b3;
+      dst8[i] = a8[i] + b3;
+      dst9[i] = a9[i] + b3;
+      dst10[i] = a10[i] + b3;
+      dst11[i] = a11[i] + b3;
+    }
+}
+
+/* { dg-final { scan-rtl-dump-times "increasing pressure for reg" 2 "fwprop1" } } */
+/* { dg-final { scan-rtl-dump-times "pressure for reg \[0-9\]+ has already been accounted" 7 "fwprop1" } } */
+/* { dg-final { scan-rtl-dump-times "would increase register pressure beyond number of hard regs" 2 "fwprop1" } } */
+/* { dg-final { scan-assembler-times "\tvadd\.vx" 9 } } */
Richard Sandiford Sept. 6, 2023, 8:44 p.m. UTC | #8
Robin Dapp <rdapp.gcc@gmail.com> writes:
> Hi Richard,
>
> I did some testing with the attached v2 that does not restrict to UNARY
> anymore.  As feared ;) there is some more fallout that I'm detailing below.
>
> On Power there is one guality fail (pr43051-1.c) that I would take
> the liberty of ignoring for now.
>
> On x86 there are four fails:
>
>  - cond_op_addsubmuldiv__Float16-2.c: assembler error
>    unsupported masking for `vmovsh'.  I guess that's a latent backend
>    problem.
>
>  - ifcvt-3.c, pr49095.c: Here we propagate into a compare.  Before, we had
>    (cmp (reg/CC) 0) and now we have (cmp (plus (reg1 reg2) 0).
>    That looks like a costing problem and can hopefully solveable by making
>    the second compare more expensive, preventing the propagation.
>    i386 costing (or every costing?) is brittle so that could well break other
>    things. 
>
>  - pr88873.c: This is interesting because even before this patch we
>    propagated with different register classes (V2DF vs DI).  With the patch
>    we check the register pressure, find the class NO_REGS for V2DF and
>    abort (because the patch assumes NO_REGS = high pressure).  I'm thinking
>    of keeping the old behavior for reg-reg propagations and only checking
>    the pressure for more complex operations.
>
> aarch64 has the most fails:
>
>  - One guality fail (same as Power).
>  - shrn-combine-[123].c as before.
>
>  - A class of (hopefully, I only checked some) similar cases where we
>    propagate an unspec_whilelo into an unspec_ptest.  Before we would only
>    set a REG_EQUALS note.
>    Before we managed to create a while_ultsivnx16bi_cc whereas now we have
>    while_ultsivnx16bi and while_ultsivnx16bi_ptest that won't be combined.
>    We create redundant whilelos and I'm not sure how to improve that. I
>    guess a peephole is out of the question :)
>
>  - pred-combine-and.c: Here the new propagation appears useful at first.
>    We propagate a "vector mask and" into a while_ultsivnx4bi_ptest and the
>    individual and registers remain live up to the propagation site (while
>    being dead before the patch).
>    With the registers dead, combine could create a single fcmgt before.
>    Now it only manages a 2->2 combination because we still need the registers
>    and end up with two fcmgts.
>    The code is worse but this seems more bad luck than anything.
>
>  - Addressing fails from before:  I looked into these and suspect all of
>    them are a similar.
>    What happens is that we have a poly_int offset that we shift, negate
>    and then add to x0.  The result is used as load address.
>    Before, we would pull (combine) the (plus x0 reg) into the load keeping
>    the neg and shift.
>    Now we propagate everything into a single (set (minus x0 offset)).
>    The propagation itself seems worthwhile because we save one insn.
>    However as we got rid of the base/offset split by lumping everything
>    together, combine cannot pull the (plus) into the address load and
>    we require an aarch64_split_add_offset.  This will emit the longer
>    sequence of ashiftl and subtract.  The "base" address is x0 here so
>    we cannot convert (minus x0 ...)) into neg.
>    I didn't go through all of aarch64_split_add_offset.  I suppose we
>    could re-add the separation of base/offset there but that might be
>    a loss when the result is not used as an address. 
>    
> Again, all in all no fatal problems but pretty annoying :)  It's not much
> but just gradually worse than with just UNARY.  Any idea on how/whether to
> continue?

Thanks for giving it a go.  Can you post the latest version of the
regpressure patch too?  The previous on-list version I could find
seems to be too old.

Thanks,
Richard

> Regards
>  Robin
>
> gcc/ChangeLog:
>
> 	* fwprop.cc (fwprop_propagation::profitable_p): Add unary
> 	handling.
> 	(fwprop_propagation::update_register_pressure): New function.
> 	(fwprop_propagation::register_pressure_high_p): New function
> 	(reg_single_def_for_src_p): Look through unary expressions.
> 	(try_fwprop_subst_pattern): Check register pressure.
> 	(forward_propagate_into): Call new function.
> 	(fwprop_init): Init register pressure.
> 	(fwprop_done): Clean up register pressure.
> 	(fwprop_insn): Add comment.
>
> gcc/testsuite/ChangeLog:
>
> 	* gcc.target/riscv/rvv/autovec/binop/vadd-vx-fwprop.c: New test.
> ---
>  gcc/fwprop.cc                                 | 359 +++++++++++++++++-
>  .../riscv/rvv/autovec/binop/vadd-vx-fwprop.c  |  64 ++++
>  2 files changed, 419 insertions(+), 4 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/vadd-vx-fwprop.c
>
> diff --git a/gcc/fwprop.cc b/gcc/fwprop.cc
> index 0707a234726..ce6f5a74b00 100644
> --- a/gcc/fwprop.cc
> +++ b/gcc/fwprop.cc
> @@ -36,6 +36,10 @@ along with GCC; see the file COPYING3.  If not see
>  #include "tree-pass.h"
>  #include "rtl-iter.h"
>  #include "target.h"
> +#include "dominance.h"
> +
> +#include "ira.h"
> +#include "regpressure.h"
>  
>  /* This pass does simple forward propagation and simplification when an
>     operand of an insn can only come from a single def.  This pass uses
> @@ -103,6 +107,10 @@ using namespace rtl_ssa;
>  
>  static int num_changes;
>  
> +/* Keep track of which registers already increased the pressure to avoid double
> +   booking.  */
> +sbitmap pressure_accounted;
> +
>  /* Do not try to replace constant addresses or addresses of local and
>     argument slots.  These MEM expressions are made only once and inserted
>     in many instructions, as well as being used to control symbol table
> @@ -181,6 +189,8 @@ namespace
>      bool changed_mem_p () const { return result_flags & CHANGED_MEM; }
>      bool folded_to_constants_p () const;
>      bool profitable_p () const;
> +    bool register_pressure_high_p (rtx, rtx, rtx_insn *, rtx_insn *) const;
> +    bool update_register_pressure (rtx, rtx, rtx_insn *, rtx_insn *) const;
>  
>      bool check_mem (int, rtx) final override;
>      void note_simplification (int, uint16_t, rtx, rtx) final override;
> @@ -340,18 +350,272 @@ fwprop_propagation::profitable_p () const
>        && !paradoxical_subreg_p (to))
>      return true;
>  
> -  if (CONSTANT_P (to))
> +  /* Disallow propagations into debug insns when it is not a reg-reg
> +     propagation.  Their costs are usually 0 and a cost comparison will
> +     always be profitable no matter how much we complicate the pattern.  */
> +  if (DEBUG_INSN_P (insn))
> +    return false;
> +
> +  /* For more complex operations check each sub-rtx.  */
> +  subrtx_iterator::array_type array;
> +  FOR_EACH_SUBRTX (iter, array, to, NONCONST)
> +    {
> +      const_rtx x = *iter;
> +
> +      if (REG_P (x))
> +	continue;
> +
> +      else if (GET_CODE (x) == SUBREG
> +	  && REG_P (SUBREG_REG (x))
> +	  && !paradoxical_subreg_p (x))
> +	continue;
> +
> +      else if (rtx_equal_p (x, to))
> +	continue;
> +
> +      else
> +	return false;
> +    }
> +
> +  return true;
> +}
> +
> +/* Check if the register pressure in any predecessor block of USE's block
> +   until DEF's block is equal or higher than the number of hardregs in NU's
> +   register class.  */
> +bool
> +fwprop_propagation::register_pressure_high_p (rtx nu, rtx old, rtx_insn *def,
> +					      rtx_insn *use) const
> +{
> +  enum reg_class nu_class, old_class;
> +  int nu_nregs, old_nregs;
> +  nu_class = regpressure_get_regno_pressure_class (REGNO (nu), &nu_nregs);
> +  old_class
> +    = regpressure_get_regno_pressure_class (REGNO (old), &old_nregs);
> +
> +  if (nu_class == NO_REGS && old_class == NO_REGS)
>      return true;
>  
> -  return false;
> This patch enables the forwarding of sources from more complex
> operations like unary (i.e. vec_duplicate) or binary.  As this
> potentially involves replacing a register with a register of a different
> class the ira_hoist_pressure machinery is used to calculate the change
> in register pressure.  If the propagation would increase the pressure
> beyond the number of hard regs, we don't perform it.
> +  if (nu_class == old_class)
> +    return false;
> +
> +  basic_block bbfrom = BLOCK_FOR_INSN (def);
> +  basic_block bbto = BLOCK_FOR_INSN (use);
> +
> +  basic_block bb;
> +
> +  sbitmap visited = sbitmap_alloc (last_basic_block_for_fn (cfun));
> +  bitmap_clear (visited);
> +  auto_vec<basic_block> q;
> +  q.safe_push (bbto);
> +
> +  bool high = false;
> +
> +  while (!q.is_empty ())
> +    {
> +      bb = q.pop ();
> +
> +      if (bitmap_bit_p (visited, bb->index))
> +	continue;
> +
> +      /* Nothing to do if the register to be replaced is not live
> +	 in this BB.  */
> +      if (bb != bbfrom && !regpressure_is_live_in (bb, REGNO (old)))
> +	continue;
> +
> +      /* Nothing to do if the replacement register is already live in
> +	 this BB.  */
> +      if (regpressure_is_live_in (bb, REGNO (nu)))
> +	continue;
> +
> +      bitmap_set_bit (visited, bb->index);
> +
> +      int press = regpressure_get (bb, nu_class);
> +      if (dump_file && (dump_flags & TDF_DETAILS))
> +	fprintf (dump_file, "pressure for reg %d in bb %d: %d\n",
> +		 REGNO (nu), bb->index, press);
> +
> +      if (press + nu_nregs > ira_class_hard_regs_num[nu_class])
> +	{
> +	  high = true;
> +	  break;
> +	}
> +
> +      if (bb == bbfrom || bb == ENTRY_BLOCK_PTR_FOR_FN (cfun))
> +	break;
> +
> +      edge e;
> +      edge_iterator ei;
> +
> +      FOR_EACH_EDGE (e, ei, bb->preds)
> +	{
> +	  basic_block pred = e->src;
> +	  q.safe_push (pred);
> +	}
> +    }
> +
> +  sbitmap_free (visited);
> +
> +  return high;
>  }
>  
> -/* Check that X has a single def.  */
> +/* Update the register pressure when propagating from DEF to USE
> +   replacing OLD with NU.  Return TRUE if the pressure was updated.  */
> +bool
> +fwprop_propagation::update_register_pressure (rtx nu, rtx old, rtx_insn *def,
> +					      rtx_insn *use) const
> +{
> +  basic_block bb;
> +
> +  enum reg_class nu_class, old_class;
> +  int nu_nregs, old_nregs;
> +  nu_class = regpressure_get_regno_pressure_class (REGNO (nu), &nu_nregs);
> +  old_class = regpressure_get_regno_pressure_class (REGNO (old), &old_nregs);
> +
> +  bool changed = false;
> +
> +  if (nu_class == old_class)
> +    return false;
> +
> +  basic_block bbfrom = BLOCK_FOR_INSN (def);
> +  basic_block bbto = BLOCK_FOR_INSN (use);
> +
> +  sbitmap visited = sbitmap_alloc (last_basic_block_for_fn (cfun));
> +  bitmap_clear (visited);
> +  auto_vec<basic_block> q;
> +  q.safe_push (bbto);
> +
> +  /* Precompute the users.  */
> +  auto_vec<rtx_insn *> users;
> +  df_ref duse;
> +  bool should_decrease = true;
> +  FOR_EACH_INSN_USE (duse, def)
> +    {
> +      rtx dreg = DF_REF_REAL_REG (duse);
> +      if (REGNO (dreg) != REGNO (old))
> +	continue;
> +
> +      df_ref op_ref = DF_REG_USE_CHAIN (REGNO (dreg));
> +      for (; op_ref; op_ref = DF_REF_NEXT_REG (op_ref))
> +	{
> +	  if (!DF_REF_INSN_INFO (op_ref))
> +	    continue;
> +
> +	  rtx_insn *insn = DF_REF_INSN (op_ref);
> +
> +	  /* If there are other users in BBTO, never decrease the
> +	     pressure.  */
> +	  if (BLOCK_FOR_INSN (insn) == bbto && insn != use)
> +	    {
> +	      should_decrease = false;
> +	      break;
> +	    }
> +
> +	  users.safe_push (insn);
> +	}
> +    }
> +
> +  while (!q.is_empty ())
> +    {
> +      bb = q.pop ();
> +
> +      if (bitmap_bit_p (visited, bb->index))
> +	continue;
> +
> +      /* Nothing to do if the register to be replaced is not live
> +	 in this BB.  */
> +      if (bb != bbfrom && !regpressure_is_live_in (bb, REGNO (old)))
> +	continue;
> +
> +      /* Nothing to do if the new register is already live in this BB.  */
> +      if (regpressure_is_live_in (bb, REGNO (nu)))
> +	continue;
> +
> +      bitmap_set_bit (visited, bb->index);
> +
> +      if (regpressure_is_live_in (bb, REGNO (old)))
> +	{
> +	  if (dump_file && (dump_flags & TDF_DETAILS))
> +	    fprintf (dump_file, "increasing pressure for "
> +		     "reg %d in bb %d by %d reg(s).\n",
> +		     REGNO (nu), bb->index, nu_nregs);
> +	  regpressure_set_live_in (bb, REGNO (nu));
> +	  regpressure_increase (bb, nu_class, nu_nregs);
> +	  changed = true;
> +
> +	  if (should_decrease)
> +	    {
> +	      rtx_insn *user;
> +	      int ix;
> +	      bool should_decrease_local = true;
> +
> +	      /* If this BB dominates no BB where OLD is used (except BBTO)
> +		 the register pressure is decreased.  */
> +	      FOR_EACH_VEC_ELT(users, ix, user)
> +		{
> +		  basic_block ubb = BLOCK_FOR_INSN (user);
> +		  if (ubb == bbto)
> +		    continue;
> +		  if (dominated_by_p (CDI_DOMINATORS, ubb, bb))
> +		    {
> +		      should_decrease_local = false;
> +		      break;
> +		    }
> +		}
> +
> +	      if (should_decrease_local)
> +		{
> +		  if (dump_file && (dump_flags & TDF_DETAILS))
> +		    fprintf (dump_file, "decreasing pressure for "
> +			     "reg %d in bb %d by %d reg(s).\n",
> +			     REGNO (old), bb->index, old_nregs);
> +		  regpressure_clear_live_in (bb, REGNO (old));
> +		  regpressure_decrease (bb, old_class, old_nregs);
> +		}
> +	    }
> +	}
> +
> +      if (bb == bbfrom || bb == ENTRY_BLOCK_PTR_FOR_FN (cfun))
> +	break;
> +
> +      edge e;
> +      edge_iterator ei;
> +
> +      FOR_EACH_EDGE (e, ei, bb->preds)
> +	{
> +	  basic_block pred = e->src;
> +	  q.safe_push (pred);
> +	}
> +    }
> +
> +  sbitmap_free (visited);
> +  return changed;
> +}
> +
> +/* Check that X has a single def.  In case of more complex operations
> +   check each sub-rtx.  */
>  
>  static bool
>  reg_single_def_p (rtx x)
>  {
> -  return REG_P (x) && crtl->ssa->single_dominating_def (REGNO (x));
> +  if (REG_P (x))
> +    return crtl->ssa->single_dominating_def (REGNO (x));
> +
> +  subrtx_var_iterator::array_type array;
> +  FOR_EACH_SUBRTX_VAR (iter, array, x, ALL)
> +    {
> +      rtx sub = *iter;
> +
> +      if (SUBREG_P (x) && !contains_paradoxical_subreg_p (sub))
> +	sub = SUBREG_REG (x);
> +
> +      bool ok = REG_P (sub) && crtl->ssa->single_dominating_def (REGNO (sub));
> +
> +      if (!ok)
> +	return false;
> +    }
> +
> +  return true;
>  }
>  
>  /* Try to substitute (set DEST SRC), which defines DEF, into note NOTE of
> @@ -490,6 +754,81 @@ try_fwprop_subst_pattern (obstack_watermark &attempt, insn_change &use_change,
>  	  }
>        }
>  
> +  /* Perform register-pressure checks for all sub-rtxs.
> +     If FLAG_IRA_HOIST_PRESSURE is set only allow the propagation if the
> +     register pressure is not high.
> +     Otherwise only allow the propagation if DEF has only a single use.  */
> +  if (ok)
> +    {
> +      rtx_insn *def_rtl = def_insn->rtl ();
> +
> +      subrtx_var_iterator::array_type array;
> +      FOR_EACH_SUBRTX_VAR (iter, array, src, NONCONST)
> +	{
> +	  rtx srcreg = *iter;
> +	  rtx dstreg = dest;
> +
> +	  /* For register pressure we do not care about subregs so strip
> +	     them.  */
> +	  while (SUBREG_P (dstreg))
> +	    dstreg = SUBREG_REG (dstreg);
> +
> +	  while (SUBREG_P (srcreg))
> +	    srcreg = SUBREG_REG (srcreg);
> +
> +	  if (!REG_P (dstreg) || !REG_P (srcreg))
> +	    continue;
> +
> +	  /* Nothing to do for the same class.  */
> +	  bool same_class = regpressure_same_class (dstreg, srcreg);
> +	  if (same_class)
> +	    {
> +	      if (flag_ira_hoist_pressure)
> +		continue;
> +	      else
> +		{
> +		  /* Check if this is the only use.  If not do not
> +		     continue.  */
> +		  for (use_info *use : def->all_uses ())
> +		    if (use->insn () != use_insn)
> +		      return false;
> +		}
> +	    }
> +
> +	  /* Never allow different pressure classes when not considering
> +	     register pressure.  */
> +	  if (!same_class && !flag_ira_hoist_pressure)
> +	    return false;
> +
> +	  if (!bitmap_bit_p (pressure_accounted, REGNO (srcreg)))
> +	    {
> +	      if (!prop.register_pressure_high_p (srcreg, dstreg, def_rtl,
> +						  use_rtl))
> +		{
> +		  bool updated
> +		    = prop.update_register_pressure (srcreg, dstreg, def_rtl,
> +						     use_rtl);
> +		  if (updated)
> +		    bitmap_set_bit (pressure_accounted, REGNO (srcreg));
> +		}
> +	      else
> +		{
> +		  if (dump_file && (dump_flags & TDF_DETAILS))
> +		    fprintf (dump_file, "cannot propagate from insn %d into"
> +			     " insn %d: %s\n", def_insn->uid (),
> +			     use_insn->uid (),
> +			     "would increase register pressure beyond number"
> +			     " of hard regs");
> +		  ok = false;
> +		}
> +	    }
> +	  else
> +	    if (dump_file && (dump_flags & TDF_DETAILS))
> +	      fprintf (dump_file, "pressure for reg %d has already"
> +		       " been accounted\n", REGNO (srcreg));
> +	}
> +    }
> +
>    if (!ok)
>      {
>        /* The pattern didn't match, but if all uses of SRC folded to
> @@ -538,6 +877,7 @@ try_fwprop_subst_pattern (obstack_watermark &attempt, insn_change &use_change,
>    confirm_change_group ();
>    crtl->ssa->change_insn (use_change);
>    num_changes++;
> +
>    return true;
>  }
>  
> @@ -890,6 +1230,10 @@ fwprop_init (void)
>  
>    df_analyze ();
>    crtl->ssa = new rtl_ssa::function_info (cfun);
> +
> +  regpressure_init ();
> +  pressure_accounted = sbitmap_alloc (DF_REG_SIZE (df));
> +  bitmap_clear (pressure_accounted);
>  }
>  
>  static void
> @@ -910,6 +1254,9 @@ fwprop_done (void)
>      fprintf (dump_file,
>  	     "\nNumber of successful forward propagations: %d\n\n",
>  	     num_changes);
> +
> +  regpressure_cleanup ();
> +  sbitmap_free (pressure_accounted);
>  }
>  
>  /* Try to optimize INSN, returning true if something changes.
> @@ -919,6 +1266,10 @@ fwprop_done (void)
>  static bool
>  fwprop_insn (insn_info *insn, bool fwprop_addr_p)
>  {
> +  /* TODO: when rejecting propagations due to register pressure
> +     we would actually like to try the propagation with most
> +     potential = uses first instead of going through them sequentially.  */
> +
>    for (use_info *use : insn->uses ())
>      {
>        if (use->is_mem ())
> diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/vadd-vx-fwprop.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/vadd-vx-fwprop.c
> new file mode 100644
> index 00000000000..0ed1bb80f73
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/vadd-vx-fwprop.c
> @@ -0,0 +1,64 @@
> +/* { dg-do compile } */
> +/* { dg-additional-options "-O2 -march=rv64gcv -mabi=lp64d --param=riscv-autovec-preference=fixed-vlmax -fdump-rtl-fwprop1-details" } */
> +
> +/* Check that we propagate from vec_duplicates into vadd.vv, turning it into
> +   vadd.vx.  */
> +
> +#include <stdint-gcc.h>
> +
> +/* Define 26 variables so that we can perform two forward propagations until we
> +   reach the hard reg limit of 28.  */
> +uint64_t *restrict dst1;
> +uint64_t *restrict dst2;
> +uint64_t *restrict dst3;
> +uint64_t *restrict dst4;
> +uint64_t *restrict dst5;
> +uint64_t *restrict dst6;
> +uint64_t *restrict dst7;
> +uint64_t *restrict dst8;
> +uint64_t *restrict dst9;
> +uint64_t *restrict dst10;
> +uint64_t *restrict dst11;
> +uint64_t *restrict a1;
> +uint64_t *restrict a2;
> +uint64_t *restrict a3;
> +uint64_t *restrict a4;
> +uint64_t *restrict a5;
> +uint64_t *restrict a6;
> +uint64_t *restrict a7;
> +uint64_t *restrict a8;
> +uint64_t *restrict a9;
> +uint64_t *restrict a10;
> +uint64_t *restrict a11;
> +uint64_t b1;
> +uint64_t b2;
> +uint64_t b3;
> +uint64_t b4;
> +
> +void foo (int n)
> +{
> +  __builtin_expect (n % 128, 0);
> +  for (int i = 0; i < n; i++)
> +    {
> +      /* We expect b3 to be propagated into vadd here as well as in the other
> +         7 occurences (without increasing the pressure every time).
> +         Then b2 is still supposed to be propagated but b1 is not even though
> +         it would be better to do so.  */
> +      dst1[i] = a1[i] + b3;
> +      dst2[i] = a2[i] + b3;
> +      dst3[i] = a3[i] + b2;
> +      dst4[i] = a4[i] + b1;
> +      dst5[i] = a5[i] + b1;
> +      dst6[i] = a6[i] + b3;
> +      dst7[i] = a7[i] + b3;
> +      dst8[i] = a8[i] + b3;
> +      dst9[i] = a9[i] + b3;
> +      dst10[i] = a10[i] + b3;
> +      dst11[i] = a11[i] + b3;
> +    }
> +}
> +
> +/* { dg-final { scan-rtl-dump-times "increasing pressure for reg" 2 "fwprop1" } } */
> +/* { dg-final { scan-rtl-dump-times "pressure for reg \[0-9\]+ has already been accounted" 7 "fwprop1" } } */
> +/* { dg-final { scan-rtl-dump-times "would increase register pressure beyond number of hard regs" 2 "fwprop1" } } */
> +/* { dg-final { scan-assembler-times "\tvadd\.vx" 9 } } */
Robin Dapp Sept. 7, 2023, 7:56 a.m. UTC | #9
> Thanks for giving it a go.  Can you post the latest version of the
> regpressure patch too?  The previous on-list version I could find
> seems to be too old.

Oh, sure, attached.  Apologies, I added the regpressure_same_class
convenience helper but forgot to re-send it.

Regards
 Robin

From d3f87e4de7d7d05a2fcf8c948097b14eadf08c90 Mon Sep 17 00:00:00 2001
From: Robin Dapp <rdapp@ventanamicro.com>
Date: Mon, 24 Jul 2023 16:25:38 +0200
Subject: [PATCH] gcse: Extract reg pressure handling into separate file.

This patch extracts the hoist-pressure handling from gcse and puts it
into a separate file so it can be used by other passes in the future.
No functional change.

gcc/ChangeLog:

	* Makefile.in: Add regpressure.o.
	* gcse.cc (struct bb_data): Move to regpressure.cc.
	(BB_DATA): Ditto.
	(get_regno_pressure_class): Ditto.
	(get_pressure_class_and_nregs): Ditto.
	(record_set_data): Ditto.
	(update_bb_reg_pressure): Ditto.
	(should_hoist_expr_to_dom): Ditto.
	(hoist_code): Ditto.
	(change_pressure): Ditto.
	(calculate_bb_reg_pressure): Ditto.
	(one_code_hoisting_pass): Ditto.
	* gcse.h (single_set_gcse): Export single_set_gcse.
	* regpressure.cc: New file.
	* regpressure.h: New file.
---
 gcc/Makefile.in    |   1 +
 gcc/gcse.cc        | 304 ++---------------------------------
 gcc/gcse.h         |   2 +
 gcc/regpressure.cc | 391 +++++++++++++++++++++++++++++++++++++++++++++
 gcc/regpressure.h  |  48 ++++++
 5 files changed, 459 insertions(+), 287 deletions(-)
 create mode 100644 gcc/regpressure.cc
 create mode 100644 gcc/regpressure.h

diff --git a/gcc/Makefile.in b/gcc/Makefile.in
index 5930b52462a..62768a84f81 100644
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@@ -1610,6 +1610,7 @@ OBJS = \
 	reg-stack.o \
 	regcprop.o \
 	reginfo.o \
+	regpressure.o \
 	regrename.o \
 	regstat.o \
 	reload.o \
diff --git a/gcc/gcse.cc b/gcc/gcse.cc
index f689c0c2687..5bafef7970f 100644
--- a/gcc/gcse.cc
+++ b/gcc/gcse.cc
@@ -160,6 +160,7 @@ along with GCC; see the file COPYING3.  If not see
 #include "gcse.h"
 #include "gcse-common.h"
 #include "function-abi.h"
+#include "regpressure.h"
 
 /* We support GCSE via Partial Redundancy Elimination.  PRE optimizations
    are a superset of those done by classic GCSE.
@@ -419,30 +420,6 @@ static bool doing_code_hoisting_p = false;
 /* For available exprs */
 static sbitmap *ae_kill;
 
-/* Data stored for each basic block.  */
-struct bb_data
-{
-  /* Maximal register pressure inside basic block for given register class
-     (defined only for the pressure classes).  */
-  int max_reg_pressure[N_REG_CLASSES];
-  /* Recorded register pressure of basic block before trying to hoist
-     an expression.  Will be used to restore the register pressure
-     if the expression should not be hoisted.  */
-  int old_pressure;
-  /* Recorded register live_in info of basic block during code hoisting
-     process.  BACKUP is used to record live_in info before trying to
-     hoist an expression, and will be used to restore LIVE_IN if the
-     expression should not be hoisted.  */
-  bitmap live_in, backup;
-};
-
-#define BB_DATA(bb) ((struct bb_data *) (bb)->aux)
-
-static basic_block curr_bb;
-
-/* Current register pressure for each pressure class.  */
-static int curr_reg_pressure[N_REG_CLASSES];
-
 
 static void compute_can_copy (void);
 static void *gmalloc (size_t) ATTRIBUTE_MALLOC;
@@ -494,8 +471,6 @@ static bool should_hoist_expr_to_dom (basic_block, struct gcse_expr *,
 				      enum reg_class,
 				      int *, bitmap, rtx_insn *);
 static bool hoist_code (void);
-static enum reg_class get_regno_pressure_class (int regno, int *nregs);
-static enum reg_class get_pressure_class_and_nregs (rtx_insn *insn, int *nregs);
 static bool one_code_hoisting_pass (void);
 static rtx_insn *process_insert_insn (struct gcse_expr *);
 static bool pre_edge_insert (struct edge_list *, struct gcse_expr **);
@@ -2402,7 +2377,7 @@ record_set_data (rtx dest, const_rtx set, void *data)
     }
 }
 
-static const_rtx
+const_rtx
 single_set_gcse (rtx_insn *insn)
 {
   struct set_data s;
@@ -2804,72 +2779,6 @@ compute_code_hoist_data (void)
     fprintf (dump_file, "\n");
 }
 
-/* Update register pressure for BB when hoisting an expression from
-   instruction FROM, if live ranges of inputs are shrunk.  Also
-   maintain live_in information if live range of register referred
-   in FROM is shrunk.
-   
-   Return 0 if register pressure doesn't change, otherwise return
-   the number by which register pressure is decreased.
-   
-   NOTE: Register pressure won't be increased in this function.  */
-
-static int
-update_bb_reg_pressure (basic_block bb, rtx_insn *from)
-{
-  rtx dreg;
-  rtx_insn *insn;
-  basic_block succ_bb;
-  df_ref use, op_ref;
-  edge succ;
-  edge_iterator ei;
-  int decreased_pressure = 0;
-  int nregs;
-  enum reg_class pressure_class;
-
-  FOR_EACH_INSN_USE (use, from)
-    {
-      dreg = DF_REF_REAL_REG (use);
-      /* The live range of register is shrunk only if it isn't:
-	 1. referred on any path from the end of this block to EXIT, or
-	 2. referred by insns other than FROM in this block.  */
-      FOR_EACH_EDGE (succ, ei, bb->succs)
-	{
-	  succ_bb = succ->dest;
-	  if (succ_bb == EXIT_BLOCK_PTR_FOR_FN (cfun))
-	    continue;
-
-	  if (bitmap_bit_p (BB_DATA (succ_bb)->live_in, REGNO (dreg)))
-	    break;
-	}
-      if (succ != NULL)
-	continue;
-
-      op_ref = DF_REG_USE_CHAIN (REGNO (dreg));
-      for (; op_ref; op_ref = DF_REF_NEXT_REG (op_ref))
-	{
-	  if (!DF_REF_INSN_INFO (op_ref))
-	    continue;
-
-	  insn = DF_REF_INSN (op_ref);
-	  if (BLOCK_FOR_INSN (insn) == bb
-	      && NONDEBUG_INSN_P (insn) && insn != from)
-	    break;
-	}
-
-      pressure_class = get_regno_pressure_class (REGNO (dreg), &nregs);
-      /* Decrease register pressure and update live_in information for
-	 this block.  */
-      if (!op_ref && pressure_class != NO_REGS)
-	{
-	  decreased_pressure += nregs;
-	  BB_DATA (bb)->max_reg_pressure[pressure_class] -= nregs;
-	  bitmap_clear_bit (BB_DATA (bb)->live_in, REGNO (dreg));
-	}
-    }
-  return decreased_pressure;
-}
-
 /* Determine if the expression EXPR should be hoisted to EXPR_BB up in
    flow graph, if it can reach BB unimpared.  Stop the search if the
    expression would need to be moved more than DISTANCE instructions.
@@ -2917,12 +2826,9 @@ should_hoist_expr_to_dom (basic_block expr_bb, struct gcse_expr *expr,
       /* Record old information of basic block BB when it is visited
 	 at the first time.  */
       if (!bitmap_bit_p (hoisted_bbs, bb->index))
-	{
-	  struct bb_data *data = BB_DATA (bb);
-	  bitmap_copy (data->backup, data->live_in);
-	  data->old_pressure = data->max_reg_pressure[pressure_class];
-	}
-      decreased_pressure = update_bb_reg_pressure (bb, from);
+	  regpressure_init_bb_info (bb, pressure_class);
+
+      decreased_pressure = regpressure_update_bb_reg_pressure (bb, from);
     }
   /* Terminate the search if distance, for which EXPR is allowed to move,
      is exhausted.  */
@@ -2945,8 +2851,7 @@ should_hoist_expr_to_dom (basic_block expr_bb, struct gcse_expr *expr,
 	     on ARM target, while it has no obvious effect on other
 	     targets like x86, x86_64, mips and powerpc.  */
 	  else if (CONST_INT_P (expr->expr)
-		   || (BB_DATA (bb)->max_reg_pressure[pressure_class]
-			 >= ira_class_hard_regs_num[pressure_class]
+		   || (!regpressure_viable (bb, pressure_class)
 		       && decreased_pressure < *nregs))
 	    distance -= bb_size[bb->index];
 	}
@@ -3073,7 +2978,6 @@ hoist_code (void)
   int *to_bb_head;
   int *bb_size;
   bool changed = false;
-  struct bb_data *data;
   /* Basic blocks that have occurrences reachable from BB.  */
   bitmap from_bbs;
   /* Basic blocks through which expr is hoisted.  */
@@ -3206,8 +3110,9 @@ hoist_code (void)
 		    max_distance += (bb_size[dominated->index]
 				     - to_bb_head[INSN_UID (occr->insn)]);
 
-		  pressure_class = get_pressure_class_and_nregs (occr->insn,
-								 &nregs);
+		  pressure_class =
+		    regpressure_get_pressure_class_and_nregs (occr->insn,
+							      &nregs);
 
 		  /* Note if the expression should be hoisted from the dominated
 		     block to BB if it can reach DOMINATED unimpared.
@@ -3262,13 +3167,11 @@ hoist_code (void)
 		  /* Increase register pressure of basic blocks to which
 		     expr is hoisted because of extended live range of
 		     output.  */
-		  data = BB_DATA (bb);
-		  data->max_reg_pressure[pressure_class] += nregs;
+		  regpressure_increase (bb, pressure_class, nregs);
+
 		  EXECUTE_IF_SET_IN_BITMAP (hoisted_bbs, 0, k, bi)
-		    {
-		      data = BB_DATA (BASIC_BLOCK_FOR_FN (cfun, k));
-		      data->max_reg_pressure[pressure_class] += nregs;
-		    }
+		    regpressure_increase (BASIC_BLOCK_FOR_FN (cfun, k),
+					  pressure_class, nregs);
 		}
 	      else if (flag_ira_hoist_pressure)
 		{
@@ -3276,12 +3179,8 @@ hoist_code (void)
 		     blocks recorded in hoisted_bbs when expr will not be
 		     hoisted.  */
 		  EXECUTE_IF_SET_IN_BITMAP (hoisted_bbs, 0, k, bi)
-		    {
-		      data = BB_DATA (BASIC_BLOCK_FOR_FN (cfun, k));
-		      bitmap_copy (data->live_in, data->backup);
-		      data->max_reg_pressure[pressure_class]
-			  = data->old_pressure;
-		    }
+		    regpressure_reset (BASIC_BLOCK_FOR_FN (cfun, k),
+				       pressure_class);
 		}
 
 	      if (flag_ira_hoist_pressure)
@@ -3343,166 +3242,6 @@ hoist_code (void)
   return changed;
 }
 
-/* Return pressure class and number of needed hard registers (through
-   *NREGS) of register REGNO.  */
-static enum reg_class
-get_regno_pressure_class (int regno, int *nregs)
-{
-  if (regno >= FIRST_PSEUDO_REGISTER)
-    {
-      enum reg_class pressure_class;
-
-      pressure_class = reg_allocno_class (regno);
-      pressure_class = ira_pressure_class_translate[pressure_class];
-      *nregs
-	= ira_reg_class_max_nregs[pressure_class][PSEUDO_REGNO_MODE (regno)];
-      return pressure_class;
-    }
-  else if (! TEST_HARD_REG_BIT (ira_no_alloc_regs, regno)
-	   && ! TEST_HARD_REG_BIT (eliminable_regset, regno))
-    {
-      *nregs = 1;
-      return ira_pressure_class_translate[REGNO_REG_CLASS (regno)];
-    }
-  else
-    {
-      *nregs = 0;
-      return NO_REGS;
-    }
-}
-
-/* Return pressure class and number of hard registers (through *NREGS)
-   for destination of INSN. */
-static enum reg_class
-get_pressure_class_and_nregs (rtx_insn *insn, int *nregs)
-{
-  rtx reg;
-  enum reg_class pressure_class;
-  const_rtx set = single_set_gcse (insn);
-
-  reg = SET_DEST (set);
-  if (GET_CODE (reg) == SUBREG)
-    reg = SUBREG_REG (reg);
-  if (MEM_P (reg))
-    {
-      *nregs = 0;
-      pressure_class = NO_REGS;
-    }
-  else
-    {
-      gcc_assert (REG_P (reg));
-      pressure_class = reg_allocno_class (REGNO (reg));
-      pressure_class = ira_pressure_class_translate[pressure_class];
-      *nregs
-	= ira_reg_class_max_nregs[pressure_class][GET_MODE (SET_SRC (set))];
-    }
-  return pressure_class;
-}
-
-/* Increase (if INCR_P) or decrease current register pressure for
-   register REGNO.  */
-static void
-change_pressure (int regno, bool incr_p)
-{
-  int nregs;
-  enum reg_class pressure_class;
-
-  pressure_class = get_regno_pressure_class (regno, &nregs);
-  if (! incr_p)
-    curr_reg_pressure[pressure_class] -= nregs;
-  else
-    {
-      curr_reg_pressure[pressure_class] += nregs;
-      if (BB_DATA (curr_bb)->max_reg_pressure[pressure_class]
-	  < curr_reg_pressure[pressure_class])
-	BB_DATA (curr_bb)->max_reg_pressure[pressure_class]
-	  = curr_reg_pressure[pressure_class];
-    }
-}
-
-/* Calculate register pressure for each basic block by walking insns
-   from last to first.  */
-static void
-calculate_bb_reg_pressure (void)
-{
-  int i;
-  unsigned int j;
-  rtx_insn *insn;
-  basic_block bb;
-  bitmap curr_regs_live;
-  bitmap_iterator bi;
-
-
-  ira_setup_eliminable_regset ();
-  curr_regs_live = BITMAP_ALLOC (&reg_obstack);
-  FOR_EACH_BB_FN (bb, cfun)
-    {
-      curr_bb = bb;
-      BB_DATA (bb)->live_in = BITMAP_ALLOC (NULL);
-      BB_DATA (bb)->backup = BITMAP_ALLOC (NULL);
-      bitmap_copy (BB_DATA (bb)->live_in, df_get_live_in (bb));
-      bitmap_copy (curr_regs_live, df_get_live_out (bb));
-      for (i = 0; i < ira_pressure_classes_num; i++)
-	curr_reg_pressure[ira_pressure_classes[i]] = 0;
-      EXECUTE_IF_SET_IN_BITMAP (curr_regs_live, 0, j, bi)
-	change_pressure (j, true);
-
-      FOR_BB_INSNS_REVERSE (bb, insn)
-	{
-	  rtx dreg;
-	  int regno;
-	  df_ref def, use;
-
-	  if (! NONDEBUG_INSN_P (insn))
-	    continue;
-
-	  FOR_EACH_INSN_DEF (def, insn)
-	    {
-	      dreg = DF_REF_REAL_REG (def);
-	      gcc_assert (REG_P (dreg));
-	      regno = REGNO (dreg);
-	      if (!(DF_REF_FLAGS (def)
-		    & (DF_REF_PARTIAL | DF_REF_CONDITIONAL)))
-		{
-		  if (bitmap_clear_bit (curr_regs_live, regno))
-		    change_pressure (regno, false);
-		}
-	    }
-
-	  FOR_EACH_INSN_USE (use, insn)
-	    {
-	      dreg = DF_REF_REAL_REG (use);
-	      gcc_assert (REG_P (dreg));
-	      regno = REGNO (dreg);
-	      if (bitmap_set_bit (curr_regs_live, regno))
-		change_pressure (regno, true);
-	    }
-	}
-    }
-  BITMAP_FREE (curr_regs_live);
-
-  if (dump_file == NULL)
-    return;
-
-  fprintf (dump_file, "\nRegister Pressure: \n");
-  FOR_EACH_BB_FN (bb, cfun)
-    {
-      fprintf (dump_file, "  Basic block %d: \n", bb->index);
-      for (i = 0; (int) i < ira_pressure_classes_num; i++)
-	{
-	  enum reg_class pressure_class;
-
-	  pressure_class = ira_pressure_classes[i];
-	  if (BB_DATA (bb)->max_reg_pressure[pressure_class] == 0)
-	    continue;
-
-	  fprintf (dump_file, "    %s=%d\n", reg_class_names[pressure_class],
-		   BB_DATA (bb)->max_reg_pressure[pressure_class]);
-	}
-    }
-  fprintf (dump_file, "\n");
-}
-
 /* Top level routine to perform one code hoisting (aka unification) pass
 
    Return true if a change was made.  */
@@ -3524,13 +3263,7 @@ one_code_hoisting_pass (void)
 
   /* Calculate register pressure for each basic block.  */
   if (flag_ira_hoist_pressure)
-    {
-      regstat_init_n_sets_and_refs ();
-      ira_set_pseudo_classes (false, dump_file);
-      alloc_aux_for_blocks (sizeof (struct bb_data));
-      calculate_bb_reg_pressure ();
-      regstat_free_n_sets_and_refs ();
-    }
+    regpressure_init ();
 
   /* We need alias.  */
   init_alias_analysis ();
@@ -3554,10 +3287,7 @@ one_code_hoisting_pass (void)
     }
 
   if (flag_ira_hoist_pressure)
-    {
-      free_aux_for_blocks ();
-      free_reg_info ();
-    }
+    regpressure_cleanup ();
   free_hash_table (&expr_hash_table);
   free_gcse_mem ();
   obstack_free (&gcse_obstack, NULL);
diff --git a/gcc/gcse.h b/gcc/gcse.h
index e68afdcea21..1162086570d 100644
--- a/gcc/gcse.h
+++ b/gcc/gcse.h
@@ -43,4 +43,6 @@ void gcse_cc_finalize (void);
 extern bool gcse_or_cprop_is_too_expensive (const char *);
 extern rtx_insn *insert_insn_end_basic_block (rtx_insn *, basic_block);
 
+const_rtx single_set_gcse (rtx_insn *insn);
+
 #endif
diff --git a/gcc/regpressure.cc b/gcc/regpressure.cc
new file mode 100644
index 00000000000..7846d320a66
--- /dev/null
+++ b/gcc/regpressure.cc
@@ -0,0 +1,391 @@
+/* Register pressure helper functions.
+   Copyright (C) 2023 Free Software Foundation, Inc.
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify it under
+the terms of the GNU General Public License as published by the Free
+Software Foundation; either version 3, or (at your option) any later
+version.
+
+GCC is distributed in the hope that it will be useful, but WITHOUT ANY
+WARRANTY; without even the implied warranty of MERCHANTABILITY or
+FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+for more details.
+
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3.  If not see
+<http://www.gnu.org/licenses/>.  */
+
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "backend.h"
+#include "target.h"
+#include "rtl.h"
+#include "df.h"
+#include "memmodel.h"
+#include "tm_p.h"
+#include "regs.h"
+#include "ira.h"
+#include "gcse.h"
+#include "gcse-common.h"
+#include "regpressure.h"
+
+struct bb_data
+{
+  /* Maximal register pressure inside basic block for given register class
+     (defined only for the pressure classes).  */
+  int max_reg_pressure[N_REG_CLASSES];
+  /* Recorded register pressure of basic block before trying to hoist
+     an expression.  Will be used to restore the register pressure
+     if the expression should not be hoisted.  */
+  int old_pressure;
+  /* Recorded register live_in info of basic block during code hoisting
+     process.  BACKUP is used to record live_in info before trying to
+     hoist an expression, and will be used to restore LIVE_IN if the
+     expression should not be hoisted.  */
+  bitmap live_in, backup;
+};
+
+#define BB_DATA(bb) ((struct bb_data *) (bb)->aux)
+
+static basic_block curr_bb;
+
+/* Current register pressure for each pressure class.  */
+static int curr_reg_pressure[N_REG_CLASSES];
+
+/* Update register pressure for BB when hoisting an expression from
+   instruction FROM, if live ranges of inputs are shrunk.  Also
+   maintain live_in information if live range of register referred
+   in FROM is shrunk.
+
+   Return 0 if register pressure doesn't change, otherwise return
+   the number by which register pressure is decreased.
+
+   NOTE: Register pressure won't be increased in this function.  */
+
+int
+regpressure_update_bb_reg_pressure (basic_block bb, rtx_insn *from)
+{
+  rtx dreg;
+  rtx_insn *insn;
+  basic_block succ_bb;
+  df_ref use, op_ref;
+  edge succ;
+  edge_iterator ei;
+  int decreased_pressure = 0;
+  int nregs;
+  enum reg_class pressure_class;
+
+  FOR_EACH_INSN_USE (use, from)
+    {
+      dreg = DF_REF_REAL_REG (use);
+      /* The live range of register is shrunk only if it isn't:
+	 1. referred on any path from the end of this block to EXIT, or
+	 2. referred by insns other than FROM in this block.  */
+      FOR_EACH_EDGE (succ, ei, bb->succs)
+	{
+	  succ_bb = succ->dest;
+	  if (succ_bb == EXIT_BLOCK_PTR_FOR_FN (cfun))
+	    continue;
+
+	  if (regpressure_is_live_in (succ_bb, REGNO (dreg)))
+	    break;
+	}
+      if (succ != NULL)
+	continue;
+
+      op_ref = DF_REG_USE_CHAIN (REGNO (dreg));
+      for (; op_ref; op_ref = DF_REF_NEXT_REG (op_ref))
+	{
+	  if (!DF_REF_INSN_INFO (op_ref))
+	    continue;
+
+	  insn = DF_REF_INSN (op_ref);
+	  if (BLOCK_FOR_INSN (insn) == bb && NONDEBUG_INSN_P (insn)
+	      && insn != from)
+	    break;
+	}
+
+      pressure_class
+	= regpressure_get_regno_pressure_class (REGNO (dreg), &nregs);
+      /* Decrease register pressure and update live_in information for
+	 this block.  */
+      if (!op_ref && pressure_class != NO_REGS)
+	{
+	  decreased_pressure += nregs;
+	  regpressure_decrease (bb, pressure_class, nregs);
+	  regpressure_clear_live_in (bb, REGNO (dreg));
+	}
+    }
+  return decreased_pressure;
+}
+
+/* Increase (if INCR_P) or decrease current register pressure for
+   register REGNO.  */
+static void
+change_pressure (int regno, bool incr_p)
+{
+  int nregs;
+  enum reg_class pressure_class;
+
+  pressure_class = regpressure_get_regno_pressure_class (regno, &nregs);
+  if (!incr_p)
+    curr_reg_pressure[pressure_class] -= nregs;
+  else
+    {
+      curr_reg_pressure[pressure_class] += nregs;
+      if (regpressure_get (curr_bb, pressure_class)
+	  < curr_reg_pressure[pressure_class])
+	BB_DATA (curr_bb)->max_reg_pressure[pressure_class]
+	  = curr_reg_pressure[pressure_class];
+    }
+}
+
+/* Calculate register pressure for each basic block by walking insns
+   from last to first.  */
+static void
+calculate_bb_reg_pressure (void)
+{
+  int i;
+  unsigned int j;
+  rtx_insn *insn;
+  basic_block bb;
+  bitmap curr_regs_live;
+  bitmap_iterator bi;
+
+  ira_setup_eliminable_regset ();
+  curr_regs_live = BITMAP_ALLOC (&reg_obstack);
+  FOR_EACH_BB_FN (bb, cfun)
+    {
+      curr_bb = bb;
+      BB_DATA (bb)->live_in = BITMAP_ALLOC (NULL);
+      BB_DATA (bb)->backup = BITMAP_ALLOC (NULL);
+      bitmap_copy (BB_DATA (bb)->live_in, df_get_live_in (bb));
+      bitmap_copy (curr_regs_live, df_get_live_out (bb));
+      for (i = 0; i < ira_pressure_classes_num; i++)
+	curr_reg_pressure[ira_pressure_classes[i]] = 0;
+      EXECUTE_IF_SET_IN_BITMAP (curr_regs_live, 0, j, bi)
+	change_pressure (j, true);
+
+      FOR_BB_INSNS_REVERSE (bb, insn)
+	{
+	  rtx dreg;
+	  int regno;
+	  df_ref def, use;
+
+	  if (!NONDEBUG_INSN_P (insn))
+	    continue;
+
+	  FOR_EACH_INSN_DEF (def, insn)
+	    {
+	      dreg = DF_REF_REAL_REG (def);
+	      gcc_assert (REG_P (dreg));
+	      regno = REGNO (dreg);
+	      if (!(DF_REF_FLAGS (def) & (DF_REF_PARTIAL | DF_REF_CONDITIONAL)))
+		{
+		  if (bitmap_clear_bit (curr_regs_live, regno))
+		    change_pressure (regno, false);
+		}
+	    }
+
+	  FOR_EACH_INSN_USE (use, insn)
+	    {
+	      dreg = DF_REF_REAL_REG (use);
+	      gcc_assert (REG_P (dreg));
+	      regno = REGNO (dreg);
+	      if (bitmap_set_bit (curr_regs_live, regno))
+		change_pressure (regno, true);
+	    }
+	}
+    }
+  BITMAP_FREE (curr_regs_live);
+
+  if (dump_file == NULL)
+    return;
+
+  fprintf (dump_file, "\nRegister Pressure: \n");
+  FOR_EACH_BB_FN (bb, cfun)
+    {
+      fprintf (dump_file, "  Basic block %d: \n", bb->index);
+      for (i = 0; (int) i < ira_pressure_classes_num; i++)
+	{
+	  enum reg_class pressure_class;
+
+	  pressure_class = ira_pressure_classes[i];
+	  if (BB_DATA (bb)->max_reg_pressure[pressure_class] == 0)
+	    continue;
+
+	  fprintf (dump_file, "    %s=%d\n", reg_class_names[pressure_class],
+		   BB_DATA (bb)->max_reg_pressure[pressure_class]);
+	}
+    }
+  fprintf (dump_file, "\n");
+}
+
+/* Initialize needed resources for register pressure calculation.  */
+void
+regpressure_init ()
+{
+  regstat_init_n_sets_and_refs ();
+  ira_set_pseudo_classes (false, dump_file);
+  if (flag_ira_hoist_pressure)
+    {
+      alloc_aux_for_blocks (sizeof (struct bb_data));
+      calculate_bb_reg_pressure ();
+    }
+  regstat_free_n_sets_and_refs ();
+}
+
+/* Free up all initialized resources.  */
+void
+regpressure_cleanup ()
+{
+  if (flag_ira_hoist_pressure)
+    {
+      free_aux_for_blocks ();
+    }
+  free_reg_info ();
+}
+
+/* Initialize aux data for BB regarding PRESSURE_CLASS.  */
+void
+regpressure_init_bb_info (basic_block bb, enum reg_class pressure_class)
+{
+  /* Record old information of basic block BB when it is visited
+     at the first time.  */
+  struct bb_data *data = BB_DATA (bb);
+  bitmap_copy (data->backup, data->live_in);
+  data->old_pressure = data->max_reg_pressure[pressure_class];
+}
+
+/* Increase PRESSURE_CLASS's register pressure in BB by NREGS.  */
+void
+regpressure_increase (basic_block bb, enum reg_class pressure_class, int nregs)
+{
+  struct bb_data *data = BB_DATA (bb);
+  data->max_reg_pressure[pressure_class] += nregs;
+}
+
+/* Decrease PRESSURE_CLASS's register pressure in BB by NREGS.  */
+void
+regpressure_decrease (basic_block bb, enum reg_class pressure_class, int nregs)
+{
+  struct bb_data *data = BB_DATA (bb);
+  data->max_reg_pressure[pressure_class] -= nregs;
+}
+
+/* Reset PRESSURE_CLASS's register pressure in BB to its initial value.  */
+void
+regpressure_reset (basic_block bb, enum reg_class pressure_class)
+{
+  struct bb_data *data = BB_DATA (bb);
+  bitmap_copy (data->live_in, data->backup);
+  data->max_reg_pressure[pressure_class] = data->old_pressure;
+}
+
+/* Return TRUE if the current register pressure of PRESSURE_CLASS in BB
+   is less than the number of hard regs.  */
+bool
+regpressure_viable (basic_block bb, enum reg_class pressure_class)
+{
+  return (regpressure_get (bb, pressure_class)
+	  < ira_class_hard_regs_num[pressure_class]);
+}
+
+/* Return pressure class and number of needed hard registers (through
+ *NREGS) of register REGNO.  */
+enum reg_class
+regpressure_get_regno_pressure_class (int regno, int *nregs)
+{
+  if (regno >= FIRST_PSEUDO_REGISTER)
+    {
+      enum reg_class pressure_class;
+
+      pressure_class = reg_allocno_class (regno);
+      pressure_class = ira_pressure_class_translate[pressure_class];
+      *nregs
+	= ira_reg_class_max_nregs[pressure_class][PSEUDO_REGNO_MODE (regno)];
+      return pressure_class;
+    }
+  else if (!TEST_HARD_REG_BIT (ira_no_alloc_regs, regno)
+	   && !TEST_HARD_REG_BIT (eliminable_regset, regno))
+    {
+      *nregs = 1;
+      return ira_pressure_class_translate[REGNO_REG_CLASS (regno)];
+    }
+  else
+    {
+      *nregs = 0;
+      return NO_REGS;
+    }
+}
+
+/* Return pressure class and number of hard registers (through *NREGS)
+   for destination of INSN. */
+enum reg_class
+regpressure_get_pressure_class_and_nregs (rtx_insn *insn, int *nregs)
+{
+  rtx reg;
+  enum reg_class pressure_class;
+  const_rtx set = single_set_gcse (insn);
+
+  reg = SET_DEST (set);
+  if (GET_CODE (reg) == SUBREG)
+    reg = SUBREG_REG (reg);
+  if (MEM_P (reg))
+    {
+      *nregs = 0;
+      pressure_class = NO_REGS;
+    }
+  else
+    {
+      gcc_assert (REG_P (reg));
+      pressure_class = reg_allocno_class (REGNO (reg));
+      pressure_class = ira_pressure_class_translate[pressure_class];
+      *nregs
+	= ira_reg_class_max_nregs[pressure_class][GET_MODE (SET_SRC (set))];
+    }
+  return pressure_class;
+}
+
+/* Return TRUE if REGNO is live (incoming) in BB.  */
+bool
+regpressure_is_live_in (basic_block bb, int regno)
+{
+  return bitmap_bit_p (BB_DATA (bb)->live_in, regno);
+}
+
+/* Clear the live (incoming) bit for REGNO in BB.  */
+void
+regpressure_clear_live_in (basic_block bb, int regno)
+{
+  bitmap_clear_bit (BB_DATA (bb)->live_in, regno);
+}
+
+/* Set the live (incoming) bit for REGNO in BB.  */
+void
+regpressure_set_live_in (basic_block bb, int regno)
+{
+  bitmap_set_bit (BB_DATA (bb)->live_in, regno);
+}
+
+/* Returns the register pressure for PRESSURE_CLASS in BB.  */
+int
+regpressure_get (basic_block bb, enum reg_class pressure_class)
+{
+  return BB_DATA (bb)->max_reg_pressure[pressure_class];
+}
+
+bool
+regpressure_same_class (rtx a, rtx b)
+{
+  enum reg_class a_class, b_class;
+  int a_nregs, b_nregs;
+
+  a_class = regpressure_get_regno_pressure_class (REGNO (a), &a_nregs);
+  b_class = regpressure_get_regno_pressure_class (REGNO (b), &b_nregs);
+
+  return a_class == b_class;
+}
diff --git a/gcc/regpressure.h b/gcc/regpressure.h
new file mode 100644
index 00000000000..b9008699acd
--- /dev/null
+++ b/gcc/regpressure.h
@@ -0,0 +1,48 @@
+/* Register pressure helper functions.
+   Copyright (C) 2023 Free Software Foundation, Inc.
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify it under
+the terms of the GNU General Public License as published by the Free
+Software Foundation; either version 3, or (at your option) any later
+version.
+
+GCC is distributed in the hope that it will be useful, but WITHOUT ANY
+WARRANTY; without even the implied warranty of MERCHANTABILITY or
+FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+for more details.
+
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3.  If not see
+<http://www.gnu.org/licenses/>.  */
+
+#ifndef GCC_REGPRESSURE_H
+#define GCC_REGPRESSURE_H
+
+int regpressure_update_bb_reg_pressure (basic_block bb, rtx_insn *from);
+
+void regpressure_init ();
+void regpressure_cleanup ();
+
+void regpressure_init_bb_info (basic_block bb, enum reg_class pressure_class);
+
+void regpressure_increase (basic_block bb, enum reg_class pressure_class, int nregs);
+void regpressure_decrease (basic_block bb, enum reg_class pressure_class, int nregs);
+
+int regpressure_get (basic_block bb, enum reg_class pressure_class);
+
+bool regpressure_is_live_in (basic_block bb, int regno);
+void regpressure_clear_live_in (basic_block bb, int regno);
+void regpressure_set_live_in (basic_block bb, int regno);
+
+void regpressure_reset (basic_block bb, enum reg_class pressure_class);
+
+bool regpressure_viable (basic_block bb, enum reg_class pressure_class);
+
+enum reg_class regpressure_get_regno_pressure_class (int regno, int *nregs);
+enum reg_class regpressure_get_pressure_class_and_nregs (rtx_insn *insn, int *nregs);
+
+bool regpressure_same_class (rtx a, rtx b);
+
+#endif
Richard Sandiford Sept. 7, 2023, 1:42 p.m. UTC | #10
Robin Dapp <rdapp.gcc@gmail.com> writes:
> Hi Richard,
>
> I did some testing with the attached v2 that does not restrict to UNARY
> anymore.  As feared ;) there is some more fallout that I'm detailing below.
>
> On Power there is one guality fail (pr43051-1.c) that I would take
> the liberty of ignoring for now.
>
> On x86 there are four fails:
>
>  - cond_op_addsubmuldiv__Float16-2.c: assembler error
>    unsupported masking for `vmovsh'.  I guess that's a latent backend
>    problem.
>
>  - ifcvt-3.c, pr49095.c: Here we propagate into a compare.  Before, we had
>    (cmp (reg/CC) 0) and now we have (cmp (plus (reg1 reg2) 0).
>    That looks like a costing problem and can hopefully solveable by making
>    the second compare more expensive, preventing the propagation.
>    i386 costing (or every costing?) is brittle so that could well break other
>    things. 
>
>  - pr88873.c: This is interesting because even before this patch we
>    propagated with different register classes (V2DF vs DI).  With the patch
>    we check the register pressure, find the class NO_REGS for V2DF and
>    abort (because the patch assumes NO_REGS = high pressure).  I'm thinking
>    of keeping the old behavior for reg-reg propagations and only checking
>    the pressure for more complex operations.
>
> aarch64 has the most fails:
>
>  - One guality fail (same as Power).
>  - shrn-combine-[123].c as before.
>
>  - A class of (hopefully, I only checked some) similar cases where we
>    propagate an unspec_whilelo into an unspec_ptest.  Before we would only
>    set a REG_EQUALS note.
>    Before we managed to create a while_ultsivnx16bi_cc whereas now we have
>    while_ultsivnx16bi and while_ultsivnx16bi_ptest that won't be combined.
>    We create redundant whilelos and I'm not sure how to improve that. I
>    guess a peephole is out of the question :)

Yeah, I think this is potentially a blocker for propagating A into B
when A is used elsewhere.  Combine is able to combine A and B while
keeping A in parallel with the result.  I think either fwprop would
need to try that too, or it would need to be restricted to cases where A
is only used in B.

I don't think that's really a UNARY_P-or-not thing.  The same problem
would in principle apply to plain unary operations.

>  - pred-combine-and.c: Here the new propagation appears useful at first.

I'm not sure about that, because...

>    We propagate a "vector mask and" into a while_ultsivnx4bi_ptest and the
>    individual and registers remain live up to the propagation site (while
>    being dead before the patch).
>    With the registers dead, combine could create a single fcmgt before.
>    Now it only manages a 2->2 combination because we still need the registers
>    and end up with two fcmgts.
>    The code is worse but this seems more bad luck than anything.

...the fwprop1 change is:

(insn 26 25 27 4 (set (reg:VNx4BI 102 [ vec_mask_and_64 ])
        (and:VNx4BI (reg:VNx4BI 116 [ mask__30.13 ])
            (reg:VNx4BI 98 [ loop_mask_58 ]))) 8420 {andvnx4bi3}
     (nil))
...
(insn 31 30 32 4 (set (reg:VNx4BI 106 [ mask__24.18 ])
        (and:VNx4BI (reg:VNx4BI 118 [ mask__25.17 ])
            (reg:VNx4BI 102 [ vec_mask_and_64 ]))) 8420 {andvnx4bi3}
     (nil))

to:

(insn 26 25 27 4 (set (reg:VNx4BI 102 [ vec_mask_and_64 ])
        (and:VNx4BI (reg:VNx4BI 116 [ mask__30.13 ])
            (reg:VNx4BI 98 [ loop_mask_58 ]))) 8420 {andvnx4bi3}
     (expr_list:REG_DEAD (reg:VNx4BI 116 [ mask__30.13 ])
        (expr_list:REG_DEAD (reg:VNx4BI 98 [ loop_mask_58 ])
            (nil))))
...
(insn 31 30 32 4 (set (reg:VNx4BI 106 [ mask__24.18 ])
        (and:VNx4BI (and:VNx4BI (reg:VNx4BI 116 [ mask__30.13 ])
                (reg:VNx4BI 98 [ loop_mask_58 ]))
            (reg:VNx4BI 118 [ mask__25.17 ]))) 8428 {aarch64_pred_andvnx4bi_z}
     (nil))

On its own this isn't worse.  But it's also not a win.  The before and
after sequences have equal cost, but the after sequence is more complex.

That would probably be OK for something that runs near the end of
the pre-RA pipeline, since it could in principle increase parallelism.
But it's probably a bad idea when we know that the main instruction
combination pass is still to run.  By making insn 31 more complex,
we're making it a less likely combination candidate.

So this isn't necesarily the wrong thing to do.  But I think it is
the wrong time to do it.

>  - Addressing fails from before:  I looked into these and suspect all of
>    them are a similar.
>    What happens is that we have a poly_int offset that we shift, negate
>    and then add to x0.  The result is used as load address.
>    Before, we would pull (combine) the (plus x0 reg) into the load keeping
>    the neg and shift.
>    Now we propagate everything into a single (set (minus x0 offset)).
>    The propagation itself seems worthwhile because we save one insn.
>    However as we got rid of the base/offset split by lumping everything
>    together, combine cannot pull the (plus) into the address load and
>    we require an aarch64_split_add_offset.  This will emit the longer
>    sequence of ashiftl and subtract.  The "base" address is x0 here so
>    we cannot convert (minus x0 ...)) into neg.
>    I didn't go through all of aarch64_split_add_offset.  I suppose we
>    could re-add the separation of base/offset there but that might be
>    a loss when the result is not used as an address. 

Yeah, at least for ld4_s8.c, I agree there are improvements as
well as regressions.  And we don't use (shifted) index addressing
for ld4_s16.c etc even before the patch.  So I agree those changes
are OK.

I don't know whether this will show up elsewhere.  But we can ignore
it for these particular instances.

I think the summary is:

IMO, we have to be mindful that combine is still to run.  We need to avoid
making equal-cost changes if the new form is more complex, or otherwise
likely to interfere with combine.

Alternatively, we could delay the optimisation until after combine
and have freer rein, since we're then just mopping up opportunities
that other passes left behind.

A while back I was experimenting with a second combine pass.  That was
the original motiviation for rtl-ssa.  I never got chance to finish it
off though.

We could do with something similar for PR106594.

Thanks,
Richard

> Again, all in all no fatal problems but pretty annoying :)  It's not much
> but just gradually worse than with just UNARY.  Any idea on how/whether to
> continue?
>
> Regards
>  Robin
>
> gcc/ChangeLog:
>
> 	* fwprop.cc (fwprop_propagation::profitable_p): Add unary
> 	handling.
> 	(fwprop_propagation::update_register_pressure): New function.
> 	(fwprop_propagation::register_pressure_high_p): New function
> 	(reg_single_def_for_src_p): Look through unary expressions.
> 	(try_fwprop_subst_pattern): Check register pressure.
> 	(forward_propagate_into): Call new function.
> 	(fwprop_init): Init register pressure.
> 	(fwprop_done): Clean up register pressure.
> 	(fwprop_insn): Add comment.
>
> gcc/testsuite/ChangeLog:
>
> 	* gcc.target/riscv/rvv/autovec/binop/vadd-vx-fwprop.c: New test.
> ---
>  gcc/fwprop.cc                                 | 359 +++++++++++++++++-
>  .../riscv/rvv/autovec/binop/vadd-vx-fwprop.c  |  64 ++++
>  2 files changed, 419 insertions(+), 4 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/vadd-vx-fwprop.c
>
> diff --git a/gcc/fwprop.cc b/gcc/fwprop.cc
> index 0707a234726..ce6f5a74b00 100644
> --- a/gcc/fwprop.cc
> +++ b/gcc/fwprop.cc
> @@ -36,6 +36,10 @@ along with GCC; see the file COPYING3.  If not see
>  #include "tree-pass.h"
>  #include "rtl-iter.h"
>  #include "target.h"
> +#include "dominance.h"
> +
> +#include "ira.h"
> +#include "regpressure.h"
>  
>  /* This pass does simple forward propagation and simplification when an
>     operand of an insn can only come from a single def.  This pass uses
> @@ -103,6 +107,10 @@ using namespace rtl_ssa;
>  
>  static int num_changes;
>  
> +/* Keep track of which registers already increased the pressure to avoid double
> +   booking.  */
> +sbitmap pressure_accounted;
> +
>  /* Do not try to replace constant addresses or addresses of local and
>     argument slots.  These MEM expressions are made only once and inserted
>     in many instructions, as well as being used to control symbol table
> @@ -181,6 +189,8 @@ namespace
>      bool changed_mem_p () const { return result_flags & CHANGED_MEM; }
>      bool folded_to_constants_p () const;
>      bool profitable_p () const;
> +    bool register_pressure_high_p (rtx, rtx, rtx_insn *, rtx_insn *) const;
> +    bool update_register_pressure (rtx, rtx, rtx_insn *, rtx_insn *) const;
>  
>      bool check_mem (int, rtx) final override;
>      void note_simplification (int, uint16_t, rtx, rtx) final override;
> @@ -340,18 +350,272 @@ fwprop_propagation::profitable_p () const
>        && !paradoxical_subreg_p (to))
>      return true;
>  
> -  if (CONSTANT_P (to))
> +  /* Disallow propagations into debug insns when it is not a reg-reg
> +     propagation.  Their costs are usually 0 and a cost comparison will
> +     always be profitable no matter how much we complicate the pattern.  */
> +  if (DEBUG_INSN_P (insn))
> +    return false;
> +
> +  /* For more complex operations check each sub-rtx.  */
> +  subrtx_iterator::array_type array;
> +  FOR_EACH_SUBRTX (iter, array, to, NONCONST)
> +    {
> +      const_rtx x = *iter;
> +
> +      if (REG_P (x))
> +	continue;
> +
> +      else if (GET_CODE (x) == SUBREG
> +	  && REG_P (SUBREG_REG (x))
> +	  && !paradoxical_subreg_p (x))
> +	continue;
> +
> +      else if (rtx_equal_p (x, to))
> +	continue;
> +
> +      else
> +	return false;
> +    }
> +
> +  return true;
> +}
> +
> +/* Check if the register pressure in any predecessor block of USE's block
> +   until DEF's block is equal or higher than the number of hardregs in NU's
> +   register class.  */
> +bool
> +fwprop_propagation::register_pressure_high_p (rtx nu, rtx old, rtx_insn *def,
> +					      rtx_insn *use) const
> +{
> +  enum reg_class nu_class, old_class;
> +  int nu_nregs, old_nregs;
> +  nu_class = regpressure_get_regno_pressure_class (REGNO (nu), &nu_nregs);
> +  old_class
> +    = regpressure_get_regno_pressure_class (REGNO (old), &old_nregs);
> +
> +  if (nu_class == NO_REGS && old_class == NO_REGS)
>      return true;
>  
> -  return false;
> This patch enables the forwarding of sources from more complex
> operations like unary (i.e. vec_duplicate) or binary.  As this
> potentially involves replacing a register with a register of a different
> class the ira_hoist_pressure machinery is used to calculate the change
> in register pressure.  If the propagation would increase the pressure
> beyond the number of hard regs, we don't perform it.
> +  if (nu_class == old_class)
> +    return false;
> +
> +  basic_block bbfrom = BLOCK_FOR_INSN (def);
> +  basic_block bbto = BLOCK_FOR_INSN (use);
> +
> +  basic_block bb;
> +
> +  sbitmap visited = sbitmap_alloc (last_basic_block_for_fn (cfun));
> +  bitmap_clear (visited);
> +  auto_vec<basic_block> q;
> +  q.safe_push (bbto);
> +
> +  bool high = false;
> +
> +  while (!q.is_empty ())
> +    {
> +      bb = q.pop ();
> +
> +      if (bitmap_bit_p (visited, bb->index))
> +	continue;
> +
> +      /* Nothing to do if the register to be replaced is not live
> +	 in this BB.  */
> +      if (bb != bbfrom && !regpressure_is_live_in (bb, REGNO (old)))
> +	continue;
> +
> +      /* Nothing to do if the replacement register is already live in
> +	 this BB.  */
> +      if (regpressure_is_live_in (bb, REGNO (nu)))
> +	continue;
> +
> +      bitmap_set_bit (visited, bb->index);
> +
> +      int press = regpressure_get (bb, nu_class);
> +      if (dump_file && (dump_flags & TDF_DETAILS))
> +	fprintf (dump_file, "pressure for reg %d in bb %d: %d\n",
> +		 REGNO (nu), bb->index, press);
> +
> +      if (press + nu_nregs > ira_class_hard_regs_num[nu_class])
> +	{
> +	  high = true;
> +	  break;
> +	}
> +
> +      if (bb == bbfrom || bb == ENTRY_BLOCK_PTR_FOR_FN (cfun))
> +	break;
> +
> +      edge e;
> +      edge_iterator ei;
> +
> +      FOR_EACH_EDGE (e, ei, bb->preds)
> +	{
> +	  basic_block pred = e->src;
> +	  q.safe_push (pred);
> +	}
> +    }
> +
> +  sbitmap_free (visited);
> +
> +  return high;
>  }
>  
> -/* Check that X has a single def.  */
> +/* Update the register pressure when propagating from DEF to USE
> +   replacing OLD with NU.  Return TRUE if the pressure was updated.  */
> +bool
> +fwprop_propagation::update_register_pressure (rtx nu, rtx old, rtx_insn *def,
> +					      rtx_insn *use) const
> +{
> +  basic_block bb;
> +
> +  enum reg_class nu_class, old_class;
> +  int nu_nregs, old_nregs;
> +  nu_class = regpressure_get_regno_pressure_class (REGNO (nu), &nu_nregs);
> +  old_class = regpressure_get_regno_pressure_class (REGNO (old), &old_nregs);
> +
> +  bool changed = false;
> +
> +  if (nu_class == old_class)
> +    return false;
> +
> +  basic_block bbfrom = BLOCK_FOR_INSN (def);
> +  basic_block bbto = BLOCK_FOR_INSN (use);
> +
> +  sbitmap visited = sbitmap_alloc (last_basic_block_for_fn (cfun));
> +  bitmap_clear (visited);
> +  auto_vec<basic_block> q;
> +  q.safe_push (bbto);
> +
> +  /* Precompute the users.  */
> +  auto_vec<rtx_insn *> users;
> +  df_ref duse;
> +  bool should_decrease = true;
> +  FOR_EACH_INSN_USE (duse, def)
> +    {
> +      rtx dreg = DF_REF_REAL_REG (duse);
> +      if (REGNO (dreg) != REGNO (old))
> +	continue;
> +
> +      df_ref op_ref = DF_REG_USE_CHAIN (REGNO (dreg));
> +      for (; op_ref; op_ref = DF_REF_NEXT_REG (op_ref))
> +	{
> +	  if (!DF_REF_INSN_INFO (op_ref))
> +	    continue;
> +
> +	  rtx_insn *insn = DF_REF_INSN (op_ref);
> +
> +	  /* If there are other users in BBTO, never decrease the
> +	     pressure.  */
> +	  if (BLOCK_FOR_INSN (insn) == bbto && insn != use)
> +	    {
> +	      should_decrease = false;
> +	      break;
> +	    }
> +
> +	  users.safe_push (insn);
> +	}
> +    }
> +
> +  while (!q.is_empty ())
> +    {
> +      bb = q.pop ();
> +
> +      if (bitmap_bit_p (visited, bb->index))
> +	continue;
> +
> +      /* Nothing to do if the register to be replaced is not live
> +	 in this BB.  */
> +      if (bb != bbfrom && !regpressure_is_live_in (bb, REGNO (old)))
> +	continue;
> +
> +      /* Nothing to do if the new register is already live in this BB.  */
> +      if (regpressure_is_live_in (bb, REGNO (nu)))
> +	continue;
> +
> +      bitmap_set_bit (visited, bb->index);
> +
> +      if (regpressure_is_live_in (bb, REGNO (old)))
> +	{
> +	  if (dump_file && (dump_flags & TDF_DETAILS))
> +	    fprintf (dump_file, "increasing pressure for "
> +		     "reg %d in bb %d by %d reg(s).\n",
> +		     REGNO (nu), bb->index, nu_nregs);
> +	  regpressure_set_live_in (bb, REGNO (nu));
> +	  regpressure_increase (bb, nu_class, nu_nregs);
> +	  changed = true;
> +
> +	  if (should_decrease)
> +	    {
> +	      rtx_insn *user;
> +	      int ix;
> +	      bool should_decrease_local = true;
> +
> +	      /* If this BB dominates no BB where OLD is used (except BBTO)
> +		 the register pressure is decreased.  */
> +	      FOR_EACH_VEC_ELT(users, ix, user)
> +		{
> +		  basic_block ubb = BLOCK_FOR_INSN (user);
> +		  if (ubb == bbto)
> +		    continue;
> +		  if (dominated_by_p (CDI_DOMINATORS, ubb, bb))
> +		    {
> +		      should_decrease_local = false;
> +		      break;
> +		    }
> +		}
> +
> +	      if (should_decrease_local)
> +		{
> +		  if (dump_file && (dump_flags & TDF_DETAILS))
> +		    fprintf (dump_file, "decreasing pressure for "
> +			     "reg %d in bb %d by %d reg(s).\n",
> +			     REGNO (old), bb->index, old_nregs);
> +		  regpressure_clear_live_in (bb, REGNO (old));
> +		  regpressure_decrease (bb, old_class, old_nregs);
> +		}
> +	    }
> +	}
> +
> +      if (bb == bbfrom || bb == ENTRY_BLOCK_PTR_FOR_FN (cfun))
> +	break;
> +
> +      edge e;
> +      edge_iterator ei;
> +
> +      FOR_EACH_EDGE (e, ei, bb->preds)
> +	{
> +	  basic_block pred = e->src;
> +	  q.safe_push (pred);
> +	}
> +    }
> +
> +  sbitmap_free (visited);
> +  return changed;
> +}
> +
> +/* Check that X has a single def.  In case of more complex operations
> +   check each sub-rtx.  */
>  
>  static bool
>  reg_single_def_p (rtx x)
>  {
> -  return REG_P (x) && crtl->ssa->single_dominating_def (REGNO (x));
> +  if (REG_P (x))
> +    return crtl->ssa->single_dominating_def (REGNO (x));
> +
> +  subrtx_var_iterator::array_type array;
> +  FOR_EACH_SUBRTX_VAR (iter, array, x, ALL)
> +    {
> +      rtx sub = *iter;
> +
> +      if (SUBREG_P (x) && !contains_paradoxical_subreg_p (sub))
> +	sub = SUBREG_REG (x);
> +
> +      bool ok = REG_P (sub) && crtl->ssa->single_dominating_def (REGNO (sub));
> +
> +      if (!ok)
> +	return false;
> +    }
> +
> +  return true;
>  }
>  
>  /* Try to substitute (set DEST SRC), which defines DEF, into note NOTE of
> @@ -490,6 +754,81 @@ try_fwprop_subst_pattern (obstack_watermark &attempt, insn_change &use_change,
>  	  }
>        }
>  
> +  /* Perform register-pressure checks for all sub-rtxs.
> +     If FLAG_IRA_HOIST_PRESSURE is set only allow the propagation if the
> +     register pressure is not high.
> +     Otherwise only allow the propagation if DEF has only a single use.  */
> +  if (ok)
> +    {
> +      rtx_insn *def_rtl = def_insn->rtl ();
> +
> +      subrtx_var_iterator::array_type array;
> +      FOR_EACH_SUBRTX_VAR (iter, array, src, NONCONST)
> +	{
> +	  rtx srcreg = *iter;
> +	  rtx dstreg = dest;
> +
> +	  /* For register pressure we do not care about subregs so strip
> +	     them.  */
> +	  while (SUBREG_P (dstreg))
> +	    dstreg = SUBREG_REG (dstreg);
> +
> +	  while (SUBREG_P (srcreg))
> +	    srcreg = SUBREG_REG (srcreg);
> +
> +	  if (!REG_P (dstreg) || !REG_P (srcreg))
> +	    continue;
> +
> +	  /* Nothing to do for the same class.  */
> +	  bool same_class = regpressure_same_class (dstreg, srcreg);
> +	  if (same_class)
> +	    {
> +	      if (flag_ira_hoist_pressure)
> +		continue;
> +	      else
> +		{
> +		  /* Check if this is the only use.  If not do not
> +		     continue.  */
> +		  for (use_info *use : def->all_uses ())
> +		    if (use->insn () != use_insn)
> +		      return false;
> +		}
> +	    }
> +
> +	  /* Never allow different pressure classes when not considering
> +	     register pressure.  */
> +	  if (!same_class && !flag_ira_hoist_pressure)
> +	    return false;
> +
> +	  if (!bitmap_bit_p (pressure_accounted, REGNO (srcreg)))
> +	    {
> +	      if (!prop.register_pressure_high_p (srcreg, dstreg, def_rtl,
> +						  use_rtl))
> +		{
> +		  bool updated
> +		    = prop.update_register_pressure (srcreg, dstreg, def_rtl,
> +						     use_rtl);
> +		  if (updated)
> +		    bitmap_set_bit (pressure_accounted, REGNO (srcreg));
> +		}
> +	      else
> +		{
> +		  if (dump_file && (dump_flags & TDF_DETAILS))
> +		    fprintf (dump_file, "cannot propagate from insn %d into"
> +			     " insn %d: %s\n", def_insn->uid (),
> +			     use_insn->uid (),
> +			     "would increase register pressure beyond number"
> +			     " of hard regs");
> +		  ok = false;
> +		}
> +	    }
> +	  else
> +	    if (dump_file && (dump_flags & TDF_DETAILS))
> +	      fprintf (dump_file, "pressure for reg %d has already"
> +		       " been accounted\n", REGNO (srcreg));
> +	}
> +    }
> +
>    if (!ok)
>      {
>        /* The pattern didn't match, but if all uses of SRC folded to
> @@ -538,6 +877,7 @@ try_fwprop_subst_pattern (obstack_watermark &attempt, insn_change &use_change,
>    confirm_change_group ();
>    crtl->ssa->change_insn (use_change);
>    num_changes++;
> +
>    return true;
>  }
>  
> @@ -890,6 +1230,10 @@ fwprop_init (void)
>  
>    df_analyze ();
>    crtl->ssa = new rtl_ssa::function_info (cfun);
> +
> +  regpressure_init ();
> +  pressure_accounted = sbitmap_alloc (DF_REG_SIZE (df));
> +  bitmap_clear (pressure_accounted);
>  }
>  
>  static void
> @@ -910,6 +1254,9 @@ fwprop_done (void)
>      fprintf (dump_file,
>  	     "\nNumber of successful forward propagations: %d\n\n",
>  	     num_changes);
> +
> +  regpressure_cleanup ();
> +  sbitmap_free (pressure_accounted);
>  }
>  
>  /* Try to optimize INSN, returning true if something changes.
> @@ -919,6 +1266,10 @@ fwprop_done (void)
>  static bool
>  fwprop_insn (insn_info *insn, bool fwprop_addr_p)
>  {
> +  /* TODO: when rejecting propagations due to register pressure
> +     we would actually like to try the propagation with most
> +     potential = uses first instead of going through them sequentially.  */
> +
>    for (use_info *use : insn->uses ())
>      {
>        if (use->is_mem ())
> diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/vadd-vx-fwprop.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/vadd-vx-fwprop.c
> new file mode 100644
> index 00000000000..0ed1bb80f73
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/binop/vadd-vx-fwprop.c
> @@ -0,0 +1,64 @@
> +/* { dg-do compile } */
> +/* { dg-additional-options "-O2 -march=rv64gcv -mabi=lp64d --param=riscv-autovec-preference=fixed-vlmax -fdump-rtl-fwprop1-details" } */
> +
> +/* Check that we propagate from vec_duplicates into vadd.vv, turning it into
> +   vadd.vx.  */
> +
> +#include <stdint-gcc.h>
> +
> +/* Define 26 variables so that we can perform two forward propagations until we
> +   reach the hard reg limit of 28.  */
> +uint64_t *restrict dst1;
> +uint64_t *restrict dst2;
> +uint64_t *restrict dst3;
> +uint64_t *restrict dst4;
> +uint64_t *restrict dst5;
> +uint64_t *restrict dst6;
> +uint64_t *restrict dst7;
> +uint64_t *restrict dst8;
> +uint64_t *restrict dst9;
> +uint64_t *restrict dst10;
> +uint64_t *restrict dst11;
> +uint64_t *restrict a1;
> +uint64_t *restrict a2;
> +uint64_t *restrict a3;
> +uint64_t *restrict a4;
> +uint64_t *restrict a5;
> +uint64_t *restrict a6;
> +uint64_t *restrict a7;
> +uint64_t *restrict a8;
> +uint64_t *restrict a9;
> +uint64_t *restrict a10;
> +uint64_t *restrict a11;
> +uint64_t b1;
> +uint64_t b2;
> +uint64_t b3;
> +uint64_t b4;
> +
> +void foo (int n)
> +{
> +  __builtin_expect (n % 128, 0);
> +  for (int i = 0; i < n; i++)
> +    {
> +      /* We expect b3 to be propagated into vadd here as well as in the other
> +         7 occurences (without increasing the pressure every time).
> +         Then b2 is still supposed to be propagated but b1 is not even though
> +         it would be better to do so.  */
> +      dst1[i] = a1[i] + b3;
> +      dst2[i] = a2[i] + b3;
> +      dst3[i] = a3[i] + b2;
> +      dst4[i] = a4[i] + b1;
> +      dst5[i] = a5[i] + b1;
> +      dst6[i] = a6[i] + b3;
> +      dst7[i] = a7[i] + b3;
> +      dst8[i] = a8[i] + b3;
> +      dst9[i] = a9[i] + b3;
> +      dst10[i] = a10[i] + b3;
> +      dst11[i] = a11[i] + b3;
> +    }
> +}
> +
> +/* { dg-final { scan-rtl-dump-times "increasing pressure for reg" 2 "fwprop1" } } */
> +/* { dg-final { scan-rtl-dump-times "pressure for reg \[0-9\]+ has already been accounted" 7 "fwprop1" } } */
> +/* { dg-final { scan-rtl-dump-times "would increase register pressure beyond number of hard regs" 2 "fwprop1" } } */
> +/* { dg-final { scan-assembler-times "\tvadd\.vx" 9 } } */
Robin Dapp Sept. 7, 2023, 2:25 p.m. UTC | #11
Thanks for looking at it in detail.

> Yeah, I think this is potentially a blocker for propagating A into B
> when A is used elsewhere.  Combine is able to combine A and B while
> keeping A in parallel with the result.  I think either fwprop would
> need to try that too, or it would need to be restricted to cases where A
> is only used in B.

That seems a rather severe limitation and my original use case would
not get optimized considerably anymore.  The intention was to replace
all uses (if register pressure allows).  Of course the example is simple
enough that a propagation is always useful if the costs allow it, so
it might not be representative.

I'm wondering if we could (my original misunderstanding) tentatively
try to propagate into all uses of a definition and, when reaching
a certain ratio, decide that it might be worth it, otherwise revert.
Would be very crude though, and not driven by the actual problem we're
trying to avoid. 

> I think the summary is:
> 
> IMO, we have to be mindful that combine is still to run.  We need to
> avoid making equal-cost changes if the new form is more complex, or
> otherwise likely to interfere with combine.

I guess we don't have a good measure for complexity or "combinability"
and even lower-cost changes could result in worse options later.
Would it make sense to have a strict less-than cost policy for those
more complex propagations?  Or do you consider the approach in its
current shape "hopeless", given the complications we discussed?

> Alternatively, we could delay the optimisation until after combine
> and have freer rein, since we're then just mopping up opportunities
> that other passes left behind.
> 
> A while back I was experimenting with a second combine pass.  That was
> the original motiviation for rtl-ssa.  I never got chance to finish it
> off though.

This doesn't sound like something that would still materialize before
the end of stage 1 :)
Do you see any way of restricting the current approach to make it less
intrusive and still worthwhile?  Limiting to vec_duplicate might be
much too arbitrary but would still help for my original example.

Regards
 Robin
Richard Sandiford Sept. 26, 2023, 4:24 p.m. UTC | #12
Robin Dapp via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> Thanks for looking at it in detail.
>
>> Yeah, I think this is potentially a blocker for propagating A into B
>> when A is used elsewhere.  Combine is able to combine A and B while
>> keeping A in parallel with the result.  I think either fwprop would
>> need to try that too, or it would need to be restricted to cases where A
>> is only used in B.
>
> That seems a rather severe limitation and my original use case would
> not get optimized considerably anymore.  The intention was to replace
> all uses (if register pressure allows).  Of course the example is simple
> enough that a propagation is always useful if the costs allow it, so
> it might not be representative.
>
> I'm wondering if we could (my original misunderstanding) tentatively
> try to propagate into all uses of a definition and, when reaching
> a certain ratio, decide that it might be worth it, otherwise revert.
> Would be very crude though, and not driven by the actual problem we're
> trying to avoid. 
>
>> I think the summary is:
>> 
>> IMO, we have to be mindful that combine is still to run.  We need to
>> avoid making equal-cost changes if the new form is more complex, or
>> otherwise likely to interfere with combine.
>
> I guess we don't have a good measure for complexity or "combinability"
> and even lower-cost changes could result in worse options later.
> Would it make sense to have a strict less-than cost policy for those
> more complex propagations?  Or do you consider the approach in its
> current shape "hopeless", given the complications we discussed?
>
>> Alternatively, we could delay the optimisation until after combine
>> and have freer rein, since we're then just mopping up opportunities
>> that other passes left behind.
>> 
>> A while back I was experimenting with a second combine pass.  That was
>> the original motiviation for rtl-ssa.  I never got chance to finish it
>> off though.
>
> This doesn't sound like something that would still materialize before
> the end of stage 1 :)
> Do you see any way of restricting the current approach to make it less
> intrusive and still worthwhile?  Limiting to vec_duplicate might be
> much too arbitrary but would still help for my original example.

FWIW, I sent an RFC for a late-combine pass that might help:

  https://gcc.gnu.org/pipermail/gcc-patches/2023-September/631406.html

I think it'll need some tweaking for your use case, but hopefully
it's "just" a case of expanding the register pressure tests.

Thanks,
Richard
diff mbox series

Patch

diff --git a/gcc/fwprop.cc b/gcc/fwprop.cc
index 0707a234726..413fe4e7335 100644
--- a/gcc/fwprop.cc
+++ b/gcc/fwprop.cc
@@ -36,6 +36,10 @@  along with GCC; see the file COPYING3.  If not see
 #include "tree-pass.h"
 #include "rtl-iter.h"
 #include "target.h"
+#include "dominance.h"
+
+#include "ira.h"
+#include "regpressure.h"
 
 /* This pass does simple forward propagation and simplification when an
    operand of an insn can only come from a single def.  This pass uses
@@ -103,6 +107,10 @@  using namespace rtl_ssa;
 
 static int num_changes;
 
+/* Keep track of which registers already increased the pressure to avoid double
+   booking.  */
+sbitmap pressure_accounted;
+
 /* Do not try to replace constant addresses or addresses of local and
    argument slots.  These MEM expressions are made only once and inserted
    in many instructions, as well as being used to control symbol table
@@ -181,6 +189,8 @@  namespace
     bool changed_mem_p () const { return result_flags & CHANGED_MEM; }
     bool folded_to_constants_p () const;
     bool profitable_p () const;
+    bool register_pressure_high_p (rtx, rtx, rtx_insn *, rtx_insn *) const;
+    void update_register_pressure (rtx, rtx, rtx_insn *, rtx_insn *) const;
 
     bool check_mem (int, rtx) final override;
     void note_simplification (int, uint16_t, rtx, rtx) final override;
@@ -332,25 +342,243 @@  fwprop_propagation::profitable_p () const
       && (result_flags & PROFITABLE))
     return true;
 
-  if (REG_P (to))
+  /* Only continue with an unary operation if we consider register
+     pressure.  */
+  rtx what = copy_rtx (to);
+  if (UNARY_P (what) && flag_ira_hoist_pressure)
+    what = XEXP (what, 0);
+
+  if (REG_P (what))
     return true;
 
-  if (GET_CODE (to) == SUBREG
-      && REG_P (SUBREG_REG (to))
-      && !paradoxical_subreg_p (to))
+  if (GET_CODE (what) == SUBREG
+      && REG_P (SUBREG_REG (what))
+      && !paradoxical_subreg_p (what))
     return true;
 
-  if (CONSTANT_P (to))
+  if (CONSTANT_P (what))
     return true;
 
   return false;
 }
 
-/* Check that X has a single def.  */
+/* Check if the register pressure in any predecessor block of USE's block
+   until DEF's block is equal or higher to the number of hardregs in NU's
+   register class.  */
+bool
+fwprop_propagation::register_pressure_high_p (rtx nu, rtx old, rtx_insn *def,
+					      rtx_insn *use) const
+{
+  enum reg_class nu_class, old_class;
+  int nu_nregs, old_nregs;
+  nu_class = regpressure_get_regno_pressure_class (REGNO (nu), &nu_nregs);
+  old_class
+    = regpressure_get_regno_pressure_class (REGNO (old), &old_nregs);
+
+  if (nu_class == NO_REGS && old_class == NO_REGS)
+    return true;
+
+  if (nu_class == old_class)
+    return false;
+
+  basic_block bbfrom = BLOCK_FOR_INSN (def);
+  basic_block bbto = BLOCK_FOR_INSN (use);
+
+  basic_block bb;
+
+  sbitmap visited = sbitmap_alloc (last_basic_block_for_fn (cfun));
+  bitmap_clear (visited);
+  auto_vec<basic_block> q;
+  q.safe_push (bbto);
+
+  while (!q.is_empty ())
+    {
+      bb = q.pop ();
+
+      if (bitmap_bit_p (visited, bb->index))
+	continue;
+
+      /* Nothing to do if the register to be replaced is not live
+	 in this BB.  */
+      if (bb != bbfrom && !regpressure_is_live_in (bb, REGNO (old)))
+	continue;
+
+      /* Nothing to do if the replacement register is already live in
+	 this BB.  */
+      if (regpressure_is_live_in (bb, REGNO (nu)))
+	continue;
+
+      bitmap_set_bit (visited, bb->index);
+
+      int press = regpressure_get (bb, nu_class);
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "pressure for reg %d in bb %d: %d\n",
+		 REGNO (nu), bb->index, press);
+
+      if (press + nu_nregs > ira_class_hard_regs_num[nu_class])
+	return true;
+
+      if (bb == bbfrom || bb == ENTRY_BLOCK_PTR_FOR_FN (cfun))
+	break;
+
+      edge e;
+      edge_iterator ei;
+
+      FOR_EACH_EDGE (e, ei, bb->preds)
+	{
+	  basic_block pred = e->src;
+	  q.safe_push (pred);
+	}
+    }
+
+  sbitmap_free (visited);
+
+  return false;
+}
+
+/* Update the register pressure when propagating from DEF to USE
+   replacing OLD with NU.  */
+void
+fwprop_propagation::update_register_pressure (rtx nu, rtx old, rtx_insn *def,
+					      rtx_insn *use) const
+{
+  basic_block bb;
+
+  enum reg_class nu_class, old_class;
+  int nu_nregs, old_nregs;
+  nu_class = regpressure_get_regno_pressure_class (REGNO (nu), &nu_nregs);
+  old_class = regpressure_get_regno_pressure_class (REGNO (old), &old_nregs);
+
+  if (nu_class == old_class)
+    return;
+
+  basic_block bbfrom = BLOCK_FOR_INSN (def);
+  basic_block bbto = BLOCK_FOR_INSN (use);
+
+  sbitmap visited = sbitmap_alloc (last_basic_block_for_fn (cfun));
+  bitmap_clear (visited);
+  auto_vec<basic_block> q;
+  q.safe_push (bbto);
+
+  /* Precompute the users.  */
+  auto_vec<rtx_insn *> users;
+  df_ref duse;
+  bool should_decrease = true;
+  FOR_EACH_INSN_USE (duse, def)
+    {
+      rtx dreg = DF_REF_REAL_REG (duse);
+      if (REGNO (dreg) != REGNO (old))
+	continue;
+
+      df_ref op_ref = DF_REG_USE_CHAIN (REGNO (dreg));
+      for (; op_ref; op_ref = DF_REF_NEXT_REG (op_ref))
+	{
+	  if (!DF_REF_INSN_INFO (op_ref))
+	    continue;
+
+	  rtx_insn *insn = DF_REF_INSN (op_ref);
+
+	  /* If there are other users in BBTO, never decrease the
+	     pressure.  */
+	  if (BLOCK_FOR_INSN (insn) == bbto && insn != use)
+	    {
+	      should_decrease = false;
+	      break;
+	    }
+
+	  users.safe_push (insn);
+	}
+    }
+
+  while (!q.is_empty ())
+    {
+      bb = q.pop ();
+
+      if (bitmap_bit_p (visited, bb->index))
+	continue;
+
+      /* Nothing to do if the register to be replaced is not live
+	 in this BB.  */
+      if (bb != bbfrom && !regpressure_is_live_in (bb, REGNO (old)))
+	continue;
+
+      /* Nothing to do if the new register is already live in this BB.  */
+      if (regpressure_is_live_in (bb, REGNO (nu)))
+	continue;
+
+      bitmap_set_bit (visited, bb->index);
+
+      if (regpressure_is_live_in (bb, REGNO (old)))
+	{
+	  if (dump_file && (dump_flags & TDF_DETAILS))
+	    fprintf (dump_file, "increasing pressure for "
+		     "reg %d in bb %d by %d regs.\n",
+		     REGNO (nu), bb->index, nu_nregs);
+	  regpressure_set_live_in (bb, REGNO (nu));
+	  regpressure_increase (bb, nu_class, nu_nregs);
+
+	  if (should_decrease)
+	    {
+	      rtx_insn *user;
+	      int ix;
+	      bool should_decrease_local = true;
+
+	      /* If this BB dominates no BB where OLD is used (except BBTO)
+		 the register pressure is decreased.  */
+	      FOR_EACH_VEC_ELT(users, ix, user)
+		{
+		  basic_block ubb = BLOCK_FOR_INSN (user);
+		  if (ubb == bbto)
+		    continue;
+		  if (dominated_by_p (CDI_DOMINATORS, ubb, bb))
+		    {
+		      should_decrease_local = false;
+		      break;
+		    }
+		}
+
+	      if (should_decrease_local)
+		{
+		  if (dump_file && (dump_flags & TDF_DETAILS))
+		    fprintf (dump_file, "decreasing pressure for "
+			     "reg %d in bb %d by %d regs.\n",
+			     REGNO (nu), bb->index, old_nregs);
+		  regpressure_clear_live_in (bb, REGNO (old));
+		  regpressure_decrease (bb, old_class, old_nregs);
+		}
+	    }
+	}
+
+      if (bb == bbfrom || bb == ENTRY_BLOCK_PTR_FOR_FN (cfun))
+	break;
+
+      edge e;
+      edge_iterator ei;
+
+      FOR_EACH_EDGE (e, ei, bb->preds)
+	{
+	  basic_block pred = e->src;
+	  q.safe_push (pred);
+	}
+    }
+
+  sbitmap_free (visited);
+}
+
+/* Check that X has a single def.  Also look through unary operations if
+   we take register pressure into consideration.  */
 
 static bool
 reg_single_def_p (rtx x)
 {
+  if (UNARY_P (x) && flag_ira_hoist_pressure)
+    {
+      x = XEXP (x, 0);
+
+      if (SUBREG_P (x) && !contains_paradoxical_subreg_p (x))
+	x = SUBREG_REG (x);
+    }
+
   return REG_P (x) && crtl->ssa->single_dominating_def (REGNO (x));
 }
 
@@ -490,6 +718,53 @@  try_fwprop_subst_pattern (obstack_watermark &attempt, insn_change &use_change,
 	  }
       }
 
+  if (ok && UNARY_P (src))
+    {
+      /* Currently the register classes can only be different with a
+	 unary operation like vec_duplicate.
+	 In that case the propagation can increase the register pressure
+	 on the way to the target BB.
+	 Therefore, we first check if the pressure of the respective register
+	 is already high, i.e. equals or exceeds the number of hardregs for
+	 that class.  If not, we allow the propagation and update the
+	 pressure.  */
+
+      rtx src1 = src;
+
+      /* Strip unary operation and possible subreg.  */
+      src1 = XEXP (src, 0);
+      if (GET_CODE (src1) == SUBREG)
+	src1 = SUBREG_REG (src1);
+
+      rtx_insn *def_rtl = def_insn->rtl ();
+
+      if (REG_P (dest) && REG_P (src1))
+	{
+	  if (!bitmap_bit_p (pressure_accounted, REGNO (src1)))
+	    {
+	      if (!prop.register_pressure_high_p (src1, dest, def_rtl, use_rtl))
+		{
+		  prop.update_register_pressure (src1, dest, def_rtl, use_rtl);
+		  bitmap_set_bit (pressure_accounted, REGNO (src1));
+		}
+	      else
+		{
+		  if (dump_file && (dump_flags & TDF_DETAILS))
+		    fprintf (dump_file, "cannot propagate from insn %d into"
+			     " insn %d: %s\n", def_insn->uid (),
+			     use_insn->uid (),
+			     "would increase register pressure beyond number"
+			     " of hard regs");
+		  ok = false;
+		}
+	    }
+	  else
+	    if (dump_file && (dump_flags & TDF_DETAILS))
+	      fprintf (dump_file, "pressure for reg %d has already"
+		       " been accounted\n", REGNO (src1));
+	}
+    }
+
   if (!ok)
     {
       /* The pattern didn't match, but if all uses of SRC folded to
@@ -859,7 +1134,7 @@  forward_propagate_into (use_info *use, bool reg_prop_only = false)
   if ((reg_prop_only
        || (def_loop != use_loop
 	   && !flow_loop_nested_p (use_loop, def_loop)))
-      && (!reg_single_def_p (dest) || !reg_single_def_p (src)))
+      && !reg_single_def_p (dest))
     return false;
 
   /* Don't substitute into a non-local goto, this confuses CFG.  */
@@ -890,6 +1165,14 @@  fwprop_init (void)
 
   df_analyze ();
   crtl->ssa = new rtl_ssa::function_info (cfun);
+
+  /* Calculate register pressure for each basic block.  */
+  if (flag_ira_hoist_pressure)
+    {
+      regpressure_init ();
+
+      pressure_accounted = sbitmap_alloc (DF_REG_SIZE (df));
+    }
 }
 
 static void
@@ -910,6 +1193,12 @@  fwprop_done (void)
     fprintf (dump_file,
 	     "\nNumber of successful forward propagations: %d\n\n",
 	     num_changes);
+
+  if (flag_ira_hoist_pressure)
+    {
+      regpressure_cleanup ();
+      sbitmap_free (pressure_accounted);
+    }
 }
 
 /* Try to optimize INSN, returning true if something changes.
@@ -919,6 +1208,10 @@  fwprop_done (void)
 static bool
 fwprop_insn (insn_info *insn, bool fwprop_addr_p)
 {
+  /* TODO: when rejecting propagations due to register pressure
+     we would actually like to try the propagation with most
+     potential = uses first instead of going through them
+     sequentially.  */
   for (use_info *use : insn->uses ())
     {
       if (use->is_mem ())