diff mbox series

[committed,PR,rtl-optimization/87761] Limited iteration in regcprop to pick up secondary opportunities

Message ID 0e195e6c-5e41-34c8-dd6c-b6722845a3a3@redhat.com
State New
Headers show
Series [committed,PR,rtl-optimization/87761] Limited iteration in regcprop to pick up secondary opportunities | expand

Commit Message

Jeff Law March 24, 2019, 3:20 p.m. UTC
This is final bits to address 87761.

Originally this patch tried to detect when it could delete a feeding
insn when hard register copy propagation was performed and immediately
delete the (dead) feeding insn.

For that to work we had to have accurate REG_DEAD notes in the stream.
So I twiddled things to make sure those notes were accurate.

But in doing so, the need to actually detect and internally delete a
feeding insn when copy propagation was performed isn't necessary.  When
we reprocess a block, if a particular reg->reg copy becomes dead it will
have a REG_UNUSED note attached by the DF framework.

Thus, the prior change in this space to remove simple const/copy
initializations where the destination has a REG_UNUSED note "just works"
and is sufficient to address the fix-r4000-10 regression (as well as
fix-r4000-5 which isn't mentioned in the BZ).

So the only real issue left was the iteration step.  I pondered leaving
in the code which tracked when we would likely be able to delete a copy
and use that to trigger re-processing a block.  That works, but I think
it's more complexity than we really want/need.

Other prototypes iterated on a block until nothing changed.  I didn't
really like that either.  I strongly suspected that iteration was rare
and iterating more than once was really rare, there's just not a lot of
benefit here to justify an iterate until nothing changes algorithm --
particularly since the we've got to rescan the whole block each
iteration and possibly propagate DF info as well.  It is clearly safe to
stop iterating, it just potentially leaves the code less optimized than
it could have been.

So it's of course time to do some instrumentation.  I instrumented an
x86_64 compiler to print iteration counts for every block we processed
then did a bootstrap.

Before giving you the raw data, a few notes.  An iteration count of zero
means regcprop processed the block, but found no optimization
opportunities at all.  That case is by far the most common.

An iteration count of one indicates regcprop  was able to do useful
propagations on the first pass over the block, but that no secondary
opportunities were identified.

An iteration count of two indicates the first pass propagations resulted
in insn deletions which in turn exposed additional propagations.  This
is the case we're looking for.

Counts higher than indicate that we found even more opportunities on
subsequent iterations.  These are exceedingly rare.


Iterations    Count
    0         3584456    No opts found
    1         183165     Single level propagation
    2         1364       Two level propagation
    3         43
    4         5
    5+        none

The case we're chasing is the two level propagation.  ie, the first pass
does some propagation which in turn makes some insn(s) dead and removal
of those dead insn(s) exposes new propagation opportunities.

The data make it pretty clear we want to avoid as much overhead as
possible in the common case of nothing changes.  That's over 95% of the
time.

Additionally, there's just so so little to gain with each iteration that
an "iterate until nothing changes" approach just seems like a waste.

So we cater to those components in the distribution.  ie, avoid as much
overhead as possible in the most common case, and don't bother with a
"repeat until nothing changes" kind of algorithm.  We allow a single
reprocessing of the block if something changed.

I seriously considered bubbling up a "hint" that we'd likely be able to
remove an insn and using that to help decide when to allow that single
reprocessing of the current block.  The hint would have to be a bit of
an optimistic test because of the inaccuracies in the notes.  Ultimately
I decided it was unlikely that the hint would really filter out that
much work.

This has been bootstrapped and regression tested on x86_64. aarch64,
ppc64le, ppc64 & sparc64.  It's also bootstrapped on various platforms
vi qemu.  It's also gone through a build/test cycle on mips64-linux-gnu
and mips64el-linux-gnu where it fixed fix-r4000-10.c and fix-4000-5.c.
Finally it's gone through a variety of *-elf builds, mostly to make sure
we didn't trip over anything nasty in the target specific tests like we
saw for the fpr-moves-5 fixes.

The patch looks bigger than it really is due to an indention change.
I'll be installing into on the trunk momentarily.

Phew.

On a non-technical note, I'm happy I dug into this, mostly because the
regcprop fix can help other targets and there's some easy stage1 things
which likely help in other cases that I haven't discussed much here.

However, I'm increasingly of the opinion that MIPS targets need to drop
off the priority platform list.  Given the trajectory I see for MIPS
based processors in industry, it's really hard to justify spending this
much time on them, particularly for low priority code quality issues.

Comments

Segher Boessenkool March 24, 2019, 5:11 p.m. UTC | #1
Hi Jeff,

On Sun, Mar 24, 2019 at 09:20:07AM -0600, Jeff Law wrote:
> +	PR rtl-optimization/87761
> +	* regcprop.c (copyprop_hardreg_forward_1): Check may_trap_p on SET,
> +	not INSN.  Also check RTX_FRAME_RELATED_P.  Queue insns for DF rescan
> +	as needed.

Why the RTX_FRAME_RELATED_P addition?  You didn't explain it I think.  Is it
just a bugfix, or something this patch exposed, something that couldn't
happen before (or was harmless) and now isn't anymore?

Or should this part be backported :-)


Segher
Jeff Law March 24, 2019, 5:20 p.m. UTC | #2
On 3/24/19 11:11 AM, Segher Boessenkool wrote:
> Hi Jeff,
> 
> On Sun, Mar 24, 2019 at 09:20:07AM -0600, Jeff Law wrote:
>> +	PR rtl-optimization/87761
>> +	* regcprop.c (copyprop_hardreg_forward_1): Check may_trap_p on SET,
>> +	not INSN.  Also check RTX_FRAME_RELATED_P.  Queue insns for DF rescan
>> +	as needed.
> 
> Why the RTX_FRAME_RELATED_P addition?  You didn't explain it I think.  Is it
> just a bugfix, or something this patch exposed, something that couldn't
> happen before (or was harmless) and now isn't anymore?
It's just a bugfix -- one of the embedded targets needed it, I can't
offhand remember which -- we had a frame related insn with a REG_UNUSED
note which the code deleted causing a fault in the dwarf2cfi bits.

Thank goodness for Jenkins  scripts which will build and test all those
silly *-elf targets :-)


> 
> Or should this part be backported :-)
Shouldn't be needed as the code to remove insns with REG_UNUSED notes is
new for gcc-9.

jeff
Jakub Jelinek March 27, 2019, 2:36 p.m. UTC | #3
On Sun, Mar 24, 2019 at 09:20:07AM -0600, Jeff Law wrote:
> However, I'm increasingly of the opinion that MIPS targets need to drop
> off the priority platform list.  Given the trajectory I see for MIPS
> based processors in industry, it's really hard to justify spending this
> much time on them, particularly for low priority code quality issues.

Besides what has been discussed on IRC for the PR89826 fix, that we really
need a df_analyze before processing the first block, because otherwise we
can't rely on the REG_UNUSED notes in the IL, I see some other issues, but I
admit I don't know much about df nor regcprop.

1) the df_analyze () after every (successful) processing of a basic block
is IMHO way too expensive, I would be very surprised if df_analyze () isn't
quadratic in number of basic blocks and so one could construct testcases
with millions of basic blocks and at least one regcprop change in each bb
and get at cubic complexity (correct me if I'm wrong, and I'm aware of the
95% bbs you said won't have any changes at all)

2) another thing I've noticed today, we queue up the debug insn changes,
but if we iterate the second time for any bb, we overwrite and thus lose any
of the debug insn queued changes from the first iteration, so those changes
aren't applied (wrong-debug?)

3) as mentioned on IRC, the order of tests in copyprop_hardreg_forward_1
conditional doesn't start from the cheapest to most expensive one

So, do we want something like the following untested patch (only tested
that the PR89826 testcase is fixed on ARM, but df_analyze right after
df_set_flags is all that is needed to cure that.  I've outlined
two hunks of code into separate functions, so that I can call them from two
different spots.  And, I don't see why in the debug processing loop we
should test the visited bitmap, isn't it guaranteed that it is set on all
the bbs after going through the first FOR_EACH_BB_FN loop?
On the other side, I thought it might be beneficial not to do the debug
FOR_EACH_BB_FN loop(s) if there aren't any debug insn changes queued (not
sure how often would that happen, if not often enough, that part can be
dropped).

--- gcc/regcprop.c.jj	2019-03-25 11:02:55.826998948 +0100
+++ gcc/regcprop.c	2019-03-27 15:15:23.505651565 +0100
@@ -800,9 +800,9 @@ copyprop_hardreg_forward_1 (basic_block
 
       /* Detect obviously dead sets (via REG_UNUSED notes) and remove them.  */
       if (set
-	  && !may_trap_p (set)
-	  && !RTX_FRAME_RELATED_P (insn)
 	  && INSN_P (insn)
+	  && !RTX_FRAME_RELATED_P (insn)
+	  && !may_trap_p (set)
 	  && find_reg_note (insn, REG_UNUSED, SET_DEST (set))
 	  && !side_effects_p (SET_SRC (set))
 	  && !side_effects_p (SET_DEST (set)))
@@ -1282,6 +1282,66 @@ public:
 
 }; // class pass_cprop_hardreg
 
+static bool
+cprop_hardreg_bb (basic_block bb, struct value_data *all_vd, sbitmap visited)
+{
+  bitmap_set_bit (visited, bb->index);
+
+  /* If a block has a single predecessor, that we've already
+     processed, begin with the value data that was live at
+     the end of the predecessor block.  */
+  /* ??? Ought to use more intelligent queuing of blocks.  */
+  if (single_pred_p (bb)
+      && bitmap_bit_p (visited, single_pred (bb)->index)
+      && ! (single_pred_edge (bb)->flags & (EDGE_ABNORMAL_CALL | EDGE_EH)))
+    {
+      all_vd[bb->index] = all_vd[single_pred (bb)->index];
+      if (all_vd[bb->index].n_debug_insn_changes)
+	{
+	  unsigned int regno;
+
+	  for (regno = 0; regno < FIRST_PSEUDO_REGISTER; regno++)
+	    {
+	      if (all_vd[bb->index].e[regno].debug_insn_changes)
+		{
+		  all_vd[bb->index].e[regno].debug_insn_changes = NULL;
+		  if (--all_vd[bb->index].n_debug_insn_changes == 0)
+		    break;
+		}
+	    }
+	}
+    }
+  else
+    init_value_data (all_vd + bb->index);
+
+  return copyprop_hardreg_forward_1 (bb, all_vd + bb->index);
+}
+
+static void
+cprop_hardreg_debug (function *fun, struct value_data *all_vd)
+{
+  basic_block bb;
+
+  FOR_EACH_BB_FN (bb, fun)
+    if (all_vd[bb->index].n_debug_insn_changes)
+      {
+	unsigned int regno;
+	bitmap live;
+
+	live = df_get_live_out (bb);
+	for (regno = 0; regno < FIRST_PSEUDO_REGISTER; regno++)
+	  if (all_vd[bb->index].e[regno].debug_insn_changes)
+	    {
+	      if (REGNO_REG_SET_P (live, regno))
+		apply_debug_insn_changes (all_vd + bb->index, regno);
+	      if (all_vd[bb->index].n_debug_insn_changes == 0)
+		break;
+	    }
+      }
+
+  queued_debug_insn_change_pool.release ();
+}
+
 unsigned int
 pass_cprop_hardreg::execute (function *fun)
 {
@@ -1293,6 +1353,9 @@ pass_cprop_hardreg::execute (function *f
   auto_sbitmap visited (last_basic_block_for_fn (fun));
   bitmap_clear (visited);
 
+  auto_vec<int> redo;
+  bool any_debug_changes = false;
+
   df_note_add_problem ();
 
   /* It is tempting to set DF_LR_RUN_DCE, but DCE may choose to delete
@@ -1305,71 +1368,42 @@ pass_cprop_hardreg::execute (function *f
      data structures appropriately.  */
   df_set_flags (DF_DEFER_INSN_RESCAN);
 
+  df_analyze ();
+
   FOR_EACH_BB_FN (bb, fun)
     {
-      bitmap_set_bit (visited, bb->index);
-
-      for (int pass = 0; pass < 2; pass++)
-	{
-	  /* If a block has a single predecessor, that we've already
-	     processed, begin with the value data that was live at
-	    the end of the predecessor block.  */
-	  /* ??? Ought to use more intelligent queuing of blocks.  */
-	  if (single_pred_p (bb)
-	      && bitmap_bit_p (visited, single_pred (bb)->index)
-	      && ! (single_pred_edge (bb)->flags & (EDGE_ABNORMAL_CALL | EDGE_EH)))
-	    {
-	      all_vd[bb->index] = all_vd[single_pred (bb)->index];
-	      if (all_vd[bb->index].n_debug_insn_changes)
-		{
-		  unsigned int regno;
-
-		  for (regno = 0; regno < FIRST_PSEUDO_REGISTER; regno++)
-		    {
-		      if (all_vd[bb->index].e[regno].debug_insn_changes)
-			{
-			  all_vd[bb->index].e[regno].debug_insn_changes = NULL;
-			  if (--all_vd[bb->index].n_debug_insn_changes == 0)
-			    break;
-			}
-		    }
-		}
-	    }
-	  else
-	    init_value_data (all_vd + bb->index);
-
-	  /* If we were unable to propagate, then break the loop.  */
-	  if (!copyprop_hardreg_forward_1 (bb, all_vd + bb->index))
-	    break;
-	  df_analyze ();
-	}
+      if (cprop_hardreg_bb (bb, all_vd, visited))
+	redo.safe_push (bb->index);
+      if (all_vd[bb->index].n_debug_insn_changes)
+	any_debug_changes = true;
     }
 
   /* We must call df_analyze here unconditionally to ensure that the
      REG_UNUSED and REG_DEAD notes are consistent with and without -g.  */
   df_analyze ();
 
-  if (MAY_HAVE_DEBUG_BIND_INSNS)
+  if (MAY_HAVE_DEBUG_BIND_INSNS && any_debug_changes)
+    cprop_hardreg_debug (fun, all_vd);
+
+  /* Second pass if we've changed anything, only for the bbs where we have
+     changed anything though.  */
+  if (!redo.is_empty ())
     {
-      FOR_EACH_BB_FN (bb, fun)
-	if (bitmap_bit_p (visited, bb->index)
-	    && all_vd[bb->index].n_debug_insn_changes)
-	  {
-	    unsigned int regno;
-	    bitmap live;
+      unsigned int i;
+      int index;
 
-	    live = df_get_live_out (bb);
-	    for (regno = 0; regno < FIRST_PSEUDO_REGISTER; regno++)
-	      if (all_vd[bb->index].e[regno].debug_insn_changes)
-		{
-		  if (REGNO_REG_SET_P (live, regno))
-		    apply_debug_insn_changes (all_vd + bb->index, regno);
-		  if (all_vd[bb->index].n_debug_insn_changes == 0)
-		    break;
-		}
-	  }
+      any_debug_changes = false;
+      FOR_EACH_VEC_ELT (redo, i, index)
+	{
+	  bb = BASIC_BLOCK_FOR_FN (fun, index);
+	  cprop_hardreg_bb (bb, all_vd, visited);
+	  if (all_vd[bb->index].n_debug_insn_changes)
+	    any_debug_changes = true;
+	}
 
-      queued_debug_insn_change_pool.release ();
+      df_analyze ();
+      if (MAY_HAVE_DEBUG_BIND_INSNS && any_debug_changes)
+	cprop_hardreg_debug (fun, all_vd);
     }
 
   free (all_vd);


	Jakub
Jeff Law March 27, 2019, 3:24 p.m. UTC | #4
On 3/27/19 8:36 AM, Jakub Jelinek wrote:
> On Sun, Mar 24, 2019 at 09:20:07AM -0600, Jeff Law wrote:
>> However, I'm increasingly of the opinion that MIPS targets need to drop
>> off the priority platform list.  Given the trajectory I see for MIPS
>> based processors in industry, it's really hard to justify spending this
>> much time on them, particularly for low priority code quality issues.
> 
> Besides what has been discussed on IRC for the PR89826 fix, that we really
> need a df_analyze before processing the first block, because otherwise we
> can't rely on the REG_UNUSED notes in the IL, I see some other issues, but I
> admit I don't know much about df nor regcprop.
RIght.  I plan to commit that today along with the test reordering you
pointed out.

> 
> 1) the df_analyze () after every (successful) processing of a basic block
> is IMHO way too expensive, I would be very surprised if df_analyze () isn't
> quadratic in number of basic blocks and so one could construct testcases
> with millions of basic blocks and at least one regcprop change in each bb
> and get at cubic complexity (correct me if I'm wrong, and I'm aware of the
> 95% bbs you said won't have any changes at all)
I'm going to look this further today.

> 
> 2) another thing I've noticed today, we queue up the debug insn changes,
> but if we iterate the second time for any bb, we overwrite and thus lose any
> of the debug insn queued changes from the first iteration, so those changes
> aren't applied (wrong-debug?)
This is actually what worries me, both with the current bits and with
any bits that change to a worklist of blocks to reprocess.  As is I
think we preserve the debug stuff across the iterations which should
keep debug info correct.  Managing that in a world where we queue up the
iteration step is going to be tougher.

Queuing the iteration step feels like gcc-10 to me, but we'll see if it
falls out easier than expected.

> 
> 3) as mentioned on IRC, the order of tests in copyprop_hardreg_forward_1
> conditional doesn't start from the cheapest to most expensive one
That's going to be fixed in today's commit since it's been tested.

Jeff
Jakub Jelinek March 27, 2019, 3:32 p.m. UTC | #5
On Wed, Mar 27, 2019 at 09:24:46AM -0600, Jeff Law wrote:
> > 2) another thing I've noticed today, we queue up the debug insn changes,
> > but if we iterate the second time for any bb, we overwrite and thus lose any
> > of the debug insn queued changes from the first iteration, so those changes
> > aren't applied (wrong-debug?)
> This is actually what worries me, both with the current bits and with
> any bits that change to a worklist of blocks to reprocess.  As is I
> think we preserve the debug stuff across the iterations which should
> keep debug info correct.  Managing that in a world where we queue up the
> iteration step is going to be tougher.

The patch I've posted would do df_analyze, all bbs once, then df_analyze,
process queued debug changes, and if anything needs to be repeated (just
once), redo the bbs from worklist and repeat the above.
That seems to me the easiest way to get rid of the per-bb df_analyze calls
as well as fixing the debug stuff.  Because otherwise, we'd need to somehow
merge the queued debug insns from the first and second pass and figure out
how to deal with that.

	Jakub
Jeff Law March 27, 2019, 4:20 p.m. UTC | #6
On 3/27/19 9:24 AM, Jeff Law wrote:
> On 3/27/19 8:36 AM, Jakub Jelinek wrote:
>> On Sun, Mar 24, 2019 at 09:20:07AM -0600, Jeff Law wrote:
>>> However, I'm increasingly of the opinion that MIPS targets need to drop
>>> off the priority platform list.  Given the trajectory I see for MIPS
>>> based processors in industry, it's really hard to justify spending this
>>> much time on them, particularly for low priority code quality issues.
>>
>> Besides what has been discussed on IRC for the PR89826 fix, that we really
>> need a df_analyze before processing the first block, because otherwise we
>> can't rely on the REG_UNUSED notes in the IL, I see some other issues, but I
>> admit I don't know much about df nor regcprop.
> RIght.  I plan to commit that today along with the test reordering you
> pointed out.
And the actual patch committed after the usual bootstrapping &
regression test on x86_64 plus testing on various *-elf platforms,
mips64-linux-gnu, mips64el-linux-gnu and others :-)


Jeff
diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 07a1333cee0..f8a353d16c0 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,13 @@
+2019-03-27  Jeff Law  <law@redhat.com>
+
+
+	PR rtl-optimization/87761
+	PR rtl-optimization/89826
+	* regcprop.c (copyprop_hardreg_forward_1): Move may_trap_p test
+	slightly later.
+	(pass_cprop_hardreg::execute): Call df_analyze after adding the
+	note problem to get REG_DEAD/REG_UNUSED notes updated.
+
 2019-03-27  Richard Biener  <rguenther@suse.de>
 
 	PR tree-optimization/89463
diff --git a/gcc/regcprop.c b/gcc/regcprop.c
index 8ca523ffe23..3efe21f377c 100644
--- a/gcc/regcprop.c
+++ b/gcc/regcprop.c
@@ -800,9 +800,9 @@ copyprop_hardreg_forward_1 (basic_block bb, struct value_data *vd)
 
       /* Detect obviously dead sets (via REG_UNUSED notes) and remove them.  */
       if (set
-	  && !may_trap_p (set)
 	  && !RTX_FRAME_RELATED_P (insn)
 	  && INSN_P (insn)
+	  && !may_trap_p (set)
 	  && find_reg_note (insn, REG_UNUSED, SET_DEST (set))
 	  && !side_effects_p (SET_SRC (set))
 	  && !side_effects_p (SET_DEST (set)))
@@ -1293,7 +1293,10 @@ pass_cprop_hardreg::execute (function *fun)
   auto_sbitmap visited (last_basic_block_for_fn (fun));
   bitmap_clear (visited);
 
+  /* We need accurate notes.  Earlier passes such as if-conversion may
+     leave notes in an inconsistent state.  */
   df_note_add_problem ();
+  df_analyze ();
 
   /* It is tempting to set DF_LR_RUN_DCE, but DCE may choose to delete
      an insn and this pass would not have visibility into the removal.
diff --git a/gcc/testsuite/ChangeLog b/gcc/testsuite/ChangeLog
index 0679cb72e52..b2c649cfb3e 100644
--- a/gcc/testsuite/ChangeLog
+++ b/gcc/testsuite/ChangeLog
@@ -1,3 +1,9 @@
+2019-03-26  Jeff Law  <law@redhat.com>
+
+	PR rtl-optimization/87761
+	PR rtl-optimization/89826
+	* gcc.c-torture/execute/pr89826.c: New test.
+
 2019-03-27  Richard Biener  <rguenther@suse.de>
 
 	* gcc.dg/torture/20190327-1.c: New testcase.
diff --git a/gcc/testsuite/gcc.c-torture/execute/pr89826.c b/gcc/testsuite/gcc.c-torture/execute/pr89826.c
new file mode 100644
index 00000000000..091644849d3
--- /dev/null
+++ b/gcc/testsuite/gcc.c-torture/execute/pr89826.c
@@ -0,0 +1,21 @@
+typedef unsigned int u32;
+typedef unsigned long long u64;
+u64 a;
+u32 b;
+
+u64
+foo (u32 d)
+{
+  a -= d ? 0 : ~a;
+  return a + b;
+}
+
+int
+main (void)
+{
+  u64 x = foo (2);
+  if (x != 0)
+    __builtin_abort();
+  return 0;
+}
+
Jakub Jelinek March 27, 2019, 4:24 p.m. UTC | #7
On Wed, Mar 27, 2019 at 10:20:17AM -0600, Jeff Law wrote:
> --- a/gcc/regcprop.c
> +++ b/gcc/regcprop.c
> @@ -800,9 +800,9 @@ copyprop_hardreg_forward_1 (basic_block bb, struct value_data *vd)
>  
>        /* Detect obviously dead sets (via REG_UNUSED notes) and remove them.  */
>        if (set
> -	  && !may_trap_p (set)
>  	  && !RTX_FRAME_RELATED_P (insn)
>  	  && INSN_P (insn)
> +	  && !may_trap_p (set)

One more nitpick, I think the && INSN (insn) test is redundant there.
A few lines above this we have:
      if (!NONDEBUG_INSN_P (insn))
        {
	  ...
	  break; or continue;
        }

      // code that doesn't change insn, at most delete_insn + break/continue

and as
#define NONDEBUG_INSN_P(X) (INSN_P (X) && !DEBUG_INSN_P (X))
INSN_P (insn) must be always true.

	Jakub
Richard Biener March 28, 2019, 8:55 a.m. UTC | #8
On Wed, Mar 27, 2019 at 4:26 PM Jeff Law <law@redhat.com> wrote:
>
> On 3/27/19 8:36 AM, Jakub Jelinek wrote:
> > On Sun, Mar 24, 2019 at 09:20:07AM -0600, Jeff Law wrote:
> >> However, I'm increasingly of the opinion that MIPS targets need to drop
> >> off the priority platform list.  Given the trajectory I see for MIPS
> >> based processors in industry, it's really hard to justify spending this
> >> much time on them, particularly for low priority code quality issues.
> >
> > Besides what has been discussed on IRC for the PR89826 fix, that we really
> > need a df_analyze before processing the first block, because otherwise we
> > can't rely on the REG_UNUSED notes in the IL, I see some other issues, but I
> > admit I don't know much about df nor regcprop.
> RIght.  I plan to commit that today along with the test reordering you
> pointed out.
>
> >
> > 1) the df_analyze () after every (successful) processing of a basic block
> > is IMHO way too expensive, I would be very surprised if df_analyze () isn't
> > quadratic in number of basic blocks and so one could construct testcases
> > with millions of basic blocks and at least one regcprop change in each bb
> > and get at cubic complexity (correct me if I'm wrong, and I'm aware of the
> > 95% bbs you said won't have any changes at all)
> I'm going to look this further today.

Look at https://gcc.opensuse.org/gcc-old/c++bench-czerny/random/random-performance-latest
and you'll see multiple testcases with 'hard reg cprop' >10% compile-time.
It's indeed a hog for no good reason.

Richard.

> >
> > 2) another thing I've noticed today, we queue up the debug insn changes,
> > but if we iterate the second time for any bb, we overwrite and thus lose any
> > of the debug insn queued changes from the first iteration, so those changes
> > aren't applied (wrong-debug?)
> This is actually what worries me, both with the current bits and with
> any bits that change to a worklist of blocks to reprocess.  As is I
> think we preserve the debug stuff across the iterations which should
> keep debug info correct.  Managing that in a world where we queue up the
> iteration step is going to be tougher.
>
> Queuing the iteration step feels like gcc-10 to me, but we'll see if it
> falls out easier than expected.
>
> >
> > 3) as mentioned on IRC, the order of tests in copyprop_hardreg_forward_1
> > conditional doesn't start from the cheapest to most expensive one
> That's going to be fixed in today's commit since it's been tested.
>
> Jeff
Jakub Jelinek March 28, 2019, 1:47 p.m. UTC | #9
On Thu, Mar 28, 2019 at 09:55:46AM +0100, Richard Biener wrote:
> On Wed, Mar 27, 2019 at 4:26 PM Jeff Law <law@redhat.com> wrote:
> >
> > On 3/27/19 8:36 AM, Jakub Jelinek wrote:
> > > On Sun, Mar 24, 2019 at 09:20:07AM -0600, Jeff Law wrote:
> > >> However, I'm increasingly of the opinion that MIPS targets need to drop
> > >> off the priority platform list.  Given the trajectory I see for MIPS
> > >> based processors in industry, it's really hard to justify spending this
> > >> much time on them, particularly for low priority code quality issues.
> > >
> > > Besides what has been discussed on IRC for the PR89826 fix, that we really
> > > need a df_analyze before processing the first block, because otherwise we
> > > can't rely on the REG_UNUSED notes in the IL, I see some other issues, but I
> > > admit I don't know much about df nor regcprop.
> > RIght.  I plan to commit that today along with the test reordering you
> > pointed out.
> >
> > >
> > > 1) the df_analyze () after every (successful) processing of a basic block
> > > is IMHO way too expensive, I would be very surprised if df_analyze () isn't
> > > quadratic in number of basic blocks and so one could construct testcases
> > > with millions of basic blocks and at least one regcprop change in each bb
> > > and get at cubic complexity (correct me if I'm wrong, and I'm aware of the
> > > 95% bbs you said won't have any changes at all)
> > I'm going to look this further today.
> 
> Look at https://gcc.opensuse.org/gcc-old/c++bench-czerny/random/random-performance-latest
> and you'll see multiple testcases with 'hard reg cprop' >10% compile-time.
> It's indeed a hog for no good reason.

I've tried https://gcc.gnu.org/bugzilla/show_bug.cgi?id=28071#c1
in --enable-checking=yes,rtl,extra bootstrapped cc1 at -O2, without and with
the patch.
The important times in -ftime-report with vanilla trunk:
 phase opt and generate             : 250.76 (100%)   2.00 ( 96%) 253.36 (100%)  768860 kB ( 99%)
 df live regs                       :  19.95 (  8%)   0.03 (  1%)  19.39 (  8%)       0 kB (  0%)
 df live&initialized regs           :  20.29 (  8%)   0.05 (  2%)  19.73 (  8%)       0 kB (  0%)
 df reg dead/unused notes           : 158.66 ( 63%)   0.02 (  1%) 160.12 ( 63%)    4665 kB (  1%)
 hard reg cprop                     :  21.03 (  8%)   0.01 (  0%)  21.39 (  8%)     509 kB (  0%)
 TOTAL                              : 250.85          2.09        253.57         776940 kB
(ignoring everything <2% in the first % column).
Configure with --enable-checking=release to disable checks.
With the https://gcc.gnu.org/ml/gcc-patches/2019-03/msg01335.html patch the
same testcase with -O2 -ftime-report results in identical assembly, but:
 phase opt and generate             :  28.92 (100%)   1.82 ( 95%)  30.85 ( 99%)  768882 kB ( 99%)
 CFG verifier                       :   1.66 (  6%)   0.02 (  1%)   1.69 (  5%)       0 kB (  0%)
 df live regs                       :   0.63 (  2%)   0.00 (  0%)   0.61 (  2%)       0 kB (  0%)
 df live&initialized regs           :   1.01 (  3%)   0.03 (  2%)   1.00 (  3%)       0 kB (  0%)
 df must-initialized regs           :   1.51 (  5%)   0.93 ( 48%)   2.46 (  8%)       0 kB (  0%)
 tree SSA verifier                  :   2.79 ( 10%)   0.01 (  1%)   2.78 (  9%)       0 kB (  0%)
 tree STMT verifier                 :   2.00 (  7%)   0.00 (  0%)   1.99 (  6%)       0 kB (  0%)
 dominance computation              :   0.61 (  2%)   0.00 (  0%)   0.59 (  2%)       0 kB (  0%)
 out of ssa                         :   0.61 (  2%)   0.04 (  2%)   0.65 (  2%)       1 kB (  0%)
 loop init                          :   0.58 (  2%)   0.00 (  0%)   0.63 (  2%)      38 kB (  0%)
 combiner                           :   0.44 (  2%)   0.02 (  1%)   0.47 (  2%)   17926 kB (  2%)
 integrated RA                      :   2.24 (  8%)   0.08 (  4%)   2.35 (  8%)  205177 kB ( 26%)
 LRA non-specific                   :   1.46 (  5%)   0.05 (  3%)   1.50 (  5%)   19172 kB (  2%)
 LRA create live ranges             :   1.23 (  4%)   0.00 (  0%)   1.23 (  4%)    2589 kB (  0%)
 reload CSE regs                    :   0.54 (  2%)   0.00 (  0%)   0.51 (  2%)    8456 kB (  1%)
 scheduling 2                       :   0.73 (  3%)   0.09 (  5%)   0.81 (  3%)    2715 kB (  0%)
 verify RTL sharing                 :   1.19 (  4%)   0.00 (  0%)   1.15 (  4%)       0 kB (  0%)
 TOTAL                              :  29.02          1.92         31.07         776962 kB
So 8.5x usr time speedup with that patch.

	Jakub
Jeff Law March 31, 2019, 5:10 p.m. UTC | #10
On 3/28/19 2:55 AM, Richard Biener wrote:
> On Wed, Mar 27, 2019 at 4:26 PM Jeff Law <law@redhat.com> wrote:
>>
>> On 3/27/19 8:36 AM, Jakub Jelinek wrote:
>>> On Sun, Mar 24, 2019 at 09:20:07AM -0600, Jeff Law wrote:
>>>> However, I'm increasingly of the opinion that MIPS targets need to drop
>>>> off the priority platform list.  Given the trajectory I see for MIPS
>>>> based processors in industry, it's really hard to justify spending this
>>>> much time on them, particularly for low priority code quality issues.
>>>
>>> Besides what has been discussed on IRC for the PR89826 fix, that we really
>>> need a df_analyze before processing the first block, because otherwise we
>>> can't rely on the REG_UNUSED notes in the IL, I see some other issues, but I
>>> admit I don't know much about df nor regcprop.
>> RIght.  I plan to commit that today along with the test reordering you
>> pointed out.
>>
>>>
>>> 1) the df_analyze () after every (successful) processing of a basic block
>>> is IMHO way too expensive, I would be very surprised if df_analyze () isn't
>>> quadratic in number of basic blocks and so one could construct testcases
>>> with millions of basic blocks and at least one regcprop change in each bb
>>> and get at cubic complexity (correct me if I'm wrong, and I'm aware of the
>>> 95% bbs you said won't have any changes at all)
>> I'm going to look this further today.
> 
> Look at https://gcc.opensuse.org/gcc-old/c++bench-czerny/random/random-performance-latest
> and you'll see multiple testcases with 'hard reg cprop' >10% compile-time.
> It's indeed a hog for no good reason.
I stand corrected.  That's really a surprise.  I looked at today's
report and don't see anything over 1%, so Jakub's improvements really
helped.
diff mbox series

Patch

diff --git a/gcc/ChangeLog b/gcc/ChangeLog
index 6b9f1c4a2b6..4a5b122eedd 100644
--- a/gcc/ChangeLog
+++ b/gcc/ChangeLog
@@ -1,3 +1,14 @@ 
+2019-03-26  Jeff Law  <law@redhat.com>
+
+	PR rtl-optimization/87761
+	* regcprop.c (copyprop_hardreg_forward_1): Check may_trap_p on SET,
+	not INSN.  Also check RTX_FRAME_RELATED_P.  Queue insns for DF rescan
+	as needed.
+	(pass_cprop_hardreg::execute): Add df note problem and defer insn
+	rescans.  Reprocess blocks as needed, calling df_analyze before
+	reprocessing.  Always call df_analyze before fixing up debug bind
+	insns.
+	
 2019-03-23  Segher Boessenkool  <segher@kernel.crashing.org>
 
 	* config/rs6000/xmmintrin.h (_mm_movemask_pi8): Implement for 32-bit
@@ -8,7 +19,7 @@ 
 	* config/aarch64/aarch64.md (zero_extendsidi2_aarch64): Fix type
 	attrribute for uxtw.
 
-2019-02-26  Jeff Law  <law@redhat.com>
+2019-03-26  Jeff Law  <law@redhat.com>
 
 	PR rtl-optimization/87761
 	* config/mips/mips-protos.h (mips_split_move): Add new argument.
diff --git a/gcc/regcprop.c b/gcc/regcprop.c
index 926df40f954..8ca523ffe23 100644
--- a/gcc/regcprop.c
+++ b/gcc/regcprop.c
@@ -800,8 +800,9 @@  copyprop_hardreg_forward_1 (basic_block bb, struct value_data *vd)
 
       /* Detect obviously dead sets (via REG_UNUSED notes) and remove them.  */
       if (set
+	  && !may_trap_p (set)
+	  && !RTX_FRAME_RELATED_P (insn)
 	  && INSN_P (insn)
-	  && !may_trap_p (insn)
 	  && find_reg_note (insn, REG_UNUSED, SET_DEST (set))
 	  && !side_effects_p (SET_SRC (set))
 	  && !side_effects_p (SET_DEST (set)))
@@ -1034,6 +1035,7 @@  copyprop_hardreg_forward_1 (basic_block bb, struct value_data *vd)
 	     DEBUG_INSNs can be applied.  */
 	  if (vd->n_debug_insn_changes)
 	    note_uses (&PATTERN (insn), cprop_find_used_regs, vd);
+	  df_insn_rescan (insn);
 	}
 
       ksvd.vd = vd;
@@ -1112,7 +1114,10 @@  copyprop_hardreg_forward_1 (basic_block bb, struct value_data *vd)
 
 	  /* Notice copies.  */
 	  if (copy_p)
-	    copy_value (SET_DEST (set), SET_SRC (set), vd);
+	    {
+	      df_insn_rescan (insn);
+	      copy_value (SET_DEST (set), SET_SRC (set), vd);
+	    }
 	}
 
       if (insn == BB_END (bb))
@@ -1282,47 +1287,68 @@  pass_cprop_hardreg::execute (function *fun)
 {
   struct value_data *all_vd;
   basic_block bb;
-  bool analyze_called = false;
 
   all_vd = XNEWVEC (struct value_data, last_basic_block_for_fn (fun));
 
   auto_sbitmap visited (last_basic_block_for_fn (fun));
   bitmap_clear (visited);
 
+  df_note_add_problem ();
+
+  /* It is tempting to set DF_LR_RUN_DCE, but DCE may choose to delete
+     an insn and this pass would not have visibility into the removal.
+     This pass would then potentially use the source of that
+     INSN for propagation purposes, generating invalid code.
+
+     So we just ask for updated notes and handle trivial deletions
+     within this pass where we can update this passes internal
+     data structures appropriately.  */
+  df_set_flags (DF_DEFER_INSN_RESCAN);
+
   FOR_EACH_BB_FN (bb, fun)
     {
       bitmap_set_bit (visited, bb->index);
 
-      /* If a block has a single predecessor, that we've already
-	 processed, begin with the value data that was live at
-	 the end of the predecessor block.  */
-      /* ??? Ought to use more intelligent queuing of blocks.  */
-      if (single_pred_p (bb)
-	  && bitmap_bit_p (visited, single_pred (bb)->index)
-	  && ! (single_pred_edge (bb)->flags & (EDGE_ABNORMAL_CALL | EDGE_EH)))
+      for (int pass = 0; pass < 2; pass++)
 	{
-	  all_vd[bb->index] = all_vd[single_pred (bb)->index];
-	  if (all_vd[bb->index].n_debug_insn_changes)
+	  /* If a block has a single predecessor, that we've already
+	     processed, begin with the value data that was live at
+	    the end of the predecessor block.  */
+	  /* ??? Ought to use more intelligent queuing of blocks.  */
+	  if (single_pred_p (bb)
+	      && bitmap_bit_p (visited, single_pred (bb)->index)
+	      && ! (single_pred_edge (bb)->flags & (EDGE_ABNORMAL_CALL | EDGE_EH)))
 	    {
-	      unsigned int regno;
-
-	      for (regno = 0; regno < FIRST_PSEUDO_REGISTER; regno++)
+	      all_vd[bb->index] = all_vd[single_pred (bb)->index];
+	      if (all_vd[bb->index].n_debug_insn_changes)
 		{
-		  if (all_vd[bb->index].e[regno].debug_insn_changes)
+		  unsigned int regno;
+
+		  for (regno = 0; regno < FIRST_PSEUDO_REGISTER; regno++)
 		    {
-		      all_vd[bb->index].e[regno].debug_insn_changes = NULL;
-		      if (--all_vd[bb->index].n_debug_insn_changes == 0)
-			break;
+		      if (all_vd[bb->index].e[regno].debug_insn_changes)
+			{
+			  all_vd[bb->index].e[regno].debug_insn_changes = NULL;
+			  if (--all_vd[bb->index].n_debug_insn_changes == 0)
+			    break;
+			}
 		    }
 		}
 	    }
-	}
-      else
-	init_value_data (all_vd + bb->index);
+	  else
+	    init_value_data (all_vd + bb->index);
 
-      copyprop_hardreg_forward_1 (bb, all_vd + bb->index);
+	  /* If we were unable to propagate, then break the loop.  */
+	  if (!copyprop_hardreg_forward_1 (bb, all_vd + bb->index))
+	    break;
+	  df_analyze ();
+	}
     }
 
+  /* We must call df_analyze here unconditionally to ensure that the
+     REG_UNUSED and REG_DEAD notes are consistent with and without -g.  */
+  df_analyze ();
+
   if (MAY_HAVE_DEBUG_BIND_INSNS)
     {
       FOR_EACH_BB_FN (bb, fun)
@@ -1332,11 +1358,6 @@  pass_cprop_hardreg::execute (function *fun)
 	    unsigned int regno;
 	    bitmap live;
 
-	    if (!analyze_called)
-	      {
-		df_analyze ();
-		analyze_called = true;
-	      }
 	    live = df_get_live_out (bb);
 	    for (regno = 0; regno < FIRST_PSEUDO_REGISTER; regno++)
 	      if (all_vd[bb->index].e[regno].debug_insn_changes)