diff mbox series

RFA: Fix combine.c combining a move and a non-move into two non-moves, PR93372

Message ID 202007060111.0661BHHM032190@ignucius.se.axis.com
State New
Headers show
Series RFA: Fix combine.c combining a move and a non-move into two non-moves, PR93372 | expand

Commit Message

Hans-Peter Nilsson July 6, 2020, 1:11 a.m. UTC
TL;DR: fixing a misdetection of what is a "simple move".

Looking into performace degradation after de-cc0 for CRIS, I
noticed combine behaving badly; it changed a move and a
right-shift into two right-shifts, where the "combined" move was
not eliminated in later passes, and where the deficiency caused
an extra insn in a hot loop: crcu16 (and crcu32) in coremark.

Before de-cc0, the insns input to combine looked like:
   33: r58:SI=r56:SI 0>>r48:SI
      REG_DEAD r56:SI
   35: r37:HI=r58:SI#0
and after:
   33: {r58:SI=r56:SI 0>>r48:SI;clobber dccr:CC;}
      REG_DEAD r56:SI
      REG_UNUSED dccr:CC
   35: {r37:HI=r58:SI#0;clobber dccr:CC;}
      REG_UNUSED dccr:CC
That is, there's always a parallel with a clobber of the
condition-codes register.  Being a parallel, it's not an
is_just_move, but e.g. a single_set.

For the de-cc0:ed "combination", it ended up as
   33: {r58:SI=r56:SI 0>>r48:SI;clobber dccr:CC;}
      REG_UNUSED dccr:CC
   35: {r37:HI#0=r56:SI 0>>r48:SI;clobber dccr:CC;}
      REG_DEAD r56:SI
      REG_UNUSED dccr:CC
That is, a move and a shift turned into two shifts; the extra
shift is not eliminated by later passes, while the move was
(with cc0, and "will be again") leading to redundant
instructions.

At first I thought this was due to parallel-ignorant old code
but the "guilty" change is actually pretty recent.  Regarding a
parallel with a clobber not being "just" a move, there's only
the two adjacent callers seen in the patch (obviously with the
rename), and their use exactly matches to check that the
argument is a single_set which is a move.  It's always applied
to an rtx_insn, so I changed the type and name to avoid the
"just" issue.  I had to adjust the type when calling single_set.

I checked the original commit, c4c5ad1d6d1e1e a.k.a r263067 and
it seems parallels-as-sets were just overlooked and that this
patch appears to agree with the intent and the comment at the
use of i2_was_move and i3_was_move, which has a clause saying
"Also do this if we started with two insns neither of which was
a simple move".

With this correction in place, the performance degradation
related to de-cc0 of the CRIS port as measured by coremark is
gone and turned into a small win.  N.B.: there certainly is a
code difference in other hot functions, and the swing between
different functions is larger than this difference; to be dealt
with separately.

Tested cris-elf, x86_64-linux, powerpc64le-linux, 2/3 through
aarch64-linux (unexpectedly slow).

Ok to commit?

2020-07-06  Hans-Peter Nilsson  <hp@axis.com>

	PR target/93372
	* gcc/combine.c (is_move): Rename from is_just_move.  Use
	single_set instead of of peeking directly into the PATTERN.
---
 gcc/combine.c | 16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

Comments

Segher Boessenkool July 6, 2020, 11:42 p.m. UTC | #1
Hi!

On Mon, Jul 06, 2020 at 03:11:17AM +0200, Hans-Peter Nilsson wrote:
> TL;DR: fixing a misdetection of what is a "simple move".

That is not a very correct characterisation of what this does :-)

> Looking into performace degradation after de-cc0 for CRIS, I
> noticed combine behaving badly; it changed a move and a
> right-shift into two right-shifts, where the "combined" move was
> not eliminated in later passes, and where the deficiency caused
> an extra insn in a hot loop: crcu16 (and crcu32) in coremark.
> 
> Before de-cc0, the insns input to combine looked like:
>    33: r58:SI=r56:SI 0>>r48:SI
>       REG_DEAD r56:SI
>    35: r37:HI=r58:SI#0
> and after:
>    33: {r58:SI=r56:SI 0>>r48:SI;clobber dccr:CC;}
>       REG_DEAD r56:SI
>       REG_UNUSED dccr:CC
>    35: {r37:HI=r58:SI#0;clobber dccr:CC;}
>       REG_UNUSED dccr:CC

So a shift like this is at most as expensive as a move, on your target
(or, in your backend, anyway ;-) )

> That is, there's always a parallel with a clobber of the
> condition-codes register.  Being a parallel, it's not an
> is_just_move, but e.g. a single_set.
> 
> For the de-cc0:ed "combination", it ended up as
>    33: {r58:SI=r56:SI 0>>r48:SI;clobber dccr:CC;}
>       REG_UNUSED dccr:CC
>    35: {r37:HI#0=r56:SI 0>>r48:SI;clobber dccr:CC;}
>       REG_DEAD r56:SI
>       REG_UNUSED dccr:CC
> That is, a move and a shift turned into two shifts; the extra
> shift is not eliminated by later passes, while the move was
> (with cc0, and "will be again") leading to redundant
> instructions.

Which was the whole point of the is_just_move() thing, yes.  Combine
doesn't know most moves will be eliminated by RA (but many are useful to
do have before RA, because it gives RA much more freedom).  If a move is
the same cost as a simple insn, doing two (say shift, like here) insns
in parallel is cheaper on most machines than having a shift and a move
sequentially.  (2-2 combinations are helpful on single-scalar and even
in-order machines as well, btw).

> At first I thought this was due to parallel-ignorant old code
> but the "guilty" change is actually pretty recent.  Regarding a
> parallel with a clobber not being "just" a move, there's only
> the two adjacent callers seen in the patch (obviously with the
> rename), and their use exactly matches to check that the
> argument is a single_set which is a move.  It's always applied
> to an rtx_insn, so I changed the type and name to avoid the
> "just" issue.  I had to adjust the type when calling single_set.

But it is *not* supposed to be the same as single_set.

> I checked the original commit, c4c5ad1d6d1e1e a.k.a r263067 and
> it seems parallels-as-sets were just overlooked and that this

They were not.  It causes regressions.  That is why it has a different
name, not something with "single set".  "just_move" isn't a very good
name, I couldn't come up with something better, it is a pretty
complicated concept :-/

"i2_should_not_be_2_2_combined"?

I'll rerun some testing to show this.  It'll take a while.

> patch appears to agree with the intent and the comment at the
> use of i2_was_move and i3_was_move, which has a clause saying
> "Also do this if we started with two insns neither of which was
> a simple move".

That says that we combine to 2 insns also if we started with only 2,
but neither of those was just a move.

> With this correction in place, the performance degradation
> related to de-cc0 of the CRIS port as measured by coremark is
> gone and turned into a small win.  N.B.: there certainly is a
> code difference in other hot functions, and the swing between
> different functions is larger than this difference; to be dealt
> with separately.
> 
> Tested cris-elf, x86_64-linux, powerpc64le-linux, 2/3 through
> aarch64-linux (unexpectedly slow).
> 
> Ok to commit?

No, sorry.

One thing that could work is allowing (unused) clobbers of fixed
registers (like you have here), or maybe some hook is needed to say this
register is like a flags register, or similar.  That should work for you,
and not regress other targets, maybe even help a little?  We'll see.

Thanks,


Segher
Segher Boessenkool July 7, 2020, 8:23 p.m. UTC | #2
Hi!

On Tue, Jul 07, 2020 at 02:50:09AM +0200, Hans-Peter Nilsson wrote:
> > On Mon, Jul 06, 2020 at 03:11:17AM +0200, Hans-Peter Nilsson wrote:
> > > TL;DR: fixing a misdetection of what is a "simple move".
> > 
> > That is not a very correct characterisation of what this does :-)
> 
> That's apparently where we completely disagree. :-)

Well, I wrote that code, I know what is considered "just a move" there.
You want to extend that, and that is fine, but this is not a bug.  Taken
to the extreme, anything in GCC is completely buggy, because it doesn't
solve world hunger (yet!), following that line of thought.

> > > Looking into performace degradation after de-cc0 for CRIS, I
> > > noticed combine behaving badly; it changed a move and a
> > > right-shift into two right-shifts, where the "combined" move was
> > > not eliminated in later passes, and where the deficiency caused
> > > an extra insn in a hot loop: crcu16 (and crcu32) in coremark.
> > > 
> > > Before de-cc0, the insns input to combine looked like:
> > >    33: r58:SI=r56:SI 0>>r48:SI
> > >       REG_DEAD r56:SI
> > >    35: r37:HI=r58:SI#0
> > > and after:
> > >    33: {r58:SI=r56:SI 0>>r48:SI;clobber dccr:CC;}
> > >       REG_DEAD r56:SI
> > >       REG_UNUSED dccr:CC
> > >    35: {r37:HI=r58:SI#0;clobber dccr:CC;}
> > >       REG_UNUSED dccr:CC
> > 
> > So a shift like this is at most as expensive as a move, on your target
> > (or, in your backend, anyway ;-) )
> 
> On CRIS, the backend *and* the target, yes; one cycle, one short
> instruction.

So combine did what it is supposed to do.

> > > That is, there's always a parallel with a clobber of the
> > > condition-codes register.  Being a parallel, it's not an
> > > is_just_move, but e.g. a single_set.

This is something that happens on many targets.  For some, only for some
instructions (and flag registers).  For some, for many instructions; and
for really unhappy targets, for almost all instructions, even for
register moves and/or loads and/or stores.

> > > For the de-cc0:ed "combination", it ended up as
> > >    33: {r58:SI=r56:SI 0>>r48:SI;clobber dccr:CC;}
> > >       REG_UNUSED dccr:CC
> > >    35: {r37:HI#0=r56:SI 0>>r48:SI;clobber dccr:CC;}
> > >       REG_DEAD r56:SI
> > >       REG_UNUSED dccr:CC
> > > That is, a move and a shift turned into two shifts; the extra
> > > shift is not eliminated by later passes, while the move was
> > > (with cc0, and "will be again") leading to redundant
> > > instructions.
> > 
> > Which was the whole point of the is_just_move() thing, yes.  Combine
> > doesn't know most moves will be eliminated by RA (but many are useful to
> > do have before RA, because it gives RA much more freedom).  If a move is
> > the same cost as a simple insn, doing two (say shift, like here) insns
> > in parallel is cheaper on most machines than having a shift and a move
> > sequentially.
> 
> Most parallel machines you mean,

No, I mean most machines, not just super-scalar machines.

> but why bring up them when
> there's no means for combine to tell the difference?

I did not bring them up, and this optimisation is important not only for
super-scalar targets.  But, the most important targets for GCC (in terms
of how many people use it, all targets are important for many other
reasons) are all supper-scalar.

> Here, the
> end result is that it *added* an instruction to the hot loop.

It did not.  Combine *never* does that.

> It's a deficiency and it caused a performance regression, can't
> argue with that.

It did not cause any regression.  It can be improved, sure, and thank
you for pointing that out!

> >  (2-2 combinations are helpful on single-scalar and even
> > in-order machines as well, btw).
> 
> I certainly don't contest that the move can be eliminated, and
> that most cost-effective 2-2 eliminations are helpful.  (See my
> other post about combine being eager with allowing same-cost
> combinations.)

I did not see that post, do you have a pointer?

> > > At first I thought this was due to parallel-ignorant old code
> > > but the "guilty" change is actually pretty recent.  Regarding a
> > > parallel with a clobber not being "just" a move, there's only
> > > the two adjacent callers seen in the patch (obviously with the
> > > rename), and their use exactly matches to check that the
> > > argument is a single_set which is a move.  It's always applied
> > > to an rtx_insn, so I changed the type and name to avoid the
> > > "just" issue.  I had to adjust the type when calling single_set.
> > 
> > But it is *not* supposed to be the same as single_set.
> 
> (I'm not saying that a single_set is a move.  But that's
> obvious.)  I guess you meant to say that a single_set with a
> general_operand as a source (as in the patch) is not supposed to
> be the same as a move.  This is the only place in combine where
> that distinction would be important.  Why?

It used to be that is_just_move was called on new*pat as well, so you
simply *could not* use single_set there (you have no rtx_insn*).

single_set also allows other insns, for example, multiple sets!  But
testing shows that this does not actually happen often at all, so we
could use single_set here and not change much at all for most other
targets.

> I think you're just overconcious about clobbers here.

It has nothing to do with that.

> > > I checked the original commit, c4c5ad1d6d1e1e a.k.a r263067 and
> > > it seems parallels-as-sets were just overlooked and that this
> > 
> > They were not.  It causes regressions.
> 
> Do you have some pointers to PR:s or something else backing that
> statement, or is it your work-in-progress hinted below?

I do not know what "work in progress" you mean?

> >  That is why it has a different
> > name, not something with "single set".  "just_move" isn't a very good
> > name, I couldn't come up with something better, it is a pretty
> > complicated concept :-/
> > 
> > "i2_should_not_be_2_2_combined"?
> > 
> > I'll rerun some testing to show this.  It'll take a while.
> 
> Care to fill in some details on what kind of testing?

Same as always: I build the toolchain and Linux kernel for 30 different
targets, and compare the output.

For most things, just looking at the size of the generated binary is
telling (it shows some combinations did or did not happen).  For a more
detailed view, you need to use at the actual machine code diffs directly,
which is much more work (sometimes diff of size per function is useful
to see first as well).

For 2-2, size does *not* usually change, which brings us immediately
into "a lot more work" territory.  Oh, and all x86 compilers ICE.

> > > With this correction in place, the performance degradation
> > > related to de-cc0 of the CRIS port as measured by coremark is
> > > gone and turned into a small win.  N.B.: there certainly is a
> > > code difference in other hot functions, and the swing between
> > > different functions is larger than this difference; to be dealt
> > > with separately.
> > > 
> > > Tested cris-elf, x86_64-linux, powerpc64le-linux, 2/3 through
> > > aarch64-linux (unexpectedly slow).
> > > 
> > > Ok to commit?
> > 
> > No, sorry.
> 
> Sigh.  I'm very interested in what your investigation will show.

It shows we can change to use single_set here.

> > One thing that could work is allowing (unused) clobbers of fixed
> > registers (like you have here), or maybe some hook is needed to say this
> > register is like a flags register, or similar.  That should work for you,
> > and not regress other targets, maybe even help a little?  We'll see.
> 
> Still, there is already TARGET_FLAGS_REGNUM (a "hooked"
> constant), so I take it you would be happy if we recognize a
> clobber of *just that*, in a parallel?  (I'll take care of
> updating tm.texi of course.)

Most targets do not use cmpelim, and many *can not* define
TARGET_FLAGS_REGNUM, unfortunately.

I'll review the original patch again, to point out where it still needs
changing.

Thanks,


Segher
Segher Boessenkool July 7, 2020, 8:50 p.m. UTC | #3
Hi!

On Mon, Jul 06, 2020 at 03:11:17AM +0200, Hans-Peter Nilsson wrote:
> TL;DR: fixing a misdetection of what is a "simple move".

As set before, this is not a fix, not a "misdetection", it is plain and
simple a behaviour change.

"Use single_set for is_just_move" would be a fine subject.

> Regarding a
> parallel with a clobber not being "just" a move, there's only
> the two adjacent callers seen in the patch (obviously with the
> rename), and their use exactly matches to check that the
> argument is a single_set which is a move.

This isn't true, single_set has somewhat different semantics.  Some of
which is exactly the change you are looking for, but there are more
differences.

> It's always applied
> to an rtx_insn, so I changed the type and name to avoid the
> "just" issue.  I had to adjust the type when calling single_set.

Don't change the name please.  Changing it to take an rtx_insn* is fine,
we aren't likely to change back to testing the resulting patterns.

> I checked the original commit, c4c5ad1d6d1e1e a.k.a r263067 and

The history is years older (some of which is on gcc-patches@).

I'll make a simpler patch.  Thanks!


Segher
Segher Boessenkool July 14, 2020, 8:40 p.m. UTC | #4
On Mon, Jul 13, 2020 at 07:29:00AM +0200, Hans-Peter Nilsson wrote:
> > From: Segher Boessenkool <segher@kernel.crashing.org>
> > Date: Tue, 7 Jul 2020 22:50:43 +0200
> 
> > I'll make a simpler patch.  Thanks!
> 
> You're welcome.  So, you'll take care of the updated patch
> yourself?

Yes.  Delayed a bit, I had a lot to do for 10.2...  But Soon(tm) :-)


Segher
Segher Boessenkool July 14, 2020, 9:33 p.m. UTC | #5
Hi!

On Mon, Jul 13, 2020 at 07:25:37AM +0200, Hans-Peter Nilsson wrote:
> > > > > TL;DR: fixing a misdetection of what is a "simple move".
> > > > 
> > > > That is not a very correct characterisation of what this does :-)
> > > 
> > > That's apparently where we completely disagree. :-)
> > 
> > Well, I wrote that code, I know what is considered "just a move" there.
> 
> You lost some context: I'm comparing before/after the
> cc0-conversion for CRIS, where this is a misdetection (a false
> negative) of a move and causes a performance-regression.

The cc0 conversion caused a performance regression.  You can improve
some code in combine to make that not happen.

> > > I certainly don't contest that the move can be eliminated, and
> > > that most cost-effective 2-2 eliminations are helpful.  (See my
> > > other post about combine being eager with allowing same-cost
> > > combinations.)
> > 
> > I did not see that post, do you have a pointer?
> 
> https://gcc.gnu.org/pipermail/gcc-patches/2020-July/549416.html

I'll reply to that separately.

> > single_set also allows other insns, for example, multiple sets!
> 
> ...where the other sets are unused.  Not sure how that kind of
> thing would get combined here, but if its combined cost is
> beneficial, it would be a win, I guess.

The unused outputs are thrown away by combine, except when that doesn't
match, then it is tried again but with the original clobbers.

> > > Do you have some pointers to PR:s or something else backing that
> > > statement, or is it your work-in-progress hinted below?
> > 
> > I do not know what "work in progress" you mean?
> 
> I'm referring to your "I'll rerun some testing to show this.
> It'll take a while."  Are the regressions you refer to above
> tracked in bugzilla or on some mailing list?

The whole 2-2 combine thing took half a year at least to develop.  I
have posted to gcc-patches@ about it a few times.  Not having the
is_just_move stuff causes highly visible regressions on some targets, I
do not remember which, but it hurts all targets.  2-2 combine with the
is_just_move (and make_extra_copies) stuff was a win everywhere.

The tests just take long to run (it used to be 2h per run, but it is
closer to 3h per run with the current compiler: building cross-compilers
used to be less than 4m per target, this is much worse since a few
years).

Analysing the results is easy for most kinds of instruction combination:
just looking at the binary size of the testcase (I usually build Linux)
gives a good idea how effective combine was.  But for 2-2 combinations
that doesn't show much at all, so I dig through the actual resulting
code (for many targets, not all 30, just those I think are interesting).

> > For 2-2, size does *not* usually change, which brings us immediately
> > into "a lot more work" territory.  Oh, and all x86 compilers ICE.

The x86-64 kernel *does* build, just some boot wrapper code fails, but
the kernel itself does build.  i386 does ICE however, something with
memcpy or some such.

> If combine only did lower-cost combinations (perhaps with
> Richard Sandifords lower-size-when-tied suggestion), I guess
> this wouldn't happen. 0:-)

And we would regress (a LOT).

> > It shows we can change to use single_set here.
> 
> Did you mean "will show whether" or is it already complete?

It did complete, yes (and didn't change a single resulting intruction).
So that was easy :-)

> > I'll review the original patch again, to point out where it still needs
> > changing.
> 
> ...but if you're in progress with a single_set variant, I'm all
> for it.

Yup, it's pretty simple actually :-)

Thanks,


Segher
Segher Boessenkool July 17, 2020, 12:05 a.m. UTC | #6
On Tue, Jul 14, 2020 at 04:33:42PM -0500, Segher Boessenkool wrote:
> > If combine only did lower-cost combinations (perhaps with
> > Richard Sandifords lower-size-when-tied suggestion), I guess
> > this wouldn't happen. 0:-)
> 
> And we would regress (a LOT).

Like this.  C0 is an unmodified compiler.  C1 is with the single_set
modification to is_just_move I committed a minute ago (84c5396d4bdb).
C2 is with this patch:

-- 8< --
diff --git a/gcc/combine.c b/gcc/combine.c
index 3a81bb6..619ba77 100644
--- a/gcc/combine.c
+++ b/gcc/combine.c
@@ -939,7 +939,7 @@ combine_validate_cost (rtx_insn *i0, rtx_insn *i1, rtx_insn 
 
   /* Disallow this combination if both new_cost and old_cost are greater than
      zero, and new_cost is greater than old cost.  */
-  int reject = old_cost > 0 && new_cost > old_cost;
+  int reject = old_cost > 0 && new_cost >= old_cost;
 
   if (dump_file)
     {
-- 8< --

[segher@gcc135 buildall]$ perl sizes.pl --percent C[012]
                    C0        C1        C2
       alpha   6045560  100.000%  100.518%
         arc   3529016  100.000%   99.933%
         arm  14173370  100.000%  101.607%
       arm64  12958466  100.000%  100.477%
         c6x   2341205  100.000%  100.154%
        csky   3320386  100.000%  100.838%
       h8300   1163584  100.000%  100.331%
        i386         0         0         0
        ia64  18079744  100.000%  100.857%
        m68k   3711195  100.000%  100.327%
  microblaze   4930937  100.000%  100.054%
        mips   8403293  100.000%  100.049%
      mips64   6975860  100.000%   99.986%
       nds32   4450951  100.000%   99.992%
       nios2   3641733  100.000%  100.206%
    openrisc   4182260  100.000%  100.025%
      parisc   7706299  100.000%  101.500%
    parisc64   8677365  100.000%  101.491%
     powerpc  10016575  100.000%  100.001%
   powerpc64  17331518  100.000%   99.974%
 powerpc64le  17331518  100.000%   99.974%
     riscv32         0         0         0
     riscv64         0         0         0
        s390  13091897  100.000%  100.396%
          sh   3213207  100.000%  100.031%
     shnommu   1610444  100.000%  100.031%
       sparc   4356641  100.000%  101.521%
     sparc64   6745123  100.000%  101.450%
      x86_64  19663507  100.000%  100.007%
      xtensa   2105610  100.000%  100.425%


Segher
diff mbox series

Patch

diff --git a/gcc/combine.c b/gcc/combine.c
index 7da144e..ed90b16 100644
--- a/gcc/combine.c
+++ b/gcc/combine.c
@@ -2624,13 +2624,15 @@  can_split_parallel_of_n_reg_sets (rtx_insn *insn, int n)
   return true;
 }
 
-/* Return whether X is just a single set, with the source
-   a general_operand.  */
+/* Return whether INSN is just a single set, with the source
+   a general_operand.  INSN must be an insn, not stripped to its PATTERN.  */
 static bool
-is_just_move (rtx x)
+is_move (const rtx_insn *insn)
 {
-  if (INSN_P (x))
-    x = PATTERN (x);
+  rtx x = single_set (insn);
+
+  if (x == NULL_RTX)
+    return false;
 
   return (GET_CODE (x) == SET && general_operand (SET_SRC (x), VOIDmode));
 }
@@ -3103,8 +3105,8 @@  try_combine (rtx_insn *i3, rtx_insn *i2, rtx_insn *i1, rtx_insn *i0,
     }
 
   /* Record whether i2 and i3 are trivial moves.  */
-  i2_was_move = is_just_move (i2);
-  i3_was_move = is_just_move (i3);
+  i2_was_move = is_move (i2);
+  i3_was_move = is_move (i3);
 
   /* Record whether I2DEST is used in I2SRC and similarly for the other
      cases.  Knowing this will help in register status updating below.  */