diff mbox series

[PATCH/RFC] combine_completed global variable.

Message ID 001a01d89239$713109e0$53931da0$@nextmovesoftware.com
State New
Headers show
Series [PATCH/RFC] combine_completed global variable. | expand

Commit Message

Roger Sayle July 7, 2022, 7:40 p.m. UTC
Hi Kewen (and Segher),
Many thanks for stress testing my patch to improve multiplication
by integer constants on rs6000 by using the rldmi instruction.
Although I've not been able to reproduce your ICE (using gcc135
on the compile farm), I completely agree with Segher's analysis
that the Achilles heel with my approach/patch is that there's
currently no way for the backend/recog to know that we're in a
pass after combine.

Rather than give up on this optimization (and a similar one for
I386.md where test;sete can be replaced by xor $1 when combine
knows that nonzero_bits is 1, but loses that information afterwards),
I thought I'd post this "strawman" proposal to add a combine_completed
global variable, matching the reload_completed and regstack_completed
global variables already used (to track progress) by the middle-end.

I was wondering if I could ask you could test the attached patch
in combination with my previous rs6000.md patch (with the obvious
change of reload_completed to combine_completed) to confirm
that it fixes the problems you were seeing.

Segher/Richard, would this sort of patch be considered acceptable?
Or is there a better approach/solution?


2022-07-07  Roger Sayle  <roger@nextmovesoftware.com>

gcc/ChangeLog
        * combine.cc (combine_completed): New global variable.
        (rest_of_handle_combine): Set combine_completed after pass.
        * final.cc (rest_of_clean_state): Reset combine_completed.
        * rtl.h (combine_completed): Prototype here.


Many thanks in advance,
Roger
--

> -----Original Message-----
> From: Kewen.Lin <linkw@linux.ibm.com>
> Sent: 27 June 2022 10:04
> To: Roger Sayle <roger@nextmovesoftware.com>
> Cc: gcc-patches@gcc.gnu.org; Segher Boessenkool
> <segher@kernel.crashing.org>; David Edelsohn <dje.gcc@gmail.com>
> Subject: Re: [rs6000 PATCH] Improve constant integer multiply using rldimi.
> 
> Hi Roger,
> 
> on 2022/6/27 04:56, Roger Sayle wrote:
> >
> >
> > This patch tweaks the code generated on POWER for integer
> > multiplications
> >
> > by a constant, by making use of rldimi instructions.  Much like x86's
> >
> > lea instruction, rldimi can be used to implement a shift and add pair
> >
> > in some circumstances.  For rldimi this is when the shifted operand
> >
> > is known to have no bits in common with the added operand.
> >
> >
> >
> > Hence for the new testcase below:
> >
> >
> >
> > int foo(int x)
> >
> > {
> >
> >   int t = x & 42;
> >
> >   return t * 0x2001;
> >
> > }
> >
> >
> >
> > when compiled with -O2, GCC currently generates:
> >
> >
> >
> >         andi. 3,3,0x2a
> >
> >         slwi 9,3,13
> >
> >         add 3,9,3
> >
> >         extsw 3,3
> >
> >         blr
> >
> >
> >
> > with this patch, we now generate:
> >
> >
> >
> >         andi. 3,3,0x2a
> >
> >         rlwimi 3,3,13,0,31-13
> >
> >         extsw 3,3
> >
> >         blr
> >
> >
> >
> > It turns out this optimization already exists in the form of a combine
> >
> > splitter in rs6000.md, but the constraints on combine splitters,
> >
> > requiring three of four input instructions (and generating one or two
> >
> > output instructions) mean it doesn't get applied as often as it could.
> >
> > This patch converts the define_split into a define_insn_and_split to
> >
> > catch more cases (such as the one above).
> >
> >
> >
> > The one bit that's tricky/controversial is the use of RTL's
> >
> > nonzero_bits which is accurate during the combine pass when this
> >
> > pattern is first recognized, but not as advanced (not kept up to
> >
> > date) when this pattern is eventually split.  To support this,
> >
> > I've used a "|| reload_completed" idiom.  Does this approach seem
> >
> > reasonable? [I've another patch of x86 that uses the same idiom].
> >
> >
> 
> I tested this patch on powerpc64-linux-gnu, it caused the below ICE against test
> case gcc/testsuite/gcc.c-torture/compile/pr93098.c.
> 
> gcc/testsuite/gcc.c-torture/compile/pr93098.c: In function ‘foo’:
> gcc/testsuite/gcc.c-torture/compile/pr93098.c:10:1: error: unrecognizable insn:
> (insn 104 32 34 2 (set (reg:SI 185 [+4 ])
>         (ior:SI (and:SI (reg:SI 200 [+4 ])
>                 (const_int 4294967295 [0xffffffff]))
>             (ashift:SI (reg:SI 140)
>                 (const_int 32 [0x20])))) "gcc/testsuite/gcc.c-
> torture/compile/pr93098.c":6:11 -1
>      (nil))
> during RTL pass: subreg3
> dump file: pr93098.c.291r.subreg3
> gcc/testsuite/gcc.c-torture/compile/pr93098.c:10:1: internal compiler error: in
> extract_insn, at recog.cc:2791 0x101f664b _fatal_insn(char const*, rtx_def
> const*, char const*, int, char const*)
>         gcc/rtl-error.cc:108
> 0x101f6697 _fatal_insn_not_found(rtx_def const*, char const*, int, char
> const*)
>         gcc/rtl-error.cc:116
> 0x10ae427f extract_insn(rtx_insn*)
>         gcc/recog.cc:2791
> 0x11b239bb decompose_multiword_subregs
>         gcc/lower-subreg.cc:1555
> 0x11b25013 execute
>         gcc/lower-subreg.cc:1818
> 
> The above trace shows we fails to recog the pattern again due to the inaccurate
> nonzero_bits information as you pointed out above.
> 
> There was another patch [1] which wasn't on trunk but touched this same
> define_split, not sure if that can help or we can follow the similar idea.
> 
> [1] https://gcc.gnu.org/pipermail/gcc-patches/2021-December/585841.html
> 
> BR,
> Kewen

Comments

Richard Biener July 8, 2022, 6:51 a.m. UTC | #1
On Thu, Jul 7, 2022 at 9:41 PM Roger Sayle <roger@nextmovesoftware.com> wrote:
>
>
> Hi Kewen (and Segher),
> Many thanks for stress testing my patch to improve multiplication
> by integer constants on rs6000 by using the rldmi instruction.
> Although I've not been able to reproduce your ICE (using gcc135
> on the compile farm), I completely agree with Segher's analysis
> that the Achilles heel with my approach/patch is that there's
> currently no way for the backend/recog to know that we're in a
> pass after combine.
>
> Rather than give up on this optimization (and a similar one for
> I386.md where test;sete can be replaced by xor $1 when combine
> knows that nonzero_bits is 1, but loses that information afterwards),
> I thought I'd post this "strawman" proposal to add a combine_completed
> global variable, matching the reload_completed and regstack_completed
> global variables already used (to track progress) by the middle-end.
>
> I was wondering if I could ask you could test the attached patch
> in combination with my previous rs6000.md patch (with the obvious
> change of reload_completed to combine_completed) to confirm
> that it fixes the problems you were seeing.
>
> Segher/Richard, would this sort of patch be considered acceptable?
> Or is there a better approach/solution?

Just to chime in - on the GIMPLE world we've transitioned to
using cfun->curr_properties for this (PROP_gimple_lvec, etc.)
so PROP_rtl_after_{combine,reload,regstack} would make this
state per function and avoids global variables (well, in exchange
for accessing cfun in the machine patterns of course).

The new variable/state needs some documentation.

A more functional state like can_create_pseudo_p () would be
more useful than 'combine_completed', but I haven't second-guessed
what exactly is no longer available after combine (some nonzero bits?
but aren't those only available _during_ combine?)

>
> 2022-07-07  Roger Sayle  <roger@nextmovesoftware.com>
>
> gcc/ChangeLog
>         * combine.cc (combine_completed): New global variable.
>         (rest_of_handle_combine): Set combine_completed after pass.
>         * final.cc (rest_of_clean_state): Reset combine_completed.
>         * rtl.h (combine_completed): Prototype here.
>
>
> Many thanks in advance,
> Roger
> --
>
> > -----Original Message-----
> > From: Kewen.Lin <linkw@linux.ibm.com>
> > Sent: 27 June 2022 10:04
> > To: Roger Sayle <roger@nextmovesoftware.com>
> > Cc: gcc-patches@gcc.gnu.org; Segher Boessenkool
> > <segher@kernel.crashing.org>; David Edelsohn <dje.gcc@gmail.com>
> > Subject: Re: [rs6000 PATCH] Improve constant integer multiply using rldimi.
> >
> > Hi Roger,
> >
> > on 2022/6/27 04:56, Roger Sayle wrote:
> > >
> > >
> > > This patch tweaks the code generated on POWER for integer
> > > multiplications
> > >
> > > by a constant, by making use of rldimi instructions.  Much like x86's
> > >
> > > lea instruction, rldimi can be used to implement a shift and add pair
> > >
> > > in some circumstances.  For rldimi this is when the shifted operand
> > >
> > > is known to have no bits in common with the added operand.
> > >
> > >
> > >
> > > Hence for the new testcase below:
> > >
> > >
> > >
> > > int foo(int x)
> > >
> > > {
> > >
> > >   int t = x & 42;
> > >
> > >   return t * 0x2001;
> > >
> > > }
> > >
> > >
> > >
> > > when compiled with -O2, GCC currently generates:
> > >
> > >
> > >
> > >         andi. 3,3,0x2a
> > >
> > >         slwi 9,3,13
> > >
> > >         add 3,9,3
> > >
> > >         extsw 3,3
> > >
> > >         blr
> > >
> > >
> > >
> > > with this patch, we now generate:
> > >
> > >
> > >
> > >         andi. 3,3,0x2a
> > >
> > >         rlwimi 3,3,13,0,31-13
> > >
> > >         extsw 3,3
> > >
> > >         blr
> > >
> > >
> > >
> > > It turns out this optimization already exists in the form of a combine
> > >
> > > splitter in rs6000.md, but the constraints on combine splitters,
> > >
> > > requiring three of four input instructions (and generating one or two
> > >
> > > output instructions) mean it doesn't get applied as often as it could.
> > >
> > > This patch converts the define_split into a define_insn_and_split to
> > >
> > > catch more cases (such as the one above).
> > >
> > >
> > >
> > > The one bit that's tricky/controversial is the use of RTL's
> > >
> > > nonzero_bits which is accurate during the combine pass when this
> > >
> > > pattern is first recognized, but not as advanced (not kept up to
> > >
> > > date) when this pattern is eventually split.  To support this,
> > >
> > > I've used a "|| reload_completed" idiom.  Does this approach seem
> > >
> > > reasonable? [I've another patch of x86 that uses the same idiom].
> > >
> > >
> >
> > I tested this patch on powerpc64-linux-gnu, it caused the below ICE against test
> > case gcc/testsuite/gcc.c-torture/compile/pr93098.c.
> >
> > gcc/testsuite/gcc.c-torture/compile/pr93098.c: In function ‘foo’:
> > gcc/testsuite/gcc.c-torture/compile/pr93098.c:10:1: error: unrecognizable insn:
> > (insn 104 32 34 2 (set (reg:SI 185 [+4 ])
> >         (ior:SI (and:SI (reg:SI 200 [+4 ])
> >                 (const_int 4294967295 [0xffffffff]))
> >             (ashift:SI (reg:SI 140)
> >                 (const_int 32 [0x20])))) "gcc/testsuite/gcc.c-
> > torture/compile/pr93098.c":6:11 -1
> >      (nil))
> > during RTL pass: subreg3
> > dump file: pr93098.c.291r.subreg3
> > gcc/testsuite/gcc.c-torture/compile/pr93098.c:10:1: internal compiler error: in
> > extract_insn, at recog.cc:2791 0x101f664b _fatal_insn(char const*, rtx_def
> > const*, char const*, int, char const*)
> >         gcc/rtl-error.cc:108
> > 0x101f6697 _fatal_insn_not_found(rtx_def const*, char const*, int, char
> > const*)
> >         gcc/rtl-error.cc:116
> > 0x10ae427f extract_insn(rtx_insn*)
> >         gcc/recog.cc:2791
> > 0x11b239bb decompose_multiword_subregs
> >         gcc/lower-subreg.cc:1555
> > 0x11b25013 execute
> >         gcc/lower-subreg.cc:1818
> >
> > The above trace shows we fails to recog the pattern again due to the inaccurate
> > nonzero_bits information as you pointed out above.
> >
> > There was another patch [1] which wasn't on trunk but touched this same
> > define_split, not sure if that can help or we can follow the similar idea.
> >
> > [1] https://gcc.gnu.org/pipermail/gcc-patches/2021-December/585841.html
> >
> > BR,
> > Kewen
Kewen.Lin July 8, 2022, 7:21 a.m. UTC | #2
Hi Roger,

on 2022/7/8 03:40, Roger Sayle wrote:
> 
> Hi Kewen (and Segher),
> Many thanks for stress testing my patch to improve multiplication
> by integer constants on rs6000 by using the rldmi instruction.
> Although I've not been able to reproduce your ICE (using gcc135
> on the compile farm), I completely agree with Segher's analysis
> that the Achilles heel with my approach/patch is that there's
> currently no way for the backend/recog to know that we're in a
> pass after combine.
> 

It's weird that it can't be reproduced on your side, did you try
with -m32 explicitly?  Sorry that I didn't say the used options
clearly in the previous reply, they are "-O2 -m32".

> Rather than give up on this optimization (and a similar one for
> I386.md where test;sete can be replaced by xor $1 when combine
> knows that nonzero_bits is 1, but loses that information afterwards),
> I thought I'd post this "strawman" proposal to add a combine_completed
> global variable, matching the reload_completed and regstack_completed
> global variables already used (to track progress) by the middle-end.
> 
> I was wondering if I could ask you could test the attached patch
> in combination with my previous rs6000.md patch (with the obvious
> change of reload_completed to combine_completed) to confirm
> that it fixes the problems you were seeing.

I just had a try, it still failed.  I checked the unrecognizable
pattern and the original patch, I guessed it needs a tiny adjustment
like below:

diff --git a/gcc/config/rs6000/rs6000.md b/gcc/config/rs6000/rs6000.md
index dde123e87b8..0a089f12510 100644
--- a/gcc/config/rs6000/rs6000.md
+++ b/gcc/config/rs6000/rs6000.md
@@ -4216,7 +4216,7 @@ (define_insn_and_split "*rotl<mode>3_insert_3b_<code>"
                      (match_operand:SI 2 "const_int_operand" "n"))
          (match_operand:GPR 3 "gpc_reg_operand" "0")))]
   "INTVAL (operands[2]) > 0
-   && INTVAL (operands[2]) < 64
+   && INTVAL (operands[2]) < GET_MODE_PRECISION (<MODE>mode)
    && ((nonzero_bits (operands[3], <MODE>mode)
        < HOST_WIDE_INT_1U << INTVAL (operands[2]))
        || combine_completed)"

the hardcoded value 64 is too big for SImode in the failure, it seems
we should use the mode precision instead?  I confirmed the failures are
gone with this proposal and the tiny change.

BR,
Kewen
Segher Boessenkool July 8, 2022, 8:23 p.m. UTC | #3
Hi!

On Thu, Jul 07, 2022 at 08:40:38PM +0100, Roger Sayle wrote:
> Although I've not been able to reproduce your ICE (using gcc135
> on the compile farm), I completely agree with Segher's analysis
> that the Achilles heel with my approach/patch is that there's
> currently no way for the backend/recog to know that we're in a
> pass after combine.

To know combine has been run at least once already, yes.  It currently
won't run more than one of course, but the only thing that really stops
running it more than once was historically that combine was one of the
more expensive passes.  Now otoh it is really quite cheap.

Currently combine is one of the last passes before RA.  It likely will
be nice to run combine quite early as well (or a limited version of it
that is), just like other optimisation passes already are, including its
closest relative fwprop.

> Rather than give up on this optimization (and a similar one for
> I386.md where test;sete can be replaced by xor $1 when combine
> knows that nonzero_bits is 1, but loses that information afterwards),
> I thought I'd post this "strawman" proposal to add a combine_completed
> global variable, matching the reload_completed and regstack_completed
> global variables already used (to track progress) by the middle-end.

It should be called differently, given the above.

> I was wondering if I could ask you could test the attached patch
> in combination with my previous rs6000.md patch (with the obvious
> change of reload_completed to combine_completed) to confirm
> that it fixes the problems you were seeing.
> 
> Segher/Richard, would this sort of patch be considered acceptable?
> Or is there a better approach/solution?

There is the thing I have been working on for about a year now, that
makes combine's nonzero_bits superfluous, by making the generic version
more capable than combine's version.  That is a better way forward imo,
it solves many other problems as well.  But it isn't ready yet.

Your patch is fine with a name change (and documentation change)
indicating it means that combine has run at least once.

Thanks!


Segher
diff mbox series

Patch

diff --git a/gcc/combine.cc b/gcc/combine.cc
index a5fabf3..9c87bf9 100644
--- a/gcc/combine.cc
+++ b/gcc/combine.cc
@@ -128,6 +128,10 @@  static rtx i2mod_old_rhs;
 /* When I2MOD is nonnull, this is a copy of the new right hand side.  */
 
 static rtx i2mod_new_rhs;
+
+/* Nonzero after end of combine pass.  */
+
+int combine_completed = 0;
 
 struct reg_stat_type {
   /* Record last point of death of (hard or pseudo) register n.  */
@@ -14991,6 +14995,7 @@  rest_of_handle_combine (void)
     }
 
   regstat_free_n_sets_and_refs ();
+  combine_completed = 1;
   return 0;
 }
 
diff --git a/gcc/final.cc b/gcc/final.cc
index 0352786..0fc2695 100644
--- a/gcc/final.cc
+++ b/gcc/final.cc
@@ -4513,6 +4513,7 @@  rest_of_clean_state (void)
   flag_rerun_cse_after_global_opts = 0;
   reload_completed = 0;
   epilogue_completed = 0;
+  combine_completed = 0;
 #ifdef STACK_REGS
   regstack_completed = 0;
 #endif
diff --git a/gcc/rtl.h b/gcc/rtl.h
index 488016b..3bb92bd 100644
--- a/gcc/rtl.h
+++ b/gcc/rtl.h
@@ -4105,6 +4105,9 @@  extern int reload_in_progress;
 /* Set to 1 while in lra.  */
 extern int lra_in_progress;
 
+/* Nonzero after end of combine pass.  */
+extern int combine_completed;
+
 /* This macro indicates whether you may create a new
    pseudo-register.  */