Message ID | 001a01d89239$713109e0$53931da0$@nextmovesoftware.com |
---|---|
State | New |
Headers | show |
Series | [PATCH/RFC] combine_completed global variable. | expand |
On Thu, Jul 7, 2022 at 9:41 PM Roger Sayle <roger@nextmovesoftware.com> wrote: > > > Hi Kewen (and Segher), > Many thanks for stress testing my patch to improve multiplication > by integer constants on rs6000 by using the rldmi instruction. > Although I've not been able to reproduce your ICE (using gcc135 > on the compile farm), I completely agree with Segher's analysis > that the Achilles heel with my approach/patch is that there's > currently no way for the backend/recog to know that we're in a > pass after combine. > > Rather than give up on this optimization (and a similar one for > I386.md where test;sete can be replaced by xor $1 when combine > knows that nonzero_bits is 1, but loses that information afterwards), > I thought I'd post this "strawman" proposal to add a combine_completed > global variable, matching the reload_completed and regstack_completed > global variables already used (to track progress) by the middle-end. > > I was wondering if I could ask you could test the attached patch > in combination with my previous rs6000.md patch (with the obvious > change of reload_completed to combine_completed) to confirm > that it fixes the problems you were seeing. > > Segher/Richard, would this sort of patch be considered acceptable? > Or is there a better approach/solution? Just to chime in - on the GIMPLE world we've transitioned to using cfun->curr_properties for this (PROP_gimple_lvec, etc.) so PROP_rtl_after_{combine,reload,regstack} would make this state per function and avoids global variables (well, in exchange for accessing cfun in the machine patterns of course). The new variable/state needs some documentation. A more functional state like can_create_pseudo_p () would be more useful than 'combine_completed', but I haven't second-guessed what exactly is no longer available after combine (some nonzero bits? but aren't those only available _during_ combine?) > > 2022-07-07 Roger Sayle <roger@nextmovesoftware.com> > > gcc/ChangeLog > * combine.cc (combine_completed): New global variable. > (rest_of_handle_combine): Set combine_completed after pass. > * final.cc (rest_of_clean_state): Reset combine_completed. > * rtl.h (combine_completed): Prototype here. > > > Many thanks in advance, > Roger > -- > > > -----Original Message----- > > From: Kewen.Lin <linkw@linux.ibm.com> > > Sent: 27 June 2022 10:04 > > To: Roger Sayle <roger@nextmovesoftware.com> > > Cc: gcc-patches@gcc.gnu.org; Segher Boessenkool > > <segher@kernel.crashing.org>; David Edelsohn <dje.gcc@gmail.com> > > Subject: Re: [rs6000 PATCH] Improve constant integer multiply using rldimi. > > > > Hi Roger, > > > > on 2022/6/27 04:56, Roger Sayle wrote: > > > > > > > > > This patch tweaks the code generated on POWER for integer > > > multiplications > > > > > > by a constant, by making use of rldimi instructions. Much like x86's > > > > > > lea instruction, rldimi can be used to implement a shift and add pair > > > > > > in some circumstances. For rldimi this is when the shifted operand > > > > > > is known to have no bits in common with the added operand. > > > > > > > > > > > > Hence for the new testcase below: > > > > > > > > > > > > int foo(int x) > > > > > > { > > > > > > int t = x & 42; > > > > > > return t * 0x2001; > > > > > > } > > > > > > > > > > > > when compiled with -O2, GCC currently generates: > > > > > > > > > > > > andi. 3,3,0x2a > > > > > > slwi 9,3,13 > > > > > > add 3,9,3 > > > > > > extsw 3,3 > > > > > > blr > > > > > > > > > > > > with this patch, we now generate: > > > > > > > > > > > > andi. 3,3,0x2a > > > > > > rlwimi 3,3,13,0,31-13 > > > > > > extsw 3,3 > > > > > > blr > > > > > > > > > > > > It turns out this optimization already exists in the form of a combine > > > > > > splitter in rs6000.md, but the constraints on combine splitters, > > > > > > requiring three of four input instructions (and generating one or two > > > > > > output instructions) mean it doesn't get applied as often as it could. > > > > > > This patch converts the define_split into a define_insn_and_split to > > > > > > catch more cases (such as the one above). > > > > > > > > > > > > The one bit that's tricky/controversial is the use of RTL's > > > > > > nonzero_bits which is accurate during the combine pass when this > > > > > > pattern is first recognized, but not as advanced (not kept up to > > > > > > date) when this pattern is eventually split. To support this, > > > > > > I've used a "|| reload_completed" idiom. Does this approach seem > > > > > > reasonable? [I've another patch of x86 that uses the same idiom]. > > > > > > > > > > I tested this patch on powerpc64-linux-gnu, it caused the below ICE against test > > case gcc/testsuite/gcc.c-torture/compile/pr93098.c. > > > > gcc/testsuite/gcc.c-torture/compile/pr93098.c: In function ‘foo’: > > gcc/testsuite/gcc.c-torture/compile/pr93098.c:10:1: error: unrecognizable insn: > > (insn 104 32 34 2 (set (reg:SI 185 [+4 ]) > > (ior:SI (and:SI (reg:SI 200 [+4 ]) > > (const_int 4294967295 [0xffffffff])) > > (ashift:SI (reg:SI 140) > > (const_int 32 [0x20])))) "gcc/testsuite/gcc.c- > > torture/compile/pr93098.c":6:11 -1 > > (nil)) > > during RTL pass: subreg3 > > dump file: pr93098.c.291r.subreg3 > > gcc/testsuite/gcc.c-torture/compile/pr93098.c:10:1: internal compiler error: in > > extract_insn, at recog.cc:2791 0x101f664b _fatal_insn(char const*, rtx_def > > const*, char const*, int, char const*) > > gcc/rtl-error.cc:108 > > 0x101f6697 _fatal_insn_not_found(rtx_def const*, char const*, int, char > > const*) > > gcc/rtl-error.cc:116 > > 0x10ae427f extract_insn(rtx_insn*) > > gcc/recog.cc:2791 > > 0x11b239bb decompose_multiword_subregs > > gcc/lower-subreg.cc:1555 > > 0x11b25013 execute > > gcc/lower-subreg.cc:1818 > > > > The above trace shows we fails to recog the pattern again due to the inaccurate > > nonzero_bits information as you pointed out above. > > > > There was another patch [1] which wasn't on trunk but touched this same > > define_split, not sure if that can help or we can follow the similar idea. > > > > [1] https://gcc.gnu.org/pipermail/gcc-patches/2021-December/585841.html > > > > BR, > > Kewen
Hi Roger, on 2022/7/8 03:40, Roger Sayle wrote: > > Hi Kewen (and Segher), > Many thanks for stress testing my patch to improve multiplication > by integer constants on rs6000 by using the rldmi instruction. > Although I've not been able to reproduce your ICE (using gcc135 > on the compile farm), I completely agree with Segher's analysis > that the Achilles heel with my approach/patch is that there's > currently no way for the backend/recog to know that we're in a > pass after combine. > It's weird that it can't be reproduced on your side, did you try with -m32 explicitly? Sorry that I didn't say the used options clearly in the previous reply, they are "-O2 -m32". > Rather than give up on this optimization (and a similar one for > I386.md where test;sete can be replaced by xor $1 when combine > knows that nonzero_bits is 1, but loses that information afterwards), > I thought I'd post this "strawman" proposal to add a combine_completed > global variable, matching the reload_completed and regstack_completed > global variables already used (to track progress) by the middle-end. > > I was wondering if I could ask you could test the attached patch > in combination with my previous rs6000.md patch (with the obvious > change of reload_completed to combine_completed) to confirm > that it fixes the problems you were seeing. I just had a try, it still failed. I checked the unrecognizable pattern and the original patch, I guessed it needs a tiny adjustment like below: diff --git a/gcc/config/rs6000/rs6000.md b/gcc/config/rs6000/rs6000.md index dde123e87b8..0a089f12510 100644 --- a/gcc/config/rs6000/rs6000.md +++ b/gcc/config/rs6000/rs6000.md @@ -4216,7 +4216,7 @@ (define_insn_and_split "*rotl<mode>3_insert_3b_<code>" (match_operand:SI 2 "const_int_operand" "n")) (match_operand:GPR 3 "gpc_reg_operand" "0")))] "INTVAL (operands[2]) > 0 - && INTVAL (operands[2]) < 64 + && INTVAL (operands[2]) < GET_MODE_PRECISION (<MODE>mode) && ((nonzero_bits (operands[3], <MODE>mode) < HOST_WIDE_INT_1U << INTVAL (operands[2])) || combine_completed)" the hardcoded value 64 is too big for SImode in the failure, it seems we should use the mode precision instead? I confirmed the failures are gone with this proposal and the tiny change. BR, Kewen
Hi! On Thu, Jul 07, 2022 at 08:40:38PM +0100, Roger Sayle wrote: > Although I've not been able to reproduce your ICE (using gcc135 > on the compile farm), I completely agree with Segher's analysis > that the Achilles heel with my approach/patch is that there's > currently no way for the backend/recog to know that we're in a > pass after combine. To know combine has been run at least once already, yes. It currently won't run more than one of course, but the only thing that really stops running it more than once was historically that combine was one of the more expensive passes. Now otoh it is really quite cheap. Currently combine is one of the last passes before RA. It likely will be nice to run combine quite early as well (or a limited version of it that is), just like other optimisation passes already are, including its closest relative fwprop. > Rather than give up on this optimization (and a similar one for > I386.md where test;sete can be replaced by xor $1 when combine > knows that nonzero_bits is 1, but loses that information afterwards), > I thought I'd post this "strawman" proposal to add a combine_completed > global variable, matching the reload_completed and regstack_completed > global variables already used (to track progress) by the middle-end. It should be called differently, given the above. > I was wondering if I could ask you could test the attached patch > in combination with my previous rs6000.md patch (with the obvious > change of reload_completed to combine_completed) to confirm > that it fixes the problems you were seeing. > > Segher/Richard, would this sort of patch be considered acceptable? > Or is there a better approach/solution? There is the thing I have been working on for about a year now, that makes combine's nonzero_bits superfluous, by making the generic version more capable than combine's version. That is a better way forward imo, it solves many other problems as well. But it isn't ready yet. Your patch is fine with a name change (and documentation change) indicating it means that combine has run at least once. Thanks! Segher
diff --git a/gcc/combine.cc b/gcc/combine.cc index a5fabf3..9c87bf9 100644 --- a/gcc/combine.cc +++ b/gcc/combine.cc @@ -128,6 +128,10 @@ static rtx i2mod_old_rhs; /* When I2MOD is nonnull, this is a copy of the new right hand side. */ static rtx i2mod_new_rhs; + +/* Nonzero after end of combine pass. */ + +int combine_completed = 0; struct reg_stat_type { /* Record last point of death of (hard or pseudo) register n. */ @@ -14991,6 +14995,7 @@ rest_of_handle_combine (void) } regstat_free_n_sets_and_refs (); + combine_completed = 1; return 0; } diff --git a/gcc/final.cc b/gcc/final.cc index 0352786..0fc2695 100644 --- a/gcc/final.cc +++ b/gcc/final.cc @@ -4513,6 +4513,7 @@ rest_of_clean_state (void) flag_rerun_cse_after_global_opts = 0; reload_completed = 0; epilogue_completed = 0; + combine_completed = 0; #ifdef STACK_REGS regstack_completed = 0; #endif diff --git a/gcc/rtl.h b/gcc/rtl.h index 488016b..3bb92bd 100644 --- a/gcc/rtl.h +++ b/gcc/rtl.h @@ -4105,6 +4105,9 @@ extern int reload_in_progress; /* Set to 1 while in lra. */ extern int lra_in_progress; +/* Nonzero after end of combine pass. */ +extern int combine_completed; + /* This macro indicates whether you may create a new pseudo-register. */