mbox series

[0/6] Add a late-combine pass

Message ID 20240620133418.350772-1-richard.sandiford@arm.com
Headers show
Series Add a late-combine pass | expand

Message

Richard Sandiford June 20, 2024, 1:34 p.m. UTC
This series is a resubmission of the late-combine work.  I've fixed
some bugs that Jeff's cross-target CI found last time and some others
that I hit since then.

I've also removed a source of quadraticness (oops!).  Doing that
in turn drove some tweaks to the rtl-ssa scan routines.

The complexity of the new pass should be amortised O(n1 log(n2)), where
n1 is the total number of input operands in the function and n2 is the
number of instructions.  The log(n2) component comes from searching call
clobbers and is very much a worst case.  We therefore shouldn't need a
--param to limit the optimisation.

I think the main comment from last time was that we should enable
the pass by default on most targets.  If there is a known reason
why the pass doesn't work on a particular target, we should default
to off for that specific target and file a bug to track the problem.

The only targets that I know need to be handled in this way are
i386, rs6000 and xtensa.  See the covering note in the last patch
for details.  If the series is OK, I'll file PRs for those targets
after pushing the patches.

Tested on aarch64-linux-gnu and x86_64-linux-gnu (somewhat of a
token gesture given the default-off for x86_64).  Also tested by
compiling one target per CPU directory and comparing the assembly output
for parts of the GCC testsuite.  This is just a way of getting a flavour
of how the pass performs; it obviously isn't a meaningful benchmark.
All targets seemed to improve on average, as described in the covering
note to the last patch.

The original motivation for the pass was to fix things like PR106594.
However, it also helps to reclaim some of the optimisations that
were lost in r15-268.  Please let me know if there are some cases
that the pass fails to reclaim.

The series depends on Gui Haochen's insn_cost fix.

OK to install?

Thanks to Jeff for the help with testing the series.

Richard


Richard Sandiford (6):
  rtl-ssa: Rework _ignoring interfaces
  rtl-ssa: Don't cost no-op moves
  iq2000: Fix test and branch instructions
  sh: Make *minus_plus_one work after RA
  xstormy16: Fix xs_hi_nonmemory_operand
  Add a late-combine pass [PR106594]

 gcc/Makefile.in                               |   1 +
 gcc/common.opt                                |   5 +
 gcc/config/aarch64/aarch64-cc-fusion.cc       |   4 +-
 gcc/config/i386/i386-options.cc               |   4 +
 gcc/config/iq2000/iq2000.cc                   |   2 +-
 gcc/config/iq2000/iq2000.md                   |   4 +-
 gcc/config/rs6000/rs6000.cc                   |   8 +
 gcc/config/sh/sh.md                           |   6 +-
 gcc/config/stormy16/predicates.md             |   2 +-
 gcc/config/xtensa/xtensa.cc                   |  11 +
 gcc/doc/invoke.texi                           |  11 +-
 gcc/doc/rtl.texi                              |  14 +-
 gcc/late-combine.cc                           | 747 ++++++++++++++++++
 gcc/opts.cc                                   |   1 +
 gcc/pair-fusion.cc                            |  34 +-
 gcc/passes.def                                |   2 +
 gcc/rtl-ssa.h                                 |   1 +
 gcc/rtl-ssa/access-utils.h                    | 145 ++--
 gcc/rtl-ssa/change-utils.h                    |  67 +-
 gcc/rtl-ssa/changes.cc                        |   6 +-
 gcc/rtl-ssa/changes.h                         |  13 -
 gcc/rtl-ssa/functions.h                       |  16 +-
 gcc/rtl-ssa/insn-utils.h                      |   8 -
 gcc/rtl-ssa/insns.cc                          |   7 +-
 gcc/rtl-ssa/insns.h                           |  12 -
 gcc/rtl-ssa/member-fns.inl                    |  35 +-
 gcc/rtl-ssa/movement.h                        | 118 ++-
 gcc/rtl-ssa/predicates.h                      |  58 ++
 gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-1.c  |   2 +-
 gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-2.c  |   2 +-
 gcc/testsuite/gcc.dg/stack-check-4.c          |   2 +-
 .../aarch64/bitfield-bitint-abi-align16.c     |   2 +-
 .../aarch64/bitfield-bitint-abi-align8.c      |   2 +-
 gcc/testsuite/gcc.target/aarch64/pr106594_1.c |  20 +
 .../gcc.target/aarch64/sve/cond_asrd_3.c      |  10 +-
 .../gcc.target/aarch64/sve/cond_convert_3.c   |   8 +-
 .../gcc.target/aarch64/sve/cond_convert_6.c   |   8 +-
 .../gcc.target/aarch64/sve/cond_fabd_5.c      |  11 +-
 .../gcc.target/aarch64/sve/cond_unary_4.c     |  13 +-
 gcc/tree-pass.h                               |   1 +
 40 files changed, 1127 insertions(+), 296 deletions(-)
 create mode 100644 gcc/late-combine.cc
 create mode 100644 gcc/rtl-ssa/predicates.h
 create mode 100644 gcc/testsuite/gcc.target/aarch64/pr106594_1.c

Comments

Xi Ruoyao June 28, 2024, 12:25 p.m. UTC | #1
Hi Richard,

The late combine pass has triggered some FAILs on LoongArch and I'm
investigating.  One of them is movcf2gr-via-fr.c.  In 315r.postreload:

(insn 22 7 24 2 (set (reg:FCC 32 $f0 [87])
        (reg:FCC 64 $fcc0 [87])) "../gcc/gcc/testsuite/gcc.target/loongarch/movcf2gr-via-fr.c":9:12 168 {movfcc_internal}
     (nil))
(insn 24 22 8 2 (set (reg:FCC 4 $r4 [88])
        (reg:FCC 32 $f0 [87])) "../gcc/gcc/testsuite/gcc.target/loongarch/movcf2gr-via-fr.c":9:12 168 {movfcc_internal}
     (nil))

The late combine pass combines these to:

(insn 24 7 8 2 (set (reg:FCC 4 $r4 [88])
        (reg:FCC 64 $fcc0 [87])) "../gcc/gcc/testsuite/gcc.target/loongarch/movcf2gr-via-fr.c":9:12 168 {movfcc_internal}
     (nil))

But we are using a FPR ($f0) here deliberately to work around an
architectural issue in LA464 causing a direct FCC-to-GPR move very slow.

Could you suggest how to fix this issue?

On Thu, 2024-06-20 at 14:34 +0100, Richard Sandiford wrote:
> This series is a resubmission of the late-combine work.  I've fixed
> some bugs that Jeff's cross-target CI found last time and some others
> that I hit since then.

/* snip */
Lulu Cheng June 28, 2024, 12:34 p.m. UTC | #2
在 2024/6/28 下午8:25, Xi Ruoyao 写道:
> Hi Richard,
>
> The late combine pass has triggered some FAILs on LoongArch and I'm
> investigating.  One of them is movcf2gr-via-fr.c.  In 315r.postreload:
>
> (insn 22 7 24 2 (set (reg:FCC 32 $f0 [87])
>          (reg:FCC 64 $fcc0 [87])) "../gcc/gcc/testsuite/gcc.target/loongarch/movcf2gr-via-fr.c":9:12 168 {movfcc_internal}
>       (nil))
> (insn 24 22 8 2 (set (reg:FCC 4 $r4 [88])
>          (reg:FCC 32 $f0 [87])) "../gcc/gcc/testsuite/gcc.target/loongarch/movcf2gr-via-fr.c":9:12 168 {movfcc_internal}
>       (nil))
>
> The late combine pass combines these to:
>
> (insn 24 7 8 2 (set (reg:FCC 4 $r4 [88])
>          (reg:FCC 64 $fcc0 [87])) "../gcc/gcc/testsuite/gcc.target/loongarch/movcf2gr-via-fr.c":9:12 168 {movfcc_internal}
>       (nil))
>
> But we are using a FPR ($f0) here deliberately to work around an
> architectural issue in LA464 causing a direct FCC-to-GPR move very slow.
>
> Could you suggest how to fix this issue?

Hi, Ruoyao:

We need to define TARGET_INSN_COST and set the cost of movcf2gr/movgr2cf.

I've fixed this and am doing correctness testing now.

>
> On Thu, 2024-06-20 at 14:34 +0100, Richard Sandiford wrote:
>> This series is a resubmission of the late-combine work.  I've fixed
>> some bugs that Jeff's cross-target CI found last time and some others
>> that I hit since then.
> /* snip */
>
Xi Ruoyao June 28, 2024, 12:35 p.m. UTC | #3
On Fri, 2024-06-28 at 20:34 +0800, chenglulu wrote:
> 
> 在 2024/6/28 下午8:25, Xi Ruoyao 写道:
> > Hi Richard,
> > 
> > The late combine pass has triggered some FAILs on LoongArch and I'm
> > investigating.  One of them is movcf2gr-via-fr.c.  In
> > 315r.postreload:
> > 
> > (insn 22 7 24 2 (set (reg:FCC 32 $f0 [87])
> >          (reg:FCC 64 $fcc0 [87]))
> > "../gcc/gcc/testsuite/gcc.target/loongarch/movcf2gr-via-fr.c":9:12
> > 168 {movfcc_internal}
> >       (nil))
> > (insn 24 22 8 2 (set (reg:FCC 4 $r4 [88])
> >          (reg:FCC 32 $f0 [87]))
> > "../gcc/gcc/testsuite/gcc.target/loongarch/movcf2gr-via-fr.c":9:12
> > 168 {movfcc_internal}
> >       (nil))
> > 
> > The late combine pass combines these to:
> > 
> > (insn 24 7 8 2 (set (reg:FCC 4 $r4 [88])
> >          (reg:FCC 64 $fcc0 [87]))
> > "../gcc/gcc/testsuite/gcc.target/loongarch/movcf2gr-via-fr.c":9:12
> > 168 {movfcc_internal}
> >       (nil))
> > 
> > But we are using a FPR ($f0) here deliberately to work around an
> > architectural issue in LA464 causing a direct FCC-to-GPR move very
> > slow.
> > 
> > Could you suggest how to fix this issue?
> 
> Hi, Ruoyao:
> 
> We need to define TARGET_INSN_COST and set the cost of
> movcf2gr/movgr2cf.
> 
> I've fixed this and am doing correctness testing now.

Ah thanks!  So it uses insn cost instead of rtx cost and I didn't
realize.
Lulu Cheng June 28, 2024, 12:44 p.m. UTC | #4
在 2024/6/28 下午8:35, Xi Ruoyao 写道:
> On Fri, 2024-06-28 at 20:34 +0800, chenglulu wrote:
>> 在 2024/6/28 下午8:25, Xi Ruoyao 写道:
>>> Hi Richard,
>>>
>>> The late combine pass has triggered some FAILs on LoongArch and I'm
>>> investigating.  One of them is movcf2gr-via-fr.c.  In
>>> 315r.postreload:
>>>
>>> (insn 22 7 24 2 (set (reg:FCC 32 $f0 [87])
>>>           (reg:FCC 64 $fcc0 [87]))
>>> "../gcc/gcc/testsuite/gcc.target/loongarch/movcf2gr-via-fr.c":9:12
>>> 168 {movfcc_internal}
>>>        (nil))
>>> (insn 24 22 8 2 (set (reg:FCC 4 $r4 [88])
>>>           (reg:FCC 32 $f0 [87]))
>>> "../gcc/gcc/testsuite/gcc.target/loongarch/movcf2gr-via-fr.c":9:12
>>> 168 {movfcc_internal}
>>>        (nil))
>>>
>>> The late combine pass combines these to:
>>>
>>> (insn 24 7 8 2 (set (reg:FCC 4 $r4 [88])
>>>           (reg:FCC 64 $fcc0 [87]))
>>> "../gcc/gcc/testsuite/gcc.target/loongarch/movcf2gr-via-fr.c":9:12
>>> 168 {movfcc_internal}
>>>        (nil))
>>>
>>> But we are using a FPR ($f0) here deliberately to work around an
>>> architectural issue in LA464 causing a direct FCC-to-GPR move very
>>> slow.
>>>
>>> Could you suggest how to fix this issue?
>> Hi, Ruoyao:
>>
>> We need to define TARGET_INSN_COST and set the cost of
>> movcf2gr/movgr2cf.
>>
>> I've fixed this and am doing correctness testing now.
> Ah thanks!  So it uses insn cost instead of rtx cost and I didn't
> realize.
>
>
That's right.:-D