[0/6] Add a late-combine pass

Message ID	20240620133418.350772-1-richard.sandiford@arm.com
Headers	show Return-Path: <gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org> DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org D2C1E389245F From: Richard Sandiford <richard.sandiford@arm.com> To: jlaw@ventanamicro.com, gcc-patches@gcc.gnu.org Cc: Richard Sandiford <richard.sandiford@arm.com> Subject: [PATCH 0/6] Add a late-combine pass Date: Thu, 20 Jun 2024 14:34:12 +0100 Message-Id: <20240620133418.350772-1-richard.sandiford@arm.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: list Errors-To: gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org
Series	Add a late-combine pass \| expand [0/6] Add a late-combine pass [1/6] rtl-ssa: Rework _ignoring interfaces [2/6] rtl-ssa: Don't cost no-op moves [3/6] iq2000: Fix test and branch instructions [4/6] sh: Make *minus_plus_one work after RA [5/6] xstormy16: Fix xs_hi_nonmemory_operand [6/6] Add a late-combine pass [PR106594]

Message ID

20240620133418.350772-1-richard.sandiford@arm.com

Headers

DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org D2C1E389245F
From: Richard Sandiford <richard.sandiford@arm.com>
To: jlaw@ventanamicro.com,
	gcc-patches@gcc.gnu.org
Cc: Richard Sandiford <richard.sandiford@arm.com>
Subject: [PATCH 0/6] Add a late-combine pass
Date: Thu, 20 Jun 2024 14:34:12 +0100
Message-Id: <20240620133418.350772-1-richard.sandiford@arm.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: list
Errors-To: gcc-patches-bounces+incoming=patchwork.ozlabs.org@gcc.gnu.org

Series

Add a late-combine pass | expand

Message

Richard Sandiford June 20, 2024, 1:34 p.m. UTC

This series is a resubmission of the late-combine work.  I've fixed
some bugs that Jeff's cross-target CI found last time and some others
that I hit since then.

I've also removed a source of quadraticness (oops!).  Doing that
in turn drove some tweaks to the rtl-ssa scan routines.

The complexity of the new pass should be amortised O(n1 log(n2)), where
n1 is the total number of input operands in the function and n2 is the
number of instructions.  The log(n2) component comes from searching call
clobbers and is very much a worst case.  We therefore shouldn't need a
--param to limit the optimisation.

I think the main comment from last time was that we should enable
the pass by default on most targets.  If there is a known reason
why the pass doesn't work on a particular target, we should default
to off for that specific target and file a bug to track the problem.

The only targets that I know need to be handled in this way are
i386, rs6000 and xtensa.  See the covering note in the last patch
for details.  If the series is OK, I'll file PRs for those targets
after pushing the patches.

Tested on aarch64-linux-gnu and x86_64-linux-gnu (somewhat of a
token gesture given the default-off for x86_64).  Also tested by
compiling one target per CPU directory and comparing the assembly output
for parts of the GCC testsuite.  This is just a way of getting a flavour
of how the pass performs; it obviously isn't a meaningful benchmark.
All targets seemed to improve on average, as described in the covering
note to the last patch.

The original motivation for the pass was to fix things like PR106594.
However, it also helps to reclaim some of the optimisations that
were lost in r15-268.  Please let me know if there are some cases
that the pass fails to reclaim.

The series depends on Gui Haochen's insn_cost fix.

OK to install?

Thanks to Jeff for the help with testing the series.

Richard


Richard Sandiford (6):
  rtl-ssa: Rework _ignoring interfaces
  rtl-ssa: Don't cost no-op moves
  iq2000: Fix test and branch instructions
  sh: Make *minus_plus_one work after RA
  xstormy16: Fix xs_hi_nonmemory_operand
  Add a late-combine pass [PR106594]

 gcc/Makefile.in                               |   1 +
 gcc/common.opt                                |   5 +
 gcc/config/aarch64/aarch64-cc-fusion.cc       |   4 +-
 gcc/config/i386/i386-options.cc               |   4 +
 gcc/config/iq2000/iq2000.cc                   |   2 +-
 gcc/config/iq2000/iq2000.md                   |   4 +-
 gcc/config/rs6000/rs6000.cc                   |   8 +
 gcc/config/sh/sh.md                           |   6 +-
 gcc/config/stormy16/predicates.md             |   2 +-
 gcc/config/xtensa/xtensa.cc                   |  11 +
 gcc/doc/invoke.texi                           |  11 +-
 gcc/doc/rtl.texi                              |  14 +-
 gcc/late-combine.cc                           | 747 ++++++++++++++++++
 gcc/opts.cc                                   |   1 +
 gcc/pair-fusion.cc                            |  34 +-
 gcc/passes.def                                |   2 +
 gcc/rtl-ssa.h                                 |   1 +
 gcc/rtl-ssa/access-utils.h                    | 145 ++--
 gcc/rtl-ssa/change-utils.h                    |  67 +-
 gcc/rtl-ssa/changes.cc                        |   6 +-
 gcc/rtl-ssa/changes.h                         |  13 -
 gcc/rtl-ssa/functions.h                       |  16 +-
 gcc/rtl-ssa/insn-utils.h                      |   8 -
 gcc/rtl-ssa/insns.cc                          |   7 +-
 gcc/rtl-ssa/insns.h                           |  12 -
 gcc/rtl-ssa/member-fns.inl                    |  35 +-
 gcc/rtl-ssa/movement.h                        | 118 ++-
 gcc/rtl-ssa/predicates.h                      |  58 ++
 gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-1.c  |   2 +-
 gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-2.c  |   2 +-
 gcc/testsuite/gcc.dg/stack-check-4.c          |   2 +-
 .../aarch64/bitfield-bitint-abi-align16.c     |   2 +-
 .../aarch64/bitfield-bitint-abi-align8.c      |   2 +-
 gcc/testsuite/gcc.target/aarch64/pr106594_1.c |  20 +
 .../gcc.target/aarch64/sve/cond_asrd_3.c      |  10 +-
 .../gcc.target/aarch64/sve/cond_convert_3.c   |   8 +-
 .../gcc.target/aarch64/sve/cond_convert_6.c   |   8 +-
 .../gcc.target/aarch64/sve/cond_fabd_5.c      |  11 +-
 .../gcc.target/aarch64/sve/cond_unary_4.c     |  13 +-
 gcc/tree-pass.h                               |   1 +
 40 files changed, 1127 insertions(+), 296 deletions(-)
 create mode 100644 gcc/late-combine.cc
 create mode 100644 gcc/rtl-ssa/predicates.h
 create mode 100644 gcc/testsuite/gcc.target/aarch64/pr106594_1.c

Comments

Xi Ruoyao June 28, 2024, 12:25 p.m. UTC | #1

Hi Richard,

The late combine pass has triggered some FAILs on LoongArch and I'm
investigating.  One of them is movcf2gr-via-fr.c.  In 315r.postreload:

(insn 22 7 24 2 (set (reg:FCC 32 $f0 [87])
        (reg:FCC 64 $fcc0 [87])) "../gcc/gcc/testsuite/gcc.target/loongarch/movcf2gr-via-fr.c":9:12 168 {movfcc_internal}
     (nil))
(insn 24 22 8 2 (set (reg:FCC 4 $r4 [88])
        (reg:FCC 32 $f0 [87])) "../gcc/gcc/testsuite/gcc.target/loongarch/movcf2gr-via-fr.c":9:12 168 {movfcc_internal}
     (nil))

The late combine pass combines these to:

(insn 24 7 8 2 (set (reg:FCC 4 $r4 [88])
        (reg:FCC 64 $fcc0 [87])) "../gcc/gcc/testsuite/gcc.target/loongarch/movcf2gr-via-fr.c":9:12 168 {movfcc_internal}
     (nil))

But we are using a FPR ($f0) here deliberately to work around an
architectural issue in LA464 causing a direct FCC-to-GPR move very slow.

Could you suggest how to fix this issue?

On Thu, 2024-06-20 at 14:34 +0100, Richard Sandiford wrote:
> This series is a resubmission of the late-combine work.  I've fixed
> some bugs that Jeff's cross-target CI found last time and some others
> that I hit since then.

/* snip */

Lulu Cheng June 28, 2024, 12:34 p.m. UTC | #2

在 2024/6/28 下午8:25, Xi Ruoyao 写道:
> Hi Richard,
>
> The late combine pass has triggered some FAILs on LoongArch and I'm
> investigating.  One of them is movcf2gr-via-fr.c.  In 315r.postreload:
>
> (insn 22 7 24 2 (set (reg:FCC 32 $f0 [87])
>          (reg:FCC 64 $fcc0 [87])) "../gcc/gcc/testsuite/gcc.target/loongarch/movcf2gr-via-fr.c":9:12 168 {movfcc_internal}
>       (nil))
> (insn 24 22 8 2 (set (reg:FCC 4 $r4 [88])
>          (reg:FCC 32 $f0 [87])) "../gcc/gcc/testsuite/gcc.target/loongarch/movcf2gr-via-fr.c":9:12 168 {movfcc_internal}
>       (nil))
>
> The late combine pass combines these to:
>
> (insn 24 7 8 2 (set (reg:FCC 4 $r4 [88])
>          (reg:FCC 64 $fcc0 [87])) "../gcc/gcc/testsuite/gcc.target/loongarch/movcf2gr-via-fr.c":9:12 168 {movfcc_internal}
>       (nil))
>
> But we are using a FPR ($f0) here deliberately to work around an
> architectural issue in LA464 causing a direct FCC-to-GPR move very slow.
>
> Could you suggest how to fix this issue?

Hi, Ruoyao:

We need to define TARGET_INSN_COST and set the cost of movcf2gr/movgr2cf.

I've fixed this and am doing correctness testing now.

>
> On Thu, 2024-06-20 at 14:34 +0100, Richard Sandiford wrote:
>> This series is a resubmission of the late-combine work.  I've fixed
>> some bugs that Jeff's cross-target CI found last time and some others
>> that I hit since then.
> /* snip */
>

Xi Ruoyao June 28, 2024, 12:35 p.m. UTC | #3

On Fri, 2024-06-28 at 20:34 +0800, chenglulu wrote:
> 
> 在 2024/6/28 下午8:25, Xi Ruoyao 写道:
> > Hi Richard,
> > 
> > The late combine pass has triggered some FAILs on LoongArch and I'm
> > investigating.  One of them is movcf2gr-via-fr.c.  In
> > 315r.postreload:
> > 
> > (insn 22 7 24 2 (set (reg:FCC 32 $f0 [87])
> >          (reg:FCC 64 $fcc0 [87]))
> > "../gcc/gcc/testsuite/gcc.target/loongarch/movcf2gr-via-fr.c":9:12
> > 168 {movfcc_internal}
> >       (nil))
> > (insn 24 22 8 2 (set (reg:FCC 4 $r4 [88])
> >          (reg:FCC 32 $f0 [87]))
> > "../gcc/gcc/testsuite/gcc.target/loongarch/movcf2gr-via-fr.c":9:12
> > 168 {movfcc_internal}
> >       (nil))
> > 
> > The late combine pass combines these to:
> > 
> > (insn 24 7 8 2 (set (reg:FCC 4 $r4 [88])
> >          (reg:FCC 64 $fcc0 [87]))
> > "../gcc/gcc/testsuite/gcc.target/loongarch/movcf2gr-via-fr.c":9:12
> > 168 {movfcc_internal}
> >       (nil))
> > 
> > But we are using a FPR ($f0) here deliberately to work around an
> > architectural issue in LA464 causing a direct FCC-to-GPR move very
> > slow.
> > 
> > Could you suggest how to fix this issue?
> 
> Hi, Ruoyao:
> 
> We need to define TARGET_INSN_COST and set the cost of
> movcf2gr/movgr2cf.
> 
> I've fixed this and am doing correctness testing now.

Ah thanks!  So it uses insn cost instead of rtx cost and I didn't
realize.

Lulu Cheng June 28, 2024, 12:44 p.m. UTC | #4

在 2024/6/28 下午8:35, Xi Ruoyao 写道:
> On Fri, 2024-06-28 at 20:34 +0800, chenglulu wrote:
>> 在 2024/6/28 下午8:25, Xi Ruoyao 写道:
>>> Hi Richard,
>>>
>>> The late combine pass has triggered some FAILs on LoongArch and I'm
>>> investigating.  One of them is movcf2gr-via-fr.c.  In
>>> 315r.postreload:
>>>
>>> (insn 22 7 24 2 (set (reg:FCC 32 $f0 [87])
>>>           (reg:FCC 64 $fcc0 [87]))
>>> "../gcc/gcc/testsuite/gcc.target/loongarch/movcf2gr-via-fr.c":9:12
>>> 168 {movfcc_internal}
>>>        (nil))
>>> (insn 24 22 8 2 (set (reg:FCC 4 $r4 [88])
>>>           (reg:FCC 32 $f0 [87]))
>>> "../gcc/gcc/testsuite/gcc.target/loongarch/movcf2gr-via-fr.c":9:12
>>> 168 {movfcc_internal}
>>>        (nil))
>>>
>>> The late combine pass combines these to:
>>>
>>> (insn 24 7 8 2 (set (reg:FCC 4 $r4 [88])
>>>           (reg:FCC 64 $fcc0 [87]))
>>> "../gcc/gcc/testsuite/gcc.target/loongarch/movcf2gr-via-fr.c":9:12
>>> 168 {movfcc_internal}
>>>        (nil))
>>>
>>> But we are using a FPR ($f0) here deliberately to work around an
>>> architectural issue in LA464 causing a direct FCC-to-GPR move very
>>> slow.
>>>
>>> Could you suggest how to fix this issue?
>> Hi, Ruoyao:
>>
>> We need to define TARGET_INSN_COST and set the cost of
>> movcf2gr/movgr2cf.
>>
>> I've fixed this and am doing correctness testing now.
> Ah thanks!  So it uses insn cost instead of rtx cost and I didn't
> realize.
>
>
That's right.:-D