diff mbox series

Add a late-combine pass [PR106594]

Message ID mptr0ljn9eh.fsf@arm.com
State New
Headers show
Series Add a late-combine pass [PR106594] | expand

Commit Message

Richard Sandiford Oct. 24, 2023, 6:49 p.m. UTC
This patch adds a combine pass that runs late in the pipeline.
There are two instances: one between combine and split1, and one
after postreload.

The pass currently has a single objective: remove definitions by
substituting into all uses.  The pre-RA version tries to restrict
itself to cases that are likely to have a neutral or beneficial
effect on register pressure.

The patch fixes PR106594.  It also fixes a few FAILs and XFAILs
in the aarch64 test results, mostly due to making proper use of
MOVPRFX in cases where we didn't previously.  I hope it would
also help with Robin's vec_duplicate testcase, although the
pressure heuristic might need tweaking for that case.

This is just a first step..  I'm hoping that the pass could be
used for other combine-related optimisations in future.  In particular,
the post-RA version doesn't need to restrict itself to cases where all
uses are substitutitable, since it doesn't have to worry about register
pressure.  If we did that, and if we extended it to handle multi-register
REGs, the pass might be a viable replacement for regcprop, which in
turn might reduce the cost of having a post-RA instance of the new pass.

I've run an assembly comparison with one target per CPU directory,
and it seems to be a win for all targets except nvptx (which is hard
to measure, being a higher-level asm).  The biggest winner seemed
to be AVR.

I'd originally hoped to enable the pass by default at -O2 and above
on all targets.  But in the end, I don't think that's possible,
because it interacts badly with x86's STV and partial register
dependency passes.

For example, gcc.target/i386/minmax-6.c tests whether the code
compiles without any spilling.  The RTL created by STV contains:

(insn 33 31 3 2 (set (subreg:V4SI (reg:SI 120) 0)
        (vec_merge:V4SI (vec_duplicate:V4SI (reg:SI 116))
            (const_vector:V4SI [
                    (const_int 0 [0]) repeated x4
                ])
            (const_int 1 [0x1]))) -1
     (nil))
(insn 3 33 34 2 (set (subreg:V4SI (reg:SI 118) 0)
        (subreg:V4SI (reg:SI 120) 0)) {movv4si_internal}
     (expr_list:REG_DEAD (reg:SI 120)
        (nil)))
(insn 34 3 32 2 (set (reg/v:SI 108 [ y ])
        (reg:SI 118)) -1
     (nil))

and it's crucial for the test that reg 108 is kept, rather than
propagated into uses.  As things stand, 118 can be allocated
a vector register and 108 a scalar register.  If 108 is propagated,
there will be scalar and vector uses of 118, and so it will be
spilled to memory.

That one could be solved by running STV2 later.  But RPAD is
a bigger problem.  In gcc.target/i386/pr87007-5.c, RPAD converts:

(insn 27 26 28 6 (set (reg:DF 100 [ _15 ])
        (sqrt:DF (mem/c:DF (symbol_ref:DI ("d2"))))) {*sqrtdf2_sse}
     (nil))

into:

(insn 45 26 44 6 (set (reg:V4SF 108)
        (const_vector:V4SF [
                (const_double:SF 0.0 [0x0.0p+0]) repeated x4
            ])) -1
     (nil))
(insn 44 45 27 6 (set (reg:V2DF 109)
        (vec_merge:V2DF (vec_duplicate:V2DF (sqrt:DF (mem/c:DF (symbol_ref:DI ("d2")))))
            (subreg:V2DF (reg:V4SF 108) 0)
            (const_int 1 [0x1]))) -1
     (nil))
(insn 27 44 28 6 (set (reg:DF 100 [ _15 ])
        (subreg:DF (reg:V2DF 109) 0)) {*movdf_internal}
     (nil))

But both the pre-RA and post-RA passes are able to combine these
instructions back to the original form.

The patch therefore enables the pass by default only on AArch64.
However, I did test the patch with it enabled on x86_64-linux-gnu
as well, which was useful for debugging.

Bootstrapped & regression-tested on aarch64-linux-gnu and
x86_64-linux-gnu (as posted, with no regressions, and with the
pass enabled by default, with some gcc.target/i386 regressions).
OK to install?

Richard


gcc/
	PR rtl-optimization/106594
	* Makefile.in (OBJS): Add late-combine.o.
	* common.opt (flate-combine-instructions): New option.
	* doc/invoke.texi: Document it.
	* common/config/aarch64/aarch64-common.cc: Enable it by default
	at -O2 and above.
	* tree-pass.h (make_pass_late_combine): Declare.
	* late-combine.cc: New file.
	* passes.def: Add two instances of late_combine.

gcc/testsuite/
	PR rtl-optimization/106594
	* gcc.dg/ira-shrinkwrap-prep-1.c: Restrict XFAIL to non-aarch64
	targets.
	* gcc.dg/ira-shrinkwrap-prep-2.c: Likewise.
	* gcc.dg/stack-check-4.c: Add -fno-shrink-wrap.
	* gcc.target/aarch64/sve/cond_asrd_3.c: Remove XFAILs.
	* gcc.target/aarch64/sve/cond_convert_3.c: Likewise.
	* gcc.target/aarch64/sve/cond_fabd_5.c: Likewise.
	* gcc.target/aarch64/sve/cond_convert_6.c: Expect the MOVPRFX /Zs
	described in the comment.
	* gcc.target/aarch64/sve/cond_unary_4.c: Likewise.
	* gcc.target/aarch64/pr106594_1.c: New test.
---
 gcc/Makefile.in                               |   1 +
 gcc/common.opt                                |   5 +
 gcc/common/config/aarch64/aarch64-common.cc   |   1 +
 gcc/doc/invoke.texi                           |  11 +-
 gcc/late-combine.cc                           | 718 ++++++++++++++++++
 gcc/passes.def                                |   2 +
 gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-1.c  |   2 +-
 gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-2.c  |   2 +-
 gcc/testsuite/gcc.dg/stack-check-4.c          |   2 +-
 gcc/testsuite/gcc.target/aarch64/pr106594_1.c |  20 +
 .../gcc.target/aarch64/sve/cond_asrd_3.c      |  10 +-
 .../gcc.target/aarch64/sve/cond_convert_3.c   |   8 +-
 .../gcc.target/aarch64/sve/cond_convert_6.c   |   8 +-
 .../gcc.target/aarch64/sve/cond_fabd_5.c      |  11 +-
 .../gcc.target/aarch64/sve/cond_unary_4.c     |  13 +-
 gcc/tree-pass.h                               |   1 +
 16 files changed, 780 insertions(+), 35 deletions(-)
 create mode 100644 gcc/late-combine.cc
 create mode 100644 gcc/testsuite/gcc.target/aarch64/pr106594_1.c

Comments

Segher Boessenkool Dec. 30, 2023, 6:13 p.m. UTC | #1
Hi!

On Tue, Oct 24, 2023 at 07:49:10PM +0100, Richard Sandiford wrote:
> This patch adds a combine pass that runs late in the pipeline.

But it is not.  It is a completely new thing, and much closer to
fwprop than to combine, too.

Could you rename it to something else, please?  Something less confusing
to both users and maintainers :-)

> There are two instances: one between combine and split1,

So, what kind of things does this do that the real combine does not?
And, same question but for fwprop.  That would be the crucial motivation
for why we want to have this new pass at all :-)

> The pass currently has a single objective: remove definitions by
> substituting into all uses.

The easy case ;-)


Segher
Richard Sandiford Jan. 2, 2024, 9:47 a.m. UTC | #2
Segher Boessenkool <segher@kernel.crashing.org> writes:
> Hi!
>
> On Tue, Oct 24, 2023 at 07:49:10PM +0100, Richard Sandiford wrote:
>> This patch adds a combine pass that runs late in the pipeline.
>
> But it is not.  It is a completely new thing, and much closer to
> fwprop than to combine, too.

Well, it is a combine pass.  It's not a new instance of the pass in
combine.cc, but I don't think that's the implication.  We already have
two combine passes: the combine.cc one and the postreload one.
Similarly we have at least two DCE implementations.

> Could you rename it to something else, please?  Something less confusing
> to both users and maintainers :-)

Do you have any suggestions?

>> There are two instances: one between combine and split1,
>
> So, what kind of things does this do that the real combine does not?
> And, same question but for fwprop.  That would be the crucial motivation
> for why we want to have this new pass at all :-)

You've already responded (below) to the part where I explained that.

>> The pass currently has a single objective: remove definitions by
>> substituting into all uses.
>
> The easy case ;-)

And the yet a case that no existing pass handles. :)  That's why I'm
trying to add something that does.

Richard
Jeff Law Jan. 3, 2024, 4:20 a.m. UTC | #3
On 10/24/23 12:49, Richard Sandiford wrote:
> This patch adds a combine pass that runs late in the pipeline.
> There are two instances: one between combine and split1, and one
> after postreload.
So have you done any investigation on cases caught by your new pass 
between combine and split1 to characterize them?  In particular do they 
point at solvable problems in combine?  Or do you forsee this subsuming 
the old combiner pass at some point in the distant future?

rth and I sketched out an SSA based RTL combine at some point in the 
deep past.  The key goal we were trying to achieve was combining across 
blocks.  We didn't have a functioning RTL SSA form at the time, so it 
never went to any implementation work.  It looks like yours would solve 
the class of problems rth and I were considering.

[ ... ]

> 
> The patch therefore enables the pass by default only on AArch64.
> However, I did test the patch with it enabled on x86_64-linux-gnu
> as well, which was useful for debugging.
> 
> Bootstrapped & regression-tested on aarch64-linux-gnu and
> x86_64-linux-gnu (as posted, with no regressions, and with the
> pass enabled by default, with some gcc.target/i386 regressions).
> OK to install?
I'm going to adjust this slightly so that it's enabled across the board 
and throw it into the tester tomorrow (tester is busy tonight).  Even if 
we make it opt-in on a per-port basis, the alternate target testing does 
seems to find stuff that needs fixing ;-)



> 
> Richard
> 
> 
> gcc/
> 	PR rtl-optimization/106594
> 	* Makefile.in (OBJS): Add late-combine.o.
> 	* common.opt (flate-combine-instructions): New option.
> 	* doc/invoke.texi: Document it.
> 	* common/config/aarch64/aarch64-common.cc: Enable it by default
> 	at -O2 and above.
> 	* tree-pass.h (make_pass_late_combine): Declare.
> 	* late-combine.cc: New file.
> 	* passes.def: Add two instances of late_combine.
> 
> gcc/testsuite/
> 	PR rtl-optimization/106594
> 	* gcc.dg/ira-shrinkwrap-prep-1.c: Restrict XFAIL to non-aarch64
> 	targets.
> 	* gcc.dg/ira-shrinkwrap-prep-2.c: Likewise.
> 	* gcc.dg/stack-check-4.c: Add -fno-shrink-wrap.
> 	* gcc.target/aarch64/sve/cond_asrd_3.c: Remove XFAILs.
> 	* gcc.target/aarch64/sve/cond_convert_3.c: Likewise.
> 	* gcc.target/aarch64/sve/cond_fabd_5.c: Likewise.
> 	* gcc.target/aarch64/sve/cond_convert_6.c: Expect the MOVPRFX /Zs
> 	described in the comment.
> 	* gcc.target/aarch64/sve/cond_unary_4.c: Likewise.
> 	* gcc.target/aarch64/pr106594_1.c: New test.


>
> +
> +      // Don't substitute into a non-local goto, since it can then be
> +      // treated as a jump to local label, e.g. in shorten_branches.
> +      // ??? But this shouldn't be necessary.
> +      if (use_insn->is_jump ()
> +	  && find_reg_note (use_insn->rtl (), REG_NON_LOCAL_GOTO, NULL_RTX))
> +	return false;
Agreed that this shouldn't be necessary.  In fact, if you can substitute 
it's potentially very profitable.  Then again, I doubt it happens all 
that much in practice, particularly since I think gimple does squash out 
some of these.

Nothing jumps out at horribly wrong.  You might want/need to reject 
frame related insns in optimizable_set, though I guess if the dwarf2 
writer isn't complaining, then we haven't mucked things up too bad.
Richard Sandiford Jan. 5, 2024, 5:35 p.m. UTC | #4
Jeff Law <jeffreyalaw@gmail.com> writes:
> On 10/24/23 12:49, Richard Sandiford wrote:
>> This patch adds a combine pass that runs late in the pipeline.
>> There are two instances: one between combine and split1, and one
>> after postreload.
> So have you done any investigation on cases caught by your new pass 
> between combine and split1 to characterize them?  In particular do they 
> point at solvable problems in combine?  Or do you forsee this subsuming 
> the old combiner pass at some point in the distant future?

Examples like the PR are the main motivation for the pre-RA pass.
There we had an extension that could be combined into an address,
but no longer was after GCC 13.

The PR itself could in principle be fixed in combine (various
approaches were suggested, but not accepted).  But the same problem
applies to multiple uses of extensions.  fwprop can't handle it because
individual propagations are not a win in isolation.  And combine has
a limit of combining 4 insns (with a maximum of 2 output insns, IIRC).
So I don't think either of the existing passes scale to the general case.

FWIW, an example of this is gcc.c-torture/compile/pr33133.c:

@@ -20,7 +20,7 @@
        bne     .L3
        mov     x3, 0
 .L7:
-       mov     w8, w3
+       mov     w9, w3
        add     x1, x5, x3
        ldrb    w6, [x1, 256]
        ubfx    x1, x6, 1, 7
@@ -28,15 +28,14 @@
        ldrb    w1, [x7, x1]
        lsr     w1, w1, 4
 .L5:
-       asr     w6, w8, 1
-       sxtw    x9, w6
-       ldrb    w6, [x2, w6, sxtw]
+       asr     w8, w9, 1
+       ldrb    w6, [x2, w8, sxtw]
        sxtb    w1, w1
-       tbz     x8, 0, .L6
+       tbz     x9, 0, .L6
        sbfiz   w1, w1, 4, 4
 .L6:
        orr     w1, w6, w1
-       strb    w1, [x2, x9]
+       strb    w1, [x2, w8, sxtw]
        add     x3, x3, 1
        cmp     x3, 16
        bne     .L7

I hope that the new pass could be extended to do new things in future,
especially if they're things that would make sense in both a pre-RA
and post-RA pass.  And there's some benefit to having a cleanish slate,
especially without the baggage of the expand_compound_operation/
make_compound_operation wrangler.  But I don't think we can realistically
expect to replace old combine, or even that it's something worth aiming for.

> rth and I sketched out an SSA based RTL combine at some point in the 
> deep past.  The key goal we were trying to achieve was combining across 
> blocks.  We didn't have a functioning RTL SSA form at the time, so it 
> never went to any implementation work.  It looks like yours would solve 
> the class of problems rth and I were considering.

Yeah, I do see some cases where combining across blocks helps.
The case above is one example of that.  Another is:

@@ -8,9 +8,8 @@
 .LFB0:
        .cfi_startproc
        ldr     w1, [x0, 4]
-       cmp     w1, 0
        cbz     w1, .L3
-       blt     .L4
+       tbnz    w1, #31, .L4
        ldr     w2, [x0]
        ldr     w0, [x0, 8]
        add     w1, w1, w2

And, as a similar example for multiple uses:

@@ -144,8 +144,7 @@
        neg     x21, x21
        mov     w26, 255
 .L18:
-       cmp     x19, 0
-       bge     .L19
+       tbz     x19, #63, .L19
        mvn     w26, w26
        and     w26, w26, 255
        neg     x19, x19
@@ -160,7 +159,7 @@
        mov     w26, 0
        b       .L18
 .L19:
-       bne     .L20
+       cbnz    x19, .L20
        mov     x19, 9223372036854775807
        b       .L21

>> The patch therefore enables the pass by default only on AArch64.
>> However, I did test the patch with it enabled on x86_64-linux-gnu
>> as well, which was useful for debugging.
>> 
>> Bootstrapped & regression-tested on aarch64-linux-gnu and
>> x86_64-linux-gnu (as posted, with no regressions, and with the
>> pass enabled by default, with some gcc.target/i386 regressions).
>> OK to install?
> I'm going to adjust this slightly so that it's enabled across the board 
> and throw it into the tester tomorrow (tester is busy tonight).  Even if 
> we make it opt-in on a per-port basis, the alternate target testing does 
> seems to find stuff that needs fixing ;-)

Thanks!  As per our off-list discussion, the cris-elf failures showed
up a bug in the handling of call arguments.  Here's an updated version
with that fixed.

Unfortunately (also as per the off-list discussion), it seems like some
ports have latent assumptions that certain combinations don't happen
between RA and the first post-RA split.  They should be easy enough
to iron out in most cases, although as mentioned, the bad interaction
with the x86 passes seems harder to fix.

> > +      // Don't substitute into a non-local goto, since it can then be
> > +      // treated as a jump to local label, e.g. in shorten_branches.
> > +      // ??? But this shouldn't be necessary.
> > +      if (use_insn->is_jump ()
> > +	  && find_reg_note (use_insn->rtl (), REG_NON_LOCAL_GOTO, NULL_RTX))
> > +	return false;
> Agreed that this shouldn't be necessary.  In fact, if you can substitute 
> it's potentially very profitable.  Then again, I doubt it happens all 
> that much in practice, particularly since I think gimple does squash out 
> some of these.

Yeah.  If I'd posted this earlier in stage 1 (rather than October),
I might have tried teaching shorten_branches how to handle this.
But it felt like it could be a bit of a rabbit hole at this stage.

> Nothing jumps out at horribly wrong.  You might want/need to reject 
> frame related insns in optimizable_set, though I guess if the dwarf2 
> writer isn't complaining, then we haven't mucked things up too bad.

Ah, yeah, I wondered about that.  But I suppose most prologue insns
don't really create combination opportunities, since most of them
either set up the stack (which we wouldn't combine anyway) or sink
incoming data.  So if there cases where this is a problem, it might
unfortunately be a case of "see what breaks".

Tested on aarch64-linux-gnu.  Does this version look OK?
(There's no change to the covering note btw.  See:

      // Don't try to substitute into call arguments.

for the new code.)

Thanks,
Richard

----

This patch adds a combine pass that runs late in the pipeline.
There are two instances: one between combine and split1, and one
after postreload.

The pass currently has a single objective: remove definitions by
substituting into all uses.  The pre-RA version tries to restrict
itself to cases that are likely to have a neutral or beneficial
effect on register pressure.

The patch fixes PR106594.  It also fixes a few FAILs and XFAILs
in the aarch64 test results, mostly due to making proper use of
MOVPRFX in cases where we didn't previously.  I hope it would
also help with Robin's vec_duplicate testcase, although the
pressure heuristic might need tweaking for that case.

This is just a first step.  I'm hoping that the pass could be
used for other combine-related optimisations in future.  In particular,
the post-RA version doesn't need to restrict itself to cases where all
uses are substitutable, since it doesn't have to worry about register
pressure.  If we did that, and if we extended it to handle multi-register
REGs, the pass might be a viable replacement for regcprop, which in
turn might reduce the cost of having a post-RA instance of the new pass.

I've run an assembly comparison with one target per CPU directory,
and it seems to be a win for all targets except nvptx (which is hard
to measure, being a higher-level asm).  The biggest winner seemed
to be AVR.

I'd originally hoped to enable the pass by default at -O2 and above
on all targets.  But in the end, I don't think that's possible,
because it interacts badly with x86's STV and partial register
dependency passes.

For example, gcc.target/i386/minmax-6.c tests whether the code
compiles without any spilling.  The RTL created by STV contains:

(insn 33 31 3 2 (set (subreg:V4SI (reg:SI 120) 0)
        (vec_merge:V4SI (vec_duplicate:V4SI (reg:SI 116))
            (const_vector:V4SI [
                    (const_int 0 [0]) repeated x4
                ])
            (const_int 1 [0x1]))) -1
     (nil))
(insn 3 33 34 2 (set (subreg:V4SI (reg:SI 118) 0)
        (subreg:V4SI (reg:SI 120) 0)) {movv4si_internal}
     (expr_list:REG_DEAD (reg:SI 120)
        (nil)))
(insn 34 3 32 2 (set (reg/v:SI 108 [ y ])
        (reg:SI 118)) -1
     (nil))

and it's crucial for the test that reg 108 is kept, rather than
propagated into uses.  As things stand, 118 can be allocated
a vector register and 108 a scalar register.  If 108 is propagated,
there will be scalar and vector uses of 118, and so it will be
spilled to memory.

That one could be solved by running STV2 later.  But RPAD is
a bigger problem.  In gcc.target/i386/pr87007-5.c, RPAD converts:

(insn 27 26 28 6 (set (reg:DF 100 [ _15 ])
        (sqrt:DF (mem/c:DF (symbol_ref:DI ("d2"))))) {*sqrtdf2_sse}
     (nil))

into:

(insn 45 26 44 6 (set (reg:V4SF 108)
        (const_vector:V4SF [
                (const_double:SF 0.0 [0x0.0p+0]) repeated x4
            ])) -1
     (nil))
(insn 44 45 27 6 (set (reg:V2DF 109)
        (vec_merge:V2DF (vec_duplicate:V2DF (sqrt:DF (mem/c:DF (symbol_ref:DI ("d2")))))
            (subreg:V2DF (reg:V4SF 108) 0)
            (const_int 1 [0x1]))) -1
     (nil))
(insn 27 44 28 6 (set (reg:DF 100 [ _15 ])
        (subreg:DF (reg:V2DF 109) 0)) {*movdf_internal}
     (nil))

But both the pre-RA and post-RA passes are able to combine these
instructions back to the original form.

The patch therefore enables the pass by default only on AArch64.
However, I did test the patch with it enabled on x86_64-linux-gnu
as well, which was useful for debugging.

gcc/
	PR rtl-optimization/106594
	* Makefile.in (OBJS): Add late-combine.o.
	* common.opt (flate-combine-instructions): New option.
	* doc/invoke.texi: Document it.
	* common/config/aarch64/aarch64-common.cc: Enable it by default
	at -O2 and above.
	* tree-pass.h (make_pass_late_combine): Declare.
	* late-combine.cc: New file.
	* passes.def: Add two instances of late_combine.

gcc/testsuite/
	PR rtl-optimization/106594
	* gcc.dg/ira-shrinkwrap-prep-1.c: Restrict XFAIL to non-aarch64
	targets.
	* gcc.dg/ira-shrinkwrap-prep-2.c: Likewise.
	* gcc.dg/stack-check-4.c: Add -fno-shrink-wrap.
	* gcc.target/aarch64/sve/cond_asrd_3.c: Remove XFAILs.
	* gcc.target/aarch64/sve/cond_convert_3.c: Likewise.
	* gcc.target/aarch64/sve/cond_fabd_5.c: Likewise.
	* gcc.target/aarch64/sve/cond_convert_6.c: Expect the MOVPRFX /Zs
	described in the comment.
	* gcc.target/aarch64/sve/cond_unary_4.c: Likewise.
	* gcc.target/aarch64/pr106594_1.c: New test.
---
 gcc/Makefile.in                               |   1 +
 gcc/common.opt                                |   5 +
 gcc/common/config/aarch64/aarch64-common.cc   |   1 +
 gcc/doc/invoke.texi                           |  11 +-
 gcc/late-combine.cc                           | 732 ++++++++++++++++++
 gcc/passes.def                                |   2 +
 gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-1.c  |   2 +-
 gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-2.c  |   2 +-
 gcc/testsuite/gcc.dg/stack-check-4.c          |   2 +-
 gcc/testsuite/gcc.target/aarch64/pr106594_1.c |  20 +
 .../gcc.target/aarch64/sve/cond_asrd_3.c      |  10 +-
 .../gcc.target/aarch64/sve/cond_convert_3.c   |   8 +-
 .../gcc.target/aarch64/sve/cond_convert_6.c   |   8 +-
 .../gcc.target/aarch64/sve/cond_fabd_5.c      |  11 +-
 .../gcc.target/aarch64/sve/cond_unary_4.c     |  13 +-
 gcc/tree-pass.h                               |   1 +
 16 files changed, 794 insertions(+), 35 deletions(-)
 create mode 100644 gcc/late-combine.cc
 create mode 100644 gcc/testsuite/gcc.target/aarch64/pr106594_1.c

diff --git a/gcc/Makefile.in b/gcc/Makefile.in
index deb12e17d25..417390b3b79 100644
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@@ -1571,6 +1571,7 @@ OBJS = \
 	ira-lives.o \
 	jump.o \
 	langhooks.o \
+	late-combine.o \
 	lcm.o \
 	lists.o \
 	loop-doloop.o \
diff --git a/gcc/common.opt b/gcc/common.opt
index 5f0a101bccb..56036f2295f 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -1784,6 +1784,11 @@ Common Var(flag_large_source_files) Init(0)
 Improve GCC's ability to track column numbers in large source files,
 at the expense of slower compilation.
 
+flate-combine-instructions
+Common Var(flag_late_combine_instructions) Optimization Init(0)
+Run two instruction combination passes late in the pass pipeline;
+one before register allocation and one after.
+
 floop-parallelize-all
 Common Var(flag_loop_parallelize_all) Optimization
 Mark all loops as parallel.
diff --git a/gcc/common/config/aarch64/aarch64-common.cc b/gcc/common/config/aarch64/aarch64-common.cc
index 951d041d310..75c72b1680d 100644
--- a/gcc/common/config/aarch64/aarch64-common.cc
+++ b/gcc/common/config/aarch64/aarch64-common.cc
@@ -56,6 +56,7 @@ static const struct default_options aarch_option_optimization_table[] =
     /* Enable redundant extension instructions removal at -O2 and higher.  */
     { OPT_LEVELS_2_PLUS, OPT_free, NULL, 1 },
     { OPT_LEVELS_2_PLUS, OPT_mearly_ra_, NULL, AARCH64_EARLY_RA_ALL },
+    { OPT_LEVELS_2_PLUS, OPT_flate_combine_instructions, NULL, 1 },
 #if (TARGET_DEFAULT_ASYNC_UNWIND_TABLES == 1)
     { OPT_LEVELS_ALL, OPT_fasynchronous_unwind_tables, NULL, 1 },
     { OPT_LEVELS_ALL, OPT_funwind_tables, NULL, 1},
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 68d1f364ac0..dbb798cc610 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -571,7 +571,7 @@ Objective-C and Objective-C++ Dialects}.
 -fipa-bit-cp  -fipa-vrp  -fipa-pta  -fipa-profile  -fipa-pure-const
 -fipa-reference  -fipa-reference-addressable
 -fipa-stack-alignment  -fipa-icf  -fira-algorithm=@var{algorithm}
--flive-patching=@var{level}
+-flate-combine-instructions  -flive-patching=@var{level}
 -fira-region=@var{region}  -fira-hoist-pressure
 -fira-loop-pressure  -fno-ira-share-save-slots
 -fno-ira-share-spill-slots
@@ -13442,6 +13442,15 @@ equivalences that are found only by GCC and equivalences found only by Gold.
 
 This flag is enabled by default at @option{-O2} and @option{-Os}.
 
+@opindex flate-combine-instructions
+@item -flate-combine-instructions
+Enable two instruction combination passes that run relatively late in the
+compilation process.  One of the passes runs before register allocation and
+the other after register allocation.  The main aim of the passes is to
+substitute definitions into all uses.
+
+Some targets enable this flag by default at @option{-O2} and @option{-Os}.
+
 @opindex flive-patching
 @item -flive-patching=@var{level}
 Control GCC's optimizations to produce output suitable for live-patching.
diff --git a/gcc/late-combine.cc b/gcc/late-combine.cc
new file mode 100644
index 00000000000..47122ca2e4b
--- /dev/null
+++ b/gcc/late-combine.cc
@@ -0,0 +1,732 @@
+// Late-stage instruction combination pass.
+// Copyright (C) 2023 Free Software Foundation, Inc.
+//
+// This file is part of GCC.
+//
+// GCC is free software; you can redistribute it and/or modify it under
+// the terms of the GNU General Public License as published by the Free
+// Software Foundation; either version 3, or (at your option) any later
+// version.
+//
+// GCC is distributed in the hope that it will be useful, but WITHOUT ANY
+// WARRANTY; without even the implied warranty of MERCHANTABILITY or
+// FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+// for more details.
+//
+// You should have received a copy of the GNU General Public License
+// along with GCC; see the file COPYING3.  If not see
+// <http://www.gnu.org/licenses/>.
+
+// The current purpose of this pass is to substitute definitions into
+// all uses, so that the definition can be removed.  However, it could
+// be extended to handle other combination-related optimizations in future.
+//
+// The pass can run before or after register allocation.  When running
+// before register allocation, it tries to avoid cases that are likely
+// to increase register pressure.  For the same reason, it avoids moving
+// instructions around, even if doing so would allow an optimization to
+// succeed.  These limitations are removed when running after register
+// allocation.
+
+#define INCLUDE_ALGORITHM
+#define INCLUDE_FUNCTIONAL
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "backend.h"
+#include "rtl.h"
+#include "df.h"
+#include "rtl-ssa.h"
+#include "print-rtl.h"
+#include "tree-pass.h"
+#include "cfgcleanup.h"
+#include "target.h"
+
+using namespace rtl_ssa;
+
+namespace {
+const pass_data pass_data_late_combine =
+{
+  RTL_PASS, // type
+  "late_combine", // name
+  OPTGROUP_NONE, // optinfo_flags
+  TV_NONE, // tv_id
+  0, // properties_required
+  0, // properties_provided
+  0, // properties_destroyed
+  0, // todo_flags_start
+  TODO_df_finish, // todo_flags_finish
+};
+
+// Represents an attempt to substitute a single-set definition into all
+// uses of the definition.
+class insn_combination
+{
+public:
+  insn_combination (set_info *, rtx, rtx);
+  bool run ();
+  const vec<insn_change *> &use_changes () const { return m_use_changes; }
+
+private:
+  use_array get_new_uses (use_info *);
+  bool substitute_nondebug_use (use_info *);
+  bool substitute_nondebug_uses (set_info *);
+  bool move_and_recog (insn_change &);
+  bool try_to_preserve_debug_info (insn_change &, use_info *);
+  void substitute_debug_use (use_info *);
+  bool substitute_note (insn_info *, rtx, bool);
+  void substitute_notes (insn_info *, bool);
+  void substitute_note_uses (use_info *);
+  void substitute_optional_uses (set_info *);
+
+  // Represents the state of the function's RTL at the start of this
+  // combination attempt.
+  insn_change_watermark m_rtl_watermark;
+
+  // Represents the rtl-ssa state at the start of this combination attempt.
+  obstack_watermark m_attempt;
+
+  // The instruction that contains the definition, and that we're trying
+  // to delete.
+  insn_info *m_def_insn;
+
+  // The definition itself.
+  set_info *m_def;
+
+  // The destination and source of the single set that defines m_def.
+  // The destination is known to be a plain REG.
+  rtx m_dest;
+  rtx m_src;
+
+  // Contains all in-progress changes to uses of m_def.
+  auto_vec<insn_change *> m_use_changes;
+
+  // Contains the full list of changes that we want to make, in reverse
+  // postorder.
+  auto_vec<insn_change *> m_nondebug_changes;
+};
+
+// Class that represents one run of the pass.
+class late_combine
+{
+public:
+  unsigned int execute (function *);
+
+private:
+  rtx optimizable_set (insn_info *);
+  bool check_register_pressure (insn_info *, rtx);
+  bool check_uses (set_info *, rtx);
+  bool combine_into_uses (insn_info *, insn_info *);
+
+  auto_vec<insn_info *> m_worklist;
+};
+
+insn_combination::insn_combination (set_info *def, rtx dest, rtx src)
+  : m_rtl_watermark (),
+    m_attempt (crtl->ssa->new_change_attempt ()),
+    m_def_insn (def->insn ()),
+    m_def (def),
+    m_dest (dest),
+    m_src (src),
+    m_use_changes (),
+    m_nondebug_changes ()
+{
+}
+
+// USE is a direct or indirect use of m_def.  Return the list of uses
+// that would be needed after substituting m_def into the instruction.
+// The returned list is marked as invalid if USE's insn and m_def_insn
+// use different definitions for the same resource (register or memory).
+use_array
+insn_combination::get_new_uses (use_info *use)
+{
+  auto *def = use->def ();
+  auto *use_insn = use->insn ();
+
+  use_array new_uses = use_insn->uses ();
+  new_uses = remove_uses_of_def (m_attempt, new_uses, def);
+  new_uses = merge_access_arrays (m_attempt, m_def_insn->uses (), new_uses);
+  if (new_uses.is_valid () && use->ebb () != m_def->ebb ())
+    new_uses = crtl->ssa->make_uses_available (m_attempt, new_uses, use->bb (),
+					       use_insn->is_debug_insn ());
+  return new_uses;
+}
+
+// Start the process of trying to replace USE by substitution, given that
+// USE occurs in a non-debug instruction.  Check that the substitution can
+// be represented in RTL and that each use of a resource (register or memory)
+// has a consistent definition.  If so, start an insn_change record for the
+// substitution and return true.
+bool
+insn_combination::substitute_nondebug_use (use_info *use)
+{
+  insn_info *use_insn = use->insn ();
+  rtx_insn *use_rtl = use_insn->rtl ();
+
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    dump_insn_slim (dump_file, use->insn ()->rtl ());
+
+  // Chceck that we can change the instruction pattern.  Leave recognition
+  // of the result till later.
+  insn_propagation prop (use_rtl, m_dest, m_src);
+  if (!prop.apply_to_pattern (&PATTERN (use_rtl))
+      || prop.num_replacements == 0)
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "-- RTL substitution failed\n");
+      return false;
+    }
+
+  use_array new_uses = get_new_uses (use);
+  if (!new_uses.is_valid ())
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "-- could not prove that all sources"
+		 " are available\n");
+      return false;
+    }
+
+  auto *where = XOBNEW (m_attempt, insn_change);
+  auto *use_change = new (where) insn_change (use_insn);
+  m_use_changes.safe_push (use_change);
+  use_change->new_uses = new_uses;
+
+  return true;
+}
+
+// Apply substitute_nondebug_use to all direct and indirect uses of DEF.
+// There will be at most one level of indirection.
+bool
+insn_combination::substitute_nondebug_uses (set_info *def)
+{
+  for (use_info *use : def->nondebug_insn_uses ())
+    if (!use->is_live_out_use ()
+	&& !use->only_occurs_in_notes ()
+	&& !substitute_nondebug_use (use))
+      return false;
+
+  for (use_info *use : def->phi_uses ())
+    if (!substitute_nondebug_uses (use->phi ()))
+      return false;
+
+  return true;
+}
+
+// Complete the verification of USE_CHANGE, given that m_nondebug_insns
+// now contains an insn_change record for all proposed non-debug changes.
+// Check that the new instruction is a recognized pattern.  Also check that
+// the instruction can be placed somewhere that makes all definitions and
+// uses valid, and that permits any new hard-register clobbers added
+// during the recognition process.  Return true on success.
+bool
+insn_combination::move_and_recog (insn_change &use_change)
+{
+  insn_info *use_insn = use_change.insn ();
+
+  if (reload_completed && can_move_insn_p (use_insn))
+    use_change.move_range = { use_insn->bb ()->head_insn (),
+			      use_insn->ebb ()->last_bb ()->end_insn () };
+  if (!restrict_movement_ignoring (use_change,
+				   insn_is_changing (m_nondebug_changes)))
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "-- cannot satisfy all definitions and uses"
+		 " in insn %d\n", INSN_UID (use_insn->rtl ()));
+      return false;
+    }
+
+  if (!recog_ignoring (m_attempt, use_change,
+		       insn_is_changing (m_nondebug_changes)))
+    return false;
+
+  return true;
+}
+
+// USE_CHANGE.insn () is a debug instruction that uses m_def.  Try to
+// substitute the definition into the instruction and try to describe
+// the result in USE_CHANGE.  Return true on success.  Failure means that
+// the instruction must be reset instead.
+bool
+insn_combination::try_to_preserve_debug_info (insn_change &use_change,
+					      use_info *use)
+{
+  insn_info *use_insn = use_change.insn ();
+  rtx_insn *use_rtl = use_insn->rtl ();
+
+  use_change.new_uses = get_new_uses (use);
+  if (!use_change.new_uses.is_valid ()
+      || !restrict_movement (use_change))
+    return false;
+
+  insn_propagation prop (use_rtl, m_dest, m_src);
+  return prop.apply_to_pattern (&INSN_VAR_LOCATION_LOC (use_rtl));
+}
+
+// USE_INSN is a debug instruction that uses m_def.  Update it to reflect
+// the fact that m_def is going to disappear.  Try to preserve the source
+// value if possible, but reset the instruction if not.
+void
+insn_combination::substitute_debug_use (use_info *use)
+{
+  auto *use_insn = use->insn ();
+  rtx_insn *use_rtl = use_insn->rtl ();
+
+  auto use_change = insn_change (use_insn);
+  if (!try_to_preserve_debug_info (use_change, use))
+    {
+      use_change.new_uses = {};
+      use_change.move_range = use_change.insn ();
+      INSN_VAR_LOCATION_LOC (use_rtl) = gen_rtx_UNKNOWN_VAR_LOC ();
+    }
+  insn_change *changes[] = { &use_change };
+  crtl->ssa->change_insns (changes);
+}
+
+// NOTE is a reg note of USE_INSN, which previously used m_def.  Update
+// the note to reflect the fact that m_def is going to disappear.  Return
+// true on success, or false if the note must be deleted.
+//
+// CAN_PROPAGATE is true if m_dest can be replaced with m_use.
+bool
+insn_combination::substitute_note (insn_info *use_insn, rtx note,
+				   bool can_propagate)
+{
+  if (REG_NOTE_KIND (note) == REG_EQUAL
+      || REG_NOTE_KIND (note) == REG_EQUIV)
+    {
+      insn_propagation prop (use_insn->rtl (), m_dest, m_src);
+      return (prop.apply_to_rvalue (&XEXP (note, 0))
+	      && (can_propagate || prop.num_replacements == 0));
+    }
+  return true;
+}
+
+// Update USE_INSN's notes after deciding to go ahead with the optimization.
+// CAN_PROPAGATE is true if m_dest can be replaced with m_use.
+void
+insn_combination::substitute_notes (insn_info *use_insn, bool can_propagate)
+{
+  rtx_insn *use_rtl = use_insn->rtl ();
+  rtx *ptr = &REG_NOTES (use_rtl);
+  while (rtx note = *ptr)
+    {
+      if (substitute_note (use_insn, note, can_propagate))
+	ptr = &XEXP (note, 1);
+      else
+	*ptr = XEXP (note, 1);
+    }
+}
+
+// We've decided to go ahead with the substitution.  Update all REG_NOTES
+// involving USE.
+void
+insn_combination::substitute_note_uses (use_info *use)
+{
+  insn_info *use_insn = use->insn ();
+
+  bool can_propagate = true;
+  if (use->only_occurs_in_notes ())
+    {
+      // The only uses are in notes.  Try to keep the note if we can,
+      // but removing it is better than aborting the optimization.
+      insn_change use_change (use_insn);
+      use_change.new_uses = get_new_uses (use);
+      if (!use_change.new_uses.is_valid ()
+	  || !restrict_movement (use_change))
+	{
+	  use_change.move_range = use_insn;
+	  use_change.new_uses = remove_uses_of_def (m_attempt,
+						    use_insn->uses (),
+						    use->def ());
+	  can_propagate = false;
+	}
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	{
+	  fprintf (dump_file, "%s notes in:\n",
+		   can_propagate ? "updating" : "removing");
+	  dump_insn_slim (dump_file, use_insn->rtl ());
+	}
+      substitute_notes (use_insn, can_propagate);
+      insn_change *changes[] = { &use_change };
+      crtl->ssa->change_insns (changes);
+    }
+  else
+    substitute_notes (use_insn, can_propagate);
+}
+
+// We've decided to go ahead with the substitution and we've dealt with
+// all uses that occur in the patterns of non-debug insns.  Update all
+// other uses for the fact that m_def is about to disappear.
+void
+insn_combination::substitute_optional_uses (set_info *def)
+{
+  if (auto insn_uses = def->all_insn_uses ())
+    {
+      use_info *use = *insn_uses.begin ();
+      while (use)
+	{
+	  use_info *next_use = use->next_any_insn_use ();
+	  if (use->is_in_debug_insn ())
+	    substitute_debug_use (use);
+	  else if (!use->is_live_out_use ())
+	    substitute_note_uses (use);
+	  use = next_use;
+	}
+    }
+  for (use_info *use : def->phi_uses ())
+    substitute_optional_uses (use->phi ());
+}
+
+// Try to perform the substitution.  Return true on success.
+bool
+insn_combination::run ()
+{
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    {
+      fprintf (dump_file, "\ntrying to combine definition of r%d in:\n",
+	       m_def->regno ());
+      dump_insn_slim (dump_file, m_def_insn->rtl ());
+      fprintf (dump_file, "into:\n");
+    }
+
+  if (!substitute_nondebug_uses (m_def))
+    return false;
+
+  auto def_change = insn_change::delete_insn (m_def_insn);
+
+  m_nondebug_changes.reserve (m_use_changes.length () + 1);
+  m_nondebug_changes.quick_push (&def_change);
+  m_nondebug_changes.splice (m_use_changes);
+
+  for (auto *use_change : m_use_changes)
+    if (!move_and_recog (*use_change))
+      return false;
+
+  if (!changes_are_worthwhile (m_nondebug_changes)
+      || !crtl->ssa->verify_insn_changes (m_nondebug_changes))
+    return false;
+
+  substitute_optional_uses (m_def);
+
+  confirm_change_group ();
+  crtl->ssa->change_insns (m_nondebug_changes);
+  return true;
+}
+
+// See whether INSN is a single_set that we can optimize.  Return the
+// set if so, otherwise return null.
+rtx
+late_combine::optimizable_set (insn_info *insn)
+{
+  if (!insn->can_be_optimized ()
+      || insn->is_asm ()
+      || insn->is_call ()
+      || insn->has_volatile_refs ()
+      || insn->has_pre_post_modify ()
+      || !can_move_insn_p (insn))
+    return NULL_RTX;
+
+  return single_set (insn->rtl ());
+}
+
+// Suppose that we can replace all uses of SET_DEST (SET) with SET_SRC (SET),
+// where SET occurs in INSN.  Return true if doing so is not likely to
+// increase register pressure.
+bool
+late_combine::check_register_pressure (insn_info *insn, rtx set)
+{
+  // Plain register-to-register moves do not establish a register class
+  // preference and have no well-defined effect on the register allocator.
+  // If changes in register class are needed, the register allocator is
+  // in the best position to place those changes.  If no change in
+  // register class is needed, then the optimization reduces register
+  // pressure if SET_SRC (set) was already live at uses, otherwise the
+  // optimization is pressure-neutral.
+  rtx src = SET_SRC (set);
+  if (REG_P (src))
+    return true;
+
+  // On the same basis, substituting a SET_SRC that contains a single
+  // pseudo register either reduces pressure or is pressure-neutral,
+  // subject to the constraints below.  We would need to do more
+  // analysis for SET_SRCs that use more than one pseudo register.
+  unsigned int nregs = 0;
+  for (auto *use : insn->uses ())
+    if (use->is_reg ()
+	&& !HARD_REGISTER_NUM_P (use->regno ())
+	&& !use->only_occurs_in_notes ())
+      if (++nregs > 1)
+	return false;
+
+  // If there are no pseudo registers in SET_SRC then the optimization
+  // should improve register pressure.
+  if (nregs == 0)
+    return true;
+
+  // We'd be substituting (set (reg R1) SRC) where SRC is known to
+  // contain a single pseudo register R2.  Assume for simplicity that
+  // each new use of R2 would need to be in the same class C as the
+  // current use of R2.  If, for a realistic allocation, C is a
+  // non-strict superset of the R1's register class, the effect on
+  // register pressure should be positive or neutral.  If instead
+  // R1 occupies a different register class from R2, or if R1 has
+  // more allocation freedom than R2, then there's a higher risk that
+  // the effect on register pressure could be negative.
+  //
+  // First use constrain_operands to get the most likely choice of
+  // alternative.  For simplicity, just handle the case where the
+  // output operand is operand 0.
+  extract_insn (insn->rtl ());
+  rtx dest = SET_DEST (set);
+  if (recog_data.n_operands == 0
+      || recog_data.operand[0] != dest)
+    return false;
+
+  if (!constrain_operands (0, get_enabled_alternatives (insn->rtl ())))
+    return false;
+
+  preprocess_constraints (insn->rtl ());
+  auto *alt = which_op_alt ();
+  auto dest_class = alt[0].cl;
+
+  // Check operands 1 and above.
+  auto check_src = [&](unsigned int i)
+    {
+      if (recog_data.is_operator[i])
+	return true;
+
+      rtx op = recog_data.operand[i];
+      if (CONSTANT_P (op))
+	return true;
+
+      if (SUBREG_P (op))
+	op = SUBREG_REG (op);
+      if (REG_P (op))
+	{
+	  if (HARD_REGISTER_P (op))
+	    {
+	      // We've already rejected uses of non-fixed hard registers.
+	      gcc_checking_assert (fixed_regs[REGNO (op)]);
+	      return true;
+	    }
+
+	  // Make sure that the source operand's class is at least as
+	  // permissive as the destination operand's class.
+	  auto src_class = alternative_class (alt, i);
+	  if (!reg_class_subset_p (dest_class, src_class))
+	    return false;
+
+	  // Make sure that the source operand occupies no more hard
+	  // registers than the destination operand.  This mostly matters
+	  // for subregs.
+	  if (targetm.class_max_nregs (dest_class, GET_MODE (dest))
+	      < targetm.class_max_nregs (src_class, GET_MODE (op)))
+	    return false;
+
+	  return true;
+	}
+      return false;
+    };
+  for (int i = 1; i < recog_data.n_operands; ++i)
+    if (!check_src (i))
+      return false;
+
+  return true;
+}
+
+// Check uses of DEF to see whether there is anything obvious that
+// prevents the substitution of SET into uses of DEF.
+bool
+late_combine::check_uses (set_info *def, rtx set)
+{
+  use_info *last_use = nullptr;
+  for (use_info *use : def->nondebug_insn_uses ())
+    {
+      insn_info *use_insn = use->insn ();
+
+      if (use->is_live_out_use ())
+	continue;
+      if (use->only_occurs_in_notes ())
+	continue;
+
+      // We cannot replace all uses if the value is live on exit.
+      if (use->is_artificial ())
+	return false;
+
+      // Avoid increasing the complexity of instructions that
+      // reference allocatable hard registers.
+      if (!REG_P (SET_SRC (set))
+	  && !reload_completed
+	  && (accesses_include_nonfixed_hard_registers (use_insn->uses ())
+	      || accesses_include_nonfixed_hard_registers (use_insn->defs ())))
+	return false;
+
+      // Don't substitute into a non-local goto, since it can then be
+      // treated as a jump to local label, e.g. in shorten_branches.
+      // ??? But this shouldn't be necessary.
+      if (use_insn->is_jump ()
+	  && find_reg_note (use_insn->rtl (), REG_NON_LOCAL_GOTO, NULL_RTX))
+	return false;
+
+      // Don't try to substitute into call arguments.
+      if (use_insn->is_call ())
+	{
+	  auto usage = CALL_INSN_FUNCTION_USAGE (use_insn->rtl ());
+	  for (rtx item = usage; item; item = XEXP (item, 1))
+	    {
+	      rtx x = XEXP (item, 0);
+	      if (GET_CODE (x) == USE
+		  && reg_overlap_mentioned_p (SET_DEST (set), XEXP (x, 0)))
+		return false;
+	    }
+	}
+
+      // We'll keep the uses in their original order, even if we move
+      // them relative to other instructions.  Make sure that non-final
+      // uses do not change any values that occur in the SET_SRC.
+      if (last_use && last_use->ebb () == use->ebb ())
+	{
+	  def_info *ultimate_def = look_through_degenerate_phi (def);
+	  if (insn_clobbers_resources (last_use->insn (),
+				       ultimate_def->insn ()->uses ()))
+	    return false;
+	}
+
+      last_use = use;
+    }
+
+  for (use_info *use : def->phi_uses ())
+    if (!use->phi ()->is_degenerate ()
+	|| !check_uses (use->phi (), set))
+      return false;
+
+  return true;
+}
+
+// Try to remove INSN by substituting a definition into all uses.
+// If the optimization moves any instructions before CURSOR, add those
+// instructions to the end of m_worklist.
+bool
+late_combine::combine_into_uses (insn_info *insn, insn_info *cursor)
+{
+  // For simplicity, don't try to handle sets of multiple hard registers.
+  // And for correctness, don't remove any assignments to the stack or
+  // frame pointers, since that would implicitly change the set of valid
+  // memory locations between this assignment and the next.
+  //
+  // Removing assignments to the hard frame pointer would invalidate
+  // backtraces.
+  set_info *def = single_set_info (insn);
+  if (!def
+      || !def->is_reg ()
+      || def->regno () == STACK_POINTER_REGNUM
+      || def->regno () == FRAME_POINTER_REGNUM
+      || def->regno () == HARD_FRAME_POINTER_REGNUM)
+    return false;
+
+  rtx set = optimizable_set (insn);
+  if (!set)
+    return false;
+
+  // For simplicity, don't try to handle subreg destinations.
+  rtx dest = SET_DEST (set);
+  if (!REG_P (dest) || def->regno () != REGNO (dest))
+    return false;
+
+  // Don't prolong the live ranges of allocatable hard registers, or put
+  // them into more complicated instructions.  Failing to prevent this
+  // could lead to spill failures, or at least to worst register allocation.
+  if (!reload_completed
+      && accesses_include_nonfixed_hard_registers (insn->uses ()))
+    return false;
+
+  if (!reload_completed && !check_register_pressure (insn, set))
+    return false;
+
+  if (!check_uses (def, set))
+    return false;
+
+  insn_combination combination (def, SET_DEST (set), SET_SRC (set));
+  if (!combination.run ())
+    return false;
+
+  for (auto *use_change : combination.use_changes ())
+    if (*use_change->insn () < *cursor)
+      m_worklist.safe_push (use_change->insn ());
+    else
+      break;
+  return true;
+}
+
+// Run the pass on function FN.
+unsigned int
+late_combine::execute (function *fn)
+{
+  // Initialization.
+  calculate_dominance_info (CDI_DOMINATORS);
+  df_analyze ();
+  crtl->ssa = new rtl_ssa::function_info (fn);
+  // Don't allow memory_operand to match volatile MEMs.
+  init_recog_no_volatile ();
+
+  insn_info *insn = *crtl->ssa->nondebug_insns ().begin ();
+  while (insn)
+    {
+      if (!insn->is_artificial ())
+	{
+	  insn_info *prev = insn->prev_nondebug_insn ();
+	  if (combine_into_uses (insn, prev))
+	    {
+	      // Any instructions that get added to the worklist were
+	      // previously after PREV.  Thus if we were able to move
+	      // an instruction X before PREV during one combination,
+	      // X cannot depend on any instructions that we move before
+	      // PREV during subsequent combinations.  This means that
+	      // the worklist should be free of backwards dependencies,
+	      // even if it isn't necessarily in RPO.
+	      for (unsigned int i = 0; i < m_worklist.length (); ++i)
+		combine_into_uses (m_worklist[i], prev);
+	      m_worklist.truncate (0);
+	      insn = prev;
+	    }
+	}
+      insn = insn->next_nondebug_insn ();
+    }
+
+  // Finalization.
+  if (crtl->ssa->perform_pending_updates ())
+    cleanup_cfg (0);
+  // Make recognizer allow volatile MEMs again.
+  init_recog ();
+  free_dominance_info (CDI_DOMINATORS);
+  return 0;
+}
+
+class pass_late_combine : public rtl_opt_pass
+{
+public:
+  pass_late_combine (gcc::context *ctxt)
+    : rtl_opt_pass (pass_data_late_combine, ctxt)
+  {}
+
+  // opt_pass methods:
+  opt_pass *clone () override { return new pass_late_combine (m_ctxt); }
+  bool gate (function *) override { return flag_late_combine_instructions; }
+  unsigned int execute (function *) override;
+};
+
+unsigned int
+pass_late_combine::execute (function *fn)
+{
+  return late_combine ().execute (fn);
+}
+
+} // end namespace
+
+// Create a new CC fusion pass instance.
+
+rtl_opt_pass *
+make_pass_late_combine (gcc::context *ctxt)
+{
+  return new pass_late_combine (ctxt);
+}
diff --git a/gcc/passes.def b/gcc/passes.def
index 1cbbd413097..1d56a0d4799 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -492,6 +492,7 @@ along with GCC; see the file COPYING3.  If not see
       NEXT_PASS (pass_initialize_regs);
       NEXT_PASS (pass_ud_rtl_dce);
       NEXT_PASS (pass_combine);
+      NEXT_PASS (pass_late_combine);
       NEXT_PASS (pass_if_after_combine);
       NEXT_PASS (pass_jump_after_combine);
       NEXT_PASS (pass_partition_blocks);
@@ -511,6 +512,7 @@ along with GCC; see the file COPYING3.  If not see
       NEXT_PASS (pass_postreload);
       PUSH_INSERT_PASSES_WITHIN (pass_postreload)
 	  NEXT_PASS (pass_postreload_cse);
+	  NEXT_PASS (pass_late_combine);
 	  NEXT_PASS (pass_gcse2);
 	  NEXT_PASS (pass_split_after_reload);
 	  NEXT_PASS (pass_ree);
diff --git a/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-1.c b/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-1.c
index f290b9ccbdc..a95637abbe5 100644
--- a/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-1.c
+++ b/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-1.c
@@ -25,5 +25,5 @@ bar (long a)
 }
 
 /* { dg-final { scan-rtl-dump "Will split live ranges of parameters" "ira" } } */
-/* { dg-final { scan-rtl-dump "Split live-range of register" "ira" { xfail *-*-* } } } */
+/* { dg-final { scan-rtl-dump "Split live-range of register" "ira" { xfail { ! aarch64*-*-* } } } } */
 /* { dg-final { scan-rtl-dump "Performing shrink-wrapping" "pro_and_epilogue" { xfail powerpc*-*-* } } } */
diff --git a/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-2.c b/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-2.c
index 6212c95585d..0690e036eaa 100644
--- a/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-2.c
+++ b/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-2.c
@@ -30,6 +30,6 @@ bar (long a)
 }
 
 /* { dg-final { scan-rtl-dump "Will split live ranges of parameters" "ira" } } */
-/* { dg-final { scan-rtl-dump "Split live-range of register" "ira" { xfail *-*-* } } } */
+/* { dg-final { scan-rtl-dump "Split live-range of register" "ira" { xfail { ! aarch64*-*-* } } } } */
 /* XFAIL due to PR70681.  */ 
 /* { dg-final { scan-rtl-dump "Performing shrink-wrapping" "pro_and_epilogue" { xfail arm*-*-* powerpc*-*-* } } } */
diff --git a/gcc/testsuite/gcc.dg/stack-check-4.c b/gcc/testsuite/gcc.dg/stack-check-4.c
index b0c5c61972f..052d2abc2f1 100644
--- a/gcc/testsuite/gcc.dg/stack-check-4.c
+++ b/gcc/testsuite/gcc.dg/stack-check-4.c
@@ -20,7 +20,7 @@
    scan for.   We scan for both the positive and negative cases.  */
 
 /* { dg-do compile } */
-/* { dg-options "-O2 -fstack-clash-protection -fdump-rtl-pro_and_epilogue -fno-optimize-sibling-calls" } */
+/* { dg-options "-O2 -fstack-clash-protection -fdump-rtl-pro_and_epilogue -fno-optimize-sibling-calls -fno-shrink-wrap" } */
 /* { dg-require-effective-target supports_stack_clash_protection } */
 
 extern void arf (char *);
diff --git a/gcc/testsuite/gcc.target/aarch64/pr106594_1.c b/gcc/testsuite/gcc.target/aarch64/pr106594_1.c
new file mode 100644
index 00000000000..71bcafcb44f
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr106594_1.c
@@ -0,0 +1,20 @@
+/* { dg-options "-O2" } */
+
+extern const int constellation_64qam[64];
+
+void foo(int nbits,
+         const char *p_src,
+         int *p_dst) {
+
+  while (nbits > 0U) {
+    char first = *p_src++;
+
+    char index1 = ((first & 0x3) << 4) | (first >> 4);
+
+    *p_dst++ = constellation_64qam[index1];
+
+    nbits--;
+  }
+}
+
+/* { dg-final { scan-assembler {(?n)\tldr\t.*\[x[0-9]+, w[0-9]+, sxtw #?2\]} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_asrd_3.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_asrd_3.c
index 0d620a30d5d..b537c6154a3 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/cond_asrd_3.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_asrd_3.c
@@ -27,9 +27,9 @@ TEST_ALL (DEF_LOOP)
 /* { dg-final { scan-assembler-times {\tasrd\tz[0-9]+\.h, p[0-7]/m, z[0-9]+\.h, #4\n} 2 } } */
 /* { dg-final { scan-assembler-times {\tasrd\tz[0-9]+\.s, p[0-7]/m, z[0-9]+\.s, #4\n} 1 } } */
 
-/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.b, p[0-7]/z, z[0-9]+\.b\n} 3 { xfail *-*-* } } } */
-/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z, z[0-9]+\.h\n} 2 { xfail *-*-* } } } */
-/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z, z[0-9]+\.s\n} 1 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.b, p[0-7]/z, z[0-9]+\.b\n} 3 } } */
+/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z, z[0-9]+\.h\n} 2 } } */
+/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z, z[0-9]+\.s\n} 1 } } */
 
-/* { dg-final { scan-assembler-not {\tmov\tz} { xfail *-*-* } } } */
-/* { dg-final { scan-assembler-not {\tsel\t} { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-not {\tmov\tz} } } */
+/* { dg-final { scan-assembler-not {\tsel\t} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_3.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_3.c
index a294effd4a9..cff806c278d 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_3.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_3.c
@@ -30,11 +30,9 @@ TEST_ALL (DEF_LOOP)
 /* { dg-final { scan-assembler-times {\tscvtf\tz[0-9]+\.d, p[0-7]/m,} 1 } } */
 /* { dg-final { scan-assembler-times {\tucvtf\tz[0-9]+\.d, p[0-7]/m,} 1 } } */
 
-/* Really we should be able to use MOVPRFX /z here, but at the moment
-   we're relying on combine to merge a SEL and an arithmetic operation,
-   and the SEL doesn't allow the "false" value to be zero when the "true"
-   value is a register.  */
-/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+, z[0-9]+\n} 6 } } */
+/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z,} 2 } } */
+/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z,} 2 } } */
+/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.d, p[0-7]/z,} 2 } } */
 
 /* { dg-final { scan-assembler-not {\tmov\tz[^\n]*z} } } */
 /* { dg-final { scan-assembler-not {\tsel\t} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_6.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_6.c
index 6541a2ea49d..abf0a2e832f 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_6.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_6.c
@@ -30,11 +30,9 @@ TEST_ALL (DEF_LOOP)
 /* { dg-final { scan-assembler-times {\tfcvtzs\tz[0-9]+\.d, p[0-7]/m,} 1 } } */
 /* { dg-final { scan-assembler-times {\tfcvtzu\tz[0-9]+\.d, p[0-7]/m,} 1 } } */
 
-/* Really we should be able to use MOVPRFX /z here, but at the moment
-   we're relying on combine to merge a SEL and an arithmetic operation,
-   and the SEL doesn't allow the "false" value to be zero when the "true"
-   value is a register.  */
-/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+, z[0-9]+\n} 6 } } */
+/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z,} 2 } } */
+/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z,} 2 } } */
+/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.d, p[0-7]/z,} 2 } } */
 
 /* { dg-final { scan-assembler-not {\tmov\tz[^\n]*z} } } */
 /* { dg-final { scan-assembler-not {\tsel\t} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_fabd_5.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_fabd_5.c
index e66477b3bce..401201b315a 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/cond_fabd_5.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_fabd_5.c
@@ -24,12 +24,9 @@ TEST_ALL (DEF_LOOP)
 /* { dg-final { scan-assembler-times {\tfabd\tz[0-9]+\.s, p[0-7]/m,} 1 } } */
 /* { dg-final { scan-assembler-times {\tfabd\tz[0-9]+\.d, p[0-7]/m,} 1 } } */
 
-/* Really we should be able to use MOVPRFX /Z here, but at the moment
-   we're relying on combine to merge a SEL and an arithmetic operation,
-   and the SEL doesn't allow zero operands.  */
-/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z, z[0-9]+\.h\n} 1 { xfail *-*-* } } } */
-/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z, z[0-9]+\.s\n} 1 { xfail *-*-* } } } */
-/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.d, p[0-7]/z, z[0-9]+\.d\n} 1 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z, z[0-9]+\.h\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z, z[0-9]+\.s\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.d, p[0-7]/z, z[0-9]+\.d\n} 1 } } */
 
 /* { dg-final { scan-assembler-not {\tmov\tz[^,]*z} } } */
-/* { dg-final { scan-assembler-not {\tsel\t} { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-not {\tsel\t} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_unary_4.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_unary_4.c
index a491f899088..cbb957bffa4 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/cond_unary_4.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_unary_4.c
@@ -52,15 +52,10 @@ TEST_ALL (DEF_LOOP)
 /* { dg-final { scan-assembler-times {\tfneg\tz[0-9]+\.s, p[0-7]/m,} 1 } } */
 /* { dg-final { scan-assembler-times {\tfneg\tz[0-9]+\.d, p[0-7]/m,} 1 } } */
 
-/* Really we should be able to use MOVPRFX /z here, but at the moment
-   we're relying on combine to merge a SEL and an arithmetic operation,
-   and the SEL doesn't allow the "false" value to be zero when the "true"
-   value is a register.  */
-/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+, z[0-9]+\n} 7 } } */
-/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.b, p[0-7]/z, z[0-9]+\.b} 1 } } */
-/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z, z[0-9]+\.h} 2 } } */
-/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z, z[0-9]+\.s} 2 } } */
-/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.d, p[0-7]/z, z[0-9]+\.d} 2 } } */
+/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.b, p[0-7]/z, z[0-9]+\.b} 2 } } */
+/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z, z[0-9]+\.h} 4 } } */
+/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z, z[0-9]+\.s} 4 } } */
+/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.d, p[0-7]/z, z[0-9]+\.d} 4 } } */
 
 /* { dg-final { scan-assembler-not {\tmov\tz[^\n]*z} } } */
 /* { dg-final { scan-assembler-not {\tsel\t} } } */
diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index 29267589eeb..4bedaa40aa2 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -614,6 +614,7 @@ extern rtl_opt_pass *make_pass_branch_prob (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_value_profile_transformations (gcc::context
 							      *ctxt);
 extern rtl_opt_pass *make_pass_postreload_cse (gcc::context *ctxt);
+extern rtl_opt_pass *make_pass_late_combine (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_gcse2 (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_split_after_reload (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_thread_prologue_and_epilogue (gcc::context
Jeff Law Jan. 8, 2024, 5:03 a.m. UTC | #5
On 1/5/24 10:35, Richard Sandiford wrote:
> Jeff Law <jeffreyalaw@gmail.com> writes:
>> On 10/24/23 12:49, Richard Sandiford wrote:
>>> This patch adds a combine pass that runs late in the pipeline.
>>> There are two instances: one between combine and split1, and one
>>> after postreload.
>> So have you done any investigation on cases caught by your new pass
>> between combine and split1 to characterize them?  In particular do they
>> point at solvable problems in combine?  Or do you forsee this subsuming
>> the old combiner pass at some point in the distant future?
> 
> Examples like the PR are the main motivation for the pre-RA pass.
> There we had an extension that could be combined into an address,
> but no longer was after GCC 13.
> 
> The PR itself could in principle be fixed in combine (various
> approaches were suggested, but not accepted).  But the same problem
> applies to multiple uses of extensions.  fwprop can't handle it because
> individual propagations are not a win in isolation.  And combine has
> a limit of combining 4 insns (with a maximum of 2 output insns, IIRC).
> So I don't think either of the existing passes scale to the general case.
Oh, that discussion :(


>
> 
>> rth and I sketched out an SSA based RTL combine at some point in the
>> deep past.  The key goal we were trying to achieve was combining across
>> blocks.  We didn't have a functioning RTL SSA form at the time, so it
>> never went to any implementation work.  It looks like yours would solve
>> the class of problems rth and I were considering.
> 
> Yeah, I do see some cases where combining across blocks helps.
> The case above is one example of that.  Another is:
Great.  The cases I think rth and I were looking at were inspired by a 
talk at one of the early Cauldrons -- whichever one we had in 
California.  Someone did a talk on cross-block combinations, mostly 
motivated by missed combinations on ARM.


> 
>>> The patch therefore enables the pass by default only on AArch64.
>>> However, I did test the patch with it enabled on x86_64-linux-gnu
>>> as well, which was useful for debugging.
>>>
>>> Bootstrapped & regression-tested on aarch64-linux-gnu and
>>> x86_64-linux-gnu (as posted, with no regressions, and with the
>>> pass enabled by default, with some gcc.target/i386 regressions).
>>> OK to install?
>> I'm going to adjust this slightly so that it's enabled across the board
>> and throw it into the tester tomorrow (tester is busy tonight).  Even if
>> we make it opt-in on a per-port basis, the alternate target testing does
>> seems to find stuff that needs fixing ;-)
> 
> Thanks!  As per our off-list discussion, the cris-elf failures showed
> up a bug in the handling of call arguments.  Here's an updated version
> with that fixed.
Perfect.  I'll spin that in the tester overnight.  Hopefully that fixes 
the other ports that failed to build (as opposed to the testsuite 
failures which I think you've covered as not an issue with the new 
combine pass, but which are instead port issues.




> 
> Yeah.  If I'd posted this earlier in stage 1 (rather than October),
> I might have tried teaching shorten_branches how to handle this.
> But it felt like it could be a bit of a rabbit hole at this stage.
> 
>> Nothing jumps out at horribly wrong.  You might want/need to reject
>> frame related insns in optimizable_set, though I guess if the dwarf2
>> writer isn't complaining, then we haven't mucked things up too bad.
> 
> Ah, yeah, I wondered about that.  But I suppose most prologue insns
> don't really create combination opportunities, since most of them
> either set up the stack (which we wouldn't combine anyway) or sink
> incoming data.  So if there cases where this is a problem, it might
> unfortunately be a case of "see what breaks".
The other issue that's been in the back of my mind is costing.  But I 
think the model here is combine without regards to cost.

Let's let the tester chew on the updated version overnight, see what 
else may pop out, but barring any big surprises, I think we're good to go.



jeff
Richard Sandiford Jan. 8, 2024, 11:52 a.m. UTC | #6
Jeff Law <jeffreyalaw@gmail.com> writes:
> The other issue that's been in the back of my mind is costing.  But I 
> think the model here is combine without regards to cost.

No, it does take costing into account.  For size, it's the usual
"sum up the before and after insn costs and see which one is lower".
For speed, the costs are weighted by execution frequency, so e.g.
two insns of cost 4 in the same block can be combined into a single
instruction of cost 8, but a hoisted invariant can only be combined
into a loop body instruction if the loop body instruction's cost
doesn't increase significantly.

This is done by rtl_ssa::changes_are_worthwhile.

Thanks,
Richard
Jeff Law Jan. 8, 2024, 4:14 p.m. UTC | #7
On 1/8/24 04:52, Richard Sandiford wrote:
> Jeff Law <jeffreyalaw@gmail.com> writes:
>> The other issue that's been in the back of my mind is costing.  But I
>> think the model here is combine without regards to cost.
> 
> No, it does take costing into account.  For size, it's the usual
> "sum up the before and after insn costs and see which one is lower".
> For speed, the costs are weighted by execution frequency, so e.g.
> two insns of cost 4 in the same block can be combined into a single
> instruction of cost 8, but a hoisted invariant can only be combined
> into a loop body instruction if the loop body instruction's cost
> doesn't increase significantly.
> 
> This is done by rtl_ssa::changes_are_worthwhile.
You're absolutely correct.  My bad.

Interesting that's exactly where we do have a notable concern.

If you remember, there were a few ports that failed to build 
newlib/libgcc that we initially ignored.  I went back and looked at one 
(arc-elf).

What appears to be happening for arc-elf is we're testing to see if the 
change is profitable.  On arc-elf the costing model is highly dependent 
on the length of the insns.

We've got a very reasonable looking insn:

> (insn 3674 753 2851 98 (set (reg/v:SI 18 r18 [orig:300 inex ] [300])
>         (ashift:SI (reg:SI 27 fp [548])
>             (const_int 4 [0x4]))) "../../../..//newlib-cygwin/newlib/libc/stdlib/gdtoa-gdtoa.c":437:10 120 {*ashlsi3_insn}
>      (nil))

We call rtl_ssa::changes_are_profitable -> insn_cost -> arc_insn_cost -> 
get_attr_length -> get_attr_length_1 -> insn_default_length

insn_default_length grubs around looking at the operands via recog_data 
which appears to be stale:



> (gdb) p debug_rtx(recog_data.operand[0])
> (reg/v:SI 18 r18 [orig:300 inex ] [300])
> $4 = void
> (gdb) p debug_rtx(recog_data.operand[1])
> (reg/v:SI 3 r3 [orig:300 inex ] [300])
> $5 = void
> (gdb) p debug_rtx(recog_data.operand[2])
> 
> Program received signal SIGSEGV, Segmentation fault.
> 0x0000000001432955 in rtx_writer::print_rtx (this=0x7fffffffe0e0, in_rtx=0xabababababababab) at /home/jlaw/test/gcc/gcc/print-rtl.cc:809
> 809       else if (GET_CODE (in_rtx) > NUM_RTX_CODE)

Note the 0xabab.... That was accessing operand #2, which should have 
been (const_int 4).

Sure enough if I force re-recognition then look at the recog_data I get 
the right values.

After LRA we have:

> (insn 753 2434 3674 98 (set (reg/v:SI 3 r3 [orig:300 inex ] [300])
>         (ashift:SI (reg:SI 27 fp [548])
>             (const_int 4 [0x4]))) "../../../..//newlib-cygwin/newlib/libc/stdlib/gdtoa-gdtoa.c":437:10 120 {*ashlsi3_insn}
>      (nil))
> (insn 3674 753 2851 98 (set (reg/v:SI 18 r18 [orig:300 inex ] [300])
>         (reg/v:SI 3 r3 [orig:300 inex ] [300])) "../../../..//newlib-cygwin/newlib/libc/stdlib/gdtoa-gdtoa.c":437:10 3 {*movsi_insn}
>      (nil))

In the emergency dump in late_combine2 (so cleanup hasn't been done):

> (insn 753 2434 3674 98 (set (reg/v:SI 3 r3 [orig:300 inex ] [300])
>         (ashift:SI (reg:SI 27 fp [548])
>             (const_int 4 [0x4]))) "../../../..//newlib-cygwin/newlib/libc/stdlib/gdtoa-gdtoa.c":437:10 120 {*ashlsi3_insn}
>      (nil))
> (insn 3674 753 2851 98 (set (reg/v:SI 18 r18 [orig:300 inex ] [300])
>         (ashift:SI (reg:SI 27 fp [548])
>             (const_int 4 [0x4]))) "../../../..//newlib-cygwin/newlib/libc/stdlib/gdtoa-gdtoa.c":437:10 120 {*ashlsi3_insn}
>      (nil))


Which brings us to the question.  If we change the form of an insn, then 
ask for its cost, don't we need to make sure the insn is re-recognized 
as the costing function may do things like query the insn's length which 
would use cached recog_data?

jeff
Richard Sandiford Jan. 8, 2024, 4:59 p.m. UTC | #8
Jeff Law <jlaw@ventanamicro.com> writes:
> On 1/8/24 04:52, Richard Sandiford wrote:
>> Jeff Law <jeffreyalaw@gmail.com> writes:
>>> The other issue that's been in the back of my mind is costing.  But I
>>> think the model here is combine without regards to cost.
>> 
>> No, it does take costing into account.  For size, it's the usual
>> "sum up the before and after insn costs and see which one is lower".
>> For speed, the costs are weighted by execution frequency, so e.g.
>> two insns of cost 4 in the same block can be combined into a single
>> instruction of cost 8, but a hoisted invariant can only be combined
>> into a loop body instruction if the loop body instruction's cost
>> doesn't increase significantly.
>> 
>> This is done by rtl_ssa::changes_are_worthwhile.
> You're absolutely correct.  My bad.
>
> Interesting that's exactly where we do have a notable concern.

Gah.

> If you remember, there were a few ports that failed to build 
> newlib/libgcc that we initially ignored.  I went back and looked at one 
> (arc-elf).
>
> What appears to be happening for arc-elf is we're testing to see if the 
> change is profitable.  On arc-elf the costing model is highly dependent 
> on the length of the insns.
>
> We've got a very reasonable looking insn:
>
>> (insn 3674 753 2851 98 (set (reg/v:SI 18 r18 [orig:300 inex ] [300])
>>         (ashift:SI (reg:SI 27 fp [548])
>>             (const_int 4 [0x4]))) "../../../..//newlib-cygwin/newlib/libc/stdlib/gdtoa-gdtoa.c":437:10 120 {*ashlsi3_insn}
>>      (nil))
>
> We call rtl_ssa::changes_are_profitable -> insn_cost -> arc_insn_cost -> 
> get_attr_length -> get_attr_length_1 -> insn_default_length
>
> insn_default_length grubs around looking at the operands via recog_data 
> which appears to be stale:
>
>
>
>> (gdb) p debug_rtx(recog_data.operand[0])
>> (reg/v:SI 18 r18 [orig:300 inex ] [300])
>> $4 = void
>> (gdb) p debug_rtx(recog_data.operand[1])
>> (reg/v:SI 3 r3 [orig:300 inex ] [300])
>> $5 = void
>> (gdb) p debug_rtx(recog_data.operand[2])
>> 
>> Program received signal SIGSEGV, Segmentation fault.
>> 0x0000000001432955 in rtx_writer::print_rtx (this=0x7fffffffe0e0, in_rtx=0xabababababababab) at /home/jlaw/test/gcc/gcc/print-rtl.cc:809
>> 809       else if (GET_CODE (in_rtx) > NUM_RTX_CODE)
>
> Note the 0xabab.... That was accessing operand #2, which should have 
> been (const_int 4).
>
> Sure enough if I force re-recognition then look at the recog_data I get 
> the right values.
>
> After LRA we have:
>
>> (insn 753 2434 3674 98 (set (reg/v:SI 3 r3 [orig:300 inex ] [300])
>>         (ashift:SI (reg:SI 27 fp [548])
>>             (const_int 4 [0x4]))) "../../../..//newlib-cygwin/newlib/libc/stdlib/gdtoa-gdtoa.c":437:10 120 {*ashlsi3_insn}
>>      (nil))
>> (insn 3674 753 2851 98 (set (reg/v:SI 18 r18 [orig:300 inex ] [300])
>>         (reg/v:SI 3 r3 [orig:300 inex ] [300])) "../../../..//newlib-cygwin/newlib/libc/stdlib/gdtoa-gdtoa.c":437:10 3 {*movsi_insn}
>>      (nil))
>
> In the emergency dump in late_combine2 (so cleanup hasn't been done):
>
>> (insn 753 2434 3674 98 (set (reg/v:SI 3 r3 [orig:300 inex ] [300])
>>         (ashift:SI (reg:SI 27 fp [548])
>>             (const_int 4 [0x4]))) "../../../..//newlib-cygwin/newlib/libc/stdlib/gdtoa-gdtoa.c":437:10 120 {*ashlsi3_insn}
>>      (nil))
>> (insn 3674 753 2851 98 (set (reg/v:SI 18 r18 [orig:300 inex ] [300])
>>         (ashift:SI (reg:SI 27 fp [548])
>>             (const_int 4 [0x4]))) "../../../..//newlib-cygwin/newlib/libc/stdlib/gdtoa-gdtoa.c":437:10 120 {*ashlsi3_insn}
>>      (nil))
>
>
> Which brings us to the question.  If we change the form of an insn, then 
> ask for its cost, don't we need to make sure the insn is re-recognized 
> as the costing function may do things like query the insn's length which 
> would use cached recog_data?

Yeah, this only happens once we've verified that the new instruction
is valid.  And it looks from the emergency dump above that the insn
code has been correctly updated to *ashlsi3_insn.

This is a bit of a hopeful stab, but is the problem that recog_data still
had the previous contents of insn 3674, and so extract_insn_cached wrongly
thought that it doesn't need to do anything?  If so, does something like:

diff --git a/gcc/recog.cc b/gcc/recog.cc
index a6799e3f5e6..8ba63c78179 100644
--- a/gcc/recog.cc
+++ b/gcc/recog.cc
@@ -267,6 +267,8 @@ validate_change_1 (rtx object, rtx *loc, rtx new_rtx, bool in_group,
 	 case invalid.  */
       changes[num_changes].old_code = INSN_CODE (object);
       INSN_CODE (object) = -1;
+      if (recog_data.insn == object)
+	recog_data.insn = nullptr;
     }
 
   num_changes++;

fix it?  I suppose there's an argument that this belongs in whatever code
sets INSN_CODE to a new nonnegative value (so recog_level2 for RTL-SSA).
But doing it in validate_change_1 seems more robust, since anything
calling that function is considering changing the insn code.

Thanks for debugging the problem.

Richard
Jeff Law Jan. 8, 2024, 5:10 p.m. UTC | #9
On 1/8/24 09:59, Richard Sandiford wrote:

> 
> This is a bit of a hopeful stab, but is the problem that recog_data still
> had the previous contents of insn 3674, and so extract_insn_cached wrongly
> thought that it doesn't need to do anything?  If so, does something like:
> 
> diff --git a/gcc/recog.cc b/gcc/recog.cc
> index a6799e3f5e6..8ba63c78179 100644
> --- a/gcc/recog.cc
> +++ b/gcc/recog.cc
> @@ -267,6 +267,8 @@ validate_change_1 (rtx object, rtx *loc, rtx new_rtx, bool in_group,
>   	 case invalid.  */
>         changes[num_changes].old_code = INSN_CODE (object);
>         INSN_CODE (object) = -1;
> +      if (recog_data.insn == object)
> +	recog_data.insn = nullptr;
>       }
>   
>     num_changes++;
> 
> fix it?  I suppose there's an argument that this belongs in whatever code
> sets INSN_CODE to a new nonnegative value (so recog_level2 for RTL-SSA).
> But doing it in validate_change_1 seems more robust, since anything
> calling that function is considering changing the insn code.
Nope, doesn't help at all.   I'd briefly put a reset of the INSN_CODE 
and a call to recog_memoized in the costing path of rtl-ssa to see if 
that would allow things to move forward, but it failed miserably.

I'll pass along the .i file separately.  Hopefully it'll fail for you 
and you can debug.  But given failure depends on stale bits in 
recog_data, it may not.

Jeff
Richard Sandiford Jan. 8, 2024, 7:11 p.m. UTC | #10
Jeff Law <jeffreyalaw@gmail.com> writes:
> On 1/8/24 09:59, Richard Sandiford wrote:
>> This is a bit of a hopeful stab, but is the problem that recog_data still
>> had the previous contents of insn 3674, and so extract_insn_cached wrongly
>> thought that it doesn't need to do anything?  If so, does something like:
>> 
>> diff --git a/gcc/recog.cc b/gcc/recog.cc
>> index a6799e3f5e6..8ba63c78179 100644
>> --- a/gcc/recog.cc
>> +++ b/gcc/recog.cc
>> @@ -267,6 +267,8 @@ validate_change_1 (rtx object, rtx *loc, rtx new_rtx, bool in_group,
>>   	 case invalid.  */
>>         changes[num_changes].old_code = INSN_CODE (object);
>>         INSN_CODE (object) = -1;
>> +      if (recog_data.insn == object)
>> +	recog_data.insn = nullptr;
>>       }
>>   
>>     num_changes++;
>> 
>> fix it?  I suppose there's an argument that this belongs in whatever code
>> sets INSN_CODE to a new nonnegative value (so recog_level2 for RTL-SSA).
>> But doing it in validate_change_1 seems more robust, since anything
>> calling that function is considering changing the insn code.
> Nope, doesn't help at all.

Yeah, in hindsight it was a dull guess.  recog resets recog_data.insn
itself, so doing it here wasn't likely to help.

> I'd briefly put a reset of the INSN_CODE 
> and a call to recog_memoized in the costing path of rtl-ssa to see if 
> that would allow things to move forward, but it failed miserably.
>
> I'll pass along the .i file separately.  Hopefully it'll fail for you 
> and you can debug.  But given failure depends on stale bits in 
> recog_data, it may not.

Thanks.  That led me to the following, which seems a bit more plausible
than my first attempt.  I'll test it on aarch64-linux-gnu and
x86_64-linux-gnu.  Does it look OK?

Richard


insn_info::calculate_cost computes the costs of unchanged insns lazily,
so that we don't waste time costing instructions that we never try to
change.  It therefore has to revert any in-progress changes, cost the
original instruction, and then reapply the in-progress changes.

However, doing that temporarily changes the INSN_CODEs, and so
temporarily invalidates any information cached about the insn.
This means that insn_cost can end up looking at stale data,
or can cache data that becomes stale once the in-progress
changes are reapplied.

This could in principle happen for any use of temporarily_undo_changes
and redo_changes.  Those functions in turn share a common subroutine,
swap_change, so that seems like the best place to fix this.

gcc/
	* recog.cc (swap_change): Invalidate the cached recog_data if it
	describes an insn that is being changed.
---
 gcc/recog.cc | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/gcc/recog.cc b/gcc/recog.cc
index a6799e3f5e6..56370e40e01 100644
--- a/gcc/recog.cc
+++ b/gcc/recog.cc
@@ -614,7 +614,11 @@ swap_change (int num)
   else
     std::swap (*changes[num].loc, changes[num].old);
   if (changes[num].object && !MEM_P (changes[num].object))
-    std::swap (INSN_CODE (changes[num].object), changes[num].old_code);
+    {
+      std::swap (INSN_CODE (changes[num].object), changes[num].old_code);
+      if (recog_data.insn == changes[num].object)
+	recog_data.insn = nullptr;
+    }
 }
 
 /* Temporarily undo all the changes numbered NUM and up, with a view
Jeff Law Jan. 8, 2024, 9:42 p.m. UTC | #11
On 1/8/24 12:11, Richard Sandiford wrote:

> 
> Thanks.  That led me to the following, which seems a bit more plausible
> than my first attempt.  I'll test it on aarch64-linux-gnu and
> x86_64-linux-gnu.  Does it look OK?
It looks reasonable to me.  I'm going to send another failure (ICE in 
finalize_new_accesses on a different target) separately.

Jeff
Richard Sandiford Jan. 10, 2024, 1:01 p.m. UTC | #12
Just a note that, following discussion on IRC, I'll pull this for
GCC 14 and resubmit for GCC 15.

There was also pushback on IRC about making the pass opt-in.
Enabling it for x86_64 would mean fixing RPAD to use a representation
that is more robust against recombination, but as you can imagine, it's
kind-of difficult for me to justify spending significant time fixing an
issue in the x86_64 port.  Jeff's testing suggested that there are also
latent issues in the older, less maintained ports.

So to get an idea for expectations: would it be a requirement that a
GCC 15 submission is enabled unconditionally and all known issues in
the ports fixed?

Thanks,
Richard
Richard Biener Jan. 10, 2024, 1:35 p.m. UTC | #13
On Wed, 10 Jan 2024, Richard Sandiford wrote:

> Just a note that, following discussion on IRC, I'll pull this for
> GCC 14 and resubmit for GCC 15.
> 
> There was also pushback on IRC about making the pass opt-in.
> Enabling it for x86_64 would mean fixing RPAD to use a representation
> that is more robust against recombination, but as you can imagine, it's
> kind-of difficult for me to justify spending significant time fixing an
> issue in the x86_64 port.  Jeff's testing suggested that there are also
> latent issues in the older, less maintained ports.
> 
> So to get an idea for expectations: would it be a requirement that a
> GCC 15 submission is enabled unconditionally and all known issues in
> the ports fixed?

Can you open a bugreport with the issue in RPAD, maybe outlining
what would need to be done?

I think x86 maintainers could opt to disable the pass - so it would
be opt-out.  It's reasonable to expect them to fix the backend given
there's nothing really wrong with the new pass, it just does
something that wasn't done before at that point?

Richard.
Jeff Law Jan. 10, 2024, 4:27 p.m. UTC | #14
On 1/10/24 06:35, Richard Biener wrote:
> 
> I think x86 maintainers could opt to disable the pass - so it would
> be opt-out.  It's reasonable to expect them to fix the backend given
> there's nothing really wrong with the new pass, it just does
> something that wasn't done before at that point?
That's been both Richard S and my experience so far -- it's exposing 
latent target issues (which we're pushing forward as independent fixes) 
as well as a few latent issues in various generic RTL bits (which I'll 
leave to Richard S to submit).  Nothing major though.

I'm a bit disappointed it's not going forward for gcc-14, but understand 
and will support the decision.

Jeff
Jeff Law Jan. 10, 2024, 4:40 p.m. UTC | #15
On 1/10/24 06:01, Richard Sandiford wrote:
> 
> So to get an idea for expectations: would it be a requirement that a
> GCC 15 submission is enabled unconditionally and all known issues in
> the ports fixed?
I don't think we need to fix those latent port issues as a hard 
requirement.  I try to balance the complexity of the fix, overall state 
of the port, value of having the port test the feature, etc.

So something like the mn103 or ephiphany where the fix was clear after a 
bit of debugging, we just fix.  Others like the long standing c6x faults 
or the rl78 assembler complaints which we're also seeing without the 
late-combine work and which don't have clearly identifiable fixes I'd 
say we leave to the port maintainers (if any) to address.

So I tend to want to understand a regression reported by the tester, 
then we determine a reasonable course of action.  I don't think that a 
no regression policy on all those old ports is a reasonable requirement.

Jeff
Hongtao Liu June 21, 2024, 4:50 a.m. UTC | #16
On Wed, Oct 25, 2023 at 2:49 AM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> This patch adds a combine pass that runs late in the pipeline.
> There are two instances: one between combine and split1, and one
> after postreload.
>
> The pass currently has a single objective: remove definitions by
> substituting into all uses.  The pre-RA version tries to restrict
> itself to cases that are likely to have a neutral or beneficial
> effect on register pressure.
>
> The patch fixes PR106594.  It also fixes a few FAILs and XFAILs
> in the aarch64 test results, mostly due to making proper use of
> MOVPRFX in cases where we didn't previously.  I hope it would
> also help with Robin's vec_duplicate testcase, although the
> pressure heuristic might need tweaking for that case.
>
> This is just a first step..  I'm hoping that the pass could be
> used for other combine-related optimisations in future.  In particular,
> the post-RA version doesn't need to restrict itself to cases where all
> uses are substitutitable, since it doesn't have to worry about register
> pressure.  If we did that, and if we extended it to handle multi-register
> REGs, the pass might be a viable replacement for regcprop, which in
> turn might reduce the cost of having a post-RA instance of the new pass.
>
> I've run an assembly comparison with one target per CPU directory,
> and it seems to be a win for all targets except nvptx (which is hard
> to measure, being a higher-level asm).  The biggest winner seemed
> to be AVR.
>
> I'd originally hoped to enable the pass by default at -O2 and above
> on all targets.  But in the end, I don't think that's possible,
> because it interacts badly with x86's STV and partial register
> dependency passes.
>
> For example, gcc.target/i386/minmax-6.c tests whether the code
> compiles without any spilling.  The RTL created by STV contains:
>
> (insn 33 31 3 2 (set (subreg:V4SI (reg:SI 120) 0)
>         (vec_merge:V4SI (vec_duplicate:V4SI (reg:SI 116))
>             (const_vector:V4SI [
>                     (const_int 0 [0]) repeated x4
>                 ])
>             (const_int 1 [0x1]))) -1
>      (nil))
> (insn 3 33 34 2 (set (subreg:V4SI (reg:SI 118) 0)
>         (subreg:V4SI (reg:SI 120) 0)) {movv4si_internal}
>      (expr_list:REG_DEAD (reg:SI 120)
>         (nil)))
> (insn 34 3 32 2 (set (reg/v:SI 108 [ y ])
>         (reg:SI 118)) -1
>      (nil))
>
> and it's crucial for the test that reg 108 is kept, rather than
> propagated into uses.  As things stand, 118 can be allocated
> a vector register and 108 a scalar register.  If 108 is propagated,
> there will be scalar and vector uses of 118, and so it will be
> spilled to memory.
>
> That one could be solved by running STV2 later.  But RPAD is
> a bigger problem.  In gcc.target/i386/pr87007-5.c, RPAD converts:
>
> (insn 27 26 28 6 (set (reg:DF 100 [ _15 ])
>         (sqrt:DF (mem/c:DF (symbol_ref:DI ("d2"))))) {*sqrtdf2_sse}
>      (nil))
>
> into:
>
> (insn 45 26 44 6 (set (reg:V4SF 108)
>         (const_vector:V4SF [
>                 (const_double:SF 0.0 [0x0.0p+0]) repeated x4
>             ])) -1
>      (nil))
> (insn 44 45 27 6 (set (reg:V2DF 109)
>         (vec_merge:V2DF (vec_duplicate:V2DF (sqrt:DF (mem/c:DF (symbol_ref:DI ("d2")))))
>             (subreg:V2DF (reg:V4SF 108) 0)
>             (const_int 1 [0x1]))) -1
>      (nil))
> (insn 27 44 28 6 (set (reg:DF 100 [ _15 ])
>         (subreg:DF (reg:V2DF 109) 0)) {*movdf_internal}
>      (nil))
>
> But both the pre-RA and post-RA passes are able to combine these
> instructions back to the original form.
One possible solution is implement target_insn_cost for x86, and
return high cost for those insns when
get_attr_avx_partial_xmm_update(insn) == AVX_PARTIAL_XMM_UPDATE_TRUE.
I'll try with that when the patch lands on the trunk.
>
> The patch therefore enables the pass by default only on AArch64.
> However, I did test the patch with it enabled on x86_64-linux-gnu
> as well, which was useful for debugging.
>
> Bootstrapped & regression-tested on aarch64-linux-gnu and
> x86_64-linux-gnu (as posted, with no regressions, and with the
> pass enabled by default, with some gcc.target/i386 regressions).
> OK to install?
>
> Richard
>
>
> gcc/
>         PR rtl-optimization/106594
>         * Makefile.in (OBJS): Add late-combine.o.
>         * common.opt (flate-combine-instructions): New option.
>         * doc/invoke.texi: Document it.
>         * common/config/aarch64/aarch64-common.cc: Enable it by default
>         at -O2 and above.
>         * tree-pass.h (make_pass_late_combine): Declare.
>         * late-combine.cc: New file.
>         * passes.def: Add two instances of late_combine.
>
> gcc/testsuite/
>         PR rtl-optimization/106594
>         * gcc.dg/ira-shrinkwrap-prep-1.c: Restrict XFAIL to non-aarch64
>         targets.
>         * gcc.dg/ira-shrinkwrap-prep-2.c: Likewise.
>         * gcc.dg/stack-check-4.c: Add -fno-shrink-wrap.
>         * gcc.target/aarch64/sve/cond_asrd_3.c: Remove XFAILs.
>         * gcc.target/aarch64/sve/cond_convert_3.c: Likewise.
>         * gcc.target/aarch64/sve/cond_fabd_5.c: Likewise.
>         * gcc.target/aarch64/sve/cond_convert_6.c: Expect the MOVPRFX /Zs
>         described in the comment.
>         * gcc.target/aarch64/sve/cond_unary_4.c: Likewise.
>         * gcc.target/aarch64/pr106594_1.c: New test.
> ---
>  gcc/Makefile.in                               |   1 +
>  gcc/common.opt                                |   5 +
>  gcc/common/config/aarch64/aarch64-common.cc   |   1 +
>  gcc/doc/invoke.texi                           |  11 +-
>  gcc/late-combine.cc                           | 718 ++++++++++++++++++
>  gcc/passes.def                                |   2 +
>  gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-1.c  |   2 +-
>  gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-2.c  |   2 +-
>  gcc/testsuite/gcc.dg/stack-check-4.c          |   2 +-
>  gcc/testsuite/gcc.target/aarch64/pr106594_1.c |  20 +
>  .../gcc.target/aarch64/sve/cond_asrd_3.c      |  10 +-
>  .../gcc.target/aarch64/sve/cond_convert_3.c   |   8 +-
>  .../gcc.target/aarch64/sve/cond_convert_6.c   |   8 +-
>  .../gcc.target/aarch64/sve/cond_fabd_5.c      |  11 +-
>  .../gcc.target/aarch64/sve/cond_unary_4.c     |  13 +-
>  gcc/tree-pass.h                               |   1 +
>  16 files changed, 780 insertions(+), 35 deletions(-)
>  create mode 100644 gcc/late-combine.cc
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/pr106594_1.c
>
> diff --git a/gcc/Makefile.in b/gcc/Makefile.in
> index 91d6bfbea4d..b43fd6e8df1 100644
> --- a/gcc/Makefile.in
> +++ b/gcc/Makefile.in
> @@ -1554,6 +1554,7 @@ OBJS = \
>         ira-lives.o \
>         jump.o \
>         langhooks.o \
> +       late-combine.o \
>         lcm.o \
>         lists.o \
>         loop-doloop.o \
> diff --git a/gcc/common.opt b/gcc/common.opt
> index 1cf3bdd3b51..306b11f91c7 100644
> --- a/gcc/common.opt
> +++ b/gcc/common.opt
> @@ -1775,6 +1775,11 @@ Common Var(flag_large_source_files) Init(0)
>  Improve GCC's ability to track column numbers in large source files,
>  at the expense of slower compilation.
>
> +flate-combine-instructions
> +Common Var(flag_late_combine_instructions) Optimization Init(0)
> +Run two instruction combination passes late in the pass pipeline;
> +one before register allocation and one after.
> +
>  floop-parallelize-all
>  Common Var(flag_loop_parallelize_all) Optimization
>  Mark all loops as parallel.
> diff --git a/gcc/common/config/aarch64/aarch64-common.cc b/gcc/common/config/aarch64/aarch64-common.cc
> index 20bc4e1291b..05647e0c93a 100644
> --- a/gcc/common/config/aarch64/aarch64-common.cc
> +++ b/gcc/common/config/aarch64/aarch64-common.cc
> @@ -55,6 +55,7 @@ static const struct default_options aarch_option_optimization_table[] =
>      { OPT_LEVELS_1_PLUS, OPT_fsched_pressure, NULL, 1 },
>      /* Enable redundant extension instructions removal at -O2 and higher.  */
>      { OPT_LEVELS_2_PLUS, OPT_free, NULL, 1 },
> +    { OPT_LEVELS_2_PLUS, OPT_flate_combine_instructions, NULL, 1 },
>  #if (TARGET_DEFAULT_ASYNC_UNWIND_TABLES == 1)
>      { OPT_LEVELS_ALL, OPT_fasynchronous_unwind_tables, NULL, 1 },
>      { OPT_LEVELS_ALL, OPT_funwind_tables, NULL, 1},
> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> index 5a9284d635c..d0576ac97cf 100644
> --- a/gcc/doc/invoke.texi
> +++ b/gcc/doc/invoke.texi
> @@ -562,7 +562,7 @@ Objective-C and Objective-C++ Dialects}.
>  -fipa-bit-cp  -fipa-vrp  -fipa-pta  -fipa-profile  -fipa-pure-const
>  -fipa-reference  -fipa-reference-addressable
>  -fipa-stack-alignment  -fipa-icf  -fira-algorithm=@var{algorithm}
> --flive-patching=@var{level}
> +-flate-combine-instructions  -flive-patching=@var{level}
>  -fira-region=@var{region}  -fira-hoist-pressure
>  -fira-loop-pressure  -fno-ira-share-save-slots
>  -fno-ira-share-spill-slots
> @@ -13201,6 +13201,15 @@ equivalences that are found only by GCC and equivalences found only by Gold.
>
>  This flag is enabled by default at @option{-O2} and @option{-Os}.
>
> +@opindex flate-combine-instructions
> +@item -flate-combine-instructions
> +Enable two instruction combination passes that run relatively late in the
> +compilation process.  One of the passes runs before register allocation and
> +the other after register allocation.  The main aim of the passes is to
> +substitute definitions into all uses.
> +
> +Some targets enable this flag by default at @option{-O2} and @option{-Os}.
> +
>  @opindex flive-patching
>  @item -flive-patching=@var{level}
>  Control GCC's optimizations to produce output suitable for live-patching.
> diff --git a/gcc/late-combine.cc b/gcc/late-combine.cc
> new file mode 100644
> index 00000000000..b1845875c4b
> --- /dev/null
> +++ b/gcc/late-combine.cc
> @@ -0,0 +1,718 @@
> +// Late-stage instruction combination pass.
> +// Copyright (C) 2023 Free Software Foundation, Inc.
> +//
> +// This file is part of GCC.
> +//
> +// GCC is free software; you can redistribute it and/or modify it under
> +// the terms of the GNU General Public License as published by the Free
> +// Software Foundation; either version 3, or (at your option) any later
> +// version.
> +//
> +// GCC is distributed in the hope that it will be useful, but WITHOUT ANY
> +// WARRANTY; without even the implied warranty of MERCHANTABILITY or
> +// FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
> +// for more details.
> +//
> +// You should have received a copy of the GNU General Public License
> +// along with GCC; see the file COPYING3.  If not see
> +// <http://www.gnu.org/licenses/>.
> +
> +// The current purpose of this pass is to substitute definitions into
> +// all uses, so that the definition can be removed.  However, it could
> +// be extended to handle other combination-related optimizations in future.
> +//
> +// The pass can run before or after register allocation.  When running
> +// before register allocation, it tries to avoid cases that are likely
> +// to increase register pressure.  For the same reason, it avoids moving
> +// instructions around, even if doing so would allow an optimization to
> +// succeed.  These limitations are removed when running after register
> +// allocation.
> +
> +#define INCLUDE_ALGORITHM
> +#define INCLUDE_FUNCTIONAL
> +#include "config.h"
> +#include "system.h"
> +#include "coretypes.h"
> +#include "backend.h"
> +#include "rtl.h"
> +#include "df.h"
> +#include "rtl-ssa.h"
> +#include "print-rtl.h"
> +#include "tree-pass.h"
> +#include "cfgcleanup.h"
> +#include "target.h"
> +
> +using namespace rtl_ssa;
> +
> +namespace {
> +const pass_data pass_data_late_combine =
> +{
> +  RTL_PASS, // type
> +  "late_combine", // name
> +  OPTGROUP_NONE, // optinfo_flags
> +  TV_NONE, // tv_id
> +  0, // properties_required
> +  0, // properties_provided
> +  0, // properties_destroyed
> +  0, // todo_flags_start
> +  TODO_df_finish, // todo_flags_finish
> +};
> +
> +// Class that represents one run of the pass.
> +class late_combine
> +{
> +public:
> +  unsigned int execute (function *);
> +
> +private:
> +  rtx optimizable_set (insn_info *);
> +  bool check_register_pressure (insn_info *, rtx);
> +  bool check_uses (set_info *, rtx);
> +  bool combine_into_uses (insn_info *, insn_info *);
> +
> +  auto_vec<insn_info *> m_worklist;
> +};
> +
> +// Represents an attempt to substitute a single-set definition into all
> +// uses of the definition.
> +class insn_combination
> +{
> +public:
> +  insn_combination (set_info *, rtx, rtx);
> +  bool run ();
> +  const vec<insn_change *> &use_changes () const { return m_use_changes; }
> +
> +private:
> +  use_array get_new_uses (use_info *);
> +  bool substitute_nondebug_use (use_info *);
> +  bool substitute_nondebug_uses (set_info *);
> +  bool move_and_recog (insn_change &);
> +  bool try_to_preserve_debug_info (insn_change &, use_info *);
> +  void substitute_debug_use (use_info *);
> +  bool substitute_note (insn_info *, rtx, bool);
> +  void substitute_notes (insn_info *, bool);
> +  void substitute_note_uses (use_info *);
> +  void substitute_optional_uses (set_info *);
> +
> +  // Represents the state of the function's RTL at the start of this
> +  // combination attempt.
> +  insn_change_watermark m_rtl_watermark;
> +
> +  // Represents the rtl-ssa state at the start of this combination attempt.
> +  obstack_watermark m_attempt;
> +
> +  // The instruction that contains the definition, and that we're trying
> +  // to delete.
> +  insn_info *m_def_insn;
> +
> +  // The definition itself.
> +  set_info *m_def;
> +
> +  // The destination and source of the single set that defines m_def.
> +  // The destination is known to be a plain REG.
> +  rtx m_dest;
> +  rtx m_src;
> +
> +  // Contains all in-progress changes to uses of m_def.
> +  auto_vec<insn_change *> m_use_changes;
> +
> +  // Contains the full list of changes that we want to make, in reverse
> +  // postorder.
> +  auto_vec<insn_change *> m_nondebug_changes;
> +};
> +
> +insn_combination::insn_combination (set_info *def, rtx dest, rtx src)
> +  : m_rtl_watermark (),
> +    m_attempt (crtl->ssa->new_change_attempt ()),
> +    m_def_insn (def->insn ()),
> +    m_def (def),
> +    m_dest (dest),
> +    m_src (src),
> +    m_use_changes (),
> +    m_nondebug_changes ()
> +{
> +}
> +
> +// USE is a direct or indirect use of m_def.  Return the list of uses
> +// that would be needed after substituting m_def into the instruction.
> +// The returned list is marked as invalid if USE's insn and m_def_insn
> +// use different definitions for the same resource (register or memory).
> +use_array
> +insn_combination::get_new_uses (use_info *use)
> +{
> +  auto *def = use->def ();
> +  auto *use_insn = use->insn ();
> +
> +  use_array new_uses = use_insn->uses ();
> +  new_uses = remove_uses_of_def (m_attempt, new_uses, def);
> +  new_uses = merge_access_arrays (m_attempt, m_def_insn->uses (), new_uses);
> +  if (new_uses.is_valid () && use->ebb () != m_def->ebb ())
> +    new_uses = crtl->ssa->make_uses_available (m_attempt, new_uses, use->bb (),
> +                                              use_insn->is_debug_insn ());
> +  return new_uses;
> +}
> +
> +// Start the process of trying to replace USE by substitution, given that
> +// USE occurs in a non-debug instruction.  Check that the substitution can
> +// be represented in RTL and that each use of a resource (register or memory)
> +// has a consistent definition.  If so, start an insn_change record for the
> +// substitution and return true.
> +bool
> +insn_combination::substitute_nondebug_use (use_info *use)
> +{
> +  insn_info *use_insn = use->insn ();
> +  rtx_insn *use_rtl = use_insn->rtl ();
> +
> +  if (dump_file && (dump_flags & TDF_DETAILS))
> +    dump_insn_slim (dump_file, use->insn ()->rtl ());
> +
> +  // Chceck that we can change the instruction pattern.  Leave recognition
> +  // of the result till later.
> +  insn_propagation prop (use_rtl, m_dest, m_src);
> +  if (!prop.apply_to_pattern (&PATTERN (use_rtl))
> +      || prop.num_replacements == 0)
> +    {
> +      if (dump_file && (dump_flags & TDF_DETAILS))
> +       fprintf (dump_file, "-- RTL substitution failed\n");
> +      return false;
> +    }
> +
> +  use_array new_uses = get_new_uses (use);
> +  if (!new_uses.is_valid ())
> +    {
> +      if (dump_file && (dump_flags & TDF_DETAILS))
> +       fprintf (dump_file, "-- could not prove that all sources"
> +                " are available\n");
> +      return false;
> +    }
> +
> +  auto *where = XOBNEW (m_attempt, insn_change);
> +  auto *use_change = new (where) insn_change (use_insn);
> +  m_use_changes.safe_push (use_change);
> +  use_change->new_uses = new_uses;
> +
> +  return true;
> +}
> +
> +// Apply substitute_nondebug_use to all direct and indirect uses of DEF.
> +// There will be at most one level of indirection.
> +bool
> +insn_combination::substitute_nondebug_uses (set_info *def)
> +{
> +  for (use_info *use : def->nondebug_insn_uses ())
> +    if (!use->is_live_out_use ()
> +       && !use->only_occurs_in_notes ()
> +       && !substitute_nondebug_use (use))
> +      return false;
> +
> +  for (use_info *use : def->phi_uses ())
> +    if (!substitute_nondebug_uses (use->phi ()))
> +      return false;
> +
> +  return true;
> +}
> +
> +// Complete the verification of USE_CHANGE, given that m_nondebug_insns
> +// now contains an insn_change record for all proposed non-debug changes.
> +// Check that the new instruction is a recognized pattern.  Also check that
> +// the instruction can be placed somewhere that makes all definitions and
> +// uses valid, and that permits any new hard-register clobbers added
> +// during the recognition process.  Return true on success.
> +bool
> +insn_combination::move_and_recog (insn_change &use_change)
> +{
> +  insn_info *use_insn = use_change.insn ();
> +
> +  if (reload_completed && can_move_insn_p (use_insn))
> +    use_change.move_range = { use_insn->bb ()->head_insn (),
> +                             use_insn->ebb ()->last_bb ()->end_insn () };
> +  if (!restrict_movement_ignoring (use_change,
> +                                  insn_is_changing (m_nondebug_changes)))
> +    {
> +      if (dump_file && (dump_flags & TDF_DETAILS))
> +       fprintf (dump_file, "-- cannot satisfy all definitions and uses"
> +                " in insn %d\n", INSN_UID (use_insn->rtl ()));
> +      return false;
> +    }
> +
> +  if (!recog_ignoring (m_attempt, use_change,
> +                      insn_is_changing (m_nondebug_changes)))
> +    return false;
> +
> +  return true;
> +}
> +
> +// USE_CHANGE.insn () is a debug instruction that uses m_def.  Try to
> +// substitute the definition into the instruction and try to describe
> +// the result in USE_CHANGE.  Return true on success.  Failure means that
> +// the instruction must be reset instead.
> +bool
> +insn_combination::try_to_preserve_debug_info (insn_change &use_change,
> +                                             use_info *use)
> +{
> +  insn_info *use_insn = use_change.insn ();
> +  rtx_insn *use_rtl = use_insn->rtl ();
> +
> +  use_change.new_uses = get_new_uses (use);
> +  if (!use_change.new_uses.is_valid ()
> +      || !restrict_movement (use_change))
> +    return false;
> +
> +  insn_propagation prop (use_rtl, m_dest, m_src);
> +  return prop.apply_to_pattern (&INSN_VAR_LOCATION_LOC (use_rtl));
> +}
> +
> +// USE_INSN is a debug instruction that uses m_def.  Update it to reflect
> +// the fact that m_def is going to disappear.  Try to preserve the source
> +// value if possible, but reset the instruction if not.
> +void
> +insn_combination::substitute_debug_use (use_info *use)
> +{
> +  auto *use_insn = use->insn ();
> +  rtx_insn *use_rtl = use_insn->rtl ();
> +
> +  auto use_change = insn_change (use_insn);
> +  if (!try_to_preserve_debug_info (use_change, use))
> +    {
> +      use_change.new_uses = {};
> +      use_change.move_range = use_change.insn ();
> +      INSN_VAR_LOCATION_LOC (use_rtl) = gen_rtx_UNKNOWN_VAR_LOC ();
> +    }
> +  insn_change *changes[] = { &use_change };
> +  crtl->ssa->change_insns (changes);
> +}
> +
> +// NOTE is a reg note of USE_INSN, which previously used m_def.  Update
> +// the note to reflect the fact that m_def is going to disappear.  Return
> +// true on success, or false if the note must be deleted.
> +//
> +// CAN_PROPAGATE is true if m_dest can be replaced with m_use.
> +bool
> +insn_combination::substitute_note (insn_info *use_insn, rtx note,
> +                                  bool can_propagate)
> +{
> +  if (REG_NOTE_KIND (note) == REG_EQUAL
> +      || REG_NOTE_KIND (note) == REG_EQUIV)
> +    {
> +      insn_propagation prop (use_insn->rtl (), m_dest, m_src);
> +      return (prop.apply_to_rvalue (&XEXP (note, 0))
> +             && (can_propagate || prop.num_replacements == 0));
> +    }
> +  return true;
> +}
> +
> +// Update USE_INSN's notes after deciding to go ahead with the optimization.
> +// CAN_PROPAGATE is true if m_dest can be replaced with m_use.
> +void
> +insn_combination::substitute_notes (insn_info *use_insn, bool can_propagate)
> +{
> +  rtx_insn *use_rtl = use_insn->rtl ();
> +  rtx *ptr = &REG_NOTES (use_rtl);
> +  while (rtx note = *ptr)
> +    {
> +      if (substitute_note (use_insn, note, can_propagate))
> +       ptr = &XEXP (note, 1);
> +      else
> +       *ptr = XEXP (note, 1);
> +    }
> +}
> +
> +// We've decided to go ahead with the substitution.  Update all REG_NOTES
> +// involving USE.
> +void
> +insn_combination::substitute_note_uses (use_info *use)
> +{
> +  insn_info *use_insn = use->insn ();
> +
> +  bool can_propagate = true;
> +  if (use->only_occurs_in_notes ())
> +    {
> +      // The only uses are in notes.  Try to keep the note if we can,
> +      // but removing it is better than aborting the optimization.
> +      insn_change use_change (use_insn);
> +      use_change.new_uses = get_new_uses (use);
> +      if (!use_change.new_uses.is_valid ()
> +         || !restrict_movement (use_change))
> +       {
> +         use_change.move_range = use_insn;
> +         use_change.new_uses = remove_uses_of_def (m_attempt,
> +                                                   use_insn->uses (),
> +                                                   use->def ());
> +         can_propagate = false;
> +       }
> +      if (dump_file && (dump_flags & TDF_DETAILS))
> +       {
> +         fprintf (dump_file, "%s notes in:\n",
> +                  can_propagate ? "updating" : "removing");
> +         dump_insn_slim (dump_file, use_insn->rtl ());
> +       }
> +      substitute_notes (use_insn, can_propagate);
> +      insn_change *changes[] = { &use_change };
> +      crtl->ssa->change_insns (changes);
> +    }
> +  else
> +    substitute_notes (use_insn, can_propagate);
> +}
> +
> +// We've decided to go ahead with the substitution and we've dealt with
> +// all uses that occur in the patterns of non-debug insns.  Update all
> +// other uses for the fact that m_def is about to disappear.
> +void
> +insn_combination::substitute_optional_uses (set_info *def)
> +{
> +  if (auto insn_uses = def->all_insn_uses ())
> +    {
> +      use_info *use = *insn_uses.begin ();
> +      while (use)
> +       {
> +         use_info *next_use = use->next_any_insn_use ();
> +         if (use->is_in_debug_insn ())
> +           substitute_debug_use (use);
> +         else if (!use->is_live_out_use ())
> +           substitute_note_uses (use);
> +         use = next_use;
> +       }
> +    }
> +  for (use_info *use : def->phi_uses ())
> +    substitute_optional_uses (use->phi ());
> +}
> +
> +// Try to perform the substitution.  Return true on success.
> +bool
> +insn_combination::run ()
> +{
> +  if (dump_file && (dump_flags & TDF_DETAILS))
> +    {
> +      fprintf (dump_file, "\ntrying to combine definition of r%d in:\n",
> +              m_def->regno ());
> +      dump_insn_slim (dump_file, m_def_insn->rtl ());
> +      fprintf (dump_file, "into:\n");
> +    }
> +
> +  if (!substitute_nondebug_uses (m_def))
> +    return false;
> +
> +  auto def_change = insn_change::delete_insn (m_def_insn);
> +
> +  m_nondebug_changes.reserve (m_use_changes.length () + 1);
> +  m_nondebug_changes.quick_push (&def_change);
> +  m_nondebug_changes.splice (m_use_changes);
> +
> +  for (auto *use_change : m_use_changes)
> +    if (!move_and_recog (*use_change))
> +      return false;
> +
> +  if (!changes_are_worthwhile (m_nondebug_changes)
> +      || !crtl->ssa->verify_insn_changes (m_nondebug_changes))
> +    return false;
> +
> +  substitute_optional_uses (m_def);
> +
> +  confirm_change_group ();
> +  crtl->ssa->change_insns (m_nondebug_changes);
> +  return true;
> +}
> +
> +// See whether INSN is a single_set that we can optimize.  Return the
> +// set if so, otherwise return null.
> +rtx
> +late_combine::optimizable_set (insn_info *insn)
> +{
> +  if (!insn->can_be_optimized ()
> +      || insn->is_asm ()
> +      || insn->is_call ()
> +      || insn->has_volatile_refs ()
> +      || insn->has_pre_post_modify ()
> +      || !can_move_insn_p (insn))
> +    return NULL_RTX;
> +
> +  return single_set (insn->rtl ());
> +}
> +
> +// Suppose that we can replace all uses of SET_DEST (SET) with SET_SRC (SET),
> +// where SET occurs in INSN.  Return true if doing so is not likely to
> +// increase register pressure.
> +bool
> +late_combine::check_register_pressure (insn_info *insn, rtx set)
> +{
> +  // Plain register-to-register moves do not establish a register class
> +  // preference and have no well-defined effect on the register allocator.
> +  // If changes in register class are needed, the register allocator is
> +  // in the best position to place those changes.  If no change in
> +  // register class is needed, then the optimization reduces register
> +  // pressure if SET_SRC (set) was already live at uses, otherwise the
> +  // optimization is pressure-neutral.
> +  rtx src = SET_SRC (set);
> +  if (REG_P (src))
> +    return true;
> +
> +  // On the same basis, substituting a SET_SRC that contains a single
> +  // pseudo register either reduces pressure or is pressure-neutral,
> +  // subject to the constraints below.  We would need to do more
> +  // analysis for SET_SRCs that use more than one pseudo register.
> +  unsigned int nregs = 0;
> +  for (auto *use : insn->uses ())
> +    if (use->is_reg ()
> +       && !HARD_REGISTER_NUM_P (use->regno ())
> +       && !use->only_occurs_in_notes ())
> +      if (++nregs > 1)
> +       return false;
> +
> +  // If there are no pseudo registers in SET_SRC then the optimization
> +  // should improve register pressure.
> +  if (nregs == 0)
> +    return true;
> +
> +  // We'd be substituting (set (reg R1) SRC) where SRC is known to
> +  // contain a single pseudo register R2.  Assume for simplicity that
> +  // each new use of R2 would need to be in the same class C as the
> +  // current use of R2.  If, for a realistic allocation, C is a
> +  // non-strict superset of the R1's register class, the effect on
> +  // register pressure should be positive or neutral.  If instead
> +  // R1 occupies a different register class from R2, or if R1 has
> +  // more allocation freedom than R2, then there's a higher risk that
> +  // the effect on register pressure could be negative.
> +  //
> +  // First use constrain_operands to get the most likely choice of
> +  // alternative.  For simplicity, just handle the case where the
> +  // output operand is operand 0.
> +  extract_insn (insn->rtl ());
> +  rtx dest = SET_DEST (set);
> +  if (recog_data.n_operands == 0
> +      || recog_data.operand[0] != dest)
> +    return false;
> +
> +  if (!constrain_operands (0, get_enabled_alternatives (insn->rtl ())))
> +    return false;
> +
> +  preprocess_constraints (insn->rtl ());
> +  auto *alt = which_op_alt ();
> +  auto dest_class = alt[0].cl;
> +
> +  // Check operands 1 and above.
> +  auto check_src = [&](unsigned int i)
> +    {
> +      if (recog_data.is_operator[i])
> +       return true;
> +
> +      rtx op = recog_data.operand[i];
> +      if (CONSTANT_P (op))
> +       return true;
> +
> +      if (SUBREG_P (op))
> +       op = SUBREG_REG (op);
> +      if (REG_P (op))
> +       {
> +         if (HARD_REGISTER_P (op))
> +           {
> +             // We've already rejected uses of non-fixed hard registers.
> +             gcc_checking_assert (fixed_regs[REGNO (op)]);
> +             return true;
> +           }
> +
> +         // Make sure that the source operand's class is at least as
> +         // permissive as the destination operand's class.
> +         if (!reg_class_subset_p (dest_class, alt[i].cl))
> +           return false;
> +
> +         // Make sure that the source operand occupies no more hard
> +         // registers than the destination operand.  This mostly matters
> +         // for subregs.
> +         if (targetm.class_max_nregs (dest_class, GET_MODE (dest))
> +             < targetm.class_max_nregs (alt[i].cl, GET_MODE (op)))
> +           return false;
> +
> +         return true;
> +       }
> +      return false;
> +    };
> +  for (int i = 1; i < recog_data.n_operands; ++i)
> +    if (!check_src (i))
> +      return false;
> +
> +  return true;
> +}
> +
> +// Check uses of DEF to see whether there is anything obvious that
> +// prevents the substitution of SET into uses of DEF.
> +bool
> +late_combine::check_uses (set_info *def, rtx set)
> +{
> +  use_info *last_use = nullptr;
> +  for (use_info *use : def->nondebug_insn_uses ())
> +    {
> +      insn_info *use_insn = use->insn ();
> +
> +      if (use->is_live_out_use ())
> +       continue;
> +      if (use->only_occurs_in_notes ())
> +       continue;
> +
> +      // We cannot replace all uses if the value is live on exit.
> +      if (use->is_artificial ())
> +       return false;
> +
> +      // Avoid increasing the complexity of instructions that
> +      // reference allocatable hard registers.
> +      if (!REG_P (SET_SRC (set))
> +         && !reload_completed
> +         && (accesses_include_nonfixed_hard_registers (use_insn->uses ())
> +             || accesses_include_nonfixed_hard_registers (use_insn->defs ())))
> +       return false;
> +
> +      // Don't substitute into a non-local goto, since it can then be
> +      // treated as a jump to local label, e.g. in shorten_branches.
> +      // ??? But this shouldn't be necessary.
> +      if (use_insn->is_jump ()
> +         && find_reg_note (use_insn->rtl (), REG_NON_LOCAL_GOTO, NULL_RTX))
> +       return false;
> +
> +      // We'll keep the uses in their original order, even if we move
> +      // them relative to other instructions.  Make sure that non-final
> +      // uses do not change any values that occur in the SET_SRC.
> +      if (last_use && last_use->ebb () == use->ebb ())
> +       {
> +         def_info *ultimate_def = look_through_degenerate_phi (def);
> +         if (insn_clobbers_resources (last_use->insn (),
> +                                      ultimate_def->insn ()->uses ()))
> +           return false;
> +       }
> +
> +      last_use = use;
> +    }
> +
> +  for (use_info *use : def->phi_uses ())
> +    if (!use->phi ()->is_degenerate ()
> +       || !check_uses (use->phi (), set))
> +      return false;
> +
> +  return true;
> +}
> +
> +// Try to remove INSN by substituting a definition into all uses.
> +// If the optimization moves any instructions before CURSOR, add those
> +// instructions to the end of m_worklist.
> +bool
> +late_combine::combine_into_uses (insn_info *insn, insn_info *cursor)
> +{
> +  // For simplicity, don't try to handle sets of multiple hard registers.
> +  // And for correctness, don't remove any assignments to the stack or
> +  // frame pointers, since that would implicitly change the set of valid
> +  // memory locations between this assignment and the next.
> +  //
> +  // Removing assignments to the hard frame pointer would invalidate
> +  // backtraces.
> +  set_info *def = single_set_info (insn);
> +  if (!def
> +      || !def->is_reg ()
> +      || def->regno () == STACK_POINTER_REGNUM
> +      || def->regno () == FRAME_POINTER_REGNUM
> +      || def->regno () == HARD_FRAME_POINTER_REGNUM)
> +    return false;
> +
> +  rtx set = optimizable_set (insn);
> +  if (!set)
> +    return false;
> +
> +  // For simplicity, don't try to handle subreg destinations.
> +  rtx dest = SET_DEST (set);
> +  if (!REG_P (dest) || def->regno () != REGNO (dest))
> +    return false;
> +
> +  // Don't prolong the live ranges of allocatable hard registers, or put
> +  // them into more complicated instructions.  Failing to prevent this
> +  // could lead to spill failures, or at least to worst register allocation.
> +  if (!reload_completed
> +      && accesses_include_nonfixed_hard_registers (insn->uses ()))
> +    return false;
> +
> +  if (!reload_completed && !check_register_pressure (insn, set))
> +    return false;
> +
> +  if (!check_uses (def, set))
> +    return false;
> +
> +  insn_combination combination (def, SET_DEST (set), SET_SRC (set));
> +  if (!combination.run ())
> +    return false;
> +
> +  for (auto *use_change : combination.use_changes ())
> +    if (*use_change->insn () < *cursor)
> +      m_worklist.safe_push (use_change->insn ());
> +    else
> +      break;
> +  return true;
> +}
> +
> +// Run the pass on function FN.
> +unsigned int
> +late_combine::execute (function *fn)
> +{
> +  // Initialization.
> +  calculate_dominance_info (CDI_DOMINATORS);
> +  df_analyze ();
> +  crtl->ssa = new rtl_ssa::function_info (fn);
> +  // Don't allow memory_operand to match volatile MEMs.
> +  init_recog_no_volatile ();
> +
> +  insn_info *insn = *crtl->ssa->nondebug_insns ().begin ();
> +  while (insn)
> +    {
> +      if (!insn->is_artificial ())
> +       {
> +         insn_info *prev = insn->prev_nondebug_insn ();
> +         if (combine_into_uses (insn, prev))
> +           {
> +             // Any instructions that get added to the worklist were
> +             // previously after PREV.  Thus if we were able to move
> +             // an instruction X before PREV during one combination,
> +             // X cannot depend on any instructions that we move before
> +             // PREV during subsequent combinations.  This means that
> +             // the worklist should be free of backwards dependencies,
> +             // even if it isn't necessarily in RPO.
> +             for (unsigned int i = 0; i < m_worklist.length (); ++i)
> +               combine_into_uses (m_worklist[i], prev);
> +             m_worklist.truncate (0);
> +             insn = prev;
> +           }
> +       }
> +      insn = insn->next_nondebug_insn ();
> +    }
> +
> +  // Finalization.
> +  if (crtl->ssa->perform_pending_updates ())
> +    cleanup_cfg (0);
> +  // Make recognizer allow volatile MEMs again.
> +  init_recog ();
> +  free_dominance_info (CDI_DOMINATORS);
> +  return 0;
> +}
> +
> +class pass_late_combine : public rtl_opt_pass
> +{
> +public:
> +  pass_late_combine (gcc::context *ctxt)
> +    : rtl_opt_pass (pass_data_late_combine, ctxt)
> +  {}
> +
> +  // opt_pass methods:
> +  opt_pass *clone () override { return new pass_late_combine (m_ctxt); }
> +  bool gate (function *) override { return flag_late_combine_instructions; }
> +  unsigned int execute (function *) override;
> +};
> +
> +unsigned int
> +pass_late_combine::execute (function *fn)
> +{
> +  return late_combine ().execute (fn);
> +}
> +
> +} // end namespace
> +
> +// Create a new CC fusion pass instance.
> +
> +rtl_opt_pass *
> +make_pass_late_combine (gcc::context *ctxt)
> +{
> +  return new pass_late_combine (ctxt);
> +}
> diff --git a/gcc/passes.def b/gcc/passes.def
> index 1e1950bdb39..56ab5204b08 100644
> --- a/gcc/passes.def
> +++ b/gcc/passes.def
> @@ -488,6 +488,7 @@ along with GCC; see the file COPYING3.  If not see
>        NEXT_PASS (pass_initialize_regs);
>        NEXT_PASS (pass_ud_rtl_dce);
>        NEXT_PASS (pass_combine);
> +      NEXT_PASS (pass_late_combine);
>        NEXT_PASS (pass_if_after_combine);
>        NEXT_PASS (pass_jump_after_combine);
>        NEXT_PASS (pass_partition_blocks);
> @@ -507,6 +508,7 @@ along with GCC; see the file COPYING3.  If not see
>        NEXT_PASS (pass_postreload);
>        PUSH_INSERT_PASSES_WITHIN (pass_postreload)
>           NEXT_PASS (pass_postreload_cse);
> +         NEXT_PASS (pass_late_combine);
>           NEXT_PASS (pass_gcse2);
>           NEXT_PASS (pass_split_after_reload);
>           NEXT_PASS (pass_ree);
> diff --git a/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-1.c b/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-1.c
> index f290b9ccbdc..a95637abbe5 100644
> --- a/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-1.c
> +++ b/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-1.c
> @@ -25,5 +25,5 @@ bar (long a)
>  }
>
>  /* { dg-final { scan-rtl-dump "Will split live ranges of parameters" "ira" } } */
> -/* { dg-final { scan-rtl-dump "Split live-range of register" "ira" { xfail *-*-* } } } */
> +/* { dg-final { scan-rtl-dump "Split live-range of register" "ira" { xfail { ! aarch64*-*-* } } } } */
>  /* { dg-final { scan-rtl-dump "Performing shrink-wrapping" "pro_and_epilogue" { xfail powerpc*-*-* } } } */
> diff --git a/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-2.c b/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-2.c
> index 6212c95585d..0690e036eaa 100644
> --- a/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-2.c
> +++ b/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-2.c
> @@ -30,6 +30,6 @@ bar (long a)
>  }
>
>  /* { dg-final { scan-rtl-dump "Will split live ranges of parameters" "ira" } } */
> -/* { dg-final { scan-rtl-dump "Split live-range of register" "ira" { xfail *-*-* } } } */
> +/* { dg-final { scan-rtl-dump "Split live-range of register" "ira" { xfail { ! aarch64*-*-* } } } } */
>  /* XFAIL due to PR70681.  */
>  /* { dg-final { scan-rtl-dump "Performing shrink-wrapping" "pro_and_epilogue" { xfail arm*-*-* powerpc*-*-* } } } */
> diff --git a/gcc/testsuite/gcc.dg/stack-check-4.c b/gcc/testsuite/gcc.dg/stack-check-4.c
> index b0c5c61972f..052d2abc2f1 100644
> --- a/gcc/testsuite/gcc.dg/stack-check-4.c
> +++ b/gcc/testsuite/gcc.dg/stack-check-4.c
> @@ -20,7 +20,7 @@
>     scan for.   We scan for both the positive and negative cases.  */
>
>  /* { dg-do compile } */
> -/* { dg-options "-O2 -fstack-clash-protection -fdump-rtl-pro_and_epilogue -fno-optimize-sibling-calls" } */
> +/* { dg-options "-O2 -fstack-clash-protection -fdump-rtl-pro_and_epilogue -fno-optimize-sibling-calls -fno-shrink-wrap" } */
>  /* { dg-require-effective-target supports_stack_clash_protection } */
>
>  extern void arf (char *);
> diff --git a/gcc/testsuite/gcc.target/aarch64/pr106594_1.c b/gcc/testsuite/gcc.target/aarch64/pr106594_1.c
> new file mode 100644
> index 00000000000..71bcafcb44f
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/pr106594_1.c
> @@ -0,0 +1,20 @@
> +/* { dg-options "-O2" } */
> +
> +extern const int constellation_64qam[64];
> +
> +void foo(int nbits,
> +         const char *p_src,
> +         int *p_dst) {
> +
> +  while (nbits > 0U) {
> +    char first = *p_src++;
> +
> +    char index1 = ((first & 0x3) << 4) | (first >> 4);
> +
> +    *p_dst++ = constellation_64qam[index1];
> +
> +    nbits--;
> +  }
> +}
> +
> +/* { dg-final { scan-assembler {(?n)\tldr\t.*\[x[0-9]+, w[0-9]+, sxtw #?2\]} } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_asrd_3.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_asrd_3.c
> index 0d620a30d5d..b537c6154a3 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/cond_asrd_3.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_asrd_3.c
> @@ -27,9 +27,9 @@ TEST_ALL (DEF_LOOP)
>  /* { dg-final { scan-assembler-times {\tasrd\tz[0-9]+\.h, p[0-7]/m, z[0-9]+\.h, #4\n} 2 } } */
>  /* { dg-final { scan-assembler-times {\tasrd\tz[0-9]+\.s, p[0-7]/m, z[0-9]+\.s, #4\n} 1 } } */
>
> -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.b, p[0-7]/z, z[0-9]+\.b\n} 3 { xfail *-*-* } } } */
> -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z, z[0-9]+\.h\n} 2 { xfail *-*-* } } } */
> -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z, z[0-9]+\.s\n} 1 { xfail *-*-* } } } */
> +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.b, p[0-7]/z, z[0-9]+\.b\n} 3 } } */
> +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z, z[0-9]+\.h\n} 2 } } */
> +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z, z[0-9]+\.s\n} 1 } } */
>
> -/* { dg-final { scan-assembler-not {\tmov\tz} { xfail *-*-* } } } */
> -/* { dg-final { scan-assembler-not {\tsel\t} { xfail *-*-* } } } */
> +/* { dg-final { scan-assembler-not {\tmov\tz} } } */
> +/* { dg-final { scan-assembler-not {\tsel\t} } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_3.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_3.c
> index a294effd4a9..cff806c278d 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_3.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_3.c
> @@ -30,11 +30,9 @@ TEST_ALL (DEF_LOOP)
>  /* { dg-final { scan-assembler-times {\tscvtf\tz[0-9]+\.d, p[0-7]/m,} 1 } } */
>  /* { dg-final { scan-assembler-times {\tucvtf\tz[0-9]+\.d, p[0-7]/m,} 1 } } */
>
> -/* Really we should be able to use MOVPRFX /z here, but at the moment
> -   we're relying on combine to merge a SEL and an arithmetic operation,
> -   and the SEL doesn't allow the "false" value to be zero when the "true"
> -   value is a register.  */
> -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+, z[0-9]+\n} 6 } } */
> +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z,} 2 } } */
> +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z,} 2 } } */
> +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.d, p[0-7]/z,} 2 } } */
>
>  /* { dg-final { scan-assembler-not {\tmov\tz[^\n]*z} } } */
>  /* { dg-final { scan-assembler-not {\tsel\t} } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_6.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_6.c
> index 6541a2ea49d..abf0a2e832f 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_6.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_6.c
> @@ -30,11 +30,9 @@ TEST_ALL (DEF_LOOP)
>  /* { dg-final { scan-assembler-times {\tfcvtzs\tz[0-9]+\.d, p[0-7]/m,} 1 } } */
>  /* { dg-final { scan-assembler-times {\tfcvtzu\tz[0-9]+\.d, p[0-7]/m,} 1 } } */
>
> -/* Really we should be able to use MOVPRFX /z here, but at the moment
> -   we're relying on combine to merge a SEL and an arithmetic operation,
> -   and the SEL doesn't allow the "false" value to be zero when the "true"
> -   value is a register.  */
> -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+, z[0-9]+\n} 6 } } */
> +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z,} 2 } } */
> +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z,} 2 } } */
> +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.d, p[0-7]/z,} 2 } } */
>
>  /* { dg-final { scan-assembler-not {\tmov\tz[^\n]*z} } } */
>  /* { dg-final { scan-assembler-not {\tsel\t} } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_fabd_5.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_fabd_5.c
> index e66477b3bce..401201b315a 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/cond_fabd_5.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_fabd_5.c
> @@ -24,12 +24,9 @@ TEST_ALL (DEF_LOOP)
>  /* { dg-final { scan-assembler-times {\tfabd\tz[0-9]+\.s, p[0-7]/m,} 1 } } */
>  /* { dg-final { scan-assembler-times {\tfabd\tz[0-9]+\.d, p[0-7]/m,} 1 } } */
>
> -/* Really we should be able to use MOVPRFX /Z here, but at the moment
> -   we're relying on combine to merge a SEL and an arithmetic operation,
> -   and the SEL doesn't allow zero operands.  */
> -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z, z[0-9]+\.h\n} 1 { xfail *-*-* } } } */
> -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z, z[0-9]+\.s\n} 1 { xfail *-*-* } } } */
> -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.d, p[0-7]/z, z[0-9]+\.d\n} 1 { xfail *-*-* } } } */
> +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z, z[0-9]+\.h\n} 1 } } */
> +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z, z[0-9]+\.s\n} 1 } } */
> +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.d, p[0-7]/z, z[0-9]+\.d\n} 1 } } */
>
>  /* { dg-final { scan-assembler-not {\tmov\tz[^,]*z} } } */
> -/* { dg-final { scan-assembler-not {\tsel\t} { xfail *-*-* } } } */
> +/* { dg-final { scan-assembler-not {\tsel\t} } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_unary_4.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_unary_4.c
> index a491f899088..cbb957bffa4 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/cond_unary_4.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_unary_4.c
> @@ -52,15 +52,10 @@ TEST_ALL (DEF_LOOP)
>  /* { dg-final { scan-assembler-times {\tfneg\tz[0-9]+\.s, p[0-7]/m,} 1 } } */
>  /* { dg-final { scan-assembler-times {\tfneg\tz[0-9]+\.d, p[0-7]/m,} 1 } } */
>
> -/* Really we should be able to use MOVPRFX /z here, but at the moment
> -   we're relying on combine to merge a SEL and an arithmetic operation,
> -   and the SEL doesn't allow the "false" value to be zero when the "true"
> -   value is a register.  */
> -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+, z[0-9]+\n} 7 } } */
> -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.b, p[0-7]/z, z[0-9]+\.b} 1 } } */
> -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z, z[0-9]+\.h} 2 } } */
> -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z, z[0-9]+\.s} 2 } } */
> -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.d, p[0-7]/z, z[0-9]+\.d} 2 } } */
> +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.b, p[0-7]/z, z[0-9]+\.b} 2 } } */
> +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z, z[0-9]+\.h} 4 } } */
> +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z, z[0-9]+\.s} 4 } } */
> +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.d, p[0-7]/z, z[0-9]+\.d} 4 } } */
>
>  /* { dg-final { scan-assembler-not {\tmov\tz[^\n]*z} } } */
>  /* { dg-final { scan-assembler-not {\tsel\t} } } */
> diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
> index 09e6ada5b2f..75376316e40 100644
> --- a/gcc/tree-pass.h
> +++ b/gcc/tree-pass.h
> @@ -612,6 +612,7 @@ extern rtl_opt_pass *make_pass_branch_prob (gcc::context *ctxt);
>  extern rtl_opt_pass *make_pass_value_profile_transformations (gcc::context
>                                                               *ctxt);
>  extern rtl_opt_pass *make_pass_postreload_cse (gcc::context *ctxt);
> +extern rtl_opt_pass *make_pass_late_combine (gcc::context *ctxt);
>  extern rtl_opt_pass *make_pass_gcse2 (gcc::context *ctxt);
>  extern rtl_opt_pass *make_pass_split_after_reload (gcc::context *ctxt);
>  extern rtl_opt_pass *make_pass_thread_prologue_and_epilogue (gcc::context
> --
> 2.25.1
>
Segher Boessenkool June 24, 2024, 7:37 p.m. UTC | #17
I didn't see this before.   Sigh.

On Tue, Jan 02, 2024 at 09:47:11AM +0000, Richard Sandiford wrote:
> Segher Boessenkool <segher@kernel.crashing.org> writes:
> > On Tue, Oct 24, 2023 at 07:49:10PM +0100, Richard Sandiford wrote:
> >> This patch adds a combine pass that runs late in the pipeline.
> >
> > But it is not.  It is a completely new thing, and much closer to
> > fwprop than to combine, too.
> 
> Well, it is a combine pass.

No, it is not.  In the context of GCC combine is the instruction
combiner.  Which does something else than this does.

So use a different name.  Please.  It will be NAKked by the combine
maintainer otherwise.

> It's not a new instance of the pass in
> combine.cc, but I don't think that's the implication.  We already have
> two combine passes: the combine.cc one and the postreload one.

There is no postreload-combine pass.  There is a postreload pass that
does various trivial things.  One of those is reload_combine, which is
nothing like combine.  It is a kind of limited fwprop for memory
addressing.

> > Could you rename it to something else, please?  Something less confusing
> > to both users and maintainers :-)
> 
> Do you have any suggestions?

Since it is something like fwprop, maybe something like that?  Or maybe
put "addressing" in the name, if that is the point here.

> >> The pass currently has a single objective: remove definitions by
> >> substituting into all uses.
> >
> > The easy case ;-)
> 
> And the yet a case that no existing pass handles. :)  That's why I'm
> trying to add something that does.

So, fwprop.


Segher
Richard Biener June 25, 2024, 10:31 a.m. UTC | #18
On Mon, Jun 24, 2024 at 9:38 PM Segher Boessenkool
<segher@kernel.crashing.org> wrote:
>
> I didn't see this before.   Sigh.
>
> On Tue, Jan 02, 2024 at 09:47:11AM +0000, Richard Sandiford wrote:
> > Segher Boessenkool <segher@kernel.crashing.org> writes:
> > > On Tue, Oct 24, 2023 at 07:49:10PM +0100, Richard Sandiford wrote:
> > >> This patch adds a combine pass that runs late in the pipeline.
> > >
> > > But it is not.  It is a completely new thing, and much closer to
> > > fwprop than to combine, too.
> >
> > Well, it is a combine pass.
>
> No, it is not.  In the context of GCC combine is the instruction
> combiner.  Which does something else than this does.
>
> So use a different name.  Please.  It will be NAKked by the combine
> maintainer otherwise.
>
> > It's not a new instance of the pass in
> > combine.cc, but I don't think that's the implication.  We already have
> > two combine passes: the combine.cc one and the postreload one.
>
> There is no postreload-combine pass.  There is a postreload pass that
> does various trivial things.  One of those is reload_combine, which is
> nothing like combine.  It is a kind of limited fwprop for memory
> addressing.
>
> > > Could you rename it to something else, please?  Something less confusing
> > > to both users and maintainers :-)
> >
> > Do you have any suggestions?
>
> Since it is something like fwprop, maybe something like that?  Or maybe
> put "addressing" in the name, if that is the point here.
>
> > >> The pass currently has a single objective: remove definitions by
> > >> substituting into all uses.
> > >
> > > The easy case ;-)
> >
> > And the yet a case that no existing pass handles. :)  That's why I'm
> > trying to add something that does.
>
> So, fwprop.

fullprop

Richard.

>
> Segher
YunQiang Su June 25, 2024, 5:22 p.m. UTC | #19
Just FYI. This patch does something to gcc.target/mips/madd-8.c, and
gcc.target/mips/msub-8.c.

-PASS: gcc.target/mips/madd-8.c   -O2   scan-assembler \tmul\t
-PASS: gcc.target/mips/madd-8.c   -O2   scan-assembler-not \tmadd\t
-PASS: gcc.target/mips/madd-8.c   -O2   scan-assembler-not \tmflo\t
-PASS: gcc.target/mips/madd-8.c   -O2   scan-assembler-not \tmtlo\t
+FAIL: gcc.target/mips/madd-8.c   -O2   scan-assembler \tmul\t
+FAIL: gcc.target/mips/madd-8.c   -O2   scan-assembler-not \tmadd\t
+FAIL: gcc.target/mips/madd-8.c   -O2   scan-assembler-not \tmflo\t
+FAIL: gcc.target/mips/madd-8.c   -O2   scan-assembler-not \tmtlo\t

-FAIL: gcc.target/mips/msub-8.c   -O2   scan-assembler \tmul\t
-FAIL: gcc.target/mips/msub-8.c   -O2   scan-assembler-not \tmflo\t
-FAIL: gcc.target/mips/msub-8.c   -O2   scan-assembler-not \tmsub\t
-FAIL: gcc.target/mips/msub-8.c   -O2   scan-assembler-not \tmtlo\t
+PASS: gcc.target/mips/msub-8.c   -O2   scan-assembler \tmul\t
+PASS: gcc.target/mips/msub-8.c   -O2   scan-assembler-not \tmflo\t
+PASS: gcc.target/mips/msub-8.c   -O2   scan-assembler-not \tmsub\t
+PASS: gcc.target/mips/msub-8.c   -O2   scan-assembler-not \tmtlo\t

Quite interesting.  I will inverest the real reason.
Thomas Schwinge June 27, 2024, 4:49 p.m. UTC | #20
Hi!

On 2023-10-24T19:49:10+0100, Richard Sandiford <richard.sandiford@arm.com> wrote:
> This patch adds a combine pass that runs late in the pipeline.

Great!

In context of <https://gcc.gnu.org/PR115682>
'nvptx vs. "fwprop: invoke change_is_worthwhile to judge if a replacement is worthwhile"',
I've beek looking a bit through recent nvptx target code generation
changes for GCC target libraries, and thought I'd also share here my
findings for the "late-combine" changes in isolation, for nvptx target.

First the unexpected thing: there are a few cases where we now see unused
registers get declared, for example (random) in
'nvptx-none/newlib/libc/libm_a-s_modf.o:modf' (full 'diff' before vs.
after):

    [...]
     .visible .func (.param .f64 %value_out) modf (.param .f64 %in_ar0, .param .u64 %in_ar1)
     {
     .reg .f64 %value;
     .reg .f64 %ar0;
     ld.param.f64 %ar0,[%in_ar0];
     .reg .u64 %ar1;
     ld.param.u64 %ar1,[%in_ar1];
     .reg .u32 %r23;
     .reg .f64 %r32;
     .reg .u32 %r41;
     .reg .u32 %r42;
     .reg .f64 %r43;
     .reg .u32 %r46;
    +.reg .u64 %r48;
    +.reg .u64 %r49;
    +.reg .u64 %r50;
    +.reg .u64 %r51;
    +.reg .u64 %r52;
     .reg .f64 %r53;
     .reg .f64 %r54;
     .reg .u64 %r55;
    [...]

That is, five additional registers declared, without any use.  I suppose
that's some 'gen_reg_rtx' that needs to be "confined" (in whichever way;
to be looked into).  I suppose that's not actually a problem: the PTX JIT
should clean these out again, but it's still noise when reading
GCC-emitted PTX code.


Other than that, I've only got good things to report :-) -- a few
examples in the following, if anyone's interested, without much
commentary.  I haven't looked of how much these "visible" PTX code
generation changes translate into actual GPU SASS code improvements after
the PTX JIT has done its thing, but it's certainly easier to read/less
state to keep during human reading (due to less live registers,
primarily).

'nvptx-none/libatomic/cas_16_.o':

    [...]
    -.reg .u32 %r24;
    [...]
     setp.eq.u64 %r45,%r40,0;
    -selp.u32 %r24,1,0,%r45;
     .loc 3 113 6
     setp.ne.u64 %r48,%r58,%r60;
     @ %r48 bra $L2;
    @@ -86,7 +84,7 @@
     call libat_unlock_1,(%out_arg1);
     }
     .loc 3 122 1
    -mov.u32 %value,%r24;
    +selp.u32 %value,1,0,%r45;
     st.param.u32 [%value_out],%value;
     ret;
     }

A lot more instances of similar patterns.

'nvptx-none/libbacktrace/dwarf.o':

    [...]
    -.reg .u64 %r695;
    [...]
     setp.ne.u32 %r679,%r678,0;
    -@ ! %r679 bra $L1149;
    -add.u64 %r695,%r212,32;
    -bra $L1090;
    -$L1149:
    +@ %r679 bra $L1090;
    [...]
     $L1090:
     .loc 2 4046 7
    -mov.u64 %r28,%r695;
    +add.u64 %r28,%r212,32;
     $L1096:
    [...]

'nvptx-none/libbacktrace/dwarf.o':

    [...]
    -.reg .u64 %r41;
    [...]
    -add.u64 %r41,%frame,8;
     mov.u64 %r43,0;
    -st.u64 [%r41],%r43;
    -st.u64 [%r41+8],%r43;
    -st.u64 [%r41+16],%r43;
    +st.u64 [%frame+8],%r43;
    +st.u64 [%frame+16],%r43;
    +st.u64 [%frame+24],%r43;
    [...]

'nvptx-none/libgfortran/generated/cshift1_8.o':

    [...]
    -.reg .u64 %r293;
    [...]
    -add.u64 %r293,%r292,120;
    -st.u64 [%r293],%r409;
    +st.u64 [%r292+120],%r409;
    [...]

'nvptx-none/newlib/libc/stdio/libc_a-siscanf.o', have 'st' do 'u32'
truncation instead of explicit 'cvt':

    [...]
    -.reg .u32 %r23;
    [...]
    -cvt.u32.u64 %r23,%r32;
    -st.u32 [%frame+8],%r23;
    +st.u32 [%frame+8],%r32;
    [...]


'nvptx-none/newlib/libc/string/libc_a-memccpy.o', simplify (supposedly)
char -> int -> short into char -> short:

    [...]
    -.reg .u32 %r37;
    [...]
    -cvt.u32.u8 %r37,%r49;
    [...]
    -cvt.u16.u32 %r67,%r37;
    +cvt.u16.u8 %r67,%r49;
    [...]

'nvptx-none/libgcc/crt0.o':

    [...]
    -.reg .u64 %r30;
    [...]
     cvta.global.u64 %r29,stack$0+131072;
     st.shared.u64 [__nvptx_stacks],%r29;
    -cvta.shared.u64 %r30,__nvptx_uni;
     mov.u32 %r31,0;
    -st.u32 [%r30],%r31;
    +st.shared.u32 [__nvptx_uni],%r31;
    [...]

There are a lot more instances of getting rid of a register in favor of
using more complex instructions.

'nvptx-none/libgfortran/generated/findloc2_s4.o':

    [...]
    -.reg .u64 %r33;
    [...]
    -neg.s64 %r33,%r35;
    [...]
    -add.u64 %r39,%r39,%r33;
    +sub.u64 %r39,%r39,%r35;
    [...]

..., but in turn also a multi-use variant of the 'neg' (is this still
beneficial?), 'nvptx-none/newlib/libc/search/libc_a-bsd_qsort_r.o':

    [...]
    -.reg .u64 %r34;
    [...]
    -neg.s64 %r34,%r82;
    -.loc 2 213 9
    -add.u64 %r35,%r76,%r34;
    +sub.u64 %r35,%r76,%r82;
    [...]
    -add.u64 %r37,%r71,%r34;
    -add.u64 %r38,%r37,%r34;
    +sub.u64 %r37,%r71,%r82;
    +sub.u64 %r38,%r37,%r82;
    [...]

'nvptx-none/newlib/libm/complex/libm_a-cephes_subr.o':

    [...]
    -.reg .f64 %r35;
    [...]
    -neg.f64 %r35,%r27;
    -fma.rn.f64 %r22,%r35,0d400921fb54000000,%r32;
    +fma.rn.f64 %r22,%r27,0dc00921fb54000000,%r32;
     .loc 2 80 21
    -fma.rn.f64 %r23,%r35,0d3e210b4610000000,%r22;
    +fma.rn.f64 %r23,%r27,0dbe210b4610000000,%r22;
     .loc 2 80 4
    -fma.rn.f64 %value,%r35,0d3c6a62633145c06e,%r23;
    +fma.rn.f64 %value,%r27,0dbc6a62633145c06e,%r23;
    [...]

'nvptx-none/libgcc/_gcov_info_to_gcda.o':

    [...]
    -.reg .u32 %r186;
    [...]
    -add.u32 %r186,%r59,%r59;
    -add.u32 %r187,%r186,%r59;
    +vadd.u32.u32.u32.add %r187,%r59,%r59,%r59;
    [...]


Grüße
 Thomas


> There are two instances: one between combine and split1, and one
> after postreload.
>
> The pass currently has a single objective: remove definitions by
> substituting into all uses.  The pre-RA version tries to restrict
> itself to cases that are likely to have a neutral or beneficial
> effect on register pressure.
>
> The patch fixes PR106594.  It also fixes a few FAILs and XFAILs
> in the aarch64 test results, mostly due to making proper use of
> MOVPRFX in cases where we didn't previously.  I hope it would
> also help with Robin's vec_duplicate testcase, although the
> pressure heuristic might need tweaking for that case.
>
> This is just a first step..  I'm hoping that the pass could be
> used for other combine-related optimisations in future.  In particular,
> the post-RA version doesn't need to restrict itself to cases where all
> uses are substitutitable, since it doesn't have to worry about register
> pressure.  If we did that, and if we extended it to handle multi-register
> REGs, the pass might be a viable replacement for regcprop, which in
> turn might reduce the cost of having a post-RA instance of the new pass.
>
> I've run an assembly comparison with one target per CPU directory,
> and it seems to be a win for all targets except nvptx (which is hard
> to measure, being a higher-level asm).  The biggest winner seemed
> to be AVR.
>
> I'd originally hoped to enable the pass by default at -O2 and above
> on all targets.  But in the end, I don't think that's possible,
> because it interacts badly with x86's STV and partial register
> dependency passes.
>
> For example, gcc.target/i386/minmax-6.c tests whether the code
> compiles without any spilling.  The RTL created by STV contains:
>
> (insn 33 31 3 2 (set (subreg:V4SI (reg:SI 120) 0)
>         (vec_merge:V4SI (vec_duplicate:V4SI (reg:SI 116))
>             (const_vector:V4SI [
>                     (const_int 0 [0]) repeated x4
>                 ])
>             (const_int 1 [0x1]))) -1
>      (nil))
> (insn 3 33 34 2 (set (subreg:V4SI (reg:SI 118) 0)
>         (subreg:V4SI (reg:SI 120) 0)) {movv4si_internal}
>      (expr_list:REG_DEAD (reg:SI 120)
>         (nil)))
> (insn 34 3 32 2 (set (reg/v:SI 108 [ y ])
>         (reg:SI 118)) -1
>      (nil))
>
> and it's crucial for the test that reg 108 is kept, rather than
> propagated into uses.  As things stand, 118 can be allocated
> a vector register and 108 a scalar register.  If 108 is propagated,
> there will be scalar and vector uses of 118, and so it will be
> spilled to memory.
>
> That one could be solved by running STV2 later.  But RPAD is
> a bigger problem.  In gcc.target/i386/pr87007-5.c, RPAD converts:
>
> (insn 27 26 28 6 (set (reg:DF 100 [ _15 ])
>         (sqrt:DF (mem/c:DF (symbol_ref:DI ("d2"))))) {*sqrtdf2_sse}
>      (nil))
>
> into:
>
> (insn 45 26 44 6 (set (reg:V4SF 108)
>         (const_vector:V4SF [
>                 (const_double:SF 0.0 [0x0.0p+0]) repeated x4
>             ])) -1
>      (nil))
> (insn 44 45 27 6 (set (reg:V2DF 109)
>         (vec_merge:V2DF (vec_duplicate:V2DF (sqrt:DF (mem/c:DF (symbol_ref:DI ("d2")))))
>             (subreg:V2DF (reg:V4SF 108) 0)
>             (const_int 1 [0x1]))) -1
>      (nil))
> (insn 27 44 28 6 (set (reg:DF 100 [ _15 ])
>         (subreg:DF (reg:V2DF 109) 0)) {*movdf_internal}
>      (nil))
>
> But both the pre-RA and post-RA passes are able to combine these
> instructions back to the original form.
>
> The patch therefore enables the pass by default only on AArch64.
> However, I did test the patch with it enabled on x86_64-linux-gnu
> as well, which was useful for debugging.
>
> Bootstrapped & regression-tested on aarch64-linux-gnu and
> x86_64-linux-gnu (as posted, with no regressions, and with the
> pass enabled by default, with some gcc.target/i386 regressions).
> OK to install?
>
> Richard
Thomas Schwinge June 27, 2024, 8:27 p.m. UTC | #21
Hi!

On 2024-06-27T18:49:17+0200, I wrote:
> On 2023-10-24T19:49:10+0100, Richard Sandiford <richard.sandiford@arm.com> wrote:
>> This patch adds a combine pass that runs late in the pipeline.

[After sending, I realized I replied to a previous thread of this work.]

> I've beek looking a bit through recent nvptx target code generation
> changes for GCC target libraries, and thought I'd also share here my
> findings for the "late-combine" changes in isolation, for nvptx target.
> 
> First the unexpected thing:

So much for "unexpected thing" -- next level of unexpected here...
Appreciated if anyone feels like helping me find my way through this, but
I totally understand if you've got other things to do.

> there are a few cases where we now see unused
> registers get declared, for example (random) in
> 'nvptx-none/newlib/libc/libm_a-s_modf.o:modf'

I first looked into a simpler case: newlib 'libc/locale/lnumeric.c'.

Here we get the following 'diff' for '*.s' for
'-fno-late-combine-instructions' vs. (default)
'-flate-combine-instructions':

     .visible .func (.param.u32 %value_out) __numeric_load_locale (.param.u64 %in_ar0, .param.u64 %in_ar1, .param.u64 %in_ar2, .param.u64 %in_ar3)
     {
            .reg.u32 %value;
            .reg.u64 %ar0;
            ld.param.u64 %ar0, [%in_ar0];
            .reg.u64 %ar1;
            ld.param.u64 %ar1, [%in_ar1];
            .reg.u64 %ar2;
            ld.param.u64 %ar2, [%in_ar2];
            .reg.u64 %ar3;
            ld.param.u64 %ar3, [%in_ar3];
    +       .reg.u32 %r22;
            .file 2 "../../../source-gcc/newlib/libc/locale/lnumeric.c"
            .loc 2 89 1
                    mov.u32 %value, 0;
            st.param.u32    [%value_out], %value;
            ret;
     }

Clearly, '%r22' is unused.  However, looking at the source code (manually
trimmed):

    int
    __numeric_load_locale (struct __locale_t *locale, const char *name ,
                           void *f_wctomb, const char *charset)
    {
      int ret;
      struct lc_numeric_T nm;
      char *bufp = NULL;
    
    #ifdef __CYGWIN__
      [...]
    #else
      /* TODO */
    #endif
      return ret;
    }

..., and adding '-Wall' (why isn't top-level/newlib build system doing
that...):

    [...]
    ../../../source-gcc/newlib/libc/locale/lnumeric.c:88:10: warning: ‘ret’ is used uninitialized [-Wuninitialized]
       88 |   return ret;
          |          ^~~
    ../../../source-gcc/newlib/libc/locale/lnumeric.c:48:7: note: ‘ret’ was declared here
       48 |   int ret;
          |       ^~~

Uh.  Given nothing else is going on in that function, I suppose '%r22'
relates to the uninitialized 'ret' -- and given undefined behavior, GCC
of course is fine to emit an unused 'reg' in that case...

But: should we expect '-fno-late-combine-instructions' vs.
'-flate-combine-instructions' to behave in the same way?  (After all,
'%r22' remains unused also with '-flate-combine-instructions', and
doesn't need to be emitted.)  This could, of course, also be a nvptx back
end issue?

I'm happy to supply any dump files etc.  Also, 'tmp-libc_a-lnumeric.i.xz'
is attached if you'd like to reproduce this with your own nvptx target
'cc1':

    $ [...]/configure --target=nvptx-none --enable-languages=c
    $ make -j12 all-gcc
    $ gcc/cc1 -fpreprocessed tmp-libc_a-lnumeric.i -quiet -dumpbase tmp-libc_a-lnumeric.c -dumpbase-ext .c -misa=sm_30 -g -O2 -fno-builtin -o tmp-libc_a-lnumeric.s -fdump-rtl-all # -fno-late-combine-instructions


Grüße
 Thomas
Thomas Schwinge June 27, 2024, 9:20 p.m. UTC | #22
Hi!

On 2024-06-27T22:27:21+0200, I wrote:
> On 2024-06-27T18:49:17+0200, I wrote:
>> On 2023-10-24T19:49:10+0100, Richard Sandiford <richard.sandiford@arm.com> wrote:
>>> This patch adds a combine pass that runs late in the pipeline.
>
> [After sending, I realized I replied to a previous thread of this work.]
>
>> I've beek looking a bit through recent nvptx target code generation
>> changes for GCC target libraries, and thought I'd also share here my
>> findings for the "late-combine" changes in isolation, for nvptx target.
>> 
>> First the unexpected thing:
>
> So much for "unexpected thing" -- next level of unexpected here...
> Appreciated if anyone feels like helping me find my way through this, but
> I totally understand if you've got other things to do.

OK, I found something already.  (Unexpectedly quickly...)  ;-)

>> there are a few cases where we now see unused
>> registers get declared, for example (random) in
>> 'nvptx-none/newlib/libc/libm_a-s_modf.o:modf'

I've now looked into the former one ('tmp-libm_a-s_modf.i.xz' is
attached), to avoid...

> I first looked into a simpler case: newlib 'libc/locale/lnumeric.c'.

>     ../../../source-gcc/newlib/libc/locale/lnumeric.c:88:10: warning: ‘ret’ is used uninitialized [-Wuninitialized]
>        88 |   return ret;
>           |          ^~~
>     ../../../source-gcc/newlib/libc/locale/lnumeric.c:48:7: note: ‘ret’ was declared here
>        48 |   int ret;
>           |       ^~~
>
> Uh.  Given nothing else is going on in that function, I suppose '%r22'
> relates to the uninitialized 'ret' -- and given undefined behavior, GCC
> of course is fine to emit an unused 'reg' in that case...

... the undefined behavior here.

But in fact, for both cases, the unexpected difference goes away if after
'pass_late_combine' I inject a 'pass_fast_rtl_dce'.  That's normally run
as part of 'PUSH_INSERT_PASSES_WITHIN (pass_postreload)' -- but that's
all not active for nvptx target given '!reload_completed', given nvptx is
'targetm.no_register_allocation'.  Maybe we need to enable a few more
passes, or is there anything in 'pass_late_combine' to change, so that we
don't run into this?  Does it inadvertently mark registers live or
something like that?

The following makes these two cases work, but evidently needs a lot more
analysis: a lot of other passes are enabled that may be anything between
beneficial and harmful for 'targetm.no_register_allocation'/nvptx.

    --- gcc/passes.cc
    +++ gcc/passes.cc
    @@ -676,17 +676,17 @@ const pass_data pass_data_postreload =
     class pass_postreload : public rtl_opt_pass
     {
     public:
       pass_postreload (gcc::context *ctxt)
         : rtl_opt_pass (pass_data_postreload, ctxt)
       {}
     
       /* opt_pass methods: */
    -  bool gate (function *) final override { return reload_completed; }
    +  bool gate (function *) final override { return reload_completed || targetm.no_register_allocation; }
    --- gcc/regcprop.cc
    +++ gcc/regcprop.cc
    @@ -1305,17 +1305,17 @@ class pass_cprop_hardreg : public rtl_opt_pass
     public:
       pass_cprop_hardreg (gcc::context *ctxt)
         : rtl_opt_pass (pass_data_cprop_hardreg, ctxt)
       {}
     
       /* opt_pass methods: */
       bool gate (function *) final override
         {
    -      return (optimize > 0 && (flag_cprop_registers));
    +      return (optimize > 0 && flag_cprop_registers && !targetm.no_register_allocation);
         }


Grüße
 Thomas


> But: should we expect '-fno-late-combine-instructions' vs.
> '-flate-combine-instructions' to behave in the same way?  (After all,
> '%r22' remains unused also with '-flate-combine-instructions', and
> doesn't need to be emitted.)  This could, of course, also be a nvptx back
> end issue?
>
> I'm happy to supply any dump files etc.  Also, 'tmp-libc_a-lnumeric.i.xz'
> is attached if you'd like to reproduce this with your own nvptx target
> 'cc1':
>
>     $ [...]/configure --target=nvptx-none --enable-languages=c
>     $ make -j12 all-gcc
>     $ gcc/cc1 -fpreprocessed tmp-libc_a-lnumeric.i -quiet -dumpbase tmp-libc_a-lnumeric.c -dumpbase-ext .c -misa=sm_30 -g -O2 -fno-builtin -o tmp-libc_a-lnumeric.s -fdump-rtl-all # -fno-late-combine-instructions
>
>
> Grüße
>  Thomas
Thomas Schwinge June 27, 2024, 10:41 p.m. UTC | #23
Hi!

On 2024-06-27T23:20:18+0200, I wrote:
> On 2024-06-27T22:27:21+0200, I wrote:
>> On 2024-06-27T18:49:17+0200, I wrote:
>>> On 2023-10-24T19:49:10+0100, Richard Sandiford <richard.sandiford@arm.com> wrote:
>>>> This patch adds a combine pass that runs late in the pipeline.
>>
>> [After sending, I realized I replied to a previous thread of this work.]
>>
>>> I've beek looking a bit through recent nvptx target code generation
>>> changes for GCC target libraries, and thought I'd also share here my
>>> findings for the "late-combine" changes in isolation, for nvptx target.
>>> 
>>> First the unexpected thing:
>>
>> So much for "unexpected thing" -- next level of unexpected here...
>> Appreciated if anyone feels like helping me find my way through this, but
>> I totally understand if you've got other things to do.
>
> OK, I found something already.  (Unexpectedly quickly...)  ;-)
>
>>> there are a few cases where we now see unused
>>> registers get declared

> But in fact, for both cases

Now tested: 's%both%all'.  :-)

> the unexpected difference goes away if after
> 'pass_late_combine' I inject a 'pass_fast_rtl_dce'.  That's normally run
> as part of 'PUSH_INSERT_PASSES_WITHIN (pass_postreload)' -- but that's
> all not active for nvptx target given '!reload_completed', given nvptx is
> 'targetm.no_register_allocation'.  Maybe we need to enable a few more
> passes, or is there anything in 'pass_late_combine' to change, so that we
> don't run into this?  Does it inadvertently mark registers live or
> something like that?

Basically, is 'pass_late_combine' potentionally doing things that depend
on later clean-up?  (..., or shouldn't it be doing these things in the
first place?)

> The following makes these two cases work, but evidently needs a lot more
> analysis: a lot of other passes are enabled that may be anything between
> beneficial and harmful for 'targetm.no_register_allocation'/nvptx.
>
>     --- gcc/passes.cc
>     +++ gcc/passes.cc
>     @@ -676,17 +676,17 @@ const pass_data pass_data_postreload =
>      class pass_postreload : public rtl_opt_pass
>      {
>      public:
>        pass_postreload (gcc::context *ctxt)
>          : rtl_opt_pass (pass_data_postreload, ctxt)
>        {}
>      
>        /* opt_pass methods: */
>     -  bool gate (function *) final override { return reload_completed; }
>     +  bool gate (function *) final override { return reload_completed || targetm.no_register_allocation; }
>     --- gcc/regcprop.cc
>     +++ gcc/regcprop.cc
>     @@ -1305,17 +1305,17 @@ class pass_cprop_hardreg : public rtl_opt_pass
>      public:
>        pass_cprop_hardreg (gcc::context *ctxt)
>          : rtl_opt_pass (pass_data_cprop_hardreg, ctxt)
>        {}
>      
>        /* opt_pass methods: */
>        bool gate (function *) final override
>          {
>     -      return (optimize > 0 && (flag_cprop_registers));
>     +      return (optimize > 0 && flag_cprop_registers && !targetm.no_register_allocation);
>          }

Also, that quickly ICEs; more '[...] && !targetm.no_register_allocation'
are needed elsewhere, at least.

The following simpler thing, however, does work; move 'pass_fast_rtl_dce'
out of 'pass_postreload':

    --- gcc/passes.cc
    +++ gcc/passes.cc
    @@ -677,14 +677,15 @@ class pass_postreload : public rtl_opt_pass
     {
     public:
       pass_postreload (gcc::context *ctxt)
         : rtl_opt_pass (pass_data_postreload, ctxt)
       {}
     
       /* opt_pass methods: */
    +  opt_pass * clone () final override { return new pass_postreload (m_ctxt); }
       bool gate (function *) final override { return reload_completed; }
     
     }; // class pass_postreload
    --- gcc/passes.def
    +++ gcc/passes.def
    @@ -529,7 +529,10 @@ along with GCC; see the file COPYING3.  If not see
              NEXT_PASS (pass_regrename);
              NEXT_PASS (pass_fold_mem_offsets);
              NEXT_PASS (pass_cprop_hardreg);
    -         NEXT_PASS (pass_fast_rtl_dce);
    +      POP_INSERT_PASSES ()
    +      NEXT_PASS (pass_fast_rtl_dce);
    +      NEXT_PASS (pass_postreload);
    +      PUSH_INSERT_PASSES_WITHIN (pass_postreload)
              NEXT_PASS (pass_reorder_blocks);
              NEXT_PASS (pass_leaf_regs);
              NEXT_PASS (pass_split_before_sched2);

This (only) cleans up "the mess that 'pass_late_combine' created"; no
further changes in GCC target libraries for nvptx.  (For avoidance of
doubt: "mess" is a great exaggeration here.)


Grüße
 Thomas


>> But: should we expect '-fno-late-combine-instructions' vs.
>> '-flate-combine-instructions' to behave in the same way?  (After all,
>> '%r22' remains unused also with '-flate-combine-instructions', and
>> doesn't need to be emitted.)  This could, of course, also be a nvptx back
>> end issue?
>>
>> I'm happy to supply any dump files etc.  Also, 'tmp-libc_a-lnumeric.i.xz'
>> is attached if you'd like to reproduce this with your own nvptx target
>> 'cc1':
>>
>>     $ [...]/configure --target=nvptx-none --enable-languages=c
>>     $ make -j12 all-gcc
>>     $ gcc/cc1 -fpreprocessed tmp-libc_a-lnumeric.i -quiet -dumpbase tmp-libc_a-lnumeric.c -dumpbase-ext .c -misa=sm_30 -g -O2 -fno-builtin -o tmp-libc_a-lnumeric.s -fdump-rtl-all # -fno-late-combine-instructions
>>
>>
>> Grüße
>>  Thomas
Roger Sayle June 28, 2024, 6:07 a.m. UTC | #24
Hi Thomas,

There are two things I think I can contribute to this discussion.  The first is that I have
a patch (from a year or two ago) for adding rtx_costs to the nvptx backend that I will
respin, which will provide more backend control over combine-like pass decisions.

The second is in response to your comment below; nvptx is currently GCC's only
target.no_register_allocation target, and I think this provides an opportunity to
tweak/change these semantics.  Currently "no_register_allocation" turns off a 
whole bunch of post-reload passes (including things like peephole2 and predication
that would be extremely useful on nvptx).  My thoughts were that it would be nice,
if no_register_allocation turned off just/only the ira/reload passes, where the gate
function would instead simply set reload_completed.

One major advantage of this approach is that it would minimize the divergence
in code paths, between nvptx and all of GCC's other backends, which would hopefully
help with things like "late-combine" that may implicitly (and perhaps unintentionally)
require/expect certain passes to be run later.

Again I've investigated a patch/change or two towards the above approach, but its 
slightly more complicated than I'd anticipated, due to things like late nvptx-specific
passes assuming they can call gen_reg_rtx, after other targets have no_new_pseudos.
There's nothing that's impossible, but it would take a concerted effort by target and
middle-end maintainers to align a shallower "no_register_allocation".

I'm curious what folks think.

Best regards,
Roger
--

> -----Original Message-----
> From: Thomas Schwinge <tschwinge@baylibre.com>
> Sent: 27 June 2024 22:20
> To: Richard Sandiford <richard.sandiford@arm.com>
> Cc: jlaw@ventanamicro.com; rdapp.gcc@gmail.com; gcc-patches@gcc.gnu.org;
> Tom de Vries <tdevries@suse.de>; Roger Sayle <roger@nextmovesoftware.com>
> Subject: Re: nvptx vs. [PATCH] Add a late-combine pass [PR106594]
> 
> Hi!
> 
> On 2024-06-27T22:27:21+0200, I wrote:
> > On 2024-06-27T18:49:17+0200, I wrote:
> >> On 2023-10-24T19:49:10+0100, Richard Sandiford
> <richard.sandiford@arm.com> wrote:
> >>> This patch adds a combine pass that runs late in the pipeline.
> >
> > [After sending, I realized I replied to a previous thread of this
> > work.]
> >
> >> I've beek looking a bit through recent nvptx target code generation
> >> changes for GCC target libraries, and thought I'd also share here my
> >> findings for the "late-combine" changes in isolation, for nvptx target.
> >>
> >> First the unexpected thing:
> >
> > So much for "unexpected thing" -- next level of unexpected here...
> > Appreciated if anyone feels like helping me find my way through this,
> > but I totally understand if you've got other things to do.
> 
> OK, I found something already.  (Unexpectedly quickly...)  ;-)
> 
> >> there are a few cases where we now see unused registers get declared,
> >> for example (random) in 'nvptx-none/newlib/libc/libm_a-s_modf.o:modf'
> 
> I've now looked into the former one ('tmp-libm_a-s_modf.i.xz' is attached), to
> avoid...
> 
> > I first looked into a simpler case: newlib 'libc/locale/lnumeric.c'.
> 
> >     ../../../source-gcc/newlib/libc/locale/lnumeric.c:88:10: warning: ‘ret’ is used
> uninitialized [-Wuninitialized]
> >        88 |   return ret;
> >           |          ^~~
> >     ../../../source-gcc/newlib/libc/locale/lnumeric.c:48:7: note: ‘ret’ was
> declared here
> >        48 |   int ret;
> >           |       ^~~
> >
> > Uh.  Given nothing else is going on in that function, I suppose '%r22'
> > relates to the uninitialized 'ret' -- and given undefined behavior,
> > GCC of course is fine to emit an unused 'reg' in that case...
> 
> ... the undefined behavior here.
> 
> But in fact, for both cases, the unexpected difference goes away if after
> 'pass_late_combine' I inject a 'pass_fast_rtl_dce'.  That's normally run as part of
> 'PUSH_INSERT_PASSES_WITHIN (pass_postreload)' -- but that's all not active for
> nvptx target given '!reload_completed', given nvptx is
> 'targetm.no_register_allocation'.  Maybe we need to enable a few more passes,
> or is there anything in 'pass_late_combine' to change, so that we don't run into
> this?  Does it inadvertently mark registers live or something like that?
> 
> The following makes these two cases work, but evidently needs a lot more
> analysis: a lot of other passes are enabled that may be anything between
> beneficial and harmful for 'targetm.no_register_allocation'/nvptx.
> 
>     --- gcc/passes.cc
>     +++ gcc/passes.cc
>     @@ -676,17 +676,17 @@ const pass_data pass_data_postreload =
>      class pass_postreload : public rtl_opt_pass
>      {
>      public:
>        pass_postreload (gcc::context *ctxt)
>          : rtl_opt_pass (pass_data_postreload, ctxt)
>        {}
> 
>        /* opt_pass methods: */
>     -  bool gate (function *) final override { return reload_completed; }
>     +  bool gate (function *) final override { return reload_completed ||
> targetm.no_register_allocation; }
>     --- gcc/regcprop.cc
>     +++ gcc/regcprop.cc
>     @@ -1305,17 +1305,17 @@ class pass_cprop_hardreg : public rtl_opt_pass
>      public:
>        pass_cprop_hardreg (gcc::context *ctxt)
>          : rtl_opt_pass (pass_data_cprop_hardreg, ctxt)
>        {}
> 
>        /* opt_pass methods: */
>        bool gate (function *) final override
>          {
>     -      return (optimize > 0 && (flag_cprop_registers));
>     +      return (optimize > 0 && flag_cprop_registers &&
> !targetm.no_register_allocation);
>          }
> 
> 
> Grüße
>  Thomas
> 
> 
> > But: should we expect '-fno-late-combine-instructions' vs.
> > '-flate-combine-instructions' to behave in the same way?  (After all,
> > '%r22' remains unused also with '-flate-combine-instructions', and
> > doesn't need to be emitted.)  This could, of course, also be a nvptx
> > back end issue?
> >
> > I'm happy to supply any dump files etc.  Also, 'tmp-libc_a-lnumeric.i.xz'
> > is attached if you'd like to reproduce this with your own nvptx target
> > 'cc1':
> >
> >     $ [...]/configure --target=nvptx-none --enable-languages=c
> >     $ make -j12 all-gcc
> >     $ gcc/cc1 -fpreprocessed tmp-libc_a-lnumeric.i -quiet -dumpbase
> > tmp-libc_a-lnumeric.c -dumpbase-ext .c -misa=sm_30 -g -O2 -fno-builtin
> > -o tmp-libc_a-lnumeric.s -fdump-rtl-all #
> > -fno-late-combine-instructions
> >
> >
> > Grüße
> >  Thomas
>
Richard Sandiford June 28, 2024, 2:01 p.m. UTC | #25
Thomas Schwinge <tschwinge@baylibre.com> writes:
> Hi!
>
> On 2024-06-27T23:20:18+0200, I wrote:
>> On 2024-06-27T22:27:21+0200, I wrote:
>>> On 2024-06-27T18:49:17+0200, I wrote:
>>>> On 2023-10-24T19:49:10+0100, Richard Sandiford <richard.sandiford@arm.com> wrote:
>>>>> This patch adds a combine pass that runs late in the pipeline.
>>>
>>> [After sending, I realized I replied to a previous thread of this work.]
>>>
>>>> I've beek looking a bit through recent nvptx target code generation
>>>> changes for GCC target libraries, and thought I'd also share here my
>>>> findings for the "late-combine" changes in isolation, for nvptx target.
>>>> 
>>>> First the unexpected thing:
>>>
>>> So much for "unexpected thing" -- next level of unexpected here...
>>> Appreciated if anyone feels like helping me find my way through this, but
>>> I totally understand if you've got other things to do.
>>
>> OK, I found something already.  (Unexpectedly quickly...)  ;-)
>>
>>>> there are a few cases where we now see unused
>>>> registers get declared
>
>> But in fact, for both cases
>
> Now tested: 's%both%all'.  :-)
>
>> the unexpected difference goes away if after
>> 'pass_late_combine' I inject a 'pass_fast_rtl_dce'.  That's normally run
>> as part of 'PUSH_INSERT_PASSES_WITHIN (pass_postreload)' -- but that's
>> all not active for nvptx target given '!reload_completed', given nvptx is
>> 'targetm.no_register_allocation'.  Maybe we need to enable a few more
>> passes, or is there anything in 'pass_late_combine' to change, so that we
>> don't run into this?  Does it inadvertently mark registers live or
>> something like that?
>
> Basically, is 'pass_late_combine' potentionally doing things that depend
> on later clean-up?  (..., or shouldn't it be doing these things in the
> first place?)

It's possible that late-combine could expose dead code, but I imagine
it's a niche case.

I had a look at the nvptx logs from my comparison, and the cases in
which I saw this seemed to be those where late-combine doesn't find
anything to do.  Does that match your examples?  Specifically,
the effect should be the same with -fdbg-cnt=late_combine:0-0

I think what's happening is that:

- combine exposes dead code

- ce2 previously ran df_analyze with DF_LR_RUN_DCE set, and so cleared
  up the dead code

- late-combine instead runs df_analyze without that flag (since late-combine
  itself doesn't really care whether dead code is present)

- if late-combine doesn't do anything, ce2's df_analyze call has nothing
  to do, and skips even the DCE

The easiest fix would be to add:

  df_set_flags (DF_LR_RUN_DCE);

before df_analyze in late-combine.cc, so that it behaves like ce2.
But the arrangement feels wrong.  I would have expected DF_LR_RUN_DCE
to depend on whether df_analyze had been called since the last DCE pass
(whether DF_LR_RUN_DCE or a full DCE).

Thanks,
Richard
Richard Sandiford June 28, 2024, 4:48 p.m. UTC | #26
Richard Sandiford <richard.sandiford@arm.com> writes:
> Thomas Schwinge <tschwinge@baylibre.com> writes:
>> Hi!
>>
>> On 2024-06-27T23:20:18+0200, I wrote:
>>> On 2024-06-27T22:27:21+0200, I wrote:
>>>> On 2024-06-27T18:49:17+0200, I wrote:
>>>>> On 2023-10-24T19:49:10+0100, Richard Sandiford <richard.sandiford@arm.com> wrote:
>>>>>> This patch adds a combine pass that runs late in the pipeline.
>>>>
>>>> [After sending, I realized I replied to a previous thread of this work.]
>>>>
>>>>> I've beek looking a bit through recent nvptx target code generation
>>>>> changes for GCC target libraries, and thought I'd also share here my
>>>>> findings for the "late-combine" changes in isolation, for nvptx target.
>>>>> 
>>>>> First the unexpected thing:
>>>>
>>>> So much for "unexpected thing" -- next level of unexpected here...
>>>> Appreciated if anyone feels like helping me find my way through this, but
>>>> I totally understand if you've got other things to do.
>>>
>>> OK, I found something already.  (Unexpectedly quickly...)  ;-)
>>>
>>>>> there are a few cases where we now see unused
>>>>> registers get declared
>>
>>> But in fact, for both cases
>>
>> Now tested: 's%both%all'.  :-)
>>
>>> the unexpected difference goes away if after
>>> 'pass_late_combine' I inject a 'pass_fast_rtl_dce'.  That's normally run
>>> as part of 'PUSH_INSERT_PASSES_WITHIN (pass_postreload)' -- but that's
>>> all not active for nvptx target given '!reload_completed', given nvptx is
>>> 'targetm.no_register_allocation'.  Maybe we need to enable a few more
>>> passes, or is there anything in 'pass_late_combine' to change, so that we
>>> don't run into this?  Does it inadvertently mark registers live or
>>> something like that?
>>
>> Basically, is 'pass_late_combine' potentionally doing things that depend
>> on later clean-up?  (..., or shouldn't it be doing these things in the
>> first place?)
>
> It's possible that late-combine could expose dead code, but I imagine
> it's a niche case.
>
> I had a look at the nvptx logs from my comparison, and the cases in
> which I saw this seemed to be those where late-combine doesn't find
> anything to do.  Does that match your examples?  Specifically,
> the effect should be the same with -fdbg-cnt=late_combine:0-0
>
> I think what's happening is that:
>
> - combine exposes dead code
>
> - ce2 previously ran df_analyze with DF_LR_RUN_DCE set, and so cleared
>   up the dead code
>
> - late-combine instead runs df_analyze without that flag (since late-combine
>   itself doesn't really care whether dead code is present)
>
> - if late-combine doesn't do anything, ce2's df_analyze call has nothing
>   to do, and skips even the DCE
>
> The easiest fix would be to add:
>
>   df_set_flags (DF_LR_RUN_DCE);
>
> before df_analyze in late-combine.cc, so that it behaves like ce2.
> But the arrangement feels wrong.  I would have expected DF_LR_RUN_DCE
> to depend on whether df_analyze had been called since the last DCE pass
> (whether DF_LR_RUN_DCE or a full DCE).

I'm testing the attached patch to do that.  I'll submit it properly if
testing passes, but it seems to fix the extra-register problem for me.

Thanks,
Richard

---
Give fast DCE a separate dirty flag

Thomas pointed out that we sometimes failed to eliminate some dead code
(specifically clobbers of otherwise unused registers) on nvptx when
late-combine is enabled.  This happens because:

- combine is able to optimise the function in a way that exposes dead code.
  This leaves the df information in a "dirty" state.

- late_combine calls df_analyze without DF_LR_RUN_DCE run set.
  This updates the df information and clears the "dirty" state.

- late_combine doesn't find any extra optimisations, and so leaves
  the df information up-to-date.

- if_after_combine (ce2) calls df_analyze with DF_LR_RUN_DCE set.
  Because the df information is already up-to-date, fast DCE is
  not run.

The upshot is that running late-combine has the effect of suppressing
a DCE opportunity that would have been noticed without late_combine.

I think this shows that we should track the state of the DCE separately
from the LR problem.  Every pass updates the latter, but not all passes
update the former.

gcc/
	* df.h (DF_LR_DCE): New df_problem_id.
	(df_lr_dce): New macro.
	* df-core.cc (rest_of_handle_df_finish): Check for a null free_fun.
	* df-problems.cc (df_lr_finalize): Split out fast DCE handling to...
	(df_lr_dce_finalize): ...this new function.
	(problem_LR_DCE): New df_problem.
	(df_lr_add_problem): Register LR_DCE rather than LR itself.
	* dce.cc (fast_dce): Clear df_lr_dce->solutions_dirty.
---
 gcc/dce.cc         |  3 ++
 gcc/df-core.cc     |  3 +-
 gcc/df-problems.cc | 96 ++++++++++++++++++++++++++++++++--------------
 gcc/df.h           |  2 +
 4 files changed, 74 insertions(+), 30 deletions(-)

diff --git a/gcc/dce.cc b/gcc/dce.cc
index be1a2a87732..04e8d98818d 100644
--- a/gcc/dce.cc
+++ b/gcc/dce.cc
@@ -1182,6 +1182,9 @@ fast_dce (bool word_level)
   BITMAP_FREE (processed);
   BITMAP_FREE (redo_out);
   BITMAP_FREE (all_blocks);
+
+  /* Both forms of DCE should make further DCE unnecessary.  */
+  df_lr_dce->solutions_dirty = false;
 }
 
 
diff --git a/gcc/df-core.cc b/gcc/df-core.cc
index b0e8a88d433..8fd778a8618 100644
--- a/gcc/df-core.cc
+++ b/gcc/df-core.cc
@@ -806,7 +806,8 @@ rest_of_handle_df_finish (void)
   for (i = 0; i < df->num_problems_defined; i++)
     {
       struct dataflow *dflow = df->problems_in_order[i];
-      dflow->problem->free_fun ();
+      if (dflow->problem->free_fun)
+	dflow->problem->free_fun ();
     }
 
   free (df->postorder);
diff --git a/gcc/df-problems.cc b/gcc/df-problems.cc
index 88ee0dd67fc..bfd24bd1e86 100644
--- a/gcc/df-problems.cc
+++ b/gcc/df-problems.cc
@@ -1054,37 +1054,10 @@ df_lr_transfer_function (int bb_index)
 }
 
 
-/* Run the fast dce as a side effect of building LR.  */
-
 static void
-df_lr_finalize (bitmap all_blocks)
+df_lr_finalize (bitmap)
 {
   df_lr->solutions_dirty = false;
-  if (df->changeable_flags & DF_LR_RUN_DCE)
-    {
-      run_fast_df_dce ();
-
-      /* If dce deletes some instructions, we need to recompute the lr
-	 solution before proceeding further.  The problem is that fast
-	 dce is a pessimestic dataflow algorithm.  In the case where
-	 it deletes a statement S inside of a loop, the uses inside of
-	 S may not be deleted from the dataflow solution because they
-	 were carried around the loop.  While it is conservatively
-	 correct to leave these extra bits, the standards of df
-	 require that we maintain the best possible (least fixed
-	 point) solution.  The only way to do that is to redo the
-	 iteration from the beginning.  See PR35805 for an
-	 example.  */
-      if (df_lr->solutions_dirty)
-	{
-	  df_clear_flags (DF_LR_RUN_DCE);
-	  df_lr_alloc (all_blocks);
-	  df_lr_local_compute (all_blocks);
-	  df_worklist_dataflow (df_lr, all_blocks, df->postorder, df->n_blocks);
-	  df_lr_finalize (all_blocks);
-	  df_set_flags (DF_LR_RUN_DCE);
-	}
-    }
 }
 
 
@@ -1266,6 +1239,69 @@ static const struct df_problem problem_LR =
   false                       /* Reset blocks on dropping out of blocks_to_analyze.  */
 };
 
+/* Run the fast DCE after building LR.  This is a separate problem so that
+   the "dirty" flag is only cleared after a DCE pass is actually run.  */
+
+static void
+df_lr_dce_finalize (bitmap all_blocks)
+{
+  if (!(df->changeable_flags & DF_LR_RUN_DCE))
+    return;
+
+  /* Also clears df_lr_dce->solutions_dirty.  */
+  run_fast_df_dce ();
+
+  /* If dce deletes some instructions, we need to recompute the lr
+     solution before proceeding further.  The problem is that fast
+     dce is a pessimestic dataflow algorithm.  In the case where
+     it deletes a statement S inside of a loop, the uses inside of
+     S may not be deleted from the dataflow solution because they
+     were carried around the loop.  While it is conservatively
+     correct to leave these extra bits, the standards of df
+     require that we maintain the best possible (least fixed
+     point) solution.  The only way to do that is to redo the
+     iteration from the beginning.  See PR35805 for an
+     example.  */
+  if (df_lr->solutions_dirty)
+    {
+      df_clear_flags (DF_LR_RUN_DCE);
+      df_lr_alloc (all_blocks);
+      df_lr_local_compute (all_blocks);
+      df_worklist_dataflow (df_lr, all_blocks, df->postorder, df->n_blocks);
+      df_lr_finalize (all_blocks);
+      df_set_flags (DF_LR_RUN_DCE);
+    }
+}
+
+static const struct df_problem problem_LR_DCE
+{
+  DF_LR_DCE,                  /* Problem id.  */
+  DF_BACKWARD,                /* Direction (arbitrary).  */
+  NULL,                       /* Allocate the problem specific data.  */
+  NULL,                       /* Reset global information.  */
+  NULL,                       /* Free basic block info.  */
+  NULL,                       /* Local compute function.  */
+  NULL,                       /* Init the solution specific data.  */
+  NULL,                       /* Worklist solver.  */
+  NULL,                       /* Confluence operator 0.  */
+  NULL,                       /* Confluence operator n.  */
+  NULL,                       /* Transfer function.  */
+  df_lr_dce_finalize,         /* Finalize function.  */
+  NULL,                       /* Free all of the problem information.  */
+  NULL,                       /* Remove this problem from the stack of dataflow problems.  */
+  NULL,                       /* Debugging.  */
+  NULL,                       /* Debugging start block.  */
+  NULL,                       /* Debugging end block.  */
+  NULL,                       /* Debugging start insn.  */
+  NULL,                       /* Debugging end insn.  */
+  NULL,                       /* Incremental solution verify start.  */
+  NULL,                       /* Incremental solution verify end.  */
+  &problem_LR,                /* Dependent problem.  */
+  0,                          /* Size of entry of block_info array.  */
+  TV_DF_LR,                   /* Timing variable.  */
+  false                       /* Reset blocks on dropping out of blocks_to_analyze.  */
+};
+
 
 /* Create a new DATAFLOW instance and add it to an existing instance
    of DF.  The returned structure is what is used to get at the
@@ -1274,7 +1310,9 @@ static const struct df_problem problem_LR =
 void
 df_lr_add_problem (void)
 {
-  df_add_problem (&problem_LR);
+  /* Also add the fast DCE problem.  It is then conditionally enabled by
+     the DF_LR_RUN_DCE flag.  */
+  df_add_problem (&problem_LR_DCE);
   /* These will be initialized when df_scan_blocks processes each
      block.  */
   df_lr->out_of_date_transfer_functions = BITMAP_ALLOC (&df_bitmap_obstack);
diff --git a/gcc/df.h b/gcc/df.h
index c4e690b40cf..bbb4f297b47 100644
--- a/gcc/df.h
+++ b/gcc/df.h
@@ -47,6 +47,7 @@ enum df_problem_id
   {
     DF_SCAN,
     DF_LR,                /* Live Registers backward. */
+    DF_LR_DCE,            /* Dead code elimination post-pass for LR.  */
     DF_LIVE,              /* Live Registers & Uninitialized Registers */
     DF_RD,                /* Reaching Defs. */
     DF_CHAIN,             /* Def-Use and/or Use-Def Chains. */
@@ -940,6 +941,7 @@ extern class df_d *df;
 #define df_scan    (df->problems_by_index[DF_SCAN])
 #define df_rd      (df->problems_by_index[DF_RD])
 #define df_lr      (df->problems_by_index[DF_LR])
+#define df_lr_dce  (df->problems_by_index[DF_LR_DCE])
 #define df_live    (df->problems_by_index[DF_LIVE])
 #define df_chain   (df->problems_by_index[DF_CHAIN])
 #define df_word_lr (df->problems_by_index[DF_WORD_LR])
Thomas Schwinge July 1, 2024, 11:55 a.m. UTC | #27
Hi Richard!

On 2024-06-28T17:48:30+0100, Richard Sandiford <richard.sandiford@arm.com> wrote:
> Richard Sandiford <richard.sandiford@arm.com> writes:
>> Thomas Schwinge <tschwinge@baylibre.com> writes:
>>> On 2024-06-27T23:20:18+0200, I wrote:
>>>> On 2024-06-27T22:27:21+0200, I wrote:
>>>>> On 2024-06-27T18:49:17+0200, I wrote:
>>>>>> On 2023-10-24T19:49:10+0100, Richard Sandiford <richard.sandiford@arm.com> wrote:
>>>>>>> This patch adds a combine pass that runs late in the pipeline.
>>>>>
>>>>> [After sending, I realized I replied to a previous thread of this work.]
>>>>>
>>>>>> I've beek looking a bit through recent nvptx target code generation
>>>>>> changes for GCC target libraries, and thought I'd also share here my
>>>>>> findings for the "late-combine" changes in isolation, for nvptx target.
>>>>>> 
>>>>>> First the unexpected thing:
>>>>>
>>>>> So much for "unexpected thing" -- next level of unexpected here...
>>>>> Appreciated if anyone feels like helping me find my way through this, but
>>>>> I totally understand if you've got other things to do.
>>>>
>>>> OK, I found something already.  (Unexpectedly quickly...)  ;-)
>>>>
>>>>>> there are a few cases where we now see unused
>>>>>> registers get declared
>>>
>>>> But in fact, for both cases
>>>
>>> Now tested: 's%both%all'.  :-)
>>>
>>>> the unexpected difference goes away if after
>>>> 'pass_late_combine' I inject a 'pass_fast_rtl_dce'.  That's normally run
>>>> as part of 'PUSH_INSERT_PASSES_WITHIN (pass_postreload)' -- but that's
>>>> all not active for nvptx target given '!reload_completed', given nvptx is
>>>> 'targetm.no_register_allocation'.  Maybe we need to enable a few more
>>>> passes, or is there anything in 'pass_late_combine' to change, so that we
>>>> don't run into this?  Does it inadvertently mark registers live or
>>>> something like that?
>>>
>>> Basically, is 'pass_late_combine' potentionally doing things that depend
>>> on later clean-up?  (..., or shouldn't it be doing these things in the
>>> first place?)
>>
>> It's possible that late-combine could expose dead code, but I imagine
>> it's a niche case.
>>
>> I had a look at the nvptx logs from my comparison, and the cases in
>> which I saw this seemed to be those where late-combine doesn't find
>> anything to do.  Does that match your examples?  Specifically,
>> the effect should be the same with -fdbg-cnt=late_combine:0-0
>>
>> I think what's happening is that:
>>
>> - combine exposes dead code
>>
>> - ce2 previously ran df_analyze with DF_LR_RUN_DCE set, and so cleared
>>   up the dead code
>>
>> - late-combine instead runs df_analyze without that flag (since late-combine
>>   itself doesn't really care whether dead code is present)
>>
>> - if late-combine doesn't do anything, ce2's df_analyze call has nothing
>>   to do, and skips even the DCE
>>
>> The easiest fix would be to add:
>>
>>   df_set_flags (DF_LR_RUN_DCE);
>>
>> before df_analyze in late-combine.cc, so that it behaves like ce2.
>> But the arrangement feels wrong.  I would have expected DF_LR_RUN_DCE
>> to depend on whether df_analyze had been called since the last DCE pass
>> (whether DF_LR_RUN_DCE or a full DCE).
>
> I'm testing the attached patch to do that.  I'll submit it properly if
> testing passes, but it seems to fix the extra-register problem for me.

> Give fast DCE a separate dirty flag

Thanks, and yes, your analysis makes sense to me (to the extent that I
only superficially understand these parts of GCC) -- and I confirm that
your proposed change to "Give fast DCE a separate dirty flag" does
address the issue for nvptx target.


Grüße
 Thomas


> Thomas pointed out that we sometimes failed to eliminate some dead code
> (specifically clobbers of otherwise unused registers) on nvptx when
> late-combine is enabled.  This happens because:
>
> - combine is able to optimise the function in a way that exposes dead code.
>   This leaves the df information in a "dirty" state.
>
> - late_combine calls df_analyze without DF_LR_RUN_DCE run set.
>   This updates the df information and clears the "dirty" state.
>
> - late_combine doesn't find any extra optimisations, and so leaves
>   the df information up-to-date.
>
> - if_after_combine (ce2) calls df_analyze with DF_LR_RUN_DCE set.
>   Because the df information is already up-to-date, fast DCE is
>   not run.
>
> The upshot is that running late-combine has the effect of suppressing
> a DCE opportunity that would have been noticed without late_combine.
>
> I think this shows that we should track the state of the DCE separately
> from the LR problem.  Every pass updates the latter, but not all passes
> update the former.
>
> gcc/
> 	* df.h (DF_LR_DCE): New df_problem_id.
> 	(df_lr_dce): New macro.
> 	* df-core.cc (rest_of_handle_df_finish): Check for a null free_fun.
> 	* df-problems.cc (df_lr_finalize): Split out fast DCE handling to...
> 	(df_lr_dce_finalize): ...this new function.
> 	(problem_LR_DCE): New df_problem.
> 	(df_lr_add_problem): Register LR_DCE rather than LR itself.
> 	* dce.cc (fast_dce): Clear df_lr_dce->solutions_dirty.
> ---
>  gcc/dce.cc         |  3 ++
>  gcc/df-core.cc     |  3 +-
>  gcc/df-problems.cc | 96 ++++++++++++++++++++++++++++++++--------------
>  gcc/df.h           |  2 +
>  4 files changed, 74 insertions(+), 30 deletions(-)
>
> diff --git a/gcc/dce.cc b/gcc/dce.cc
> index be1a2a87732..04e8d98818d 100644
> --- a/gcc/dce.cc
> +++ b/gcc/dce.cc
> @@ -1182,6 +1182,9 @@ fast_dce (bool word_level)
>    BITMAP_FREE (processed);
>    BITMAP_FREE (redo_out);
>    BITMAP_FREE (all_blocks);
> +
> +  /* Both forms of DCE should make further DCE unnecessary.  */
> +  df_lr_dce->solutions_dirty = false;
>  }
>  
>  
> diff --git a/gcc/df-core.cc b/gcc/df-core.cc
> index b0e8a88d433..8fd778a8618 100644
> --- a/gcc/df-core.cc
> +++ b/gcc/df-core.cc
> @@ -806,7 +806,8 @@ rest_of_handle_df_finish (void)
>    for (i = 0; i < df->num_problems_defined; i++)
>      {
>        struct dataflow *dflow = df->problems_in_order[i];
> -      dflow->problem->free_fun ();
> +      if (dflow->problem->free_fun)
> +	dflow->problem->free_fun ();
>      }
>  
>    free (df->postorder);
> diff --git a/gcc/df-problems.cc b/gcc/df-problems.cc
> index 88ee0dd67fc..bfd24bd1e86 100644
> --- a/gcc/df-problems.cc
> +++ b/gcc/df-problems.cc
> @@ -1054,37 +1054,10 @@ df_lr_transfer_function (int bb_index)
>  }
>  
>  
> -/* Run the fast dce as a side effect of building LR.  */
> -
>  static void
> -df_lr_finalize (bitmap all_blocks)
> +df_lr_finalize (bitmap)
>  {
>    df_lr->solutions_dirty = false;
> -  if (df->changeable_flags & DF_LR_RUN_DCE)
> -    {
> -      run_fast_df_dce ();
> -
> -      /* If dce deletes some instructions, we need to recompute the lr
> -	 solution before proceeding further.  The problem is that fast
> -	 dce is a pessimestic dataflow algorithm.  In the case where
> -	 it deletes a statement S inside of a loop, the uses inside of
> -	 S may not be deleted from the dataflow solution because they
> -	 were carried around the loop.  While it is conservatively
> -	 correct to leave these extra bits, the standards of df
> -	 require that we maintain the best possible (least fixed
> -	 point) solution.  The only way to do that is to redo the
> -	 iteration from the beginning.  See PR35805 for an
> -	 example.  */
> -      if (df_lr->solutions_dirty)
> -	{
> -	  df_clear_flags (DF_LR_RUN_DCE);
> -	  df_lr_alloc (all_blocks);
> -	  df_lr_local_compute (all_blocks);
> -	  df_worklist_dataflow (df_lr, all_blocks, df->postorder, df->n_blocks);
> -	  df_lr_finalize (all_blocks);
> -	  df_set_flags (DF_LR_RUN_DCE);
> -	}
> -    }
>  }
>  
>  
> @@ -1266,6 +1239,69 @@ static const struct df_problem problem_LR =
>    false                       /* Reset blocks on dropping out of blocks_to_analyze.  */
>  };
>  
> +/* Run the fast DCE after building LR.  This is a separate problem so that
> +   the "dirty" flag is only cleared after a DCE pass is actually run.  */
> +
> +static void
> +df_lr_dce_finalize (bitmap all_blocks)
> +{
> +  if (!(df->changeable_flags & DF_LR_RUN_DCE))
> +    return;
> +
> +  /* Also clears df_lr_dce->solutions_dirty.  */
> +  run_fast_df_dce ();
> +
> +  /* If dce deletes some instructions, we need to recompute the lr
> +     solution before proceeding further.  The problem is that fast
> +     dce is a pessimestic dataflow algorithm.  In the case where
> +     it deletes a statement S inside of a loop, the uses inside of
> +     S may not be deleted from the dataflow solution because they
> +     were carried around the loop.  While it is conservatively
> +     correct to leave these extra bits, the standards of df
> +     require that we maintain the best possible (least fixed
> +     point) solution.  The only way to do that is to redo the
> +     iteration from the beginning.  See PR35805 for an
> +     example.  */
> +  if (df_lr->solutions_dirty)
> +    {
> +      df_clear_flags (DF_LR_RUN_DCE);
> +      df_lr_alloc (all_blocks);
> +      df_lr_local_compute (all_blocks);
> +      df_worklist_dataflow (df_lr, all_blocks, df->postorder, df->n_blocks);
> +      df_lr_finalize (all_blocks);
> +      df_set_flags (DF_LR_RUN_DCE);
> +    }
> +}
> +
> +static const struct df_problem problem_LR_DCE
> +{
> +  DF_LR_DCE,                  /* Problem id.  */
> +  DF_BACKWARD,                /* Direction (arbitrary).  */
> +  NULL,                       /* Allocate the problem specific data.  */
> +  NULL,                       /* Reset global information.  */
> +  NULL,                       /* Free basic block info.  */
> +  NULL,                       /* Local compute function.  */
> +  NULL,                       /* Init the solution specific data.  */
> +  NULL,                       /* Worklist solver.  */
> +  NULL,                       /* Confluence operator 0.  */
> +  NULL,                       /* Confluence operator n.  */
> +  NULL,                       /* Transfer function.  */
> +  df_lr_dce_finalize,         /* Finalize function.  */
> +  NULL,                       /* Free all of the problem information.  */
> +  NULL,                       /* Remove this problem from the stack of dataflow problems.  */
> +  NULL,                       /* Debugging.  */
> +  NULL,                       /* Debugging start block.  */
> +  NULL,                       /* Debugging end block.  */
> +  NULL,                       /* Debugging start insn.  */
> +  NULL,                       /* Debugging end insn.  */
> +  NULL,                       /* Incremental solution verify start.  */
> +  NULL,                       /* Incremental solution verify end.  */
> +  &problem_LR,                /* Dependent problem.  */
> +  0,                          /* Size of entry of block_info array.  */
> +  TV_DF_LR,                   /* Timing variable.  */
> +  false                       /* Reset blocks on dropping out of blocks_to_analyze.  */
> +};
> +
>  
>  /* Create a new DATAFLOW instance and add it to an existing instance
>     of DF.  The returned structure is what is used to get at the
> @@ -1274,7 +1310,9 @@ static const struct df_problem problem_LR =
>  void
>  df_lr_add_problem (void)
>  {
> -  df_add_problem (&problem_LR);
> +  /* Also add the fast DCE problem.  It is then conditionally enabled by
> +     the DF_LR_RUN_DCE flag.  */
> +  df_add_problem (&problem_LR_DCE);
>    /* These will be initialized when df_scan_blocks processes each
>       block.  */
>    df_lr->out_of_date_transfer_functions = BITMAP_ALLOC (&df_bitmap_obstack);
> diff --git a/gcc/df.h b/gcc/df.h
> index c4e690b40cf..bbb4f297b47 100644
> --- a/gcc/df.h
> +++ b/gcc/df.h
> @@ -47,6 +47,7 @@ enum df_problem_id
>    {
>      DF_SCAN,
>      DF_LR,                /* Live Registers backward. */
> +    DF_LR_DCE,            /* Dead code elimination post-pass for LR.  */
>      DF_LIVE,              /* Live Registers & Uninitialized Registers */
>      DF_RD,                /* Reaching Defs. */
>      DF_CHAIN,             /* Def-Use and/or Use-Def Chains. */
> @@ -940,6 +941,7 @@ extern class df_d *df;
>  #define df_scan    (df->problems_by_index[DF_SCAN])
>  #define df_rd      (df->problems_by_index[DF_RD])
>  #define df_lr      (df->problems_by_index[DF_LR])
> +#define df_lr_dce  (df->problems_by_index[DF_LR_DCE])
>  #define df_live    (df->problems_by_index[DF_LIVE])
>  #define df_chain   (df->problems_by_index[DF_CHAIN])
>  #define df_word_lr (df->problems_by_index[DF_WORD_LR])
> -- 
> 2.25.1
diff mbox series

Patch

diff --git a/gcc/Makefile.in b/gcc/Makefile.in
index 91d6bfbea4d..b43fd6e8df1 100644
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@@ -1554,6 +1554,7 @@  OBJS = \
 	ira-lives.o \
 	jump.o \
 	langhooks.o \
+	late-combine.o \
 	lcm.o \
 	lists.o \
 	loop-doloop.o \
diff --git a/gcc/common.opt b/gcc/common.opt
index 1cf3bdd3b51..306b11f91c7 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -1775,6 +1775,11 @@  Common Var(flag_large_source_files) Init(0)
 Improve GCC's ability to track column numbers in large source files,
 at the expense of slower compilation.
 
+flate-combine-instructions
+Common Var(flag_late_combine_instructions) Optimization Init(0)
+Run two instruction combination passes late in the pass pipeline;
+one before register allocation and one after.
+
 floop-parallelize-all
 Common Var(flag_loop_parallelize_all) Optimization
 Mark all loops as parallel.
diff --git a/gcc/common/config/aarch64/aarch64-common.cc b/gcc/common/config/aarch64/aarch64-common.cc
index 20bc4e1291b..05647e0c93a 100644
--- a/gcc/common/config/aarch64/aarch64-common.cc
+++ b/gcc/common/config/aarch64/aarch64-common.cc
@@ -55,6 +55,7 @@  static const struct default_options aarch_option_optimization_table[] =
     { OPT_LEVELS_1_PLUS, OPT_fsched_pressure, NULL, 1 },
     /* Enable redundant extension instructions removal at -O2 and higher.  */
     { OPT_LEVELS_2_PLUS, OPT_free, NULL, 1 },
+    { OPT_LEVELS_2_PLUS, OPT_flate_combine_instructions, NULL, 1 },
 #if (TARGET_DEFAULT_ASYNC_UNWIND_TABLES == 1)
     { OPT_LEVELS_ALL, OPT_fasynchronous_unwind_tables, NULL, 1 },
     { OPT_LEVELS_ALL, OPT_funwind_tables, NULL, 1},
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 5a9284d635c..d0576ac97cf 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -562,7 +562,7 @@  Objective-C and Objective-C++ Dialects}.
 -fipa-bit-cp  -fipa-vrp  -fipa-pta  -fipa-profile  -fipa-pure-const
 -fipa-reference  -fipa-reference-addressable
 -fipa-stack-alignment  -fipa-icf  -fira-algorithm=@var{algorithm}
--flive-patching=@var{level}
+-flate-combine-instructions  -flive-patching=@var{level}
 -fira-region=@var{region}  -fira-hoist-pressure
 -fira-loop-pressure  -fno-ira-share-save-slots
 -fno-ira-share-spill-slots
@@ -13201,6 +13201,15 @@  equivalences that are found only by GCC and equivalences found only by Gold.
 
 This flag is enabled by default at @option{-O2} and @option{-Os}.
 
+@opindex flate-combine-instructions
+@item -flate-combine-instructions
+Enable two instruction combination passes that run relatively late in the
+compilation process.  One of the passes runs before register allocation and
+the other after register allocation.  The main aim of the passes is to
+substitute definitions into all uses.
+
+Some targets enable this flag by default at @option{-O2} and @option{-Os}.
+
 @opindex flive-patching
 @item -flive-patching=@var{level}
 Control GCC's optimizations to produce output suitable for live-patching.
diff --git a/gcc/late-combine.cc b/gcc/late-combine.cc
new file mode 100644
index 00000000000..b1845875c4b
--- /dev/null
+++ b/gcc/late-combine.cc
@@ -0,0 +1,718 @@ 
+// Late-stage instruction combination pass.
+// Copyright (C) 2023 Free Software Foundation, Inc.
+//
+// This file is part of GCC.
+//
+// GCC is free software; you can redistribute it and/or modify it under
+// the terms of the GNU General Public License as published by the Free
+// Software Foundation; either version 3, or (at your option) any later
+// version.
+//
+// GCC is distributed in the hope that it will be useful, but WITHOUT ANY
+// WARRANTY; without even the implied warranty of MERCHANTABILITY or
+// FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+// for more details.
+//
+// You should have received a copy of the GNU General Public License
+// along with GCC; see the file COPYING3.  If not see
+// <http://www.gnu.org/licenses/>.
+
+// The current purpose of this pass is to substitute definitions into
+// all uses, so that the definition can be removed.  However, it could
+// be extended to handle other combination-related optimizations in future.
+//
+// The pass can run before or after register allocation.  When running
+// before register allocation, it tries to avoid cases that are likely
+// to increase register pressure.  For the same reason, it avoids moving
+// instructions around, even if doing so would allow an optimization to
+// succeed.  These limitations are removed when running after register
+// allocation.
+
+#define INCLUDE_ALGORITHM
+#define INCLUDE_FUNCTIONAL
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "backend.h"
+#include "rtl.h"
+#include "df.h"
+#include "rtl-ssa.h"
+#include "print-rtl.h"
+#include "tree-pass.h"
+#include "cfgcleanup.h"
+#include "target.h"
+
+using namespace rtl_ssa;
+
+namespace {
+const pass_data pass_data_late_combine =
+{
+  RTL_PASS, // type
+  "late_combine", // name
+  OPTGROUP_NONE, // optinfo_flags
+  TV_NONE, // tv_id
+  0, // properties_required
+  0, // properties_provided
+  0, // properties_destroyed
+  0, // todo_flags_start
+  TODO_df_finish, // todo_flags_finish
+};
+
+// Class that represents one run of the pass.
+class late_combine
+{
+public:
+  unsigned int execute (function *);
+
+private:
+  rtx optimizable_set (insn_info *);
+  bool check_register_pressure (insn_info *, rtx);
+  bool check_uses (set_info *, rtx);
+  bool combine_into_uses (insn_info *, insn_info *);
+
+  auto_vec<insn_info *> m_worklist;
+};
+
+// Represents an attempt to substitute a single-set definition into all
+// uses of the definition.
+class insn_combination
+{
+public:
+  insn_combination (set_info *, rtx, rtx);
+  bool run ();
+  const vec<insn_change *> &use_changes () const { return m_use_changes; }
+
+private:
+  use_array get_new_uses (use_info *);
+  bool substitute_nondebug_use (use_info *);
+  bool substitute_nondebug_uses (set_info *);
+  bool move_and_recog (insn_change &);
+  bool try_to_preserve_debug_info (insn_change &, use_info *);
+  void substitute_debug_use (use_info *);
+  bool substitute_note (insn_info *, rtx, bool);
+  void substitute_notes (insn_info *, bool);
+  void substitute_note_uses (use_info *);
+  void substitute_optional_uses (set_info *);
+
+  // Represents the state of the function's RTL at the start of this
+  // combination attempt.
+  insn_change_watermark m_rtl_watermark;
+
+  // Represents the rtl-ssa state at the start of this combination attempt.
+  obstack_watermark m_attempt;
+
+  // The instruction that contains the definition, and that we're trying
+  // to delete.
+  insn_info *m_def_insn;
+
+  // The definition itself.
+  set_info *m_def;
+
+  // The destination and source of the single set that defines m_def.
+  // The destination is known to be a plain REG.
+  rtx m_dest;
+  rtx m_src;
+
+  // Contains all in-progress changes to uses of m_def.
+  auto_vec<insn_change *> m_use_changes;
+
+  // Contains the full list of changes that we want to make, in reverse
+  // postorder.
+  auto_vec<insn_change *> m_nondebug_changes;
+};
+
+insn_combination::insn_combination (set_info *def, rtx dest, rtx src)
+  : m_rtl_watermark (),
+    m_attempt (crtl->ssa->new_change_attempt ()),
+    m_def_insn (def->insn ()),
+    m_def (def),
+    m_dest (dest),
+    m_src (src),
+    m_use_changes (),
+    m_nondebug_changes ()
+{
+}
+
+// USE is a direct or indirect use of m_def.  Return the list of uses
+// that would be needed after substituting m_def into the instruction.
+// The returned list is marked as invalid if USE's insn and m_def_insn
+// use different definitions for the same resource (register or memory).
+use_array
+insn_combination::get_new_uses (use_info *use)
+{
+  auto *def = use->def ();
+  auto *use_insn = use->insn ();
+
+  use_array new_uses = use_insn->uses ();
+  new_uses = remove_uses_of_def (m_attempt, new_uses, def);
+  new_uses = merge_access_arrays (m_attempt, m_def_insn->uses (), new_uses);
+  if (new_uses.is_valid () && use->ebb () != m_def->ebb ())
+    new_uses = crtl->ssa->make_uses_available (m_attempt, new_uses, use->bb (),
+					       use_insn->is_debug_insn ());
+  return new_uses;
+}
+
+// Start the process of trying to replace USE by substitution, given that
+// USE occurs in a non-debug instruction.  Check that the substitution can
+// be represented in RTL and that each use of a resource (register or memory)
+// has a consistent definition.  If so, start an insn_change record for the
+// substitution and return true.
+bool
+insn_combination::substitute_nondebug_use (use_info *use)
+{
+  insn_info *use_insn = use->insn ();
+  rtx_insn *use_rtl = use_insn->rtl ();
+
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    dump_insn_slim (dump_file, use->insn ()->rtl ());
+
+  // Chceck that we can change the instruction pattern.  Leave recognition
+  // of the result till later.
+  insn_propagation prop (use_rtl, m_dest, m_src);
+  if (!prop.apply_to_pattern (&PATTERN (use_rtl))
+      || prop.num_replacements == 0)
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "-- RTL substitution failed\n");
+      return false;
+    }
+
+  use_array new_uses = get_new_uses (use);
+  if (!new_uses.is_valid ())
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "-- could not prove that all sources"
+		 " are available\n");
+      return false;
+    }
+
+  auto *where = XOBNEW (m_attempt, insn_change);
+  auto *use_change = new (where) insn_change (use_insn);
+  m_use_changes.safe_push (use_change);
+  use_change->new_uses = new_uses;
+
+  return true;
+}
+
+// Apply substitute_nondebug_use to all direct and indirect uses of DEF.
+// There will be at most one level of indirection.
+bool
+insn_combination::substitute_nondebug_uses (set_info *def)
+{
+  for (use_info *use : def->nondebug_insn_uses ())
+    if (!use->is_live_out_use ()
+	&& !use->only_occurs_in_notes ()
+	&& !substitute_nondebug_use (use))
+      return false;
+
+  for (use_info *use : def->phi_uses ())
+    if (!substitute_nondebug_uses (use->phi ()))
+      return false;
+
+  return true;
+}
+
+// Complete the verification of USE_CHANGE, given that m_nondebug_insns
+// now contains an insn_change record for all proposed non-debug changes.
+// Check that the new instruction is a recognized pattern.  Also check that
+// the instruction can be placed somewhere that makes all definitions and
+// uses valid, and that permits any new hard-register clobbers added
+// during the recognition process.  Return true on success.
+bool
+insn_combination::move_and_recog (insn_change &use_change)
+{
+  insn_info *use_insn = use_change.insn ();
+
+  if (reload_completed && can_move_insn_p (use_insn))
+    use_change.move_range = { use_insn->bb ()->head_insn (),
+			      use_insn->ebb ()->last_bb ()->end_insn () };
+  if (!restrict_movement_ignoring (use_change,
+				   insn_is_changing (m_nondebug_changes)))
+    {
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	fprintf (dump_file, "-- cannot satisfy all definitions and uses"
+		 " in insn %d\n", INSN_UID (use_insn->rtl ()));
+      return false;
+    }
+
+  if (!recog_ignoring (m_attempt, use_change,
+		       insn_is_changing (m_nondebug_changes)))
+    return false;
+
+  return true;
+}
+
+// USE_CHANGE.insn () is a debug instruction that uses m_def.  Try to
+// substitute the definition into the instruction and try to describe
+// the result in USE_CHANGE.  Return true on success.  Failure means that
+// the instruction must be reset instead.
+bool
+insn_combination::try_to_preserve_debug_info (insn_change &use_change,
+					      use_info *use)
+{
+  insn_info *use_insn = use_change.insn ();
+  rtx_insn *use_rtl = use_insn->rtl ();
+
+  use_change.new_uses = get_new_uses (use);
+  if (!use_change.new_uses.is_valid ()
+      || !restrict_movement (use_change))
+    return false;
+
+  insn_propagation prop (use_rtl, m_dest, m_src);
+  return prop.apply_to_pattern (&INSN_VAR_LOCATION_LOC (use_rtl));
+}
+
+// USE_INSN is a debug instruction that uses m_def.  Update it to reflect
+// the fact that m_def is going to disappear.  Try to preserve the source
+// value if possible, but reset the instruction if not.
+void
+insn_combination::substitute_debug_use (use_info *use)
+{
+  auto *use_insn = use->insn ();
+  rtx_insn *use_rtl = use_insn->rtl ();
+
+  auto use_change = insn_change (use_insn);
+  if (!try_to_preserve_debug_info (use_change, use))
+    {
+      use_change.new_uses = {};
+      use_change.move_range = use_change.insn ();
+      INSN_VAR_LOCATION_LOC (use_rtl) = gen_rtx_UNKNOWN_VAR_LOC ();
+    }
+  insn_change *changes[] = { &use_change };
+  crtl->ssa->change_insns (changes);
+}
+
+// NOTE is a reg note of USE_INSN, which previously used m_def.  Update
+// the note to reflect the fact that m_def is going to disappear.  Return
+// true on success, or false if the note must be deleted.
+//
+// CAN_PROPAGATE is true if m_dest can be replaced with m_use.
+bool
+insn_combination::substitute_note (insn_info *use_insn, rtx note,
+				   bool can_propagate)
+{
+  if (REG_NOTE_KIND (note) == REG_EQUAL
+      || REG_NOTE_KIND (note) == REG_EQUIV)
+    {
+      insn_propagation prop (use_insn->rtl (), m_dest, m_src);
+      return (prop.apply_to_rvalue (&XEXP (note, 0))
+	      && (can_propagate || prop.num_replacements == 0));
+    }
+  return true;
+}
+
+// Update USE_INSN's notes after deciding to go ahead with the optimization.
+// CAN_PROPAGATE is true if m_dest can be replaced with m_use.
+void
+insn_combination::substitute_notes (insn_info *use_insn, bool can_propagate)
+{
+  rtx_insn *use_rtl = use_insn->rtl ();
+  rtx *ptr = &REG_NOTES (use_rtl);
+  while (rtx note = *ptr)
+    {
+      if (substitute_note (use_insn, note, can_propagate))
+	ptr = &XEXP (note, 1);
+      else
+	*ptr = XEXP (note, 1);
+    }
+}
+
+// We've decided to go ahead with the substitution.  Update all REG_NOTES
+// involving USE.
+void
+insn_combination::substitute_note_uses (use_info *use)
+{
+  insn_info *use_insn = use->insn ();
+
+  bool can_propagate = true;
+  if (use->only_occurs_in_notes ())
+    {
+      // The only uses are in notes.  Try to keep the note if we can,
+      // but removing it is better than aborting the optimization.
+      insn_change use_change (use_insn);
+      use_change.new_uses = get_new_uses (use);
+      if (!use_change.new_uses.is_valid ()
+	  || !restrict_movement (use_change))
+	{
+	  use_change.move_range = use_insn;
+	  use_change.new_uses = remove_uses_of_def (m_attempt,
+						    use_insn->uses (),
+						    use->def ());
+	  can_propagate = false;
+	}
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	{
+	  fprintf (dump_file, "%s notes in:\n",
+		   can_propagate ? "updating" : "removing");
+	  dump_insn_slim (dump_file, use_insn->rtl ());
+	}
+      substitute_notes (use_insn, can_propagate);
+      insn_change *changes[] = { &use_change };
+      crtl->ssa->change_insns (changes);
+    }
+  else
+    substitute_notes (use_insn, can_propagate);
+}
+
+// We've decided to go ahead with the substitution and we've dealt with
+// all uses that occur in the patterns of non-debug insns.  Update all
+// other uses for the fact that m_def is about to disappear.
+void
+insn_combination::substitute_optional_uses (set_info *def)
+{
+  if (auto insn_uses = def->all_insn_uses ())
+    {
+      use_info *use = *insn_uses.begin ();
+      while (use)
+	{
+	  use_info *next_use = use->next_any_insn_use ();
+	  if (use->is_in_debug_insn ())
+	    substitute_debug_use (use);
+	  else if (!use->is_live_out_use ())
+	    substitute_note_uses (use);
+	  use = next_use;
+	}
+    }
+  for (use_info *use : def->phi_uses ())
+    substitute_optional_uses (use->phi ());
+}
+
+// Try to perform the substitution.  Return true on success.
+bool
+insn_combination::run ()
+{
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    {
+      fprintf (dump_file, "\ntrying to combine definition of r%d in:\n",
+	       m_def->regno ());
+      dump_insn_slim (dump_file, m_def_insn->rtl ());
+      fprintf (dump_file, "into:\n");
+    }
+
+  if (!substitute_nondebug_uses (m_def))
+    return false;
+
+  auto def_change = insn_change::delete_insn (m_def_insn);
+
+  m_nondebug_changes.reserve (m_use_changes.length () + 1);
+  m_nondebug_changes.quick_push (&def_change);
+  m_nondebug_changes.splice (m_use_changes);
+
+  for (auto *use_change : m_use_changes)
+    if (!move_and_recog (*use_change))
+      return false;
+
+  if (!changes_are_worthwhile (m_nondebug_changes)
+      || !crtl->ssa->verify_insn_changes (m_nondebug_changes))
+    return false;
+
+  substitute_optional_uses (m_def);
+
+  confirm_change_group ();
+  crtl->ssa->change_insns (m_nondebug_changes);
+  return true;
+}
+
+// See whether INSN is a single_set that we can optimize.  Return the
+// set if so, otherwise return null.
+rtx
+late_combine::optimizable_set (insn_info *insn)
+{
+  if (!insn->can_be_optimized ()
+      || insn->is_asm ()
+      || insn->is_call ()
+      || insn->has_volatile_refs ()
+      || insn->has_pre_post_modify ()
+      || !can_move_insn_p (insn))
+    return NULL_RTX;
+
+  return single_set (insn->rtl ());
+}
+
+// Suppose that we can replace all uses of SET_DEST (SET) with SET_SRC (SET),
+// where SET occurs in INSN.  Return true if doing so is not likely to
+// increase register pressure.
+bool
+late_combine::check_register_pressure (insn_info *insn, rtx set)
+{
+  // Plain register-to-register moves do not establish a register class
+  // preference and have no well-defined effect on the register allocator.
+  // If changes in register class are needed, the register allocator is
+  // in the best position to place those changes.  If no change in
+  // register class is needed, then the optimization reduces register
+  // pressure if SET_SRC (set) was already live at uses, otherwise the
+  // optimization is pressure-neutral.
+  rtx src = SET_SRC (set);
+  if (REG_P (src))
+    return true;
+
+  // On the same basis, substituting a SET_SRC that contains a single
+  // pseudo register either reduces pressure or is pressure-neutral,
+  // subject to the constraints below.  We would need to do more
+  // analysis for SET_SRCs that use more than one pseudo register.
+  unsigned int nregs = 0;
+  for (auto *use : insn->uses ())
+    if (use->is_reg ()
+	&& !HARD_REGISTER_NUM_P (use->regno ())
+	&& !use->only_occurs_in_notes ())
+      if (++nregs > 1)
+	return false;
+
+  // If there are no pseudo registers in SET_SRC then the optimization
+  // should improve register pressure.
+  if (nregs == 0)
+    return true;
+
+  // We'd be substituting (set (reg R1) SRC) where SRC is known to
+  // contain a single pseudo register R2.  Assume for simplicity that
+  // each new use of R2 would need to be in the same class C as the
+  // current use of R2.  If, for a realistic allocation, C is a
+  // non-strict superset of the R1's register class, the effect on
+  // register pressure should be positive or neutral.  If instead
+  // R1 occupies a different register class from R2, or if R1 has
+  // more allocation freedom than R2, then there's a higher risk that
+  // the effect on register pressure could be negative.
+  //
+  // First use constrain_operands to get the most likely choice of
+  // alternative.  For simplicity, just handle the case where the
+  // output operand is operand 0.
+  extract_insn (insn->rtl ());
+  rtx dest = SET_DEST (set);
+  if (recog_data.n_operands == 0
+      || recog_data.operand[0] != dest)
+    return false;
+
+  if (!constrain_operands (0, get_enabled_alternatives (insn->rtl ())))
+    return false;
+
+  preprocess_constraints (insn->rtl ());
+  auto *alt = which_op_alt ();
+  auto dest_class = alt[0].cl;
+
+  // Check operands 1 and above.
+  auto check_src = [&](unsigned int i)
+    {
+      if (recog_data.is_operator[i])
+	return true;
+
+      rtx op = recog_data.operand[i];
+      if (CONSTANT_P (op))
+	return true;
+
+      if (SUBREG_P (op))
+	op = SUBREG_REG (op);
+      if (REG_P (op))
+	{
+	  if (HARD_REGISTER_P (op))
+	    {
+	      // We've already rejected uses of non-fixed hard registers.
+	      gcc_checking_assert (fixed_regs[REGNO (op)]);
+	      return true;
+	    }
+
+	  // Make sure that the source operand's class is at least as
+	  // permissive as the destination operand's class.
+	  if (!reg_class_subset_p (dest_class, alt[i].cl))
+	    return false;
+
+	  // Make sure that the source operand occupies no more hard
+	  // registers than the destination operand.  This mostly matters
+	  // for subregs.
+	  if (targetm.class_max_nregs (dest_class, GET_MODE (dest))
+	      < targetm.class_max_nregs (alt[i].cl, GET_MODE (op)))
+	    return false;
+
+	  return true;
+	}
+      return false;
+    };
+  for (int i = 1; i < recog_data.n_operands; ++i)
+    if (!check_src (i))
+      return false;
+
+  return true;
+}
+
+// Check uses of DEF to see whether there is anything obvious that
+// prevents the substitution of SET into uses of DEF.
+bool
+late_combine::check_uses (set_info *def, rtx set)
+{
+  use_info *last_use = nullptr;
+  for (use_info *use : def->nondebug_insn_uses ())
+    {
+      insn_info *use_insn = use->insn ();
+
+      if (use->is_live_out_use ())
+	continue;
+      if (use->only_occurs_in_notes ())
+	continue;
+
+      // We cannot replace all uses if the value is live on exit.
+      if (use->is_artificial ())
+	return false;
+
+      // Avoid increasing the complexity of instructions that
+      // reference allocatable hard registers.
+      if (!REG_P (SET_SRC (set))
+	  && !reload_completed
+	  && (accesses_include_nonfixed_hard_registers (use_insn->uses ())
+	      || accesses_include_nonfixed_hard_registers (use_insn->defs ())))
+	return false;
+
+      // Don't substitute into a non-local goto, since it can then be
+      // treated as a jump to local label, e.g. in shorten_branches.
+      // ??? But this shouldn't be necessary.
+      if (use_insn->is_jump ()
+	  && find_reg_note (use_insn->rtl (), REG_NON_LOCAL_GOTO, NULL_RTX))
+	return false;
+
+      // We'll keep the uses in their original order, even if we move
+      // them relative to other instructions.  Make sure that non-final
+      // uses do not change any values that occur in the SET_SRC.
+      if (last_use && last_use->ebb () == use->ebb ())
+	{
+	  def_info *ultimate_def = look_through_degenerate_phi (def);
+	  if (insn_clobbers_resources (last_use->insn (),
+				       ultimate_def->insn ()->uses ()))
+	    return false;
+	}
+
+      last_use = use;
+    }
+
+  for (use_info *use : def->phi_uses ())
+    if (!use->phi ()->is_degenerate ()
+	|| !check_uses (use->phi (), set))
+      return false;
+
+  return true;
+}
+
+// Try to remove INSN by substituting a definition into all uses.
+// If the optimization moves any instructions before CURSOR, add those
+// instructions to the end of m_worklist.
+bool
+late_combine::combine_into_uses (insn_info *insn, insn_info *cursor)
+{
+  // For simplicity, don't try to handle sets of multiple hard registers.
+  // And for correctness, don't remove any assignments to the stack or
+  // frame pointers, since that would implicitly change the set of valid
+  // memory locations between this assignment and the next.
+  //
+  // Removing assignments to the hard frame pointer would invalidate
+  // backtraces.
+  set_info *def = single_set_info (insn);
+  if (!def
+      || !def->is_reg ()
+      || def->regno () == STACK_POINTER_REGNUM
+      || def->regno () == FRAME_POINTER_REGNUM
+      || def->regno () == HARD_FRAME_POINTER_REGNUM)
+    return false;
+
+  rtx set = optimizable_set (insn);
+  if (!set)
+    return false;
+
+  // For simplicity, don't try to handle subreg destinations.
+  rtx dest = SET_DEST (set);
+  if (!REG_P (dest) || def->regno () != REGNO (dest))
+    return false;
+
+  // Don't prolong the live ranges of allocatable hard registers, or put
+  // them into more complicated instructions.  Failing to prevent this
+  // could lead to spill failures, or at least to worst register allocation.
+  if (!reload_completed
+      && accesses_include_nonfixed_hard_registers (insn->uses ()))
+    return false;
+
+  if (!reload_completed && !check_register_pressure (insn, set))
+    return false;
+
+  if (!check_uses (def, set))
+    return false;
+
+  insn_combination combination (def, SET_DEST (set), SET_SRC (set));
+  if (!combination.run ())
+    return false;
+
+  for (auto *use_change : combination.use_changes ())
+    if (*use_change->insn () < *cursor)
+      m_worklist.safe_push (use_change->insn ());
+    else
+      break;
+  return true;
+}
+
+// Run the pass on function FN.
+unsigned int
+late_combine::execute (function *fn)
+{
+  // Initialization.
+  calculate_dominance_info (CDI_DOMINATORS);
+  df_analyze ();
+  crtl->ssa = new rtl_ssa::function_info (fn);
+  // Don't allow memory_operand to match volatile MEMs.
+  init_recog_no_volatile ();
+
+  insn_info *insn = *crtl->ssa->nondebug_insns ().begin ();
+  while (insn)
+    {
+      if (!insn->is_artificial ())
+	{
+	  insn_info *prev = insn->prev_nondebug_insn ();
+	  if (combine_into_uses (insn, prev))
+	    {
+	      // Any instructions that get added to the worklist were
+	      // previously after PREV.  Thus if we were able to move
+	      // an instruction X before PREV during one combination,
+	      // X cannot depend on any instructions that we move before
+	      // PREV during subsequent combinations.  This means that
+	      // the worklist should be free of backwards dependencies,
+	      // even if it isn't necessarily in RPO.
+	      for (unsigned int i = 0; i < m_worklist.length (); ++i)
+		combine_into_uses (m_worklist[i], prev);
+	      m_worklist.truncate (0);
+	      insn = prev;
+	    }
+	}
+      insn = insn->next_nondebug_insn ();
+    }
+
+  // Finalization.
+  if (crtl->ssa->perform_pending_updates ())
+    cleanup_cfg (0);
+  // Make recognizer allow volatile MEMs again.
+  init_recog ();
+  free_dominance_info (CDI_DOMINATORS);
+  return 0;
+}
+
+class pass_late_combine : public rtl_opt_pass
+{
+public:
+  pass_late_combine (gcc::context *ctxt)
+    : rtl_opt_pass (pass_data_late_combine, ctxt)
+  {}
+
+  // opt_pass methods:
+  opt_pass *clone () override { return new pass_late_combine (m_ctxt); }
+  bool gate (function *) override { return flag_late_combine_instructions; }
+  unsigned int execute (function *) override;
+};
+
+unsigned int
+pass_late_combine::execute (function *fn)
+{
+  return late_combine ().execute (fn);
+}
+
+} // end namespace
+
+// Create a new CC fusion pass instance.
+
+rtl_opt_pass *
+make_pass_late_combine (gcc::context *ctxt)
+{
+  return new pass_late_combine (ctxt);
+}
diff --git a/gcc/passes.def b/gcc/passes.def
index 1e1950bdb39..56ab5204b08 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -488,6 +488,7 @@  along with GCC; see the file COPYING3.  If not see
       NEXT_PASS (pass_initialize_regs);
       NEXT_PASS (pass_ud_rtl_dce);
       NEXT_PASS (pass_combine);
+      NEXT_PASS (pass_late_combine);
       NEXT_PASS (pass_if_after_combine);
       NEXT_PASS (pass_jump_after_combine);
       NEXT_PASS (pass_partition_blocks);
@@ -507,6 +508,7 @@  along with GCC; see the file COPYING3.  If not see
       NEXT_PASS (pass_postreload);
       PUSH_INSERT_PASSES_WITHIN (pass_postreload)
 	  NEXT_PASS (pass_postreload_cse);
+	  NEXT_PASS (pass_late_combine);
 	  NEXT_PASS (pass_gcse2);
 	  NEXT_PASS (pass_split_after_reload);
 	  NEXT_PASS (pass_ree);
diff --git a/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-1.c b/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-1.c
index f290b9ccbdc..a95637abbe5 100644
--- a/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-1.c
+++ b/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-1.c
@@ -25,5 +25,5 @@  bar (long a)
 }
 
 /* { dg-final { scan-rtl-dump "Will split live ranges of parameters" "ira" } } */
-/* { dg-final { scan-rtl-dump "Split live-range of register" "ira" { xfail *-*-* } } } */
+/* { dg-final { scan-rtl-dump "Split live-range of register" "ira" { xfail { ! aarch64*-*-* } } } } */
 /* { dg-final { scan-rtl-dump "Performing shrink-wrapping" "pro_and_epilogue" { xfail powerpc*-*-* } } } */
diff --git a/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-2.c b/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-2.c
index 6212c95585d..0690e036eaa 100644
--- a/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-2.c
+++ b/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-2.c
@@ -30,6 +30,6 @@  bar (long a)
 }
 
 /* { dg-final { scan-rtl-dump "Will split live ranges of parameters" "ira" } } */
-/* { dg-final { scan-rtl-dump "Split live-range of register" "ira" { xfail *-*-* } } } */
+/* { dg-final { scan-rtl-dump "Split live-range of register" "ira" { xfail { ! aarch64*-*-* } } } } */
 /* XFAIL due to PR70681.  */ 
 /* { dg-final { scan-rtl-dump "Performing shrink-wrapping" "pro_and_epilogue" { xfail arm*-*-* powerpc*-*-* } } } */
diff --git a/gcc/testsuite/gcc.dg/stack-check-4.c b/gcc/testsuite/gcc.dg/stack-check-4.c
index b0c5c61972f..052d2abc2f1 100644
--- a/gcc/testsuite/gcc.dg/stack-check-4.c
+++ b/gcc/testsuite/gcc.dg/stack-check-4.c
@@ -20,7 +20,7 @@ 
    scan for.   We scan for both the positive and negative cases.  */
 
 /* { dg-do compile } */
-/* { dg-options "-O2 -fstack-clash-protection -fdump-rtl-pro_and_epilogue -fno-optimize-sibling-calls" } */
+/* { dg-options "-O2 -fstack-clash-protection -fdump-rtl-pro_and_epilogue -fno-optimize-sibling-calls -fno-shrink-wrap" } */
 /* { dg-require-effective-target supports_stack_clash_protection } */
 
 extern void arf (char *);
diff --git a/gcc/testsuite/gcc.target/aarch64/pr106594_1.c b/gcc/testsuite/gcc.target/aarch64/pr106594_1.c
new file mode 100644
index 00000000000..71bcafcb44f
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr106594_1.c
@@ -0,0 +1,20 @@ 
+/* { dg-options "-O2" } */
+
+extern const int constellation_64qam[64];
+
+void foo(int nbits,
+         const char *p_src,
+         int *p_dst) {
+
+  while (nbits > 0U) {
+    char first = *p_src++;
+
+    char index1 = ((first & 0x3) << 4) | (first >> 4);
+
+    *p_dst++ = constellation_64qam[index1];
+
+    nbits--;
+  }
+}
+
+/* { dg-final { scan-assembler {(?n)\tldr\t.*\[x[0-9]+, w[0-9]+, sxtw #?2\]} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_asrd_3.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_asrd_3.c
index 0d620a30d5d..b537c6154a3 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/cond_asrd_3.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_asrd_3.c
@@ -27,9 +27,9 @@  TEST_ALL (DEF_LOOP)
 /* { dg-final { scan-assembler-times {\tasrd\tz[0-9]+\.h, p[0-7]/m, z[0-9]+\.h, #4\n} 2 } } */
 /* { dg-final { scan-assembler-times {\tasrd\tz[0-9]+\.s, p[0-7]/m, z[0-9]+\.s, #4\n} 1 } } */
 
-/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.b, p[0-7]/z, z[0-9]+\.b\n} 3 { xfail *-*-* } } } */
-/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z, z[0-9]+\.h\n} 2 { xfail *-*-* } } } */
-/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z, z[0-9]+\.s\n} 1 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.b, p[0-7]/z, z[0-9]+\.b\n} 3 } } */
+/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z, z[0-9]+\.h\n} 2 } } */
+/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z, z[0-9]+\.s\n} 1 } } */
 
-/* { dg-final { scan-assembler-not {\tmov\tz} { xfail *-*-* } } } */
-/* { dg-final { scan-assembler-not {\tsel\t} { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-not {\tmov\tz} } } */
+/* { dg-final { scan-assembler-not {\tsel\t} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_3.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_3.c
index a294effd4a9..cff806c278d 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_3.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_3.c
@@ -30,11 +30,9 @@  TEST_ALL (DEF_LOOP)
 /* { dg-final { scan-assembler-times {\tscvtf\tz[0-9]+\.d, p[0-7]/m,} 1 } } */
 /* { dg-final { scan-assembler-times {\tucvtf\tz[0-9]+\.d, p[0-7]/m,} 1 } } */
 
-/* Really we should be able to use MOVPRFX /z here, but at the moment
-   we're relying on combine to merge a SEL and an arithmetic operation,
-   and the SEL doesn't allow the "false" value to be zero when the "true"
-   value is a register.  */
-/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+, z[0-9]+\n} 6 } } */
+/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z,} 2 } } */
+/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z,} 2 } } */
+/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.d, p[0-7]/z,} 2 } } */
 
 /* { dg-final { scan-assembler-not {\tmov\tz[^\n]*z} } } */
 /* { dg-final { scan-assembler-not {\tsel\t} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_6.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_6.c
index 6541a2ea49d..abf0a2e832f 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_6.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_6.c
@@ -30,11 +30,9 @@  TEST_ALL (DEF_LOOP)
 /* { dg-final { scan-assembler-times {\tfcvtzs\tz[0-9]+\.d, p[0-7]/m,} 1 } } */
 /* { dg-final { scan-assembler-times {\tfcvtzu\tz[0-9]+\.d, p[0-7]/m,} 1 } } */
 
-/* Really we should be able to use MOVPRFX /z here, but at the moment
-   we're relying on combine to merge a SEL and an arithmetic operation,
-   and the SEL doesn't allow the "false" value to be zero when the "true"
-   value is a register.  */
-/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+, z[0-9]+\n} 6 } } */
+/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z,} 2 } } */
+/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z,} 2 } } */
+/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.d, p[0-7]/z,} 2 } } */
 
 /* { dg-final { scan-assembler-not {\tmov\tz[^\n]*z} } } */
 /* { dg-final { scan-assembler-not {\tsel\t} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_fabd_5.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_fabd_5.c
index e66477b3bce..401201b315a 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/cond_fabd_5.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_fabd_5.c
@@ -24,12 +24,9 @@  TEST_ALL (DEF_LOOP)
 /* { dg-final { scan-assembler-times {\tfabd\tz[0-9]+\.s, p[0-7]/m,} 1 } } */
 /* { dg-final { scan-assembler-times {\tfabd\tz[0-9]+\.d, p[0-7]/m,} 1 } } */
 
-/* Really we should be able to use MOVPRFX /Z here, but at the moment
-   we're relying on combine to merge a SEL and an arithmetic operation,
-   and the SEL doesn't allow zero operands.  */
-/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z, z[0-9]+\.h\n} 1 { xfail *-*-* } } } */
-/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z, z[0-9]+\.s\n} 1 { xfail *-*-* } } } */
-/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.d, p[0-7]/z, z[0-9]+\.d\n} 1 { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z, z[0-9]+\.h\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z, z[0-9]+\.s\n} 1 } } */
+/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.d, p[0-7]/z, z[0-9]+\.d\n} 1 } } */
 
 /* { dg-final { scan-assembler-not {\tmov\tz[^,]*z} } } */
-/* { dg-final { scan-assembler-not {\tsel\t} { xfail *-*-* } } } */
+/* { dg-final { scan-assembler-not {\tsel\t} } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_unary_4.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_unary_4.c
index a491f899088..cbb957bffa4 100644
--- a/gcc/testsuite/gcc.target/aarch64/sve/cond_unary_4.c
+++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_unary_4.c
@@ -52,15 +52,10 @@  TEST_ALL (DEF_LOOP)
 /* { dg-final { scan-assembler-times {\tfneg\tz[0-9]+\.s, p[0-7]/m,} 1 } } */
 /* { dg-final { scan-assembler-times {\tfneg\tz[0-9]+\.d, p[0-7]/m,} 1 } } */
 
-/* Really we should be able to use MOVPRFX /z here, but at the moment
-   we're relying on combine to merge a SEL and an arithmetic operation,
-   and the SEL doesn't allow the "false" value to be zero when the "true"
-   value is a register.  */
-/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+, z[0-9]+\n} 7 } } */
-/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.b, p[0-7]/z, z[0-9]+\.b} 1 } } */
-/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z, z[0-9]+\.h} 2 } } */
-/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z, z[0-9]+\.s} 2 } } */
-/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.d, p[0-7]/z, z[0-9]+\.d} 2 } } */
+/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.b, p[0-7]/z, z[0-9]+\.b} 2 } } */
+/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z, z[0-9]+\.h} 4 } } */
+/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z, z[0-9]+\.s} 4 } } */
+/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.d, p[0-7]/z, z[0-9]+\.d} 4 } } */
 
 /* { dg-final { scan-assembler-not {\tmov\tz[^\n]*z} } } */
 /* { dg-final { scan-assembler-not {\tsel\t} } } */
diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index 09e6ada5b2f..75376316e40 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -612,6 +612,7 @@  extern rtl_opt_pass *make_pass_branch_prob (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_value_profile_transformations (gcc::context
 							      *ctxt);
 extern rtl_opt_pass *make_pass_postreload_cse (gcc::context *ctxt);
+extern rtl_opt_pass *make_pass_late_combine (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_gcse2 (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_split_after_reload (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_thread_prologue_and_epilogue (gcc::context