Message ID | mptr0ljn9eh.fsf@arm.com |
---|---|
State | New |
Headers | show |
Series | Add a late-combine pass [PR106594] | expand |
Hi! On Tue, Oct 24, 2023 at 07:49:10PM +0100, Richard Sandiford wrote: > This patch adds a combine pass that runs late in the pipeline. But it is not. It is a completely new thing, and much closer to fwprop than to combine, too. Could you rename it to something else, please? Something less confusing to both users and maintainers :-) > There are two instances: one between combine and split1, So, what kind of things does this do that the real combine does not? And, same question but for fwprop. That would be the crucial motivation for why we want to have this new pass at all :-) > The pass currently has a single objective: remove definitions by > substituting into all uses. The easy case ;-) Segher
Segher Boessenkool <segher@kernel.crashing.org> writes: > Hi! > > On Tue, Oct 24, 2023 at 07:49:10PM +0100, Richard Sandiford wrote: >> This patch adds a combine pass that runs late in the pipeline. > > But it is not. It is a completely new thing, and much closer to > fwprop than to combine, too. Well, it is a combine pass. It's not a new instance of the pass in combine.cc, but I don't think that's the implication. We already have two combine passes: the combine.cc one and the postreload one. Similarly we have at least two DCE implementations. > Could you rename it to something else, please? Something less confusing > to both users and maintainers :-) Do you have any suggestions? >> There are two instances: one between combine and split1, > > So, what kind of things does this do that the real combine does not? > And, same question but for fwprop. That would be the crucial motivation > for why we want to have this new pass at all :-) You've already responded (below) to the part where I explained that. >> The pass currently has a single objective: remove definitions by >> substituting into all uses. > > The easy case ;-) And the yet a case that no existing pass handles. :) That's why I'm trying to add something that does. Richard
On 10/24/23 12:49, Richard Sandiford wrote: > This patch adds a combine pass that runs late in the pipeline. > There are two instances: one between combine and split1, and one > after postreload. So have you done any investigation on cases caught by your new pass between combine and split1 to characterize them? In particular do they point at solvable problems in combine? Or do you forsee this subsuming the old combiner pass at some point in the distant future? rth and I sketched out an SSA based RTL combine at some point in the deep past. The key goal we were trying to achieve was combining across blocks. We didn't have a functioning RTL SSA form at the time, so it never went to any implementation work. It looks like yours would solve the class of problems rth and I were considering. [ ... ] > > The patch therefore enables the pass by default only on AArch64. > However, I did test the patch with it enabled on x86_64-linux-gnu > as well, which was useful for debugging. > > Bootstrapped & regression-tested on aarch64-linux-gnu and > x86_64-linux-gnu (as posted, with no regressions, and with the > pass enabled by default, with some gcc.target/i386 regressions). > OK to install? I'm going to adjust this slightly so that it's enabled across the board and throw it into the tester tomorrow (tester is busy tonight). Even if we make it opt-in on a per-port basis, the alternate target testing does seems to find stuff that needs fixing ;-) > > Richard > > > gcc/ > PR rtl-optimization/106594 > * Makefile.in (OBJS): Add late-combine.o. > * common.opt (flate-combine-instructions): New option. > * doc/invoke.texi: Document it. > * common/config/aarch64/aarch64-common.cc: Enable it by default > at -O2 and above. > * tree-pass.h (make_pass_late_combine): Declare. > * late-combine.cc: New file. > * passes.def: Add two instances of late_combine. > > gcc/testsuite/ > PR rtl-optimization/106594 > * gcc.dg/ira-shrinkwrap-prep-1.c: Restrict XFAIL to non-aarch64 > targets. > * gcc.dg/ira-shrinkwrap-prep-2.c: Likewise. > * gcc.dg/stack-check-4.c: Add -fno-shrink-wrap. > * gcc.target/aarch64/sve/cond_asrd_3.c: Remove XFAILs. > * gcc.target/aarch64/sve/cond_convert_3.c: Likewise. > * gcc.target/aarch64/sve/cond_fabd_5.c: Likewise. > * gcc.target/aarch64/sve/cond_convert_6.c: Expect the MOVPRFX /Zs > described in the comment. > * gcc.target/aarch64/sve/cond_unary_4.c: Likewise. > * gcc.target/aarch64/pr106594_1.c: New test. > > + > + // Don't substitute into a non-local goto, since it can then be > + // treated as a jump to local label, e.g. in shorten_branches. > + // ??? But this shouldn't be necessary. > + if (use_insn->is_jump () > + && find_reg_note (use_insn->rtl (), REG_NON_LOCAL_GOTO, NULL_RTX)) > + return false; Agreed that this shouldn't be necessary. In fact, if you can substitute it's potentially very profitable. Then again, I doubt it happens all that much in practice, particularly since I think gimple does squash out some of these. Nothing jumps out at horribly wrong. You might want/need to reject frame related insns in optimizable_set, though I guess if the dwarf2 writer isn't complaining, then we haven't mucked things up too bad.
Jeff Law <jeffreyalaw@gmail.com> writes: > On 10/24/23 12:49, Richard Sandiford wrote: >> This patch adds a combine pass that runs late in the pipeline. >> There are two instances: one between combine and split1, and one >> after postreload. > So have you done any investigation on cases caught by your new pass > between combine and split1 to characterize them? In particular do they > point at solvable problems in combine? Or do you forsee this subsuming > the old combiner pass at some point in the distant future? Examples like the PR are the main motivation for the pre-RA pass. There we had an extension that could be combined into an address, but no longer was after GCC 13. The PR itself could in principle be fixed in combine (various approaches were suggested, but not accepted). But the same problem applies to multiple uses of extensions. fwprop can't handle it because individual propagations are not a win in isolation. And combine has a limit of combining 4 insns (with a maximum of 2 output insns, IIRC). So I don't think either of the existing passes scale to the general case. FWIW, an example of this is gcc.c-torture/compile/pr33133.c: @@ -20,7 +20,7 @@ bne .L3 mov x3, 0 .L7: - mov w8, w3 + mov w9, w3 add x1, x5, x3 ldrb w6, [x1, 256] ubfx x1, x6, 1, 7 @@ -28,15 +28,14 @@ ldrb w1, [x7, x1] lsr w1, w1, 4 .L5: - asr w6, w8, 1 - sxtw x9, w6 - ldrb w6, [x2, w6, sxtw] + asr w8, w9, 1 + ldrb w6, [x2, w8, sxtw] sxtb w1, w1 - tbz x8, 0, .L6 + tbz x9, 0, .L6 sbfiz w1, w1, 4, 4 .L6: orr w1, w6, w1 - strb w1, [x2, x9] + strb w1, [x2, w8, sxtw] add x3, x3, 1 cmp x3, 16 bne .L7 I hope that the new pass could be extended to do new things in future, especially if they're things that would make sense in both a pre-RA and post-RA pass. And there's some benefit to having a cleanish slate, especially without the baggage of the expand_compound_operation/ make_compound_operation wrangler. But I don't think we can realistically expect to replace old combine, or even that it's something worth aiming for. > rth and I sketched out an SSA based RTL combine at some point in the > deep past. The key goal we were trying to achieve was combining across > blocks. We didn't have a functioning RTL SSA form at the time, so it > never went to any implementation work. It looks like yours would solve > the class of problems rth and I were considering. Yeah, I do see some cases where combining across blocks helps. The case above is one example of that. Another is: @@ -8,9 +8,8 @@ .LFB0: .cfi_startproc ldr w1, [x0, 4] - cmp w1, 0 cbz w1, .L3 - blt .L4 + tbnz w1, #31, .L4 ldr w2, [x0] ldr w0, [x0, 8] add w1, w1, w2 And, as a similar example for multiple uses: @@ -144,8 +144,7 @@ neg x21, x21 mov w26, 255 .L18: - cmp x19, 0 - bge .L19 + tbz x19, #63, .L19 mvn w26, w26 and w26, w26, 255 neg x19, x19 @@ -160,7 +159,7 @@ mov w26, 0 b .L18 .L19: - bne .L20 + cbnz x19, .L20 mov x19, 9223372036854775807 b .L21 >> The patch therefore enables the pass by default only on AArch64. >> However, I did test the patch with it enabled on x86_64-linux-gnu >> as well, which was useful for debugging. >> >> Bootstrapped & regression-tested on aarch64-linux-gnu and >> x86_64-linux-gnu (as posted, with no regressions, and with the >> pass enabled by default, with some gcc.target/i386 regressions). >> OK to install? > I'm going to adjust this slightly so that it's enabled across the board > and throw it into the tester tomorrow (tester is busy tonight). Even if > we make it opt-in on a per-port basis, the alternate target testing does > seems to find stuff that needs fixing ;-) Thanks! As per our off-list discussion, the cris-elf failures showed up a bug in the handling of call arguments. Here's an updated version with that fixed. Unfortunately (also as per the off-list discussion), it seems like some ports have latent assumptions that certain combinations don't happen between RA and the first post-RA split. They should be easy enough to iron out in most cases, although as mentioned, the bad interaction with the x86 passes seems harder to fix. > > + // Don't substitute into a non-local goto, since it can then be > > + // treated as a jump to local label, e.g. in shorten_branches. > > + // ??? But this shouldn't be necessary. > > + if (use_insn->is_jump () > > + && find_reg_note (use_insn->rtl (), REG_NON_LOCAL_GOTO, NULL_RTX)) > > + return false; > Agreed that this shouldn't be necessary. In fact, if you can substitute > it's potentially very profitable. Then again, I doubt it happens all > that much in practice, particularly since I think gimple does squash out > some of these. Yeah. If I'd posted this earlier in stage 1 (rather than October), I might have tried teaching shorten_branches how to handle this. But it felt like it could be a bit of a rabbit hole at this stage. > Nothing jumps out at horribly wrong. You might want/need to reject > frame related insns in optimizable_set, though I guess if the dwarf2 > writer isn't complaining, then we haven't mucked things up too bad. Ah, yeah, I wondered about that. But I suppose most prologue insns don't really create combination opportunities, since most of them either set up the stack (which we wouldn't combine anyway) or sink incoming data. So if there cases where this is a problem, it might unfortunately be a case of "see what breaks". Tested on aarch64-linux-gnu. Does this version look OK? (There's no change to the covering note btw. See: // Don't try to substitute into call arguments. for the new code.) Thanks, Richard ---- This patch adds a combine pass that runs late in the pipeline. There are two instances: one between combine and split1, and one after postreload. The pass currently has a single objective: remove definitions by substituting into all uses. The pre-RA version tries to restrict itself to cases that are likely to have a neutral or beneficial effect on register pressure. The patch fixes PR106594. It also fixes a few FAILs and XFAILs in the aarch64 test results, mostly due to making proper use of MOVPRFX in cases where we didn't previously. I hope it would also help with Robin's vec_duplicate testcase, although the pressure heuristic might need tweaking for that case. This is just a first step. I'm hoping that the pass could be used for other combine-related optimisations in future. In particular, the post-RA version doesn't need to restrict itself to cases where all uses are substitutable, since it doesn't have to worry about register pressure. If we did that, and if we extended it to handle multi-register REGs, the pass might be a viable replacement for regcprop, which in turn might reduce the cost of having a post-RA instance of the new pass. I've run an assembly comparison with one target per CPU directory, and it seems to be a win for all targets except nvptx (which is hard to measure, being a higher-level asm). The biggest winner seemed to be AVR. I'd originally hoped to enable the pass by default at -O2 and above on all targets. But in the end, I don't think that's possible, because it interacts badly with x86's STV and partial register dependency passes. For example, gcc.target/i386/minmax-6.c tests whether the code compiles without any spilling. The RTL created by STV contains: (insn 33 31 3 2 (set (subreg:V4SI (reg:SI 120) 0) (vec_merge:V4SI (vec_duplicate:V4SI (reg:SI 116)) (const_vector:V4SI [ (const_int 0 [0]) repeated x4 ]) (const_int 1 [0x1]))) -1 (nil)) (insn 3 33 34 2 (set (subreg:V4SI (reg:SI 118) 0) (subreg:V4SI (reg:SI 120) 0)) {movv4si_internal} (expr_list:REG_DEAD (reg:SI 120) (nil))) (insn 34 3 32 2 (set (reg/v:SI 108 [ y ]) (reg:SI 118)) -1 (nil)) and it's crucial for the test that reg 108 is kept, rather than propagated into uses. As things stand, 118 can be allocated a vector register and 108 a scalar register. If 108 is propagated, there will be scalar and vector uses of 118, and so it will be spilled to memory. That one could be solved by running STV2 later. But RPAD is a bigger problem. In gcc.target/i386/pr87007-5.c, RPAD converts: (insn 27 26 28 6 (set (reg:DF 100 [ _15 ]) (sqrt:DF (mem/c:DF (symbol_ref:DI ("d2"))))) {*sqrtdf2_sse} (nil)) into: (insn 45 26 44 6 (set (reg:V4SF 108) (const_vector:V4SF [ (const_double:SF 0.0 [0x0.0p+0]) repeated x4 ])) -1 (nil)) (insn 44 45 27 6 (set (reg:V2DF 109) (vec_merge:V2DF (vec_duplicate:V2DF (sqrt:DF (mem/c:DF (symbol_ref:DI ("d2"))))) (subreg:V2DF (reg:V4SF 108) 0) (const_int 1 [0x1]))) -1 (nil)) (insn 27 44 28 6 (set (reg:DF 100 [ _15 ]) (subreg:DF (reg:V2DF 109) 0)) {*movdf_internal} (nil)) But both the pre-RA and post-RA passes are able to combine these instructions back to the original form. The patch therefore enables the pass by default only on AArch64. However, I did test the patch with it enabled on x86_64-linux-gnu as well, which was useful for debugging. gcc/ PR rtl-optimization/106594 * Makefile.in (OBJS): Add late-combine.o. * common.opt (flate-combine-instructions): New option. * doc/invoke.texi: Document it. * common/config/aarch64/aarch64-common.cc: Enable it by default at -O2 and above. * tree-pass.h (make_pass_late_combine): Declare. * late-combine.cc: New file. * passes.def: Add two instances of late_combine. gcc/testsuite/ PR rtl-optimization/106594 * gcc.dg/ira-shrinkwrap-prep-1.c: Restrict XFAIL to non-aarch64 targets. * gcc.dg/ira-shrinkwrap-prep-2.c: Likewise. * gcc.dg/stack-check-4.c: Add -fno-shrink-wrap. * gcc.target/aarch64/sve/cond_asrd_3.c: Remove XFAILs. * gcc.target/aarch64/sve/cond_convert_3.c: Likewise. * gcc.target/aarch64/sve/cond_fabd_5.c: Likewise. * gcc.target/aarch64/sve/cond_convert_6.c: Expect the MOVPRFX /Zs described in the comment. * gcc.target/aarch64/sve/cond_unary_4.c: Likewise. * gcc.target/aarch64/pr106594_1.c: New test. --- gcc/Makefile.in | 1 + gcc/common.opt | 5 + gcc/common/config/aarch64/aarch64-common.cc | 1 + gcc/doc/invoke.texi | 11 +- gcc/late-combine.cc | 732 ++++++++++++++++++ gcc/passes.def | 2 + gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-1.c | 2 +- gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-2.c | 2 +- gcc/testsuite/gcc.dg/stack-check-4.c | 2 +- gcc/testsuite/gcc.target/aarch64/pr106594_1.c | 20 + .../gcc.target/aarch64/sve/cond_asrd_3.c | 10 +- .../gcc.target/aarch64/sve/cond_convert_3.c | 8 +- .../gcc.target/aarch64/sve/cond_convert_6.c | 8 +- .../gcc.target/aarch64/sve/cond_fabd_5.c | 11 +- .../gcc.target/aarch64/sve/cond_unary_4.c | 13 +- gcc/tree-pass.h | 1 + 16 files changed, 794 insertions(+), 35 deletions(-) create mode 100644 gcc/late-combine.cc create mode 100644 gcc/testsuite/gcc.target/aarch64/pr106594_1.c diff --git a/gcc/Makefile.in b/gcc/Makefile.in index deb12e17d25..417390b3b79 100644 --- a/gcc/Makefile.in +++ b/gcc/Makefile.in @@ -1571,6 +1571,7 @@ OBJS = \ ira-lives.o \ jump.o \ langhooks.o \ + late-combine.o \ lcm.o \ lists.o \ loop-doloop.o \ diff --git a/gcc/common.opt b/gcc/common.opt index 5f0a101bccb..56036f2295f 100644 --- a/gcc/common.opt +++ b/gcc/common.opt @@ -1784,6 +1784,11 @@ Common Var(flag_large_source_files) Init(0) Improve GCC's ability to track column numbers in large source files, at the expense of slower compilation. +flate-combine-instructions +Common Var(flag_late_combine_instructions) Optimization Init(0) +Run two instruction combination passes late in the pass pipeline; +one before register allocation and one after. + floop-parallelize-all Common Var(flag_loop_parallelize_all) Optimization Mark all loops as parallel. diff --git a/gcc/common/config/aarch64/aarch64-common.cc b/gcc/common/config/aarch64/aarch64-common.cc index 951d041d310..75c72b1680d 100644 --- a/gcc/common/config/aarch64/aarch64-common.cc +++ b/gcc/common/config/aarch64/aarch64-common.cc @@ -56,6 +56,7 @@ static const struct default_options aarch_option_optimization_table[] = /* Enable redundant extension instructions removal at -O2 and higher. */ { OPT_LEVELS_2_PLUS, OPT_free, NULL, 1 }, { OPT_LEVELS_2_PLUS, OPT_mearly_ra_, NULL, AARCH64_EARLY_RA_ALL }, + { OPT_LEVELS_2_PLUS, OPT_flate_combine_instructions, NULL, 1 }, #if (TARGET_DEFAULT_ASYNC_UNWIND_TABLES == 1) { OPT_LEVELS_ALL, OPT_fasynchronous_unwind_tables, NULL, 1 }, { OPT_LEVELS_ALL, OPT_funwind_tables, NULL, 1}, diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi index 68d1f364ac0..dbb798cc610 100644 --- a/gcc/doc/invoke.texi +++ b/gcc/doc/invoke.texi @@ -571,7 +571,7 @@ Objective-C and Objective-C++ Dialects}. -fipa-bit-cp -fipa-vrp -fipa-pta -fipa-profile -fipa-pure-const -fipa-reference -fipa-reference-addressable -fipa-stack-alignment -fipa-icf -fira-algorithm=@var{algorithm} --flive-patching=@var{level} +-flate-combine-instructions -flive-patching=@var{level} -fira-region=@var{region} -fira-hoist-pressure -fira-loop-pressure -fno-ira-share-save-slots -fno-ira-share-spill-slots @@ -13442,6 +13442,15 @@ equivalences that are found only by GCC and equivalences found only by Gold. This flag is enabled by default at @option{-O2} and @option{-Os}. +@opindex flate-combine-instructions +@item -flate-combine-instructions +Enable two instruction combination passes that run relatively late in the +compilation process. One of the passes runs before register allocation and +the other after register allocation. The main aim of the passes is to +substitute definitions into all uses. + +Some targets enable this flag by default at @option{-O2} and @option{-Os}. + @opindex flive-patching @item -flive-patching=@var{level} Control GCC's optimizations to produce output suitable for live-patching. diff --git a/gcc/late-combine.cc b/gcc/late-combine.cc new file mode 100644 index 00000000000..47122ca2e4b --- /dev/null +++ b/gcc/late-combine.cc @@ -0,0 +1,732 @@ +// Late-stage instruction combination pass. +// Copyright (C) 2023 Free Software Foundation, Inc. +// +// This file is part of GCC. +// +// GCC is free software; you can redistribute it and/or modify it under +// the terms of the GNU General Public License as published by the Free +// Software Foundation; either version 3, or (at your option) any later +// version. +// +// GCC is distributed in the hope that it will be useful, but WITHOUT ANY +// WARRANTY; without even the implied warranty of MERCHANTABILITY or +// FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License +// for more details. +// +// You should have received a copy of the GNU General Public License +// along with GCC; see the file COPYING3. If not see +// <http://www.gnu.org/licenses/>. + +// The current purpose of this pass is to substitute definitions into +// all uses, so that the definition can be removed. However, it could +// be extended to handle other combination-related optimizations in future. +// +// The pass can run before or after register allocation. When running +// before register allocation, it tries to avoid cases that are likely +// to increase register pressure. For the same reason, it avoids moving +// instructions around, even if doing so would allow an optimization to +// succeed. These limitations are removed when running after register +// allocation. + +#define INCLUDE_ALGORITHM +#define INCLUDE_FUNCTIONAL +#include "config.h" +#include "system.h" +#include "coretypes.h" +#include "backend.h" +#include "rtl.h" +#include "df.h" +#include "rtl-ssa.h" +#include "print-rtl.h" +#include "tree-pass.h" +#include "cfgcleanup.h" +#include "target.h" + +using namespace rtl_ssa; + +namespace { +const pass_data pass_data_late_combine = +{ + RTL_PASS, // type + "late_combine", // name + OPTGROUP_NONE, // optinfo_flags + TV_NONE, // tv_id + 0, // properties_required + 0, // properties_provided + 0, // properties_destroyed + 0, // todo_flags_start + TODO_df_finish, // todo_flags_finish +}; + +// Represents an attempt to substitute a single-set definition into all +// uses of the definition. +class insn_combination +{ +public: + insn_combination (set_info *, rtx, rtx); + bool run (); + const vec<insn_change *> &use_changes () const { return m_use_changes; } + +private: + use_array get_new_uses (use_info *); + bool substitute_nondebug_use (use_info *); + bool substitute_nondebug_uses (set_info *); + bool move_and_recog (insn_change &); + bool try_to_preserve_debug_info (insn_change &, use_info *); + void substitute_debug_use (use_info *); + bool substitute_note (insn_info *, rtx, bool); + void substitute_notes (insn_info *, bool); + void substitute_note_uses (use_info *); + void substitute_optional_uses (set_info *); + + // Represents the state of the function's RTL at the start of this + // combination attempt. + insn_change_watermark m_rtl_watermark; + + // Represents the rtl-ssa state at the start of this combination attempt. + obstack_watermark m_attempt; + + // The instruction that contains the definition, and that we're trying + // to delete. + insn_info *m_def_insn; + + // The definition itself. + set_info *m_def; + + // The destination and source of the single set that defines m_def. + // The destination is known to be a plain REG. + rtx m_dest; + rtx m_src; + + // Contains all in-progress changes to uses of m_def. + auto_vec<insn_change *> m_use_changes; + + // Contains the full list of changes that we want to make, in reverse + // postorder. + auto_vec<insn_change *> m_nondebug_changes; +}; + +// Class that represents one run of the pass. +class late_combine +{ +public: + unsigned int execute (function *); + +private: + rtx optimizable_set (insn_info *); + bool check_register_pressure (insn_info *, rtx); + bool check_uses (set_info *, rtx); + bool combine_into_uses (insn_info *, insn_info *); + + auto_vec<insn_info *> m_worklist; +}; + +insn_combination::insn_combination (set_info *def, rtx dest, rtx src) + : m_rtl_watermark (), + m_attempt (crtl->ssa->new_change_attempt ()), + m_def_insn (def->insn ()), + m_def (def), + m_dest (dest), + m_src (src), + m_use_changes (), + m_nondebug_changes () +{ +} + +// USE is a direct or indirect use of m_def. Return the list of uses +// that would be needed after substituting m_def into the instruction. +// The returned list is marked as invalid if USE's insn and m_def_insn +// use different definitions for the same resource (register or memory). +use_array +insn_combination::get_new_uses (use_info *use) +{ + auto *def = use->def (); + auto *use_insn = use->insn (); + + use_array new_uses = use_insn->uses (); + new_uses = remove_uses_of_def (m_attempt, new_uses, def); + new_uses = merge_access_arrays (m_attempt, m_def_insn->uses (), new_uses); + if (new_uses.is_valid () && use->ebb () != m_def->ebb ()) + new_uses = crtl->ssa->make_uses_available (m_attempt, new_uses, use->bb (), + use_insn->is_debug_insn ()); + return new_uses; +} + +// Start the process of trying to replace USE by substitution, given that +// USE occurs in a non-debug instruction. Check that the substitution can +// be represented in RTL and that each use of a resource (register or memory) +// has a consistent definition. If so, start an insn_change record for the +// substitution and return true. +bool +insn_combination::substitute_nondebug_use (use_info *use) +{ + insn_info *use_insn = use->insn (); + rtx_insn *use_rtl = use_insn->rtl (); + + if (dump_file && (dump_flags & TDF_DETAILS)) + dump_insn_slim (dump_file, use->insn ()->rtl ()); + + // Chceck that we can change the instruction pattern. Leave recognition + // of the result till later. + insn_propagation prop (use_rtl, m_dest, m_src); + if (!prop.apply_to_pattern (&PATTERN (use_rtl)) + || prop.num_replacements == 0) + { + if (dump_file && (dump_flags & TDF_DETAILS)) + fprintf (dump_file, "-- RTL substitution failed\n"); + return false; + } + + use_array new_uses = get_new_uses (use); + if (!new_uses.is_valid ()) + { + if (dump_file && (dump_flags & TDF_DETAILS)) + fprintf (dump_file, "-- could not prove that all sources" + " are available\n"); + return false; + } + + auto *where = XOBNEW (m_attempt, insn_change); + auto *use_change = new (where) insn_change (use_insn); + m_use_changes.safe_push (use_change); + use_change->new_uses = new_uses; + + return true; +} + +// Apply substitute_nondebug_use to all direct and indirect uses of DEF. +// There will be at most one level of indirection. +bool +insn_combination::substitute_nondebug_uses (set_info *def) +{ + for (use_info *use : def->nondebug_insn_uses ()) + if (!use->is_live_out_use () + && !use->only_occurs_in_notes () + && !substitute_nondebug_use (use)) + return false; + + for (use_info *use : def->phi_uses ()) + if (!substitute_nondebug_uses (use->phi ())) + return false; + + return true; +} + +// Complete the verification of USE_CHANGE, given that m_nondebug_insns +// now contains an insn_change record for all proposed non-debug changes. +// Check that the new instruction is a recognized pattern. Also check that +// the instruction can be placed somewhere that makes all definitions and +// uses valid, and that permits any new hard-register clobbers added +// during the recognition process. Return true on success. +bool +insn_combination::move_and_recog (insn_change &use_change) +{ + insn_info *use_insn = use_change.insn (); + + if (reload_completed && can_move_insn_p (use_insn)) + use_change.move_range = { use_insn->bb ()->head_insn (), + use_insn->ebb ()->last_bb ()->end_insn () }; + if (!restrict_movement_ignoring (use_change, + insn_is_changing (m_nondebug_changes))) + { + if (dump_file && (dump_flags & TDF_DETAILS)) + fprintf (dump_file, "-- cannot satisfy all definitions and uses" + " in insn %d\n", INSN_UID (use_insn->rtl ())); + return false; + } + + if (!recog_ignoring (m_attempt, use_change, + insn_is_changing (m_nondebug_changes))) + return false; + + return true; +} + +// USE_CHANGE.insn () is a debug instruction that uses m_def. Try to +// substitute the definition into the instruction and try to describe +// the result in USE_CHANGE. Return true on success. Failure means that +// the instruction must be reset instead. +bool +insn_combination::try_to_preserve_debug_info (insn_change &use_change, + use_info *use) +{ + insn_info *use_insn = use_change.insn (); + rtx_insn *use_rtl = use_insn->rtl (); + + use_change.new_uses = get_new_uses (use); + if (!use_change.new_uses.is_valid () + || !restrict_movement (use_change)) + return false; + + insn_propagation prop (use_rtl, m_dest, m_src); + return prop.apply_to_pattern (&INSN_VAR_LOCATION_LOC (use_rtl)); +} + +// USE_INSN is a debug instruction that uses m_def. Update it to reflect +// the fact that m_def is going to disappear. Try to preserve the source +// value if possible, but reset the instruction if not. +void +insn_combination::substitute_debug_use (use_info *use) +{ + auto *use_insn = use->insn (); + rtx_insn *use_rtl = use_insn->rtl (); + + auto use_change = insn_change (use_insn); + if (!try_to_preserve_debug_info (use_change, use)) + { + use_change.new_uses = {}; + use_change.move_range = use_change.insn (); + INSN_VAR_LOCATION_LOC (use_rtl) = gen_rtx_UNKNOWN_VAR_LOC (); + } + insn_change *changes[] = { &use_change }; + crtl->ssa->change_insns (changes); +} + +// NOTE is a reg note of USE_INSN, which previously used m_def. Update +// the note to reflect the fact that m_def is going to disappear. Return +// true on success, or false if the note must be deleted. +// +// CAN_PROPAGATE is true if m_dest can be replaced with m_use. +bool +insn_combination::substitute_note (insn_info *use_insn, rtx note, + bool can_propagate) +{ + if (REG_NOTE_KIND (note) == REG_EQUAL + || REG_NOTE_KIND (note) == REG_EQUIV) + { + insn_propagation prop (use_insn->rtl (), m_dest, m_src); + return (prop.apply_to_rvalue (&XEXP (note, 0)) + && (can_propagate || prop.num_replacements == 0)); + } + return true; +} + +// Update USE_INSN's notes after deciding to go ahead with the optimization. +// CAN_PROPAGATE is true if m_dest can be replaced with m_use. +void +insn_combination::substitute_notes (insn_info *use_insn, bool can_propagate) +{ + rtx_insn *use_rtl = use_insn->rtl (); + rtx *ptr = ®_NOTES (use_rtl); + while (rtx note = *ptr) + { + if (substitute_note (use_insn, note, can_propagate)) + ptr = &XEXP (note, 1); + else + *ptr = XEXP (note, 1); + } +} + +// We've decided to go ahead with the substitution. Update all REG_NOTES +// involving USE. +void +insn_combination::substitute_note_uses (use_info *use) +{ + insn_info *use_insn = use->insn (); + + bool can_propagate = true; + if (use->only_occurs_in_notes ()) + { + // The only uses are in notes. Try to keep the note if we can, + // but removing it is better than aborting the optimization. + insn_change use_change (use_insn); + use_change.new_uses = get_new_uses (use); + if (!use_change.new_uses.is_valid () + || !restrict_movement (use_change)) + { + use_change.move_range = use_insn; + use_change.new_uses = remove_uses_of_def (m_attempt, + use_insn->uses (), + use->def ()); + can_propagate = false; + } + if (dump_file && (dump_flags & TDF_DETAILS)) + { + fprintf (dump_file, "%s notes in:\n", + can_propagate ? "updating" : "removing"); + dump_insn_slim (dump_file, use_insn->rtl ()); + } + substitute_notes (use_insn, can_propagate); + insn_change *changes[] = { &use_change }; + crtl->ssa->change_insns (changes); + } + else + substitute_notes (use_insn, can_propagate); +} + +// We've decided to go ahead with the substitution and we've dealt with +// all uses that occur in the patterns of non-debug insns. Update all +// other uses for the fact that m_def is about to disappear. +void +insn_combination::substitute_optional_uses (set_info *def) +{ + if (auto insn_uses = def->all_insn_uses ()) + { + use_info *use = *insn_uses.begin (); + while (use) + { + use_info *next_use = use->next_any_insn_use (); + if (use->is_in_debug_insn ()) + substitute_debug_use (use); + else if (!use->is_live_out_use ()) + substitute_note_uses (use); + use = next_use; + } + } + for (use_info *use : def->phi_uses ()) + substitute_optional_uses (use->phi ()); +} + +// Try to perform the substitution. Return true on success. +bool +insn_combination::run () +{ + if (dump_file && (dump_flags & TDF_DETAILS)) + { + fprintf (dump_file, "\ntrying to combine definition of r%d in:\n", + m_def->regno ()); + dump_insn_slim (dump_file, m_def_insn->rtl ()); + fprintf (dump_file, "into:\n"); + } + + if (!substitute_nondebug_uses (m_def)) + return false; + + auto def_change = insn_change::delete_insn (m_def_insn); + + m_nondebug_changes.reserve (m_use_changes.length () + 1); + m_nondebug_changes.quick_push (&def_change); + m_nondebug_changes.splice (m_use_changes); + + for (auto *use_change : m_use_changes) + if (!move_and_recog (*use_change)) + return false; + + if (!changes_are_worthwhile (m_nondebug_changes) + || !crtl->ssa->verify_insn_changes (m_nondebug_changes)) + return false; + + substitute_optional_uses (m_def); + + confirm_change_group (); + crtl->ssa->change_insns (m_nondebug_changes); + return true; +} + +// See whether INSN is a single_set that we can optimize. Return the +// set if so, otherwise return null. +rtx +late_combine::optimizable_set (insn_info *insn) +{ + if (!insn->can_be_optimized () + || insn->is_asm () + || insn->is_call () + || insn->has_volatile_refs () + || insn->has_pre_post_modify () + || !can_move_insn_p (insn)) + return NULL_RTX; + + return single_set (insn->rtl ()); +} + +// Suppose that we can replace all uses of SET_DEST (SET) with SET_SRC (SET), +// where SET occurs in INSN. Return true if doing so is not likely to +// increase register pressure. +bool +late_combine::check_register_pressure (insn_info *insn, rtx set) +{ + // Plain register-to-register moves do not establish a register class + // preference and have no well-defined effect on the register allocator. + // If changes in register class are needed, the register allocator is + // in the best position to place those changes. If no change in + // register class is needed, then the optimization reduces register + // pressure if SET_SRC (set) was already live at uses, otherwise the + // optimization is pressure-neutral. + rtx src = SET_SRC (set); + if (REG_P (src)) + return true; + + // On the same basis, substituting a SET_SRC that contains a single + // pseudo register either reduces pressure or is pressure-neutral, + // subject to the constraints below. We would need to do more + // analysis for SET_SRCs that use more than one pseudo register. + unsigned int nregs = 0; + for (auto *use : insn->uses ()) + if (use->is_reg () + && !HARD_REGISTER_NUM_P (use->regno ()) + && !use->only_occurs_in_notes ()) + if (++nregs > 1) + return false; + + // If there are no pseudo registers in SET_SRC then the optimization + // should improve register pressure. + if (nregs == 0) + return true; + + // We'd be substituting (set (reg R1) SRC) where SRC is known to + // contain a single pseudo register R2. Assume for simplicity that + // each new use of R2 would need to be in the same class C as the + // current use of R2. If, for a realistic allocation, C is a + // non-strict superset of the R1's register class, the effect on + // register pressure should be positive or neutral. If instead + // R1 occupies a different register class from R2, or if R1 has + // more allocation freedom than R2, then there's a higher risk that + // the effect on register pressure could be negative. + // + // First use constrain_operands to get the most likely choice of + // alternative. For simplicity, just handle the case where the + // output operand is operand 0. + extract_insn (insn->rtl ()); + rtx dest = SET_DEST (set); + if (recog_data.n_operands == 0 + || recog_data.operand[0] != dest) + return false; + + if (!constrain_operands (0, get_enabled_alternatives (insn->rtl ()))) + return false; + + preprocess_constraints (insn->rtl ()); + auto *alt = which_op_alt (); + auto dest_class = alt[0].cl; + + // Check operands 1 and above. + auto check_src = [&](unsigned int i) + { + if (recog_data.is_operator[i]) + return true; + + rtx op = recog_data.operand[i]; + if (CONSTANT_P (op)) + return true; + + if (SUBREG_P (op)) + op = SUBREG_REG (op); + if (REG_P (op)) + { + if (HARD_REGISTER_P (op)) + { + // We've already rejected uses of non-fixed hard registers. + gcc_checking_assert (fixed_regs[REGNO (op)]); + return true; + } + + // Make sure that the source operand's class is at least as + // permissive as the destination operand's class. + auto src_class = alternative_class (alt, i); + if (!reg_class_subset_p (dest_class, src_class)) + return false; + + // Make sure that the source operand occupies no more hard + // registers than the destination operand. This mostly matters + // for subregs. + if (targetm.class_max_nregs (dest_class, GET_MODE (dest)) + < targetm.class_max_nregs (src_class, GET_MODE (op))) + return false; + + return true; + } + return false; + }; + for (int i = 1; i < recog_data.n_operands; ++i) + if (!check_src (i)) + return false; + + return true; +} + +// Check uses of DEF to see whether there is anything obvious that +// prevents the substitution of SET into uses of DEF. +bool +late_combine::check_uses (set_info *def, rtx set) +{ + use_info *last_use = nullptr; + for (use_info *use : def->nondebug_insn_uses ()) + { + insn_info *use_insn = use->insn (); + + if (use->is_live_out_use ()) + continue; + if (use->only_occurs_in_notes ()) + continue; + + // We cannot replace all uses if the value is live on exit. + if (use->is_artificial ()) + return false; + + // Avoid increasing the complexity of instructions that + // reference allocatable hard registers. + if (!REG_P (SET_SRC (set)) + && !reload_completed + && (accesses_include_nonfixed_hard_registers (use_insn->uses ()) + || accesses_include_nonfixed_hard_registers (use_insn->defs ()))) + return false; + + // Don't substitute into a non-local goto, since it can then be + // treated as a jump to local label, e.g. in shorten_branches. + // ??? But this shouldn't be necessary. + if (use_insn->is_jump () + && find_reg_note (use_insn->rtl (), REG_NON_LOCAL_GOTO, NULL_RTX)) + return false; + + // Don't try to substitute into call arguments. + if (use_insn->is_call ()) + { + auto usage = CALL_INSN_FUNCTION_USAGE (use_insn->rtl ()); + for (rtx item = usage; item; item = XEXP (item, 1)) + { + rtx x = XEXP (item, 0); + if (GET_CODE (x) == USE + && reg_overlap_mentioned_p (SET_DEST (set), XEXP (x, 0))) + return false; + } + } + + // We'll keep the uses in their original order, even if we move + // them relative to other instructions. Make sure that non-final + // uses do not change any values that occur in the SET_SRC. + if (last_use && last_use->ebb () == use->ebb ()) + { + def_info *ultimate_def = look_through_degenerate_phi (def); + if (insn_clobbers_resources (last_use->insn (), + ultimate_def->insn ()->uses ())) + return false; + } + + last_use = use; + } + + for (use_info *use : def->phi_uses ()) + if (!use->phi ()->is_degenerate () + || !check_uses (use->phi (), set)) + return false; + + return true; +} + +// Try to remove INSN by substituting a definition into all uses. +// If the optimization moves any instructions before CURSOR, add those +// instructions to the end of m_worklist. +bool +late_combine::combine_into_uses (insn_info *insn, insn_info *cursor) +{ + // For simplicity, don't try to handle sets of multiple hard registers. + // And for correctness, don't remove any assignments to the stack or + // frame pointers, since that would implicitly change the set of valid + // memory locations between this assignment and the next. + // + // Removing assignments to the hard frame pointer would invalidate + // backtraces. + set_info *def = single_set_info (insn); + if (!def + || !def->is_reg () + || def->regno () == STACK_POINTER_REGNUM + || def->regno () == FRAME_POINTER_REGNUM + || def->regno () == HARD_FRAME_POINTER_REGNUM) + return false; + + rtx set = optimizable_set (insn); + if (!set) + return false; + + // For simplicity, don't try to handle subreg destinations. + rtx dest = SET_DEST (set); + if (!REG_P (dest) || def->regno () != REGNO (dest)) + return false; + + // Don't prolong the live ranges of allocatable hard registers, or put + // them into more complicated instructions. Failing to prevent this + // could lead to spill failures, or at least to worst register allocation. + if (!reload_completed + && accesses_include_nonfixed_hard_registers (insn->uses ())) + return false; + + if (!reload_completed && !check_register_pressure (insn, set)) + return false; + + if (!check_uses (def, set)) + return false; + + insn_combination combination (def, SET_DEST (set), SET_SRC (set)); + if (!combination.run ()) + return false; + + for (auto *use_change : combination.use_changes ()) + if (*use_change->insn () < *cursor) + m_worklist.safe_push (use_change->insn ()); + else + break; + return true; +} + +// Run the pass on function FN. +unsigned int +late_combine::execute (function *fn) +{ + // Initialization. + calculate_dominance_info (CDI_DOMINATORS); + df_analyze (); + crtl->ssa = new rtl_ssa::function_info (fn); + // Don't allow memory_operand to match volatile MEMs. + init_recog_no_volatile (); + + insn_info *insn = *crtl->ssa->nondebug_insns ().begin (); + while (insn) + { + if (!insn->is_artificial ()) + { + insn_info *prev = insn->prev_nondebug_insn (); + if (combine_into_uses (insn, prev)) + { + // Any instructions that get added to the worklist were + // previously after PREV. Thus if we were able to move + // an instruction X before PREV during one combination, + // X cannot depend on any instructions that we move before + // PREV during subsequent combinations. This means that + // the worklist should be free of backwards dependencies, + // even if it isn't necessarily in RPO. + for (unsigned int i = 0; i < m_worklist.length (); ++i) + combine_into_uses (m_worklist[i], prev); + m_worklist.truncate (0); + insn = prev; + } + } + insn = insn->next_nondebug_insn (); + } + + // Finalization. + if (crtl->ssa->perform_pending_updates ()) + cleanup_cfg (0); + // Make recognizer allow volatile MEMs again. + init_recog (); + free_dominance_info (CDI_DOMINATORS); + return 0; +} + +class pass_late_combine : public rtl_opt_pass +{ +public: + pass_late_combine (gcc::context *ctxt) + : rtl_opt_pass (pass_data_late_combine, ctxt) + {} + + // opt_pass methods: + opt_pass *clone () override { return new pass_late_combine (m_ctxt); } + bool gate (function *) override { return flag_late_combine_instructions; } + unsigned int execute (function *) override; +}; + +unsigned int +pass_late_combine::execute (function *fn) +{ + return late_combine ().execute (fn); +} + +} // end namespace + +// Create a new CC fusion pass instance. + +rtl_opt_pass * +make_pass_late_combine (gcc::context *ctxt) +{ + return new pass_late_combine (ctxt); +} diff --git a/gcc/passes.def b/gcc/passes.def index 1cbbd413097..1d56a0d4799 100644 --- a/gcc/passes.def +++ b/gcc/passes.def @@ -492,6 +492,7 @@ along with GCC; see the file COPYING3. If not see NEXT_PASS (pass_initialize_regs); NEXT_PASS (pass_ud_rtl_dce); NEXT_PASS (pass_combine); + NEXT_PASS (pass_late_combine); NEXT_PASS (pass_if_after_combine); NEXT_PASS (pass_jump_after_combine); NEXT_PASS (pass_partition_blocks); @@ -511,6 +512,7 @@ along with GCC; see the file COPYING3. If not see NEXT_PASS (pass_postreload); PUSH_INSERT_PASSES_WITHIN (pass_postreload) NEXT_PASS (pass_postreload_cse); + NEXT_PASS (pass_late_combine); NEXT_PASS (pass_gcse2); NEXT_PASS (pass_split_after_reload); NEXT_PASS (pass_ree); diff --git a/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-1.c b/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-1.c index f290b9ccbdc..a95637abbe5 100644 --- a/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-1.c +++ b/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-1.c @@ -25,5 +25,5 @@ bar (long a) } /* { dg-final { scan-rtl-dump "Will split live ranges of parameters" "ira" } } */ -/* { dg-final { scan-rtl-dump "Split live-range of register" "ira" { xfail *-*-* } } } */ +/* { dg-final { scan-rtl-dump "Split live-range of register" "ira" { xfail { ! aarch64*-*-* } } } } */ /* { dg-final { scan-rtl-dump "Performing shrink-wrapping" "pro_and_epilogue" { xfail powerpc*-*-* } } } */ diff --git a/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-2.c b/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-2.c index 6212c95585d..0690e036eaa 100644 --- a/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-2.c +++ b/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-2.c @@ -30,6 +30,6 @@ bar (long a) } /* { dg-final { scan-rtl-dump "Will split live ranges of parameters" "ira" } } */ -/* { dg-final { scan-rtl-dump "Split live-range of register" "ira" { xfail *-*-* } } } */ +/* { dg-final { scan-rtl-dump "Split live-range of register" "ira" { xfail { ! aarch64*-*-* } } } } */ /* XFAIL due to PR70681. */ /* { dg-final { scan-rtl-dump "Performing shrink-wrapping" "pro_and_epilogue" { xfail arm*-*-* powerpc*-*-* } } } */ diff --git a/gcc/testsuite/gcc.dg/stack-check-4.c b/gcc/testsuite/gcc.dg/stack-check-4.c index b0c5c61972f..052d2abc2f1 100644 --- a/gcc/testsuite/gcc.dg/stack-check-4.c +++ b/gcc/testsuite/gcc.dg/stack-check-4.c @@ -20,7 +20,7 @@ scan for. We scan for both the positive and negative cases. */ /* { dg-do compile } */ -/* { dg-options "-O2 -fstack-clash-protection -fdump-rtl-pro_and_epilogue -fno-optimize-sibling-calls" } */ +/* { dg-options "-O2 -fstack-clash-protection -fdump-rtl-pro_and_epilogue -fno-optimize-sibling-calls -fno-shrink-wrap" } */ /* { dg-require-effective-target supports_stack_clash_protection } */ extern void arf (char *); diff --git a/gcc/testsuite/gcc.target/aarch64/pr106594_1.c b/gcc/testsuite/gcc.target/aarch64/pr106594_1.c new file mode 100644 index 00000000000..71bcafcb44f --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/pr106594_1.c @@ -0,0 +1,20 @@ +/* { dg-options "-O2" } */ + +extern const int constellation_64qam[64]; + +void foo(int nbits, + const char *p_src, + int *p_dst) { + + while (nbits > 0U) { + char first = *p_src++; + + char index1 = ((first & 0x3) << 4) | (first >> 4); + + *p_dst++ = constellation_64qam[index1]; + + nbits--; + } +} + +/* { dg-final { scan-assembler {(?n)\tldr\t.*\[x[0-9]+, w[0-9]+, sxtw #?2\]} } } */ diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_asrd_3.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_asrd_3.c index 0d620a30d5d..b537c6154a3 100644 --- a/gcc/testsuite/gcc.target/aarch64/sve/cond_asrd_3.c +++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_asrd_3.c @@ -27,9 +27,9 @@ TEST_ALL (DEF_LOOP) /* { dg-final { scan-assembler-times {\tasrd\tz[0-9]+\.h, p[0-7]/m, z[0-9]+\.h, #4\n} 2 } } */ /* { dg-final { scan-assembler-times {\tasrd\tz[0-9]+\.s, p[0-7]/m, z[0-9]+\.s, #4\n} 1 } } */ -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.b, p[0-7]/z, z[0-9]+\.b\n} 3 { xfail *-*-* } } } */ -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z, z[0-9]+\.h\n} 2 { xfail *-*-* } } } */ -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z, z[0-9]+\.s\n} 1 { xfail *-*-* } } } */ +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.b, p[0-7]/z, z[0-9]+\.b\n} 3 } } */ +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z, z[0-9]+\.h\n} 2 } } */ +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z, z[0-9]+\.s\n} 1 } } */ -/* { dg-final { scan-assembler-not {\tmov\tz} { xfail *-*-* } } } */ -/* { dg-final { scan-assembler-not {\tsel\t} { xfail *-*-* } } } */ +/* { dg-final { scan-assembler-not {\tmov\tz} } } */ +/* { dg-final { scan-assembler-not {\tsel\t} } } */ diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_3.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_3.c index a294effd4a9..cff806c278d 100644 --- a/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_3.c +++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_3.c @@ -30,11 +30,9 @@ TEST_ALL (DEF_LOOP) /* { dg-final { scan-assembler-times {\tscvtf\tz[0-9]+\.d, p[0-7]/m,} 1 } } */ /* { dg-final { scan-assembler-times {\tucvtf\tz[0-9]+\.d, p[0-7]/m,} 1 } } */ -/* Really we should be able to use MOVPRFX /z here, but at the moment - we're relying on combine to merge a SEL and an arithmetic operation, - and the SEL doesn't allow the "false" value to be zero when the "true" - value is a register. */ -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+, z[0-9]+\n} 6 } } */ +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z,} 2 } } */ +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z,} 2 } } */ +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.d, p[0-7]/z,} 2 } } */ /* { dg-final { scan-assembler-not {\tmov\tz[^\n]*z} } } */ /* { dg-final { scan-assembler-not {\tsel\t} } } */ diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_6.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_6.c index 6541a2ea49d..abf0a2e832f 100644 --- a/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_6.c +++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_6.c @@ -30,11 +30,9 @@ TEST_ALL (DEF_LOOP) /* { dg-final { scan-assembler-times {\tfcvtzs\tz[0-9]+\.d, p[0-7]/m,} 1 } } */ /* { dg-final { scan-assembler-times {\tfcvtzu\tz[0-9]+\.d, p[0-7]/m,} 1 } } */ -/* Really we should be able to use MOVPRFX /z here, but at the moment - we're relying on combine to merge a SEL and an arithmetic operation, - and the SEL doesn't allow the "false" value to be zero when the "true" - value is a register. */ -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+, z[0-9]+\n} 6 } } */ +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z,} 2 } } */ +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z,} 2 } } */ +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.d, p[0-7]/z,} 2 } } */ /* { dg-final { scan-assembler-not {\tmov\tz[^\n]*z} } } */ /* { dg-final { scan-assembler-not {\tsel\t} } } */ diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_fabd_5.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_fabd_5.c index e66477b3bce..401201b315a 100644 --- a/gcc/testsuite/gcc.target/aarch64/sve/cond_fabd_5.c +++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_fabd_5.c @@ -24,12 +24,9 @@ TEST_ALL (DEF_LOOP) /* { dg-final { scan-assembler-times {\tfabd\tz[0-9]+\.s, p[0-7]/m,} 1 } } */ /* { dg-final { scan-assembler-times {\tfabd\tz[0-9]+\.d, p[0-7]/m,} 1 } } */ -/* Really we should be able to use MOVPRFX /Z here, but at the moment - we're relying on combine to merge a SEL and an arithmetic operation, - and the SEL doesn't allow zero operands. */ -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z, z[0-9]+\.h\n} 1 { xfail *-*-* } } } */ -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z, z[0-9]+\.s\n} 1 { xfail *-*-* } } } */ -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.d, p[0-7]/z, z[0-9]+\.d\n} 1 { xfail *-*-* } } } */ +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z, z[0-9]+\.h\n} 1 } } */ +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z, z[0-9]+\.s\n} 1 } } */ +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.d, p[0-7]/z, z[0-9]+\.d\n} 1 } } */ /* { dg-final { scan-assembler-not {\tmov\tz[^,]*z} } } */ -/* { dg-final { scan-assembler-not {\tsel\t} { xfail *-*-* } } } */ +/* { dg-final { scan-assembler-not {\tsel\t} } } */ diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_unary_4.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_unary_4.c index a491f899088..cbb957bffa4 100644 --- a/gcc/testsuite/gcc.target/aarch64/sve/cond_unary_4.c +++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_unary_4.c @@ -52,15 +52,10 @@ TEST_ALL (DEF_LOOP) /* { dg-final { scan-assembler-times {\tfneg\tz[0-9]+\.s, p[0-7]/m,} 1 } } */ /* { dg-final { scan-assembler-times {\tfneg\tz[0-9]+\.d, p[0-7]/m,} 1 } } */ -/* Really we should be able to use MOVPRFX /z here, but at the moment - we're relying on combine to merge a SEL and an arithmetic operation, - and the SEL doesn't allow the "false" value to be zero when the "true" - value is a register. */ -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+, z[0-9]+\n} 7 } } */ -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.b, p[0-7]/z, z[0-9]+\.b} 1 } } */ -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z, z[0-9]+\.h} 2 } } */ -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z, z[0-9]+\.s} 2 } } */ -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.d, p[0-7]/z, z[0-9]+\.d} 2 } } */ +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.b, p[0-7]/z, z[0-9]+\.b} 2 } } */ +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z, z[0-9]+\.h} 4 } } */ +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z, z[0-9]+\.s} 4 } } */ +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.d, p[0-7]/z, z[0-9]+\.d} 4 } } */ /* { dg-final { scan-assembler-not {\tmov\tz[^\n]*z} } } */ /* { dg-final { scan-assembler-not {\tsel\t} } } */ diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h index 29267589eeb..4bedaa40aa2 100644 --- a/gcc/tree-pass.h +++ b/gcc/tree-pass.h @@ -614,6 +614,7 @@ extern rtl_opt_pass *make_pass_branch_prob (gcc::context *ctxt); extern rtl_opt_pass *make_pass_value_profile_transformations (gcc::context *ctxt); extern rtl_opt_pass *make_pass_postreload_cse (gcc::context *ctxt); +extern rtl_opt_pass *make_pass_late_combine (gcc::context *ctxt); extern rtl_opt_pass *make_pass_gcse2 (gcc::context *ctxt); extern rtl_opt_pass *make_pass_split_after_reload (gcc::context *ctxt); extern rtl_opt_pass *make_pass_thread_prologue_and_epilogue (gcc::context
On 1/5/24 10:35, Richard Sandiford wrote: > Jeff Law <jeffreyalaw@gmail.com> writes: >> On 10/24/23 12:49, Richard Sandiford wrote: >>> This patch adds a combine pass that runs late in the pipeline. >>> There are two instances: one between combine and split1, and one >>> after postreload. >> So have you done any investigation on cases caught by your new pass >> between combine and split1 to characterize them? In particular do they >> point at solvable problems in combine? Or do you forsee this subsuming >> the old combiner pass at some point in the distant future? > > Examples like the PR are the main motivation for the pre-RA pass. > There we had an extension that could be combined into an address, > but no longer was after GCC 13. > > The PR itself could in principle be fixed in combine (various > approaches were suggested, but not accepted). But the same problem > applies to multiple uses of extensions. fwprop can't handle it because > individual propagations are not a win in isolation. And combine has > a limit of combining 4 insns (with a maximum of 2 output insns, IIRC). > So I don't think either of the existing passes scale to the general case. Oh, that discussion :( > > >> rth and I sketched out an SSA based RTL combine at some point in the >> deep past. The key goal we were trying to achieve was combining across >> blocks. We didn't have a functioning RTL SSA form at the time, so it >> never went to any implementation work. It looks like yours would solve >> the class of problems rth and I were considering. > > Yeah, I do see some cases where combining across blocks helps. > The case above is one example of that. Another is: Great. The cases I think rth and I were looking at were inspired by a talk at one of the early Cauldrons -- whichever one we had in California. Someone did a talk on cross-block combinations, mostly motivated by missed combinations on ARM. > >>> The patch therefore enables the pass by default only on AArch64. >>> However, I did test the patch with it enabled on x86_64-linux-gnu >>> as well, which was useful for debugging. >>> >>> Bootstrapped & regression-tested on aarch64-linux-gnu and >>> x86_64-linux-gnu (as posted, with no regressions, and with the >>> pass enabled by default, with some gcc.target/i386 regressions). >>> OK to install? >> I'm going to adjust this slightly so that it's enabled across the board >> and throw it into the tester tomorrow (tester is busy tonight). Even if >> we make it opt-in on a per-port basis, the alternate target testing does >> seems to find stuff that needs fixing ;-) > > Thanks! As per our off-list discussion, the cris-elf failures showed > up a bug in the handling of call arguments. Here's an updated version > with that fixed. Perfect. I'll spin that in the tester overnight. Hopefully that fixes the other ports that failed to build (as opposed to the testsuite failures which I think you've covered as not an issue with the new combine pass, but which are instead port issues. > > Yeah. If I'd posted this earlier in stage 1 (rather than October), > I might have tried teaching shorten_branches how to handle this. > But it felt like it could be a bit of a rabbit hole at this stage. > >> Nothing jumps out at horribly wrong. You might want/need to reject >> frame related insns in optimizable_set, though I guess if the dwarf2 >> writer isn't complaining, then we haven't mucked things up too bad. > > Ah, yeah, I wondered about that. But I suppose most prologue insns > don't really create combination opportunities, since most of them > either set up the stack (which we wouldn't combine anyway) or sink > incoming data. So if there cases where this is a problem, it might > unfortunately be a case of "see what breaks". The other issue that's been in the back of my mind is costing. But I think the model here is combine without regards to cost. Let's let the tester chew on the updated version overnight, see what else may pop out, but barring any big surprises, I think we're good to go. jeff
Jeff Law <jeffreyalaw@gmail.com> writes: > The other issue that's been in the back of my mind is costing. But I > think the model here is combine without regards to cost. No, it does take costing into account. For size, it's the usual "sum up the before and after insn costs and see which one is lower". For speed, the costs are weighted by execution frequency, so e.g. two insns of cost 4 in the same block can be combined into a single instruction of cost 8, but a hoisted invariant can only be combined into a loop body instruction if the loop body instruction's cost doesn't increase significantly. This is done by rtl_ssa::changes_are_worthwhile. Thanks, Richard
On 1/8/24 04:52, Richard Sandiford wrote: > Jeff Law <jeffreyalaw@gmail.com> writes: >> The other issue that's been in the back of my mind is costing. But I >> think the model here is combine without regards to cost. > > No, it does take costing into account. For size, it's the usual > "sum up the before and after insn costs and see which one is lower". > For speed, the costs are weighted by execution frequency, so e.g. > two insns of cost 4 in the same block can be combined into a single > instruction of cost 8, but a hoisted invariant can only be combined > into a loop body instruction if the loop body instruction's cost > doesn't increase significantly. > > This is done by rtl_ssa::changes_are_worthwhile. You're absolutely correct. My bad. Interesting that's exactly where we do have a notable concern. If you remember, there were a few ports that failed to build newlib/libgcc that we initially ignored. I went back and looked at one (arc-elf). What appears to be happening for arc-elf is we're testing to see if the change is profitable. On arc-elf the costing model is highly dependent on the length of the insns. We've got a very reasonable looking insn: > (insn 3674 753 2851 98 (set (reg/v:SI 18 r18 [orig:300 inex ] [300]) > (ashift:SI (reg:SI 27 fp [548]) > (const_int 4 [0x4]))) "../../../..//newlib-cygwin/newlib/libc/stdlib/gdtoa-gdtoa.c":437:10 120 {*ashlsi3_insn} > (nil)) We call rtl_ssa::changes_are_profitable -> insn_cost -> arc_insn_cost -> get_attr_length -> get_attr_length_1 -> insn_default_length insn_default_length grubs around looking at the operands via recog_data which appears to be stale: > (gdb) p debug_rtx(recog_data.operand[0]) > (reg/v:SI 18 r18 [orig:300 inex ] [300]) > $4 = void > (gdb) p debug_rtx(recog_data.operand[1]) > (reg/v:SI 3 r3 [orig:300 inex ] [300]) > $5 = void > (gdb) p debug_rtx(recog_data.operand[2]) > > Program received signal SIGSEGV, Segmentation fault. > 0x0000000001432955 in rtx_writer::print_rtx (this=0x7fffffffe0e0, in_rtx=0xabababababababab) at /home/jlaw/test/gcc/gcc/print-rtl.cc:809 > 809 else if (GET_CODE (in_rtx) > NUM_RTX_CODE) Note the 0xabab.... That was accessing operand #2, which should have been (const_int 4). Sure enough if I force re-recognition then look at the recog_data I get the right values. After LRA we have: > (insn 753 2434 3674 98 (set (reg/v:SI 3 r3 [orig:300 inex ] [300]) > (ashift:SI (reg:SI 27 fp [548]) > (const_int 4 [0x4]))) "../../../..//newlib-cygwin/newlib/libc/stdlib/gdtoa-gdtoa.c":437:10 120 {*ashlsi3_insn} > (nil)) > (insn 3674 753 2851 98 (set (reg/v:SI 18 r18 [orig:300 inex ] [300]) > (reg/v:SI 3 r3 [orig:300 inex ] [300])) "../../../..//newlib-cygwin/newlib/libc/stdlib/gdtoa-gdtoa.c":437:10 3 {*movsi_insn} > (nil)) In the emergency dump in late_combine2 (so cleanup hasn't been done): > (insn 753 2434 3674 98 (set (reg/v:SI 3 r3 [orig:300 inex ] [300]) > (ashift:SI (reg:SI 27 fp [548]) > (const_int 4 [0x4]))) "../../../..//newlib-cygwin/newlib/libc/stdlib/gdtoa-gdtoa.c":437:10 120 {*ashlsi3_insn} > (nil)) > (insn 3674 753 2851 98 (set (reg/v:SI 18 r18 [orig:300 inex ] [300]) > (ashift:SI (reg:SI 27 fp [548]) > (const_int 4 [0x4]))) "../../../..//newlib-cygwin/newlib/libc/stdlib/gdtoa-gdtoa.c":437:10 120 {*ashlsi3_insn} > (nil)) Which brings us to the question. If we change the form of an insn, then ask for its cost, don't we need to make sure the insn is re-recognized as the costing function may do things like query the insn's length which would use cached recog_data? jeff
Jeff Law <jlaw@ventanamicro.com> writes: > On 1/8/24 04:52, Richard Sandiford wrote: >> Jeff Law <jeffreyalaw@gmail.com> writes: >>> The other issue that's been in the back of my mind is costing. But I >>> think the model here is combine without regards to cost. >> >> No, it does take costing into account. For size, it's the usual >> "sum up the before and after insn costs and see which one is lower". >> For speed, the costs are weighted by execution frequency, so e.g. >> two insns of cost 4 in the same block can be combined into a single >> instruction of cost 8, but a hoisted invariant can only be combined >> into a loop body instruction if the loop body instruction's cost >> doesn't increase significantly. >> >> This is done by rtl_ssa::changes_are_worthwhile. > You're absolutely correct. My bad. > > Interesting that's exactly where we do have a notable concern. Gah. > If you remember, there were a few ports that failed to build > newlib/libgcc that we initially ignored. I went back and looked at one > (arc-elf). > > What appears to be happening for arc-elf is we're testing to see if the > change is profitable. On arc-elf the costing model is highly dependent > on the length of the insns. > > We've got a very reasonable looking insn: > >> (insn 3674 753 2851 98 (set (reg/v:SI 18 r18 [orig:300 inex ] [300]) >> (ashift:SI (reg:SI 27 fp [548]) >> (const_int 4 [0x4]))) "../../../..//newlib-cygwin/newlib/libc/stdlib/gdtoa-gdtoa.c":437:10 120 {*ashlsi3_insn} >> (nil)) > > We call rtl_ssa::changes_are_profitable -> insn_cost -> arc_insn_cost -> > get_attr_length -> get_attr_length_1 -> insn_default_length > > insn_default_length grubs around looking at the operands via recog_data > which appears to be stale: > > > >> (gdb) p debug_rtx(recog_data.operand[0]) >> (reg/v:SI 18 r18 [orig:300 inex ] [300]) >> $4 = void >> (gdb) p debug_rtx(recog_data.operand[1]) >> (reg/v:SI 3 r3 [orig:300 inex ] [300]) >> $5 = void >> (gdb) p debug_rtx(recog_data.operand[2]) >> >> Program received signal SIGSEGV, Segmentation fault. >> 0x0000000001432955 in rtx_writer::print_rtx (this=0x7fffffffe0e0, in_rtx=0xabababababababab) at /home/jlaw/test/gcc/gcc/print-rtl.cc:809 >> 809 else if (GET_CODE (in_rtx) > NUM_RTX_CODE) > > Note the 0xabab.... That was accessing operand #2, which should have > been (const_int 4). > > Sure enough if I force re-recognition then look at the recog_data I get > the right values. > > After LRA we have: > >> (insn 753 2434 3674 98 (set (reg/v:SI 3 r3 [orig:300 inex ] [300]) >> (ashift:SI (reg:SI 27 fp [548]) >> (const_int 4 [0x4]))) "../../../..//newlib-cygwin/newlib/libc/stdlib/gdtoa-gdtoa.c":437:10 120 {*ashlsi3_insn} >> (nil)) >> (insn 3674 753 2851 98 (set (reg/v:SI 18 r18 [orig:300 inex ] [300]) >> (reg/v:SI 3 r3 [orig:300 inex ] [300])) "../../../..//newlib-cygwin/newlib/libc/stdlib/gdtoa-gdtoa.c":437:10 3 {*movsi_insn} >> (nil)) > > In the emergency dump in late_combine2 (so cleanup hasn't been done): > >> (insn 753 2434 3674 98 (set (reg/v:SI 3 r3 [orig:300 inex ] [300]) >> (ashift:SI (reg:SI 27 fp [548]) >> (const_int 4 [0x4]))) "../../../..//newlib-cygwin/newlib/libc/stdlib/gdtoa-gdtoa.c":437:10 120 {*ashlsi3_insn} >> (nil)) >> (insn 3674 753 2851 98 (set (reg/v:SI 18 r18 [orig:300 inex ] [300]) >> (ashift:SI (reg:SI 27 fp [548]) >> (const_int 4 [0x4]))) "../../../..//newlib-cygwin/newlib/libc/stdlib/gdtoa-gdtoa.c":437:10 120 {*ashlsi3_insn} >> (nil)) > > > Which brings us to the question. If we change the form of an insn, then > ask for its cost, don't we need to make sure the insn is re-recognized > as the costing function may do things like query the insn's length which > would use cached recog_data? Yeah, this only happens once we've verified that the new instruction is valid. And it looks from the emergency dump above that the insn code has been correctly updated to *ashlsi3_insn. This is a bit of a hopeful stab, but is the problem that recog_data still had the previous contents of insn 3674, and so extract_insn_cached wrongly thought that it doesn't need to do anything? If so, does something like: diff --git a/gcc/recog.cc b/gcc/recog.cc index a6799e3f5e6..8ba63c78179 100644 --- a/gcc/recog.cc +++ b/gcc/recog.cc @@ -267,6 +267,8 @@ validate_change_1 (rtx object, rtx *loc, rtx new_rtx, bool in_group, case invalid. */ changes[num_changes].old_code = INSN_CODE (object); INSN_CODE (object) = -1; + if (recog_data.insn == object) + recog_data.insn = nullptr; } num_changes++; fix it? I suppose there's an argument that this belongs in whatever code sets INSN_CODE to a new nonnegative value (so recog_level2 for RTL-SSA). But doing it in validate_change_1 seems more robust, since anything calling that function is considering changing the insn code. Thanks for debugging the problem. Richard
On 1/8/24 09:59, Richard Sandiford wrote: > > This is a bit of a hopeful stab, but is the problem that recog_data still > had the previous contents of insn 3674, and so extract_insn_cached wrongly > thought that it doesn't need to do anything? If so, does something like: > > diff --git a/gcc/recog.cc b/gcc/recog.cc > index a6799e3f5e6..8ba63c78179 100644 > --- a/gcc/recog.cc > +++ b/gcc/recog.cc > @@ -267,6 +267,8 @@ validate_change_1 (rtx object, rtx *loc, rtx new_rtx, bool in_group, > case invalid. */ > changes[num_changes].old_code = INSN_CODE (object); > INSN_CODE (object) = -1; > + if (recog_data.insn == object) > + recog_data.insn = nullptr; > } > > num_changes++; > > fix it? I suppose there's an argument that this belongs in whatever code > sets INSN_CODE to a new nonnegative value (so recog_level2 for RTL-SSA). > But doing it in validate_change_1 seems more robust, since anything > calling that function is considering changing the insn code. Nope, doesn't help at all. I'd briefly put a reset of the INSN_CODE and a call to recog_memoized in the costing path of rtl-ssa to see if that would allow things to move forward, but it failed miserably. I'll pass along the .i file separately. Hopefully it'll fail for you and you can debug. But given failure depends on stale bits in recog_data, it may not. Jeff
Jeff Law <jeffreyalaw@gmail.com> writes: > On 1/8/24 09:59, Richard Sandiford wrote: >> This is a bit of a hopeful stab, but is the problem that recog_data still >> had the previous contents of insn 3674, and so extract_insn_cached wrongly >> thought that it doesn't need to do anything? If so, does something like: >> >> diff --git a/gcc/recog.cc b/gcc/recog.cc >> index a6799e3f5e6..8ba63c78179 100644 >> --- a/gcc/recog.cc >> +++ b/gcc/recog.cc >> @@ -267,6 +267,8 @@ validate_change_1 (rtx object, rtx *loc, rtx new_rtx, bool in_group, >> case invalid. */ >> changes[num_changes].old_code = INSN_CODE (object); >> INSN_CODE (object) = -1; >> + if (recog_data.insn == object) >> + recog_data.insn = nullptr; >> } >> >> num_changes++; >> >> fix it? I suppose there's an argument that this belongs in whatever code >> sets INSN_CODE to a new nonnegative value (so recog_level2 for RTL-SSA). >> But doing it in validate_change_1 seems more robust, since anything >> calling that function is considering changing the insn code. > Nope, doesn't help at all. Yeah, in hindsight it was a dull guess. recog resets recog_data.insn itself, so doing it here wasn't likely to help. > I'd briefly put a reset of the INSN_CODE > and a call to recog_memoized in the costing path of rtl-ssa to see if > that would allow things to move forward, but it failed miserably. > > I'll pass along the .i file separately. Hopefully it'll fail for you > and you can debug. But given failure depends on stale bits in > recog_data, it may not. Thanks. That led me to the following, which seems a bit more plausible than my first attempt. I'll test it on aarch64-linux-gnu and x86_64-linux-gnu. Does it look OK? Richard insn_info::calculate_cost computes the costs of unchanged insns lazily, so that we don't waste time costing instructions that we never try to change. It therefore has to revert any in-progress changes, cost the original instruction, and then reapply the in-progress changes. However, doing that temporarily changes the INSN_CODEs, and so temporarily invalidates any information cached about the insn. This means that insn_cost can end up looking at stale data, or can cache data that becomes stale once the in-progress changes are reapplied. This could in principle happen for any use of temporarily_undo_changes and redo_changes. Those functions in turn share a common subroutine, swap_change, so that seems like the best place to fix this. gcc/ * recog.cc (swap_change): Invalidate the cached recog_data if it describes an insn that is being changed. --- gcc/recog.cc | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/gcc/recog.cc b/gcc/recog.cc index a6799e3f5e6..56370e40e01 100644 --- a/gcc/recog.cc +++ b/gcc/recog.cc @@ -614,7 +614,11 @@ swap_change (int num) else std::swap (*changes[num].loc, changes[num].old); if (changes[num].object && !MEM_P (changes[num].object)) - std::swap (INSN_CODE (changes[num].object), changes[num].old_code); + { + std::swap (INSN_CODE (changes[num].object), changes[num].old_code); + if (recog_data.insn == changes[num].object) + recog_data.insn = nullptr; + } } /* Temporarily undo all the changes numbered NUM and up, with a view
On 1/8/24 12:11, Richard Sandiford wrote: > > Thanks. That led me to the following, which seems a bit more plausible > than my first attempt. I'll test it on aarch64-linux-gnu and > x86_64-linux-gnu. Does it look OK? It looks reasonable to me. I'm going to send another failure (ICE in finalize_new_accesses on a different target) separately. Jeff
Just a note that, following discussion on IRC, I'll pull this for GCC 14 and resubmit for GCC 15. There was also pushback on IRC about making the pass opt-in. Enabling it for x86_64 would mean fixing RPAD to use a representation that is more robust against recombination, but as you can imagine, it's kind-of difficult for me to justify spending significant time fixing an issue in the x86_64 port. Jeff's testing suggested that there are also latent issues in the older, less maintained ports. So to get an idea for expectations: would it be a requirement that a GCC 15 submission is enabled unconditionally and all known issues in the ports fixed? Thanks, Richard
On Wed, 10 Jan 2024, Richard Sandiford wrote: > Just a note that, following discussion on IRC, I'll pull this for > GCC 14 and resubmit for GCC 15. > > There was also pushback on IRC about making the pass opt-in. > Enabling it for x86_64 would mean fixing RPAD to use a representation > that is more robust against recombination, but as you can imagine, it's > kind-of difficult for me to justify spending significant time fixing an > issue in the x86_64 port. Jeff's testing suggested that there are also > latent issues in the older, less maintained ports. > > So to get an idea for expectations: would it be a requirement that a > GCC 15 submission is enabled unconditionally and all known issues in > the ports fixed? Can you open a bugreport with the issue in RPAD, maybe outlining what would need to be done? I think x86 maintainers could opt to disable the pass - so it would be opt-out. It's reasonable to expect them to fix the backend given there's nothing really wrong with the new pass, it just does something that wasn't done before at that point? Richard.
On 1/10/24 06:35, Richard Biener wrote: > > I think x86 maintainers could opt to disable the pass - so it would > be opt-out. It's reasonable to expect them to fix the backend given > there's nothing really wrong with the new pass, it just does > something that wasn't done before at that point? That's been both Richard S and my experience so far -- it's exposing latent target issues (which we're pushing forward as independent fixes) as well as a few latent issues in various generic RTL bits (which I'll leave to Richard S to submit). Nothing major though. I'm a bit disappointed it's not going forward for gcc-14, but understand and will support the decision. Jeff
On 1/10/24 06:01, Richard Sandiford wrote: > > So to get an idea for expectations: would it be a requirement that a > GCC 15 submission is enabled unconditionally and all known issues in > the ports fixed? I don't think we need to fix those latent port issues as a hard requirement. I try to balance the complexity of the fix, overall state of the port, value of having the port test the feature, etc. So something like the mn103 or ephiphany where the fix was clear after a bit of debugging, we just fix. Others like the long standing c6x faults or the rl78 assembler complaints which we're also seeing without the late-combine work and which don't have clearly identifiable fixes I'd say we leave to the port maintainers (if any) to address. So I tend to want to understand a regression reported by the tester, then we determine a reasonable course of action. I don't think that a no regression policy on all those old ports is a reasonable requirement. Jeff
On Wed, Oct 25, 2023 at 2:49 AM Richard Sandiford <richard.sandiford@arm.com> wrote: > > This patch adds a combine pass that runs late in the pipeline. > There are two instances: one between combine and split1, and one > after postreload. > > The pass currently has a single objective: remove definitions by > substituting into all uses. The pre-RA version tries to restrict > itself to cases that are likely to have a neutral or beneficial > effect on register pressure. > > The patch fixes PR106594. It also fixes a few FAILs and XFAILs > in the aarch64 test results, mostly due to making proper use of > MOVPRFX in cases where we didn't previously. I hope it would > also help with Robin's vec_duplicate testcase, although the > pressure heuristic might need tweaking for that case. > > This is just a first step.. I'm hoping that the pass could be > used for other combine-related optimisations in future. In particular, > the post-RA version doesn't need to restrict itself to cases where all > uses are substitutitable, since it doesn't have to worry about register > pressure. If we did that, and if we extended it to handle multi-register > REGs, the pass might be a viable replacement for regcprop, which in > turn might reduce the cost of having a post-RA instance of the new pass. > > I've run an assembly comparison with one target per CPU directory, > and it seems to be a win for all targets except nvptx (which is hard > to measure, being a higher-level asm). The biggest winner seemed > to be AVR. > > I'd originally hoped to enable the pass by default at -O2 and above > on all targets. But in the end, I don't think that's possible, > because it interacts badly with x86's STV and partial register > dependency passes. > > For example, gcc.target/i386/minmax-6.c tests whether the code > compiles without any spilling. The RTL created by STV contains: > > (insn 33 31 3 2 (set (subreg:V4SI (reg:SI 120) 0) > (vec_merge:V4SI (vec_duplicate:V4SI (reg:SI 116)) > (const_vector:V4SI [ > (const_int 0 [0]) repeated x4 > ]) > (const_int 1 [0x1]))) -1 > (nil)) > (insn 3 33 34 2 (set (subreg:V4SI (reg:SI 118) 0) > (subreg:V4SI (reg:SI 120) 0)) {movv4si_internal} > (expr_list:REG_DEAD (reg:SI 120) > (nil))) > (insn 34 3 32 2 (set (reg/v:SI 108 [ y ]) > (reg:SI 118)) -1 > (nil)) > > and it's crucial for the test that reg 108 is kept, rather than > propagated into uses. As things stand, 118 can be allocated > a vector register and 108 a scalar register. If 108 is propagated, > there will be scalar and vector uses of 118, and so it will be > spilled to memory. > > That one could be solved by running STV2 later. But RPAD is > a bigger problem. In gcc.target/i386/pr87007-5.c, RPAD converts: > > (insn 27 26 28 6 (set (reg:DF 100 [ _15 ]) > (sqrt:DF (mem/c:DF (symbol_ref:DI ("d2"))))) {*sqrtdf2_sse} > (nil)) > > into: > > (insn 45 26 44 6 (set (reg:V4SF 108) > (const_vector:V4SF [ > (const_double:SF 0.0 [0x0.0p+0]) repeated x4 > ])) -1 > (nil)) > (insn 44 45 27 6 (set (reg:V2DF 109) > (vec_merge:V2DF (vec_duplicate:V2DF (sqrt:DF (mem/c:DF (symbol_ref:DI ("d2"))))) > (subreg:V2DF (reg:V4SF 108) 0) > (const_int 1 [0x1]))) -1 > (nil)) > (insn 27 44 28 6 (set (reg:DF 100 [ _15 ]) > (subreg:DF (reg:V2DF 109) 0)) {*movdf_internal} > (nil)) > > But both the pre-RA and post-RA passes are able to combine these > instructions back to the original form. One possible solution is implement target_insn_cost for x86, and return high cost for those insns when get_attr_avx_partial_xmm_update(insn) == AVX_PARTIAL_XMM_UPDATE_TRUE. I'll try with that when the patch lands on the trunk. > > The patch therefore enables the pass by default only on AArch64. > However, I did test the patch with it enabled on x86_64-linux-gnu > as well, which was useful for debugging. > > Bootstrapped & regression-tested on aarch64-linux-gnu and > x86_64-linux-gnu (as posted, with no regressions, and with the > pass enabled by default, with some gcc.target/i386 regressions). > OK to install? > > Richard > > > gcc/ > PR rtl-optimization/106594 > * Makefile.in (OBJS): Add late-combine.o. > * common.opt (flate-combine-instructions): New option. > * doc/invoke.texi: Document it. > * common/config/aarch64/aarch64-common.cc: Enable it by default > at -O2 and above. > * tree-pass.h (make_pass_late_combine): Declare. > * late-combine.cc: New file. > * passes.def: Add two instances of late_combine. > > gcc/testsuite/ > PR rtl-optimization/106594 > * gcc.dg/ira-shrinkwrap-prep-1.c: Restrict XFAIL to non-aarch64 > targets. > * gcc.dg/ira-shrinkwrap-prep-2.c: Likewise. > * gcc.dg/stack-check-4.c: Add -fno-shrink-wrap. > * gcc.target/aarch64/sve/cond_asrd_3.c: Remove XFAILs. > * gcc.target/aarch64/sve/cond_convert_3.c: Likewise. > * gcc.target/aarch64/sve/cond_fabd_5.c: Likewise. > * gcc.target/aarch64/sve/cond_convert_6.c: Expect the MOVPRFX /Zs > described in the comment. > * gcc.target/aarch64/sve/cond_unary_4.c: Likewise. > * gcc.target/aarch64/pr106594_1.c: New test. > --- > gcc/Makefile.in | 1 + > gcc/common.opt | 5 + > gcc/common/config/aarch64/aarch64-common.cc | 1 + > gcc/doc/invoke.texi | 11 +- > gcc/late-combine.cc | 718 ++++++++++++++++++ > gcc/passes.def | 2 + > gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-1.c | 2 +- > gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-2.c | 2 +- > gcc/testsuite/gcc.dg/stack-check-4.c | 2 +- > gcc/testsuite/gcc.target/aarch64/pr106594_1.c | 20 + > .../gcc.target/aarch64/sve/cond_asrd_3.c | 10 +- > .../gcc.target/aarch64/sve/cond_convert_3.c | 8 +- > .../gcc.target/aarch64/sve/cond_convert_6.c | 8 +- > .../gcc.target/aarch64/sve/cond_fabd_5.c | 11 +- > .../gcc.target/aarch64/sve/cond_unary_4.c | 13 +- > gcc/tree-pass.h | 1 + > 16 files changed, 780 insertions(+), 35 deletions(-) > create mode 100644 gcc/late-combine.cc > create mode 100644 gcc/testsuite/gcc.target/aarch64/pr106594_1.c > > diff --git a/gcc/Makefile.in b/gcc/Makefile.in > index 91d6bfbea4d..b43fd6e8df1 100644 > --- a/gcc/Makefile.in > +++ b/gcc/Makefile.in > @@ -1554,6 +1554,7 @@ OBJS = \ > ira-lives.o \ > jump.o \ > langhooks.o \ > + late-combine.o \ > lcm.o \ > lists.o \ > loop-doloop.o \ > diff --git a/gcc/common.opt b/gcc/common.opt > index 1cf3bdd3b51..306b11f91c7 100644 > --- a/gcc/common.opt > +++ b/gcc/common.opt > @@ -1775,6 +1775,11 @@ Common Var(flag_large_source_files) Init(0) > Improve GCC's ability to track column numbers in large source files, > at the expense of slower compilation. > > +flate-combine-instructions > +Common Var(flag_late_combine_instructions) Optimization Init(0) > +Run two instruction combination passes late in the pass pipeline; > +one before register allocation and one after. > + > floop-parallelize-all > Common Var(flag_loop_parallelize_all) Optimization > Mark all loops as parallel. > diff --git a/gcc/common/config/aarch64/aarch64-common.cc b/gcc/common/config/aarch64/aarch64-common.cc > index 20bc4e1291b..05647e0c93a 100644 > --- a/gcc/common/config/aarch64/aarch64-common.cc > +++ b/gcc/common/config/aarch64/aarch64-common.cc > @@ -55,6 +55,7 @@ static const struct default_options aarch_option_optimization_table[] = > { OPT_LEVELS_1_PLUS, OPT_fsched_pressure, NULL, 1 }, > /* Enable redundant extension instructions removal at -O2 and higher. */ > { OPT_LEVELS_2_PLUS, OPT_free, NULL, 1 }, > + { OPT_LEVELS_2_PLUS, OPT_flate_combine_instructions, NULL, 1 }, > #if (TARGET_DEFAULT_ASYNC_UNWIND_TABLES == 1) > { OPT_LEVELS_ALL, OPT_fasynchronous_unwind_tables, NULL, 1 }, > { OPT_LEVELS_ALL, OPT_funwind_tables, NULL, 1}, > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi > index 5a9284d635c..d0576ac97cf 100644 > --- a/gcc/doc/invoke.texi > +++ b/gcc/doc/invoke.texi > @@ -562,7 +562,7 @@ Objective-C and Objective-C++ Dialects}. > -fipa-bit-cp -fipa-vrp -fipa-pta -fipa-profile -fipa-pure-const > -fipa-reference -fipa-reference-addressable > -fipa-stack-alignment -fipa-icf -fira-algorithm=@var{algorithm} > --flive-patching=@var{level} > +-flate-combine-instructions -flive-patching=@var{level} > -fira-region=@var{region} -fira-hoist-pressure > -fira-loop-pressure -fno-ira-share-save-slots > -fno-ira-share-spill-slots > @@ -13201,6 +13201,15 @@ equivalences that are found only by GCC and equivalences found only by Gold. > > This flag is enabled by default at @option{-O2} and @option{-Os}. > > +@opindex flate-combine-instructions > +@item -flate-combine-instructions > +Enable two instruction combination passes that run relatively late in the > +compilation process. One of the passes runs before register allocation and > +the other after register allocation. The main aim of the passes is to > +substitute definitions into all uses. > + > +Some targets enable this flag by default at @option{-O2} and @option{-Os}. > + > @opindex flive-patching > @item -flive-patching=@var{level} > Control GCC's optimizations to produce output suitable for live-patching. > diff --git a/gcc/late-combine.cc b/gcc/late-combine.cc > new file mode 100644 > index 00000000000..b1845875c4b > --- /dev/null > +++ b/gcc/late-combine.cc > @@ -0,0 +1,718 @@ > +// Late-stage instruction combination pass. > +// Copyright (C) 2023 Free Software Foundation, Inc. > +// > +// This file is part of GCC. > +// > +// GCC is free software; you can redistribute it and/or modify it under > +// the terms of the GNU General Public License as published by the Free > +// Software Foundation; either version 3, or (at your option) any later > +// version. > +// > +// GCC is distributed in the hope that it will be useful, but WITHOUT ANY > +// WARRANTY; without even the implied warranty of MERCHANTABILITY or > +// FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License > +// for more details. > +// > +// You should have received a copy of the GNU General Public License > +// along with GCC; see the file COPYING3. If not see > +// <http://www.gnu.org/licenses/>. > + > +// The current purpose of this pass is to substitute definitions into > +// all uses, so that the definition can be removed. However, it could > +// be extended to handle other combination-related optimizations in future. > +// > +// The pass can run before or after register allocation. When running > +// before register allocation, it tries to avoid cases that are likely > +// to increase register pressure. For the same reason, it avoids moving > +// instructions around, even if doing so would allow an optimization to > +// succeed. These limitations are removed when running after register > +// allocation. > + > +#define INCLUDE_ALGORITHM > +#define INCLUDE_FUNCTIONAL > +#include "config.h" > +#include "system.h" > +#include "coretypes.h" > +#include "backend.h" > +#include "rtl.h" > +#include "df.h" > +#include "rtl-ssa.h" > +#include "print-rtl.h" > +#include "tree-pass.h" > +#include "cfgcleanup.h" > +#include "target.h" > + > +using namespace rtl_ssa; > + > +namespace { > +const pass_data pass_data_late_combine = > +{ > + RTL_PASS, // type > + "late_combine", // name > + OPTGROUP_NONE, // optinfo_flags > + TV_NONE, // tv_id > + 0, // properties_required > + 0, // properties_provided > + 0, // properties_destroyed > + 0, // todo_flags_start > + TODO_df_finish, // todo_flags_finish > +}; > + > +// Class that represents one run of the pass. > +class late_combine > +{ > +public: > + unsigned int execute (function *); > + > +private: > + rtx optimizable_set (insn_info *); > + bool check_register_pressure (insn_info *, rtx); > + bool check_uses (set_info *, rtx); > + bool combine_into_uses (insn_info *, insn_info *); > + > + auto_vec<insn_info *> m_worklist; > +}; > + > +// Represents an attempt to substitute a single-set definition into all > +// uses of the definition. > +class insn_combination > +{ > +public: > + insn_combination (set_info *, rtx, rtx); > + bool run (); > + const vec<insn_change *> &use_changes () const { return m_use_changes; } > + > +private: > + use_array get_new_uses (use_info *); > + bool substitute_nondebug_use (use_info *); > + bool substitute_nondebug_uses (set_info *); > + bool move_and_recog (insn_change &); > + bool try_to_preserve_debug_info (insn_change &, use_info *); > + void substitute_debug_use (use_info *); > + bool substitute_note (insn_info *, rtx, bool); > + void substitute_notes (insn_info *, bool); > + void substitute_note_uses (use_info *); > + void substitute_optional_uses (set_info *); > + > + // Represents the state of the function's RTL at the start of this > + // combination attempt. > + insn_change_watermark m_rtl_watermark; > + > + // Represents the rtl-ssa state at the start of this combination attempt. > + obstack_watermark m_attempt; > + > + // The instruction that contains the definition, and that we're trying > + // to delete. > + insn_info *m_def_insn; > + > + // The definition itself. > + set_info *m_def; > + > + // The destination and source of the single set that defines m_def. > + // The destination is known to be a plain REG. > + rtx m_dest; > + rtx m_src; > + > + // Contains all in-progress changes to uses of m_def. > + auto_vec<insn_change *> m_use_changes; > + > + // Contains the full list of changes that we want to make, in reverse > + // postorder. > + auto_vec<insn_change *> m_nondebug_changes; > +}; > + > +insn_combination::insn_combination (set_info *def, rtx dest, rtx src) > + : m_rtl_watermark (), > + m_attempt (crtl->ssa->new_change_attempt ()), > + m_def_insn (def->insn ()), > + m_def (def), > + m_dest (dest), > + m_src (src), > + m_use_changes (), > + m_nondebug_changes () > +{ > +} > + > +// USE is a direct or indirect use of m_def. Return the list of uses > +// that would be needed after substituting m_def into the instruction. > +// The returned list is marked as invalid if USE's insn and m_def_insn > +// use different definitions for the same resource (register or memory). > +use_array > +insn_combination::get_new_uses (use_info *use) > +{ > + auto *def = use->def (); > + auto *use_insn = use->insn (); > + > + use_array new_uses = use_insn->uses (); > + new_uses = remove_uses_of_def (m_attempt, new_uses, def); > + new_uses = merge_access_arrays (m_attempt, m_def_insn->uses (), new_uses); > + if (new_uses.is_valid () && use->ebb () != m_def->ebb ()) > + new_uses = crtl->ssa->make_uses_available (m_attempt, new_uses, use->bb (), > + use_insn->is_debug_insn ()); > + return new_uses; > +} > + > +// Start the process of trying to replace USE by substitution, given that > +// USE occurs in a non-debug instruction. Check that the substitution can > +// be represented in RTL and that each use of a resource (register or memory) > +// has a consistent definition. If so, start an insn_change record for the > +// substitution and return true. > +bool > +insn_combination::substitute_nondebug_use (use_info *use) > +{ > + insn_info *use_insn = use->insn (); > + rtx_insn *use_rtl = use_insn->rtl (); > + > + if (dump_file && (dump_flags & TDF_DETAILS)) > + dump_insn_slim (dump_file, use->insn ()->rtl ()); > + > + // Chceck that we can change the instruction pattern. Leave recognition > + // of the result till later. > + insn_propagation prop (use_rtl, m_dest, m_src); > + if (!prop.apply_to_pattern (&PATTERN (use_rtl)) > + || prop.num_replacements == 0) > + { > + if (dump_file && (dump_flags & TDF_DETAILS)) > + fprintf (dump_file, "-- RTL substitution failed\n"); > + return false; > + } > + > + use_array new_uses = get_new_uses (use); > + if (!new_uses.is_valid ()) > + { > + if (dump_file && (dump_flags & TDF_DETAILS)) > + fprintf (dump_file, "-- could not prove that all sources" > + " are available\n"); > + return false; > + } > + > + auto *where = XOBNEW (m_attempt, insn_change); > + auto *use_change = new (where) insn_change (use_insn); > + m_use_changes.safe_push (use_change); > + use_change->new_uses = new_uses; > + > + return true; > +} > + > +// Apply substitute_nondebug_use to all direct and indirect uses of DEF. > +// There will be at most one level of indirection. > +bool > +insn_combination::substitute_nondebug_uses (set_info *def) > +{ > + for (use_info *use : def->nondebug_insn_uses ()) > + if (!use->is_live_out_use () > + && !use->only_occurs_in_notes () > + && !substitute_nondebug_use (use)) > + return false; > + > + for (use_info *use : def->phi_uses ()) > + if (!substitute_nondebug_uses (use->phi ())) > + return false; > + > + return true; > +} > + > +// Complete the verification of USE_CHANGE, given that m_nondebug_insns > +// now contains an insn_change record for all proposed non-debug changes. > +// Check that the new instruction is a recognized pattern. Also check that > +// the instruction can be placed somewhere that makes all definitions and > +// uses valid, and that permits any new hard-register clobbers added > +// during the recognition process. Return true on success. > +bool > +insn_combination::move_and_recog (insn_change &use_change) > +{ > + insn_info *use_insn = use_change.insn (); > + > + if (reload_completed && can_move_insn_p (use_insn)) > + use_change.move_range = { use_insn->bb ()->head_insn (), > + use_insn->ebb ()->last_bb ()->end_insn () }; > + if (!restrict_movement_ignoring (use_change, > + insn_is_changing (m_nondebug_changes))) > + { > + if (dump_file && (dump_flags & TDF_DETAILS)) > + fprintf (dump_file, "-- cannot satisfy all definitions and uses" > + " in insn %d\n", INSN_UID (use_insn->rtl ())); > + return false; > + } > + > + if (!recog_ignoring (m_attempt, use_change, > + insn_is_changing (m_nondebug_changes))) > + return false; > + > + return true; > +} > + > +// USE_CHANGE.insn () is a debug instruction that uses m_def. Try to > +// substitute the definition into the instruction and try to describe > +// the result in USE_CHANGE. Return true on success. Failure means that > +// the instruction must be reset instead. > +bool > +insn_combination::try_to_preserve_debug_info (insn_change &use_change, > + use_info *use) > +{ > + insn_info *use_insn = use_change.insn (); > + rtx_insn *use_rtl = use_insn->rtl (); > + > + use_change.new_uses = get_new_uses (use); > + if (!use_change.new_uses.is_valid () > + || !restrict_movement (use_change)) > + return false; > + > + insn_propagation prop (use_rtl, m_dest, m_src); > + return prop.apply_to_pattern (&INSN_VAR_LOCATION_LOC (use_rtl)); > +} > + > +// USE_INSN is a debug instruction that uses m_def. Update it to reflect > +// the fact that m_def is going to disappear. Try to preserve the source > +// value if possible, but reset the instruction if not. > +void > +insn_combination::substitute_debug_use (use_info *use) > +{ > + auto *use_insn = use->insn (); > + rtx_insn *use_rtl = use_insn->rtl (); > + > + auto use_change = insn_change (use_insn); > + if (!try_to_preserve_debug_info (use_change, use)) > + { > + use_change.new_uses = {}; > + use_change.move_range = use_change.insn (); > + INSN_VAR_LOCATION_LOC (use_rtl) = gen_rtx_UNKNOWN_VAR_LOC (); > + } > + insn_change *changes[] = { &use_change }; > + crtl->ssa->change_insns (changes); > +} > + > +// NOTE is a reg note of USE_INSN, which previously used m_def. Update > +// the note to reflect the fact that m_def is going to disappear. Return > +// true on success, or false if the note must be deleted. > +// > +// CAN_PROPAGATE is true if m_dest can be replaced with m_use. > +bool > +insn_combination::substitute_note (insn_info *use_insn, rtx note, > + bool can_propagate) > +{ > + if (REG_NOTE_KIND (note) == REG_EQUAL > + || REG_NOTE_KIND (note) == REG_EQUIV) > + { > + insn_propagation prop (use_insn->rtl (), m_dest, m_src); > + return (prop.apply_to_rvalue (&XEXP (note, 0)) > + && (can_propagate || prop.num_replacements == 0)); > + } > + return true; > +} > + > +// Update USE_INSN's notes after deciding to go ahead with the optimization. > +// CAN_PROPAGATE is true if m_dest can be replaced with m_use. > +void > +insn_combination::substitute_notes (insn_info *use_insn, bool can_propagate) > +{ > + rtx_insn *use_rtl = use_insn->rtl (); > + rtx *ptr = ®_NOTES (use_rtl); > + while (rtx note = *ptr) > + { > + if (substitute_note (use_insn, note, can_propagate)) > + ptr = &XEXP (note, 1); > + else > + *ptr = XEXP (note, 1); > + } > +} > + > +// We've decided to go ahead with the substitution. Update all REG_NOTES > +// involving USE. > +void > +insn_combination::substitute_note_uses (use_info *use) > +{ > + insn_info *use_insn = use->insn (); > + > + bool can_propagate = true; > + if (use->only_occurs_in_notes ()) > + { > + // The only uses are in notes. Try to keep the note if we can, > + // but removing it is better than aborting the optimization. > + insn_change use_change (use_insn); > + use_change.new_uses = get_new_uses (use); > + if (!use_change.new_uses.is_valid () > + || !restrict_movement (use_change)) > + { > + use_change.move_range = use_insn; > + use_change.new_uses = remove_uses_of_def (m_attempt, > + use_insn->uses (), > + use->def ()); > + can_propagate = false; > + } > + if (dump_file && (dump_flags & TDF_DETAILS)) > + { > + fprintf (dump_file, "%s notes in:\n", > + can_propagate ? "updating" : "removing"); > + dump_insn_slim (dump_file, use_insn->rtl ()); > + } > + substitute_notes (use_insn, can_propagate); > + insn_change *changes[] = { &use_change }; > + crtl->ssa->change_insns (changes); > + } > + else > + substitute_notes (use_insn, can_propagate); > +} > + > +// We've decided to go ahead with the substitution and we've dealt with > +// all uses that occur in the patterns of non-debug insns. Update all > +// other uses for the fact that m_def is about to disappear. > +void > +insn_combination::substitute_optional_uses (set_info *def) > +{ > + if (auto insn_uses = def->all_insn_uses ()) > + { > + use_info *use = *insn_uses.begin (); > + while (use) > + { > + use_info *next_use = use->next_any_insn_use (); > + if (use->is_in_debug_insn ()) > + substitute_debug_use (use); > + else if (!use->is_live_out_use ()) > + substitute_note_uses (use); > + use = next_use; > + } > + } > + for (use_info *use : def->phi_uses ()) > + substitute_optional_uses (use->phi ()); > +} > + > +// Try to perform the substitution. Return true on success. > +bool > +insn_combination::run () > +{ > + if (dump_file && (dump_flags & TDF_DETAILS)) > + { > + fprintf (dump_file, "\ntrying to combine definition of r%d in:\n", > + m_def->regno ()); > + dump_insn_slim (dump_file, m_def_insn->rtl ()); > + fprintf (dump_file, "into:\n"); > + } > + > + if (!substitute_nondebug_uses (m_def)) > + return false; > + > + auto def_change = insn_change::delete_insn (m_def_insn); > + > + m_nondebug_changes.reserve (m_use_changes.length () + 1); > + m_nondebug_changes.quick_push (&def_change); > + m_nondebug_changes.splice (m_use_changes); > + > + for (auto *use_change : m_use_changes) > + if (!move_and_recog (*use_change)) > + return false; > + > + if (!changes_are_worthwhile (m_nondebug_changes) > + || !crtl->ssa->verify_insn_changes (m_nondebug_changes)) > + return false; > + > + substitute_optional_uses (m_def); > + > + confirm_change_group (); > + crtl->ssa->change_insns (m_nondebug_changes); > + return true; > +} > + > +// See whether INSN is a single_set that we can optimize. Return the > +// set if so, otherwise return null. > +rtx > +late_combine::optimizable_set (insn_info *insn) > +{ > + if (!insn->can_be_optimized () > + || insn->is_asm () > + || insn->is_call () > + || insn->has_volatile_refs () > + || insn->has_pre_post_modify () > + || !can_move_insn_p (insn)) > + return NULL_RTX; > + > + return single_set (insn->rtl ()); > +} > + > +// Suppose that we can replace all uses of SET_DEST (SET) with SET_SRC (SET), > +// where SET occurs in INSN. Return true if doing so is not likely to > +// increase register pressure. > +bool > +late_combine::check_register_pressure (insn_info *insn, rtx set) > +{ > + // Plain register-to-register moves do not establish a register class > + // preference and have no well-defined effect on the register allocator. > + // If changes in register class are needed, the register allocator is > + // in the best position to place those changes. If no change in > + // register class is needed, then the optimization reduces register > + // pressure if SET_SRC (set) was already live at uses, otherwise the > + // optimization is pressure-neutral. > + rtx src = SET_SRC (set); > + if (REG_P (src)) > + return true; > + > + // On the same basis, substituting a SET_SRC that contains a single > + // pseudo register either reduces pressure or is pressure-neutral, > + // subject to the constraints below. We would need to do more > + // analysis for SET_SRCs that use more than one pseudo register. > + unsigned int nregs = 0; > + for (auto *use : insn->uses ()) > + if (use->is_reg () > + && !HARD_REGISTER_NUM_P (use->regno ()) > + && !use->only_occurs_in_notes ()) > + if (++nregs > 1) > + return false; > + > + // If there are no pseudo registers in SET_SRC then the optimization > + // should improve register pressure. > + if (nregs == 0) > + return true; > + > + // We'd be substituting (set (reg R1) SRC) where SRC is known to > + // contain a single pseudo register R2. Assume for simplicity that > + // each new use of R2 would need to be in the same class C as the > + // current use of R2. If, for a realistic allocation, C is a > + // non-strict superset of the R1's register class, the effect on > + // register pressure should be positive or neutral. If instead > + // R1 occupies a different register class from R2, or if R1 has > + // more allocation freedom than R2, then there's a higher risk that > + // the effect on register pressure could be negative. > + // > + // First use constrain_operands to get the most likely choice of > + // alternative. For simplicity, just handle the case where the > + // output operand is operand 0. > + extract_insn (insn->rtl ()); > + rtx dest = SET_DEST (set); > + if (recog_data.n_operands == 0 > + || recog_data.operand[0] != dest) > + return false; > + > + if (!constrain_operands (0, get_enabled_alternatives (insn->rtl ()))) > + return false; > + > + preprocess_constraints (insn->rtl ()); > + auto *alt = which_op_alt (); > + auto dest_class = alt[0].cl; > + > + // Check operands 1 and above. > + auto check_src = [&](unsigned int i) > + { > + if (recog_data.is_operator[i]) > + return true; > + > + rtx op = recog_data.operand[i]; > + if (CONSTANT_P (op)) > + return true; > + > + if (SUBREG_P (op)) > + op = SUBREG_REG (op); > + if (REG_P (op)) > + { > + if (HARD_REGISTER_P (op)) > + { > + // We've already rejected uses of non-fixed hard registers. > + gcc_checking_assert (fixed_regs[REGNO (op)]); > + return true; > + } > + > + // Make sure that the source operand's class is at least as > + // permissive as the destination operand's class. > + if (!reg_class_subset_p (dest_class, alt[i].cl)) > + return false; > + > + // Make sure that the source operand occupies no more hard > + // registers than the destination operand. This mostly matters > + // for subregs. > + if (targetm.class_max_nregs (dest_class, GET_MODE (dest)) > + < targetm.class_max_nregs (alt[i].cl, GET_MODE (op))) > + return false; > + > + return true; > + } > + return false; > + }; > + for (int i = 1; i < recog_data.n_operands; ++i) > + if (!check_src (i)) > + return false; > + > + return true; > +} > + > +// Check uses of DEF to see whether there is anything obvious that > +// prevents the substitution of SET into uses of DEF. > +bool > +late_combine::check_uses (set_info *def, rtx set) > +{ > + use_info *last_use = nullptr; > + for (use_info *use : def->nondebug_insn_uses ()) > + { > + insn_info *use_insn = use->insn (); > + > + if (use->is_live_out_use ()) > + continue; > + if (use->only_occurs_in_notes ()) > + continue; > + > + // We cannot replace all uses if the value is live on exit. > + if (use->is_artificial ()) > + return false; > + > + // Avoid increasing the complexity of instructions that > + // reference allocatable hard registers. > + if (!REG_P (SET_SRC (set)) > + && !reload_completed > + && (accesses_include_nonfixed_hard_registers (use_insn->uses ()) > + || accesses_include_nonfixed_hard_registers (use_insn->defs ()))) > + return false; > + > + // Don't substitute into a non-local goto, since it can then be > + // treated as a jump to local label, e.g. in shorten_branches. > + // ??? But this shouldn't be necessary. > + if (use_insn->is_jump () > + && find_reg_note (use_insn->rtl (), REG_NON_LOCAL_GOTO, NULL_RTX)) > + return false; > + > + // We'll keep the uses in their original order, even if we move > + // them relative to other instructions. Make sure that non-final > + // uses do not change any values that occur in the SET_SRC. > + if (last_use && last_use->ebb () == use->ebb ()) > + { > + def_info *ultimate_def = look_through_degenerate_phi (def); > + if (insn_clobbers_resources (last_use->insn (), > + ultimate_def->insn ()->uses ())) > + return false; > + } > + > + last_use = use; > + } > + > + for (use_info *use : def->phi_uses ()) > + if (!use->phi ()->is_degenerate () > + || !check_uses (use->phi (), set)) > + return false; > + > + return true; > +} > + > +// Try to remove INSN by substituting a definition into all uses. > +// If the optimization moves any instructions before CURSOR, add those > +// instructions to the end of m_worklist. > +bool > +late_combine::combine_into_uses (insn_info *insn, insn_info *cursor) > +{ > + // For simplicity, don't try to handle sets of multiple hard registers. > + // And for correctness, don't remove any assignments to the stack or > + // frame pointers, since that would implicitly change the set of valid > + // memory locations between this assignment and the next. > + // > + // Removing assignments to the hard frame pointer would invalidate > + // backtraces. > + set_info *def = single_set_info (insn); > + if (!def > + || !def->is_reg () > + || def->regno () == STACK_POINTER_REGNUM > + || def->regno () == FRAME_POINTER_REGNUM > + || def->regno () == HARD_FRAME_POINTER_REGNUM) > + return false; > + > + rtx set = optimizable_set (insn); > + if (!set) > + return false; > + > + // For simplicity, don't try to handle subreg destinations. > + rtx dest = SET_DEST (set); > + if (!REG_P (dest) || def->regno () != REGNO (dest)) > + return false; > + > + // Don't prolong the live ranges of allocatable hard registers, or put > + // them into more complicated instructions. Failing to prevent this > + // could lead to spill failures, or at least to worst register allocation. > + if (!reload_completed > + && accesses_include_nonfixed_hard_registers (insn->uses ())) > + return false; > + > + if (!reload_completed && !check_register_pressure (insn, set)) > + return false; > + > + if (!check_uses (def, set)) > + return false; > + > + insn_combination combination (def, SET_DEST (set), SET_SRC (set)); > + if (!combination.run ()) > + return false; > + > + for (auto *use_change : combination.use_changes ()) > + if (*use_change->insn () < *cursor) > + m_worklist.safe_push (use_change->insn ()); > + else > + break; > + return true; > +} > + > +// Run the pass on function FN. > +unsigned int > +late_combine::execute (function *fn) > +{ > + // Initialization. > + calculate_dominance_info (CDI_DOMINATORS); > + df_analyze (); > + crtl->ssa = new rtl_ssa::function_info (fn); > + // Don't allow memory_operand to match volatile MEMs. > + init_recog_no_volatile (); > + > + insn_info *insn = *crtl->ssa->nondebug_insns ().begin (); > + while (insn) > + { > + if (!insn->is_artificial ()) > + { > + insn_info *prev = insn->prev_nondebug_insn (); > + if (combine_into_uses (insn, prev)) > + { > + // Any instructions that get added to the worklist were > + // previously after PREV. Thus if we were able to move > + // an instruction X before PREV during one combination, > + // X cannot depend on any instructions that we move before > + // PREV during subsequent combinations. This means that > + // the worklist should be free of backwards dependencies, > + // even if it isn't necessarily in RPO. > + for (unsigned int i = 0; i < m_worklist.length (); ++i) > + combine_into_uses (m_worklist[i], prev); > + m_worklist.truncate (0); > + insn = prev; > + } > + } > + insn = insn->next_nondebug_insn (); > + } > + > + // Finalization. > + if (crtl->ssa->perform_pending_updates ()) > + cleanup_cfg (0); > + // Make recognizer allow volatile MEMs again. > + init_recog (); > + free_dominance_info (CDI_DOMINATORS); > + return 0; > +} > + > +class pass_late_combine : public rtl_opt_pass > +{ > +public: > + pass_late_combine (gcc::context *ctxt) > + : rtl_opt_pass (pass_data_late_combine, ctxt) > + {} > + > + // opt_pass methods: > + opt_pass *clone () override { return new pass_late_combine (m_ctxt); } > + bool gate (function *) override { return flag_late_combine_instructions; } > + unsigned int execute (function *) override; > +}; > + > +unsigned int > +pass_late_combine::execute (function *fn) > +{ > + return late_combine ().execute (fn); > +} > + > +} // end namespace > + > +// Create a new CC fusion pass instance. > + > +rtl_opt_pass * > +make_pass_late_combine (gcc::context *ctxt) > +{ > + return new pass_late_combine (ctxt); > +} > diff --git a/gcc/passes.def b/gcc/passes.def > index 1e1950bdb39..56ab5204b08 100644 > --- a/gcc/passes.def > +++ b/gcc/passes.def > @@ -488,6 +488,7 @@ along with GCC; see the file COPYING3. If not see > NEXT_PASS (pass_initialize_regs); > NEXT_PASS (pass_ud_rtl_dce); > NEXT_PASS (pass_combine); > + NEXT_PASS (pass_late_combine); > NEXT_PASS (pass_if_after_combine); > NEXT_PASS (pass_jump_after_combine); > NEXT_PASS (pass_partition_blocks); > @@ -507,6 +508,7 @@ along with GCC; see the file COPYING3. If not see > NEXT_PASS (pass_postreload); > PUSH_INSERT_PASSES_WITHIN (pass_postreload) > NEXT_PASS (pass_postreload_cse); > + NEXT_PASS (pass_late_combine); > NEXT_PASS (pass_gcse2); > NEXT_PASS (pass_split_after_reload); > NEXT_PASS (pass_ree); > diff --git a/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-1.c b/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-1.c > index f290b9ccbdc..a95637abbe5 100644 > --- a/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-1.c > +++ b/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-1.c > @@ -25,5 +25,5 @@ bar (long a) > } > > /* { dg-final { scan-rtl-dump "Will split live ranges of parameters" "ira" } } */ > -/* { dg-final { scan-rtl-dump "Split live-range of register" "ira" { xfail *-*-* } } } */ > +/* { dg-final { scan-rtl-dump "Split live-range of register" "ira" { xfail { ! aarch64*-*-* } } } } */ > /* { dg-final { scan-rtl-dump "Performing shrink-wrapping" "pro_and_epilogue" { xfail powerpc*-*-* } } } */ > diff --git a/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-2.c b/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-2.c > index 6212c95585d..0690e036eaa 100644 > --- a/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-2.c > +++ b/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-2.c > @@ -30,6 +30,6 @@ bar (long a) > } > > /* { dg-final { scan-rtl-dump "Will split live ranges of parameters" "ira" } } */ > -/* { dg-final { scan-rtl-dump "Split live-range of register" "ira" { xfail *-*-* } } } */ > +/* { dg-final { scan-rtl-dump "Split live-range of register" "ira" { xfail { ! aarch64*-*-* } } } } */ > /* XFAIL due to PR70681. */ > /* { dg-final { scan-rtl-dump "Performing shrink-wrapping" "pro_and_epilogue" { xfail arm*-*-* powerpc*-*-* } } } */ > diff --git a/gcc/testsuite/gcc.dg/stack-check-4.c b/gcc/testsuite/gcc.dg/stack-check-4.c > index b0c5c61972f..052d2abc2f1 100644 > --- a/gcc/testsuite/gcc.dg/stack-check-4.c > +++ b/gcc/testsuite/gcc.dg/stack-check-4.c > @@ -20,7 +20,7 @@ > scan for. We scan for both the positive and negative cases. */ > > /* { dg-do compile } */ > -/* { dg-options "-O2 -fstack-clash-protection -fdump-rtl-pro_and_epilogue -fno-optimize-sibling-calls" } */ > +/* { dg-options "-O2 -fstack-clash-protection -fdump-rtl-pro_and_epilogue -fno-optimize-sibling-calls -fno-shrink-wrap" } */ > /* { dg-require-effective-target supports_stack_clash_protection } */ > > extern void arf (char *); > diff --git a/gcc/testsuite/gcc.target/aarch64/pr106594_1.c b/gcc/testsuite/gcc.target/aarch64/pr106594_1.c > new file mode 100644 > index 00000000000..71bcafcb44f > --- /dev/null > +++ b/gcc/testsuite/gcc.target/aarch64/pr106594_1.c > @@ -0,0 +1,20 @@ > +/* { dg-options "-O2" } */ > + > +extern const int constellation_64qam[64]; > + > +void foo(int nbits, > + const char *p_src, > + int *p_dst) { > + > + while (nbits > 0U) { > + char first = *p_src++; > + > + char index1 = ((first & 0x3) << 4) | (first >> 4); > + > + *p_dst++ = constellation_64qam[index1]; > + > + nbits--; > + } > +} > + > +/* { dg-final { scan-assembler {(?n)\tldr\t.*\[x[0-9]+, w[0-9]+, sxtw #?2\]} } } */ > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_asrd_3.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_asrd_3.c > index 0d620a30d5d..b537c6154a3 100644 > --- a/gcc/testsuite/gcc.target/aarch64/sve/cond_asrd_3.c > +++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_asrd_3.c > @@ -27,9 +27,9 @@ TEST_ALL (DEF_LOOP) > /* { dg-final { scan-assembler-times {\tasrd\tz[0-9]+\.h, p[0-7]/m, z[0-9]+\.h, #4\n} 2 } } */ > /* { dg-final { scan-assembler-times {\tasrd\tz[0-9]+\.s, p[0-7]/m, z[0-9]+\.s, #4\n} 1 } } */ > > -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.b, p[0-7]/z, z[0-9]+\.b\n} 3 { xfail *-*-* } } } */ > -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z, z[0-9]+\.h\n} 2 { xfail *-*-* } } } */ > -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z, z[0-9]+\.s\n} 1 { xfail *-*-* } } } */ > +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.b, p[0-7]/z, z[0-9]+\.b\n} 3 } } */ > +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z, z[0-9]+\.h\n} 2 } } */ > +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z, z[0-9]+\.s\n} 1 } } */ > > -/* { dg-final { scan-assembler-not {\tmov\tz} { xfail *-*-* } } } */ > -/* { dg-final { scan-assembler-not {\tsel\t} { xfail *-*-* } } } */ > +/* { dg-final { scan-assembler-not {\tmov\tz} } } */ > +/* { dg-final { scan-assembler-not {\tsel\t} } } */ > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_3.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_3.c > index a294effd4a9..cff806c278d 100644 > --- a/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_3.c > +++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_3.c > @@ -30,11 +30,9 @@ TEST_ALL (DEF_LOOP) > /* { dg-final { scan-assembler-times {\tscvtf\tz[0-9]+\.d, p[0-7]/m,} 1 } } */ > /* { dg-final { scan-assembler-times {\tucvtf\tz[0-9]+\.d, p[0-7]/m,} 1 } } */ > > -/* Really we should be able to use MOVPRFX /z here, but at the moment > - we're relying on combine to merge a SEL and an arithmetic operation, > - and the SEL doesn't allow the "false" value to be zero when the "true" > - value is a register. */ > -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+, z[0-9]+\n} 6 } } */ > +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z,} 2 } } */ > +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z,} 2 } } */ > +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.d, p[0-7]/z,} 2 } } */ > > /* { dg-final { scan-assembler-not {\tmov\tz[^\n]*z} } } */ > /* { dg-final { scan-assembler-not {\tsel\t} } } */ > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_6.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_6.c > index 6541a2ea49d..abf0a2e832f 100644 > --- a/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_6.c > +++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_6.c > @@ -30,11 +30,9 @@ TEST_ALL (DEF_LOOP) > /* { dg-final { scan-assembler-times {\tfcvtzs\tz[0-9]+\.d, p[0-7]/m,} 1 } } */ > /* { dg-final { scan-assembler-times {\tfcvtzu\tz[0-9]+\.d, p[0-7]/m,} 1 } } */ > > -/* Really we should be able to use MOVPRFX /z here, but at the moment > - we're relying on combine to merge a SEL and an arithmetic operation, > - and the SEL doesn't allow the "false" value to be zero when the "true" > - value is a register. */ > -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+, z[0-9]+\n} 6 } } */ > +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z,} 2 } } */ > +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z,} 2 } } */ > +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.d, p[0-7]/z,} 2 } } */ > > /* { dg-final { scan-assembler-not {\tmov\tz[^\n]*z} } } */ > /* { dg-final { scan-assembler-not {\tsel\t} } } */ > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_fabd_5.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_fabd_5.c > index e66477b3bce..401201b315a 100644 > --- a/gcc/testsuite/gcc.target/aarch64/sve/cond_fabd_5.c > +++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_fabd_5.c > @@ -24,12 +24,9 @@ TEST_ALL (DEF_LOOP) > /* { dg-final { scan-assembler-times {\tfabd\tz[0-9]+\.s, p[0-7]/m,} 1 } } */ > /* { dg-final { scan-assembler-times {\tfabd\tz[0-9]+\.d, p[0-7]/m,} 1 } } */ > > -/* Really we should be able to use MOVPRFX /Z here, but at the moment > - we're relying on combine to merge a SEL and an arithmetic operation, > - and the SEL doesn't allow zero operands. */ > -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z, z[0-9]+\.h\n} 1 { xfail *-*-* } } } */ > -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z, z[0-9]+\.s\n} 1 { xfail *-*-* } } } */ > -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.d, p[0-7]/z, z[0-9]+\.d\n} 1 { xfail *-*-* } } } */ > +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z, z[0-9]+\.h\n} 1 } } */ > +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z, z[0-9]+\.s\n} 1 } } */ > +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.d, p[0-7]/z, z[0-9]+\.d\n} 1 } } */ > > /* { dg-final { scan-assembler-not {\tmov\tz[^,]*z} } } */ > -/* { dg-final { scan-assembler-not {\tsel\t} { xfail *-*-* } } } */ > +/* { dg-final { scan-assembler-not {\tsel\t} } } */ > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_unary_4.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_unary_4.c > index a491f899088..cbb957bffa4 100644 > --- a/gcc/testsuite/gcc.target/aarch64/sve/cond_unary_4.c > +++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_unary_4.c > @@ -52,15 +52,10 @@ TEST_ALL (DEF_LOOP) > /* { dg-final { scan-assembler-times {\tfneg\tz[0-9]+\.s, p[0-7]/m,} 1 } } */ > /* { dg-final { scan-assembler-times {\tfneg\tz[0-9]+\.d, p[0-7]/m,} 1 } } */ > > -/* Really we should be able to use MOVPRFX /z here, but at the moment > - we're relying on combine to merge a SEL and an arithmetic operation, > - and the SEL doesn't allow the "false" value to be zero when the "true" > - value is a register. */ > -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+, z[0-9]+\n} 7 } } */ > -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.b, p[0-7]/z, z[0-9]+\.b} 1 } } */ > -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z, z[0-9]+\.h} 2 } } */ > -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z, z[0-9]+\.s} 2 } } */ > -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.d, p[0-7]/z, z[0-9]+\.d} 2 } } */ > +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.b, p[0-7]/z, z[0-9]+\.b} 2 } } */ > +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z, z[0-9]+\.h} 4 } } */ > +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z, z[0-9]+\.s} 4 } } */ > +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.d, p[0-7]/z, z[0-9]+\.d} 4 } } */ > > /* { dg-final { scan-assembler-not {\tmov\tz[^\n]*z} } } */ > /* { dg-final { scan-assembler-not {\tsel\t} } } */ > diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h > index 09e6ada5b2f..75376316e40 100644 > --- a/gcc/tree-pass.h > +++ b/gcc/tree-pass.h > @@ -612,6 +612,7 @@ extern rtl_opt_pass *make_pass_branch_prob (gcc::context *ctxt); > extern rtl_opt_pass *make_pass_value_profile_transformations (gcc::context > *ctxt); > extern rtl_opt_pass *make_pass_postreload_cse (gcc::context *ctxt); > +extern rtl_opt_pass *make_pass_late_combine (gcc::context *ctxt); > extern rtl_opt_pass *make_pass_gcse2 (gcc::context *ctxt); > extern rtl_opt_pass *make_pass_split_after_reload (gcc::context *ctxt); > extern rtl_opt_pass *make_pass_thread_prologue_and_epilogue (gcc::context > -- > 2.25.1 >
I didn't see this before. Sigh. On Tue, Jan 02, 2024 at 09:47:11AM +0000, Richard Sandiford wrote: > Segher Boessenkool <segher@kernel.crashing.org> writes: > > On Tue, Oct 24, 2023 at 07:49:10PM +0100, Richard Sandiford wrote: > >> This patch adds a combine pass that runs late in the pipeline. > > > > But it is not. It is a completely new thing, and much closer to > > fwprop than to combine, too. > > Well, it is a combine pass. No, it is not. In the context of GCC combine is the instruction combiner. Which does something else than this does. So use a different name. Please. It will be NAKked by the combine maintainer otherwise. > It's not a new instance of the pass in > combine.cc, but I don't think that's the implication. We already have > two combine passes: the combine.cc one and the postreload one. There is no postreload-combine pass. There is a postreload pass that does various trivial things. One of those is reload_combine, which is nothing like combine. It is a kind of limited fwprop for memory addressing. > > Could you rename it to something else, please? Something less confusing > > to both users and maintainers :-) > > Do you have any suggestions? Since it is something like fwprop, maybe something like that? Or maybe put "addressing" in the name, if that is the point here. > >> The pass currently has a single objective: remove definitions by > >> substituting into all uses. > > > > The easy case ;-) > > And the yet a case that no existing pass handles. :) That's why I'm > trying to add something that does. So, fwprop. Segher
On Mon, Jun 24, 2024 at 9:38 PM Segher Boessenkool <segher@kernel.crashing.org> wrote: > > I didn't see this before. Sigh. > > On Tue, Jan 02, 2024 at 09:47:11AM +0000, Richard Sandiford wrote: > > Segher Boessenkool <segher@kernel.crashing.org> writes: > > > On Tue, Oct 24, 2023 at 07:49:10PM +0100, Richard Sandiford wrote: > > >> This patch adds a combine pass that runs late in the pipeline. > > > > > > But it is not. It is a completely new thing, and much closer to > > > fwprop than to combine, too. > > > > Well, it is a combine pass. > > No, it is not. In the context of GCC combine is the instruction > combiner. Which does something else than this does. > > So use a different name. Please. It will be NAKked by the combine > maintainer otherwise. > > > It's not a new instance of the pass in > > combine.cc, but I don't think that's the implication. We already have > > two combine passes: the combine.cc one and the postreload one. > > There is no postreload-combine pass. There is a postreload pass that > does various trivial things. One of those is reload_combine, which is > nothing like combine. It is a kind of limited fwprop for memory > addressing. > > > > Could you rename it to something else, please? Something less confusing > > > to both users and maintainers :-) > > > > Do you have any suggestions? > > Since it is something like fwprop, maybe something like that? Or maybe > put "addressing" in the name, if that is the point here. > > > >> The pass currently has a single objective: remove definitions by > > >> substituting into all uses. > > > > > > The easy case ;-) > > > > And the yet a case that no existing pass handles. :) That's why I'm > > trying to add something that does. > > So, fwprop. fullprop Richard. > > Segher
Just FYI. This patch does something to gcc.target/mips/madd-8.c, and gcc.target/mips/msub-8.c. -PASS: gcc.target/mips/madd-8.c -O2 scan-assembler \tmul\t -PASS: gcc.target/mips/madd-8.c -O2 scan-assembler-not \tmadd\t -PASS: gcc.target/mips/madd-8.c -O2 scan-assembler-not \tmflo\t -PASS: gcc.target/mips/madd-8.c -O2 scan-assembler-not \tmtlo\t +FAIL: gcc.target/mips/madd-8.c -O2 scan-assembler \tmul\t +FAIL: gcc.target/mips/madd-8.c -O2 scan-assembler-not \tmadd\t +FAIL: gcc.target/mips/madd-8.c -O2 scan-assembler-not \tmflo\t +FAIL: gcc.target/mips/madd-8.c -O2 scan-assembler-not \tmtlo\t -FAIL: gcc.target/mips/msub-8.c -O2 scan-assembler \tmul\t -FAIL: gcc.target/mips/msub-8.c -O2 scan-assembler-not \tmflo\t -FAIL: gcc.target/mips/msub-8.c -O2 scan-assembler-not \tmsub\t -FAIL: gcc.target/mips/msub-8.c -O2 scan-assembler-not \tmtlo\t +PASS: gcc.target/mips/msub-8.c -O2 scan-assembler \tmul\t +PASS: gcc.target/mips/msub-8.c -O2 scan-assembler-not \tmflo\t +PASS: gcc.target/mips/msub-8.c -O2 scan-assembler-not \tmsub\t +PASS: gcc.target/mips/msub-8.c -O2 scan-assembler-not \tmtlo\t Quite interesting. I will inverest the real reason.
Hi! On 2023-10-24T19:49:10+0100, Richard Sandiford <richard.sandiford@arm.com> wrote: > This patch adds a combine pass that runs late in the pipeline. Great! In context of <https://gcc.gnu.org/PR115682> 'nvptx vs. "fwprop: invoke change_is_worthwhile to judge if a replacement is worthwhile"', I've beek looking a bit through recent nvptx target code generation changes for GCC target libraries, and thought I'd also share here my findings for the "late-combine" changes in isolation, for nvptx target. First the unexpected thing: there are a few cases where we now see unused registers get declared, for example (random) in 'nvptx-none/newlib/libc/libm_a-s_modf.o:modf' (full 'diff' before vs. after): [...] .visible .func (.param .f64 %value_out) modf (.param .f64 %in_ar0, .param .u64 %in_ar1) { .reg .f64 %value; .reg .f64 %ar0; ld.param.f64 %ar0,[%in_ar0]; .reg .u64 %ar1; ld.param.u64 %ar1,[%in_ar1]; .reg .u32 %r23; .reg .f64 %r32; .reg .u32 %r41; .reg .u32 %r42; .reg .f64 %r43; .reg .u32 %r46; +.reg .u64 %r48; +.reg .u64 %r49; +.reg .u64 %r50; +.reg .u64 %r51; +.reg .u64 %r52; .reg .f64 %r53; .reg .f64 %r54; .reg .u64 %r55; [...] That is, five additional registers declared, without any use. I suppose that's some 'gen_reg_rtx' that needs to be "confined" (in whichever way; to be looked into). I suppose that's not actually a problem: the PTX JIT should clean these out again, but it's still noise when reading GCC-emitted PTX code. Other than that, I've only got good things to report :-) -- a few examples in the following, if anyone's interested, without much commentary. I haven't looked of how much these "visible" PTX code generation changes translate into actual GPU SASS code improvements after the PTX JIT has done its thing, but it's certainly easier to read/less state to keep during human reading (due to less live registers, primarily). 'nvptx-none/libatomic/cas_16_.o': [...] -.reg .u32 %r24; [...] setp.eq.u64 %r45,%r40,0; -selp.u32 %r24,1,0,%r45; .loc 3 113 6 setp.ne.u64 %r48,%r58,%r60; @ %r48 bra $L2; @@ -86,7 +84,7 @@ call libat_unlock_1,(%out_arg1); } .loc 3 122 1 -mov.u32 %value,%r24; +selp.u32 %value,1,0,%r45; st.param.u32 [%value_out],%value; ret; } A lot more instances of similar patterns. 'nvptx-none/libbacktrace/dwarf.o': [...] -.reg .u64 %r695; [...] setp.ne.u32 %r679,%r678,0; -@ ! %r679 bra $L1149; -add.u64 %r695,%r212,32; -bra $L1090; -$L1149: +@ %r679 bra $L1090; [...] $L1090: .loc 2 4046 7 -mov.u64 %r28,%r695; +add.u64 %r28,%r212,32; $L1096: [...] 'nvptx-none/libbacktrace/dwarf.o': [...] -.reg .u64 %r41; [...] -add.u64 %r41,%frame,8; mov.u64 %r43,0; -st.u64 [%r41],%r43; -st.u64 [%r41+8],%r43; -st.u64 [%r41+16],%r43; +st.u64 [%frame+8],%r43; +st.u64 [%frame+16],%r43; +st.u64 [%frame+24],%r43; [...] 'nvptx-none/libgfortran/generated/cshift1_8.o': [...] -.reg .u64 %r293; [...] -add.u64 %r293,%r292,120; -st.u64 [%r293],%r409; +st.u64 [%r292+120],%r409; [...] 'nvptx-none/newlib/libc/stdio/libc_a-siscanf.o', have 'st' do 'u32' truncation instead of explicit 'cvt': [...] -.reg .u32 %r23; [...] -cvt.u32.u64 %r23,%r32; -st.u32 [%frame+8],%r23; +st.u32 [%frame+8],%r32; [...] 'nvptx-none/newlib/libc/string/libc_a-memccpy.o', simplify (supposedly) char -> int -> short into char -> short: [...] -.reg .u32 %r37; [...] -cvt.u32.u8 %r37,%r49; [...] -cvt.u16.u32 %r67,%r37; +cvt.u16.u8 %r67,%r49; [...] 'nvptx-none/libgcc/crt0.o': [...] -.reg .u64 %r30; [...] cvta.global.u64 %r29,stack$0+131072; st.shared.u64 [__nvptx_stacks],%r29; -cvta.shared.u64 %r30,__nvptx_uni; mov.u32 %r31,0; -st.u32 [%r30],%r31; +st.shared.u32 [__nvptx_uni],%r31; [...] There are a lot more instances of getting rid of a register in favor of using more complex instructions. 'nvptx-none/libgfortran/generated/findloc2_s4.o': [...] -.reg .u64 %r33; [...] -neg.s64 %r33,%r35; [...] -add.u64 %r39,%r39,%r33; +sub.u64 %r39,%r39,%r35; [...] ..., but in turn also a multi-use variant of the 'neg' (is this still beneficial?), 'nvptx-none/newlib/libc/search/libc_a-bsd_qsort_r.o': [...] -.reg .u64 %r34; [...] -neg.s64 %r34,%r82; -.loc 2 213 9 -add.u64 %r35,%r76,%r34; +sub.u64 %r35,%r76,%r82; [...] -add.u64 %r37,%r71,%r34; -add.u64 %r38,%r37,%r34; +sub.u64 %r37,%r71,%r82; +sub.u64 %r38,%r37,%r82; [...] 'nvptx-none/newlib/libm/complex/libm_a-cephes_subr.o': [...] -.reg .f64 %r35; [...] -neg.f64 %r35,%r27; -fma.rn.f64 %r22,%r35,0d400921fb54000000,%r32; +fma.rn.f64 %r22,%r27,0dc00921fb54000000,%r32; .loc 2 80 21 -fma.rn.f64 %r23,%r35,0d3e210b4610000000,%r22; +fma.rn.f64 %r23,%r27,0dbe210b4610000000,%r22; .loc 2 80 4 -fma.rn.f64 %value,%r35,0d3c6a62633145c06e,%r23; +fma.rn.f64 %value,%r27,0dbc6a62633145c06e,%r23; [...] 'nvptx-none/libgcc/_gcov_info_to_gcda.o': [...] -.reg .u32 %r186; [...] -add.u32 %r186,%r59,%r59; -add.u32 %r187,%r186,%r59; +vadd.u32.u32.u32.add %r187,%r59,%r59,%r59; [...] Grüße Thomas > There are two instances: one between combine and split1, and one > after postreload. > > The pass currently has a single objective: remove definitions by > substituting into all uses. The pre-RA version tries to restrict > itself to cases that are likely to have a neutral or beneficial > effect on register pressure. > > The patch fixes PR106594. It also fixes a few FAILs and XFAILs > in the aarch64 test results, mostly due to making proper use of > MOVPRFX in cases where we didn't previously. I hope it would > also help with Robin's vec_duplicate testcase, although the > pressure heuristic might need tweaking for that case. > > This is just a first step.. I'm hoping that the pass could be > used for other combine-related optimisations in future. In particular, > the post-RA version doesn't need to restrict itself to cases where all > uses are substitutitable, since it doesn't have to worry about register > pressure. If we did that, and if we extended it to handle multi-register > REGs, the pass might be a viable replacement for regcprop, which in > turn might reduce the cost of having a post-RA instance of the new pass. > > I've run an assembly comparison with one target per CPU directory, > and it seems to be a win for all targets except nvptx (which is hard > to measure, being a higher-level asm). The biggest winner seemed > to be AVR. > > I'd originally hoped to enable the pass by default at -O2 and above > on all targets. But in the end, I don't think that's possible, > because it interacts badly with x86's STV and partial register > dependency passes. > > For example, gcc.target/i386/minmax-6.c tests whether the code > compiles without any spilling. The RTL created by STV contains: > > (insn 33 31 3 2 (set (subreg:V4SI (reg:SI 120) 0) > (vec_merge:V4SI (vec_duplicate:V4SI (reg:SI 116)) > (const_vector:V4SI [ > (const_int 0 [0]) repeated x4 > ]) > (const_int 1 [0x1]))) -1 > (nil)) > (insn 3 33 34 2 (set (subreg:V4SI (reg:SI 118) 0) > (subreg:V4SI (reg:SI 120) 0)) {movv4si_internal} > (expr_list:REG_DEAD (reg:SI 120) > (nil))) > (insn 34 3 32 2 (set (reg/v:SI 108 [ y ]) > (reg:SI 118)) -1 > (nil)) > > and it's crucial for the test that reg 108 is kept, rather than > propagated into uses. As things stand, 118 can be allocated > a vector register and 108 a scalar register. If 108 is propagated, > there will be scalar and vector uses of 118, and so it will be > spilled to memory. > > That one could be solved by running STV2 later. But RPAD is > a bigger problem. In gcc.target/i386/pr87007-5.c, RPAD converts: > > (insn 27 26 28 6 (set (reg:DF 100 [ _15 ]) > (sqrt:DF (mem/c:DF (symbol_ref:DI ("d2"))))) {*sqrtdf2_sse} > (nil)) > > into: > > (insn 45 26 44 6 (set (reg:V4SF 108) > (const_vector:V4SF [ > (const_double:SF 0.0 [0x0.0p+0]) repeated x4 > ])) -1 > (nil)) > (insn 44 45 27 6 (set (reg:V2DF 109) > (vec_merge:V2DF (vec_duplicate:V2DF (sqrt:DF (mem/c:DF (symbol_ref:DI ("d2"))))) > (subreg:V2DF (reg:V4SF 108) 0) > (const_int 1 [0x1]))) -1 > (nil)) > (insn 27 44 28 6 (set (reg:DF 100 [ _15 ]) > (subreg:DF (reg:V2DF 109) 0)) {*movdf_internal} > (nil)) > > But both the pre-RA and post-RA passes are able to combine these > instructions back to the original form. > > The patch therefore enables the pass by default only on AArch64. > However, I did test the patch with it enabled on x86_64-linux-gnu > as well, which was useful for debugging. > > Bootstrapped & regression-tested on aarch64-linux-gnu and > x86_64-linux-gnu (as posted, with no regressions, and with the > pass enabled by default, with some gcc.target/i386 regressions). > OK to install? > > Richard
Hi! On 2024-06-27T18:49:17+0200, I wrote: > On 2023-10-24T19:49:10+0100, Richard Sandiford <richard.sandiford@arm.com> wrote: >> This patch adds a combine pass that runs late in the pipeline. [After sending, I realized I replied to a previous thread of this work.] > I've beek looking a bit through recent nvptx target code generation > changes for GCC target libraries, and thought I'd also share here my > findings for the "late-combine" changes in isolation, for nvptx target. > > First the unexpected thing: So much for "unexpected thing" -- next level of unexpected here... Appreciated if anyone feels like helping me find my way through this, but I totally understand if you've got other things to do. > there are a few cases where we now see unused > registers get declared, for example (random) in > 'nvptx-none/newlib/libc/libm_a-s_modf.o:modf' I first looked into a simpler case: newlib 'libc/locale/lnumeric.c'. Here we get the following 'diff' for '*.s' for '-fno-late-combine-instructions' vs. (default) '-flate-combine-instructions': .visible .func (.param.u32 %value_out) __numeric_load_locale (.param.u64 %in_ar0, .param.u64 %in_ar1, .param.u64 %in_ar2, .param.u64 %in_ar3) { .reg.u32 %value; .reg.u64 %ar0; ld.param.u64 %ar0, [%in_ar0]; .reg.u64 %ar1; ld.param.u64 %ar1, [%in_ar1]; .reg.u64 %ar2; ld.param.u64 %ar2, [%in_ar2]; .reg.u64 %ar3; ld.param.u64 %ar3, [%in_ar3]; + .reg.u32 %r22; .file 2 "../../../source-gcc/newlib/libc/locale/lnumeric.c" .loc 2 89 1 mov.u32 %value, 0; st.param.u32 [%value_out], %value; ret; } Clearly, '%r22' is unused. However, looking at the source code (manually trimmed): int __numeric_load_locale (struct __locale_t *locale, const char *name , void *f_wctomb, const char *charset) { int ret; struct lc_numeric_T nm; char *bufp = NULL; #ifdef __CYGWIN__ [...] #else /* TODO */ #endif return ret; } ..., and adding '-Wall' (why isn't top-level/newlib build system doing that...): [...] ../../../source-gcc/newlib/libc/locale/lnumeric.c:88:10: warning: ‘ret’ is used uninitialized [-Wuninitialized] 88 | return ret; | ^~~ ../../../source-gcc/newlib/libc/locale/lnumeric.c:48:7: note: ‘ret’ was declared here 48 | int ret; | ^~~ Uh. Given nothing else is going on in that function, I suppose '%r22' relates to the uninitialized 'ret' -- and given undefined behavior, GCC of course is fine to emit an unused 'reg' in that case... But: should we expect '-fno-late-combine-instructions' vs. '-flate-combine-instructions' to behave in the same way? (After all, '%r22' remains unused also with '-flate-combine-instructions', and doesn't need to be emitted.) This could, of course, also be a nvptx back end issue? I'm happy to supply any dump files etc. Also, 'tmp-libc_a-lnumeric.i.xz' is attached if you'd like to reproduce this with your own nvptx target 'cc1': $ [...]/configure --target=nvptx-none --enable-languages=c $ make -j12 all-gcc $ gcc/cc1 -fpreprocessed tmp-libc_a-lnumeric.i -quiet -dumpbase tmp-libc_a-lnumeric.c -dumpbase-ext .c -misa=sm_30 -g -O2 -fno-builtin -o tmp-libc_a-lnumeric.s -fdump-rtl-all # -fno-late-combine-instructions Grüße Thomas
Hi! On 2024-06-27T22:27:21+0200, I wrote: > On 2024-06-27T18:49:17+0200, I wrote: >> On 2023-10-24T19:49:10+0100, Richard Sandiford <richard.sandiford@arm.com> wrote: >>> This patch adds a combine pass that runs late in the pipeline. > > [After sending, I realized I replied to a previous thread of this work.] > >> I've beek looking a bit through recent nvptx target code generation >> changes for GCC target libraries, and thought I'd also share here my >> findings for the "late-combine" changes in isolation, for nvptx target. >> >> First the unexpected thing: > > So much for "unexpected thing" -- next level of unexpected here... > Appreciated if anyone feels like helping me find my way through this, but > I totally understand if you've got other things to do. OK, I found something already. (Unexpectedly quickly...) ;-) >> there are a few cases where we now see unused >> registers get declared, for example (random) in >> 'nvptx-none/newlib/libc/libm_a-s_modf.o:modf' I've now looked into the former one ('tmp-libm_a-s_modf.i.xz' is attached), to avoid... > I first looked into a simpler case: newlib 'libc/locale/lnumeric.c'. > ../../../source-gcc/newlib/libc/locale/lnumeric.c:88:10: warning: ‘ret’ is used uninitialized [-Wuninitialized] > 88 | return ret; > | ^~~ > ../../../source-gcc/newlib/libc/locale/lnumeric.c:48:7: note: ‘ret’ was declared here > 48 | int ret; > | ^~~ > > Uh. Given nothing else is going on in that function, I suppose '%r22' > relates to the uninitialized 'ret' -- and given undefined behavior, GCC > of course is fine to emit an unused 'reg' in that case... ... the undefined behavior here. But in fact, for both cases, the unexpected difference goes away if after 'pass_late_combine' I inject a 'pass_fast_rtl_dce'. That's normally run as part of 'PUSH_INSERT_PASSES_WITHIN (pass_postreload)' -- but that's all not active for nvptx target given '!reload_completed', given nvptx is 'targetm.no_register_allocation'. Maybe we need to enable a few more passes, or is there anything in 'pass_late_combine' to change, so that we don't run into this? Does it inadvertently mark registers live or something like that? The following makes these two cases work, but evidently needs a lot more analysis: a lot of other passes are enabled that may be anything between beneficial and harmful for 'targetm.no_register_allocation'/nvptx. --- gcc/passes.cc +++ gcc/passes.cc @@ -676,17 +676,17 @@ const pass_data pass_data_postreload = class pass_postreload : public rtl_opt_pass { public: pass_postreload (gcc::context *ctxt) : rtl_opt_pass (pass_data_postreload, ctxt) {} /* opt_pass methods: */ - bool gate (function *) final override { return reload_completed; } + bool gate (function *) final override { return reload_completed || targetm.no_register_allocation; } --- gcc/regcprop.cc +++ gcc/regcprop.cc @@ -1305,17 +1305,17 @@ class pass_cprop_hardreg : public rtl_opt_pass public: pass_cprop_hardreg (gcc::context *ctxt) : rtl_opt_pass (pass_data_cprop_hardreg, ctxt) {} /* opt_pass methods: */ bool gate (function *) final override { - return (optimize > 0 && (flag_cprop_registers)); + return (optimize > 0 && flag_cprop_registers && !targetm.no_register_allocation); } Grüße Thomas > But: should we expect '-fno-late-combine-instructions' vs. > '-flate-combine-instructions' to behave in the same way? (After all, > '%r22' remains unused also with '-flate-combine-instructions', and > doesn't need to be emitted.) This could, of course, also be a nvptx back > end issue? > > I'm happy to supply any dump files etc. Also, 'tmp-libc_a-lnumeric.i.xz' > is attached if you'd like to reproduce this with your own nvptx target > 'cc1': > > $ [...]/configure --target=nvptx-none --enable-languages=c > $ make -j12 all-gcc > $ gcc/cc1 -fpreprocessed tmp-libc_a-lnumeric.i -quiet -dumpbase tmp-libc_a-lnumeric.c -dumpbase-ext .c -misa=sm_30 -g -O2 -fno-builtin -o tmp-libc_a-lnumeric.s -fdump-rtl-all # -fno-late-combine-instructions > > > Grüße > Thomas
Hi! On 2024-06-27T23:20:18+0200, I wrote: > On 2024-06-27T22:27:21+0200, I wrote: >> On 2024-06-27T18:49:17+0200, I wrote: >>> On 2023-10-24T19:49:10+0100, Richard Sandiford <richard.sandiford@arm.com> wrote: >>>> This patch adds a combine pass that runs late in the pipeline. >> >> [After sending, I realized I replied to a previous thread of this work.] >> >>> I've beek looking a bit through recent nvptx target code generation >>> changes for GCC target libraries, and thought I'd also share here my >>> findings for the "late-combine" changes in isolation, for nvptx target. >>> >>> First the unexpected thing: >> >> So much for "unexpected thing" -- next level of unexpected here... >> Appreciated if anyone feels like helping me find my way through this, but >> I totally understand if you've got other things to do. > > OK, I found something already. (Unexpectedly quickly...) ;-) > >>> there are a few cases where we now see unused >>> registers get declared > But in fact, for both cases Now tested: 's%both%all'. :-) > the unexpected difference goes away if after > 'pass_late_combine' I inject a 'pass_fast_rtl_dce'. That's normally run > as part of 'PUSH_INSERT_PASSES_WITHIN (pass_postreload)' -- but that's > all not active for nvptx target given '!reload_completed', given nvptx is > 'targetm.no_register_allocation'. Maybe we need to enable a few more > passes, or is there anything in 'pass_late_combine' to change, so that we > don't run into this? Does it inadvertently mark registers live or > something like that? Basically, is 'pass_late_combine' potentionally doing things that depend on later clean-up? (..., or shouldn't it be doing these things in the first place?) > The following makes these two cases work, but evidently needs a lot more > analysis: a lot of other passes are enabled that may be anything between > beneficial and harmful for 'targetm.no_register_allocation'/nvptx. > > --- gcc/passes.cc > +++ gcc/passes.cc > @@ -676,17 +676,17 @@ const pass_data pass_data_postreload = > class pass_postreload : public rtl_opt_pass > { > public: > pass_postreload (gcc::context *ctxt) > : rtl_opt_pass (pass_data_postreload, ctxt) > {} > > /* opt_pass methods: */ > - bool gate (function *) final override { return reload_completed; } > + bool gate (function *) final override { return reload_completed || targetm.no_register_allocation; } > --- gcc/regcprop.cc > +++ gcc/regcprop.cc > @@ -1305,17 +1305,17 @@ class pass_cprop_hardreg : public rtl_opt_pass > public: > pass_cprop_hardreg (gcc::context *ctxt) > : rtl_opt_pass (pass_data_cprop_hardreg, ctxt) > {} > > /* opt_pass methods: */ > bool gate (function *) final override > { > - return (optimize > 0 && (flag_cprop_registers)); > + return (optimize > 0 && flag_cprop_registers && !targetm.no_register_allocation); > } Also, that quickly ICEs; more '[...] && !targetm.no_register_allocation' are needed elsewhere, at least. The following simpler thing, however, does work; move 'pass_fast_rtl_dce' out of 'pass_postreload': --- gcc/passes.cc +++ gcc/passes.cc @@ -677,14 +677,15 @@ class pass_postreload : public rtl_opt_pass { public: pass_postreload (gcc::context *ctxt) : rtl_opt_pass (pass_data_postreload, ctxt) {} /* opt_pass methods: */ + opt_pass * clone () final override { return new pass_postreload (m_ctxt); } bool gate (function *) final override { return reload_completed; } }; // class pass_postreload --- gcc/passes.def +++ gcc/passes.def @@ -529,7 +529,10 @@ along with GCC; see the file COPYING3. If not see NEXT_PASS (pass_regrename); NEXT_PASS (pass_fold_mem_offsets); NEXT_PASS (pass_cprop_hardreg); - NEXT_PASS (pass_fast_rtl_dce); + POP_INSERT_PASSES () + NEXT_PASS (pass_fast_rtl_dce); + NEXT_PASS (pass_postreload); + PUSH_INSERT_PASSES_WITHIN (pass_postreload) NEXT_PASS (pass_reorder_blocks); NEXT_PASS (pass_leaf_regs); NEXT_PASS (pass_split_before_sched2); This (only) cleans up "the mess that 'pass_late_combine' created"; no further changes in GCC target libraries for nvptx. (For avoidance of doubt: "mess" is a great exaggeration here.) Grüße Thomas >> But: should we expect '-fno-late-combine-instructions' vs. >> '-flate-combine-instructions' to behave in the same way? (After all, >> '%r22' remains unused also with '-flate-combine-instructions', and >> doesn't need to be emitted.) This could, of course, also be a nvptx back >> end issue? >> >> I'm happy to supply any dump files etc. Also, 'tmp-libc_a-lnumeric.i.xz' >> is attached if you'd like to reproduce this with your own nvptx target >> 'cc1': >> >> $ [...]/configure --target=nvptx-none --enable-languages=c >> $ make -j12 all-gcc >> $ gcc/cc1 -fpreprocessed tmp-libc_a-lnumeric.i -quiet -dumpbase tmp-libc_a-lnumeric.c -dumpbase-ext .c -misa=sm_30 -g -O2 -fno-builtin -o tmp-libc_a-lnumeric.s -fdump-rtl-all # -fno-late-combine-instructions >> >> >> Grüße >> Thomas
Hi Thomas, There are two things I think I can contribute to this discussion. The first is that I have a patch (from a year or two ago) for adding rtx_costs to the nvptx backend that I will respin, which will provide more backend control over combine-like pass decisions. The second is in response to your comment below; nvptx is currently GCC's only target.no_register_allocation target, and I think this provides an opportunity to tweak/change these semantics. Currently "no_register_allocation" turns off a whole bunch of post-reload passes (including things like peephole2 and predication that would be extremely useful on nvptx). My thoughts were that it would be nice, if no_register_allocation turned off just/only the ira/reload passes, where the gate function would instead simply set reload_completed. One major advantage of this approach is that it would minimize the divergence in code paths, between nvptx and all of GCC's other backends, which would hopefully help with things like "late-combine" that may implicitly (and perhaps unintentionally) require/expect certain passes to be run later. Again I've investigated a patch/change or two towards the above approach, but its slightly more complicated than I'd anticipated, due to things like late nvptx-specific passes assuming they can call gen_reg_rtx, after other targets have no_new_pseudos. There's nothing that's impossible, but it would take a concerted effort by target and middle-end maintainers to align a shallower "no_register_allocation". I'm curious what folks think. Best regards, Roger -- > -----Original Message----- > From: Thomas Schwinge <tschwinge@baylibre.com> > Sent: 27 June 2024 22:20 > To: Richard Sandiford <richard.sandiford@arm.com> > Cc: jlaw@ventanamicro.com; rdapp.gcc@gmail.com; gcc-patches@gcc.gnu.org; > Tom de Vries <tdevries@suse.de>; Roger Sayle <roger@nextmovesoftware.com> > Subject: Re: nvptx vs. [PATCH] Add a late-combine pass [PR106594] > > Hi! > > On 2024-06-27T22:27:21+0200, I wrote: > > On 2024-06-27T18:49:17+0200, I wrote: > >> On 2023-10-24T19:49:10+0100, Richard Sandiford > <richard.sandiford@arm.com> wrote: > >>> This patch adds a combine pass that runs late in the pipeline. > > > > [After sending, I realized I replied to a previous thread of this > > work.] > > > >> I've beek looking a bit through recent nvptx target code generation > >> changes for GCC target libraries, and thought I'd also share here my > >> findings for the "late-combine" changes in isolation, for nvptx target. > >> > >> First the unexpected thing: > > > > So much for "unexpected thing" -- next level of unexpected here... > > Appreciated if anyone feels like helping me find my way through this, > > but I totally understand if you've got other things to do. > > OK, I found something already. (Unexpectedly quickly...) ;-) > > >> there are a few cases where we now see unused registers get declared, > >> for example (random) in 'nvptx-none/newlib/libc/libm_a-s_modf.o:modf' > > I've now looked into the former one ('tmp-libm_a-s_modf.i.xz' is attached), to > avoid... > > > I first looked into a simpler case: newlib 'libc/locale/lnumeric.c'. > > > ../../../source-gcc/newlib/libc/locale/lnumeric.c:88:10: warning: ‘ret’ is used > uninitialized [-Wuninitialized] > > 88 | return ret; > > | ^~~ > > ../../../source-gcc/newlib/libc/locale/lnumeric.c:48:7: note: ‘ret’ was > declared here > > 48 | int ret; > > | ^~~ > > > > Uh. Given nothing else is going on in that function, I suppose '%r22' > > relates to the uninitialized 'ret' -- and given undefined behavior, > > GCC of course is fine to emit an unused 'reg' in that case... > > ... the undefined behavior here. > > But in fact, for both cases, the unexpected difference goes away if after > 'pass_late_combine' I inject a 'pass_fast_rtl_dce'. That's normally run as part of > 'PUSH_INSERT_PASSES_WITHIN (pass_postreload)' -- but that's all not active for > nvptx target given '!reload_completed', given nvptx is > 'targetm.no_register_allocation'. Maybe we need to enable a few more passes, > or is there anything in 'pass_late_combine' to change, so that we don't run into > this? Does it inadvertently mark registers live or something like that? > > The following makes these two cases work, but evidently needs a lot more > analysis: a lot of other passes are enabled that may be anything between > beneficial and harmful for 'targetm.no_register_allocation'/nvptx. > > --- gcc/passes.cc > +++ gcc/passes.cc > @@ -676,17 +676,17 @@ const pass_data pass_data_postreload = > class pass_postreload : public rtl_opt_pass > { > public: > pass_postreload (gcc::context *ctxt) > : rtl_opt_pass (pass_data_postreload, ctxt) > {} > > /* opt_pass methods: */ > - bool gate (function *) final override { return reload_completed; } > + bool gate (function *) final override { return reload_completed || > targetm.no_register_allocation; } > --- gcc/regcprop.cc > +++ gcc/regcprop.cc > @@ -1305,17 +1305,17 @@ class pass_cprop_hardreg : public rtl_opt_pass > public: > pass_cprop_hardreg (gcc::context *ctxt) > : rtl_opt_pass (pass_data_cprop_hardreg, ctxt) > {} > > /* opt_pass methods: */ > bool gate (function *) final override > { > - return (optimize > 0 && (flag_cprop_registers)); > + return (optimize > 0 && flag_cprop_registers && > !targetm.no_register_allocation); > } > > > Grüße > Thomas > > > > But: should we expect '-fno-late-combine-instructions' vs. > > '-flate-combine-instructions' to behave in the same way? (After all, > > '%r22' remains unused also with '-flate-combine-instructions', and > > doesn't need to be emitted.) This could, of course, also be a nvptx > > back end issue? > > > > I'm happy to supply any dump files etc. Also, 'tmp-libc_a-lnumeric.i.xz' > > is attached if you'd like to reproduce this with your own nvptx target > > 'cc1': > > > > $ [...]/configure --target=nvptx-none --enable-languages=c > > $ make -j12 all-gcc > > $ gcc/cc1 -fpreprocessed tmp-libc_a-lnumeric.i -quiet -dumpbase > > tmp-libc_a-lnumeric.c -dumpbase-ext .c -misa=sm_30 -g -O2 -fno-builtin > > -o tmp-libc_a-lnumeric.s -fdump-rtl-all # > > -fno-late-combine-instructions > > > > > > Grüße > > Thomas >
Thomas Schwinge <tschwinge@baylibre.com> writes: > Hi! > > On 2024-06-27T23:20:18+0200, I wrote: >> On 2024-06-27T22:27:21+0200, I wrote: >>> On 2024-06-27T18:49:17+0200, I wrote: >>>> On 2023-10-24T19:49:10+0100, Richard Sandiford <richard.sandiford@arm.com> wrote: >>>>> This patch adds a combine pass that runs late in the pipeline. >>> >>> [After sending, I realized I replied to a previous thread of this work.] >>> >>>> I've beek looking a bit through recent nvptx target code generation >>>> changes for GCC target libraries, and thought I'd also share here my >>>> findings for the "late-combine" changes in isolation, for nvptx target. >>>> >>>> First the unexpected thing: >>> >>> So much for "unexpected thing" -- next level of unexpected here... >>> Appreciated if anyone feels like helping me find my way through this, but >>> I totally understand if you've got other things to do. >> >> OK, I found something already. (Unexpectedly quickly...) ;-) >> >>>> there are a few cases where we now see unused >>>> registers get declared > >> But in fact, for both cases > > Now tested: 's%both%all'. :-) > >> the unexpected difference goes away if after >> 'pass_late_combine' I inject a 'pass_fast_rtl_dce'. That's normally run >> as part of 'PUSH_INSERT_PASSES_WITHIN (pass_postreload)' -- but that's >> all not active for nvptx target given '!reload_completed', given nvptx is >> 'targetm.no_register_allocation'. Maybe we need to enable a few more >> passes, or is there anything in 'pass_late_combine' to change, so that we >> don't run into this? Does it inadvertently mark registers live or >> something like that? > > Basically, is 'pass_late_combine' potentionally doing things that depend > on later clean-up? (..., or shouldn't it be doing these things in the > first place?) It's possible that late-combine could expose dead code, but I imagine it's a niche case. I had a look at the nvptx logs from my comparison, and the cases in which I saw this seemed to be those where late-combine doesn't find anything to do. Does that match your examples? Specifically, the effect should be the same with -fdbg-cnt=late_combine:0-0 I think what's happening is that: - combine exposes dead code - ce2 previously ran df_analyze with DF_LR_RUN_DCE set, and so cleared up the dead code - late-combine instead runs df_analyze without that flag (since late-combine itself doesn't really care whether dead code is present) - if late-combine doesn't do anything, ce2's df_analyze call has nothing to do, and skips even the DCE The easiest fix would be to add: df_set_flags (DF_LR_RUN_DCE); before df_analyze in late-combine.cc, so that it behaves like ce2. But the arrangement feels wrong. I would have expected DF_LR_RUN_DCE to depend on whether df_analyze had been called since the last DCE pass (whether DF_LR_RUN_DCE or a full DCE). Thanks, Richard
Richard Sandiford <richard.sandiford@arm.com> writes: > Thomas Schwinge <tschwinge@baylibre.com> writes: >> Hi! >> >> On 2024-06-27T23:20:18+0200, I wrote: >>> On 2024-06-27T22:27:21+0200, I wrote: >>>> On 2024-06-27T18:49:17+0200, I wrote: >>>>> On 2023-10-24T19:49:10+0100, Richard Sandiford <richard.sandiford@arm.com> wrote: >>>>>> This patch adds a combine pass that runs late in the pipeline. >>>> >>>> [After sending, I realized I replied to a previous thread of this work.] >>>> >>>>> I've beek looking a bit through recent nvptx target code generation >>>>> changes for GCC target libraries, and thought I'd also share here my >>>>> findings for the "late-combine" changes in isolation, for nvptx target. >>>>> >>>>> First the unexpected thing: >>>> >>>> So much for "unexpected thing" -- next level of unexpected here... >>>> Appreciated if anyone feels like helping me find my way through this, but >>>> I totally understand if you've got other things to do. >>> >>> OK, I found something already. (Unexpectedly quickly...) ;-) >>> >>>>> there are a few cases where we now see unused >>>>> registers get declared >> >>> But in fact, for both cases >> >> Now tested: 's%both%all'. :-) >> >>> the unexpected difference goes away if after >>> 'pass_late_combine' I inject a 'pass_fast_rtl_dce'. That's normally run >>> as part of 'PUSH_INSERT_PASSES_WITHIN (pass_postreload)' -- but that's >>> all not active for nvptx target given '!reload_completed', given nvptx is >>> 'targetm.no_register_allocation'. Maybe we need to enable a few more >>> passes, or is there anything in 'pass_late_combine' to change, so that we >>> don't run into this? Does it inadvertently mark registers live or >>> something like that? >> >> Basically, is 'pass_late_combine' potentionally doing things that depend >> on later clean-up? (..., or shouldn't it be doing these things in the >> first place?) > > It's possible that late-combine could expose dead code, but I imagine > it's a niche case. > > I had a look at the nvptx logs from my comparison, and the cases in > which I saw this seemed to be those where late-combine doesn't find > anything to do. Does that match your examples? Specifically, > the effect should be the same with -fdbg-cnt=late_combine:0-0 > > I think what's happening is that: > > - combine exposes dead code > > - ce2 previously ran df_analyze with DF_LR_RUN_DCE set, and so cleared > up the dead code > > - late-combine instead runs df_analyze without that flag (since late-combine > itself doesn't really care whether dead code is present) > > - if late-combine doesn't do anything, ce2's df_analyze call has nothing > to do, and skips even the DCE > > The easiest fix would be to add: > > df_set_flags (DF_LR_RUN_DCE); > > before df_analyze in late-combine.cc, so that it behaves like ce2. > But the arrangement feels wrong. I would have expected DF_LR_RUN_DCE > to depend on whether df_analyze had been called since the last DCE pass > (whether DF_LR_RUN_DCE or a full DCE). I'm testing the attached patch to do that. I'll submit it properly if testing passes, but it seems to fix the extra-register problem for me. Thanks, Richard --- Give fast DCE a separate dirty flag Thomas pointed out that we sometimes failed to eliminate some dead code (specifically clobbers of otherwise unused registers) on nvptx when late-combine is enabled. This happens because: - combine is able to optimise the function in a way that exposes dead code. This leaves the df information in a "dirty" state. - late_combine calls df_analyze without DF_LR_RUN_DCE run set. This updates the df information and clears the "dirty" state. - late_combine doesn't find any extra optimisations, and so leaves the df information up-to-date. - if_after_combine (ce2) calls df_analyze with DF_LR_RUN_DCE set. Because the df information is already up-to-date, fast DCE is not run. The upshot is that running late-combine has the effect of suppressing a DCE opportunity that would have been noticed without late_combine. I think this shows that we should track the state of the DCE separately from the LR problem. Every pass updates the latter, but not all passes update the former. gcc/ * df.h (DF_LR_DCE): New df_problem_id. (df_lr_dce): New macro. * df-core.cc (rest_of_handle_df_finish): Check for a null free_fun. * df-problems.cc (df_lr_finalize): Split out fast DCE handling to... (df_lr_dce_finalize): ...this new function. (problem_LR_DCE): New df_problem. (df_lr_add_problem): Register LR_DCE rather than LR itself. * dce.cc (fast_dce): Clear df_lr_dce->solutions_dirty. --- gcc/dce.cc | 3 ++ gcc/df-core.cc | 3 +- gcc/df-problems.cc | 96 ++++++++++++++++++++++++++++++++-------------- gcc/df.h | 2 + 4 files changed, 74 insertions(+), 30 deletions(-) diff --git a/gcc/dce.cc b/gcc/dce.cc index be1a2a87732..04e8d98818d 100644 --- a/gcc/dce.cc +++ b/gcc/dce.cc @@ -1182,6 +1182,9 @@ fast_dce (bool word_level) BITMAP_FREE (processed); BITMAP_FREE (redo_out); BITMAP_FREE (all_blocks); + + /* Both forms of DCE should make further DCE unnecessary. */ + df_lr_dce->solutions_dirty = false; } diff --git a/gcc/df-core.cc b/gcc/df-core.cc index b0e8a88d433..8fd778a8618 100644 --- a/gcc/df-core.cc +++ b/gcc/df-core.cc @@ -806,7 +806,8 @@ rest_of_handle_df_finish (void) for (i = 0; i < df->num_problems_defined; i++) { struct dataflow *dflow = df->problems_in_order[i]; - dflow->problem->free_fun (); + if (dflow->problem->free_fun) + dflow->problem->free_fun (); } free (df->postorder); diff --git a/gcc/df-problems.cc b/gcc/df-problems.cc index 88ee0dd67fc..bfd24bd1e86 100644 --- a/gcc/df-problems.cc +++ b/gcc/df-problems.cc @@ -1054,37 +1054,10 @@ df_lr_transfer_function (int bb_index) } -/* Run the fast dce as a side effect of building LR. */ - static void -df_lr_finalize (bitmap all_blocks) +df_lr_finalize (bitmap) { df_lr->solutions_dirty = false; - if (df->changeable_flags & DF_LR_RUN_DCE) - { - run_fast_df_dce (); - - /* If dce deletes some instructions, we need to recompute the lr - solution before proceeding further. The problem is that fast - dce is a pessimestic dataflow algorithm. In the case where - it deletes a statement S inside of a loop, the uses inside of - S may not be deleted from the dataflow solution because they - were carried around the loop. While it is conservatively - correct to leave these extra bits, the standards of df - require that we maintain the best possible (least fixed - point) solution. The only way to do that is to redo the - iteration from the beginning. See PR35805 for an - example. */ - if (df_lr->solutions_dirty) - { - df_clear_flags (DF_LR_RUN_DCE); - df_lr_alloc (all_blocks); - df_lr_local_compute (all_blocks); - df_worklist_dataflow (df_lr, all_blocks, df->postorder, df->n_blocks); - df_lr_finalize (all_blocks); - df_set_flags (DF_LR_RUN_DCE); - } - } } @@ -1266,6 +1239,69 @@ static const struct df_problem problem_LR = false /* Reset blocks on dropping out of blocks_to_analyze. */ }; +/* Run the fast DCE after building LR. This is a separate problem so that + the "dirty" flag is only cleared after a DCE pass is actually run. */ + +static void +df_lr_dce_finalize (bitmap all_blocks) +{ + if (!(df->changeable_flags & DF_LR_RUN_DCE)) + return; + + /* Also clears df_lr_dce->solutions_dirty. */ + run_fast_df_dce (); + + /* If dce deletes some instructions, we need to recompute the lr + solution before proceeding further. The problem is that fast + dce is a pessimestic dataflow algorithm. In the case where + it deletes a statement S inside of a loop, the uses inside of + S may not be deleted from the dataflow solution because they + were carried around the loop. While it is conservatively + correct to leave these extra bits, the standards of df + require that we maintain the best possible (least fixed + point) solution. The only way to do that is to redo the + iteration from the beginning. See PR35805 for an + example. */ + if (df_lr->solutions_dirty) + { + df_clear_flags (DF_LR_RUN_DCE); + df_lr_alloc (all_blocks); + df_lr_local_compute (all_blocks); + df_worklist_dataflow (df_lr, all_blocks, df->postorder, df->n_blocks); + df_lr_finalize (all_blocks); + df_set_flags (DF_LR_RUN_DCE); + } +} + +static const struct df_problem problem_LR_DCE +{ + DF_LR_DCE, /* Problem id. */ + DF_BACKWARD, /* Direction (arbitrary). */ + NULL, /* Allocate the problem specific data. */ + NULL, /* Reset global information. */ + NULL, /* Free basic block info. */ + NULL, /* Local compute function. */ + NULL, /* Init the solution specific data. */ + NULL, /* Worklist solver. */ + NULL, /* Confluence operator 0. */ + NULL, /* Confluence operator n. */ + NULL, /* Transfer function. */ + df_lr_dce_finalize, /* Finalize function. */ + NULL, /* Free all of the problem information. */ + NULL, /* Remove this problem from the stack of dataflow problems. */ + NULL, /* Debugging. */ + NULL, /* Debugging start block. */ + NULL, /* Debugging end block. */ + NULL, /* Debugging start insn. */ + NULL, /* Debugging end insn. */ + NULL, /* Incremental solution verify start. */ + NULL, /* Incremental solution verify end. */ + &problem_LR, /* Dependent problem. */ + 0, /* Size of entry of block_info array. */ + TV_DF_LR, /* Timing variable. */ + false /* Reset blocks on dropping out of blocks_to_analyze. */ +}; + /* Create a new DATAFLOW instance and add it to an existing instance of DF. The returned structure is what is used to get at the @@ -1274,7 +1310,9 @@ static const struct df_problem problem_LR = void df_lr_add_problem (void) { - df_add_problem (&problem_LR); + /* Also add the fast DCE problem. It is then conditionally enabled by + the DF_LR_RUN_DCE flag. */ + df_add_problem (&problem_LR_DCE); /* These will be initialized when df_scan_blocks processes each block. */ df_lr->out_of_date_transfer_functions = BITMAP_ALLOC (&df_bitmap_obstack); diff --git a/gcc/df.h b/gcc/df.h index c4e690b40cf..bbb4f297b47 100644 --- a/gcc/df.h +++ b/gcc/df.h @@ -47,6 +47,7 @@ enum df_problem_id { DF_SCAN, DF_LR, /* Live Registers backward. */ + DF_LR_DCE, /* Dead code elimination post-pass for LR. */ DF_LIVE, /* Live Registers & Uninitialized Registers */ DF_RD, /* Reaching Defs. */ DF_CHAIN, /* Def-Use and/or Use-Def Chains. */ @@ -940,6 +941,7 @@ extern class df_d *df; #define df_scan (df->problems_by_index[DF_SCAN]) #define df_rd (df->problems_by_index[DF_RD]) #define df_lr (df->problems_by_index[DF_LR]) +#define df_lr_dce (df->problems_by_index[DF_LR_DCE]) #define df_live (df->problems_by_index[DF_LIVE]) #define df_chain (df->problems_by_index[DF_CHAIN]) #define df_word_lr (df->problems_by_index[DF_WORD_LR])
Hi Richard! On 2024-06-28T17:48:30+0100, Richard Sandiford <richard.sandiford@arm.com> wrote: > Richard Sandiford <richard.sandiford@arm.com> writes: >> Thomas Schwinge <tschwinge@baylibre.com> writes: >>> On 2024-06-27T23:20:18+0200, I wrote: >>>> On 2024-06-27T22:27:21+0200, I wrote: >>>>> On 2024-06-27T18:49:17+0200, I wrote: >>>>>> On 2023-10-24T19:49:10+0100, Richard Sandiford <richard.sandiford@arm.com> wrote: >>>>>>> This patch adds a combine pass that runs late in the pipeline. >>>>> >>>>> [After sending, I realized I replied to a previous thread of this work.] >>>>> >>>>>> I've beek looking a bit through recent nvptx target code generation >>>>>> changes for GCC target libraries, and thought I'd also share here my >>>>>> findings for the "late-combine" changes in isolation, for nvptx target. >>>>>> >>>>>> First the unexpected thing: >>>>> >>>>> So much for "unexpected thing" -- next level of unexpected here... >>>>> Appreciated if anyone feels like helping me find my way through this, but >>>>> I totally understand if you've got other things to do. >>>> >>>> OK, I found something already. (Unexpectedly quickly...) ;-) >>>> >>>>>> there are a few cases where we now see unused >>>>>> registers get declared >>> >>>> But in fact, for both cases >>> >>> Now tested: 's%both%all'. :-) >>> >>>> the unexpected difference goes away if after >>>> 'pass_late_combine' I inject a 'pass_fast_rtl_dce'. That's normally run >>>> as part of 'PUSH_INSERT_PASSES_WITHIN (pass_postreload)' -- but that's >>>> all not active for nvptx target given '!reload_completed', given nvptx is >>>> 'targetm.no_register_allocation'. Maybe we need to enable a few more >>>> passes, or is there anything in 'pass_late_combine' to change, so that we >>>> don't run into this? Does it inadvertently mark registers live or >>>> something like that? >>> >>> Basically, is 'pass_late_combine' potentionally doing things that depend >>> on later clean-up? (..., or shouldn't it be doing these things in the >>> first place?) >> >> It's possible that late-combine could expose dead code, but I imagine >> it's a niche case. >> >> I had a look at the nvptx logs from my comparison, and the cases in >> which I saw this seemed to be those where late-combine doesn't find >> anything to do. Does that match your examples? Specifically, >> the effect should be the same with -fdbg-cnt=late_combine:0-0 >> >> I think what's happening is that: >> >> - combine exposes dead code >> >> - ce2 previously ran df_analyze with DF_LR_RUN_DCE set, and so cleared >> up the dead code >> >> - late-combine instead runs df_analyze without that flag (since late-combine >> itself doesn't really care whether dead code is present) >> >> - if late-combine doesn't do anything, ce2's df_analyze call has nothing >> to do, and skips even the DCE >> >> The easiest fix would be to add: >> >> df_set_flags (DF_LR_RUN_DCE); >> >> before df_analyze in late-combine.cc, so that it behaves like ce2. >> But the arrangement feels wrong. I would have expected DF_LR_RUN_DCE >> to depend on whether df_analyze had been called since the last DCE pass >> (whether DF_LR_RUN_DCE or a full DCE). > > I'm testing the attached patch to do that. I'll submit it properly if > testing passes, but it seems to fix the extra-register problem for me. > Give fast DCE a separate dirty flag Thanks, and yes, your analysis makes sense to me (to the extent that I only superficially understand these parts of GCC) -- and I confirm that your proposed change to "Give fast DCE a separate dirty flag" does address the issue for nvptx target. Grüße Thomas > Thomas pointed out that we sometimes failed to eliminate some dead code > (specifically clobbers of otherwise unused registers) on nvptx when > late-combine is enabled. This happens because: > > - combine is able to optimise the function in a way that exposes dead code. > This leaves the df information in a "dirty" state. > > - late_combine calls df_analyze without DF_LR_RUN_DCE run set. > This updates the df information and clears the "dirty" state. > > - late_combine doesn't find any extra optimisations, and so leaves > the df information up-to-date. > > - if_after_combine (ce2) calls df_analyze with DF_LR_RUN_DCE set. > Because the df information is already up-to-date, fast DCE is > not run. > > The upshot is that running late-combine has the effect of suppressing > a DCE opportunity that would have been noticed without late_combine. > > I think this shows that we should track the state of the DCE separately > from the LR problem. Every pass updates the latter, but not all passes > update the former. > > gcc/ > * df.h (DF_LR_DCE): New df_problem_id. > (df_lr_dce): New macro. > * df-core.cc (rest_of_handle_df_finish): Check for a null free_fun. > * df-problems.cc (df_lr_finalize): Split out fast DCE handling to... > (df_lr_dce_finalize): ...this new function. > (problem_LR_DCE): New df_problem. > (df_lr_add_problem): Register LR_DCE rather than LR itself. > * dce.cc (fast_dce): Clear df_lr_dce->solutions_dirty. > --- > gcc/dce.cc | 3 ++ > gcc/df-core.cc | 3 +- > gcc/df-problems.cc | 96 ++++++++++++++++++++++++++++++++-------------- > gcc/df.h | 2 + > 4 files changed, 74 insertions(+), 30 deletions(-) > > diff --git a/gcc/dce.cc b/gcc/dce.cc > index be1a2a87732..04e8d98818d 100644 > --- a/gcc/dce.cc > +++ b/gcc/dce.cc > @@ -1182,6 +1182,9 @@ fast_dce (bool word_level) > BITMAP_FREE (processed); > BITMAP_FREE (redo_out); > BITMAP_FREE (all_blocks); > + > + /* Both forms of DCE should make further DCE unnecessary. */ > + df_lr_dce->solutions_dirty = false; > } > > > diff --git a/gcc/df-core.cc b/gcc/df-core.cc > index b0e8a88d433..8fd778a8618 100644 > --- a/gcc/df-core.cc > +++ b/gcc/df-core.cc > @@ -806,7 +806,8 @@ rest_of_handle_df_finish (void) > for (i = 0; i < df->num_problems_defined; i++) > { > struct dataflow *dflow = df->problems_in_order[i]; > - dflow->problem->free_fun (); > + if (dflow->problem->free_fun) > + dflow->problem->free_fun (); > } > > free (df->postorder); > diff --git a/gcc/df-problems.cc b/gcc/df-problems.cc > index 88ee0dd67fc..bfd24bd1e86 100644 > --- a/gcc/df-problems.cc > +++ b/gcc/df-problems.cc > @@ -1054,37 +1054,10 @@ df_lr_transfer_function (int bb_index) > } > > > -/* Run the fast dce as a side effect of building LR. */ > - > static void > -df_lr_finalize (bitmap all_blocks) > +df_lr_finalize (bitmap) > { > df_lr->solutions_dirty = false; > - if (df->changeable_flags & DF_LR_RUN_DCE) > - { > - run_fast_df_dce (); > - > - /* If dce deletes some instructions, we need to recompute the lr > - solution before proceeding further. The problem is that fast > - dce is a pessimestic dataflow algorithm. In the case where > - it deletes a statement S inside of a loop, the uses inside of > - S may not be deleted from the dataflow solution because they > - were carried around the loop. While it is conservatively > - correct to leave these extra bits, the standards of df > - require that we maintain the best possible (least fixed > - point) solution. The only way to do that is to redo the > - iteration from the beginning. See PR35805 for an > - example. */ > - if (df_lr->solutions_dirty) > - { > - df_clear_flags (DF_LR_RUN_DCE); > - df_lr_alloc (all_blocks); > - df_lr_local_compute (all_blocks); > - df_worklist_dataflow (df_lr, all_blocks, df->postorder, df->n_blocks); > - df_lr_finalize (all_blocks); > - df_set_flags (DF_LR_RUN_DCE); > - } > - } > } > > > @@ -1266,6 +1239,69 @@ static const struct df_problem problem_LR = > false /* Reset blocks on dropping out of blocks_to_analyze. */ > }; > > +/* Run the fast DCE after building LR. This is a separate problem so that > + the "dirty" flag is only cleared after a DCE pass is actually run. */ > + > +static void > +df_lr_dce_finalize (bitmap all_blocks) > +{ > + if (!(df->changeable_flags & DF_LR_RUN_DCE)) > + return; > + > + /* Also clears df_lr_dce->solutions_dirty. */ > + run_fast_df_dce (); > + > + /* If dce deletes some instructions, we need to recompute the lr > + solution before proceeding further. The problem is that fast > + dce is a pessimestic dataflow algorithm. In the case where > + it deletes a statement S inside of a loop, the uses inside of > + S may not be deleted from the dataflow solution because they > + were carried around the loop. While it is conservatively > + correct to leave these extra bits, the standards of df > + require that we maintain the best possible (least fixed > + point) solution. The only way to do that is to redo the > + iteration from the beginning. See PR35805 for an > + example. */ > + if (df_lr->solutions_dirty) > + { > + df_clear_flags (DF_LR_RUN_DCE); > + df_lr_alloc (all_blocks); > + df_lr_local_compute (all_blocks); > + df_worklist_dataflow (df_lr, all_blocks, df->postorder, df->n_blocks); > + df_lr_finalize (all_blocks); > + df_set_flags (DF_LR_RUN_DCE); > + } > +} > + > +static const struct df_problem problem_LR_DCE > +{ > + DF_LR_DCE, /* Problem id. */ > + DF_BACKWARD, /* Direction (arbitrary). */ > + NULL, /* Allocate the problem specific data. */ > + NULL, /* Reset global information. */ > + NULL, /* Free basic block info. */ > + NULL, /* Local compute function. */ > + NULL, /* Init the solution specific data. */ > + NULL, /* Worklist solver. */ > + NULL, /* Confluence operator 0. */ > + NULL, /* Confluence operator n. */ > + NULL, /* Transfer function. */ > + df_lr_dce_finalize, /* Finalize function. */ > + NULL, /* Free all of the problem information. */ > + NULL, /* Remove this problem from the stack of dataflow problems. */ > + NULL, /* Debugging. */ > + NULL, /* Debugging start block. */ > + NULL, /* Debugging end block. */ > + NULL, /* Debugging start insn. */ > + NULL, /* Debugging end insn. */ > + NULL, /* Incremental solution verify start. */ > + NULL, /* Incremental solution verify end. */ > + &problem_LR, /* Dependent problem. */ > + 0, /* Size of entry of block_info array. */ > + TV_DF_LR, /* Timing variable. */ > + false /* Reset blocks on dropping out of blocks_to_analyze. */ > +}; > + > > /* Create a new DATAFLOW instance and add it to an existing instance > of DF. The returned structure is what is used to get at the > @@ -1274,7 +1310,9 @@ static const struct df_problem problem_LR = > void > df_lr_add_problem (void) > { > - df_add_problem (&problem_LR); > + /* Also add the fast DCE problem. It is then conditionally enabled by > + the DF_LR_RUN_DCE flag. */ > + df_add_problem (&problem_LR_DCE); > /* These will be initialized when df_scan_blocks processes each > block. */ > df_lr->out_of_date_transfer_functions = BITMAP_ALLOC (&df_bitmap_obstack); > diff --git a/gcc/df.h b/gcc/df.h > index c4e690b40cf..bbb4f297b47 100644 > --- a/gcc/df.h > +++ b/gcc/df.h > @@ -47,6 +47,7 @@ enum df_problem_id > { > DF_SCAN, > DF_LR, /* Live Registers backward. */ > + DF_LR_DCE, /* Dead code elimination post-pass for LR. */ > DF_LIVE, /* Live Registers & Uninitialized Registers */ > DF_RD, /* Reaching Defs. */ > DF_CHAIN, /* Def-Use and/or Use-Def Chains. */ > @@ -940,6 +941,7 @@ extern class df_d *df; > #define df_scan (df->problems_by_index[DF_SCAN]) > #define df_rd (df->problems_by_index[DF_RD]) > #define df_lr (df->problems_by_index[DF_LR]) > +#define df_lr_dce (df->problems_by_index[DF_LR_DCE]) > #define df_live (df->problems_by_index[DF_LIVE]) > #define df_chain (df->problems_by_index[DF_CHAIN]) > #define df_word_lr (df->problems_by_index[DF_WORD_LR]) > -- > 2.25.1
diff --git a/gcc/Makefile.in b/gcc/Makefile.in index 91d6bfbea4d..b43fd6e8df1 100644 --- a/gcc/Makefile.in +++ b/gcc/Makefile.in @@ -1554,6 +1554,7 @@ OBJS = \ ira-lives.o \ jump.o \ langhooks.o \ + late-combine.o \ lcm.o \ lists.o \ loop-doloop.o \ diff --git a/gcc/common.opt b/gcc/common.opt index 1cf3bdd3b51..306b11f91c7 100644 --- a/gcc/common.opt +++ b/gcc/common.opt @@ -1775,6 +1775,11 @@ Common Var(flag_large_source_files) Init(0) Improve GCC's ability to track column numbers in large source files, at the expense of slower compilation. +flate-combine-instructions +Common Var(flag_late_combine_instructions) Optimization Init(0) +Run two instruction combination passes late in the pass pipeline; +one before register allocation and one after. + floop-parallelize-all Common Var(flag_loop_parallelize_all) Optimization Mark all loops as parallel. diff --git a/gcc/common/config/aarch64/aarch64-common.cc b/gcc/common/config/aarch64/aarch64-common.cc index 20bc4e1291b..05647e0c93a 100644 --- a/gcc/common/config/aarch64/aarch64-common.cc +++ b/gcc/common/config/aarch64/aarch64-common.cc @@ -55,6 +55,7 @@ static const struct default_options aarch_option_optimization_table[] = { OPT_LEVELS_1_PLUS, OPT_fsched_pressure, NULL, 1 }, /* Enable redundant extension instructions removal at -O2 and higher. */ { OPT_LEVELS_2_PLUS, OPT_free, NULL, 1 }, + { OPT_LEVELS_2_PLUS, OPT_flate_combine_instructions, NULL, 1 }, #if (TARGET_DEFAULT_ASYNC_UNWIND_TABLES == 1) { OPT_LEVELS_ALL, OPT_fasynchronous_unwind_tables, NULL, 1 }, { OPT_LEVELS_ALL, OPT_funwind_tables, NULL, 1}, diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi index 5a9284d635c..d0576ac97cf 100644 --- a/gcc/doc/invoke.texi +++ b/gcc/doc/invoke.texi @@ -562,7 +562,7 @@ Objective-C and Objective-C++ Dialects}. -fipa-bit-cp -fipa-vrp -fipa-pta -fipa-profile -fipa-pure-const -fipa-reference -fipa-reference-addressable -fipa-stack-alignment -fipa-icf -fira-algorithm=@var{algorithm} --flive-patching=@var{level} +-flate-combine-instructions -flive-patching=@var{level} -fira-region=@var{region} -fira-hoist-pressure -fira-loop-pressure -fno-ira-share-save-slots -fno-ira-share-spill-slots @@ -13201,6 +13201,15 @@ equivalences that are found only by GCC and equivalences found only by Gold. This flag is enabled by default at @option{-O2} and @option{-Os}. +@opindex flate-combine-instructions +@item -flate-combine-instructions +Enable two instruction combination passes that run relatively late in the +compilation process. One of the passes runs before register allocation and +the other after register allocation. The main aim of the passes is to +substitute definitions into all uses. + +Some targets enable this flag by default at @option{-O2} and @option{-Os}. + @opindex flive-patching @item -flive-patching=@var{level} Control GCC's optimizations to produce output suitable for live-patching. diff --git a/gcc/late-combine.cc b/gcc/late-combine.cc new file mode 100644 index 00000000000..b1845875c4b --- /dev/null +++ b/gcc/late-combine.cc @@ -0,0 +1,718 @@ +// Late-stage instruction combination pass. +// Copyright (C) 2023 Free Software Foundation, Inc. +// +// This file is part of GCC. +// +// GCC is free software; you can redistribute it and/or modify it under +// the terms of the GNU General Public License as published by the Free +// Software Foundation; either version 3, or (at your option) any later +// version. +// +// GCC is distributed in the hope that it will be useful, but WITHOUT ANY +// WARRANTY; without even the implied warranty of MERCHANTABILITY or +// FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License +// for more details. +// +// You should have received a copy of the GNU General Public License +// along with GCC; see the file COPYING3. If not see +// <http://www.gnu.org/licenses/>. + +// The current purpose of this pass is to substitute definitions into +// all uses, so that the definition can be removed. However, it could +// be extended to handle other combination-related optimizations in future. +// +// The pass can run before or after register allocation. When running +// before register allocation, it tries to avoid cases that are likely +// to increase register pressure. For the same reason, it avoids moving +// instructions around, even if doing so would allow an optimization to +// succeed. These limitations are removed when running after register +// allocation. + +#define INCLUDE_ALGORITHM +#define INCLUDE_FUNCTIONAL +#include "config.h" +#include "system.h" +#include "coretypes.h" +#include "backend.h" +#include "rtl.h" +#include "df.h" +#include "rtl-ssa.h" +#include "print-rtl.h" +#include "tree-pass.h" +#include "cfgcleanup.h" +#include "target.h" + +using namespace rtl_ssa; + +namespace { +const pass_data pass_data_late_combine = +{ + RTL_PASS, // type + "late_combine", // name + OPTGROUP_NONE, // optinfo_flags + TV_NONE, // tv_id + 0, // properties_required + 0, // properties_provided + 0, // properties_destroyed + 0, // todo_flags_start + TODO_df_finish, // todo_flags_finish +}; + +// Class that represents one run of the pass. +class late_combine +{ +public: + unsigned int execute (function *); + +private: + rtx optimizable_set (insn_info *); + bool check_register_pressure (insn_info *, rtx); + bool check_uses (set_info *, rtx); + bool combine_into_uses (insn_info *, insn_info *); + + auto_vec<insn_info *> m_worklist; +}; + +// Represents an attempt to substitute a single-set definition into all +// uses of the definition. +class insn_combination +{ +public: + insn_combination (set_info *, rtx, rtx); + bool run (); + const vec<insn_change *> &use_changes () const { return m_use_changes; } + +private: + use_array get_new_uses (use_info *); + bool substitute_nondebug_use (use_info *); + bool substitute_nondebug_uses (set_info *); + bool move_and_recog (insn_change &); + bool try_to_preserve_debug_info (insn_change &, use_info *); + void substitute_debug_use (use_info *); + bool substitute_note (insn_info *, rtx, bool); + void substitute_notes (insn_info *, bool); + void substitute_note_uses (use_info *); + void substitute_optional_uses (set_info *); + + // Represents the state of the function's RTL at the start of this + // combination attempt. + insn_change_watermark m_rtl_watermark; + + // Represents the rtl-ssa state at the start of this combination attempt. + obstack_watermark m_attempt; + + // The instruction that contains the definition, and that we're trying + // to delete. + insn_info *m_def_insn; + + // The definition itself. + set_info *m_def; + + // The destination and source of the single set that defines m_def. + // The destination is known to be a plain REG. + rtx m_dest; + rtx m_src; + + // Contains all in-progress changes to uses of m_def. + auto_vec<insn_change *> m_use_changes; + + // Contains the full list of changes that we want to make, in reverse + // postorder. + auto_vec<insn_change *> m_nondebug_changes; +}; + +insn_combination::insn_combination (set_info *def, rtx dest, rtx src) + : m_rtl_watermark (), + m_attempt (crtl->ssa->new_change_attempt ()), + m_def_insn (def->insn ()), + m_def (def), + m_dest (dest), + m_src (src), + m_use_changes (), + m_nondebug_changes () +{ +} + +// USE is a direct or indirect use of m_def. Return the list of uses +// that would be needed after substituting m_def into the instruction. +// The returned list is marked as invalid if USE's insn and m_def_insn +// use different definitions for the same resource (register or memory). +use_array +insn_combination::get_new_uses (use_info *use) +{ + auto *def = use->def (); + auto *use_insn = use->insn (); + + use_array new_uses = use_insn->uses (); + new_uses = remove_uses_of_def (m_attempt, new_uses, def); + new_uses = merge_access_arrays (m_attempt, m_def_insn->uses (), new_uses); + if (new_uses.is_valid () && use->ebb () != m_def->ebb ()) + new_uses = crtl->ssa->make_uses_available (m_attempt, new_uses, use->bb (), + use_insn->is_debug_insn ()); + return new_uses; +} + +// Start the process of trying to replace USE by substitution, given that +// USE occurs in a non-debug instruction. Check that the substitution can +// be represented in RTL and that each use of a resource (register or memory) +// has a consistent definition. If so, start an insn_change record for the +// substitution and return true. +bool +insn_combination::substitute_nondebug_use (use_info *use) +{ + insn_info *use_insn = use->insn (); + rtx_insn *use_rtl = use_insn->rtl (); + + if (dump_file && (dump_flags & TDF_DETAILS)) + dump_insn_slim (dump_file, use->insn ()->rtl ()); + + // Chceck that we can change the instruction pattern. Leave recognition + // of the result till later. + insn_propagation prop (use_rtl, m_dest, m_src); + if (!prop.apply_to_pattern (&PATTERN (use_rtl)) + || prop.num_replacements == 0) + { + if (dump_file && (dump_flags & TDF_DETAILS)) + fprintf (dump_file, "-- RTL substitution failed\n"); + return false; + } + + use_array new_uses = get_new_uses (use); + if (!new_uses.is_valid ()) + { + if (dump_file && (dump_flags & TDF_DETAILS)) + fprintf (dump_file, "-- could not prove that all sources" + " are available\n"); + return false; + } + + auto *where = XOBNEW (m_attempt, insn_change); + auto *use_change = new (where) insn_change (use_insn); + m_use_changes.safe_push (use_change); + use_change->new_uses = new_uses; + + return true; +} + +// Apply substitute_nondebug_use to all direct and indirect uses of DEF. +// There will be at most one level of indirection. +bool +insn_combination::substitute_nondebug_uses (set_info *def) +{ + for (use_info *use : def->nondebug_insn_uses ()) + if (!use->is_live_out_use () + && !use->only_occurs_in_notes () + && !substitute_nondebug_use (use)) + return false; + + for (use_info *use : def->phi_uses ()) + if (!substitute_nondebug_uses (use->phi ())) + return false; + + return true; +} + +// Complete the verification of USE_CHANGE, given that m_nondebug_insns +// now contains an insn_change record for all proposed non-debug changes. +// Check that the new instruction is a recognized pattern. Also check that +// the instruction can be placed somewhere that makes all definitions and +// uses valid, and that permits any new hard-register clobbers added +// during the recognition process. Return true on success. +bool +insn_combination::move_and_recog (insn_change &use_change) +{ + insn_info *use_insn = use_change.insn (); + + if (reload_completed && can_move_insn_p (use_insn)) + use_change.move_range = { use_insn->bb ()->head_insn (), + use_insn->ebb ()->last_bb ()->end_insn () }; + if (!restrict_movement_ignoring (use_change, + insn_is_changing (m_nondebug_changes))) + { + if (dump_file && (dump_flags & TDF_DETAILS)) + fprintf (dump_file, "-- cannot satisfy all definitions and uses" + " in insn %d\n", INSN_UID (use_insn->rtl ())); + return false; + } + + if (!recog_ignoring (m_attempt, use_change, + insn_is_changing (m_nondebug_changes))) + return false; + + return true; +} + +// USE_CHANGE.insn () is a debug instruction that uses m_def. Try to +// substitute the definition into the instruction and try to describe +// the result in USE_CHANGE. Return true on success. Failure means that +// the instruction must be reset instead. +bool +insn_combination::try_to_preserve_debug_info (insn_change &use_change, + use_info *use) +{ + insn_info *use_insn = use_change.insn (); + rtx_insn *use_rtl = use_insn->rtl (); + + use_change.new_uses = get_new_uses (use); + if (!use_change.new_uses.is_valid () + || !restrict_movement (use_change)) + return false; + + insn_propagation prop (use_rtl, m_dest, m_src); + return prop.apply_to_pattern (&INSN_VAR_LOCATION_LOC (use_rtl)); +} + +// USE_INSN is a debug instruction that uses m_def. Update it to reflect +// the fact that m_def is going to disappear. Try to preserve the source +// value if possible, but reset the instruction if not. +void +insn_combination::substitute_debug_use (use_info *use) +{ + auto *use_insn = use->insn (); + rtx_insn *use_rtl = use_insn->rtl (); + + auto use_change = insn_change (use_insn); + if (!try_to_preserve_debug_info (use_change, use)) + { + use_change.new_uses = {}; + use_change.move_range = use_change.insn (); + INSN_VAR_LOCATION_LOC (use_rtl) = gen_rtx_UNKNOWN_VAR_LOC (); + } + insn_change *changes[] = { &use_change }; + crtl->ssa->change_insns (changes); +} + +// NOTE is a reg note of USE_INSN, which previously used m_def. Update +// the note to reflect the fact that m_def is going to disappear. Return +// true on success, or false if the note must be deleted. +// +// CAN_PROPAGATE is true if m_dest can be replaced with m_use. +bool +insn_combination::substitute_note (insn_info *use_insn, rtx note, + bool can_propagate) +{ + if (REG_NOTE_KIND (note) == REG_EQUAL + || REG_NOTE_KIND (note) == REG_EQUIV) + { + insn_propagation prop (use_insn->rtl (), m_dest, m_src); + return (prop.apply_to_rvalue (&XEXP (note, 0)) + && (can_propagate || prop.num_replacements == 0)); + } + return true; +} + +// Update USE_INSN's notes after deciding to go ahead with the optimization. +// CAN_PROPAGATE is true if m_dest can be replaced with m_use. +void +insn_combination::substitute_notes (insn_info *use_insn, bool can_propagate) +{ + rtx_insn *use_rtl = use_insn->rtl (); + rtx *ptr = ®_NOTES (use_rtl); + while (rtx note = *ptr) + { + if (substitute_note (use_insn, note, can_propagate)) + ptr = &XEXP (note, 1); + else + *ptr = XEXP (note, 1); + } +} + +// We've decided to go ahead with the substitution. Update all REG_NOTES +// involving USE. +void +insn_combination::substitute_note_uses (use_info *use) +{ + insn_info *use_insn = use->insn (); + + bool can_propagate = true; + if (use->only_occurs_in_notes ()) + { + // The only uses are in notes. Try to keep the note if we can, + // but removing it is better than aborting the optimization. + insn_change use_change (use_insn); + use_change.new_uses = get_new_uses (use); + if (!use_change.new_uses.is_valid () + || !restrict_movement (use_change)) + { + use_change.move_range = use_insn; + use_change.new_uses = remove_uses_of_def (m_attempt, + use_insn->uses (), + use->def ()); + can_propagate = false; + } + if (dump_file && (dump_flags & TDF_DETAILS)) + { + fprintf (dump_file, "%s notes in:\n", + can_propagate ? "updating" : "removing"); + dump_insn_slim (dump_file, use_insn->rtl ()); + } + substitute_notes (use_insn, can_propagate); + insn_change *changes[] = { &use_change }; + crtl->ssa->change_insns (changes); + } + else + substitute_notes (use_insn, can_propagate); +} + +// We've decided to go ahead with the substitution and we've dealt with +// all uses that occur in the patterns of non-debug insns. Update all +// other uses for the fact that m_def is about to disappear. +void +insn_combination::substitute_optional_uses (set_info *def) +{ + if (auto insn_uses = def->all_insn_uses ()) + { + use_info *use = *insn_uses.begin (); + while (use) + { + use_info *next_use = use->next_any_insn_use (); + if (use->is_in_debug_insn ()) + substitute_debug_use (use); + else if (!use->is_live_out_use ()) + substitute_note_uses (use); + use = next_use; + } + } + for (use_info *use : def->phi_uses ()) + substitute_optional_uses (use->phi ()); +} + +// Try to perform the substitution. Return true on success. +bool +insn_combination::run () +{ + if (dump_file && (dump_flags & TDF_DETAILS)) + { + fprintf (dump_file, "\ntrying to combine definition of r%d in:\n", + m_def->regno ()); + dump_insn_slim (dump_file, m_def_insn->rtl ()); + fprintf (dump_file, "into:\n"); + } + + if (!substitute_nondebug_uses (m_def)) + return false; + + auto def_change = insn_change::delete_insn (m_def_insn); + + m_nondebug_changes.reserve (m_use_changes.length () + 1); + m_nondebug_changes.quick_push (&def_change); + m_nondebug_changes.splice (m_use_changes); + + for (auto *use_change : m_use_changes) + if (!move_and_recog (*use_change)) + return false; + + if (!changes_are_worthwhile (m_nondebug_changes) + || !crtl->ssa->verify_insn_changes (m_nondebug_changes)) + return false; + + substitute_optional_uses (m_def); + + confirm_change_group (); + crtl->ssa->change_insns (m_nondebug_changes); + return true; +} + +// See whether INSN is a single_set that we can optimize. Return the +// set if so, otherwise return null. +rtx +late_combine::optimizable_set (insn_info *insn) +{ + if (!insn->can_be_optimized () + || insn->is_asm () + || insn->is_call () + || insn->has_volatile_refs () + || insn->has_pre_post_modify () + || !can_move_insn_p (insn)) + return NULL_RTX; + + return single_set (insn->rtl ()); +} + +// Suppose that we can replace all uses of SET_DEST (SET) with SET_SRC (SET), +// where SET occurs in INSN. Return true if doing so is not likely to +// increase register pressure. +bool +late_combine::check_register_pressure (insn_info *insn, rtx set) +{ + // Plain register-to-register moves do not establish a register class + // preference and have no well-defined effect on the register allocator. + // If changes in register class are needed, the register allocator is + // in the best position to place those changes. If no change in + // register class is needed, then the optimization reduces register + // pressure if SET_SRC (set) was already live at uses, otherwise the + // optimization is pressure-neutral. + rtx src = SET_SRC (set); + if (REG_P (src)) + return true; + + // On the same basis, substituting a SET_SRC that contains a single + // pseudo register either reduces pressure or is pressure-neutral, + // subject to the constraints below. We would need to do more + // analysis for SET_SRCs that use more than one pseudo register. + unsigned int nregs = 0; + for (auto *use : insn->uses ()) + if (use->is_reg () + && !HARD_REGISTER_NUM_P (use->regno ()) + && !use->only_occurs_in_notes ()) + if (++nregs > 1) + return false; + + // If there are no pseudo registers in SET_SRC then the optimization + // should improve register pressure. + if (nregs == 0) + return true; + + // We'd be substituting (set (reg R1) SRC) where SRC is known to + // contain a single pseudo register R2. Assume for simplicity that + // each new use of R2 would need to be in the same class C as the + // current use of R2. If, for a realistic allocation, C is a + // non-strict superset of the R1's register class, the effect on + // register pressure should be positive or neutral. If instead + // R1 occupies a different register class from R2, or if R1 has + // more allocation freedom than R2, then there's a higher risk that + // the effect on register pressure could be negative. + // + // First use constrain_operands to get the most likely choice of + // alternative. For simplicity, just handle the case where the + // output operand is operand 0. + extract_insn (insn->rtl ()); + rtx dest = SET_DEST (set); + if (recog_data.n_operands == 0 + || recog_data.operand[0] != dest) + return false; + + if (!constrain_operands (0, get_enabled_alternatives (insn->rtl ()))) + return false; + + preprocess_constraints (insn->rtl ()); + auto *alt = which_op_alt (); + auto dest_class = alt[0].cl; + + // Check operands 1 and above. + auto check_src = [&](unsigned int i) + { + if (recog_data.is_operator[i]) + return true; + + rtx op = recog_data.operand[i]; + if (CONSTANT_P (op)) + return true; + + if (SUBREG_P (op)) + op = SUBREG_REG (op); + if (REG_P (op)) + { + if (HARD_REGISTER_P (op)) + { + // We've already rejected uses of non-fixed hard registers. + gcc_checking_assert (fixed_regs[REGNO (op)]); + return true; + } + + // Make sure that the source operand's class is at least as + // permissive as the destination operand's class. + if (!reg_class_subset_p (dest_class, alt[i].cl)) + return false; + + // Make sure that the source operand occupies no more hard + // registers than the destination operand. This mostly matters + // for subregs. + if (targetm.class_max_nregs (dest_class, GET_MODE (dest)) + < targetm.class_max_nregs (alt[i].cl, GET_MODE (op))) + return false; + + return true; + } + return false; + }; + for (int i = 1; i < recog_data.n_operands; ++i) + if (!check_src (i)) + return false; + + return true; +} + +// Check uses of DEF to see whether there is anything obvious that +// prevents the substitution of SET into uses of DEF. +bool +late_combine::check_uses (set_info *def, rtx set) +{ + use_info *last_use = nullptr; + for (use_info *use : def->nondebug_insn_uses ()) + { + insn_info *use_insn = use->insn (); + + if (use->is_live_out_use ()) + continue; + if (use->only_occurs_in_notes ()) + continue; + + // We cannot replace all uses if the value is live on exit. + if (use->is_artificial ()) + return false; + + // Avoid increasing the complexity of instructions that + // reference allocatable hard registers. + if (!REG_P (SET_SRC (set)) + && !reload_completed + && (accesses_include_nonfixed_hard_registers (use_insn->uses ()) + || accesses_include_nonfixed_hard_registers (use_insn->defs ()))) + return false; + + // Don't substitute into a non-local goto, since it can then be + // treated as a jump to local label, e.g. in shorten_branches. + // ??? But this shouldn't be necessary. + if (use_insn->is_jump () + && find_reg_note (use_insn->rtl (), REG_NON_LOCAL_GOTO, NULL_RTX)) + return false; + + // We'll keep the uses in their original order, even if we move + // them relative to other instructions. Make sure that non-final + // uses do not change any values that occur in the SET_SRC. + if (last_use && last_use->ebb () == use->ebb ()) + { + def_info *ultimate_def = look_through_degenerate_phi (def); + if (insn_clobbers_resources (last_use->insn (), + ultimate_def->insn ()->uses ())) + return false; + } + + last_use = use; + } + + for (use_info *use : def->phi_uses ()) + if (!use->phi ()->is_degenerate () + || !check_uses (use->phi (), set)) + return false; + + return true; +} + +// Try to remove INSN by substituting a definition into all uses. +// If the optimization moves any instructions before CURSOR, add those +// instructions to the end of m_worklist. +bool +late_combine::combine_into_uses (insn_info *insn, insn_info *cursor) +{ + // For simplicity, don't try to handle sets of multiple hard registers. + // And for correctness, don't remove any assignments to the stack or + // frame pointers, since that would implicitly change the set of valid + // memory locations between this assignment and the next. + // + // Removing assignments to the hard frame pointer would invalidate + // backtraces. + set_info *def = single_set_info (insn); + if (!def + || !def->is_reg () + || def->regno () == STACK_POINTER_REGNUM + || def->regno () == FRAME_POINTER_REGNUM + || def->regno () == HARD_FRAME_POINTER_REGNUM) + return false; + + rtx set = optimizable_set (insn); + if (!set) + return false; + + // For simplicity, don't try to handle subreg destinations. + rtx dest = SET_DEST (set); + if (!REG_P (dest) || def->regno () != REGNO (dest)) + return false; + + // Don't prolong the live ranges of allocatable hard registers, or put + // them into more complicated instructions. Failing to prevent this + // could lead to spill failures, or at least to worst register allocation. + if (!reload_completed + && accesses_include_nonfixed_hard_registers (insn->uses ())) + return false; + + if (!reload_completed && !check_register_pressure (insn, set)) + return false; + + if (!check_uses (def, set)) + return false; + + insn_combination combination (def, SET_DEST (set), SET_SRC (set)); + if (!combination.run ()) + return false; + + for (auto *use_change : combination.use_changes ()) + if (*use_change->insn () < *cursor) + m_worklist.safe_push (use_change->insn ()); + else + break; + return true; +} + +// Run the pass on function FN. +unsigned int +late_combine::execute (function *fn) +{ + // Initialization. + calculate_dominance_info (CDI_DOMINATORS); + df_analyze (); + crtl->ssa = new rtl_ssa::function_info (fn); + // Don't allow memory_operand to match volatile MEMs. + init_recog_no_volatile (); + + insn_info *insn = *crtl->ssa->nondebug_insns ().begin (); + while (insn) + { + if (!insn->is_artificial ()) + { + insn_info *prev = insn->prev_nondebug_insn (); + if (combine_into_uses (insn, prev)) + { + // Any instructions that get added to the worklist were + // previously after PREV. Thus if we were able to move + // an instruction X before PREV during one combination, + // X cannot depend on any instructions that we move before + // PREV during subsequent combinations. This means that + // the worklist should be free of backwards dependencies, + // even if it isn't necessarily in RPO. + for (unsigned int i = 0; i < m_worklist.length (); ++i) + combine_into_uses (m_worklist[i], prev); + m_worklist.truncate (0); + insn = prev; + } + } + insn = insn->next_nondebug_insn (); + } + + // Finalization. + if (crtl->ssa->perform_pending_updates ()) + cleanup_cfg (0); + // Make recognizer allow volatile MEMs again. + init_recog (); + free_dominance_info (CDI_DOMINATORS); + return 0; +} + +class pass_late_combine : public rtl_opt_pass +{ +public: + pass_late_combine (gcc::context *ctxt) + : rtl_opt_pass (pass_data_late_combine, ctxt) + {} + + // opt_pass methods: + opt_pass *clone () override { return new pass_late_combine (m_ctxt); } + bool gate (function *) override { return flag_late_combine_instructions; } + unsigned int execute (function *) override; +}; + +unsigned int +pass_late_combine::execute (function *fn) +{ + return late_combine ().execute (fn); +} + +} // end namespace + +// Create a new CC fusion pass instance. + +rtl_opt_pass * +make_pass_late_combine (gcc::context *ctxt) +{ + return new pass_late_combine (ctxt); +} diff --git a/gcc/passes.def b/gcc/passes.def index 1e1950bdb39..56ab5204b08 100644 --- a/gcc/passes.def +++ b/gcc/passes.def @@ -488,6 +488,7 @@ along with GCC; see the file COPYING3. If not see NEXT_PASS (pass_initialize_regs); NEXT_PASS (pass_ud_rtl_dce); NEXT_PASS (pass_combine); + NEXT_PASS (pass_late_combine); NEXT_PASS (pass_if_after_combine); NEXT_PASS (pass_jump_after_combine); NEXT_PASS (pass_partition_blocks); @@ -507,6 +508,7 @@ along with GCC; see the file COPYING3. If not see NEXT_PASS (pass_postreload); PUSH_INSERT_PASSES_WITHIN (pass_postreload) NEXT_PASS (pass_postreload_cse); + NEXT_PASS (pass_late_combine); NEXT_PASS (pass_gcse2); NEXT_PASS (pass_split_after_reload); NEXT_PASS (pass_ree); diff --git a/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-1.c b/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-1.c index f290b9ccbdc..a95637abbe5 100644 --- a/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-1.c +++ b/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-1.c @@ -25,5 +25,5 @@ bar (long a) } /* { dg-final { scan-rtl-dump "Will split live ranges of parameters" "ira" } } */ -/* { dg-final { scan-rtl-dump "Split live-range of register" "ira" { xfail *-*-* } } } */ +/* { dg-final { scan-rtl-dump "Split live-range of register" "ira" { xfail { ! aarch64*-*-* } } } } */ /* { dg-final { scan-rtl-dump "Performing shrink-wrapping" "pro_and_epilogue" { xfail powerpc*-*-* } } } */ diff --git a/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-2.c b/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-2.c index 6212c95585d..0690e036eaa 100644 --- a/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-2.c +++ b/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-2.c @@ -30,6 +30,6 @@ bar (long a) } /* { dg-final { scan-rtl-dump "Will split live ranges of parameters" "ira" } } */ -/* { dg-final { scan-rtl-dump "Split live-range of register" "ira" { xfail *-*-* } } } */ +/* { dg-final { scan-rtl-dump "Split live-range of register" "ira" { xfail { ! aarch64*-*-* } } } } */ /* XFAIL due to PR70681. */ /* { dg-final { scan-rtl-dump "Performing shrink-wrapping" "pro_and_epilogue" { xfail arm*-*-* powerpc*-*-* } } } */ diff --git a/gcc/testsuite/gcc.dg/stack-check-4.c b/gcc/testsuite/gcc.dg/stack-check-4.c index b0c5c61972f..052d2abc2f1 100644 --- a/gcc/testsuite/gcc.dg/stack-check-4.c +++ b/gcc/testsuite/gcc.dg/stack-check-4.c @@ -20,7 +20,7 @@ scan for. We scan for both the positive and negative cases. */ /* { dg-do compile } */ -/* { dg-options "-O2 -fstack-clash-protection -fdump-rtl-pro_and_epilogue -fno-optimize-sibling-calls" } */ +/* { dg-options "-O2 -fstack-clash-protection -fdump-rtl-pro_and_epilogue -fno-optimize-sibling-calls -fno-shrink-wrap" } */ /* { dg-require-effective-target supports_stack_clash_protection } */ extern void arf (char *); diff --git a/gcc/testsuite/gcc.target/aarch64/pr106594_1.c b/gcc/testsuite/gcc.target/aarch64/pr106594_1.c new file mode 100644 index 00000000000..71bcafcb44f --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/pr106594_1.c @@ -0,0 +1,20 @@ +/* { dg-options "-O2" } */ + +extern const int constellation_64qam[64]; + +void foo(int nbits, + const char *p_src, + int *p_dst) { + + while (nbits > 0U) { + char first = *p_src++; + + char index1 = ((first & 0x3) << 4) | (first >> 4); + + *p_dst++ = constellation_64qam[index1]; + + nbits--; + } +} + +/* { dg-final { scan-assembler {(?n)\tldr\t.*\[x[0-9]+, w[0-9]+, sxtw #?2\]} } } */ diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_asrd_3.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_asrd_3.c index 0d620a30d5d..b537c6154a3 100644 --- a/gcc/testsuite/gcc.target/aarch64/sve/cond_asrd_3.c +++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_asrd_3.c @@ -27,9 +27,9 @@ TEST_ALL (DEF_LOOP) /* { dg-final { scan-assembler-times {\tasrd\tz[0-9]+\.h, p[0-7]/m, z[0-9]+\.h, #4\n} 2 } } */ /* { dg-final { scan-assembler-times {\tasrd\tz[0-9]+\.s, p[0-7]/m, z[0-9]+\.s, #4\n} 1 } } */ -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.b, p[0-7]/z, z[0-9]+\.b\n} 3 { xfail *-*-* } } } */ -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z, z[0-9]+\.h\n} 2 { xfail *-*-* } } } */ -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z, z[0-9]+\.s\n} 1 { xfail *-*-* } } } */ +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.b, p[0-7]/z, z[0-9]+\.b\n} 3 } } */ +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z, z[0-9]+\.h\n} 2 } } */ +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z, z[0-9]+\.s\n} 1 } } */ -/* { dg-final { scan-assembler-not {\tmov\tz} { xfail *-*-* } } } */ -/* { dg-final { scan-assembler-not {\tsel\t} { xfail *-*-* } } } */ +/* { dg-final { scan-assembler-not {\tmov\tz} } } */ +/* { dg-final { scan-assembler-not {\tsel\t} } } */ diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_3.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_3.c index a294effd4a9..cff806c278d 100644 --- a/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_3.c +++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_3.c @@ -30,11 +30,9 @@ TEST_ALL (DEF_LOOP) /* { dg-final { scan-assembler-times {\tscvtf\tz[0-9]+\.d, p[0-7]/m,} 1 } } */ /* { dg-final { scan-assembler-times {\tucvtf\tz[0-9]+\.d, p[0-7]/m,} 1 } } */ -/* Really we should be able to use MOVPRFX /z here, but at the moment - we're relying on combine to merge a SEL and an arithmetic operation, - and the SEL doesn't allow the "false" value to be zero when the "true" - value is a register. */ -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+, z[0-9]+\n} 6 } } */ +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z,} 2 } } */ +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z,} 2 } } */ +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.d, p[0-7]/z,} 2 } } */ /* { dg-final { scan-assembler-not {\tmov\tz[^\n]*z} } } */ /* { dg-final { scan-assembler-not {\tsel\t} } } */ diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_6.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_6.c index 6541a2ea49d..abf0a2e832f 100644 --- a/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_6.c +++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_6.c @@ -30,11 +30,9 @@ TEST_ALL (DEF_LOOP) /* { dg-final { scan-assembler-times {\tfcvtzs\tz[0-9]+\.d, p[0-7]/m,} 1 } } */ /* { dg-final { scan-assembler-times {\tfcvtzu\tz[0-9]+\.d, p[0-7]/m,} 1 } } */ -/* Really we should be able to use MOVPRFX /z here, but at the moment - we're relying on combine to merge a SEL and an arithmetic operation, - and the SEL doesn't allow the "false" value to be zero when the "true" - value is a register. */ -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+, z[0-9]+\n} 6 } } */ +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z,} 2 } } */ +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z,} 2 } } */ +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.d, p[0-7]/z,} 2 } } */ /* { dg-final { scan-assembler-not {\tmov\tz[^\n]*z} } } */ /* { dg-final { scan-assembler-not {\tsel\t} } } */ diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_fabd_5.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_fabd_5.c index e66477b3bce..401201b315a 100644 --- a/gcc/testsuite/gcc.target/aarch64/sve/cond_fabd_5.c +++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_fabd_5.c @@ -24,12 +24,9 @@ TEST_ALL (DEF_LOOP) /* { dg-final { scan-assembler-times {\tfabd\tz[0-9]+\.s, p[0-7]/m,} 1 } } */ /* { dg-final { scan-assembler-times {\tfabd\tz[0-9]+\.d, p[0-7]/m,} 1 } } */ -/* Really we should be able to use MOVPRFX /Z here, but at the moment - we're relying on combine to merge a SEL and an arithmetic operation, - and the SEL doesn't allow zero operands. */ -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z, z[0-9]+\.h\n} 1 { xfail *-*-* } } } */ -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z, z[0-9]+\.s\n} 1 { xfail *-*-* } } } */ -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.d, p[0-7]/z, z[0-9]+\.d\n} 1 { xfail *-*-* } } } */ +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z, z[0-9]+\.h\n} 1 } } */ +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z, z[0-9]+\.s\n} 1 } } */ +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.d, p[0-7]/z, z[0-9]+\.d\n} 1 } } */ /* { dg-final { scan-assembler-not {\tmov\tz[^,]*z} } } */ -/* { dg-final { scan-assembler-not {\tsel\t} { xfail *-*-* } } } */ +/* { dg-final { scan-assembler-not {\tsel\t} } } */ diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_unary_4.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_unary_4.c index a491f899088..cbb957bffa4 100644 --- a/gcc/testsuite/gcc.target/aarch64/sve/cond_unary_4.c +++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_unary_4.c @@ -52,15 +52,10 @@ TEST_ALL (DEF_LOOP) /* { dg-final { scan-assembler-times {\tfneg\tz[0-9]+\.s, p[0-7]/m,} 1 } } */ /* { dg-final { scan-assembler-times {\tfneg\tz[0-9]+\.d, p[0-7]/m,} 1 } } */ -/* Really we should be able to use MOVPRFX /z here, but at the moment - we're relying on combine to merge a SEL and an arithmetic operation, - and the SEL doesn't allow the "false" value to be zero when the "true" - value is a register. */ -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+, z[0-9]+\n} 7 } } */ -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.b, p[0-7]/z, z[0-9]+\.b} 1 } } */ -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z, z[0-9]+\.h} 2 } } */ -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z, z[0-9]+\.s} 2 } } */ -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.d, p[0-7]/z, z[0-9]+\.d} 2 } } */ +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.b, p[0-7]/z, z[0-9]+\.b} 2 } } */ +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z, z[0-9]+\.h} 4 } } */ +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z, z[0-9]+\.s} 4 } } */ +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.d, p[0-7]/z, z[0-9]+\.d} 4 } } */ /* { dg-final { scan-assembler-not {\tmov\tz[^\n]*z} } } */ /* { dg-final { scan-assembler-not {\tsel\t} } } */ diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h index 09e6ada5b2f..75376316e40 100644 --- a/gcc/tree-pass.h +++ b/gcc/tree-pass.h @@ -612,6 +612,7 @@ extern rtl_opt_pass *make_pass_branch_prob (gcc::context *ctxt); extern rtl_opt_pass *make_pass_value_profile_transformations (gcc::context *ctxt); extern rtl_opt_pass *make_pass_postreload_cse (gcc::context *ctxt); +extern rtl_opt_pass *make_pass_late_combine (gcc::context *ctxt); extern rtl_opt_pass *make_pass_gcse2 (gcc::context *ctxt); extern rtl_opt_pass *make_pass_split_after_reload (gcc::context *ctxt); extern rtl_opt_pass *make_pass_thread_prologue_and_epilogue (gcc::context