Message ID | 20240620133418.350772-7-richard.sandiford@arm.com |
---|---|
State | New |
Headers | show |
Series | Add a late-combine pass | expand |
On Thu, 2024-06-20 at 14:34 +0100, Richard Sandiford wrote: > > I tried compiling at least one target per CPU directory and comparing > the assembly output for parts of the GCC testsuite. This is just a way > of getting a flavour of how the pass performs; it obviously isn't a > meaningful benchmark. All targets seemed to improve on average: > > Target Tests Good Bad %Good Delta Median > ====== ===== ==== === ===== ===== ====== > aarch64-linux-gnu 2215 1975 240 89.16% -4159 -1 > aarch64_be-linux-gnu 1569 1483 86 94.52% -10117 -1 > alpha-linux-gnu 1454 1370 84 94.22% -9502 -1 > amdgcn-amdhsa 5122 4671 451 91.19% -35737 -1 > arc-elf 2166 1932 234 89.20% -37742 -1 > arm-linux-gnueabi 1953 1661 292 85.05% -12415 -1 > arm-linux-gnueabihf 1834 1549 285 84.46% -11137 -1 > avr-elf 4789 4330 459 90.42% -441276 -4 > bfin-elf 2795 2394 401 85.65% -19252 -1 > bpf-elf 3122 2928 194 93.79% -8785 -1 > c6x-elf 2227 1929 298 86.62% -17339 -1 > cris-elf 3464 3270 194 94.40% -23263 -2 > csky-elf 2915 2591 324 88.89% -22146 -1 > epiphany-elf 2399 2304 95 96.04% -28698 -2 > fr30-elf 7712 7299 413 94.64% -99830 -2 > frv-linux-gnu 3332 2877 455 86.34% -25108 -1 > ft32-elf 2775 2667 108 96.11% -25029 -1 > h8300-elf 3176 2862 314 90.11% -29305 -2 > hppa64-hp-hpux11.23 4287 4247 40 99.07% -45963 -2 > ia64-linux-gnu 2343 1946 397 83.06% -9907 -2 > iq2000-elf 9684 9637 47 99.51% -126557 -2 > lm32-elf 2681 2608 73 97.28% -59884 -3 > loongarch64-linux-gnu 1303 1218 85 93.48% -13375 -2 > m32r-elf 1626 1517 109 93.30% -9323 -2 > m68k-linux-gnu 3022 2620 402 86.70% -21531 -1 > mcore-elf 2315 2085 230 90.06% -24160 -1 > microblaze-elf 2782 2585 197 92.92% -16530 -1 > mipsel-linux-gnu 1958 1827 131 93.31% -15462 -1 > mipsisa64-linux-gnu 1655 1488 167 89.91% -16592 -2 > mmix 4914 4814 100 97.96% -63021 -1 > mn10300-elf 3639 3320 319 91.23% -34752 -2 > moxie-rtems 3497 3252 245 92.99% -87305 -3 > msp430-elf 4353 3876 477 89.04% -23780 -1 > nds32le-elf 3042 2780 262 91.39% -27320 -1 > nios2-linux-gnu 1683 1355 328 80.51% -8065 -1 > nvptx-none 2114 1781 333 84.25% -12589 -2 > or1k-elf 3045 2699 346 88.64% -14328 -2 > pdp11 4515 4146 369 91.83% -26047 -2 > pru-elf 1585 1245 340 78.55% -5225 -1 > riscv32-elf 2122 2000 122 94.25% -101162 -2 > riscv64-elf 1841 1726 115 93.75% -49997 -2 > rl78-elf 2823 2530 293 89.62% -40742 -4 > rx-elf 2614 2480 134 94.87% -18863 -1 > s390-linux-gnu 1591 1393 198 87.55% -16696 -1 > s390x-linux-gnu 2015 1879 136 93.25% -21134 -1 > sh-linux-gnu 1870 1507 363 80.59% -9491 -1 > sparc-linux-gnu 1123 1075 48 95.73% -14503 -1 > sparc-wrs-vxworks 1121 1073 48 95.72% -14578 -1 > sparc64-linux-gnu 1096 1021 75 93.16% -15003 -1 > v850-elf 1897 1728 169 91.09% -11078 -1 > vax-netbsdelf 3035 2995 40 98.68% -27642 -1 > visium-elf 1392 1106 286 79.45% -7984 -2 > xstormy16-elf 2577 2071 506 80.36% -13061 -1 > > Since you have already briefly compared some of the code, can you share those cases which get worse and might require some potential follow up patches? Best regards, Oleg Endo
On Thu, Jun 20, 2024 at 3:37 PM Richard Sandiford <richard.sandiford@arm.com> wrote: > > This patch adds a combine pass that runs late in the pipeline. > There are two instances: one between combine and split1, and one > after postreload. > > The pass currently has a single objective: remove definitions by > substituting into all uses. The pre-RA version tries to restrict > itself to cases that are likely to have a neutral or beneficial > effect on register pressure. > > The patch fixes PR106594. It also fixes a few FAILs and XFAILs > in the aarch64 test results, mostly due to making proper use of > MOVPRFX in cases where we didn't previously. > > This is just a first step. I'm hoping that the pass could be > used for other combine-related optimisations in future. In particular, > the post-RA version doesn't need to restrict itself to cases where all > uses are substitutable, since it doesn't have to worry about register > pressure. If we did that, and if we extended it to handle multi-register > REGs, the pass might be a viable replacement for regcprop, which in > turn might reduce the cost of having a post-RA instance of the new pass. > > On most targets, the pass is enabled by default at -O2 and above. > However, it has a tendency to undo x86's STV and RPAD passes, > by folding the more complex post-STV/RPAD form back into the > simpler pre-pass form. > > Also, running a pass after register allocation means that we can > now match define_insn_and_splits that were previously only matched > before register allocation. This trips things like: > > (define_insn_and_split "..." > [...pattern...] > "...cond..." > "#" > "&& 1" > [...pattern...] > { > ...unconditional use of gen_reg_rtx ()...; > } > > because matching and splitting after RA will call gen_reg_rtx when > pseudos are no longer allowed. rs6000 has several instances of this. > > xtensa has a variation in which the split condition is: > > "&& can_create_pseudo_p ()" > > The failure then is that, if we match after RA, we'll never be > able to split the instruction. > > The patch therefore disables the pass by default on i386, rs6000 > and xtensa. Hopefully we can fix those ports later (if their > maintainers want). It seems easier to add the pass first, though, > to make it easier to test any such fixes. > > gcc.target/aarch64/bitfield-bitint-abi-align{16,8}.c would need > quite a few updates for the late-combine output. That might be > worth doing, but it seems too complex to do as part of this patch. > > I tried compiling at least one target per CPU directory and comparing > the assembly output for parts of the GCC testsuite. This is just a way > of getting a flavour of how the pass performs; it obviously isn't a > meaningful benchmark. All targets seemed to improve on average: > > Target Tests Good Bad %Good Delta Median > ====== ===== ==== === ===== ===== ====== > aarch64-linux-gnu 2215 1975 240 89.16% -4159 -1 > aarch64_be-linux-gnu 1569 1483 86 94.52% -10117 -1 > alpha-linux-gnu 1454 1370 84 94.22% -9502 -1 > amdgcn-amdhsa 5122 4671 451 91.19% -35737 -1 > arc-elf 2166 1932 234 89.20% -37742 -1 > arm-linux-gnueabi 1953 1661 292 85.05% -12415 -1 > arm-linux-gnueabihf 1834 1549 285 84.46% -11137 -1 > avr-elf 4789 4330 459 90.42% -441276 -4 > bfin-elf 2795 2394 401 85.65% -19252 -1 > bpf-elf 3122 2928 194 93.79% -8785 -1 > c6x-elf 2227 1929 298 86.62% -17339 -1 > cris-elf 3464 3270 194 94.40% -23263 -2 > csky-elf 2915 2591 324 88.89% -22146 -1 > epiphany-elf 2399 2304 95 96.04% -28698 -2 > fr30-elf 7712 7299 413 94.64% -99830 -2 > frv-linux-gnu 3332 2877 455 86.34% -25108 -1 > ft32-elf 2775 2667 108 96.11% -25029 -1 > h8300-elf 3176 2862 314 90.11% -29305 -2 > hppa64-hp-hpux11.23 4287 4247 40 99.07% -45963 -2 > ia64-linux-gnu 2343 1946 397 83.06% -9907 -2 > iq2000-elf 9684 9637 47 99.51% -126557 -2 > lm32-elf 2681 2608 73 97.28% -59884 -3 > loongarch64-linux-gnu 1303 1218 85 93.48% -13375 -2 > m32r-elf 1626 1517 109 93.30% -9323 -2 > m68k-linux-gnu 3022 2620 402 86.70% -21531 -1 > mcore-elf 2315 2085 230 90.06% -24160 -1 > microblaze-elf 2782 2585 197 92.92% -16530 -1 > mipsel-linux-gnu 1958 1827 131 93.31% -15462 -1 > mipsisa64-linux-gnu 1655 1488 167 89.91% -16592 -2 > mmix 4914 4814 100 97.96% -63021 -1 > mn10300-elf 3639 3320 319 91.23% -34752 -2 > moxie-rtems 3497 3252 245 92.99% -87305 -3 > msp430-elf 4353 3876 477 89.04% -23780 -1 > nds32le-elf 3042 2780 262 91.39% -27320 -1 > nios2-linux-gnu 1683 1355 328 80.51% -8065 -1 > nvptx-none 2114 1781 333 84.25% -12589 -2 > or1k-elf 3045 2699 346 88.64% -14328 -2 > pdp11 4515 4146 369 91.83% -26047 -2 > pru-elf 1585 1245 340 78.55% -5225 -1 > riscv32-elf 2122 2000 122 94.25% -101162 -2 > riscv64-elf 1841 1726 115 93.75% -49997 -2 > rl78-elf 2823 2530 293 89.62% -40742 -4 > rx-elf 2614 2480 134 94.87% -18863 -1 > s390-linux-gnu 1591 1393 198 87.55% -16696 -1 > s390x-linux-gnu 2015 1879 136 93.25% -21134 -1 > sh-linux-gnu 1870 1507 363 80.59% -9491 -1 > sparc-linux-gnu 1123 1075 48 95.73% -14503 -1 > sparc-wrs-vxworks 1121 1073 48 95.72% -14578 -1 > sparc64-linux-gnu 1096 1021 75 93.16% -15003 -1 > v850-elf 1897 1728 169 91.09% -11078 -1 > vax-netbsdelf 3035 2995 40 98.68% -27642 -1 > visium-elf 1392 1106 286 79.45% -7984 -2 > xstormy16-elf 2577 2071 506 80.36% -13061 -1 I wonder if you can amend doc/passes.texi, specifically noting differences between fwprop, combine and late-combine? > gcc/ > PR rtl-optimization/106594 > * Makefile.in (OBJS): Add late-combine.o. > * common.opt (flate-combine-instructions): New option. > * doc/invoke.texi: Document it. > * opts.cc (default_options_table): Enable it by default at -O2 > and above. > * tree-pass.h (make_pass_late_combine): Declare. > * late-combine.cc: New file. > * passes.def: Add two instances of late_combine. > * config/i386/i386-options.cc (ix86_override_options_after_change): > Disable late-combine by default. > * config/rs6000/rs6000.cc (rs6000_option_override_internal): Likewise. > * config/xtensa/xtensa.cc (xtensa_option_override): Likewise. > > gcc/testsuite/ > PR rtl-optimization/106594 > * gcc.dg/ira-shrinkwrap-prep-1.c: Restrict XFAIL to non-aarch64 > targets. > * gcc.dg/ira-shrinkwrap-prep-2.c: Likewise. > * gcc.dg/stack-check-4.c: Add -fno-shrink-wrap. > * gcc.target/aarch64/bitfield-bitint-abi-align16.c: Add > -fno-late-combine-instructions. > * gcc.target/aarch64/bitfield-bitint-abi-align8.c: Likewise. > * gcc.target/aarch64/sve/cond_asrd_3.c: Remove XFAILs. > * gcc.target/aarch64/sve/cond_convert_3.c: Likewise. > * gcc.target/aarch64/sve/cond_fabd_5.c: Likewise. > * gcc.target/aarch64/sve/cond_convert_6.c: Expect the MOVPRFX /Zs > described in the comment. > * gcc.target/aarch64/sve/cond_unary_4.c: Likewise. > * gcc.target/aarch64/pr106594_1.c: New test. > --- > gcc/Makefile.in | 1 + > gcc/common.opt | 5 + > gcc/config/i386/i386-options.cc | 4 + > gcc/config/rs6000/rs6000.cc | 8 + > gcc/config/xtensa/xtensa.cc | 11 + > gcc/doc/invoke.texi | 11 +- > gcc/late-combine.cc | 747 ++++++++++++++++++ > gcc/opts.cc | 1 + > gcc/passes.def | 2 + > gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-1.c | 2 +- > gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-2.c | 2 +- > gcc/testsuite/gcc.dg/stack-check-4.c | 2 +- > .../aarch64/bitfield-bitint-abi-align16.c | 2 +- > .../aarch64/bitfield-bitint-abi-align8.c | 2 +- > gcc/testsuite/gcc.target/aarch64/pr106594_1.c | 20 + > .../gcc.target/aarch64/sve/cond_asrd_3.c | 10 +- > .../gcc.target/aarch64/sve/cond_convert_3.c | 8 +- > .../gcc.target/aarch64/sve/cond_convert_6.c | 8 +- > .../gcc.target/aarch64/sve/cond_fabd_5.c | 11 +- > .../gcc.target/aarch64/sve/cond_unary_4.c | 13 +- > gcc/tree-pass.h | 1 + > 21 files changed, 834 insertions(+), 37 deletions(-) > create mode 100644 gcc/late-combine.cc > create mode 100644 gcc/testsuite/gcc.target/aarch64/pr106594_1.c > > diff --git a/gcc/Makefile.in b/gcc/Makefile.in > index f5adb647d3f..5e29ddb5690 100644 > --- a/gcc/Makefile.in > +++ b/gcc/Makefile.in > @@ -1574,6 +1574,7 @@ OBJS = \ > ira-lives.o \ > jump.o \ > langhooks.o \ > + late-combine.o \ > lcm.o \ > lists.o \ > loop-doloop.o \ > diff --git a/gcc/common.opt b/gcc/common.opt > index f2bc47fdc5e..327230967ea 100644 > --- a/gcc/common.opt > +++ b/gcc/common.opt > @@ -1796,6 +1796,11 @@ Common Var(flag_large_source_files) Init(0) > Improve GCC's ability to track column numbers in large source files, > at the expense of slower compilation. > > +flate-combine-instructions > +Common Var(flag_late_combine_instructions) Optimization Init(0) > +Run two instruction combination passes late in the pass pipeline; > +one before register allocation and one after. > + > floop-parallelize-all > Common Var(flag_loop_parallelize_all) Optimization > Mark all loops as parallel. > diff --git a/gcc/config/i386/i386-options.cc b/gcc/config/i386/i386-options.cc > index f2cecc0e254..4620bf8e9e6 100644 > --- a/gcc/config/i386/i386-options.cc > +++ b/gcc/config/i386/i386-options.cc > @@ -1942,6 +1942,10 @@ ix86_override_options_after_change (void) > flag_cunroll_grow_size = flag_peel_loops || optimize >= 3; > } > > + /* Late combine tends to undo some of the effects of STV and RPAD, > + by combining instructions back to their original form. */ > + if (!OPTION_SET_P (flag_late_combine_instructions)) > + flag_late_combine_instructions = 0; > } > > /* Clear stack slot assignments remembered from previous functions. > diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc > index e4dc629ddcc..f39b8909925 100644 > --- a/gcc/config/rs6000/rs6000.cc > +++ b/gcc/config/rs6000/rs6000.cc > @@ -4768,6 +4768,14 @@ rs6000_option_override_internal (bool global_init_p) > targetm.expand_builtin_va_start = NULL; > } > > + /* One of the late-combine passes runs after register allocation > + and can match define_insn_and_splits that were previously used > + only before register allocation. Some of those define_insn_and_splits > + use gen_reg_rtx unconditionally. Disable late-combine by default > + until the define_insn_and_splits are fixed. */ > + if (!OPTION_SET_P (flag_late_combine_instructions)) > + flag_late_combine_instructions = 0; > + > rs6000_override_options_after_change (); > > /* If not explicitly specified via option, decide whether to generate indexed > diff --git a/gcc/config/xtensa/xtensa.cc b/gcc/config/xtensa/xtensa.cc > index 45dc1be3ff5..308dc62e0f8 100644 > --- a/gcc/config/xtensa/xtensa.cc > +++ b/gcc/config/xtensa/xtensa.cc > @@ -59,6 +59,7 @@ along with GCC; see the file COPYING3. If not see > #include "tree-pass.h" > #include "print-rtl.h" > #include <math.h> > +#include "opts.h" > > /* This file should be included last. */ > #include "target-def.h" > @@ -2916,6 +2917,16 @@ xtensa_option_override (void) > flag_reorder_blocks_and_partition = 0; > flag_reorder_blocks = 1; > } > + > + /* One of the late-combine passes runs after register allocation > + and can match define_insn_and_splits that were previously used > + only before register allocation. Some of those define_insn_and_splits > + require the split to take place, but have a split condition of > + can_create_pseudo_p, and so matching after RA will give an > + unsplittable instruction. Disable late-combine by default until > + the define_insn_and_splits are fixed. */ > + if (!OPTION_SET_P (flag_late_combine_instructions)) > + flag_late_combine_instructions = 0; > } > > /* Implement TARGET_HARD_REGNO_NREGS. */ > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi > index 5d7a87fde86..3b8c427d509 100644 > --- a/gcc/doc/invoke.texi > +++ b/gcc/doc/invoke.texi > @@ -575,7 +575,7 @@ Objective-C and Objective-C++ Dialects}. > -fipa-bit-cp -fipa-vrp -fipa-pta -fipa-profile -fipa-pure-const > -fipa-reference -fipa-reference-addressable > -fipa-stack-alignment -fipa-icf -fira-algorithm=@var{algorithm} > --flive-patching=@var{level} > +-flate-combine-instructions -flive-patching=@var{level} > -fira-region=@var{region} -fira-hoist-pressure > -fira-loop-pressure -fno-ira-share-save-slots > -fno-ira-share-spill-slots > @@ -13675,6 +13675,15 @@ equivalences that are found only by GCC and equivalences found only by Gold. > > This flag is enabled by default at @option{-O2} and @option{-Os}. > > +@opindex flate-combine-instructions > +@item -flate-combine-instructions > +Enable two instruction combination passes that run relatively late in the > +compilation process. One of the passes runs before register allocation and > +the other after register allocation. The main aim of the passes is to > +substitute definitions into all uses. > + > +Most targets enable this flag by default at @option{-O2} and @option{-Os}. > + > @opindex flive-patching > @item -flive-patching=@var{level} > Control GCC's optimizations to produce output suitable for live-patching. > diff --git a/gcc/late-combine.cc b/gcc/late-combine.cc > new file mode 100644 > index 00000000000..22a1d81d38e > --- /dev/null > +++ b/gcc/late-combine.cc > @@ -0,0 +1,747 @@ > +// Late-stage instruction combination pass. > +// Copyright (C) 2023-2024 Free Software Foundation, Inc. > +// > +// This file is part of GCC. > +// > +// GCC is free software; you can redistribute it and/or modify it under > +// the terms of the GNU General Public License as published by the Free > +// Software Foundation; either version 3, or (at your option) any later > +// version. > +// > +// GCC is distributed in the hope that it will be useful, but WITHOUT ANY > +// WARRANTY; without even the implied warranty of MERCHANTABILITY or > +// FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License > +// for more details. > +// > +// You should have received a copy of the GNU General Public License > +// along with GCC; see the file COPYING3. If not see > +// <http://www.gnu.org/licenses/>. > + > +// The current purpose of this pass is to substitute definitions into > +// all uses, so that the definition can be removed. However, it could > +// be extended to handle other combination-related optimizations in future. > +// > +// The pass can run before or after register allocation. When running > +// before register allocation, it tries to avoid cases that are likely > +// to increase register pressure. For the same reason, it avoids moving > +// instructions around, even if doing so would allow an optimization to > +// succeed. These limitations are removed when running after register > +// allocation. > + > +#define INCLUDE_ALGORITHM > +#define INCLUDE_FUNCTIONAL > +#include "config.h" > +#include "system.h" > +#include "coretypes.h" > +#include "backend.h" > +#include "rtl.h" > +#include "df.h" > +#include "rtl-ssa.h" > +#include "print-rtl.h" > +#include "tree-pass.h" > +#include "cfgcleanup.h" > +#include "target.h" > + > +using namespace rtl_ssa; > + > +namespace { > +const pass_data pass_data_late_combine = > +{ > + RTL_PASS, // type > + "late_combine", // name > + OPTGROUP_NONE, // optinfo_flags > + TV_NONE, // tv_id > + 0, // properties_required > + 0, // properties_provided > + 0, // properties_destroyed > + 0, // todo_flags_start > + TODO_df_finish, // todo_flags_finish > +}; > + > +// Represents an attempt to substitute a single-set definition into all > +// uses of the definition. > +class insn_combination > +{ > +public: > + insn_combination (set_info *, rtx, rtx); > + bool run (); > + array_slice<insn_change *const> use_changes () const; > + > +private: > + use_array get_new_uses (use_info *); > + bool substitute_nondebug_use (use_info *); > + bool substitute_nondebug_uses (set_info *); > + bool try_to_preserve_debug_info (insn_change &, use_info *); > + void substitute_debug_use (use_info *); > + bool substitute_note (insn_info *, rtx, bool); > + void substitute_notes (insn_info *, bool); > + void substitute_note_uses (use_info *); > + void substitute_optional_uses (set_info *); > + > + // Represents the state of the function's RTL at the start of this > + // combination attempt. > + insn_change_watermark m_rtl_watermark; > + > + // Represents the rtl-ssa state at the start of this combination attempt. > + obstack_watermark m_attempt; > + > + // The instruction that contains the definition, and that we're trying > + // to delete. > + insn_info *m_def_insn; > + > + // The definition itself. > + set_info *m_def; > + > + // The destination and source of the single set that defines m_def. > + // The destination is known to be a plain REG. > + rtx m_dest; > + rtx m_src; > + > + // Contains the full list of changes that we want to make, in reverse > + // postorder. > + auto_vec<insn_change *> m_nondebug_changes; > +}; > + > +// Class that represents one run of the pass. > +class late_combine > +{ > +public: > + unsigned int execute (function *); > + > +private: > + rtx optimizable_set (insn_info *); > + bool check_register_pressure (insn_info *, rtx); > + bool check_uses (set_info *, rtx); > + bool combine_into_uses (insn_info *, insn_info *); > + > + auto_vec<insn_info *> m_worklist; > +}; > + > +insn_combination::insn_combination (set_info *def, rtx dest, rtx src) > + : m_rtl_watermark (), > + m_attempt (crtl->ssa->new_change_attempt ()), > + m_def_insn (def->insn ()), > + m_def (def), > + m_dest (dest), > + m_src (src), > + m_nondebug_changes () > +{ > +} > + > +array_slice<insn_change *const> > +insn_combination::use_changes () const > +{ > + return { m_nondebug_changes.address () + 1, > + m_nondebug_changes.length () - 1 }; > +} > + > +// USE is a direct or indirect use of m_def. Return the list of uses > +// that would be needed after substituting m_def into the instruction. > +// The returned list is marked as invalid if USE's insn and m_def_insn > +// use different definitions for the same resource (register or memory). > +use_array > +insn_combination::get_new_uses (use_info *use) > +{ > + auto *def = use->def (); > + auto *use_insn = use->insn (); > + > + use_array new_uses = use_insn->uses (); > + new_uses = remove_uses_of_def (m_attempt, new_uses, def); > + new_uses = merge_access_arrays (m_attempt, m_def_insn->uses (), new_uses); > + if (new_uses.is_valid () && use->ebb () != m_def->ebb ()) > + new_uses = crtl->ssa->make_uses_available (m_attempt, new_uses, use->bb (), > + use_insn->is_debug_insn ()); > + return new_uses; > +} > + > +// Start the process of trying to replace USE by substitution, given that > +// USE occurs in a non-debug instruction. Check: > +// > +// - that the substitution can be represented in RTL > +// > +// - that each use of a resource (register or memory) within the new > +// instruction has a consistent definition > +// > +// - that the new instruction is a recognized pattern > +// > +// - that the instruction can be placed somewhere that makes all definitions > +// and uses valid, and that permits any new hard-register clobbers added > +// during the recognition process > +// > +// Return true on success. > +bool > +insn_combination::substitute_nondebug_use (use_info *use) > +{ > + insn_info *use_insn = use->insn (); > + rtx_insn *use_rtl = use_insn->rtl (); > + > + if (dump_file && (dump_flags & TDF_DETAILS)) > + dump_insn_slim (dump_file, use->insn ()->rtl ()); > + > + // Check that we can change the instruction pattern. Leave recognition > + // of the result till later. > + insn_propagation prop (use_rtl, m_dest, m_src); > + if (!prop.apply_to_pattern (&PATTERN (use_rtl)) > + || prop.num_replacements == 0) > + { > + if (dump_file && (dump_flags & TDF_DETAILS)) > + fprintf (dump_file, "-- RTL substitution failed\n"); > + return false; > + } > + > + use_array new_uses = get_new_uses (use); > + if (!new_uses.is_valid ()) > + { > + if (dump_file && (dump_flags & TDF_DETAILS)) > + fprintf (dump_file, "-- could not prove that all sources" > + " are available\n"); > + return false; > + } > + > + // Create a tentative change for the use. > + auto *where = XOBNEW (m_attempt, insn_change); > + auto *use_change = new (where) insn_change (use_insn); > + m_nondebug_changes.safe_push (use_change); > + use_change->new_uses = new_uses; > + > + struct local_ignore : ignore_nothing > + { > + local_ignore (const set_info *def, const insn_info *use_insn) > + : m_def (def), m_use_insn (use_insn) {} > + > + // We don't limit the number of insns per optimization, so ignoring all > + // insns for all insns would lead to quadratic complexity. Just ignore > + // the use and definition, which should be enough for most purposes. > + bool > + should_ignore_insn (const insn_info *insn) > + { > + return insn == m_def->insn () || insn == m_use_insn; > + } > + > + // Ignore the definition that we're removing, and all uses of it. > + bool should_ignore_def (const def_info *def) { return def == m_def; } > + > + const set_info *m_def; > + const insn_info *m_use_insn; > + }; > + > + auto ignore = local_ignore (m_def, use_insn); > + > + // Moving instructions before register allocation could increase > + // register pressure. Only try moving them after RA. > + if (reload_completed && can_move_insn_p (use_insn)) > + use_change->move_range = { use_insn->bb ()->head_insn (), > + use_insn->ebb ()->last_bb ()->end_insn () }; > + if (!restrict_movement (*use_change, ignore)) > + { > + if (dump_file && (dump_flags & TDF_DETAILS)) > + fprintf (dump_file, "-- cannot satisfy all definitions and uses" > + " in insn %d\n", INSN_UID (use_insn->rtl ())); > + return false; > + } > + > + if (!recog (m_attempt, *use_change, ignore)) > + return false; > + > + return true; > +} > + > +// Apply substitute_nondebug_use to all direct and indirect uses of DEF. > +// There will be at most one level of indirection. > +bool > +insn_combination::substitute_nondebug_uses (set_info *def) > +{ > + for (use_info *use : def->nondebug_insn_uses ()) > + if (!use->is_live_out_use () > + && !use->only_occurs_in_notes () > + && !substitute_nondebug_use (use)) > + return false; > + > + for (use_info *use : def->phi_uses ()) > + if (!substitute_nondebug_uses (use->phi ())) > + return false; > + > + return true; > +} > + > +// USE_CHANGE.insn () is a debug instruction that uses m_def. Try to > +// substitute the definition into the instruction and try to describe > +// the result in USE_CHANGE. Return true on success. Failure means that > +// the instruction must be reset instead. > +bool > +insn_combination::try_to_preserve_debug_info (insn_change &use_change, > + use_info *use) > +{ > + // Punt on unsimplified subregs of hard registers. In that case, > + // propagation can succeed and create a wider reg than the one we > + // started with. > + if (HARD_REGISTER_NUM_P (use->regno ()) > + && use->includes_subregs ()) > + return false; > + > + insn_info *use_insn = use_change.insn (); > + rtx_insn *use_rtl = use_insn->rtl (); > + > + use_change.new_uses = get_new_uses (use); > + if (!use_change.new_uses.is_valid () > + || !restrict_movement (use_change)) > + return false; > + > + insn_propagation prop (use_rtl, m_dest, m_src); > + return prop.apply_to_pattern (&INSN_VAR_LOCATION_LOC (use_rtl)); > +} > + > +// USE_INSN is a debug instruction that uses m_def. Update it to reflect > +// the fact that m_def is going to disappear. Try to preserve the source > +// value if possible, but reset the instruction if not. > +void > +insn_combination::substitute_debug_use (use_info *use) > +{ > + auto *use_insn = use->insn (); > + rtx_insn *use_rtl = use_insn->rtl (); > + > + auto use_change = insn_change (use_insn); > + if (!try_to_preserve_debug_info (use_change, use)) > + { > + use_change.new_uses = {}; > + use_change.move_range = use_change.insn (); > + INSN_VAR_LOCATION_LOC (use_rtl) = gen_rtx_UNKNOWN_VAR_LOC (); > + } > + insn_change *changes[] = { &use_change }; > + crtl->ssa->change_insns (changes); > +} > + > +// NOTE is a reg note of USE_INSN, which previously used m_def. Update > +// the note to reflect the fact that m_def is going to disappear. Return > +// true on success, or false if the note must be deleted. > +// > +// CAN_PROPAGATE is true if m_dest can be replaced with m_use. > +bool > +insn_combination::substitute_note (insn_info *use_insn, rtx note, > + bool can_propagate) > +{ > + if (REG_NOTE_KIND (note) == REG_EQUAL > + || REG_NOTE_KIND (note) == REG_EQUIV) > + { > + insn_propagation prop (use_insn->rtl (), m_dest, m_src); > + return (prop.apply_to_rvalue (&XEXP (note, 0)) > + && (can_propagate || prop.num_replacements == 0)); > + } > + return true; > +} > + > +// Update USE_INSN's notes after deciding to go ahead with the optimization. > +// CAN_PROPAGATE is true if m_dest can be replaced with m_use. > +void > +insn_combination::substitute_notes (insn_info *use_insn, bool can_propagate) > +{ > + rtx_insn *use_rtl = use_insn->rtl (); > + rtx *ptr = ®_NOTES (use_rtl); > + while (rtx note = *ptr) > + { > + if (substitute_note (use_insn, note, can_propagate)) > + ptr = &XEXP (note, 1); > + else > + *ptr = XEXP (note, 1); > + } > +} > + > +// We've decided to go ahead with the substitution. Update all REG_NOTES > +// involving USE. > +void > +insn_combination::substitute_note_uses (use_info *use) > +{ > + insn_info *use_insn = use->insn (); > + > + bool can_propagate = true; > + if (use->only_occurs_in_notes ()) > + { > + // The only uses are in notes. Try to keep the note if we can, > + // but removing it is better than aborting the optimization. > + insn_change use_change (use_insn); > + use_change.new_uses = get_new_uses (use); > + if (!use_change.new_uses.is_valid () > + || !restrict_movement (use_change)) > + { > + use_change.move_range = use_insn; > + use_change.new_uses = remove_uses_of_def (m_attempt, > + use_insn->uses (), > + use->def ()); > + can_propagate = false; > + } > + if (dump_file && (dump_flags & TDF_DETAILS)) > + { > + fprintf (dump_file, "%s notes in:\n", > + can_propagate ? "updating" : "removing"); > + dump_insn_slim (dump_file, use_insn->rtl ()); > + } > + substitute_notes (use_insn, can_propagate); > + insn_change *changes[] = { &use_change }; > + crtl->ssa->change_insns (changes); > + } > + else > + // We've already decided to update the insn's pattern and know that m_src > + // will be available at the insn's new location. Now update its notes. > + substitute_notes (use_insn, can_propagate); > +} > + > +// We've decided to go ahead with the substitution and we've dealt with > +// all uses that occur in the patterns of non-debug insns. Update all > +// other uses for the fact that m_def is about to disappear. > +void > +insn_combination::substitute_optional_uses (set_info *def) > +{ > + if (auto insn_uses = def->all_insn_uses ()) > + { > + use_info *use = *insn_uses.begin (); > + while (use) > + { > + use_info *next_use = use->next_any_insn_use (); > + if (use->is_in_debug_insn ()) > + substitute_debug_use (use); > + else if (!use->is_live_out_use ()) > + substitute_note_uses (use); > + use = next_use; > + } > + } > + for (use_info *use : def->phi_uses ()) > + substitute_optional_uses (use->phi ()); > +} > + > +// Try to perform the substitution. Return true on success. > +bool > +insn_combination::run () > +{ > + if (dump_file && (dump_flags & TDF_DETAILS)) > + { > + fprintf (dump_file, "\ntrying to combine definition of r%d in:\n", > + m_def->regno ()); > + dump_insn_slim (dump_file, m_def_insn->rtl ()); > + fprintf (dump_file, "into:\n"); > + } > + > + auto def_change = insn_change::delete_insn (m_def_insn); > + m_nondebug_changes.safe_push (&def_change); > + > + if (!substitute_nondebug_uses (m_def) > + || !changes_are_worthwhile (m_nondebug_changes) > + || !crtl->ssa->verify_insn_changes (m_nondebug_changes)) > + return false; > + > + substitute_optional_uses (m_def); > + > + confirm_change_group (); > + crtl->ssa->change_insns (m_nondebug_changes); > + return true; > +} > + > +// See whether INSN is a single_set that we can optimize. Return the > +// set if so, otherwise return null. > +rtx > +late_combine::optimizable_set (insn_info *insn) > +{ > + if (!insn->can_be_optimized () > + || insn->is_asm () > + || insn->is_call () > + || insn->has_volatile_refs () > + || insn->has_pre_post_modify () > + || !can_move_insn_p (insn)) > + return NULL_RTX; > + > + return single_set (insn->rtl ()); > +} > + > +// Suppose that we can replace all uses of SET_DEST (SET) with SET_SRC (SET), > +// where SET occurs in INSN. Return true if doing so is not likely to > +// increase register pressure. > +bool > +late_combine::check_register_pressure (insn_info *insn, rtx set) > +{ > + // Plain register-to-register moves do not establish a register class > + // preference and have no well-defined effect on the register allocator. > + // If changes in register class are needed, the register allocator is > + // in the best position to place those changes. If no change in > + // register class is needed, then the optimization reduces register > + // pressure if SET_SRC (set) was already live at uses, otherwise the > + // optimization is pressure-neutral. > + rtx src = SET_SRC (set); > + if (REG_P (src)) > + return true; > + > + // On the same basis, substituting a SET_SRC that contains a single > + // pseudo register either reduces pressure or is pressure-neutral, > + // subject to the constraints below. We would need to do more > + // analysis for SET_SRCs that use more than one pseudo register. > + unsigned int nregs = 0; > + for (auto *use : insn->uses ()) > + if (use->is_reg () > + && !HARD_REGISTER_NUM_P (use->regno ()) > + && !use->only_occurs_in_notes ()) > + if (++nregs > 1) > + return false; > + > + // If there are no pseudo registers in SET_SRC then the optimization > + // should improve register pressure. > + if (nregs == 0) > + return true; > + > + // We'd be substituting (set (reg R1) SRC) where SRC is known to > + // contain a single pseudo register R2. Assume for simplicity that > + // each new use of R2 would need to be in the same class C as the > + // current use of R2. If, for a realistic allocation, C is a > + // non-strict superset of the R1's register class, the effect on > + // register pressure should be positive or neutral. If instead > + // R1 occupies a different register class from R2, or if R1 has > + // more allocation freedom than R2, then there's a higher risk that > + // the effect on register pressure could be negative. > + // > + // First use constrain_operands to get the most likely choice of > + // alternative. For simplicity, just handle the case where the > + // output operand is operand 0. > + extract_insn (insn->rtl ()); > + rtx dest = SET_DEST (set); > + if (recog_data.n_operands == 0 > + || recog_data.operand[0] != dest) > + return false; > + > + if (!constrain_operands (0, get_enabled_alternatives (insn->rtl ()))) > + return false; > + > + preprocess_constraints (insn->rtl ()); > + auto *alt = which_op_alt (); > + auto dest_class = alt[0].cl; > + > + // Check operands 1 and above. > + auto check_src = [&] (unsigned int i) > + { > + if (recog_data.is_operator[i]) > + return true; > + > + rtx op = recog_data.operand[i]; > + if (CONSTANT_P (op)) > + return true; > + > + if (SUBREG_P (op)) > + op = SUBREG_REG (op); > + if (REG_P (op)) > + { > + // Ignore hard registers. We've already rejected uses of non-fixed > + // hard registers in the SET_SRC. > + if (HARD_REGISTER_P (op)) > + return true; > + > + // Make sure that the source operand's class is at least as > + // permissive as the destination operand's class. > + auto src_class = alternative_class (alt, i); > + if (!reg_class_subset_p (dest_class, src_class)) > + return false; > + > + // Make sure that the source operand occupies no more hard > + // registers than the destination operand. This mostly matters > + // for subregs. > + if (targetm.class_max_nregs (dest_class, GET_MODE (dest)) > + < targetm.class_max_nregs (src_class, GET_MODE (op))) > + return false; > + > + return true; > + } > + return false; > + }; > + for (int i = 1; i < recog_data.n_operands; ++i) > + if (recog_data.operand_type[i] != OP_OUT && !check_src (i)) > + return false; > + > + return true; > +} > + > +// Check uses of DEF to see whether there is anything obvious that > +// prevents the substitution of SET into uses of DEF. > +bool > +late_combine::check_uses (set_info *def, rtx set) > +{ > + use_info *prev_use = nullptr; > + for (use_info *use : def->nondebug_insn_uses ()) > + { > + insn_info *use_insn = use->insn (); > + > + if (use->is_live_out_use ()) > + continue; > + if (use->only_occurs_in_notes ()) > + continue; > + > + // We cannot replace all uses if the value is live on exit. > + if (use->is_artificial ()) > + return false; > + > + // Avoid increasing the complexity of instructions that > + // reference allocatable hard registers. > + if (!REG_P (SET_SRC (set)) > + && !reload_completed > + && (accesses_include_nonfixed_hard_registers (use_insn->uses ()) > + || accesses_include_nonfixed_hard_registers (use_insn->defs ()))) > + return false; > + > + // Don't substitute into a non-local goto, since it can then be > + // treated as a jump to local label, e.g. in shorten_branches. > + // ??? But this shouldn't be necessary. > + if (use_insn->is_jump () > + && find_reg_note (use_insn->rtl (), REG_NON_LOCAL_GOTO, NULL_RTX)) > + return false; > + > + // Reject cases where one of the uses is a function argument. > + // The combine attempt should fail anyway, but this is a common > + // case that is easy to check early. > + if (use_insn->is_call () > + && HARD_REGISTER_P (SET_DEST (set)) > + && find_reg_fusage (use_insn->rtl (), USE, SET_DEST (set))) > + return false; > + > + // We'll keep the uses in their original order, even if we move > + // them relative to other instructions. Make sure that non-final > + // uses do not change any values that occur in the SET_SRC. > + if (prev_use && prev_use->ebb () == use->ebb ()) > + { > + def_info *ultimate_def = look_through_degenerate_phi (def); > + if (insn_clobbers_resources (prev_use->insn (), > + ultimate_def->insn ()->uses ())) > + return false; > + } > + > + prev_use = use; > + } > + > + for (use_info *use : def->phi_uses ()) > + if (!use->phi ()->is_degenerate () > + || !check_uses (use->phi (), set)) > + return false; > + > + return true; > +} > + > +// Try to remove INSN by substituting a definition into all uses. > +// If the optimization moves any instructions before CURSOR, add those > +// instructions to the end of m_worklist. > +bool > +late_combine::combine_into_uses (insn_info *insn, insn_info *cursor) > +{ > + // For simplicity, don't try to handle sets of multiple hard registers. > + // And for correctness, don't remove any assignments to the stack or > + // frame pointers, since that would implicitly change the set of valid > + // memory locations between this assignment and the next. > + // > + // Removing assignments to the hard frame pointer would invalidate > + // backtraces. > + set_info *def = single_set_info (insn); > + if (!def > + || !def->is_reg () > + || def->regno () == STACK_POINTER_REGNUM > + || def->regno () == FRAME_POINTER_REGNUM > + || def->regno () == HARD_FRAME_POINTER_REGNUM) > + return false; > + > + rtx set = optimizable_set (insn); > + if (!set) > + return false; > + > + // For simplicity, don't try to handle subreg destinations. > + rtx dest = SET_DEST (set); > + if (!REG_P (dest) || def->regno () != REGNO (dest)) > + return false; > + > + // Don't prolong the live ranges of allocatable hard registers, or put > + // them into more complicated instructions. Failing to prevent this > + // could lead to spill failures, or at least to worst register allocation. > + if (!reload_completed > + && accesses_include_nonfixed_hard_registers (insn->uses ())) > + return false; > + > + if (!reload_completed && !check_register_pressure (insn, set)) > + return false; > + > + if (!check_uses (def, set)) > + return false; > + > + insn_combination combination (def, SET_DEST (set), SET_SRC (set)); > + if (!combination.run ()) > + return false; > + > + for (auto *use_change : combination.use_changes ()) > + if (*use_change->insn () < *cursor) > + m_worklist.safe_push (use_change->insn ()); > + else > + break; > + return true; > +} > + > +// Run the pass on function FN. > +unsigned int > +late_combine::execute (function *fn) > +{ > + // Initialization. > + calculate_dominance_info (CDI_DOMINATORS); > + df_analyze (); > + crtl->ssa = new rtl_ssa::function_info (fn); > + // Don't allow memory_operand to match volatile MEMs. > + init_recog_no_volatile (); > + > + insn_info *insn = *crtl->ssa->nondebug_insns ().begin (); > + while (insn) > + { > + if (!insn->is_artificial ()) > + { > + insn_info *prev = insn->prev_nondebug_insn (); > + if (combine_into_uses (insn, prev)) > + { > + // Any instructions that get added to the worklist were > + // previously after PREV. Thus if we were able to move > + // an instruction X before PREV during one combination, > + // X cannot depend on any instructions that we move before > + // PREV during subsequent combinations. This means that > + // the worklist should be free of backwards dependencies, > + // even if it isn't necessarily in RPO. > + for (unsigned int i = 0; i < m_worklist.length (); ++i) > + combine_into_uses (m_worklist[i], prev); > + m_worklist.truncate (0); > + insn = prev; > + } > + } > + insn = insn->next_nondebug_insn (); > + } > + > + // Finalization. > + if (crtl->ssa->perform_pending_updates ()) > + cleanup_cfg (0); > + // Make the recognizer allow volatile MEMs again. > + init_recog (); > + free_dominance_info (CDI_DOMINATORS); > + return 0; > +} > + > +class pass_late_combine : public rtl_opt_pass > +{ > +public: > + pass_late_combine (gcc::context *ctxt) > + : rtl_opt_pass (pass_data_late_combine, ctxt) > + {} > + > + // opt_pass methods: > + opt_pass *clone () override { return new pass_late_combine (m_ctxt); } > + bool gate (function *) override { return flag_late_combine_instructions; } > + unsigned int execute (function *) override; > +}; > + > +unsigned int > +pass_late_combine::execute (function *fn) > +{ > + return late_combine ().execute (fn); > +} > + > +} // end namespace > + > +// Create a new CC fusion pass instance. > + > +rtl_opt_pass * > +make_pass_late_combine (gcc::context *ctxt) > +{ > + return new pass_late_combine (ctxt); > +} > diff --git a/gcc/opts.cc b/gcc/opts.cc > index 1b1b46455af..915bce88fd6 100644 > --- a/gcc/opts.cc > +++ b/gcc/opts.cc > @@ -664,6 +664,7 @@ static const struct default_options default_options_table[] = > VECT_COST_MODEL_VERY_CHEAP }, > { OPT_LEVELS_2_PLUS, OPT_finline_functions, NULL, 1 }, > { OPT_LEVELS_2_PLUS, OPT_ftree_loop_distribute_patterns, NULL, 1 }, > + { OPT_LEVELS_2_PLUS, OPT_flate_combine_instructions, NULL, 1 }, > > /* -O2 and above optimizations, but not -Os or -Og. */ > { OPT_LEVELS_2_PLUS_SPEED_ONLY, OPT_falign_functions, NULL, 1 }, > diff --git a/gcc/passes.def b/gcc/passes.def > index 041229e47a6..13c9dc34ddf 100644 > --- a/gcc/passes.def > +++ b/gcc/passes.def > @@ -493,6 +493,7 @@ along with GCC; see the file COPYING3. If not see > NEXT_PASS (pass_initialize_regs); > NEXT_PASS (pass_ud_rtl_dce); > NEXT_PASS (pass_combine); > + NEXT_PASS (pass_late_combine); > NEXT_PASS (pass_if_after_combine); > NEXT_PASS (pass_jump_after_combine); > NEXT_PASS (pass_partition_blocks); > @@ -512,6 +513,7 @@ along with GCC; see the file COPYING3. If not see > NEXT_PASS (pass_postreload); > PUSH_INSERT_PASSES_WITHIN (pass_postreload) > NEXT_PASS (pass_postreload_cse); > + NEXT_PASS (pass_late_combine); > NEXT_PASS (pass_gcse2); > NEXT_PASS (pass_split_after_reload); > NEXT_PASS (pass_ree); > diff --git a/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-1.c b/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-1.c > index f290b9ccbdc..a95637abbe5 100644 > --- a/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-1.c > +++ b/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-1.c > @@ -25,5 +25,5 @@ bar (long a) > } > > /* { dg-final { scan-rtl-dump "Will split live ranges of parameters" "ira" } } */ > -/* { dg-final { scan-rtl-dump "Split live-range of register" "ira" { xfail *-*-* } } } */ > +/* { dg-final { scan-rtl-dump "Split live-range of register" "ira" { xfail { ! aarch64*-*-* } } } } */ > /* { dg-final { scan-rtl-dump "Performing shrink-wrapping" "pro_and_epilogue" { xfail powerpc*-*-* } } } */ > diff --git a/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-2.c b/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-2.c > index 6212c95585d..0690e036eaa 100644 > --- a/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-2.c > +++ b/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-2.c > @@ -30,6 +30,6 @@ bar (long a) > } > > /* { dg-final { scan-rtl-dump "Will split live ranges of parameters" "ira" } } */ > -/* { dg-final { scan-rtl-dump "Split live-range of register" "ira" { xfail *-*-* } } } */ > +/* { dg-final { scan-rtl-dump "Split live-range of register" "ira" { xfail { ! aarch64*-*-* } } } } */ > /* XFAIL due to PR70681. */ > /* { dg-final { scan-rtl-dump "Performing shrink-wrapping" "pro_and_epilogue" { xfail arm*-*-* powerpc*-*-* } } } */ > diff --git a/gcc/testsuite/gcc.dg/stack-check-4.c b/gcc/testsuite/gcc.dg/stack-check-4.c > index b0c5c61972f..052d2abc2f1 100644 > --- a/gcc/testsuite/gcc.dg/stack-check-4.c > +++ b/gcc/testsuite/gcc.dg/stack-check-4.c > @@ -20,7 +20,7 @@ > scan for. We scan for both the positive and negative cases. */ > > /* { dg-do compile } */ > -/* { dg-options "-O2 -fstack-clash-protection -fdump-rtl-pro_and_epilogue -fno-optimize-sibling-calls" } */ > +/* { dg-options "-O2 -fstack-clash-protection -fdump-rtl-pro_and_epilogue -fno-optimize-sibling-calls -fno-shrink-wrap" } */ > /* { dg-require-effective-target supports_stack_clash_protection } */ > > extern void arf (char *); > diff --git a/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align16.c b/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align16.c > index 4a228b0a1ce..c29a230a771 100644 > --- a/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align16.c > +++ b/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align16.c > @@ -1,5 +1,5 @@ > /* { dg-do compile { target bitint } } */ > -/* { dg-additional-options "-std=c23 -O2 -fno-stack-protector -save-temps -fno-schedule-insns -fno-schedule-insns2" } */ > +/* { dg-additional-options "-std=c23 -O2 -fno-stack-protector -save-temps -fno-schedule-insns -fno-schedule-insns2 -fno-late-combine-instructions" } */ > /* { dg-final { check-function-bodies "**" "" "" } } */ > > #define ALIGN 16 > diff --git a/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align8.c b/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align8.c > index e7f773640f0..13ffbf416ca 100644 > --- a/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align8.c > +++ b/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align8.c > @@ -1,5 +1,5 @@ > /* { dg-do compile { target bitint } } */ > -/* { dg-additional-options "-std=c23 -O2 -fno-stack-protector -save-temps -fno-schedule-insns -fno-schedule-insns2" } */ > +/* { dg-additional-options "-std=c23 -O2 -fno-stack-protector -save-temps -fno-schedule-insns -fno-schedule-insns2 -fno-late-combine-instructions" } */ > /* { dg-final { check-function-bodies "**" "" "" } } */ > > #define ALIGN 8 > diff --git a/gcc/testsuite/gcc.target/aarch64/pr106594_1.c b/gcc/testsuite/gcc.target/aarch64/pr106594_1.c > new file mode 100644 > index 00000000000..71bcafcb44f > --- /dev/null > +++ b/gcc/testsuite/gcc.target/aarch64/pr106594_1.c > @@ -0,0 +1,20 @@ > +/* { dg-options "-O2" } */ > + > +extern const int constellation_64qam[64]; > + > +void foo(int nbits, > + const char *p_src, > + int *p_dst) { > + > + while (nbits > 0U) { > + char first = *p_src++; > + > + char index1 = ((first & 0x3) << 4) | (first >> 4); > + > + *p_dst++ = constellation_64qam[index1]; > + > + nbits--; > + } > +} > + > +/* { dg-final { scan-assembler {(?n)\tldr\t.*\[x[0-9]+, w[0-9]+, sxtw #?2\]} } } */ > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_asrd_3.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_asrd_3.c > index 0d620a30d5d..b537c6154a3 100644 > --- a/gcc/testsuite/gcc.target/aarch64/sve/cond_asrd_3.c > +++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_asrd_3.c > @@ -27,9 +27,9 @@ TEST_ALL (DEF_LOOP) > /* { dg-final { scan-assembler-times {\tasrd\tz[0-9]+\.h, p[0-7]/m, z[0-9]+\.h, #4\n} 2 } } */ > /* { dg-final { scan-assembler-times {\tasrd\tz[0-9]+\.s, p[0-7]/m, z[0-9]+\.s, #4\n} 1 } } */ > > -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.b, p[0-7]/z, z[0-9]+\.b\n} 3 { xfail *-*-* } } } */ > -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z, z[0-9]+\.h\n} 2 { xfail *-*-* } } } */ > -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z, z[0-9]+\.s\n} 1 { xfail *-*-* } } } */ > +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.b, p[0-7]/z, z[0-9]+\.b\n} 3 } } */ > +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z, z[0-9]+\.h\n} 2 } } */ > +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z, z[0-9]+\.s\n} 1 } } */ > > -/* { dg-final { scan-assembler-not {\tmov\tz} { xfail *-*-* } } } */ > -/* { dg-final { scan-assembler-not {\tsel\t} { xfail *-*-* } } } */ > +/* { dg-final { scan-assembler-not {\tmov\tz} } } */ > +/* { dg-final { scan-assembler-not {\tsel\t} } } */ > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_3.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_3.c > index a294effd4a9..cff806c278d 100644 > --- a/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_3.c > +++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_3.c > @@ -30,11 +30,9 @@ TEST_ALL (DEF_LOOP) > /* { dg-final { scan-assembler-times {\tscvtf\tz[0-9]+\.d, p[0-7]/m,} 1 } } */ > /* { dg-final { scan-assembler-times {\tucvtf\tz[0-9]+\.d, p[0-7]/m,} 1 } } */ > > -/* Really we should be able to use MOVPRFX /z here, but at the moment > - we're relying on combine to merge a SEL and an arithmetic operation, > - and the SEL doesn't allow the "false" value to be zero when the "true" > - value is a register. */ > -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+, z[0-9]+\n} 6 } } */ > +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z,} 2 } } */ > +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z,} 2 } } */ > +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.d, p[0-7]/z,} 2 } } */ > > /* { dg-final { scan-assembler-not {\tmov\tz[^\n]*z} } } */ > /* { dg-final { scan-assembler-not {\tsel\t} } } */ > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_6.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_6.c > index 6541a2ea49d..abf0a2e832f 100644 > --- a/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_6.c > +++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_6.c > @@ -30,11 +30,9 @@ TEST_ALL (DEF_LOOP) > /* { dg-final { scan-assembler-times {\tfcvtzs\tz[0-9]+\.d, p[0-7]/m,} 1 } } */ > /* { dg-final { scan-assembler-times {\tfcvtzu\tz[0-9]+\.d, p[0-7]/m,} 1 } } */ > > -/* Really we should be able to use MOVPRFX /z here, but at the moment > - we're relying on combine to merge a SEL and an arithmetic operation, > - and the SEL doesn't allow the "false" value to be zero when the "true" > - value is a register. */ > -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+, z[0-9]+\n} 6 } } */ > +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z,} 2 } } */ > +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z,} 2 } } */ > +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.d, p[0-7]/z,} 2 } } */ > > /* { dg-final { scan-assembler-not {\tmov\tz[^\n]*z} } } */ > /* { dg-final { scan-assembler-not {\tsel\t} } } */ > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_fabd_5.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_fabd_5.c > index e66477b3bce..401201b315a 100644 > --- a/gcc/testsuite/gcc.target/aarch64/sve/cond_fabd_5.c > +++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_fabd_5.c > @@ -24,12 +24,9 @@ TEST_ALL (DEF_LOOP) > /* { dg-final { scan-assembler-times {\tfabd\tz[0-9]+\.s, p[0-7]/m,} 1 } } */ > /* { dg-final { scan-assembler-times {\tfabd\tz[0-9]+\.d, p[0-7]/m,} 1 } } */ > > -/* Really we should be able to use MOVPRFX /Z here, but at the moment > - we're relying on combine to merge a SEL and an arithmetic operation, > - and the SEL doesn't allow zero operands. */ > -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z, z[0-9]+\.h\n} 1 { xfail *-*-* } } } */ > -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z, z[0-9]+\.s\n} 1 { xfail *-*-* } } } */ > -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.d, p[0-7]/z, z[0-9]+\.d\n} 1 { xfail *-*-* } } } */ > +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z, z[0-9]+\.h\n} 1 } } */ > +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z, z[0-9]+\.s\n} 1 } } */ > +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.d, p[0-7]/z, z[0-9]+\.d\n} 1 } } */ > > /* { dg-final { scan-assembler-not {\tmov\tz[^,]*z} } } */ > -/* { dg-final { scan-assembler-not {\tsel\t} { xfail *-*-* } } } */ > +/* { dg-final { scan-assembler-not {\tsel\t} } } */ > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_unary_4.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_unary_4.c > index a491f899088..cbb957bffa4 100644 > --- a/gcc/testsuite/gcc.target/aarch64/sve/cond_unary_4.c > +++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_unary_4.c > @@ -52,15 +52,10 @@ TEST_ALL (DEF_LOOP) > /* { dg-final { scan-assembler-times {\tfneg\tz[0-9]+\.s, p[0-7]/m,} 1 } } */ > /* { dg-final { scan-assembler-times {\tfneg\tz[0-9]+\.d, p[0-7]/m,} 1 } } */ > > -/* Really we should be able to use MOVPRFX /z here, but at the moment > - we're relying on combine to merge a SEL and an arithmetic operation, > - and the SEL doesn't allow the "false" value to be zero when the "true" > - value is a register. */ > -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+, z[0-9]+\n} 7 } } */ > -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.b, p[0-7]/z, z[0-9]+\.b} 1 } } */ > -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z, z[0-9]+\.h} 2 } } */ > -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z, z[0-9]+\.s} 2 } } */ > -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.d, p[0-7]/z, z[0-9]+\.d} 2 } } */ > +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.b, p[0-7]/z, z[0-9]+\.b} 2 } } */ > +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z, z[0-9]+\.h} 4 } } */ > +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z, z[0-9]+\.s} 4 } } */ > +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.d, p[0-7]/z, z[0-9]+\.d} 4 } } */ > > /* { dg-final { scan-assembler-not {\tmov\tz[^\n]*z} } } */ > /* { dg-final { scan-assembler-not {\tsel\t} } } */ > diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h > index edebb2be245..38902b1b01b 100644 > --- a/gcc/tree-pass.h > +++ b/gcc/tree-pass.h > @@ -615,6 +615,7 @@ extern rtl_opt_pass *make_pass_branch_prob (gcc::context *ctxt); > extern rtl_opt_pass *make_pass_value_profile_transformations (gcc::context > *ctxt); > extern rtl_opt_pass *make_pass_postreload_cse (gcc::context *ctxt); > +extern rtl_opt_pass *make_pass_late_combine (gcc::context *ctxt); > extern rtl_opt_pass *make_pass_gcse2 (gcc::context *ctxt); > extern rtl_opt_pass *make_pass_split_after_reload (gcc::context *ctxt); > extern rtl_opt_pass *make_pass_thread_prologue_and_epilogue (gcc::context > -- > 2.25.1 >
Oleg Endo <oleg.endo@t-online.de> writes: > On Thu, 2024-06-20 at 14:34 +0100, Richard Sandiford wrote: >> >> I tried compiling at least one target per CPU directory and comparing >> the assembly output for parts of the GCC testsuite. This is just a way >> of getting a flavour of how the pass performs; it obviously isn't a >> meaningful benchmark. All targets seemed to improve on average: >> >> Target Tests Good Bad %Good Delta Median >> ====== ===== ==== === ===== ===== ====== >> aarch64-linux-gnu 2215 1975 240 89.16% -4159 -1 >> aarch64_be-linux-gnu 1569 1483 86 94.52% -10117 -1 >> alpha-linux-gnu 1454 1370 84 94.22% -9502 -1 >> amdgcn-amdhsa 5122 4671 451 91.19% -35737 -1 >> arc-elf 2166 1932 234 89.20% -37742 -1 >> arm-linux-gnueabi 1953 1661 292 85.05% -12415 -1 >> arm-linux-gnueabihf 1834 1549 285 84.46% -11137 -1 >> avr-elf 4789 4330 459 90.42% -441276 -4 >> bfin-elf 2795 2394 401 85.65% -19252 -1 >> bpf-elf 3122 2928 194 93.79% -8785 -1 >> c6x-elf 2227 1929 298 86.62% -17339 -1 >> cris-elf 3464 3270 194 94.40% -23263 -2 >> csky-elf 2915 2591 324 88.89% -22146 -1 >> epiphany-elf 2399 2304 95 96.04% -28698 -2 >> fr30-elf 7712 7299 413 94.64% -99830 -2 >> frv-linux-gnu 3332 2877 455 86.34% -25108 -1 >> ft32-elf 2775 2667 108 96.11% -25029 -1 >> h8300-elf 3176 2862 314 90.11% -29305 -2 >> hppa64-hp-hpux11.23 4287 4247 40 99.07% -45963 -2 >> ia64-linux-gnu 2343 1946 397 83.06% -9907 -2 >> iq2000-elf 9684 9637 47 99.51% -126557 -2 >> lm32-elf 2681 2608 73 97.28% -59884 -3 >> loongarch64-linux-gnu 1303 1218 85 93.48% -13375 -2 >> m32r-elf 1626 1517 109 93.30% -9323 -2 >> m68k-linux-gnu 3022 2620 402 86.70% -21531 -1 >> mcore-elf 2315 2085 230 90.06% -24160 -1 >> microblaze-elf 2782 2585 197 92.92% -16530 -1 >> mipsel-linux-gnu 1958 1827 131 93.31% -15462 -1 >> mipsisa64-linux-gnu 1655 1488 167 89.91% -16592 -2 >> mmix 4914 4814 100 97.96% -63021 -1 >> mn10300-elf 3639 3320 319 91.23% -34752 -2 >> moxie-rtems 3497 3252 245 92.99% -87305 -3 >> msp430-elf 4353 3876 477 89.04% -23780 -1 >> nds32le-elf 3042 2780 262 91.39% -27320 -1 >> nios2-linux-gnu 1683 1355 328 80.51% -8065 -1 >> nvptx-none 2114 1781 333 84.25% -12589 -2 >> or1k-elf 3045 2699 346 88.64% -14328 -2 >> pdp11 4515 4146 369 91.83% -26047 -2 >> pru-elf 1585 1245 340 78.55% -5225 -1 >> riscv32-elf 2122 2000 122 94.25% -101162 -2 >> riscv64-elf 1841 1726 115 93.75% -49997 -2 >> rl78-elf 2823 2530 293 89.62% -40742 -4 >> rx-elf 2614 2480 134 94.87% -18863 -1 >> s390-linux-gnu 1591 1393 198 87.55% -16696 -1 >> s390x-linux-gnu 2015 1879 136 93.25% -21134 -1 >> sh-linux-gnu 1870 1507 363 80.59% -9491 -1 >> sparc-linux-gnu 1123 1075 48 95.73% -14503 -1 >> sparc-wrs-vxworks 1121 1073 48 95.72% -14578 -1 >> sparc64-linux-gnu 1096 1021 75 93.16% -15003 -1 >> v850-elf 1897 1728 169 91.09% -11078 -1 >> vax-netbsdelf 3035 2995 40 98.68% -27642 -1 >> visium-elf 1392 1106 286 79.45% -7984 -2 >> xstormy16-elf 2577 2071 506 80.36% -13061 -1 >> >> > > Since you have already briefly compared some of the code, can you share > those cases which get worse and might require some potential follow up > patches? I think a lot of them are unpredictable secondary effects, such as on register allocation, tail merging potential, and so on. For sh, it also includes whether delay slots are filled with useful work, or whether they get a nop. (Instruction combination tends to create more complex instructions, so there will be fewer 2-byte instructions to act as delay slot candidates.) Also, this kind of combination can decrease the number of instructions but increase the constant pool size. The figures take that into account. (The comparison is a bit ad-hoc, though, since I wasn't dedicated enough to try to build a full source->executable toolchain for each target. :)) To give one example, the effect on gcc.c-torture/compile/20040727-1.c is: @@ -6,18 +6,21 @@ .global GC_dirty_init .type GC_dirty_init, @function GC_dirty_init: - mov.l .L2,r4 - mov r4,r6 - mov r4,r5 - add #-64,r5 - mov.l .L3,r0 + mov.l .L2,r6 + mov.l .L3,r5 + mov.l .L4,r4 + mov.l .L5,r0 jmp @r0 - add #-128,r4 -.L4: + nop +.L6: .align 2 .L2: .long GC_old_exc_ports+132 .L3: + .long GC_old_exc_ports+68 +.L4: + .long GC_old_exc_ports+4 +.L5: .long task_get_exception_ports .size GC_dirty_init, .-GC_dirty_init .local GC_old_exc_ports Thanks, Richard
Richard Biener <richard.guenther@gmail.com> writes: > [...] > I wonder if you can amend doc/passes.texi, specifically noting differences > between fwprop, combine and late-combine? Ooh, we have a doc/passes.texi? :) Somehow missed that. How about the patch below? Thanks, Richard diff --git a/gcc/doc/passes.texi b/gcc/doc/passes.texi index 5746d3ec636..4ac7a2306a1 100644 --- a/gcc/doc/passes.texi +++ b/gcc/doc/passes.texi @@ -991,6 +991,25 @@ RTL expressions for the instructions by substitution, simplifies the result using algebra, and then attempts to match the result against the machine description. The code is located in @file{combine.cc}. +@item Late instruction combination + +This pass attempts to do further instruction combination, on top of +that performed by @file{combine.cc}. Its current purpose is to +substitute definitions into all uses simultaneously, so that the +definition can be removed. This differs from the forward propagation +pass, whose purpose is instead to simplify individual uses on the +assumption that the definition will remain. It differs from +@file{combine.cc} in that there is no hard-coded limit on the number +of instructions that can be combined at once. It also differs from +@file{combine.cc} in that it can move instructions, where necessary. + +However, the pass is not in principle limited to this form of +combination. It is intended to be a home for other, future +combination approaches as well. + +The pass runs twice, once before register allocation and once after +register allocation. The code is located in @file{late-combine.cc}. + @item Mode switching optimization This pass looks for instructions that require the processor to be in a
On Fri, Jun 21, 2024 at 10:21 AM Richard Sandiford <richard.sandiford@arm.com> wrote: > > Richard Biener <richard.guenther@gmail.com> writes: > > [...] > > I wonder if you can amend doc/passes.texi, specifically noting differences > > between fwprop, combine and late-combine? > > Ooh, we have a doc/passes.texi? :) Somehow missed that. Yeah, I also usually forget this. > How about the patch below? Thanks - looks good to me. Richard. > Thanks, > Richard > > > diff --git a/gcc/doc/passes.texi b/gcc/doc/passes.texi > index 5746d3ec636..4ac7a2306a1 100644 > --- a/gcc/doc/passes.texi > +++ b/gcc/doc/passes.texi > @@ -991,6 +991,25 @@ RTL expressions for the instructions by substitution, simplifies the > result using algebra, and then attempts to match the result against > the machine description. The code is located in @file{combine.cc}. > > +@item Late instruction combination > + > +This pass attempts to do further instruction combination, on top of > +that performed by @file{combine.cc}. Its current purpose is to > +substitute definitions into all uses simultaneously, so that the > +definition can be removed. This differs from the forward propagation > +pass, whose purpose is instead to simplify individual uses on the > +assumption that the definition will remain. It differs from > +@file{combine.cc} in that there is no hard-coded limit on the number > +of instructions that can be combined at once. It also differs from > +@file{combine.cc} in that it can move instructions, where necessary. > + > +However, the pass is not in principle limited to this form of > +combination. It is intended to be a home for other, future > +combination approaches as well. > + > +The pass runs twice, once before register allocation and once after > +register allocation. The code is located in @file{late-combine.cc}. > + > @item Mode switching optimization > > This pass looks for instructions that require the processor to be in a
On 6/20/24 7:34 AM, Richard Sandiford wrote: > This patch adds a combine pass that runs late in the pipeline. > There are two instances: one between combine and split1, and one > after postreload. > > The pass currently has a single objective: remove definitions by > substituting into all uses. The pre-RA version tries to restrict > itself to cases that are likely to have a neutral or beneficial > effect on register pressure. I would expect this to fix a problem we've seen on RISC-V as well. Essentially we have A, B an C. We want to combine A->B and A->C generating B' and C' and eliminate A. This shows up in the xz loop. . > > On most targets, the pass is enabled by default at -O2 and above. > However, it has a tendency to undo x86's STV and RPAD passes, > by folding the more complex post-STV/RPAD form back into the > simpler pre-pass form. IIRC the limited enablement was one of the things folks were unhappy about in the gcc-14 cycle. Good to see that addressed. > > Also, running a pass after register allocation means that we can > now match define_insn_and_splits that were previously only matched > before register allocation. This trips things like: > > (define_insn_and_split "..." > [...pattern...] > "...cond..." > "#" > "&& 1" > [...pattern...] > { > ...unconditional use of gen_reg_rtx ()...; > } > > because matching and splitting after RA will call gen_reg_rtx when > pseudos are no longer allowed. rs6000 has several instances of this. Interesting. I suspect ppc won't be the only affected port. This is somewhat worrisome. > > xtensa has a variation in which the split condition is: > > "&& can_create_pseudo_p ()" > > The failure then is that, if we match after RA, we'll never be > able to split the instruction. > > The patch therefore disables the pass by default on i386, rs6000 > and xtensa. Hopefully we can fix those ports later (if their > maintainers want). It seems easier to add the pass first, though, > to make it easier to test any such fixes. I suspect it'll be a "does this make code better on the port, then let's fix the port so it can be used consistently" kind of scenario. Given the data you've presented I strongly suspect it would make the code better on the xtensa, so hopefully Max will do the gruntwork on that one. > > gcc/ > PR rtl-optimization/106594 > * Makefile.in (OBJS): Add late-combine.o. > * common.opt (flate-combine-instructions): New option. > * doc/invoke.texi: Document it. > * opts.cc (default_options_table): Enable it by default at -O2 > and above. > * tree-pass.h (make_pass_late_combine): Declare. > * late-combine.cc: New file. > * passes.def: Add two instances of late_combine. > * config/i386/i386-options.cc (ix86_override_options_after_change): > Disable late-combine by default. > * config/rs6000/rs6000.cc (rs6000_option_override_internal): Likewise. > * config/xtensa/xtensa.cc (xtensa_option_override): Likewise. > > gcc/testsuite/ > PR rtl-optimization/106594 > * gcc.dg/ira-shrinkwrap-prep-1.c: Restrict XFAIL to non-aarch64 > targets. > * gcc.dg/ira-shrinkwrap-prep-2.c: Likewise. > * gcc.dg/stack-check-4.c: Add -fno-shrink-wrap. > * gcc.target/aarch64/bitfield-bitint-abi-align16.c: Add > -fno-late-combine-instructions. > * gcc.target/aarch64/bitfield-bitint-abi-align8.c: Likewise. > * gcc.target/aarch64/sve/cond_asrd_3.c: Remove XFAILs. > * gcc.target/aarch64/sve/cond_convert_3.c: Likewise. > * gcc.target/aarch64/sve/cond_fabd_5.c: Likewise. > * gcc.target/aarch64/sve/cond_convert_6.c: Expect the MOVPRFX /Zs > described in the comment. > * gcc.target/aarch64/sve/cond_unary_4.c: Likewise. > * gcc.target/aarch64/pr106594_1.c: New test. > --- OK. Obviously we'll need to keep an eye on testing state after this patch. I do expect fallout from the splitter issue noted above, but IMHO those are port problems for the port maintainers to sort out. Jeff
Hi! On 2024/06/20 22:34, Richard Sandiford wrote: > This patch adds a combine pass that runs late in the pipeline. > There are two instances: one between combine and split1, and one > after postreload. > > The pass currently has a single objective: remove definitions by > substituting into all uses. The pre-RA version tries to restrict > itself to cases that are likely to have a neutral or beneficial > effect on register pressure. > > The patch fixes PR106594. It also fixes a few FAILs and XFAILs > in the aarch64 test results, mostly due to making proper use of > MOVPRFX in cases where we didn't previously. > > This is just a first step. I'm hoping that the pass could be > used for other combine-related optimisations in future. In particular, > the post-RA version doesn't need to restrict itself to cases where all > uses are substitutable, since it doesn't have to worry about register > pressure. If we did that, and if we extended it to handle multi-register > REGs, the pass might be a viable replacement for regcprop, which in > turn might reduce the cost of having a post-RA instance of the new pass. > > On most targets, the pass is enabled by default at -O2 and above. > However, it has a tendency to undo x86's STV and RPAD passes, > by folding the more complex post-STV/RPAD form back into the > simpler pre-pass form. > > Also, running a pass after register allocation means that we can > now match define_insn_and_splits that were previously only matched > before register allocation. This trips things like: > > (define_insn_and_split "..." > [...pattern...] > "...cond..." > "#" > "&& 1" > [...pattern...] > { > ...unconditional use of gen_reg_rtx ()...; > } > > because matching and splitting after RA will call gen_reg_rtx when > pseudos are no longer allowed. rs6000 has several instances of this. xtensa also has something like that. > xtensa has a variation in which the split condition is: > > "&& can_create_pseudo_p ()" > > The failure then is that, if we match after RA, we'll never be > able to split the instruction. To be honest, I'm confusing by the possibility of adding a split pattern application opportunity that depends on the optimization options after Rel... ah, LRA and before the existing rtl-split2. Because I just recently submitted a patch that I expected would reliably (i.e. regardless of optimization options, etc.) apply the split pattern first in the rtl-split2 pass after RA, and it was merged. > > The patch therefore disables the pass by default on i386, rs6000 > and xtensa. Hopefully we can fix those ports later (if their > maintainers want). It seems easier to add the pass first, though, > to make it easier to test any such fixes. > > gcc.target/aarch64/bitfield-bitint-abi-align{16,8}.c would need > quite a few updates for the late-combine output. That might be > worth doing, but it seems too complex to do as part of this patch. > > I tried compiling at least one target per CPU directory and comparing > the assembly output for parts of the GCC testsuite. This is just a way > of getting a flavour of how the pass performs; it obviously isn't a > meaningful benchmark. All targets seemed to improve on average: > > Target Tests Good Bad %Good Delta Median > ====== ===== ==== === ===== ===== ====== > aarch64-linux-gnu 2215 1975 240 89.16% -4159 -1 > aarch64_be-linux-gnu 1569 1483 86 94.52% -10117 -1 > alpha-linux-gnu 1454 1370 84 94.22% -9502 -1 > amdgcn-amdhsa 5122 4671 451 91.19% -35737 -1 > arc-elf 2166 1932 234 89.20% -37742 -1 > arm-linux-gnueabi 1953 1661 292 85.05% -12415 -1 > arm-linux-gnueabihf 1834 1549 285 84.46% -11137 -1 > avr-elf 4789 4330 459 90.42% -441276 -4 > bfin-elf 2795 2394 401 85.65% -19252 -1 > bpf-elf 3122 2928 194 93.79% -8785 -1 > c6x-elf 2227 1929 298 86.62% -17339 -1 > cris-elf 3464 3270 194 94.40% -23263 -2 > csky-elf 2915 2591 324 88.89% -22146 -1 > epiphany-elf 2399 2304 95 96.04% -28698 -2 > fr30-elf 7712 7299 413 94.64% -99830 -2 > frv-linux-gnu 3332 2877 455 86.34% -25108 -1 > ft32-elf 2775 2667 108 96.11% -25029 -1 > h8300-elf 3176 2862 314 90.11% -29305 -2 > hppa64-hp-hpux11.23 4287 4247 40 99.07% -45963 -2 > ia64-linux-gnu 2343 1946 397 83.06% -9907 -2 > iq2000-elf 9684 9637 47 99.51% -126557 -2 > lm32-elf 2681 2608 73 97.28% -59884 -3 > loongarch64-linux-gnu 1303 1218 85 93.48% -13375 -2 > m32r-elf 1626 1517 109 93.30% -9323 -2 > m68k-linux-gnu 3022 2620 402 86.70% -21531 -1 > mcore-elf 2315 2085 230 90.06% -24160 -1 > microblaze-elf 2782 2585 197 92.92% -16530 -1 > mipsel-linux-gnu 1958 1827 131 93.31% -15462 -1 > mipsisa64-linux-gnu 1655 1488 167 89.91% -16592 -2 > mmix 4914 4814 100 97.96% -63021 -1 > mn10300-elf 3639 3320 319 91.23% -34752 -2 > moxie-rtems 3497 3252 245 92.99% -87305 -3 > msp430-elf 4353 3876 477 89.04% -23780 -1 > nds32le-elf 3042 2780 262 91.39% -27320 -1 > nios2-linux-gnu 1683 1355 328 80.51% -8065 -1 > nvptx-none 2114 1781 333 84.25% -12589 -2 > or1k-elf 3045 2699 346 88.64% -14328 -2 > pdp11 4515 4146 369 91.83% -26047 -2 > pru-elf 1585 1245 340 78.55% -5225 -1 > riscv32-elf 2122 2000 122 94.25% -101162 -2 > riscv64-elf 1841 1726 115 93.75% -49997 -2 > rl78-elf 2823 2530 293 89.62% -40742 -4 > rx-elf 2614 2480 134 94.87% -18863 -1 > s390-linux-gnu 1591 1393 198 87.55% -16696 -1 > s390x-linux-gnu 2015 1879 136 93.25% -21134 -1 > sh-linux-gnu 1870 1507 363 80.59% -9491 -1 > sparc-linux-gnu 1123 1075 48 95.73% -14503 -1 > sparc-wrs-vxworks 1121 1073 48 95.72% -14578 -1 > sparc64-linux-gnu 1096 1021 75 93.16% -15003 -1 > v850-elf 1897 1728 169 91.09% -11078 -1 > vax-netbsdelf 3035 2995 40 98.68% -27642 -1 > visium-elf 1392 1106 286 79.45% -7984 -2 > xstormy16-elf 2577 2071 506 80.36% -13061 -1 > > ** snip ** To be more frank, once a split pattern is defined, it is applied by the existing five split paths and possibly by combiners. In most cases, it is enough to apply it in one of these places, or that is what the pattern creator intended. Wouldn't applying the split pattern indiscriminately in various places be a waste of execution resources and bring about unexpected and undesired results? I think we need some way to properly control the application of the split pattern, perhaps some predicate function.
Takayuki 'January June' Suwa <jjsuwa_sys3175@yahoo.co.jp> writes: > On 2024/06/20 22:34, Richard Sandiford wrote: >> This patch adds a combine pass that runs late in the pipeline. >> There are two instances: one between combine and split1, and one >> after postreload. >> >> The pass currently has a single objective: remove definitions by >> substituting into all uses. The pre-RA version tries to restrict >> itself to cases that are likely to have a neutral or beneficial >> effect on register pressure. >> >> The patch fixes PR106594. It also fixes a few FAILs and XFAILs >> in the aarch64 test results, mostly due to making proper use of >> MOVPRFX in cases where we didn't previously. >> >> This is just a first step. I'm hoping that the pass could be >> used for other combine-related optimisations in future. In particular, >> the post-RA version doesn't need to restrict itself to cases where all >> uses are substitutable, since it doesn't have to worry about register >> pressure. If we did that, and if we extended it to handle multi-register >> REGs, the pass might be a viable replacement for regcprop, which in >> turn might reduce the cost of having a post-RA instance of the new pass. >> >> On most targets, the pass is enabled by default at -O2 and above. >> However, it has a tendency to undo x86's STV and RPAD passes, >> by folding the more complex post-STV/RPAD form back into the >> simpler pre-pass form. >> >> Also, running a pass after register allocation means that we can >> now match define_insn_and_splits that were previously only matched >> before register allocation. This trips things like: >> >> (define_insn_and_split "..." >> [...pattern...] >> "...cond..." >> "#" >> "&& 1" >> [...pattern...] >> { >> ...unconditional use of gen_reg_rtx ()...; >> } >> >> because matching and splitting after RA will call gen_reg_rtx when >> pseudos are no longer allowed. rs6000 has several instances of this. > > xtensa also has something like that. > >> xtensa has a variation in which the split condition is: >> >> "&& can_create_pseudo_p ()" >> >> The failure then is that, if we match after RA, we'll never be >> able to split the instruction. > > To be honest, I'm confusing by the possibility of adding a split pattern > application opportunity that depends on the optimization options after > Rel... ah, LRA and before the existing rtl-split2. > > Because I just recently submitted a patch that I expected would reliably > (i.e. regardless of optimization options, etc.) apply the split pattern > first in the rtl-split2 pass after RA, and it was merged. > >> >> The patch therefore disables the pass by default on i386, rs6000 >> and xtensa. Hopefully we can fix those ports later (if their >> maintainers want). It seems easier to add the pass first, though, >> to make it easier to test any such fixes. >> >> gcc.target/aarch64/bitfield-bitint-abi-align{16,8}.c would need >> quite a few updates for the late-combine output. That might be >> worth doing, but it seems too complex to do as part of this patch. >> >> I tried compiling at least one target per CPU directory and comparing >> the assembly output for parts of the GCC testsuite. This is just a way >> of getting a flavour of how the pass performs; it obviously isn't a >> meaningful benchmark. All targets seemed to improve on average: >> >> Target Tests Good Bad %Good Delta Median >> ====== ===== ==== === ===== ===== ====== >> aarch64-linux-gnu 2215 1975 240 89.16% -4159 -1 >> aarch64_be-linux-gnu 1569 1483 86 94.52% -10117 -1 >> alpha-linux-gnu 1454 1370 84 94.22% -9502 -1 >> amdgcn-amdhsa 5122 4671 451 91.19% -35737 -1 >> arc-elf 2166 1932 234 89.20% -37742 -1 >> arm-linux-gnueabi 1953 1661 292 85.05% -12415 -1 >> arm-linux-gnueabihf 1834 1549 285 84.46% -11137 -1 >> avr-elf 4789 4330 459 90.42% -441276 -4 >> bfin-elf 2795 2394 401 85.65% -19252 -1 >> bpf-elf 3122 2928 194 93.79% -8785 -1 >> c6x-elf 2227 1929 298 86.62% -17339 -1 >> cris-elf 3464 3270 194 94.40% -23263 -2 >> csky-elf 2915 2591 324 88.89% -22146 -1 >> epiphany-elf 2399 2304 95 96.04% -28698 -2 >> fr30-elf 7712 7299 413 94.64% -99830 -2 >> frv-linux-gnu 3332 2877 455 86.34% -25108 -1 >> ft32-elf 2775 2667 108 96.11% -25029 -1 >> h8300-elf 3176 2862 314 90.11% -29305 -2 >> hppa64-hp-hpux11.23 4287 4247 40 99.07% -45963 -2 >> ia64-linux-gnu 2343 1946 397 83.06% -9907 -2 >> iq2000-elf 9684 9637 47 99.51% -126557 -2 >> lm32-elf 2681 2608 73 97.28% -59884 -3 >> loongarch64-linux-gnu 1303 1218 85 93.48% -13375 -2 >> m32r-elf 1626 1517 109 93.30% -9323 -2 >> m68k-linux-gnu 3022 2620 402 86.70% -21531 -1 >> mcore-elf 2315 2085 230 90.06% -24160 -1 >> microblaze-elf 2782 2585 197 92.92% -16530 -1 >> mipsel-linux-gnu 1958 1827 131 93.31% -15462 -1 >> mipsisa64-linux-gnu 1655 1488 167 89.91% -16592 -2 >> mmix 4914 4814 100 97.96% -63021 -1 >> mn10300-elf 3639 3320 319 91.23% -34752 -2 >> moxie-rtems 3497 3252 245 92.99% -87305 -3 >> msp430-elf 4353 3876 477 89.04% -23780 -1 >> nds32le-elf 3042 2780 262 91.39% -27320 -1 >> nios2-linux-gnu 1683 1355 328 80.51% -8065 -1 >> nvptx-none 2114 1781 333 84.25% -12589 -2 >> or1k-elf 3045 2699 346 88.64% -14328 -2 >> pdp11 4515 4146 369 91.83% -26047 -2 >> pru-elf 1585 1245 340 78.55% -5225 -1 >> riscv32-elf 2122 2000 122 94.25% -101162 -2 >> riscv64-elf 1841 1726 115 93.75% -49997 -2 >> rl78-elf 2823 2530 293 89.62% -40742 -4 >> rx-elf 2614 2480 134 94.87% -18863 -1 >> s390-linux-gnu 1591 1393 198 87.55% -16696 -1 >> s390x-linux-gnu 2015 1879 136 93.25% -21134 -1 >> sh-linux-gnu 1870 1507 363 80.59% -9491 -1 >> sparc-linux-gnu 1123 1075 48 95.73% -14503 -1 >> sparc-wrs-vxworks 1121 1073 48 95.72% -14578 -1 >> sparc64-linux-gnu 1096 1021 75 93.16% -15003 -1 >> v850-elf 1897 1728 169 91.09% -11078 -1 >> vax-netbsdelf 3035 2995 40 98.68% -27642 -1 >> visium-elf 1392 1106 286 79.45% -7984 -2 >> xstormy16-elf 2577 2071 506 80.36% -13061 -1 >> >> ** snip ** > > To be more frank, once a split pattern is defined, it is applied by > the existing five split paths and possibly by combiners. In most cases, > it is enough to apply it in one of these places, or that is what the > pattern creator intended. > > Wouldn't applying the split pattern indiscriminately in various places > be a waste of execution resources and bring about unexpected and undesired > results? > > I think we need some way to properly control the application of the split > pattern, perhaps some predicate function. The problem is more the define_insn part of the define_insn_and_split, rather than the define_split part. The number and location of the split passes is the same: anything matched by rtl-late_combine1 will be split by rtl-split1 and anything matched by rtl-late_combine2 will be split by rtl-split2. (If the split condition allows it, of course.) But more things can be matched by rtl-late_combine2 than are matched by other post-RA passes like rtl-postreload. And that's what causes the issue. If: (define_insn_and_split "..." [...pattern...] "...cond..." "#" "&& 1" [...pattern...] { ...unconditional use of gen_reg_rtx ()...; } is matched by rtl-late_combine2, the split will be done by rtl-split2. But the split will ICE, because it isn't valid to call gen_reg_rtx after register allocation. Similarly, if: (define_insn_and_split "..." [...pattern...] "...cond..." "#" "&& can_create_pseudo_p ()" [...pattern...] { ...unconditional use of gen_reg_rtx ()...; } is matched by rtl-late_combine2, the can_create_pseudo_p condition will be false in rtl-split2, and in all subsequent split passes. So we'll still have the unsplit instruction during final, which will ICE because it doesn't have a valid means of implementing the "#". The traditional (and IMO correct) way to handle this is to make the pattern reserve the temporary registers that it needs, using match_scratches. rs6000 has many examples of this. E.g.: (define_insn_and_split "@ieee_128bit_vsx_neg<mode>2" [(set (match_operand:IEEE128 0 "register_operand" "=wa") (neg:IEEE128 (match_operand:IEEE128 1 "register_operand" "wa"))) (clobber (match_scratch:V16QI 2 "=v"))] "TARGET_FLOAT128_TYPE && !TARGET_FLOAT128_HW" "#" "&& 1" [(parallel [(set (match_dup 0) (neg:IEEE128 (match_dup 1))) (use (match_dup 2))])] { if (GET_CODE (operands[2]) == SCRATCH) operands[2] = gen_reg_rtx (V16QImode); emit_insn (gen_ieee_128bit_negative_zero (operands[2])); } [(set_attr "length" "8") (set_attr "type" "vecsimple")]) Before RA, this is just: (set ...) (clobber (scratch:V16QI)) and the split creates a new register. After RA, operand 2 provides the required temporary register: (set ...) (clobber (reg:V16QI TMP)) Another approach is to add can_create_pseudo_p () to the define_insn condition (rather than the split condition). But IMO that's an ICE trap, since insns that have already been matched & accepted shouldn't suddenly become invalid if recog is reattempted later. Thanks, Richard
Hi! On 2024/06/23 1:49, Richard Sandiford wrote: > Takayuki 'January June' Suwa <jjsuwa_sys3175@yahoo.co.jp> writes: >> On 2024/06/20 22:34, Richard Sandiford wrote: >>> This patch adds a combine pass that runs late in the pipeline. >>> There are two instances: one between combine and split1, and one >>> after postreload. >>> >>> The pass currently has a single objective: remove definitions by >>> substituting into all uses. The pre-RA version tries to restrict >>> itself to cases that are likely to have a neutral or beneficial >>> effect on register pressure. >>> >>> The patch fixes PR106594. It also fixes a few FAILs and XFAILs >>> in the aarch64 test results, mostly due to making proper use of >>> MOVPRFX in cases where we didn't previously. >>> >>> This is just a first step. I'm hoping that the pass could be >>> used for other combine-related optimisations in future. In particular, >>> the post-RA version doesn't need to restrict itself to cases where all >>> uses are substitutable, since it doesn't have to worry about register >>> pressure. If we did that, and if we extended it to handle multi-register >>> REGs, the pass might be a viable replacement for regcprop, which in >>> turn might reduce the cost of having a post-RA instance of the new pass. >>> >>> On most targets, the pass is enabled by default at -O2 and above. >>> However, it has a tendency to undo x86's STV and RPAD passes, >>> by folding the more complex post-STV/RPAD form back into the >>> simpler pre-pass form. >>> >>> Also, running a pass after register allocation means that we can >>> now match define_insn_and_splits that were previously only matched >>> before register allocation. This trips things like: >>> >>> (define_insn_and_split "..." >>> [...pattern...] >>> "...cond..." >>> "#" >>> "&& 1" >>> [...pattern...] >>> { >>> ...unconditional use of gen_reg_rtx ()...; >>> } >>> >>> because matching and splitting after RA will call gen_reg_rtx when >>> pseudos are no longer allowed. rs6000 has several instances of this. >> >> xtensa also has something like that. >> >>> xtensa has a variation in which the split condition is: >>> >>> "&& can_create_pseudo_p ()" >>> >>> The failure then is that, if we match after RA, we'll never be >>> able to split the instruction. >> >> To be honest, I'm confusing by the possibility of adding a split pattern >> application opportunity that depends on the optimization options after >> Rel... ah, LRA and before the existing rtl-split2. >> >> Because I just recently submitted a patch that I expected would reliably >> (i.e. regardless of optimization options, etc.) apply the split pattern >> first in the rtl-split2 pass after RA, and it was merged. >> >>> >>> The patch therefore disables the pass by default on i386, rs6000 >>> and xtensa. Hopefully we can fix those ports later (if their >>> maintainers want). It seems easier to add the pass first, though, >>> to make it easier to test any such fixes. >>> >>> gcc.target/aarch64/bitfield-bitint-abi-align{16,8}.c would need >>> quite a few updates for the late-combine output. That might be >>> worth doing, but it seems too complex to do as part of this patch. >>> >>> I tried compiling at least one target per CPU directory and comparing >>> the assembly output for parts of the GCC testsuite. This is just a way >>> of getting a flavour of how the pass performs; it obviously isn't a >>> meaningful benchmark. All targets seemed to improve on average: >>> >>> Target Tests Good Bad %Good Delta Median >>> ====== ===== ==== === ===== ===== ====== >>> aarch64-linux-gnu 2215 1975 240 89.16% -4159 -1 >>> aarch64_be-linux-gnu 1569 1483 86 94.52% -10117 -1 >>> alpha-linux-gnu 1454 1370 84 94.22% -9502 -1 >>> amdgcn-amdhsa 5122 4671 451 91.19% -35737 -1 >>> arc-elf 2166 1932 234 89.20% -37742 -1 >>> arm-linux-gnueabi 1953 1661 292 85.05% -12415 -1 >>> arm-linux-gnueabihf 1834 1549 285 84.46% -11137 -1 >>> avr-elf 4789 4330 459 90.42% -441276 -4 >>> bfin-elf 2795 2394 401 85.65% -19252 -1 >>> bpf-elf 3122 2928 194 93.79% -8785 -1 >>> c6x-elf 2227 1929 298 86.62% -17339 -1 >>> cris-elf 3464 3270 194 94.40% -23263 -2 >>> csky-elf 2915 2591 324 88.89% -22146 -1 >>> epiphany-elf 2399 2304 95 96.04% -28698 -2 >>> fr30-elf 7712 7299 413 94.64% -99830 -2 >>> frv-linux-gnu 3332 2877 455 86.34% -25108 -1 >>> ft32-elf 2775 2667 108 96.11% -25029 -1 >>> h8300-elf 3176 2862 314 90.11% -29305 -2 >>> hppa64-hp-hpux11.23 4287 4247 40 99.07% -45963 -2 >>> ia64-linux-gnu 2343 1946 397 83.06% -9907 -2 >>> iq2000-elf 9684 9637 47 99.51% -126557 -2 >>> lm32-elf 2681 2608 73 97.28% -59884 -3 >>> loongarch64-linux-gnu 1303 1218 85 93.48% -13375 -2 >>> m32r-elf 1626 1517 109 93.30% -9323 -2 >>> m68k-linux-gnu 3022 2620 402 86.70% -21531 -1 >>> mcore-elf 2315 2085 230 90.06% -24160 -1 >>> microblaze-elf 2782 2585 197 92.92% -16530 -1 >>> mipsel-linux-gnu 1958 1827 131 93.31% -15462 -1 >>> mipsisa64-linux-gnu 1655 1488 167 89.91% -16592 -2 >>> mmix 4914 4814 100 97.96% -63021 -1 >>> mn10300-elf 3639 3320 319 91.23% -34752 -2 >>> moxie-rtems 3497 3252 245 92.99% -87305 -3 >>> msp430-elf 4353 3876 477 89.04% -23780 -1 >>> nds32le-elf 3042 2780 262 91.39% -27320 -1 >>> nios2-linux-gnu 1683 1355 328 80.51% -8065 -1 >>> nvptx-none 2114 1781 333 84.25% -12589 -2 >>> or1k-elf 3045 2699 346 88.64% -14328 -2 >>> pdp11 4515 4146 369 91.83% -26047 -2 >>> pru-elf 1585 1245 340 78.55% -5225 -1 >>> riscv32-elf 2122 2000 122 94.25% -101162 -2 >>> riscv64-elf 1841 1726 115 93.75% -49997 -2 >>> rl78-elf 2823 2530 293 89.62% -40742 -4 >>> rx-elf 2614 2480 134 94.87% -18863 -1 >>> s390-linux-gnu 1591 1393 198 87.55% -16696 -1 >>> s390x-linux-gnu 2015 1879 136 93.25% -21134 -1 >>> sh-linux-gnu 1870 1507 363 80.59% -9491 -1 >>> sparc-linux-gnu 1123 1075 48 95.73% -14503 -1 >>> sparc-wrs-vxworks 1121 1073 48 95.72% -14578 -1 >>> sparc64-linux-gnu 1096 1021 75 93.16% -15003 -1 >>> v850-elf 1897 1728 169 91.09% -11078 -1 >>> vax-netbsdelf 3035 2995 40 98.68% -27642 -1 >>> visium-elf 1392 1106 286 79.45% -7984 -2 >>> xstormy16-elf 2577 2071 506 80.36% -13061 -1 >>> >>> ** snip ** >> >> To be more frank, once a split pattern is defined, it is applied by >> the existing five split paths and possibly by combiners. In most cases, >> it is enough to apply it in one of these places, or that is what the >> pattern creator intended. >> >> Wouldn't applying the split pattern indiscriminately in various places >> be a waste of execution resources and bring about unexpected and undesired >> results? >> >> I think we need some way to properly control the application of the split >> pattern, perhaps some predicate function. > > The problem is more the define_insn part of the define_insn_and_split, > rather than the define_split part. The number and location of the split > passes is the same: anything matched by rtl-late_combine1 will be split by > rtl-split1 and anything matched by rtl-late_combine2 will be split by > rtl-split2. (If the split condition allows it, of course.) > > But more things can be matched by rtl-late_combine2 than are matched by > other post-RA passes like rtl-postreload. And that's what causes the > issue. If: > > (define_insn_and_split "..." > [...pattern...] > "...cond..." > "#" > "&& 1" > [...pattern...] > { > ...unconditional use of gen_reg_rtx ()...; > } > > is matched by rtl-late_combine2, the split will be done by rtl-split2. > But the split will ICE, because it isn't valid to call gen_reg_rtx after > register allocation. > > Similarly, if: > > (define_insn_and_split "..." > [...pattern...] > "...cond..." > "#" > "&& can_create_pseudo_p ()" > [...pattern...] > { > ...unconditional use of gen_reg_rtx ()...; > } > > is matched by rtl-late_combine2, the can_create_pseudo_p condition will > be false in rtl-split2, and in all subsequent split passes. So we'll > still have the unsplit instruction during final, which will ICE because > it doesn't have a valid means of implementing the "#". > > The traditional (and IMO correct) way to handle this is to make the > pattern reserve the temporary registers that it needs, using match_scratches. > rs6000 has many examples of this. E.g.: > > (define_insn_and_split "@ieee_128bit_vsx_neg<mode>2" > [(set (match_operand:IEEE128 0 "register_operand" "=wa") > (neg:IEEE128 (match_operand:IEEE128 1 "register_operand" "wa"))) > (clobber (match_scratch:V16QI 2 "=v"))] > "TARGET_FLOAT128_TYPE && !TARGET_FLOAT128_HW" > "#" > "&& 1" > [(parallel [(set (match_dup 0) > (neg:IEEE128 (match_dup 1))) > (use (match_dup 2))])] > { > if (GET_CODE (operands[2]) == SCRATCH) > operands[2] = gen_reg_rtx (V16QImode); > > emit_insn (gen_ieee_128bit_negative_zero (operands[2])); > } > [(set_attr "length" "8") > (set_attr "type" "vecsimple")]) > > Before RA, this is just: > > (set ...) > (clobber (scratch:V16QI)) > > and the split creates a new register. After RA, operand 2 provides > the required temporary register: > > (set ...) > (clobber (reg:V16QI TMP)) > > Another approach is to add can_create_pseudo_p () to the define_insn > condition (rather than the split condition). But IMO that's an ICE > trap, since insns that have already been matched & accepted shouldn't > suddenly become invalid if recog is reattempted later. > > Thanks, > Richard > Ah, I see, I understand the standard idiom for synchronizing RA between define_insn and split conditions(, I guess). However, as I wrote before, it is true that split paths are useful for other purposes as well. If this is unacceptable practice, then an alternative solution would be to add a target-specific path (in my case, after postreload and before rtl-late_combine2). That is obviously a very involved process.
On Sat, Jun 22, 2024 at 6:50 PM Richard Sandiford <richard.sandiford@arm.com> wrote: > > Takayuki 'January June' Suwa <jjsuwa_sys3175@yahoo.co.jp> writes: > > On 2024/06/20 22:34, Richard Sandiford wrote: > >> This patch adds a combine pass that runs late in the pipeline. > >> There are two instances: one between combine and split1, and one > >> after postreload. > >> > >> The pass currently has a single objective: remove definitions by > >> substituting into all uses. The pre-RA version tries to restrict > >> itself to cases that are likely to have a neutral or beneficial > >> effect on register pressure. > >> > >> The patch fixes PR106594. It also fixes a few FAILs and XFAILs > >> in the aarch64 test results, mostly due to making proper use of > >> MOVPRFX in cases where we didn't previously. > >> > >> This is just a first step. I'm hoping that the pass could be > >> used for other combine-related optimisations in future. In particular, > >> the post-RA version doesn't need to restrict itself to cases where all > >> uses are substitutable, since it doesn't have to worry about register > >> pressure. If we did that, and if we extended it to handle multi-register > >> REGs, the pass might be a viable replacement for regcprop, which in > >> turn might reduce the cost of having a post-RA instance of the new pass. > >> > >> On most targets, the pass is enabled by default at -O2 and above. > >> However, it has a tendency to undo x86's STV and RPAD passes, > >> by folding the more complex post-STV/RPAD form back into the > >> simpler pre-pass form. > >> > >> Also, running a pass after register allocation means that we can > >> now match define_insn_and_splits that were previously only matched > >> before register allocation. This trips things like: > >> > >> (define_insn_and_split "..." > >> [...pattern...] > >> "...cond..." > >> "#" > >> "&& 1" > >> [...pattern...] > >> { > >> ...unconditional use of gen_reg_rtx ()...; > >> } > >> > >> because matching and splitting after RA will call gen_reg_rtx when > >> pseudos are no longer allowed. rs6000 has several instances of this. > > > > xtensa also has something like that. > > > >> xtensa has a variation in which the split condition is: > >> > >> "&& can_create_pseudo_p ()" > >> > >> The failure then is that, if we match after RA, we'll never be > >> able to split the instruction. > > > > To be honest, I'm confusing by the possibility of adding a split pattern > > application opportunity that depends on the optimization options after > > Rel... ah, LRA and before the existing rtl-split2. > > > > Because I just recently submitted a patch that I expected would reliably > > (i.e. regardless of optimization options, etc.) apply the split pattern > > first in the rtl-split2 pass after RA, and it was merged. > > > >> > >> The patch therefore disables the pass by default on i386, rs6000 > >> and xtensa. Hopefully we can fix those ports later (if their > >> maintainers want). It seems easier to add the pass first, though, > >> to make it easier to test any such fixes. > >> > >> gcc.target/aarch64/bitfield-bitint-abi-align{16,8}.c would need > >> quite a few updates for the late-combine output. That might be > >> worth doing, but it seems too complex to do as part of this patch. > >> > >> I tried compiling at least one target per CPU directory and comparing > >> the assembly output for parts of the GCC testsuite. This is just a way > >> of getting a flavour of how the pass performs; it obviously isn't a > >> meaningful benchmark. All targets seemed to improve on average: > >> > >> Target Tests Good Bad %Good Delta Median > >> ====== ===== ==== === ===== ===== ====== > >> aarch64-linux-gnu 2215 1975 240 89.16% -4159 -1 > >> aarch64_be-linux-gnu 1569 1483 86 94.52% -10117 -1 > >> alpha-linux-gnu 1454 1370 84 94.22% -9502 -1 > >> amdgcn-amdhsa 5122 4671 451 91.19% -35737 -1 > >> arc-elf 2166 1932 234 89.20% -37742 -1 > >> arm-linux-gnueabi 1953 1661 292 85.05% -12415 -1 > >> arm-linux-gnueabihf 1834 1549 285 84.46% -11137 -1 > >> avr-elf 4789 4330 459 90.42% -441276 -4 > >> bfin-elf 2795 2394 401 85.65% -19252 -1 > >> bpf-elf 3122 2928 194 93.79% -8785 -1 > >> c6x-elf 2227 1929 298 86.62% -17339 -1 > >> cris-elf 3464 3270 194 94.40% -23263 -2 > >> csky-elf 2915 2591 324 88.89% -22146 -1 > >> epiphany-elf 2399 2304 95 96.04% -28698 -2 > >> fr30-elf 7712 7299 413 94.64% -99830 -2 > >> frv-linux-gnu 3332 2877 455 86.34% -25108 -1 > >> ft32-elf 2775 2667 108 96.11% -25029 -1 > >> h8300-elf 3176 2862 314 90.11% -29305 -2 > >> hppa64-hp-hpux11.23 4287 4247 40 99.07% -45963 -2 > >> ia64-linux-gnu 2343 1946 397 83.06% -9907 -2 > >> iq2000-elf 9684 9637 47 99.51% -126557 -2 > >> lm32-elf 2681 2608 73 97.28% -59884 -3 > >> loongarch64-linux-gnu 1303 1218 85 93.48% -13375 -2 > >> m32r-elf 1626 1517 109 93.30% -9323 -2 > >> m68k-linux-gnu 3022 2620 402 86.70% -21531 -1 > >> mcore-elf 2315 2085 230 90.06% -24160 -1 > >> microblaze-elf 2782 2585 197 92.92% -16530 -1 > >> mipsel-linux-gnu 1958 1827 131 93.31% -15462 -1 > >> mipsisa64-linux-gnu 1655 1488 167 89.91% -16592 -2 > >> mmix 4914 4814 100 97.96% -63021 -1 > >> mn10300-elf 3639 3320 319 91.23% -34752 -2 > >> moxie-rtems 3497 3252 245 92.99% -87305 -3 > >> msp430-elf 4353 3876 477 89.04% -23780 -1 > >> nds32le-elf 3042 2780 262 91.39% -27320 -1 > >> nios2-linux-gnu 1683 1355 328 80.51% -8065 -1 > >> nvptx-none 2114 1781 333 84.25% -12589 -2 > >> or1k-elf 3045 2699 346 88.64% -14328 -2 > >> pdp11 4515 4146 369 91.83% -26047 -2 > >> pru-elf 1585 1245 340 78.55% -5225 -1 > >> riscv32-elf 2122 2000 122 94.25% -101162 -2 > >> riscv64-elf 1841 1726 115 93.75% -49997 -2 > >> rl78-elf 2823 2530 293 89.62% -40742 -4 > >> rx-elf 2614 2480 134 94.87% -18863 -1 > >> s390-linux-gnu 1591 1393 198 87.55% -16696 -1 > >> s390x-linux-gnu 2015 1879 136 93.25% -21134 -1 > >> sh-linux-gnu 1870 1507 363 80.59% -9491 -1 > >> sparc-linux-gnu 1123 1075 48 95.73% -14503 -1 > >> sparc-wrs-vxworks 1121 1073 48 95.72% -14578 -1 > >> sparc64-linux-gnu 1096 1021 75 93.16% -15003 -1 > >> v850-elf 1897 1728 169 91.09% -11078 -1 > >> vax-netbsdelf 3035 2995 40 98.68% -27642 -1 > >> visium-elf 1392 1106 286 79.45% -7984 -2 > >> xstormy16-elf 2577 2071 506 80.36% -13061 -1 > >> > >> ** snip ** > > > > To be more frank, once a split pattern is defined, it is applied by > > the existing five split paths and possibly by combiners. In most cases, > > it is enough to apply it in one of these places, or that is what the > > pattern creator intended. > > > > Wouldn't applying the split pattern indiscriminately in various places > > be a waste of execution resources and bring about unexpected and undesired > > results? > > > > I think we need some way to properly control the application of the split > > pattern, perhaps some predicate function. > > The problem is more the define_insn part of the define_insn_and_split, > rather than the define_split part. The number and location of the split > passes is the same: anything matched by rtl-late_combine1 will be split by > rtl-split1 and anything matched by rtl-late_combine2 will be split by > rtl-split2. (If the split condition allows it, of course.) > > But more things can be matched by rtl-late_combine2 than are matched by > other post-RA passes like rtl-postreload. And that's what causes the > issue. If: > > (define_insn_and_split "..." > [...pattern...] > "...cond..." > "#" > "&& 1" > [...pattern...] > { > ...unconditional use of gen_reg_rtx ()...; > } > > is matched by rtl-late_combine2, the split will be done by rtl-split2. > But the split will ICE, because it isn't valid to call gen_reg_rtx after > register allocation. > > Similarly, if: > > (define_insn_and_split "..." > [...pattern...] > "...cond..." > "#" > "&& can_create_pseudo_p ()" > [...pattern...] > { > ...unconditional use of gen_reg_rtx ()...; > } > > is matched by rtl-late_combine2, the can_create_pseudo_p condition will > be false in rtl-split2, and in all subsequent split passes. So we'll > still have the unsplit instruction during final, which will ICE because > it doesn't have a valid means of implementing the "#". > > The traditional (and IMO correct) way to handle this is to make the > pattern reserve the temporary registers that it needs, using match_scratches. > rs6000 has many examples of this. E.g.: > > (define_insn_and_split "@ieee_128bit_vsx_neg<mode>2" > [(set (match_operand:IEEE128 0 "register_operand" "=wa") > (neg:IEEE128 (match_operand:IEEE128 1 "register_operand" "wa"))) > (clobber (match_scratch:V16QI 2 "=v"))] > "TARGET_FLOAT128_TYPE && !TARGET_FLOAT128_HW" > "#" > "&& 1" > [(parallel [(set (match_dup 0) > (neg:IEEE128 (match_dup 1))) > (use (match_dup 2))])] > { > if (GET_CODE (operands[2]) == SCRATCH) > operands[2] = gen_reg_rtx (V16QImode); > > emit_insn (gen_ieee_128bit_negative_zero (operands[2])); > } > [(set_attr "length" "8") > (set_attr "type" "vecsimple")]) > > Before RA, this is just: > > (set ...) > (clobber (scratch:V16QI)) > > and the split creates a new register. After RA, operand 2 provides > the required temporary register: > > (set ...) > (clobber (reg:V16QI TMP)) > > Another approach is to add can_create_pseudo_p () to the define_insn > condition (rather than the split condition). But IMO that's an ICE > trap, since insns that have already been matched & accepted shouldn't > suddenly become invalid if recog is reattempted later. What about splitting immediately in late-combine? Wouldn't that possibly allow more combinations to immediately happen? Richard. > Thanks, > Richard >
Richard Biener <richard.guenther@gmail.com> writes: > On Sat, Jun 22, 2024 at 6:50 PM Richard Sandiford >> The traditional (and IMO correct) way to handle this is to make the >> pattern reserve the temporary registers that it needs, using match_scratches. >> rs6000 has many examples of this. E.g.: >> >> (define_insn_and_split "@ieee_128bit_vsx_neg<mode>2" >> [(set (match_operand:IEEE128 0 "register_operand" "=wa") >> (neg:IEEE128 (match_operand:IEEE128 1 "register_operand" "wa"))) >> (clobber (match_scratch:V16QI 2 "=v"))] >> "TARGET_FLOAT128_TYPE && !TARGET_FLOAT128_HW" >> "#" >> "&& 1" >> [(parallel [(set (match_dup 0) >> (neg:IEEE128 (match_dup 1))) >> (use (match_dup 2))])] >> { >> if (GET_CODE (operands[2]) == SCRATCH) >> operands[2] = gen_reg_rtx (V16QImode); >> >> emit_insn (gen_ieee_128bit_negative_zero (operands[2])); >> } >> [(set_attr "length" "8") >> (set_attr "type" "vecsimple")]) >> >> Before RA, this is just: >> >> (set ...) >> (clobber (scratch:V16QI)) >> >> and the split creates a new register. After RA, operand 2 provides >> the required temporary register: >> >> (set ...) >> (clobber (reg:V16QI TMP)) >> >> Another approach is to add can_create_pseudo_p () to the define_insn >> condition (rather than the split condition). But IMO that's an ICE >> trap, since insns that have already been matched & accepted shouldn't >> suddenly become invalid if recog is reattempted later. > > What about splitting immediately in late-combine? Wouldn't that possibly > allow more combinations to immediately happen? It would be difficult to guarantee termination. Often the split instructions can be immediately recombined back to the original instruction. Even if we guard against that happening directly, it'd be difficult to prove that it can't happen indirectly. We might also run into issues like PR101523. Combine uses define_splits (without define_insns) for 3->2 combinations, but the current late-combine optimisation is kind-of 1/N+1->1 x N. Personally, I think we should allow targets to use the .md file to define match.pd-style simplification rules involving unspecs, but there were objections to that when I last suggested it. Thanks, Richard
On Mon, Jun 24, 2024 at 10:03 AM Richard Sandiford <richard.sandiford@arm.com> wrote: > > Richard Biener <richard.guenther@gmail.com> writes: > > On Sat, Jun 22, 2024 at 6:50 PM Richard Sandiford > >> The traditional (and IMO correct) way to handle this is to make the > >> pattern reserve the temporary registers that it needs, using match_scratches. > >> rs6000 has many examples of this. E.g.: > >> > >> (define_insn_and_split "@ieee_128bit_vsx_neg<mode>2" > >> [(set (match_operand:IEEE128 0 "register_operand" "=wa") > >> (neg:IEEE128 (match_operand:IEEE128 1 "register_operand" "wa"))) > >> (clobber (match_scratch:V16QI 2 "=v"))] > >> "TARGET_FLOAT128_TYPE && !TARGET_FLOAT128_HW" > >> "#" > >> "&& 1" > >> [(parallel [(set (match_dup 0) > >> (neg:IEEE128 (match_dup 1))) > >> (use (match_dup 2))])] > >> { > >> if (GET_CODE (operands[2]) == SCRATCH) > >> operands[2] = gen_reg_rtx (V16QImode); > >> > >> emit_insn (gen_ieee_128bit_negative_zero (operands[2])); > >> } > >> [(set_attr "length" "8") > >> (set_attr "type" "vecsimple")]) > >> > >> Before RA, this is just: > >> > >> (set ...) > >> (clobber (scratch:V16QI)) > >> > >> and the split creates a new register. After RA, operand 2 provides > >> the required temporary register: > >> > >> (set ...) > >> (clobber (reg:V16QI TMP)) > >> > >> Another approach is to add can_create_pseudo_p () to the define_insn > >> condition (rather than the split condition). But IMO that's an ICE > >> trap, since insns that have already been matched & accepted shouldn't > >> suddenly become invalid if recog is reattempted later. > > > > What about splitting immediately in late-combine? Wouldn't that possibly > > allow more combinations to immediately happen? > > It would be difficult to guarantee termination. Often the split > instructions can be immediately recombined back to the original > instruction. Even if we guard against that happening directly, > it'd be difficult to prove that it can't happen indirectly. > > We might also run into issues like PR101523. > > Combine uses define_splits (without define_insns) for 3->2 combinations, > but the current late-combine optimisation is kind-of 1/N+1->1 x N. > > Personally, I think we should allow targets to use the .md file to > define match.pd-style simplification rules involving unspecs, but there > were objections to that when I last suggested it. Isn't that what basically "combine-helper" patterns do to some extent? Richard. > > Thanks, > Richard
Richard Biener <richard.guenther@gmail.com> writes: > On Mon, Jun 24, 2024 at 10:03 AM Richard Sandiford > <richard.sandiford@arm.com> wrote: >> >> Richard Biener <richard.guenther@gmail.com> writes: >> > On Sat, Jun 22, 2024 at 6:50 PM Richard Sandiford >> >> The traditional (and IMO correct) way to handle this is to make the >> >> pattern reserve the temporary registers that it needs, using match_scratches. >> >> rs6000 has many examples of this. E.g.: >> >> >> >> (define_insn_and_split "@ieee_128bit_vsx_neg<mode>2" >> >> [(set (match_operand:IEEE128 0 "register_operand" "=wa") >> >> (neg:IEEE128 (match_operand:IEEE128 1 "register_operand" "wa"))) >> >> (clobber (match_scratch:V16QI 2 "=v"))] >> >> "TARGET_FLOAT128_TYPE && !TARGET_FLOAT128_HW" >> >> "#" >> >> "&& 1" >> >> [(parallel [(set (match_dup 0) >> >> (neg:IEEE128 (match_dup 1))) >> >> (use (match_dup 2))])] >> >> { >> >> if (GET_CODE (operands[2]) == SCRATCH) >> >> operands[2] = gen_reg_rtx (V16QImode); >> >> >> >> emit_insn (gen_ieee_128bit_negative_zero (operands[2])); >> >> } >> >> [(set_attr "length" "8") >> >> (set_attr "type" "vecsimple")]) >> >> >> >> Before RA, this is just: >> >> >> >> (set ...) >> >> (clobber (scratch:V16QI)) >> >> >> >> and the split creates a new register. After RA, operand 2 provides >> >> the required temporary register: >> >> >> >> (set ...) >> >> (clobber (reg:V16QI TMP)) >> >> >> >> Another approach is to add can_create_pseudo_p () to the define_insn >> >> condition (rather than the split condition). But IMO that's an ICE >> >> trap, since insns that have already been matched & accepted shouldn't >> >> suddenly become invalid if recog is reattempted later. >> > >> > What about splitting immediately in late-combine? Wouldn't that possibly >> > allow more combinations to immediately happen? >> >> It would be difficult to guarantee termination. Often the split >> instructions can be immediately recombined back to the original >> instruction. Even if we guard against that happening directly, >> it'd be difficult to prove that it can't happen indirectly. >> >> We might also run into issues like PR101523. >> >> Combine uses define_splits (without define_insns) for 3->2 combinations, >> but the current late-combine optimisation is kind-of 1/N+1->1 x N. >> >> Personally, I think we should allow targets to use the .md file to >> define match.pd-style simplification rules involving unspecs, but there >> were objections to that when I last suggested it. > > Isn't that what basically "combine-helper" patterns do to some extent? Partly, but: (1) It's a big hammer. It means we add all the overhead of a define_insn for something that is only meant to survive between one pass and the next. (2) Unlike match.pd, it isn't designed to be applied iteratively. There is no attempt even in theory to ensure that match helper -> split -> match helper -> split -> ... would terminate. (3) It operates at the level of complete instructions, including e.g. destinations of sets. The kind of rule I had in mind would be aimed at arithmetic simplification, and would operate at the simplify-rtx.cc level. That is, if simplify_foo failed to apply a target-independent rule, it could fall back on an automatically generated target-specific rule, with the requirement/understanding that these rules really should be target-specific. One easy way of enforcing that is to say that at least one side of a production rule must involve an unspec. Richard
On Mon, Jun 24, 2024 at 1:34 PM Richard Sandiford <richard.sandiford@arm.com> wrote: > > Richard Biener <richard.guenther@gmail.com> writes: > > On Mon, Jun 24, 2024 at 10:03 AM Richard Sandiford > > <richard.sandiford@arm.com> wrote: > >> > >> Richard Biener <richard.guenther@gmail.com> writes: > >> > On Sat, Jun 22, 2024 at 6:50 PM Richard Sandiford > >> >> The traditional (and IMO correct) way to handle this is to make the > >> >> pattern reserve the temporary registers that it needs, using match_scratches. > >> >> rs6000 has many examples of this. E.g.: > >> >> > >> >> (define_insn_and_split "@ieee_128bit_vsx_neg<mode>2" > >> >> [(set (match_operand:IEEE128 0 "register_operand" "=wa") > >> >> (neg:IEEE128 (match_operand:IEEE128 1 "register_operand" "wa"))) > >> >> (clobber (match_scratch:V16QI 2 "=v"))] > >> >> "TARGET_FLOAT128_TYPE && !TARGET_FLOAT128_HW" > >> >> "#" > >> >> "&& 1" > >> >> [(parallel [(set (match_dup 0) > >> >> (neg:IEEE128 (match_dup 1))) > >> >> (use (match_dup 2))])] > >> >> { > >> >> if (GET_CODE (operands[2]) == SCRATCH) > >> >> operands[2] = gen_reg_rtx (V16QImode); > >> >> > >> >> emit_insn (gen_ieee_128bit_negative_zero (operands[2])); > >> >> } > >> >> [(set_attr "length" "8") > >> >> (set_attr "type" "vecsimple")]) > >> >> > >> >> Before RA, this is just: > >> >> > >> >> (set ...) > >> >> (clobber (scratch:V16QI)) > >> >> > >> >> and the split creates a new register. After RA, operand 2 provides > >> >> the required temporary register: > >> >> > >> >> (set ...) > >> >> (clobber (reg:V16QI TMP)) > >> >> > >> >> Another approach is to add can_create_pseudo_p () to the define_insn > >> >> condition (rather than the split condition). But IMO that's an ICE > >> >> trap, since insns that have already been matched & accepted shouldn't > >> >> suddenly become invalid if recog is reattempted later. > >> > > >> > What about splitting immediately in late-combine? Wouldn't that possibly > >> > allow more combinations to immediately happen? > >> > >> It would be difficult to guarantee termination. Often the split > >> instructions can be immediately recombined back to the original > >> instruction. Even if we guard against that happening directly, > >> it'd be difficult to prove that it can't happen indirectly. > >> > >> We might also run into issues like PR101523. > >> > >> Combine uses define_splits (without define_insns) for 3->2 combinations, > >> but the current late-combine optimisation is kind-of 1/N+1->1 x N. > >> > >> Personally, I think we should allow targets to use the .md file to > >> define match.pd-style simplification rules involving unspecs, but there > >> were objections to that when I last suggested it. > > > > Isn't that what basically "combine-helper" patterns do to some extent? > > Partly, but: > > (1) It's a big hammer. It means we add all the overhead of a define_insn > for something that is only meant to survive between one pass and the next. > > (2) Unlike match.pd, it isn't designed to be applied iteratively. > There is no attempt even in theory to ensure that match helper > -> split -> match helper -> split -> ... would terminate. > > (3) It operates at the level of complete instructions, including e.g. > destinations of sets. The kind of rule I had in mind would be aimed > at arithmetic simplification, and would operate at the simplify-rtx.cc > level. > > That is, if simplify_foo failed to apply a target-independent rule, > it could fall back on an automatically generated target-specific rule, > with the requirement/understanding that these rules really should be > target-specific. One easy way of enforcing that is to say that > at least one side of a production rule must involve an unspec. OK, that makes sense. I did think of having something like match.pd generate simplify-rtx.cc. It probably has different constraints so that simply translating tree codes to rtx codes and re-using match.pd patterns isn't going to work well. Richard. > Richard > >
Hi! On 2024-06-20T14:34:18+0100, Richard Sandiford <richard.sandiford@arm.com> wrote: > This patch adds a combine pass that runs late in the pipeline. > [...] Nice! > The patch [...] disables the pass by default on i386, rs6000 > and xtensa. Like here: > --- a/gcc/config/i386/i386-options.cc > +++ b/gcc/config/i386/i386-options.cc > @@ -1942,6 +1942,10 @@ ix86_override_options_after_change (void) > flag_cunroll_grow_size = flag_peel_loops || optimize >= 3; > } > > + /* Late combine tends to undo some of the effects of STV and RPAD, > + by combining instructions back to their original form. */ > + if (!OPTION_SET_P (flag_late_combine_instructions)) > + flag_late_combine_instructions = 0; > } ..., I think also here: > --- a/gcc/config/rs6000/rs6000.cc > +++ b/gcc/config/rs6000/rs6000.cc > @@ -4768,6 +4768,14 @@ rs6000_option_override_internal (bool global_init_p) > targetm.expand_builtin_va_start = NULL; > } > > + /* One of the late-combine passes runs after register allocation > + and can match define_insn_and_splits that were previously used > + only before register allocation. Some of those define_insn_and_splits > + use gen_reg_rtx unconditionally. Disable late-combine by default > + until the define_insn_and_splits are fixed. */ > + if (!OPTION_SET_P (flag_late_combine_instructions)) > + flag_late_combine_instructions = 0; > + > rs6000_override_options_after_change (); ..., this needs to be done in 'rs6000_override_options_after_change' instead of 'rs6000_option_override_internal', to address the PRs under discussion. I'm testing such a patch. Grüße Thomas
Thomas Schwinge <tschwinge@baylibre.com> writes: > Hi! > > On 2024-06-20T14:34:18+0100, Richard Sandiford <richard.sandiford@arm.com> wrote: >> This patch adds a combine pass that runs late in the pipeline. >> [...] > > Nice! > >> The patch [...] disables the pass by default on i386, rs6000 >> and xtensa. > > Like here: > >> --- a/gcc/config/i386/i386-options.cc >> +++ b/gcc/config/i386/i386-options.cc >> @@ -1942,6 +1942,10 @@ ix86_override_options_after_change (void) >> flag_cunroll_grow_size = flag_peel_loops || optimize >= 3; >> } >> >> + /* Late combine tends to undo some of the effects of STV and RPAD, >> + by combining instructions back to their original form. */ >> + if (!OPTION_SET_P (flag_late_combine_instructions)) >> + flag_late_combine_instructions = 0; >> } > > ..., I think also here: > >> --- a/gcc/config/rs6000/rs6000.cc >> +++ b/gcc/config/rs6000/rs6000.cc >> @@ -4768,6 +4768,14 @@ rs6000_option_override_internal (bool global_init_p) >> targetm.expand_builtin_va_start = NULL; >> } >> >> + /* One of the late-combine passes runs after register allocation >> + and can match define_insn_and_splits that were previously used >> + only before register allocation. Some of those define_insn_and_splits >> + use gen_reg_rtx unconditionally. Disable late-combine by default >> + until the define_insn_and_splits are fixed. */ >> + if (!OPTION_SET_P (flag_late_combine_instructions)) >> + flag_late_combine_instructions = 0; >> + >> rs6000_override_options_after_change (); > > ..., this needs to be done in 'rs6000_override_options_after_change' > instead of 'rs6000_option_override_internal', to address the PRs under > discussion. I'm testing such a patch. Oops! Sorry about that, and thanks for tracking it down. Richard
diff --git a/gcc/Makefile.in b/gcc/Makefile.in index f5adb647d3f..5e29ddb5690 100644 --- a/gcc/Makefile.in +++ b/gcc/Makefile.in @@ -1574,6 +1574,7 @@ OBJS = \ ira-lives.o \ jump.o \ langhooks.o \ + late-combine.o \ lcm.o \ lists.o \ loop-doloop.o \ diff --git a/gcc/common.opt b/gcc/common.opt index f2bc47fdc5e..327230967ea 100644 --- a/gcc/common.opt +++ b/gcc/common.opt @@ -1796,6 +1796,11 @@ Common Var(flag_large_source_files) Init(0) Improve GCC's ability to track column numbers in large source files, at the expense of slower compilation. +flate-combine-instructions +Common Var(flag_late_combine_instructions) Optimization Init(0) +Run two instruction combination passes late in the pass pipeline; +one before register allocation and one after. + floop-parallelize-all Common Var(flag_loop_parallelize_all) Optimization Mark all loops as parallel. diff --git a/gcc/config/i386/i386-options.cc b/gcc/config/i386/i386-options.cc index f2cecc0e254..4620bf8e9e6 100644 --- a/gcc/config/i386/i386-options.cc +++ b/gcc/config/i386/i386-options.cc @@ -1942,6 +1942,10 @@ ix86_override_options_after_change (void) flag_cunroll_grow_size = flag_peel_loops || optimize >= 3; } + /* Late combine tends to undo some of the effects of STV and RPAD, + by combining instructions back to their original form. */ + if (!OPTION_SET_P (flag_late_combine_instructions)) + flag_late_combine_instructions = 0; } /* Clear stack slot assignments remembered from previous functions. diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc index e4dc629ddcc..f39b8909925 100644 --- a/gcc/config/rs6000/rs6000.cc +++ b/gcc/config/rs6000/rs6000.cc @@ -4768,6 +4768,14 @@ rs6000_option_override_internal (bool global_init_p) targetm.expand_builtin_va_start = NULL; } + /* One of the late-combine passes runs after register allocation + and can match define_insn_and_splits that were previously used + only before register allocation. Some of those define_insn_and_splits + use gen_reg_rtx unconditionally. Disable late-combine by default + until the define_insn_and_splits are fixed. */ + if (!OPTION_SET_P (flag_late_combine_instructions)) + flag_late_combine_instructions = 0; + rs6000_override_options_after_change (); /* If not explicitly specified via option, decide whether to generate indexed diff --git a/gcc/config/xtensa/xtensa.cc b/gcc/config/xtensa/xtensa.cc index 45dc1be3ff5..308dc62e0f8 100644 --- a/gcc/config/xtensa/xtensa.cc +++ b/gcc/config/xtensa/xtensa.cc @@ -59,6 +59,7 @@ along with GCC; see the file COPYING3. If not see #include "tree-pass.h" #include "print-rtl.h" #include <math.h> +#include "opts.h" /* This file should be included last. */ #include "target-def.h" @@ -2916,6 +2917,16 @@ xtensa_option_override (void) flag_reorder_blocks_and_partition = 0; flag_reorder_blocks = 1; } + + /* One of the late-combine passes runs after register allocation + and can match define_insn_and_splits that were previously used + only before register allocation. Some of those define_insn_and_splits + require the split to take place, but have a split condition of + can_create_pseudo_p, and so matching after RA will give an + unsplittable instruction. Disable late-combine by default until + the define_insn_and_splits are fixed. */ + if (!OPTION_SET_P (flag_late_combine_instructions)) + flag_late_combine_instructions = 0; } /* Implement TARGET_HARD_REGNO_NREGS. */ diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi index 5d7a87fde86..3b8c427d509 100644 --- a/gcc/doc/invoke.texi +++ b/gcc/doc/invoke.texi @@ -575,7 +575,7 @@ Objective-C and Objective-C++ Dialects}. -fipa-bit-cp -fipa-vrp -fipa-pta -fipa-profile -fipa-pure-const -fipa-reference -fipa-reference-addressable -fipa-stack-alignment -fipa-icf -fira-algorithm=@var{algorithm} --flive-patching=@var{level} +-flate-combine-instructions -flive-patching=@var{level} -fira-region=@var{region} -fira-hoist-pressure -fira-loop-pressure -fno-ira-share-save-slots -fno-ira-share-spill-slots @@ -13675,6 +13675,15 @@ equivalences that are found only by GCC and equivalences found only by Gold. This flag is enabled by default at @option{-O2} and @option{-Os}. +@opindex flate-combine-instructions +@item -flate-combine-instructions +Enable two instruction combination passes that run relatively late in the +compilation process. One of the passes runs before register allocation and +the other after register allocation. The main aim of the passes is to +substitute definitions into all uses. + +Most targets enable this flag by default at @option{-O2} and @option{-Os}. + @opindex flive-patching @item -flive-patching=@var{level} Control GCC's optimizations to produce output suitable for live-patching. diff --git a/gcc/late-combine.cc b/gcc/late-combine.cc new file mode 100644 index 00000000000..22a1d81d38e --- /dev/null +++ b/gcc/late-combine.cc @@ -0,0 +1,747 @@ +// Late-stage instruction combination pass. +// Copyright (C) 2023-2024 Free Software Foundation, Inc. +// +// This file is part of GCC. +// +// GCC is free software; you can redistribute it and/or modify it under +// the terms of the GNU General Public License as published by the Free +// Software Foundation; either version 3, or (at your option) any later +// version. +// +// GCC is distributed in the hope that it will be useful, but WITHOUT ANY +// WARRANTY; without even the implied warranty of MERCHANTABILITY or +// FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License +// for more details. +// +// You should have received a copy of the GNU General Public License +// along with GCC; see the file COPYING3. If not see +// <http://www.gnu.org/licenses/>. + +// The current purpose of this pass is to substitute definitions into +// all uses, so that the definition can be removed. However, it could +// be extended to handle other combination-related optimizations in future. +// +// The pass can run before or after register allocation. When running +// before register allocation, it tries to avoid cases that are likely +// to increase register pressure. For the same reason, it avoids moving +// instructions around, even if doing so would allow an optimization to +// succeed. These limitations are removed when running after register +// allocation. + +#define INCLUDE_ALGORITHM +#define INCLUDE_FUNCTIONAL +#include "config.h" +#include "system.h" +#include "coretypes.h" +#include "backend.h" +#include "rtl.h" +#include "df.h" +#include "rtl-ssa.h" +#include "print-rtl.h" +#include "tree-pass.h" +#include "cfgcleanup.h" +#include "target.h" + +using namespace rtl_ssa; + +namespace { +const pass_data pass_data_late_combine = +{ + RTL_PASS, // type + "late_combine", // name + OPTGROUP_NONE, // optinfo_flags + TV_NONE, // tv_id + 0, // properties_required + 0, // properties_provided + 0, // properties_destroyed + 0, // todo_flags_start + TODO_df_finish, // todo_flags_finish +}; + +// Represents an attempt to substitute a single-set definition into all +// uses of the definition. +class insn_combination +{ +public: + insn_combination (set_info *, rtx, rtx); + bool run (); + array_slice<insn_change *const> use_changes () const; + +private: + use_array get_new_uses (use_info *); + bool substitute_nondebug_use (use_info *); + bool substitute_nondebug_uses (set_info *); + bool try_to_preserve_debug_info (insn_change &, use_info *); + void substitute_debug_use (use_info *); + bool substitute_note (insn_info *, rtx, bool); + void substitute_notes (insn_info *, bool); + void substitute_note_uses (use_info *); + void substitute_optional_uses (set_info *); + + // Represents the state of the function's RTL at the start of this + // combination attempt. + insn_change_watermark m_rtl_watermark; + + // Represents the rtl-ssa state at the start of this combination attempt. + obstack_watermark m_attempt; + + // The instruction that contains the definition, and that we're trying + // to delete. + insn_info *m_def_insn; + + // The definition itself. + set_info *m_def; + + // The destination and source of the single set that defines m_def. + // The destination is known to be a plain REG. + rtx m_dest; + rtx m_src; + + // Contains the full list of changes that we want to make, in reverse + // postorder. + auto_vec<insn_change *> m_nondebug_changes; +}; + +// Class that represents one run of the pass. +class late_combine +{ +public: + unsigned int execute (function *); + +private: + rtx optimizable_set (insn_info *); + bool check_register_pressure (insn_info *, rtx); + bool check_uses (set_info *, rtx); + bool combine_into_uses (insn_info *, insn_info *); + + auto_vec<insn_info *> m_worklist; +}; + +insn_combination::insn_combination (set_info *def, rtx dest, rtx src) + : m_rtl_watermark (), + m_attempt (crtl->ssa->new_change_attempt ()), + m_def_insn (def->insn ()), + m_def (def), + m_dest (dest), + m_src (src), + m_nondebug_changes () +{ +} + +array_slice<insn_change *const> +insn_combination::use_changes () const +{ + return { m_nondebug_changes.address () + 1, + m_nondebug_changes.length () - 1 }; +} + +// USE is a direct or indirect use of m_def. Return the list of uses +// that would be needed after substituting m_def into the instruction. +// The returned list is marked as invalid if USE's insn and m_def_insn +// use different definitions for the same resource (register or memory). +use_array +insn_combination::get_new_uses (use_info *use) +{ + auto *def = use->def (); + auto *use_insn = use->insn (); + + use_array new_uses = use_insn->uses (); + new_uses = remove_uses_of_def (m_attempt, new_uses, def); + new_uses = merge_access_arrays (m_attempt, m_def_insn->uses (), new_uses); + if (new_uses.is_valid () && use->ebb () != m_def->ebb ()) + new_uses = crtl->ssa->make_uses_available (m_attempt, new_uses, use->bb (), + use_insn->is_debug_insn ()); + return new_uses; +} + +// Start the process of trying to replace USE by substitution, given that +// USE occurs in a non-debug instruction. Check: +// +// - that the substitution can be represented in RTL +// +// - that each use of a resource (register or memory) within the new +// instruction has a consistent definition +// +// - that the new instruction is a recognized pattern +// +// - that the instruction can be placed somewhere that makes all definitions +// and uses valid, and that permits any new hard-register clobbers added +// during the recognition process +// +// Return true on success. +bool +insn_combination::substitute_nondebug_use (use_info *use) +{ + insn_info *use_insn = use->insn (); + rtx_insn *use_rtl = use_insn->rtl (); + + if (dump_file && (dump_flags & TDF_DETAILS)) + dump_insn_slim (dump_file, use->insn ()->rtl ()); + + // Check that we can change the instruction pattern. Leave recognition + // of the result till later. + insn_propagation prop (use_rtl, m_dest, m_src); + if (!prop.apply_to_pattern (&PATTERN (use_rtl)) + || prop.num_replacements == 0) + { + if (dump_file && (dump_flags & TDF_DETAILS)) + fprintf (dump_file, "-- RTL substitution failed\n"); + return false; + } + + use_array new_uses = get_new_uses (use); + if (!new_uses.is_valid ()) + { + if (dump_file && (dump_flags & TDF_DETAILS)) + fprintf (dump_file, "-- could not prove that all sources" + " are available\n"); + return false; + } + + // Create a tentative change for the use. + auto *where = XOBNEW (m_attempt, insn_change); + auto *use_change = new (where) insn_change (use_insn); + m_nondebug_changes.safe_push (use_change); + use_change->new_uses = new_uses; + + struct local_ignore : ignore_nothing + { + local_ignore (const set_info *def, const insn_info *use_insn) + : m_def (def), m_use_insn (use_insn) {} + + // We don't limit the number of insns per optimization, so ignoring all + // insns for all insns would lead to quadratic complexity. Just ignore + // the use and definition, which should be enough for most purposes. + bool + should_ignore_insn (const insn_info *insn) + { + return insn == m_def->insn () || insn == m_use_insn; + } + + // Ignore the definition that we're removing, and all uses of it. + bool should_ignore_def (const def_info *def) { return def == m_def; } + + const set_info *m_def; + const insn_info *m_use_insn; + }; + + auto ignore = local_ignore (m_def, use_insn); + + // Moving instructions before register allocation could increase + // register pressure. Only try moving them after RA. + if (reload_completed && can_move_insn_p (use_insn)) + use_change->move_range = { use_insn->bb ()->head_insn (), + use_insn->ebb ()->last_bb ()->end_insn () }; + if (!restrict_movement (*use_change, ignore)) + { + if (dump_file && (dump_flags & TDF_DETAILS)) + fprintf (dump_file, "-- cannot satisfy all definitions and uses" + " in insn %d\n", INSN_UID (use_insn->rtl ())); + return false; + } + + if (!recog (m_attempt, *use_change, ignore)) + return false; + + return true; +} + +// Apply substitute_nondebug_use to all direct and indirect uses of DEF. +// There will be at most one level of indirection. +bool +insn_combination::substitute_nondebug_uses (set_info *def) +{ + for (use_info *use : def->nondebug_insn_uses ()) + if (!use->is_live_out_use () + && !use->only_occurs_in_notes () + && !substitute_nondebug_use (use)) + return false; + + for (use_info *use : def->phi_uses ()) + if (!substitute_nondebug_uses (use->phi ())) + return false; + + return true; +} + +// USE_CHANGE.insn () is a debug instruction that uses m_def. Try to +// substitute the definition into the instruction and try to describe +// the result in USE_CHANGE. Return true on success. Failure means that +// the instruction must be reset instead. +bool +insn_combination::try_to_preserve_debug_info (insn_change &use_change, + use_info *use) +{ + // Punt on unsimplified subregs of hard registers. In that case, + // propagation can succeed and create a wider reg than the one we + // started with. + if (HARD_REGISTER_NUM_P (use->regno ()) + && use->includes_subregs ()) + return false; + + insn_info *use_insn = use_change.insn (); + rtx_insn *use_rtl = use_insn->rtl (); + + use_change.new_uses = get_new_uses (use); + if (!use_change.new_uses.is_valid () + || !restrict_movement (use_change)) + return false; + + insn_propagation prop (use_rtl, m_dest, m_src); + return prop.apply_to_pattern (&INSN_VAR_LOCATION_LOC (use_rtl)); +} + +// USE_INSN is a debug instruction that uses m_def. Update it to reflect +// the fact that m_def is going to disappear. Try to preserve the source +// value if possible, but reset the instruction if not. +void +insn_combination::substitute_debug_use (use_info *use) +{ + auto *use_insn = use->insn (); + rtx_insn *use_rtl = use_insn->rtl (); + + auto use_change = insn_change (use_insn); + if (!try_to_preserve_debug_info (use_change, use)) + { + use_change.new_uses = {}; + use_change.move_range = use_change.insn (); + INSN_VAR_LOCATION_LOC (use_rtl) = gen_rtx_UNKNOWN_VAR_LOC (); + } + insn_change *changes[] = { &use_change }; + crtl->ssa->change_insns (changes); +} + +// NOTE is a reg note of USE_INSN, which previously used m_def. Update +// the note to reflect the fact that m_def is going to disappear. Return +// true on success, or false if the note must be deleted. +// +// CAN_PROPAGATE is true if m_dest can be replaced with m_use. +bool +insn_combination::substitute_note (insn_info *use_insn, rtx note, + bool can_propagate) +{ + if (REG_NOTE_KIND (note) == REG_EQUAL + || REG_NOTE_KIND (note) == REG_EQUIV) + { + insn_propagation prop (use_insn->rtl (), m_dest, m_src); + return (prop.apply_to_rvalue (&XEXP (note, 0)) + && (can_propagate || prop.num_replacements == 0)); + } + return true; +} + +// Update USE_INSN's notes after deciding to go ahead with the optimization. +// CAN_PROPAGATE is true if m_dest can be replaced with m_use. +void +insn_combination::substitute_notes (insn_info *use_insn, bool can_propagate) +{ + rtx_insn *use_rtl = use_insn->rtl (); + rtx *ptr = ®_NOTES (use_rtl); + while (rtx note = *ptr) + { + if (substitute_note (use_insn, note, can_propagate)) + ptr = &XEXP (note, 1); + else + *ptr = XEXP (note, 1); + } +} + +// We've decided to go ahead with the substitution. Update all REG_NOTES +// involving USE. +void +insn_combination::substitute_note_uses (use_info *use) +{ + insn_info *use_insn = use->insn (); + + bool can_propagate = true; + if (use->only_occurs_in_notes ()) + { + // The only uses are in notes. Try to keep the note if we can, + // but removing it is better than aborting the optimization. + insn_change use_change (use_insn); + use_change.new_uses = get_new_uses (use); + if (!use_change.new_uses.is_valid () + || !restrict_movement (use_change)) + { + use_change.move_range = use_insn; + use_change.new_uses = remove_uses_of_def (m_attempt, + use_insn->uses (), + use->def ()); + can_propagate = false; + } + if (dump_file && (dump_flags & TDF_DETAILS)) + { + fprintf (dump_file, "%s notes in:\n", + can_propagate ? "updating" : "removing"); + dump_insn_slim (dump_file, use_insn->rtl ()); + } + substitute_notes (use_insn, can_propagate); + insn_change *changes[] = { &use_change }; + crtl->ssa->change_insns (changes); + } + else + // We've already decided to update the insn's pattern and know that m_src + // will be available at the insn's new location. Now update its notes. + substitute_notes (use_insn, can_propagate); +} + +// We've decided to go ahead with the substitution and we've dealt with +// all uses that occur in the patterns of non-debug insns. Update all +// other uses for the fact that m_def is about to disappear. +void +insn_combination::substitute_optional_uses (set_info *def) +{ + if (auto insn_uses = def->all_insn_uses ()) + { + use_info *use = *insn_uses.begin (); + while (use) + { + use_info *next_use = use->next_any_insn_use (); + if (use->is_in_debug_insn ()) + substitute_debug_use (use); + else if (!use->is_live_out_use ()) + substitute_note_uses (use); + use = next_use; + } + } + for (use_info *use : def->phi_uses ()) + substitute_optional_uses (use->phi ()); +} + +// Try to perform the substitution. Return true on success. +bool +insn_combination::run () +{ + if (dump_file && (dump_flags & TDF_DETAILS)) + { + fprintf (dump_file, "\ntrying to combine definition of r%d in:\n", + m_def->regno ()); + dump_insn_slim (dump_file, m_def_insn->rtl ()); + fprintf (dump_file, "into:\n"); + } + + auto def_change = insn_change::delete_insn (m_def_insn); + m_nondebug_changes.safe_push (&def_change); + + if (!substitute_nondebug_uses (m_def) + || !changes_are_worthwhile (m_nondebug_changes) + || !crtl->ssa->verify_insn_changes (m_nondebug_changes)) + return false; + + substitute_optional_uses (m_def); + + confirm_change_group (); + crtl->ssa->change_insns (m_nondebug_changes); + return true; +} + +// See whether INSN is a single_set that we can optimize. Return the +// set if so, otherwise return null. +rtx +late_combine::optimizable_set (insn_info *insn) +{ + if (!insn->can_be_optimized () + || insn->is_asm () + || insn->is_call () + || insn->has_volatile_refs () + || insn->has_pre_post_modify () + || !can_move_insn_p (insn)) + return NULL_RTX; + + return single_set (insn->rtl ()); +} + +// Suppose that we can replace all uses of SET_DEST (SET) with SET_SRC (SET), +// where SET occurs in INSN. Return true if doing so is not likely to +// increase register pressure. +bool +late_combine::check_register_pressure (insn_info *insn, rtx set) +{ + // Plain register-to-register moves do not establish a register class + // preference and have no well-defined effect on the register allocator. + // If changes in register class are needed, the register allocator is + // in the best position to place those changes. If no change in + // register class is needed, then the optimization reduces register + // pressure if SET_SRC (set) was already live at uses, otherwise the + // optimization is pressure-neutral. + rtx src = SET_SRC (set); + if (REG_P (src)) + return true; + + // On the same basis, substituting a SET_SRC that contains a single + // pseudo register either reduces pressure or is pressure-neutral, + // subject to the constraints below. We would need to do more + // analysis for SET_SRCs that use more than one pseudo register. + unsigned int nregs = 0; + for (auto *use : insn->uses ()) + if (use->is_reg () + && !HARD_REGISTER_NUM_P (use->regno ()) + && !use->only_occurs_in_notes ()) + if (++nregs > 1) + return false; + + // If there are no pseudo registers in SET_SRC then the optimization + // should improve register pressure. + if (nregs == 0) + return true; + + // We'd be substituting (set (reg R1) SRC) where SRC is known to + // contain a single pseudo register R2. Assume for simplicity that + // each new use of R2 would need to be in the same class C as the + // current use of R2. If, for a realistic allocation, C is a + // non-strict superset of the R1's register class, the effect on + // register pressure should be positive or neutral. If instead + // R1 occupies a different register class from R2, or if R1 has + // more allocation freedom than R2, then there's a higher risk that + // the effect on register pressure could be negative. + // + // First use constrain_operands to get the most likely choice of + // alternative. For simplicity, just handle the case where the + // output operand is operand 0. + extract_insn (insn->rtl ()); + rtx dest = SET_DEST (set); + if (recog_data.n_operands == 0 + || recog_data.operand[0] != dest) + return false; + + if (!constrain_operands (0, get_enabled_alternatives (insn->rtl ()))) + return false; + + preprocess_constraints (insn->rtl ()); + auto *alt = which_op_alt (); + auto dest_class = alt[0].cl; + + // Check operands 1 and above. + auto check_src = [&] (unsigned int i) + { + if (recog_data.is_operator[i]) + return true; + + rtx op = recog_data.operand[i]; + if (CONSTANT_P (op)) + return true; + + if (SUBREG_P (op)) + op = SUBREG_REG (op); + if (REG_P (op)) + { + // Ignore hard registers. We've already rejected uses of non-fixed + // hard registers in the SET_SRC. + if (HARD_REGISTER_P (op)) + return true; + + // Make sure that the source operand's class is at least as + // permissive as the destination operand's class. + auto src_class = alternative_class (alt, i); + if (!reg_class_subset_p (dest_class, src_class)) + return false; + + // Make sure that the source operand occupies no more hard + // registers than the destination operand. This mostly matters + // for subregs. + if (targetm.class_max_nregs (dest_class, GET_MODE (dest)) + < targetm.class_max_nregs (src_class, GET_MODE (op))) + return false; + + return true; + } + return false; + }; + for (int i = 1; i < recog_data.n_operands; ++i) + if (recog_data.operand_type[i] != OP_OUT && !check_src (i)) + return false; + + return true; +} + +// Check uses of DEF to see whether there is anything obvious that +// prevents the substitution of SET into uses of DEF. +bool +late_combine::check_uses (set_info *def, rtx set) +{ + use_info *prev_use = nullptr; + for (use_info *use : def->nondebug_insn_uses ()) + { + insn_info *use_insn = use->insn (); + + if (use->is_live_out_use ()) + continue; + if (use->only_occurs_in_notes ()) + continue; + + // We cannot replace all uses if the value is live on exit. + if (use->is_artificial ()) + return false; + + // Avoid increasing the complexity of instructions that + // reference allocatable hard registers. + if (!REG_P (SET_SRC (set)) + && !reload_completed + && (accesses_include_nonfixed_hard_registers (use_insn->uses ()) + || accesses_include_nonfixed_hard_registers (use_insn->defs ()))) + return false; + + // Don't substitute into a non-local goto, since it can then be + // treated as a jump to local label, e.g. in shorten_branches. + // ??? But this shouldn't be necessary. + if (use_insn->is_jump () + && find_reg_note (use_insn->rtl (), REG_NON_LOCAL_GOTO, NULL_RTX)) + return false; + + // Reject cases where one of the uses is a function argument. + // The combine attempt should fail anyway, but this is a common + // case that is easy to check early. + if (use_insn->is_call () + && HARD_REGISTER_P (SET_DEST (set)) + && find_reg_fusage (use_insn->rtl (), USE, SET_DEST (set))) + return false; + + // We'll keep the uses in their original order, even if we move + // them relative to other instructions. Make sure that non-final + // uses do not change any values that occur in the SET_SRC. + if (prev_use && prev_use->ebb () == use->ebb ()) + { + def_info *ultimate_def = look_through_degenerate_phi (def); + if (insn_clobbers_resources (prev_use->insn (), + ultimate_def->insn ()->uses ())) + return false; + } + + prev_use = use; + } + + for (use_info *use : def->phi_uses ()) + if (!use->phi ()->is_degenerate () + || !check_uses (use->phi (), set)) + return false; + + return true; +} + +// Try to remove INSN by substituting a definition into all uses. +// If the optimization moves any instructions before CURSOR, add those +// instructions to the end of m_worklist. +bool +late_combine::combine_into_uses (insn_info *insn, insn_info *cursor) +{ + // For simplicity, don't try to handle sets of multiple hard registers. + // And for correctness, don't remove any assignments to the stack or + // frame pointers, since that would implicitly change the set of valid + // memory locations between this assignment and the next. + // + // Removing assignments to the hard frame pointer would invalidate + // backtraces. + set_info *def = single_set_info (insn); + if (!def + || !def->is_reg () + || def->regno () == STACK_POINTER_REGNUM + || def->regno () == FRAME_POINTER_REGNUM + || def->regno () == HARD_FRAME_POINTER_REGNUM) + return false; + + rtx set = optimizable_set (insn); + if (!set) + return false; + + // For simplicity, don't try to handle subreg destinations. + rtx dest = SET_DEST (set); + if (!REG_P (dest) || def->regno () != REGNO (dest)) + return false; + + // Don't prolong the live ranges of allocatable hard registers, or put + // them into more complicated instructions. Failing to prevent this + // could lead to spill failures, or at least to worst register allocation. + if (!reload_completed + && accesses_include_nonfixed_hard_registers (insn->uses ())) + return false; + + if (!reload_completed && !check_register_pressure (insn, set)) + return false; + + if (!check_uses (def, set)) + return false; + + insn_combination combination (def, SET_DEST (set), SET_SRC (set)); + if (!combination.run ()) + return false; + + for (auto *use_change : combination.use_changes ()) + if (*use_change->insn () < *cursor) + m_worklist.safe_push (use_change->insn ()); + else + break; + return true; +} + +// Run the pass on function FN. +unsigned int +late_combine::execute (function *fn) +{ + // Initialization. + calculate_dominance_info (CDI_DOMINATORS); + df_analyze (); + crtl->ssa = new rtl_ssa::function_info (fn); + // Don't allow memory_operand to match volatile MEMs. + init_recog_no_volatile (); + + insn_info *insn = *crtl->ssa->nondebug_insns ().begin (); + while (insn) + { + if (!insn->is_artificial ()) + { + insn_info *prev = insn->prev_nondebug_insn (); + if (combine_into_uses (insn, prev)) + { + // Any instructions that get added to the worklist were + // previously after PREV. Thus if we were able to move + // an instruction X before PREV during one combination, + // X cannot depend on any instructions that we move before + // PREV during subsequent combinations. This means that + // the worklist should be free of backwards dependencies, + // even if it isn't necessarily in RPO. + for (unsigned int i = 0; i < m_worklist.length (); ++i) + combine_into_uses (m_worklist[i], prev); + m_worklist.truncate (0); + insn = prev; + } + } + insn = insn->next_nondebug_insn (); + } + + // Finalization. + if (crtl->ssa->perform_pending_updates ()) + cleanup_cfg (0); + // Make the recognizer allow volatile MEMs again. + init_recog (); + free_dominance_info (CDI_DOMINATORS); + return 0; +} + +class pass_late_combine : public rtl_opt_pass +{ +public: + pass_late_combine (gcc::context *ctxt) + : rtl_opt_pass (pass_data_late_combine, ctxt) + {} + + // opt_pass methods: + opt_pass *clone () override { return new pass_late_combine (m_ctxt); } + bool gate (function *) override { return flag_late_combine_instructions; } + unsigned int execute (function *) override; +}; + +unsigned int +pass_late_combine::execute (function *fn) +{ + return late_combine ().execute (fn); +} + +} // end namespace + +// Create a new CC fusion pass instance. + +rtl_opt_pass * +make_pass_late_combine (gcc::context *ctxt) +{ + return new pass_late_combine (ctxt); +} diff --git a/gcc/opts.cc b/gcc/opts.cc index 1b1b46455af..915bce88fd6 100644 --- a/gcc/opts.cc +++ b/gcc/opts.cc @@ -664,6 +664,7 @@ static const struct default_options default_options_table[] = VECT_COST_MODEL_VERY_CHEAP }, { OPT_LEVELS_2_PLUS, OPT_finline_functions, NULL, 1 }, { OPT_LEVELS_2_PLUS, OPT_ftree_loop_distribute_patterns, NULL, 1 }, + { OPT_LEVELS_2_PLUS, OPT_flate_combine_instructions, NULL, 1 }, /* -O2 and above optimizations, but not -Os or -Og. */ { OPT_LEVELS_2_PLUS_SPEED_ONLY, OPT_falign_functions, NULL, 1 }, diff --git a/gcc/passes.def b/gcc/passes.def index 041229e47a6..13c9dc34ddf 100644 --- a/gcc/passes.def +++ b/gcc/passes.def @@ -493,6 +493,7 @@ along with GCC; see the file COPYING3. If not see NEXT_PASS (pass_initialize_regs); NEXT_PASS (pass_ud_rtl_dce); NEXT_PASS (pass_combine); + NEXT_PASS (pass_late_combine); NEXT_PASS (pass_if_after_combine); NEXT_PASS (pass_jump_after_combine); NEXT_PASS (pass_partition_blocks); @@ -512,6 +513,7 @@ along with GCC; see the file COPYING3. If not see NEXT_PASS (pass_postreload); PUSH_INSERT_PASSES_WITHIN (pass_postreload) NEXT_PASS (pass_postreload_cse); + NEXT_PASS (pass_late_combine); NEXT_PASS (pass_gcse2); NEXT_PASS (pass_split_after_reload); NEXT_PASS (pass_ree); diff --git a/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-1.c b/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-1.c index f290b9ccbdc..a95637abbe5 100644 --- a/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-1.c +++ b/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-1.c @@ -25,5 +25,5 @@ bar (long a) } /* { dg-final { scan-rtl-dump "Will split live ranges of parameters" "ira" } } */ -/* { dg-final { scan-rtl-dump "Split live-range of register" "ira" { xfail *-*-* } } } */ +/* { dg-final { scan-rtl-dump "Split live-range of register" "ira" { xfail { ! aarch64*-*-* } } } } */ /* { dg-final { scan-rtl-dump "Performing shrink-wrapping" "pro_and_epilogue" { xfail powerpc*-*-* } } } */ diff --git a/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-2.c b/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-2.c index 6212c95585d..0690e036eaa 100644 --- a/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-2.c +++ b/gcc/testsuite/gcc.dg/ira-shrinkwrap-prep-2.c @@ -30,6 +30,6 @@ bar (long a) } /* { dg-final { scan-rtl-dump "Will split live ranges of parameters" "ira" } } */ -/* { dg-final { scan-rtl-dump "Split live-range of register" "ira" { xfail *-*-* } } } */ +/* { dg-final { scan-rtl-dump "Split live-range of register" "ira" { xfail { ! aarch64*-*-* } } } } */ /* XFAIL due to PR70681. */ /* { dg-final { scan-rtl-dump "Performing shrink-wrapping" "pro_and_epilogue" { xfail arm*-*-* powerpc*-*-* } } } */ diff --git a/gcc/testsuite/gcc.dg/stack-check-4.c b/gcc/testsuite/gcc.dg/stack-check-4.c index b0c5c61972f..052d2abc2f1 100644 --- a/gcc/testsuite/gcc.dg/stack-check-4.c +++ b/gcc/testsuite/gcc.dg/stack-check-4.c @@ -20,7 +20,7 @@ scan for. We scan for both the positive and negative cases. */ /* { dg-do compile } */ -/* { dg-options "-O2 -fstack-clash-protection -fdump-rtl-pro_and_epilogue -fno-optimize-sibling-calls" } */ +/* { dg-options "-O2 -fstack-clash-protection -fdump-rtl-pro_and_epilogue -fno-optimize-sibling-calls -fno-shrink-wrap" } */ /* { dg-require-effective-target supports_stack_clash_protection } */ extern void arf (char *); diff --git a/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align16.c b/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align16.c index 4a228b0a1ce..c29a230a771 100644 --- a/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align16.c +++ b/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align16.c @@ -1,5 +1,5 @@ /* { dg-do compile { target bitint } } */ -/* { dg-additional-options "-std=c23 -O2 -fno-stack-protector -save-temps -fno-schedule-insns -fno-schedule-insns2" } */ +/* { dg-additional-options "-std=c23 -O2 -fno-stack-protector -save-temps -fno-schedule-insns -fno-schedule-insns2 -fno-late-combine-instructions" } */ /* { dg-final { check-function-bodies "**" "" "" } } */ #define ALIGN 16 diff --git a/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align8.c b/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align8.c index e7f773640f0..13ffbf416ca 100644 --- a/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align8.c +++ b/gcc/testsuite/gcc.target/aarch64/bitfield-bitint-abi-align8.c @@ -1,5 +1,5 @@ /* { dg-do compile { target bitint } } */ -/* { dg-additional-options "-std=c23 -O2 -fno-stack-protector -save-temps -fno-schedule-insns -fno-schedule-insns2" } */ +/* { dg-additional-options "-std=c23 -O2 -fno-stack-protector -save-temps -fno-schedule-insns -fno-schedule-insns2 -fno-late-combine-instructions" } */ /* { dg-final { check-function-bodies "**" "" "" } } */ #define ALIGN 8 diff --git a/gcc/testsuite/gcc.target/aarch64/pr106594_1.c b/gcc/testsuite/gcc.target/aarch64/pr106594_1.c new file mode 100644 index 00000000000..71bcafcb44f --- /dev/null +++ b/gcc/testsuite/gcc.target/aarch64/pr106594_1.c @@ -0,0 +1,20 @@ +/* { dg-options "-O2" } */ + +extern const int constellation_64qam[64]; + +void foo(int nbits, + const char *p_src, + int *p_dst) { + + while (nbits > 0U) { + char first = *p_src++; + + char index1 = ((first & 0x3) << 4) | (first >> 4); + + *p_dst++ = constellation_64qam[index1]; + + nbits--; + } +} + +/* { dg-final { scan-assembler {(?n)\tldr\t.*\[x[0-9]+, w[0-9]+, sxtw #?2\]} } } */ diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_asrd_3.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_asrd_3.c index 0d620a30d5d..b537c6154a3 100644 --- a/gcc/testsuite/gcc.target/aarch64/sve/cond_asrd_3.c +++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_asrd_3.c @@ -27,9 +27,9 @@ TEST_ALL (DEF_LOOP) /* { dg-final { scan-assembler-times {\tasrd\tz[0-9]+\.h, p[0-7]/m, z[0-9]+\.h, #4\n} 2 } } */ /* { dg-final { scan-assembler-times {\tasrd\tz[0-9]+\.s, p[0-7]/m, z[0-9]+\.s, #4\n} 1 } } */ -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.b, p[0-7]/z, z[0-9]+\.b\n} 3 { xfail *-*-* } } } */ -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z, z[0-9]+\.h\n} 2 { xfail *-*-* } } } */ -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z, z[0-9]+\.s\n} 1 { xfail *-*-* } } } */ +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.b, p[0-7]/z, z[0-9]+\.b\n} 3 } } */ +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z, z[0-9]+\.h\n} 2 } } */ +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z, z[0-9]+\.s\n} 1 } } */ -/* { dg-final { scan-assembler-not {\tmov\tz} { xfail *-*-* } } } */ -/* { dg-final { scan-assembler-not {\tsel\t} { xfail *-*-* } } } */ +/* { dg-final { scan-assembler-not {\tmov\tz} } } */ +/* { dg-final { scan-assembler-not {\tsel\t} } } */ diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_3.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_3.c index a294effd4a9..cff806c278d 100644 --- a/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_3.c +++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_3.c @@ -30,11 +30,9 @@ TEST_ALL (DEF_LOOP) /* { dg-final { scan-assembler-times {\tscvtf\tz[0-9]+\.d, p[0-7]/m,} 1 } } */ /* { dg-final { scan-assembler-times {\tucvtf\tz[0-9]+\.d, p[0-7]/m,} 1 } } */ -/* Really we should be able to use MOVPRFX /z here, but at the moment - we're relying on combine to merge a SEL and an arithmetic operation, - and the SEL doesn't allow the "false" value to be zero when the "true" - value is a register. */ -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+, z[0-9]+\n} 6 } } */ +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z,} 2 } } */ +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z,} 2 } } */ +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.d, p[0-7]/z,} 2 } } */ /* { dg-final { scan-assembler-not {\tmov\tz[^\n]*z} } } */ /* { dg-final { scan-assembler-not {\tsel\t} } } */ diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_6.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_6.c index 6541a2ea49d..abf0a2e832f 100644 --- a/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_6.c +++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_convert_6.c @@ -30,11 +30,9 @@ TEST_ALL (DEF_LOOP) /* { dg-final { scan-assembler-times {\tfcvtzs\tz[0-9]+\.d, p[0-7]/m,} 1 } } */ /* { dg-final { scan-assembler-times {\tfcvtzu\tz[0-9]+\.d, p[0-7]/m,} 1 } } */ -/* Really we should be able to use MOVPRFX /z here, but at the moment - we're relying on combine to merge a SEL and an arithmetic operation, - and the SEL doesn't allow the "false" value to be zero when the "true" - value is a register. */ -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+, z[0-9]+\n} 6 } } */ +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z,} 2 } } */ +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z,} 2 } } */ +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.d, p[0-7]/z,} 2 } } */ /* { dg-final { scan-assembler-not {\tmov\tz[^\n]*z} } } */ /* { dg-final { scan-assembler-not {\tsel\t} } } */ diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_fabd_5.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_fabd_5.c index e66477b3bce..401201b315a 100644 --- a/gcc/testsuite/gcc.target/aarch64/sve/cond_fabd_5.c +++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_fabd_5.c @@ -24,12 +24,9 @@ TEST_ALL (DEF_LOOP) /* { dg-final { scan-assembler-times {\tfabd\tz[0-9]+\.s, p[0-7]/m,} 1 } } */ /* { dg-final { scan-assembler-times {\tfabd\tz[0-9]+\.d, p[0-7]/m,} 1 } } */ -/* Really we should be able to use MOVPRFX /Z here, but at the moment - we're relying on combine to merge a SEL and an arithmetic operation, - and the SEL doesn't allow zero operands. */ -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z, z[0-9]+\.h\n} 1 { xfail *-*-* } } } */ -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z, z[0-9]+\.s\n} 1 { xfail *-*-* } } } */ -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.d, p[0-7]/z, z[0-9]+\.d\n} 1 { xfail *-*-* } } } */ +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z, z[0-9]+\.h\n} 1 } } */ +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z, z[0-9]+\.s\n} 1 } } */ +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.d, p[0-7]/z, z[0-9]+\.d\n} 1 } } */ /* { dg-final { scan-assembler-not {\tmov\tz[^,]*z} } } */ -/* { dg-final { scan-assembler-not {\tsel\t} { xfail *-*-* } } } */ +/* { dg-final { scan-assembler-not {\tsel\t} } } */ diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cond_unary_4.c b/gcc/testsuite/gcc.target/aarch64/sve/cond_unary_4.c index a491f899088..cbb957bffa4 100644 --- a/gcc/testsuite/gcc.target/aarch64/sve/cond_unary_4.c +++ b/gcc/testsuite/gcc.target/aarch64/sve/cond_unary_4.c @@ -52,15 +52,10 @@ TEST_ALL (DEF_LOOP) /* { dg-final { scan-assembler-times {\tfneg\tz[0-9]+\.s, p[0-7]/m,} 1 } } */ /* { dg-final { scan-assembler-times {\tfneg\tz[0-9]+\.d, p[0-7]/m,} 1 } } */ -/* Really we should be able to use MOVPRFX /z here, but at the moment - we're relying on combine to merge a SEL and an arithmetic operation, - and the SEL doesn't allow the "false" value to be zero when the "true" - value is a register. */ -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+, z[0-9]+\n} 7 } } */ -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.b, p[0-7]/z, z[0-9]+\.b} 1 } } */ -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z, z[0-9]+\.h} 2 } } */ -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z, z[0-9]+\.s} 2 } } */ -/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.d, p[0-7]/z, z[0-9]+\.d} 2 } } */ +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.b, p[0-7]/z, z[0-9]+\.b} 2 } } */ +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.h, p[0-7]/z, z[0-9]+\.h} 4 } } */ +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.s, p[0-7]/z, z[0-9]+\.s} 4 } } */ +/* { dg-final { scan-assembler-times {\tmovprfx\tz[0-9]+\.d, p[0-7]/z, z[0-9]+\.d} 4 } } */ /* { dg-final { scan-assembler-not {\tmov\tz[^\n]*z} } } */ /* { dg-final { scan-assembler-not {\tsel\t} } } */ diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h index edebb2be245..38902b1b01b 100644 --- a/gcc/tree-pass.h +++ b/gcc/tree-pass.h @@ -615,6 +615,7 @@ extern rtl_opt_pass *make_pass_branch_prob (gcc::context *ctxt); extern rtl_opt_pass *make_pass_value_profile_transformations (gcc::context *ctxt); extern rtl_opt_pass *make_pass_postreload_cse (gcc::context *ctxt); +extern rtl_opt_pass *make_pass_late_combine (gcc::context *ctxt); extern rtl_opt_pass *make_pass_gcse2 (gcc::context *ctxt); extern rtl_opt_pass *make_pass_split_after_reload (gcc::context *ctxt); extern rtl_opt_pass *make_pass_thread_prologue_and_epilogue (gcc::context