diff mbox series

[v8] Target-independent store forwarding avoidance.

Message ID 20241109094856.220379-1-konstantinos.eleftheriou@vrull.eu
State New
Headers show
Series [v8] Target-independent store forwarding avoidance. | expand

Commit Message

Konstantinos Eleftheriou Nov. 9, 2024, 9:48 a.m. UTC
From: kelefth <konstantinos.eleftheriou@vrull.eu>

This pass detects cases of expensive store forwarding and tries to avoid them
by reordering the stores and using suitable bit insertion sequences.
For example it can transform this:

     strb    w2, [x1, 1]
     ldr     x0, [x1]      # Expensive store forwarding to larger load.

To:

     ldr     x0, [x1]
     strb    w2, [x1]
     bfi     x0, x2, 0, 8

Assembly like this can appear with bitfields or type punning / unions.
On stress-ng when running the cpu-union microbenchmark the following speedups
have been observed.

  Neoverse-N1:      +29.4%
  Intel Coffeelake: +13.1%
  AMD 5950X:        +17.5%

The transformation is rejected on cases that would cause store_bit_field
to generate subreg expressions on different register classes.
Files avoid-store-forwarding-4.c and avoid-store-forwarding-5.c contain
such cases and have been marked as XFAIL.

There is a special handling for machines with BITS_BIG_ENDIAN !=
BYTES_BIG_ENDIAN. The need for this came up from an issue in H8
architecture, which uses big-endian ordering, but BITS_BIG_ENDIAN
is false. In that case, the START parameter of store_bit_field
needs to be calculated from the end of the destination register.

gcc/ChangeLog:

	* Makefile.in (OBJS): Add avoid-store-forwarding.o.
	* common.opt (favoid-store-forwarding): New option.
	* common.opt.urls: Regenerate.
	* doc/invoke.texi: New param store-forwarding-max-distance.
	* doc/passes.texi: Document new pass.
	* doc/tm.texi: Regenerate.
	* doc/tm.texi.in: Document new pass.
	* params.opt (store-forwarding-max-distance): New param.
	* passes.def: Add pass_rtl_avoid_store_forwarding before
	pass_early_remat.
	* target.def (avoid_store_forwarding_p): New DEFHOOK.
	* target.h (struct store_fwd_info): Declare.
	* targhooks.cc (default_avoid_store_forwarding_p): New function.
	* targhooks.h (default_avoid_store_forwarding_p): Declare.
	* tree-pass.h (make_pass_rtl_avoid_store_forwarding): Declare.
	* avoid-store-forwarding.cc: New file.
	* avoid-store-forwarding.h: New file.

gcc/testsuite/ChangeLog:

	* gcc.target/aarch64/avoid-store-forwarding-1.c: New test.
	* gcc.target/aarch64/avoid-store-forwarding-2.c: New test.
	* gcc.target/aarch64/avoid-store-forwarding-3.c: New test.
	* gcc.target/aarch64/avoid-store-forwarding-4.c: New test.
	* gcc.target/aarch64/avoid-store-forwarding-5.c: New test.
	* gcc.target/x86_64/abi/callabi/avoid-store-forwarding-1.c: New test.
        * gcc.target/x86_64/abi/callabi/avoid-store-forwarding-2.c: New test.

Signed-off-by: Philipp Tomsich <philipp.tomsich@vrull.eu>
Signed-off-by: Konstantinos Eleftheriou <konstantinos.eleftheriou@vrull.eu>

Series-version: 8

Series-changes: 8
	- Fix store_bit_field call for big-endian targets, where
	  BITS_BIG_ENDIAN is false.
	- Handle store_forwarding_max_distance = 0 as a special case that
	  disables cost checks for avoid-store-forwarding.
	- Update testcases for AArch64 and add testcases for x86-64.

Series-changes: 7
	- Fix bug when copying back the load register, in the case that the
	  load is eliminated.

Series-changes: 6
	- Reject the transformation on cases that would cause store_bit_field
	  to generate subreg expressions on different register classes.
	  Files avoid-store-forwarding-4.c and avoid-store-forwarding-5.c
          contain such cases and have been marked as XFAIL.
	- Use optimize_bb_for_speed_p instead of optimize_insn_for_speed_p.
	- Inline and remove get_load_mem.
	- New implementation for is_store_forwarding.
	- Refactor the main loop in avoid_store_forwarding.
	- Avoid using the word 'forwardings'.
	- Use lowpart_subreg instead of validate_subreg + gen_rtx_subreg.
	- Don't use df_insn_rescan where not needed.
	- Change order of emitting stores and bit insert instructions.
	- Check and reject loads for which the dest register overlaps with src.
	- Remove unused variable.
	- Change some gen_mov_insn function calls to gen_rtx_SET.
	- Subtract the cost of eliminated load, instead of 1, for the total cost.
	- Use delete_insn instead of set_insn_deleted.
	- Regenerate common.opt.urls.
	- Add some more comments.

Series-changes: 5
	- Fix bug with BIG_ENDIAN targets.
	- Fix bug with unrecognized instructions.
	- Fix / simplify pass init/fini.

Series-changes: 4
	- Change pass scheduling to run after sched1.
	- Add target hook to decide whether a store forwarding instance
	  should be avoided or not.
	- Fix bugs.

Series-changes: 3
	- Only emit SUBREG after calling validate_subreg.
	- Fix memory corruption due to vec self-reference.
	- Fix bitmap_bit_in_range_p ICE due to BLKMode.
	- Reject MEM to MEM sets.
	- Add get_load_mem comment.
	- Add new testcase.

Series-changes: 2
	- Allow modes that are not scalar_int_mode.
	- Introduce simple costing to avoid unprofitable transformations.
	- Reject bit insert sequences that spill to memory.
	- Document new pass.
	- Fix and add testcases.
---
 gcc/Makefile.in                               |   1 +
 gcc/avoid-store-forwarding.cc                 | 648 ++++++++++++++++++
 gcc/avoid-store-forwarding.h                  |  56 ++
 gcc/common.opt                                |   4 +
 gcc/common.opt.urls                           |   3 +
 gcc/doc/invoke.texi                           |   9 +
 gcc/doc/passes.texi                           |   8 +
 gcc/doc/tm.texi                               |   8 +
 gcc/doc/tm.texi.in                            |   2 +
 gcc/params.opt                                |   4 +
 gcc/passes.def                                |   1 +
 gcc/target.def                                |  10 +
 gcc/target.h                                  |   3 +
 gcc/targhooks.cc                              |  27 +
 gcc/targhooks.h                               |   3 +
 .../aarch64/avoid-store-forwarding-1.c        |  27 +
 .../aarch64/avoid-store-forwarding-2.c        |  39 ++
 .../aarch64/avoid-store-forwarding-3.c        |  30 +
 .../aarch64/avoid-store-forwarding-4.c        |  26 +
 .../aarch64/avoid-store-forwarding-5.c        |  41 ++
 .../abi/callabi/avoid-store-forwarding-1.c    |  28 +
 .../abi/callabi/avoid-store-forwarding-2.c    |  39 ++
 gcc/tree-pass.h                               |   1 +
 23 files changed, 1018 insertions(+)
 create mode 100644 gcc/avoid-store-forwarding.cc
 create mode 100644 gcc/avoid-store-forwarding.h
 create mode 100644 gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-1.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-2.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-3.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-4.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-5.c
 create mode 100644 gcc/testsuite/gcc.target/x86_64/abi/callabi/avoid-store-forwarding-1.c
 create mode 100644 gcc/testsuite/gcc.target/x86_64/abi/callabi/avoid-store-forwarding-2.c

Comments

Jeff Law Nov. 24, 2024, 7:30 p.m. UTC | #1
On 11/9/24 2:48 AM, Konstantinos Eleftheriou wrote:
> From: kelefth <konstantinos.eleftheriou@vrull.eu>
> 
> This pass detects cases of expensive store forwarding and tries to avoid them
> by reordering the stores and using suitable bit insertion sequences.
> For example it can transform this:
> 
>       strb    w2, [x1, 1]
>       ldr     x0, [x1]      # Expensive store forwarding to larger load.
> 
> To:
> 
>       ldr     x0, [x1]
>       strb    w2, [x1]
>       bfi     x0, x2, 0, 8
> 
> Assembly like this can appear with bitfields or type punning / unions.
> On stress-ng when running the cpu-union microbenchmark the following speedups
> have been observed.
> 
>    Neoverse-N1:      +29.4%
>    Intel Coffeelake: +13.1%
>    AMD 5950X:        +17.5%
> 
> The transformation is rejected on cases that would cause store_bit_field
> to generate subreg expressions on different register classes.
> Files avoid-store-forwarding-4.c and avoid-store-forwarding-5.c contain
> such cases and have been marked as XFAIL.
> 
> There is a special handling for machines with BITS_BIG_ENDIAN !=
> BYTES_BIG_ENDIAN. The need for this came up from an issue in H8
> architecture, which uses big-endian ordering, but BITS_BIG_ENDIAN
> is false. In that case, the START parameter of store_bit_field
> needs to be calculated from the end of the destination register.
> 
> gcc/ChangeLog:
> 
> 	* Makefile.in (OBJS): Add avoid-store-forwarding.o.
> 	* common.opt (favoid-store-forwarding): New option.
> 	* common.opt.urls: Regenerate.
> 	* doc/invoke.texi: New param store-forwarding-max-distance.
> 	* doc/passes.texi: Document new pass.
> 	* doc/tm.texi: Regenerate.
> 	* doc/tm.texi.in: Document new pass.
> 	* params.opt (store-forwarding-max-distance): New param.
> 	* passes.def: Add pass_rtl_avoid_store_forwarding before
> 	pass_early_remat.
> 	* target.def (avoid_store_forwarding_p): New DEFHOOK.
> 	* target.h (struct store_fwd_info): Declare.
> 	* targhooks.cc (default_avoid_store_forwarding_p): New function.
> 	* targhooks.h (default_avoid_store_forwarding_p): Declare.
> 	* tree-pass.h (make_pass_rtl_avoid_store_forwarding): Declare.
> 	* avoid-store-forwarding.cc: New file.
> 	* avoid-store-forwarding.h: New file.
> 
> gcc/testsuite/ChangeLog:
> 
> 	* gcc.target/aarch64/avoid-store-forwarding-1.c: New test.
> 	* gcc.target/aarch64/avoid-store-forwarding-2.c: New test.
> 	* gcc.target/aarch64/avoid-store-forwarding-3.c: New test.
> 	* gcc.target/aarch64/avoid-store-forwarding-4.c: New test.
> 	* gcc.target/aarch64/avoid-store-forwarding-5.c: New test.
> 	* gcc.target/x86_64/abi/callabi/avoid-store-forwarding-1.c: New test.
>          * gcc.target/x86_64/abi/callabi/avoid-store-forwarding-2.c: New test.
> 
> Signed-off-by: Philipp Tomsich <philipp.tomsich@vrull.eu>
> Signed-off-by: Konstantinos Eleftheriou <konstantinos.eleftheriou@vrull.eu>
> 
> Series-version: 8
> 
> Series-changes: 8
> 	- Fix store_bit_field call for big-endian targets, where
> 	  BITS_BIG_ENDIAN is false.
> 	- Handle store_forwarding_max_distance = 0 as a special case that
> 	  disables cost checks for avoid-store-forwarding.
> 	- Update testcases for AArch64 and add testcases for x86-64.
> 
> Series-changes: 7
> 	- Fix bug when copying back the load register, in the case that the
> 	  load is eliminated.
> 
> Series-changes: 6
> 	- Reject the transformation on cases that would cause store_bit_field
> 	  to generate subreg expressions on different register classes.
> 	  Files avoid-store-forwarding-4.c and avoid-store-forwarding-5.c
>            contain such cases and have been marked as XFAIL.
> 	- Use optimize_bb_for_speed_p instead of optimize_insn_for_speed_p.
> 	- Inline and remove get_load_mem.
> 	- New implementation for is_store_forwarding.
> 	- Refactor the main loop in avoid_store_forwarding.
> 	- Avoid using the word 'forwardings'.
> 	- Use lowpart_subreg instead of validate_subreg + gen_rtx_subreg.
> 	- Don't use df_insn_rescan where not needed.
> 	- Change order of emitting stores and bit insert instructions.
> 	- Check and reject loads for which the dest register overlaps with src.
> 	- Remove unused variable.
> 	- Change some gen_mov_insn function calls to gen_rtx_SET.
> 	- Subtract the cost of eliminated load, instead of 1, for the total cost.
> 	- Use delete_insn instead of set_insn_deleted.
> 	- Regenerate common.opt.urls.
> 	- Add some more comments.
> 
> Series-changes: 5
> 	- Fix bug with BIG_ENDIAN targets.
> 	- Fix bug with unrecognized instructions.
> 	- Fix / simplify pass init/fini.
> 
> Series-changes: 4
> 	- Change pass scheduling to run after sched1.
> 	- Add target hook to decide whether a store forwarding instance
> 	  should be avoided or not.
> 	- Fix bugs.
> 
> Series-changes: 3
> 	- Only emit SUBREG after calling validate_subreg.
> 	- Fix memory corruption due to vec self-reference.
> 	- Fix bitmap_bit_in_range_p ICE due to BLKMode.
> 	- Reject MEM to MEM sets.
> 	- Add get_load_mem comment.
> 	- Add new testcase.
> 
> Series-changes: 2
> 	- Allow modes that are not scalar_int_mode.
> 	- Introduce simple costing to avoid unprofitable transformations.
> 	- Reject bit insert sequences that spill to memory.
> 	- Document new pass.
> 	- Fix and add testcases.
> ---

> +namespace {
> +
> +const pass_data pass_data_avoid_store_forwarding =
> +{
> +  RTL_PASS, /* type.  */
> +  "avoid_store_forwarding", /* name.  */
> +  OPTGROUP_NONE, /* optinfo_flags.  */
> +  TV_NONE, /* tv_id.  */
> +  0, /* properties_required.  */
> +  0, /* properties_provided.  */
> +  0, /* properties_destroyed.  */
> +  0, /* todo_flags_start.  */
> +  TODO_df_finish /* todo_flags_finish.  */
> +};
Probably want a TV entry for store forwarding.  While it shouldn't be a 
big time sink, we've seen other passes that should be efficient go nuts 
in some cases (extension elimination being the most recent example).




> +
> +/* Try to modify BB so that expensive store forwarding cases are avoided.  */
> +
> +void store_forwarding_analyzer::avoid_store_forwarding (basic_block bb)

Formatting nit.  Put the return type on its own line so that the 
function name always starts on column 0 of its own line.



> +
> +/* Update pass statistics.  */
> +
> +void store_forwarding_analyzer::update_stats (function *fn)
Similarly.


OK with the new timevar and two formatting nits.  Thanks for your 
patience on this.

Jeff
Philipp Tomsich Nov. 25, 2024, 2:24 a.m. UTC | #2
Pushed to master with the following fixups:
  - new timevar added
  - nits addressed
  - whitespace fixes

Philipp.


On Mon, 25 Nov 2024 at 03:30, Jeff Law <jeffreyalaw@gmail.com> wrote:
>
>
>
> On 11/9/24 2:48 AM, Konstantinos Eleftheriou wrote:
> > From: kelefth <konstantinos.eleftheriou@vrull.eu>
> >
> > This pass detects cases of expensive store forwarding and tries to avoid them
> > by reordering the stores and using suitable bit insertion sequences.
> > For example it can transform this:
> >
> >       strb    w2, [x1, 1]
> >       ldr     x0, [x1]      # Expensive store forwarding to larger load.
> >
> > To:
> >
> >       ldr     x0, [x1]
> >       strb    w2, [x1]
> >       bfi     x0, x2, 0, 8
> >
> > Assembly like this can appear with bitfields or type punning / unions.
> > On stress-ng when running the cpu-union microbenchmark the following speedups
> > have been observed.
> >
> >    Neoverse-N1:      +29.4%
> >    Intel Coffeelake: +13.1%
> >    AMD 5950X:        +17.5%
> >
> > The transformation is rejected on cases that would cause store_bit_field
> > to generate subreg expressions on different register classes.
> > Files avoid-store-forwarding-4.c and avoid-store-forwarding-5.c contain
> > such cases and have been marked as XFAIL.
> >
> > There is a special handling for machines with BITS_BIG_ENDIAN !=
> > BYTES_BIG_ENDIAN. The need for this came up from an issue in H8
> > architecture, which uses big-endian ordering, but BITS_BIG_ENDIAN
> > is false. In that case, the START parameter of store_bit_field
> > needs to be calculated from the end of the destination register.
> >
> > gcc/ChangeLog:
> >
> >       * Makefile.in (OBJS): Add avoid-store-forwarding.o.
> >       * common.opt (favoid-store-forwarding): New option.
> >       * common.opt.urls: Regenerate.
> >       * doc/invoke.texi: New param store-forwarding-max-distance.
> >       * doc/passes.texi: Document new pass.
> >       * doc/tm.texi: Regenerate.
> >       * doc/tm.texi.in: Document new pass.
> >       * params.opt (store-forwarding-max-distance): New param.
> >       * passes.def: Add pass_rtl_avoid_store_forwarding before
> >       pass_early_remat.
> >       * target.def (avoid_store_forwarding_p): New DEFHOOK.
> >       * target.h (struct store_fwd_info): Declare.
> >       * targhooks.cc (default_avoid_store_forwarding_p): New function.
> >       * targhooks.h (default_avoid_store_forwarding_p): Declare.
> >       * tree-pass.h (make_pass_rtl_avoid_store_forwarding): Declare.
> >       * avoid-store-forwarding.cc: New file.
> >       * avoid-store-forwarding.h: New file.
> >
> > gcc/testsuite/ChangeLog:
> >
> >       * gcc.target/aarch64/avoid-store-forwarding-1.c: New test.
> >       * gcc.target/aarch64/avoid-store-forwarding-2.c: New test.
> >       * gcc.target/aarch64/avoid-store-forwarding-3.c: New test.
> >       * gcc.target/aarch64/avoid-store-forwarding-4.c: New test.
> >       * gcc.target/aarch64/avoid-store-forwarding-5.c: New test.
> >       * gcc.target/x86_64/abi/callabi/avoid-store-forwarding-1.c: New test.
> >          * gcc.target/x86_64/abi/callabi/avoid-store-forwarding-2.c: New test.
> >
> > Signed-off-by: Philipp Tomsich <philipp.tomsich@vrull.eu>
> > Signed-off-by: Konstantinos Eleftheriou <konstantinos.eleftheriou@vrull.eu>
> >
> > Series-version: 8
> >
> > Series-changes: 8
> >       - Fix store_bit_field call for big-endian targets, where
> >         BITS_BIG_ENDIAN is false.
> >       - Handle store_forwarding_max_distance = 0 as a special case that
> >         disables cost checks for avoid-store-forwarding.
> >       - Update testcases for AArch64 and add testcases for x86-64.
> >
> > Series-changes: 7
> >       - Fix bug when copying back the load register, in the case that the
> >         load is eliminated.
> >
> > Series-changes: 6
> >       - Reject the transformation on cases that would cause store_bit_field
> >         to generate subreg expressions on different register classes.
> >         Files avoid-store-forwarding-4.c and avoid-store-forwarding-5.c
> >            contain such cases and have been marked as XFAIL.
> >       - Use optimize_bb_for_speed_p instead of optimize_insn_for_speed_p.
> >       - Inline and remove get_load_mem.
> >       - New implementation for is_store_forwarding.
> >       - Refactor the main loop in avoid_store_forwarding.
> >       - Avoid using the word 'forwardings'.
> >       - Use lowpart_subreg instead of validate_subreg + gen_rtx_subreg.
> >       - Don't use df_insn_rescan where not needed.
> >       - Change order of emitting stores and bit insert instructions.
> >       - Check and reject loads for which the dest register overlaps with src.
> >       - Remove unused variable.
> >       - Change some gen_mov_insn function calls to gen_rtx_SET.
> >       - Subtract the cost of eliminated load, instead of 1, for the total cost.
> >       - Use delete_insn instead of set_insn_deleted.
> >       - Regenerate common.opt.urls.
> >       - Add some more comments.
> >
> > Series-changes: 5
> >       - Fix bug with BIG_ENDIAN targets.
> >       - Fix bug with unrecognized instructions.
> >       - Fix / simplify pass init/fini.
> >
> > Series-changes: 4
> >       - Change pass scheduling to run after sched1.
> >       - Add target hook to decide whether a store forwarding instance
> >         should be avoided or not.
> >       - Fix bugs.
> >
> > Series-changes: 3
> >       - Only emit SUBREG after calling validate_subreg.
> >       - Fix memory corruption due to vec self-reference.
> >       - Fix bitmap_bit_in_range_p ICE due to BLKMode.
> >       - Reject MEM to MEM sets.
> >       - Add get_load_mem comment.
> >       - Add new testcase.
> >
> > Series-changes: 2
> >       - Allow modes that are not scalar_int_mode.
> >       - Introduce simple costing to avoid unprofitable transformations.
> >       - Reject bit insert sequences that spill to memory.
> >       - Document new pass.
> >       - Fix and add testcases.
> > ---
>
> > +namespace {
> > +
> > +const pass_data pass_data_avoid_store_forwarding =
> > +{
> > +  RTL_PASS, /* type.  */
> > +  "avoid_store_forwarding", /* name.  */
> > +  OPTGROUP_NONE, /* optinfo_flags.  */
> > +  TV_NONE, /* tv_id.  */
> > +  0, /* properties_required.  */
> > +  0, /* properties_provided.  */
> > +  0, /* properties_destroyed.  */
> > +  0, /* todo_flags_start.  */
> > +  TODO_df_finish /* todo_flags_finish.  */
> > +};
> Probably want a TV entry for store forwarding.  While it shouldn't be a
> big time sink, we've seen other passes that should be efficient go nuts
> in some cases (extension elimination being the most recent example).
>
>
>
>
> > +
> > +/* Try to modify BB so that expensive store forwarding cases are avoided.  */
> > +
> > +void store_forwarding_analyzer::avoid_store_forwarding (basic_block bb)
>
> Formatting nit.  Put the return type on its own line so that the
> function name always starts on column 0 of its own line.
>
>
>
> > +
> > +/* Update pass statistics.  */
> > +
> > +void store_forwarding_analyzer::update_stats (function *fn)
> Similarly.
>
>
> OK with the new timevar and two formatting nits.  Thanks for your
> patience on this.
>
> Jeff
>
>
Richard Biener Nov. 28, 2024, 7:35 a.m. UTC | #3
On Mon, Nov 25, 2024 at 3:28 AM Philipp Tomsich
<philipp.tomsich@vrull.eu> wrote:
>
> Pushed to master with the following fixups:
>   - new timevar added
>   - nits addressed
>   - whitespace fixes

The pass seems to be disabled by default everywhere - I thought we
decided to avoid adding
passes like this because they tend to bit-rot quickly and become a
maintenance burden.

What was the plan here?

Richard.

> Philipp.
>
>
> On Mon, 25 Nov 2024 at 03:30, Jeff Law <jeffreyalaw@gmail.com> wrote:
> >
> >
> >
> > On 11/9/24 2:48 AM, Konstantinos Eleftheriou wrote:
> > > From: kelefth <konstantinos.eleftheriou@vrull.eu>
> > >
> > > This pass detects cases of expensive store forwarding and tries to avoid them
> > > by reordering the stores and using suitable bit insertion sequences.
> > > For example it can transform this:
> > >
> > >       strb    w2, [x1, 1]
> > >       ldr     x0, [x1]      # Expensive store forwarding to larger load.
> > >
> > > To:
> > >
> > >       ldr     x0, [x1]
> > >       strb    w2, [x1]
> > >       bfi     x0, x2, 0, 8
> > >
> > > Assembly like this can appear with bitfields or type punning / unions.
> > > On stress-ng when running the cpu-union microbenchmark the following speedups
> > > have been observed.
> > >
> > >    Neoverse-N1:      +29.4%
> > >    Intel Coffeelake: +13.1%
> > >    AMD 5950X:        +17.5%
> > >
> > > The transformation is rejected on cases that would cause store_bit_field
> > > to generate subreg expressions on different register classes.
> > > Files avoid-store-forwarding-4.c and avoid-store-forwarding-5.c contain
> > > such cases and have been marked as XFAIL.
> > >
> > > There is a special handling for machines with BITS_BIG_ENDIAN !=
> > > BYTES_BIG_ENDIAN. The need for this came up from an issue in H8
> > > architecture, which uses big-endian ordering, but BITS_BIG_ENDIAN
> > > is false. In that case, the START parameter of store_bit_field
> > > needs to be calculated from the end of the destination register.
> > >
> > > gcc/ChangeLog:
> > >
> > >       * Makefile.in (OBJS): Add avoid-store-forwarding.o.
> > >       * common.opt (favoid-store-forwarding): New option.
> > >       * common.opt.urls: Regenerate.
> > >       * doc/invoke.texi: New param store-forwarding-max-distance.
> > >       * doc/passes.texi: Document new pass.
> > >       * doc/tm.texi: Regenerate.
> > >       * doc/tm.texi.in: Document new pass.
> > >       * params.opt (store-forwarding-max-distance): New param.
> > >       * passes.def: Add pass_rtl_avoid_store_forwarding before
> > >       pass_early_remat.
> > >       * target.def (avoid_store_forwarding_p): New DEFHOOK.
> > >       * target.h (struct store_fwd_info): Declare.
> > >       * targhooks.cc (default_avoid_store_forwarding_p): New function.
> > >       * targhooks.h (default_avoid_store_forwarding_p): Declare.
> > >       * tree-pass.h (make_pass_rtl_avoid_store_forwarding): Declare.
> > >       * avoid-store-forwarding.cc: New file.
> > >       * avoid-store-forwarding.h: New file.
> > >
> > > gcc/testsuite/ChangeLog:
> > >
> > >       * gcc.target/aarch64/avoid-store-forwarding-1.c: New test.
> > >       * gcc.target/aarch64/avoid-store-forwarding-2.c: New test.
> > >       * gcc.target/aarch64/avoid-store-forwarding-3.c: New test.
> > >       * gcc.target/aarch64/avoid-store-forwarding-4.c: New test.
> > >       * gcc.target/aarch64/avoid-store-forwarding-5.c: New test.
> > >       * gcc.target/x86_64/abi/callabi/avoid-store-forwarding-1.c: New test.
> > >          * gcc.target/x86_64/abi/callabi/avoid-store-forwarding-2.c: New test.
> > >
> > > Signed-off-by: Philipp Tomsich <philipp.tomsich@vrull.eu>
> > > Signed-off-by: Konstantinos Eleftheriou <konstantinos.eleftheriou@vrull.eu>
> > >
> > > Series-version: 8
> > >
> > > Series-changes: 8
> > >       - Fix store_bit_field call for big-endian targets, where
> > >         BITS_BIG_ENDIAN is false.
> > >       - Handle store_forwarding_max_distance = 0 as a special case that
> > >         disables cost checks for avoid-store-forwarding.
> > >       - Update testcases for AArch64 and add testcases for x86-64.
> > >
> > > Series-changes: 7
> > >       - Fix bug when copying back the load register, in the case that the
> > >         load is eliminated.
> > >
> > > Series-changes: 6
> > >       - Reject the transformation on cases that would cause store_bit_field
> > >         to generate subreg expressions on different register classes.
> > >         Files avoid-store-forwarding-4.c and avoid-store-forwarding-5.c
> > >            contain such cases and have been marked as XFAIL.
> > >       - Use optimize_bb_for_speed_p instead of optimize_insn_for_speed_p.
> > >       - Inline and remove get_load_mem.
> > >       - New implementation for is_store_forwarding.
> > >       - Refactor the main loop in avoid_store_forwarding.
> > >       - Avoid using the word 'forwardings'.
> > >       - Use lowpart_subreg instead of validate_subreg + gen_rtx_subreg.
> > >       - Don't use df_insn_rescan where not needed.
> > >       - Change order of emitting stores and bit insert instructions.
> > >       - Check and reject loads for which the dest register overlaps with src.
> > >       - Remove unused variable.
> > >       - Change some gen_mov_insn function calls to gen_rtx_SET.
> > >       - Subtract the cost of eliminated load, instead of 1, for the total cost.
> > >       - Use delete_insn instead of set_insn_deleted.
> > >       - Regenerate common.opt.urls.
> > >       - Add some more comments.
> > >
> > > Series-changes: 5
> > >       - Fix bug with BIG_ENDIAN targets.
> > >       - Fix bug with unrecognized instructions.
> > >       - Fix / simplify pass init/fini.
> > >
> > > Series-changes: 4
> > >       - Change pass scheduling to run after sched1.
> > >       - Add target hook to decide whether a store forwarding instance
> > >         should be avoided or not.
> > >       - Fix bugs.
> > >
> > > Series-changes: 3
> > >       - Only emit SUBREG after calling validate_subreg.
> > >       - Fix memory corruption due to vec self-reference.
> > >       - Fix bitmap_bit_in_range_p ICE due to BLKMode.
> > >       - Reject MEM to MEM sets.
> > >       - Add get_load_mem comment.
> > >       - Add new testcase.
> > >
> > > Series-changes: 2
> > >       - Allow modes that are not scalar_int_mode.
> > >       - Introduce simple costing to avoid unprofitable transformations.
> > >       - Reject bit insert sequences that spill to memory.
> > >       - Document new pass.
> > >       - Fix and add testcases.
> > > ---
> >
> > > +namespace {
> > > +
> > > +const pass_data pass_data_avoid_store_forwarding =
> > > +{
> > > +  RTL_PASS, /* type.  */
> > > +  "avoid_store_forwarding", /* name.  */
> > > +  OPTGROUP_NONE, /* optinfo_flags.  */
> > > +  TV_NONE, /* tv_id.  */
> > > +  0, /* properties_required.  */
> > > +  0, /* properties_provided.  */
> > > +  0, /* properties_destroyed.  */
> > > +  0, /* todo_flags_start.  */
> > > +  TODO_df_finish /* todo_flags_finish.  */
> > > +};
> > Probably want a TV entry for store forwarding.  While it shouldn't be a
> > big time sink, we've seen other passes that should be efficient go nuts
> > in some cases (extension elimination being the most recent example).
> >
> >
> >
> >
> > > +
> > > +/* Try to modify BB so that expensive store forwarding cases are avoided.  */
> > > +
> > > +void store_forwarding_analyzer::avoid_store_forwarding (basic_block bb)
> >
> > Formatting nit.  Put the return type on its own line so that the
> > function name always starts on column 0 of its own line.
> >
> >
> >
> > > +
> > > +/* Update pass statistics.  */
> > > +
> > > +void store_forwarding_analyzer::update_stats (function *fn)
> > Similarly.
> >
> >
> > OK with the new timevar and two formatting nits.  Thanks for your
> > patience on this.
> >
> > Jeff
> >
> >
Philipp Tomsich Nov. 28, 2024, 7:37 a.m. UTC | #4
On Thu 28. Nov 2024 at 15:36, Richard Biener <richard.guenther@gmail.com>
wrote:

> On Mon, Nov 25, 2024 at 3:28 AM Philipp Tomsich
> <philipp.tomsich@vrull.eu> wrote:
> >
> > Pushed to master with the following fixups:
> >   - new timevar added
> >   - nits addressed
> >   - whitespace fixes
>
> The pass seems to be disabled by default everywhere - I thought we
> decided to avoid adding
> passes like this because they tend to bit-rot quickly and become a
> maintenance burden.
>
> What was the plan here?


We are preparing a follow-on commit to enable on Aarch64 and a few more key
architectures.

Richard.
>
> > Philipp.
> >
> >
> > On Mon, 25 Nov 2024 at 03:30, Jeff Law <jeffreyalaw@gmail.com> wrote:
> > >
> > >
> > >
> > > On 11/9/24 2:48 AM, Konstantinos Eleftheriou wrote:
> > > > From: kelefth <konstantinos.eleftheriou@vrull.eu>
> > > >
> > > > This pass detects cases of expensive store forwarding and tries to
> avoid them
> > > > by reordering the stores and using suitable bit insertion sequences.
> > > > For example it can transform this:
> > > >
> > > >       strb    w2, [x1, 1]
> > > >       ldr     x0, [x1]      # Expensive store forwarding to larger
> load.
> > > >
> > > > To:
> > > >
> > > >       ldr     x0, [x1]
> > > >       strb    w2, [x1]
> > > >       bfi     x0, x2, 0, 8
> > > >
> > > > Assembly like this can appear with bitfields or type punning /
> unions.
> > > > On stress-ng when running the cpu-union microbenchmark the following
> speedups
> > > > have been observed.
> > > >
> > > >    Neoverse-N1:      +29.4%
> > > >    Intel Coffeelake: +13.1%
> > > >    AMD 5950X:        +17.5%
> > > >
> > > > The transformation is rejected on cases that would cause
> store_bit_field
> > > > to generate subreg expressions on different register classes.
> > > > Files avoid-store-forwarding-4.c and avoid-store-forwarding-5.c
> contain
> > > > such cases and have been marked as XFAIL.
> > > >
> > > > There is a special handling for machines with BITS_BIG_ENDIAN !=
> > > > BYTES_BIG_ENDIAN. The need for this came up from an issue in H8
> > > > architecture, which uses big-endian ordering, but BITS_BIG_ENDIAN
> > > > is false. In that case, the START parameter of store_bit_field
> > > > needs to be calculated from the end of the destination register.
> > > >
> > > > gcc/ChangeLog:
> > > >
> > > >       * Makefile.in (OBJS): Add avoid-store-forwarding.o.
> > > >       * common.opt (favoid-store-forwarding): New option.
> > > >       * common.opt.urls: Regenerate.
> > > >       * doc/invoke.texi: New param store-forwarding-max-distance.
> > > >       * doc/passes.texi: Document new pass.
> > > >       * doc/tm.texi: Regenerate.
> > > >       * doc/tm.texi.in: Document new pass.
> > > >       * params.opt (store-forwarding-max-distance): New param.
> > > >       * passes.def: Add pass_rtl_avoid_store_forwarding before
> > > >       pass_early_remat.
> > > >       * target.def (avoid_store_forwarding_p): New DEFHOOK.
> > > >       * target.h (struct store_fwd_info): Declare.
> > > >       * targhooks.cc (default_avoid_store_forwarding_p): New
> function.
> > > >       * targhooks.h (default_avoid_store_forwarding_p): Declare.
> > > >       * tree-pass.h (make_pass_rtl_avoid_store_forwarding): Declare.
> > > >       * avoid-store-forwarding.cc: New file.
> > > >       * avoid-store-forwarding.h: New file.
> > > >
> > > > gcc/testsuite/ChangeLog:
> > > >
> > > >       * gcc.target/aarch64/avoid-store-forwarding-1.c: New test.
> > > >       * gcc.target/aarch64/avoid-store-forwarding-2.c: New test.
> > > >       * gcc.target/aarch64/avoid-store-forwarding-3.c: New test.
> > > >       * gcc.target/aarch64/avoid-store-forwarding-4.c: New test.
> > > >       * gcc.target/aarch64/avoid-store-forwarding-5.c: New test.
> > > >       * gcc.target/x86_64/abi/callabi/avoid-store-forwarding-1.c:
> New test.
> > > >          * gcc.target/x86_64/abi/callabi/avoid-store-forwarding-2.c:
> New test.
> > > >
> > > > Signed-off-by: Philipp Tomsich <philipp.tomsich@vrull.eu>
> > > > Signed-off-by: Konstantinos Eleftheriou <
> konstantinos.eleftheriou@vrull.eu>
> > > >
> > > > Series-version: 8
> > > >
> > > > Series-changes: 8
> > > >       - Fix store_bit_field call for big-endian targets, where
> > > >         BITS_BIG_ENDIAN is false.
> > > >       - Handle store_forwarding_max_distance = 0 as a special case
> that
> > > >         disables cost checks for avoid-store-forwarding.
> > > >       - Update testcases for AArch64 and add testcases for x86-64.
> > > >
> > > > Series-changes: 7
> > > >       - Fix bug when copying back the load register, in the case
> that the
> > > >         load is eliminated.
> > > >
> > > > Series-changes: 6
> > > >       - Reject the transformation on cases that would cause
> store_bit_field
> > > >         to generate subreg expressions on different register classes.
> > > >         Files avoid-store-forwarding-4.c and
> avoid-store-forwarding-5.c
> > > >            contain such cases and have been marked as XFAIL.
> > > >       - Use optimize_bb_for_speed_p instead of
> optimize_insn_for_speed_p.
> > > >       - Inline and remove get_load_mem.
> > > >       - New implementation for is_store_forwarding.
> > > >       - Refactor the main loop in avoid_store_forwarding.
> > > >       - Avoid using the word 'forwardings'.
> > > >       - Use lowpart_subreg instead of validate_subreg +
> gen_rtx_subreg.
> > > >       - Don't use df_insn_rescan where not needed.
> > > >       - Change order of emitting stores and bit insert instructions.
> > > >       - Check and reject loads for which the dest register overlaps
> with src.
> > > >       - Remove unused variable.
> > > >       - Change some gen_mov_insn function calls to gen_rtx_SET.
> > > >       - Subtract the cost of eliminated load, instead of 1, for the
> total cost.
> > > >       - Use delete_insn instead of set_insn_deleted.
> > > >       - Regenerate common.opt.urls.
> > > >       - Add some more comments.
> > > >
> > > > Series-changes: 5
> > > >       - Fix bug with BIG_ENDIAN targets.
> > > >       - Fix bug with unrecognized instructions.
> > > >       - Fix / simplify pass init/fini.
> > > >
> > > > Series-changes: 4
> > > >       - Change pass scheduling to run after sched1.
> > > >       - Add target hook to decide whether a store forwarding instance
> > > >         should be avoided or not.
> > > >       - Fix bugs.
> > > >
> > > > Series-changes: 3
> > > >       - Only emit SUBREG after calling validate_subreg.
> > > >       - Fix memory corruption due to vec self-reference.
> > > >       - Fix bitmap_bit_in_range_p ICE due to BLKMode.
> > > >       - Reject MEM to MEM sets.
> > > >       - Add get_load_mem comment.
> > > >       - Add new testcase.
> > > >
> > > > Series-changes: 2
> > > >       - Allow modes that are not scalar_int_mode.
> > > >       - Introduce simple costing to avoid unprofitable
> transformations.
> > > >       - Reject bit insert sequences that spill to memory.
> > > >       - Document new pass.
> > > >       - Fix and add testcases.
> > > > ---
> > >
> > > > +namespace {
> > > > +
> > > > +const pass_data pass_data_avoid_store_forwarding =
> > > > +{
> > > > +  RTL_PASS, /* type.  */
> > > > +  "avoid_store_forwarding", /* name.  */
> > > > +  OPTGROUP_NONE, /* optinfo_flags.  */
> > > > +  TV_NONE, /* tv_id.  */
> > > > +  0, /* properties_required.  */
> > > > +  0, /* properties_provided.  */
> > > > +  0, /* properties_destroyed.  */
> > > > +  0, /* todo_flags_start.  */
> > > > +  TODO_df_finish /* todo_flags_finish.  */
> > > > +};
> > > Probably want a TV entry for store forwarding.  While it shouldn't be a
> > > big time sink, we've seen other passes that should be efficient go nuts
> > > in some cases (extension elimination being the most recent example).
> > >
> > >
> > >
> > >
> > > > +
> > > > +/* Try to modify BB so that expensive store forwarding cases are
> avoided.  */
> > > > +
> > > > +void store_forwarding_analyzer::avoid_store_forwarding (basic_block
> bb)
> > >
> > > Formatting nit.  Put the return type on its own line so that the
> > > function name always starts on column 0 of its own line.
> > >
> > >
> > >
> > > > +
> > > > +/* Update pass statistics.  */
> > > > +
> > > > +void store_forwarding_analyzer::update_stats (function *fn)
> > > Similarly.
> > >
> > >
> > > OK with the new timevar and two formatting nits.  Thanks for your
> > > patience on this.
> > >
> > > Jeff
> > >
> > >
>
diff mbox series

Patch

diff --git a/gcc/Makefile.in b/gcc/Makefile.in
index eb578210411..550c28bd614 100644
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@@ -1686,6 +1686,7 @@  OBJS = \
 	statistics.o \
 	stmt.o \
 	stor-layout.o \
+	avoid-store-forwarding.o \
 	store-motion.o \
 	streamer-hooks.o \
 	stringpool.o \
diff --git a/gcc/avoid-store-forwarding.cc b/gcc/avoid-store-forwarding.cc
new file mode 100644
index 00000000000..bb99cf27250
--- /dev/null
+++ b/gcc/avoid-store-forwarding.cc
@@ -0,0 +1,648 @@ 
+/* Avoid store forwarding optimization pass.
+   Copyright (C) 2024 Free Software Foundation, Inc.
+   Contributed by VRULL GmbH.
+
+   This file is part of GCC.
+
+   GCC is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3, or (at your option)
+   any later version.
+
+   GCC is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with GCC; see the file COPYING3.  If not see
+   <http://www.gnu.org/licenses/>.  */
+
+#include "avoid-store-forwarding.h"
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "backend.h"
+#include "target.h"
+#include "rtl.h"
+#include "alias.h"
+#include "rtlanal.h"
+#include "cfgrtl.h"
+#include "tree-pass.h"
+#include "cselib.h"
+#include "predict.h"
+#include "insn-config.h"
+#include "expmed.h"
+#include "recog.h"
+#include "regset.h"
+#include "df.h"
+#include "expr.h"
+#include "memmodel.h"
+#include "emit-rtl.h"
+#include "vec.h"
+
+/* This pass tries to detect and avoid cases of store forwarding.
+   On many processors there is a large penalty when smaller stores are
+   forwarded to larger loads.  The idea used to avoid the stall is to move
+   the store after the load and in addition emit a bit insert sequence so
+   the load register has the correct value.  For example the following:
+
+     strb    w2, [x1, 1]
+     ldr     x0, [x1]
+
+   Will be transformed to:
+
+     ldr     x0, [x1]
+     strb    w2, [x1]
+     bfi     x0, x2, 0, 8
+*/
+
+namespace {
+
+const pass_data pass_data_avoid_store_forwarding =
+{
+  RTL_PASS, /* type.  */
+  "avoid_store_forwarding", /* name.  */
+  OPTGROUP_NONE, /* optinfo_flags.  */
+  TV_NONE, /* tv_id.  */
+  0, /* properties_required.  */
+  0, /* properties_provided.  */
+  0, /* properties_destroyed.  */
+  0, /* todo_flags_start.  */
+  TODO_df_finish /* todo_flags_finish.  */
+};
+
+class pass_rtl_avoid_store_forwarding : public rtl_opt_pass
+{
+public:
+  pass_rtl_avoid_store_forwarding (gcc::context *ctxt)
+    : rtl_opt_pass (pass_data_avoid_store_forwarding, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  virtual bool gate (function *)
+    {
+      return flag_avoid_store_forwarding && optimize >= 1;
+    }
+
+  virtual unsigned int execute (function *) override;
+}; // class pass_rtl_avoid_store_forwarding
+
+/* Handler for finding and avoiding store forwardings.  */
+
+class store_forwarding_analyzer
+{
+public:
+  unsigned int stats_sf_detected = 0;
+  unsigned int stats_sf_avoided = 0;
+
+  bool is_store_forwarding (rtx store_mem, rtx load_mem,
+			    HOST_WIDE_INT *off_val);
+  bool process_store_forwarding (vec<store_fwd_info> &, rtx_insn *load_insn,
+				 rtx load_mem);
+  void avoid_store_forwarding (basic_block);
+  void update_stats (function *);
+};
+
+/* Return a bit insertion sequence that would make DEST have the correct value
+   if the store represented by STORE_INFO were to be moved after DEST.  */
+
+static rtx_insn *
+generate_bit_insert_sequence (store_fwd_info *store_info, rtx dest)
+{
+  /* Memory size should be a constant at this stage.  */
+  unsigned HOST_WIDE_INT store_size
+    = MEM_SIZE (store_info->store_mem).to_constant ();
+
+  start_sequence ();
+
+  unsigned HOST_WIDE_INT bitsize = store_size * BITS_PER_UNIT;
+  unsigned HOST_WIDE_INT start = store_info->offset * BITS_PER_UNIT;
+
+  /* Adjust START for machines with BITS_BIG_ENDIAN != BYTES_BIG_ENDIAN.
+     Given that the bytes will be reversed in this case, we need to
+     calculate the starting position from the end of the destination
+     register.  */
+  if (BITS_BIG_ENDIAN != BYTES_BIG_ENDIAN)
+    {
+      unsigned HOST_WIDE_INT load_mode_bitsize
+	= (GET_MODE_BITSIZE (GET_MODE (dest))).to_constant ();
+      start = load_mode_bitsize - bitsize - start;
+    }
+
+  rtx mov_reg = store_info->mov_reg;
+  store_bit_field (dest, bitsize, start, 0, 0, GET_MODE (mov_reg), mov_reg,
+		   false, false);
+
+  rtx_insn *insns = get_insns ();
+  unshare_all_rtl_in_chain (insns);
+  end_sequence ();
+
+  for (rtx_insn *insn = insns; insn; insn = NEXT_INSN (insn))
+    if (contains_mem_rtx_p (PATTERN (insn))
+	|| recog_memoized (insn) < 0)
+      return NULL;
+
+  return insns;
+}
+
+/* Return true iff a store to STORE_MEM would write to a sub-region of bytes
+   from what LOAD_MEM would read.  If true also store the relative byte offset
+   of the store within the load to OFF_VAL.  */
+
+bool store_forwarding_analyzer::
+is_store_forwarding (rtx store_mem, rtx load_mem, HOST_WIDE_INT *off_val)
+{
+  poly_int64 load_offset, store_offset;
+  rtx load_base = strip_offset (XEXP (load_mem, 0), &load_offset);
+  rtx store_base = strip_offset (XEXP (store_mem, 0), &store_offset);
+  return (MEM_SIZE (load_mem).is_constant ()
+	  && rtx_equal_p (load_base, store_base)
+	  && known_subrange_p (store_offset, MEM_SIZE (store_mem),
+			       load_offset, MEM_SIZE (load_mem))
+	  && (store_offset - load_offset).is_constant (off_val));
+}
+
+/* Given a list of small stores that are forwarded to LOAD_INSN, try to
+   rearrange them so that a store-forwarding penalty doesn't occur.
+   The stores must be given in reverse program order, starting from the
+   one closer to LOAD_INSN.  */
+
+bool store_forwarding_analyzer::
+process_store_forwarding (vec<store_fwd_info> &stores, rtx_insn *load_insn,
+			  rtx load_mem)
+{
+  machine_mode load_mem_mode = GET_MODE (load_mem);
+  /* Memory sizes should be constants at this stage.  */
+  HOST_WIDE_INT load_size = MEM_SIZE (load_mem).to_constant ();
+
+  /* If the stores cover all the bytes of the load without overlap then we can
+     eliminate the load entirely and use the computed value instead.  */
+
+  sbitmap forwarded_bytes = sbitmap_alloc (load_size);
+  bitmap_clear (forwarded_bytes);
+
+  unsigned int i;
+  store_fwd_info* it;
+  FOR_EACH_VEC_ELT (stores, i, it)
+    {
+      HOST_WIDE_INT store_size = MEM_SIZE (it->store_mem).to_constant ();
+      if (bitmap_bit_in_range_p (forwarded_bytes, it->offset,
+				 it->offset + store_size - 1))
+	break;
+      bitmap_set_range (forwarded_bytes, it->offset, store_size);
+    }
+
+  bitmap_not (forwarded_bytes, forwarded_bytes);
+  bool load_elim = bitmap_empty_p (forwarded_bytes);
+
+  stats_sf_detected++;
+
+  if (dump_file)
+    {
+      fprintf (dump_file, "Store forwarding detected:\n");
+
+      FOR_EACH_VEC_ELT (stores, i, it)
+	{
+	  fprintf (dump_file, "From: ");
+	  print_rtl_single (dump_file, it->store_insn);
+	}
+
+      fprintf (dump_file, "To: ");
+      print_rtl_single (dump_file, load_insn);
+
+      if (load_elim)
+	fprintf (dump_file, "(Load elimination candidate)\n");
+    }
+
+  rtx load = single_set (load_insn);
+  rtx dest;
+
+  if (load_elim)
+    dest = gen_reg_rtx (load_mem_mode);
+  else
+    dest = SET_DEST (load);
+
+  int move_to_front = -1;
+  int total_cost = 0;
+
+  /* Check if we can emit bit insert instructions for all forwarded stores.  */
+  FOR_EACH_VEC_ELT (stores, i, it)
+    {
+      it->mov_reg = gen_reg_rtx (GET_MODE (it->store_mem));
+      rtx_insn *insns = NULL;
+
+      /* If we're eliminating the load then find the store with zero offset
+	 and use it as the base register to avoid a bit insert if possible.  */
+      if (load_elim && it->offset == 0)
+	{
+	  start_sequence ();
+
+	  /* We can use a paradoxical subreg to force this to a wider mode, as
+	     the only use will be inserting the bits (i.e., we don't care about
+	     the value of the higher bits).  */
+	  rtx ext0 = lowpart_subreg (GET_MODE (dest), it->mov_reg,
+				     GET_MODE (it->mov_reg));
+	  if (ext0)
+	    {
+	      rtx_insn *move0 = emit_move_insn (dest, ext0);
+	      if (recog_memoized (move0) >= 0)
+		{
+		  insns = get_insns ();
+		  move_to_front = (int) i;
+		}
+	    }
+
+	  end_sequence ();
+	}
+
+      if (!insns)
+	insns = generate_bit_insert_sequence (&(*it), dest);
+
+      if (!insns)
+	{
+	  if (dump_file)
+	    {
+	      fprintf (dump_file, "Failed due to: ");
+	      print_rtl_single (dump_file, it->store_insn);
+	    }
+	  return false;
+	}
+
+      total_cost += seq_cost (insns, true);
+      it->bits_insert_insns = insns;
+
+      rtx store_set = single_set (it->store_insn);
+
+      /* Create a register move at the store's original position to save the
+	 stored value.  */
+      start_sequence ();
+      rtx_insn *insn1
+	= emit_insn (gen_rtx_SET (it->mov_reg, SET_SRC (store_set)));
+      end_sequence ();
+
+      if (recog_memoized (insn1) < 0)
+	{
+	  if (dump_file)
+	    {
+	      fprintf (dump_file, "Failed due to unrecognizable insn: ");
+	      print_rtl_single (dump_file, insn1);
+	    }
+	  return false;
+	}
+
+      it->save_store_value_insn = insn1;
+
+      /* Create a new store after the load with the saved original value.
+	 This avoids the forwarding stall.  */
+      start_sequence ();
+      rtx_insn *insn2
+	= emit_insn (gen_rtx_SET (SET_DEST (store_set), it->mov_reg));
+      end_sequence ();
+
+      if (recog_memoized (insn2) < 0)
+	{
+	  if (dump_file)
+	    {
+	      fprintf (dump_file, "Failed due to unrecognizable insn: ");
+	      print_rtl_single (dump_file, insn2);
+	    }
+	  return false;
+	}
+
+      it->store_saved_value_insn = insn2;
+    }
+
+  if (load_elim)
+    total_cost -= insn_cost (load_insn, true);
+
+  /* Let the target decide if transforming this store forwarding instance is
+     profitable.  */
+  if (!targetm.avoid_store_forwarding_p (stores, load_mem, total_cost,
+					 load_elim))
+    {
+      if (dump_file)
+	fprintf (dump_file, "Not transformed due to target decision.\n");
+
+      return false;
+    }
+
+  /* If we have a move instead of bit insert, it needs to be emitted first in
+     the resulting sequence.  */
+  if (move_to_front != -1)
+    {
+      store_fwd_info copy = stores[move_to_front];
+      stores.safe_push (copy);
+      stores.ordered_remove (move_to_front);
+    }
+
+  if (load_elim)
+    {
+      machine_mode outer_mode = GET_MODE (SET_DEST (load));
+      rtx load_move;
+      if (outer_mode != load_mem_mode)
+	{
+	  rtx load_value = simplify_gen_unary (GET_CODE (SET_SRC (load)),
+					       outer_mode, dest, load_mem_mode);
+	  load_move = gen_rtx_SET (SET_DEST (load), load_value);
+	}
+      else
+	  load_move = gen_rtx_SET (SET_DEST (load), dest);
+
+      start_sequence ();
+      rtx_insn *insn = emit_insn (load_move);
+      rtx_insn *seq = get_insns ();
+      end_sequence ();
+
+      if (recog_memoized (insn) < 0)
+	return false;
+
+      emit_insn_after (seq, load_insn);
+    }
+
+  if (dump_file)
+    {
+      fprintf (dump_file, "Store forwarding avoided with bit inserts:\n");
+
+      FOR_EACH_VEC_ELT (stores, i, it)
+	{
+	  if (stores.length () > 1)
+	    {
+	      fprintf (dump_file, "For: ");
+	      print_rtl_single (dump_file, it->store_insn);
+	    }
+
+	  fprintf (dump_file, "With sequence:\n");
+
+	  for (rtx_insn *insn = it->bits_insert_insns; insn;
+	       insn = NEXT_INSN (insn))
+	    {
+	      fprintf (dump_file, "  ");
+	      print_rtl_single (dump_file, insn);
+	    }
+	}
+    }
+
+  stats_sf_avoided++;
+
+  /* Done, emit all the generated instructions and delete the stores.
+     Note that STORES are in reverse program order.  */
+
+  FOR_EACH_VEC_ELT (stores, i, it)
+    {
+      emit_insn_after (it->bits_insert_insns, load_insn);
+      emit_insn_after (it->store_saved_value_insn, load_insn);
+    }
+
+  FOR_EACH_VEC_ELT (stores, i, it)
+    {
+      emit_insn_before (it->save_store_value_insn, it->store_insn);
+      delete_insn (it->store_insn);
+    }
+
+  df_insn_rescan (load_insn);
+
+  if (load_elim)
+    delete_insn (load_insn);
+
+  return true;
+}
+
+/* Try to modify BB so that expensive store forwarding cases are avoided.  */
+
+void store_forwarding_analyzer::avoid_store_forwarding (basic_block bb)
+{
+  if (!optimize_bb_for_speed_p (bb))
+    return;
+
+  auto_vec<store_fwd_info, 8> store_exprs;
+  rtx_insn *insn;
+  unsigned int insn_cnt = 0;
+
+  FOR_BB_INSNS (bb, insn)
+    {
+      if (!NONDEBUG_INSN_P (insn))
+	continue;
+
+      vec_rtx_properties properties;
+      properties.add_insn (insn, false);
+
+      rtx set = single_set (insn);
+
+      if (!set)
+	{
+	  store_exprs.truncate (0);
+	  continue;
+	}
+
+      /* The inner mem RTX if INSN is a load, NULL_RTX otherwise.  */
+      rtx load_mem = SET_SRC (set);
+
+      if (GET_CODE (load_mem) == ZERO_EXTEND
+	  || GET_CODE (load_mem) == SIGN_EXTEND)
+	load_mem = XEXP (load_mem, 0);
+
+      if (!MEM_P (load_mem))
+	load_mem = NULL_RTX;
+
+      /* The mem RTX if INSN is a store, NULL_RTX otherwise.  */
+      rtx store_mem = MEM_P (SET_DEST (set)) ? SET_DEST (set) : NULL_RTX;
+
+      /* We cannot analyze memory RTXs that have unknown size.  */
+      if ((store_mem && (!MEM_SIZE_KNOWN_P (store_mem)
+			 || !MEM_SIZE (store_mem).is_constant ()))
+	  || (load_mem && (!MEM_SIZE_KNOWN_P (load_mem)
+			   || !MEM_SIZE (load_mem).is_constant ())))
+	{
+	  store_exprs.truncate (0);
+	  continue;
+	}
+
+      bool is_simple = !properties.has_asm
+		       && !properties.has_side_effects ();
+      bool is_simple_store = is_simple
+			     && store_mem
+			     && !contains_mem_rtx_p (SET_SRC (set));
+      bool is_simple_load = is_simple
+			    && load_mem
+			    && !contains_mem_rtx_p (SET_DEST (set));
+
+      int removed_count = 0;
+
+      if (is_simple_store)
+	{
+	  /* Record store forwarding candidate.  */
+	  store_fwd_info info;
+	  info.store_insn = insn;
+	  info.store_mem = store_mem;
+	  info.insn_cnt = insn_cnt;
+	  info.remove = false;
+	  info.forwarded = false;
+	  store_exprs.safe_push (info);
+	}
+
+      bool reads_mem = false;
+      bool writes_mem = false;
+      for (auto ref : properties.refs ())
+	if (ref.is_mem ())
+	  {
+	    reads_mem |= ref.is_read ();
+	    writes_mem |= ref.is_write ();
+	  }
+	else if (ref.is_write ())
+	  {
+	    /* Drop store forwarding candidates when the address register is
+	    overwritten.  */
+	    bool remove_rest = false;
+	    unsigned int i;
+	    store_fwd_info *it;
+	    FOR_EACH_VEC_ELT_REVERSE (store_exprs, i, it)
+	      {
+		if (remove_rest
+		    || reg_overlap_mentioned_p (regno_reg_rtx[ref.regno],
+						it->store_mem))
+		  {
+		    it->remove = true;
+		    removed_count++;
+		    remove_rest = true;
+		  }
+	      }
+	  }
+
+      if (is_simple_load)
+	{
+	  /* Process load for possible store forwarding cases.
+	     Possible newly created/moved stores, resulted from a successful
+	     forwarding, will be processed in subsequent iterations.  */
+	  auto_vec<store_fwd_info> forwardings;
+	  bool partial_forwarding = false;
+	  bool remove_rest = false;
+
+	  bool vector_load = VECTOR_MODE_P (GET_MODE (load_mem));
+
+	  unsigned int i;
+	  store_fwd_info *it;
+	  FOR_EACH_VEC_ELT_REVERSE (store_exprs, i, it)
+	    {
+	      rtx store_mem = it->store_mem;
+	      HOST_WIDE_INT off_val;
+
+	      bool vector_store = VECTOR_MODE_P (GET_MODE (store_mem));
+
+	      if (remove_rest)
+	      {
+		it->remove = true;
+		removed_count++;
+	      }
+	      else if (vector_load ^ vector_store)
+	      {
+		  /* Vector stores followed by a non-vector load or the
+		     opposite, cause store_bit_field to generate non-canonical
+		     expressions, like (subreg:V4SI (reg:DI ...) 0)).
+		     Cases like that should be handled using vec_duplicate,
+		     so we reject the transformation in those cases.  */
+		     it->remove = true;
+		     removed_count++;
+		     remove_rest = true;
+	      }
+	      else if (is_store_forwarding (store_mem, load_mem, &off_val))
+	      {
+		/* Check if moving this store after the load is legal.  */
+		bool write_dep = false;
+		for (unsigned int j = store_exprs.length () - 1; j != i; j--)
+		  if (!store_exprs[j].forwarded
+		      && output_dependence (store_mem,
+					    store_exprs[j].store_mem))
+		    {
+		      write_dep = true;
+		      break;
+		    }
+
+		if (!write_dep)
+		  {
+		    it->forwarded = true;
+		    it->offset = off_val;
+		    forwardings.safe_push (*it);
+		  }
+		else
+		    partial_forwarding = true;
+
+		it->remove = true;
+		removed_count++;
+	      }
+	      else if (true_dependence (store_mem, GET_MODE (store_mem),
+					load_mem))
+	      {
+		/* We cannot keep a store forwarding candidate if it possibly
+		   interferes with this load.  */
+		it->remove = true;
+		removed_count++;
+		remove_rest = true;
+	      }
+	    }
+
+	  if (!forwardings.is_empty () && !partial_forwarding)
+	    process_store_forwarding (forwardings, insn, load_mem);
+	}
+
+	if ((writes_mem && !is_simple_store)
+	     || (reads_mem && !is_simple_load))
+	   store_exprs.truncate (0);
+
+	if (removed_count)
+	{
+	  unsigned int i, j;
+	  store_fwd_info *it;
+	  VEC_ORDERED_REMOVE_IF (store_exprs, i, j, it, it->remove);
+	}
+
+	/* Don't consider store forwarding if the RTL instruction distance is
+	   more than PARAM_STORE_FORWARDING_MAX_DISTANCE and the cost checks
+	   are not disabled.  */
+	const bool unlimited_cost = (param_store_forwarding_max_distance == 0);
+	if (!unlimited_cost && !store_exprs.is_empty ()
+	    && (store_exprs[0].insn_cnt
+		+ param_store_forwarding_max_distance <= insn_cnt))
+	  store_exprs.ordered_remove (0);
+
+	insn_cnt++;
+    }
+}
+
+/* Update pass statistics.  */
+
+void store_forwarding_analyzer::update_stats (function *fn)
+{
+  statistics_counter_event (fn, "Cases of store forwarding detected: ",
+			    stats_sf_detected);
+  statistics_counter_event (fn, "Cases of store forwarding avoided: ",
+			    stats_sf_detected);
+}
+
+unsigned int
+pass_rtl_avoid_store_forwarding::execute (function *fn)
+{
+  df_set_flags (DF_DEFER_INSN_RESCAN);
+
+  init_alias_analysis ();
+
+  store_forwarding_analyzer analyzer;
+
+  basic_block bb;
+  FOR_EACH_BB_FN (bb, fn)
+    analyzer.avoid_store_forwarding (bb);
+
+  end_alias_analysis ();
+
+  analyzer.update_stats (fn);
+
+  return 0;
+}
+
+} // anon namespace.
+
+rtl_opt_pass *
+make_pass_rtl_avoid_store_forwarding (gcc::context *ctxt)
+{
+  return new pass_rtl_avoid_store_forwarding (ctxt);
+}
diff --git a/gcc/avoid-store-forwarding.h b/gcc/avoid-store-forwarding.h
new file mode 100644
index 00000000000..55a0c97f008
--- /dev/null
+++ b/gcc/avoid-store-forwarding.h
@@ -0,0 +1,56 @@ 
+/* Avoid store forwarding optimization pass.
+   Copyright (C) 2024 Free Software Foundation, Inc.
+   Contributed by VRULL GmbH.
+
+   This file is part of GCC.
+
+   GCC is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3, or (at your option)
+   any later version.
+
+   GCC is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with GCC; see the file COPYING3.  If not see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef GCC_AVOID_STORE_FORWARDING_H
+#define GCC_AVOID_STORE_FORWARDING_H
+
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "backend.h"
+#include "rtl.h"
+
+struct store_fwd_info
+{
+  /* The store instruction that is a store forwarding candidate.  */
+  rtx_insn *store_insn;
+  /* SET_DEST (single_set (store_insn)).  */
+  rtx store_mem;
+  /* The temporary that will hold the stored value at the original store
+     position.  */
+  rtx mov_reg;
+  /* The instruction sequence that inserts the stored value's bits at the
+     appropriate position in the loaded value.  */
+  rtx_insn *bits_insert_insns;
+  /* An instruction that saves the store's value in a register temporarily,
+     (set (reg X) (SET_SRC (store_insn))).  */
+  rtx_insn *save_store_value_insn;
+  /* An instruction that stores the saved value back to memory,
+     (set (SET_DEST (store_insn)) (reg X)).  */
+  rtx_insn *store_saved_value_insn;
+  /* The byte offset for the store's position within the load.  */
+  HOST_WIDE_INT offset;
+
+  unsigned int insn_cnt;
+  bool remove;
+  bool forwarded;
+};
+
+#endif  /* GCC_AVOID_STORE_FORWARDING_H  */
diff --git a/gcc/common.opt b/gcc/common.opt
index 0b1f1ec26e1..dfd668f9fb6 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -1759,6 +1759,10 @@  fgcse-sm
 Common Var(flag_gcse_sm) Init(0) Optimization
 Perform store motion after global common subexpression elimination.
 
+favoid-store-forwarding
+Common Var(flag_avoid_store_forwarding) Init(0) Optimization
+Try to avoid store forwarding.
+
 fgcse-las
 Common Var(flag_gcse_las) Init(0) Optimization
 Perform redundant load after store elimination in global common subexpression
diff --git a/gcc/common.opt.urls b/gcc/common.opt.urls
index 78e0dc209d1..e3c6e70daca 100644
--- a/gcc/common.opt.urls
+++ b/gcc/common.opt.urls
@@ -706,6 +706,9 @@  UrlSuffix(gcc/Optimize-Options.html#index-fgcse-lm)
 fgcse-sm
 UrlSuffix(gcc/Optimize-Options.html#index-fgcse-sm)
 
+favoid-store-forwarding
+UrlSuffix(gcc/Optimize-Options.html#index-favoid-store-forwarding)
+
 fgcse-las
 UrlSuffix(gcc/Optimize-Options.html#index-fgcse-las)
 
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 7146163d66d..d36cd6f74e0 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -12734,6 +12734,15 @@  loop unrolling.
 This option is enabled by default at optimization levels @option{-O1},
 @option{-O2}, @option{-O3}, @option{-Os}.
 
+@opindex favoid-store-forwarding
+@item -favoid-store-forwarding
+@itemx -fno-avoid-store-forwarding
+Many CPUs will stall for many cycles when a load partially depends on previous
+smaller stores.  This pass tries to detect such cases and avoid the penalty by
+changing the order of the load and store and then fixing up the loaded value.
+
+Disabled by default.
+
 @opindex ffp-contract
 @item -ffp-contract=@var{style}
 @option{-ffp-contract=off} disables floating-point expression contraction.
diff --git a/gcc/doc/passes.texi b/gcc/doc/passes.texi
index 4ac7a2306a1..639f6b325c8 100644
--- a/gcc/doc/passes.texi
+++ b/gcc/doc/passes.texi
@@ -925,6 +925,14 @@  and addressing mode selection.  The pass is run twice, with values
 being propagated into loops only on the second run.  The code is
 located in @file{fwprop.cc}.
 
+@item Store forwarding avoidance
+
+This pass attempts to reduce the overhead of store to load forwarding.
+It detects when a load reads from one or more previous smaller stores and
+then rearranges them so that the stores are done after the load.  The loaded
+value is adjusted with a series of bit insert instructions so that it stays
+the same.  The code is located in @file{avoid-store-forwarding.cc}.
+
 @item Common subexpression elimination
 
 This pass removes redundant computation within basic blocks, and
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 4deb3d2c283..eed30a20103 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -7355,6 +7355,14 @@  the @code{POLY_VALUE_MIN}, @code{POLY_VALUE_MAX} and
 implementation returns the lowest possible value of @var{val}.
 @end deftypefn
 
+@deftypefn {Target Hook} bool TARGET_AVOID_STORE_FORWARDING_P (vec<store_fwd_info>, @var{rtx}, @var{int}, @var{bool})
+Given a list of stores and a load instruction that reads from the location
+of the stores, this hook decides if it's profitable to emit additional code
+to avoid a potential store forwarding stall.  The additional instructions
+needed, the sequence cost and additional relevant information is given in
+the arguments so that the target can make an informed decision.
+@end deftypefn
+
 @node Scheduling
 @section Adjusting the Instruction Scheduler
 
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 9f147ccb95c..59a478971ab 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -4752,6 +4752,8 @@  Define this macro if a non-short-circuit operation produced by
 
 @hook TARGET_ESTIMATED_POLY_VALUE
 
+@hook TARGET_AVOID_STORE_FORWARDING_P
+
 @node Scheduling
 @section Adjusting the Instruction Scheduler
 
diff --git a/gcc/params.opt b/gcc/params.opt
index 7c572774df2..56a277d3b20 100644
--- a/gcc/params.opt
+++ b/gcc/params.opt
@@ -1040,6 +1040,10 @@  Allow the store merging pass to introduce unaligned stores if it is legal to do
 Common Joined UInteger Var(param_store_merging_max_size) Init(65536) IntegerRange(1, 65536) Param Optimization
 Maximum size of a single store merging region in bytes.
 
+-param=store-forwarding-max-distance=
+Common Joined UInteger Var(param_store_forwarding_max_distance) Init(10) IntegerRange(0, 1000) Param Optimization
+Maximum number of instruction distance that a small store forwarded to a larger load may stall. Value '0' disables the cost checks for the avoid-store-forwarding pass.
+
 -param=switch-conversion-max-branch-ratio=
 Common Joined UInteger Var(param_switch_conversion_branch_ratio) Init(8) IntegerRange(1, 65536) Param Optimization
 The maximum ratio between array size and switch branches for a switch conversion to take place.
diff --git a/gcc/passes.def b/gcc/passes.def
index 7d01227eed1..b736879e4bd 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -503,6 +503,7 @@  along with GCC; see the file COPYING3.  If not see
       NEXT_PASS (pass_sms);
       NEXT_PASS (pass_live_range_shrinkage);
       NEXT_PASS (pass_sched);
+      NEXT_PASS (pass_rtl_avoid_store_forwarding);
       NEXT_PASS (pass_early_remat);
       NEXT_PASS (pass_ira);
       NEXT_PASS (pass_reload);
diff --git a/gcc/target.def b/gcc/target.def
index 523ae7ec9aa..000dd116798 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -6979,6 +6979,16 @@  HOOK_VECTOR_END (shrink_wrap)
 #undef HOOK_PREFIX
 #define HOOK_PREFIX "TARGET_"
 
+DEFHOOK
+(avoid_store_forwarding_p,
+ "Given a list of stores and a load instruction that reads from the location\n\
+of the stores, this hook decides if it's profitable to emit additional code\n\
+to avoid a potential store forwarding stall.  The additional instructions\n\
+needed, the sequence cost and additional relevant information is given in\n\
+the arguments so that the target can make an informed decision.",
+ bool, (vec<store_fwd_info>, rtx, int, bool),
+ default_avoid_store_forwarding_p)
+
 /* Determine the type of unwind info to emit for debugging.  */
 DEFHOOK
 (debug_unwind_info,
diff --git a/gcc/target.h b/gcc/target.h
index 837651d273a..bc6845b8023 100644
--- a/gcc/target.h
+++ b/gcc/target.h
@@ -165,6 +165,9 @@  class function_arg_info;
 /* This is defined in function-abi.h.  */
 class predefined_function_abi;
 
+/* This is defined in avoid-store-forwarding.h.  */
+struct store_fwd_info;
+
 /* These are defined in tree-vect-stmts.cc.  */
 extern tree stmt_vectype (class _stmt_vec_info *);
 extern bool stmt_in_inner_loop_p (class vec_info *, class _stmt_vec_info *);
diff --git a/gcc/targhooks.cc b/gcc/targhooks.cc
index 304b35ed772..0fbada88b48 100644
--- a/gcc/targhooks.cc
+++ b/gcc/targhooks.cc
@@ -96,6 +96,7 @@  along with GCC; see the file COPYING3.  If not see
 #include "tree-vectorizer.h"
 #include "options.h"
 #include "case-cfn-macros.h"
+#include "avoid-store-forwarding.h"
 
 bool
 default_legitimate_address_p (machine_mode mode ATTRIBUTE_UNUSED,
@@ -2281,6 +2282,32 @@  default_class_max_nregs (reg_class_t rclass ATTRIBUTE_UNUSED,
 #endif
 }
 
+/* The default implementation of TARGET_AVOID_STORE_FORWARDING_P.  */
+
+bool
+default_avoid_store_forwarding_p (vec<store_fwd_info>, rtx, int total_cost,
+				  bool)
+{
+  /* Use a simple cost heurstic base on param_store_forwarding_max_distance.
+     In general the distance should be somewhat correlated to the store
+     forwarding penalty; if the penalty is large then it is justified to
+     increase the window size.  Use this to reject sequences that are clearly
+     unprofitable.
+     Skip the cost check if param_store_forwarding_max_distance is 0.  */
+  int max_cost = COSTS_N_INSNS (param_store_forwarding_max_distance / 2);
+  const bool unlimited_cost = (param_store_forwarding_max_distance == 0);
+  if (!unlimited_cost && total_cost > max_cost && max_cost)
+    {
+      if (dump_file)
+	fprintf (dump_file, "Not transformed due to cost: %d > %d.\n",
+		 total_cost, max_cost);
+
+      return false;
+    }
+
+  return true;
+}
+
 /* Determine the debugging unwind mechanism for the target.  */
 
 enum unwind_info_type
diff --git a/gcc/targhooks.h b/gcc/targhooks.h
index 2704d6008f1..7cf22038100 100644
--- a/gcc/targhooks.h
+++ b/gcc/targhooks.h
@@ -254,6 +254,9 @@  extern reg_class_t default_preferred_rename_class (reg_class_t rclass);
 extern bool default_class_likely_spilled_p (reg_class_t);
 extern unsigned char default_class_max_nregs (reg_class_t, machine_mode);
 
+extern bool default_avoid_store_forwarding_p (vec<store_fwd_info>, rtx, int,
+					      bool);
+
 extern enum unwind_info_type default_debug_unwind_info (void);
 
 extern void default_canonicalize_comparison (int *, rtx *, rtx *, bool);
diff --git a/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-1.c b/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-1.c
new file mode 100644
index 00000000000..5a43eb41be7
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-1.c
@@ -0,0 +1,27 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -favoid-store-forwarding" } */
+
+typedef union {
+    char arr_8[8];
+    long long_value;
+} DataUnion;
+
+long ssll_1 (DataUnion *data, char x)
+{
+  data->arr_8[0] = x;
+  return data->long_value;
+}
+
+long ssll_2 (DataUnion *data, char x)
+{
+  data->arr_8[1] = x;
+  return data->long_value;
+}
+
+long ssll_3 (DataUnion *data, char x)
+{
+  data->arr_8[7] = x;
+  return data->long_value;
+}
+
+/* { dg-final { scan-assembler-times {ldr\tx[0-9]+, \[x[0-9]+\]\n\tstrb\tw[0-9]+, \[x[0-9]+(, \d+)?\]\n\tbfi\tx[0-9]+, x[0-9]+, \d+, \d+} 3 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-2.c b/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-2.c
new file mode 100644
index 00000000000..b958612173b
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-2.c
@@ -0,0 +1,39 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -favoid-store-forwarding -fdump-rtl-avoid_store_forwarding" } */
+
+typedef union {
+    char arr_8[8];
+    int long_value;
+} DataUnion1;
+
+long no_ssll_1 (DataUnion1 *data, char x)
+{
+  data->arr_8[4] = x;
+  return data->long_value;
+}
+
+long no_ssll_2 (DataUnion1 *data, char x)
+{
+  data->arr_8[5] = x;
+  return data->long_value;
+}
+
+typedef union {
+    char arr_8[8];
+    short long_value[4];
+} DataUnion2;
+
+long no_ssll_3 (DataUnion2 *data, char x)
+{
+  data->arr_8[4] = x;
+  return data->long_value[1];
+}
+
+long no_ssll_4 (DataUnion2 *data, char x)
+{
+  data->arr_8[0] = x;
+  return data->long_value[1];
+}
+
+/* { dg-final { scan-rtl-dump-times "Store forwarding detected" 0 "avoid_store_forwarding" } } */
+/* { dg-final { scan-rtl-dump-times "Store forwarding avoided" 0 "avoid_store_forwarding" } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-3.c b/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-3.c
new file mode 100644
index 00000000000..d969c774905
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-3.c
@@ -0,0 +1,30 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -favoid-store-forwarding" } */
+
+typedef union {
+    char arr_8[8];
+    long long_value;
+} DataUnion;
+
+long ssll_multi_1 (DataUnion *data, char x)
+{
+  data->arr_8[0] = x;
+  data->arr_8[2] = x;
+  return data->long_value;
+}
+
+long ssll_multi_2 (DataUnion *data, char x)
+{
+  data->arr_8[0] = x;
+  data->arr_8[1] = 11;
+  return data->long_value;
+}
+
+long ssll_multi_3 (DataUnion *data, char x, short y)
+{
+  data->arr_8[1] = x;
+  __builtin_memcpy(data->arr_8 + 4, &y, sizeof(short));
+  return data->long_value;
+}
+
+/* { dg-final { scan-assembler-times {(\tstr[bh]\tw[0-9]+, \[x[0-9]+(, \d+)?\]\n){2}(\tbfi\tx[0-9]+, x[0-9]+, \d+, \d+\n){2}} 3 } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-4.c b/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-4.c
new file mode 100644
index 00000000000..3572dd5cd39
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-4.c
@@ -0,0 +1,26 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -favoid-store-forwarding -fdump-rtl-avoid_store_forwarding" } */
+
+typedef int v4si __attribute__ ((vector_size (16)));
+
+typedef union {
+    char arr_16[16];
+    v4si vec_value;
+} DataUnion;
+
+v4si ssll_vect_1 (DataUnion *data, char x)
+{
+  data->arr_16[0] = x;
+  return data->vec_value;
+}
+
+v4si ssll_vect_2 (DataUnion *data, int x)
+{
+  __builtin_memcpy(data->arr_16 + 4, &x, sizeof(int));
+  return data->vec_value;
+}
+
+/* Scalar stores leading to vector loads cause store_bit_field to generate
+   subreg expressions on different register classes. This should be handled
+   using vec_duplicate, so it is marked as an XFAIL for now.  */
+/* { dg-final { scan-rtl-dump-times "Store forwarding detected" 2 "avoid_store_forwarding" { xfail *-*-* } } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-5.c b/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-5.c
new file mode 100644
index 00000000000..0b2764256cd
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-5.c
@@ -0,0 +1,41 @@ 
+/* { dg-do compile { target int128 } } */
+/* { dg-options "-O2 -favoid-store-forwarding -fdump-rtl-avoid_store_forwarding" } */
+
+typedef float v4f __attribute__ ((vector_size (16)));
+
+typedef union {
+    float arr_2[4];
+    long long_value;
+    __int128 longlong_value;
+    v4f vec_value;
+} DataUnion;
+
+long ssll_load_elim_1 (DataUnion *data, float x)
+{
+  data->arr_2[0] = x;
+  data->arr_2[1] = 0.0f;
+  return data->long_value;
+}
+
+__int128 ssll_load_elim_2 (DataUnion *data, float x)
+{
+  data->arr_2[0] = x;
+  data->arr_2[1] = 0.0f;
+  data->arr_2[2] = x;
+  data->arr_2[3] = 0.0f;
+  return data->longlong_value;
+}
+
+v4f ssll_load_elim_3 (DataUnion *data, float x)
+{
+  data->arr_2[3] = x;
+  data->arr_2[2] = x;
+  data->arr_2[1] = x;
+  data->arr_2[0] = x;
+  return data->vec_value;
+}
+
+/* Scalar stores leading to vector loads cause store_bit_field to generate
+   subreg expressions on different register classes. This should be handled
+   using vec_duplicate, so it is marked as an XFAIL for now.  */
+/* { dg-final { scan-rtl-dump-times "Store forwarding detected" 3 "avoid_store_forwarding" { xfail *-*-* } } } */
diff --git a/gcc/testsuite/gcc.target/x86_64/abi/callabi/avoid-store-forwarding-1.c b/gcc/testsuite/gcc.target/x86_64/abi/callabi/avoid-store-forwarding-1.c
new file mode 100644
index 00000000000..d94bcf136ac
--- /dev/null
+++ b/gcc/testsuite/gcc.target/x86_64/abi/callabi/avoid-store-forwarding-1.c
@@ -0,0 +1,28 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -favoid-store-forwarding" } */
+
+typedef union {
+    char arr_8[8];
+    long long_value;
+} DataUnion;
+
+long ssll_1 (DataUnion *data, char x)
+{
+  data->arr_8[0] = x;
+  return data->long_value;
+}
+
+long ssll_2 (DataUnion *data, char x)
+{
+  data->arr_8[1] = x;
+  return data->long_value;
+}
+
+long ssll_3 (DataUnion *data, char x)
+{
+  data->arr_8[7] = x;
+  return data->long_value;
+}
+
+/* Check that the order of stores and loads has changed.  */
+/* { dg-final { scan-assembler-times {movq\t\(%[a-z]{3}\), %[a-z]{3}\n(\tmovl\t%[a-z]{3}, %[a-z]{3}\n)?\tmovb\t%[a-z]{3}, (\d+)?\(%[a-z]{3}\)} 2 } } */
diff --git a/gcc/testsuite/gcc.target/x86_64/abi/callabi/avoid-store-forwarding-2.c b/gcc/testsuite/gcc.target/x86_64/abi/callabi/avoid-store-forwarding-2.c
new file mode 100644
index 00000000000..b958612173b
--- /dev/null
+++ b/gcc/testsuite/gcc.target/x86_64/abi/callabi/avoid-store-forwarding-2.c
@@ -0,0 +1,39 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -favoid-store-forwarding -fdump-rtl-avoid_store_forwarding" } */
+
+typedef union {
+    char arr_8[8];
+    int long_value;
+} DataUnion1;
+
+long no_ssll_1 (DataUnion1 *data, char x)
+{
+  data->arr_8[4] = x;
+  return data->long_value;
+}
+
+long no_ssll_2 (DataUnion1 *data, char x)
+{
+  data->arr_8[5] = x;
+  return data->long_value;
+}
+
+typedef union {
+    char arr_8[8];
+    short long_value[4];
+} DataUnion2;
+
+long no_ssll_3 (DataUnion2 *data, char x)
+{
+  data->arr_8[4] = x;
+  return data->long_value[1];
+}
+
+long no_ssll_4 (DataUnion2 *data, char x)
+{
+  data->arr_8[0] = x;
+  return data->long_value[1];
+}
+
+/* { dg-final { scan-rtl-dump-times "Store forwarding detected" 0 "avoid_store_forwarding" } } */
+/* { dg-final { scan-rtl-dump-times "Store forwarding avoided" 0 "avoid_store_forwarding" } } */
diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index a928cbe4557..f8bb875cf9f 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -572,6 +572,7 @@  extern rtl_opt_pass *make_pass_rtl_dse3 (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_rtl_cprop (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_rtl_pre (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_rtl_hoist (gcc::context *ctxt);
+extern rtl_opt_pass *make_pass_rtl_avoid_store_forwarding (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_rtl_store_motion (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_cse_after_global_opts (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_rtl_ifcvt (gcc::context *ctxt);