diff mbox series

[v7] Target-independent store forwarding avoidance.

Message ID 20241023142729.2404078-1-konstantinos.eleftheriou@vrull.eu
State New
Headers show
Series [v7] Target-independent store forwarding avoidance. | expand

Commit Message

Konstantinos Eleftheriou Oct. 23, 2024, 2:27 p.m. UTC
From: kelefth <konstantinos.eleftheriou@vrull.eu>

This pass detects cases of expensive store forwarding and tries to avoid them
by reordering the stores and using suitable bit insertion sequences.
For example it can transform this:

     strb    w2, [x1, 1]
     ldr     x0, [x1]      # Expensive store forwarding to larger load.

To:

     ldr     x0, [x1]
     strb    w2, [x1]
     bfi     x0, x2, 0, 8

Assembly like this can appear with bitfields or type punning / unions.
On stress-ng when running the cpu-union microbenchmark the following speedups
have been observed.

  Neoverse-N1:      +29.4%
  Intel Coffeelake: +13.1%
  AMD 5950X:        +17.5%

gcc/ChangeLog:

	* Makefile.in: Add avoid-store-forwarding.o
	* common.opt: New option -favoid-store-forwarding.
	* common.opt.urls: Regenerate.
	* doc/invoke.texi: New param store-forwarding-max-distance.
	* doc/passes.texi: Document new pass.
	* doc/tm.texi: Regenerate.
	* doc/tm.texi.in: Document new pass.
	* params.opt: New param store-forwarding-max-distance.
	* passes.def: Schedule a new pass.
	* target.def (HOOK_PREFIX): New target hook avoid_store_forwarding_p.
	* target.h (struct store_fwd_info): Declare.
	* targhooks.cc (default_avoid_store_forwarding_p):
	  Add default_avoid_store_forwarding_p.
	* targhooks.h (default_avoid_store_forwarding_p):
	  Add default_avoid_store_forwarding_p.
	* tree-pass.h (make_pass_rtl_avoid_store_forwarding): Declare.
	* avoid-store-forwarding.cc: New file.
	* avoid-store-forwarding.h: New file.

gcc/testsuite/ChangeLog:

	* gcc.target/aarch64/avoid-store-forwarding-1.c: New test.
	* gcc.target/aarch64/avoid-store-forwarding-2.c: New test.
	* gcc.target/aarch64/avoid-store-forwarding-3.c: New test.
	* gcc.target/aarch64/avoid-store-forwarding-4.c: New test (XFAIL).
	* gcc.target/aarch64/avoid-store-forwarding-5.c: New test (XFAIL).

Signed-off-by: Philipp Tomsich <philipp.tomsich@vrull.eu>
Signed-off-by: Konstantinos Eleftheriou <konstantinos.eleftheriou@vrull.eu>

Series-version: 7

Series-changes: 7
	- Fix bug when copying back the load register, in the case that the
	  load is eliminated.

Series-changes: 6
	- Reject the transformation on cases that would cause store_bit_field
	  to generate subreg expressions on different register classes.
	  Files avoid-store-forwarding-4.c and avoid-store-forwarding-5.c
          contain such cases and have been marked as XFAIL.
	- Use optimize_bb_for_speed_p instead of optimize_insn_for_speed_p.
	- Inline and remove get_load_mem.
	- New implementation for is_store_forwarding.
	- Refactor the main loop in avoid_store_forwarding.
	- Avoid using the word 'forwardings'.
	- Use lowpart_subreg instead of validate_subreg + gen_rtx_subreg.
	- Don't use df_insn_rescan where not needed.
	- Change order of emitting stores and bit insert instructions.
	- Check and reject loads for which the dest register overlaps with src.
	- Remove unused variable.
	- Change some gen_mov_insn function calls to gen_rtx_SET.
	- Subtract the cost of eliminated load, instead of 1, for the total cost.
	- Use delete_insn instead of set_insn_deleted.
	- Regenerate common.opt.urls.
	- Add some more comments.

Series-changes: 5
	- Fix bug with BIG_ENDIAN targets.
	- Fix bug with unrecognized instructions.
	- Fix / simplify pass init/fini.

Series-changes: 4
	- Change pass scheduling to run after sched1.
	- Add target hook to decide whether a store forwarding instance
	  should be avoided or not.
	- Fix bugs.

Series-changes: 3
	- Only emit SUBREG after calling validate_subreg.
	- Fix memory corruption due to vec self-reference.
	- Fix bitmap_bit_in_range_p ICE due to BLKMode.
	- Reject MEM to MEM sets.
	- Add get_load_mem comment.
	- Add new testcase.

Series-changes: 2
	- Allow modes that are not scalar_int_mode.
	- Introduce simple costing to avoid unprofitable transformations.
	- Reject bit insert sequences that spill to memory.
	- Document new pass.
	- Fix and add testcases.
---
 gcc/Makefile.in                               |   1 +
 gcc/avoid-store-forwarding.cc                 | 634 ++++++++++++++++++
 gcc/avoid-store-forwarding.h                  |  56 ++
 gcc/common.opt                                |   4 +
 gcc/common.opt.urls                           |   3 +
 gcc/doc/invoke.texi                           |   9 +
 gcc/doc/passes.texi                           |   8 +
 gcc/doc/tm.texi                               |   8 +
 gcc/doc/tm.texi.in                            |   2 +
 gcc/params.opt                                |   4 +
 gcc/passes.def                                |   1 +
 gcc/target.def                                |  10 +
 gcc/target.h                                  |   3 +
 gcc/targhooks.cc                              |  25 +
 gcc/targhooks.h                               |   3 +
 .../aarch64/avoid-store-forwarding-1.c        |  28 +
 .../aarch64/avoid-store-forwarding-2.c        |  39 ++
 .../aarch64/avoid-store-forwarding-3.c        |  31 +
 .../aarch64/avoid-store-forwarding-4.c        |  26 +
 .../aarch64/avoid-store-forwarding-5.c        |  41 ++
 gcc/tree-pass.h                               |   1 +
 21 files changed, 937 insertions(+)
 create mode 100644 gcc/avoid-store-forwarding.cc
 create mode 100644 gcc/avoid-store-forwarding.h
 create mode 100644 gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-1.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-2.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-3.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-4.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-5.c

Comments

Jakub Jelinek Oct. 23, 2024, 2:38 p.m. UTC | #1
On Wed, Oct 23, 2024 at 04:27:29PM +0200, Konstantinos Eleftheriou wrote:

Just random ChangeLog formatting nits, not actual patch review:

> gcc/ChangeLog:
> 
> 	* Makefile.in: Add avoid-store-forwarding.o

Missing . at the end.  Though, you should really also mention
what you're changing, so
	* Makefile.in (OBJS): Add avoid-store-forwarding.o.

> 	* common.opt: New option -favoid-store-forwarding.

	* common.opt (favoid-store-forwarding): New option.

> 	* common.opt.urls: Regenerate.
> 	* doc/invoke.texi: New param store-forwarding-max-distance.
> 	* doc/passes.texi: Document new pass.
> 	* doc/tm.texi: Regenerate.
> 	* doc/tm.texi.in: Document new pass.
> 	* params.opt: New param store-forwarding-max-distance.

	* params.opt (store-forwarding-max-distance): New param.

> 	* passes.def: Schedule a new pass.

This doesn't say what you schedule and where.
	* passes.def: Add pass_rtl_avoid_store_forwarding before
	pass_early_remat.
or so.

> 	* target.def (HOOK_PREFIX): New target hook avoid_store_forwarding_p.

Better
	* target.def (avoid_store_forwarding_p): New DEFHOOK.

> 	* target.h (struct store_fwd_info): Declare.
> 	* targhooks.cc (default_avoid_store_forwarding_p):
> 	  Add default_avoid_store_forwarding_p.

We don't indent the subsequent lines further than by a single tab.  Plus
it is just a new function, so just say
	* targhooks.cc (default_avoid_store_forwarding_p): New function.

> 	* targhooks.h (default_avoid_store_forwarding_p):
> 	  Add default_avoid_store_forwarding_p.

See above, but just: Declare.

> 	* tree-pass.h (make_pass_rtl_avoid_store_forwarding): Declare.
> 	* avoid-store-forwarding.cc: New file.
> 	* avoid-store-forwarding.h: New file.
> 
> gcc/testsuite/ChangeLog:
> 
> 	* gcc.target/aarch64/avoid-store-forwarding-1.c: New test.
> 	* gcc.target/aarch64/avoid-store-forwarding-2.c: New test.
> 	* gcc.target/aarch64/avoid-store-forwarding-3.c: New test.
> 	* gcc.target/aarch64/avoid-store-forwarding-4.c: New test (XFAIL).
> 	* gcc.target/aarch64/avoid-store-forwarding-5.c: New test (XFAIL).

The " (XFAIL)" parts IMHO don't belong here.
And, it would be nice to get some test coverage on other arches too.

	Jakub
Jeff Law Oct. 26, 2024, 2:54 p.m. UTC | #2
On 10/23/24 8:27 AM, Konstantinos Eleftheriou wrote:
> From: kelefth <konstantinos.eleftheriou@vrull.eu>
> 
> This pass detects cases of expensive store forwarding and tries to avoid them
> by reordering the stores and using suitable bit insertion sequences.
> For example it can transform this:
> 
>       strb    w2, [x1, 1]
>       ldr     x0, [x1]      # Expensive store forwarding to larger load.
> 
> To:
> 
>       ldr     x0, [x1]
>       strb    w2, [x1]
>       bfi     x0, x2, 0, 8
> 
> Assembly like this can appear with bitfields or type punning / unions.
> On stress-ng when running the cpu-union microbenchmark the following speedups
> have been observed.
> 
>    Neoverse-N1:      +29.4%
>    Intel Coffeelake: +13.1%
>    AMD 5950X:        +17.5%
[ ... ]
Seems to still have some correctness issues.  H8 reports this when the 
pass is enabled by default:

> Tests that now fail, but worked before (8 tests):
> 
> h8300-sim/-mh/-mint32: gcc: gcc.c-torture/execute/pr63843.c   -O2  execution test
[ ... ]

It looks like we miss setting the high half of the register.  The good 
sequence looks like:

> !       mov.b   @er2,r3l
> !       mov.b   r3l,@(2,er7)
>         mov.b   @(1,er2),r2l
>         mov.b   r2l,@(3,er7)
> !       mov.w   @(2,er7),r0

Note the word (16 bit) move at the end of the sequence that sets both 
halves of the r0 register.

The broken sequence looks like this:

> !       mov.b   @er2,r0l
>         mov.b   @(1,er2),r2l
> +       mov.b   r0l,@(2,er7)
>         mov.b   r2l,@(3,er7)
> !       mov.b   r2l,r0l

Note how all the assignments are byte sized and that nothing sets r0h. 
We get whatever value happened to be lying around in the high half of 
the register.

You should be able to see this with an H8 cross compiler and shouldn't 
need a full toolchain to test.  Compile with -O2 -mh -mint32.  There is 
only one opportunity for SFB avoidance in pr63843.

There's also a failure for bfin-elf, but I suspect it's ultimately the 
same underlying issue.  Obviously I'll retest once there's a fix.

jeff
diff mbox series

Patch

diff --git a/gcc/Makefile.in b/gcc/Makefile.in
index 059cf2e8f79..8b3e3e45d2e 100644
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@@ -1682,6 +1682,7 @@  OBJS = \
 	statistics.o \
 	stmt.o \
 	stor-layout.o \
+	avoid-store-forwarding.o \
 	store-motion.o \
 	streamer-hooks.o \
 	stringpool.o \
diff --git a/gcc/avoid-store-forwarding.cc b/gcc/avoid-store-forwarding.cc
new file mode 100644
index 00000000000..95777fc70cc
--- /dev/null
+++ b/gcc/avoid-store-forwarding.cc
@@ -0,0 +1,634 @@ 
+/* Avoid store forwarding optimization pass.
+   Copyright (C) 2024 Free Software Foundation, Inc.
+   Contributed by VRULL GmbH.
+
+   This file is part of GCC.
+
+   GCC is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3, or (at your option)
+   any later version.
+
+   GCC is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with GCC; see the file COPYING3.  If not see
+   <http://www.gnu.org/licenses/>.  */
+
+#include "avoid-store-forwarding.h"
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "backend.h"
+#include "target.h"
+#include "rtl.h"
+#include "alias.h"
+#include "rtlanal.h"
+#include "cfgrtl.h"
+#include "tree-pass.h"
+#include "cselib.h"
+#include "predict.h"
+#include "insn-config.h"
+#include "expmed.h"
+#include "recog.h"
+#include "regset.h"
+#include "df.h"
+#include "expr.h"
+#include "memmodel.h"
+#include "emit-rtl.h"
+#include "vec.h"
+
+/* This pass tries to detect and avoid cases of store forwarding.
+   On many processors there is a large penalty when smaller stores are
+   forwarded to larger loads.  The idea used to avoid the stall is to move
+   the store after the load and in addition emit a bit insert sequence so
+   the load register has the correct value.  For example the following:
+
+     strb    w2, [x1, 1]
+     ldr     x0, [x1]
+
+   Will be transformed to:
+
+     ldr     x0, [x1]
+     strb    w2, [x1]
+     bfi     x0, x2, 0, 8
+*/
+
+namespace {
+
+const pass_data pass_data_avoid_store_forwarding =
+{
+  RTL_PASS, /* type.  */
+  "avoid_store_forwarding", /* name.  */
+  OPTGROUP_NONE, /* optinfo_flags.  */
+  TV_NONE, /* tv_id.  */
+  0, /* properties_required.  */
+  0, /* properties_provided.  */
+  0, /* properties_destroyed.  */
+  0, /* todo_flags_start.  */
+  TODO_df_finish /* todo_flags_finish.  */
+};
+
+class pass_rtl_avoid_store_forwarding : public rtl_opt_pass
+{
+public:
+  pass_rtl_avoid_store_forwarding (gcc::context *ctxt)
+    : rtl_opt_pass (pass_data_avoid_store_forwarding, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  virtual bool gate (function *)
+    {
+      return flag_avoid_store_forwarding && optimize >= 1;
+    }
+
+  virtual unsigned int execute (function *) override;
+}; // class pass_rtl_avoid_store_forwarding
+
+/* Handler for finding and avoiding store forwardings.  */
+
+class store_forwarding_analyzer
+{
+public:
+  unsigned int stats_sf_detected = 0;
+  unsigned int stats_sf_avoided = 0;
+
+  bool is_store_forwarding (rtx store_mem, rtx load_mem,
+			    HOST_WIDE_INT *off_val);
+  bool process_store_forwarding (vec<store_fwd_info> &, rtx_insn *load_insn,
+				 rtx load_mem);
+  void avoid_store_forwarding (basic_block);
+  void update_stats (function *);
+};
+
+/* Return a bit insertion sequence that would make DEST have the correct value
+   if the store represented by STORE_INFO were to be moved after DEST.  */
+
+static rtx_insn *
+generate_bit_insert_sequence (store_fwd_info *store_info, rtx dest)
+{
+  /* Memory size should be a constant at this stage.  */
+  unsigned HOST_WIDE_INT store_size
+    = MEM_SIZE (store_info->store_mem).to_constant ();
+
+  start_sequence ();
+
+  rtx mov_reg = store_info->mov_reg;
+  store_bit_field (dest, store_size * BITS_PER_UNIT,
+		   store_info->offset * BITS_PER_UNIT, 0, 0,
+		   GET_MODE (mov_reg), mov_reg,
+		   false, false);
+
+  rtx_insn *insns = get_insns ();
+  unshare_all_rtl_in_chain (insns);
+  end_sequence ();
+
+  for (rtx_insn *insn = insns; insn; insn = NEXT_INSN (insn))
+    if (contains_mem_rtx_p (PATTERN (insn))
+	|| recog_memoized (insn) < 0)
+      return NULL;
+
+  return insns;
+}
+
+/* Return true iff a store to STORE_MEM would write to a sub-region of bytes
+   from what LOAD_MEM would read.  If true also store the relative byte offset
+   of the store within the load to OFF_VAL.  */
+
+bool store_forwarding_analyzer::
+is_store_forwarding (rtx store_mem, rtx load_mem, HOST_WIDE_INT *off_val)
+{
+  poly_int64 load_offset, store_offset;
+  rtx load_base = strip_offset (XEXP (load_mem, 0), &load_offset);
+  rtx store_base = strip_offset (XEXP (store_mem, 0), &store_offset);
+  return (MEM_SIZE (load_mem).is_constant ()
+	  && rtx_equal_p (load_base, store_base)
+	  && known_subrange_p (store_offset, MEM_SIZE (store_mem),
+			       load_offset, MEM_SIZE (load_mem))
+	  && (store_offset - load_offset).is_constant (off_val));
+}
+
+/* Given a list of small stores that are forwarded to LOAD_INSN, try to
+   rearrange them so that a store-forwarding penalty doesn't occur.
+   The stores must be given in reverse program order, starting from the
+   one closer to LOAD_INSN.  */
+
+bool store_forwarding_analyzer::
+process_store_forwarding (vec<store_fwd_info> &stores, rtx_insn *load_insn,
+			  rtx load_mem)
+{
+  machine_mode load_mem_mode = GET_MODE (load_mem);
+  /* Memory sizes should be constants at this stage.  */
+  HOST_WIDE_INT load_size = MEM_SIZE (load_mem).to_constant ();
+
+  /* If the stores cover all the bytes of the load without overlap then we can
+     eliminate the load entirely and use the computed value instead.  */
+
+  sbitmap forwarded_bytes = sbitmap_alloc (load_size);
+  bitmap_clear (forwarded_bytes);
+
+  unsigned int i;
+  store_fwd_info* it;
+  FOR_EACH_VEC_ELT (stores, i, it)
+    {
+      HOST_WIDE_INT store_size = MEM_SIZE (it->store_mem).to_constant ();
+      if (bitmap_bit_in_range_p (forwarded_bytes, it->offset,
+				 it->offset + store_size - 1))
+	break;
+      bitmap_set_range (forwarded_bytes, it->offset, store_size);
+    }
+
+  bitmap_not (forwarded_bytes, forwarded_bytes);
+  bool load_elim = bitmap_empty_p (forwarded_bytes);
+
+  stats_sf_detected++;
+
+  if (dump_file)
+    {
+      fprintf (dump_file, "Store forwarding detected:\n");
+
+      FOR_EACH_VEC_ELT (stores, i, it)
+	{
+	  fprintf (dump_file, "From: ");
+	  print_rtl_single (dump_file, it->store_insn);
+	}
+
+      fprintf (dump_file, "To: ");
+      print_rtl_single (dump_file, load_insn);
+
+      if (load_elim)
+	fprintf (dump_file, "(Load elimination candidate)\n");
+    }
+
+  rtx load = single_set (load_insn);
+  rtx dest;
+
+  if (load_elim)
+    dest = gen_reg_rtx (load_mem_mode);
+  else
+    dest = SET_DEST (load);
+
+  int move_to_front = -1;
+  int total_cost = 0;
+
+  /* Check if we can emit bit insert instructions for all forwarded stores.  */
+  FOR_EACH_VEC_ELT (stores, i, it)
+    {
+      it->mov_reg = gen_reg_rtx (GET_MODE (it->store_mem));
+      rtx_insn *insns = NULL;
+
+      /* If we're eliminating the load then find the store with zero offset
+	 and use it as the base register to avoid a bit insert if possible.  */
+      if (load_elim && it->offset == 0)
+	{
+	  start_sequence ();
+
+	  /* We can use a paradoxical subreg to force this to a wider mode, as
+	     the only use will be inserting the bits (i.e., we don't care about
+	     the value of the higher bits).  */
+	  rtx ext0 = lowpart_subreg (GET_MODE (dest), it->mov_reg,
+				     GET_MODE (it->mov_reg));
+	  if (ext0)
+	    {
+	      rtx_insn *move0 = emit_move_insn (dest, ext0);
+	      if (recog_memoized (move0) >= 0)
+		{
+		  insns = get_insns ();
+		  move_to_front = (int) i;
+		}
+	    }
+
+	  end_sequence ();
+	}
+
+      if (!insns)
+	insns = generate_bit_insert_sequence (&(*it), dest);
+
+      if (!insns)
+	{
+	  if (dump_file)
+	    {
+	      fprintf (dump_file, "Failed due to: ");
+	      print_rtl_single (dump_file, it->store_insn);
+	    }
+	  return false;
+	}
+
+      total_cost += seq_cost (insns, true);
+      it->bits_insert_insns = insns;
+
+      rtx store_set = single_set (it->store_insn);
+
+      /* Create a register move at the store's original position to save the
+	 stored value.  */
+      start_sequence ();
+      rtx_insn *insn1
+	= emit_insn (gen_rtx_SET (it->mov_reg, SET_SRC (store_set)));
+      end_sequence ();
+
+      if (recog_memoized (insn1) < 0)
+	{
+	  if (dump_file)
+	    {
+	      fprintf (dump_file, "Failed due to unrecognizable insn: ");
+	      print_rtl_single (dump_file, insn1);
+	    }
+	  return false;
+	}
+
+      it->save_store_value_insn = insn1;
+
+      /* Create a new store after the load with the saved original value.
+	 This avoids the forwarding stall.  */
+      start_sequence ();
+      rtx_insn *insn2
+	= emit_insn (gen_rtx_SET (SET_DEST (store_set), it->mov_reg));
+      end_sequence ();
+
+      if (recog_memoized (insn2) < 0)
+	{
+	  if (dump_file)
+	    {
+	      fprintf (dump_file, "Failed due to unrecognizable insn: ");
+	      print_rtl_single (dump_file, insn2);
+	    }
+	  return false;
+	}
+
+      it->store_saved_value_insn = insn2;
+    }
+
+  if (load_elim)
+    total_cost -= insn_cost (load_insn, true);
+
+  /* Let the target decide if transforming this store forwarding instance is
+     profitable.  */
+  if (!targetm.avoid_store_forwarding_p (stores, load_mem, total_cost,
+					 load_elim))
+    {
+      if (dump_file)
+	fprintf (dump_file, "Not transformed due to target decision.\n");
+
+      return false;
+    }
+
+  /* If we have a move instead of bit insert, it needs to be emitted first in
+     the resulting sequence.  */
+  if (move_to_front != -1)
+    {
+      store_fwd_info copy = stores[move_to_front];
+      stores.safe_push (copy);
+      stores.ordered_remove (move_to_front);
+    }
+
+  if (load_elim)
+    {
+      machine_mode outer_mode = GET_MODE (SET_DEST (load));
+      rtx load_move;
+      if (outer_mode != load_mem_mode)
+	{
+	  rtx load_value = simplify_gen_unary (GET_CODE (SET_SRC (load)),
+					       outer_mode, dest, load_mem_mode);
+	  load_move = gen_rtx_SET (SET_DEST (load), load_value);
+	}
+      else
+	  load_move = gen_rtx_SET (SET_DEST (load), dest);
+
+      start_sequence ();
+      rtx_insn *insn = emit_insn (load_move);
+      rtx_insn *seq = get_insns ();
+      end_sequence ();
+
+      if (recog_memoized (insn) < 0)
+	return false;
+
+      emit_insn_after (seq, load_insn);
+    }
+
+  if (dump_file)
+    {
+      fprintf (dump_file, "Store forwarding avoided with bit inserts:\n");
+
+      FOR_EACH_VEC_ELT (stores, i, it)
+	{
+	  if (stores.length () > 1)
+	    {
+	      fprintf (dump_file, "For: ");
+	      print_rtl_single (dump_file, it->store_insn);
+	    }
+
+	  fprintf (dump_file, "With sequence:\n");
+
+	  for (rtx_insn *insn = it->bits_insert_insns; insn;
+	       insn = NEXT_INSN (insn))
+	    {
+	      fprintf (dump_file, "  ");
+	      print_rtl_single (dump_file, insn);
+	    }
+	}
+    }
+
+  stats_sf_avoided++;
+
+  /* Done, emit all the generated instructions and delete the stores.
+     Note that STORES are in reverse program order.  */
+
+  FOR_EACH_VEC_ELT (stores, i, it)
+    {
+      emit_insn_after (it->bits_insert_insns, load_insn);
+      emit_insn_after (it->store_saved_value_insn, load_insn);
+    }
+
+  FOR_EACH_VEC_ELT (stores, i, it)
+    {
+      emit_insn_before (it->save_store_value_insn, it->store_insn);
+      delete_insn (it->store_insn);
+    }
+
+  df_insn_rescan (load_insn);
+
+  if (load_elim)
+    delete_insn (load_insn);
+
+  return true;
+}
+
+/* Try to modify BB so that expensive store forwarding cases are avoided.  */
+
+void store_forwarding_analyzer::avoid_store_forwarding (basic_block bb)
+{
+  if (!optimize_bb_for_speed_p (bb))
+    return;
+
+  auto_vec<store_fwd_info, 8> store_exprs;
+  rtx_insn *insn;
+  unsigned int insn_cnt = 0;
+
+  FOR_BB_INSNS (bb, insn)
+    {
+      if (!NONDEBUG_INSN_P (insn))
+	continue;
+
+      vec_rtx_properties properties;
+      properties.add_insn (insn, false);
+
+      rtx set = single_set (insn);
+
+      if (!set)
+	{
+	  store_exprs.truncate (0);
+	  continue;
+	}
+
+      /* The inner mem RTX if INSN is a load, NULL_RTX otherwise.  */
+      rtx load_mem = SET_SRC (set);
+
+      if (GET_CODE (load_mem) == ZERO_EXTEND
+	  || GET_CODE (load_mem) == SIGN_EXTEND)
+	load_mem = XEXP (load_mem, 0);
+
+      if (!MEM_P (load_mem))
+	load_mem = NULL_RTX;
+
+      /* The mem RTX if INSN is a store, NULL_RTX otherwise.  */
+      rtx store_mem = MEM_P (SET_DEST (set)) ? SET_DEST (set) : NULL_RTX;
+
+      /* We cannot analyze memory RTXs that have unknown size.  */
+      if ((store_mem && (!MEM_SIZE_KNOWN_P (store_mem)
+			 || !MEM_SIZE (store_mem).is_constant ()))
+	  || (load_mem && (!MEM_SIZE_KNOWN_P (load_mem)
+			   || !MEM_SIZE (load_mem).is_constant ())))
+	{
+	  store_exprs.truncate (0);
+	  continue;
+	}
+
+      bool is_simple = !properties.has_asm
+		       && !properties.has_side_effects ();
+      bool is_simple_store = is_simple
+			     && store_mem
+			     && !contains_mem_rtx_p (SET_SRC (set));
+      bool is_simple_load = is_simple
+			    && load_mem
+			    && !contains_mem_rtx_p (SET_DEST (set));
+
+      int removed_count = 0;
+
+      if (is_simple_store)
+	{
+	  /* Record store forwarding candidate.  */
+	  store_fwd_info info;
+	  info.store_insn = insn;
+	  info.store_mem = store_mem;
+	  info.insn_cnt = insn_cnt;
+	  info.remove = false;
+	  info.forwarded = false;
+	  store_exprs.safe_push (info);
+	}
+
+      bool reads_mem = false;
+      bool writes_mem = false;
+      for (auto ref : properties.refs ())
+	if (ref.is_mem ())
+	  {
+	    reads_mem |= ref.is_read ();
+	    writes_mem |= ref.is_write ();
+	  }
+	else if (ref.is_write ())
+	  {
+	    /* Drop store forwarding candidates when the address register is
+	    overwritten.  */
+	    bool remove_rest = false;
+	    unsigned int i;
+	    store_fwd_info *it;
+	    FOR_EACH_VEC_ELT_REVERSE (store_exprs, i, it)
+	      {
+		if (remove_rest
+		    || reg_overlap_mentioned_p (regno_reg_rtx[ref.regno],
+						it->store_mem))
+		  {
+		    it->remove = true;
+		    removed_count++;
+		    remove_rest = true;
+		  }
+	      }
+	  }
+
+      if (is_simple_load)
+	{
+	  /* Process load for possible store forwarding cases.
+	     Possible newly created/moved stores, resulted from a successful
+	     forwarding, will be processed in subsequent iterations.  */
+	  auto_vec<store_fwd_info> forwardings;
+	  bool partial_forwarding = false;
+	  bool remove_rest = false;
+
+	  bool vector_load = VECTOR_MODE_P (GET_MODE (load_mem));
+
+	  unsigned int i;
+	  store_fwd_info *it;
+	  FOR_EACH_VEC_ELT_REVERSE (store_exprs, i, it)
+	    {
+	      rtx store_mem = it->store_mem;
+	      HOST_WIDE_INT off_val;
+
+	      bool vector_store = VECTOR_MODE_P (GET_MODE (store_mem));
+
+	      if (remove_rest)
+	      {
+		it->remove = true;
+		removed_count++;
+	      }
+	      else if (vector_load ^ vector_store)
+	      {
+		  /* Vector stores followed by a non-vector load or the
+		     opposite, cause store_bit_field to generate non-canonical
+		     expressions, like (subreg:V4SI (reg:DI ...) 0)).
+		     Cases like that should be handled using vec_duplicate,
+		     so we reject the transformation in those cases.  */
+		     it->remove = true;
+		     removed_count++;
+		     remove_rest = true;
+	      }
+	      else if (is_store_forwarding (store_mem, load_mem, &off_val))
+	      {
+		/* Check if moving this store after the load is legal.  */
+		bool write_dep = false;
+		for (unsigned int j = store_exprs.length () - 1; j != i; j--)
+		  if (!store_exprs[j].forwarded
+		      && output_dependence (store_mem,
+					    store_exprs[j].store_mem))
+		    {
+		      write_dep = true;
+		      break;
+		    }
+
+		if (!write_dep)
+		  {
+		    it->forwarded = true;
+		    it->offset = off_val;
+		    forwardings.safe_push (*it);
+		  }
+		else
+		    partial_forwarding = true;
+
+		it->remove = true;
+		removed_count++;
+	      }
+	      else if (true_dependence (store_mem, GET_MODE (store_mem),
+					load_mem))
+	      {
+		/* We cannot keep a store forwarding candidate if it possibly
+		   interferes with this load.  */
+		it->remove = true;
+		removed_count++;
+		remove_rest = true;
+	      }
+	    }
+
+	  if (!forwardings.is_empty () && !partial_forwarding)
+	    process_store_forwarding (forwardings, insn, load_mem);
+	}
+
+	if ((writes_mem && !is_simple_store)
+	     || (reads_mem && !is_simple_load))
+	   store_exprs.truncate (0);
+
+	if (removed_count)
+	{
+	  unsigned int i, j;
+	  store_fwd_info *it;
+	  VEC_ORDERED_REMOVE_IF (store_exprs, i, j, it, it->remove);
+	}
+
+	/* Don't consider store forwarding if the RTL instruction distance is
+	   more than PARAM_STORE_FORWARDING_MAX_DISTANCE.  */
+	if (!store_exprs.is_empty ()
+	    && (store_exprs[0].insn_cnt
+		+ param_store_forwarding_max_distance <= insn_cnt))
+	  store_exprs.ordered_remove (0);
+
+	insn_cnt++;
+    }
+}
+
+/* Update pass statistics.  */
+
+void store_forwarding_analyzer::update_stats (function *fn)
+{
+  statistics_counter_event (fn, "Cases of store forwarding detected: ",
+			    stats_sf_detected);
+  statistics_counter_event (fn, "Cases of store forwarding avoided: ",
+			    stats_sf_detected);
+}
+
+unsigned int
+pass_rtl_avoid_store_forwarding::execute (function *fn)
+{
+  df_set_flags (DF_DEFER_INSN_RESCAN);
+
+  init_alias_analysis ();
+
+  store_forwarding_analyzer analyzer;
+
+  basic_block bb;
+  FOR_EACH_BB_FN (bb, fn)
+    analyzer.avoid_store_forwarding (bb);
+
+  end_alias_analysis ();
+
+  analyzer.update_stats (fn);
+
+  return 0;
+}
+
+} // anon namespace.
+
+rtl_opt_pass *
+make_pass_rtl_avoid_store_forwarding (gcc::context *ctxt)
+{
+  return new pass_rtl_avoid_store_forwarding (ctxt);
+}
diff --git a/gcc/avoid-store-forwarding.h b/gcc/avoid-store-forwarding.h
new file mode 100644
index 00000000000..55a0c97f008
--- /dev/null
+++ b/gcc/avoid-store-forwarding.h
@@ -0,0 +1,56 @@ 
+/* Avoid store forwarding optimization pass.
+   Copyright (C) 2024 Free Software Foundation, Inc.
+   Contributed by VRULL GmbH.
+
+   This file is part of GCC.
+
+   GCC is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3, or (at your option)
+   any later version.
+
+   GCC is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with GCC; see the file COPYING3.  If not see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef GCC_AVOID_STORE_FORWARDING_H
+#define GCC_AVOID_STORE_FORWARDING_H
+
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "backend.h"
+#include "rtl.h"
+
+struct store_fwd_info
+{
+  /* The store instruction that is a store forwarding candidate.  */
+  rtx_insn *store_insn;
+  /* SET_DEST (single_set (store_insn)).  */
+  rtx store_mem;
+  /* The temporary that will hold the stored value at the original store
+     position.  */
+  rtx mov_reg;
+  /* The instruction sequence that inserts the stored value's bits at the
+     appropriate position in the loaded value.  */
+  rtx_insn *bits_insert_insns;
+  /* An instruction that saves the store's value in a register temporarily,
+     (set (reg X) (SET_SRC (store_insn))).  */
+  rtx_insn *save_store_value_insn;
+  /* An instruction that stores the saved value back to memory,
+     (set (SET_DEST (store_insn)) (reg X)).  */
+  rtx_insn *store_saved_value_insn;
+  /* The byte offset for the store's position within the load.  */
+  HOST_WIDE_INT offset;
+
+  unsigned int insn_cnt;
+  bool remove;
+  bool forwarded;
+};
+
+#endif  /* GCC_AVOID_STORE_FORWARDING_H  */
diff --git a/gcc/common.opt b/gcc/common.opt
index 12b25ff486d..346a996ef07 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -1755,6 +1755,10 @@  fgcse-sm
 Common Var(flag_gcse_sm) Init(0) Optimization
 Perform store motion after global common subexpression elimination.
 
+favoid-store-forwarding
+Common Var(flag_avoid_store_forwarding) Init(0) Optimization
+Try to avoid store forwarding.
+
 fgcse-las
 Common Var(flag_gcse_las) Init(0) Optimization
 Perform redundant load after store elimination in global common subexpression
diff --git a/gcc/common.opt.urls b/gcc/common.opt.urls
index e31736cd994..8325b2c28c3 100644
--- a/gcc/common.opt.urls
+++ b/gcc/common.opt.urls
@@ -700,6 +700,9 @@  UrlSuffix(gcc/Optimize-Options.html#index-fgcse-lm)
 fgcse-sm
 UrlSuffix(gcc/Optimize-Options.html#index-fgcse-sm)
 
+favoid-store-forwarding
+UrlSuffix(gcc/Optimize-Options.html#index-favoid-store-forwarding)
+
 fgcse-las
 UrlSuffix(gcc/Optimize-Options.html#index-fgcse-las)
 
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index aebcc9082ff..24c0e32a37e 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -12614,6 +12614,15 @@  loop unrolling.
 This option is enabled by default at optimization levels @option{-O1},
 @option{-O2}, @option{-O3}, @option{-Os}.
 
+@opindex favoid-store-forwarding
+@item -favoid-store-forwarding
+@itemx -fno-avoid-store-forwarding
+Many CPUs will stall for many cycles when a load partially depends on previous
+smaller stores.  This pass tries to detect such cases and avoid the penalty by
+changing the order of the load and store and then fixing up the loaded value.
+
+Disabled by default.
+
 @opindex ffp-contract
 @item -ffp-contract=@var{style}
 @option{-ffp-contract=off} disables floating-point expression contraction.
diff --git a/gcc/doc/passes.texi b/gcc/doc/passes.texi
index 4ac7a2306a1..639f6b325c8 100644
--- a/gcc/doc/passes.texi
+++ b/gcc/doc/passes.texi
@@ -925,6 +925,14 @@  and addressing mode selection.  The pass is run twice, with values
 being propagated into loops only on the second run.  The code is
 located in @file{fwprop.cc}.
 
+@item Store forwarding avoidance
+
+This pass attempts to reduce the overhead of store to load forwarding.
+It detects when a load reads from one or more previous smaller stores and
+then rearranges them so that the stores are done after the load.  The loaded
+value is adjusted with a series of bit insert instructions so that it stays
+the same.  The code is located in @file{avoid-store-forwarding.cc}.
+
 @item Common subexpression elimination
 
 This pass removes redundant computation within basic blocks, and
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 4deb3d2c283..eed30a20103 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -7355,6 +7355,14 @@  the @code{POLY_VALUE_MIN}, @code{POLY_VALUE_MAX} and
 implementation returns the lowest possible value of @var{val}.
 @end deftypefn
 
+@deftypefn {Target Hook} bool TARGET_AVOID_STORE_FORWARDING_P (vec<store_fwd_info>, @var{rtx}, @var{int}, @var{bool})
+Given a list of stores and a load instruction that reads from the location
+of the stores, this hook decides if it's profitable to emit additional code
+to avoid a potential store forwarding stall.  The additional instructions
+needed, the sequence cost and additional relevant information is given in
+the arguments so that the target can make an informed decision.
+@end deftypefn
+
 @node Scheduling
 @section Adjusting the Instruction Scheduler
 
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 9f147ccb95c..59a478971ab 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -4752,6 +4752,8 @@  Define this macro if a non-short-circuit operation produced by
 
 @hook TARGET_ESTIMATED_POLY_VALUE
 
+@hook TARGET_AVOID_STORE_FORWARDING_P
+
 @node Scheduling
 @section Adjusting the Instruction Scheduler
 
diff --git a/gcc/params.opt b/gcc/params.opt
index 24f440bbe71..bcaf47fd159 100644
--- a/gcc/params.opt
+++ b/gcc/params.opt
@@ -1040,6 +1040,10 @@  Allow the store merging pass to introduce unaligned stores if it is legal to do
 Common Joined UInteger Var(param_store_merging_max_size) Init(65536) IntegerRange(1, 65536) Param Optimization
 Maximum size of a single store merging region in bytes.
 
+-param=store-forwarding-max-distance=
+Common Joined UInteger Var(param_store_forwarding_max_distance) Init(10) IntegerRange(1, 1000) Param Optimization
+Maximum number of instruction distance that a small store forwarded to a larger load may stall.
+
 -param=switch-conversion-max-branch-ratio=
 Common Joined UInteger Var(param_switch_conversion_branch_ratio) Init(8) IntegerRange(1, 65536) Param Optimization
 The maximum ratio between array size and switch branches for a switch conversion to take place.
diff --git a/gcc/passes.def b/gcc/passes.def
index 7d01227eed1..b736879e4bd 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -503,6 +503,7 @@  along with GCC; see the file COPYING3.  If not see
       NEXT_PASS (pass_sms);
       NEXT_PASS (pass_live_range_shrinkage);
       NEXT_PASS (pass_sched);
+      NEXT_PASS (pass_rtl_avoid_store_forwarding);
       NEXT_PASS (pass_early_remat);
       NEXT_PASS (pass_ira);
       NEXT_PASS (pass_reload);
diff --git a/gcc/target.def b/gcc/target.def
index b3155010888..37e07cb373a 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -6979,6 +6979,16 @@  HOOK_VECTOR_END (shrink_wrap)
 #undef HOOK_PREFIX
 #define HOOK_PREFIX "TARGET_"
 
+DEFHOOK
+(avoid_store_forwarding_p,
+ "Given a list of stores and a load instruction that reads from the location\n\
+of the stores, this hook decides if it's profitable to emit additional code\n\
+to avoid a potential store forwarding stall.  The additional instructions\n\
+needed, the sequence cost and additional relevant information is given in\n\
+the arguments so that the target can make an informed decision.",
+ bool, (vec<store_fwd_info>, rtx, int, bool),
+ default_avoid_store_forwarding_p)
+
 /* Determine the type of unwind info to emit for debugging.  */
 DEFHOOK
 (debug_unwind_info,
diff --git a/gcc/target.h b/gcc/target.h
index 837651d273a..bc6845b8023 100644
--- a/gcc/target.h
+++ b/gcc/target.h
@@ -165,6 +165,9 @@  class function_arg_info;
 /* This is defined in function-abi.h.  */
 class predefined_function_abi;
 
+/* This is defined in avoid-store-forwarding.h.  */
+struct store_fwd_info;
+
 /* These are defined in tree-vect-stmts.cc.  */
 extern tree stmt_vectype (class _stmt_vec_info *);
 extern bool stmt_in_inner_loop_p (class vec_info *, class _stmt_vec_info *);
diff --git a/gcc/targhooks.cc b/gcc/targhooks.cc
index dc040df9fcd..19927815312 100644
--- a/gcc/targhooks.cc
+++ b/gcc/targhooks.cc
@@ -96,6 +96,7 @@  along with GCC; see the file COPYING3.  If not see
 #include "tree-vectorizer.h"
 #include "options.h"
 #include "case-cfn-macros.h"
+#include "avoid-store-forwarding.h"
 
 bool
 default_legitimate_address_p (machine_mode mode ATTRIBUTE_UNUSED,
@@ -2281,6 +2282,30 @@  default_class_max_nregs (reg_class_t rclass ATTRIBUTE_UNUSED,
 #endif
 }
 
+/* The default implementation of TARGET_AVOID_STORE_FORWARDING_P.  */
+
+bool
+default_avoid_store_forwarding_p (vec<store_fwd_info>, rtx, int total_cost,
+				  bool)
+{
+  /* Use a simple cost heurstic base on param_store_forwarding_max_distance.
+     In general the distance should be somewhat correlated to the store
+     forwarding penalty; if the penalty is large then it is justified to
+     increase the window size.  Use this to reject sequences that are clearly
+     unprofitable.  */
+  int max_cost = COSTS_N_INSNS (param_store_forwarding_max_distance / 2);
+  if (total_cost > max_cost)
+    {
+      if (dump_file)
+	fprintf (dump_file, "Not transformed due to cost: %d > %d.\n",
+		 total_cost, max_cost);
+
+      return false;
+    }
+
+  return true;
+}
+
 /* Determine the debugging unwind mechanism for the target.  */
 
 enum unwind_info_type
diff --git a/gcc/targhooks.h b/gcc/targhooks.h
index 2704d6008f1..7cf22038100 100644
--- a/gcc/targhooks.h
+++ b/gcc/targhooks.h
@@ -254,6 +254,9 @@  extern reg_class_t default_preferred_rename_class (reg_class_t rclass);
 extern bool default_class_likely_spilled_p (reg_class_t);
 extern unsigned char default_class_max_nregs (reg_class_t, machine_mode);
 
+extern bool default_avoid_store_forwarding_p (vec<store_fwd_info>, rtx, int,
+					      bool);
+
 extern enum unwind_info_type default_debug_unwind_info (void);
 
 extern void default_canonicalize_comparison (int *, rtx *, rtx *, bool);
diff --git a/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-1.c b/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-1.c
new file mode 100644
index 00000000000..4712f101d5e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-1.c
@@ -0,0 +1,28 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -favoid-store-forwarding -fdump-rtl-avoid_store_forwarding" } */
+
+typedef union {
+    char arr_8[8];
+    long long_value;
+} DataUnion;
+
+long ssll_1 (DataUnion *data, char x)
+{
+  data->arr_8[0] = x;
+  return data->long_value;
+}
+
+long ssll_2 (DataUnion *data, char x)
+{
+  data->arr_8[1] = x;
+  return data->long_value;
+}
+
+long ssll_3 (DataUnion *data, char x)
+{
+  data->arr_8[7] = x;
+  return data->long_value;
+}
+
+/* { dg-final { scan-rtl-dump-times "Store forwarding detected" 3 "avoid_store_forwarding" } } */
+/* { dg-final { scan-rtl-dump-times "Store forwarding avoided" 3 "avoid_store_forwarding" } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-2.c b/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-2.c
new file mode 100644
index 00000000000..b958612173b
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-2.c
@@ -0,0 +1,39 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -favoid-store-forwarding -fdump-rtl-avoid_store_forwarding" } */
+
+typedef union {
+    char arr_8[8];
+    int long_value;
+} DataUnion1;
+
+long no_ssll_1 (DataUnion1 *data, char x)
+{
+  data->arr_8[4] = x;
+  return data->long_value;
+}
+
+long no_ssll_2 (DataUnion1 *data, char x)
+{
+  data->arr_8[5] = x;
+  return data->long_value;
+}
+
+typedef union {
+    char arr_8[8];
+    short long_value[4];
+} DataUnion2;
+
+long no_ssll_3 (DataUnion2 *data, char x)
+{
+  data->arr_8[4] = x;
+  return data->long_value[1];
+}
+
+long no_ssll_4 (DataUnion2 *data, char x)
+{
+  data->arr_8[0] = x;
+  return data->long_value[1];
+}
+
+/* { dg-final { scan-rtl-dump-times "Store forwarding detected" 0 "avoid_store_forwarding" } } */
+/* { dg-final { scan-rtl-dump-times "Store forwarding avoided" 0 "avoid_store_forwarding" } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-3.c b/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-3.c
new file mode 100644
index 00000000000..28191f5aa39
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-3.c
@@ -0,0 +1,31 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -favoid-store-forwarding -fdump-rtl-avoid_store_forwarding" } */
+
+typedef union {
+    char arr_8[8];
+    long long_value;
+} DataUnion;
+
+long ssll_multi_1 (DataUnion *data, char x)
+{
+  data->arr_8[0] = x;
+  data->arr_8[2] = x;
+  return data->long_value;
+}
+
+long ssll_multi_2 (DataUnion *data, char x)
+{
+  data->arr_8[0] = x;
+  data->arr_8[1] = 11;
+  return data->long_value;
+}
+
+long ssll_multi_3 (DataUnion *data, char x, short y)
+{
+  data->arr_8[1] = x;
+  __builtin_memcpy(data->arr_8 + 4, &y, sizeof(short));
+  return data->long_value;
+}
+
+/* { dg-final { scan-rtl-dump-times "Store forwarding detected" 3 "avoid_store_forwarding" } } */
+/* { dg-final { scan-rtl-dump-times "Store forwarding avoided" 3 "avoid_store_forwarding" } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-4.c b/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-4.c
new file mode 100644
index 00000000000..3572dd5cd39
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-4.c
@@ -0,0 +1,26 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -favoid-store-forwarding -fdump-rtl-avoid_store_forwarding" } */
+
+typedef int v4si __attribute__ ((vector_size (16)));
+
+typedef union {
+    char arr_16[16];
+    v4si vec_value;
+} DataUnion;
+
+v4si ssll_vect_1 (DataUnion *data, char x)
+{
+  data->arr_16[0] = x;
+  return data->vec_value;
+}
+
+v4si ssll_vect_2 (DataUnion *data, int x)
+{
+  __builtin_memcpy(data->arr_16 + 4, &x, sizeof(int));
+  return data->vec_value;
+}
+
+/* Scalar stores leading to vector loads cause store_bit_field to generate
+   subreg expressions on different register classes. This should be handled
+   using vec_duplicate, so it is marked as an XFAIL for now.  */
+/* { dg-final { scan-rtl-dump-times "Store forwarding detected" 2 "avoid_store_forwarding" { xfail *-*-* } } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-5.c b/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-5.c
new file mode 100644
index 00000000000..0b2764256cd
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-5.c
@@ -0,0 +1,41 @@ 
+/* { dg-do compile { target int128 } } */
+/* { dg-options "-O2 -favoid-store-forwarding -fdump-rtl-avoid_store_forwarding" } */
+
+typedef float v4f __attribute__ ((vector_size (16)));
+
+typedef union {
+    float arr_2[4];
+    long long_value;
+    __int128 longlong_value;
+    v4f vec_value;
+} DataUnion;
+
+long ssll_load_elim_1 (DataUnion *data, float x)
+{
+  data->arr_2[0] = x;
+  data->arr_2[1] = 0.0f;
+  return data->long_value;
+}
+
+__int128 ssll_load_elim_2 (DataUnion *data, float x)
+{
+  data->arr_2[0] = x;
+  data->arr_2[1] = 0.0f;
+  data->arr_2[2] = x;
+  data->arr_2[3] = 0.0f;
+  return data->longlong_value;
+}
+
+v4f ssll_load_elim_3 (DataUnion *data, float x)
+{
+  data->arr_2[3] = x;
+  data->arr_2[2] = x;
+  data->arr_2[1] = x;
+  data->arr_2[0] = x;
+  return data->vec_value;
+}
+
+/* Scalar stores leading to vector loads cause store_bit_field to generate
+   subreg expressions on different register classes. This should be handled
+   using vec_duplicate, so it is marked as an XFAIL for now.  */
+/* { dg-final { scan-rtl-dump-times "Store forwarding detected" 3 "avoid_store_forwarding" { xfail *-*-* } } } */
diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index a928cbe4557..f8bb875cf9f 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -572,6 +572,7 @@  extern rtl_opt_pass *make_pass_rtl_dse3 (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_rtl_cprop (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_rtl_pre (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_rtl_hoist (gcc::context *ctxt);
+extern rtl_opt_pass *make_pass_rtl_avoid_store_forwarding (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_rtl_store_motion (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_cse_after_global_opts (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_rtl_ifcvt (gcc::context *ctxt);