diff mbox series

[v5] Target-independent store forwarding avoidance.

Message ID 20240809145917.474722-1-manolis.tsamis@vrull.eu
State New
Headers show
Series [v5] Target-independent store forwarding avoidance. | expand

Commit Message

Manolis Tsamis Aug. 9, 2024, 2:58 p.m. UTC
This pass detects cases of expensive store forwarding and tries to avoid them
by reordering the stores and using suitable bit insertion sequences.
For example it can transform this:

     strb    w2, [x1, 1]
     ldr     x0, [x1]      # Expensive store forwarding to larger load.

To:

     ldr     x0, [x1]
     strb    w2, [x1]
     bfi     x0, x2, 0, 8

Assembly like this can appear with bitfields or type punning / unions.
On stress-ng when running the cpu-union microbenchmark the following speedups
have been observed.

  Neoverse-N1:      +29.4%
  Intel Coffeelake: +13.1%
  AMD 5950X:        +17.5%

gcc/ChangeLog:

	* Makefile.in: Add avoid-store-forwarding.o
	* common.opt: New option -favoid-store-forwarding.
	* doc/invoke.texi: New param store-forwarding-max-distance.
	* doc/passes.texi: Document new pass.
	* doc/tm.texi: Regenerate.
	* doc/tm.texi.in: Document new pass.
	* params.opt: New param store-forwarding-max-distance.
	* passes.def: Schedule a new pass.
	* target.def (HOOK_PREFIX): New target hook avoid_store_forwarding_p.
	* target.h (struct store_fwd_info): Declare.
	* tree-pass.h (make_pass_rtl_avoid_store_forwarding): Declare.
	* avoid-store-forwarding.cc: New file.
	* avoid-store-forwarding.h: New file.

gcc/testsuite/ChangeLog:

	* gcc.target/aarch64/avoid-store-forwarding-1.c: New test.
	* gcc.target/aarch64/avoid-store-forwarding-2.c: New test.
	* gcc.target/aarch64/avoid-store-forwarding-3.c: New test.
	* gcc.target/aarch64/avoid-store-forwarding-4.c: New test.
	* gcc.target/aarch64/avoid-store-forwarding-5.c: New test.

Signed-off-by: Manolis Tsamis <manolis.tsamis@vrull.eu>
---

Changes in v5:
        - Fix bug with BIG_ENDIAN targets.
        - Fix bug with unrecognized instructions.
        - Fix / simplify pass init/fini.

Changes in v4:
        - Change pass scheduling to run after sched1.
        - Add target hook to decide whether a store forwarding instance
        should be avoided or not.
        - Fix bugs.

Changes in v3:
        - Only emit SUBREG after calling validate_subreg.
        - Fix memory corruption due to vec self-reference.
        - Fix bitmap_bit_in_range_p ICE due to BLKMode.
        - Reject MEM to MEM sets.
        - Add get_load_mem comment.
        - Add new testcase.

Changes in v2:
        - Allow modes that are not scalar_int_mode.
        - Introduce simple costing to avoid unprofitable transformations.
        - Reject bit insert sequences that spill to memory.
        - Document new pass.
        - Fix and add testcases.

 gcc/Makefile.in                               |   1 +
 gcc/avoid-store-forwarding.cc                 | 616 ++++++++++++++++++
 gcc/avoid-store-forwarding.h                  |  56 ++
 gcc/common.opt                                |   4 +
 gcc/doc/invoke.texi                           |   9 +
 gcc/doc/passes.texi                           |   8 +
 gcc/doc/tm.texi                               |   9 +
 gcc/doc/tm.texi.in                            |   2 +
 gcc/params.opt                                |   4 +
 gcc/passes.def                                |   1 +
 gcc/target.def                                |  11 +
 gcc/target.h                                  |   3 +
 .../aarch64/avoid-store-forwarding-1.c        |  28 +
 .../aarch64/avoid-store-forwarding-2.c        |  39 ++
 .../aarch64/avoid-store-forwarding-3.c        |  31 +
 .../aarch64/avoid-store-forwarding-4.c        |  23 +
 .../aarch64/avoid-store-forwarding-5.c        |  38 ++
 gcc/tree-pass.h                               |   1 +
 18 files changed, 884 insertions(+)
 create mode 100644 gcc/avoid-store-forwarding.cc
 create mode 100644 gcc/avoid-store-forwarding.h
 create mode 100644 gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-1.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-2.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-3.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-4.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-5.c

Comments

Jeff Law Aug. 11, 2024, 5:27 p.m. UTC | #1
On 8/9/24 8:58 AM, Manolis Tsamis wrote:
> This pass detects cases of expensive store forwarding and tries to avoid them
> by reordering the stores and using suitable bit insertion sequences.
> For example it can transform this:
[ ... ]


Trying to bootstrap (enabled by default) on  68k:

> ../../../gcc/gcc/avoid-store-forwarding.cc: In function 'rtx_insn* {anonymous}::generate_bit_insert_sequence(store_fwd_info*, rtx, machine_mode)':
> ../../../gcc/gcc/avoid-store-forwarding.cc:152:44: error: unused parameter 'load_inner_mode' [-Werror=unused-parameter]
>   152 |                               machine_mode load_inner_mode)
>       |                               ~~~~~~~~~~~~~^~~~~~~~~~~~~~~
Richard Sandiford Aug. 16, 2024, 2:33 p.m. UTC | #2
Manolis Tsamis <manolis.tsamis@vrull.eu> writes:
> This pass detects cases of expensive store forwarding and tries to avoid them
> by reordering the stores and using suitable bit insertion sequences.
> For example it can transform this:
>
>      strb    w2, [x1, 1]
>      ldr     x0, [x1]      # Expensive store forwarding to larger load.
>
> To:
>
>      ldr     x0, [x1]
>      strb    w2, [x1]
>      bfi     x0, x2, 0, 8
>
> Assembly like this can appear with bitfields or type punning / unions.
> On stress-ng when running the cpu-union microbenchmark the following speedups
> have been observed.
>
>   Neoverse-N1:      +29.4%
>   Intel Coffeelake: +13.1%
>   AMD 5950X:        +17.5%
>
> gcc/ChangeLog:
>
> 	* Makefile.in: Add avoid-store-forwarding.o
> 	* common.opt: New option -favoid-store-forwarding.
> 	* doc/invoke.texi: New param store-forwarding-max-distance.
> 	* doc/passes.texi: Document new pass.
> 	* doc/tm.texi: Regenerate.
> 	* doc/tm.texi.in: Document new pass.
> 	* params.opt: New param store-forwarding-max-distance.
> 	* passes.def: Schedule a new pass.
> 	* target.def (HOOK_PREFIX): New target hook avoid_store_forwarding_p.
> 	* target.h (struct store_fwd_info): Declare.
> 	* tree-pass.h (make_pass_rtl_avoid_store_forwarding): Declare.
> 	* avoid-store-forwarding.cc: New file.
> 	* avoid-store-forwarding.h: New file.
>
> gcc/testsuite/ChangeLog:
>
> 	* gcc.target/aarch64/avoid-store-forwarding-1.c: New test.
> 	* gcc.target/aarch64/avoid-store-forwarding-2.c: New test.
> 	* gcc.target/aarch64/avoid-store-forwarding-3.c: New test.
> 	* gcc.target/aarch64/avoid-store-forwarding-4.c: New test.
> 	* gcc.target/aarch64/avoid-store-forwarding-5.c: New test.
>
> Signed-off-by: Manolis Tsamis <manolis.tsamis@vrull.eu>
> ---
>
> Changes in v5:
>         - Fix bug with BIG_ENDIAN targets.
>         - Fix bug with unrecognized instructions.
>         - Fix / simplify pass init/fini.
>
> Changes in v4:
>         - Change pass scheduling to run after sched1.
>         - Add target hook to decide whether a store forwarding instance
>         should be avoided or not.
>         - Fix bugs.
>
> Changes in v3:
>         - Only emit SUBREG after calling validate_subreg.
>         - Fix memory corruption due to vec self-reference.
>         - Fix bitmap_bit_in_range_p ICE due to BLKMode.
>         - Reject MEM to MEM sets.
>         - Add get_load_mem comment.
>         - Add new testcase.
>
> Changes in v2:
>         - Allow modes that are not scalar_int_mode.
>         - Introduce simple costing to avoid unprofitable transformations.
>         - Reject bit insert sequences that spill to memory.
>         - Document new pass.
>         - Fix and add testcases.
>
>  gcc/Makefile.in                               |   1 +
>  gcc/avoid-store-forwarding.cc                 | 616 ++++++++++++++++++
>  gcc/avoid-store-forwarding.h                  |  56 ++
>  gcc/common.opt                                |   4 +
>  gcc/doc/invoke.texi                           |   9 +
>  gcc/doc/passes.texi                           |   8 +
>  gcc/doc/tm.texi                               |   9 +
>  gcc/doc/tm.texi.in                            |   2 +
>  gcc/params.opt                                |   4 +
>  gcc/passes.def                                |   1 +
>  gcc/target.def                                |  11 +
>  gcc/target.h                                  |   3 +
>  .../aarch64/avoid-store-forwarding-1.c        |  28 +
>  .../aarch64/avoid-store-forwarding-2.c        |  39 ++
>  .../aarch64/avoid-store-forwarding-3.c        |  31 +
>  .../aarch64/avoid-store-forwarding-4.c        |  23 +
>  .../aarch64/avoid-store-forwarding-5.c        |  38 ++
>  gcc/tree-pass.h                               |   1 +
>  18 files changed, 884 insertions(+)
>  create mode 100644 gcc/avoid-store-forwarding.cc
>  create mode 100644 gcc/avoid-store-forwarding.h
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-1.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-2.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-3.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-4.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-5.c
>
> diff --git a/gcc/Makefile.in b/gcc/Makefile.in
> index 8fba8f7db6a..43675288399 100644
> --- a/gcc/Makefile.in
> +++ b/gcc/Makefile.in
> @@ -1682,6 +1682,7 @@ OBJS = \
>  	statistics.o \
>  	stmt.o \
>  	stor-layout.o \
> +	avoid-store-forwarding.o \
>  	store-motion.o \
>  	streamer-hooks.o \
>  	stringpool.o \
> diff --git a/gcc/avoid-store-forwarding.cc b/gcc/avoid-store-forwarding.cc
> new file mode 100644
> index 00000000000..4a0343c0314
> --- /dev/null
> +++ b/gcc/avoid-store-forwarding.cc
> @@ -0,0 +1,616 @@
> +/* Avoid store forwarding optimization pass.
> +   Copyright (C) 2024 Free Software Foundation, Inc.
> +   Contributed by VRULL GmbH.
> +
> +   This file is part of GCC.
> +
> +   GCC is free software; you can redistribute it and/or modify it
> +   under the terms of the GNU General Public License as published by
> +   the Free Software Foundation; either version 3, or (at your option)
> +   any later version.
> +
> +   GCC is distributed in the hope that it will be useful, but
> +   WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   General Public License for more details.
> +
> +   You should have received a copy of the GNU General Public License
> +   along with GCC; see the file COPYING3.  If not see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +#include "avoid-store-forwarding.h"
> +#include "config.h"
> +#include "system.h"
> +#include "coretypes.h"
> +#include "backend.h"
> +#include "target.h"
> +#include "rtl.h"
> +#include "alias.h"
> +#include "rtlanal.h"
> +#include "tree-pass.h"
> +#include "cselib.h"
> +#include "predict.h"
> +#include "insn-config.h"
> +#include "expmed.h"
> +#include "recog.h"
> +#include "regset.h"
> +#include "df.h"
> +#include "expr.h"
> +#include "memmodel.h"
> +#include "emit-rtl.h"
> +#include "vec.h"
> +
> +/* This pass tries to detect and avoid cases of store forwarding.
> +   On many processors there is a large penalty when smaller stores are
> +   forwarded to larger loads.  The idea used to avoid the stall is to move
> +   the store after the load and in addition emit a bit insert sequence so
> +   the load register has the correct value.  For example the following:
> +
> +     strb    w2, [x1, 1]
> +     ldr     x0, [x1]
> +
> +   Will be transformed to:
> +
> +     ldr     x0, [x1]
> +     strb    w2, [x1]
> +     bfi     x0, x2, 0, 8
> +*/
> +
> +namespace {
> +
> +const pass_data pass_data_avoid_store_forwarding =
> +{
> +  RTL_PASS, /* type.  */
> +  "avoid_store_forwarding", /* name.  */
> +  OPTGROUP_NONE, /* optinfo_flags.  */
> +  TV_NONE, /* tv_id.  */
> +  0, /* properties_required.  */
> +  0, /* properties_provided.  */
> +  0, /* properties_destroyed.  */
> +  0, /* todo_flags_start.  */
> +  TODO_df_finish /* todo_flags_finish.  */
> +};
> +
> +class pass_rtl_avoid_store_forwarding : public rtl_opt_pass
> +{
> +public:
> +  pass_rtl_avoid_store_forwarding (gcc::context *ctxt)
> +    : rtl_opt_pass (pass_data_avoid_store_forwarding, ctxt)
> +  {}
> +
> +  /* opt_pass methods: */
> +  virtual bool gate (function *)
> +    {
> +      return flag_avoid_store_forwarding && optimize >= 1
> +	     && optimize_insn_for_speed_p ();

I think this should be optimize_function_for_speed_p, since we have no
contextual information for crtl->maybe_hot_insn_p.  But if possible, it'd
be better if we could drop this and instead check optimize_bb_for_speed_p
for each block.

> +    }
> +
> +  virtual unsigned int execute (function *) override;
> +}; // class pass_rtl_avoid_store_forwarding
> +
> +static unsigned int stats_sf_detected = 0;
> +static unsigned int stats_sf_avoided = 0;

Could you instead structure the code as a class with these as member
variables?  I realise it's a bit over-the-top for just 2 variables,
but the pass might be expanded in future.

> +
> +/* If expr is a SET that reads from memory then return the RTX for it's MEM
> +   argument, otherwise return NULL.  Allows MEM to be zero/sign extended.  */
> +
> +static rtx
> +get_load_mem (rtx expr)
> +{
> +  if (!expr)
> +    return NULL_RTX;

Could you check this in the caller instead?  It seems an odd interface
that expr can be null, but if nonnull, must be known to be a SET.

> +
> +  rtx mem = SET_SRC (expr);
> +
> +  if (GET_CODE (mem) == ZERO_EXTEND
> +      || GET_CODE (mem) == SIGN_EXTEND)
> +    mem = XEXP (mem, 0);
> +
> +  if (MEM_P (mem))
> +    return mem;
> +  else
> +    return NULL_RTX;
> +}
> +
> +/* Return true iff a store to STORE_MEM would write to a sub-region of bytes
> +   from what LOAD_MEM would read.  If true also store the relative byte offset
> +   of the store within the load to OFF_VAL.  */
> +
> +static bool
> +is_store_forwarding (rtx store_mem, rtx load_mem, HOST_WIDE_INT *off_val)
> +{
> +  if (known_ge (GET_MODE_SIZE (GET_MODE (store_mem)),
> +		GET_MODE_SIZE (GET_MODE (load_mem))))
> +    return false;
> +
> +  rtx off = simplify_gen_binary (MINUS, GET_MODE (XEXP (store_mem, 0)),
> +				 XEXP (store_mem, 0), XEXP (load_mem, 0));
> +
> +  if (CONST_INT_P (off))
> +    {
> +      *off_val = INTVAL (off);
> +      poly_uint64 store_mode_size = GET_MODE_SIZE (GET_MODE (store_mem));
> +      poly_uint64 load_mode_size = GET_MODE_SIZE (GET_MODE (load_mem));
> +      unsigned HOST_WIDE_INT store_size, load_size;
> +      if (store_mode_size.is_constant (&store_size)
> +	  && load_mode_size.is_constant (&load_size))
> +	{
> +	  gcc_checking_assert (store_size > 0 && load_size > 0);
> +	  return *off_val >= 0
> +		 && (store_size + *off_val <= load_size);
> +	}
> +    }
> +
> +  return false;

A more direct way of writing this is:

  poly_int64 load_offset, store_offset;
  rtx load_base = strip_offset (XEXP (load_mem, 0), &load_offset);
  rtx store_base = strip_offset (XEXP (store_mem, 0), &store_offset);
  return (MEM_SIZE (load_mem).is_constant ()
	  && rtx_equal_p (load_base, store_base)
	  && known_subrange_p (store_offset, MEM_SIZE (store_mem),
			       load_offset, MEM_SIZE (load_mem))
	  && (store_offset - load_offset).is_constant (off_val));

which avoids the need for simplify_gen_binary (and thus unnecessary rtxes).

This doesn't enforce that the load and store have constant size,
but IMO that's more naturally done when detecting loads and stores.

> +}
> +
> +/* Return a bit insertion sequence that would make DEST have the correct value
> +   if the store represented by STORE_INFO were to be moved after DEST.  */
> +
> +static rtx_insn *
> +generate_bit_insert_sequence (store_fwd_info *store_info, rtx dest,
> +			      machine_mode load_inner_mode)

As Jeff says, load_inner_mode appears to be unused.

> +{
> +  poly_uint64 store_mode_size
> +    = GET_MODE_SIZE (GET_MODE (store_info->store_mem));
> +  unsigned HOST_WIDE_INT store_size = store_mode_size.to_constant ();

Uses of to_constant should have a comment saying where is_constant
is enforced.  Using MEM_SIZE is a bit shorter than checking the size
of the mode (and more general, although that's not helpful in this
context).

> +
> +  start_sequence ();
> +  store_bit_field (dest, store_size * BITS_PER_UNIT,
> +		   store_info->offset * BITS_PER_UNIT, 0, 0,
> +		   GET_MODE (dest), store_info->mov_reg,
> +		   false, false);
> +
> +  rtx_insn *insns = get_insns ();
> +  unshare_all_rtl_in_chain (insns);
> +  end_sequence ();
> +
> +  for (rtx_insn *insn = insns; insn; insn = NEXT_INSN (insn))
> +    if (contains_mem_rtx_p (PATTERN (insn))
> +	|| recog_memoized (insn) < 0)

When does recog_memoized (insn) < 0 occur?  I'd expect only valid
instructions to be produced.  (Checking for MEMs is ok though.)

> +      return NULL;
> +
> +  return insns;
> +}
> +
> +/* Given a list of small stores that are forwarded to LOAD_INSN, try to
> +   rearrange them so that a store-forwarding penalty doesn't occur.  */

I think this should say that STORES are in reverse program order
(if I've read the code correctly).

I was surprised that we emitted the bit insertions for the last store first,
and only realised later that that order was reversing a previous reverse.

> +
> +static bool
> +process_forwardings (vec<store_fwd_info> &stores, rtx_insn *load_insn)
> +{
> +  rtx load = single_set (load_insn);
> +  rtx load_inner = get_load_mem (load);
> +  machine_mode load_inner_mode = GET_MODE (load_inner);
> +  poly_uint64 load_mode_size = GET_MODE_SIZE (load_inner_mode);
> +  HOST_WIDE_INT load_size = load_mode_size.to_constant ();
> +
> +  /* If the stores cover all the bytes of the load without overlap then we can
> +     eliminate the load entirely and use the computed value instead.  */
> +
> +  sbitmap forwarded_bytes = sbitmap_alloc (load_size);
> +  bitmap_clear (forwarded_bytes);
> +
> +  unsigned int i;
> +  store_fwd_info* it;
> +  FOR_EACH_VEC_ELT (stores, i, it)
> +    {
> +      poly_uint64 store_mode_size = GET_MODE_SIZE (GET_MODE (it->store_mem));
> +      HOST_WIDE_INT store_size = store_mode_size.to_constant ();
> +      if (bitmap_bit_in_range_p (forwarded_bytes, it->offset,
> +				 it->offset + store_size - 1))
> +	break;
> +      bitmap_set_range (forwarded_bytes, it->offset, store_size);
> +    }
> +
> +  bitmap_not (forwarded_bytes, forwarded_bytes);
> +  bool load_elim = bitmap_empty_p (forwarded_bytes);
> +
> +  stats_sf_detected++;
> +
> +  if (dump_file)
> +    {
> +      fprintf (dump_file, "Store forwarding%s detected:\n",
> +	       (stores.length () > 1) ? "s" : "");

Very minor, but "Store forwarding" sounds more natural to me even
when there are multiple stores involved.

> +
> +      FOR_EACH_VEC_ELT (stores, i, it)
> +	{
> +	  fprintf (dump_file, "From: ");
> +	  print_rtl_single (dump_file, it->store_insn);
> +	}
> +
> +      fprintf (dump_file, "To: ");
> +      print_rtl_single (dump_file, load_insn);
> +
> +      if (load_elim)
> +	fprintf (dump_file, "(Load elimination candidate)\n");
> +    }
> +
> +  rtx dest;
> +  if (load_elim)
> +    dest = gen_reg_rtx (load_inner_mode);
> +  else
> +    dest = SET_DEST (load);
> +
> +  int move_to_front = -1;
> +  int total_cost = 0;
> +
> +  /* Check if we can emit bit insert instructions for all forwarded stores.  */
> +  FOR_EACH_VEC_ELT (stores, i, it)
> +    {
> +      it->mov_reg = gen_reg_rtx (GET_MODE (it->store_mem));
> +      rtx_insn *insns = NULL;
> +
> +      /* If we're eliminating the load then find the store with zero offset
> +	 and use it as the base register to avoid a bit insert if possible.  */
> +      if (load_elim && it->offset == 0
> +	  && validate_subreg (GET_MODE (dest), GET_MODE (it->mov_reg),
> +			      it->mov_reg, 0))
> +	{
> +	  start_sequence ();
> +
> +	  /* We can use a paradoxical subreg to force this to a wider mode, as
> +	     the only use will be inserting the bits (i.e., we don't care about
> +	     the value of the higher bits).  */
> +	  rtx ext0 = gen_rtx_SUBREG (GET_MODE (dest), it->mov_reg, 0);

IMO it'd be safer to call lowpart_subreg and check whether the result
is null.  This would also remove the need to check validate_subreg
directly.

> +	  rtx_insn *move0 = emit_move_insn (dest, ext0);
> +	  if (recog_memoized (move0) >= 0)
> +	    {
> +	      insns = get_insns ();
> +	      move_to_front = (int) i;
> +	    }

Here too I'd expect emit_move_insn to produce only valid insns.
(And it can produce multiple insns.)

> +
> +	  end_sequence ();
> +	}
> +
> +      if (!insns)
> +	insns = generate_bit_insert_sequence (&(*it), dest, load_inner_mode);
> +
> +      if (!insns)
> +	{
> +	  if (dump_file)
> +	    {
> +	      fprintf (dump_file, "Failed due to: ");
> +	      print_rtl_single (dump_file, it->store_insn);
> +	    }
> +	  return false;
> +	}
> +
> +      total_cost += seq_cost (insns, true);
> +      it->bits_insert_insns = insns;
> +
> +      rtx store_set = single_set (it->store_insn);
> +
> +      /* Create a register move at the store's original position to save the
> +	 stored value.  */
> +      start_sequence ();
> +      rtx mov1 = gen_move_insn (it->mov_reg, SET_SRC (store_set));

Since SET_SRC is a general expression, we should just try to emit it
as a single set, using gen_rtx_SET rather than emit_move_insn.

> +      rtx_insn *insn1 = emit_insn (mov1);
> +      end_sequence ();
> +
> +      if (recog_memoized (insn1) < 0)
> +	{
> +	  if (dump_file)
> +	    {
> +	      fprintf (dump_file, "Failed due to unrecognizable insn: ");
> +	      print_rtl_single (dump_file, insn1);
> +	    }
> +	  return false;
> +	}
> +
> +      it->save_store_value_insn = insn1;
> +
> +      /* Create a new store after the load with the saved original value.
> +	 This avoids the forwarding stall.  */
> +      start_sequence ();
> +      rtx mov2 = gen_move_insn (SET_DEST (store_set), it->mov_reg);
> +      rtx_insn *insn2 = emit_insn (mov2);
> +      end_sequence ();
> +
> +      if (recog_memoized (insn2) < 0)
> +	{
> +	  if (dump_file)
> +	    {
> +	      fprintf (dump_file, "Failed due to unrecognizable insn: ");
> +	      print_rtl_single (dump_file, insn2);
> +	    }
> +	  return false;
> +	}

Similarly, this recog_memoized shouldn't be needed.

> +
> +      it->store_saved_value_insn = insn2;
> +    }
> +
> +  if (load_elim)
> +    total_cost -= COSTS_N_INSNS (1);

Why 1, rather than the insn_cost of the load?

> +
> +  if (targetm.avoid_store_forwarding_p)
> +    {
> +      /* If a target-specific hook exists to decide profitability of the
> +	 specific store-forwarding instance then use that as it should be
> +	 the most accurate.  */
> +      if (!targetm.avoid_store_forwarding_p (stores, load_inner, total_cost,
> +					     load_elim))
> +	{
> +	  if (dump_file)
> +	    fprintf (dump_file, "Not transformed due to target decision.\n");
> +
> +	  return false;
> +	}
> +    }
> +  else
> +    {
> +      /* Otherwise use a simple cost heurstic base on
> +	 param_store_forwarding_max_distance.  In general the distance should
> +	 be somewhat correlated to the store forwarding penalty; if the penalty
> +	 is large then it is justified to increase the window size.  Use this
> +	 to reject sequences that are clearly unprofitable.  */
> +      int max_cost = COSTS_N_INSNS (param_store_forwarding_max_distance / 2);
> +      if (total_cost > max_cost)
> +	{
> +	  if (dump_file)
> +	    fprintf (dump_file, "Not transformed due to cost: %d > %d.\n",
> +		     total_cost, max_cost);
> +
> +	  return false;
> +	}

Could you instead make this the default implementation of the hook,
so that targets have the option of falling back to it?

> +    }
> +
> +  /* If we have a move instead of bit insert, it needs to be emitted first in
> +     the resulting sequence.  */
> +  if (move_to_front != -1)
> +    {
> +      store_fwd_info copy = stores[move_to_front];
> +      stores.safe_push (copy);
> +      stores.ordered_remove (move_to_front);
> +    }
> +
> +  if (load_elim)
> +    {
> +      machine_mode outter_mode = GET_MODE (SET_DEST (load));

outer

> +      rtx_code extend = ZERO_EXTEND;
> +      if (outter_mode != load_inner_mode)
> +	extend = GET_CODE (SET_SRC (load));
> +
> +      start_sequence ();
> +      rtx load_value = simplify_gen_unary (extend, outter_mode, dest,
> +					   load_inner_mode);

We should only do this if the modes are different.  And in the case
where they are:

> +      rtx load_move = gen_move_insn (SET_DEST (load), load_value);

...this should be a gen_rtx_SET.

It might be simpler to separate the "equal modes" and "unequal modes"
paths out and emit different sequences.

> +      rtx_insn *insn = emit_insn (load_move);
> +      rtx_insn *seq = get_insns ();
> +      end_sequence ();
> +
> +      if (recog_memoized (insn) < 0)
> +	return false;
> +
> +      df_insn_rescan (emit_insn_after (seq, load_insn));

df_insn_rescan shouldn't be needed when emitting instructions.
Same below.  (The call for load_insn looks good though.)

> +    }
> +
> +  if (dump_file)
> +    {
> +      fprintf (dump_file, "Store forwarding%s avoided with bit inserts:\n",
> +	       (stores.length () > 1) ? "s" : "");
> +
> +      FOR_EACH_VEC_ELT (stores, i, it)
> +	{
> +	  if (stores.length () > 1)
> +	    {
> +	      fprintf (dump_file, "For: ");
> +	      print_rtl_single (dump_file, it->store_insn);
> +	    }
> +
> +	  fprintf (dump_file, "With sequence:\n");
> +
> +	  for (rtx_insn *insn = it->bits_insert_insns; insn;
> +	       insn = NEXT_INSN (insn))
> +	    {
> +	      fprintf (dump_file, "  ");
> +	      print_rtl_single (dump_file, insn);
> +	    }
> +	}
> +    }
> +
> +  stats_sf_avoided++;
> +
> +  /* Done, emit all the generated instructions and delete the stores.  */
> +
> +  FOR_EACH_VEC_ELT (stores, i, it)
> +    df_insn_rescan (emit_insn_after (it->bits_insert_insns, load_insn));
> +
> +  FOR_EACH_VEC_ELT (stores, i, it)
> +    {
> +      df_insn_rescan (emit_insn_before (it->save_store_value_insn,
> +					it->store_insn));
> +      df_insn_rescan (emit_insn_after (it->store_saved_value_insn, load_insn));
> +      set_insn_deleted (it->store_insn);

Any particular reason not to emit the new store immediately after the
corresponding bits_insert_insns?  That would leads to shorter live ranges.

> +    }
> +
> +  df_insn_rescan (load_insn);
> +
> +  if (load_elim)
> +    set_insn_deleted (load_insn);

Could you try to delete the insn via delete_insn instead?  There have
been problems in the past with INSN_DELETED notes being left around.

> +
> +  return true;
> +}
> +
> +/* Process BB for expensive store forwardings.  */
> +
> +static void
> +avoid_store_forwarding (basic_block bb)
> +{
> +  auto_vec<store_fwd_info, 8> store_exprs;
> +  rtx_insn *insn;
> +  unsigned int insn_cnt = 0;
> +
> +  FOR_BB_INSNS (bb, insn)
> +    {
> +      if (!NONDEBUG_INSN_P (insn))
> +	continue;
> +
> +      rtx set = single_set (insn);
> +
> +      /* Store forwarding issues are unlikely if we cross a call.
> +	 Clear store forwarding candidates if we can't deal with INSN.  */
> +      if (CALL_P (insn) || !set || volatile_refs_p (set)

Minor, but IMO it'd more natural to check CALL_P alongside
NONDEBUG_INSN_P, rather than call single_set and then ignore the result.

> +	  || GET_MODE (SET_DEST (set)) == BLKmode
> +	  || (MEM_P (SET_DEST (set)) && MEM_P (SET_SRC (set))))
> +	{
> +	  store_exprs.truncate (0);
> +	  continue;
> +	}
> +
> +      rtx load_mem = get_load_mem (set);
> +      int removed_count = 0;
> +
> +      if (MEM_P (SET_DEST (set)))
> +	{
> +	  /* Record store forwarding candidate.  */
> +	  store_fwd_info info;
> +	  info.store_insn = insn;
> +	  info.store_mem = SET_DEST (set);
> +	  info.insn_cnt = insn_cnt;
> +	  info.remove = false;
> +	  info.forwarded = false;
> +	  store_exprs.safe_push (info);
> +	}
> +      else if (load_mem)
> +	{
> +	  /* Process load for possible store forwardings.  */
> +	  auto_vec<store_fwd_info> forwardings;
> +	  bool partial_forwarding = false;
> +	  bool remove_rest = false;
> +
> +	  unsigned int i;
> +	  store_fwd_info *it;
> +	  FOR_EACH_VEC_ELT_REVERSE (store_exprs, i, it)
> +	    {
> +	      rtx store_mem = it->store_mem;
> +	      HOST_WIDE_INT off_val;
> +
> +	      if (remove_rest)
> +		{
> +		  it->remove = true;
> +		  removed_count++;
> +		}
> +	      else if (is_store_forwarding (store_mem, load_mem, &off_val))
> +		{
> +		  /* Check if moving this store after the load is legal.  */
> +		  bool write_dep = false;
> +		  for (unsigned int j = store_exprs.length () - 1; j != i; j--)
> +		    if (!store_exprs[j].forwarded
> +			&& output_dependence (store_mem,
> +					      store_exprs[j].store_mem))
> +		      {
> +			write_dep = true;
> +			break;
> +		      }
> +
> +		  if (!write_dep)
> +		    {
> +		      it->forwarded = true;
> +		      it->offset = off_val;
> +		      forwardings.safe_push (*it);
> +		    }
> +		  else
> +		    partial_forwarding = true;
> +
> +		  it->remove = true;
> +		  removed_count++;
> +		}
> +	      else if (true_dependence (store_mem, GET_MODE (store_mem),
> +					load_mem))
> +		{
> +		  /* We cannot keep a store forwarding candidate if it possibly
> +		     interferes with this load.  */
> +		  it->remove = true;
> +		  removed_count++;
> +		  remove_rest = true;
> +		}
> +	    }
> +
> +	  if (!forwardings.is_empty () && !partial_forwarding)
> +	    process_forwardings (forwardings, insn);

It doesn't look like this checks whether the load result conflicts
with the memory addresses, e.g. in:

  (set (mem:QI (reg:SI R)) (const_int 0))
  (set (reg:SI R) (mem:SI (reg:SI R)))

> +	}
> +      else
> +	{
> +	  rtx reg = SET_DEST (set);
> +
> +	  while (GET_CODE (reg) == ZERO_EXTRACT
> +		|| GET_CODE (reg) == STRICT_LOW_PART
> +		|| GET_CODE (reg) == SUBREG)
> +	    reg = XEXP (reg, 0);
> +
> +	  /* Drop store forwarding candidates when the address register is
> +	     overwritten.  */
> +	  if (REG_P (reg))
> +	    {
> +	      bool remove_rest = false;
> +	      unsigned int i;
> +	      store_fwd_info *it;
> +	      FOR_EACH_VEC_ELT_REVERSE (store_exprs, i, it)
> +		{
> +		  if (remove_rest
> +		      || reg_overlap_mentioned_p (reg, it->store_mem))
> +		    {
> +		      it->remove = true;
> +		      removed_count++;
> +		      remove_rest = true;
> +		    }
> +		}
> +	    }
> +	  else
> +	    {
> +	      /* We can't understand INSN.  */
> +	      store_exprs.truncate (0);
> +	      continue;
> +	    }
> +	}
> +
> +      if (removed_count)
> +	{
> +	  unsigned int i, j;
> +	  store_fwd_info *it;
> +	  VEC_ORDERED_REMOVE_IF (store_exprs, i, j, it, it->remove);
> +	}

It might be safer (and more general) to structure the code as follows:

      if (!NONDEBUG_INSN_P (insn))
	continue;

      vec_rtx_properties properties;
      properties.add_insn (insn, false);

      rtx set = single_set (insn);
      bool is_simple = (set
			&& !properties.has_asm
			&& !properties.has_side_effects ());
      bool is_simple_store = (is_simple
			      && MEM_P (SET_DEST (set))
			      && !contains_mem_rtx_p (SET_SRC (set)));
      rtx load_mem = nullptr;
      bool is_simple_load = (is_simple
			     && !contains_mem_rtx_p (SET_DEST (set))
			     && (load_mem = get_load_mem (set)));
      if (is_simple_store)
	...record store...

      bool reads_mem = false;
      bool writes_mem = false;
      for (auto ref : properties.refs ())
        if (ref.is_mem ())
          {
	    reads_mem |= ref.is_read ();
	    writes_mem |= ref.is_write ();
	  }
        else if (ref.is_write ())
	  {
	    // Check for the last store to mention ref.regno (), use
	    // block_remove to remove it and all preceding stores.
	  }

      if (is_simple_load)
	{
	  // Try store forwarding.  Remove the last store that
	  // load_mem might depend on, and all previous stores.
	}

      if ((writes_mem && !is_simple_store)
	  || (reads_mem && !is_simple_load))
	store_exprs.truncate (0);

It doesn't need to be exactly like that, but the idea is to test
directly for all references to memory and registers, without giving
up unnecessarily on (say) const calls or arithmetic double-sets.

I think it would worth adding a comment that, when forwarding succeeds,
we'll process the newly created/moved stores in subsequent iterations
of the loop.

> +
> +      /* Don't consider store forwarding if the RTL instruction distance is
> +	 more than PARAM_STORE_FORWARDING_MAX_DISTANCE.  */
> +      if (!store_exprs.is_empty ()
> +	  && (store_exprs[0].insn_cnt
> +	      + param_store_forwarding_max_distance <= insn_cnt))
> +	store_exprs.ordered_remove (0);
> +
> +      insn_cnt++;
> +    }
> +}
> +
> +unsigned int
> +pass_rtl_avoid_store_forwarding::execute (function *fn)
> +{
> +  df_set_flags (DF_DEFER_INSN_RESCAN);
> +
> +  init_alias_analysis ();
> +
> +  stats_sf_detected = 0;
> +  stats_sf_avoided = 0;
> +
> +  basic_block bb;
> +  FOR_EACH_BB_FN (bb, fn)
> +    avoid_store_forwarding (bb);
> +
> +  end_alias_analysis ();
> +
> +  statistics_counter_event (fn, "Store forwardings detected: ",
> +			    stats_sf_detected);
> +  statistics_counter_event (fn, "Store forwardings avoided: ",
> +			    stats_sf_detected);
> +
> +  return 0;
> +}
> +
> +} // anon namespace.
> +
> +rtl_opt_pass *
> +make_pass_rtl_avoid_store_forwarding (gcc::context *ctxt)
> +{
> +  return new pass_rtl_avoid_store_forwarding (ctxt);
> +}
> diff --git a/gcc/avoid-store-forwarding.h b/gcc/avoid-store-forwarding.h
> new file mode 100644
> index 00000000000..55a0c97f008
> --- /dev/null
> +++ b/gcc/avoid-store-forwarding.h
> @@ -0,0 +1,56 @@
> +/* Avoid store forwarding optimization pass.
> +   Copyright (C) 2024 Free Software Foundation, Inc.
> +   Contributed by VRULL GmbH.
> +
> +   This file is part of GCC.
> +
> +   GCC is free software; you can redistribute it and/or modify it
> +   under the terms of the GNU General Public License as published by
> +   the Free Software Foundation; either version 3, or (at your option)
> +   any later version.
> +
> +   GCC is distributed in the hope that it will be useful, but
> +   WITHOUT ANY WARRANTY; without even the implied warranty of
> +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> +   General Public License for more details.
> +
> +   You should have received a copy of the GNU General Public License
> +   along with GCC; see the file COPYING3.  If not see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +#ifndef GCC_AVOID_STORE_FORWARDING_H
> +#define GCC_AVOID_STORE_FORWARDING_H
> +
> +#include "config.h"
> +#include "system.h"
> +#include "coretypes.h"
> +#include "backend.h"
> +#include "rtl.h"
> +
> +struct store_fwd_info
> +{
> +  /* The store instruction that is a store forwarding candidate.  */
> +  rtx_insn *store_insn;
> +  /* SET_DEST (single_set (store_insn)).  */
> +  rtx store_mem;
> +  /* The temporary that will hold the stored value at the original store
> +     position.  */
> +  rtx mov_reg;
> +  /* The instruction sequence that inserts the stored value's bits at the
> +     appropriate position in the loaded value.  */
> +  rtx_insn *bits_insert_insns;
> +  /* An instruction that saves the store's value in a register temporarily,
> +     (set (reg X) (SET_SRC (store_insn))).  */
> +  rtx_insn *save_store_value_insn;
> +  /* An instruction that stores the saved value back to memory,
> +     (set (SET_DEST (store_insn)) (reg X)).  */
> +  rtx_insn *store_saved_value_insn;
> +  /* The byte offset for the store's position within the load.  */
> +  HOST_WIDE_INT offset;
> +
> +  unsigned int insn_cnt;
> +  bool remove;
> +  bool forwarded;
> +};
> +
> +#endif  /* GCC_AVOID_STORE_FORWARDING_H  */
> diff --git a/gcc/common.opt b/gcc/common.opt
> index ea39f87ae71..7d080fbc513 100644
> --- a/gcc/common.opt
> +++ b/gcc/common.opt
> @@ -1751,6 +1751,10 @@ fgcse-sm
>  Common Var(flag_gcse_sm) Init(0) Optimization
>  Perform store motion after global common subexpression elimination.
>  
> +favoid-store-forwarding
> +Common Var(flag_avoid_store_forwarding) Init(0) Optimization
> +Try to avoid store forwarding.
> +
>  fgcse-las
>  Common Var(flag_gcse_las) Init(0) Optimization
>  Perform redundant load after store elimination in global common subexpression

Remember to regenerate common.opt.urls too. :)

> diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> index 27539a01785..4f8c145ff1f 100644
> --- a/gcc/doc/invoke.texi
> +++ b/gcc/doc/invoke.texi
> @@ -12797,6 +12797,15 @@ loop unrolling.
>  This option is enabled by default at optimization levels @option{-O1},
>  @option{-O2}, @option{-O3}, @option{-Os}.
>  
> +@opindex favoid-store-forwarding
> +@item -favoid-store-forwarding
> +@itemx -fno-avoid-store-forwarding
> +Many CPUs will stall for many cycles when a load partially depends on previous
> +smaller stores.  This pass tries to detect such cases and avoid the penalty by
> +changing the order of the load and store and then fixing up the loaded value.
> +
> +Disabled by default.
> +
>  @opindex ffp-contract
>  @item -ffp-contract=@var{style}
>  @option{-ffp-contract=off} disables floating-point expression contraction.
> diff --git a/gcc/doc/passes.texi b/gcc/doc/passes.texi
> index 4ac7a2306a1..639f6b325c8 100644
> --- a/gcc/doc/passes.texi
> +++ b/gcc/doc/passes.texi
> @@ -925,6 +925,14 @@ and addressing mode selection.  The pass is run twice, with values
>  being propagated into loops only on the second run.  The code is
>  located in @file{fwprop.cc}.
>  
> +@item Store forwarding avoidance
> +
> +This pass attempts to reduce the overhead of store to load forwarding.
> +It detects when a load reads from one or more previous smaller stores and
> +then rearranges them so that the stores are done after the load.  The loaded
> +value is adjusted with a series of bit insert instructions so that it stays
> +the same.  The code is located in @file{avoid-store-forwarding.cc}.
> +
>  @item Common subexpression elimination
>  
>  This pass removes redundant computation within basic blocks, and
> diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
> index cc33084ed32..47ada0c6bf5 100644
> --- a/gcc/doc/tm.texi
> +++ b/gcc/doc/tm.texi
> @@ -7348,6 +7348,15 @@ the @code{POLY_VALUE_MIN}, @code{POLY_VALUE_MAX} and
>  implementation returns the lowest possible value of @var{val}.
>  @end deftypefn
>  
> +@deftypefn {Target Hook} bool TARGET_AVOID_STORE_FORWARDING_P (vec<store_fwd_info>, @var{rtx}, @var{int}, @var{bool})
> +This hook, if defined, enables the use of the avoid-store-forwarding pass.

The pass is enabled even if the hook isn't defined.  Maybe just drop
this sentence.

Thanks,
Richard

> +Given a list of stores and a load instruction that reads from the location
> +of the stores, this hook decides if it's profitable to emit additional code
> +to avoid a potential store forwarding stall.  The additional instructions
> +needed, the sequence cost and additional relevant information is given in
> +the arguments so that the target can make an informed decision.
> +@end deftypefn
> +
>  @node Scheduling
>  @section Adjusting the Instruction Scheduler
>  
> diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
> index 8af3f414505..73ae1a1b8fe 100644
> --- a/gcc/doc/tm.texi.in
> +++ b/gcc/doc/tm.texi.in
> @@ -4745,6 +4745,8 @@ Define this macro if a non-short-circuit operation produced by
>  
>  @hook TARGET_ESTIMATED_POLY_VALUE
>  
> +@hook TARGET_AVOID_STORE_FORWARDING_P
> +
>  @node Scheduling
>  @section Adjusting the Instruction Scheduler
>  
> diff --git a/gcc/params.opt b/gcc/params.opt
> index c17ba17b91b..84a32b28b4a 100644
> --- a/gcc/params.opt
> +++ b/gcc/params.opt
> @@ -1032,6 +1032,10 @@ Allow the store merging pass to introduce unaligned stores if it is legal to do
>  Common Joined UInteger Var(param_store_merging_max_size) Init(65536) IntegerRange(1, 65536) Param Optimization
>  Maximum size of a single store merging region in bytes.
>  
> +-param=store-forwarding-max-distance=
> +Common Joined UInteger Var(param_store_forwarding_max_distance) Init(10) IntegerRange(1, 1000) Param Optimization
> +Maximum number of instruction distance that a small store forwarded to a larger load may stall.
> +
>  -param=switch-conversion-max-branch-ratio=
>  Common Joined UInteger Var(param_switch_conversion_branch_ratio) Init(8) IntegerRange(1, 65536) Param Optimization
>  The maximum ratio between array size and switch branches for a switch conversion to take place.
> diff --git a/gcc/passes.def b/gcc/passes.def
> index b06d6d45f63..7f7fd96e336 100644
> --- a/gcc/passes.def
> +++ b/gcc/passes.def
> @@ -506,6 +506,7 @@ along with GCC; see the file COPYING3.  If not see
>        NEXT_PASS (pass_sms);
>        NEXT_PASS (pass_live_range_shrinkage);
>        NEXT_PASS (pass_sched);
> +      NEXT_PASS (pass_rtl_avoid_store_forwarding);
>        NEXT_PASS (pass_early_remat);
>        NEXT_PASS (pass_ira);
>        NEXT_PASS (pass_reload);
> diff --git a/gcc/target.def b/gcc/target.def
> index 1d0ea6f30ca..c90ec9c7edd 100644
> --- a/gcc/target.def
> +++ b/gcc/target.def
> @@ -6959,6 +6959,17 @@ HOOK_VECTOR_END (shrink_wrap)
>  #undef HOOK_PREFIX
>  #define HOOK_PREFIX "TARGET_"
>  
> +DEFHOOK
> +(avoid_store_forwarding_p,
> + "This hook, if defined, enables the use of the avoid-store-forwarding pass.\n\
> +Given a list of stores and a load instruction that reads from the location\n\
> +of the stores, this hook decides if it's profitable to emit additional code\n\
> +to avoid a potential store forwarding stall.  The additional instructions\n\
> +needed, the sequence cost and additional relevant information is given in\n\
> +the arguments so that the target can make an informed decision.",
> + bool, (vec<store_fwd_info>, rtx, int, bool),
> + NULL)
> +
>  /* Determine the type of unwind info to emit for debugging.  */
>  DEFHOOK
>  (debug_unwind_info,
> diff --git a/gcc/target.h b/gcc/target.h
> index 837651d273a..2f59f3c6a80 100644
> --- a/gcc/target.h
> +++ b/gcc/target.h
> @@ -165,6 +165,9 @@ class function_arg_info;
>  /* This is defined in function-abi.h.  */
>  class predefined_function_abi;
>  
> +/* This is defined in avoid-store-forwarding.h .  */
> +struct store_fwd_info;
> +
>  /* These are defined in tree-vect-stmts.cc.  */
>  extern tree stmt_vectype (class _stmt_vec_info *);
>  extern bool stmt_in_inner_loop_p (class vec_info *, class _stmt_vec_info *);
> diff --git a/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-1.c b/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-1.c
> new file mode 100644
> index 00000000000..4712f101d5e
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-1.c
> @@ -0,0 +1,28 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -favoid-store-forwarding -fdump-rtl-avoid_store_forwarding" } */
> +
> +typedef union {
> +    char arr_8[8];
> +    long long_value;
> +} DataUnion;
> +
> +long ssll_1 (DataUnion *data, char x)
> +{
> +  data->arr_8[0] = x;
> +  return data->long_value;
> +}
> +
> +long ssll_2 (DataUnion *data, char x)
> +{
> +  data->arr_8[1] = x;
> +  return data->long_value;
> +}
> +
> +long ssll_3 (DataUnion *data, char x)
> +{
> +  data->arr_8[7] = x;
> +  return data->long_value;
> +}
> +
> +/* { dg-final { scan-rtl-dump-times "Store forwarding detected" 3 "avoid_store_forwarding" } } */
> +/* { dg-final { scan-rtl-dump-times "Store forwarding avoided" 3 "avoid_store_forwarding" } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-2.c b/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-2.c
> new file mode 100644
> index 00000000000..b958612173b
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-2.c
> @@ -0,0 +1,39 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -favoid-store-forwarding -fdump-rtl-avoid_store_forwarding" } */
> +
> +typedef union {
> +    char arr_8[8];
> +    int long_value;
> +} DataUnion1;
> +
> +long no_ssll_1 (DataUnion1 *data, char x)
> +{
> +  data->arr_8[4] = x;
> +  return data->long_value;
> +}
> +
> +long no_ssll_2 (DataUnion1 *data, char x)
> +{
> +  data->arr_8[5] = x;
> +  return data->long_value;
> +}
> +
> +typedef union {
> +    char arr_8[8];
> +    short long_value[4];
> +} DataUnion2;
> +
> +long no_ssll_3 (DataUnion2 *data, char x)
> +{
> +  data->arr_8[4] = x;
> +  return data->long_value[1];
> +}
> +
> +long no_ssll_4 (DataUnion2 *data, char x)
> +{
> +  data->arr_8[0] = x;
> +  return data->long_value[1];
> +}
> +
> +/* { dg-final { scan-rtl-dump-times "Store forwarding detected" 0 "avoid_store_forwarding" } } */
> +/* { dg-final { scan-rtl-dump-times "Store forwarding avoided" 0 "avoid_store_forwarding" } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-3.c b/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-3.c
> new file mode 100644
> index 00000000000..f4814c1a4d2
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-3.c
> @@ -0,0 +1,31 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -favoid-store-forwarding -fdump-rtl-avoid_store_forwarding" } */
> +
> +typedef union {
> +    char arr_8[8];
> +    long long_value;
> +} DataUnion;
> +
> +long ssll_multi_1 (DataUnion *data, char x)
> +{
> +  data->arr_8[0] = x;
> +  data->arr_8[2] = x;
> +  return data->long_value;
> +}
> +
> +long ssll_multi_2 (DataUnion *data, char x)
> +{
> +  data->arr_8[0] = x;
> +  data->arr_8[1] = 11;
> +  return data->long_value;
> +}
> +
> +long ssll_multi_3 (DataUnion *data, char x, short y)
> +{
> +  data->arr_8[1] = x;
> +  __builtin_memcpy(data->arr_8 + 4, &y, sizeof(short));
> +  return data->long_value;
> +}
> +
> +/* { dg-final { scan-rtl-dump-times "Store forwardings detected" 3 "avoid_store_forwarding" } } */
> +/* { dg-final { scan-rtl-dump-times "Store forwardings avoided" 3 "avoid_store_forwarding" } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-4.c b/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-4.c
> new file mode 100644
> index 00000000000..db3c6b6630b
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-4.c
> @@ -0,0 +1,23 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -favoid-store-forwarding -fdump-rtl-avoid_store_forwarding" } */
> +
> +typedef int v4si __attribute__ ((vector_size (16)));
> +
> +typedef union {
> +    char arr_16[16];
> +    v4si vec_value;
> +} DataUnion;
> +
> +v4si ssll_vect_1 (DataUnion *data, char x)
> +{
> +  data->arr_16[0] = x;
> +  return data->vec_value;
> +}
> +
> +v4si ssll_vect_2 (DataUnion *data, int x)
> +{
> +  __builtin_memcpy(data->arr_16 + 4, &x, sizeof(int));
> +  return data->vec_value;
> +}
> +
> +/* { dg-final { scan-rtl-dump-times "Store forwarding detected" 2 "avoid_store_forwarding" } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-5.c b/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-5.c
> new file mode 100644
> index 00000000000..2522df56905
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-5.c
> @@ -0,0 +1,38 @@
> +/* { dg-do compile { target int128 } } */
> +/* { dg-options "-O2 -favoid-store-forwarding -fdump-rtl-avoid_store_forwarding" } */
> +
> +typedef float v4f __attribute__ ((vector_size (16)));
> +
> +typedef union {
> +    float arr_2[4];
> +    long long_value;
> +    __int128 longlong_value;
> +    v4f vec_value;
> +} DataUnion;
> +
> +long ssll_load_elim_1 (DataUnion *data, float x)
> +{
> +  data->arr_2[0] = x;
> +  data->arr_2[1] = 0.0f;
> +  return data->long_value;
> +}
> +
> +__int128 ssll_load_elim_2 (DataUnion *data, float x)
> +{
> +  data->arr_2[0] = x;
> +  data->arr_2[1] = 0.0f;
> +  data->arr_2[2] = x;
> +  data->arr_2[3] = 0.0f;
> +  return data->longlong_value;
> +}
> +
> +v4f ssll_load_elim_3 (DataUnion *data, float x)
> +{
> +  data->arr_2[3] = x;
> +  data->arr_2[2] = x;
> +  data->arr_2[1] = x;
> +  data->arr_2[0] = x;
> +  return data->vec_value;
> +}
> +
> +/* { dg-final { scan-rtl-dump-times "Store forwardings detected" 3 "avoid_store_forwarding" } } */
> diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
> index 3a0cf13089e..75b1d9cbae4 100644
> --- a/gcc/tree-pass.h
> +++ b/gcc/tree-pass.h
> @@ -572,6 +572,7 @@ extern rtl_opt_pass *make_pass_rtl_dse3 (gcc::context *ctxt);
>  extern rtl_opt_pass *make_pass_rtl_cprop (gcc::context *ctxt);
>  extern rtl_opt_pass *make_pass_rtl_pre (gcc::context *ctxt);
>  extern rtl_opt_pass *make_pass_rtl_hoist (gcc::context *ctxt);
> +extern rtl_opt_pass *make_pass_rtl_avoid_store_forwarding (gcc::context *ctxt);
>  extern rtl_opt_pass *make_pass_rtl_store_motion (gcc::context *ctxt);
>  extern rtl_opt_pass *make_pass_cse_after_global_opts (gcc::context *ctxt);
>  extern rtl_opt_pass *make_pass_rtl_ifcvt (gcc::context *ctxt);
Manolis Tsamis Aug. 29, 2024, 1:44 p.m. UTC | #3
Hi Richard,

Thanks a lot for the extensive review and useful suggestions (and
sorry for the late reply). I have implemented most of these for a next
version, so let me address your individual comments below:

On Fri, Aug 16, 2024 at 5:33 PM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Manolis Tsamis <manolis.tsamis@vrull.eu> writes:
> > This pass detects cases of expensive store forwarding and tries to avoid them
> > by reordering the stores and using suitable bit insertion sequences.
> > For example it can transform this:
> >
> >      strb    w2, [x1, 1]
> >      ldr     x0, [x1]      # Expensive store forwarding to larger load.
> >
> > To:
> >
> >      ldr     x0, [x1]
> >      strb    w2, [x1]
> >      bfi     x0, x2, 0, 8
> >
> > Assembly like this can appear with bitfields or type punning / unions.
> > On stress-ng when running the cpu-union microbenchmark the following speedups
> > have been observed.
> >
> >   Neoverse-N1:      +29.4%
> >   Intel Coffeelake: +13.1%
> >   AMD 5950X:        +17.5%
> >
> > gcc/ChangeLog:
> >
> >       * Makefile.in: Add avoid-store-forwarding.o
> >       * common.opt: New option -favoid-store-forwarding.
> >       * doc/invoke.texi: New param store-forwarding-max-distance.
> >       * doc/passes.texi: Document new pass.
> >       * doc/tm.texi: Regenerate.
> >       * doc/tm.texi.in: Document new pass.
> >       * params.opt: New param store-forwarding-max-distance.
> >       * passes.def: Schedule a new pass.
> >       * target.def (HOOK_PREFIX): New target hook avoid_store_forwarding_p.
> >       * target.h (struct store_fwd_info): Declare.
> >       * tree-pass.h (make_pass_rtl_avoid_store_forwarding): Declare.
> >       * avoid-store-forwarding.cc: New file.
> >       * avoid-store-forwarding.h: New file.
> >
> > gcc/testsuite/ChangeLog:
> >
> >       * gcc.target/aarch64/avoid-store-forwarding-1.c: New test.
> >       * gcc.target/aarch64/avoid-store-forwarding-2.c: New test.
> >       * gcc.target/aarch64/avoid-store-forwarding-3.c: New test.
> >       * gcc.target/aarch64/avoid-store-forwarding-4.c: New test.
> >       * gcc.target/aarch64/avoid-store-forwarding-5.c: New test.
> >
> > Signed-off-by: Manolis Tsamis <manolis.tsamis@vrull.eu>
> > ---
> >
> > Changes in v5:
> >         - Fix bug with BIG_ENDIAN targets.
> >         - Fix bug with unrecognized instructions.
> >         - Fix / simplify pass init/fini.
> >
> > Changes in v4:
> >         - Change pass scheduling to run after sched1.
> >         - Add target hook to decide whether a store forwarding instance
> >         should be avoided or not.
> >         - Fix bugs.
> >
> > Changes in v3:
> >         - Only emit SUBREG after calling validate_subreg.
> >         - Fix memory corruption due to vec self-reference.
> >         - Fix bitmap_bit_in_range_p ICE due to BLKMode.
> >         - Reject MEM to MEM sets.
> >         - Add get_load_mem comment.
> >         - Add new testcase.
> >
> > Changes in v2:
> >         - Allow modes that are not scalar_int_mode.
> >         - Introduce simple costing to avoid unprofitable transformations.
> >         - Reject bit insert sequences that spill to memory.
> >         - Document new pass.
> >         - Fix and add testcases.
> >
> >  gcc/Makefile.in                               |   1 +
> >  gcc/avoid-store-forwarding.cc                 | 616 ++++++++++++++++++
> >  gcc/avoid-store-forwarding.h                  |  56 ++
> >  gcc/common.opt                                |   4 +
> >  gcc/doc/invoke.texi                           |   9 +
> >  gcc/doc/passes.texi                           |   8 +
> >  gcc/doc/tm.texi                               |   9 +
> >  gcc/doc/tm.texi.in                            |   2 +
> >  gcc/params.opt                                |   4 +
> >  gcc/passes.def                                |   1 +
> >  gcc/target.def                                |  11 +
> >  gcc/target.h                                  |   3 +
> >  .../aarch64/avoid-store-forwarding-1.c        |  28 +
> >  .../aarch64/avoid-store-forwarding-2.c        |  39 ++
> >  .../aarch64/avoid-store-forwarding-3.c        |  31 +
> >  .../aarch64/avoid-store-forwarding-4.c        |  23 +
> >  .../aarch64/avoid-store-forwarding-5.c        |  38 ++
> >  gcc/tree-pass.h                               |   1 +
> >  18 files changed, 884 insertions(+)
> >  create mode 100644 gcc/avoid-store-forwarding.cc
> >  create mode 100644 gcc/avoid-store-forwarding.h
> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-1.c
> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-2.c
> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-3.c
> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-4.c
> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-5.c
> >
> > diff --git a/gcc/Makefile.in b/gcc/Makefile.in
> > index 8fba8f7db6a..43675288399 100644
> > --- a/gcc/Makefile.in
> > +++ b/gcc/Makefile.in
> > @@ -1682,6 +1682,7 @@ OBJS = \
> >       statistics.o \
> >       stmt.o \
> >       stor-layout.o \
> > +     avoid-store-forwarding.o \
> >       store-motion.o \
> >       streamer-hooks.o \
> >       stringpool.o \
> > diff --git a/gcc/avoid-store-forwarding.cc b/gcc/avoid-store-forwarding.cc
> > new file mode 100644
> > index 00000000000..4a0343c0314
> > --- /dev/null
> > +++ b/gcc/avoid-store-forwarding.cc
> > @@ -0,0 +1,616 @@
> > +/* Avoid store forwarding optimization pass.
> > +   Copyright (C) 2024 Free Software Foundation, Inc.
> > +   Contributed by VRULL GmbH.
> > +
> > +   This file is part of GCC.
> > +
> > +   GCC is free software; you can redistribute it and/or modify it
> > +   under the terms of the GNU General Public License as published by
> > +   the Free Software Foundation; either version 3, or (at your option)
> > +   any later version.
> > +
> > +   GCC is distributed in the hope that it will be useful, but
> > +   WITHOUT ANY WARRANTY; without even the implied warranty of
> > +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > +   General Public License for more details.
> > +
> > +   You should have received a copy of the GNU General Public License
> > +   along with GCC; see the file COPYING3.  If not see
> > +   <http://www.gnu.org/licenses/>.  */
> > +
> > +#include "avoid-store-forwarding.h"
> > +#include "config.h"
> > +#include "system.h"
> > +#include "coretypes.h"
> > +#include "backend.h"
> > +#include "target.h"
> > +#include "rtl.h"
> > +#include "alias.h"
> > +#include "rtlanal.h"
> > +#include "tree-pass.h"
> > +#include "cselib.h"
> > +#include "predict.h"
> > +#include "insn-config.h"
> > +#include "expmed.h"
> > +#include "recog.h"
> > +#include "regset.h"
> > +#include "df.h"
> > +#include "expr.h"
> > +#include "memmodel.h"
> > +#include "emit-rtl.h"
> > +#include "vec.h"
> > +
> > +/* This pass tries to detect and avoid cases of store forwarding.
> > +   On many processors there is a large penalty when smaller stores are
> > +   forwarded to larger loads.  The idea used to avoid the stall is to move
> > +   the store after the load and in addition emit a bit insert sequence so
> > +   the load register has the correct value.  For example the following:
> > +
> > +     strb    w2, [x1, 1]
> > +     ldr     x0, [x1]
> > +
> > +   Will be transformed to:
> > +
> > +     ldr     x0, [x1]
> > +     strb    w2, [x1]
> > +     bfi     x0, x2, 0, 8
> > +*/
> > +
> > +namespace {
> > +
> > +const pass_data pass_data_avoid_store_forwarding =
> > +{
> > +  RTL_PASS, /* type.  */
> > +  "avoid_store_forwarding", /* name.  */
> > +  OPTGROUP_NONE, /* optinfo_flags.  */
> > +  TV_NONE, /* tv_id.  */
> > +  0, /* properties_required.  */
> > +  0, /* properties_provided.  */
> > +  0, /* properties_destroyed.  */
> > +  0, /* todo_flags_start.  */
> > +  TODO_df_finish /* todo_flags_finish.  */
> > +};
> > +
> > +class pass_rtl_avoid_store_forwarding : public rtl_opt_pass
> > +{
> > +public:
> > +  pass_rtl_avoid_store_forwarding (gcc::context *ctxt)
> > +    : rtl_opt_pass (pass_data_avoid_store_forwarding, ctxt)
> > +  {}
> > +
> > +  /* opt_pass methods: */
> > +  virtual bool gate (function *)
> > +    {
> > +      return flag_avoid_store_forwarding && optimize >= 1
> > +          && optimize_insn_for_speed_p ();
>
> I think this should be optimize_function_for_speed_p, since we have no
> contextual information for crtl->maybe_hot_insn_p.  But if possible, it'd
> be better if we could drop this and instead check optimize_bb_for_speed_p
> for each block.
>

Yes, my mistake. Changed to optimize_bb_for_speed_p (and moved to
avoid_store_forwarding).

> > +    }
> > +
> > +  virtual unsigned int execute (function *) override;
> > +}; // class pass_rtl_avoid_store_forwarding
> > +
> > +static unsigned int stats_sf_detected = 0;
> > +static unsigned int stats_sf_avoided = 0;
>
> Could you instead structure the code as a class with these as member
> variables?  I realise it's a bit over-the-top for just 2 variables,
> but the pass might be expanded in future.
>
I could do that but it would require making most of the functions
member functions (as these are incremented in
process_store_forwarding) and by looking at other passes/GCC code this
is like never done.
So I felt that it would be a bit unconventional and I haven't
implemented it yet. Thoughts about that?

> > +
> > +/* If expr is a SET that reads from memory then return the RTX for it's MEM
> > +   argument, otherwise return NULL.  Allows MEM to be zero/sign extended.  */
> > +
> > +static rtx
> > +get_load_mem (rtx expr)
> > +{
> > +  if (!expr)
> > +    return NULL_RTX;
>
> Could you check this in the caller instead?  It seems an odd interface
> that expr can be null, but if nonnull, must be known to be a SET.
>
After implementing your suggestions I actually ended up removing this
function altogether. By adding a load_mem argument to the process
function only a single caller is left, and it can then be inlined
nicely. Thanks.

> > +
> > +  rtx mem = SET_SRC (expr);
> > +
> > +  if (GET_CODE (mem) == ZERO_EXTEND
> > +      || GET_CODE (mem) == SIGN_EXTEND)
> > +    mem = XEXP (mem, 0);
> > +
> > +  if (MEM_P (mem))
> > +    return mem;
> > +  else
> > +    return NULL_RTX;
> > +}
> > +
> > +/* Return true iff a store to STORE_MEM would write to a sub-region of bytes
> > +   from what LOAD_MEM would read.  If true also store the relative byte offset
> > +   of the store within the load to OFF_VAL.  */
> > +
> > +static bool
> > +is_store_forwarding (rtx store_mem, rtx load_mem, HOST_WIDE_INT *off_val)
> > +{
> > +  if (known_ge (GET_MODE_SIZE (GET_MODE (store_mem)),
> > +             GET_MODE_SIZE (GET_MODE (load_mem))))
> > +    return false;
> > +
> > +  rtx off = simplify_gen_binary (MINUS, GET_MODE (XEXP (store_mem, 0)),
> > +                              XEXP (store_mem, 0), XEXP (load_mem, 0));
> > +
> > +  if (CONST_INT_P (off))
> > +    {
> > +      *off_val = INTVAL (off);
> > +      poly_uint64 store_mode_size = GET_MODE_SIZE (GET_MODE (store_mem));
> > +      poly_uint64 load_mode_size = GET_MODE_SIZE (GET_MODE (load_mem));
> > +      unsigned HOST_WIDE_INT store_size, load_size;
> > +      if (store_mode_size.is_constant (&store_size)
> > +       && load_mode_size.is_constant (&load_size))
> > +     {
> > +       gcc_checking_assert (store_size > 0 && load_size > 0);
> > +       return *off_val >= 0
> > +              && (store_size + *off_val <= load_size);
> > +     }
> > +    }
> > +
> > +  return false;
>
> A more direct way of writing this is:
>
>   poly_int64 load_offset, store_offset;
>   rtx load_base = strip_offset (XEXP (load_mem, 0), &load_offset);
>   rtx store_base = strip_offset (XEXP (store_mem, 0), &store_offset);
>   return (MEM_SIZE (load_mem).is_constant ()
>           && rtx_equal_p (load_base, store_base)
>           && known_subrange_p (store_offset, MEM_SIZE (store_mem),
>                                load_offset, MEM_SIZE (load_mem))
>           && (store_offset - load_offset).is_constant (off_val));
>
> which avoids the need for simplify_gen_binary (and thus unnecessary rtxes).
>
Thanks, I like this version much more. Done.

> This doesn't enforce that the load and store have constant size,
> but IMO that's more naturally done when detecting loads and stores.
>
Indeed and also done.

> > +}
> > +
> > +/* Return a bit insertion sequence that would make DEST have the correct value
> > +   if the store represented by STORE_INFO were to be moved after DEST.  */
> > +
> > +static rtx_insn *
> > +generate_bit_insert_sequence (store_fwd_info *store_info, rtx dest,
> > +                           machine_mode load_inner_mode)
>
> As Jeff says, load_inner_mode appears to be unused.
>
> > +{
> > +  poly_uint64 store_mode_size
> > +    = GET_MODE_SIZE (GET_MODE (store_info->store_mem));
> > +  unsigned HOST_WIDE_INT store_size = store_mode_size.to_constant ();
>
> Uses of to_constant should have a comment saying where is_constant
> is enforced.  Using MEM_SIZE is a bit shorter than checking the size
> of the mode (and more general, although that's not helpful in this
> context).
>
Ok.

> > +
> > +  start_sequence ();
> > +  store_bit_field (dest, store_size * BITS_PER_UNIT,
> > +                store_info->offset * BITS_PER_UNIT, 0, 0,
> > +                GET_MODE (dest), store_info->mov_reg,
> > +                false, false);
> > +
> > +  rtx_insn *insns = get_insns ();
> > +  unshare_all_rtl_in_chain (insns);
> > +  end_sequence ();
> > +
> > +  for (rtx_insn *insn = insns; insn; insn = NEXT_INSN (insn))
> > +    if (contains_mem_rtx_p (PATTERN (insn))
> > +     || recog_memoized (insn) < 0)
>
> When does recog_memoized (insn) < 0 occur?  I'd expect only valid
> instructions to be produced.  (Checking for MEMs is ok though.)
>
> > +      return NULL;
> > +
> > +  return insns;
> > +}
> > +
> > +/* Given a list of small stores that are forwarded to LOAD_INSN, try to
> > +   rearrange them so that a store-forwarding penalty doesn't occur.  */
>
> I think this should say that STORES are in reverse program order
> (if I've read the code correctly).
>
> I was surprised that we emitted the bit insertions for the last store first,
> and only realised later that that order was reversing a previous reverse.
>
Yes, that can definitely be a confusing part; I've added two comments,
one in the function and one when we emit these.

> > +
> > +static bool
> > +process_forwardings (vec<store_fwd_info> &stores, rtx_insn *load_insn)
> > +{
> > +  rtx load = single_set (load_insn);
> > +  rtx load_inner = get_load_mem (load);
> > +  machine_mode load_inner_mode = GET_MODE (load_inner);
> > +  poly_uint64 load_mode_size = GET_MODE_SIZE (load_inner_mode);
> > +  HOST_WIDE_INT load_size = load_mode_size.to_constant ();
> > +
> > +  /* If the stores cover all the bytes of the load without overlap then we can
> > +     eliminate the load entirely and use the computed value instead.  */
> > +
> > +  sbitmap forwarded_bytes = sbitmap_alloc (load_size);
> > +  bitmap_clear (forwarded_bytes);
> > +
> > +  unsigned int i;
> > +  store_fwd_info* it;
> > +  FOR_EACH_VEC_ELT (stores, i, it)
> > +    {
> > +      poly_uint64 store_mode_size = GET_MODE_SIZE (GET_MODE (it->store_mem));
> > +      HOST_WIDE_INT store_size = store_mode_size.to_constant ();
> > +      if (bitmap_bit_in_range_p (forwarded_bytes, it->offset,
> > +                              it->offset + store_size - 1))
> > +     break;
> > +      bitmap_set_range (forwarded_bytes, it->offset, store_size);
> > +    }
> > +
> > +  bitmap_not (forwarded_bytes, forwarded_bytes);
> > +  bool load_elim = bitmap_empty_p (forwarded_bytes);
> > +
> > +  stats_sf_detected++;
> > +
> > +  if (dump_file)
> > +    {
> > +      fprintf (dump_file, "Store forwarding%s detected:\n",
> > +            (stores.length () > 1) ? "s" : "");
>
> Very minor, but "Store forwarding" sounds more natural to me even
> when there are multiple stores involved.
>
Fair, I changed the dump files and most uses of forwardings (e.g.
"store forwarding cases" instead of "store forwardings" in some
contexts that made more sense).

> > +
> > +      FOR_EACH_VEC_ELT (stores, i, it)
> > +     {
> > +       fprintf (dump_file, "From: ");
> > +       print_rtl_single (dump_file, it->store_insn);
> > +     }
> > +
> > +      fprintf (dump_file, "To: ");
> > +      print_rtl_single (dump_file, load_insn);
> > +
> > +      if (load_elim)
> > +     fprintf (dump_file, "(Load elimination candidate)\n");
> > +    }
> > +
> > +  rtx dest;
> > +  if (load_elim)
> > +    dest = gen_reg_rtx (load_inner_mode);
> > +  else
> > +    dest = SET_DEST (load);
> > +
> > +  int move_to_front = -1;
> > +  int total_cost = 0;
> > +
> > +  /* Check if we can emit bit insert instructions for all forwarded stores.  */
> > +  FOR_EACH_VEC_ELT (stores, i, it)
> > +    {
> > +      it->mov_reg = gen_reg_rtx (GET_MODE (it->store_mem));
> > +      rtx_insn *insns = NULL;
> > +
> > +      /* If we're eliminating the load then find the store with zero offset
> > +      and use it as the base register to avoid a bit insert if possible.  */
> > +      if (load_elim && it->offset == 0
> > +       && validate_subreg (GET_MODE (dest), GET_MODE (it->mov_reg),
> > +                           it->mov_reg, 0))
> > +     {
> > +       start_sequence ();
> > +
> > +       /* We can use a paradoxical subreg to force this to a wider mode, as
> > +          the only use will be inserting the bits (i.e., we don't care about
> > +          the value of the higher bits).  */
> > +       rtx ext0 = gen_rtx_SUBREG (GET_MODE (dest), it->mov_reg, 0);
>
> IMO it'd be safer to call lowpart_subreg and check whether the result
> is null.  This would also remove the need to check validate_subreg
> directly.
>
I did change this to lowpart_subreg, but note that validate_subreg
looks to still be necessary (for AArch64). Otherwise I get various
ICEs, e.g. when a SF -> TI case is hit, which was the original reason
for validate_subreg.
I could have some help with that, because after the new changes a
subreg related ICE also happens within store_bit_field when a DI ->
V4SI case is hit. Afaik store_bit_field should just return NULL if it
can't handle something so I don't really know how to address this ICE
in the new version.

> > +       rtx_insn *move0 = emit_move_insn (dest, ext0);
> > +       if (recog_memoized (move0) >= 0)
> > +         {
> > +           insns = get_insns ();
> > +           move_to_front = (int) i;
> > +         }
>
> Here too I'd expect emit_move_insn to produce only valid insns.
> (And it can produce multiple insns.)
>
The original implementation emitted simple mov instructions without
recog but then gradually each of these recog calls was added for ICE
in various more obscure targets that Jeff helped find (even due to
"simple" moves).
So, although I don't like it and it complicates things, I had to add
recog_memoized pretty much everywhere and also make sure to only emit
if all of these are successful...

> > +
> > +       end_sequence ();
> > +     }
> > +
> > +      if (!insns)
> > +     insns = generate_bit_insert_sequence (&(*it), dest, load_inner_mode);
> > +
> > +      if (!insns)
> > +     {
> > +       if (dump_file)
> > +         {
> > +           fprintf (dump_file, "Failed due to: ");
> > +           print_rtl_single (dump_file, it->store_insn);
> > +         }
> > +       return false;
> > +     }
> > +
> > +      total_cost += seq_cost (insns, true);
> > +      it->bits_insert_insns = insns;
> > +
> > +      rtx store_set = single_set (it->store_insn);
> > +
> > +      /* Create a register move at the store's original position to save the
> > +      stored value.  */
> > +      start_sequence ();
> > +      rtx mov1 = gen_move_insn (it->mov_reg, SET_SRC (store_set));
>
> Since SET_SRC is a general expression, we should just try to emit it
> as a single set, using gen_rtx_SET rather than emit_move_insn.
>
Ok, done.

> > +      rtx_insn *insn1 = emit_insn (mov1);
> > +      end_sequence ();
> > +
> > +      if (recog_memoized (insn1) < 0)
> > +     {
> > +       if (dump_file)
> > +         {
> > +           fprintf (dump_file, "Failed due to unrecognizable insn: ");
> > +           print_rtl_single (dump_file, insn1);
> > +         }
> > +       return false;
> > +     }
> > +
> > +      it->save_store_value_insn = insn1;
> > +
> > +      /* Create a new store after the load with the saved original value.
> > +      This avoids the forwarding stall.  */
> > +      start_sequence ();
> > +      rtx mov2 = gen_move_insn (SET_DEST (store_set), it->mov_reg);
> > +      rtx_insn *insn2 = emit_insn (mov2);
> > +      end_sequence ();
> > +
> > +      if (recog_memoized (insn2) < 0)
> > +     {
> > +       if (dump_file)
> > +         {
> > +           fprintf (dump_file, "Failed due to unrecognizable insn: ");
> > +           print_rtl_single (dump_file, insn2);
> > +         }
> > +       return false;
> > +     }
>
> Similarly, this recog_memoized shouldn't be needed.
>
See above.

> > +
> > +      it->store_saved_value_insn = insn2;
> > +    }
> > +
> > +  if (load_elim)
> > +    total_cost -= COSTS_N_INSNS (1);
>
> Why 1, rather than the insn_cost of the load?
>
Good question, fixed now.

> > +
> > +  if (targetm.avoid_store_forwarding_p)
> > +    {
> > +      /* If a target-specific hook exists to decide profitability of the
> > +      specific store-forwarding instance then use that as it should be
> > +      the most accurate.  */
> > +      if (!targetm.avoid_store_forwarding_p (stores, load_inner, total_cost,
> > +                                          load_elim))
> > +     {
> > +       if (dump_file)
> > +         fprintf (dump_file, "Not transformed due to target decision.\n");
> > +
> > +       return false;
> > +     }
> > +    }
> > +  else
> > +    {
> > +      /* Otherwise use a simple cost heurstic base on
> > +      param_store_forwarding_max_distance.  In general the distance should
> > +      be somewhat correlated to the store forwarding penalty; if the penalty
> > +      is large then it is justified to increase the window size.  Use this
> > +      to reject sequences that are clearly unprofitable.  */
> > +      int max_cost = COSTS_N_INSNS (param_store_forwarding_max_distance / 2);
> > +      if (total_cost > max_cost)
> > +     {
> > +       if (dump_file)
> > +         fprintf (dump_file, "Not transformed due to cost: %d > %d.\n",
> > +                  total_cost, max_cost);
> > +
> > +       return false;
> > +     }
>
> Could you instead make this the default implementation of the hook,
> so that targets have the option of falling back to it?
>
> > +    }
> > +
> > +  /* If we have a move instead of bit insert, it needs to be emitted first in
> > +     the resulting sequence.  */
> > +  if (move_to_front != -1)
> > +    {
> > +      store_fwd_info copy = stores[move_to_front];
> > +      stores.safe_push (copy);
> > +      stores.ordered_remove (move_to_front);
> > +    }
> > +
> > +  if (load_elim)
> > +    {
> > +      machine_mode outter_mode = GET_MODE (SET_DEST (load));
>
> outer
>

Fixed.

> > +      rtx_code extend = ZERO_EXTEND;
> > +      if (outter_mode != load_inner_mode)
> > +     extend = GET_CODE (SET_SRC (load));
> > +
> > +      start_sequence ();
> > +      rtx load_value = simplify_gen_unary (extend, outter_mode, dest,
> > +                                        load_inner_mode);
>
> We should only do this if the modes are different.  And in the case
> where they are:
>
> > +      rtx load_move = gen_move_insn (SET_DEST (load), load_value);
>
> ...this should be a gen_rtx_SET.
>
> It might be simpler to separate the "equal modes" and "unequal modes"
> paths out and emit different sequences.
>
Haven't implemented that yet, but I'm looking into it.

> > +      rtx_insn *insn = emit_insn (load_move);
> > +      rtx_insn *seq = get_insns ();
> > +      end_sequence ();
> > +
> > +      if (recog_memoized (insn) < 0)
> > +     return false;
> > +
> > +      df_insn_rescan (emit_insn_after (seq, load_insn));
>
> df_insn_rescan shouldn't be needed when emitting instructions.
> Same below.  (The call for load_insn looks good though.)
>
Ok, I wasn't aware, done.

> > +    }
> > +
> > +  if (dump_file)
> > +    {
> > +      fprintf (dump_file, "Store forwarding%s avoided with bit inserts:\n",
> > +            (stores.length () > 1) ? "s" : "");
> > +
> > +      FOR_EACH_VEC_ELT (stores, i, it)
> > +     {
> > +       if (stores.length () > 1)
> > +         {
> > +           fprintf (dump_file, "For: ");
> > +           print_rtl_single (dump_file, it->store_insn);
> > +         }
> > +
> > +       fprintf (dump_file, "With sequence:\n");
> > +
> > +       for (rtx_insn *insn = it->bits_insert_insns; insn;
> > +            insn = NEXT_INSN (insn))
> > +         {
> > +           fprintf (dump_file, "  ");
> > +           print_rtl_single (dump_file, insn);
> > +         }
> > +     }
> > +    }
> > +
> > +  stats_sf_avoided++;
> > +
> > +  /* Done, emit all the generated instructions and delete the stores.  */
> > +
> > +  FOR_EACH_VEC_ELT (stores, i, it)
> > +    df_insn_rescan (emit_insn_after (it->bits_insert_insns, load_insn));
> > +
> > +  FOR_EACH_VEC_ELT (stores, i, it)
> > +    {
> > +      df_insn_rescan (emit_insn_before (it->save_store_value_insn,
> > +                                     it->store_insn));
> > +      df_insn_rescan (emit_insn_after (it->store_saved_value_insn, load_insn));
> > +      set_insn_deleted (it->store_insn);
>
> Any particular reason not to emit the new store immediately after the
> corresponding bits_insert_insns?  That would leads to shorter live ranges.
>
Hmm, good question. I don't remember, maybe to keep all the stores
together? In any case I did change it.

> > +    }
> > +
> > +  df_insn_rescan (load_insn);
> > +
> > +  if (load_elim)
> > +    set_insn_deleted (load_insn);
>
> Could you try to delete the insn via delete_insn instead?  There have
> been problems in the past with INSN_DELETED notes being left around.
>
Done.

> > +
> > +  return true;
> > +}
> > +
> > +/* Process BB for expensive store forwardings.  */
> > +
> > +static void
> > +avoid_store_forwarding (basic_block bb)
> > +{
> > +  auto_vec<store_fwd_info, 8> store_exprs;
> > +  rtx_insn *insn;
> > +  unsigned int insn_cnt = 0;
> > +
> > +  FOR_BB_INSNS (bb, insn)
> > +    {
> > +      if (!NONDEBUG_INSN_P (insn))
> > +     continue;
> > +
> > +      rtx set = single_set (insn);
> > +
> > +      /* Store forwarding issues are unlikely if we cross a call.
> > +      Clear store forwarding candidates if we can't deal with INSN.  */
> > +      if (CALL_P (insn) || !set || volatile_refs_p (set)
>
> Minor, but IMO it'd more natural to check CALL_P alongside
> NONDEBUG_INSN_P, rather than call single_set and then ignore the result.
>

NONDEBUG_INSN_P results in a continue statement but CALL_P clears the
forwarding candidates, that's why they're not together. I think that
makes sense, so I kept it as is.

> > +       || GET_MODE (SET_DEST (set)) == BLKmode
> > +       || (MEM_P (SET_DEST (set)) && MEM_P (SET_SRC (set))))
> > +     {
> > +       store_exprs.truncate (0);
> > +       continue;
> > +     }
> > +
> > +      rtx load_mem = get_load_mem (set);
> > +      int removed_count = 0;
> > +
> > +      if (MEM_P (SET_DEST (set)))
> > +     {
> > +       /* Record store forwarding candidate.  */
> > +       store_fwd_info info;
> > +       info.store_insn = insn;
> > +       info.store_mem = SET_DEST (set);
> > +       info.insn_cnt = insn_cnt;
> > +       info.remove = false;
> > +       info.forwarded = false;
> > +       store_exprs.safe_push (info);
> > +     }
> > +      else if (load_mem)
> > +     {
> > +       /* Process load for possible store forwardings.  */
> > +       auto_vec<store_fwd_info> forwardings;
> > +       bool partial_forwarding = false;
> > +       bool remove_rest = false;
> > +
> > +       unsigned int i;
> > +       store_fwd_info *it;
> > +       FOR_EACH_VEC_ELT_REVERSE (store_exprs, i, it)
> > +         {
> > +           rtx store_mem = it->store_mem;
> > +           HOST_WIDE_INT off_val;
> > +
> > +           if (remove_rest)
> > +             {
> > +               it->remove = true;
> > +               removed_count++;
> > +             }
> > +           else if (is_store_forwarding (store_mem, load_mem, &off_val))
> > +             {
> > +               /* Check if moving this store after the load is legal.  */
> > +               bool write_dep = false;
> > +               for (unsigned int j = store_exprs.length () - 1; j != i; j--)
> > +                 if (!store_exprs[j].forwarded
> > +                     && output_dependence (store_mem,
> > +                                           store_exprs[j].store_mem))
> > +                   {
> > +                     write_dep = true;
> > +                     break;
> > +                   }
> > +
> > +               if (!write_dep)
> > +                 {
> > +                   it->forwarded = true;
> > +                   it->offset = off_val;
> > +                   forwardings.safe_push (*it);
> > +                 }
> > +               else
> > +                 partial_forwarding = true;
> > +
> > +               it->remove = true;
> > +               removed_count++;
> > +             }
> > +           else if (true_dependence (store_mem, GET_MODE (store_mem),
> > +                                     load_mem))
> > +             {
> > +               /* We cannot keep a store forwarding candidate if it possibly
> > +                  interferes with this load.  */
> > +               it->remove = true;
> > +               removed_count++;
> > +               remove_rest = true;
> > +             }
> > +         }
> > +
> > +       if (!forwardings.is_empty () && !partial_forwarding)
> > +         process_forwardings (forwardings, insn);
>
> It doesn't look like this checks whether the load result conflicts
> with the memory addresses, e.g. in:
>
>   (set (mem:QI (reg:SI R)) (const_int 0))
>   (set (reg:SI R) (mem:SI (reg:SI R)))
>
Thanks! Added a new check for that with register_overlap_mentioned_p.

> > +     }
> > +      else
> > +     {
> > +       rtx reg = SET_DEST (set);
> > +
> > +       while (GET_CODE (reg) == ZERO_EXTRACT
> > +             || GET_CODE (reg) == STRICT_LOW_PART
> > +             || GET_CODE (reg) == SUBREG)
> > +         reg = XEXP (reg, 0);
> > +
> > +       /* Drop store forwarding candidates when the address register is
> > +          overwritten.  */
> > +       if (REG_P (reg))
> > +         {
> > +           bool remove_rest = false;
> > +           unsigned int i;
> > +           store_fwd_info *it;
> > +           FOR_EACH_VEC_ELT_REVERSE (store_exprs, i, it)
> > +             {
> > +               if (remove_rest
> > +                   || reg_overlap_mentioned_p (reg, it->store_mem))
> > +                 {
> > +                   it->remove = true;
> > +                   removed_count++;
> > +                   remove_rest = true;
> > +                 }
> > +             }
> > +         }
> > +       else
> > +         {
> > +           /* We can't understand INSN.  */
> > +           store_exprs.truncate (0);
> > +           continue;
> > +         }
> > +     }
> > +
> > +      if (removed_count)
> > +     {
> > +       unsigned int i, j;
> > +       store_fwd_info *it;
> > +       VEC_ORDERED_REMOVE_IF (store_exprs, i, j, it, it->remove);
> > +     }
>
> It might be safer (and more general) to structure the code as follows:
>
>       if (!NONDEBUG_INSN_P (insn))
>         continue;
>
>       vec_rtx_properties properties;
>       properties.add_insn (insn, false);
>
>       rtx set = single_set (insn);
>       bool is_simple = (set
>                         && !properties.has_asm
>                         && !properties.has_side_effects ());
>       bool is_simple_store = (is_simple
>                               && MEM_P (SET_DEST (set))
>                               && !contains_mem_rtx_p (SET_SRC (set)));
>       rtx load_mem = nullptr;
>       bool is_simple_load = (is_simple
>                              && !contains_mem_rtx_p (SET_DEST (set))
>                              && (load_mem = get_load_mem (set)));
>       if (is_simple_store)
>         ...record store...
>
>       bool reads_mem = false;
>       bool writes_mem = false;
>       for (auto ref : properties.refs ())
>         if (ref.is_mem ())
>           {
>             reads_mem |= ref.is_read ();
>             writes_mem |= ref.is_write ();
>           }
>         else if (ref.is_write ())
>           {
>             // Check for the last store to mention ref.regno (), use
>             // block_remove to remove it and all preceding stores.
>           }
>
>       if (is_simple_load)
>         {
>           // Try store forwarding.  Remove the last store that
>           // load_mem might depend on, and all previous stores.
>         }
>
>       if ((writes_mem && !is_simple_store)
>           || (reads_mem && !is_simple_load))
>         store_exprs.truncate (0);
>
> It doesn't need to be exactly like that, but the idea is to test
> directly for all references to memory and registers, without giving
> up unnecessarily on (say) const calls or arithmetic double-sets.
>
Hmm, I have restructured the code based on your other feedback and it
indeed looks more like this (except for not using vec_rtx_properties).
I'll evaluate if it makes it better to further make it more like that
(or use vec_rtx_properties); I'm a bit nervous to make a very large
change there, but if it improves things I'll do it.

> I think it would worth adding a comment that, when forwarding succeeds,
> we'll process the newly created/moved stores in subsequent iterations
> of the loop.
>
Ok.

> > +
> > +      /* Don't consider store forwarding if the RTL instruction distance is
> > +      more than PARAM_STORE_FORWARDING_MAX_DISTANCE.  */
> > +      if (!store_exprs.is_empty ()
> > +       && (store_exprs[0].insn_cnt
> > +           + param_store_forwarding_max_distance <= insn_cnt))
> > +     store_exprs.ordered_remove (0);
> > +
> > +      insn_cnt++;
> > +    }
> > +}
> > +
> > +unsigned int
> > +pass_rtl_avoid_store_forwarding::execute (function *fn)
> > +{
> > +  df_set_flags (DF_DEFER_INSN_RESCAN);
> > +
> > +  init_alias_analysis ();
> > +
> > +  stats_sf_detected = 0;
> > +  stats_sf_avoided = 0;
> > +
> > +  basic_block bb;
> > +  FOR_EACH_BB_FN (bb, fn)
> > +    avoid_store_forwarding (bb);
> > +
> > +  end_alias_analysis ();
> > +
> > +  statistics_counter_event (fn, "Store forwardings detected: ",
> > +                         stats_sf_detected);
> > +  statistics_counter_event (fn, "Store forwardings avoided: ",
> > +                         stats_sf_detected);
> > +
> > +  return 0;
> > +}
> > +
> > +} // anon namespace.
> > +
> > +rtl_opt_pass *
> > +make_pass_rtl_avoid_store_forwarding (gcc::context *ctxt)
> > +{
> > +  return new pass_rtl_avoid_store_forwarding (ctxt);
> > +}
> > diff --git a/gcc/avoid-store-forwarding.h b/gcc/avoid-store-forwarding.h
> > new file mode 100644
> > index 00000000000..55a0c97f008
> > --- /dev/null
> > +++ b/gcc/avoid-store-forwarding.h
> > @@ -0,0 +1,56 @@
> > +/* Avoid store forwarding optimization pass.
> > +   Copyright (C) 2024 Free Software Foundation, Inc.
> > +   Contributed by VRULL GmbH.
> > +
> > +   This file is part of GCC.
> > +
> > +   GCC is free software; you can redistribute it and/or modify it
> > +   under the terms of the GNU General Public License as published by
> > +   the Free Software Foundation; either version 3, or (at your option)
> > +   any later version.
> > +
> > +   GCC is distributed in the hope that it will be useful, but
> > +   WITHOUT ANY WARRANTY; without even the implied warranty of
> > +   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > +   General Public License for more details.
> > +
> > +   You should have received a copy of the GNU General Public License
> > +   along with GCC; see the file COPYING3.  If not see
> > +   <http://www.gnu.org/licenses/>.  */
> > +
> > +#ifndef GCC_AVOID_STORE_FORWARDING_H
> > +#define GCC_AVOID_STORE_FORWARDING_H
> > +
> > +#include "config.h"
> > +#include "system.h"
> > +#include "coretypes.h"
> > +#include "backend.h"
> > +#include "rtl.h"
> > +
> > +struct store_fwd_info
> > +{
> > +  /* The store instruction that is a store forwarding candidate.  */
> > +  rtx_insn *store_insn;
> > +  /* SET_DEST (single_set (store_insn)).  */
> > +  rtx store_mem;
> > +  /* The temporary that will hold the stored value at the original store
> > +     position.  */
> > +  rtx mov_reg;
> > +  /* The instruction sequence that inserts the stored value's bits at the
> > +     appropriate position in the loaded value.  */
> > +  rtx_insn *bits_insert_insns;
> > +  /* An instruction that saves the store's value in a register temporarily,
> > +     (set (reg X) (SET_SRC (store_insn))).  */
> > +  rtx_insn *save_store_value_insn;
> > +  /* An instruction that stores the saved value back to memory,
> > +     (set (SET_DEST (store_insn)) (reg X)).  */
> > +  rtx_insn *store_saved_value_insn;
> > +  /* The byte offset for the store's position within the load.  */
> > +  HOST_WIDE_INT offset;
> > +
> > +  unsigned int insn_cnt;
> > +  bool remove;
> > +  bool forwarded;
> > +};
> > +
> > +#endif  /* GCC_AVOID_STORE_FORWARDING_H  */
> > diff --git a/gcc/common.opt b/gcc/common.opt
> > index ea39f87ae71..7d080fbc513 100644
> > --- a/gcc/common.opt
> > +++ b/gcc/common.opt
> > @@ -1751,6 +1751,10 @@ fgcse-sm
> >  Common Var(flag_gcse_sm) Init(0) Optimization
> >  Perform store motion after global common subexpression elimination.
> >
> > +favoid-store-forwarding
> > +Common Var(flag_avoid_store_forwarding) Init(0) Optimization
> > +Try to avoid store forwarding.
> > +
> >  fgcse-las
> >  Common Var(flag_gcse_las) Init(0) Optimization
> >  Perform redundant load after store elimination in global common subexpression
>
> Remember to regenerate common.opt.urls too. :)
>
> > diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
> > index 27539a01785..4f8c145ff1f 100644
> > --- a/gcc/doc/invoke.texi
> > +++ b/gcc/doc/invoke.texi
> > @@ -12797,6 +12797,15 @@ loop unrolling.
> >  This option is enabled by default at optimization levels @option{-O1},
> >  @option{-O2}, @option{-O3}, @option{-Os}.
> >
> > +@opindex favoid-store-forwarding
> > +@item -favoid-store-forwarding
> > +@itemx -fno-avoid-store-forwarding
> > +Many CPUs will stall for many cycles when a load partially depends on previous
> > +smaller stores.  This pass tries to detect such cases and avoid the penalty by
> > +changing the order of the load and store and then fixing up the loaded value.
> > +
> > +Disabled by default.
> > +
> >  @opindex ffp-contract
> >  @item -ffp-contract=@var{style}
> >  @option{-ffp-contract=off} disables floating-point expression contraction.
> > diff --git a/gcc/doc/passes.texi b/gcc/doc/passes.texi
> > index 4ac7a2306a1..639f6b325c8 100644
> > --- a/gcc/doc/passes.texi
> > +++ b/gcc/doc/passes.texi
> > @@ -925,6 +925,14 @@ and addressing mode selection.  The pass is run twice, with values
> >  being propagated into loops only on the second run.  The code is
> >  located in @file{fwprop.cc}.
> >
> > +@item Store forwarding avoidance
> > +
> > +This pass attempts to reduce the overhead of store to load forwarding.
> > +It detects when a load reads from one or more previous smaller stores and
> > +then rearranges them so that the stores are done after the load.  The loaded
> > +value is adjusted with a series of bit insert instructions so that it stays
> > +the same.  The code is located in @file{avoid-store-forwarding.cc}.
> > +
> >  @item Common subexpression elimination
> >
> >  This pass removes redundant computation within basic blocks, and
> > diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
> > index cc33084ed32..47ada0c6bf5 100644
> > --- a/gcc/doc/tm.texi
> > +++ b/gcc/doc/tm.texi
> > @@ -7348,6 +7348,15 @@ the @code{POLY_VALUE_MIN}, @code{POLY_VALUE_MAX} and
> >  implementation returns the lowest possible value of @var{val}.
> >  @end deftypefn
> >
> > +@deftypefn {Target Hook} bool TARGET_AVOID_STORE_FORWARDING_P (vec<store_fwd_info>, @var{rtx}, @var{int}, @var{bool})
> > +This hook, if defined, enables the use of the avoid-store-forwarding pass.
>
> The pass is enabled even if the hook isn't defined.  Maybe just drop
> this sentence.
>
> Thanks,
> Richard
>
> > +Given a list of stores and a load instruction that reads from the location
> > +of the stores, this hook decides if it's profitable to emit additional code
> > +to avoid a potential store forwarding stall.  The additional instructions
> > +needed, the sequence cost and additional relevant information is given in
> > +the arguments so that the target can make an informed decision.
> > +@end deftypefn
> > +
> >  @node Scheduling
> >  @section Adjusting the Instruction Scheduler
> >
> > diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
> > index 8af3f414505..73ae1a1b8fe 100644
> > --- a/gcc/doc/tm.texi.in
> > +++ b/gcc/doc/tm.texi.in
> > @@ -4745,6 +4745,8 @@ Define this macro if a non-short-circuit operation produced by
> >
> >  @hook TARGET_ESTIMATED_POLY_VALUE
> >
> > +@hook TARGET_AVOID_STORE_FORWARDING_P
> > +
> >  @node Scheduling
> >  @section Adjusting the Instruction Scheduler
> >
> > diff --git a/gcc/params.opt b/gcc/params.opt
> > index c17ba17b91b..84a32b28b4a 100644
> > --- a/gcc/params.opt
> > +++ b/gcc/params.opt
> > @@ -1032,6 +1032,10 @@ Allow the store merging pass to introduce unaligned stores if it is legal to do
> >  Common Joined UInteger Var(param_store_merging_max_size) Init(65536) IntegerRange(1, 65536) Param Optimization
> >  Maximum size of a single store merging region in bytes.
> >
> > +-param=store-forwarding-max-distance=
> > +Common Joined UInteger Var(param_store_forwarding_max_distance) Init(10) IntegerRange(1, 1000) Param Optimization
> > +Maximum number of instruction distance that a small store forwarded to a larger load may stall.
> > +
> >  -param=switch-conversion-max-branch-ratio=
> >  Common Joined UInteger Var(param_switch_conversion_branch_ratio) Init(8) IntegerRange(1, 65536) Param Optimization
> >  The maximum ratio between array size and switch branches for a switch conversion to take place.
> > diff --git a/gcc/passes.def b/gcc/passes.def
> > index b06d6d45f63..7f7fd96e336 100644
> > --- a/gcc/passes.def
> > +++ b/gcc/passes.def
> > @@ -506,6 +506,7 @@ along with GCC; see the file COPYING3.  If not see
> >        NEXT_PASS (pass_sms);
> >        NEXT_PASS (pass_live_range_shrinkage);
> >        NEXT_PASS (pass_sched);
> > +      NEXT_PASS (pass_rtl_avoid_store_forwarding);
> >        NEXT_PASS (pass_early_remat);
> >        NEXT_PASS (pass_ira);
> >        NEXT_PASS (pass_reload);
> > diff --git a/gcc/target.def b/gcc/target.def
> > index 1d0ea6f30ca..c90ec9c7edd 100644
> > --- a/gcc/target.def
> > +++ b/gcc/target.def
> > @@ -6959,6 +6959,17 @@ HOOK_VECTOR_END (shrink_wrap)
> >  #undef HOOK_PREFIX
> >  #define HOOK_PREFIX "TARGET_"
> >
> > +DEFHOOK
> > +(avoid_store_forwarding_p,
> > + "This hook, if defined, enables the use of the avoid-store-forwarding pass.\n\
> > +Given a list of stores and a load instruction that reads from the location\n\
> > +of the stores, this hook decides if it's profitable to emit additional code\n\
> > +to avoid a potential store forwarding stall.  The additional instructions\n\
> > +needed, the sequence cost and additional relevant information is given in\n\
> > +the arguments so that the target can make an informed decision.",
> > + bool, (vec<store_fwd_info>, rtx, int, bool),
> > + NULL)
> > +
> >  /* Determine the type of unwind info to emit for debugging.  */
> >  DEFHOOK
> >  (debug_unwind_info,
> > diff --git a/gcc/target.h b/gcc/target.h
> > index 837651d273a..2f59f3c6a80 100644
> > --- a/gcc/target.h
> > +++ b/gcc/target.h
> > @@ -165,6 +165,9 @@ class function_arg_info;
> >  /* This is defined in function-abi.h.  */
> >  class predefined_function_abi;
> >
> > +/* This is defined in avoid-store-forwarding.h .  */
> > +struct store_fwd_info;
> > +
> >  /* These are defined in tree-vect-stmts.cc.  */
> >  extern tree stmt_vectype (class _stmt_vec_info *);
> >  extern bool stmt_in_inner_loop_p (class vec_info *, class _stmt_vec_info *);
> > diff --git a/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-1.c b/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-1.c
> > new file mode 100644
> > index 00000000000..4712f101d5e
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-1.c
> > @@ -0,0 +1,28 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -favoid-store-forwarding -fdump-rtl-avoid_store_forwarding" } */
> > +
> > +typedef union {
> > +    char arr_8[8];
> > +    long long_value;
> > +} DataUnion;
> > +
> > +long ssll_1 (DataUnion *data, char x)
> > +{
> > +  data->arr_8[0] = x;
> > +  return data->long_value;
> > +}
> > +
> > +long ssll_2 (DataUnion *data, char x)
> > +{
> > +  data->arr_8[1] = x;
> > +  return data->long_value;
> > +}
> > +
> > +long ssll_3 (DataUnion *data, char x)
> > +{
> > +  data->arr_8[7] = x;
> > +  return data->long_value;
> > +}
> > +
> > +/* { dg-final { scan-rtl-dump-times "Store forwarding detected" 3 "avoid_store_forwarding" } } */
> > +/* { dg-final { scan-rtl-dump-times "Store forwarding avoided" 3 "avoid_store_forwarding" } } */
> > diff --git a/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-2.c b/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-2.c
> > new file mode 100644
> > index 00000000000..b958612173b
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-2.c
> > @@ -0,0 +1,39 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -favoid-store-forwarding -fdump-rtl-avoid_store_forwarding" } */
> > +
> > +typedef union {
> > +    char arr_8[8];
> > +    int long_value;
> > +} DataUnion1;
> > +
> > +long no_ssll_1 (DataUnion1 *data, char x)
> > +{
> > +  data->arr_8[4] = x;
> > +  return data->long_value;
> > +}
> > +
> > +long no_ssll_2 (DataUnion1 *data, char x)
> > +{
> > +  data->arr_8[5] = x;
> > +  return data->long_value;
> > +}
> > +
> > +typedef union {
> > +    char arr_8[8];
> > +    short long_value[4];
> > +} DataUnion2;
> > +
> > +long no_ssll_3 (DataUnion2 *data, char x)
> > +{
> > +  data->arr_8[4] = x;
> > +  return data->long_value[1];
> > +}
> > +
> > +long no_ssll_4 (DataUnion2 *data, char x)
> > +{
> > +  data->arr_8[0] = x;
> > +  return data->long_value[1];
> > +}
> > +
> > +/* { dg-final { scan-rtl-dump-times "Store forwarding detected" 0 "avoid_store_forwarding" } } */
> > +/* { dg-final { scan-rtl-dump-times "Store forwarding avoided" 0 "avoid_store_forwarding" } } */
> > diff --git a/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-3.c b/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-3.c
> > new file mode 100644
> > index 00000000000..f4814c1a4d2
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-3.c
> > @@ -0,0 +1,31 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -favoid-store-forwarding -fdump-rtl-avoid_store_forwarding" } */
> > +
> > +typedef union {
> > +    char arr_8[8];
> > +    long long_value;
> > +} DataUnion;
> > +
> > +long ssll_multi_1 (DataUnion *data, char x)
> > +{
> > +  data->arr_8[0] = x;
> > +  data->arr_8[2] = x;
> > +  return data->long_value;
> > +}
> > +
> > +long ssll_multi_2 (DataUnion *data, char x)
> > +{
> > +  data->arr_8[0] = x;
> > +  data->arr_8[1] = 11;
> > +  return data->long_value;
> > +}
> > +
> > +long ssll_multi_3 (DataUnion *data, char x, short y)
> > +{
> > +  data->arr_8[1] = x;
> > +  __builtin_memcpy(data->arr_8 + 4, &y, sizeof(short));
> > +  return data->long_value;
> > +}
> > +
> > +/* { dg-final { scan-rtl-dump-times "Store forwardings detected" 3 "avoid_store_forwarding" } } */
> > +/* { dg-final { scan-rtl-dump-times "Store forwardings avoided" 3 "avoid_store_forwarding" } } */
> > diff --git a/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-4.c b/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-4.c
> > new file mode 100644
> > index 00000000000..db3c6b6630b
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-4.c
> > @@ -0,0 +1,23 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -favoid-store-forwarding -fdump-rtl-avoid_store_forwarding" } */
> > +
> > +typedef int v4si __attribute__ ((vector_size (16)));
> > +
> > +typedef union {
> > +    char arr_16[16];
> > +    v4si vec_value;
> > +} DataUnion;
> > +
> > +v4si ssll_vect_1 (DataUnion *data, char x)
> > +{
> > +  data->arr_16[0] = x;
> > +  return data->vec_value;
> > +}
> > +
> > +v4si ssll_vect_2 (DataUnion *data, int x)
> > +{
> > +  __builtin_memcpy(data->arr_16 + 4, &x, sizeof(int));
> > +  return data->vec_value;
> > +}
> > +
> > +/* { dg-final { scan-rtl-dump-times "Store forwarding detected" 2 "avoid_store_forwarding" } } */
> > diff --git a/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-5.c b/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-5.c
> > new file mode 100644
> > index 00000000000..2522df56905
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-5.c
> > @@ -0,0 +1,38 @@
> > +/* { dg-do compile { target int128 } } */
> > +/* { dg-options "-O2 -favoid-store-forwarding -fdump-rtl-avoid_store_forwarding" } */
> > +
> > +typedef float v4f __attribute__ ((vector_size (16)));
> > +
> > +typedef union {
> > +    float arr_2[4];
> > +    long long_value;
> > +    __int128 longlong_value;
> > +    v4f vec_value;
> > +} DataUnion;
> > +
> > +long ssll_load_elim_1 (DataUnion *data, float x)
> > +{
> > +  data->arr_2[0] = x;
> > +  data->arr_2[1] = 0.0f;
> > +  return data->long_value;
> > +}
> > +
> > +__int128 ssll_load_elim_2 (DataUnion *data, float x)
> > +{
> > +  data->arr_2[0] = x;
> > +  data->arr_2[1] = 0.0f;
> > +  data->arr_2[2] = x;
> > +  data->arr_2[3] = 0.0f;
> > +  return data->longlong_value;
> > +}
> > +
> > +v4f ssll_load_elim_3 (DataUnion *data, float x)
> > +{
> > +  data->arr_2[3] = x;
> > +  data->arr_2[2] = x;
> > +  data->arr_2[1] = x;
> > +  data->arr_2[0] = x;
> > +  return data->vec_value;
> > +}
> > +
> > +/* { dg-final { scan-rtl-dump-times "Store forwardings detected" 3 "avoid_store_forwarding" } } */
> > diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
> > index 3a0cf13089e..75b1d9cbae4 100644
> > --- a/gcc/tree-pass.h
> > +++ b/gcc/tree-pass.h
> > @@ -572,6 +572,7 @@ extern rtl_opt_pass *make_pass_rtl_dse3 (gcc::context *ctxt);
> >  extern rtl_opt_pass *make_pass_rtl_cprop (gcc::context *ctxt);
> >  extern rtl_opt_pass *make_pass_rtl_pre (gcc::context *ctxt);
> >  extern rtl_opt_pass *make_pass_rtl_hoist (gcc::context *ctxt);
> > +extern rtl_opt_pass *make_pass_rtl_avoid_store_forwarding (gcc::context *ctxt);
> >  extern rtl_opt_pass *make_pass_rtl_store_motion (gcc::context *ctxt);
> >  extern rtl_opt_pass *make_pass_cse_after_global_opts (gcc::context *ctxt);
> >  extern rtl_opt_pass *make_pass_rtl_ifcvt (gcc::context *ctxt);
Richard Sandiford Aug. 29, 2024, 6:16 p.m. UTC | #4
Manolis Tsamis <manolis.tsamis@vrull.eu> writes:
> On Fri, Aug 16, 2024 at 5:33 PM Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>> Manolis Tsamis <manolis.tsamis@vrull.eu> writes:
>> > +    }
>> > +
>> > +  virtual unsigned int execute (function *) override;
>> > +}; // class pass_rtl_avoid_store_forwarding
>> > +
>> > +static unsigned int stats_sf_detected = 0;
>> > +static unsigned int stats_sf_avoided = 0;
>>
>> Could you instead structure the code as a class with these as member
>> variables?  I realise it's a bit over-the-top for just 2 variables,
>> but the pass might be expanded in future.
>>
> I could do that but it would require making most of the functions
> member functions (as these are incremented in
> process_store_forwarding)

Right.

> and by looking at other passes/GCC code this is like never done.  So I
> felt that it would be a bit unconventional and I haven't implemented
> it yet. Thoughts about that?

Most passes were written while the codebase was still C.  I think newer
passes have generally been written in the way I described.  Where possible,
it'd be better for new code to avoid using globals to communicate
information between functions, to reduce the work involved in any future
parallelisation effort.  (There have been tentative steps towards reducing
existing uses of globals in the past.)

Using classes also allows pass-level RAII.  I realise that doesn't matter
for the current code, but like I say, we should consider future extensions
too.

>> > +
>> > +      FOR_EACH_VEC_ELT (stores, i, it)
>> > +     {
>> > +       fprintf (dump_file, "From: ");
>> > +       print_rtl_single (dump_file, it->store_insn);
>> > +     }
>> > +
>> > +      fprintf (dump_file, "To: ");
>> > +      print_rtl_single (dump_file, load_insn);
>> > +
>> > +      if (load_elim)
>> > +     fprintf (dump_file, "(Load elimination candidate)\n");
>> > +    }
>> > +
>> > +  rtx dest;
>> > +  if (load_elim)
>> > +    dest = gen_reg_rtx (load_inner_mode);
>> > +  else
>> > +    dest = SET_DEST (load);
>> > +
>> > +  int move_to_front = -1;
>> > +  int total_cost = 0;
>> > +
>> > +  /* Check if we can emit bit insert instructions for all forwarded stores.  */
>> > +  FOR_EACH_VEC_ELT (stores, i, it)
>> > +    {
>> > +      it->mov_reg = gen_reg_rtx (GET_MODE (it->store_mem));
>> > +      rtx_insn *insns = NULL;
>> > +
>> > +      /* If we're eliminating the load then find the store with zero offset
>> > +      and use it as the base register to avoid a bit insert if possible.  */
>> > +      if (load_elim && it->offset == 0
>> > +       && validate_subreg (GET_MODE (dest), GET_MODE (it->mov_reg),
>> > +                           it->mov_reg, 0))
>> > +     {
>> > +       start_sequence ();
>> > +
>> > +       /* We can use a paradoxical subreg to force this to a wider mode, as
>> > +          the only use will be inserting the bits (i.e., we don't care about
>> > +          the value of the higher bits).  */
>> > +       rtx ext0 = gen_rtx_SUBREG (GET_MODE (dest), it->mov_reg, 0);
>>
>> IMO it'd be safer to call lowpart_subreg and check whether the result
>> is null.  This would also remove the need to check validate_subreg
>> directly.
>>
> I did change this to lowpart_subreg, but note that validate_subreg
> looks to still be necessary (for AArch64). Otherwise I get various
> ICEs, e.g. when a SF -> TI case is hit, which was the original reason
> for validate_subreg.

Do you have a testcase?

Direct uses of gen_rtx_SUBREG are generally suspicious.  We should be
using higher-level generators in most cases.

> I could have some help with that, because after the new changes a
> subreg related ICE also happens within store_bit_field when a DI ->
> V4SI case is hit. Afaik store_bit_field should just return NULL if it
> can't handle something so I don't really know how to address this ICE
> in the new version.

I don't think store_bit_field is expected to fail.  But yeah, if you have
a testcase, I can take a look.

>> > +       rtx_insn *move0 = emit_move_insn (dest, ext0);
>> > +       if (recog_memoized (move0) >= 0)
>> > +         {
>> > +           insns = get_insns ();
>> > +           move_to_front = (int) i;
>> > +         }
>>
>> Here too I'd expect emit_move_insn to produce only valid insns.
>> (And it can produce multiple insns.)
>>
> The original implementation emitted simple mov instructions without
> recog but then gradually each of these recog calls was added for ICE
> in various more obscure targets that Jeff helped find (even due to
> "simple" moves).

By "simple mov instructions", do you mean via emit_move_insn, or by
emitting a SET directly?

> So, although I don't like it and it complicates things, I had to add
> recog_memoized pretty much everywhere and also make sure to only emit
> if all of these are successful...

Do you have testcases?  emit_move_insn's job is to make the move
legitimate, so it seems like the checks might be papering over an issue
elsewhere.

>> > +     }
>> > +      else
>> > +     {
>> > +       rtx reg = SET_DEST (set);
>> > +
>> > +       while (GET_CODE (reg) == ZERO_EXTRACT
>> > +             || GET_CODE (reg) == STRICT_LOW_PART
>> > +             || GET_CODE (reg) == SUBREG)
>> > +         reg = XEXP (reg, 0);
>> > +
>> > +       /* Drop store forwarding candidates when the address register is
>> > +          overwritten.  */
>> > +       if (REG_P (reg))
>> > +         {
>> > +           bool remove_rest = false;
>> > +           unsigned int i;
>> > +           store_fwd_info *it;
>> > +           FOR_EACH_VEC_ELT_REVERSE (store_exprs, i, it)
>> > +             {
>> > +               if (remove_rest
>> > +                   || reg_overlap_mentioned_p (reg, it->store_mem))
>> > +                 {
>> > +                   it->remove = true;
>> > +                   removed_count++;
>> > +                   remove_rest = true;
>> > +                 }
>> > +             }
>> > +         }
>> > +       else
>> > +         {
>> > +           /* We can't understand INSN.  */
>> > +           store_exprs.truncate (0);
>> > +           continue;
>> > +         }
>> > +     }
>> > +
>> > +      if (removed_count)
>> > +     {
>> > +       unsigned int i, j;
>> > +       store_fwd_info *it;
>> > +       VEC_ORDERED_REMOVE_IF (store_exprs, i, j, it, it->remove);
>> > +     }
>>
>> It might be safer (and more general) to structure the code as follows:
>>
>>       if (!NONDEBUG_INSN_P (insn))
>>         continue;
>>
>>       vec_rtx_properties properties;
>>       properties.add_insn (insn, false);
>>
>>       rtx set = single_set (insn);
>>       bool is_simple = (set
>>                         && !properties.has_asm
>>                         && !properties.has_side_effects ());
>>       bool is_simple_store = (is_simple
>>                               && MEM_P (SET_DEST (set))
>>                               && !contains_mem_rtx_p (SET_SRC (set)));
>>       rtx load_mem = nullptr;
>>       bool is_simple_load = (is_simple
>>                              && !contains_mem_rtx_p (SET_DEST (set))
>>                              && (load_mem = get_load_mem (set)));
>>       if (is_simple_store)
>>         ...record store...
>>
>>       bool reads_mem = false;
>>       bool writes_mem = false;
>>       for (auto ref : properties.refs ())
>>         if (ref.is_mem ())
>>           {
>>             reads_mem |= ref.is_read ();
>>             writes_mem |= ref.is_write ();
>>           }
>>         else if (ref.is_write ())
>>           {
>>             // Check for the last store to mention ref.regno (), use
>>             // block_remove to remove it and all preceding stores.
>>           }
>>
>>       if (is_simple_load)
>>         {
>>           // Try store forwarding.  Remove the last store that
>>           // load_mem might depend on, and all previous stores.
>>         }
>>
>>       if ((writes_mem && !is_simple_store)
>>           || (reads_mem && !is_simple_load))
>>         store_exprs.truncate (0);
>>
>> It doesn't need to be exactly like that, but the idea is to test
>> directly for all references to memory and registers, without giving
>> up unnecessarily on (say) const calls or arithmetic double-sets.
>>
> Hmm, I have restructured the code based on your other feedback and it
> indeed looks more like this (except for not using vec_rtx_properties).
> I'll evaluate if it makes it better to further make it more like that
> (or use vec_rtx_properties); I'm a bit nervous to make a very large
> change there, but if it improves things I'll do it.

Thanks.  I'd appreciate if you could use vec_rtx_properties, since it
is designed for cases like this.  IMO:

   if (CALL_P (insn) || !set || volatile_refs_p (set)
       || GET_MODE (SET_DEST (set)) == BLKmode
       || (MEM_P (SET_DEST (set)) && MEM_P (SET_SRC (set))))
     {
       store_exprs.truncate (0);
       continue;
     }

and the later continue are too conservative.  In particular,
we shouldn't treat all non-single_sets as optimisation barriers,
since many of them are fine.

Richard
Jeff Law Aug. 29, 2024, 8:41 p.m. UTC | #5
On 8/29/24 12:16 PM, Richard Sandiford wrote:
> 
> Most passes were written while the codebase was still C.  I think newer
> passes have generally been written in the way I described.  Where possible,
> it'd be better for new code to avoid using globals to communicate
> information between functions, to reduce the work involved in any future
> parallelisation effort.  (There have been tentative steps towards reducing
> existing uses of globals in the past.)
> 
> Using classes also allows pass-level RAII.  I realise that doesn't matter
> for the current code, but like I say, we should consider future extensions
> too.
Right.  I've actually got a patch here from Jivan that does this kind of 
conversion of ext-dce.  I haven't done anything with it yet as I'm 
starting to worry it's got some fundamental flaws and may need to get 
pulled out.

Jeff
Manolis Tsamis Aug. 30, 2024, 12:45 p.m. UTC | #6
On Thu, Aug 29, 2024 at 9:16 PM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Manolis Tsamis <manolis.tsamis@vrull.eu> writes:
> > On Fri, Aug 16, 2024 at 5:33 PM Richard Sandiford
> > <richard.sandiford@arm.com> wrote:
> >> Manolis Tsamis <manolis.tsamis@vrull.eu> writes:
> >> > +    }
> >> > +
> >> > +  virtual unsigned int execute (function *) override;
> >> > +}; // class pass_rtl_avoid_store_forwarding
> >> > +
> >> > +static unsigned int stats_sf_detected = 0;
> >> > +static unsigned int stats_sf_avoided = 0;
> >>
> >> Could you instead structure the code as a class with these as member
> >> variables?  I realise it's a bit over-the-top for just 2 variables,
> >> but the pass might be expanded in future.
> >>
> > I could do that but it would require making most of the functions
> > member functions (as these are incremented in
> > process_store_forwarding)
>
> Right.
>
> > and by looking at other passes/GCC code this is like never done.  So I
> > felt that it would be a bit unconventional and I haven't implemented
> > it yet. Thoughts about that?
>
> Most passes were written while the codebase was still C.  I think newer
> passes have generally been written in the way I described.  Where possible,
> it'd be better for new code to avoid using globals to communicate
> information between functions, to reduce the work involved in any future
> parallelisation effort.  (There have been tentative steps towards reducing
> existing uses of globals in the past.)
>
> Using classes also allows pass-level RAII.  I realise that doesn't matter
> for the current code, but like I say, we should consider future extensions
> too.
>
Ok, sounds good, will do then.

> >> > +
> >> > +      FOR_EACH_VEC_ELT (stores, i, it)
> >> > +     {
> >> > +       fprintf (dump_file, "From: ");
> >> > +       print_rtl_single (dump_file, it->store_insn);
> >> > +     }
> >> > +
> >> > +      fprintf (dump_file, "To: ");
> >> > +      print_rtl_single (dump_file, load_insn);
> >> > +
> >> > +      if (load_elim)
> >> > +     fprintf (dump_file, "(Load elimination candidate)\n");
> >> > +    }
> >> > +
> >> > +  rtx dest;
> >> > +  if (load_elim)
> >> > +    dest = gen_reg_rtx (load_inner_mode);
> >> > +  else
> >> > +    dest = SET_DEST (load);
> >> > +
> >> > +  int move_to_front = -1;
> >> > +  int total_cost = 0;
> >> > +
> >> > +  /* Check if we can emit bit insert instructions for all forwarded stores.  */
> >> > +  FOR_EACH_VEC_ELT (stores, i, it)
> >> > +    {
> >> > +      it->mov_reg = gen_reg_rtx (GET_MODE (it->store_mem));
> >> > +      rtx_insn *insns = NULL;
> >> > +
> >> > +      /* If we're eliminating the load then find the store with zero offset
> >> > +      and use it as the base register to avoid a bit insert if possible.  */
> >> > +      if (load_elim && it->offset == 0
> >> > +       && validate_subreg (GET_MODE (dest), GET_MODE (it->mov_reg),
> >> > +                           it->mov_reg, 0))
> >> > +     {
> >> > +       start_sequence ();
> >> > +
> >> > +       /* We can use a paradoxical subreg to force this to a wider mode, as
> >> > +          the only use will be inserting the bits (i.e., we don't care about
> >> > +          the value of the higher bits).  */
> >> > +       rtx ext0 = gen_rtx_SUBREG (GET_MODE (dest), it->mov_reg, 0);
> >>
> >> IMO it'd be safer to call lowpart_subreg and check whether the result
> >> is null.  This would also remove the need to check validate_subreg
> >> directly.
> >>
> > I did change this to lowpart_subreg, but note that validate_subreg
> > looks to still be necessary (for AArch64). Otherwise I get various
> > ICEs, e.g. when a SF -> TI case is hit, which was the original reason
> > for validate_subreg.
>
> Do you have a testcase?
>
> Direct uses of gen_rtx_SUBREG are generally suspicious.  We should be
> using higher-level generators in most cases.
>
After going over the code I found it was my mistake in how I was
calling lowpart_subreg. Fixed and now it's working fine.

> > I could have some help with that, because after the new changes a
> > subreg related ICE also happens within store_bit_field when a DI ->
> > V4SI case is hit. Afaik store_bit_field should just return NULL if it
> > can't handle something so I don't really know how to address this ICE
> > in the new version.
>
> I don't think store_bit_field is expected to fail.  But yeah, if you have
> a testcase, I can take a look.
>

Yes, that was my initial reaction too. I don't yet have a reduced
testcase, this now only happens with SPEC2017 gcc_r when compiled with
`-g -O3 -march=armv8.2-a -flto=32 -favoid-store-forwarding` (doesn't
reproduce without flto). I have also attached the work-in-progress
patch that implements your changes up to that point and which can be
used to reproduce this, at least till we have a smaller testcase.

The case that leads to that ICE is this:

Store forwarding detected:
From: (insn 1700 1714 1701 132 (set (mem/j/c:SI (plus:DI (reg/f:DI 64 sfp)
                (const_int -160 [0xffffffffffffff60])) [0 +0 S4 A128])
        (reg:SI 617)) "real.c":158:9 69 {*movsi_aarch64}
     (expr_list:REG_DEAD (reg:SI 617)
        (nil)))
From: (insn 1695 1698 1715 132 (set (mem/c:TI (plus:DI (reg/f:DI 64 sfp)
                (const_int -160 [0xffffffffffffff60])) [0 MEM  [(void
*)&r]+0 S16 A128])
        (const_int 0 [0])) "real.c":157:3 75 {*movti_aarch64}
     (nil))
To: (insn 1709 1703 1696 132 (set (reg:V4SI 622 [ r ])
        (mem/c:V4SI (plus:DI (reg/f:DI 64 sfp)
                (const_int -160 [0xffffffffffffff60])) [51 r+0 S16
A128])) "builtins.c":9619:13 1270 {*aarch64_simd_movv4si}
     (nil))

Which is somewhat suspicious; we'll have to first make sure it's not a
avoid-store-forwarding bug. In any case the ICE is

builtins.c:6684:1: internal compiler error: in simplify_subreg, at
simplify-rtx.cc:7680
 6684 | }
      | ^
0xc41ecb simplify_context::simplify_subreg(machine_mode, rtx_def*,
machine_mode, poly_int<2u, unsigned long>)
        /home/mtsamis/src/gcc/gcc4/gcc/simplify-rtx.cc:7680
0xc42d0f simplify_context::simplify_gen_subreg(machine_mode, rtx_def*,
machine_mode, poly_int<2u, unsigned long>)
        /home/mtsamis/src/gcc/gcc4/gcc/simplify-rtx.cc:8007
0x8afb4b simplify_gen_subreg(machine_mode, rtx_def*, machine_mode,
poly_int<2u, unsigned long>)
        /home/mtsamis/src/gcc/gcc4/gcc/rtl.h:3562
0x8afb4b force_subreg(machine_mode, rtx_def*, machine_mode,
poly_int<2u, unsigned long>)
        /home/mtsamis/src/gcc/gcc4/gcc/explow.cc:755
0x8b9ef3 store_bit_field_1
        /home/mtsamis/src/gcc/gcc4/gcc/expmed.cc:810
0x8ba327 store_bit_field(rtx_def*, poly_int<2u, unsigned long>,
poly_int<2u, unsigned long>, poly_int<2u, unsigned long>, poly_int<2u,
unsigned long>, machine_mode, rtx_def*, bool, bool)
        /home/mtsamis/src/gcc/gcc4/gcc/expmed.cc:1214
0xc53c7f generate_bit_insert_sequence
        /home/mtsamis/src/gcc/gcc4/gcc/avoid-store-forwarding.cc:122
0xc53c7f process_store_forwarding
        /home/mtsamis/src/gcc/gcc4/gcc/avoid-store-forwarding.cc:231
0xc53c7f avoid_store_forwarding
        /home/mtsamis/src/gcc/gcc4/gcc/avoid-store-forwarding.cc:504
0xc546eb execute
        /home/mtsamis/src/gcc/gcc4/gcc/avoid-store-forwarding.cc:571

> >> > +       rtx_insn *move0 = emit_move_insn (dest, ext0);
> >> > +       if (recog_memoized (move0) >= 0)
> >> > +         {
> >> > +           insns = get_insns ();
> >> > +           move_to_front = (int) i;
> >> > +         }
> >>
> >> Here too I'd expect emit_move_insn to produce only valid insns.
> >> (And it can produce multiple insns.)
> >>
> > The original implementation emitted simple mov instructions without
> > recog but then gradually each of these recog calls was added for ICE
> > in various more obscure targets that Jeff helped find (even due to
> > "simple" moves).
>
> By "simple mov instructions", do you mean via emit_move_insn, or by
> emitting a SET directly?
>
The implementation used emit_move_insn from the beginning. I only
added gen_rtx_SET per your suggestion.

> > So, although I don't like it and it complicates things, I had to add
> > recog_memoized pretty much everywhere and also make sure to only emit
> > if all of these are successful...
>
> Do you have testcases?  emit_move_insn's job is to make the move
> legitimate, so it seems like the checks might be papering over an issue
> elsewhere.
>
I had a quick look and couldn't find an existing testcase. I'll have
to recheck more thoroughly in order to create a testcase (if possible
at all).
I remember that in one case the rtx provided was zero_extend and there
was no instruction for that so recog would fail.

> >> > +     }
> >> > +      else
> >> > +     {
> >> > +       rtx reg = SET_DEST (set);
> >> > +
> >> > +       while (GET_CODE (reg) == ZERO_EXTRACT
> >> > +             || GET_CODE (reg) == STRICT_LOW_PART
> >> > +             || GET_CODE (reg) == SUBREG)
> >> > +         reg = XEXP (reg, 0);
> >> > +
> >> > +       /* Drop store forwarding candidates when the address register is
> >> > +          overwritten.  */
> >> > +       if (REG_P (reg))
> >> > +         {
> >> > +           bool remove_rest = false;
> >> > +           unsigned int i;
> >> > +           store_fwd_info *it;
> >> > +           FOR_EACH_VEC_ELT_REVERSE (store_exprs, i, it)
> >> > +             {
> >> > +               if (remove_rest
> >> > +                   || reg_overlap_mentioned_p (reg, it->store_mem))
> >> > +                 {
> >> > +                   it->remove = true;
> >> > +                   removed_count++;
> >> > +                   remove_rest = true;
> >> > +                 }
> >> > +             }
> >> > +         }
> >> > +       else
> >> > +         {
> >> > +           /* We can't understand INSN.  */
> >> > +           store_exprs.truncate (0);
> >> > +           continue;
> >> > +         }
> >> > +     }
> >> > +
> >> > +      if (removed_count)
> >> > +     {
> >> > +       unsigned int i, j;
> >> > +       store_fwd_info *it;
> >> > +       VEC_ORDERED_REMOVE_IF (store_exprs, i, j, it, it->remove);
> >> > +     }
> >>
> >> It might be safer (and more general) to structure the code as follows:
> >>
> >>       if (!NONDEBUG_INSN_P (insn))
> >>         continue;
> >>
> >>       vec_rtx_properties properties;
> >>       properties.add_insn (insn, false);
> >>
> >>       rtx set = single_set (insn);
> >>       bool is_simple = (set
> >>                         && !properties.has_asm
> >>                         && !properties.has_side_effects ());
> >>       bool is_simple_store = (is_simple
> >>                               && MEM_P (SET_DEST (set))
> >>                               && !contains_mem_rtx_p (SET_SRC (set)));
> >>       rtx load_mem = nullptr;
> >>       bool is_simple_load = (is_simple
> >>                              && !contains_mem_rtx_p (SET_DEST (set))
> >>                              && (load_mem = get_load_mem (set)));
> >>       if (is_simple_store)
> >>         ...record store...
> >>
> >>       bool reads_mem = false;
> >>       bool writes_mem = false;
> >>       for (auto ref : properties.refs ())
> >>         if (ref.is_mem ())
> >>           {
> >>             reads_mem |= ref.is_read ();
> >>             writes_mem |= ref.is_write ();
> >>           }
> >>         else if (ref.is_write ())
> >>           {
> >>             // Check for the last store to mention ref.regno (), use
> >>             // block_remove to remove it and all preceding stores.
> >>           }
> >>
> >>       if (is_simple_load)
> >>         {
> >>           // Try store forwarding.  Remove the last store that
> >>           // load_mem might depend on, and all previous stores.
> >>         }
> >>
> >>       if ((writes_mem && !is_simple_store)
> >>           || (reads_mem && !is_simple_load))
> >>         store_exprs.truncate (0);
> >>
> >> It doesn't need to be exactly like that, but the idea is to test
> >> directly for all references to memory and registers, without giving
> >> up unnecessarily on (say) const calls or arithmetic double-sets.
> >>
> > Hmm, I have restructured the code based on your other feedback and it
> > indeed looks more like this (except for not using vec_rtx_properties).
> > I'll evaluate if it makes it better to further make it more like that
> > (or use vec_rtx_properties); I'm a bit nervous to make a very large
> > change there, but if it improves things I'll do it.
>
> Thanks.  I'd appreciate if you could use vec_rtx_properties, since it
> is designed for cases like this.  IMO:
>
Alright, will do.

>    if (CALL_P (insn) || !set || volatile_refs_p (set)
>        || GET_MODE (SET_DEST (set)) == BLKmode
>        || (MEM_P (SET_DEST (set)) && MEM_P (SET_SRC (set))))
>      {
>        store_exprs.truncate (0);
>        continue;
>      }
>
> and the later continue are too conservative.  In particular,
> we shouldn't treat all non-single_sets as optimisation barriers,
> since many of them are fine.
>
Fair enough.

Thanks,
Manolis

> Richard
Richard Sandiford Aug. 30, 2024, 1:31 p.m. UTC | #7
Manolis Tsamis <manolis.tsamis@vrull.eu> writes:
>> > I could have some help with that, because after the new changes a
>> > subreg related ICE also happens within store_bit_field when a DI ->
>> > V4SI case is hit. Afaik store_bit_field should just return NULL if it
>> > can't handle something so I don't really know how to address this ICE
>> > in the new version.
>>
>> I don't think store_bit_field is expected to fail.  But yeah, if you have
>> a testcase, I can take a look.
>>
>
> Yes, that was my initial reaction too. I don't yet have a reduced
> testcase, this now only happens with SPEC2017 gcc_r when compiled with
> `-g -O3 -march=armv8.2-a -flto=32 -favoid-store-forwarding` (doesn't
> reproduce without flto). I have also attached the work-in-progress
> patch that implements your changes up to that point and which can be
> used to reproduce this, at least till we have a smaller testcase.
>
> The case that leads to that ICE is this:
>
> Store forwarding detected:
> From: (insn 1700 1714 1701 132 (set (mem/j/c:SI (plus:DI (reg/f:DI 64 sfp)
>                 (const_int -160 [0xffffffffffffff60])) [0 +0 S4 A128])
>         (reg:SI 617)) "real.c":158:9 69 {*movsi_aarch64}
>      (expr_list:REG_DEAD (reg:SI 617)
>         (nil)))
> From: (insn 1695 1698 1715 132 (set (mem/c:TI (plus:DI (reg/f:DI 64 sfp)
>                 (const_int -160 [0xffffffffffffff60])) [0 MEM  [(void
> *)&r]+0 S16 A128])
>         (const_int 0 [0])) "real.c":157:3 75 {*movti_aarch64}
>      (nil))
> To: (insn 1709 1703 1696 132 (set (reg:V4SI 622 [ r ])
>         (mem/c:V4SI (plus:DI (reg/f:DI 64 sfp)
>                 (const_int -160 [0xffffffffffffff60])) [51 r+0 S16
> A128])) "builtins.c":9619:13 1270 {*aarch64_simd_movv4si}
>      (nil))
>
> Which is somewhat suspicious; we'll have to first make sure it's not a
> avoid-store-forwarding bug.

Does 1695 come before 1700?  If so, it looks like it should be valid.

> In any case the ICE is
>
> builtins.c:6684:1: internal compiler error: in simplify_subreg, at
> simplify-rtx.cc:7680
>  6684 | }
>       | ^
> 0xc41ecb simplify_context::simplify_subreg(machine_mode, rtx_def*,
> machine_mode, poly_int<2u, unsigned long>)
>         /home/mtsamis/src/gcc/gcc4/gcc/simplify-rtx.cc:7680
> 0xc42d0f simplify_context::simplify_gen_subreg(machine_mode, rtx_def*,
> machine_mode, poly_int<2u, unsigned long>)
>         /home/mtsamis/src/gcc/gcc4/gcc/simplify-rtx.cc:8007
> 0x8afb4b simplify_gen_subreg(machine_mode, rtx_def*, machine_mode,
> poly_int<2u, unsigned long>)
>         /home/mtsamis/src/gcc/gcc4/gcc/rtl.h:3562
> 0x8afb4b force_subreg(machine_mode, rtx_def*, machine_mode,
> poly_int<2u, unsigned long>)
>         /home/mtsamis/src/gcc/gcc4/gcc/explow.cc:755
> 0x8b9ef3 store_bit_field_1
>         /home/mtsamis/src/gcc/gcc4/gcc/expmed.cc:810
> 0x8ba327 store_bit_field(rtx_def*, poly_int<2u, unsigned long>,
> poly_int<2u, unsigned long>, poly_int<2u, unsigned long>, poly_int<2u,
> unsigned long>, machine_mode, rtx_def*, bool, bool)
>         /home/mtsamis/src/gcc/gcc4/gcc/expmed.cc:1214
> 0xc53c7f generate_bit_insert_sequence
>         /home/mtsamis/src/gcc/gcc4/gcc/avoid-store-forwarding.cc:122
> 0xc53c7f process_store_forwarding
>         /home/mtsamis/src/gcc/gcc4/gcc/avoid-store-forwarding.cc:231
> 0xc53c7f avoid_store_forwarding
>         /home/mtsamis/src/gcc/gcc4/gcc/avoid-store-forwarding.cc:504
> 0xc546eb execute
>         /home/mtsamis/src/gcc/gcc4/gcc/avoid-store-forwarding.cc:571

OK, thanks, I'll have a go at reproducing, but it sounds like it might
be tricky. :)

>> > So, although I don't like it and it complicates things, I had to add
>> > recog_memoized pretty much everywhere and also make sure to only emit
>> > if all of these are successful...
>>
>> Do you have testcases?  emit_move_insn's job is to make the move
>> legitimate, so it seems like the checks might be papering over an issue
>> elsewhere.
>>
> I had a quick look and couldn't find an existing testcase. I'll have
> to recheck more thoroughly in order to create a testcase (if possible
> at all).
> I remember that in one case the rtx provided was zero_extend and there
> was no instruction for that so recog would fail.

Ah, ok.  That might have been the bug then.  emit_move_insn should only
be used for moving one object to another, where an "object" can be a
constant, memory or register (including subregs of registers).
The advantage of using it is that it works out how to make the move
valid, given target limitations.

If the instruction can be more general than that then we should just use
gen_rtx_SET, and recog the result.

Thanks,
Richard
Konstantinos Eleftheriou Oct. 18, 2024, 10 a.m. UTC | #8
Sent a new version for this in
https://gcc.gnu.org/pipermail/gcc-patches/2024-October/665794.html.

Thanks,
Konstantinos.

On Fri, Aug 30, 2024 at 4:32 PM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Manolis Tsamis <manolis.tsamis@vrull.eu> writes:
> >> > I could have some help with that, because after the new changes a
> >> > subreg related ICE also happens within store_bit_field when a DI ->
> >> > V4SI case is hit. Afaik store_bit_field should just return NULL if it
> >> > can't handle something so I don't really know how to address this ICE
> >> > in the new version.
> >>
> >> I don't think store_bit_field is expected to fail.  But yeah, if you have
> >> a testcase, I can take a look.
> >>
> >
> > Yes, that was my initial reaction too. I don't yet have a reduced
> > testcase, this now only happens with SPEC2017 gcc_r when compiled with
> > `-g -O3 -march=armv8.2-a -flto=32 -favoid-store-forwarding` (doesn't
> > reproduce without flto). I have also attached the work-in-progress
> > patch that implements your changes up to that point and which can be
> > used to reproduce this, at least till we have a smaller testcase.
> >
> > The case that leads to that ICE is this:
> >
> > Store forwarding detected:
> > From: (insn 1700 1714 1701 132 (set (mem/j/c:SI (plus:DI (reg/f:DI 64 sfp)
> >                 (const_int -160 [0xffffffffffffff60])) [0 +0 S4 A128])
> >         (reg:SI 617)) "real.c":158:9 69 {*movsi_aarch64}
> >      (expr_list:REG_DEAD (reg:SI 617)
> >         (nil)))
> > From: (insn 1695 1698 1715 132 (set (mem/c:TI (plus:DI (reg/f:DI 64 sfp)
> >                 (const_int -160 [0xffffffffffffff60])) [0 MEM  [(void
> > *)&r]+0 S16 A128])
> >         (const_int 0 [0])) "real.c":157:3 75 {*movti_aarch64}
> >      (nil))
> > To: (insn 1709 1703 1696 132 (set (reg:V4SI 622 [ r ])
> >         (mem/c:V4SI (plus:DI (reg/f:DI 64 sfp)
> >                 (const_int -160 [0xffffffffffffff60])) [51 r+0 S16
> > A128])) "builtins.c":9619:13 1270 {*aarch64_simd_movv4si}
> >      (nil))
> >
> > Which is somewhat suspicious; we'll have to first make sure it's not a
> > avoid-store-forwarding bug.
>
> Does 1695 come before 1700?  If so, it looks like it should be valid.
>
> > In any case the ICE is
> >
> > builtins.c:6684:1: internal compiler error: in simplify_subreg, at
> > simplify-rtx.cc:7680
> >  6684 | }
> >       | ^
> > 0xc41ecb simplify_context::simplify_subreg(machine_mode, rtx_def*,
> > machine_mode, poly_int<2u, unsigned long>)
> >         /home/mtsamis/src/gcc/gcc4/gcc/simplify-rtx.cc:7680
> > 0xc42d0f simplify_context::simplify_gen_subreg(machine_mode, rtx_def*,
> > machine_mode, poly_int<2u, unsigned long>)
> >         /home/mtsamis/src/gcc/gcc4/gcc/simplify-rtx.cc:8007
> > 0x8afb4b simplify_gen_subreg(machine_mode, rtx_def*, machine_mode,
> > poly_int<2u, unsigned long>)
> >         /home/mtsamis/src/gcc/gcc4/gcc/rtl.h:3562
> > 0x8afb4b force_subreg(machine_mode, rtx_def*, machine_mode,
> > poly_int<2u, unsigned long>)
> >         /home/mtsamis/src/gcc/gcc4/gcc/explow.cc:755
> > 0x8b9ef3 store_bit_field_1
> >         /home/mtsamis/src/gcc/gcc4/gcc/expmed.cc:810
> > 0x8ba327 store_bit_field(rtx_def*, poly_int<2u, unsigned long>,
> > poly_int<2u, unsigned long>, poly_int<2u, unsigned long>, poly_int<2u,
> > unsigned long>, machine_mode, rtx_def*, bool, bool)
> >         /home/mtsamis/src/gcc/gcc4/gcc/expmed.cc:1214
> > 0xc53c7f generate_bit_insert_sequence
> >         /home/mtsamis/src/gcc/gcc4/gcc/avoid-store-forwarding.cc:122
> > 0xc53c7f process_store_forwarding
> >         /home/mtsamis/src/gcc/gcc4/gcc/avoid-store-forwarding.cc:231
> > 0xc53c7f avoid_store_forwarding
> >         /home/mtsamis/src/gcc/gcc4/gcc/avoid-store-forwarding.cc:504
> > 0xc546eb execute
> >         /home/mtsamis/src/gcc/gcc4/gcc/avoid-store-forwarding.cc:571
>
> OK, thanks, I'll have a go at reproducing, but it sounds like it might
> be tricky. :)
>
> >> > So, although I don't like it and it complicates things, I had to add
> >> > recog_memoized pretty much everywhere and also make sure to only emit
> >> > if all of these are successful...
> >>
> >> Do you have testcases?  emit_move_insn's job is to make the move
> >> legitimate, so it seems like the checks might be papering over an issue
> >> elsewhere.
> >>
> > I had a quick look and couldn't find an existing testcase. I'll have
> > to recheck more thoroughly in order to create a testcase (if possible
> > at all).
> > I remember that in one case the rtx provided was zero_extend and there
> > was no instruction for that so recog would fail.
>
> Ah, ok.  That might have been the bug then.  emit_move_insn should only
> be used for moving one object to another, where an "object" can be a
> constant, memory or register (including subregs of registers).
> The advantage of using it is that it works out how to make the move
> valid, given target limitations.
>
> If the instruction can be more general than that then we should just use
> gen_rtx_SET, and recog the result.
>
> Thanks,
> Richard
diff mbox series

Patch

diff --git a/gcc/Makefile.in b/gcc/Makefile.in
index 8fba8f7db6a..43675288399 100644
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@@ -1682,6 +1682,7 @@  OBJS = \
 	statistics.o \
 	stmt.o \
 	stor-layout.o \
+	avoid-store-forwarding.o \
 	store-motion.o \
 	streamer-hooks.o \
 	stringpool.o \
diff --git a/gcc/avoid-store-forwarding.cc b/gcc/avoid-store-forwarding.cc
new file mode 100644
index 00000000000..4a0343c0314
--- /dev/null
+++ b/gcc/avoid-store-forwarding.cc
@@ -0,0 +1,616 @@ 
+/* Avoid store forwarding optimization pass.
+   Copyright (C) 2024 Free Software Foundation, Inc.
+   Contributed by VRULL GmbH.
+
+   This file is part of GCC.
+
+   GCC is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3, or (at your option)
+   any later version.
+
+   GCC is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with GCC; see the file COPYING3.  If not see
+   <http://www.gnu.org/licenses/>.  */
+
+#include "avoid-store-forwarding.h"
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "backend.h"
+#include "target.h"
+#include "rtl.h"
+#include "alias.h"
+#include "rtlanal.h"
+#include "tree-pass.h"
+#include "cselib.h"
+#include "predict.h"
+#include "insn-config.h"
+#include "expmed.h"
+#include "recog.h"
+#include "regset.h"
+#include "df.h"
+#include "expr.h"
+#include "memmodel.h"
+#include "emit-rtl.h"
+#include "vec.h"
+
+/* This pass tries to detect and avoid cases of store forwarding.
+   On many processors there is a large penalty when smaller stores are
+   forwarded to larger loads.  The idea used to avoid the stall is to move
+   the store after the load and in addition emit a bit insert sequence so
+   the load register has the correct value.  For example the following:
+
+     strb    w2, [x1, 1]
+     ldr     x0, [x1]
+
+   Will be transformed to:
+
+     ldr     x0, [x1]
+     strb    w2, [x1]
+     bfi     x0, x2, 0, 8
+*/
+
+namespace {
+
+const pass_data pass_data_avoid_store_forwarding =
+{
+  RTL_PASS, /* type.  */
+  "avoid_store_forwarding", /* name.  */
+  OPTGROUP_NONE, /* optinfo_flags.  */
+  TV_NONE, /* tv_id.  */
+  0, /* properties_required.  */
+  0, /* properties_provided.  */
+  0, /* properties_destroyed.  */
+  0, /* todo_flags_start.  */
+  TODO_df_finish /* todo_flags_finish.  */
+};
+
+class pass_rtl_avoid_store_forwarding : public rtl_opt_pass
+{
+public:
+  pass_rtl_avoid_store_forwarding (gcc::context *ctxt)
+    : rtl_opt_pass (pass_data_avoid_store_forwarding, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  virtual bool gate (function *)
+    {
+      return flag_avoid_store_forwarding && optimize >= 1
+	     && optimize_insn_for_speed_p ();
+    }
+
+  virtual unsigned int execute (function *) override;
+}; // class pass_rtl_avoid_store_forwarding
+
+static unsigned int stats_sf_detected = 0;
+static unsigned int stats_sf_avoided = 0;
+
+/* If expr is a SET that reads from memory then return the RTX for it's MEM
+   argument, otherwise return NULL.  Allows MEM to be zero/sign extended.  */
+
+static rtx
+get_load_mem (rtx expr)
+{
+  if (!expr)
+    return NULL_RTX;
+
+  rtx mem = SET_SRC (expr);
+
+  if (GET_CODE (mem) == ZERO_EXTEND
+      || GET_CODE (mem) == SIGN_EXTEND)
+    mem = XEXP (mem, 0);
+
+  if (MEM_P (mem))
+    return mem;
+  else
+    return NULL_RTX;
+}
+
+/* Return true iff a store to STORE_MEM would write to a sub-region of bytes
+   from what LOAD_MEM would read.  If true also store the relative byte offset
+   of the store within the load to OFF_VAL.  */
+
+static bool
+is_store_forwarding (rtx store_mem, rtx load_mem, HOST_WIDE_INT *off_val)
+{
+  if (known_ge (GET_MODE_SIZE (GET_MODE (store_mem)),
+		GET_MODE_SIZE (GET_MODE (load_mem))))
+    return false;
+
+  rtx off = simplify_gen_binary (MINUS, GET_MODE (XEXP (store_mem, 0)),
+				 XEXP (store_mem, 0), XEXP (load_mem, 0));
+
+  if (CONST_INT_P (off))
+    {
+      *off_val = INTVAL (off);
+      poly_uint64 store_mode_size = GET_MODE_SIZE (GET_MODE (store_mem));
+      poly_uint64 load_mode_size = GET_MODE_SIZE (GET_MODE (load_mem));
+      unsigned HOST_WIDE_INT store_size, load_size;
+      if (store_mode_size.is_constant (&store_size)
+	  && load_mode_size.is_constant (&load_size))
+	{
+	  gcc_checking_assert (store_size > 0 && load_size > 0);
+	  return *off_val >= 0
+		 && (store_size + *off_val <= load_size);
+	}
+    }
+
+  return false;
+}
+
+/* Return a bit insertion sequence that would make DEST have the correct value
+   if the store represented by STORE_INFO were to be moved after DEST.  */
+
+static rtx_insn *
+generate_bit_insert_sequence (store_fwd_info *store_info, rtx dest,
+			      machine_mode load_inner_mode)
+{
+  poly_uint64 store_mode_size
+    = GET_MODE_SIZE (GET_MODE (store_info->store_mem));
+  unsigned HOST_WIDE_INT store_size = store_mode_size.to_constant ();
+
+  start_sequence ();
+  store_bit_field (dest, store_size * BITS_PER_UNIT,
+		   store_info->offset * BITS_PER_UNIT, 0, 0,
+		   GET_MODE (dest), store_info->mov_reg,
+		   false, false);
+
+  rtx_insn *insns = get_insns ();
+  unshare_all_rtl_in_chain (insns);
+  end_sequence ();
+
+  for (rtx_insn *insn = insns; insn; insn = NEXT_INSN (insn))
+    if (contains_mem_rtx_p (PATTERN (insn))
+	|| recog_memoized (insn) < 0)
+      return NULL;
+
+  return insns;
+}
+
+/* Given a list of small stores that are forwarded to LOAD_INSN, try to
+   rearrange them so that a store-forwarding penalty doesn't occur.  */
+
+static bool
+process_forwardings (vec<store_fwd_info> &stores, rtx_insn *load_insn)
+{
+  rtx load = single_set (load_insn);
+  rtx load_inner = get_load_mem (load);
+  machine_mode load_inner_mode = GET_MODE (load_inner);
+  poly_uint64 load_mode_size = GET_MODE_SIZE (load_inner_mode);
+  HOST_WIDE_INT load_size = load_mode_size.to_constant ();
+
+  /* If the stores cover all the bytes of the load without overlap then we can
+     eliminate the load entirely and use the computed value instead.  */
+
+  sbitmap forwarded_bytes = sbitmap_alloc (load_size);
+  bitmap_clear (forwarded_bytes);
+
+  unsigned int i;
+  store_fwd_info* it;
+  FOR_EACH_VEC_ELT (stores, i, it)
+    {
+      poly_uint64 store_mode_size = GET_MODE_SIZE (GET_MODE (it->store_mem));
+      HOST_WIDE_INT store_size = store_mode_size.to_constant ();
+      if (bitmap_bit_in_range_p (forwarded_bytes, it->offset,
+				 it->offset + store_size - 1))
+	break;
+      bitmap_set_range (forwarded_bytes, it->offset, store_size);
+    }
+
+  bitmap_not (forwarded_bytes, forwarded_bytes);
+  bool load_elim = bitmap_empty_p (forwarded_bytes);
+
+  stats_sf_detected++;
+
+  if (dump_file)
+    {
+      fprintf (dump_file, "Store forwarding%s detected:\n",
+	       (stores.length () > 1) ? "s" : "");
+
+      FOR_EACH_VEC_ELT (stores, i, it)
+	{
+	  fprintf (dump_file, "From: ");
+	  print_rtl_single (dump_file, it->store_insn);
+	}
+
+      fprintf (dump_file, "To: ");
+      print_rtl_single (dump_file, load_insn);
+
+      if (load_elim)
+	fprintf (dump_file, "(Load elimination candidate)\n");
+    }
+
+  rtx dest;
+  if (load_elim)
+    dest = gen_reg_rtx (load_inner_mode);
+  else
+    dest = SET_DEST (load);
+
+  int move_to_front = -1;
+  int total_cost = 0;
+
+  /* Check if we can emit bit insert instructions for all forwarded stores.  */
+  FOR_EACH_VEC_ELT (stores, i, it)
+    {
+      it->mov_reg = gen_reg_rtx (GET_MODE (it->store_mem));
+      rtx_insn *insns = NULL;
+
+      /* If we're eliminating the load then find the store with zero offset
+	 and use it as the base register to avoid a bit insert if possible.  */
+      if (load_elim && it->offset == 0
+	  && validate_subreg (GET_MODE (dest), GET_MODE (it->mov_reg),
+			      it->mov_reg, 0))
+	{
+	  start_sequence ();
+
+	  /* We can use a paradoxical subreg to force this to a wider mode, as
+	     the only use will be inserting the bits (i.e., we don't care about
+	     the value of the higher bits).  */
+	  rtx ext0 = gen_rtx_SUBREG (GET_MODE (dest), it->mov_reg, 0);
+	  rtx_insn *move0 = emit_move_insn (dest, ext0);
+	  if (recog_memoized (move0) >= 0)
+	    {
+	      insns = get_insns ();
+	      move_to_front = (int) i;
+	    }
+
+	  end_sequence ();
+	}
+
+      if (!insns)
+	insns = generate_bit_insert_sequence (&(*it), dest, load_inner_mode);
+
+      if (!insns)
+	{
+	  if (dump_file)
+	    {
+	      fprintf (dump_file, "Failed due to: ");
+	      print_rtl_single (dump_file, it->store_insn);
+	    }
+	  return false;
+	}
+
+      total_cost += seq_cost (insns, true);
+      it->bits_insert_insns = insns;
+
+      rtx store_set = single_set (it->store_insn);
+
+      /* Create a register move at the store's original position to save the
+	 stored value.  */
+      start_sequence ();
+      rtx mov1 = gen_move_insn (it->mov_reg, SET_SRC (store_set));
+      rtx_insn *insn1 = emit_insn (mov1);
+      end_sequence ();
+
+      if (recog_memoized (insn1) < 0)
+	{
+	  if (dump_file)
+	    {
+	      fprintf (dump_file, "Failed due to unrecognizable insn: ");
+	      print_rtl_single (dump_file, insn1);
+	    }
+	  return false;
+	}
+
+      it->save_store_value_insn = insn1;
+
+      /* Create a new store after the load with the saved original value.
+	 This avoids the forwarding stall.  */
+      start_sequence ();
+      rtx mov2 = gen_move_insn (SET_DEST (store_set), it->mov_reg);
+      rtx_insn *insn2 = emit_insn (mov2);
+      end_sequence ();
+
+      if (recog_memoized (insn2) < 0)
+	{
+	  if (dump_file)
+	    {
+	      fprintf (dump_file, "Failed due to unrecognizable insn: ");
+	      print_rtl_single (dump_file, insn2);
+	    }
+	  return false;
+	}
+
+      it->store_saved_value_insn = insn2;
+    }
+
+  if (load_elim)
+    total_cost -= COSTS_N_INSNS (1);
+
+  if (targetm.avoid_store_forwarding_p)
+    {
+      /* If a target-specific hook exists to decide profitability of the
+	 specific store-forwarding instance then use that as it should be
+	 the most accurate.  */
+      if (!targetm.avoid_store_forwarding_p (stores, load_inner, total_cost,
+					     load_elim))
+	{
+	  if (dump_file)
+	    fprintf (dump_file, "Not transformed due to target decision.\n");
+
+	  return false;
+	}
+    }
+  else
+    {
+      /* Otherwise use a simple cost heurstic base on
+	 param_store_forwarding_max_distance.  In general the distance should
+	 be somewhat correlated to the store forwarding penalty; if the penalty
+	 is large then it is justified to increase the window size.  Use this
+	 to reject sequences that are clearly unprofitable.  */
+      int max_cost = COSTS_N_INSNS (param_store_forwarding_max_distance / 2);
+      if (total_cost > max_cost)
+	{
+	  if (dump_file)
+	    fprintf (dump_file, "Not transformed due to cost: %d > %d.\n",
+		     total_cost, max_cost);
+
+	  return false;
+	}
+    }
+
+  /* If we have a move instead of bit insert, it needs to be emitted first in
+     the resulting sequence.  */
+  if (move_to_front != -1)
+    {
+      store_fwd_info copy = stores[move_to_front];
+      stores.safe_push (copy);
+      stores.ordered_remove (move_to_front);
+    }
+
+  if (load_elim)
+    {
+      machine_mode outter_mode = GET_MODE (SET_DEST (load));
+      rtx_code extend = ZERO_EXTEND;
+      if (outter_mode != load_inner_mode)
+	extend = GET_CODE (SET_SRC (load));
+
+      start_sequence ();
+      rtx load_value = simplify_gen_unary (extend, outter_mode, dest,
+					   load_inner_mode);
+      rtx load_move = gen_move_insn (SET_DEST (load), load_value);
+      rtx_insn *insn = emit_insn (load_move);
+      rtx_insn *seq = get_insns ();
+      end_sequence ();
+
+      if (recog_memoized (insn) < 0)
+	return false;
+
+      df_insn_rescan (emit_insn_after (seq, load_insn));
+    }
+
+  if (dump_file)
+    {
+      fprintf (dump_file, "Store forwarding%s avoided with bit inserts:\n",
+	       (stores.length () > 1) ? "s" : "");
+
+      FOR_EACH_VEC_ELT (stores, i, it)
+	{
+	  if (stores.length () > 1)
+	    {
+	      fprintf (dump_file, "For: ");
+	      print_rtl_single (dump_file, it->store_insn);
+	    }
+
+	  fprintf (dump_file, "With sequence:\n");
+
+	  for (rtx_insn *insn = it->bits_insert_insns; insn;
+	       insn = NEXT_INSN (insn))
+	    {
+	      fprintf (dump_file, "  ");
+	      print_rtl_single (dump_file, insn);
+	    }
+	}
+    }
+
+  stats_sf_avoided++;
+
+  /* Done, emit all the generated instructions and delete the stores.  */
+
+  FOR_EACH_VEC_ELT (stores, i, it)
+    df_insn_rescan (emit_insn_after (it->bits_insert_insns, load_insn));
+
+  FOR_EACH_VEC_ELT (stores, i, it)
+    {
+      df_insn_rescan (emit_insn_before (it->save_store_value_insn,
+					it->store_insn));
+      df_insn_rescan (emit_insn_after (it->store_saved_value_insn, load_insn));
+      set_insn_deleted (it->store_insn);
+    }
+
+  df_insn_rescan (load_insn);
+
+  if (load_elim)
+    set_insn_deleted (load_insn);
+
+  return true;
+}
+
+/* Process BB for expensive store forwardings.  */
+
+static void
+avoid_store_forwarding (basic_block bb)
+{
+  auto_vec<store_fwd_info, 8> store_exprs;
+  rtx_insn *insn;
+  unsigned int insn_cnt = 0;
+
+  FOR_BB_INSNS (bb, insn)
+    {
+      if (!NONDEBUG_INSN_P (insn))
+	continue;
+
+      rtx set = single_set (insn);
+
+      /* Store forwarding issues are unlikely if we cross a call.
+	 Clear store forwarding candidates if we can't deal with INSN.  */
+      if (CALL_P (insn) || !set || volatile_refs_p (set)
+	  || GET_MODE (SET_DEST (set)) == BLKmode
+	  || (MEM_P (SET_DEST (set)) && MEM_P (SET_SRC (set))))
+	{
+	  store_exprs.truncate (0);
+	  continue;
+	}
+
+      rtx load_mem = get_load_mem (set);
+      int removed_count = 0;
+
+      if (MEM_P (SET_DEST (set)))
+	{
+	  /* Record store forwarding candidate.  */
+	  store_fwd_info info;
+	  info.store_insn = insn;
+	  info.store_mem = SET_DEST (set);
+	  info.insn_cnt = insn_cnt;
+	  info.remove = false;
+	  info.forwarded = false;
+	  store_exprs.safe_push (info);
+	}
+      else if (load_mem)
+	{
+	  /* Process load for possible store forwardings.  */
+	  auto_vec<store_fwd_info> forwardings;
+	  bool partial_forwarding = false;
+	  bool remove_rest = false;
+
+	  unsigned int i;
+	  store_fwd_info *it;
+	  FOR_EACH_VEC_ELT_REVERSE (store_exprs, i, it)
+	    {
+	      rtx store_mem = it->store_mem;
+	      HOST_WIDE_INT off_val;
+
+	      if (remove_rest)
+		{
+		  it->remove = true;
+		  removed_count++;
+		}
+	      else if (is_store_forwarding (store_mem, load_mem, &off_val))
+		{
+		  /* Check if moving this store after the load is legal.  */
+		  bool write_dep = false;
+		  for (unsigned int j = store_exprs.length () - 1; j != i; j--)
+		    if (!store_exprs[j].forwarded
+			&& output_dependence (store_mem,
+					      store_exprs[j].store_mem))
+		      {
+			write_dep = true;
+			break;
+		      }
+
+		  if (!write_dep)
+		    {
+		      it->forwarded = true;
+		      it->offset = off_val;
+		      forwardings.safe_push (*it);
+		    }
+		  else
+		    partial_forwarding = true;
+
+		  it->remove = true;
+		  removed_count++;
+		}
+	      else if (true_dependence (store_mem, GET_MODE (store_mem),
+					load_mem))
+		{
+		  /* We cannot keep a store forwarding candidate if it possibly
+		     interferes with this load.  */
+		  it->remove = true;
+		  removed_count++;
+		  remove_rest = true;
+		}
+	    }
+
+	  if (!forwardings.is_empty () && !partial_forwarding)
+	    process_forwardings (forwardings, insn);
+	}
+      else
+	{
+	  rtx reg = SET_DEST (set);
+
+	  while (GET_CODE (reg) == ZERO_EXTRACT
+		|| GET_CODE (reg) == STRICT_LOW_PART
+		|| GET_CODE (reg) == SUBREG)
+	    reg = XEXP (reg, 0);
+
+	  /* Drop store forwarding candidates when the address register is
+	     overwritten.  */
+	  if (REG_P (reg))
+	    {
+	      bool remove_rest = false;
+	      unsigned int i;
+	      store_fwd_info *it;
+	      FOR_EACH_VEC_ELT_REVERSE (store_exprs, i, it)
+		{
+		  if (remove_rest
+		      || reg_overlap_mentioned_p (reg, it->store_mem))
+		    {
+		      it->remove = true;
+		      removed_count++;
+		      remove_rest = true;
+		    }
+		}
+	    }
+	  else
+	    {
+	      /* We can't understand INSN.  */
+	      store_exprs.truncate (0);
+	      continue;
+	    }
+	}
+
+      if (removed_count)
+	{
+	  unsigned int i, j;
+	  store_fwd_info *it;
+	  VEC_ORDERED_REMOVE_IF (store_exprs, i, j, it, it->remove);
+	}
+
+      /* Don't consider store forwarding if the RTL instruction distance is
+	 more than PARAM_STORE_FORWARDING_MAX_DISTANCE.  */
+      if (!store_exprs.is_empty ()
+	  && (store_exprs[0].insn_cnt
+	      + param_store_forwarding_max_distance <= insn_cnt))
+	store_exprs.ordered_remove (0);
+
+      insn_cnt++;
+    }
+}
+
+unsigned int
+pass_rtl_avoid_store_forwarding::execute (function *fn)
+{
+  df_set_flags (DF_DEFER_INSN_RESCAN);
+
+  init_alias_analysis ();
+
+  stats_sf_detected = 0;
+  stats_sf_avoided = 0;
+
+  basic_block bb;
+  FOR_EACH_BB_FN (bb, fn)
+    avoid_store_forwarding (bb);
+
+  end_alias_analysis ();
+
+  statistics_counter_event (fn, "Store forwardings detected: ",
+			    stats_sf_detected);
+  statistics_counter_event (fn, "Store forwardings avoided: ",
+			    stats_sf_detected);
+
+  return 0;
+}
+
+} // anon namespace.
+
+rtl_opt_pass *
+make_pass_rtl_avoid_store_forwarding (gcc::context *ctxt)
+{
+  return new pass_rtl_avoid_store_forwarding (ctxt);
+}
diff --git a/gcc/avoid-store-forwarding.h b/gcc/avoid-store-forwarding.h
new file mode 100644
index 00000000000..55a0c97f008
--- /dev/null
+++ b/gcc/avoid-store-forwarding.h
@@ -0,0 +1,56 @@ 
+/* Avoid store forwarding optimization pass.
+   Copyright (C) 2024 Free Software Foundation, Inc.
+   Contributed by VRULL GmbH.
+
+   This file is part of GCC.
+
+   GCC is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3, or (at your option)
+   any later version.
+
+   GCC is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with GCC; see the file COPYING3.  If not see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef GCC_AVOID_STORE_FORWARDING_H
+#define GCC_AVOID_STORE_FORWARDING_H
+
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "backend.h"
+#include "rtl.h"
+
+struct store_fwd_info
+{
+  /* The store instruction that is a store forwarding candidate.  */
+  rtx_insn *store_insn;
+  /* SET_DEST (single_set (store_insn)).  */
+  rtx store_mem;
+  /* The temporary that will hold the stored value at the original store
+     position.  */
+  rtx mov_reg;
+  /* The instruction sequence that inserts the stored value's bits at the
+     appropriate position in the loaded value.  */
+  rtx_insn *bits_insert_insns;
+  /* An instruction that saves the store's value in a register temporarily,
+     (set (reg X) (SET_SRC (store_insn))).  */
+  rtx_insn *save_store_value_insn;
+  /* An instruction that stores the saved value back to memory,
+     (set (SET_DEST (store_insn)) (reg X)).  */
+  rtx_insn *store_saved_value_insn;
+  /* The byte offset for the store's position within the load.  */
+  HOST_WIDE_INT offset;
+
+  unsigned int insn_cnt;
+  bool remove;
+  bool forwarded;
+};
+
+#endif  /* GCC_AVOID_STORE_FORWARDING_H  */
diff --git a/gcc/common.opt b/gcc/common.opt
index ea39f87ae71..7d080fbc513 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -1751,6 +1751,10 @@  fgcse-sm
 Common Var(flag_gcse_sm) Init(0) Optimization
 Perform store motion after global common subexpression elimination.
 
+favoid-store-forwarding
+Common Var(flag_avoid_store_forwarding) Init(0) Optimization
+Try to avoid store forwarding.
+
 fgcse-las
 Common Var(flag_gcse_las) Init(0) Optimization
 Perform redundant load after store elimination in global common subexpression
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 27539a01785..4f8c145ff1f 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -12797,6 +12797,15 @@  loop unrolling.
 This option is enabled by default at optimization levels @option{-O1},
 @option{-O2}, @option{-O3}, @option{-Os}.
 
+@opindex favoid-store-forwarding
+@item -favoid-store-forwarding
+@itemx -fno-avoid-store-forwarding
+Many CPUs will stall for many cycles when a load partially depends on previous
+smaller stores.  This pass tries to detect such cases and avoid the penalty by
+changing the order of the load and store and then fixing up the loaded value.
+
+Disabled by default.
+
 @opindex ffp-contract
 @item -ffp-contract=@var{style}
 @option{-ffp-contract=off} disables floating-point expression contraction.
diff --git a/gcc/doc/passes.texi b/gcc/doc/passes.texi
index 4ac7a2306a1..639f6b325c8 100644
--- a/gcc/doc/passes.texi
+++ b/gcc/doc/passes.texi
@@ -925,6 +925,14 @@  and addressing mode selection.  The pass is run twice, with values
 being propagated into loops only on the second run.  The code is
 located in @file{fwprop.cc}.
 
+@item Store forwarding avoidance
+
+This pass attempts to reduce the overhead of store to load forwarding.
+It detects when a load reads from one or more previous smaller stores and
+then rearranges them so that the stores are done after the load.  The loaded
+value is adjusted with a series of bit insert instructions so that it stays
+the same.  The code is located in @file{avoid-store-forwarding.cc}.
+
 @item Common subexpression elimination
 
 This pass removes redundant computation within basic blocks, and
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index cc33084ed32..47ada0c6bf5 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -7348,6 +7348,15 @@  the @code{POLY_VALUE_MIN}, @code{POLY_VALUE_MAX} and
 implementation returns the lowest possible value of @var{val}.
 @end deftypefn
 
+@deftypefn {Target Hook} bool TARGET_AVOID_STORE_FORWARDING_P (vec<store_fwd_info>, @var{rtx}, @var{int}, @var{bool})
+This hook, if defined, enables the use of the avoid-store-forwarding pass.
+Given a list of stores and a load instruction that reads from the location
+of the stores, this hook decides if it's profitable to emit additional code
+to avoid a potential store forwarding stall.  The additional instructions
+needed, the sequence cost and additional relevant information is given in
+the arguments so that the target can make an informed decision.
+@end deftypefn
+
 @node Scheduling
 @section Adjusting the Instruction Scheduler
 
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 8af3f414505..73ae1a1b8fe 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -4745,6 +4745,8 @@  Define this macro if a non-short-circuit operation produced by
 
 @hook TARGET_ESTIMATED_POLY_VALUE
 
+@hook TARGET_AVOID_STORE_FORWARDING_P
+
 @node Scheduling
 @section Adjusting the Instruction Scheduler
 
diff --git a/gcc/params.opt b/gcc/params.opt
index c17ba17b91b..84a32b28b4a 100644
--- a/gcc/params.opt
+++ b/gcc/params.opt
@@ -1032,6 +1032,10 @@  Allow the store merging pass to introduce unaligned stores if it is legal to do
 Common Joined UInteger Var(param_store_merging_max_size) Init(65536) IntegerRange(1, 65536) Param Optimization
 Maximum size of a single store merging region in bytes.
 
+-param=store-forwarding-max-distance=
+Common Joined UInteger Var(param_store_forwarding_max_distance) Init(10) IntegerRange(1, 1000) Param Optimization
+Maximum number of instruction distance that a small store forwarded to a larger load may stall.
+
 -param=switch-conversion-max-branch-ratio=
 Common Joined UInteger Var(param_switch_conversion_branch_ratio) Init(8) IntegerRange(1, 65536) Param Optimization
 The maximum ratio between array size and switch branches for a switch conversion to take place.
diff --git a/gcc/passes.def b/gcc/passes.def
index b06d6d45f63..7f7fd96e336 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -506,6 +506,7 @@  along with GCC; see the file COPYING3.  If not see
       NEXT_PASS (pass_sms);
       NEXT_PASS (pass_live_range_shrinkage);
       NEXT_PASS (pass_sched);
+      NEXT_PASS (pass_rtl_avoid_store_forwarding);
       NEXT_PASS (pass_early_remat);
       NEXT_PASS (pass_ira);
       NEXT_PASS (pass_reload);
diff --git a/gcc/target.def b/gcc/target.def
index 1d0ea6f30ca..c90ec9c7edd 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -6959,6 +6959,17 @@  HOOK_VECTOR_END (shrink_wrap)
 #undef HOOK_PREFIX
 #define HOOK_PREFIX "TARGET_"
 
+DEFHOOK
+(avoid_store_forwarding_p,
+ "This hook, if defined, enables the use of the avoid-store-forwarding pass.\n\
+Given a list of stores and a load instruction that reads from the location\n\
+of the stores, this hook decides if it's profitable to emit additional code\n\
+to avoid a potential store forwarding stall.  The additional instructions\n\
+needed, the sequence cost and additional relevant information is given in\n\
+the arguments so that the target can make an informed decision.",
+ bool, (vec<store_fwd_info>, rtx, int, bool),
+ NULL)
+
 /* Determine the type of unwind info to emit for debugging.  */
 DEFHOOK
 (debug_unwind_info,
diff --git a/gcc/target.h b/gcc/target.h
index 837651d273a..2f59f3c6a80 100644
--- a/gcc/target.h
+++ b/gcc/target.h
@@ -165,6 +165,9 @@  class function_arg_info;
 /* This is defined in function-abi.h.  */
 class predefined_function_abi;
 
+/* This is defined in avoid-store-forwarding.h .  */
+struct store_fwd_info;
+
 /* These are defined in tree-vect-stmts.cc.  */
 extern tree stmt_vectype (class _stmt_vec_info *);
 extern bool stmt_in_inner_loop_p (class vec_info *, class _stmt_vec_info *);
diff --git a/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-1.c b/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-1.c
new file mode 100644
index 00000000000..4712f101d5e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-1.c
@@ -0,0 +1,28 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -favoid-store-forwarding -fdump-rtl-avoid_store_forwarding" } */
+
+typedef union {
+    char arr_8[8];
+    long long_value;
+} DataUnion;
+
+long ssll_1 (DataUnion *data, char x)
+{
+  data->arr_8[0] = x;
+  return data->long_value;
+}
+
+long ssll_2 (DataUnion *data, char x)
+{
+  data->arr_8[1] = x;
+  return data->long_value;
+}
+
+long ssll_3 (DataUnion *data, char x)
+{
+  data->arr_8[7] = x;
+  return data->long_value;
+}
+
+/* { dg-final { scan-rtl-dump-times "Store forwarding detected" 3 "avoid_store_forwarding" } } */
+/* { dg-final { scan-rtl-dump-times "Store forwarding avoided" 3 "avoid_store_forwarding" } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-2.c b/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-2.c
new file mode 100644
index 00000000000..b958612173b
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-2.c
@@ -0,0 +1,39 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -favoid-store-forwarding -fdump-rtl-avoid_store_forwarding" } */
+
+typedef union {
+    char arr_8[8];
+    int long_value;
+} DataUnion1;
+
+long no_ssll_1 (DataUnion1 *data, char x)
+{
+  data->arr_8[4] = x;
+  return data->long_value;
+}
+
+long no_ssll_2 (DataUnion1 *data, char x)
+{
+  data->arr_8[5] = x;
+  return data->long_value;
+}
+
+typedef union {
+    char arr_8[8];
+    short long_value[4];
+} DataUnion2;
+
+long no_ssll_3 (DataUnion2 *data, char x)
+{
+  data->arr_8[4] = x;
+  return data->long_value[1];
+}
+
+long no_ssll_4 (DataUnion2 *data, char x)
+{
+  data->arr_8[0] = x;
+  return data->long_value[1];
+}
+
+/* { dg-final { scan-rtl-dump-times "Store forwarding detected" 0 "avoid_store_forwarding" } } */
+/* { dg-final { scan-rtl-dump-times "Store forwarding avoided" 0 "avoid_store_forwarding" } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-3.c b/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-3.c
new file mode 100644
index 00000000000..f4814c1a4d2
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-3.c
@@ -0,0 +1,31 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -favoid-store-forwarding -fdump-rtl-avoid_store_forwarding" } */
+
+typedef union {
+    char arr_8[8];
+    long long_value;
+} DataUnion;
+
+long ssll_multi_1 (DataUnion *data, char x)
+{
+  data->arr_8[0] = x;
+  data->arr_8[2] = x;
+  return data->long_value;
+}
+
+long ssll_multi_2 (DataUnion *data, char x)
+{
+  data->arr_8[0] = x;
+  data->arr_8[1] = 11;
+  return data->long_value;
+}
+
+long ssll_multi_3 (DataUnion *data, char x, short y)
+{
+  data->arr_8[1] = x;
+  __builtin_memcpy(data->arr_8 + 4, &y, sizeof(short));
+  return data->long_value;
+}
+
+/* { dg-final { scan-rtl-dump-times "Store forwardings detected" 3 "avoid_store_forwarding" } } */
+/* { dg-final { scan-rtl-dump-times "Store forwardings avoided" 3 "avoid_store_forwarding" } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-4.c b/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-4.c
new file mode 100644
index 00000000000..db3c6b6630b
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-4.c
@@ -0,0 +1,23 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -favoid-store-forwarding -fdump-rtl-avoid_store_forwarding" } */
+
+typedef int v4si __attribute__ ((vector_size (16)));
+
+typedef union {
+    char arr_16[16];
+    v4si vec_value;
+} DataUnion;
+
+v4si ssll_vect_1 (DataUnion *data, char x)
+{
+  data->arr_16[0] = x;
+  return data->vec_value;
+}
+
+v4si ssll_vect_2 (DataUnion *data, int x)
+{
+  __builtin_memcpy(data->arr_16 + 4, &x, sizeof(int));
+  return data->vec_value;
+}
+
+/* { dg-final { scan-rtl-dump-times "Store forwarding detected" 2 "avoid_store_forwarding" } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-5.c b/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-5.c
new file mode 100644
index 00000000000..2522df56905
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/avoid-store-forwarding-5.c
@@ -0,0 +1,38 @@ 
+/* { dg-do compile { target int128 } } */
+/* { dg-options "-O2 -favoid-store-forwarding -fdump-rtl-avoid_store_forwarding" } */
+
+typedef float v4f __attribute__ ((vector_size (16)));
+
+typedef union {
+    float arr_2[4];
+    long long_value;
+    __int128 longlong_value;
+    v4f vec_value;
+} DataUnion;
+
+long ssll_load_elim_1 (DataUnion *data, float x)
+{
+  data->arr_2[0] = x;
+  data->arr_2[1] = 0.0f;
+  return data->long_value;
+}
+
+__int128 ssll_load_elim_2 (DataUnion *data, float x)
+{
+  data->arr_2[0] = x;
+  data->arr_2[1] = 0.0f;
+  data->arr_2[2] = x;
+  data->arr_2[3] = 0.0f;
+  return data->longlong_value;
+}
+
+v4f ssll_load_elim_3 (DataUnion *data, float x)
+{
+  data->arr_2[3] = x;
+  data->arr_2[2] = x;
+  data->arr_2[1] = x;
+  data->arr_2[0] = x;
+  return data->vec_value;
+}
+
+/* { dg-final { scan-rtl-dump-times "Store forwardings detected" 3 "avoid_store_forwarding" } } */
diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index 3a0cf13089e..75b1d9cbae4 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -572,6 +572,7 @@  extern rtl_opt_pass *make_pass_rtl_dse3 (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_rtl_cprop (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_rtl_pre (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_rtl_hoist (gcc::context *ctxt);
+extern rtl_opt_pass *make_pass_rtl_avoid_store_forwarding (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_rtl_store_motion (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_cse_after_global_opts (gcc::context *ctxt);
 extern rtl_opt_pass *make_pass_rtl_ifcvt (gcc::context *ctxt);