diff mbox series

[rs6000,middle-end] v10: Add implementation for different targets for pair mem fusion

Message ID d878c99b-b8b7-4738-ae30-ce127fd98f08@linux.ibm.com
State New
Headers show
Series [rs6000,middle-end] v10: Add implementation for different targets for pair mem fusion | expand

Commit Message

Ajit Agarwal Sept. 3, 2024, 9:54 a.m. UTC
Hello Richard:

This patch addresses all the review comments.
It also fix the arm build failure.

Common infrastructure using generic code for pair mem fusion of different
targets.

rs6000 target specific code implement virtual functions defined by generic code.

Target specific code are added in rs6000-mem-fusion.cc.

Bootstrapped and regtested on powepc64-linux-gnu.

Thanks & Regards
Ajit


rs6000, middle-end: Add implementation for different targets for pair mem fusion

Common infrastructure using generic code for pair mem fusion of different
targets.

rs6000 target specific code implement virtual functions defined by generic code.

Target specific code are added in rs6000-mem-fusion.cc.

2024-09-03  Ajit Kumar Agarwal  <aagarwa1@linux.ibm.com>

gcc/ChangeLog:

	* config/rs6000/rs6000-passes.def: New mem fusion pass
	before pass_early_remat.
	* pair-fusion.h: Add additional pure virtual function
	required for rs6000 target implementation.
	* pair-fusion.cc: Use of virtual functions for additional
	virtual function addded for rs6000 target.
	* config/rs6000/rs6000-mem-fusion.cc: Add new pass.
	Add target specific implementation for generic pure virtual
	functions.
	* config/rs6000/mma.md: Modify movoo machine description.
	Add new machine description movoo1.
	* config/rs6000/rs6000.cc: Modify rs6000_split_multireg_move
	to expand movoo machine description for all constraints.
	* config.gcc: Add new object file.
	* config/rs6000/rs6000-protos.h: Add new prototype for mem
	fusion pass.
	* config/rs6000/t-rs6000: Add new rule.
	* rtl-ssa/functions.h: Move out allocate function from private
	to public and add get_m_temp_defs function.

gcc/testsuite/ChangeLog:

	* g++.target/powerpc/mem-fusion.C: New test.
	* g++.target/powerpc/mem-fusion-1.C: New test.
	* gcc.target/powerpc/mma-builtin-1.c: Modify test.
---
 gcc/config.gcc                                |   2 +
 gcc/config/rs6000/mma.md                      |  26 +-
 gcc/config/rs6000/rs6000-mem-fusion.cc        | 695 ++++++++++++++++++
 gcc/config/rs6000/rs6000-passes.def           |   4 +-
 gcc/config/rs6000/rs6000-protos.h             |   1 +
 gcc/config/rs6000/rs6000.cc                   |  59 +-
 gcc/config/rs6000/rs6000.md                   |   1 +
 gcc/config/rs6000/t-rs6000                    |   5 +
 gcc/pair-fusion.cc                            |  48 +-
 gcc/pair-fusion.h                             |  48 ++
 gcc/rtl-ssa/access-utils.h                    |   2 +
 gcc/rtl-ssa/changes.cc                        | 122 ++-
 gcc/rtl-ssa/functions.h                       |  28 +-
 .../g++.target/powerpc/mem-fusion-1.C         |  22 +
 gcc/testsuite/g++.target/powerpc/mem-fusion.C |  15 +
 .../gcc.target/powerpc/mma-builtin-1.c        |   4 +-
 16 files changed, 1022 insertions(+), 60 deletions(-)
 create mode 100644 gcc/config/rs6000/rs6000-mem-fusion.cc
 create mode 100644 gcc/testsuite/g++.target/powerpc/mem-fusion-1.C
 create mode 100644 gcc/testsuite/g++.target/powerpc/mem-fusion.C

Comments

Alex Coplan Nov. 1, 2024, 5:16 p.m. UTC | #1
Hi Ajit,

Sorry for not reviewing this sooner.  For a while I thought this patch
was with Richard S to look at the rtl-ssa changes, but we've now agreed
that I should take a look at it.

I noticed that this breaks the build on aarch64.  I suspect you got an
email from the Linaro CI about this?  You can see on patchwork that the
Linaro CI build failed on aarch64:
https://patchwork.sourceware.org/project/gcc/patch/d878c99b-b8b7-4738-ae30-ce127fd98f08@linux.ibm.com/
unfortunately I'm told the CI only keeps builds around for 30 days, so
you'll need to try your own build to reproduce the failure, but that
should be fairly straightforward (just building a cc1 cross is enough).

The main issue leading to the build failure is the introduction of new
pure virtual functions in pair-fusion.h without corresponding
implementations in aarch64-ldp-fusion.cc:aarch64_pair_fusion.

I see the following diagnostics:

$SRC/gcc/config/aarch64/aarch64-ldp-fusion.cc: In member function ‘virtual unsigned int {anonymous}::pass_ldp_fusion::execute(function*)’:
$SRC/gcc/config/aarch64/aarch64-ldp-fusion.cc:302:27: error: cannot declare variable ‘pass’ to be of abstract type ‘aarch64_pair_fusion’
  302 |       aarch64_pair_fusion pass;
      |                           ^~~~
$SRC/gcc/config/aarch64/aarch64-ldp-fusion.cc:38:8: note:   because the following virtual functions are pure within ‘aarch64_pair_fusion’:
   38 | struct aarch64_pair_fusion : public pair_fusion
      |        ^~~~~~~~~~~~~~~~~~~
In file included from
$SRC/gcc/config/aarch64/aarch64-ldp-fusion.cc:31:
$SRC/gcc/pair-fusion.h:151:16: note:     ‘virtual void pair_fusion::change_existing_multword_mode(rtx_insn*)’
  151 |   virtual void change_existing_multword_mode (rtx_insn *insn) = 0;
      |                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
$SRC/gcc/pair-fusion.h:156:16: note:     ‘virtual void pair_fusion::modify_new_rtx_insn(rtl_ssa::insn_info*, obstack_watermark*, rtl_ssa::insn_change**, auto_vec<rtl_ssa::insn_change*>&)’
  156 |   virtual void modify_new_rtx_insn (rtl_ssa::insn_info *first,
      |                ^~~~~~~~~~~~~~~~~~~
[... several more complaints about new pure virtuals ...]

so either these will need to be given default implementations (i.e. not pure
virtuals) or you will need to add a specific implementation in
aarch64-ldp-fusion.cc.  Most likely the former.

Please try to make sure any future versions of the patch build and test
cleanly on aarch64-linux-gnu.

More comments below.

The most important comment is the one below the call to set_multiword_subreg in
pair-fusion.cc, please pay particular attention to that.  Hopefully if you
re-work things along those lines, you should be able to drop most (if not all)
of your rtl-ssa changes.

On 03/09/2024 15:24, Ajit Agarwal wrote:
> Hello Richard:
> 
> This patch addresses all the review comments.
> It also fix the arm build failure.
> 
> Common infrastructure using generic code for pair mem fusion of different
> targets.
> 
> rs6000 target specific code implement virtual functions defined by generic code.
> 
> Target specific code are added in rs6000-mem-fusion.cc.
> 
> Bootstrapped and regtested on powepc64-linux-gnu.
> 
> Thanks & Regards
> Ajit
> 
> 
> rs6000, middle-end: Add implementation for different targets for pair mem fusion
> 
> Common infrastructure using generic code for pair mem fusion of different
> targets.
> 
> rs6000 target specific code implement virtual functions defined by generic code.
> 
> Target specific code are added in rs6000-mem-fusion.cc.
> 
> 2024-09-03  Ajit Kumar Agarwal  <aagarwa1@linux.ibm.com>
> 
> gcc/ChangeLog:
> 
> 	* config/rs6000/rs6000-passes.def: New mem fusion pass
> 	before pass_early_remat.
> 	* pair-fusion.h: Add additional pure virtual function
> 	required for rs6000 target implementation.
> 	* pair-fusion.cc: Use of virtual functions for additional
> 	virtual function addded for rs6000 target.
> 	* config/rs6000/rs6000-mem-fusion.cc: Add new pass.
> 	Add target specific implementation for generic pure virtual
> 	functions.
> 	* config/rs6000/mma.md: Modify movoo machine description.
> 	Add new machine description movoo1.
> 	* config/rs6000/rs6000.cc: Modify rs6000_split_multireg_move
> 	to expand movoo machine description for all constraints.
> 	* config.gcc: Add new object file.
> 	* config/rs6000/rs6000-protos.h: Add new prototype for mem
> 	fusion pass.
> 	* config/rs6000/t-rs6000: Add new rule.
> 	* rtl-ssa/functions.h: Move out allocate function from private
> 	to public and add get_m_temp_defs function.
> 
> gcc/testsuite/ChangeLog:
> 
> 	* g++.target/powerpc/mem-fusion.C: New test.
> 	* g++.target/powerpc/mem-fusion-1.C: New test.
> 	* gcc.target/powerpc/mma-builtin-1.c: Modify test.
> ---
>  gcc/config.gcc                                |   2 +
>  gcc/config/rs6000/mma.md                      |  26 +-
>  gcc/config/rs6000/rs6000-mem-fusion.cc        | 695 ++++++++++++++++++
>  gcc/config/rs6000/rs6000-passes.def           |   4 +-
>  gcc/config/rs6000/rs6000-protos.h             |   1 +
>  gcc/config/rs6000/rs6000.cc                   |  59 +-
>  gcc/config/rs6000/rs6000.md                   |   1 +
>  gcc/config/rs6000/t-rs6000                    |   5 +
>  gcc/pair-fusion.cc                            |  48 +-
>  gcc/pair-fusion.h                             |  48 ++
>  gcc/rtl-ssa/access-utils.h                    |   2 +
>  gcc/rtl-ssa/changes.cc                        | 122 ++-
>  gcc/rtl-ssa/functions.h                       |  28 +-
>  .../g++.target/powerpc/mem-fusion-1.C         |  22 +
>  gcc/testsuite/g++.target/powerpc/mem-fusion.C |  15 +
>  .../gcc.target/powerpc/mma-builtin-1.c        |   4 +-
>  16 files changed, 1022 insertions(+), 60 deletions(-)
>  create mode 100644 gcc/config/rs6000/rs6000-mem-fusion.cc
>  create mode 100644 gcc/testsuite/g++.target/powerpc/mem-fusion-1.C
>  create mode 100644 gcc/testsuite/g++.target/powerpc/mem-fusion.C
> 
> diff --git a/gcc/config.gcc b/gcc/config.gcc
> index 08291f4b6e0..c043abdb871 100644
> --- a/gcc/config.gcc
> +++ b/gcc/config.gcc
> @@ -530,6 +530,7 @@ powerpc*-*-*)
>  	extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o"
>  	extra_objs="${extra_objs} rs6000-call.o rs6000-pcrel-opt.o"
>  	extra_objs="${extra_objs} rs6000-builtins.o rs6000-builtin.o"
> +	extra_objs="${extra_objs} rs6000-mem-fusion.o"
>  	extra_headers="ppc-asm.h altivec.h htmintrin.h htmxlintrin.h"
>  	extra_headers="${extra_headers} bmi2intrin.h bmiintrin.h"
>  	extra_headers="${extra_headers} xmmintrin.h mm_malloc.h emmintrin.h"
> @@ -566,6 +567,7 @@ rs6000*-*-*)
>  	extra_options="${extra_options} g.opt fused-madd.opt rs6000/rs6000-tables.opt"
>  	extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o"
>  	extra_objs="${extra_objs} rs6000-call.o rs6000-pcrel-opt.o"
> +	extra_objs="${extra_objs} rs6000-mem-fusion.o"
>  	target_gtfiles="$target_gtfiles \$(srcdir)/config/rs6000/rs6000-logue.cc \$(srcdir)/config/rs6000/rs6000-call.cc"
>  	target_gtfiles="$target_gtfiles \$(srcdir)/config/rs6000/rs6000-pcrel-opt.cc"
>  	;;
> diff --git a/gcc/config/rs6000/mma.md b/gcc/config/rs6000/mma.md
> index 04e2d0066df..88413926a02 100644
> --- a/gcc/config/rs6000/mma.md
> +++ b/gcc/config/rs6000/mma.md
> @@ -294,7 +294,31 @@
>  
>  (define_insn_and_split "*movoo"
>    [(set (match_operand:OO 0 "nonimmediate_operand" "=wa,ZwO,wa")
> -	(match_operand:OO 1 "input_operand" "ZwO,wa,wa"))]
> +        (match_operand:OO 1 "input_operand" "ZwO,wa,wa"))]
> +  "TARGET_MMA
> +   && (gpc_reg_operand (operands[0], OOmode)
> +       || gpc_reg_operand (operands[1], OOmode))"
> +;;    ""
> +  "@
> +   #
> +   #
> +   #"
> +  "&& reload_completed"
> +  [(const_int 0)]
> +{
> +  rs6000_split_multireg_move (operands[0], operands[1]);
> +  DONE;
> +}
> +  [(set_attr "type" "vecload,vecstore,veclogical")
> +   (set_attr "length" "*,*,8")])
> +;;   (set_attr "max_prefixed_insns" "2,2,*")])
> +
> +
> +(define_insn_and_split "*movoo1"
> +  [(set (match_operand:OO 0 "nonimmediate_operand" "=wa,ZwO,wa")
> +        (unspec [
> +          (match_operand:OO 1 "input_operand" "ZwO,wa,wa")
> +        ] UNSPEC_LXVP))]
>    "TARGET_MMA
>     && (gpc_reg_operand (operands[0], OOmode)
>         || gpc_reg_operand (operands[1], OOmode))"
> diff --git a/gcc/config/rs6000/rs6000-mem-fusion.cc b/gcc/config/rs6000/rs6000-mem-fusion.cc
> new file mode 100644
> index 00000000000..8c27a8f32e1
> --- /dev/null
> +++ b/gcc/config/rs6000/rs6000-mem-fusion.cc
> @@ -0,0 +1,695 @@
> +/* Subroutines used to perform adjacent load/store into
> +   paired memory accesses for TARGET_POWER10 and TARGET_VSX.
> +
> +   Copyright (C) 2024 Free Software Foundation, Inc.
> +
> +   This file is part of GCC.
> +
> +   GCC is free software; you can redistribute it and/or modify it
> +   under the terms of the GNU General Public License as published
> +   by the Free Software Foundation; either version 3, or (at your
> +   option) any later version.
> +
> +   GCC is distributed in the hope that it will be useful, but WITHOUT
> +   ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
> +   or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public
> +   License for more details.
> +
> +   You should have received a copy of the GNU General Public License
> +   along with GCC; see the file COPYING3.  If not see
> +   <http://www.gnu.org/licenses/>.  */
> +
> +#define INCLUDE_ALGORITHM
> +#define INCLUDE_FUNCTIONAL
> +#define INCLUDE_LIST
> +#define INCLUDE_TYPE_TRAITS
> +#include "config.h"
> +#include "system.h"
> +#include "coretypes.h"
> +#include "backend.h"
> +#include "rtl.h"
> +#include "df.h"
> +#include "rtl-iter.h"
> +#include "rtl-ssa.h"
> +#include "rtl-ssa/internals.h"
> +#include "rtl-ssa/internals.inl"

Including rtl-ssa/internals.* looks wrong.

> +#include "cfgcleanup.h"
> +#include "tree-pass.h"
> +#include "ordered-hash-map.h"
> +#include "pair-fusion.h"
> +
> +using namespace rtl_ssa;
> +
> +struct rs6000_pair_fusion : public pair_fusion
> +{
> +  bool fpsimd_op_p (rtx , machine_mode , bool)  override final
> +  {
> +    return false;
> +  }

You don't need an override here: the default implementation in
pair-fusion.h returns false.

> +
> +  bool pair_mem_insn_p (rtx_insn *, bool &) override final
> +  {
> +    return false;
> +  }

Perhaps a comment that you can safely return false here since you don't
support writeback (your should_handle_writeback unconditionally returns
false) would be helpful.  Perhaps you could also add a comment above the
declaration in pair-fusion.h indicating that targets which don't need
writeback support can unconditionally return false here, since it is
only used for identifying non-writeback pairs which might be candidates
for promotion to writeback pairs.

I appreciate that the pair_fusion interface isn't very ergonomic for
targets that don't care about writeback, as you end up having to provide
dummy implementations for three different pure virtuals that aren't
really relevant.  Here's an idea for a possible refactoring to pair-fusion.h,
but it might only be worth doing once we have a second target that wants this
behaviour: we could introduce a more concrete (but still abstract) base class,
say pair_fusion_no_writeback, derived from pair_fusion.  That class could
provide an implementation of should_handle_writeback which unconditionally
returns false, along with dummy implementations of the relevant virtual
functions:
 - pair_mem_insn_p
 - destructure_pair
 - gen_promote_writeback_pair
and rs6000_pair_fusion could then inherit from that.

> +
> +  void change_existing_multword_mode (rtx_insn *insn) override final;
> +
> +  bool pair_mem_ok_with_policy (rtx, bool) override final
> +  {
> +    return true;
> +  }
> +
> +  bool pair_operand_mode_ok_p (machine_mode mode) override final;
> +
> +  rtx gen_pair (rtx *pats, rtx, bool load_p) override final;
> +
> +  bool pair_reg_operand_ok_p (bool, rtx, machine_mode) override final
> +  {
> +    return true;
> +  }

This doesn't look right: presumably there are some requirements on the
register operands that lxvp and stxvp support.  Note that on aarch64 we
needed to reject invalid register operands early for correctness, see:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113062
for the history.  It's also better for compile time to avoid tracking
candidate insns altogether when they have no hope of being formed into a
pair.

> +
> +  int pair_mem_alias_check_limit () override final
> +  {
> +    return 0;
> +  }

Are you sure you want to do this?  This essentially forces the alias
analysis to always assume a conflict, which will significantly reduce
the optimization power of the pass.

> +
> +  bool should_handle_writeback (enum writeback_type) override final
> +  {
> +    return false;
> +  }
> +
> +  bool track_loads_p () override final
> +  {
> +    return true;
> +  }
> +
> +  bool track_stores_p () override final
> +  {
> +    return true;
> +  }

You don't need to ovedrride track_{loads,stores}_p: the default
implementation in pair-fusion.h returns true unconditionally.

> +
> +  bool pair_mem_in_range_p (HOST_WIDE_INT) override final
> +  {
> +    return true;
> +  }

This doesn't look right: this would only make sense if lxvp and stxvp
have unlimited range (which presumably isn't the case).

> +
> +  rtx gen_promote_writeback_pair (rtx, rtx, rtx *, bool) override final
> +  {
> +    return NULL_RTX;
> +  }
> +
> +  rtx destructure_pair (rtx_def **, rtx, bool) override final
> +  {
> +    return NULL_RTX;
> +  }
> +
> +  bool fuseable_store_p (insn_info *i1, insn_info *i2) override final;
> +
> +  bool fuseable_load_p (insn_info *i1) override final;
> +
> +  void set_multiword_subreg (insn_info *i1, insn_info *i2,
> +			     bool load_p, rtx *pats) override final;
> +
> +  void set_uses (insn_info *i1, insn_info *i2,
> +		 insn_change **pair_change,
> +		 obstack_watermark *attempt) override final;
> +
> +  void  set_defs (insn_info *i1, insn_info *i2,
> +		  insn_change **pair_change,
> +		  obstack_watermark *attempt) override final;
> +
> +
> +  void modify_new_rtx_insn (insn_info *first, obstack_watermark *attempt,
> +			    insn_change **pair_change,
> +			    auto_vec <insn_change *> &changes) override final;
> +};
> +
> +bool
> +rs6000_pair_fusion::pair_operand_mode_ok_p (machine_mode mode)
> +{
> +  return (ALTIVEC_OR_VSX_VECTOR_MODE (mode));
> +}
> +
> +void
> +rs6000_pair_fusion::change_existing_multword_mode (rtx_insn *insn)
> +{
> +  rtx set = single_set (insn);
> +  rtx src = SET_SRC (set);
> +  rtx dest = SET_DEST (set);
> +  rtx copy = NULL_RTX;
> +
> +  if ((MEM_P (src) && GET_MODE (src) == OOmode)
> +       || (MEM_P (dest) && GET_MODE (dest) == OOmode))
> +    {
> +      rtx unspec  = gen_rtx_UNSPEC (GET_MODE (dest),
> +				    gen_rtvec (1, src),
> +				    UNSPEC_LXVP);
> +      copy =  gen_rtx_SET (dest, unspec);
> +      rtx_insn *new_insn = emit_insn_after (copy, insn);
> +      set_block_for_insn (new_insn, BLOCK_FOR_INSN (insn));
> +      df_insn_rescan (new_insn);
> +      df_insn_delete (insn);
> +      remove_insn (insn);
> +      insn->set_deleted ();
> +    }
> +}
> +
> +void
> +rs6000_pair_fusion::modify_new_rtx_insn (insn_info *first,
> +					 obstack_watermark *attempt,
> +					 insn_change **pair_change,
> +					 auto_vec<insn_change *> &changes)
> +{
> +  def_array new_defs;
> +  vec_rtx_properties properties;
> +  properties.add_insn (first->rtl (), true);
> +  for (rtx_obj_reference ref : properties.refs ())
> +    if (ref.is_write () && ref.is_reg ())
> +      {
> +	auto def = find_access (first->defs (), ref.regno);
> +	if (!def)
> +	  {
> +	    auto *set = crtl->ssa->allocate<set_info> (first,
> +						       full_register (ref.regno));

Use crtl->ssa->create_set instead of the (internal) allocate function.

> +	    new_defs = insert_access (*attempt, set, new_defs);
> +	  }
> +	else
> +	  new_defs = insert_access (*attempt, def, new_defs);
> +      }
> +
> +  (*pair_change)->new_defs
> +    = merge_access_arrays (*attempt, (*pair_change)->new_defs, new_defs);
> +  crtl->ssa->update_changes (first, attempt, pair_change, changes);

Couldn't this be made a lot simpler, since this code is only called for
loads, and you know the shape of the RTL is:

  (set (reg:OO rN) (unspec: (mem: ...)))

then all you need is a singleton def_array containing a def for rN (the
new pseudo), no?

> +
> +}
> +
> +//Given insn I1, PAIR_CHANGE and obstack_watermark ATTEMPT sets new
> +//uses in insn_change.
> +static void
> +set_uses_load (insn_info *i1,insn_change **pair_change,
> +	       obstack_watermark *attempt)
> +{
> +  for (auto def : i1->defs ())
> +    {
> +      auto set = dyn_cast<set_info *> (def);
> +      for (auto use : set->all_uses ())
> +	{
> +	  insn_change *change = *pair_change;
> +	  change->new_uses = insert_access (*attempt, use, change->new_uses);
> +	}
> +    }
> +}
> +
> +void rs6000_pair_fusion::set_uses (insn_info *i1,
> +				   insn_info *i2,
> +				   insn_change **pair_change,
> +				   obstack_watermark *attempt)
> +{
> +  set_uses_load (i1, pair_change, attempt);
> +  set_uses_load (i2, pair_change, attempt);
> +}

This code doesn't seem to get called from anywhere, is it needed?

> +
> +// df_insn_rescan dependent instruction where operands
> +// are reversed given insn_info INFO.
> +static void
> +set_rescan_load (insn_info *i1)
> +{
> +  for (auto def : i1->defs ())
> +    {
> +      auto set = dyn_cast<set_info *> (def);
> +      for (auto use : set->nondebug_insn_uses ())
> +	{
> +	  insn_info *info = use->insn ();
> +	  if (use->is_artificial ())
> +	    continue;
> +	  rtx_insn *rtl_insn = info->rtl ();
> +	  df_insn_rescan (rtl_insn);
> +	}
> +    }
> +}
> +
> +//Given insn INSN, PAIR_CHANGE and obstack_watermark ATTEMPT sets new
> +//defs in insn_change.
> +static bool
> +set_defs_store (insn_info *insn,
> +		insn_change **pair_change,
> +		obstack_watermark *attempt)
> +{
> +  for (auto use : insn->uses())
> +    {
> +      auto def = use->def ();
> +
> +      if (!def)
> +	return false;
> +
> +      if (def->insn ()->is_artificial ())
> +	{
> +	  insn_change *change = *pair_change;
> +	  change->new_defs = insert_access (*attempt, def, change->new_defs);
> +	}
> +    }
> +  return true;
> +}
> +
> +void rs6000_pair_fusion::set_defs(insn_info *i1,
> +				  insn_info *i2,
> +				  insn_change **pair_change,
> +				  obstack_watermark *attempt)
> +{
> +  set_defs_store (i1, pair_change, attempt);
> +  set_defs_store (i2, pair_change, attempt);
> +}
> +
> +// df_insn_rescan the def instruction where operands are reversed given INSN.
> +static bool
> +set_rescan_store (insn_info *insn)
> +{
> +  for (auto use : insn->uses())
> +    {
> +      auto def = use->def ();
> +
> +      if (!def)
> +	return false;
> +
> +      if (def->insn ()->is_artificial ())
> +	return false;
> +
> +      if (def->insn () && def->insn ()->rtl ()
> +	  && def->insn()->is_real ())
> +	{
> +	  rtx_insn *rtl_insn = def->insn ()->rtl ();
> +	  rtx set = single_set (rtl_insn);
> +
> +	  if (set == NULL_RTX)
> +	    return false;
> +	  df_insn_rescan (rtl_insn);
> +	}
> +    }
> +  return true;
> +}
> +
> +// Check for feasibility of store to be fuseable or not. Return true if
> +// feasible otherwise false.
> +static bool
> +feasible_store_p (insn_info *insn)
> +{
> +  for (auto use : insn->uses ())
> +    {
> +      auto def = use->def ();
> +
> +      if (def->insn ()->is_artificial ())
> +	return false;
> +
> +      if (def->insn () && def->insn ()->rtl ()
> +	  && def->insn()->is_real ())
> +	{
> +	  rtx_insn *rtl_insn = def->insn ()->rtl ();
> +	  rtx set = single_set (rtl_insn);
> +
> +	  if (set == NULL_RTX)
> +	    return false;
> +
> +	  // Return false if dependent def is load.
> +	  // This is done as def instruction could be a fused load and
> +	  // to avoid already existing subreg (reg:OO R) offset.
> +	  if (rtl_insn && MEM_P (SET_SRC (set)))
> +	    return false;
> +
> +	  // Return false if dependent def is store.
> +	  if (rtl_insn && MEM_P (SET_DEST (set)))
> +	    return false;
> +	}
> +    }
> +  return true;
> +}
> +
> +// Check if store can be fuseable or not.  Return true if fuseable otherwise
> +// false.
> +bool
> +rs6000_pair_fusion::fuseable_store_p (insn_info *i1, insn_info *i2)
> +{
> +  rtx_insn *insn1 = i1->rtl ();
> +  rtx_insn *insn2 = i2->rtl ();
> +  rtx body = PATTERN (insn1);
> +  rtx src_exp = SET_SRC (body);
> +  rtx insn2_body = PATTERN (insn2);
> +  rtx insn2_src_exp = SET_SRC (insn2_body);

I think it would make the code more readable if you kept consistent
naming for the variables, e.g. src1 and src2 instead of src_exp and
insn2_src_exp.

> +
> +  if (!(REG_P (src_exp)
> +      && crtl->ssa->single_dominating_def (REGNO (src_exp))))
> +    return false;
> +
> +  if (!(REG_P (insn2_src_exp)
> +      && crtl->ssa->single_dominating_def (REGNO (insn2_src_exp))))
> +    return false;

Given that you also require the transfer regs to have a
single_dominating_def in the load case, can you just check this
condition in pair_reg_operand_ok_p and drop the checks here?

> +
> +  // This is done as def instruction could be a fused load and
> +  // to avoid  already existing subreg (reg:OO R) offset.
> +  if (DF_REG_USE_COUNT (REGNO (src_exp)) > 1)
> +    return false;
> +
> +  // Return false if src of insn1 and src of insn2 are same.
> +  if (src_exp == insn2_src_exp)
> +    return false;
> +
> +  if (!feasible_store_p (i1))
> +    return false;;
> +
> +  if (!feasible_store_p (i2))
> +    return false;
> +
> +  return true;
> +}
> +
> +// Set subreg for def of store INSN given rtx SRC instruction.
> +static void
> +set_store_subreg (insn_info *i1, rtx src, int regoff)
> +{
> +  for (auto use: i1->uses ())
> +    {
> +      auto def = use->def ();
> +      if (!def)
> +	return;
> +
> +      insn_info *info = def->insn ();
> +
> +      if (info->is_artificial ())
> +	return;
> +
> +      if (info && info->is_real ())
> +	{
> +	  rtx_insn *rtl_insn = info->rtl ();
> +	  rtx set = single_set (rtl_insn);
> +	  if (set == NULL_RTX)
> +	    return;
> +	  df_ref ref;
> +	  FOR_EACH_INSN_DEF (ref, rtl_insn)
> +	    {
> +	      rtx src_exp = SET_SRC (PATTERN (i1->rtl ()));
> +	      if (REG_P (src_exp) && DF_REF_REGNO (ref) == REGNO (src_exp))
> +		{
> +		  rtx *loc = DF_REF_LOC (ref);
> +		  if (GET_CODE (*loc) == SUBREG)
> +		    {
> +		      src = simplify_gen_subreg (GET_MODE (*loc),
> +						      SUBREG_REG (src),
> +						      OOmode,
> +						      regoff);
> +		    }
> +		  *loc =  copy_rtx (src);
> +		}
> +	    }
> +	}
> +    }
> +}
> +
> +// Check whether load can be fusable or not.
> +// Return true if fuseable otherwise false.
> +bool
> +rs6000_pair_fusion::fuseable_load_p (insn_info *i1)
> +{
> +  rtx_insn *insn = i1->rtl ();
> +  rtx body = PATTERN (insn);
> +  rtx dest_exp = SET_DEST (body);
> +
> +  if (!(REG_P (dest_exp)
> +      && crtl->ssa->single_dominating_def (REGNO (dest_exp))))
> +    return false;
> +
> +  return true;
> +}
> +
> +// Propagate insn I1 with new rtx NEW_DEST_EXP.
> +static void
> +propagate_insn (insn_info *i1, rtx new_dest_exp)
> +{
> +  df_ref ref;
> +  FOR_EACH_INSN_DEF (ref, i1->rtl())
> +    {
> +      rtx dest_exp = SET_DEST (PATTERN (i1->rtl ()));
> +      if (REG_P (dest_exp)
> +	  && DF_REF_REGNO (ref) == REGNO (dest_exp))
> +	{
> +	  rtx *loc = DF_REF_LOC (ref);
> +	  *loc = new_dest_exp;
> +	}
> +     }
> +}
> +
> +static bool
> +set_load_subreg (auto_vec <insn_change *> &changes,
> +		 insn_info *i1, rtx src, obstack_watermark *attempt)
> +{
> +  rtx set = single_set (i1->rtl());
> +  rtx old_dest = SET_DEST (set);
> +
> +  for (auto def : i1->defs ())
> +    {
> +      auto set1 = dyn_cast<set_info *> (def);
> +      for (auto use : set1->nondebug_insn_uses ())
> +	{
> +	  if (use->is_artificial ())
> +	    continue;
> +
> +	  insn_info *info = use->insn ();
> +	  rtx_insn *rtl_insn = info->rtl ();
> +	  insn_propagation prop (rtl_insn, old_dest, src);
> +	  if (!prop.apply_to_pattern (&PATTERN (rtl_insn)))
> +	    gcc_assert (0);
> +
> +	  vec_rtx_properties properties;
> +	  insn_change *use_change;
> +	  auto new_insn = crtl->ssa->create_insn (*attempt, INSN,
> +						  PATTERN (rtl_insn));
> +	  use_change = crtl->ssa->change_alloc<insn_change> (*attempt,
> +							     new_insn);
> +	  use_array new_uses_temp;
> +	  properties.add_insn (rtl_insn ,  true);
> +
> +	  for (rtx_obj_reference ref : properties.refs ())
> +	    if (ref.is_reg ())
> +	      {
> +		auto def = find_access (info->defs (), ref.regno);
> +		if (ref.is_write () || use_change->new_defs.size () == 0)
> +		  {
> +		    if (!def)
> +		      {
> +			auto resource = full_register (ref.regno);
> +			auto *set
> +			  = crtl->ssa->allocate<set_info> (info,
> +							   resource);

As above, please can you use the public create_set function instead of
exposing the rtl-ssa internal allocate function?

> +			use_change->new_defs
> +			  = insert_access (*attempt, set,
> +					   use_change->new_defs);
> +		      }
> +		   else
> +		     use_change->new_defs
> +		       = insert_access (*attempt, def,
> +					use_change->new_defs);
> +
> +		  }
> +		auto_vec<access_info *, 1> new_uses;
> +		auto *set
> +		  = crtl->ssa->allocate<set_info> (new_insn,
> +						   full_register (ref.regno));
> +		if (set)
> +		  {
> +		    auto new_use = crtl->ssa->create_use (*attempt, new_insn,
> +							  set);
> +		    if (!find_access (new_uses_temp, ref.regno))
> +		      {
> +			if (new_uses_temp.size () == 0)
> +			  {
> +			    new_uses.quick_push (new_use);
> +			    new_uses_temp = use_array (access_array (new_uses));
> +			    new_uses_temp = merge_access_arrays (*attempt,
> +								 new_uses_temp,
> +								 new_uses_temp);
> +			  }
> +			else
> +			  new_uses_temp = insert_access (*attempt, new_use,
> +							 new_uses_temp);
> +		      }
> +		  }
> +	      }
> +	   use_change->new_uses = new_uses_temp;
> +	   use_change->move_range = new_insn;
> +	   changes.safe_push (use_change);
> +	}
> +    }
> +  return true;
> +}
> +
> +// Set subreg for OO mode store pair to generate registers in pairs
> +// given insn_info I1 and I2.
> +static void
> +set_multiword_subreg_store (insn_info *i1, insn_info *i2)
> +{
> +  rtx_insn *insn1 = i1->rtl ();
> +  rtx_insn *insn2 = i2->rtl ();
> +  rtx body = PATTERN (insn1);
> +  rtx src_exp = SET_SRC (body);
> +  rtx insn2_body = PATTERN (insn2);
> +  rtx insn2_dest_exp = SET_DEST (insn2_body);
> +  machine_mode mode = GET_MODE (src_exp);
> +  int regoff;
> +  rtx src;
> +  rtx addr = XEXP (insn2_dest_exp, 0);
> +
> +  PUT_MODE_RAW (src_exp, OOmode);

I don't think you should be changing the mode of an existing pseudo in place
like this.  Instead the store path should use a fresh OOmode pseudo just like
the load path.

> +  if (GET_CODE (addr) == PLUS
> +      && XEXP (addr, 1) && CONST_INT_P (XEXP (addr, 1)))
> +    regoff = 16;
> +  else
> +    regoff = 0;
> +
> +  src = simplify_gen_subreg (mode,
> +			     src_exp, GET_MODE (src_exp),
> +			     regoff);
> +
> +  set_store_subreg (i1, src, regoff);
> +
> +  int regoff1 = 0;
> +  rtx src1;
> +
> +  src1 = simplify_gen_subreg (mode,
> +			      src_exp, GET_MODE (src_exp),
> +			      regoff1);
> +
> +  set_store_subreg (i2, src1, regoff1);
> +  set_rescan_store (i1);
> +  set_rescan_store (i2);
> +  df_insn_rescan (insn1);
> +}
> +
> +// Set subreg for OO mode pair load for existing subreg rtx to generate
> +// registers in pairs given insn_info I2 and I2.
> +static rtx
> +set_multiword_subreg_load (insn_info *i1, insn_info *i2)
> +{

Please see the comments below the call to set_multiword_subreg in
pair-fusion.cc:fuse_pair.

> +  rtx_insn *insn1 = i1->rtl ();
> +  rtx body = PATTERN (insn1);
> +  rtx dest_exp = SET_DEST (body);
> +  machine_mode mode = GET_MODE (dest_exp);
> +  int regoff1;
> +  regoff1 = 16;
> +  auto attempt = crtl->ssa->new_change_attempt ();
> +  rtx new_dest_exp = gen_reg_rtx (OOmode);
> +  auto_vec <insn_change *> use_changes;
> +  rtx src = simplify_gen_subreg (mode,
> +				 new_dest_exp,
> +				 OOmode,
> +				 regoff1);
> +
> +  set_load_subreg (use_changes, i1, src, &attempt);
> +  propagate_insn (i1, new_dest_exp);
> +  confirm_change_group ();
> +  crtl->ssa->change_insns (use_changes);
> +  use_changes.release ();
> +
> +  int regoff = 0;
> +  rtx sset = single_set (i2->rtl ());
> +  rtx insn2_dest_exp = SET_DEST (sset);
> +  machine_mode insn2_mode = GET_MODE (insn2_dest_exp);
> +
> +  src = simplify_gen_subreg (insn2_mode,
> +			     new_dest_exp,
> +			     OOmode,
> +			     regoff);
> +
> +  set_load_subreg (use_changes, i2, src, &attempt);
> +  propagate_insn (i2, new_dest_exp);
> +  confirm_change_group ();
> +  crtl->ssa->change_insns (use_changes);
> +  set_rescan_load (i1);
> +  set_rescan_load (i2);
> +  df_insn_rescan (i1->rtl ());
> +  return new_dest_exp;
> +}
> +
> +// Set subreg for OO mode pair to generate sequential registers given
> +// insn_info pairs I1, I2 and LOAD_P is true iff load insn and false
> +// if store insn.
> +void
> +rs6000_pair_fusion::set_multiword_subreg (insn_info *i1, insn_info *i2,
> +					  bool load_p, rtx *pats)
> +{
> +  if (load_p)
> +    {
> +      rtx new_dest_exp =  set_multiword_subreg_load (i1, i2);
> +      if (new_dest_exp)
> +	{
> +	  rtx src = SET_SRC (pats[0]);
> +	  rtx set = gen_rtx_SET (new_dest_exp, src);
> +	  pats[0] = set;
> +	}
> +    }
> +  else
> +    set_multiword_subreg_store (i1, i2);
> +}
> +
> +rtx
> +rs6000_pair_fusion::gen_pair (rtx *pats, rtx,  bool load_p)
> +{
> +  rtx i1 = pats[0];
> +  rtx src_exp = SET_SRC (i1);
> +  rtx dest_exp = SET_DEST (i1);
> +  PUT_MODE_RAW (src_exp, OOmode);
> +  PUT_MODE_RAW (dest_exp, OOmode);
> +  rtx unspec  = gen_rtx_UNSPEC (GET_MODE (dest_exp),
> +				gen_rtvec (1, src_exp),
> +				UNSPEC_LXVP);
> +  rtx set =  gen_rtx_SET (dest_exp, unspec);
> +  if (dump_file)
> +    {
> +      if (load_p)
> +	fprintf (dump_file, "lxv with lxvp ");
> +      else
> +	fprintf (dump_file, "stxv with stxvp ");
> +      print_rtl_single (dump_file, set);
> +    }
> +
> +  return set;
> +}
> +
> +const pass_data pass_data_mem_fusion =
> +{
> +  RTL_PASS, /* type */
> +  "mem_fusion", /* name */
> +  OPTGROUP_NONE, /* optinfo_flags */
> +  TV_NONE, /* tv_id */
> +  0, /* properties_required */
> +  0, /* properties_provided */
> +  0, /* properties_destroyed */
> +  0, /* todo_flags_start */
> +  TODO_df_finish, /* todo_flags_finish */
> +};
> +
> +class pass_mem_fusion : public rtl_opt_pass
> +{
> +public:
> +  pass_mem_fusion (gcc::context *ctxt)
> +    : rtl_opt_pass (pass_data_mem_fusion, ctxt)
> +  {}
> +
> +  opt_pass *clone () override { return new pass_mem_fusion (m_ctxt);}
> +
> +  /* opt_pass methods: */
> +  bool gate (function *)
> +    {
> +      return (optimize > 0 && TARGET_VSX && TARGET_POWER10);
> +    }
> +
> +  unsigned int execute (function *) final override
> +    {
> +      rs6000_pair_fusion pass;
> +      pass.run ();
> +      return 0;
> +    }
> +}; // class pass_mem_fusion
> +
> +rtl_opt_pass *
> +make_pass_mem_fusion (gcc::context *ctxt)
> +{
> +  return new pass_mem_fusion (ctxt);
> +}
> diff --git a/gcc/config/rs6000/rs6000-passes.def b/gcc/config/rs6000/rs6000-passes.def
> index 46a0d0b8c56..0b48f57014d 100644
> --- a/gcc/config/rs6000/rs6000-passes.def
> +++ b/gcc/config/rs6000/rs6000-passes.def
> @@ -28,7 +28,9 @@ along with GCC; see the file COPYING3.  If not see
>       The power8 does not have instructions that automaticaly do the byte swaps
>       for loads and stores.  */
>    INSERT_PASS_BEFORE (pass_cse, 1, pass_analyze_swaps);
> -
> +  /* Pass to replace adjacent memory addresses lxv/stxv instruction with
> +     lxvp/stxvp instruction.  */
> +  INSERT_PASS_BEFORE (pass_early_remat, 1, pass_mem_fusion);
>    /* Pass to do the PCREL_OPT optimization that combines the load of an
>       external symbol's address along with a single load or store using that
>       address as a base register.  */
> diff --git a/gcc/config/rs6000/rs6000-protos.h b/gcc/config/rs6000/rs6000-protos.h
> index b40557a8557..5295ff66950 100644
> --- a/gcc/config/rs6000/rs6000-protos.h
> +++ b/gcc/config/rs6000/rs6000-protos.h
> @@ -342,6 +342,7 @@ namespace gcc { class context; }
>  class rtl_opt_pass;
>  
>  extern rtl_opt_pass *make_pass_analyze_swaps (gcc::context *);
> +extern rtl_opt_pass *make_pass_mem_fusion (gcc::context *);
>  extern rtl_opt_pass *make_pass_pcrel_opt (gcc::context *);
>  extern bool rs6000_sum_of_two_registers_p (const_rtx expr);
>  extern bool rs6000_quadword_masked_address_p (const_rtx exp);
> diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
> index 08579bc83e6..d00c687cbc9 100644
> --- a/gcc/config/rs6000/rs6000.cc
> +++ b/gcc/config/rs6000/rs6000.cc
> @@ -20,7 +20,6 @@
>     <http://www.gnu.org/licenses/>.  */
>  
>  #define IN_TARGET_CODE 1
> -
>  #include "config.h"
>  #define INCLUDE_MEMORY
>  #include "system.h"
> @@ -27469,7 +27468,8 @@ rs6000_split_multireg_move (rtx dst, rtx src)
>      reg_mode = word_mode;
>    reg_mode_size = GET_MODE_SIZE (reg_mode);
>  
> -  gcc_assert (reg_mode_size * nregs == GET_MODE_SIZE (mode));
> +  gcc_assert (mode == OOmode
> +	      || reg_mode_size * nregs == GET_MODE_SIZE (mode));
>  
>    /* TDmode residing in FP registers is special, since the ISA requires that
>       the lower-numbered word of a register pair is always the most significant
> @@ -27516,6 +27516,11 @@ rs6000_split_multireg_move (rtx dst, rtx src)
>        int reg_mode_nregs = hard_regno_nregs (reg, reg_mode);
>        if (MEM_P (dst))
>  	{
> +	  rtx addr = XEXP (dst, 0);
> +	  rtx opnd1 = NULL_RTX;
> +	  if (addr && GET_CODE (addr) == PLUS)
> +	    opnd1 = XEXP (addr,1);
> +
>  	  unsigned offset = 0;
>  	  unsigned size = GET_MODE_SIZE (reg_mode);
>  
> @@ -27529,7 +27534,13 @@ rs6000_split_multireg_move (rtx dst, rtx src)
>  	    {
>  	      unsigned subreg
>  		= WORDS_BIG_ENDIAN ? i : (nregs - reg_mode_nregs - i);
> -	      rtx dst2 = adjust_address (dst, reg_mode, offset);
> +	      rtx dst2 = dst;
> +
> +	      if ((GET_CODE (addr) != PLUS
> +		  || (opnd1 && CONST_INT_P(opnd1))))
> +		dst2 = adjust_address (dst, reg_mode, offset);
> +	      else
> +		PUT_MODE_RAW (dst, reg_mode);
>  	      rtx src2 = gen_rtx_REG (reg_mode, reg + subreg);
>  	      offset += size;
>  	      emit_insn (gen_rtx_SET (dst2, src2));
> @@ -27540,15 +27551,25 @@ rs6000_split_multireg_move (rtx dst, rtx src)
>  
>        if (MEM_P (src))
>  	{
> +	  rtx  addr = XEXP (src, 0);
> +	  rtx opnd1 = NULL_RTX;
> +	  if (addr && GET_CODE (addr) == PLUS)
> +	    opnd1 = XEXP (addr,1);
> +
>  	  unsigned offset = 0;
>  	  unsigned size = GET_MODE_SIZE (reg_mode);
>  
> -	  for (int i = 0; i < nregs; i += reg_mode_nregs)
> +	  for (int i = nregs-1; i >= 0; i -= reg_mode_nregs)
>  	    {
>  	      unsigned subreg
>  		= WORDS_BIG_ENDIAN ? i : (nregs - reg_mode_nregs - i);
>  	      rtx dst2 = gen_rtx_REG (reg_mode, reg + subreg);
> -	      rtx src2 = adjust_address (src, reg_mode, offset);
> +	      rtx src2 = src;
> +
> +	      if ((GET_CODE (addr) != PLUS || (opnd1 && CONST_INT_P (opnd1))))
> +		src2 = adjust_address (src, reg_mode, offset);
> +	      else
> +		PUT_MODE_RAW (src2, reg_mode);
>  	      offset += size;
>  	      emit_insn (gen_rtx_SET (dst2, src2));
>  	    }
> @@ -27556,7 +27577,8 @@ rs6000_split_multireg_move (rtx dst, rtx src)
>  	  /* If we are writing an accumulator register, we have to
>  	     prime it after we've written it.  */
>  	  if (TARGET_MMA
> -	      && GET_MODE (dst) == XOmode && FP_REGNO_P (REGNO (dst)))
> +	      && REG_P (dst) && GET_MODE (dst) == XOmode
> +	      && FP_REGNO_P (REGNO (dst)))
>  	    emit_insn (gen_mma_xxmtacc (dst, dst));
>  
>  	  return;
> @@ -27649,9 +27671,12 @@ rs6000_split_multireg_move (rtx dst, rtx src)
>  	{
>  	  for (i = nregs - 1; i >= 0; i--)
>  	    {
> -	      rtx dst_i = gen_rtx_REG (reg_mode, REGNO (dst) + i);
> -	      rtx src_i = gen_rtx_REG (reg_mode, REGNO (src) + i);
> -	      emit_insn (gen_rtx_SET (dst_i, src_i));
> +	      if (REG_P (dst) && REG_P (src))
> +		{
> +		  rtx dst_i = gen_rtx_REG (reg_mode, REGNO (dst) + i);
> +		  rtx src_i = gen_rtx_REG (reg_mode, REGNO (src) + i);
> +		  emit_insn (gen_rtx_SET (dst_i, src_i));
> +		}
>  	    }
>  	}
>        else
> @@ -27666,7 +27691,8 @@ rs6000_split_multireg_move (rtx dst, rtx src)
>        /* If we are writing an accumulator register, we have to
>  	 prime it after we've written it.  */
>        if (TARGET_MMA
> -	  && GET_MODE (dst) == XOmode && FP_REGNO_P (REGNO (dst)))
> +	  && REG_P (dst) && GET_MODE (dst) == XOmode
> +	  && FP_REGNO_P (REGNO (dst)))
>  	emit_insn (gen_mma_xxmtacc (dst, dst));
>      }
>    else
> @@ -27723,7 +27749,7 @@ rs6000_split_multireg_move (rtx dst, rtx src)
>  
>  	  /* If the base register we are using to address memory is
>  	     also a destination reg, then change that register last.  */
> -	  if (REG_P (breg)
> +	  if (REG_P (dst) && REG_P (breg)
>  	      && REGNO (breg) >= REGNO (dst)
>  	      && REGNO (breg) < REGNO (dst) + nregs)
>  	    j = REGNO (breg) - REGNO (dst);
> @@ -27821,9 +27847,12 @@ rs6000_split_multireg_move (rtx dst, rtx src)
>  	  /* XO/OO are opaque so cannot use subregs. */
>  	  if (mode == OOmode || mode == XOmode )
>  	    {
> -	      rtx dst_i = gen_rtx_REG (reg_mode, REGNO (dst) + j);
> -	      rtx src_i = gen_rtx_REG (reg_mode, REGNO (src) + j);
> -	      emit_insn (gen_rtx_SET (dst_i, src_i));
> +	      if (REG_P (dst) &&  REG_P (src))
> +		{
> +		  rtx dst_i = gen_rtx_REG (reg_mode, REGNO (dst) + j);
> +		  rtx src_i = gen_rtx_REG (reg_mode, REGNO (src) + j);
> +		  emit_insn (gen_rtx_SET (dst_i, src_i));
> +		}
>  	    }
>  	  else
>  	    emit_insn (gen_rtx_SET (simplify_gen_subreg (reg_mode, dst, mode,
> @@ -27841,7 +27870,9 @@ rs6000_split_multireg_move (rtx dst, rtx src)
>        if (restore_basereg != NULL_RTX)
>  	emit_insn (restore_basereg);
>      }
> +  return;
>  }
> +
>  
>  /* Return true if the peephole2 can combine a load involving a combination of
>     an addis instruction and a load with an offset that can be fused together on
> diff --git a/gcc/config/rs6000/rs6000.md b/gcc/config/rs6000/rs6000.md
> index 8eda2f7bb0d..b9c38088510 100644
> --- a/gcc/config/rs6000/rs6000.md
> +++ b/gcc/config/rs6000/rs6000.md
> @@ -173,6 +173,7 @@
>     UNSPEC_XXSPLTIW_CONST
>     UNSPEC_FMAX
>     UNSPEC_FMIN
> +   UNSPEC_LXVP
>    ])
>  
>  ;;
> diff --git a/gcc/config/rs6000/t-rs6000 b/gcc/config/rs6000/t-rs6000
> index 155788de40a..a052163eec4 100644
> --- a/gcc/config/rs6000/t-rs6000
> +++ b/gcc/config/rs6000/t-rs6000
> @@ -34,6 +34,11 @@ rs6000-p8swap.o: $(srcdir)/config/rs6000/rs6000-p8swap.cc
>  	$(COMPILE) $<
>  	$(POSTCOMPILE)
>  
> +rs6000-mem-fusion.o: $(srcdir)/config/rs6000/rs6000-mem-fusion.cc
> +	$(COMPILE) $<
> +	$(POSTCOMPILE)
> +
> +
>  rs6000-d.o: $(srcdir)/config/rs6000/rs6000-d.cc
>  	$(COMPILE) $<
>  	$(POSTCOMPILE)
> diff --git a/gcc/pair-fusion.cc b/gcc/pair-fusion.cc
> index cb0374f426b..d0665b8730a 100644
> --- a/gcc/pair-fusion.cc
> +++ b/gcc/pair-fusion.cc
> @@ -21,7 +21,6 @@
>  #define INCLUDE_FUNCTIONAL
>  #define INCLUDE_LIST
>  #define INCLUDE_TYPE_TRAITS
> -#define INCLUDE_ARRAY

Why is this needed as part of this patch?

>  #include "config.h"
>  #include "system.h"
>  #include "coretypes.h"
> @@ -36,6 +35,8 @@
>  #include "tree-dfa.h"
>  #include "fold-const.h"
>  #include "tree-hash-traits.h"
> +#include "rtl-ssa/internals.h"
> +#include "rtl-ssa/internals.inl"

We shouldn't be including rtl-ssa/internals.* here.

>  #include "print-tree.h"
>  #include "pair-fusion.h"
>  
> @@ -313,9 +314,9 @@ static int
>  encode_lfs (lfs_fields fields)
>  {
>    int size_log2 = exact_log2 (fields.size);
> -  gcc_checking_assert (size_log2 >= 2 && size_log2 <= 4);
> -  return ((int)fields.load_p << 3)
> -    | ((int)fields.fpsimd_p << 2)
> +  gcc_checking_assert (size_log2 >= 2 && size_log2 <= 9);

Why is this change needed?  Do you really need to handle accesses that
are 2^9 = 512 bytes (!) wide?  That seems unlikely.  Any need for this
looks to be symptomatic of missing validation elsewhere (e.g. perhaps
pair_operand_mode_ok_p is not sufficiently constrained).

AFAICT you should only need to handle 16 byte candidate accesses, since
OOmode is 32 bytes wide.

> +  return ((int)fields.load_p << 4)
> +    | ((int)fields.fpsimd_p << 3)
>      | (size_log2 - 2);
>  }
>  
> @@ -323,8 +324,8 @@ encode_lfs (lfs_fields fields)
>  static lfs_fields
>  decode_lfs (int lfs)
>  {
> -  bool load_p = (lfs & (1 << 3));
> -  bool fpsimd_p = (lfs & (1 << 2));
> +  bool load_p = (lfs & (1 << 4));
> +  bool fpsimd_p = (lfs & (1 << 3));
>    unsigned size = 1U << ((lfs & 3) + 2);
>    return { load_p, fpsimd_p, size };
>  }
> @@ -426,6 +427,9 @@ pair_fusion_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
>    if (MEM_VOLATILE_P (mem))
>      return;
>  
> +  if (load_p && !m_pass->fuseable_load_p (insn))
> +    return;
> +

Instead of adding the new hook (fuseable_load_p), could you handle this
in pair_reg_operand_ok_p?

>    // Ignore writeback accesses if the hook says to do so.
>    if (!m_pass->should_handle_writeback (writeback_type::EXISTING)
>        && GET_RTX_CLASS (GET_CODE (XEXP (mem, 0))) == RTX_AUTOINC)
> @@ -1815,7 +1819,7 @@ pair_fusion_bb_info::fuse_pair (bool load_p,
>      }
>  
>    rtx reg_notes = combine_reg_notes (first, second, load_p);
> -
> +  m_pass->set_multiword_subreg (i1, i2, load_p, pats);

Looking at the rs6000 code (for now just thinking about the load case), it
looks like this function is responsible for updating uses of the existing
transfer registers to consume subregs of the new OOmode pseudo instead.

I think the interface is a bit off, though, since I think we want a
single call to change_insns to handle the entire change group for all
nondebug changes in one go (namely the call at the bottom of fuse_pair).
The rs6000 code should not be calling change_insns itself.

To make this work, I think we want to allow the rs6000 code to push
changes to the main auto_vec of changes that is created in fuse_pair.
The interface you chose for modify_new_rtx_insn is therefore more
appropraite for this.

Note the RTL-SSA constraint that the changes have to be created in
program order.  This means that (in the rs6000 code) you'll need to walk the
use lists of the two candidate load instructions simultaneously, pushing
changes for whichever use comes first at each step.

Also, I think any new virtual function calls to finish constructing the
changes should come after the new pattern is applied with the call to
set_pair_pat.  That way gen_pair can first (in the rs6000 case) create
the new pseudo and set up the new pattern.

It's OK to move the call to set_pair_pat earlier in the below
load_p=true block, if that helps.

Now, in the store pair case, I think you want to hook into
store_change_builder.  It looks like you need to visit the defs of the transfer
registers so that you can substitute the subreg rtxes in (and update the defs
of the instructions accordingly).

Again, the same principle applies (as in the load case) that we should
build up all the nondebug changes into a single vector (in program
order) which we then apply in one go at the bottom of fuse_pair.

Specifically, I think we want to extend store_change_builder to take a
boolean, say visit_store_reg_defs_p (the value of which would be
determined by a new virtual function, say update_store_reg_defs_p).  If
true, store_change_builder would then additionally visit the defs of the
transfer registers (at the appropriate point in program order w.r.t. the
other steps).  You could then add a new action (say FIXUP_REG_DEF), and
in the switch statement in fuse_pair you could then invoke a second new
virtual function (fixup_store_reg_def) which (in rs6000 code) constructs
the appropriate insn_change (doing the subreg substitution etc.)

>    rtx pair_pat = m_pass->gen_pair (pats, writeback_effect, load_p);

>    insn_change *pair_change = nullptr;
>    auto set_pair_pat = [pair_pat,reg_notes](insn_change *change) {
> @@ -1829,18 +1833,19 @@ pair_fusion_bb_info::fuse_pair (bool load_p,
>        changes.safe_push (make_delete (first));
>        pair_change = make_change (second);
>        changes.safe_push (pair_change);
> -
> -      pair_change->move_range = move_range;

Why do you need to move the setting of the move_range later?

>        pair_change->new_defs = merge_access_arrays (attempt,
>  						   input_defs[0],
>  						   input_defs[1]);
> -      gcc_assert (pair_change->new_defs.is_valid ());
> -
>        pair_change->new_uses
>  	= merge_access_arrays (attempt,
>  			       drop_memory_access (input_uses[0]),
>  			       drop_memory_access (input_uses[1]));
> +      pair_change->move_range = move_range;
> +      m_pass->modify_new_rtx_insn (first, &attempt, &pair_change, changes);

Ideally I think we'd just have a single hook (here) to handle the load case.
I.e. set_multiword_subreg should go away and any load pair parts still needed
merged into the implementation of this hook.   Naming-wise, something like
finish_load_pair_changes might be more appropriate.

> +      gcc_assert (pair_change->new_defs.is_valid ());
> +

Nit: I don't think there's a need for these blank lines.

>        gcc_assert (pair_change->new_uses.is_valid ());
> +
>        set_pair_pat (pair_change);
>      }
>    else
> @@ -1873,6 +1878,7 @@ pair_fusion_bb_info::fuse_pair (bool load_p,
>  						store_def,
>  						change->new_defs);
>  	      gcc_assert (change->new_defs.is_valid ());
> +	      m_pass->set_defs (i1, i2, &change, &attempt);

Looking at the rs6000 code I'm really not sure what set_defs is supposed
to be doing.  It would help if the comments at the top of your functions
actually explained the purpose/intent of the function.  Please can you
elaborate?

>  	      change->move_range = move_range;
>  	      pair_change = change;
>  	      break;
> @@ -1909,6 +1915,7 @@ pair_fusion_bb_info::fuse_pair (bool load_p,
>  	      change->new_defs = insert_access (attempt, new_set,
>  						change->new_defs);
>  	      gcc_assert (change->new_defs.is_valid ());
> +	      m_pass->set_defs (i1, i2, &change, &attempt);
>  	      pair_change = change;
>  	      break;
>  	    }
> @@ -1920,6 +1927,7 @@ pair_fusion_bb_info::fuse_pair (bool load_p,
>  			 "  stp: changing i%d to use mem from new stp "
>  			 "(after i%d)\n",
>  			 action.insn->uid (), pair_dst->uid ());
> +	  
>  	      change->new_uses = drop_memory_access (change->new_uses);
>  	      gcc_assert (new_set);
>  	      auto new_use = crtl->ssa->create_use (attempt, action.insn,
> @@ -1959,7 +1967,6 @@ pair_fusion_bb_info::fuse_pair (bool load_p,
>  	    }
>  	}
>      }
> -

Nit: stray whitespace change.

>    auto ignore = ignore_changing_insns (changes);
>    for (unsigned i = 0; i < changes.length (); i++)
>      gcc_assert (rtl_ssa::restrict_movement (*changes[i], ignore));
> @@ -1982,7 +1989,6 @@ pair_fusion_bb_info::fuse_pair (bool load_p,
>      }
>  
>    gcc_assert (crtl->ssa->verify_insn_changes (changes));
> -

Likewise.

>    // Fix up any debug uses that will be affected by the changes.
>    if (MAY_HAVE_DEBUG_INSNS)
>      fixup_debug_uses (attempt, insns, orig_rtl, pair_dst, trailing_add,
> @@ -2406,6 +2412,15 @@ pair_fusion_bb_info::try_fuse_pair (bool load_p, unsigned access_size,
>        reg_ops[i] = XEXP (pats[i], !load_p);
>      }
>  
> +  if (!load_p && !m_pass->fuseable_store_p (i1, i2))
> +    {
> +      if (dump_file)
> +	fprintf (dump_file,
> +		 "punting on store-mem-pairs due to non fuseable cand (%d,%d)\n",
> +		 insns[0]->uid (), insns[1]->uid ());
> +      return false;
> +    }
> +
>    if (load_p && reg_overlap_mentioned_p (reg_ops[0], reg_ops[1]))
>      {
>        if (dump_file)
> @@ -2650,7 +2665,6 @@ pair_fusion_bb_info::try_fuse_pair (bool load_p, unsigned access_size,
>  	    base = &base_cands[1];
>  	}
>      }
> -

Nit: stray whitespace change.

>    // Otherwise, hazards[0] > hazards[1].
>    // Pair can be formed anywhere in (hazards[1], hazards[0]).
>    insn_range_info range (insns[0], insns[1]);
> @@ -2698,7 +2712,6 @@ pair_fusion_bb_info::try_fuse_pair (bool load_p, unsigned access_size,
>        fprintf (dump_file, "), move_range: (%d,%d)\n",
>  	       range.first->uid (), range.last->uid ());
>      }
> -

Likewise.

>    return fuse_pair (load_p, access_size, writeback,
>  		    i1, i2, *base, range);
>  }
> @@ -2764,6 +2777,7 @@ pair_fusion_bb_info::merge_pairs (insn_list_t &left_list,
>  			  bool load_p,
>  			  unsigned access_size)
>  {
> +	//dump_file = stdout;

Please don't leave commented-out code in your patch submissions.
Also note that you can dump to stdout using the command-line,
-fdump-rtl-mem_fusion=- should do it in your case.

>    if (dump_file)
>      {
>        fprintf (dump_file, "merge_pairs [L=%d], cand vecs ", load_p);
> @@ -2817,10 +2831,12 @@ pair_fusion_bb_info::transform_for_base (int encoded_lfs,
>  
>    for (auto &access : group.list)
>      {
> +	    //fprintf (stdout, "MERGE \n");
>        if (skip_next)
>  	skip_next = false;
>        else if (known_eq (access.offset, prev_access->offset + access_size))
>  	{
> +	//	fprintf (stdout, " Inside Merge \n");
>  	  merge_pairs (prev_access->cand_insns,
>  		       access.cand_insns,
>  		       lfs.load_p,
> @@ -2998,6 +3014,8 @@ void pair_fusion::process_block (bb_info *bb)
>        if (GET_CODE (pat) != SET)
>  	continue;
>  
> +      change_existing_multword_mode (rti);
> +

Can you explain why this is needed?  Looking at the rs6000
implementation, it seems to be transforming any existing OOmode accesses
into lxvp instructions.  Why?

>        if (track_stores && MEM_P (XEXP (pat, 0)))
>  	bb_state.track_access (insn, false, XEXP (pat, 0));
>        else if (track_loads && MEM_P (XEXP (pat, 1)))
> diff --git a/gcc/pair-fusion.h b/gcc/pair-fusion.h
> index 45e4edceecb..b59c59f397a 100644
> --- a/gcc/pair-fusion.h
> +++ b/gcc/pair-fusion.h
> @@ -26,8 +26,12 @@ namespace rtl_ssa {
>    class insn_info;
>    class insn_range_info;
>    class bb_info;
> +  class insn_change;
> +  class insn_range_info;
>  }
>  
> +class obstack_watermark;
> +
>  // Information about a potential base candidate, used in try_fuse_pair.
>  // There may be zero, one, or two viable RTL bases for a given pair.
>  struct base_cand
> @@ -142,6 +146,19 @@ struct pair_fusion {
>    // true iff INSN is a load pair.
>    virtual bool pair_mem_insn_p (rtx_insn *insn, bool &load_p) = 0;
>  
> +  // Given INSN change multiword mode load and store to respective
> +  // unspec instruction.
> +  virtual void change_existing_multword_mode (rtx_insn *insn) = 0;
> +
> +  // Given INSN and watermark ATTEMPT and PAIR_CHANGE sets the
> +  // new rtx with INSN.  Remove all uses of definition that are
> +  // removed given CHANGES.
> +  virtual void modify_new_rtx_insn (rtl_ssa::insn_info *first,
> +				    obstack_watermark *attempt,

It's more conventional in RTL-SSA code to pass these by reference.
Let's do that here.

> +				    rtl_ssa::insn_change **pair_change,

No need for the double indirection here.

> +				    auto_vec<rtl_ssa::insn_change *> &changes)
> +				    = 0;
> +
>    // Return true if we should track loads.
>    virtual bool track_loads_p ()
>    {
> @@ -171,6 +188,37 @@ struct pair_fusion {
>    virtual rtx gen_promote_writeback_pair (rtx wb_effect, rtx mem,
>  					  rtx regs[2], bool load_p) = 0;
>  
> +  // Given insn_info pair I1 and I2, sets subreg with multiword registers
> +  // to assign register pairs by allocators.
> +  // LOAD_P is true iff the pair is a load.
> +  virtual void set_multiword_subreg (rtl_ssa::insn_info *i1,
> +				     rtl_ssa::insn_info *i2,
> +				     bool load_p, rtx *pats) = 0;
> +
> +  // Given insn_info pair I1 and I2, sets uses in PAIR_CHANGE.
> +  virtual void set_uses (rtl_ssa::insn_info *i1,
> +			 rtl_ssa::insn_info *i2,
> +			 rtl_ssa::insn_change **pair_change,
> +			 obstack_watermark *attempt) = 0;

I can't see this function being called anywhere, is it needed?

> +
> +  // Given insn_info pair I1 and I2, sets defss in PAIR_CHANGE.

This comment is pretty useless.  It doesn't add anything above what is
alredy implied by the name of the function and the names of the
parameters.  Try to make your function comments add some information
(intent, purpose, invariants) above and beyond what is already apparent.

> +  virtual void set_defs (rtl_ssa::insn_info *i1,
> +			 rtl_ssa::insn_info *i2,
> +			 rtl_ssa::insn_change **pair_change,
> +			 obstack_watermark *attempt) = 0;
> +
> +
> +  // Given insn_info pair I1 and I2, checks if pairs are feasible to perform
> +  // store mem pairs.
> +  // Return true if feasible to perform store mem pairs otherwise false.
> +  virtual bool fuseable_store_p (rtl_ssa::insn_info *i1,
> +				 rtl_ssa::insn_info *i2) = 0;
> +
> +  // Given insn_info pair I1 and I2, checks if pairs are feasible to perform
> +  // load mem pairs.
> +  // Return true if feasible to perform load mem pairs otherwise false.
> +  virtual bool fuseable_load_p (rtl_ssa::insn_info *i1) = 0;
> +
>    void process_block (rtl_ssa::bb_info *bb);
>    rtl_ssa::insn_info *find_trailing_add (rtl_ssa::insn_info *insns[2],
>  					 const rtl_ssa::insn_range_info
> diff --git a/gcc/rtl-ssa/access-utils.h b/gcc/rtl-ssa/access-utils.h
> index 8805eec1d7f..0d0d299bb83 100644
> --- a/gcc/rtl-ssa/access-utils.h
> +++ b/gcc/rtl-ssa/access-utils.h

I'll try to leave detailed review of the rtl-ssa changes to Richard, but
please try to remove as many rtl-ssa changes as possible.  If something
really is needed, ideally submit it as a pre-patch rather than bundling
it in this patch.

> @@ -574,6 +574,8 @@ sort_accesses (access_info **begin, access_info **end)
>  
>    if (count == 2)
>      {
> +      if (begin[0]->regno () == begin[1]->regno ())
> +	return;

This looks wrong.  It looks like you're papering over a problem elsewhere
because your code violates the precondition:

// The sequence must have at most one access to a given a regno.

and trips the below assert.

>        gcc_checking_assert (begin[0]->regno () != begin[1]->regno ());
>        if (begin[0]->regno () > begin[1]->regno ())
>  	std::swap (begin[0], begin[1]);
> diff --git a/gcc/rtl-ssa/changes.cc b/gcc/rtl-ssa/changes.cc
> index 0476296607b..9a84bf10129 100644
> --- a/gcc/rtl-ssa/changes.cc
> +++ b/gcc/rtl-ssa/changes.cc
> @@ -228,10 +228,7 @@ rtl_ssa::changes_are_worthwhile (array_slice<insn_change *const> changes,
>        for (const insn_change *change : changes)
>  	if (!change->is_deletion ())
>  	  {
> -	    if (INSN_CODE (change->rtl ()) == NOOP_MOVE_INSN_CODE)
> -	      fprintf (dump_file, " %c nop", sep);
> -	    else
> -	      fprintf (dump_file, " %c %d", sep, change->new_cost);
> +	    fprintf (dump_file, " %c %d", sep, change->new_cost);

This looks unneeded?

>  	    sep = '+';
>  	  }
>        if (weighted_new_cost != 0)
> @@ -320,13 +317,11 @@ choose_insn_placement (insn_change &change)
>    insn_info *insn = change.insn ();
>    insn_info *first = change.move_range.first;
>    insn_info *last = change.move_range.last;
> -

Stray whitespace change.

>    // Quick(ish) exit if there is only one possible choice.
>    if (first == last)
>      return first;
>    if (first == insn->prev_nondebug_insn () && last == insn)
>      return insn;
> -
>    // For now just use the closest valid choice to the original instruction.
>    // If the register usage has changed significantly, it might instead be
>    // better to try to take register pressure into account.
> @@ -457,9 +452,11 @@ function_info::finalize_new_accesses (insn_change &change, insn_info *pos,
>  
>    // Build up the new list of definitions.
>    for (rtx_obj_reference ref : properties.refs ())
> -    if (ref.is_write ())
> +    if (ref.is_reg ())

Why this change?

>        {
>  	def_info *def = find_access (change.new_defs, ref.regno);
> +	if (ref.is_read () && !def)
> +	  continue;

This looks like it's just papering over a bug elsewhere that triggers
the below assert.

>  	gcc_assert (def);
>  
>  	if (def->m_is_temp && is_a<set_info *> (def) && def->last_def ())
> @@ -497,11 +494,17 @@ function_info::finalize_new_accesses (insn_change &change, insn_info *pos,
>  	  }
>  	else if (!def->m_has_been_superceded)
>  	  {
> -	    // This is a second or subsequent definition.
> -	    // See function_info::record_def for a discussion of when
> -	    // this can happen.
> -	    def->record_reference (ref, false);
> -	    continue;
> +	    if (def && def->insn () == insn)
> +	      {
> +		if (!ref.is_clobber ())
> +		  {
> +		    // This is a second or subsequent definition.
> +		    // See function_info::record_def for a discussion of when
> +		    // this can happen.
> +		    def->record_reference (ref, false);
> +		    continue;
> +		  }
> +	      }
>  	  }
>  	else
>  	  {
> @@ -700,7 +703,7 @@ function_info::apply_changes_to_insn (insn_change &change,
>    unsigned int num_uses = change.new_uses.size ();
>    if (num_defs + num_uses <= insn->num_defs () + insn->num_uses ())
>      insn->copy_accesses (change.new_defs, change.new_uses);
> -  else
> +   else
>      {
>        access_array_builder builder (&m_obstack);
>        builder.reserve (num_defs + num_uses);
> @@ -751,11 +754,13 @@ function_info::change_insns (array_slice<insn_change *> changes)
>  
>  	  // Make sure that the placement of this instruction would still
>  	  // leave room for previous instructions.
> -	  change->move_range = move_later_than (change->move_range, min_insn);
> +	  if (min_insn->rtl ())
> +	    change->move_range = move_later_than (change->move_range, min_insn);

Why is this change needed?

>  	  if (!canonicalize_move_range (change->move_range, change->insn ()))
>  	    // verify_insn_changes is supposed to make sure that this holds.
>  	    gcc_unreachable ();
> -	  min_insn = later_insn (min_insn, change->move_range.first);
> +	   if (min_insn-> rtl ())
> +	     min_insn = later_insn (min_insn, change->move_range.first);
>  
>  	  if (change->insn ()->m_is_temp)
>  	    {
> @@ -785,7 +790,6 @@ function_info::change_insns (array_slice<insn_change *> changes)
>  	delete_insn (change);
>        else
>  	{
> -	  // Make sure that this instruction comes before later ones.
>  	  if (following_insn)
>  	    {
>  	      change.move_range = move_earlier_than (change.move_range,
> @@ -799,15 +803,18 @@ function_info::change_insns (array_slice<insn_change *> changes)
>  
>  	  // Decide which instruction INSN should go after.
>  	  insn_info *after = choose_insn_placement (change);
> -
>  	  // If INSN is moving, insert a placeholder insn_info at the
>  	  // new location.  We can't move INSN itself yet because it
>  	  // might still be referenced by earlier move ranges.
>  	  insn_info *insn = change.insn ();
> -	  if (after == insn || after == insn->prev_nondebug_insn ())
> +	  unsigned int after_uid = INSN_UID (after->rtl ());
> +	  unsigned int change_uid = INSN_UID (change.insn ()->rtl ());
> +	  if (after_uid == change_uid || after == insn
> +	      || after == insn->prev_nondebug_insn ())
>  	    {
>  	      update_insn_in_place (change);
> -	      following_insn = insn;
> +	      if (after_uid != change_uid)
> +		following_insn = insn;
>  	    }
>  	  else
>  	    {
> @@ -818,7 +825,7 @@ function_info::change_insns (array_slice<insn_change *> changes)
>  	}
>        placeholders[i] = placeholder;
>      }
> -
> +//#endif
>    // We need to keep track of newly-added sets as these need adding to
>    // the def chain later.
>    hash_set<def_info *> new_sets;
> @@ -1282,6 +1289,81 @@ function_info::create_insn (obstack_watermark &watermark,
>    return insn;
>  }
>  
> +void
> +function_info::update_change (def_info *def, set_info *set)
> +{
> +  if (!set->has_any_uses ())
> +    return;
> +
> +  auto *use = *set->all_uses ().begin ();
> +  do
> +    {
> +      auto *next_use = use->next_use ();
> +      if (use->is_in_debug_insn ())
> +	substitute_debug_use (use);
> +      else if (use->is_in_phi ())
> +	{
> +	  update_change (def, use->phi ());
> +	}
> +      else
> +	{
> +	  remove_use (use);
> +	}
> +      use = next_use;
> +    }
> +  while (use);
> +}
> +
> +void function_info::substitute_debug_use (use_info *use)
> +{
> +  auto *use_insn = use->insn ();
> +  rtx_insn *use_rtl = use_insn->rtl ();
> +  auto use_change = insn_change (use_insn);
> +  use_change.new_uses = {};
> +  use_change.move_range = use_change.insn ();
> +  INSN_VAR_LOCATION_LOC (use_rtl) = gen_rtx_UNKNOWN_VAR_LOC ();
> +  insn_change *changes[] = { &use_change };
> +  crtl->ssa->change_insns (changes);
> +}
> +
> +void
> +function_info::update_changes (insn_info *insn,
> +				obstack_watermark *attempt,
> +				insn_change **change,
> +				auto_vec<insn_change *> &changes)
> +{
> +  for (insn_change *change : changes)
> +  {
> +    for (auto def : change->old_defs ())
> +      {
> +	auto set = dyn_cast<set_info *> (def);
> +	update_change (def ,set);
> +      }
> +  }
> +  auto &new_defs = (*change)->new_defs;
> +  vec_rtx_properties properties;
> +  properties.add_insn (insn->rtl (), true);
> +  // Build up the new list of definitions.
> +  for (rtx_obj_reference ref : properties.refs ())
> +    if (ref.is_write ())
> +      {
> +	auto *set = crtl->ssa->allocate<set_info > (insn,
> +					   full_register (ref.regno));
> +	if (set)
> +	  {
> +	    auto def = find_access (new_defs, ref.regno);
> +	    if (!def)
> +	      {
> +		insn->defs() = insert_access (*attempt, set, insn->defs());
> +		new_defs = insert_access (*attempt, set,
> +					  new_defs);
> +		auto &m_temp_defs = get_m_temp_defs ();
> +		m_temp_defs.safe_push (set);
> +	      }
> +	  }
> +      }
> +}
> +
>  // Print a description of CHANGE to PP.
>  void
>  rtl_ssa::pp_insn_change (pretty_printer *pp, const insn_change &change)
> diff --git a/gcc/rtl-ssa/functions.h b/gcc/rtl-ssa/functions.h
> index 567701c7995..053f7155743 100644
> --- a/gcc/rtl-ssa/functions.h
> +++ b/gcc/rtl-ssa/functions.h
> @@ -84,6 +84,18 @@ public:
>    insn_info *create_insn (obstack_watermark &watermark,
>  			  rtx_code insn_code,
>  			  rtx pat);
> +  insn_info *create_insn_temp (obstack_watermark &watermark,
> +			  rtx_code insn_code,
> +			  rtx pat, insn_info *info);

This looks unused, please drop it.

> +
> +  void update_changes (insn_info *first,
> +		       obstack_watermark *attempt,
> +		       insn_change **pair_change,
> +		       auto_vec<insn_change *> &changes);
> +
> +  void update_change (def_info *def, set_info *set);
> +
> +  void substitute_debug_use (use_info *use);
>  
>    // Return a list of all the instructions in the function, in reverse
>    // postorder.  The list includes both real and artificial instructions.
> @@ -222,6 +234,14 @@ public:
>    template<typename T, typename... Ts>
>    T *change_alloc (obstack_watermark &wm, Ts... args);
>  
> +  auto_vec<access_info *> &get_m_temp_defs () { return m_temp_defs; }
> + 
> +  template<typename T, typename... Ts>
> +  T *allocate (Ts... args);
> +
> +  template<typename T, typename... Ts>
> +  T *allocate_temp (Ts... args);

I don't think we want to expose rtl-ssa internals like this.

> +
>  private:
>    class bb_phi_info;
>    class build_info;
> @@ -230,13 +250,6 @@ private:
>    // Return an RAII object that owns all objects allocated by
>    // allocate_temp during its lifetime.
>    obstack_watermark temp_watermark () { return &m_temp_obstack; }
> -
> -  template<typename T, typename... Ts>
> -  T *allocate (Ts... args);
> -
> -  template<typename T, typename... Ts>
> -  T *allocate_temp (Ts... args);
> -
>    access_array temp_access_array (access_array accesses);
>  
>    clobber_group *need_clobber_group (clobber_info *);
> @@ -257,6 +270,7 @@ private:
>    void append_clobber_to_group (clobber_info *, clobber_group *);
>    void merge_clobber_groups (clobber_info *, clobber_info *,
>  			     def_info *);
> +

Stray whitespace.

Thanks for working on this and apologies again for the slow review.

Alex

>    std::array<clobber_group *, 2> split_clobber_group (clobber_group *,
>  						      insn_info *);
>  
> diff --git a/gcc/testsuite/g++.target/powerpc/mem-fusion-1.C b/gcc/testsuite/g++.target/powerpc/mem-fusion-1.C
> new file mode 100644
> index 00000000000..d10ff0cdf36
> --- /dev/null
> +++ b/gcc/testsuite/g++.target/powerpc/mem-fusion-1.C
> @@ -0,0 +1,22 @@
> +/* { dg-do compile } */
> +/* { dg-require-effective-target power10_ok } */
> +/* { dg-options "-mdejagnu-cpu=power10 -O2" } */
> +
> +#include <altivec.h>
> +	
> +void
> +foo2 ()
> +{
> +  __vector_quad *dst1;
> +  __vector_quad *dst2;
> +  vector unsigned char src;
> +  __vector_quad acc;
> +  vector unsigned char *ptr;
> +  __builtin_mma_xvf32ger(&acc, src, ptr[0]);
> +  __builtin_mma_xvf32gerpp(&acc, src, ptr[1]);
> +  *dst1 = acc;
> +  __builtin_mma_xvf32ger(&acc, src, ptr[2]);
> +  __builtin_mma_xvf32gerpp(&acc, src, ptr[3]);
> +  *dst2 = acc;
> +}
> +/* { dg-final { scan-assembler {\mlxvp\M} } } */
> diff --git a/gcc/testsuite/g++.target/powerpc/mem-fusion.C b/gcc/testsuite/g++.target/powerpc/mem-fusion.C
> new file mode 100644
> index 00000000000..c523572cf3c
> --- /dev/null
> +++ b/gcc/testsuite/g++.target/powerpc/mem-fusion.C
> @@ -0,0 +1,15 @@
> +/* { dg-do compile } */ 
> +/* { dg-require-effective-target power10_ok } */
> +/* { dg-options "-mdejagnu-cpu=power10 -O2" } */ 
> +
> +#include <altivec.h>
> +
> +void
> +foo (__vector_quad *dst, vector unsigned char *ptr, vector unsigned char src)
> +{
> +  __vector_quad acc;
> +  __builtin_mma_xvf32ger(&acc, src, ptr[0]);
> +  __builtin_mma_xvf32gerpp(&acc, src, ptr[1]);
> +  *dst = acc;
> +}
> +/* { dg-final { scan-assembler {\mlxvp\M} } } */
> diff --git a/gcc/testsuite/gcc.target/powerpc/mma-builtin-1.c b/gcc/testsuite/gcc.target/powerpc/mma-builtin-1.c
> index 69ee826e1be..ae29127f954 100644
> --- a/gcc/testsuite/gcc.target/powerpc/mma-builtin-1.c
> +++ b/gcc/testsuite/gcc.target/powerpc/mma-builtin-1.c
> @@ -258,8 +258,8 @@ foo13b (__vector_quad *dst, __vector_quad *src, vec_t *vec)
>    dst[13] = acc;
>  }
>  
> -/* { dg-final { scan-assembler-times {\mlxv\M} 40 } } */
> -/* { dg-final { scan-assembler-times {\mlxvp\M} 12 } } */
> +/* { dg-final { scan-assembler-times {\mlxv\M} 0 } } */
> +/* { dg-final { scan-assembler-times {\mlxvp\M} 32 } } */
>  /* { dg-final { scan-assembler-times {\mstxvp\M} 40 } } */
>  /* { dg-final { scan-assembler-times {\mxxmfacc\M} 20 } } */
>  /* { dg-final { scan-assembler-times {\mxxmtacc\M} 6 } } */
> -- 
> 2.43.5
>
diff mbox series

Patch

diff --git a/gcc/config.gcc b/gcc/config.gcc
index 08291f4b6e0..c043abdb871 100644
--- a/gcc/config.gcc
+++ b/gcc/config.gcc
@@ -530,6 +530,7 @@  powerpc*-*-*)
 	extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o"
 	extra_objs="${extra_objs} rs6000-call.o rs6000-pcrel-opt.o"
 	extra_objs="${extra_objs} rs6000-builtins.o rs6000-builtin.o"
+	extra_objs="${extra_objs} rs6000-mem-fusion.o"
 	extra_headers="ppc-asm.h altivec.h htmintrin.h htmxlintrin.h"
 	extra_headers="${extra_headers} bmi2intrin.h bmiintrin.h"
 	extra_headers="${extra_headers} xmmintrin.h mm_malloc.h emmintrin.h"
@@ -566,6 +567,7 @@  rs6000*-*-*)
 	extra_options="${extra_options} g.opt fused-madd.opt rs6000/rs6000-tables.opt"
 	extra_objs="rs6000-string.o rs6000-p8swap.o rs6000-logue.o"
 	extra_objs="${extra_objs} rs6000-call.o rs6000-pcrel-opt.o"
+	extra_objs="${extra_objs} rs6000-mem-fusion.o"
 	target_gtfiles="$target_gtfiles \$(srcdir)/config/rs6000/rs6000-logue.cc \$(srcdir)/config/rs6000/rs6000-call.cc"
 	target_gtfiles="$target_gtfiles \$(srcdir)/config/rs6000/rs6000-pcrel-opt.cc"
 	;;
diff --git a/gcc/config/rs6000/mma.md b/gcc/config/rs6000/mma.md
index 04e2d0066df..88413926a02 100644
--- a/gcc/config/rs6000/mma.md
+++ b/gcc/config/rs6000/mma.md
@@ -294,7 +294,31 @@ 
 
 (define_insn_and_split "*movoo"
   [(set (match_operand:OO 0 "nonimmediate_operand" "=wa,ZwO,wa")
-	(match_operand:OO 1 "input_operand" "ZwO,wa,wa"))]
+        (match_operand:OO 1 "input_operand" "ZwO,wa,wa"))]
+  "TARGET_MMA
+   && (gpc_reg_operand (operands[0], OOmode)
+       || gpc_reg_operand (operands[1], OOmode))"
+;;    ""
+  "@
+   #
+   #
+   #"
+  "&& reload_completed"
+  [(const_int 0)]
+{
+  rs6000_split_multireg_move (operands[0], operands[1]);
+  DONE;
+}
+  [(set_attr "type" "vecload,vecstore,veclogical")
+   (set_attr "length" "*,*,8")])
+;;   (set_attr "max_prefixed_insns" "2,2,*")])
+
+
+(define_insn_and_split "*movoo1"
+  [(set (match_operand:OO 0 "nonimmediate_operand" "=wa,ZwO,wa")
+        (unspec [
+          (match_operand:OO 1 "input_operand" "ZwO,wa,wa")
+        ] UNSPEC_LXVP))]
   "TARGET_MMA
    && (gpc_reg_operand (operands[0], OOmode)
        || gpc_reg_operand (operands[1], OOmode))"
diff --git a/gcc/config/rs6000/rs6000-mem-fusion.cc b/gcc/config/rs6000/rs6000-mem-fusion.cc
new file mode 100644
index 00000000000..8c27a8f32e1
--- /dev/null
+++ b/gcc/config/rs6000/rs6000-mem-fusion.cc
@@ -0,0 +1,695 @@ 
+/* Subroutines used to perform adjacent load/store into
+   paired memory accesses for TARGET_POWER10 and TARGET_VSX.
+
+   Copyright (C) 2024 Free Software Foundation, Inc.
+
+   This file is part of GCC.
+
+   GCC is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published
+   by the Free Software Foundation; either version 3, or (at your
+   option) any later version.
+
+   GCC is distributed in the hope that it will be useful, but WITHOUT
+   ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
+   or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public
+   License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with GCC; see the file COPYING3.  If not see
+   <http://www.gnu.org/licenses/>.  */
+
+#define INCLUDE_ALGORITHM
+#define INCLUDE_FUNCTIONAL
+#define INCLUDE_LIST
+#define INCLUDE_TYPE_TRAITS
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "backend.h"
+#include "rtl.h"
+#include "df.h"
+#include "rtl-iter.h"
+#include "rtl-ssa.h"
+#include "rtl-ssa/internals.h"
+#include "rtl-ssa/internals.inl"
+#include "cfgcleanup.h"
+#include "tree-pass.h"
+#include "ordered-hash-map.h"
+#include "pair-fusion.h"
+
+using namespace rtl_ssa;
+
+struct rs6000_pair_fusion : public pair_fusion
+{
+  bool fpsimd_op_p (rtx , machine_mode , bool)  override final
+  {
+    return false;
+  }
+
+  bool pair_mem_insn_p (rtx_insn *, bool &) override final
+  {
+    return false;
+  }
+
+  void change_existing_multword_mode (rtx_insn *insn) override final;
+
+  bool pair_mem_ok_with_policy (rtx, bool) override final
+  {
+    return true;
+  }
+
+  bool pair_operand_mode_ok_p (machine_mode mode) override final;
+
+  rtx gen_pair (rtx *pats, rtx, bool load_p) override final;
+
+  bool pair_reg_operand_ok_p (bool, rtx, machine_mode) override final
+  {
+    return true;
+  }
+
+  int pair_mem_alias_check_limit () override final
+  {
+    return 0;
+  }
+
+  bool should_handle_writeback (enum writeback_type) override final
+  {
+    return false;
+  }
+
+  bool track_loads_p () override final
+  {
+    return true;
+  }
+
+  bool track_stores_p () override final
+  {
+    return true;
+  }
+
+  bool pair_mem_in_range_p (HOST_WIDE_INT) override final
+  {
+    return true;
+  }
+
+  rtx gen_promote_writeback_pair (rtx, rtx, rtx *, bool) override final
+  {
+    return NULL_RTX;
+  }
+
+  rtx destructure_pair (rtx_def **, rtx, bool) override final
+  {
+    return NULL_RTX;
+  }
+
+  bool fuseable_store_p (insn_info *i1, insn_info *i2) override final;
+
+  bool fuseable_load_p (insn_info *i1) override final;
+
+  void set_multiword_subreg (insn_info *i1, insn_info *i2,
+			     bool load_p, rtx *pats) override final;
+
+  void set_uses (insn_info *i1, insn_info *i2,
+		 insn_change **pair_change,
+		 obstack_watermark *attempt) override final;
+
+  void  set_defs (insn_info *i1, insn_info *i2,
+		  insn_change **pair_change,
+		  obstack_watermark *attempt) override final;
+
+
+  void modify_new_rtx_insn (insn_info *first, obstack_watermark *attempt,
+			    insn_change **pair_change,
+			    auto_vec <insn_change *> &changes) override final;
+};
+
+bool
+rs6000_pair_fusion::pair_operand_mode_ok_p (machine_mode mode)
+{
+  return (ALTIVEC_OR_VSX_VECTOR_MODE (mode));
+}
+
+void
+rs6000_pair_fusion::change_existing_multword_mode (rtx_insn *insn)
+{
+  rtx set = single_set (insn);
+  rtx src = SET_SRC (set);
+  rtx dest = SET_DEST (set);
+  rtx copy = NULL_RTX;
+
+  if ((MEM_P (src) && GET_MODE (src) == OOmode)
+       || (MEM_P (dest) && GET_MODE (dest) == OOmode))
+    {
+      rtx unspec  = gen_rtx_UNSPEC (GET_MODE (dest),
+				    gen_rtvec (1, src),
+				    UNSPEC_LXVP);
+      copy =  gen_rtx_SET (dest, unspec);
+      rtx_insn *new_insn = emit_insn_after (copy, insn);
+      set_block_for_insn (new_insn, BLOCK_FOR_INSN (insn));
+      df_insn_rescan (new_insn);
+      df_insn_delete (insn);
+      remove_insn (insn);
+      insn->set_deleted ();
+    }
+}
+
+void
+rs6000_pair_fusion::modify_new_rtx_insn (insn_info *first,
+					 obstack_watermark *attempt,
+					 insn_change **pair_change,
+					 auto_vec<insn_change *> &changes)
+{
+  def_array new_defs;
+  vec_rtx_properties properties;
+  properties.add_insn (first->rtl (), true);
+  for (rtx_obj_reference ref : properties.refs ())
+    if (ref.is_write () && ref.is_reg ())
+      {
+	auto def = find_access (first->defs (), ref.regno);
+	if (!def)
+	  {
+	    auto *set = crtl->ssa->allocate<set_info> (first,
+						       full_register (ref.regno));
+	    new_defs = insert_access (*attempt, set, new_defs);
+	  }
+	else
+	  new_defs = insert_access (*attempt, def, new_defs);
+      }
+
+  (*pair_change)->new_defs
+    = merge_access_arrays (*attempt, (*pair_change)->new_defs, new_defs);
+  crtl->ssa->update_changes (first, attempt, pair_change, changes);
+
+}
+
+//Given insn I1, PAIR_CHANGE and obstack_watermark ATTEMPT sets new
+//uses in insn_change.
+static void
+set_uses_load (insn_info *i1,insn_change **pair_change,
+	       obstack_watermark *attempt)
+{
+  for (auto def : i1->defs ())
+    {
+      auto set = dyn_cast<set_info *> (def);
+      for (auto use : set->all_uses ())
+	{
+	  insn_change *change = *pair_change;
+	  change->new_uses = insert_access (*attempt, use, change->new_uses);
+	}
+    }
+}
+
+void rs6000_pair_fusion::set_uses (insn_info *i1,
+				   insn_info *i2,
+				   insn_change **pair_change,
+				   obstack_watermark *attempt)
+{
+  set_uses_load (i1, pair_change, attempt);
+  set_uses_load (i2, pair_change, attempt);
+}
+
+// df_insn_rescan dependent instruction where operands
+// are reversed given insn_info INFO.
+static void
+set_rescan_load (insn_info *i1)
+{
+  for (auto def : i1->defs ())
+    {
+      auto set = dyn_cast<set_info *> (def);
+      for (auto use : set->nondebug_insn_uses ())
+	{
+	  insn_info *info = use->insn ();
+	  if (use->is_artificial ())
+	    continue;
+	  rtx_insn *rtl_insn = info->rtl ();
+	  df_insn_rescan (rtl_insn);
+	}
+    }
+}
+
+//Given insn INSN, PAIR_CHANGE and obstack_watermark ATTEMPT sets new
+//defs in insn_change.
+static bool
+set_defs_store (insn_info *insn,
+		insn_change **pair_change,
+		obstack_watermark *attempt)
+{
+  for (auto use : insn->uses())
+    {
+      auto def = use->def ();
+
+      if (!def)
+	return false;
+
+      if (def->insn ()->is_artificial ())
+	{
+	  insn_change *change = *pair_change;
+	  change->new_defs = insert_access (*attempt, def, change->new_defs);
+	}
+    }
+  return true;
+}
+
+void rs6000_pair_fusion::set_defs(insn_info *i1,
+				  insn_info *i2,
+				  insn_change **pair_change,
+				  obstack_watermark *attempt)
+{
+  set_defs_store (i1, pair_change, attempt);
+  set_defs_store (i2, pair_change, attempt);
+}
+
+// df_insn_rescan the def instruction where operands are reversed given INSN.
+static bool
+set_rescan_store (insn_info *insn)
+{
+  for (auto use : insn->uses())
+    {
+      auto def = use->def ();
+
+      if (!def)
+	return false;
+
+      if (def->insn ()->is_artificial ())
+	return false;
+
+      if (def->insn () && def->insn ()->rtl ()
+	  && def->insn()->is_real ())
+	{
+	  rtx_insn *rtl_insn = def->insn ()->rtl ();
+	  rtx set = single_set (rtl_insn);
+
+	  if (set == NULL_RTX)
+	    return false;
+	  df_insn_rescan (rtl_insn);
+	}
+    }
+  return true;
+}
+
+// Check for feasibility of store to be fuseable or not. Return true if
+// feasible otherwise false.
+static bool
+feasible_store_p (insn_info *insn)
+{
+  for (auto use : insn->uses ())
+    {
+      auto def = use->def ();
+
+      if (def->insn ()->is_artificial ())
+	return false;
+
+      if (def->insn () && def->insn ()->rtl ()
+	  && def->insn()->is_real ())
+	{
+	  rtx_insn *rtl_insn = def->insn ()->rtl ();
+	  rtx set = single_set (rtl_insn);
+
+	  if (set == NULL_RTX)
+	    return false;
+
+	  // Return false if dependent def is load.
+	  // This is done as def instruction could be a fused load and
+	  // to avoid already existing subreg (reg:OO R) offset.
+	  if (rtl_insn && MEM_P (SET_SRC (set)))
+	    return false;
+
+	  // Return false if dependent def is store.
+	  if (rtl_insn && MEM_P (SET_DEST (set)))
+	    return false;
+	}
+    }
+  return true;
+}
+
+// Check if store can be fuseable or not.  Return true if fuseable otherwise
+// false.
+bool
+rs6000_pair_fusion::fuseable_store_p (insn_info *i1, insn_info *i2)
+{
+  rtx_insn *insn1 = i1->rtl ();
+  rtx_insn *insn2 = i2->rtl ();
+  rtx body = PATTERN (insn1);
+  rtx src_exp = SET_SRC (body);
+  rtx insn2_body = PATTERN (insn2);
+  rtx insn2_src_exp = SET_SRC (insn2_body);
+
+  if (!(REG_P (src_exp)
+      && crtl->ssa->single_dominating_def (REGNO (src_exp))))
+    return false;
+
+  if (!(REG_P (insn2_src_exp)
+      && crtl->ssa->single_dominating_def (REGNO (insn2_src_exp))))
+    return false;
+
+  // This is done as def instruction could be a fused load and
+  // to avoid  already existing subreg (reg:OO R) offset.
+  if (DF_REG_USE_COUNT (REGNO (src_exp)) > 1)
+    return false;
+
+  // Return false if src of insn1 and src of insn2 are same.
+  if (src_exp == insn2_src_exp)
+    return false;
+
+  if (!feasible_store_p (i1))
+    return false;;
+
+  if (!feasible_store_p (i2))
+    return false;
+
+  return true;
+}
+
+// Set subreg for def of store INSN given rtx SRC instruction.
+static void
+set_store_subreg (insn_info *i1, rtx src, int regoff)
+{
+  for (auto use: i1->uses ())
+    {
+      auto def = use->def ();
+      if (!def)
+	return;
+
+      insn_info *info = def->insn ();
+
+      if (info->is_artificial ())
+	return;
+
+      if (info && info->is_real ())
+	{
+	  rtx_insn *rtl_insn = info->rtl ();
+	  rtx set = single_set (rtl_insn);
+	  if (set == NULL_RTX)
+	    return;
+	  df_ref ref;
+	  FOR_EACH_INSN_DEF (ref, rtl_insn)
+	    {
+	      rtx src_exp = SET_SRC (PATTERN (i1->rtl ()));
+	      if (REG_P (src_exp) && DF_REF_REGNO (ref) == REGNO (src_exp))
+		{
+		  rtx *loc = DF_REF_LOC (ref);
+		  if (GET_CODE (*loc) == SUBREG)
+		    {
+		      src = simplify_gen_subreg (GET_MODE (*loc),
+						      SUBREG_REG (src),
+						      OOmode,
+						      regoff);
+		    }
+		  *loc =  copy_rtx (src);
+		}
+	    }
+	}
+    }
+}
+
+// Check whether load can be fusable or not.
+// Return true if fuseable otherwise false.
+bool
+rs6000_pair_fusion::fuseable_load_p (insn_info *i1)
+{
+  rtx_insn *insn = i1->rtl ();
+  rtx body = PATTERN (insn);
+  rtx dest_exp = SET_DEST (body);
+
+  if (!(REG_P (dest_exp)
+      && crtl->ssa->single_dominating_def (REGNO (dest_exp))))
+    return false;
+
+  return true;
+}
+
+// Propagate insn I1 with new rtx NEW_DEST_EXP.
+static void
+propagate_insn (insn_info *i1, rtx new_dest_exp)
+{
+  df_ref ref;
+  FOR_EACH_INSN_DEF (ref, i1->rtl())
+    {
+      rtx dest_exp = SET_DEST (PATTERN (i1->rtl ()));
+      if (REG_P (dest_exp)
+	  && DF_REF_REGNO (ref) == REGNO (dest_exp))
+	{
+	  rtx *loc = DF_REF_LOC (ref);
+	  *loc = new_dest_exp;
+	}
+     }
+}
+
+static bool
+set_load_subreg (auto_vec <insn_change *> &changes,
+		 insn_info *i1, rtx src, obstack_watermark *attempt)
+{
+  rtx set = single_set (i1->rtl());
+  rtx old_dest = SET_DEST (set);
+
+  for (auto def : i1->defs ())
+    {
+      auto set1 = dyn_cast<set_info *> (def);
+      for (auto use : set1->nondebug_insn_uses ())
+	{
+	  if (use->is_artificial ())
+	    continue;
+
+	  insn_info *info = use->insn ();
+	  rtx_insn *rtl_insn = info->rtl ();
+	  insn_propagation prop (rtl_insn, old_dest, src);
+	  if (!prop.apply_to_pattern (&PATTERN (rtl_insn)))
+	    gcc_assert (0);
+
+	  vec_rtx_properties properties;
+	  insn_change *use_change;
+	  auto new_insn = crtl->ssa->create_insn (*attempt, INSN,
+						  PATTERN (rtl_insn));
+	  use_change = crtl->ssa->change_alloc<insn_change> (*attempt,
+							     new_insn);
+	  use_array new_uses_temp;
+	  properties.add_insn (rtl_insn ,  true);
+
+	  for (rtx_obj_reference ref : properties.refs ())
+	    if (ref.is_reg ())
+	      {
+		auto def = find_access (info->defs (), ref.regno);
+		if (ref.is_write () || use_change->new_defs.size () == 0)
+		  {
+		    if (!def)
+		      {
+			auto resource = full_register (ref.regno);
+			auto *set
+			  = crtl->ssa->allocate<set_info> (info,
+							   resource);
+			use_change->new_defs
+			  = insert_access (*attempt, set,
+					   use_change->new_defs);
+		      }
+		   else
+		     use_change->new_defs
+		       = insert_access (*attempt, def,
+					use_change->new_defs);
+
+		  }
+		auto_vec<access_info *, 1> new_uses;
+		auto *set
+		  = crtl->ssa->allocate<set_info> (new_insn,
+						   full_register (ref.regno));
+		if (set)
+		  {
+		    auto new_use = crtl->ssa->create_use (*attempt, new_insn,
+							  set);
+		    if (!find_access (new_uses_temp, ref.regno))
+		      {
+			if (new_uses_temp.size () == 0)
+			  {
+			    new_uses.quick_push (new_use);
+			    new_uses_temp = use_array (access_array (new_uses));
+			    new_uses_temp = merge_access_arrays (*attempt,
+								 new_uses_temp,
+								 new_uses_temp);
+			  }
+			else
+			  new_uses_temp = insert_access (*attempt, new_use,
+							 new_uses_temp);
+		      }
+		  }
+	      }
+	   use_change->new_uses = new_uses_temp;
+	   use_change->move_range = new_insn;
+	   changes.safe_push (use_change);
+	}
+    }
+  return true;
+}
+
+// Set subreg for OO mode store pair to generate registers in pairs
+// given insn_info I1 and I2.
+static void
+set_multiword_subreg_store (insn_info *i1, insn_info *i2)
+{
+  rtx_insn *insn1 = i1->rtl ();
+  rtx_insn *insn2 = i2->rtl ();
+  rtx body = PATTERN (insn1);
+  rtx src_exp = SET_SRC (body);
+  rtx insn2_body = PATTERN (insn2);
+  rtx insn2_dest_exp = SET_DEST (insn2_body);
+  machine_mode mode = GET_MODE (src_exp);
+  int regoff;
+  rtx src;
+  rtx addr = XEXP (insn2_dest_exp, 0);
+
+  PUT_MODE_RAW (src_exp, OOmode);
+  if (GET_CODE (addr) == PLUS
+      && XEXP (addr, 1) && CONST_INT_P (XEXP (addr, 1)))
+    regoff = 16;
+  else
+    regoff = 0;
+
+  src = simplify_gen_subreg (mode,
+			     src_exp, GET_MODE (src_exp),
+			     regoff);
+
+  set_store_subreg (i1, src, regoff);
+
+  int regoff1 = 0;
+  rtx src1;
+
+  src1 = simplify_gen_subreg (mode,
+			      src_exp, GET_MODE (src_exp),
+			      regoff1);
+
+  set_store_subreg (i2, src1, regoff1);
+  set_rescan_store (i1);
+  set_rescan_store (i2);
+  df_insn_rescan (insn1);
+}
+
+// Set subreg for OO mode pair load for existing subreg rtx to generate
+// registers in pairs given insn_info I2 and I2.
+static rtx
+set_multiword_subreg_load (insn_info *i1, insn_info *i2)
+{
+  rtx_insn *insn1 = i1->rtl ();
+  rtx body = PATTERN (insn1);
+  rtx dest_exp = SET_DEST (body);
+  machine_mode mode = GET_MODE (dest_exp);
+  int regoff1;
+  regoff1 = 16;
+  auto attempt = crtl->ssa->new_change_attempt ();
+  rtx new_dest_exp = gen_reg_rtx (OOmode);
+  auto_vec <insn_change *> use_changes;
+  rtx src = simplify_gen_subreg (mode,
+				 new_dest_exp,
+				 OOmode,
+				 regoff1);
+
+  set_load_subreg (use_changes, i1, src, &attempt);
+  propagate_insn (i1, new_dest_exp);
+  confirm_change_group ();
+  crtl->ssa->change_insns (use_changes);
+  use_changes.release ();
+
+  int regoff = 0;
+  rtx sset = single_set (i2->rtl ());
+  rtx insn2_dest_exp = SET_DEST (sset);
+  machine_mode insn2_mode = GET_MODE (insn2_dest_exp);
+
+  src = simplify_gen_subreg (insn2_mode,
+			     new_dest_exp,
+			     OOmode,
+			     regoff);
+
+  set_load_subreg (use_changes, i2, src, &attempt);
+  propagate_insn (i2, new_dest_exp);
+  confirm_change_group ();
+  crtl->ssa->change_insns (use_changes);
+  set_rescan_load (i1);
+  set_rescan_load (i2);
+  df_insn_rescan (i1->rtl ());
+  return new_dest_exp;
+}
+
+// Set subreg for OO mode pair to generate sequential registers given
+// insn_info pairs I1, I2 and LOAD_P is true iff load insn and false
+// if store insn.
+void
+rs6000_pair_fusion::set_multiword_subreg (insn_info *i1, insn_info *i2,
+					  bool load_p, rtx *pats)
+{
+  if (load_p)
+    {
+      rtx new_dest_exp =  set_multiword_subreg_load (i1, i2);
+      if (new_dest_exp)
+	{
+	  rtx src = SET_SRC (pats[0]);
+	  rtx set = gen_rtx_SET (new_dest_exp, src);
+	  pats[0] = set;
+	}
+    }
+  else
+    set_multiword_subreg_store (i1, i2);
+}
+
+rtx
+rs6000_pair_fusion::gen_pair (rtx *pats, rtx,  bool load_p)
+{
+  rtx i1 = pats[0];
+  rtx src_exp = SET_SRC (i1);
+  rtx dest_exp = SET_DEST (i1);
+  PUT_MODE_RAW (src_exp, OOmode);
+  PUT_MODE_RAW (dest_exp, OOmode);
+  rtx unspec  = gen_rtx_UNSPEC (GET_MODE (dest_exp),
+				gen_rtvec (1, src_exp),
+				UNSPEC_LXVP);
+  rtx set =  gen_rtx_SET (dest_exp, unspec);
+  if (dump_file)
+    {
+      if (load_p)
+	fprintf (dump_file, "lxv with lxvp ");
+      else
+	fprintf (dump_file, "stxv with stxvp ");
+      print_rtl_single (dump_file, set);
+    }
+
+  return set;
+}
+
+const pass_data pass_data_mem_fusion =
+{
+  RTL_PASS, /* type */
+  "mem_fusion", /* name */
+  OPTGROUP_NONE, /* optinfo_flags */
+  TV_NONE, /* tv_id */
+  0, /* properties_required */
+  0, /* properties_provided */
+  0, /* properties_destroyed */
+  0, /* todo_flags_start */
+  TODO_df_finish, /* todo_flags_finish */
+};
+
+class pass_mem_fusion : public rtl_opt_pass
+{
+public:
+  pass_mem_fusion (gcc::context *ctxt)
+    : rtl_opt_pass (pass_data_mem_fusion, ctxt)
+  {}
+
+  opt_pass *clone () override { return new pass_mem_fusion (m_ctxt);}
+
+  /* opt_pass methods: */
+  bool gate (function *)
+    {
+      return (optimize > 0 && TARGET_VSX && TARGET_POWER10);
+    }
+
+  unsigned int execute (function *) final override
+    {
+      rs6000_pair_fusion pass;
+      pass.run ();
+      return 0;
+    }
+}; // class pass_mem_fusion
+
+rtl_opt_pass *
+make_pass_mem_fusion (gcc::context *ctxt)
+{
+  return new pass_mem_fusion (ctxt);
+}
diff --git a/gcc/config/rs6000/rs6000-passes.def b/gcc/config/rs6000/rs6000-passes.def
index 46a0d0b8c56..0b48f57014d 100644
--- a/gcc/config/rs6000/rs6000-passes.def
+++ b/gcc/config/rs6000/rs6000-passes.def
@@ -28,7 +28,9 @@  along with GCC; see the file COPYING3.  If not see
      The power8 does not have instructions that automaticaly do the byte swaps
      for loads and stores.  */
   INSERT_PASS_BEFORE (pass_cse, 1, pass_analyze_swaps);
-
+  /* Pass to replace adjacent memory addresses lxv/stxv instruction with
+     lxvp/stxvp instruction.  */
+  INSERT_PASS_BEFORE (pass_early_remat, 1, pass_mem_fusion);
   /* Pass to do the PCREL_OPT optimization that combines the load of an
      external symbol's address along with a single load or store using that
      address as a base register.  */
diff --git a/gcc/config/rs6000/rs6000-protos.h b/gcc/config/rs6000/rs6000-protos.h
index b40557a8557..5295ff66950 100644
--- a/gcc/config/rs6000/rs6000-protos.h
+++ b/gcc/config/rs6000/rs6000-protos.h
@@ -342,6 +342,7 @@  namespace gcc { class context; }
 class rtl_opt_pass;
 
 extern rtl_opt_pass *make_pass_analyze_swaps (gcc::context *);
+extern rtl_opt_pass *make_pass_mem_fusion (gcc::context *);
 extern rtl_opt_pass *make_pass_pcrel_opt (gcc::context *);
 extern bool rs6000_sum_of_two_registers_p (const_rtx expr);
 extern bool rs6000_quadword_masked_address_p (const_rtx exp);
diff --git a/gcc/config/rs6000/rs6000.cc b/gcc/config/rs6000/rs6000.cc
index 08579bc83e6..d00c687cbc9 100644
--- a/gcc/config/rs6000/rs6000.cc
+++ b/gcc/config/rs6000/rs6000.cc
@@ -20,7 +20,6 @@ 
    <http://www.gnu.org/licenses/>.  */
 
 #define IN_TARGET_CODE 1
-
 #include "config.h"
 #define INCLUDE_MEMORY
 #include "system.h"
@@ -27469,7 +27468,8 @@  rs6000_split_multireg_move (rtx dst, rtx src)
     reg_mode = word_mode;
   reg_mode_size = GET_MODE_SIZE (reg_mode);
 
-  gcc_assert (reg_mode_size * nregs == GET_MODE_SIZE (mode));
+  gcc_assert (mode == OOmode
+	      || reg_mode_size * nregs == GET_MODE_SIZE (mode));
 
   /* TDmode residing in FP registers is special, since the ISA requires that
      the lower-numbered word of a register pair is always the most significant
@@ -27516,6 +27516,11 @@  rs6000_split_multireg_move (rtx dst, rtx src)
       int reg_mode_nregs = hard_regno_nregs (reg, reg_mode);
       if (MEM_P (dst))
 	{
+	  rtx addr = XEXP (dst, 0);
+	  rtx opnd1 = NULL_RTX;
+	  if (addr && GET_CODE (addr) == PLUS)
+	    opnd1 = XEXP (addr,1);
+
 	  unsigned offset = 0;
 	  unsigned size = GET_MODE_SIZE (reg_mode);
 
@@ -27529,7 +27534,13 @@  rs6000_split_multireg_move (rtx dst, rtx src)
 	    {
 	      unsigned subreg
 		= WORDS_BIG_ENDIAN ? i : (nregs - reg_mode_nregs - i);
-	      rtx dst2 = adjust_address (dst, reg_mode, offset);
+	      rtx dst2 = dst;
+
+	      if ((GET_CODE (addr) != PLUS
+		  || (opnd1 && CONST_INT_P(opnd1))))
+		dst2 = adjust_address (dst, reg_mode, offset);
+	      else
+		PUT_MODE_RAW (dst, reg_mode);
 	      rtx src2 = gen_rtx_REG (reg_mode, reg + subreg);
 	      offset += size;
 	      emit_insn (gen_rtx_SET (dst2, src2));
@@ -27540,15 +27551,25 @@  rs6000_split_multireg_move (rtx dst, rtx src)
 
       if (MEM_P (src))
 	{
+	  rtx  addr = XEXP (src, 0);
+	  rtx opnd1 = NULL_RTX;
+	  if (addr && GET_CODE (addr) == PLUS)
+	    opnd1 = XEXP (addr,1);
+
 	  unsigned offset = 0;
 	  unsigned size = GET_MODE_SIZE (reg_mode);
 
-	  for (int i = 0; i < nregs; i += reg_mode_nregs)
+	  for (int i = nregs-1; i >= 0; i -= reg_mode_nregs)
 	    {
 	      unsigned subreg
 		= WORDS_BIG_ENDIAN ? i : (nregs - reg_mode_nregs - i);
 	      rtx dst2 = gen_rtx_REG (reg_mode, reg + subreg);
-	      rtx src2 = adjust_address (src, reg_mode, offset);
+	      rtx src2 = src;
+
+	      if ((GET_CODE (addr) != PLUS || (opnd1 && CONST_INT_P (opnd1))))
+		src2 = adjust_address (src, reg_mode, offset);
+	      else
+		PUT_MODE_RAW (src2, reg_mode);
 	      offset += size;
 	      emit_insn (gen_rtx_SET (dst2, src2));
 	    }
@@ -27556,7 +27577,8 @@  rs6000_split_multireg_move (rtx dst, rtx src)
 	  /* If we are writing an accumulator register, we have to
 	     prime it after we've written it.  */
 	  if (TARGET_MMA
-	      && GET_MODE (dst) == XOmode && FP_REGNO_P (REGNO (dst)))
+	      && REG_P (dst) && GET_MODE (dst) == XOmode
+	      && FP_REGNO_P (REGNO (dst)))
 	    emit_insn (gen_mma_xxmtacc (dst, dst));
 
 	  return;
@@ -27649,9 +27671,12 @@  rs6000_split_multireg_move (rtx dst, rtx src)
 	{
 	  for (i = nregs - 1; i >= 0; i--)
 	    {
-	      rtx dst_i = gen_rtx_REG (reg_mode, REGNO (dst) + i);
-	      rtx src_i = gen_rtx_REG (reg_mode, REGNO (src) + i);
-	      emit_insn (gen_rtx_SET (dst_i, src_i));
+	      if (REG_P (dst) && REG_P (src))
+		{
+		  rtx dst_i = gen_rtx_REG (reg_mode, REGNO (dst) + i);
+		  rtx src_i = gen_rtx_REG (reg_mode, REGNO (src) + i);
+		  emit_insn (gen_rtx_SET (dst_i, src_i));
+		}
 	    }
 	}
       else
@@ -27666,7 +27691,8 @@  rs6000_split_multireg_move (rtx dst, rtx src)
       /* If we are writing an accumulator register, we have to
 	 prime it after we've written it.  */
       if (TARGET_MMA
-	  && GET_MODE (dst) == XOmode && FP_REGNO_P (REGNO (dst)))
+	  && REG_P (dst) && GET_MODE (dst) == XOmode
+	  && FP_REGNO_P (REGNO (dst)))
 	emit_insn (gen_mma_xxmtacc (dst, dst));
     }
   else
@@ -27723,7 +27749,7 @@  rs6000_split_multireg_move (rtx dst, rtx src)
 
 	  /* If the base register we are using to address memory is
 	     also a destination reg, then change that register last.  */
-	  if (REG_P (breg)
+	  if (REG_P (dst) && REG_P (breg)
 	      && REGNO (breg) >= REGNO (dst)
 	      && REGNO (breg) < REGNO (dst) + nregs)
 	    j = REGNO (breg) - REGNO (dst);
@@ -27821,9 +27847,12 @@  rs6000_split_multireg_move (rtx dst, rtx src)
 	  /* XO/OO are opaque so cannot use subregs. */
 	  if (mode == OOmode || mode == XOmode )
 	    {
-	      rtx dst_i = gen_rtx_REG (reg_mode, REGNO (dst) + j);
-	      rtx src_i = gen_rtx_REG (reg_mode, REGNO (src) + j);
-	      emit_insn (gen_rtx_SET (dst_i, src_i));
+	      if (REG_P (dst) &&  REG_P (src))
+		{
+		  rtx dst_i = gen_rtx_REG (reg_mode, REGNO (dst) + j);
+		  rtx src_i = gen_rtx_REG (reg_mode, REGNO (src) + j);
+		  emit_insn (gen_rtx_SET (dst_i, src_i));
+		}
 	    }
 	  else
 	    emit_insn (gen_rtx_SET (simplify_gen_subreg (reg_mode, dst, mode,
@@ -27841,7 +27870,9 @@  rs6000_split_multireg_move (rtx dst, rtx src)
       if (restore_basereg != NULL_RTX)
 	emit_insn (restore_basereg);
     }
+  return;
 }
+
 
 /* Return true if the peephole2 can combine a load involving a combination of
    an addis instruction and a load with an offset that can be fused together on
diff --git a/gcc/config/rs6000/rs6000.md b/gcc/config/rs6000/rs6000.md
index 8eda2f7bb0d..b9c38088510 100644
--- a/gcc/config/rs6000/rs6000.md
+++ b/gcc/config/rs6000/rs6000.md
@@ -173,6 +173,7 @@ 
    UNSPEC_XXSPLTIW_CONST
    UNSPEC_FMAX
    UNSPEC_FMIN
+   UNSPEC_LXVP
   ])
 
 ;;
diff --git a/gcc/config/rs6000/t-rs6000 b/gcc/config/rs6000/t-rs6000
index 155788de40a..a052163eec4 100644
--- a/gcc/config/rs6000/t-rs6000
+++ b/gcc/config/rs6000/t-rs6000
@@ -34,6 +34,11 @@  rs6000-p8swap.o: $(srcdir)/config/rs6000/rs6000-p8swap.cc
 	$(COMPILE) $<
 	$(POSTCOMPILE)
 
+rs6000-mem-fusion.o: $(srcdir)/config/rs6000/rs6000-mem-fusion.cc
+	$(COMPILE) $<
+	$(POSTCOMPILE)
+
+
 rs6000-d.o: $(srcdir)/config/rs6000/rs6000-d.cc
 	$(COMPILE) $<
 	$(POSTCOMPILE)
diff --git a/gcc/pair-fusion.cc b/gcc/pair-fusion.cc
index cb0374f426b..d0665b8730a 100644
--- a/gcc/pair-fusion.cc
+++ b/gcc/pair-fusion.cc
@@ -21,7 +21,6 @@ 
 #define INCLUDE_FUNCTIONAL
 #define INCLUDE_LIST
 #define INCLUDE_TYPE_TRAITS
-#define INCLUDE_ARRAY
 #include "config.h"
 #include "system.h"
 #include "coretypes.h"
@@ -36,6 +35,8 @@ 
 #include "tree-dfa.h"
 #include "fold-const.h"
 #include "tree-hash-traits.h"
+#include "rtl-ssa/internals.h"
+#include "rtl-ssa/internals.inl"
 #include "print-tree.h"
 #include "pair-fusion.h"
 
@@ -313,9 +314,9 @@  static int
 encode_lfs (lfs_fields fields)
 {
   int size_log2 = exact_log2 (fields.size);
-  gcc_checking_assert (size_log2 >= 2 && size_log2 <= 4);
-  return ((int)fields.load_p << 3)
-    | ((int)fields.fpsimd_p << 2)
+  gcc_checking_assert (size_log2 >= 2 && size_log2 <= 9);
+  return ((int)fields.load_p << 4)
+    | ((int)fields.fpsimd_p << 3)
     | (size_log2 - 2);
 }
 
@@ -323,8 +324,8 @@  encode_lfs (lfs_fields fields)
 static lfs_fields
 decode_lfs (int lfs)
 {
-  bool load_p = (lfs & (1 << 3));
-  bool fpsimd_p = (lfs & (1 << 2));
+  bool load_p = (lfs & (1 << 4));
+  bool fpsimd_p = (lfs & (1 << 3));
   unsigned size = 1U << ((lfs & 3) + 2);
   return { load_p, fpsimd_p, size };
 }
@@ -426,6 +427,9 @@  pair_fusion_bb_info::track_access (insn_info *insn, bool load_p, rtx mem)
   if (MEM_VOLATILE_P (mem))
     return;
 
+  if (load_p && !m_pass->fuseable_load_p (insn))
+    return;
+
   // Ignore writeback accesses if the hook says to do so.
   if (!m_pass->should_handle_writeback (writeback_type::EXISTING)
       && GET_RTX_CLASS (GET_CODE (XEXP (mem, 0))) == RTX_AUTOINC)
@@ -1815,7 +1819,7 @@  pair_fusion_bb_info::fuse_pair (bool load_p,
     }
 
   rtx reg_notes = combine_reg_notes (first, second, load_p);
-
+  m_pass->set_multiword_subreg (i1, i2, load_p, pats);
   rtx pair_pat = m_pass->gen_pair (pats, writeback_effect, load_p);
   insn_change *pair_change = nullptr;
   auto set_pair_pat = [pair_pat,reg_notes](insn_change *change) {
@@ -1829,18 +1833,19 @@  pair_fusion_bb_info::fuse_pair (bool load_p,
       changes.safe_push (make_delete (first));
       pair_change = make_change (second);
       changes.safe_push (pair_change);
-
-      pair_change->move_range = move_range;
       pair_change->new_defs = merge_access_arrays (attempt,
 						   input_defs[0],
 						   input_defs[1]);
-      gcc_assert (pair_change->new_defs.is_valid ());
-
       pair_change->new_uses
 	= merge_access_arrays (attempt,
 			       drop_memory_access (input_uses[0]),
 			       drop_memory_access (input_uses[1]));
+      pair_change->move_range = move_range;
+      m_pass->modify_new_rtx_insn (first, &attempt, &pair_change, changes);
+      gcc_assert (pair_change->new_defs.is_valid ());
+
       gcc_assert (pair_change->new_uses.is_valid ());
+
       set_pair_pat (pair_change);
     }
   else
@@ -1873,6 +1878,7 @@  pair_fusion_bb_info::fuse_pair (bool load_p,
 						store_def,
 						change->new_defs);
 	      gcc_assert (change->new_defs.is_valid ());
+	      m_pass->set_defs (i1, i2, &change, &attempt);
 	      change->move_range = move_range;
 	      pair_change = change;
 	      break;
@@ -1909,6 +1915,7 @@  pair_fusion_bb_info::fuse_pair (bool load_p,
 	      change->new_defs = insert_access (attempt, new_set,
 						change->new_defs);
 	      gcc_assert (change->new_defs.is_valid ());
+	      m_pass->set_defs (i1, i2, &change, &attempt);
 	      pair_change = change;
 	      break;
 	    }
@@ -1920,6 +1927,7 @@  pair_fusion_bb_info::fuse_pair (bool load_p,
 			 "  stp: changing i%d to use mem from new stp "
 			 "(after i%d)\n",
 			 action.insn->uid (), pair_dst->uid ());
+	  
 	      change->new_uses = drop_memory_access (change->new_uses);
 	      gcc_assert (new_set);
 	      auto new_use = crtl->ssa->create_use (attempt, action.insn,
@@ -1959,7 +1967,6 @@  pair_fusion_bb_info::fuse_pair (bool load_p,
 	    }
 	}
     }
-
   auto ignore = ignore_changing_insns (changes);
   for (unsigned i = 0; i < changes.length (); i++)
     gcc_assert (rtl_ssa::restrict_movement (*changes[i], ignore));
@@ -1982,7 +1989,6 @@  pair_fusion_bb_info::fuse_pair (bool load_p,
     }
 
   gcc_assert (crtl->ssa->verify_insn_changes (changes));
-
   // Fix up any debug uses that will be affected by the changes.
   if (MAY_HAVE_DEBUG_INSNS)
     fixup_debug_uses (attempt, insns, orig_rtl, pair_dst, trailing_add,
@@ -2406,6 +2412,15 @@  pair_fusion_bb_info::try_fuse_pair (bool load_p, unsigned access_size,
       reg_ops[i] = XEXP (pats[i], !load_p);
     }
 
+  if (!load_p && !m_pass->fuseable_store_p (i1, i2))
+    {
+      if (dump_file)
+	fprintf (dump_file,
+		 "punting on store-mem-pairs due to non fuseable cand (%d,%d)\n",
+		 insns[0]->uid (), insns[1]->uid ());
+      return false;
+    }
+
   if (load_p && reg_overlap_mentioned_p (reg_ops[0], reg_ops[1]))
     {
       if (dump_file)
@@ -2650,7 +2665,6 @@  pair_fusion_bb_info::try_fuse_pair (bool load_p, unsigned access_size,
 	    base = &base_cands[1];
 	}
     }
-
   // Otherwise, hazards[0] > hazards[1].
   // Pair can be formed anywhere in (hazards[1], hazards[0]).
   insn_range_info range (insns[0], insns[1]);
@@ -2698,7 +2712,6 @@  pair_fusion_bb_info::try_fuse_pair (bool load_p, unsigned access_size,
       fprintf (dump_file, "), move_range: (%d,%d)\n",
 	       range.first->uid (), range.last->uid ());
     }
-
   return fuse_pair (load_p, access_size, writeback,
 		    i1, i2, *base, range);
 }
@@ -2764,6 +2777,7 @@  pair_fusion_bb_info::merge_pairs (insn_list_t &left_list,
 			  bool load_p,
 			  unsigned access_size)
 {
+	//dump_file = stdout;
   if (dump_file)
     {
       fprintf (dump_file, "merge_pairs [L=%d], cand vecs ", load_p);
@@ -2817,10 +2831,12 @@  pair_fusion_bb_info::transform_for_base (int encoded_lfs,
 
   for (auto &access : group.list)
     {
+	    //fprintf (stdout, "MERGE \n");
       if (skip_next)
 	skip_next = false;
       else if (known_eq (access.offset, prev_access->offset + access_size))
 	{
+	//	fprintf (stdout, " Inside Merge \n");
 	  merge_pairs (prev_access->cand_insns,
 		       access.cand_insns,
 		       lfs.load_p,
@@ -2998,6 +3014,8 @@  void pair_fusion::process_block (bb_info *bb)
       if (GET_CODE (pat) != SET)
 	continue;
 
+      change_existing_multword_mode (rti);
+
       if (track_stores && MEM_P (XEXP (pat, 0)))
 	bb_state.track_access (insn, false, XEXP (pat, 0));
       else if (track_loads && MEM_P (XEXP (pat, 1)))
diff --git a/gcc/pair-fusion.h b/gcc/pair-fusion.h
index 45e4edceecb..b59c59f397a 100644
--- a/gcc/pair-fusion.h
+++ b/gcc/pair-fusion.h
@@ -26,8 +26,12 @@  namespace rtl_ssa {
   class insn_info;
   class insn_range_info;
   class bb_info;
+  class insn_change;
+  class insn_range_info;
 }
 
+class obstack_watermark;
+
 // Information about a potential base candidate, used in try_fuse_pair.
 // There may be zero, one, or two viable RTL bases for a given pair.
 struct base_cand
@@ -142,6 +146,19 @@  struct pair_fusion {
   // true iff INSN is a load pair.
   virtual bool pair_mem_insn_p (rtx_insn *insn, bool &load_p) = 0;
 
+  // Given INSN change multiword mode load and store to respective
+  // unspec instruction.
+  virtual void change_existing_multword_mode (rtx_insn *insn) = 0;
+
+  // Given INSN and watermark ATTEMPT and PAIR_CHANGE sets the
+  // new rtx with INSN.  Remove all uses of definition that are
+  // removed given CHANGES.
+  virtual void modify_new_rtx_insn (rtl_ssa::insn_info *first,
+				    obstack_watermark *attempt,
+				    rtl_ssa::insn_change **pair_change,
+				    auto_vec<rtl_ssa::insn_change *> &changes)
+				    = 0;
+
   // Return true if we should track loads.
   virtual bool track_loads_p ()
   {
@@ -171,6 +188,37 @@  struct pair_fusion {
   virtual rtx gen_promote_writeback_pair (rtx wb_effect, rtx mem,
 					  rtx regs[2], bool load_p) = 0;
 
+  // Given insn_info pair I1 and I2, sets subreg with multiword registers
+  // to assign register pairs by allocators.
+  // LOAD_P is true iff the pair is a load.
+  virtual void set_multiword_subreg (rtl_ssa::insn_info *i1,
+				     rtl_ssa::insn_info *i2,
+				     bool load_p, rtx *pats) = 0;
+
+  // Given insn_info pair I1 and I2, sets uses in PAIR_CHANGE.
+  virtual void set_uses (rtl_ssa::insn_info *i1,
+			 rtl_ssa::insn_info *i2,
+			 rtl_ssa::insn_change **pair_change,
+			 obstack_watermark *attempt) = 0;
+
+  // Given insn_info pair I1 and I2, sets defss in PAIR_CHANGE.
+  virtual void set_defs (rtl_ssa::insn_info *i1,
+			 rtl_ssa::insn_info *i2,
+			 rtl_ssa::insn_change **pair_change,
+			 obstack_watermark *attempt) = 0;
+
+
+  // Given insn_info pair I1 and I2, checks if pairs are feasible to perform
+  // store mem pairs.
+  // Return true if feasible to perform store mem pairs otherwise false.
+  virtual bool fuseable_store_p (rtl_ssa::insn_info *i1,
+				 rtl_ssa::insn_info *i2) = 0;
+
+  // Given insn_info pair I1 and I2, checks if pairs are feasible to perform
+  // load mem pairs.
+  // Return true if feasible to perform load mem pairs otherwise false.
+  virtual bool fuseable_load_p (rtl_ssa::insn_info *i1) = 0;
+
   void process_block (rtl_ssa::bb_info *bb);
   rtl_ssa::insn_info *find_trailing_add (rtl_ssa::insn_info *insns[2],
 					 const rtl_ssa::insn_range_info
diff --git a/gcc/rtl-ssa/access-utils.h b/gcc/rtl-ssa/access-utils.h
index 8805eec1d7f..0d0d299bb83 100644
--- a/gcc/rtl-ssa/access-utils.h
+++ b/gcc/rtl-ssa/access-utils.h
@@ -574,6 +574,8 @@  sort_accesses (access_info **begin, access_info **end)
 
   if (count == 2)
     {
+      if (begin[0]->regno () == begin[1]->regno ())
+	return;
       gcc_checking_assert (begin[0]->regno () != begin[1]->regno ());
       if (begin[0]->regno () > begin[1]->regno ())
 	std::swap (begin[0], begin[1]);
diff --git a/gcc/rtl-ssa/changes.cc b/gcc/rtl-ssa/changes.cc
index 0476296607b..9a84bf10129 100644
--- a/gcc/rtl-ssa/changes.cc
+++ b/gcc/rtl-ssa/changes.cc
@@ -228,10 +228,7 @@  rtl_ssa::changes_are_worthwhile (array_slice<insn_change *const> changes,
       for (const insn_change *change : changes)
 	if (!change->is_deletion ())
 	  {
-	    if (INSN_CODE (change->rtl ()) == NOOP_MOVE_INSN_CODE)
-	      fprintf (dump_file, " %c nop", sep);
-	    else
-	      fprintf (dump_file, " %c %d", sep, change->new_cost);
+	    fprintf (dump_file, " %c %d", sep, change->new_cost);
 	    sep = '+';
 	  }
       if (weighted_new_cost != 0)
@@ -320,13 +317,11 @@  choose_insn_placement (insn_change &change)
   insn_info *insn = change.insn ();
   insn_info *first = change.move_range.first;
   insn_info *last = change.move_range.last;
-
   // Quick(ish) exit if there is only one possible choice.
   if (first == last)
     return first;
   if (first == insn->prev_nondebug_insn () && last == insn)
     return insn;
-
   // For now just use the closest valid choice to the original instruction.
   // If the register usage has changed significantly, it might instead be
   // better to try to take register pressure into account.
@@ -457,9 +452,11 @@  function_info::finalize_new_accesses (insn_change &change, insn_info *pos,
 
   // Build up the new list of definitions.
   for (rtx_obj_reference ref : properties.refs ())
-    if (ref.is_write ())
+    if (ref.is_reg ())
       {
 	def_info *def = find_access (change.new_defs, ref.regno);
+	if (ref.is_read () && !def)
+	  continue;
 	gcc_assert (def);
 
 	if (def->m_is_temp && is_a<set_info *> (def) && def->last_def ())
@@ -497,11 +494,17 @@  function_info::finalize_new_accesses (insn_change &change, insn_info *pos,
 	  }
 	else if (!def->m_has_been_superceded)
 	  {
-	    // This is a second or subsequent definition.
-	    // See function_info::record_def for a discussion of when
-	    // this can happen.
-	    def->record_reference (ref, false);
-	    continue;
+	    if (def && def->insn () == insn)
+	      {
+		if (!ref.is_clobber ())
+		  {
+		    // This is a second or subsequent definition.
+		    // See function_info::record_def for a discussion of when
+		    // this can happen.
+		    def->record_reference (ref, false);
+		    continue;
+		  }
+	      }
 	  }
 	else
 	  {
@@ -700,7 +703,7 @@  function_info::apply_changes_to_insn (insn_change &change,
   unsigned int num_uses = change.new_uses.size ();
   if (num_defs + num_uses <= insn->num_defs () + insn->num_uses ())
     insn->copy_accesses (change.new_defs, change.new_uses);
-  else
+   else
     {
       access_array_builder builder (&m_obstack);
       builder.reserve (num_defs + num_uses);
@@ -751,11 +754,13 @@  function_info::change_insns (array_slice<insn_change *> changes)
 
 	  // Make sure that the placement of this instruction would still
 	  // leave room for previous instructions.
-	  change->move_range = move_later_than (change->move_range, min_insn);
+	  if (min_insn->rtl ())
+	    change->move_range = move_later_than (change->move_range, min_insn);
 	  if (!canonicalize_move_range (change->move_range, change->insn ()))
 	    // verify_insn_changes is supposed to make sure that this holds.
 	    gcc_unreachable ();
-	  min_insn = later_insn (min_insn, change->move_range.first);
+	   if (min_insn-> rtl ())
+	     min_insn = later_insn (min_insn, change->move_range.first);
 
 	  if (change->insn ()->m_is_temp)
 	    {
@@ -785,7 +790,6 @@  function_info::change_insns (array_slice<insn_change *> changes)
 	delete_insn (change);
       else
 	{
-	  // Make sure that this instruction comes before later ones.
 	  if (following_insn)
 	    {
 	      change.move_range = move_earlier_than (change.move_range,
@@ -799,15 +803,18 @@  function_info::change_insns (array_slice<insn_change *> changes)
 
 	  // Decide which instruction INSN should go after.
 	  insn_info *after = choose_insn_placement (change);
-
 	  // If INSN is moving, insert a placeholder insn_info at the
 	  // new location.  We can't move INSN itself yet because it
 	  // might still be referenced by earlier move ranges.
 	  insn_info *insn = change.insn ();
-	  if (after == insn || after == insn->prev_nondebug_insn ())
+	  unsigned int after_uid = INSN_UID (after->rtl ());
+	  unsigned int change_uid = INSN_UID (change.insn ()->rtl ());
+	  if (after_uid == change_uid || after == insn
+	      || after == insn->prev_nondebug_insn ())
 	    {
 	      update_insn_in_place (change);
-	      following_insn = insn;
+	      if (after_uid != change_uid)
+		following_insn = insn;
 	    }
 	  else
 	    {
@@ -818,7 +825,7 @@  function_info::change_insns (array_slice<insn_change *> changes)
 	}
       placeholders[i] = placeholder;
     }
-
+//#endif
   // We need to keep track of newly-added sets as these need adding to
   // the def chain later.
   hash_set<def_info *> new_sets;
@@ -1282,6 +1289,81 @@  function_info::create_insn (obstack_watermark &watermark,
   return insn;
 }
 
+void
+function_info::update_change (def_info *def, set_info *set)
+{
+  if (!set->has_any_uses ())
+    return;
+
+  auto *use = *set->all_uses ().begin ();
+  do
+    {
+      auto *next_use = use->next_use ();
+      if (use->is_in_debug_insn ())
+	substitute_debug_use (use);
+      else if (use->is_in_phi ())
+	{
+	  update_change (def, use->phi ());
+	}
+      else
+	{
+	  remove_use (use);
+	}
+      use = next_use;
+    }
+  while (use);
+}
+
+void function_info::substitute_debug_use (use_info *use)
+{
+  auto *use_insn = use->insn ();
+  rtx_insn *use_rtl = use_insn->rtl ();
+  auto use_change = insn_change (use_insn);
+  use_change.new_uses = {};
+  use_change.move_range = use_change.insn ();
+  INSN_VAR_LOCATION_LOC (use_rtl) = gen_rtx_UNKNOWN_VAR_LOC ();
+  insn_change *changes[] = { &use_change };
+  crtl->ssa->change_insns (changes);
+}
+
+void
+function_info::update_changes (insn_info *insn,
+				obstack_watermark *attempt,
+				insn_change **change,
+				auto_vec<insn_change *> &changes)
+{
+  for (insn_change *change : changes)
+  {
+    for (auto def : change->old_defs ())
+      {
+	auto set = dyn_cast<set_info *> (def);
+	update_change (def ,set);
+      }
+  }
+  auto &new_defs = (*change)->new_defs;
+  vec_rtx_properties properties;
+  properties.add_insn (insn->rtl (), true);
+  // Build up the new list of definitions.
+  for (rtx_obj_reference ref : properties.refs ())
+    if (ref.is_write ())
+      {
+	auto *set = crtl->ssa->allocate<set_info > (insn,
+					   full_register (ref.regno));
+	if (set)
+	  {
+	    auto def = find_access (new_defs, ref.regno);
+	    if (!def)
+	      {
+		insn->defs() = insert_access (*attempt, set, insn->defs());
+		new_defs = insert_access (*attempt, set,
+					  new_defs);
+		auto &m_temp_defs = get_m_temp_defs ();
+		m_temp_defs.safe_push (set);
+	      }
+	  }
+      }
+}
+
 // Print a description of CHANGE to PP.
 void
 rtl_ssa::pp_insn_change (pretty_printer *pp, const insn_change &change)
diff --git a/gcc/rtl-ssa/functions.h b/gcc/rtl-ssa/functions.h
index 567701c7995..053f7155743 100644
--- a/gcc/rtl-ssa/functions.h
+++ b/gcc/rtl-ssa/functions.h
@@ -84,6 +84,18 @@  public:
   insn_info *create_insn (obstack_watermark &watermark,
 			  rtx_code insn_code,
 			  rtx pat);
+  insn_info *create_insn_temp (obstack_watermark &watermark,
+			  rtx_code insn_code,
+			  rtx pat, insn_info *info);
+
+  void update_changes (insn_info *first,
+		       obstack_watermark *attempt,
+		       insn_change **pair_change,
+		       auto_vec<insn_change *> &changes);
+
+  void update_change (def_info *def, set_info *set);
+
+  void substitute_debug_use (use_info *use);
 
   // Return a list of all the instructions in the function, in reverse
   // postorder.  The list includes both real and artificial instructions.
@@ -222,6 +234,14 @@  public:
   template<typename T, typename... Ts>
   T *change_alloc (obstack_watermark &wm, Ts... args);
 
+  auto_vec<access_info *> &get_m_temp_defs () { return m_temp_defs; }
+ 
+  template<typename T, typename... Ts>
+  T *allocate (Ts... args);
+
+  template<typename T, typename... Ts>
+  T *allocate_temp (Ts... args);
+
 private:
   class bb_phi_info;
   class build_info;
@@ -230,13 +250,6 @@  private:
   // Return an RAII object that owns all objects allocated by
   // allocate_temp during its lifetime.
   obstack_watermark temp_watermark () { return &m_temp_obstack; }
-
-  template<typename T, typename... Ts>
-  T *allocate (Ts... args);
-
-  template<typename T, typename... Ts>
-  T *allocate_temp (Ts... args);
-
   access_array temp_access_array (access_array accesses);
 
   clobber_group *need_clobber_group (clobber_info *);
@@ -257,6 +270,7 @@  private:
   void append_clobber_to_group (clobber_info *, clobber_group *);
   void merge_clobber_groups (clobber_info *, clobber_info *,
 			     def_info *);
+
   std::array<clobber_group *, 2> split_clobber_group (clobber_group *,
 						      insn_info *);
 
diff --git a/gcc/testsuite/g++.target/powerpc/mem-fusion-1.C b/gcc/testsuite/g++.target/powerpc/mem-fusion-1.C
new file mode 100644
index 00000000000..d10ff0cdf36
--- /dev/null
+++ b/gcc/testsuite/g++.target/powerpc/mem-fusion-1.C
@@ -0,0 +1,22 @@ 
+/* { dg-do compile } */
+/* { dg-require-effective-target power10_ok } */
+/* { dg-options "-mdejagnu-cpu=power10 -O2" } */
+
+#include <altivec.h>
+	
+void
+foo2 ()
+{
+  __vector_quad *dst1;
+  __vector_quad *dst2;
+  vector unsigned char src;
+  __vector_quad acc;
+  vector unsigned char *ptr;
+  __builtin_mma_xvf32ger(&acc, src, ptr[0]);
+  __builtin_mma_xvf32gerpp(&acc, src, ptr[1]);
+  *dst1 = acc;
+  __builtin_mma_xvf32ger(&acc, src, ptr[2]);
+  __builtin_mma_xvf32gerpp(&acc, src, ptr[3]);
+  *dst2 = acc;
+}
+/* { dg-final { scan-assembler {\mlxvp\M} } } */
diff --git a/gcc/testsuite/g++.target/powerpc/mem-fusion.C b/gcc/testsuite/g++.target/powerpc/mem-fusion.C
new file mode 100644
index 00000000000..c523572cf3c
--- /dev/null
+++ b/gcc/testsuite/g++.target/powerpc/mem-fusion.C
@@ -0,0 +1,15 @@ 
+/* { dg-do compile } */ 
+/* { dg-require-effective-target power10_ok } */
+/* { dg-options "-mdejagnu-cpu=power10 -O2" } */ 
+
+#include <altivec.h>
+
+void
+foo (__vector_quad *dst, vector unsigned char *ptr, vector unsigned char src)
+{
+  __vector_quad acc;
+  __builtin_mma_xvf32ger(&acc, src, ptr[0]);
+  __builtin_mma_xvf32gerpp(&acc, src, ptr[1]);
+  *dst = acc;
+}
+/* { dg-final { scan-assembler {\mlxvp\M} } } */
diff --git a/gcc/testsuite/gcc.target/powerpc/mma-builtin-1.c b/gcc/testsuite/gcc.target/powerpc/mma-builtin-1.c
index 69ee826e1be..ae29127f954 100644
--- a/gcc/testsuite/gcc.target/powerpc/mma-builtin-1.c
+++ b/gcc/testsuite/gcc.target/powerpc/mma-builtin-1.c
@@ -258,8 +258,8 @@  foo13b (__vector_quad *dst, __vector_quad *src, vec_t *vec)
   dst[13] = acc;
 }
 
-/* { dg-final { scan-assembler-times {\mlxv\M} 40 } } */
-/* { dg-final { scan-assembler-times {\mlxvp\M} 12 } } */
+/* { dg-final { scan-assembler-times {\mlxv\M} 0 } } */
+/* { dg-final { scan-assembler-times {\mlxvp\M} 32 } } */
 /* { dg-final { scan-assembler-times {\mstxvp\M} 40 } } */
 /* { dg-final { scan-assembler-times {\mxxmfacc\M} 20 } } */
 /* { dg-final { scan-assembler-times {\mxxmtacc\M} 6 } } */