diff mbox series

[V3] fsra: gimple final sra pass for paramters and returns

Message ID 20240725031848.1792449-1-guojiufu@linux.ibm.com
State New
Headers show
Series [V3] fsra: gimple final sra pass for paramters and returns | expand

Commit Message

Jiufu Guo July 25, 2024, 3:18 a.m. UTC
Hi,

There are a few PRs (meta-bug PR101926) about accessing aggregate
parameters/returns which are passed through registers.

A major reason of those issues is when access the aggregate, the
temporary stack slots are used without leverage the information about
the incoming/outgoing registers.

We could use the current SRA pass in a special mode right before
GIMPLE->RTL expansion for the parameters/returns, and scalarize
the access according to the incoming/outgoing registers.
Some discussion in:
https://gcc.gnu.org/pipermail/gcc-patches/2023-November/637935.html

This patch adds a FINAL mode for tree-sra; and introduces IFN ARG_PARTS
/SET_RET_PARTS for scalar(s) access on parameters/returns.
And expand the IFNs according to the incoming/outgoing registers.

Compare with previous version:
https://gcc.gnu.org/pipermail/gcc-patches/2024-July/656359.html
This version:
* Support to cover mutli-registers in one access
* Support more bitfield access
* Support aggregate assignment to 'return'
* Support aggregate assignment from 'parameter' or 'sub-parameter'
* Remove SET_RET_LAST_PARTS, only keep SET_RET_PARTS
* Accept caller's parameters on callee's arguments.
* Refactor the code, e.g. style, comments.

Again there would be a few thing could be enhanced.
* More optimization for access parameter which in memory
* Enhancemore parameter accesses, e.g. grp_write.
* More cases/targets checking

Bootstrapped/regtested on ppc64{,le}, x86_64.
With known behavior changes in pr88873.c, pr101908-3.c and bfxil_1.c(aarch64).

Is this ok for trunk?

BR,
Jeff (Jiufu Guo)

	PR target/108073
	PR target/65421
	PR target/69143

gcc/ChangeLog:

	* calls.cc (precompute_register_parameters): Prepare callee argument
	from caller parameter.
	* cfgexpand.cc (expand_value_return):  Update 'rtx eq' checking.
	(expand_return): Checking sclarized returns.
	* internal-fn.cc (query_position_in_series): New function.
	(assign_from_regs): New function.
	(reference_alias_ptr_type): Extern declare.
	(expand_ARG_PARTS): New IFN expand.
	(assign_to_regs): New function.
	(expand_SET_RET_PARTS): New IFN expand.
	* internal-fn.def (ARG_PARTS): New IFN
	(SET_RET_PARTS): New IFN
	* passes.def (pass_sra_final): Add new pass.
	* tree-pass.h (make_pass_sra_final): New function.
	* tree-sra.cc (enum sra_mode): New enum item SRA_MODE_FINAL_INTRA.
	(build_access_from_expr_1): Check TARGET_MEM_REF.
	(build_accesses_from_assign): Accept SRA_MODE_FINAL_INTRA.
	(scan_function):  Check SRA_MODE_FINAL_INTRA.
	(find_var_candidates): Add condidates for fsra.
	(fsra_analyze): New function.
	(analyze_access_subtree): Check fsra_analyze.
	(generate_subtree_copies): Generate IFNs for fsra.
	(final_intra_sra): New function.
	(class pass_sra_final): New pass class.
	(make_pass_sra_final): New function.

gcc/testsuite/ChangeLog:

	* g++.target/powerpc/pr102024.C: Update instructions checking.
	* gcc.target/powerpc/pr108073-1.c: New test.
	* gcc.target/powerpc/pr108073.c: New test.
	* gcc.target/powerpc/pr65421.c: New test.
	* gcc.target/powerpc/pr69143.c: New test.

---
 gcc/calls.cc                                  |   9 +
 gcc/cfgexpand.cc                              |   6 +-
 gcc/internal-fn.cc                            | 413 ++++++++++++++++++
 gcc/internal-fn.def                           |   6 +
 gcc/passes.def                                |   1 +
 gcc/tree-pass.h                               |   1 +
 gcc/tree-sra.cc                               | 170 ++++++-
 gcc/testsuite/g++.target/powerpc/pr102024.C   |   3 +-
 gcc/testsuite/gcc.target/powerpc/pr108073-1.c |  76 ++++
 gcc/testsuite/gcc.target/powerpc/pr108073.c   |  74 ++++
 gcc/testsuite/gcc.target/powerpc/pr65421.c    |  26 ++
 gcc/testsuite/gcc.target/powerpc/pr69143.c    |  22 +
 12 files changed, 795 insertions(+), 12 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/powerpc/pr108073-1.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/pr108073.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/pr65421.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/pr69143.c
diff mbox series

Patch

diff --git a/gcc/calls.cc b/gcc/calls.cc
index 21d78f9779f..9035b12f429 100644
--- a/gcc/calls.cc
+++ b/gcc/calls.cc
@@ -996,6 +996,15 @@  precompute_register_parameters (int num_actuals, struct arg_data *args,
 	    pop_temp_slots ();
 	  }
 
+	if (GET_CODE (args[i].value) == PARALLEL)
+	  {
+	    tree type = TREE_TYPE (args[i].tree_value);
+	    int size = int_size_in_bytes (type);
+	    rtx mem = assign_stack_temp (BLKmode, size);
+	    emit_group_store (mem, args[i].value, type, size);
+	    args[i].value = mem;
+	  }
+
 	/* If we are to promote the function arg to a wider mode,
 	   do it now.  */
 
diff --git a/gcc/cfgexpand.cc b/gcc/cfgexpand.cc
index dad3ae1b7c6..2940431cb16 100644
--- a/gcc/cfgexpand.cc
+++ b/gcc/cfgexpand.cc
@@ -3789,7 +3789,7 @@  expand_value_return (rtx val)
 
   tree decl = DECL_RESULT (current_function_decl);
   rtx return_reg = DECL_RTL (decl);
-  if (return_reg != val)
+  if (!rtx_equal_p (return_reg, val))
     {
       tree funtype = TREE_TYPE (current_function_decl);
       tree type = TREE_TYPE (decl);
@@ -3862,6 +3862,10 @@  expand_return (tree retval)
      been stored into it, so we don't have to do anything special.  */
   if (TREE_CODE (retval_rhs) == RESULT_DECL)
     expand_value_return (result_rtl);
+  /* return is scalarized by fsra.  */
+  else if (DECL_RTL_SET_P (retval_rhs)
+	   && rtx_equal_p (result_rtl, DECL_RTL (retval_rhs)))
+    expand_null_return_1 ();
 
   /* If the result is an aggregate that is being returned in one (or more)
      registers, load the registers here.  */
diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
index 95946bfd683..78d9d7d557a 100644
--- a/gcc/internal-fn.cc
+++ b/gcc/internal-fn.cc
@@ -3483,6 +3483,419 @@  expand_ACCESS_WITH_SIZE (internal_fn, gcall *stmt)
     expand_assignment (lhs, ref_to_obj, false);
 }
 
+/* From the registers series REGS, compute which registers are touched
+   at the position {BITPOS, BITSIZE}.  The results are stored into
+   START_INDEX, END_INDEX, LEFT_BITS and RIGHT_BITS.
+   Return the first register at the expected position.  */
+
+static rtx
+query_position_in_series (rtx regs, HOST_WIDE_INT bitpos, HOST_WIDE_INT bitsize,
+			  int &start_index, int &end_index,
+			  HOST_WIDE_INT &left_bits, HOST_WIDE_INT &right_bits)
+{
+  if (GET_CODE (regs) == PARALLEL)
+    {
+      int cur_index = XEXP (XVECEXP (regs, 0, 0), 0) ? 0 : 1;
+      for (; cur_index < XVECLEN (regs, 0); cur_index++)
+	{
+	  rtx slot = XVECEXP (regs, 0, cur_index);
+	  HOST_WIDE_INT off = UINTVAL (XEXP (slot, 1)) * BITS_PER_UNIT;
+	  machine_mode mode = GET_MODE (XEXP (slot, 0));
+	  HOST_WIDE_INT size = GET_MODE_BITSIZE (mode).to_constant ();
+	  if (off <= bitpos && off + size > bitpos)
+	    {
+	      start_index = cur_index;
+	      left_bits = bitpos - off;
+	    }
+	  if (off + size >= bitpos + bitsize)
+	    {
+	      end_index = cur_index;
+	      right_bits = off + size - (bitpos + bitsize);
+	      break;
+	    }
+	}
+      /* Does not hit the incoming reg, e.g. accessing the padding.  */
+      if (start_index < 0 || end_index < start_index)
+	return NULL_RTX;
+      return XEXP (XVECEXP (regs, 0, start_index), 0);
+    }
+
+  if (REG_P (regs) && GET_MODE (regs) == BLKmode)
+    {
+      HOST_WIDE_INT end_bits = bitpos + bitsize - 1;
+      start_index = bitpos / BITS_PER_WORD;
+      left_bits = bitpos % BITS_PER_WORD;
+      end_index = end_bits / BITS_PER_WORD;
+      right_bits = BITS_PER_WORD - 1 - (end_bits % BITS_PER_WORD);
+      return gen_rtx_REG (word_mode, REGNO (regs) + start_index);
+    }
+
+  /* Complex concat.  */
+  if (GET_CODE (regs) == CONCAT && COMPLEX_MODE_P (GET_MODE (regs)))
+    {
+      machine_mode mode = GET_MODE (XEXP (regs, 0));
+      HOST_WIDE_INT mode_size = GET_MODE_BITSIZE (mode).to_constant ();
+      if ((bitsize == mode_size && (bitpos == 0 || bitpos == mode_size))
+	  || (bitsize == mode_size * 2 && bitpos == 0))
+	{
+	  start_index = end_index = bitpos == 0 ? 0 : 1;
+	  left_bits = right_bits = 0;
+	  if (bitsize == mode_size)
+	    return XEXP (regs, start_index);
+
+	  /* bitsize == mode_size * 2 */
+	  rtx reg = gen_reg_rtx (GET_MODE (regs));
+	  emit_move_insn (reg, regs);
+	  return reg;
+	}
+
+      /* Take middle bits of complex. */
+      return NULL_RTX;
+    }
+
+  if (REG_P (regs))
+    {
+      start_index = end_index = 0;
+      left_bits = bitpos;
+      machine_mode mode = GET_MODE (regs);
+      right_bits = GET_MODE_BITSIZE (mode).to_constant () - bitpos - bitsize;
+      return regs;
+    }
+
+  return NULL_RTX;
+}
+
+/* For an access on ARG at {BITPOS, BITSIZE}, compute a RTX
+   expression for the access and assign it to LHS.
+   ARG is an aggregate parameter of a function, and it should
+   be passed through registers.
+   Return true if able to figure out an effective way to assign
+   the access to LHS, otherwise return false.  */
+
+static bool
+assign_from_regs (tree arg, HOST_WIDE_INT bitpos, HOST_WIDE_INT bitsize,
+		  bool reversep, tree lhs)
+{
+  rtx regs = DECL_INCOMING_RTL (arg);
+  if (MEM_P (regs))
+    return false;
+  int start_index = -1;
+  int end_index = -1;
+  HOST_WIDE_INT left_bits = 0;
+  HOST_WIDE_INT right_bits = 0;
+
+  /* Prefer pseudo on DECL_RTL instead of single hard register.  */
+  if (REG_P (regs))
+    {
+      rtx arg_rtx = arg->decl_with_rtl.rtl;
+      if (GET_MODE (regs) != BLKmode)
+	regs = arg_rtx;
+      else if (int_size_in_bytes (TREE_TYPE (arg)) < UNITS_PER_WORD)
+	{
+	  regs = gen_reg_rtx (word_mode);
+	  emit_move_insn (regs, adjust_address (arg_rtx, word_mode, 0));
+	}
+    }
+  rtx reg1st = query_position_in_series (regs, bitpos, bitsize, start_index,
+					 end_index, left_bits, right_bits);
+  if (!reg1st)
+    return false;
+  machine_mode mode = word_mode;
+
+  tree type = TREE_TYPE (lhs);
+  machine_mode expr_mode = TYPE_MODE (type);
+
+  /* Just need one reg for the access.  */
+  if (end_index == start_index)
+    {
+      rtx reg = reg1st;
+      /* Perfer pesudo */
+      if (HARD_REGISTER_P (reg))
+	{
+	  rtx tmp = gen_reg_rtx (GET_MODE (reg));
+	  emit_move_insn (tmp, reg);
+	  reg = tmp;
+	}
+
+      if (left_bits == 0 && right_bits == 0)
+	{
+	  if (reversep)
+	    reg = flip_storage_order (GET_MODE (reg), reg);
+	  if (GET_MODE (reg) != expr_mode)
+	    reg = gen_lowpart (expr_mode, reg);
+
+	  rtx dest = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
+	  emit_move_insn (dest, reg);
+	  return true;
+	}
+
+      /* Need to extract bitfield part reg for the access.
+	 left_bits != 0 or right_bits != 0 */
+      scalar_int_mode imode;
+      if (!int_mode_for_mode (expr_mode).exists (&imode))
+	return false;
+
+      bool sgn = TYPE_UNSIGNED (type);
+      rtx res = extract_bit_field (reg, bitsize, left_bits, sgn, NULL_RTX,
+				   imode, imode, reversep, NULL);
+
+      if (GET_MODE (res) != imode)
+	res = gen_lowpart (imode, res);
+
+      if (expr_mode != imode)
+	{
+	  /* expr_mode != imode, e.g. SF != SI.  */
+	  rtx tmp = gen_reg_rtx (imode);
+	  emit_move_insn (tmp, res);
+	  res = gen_lowpart (expr_mode, tmp);
+	}
+
+      rtx dest = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
+      emit_move_insn (dest, res);
+      return true;
+    }
+
+  /* Access mult-registers.  */
+  if (left_bits != 0 || right_bits != 0 || bitsize % BITS_PER_WORD)
+    return false;
+
+  int len = end_index - start_index + 1;
+  if (!VECTOR_MODE_P (expr_mode))
+    {
+      rtvec vec = rtvec_alloc (len);
+      int off0 = 0;
+      if (GET_CODE (regs) == PARALLEL)
+	off0 = INTVAL (XEXP (XVECEXP (regs, 0, start_index), 1));
+      for (int i = 0; i < len; i++)
+	{
+	  int index = start_index + i;
+	  rtx src;
+	  rtx off;
+	  if (REG_P (regs))
+	    {
+	      src = gen_rtx_REG (mode, REGNO (regs) + index);
+	      off = GEN_INT (UNITS_PER_WORD * i);
+	    }
+	  else
+	    {
+	      rtx e = XVECEXP (regs, 0, index);
+	      src = XEXP (e, 0);
+	      off = GEN_INT (INTVAL (XEXP (e, 1)) - off0);
+	    }
+	  rtx tmp = gen_reg_rtx (GET_MODE (src));
+	  emit_move_insn (tmp, src);
+	  RTVEC_ELT (vec, i) = alloc_EXPR_LIST (REG_DEP_TRUE, tmp, off);
+	}
+      rtx res = gen_rtx_PARALLEL (expr_mode, vec);
+      if (expr_mode == BLKmode)
+	lhs->decl_with_rtl.rtl = res; // SET_DECL_RTL
+      else
+	{
+	  rtx dest = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
+	  emit_group_store (dest, res, type, bitsize / BITS_PER_UNIT);
+	}
+      return true;
+    }
+
+  /* Check vector mode and element mode.  */
+  if (GET_CODE (regs) == PARALLEL)
+    mode = GET_MODE (XEXP (XVECEXP (regs, 0, start_index), 0));
+  scalar_mode smode = as_a<scalar_mode> (mode);
+  machine_mode vmode;
+  if (!related_vector_mode (expr_mode, smode, len).exists (&vmode))
+    return false;
+  insn_code icode = convert_optab_handler (vec_init_optab, vmode, smode);
+  if (icode == CODE_FOR_nothing)
+    return false;
+
+  rtvec vec = rtvec_alloc (len);
+  for (int i = 0; i < len; i++)
+    {
+      int index = start_index + i;
+      rtx src = REG_P (regs) ? gen_rtx_REG (mode, REGNO (regs) + index)
+			     : XEXP (XVECEXP (regs, 0, index), 0);
+      RTVEC_ELT (vec, i) = src;
+    }
+
+  rtx res = gen_reg_rtx (vmode);
+  emit_insn (GEN_FCN (icode) (res, gen_rtx_PARALLEL (vmode, vec)));
+  if (expr_mode != vmode)
+    res = gen_lowpart (expr_mode, res);
+  rtx dest = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
+  emit_move_insn (dest, res);
+  return true;
+}
+
+/* Extern function for building MEM_REF rtx.  */
+tree
+reference_alias_ptr_type (tree t);
+
+/* Expand the IFN_ARG_PARTS function:
+   LHS = .ARG_PARTS(INCOMING_ARG, BIT_OFFSET, BIT_SIZE, REVERSEP).  */
+
+static void
+expand_ARG_PARTS (internal_fn, gcall *stmt)
+{
+  tree lhs = gimple_call_lhs (stmt);
+  tree arg = gimple_call_arg (stmt, 0);
+  HOST_WIDE_INT offset = tree_to_shwi (gimple_call_arg (stmt, 1));
+  HOST_WIDE_INT size = tree_to_shwi (gimple_call_arg (stmt, 2));
+  int reversep = tree_to_shwi (gimple_call_arg (stmt, 3));
+  if (assign_from_regs (arg, offset, size, reversep, lhs))
+    return;
+
+  /* Access bitfiled.  */
+  tree type = TREE_TYPE (lhs);
+  if ((INTEGRAL_TYPE_P (type) && !type_has_mode_precision_p (type))
+      || offset % BITS_PER_UNIT != 0 || size % BITS_PER_UNIT != 0)
+    {
+      machine_mode mode = TYPE_MODE (type);
+      rtx src
+	= expand_expr_real (arg, NULL, VOIDmode, EXPAND_NORMAL, NULL, true);
+      src = extract_bit_field (src, size, offset, TYPE_UNSIGNED (type), NULL,
+			       mode, mode, reversep, NULL);
+      rtx dest = expand_expr (lhs, NULL, VOIDmode, EXPAND_WRITE);
+      if (GET_CODE (dest) == SUBREG && SUBREG_PROMOTED_VAR_P (dest))
+	convert_move (SUBREG_REG (dest), src, SUBREG_PROMOTED_SIGN (dest));
+      else
+	emit_move_insn (dest, src);
+      return;
+    }
+
+  /* Fall to original expand.  */
+  gcc_assert (offset % BITS_PER_UNIT == 0 && size % BITS_PER_UNIT == 0);
+  tree base = build_fold_addr_expr (arg);
+  tree atype = reference_alias_ptr_type (arg);
+  tree off = build_int_cst (atype, offset / BITS_PER_UNIT);
+  location_t loc = EXPR_LOCATION (arg);
+  tree rhs = fold_build2_loc (loc, MEM_REF, type, base, off);
+  REF_REVERSE_STORAGE_ORDER (rhs) = reversep;
+  expand_assignment (lhs, rhs, false);
+}
+
+/* REGS constains the function return registers, compute which register(s)
+   are access via. {BITPOS, BITSIZE}. REV indicates the aggrgates's storeage
+   order.
+   If the accessed register(s) are whole register(s) or lowpart of a
+   register, then assign rhs to it(them) and return true.
+   Otherwise return NULL_RTX.  */
+
+static bool
+assign_to_regs (rtx regs, HOST_WIDE_INT bitpos, HOST_WIDE_INT bitsize, bool rev,
+		tree rhs)
+{
+  auto store_into_reg
+    = [rev] (rtx dest, HOST_WIDE_INT size, HOST_WIDE_INT pos, rtx src) -> bool {
+    machine_mode mode = GET_MODE (dest);
+    if (known_eq (size, GET_MODE_BITSIZE (mode)))
+      emit_move_insn (dest, gen_lowpart (mode, src));
+    else
+      store_bit_field (dest, size, pos, 0, 0, mode, src, rev, false);
+    return true;
+  };
+
+  rtx src = expand_expr (rhs, NULL_RTX, VOIDmode, EXPAND_NORMAL);
+  gcc_assert (src);
+  if (REG_P (regs))
+    return store_into_reg (regs, bitsize, bitpos, src);
+
+  gcc_assert (GET_CODE (regs) == PARALLEL);
+
+  int start_index = -1;
+  int end_index = -1;
+  HOST_WIDE_INT left_bits = 0;
+  HOST_WIDE_INT right_bits = 0;
+  rtx reg1st = query_position_in_series (regs, bitpos, bitsize, start_index,
+					 end_index, left_bits, right_bits);
+
+  gcc_assert (reg1st);
+
+  if (end_index == start_index)
+    return store_into_reg (reg1st, bitsize, left_bits, src);
+
+  /* Multi-registers  */
+  machine_mode src_mode = GET_MODE (src);
+  if (src_mode != BLKmode && !REG_P (src))
+    {
+      /* To extract bitfield, convert src. */
+      poly_int64 size = GET_MODE_SIZE (src_mode);
+      rtx mem = assign_stack_temp (src_mode, size);
+      emit_move_insn (mem, src);
+      src = mem;
+    }
+
+  int off0 = INTVAL (XEXP (XVECEXP (regs, 0, start_index), 1)) * BITS_PER_UNIT;
+  if (left_bits != 0)
+    {
+      rtx reg = reg1st;
+      machine_mode mode = GET_MODE (reg);
+      int size = GET_MODE_BITSIZE (mode).to_constant ();
+      size -= left_bits;
+      rtx elt
+	= extract_bit_field (src, size, 0, 1, NULL, mode, mode, false, NULL);
+      store_into_reg (reg, size, left_bits, elt);
+
+      off0 += left_bits;
+      start_index++;
+    }
+
+  if (right_bits != 0)
+    {
+      rtx e = XVECEXP (regs, 0, end_index);
+      rtx reg = XEXP (e, 0);
+      machine_mode mode = GET_MODE (reg);
+      int size = GET_MODE_BITSIZE (mode).to_constant ();
+      size -= right_bits;
+      int off = INTVAL (XEXP (e, 1)) * BITS_PER_UNIT - off0;
+      rtx elt
+	= extract_bit_field (src, size, off, 1, NULL, mode, mode, false, NULL);
+      store_into_reg (reg, size, 0, elt);
+
+      end_index--;
+    }
+
+  for (int i = start_index; i <= end_index; i++)
+    {
+      rtx e = XVECEXP (regs, 0, i);
+      rtx reg = XEXP (e, 0);
+      int off = INTVAL (XEXP (e, 1)) * BITS_PER_UNIT - off0;
+      machine_mode mode = GET_MODE (reg);
+      int size = GET_MODE_BITSIZE (mode).to_constant ();
+      rtx elt
+	= extract_bit_field (src, size, off, 1, NULL, mode, mode, false, NULL);
+      emit_move_insn (reg, elt);
+    }
+  return true;
+}
+
+/* Expand the IFN_SET_RET_PARTS function:
+   .SET_REG_PARTS(RET_BASE, BIT_OFFSET, BIT_SIZE, SRC).
+   e.g. .SET_RET_PARTS (D.2774, 0, 8, SRC); return D.2774; */
+
+static void
+expand_SET_RET_PARTS (internal_fn, gcall *stmt)
+{
+  HOST_WIDE_INT offset = tree_to_shwi (gimple_call_arg (stmt, 1));
+  HOST_WIDE_INT size = tree_to_shwi (gimple_call_arg (stmt, 2));
+  tree rhs = gimple_call_arg (stmt, 3);
+
+  tree decl = DECL_RESULT (current_function_decl);
+  rtx dest_regs = decl->decl_with_rtl.rtl; // DECL_RTL (base);
+  if (GET_CODE (dest_regs) != PARALLEL
+      && (REG_P (dest_regs) ? REGNO (dest_regs) >= FIRST_PSEUDO_REGISTER
+			    : DECL_REGISTER (decl))
+      && GET_CODE (crtl->return_rtx) == PARALLEL)
+    {
+      dest_regs = crtl->return_rtx;
+      decl->decl_with_rtl.rtl = dest_regs;
+    }
+  tree base = gimple_call_arg (stmt, 0);
+  base->decl_with_rtl.rtl = dest_regs; // SET_DECL_RTL
+
+  bool rev = TYPE_REVERSE_STORAGE_ORDER (TREE_TYPE (decl));
+  assign_to_regs (dest_regs, offset, size, rev, rhs);
+}
+
 /* The size of an OpenACC compute dimension.  */
 
 static void
diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
index 0b45f322f0d..555ac9576d8 100644
--- a/gcc/internal-fn.def
+++ b/gcc/internal-fn.def
@@ -520,6 +520,12 @@  DEF_INTERNAL_FN (DEFERRED_INIT, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
    2nd argument.  */
 DEF_INTERNAL_FN (ACCESS_WITH_SIZE, ECF_PURE | ECF_LEAF | ECF_NOTHROW, NULL)
 
+/* A function to extract elemet(s) from an aggregate argument in fsra. */
+DEF_INTERNAL_FN (ARG_PARTS, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
+
+/* Functions to set/construct elemet(s) for an 'return' aggregate. */
+DEF_INTERNAL_FN (SET_RET_PARTS, ECF_LEAF | ECF_NOTHROW, NULL)
+
 /* DIM_SIZE and DIM_POS return the size of a particular compute
    dimension and the executing thread's position within that
    dimension.  DIM_POS is pure (and not const) so that it isn't
diff --git a/gcc/passes.def b/gcc/passes.def
index 94386ba5457..d2a6b421418 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -450,6 +450,7 @@  along with GCC; see the file COPYING3.  If not see
   NEXT_PASS (pass_harden_conditional_branches);
   NEXT_PASS (pass_harden_compares);
   NEXT_PASS (pass_warn_access, /*early=*/false);
+  NEXT_PASS (pass_sra_final);
   NEXT_PASS (pass_cleanup_cfg_post_optimizing);
   NEXT_PASS (pass_warn_function_noreturn);
 
diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index 84bc91b51e9..58ee0333594 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -366,6 +366,7 @@  extern gimple_opt_pass *make_pass_early_tree_profile (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_cleanup_eh (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_sra (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_sra_early (gcc::context *ctxt);
+extern gimple_opt_pass *make_pass_sra_final (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_tail_recursion (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_tail_calls (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_fix_loops (gcc::context *ctxt);
diff --git a/gcc/tree-sra.cc b/gcc/tree-sra.cc
index 8040b0c5645..132304428b9 100644
--- a/gcc/tree-sra.cc
+++ b/gcc/tree-sra.cc
@@ -21,14 +21,16 @@  along with GCC; see the file COPYING3.  If not see
 <http://www.gnu.org/licenses/>.  */
 
 /* This file implements Scalar Reduction of Aggregates (SRA).  SRA is run
-   twice, once in the early stages of compilation (early SRA) and once in the
-   late stages (late SRA).  The aim of both is to turn references to scalar
-   parts of aggregates into uses of independent scalar variables.
+   three times, once in the early stages of compilation (early SRA) and once
+   in the late stages (late SRA).  The aim of them is to turn references to
+   scalar parts of aggregates into uses of independent scalar variables.
 
-   The two passes are nearly identical, the only difference is that early SRA
+   The three passes are nearly identical, the difference are that early SRA
    does not scalarize unions which are used as the result in a GIMPLE_RETURN
    statement because together with inlining this can lead to weird type
-   conversions.
+   conversions.  The third pass is more care about parameters and returns,
+   it would be helpful for the parameters and returns which are passed through
+   registers.
 
    Both passes operate in four stages:
 
@@ -104,6 +106,7 @@  along with GCC; see the file COPYING3.  If not see
 /* Enumeration of all aggregate reductions we can do.  */
 enum sra_mode { SRA_MODE_EARLY_IPA,   /* early call regularization */
 		SRA_MODE_EARLY_INTRA, /* early intraprocedural SRA */
+		SRA_MODE_FINAL_INTRA, /* final gimple intraprocedural SRA */
 		SRA_MODE_INTRA };     /* late intraprocedural SRA */
 
 /* Global variable describing which aggregate reduction we are performing at
@@ -1269,6 +1272,12 @@  build_access_from_expr_1 (tree expr, gimple *stmt, bool write)
       return NULL;
     }
 
+  if (TREE_CODE (expr) == TARGET_MEM_REF)
+    {
+      disqualify_base_of_expr (expr, "variable offset on base.");
+      return NULL;
+    }
+
   struct access *ret = NULL;
   bool partial_ref;
 
@@ -1549,7 +1558,8 @@  build_accesses_from_assign (gimple *stmt)
     }
 
   if (lacc && racc
-      && (sra_mode == SRA_MODE_EARLY_INTRA || sra_mode == SRA_MODE_INTRA)
+      && (sra_mode == SRA_MODE_EARLY_INTRA || sra_mode == SRA_MODE_INTRA
+	  || sra_mode == SRA_MODE_FINAL_INTRA)
       && !lacc->grp_unscalarizable_region
       && !racc->grp_unscalarizable_region
       && AGGREGATE_TYPE_P (TREE_TYPE (lhs))
@@ -1616,7 +1626,8 @@  scan_function (void)
 	  tree t;
 	  unsigned i;
 
-	  if (gimple_code (stmt) != GIMPLE_CALL)
+	  if (gimple_code (stmt) != GIMPLE_CALL
+	      || sra_mode == SRA_MODE_FINAL_INTRA)
 	    walk_stmt_load_store_addr_ops (stmt, NULL, NULL, NULL,
 					   scan_visit_addr);
 
@@ -2261,6 +2272,26 @@  find_var_candidates (void)
        parm = DECL_CHAIN (parm))
     ret |= maybe_add_sra_candidate (parm);
 
+  /* fsra only care about parameters and returns */
+  if (sra_mode == SRA_MODE_FINAL_INTRA)
+    {
+      tree res = DECL_RESULT (current_function_decl);
+      if (!res || !AGGREGATE_TYPE_P (TREE_TYPE (res))
+	  || aggregate_value_p (res, current_function_decl))
+	return ret;
+
+      edge_iterator ei;
+      edge e;
+      FOR_EACH_EDGE (e, ei, EXIT_BLOCK_PTR_FOR_FN (cfun)->preds)
+	if (greturn *r = safe_dyn_cast<greturn *> (*gsi_last_bb (e->src)))
+	  {
+	    tree val = gimple_return_retval (r);
+	    if (val && VAR_P (val))
+	      ret |= maybe_add_sra_candidate (val);
+	  }
+      return ret;
+    }
+
   FOR_EACH_LOCAL_DECL (cfun, i, var)
     {
       if (!VAR_P (var))
@@ -2771,6 +2802,33 @@  expr_with_var_bounded_array_refs_p (tree expr)
   return false;
 }
 
+/* Analyze if the access ROOT should be replaced.  */
+static bool
+fsra_analyze (struct access *root)
+{
+  gcc_assert (sra_mode == SRA_MODE_FINAL_INTRA);
+
+  bool scalar = is_gimple_reg_type (root->type);
+  if (scalar)
+    {
+      /* Parameter is scalarizable even no writing to it.  */
+      if (TREE_CODE (root->base) == PARM_DECL)
+	return root->grp_scalar_read || root->grp_assignment_read;
+
+      /* Returns is scalarizable even no reading on it.  */
+      else
+	return root->grp_scalar_write || root->grp_assignment_write;
+    }
+
+  /* fsra could handle assignment on aggregate, e.g for *p = arg;  */
+  if (TREE_CODE (root->base) == PARM_DECL)
+    return !root->grp_scalar_read && root->grp_assignment_read;
+
+  /* For "D.3988 = *pa_2(D); return * D.3988;".  */
+  return !root->grp_scalar_write && root->grp_assignment_write
+	 && root->base == root->expr;
+}
+
 /* Analyze the subtree of accesses rooted in ROOT, scheduling replacements when
    both seeming beneficial and when ALLOW_REPLACEMENTS allows it.  If TOTALLY
    is set, we are totally scalarizing the aggregate.  Also set all sorts of
@@ -2903,6 +2961,23 @@  analyze_access_subtree (struct access *root, struct access *parent,
       sth_created = true;
       hole = false;
     }
+
+  /* Support aggregate assignment, e.g. "D.3988 = *pa_2(D); return * D.3988;"
+     or "*p = arg;" */
+  else if (allow_replacements && sra_mode == SRA_MODE_FINAL_INTRA
+	   && !root->first_child && fsra_analyze (root))
+    {
+      root->grp_to_be_replaced = 1;
+      if (scalar || TREE_CODE (root->base) == PARM_DECL)
+	root->replacement_decl = create_access_replacement (root);
+
+      /* For function assign to whole 'return', directly use DECL_RESULT.  */
+      else
+	root->replacement_decl = DECL_RESULT (current_function_decl);
+      sth_created = true;
+      hole = false;
+    }
+
   else
     {
       if (allow_replacements
@@ -2932,6 +3007,16 @@  analyze_access_subtree (struct access *root, struct access *parent,
     root->grp_covered = 1;
   else if (root->grp_write || comes_initialized_p (root->base))
     root->grp_unscalarized_data = 1; /* not covered and written to */
+
+  if (sra_mode == SRA_MODE_FINAL_INTRA)
+    {
+      /* Does not support writen to PARAM and partial-unscalarized RET yet.  */
+      if (root->grp_unscalarized_data && (VAR_P (root->base)))
+	return false;
+      if (root->grp_write && TREE_CODE (root->base) == PARM_DECL)
+	return false;
+    }
+
   return sth_created;
 }
 
@@ -3887,7 +3972,7 @@  generate_subtree_copies (struct access *access, tree agg,
 	      || access->offset + access->size > start_offset))
 	{
 	  tree expr, repl = get_access_replacement (access);
-	  gassign *stmt;
+	  gimple *stmt;
 
 	  expr = build_ref_for_model (loc, agg, access->offset - top_offset,
 				      access, gsi, insert_after);
@@ -3899,7 +3984,19 @@  generate_subtree_copies (struct access *access, tree agg,
 						 !insert_after,
 						 insert_after ? GSI_NEW_STMT
 						 : GSI_SAME_STMT);
-	      stmt = gimple_build_assign (repl, expr);
+	      if (sra_mode == SRA_MODE_FINAL_INTRA
+		  && TREE_CODE (access->base) == PARM_DECL
+		  && (access->grp_scalar_read || access->grp_assignment_read))
+		{
+		  stmt = gimple_build_call_internal (
+		    IFN_ARG_PARTS, 4, access->base,
+		    wide_int_to_tree (sizetype, access->offset),
+		    wide_int_to_tree (sizetype, access->size),
+		    wide_int_to_tree (sizetype, access->reverse));
+		  gimple_call_set_lhs (stmt, repl);
+		}
+	      else
+		stmt = gimple_build_assign (repl, expr);
 	    }
 	  else
 	    {
@@ -3909,7 +4006,14 @@  generate_subtree_copies (struct access *access, tree agg,
 						 !insert_after,
 						 insert_after ? GSI_NEW_STMT
 						 : GSI_SAME_STMT);
-	      stmt = gimple_build_assign (expr, repl);
+	      if (sra_mode == SRA_MODE_FINAL_INTRA && VAR_P (access->base)
+		  && (access->grp_scalar_write || access->grp_assignment_write))
+		stmt = gimple_build_call_internal (
+		  IFN_SET_RET_PARTS, 4, access->base,
+		  wide_int_to_tree (sizetype, access->offset),
+		  wide_int_to_tree (sizetype, access->size), repl);
+	      else
+		stmt = gimple_build_assign (expr, repl);
 	    }
 	  gimple_set_location (stmt, loc);
 
@@ -5134,6 +5238,14 @@  late_intra_sra (void)
   return perform_intra_sra ();
 }
 
+/* Perform "final sra" SRA just before expander.  */
+static unsigned int
+final_intra_sra (void)
+{
+  sra_mode = SRA_MODE_FINAL_INTRA;
+  return perform_intra_sra ();
+}
+
 
 static bool
 gate_intra_sra (void)
@@ -5217,6 +5329,44 @@  make_pass_sra (gcc::context *ctxt)
   return new pass_sra (ctxt);
 }
 
+namespace
+{
+const pass_data pass_data_sra_final = {
+  GIMPLE_PASS,		 /* type */
+  "fsra",		 /* name */
+  OPTGROUP_NONE,	 /* optinfo_flags */
+  TV_TREE_SRA,		 /* tv_id */
+  (PROP_cfg | PROP_ssa), /* properties_required */
+  0,			 /* properties_provided */
+  0,			 /* properties_destroyed */
+  0,			 /* todo_flags_start */
+  TODO_update_ssa,	 /* todo_flags_finish */
+};
+
+class pass_sra_final : public gimple_opt_pass
+{
+public:
+  pass_sra_final (gcc::context *ctxt)
+    : gimple_opt_pass (pass_data_sra_final, ctxt)
+  {
+  }
+
+  /* opt_pass methods: */
+  bool gate (function *) final override { return gate_intra_sra (); }
+  unsigned int execute (function *) final override
+  {
+    return final_intra_sra ();
+  }
+
+}; // class pass_sra_final
+
+} // namespace
+
+gimple_opt_pass *
+make_pass_sra_final (gcc::context *ctxt)
+{
+  return new pass_sra_final (ctxt);
+}
 
 /* If type T cannot be totally scalarized, return false.  Otherwise return true
    and push to the vector within PC offsets and lengths of all padding in the
diff --git a/gcc/testsuite/g++.target/powerpc/pr102024.C b/gcc/testsuite/g++.target/powerpc/pr102024.C
index 769585052b5..4d9bbb0f050 100644
--- a/gcc/testsuite/g++.target/powerpc/pr102024.C
+++ b/gcc/testsuite/g++.target/powerpc/pr102024.C
@@ -5,7 +5,8 @@ 
 // Test that a zero-width bit field in an otherwise homogeneous aggregate
 // generates a psabi warning and passes arguments in GPRs.
 
-// { dg-final { scan-assembler-times {\mstd\M} 4 } }
+// { dg-final { scan-assembler-times {\mstd\M} 4 {target { ! has_arch_pwr8 } } } }
+// { dg-final { scan-assembler-times {\mmtvsrd\M} 4 {target { has_arch_pwr8 } } } }
 
 struct a_thing
 {
diff --git a/gcc/testsuite/gcc.target/powerpc/pr108073-1.c b/gcc/testsuite/gcc.target/powerpc/pr108073-1.c
new file mode 100644
index 00000000000..4892716e85f
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/pr108073-1.c
@@ -0,0 +1,76 @@ 
+/* { dg-do run } */
+/* { dg-require-effective-target hard_float } */
+/* { dg-options "-O2 -save-temps" } */
+
+typedef struct DF
+{
+  double a[4];
+  short s1;
+  short s2;
+  short s3;
+  short s4;
+} DF;
+typedef struct SF
+{
+  float a[4];
+  int i1;
+  int i2;
+} SF;
+
+/* { dg-final { scan-assembler-times {\mmtvsrd|mtvsrws\M} 3 {target { lp64 && has_arch_pwr8 } } } } */
+/* { dg-final { scan-assembler-not {\mlwz\M} {target { lp64 && has_arch_pwr8 } } } } */
+/* { dg-final { scan-assembler-not {\mlhz\M} {target { lp64 && has_arch_pwr8 } } } } */
+
+#define NOIPA __attribute__ ((noipa))
+
+short NOIPA
+foo_hi (DF a, int flag)
+{
+  if (flag == 2)
+    return a.s2 + a.s3;
+  return 0;
+}
+int NOIPA
+foo_si (SF a, int flag)
+{
+  if (flag == 2)
+    return a.i2 + a.i1;
+  return 0;
+}
+double NOIPA
+foo_df (DF arg, int flag)
+{
+  if (flag == 2)
+    return arg.a[3];
+  else
+    return 0.0;
+}
+float NOIPA
+foo_sf (SF arg, int flag)
+{
+  if (flag == 2)
+    return arg.a[2];
+  return 0;
+}
+float NOIPA
+foo_sf1 (SF arg, int flag)
+{
+  if (flag == 2)
+    return arg.a[1];
+  return 0;
+}
+
+DF gdf = {{1.0, 2.0, 3.0, 4.0}, 1, 2, 3, 4};
+SF gsf = {{1.0f, 2.0f, 3.0f, 4.0f}, 1, 2};
+
+int
+main ()
+{
+  if (!(foo_hi (gdf, 2) == 5 && foo_si (gsf, 2) == 3 && foo_df (gdf, 2) == 4.0
+	&& foo_sf (gsf, 2) == 3.0 && foo_sf1 (gsf, 2) == 2.0))
+    __builtin_abort ();
+  if (!(foo_hi (gdf, 1) == 0 && foo_si (gsf, 1) == 0 && foo_df (gdf, 1) == 0
+	&& foo_sf (gsf, 1) == 0 && foo_sf1 (gsf, 1) == 0))
+    __builtin_abort ();
+  return 0;
+}
diff --git a/gcc/testsuite/gcc.target/powerpc/pr108073.c b/gcc/testsuite/gcc.target/powerpc/pr108073.c
new file mode 100644
index 00000000000..4e7feaa6810
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/pr108073.c
@@ -0,0 +1,74 @@ 
+/* { dg-do run } */
+/* { dg-require-effective-target hard_float } */
+/* { dg-options "-O2 -save-temps" } */
+
+/* { dg-final { scan-assembler-times {\mmtvsrd|mtvsrws\M} 5 {target { lp64 && { has_arch_pwr8 && be } } } } } */
+/* { dg-final { scan-assembler-times {\mxscvspdpn\M} 4 {target { lp64 && { has_arch_pwr8 && be } } } } } */
+/* { dg-final { scan-assembler-times {\mmtvsrd|mtvsrws\M} 3 {target { lp64 && { has_arch_pwr8 && le } } } } } */
+/* { dg-final { scan-assembler-times {\mxscvspdpn\M} 2 {target { lp64 && { has_arch_pwr8 && le } } } } } */
+/* { dg-final { scan-assembler-times {\mfadds\M} 2 {target { lp64 && has_arch_pwr8 } } } } */
+
+#define NOIPA __attribute__ ((noipa))
+typedef struct X
+{
+  float x;
+  float y;
+} X;
+
+float NOIPA
+fooX (X y)
+{
+  y.x += 1;
+  return y.x + y.y;
+}
+
+typedef struct Y
+{
+  double a[4];
+  long l;
+} Y;
+
+double NOIPA
+fooY (Y arg)
+{
+  return arg.a[3];
+}
+
+typedef struct Z
+{
+  float a[4];
+  short l;
+} Z;
+
+float NOIPA
+fooZ (Z arg)
+{
+  return arg.a[3];
+}
+
+float NOIPA
+fooZ2 (Z arg)
+{
+  return arg.a[2];
+}
+
+X x = {1.0f, 2.0f};
+Y y = {1.0, 2.0, 3.0, 4.0, 1};
+Z z = {1.0f, 2.0f, 3.0f, 4.0f, 1};
+int
+main ()
+{
+  if (fooX (x) != 4.0f)
+    __builtin_abort ();
+
+  if (fooY (y) != 4.0)
+    __builtin_abort ();
+
+  if (fooZ (z) != 4.0f)
+    __builtin_abort ();
+
+  if (fooZ2 (z) != 3.0f)
+    __builtin_abort ();
+
+  return 0;
+}
diff --git a/gcc/testsuite/gcc.target/powerpc/pr65421.c b/gcc/testsuite/gcc.target/powerpc/pr65421.c
new file mode 100644
index 00000000000..25293610680
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/pr65421.c
@@ -0,0 +1,26 @@ 
+/* { dg-require-effective-target hard_float } */
+/* { dg-options "-O2" } */
+
+/* generate lfd for fun1, fun4 and fun5  */
+/* { dg-final { scan-assembler-times {\mlfd\M} 12 {target { lp64 && powerpc_elfv2 } } } } */
+
+/* generate stfd for fun2  */
+/* { dg-final { scan-assembler-times {\mstfd\M} 4 {target { lp64 && powerpc_elfv2 } } } } */
+
+/* no other load/store */
+/* { dg-final { scan-assembler-not {\m[p?l|st]d\M} {target { lp64 && powerpc_elfv2 } } } } */
+/* { dg-final { scan-assembler-not {\m[l|st]xvd2x\M} {target { lp64 && powerpc_elfv2 } } } } */
+/* { dg-final { scan-assembler-not {\m[p?l|st]xv\M} {target { lp64 && powerpc_elfv2 } } } } */
+
+typedef struct { double a[4]; } A;
+typedef struct { A a; double b; } B;
+typedef struct { double c; A a;} C;
+
+A fun1 (const A *pa) { return *pa; }
+void fun2 (const A a, A *pa) { *pa = a; }
+A fun3 (const A a) { return a; }
+A fun4 (const B *b) { return b->a; }
+A fun5 (const C *b) { return b->a; }
+A fun6 (const B b) { return b.a; }
+B fun7 (const A a) { B b; b.a = a; b.b = 0; return b; }
+C fun8 (const A a) { C c; c.a = a; c.c = 0; return c; }
diff --git a/gcc/testsuite/gcc.target/powerpc/pr69143.c b/gcc/testsuite/gcc.target/powerpc/pr69143.c
new file mode 100644
index 00000000000..62db4b0bd82
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/pr69143.c
@@ -0,0 +1,22 @@ 
+/* { dg-require-effective-target hard_float } */
+/* { dg-options "-O2" } */
+
+/* { dg-final { scan-assembler-times {\mfmr\M} 3 {target { { lp64 && powerpc_elfv2 } && has_arch_pwr8 } } } } */
+/* { dg-final { scan-assembler-not {\mxscvspdpn\M} {target { { lp64 && powerpc_elfv2 } && has_arch_pwr8 } } } } */
+
+struct foo1
+{
+  float x;
+  float y;
+};
+
+struct foo1
+blah1 (struct foo1 y)
+{
+  struct foo1 x;
+
+  x.x = y.y;
+  x.y = y.x;
+
+  return x;
+}