diff mbox series

[V5] fsra: gimple final sra pass for paramters and returns

Message ID 20240821014512.4038866-1-guojiufu@linux.ibm.com
State New
Headers show
Series [V5] fsra: gimple final sra pass for paramters and returns | expand

Commit Message

Jiufu Guo Aug. 21, 2024, 1:45 a.m. UTC
Hi,

There are a few PRs (meta-bug PR101926) about accessing aggregate
parameters/returns which are passed through registers.

A major reason of those issues is when access the aggregate, the
temporary stack slots are used without leverage the information about
the incoming/outgoing registers.

We could use the current SRA pass in a special mode right before
GIMPLE->RTL expansion for the parameters/returns, and scalarize
the access according to the incoming/outgoing registers.
Some discussion in:
https://gcc.gnu.org/pipermail/gcc-patches/2023-November/637935.html

This patch adds a FINAL mode for tree-sra; and introduces IFN ARG_PARTS
/SET_RET_PARTS for scalar(s) access on parameters/returns.
And expand the IFNs according to the incoming/outgoing registers.

Compare with version:
https://gcc.gnu.org/pipermail/gcc-patches/2024-July/658224.html
This version supports more features:
* Allow access on parameter with grp_write
* Allow there are unscalarized hole on 'return var'
* Allow addr-taken occur on call statment

Compare with version:
https://gcc.gnu.org/pipermail/gcc-patches/2024-August/660360.html
This version handled regressions from aarch64.

Again there would be a few thing could be enhanced for more cases. e.g.
* More optimization for access parameter in memory

Bootstrapped/regtested on ppc64{,le}, x86_64 and aarch64.
With known behavior changes in pr88873.c, pr101908-3.c, bfxil_1.c
and pr100075.c.

Thanks in advance for your comments!

Is this ok for trunk?

BR,
Jeff (Jiufu Guo)

	PR target/108073
	PR target/65421
	PR target/69143

gcc/ChangeLog:

	* calls.cc (precompute_register_parameters): Prepare callee argument
	from caller parameter.
	* cfgexpand.cc (expand_value_return):  Update 'rtx eq' checking.
	(expand_return): Checking sclarized returns.
	* expr.cc (copy_blkmode_to_reg): Handle reg:TI=parallel:BLK.
	* internal-fn.cc (query_position_in_series): New function.
	(assign_from_regs): New function.
	(reference_alias_ptr_type): Extern declare.
	(expand_ARG_PARTS): New IFN expand.
	(assign_to_regs): New function.
	(expand_SET_RET_PARTS): New IFN expand.
	* internal-fn.def (ARG_PARTS): New IFN
	(SET_RET_PARTS): New IFN
	* passes.def (pass_sra_final): Add new pass.
	* tree-pass.h (make_pass_sra_final): New function.
	* tree-sra.cc (enum sra_mode): New enum item SRA_MODE_FINAL_INTRA.
	(build_access_from_expr_1): Check TARGET_MEM_REF.
	(build_accesses_from_assign): Accept SRA_MODE_FINAL_INTRA.
	(find_var_candidates): Add condidates for fsra.
	(fsra_analyze): New function.
	(analyze_access_subtree): Check fsra_analyze.
	(propagate_subaccesses_from_rhs): Check return-var for fsra.
	(generate_subtree_copies):  Generate IFNs and add a new parameter.
	(sra_modify_assign): Gen SET_RET_PARTS for assign to 'return'.
	(initialize_parameter_reductions): Pass fsra info.
	(final_intra_sra): New function
	(class pass_sra_final): New pass class.
	(make_pass_sra_final): New function.

gcc/testsuite/ChangeLog:

	* g++.target/powerpc/pr102024.C: Update instructions checking.
	* gcc.target/powerpc/pr108073-1.c: New test.
	* gcc.target/powerpc/pr108073.c: New test.
	* gcc.target/powerpc/pr65421.c: New test.
	* gcc.target/powerpc/pr69143.c: New test.
---
 gcc/calls.cc                                  |  10 +
 gcc/cfgexpand.cc                              |   6 +-
 gcc/expr.cc                                   |   8 +
 gcc/internal-fn.cc                            | 442 ++++++++++++++++++
 gcc/internal-fn.def                           |   6 +
 gcc/passes.def                                |   1 +
 gcc/tree-pass.h                               |   1 +
 gcc/tree-sra.cc                               | 207 +++++++-
 gcc/testsuite/g++.target/powerpc/pr102024.C   |   3 +-
 gcc/testsuite/gcc.target/powerpc/pr108073-1.c |  76 +++
 gcc/testsuite/gcc.target/powerpc/pr108073.c   |  74 +++
 gcc/testsuite/gcc.target/powerpc/pr65421.c    |  26 ++
 gcc/testsuite/gcc.target/powerpc/pr69143.c    |  22 +
 13 files changed, 868 insertions(+), 14 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/powerpc/pr108073-1.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/pr108073.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/pr65421.c
 create mode 100644 gcc/testsuite/gcc.target/powerpc/pr69143.c

Comments

Richard Biener Sept. 14, 2024, 6:55 a.m. UTC | #1
On Wed, 21 Aug 2024, Jiufu Guo wrote:

> Hi,
>
> There are a few PRs (meta-bug PR101926) about accessing aggregate
> parameters/returns which are passed through registers.
>
> A major reason of those issues is when access the aggregate, the
> temporary stack slots are used without leverage the information about
> the incoming/outgoing registers.
>
> We could use the current SRA pass in a special mode right before
> GIMPLE->RTL expansion for the parameters/returns, and scalarize
> the access according to the incoming/outgoing registers.
> Some discussion in:
> https://gcc.gnu.org/pipermail/gcc-patches/2023-November/637935.html
>
> This patch adds a FINAL mode for tree-sra; and introduces IFN ARG_PARTS
> /SET_RET_PARTS for scalar(s) access on parameters/returns.
> And expand the IFNs according to the incoming/outgoing registers.
>
> Compare with version:
> https://gcc.gnu.org/pipermail/gcc-patches/2024-July/658224.html
> This version supports more features:
> * Allow access on parameter with grp_write
> * Allow there are unscalarized hole on 'return var'
> * Allow addr-taken occur on call statment
>
> Compare with version:
> https://gcc.gnu.org/pipermail/gcc-patches/2024-August/660360.html
> This version handled regressions from aarch64.
>
> Again there would be a few thing could be enhanced for more cases. e.g.
> * More optimization for access parameter in memory
>
> Bootstrapped/regtested on ppc64{,le}, x86_64 and aarch64.
> With known behavior changes in pr88873.c, pr101908-3.c, bfxil_1.c
> and pr100075.c.
>
> Thanks in advance for your comments!

I have tried the patch on

struct S { float x; int j; };
struct S foo (struct S s)
{
   struct S x;
   x.x = s.j;
   x.j = s.x;
   return x;
}

void baz (_Complex float *);
_Complex float bar (_Complex float x)
{
   _Complex float k = x;
   baz (&k);
   return k;
}

where on foo I see expected behavior while for bar we reject to work
on 'x' because it's not aggregate - that should possibly be improved.

x86 passes s in a single 64bit integer register and FSRA produces
two .ARG_PARTS (so it also works when there's not a 1:1 correspondence
between registers and uses in size).  I don't think x86 would ever
split a structure member between two registers but I don't see the
FSRA pass taking into account how the incoming arguments split up
an aggregate?  I am not sure the correct thing is to add those
as accesses to SRA.  Trying

struct S { float i; int x; int j; float y; };

long foo (struct S s)
{
   return *(long *)&s.x;
}

to get two 64bit reg passing shows we then construct a TImode
reg as before the patch but we'll still have the .ARG_PARTS?
I wonder if the code would simplify if it would not need to
fallback to old expand by making sure we can expand .ARG_PARTS
appropriately in an efficient way?  Basically try to make SRA
work on the parts as determined by the calling convention
rather than the full aggregate as currently RTL expansion
would create.

On x86 code for both cases doesn't change.

I wonder about the precompute_register_parameters change,
can you comment on it?

I think the copy_blkmode_to_reg part shows one issue
with not aligning .ARG_PARTS with incoming reg boundaries?

I think you are reverse-engineering this in query_position_in_series?
I'd appreciate somebody more familiar with RTL and RTL expansion
in particular to review the expansion bits.

I do like the overall change, as said - I hope it could be
simplified by avoiding handling of .ARG_PARTs or .SET_RET_PARTS
that would touch multiple registers but I'm also not sure
whether that's needed, say on powerpc.

Maybe Jeff can run the patch on his tester to cover more
targets.

Thanks,
Richard.

> Is this ok for trunk?
>
> BR,
> Jeff (Jiufu Guo)
>
> 	PR target/108073
> 	PR target/65421
> 	PR target/69143
>
> gcc/ChangeLog:
>
> 	* calls.cc (precompute_register_parameters): Prepare callee argument
> 	from caller parameter.
> 	* cfgexpand.cc (expand_value_return):  Update 'rtx eq' checking.
> 	(expand_return): Checking sclarized returns.
> 	* expr.cc (copy_blkmode_to_reg): Handle reg:TI=parallel:BLK.
> 	* internal-fn.cc (query_position_in_series): New function.
> 	(assign_from_regs): New function.
> 	(reference_alias_ptr_type): Extern declare.
> 	(expand_ARG_PARTS): New IFN expand.
> 	(assign_to_regs): New function.
> 	(expand_SET_RET_PARTS): New IFN expand.
> 	* internal-fn.def (ARG_PARTS): New IFN
> 	(SET_RET_PARTS): New IFN
> 	* passes.def (pass_sra_final): Add new pass.
> 	* tree-pass.h (make_pass_sra_final): New function.
> 	* tree-sra.cc (enum sra_mode): New enum item SRA_MODE_FINAL_INTRA.
> 	(build_access_from_expr_1): Check TARGET_MEM_REF.
> 	(build_accesses_from_assign): Accept SRA_MODE_FINAL_INTRA.
> 	(find_var_candidates): Add condidates for fsra.
> 	(fsra_analyze): New function.
> 	(analyze_access_subtree): Check fsra_analyze.
> 	(propagate_subaccesses_from_rhs): Check return-var for fsra.
> 	(generate_subtree_copies):  Generate IFNs and add a new parameter.
> 	(sra_modify_assign): Gen SET_RET_PARTS for assign to 'return'.
> 	(initialize_parameter_reductions): Pass fsra info.
> 	(final_intra_sra): New function
> 	(class pass_sra_final): New pass class.
> 	(make_pass_sra_final): New function.
>
> gcc/testsuite/ChangeLog:
>
> 	* g++.target/powerpc/pr102024.C: Update instructions checking.
> 	* gcc.target/powerpc/pr108073-1.c: New test.
> 	* gcc.target/powerpc/pr108073.c: New test.
> 	* gcc.target/powerpc/pr65421.c: New test.
> 	* gcc.target/powerpc/pr69143.c: New test.
> ---
> gcc/calls.cc                                  |  10 +
> gcc/cfgexpand.cc                              |   6 +-
> gcc/expr.cc                                   |   8 +
> gcc/internal-fn.cc                            | 442 ++++++++++++++++++
> gcc/internal-fn.def                           |   6 +
> gcc/passes.def                                |   1 +
> gcc/tree-pass.h                               |   1 +
> gcc/tree-sra.cc                               | 207 +++++++-
> gcc/testsuite/g++.target/powerpc/pr102024.C   |   3 +-
> gcc/testsuite/gcc.target/powerpc/pr108073-1.c |  76 +++
> gcc/testsuite/gcc.target/powerpc/pr108073.c   |  74 +++
> gcc/testsuite/gcc.target/powerpc/pr65421.c    |  26 ++
> gcc/testsuite/gcc.target/powerpc/pr69143.c    |  22 +
> 13 files changed, 868 insertions(+), 14 deletions(-)
> create mode 100644 gcc/testsuite/gcc.target/powerpc/pr108073-1.c
> create mode 100644 gcc/testsuite/gcc.target/powerpc/pr108073.c
> create mode 100644 gcc/testsuite/gcc.target/powerpc/pr65421.c
> create mode 100644 gcc/testsuite/gcc.target/powerpc/pr69143.c
>
> diff --git a/gcc/calls.cc b/gcc/calls.cc
> index f28c58217fd..d15996b8de8 100644
> --- a/gcc/calls.cc
> +++ b/gcc/calls.cc
> @@ -996,6 +996,16 @@ precompute_register_parameters (int num_actuals, struct arg_data *args,
> 	    pop_temp_slots ();
> 	  }
>
> +	/* Put pseudos to temp stack for argument */
> +	if (GET_CODE (args[i].value) == PARALLEL)
> +	  {
> +	    tree type = TREE_TYPE (args[i].tree_value);
> +	    int size = int_size_in_bytes (type);
> +	    rtx mem = assign_stack_temp (BLKmode, size);
> +	    emit_group_store (mem, args[i].value, type, size);
> +	    args[i].value = mem;
> +	  }
> +
> 	/* If we are to promote the function arg to a wider mode,
> 	   do it now.  */
>
> diff --git a/gcc/cfgexpand.cc b/gcc/cfgexpand.cc
> index dad3ae1b7c6..2940431cb16 100644
> --- a/gcc/cfgexpand.cc
> +++ b/gcc/cfgexpand.cc
> @@ -3789,7 +3789,7 @@ expand_value_return (rtx val)
>
>   tree decl = DECL_RESULT (current_function_decl);
>   rtx return_reg = DECL_RTL (decl);
> -  if (return_reg != val)
> +  if (!rtx_equal_p (return_reg, val))
>     {
>       tree funtype = TREE_TYPE (current_function_decl);
>       tree type = TREE_TYPE (decl);
> @@ -3862,6 +3862,10 @@ expand_return (tree retval)
>      been stored into it, so we don't have to do anything special.  */
>   if (TREE_CODE (retval_rhs) == RESULT_DECL)
>     expand_value_return (result_rtl);
> +  /* return is scalarized by fsra.  */
> +  else if (DECL_RTL_SET_P (retval_rhs)
> +	   && rtx_equal_p (result_rtl, DECL_RTL (retval_rhs)))
> +    expand_null_return_1 ();
>
>   /* If the result is an aggregate that is being returned in one (or more)
>      registers, load the registers here.  */
> diff --git a/gcc/expr.cc b/gcc/expr.cc
> index 2089c2b86a9..8fef1af932c 100644
> --- a/gcc/expr.cc
> +++ b/gcc/expr.cc
> @@ -3647,6 +3647,14 @@ copy_blkmode_to_reg (machine_mode mode_in, tree src)
>   if (bytes == 0)
>     return NULL_RTX;
>
> +  /* For large mode in parallel, e.g. x=.ARG_PARTS; ret_reg:TI=x;  */
> +  if (GET_CODE (x) == PARALLEL)
> +    {
> +      rtx dest = gen_reg_rtx (mode_in);
> +      emit_group_store (dest, x, TREE_TYPE (src), bytes);
> +      return dest;
> +    }
> +
>   /* If the structure doesn't take up a whole number of words, see
>      whether the register value should be padded on the left or on
>      the right.  Set PADDING_CORRECTION to the number of padding
> diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
> index 8a2e07f2f96..71202360109 100644
> --- a/gcc/internal-fn.cc
> +++ b/gcc/internal-fn.cc
> @@ -3483,6 +3483,448 @@ expand_ACCESS_WITH_SIZE (internal_fn, gcall *stmt)
>     expand_assignment (lhs, ref_to_obj, false);
> }
>
> +/* For the registers series REGS, compute which registers are touched
> +   at the position {BITPOS, BITSIZE}.  The results are stored into
> +   START_INDEX, END_INDEX, LEFT_BITS and RIGHT_BITS.
> +   Return the first register at the expected position.  */
> +
> +static rtx
> +query_position_in_series (rtx regs, HOST_WIDE_INT bitpos, HOST_WIDE_INT bitsize,
> +			  int &start_index, int &end_index,
> +			  HOST_WIDE_INT &left_bits, HOST_WIDE_INT &right_bits)
> +{
> +  if (GET_CODE (regs) == PARALLEL)
> +    {
> +      int cur_index = XEXP (XVECEXP (regs, 0, 0), 0) ? 0 : 1;
> +      for (; cur_index < XVECLEN (regs, 0); cur_index++)
> +	{
> +	  rtx slot = XVECEXP (regs, 0, cur_index);
> +	  HOST_WIDE_INT off = UINTVAL (XEXP (slot, 1)) * BITS_PER_UNIT;
> +	  machine_mode mode = GET_MODE (XEXP (slot, 0));
> +	  HOST_WIDE_INT size = GET_MODE_BITSIZE (mode).to_constant ();
> +	  if (off <= bitpos && off + size > bitpos)
> +	    {
> +	      start_index = cur_index;
> +	      left_bits = bitpos - off;
> +	    }
> +	  if (off + size >= bitpos + bitsize)
> +	    {
> +	      end_index = cur_index;
> +	      right_bits = off + size - (bitpos + bitsize);
> +	      break;
> +	    }
> +	}
> +      /* Does not hit the incoming reg, e.g. accessing the padding.  */
> +      if (start_index < 0 || end_index < start_index)
> +	return NULL_RTX;
> +      return XEXP (XVECEXP (regs, 0, start_index), 0);
> +    }
> +
> +  /* Complex concat.  */
> +  if (GET_CODE (regs) == CONCAT && COMPLEX_MODE_P (GET_MODE (regs)))
> +    {
> +      machine_mode mode = GET_MODE (XEXP (regs, 0));
> +      HOST_WIDE_INT mode_size = GET_MODE_BITSIZE (mode).to_constant ();
> +      if ((bitsize == mode_size && (bitpos == 0 || bitpos == mode_size))
> +	  || (bitsize == mode_size * 2 && bitpos == 0))
> +	{
> +	  start_index = end_index = bitpos == 0 ? 0 : 1;
> +	  left_bits = right_bits = 0;
> +	  if (bitsize == mode_size)
> +	    return XEXP (regs, start_index);
> +
> +	  /* bitsize == mode_size * 2 */
> +	  rtx reg = gen_reg_rtx (GET_MODE (regs));
> +	  emit_move_insn (reg, regs);
> +	  return reg;
> +	}
> +
> +      /* Take middle bits of complex. */
> +      return NULL_RTX;
> +    }
> +
> +  /* Continues registers.  */
> +  if (REG_P (regs) && GET_MODE (regs) == BLKmode)
> +    {
> +      HOST_WIDE_INT end_bits = bitpos + bitsize - 1;
> +      start_index = bitpos / BITS_PER_WORD;
> +      left_bits = bitpos % BITS_PER_WORD;
> +      end_index = end_bits / BITS_PER_WORD;
> +      right_bits = BITS_PER_WORD - 1 - (end_bits % BITS_PER_WORD);
> +      return gen_rtx_REG (word_mode, REGNO (regs) + start_index);
> +    }
> +
> +  /* Only one incoming register.  */
> +  if (REG_P (regs))
> +    {
> +      start_index = end_index = 0;
> +      left_bits = bitpos;
> +      machine_mode mode = GET_MODE (regs);
> +      right_bits = GET_MODE_BITSIZE (mode).to_constant () - bitpos - bitsize;
> +      return regs;
> +    }
> +
> +  return NULL_RTX;
> +}
> +
> +/* For an access on ARG at {BITPOS, BITSIZE}, compute a RTX
> +   expression for the access and assign it to LHS.
> +   ARG is an aggregate parameter of a function, and it should
> +   be passed through registers.
> +   Return true if able to figure out an effective way to assign
> +   the access to LHS, otherwise return false.  */
> +
> +static bool
> +assign_from_regs (tree arg, HOST_WIDE_INT bitpos, HOST_WIDE_INT bitsize,
> +		  bool reversep, tree lhs)
> +{
> +  rtx regs = DECL_INCOMING_RTL (arg);
> +  if (MEM_P (regs))
> +    return false;
> +  int start_index = -1;
> +  int end_index = -1;
> +  HOST_WIDE_INT left_bits = 0;
> +  HOST_WIDE_INT right_bits = 0;
> +
> +  /* Use pseudo on DECL_RTL instead of single hard register.  */
> +  if (REG_P (regs))
> +    {
> +      rtx arg_rtx = arg->decl_with_rtl.rtl;
> +      if (GET_MODE (regs) != BLKmode)
> +	regs = arg_rtx;
> +
> +      /* For the case: small aggregate size.  */
> +      else if (int_size_in_bytes (TREE_TYPE (arg)) < UNITS_PER_WORD)
> +	{
> +	  regs = gen_reg_rtx (word_mode);
> +	  emit_move_insn (regs, adjust_address (arg_rtx, word_mode, 0));
> +	}
> +    }
> +
> +  rtx reg1st = query_position_in_series (regs, bitpos, bitsize, start_index,
> +					 end_index, left_bits, right_bits);
> +  if (!reg1st)
> +    return false;
> +  machine_mode mode = word_mode;
> +
> +  tree type = TREE_TYPE (lhs);
> +  machine_mode expr_mode = TYPE_MODE (type);
> +
> +  /* Just need one reg for the access.  */
> +  if (end_index == start_index)
> +    {
> +      rtx reg = reg1st;
> +      /* Perfer pesudo */
> +      if (HARD_REGISTER_P (reg))
> +	{
> +	  rtx tmp = gen_reg_rtx (GET_MODE (reg));
> +	  emit_move_insn (tmp, reg);
> +	  reg = tmp;
> +	}
> +
> +      /* Whole register.  */
> +      if (left_bits == 0 && right_bits == 0)
> +	{
> +	  if (reversep)
> +	    reg = flip_storage_order (GET_MODE (reg), reg);
> +
> +	  rtx dest = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
> +	  if (GET_MODE (reg) != expr_mode)
> +	    {
> +	      if (expr_mode == BLKmode)
> +		dest = adjust_address (dest, GET_MODE (reg), 0);
> +	      else
> +		reg = gen_lowpart (expr_mode, reg);
> +	    }
> +	  if (GET_CODE (dest) == SUBREG && SUBREG_PROMOTED_VAR_P (dest))
> +	    convert_move (SUBREG_REG (dest), reg, SUBREG_PROMOTED_SIGN (dest));
> +	  else
> +	    emit_move_insn (dest, reg);
> +	  return true;
> +	}
> +
> +      /* For left_bits != 0 or right_bits != 0,
> +	 Need to extract bitfield part reg for the access.  */
> +      scalar_int_mode imode;
> +      if (!int_mode_for_mode (expr_mode).exists (&imode))
> +	return false;
> +
> +      bool sgn = TYPE_UNSIGNED (type);
> +      rtx res = extract_bit_field (reg, bitsize, left_bits, sgn, NULL_RTX,
> +				   expr_mode, expr_mode, reversep, NULL);
> +
> +      if (GET_MODE (res) != expr_mode)
> +	{
> +	  /* e.g. SI != SF.  */
> +	  rtx tmp = gen_reg_rtx (GET_MODE (res));
> +	  emit_move_insn (tmp, res);
> +	  res = gen_lowpart (expr_mode, tmp);
> +	}
> +
> +      rtx dest = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
> +      if (GET_CODE (dest) == SUBREG && SUBREG_PROMOTED_VAR_P (dest))
> +	convert_move (SUBREG_REG (dest), res, SUBREG_PROMOTED_SIGN (dest));
> +      else
> +       emit_move_insn (dest, res);
> +      return true;
> +    }
> +
> +  /* Sub-bits cross multi-registers.  */
> +  if (left_bits != 0 || right_bits != 0 || bitsize % BITS_PER_WORD)
> +    return false;
> +
> +  /* Access multi-registers.  */
> +  int len = end_index - start_index + 1;
> +  if (!VECTOR_MODE_P (expr_mode))
> +    {
> +      rtvec vec = rtvec_alloc (len);
> +      int off0 = 0;
> +      if (GET_CODE (regs) == PARALLEL)
> +	off0 = INTVAL (XEXP (XVECEXP (regs, 0, start_index), 1));
> +      for (int i = 0; i < len; i++)
> +	{
> +	  int index = start_index + i;
> +	  rtx src;
> +	  rtx off;
> +	  if (REG_P (regs))
> +	    {
> +	      src = gen_rtx_REG (mode, REGNO (regs) + index);
> +	      off = GEN_INT (UNITS_PER_WORD * i);
> +	    }
> +	  else
> +	    {
> +	      rtx e = XVECEXP (regs, 0, index);
> +	      src = XEXP (e, 0);
> +	      off = GEN_INT (INTVAL (XEXP (e, 1)) - off0);
> +	    }
> +	  rtx tmp = gen_reg_rtx (GET_MODE (src));
> +	  emit_move_insn (tmp, src);
> +	  RTVEC_ELT (vec, i) = alloc_EXPR_LIST (REG_DEP_TRUE, tmp, off);
> +	}
> +      rtx res = gen_rtx_PARALLEL (expr_mode, vec);
> +
> +      /* A few registers: parallel:[DI,DI,DI..].  */
> +      if (expr_mode == BLKmode)
> +	lhs->decl_with_rtl.rtl = res; // SET_DECL_RTL
> +
> +      /* For case like: "reg:TI=[DI, DI]"  */
> +      else
> +	{
> +	  rtx dest = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
> +	  emit_group_store (dest, res, type, bitsize / BITS_PER_UNIT);
> +	}
> +      return true;
> +    }
> +
> +  /* For vector, check vector mode and element mode.  */
> +  if (GET_CODE (regs) == PARALLEL)
> +    mode = GET_MODE (XEXP (XVECEXP (regs, 0, start_index), 0));
> +  scalar_mode smode = as_a<scalar_mode> (mode);
> +  machine_mode vmode;
> +  if (!related_vector_mode (expr_mode, smode, len).exists (&vmode))
> +    return false;
> +  insn_code icode = convert_optab_handler (vec_init_optab, vmode, smode);
> +  if (icode == CODE_FOR_nothing)
> +    return false;
> +
> +  rtvec vec = rtvec_alloc (len);
> +  for (int i = 0; i < len; i++)
> +    {
> +      int index = start_index + i;
> +      rtx src = REG_P (regs) ? gen_rtx_REG (mode, REGNO (regs) + index)
> +			     : XEXP (XVECEXP (regs, 0, index), 0);
> +      RTVEC_ELT (vec, i) = src;
> +    }
> +
> +  rtx res = gen_reg_rtx (vmode);
> +  emit_insn (GEN_FCN (icode) (res, gen_rtx_PARALLEL (vmode, vec)));
> +  if (expr_mode != vmode)
> +    res = gen_lowpart (expr_mode, res);
> +  rtx dest = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
> +  emit_move_insn (dest, res);
> +  return true;
> +}
> +
> +/* Extern function for building MEM_REF rtx.  */
> +tree reference_alias_ptr_type (tree);
> +
> +/* Expand the IFN_ARG_PARTS function:
> +   LHS = .ARG_PARTS(INCOMING_ARG, BIT_OFFSET, BIT_SIZE, REVERSEP).  */
> +
> +static void
> +expand_ARG_PARTS (internal_fn, gcall *stmt)
> +{
> +  tree lhs = gimple_call_lhs (stmt);
> +  tree arg = gimple_call_arg (stmt, 0);
> +  HOST_WIDE_INT offset = tree_to_shwi (gimple_call_arg (stmt, 1));
> +  HOST_WIDE_INT size = tree_to_shwi (gimple_call_arg (stmt, 2));
> +  int reversep = tree_to_shwi (gimple_call_arg (stmt, 3));
> +  if (assign_from_regs (arg, offset, size, reversep, lhs))
> +    return;
> +
> +  /* Access bitfiled (mem or cross-registers).  */
> +  tree type = TREE_TYPE (lhs);
> +  if ((INTEGRAL_TYPE_P (type) && !type_has_mode_precision_p (type))
> +      || offset % BITS_PER_UNIT != 0 || size % BITS_PER_UNIT != 0)
> +    {
> +      machine_mode mode = TYPE_MODE (type);
> +      rtx src
> +	= expand_expr_real (arg, NULL, VOIDmode, EXPAND_NORMAL, NULL, true);
> +      src = extract_bit_field (src, size, offset, TYPE_UNSIGNED (type), NULL,
> +			       mode, mode, reversep, NULL);
> +      rtx dest = expand_expr (lhs, NULL, VOIDmode, EXPAND_WRITE);
> +      if (GET_CODE (dest) == SUBREG && SUBREG_PROMOTED_VAR_P (dest))
> +	convert_move (SUBREG_REG (dest), src, SUBREG_PROMOTED_SIGN (dest));
> +      else
> +	emit_move_insn (dest, src);
> +      return;
> +    }
> +
> +  /* Fallback to original expand.  */
> +  gcc_assert (offset % BITS_PER_UNIT == 0 && size % BITS_PER_UNIT == 0);
> +  tree base = build_fold_addr_expr (arg);
> +  tree atype = reference_alias_ptr_type (arg);
> +  tree off = build_int_cst (atype, offset / BITS_PER_UNIT);
> +  location_t loc = EXPR_LOCATION (arg);
> +  tree rhs = fold_build2_loc (loc, MEM_REF, type, base, off);
> +  REF_REVERSE_STORAGE_ORDER (rhs) = reversep;
> +  expand_assignment (lhs, rhs, false);
> +}
> +
> +/* REGS constains registers (e.g. function return), compute which register(s)
> +   are touched at {BITPOS, BITSIZE}. REV indicates the aggrgates's storeage
> +   order.  And assign rhs to it(them) and return true.  */
> +
> +static bool
> +assign_to_regs (rtx regs, HOST_WIDE_INT bitpos, HOST_WIDE_INT bitsize, bool rev,
> +		tree rhs)
> +{
> +  /* Store src into a (or part of) registrer.  */
> +  auto store_into_reg
> +    = [rev] (rtx dest, HOST_WIDE_INT size, HOST_WIDE_INT pos, rtx src) -> bool {
> +    machine_mode mode = GET_MODE (dest);
> +    if (MEM_P (src) && GET_MODE (src) != mode)
> +      PUT_MODE (src, mode);
> +    if (known_eq (size, GET_MODE_BITSIZE (mode)))
> +      emit_move_insn (dest, gen_lowpart (mode, src));
> +    else
> +      store_bit_field (dest, size, pos, 0, 0, mode, src, rev, false);
> +    return true;
> +  };
> +
> +  rtx src = expand_expr (rhs, NULL_RTX, VOIDmode, EXPAND_NORMAL);
> +  gcc_assert (src);
> +
> +  /* Just one register.  */
> +  if (REG_P (regs))
> +    return store_into_reg (regs, bitsize, bitpos, src);
> +
> +  gcc_assert (GET_CODE (regs) == PARALLEL);
> +
> +  int start_index = -1;
> +  int end_index = -1;
> +  HOST_WIDE_INT left_bits = 0;
> +  HOST_WIDE_INT right_bits = 0;
> +  rtx reg1st = query_position_in_series (regs, bitpos, bitsize, start_index,
> +					 end_index, left_bits, right_bits);
> +
> +  gcc_assert (reg1st);
> +
> +  /* Only hit one register.  */
> +  if (end_index == start_index)
> +    return store_into_reg (reg1st, bitsize, left_bits, src);
> +
> +  /* Hit multi-registers.  */
> +  machine_mode src_mode = GET_MODE (src);
> +
> +  /* Bits in mem:blk*/
> +  if (src_mode != BLKmode && !REG_P (src))
> +    {
> +      /* To extract bitfield, convert src. */
> +      poly_int64 size = GET_MODE_SIZE (src_mode);
> +      rtx mem = assign_stack_temp (src_mode, size);
> +      emit_move_insn (mem, src);
> +      src = mem;
> +    }
> +
> +  int off0 = INTVAL (XEXP (XVECEXP (regs, 0, start_index), 1)) * BITS_PER_UNIT;
> +  /* Need to set bits in the first register. */
> +  if (left_bits != 0)
> +    {
> +      rtx reg = reg1st;
> +      machine_mode mode = GET_MODE (reg);
> +      int size = GET_MODE_BITSIZE (mode).to_constant ();
> +      size -= left_bits;
> +      rtx elt
> +	= extract_bit_field (src, size, 0, 1, NULL, mode, mode, false, NULL);
> +      store_into_reg (reg, size, left_bits, elt);
> +
> +      off0 += left_bits;
> +      start_index++;
> +    }
> +
> +  /* Need to set bits in the last register. */
> +  if (right_bits != 0)
> +    {
> +      rtx e = XVECEXP (regs, 0, end_index);
> +      rtx reg = XEXP (e, 0);
> +      machine_mode mode = GET_MODE (reg);
> +      int size = GET_MODE_BITSIZE (mode).to_constant ();
> +      size -= right_bits;
> +      int off = INTVAL (XEXP (e, 1)) * BITS_PER_UNIT - off0;
> +      rtx elt
> +	= extract_bit_field (src, size, off, 1, NULL, mode, mode, false, NULL);
> +      store_into_reg (reg, size, 0, elt);
> +
> +      end_index--;
> +    }
> +
> +  /* Assign the hitted whole registers.  */
> +  for (int i = start_index; i <= end_index; i++)
> +    {
> +      rtx e = XVECEXP (regs, 0, i);
> +      rtx reg = XEXP (e, 0);
> +      int off = INTVAL (XEXP (e, 1)) * BITS_PER_UNIT - off0;
> +      machine_mode mode = GET_MODE (reg);
> +      int size = GET_MODE_BITSIZE (mode).to_constant ();
> +      rtx elt
> +	= extract_bit_field (src, size, off, 1, NULL, mode, mode, false, NULL);
> +      emit_move_insn (reg, elt);
> +    }
> +  return true;
> +}
> +
> +/* Expand the IFN_SET_RET_PARTS function:
> +   .SET_REG_PARTS(RET_BASE, BIT_OFFSET, BIT_SIZE, SRC).
> +   e.g. .SET_RET_PARTS (D.2774, 0, 8, SRC); return D.2774; */
> +
> +static void
> +expand_SET_RET_PARTS (internal_fn, gcall *stmt)
> +{
> +  HOST_WIDE_INT offset = tree_to_shwi (gimple_call_arg (stmt, 1));
> +  HOST_WIDE_INT size = tree_to_shwi (gimple_call_arg (stmt, 2));
> +  tree rhs = gimple_call_arg (stmt, 3);
> +
> +  tree decl = DECL_RESULT (current_function_decl);
> +  rtx dest_regs = decl->decl_with_rtl.rtl; // DECL_RTL (base);
> +  if (GET_CODE (dest_regs) != PARALLEL
> +      && (REG_P (dest_regs) ? REGNO (dest_regs) >= FIRST_PSEUDO_REGISTER
> +			    : DECL_REGISTER (decl))
> +      && GET_CODE (crtl->return_rtx) == PARALLEL)
> +    {
> +      /* Outgoing registers are in crtl->return_rtx.  */
> +      dest_regs = gen_group_rtx (crtl->return_rtx);
> +      decl->decl_with_rtl.rtl = dest_regs;
> +    }
> +
> +  /* Use outgoing pseudos for return-var.  */
> +  tree base = gimple_call_arg (stmt, 0);
> +  base->decl_with_rtl.rtl = dest_regs; // SET_DECL_RTL
> +
> +  bool rev = TYPE_REVERSE_STORAGE_ORDER (TREE_TYPE (decl));
> +  assign_to_regs (dest_regs, offset, size, rev, rhs);
> +}
> +
> /* The size of an OpenACC compute dimension.  */
>
> static void
> diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
> index 75b527b1ab0..7eda3ecc059 100644
> --- a/gcc/internal-fn.def
> +++ b/gcc/internal-fn.def
> @@ -520,6 +520,12 @@ DEF_INTERNAL_FN (DEFERRED_INIT, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
>    2nd argument.  */
> DEF_INTERNAL_FN (ACCESS_WITH_SIZE, ECF_PURE | ECF_LEAF | ECF_NOTHROW, NULL)
>
> +/* A function to extract element from function's incoming registers. */
> +DEF_INTERNAL_FN (ARG_PARTS, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
> +
> +/* A function to set outgoing registers for 'return' aggregate. */
> +DEF_INTERNAL_FN (SET_RET_PARTS, ECF_LEAF | ECF_NOTHROW, NULL)
> +
> /* DIM_SIZE and DIM_POS return the size of a particular compute
>    dimension and the executing thread's position within that
>    dimension.  DIM_POS is pure (and not const) so that it isn't
> diff --git a/gcc/passes.def b/gcc/passes.def
> index b06d6d45f63..46eea60f395 100644
> --- a/gcc/passes.def
> +++ b/gcc/passes.def
> @@ -448,6 +448,7 @@ along with GCC; see the file COPYING3.  If not see
>   NEXT_PASS (pass_harden_conditional_branches);
>   NEXT_PASS (pass_harden_compares);
>   NEXT_PASS (pass_warn_access, /*early=*/false);
> +  NEXT_PASS (pass_sra_final);
>   NEXT_PASS (pass_cleanup_cfg_post_optimizing);
>   NEXT_PASS (pass_warn_function_noreturn);
>
> diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
> index 3a0cf13089e..0ea5c6748a8 100644
> --- a/gcc/tree-pass.h
> +++ b/gcc/tree-pass.h
> @@ -366,6 +366,7 @@ extern gimple_opt_pass *make_pass_early_tree_profile (gcc::context *ctxt);
> extern gimple_opt_pass *make_pass_cleanup_eh (gcc::context *ctxt);
> extern gimple_opt_pass *make_pass_sra (gcc::context *ctxt);
> extern gimple_opt_pass *make_pass_sra_early (gcc::context *ctxt);
> +extern gimple_opt_pass *make_pass_sra_final (gcc::context *ctxt);
> extern gimple_opt_pass *make_pass_tail_recursion (gcc::context *ctxt);
> extern gimple_opt_pass *make_pass_tail_calls (gcc::context *ctxt);
> extern gimple_opt_pass *make_pass_musttail (gcc::context *ctxt);
> diff --git a/gcc/tree-sra.cc b/gcc/tree-sra.cc
> index 8040b0c5645..b23e9fd8794 100644
> --- a/gcc/tree-sra.cc
> +++ b/gcc/tree-sra.cc
> @@ -21,14 +21,16 @@ along with GCC; see the file COPYING3.  If not see
> <http://www.gnu.org/licenses/>.  */
>
> /* This file implements Scalar Reduction of Aggregates (SRA).  SRA is run
> -   twice, once in the early stages of compilation (early SRA) and once in the
> -   late stages (late SRA).  The aim of both is to turn references to scalar
> -   parts of aggregates into uses of independent scalar variables.
> +   three times, once in the early stages of compilation (early SRA) and once
> +   in the late stages (late SRA).  The aim of them is to turn references to
> +   scalar parts of aggregates into uses of independent scalar variables.
>
> -   The two passes are nearly identical, the only difference is that early SRA
> +   The three passes are nearly identical, the difference are that early SRA
>    does not scalarize unions which are used as the result in a GIMPLE_RETURN
>    statement because together with inlining this can lead to weird type
> -   conversions.
> +   conversions.  The third pass is more care about parameters and returns,
> +   it would be helpful for the parameters and returns which are passed through
> +   registers.
>
>    Both passes operate in four stages:
>
> @@ -104,6 +106,7 @@ along with GCC; see the file COPYING3.  If not see
> /* Enumeration of all aggregate reductions we can do.  */
> enum sra_mode { SRA_MODE_EARLY_IPA,   /* early call regularization */
> 		SRA_MODE_EARLY_INTRA, /* early intraprocedural SRA */
> +		SRA_MODE_FINAL_INTRA, /* final gimple intraprocedural SRA */
> 		SRA_MODE_INTRA };     /* late intraprocedural SRA */
>
> /* Global variable describing which aggregate reduction we are performing at
> @@ -1269,6 +1272,12 @@ build_access_from_expr_1 (tree expr, gimple *stmt, bool write)
>       return NULL;
>     }
>
> +  if (TREE_CODE (expr) == TARGET_MEM_REF)
> +    {
> +      disqualify_base_of_expr (expr, "variable offset on base.");
> +      return NULL;
> +    }
> +
>   struct access *ret = NULL;
>   bool partial_ref;
>
> @@ -1549,7 +1558,8 @@ build_accesses_from_assign (gimple *stmt)
>     }
>
>   if (lacc && racc
> -      && (sra_mode == SRA_MODE_EARLY_INTRA || sra_mode == SRA_MODE_INTRA)
> +      && (sra_mode == SRA_MODE_EARLY_INTRA || sra_mode == SRA_MODE_INTRA
> +	  || sra_mode == SRA_MODE_FINAL_INTRA)
>       && !lacc->grp_unscalarizable_region
>       && !racc->grp_unscalarizable_region
>       && AGGREGATE_TYPE_P (TREE_TYPE (lhs))
> @@ -2261,6 +2271,26 @@ find_var_candidates (void)
>        parm = DECL_CHAIN (parm))
>     ret |= maybe_add_sra_candidate (parm);
>
> +  /* fsra only care about parameters and returns */
> +  if (sra_mode == SRA_MODE_FINAL_INTRA)
> +    {
> +      tree res = DECL_RESULT (current_function_decl);
> +      if (!res || !AGGREGATE_TYPE_P (TREE_TYPE (res))
> +	  || aggregate_value_p (res, current_function_decl))
> +	return ret;
> +
> +      edge_iterator ei;
> +      edge e;
> +      FOR_EACH_EDGE (e, ei, EXIT_BLOCK_PTR_FOR_FN (cfun)->preds)
> +	if (greturn *r = safe_dyn_cast<greturn *> (*gsi_last_bb (e->src)))
> +	  {
> +	    tree val = gimple_return_retval (r);
> +	    if (val && VAR_P (val))
> +	      ret |= maybe_add_sra_candidate (val);
> +	  }
> +      return ret;
> +    }
> +
>   FOR_EACH_LOCAL_DECL (cfun, i, var)
>     {
>       if (!VAR_P (var))
> @@ -2771,6 +2801,33 @@ expr_with_var_bounded_array_refs_p (tree expr)
>   return false;
> }
>
> +/* Analyze if the access ROOT should be replaced in fsra pass.  */
> +static bool
> +fsra_analyze (struct access *root)
> +{
> +  gcc_assert (sra_mode == SRA_MODE_FINAL_INTRA);
> +
> +  bool scalar = is_gimple_reg_type (root->type);
> +  if (scalar)
> +    {
> +      /* Parameter is scalarizable even no writing to it.  */
> +      if (TREE_CODE (root->base) == PARM_DECL)
> +	return root->grp_scalar_read || root->grp_assignment_read;
> +
> +      /* Returns is scalarizable even no reading on it.  */
> +      else
> +	return root->grp_scalar_write || root->grp_assignment_write;
> +    }
> +
> +  /* fsra could handle assignment on aggregate, e.g for *p = arg;  */
> +  if (TREE_CODE (root->base) == PARM_DECL)
> +    return !root->grp_scalar_read && root->grp_assignment_read;
> +
> +  /* For "D.3988 = *pa_2(D); return * D.3988;".  */
> +  return !root->grp_scalar_write && root->grp_assignment_write
> +	 && root->base == root->expr;
> +}
> +
> /* Analyze the subtree of accesses rooted in ROOT, scheduling replacements when
>    both seeming beneficial and when ALLOW_REPLACEMENTS allows it.  If TOTALLY
>    is set, we are totally scalarizing the aggregate.  Also set all sorts of
> @@ -2903,6 +2960,22 @@ analyze_access_subtree (struct access *root, struct access *parent,
>       sth_created = true;
>       hole = false;
>     }
> +
> +  /* Check if fsra is going to handle the access.  */
> +  else if (allow_replacements && sra_mode == SRA_MODE_FINAL_INTRA
> +	   && !root->first_child && fsra_analyze (root))
> +    {
> +      root->grp_to_be_replaced = 1;
> +      if (scalar || TREE_CODE (root->base) == PARM_DECL)
> +	root->replacement_decl = create_access_replacement (root);
> +
> +      /* For assigning to whole 'return', directly use DECL_RESULT.  */
> +      else
> +	root->replacement_decl = DECL_RESULT (current_function_decl);
> +      sth_created = true;
> +      hole = false;
> +    }
> +
>   else
>     {
>       if (allow_replacements
> @@ -3159,6 +3232,13 @@ propagate_subaccesses_from_rhs (struct access *lacc, struct access *racc)
>       return ret;
>     }
>
> +  /* In fsra, as all access (include no replacement) about assigning to
> +     'return' is assign to outgoing registers, then no need to propagate
> +     racc's child to the lhs('return').  */
> +  if (sra_mode == SRA_MODE_FINAL_INTRA && lacc
> +      && TREE_CODE (lacc->base) != PARM_DECL)
> +    return ret;
> +
>   for (rchild = racc->first_child; rchild; rchild = rchild->next_sibling)
>     {
>       struct access *new_acc = NULL;
> @@ -3872,7 +3952,8 @@ generate_subtree_copies (struct access *access, tree agg,
> 			 HOST_WIDE_INT top_offset,
> 			 HOST_WIDE_INT start_offset, HOST_WIDE_INT chunk_size,
> 			 gimple_stmt_iterator *gsi, bool write,
> -			 bool insert_after, location_t loc)
> +			 bool insert_after, location_t loc,
> +			 bool fsra_arg = false)
> {
>   /* Never write anything into constant pool decls.  See PR70602.  */
>   if (!write && constant_decl_p (agg))
> @@ -3887,7 +3968,7 @@ generate_subtree_copies (struct access *access, tree agg,
> 	      || access->offset + access->size > start_offset))
> 	{
> 	  tree expr, repl = get_access_replacement (access);
> -	  gassign *stmt;
> +	  gimple *stmt;
>
> 	  expr = build_ref_for_model (loc, agg, access->offset - top_offset,
> 				      access, gsi, insert_after);
> @@ -3899,7 +3980,21 @@ generate_subtree_copies (struct access *access, tree agg,
> 						 !insert_after,
> 						 insert_after ? GSI_NEW_STMT
> 						 : GSI_SAME_STMT);
> -	      stmt = gimple_build_assign (repl, expr);
> +	      /* When assigning the incoming part of argument to 'repl', Gen
> +		the ARG_PARTS.  */
> +	      if (fsra_arg
> +		  && (!access->grp_write || is_gimple_reg_type (access->type))
> +		  && (access->grp_scalar_read || access->grp_assignment_read))
> +		{
> +		  stmt = gimple_build_call_internal (
> +		    IFN_ARG_PARTS, 4, access->base,
> +		    wide_int_to_tree (sizetype, access->offset),
> +		    wide_int_to_tree (sizetype, access->size),
> +		    wide_int_to_tree (sizetype, access->reverse));
> +		  gimple_call_set_lhs (stmt, repl);
> +		}
> +	      else
> +		stmt = gimple_build_assign (repl, expr);
> 	    }
> 	  else
> 	    {
> @@ -3909,7 +4004,15 @@ generate_subtree_copies (struct access *access, tree agg,
> 						 !insert_after,
> 						 insert_after ? GSI_NEW_STMT
> 						 : GSI_SAME_STMT);
> -	      stmt = gimple_build_assign (expr, repl);
> +	      /* Gen .SET_RET_PARTS, when assigning to 'return var'.  */
> +	      if (sra_mode == SRA_MODE_FINAL_INTRA && VAR_P (access->base)
> +		  && (access->grp_scalar_write || access->grp_assignment_write))
> +		stmt = gimple_build_call_internal (
> +		  IFN_SET_RET_PARTS, 4, access->base,
> +		  wide_int_to_tree (sizetype, access->offset),
> +		  wide_int_to_tree (sizetype, access->size), repl);
> +	      else
> +		stmt = gimple_build_assign (expr, repl);
> 	    }
> 	  gimple_set_location (stmt, loc);
>
> @@ -3940,7 +4043,7 @@ generate_subtree_copies (struct access *access, tree agg,
>       if (access->first_child)
> 	generate_subtree_copies (access->first_child, agg, top_offset,
> 				 start_offset, chunk_size, gsi,
> -				 write, insert_after, loc);
> +				 write, insert_after, loc, fsra_arg);
>
>       access = access->next_sibling;
>     }
> @@ -4718,6 +4821,30 @@ sra_modify_assign (gimple *stmt, gimple_stmt_iterator *gsi)
>       gsi_insert_before (gsi, ds, GSI_SAME_STMT);
>     }
>
> +  /* In fsra, even without replacement, IFN SET_RET_PARTS should be used
> +     to replace the assignment on 'return var'.   */
> +  auto use_set_ret_parts
> +    = [stmt, &orig_gsi, &gsi] (struct access *lacc, tree rhs) -> bool {
> +    /* Change the assign to SET_RET_PARTS if it is not repalced yet.  */
> +    if (sra_mode == SRA_MODE_FINAL_INTRA && lacc
> +	&& TREE_CODE (lacc->base) != PARM_DECL && !lacc->replacement_decl)
> +      {
> +	tree base = lacc->base;
> +	tree offset = wide_int_to_tree (sizetype, lacc->offset);
> +	tree size = wide_int_to_tree (sizetype, lacc->size);
> +	gimple *new_stmt = gimple_build_call_internal (IFN_SET_RET_PARTS, 4,
> +						       base, offset, size, rhs);
> +	gsi_insert_before (&orig_gsi, new_stmt, GSI_SAME_STMT);
> +	update_stmt (new_stmt);
> +	unlink_stmt_vdef (stmt);
> +	gsi_remove (gsi_stmt (*gsi) == stmt ? gsi : &orig_gsi, true);
> +	release_defs (stmt);
> +	sra_stats.deleted++;
> +	return true;
> +      }
> +    return false;
> +  };
> +
>   /* From this point on, the function deals with assignments in between
>      aggregates when at least one has scalar reductions of some of its
>      components.  There are three possible scenarios: Both the LHS and RHS have
> @@ -4787,6 +4914,9 @@ sra_modify_assign (gimple *stmt, gimple_stmt_iterator *gsi)
> 	  gcc_assert (stmt == gsi_stmt (orig_gsi));
> 	}
>
> +      if (use_set_ret_parts (lacc, rhs))
> +	return SRA_AM_REMOVED;
> +
>       return modify_this_stmt ? SRA_AM_MODIFIED : SRA_AM_NONE;
>     }
>   else
> @@ -4823,6 +4953,9 @@ sra_modify_assign (gimple *stmt, gimple_stmt_iterator *gsi)
> 	      sra_stats.deleted++;
> 	      return SRA_AM_REMOVED;
> 	    }
> +
> +	  if (use_set_ret_parts (lacc, rhs))
> +	    return SRA_AM_REMOVED;
> 	}
>       else
> 	{
> @@ -4866,6 +4999,9 @@ sra_modify_assign (gimple *stmt, gimple_stmt_iterator *gsi)
> 		  return SRA_AM_REMOVED;
> 		}
> 	    }
> +
> +	  if (use_set_ret_parts (lacc, rhs))
> +	    return SRA_AM_REMOVED;
> 	}
>
>       return SRA_AM_NONE;
> @@ -5070,7 +5206,8 @@ initialize_parameter_reductions (void)
> 	   access;
> 	   access = access->next_grp)
> 	generate_subtree_copies (access, parm, 0, 0, 0, &gsi, true, true,
> -				 EXPR_LOCATION (parm));
> +				 EXPR_LOCATION (parm),
> +				 sra_mode == SRA_MODE_FINAL_INTRA);
>     }
>
>   seq = gsi_seq (gsi);
> @@ -5134,6 +5271,14 @@ late_intra_sra (void)
>   return perform_intra_sra ();
> }
>
> +/* Perform "final sra" SRA just before expander.  */
> +static unsigned int
> +final_intra_sra (void)
> +{
> +  sra_mode = SRA_MODE_FINAL_INTRA;
> +  return perform_intra_sra ();
> +}
> +
>
> static bool
> gate_intra_sra (void)
> @@ -5217,6 +5362,44 @@ make_pass_sra (gcc::context *ctxt)
>   return new pass_sra (ctxt);
> }
>
> +namespace
> +{
> +const pass_data pass_data_sra_final = {
> +  GIMPLE_PASS,		 /* type */
> +  "fsra",		 /* name */
> +  OPTGROUP_NONE,	 /* optinfo_flags */
> +  TV_TREE_SRA,		 /* tv_id */
> +  (PROP_cfg | PROP_ssa), /* properties_required */
> +  0,			 /* properties_provided */
> +  0,			 /* properties_destroyed */
> +  0,			 /* todo_flags_start */
> +  TODO_update_ssa,	 /* todo_flags_finish */
> +};
> +
> +class pass_sra_final : public gimple_opt_pass
> +{
> +public:
> +  pass_sra_final (gcc::context *ctxt)
> +    : gimple_opt_pass (pass_data_sra_final, ctxt)
> +  {
> +  }
> +
> +  /* opt_pass methods: */
> +  bool gate (function *) final override { return gate_intra_sra (); }
> +  unsigned int execute (function *) final override
> +  {
> +    return final_intra_sra ();
> +  }
> +
> +}; // class pass_sra_final
> +
> +} // namespace
> +
> +gimple_opt_pass *
> +make_pass_sra_final (gcc::context *ctxt)
> +{
> +  return new pass_sra_final (ctxt);
> +}
>
> /* If type T cannot be totally scalarized, return false.  Otherwise return true
>    and push to the vector within PC offsets and lengths of all padding in the
> diff --git a/gcc/testsuite/g++.target/powerpc/pr102024.C b/gcc/testsuite/g++.target/powerpc/pr102024.C
> index 769585052b5..4d9bbb0f050 100644
> --- a/gcc/testsuite/g++.target/powerpc/pr102024.C
> +++ b/gcc/testsuite/g++.target/powerpc/pr102024.C
> @@ -5,7 +5,8 @@
> // Test that a zero-width bit field in an otherwise homogeneous aggregate
> // generates a psabi warning and passes arguments in GPRs.
>
> -// { dg-final { scan-assembler-times {\mstd\M} 4 } }
> +// { dg-final { scan-assembler-times {\mstd\M} 4 {target { ! has_arch_pwr8 } } } }
> +// { dg-final { scan-assembler-times {\mmtvsrd\M} 4 {target { has_arch_pwr8 } } } }
>
> struct a_thing
> {
> diff --git a/gcc/testsuite/gcc.target/powerpc/pr108073-1.c b/gcc/testsuite/gcc.target/powerpc/pr108073-1.c
> new file mode 100644
> index 00000000000..4892716e85f
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/pr108073-1.c
> @@ -0,0 +1,76 @@
> +/* { dg-do run } */
> +/* { dg-require-effective-target hard_float } */
> +/* { dg-options "-O2 -save-temps" } */
> +
> +typedef struct DF
> +{
> +  double a[4];
> +  short s1;
> +  short s2;
> +  short s3;
> +  short s4;
> +} DF;
> +typedef struct SF
> +{
> +  float a[4];
> +  int i1;
> +  int i2;
> +} SF;
> +
> +/* { dg-final { scan-assembler-times {\mmtvsrd|mtvsrws\M} 3 {target { lp64 && has_arch_pwr8 } } } } */
> +/* { dg-final { scan-assembler-not {\mlwz\M} {target { lp64 && has_arch_pwr8 } } } } */
> +/* { dg-final { scan-assembler-not {\mlhz\M} {target { lp64 && has_arch_pwr8 } } } } */
> +
> +#define NOIPA __attribute__ ((noipa))
> +
> +short NOIPA
> +foo_hi (DF a, int flag)
> +{
> +  if (flag == 2)
> +    return a.s2 + a.s3;
> +  return 0;
> +}
> +int NOIPA
> +foo_si (SF a, int flag)
> +{
> +  if (flag == 2)
> +    return a.i2 + a.i1;
> +  return 0;
> +}
> +double NOIPA
> +foo_df (DF arg, int flag)
> +{
> +  if (flag == 2)
> +    return arg.a[3];
> +  else
> +    return 0.0;
> +}
> +float NOIPA
> +foo_sf (SF arg, int flag)
> +{
> +  if (flag == 2)
> +    return arg.a[2];
> +  return 0;
> +}
> +float NOIPA
> +foo_sf1 (SF arg, int flag)
> +{
> +  if (flag == 2)
> +    return arg.a[1];
> +  return 0;
> +}
> +
> +DF gdf = {{1.0, 2.0, 3.0, 4.0}, 1, 2, 3, 4};
> +SF gsf = {{1.0f, 2.0f, 3.0f, 4.0f}, 1, 2};
> +
> +int
> +main ()
> +{
> +  if (!(foo_hi (gdf, 2) == 5 && foo_si (gsf, 2) == 3 && foo_df (gdf, 2) == 4.0
> +	&& foo_sf (gsf, 2) == 3.0 && foo_sf1 (gsf, 2) == 2.0))
> +    __builtin_abort ();
> +  if (!(foo_hi (gdf, 1) == 0 && foo_si (gsf, 1) == 0 && foo_df (gdf, 1) == 0
> +	&& foo_sf (gsf, 1) == 0 && foo_sf1 (gsf, 1) == 0))
> +    __builtin_abort ();
> +  return 0;
> +}
> diff --git a/gcc/testsuite/gcc.target/powerpc/pr108073.c b/gcc/testsuite/gcc.target/powerpc/pr108073.c
> new file mode 100644
> index 00000000000..4e7feaa6810
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/pr108073.c
> @@ -0,0 +1,74 @@
> +/* { dg-do run } */
> +/* { dg-require-effective-target hard_float } */
> +/* { dg-options "-O2 -save-temps" } */
> +
> +/* { dg-final { scan-assembler-times {\mmtvsrd|mtvsrws\M} 5 {target { lp64 && { has_arch_pwr8 && be } } } } } */
> +/* { dg-final { scan-assembler-times {\mxscvspdpn\M} 4 {target { lp64 && { has_arch_pwr8 && be } } } } } */
> +/* { dg-final { scan-assembler-times {\mmtvsrd|mtvsrws\M} 3 {target { lp64 && { has_arch_pwr8 && le } } } } } */
> +/* { dg-final { scan-assembler-times {\mxscvspdpn\M} 2 {target { lp64 && { has_arch_pwr8 && le } } } } } */
> +/* { dg-final { scan-assembler-times {\mfadds\M} 2 {target { lp64 && has_arch_pwr8 } } } } */
> +
> +#define NOIPA __attribute__ ((noipa))
> +typedef struct X
> +{
> +  float x;
> +  float y;
> +} X;
> +
> +float NOIPA
> +fooX (X y)
> +{
> +  y.x += 1;
> +  return y.x + y.y;
> +}
> +
> +typedef struct Y
> +{
> +  double a[4];
> +  long l;
> +} Y;
> +
> +double NOIPA
> +fooY (Y arg)
> +{
> +  return arg.a[3];
> +}
> +
> +typedef struct Z
> +{
> +  float a[4];
> +  short l;
> +} Z;
> +
> +float NOIPA
> +fooZ (Z arg)
> +{
> +  return arg.a[3];
> +}
> +
> +float NOIPA
> +fooZ2 (Z arg)
> +{
> +  return arg.a[2];
> +}
> +
> +X x = {1.0f, 2.0f};
> +Y y = {1.0, 2.0, 3.0, 4.0, 1};
> +Z z = {1.0f, 2.0f, 3.0f, 4.0f, 1};
> +int
> +main ()
> +{
> +  if (fooX (x) != 4.0f)
> +    __builtin_abort ();
> +
> +  if (fooY (y) != 4.0)
> +    __builtin_abort ();
> +
> +  if (fooZ (z) != 4.0f)
> +    __builtin_abort ();
> +
> +  if (fooZ2 (z) != 3.0f)
> +    __builtin_abort ();
> +
> +  return 0;
> +}
> diff --git a/gcc/testsuite/gcc.target/powerpc/pr65421.c b/gcc/testsuite/gcc.target/powerpc/pr65421.c
> new file mode 100644
> index 00000000000..25293610680
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/pr65421.c
> @@ -0,0 +1,26 @@
> +/* { dg-require-effective-target hard_float } */
> +/* { dg-options "-O2" } */
> +
> +/* generate lfd for fun1, fun4 and fun5  */
> +/* { dg-final { scan-assembler-times {\mlfd\M} 12 {target { lp64 && powerpc_elfv2 } } } } */
> +
> +/* generate stfd for fun2  */
> +/* { dg-final { scan-assembler-times {\mstfd\M} 4 {target { lp64 && powerpc_elfv2 } } } } */
> +
> +/* no other load/store */
> +/* { dg-final { scan-assembler-not {\m[p?l|st]d\M} {target { lp64 && powerpc_elfv2 } } } } */
> +/* { dg-final { scan-assembler-not {\m[l|st]xvd2x\M} {target { lp64 && powerpc_elfv2 } } } } */
> +/* { dg-final { scan-assembler-not {\m[p?l|st]xv\M} {target { lp64 && powerpc_elfv2 } } } } */
> +
> +typedef struct { double a[4]; } A;
> +typedef struct { A a; double b; } B;
> +typedef struct { double c; A a;} C;
> +
> +A fun1 (const A *pa) { return *pa; }
> +void fun2 (const A a, A *pa) { *pa = a; }
> +A fun3 (const A a) { return a; }
> +A fun4 (const B *b) { return b->a; }
> +A fun5 (const C *b) { return b->a; }
> +A fun6 (const B b) { return b.a; }
> +B fun7 (const A a) { B b; b.a = a; b.b = 0; return b; }
> +C fun8 (const A a) { C c; c.a = a; c.c = 0; return c; }
> diff --git a/gcc/testsuite/gcc.target/powerpc/pr69143.c b/gcc/testsuite/gcc.target/powerpc/pr69143.c
> new file mode 100644
> index 00000000000..62db4b0bd82
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/powerpc/pr69143.c
> @@ -0,0 +1,22 @@
> +/* { dg-require-effective-target hard_float } */
> +/* { dg-options "-O2" } */
> +
> +/* { dg-final { scan-assembler-times {\mfmr\M} 3 {target { { lp64 && powerpc_elfv2 } && has_arch_pwr8 } } } } */
> +/* { dg-final { scan-assembler-not {\mxscvspdpn\M} {target { { lp64 && powerpc_elfv2 } && has_arch_pwr8 } } } } */
> +
> +struct foo1
> +{
> +  float x;
> +  float y;
> +};
> +
> +struct foo1
> +blah1 (struct foo1 y)
> +{
> +  struct foo1 x;
> +
> +  x.x = y.y;
> +  x.y = y.x;
> +
> +  return x;
> +}
> -- 
> 2.25.1
>
>
diff mbox series

Patch

diff --git a/gcc/calls.cc b/gcc/calls.cc
index f28c58217fd..d15996b8de8 100644
--- a/gcc/calls.cc
+++ b/gcc/calls.cc
@@ -996,6 +996,16 @@  precompute_register_parameters (int num_actuals, struct arg_data *args,
 	    pop_temp_slots ();
 	  }
 
+	/* Put pseudos to temp stack for argument */
+	if (GET_CODE (args[i].value) == PARALLEL)
+	  {
+	    tree type = TREE_TYPE (args[i].tree_value);
+	    int size = int_size_in_bytes (type);
+	    rtx mem = assign_stack_temp (BLKmode, size);
+	    emit_group_store (mem, args[i].value, type, size);
+	    args[i].value = mem;
+	  }
+
 	/* If we are to promote the function arg to a wider mode,
 	   do it now.  */
 
diff --git a/gcc/cfgexpand.cc b/gcc/cfgexpand.cc
index dad3ae1b7c6..2940431cb16 100644
--- a/gcc/cfgexpand.cc
+++ b/gcc/cfgexpand.cc
@@ -3789,7 +3789,7 @@  expand_value_return (rtx val)
 
   tree decl = DECL_RESULT (current_function_decl);
   rtx return_reg = DECL_RTL (decl);
-  if (return_reg != val)
+  if (!rtx_equal_p (return_reg, val))
     {
       tree funtype = TREE_TYPE (current_function_decl);
       tree type = TREE_TYPE (decl);
@@ -3862,6 +3862,10 @@  expand_return (tree retval)
      been stored into it, so we don't have to do anything special.  */
   if (TREE_CODE (retval_rhs) == RESULT_DECL)
     expand_value_return (result_rtl);
+  /* return is scalarized by fsra.  */
+  else if (DECL_RTL_SET_P (retval_rhs)
+	   && rtx_equal_p (result_rtl, DECL_RTL (retval_rhs)))
+    expand_null_return_1 ();
 
   /* If the result is an aggregate that is being returned in one (or more)
      registers, load the registers here.  */
diff --git a/gcc/expr.cc b/gcc/expr.cc
index 2089c2b86a9..8fef1af932c 100644
--- a/gcc/expr.cc
+++ b/gcc/expr.cc
@@ -3647,6 +3647,14 @@  copy_blkmode_to_reg (machine_mode mode_in, tree src)
   if (bytes == 0)
     return NULL_RTX;
 
+  /* For large mode in parallel, e.g. x=.ARG_PARTS; ret_reg:TI=x;  */
+  if (GET_CODE (x) == PARALLEL)
+    {
+      rtx dest = gen_reg_rtx (mode_in);
+      emit_group_store (dest, x, TREE_TYPE (src), bytes);
+      return dest;
+    }
+
   /* If the structure doesn't take up a whole number of words, see
      whether the register value should be padded on the left or on
      the right.  Set PADDING_CORRECTION to the number of padding
diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
index 8a2e07f2f96..71202360109 100644
--- a/gcc/internal-fn.cc
+++ b/gcc/internal-fn.cc
@@ -3483,6 +3483,448 @@  expand_ACCESS_WITH_SIZE (internal_fn, gcall *stmt)
     expand_assignment (lhs, ref_to_obj, false);
 }
 
+/* For the registers series REGS, compute which registers are touched
+   at the position {BITPOS, BITSIZE}.  The results are stored into
+   START_INDEX, END_INDEX, LEFT_BITS and RIGHT_BITS.
+   Return the first register at the expected position.  */
+
+static rtx
+query_position_in_series (rtx regs, HOST_WIDE_INT bitpos, HOST_WIDE_INT bitsize,
+			  int &start_index, int &end_index,
+			  HOST_WIDE_INT &left_bits, HOST_WIDE_INT &right_bits)
+{
+  if (GET_CODE (regs) == PARALLEL)
+    {
+      int cur_index = XEXP (XVECEXP (regs, 0, 0), 0) ? 0 : 1;
+      for (; cur_index < XVECLEN (regs, 0); cur_index++)
+	{
+	  rtx slot = XVECEXP (regs, 0, cur_index);
+	  HOST_WIDE_INT off = UINTVAL (XEXP (slot, 1)) * BITS_PER_UNIT;
+	  machine_mode mode = GET_MODE (XEXP (slot, 0));
+	  HOST_WIDE_INT size = GET_MODE_BITSIZE (mode).to_constant ();
+	  if (off <= bitpos && off + size > bitpos)
+	    {
+	      start_index = cur_index;
+	      left_bits = bitpos - off;
+	    }
+	  if (off + size >= bitpos + bitsize)
+	    {
+	      end_index = cur_index;
+	      right_bits = off + size - (bitpos + bitsize);
+	      break;
+	    }
+	}
+      /* Does not hit the incoming reg, e.g. accessing the padding.  */
+      if (start_index < 0 || end_index < start_index)
+	return NULL_RTX;
+      return XEXP (XVECEXP (regs, 0, start_index), 0);
+    }
+
+  /* Complex concat.  */
+  if (GET_CODE (regs) == CONCAT && COMPLEX_MODE_P (GET_MODE (regs)))
+    {
+      machine_mode mode = GET_MODE (XEXP (regs, 0));
+      HOST_WIDE_INT mode_size = GET_MODE_BITSIZE (mode).to_constant ();
+      if ((bitsize == mode_size && (bitpos == 0 || bitpos == mode_size))
+	  || (bitsize == mode_size * 2 && bitpos == 0))
+	{
+	  start_index = end_index = bitpos == 0 ? 0 : 1;
+	  left_bits = right_bits = 0;
+	  if (bitsize == mode_size)
+	    return XEXP (regs, start_index);
+
+	  /* bitsize == mode_size * 2 */
+	  rtx reg = gen_reg_rtx (GET_MODE (regs));
+	  emit_move_insn (reg, regs);
+	  return reg;
+	}
+
+      /* Take middle bits of complex. */
+      return NULL_RTX;
+    }
+
+  /* Continues registers.  */
+  if (REG_P (regs) && GET_MODE (regs) == BLKmode)
+    {
+      HOST_WIDE_INT end_bits = bitpos + bitsize - 1;
+      start_index = bitpos / BITS_PER_WORD;
+      left_bits = bitpos % BITS_PER_WORD;
+      end_index = end_bits / BITS_PER_WORD;
+      right_bits = BITS_PER_WORD - 1 - (end_bits % BITS_PER_WORD);
+      return gen_rtx_REG (word_mode, REGNO (regs) + start_index);
+    }
+
+  /* Only one incoming register.  */
+  if (REG_P (regs))
+    {
+      start_index = end_index = 0;
+      left_bits = bitpos;
+      machine_mode mode = GET_MODE (regs);
+      right_bits = GET_MODE_BITSIZE (mode).to_constant () - bitpos - bitsize;
+      return regs;
+    }
+
+  return NULL_RTX;
+}
+
+/* For an access on ARG at {BITPOS, BITSIZE}, compute a RTX
+   expression for the access and assign it to LHS.
+   ARG is an aggregate parameter of a function, and it should
+   be passed through registers.
+   Return true if able to figure out an effective way to assign
+   the access to LHS, otherwise return false.  */
+
+static bool
+assign_from_regs (tree arg, HOST_WIDE_INT bitpos, HOST_WIDE_INT bitsize,
+		  bool reversep, tree lhs)
+{
+  rtx regs = DECL_INCOMING_RTL (arg);
+  if (MEM_P (regs))
+    return false;
+  int start_index = -1;
+  int end_index = -1;
+  HOST_WIDE_INT left_bits = 0;
+  HOST_WIDE_INT right_bits = 0;
+
+  /* Use pseudo on DECL_RTL instead of single hard register.  */
+  if (REG_P (regs))
+    {
+      rtx arg_rtx = arg->decl_with_rtl.rtl;
+      if (GET_MODE (regs) != BLKmode)
+	regs = arg_rtx;
+
+      /* For the case: small aggregate size.  */
+      else if (int_size_in_bytes (TREE_TYPE (arg)) < UNITS_PER_WORD)
+	{
+	  regs = gen_reg_rtx (word_mode);
+	  emit_move_insn (regs, adjust_address (arg_rtx, word_mode, 0));
+	}
+    }
+
+  rtx reg1st = query_position_in_series (regs, bitpos, bitsize, start_index,
+					 end_index, left_bits, right_bits);
+  if (!reg1st)
+    return false;
+  machine_mode mode = word_mode;
+
+  tree type = TREE_TYPE (lhs);
+  machine_mode expr_mode = TYPE_MODE (type);
+
+  /* Just need one reg for the access.  */
+  if (end_index == start_index)
+    {
+      rtx reg = reg1st;
+      /* Perfer pesudo */
+      if (HARD_REGISTER_P (reg))
+	{
+	  rtx tmp = gen_reg_rtx (GET_MODE (reg));
+	  emit_move_insn (tmp, reg);
+	  reg = tmp;
+	}
+
+      /* Whole register.  */
+      if (left_bits == 0 && right_bits == 0)
+	{
+	  if (reversep)
+	    reg = flip_storage_order (GET_MODE (reg), reg);
+
+	  rtx dest = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
+	  if (GET_MODE (reg) != expr_mode)
+	    {
+	      if (expr_mode == BLKmode)
+		dest = adjust_address (dest, GET_MODE (reg), 0);
+	      else
+		reg = gen_lowpart (expr_mode, reg);
+	    }
+	  if (GET_CODE (dest) == SUBREG && SUBREG_PROMOTED_VAR_P (dest))
+	    convert_move (SUBREG_REG (dest), reg, SUBREG_PROMOTED_SIGN (dest));
+	  else
+	    emit_move_insn (dest, reg);
+	  return true;
+	}
+
+      /* For left_bits != 0 or right_bits != 0,
+	 Need to extract bitfield part reg for the access.  */
+      scalar_int_mode imode;
+      if (!int_mode_for_mode (expr_mode).exists (&imode))
+	return false;
+
+      bool sgn = TYPE_UNSIGNED (type);
+      rtx res = extract_bit_field (reg, bitsize, left_bits, sgn, NULL_RTX,
+				   expr_mode, expr_mode, reversep, NULL);
+
+      if (GET_MODE (res) != expr_mode)
+	{
+	  /* e.g. SI != SF.  */
+	  rtx tmp = gen_reg_rtx (GET_MODE (res));
+	  emit_move_insn (tmp, res);
+	  res = gen_lowpart (expr_mode, tmp);
+	}
+
+      rtx dest = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
+      if (GET_CODE (dest) == SUBREG && SUBREG_PROMOTED_VAR_P (dest))
+	convert_move (SUBREG_REG (dest), res, SUBREG_PROMOTED_SIGN (dest));
+      else
+       emit_move_insn (dest, res);
+      return true;
+    }
+
+  /* Sub-bits cross multi-registers.  */
+  if (left_bits != 0 || right_bits != 0 || bitsize % BITS_PER_WORD)
+    return false;
+
+  /* Access multi-registers.  */
+  int len = end_index - start_index + 1;
+  if (!VECTOR_MODE_P (expr_mode))
+    {
+      rtvec vec = rtvec_alloc (len);
+      int off0 = 0;
+      if (GET_CODE (regs) == PARALLEL)
+	off0 = INTVAL (XEXP (XVECEXP (regs, 0, start_index), 1));
+      for (int i = 0; i < len; i++)
+	{
+	  int index = start_index + i;
+	  rtx src;
+	  rtx off;
+	  if (REG_P (regs))
+	    {
+	      src = gen_rtx_REG (mode, REGNO (regs) + index);
+	      off = GEN_INT (UNITS_PER_WORD * i);
+	    }
+	  else
+	    {
+	      rtx e = XVECEXP (regs, 0, index);
+	      src = XEXP (e, 0);
+	      off = GEN_INT (INTVAL (XEXP (e, 1)) - off0);
+	    }
+	  rtx tmp = gen_reg_rtx (GET_MODE (src));
+	  emit_move_insn (tmp, src);
+	  RTVEC_ELT (vec, i) = alloc_EXPR_LIST (REG_DEP_TRUE, tmp, off);
+	}
+      rtx res = gen_rtx_PARALLEL (expr_mode, vec);
+
+      /* A few registers: parallel:[DI,DI,DI..].  */
+      if (expr_mode == BLKmode)
+	lhs->decl_with_rtl.rtl = res; // SET_DECL_RTL
+
+      /* For case like: "reg:TI=[DI, DI]"  */
+      else
+	{
+	  rtx dest = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
+	  emit_group_store (dest, res, type, bitsize / BITS_PER_UNIT);
+	}
+      return true;
+    }
+
+  /* For vector, check vector mode and element mode.  */
+  if (GET_CODE (regs) == PARALLEL)
+    mode = GET_MODE (XEXP (XVECEXP (regs, 0, start_index), 0));
+  scalar_mode smode = as_a<scalar_mode> (mode);
+  machine_mode vmode;
+  if (!related_vector_mode (expr_mode, smode, len).exists (&vmode))
+    return false;
+  insn_code icode = convert_optab_handler (vec_init_optab, vmode, smode);
+  if (icode == CODE_FOR_nothing)
+    return false;
+
+  rtvec vec = rtvec_alloc (len);
+  for (int i = 0; i < len; i++)
+    {
+      int index = start_index + i;
+      rtx src = REG_P (regs) ? gen_rtx_REG (mode, REGNO (regs) + index)
+			     : XEXP (XVECEXP (regs, 0, index), 0);
+      RTVEC_ELT (vec, i) = src;
+    }
+
+  rtx res = gen_reg_rtx (vmode);
+  emit_insn (GEN_FCN (icode) (res, gen_rtx_PARALLEL (vmode, vec)));
+  if (expr_mode != vmode)
+    res = gen_lowpart (expr_mode, res);
+  rtx dest = expand_expr (lhs, NULL_RTX, VOIDmode, EXPAND_WRITE);
+  emit_move_insn (dest, res);
+  return true;
+}
+
+/* Extern function for building MEM_REF rtx.  */
+tree reference_alias_ptr_type (tree);
+
+/* Expand the IFN_ARG_PARTS function:
+   LHS = .ARG_PARTS(INCOMING_ARG, BIT_OFFSET, BIT_SIZE, REVERSEP).  */
+
+static void
+expand_ARG_PARTS (internal_fn, gcall *stmt)
+{
+  tree lhs = gimple_call_lhs (stmt);
+  tree arg = gimple_call_arg (stmt, 0);
+  HOST_WIDE_INT offset = tree_to_shwi (gimple_call_arg (stmt, 1));
+  HOST_WIDE_INT size = tree_to_shwi (gimple_call_arg (stmt, 2));
+  int reversep = tree_to_shwi (gimple_call_arg (stmt, 3));
+  if (assign_from_regs (arg, offset, size, reversep, lhs))
+    return;
+
+  /* Access bitfiled (mem or cross-registers).  */
+  tree type = TREE_TYPE (lhs);
+  if ((INTEGRAL_TYPE_P (type) && !type_has_mode_precision_p (type))
+      || offset % BITS_PER_UNIT != 0 || size % BITS_PER_UNIT != 0)
+    {
+      machine_mode mode = TYPE_MODE (type);
+      rtx src
+	= expand_expr_real (arg, NULL, VOIDmode, EXPAND_NORMAL, NULL, true);
+      src = extract_bit_field (src, size, offset, TYPE_UNSIGNED (type), NULL,
+			       mode, mode, reversep, NULL);
+      rtx dest = expand_expr (lhs, NULL, VOIDmode, EXPAND_WRITE);
+      if (GET_CODE (dest) == SUBREG && SUBREG_PROMOTED_VAR_P (dest))
+	convert_move (SUBREG_REG (dest), src, SUBREG_PROMOTED_SIGN (dest));
+      else
+	emit_move_insn (dest, src);
+      return;
+    }
+
+  /* Fallback to original expand.  */
+  gcc_assert (offset % BITS_PER_UNIT == 0 && size % BITS_PER_UNIT == 0);
+  tree base = build_fold_addr_expr (arg);
+  tree atype = reference_alias_ptr_type (arg);
+  tree off = build_int_cst (atype, offset / BITS_PER_UNIT);
+  location_t loc = EXPR_LOCATION (arg);
+  tree rhs = fold_build2_loc (loc, MEM_REF, type, base, off);
+  REF_REVERSE_STORAGE_ORDER (rhs) = reversep;
+  expand_assignment (lhs, rhs, false);
+}
+
+/* REGS constains registers (e.g. function return), compute which register(s)
+   are touched at {BITPOS, BITSIZE}. REV indicates the aggrgates's storeage
+   order.  And assign rhs to it(them) and return true.  */
+
+static bool
+assign_to_regs (rtx regs, HOST_WIDE_INT bitpos, HOST_WIDE_INT bitsize, bool rev,
+		tree rhs)
+{
+  /* Store src into a (or part of) registrer.  */
+  auto store_into_reg
+    = [rev] (rtx dest, HOST_WIDE_INT size, HOST_WIDE_INT pos, rtx src) -> bool {
+    machine_mode mode = GET_MODE (dest);
+    if (MEM_P (src) && GET_MODE (src) != mode)
+      PUT_MODE (src, mode);
+    if (known_eq (size, GET_MODE_BITSIZE (mode)))
+      emit_move_insn (dest, gen_lowpart (mode, src));
+    else
+      store_bit_field (dest, size, pos, 0, 0, mode, src, rev, false);
+    return true;
+  };
+
+  rtx src = expand_expr (rhs, NULL_RTX, VOIDmode, EXPAND_NORMAL);
+  gcc_assert (src);
+
+  /* Just one register.  */
+  if (REG_P (regs))
+    return store_into_reg (regs, bitsize, bitpos, src);
+
+  gcc_assert (GET_CODE (regs) == PARALLEL);
+
+  int start_index = -1;
+  int end_index = -1;
+  HOST_WIDE_INT left_bits = 0;
+  HOST_WIDE_INT right_bits = 0;
+  rtx reg1st = query_position_in_series (regs, bitpos, bitsize, start_index,
+					 end_index, left_bits, right_bits);
+
+  gcc_assert (reg1st);
+
+  /* Only hit one register.  */
+  if (end_index == start_index)
+    return store_into_reg (reg1st, bitsize, left_bits, src);
+
+  /* Hit multi-registers.  */
+  machine_mode src_mode = GET_MODE (src);
+
+  /* Bits in mem:blk*/
+  if (src_mode != BLKmode && !REG_P (src))
+    {
+      /* To extract bitfield, convert src. */
+      poly_int64 size = GET_MODE_SIZE (src_mode);
+      rtx mem = assign_stack_temp (src_mode, size);
+      emit_move_insn (mem, src);
+      src = mem;
+    }
+
+  int off0 = INTVAL (XEXP (XVECEXP (regs, 0, start_index), 1)) * BITS_PER_UNIT;
+  /* Need to set bits in the first register. */
+  if (left_bits != 0)
+    {
+      rtx reg = reg1st;
+      machine_mode mode = GET_MODE (reg);
+      int size = GET_MODE_BITSIZE (mode).to_constant ();
+      size -= left_bits;
+      rtx elt
+	= extract_bit_field (src, size, 0, 1, NULL, mode, mode, false, NULL);
+      store_into_reg (reg, size, left_bits, elt);
+
+      off0 += left_bits;
+      start_index++;
+    }
+
+  /* Need to set bits in the last register. */
+  if (right_bits != 0)
+    {
+      rtx e = XVECEXP (regs, 0, end_index);
+      rtx reg = XEXP (e, 0);
+      machine_mode mode = GET_MODE (reg);
+      int size = GET_MODE_BITSIZE (mode).to_constant ();
+      size -= right_bits;
+      int off = INTVAL (XEXP (e, 1)) * BITS_PER_UNIT - off0;
+      rtx elt
+	= extract_bit_field (src, size, off, 1, NULL, mode, mode, false, NULL);
+      store_into_reg (reg, size, 0, elt);
+
+      end_index--;
+    }
+
+  /* Assign the hitted whole registers.  */
+  for (int i = start_index; i <= end_index; i++)
+    {
+      rtx e = XVECEXP (regs, 0, i);
+      rtx reg = XEXP (e, 0);
+      int off = INTVAL (XEXP (e, 1)) * BITS_PER_UNIT - off0;
+      machine_mode mode = GET_MODE (reg);
+      int size = GET_MODE_BITSIZE (mode).to_constant ();
+      rtx elt
+	= extract_bit_field (src, size, off, 1, NULL, mode, mode, false, NULL);
+      emit_move_insn (reg, elt);
+    }
+  return true;
+}
+
+/* Expand the IFN_SET_RET_PARTS function:
+   .SET_REG_PARTS(RET_BASE, BIT_OFFSET, BIT_SIZE, SRC).
+   e.g. .SET_RET_PARTS (D.2774, 0, 8, SRC); return D.2774; */
+
+static void
+expand_SET_RET_PARTS (internal_fn, gcall *stmt)
+{
+  HOST_WIDE_INT offset = tree_to_shwi (gimple_call_arg (stmt, 1));
+  HOST_WIDE_INT size = tree_to_shwi (gimple_call_arg (stmt, 2));
+  tree rhs = gimple_call_arg (stmt, 3);
+
+  tree decl = DECL_RESULT (current_function_decl);
+  rtx dest_regs = decl->decl_with_rtl.rtl; // DECL_RTL (base);
+  if (GET_CODE (dest_regs) != PARALLEL
+      && (REG_P (dest_regs) ? REGNO (dest_regs) >= FIRST_PSEUDO_REGISTER
+			    : DECL_REGISTER (decl))
+      && GET_CODE (crtl->return_rtx) == PARALLEL)
+    {
+      /* Outgoing registers are in crtl->return_rtx.  */
+      dest_regs = gen_group_rtx (crtl->return_rtx);
+      decl->decl_with_rtl.rtl = dest_regs;
+    }
+
+  /* Use outgoing pseudos for return-var.  */
+  tree base = gimple_call_arg (stmt, 0);
+  base->decl_with_rtl.rtl = dest_regs; // SET_DECL_RTL
+
+  bool rev = TYPE_REVERSE_STORAGE_ORDER (TREE_TYPE (decl));
+  assign_to_regs (dest_regs, offset, size, rev, rhs);
+}
+
 /* The size of an OpenACC compute dimension.  */
 
 static void
diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
index 75b527b1ab0..7eda3ecc059 100644
--- a/gcc/internal-fn.def
+++ b/gcc/internal-fn.def
@@ -520,6 +520,12 @@  DEF_INTERNAL_FN (DEFERRED_INIT, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
    2nd argument.  */
 DEF_INTERNAL_FN (ACCESS_WITH_SIZE, ECF_PURE | ECF_LEAF | ECF_NOTHROW, NULL)
 
+/* A function to extract element from function's incoming registers. */
+DEF_INTERNAL_FN (ARG_PARTS, ECF_CONST | ECF_LEAF | ECF_NOTHROW, NULL)
+
+/* A function to set outgoing registers for 'return' aggregate. */
+DEF_INTERNAL_FN (SET_RET_PARTS, ECF_LEAF | ECF_NOTHROW, NULL)
+
 /* DIM_SIZE and DIM_POS return the size of a particular compute
    dimension and the executing thread's position within that
    dimension.  DIM_POS is pure (and not const) so that it isn't
diff --git a/gcc/passes.def b/gcc/passes.def
index b06d6d45f63..46eea60f395 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -448,6 +448,7 @@  along with GCC; see the file COPYING3.  If not see
   NEXT_PASS (pass_harden_conditional_branches);
   NEXT_PASS (pass_harden_compares);
   NEXT_PASS (pass_warn_access, /*early=*/false);
+  NEXT_PASS (pass_sra_final);
   NEXT_PASS (pass_cleanup_cfg_post_optimizing);
   NEXT_PASS (pass_warn_function_noreturn);
 
diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index 3a0cf13089e..0ea5c6748a8 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -366,6 +366,7 @@  extern gimple_opt_pass *make_pass_early_tree_profile (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_cleanup_eh (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_sra (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_sra_early (gcc::context *ctxt);
+extern gimple_opt_pass *make_pass_sra_final (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_tail_recursion (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_tail_calls (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_musttail (gcc::context *ctxt);
diff --git a/gcc/tree-sra.cc b/gcc/tree-sra.cc
index 8040b0c5645..b23e9fd8794 100644
--- a/gcc/tree-sra.cc
+++ b/gcc/tree-sra.cc
@@ -21,14 +21,16 @@  along with GCC; see the file COPYING3.  If not see
 <http://www.gnu.org/licenses/>.  */
 
 /* This file implements Scalar Reduction of Aggregates (SRA).  SRA is run
-   twice, once in the early stages of compilation (early SRA) and once in the
-   late stages (late SRA).  The aim of both is to turn references to scalar
-   parts of aggregates into uses of independent scalar variables.
+   three times, once in the early stages of compilation (early SRA) and once
+   in the late stages (late SRA).  The aim of them is to turn references to
+   scalar parts of aggregates into uses of independent scalar variables.
 
-   The two passes are nearly identical, the only difference is that early SRA
+   The three passes are nearly identical, the difference are that early SRA
    does not scalarize unions which are used as the result in a GIMPLE_RETURN
    statement because together with inlining this can lead to weird type
-   conversions.
+   conversions.  The third pass is more care about parameters and returns,
+   it would be helpful for the parameters and returns which are passed through
+   registers.
 
    Both passes operate in four stages:
 
@@ -104,6 +106,7 @@  along with GCC; see the file COPYING3.  If not see
 /* Enumeration of all aggregate reductions we can do.  */
 enum sra_mode { SRA_MODE_EARLY_IPA,   /* early call regularization */
 		SRA_MODE_EARLY_INTRA, /* early intraprocedural SRA */
+		SRA_MODE_FINAL_INTRA, /* final gimple intraprocedural SRA */
 		SRA_MODE_INTRA };     /* late intraprocedural SRA */
 
 /* Global variable describing which aggregate reduction we are performing at
@@ -1269,6 +1272,12 @@  build_access_from_expr_1 (tree expr, gimple *stmt, bool write)
       return NULL;
     }
 
+  if (TREE_CODE (expr) == TARGET_MEM_REF)
+    {
+      disqualify_base_of_expr (expr, "variable offset on base.");
+      return NULL;
+    }
+
   struct access *ret = NULL;
   bool partial_ref;
 
@@ -1549,7 +1558,8 @@  build_accesses_from_assign (gimple *stmt)
     }
 
   if (lacc && racc
-      && (sra_mode == SRA_MODE_EARLY_INTRA || sra_mode == SRA_MODE_INTRA)
+      && (sra_mode == SRA_MODE_EARLY_INTRA || sra_mode == SRA_MODE_INTRA
+	  || sra_mode == SRA_MODE_FINAL_INTRA)
       && !lacc->grp_unscalarizable_region
       && !racc->grp_unscalarizable_region
       && AGGREGATE_TYPE_P (TREE_TYPE (lhs))
@@ -2261,6 +2271,26 @@  find_var_candidates (void)
        parm = DECL_CHAIN (parm))
     ret |= maybe_add_sra_candidate (parm);
 
+  /* fsra only care about parameters and returns */
+  if (sra_mode == SRA_MODE_FINAL_INTRA)
+    {
+      tree res = DECL_RESULT (current_function_decl);
+      if (!res || !AGGREGATE_TYPE_P (TREE_TYPE (res))
+	  || aggregate_value_p (res, current_function_decl))
+	return ret;
+
+      edge_iterator ei;
+      edge e;
+      FOR_EACH_EDGE (e, ei, EXIT_BLOCK_PTR_FOR_FN (cfun)->preds)
+	if (greturn *r = safe_dyn_cast<greturn *> (*gsi_last_bb (e->src)))
+	  {
+	    tree val = gimple_return_retval (r);
+	    if (val && VAR_P (val))
+	      ret |= maybe_add_sra_candidate (val);
+	  }
+      return ret;
+    }
+
   FOR_EACH_LOCAL_DECL (cfun, i, var)
     {
       if (!VAR_P (var))
@@ -2771,6 +2801,33 @@  expr_with_var_bounded_array_refs_p (tree expr)
   return false;
 }
 
+/* Analyze if the access ROOT should be replaced in fsra pass.  */
+static bool
+fsra_analyze (struct access *root)
+{
+  gcc_assert (sra_mode == SRA_MODE_FINAL_INTRA);
+
+  bool scalar = is_gimple_reg_type (root->type);
+  if (scalar)
+    {
+      /* Parameter is scalarizable even no writing to it.  */
+      if (TREE_CODE (root->base) == PARM_DECL)
+	return root->grp_scalar_read || root->grp_assignment_read;
+
+      /* Returns is scalarizable even no reading on it.  */
+      else
+	return root->grp_scalar_write || root->grp_assignment_write;
+    }
+
+  /* fsra could handle assignment on aggregate, e.g for *p = arg;  */
+  if (TREE_CODE (root->base) == PARM_DECL)
+    return !root->grp_scalar_read && root->grp_assignment_read;
+
+  /* For "D.3988 = *pa_2(D); return * D.3988;".  */
+  return !root->grp_scalar_write && root->grp_assignment_write
+	 && root->base == root->expr;
+}
+
 /* Analyze the subtree of accesses rooted in ROOT, scheduling replacements when
    both seeming beneficial and when ALLOW_REPLACEMENTS allows it.  If TOTALLY
    is set, we are totally scalarizing the aggregate.  Also set all sorts of
@@ -2903,6 +2960,22 @@  analyze_access_subtree (struct access *root, struct access *parent,
       sth_created = true;
       hole = false;
     }
+
+  /* Check if fsra is going to handle the access.  */
+  else if (allow_replacements && sra_mode == SRA_MODE_FINAL_INTRA
+	   && !root->first_child && fsra_analyze (root))
+    {
+      root->grp_to_be_replaced = 1;
+      if (scalar || TREE_CODE (root->base) == PARM_DECL)
+	root->replacement_decl = create_access_replacement (root);
+
+      /* For assigning to whole 'return', directly use DECL_RESULT.  */
+      else
+	root->replacement_decl = DECL_RESULT (current_function_decl);
+      sth_created = true;
+      hole = false;
+    }
+
   else
     {
       if (allow_replacements
@@ -3159,6 +3232,13 @@  propagate_subaccesses_from_rhs (struct access *lacc, struct access *racc)
       return ret;
     }
 
+  /* In fsra, as all access (include no replacement) about assigning to
+     'return' is assign to outgoing registers, then no need to propagate
+     racc's child to the lhs('return').  */
+  if (sra_mode == SRA_MODE_FINAL_INTRA && lacc
+      && TREE_CODE (lacc->base) != PARM_DECL)
+    return ret;
+
   for (rchild = racc->first_child; rchild; rchild = rchild->next_sibling)
     {
       struct access *new_acc = NULL;
@@ -3872,7 +3952,8 @@  generate_subtree_copies (struct access *access, tree agg,
 			 HOST_WIDE_INT top_offset,
 			 HOST_WIDE_INT start_offset, HOST_WIDE_INT chunk_size,
 			 gimple_stmt_iterator *gsi, bool write,
-			 bool insert_after, location_t loc)
+			 bool insert_after, location_t loc,
+			 bool fsra_arg = false)
 {
   /* Never write anything into constant pool decls.  See PR70602.  */
   if (!write && constant_decl_p (agg))
@@ -3887,7 +3968,7 @@  generate_subtree_copies (struct access *access, tree agg,
 	      || access->offset + access->size > start_offset))
 	{
 	  tree expr, repl = get_access_replacement (access);
-	  gassign *stmt;
+	  gimple *stmt;
 
 	  expr = build_ref_for_model (loc, agg, access->offset - top_offset,
 				      access, gsi, insert_after);
@@ -3899,7 +3980,21 @@  generate_subtree_copies (struct access *access, tree agg,
 						 !insert_after,
 						 insert_after ? GSI_NEW_STMT
 						 : GSI_SAME_STMT);
-	      stmt = gimple_build_assign (repl, expr);
+	      /* When assigning the incoming part of argument to 'repl', Gen
+		the ARG_PARTS.  */
+	      if (fsra_arg
+		  && (!access->grp_write || is_gimple_reg_type (access->type))
+		  && (access->grp_scalar_read || access->grp_assignment_read))
+		{
+		  stmt = gimple_build_call_internal (
+		    IFN_ARG_PARTS, 4, access->base,
+		    wide_int_to_tree (sizetype, access->offset),
+		    wide_int_to_tree (sizetype, access->size),
+		    wide_int_to_tree (sizetype, access->reverse));
+		  gimple_call_set_lhs (stmt, repl);
+		}
+	      else
+		stmt = gimple_build_assign (repl, expr);
 	    }
 	  else
 	    {
@@ -3909,7 +4004,15 @@  generate_subtree_copies (struct access *access, tree agg,
 						 !insert_after,
 						 insert_after ? GSI_NEW_STMT
 						 : GSI_SAME_STMT);
-	      stmt = gimple_build_assign (expr, repl);
+	      /* Gen .SET_RET_PARTS, when assigning to 'return var'.  */
+	      if (sra_mode == SRA_MODE_FINAL_INTRA && VAR_P (access->base)
+		  && (access->grp_scalar_write || access->grp_assignment_write))
+		stmt = gimple_build_call_internal (
+		  IFN_SET_RET_PARTS, 4, access->base,
+		  wide_int_to_tree (sizetype, access->offset),
+		  wide_int_to_tree (sizetype, access->size), repl);
+	      else
+		stmt = gimple_build_assign (expr, repl);
 	    }
 	  gimple_set_location (stmt, loc);
 
@@ -3940,7 +4043,7 @@  generate_subtree_copies (struct access *access, tree agg,
       if (access->first_child)
 	generate_subtree_copies (access->first_child, agg, top_offset,
 				 start_offset, chunk_size, gsi,
-				 write, insert_after, loc);
+				 write, insert_after, loc, fsra_arg);
 
       access = access->next_sibling;
     }
@@ -4718,6 +4821,30 @@  sra_modify_assign (gimple *stmt, gimple_stmt_iterator *gsi)
       gsi_insert_before (gsi, ds, GSI_SAME_STMT);
     }
 
+  /* In fsra, even without replacement, IFN SET_RET_PARTS should be used
+     to replace the assignment on 'return var'.   */
+  auto use_set_ret_parts
+    = [stmt, &orig_gsi, &gsi] (struct access *lacc, tree rhs) -> bool {
+    /* Change the assign to SET_RET_PARTS if it is not repalced yet.  */
+    if (sra_mode == SRA_MODE_FINAL_INTRA && lacc
+	&& TREE_CODE (lacc->base) != PARM_DECL && !lacc->replacement_decl)
+      {
+	tree base = lacc->base;
+	tree offset = wide_int_to_tree (sizetype, lacc->offset);
+	tree size = wide_int_to_tree (sizetype, lacc->size);
+	gimple *new_stmt = gimple_build_call_internal (IFN_SET_RET_PARTS, 4,
+						       base, offset, size, rhs);
+	gsi_insert_before (&orig_gsi, new_stmt, GSI_SAME_STMT);
+	update_stmt (new_stmt);
+	unlink_stmt_vdef (stmt);
+	gsi_remove (gsi_stmt (*gsi) == stmt ? gsi : &orig_gsi, true);
+	release_defs (stmt);
+	sra_stats.deleted++;
+	return true;
+      }
+    return false;
+  };
+
   /* From this point on, the function deals with assignments in between
      aggregates when at least one has scalar reductions of some of its
      components.  There are three possible scenarios: Both the LHS and RHS have
@@ -4787,6 +4914,9 @@  sra_modify_assign (gimple *stmt, gimple_stmt_iterator *gsi)
 	  gcc_assert (stmt == gsi_stmt (orig_gsi));
 	}
 
+      if (use_set_ret_parts (lacc, rhs))
+	return SRA_AM_REMOVED;
+
       return modify_this_stmt ? SRA_AM_MODIFIED : SRA_AM_NONE;
     }
   else
@@ -4823,6 +4953,9 @@  sra_modify_assign (gimple *stmt, gimple_stmt_iterator *gsi)
 	      sra_stats.deleted++;
 	      return SRA_AM_REMOVED;
 	    }
+
+	  if (use_set_ret_parts (lacc, rhs))
+	    return SRA_AM_REMOVED;
 	}
       else
 	{
@@ -4866,6 +4999,9 @@  sra_modify_assign (gimple *stmt, gimple_stmt_iterator *gsi)
 		  return SRA_AM_REMOVED;
 		}
 	    }
+
+	  if (use_set_ret_parts (lacc, rhs))
+	    return SRA_AM_REMOVED;
 	}
 
       return SRA_AM_NONE;
@@ -5070,7 +5206,8 @@  initialize_parameter_reductions (void)
 	   access;
 	   access = access->next_grp)
 	generate_subtree_copies (access, parm, 0, 0, 0, &gsi, true, true,
-				 EXPR_LOCATION (parm));
+				 EXPR_LOCATION (parm),
+				 sra_mode == SRA_MODE_FINAL_INTRA);
     }
 
   seq = gsi_seq (gsi);
@@ -5134,6 +5271,14 @@  late_intra_sra (void)
   return perform_intra_sra ();
 }
 
+/* Perform "final sra" SRA just before expander.  */
+static unsigned int
+final_intra_sra (void)
+{
+  sra_mode = SRA_MODE_FINAL_INTRA;
+  return perform_intra_sra ();
+}
+
 
 static bool
 gate_intra_sra (void)
@@ -5217,6 +5362,44 @@  make_pass_sra (gcc::context *ctxt)
   return new pass_sra (ctxt);
 }
 
+namespace
+{
+const pass_data pass_data_sra_final = {
+  GIMPLE_PASS,		 /* type */
+  "fsra",		 /* name */
+  OPTGROUP_NONE,	 /* optinfo_flags */
+  TV_TREE_SRA,		 /* tv_id */
+  (PROP_cfg | PROP_ssa), /* properties_required */
+  0,			 /* properties_provided */
+  0,			 /* properties_destroyed */
+  0,			 /* todo_flags_start */
+  TODO_update_ssa,	 /* todo_flags_finish */
+};
+
+class pass_sra_final : public gimple_opt_pass
+{
+public:
+  pass_sra_final (gcc::context *ctxt)
+    : gimple_opt_pass (pass_data_sra_final, ctxt)
+  {
+  }
+
+  /* opt_pass methods: */
+  bool gate (function *) final override { return gate_intra_sra (); }
+  unsigned int execute (function *) final override
+  {
+    return final_intra_sra ();
+  }
+
+}; // class pass_sra_final
+
+} // namespace
+
+gimple_opt_pass *
+make_pass_sra_final (gcc::context *ctxt)
+{
+  return new pass_sra_final (ctxt);
+}
 
 /* If type T cannot be totally scalarized, return false.  Otherwise return true
    and push to the vector within PC offsets and lengths of all padding in the
diff --git a/gcc/testsuite/g++.target/powerpc/pr102024.C b/gcc/testsuite/g++.target/powerpc/pr102024.C
index 769585052b5..4d9bbb0f050 100644
--- a/gcc/testsuite/g++.target/powerpc/pr102024.C
+++ b/gcc/testsuite/g++.target/powerpc/pr102024.C
@@ -5,7 +5,8 @@ 
 // Test that a zero-width bit field in an otherwise homogeneous aggregate
 // generates a psabi warning and passes arguments in GPRs.
 
-// { dg-final { scan-assembler-times {\mstd\M} 4 } }
+// { dg-final { scan-assembler-times {\mstd\M} 4 {target { ! has_arch_pwr8 } } } }
+// { dg-final { scan-assembler-times {\mmtvsrd\M} 4 {target { has_arch_pwr8 } } } }
 
 struct a_thing
 {
diff --git a/gcc/testsuite/gcc.target/powerpc/pr108073-1.c b/gcc/testsuite/gcc.target/powerpc/pr108073-1.c
new file mode 100644
index 00000000000..4892716e85f
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/pr108073-1.c
@@ -0,0 +1,76 @@ 
+/* { dg-do run } */
+/* { dg-require-effective-target hard_float } */
+/* { dg-options "-O2 -save-temps" } */
+
+typedef struct DF
+{
+  double a[4];
+  short s1;
+  short s2;
+  short s3;
+  short s4;
+} DF;
+typedef struct SF
+{
+  float a[4];
+  int i1;
+  int i2;
+} SF;
+
+/* { dg-final { scan-assembler-times {\mmtvsrd|mtvsrws\M} 3 {target { lp64 && has_arch_pwr8 } } } } */
+/* { dg-final { scan-assembler-not {\mlwz\M} {target { lp64 && has_arch_pwr8 } } } } */
+/* { dg-final { scan-assembler-not {\mlhz\M} {target { lp64 && has_arch_pwr8 } } } } */
+
+#define NOIPA __attribute__ ((noipa))
+
+short NOIPA
+foo_hi (DF a, int flag)
+{
+  if (flag == 2)
+    return a.s2 + a.s3;
+  return 0;
+}
+int NOIPA
+foo_si (SF a, int flag)
+{
+  if (flag == 2)
+    return a.i2 + a.i1;
+  return 0;
+}
+double NOIPA
+foo_df (DF arg, int flag)
+{
+  if (flag == 2)
+    return arg.a[3];
+  else
+    return 0.0;
+}
+float NOIPA
+foo_sf (SF arg, int flag)
+{
+  if (flag == 2)
+    return arg.a[2];
+  return 0;
+}
+float NOIPA
+foo_sf1 (SF arg, int flag)
+{
+  if (flag == 2)
+    return arg.a[1];
+  return 0;
+}
+
+DF gdf = {{1.0, 2.0, 3.0, 4.0}, 1, 2, 3, 4};
+SF gsf = {{1.0f, 2.0f, 3.0f, 4.0f}, 1, 2};
+
+int
+main ()
+{
+  if (!(foo_hi (gdf, 2) == 5 && foo_si (gsf, 2) == 3 && foo_df (gdf, 2) == 4.0
+	&& foo_sf (gsf, 2) == 3.0 && foo_sf1 (gsf, 2) == 2.0))
+    __builtin_abort ();
+  if (!(foo_hi (gdf, 1) == 0 && foo_si (gsf, 1) == 0 && foo_df (gdf, 1) == 0
+	&& foo_sf (gsf, 1) == 0 && foo_sf1 (gsf, 1) == 0))
+    __builtin_abort ();
+  return 0;
+}
diff --git a/gcc/testsuite/gcc.target/powerpc/pr108073.c b/gcc/testsuite/gcc.target/powerpc/pr108073.c
new file mode 100644
index 00000000000..4e7feaa6810
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/pr108073.c
@@ -0,0 +1,74 @@ 
+/* { dg-do run } */
+/* { dg-require-effective-target hard_float } */
+/* { dg-options "-O2 -save-temps" } */
+
+/* { dg-final { scan-assembler-times {\mmtvsrd|mtvsrws\M} 5 {target { lp64 && { has_arch_pwr8 && be } } } } } */
+/* { dg-final { scan-assembler-times {\mxscvspdpn\M} 4 {target { lp64 && { has_arch_pwr8 && be } } } } } */
+/* { dg-final { scan-assembler-times {\mmtvsrd|mtvsrws\M} 3 {target { lp64 && { has_arch_pwr8 && le } } } } } */
+/* { dg-final { scan-assembler-times {\mxscvspdpn\M} 2 {target { lp64 && { has_arch_pwr8 && le } } } } } */
+/* { dg-final { scan-assembler-times {\mfadds\M} 2 {target { lp64 && has_arch_pwr8 } } } } */
+
+#define NOIPA __attribute__ ((noipa))
+typedef struct X
+{
+  float x;
+  float y;
+} X;
+
+float NOIPA
+fooX (X y)
+{
+  y.x += 1;
+  return y.x + y.y;
+}
+
+typedef struct Y
+{
+  double a[4];
+  long l;
+} Y;
+
+double NOIPA
+fooY (Y arg)
+{
+  return arg.a[3];
+}
+
+typedef struct Z
+{
+  float a[4];
+  short l;
+} Z;
+
+float NOIPA
+fooZ (Z arg)
+{
+  return arg.a[3];
+}
+
+float NOIPA
+fooZ2 (Z arg)
+{
+  return arg.a[2];
+}
+
+X x = {1.0f, 2.0f};
+Y y = {1.0, 2.0, 3.0, 4.0, 1};
+Z z = {1.0f, 2.0f, 3.0f, 4.0f, 1};
+int
+main ()
+{
+  if (fooX (x) != 4.0f)
+    __builtin_abort ();
+
+  if (fooY (y) != 4.0)
+    __builtin_abort ();
+
+  if (fooZ (z) != 4.0f)
+    __builtin_abort ();
+
+  if (fooZ2 (z) != 3.0f)
+    __builtin_abort ();
+
+  return 0;
+}
diff --git a/gcc/testsuite/gcc.target/powerpc/pr65421.c b/gcc/testsuite/gcc.target/powerpc/pr65421.c
new file mode 100644
index 00000000000..25293610680
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/pr65421.c
@@ -0,0 +1,26 @@ 
+/* { dg-require-effective-target hard_float } */
+/* { dg-options "-O2" } */
+
+/* generate lfd for fun1, fun4 and fun5  */
+/* { dg-final { scan-assembler-times {\mlfd\M} 12 {target { lp64 && powerpc_elfv2 } } } } */
+
+/* generate stfd for fun2  */
+/* { dg-final { scan-assembler-times {\mstfd\M} 4 {target { lp64 && powerpc_elfv2 } } } } */
+
+/* no other load/store */
+/* { dg-final { scan-assembler-not {\m[p?l|st]d\M} {target { lp64 && powerpc_elfv2 } } } } */
+/* { dg-final { scan-assembler-not {\m[l|st]xvd2x\M} {target { lp64 && powerpc_elfv2 } } } } */
+/* { dg-final { scan-assembler-not {\m[p?l|st]xv\M} {target { lp64 && powerpc_elfv2 } } } } */
+
+typedef struct { double a[4]; } A;
+typedef struct { A a; double b; } B;
+typedef struct { double c; A a;} C;
+
+A fun1 (const A *pa) { return *pa; }
+void fun2 (const A a, A *pa) { *pa = a; }
+A fun3 (const A a) { return a; }
+A fun4 (const B *b) { return b->a; }
+A fun5 (const C *b) { return b->a; }
+A fun6 (const B b) { return b.a; }
+B fun7 (const A a) { B b; b.a = a; b.b = 0; return b; }
+C fun8 (const A a) { C c; c.a = a; c.c = 0; return c; }
diff --git a/gcc/testsuite/gcc.target/powerpc/pr69143.c b/gcc/testsuite/gcc.target/powerpc/pr69143.c
new file mode 100644
index 00000000000..62db4b0bd82
--- /dev/null
+++ b/gcc/testsuite/gcc.target/powerpc/pr69143.c
@@ -0,0 +1,22 @@ 
+/* { dg-require-effective-target hard_float } */
+/* { dg-options "-O2" } */
+
+/* { dg-final { scan-assembler-times {\mfmr\M} 3 {target { { lp64 && powerpc_elfv2 } && has_arch_pwr8 } } } } */
+/* { dg-final { scan-assembler-not {\mxscvspdpn\M} {target { { lp64 && powerpc_elfv2 } && has_arch_pwr8 } } } } */
+
+struct foo1
+{
+  float x;
+  float y;
+};
+
+struct foo1
+blah1 (struct foo1 y)
+{
+  struct foo1 x;
+
+  x.x = y.y;
+  x.y = y.x;
+
+  return x;
+}