diff mbox series

[RFC,PR,80689] Copy small aggregates element-wise

Message ID 20171013161353.uvlix6gfxz7ir4y7@virgil.suse.cz
State New
Headers show
Series [RFC,PR,80689] Copy small aggregates element-wise | expand

Commit Message

Martin Jambor Oct. 13, 2017, 4:13 p.m. UTC
Hi,

I'd like to request comments to the patch below which aims to fix PR
80689, which is an instance of a store-to-load forwarding stall on
x86_64 CPUs in the Image Magick benchmark, which is responsible for a
slow down of up to 9% compared to gcc 6, depending on options and HW
used.  (Actually, I have just seen 24% in one specific combination but
for various reasons can no longer verify it today.)

The revision causing the regression is 237074, which increased the
size of the mode for copying aggregates "by pieces" to 128 bits,
incurring big stalls when the values being copied are also still being
stored in a smaller data type or if the copied values are loaded in a
smaller types shortly afterwards.  Such situations happen in Image
Magick even across calls, which means that any non-IPA flow-sensitive
approach would not detect them.  Therefore, the patch simply changes
the way we copy small BLKmode data that are simple combinations of
records and arrays (meaning no unions, bit-fields but also character
arrays are disallowed) and simply copies them one field and/or element
at a time.

"Small" in this RFC patch means up to 35 bytes on x86_64 and i386 CPUs
(the structure in the benchmark has 32 bytes) but is subject to change
after more benchmarking and is actually zero - meaning element copying
never happens - on other architectures.  I believe that any
architecture with a store buffer can benefit but it's probably better
to leave it to their maintainers to find a different default value.  I
am not sure this is how such HW-dependant decisions should be done and
is the primary reason, why I am sending this RFC first.

I have decided to implement this change at the expansion level because
at that point the type information is still readily available and at
the same time we can also handle various implicit copies, for example
those passing a parameter.  I found I could re-use some bits and
pieces of tree-SRA and so I did, creating tree-sra.h header file in
the process.

I am fully aware that in the final patch the new parameter, or indeed
any new parameters, need to be documented.  I have skipped that
intentionally now and will write the documentation if feedback here is
generally good.

I have bootstrapped and tested this patch on x86_64-linux, with
different values of the parameter and only found problems with
unreasonably high values leading to OOM.  I have done the same with a
previous version of the patch which was equivalent to the limit being
64 bytes on aarch64-linux, ppc64le-linux and ia64-linux and only ran
into failures of tests which assumed that structure padding was copied
in aggregate copies (mostly gcc.target/aarch64/aapcs64/ stuff but also
for example gcc.dg/vmx/varargs-4.c).

The patch decreases the SPEC 2017 "rate" run-time of imagick by 9% and
8% at -O2 and -Ofast compilation levels respectively on one particular
new AMD CPU and by 6% and 3% on one particular old Intel machine.

Thanks in advance for any comments,

Martin


2017-10-12  Martin Jambor  <mjambor@suse.cz>

	PR target/80689
	* tree-sra.h: New file.
	* ipa-prop.h: Moved declaration of build_ref_for_offset to
	tree-sra.h.
	* expr.c: Include params.h and tree-sra.h.
	(emit_move_elementwise): New function.
	(store_expr_with_bounds): Optionally use it.
	* ipa-cp.c: Include tree-sra.h.
	* params.def (PARAM_MAX_SIZE_FOR_ELEMENTWISE_COPY): New.
	* config/i386/i386.c (ix86_option_override_internal): Set
	PARAM_MAX_SIZE_FOR_ELEMENTWISE_COPY to 35.
	* tree-sra.c: Include tree-sra.h.
	(scalarizable_type_p): Renamed to
	simple_mix_of_records_and_arrays_p, made public, renamed the
	second parameter to allow_char_arrays.
	(extract_min_max_idx_from_array): New function.
	(completely_scalarize): Moved bits of the function to
	extract_min_max_idx_from_array.

	testsuite/
	* gcc.target/i386/pr80689-1.c: New test.
---
 gcc/config/i386/i386.c                    |   4 ++
 gcc/expr.c                                | 103 ++++++++++++++++++++++++++++--
 gcc/ipa-cp.c                              |   1 +
 gcc/ipa-prop.h                            |   4 --
 gcc/params.def                            |   6 ++
 gcc/testsuite/gcc.target/i386/pr80689-1.c |  38 +++++++++++
 gcc/tree-sra.c                            |  86 +++++++++++++++----------
 gcc/tree-sra.h                            |  33 ++++++++++
 8 files changed, 233 insertions(+), 42 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/i386/pr80689-1.c
 create mode 100644 gcc/tree-sra.h

Comments

Richard Biener Oct. 17, 2017, 11:34 a.m. UTC | #1
On Fri, Oct 13, 2017 at 6:13 PM, Martin Jambor <mjambor@suse.cz> wrote:
> Hi,
>
> I'd like to request comments to the patch below which aims to fix PR
> 80689, which is an instance of a store-to-load forwarding stall on
> x86_64 CPUs in the Image Magick benchmark, which is responsible for a
> slow down of up to 9% compared to gcc 6, depending on options and HW
> used.  (Actually, I have just seen 24% in one specific combination but
> for various reasons can no longer verify it today.)
>
> The revision causing the regression is 237074, which increased the
> size of the mode for copying aggregates "by pieces" to 128 bits,
> incurring big stalls when the values being copied are also still being
> stored in a smaller data type or if the copied values are loaded in a
> smaller types shortly afterwards.  Such situations happen in Image
> Magick even across calls, which means that any non-IPA flow-sensitive
> approach would not detect them.  Therefore, the patch simply changes
> the way we copy small BLKmode data that are simple combinations of
> records and arrays (meaning no unions, bit-fields but also character
> arrays are disallowed) and simply copies them one field and/or element
> at a time.
>
> "Small" in this RFC patch means up to 35 bytes on x86_64 and i386 CPUs
> (the structure in the benchmark has 32 bytes) but is subject to change
> after more benchmarking and is actually zero - meaning element copying
> never happens - on other architectures.  I believe that any
> architecture with a store buffer can benefit but it's probably better
> to leave it to their maintainers to find a different default value.  I
> am not sure this is how such HW-dependant decisions should be done and
> is the primary reason, why I am sending this RFC first.
>
> I have decided to implement this change at the expansion level because
> at that point the type information is still readily available and at
> the same time we can also handle various implicit copies, for example
> those passing a parameter.  I found I could re-use some bits and
> pieces of tree-SRA and so I did, creating tree-sra.h header file in
> the process.
>
> I am fully aware that in the final patch the new parameter, or indeed
> any new parameters, need to be documented.  I have skipped that
> intentionally now and will write the documentation if feedback here is
> generally good.
>
> I have bootstrapped and tested this patch on x86_64-linux, with
> different values of the parameter and only found problems with
> unreasonably high values leading to OOM.  I have done the same with a
> previous version of the patch which was equivalent to the limit being
> 64 bytes on aarch64-linux, ppc64le-linux and ia64-linux and only ran
> into failures of tests which assumed that structure padding was copied
> in aggregate copies (mostly gcc.target/aarch64/aapcs64/ stuff but also
> for example gcc.dg/vmx/varargs-4.c).
>
> The patch decreases the SPEC 2017 "rate" run-time of imagick by 9% and
> 8% at -O2 and -Ofast compilation levels respectively on one particular
> new AMD CPU and by 6% and 3% on one particular old Intel machine.
>
> Thanks in advance for any comments,

I wonder if you can at the place you choose to hook this in elide any
copying of padding between fields.

I'd rather have hooked such "high level" optimization in expand_assignment
where you can be reasonably sure you're seeing an actual source-level construct.

35 bytes seems to be much - what is the code-size impact?

IIRC the reason this may be slow isn't loading in smaller types than stored
before by the copy - the store buffers can handle this reasonably well.  It's
solely when previous smaller stores are

  a1) not mergeabe in the store buffer
  a2) not merged because earlier stores are already committed

and

  b) loaded afterwards as a type that would access multiple store buffers

a) would be sure to happen in case b) involves accessing padding.  Is the
Image Magick case one that involves padding in the structure?

Richard.

> Martin
>
>
> 2017-10-12  Martin Jambor  <mjambor@suse.cz>
>
>         PR target/80689
>         * tree-sra.h: New file.
>         * ipa-prop.h: Moved declaration of build_ref_for_offset to
>         tree-sra.h.
>         * expr.c: Include params.h and tree-sra.h.
>         (emit_move_elementwise): New function.
>         (store_expr_with_bounds): Optionally use it.
>         * ipa-cp.c: Include tree-sra.h.
>         * params.def (PARAM_MAX_SIZE_FOR_ELEMENTWISE_COPY): New.
>         * config/i386/i386.c (ix86_option_override_internal): Set
>         PARAM_MAX_SIZE_FOR_ELEMENTWISE_COPY to 35.
>         * tree-sra.c: Include tree-sra.h.
>         (scalarizable_type_p): Renamed to
>         simple_mix_of_records_and_arrays_p, made public, renamed the
>         second parameter to allow_char_arrays.
>         (extract_min_max_idx_from_array): New function.
>         (completely_scalarize): Moved bits of the function to
>         extract_min_max_idx_from_array.
>
>         testsuite/
>         * gcc.target/i386/pr80689-1.c: New test.
> ---
>  gcc/config/i386/i386.c                    |   4 ++
>  gcc/expr.c                                | 103 ++++++++++++++++++++++++++++--
>  gcc/ipa-cp.c                              |   1 +
>  gcc/ipa-prop.h                            |   4 --
>  gcc/params.def                            |   6 ++
>  gcc/testsuite/gcc.target/i386/pr80689-1.c |  38 +++++++++++
>  gcc/tree-sra.c                            |  86 +++++++++++++++----------
>  gcc/tree-sra.h                            |  33 ++++++++++
>  8 files changed, 233 insertions(+), 42 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr80689-1.c
>  create mode 100644 gcc/tree-sra.h
>
> diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
> index 1ee8351c21f..87f602e7ead 100644
> --- a/gcc/config/i386/i386.c
> +++ b/gcc/config/i386/i386.c
> @@ -6511,6 +6511,10 @@ ix86_option_override_internal (bool main_args_p,
>                          ix86_tune_cost->l2_cache_size,
>                          opts->x_param_values,
>                          opts_set->x_param_values);
> +  maybe_set_param_value (PARAM_MAX_SIZE_FOR_ELEMENTWISE_COPY,
> +                        35,
> +                        opts->x_param_values,
> +                        opts_set->x_param_values);
>
>    /* Enable sw prefetching at -O3 for CPUS that prefetching is helpful.  */
>    if (opts->x_flag_prefetch_loop_arrays < 0
> diff --git a/gcc/expr.c b/gcc/expr.c
> index 134ee731c29..dff24e7f166 100644
> --- a/gcc/expr.c
> +++ b/gcc/expr.c
> @@ -61,7 +61,8 @@ along with GCC; see the file COPYING3.  If not see
>  #include "tree-chkp.h"
>  #include "rtl-chkp.h"
>  #include "ccmp.h"
> -
> +#include "params.h"
> +#include "tree-sra.h"
>
>  /* If this is nonzero, we do not bother generating VOLATILE
>     around volatile memory references, and we are willing to
> @@ -5340,6 +5341,80 @@ emit_storent_insn (rtx to, rtx from)
>    return maybe_expand_insn (code, 2, ops);
>  }
>
> +/* Generate code for copying data of type TYPE at SOURCE plus OFFSET to TARGET
> +   plus OFFSET, but do so element-wise and/or field-wise for each record and
> +   array within TYPE.  TYPE must either be a register type or an aggregate
> +   complying with scalarizable_type_p.
> +
> +   If CALL_PARAM_P is nonzero, this is a store into a call param on the
> +   stack, and block moves may need to be treated specially.  */
> +
> +static void
> +emit_move_elementwise (tree type, rtx target, rtx source, HOST_WIDE_INT offset,
> +                      int call_param_p)
> +{
> +  switch (TREE_CODE (type))
> +    {
> +    case RECORD_TYPE:
> +      for (tree fld = TYPE_FIELDS (type); fld; fld = DECL_CHAIN (fld))
> +       if (TREE_CODE (fld) == FIELD_DECL)
> +         {
> +           HOST_WIDE_INT fld_offset = offset + int_bit_position (fld);
> +           tree ft = TREE_TYPE (fld);
> +           emit_move_elementwise (ft, target, source, fld_offset,
> +                                  call_param_p);
> +         }
> +      break;
> +
> +    case ARRAY_TYPE:
> +      {
> +       tree elem_type = TREE_TYPE (type);
> +       HOST_WIDE_INT el_size = tree_to_shwi (TYPE_SIZE (elem_type));
> +       gcc_assert (el_size > 0);
> +
> +       offset_int idx, max;
> +       /* Skip (some) zero-length arrays; others have MAXIDX == MINIDX - 1.  */
> +       if (extract_min_max_idx_from_array (type, &idx, &max))
> +         {
> +           HOST_WIDE_INT el_offset = offset;
> +           for (; idx <= max; ++idx)
> +             {
> +               emit_move_elementwise (elem_type, target, source, el_offset,
> +                                      call_param_p);
> +               el_offset += el_size;
> +             }
> +         }
> +      }
> +      break;
> +    default:
> +      machine_mode mode = TYPE_MODE (type);
> +
> +      rtx ntgt = adjust_address (target, mode, offset / BITS_PER_UNIT);
> +      rtx nsrc = adjust_address (source, mode, offset / BITS_PER_UNIT);
> +
> +      /* TODO: Figure out whether the following is actually necessary.  */
> +      if (target == ntgt)
> +       ntgt = copy_rtx (target);
> +      if (source == nsrc)
> +       nsrc = copy_rtx (source);
> +
> +      gcc_assert (mode != VOIDmode);
> +      if (mode != BLKmode)
> +       emit_move_insn (ntgt, nsrc);
> +      else
> +       {
> +         /* For example vector gimple registers can end up here.  */
> +         rtx size = expand_expr (TYPE_SIZE_UNIT (type), NULL_RTX,
> +                                 TYPE_MODE (sizetype), EXPAND_NORMAL);
> +         emit_block_move (ntgt, nsrc, size,
> +                          (call_param_p
> +                           ? BLOCK_OP_CALL_PARM : BLOCK_OP_NORMAL));
> +       }
> +      break;
> +    }
> +  return;
> +}
> +
>  /* Generate code for computing expression EXP,
>     and storing the value into TARGET.
>
> @@ -5713,9 +5788,29 @@ store_expr_with_bounds (tree exp, rtx target, int call_param_p,
>         emit_group_store (target, temp, TREE_TYPE (exp),
>                           int_size_in_bytes (TREE_TYPE (exp)));
>        else if (GET_MODE (temp) == BLKmode)
> -       emit_block_move (target, temp, expr_size (exp),
> -                        (call_param_p
> -                         ? BLOCK_OP_CALL_PARM : BLOCK_OP_NORMAL));
> +       {
> +         /* Copying smallish BLKmode structures with emit_block_move and thus
> +            by-pieces can result in store-to-load stalls.  So copy some simple
> +            small aggregates element or field-wise.  */
> +         if (GET_MODE (target) == BLKmode
> +             && AGGREGATE_TYPE_P (TREE_TYPE (exp))
> +             && !TREE_ADDRESSABLE (TREE_TYPE (exp))
> +             && tree_fits_shwi_p (TYPE_SIZE (TREE_TYPE (exp)))
> +             && (tree_to_shwi (TYPE_SIZE (TREE_TYPE (exp)))
> +                 <= (PARAM_VALUE (PARAM_MAX_SIZE_FOR_ELEMENTWISE_COPY)
> +                     * BITS_PER_UNIT))
> +             && simple_mix_of_records_and_arrays_p (TREE_TYPE (exp), false))
> +           {
> +             /* FIXME:  Can this happen?  What would it mean?  */
> +             gcc_assert (!reverse);
> +             emit_move_elementwise (TREE_TYPE (exp), target, temp, 0,
> +                                    call_param_p);
> +           }
> +         else
> +           emit_block_move (target, temp, expr_size (exp),
> +                            (call_param_p
> +                             ? BLOCK_OP_CALL_PARM : BLOCK_OP_NORMAL));
> +       }
>        /* If we emit a nontemporal store, there is nothing else to do.  */
>        else if (nontemporal && emit_storent_insn (target, temp))
>         ;
> diff --git a/gcc/ipa-cp.c b/gcc/ipa-cp.c
> index 6b3d8d7364c..7d6019bbd30 100644
> --- a/gcc/ipa-cp.c
> +++ b/gcc/ipa-cp.c
> @@ -124,6 +124,7 @@ along with GCC; see the file COPYING3.  If not see
>  #include "tree-ssa-ccp.h"
>  #include "stringpool.h"
>  #include "attribs.h"
> +#include "tree-sra.h"
>
>  template <typename valtype> class ipcp_value;
>
> diff --git a/gcc/ipa-prop.h b/gcc/ipa-prop.h
> index fa5bed49ee0..2313cc884ed 100644
> --- a/gcc/ipa-prop.h
> +++ b/gcc/ipa-prop.h
> @@ -877,10 +877,6 @@ ipa_parm_adjustment *ipa_get_adjustment_candidate (tree **, bool *,
>  void ipa_release_body_info (struct ipa_func_body_info *);
>  tree ipa_get_callee_param_type (struct cgraph_edge *e, int i);
>
> -/* From tree-sra.c:  */
> -tree build_ref_for_offset (location_t, tree, HOST_WIDE_INT, bool, tree,
> -                          gimple_stmt_iterator *, bool);
> -
>  /* In ipa-cp.c  */
>  void ipa_cp_c_finalize (void);
>
> diff --git a/gcc/params.def b/gcc/params.def
> index e55afc28053..5e19f1414a0 100644
> --- a/gcc/params.def
> +++ b/gcc/params.def
> @@ -1294,6 +1294,12 @@ DEFPARAM (PARAM_VECT_EPILOGUES_NOMASK,
>           "Enable loop epilogue vectorization using smaller vector size.",
>           0, 0, 1)
>
> +DEFPARAM (PARAM_MAX_SIZE_FOR_ELEMENTWISE_COPY,
> +         "max-size-for-elementwise-copy",
> +         "Maximum size in bytes of a structure or array to by considered for "
> +         "copying by its individual fields or elements",
> +         0, 0, 512)
> +
>  /*
>
>  Local variables:
> diff --git a/gcc/testsuite/gcc.target/i386/pr80689-1.c b/gcc/testsuite/gcc.target/i386/pr80689-1.c
> new file mode 100644
> index 00000000000..4156d4fba45
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr80689-1.c
> @@ -0,0 +1,38 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2" } */
> +
> +typedef struct st1
> +{
> +        long unsigned int a,b;
> +        long int c,d;
> +}R;
> +
> +typedef struct st2
> +{
> +        int  t;
> +        R  reg;
> +}N;
> +
> +void Set (const R *region,  N *n_info );
> +
> +void test(N  *n_obj ,const long unsigned int a, const long unsigned int b,  const long int c,const long int d)
> +{
> +        R reg;
> +
> +        reg.a=a;
> +        reg.b=b;
> +        reg.c=c;
> +        reg.d=d;
> +        Set (&reg, n_obj);
> +
> +}
> +
> +void Set (const R *reg,  N *n_obj )
> +{
> +        n_obj->reg=(*reg);
> +}
> +
> +
> +/* { dg-final { scan-assembler-not "%(x|y|z)mm\[0-9\]+" } } */
> +/* { dg-final { scan-assembler-not "movdqu" } } */
> +/* { dg-final { scan-assembler-not "movups" } } */
> diff --git a/gcc/tree-sra.c b/gcc/tree-sra.c
> index bac593951e7..ade97964205 100644
> --- a/gcc/tree-sra.c
> +++ b/gcc/tree-sra.c
> @@ -104,6 +104,7 @@ along with GCC; see the file COPYING3.  If not see
>  #include "ipa-fnsummary.h"
>  #include "ipa-utils.h"
>  #include "builtins.h"
> +#include "tree-sra.h"
>
>  /* Enumeration of all aggregate reductions we can do.  */
>  enum sra_mode { SRA_MODE_EARLY_IPA,   /* early call regularization */
> @@ -952,14 +953,14 @@ create_access (tree expr, gimple *stmt, bool write)
>  }
>
>
> -/* Return true iff TYPE is scalarizable - i.e. a RECORD_TYPE or fixed-length
> -   ARRAY_TYPE with fields that are either of gimple register types (excluding
> -   bit-fields) or (recursively) scalarizable types.  CONST_DECL must be true if
> -   we are considering a decl from constant pool.  If it is false, char arrays
> -   will be refused.  */
> +/* Return true if TYPE consists of RECORD_TYPE or fixed-length ARRAY_TYPE with
> +   fields/elements that are not bit-fields and are either register types or
> +   recursively comply with simple_mix_of_records_and_arrays_p.  Furthermore, if
> +   ALLOW_CHAR_ARRAYS is false, the function will return false also if TYPE
> +   contains an array of elements that only have one byte.  */
>
> -static bool
> -scalarizable_type_p (tree type, bool const_decl)
> +bool
> +simple_mix_of_records_and_arrays_p (tree type, bool allow_char_arrays)
>  {
>    gcc_assert (!is_gimple_reg_type (type));
>    if (type_contains_placeholder_p (type))
> @@ -977,7 +978,7 @@ scalarizable_type_p (tree type, bool const_decl)
>             return false;
>
>           if (!is_gimple_reg_type (ft)
> -             && !scalarizable_type_p (ft, const_decl))
> +             && !simple_mix_of_records_and_arrays_p (ft, allow_char_arrays))
>             return false;
>         }
>
> @@ -986,7 +987,7 @@ scalarizable_type_p (tree type, bool const_decl)
>    case ARRAY_TYPE:
>      {
>        HOST_WIDE_INT min_elem_size;
> -      if (const_decl)
> +      if (allow_char_arrays)
>         min_elem_size = 0;
>        else
>         min_elem_size = BITS_PER_UNIT;
> @@ -1008,7 +1009,7 @@ scalarizable_type_p (tree type, bool const_decl)
>
>        tree elem = TREE_TYPE (type);
>        if (!is_gimple_reg_type (elem)
> -         && !scalarizable_type_p (elem, const_decl))
> +         && !simple_mix_of_records_and_arrays_p (elem, allow_char_arrays))
>         return false;
>        return true;
>      }
> @@ -1017,10 +1018,38 @@ scalarizable_type_p (tree type, bool const_decl)
>    }
>  }
>
> -static void scalarize_elem (tree, HOST_WIDE_INT, HOST_WIDE_INT, bool, tree, tree);
> +static void scalarize_elem (tree, HOST_WIDE_INT, HOST_WIDE_INT, bool, tree,
> +                           tree);
> +
> +/* For a given array TYPE, return false if its domain does not have any maximum
> +   value.  Otherwise calculate MIN and MAX indices of the first and the last
> +   element.  */
> +
> +bool
> +extract_min_max_idx_from_array (tree type, offset_int *min, offset_int *max)
> +{
> +  tree domain = TYPE_DOMAIN (type);
> +  tree minidx = TYPE_MIN_VALUE (domain);
> +  gcc_assert (TREE_CODE (minidx) == INTEGER_CST);
> +  tree maxidx = TYPE_MAX_VALUE (domain);
> +  if (!maxidx)
> +    return false;
> +  gcc_assert (TREE_CODE (maxidx) == INTEGER_CST);
> +
> +  /* MINIDX and MAXIDX are inclusive, and must be interpreted in
> +     DOMAIN (e.g. signed int, whereas min/max may be size_int).  */
> +  *min = wi::to_offset (minidx);
> +  *max = wi::to_offset (maxidx);
> +  if (!TYPE_UNSIGNED (domain))
> +    {
> +      *min = wi::sext (*min, TYPE_PRECISION (domain));
> +      *max = wi::sext (*max, TYPE_PRECISION (domain));
> +    }
> +  return true;
> +}
>
>  /* Create total_scalarization accesses for all scalar fields of a member
> -   of type DECL_TYPE conforming to scalarizable_type_p.  BASE
> +   of type DECL_TYPE conforming to simple_mix_of_records_and_arrays_p.  BASE
>     must be the top-most VAR_DECL representing the variable; within that,
>     OFFSET locates the member and REF must be the memory reference expression for
>     the member.  */
> @@ -1047,27 +1076,14 @@ completely_scalarize (tree base, tree decl_type, HOST_WIDE_INT offset, tree ref)
>        {
>         tree elemtype = TREE_TYPE (decl_type);
>         tree elem_size = TYPE_SIZE (elemtype);
> -       gcc_assert (elem_size && tree_fits_shwi_p (elem_size));
>         HOST_WIDE_INT el_size = tree_to_shwi (elem_size);
>         gcc_assert (el_size > 0);
>
> -       tree minidx = TYPE_MIN_VALUE (TYPE_DOMAIN (decl_type));
> -       gcc_assert (TREE_CODE (minidx) == INTEGER_CST);
> -       tree maxidx = TYPE_MAX_VALUE (TYPE_DOMAIN (decl_type));
> +       offset_int idx, max;
>         /* Skip (some) zero-length arrays; others have MAXIDX == MINIDX - 1.  */
> -       if (maxidx)
> +       if (extract_min_max_idx_from_array (decl_type, &idx, &max))
>           {
> -           gcc_assert (TREE_CODE (maxidx) == INTEGER_CST);
>             tree domain = TYPE_DOMAIN (decl_type);
> -           /* MINIDX and MAXIDX are inclusive, and must be interpreted in
> -              DOMAIN (e.g. signed int, whereas min/max may be size_int).  */
> -           offset_int idx = wi::to_offset (minidx);
> -           offset_int max = wi::to_offset (maxidx);
> -           if (!TYPE_UNSIGNED (domain))
> -             {
> -               idx = wi::sext (idx, TYPE_PRECISION (domain));
> -               max = wi::sext (max, TYPE_PRECISION (domain));
> -             }
>             for (int el_off = offset; idx <= max; ++idx)
>               {
>                 tree nref = build4 (ARRAY_REF, elemtype,
> @@ -1088,10 +1104,10 @@ completely_scalarize (tree base, tree decl_type, HOST_WIDE_INT offset, tree ref)
>  }
>
>  /* Create total_scalarization accesses for a member of type TYPE, which must
> -   satisfy either is_gimple_reg_type or scalarizable_type_p.  BASE must be the
> -   top-most VAR_DECL representing the variable; within that, POS and SIZE locate
> -   the member, REVERSE gives its torage order. and REF must be the reference
> -   expression for it.  */
> +   satisfy either is_gimple_reg_type or simple_mix_of_records_and_arrays_p.
> +   BASE must be the top-most VAR_DECL representing the variable; within that,
> +   POS and SIZE locate the member, REVERSE gives its torage order. and REF must
> +   be the reference expression for it.  */
>
>  static void
>  scalarize_elem (tree base, HOST_WIDE_INT pos, HOST_WIDE_INT size, bool reverse,
> @@ -1111,7 +1127,8 @@ scalarize_elem (tree base, HOST_WIDE_INT pos, HOST_WIDE_INT size, bool reverse,
>  }
>
>  /* Create a total_scalarization access for VAR as a whole.  VAR must be of a
> -   RECORD_TYPE or ARRAY_TYPE conforming to scalarizable_type_p.  */
> +   RECORD_TYPE or ARRAY_TYPE conforming to
> +   simple_mix_of_records_and_arrays_p.  */
>
>  static void
>  create_total_scalarization_access (tree var)
> @@ -2803,8 +2820,9 @@ analyze_all_variable_accesses (void)
>        {
>         tree var = candidate (i);
>
> -       if (VAR_P (var) && scalarizable_type_p (TREE_TYPE (var),
> -                                               constant_decl_p (var)))
> +       if (VAR_P (var)
> +           && simple_mix_of_records_and_arrays_p (TREE_TYPE (var),
> +                                                  constant_decl_p (var)))
>           {
>             if (tree_to_uhwi (TYPE_SIZE (TREE_TYPE (var)))
>                 <= max_scalarization_size)
> diff --git a/gcc/tree-sra.h b/gcc/tree-sra.h
> new file mode 100644
> index 00000000000..dc901385994
> --- /dev/null
> +++ b/gcc/tree-sra.h
> @@ -0,0 +1,33 @@
> +/* tree-sra.h - Run-time parameters.
> +   Copyright (C) 2017 Free Software Foundation, Inc.
> +
> +This file is part of GCC.
> +
> +GCC is free software; you can redistribute it and/or modify it under
> +the terms of the GNU General Public License as published by the Free
> +Software Foundation; either version 3, or (at your option) any later
> +version.
> +
> +GCC is distributed in the hope that it will be useful, but WITHOUT ANY
> +WARRANTY; without even the implied warranty of MERCHANTABILITY or
> +FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
> +for more details.
> +
> +You should have received a copy of the GNU General Public License
> +along with GCC; see the file COPYING3.  If not see
> +<http://www.gnu.org/licenses/>.  */
> +
> +#ifndef TREE_SRA_H
> +#define TREE_SRA_H
> +
> +
> +bool simple_mix_of_records_and_arrays_p (tree type, bool allow_char_arrays);
> +bool extract_min_max_idx_from_array (tree type, offset_int *idx,
> +                                    offset_int *max);
> +tree build_ref_for_offset (location_t loc, tree base, HOST_WIDE_INT offset,
> +                          bool reverse, tree exp_type,
> +                          gimple_stmt_iterator *gsi, bool insert_after);
> +
> +
> +
> +#endif /* TREE_SRA_H */
> --
> 2.14.1
>
Martin Jambor Oct. 26, 2017, 12:18 p.m. UTC | #2
Hi,

On Tue, Oct 17, 2017 at 01:34:54PM +0200, Richard Biener wrote:
> On Fri, Oct 13, 2017 at 6:13 PM, Martin Jambor <mjambor@suse.cz> wrote:
> > Hi,
> >
> > I'd like to request comments to the patch below which aims to fix PR
> > 80689, which is an instance of a store-to-load forwarding stall on
> > x86_64 CPUs in the Image Magick benchmark, which is responsible for a
> > slow down of up to 9% compared to gcc 6, depending on options and HW
> > used.  (Actually, I have just seen 24% in one specific combination but
> > for various reasons can no longer verify it today.)
> >
> > The revision causing the regression is 237074, which increased the
> > size of the mode for copying aggregates "by pieces" to 128 bits,
> > incurring big stalls when the values being copied are also still being
> > stored in a smaller data type or if the copied values are loaded in a
> > smaller types shortly afterwards.  Such situations happen in Image
> > Magick even across calls, which means that any non-IPA flow-sensitive
> > approach would not detect them.  Therefore, the patch simply changes
> > the way we copy small BLKmode data that are simple combinations of
> > records and arrays (meaning no unions, bit-fields but also character
> > arrays are disallowed) and simply copies them one field and/or element
> > at a time.
> >
> > "Small" in this RFC patch means up to 35 bytes on x86_64 and i386 CPUs
> > (the structure in the benchmark has 32 bytes) but is subject to change
> > after more benchmarking and is actually zero - meaning element copying
> > never happens - on other architectures.  I believe that any
> > architecture with a store buffer can benefit but it's probably better
> > to leave it to their maintainers to find a different default value.  I
> > am not sure this is how such HW-dependant decisions should be done and
> > is the primary reason, why I am sending this RFC first.
> >
> > I have decided to implement this change at the expansion level because
> > at that point the type information is still readily available and at
> > the same time we can also handle various implicit copies, for example
> > those passing a parameter.  I found I could re-use some bits and
> > pieces of tree-SRA and so I did, creating tree-sra.h header file in
> > the process.
> >
> > I am fully aware that in the final patch the new parameter, or indeed
> > any new parameters, need to be documented.  I have skipped that
> > intentionally now and will write the documentation if feedback here is
> > generally good.
> >
> > I have bootstrapped and tested this patch on x86_64-linux, with
> > different values of the parameter and only found problems with
> > unreasonably high values leading to OOM.  I have done the same with a
> > previous version of the patch which was equivalent to the limit being
> > 64 bytes on aarch64-linux, ppc64le-linux and ia64-linux and only ran
> > into failures of tests which assumed that structure padding was copied
> > in aggregate copies (mostly gcc.target/aarch64/aapcs64/ stuff but also
> > for example gcc.dg/vmx/varargs-4.c).
> >
> > The patch decreases the SPEC 2017 "rate" run-time of imagick by 9% and
> > 8% at -O2 and -Ofast compilation levels respectively on one particular
> > new AMD CPU and by 6% and 3% on one particular old Intel machine.
> >
> > Thanks in advance for any comments,
> 
> I wonder if you can at the place you choose to hook this in elide any
> copying of padding between fields.
> 
> I'd rather have hooked such "high level" optimization in
> expand_assignment where you can be reasonably sure you're seeing an
> actual source-level construct.

I have discussed this with Honza and we eventually decided to make the
elememnt-wise copy an alternative to emit_block_move (which uses the
larger mode for moving since GCC 7) exactly so that we handle not only
source-level assignments but also passing parameters by value and
other situations.

> 
> 35 bytes seems to be much - what is the code-size impact?

I will find out and report on that.  I need at least 32 bytes (four
long ints) to fix imagemagick, where the problematic structure is:

  typedef struct _RectangleInfo
  {
    size_t
      width,
      height;
  
    ssize_t
      x,
      y;
  } RectangleInfo;

...so 8 longs, no padding.  Since any aggregate having between 33-35
bytes needs to consist of smaller fields/elements, it seemed
reasonable to also copy them element-wise.

Nevertheless, I still intend to experiment with the limit, I sent out
this RFC exactly so that I don't spend a lot of time benchmarking
something that is eventually not deemed acceptable on principle.

> 
> IIRC the reason this may be slow isn't loading in smaller types than stored
> before by the copy - the store buffers can handle this reasonably well.  It's
> solely when previous smaller stores are
> 
>   a1) not mergeabe in the store buffer
>   a2) not merged because earlier stores are already committed
> 
> and
> 
>   b) loaded afterwards as a type that would access multiple store buffers
> 
> a) would be sure to happen in case b) involves accessing padding.  Is the
> Image Magick case one that involves padding in the structure?

As I said above, there is no padding.

Basically, what happens is that in a number of places, there is a
variable region of the aforementioned type and it is initialized and
passed to function SetPixelCacheNexusPixels in the following manner:

    ...
    region.width=cache_info->columns;
    region.height=1;
    region.x=0;
    region.y=y;
    pixels=SetPixelCacheNexusPixels(cache_info,ReadMode,&region,MagickTrue,
      cache_nexus[id],exception);
    ...

and the first four statements in SetPixelCacheNexusPixels are:

  assert(cache_info != (const CacheInfo *) NULL);
  assert(cache_info->signature == MagickSignature);
  if (cache_info->type == UndefinedCache)
    return((PixelPacket *) NULL);
  nexus_info->region=(*region);

with the last one generating the stalls, on both Zen-based machines
and also on 2-3 years old Intel CPUs.

I have had a look at what Agner Fog's micro-architecture document says
about store forwarding stalls and:

  - on Broadwells and Haswells, any "write of any size is followed by
    a read of a larger size" ioncurs a stall, which fits our example,
  - on Skylakes: "A read that is bigger than the write, or a read that
    covers both written and unwritten bytes, takes approximately 11
    clock cycles extra" seems to apply
  - on Intel silvermont, there will also be a stall because "A memory
    write can be forwarded to a subsequent read of the same size or a
    smaller size..."
  - on Zens, Agner Fog says they work perfectly except when crossing a
    page or when "A read that has a partial overlap with a preceding
    write has a penalty of 6-7 clock cycles," which must be why I see
    stalls.

So I guess the pending stores are not really merged even without
padding,

Martin


> 
> Richard.
> 
> > Martin
> >
> >
> > 2017-10-12  Martin Jambor  <mjambor@suse.cz>
> >
> >         PR target/80689
> >         * tree-sra.h: New file.
> >         * ipa-prop.h: Moved declaration of build_ref_for_offset to
> >         tree-sra.h.
> >         * expr.c: Include params.h and tree-sra.h.
> >         (emit_move_elementwise): New function.
> >         (store_expr_with_bounds): Optionally use it.
> >         * ipa-cp.c: Include tree-sra.h.
> >         * params.def (PARAM_MAX_SIZE_FOR_ELEMENTWISE_COPY): New.
> >         * config/i386/i386.c (ix86_option_override_internal): Set
> >         PARAM_MAX_SIZE_FOR_ELEMENTWISE_COPY to 35.
> >         * tree-sra.c: Include tree-sra.h.
> >         (scalarizable_type_p): Renamed to
> >         simple_mix_of_records_and_arrays_p, made public, renamed the
> >         second parameter to allow_char_arrays.
> >         (extract_min_max_idx_from_array): New function.
> >         (completely_scalarize): Moved bits of the function to
> >         extract_min_max_idx_from_array.
> >
> >         testsuite/
> >         * gcc.target/i386/pr80689-1.c: New test.
> > ---
> >  gcc/config/i386/i386.c                    |   4 ++
> >  gcc/expr.c                                | 103 ++++++++++++++++++++++++++++--
> >  gcc/ipa-cp.c                              |   1 +
> >  gcc/ipa-prop.h                            |   4 --
> >  gcc/params.def                            |   6 ++
> >  gcc/testsuite/gcc.target/i386/pr80689-1.c |  38 +++++++++++
> >  gcc/tree-sra.c                            |  86 +++++++++++++++----------
> >  gcc/tree-sra.h                            |  33 ++++++++++
> >  8 files changed, 233 insertions(+), 42 deletions(-)
> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr80689-1.c
> >  create mode 100644 gcc/tree-sra.h
> >
> > diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
> > index 1ee8351c21f..87f602e7ead 100644
> > --- a/gcc/config/i386/i386.c
> > +++ b/gcc/config/i386/i386.c
> > @@ -6511,6 +6511,10 @@ ix86_option_override_internal (bool main_args_p,
> >                          ix86_tune_cost->l2_cache_size,
> >                          opts->x_param_values,
> >                          opts_set->x_param_values);
> > +  maybe_set_param_value (PARAM_MAX_SIZE_FOR_ELEMENTWISE_COPY,
> > +                        35,
> > +                        opts->x_param_values,
> > +                        opts_set->x_param_values);
> >
> >    /* Enable sw prefetching at -O3 for CPUS that prefetching is helpful.  */
> >    if (opts->x_flag_prefetch_loop_arrays < 0
> > diff --git a/gcc/expr.c b/gcc/expr.c
> > index 134ee731c29..dff24e7f166 100644
> > --- a/gcc/expr.c
> > +++ b/gcc/expr.c
> > @@ -61,7 +61,8 @@ along with GCC; see the file COPYING3.  If not see
> >  #include "tree-chkp.h"
> >  #include "rtl-chkp.h"
> >  #include "ccmp.h"
> > -
> > +#include "params.h"
> > +#include "tree-sra.h"
> >
> >  /* If this is nonzero, we do not bother generating VOLATILE
> >     around volatile memory references, and we are willing to
> > @@ -5340,6 +5341,80 @@ emit_storent_insn (rtx to, rtx from)
> >    return maybe_expand_insn (code, 2, ops);
> >  }
> >
> > +/* Generate code for copying data of type TYPE at SOURCE plus OFFSET to TARGET
> > +   plus OFFSET, but do so element-wise and/or field-wise for each record and
> > +   array within TYPE.  TYPE must either be a register type or an aggregate
> > +   complying with scalarizable_type_p.
> > +
> > +   If CALL_PARAM_P is nonzero, this is a store into a call param on the
> > +   stack, and block moves may need to be treated specially.  */
> > +
> > +static void
> > +emit_move_elementwise (tree type, rtx target, rtx source, HOST_WIDE_INT offset,
> > +                      int call_param_p)
> > +{
> > +  switch (TREE_CODE (type))
> > +    {
> > +    case RECORD_TYPE:
> > +      for (tree fld = TYPE_FIELDS (type); fld; fld = DECL_CHAIN (fld))
> > +       if (TREE_CODE (fld) == FIELD_DECL)
> > +         {
> > +           HOST_WIDE_INT fld_offset = offset + int_bit_position (fld);
> > +           tree ft = TREE_TYPE (fld);
> > +           emit_move_elementwise (ft, target, source, fld_offset,
> > +                                  call_param_p);
> > +         }
> > +      break;
> > +
> > +    case ARRAY_TYPE:
> > +      {
> > +       tree elem_type = TREE_TYPE (type);
> > +       HOST_WIDE_INT el_size = tree_to_shwi (TYPE_SIZE (elem_type));
> > +       gcc_assert (el_size > 0);
> > +
> > +       offset_int idx, max;
> > +       /* Skip (some) zero-length arrays; others have MAXIDX == MINIDX - 1.  */
> > +       if (extract_min_max_idx_from_array (type, &idx, &max))
> > +         {
> > +           HOST_WIDE_INT el_offset = offset;
> > +           for (; idx <= max; ++idx)
> > +             {
> > +               emit_move_elementwise (elem_type, target, source, el_offset,
> > +                                      call_param_p);
> > +               el_offset += el_size;
> > +             }
> > +         }
> > +      }
> > +      break;
> > +    default:
> > +      machine_mode mode = TYPE_MODE (type);
> > +
> > +      rtx ntgt = adjust_address (target, mode, offset / BITS_PER_UNIT);
> > +      rtx nsrc = adjust_address (source, mode, offset / BITS_PER_UNIT);
> > +
> > +      /* TODO: Figure out whether the following is actually necessary.  */
> > +      if (target == ntgt)
> > +       ntgt = copy_rtx (target);
> > +      if (source == nsrc)
> > +       nsrc = copy_rtx (source);
> > +
> > +      gcc_assert (mode != VOIDmode);
> > +      if (mode != BLKmode)
> > +       emit_move_insn (ntgt, nsrc);
> > +      else
> > +       {
> > +         /* For example vector gimple registers can end up here.  */
> > +         rtx size = expand_expr (TYPE_SIZE_UNIT (type), NULL_RTX,
> > +                                 TYPE_MODE (sizetype), EXPAND_NORMAL);
> > +         emit_block_move (ntgt, nsrc, size,
> > +                          (call_param_p
> > +                           ? BLOCK_OP_CALL_PARM : BLOCK_OP_NORMAL));
> > +       }
> > +      break;
> > +    }
> > +  return;
> > +}
> > +
> >  /* Generate code for computing expression EXP,
> >     and storing the value into TARGET.
> >
> > @@ -5713,9 +5788,29 @@ store_expr_with_bounds (tree exp, rtx target, int call_param_p,
> >         emit_group_store (target, temp, TREE_TYPE (exp),
> >                           int_size_in_bytes (TREE_TYPE (exp)));
> >        else if (GET_MODE (temp) == BLKmode)
> > -       emit_block_move (target, temp, expr_size (exp),
> > -                        (call_param_p
> > -                         ? BLOCK_OP_CALL_PARM : BLOCK_OP_NORMAL));
> > +       {
> > +         /* Copying smallish BLKmode structures with emit_block_move and thus
> > +            by-pieces can result in store-to-load stalls.  So copy some simple
> > +            small aggregates element or field-wise.  */
> > +         if (GET_MODE (target) == BLKmode
> > +             && AGGREGATE_TYPE_P (TREE_TYPE (exp))
> > +             && !TREE_ADDRESSABLE (TREE_TYPE (exp))
> > +             && tree_fits_shwi_p (TYPE_SIZE (TREE_TYPE (exp)))
> > +             && (tree_to_shwi (TYPE_SIZE (TREE_TYPE (exp)))
> > +                 <= (PARAM_VALUE (PARAM_MAX_SIZE_FOR_ELEMENTWISE_COPY)
> > +                     * BITS_PER_UNIT))
> > +             && simple_mix_of_records_and_arrays_p (TREE_TYPE (exp), false))
> > +           {
> > +             /* FIXME:  Can this happen?  What would it mean?  */
> > +             gcc_assert (!reverse);
> > +             emit_move_elementwise (TREE_TYPE (exp), target, temp, 0,
> > +                                    call_param_p);
> > +           }
> > +         else
> > +           emit_block_move (target, temp, expr_size (exp),
> > +                            (call_param_p
> > +                             ? BLOCK_OP_CALL_PARM : BLOCK_OP_NORMAL));
> > +       }
> >        /* If we emit a nontemporal store, there is nothing else to do.  */
> >        else if (nontemporal && emit_storent_insn (target, temp))
> >         ;
> > diff --git a/gcc/ipa-cp.c b/gcc/ipa-cp.c
> > index 6b3d8d7364c..7d6019bbd30 100644
> > --- a/gcc/ipa-cp.c
> > +++ b/gcc/ipa-cp.c
> > @@ -124,6 +124,7 @@ along with GCC; see the file COPYING3.  If not see
> >  #include "tree-ssa-ccp.h"
> >  #include "stringpool.h"
> >  #include "attribs.h"
> > +#include "tree-sra.h"
> >
> >  template <typename valtype> class ipcp_value;
> >
> > diff --git a/gcc/ipa-prop.h b/gcc/ipa-prop.h
> > index fa5bed49ee0..2313cc884ed 100644
> > --- a/gcc/ipa-prop.h
> > +++ b/gcc/ipa-prop.h
> > @@ -877,10 +877,6 @@ ipa_parm_adjustment *ipa_get_adjustment_candidate (tree **, bool *,
> >  void ipa_release_body_info (struct ipa_func_body_info *);
> >  tree ipa_get_callee_param_type (struct cgraph_edge *e, int i);
> >
> > -/* From tree-sra.c:  */
> > -tree build_ref_for_offset (location_t, tree, HOST_WIDE_INT, bool, tree,
> > -                          gimple_stmt_iterator *, bool);
> > -
> >  /* In ipa-cp.c  */
> >  void ipa_cp_c_finalize (void);
> >
> > diff --git a/gcc/params.def b/gcc/params.def
> > index e55afc28053..5e19f1414a0 100644
> > --- a/gcc/params.def
> > +++ b/gcc/params.def
> > @@ -1294,6 +1294,12 @@ DEFPARAM (PARAM_VECT_EPILOGUES_NOMASK,
> >           "Enable loop epilogue vectorization using smaller vector size.",
> >           0, 0, 1)
> >
> > +DEFPARAM (PARAM_MAX_SIZE_FOR_ELEMENTWISE_COPY,
> > +         "max-size-for-elementwise-copy",
> > +         "Maximum size in bytes of a structure or array to by considered for "
> > +         "copying by its individual fields or elements",
> > +         0, 0, 512)
> > +
> >  /*
> >
> >  Local variables:
> > diff --git a/gcc/testsuite/gcc.target/i386/pr80689-1.c b/gcc/testsuite/gcc.target/i386/pr80689-1.c
> > new file mode 100644
> > index 00000000000..4156d4fba45
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/i386/pr80689-1.c
> > @@ -0,0 +1,38 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2" } */
> > +
> > +typedef struct st1
> > +{
> > +        long unsigned int a,b;
> > +        long int c,d;
> > +}R;
> > +
> > +typedef struct st2
> > +{
> > +        int  t;
> > +        R  reg;
> > +}N;
> > +
> > +void Set (const R *region,  N *n_info );
> > +
> > +void test(N  *n_obj ,const long unsigned int a, const long unsigned int b,  const long int c,const long int d)
> > +{
> > +        R reg;
> > +
> > +        reg.a=a;
> > +        reg.b=b;
> > +        reg.c=c;
> > +        reg.d=d;
> > +        Set (&reg, n_obj);
> > +
> > +}
> > +
> > +void Set (const R *reg,  N *n_obj )
> > +{
> > +        n_obj->reg=(*reg);
> > +}
> > +
> > +
> > +/* { dg-final { scan-assembler-not "%(x|y|z)mm\[0-9\]+" } } */
> > +/* { dg-final { scan-assembler-not "movdqu" } } */
> > +/* { dg-final { scan-assembler-not "movups" } } */
> > diff --git a/gcc/tree-sra.c b/gcc/tree-sra.c
> > index bac593951e7..ade97964205 100644
> > --- a/gcc/tree-sra.c
> > +++ b/gcc/tree-sra.c
> > @@ -104,6 +104,7 @@ along with GCC; see the file COPYING3.  If not see
> >  #include "ipa-fnsummary.h"
> >  #include "ipa-utils.h"
> >  #include "builtins.h"
> > +#include "tree-sra.h"
> >
> >  /* Enumeration of all aggregate reductions we can do.  */
> >  enum sra_mode { SRA_MODE_EARLY_IPA,   /* early call regularization */
> > @@ -952,14 +953,14 @@ create_access (tree expr, gimple *stmt, bool write)
> >  }
> >
> >
> > -/* Return true iff TYPE is scalarizable - i.e. a RECORD_TYPE or fixed-length
> > -   ARRAY_TYPE with fields that are either of gimple register types (excluding
> > -   bit-fields) or (recursively) scalarizable types.  CONST_DECL must be true if
> > -   we are considering a decl from constant pool.  If it is false, char arrays
> > -   will be refused.  */
> > +/* Return true if TYPE consists of RECORD_TYPE or fixed-length ARRAY_TYPE with
> > +   fields/elements that are not bit-fields and are either register types or
> > +   recursively comply with simple_mix_of_records_and_arrays_p.  Furthermore, if
> > +   ALLOW_CHAR_ARRAYS is false, the function will return false also if TYPE
> > +   contains an array of elements that only have one byte.  */
> >
> > -static bool
> > -scalarizable_type_p (tree type, bool const_decl)
> > +bool
> > +simple_mix_of_records_and_arrays_p (tree type, bool allow_char_arrays)
> >  {
> >    gcc_assert (!is_gimple_reg_type (type));
> >    if (type_contains_placeholder_p (type))
> > @@ -977,7 +978,7 @@ scalarizable_type_p (tree type, bool const_decl)
> >             return false;
> >
> >           if (!is_gimple_reg_type (ft)
> > -             && !scalarizable_type_p (ft, const_decl))
> > +             && !simple_mix_of_records_and_arrays_p (ft, allow_char_arrays))
> >             return false;
> >         }
> >
> > @@ -986,7 +987,7 @@ scalarizable_type_p (tree type, bool const_decl)
> >    case ARRAY_TYPE:
> >      {
> >        HOST_WIDE_INT min_elem_size;
> > -      if (const_decl)
> > +      if (allow_char_arrays)
> >         min_elem_size = 0;
> >        else
> >         min_elem_size = BITS_PER_UNIT;
> > @@ -1008,7 +1009,7 @@ scalarizable_type_p (tree type, bool const_decl)
> >
> >        tree elem = TREE_TYPE (type);
> >        if (!is_gimple_reg_type (elem)
> > -         && !scalarizable_type_p (elem, const_decl))
> > +         && !simple_mix_of_records_and_arrays_p (elem, allow_char_arrays))
> >         return false;
> >        return true;
> >      }
> > @@ -1017,10 +1018,38 @@ scalarizable_type_p (tree type, bool const_decl)
> >    }
> >  }
> >
> > -static void scalarize_elem (tree, HOST_WIDE_INT, HOST_WIDE_INT, bool, tree, tree);
> > +static void scalarize_elem (tree, HOST_WIDE_INT, HOST_WIDE_INT, bool, tree,
> > +                           tree);
> > +
> > +/* For a given array TYPE, return false if its domain does not have any maximum
> > +   value.  Otherwise calculate MIN and MAX indices of the first and the last
> > +   element.  */
> > +
> > +bool
> > +extract_min_max_idx_from_array (tree type, offset_int *min, offset_int *max)
> > +{
> > +  tree domain = TYPE_DOMAIN (type);
> > +  tree minidx = TYPE_MIN_VALUE (domain);
> > +  gcc_assert (TREE_CODE (minidx) == INTEGER_CST);
> > +  tree maxidx = TYPE_MAX_VALUE (domain);
> > +  if (!maxidx)
> > +    return false;
> > +  gcc_assert (TREE_CODE (maxidx) == INTEGER_CST);
> > +
> > +  /* MINIDX and MAXIDX are inclusive, and must be interpreted in
> > +     DOMAIN (e.g. signed int, whereas min/max may be size_int).  */
> > +  *min = wi::to_offset (minidx);
> > +  *max = wi::to_offset (maxidx);
> > +  if (!TYPE_UNSIGNED (domain))
> > +    {
> > +      *min = wi::sext (*min, TYPE_PRECISION (domain));
> > +      *max = wi::sext (*max, TYPE_PRECISION (domain));
> > +    }
> > +  return true;
> > +}
> >
> >  /* Create total_scalarization accesses for all scalar fields of a member
> > -   of type DECL_TYPE conforming to scalarizable_type_p.  BASE
> > +   of type DECL_TYPE conforming to simple_mix_of_records_and_arrays_p.  BASE
> >     must be the top-most VAR_DECL representing the variable; within that,
> >     OFFSET locates the member and REF must be the memory reference expression for
> >     the member.  */
> > @@ -1047,27 +1076,14 @@ completely_scalarize (tree base, tree decl_type, HOST_WIDE_INT offset, tree ref)
> >        {
> >         tree elemtype = TREE_TYPE (decl_type);
> >         tree elem_size = TYPE_SIZE (elemtype);
> > -       gcc_assert (elem_size && tree_fits_shwi_p (elem_size));
> >         HOST_WIDE_INT el_size = tree_to_shwi (elem_size);
> >         gcc_assert (el_size > 0);
> >
> > -       tree minidx = TYPE_MIN_VALUE (TYPE_DOMAIN (decl_type));
> > -       gcc_assert (TREE_CODE (minidx) == INTEGER_CST);
> > -       tree maxidx = TYPE_MAX_VALUE (TYPE_DOMAIN (decl_type));
> > +       offset_int idx, max;
> >         /* Skip (some) zero-length arrays; others have MAXIDX == MINIDX - 1.  */
> > -       if (maxidx)
> > +       if (extract_min_max_idx_from_array (decl_type, &idx, &max))
> >           {
> > -           gcc_assert (TREE_CODE (maxidx) == INTEGER_CST);
> >             tree domain = TYPE_DOMAIN (decl_type);
> > -           /* MINIDX and MAXIDX are inclusive, and must be interpreted in
> > -              DOMAIN (e.g. signed int, whereas min/max may be size_int).  */
> > -           offset_int idx = wi::to_offset (minidx);
> > -           offset_int max = wi::to_offset (maxidx);
> > -           if (!TYPE_UNSIGNED (domain))
> > -             {
> > -               idx = wi::sext (idx, TYPE_PRECISION (domain));
> > -               max = wi::sext (max, TYPE_PRECISION (domain));
> > -             }
> >             for (int el_off = offset; idx <= max; ++idx)
> >               {
> >                 tree nref = build4 (ARRAY_REF, elemtype,
> > @@ -1088,10 +1104,10 @@ completely_scalarize (tree base, tree decl_type, HOST_WIDE_INT offset, tree ref)
> >  }
> >
> >  /* Create total_scalarization accesses for a member of type TYPE, which must
> > -   satisfy either is_gimple_reg_type or scalarizable_type_p.  BASE must be the
> > -   top-most VAR_DECL representing the variable; within that, POS and SIZE locate
> > -   the member, REVERSE gives its torage order. and REF must be the reference
> > -   expression for it.  */
> > +   satisfy either is_gimple_reg_type or simple_mix_of_records_and_arrays_p.
> > +   BASE must be the top-most VAR_DECL representing the variable; within that,
> > +   POS and SIZE locate the member, REVERSE gives its torage order. and REF must
> > +   be the reference expression for it.  */
> >
> >  static void
> >  scalarize_elem (tree base, HOST_WIDE_INT pos, HOST_WIDE_INT size, bool reverse,
> > @@ -1111,7 +1127,8 @@ scalarize_elem (tree base, HOST_WIDE_INT pos, HOST_WIDE_INT size, bool reverse,
> >  }
> >
> >  /* Create a total_scalarization access for VAR as a whole.  VAR must be of a
> > -   RECORD_TYPE or ARRAY_TYPE conforming to scalarizable_type_p.  */
> > +   RECORD_TYPE or ARRAY_TYPE conforming to
> > +   simple_mix_of_records_and_arrays_p.  */
> >
> >  static void
> >  create_total_scalarization_access (tree var)
> > @@ -2803,8 +2820,9 @@ analyze_all_variable_accesses (void)
> >        {
> >         tree var = candidate (i);
> >
> > -       if (VAR_P (var) && scalarizable_type_p (TREE_TYPE (var),
> > -                                               constant_decl_p (var)))
> > +       if (VAR_P (var)
> > +           && simple_mix_of_records_and_arrays_p (TREE_TYPE (var),
> > +                                                  constant_decl_p (var)))
> >           {
> >             if (tree_to_uhwi (TYPE_SIZE (TREE_TYPE (var)))
> >                 <= max_scalarization_size)
> > diff --git a/gcc/tree-sra.h b/gcc/tree-sra.h
> > new file mode 100644
> > index 00000000000..dc901385994
> > --- /dev/null
> > +++ b/gcc/tree-sra.h
> > @@ -0,0 +1,33 @@
> > +/* tree-sra.h - Run-time parameters.
> > +   Copyright (C) 2017 Free Software Foundation, Inc.
> > +
> > +This file is part of GCC.
> > +
> > +GCC is free software; you can redistribute it and/or modify it under
> > +the terms of the GNU General Public License as published by the Free
> > +Software Foundation; either version 3, or (at your option) any later
> > +version.
> > +
> > +GCC is distributed in the hope that it will be useful, but WITHOUT ANY
> > +WARRANTY; without even the implied warranty of MERCHANTABILITY or
> > +FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
> > +for more details.
> > +
> > +You should have received a copy of the GNU General Public License
> > +along with GCC; see the file COPYING3.  If not see
> > +<http://www.gnu.org/licenses/>.  */
> > +
> > +#ifndef TREE_SRA_H
> > +#define TREE_SRA_H
> > +
> > +
> > +bool simple_mix_of_records_and_arrays_p (tree type, bool allow_char_arrays);
> > +bool extract_min_max_idx_from_array (tree type, offset_int *idx,
> > +                                    offset_int *max);
> > +tree build_ref_for_offset (location_t loc, tree base, HOST_WIDE_INT offset,
> > +                          bool reverse, tree exp_type,
> > +                          gimple_stmt_iterator *gsi, bool insert_after);
> > +
> > +
> > +
> > +#endif /* TREE_SRA_H */
> > --
> > 2.14.1
> >
Richard Biener Oct. 26, 2017, 12:43 p.m. UTC | #3
On Thu, Oct 26, 2017 at 2:18 PM, Martin Jambor <mjambor@suse.cz> wrote:
> Hi,
>
> On Tue, Oct 17, 2017 at 01:34:54PM +0200, Richard Biener wrote:
>> On Fri, Oct 13, 2017 at 6:13 PM, Martin Jambor <mjambor@suse.cz> wrote:
>> > Hi,
>> >
>> > I'd like to request comments to the patch below which aims to fix PR
>> > 80689, which is an instance of a store-to-load forwarding stall on
>> > x86_64 CPUs in the Image Magick benchmark, which is responsible for a
>> > slow down of up to 9% compared to gcc 6, depending on options and HW
>> > used.  (Actually, I have just seen 24% in one specific combination but
>> > for various reasons can no longer verify it today.)
>> >
>> > The revision causing the regression is 237074, which increased the
>> > size of the mode for copying aggregates "by pieces" to 128 bits,
>> > incurring big stalls when the values being copied are also still being
>> > stored in a smaller data type or if the copied values are loaded in a
>> > smaller types shortly afterwards.  Such situations happen in Image
>> > Magick even across calls, which means that any non-IPA flow-sensitive
>> > approach would not detect them.  Therefore, the patch simply changes
>> > the way we copy small BLKmode data that are simple combinations of
>> > records and arrays (meaning no unions, bit-fields but also character
>> > arrays are disallowed) and simply copies them one field and/or element
>> > at a time.
>> >
>> > "Small" in this RFC patch means up to 35 bytes on x86_64 and i386 CPUs
>> > (the structure in the benchmark has 32 bytes) but is subject to change
>> > after more benchmarking and is actually zero - meaning element copying
>> > never happens - on other architectures.  I believe that any
>> > architecture with a store buffer can benefit but it's probably better
>> > to leave it to their maintainers to find a different default value.  I
>> > am not sure this is how such HW-dependant decisions should be done and
>> > is the primary reason, why I am sending this RFC first.
>> >
>> > I have decided to implement this change at the expansion level because
>> > at that point the type information is still readily available and at
>> > the same time we can also handle various implicit copies, for example
>> > those passing a parameter.  I found I could re-use some bits and
>> > pieces of tree-SRA and so I did, creating tree-sra.h header file in
>> > the process.
>> >
>> > I am fully aware that in the final patch the new parameter, or indeed
>> > any new parameters, need to be documented.  I have skipped that
>> > intentionally now and will write the documentation if feedback here is
>> > generally good.
>> >
>> > I have bootstrapped and tested this patch on x86_64-linux, with
>> > different values of the parameter and only found problems with
>> > unreasonably high values leading to OOM.  I have done the same with a
>> > previous version of the patch which was equivalent to the limit being
>> > 64 bytes on aarch64-linux, ppc64le-linux and ia64-linux and only ran
>> > into failures of tests which assumed that structure padding was copied
>> > in aggregate copies (mostly gcc.target/aarch64/aapcs64/ stuff but also
>> > for example gcc.dg/vmx/varargs-4.c).
>> >
>> > The patch decreases the SPEC 2017 "rate" run-time of imagick by 9% and
>> > 8% at -O2 and -Ofast compilation levels respectively on one particular
>> > new AMD CPU and by 6% and 3% on one particular old Intel machine.
>> >
>> > Thanks in advance for any comments,
>>
>> I wonder if you can at the place you choose to hook this in elide any
>> copying of padding between fields.
>>
>> I'd rather have hooked such "high level" optimization in
>> expand_assignment where you can be reasonably sure you're seeing an
>> actual source-level construct.
>
> I have discussed this with Honza and we eventually decided to make the
> elememnt-wise copy an alternative to emit_block_move (which uses the
> larger mode for moving since GCC 7) exactly so that we handle not only
> source-level assignments but also passing parameters by value and
> other situations.
>
>>
>> 35 bytes seems to be much - what is the code-size impact?
>
> I will find out and report on that.  I need at least 32 bytes (four
> long ints) to fix imagemagick, where the problematic structure is:
>
>   typedef struct _RectangleInfo
>   {
>     size_t
>       width,
>       height;
>
>     ssize_t
>       x,
>       y;
>   } RectangleInfo;
>
> ...so 8 longs, no padding.  Since any aggregate having between 33-35
> bytes needs to consist of smaller fields/elements, it seemed
> reasonable to also copy them element-wise.
>
> Nevertheless, I still intend to experiment with the limit, I sent out
> this RFC exactly so that I don't spend a lot of time benchmarking
> something that is eventually not deemed acceptable on principle.

I think the limit should be on the number of generated copies and not
the overall size of the structure...  If the struct were composed of
32 individual chars we wouldn't want to emit 32 loads and 32 stores...

I wonder how rep; movb; interacts with store to load forwarding?  Is
that maybe optimized well on some archs?  movb should always
forward and wasn't the setup cost for small N reasonable on modern
CPUs?

>>
>> IIRC the reason this may be slow isn't loading in smaller types than stored
>> before by the copy - the store buffers can handle this reasonably well.  It's
>> solely when previous smaller stores are
>>
>>   a1) not mergeabe in the store buffer
>>   a2) not merged because earlier stores are already committed
>>
>> and
>>
>>   b) loaded afterwards as a type that would access multiple store buffers
>>
>> a) would be sure to happen in case b) involves accessing padding.  Is the
>> Image Magick case one that involves padding in the structure?
>
> As I said above, there is no padding.
>
> Basically, what happens is that in a number of places, there is a
> variable region of the aforementioned type and it is initialized and
> passed to function SetPixelCacheNexusPixels in the following manner:
>
>     ...
>     region.width=cache_info->columns;
>     region.height=1;
>     region.x=0;
>     region.y=y;
>     pixels=SetPixelCacheNexusPixels(cache_info,ReadMode,&region,MagickTrue,
>       cache_nexus[id],exception);
>     ...
>
> and the first four statements in SetPixelCacheNexusPixels are:
>
>   assert(cache_info != (const CacheInfo *) NULL);
>   assert(cache_info->signature == MagickSignature);
>   if (cache_info->type == UndefinedCache)
>     return((PixelPacket *) NULL);
>   nexus_info->region=(*region);
>
> with the last one generating the stalls, on both Zen-based machines
> and also on 2-3 years old Intel CPUs.
>
> I have had a look at what Agner Fog's micro-architecture document says
> about store forwarding stalls and:
>
>   - on Broadwells and Haswells, any "write of any size is followed by
>     a read of a larger size" ioncurs a stall, which fits our example,
>   - on Skylakes: "A read that is bigger than the write, or a read that
>     covers both written and unwritten bytes, takes approximately 11
>     clock cycles extra" seems to apply
>   - on Intel silvermont, there will also be a stall because "A memory
>     write can be forwarded to a subsequent read of the same size or a
>     smaller size..."
>   - on Zens, Agner Fog says they work perfectly except when crossing a
>     page or when "A read that has a partial overlap with a preceding
>     write has a penalty of 6-7 clock cycles," which must be why I see
>     stalls.
>
> So I guess the pending stores are not really merged even without
> padding,

It probably depends on the width of the entries in the store buffer,
if they appear in-order and the alignment of the stores (if they are larger than
8 bytes they are surely aligned).  IIRC CPUs had smaller store buffer
entries than cache line size.

Given that load bandwith is usually higher than store bandwith it
might make sense to do the store combining in our copying sequence,
like for the 8 byte entry case use sth like

  movq 0(%eax), %xmm0
  movhps 8(%eax), %xmm0 // or vpinsert
  mov[au]ps %xmm0, 0%(ebx)
...

thus do two loads per store and perform the stores in wider
mode?

As said a general concern was you not copying padding.  If you
put this into an even more common place you surely will break
stuff, no?

Richard.

>
> Martin
>
>
>>
>> Richard.
>>
>> > Martin
>> >
>> >
>> > 2017-10-12  Martin Jambor  <mjambor@suse.cz>
>> >
>> >         PR target/80689
>> >         * tree-sra.h: New file.
>> >         * ipa-prop.h: Moved declaration of build_ref_for_offset to
>> >         tree-sra.h.
>> >         * expr.c: Include params.h and tree-sra.h.
>> >         (emit_move_elementwise): New function.
>> >         (store_expr_with_bounds): Optionally use it.
>> >         * ipa-cp.c: Include tree-sra.h.
>> >         * params.def (PARAM_MAX_SIZE_FOR_ELEMENTWISE_COPY): New.
>> >         * config/i386/i386.c (ix86_option_override_internal): Set
>> >         PARAM_MAX_SIZE_FOR_ELEMENTWISE_COPY to 35.
>> >         * tree-sra.c: Include tree-sra.h.
>> >         (scalarizable_type_p): Renamed to
>> >         simple_mix_of_records_and_arrays_p, made public, renamed the
>> >         second parameter to allow_char_arrays.
>> >         (extract_min_max_idx_from_array): New function.
>> >         (completely_scalarize): Moved bits of the function to
>> >         extract_min_max_idx_from_array.
>> >
>> >         testsuite/
>> >         * gcc.target/i386/pr80689-1.c: New test.
>> > ---
>> >  gcc/config/i386/i386.c                    |   4 ++
>> >  gcc/expr.c                                | 103 ++++++++++++++++++++++++++++--
>> >  gcc/ipa-cp.c                              |   1 +
>> >  gcc/ipa-prop.h                            |   4 --
>> >  gcc/params.def                            |   6 ++
>> >  gcc/testsuite/gcc.target/i386/pr80689-1.c |  38 +++++++++++
>> >  gcc/tree-sra.c                            |  86 +++++++++++++++----------
>> >  gcc/tree-sra.h                            |  33 ++++++++++
>> >  8 files changed, 233 insertions(+), 42 deletions(-)
>> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr80689-1.c
>> >  create mode 100644 gcc/tree-sra.h
>> >
>> > diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
>> > index 1ee8351c21f..87f602e7ead 100644
>> > --- a/gcc/config/i386/i386.c
>> > +++ b/gcc/config/i386/i386.c
>> > @@ -6511,6 +6511,10 @@ ix86_option_override_internal (bool main_args_p,
>> >                          ix86_tune_cost->l2_cache_size,
>> >                          opts->x_param_values,
>> >                          opts_set->x_param_values);
>> > +  maybe_set_param_value (PARAM_MAX_SIZE_FOR_ELEMENTWISE_COPY,
>> > +                        35,
>> > +                        opts->x_param_values,
>> > +                        opts_set->x_param_values);
>> >
>> >    /* Enable sw prefetching at -O3 for CPUS that prefetching is helpful.  */
>> >    if (opts->x_flag_prefetch_loop_arrays < 0
>> > diff --git a/gcc/expr.c b/gcc/expr.c
>> > index 134ee731c29..dff24e7f166 100644
>> > --- a/gcc/expr.c
>> > +++ b/gcc/expr.c
>> > @@ -61,7 +61,8 @@ along with GCC; see the file COPYING3.  If not see
>> >  #include "tree-chkp.h"
>> >  #include "rtl-chkp.h"
>> >  #include "ccmp.h"
>> > -
>> > +#include "params.h"
>> > +#include "tree-sra.h"
>> >
>> >  /* If this is nonzero, we do not bother generating VOLATILE
>> >     around volatile memory references, and we are willing to
>> > @@ -5340,6 +5341,80 @@ emit_storent_insn (rtx to, rtx from)
>> >    return maybe_expand_insn (code, 2, ops);
>> >  }
>> >
>> > +/* Generate code for copying data of type TYPE at SOURCE plus OFFSET to TARGET
>> > +   plus OFFSET, but do so element-wise and/or field-wise for each record and
>> > +   array within TYPE.  TYPE must either be a register type or an aggregate
>> > +   complying with scalarizable_type_p.
>> > +
>> > +   If CALL_PARAM_P is nonzero, this is a store into a call param on the
>> > +   stack, and block moves may need to be treated specially.  */
>> > +
>> > +static void
>> > +emit_move_elementwise (tree type, rtx target, rtx source, HOST_WIDE_INT offset,
>> > +                      int call_param_p)
>> > +{
>> > +  switch (TREE_CODE (type))
>> > +    {
>> > +    case RECORD_TYPE:
>> > +      for (tree fld = TYPE_FIELDS (type); fld; fld = DECL_CHAIN (fld))
>> > +       if (TREE_CODE (fld) == FIELD_DECL)
>> > +         {
>> > +           HOST_WIDE_INT fld_offset = offset + int_bit_position (fld);
>> > +           tree ft = TREE_TYPE (fld);
>> > +           emit_move_elementwise (ft, target, source, fld_offset,
>> > +                                  call_param_p);
>> > +         }
>> > +      break;
>> > +
>> > +    case ARRAY_TYPE:
>> > +      {
>> > +       tree elem_type = TREE_TYPE (type);
>> > +       HOST_WIDE_INT el_size = tree_to_shwi (TYPE_SIZE (elem_type));
>> > +       gcc_assert (el_size > 0);
>> > +
>> > +       offset_int idx, max;
>> > +       /* Skip (some) zero-length arrays; others have MAXIDX == MINIDX - 1.  */
>> > +       if (extract_min_max_idx_from_array (type, &idx, &max))
>> > +         {
>> > +           HOST_WIDE_INT el_offset = offset;
>> > +           for (; idx <= max; ++idx)
>> > +             {
>> > +               emit_move_elementwise (elem_type, target, source, el_offset,
>> > +                                      call_param_p);
>> > +               el_offset += el_size;
>> > +             }
>> > +         }
>> > +      }
>> > +      break;
>> > +    default:
>> > +      machine_mode mode = TYPE_MODE (type);
>> > +
>> > +      rtx ntgt = adjust_address (target, mode, offset / BITS_PER_UNIT);
>> > +      rtx nsrc = adjust_address (source, mode, offset / BITS_PER_UNIT);
>> > +
>> > +      /* TODO: Figure out whether the following is actually necessary.  */
>> > +      if (target == ntgt)
>> > +       ntgt = copy_rtx (target);
>> > +      if (source == nsrc)
>> > +       nsrc = copy_rtx (source);
>> > +
>> > +      gcc_assert (mode != VOIDmode);
>> > +      if (mode != BLKmode)
>> > +       emit_move_insn (ntgt, nsrc);
>> > +      else
>> > +       {
>> > +         /* For example vector gimple registers can end up here.  */
>> > +         rtx size = expand_expr (TYPE_SIZE_UNIT (type), NULL_RTX,
>> > +                                 TYPE_MODE (sizetype), EXPAND_NORMAL);
>> > +         emit_block_move (ntgt, nsrc, size,
>> > +                          (call_param_p
>> > +                           ? BLOCK_OP_CALL_PARM : BLOCK_OP_NORMAL));
>> > +       }
>> > +      break;
>> > +    }
>> > +  return;
>> > +}
>> > +
>> >  /* Generate code for computing expression EXP,
>> >     and storing the value into TARGET.
>> >
>> > @@ -5713,9 +5788,29 @@ store_expr_with_bounds (tree exp, rtx target, int call_param_p,
>> >         emit_group_store (target, temp, TREE_TYPE (exp),
>> >                           int_size_in_bytes (TREE_TYPE (exp)));
>> >        else if (GET_MODE (temp) == BLKmode)
>> > -       emit_block_move (target, temp, expr_size (exp),
>> > -                        (call_param_p
>> > -                         ? BLOCK_OP_CALL_PARM : BLOCK_OP_NORMAL));
>> > +       {
>> > +         /* Copying smallish BLKmode structures with emit_block_move and thus
>> > +            by-pieces can result in store-to-load stalls.  So copy some simple
>> > +            small aggregates element or field-wise.  */
>> > +         if (GET_MODE (target) == BLKmode
>> > +             && AGGREGATE_TYPE_P (TREE_TYPE (exp))
>> > +             && !TREE_ADDRESSABLE (TREE_TYPE (exp))
>> > +             && tree_fits_shwi_p (TYPE_SIZE (TREE_TYPE (exp)))
>> > +             && (tree_to_shwi (TYPE_SIZE (TREE_TYPE (exp)))
>> > +                 <= (PARAM_VALUE (PARAM_MAX_SIZE_FOR_ELEMENTWISE_COPY)
>> > +                     * BITS_PER_UNIT))
>> > +             && simple_mix_of_records_and_arrays_p (TREE_TYPE (exp), false))
>> > +           {
>> > +             /* FIXME:  Can this happen?  What would it mean?  */
>> > +             gcc_assert (!reverse);
>> > +             emit_move_elementwise (TREE_TYPE (exp), target, temp, 0,
>> > +                                    call_param_p);
>> > +           }
>> > +         else
>> > +           emit_block_move (target, temp, expr_size (exp),
>> > +                            (call_param_p
>> > +                             ? BLOCK_OP_CALL_PARM : BLOCK_OP_NORMAL));
>> > +       }
>> >        /* If we emit a nontemporal store, there is nothing else to do.  */
>> >        else if (nontemporal && emit_storent_insn (target, temp))
>> >         ;
>> > diff --git a/gcc/ipa-cp.c b/gcc/ipa-cp.c
>> > index 6b3d8d7364c..7d6019bbd30 100644
>> > --- a/gcc/ipa-cp.c
>> > +++ b/gcc/ipa-cp.c
>> > @@ -124,6 +124,7 @@ along with GCC; see the file COPYING3.  If not see
>> >  #include "tree-ssa-ccp.h"
>> >  #include "stringpool.h"
>> >  #include "attribs.h"
>> > +#include "tree-sra.h"
>> >
>> >  template <typename valtype> class ipcp_value;
>> >
>> > diff --git a/gcc/ipa-prop.h b/gcc/ipa-prop.h
>> > index fa5bed49ee0..2313cc884ed 100644
>> > --- a/gcc/ipa-prop.h
>> > +++ b/gcc/ipa-prop.h
>> > @@ -877,10 +877,6 @@ ipa_parm_adjustment *ipa_get_adjustment_candidate (tree **, bool *,
>> >  void ipa_release_body_info (struct ipa_func_body_info *);
>> >  tree ipa_get_callee_param_type (struct cgraph_edge *e, int i);
>> >
>> > -/* From tree-sra.c:  */
>> > -tree build_ref_for_offset (location_t, tree, HOST_WIDE_INT, bool, tree,
>> > -                          gimple_stmt_iterator *, bool);
>> > -
>> >  /* In ipa-cp.c  */
>> >  void ipa_cp_c_finalize (void);
>> >
>> > diff --git a/gcc/params.def b/gcc/params.def
>> > index e55afc28053..5e19f1414a0 100644
>> > --- a/gcc/params.def
>> > +++ b/gcc/params.def
>> > @@ -1294,6 +1294,12 @@ DEFPARAM (PARAM_VECT_EPILOGUES_NOMASK,
>> >           "Enable loop epilogue vectorization using smaller vector size.",
>> >           0, 0, 1)
>> >
>> > +DEFPARAM (PARAM_MAX_SIZE_FOR_ELEMENTWISE_COPY,
>> > +         "max-size-for-elementwise-copy",
>> > +         "Maximum size in bytes of a structure or array to by considered for "
>> > +         "copying by its individual fields or elements",
>> > +         0, 0, 512)
>> > +
>> >  /*
>> >
>> >  Local variables:
>> > diff --git a/gcc/testsuite/gcc.target/i386/pr80689-1.c b/gcc/testsuite/gcc.target/i386/pr80689-1.c
>> > new file mode 100644
>> > index 00000000000..4156d4fba45
>> > --- /dev/null
>> > +++ b/gcc/testsuite/gcc.target/i386/pr80689-1.c
>> > @@ -0,0 +1,38 @@
>> > +/* { dg-do compile } */
>> > +/* { dg-options "-O2" } */
>> > +
>> > +typedef struct st1
>> > +{
>> > +        long unsigned int a,b;
>> > +        long int c,d;
>> > +}R;
>> > +
>> > +typedef struct st2
>> > +{
>> > +        int  t;
>> > +        R  reg;
>> > +}N;
>> > +
>> > +void Set (const R *region,  N *n_info );
>> > +
>> > +void test(N  *n_obj ,const long unsigned int a, const long unsigned int b,  const long int c,const long int d)
>> > +{
>> > +        R reg;
>> > +
>> > +        reg.a=a;
>> > +        reg.b=b;
>> > +        reg.c=c;
>> > +        reg.d=d;
>> > +        Set (&reg, n_obj);
>> > +
>> > +}
>> > +
>> > +void Set (const R *reg,  N *n_obj )
>> > +{
>> > +        n_obj->reg=(*reg);
>> > +}
>> > +
>> > +
>> > +/* { dg-final { scan-assembler-not "%(x|y|z)mm\[0-9\]+" } } */
>> > +/* { dg-final { scan-assembler-not "movdqu" } } */
>> > +/* { dg-final { scan-assembler-not "movups" } } */
>> > diff --git a/gcc/tree-sra.c b/gcc/tree-sra.c
>> > index bac593951e7..ade97964205 100644
>> > --- a/gcc/tree-sra.c
>> > +++ b/gcc/tree-sra.c
>> > @@ -104,6 +104,7 @@ along with GCC; see the file COPYING3.  If not see
>> >  #include "ipa-fnsummary.h"
>> >  #include "ipa-utils.h"
>> >  #include "builtins.h"
>> > +#include "tree-sra.h"
>> >
>> >  /* Enumeration of all aggregate reductions we can do.  */
>> >  enum sra_mode { SRA_MODE_EARLY_IPA,   /* early call regularization */
>> > @@ -952,14 +953,14 @@ create_access (tree expr, gimple *stmt, bool write)
>> >  }
>> >
>> >
>> > -/* Return true iff TYPE is scalarizable - i.e. a RECORD_TYPE or fixed-length
>> > -   ARRAY_TYPE with fields that are either of gimple register types (excluding
>> > -   bit-fields) or (recursively) scalarizable types.  CONST_DECL must be true if
>> > -   we are considering a decl from constant pool.  If it is false, char arrays
>> > -   will be refused.  */
>> > +/* Return true if TYPE consists of RECORD_TYPE or fixed-length ARRAY_TYPE with
>> > +   fields/elements that are not bit-fields and are either register types or
>> > +   recursively comply with simple_mix_of_records_and_arrays_p.  Furthermore, if
>> > +   ALLOW_CHAR_ARRAYS is false, the function will return false also if TYPE
>> > +   contains an array of elements that only have one byte.  */
>> >
>> > -static bool
>> > -scalarizable_type_p (tree type, bool const_decl)
>> > +bool
>> > +simple_mix_of_records_and_arrays_p (tree type, bool allow_char_arrays)
>> >  {
>> >    gcc_assert (!is_gimple_reg_type (type));
>> >    if (type_contains_placeholder_p (type))
>> > @@ -977,7 +978,7 @@ scalarizable_type_p (tree type, bool const_decl)
>> >             return false;
>> >
>> >           if (!is_gimple_reg_type (ft)
>> > -             && !scalarizable_type_p (ft, const_decl))
>> > +             && !simple_mix_of_records_and_arrays_p (ft, allow_char_arrays))
>> >             return false;
>> >         }
>> >
>> > @@ -986,7 +987,7 @@ scalarizable_type_p (tree type, bool const_decl)
>> >    case ARRAY_TYPE:
>> >      {
>> >        HOST_WIDE_INT min_elem_size;
>> > -      if (const_decl)
>> > +      if (allow_char_arrays)
>> >         min_elem_size = 0;
>> >        else
>> >         min_elem_size = BITS_PER_UNIT;
>> > @@ -1008,7 +1009,7 @@ scalarizable_type_p (tree type, bool const_decl)
>> >
>> >        tree elem = TREE_TYPE (type);
>> >        if (!is_gimple_reg_type (elem)
>> > -         && !scalarizable_type_p (elem, const_decl))
>> > +         && !simple_mix_of_records_and_arrays_p (elem, allow_char_arrays))
>> >         return false;
>> >        return true;
>> >      }
>> > @@ -1017,10 +1018,38 @@ scalarizable_type_p (tree type, bool const_decl)
>> >    }
>> >  }
>> >
>> > -static void scalarize_elem (tree, HOST_WIDE_INT, HOST_WIDE_INT, bool, tree, tree);
>> > +static void scalarize_elem (tree, HOST_WIDE_INT, HOST_WIDE_INT, bool, tree,
>> > +                           tree);
>> > +
>> > +/* For a given array TYPE, return false if its domain does not have any maximum
>> > +   value.  Otherwise calculate MIN and MAX indices of the first and the last
>> > +   element.  */
>> > +
>> > +bool
>> > +extract_min_max_idx_from_array (tree type, offset_int *min, offset_int *max)
>> > +{
>> > +  tree domain = TYPE_DOMAIN (type);
>> > +  tree minidx = TYPE_MIN_VALUE (domain);
>> > +  gcc_assert (TREE_CODE (minidx) == INTEGER_CST);
>> > +  tree maxidx = TYPE_MAX_VALUE (domain);
>> > +  if (!maxidx)
>> > +    return false;
>> > +  gcc_assert (TREE_CODE (maxidx) == INTEGER_CST);
>> > +
>> > +  /* MINIDX and MAXIDX are inclusive, and must be interpreted in
>> > +     DOMAIN (e.g. signed int, whereas min/max may be size_int).  */
>> > +  *min = wi::to_offset (minidx);
>> > +  *max = wi::to_offset (maxidx);
>> > +  if (!TYPE_UNSIGNED (domain))
>> > +    {
>> > +      *min = wi::sext (*min, TYPE_PRECISION (domain));
>> > +      *max = wi::sext (*max, TYPE_PRECISION (domain));
>> > +    }
>> > +  return true;
>> > +}
>> >
>> >  /* Create total_scalarization accesses for all scalar fields of a member
>> > -   of type DECL_TYPE conforming to scalarizable_type_p.  BASE
>> > +   of type DECL_TYPE conforming to simple_mix_of_records_and_arrays_p.  BASE
>> >     must be the top-most VAR_DECL representing the variable; within that,
>> >     OFFSET locates the member and REF must be the memory reference expression for
>> >     the member.  */
>> > @@ -1047,27 +1076,14 @@ completely_scalarize (tree base, tree decl_type, HOST_WIDE_INT offset, tree ref)
>> >        {
>> >         tree elemtype = TREE_TYPE (decl_type);
>> >         tree elem_size = TYPE_SIZE (elemtype);
>> > -       gcc_assert (elem_size && tree_fits_shwi_p (elem_size));
>> >         HOST_WIDE_INT el_size = tree_to_shwi (elem_size);
>> >         gcc_assert (el_size > 0);
>> >
>> > -       tree minidx = TYPE_MIN_VALUE (TYPE_DOMAIN (decl_type));
>> > -       gcc_assert (TREE_CODE (minidx) == INTEGER_CST);
>> > -       tree maxidx = TYPE_MAX_VALUE (TYPE_DOMAIN (decl_type));
>> > +       offset_int idx, max;
>> >         /* Skip (some) zero-length arrays; others have MAXIDX == MINIDX - 1.  */
>> > -       if (maxidx)
>> > +       if (extract_min_max_idx_from_array (decl_type, &idx, &max))
>> >           {
>> > -           gcc_assert (TREE_CODE (maxidx) == INTEGER_CST);
>> >             tree domain = TYPE_DOMAIN (decl_type);
>> > -           /* MINIDX and MAXIDX are inclusive, and must be interpreted in
>> > -              DOMAIN (e.g. signed int, whereas min/max may be size_int).  */
>> > -           offset_int idx = wi::to_offset (minidx);
>> > -           offset_int max = wi::to_offset (maxidx);
>> > -           if (!TYPE_UNSIGNED (domain))
>> > -             {
>> > -               idx = wi::sext (idx, TYPE_PRECISION (domain));
>> > -               max = wi::sext (max, TYPE_PRECISION (domain));
>> > -             }
>> >             for (int el_off = offset; idx <= max; ++idx)
>> >               {
>> >                 tree nref = build4 (ARRAY_REF, elemtype,
>> > @@ -1088,10 +1104,10 @@ completely_scalarize (tree base, tree decl_type, HOST_WIDE_INT offset, tree ref)
>> >  }
>> >
>> >  /* Create total_scalarization accesses for a member of type TYPE, which must
>> > -   satisfy either is_gimple_reg_type or scalarizable_type_p.  BASE must be the
>> > -   top-most VAR_DECL representing the variable; within that, POS and SIZE locate
>> > -   the member, REVERSE gives its torage order. and REF must be the reference
>> > -   expression for it.  */
>> > +   satisfy either is_gimple_reg_type or simple_mix_of_records_and_arrays_p.
>> > +   BASE must be the top-most VAR_DECL representing the variable; within that,
>> > +   POS and SIZE locate the member, REVERSE gives its torage order. and REF must
>> > +   be the reference expression for it.  */
>> >
>> >  static void
>> >  scalarize_elem (tree base, HOST_WIDE_INT pos, HOST_WIDE_INT size, bool reverse,
>> > @@ -1111,7 +1127,8 @@ scalarize_elem (tree base, HOST_WIDE_INT pos, HOST_WIDE_INT size, bool reverse,
>> >  }
>> >
>> >  /* Create a total_scalarization access for VAR as a whole.  VAR must be of a
>> > -   RECORD_TYPE or ARRAY_TYPE conforming to scalarizable_type_p.  */
>> > +   RECORD_TYPE or ARRAY_TYPE conforming to
>> > +   simple_mix_of_records_and_arrays_p.  */
>> >
>> >  static void
>> >  create_total_scalarization_access (tree var)
>> > @@ -2803,8 +2820,9 @@ analyze_all_variable_accesses (void)
>> >        {
>> >         tree var = candidate (i);
>> >
>> > -       if (VAR_P (var) && scalarizable_type_p (TREE_TYPE (var),
>> > -                                               constant_decl_p (var)))
>> > +       if (VAR_P (var)
>> > +           && simple_mix_of_records_and_arrays_p (TREE_TYPE (var),
>> > +                                                  constant_decl_p (var)))
>> >           {
>> >             if (tree_to_uhwi (TYPE_SIZE (TREE_TYPE (var)))
>> >                 <= max_scalarization_size)
>> > diff --git a/gcc/tree-sra.h b/gcc/tree-sra.h
>> > new file mode 100644
>> > index 00000000000..dc901385994
>> > --- /dev/null
>> > +++ b/gcc/tree-sra.h
>> > @@ -0,0 +1,33 @@
>> > +/* tree-sra.h - Run-time parameters.
>> > +   Copyright (C) 2017 Free Software Foundation, Inc.
>> > +
>> > +This file is part of GCC.
>> > +
>> > +GCC is free software; you can redistribute it and/or modify it under
>> > +the terms of the GNU General Public License as published by the Free
>> > +Software Foundation; either version 3, or (at your option) any later
>> > +version.
>> > +
>> > +GCC is distributed in the hope that it will be useful, but WITHOUT ANY
>> > +WARRANTY; without even the implied warranty of MERCHANTABILITY or
>> > +FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
>> > +for more details.
>> > +
>> > +You should have received a copy of the GNU General Public License
>> > +along with GCC; see the file COPYING3.  If not see
>> > +<http://www.gnu.org/licenses/>.  */
>> > +
>> > +#ifndef TREE_SRA_H
>> > +#define TREE_SRA_H
>> > +
>> > +
>> > +bool simple_mix_of_records_and_arrays_p (tree type, bool allow_char_arrays);
>> > +bool extract_min_max_idx_from_array (tree type, offset_int *idx,
>> > +                                    offset_int *max);
>> > +tree build_ref_for_offset (location_t loc, tree base, HOST_WIDE_INT offset,
>> > +                          bool reverse, tree exp_type,
>> > +                          gimple_stmt_iterator *gsi, bool insert_after);
>> > +
>> > +
>> > +
>> > +#endif /* TREE_SRA_H */
>> > --
>> > 2.14.1
>> >
Jan Hubicka Oct. 26, 2017, 12:55 p.m. UTC | #4
> I think the limit should be on the number of generated copies and not
> the overall size of the structure...  If the struct were composed of
> 32 individual chars we wouldn't want to emit 32 loads and 32 stores...
> 
> I wonder how rep; movb; interacts with store to load forwarding?  Is
> that maybe optimized well on some archs?  movb should always
> forward and wasn't the setup cost for small N reasonable on modern
> CPUs?

rep mov is win over loop for blocks over 128bytes on core, for blocks in rage
24-128 on zen.  This is w/o store/load forwarding, but I doubt those provide
a cheap way around.

> 
> It probably depends on the width of the entries in the store buffer,
> if they appear in-order and the alignment of the stores (if they are larger than
> 8 bytes they are surely aligned).  IIRC CPUs had smaller store buffer
> entries than cache line size.
> 
> Given that load bandwith is usually higher than store bandwith it
> might make sense to do the store combining in our copying sequence,
> like for the 8 byte entry case use sth like
> 
>   movq 0(%eax), %xmm0
>   movhps 8(%eax), %xmm0 // or vpinsert
>   mov[au]ps %xmm0, 0%(ebx)
> ...
> 
> thus do two loads per store and perform the stores in wider
> mode?

This may be somewhat faster indeed.  I am not sure if store to load
forwarding will work for the later half when read again by halves.
It would not happen on older CPUs :)

Honza
> 
> As said a general concern was you not copying padding.  If you
> put this into an even more common place you surely will break
> stuff, no?
> 
> Richard.
> 
> >
> > Martin
> >
> >
> >>
> >> Richard.
> >>
> >> > Martin
> >> >
> >> >
> >> > 2017-10-12  Martin Jambor  <mjambor@suse.cz>
> >> >
> >> >         PR target/80689
> >> >         * tree-sra.h: New file.
> >> >         * ipa-prop.h: Moved declaration of build_ref_for_offset to
> >> >         tree-sra.h.
> >> >         * expr.c: Include params.h and tree-sra.h.
> >> >         (emit_move_elementwise): New function.
> >> >         (store_expr_with_bounds): Optionally use it.
> >> >         * ipa-cp.c: Include tree-sra.h.
> >> >         * params.def (PARAM_MAX_SIZE_FOR_ELEMENTWISE_COPY): New.
> >> >         * config/i386/i386.c (ix86_option_override_internal): Set
> >> >         PARAM_MAX_SIZE_FOR_ELEMENTWISE_COPY to 35.
> >> >         * tree-sra.c: Include tree-sra.h.
> >> >         (scalarizable_type_p): Renamed to
> >> >         simple_mix_of_records_and_arrays_p, made public, renamed the
> >> >         second parameter to allow_char_arrays.
> >> >         (extract_min_max_idx_from_array): New function.
> >> >         (completely_scalarize): Moved bits of the function to
> >> >         extract_min_max_idx_from_array.
> >> >
> >> >         testsuite/
> >> >         * gcc.target/i386/pr80689-1.c: New test.
> >> > ---
> >> >  gcc/config/i386/i386.c                    |   4 ++
> >> >  gcc/expr.c                                | 103 ++++++++++++++++++++++++++++--
> >> >  gcc/ipa-cp.c                              |   1 +
> >> >  gcc/ipa-prop.h                            |   4 --
> >> >  gcc/params.def                            |   6 ++
> >> >  gcc/testsuite/gcc.target/i386/pr80689-1.c |  38 +++++++++++
> >> >  gcc/tree-sra.c                            |  86 +++++++++++++++----------
> >> >  gcc/tree-sra.h                            |  33 ++++++++++
> >> >  8 files changed, 233 insertions(+), 42 deletions(-)
> >> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr80689-1.c
> >> >  create mode 100644 gcc/tree-sra.h
> >> >
> >> > diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
> >> > index 1ee8351c21f..87f602e7ead 100644
> >> > --- a/gcc/config/i386/i386.c
> >> > +++ b/gcc/config/i386/i386.c
> >> > @@ -6511,6 +6511,10 @@ ix86_option_override_internal (bool main_args_p,
> >> >                          ix86_tune_cost->l2_cache_size,
> >> >                          opts->x_param_values,
> >> >                          opts_set->x_param_values);
> >> > +  maybe_set_param_value (PARAM_MAX_SIZE_FOR_ELEMENTWISE_COPY,
> >> > +                        35,
> >> > +                        opts->x_param_values,
> >> > +                        opts_set->x_param_values);
> >> >
> >> >    /* Enable sw prefetching at -O3 for CPUS that prefetching is helpful.  */
> >> >    if (opts->x_flag_prefetch_loop_arrays < 0
> >> > diff --git a/gcc/expr.c b/gcc/expr.c
> >> > index 134ee731c29..dff24e7f166 100644
> >> > --- a/gcc/expr.c
> >> > +++ b/gcc/expr.c
> >> > @@ -61,7 +61,8 @@ along with GCC; see the file COPYING3.  If not see
> >> >  #include "tree-chkp.h"
> >> >  #include "rtl-chkp.h"
> >> >  #include "ccmp.h"
> >> > -
> >> > +#include "params.h"
> >> > +#include "tree-sra.h"
> >> >
> >> >  /* If this is nonzero, we do not bother generating VOLATILE
> >> >     around volatile memory references, and we are willing to
> >> > @@ -5340,6 +5341,80 @@ emit_storent_insn (rtx to, rtx from)
> >> >    return maybe_expand_insn (code, 2, ops);
> >> >  }
> >> >
> >> > +/* Generate code for copying data of type TYPE at SOURCE plus OFFSET to TARGET
> >> > +   plus OFFSET, but do so element-wise and/or field-wise for each record and
> >> > +   array within TYPE.  TYPE must either be a register type or an aggregate
> >> > +   complying with scalarizable_type_p.
> >> > +
> >> > +   If CALL_PARAM_P is nonzero, this is a store into a call param on the
> >> > +   stack, and block moves may need to be treated specially.  */
> >> > +
> >> > +static void
> >> > +emit_move_elementwise (tree type, rtx target, rtx source, HOST_WIDE_INT offset,
> >> > +                      int call_param_p)
> >> > +{
> >> > +  switch (TREE_CODE (type))
> >> > +    {
> >> > +    case RECORD_TYPE:
> >> > +      for (tree fld = TYPE_FIELDS (type); fld; fld = DECL_CHAIN (fld))
> >> > +       if (TREE_CODE (fld) == FIELD_DECL)
> >> > +         {
> >> > +           HOST_WIDE_INT fld_offset = offset + int_bit_position (fld);
> >> > +           tree ft = TREE_TYPE (fld);
> >> > +           emit_move_elementwise (ft, target, source, fld_offset,
> >> > +                                  call_param_p);
> >> > +         }
> >> > +      break;
> >> > +
> >> > +    case ARRAY_TYPE:
> >> > +      {
> >> > +       tree elem_type = TREE_TYPE (type);
> >> > +       HOST_WIDE_INT el_size = tree_to_shwi (TYPE_SIZE (elem_type));
> >> > +       gcc_assert (el_size > 0);
> >> > +
> >> > +       offset_int idx, max;
> >> > +       /* Skip (some) zero-length arrays; others have MAXIDX == MINIDX - 1.  */
> >> > +       if (extract_min_max_idx_from_array (type, &idx, &max))
> >> > +         {
> >> > +           HOST_WIDE_INT el_offset = offset;
> >> > +           for (; idx <= max; ++idx)
> >> > +             {
> >> > +               emit_move_elementwise (elem_type, target, source, el_offset,
> >> > +                                      call_param_p);
> >> > +               el_offset += el_size;
> >> > +             }
> >> > +         }
> >> > +      }
> >> > +      break;
> >> > +    default:
> >> > +      machine_mode mode = TYPE_MODE (type);
> >> > +
> >> > +      rtx ntgt = adjust_address (target, mode, offset / BITS_PER_UNIT);
> >> > +      rtx nsrc = adjust_address (source, mode, offset / BITS_PER_UNIT);
> >> > +
> >> > +      /* TODO: Figure out whether the following is actually necessary.  */
> >> > +      if (target == ntgt)
> >> > +       ntgt = copy_rtx (target);
> >> > +      if (source == nsrc)
> >> > +       nsrc = copy_rtx (source);
> >> > +
> >> > +      gcc_assert (mode != VOIDmode);
> >> > +      if (mode != BLKmode)
> >> > +       emit_move_insn (ntgt, nsrc);
> >> > +      else
> >> > +       {
> >> > +         /* For example vector gimple registers can end up here.  */
> >> > +         rtx size = expand_expr (TYPE_SIZE_UNIT (type), NULL_RTX,
> >> > +                                 TYPE_MODE (sizetype), EXPAND_NORMAL);
> >> > +         emit_block_move (ntgt, nsrc, size,
> >> > +                          (call_param_p
> >> > +                           ? BLOCK_OP_CALL_PARM : BLOCK_OP_NORMAL));
> >> > +       }
> >> > +      break;
> >> > +    }
> >> > +  return;
> >> > +}
> >> > +
> >> >  /* Generate code for computing expression EXP,
> >> >     and storing the value into TARGET.
> >> >
> >> > @@ -5713,9 +5788,29 @@ store_expr_with_bounds (tree exp, rtx target, int call_param_p,
> >> >         emit_group_store (target, temp, TREE_TYPE (exp),
> >> >                           int_size_in_bytes (TREE_TYPE (exp)));
> >> >        else if (GET_MODE (temp) == BLKmode)
> >> > -       emit_block_move (target, temp, expr_size (exp),
> >> > -                        (call_param_p
> >> > -                         ? BLOCK_OP_CALL_PARM : BLOCK_OP_NORMAL));
> >> > +       {
> >> > +         /* Copying smallish BLKmode structures with emit_block_move and thus
> >> > +            by-pieces can result in store-to-load stalls.  So copy some simple
> >> > +            small aggregates element or field-wise.  */
> >> > +         if (GET_MODE (target) == BLKmode
> >> > +             && AGGREGATE_TYPE_P (TREE_TYPE (exp))
> >> > +             && !TREE_ADDRESSABLE (TREE_TYPE (exp))
> >> > +             && tree_fits_shwi_p (TYPE_SIZE (TREE_TYPE (exp)))
> >> > +             && (tree_to_shwi (TYPE_SIZE (TREE_TYPE (exp)))
> >> > +                 <= (PARAM_VALUE (PARAM_MAX_SIZE_FOR_ELEMENTWISE_COPY)
> >> > +                     * BITS_PER_UNIT))
> >> > +             && simple_mix_of_records_and_arrays_p (TREE_TYPE (exp), false))
> >> > +           {
> >> > +             /* FIXME:  Can this happen?  What would it mean?  */
> >> > +             gcc_assert (!reverse);
> >> > +             emit_move_elementwise (TREE_TYPE (exp), target, temp, 0,
> >> > +                                    call_param_p);
> >> > +           }
> >> > +         else
> >> > +           emit_block_move (target, temp, expr_size (exp),
> >> > +                            (call_param_p
> >> > +                             ? BLOCK_OP_CALL_PARM : BLOCK_OP_NORMAL));
> >> > +       }
> >> >        /* If we emit a nontemporal store, there is nothing else to do.  */
> >> >        else if (nontemporal && emit_storent_insn (target, temp))
> >> >         ;
> >> > diff --git a/gcc/ipa-cp.c b/gcc/ipa-cp.c
> >> > index 6b3d8d7364c..7d6019bbd30 100644
> >> > --- a/gcc/ipa-cp.c
> >> > +++ b/gcc/ipa-cp.c
> >> > @@ -124,6 +124,7 @@ along with GCC; see the file COPYING3.  If not see
> >> >  #include "tree-ssa-ccp.h"
> >> >  #include "stringpool.h"
> >> >  #include "attribs.h"
> >> > +#include "tree-sra.h"
> >> >
> >> >  template <typename valtype> class ipcp_value;
> >> >
> >> > diff --git a/gcc/ipa-prop.h b/gcc/ipa-prop.h
> >> > index fa5bed49ee0..2313cc884ed 100644
> >> > --- a/gcc/ipa-prop.h
> >> > +++ b/gcc/ipa-prop.h
> >> > @@ -877,10 +877,6 @@ ipa_parm_adjustment *ipa_get_adjustment_candidate (tree **, bool *,
> >> >  void ipa_release_body_info (struct ipa_func_body_info *);
> >> >  tree ipa_get_callee_param_type (struct cgraph_edge *e, int i);
> >> >
> >> > -/* From tree-sra.c:  */
> >> > -tree build_ref_for_offset (location_t, tree, HOST_WIDE_INT, bool, tree,
> >> > -                          gimple_stmt_iterator *, bool);
> >> > -
> >> >  /* In ipa-cp.c  */
> >> >  void ipa_cp_c_finalize (void);
> >> >
> >> > diff --git a/gcc/params.def b/gcc/params.def
> >> > index e55afc28053..5e19f1414a0 100644
> >> > --- a/gcc/params.def
> >> > +++ b/gcc/params.def
> >> > @@ -1294,6 +1294,12 @@ DEFPARAM (PARAM_VECT_EPILOGUES_NOMASK,
> >> >           "Enable loop epilogue vectorization using smaller vector size.",
> >> >           0, 0, 1)
> >> >
> >> > +DEFPARAM (PARAM_MAX_SIZE_FOR_ELEMENTWISE_COPY,
> >> > +         "max-size-for-elementwise-copy",
> >> > +         "Maximum size in bytes of a structure or array to by considered for "
> >> > +         "copying by its individual fields or elements",
> >> > +         0, 0, 512)
> >> > +
> >> >  /*
> >> >
> >> >  Local variables:
> >> > diff --git a/gcc/testsuite/gcc.target/i386/pr80689-1.c b/gcc/testsuite/gcc.target/i386/pr80689-1.c
> >> > new file mode 100644
> >> > index 00000000000..4156d4fba45
> >> > --- /dev/null
> >> > +++ b/gcc/testsuite/gcc.target/i386/pr80689-1.c
> >> > @@ -0,0 +1,38 @@
> >> > +/* { dg-do compile } */
> >> > +/* { dg-options "-O2" } */
> >> > +
> >> > +typedef struct st1
> >> > +{
> >> > +        long unsigned int a,b;
> >> > +        long int c,d;
> >> > +}R;
> >> > +
> >> > +typedef struct st2
> >> > +{
> >> > +        int  t;
> >> > +        R  reg;
> >> > +}N;
> >> > +
> >> > +void Set (const R *region,  N *n_info );
> >> > +
> >> > +void test(N  *n_obj ,const long unsigned int a, const long unsigned int b,  const long int c,const long int d)
> >> > +{
> >> > +        R reg;
> >> > +
> >> > +        reg.a=a;
> >> > +        reg.b=b;
> >> > +        reg.c=c;
> >> > +        reg.d=d;
> >> > +        Set (&reg, n_obj);
> >> > +
> >> > +}
> >> > +
> >> > +void Set (const R *reg,  N *n_obj )
> >> > +{
> >> > +        n_obj->reg=(*reg);
> >> > +}
> >> > +
> >> > +
> >> > +/* { dg-final { scan-assembler-not "%(x|y|z)mm\[0-9\]+" } } */
> >> > +/* { dg-final { scan-assembler-not "movdqu" } } */
> >> > +/* { dg-final { scan-assembler-not "movups" } } */
> >> > diff --git a/gcc/tree-sra.c b/gcc/tree-sra.c
> >> > index bac593951e7..ade97964205 100644
> >> > --- a/gcc/tree-sra.c
> >> > +++ b/gcc/tree-sra.c
> >> > @@ -104,6 +104,7 @@ along with GCC; see the file COPYING3.  If not see
> >> >  #include "ipa-fnsummary.h"
> >> >  #include "ipa-utils.h"
> >> >  #include "builtins.h"
> >> > +#include "tree-sra.h"
> >> >
> >> >  /* Enumeration of all aggregate reductions we can do.  */
> >> >  enum sra_mode { SRA_MODE_EARLY_IPA,   /* early call regularization */
> >> > @@ -952,14 +953,14 @@ create_access (tree expr, gimple *stmt, bool write)
> >> >  }
> >> >
> >> >
> >> > -/* Return true iff TYPE is scalarizable - i.e. a RECORD_TYPE or fixed-length
> >> > -   ARRAY_TYPE with fields that are either of gimple register types (excluding
> >> > -   bit-fields) or (recursively) scalarizable types.  CONST_DECL must be true if
> >> > -   we are considering a decl from constant pool.  If it is false, char arrays
> >> > -   will be refused.  */
> >> > +/* Return true if TYPE consists of RECORD_TYPE or fixed-length ARRAY_TYPE with
> >> > +   fields/elements that are not bit-fields and are either register types or
> >> > +   recursively comply with simple_mix_of_records_and_arrays_p.  Furthermore, if
> >> > +   ALLOW_CHAR_ARRAYS is false, the function will return false also if TYPE
> >> > +   contains an array of elements that only have one byte.  */
> >> >
> >> > -static bool
> >> > -scalarizable_type_p (tree type, bool const_decl)
> >> > +bool
> >> > +simple_mix_of_records_and_arrays_p (tree type, bool allow_char_arrays)
> >> >  {
> >> >    gcc_assert (!is_gimple_reg_type (type));
> >> >    if (type_contains_placeholder_p (type))
> >> > @@ -977,7 +978,7 @@ scalarizable_type_p (tree type, bool const_decl)
> >> >             return false;
> >> >
> >> >           if (!is_gimple_reg_type (ft)
> >> > -             && !scalarizable_type_p (ft, const_decl))
> >> > +             && !simple_mix_of_records_and_arrays_p (ft, allow_char_arrays))
> >> >             return false;
> >> >         }
> >> >
> >> > @@ -986,7 +987,7 @@ scalarizable_type_p (tree type, bool const_decl)
> >> >    case ARRAY_TYPE:
> >> >      {
> >> >        HOST_WIDE_INT min_elem_size;
> >> > -      if (const_decl)
> >> > +      if (allow_char_arrays)
> >> >         min_elem_size = 0;
> >> >        else
> >> >         min_elem_size = BITS_PER_UNIT;
> >> > @@ -1008,7 +1009,7 @@ scalarizable_type_p (tree type, bool const_decl)
> >> >
> >> >        tree elem = TREE_TYPE (type);
> >> >        if (!is_gimple_reg_type (elem)
> >> > -         && !scalarizable_type_p (elem, const_decl))
> >> > +         && !simple_mix_of_records_and_arrays_p (elem, allow_char_arrays))
> >> >         return false;
> >> >        return true;
> >> >      }
> >> > @@ -1017,10 +1018,38 @@ scalarizable_type_p (tree type, bool const_decl)
> >> >    }
> >> >  }
> >> >
> >> > -static void scalarize_elem (tree, HOST_WIDE_INT, HOST_WIDE_INT, bool, tree, tree);
> >> > +static void scalarize_elem (tree, HOST_WIDE_INT, HOST_WIDE_INT, bool, tree,
> >> > +                           tree);
> >> > +
> >> > +/* For a given array TYPE, return false if its domain does not have any maximum
> >> > +   value.  Otherwise calculate MIN and MAX indices of the first and the last
> >> > +   element.  */
> >> > +
> >> > +bool
> >> > +extract_min_max_idx_from_array (tree type, offset_int *min, offset_int *max)
> >> > +{
> >> > +  tree domain = TYPE_DOMAIN (type);
> >> > +  tree minidx = TYPE_MIN_VALUE (domain);
> >> > +  gcc_assert (TREE_CODE (minidx) == INTEGER_CST);
> >> > +  tree maxidx = TYPE_MAX_VALUE (domain);
> >> > +  if (!maxidx)
> >> > +    return false;
> >> > +  gcc_assert (TREE_CODE (maxidx) == INTEGER_CST);
> >> > +
> >> > +  /* MINIDX and MAXIDX are inclusive, and must be interpreted in
> >> > +     DOMAIN (e.g. signed int, whereas min/max may be size_int).  */
> >> > +  *min = wi::to_offset (minidx);
> >> > +  *max = wi::to_offset (maxidx);
> >> > +  if (!TYPE_UNSIGNED (domain))
> >> > +    {
> >> > +      *min = wi::sext (*min, TYPE_PRECISION (domain));
> >> > +      *max = wi::sext (*max, TYPE_PRECISION (domain));
> >> > +    }
> >> > +  return true;
> >> > +}
> >> >
> >> >  /* Create total_scalarization accesses for all scalar fields of a member
> >> > -   of type DECL_TYPE conforming to scalarizable_type_p.  BASE
> >> > +   of type DECL_TYPE conforming to simple_mix_of_records_and_arrays_p.  BASE
> >> >     must be the top-most VAR_DECL representing the variable; within that,
> >> >     OFFSET locates the member and REF must be the memory reference expression for
> >> >     the member.  */
> >> > @@ -1047,27 +1076,14 @@ completely_scalarize (tree base, tree decl_type, HOST_WIDE_INT offset, tree ref)
> >> >        {
> >> >         tree elemtype = TREE_TYPE (decl_type);
> >> >         tree elem_size = TYPE_SIZE (elemtype);
> >> > -       gcc_assert (elem_size && tree_fits_shwi_p (elem_size));
> >> >         HOST_WIDE_INT el_size = tree_to_shwi (elem_size);
> >> >         gcc_assert (el_size > 0);
> >> >
> >> > -       tree minidx = TYPE_MIN_VALUE (TYPE_DOMAIN (decl_type));
> >> > -       gcc_assert (TREE_CODE (minidx) == INTEGER_CST);
> >> > -       tree maxidx = TYPE_MAX_VALUE (TYPE_DOMAIN (decl_type));
> >> > +       offset_int idx, max;
> >> >         /* Skip (some) zero-length arrays; others have MAXIDX == MINIDX - 1.  */
> >> > -       if (maxidx)
> >> > +       if (extract_min_max_idx_from_array (decl_type, &idx, &max))
> >> >           {
> >> > -           gcc_assert (TREE_CODE (maxidx) == INTEGER_CST);
> >> >             tree domain = TYPE_DOMAIN (decl_type);
> >> > -           /* MINIDX and MAXIDX are inclusive, and must be interpreted in
> >> > -              DOMAIN (e.g. signed int, whereas min/max may be size_int).  */
> >> > -           offset_int idx = wi::to_offset (minidx);
> >> > -           offset_int max = wi::to_offset (maxidx);
> >> > -           if (!TYPE_UNSIGNED (domain))
> >> > -             {
> >> > -               idx = wi::sext (idx, TYPE_PRECISION (domain));
> >> > -               max = wi::sext (max, TYPE_PRECISION (domain));
> >> > -             }
> >> >             for (int el_off = offset; idx <= max; ++idx)
> >> >               {
> >> >                 tree nref = build4 (ARRAY_REF, elemtype,
> >> > @@ -1088,10 +1104,10 @@ completely_scalarize (tree base, tree decl_type, HOST_WIDE_INT offset, tree ref)
> >> >  }
> >> >
> >> >  /* Create total_scalarization accesses for a member of type TYPE, which must
> >> > -   satisfy either is_gimple_reg_type or scalarizable_type_p.  BASE must be the
> >> > -   top-most VAR_DECL representing the variable; within that, POS and SIZE locate
> >> > -   the member, REVERSE gives its torage order. and REF must be the reference
> >> > -   expression for it.  */
> >> > +   satisfy either is_gimple_reg_type or simple_mix_of_records_and_arrays_p.
> >> > +   BASE must be the top-most VAR_DECL representing the variable; within that,
> >> > +   POS and SIZE locate the member, REVERSE gives its torage order. and REF must
> >> > +   be the reference expression for it.  */
> >> >
> >> >  static void
> >> >  scalarize_elem (tree base, HOST_WIDE_INT pos, HOST_WIDE_INT size, bool reverse,
> >> > @@ -1111,7 +1127,8 @@ scalarize_elem (tree base, HOST_WIDE_INT pos, HOST_WIDE_INT size, bool reverse,
> >> >  }
> >> >
> >> >  /* Create a total_scalarization access for VAR as a whole.  VAR must be of a
> >> > -   RECORD_TYPE or ARRAY_TYPE conforming to scalarizable_type_p.  */
> >> > +   RECORD_TYPE or ARRAY_TYPE conforming to
> >> > +   simple_mix_of_records_and_arrays_p.  */
> >> >
> >> >  static void
> >> >  create_total_scalarization_access (tree var)
> >> > @@ -2803,8 +2820,9 @@ analyze_all_variable_accesses (void)
> >> >        {
> >> >         tree var = candidate (i);
> >> >
> >> > -       if (VAR_P (var) && scalarizable_type_p (TREE_TYPE (var),
> >> > -                                               constant_decl_p (var)))
> >> > +       if (VAR_P (var)
> >> > +           && simple_mix_of_records_and_arrays_p (TREE_TYPE (var),
> >> > +                                                  constant_decl_p (var)))
> >> >           {
> >> >             if (tree_to_uhwi (TYPE_SIZE (TREE_TYPE (var)))
> >> >                 <= max_scalarization_size)
> >> > diff --git a/gcc/tree-sra.h b/gcc/tree-sra.h
> >> > new file mode 100644
> >> > index 00000000000..dc901385994
> >> > --- /dev/null
> >> > +++ b/gcc/tree-sra.h
> >> > @@ -0,0 +1,33 @@
> >> > +/* tree-sra.h - Run-time parameters.
> >> > +   Copyright (C) 2017 Free Software Foundation, Inc.
> >> > +
> >> > +This file is part of GCC.
> >> > +
> >> > +GCC is free software; you can redistribute it and/or modify it under
> >> > +the terms of the GNU General Public License as published by the Free
> >> > +Software Foundation; either version 3, or (at your option) any later
> >> > +version.
> >> > +
> >> > +GCC is distributed in the hope that it will be useful, but WITHOUT ANY
> >> > +WARRANTY; without even the implied warranty of MERCHANTABILITY or
> >> > +FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
> >> > +for more details.
> >> > +
> >> > +You should have received a copy of the GNU General Public License
> >> > +along with GCC; see the file COPYING3.  If not see
> >> > +<http://www.gnu.org/licenses/>.  */
> >> > +
> >> > +#ifndef TREE_SRA_H
> >> > +#define TREE_SRA_H
> >> > +
> >> > +
> >> > +bool simple_mix_of_records_and_arrays_p (tree type, bool allow_char_arrays);
> >> > +bool extract_min_max_idx_from_array (tree type, offset_int *idx,
> >> > +                                    offset_int *max);
> >> > +tree build_ref_for_offset (location_t loc, tree base, HOST_WIDE_INT offset,
> >> > +                          bool reverse, tree exp_type,
> >> > +                          gimple_stmt_iterator *gsi, bool insert_after);
> >> > +
> >> > +
> >> > +
> >> > +#endif /* TREE_SRA_H */
> >> > --
> >> > 2.14.1
> >> >
Michael Matz Oct. 26, 2017, 2:07 p.m. UTC | #5
Hi,

On Thu, 26 Oct 2017, Martin Jambor wrote:

> > 35 bytes seems to be much - what is the code-size impact?
> 
> I will find out and report on that.  I need at least 32 bytes (four
> long ints) to fix imagemagick, where the problematic structure is:

Surely the final heuristic should look at the size and number of elements 
of the struct in question, not only on size.


Ciao,
Michael.
Richard Biener Oct. 26, 2017, 2:38 p.m. UTC | #6
On Thu, Oct 26, 2017 at 2:55 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> I think the limit should be on the number of generated copies and not
>> the overall size of the structure...  If the struct were composed of
>> 32 individual chars we wouldn't want to emit 32 loads and 32 stores...
>>
>> I wonder how rep; movb; interacts with store to load forwarding?  Is
>> that maybe optimized well on some archs?  movb should always
>> forward and wasn't the setup cost for small N reasonable on modern
>> CPUs?
>
> rep mov is win over loop for blocks over 128bytes on core, for blocks in rage
> 24-128 on zen.  This is w/o store/load forwarding, but I doubt those provide
> a cheap way around.
>
>>
>> It probably depends on the width of the entries in the store buffer,
>> if they appear in-order and the alignment of the stores (if they are larger than
>> 8 bytes they are surely aligned).  IIRC CPUs had smaller store buffer
>> entries than cache line size.
>>
>> Given that load bandwith is usually higher than store bandwith it
>> might make sense to do the store combining in our copying sequence,
>> like for the 8 byte entry case use sth like
>>
>>   movq 0(%eax), %xmm0
>>   movhps 8(%eax), %xmm0 // or vpinsert
>>   mov[au]ps %xmm0, 0%(ebx)
>> ...
>>
>> thus do two loads per store and perform the stores in wider
>> mode?
>
> This may be somewhat faster indeed.  I am not sure if store to load
> forwarding will work for the later half when read again by halves.
> It would not happen on older CPUs :)

Yes, forwarding larger stores to smaller loads generally works fine
since forever with the usual restrictions of alignment/size being
power of two "halves".

The question is of course what to do for 4 byte or smaller elements or
mixed size elements.  We can do zero-extending loads
(do we have them for QI, HI mode loads as well?) and
do shift and or's.  I'm quite sure the CPUs wouldn't like to
see vpinsert's of different vector mode destinations.  So it
would be 8 byte stores from GPRs and values built up via
shift & or.

As said, the important part is that IIRC CPUs can usually
have more loads in flight than stores.  Esp. Bulldozer
with the split core was store buffer size limited (but it
could do merging of store buffer entries IIRC).

Richard.

> Honza
>>
>> As said a general concern was you not copying padding.  If you
>> put this into an even more common place you surely will break
>> stuff, no?
>>
>> Richard.
>>
>> >
>> > Martin
>> >
>> >
>> >>
>> >> Richard.
>> >>
>> >> > Martin
>> >> >
>> >> >
>> >> > 2017-10-12  Martin Jambor  <mjambor@suse.cz>
>> >> >
>> >> >         PR target/80689
>> >> >         * tree-sra.h: New file.
>> >> >         * ipa-prop.h: Moved declaration of build_ref_for_offset to
>> >> >         tree-sra.h.
>> >> >         * expr.c: Include params.h and tree-sra.h.
>> >> >         (emit_move_elementwise): New function.
>> >> >         (store_expr_with_bounds): Optionally use it.
>> >> >         * ipa-cp.c: Include tree-sra.h.
>> >> >         * params.def (PARAM_MAX_SIZE_FOR_ELEMENTWISE_COPY): New.
>> >> >         * config/i386/i386.c (ix86_option_override_internal): Set
>> >> >         PARAM_MAX_SIZE_FOR_ELEMENTWISE_COPY to 35.
>> >> >         * tree-sra.c: Include tree-sra.h.
>> >> >         (scalarizable_type_p): Renamed to
>> >> >         simple_mix_of_records_and_arrays_p, made public, renamed the
>> >> >         second parameter to allow_char_arrays.
>> >> >         (extract_min_max_idx_from_array): New function.
>> >> >         (completely_scalarize): Moved bits of the function to
>> >> >         extract_min_max_idx_from_array.
>> >> >
>> >> >         testsuite/
>> >> >         * gcc.target/i386/pr80689-1.c: New test.
>> >> > ---
>> >> >  gcc/config/i386/i386.c                    |   4 ++
>> >> >  gcc/expr.c                                | 103 ++++++++++++++++++++++++++++--
>> >> >  gcc/ipa-cp.c                              |   1 +
>> >> >  gcc/ipa-prop.h                            |   4 --
>> >> >  gcc/params.def                            |   6 ++
>> >> >  gcc/testsuite/gcc.target/i386/pr80689-1.c |  38 +++++++++++
>> >> >  gcc/tree-sra.c                            |  86 +++++++++++++++----------
>> >> >  gcc/tree-sra.h                            |  33 ++++++++++
>> >> >  8 files changed, 233 insertions(+), 42 deletions(-)
>> >> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr80689-1.c
>> >> >  create mode 100644 gcc/tree-sra.h
>> >> >
>> >> > diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
>> >> > index 1ee8351c21f..87f602e7ead 100644
>> >> > --- a/gcc/config/i386/i386.c
>> >> > +++ b/gcc/config/i386/i386.c
>> >> > @@ -6511,6 +6511,10 @@ ix86_option_override_internal (bool main_args_p,
>> >> >                          ix86_tune_cost->l2_cache_size,
>> >> >                          opts->x_param_values,
>> >> >                          opts_set->x_param_values);
>> >> > +  maybe_set_param_value (PARAM_MAX_SIZE_FOR_ELEMENTWISE_COPY,
>> >> > +                        35,
>> >> > +                        opts->x_param_values,
>> >> > +                        opts_set->x_param_values);
>> >> >
>> >> >    /* Enable sw prefetching at -O3 for CPUS that prefetching is helpful.  */
>> >> >    if (opts->x_flag_prefetch_loop_arrays < 0
>> >> > diff --git a/gcc/expr.c b/gcc/expr.c
>> >> > index 134ee731c29..dff24e7f166 100644
>> >> > --- a/gcc/expr.c
>> >> > +++ b/gcc/expr.c
>> >> > @@ -61,7 +61,8 @@ along with GCC; see the file COPYING3.  If not see
>> >> >  #include "tree-chkp.h"
>> >> >  #include "rtl-chkp.h"
>> >> >  #include "ccmp.h"
>> >> > -
>> >> > +#include "params.h"
>> >> > +#include "tree-sra.h"
>> >> >
>> >> >  /* If this is nonzero, we do not bother generating VOLATILE
>> >> >     around volatile memory references, and we are willing to
>> >> > @@ -5340,6 +5341,80 @@ emit_storent_insn (rtx to, rtx from)
>> >> >    return maybe_expand_insn (code, 2, ops);
>> >> >  }
>> >> >
>> >> > +/* Generate code for copying data of type TYPE at SOURCE plus OFFSET to TARGET
>> >> > +   plus OFFSET, but do so element-wise and/or field-wise for each record and
>> >> > +   array within TYPE.  TYPE must either be a register type or an aggregate
>> >> > +   complying with scalarizable_type_p.
>> >> > +
>> >> > +   If CALL_PARAM_P is nonzero, this is a store into a call param on the
>> >> > +   stack, and block moves may need to be treated specially.  */
>> >> > +
>> >> > +static void
>> >> > +emit_move_elementwise (tree type, rtx target, rtx source, HOST_WIDE_INT offset,
>> >> > +                      int call_param_p)
>> >> > +{
>> >> > +  switch (TREE_CODE (type))
>> >> > +    {
>> >> > +    case RECORD_TYPE:
>> >> > +      for (tree fld = TYPE_FIELDS (type); fld; fld = DECL_CHAIN (fld))
>> >> > +       if (TREE_CODE (fld) == FIELD_DECL)
>> >> > +         {
>> >> > +           HOST_WIDE_INT fld_offset = offset + int_bit_position (fld);
>> >> > +           tree ft = TREE_TYPE (fld);
>> >> > +           emit_move_elementwise (ft, target, source, fld_offset,
>> >> > +                                  call_param_p);
>> >> > +         }
>> >> > +      break;
>> >> > +
>> >> > +    case ARRAY_TYPE:
>> >> > +      {
>> >> > +       tree elem_type = TREE_TYPE (type);
>> >> > +       HOST_WIDE_INT el_size = tree_to_shwi (TYPE_SIZE (elem_type));
>> >> > +       gcc_assert (el_size > 0);
>> >> > +
>> >> > +       offset_int idx, max;
>> >> > +       /* Skip (some) zero-length arrays; others have MAXIDX == MINIDX - 1.  */
>> >> > +       if (extract_min_max_idx_from_array (type, &idx, &max))
>> >> > +         {
>> >> > +           HOST_WIDE_INT el_offset = offset;
>> >> > +           for (; idx <= max; ++idx)
>> >> > +             {
>> >> > +               emit_move_elementwise (elem_type, target, source, el_offset,
>> >> > +                                      call_param_p);
>> >> > +               el_offset += el_size;
>> >> > +             }
>> >> > +         }
>> >> > +      }
>> >> > +      break;
>> >> > +    default:
>> >> > +      machine_mode mode = TYPE_MODE (type);
>> >> > +
>> >> > +      rtx ntgt = adjust_address (target, mode, offset / BITS_PER_UNIT);
>> >> > +      rtx nsrc = adjust_address (source, mode, offset / BITS_PER_UNIT);
>> >> > +
>> >> > +      /* TODO: Figure out whether the following is actually necessary.  */
>> >> > +      if (target == ntgt)
>> >> > +       ntgt = copy_rtx (target);
>> >> > +      if (source == nsrc)
>> >> > +       nsrc = copy_rtx (source);
>> >> > +
>> >> > +      gcc_assert (mode != VOIDmode);
>> >> > +      if (mode != BLKmode)
>> >> > +       emit_move_insn (ntgt, nsrc);
>> >> > +      else
>> >> > +       {
>> >> > +         /* For example vector gimple registers can end up here.  */
>> >> > +         rtx size = expand_expr (TYPE_SIZE_UNIT (type), NULL_RTX,
>> >> > +                                 TYPE_MODE (sizetype), EXPAND_NORMAL);
>> >> > +         emit_block_move (ntgt, nsrc, size,
>> >> > +                          (call_param_p
>> >> > +                           ? BLOCK_OP_CALL_PARM : BLOCK_OP_NORMAL));
>> >> > +       }
>> >> > +      break;
>> >> > +    }
>> >> > +  return;
>> >> > +}
>> >> > +
>> >> >  /* Generate code for computing expression EXP,
>> >> >     and storing the value into TARGET.
>> >> >
>> >> > @@ -5713,9 +5788,29 @@ store_expr_with_bounds (tree exp, rtx target, int call_param_p,
>> >> >         emit_group_store (target, temp, TREE_TYPE (exp),
>> >> >                           int_size_in_bytes (TREE_TYPE (exp)));
>> >> >        else if (GET_MODE (temp) == BLKmode)
>> >> > -       emit_block_move (target, temp, expr_size (exp),
>> >> > -                        (call_param_p
>> >> > -                         ? BLOCK_OP_CALL_PARM : BLOCK_OP_NORMAL));
>> >> > +       {
>> >> > +         /* Copying smallish BLKmode structures with emit_block_move and thus
>> >> > +            by-pieces can result in store-to-load stalls.  So copy some simple
>> >> > +            small aggregates element or field-wise.  */
>> >> > +         if (GET_MODE (target) == BLKmode
>> >> > +             && AGGREGATE_TYPE_P (TREE_TYPE (exp))
>> >> > +             && !TREE_ADDRESSABLE (TREE_TYPE (exp))
>> >> > +             && tree_fits_shwi_p (TYPE_SIZE (TREE_TYPE (exp)))
>> >> > +             && (tree_to_shwi (TYPE_SIZE (TREE_TYPE (exp)))
>> >> > +                 <= (PARAM_VALUE (PARAM_MAX_SIZE_FOR_ELEMENTWISE_COPY)
>> >> > +                     * BITS_PER_UNIT))
>> >> > +             && simple_mix_of_records_and_arrays_p (TREE_TYPE (exp), false))
>> >> > +           {
>> >> > +             /* FIXME:  Can this happen?  What would it mean?  */
>> >> > +             gcc_assert (!reverse);
>> >> > +             emit_move_elementwise (TREE_TYPE (exp), target, temp, 0,
>> >> > +                                    call_param_p);
>> >> > +           }
>> >> > +         else
>> >> > +           emit_block_move (target, temp, expr_size (exp),
>> >> > +                            (call_param_p
>> >> > +                             ? BLOCK_OP_CALL_PARM : BLOCK_OP_NORMAL));
>> >> > +       }
>> >> >        /* If we emit a nontemporal store, there is nothing else to do.  */
>> >> >        else if (nontemporal && emit_storent_insn (target, temp))
>> >> >         ;
>> >> > diff --git a/gcc/ipa-cp.c b/gcc/ipa-cp.c
>> >> > index 6b3d8d7364c..7d6019bbd30 100644
>> >> > --- a/gcc/ipa-cp.c
>> >> > +++ b/gcc/ipa-cp.c
>> >> > @@ -124,6 +124,7 @@ along with GCC; see the file COPYING3.  If not see
>> >> >  #include "tree-ssa-ccp.h"
>> >> >  #include "stringpool.h"
>> >> >  #include "attribs.h"
>> >> > +#include "tree-sra.h"
>> >> >
>> >> >  template <typename valtype> class ipcp_value;
>> >> >
>> >> > diff --git a/gcc/ipa-prop.h b/gcc/ipa-prop.h
>> >> > index fa5bed49ee0..2313cc884ed 100644
>> >> > --- a/gcc/ipa-prop.h
>> >> > +++ b/gcc/ipa-prop.h
>> >> > @@ -877,10 +877,6 @@ ipa_parm_adjustment *ipa_get_adjustment_candidate (tree **, bool *,
>> >> >  void ipa_release_body_info (struct ipa_func_body_info *);
>> >> >  tree ipa_get_callee_param_type (struct cgraph_edge *e, int i);
>> >> >
>> >> > -/* From tree-sra.c:  */
>> >> > -tree build_ref_for_offset (location_t, tree, HOST_WIDE_INT, bool, tree,
>> >> > -                          gimple_stmt_iterator *, bool);
>> >> > -
>> >> >  /* In ipa-cp.c  */
>> >> >  void ipa_cp_c_finalize (void);
>> >> >
>> >> > diff --git a/gcc/params.def b/gcc/params.def
>> >> > index e55afc28053..5e19f1414a0 100644
>> >> > --- a/gcc/params.def
>> >> > +++ b/gcc/params.def
>> >> > @@ -1294,6 +1294,12 @@ DEFPARAM (PARAM_VECT_EPILOGUES_NOMASK,
>> >> >           "Enable loop epilogue vectorization using smaller vector size.",
>> >> >           0, 0, 1)
>> >> >
>> >> > +DEFPARAM (PARAM_MAX_SIZE_FOR_ELEMENTWISE_COPY,
>> >> > +         "max-size-for-elementwise-copy",
>> >> > +         "Maximum size in bytes of a structure or array to by considered for "
>> >> > +         "copying by its individual fields or elements",
>> >> > +         0, 0, 512)
>> >> > +
>> >> >  /*
>> >> >
>> >> >  Local variables:
>> >> > diff --git a/gcc/testsuite/gcc.target/i386/pr80689-1.c b/gcc/testsuite/gcc.target/i386/pr80689-1.c
>> >> > new file mode 100644
>> >> > index 00000000000..4156d4fba45
>> >> > --- /dev/null
>> >> > +++ b/gcc/testsuite/gcc.target/i386/pr80689-1.c
>> >> > @@ -0,0 +1,38 @@
>> >> > +/* { dg-do compile } */
>> >> > +/* { dg-options "-O2" } */
>> >> > +
>> >> > +typedef struct st1
>> >> > +{
>> >> > +        long unsigned int a,b;
>> >> > +        long int c,d;
>> >> > +}R;
>> >> > +
>> >> > +typedef struct st2
>> >> > +{
>> >> > +        int  t;
>> >> > +        R  reg;
>> >> > +}N;
>> >> > +
>> >> > +void Set (const R *region,  N *n_info );
>> >> > +
>> >> > +void test(N  *n_obj ,const long unsigned int a, const long unsigned int b,  const long int c,const long int d)
>> >> > +{
>> >> > +        R reg;
>> >> > +
>> >> > +        reg.a=a;
>> >> > +        reg.b=b;
>> >> > +        reg.c=c;
>> >> > +        reg.d=d;
>> >> > +        Set (&reg, n_obj);
>> >> > +
>> >> > +}
>> >> > +
>> >> > +void Set (const R *reg,  N *n_obj )
>> >> > +{
>> >> > +        n_obj->reg=(*reg);
>> >> > +}
>> >> > +
>> >> > +
>> >> > +/* { dg-final { scan-assembler-not "%(x|y|z)mm\[0-9\]+" } } */
>> >> > +/* { dg-final { scan-assembler-not "movdqu" } } */
>> >> > +/* { dg-final { scan-assembler-not "movups" } } */
>> >> > diff --git a/gcc/tree-sra.c b/gcc/tree-sra.c
>> >> > index bac593951e7..ade97964205 100644
>> >> > --- a/gcc/tree-sra.c
>> >> > +++ b/gcc/tree-sra.c
>> >> > @@ -104,6 +104,7 @@ along with GCC; see the file COPYING3.  If not see
>> >> >  #include "ipa-fnsummary.h"
>> >> >  #include "ipa-utils.h"
>> >> >  #include "builtins.h"
>> >> > +#include "tree-sra.h"
>> >> >
>> >> >  /* Enumeration of all aggregate reductions we can do.  */
>> >> >  enum sra_mode { SRA_MODE_EARLY_IPA,   /* early call regularization */
>> >> > @@ -952,14 +953,14 @@ create_access (tree expr, gimple *stmt, bool write)
>> >> >  }
>> >> >
>> >> >
>> >> > -/* Return true iff TYPE is scalarizable - i.e. a RECORD_TYPE or fixed-length
>> >> > -   ARRAY_TYPE with fields that are either of gimple register types (excluding
>> >> > -   bit-fields) or (recursively) scalarizable types.  CONST_DECL must be true if
>> >> > -   we are considering a decl from constant pool.  If it is false, char arrays
>> >> > -   will be refused.  */
>> >> > +/* Return true if TYPE consists of RECORD_TYPE or fixed-length ARRAY_TYPE with
>> >> > +   fields/elements that are not bit-fields and are either register types or
>> >> > +   recursively comply with simple_mix_of_records_and_arrays_p.  Furthermore, if
>> >> > +   ALLOW_CHAR_ARRAYS is false, the function will return false also if TYPE
>> >> > +   contains an array of elements that only have one byte.  */
>> >> >
>> >> > -static bool
>> >> > -scalarizable_type_p (tree type, bool const_decl)
>> >> > +bool
>> >> > +simple_mix_of_records_and_arrays_p (tree type, bool allow_char_arrays)
>> >> >  {
>> >> >    gcc_assert (!is_gimple_reg_type (type));
>> >> >    if (type_contains_placeholder_p (type))
>> >> > @@ -977,7 +978,7 @@ scalarizable_type_p (tree type, bool const_decl)
>> >> >             return false;
>> >> >
>> >> >           if (!is_gimple_reg_type (ft)
>> >> > -             && !scalarizable_type_p (ft, const_decl))
>> >> > +             && !simple_mix_of_records_and_arrays_p (ft, allow_char_arrays))
>> >> >             return false;
>> >> >         }
>> >> >
>> >> > @@ -986,7 +987,7 @@ scalarizable_type_p (tree type, bool const_decl)
>> >> >    case ARRAY_TYPE:
>> >> >      {
>> >> >        HOST_WIDE_INT min_elem_size;
>> >> > -      if (const_decl)
>> >> > +      if (allow_char_arrays)
>> >> >         min_elem_size = 0;
>> >> >        else
>> >> >         min_elem_size = BITS_PER_UNIT;
>> >> > @@ -1008,7 +1009,7 @@ scalarizable_type_p (tree type, bool const_decl)
>> >> >
>> >> >        tree elem = TREE_TYPE (type);
>> >> >        if (!is_gimple_reg_type (elem)
>> >> > -         && !scalarizable_type_p (elem, const_decl))
>> >> > +         && !simple_mix_of_records_and_arrays_p (elem, allow_char_arrays))
>> >> >         return false;
>> >> >        return true;
>> >> >      }
>> >> > @@ -1017,10 +1018,38 @@ scalarizable_type_p (tree type, bool const_decl)
>> >> >    }
>> >> >  }
>> >> >
>> >> > -static void scalarize_elem (tree, HOST_WIDE_INT, HOST_WIDE_INT, bool, tree, tree);
>> >> > +static void scalarize_elem (tree, HOST_WIDE_INT, HOST_WIDE_INT, bool, tree,
>> >> > +                           tree);
>> >> > +
>> >> > +/* For a given array TYPE, return false if its domain does not have any maximum
>> >> > +   value.  Otherwise calculate MIN and MAX indices of the first and the last
>> >> > +   element.  */
>> >> > +
>> >> > +bool
>> >> > +extract_min_max_idx_from_array (tree type, offset_int *min, offset_int *max)
>> >> > +{
>> >> > +  tree domain = TYPE_DOMAIN (type);
>> >> > +  tree minidx = TYPE_MIN_VALUE (domain);
>> >> > +  gcc_assert (TREE_CODE (minidx) == INTEGER_CST);
>> >> > +  tree maxidx = TYPE_MAX_VALUE (domain);
>> >> > +  if (!maxidx)
>> >> > +    return false;
>> >> > +  gcc_assert (TREE_CODE (maxidx) == INTEGER_CST);
>> >> > +
>> >> > +  /* MINIDX and MAXIDX are inclusive, and must be interpreted in
>> >> > +     DOMAIN (e.g. signed int, whereas min/max may be size_int).  */
>> >> > +  *min = wi::to_offset (minidx);
>> >> > +  *max = wi::to_offset (maxidx);
>> >> > +  if (!TYPE_UNSIGNED (domain))
>> >> > +    {
>> >> > +      *min = wi::sext (*min, TYPE_PRECISION (domain));
>> >> > +      *max = wi::sext (*max, TYPE_PRECISION (domain));
>> >> > +    }
>> >> > +  return true;
>> >> > +}
>> >> >
>> >> >  /* Create total_scalarization accesses for all scalar fields of a member
>> >> > -   of type DECL_TYPE conforming to scalarizable_type_p.  BASE
>> >> > +   of type DECL_TYPE conforming to simple_mix_of_records_and_arrays_p.  BASE
>> >> >     must be the top-most VAR_DECL representing the variable; within that,
>> >> >     OFFSET locates the member and REF must be the memory reference expression for
>> >> >     the member.  */
>> >> > @@ -1047,27 +1076,14 @@ completely_scalarize (tree base, tree decl_type, HOST_WIDE_INT offset, tree ref)
>> >> >        {
>> >> >         tree elemtype = TREE_TYPE (decl_type);
>> >> >         tree elem_size = TYPE_SIZE (elemtype);
>> >> > -       gcc_assert (elem_size && tree_fits_shwi_p (elem_size));
>> >> >         HOST_WIDE_INT el_size = tree_to_shwi (elem_size);
>> >> >         gcc_assert (el_size > 0);
>> >> >
>> >> > -       tree minidx = TYPE_MIN_VALUE (TYPE_DOMAIN (decl_type));
>> >> > -       gcc_assert (TREE_CODE (minidx) == INTEGER_CST);
>> >> > -       tree maxidx = TYPE_MAX_VALUE (TYPE_DOMAIN (decl_type));
>> >> > +       offset_int idx, max;
>> >> >         /* Skip (some) zero-length arrays; others have MAXIDX == MINIDX - 1.  */
>> >> > -       if (maxidx)
>> >> > +       if (extract_min_max_idx_from_array (decl_type, &idx, &max))
>> >> >           {
>> >> > -           gcc_assert (TREE_CODE (maxidx) == INTEGER_CST);
>> >> >             tree domain = TYPE_DOMAIN (decl_type);
>> >> > -           /* MINIDX and MAXIDX are inclusive, and must be interpreted in
>> >> > -              DOMAIN (e.g. signed int, whereas min/max may be size_int).  */
>> >> > -           offset_int idx = wi::to_offset (minidx);
>> >> > -           offset_int max = wi::to_offset (maxidx);
>> >> > -           if (!TYPE_UNSIGNED (domain))
>> >> > -             {
>> >> > -               idx = wi::sext (idx, TYPE_PRECISION (domain));
>> >> > -               max = wi::sext (max, TYPE_PRECISION (domain));
>> >> > -             }
>> >> >             for (int el_off = offset; idx <= max; ++idx)
>> >> >               {
>> >> >                 tree nref = build4 (ARRAY_REF, elemtype,
>> >> > @@ -1088,10 +1104,10 @@ completely_scalarize (tree base, tree decl_type, HOST_WIDE_INT offset, tree ref)
>> >> >  }
>> >> >
>> >> >  /* Create total_scalarization accesses for a member of type TYPE, which must
>> >> > -   satisfy either is_gimple_reg_type or scalarizable_type_p.  BASE must be the
>> >> > -   top-most VAR_DECL representing the variable; within that, POS and SIZE locate
>> >> > -   the member, REVERSE gives its torage order. and REF must be the reference
>> >> > -   expression for it.  */
>> >> > +   satisfy either is_gimple_reg_type or simple_mix_of_records_and_arrays_p.
>> >> > +   BASE must be the top-most VAR_DECL representing the variable; within that,
>> >> > +   POS and SIZE locate the member, REVERSE gives its torage order. and REF must
>> >> > +   be the reference expression for it.  */
>> >> >
>> >> >  static void
>> >> >  scalarize_elem (tree base, HOST_WIDE_INT pos, HOST_WIDE_INT size, bool reverse,
>> >> > @@ -1111,7 +1127,8 @@ scalarize_elem (tree base, HOST_WIDE_INT pos, HOST_WIDE_INT size, bool reverse,
>> >> >  }
>> >> >
>> >> >  /* Create a total_scalarization access for VAR as a whole.  VAR must be of a
>> >> > -   RECORD_TYPE or ARRAY_TYPE conforming to scalarizable_type_p.  */
>> >> > +   RECORD_TYPE or ARRAY_TYPE conforming to
>> >> > +   simple_mix_of_records_and_arrays_p.  */
>> >> >
>> >> >  static void
>> >> >  create_total_scalarization_access (tree var)
>> >> > @@ -2803,8 +2820,9 @@ analyze_all_variable_accesses (void)
>> >> >        {
>> >> >         tree var = candidate (i);
>> >> >
>> >> > -       if (VAR_P (var) && scalarizable_type_p (TREE_TYPE (var),
>> >> > -                                               constant_decl_p (var)))
>> >> > +       if (VAR_P (var)
>> >> > +           && simple_mix_of_records_and_arrays_p (TREE_TYPE (var),
>> >> > +                                                  constant_decl_p (var)))
>> >> >           {
>> >> >             if (tree_to_uhwi (TYPE_SIZE (TREE_TYPE (var)))
>> >> >                 <= max_scalarization_size)
>> >> > diff --git a/gcc/tree-sra.h b/gcc/tree-sra.h
>> >> > new file mode 100644
>> >> > index 00000000000..dc901385994
>> >> > --- /dev/null
>> >> > +++ b/gcc/tree-sra.h
>> >> > @@ -0,0 +1,33 @@
>> >> > +/* tree-sra.h - Run-time parameters.
>> >> > +   Copyright (C) 2017 Free Software Foundation, Inc.
>> >> > +
>> >> > +This file is part of GCC.
>> >> > +
>> >> > +GCC is free software; you can redistribute it and/or modify it under
>> >> > +the terms of the GNU General Public License as published by the Free
>> >> > +Software Foundation; either version 3, or (at your option) any later
>> >> > +version.
>> >> > +
>> >> > +GCC is distributed in the hope that it will be useful, but WITHOUT ANY
>> >> > +WARRANTY; without even the implied warranty of MERCHANTABILITY or
>> >> > +FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
>> >> > +for more details.
>> >> > +
>> >> > +You should have received a copy of the GNU General Public License
>> >> > +along with GCC; see the file COPYING3.  If not see
>> >> > +<http://www.gnu.org/licenses/>.  */
>> >> > +
>> >> > +#ifndef TREE_SRA_H
>> >> > +#define TREE_SRA_H
>> >> > +
>> >> > +
>> >> > +bool simple_mix_of_records_and_arrays_p (tree type, bool allow_char_arrays);
>> >> > +bool extract_min_max_idx_from_array (tree type, offset_int *idx,
>> >> > +                                    offset_int *max);
>> >> > +tree build_ref_for_offset (location_t loc, tree base, HOST_WIDE_INT offset,
>> >> > +                          bool reverse, tree exp_type,
>> >> > +                          gimple_stmt_iterator *gsi, bool insert_after);
>> >> > +
>> >> > +
>> >> > +
>> >> > +#endif /* TREE_SRA_H */
>> >> > --
>> >> > 2.14.1
>> >> >
Richard Biener Oct. 26, 2017, 3:09 p.m. UTC | #7
On Thu, Oct 26, 2017 at 4:38 PM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On Thu, Oct 26, 2017 at 2:55 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>>> I think the limit should be on the number of generated copies and not
>>> the overall size of the structure...  If the struct were composed of
>>> 32 individual chars we wouldn't want to emit 32 loads and 32 stores...
>>>
>>> I wonder how rep; movb; interacts with store to load forwarding?  Is
>>> that maybe optimized well on some archs?  movb should always
>>> forward and wasn't the setup cost for small N reasonable on modern
>>> CPUs?
>>
>> rep mov is win over loop for blocks over 128bytes on core, for blocks in rage
>> 24-128 on zen.  This is w/o store/load forwarding, but I doubt those provide
>> a cheap way around.
>>
>>>
>>> It probably depends on the width of the entries in the store buffer,
>>> if they appear in-order and the alignment of the stores (if they are larger than
>>> 8 bytes they are surely aligned).  IIRC CPUs had smaller store buffer
>>> entries than cache line size.
>>>
>>> Given that load bandwith is usually higher than store bandwith it
>>> might make sense to do the store combining in our copying sequence,
>>> like for the 8 byte entry case use sth like
>>>
>>>   movq 0(%eax), %xmm0
>>>   movhps 8(%eax), %xmm0 // or vpinsert
>>>   mov[au]ps %xmm0, 0%(ebx)
>>> ...
>>>
>>> thus do two loads per store and perform the stores in wider
>>> mode?
>>
>> This may be somewhat faster indeed.  I am not sure if store to load
>> forwarding will work for the later half when read again by halves.
>> It would not happen on older CPUs :)
>
> Yes, forwarding larger stores to smaller loads generally works fine
> since forever with the usual restrictions of alignment/size being
> power of two "halves".
>
> The question is of course what to do for 4 byte or smaller elements or
> mixed size elements.  We can do zero-extending loads
> (do we have them for QI, HI mode loads as well?) and
> do shift and or's.  I'm quite sure the CPUs wouldn't like to
> see vpinsert's of different vector mode destinations.  So it
> would be 8 byte stores from GPRs and values built up via
> shift & or.

Like we generate

foo:
.LFB0:
        .cfi_startproc
        movl    4(%rdi), %eax
        movzwl  2(%rdi), %edx
        salq    $16, %rax
        orq     %rdx, %rax
        movzbl  1(%rdi), %edx
        salq    $8, %rax
        orq     %rdx, %rax
        movzbl  (%rdi), %edx
        salq    $8, %rax
        orq     %rdx, %rax
        movq    %rax, (%rsi)
        ret

for

struct x { char e; char f; short c; int i; } a;

void foo (struct x *p, long *q)
{
 *q = (((((((unsigned long)(unsigned int)p->i) << 16)
   | (((unsigned long)(unsigned short)p->c))) << 8)
   | (((unsigned long)(unsigned char)p->f))) << 8)
   | ((unsigned long)(unsigned char)p->e);
}

if you disable the bswap pass.  Doing 4 byte stores in this
case would save some prefixes at least.  I expected the
ORs and shifts to have smaller encodings...

With 4 byte stores we end up with the same size as with
individual loads & stores.

> As said, the important part is that IIRC CPUs can usually
> have more loads in flight than stores.  Esp. Bulldozer
> with the split core was store buffer size limited (but it
> could do merging of store buffer entries IIRC).

Also if we do the stores in smaller chunks we are more
likely hitting the same store-to-load-forwarding issue
elsewhere.  Like in case the destination is memcpy'ed
away.

So the proposed change isn't necessarily a win without
a possible similar regression that it tries to fix.

Whole-program analysis of accesses might allow
marking affected objects.

Richard.

> Richard.
>
>> Honza
>>>
>>> As said a general concern was you not copying padding.  If you
>>> put this into an even more common place you surely will break
>>> stuff, no?
>>>
>>> Richard.
>>>
>>> >
>>> > Martin
>>> >
>>> >
>>> >>
>>> >> Richard.
>>> >>
>>> >> > Martin
>>> >> >
>>> >> >
>>> >> > 2017-10-12  Martin Jambor  <mjambor@suse.cz>
>>> >> >
>>> >> >         PR target/80689
>>> >> >         * tree-sra.h: New file.
>>> >> >         * ipa-prop.h: Moved declaration of build_ref_for_offset to
>>> >> >         tree-sra.h.
>>> >> >         * expr.c: Include params.h and tree-sra.h.
>>> >> >         (emit_move_elementwise): New function.
>>> >> >         (store_expr_with_bounds): Optionally use it.
>>> >> >         * ipa-cp.c: Include tree-sra.h.
>>> >> >         * params.def (PARAM_MAX_SIZE_FOR_ELEMENTWISE_COPY): New.
>>> >> >         * config/i386/i386.c (ix86_option_override_internal): Set
>>> >> >         PARAM_MAX_SIZE_FOR_ELEMENTWISE_COPY to 35.
>>> >> >         * tree-sra.c: Include tree-sra.h.
>>> >> >         (scalarizable_type_p): Renamed to
>>> >> >         simple_mix_of_records_and_arrays_p, made public, renamed the
>>> >> >         second parameter to allow_char_arrays.
>>> >> >         (extract_min_max_idx_from_array): New function.
>>> >> >         (completely_scalarize): Moved bits of the function to
>>> >> >         extract_min_max_idx_from_array.
>>> >> >
>>> >> >         testsuite/
>>> >> >         * gcc.target/i386/pr80689-1.c: New test.
>>> >> > ---
>>> >> >  gcc/config/i386/i386.c                    |   4 ++
>>> >> >  gcc/expr.c                                | 103 ++++++++++++++++++++++++++++--
>>> >> >  gcc/ipa-cp.c                              |   1 +
>>> >> >  gcc/ipa-prop.h                            |   4 --
>>> >> >  gcc/params.def                            |   6 ++
>>> >> >  gcc/testsuite/gcc.target/i386/pr80689-1.c |  38 +++++++++++
>>> >> >  gcc/tree-sra.c                            |  86 +++++++++++++++----------
>>> >> >  gcc/tree-sra.h                            |  33 ++++++++++
>>> >> >  8 files changed, 233 insertions(+), 42 deletions(-)
>>> >> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr80689-1.c
>>> >> >  create mode 100644 gcc/tree-sra.h
>>> >> >
>>> >> > diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
>>> >> > index 1ee8351c21f..87f602e7ead 100644
>>> >> > --- a/gcc/config/i386/i386.c
>>> >> > +++ b/gcc/config/i386/i386.c
>>> >> > @@ -6511,6 +6511,10 @@ ix86_option_override_internal (bool main_args_p,
>>> >> >                          ix86_tune_cost->l2_cache_size,
>>> >> >                          opts->x_param_values,
>>> >> >                          opts_set->x_param_values);
>>> >> > +  maybe_set_param_value (PARAM_MAX_SIZE_FOR_ELEMENTWISE_COPY,
>>> >> > +                        35,
>>> >> > +                        opts->x_param_values,
>>> >> > +                        opts_set->x_param_values);
>>> >> >
>>> >> >    /* Enable sw prefetching at -O3 for CPUS that prefetching is helpful.  */
>>> >> >    if (opts->x_flag_prefetch_loop_arrays < 0
>>> >> > diff --git a/gcc/expr.c b/gcc/expr.c
>>> >> > index 134ee731c29..dff24e7f166 100644
>>> >> > --- a/gcc/expr.c
>>> >> > +++ b/gcc/expr.c
>>> >> > @@ -61,7 +61,8 @@ along with GCC; see the file COPYING3.  If not see
>>> >> >  #include "tree-chkp.h"
>>> >> >  #include "rtl-chkp.h"
>>> >> >  #include "ccmp.h"
>>> >> > -
>>> >> > +#include "params.h"
>>> >> > +#include "tree-sra.h"
>>> >> >
>>> >> >  /* If this is nonzero, we do not bother generating VOLATILE
>>> >> >     around volatile memory references, and we are willing to
>>> >> > @@ -5340,6 +5341,80 @@ emit_storent_insn (rtx to, rtx from)
>>> >> >    return maybe_expand_insn (code, 2, ops);
>>> >> >  }
>>> >> >
>>> >> > +/* Generate code for copying data of type TYPE at SOURCE plus OFFSET to TARGET
>>> >> > +   plus OFFSET, but do so element-wise and/or field-wise for each record and
>>> >> > +   array within TYPE.  TYPE must either be a register type or an aggregate
>>> >> > +   complying with scalarizable_type_p.
>>> >> > +
>>> >> > +   If CALL_PARAM_P is nonzero, this is a store into a call param on the
>>> >> > +   stack, and block moves may need to be treated specially.  */
>>> >> > +
>>> >> > +static void
>>> >> > +emit_move_elementwise (tree type, rtx target, rtx source, HOST_WIDE_INT offset,
>>> >> > +                      int call_param_p)
>>> >> > +{
>>> >> > +  switch (TREE_CODE (type))
>>> >> > +    {
>>> >> > +    case RECORD_TYPE:
>>> >> > +      for (tree fld = TYPE_FIELDS (type); fld; fld = DECL_CHAIN (fld))
>>> >> > +       if (TREE_CODE (fld) == FIELD_DECL)
>>> >> > +         {
>>> >> > +           HOST_WIDE_INT fld_offset = offset + int_bit_position (fld);
>>> >> > +           tree ft = TREE_TYPE (fld);
>>> >> > +           emit_move_elementwise (ft, target, source, fld_offset,
>>> >> > +                                  call_param_p);
>>> >> > +         }
>>> >> > +      break;
>>> >> > +
>>> >> > +    case ARRAY_TYPE:
>>> >> > +      {
>>> >> > +       tree elem_type = TREE_TYPE (type);
>>> >> > +       HOST_WIDE_INT el_size = tree_to_shwi (TYPE_SIZE (elem_type));
>>> >> > +       gcc_assert (el_size > 0);
>>> >> > +
>>> >> > +       offset_int idx, max;
>>> >> > +       /* Skip (some) zero-length arrays; others have MAXIDX == MINIDX - 1.  */
>>> >> > +       if (extract_min_max_idx_from_array (type, &idx, &max))
>>> >> > +         {
>>> >> > +           HOST_WIDE_INT el_offset = offset;
>>> >> > +           for (; idx <= max; ++idx)
>>> >> > +             {
>>> >> > +               emit_move_elementwise (elem_type, target, source, el_offset,
>>> >> > +                                      call_param_p);
>>> >> > +               el_offset += el_size;
>>> >> > +             }
>>> >> > +         }
>>> >> > +      }
>>> >> > +      break;
>>> >> > +    default:
>>> >> > +      machine_mode mode = TYPE_MODE (type);
>>> >> > +
>>> >> > +      rtx ntgt = adjust_address (target, mode, offset / BITS_PER_UNIT);
>>> >> > +      rtx nsrc = adjust_address (source, mode, offset / BITS_PER_UNIT);
>>> >> > +
>>> >> > +      /* TODO: Figure out whether the following is actually necessary.  */
>>> >> > +      if (target == ntgt)
>>> >> > +       ntgt = copy_rtx (target);
>>> >> > +      if (source == nsrc)
>>> >> > +       nsrc = copy_rtx (source);
>>> >> > +
>>> >> > +      gcc_assert (mode != VOIDmode);
>>> >> > +      if (mode != BLKmode)
>>> >> > +       emit_move_insn (ntgt, nsrc);
>>> >> > +      else
>>> >> > +       {
>>> >> > +         /* For example vector gimple registers can end up here.  */
>>> >> > +         rtx size = expand_expr (TYPE_SIZE_UNIT (type), NULL_RTX,
>>> >> > +                                 TYPE_MODE (sizetype), EXPAND_NORMAL);
>>> >> > +         emit_block_move (ntgt, nsrc, size,
>>> >> > +                          (call_param_p
>>> >> > +                           ? BLOCK_OP_CALL_PARM : BLOCK_OP_NORMAL));
>>> >> > +       }
>>> >> > +      break;
>>> >> > +    }
>>> >> > +  return;
>>> >> > +}
>>> >> > +
>>> >> >  /* Generate code for computing expression EXP,
>>> >> >     and storing the value into TARGET.
>>> >> >
>>> >> > @@ -5713,9 +5788,29 @@ store_expr_with_bounds (tree exp, rtx target, int call_param_p,
>>> >> >         emit_group_store (target, temp, TREE_TYPE (exp),
>>> >> >                           int_size_in_bytes (TREE_TYPE (exp)));
>>> >> >        else if (GET_MODE (temp) == BLKmode)
>>> >> > -       emit_block_move (target, temp, expr_size (exp),
>>> >> > -                        (call_param_p
>>> >> > -                         ? BLOCK_OP_CALL_PARM : BLOCK_OP_NORMAL));
>>> >> > +       {
>>> >> > +         /* Copying smallish BLKmode structures with emit_block_move and thus
>>> >> > +            by-pieces can result in store-to-load stalls.  So copy some simple
>>> >> > +            small aggregates element or field-wise.  */
>>> >> > +         if (GET_MODE (target) == BLKmode
>>> >> > +             && AGGREGATE_TYPE_P (TREE_TYPE (exp))
>>> >> > +             && !TREE_ADDRESSABLE (TREE_TYPE (exp))
>>> >> > +             && tree_fits_shwi_p (TYPE_SIZE (TREE_TYPE (exp)))
>>> >> > +             && (tree_to_shwi (TYPE_SIZE (TREE_TYPE (exp)))
>>> >> > +                 <= (PARAM_VALUE (PARAM_MAX_SIZE_FOR_ELEMENTWISE_COPY)
>>> >> > +                     * BITS_PER_UNIT))
>>> >> > +             && simple_mix_of_records_and_arrays_p (TREE_TYPE (exp), false))
>>> >> > +           {
>>> >> > +             /* FIXME:  Can this happen?  What would it mean?  */
>>> >> > +             gcc_assert (!reverse);
>>> >> > +             emit_move_elementwise (TREE_TYPE (exp), target, temp, 0,
>>> >> > +                                    call_param_p);
>>> >> > +           }
>>> >> > +         else
>>> >> > +           emit_block_move (target, temp, expr_size (exp),
>>> >> > +                            (call_param_p
>>> >> > +                             ? BLOCK_OP_CALL_PARM : BLOCK_OP_NORMAL));
>>> >> > +       }
>>> >> >        /* If we emit a nontemporal store, there is nothing else to do.  */
>>> >> >        else if (nontemporal && emit_storent_insn (target, temp))
>>> >> >         ;
>>> >> > diff --git a/gcc/ipa-cp.c b/gcc/ipa-cp.c
>>> >> > index 6b3d8d7364c..7d6019bbd30 100644
>>> >> > --- a/gcc/ipa-cp.c
>>> >> > +++ b/gcc/ipa-cp.c
>>> >> > @@ -124,6 +124,7 @@ along with GCC; see the file COPYING3.  If not see
>>> >> >  #include "tree-ssa-ccp.h"
>>> >> >  #include "stringpool.h"
>>> >> >  #include "attribs.h"
>>> >> > +#include "tree-sra.h"
>>> >> >
>>> >> >  template <typename valtype> class ipcp_value;
>>> >> >
>>> >> > diff --git a/gcc/ipa-prop.h b/gcc/ipa-prop.h
>>> >> > index fa5bed49ee0..2313cc884ed 100644
>>> >> > --- a/gcc/ipa-prop.h
>>> >> > +++ b/gcc/ipa-prop.h
>>> >> > @@ -877,10 +877,6 @@ ipa_parm_adjustment *ipa_get_adjustment_candidate (tree **, bool *,
>>> >> >  void ipa_release_body_info (struct ipa_func_body_info *);
>>> >> >  tree ipa_get_callee_param_type (struct cgraph_edge *e, int i);
>>> >> >
>>> >> > -/* From tree-sra.c:  */
>>> >> > -tree build_ref_for_offset (location_t, tree, HOST_WIDE_INT, bool, tree,
>>> >> > -                          gimple_stmt_iterator *, bool);
>>> >> > -
>>> >> >  /* In ipa-cp.c  */
>>> >> >  void ipa_cp_c_finalize (void);
>>> >> >
>>> >> > diff --git a/gcc/params.def b/gcc/params.def
>>> >> > index e55afc28053..5e19f1414a0 100644
>>> >> > --- a/gcc/params.def
>>> >> > +++ b/gcc/params.def
>>> >> > @@ -1294,6 +1294,12 @@ DEFPARAM (PARAM_VECT_EPILOGUES_NOMASK,
>>> >> >           "Enable loop epilogue vectorization using smaller vector size.",
>>> >> >           0, 0, 1)
>>> >> >
>>> >> > +DEFPARAM (PARAM_MAX_SIZE_FOR_ELEMENTWISE_COPY,
>>> >> > +         "max-size-for-elementwise-copy",
>>> >> > +         "Maximum size in bytes of a structure or array to by considered for "
>>> >> > +         "copying by its individual fields or elements",
>>> >> > +         0, 0, 512)
>>> >> > +
>>> >> >  /*
>>> >> >
>>> >> >  Local variables:
>>> >> > diff --git a/gcc/testsuite/gcc.target/i386/pr80689-1.c b/gcc/testsuite/gcc.target/i386/pr80689-1.c
>>> >> > new file mode 100644
>>> >> > index 00000000000..4156d4fba45
>>> >> > --- /dev/null
>>> >> > +++ b/gcc/testsuite/gcc.target/i386/pr80689-1.c
>>> >> > @@ -0,0 +1,38 @@
>>> >> > +/* { dg-do compile } */
>>> >> > +/* { dg-options "-O2" } */
>>> >> > +
>>> >> > +typedef struct st1
>>> >> > +{
>>> >> > +        long unsigned int a,b;
>>> >> > +        long int c,d;
>>> >> > +}R;
>>> >> > +
>>> >> > +typedef struct st2
>>> >> > +{
>>> >> > +        int  t;
>>> >> > +        R  reg;
>>> >> > +}N;
>>> >> > +
>>> >> > +void Set (const R *region,  N *n_info );
>>> >> > +
>>> >> > +void test(N  *n_obj ,const long unsigned int a, const long unsigned int b,  const long int c,const long int d)
>>> >> > +{
>>> >> > +        R reg;
>>> >> > +
>>> >> > +        reg.a=a;
>>> >> > +        reg.b=b;
>>> >> > +        reg.c=c;
>>> >> > +        reg.d=d;
>>> >> > +        Set (&reg, n_obj);
>>> >> > +
>>> >> > +}
>>> >> > +
>>> >> > +void Set (const R *reg,  N *n_obj )
>>> >> > +{
>>> >> > +        n_obj->reg=(*reg);
>>> >> > +}
>>> >> > +
>>> >> > +
>>> >> > +/* { dg-final { scan-assembler-not "%(x|y|z)mm\[0-9\]+" } } */
>>> >> > +/* { dg-final { scan-assembler-not "movdqu" } } */
>>> >> > +/* { dg-final { scan-assembler-not "movups" } } */
>>> >> > diff --git a/gcc/tree-sra.c b/gcc/tree-sra.c
>>> >> > index bac593951e7..ade97964205 100644
>>> >> > --- a/gcc/tree-sra.c
>>> >> > +++ b/gcc/tree-sra.c
>>> >> > @@ -104,6 +104,7 @@ along with GCC; see the file COPYING3.  If not see
>>> >> >  #include "ipa-fnsummary.h"
>>> >> >  #include "ipa-utils.h"
>>> >> >  #include "builtins.h"
>>> >> > +#include "tree-sra.h"
>>> >> >
>>> >> >  /* Enumeration of all aggregate reductions we can do.  */
>>> >> >  enum sra_mode { SRA_MODE_EARLY_IPA,   /* early call regularization */
>>> >> > @@ -952,14 +953,14 @@ create_access (tree expr, gimple *stmt, bool write)
>>> >> >  }
>>> >> >
>>> >> >
>>> >> > -/* Return true iff TYPE is scalarizable - i.e. a RECORD_TYPE or fixed-length
>>> >> > -   ARRAY_TYPE with fields that are either of gimple register types (excluding
>>> >> > -   bit-fields) or (recursively) scalarizable types.  CONST_DECL must be true if
>>> >> > -   we are considering a decl from constant pool.  If it is false, char arrays
>>> >> > -   will be refused.  */
>>> >> > +/* Return true if TYPE consists of RECORD_TYPE or fixed-length ARRAY_TYPE with
>>> >> > +   fields/elements that are not bit-fields and are either register types or
>>> >> > +   recursively comply with simple_mix_of_records_and_arrays_p.  Furthermore, if
>>> >> > +   ALLOW_CHAR_ARRAYS is false, the function will return false also if TYPE
>>> >> > +   contains an array of elements that only have one byte.  */
>>> >> >
>>> >> > -static bool
>>> >> > -scalarizable_type_p (tree type, bool const_decl)
>>> >> > +bool
>>> >> > +simple_mix_of_records_and_arrays_p (tree type, bool allow_char_arrays)
>>> >> >  {
>>> >> >    gcc_assert (!is_gimple_reg_type (type));
>>> >> >    if (type_contains_placeholder_p (type))
>>> >> > @@ -977,7 +978,7 @@ scalarizable_type_p (tree type, bool const_decl)
>>> >> >             return false;
>>> >> >
>>> >> >           if (!is_gimple_reg_type (ft)
>>> >> > -             && !scalarizable_type_p (ft, const_decl))
>>> >> > +             && !simple_mix_of_records_and_arrays_p (ft, allow_char_arrays))
>>> >> >             return false;
>>> >> >         }
>>> >> >
>>> >> > @@ -986,7 +987,7 @@ scalarizable_type_p (tree type, bool const_decl)
>>> >> >    case ARRAY_TYPE:
>>> >> >      {
>>> >> >        HOST_WIDE_INT min_elem_size;
>>> >> > -      if (const_decl)
>>> >> > +      if (allow_char_arrays)
>>> >> >         min_elem_size = 0;
>>> >> >        else
>>> >> >         min_elem_size = BITS_PER_UNIT;
>>> >> > @@ -1008,7 +1009,7 @@ scalarizable_type_p (tree type, bool const_decl)
>>> >> >
>>> >> >        tree elem = TREE_TYPE (type);
>>> >> >        if (!is_gimple_reg_type (elem)
>>> >> > -         && !scalarizable_type_p (elem, const_decl))
>>> >> > +         && !simple_mix_of_records_and_arrays_p (elem, allow_char_arrays))
>>> >> >         return false;
>>> >> >        return true;
>>> >> >      }
>>> >> > @@ -1017,10 +1018,38 @@ scalarizable_type_p (tree type, bool const_decl)
>>> >> >    }
>>> >> >  }
>>> >> >
>>> >> > -static void scalarize_elem (tree, HOST_WIDE_INT, HOST_WIDE_INT, bool, tree, tree);
>>> >> > +static void scalarize_elem (tree, HOST_WIDE_INT, HOST_WIDE_INT, bool, tree,
>>> >> > +                           tree);
>>> >> > +
>>> >> > +/* For a given array TYPE, return false if its domain does not have any maximum
>>> >> > +   value.  Otherwise calculate MIN and MAX indices of the first and the last
>>> >> > +   element.  */
>>> >> > +
>>> >> > +bool
>>> >> > +extract_min_max_idx_from_array (tree type, offset_int *min, offset_int *max)
>>> >> > +{
>>> >> > +  tree domain = TYPE_DOMAIN (type);
>>> >> > +  tree minidx = TYPE_MIN_VALUE (domain);
>>> >> > +  gcc_assert (TREE_CODE (minidx) == INTEGER_CST);
>>> >> > +  tree maxidx = TYPE_MAX_VALUE (domain);
>>> >> > +  if (!maxidx)
>>> >> > +    return false;
>>> >> > +  gcc_assert (TREE_CODE (maxidx) == INTEGER_CST);
>>> >> > +
>>> >> > +  /* MINIDX and MAXIDX are inclusive, and must be interpreted in
>>> >> > +     DOMAIN (e.g. signed int, whereas min/max may be size_int).  */
>>> >> > +  *min = wi::to_offset (minidx);
>>> >> > +  *max = wi::to_offset (maxidx);
>>> >> > +  if (!TYPE_UNSIGNED (domain))
>>> >> > +    {
>>> >> > +      *min = wi::sext (*min, TYPE_PRECISION (domain));
>>> >> > +      *max = wi::sext (*max, TYPE_PRECISION (domain));
>>> >> > +    }
>>> >> > +  return true;
>>> >> > +}
>>> >> >
>>> >> >  /* Create total_scalarization accesses for all scalar fields of a member
>>> >> > -   of type DECL_TYPE conforming to scalarizable_type_p.  BASE
>>> >> > +   of type DECL_TYPE conforming to simple_mix_of_records_and_arrays_p.  BASE
>>> >> >     must be the top-most VAR_DECL representing the variable; within that,
>>> >> >     OFFSET locates the member and REF must be the memory reference expression for
>>> >> >     the member.  */
>>> >> > @@ -1047,27 +1076,14 @@ completely_scalarize (tree base, tree decl_type, HOST_WIDE_INT offset, tree ref)
>>> >> >        {
>>> >> >         tree elemtype = TREE_TYPE (decl_type);
>>> >> >         tree elem_size = TYPE_SIZE (elemtype);
>>> >> > -       gcc_assert (elem_size && tree_fits_shwi_p (elem_size));
>>> >> >         HOST_WIDE_INT el_size = tree_to_shwi (elem_size);
>>> >> >         gcc_assert (el_size > 0);
>>> >> >
>>> >> > -       tree minidx = TYPE_MIN_VALUE (TYPE_DOMAIN (decl_type));
>>> >> > -       gcc_assert (TREE_CODE (minidx) == INTEGER_CST);
>>> >> > -       tree maxidx = TYPE_MAX_VALUE (TYPE_DOMAIN (decl_type));
>>> >> > +       offset_int idx, max;
>>> >> >         /* Skip (some) zero-length arrays; others have MAXIDX == MINIDX - 1.  */
>>> >> > -       if (maxidx)
>>> >> > +       if (extract_min_max_idx_from_array (decl_type, &idx, &max))
>>> >> >           {
>>> >> > -           gcc_assert (TREE_CODE (maxidx) == INTEGER_CST);
>>> >> >             tree domain = TYPE_DOMAIN (decl_type);
>>> >> > -           /* MINIDX and MAXIDX are inclusive, and must be interpreted in
>>> >> > -              DOMAIN (e.g. signed int, whereas min/max may be size_int).  */
>>> >> > -           offset_int idx = wi::to_offset (minidx);
>>> >> > -           offset_int max = wi::to_offset (maxidx);
>>> >> > -           if (!TYPE_UNSIGNED (domain))
>>> >> > -             {
>>> >> > -               idx = wi::sext (idx, TYPE_PRECISION (domain));
>>> >> > -               max = wi::sext (max, TYPE_PRECISION (domain));
>>> >> > -             }
>>> >> >             for (int el_off = offset; idx <= max; ++idx)
>>> >> >               {
>>> >> >                 tree nref = build4 (ARRAY_REF, elemtype,
>>> >> > @@ -1088,10 +1104,10 @@ completely_scalarize (tree base, tree decl_type, HOST_WIDE_INT offset, tree ref)
>>> >> >  }
>>> >> >
>>> >> >  /* Create total_scalarization accesses for a member of type TYPE, which must
>>> >> > -   satisfy either is_gimple_reg_type or scalarizable_type_p.  BASE must be the
>>> >> > -   top-most VAR_DECL representing the variable; within that, POS and SIZE locate
>>> >> > -   the member, REVERSE gives its torage order. and REF must be the reference
>>> >> > -   expression for it.  */
>>> >> > +   satisfy either is_gimple_reg_type or simple_mix_of_records_and_arrays_p.
>>> >> > +   BASE must be the top-most VAR_DECL representing the variable; within that,
>>> >> > +   POS and SIZE locate the member, REVERSE gives its torage order. and REF must
>>> >> > +   be the reference expression for it.  */
>>> >> >
>>> >> >  static void
>>> >> >  scalarize_elem (tree base, HOST_WIDE_INT pos, HOST_WIDE_INT size, bool reverse,
>>> >> > @@ -1111,7 +1127,8 @@ scalarize_elem (tree base, HOST_WIDE_INT pos, HOST_WIDE_INT size, bool reverse,
>>> >> >  }
>>> >> >
>>> >> >  /* Create a total_scalarization access for VAR as a whole.  VAR must be of a
>>> >> > -   RECORD_TYPE or ARRAY_TYPE conforming to scalarizable_type_p.  */
>>> >> > +   RECORD_TYPE or ARRAY_TYPE conforming to
>>> >> > +   simple_mix_of_records_and_arrays_p.  */
>>> >> >
>>> >> >  static void
>>> >> >  create_total_scalarization_access (tree var)
>>> >> > @@ -2803,8 +2820,9 @@ analyze_all_variable_accesses (void)
>>> >> >        {
>>> >> >         tree var = candidate (i);
>>> >> >
>>> >> > -       if (VAR_P (var) && scalarizable_type_p (TREE_TYPE (var),
>>> >> > -                                               constant_decl_p (var)))
>>> >> > +       if (VAR_P (var)
>>> >> > +           && simple_mix_of_records_and_arrays_p (TREE_TYPE (var),
>>> >> > +                                                  constant_decl_p (var)))
>>> >> >           {
>>> >> >             if (tree_to_uhwi (TYPE_SIZE (TREE_TYPE (var)))
>>> >> >                 <= max_scalarization_size)
>>> >> > diff --git a/gcc/tree-sra.h b/gcc/tree-sra.h
>>> >> > new file mode 100644
>>> >> > index 00000000000..dc901385994
>>> >> > --- /dev/null
>>> >> > +++ b/gcc/tree-sra.h
>>> >> > @@ -0,0 +1,33 @@
>>> >> > +/* tree-sra.h - Run-time parameters.
>>> >> > +   Copyright (C) 2017 Free Software Foundation, Inc.
>>> >> > +
>>> >> > +This file is part of GCC.
>>> >> > +
>>> >> > +GCC is free software; you can redistribute it and/or modify it under
>>> >> > +the terms of the GNU General Public License as published by the Free
>>> >> > +Software Foundation; either version 3, or (at your option) any later
>>> >> > +version.
>>> >> > +
>>> >> > +GCC is distributed in the hope that it will be useful, but WITHOUT ANY
>>> >> > +WARRANTY; without even the implied warranty of MERCHANTABILITY or
>>> >> > +FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
>>> >> > +for more details.
>>> >> > +
>>> >> > +You should have received a copy of the GNU General Public License
>>> >> > +along with GCC; see the file COPYING3.  If not see
>>> >> > +<http://www.gnu.org/licenses/>.  */
>>> >> > +
>>> >> > +#ifndef TREE_SRA_H
>>> >> > +#define TREE_SRA_H
>>> >> > +
>>> >> > +
>>> >> > +bool simple_mix_of_records_and_arrays_p (tree type, bool allow_char_arrays);
>>> >> > +bool extract_min_max_idx_from_array (tree type, offset_int *idx,
>>> >> > +                                    offset_int *max);
>>> >> > +tree build_ref_for_offset (location_t loc, tree base, HOST_WIDE_INT offset,
>>> >> > +                          bool reverse, tree exp_type,
>>> >> > +                          gimple_stmt_iterator *gsi, bool insert_after);
>>> >> > +
>>> >> > +
>>> >> > +
>>> >> > +#endif /* TREE_SRA_H */
>>> >> > --
>>> >> > 2.14.1
>>> >> >
Jan Hubicka Oct. 27, 2017, 12:19 p.m. UTC | #8
> On Thu, Oct 26, 2017 at 2:55 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
> >> I think the limit should be on the number of generated copies and not
> >> the overall size of the structure...  If the struct were composed of
> >> 32 individual chars we wouldn't want to emit 32 loads and 32 stores...
> >>
> >> I wonder how rep; movb; interacts with store to load forwarding?  Is
> >> that maybe optimized well on some archs?  movb should always
> >> forward and wasn't the setup cost for small N reasonable on modern
> >> CPUs?
> >
> > rep mov is win over loop for blocks over 128bytes on core, for blocks in rage
> > 24-128 on zen.  This is w/o store/load forwarding, but I doubt those provide
> > a cheap way around.
> >
> >>
> >> It probably depends on the width of the entries in the store buffer,
> >> if they appear in-order and the alignment of the stores (if they are larger than
> >> 8 bytes they are surely aligned).  IIRC CPUs had smaller store buffer
> >> entries than cache line size.
> >>
> >> Given that load bandwith is usually higher than store bandwith it
> >> might make sense to do the store combining in our copying sequence,
> >> like for the 8 byte entry case use sth like
> >>
> >>   movq 0(%eax), %xmm0
> >>   movhps 8(%eax), %xmm0 // or vpinsert
> >>   mov[au]ps %xmm0, 0%(ebx)
> >> ...
> >>
> >> thus do two loads per store and perform the stores in wider
> >> mode?
> >
> > This may be somewhat faster indeed.  I am not sure if store to load
> > forwarding will work for the later half when read again by halves.
> > It would not happen on older CPUs :)
> 
> Yes, forwarding larger stores to smaller loads generally works fine
> since forever with the usual restrictions of alignment/size being
> power of two "halves".
> 
> The question is of course what to do for 4 byte or smaller elements or
> mixed size elements.  We can do zero-extending loads
> (do we have them for QI, HI mode loads as well?) and
> do shift and or's.  I'm quite sure the CPUs wouldn't like to
> see vpinsert's of different vector mode destinations.  So it
> would be 8 byte stores from GPRs and values built up via
> shift & or.
> 
> As said, the important part is that IIRC CPUs can usually
> have more loads in flight than stores.  Esp. Bulldozer
> with the split core was store buffer size limited (but it
> could do merging of store buffer entries IIRC).

In a way this seems an independent optimization to me
(store forwarding) because for sure this can work for user
code which does not originate from copy sequence.

Seems like something bit tricky to implement on top of RTL
though.

Honza
> 
> Richard.
> 
> > Honza
> >>
> >> As said a general concern was you not copying padding.  If you
> >> put this into an even more common place you surely will break
> >> stuff, no?
> >>
> >> Richard.
> >>
> >> >
> >> > Martin
> >> >
> >> >
> >> >>
> >> >> Richard.
> >> >>
> >> >> > Martin
> >> >> >
> >> >> >
> >> >> > 2017-10-12  Martin Jambor  <mjambor@suse.cz>
> >> >> >
> >> >> >         PR target/80689
> >> >> >         * tree-sra.h: New file.
> >> >> >         * ipa-prop.h: Moved declaration of build_ref_for_offset to
> >> >> >         tree-sra.h.
> >> >> >         * expr.c: Include params.h and tree-sra.h.
> >> >> >         (emit_move_elementwise): New function.
> >> >> >         (store_expr_with_bounds): Optionally use it.
> >> >> >         * ipa-cp.c: Include tree-sra.h.
> >> >> >         * params.def (PARAM_MAX_SIZE_FOR_ELEMENTWISE_COPY): New.
> >> >> >         * config/i386/i386.c (ix86_option_override_internal): Set
> >> >> >         PARAM_MAX_SIZE_FOR_ELEMENTWISE_COPY to 35.
> >> >> >         * tree-sra.c: Include tree-sra.h.
> >> >> >         (scalarizable_type_p): Renamed to
> >> >> >         simple_mix_of_records_and_arrays_p, made public, renamed the
> >> >> >         second parameter to allow_char_arrays.
> >> >> >         (extract_min_max_idx_from_array): New function.
> >> >> >         (completely_scalarize): Moved bits of the function to
> >> >> >         extract_min_max_idx_from_array.
> >> >> >
> >> >> >         testsuite/
> >> >> >         * gcc.target/i386/pr80689-1.c: New test.
> >> >> > ---
> >> >> >  gcc/config/i386/i386.c                    |   4 ++
> >> >> >  gcc/expr.c                                | 103 ++++++++++++++++++++++++++++--
> >> >> >  gcc/ipa-cp.c                              |   1 +
> >> >> >  gcc/ipa-prop.h                            |   4 --
> >> >> >  gcc/params.def                            |   6 ++
> >> >> >  gcc/testsuite/gcc.target/i386/pr80689-1.c |  38 +++++++++++
> >> >> >  gcc/tree-sra.c                            |  86 +++++++++++++++----------
> >> >> >  gcc/tree-sra.h                            |  33 ++++++++++
> >> >> >  8 files changed, 233 insertions(+), 42 deletions(-)
> >> >> >  create mode 100644 gcc/testsuite/gcc.target/i386/pr80689-1.c
> >> >> >  create mode 100644 gcc/tree-sra.h
> >> >> >
> >> >> > diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
> >> >> > index 1ee8351c21f..87f602e7ead 100644
> >> >> > --- a/gcc/config/i386/i386.c
> >> >> > +++ b/gcc/config/i386/i386.c
> >> >> > @@ -6511,6 +6511,10 @@ ix86_option_override_internal (bool main_args_p,
> >> >> >                          ix86_tune_cost->l2_cache_size,
> >> >> >                          opts->x_param_values,
> >> >> >                          opts_set->x_param_values);
> >> >> > +  maybe_set_param_value (PARAM_MAX_SIZE_FOR_ELEMENTWISE_COPY,
> >> >> > +                        35,
> >> >> > +                        opts->x_param_values,
> >> >> > +                        opts_set->x_param_values);
> >> >> >
> >> >> >    /* Enable sw prefetching at -O3 for CPUS that prefetching is helpful.  */
> >> >> >    if (opts->x_flag_prefetch_loop_arrays < 0
> >> >> > diff --git a/gcc/expr.c b/gcc/expr.c
> >> >> > index 134ee731c29..dff24e7f166 100644
> >> >> > --- a/gcc/expr.c
> >> >> > +++ b/gcc/expr.c
> >> >> > @@ -61,7 +61,8 @@ along with GCC; see the file COPYING3.  If not see
> >> >> >  #include "tree-chkp.h"
> >> >> >  #include "rtl-chkp.h"
> >> >> >  #include "ccmp.h"
> >> >> > -
> >> >> > +#include "params.h"
> >> >> > +#include "tree-sra.h"
> >> >> >
> >> >> >  /* If this is nonzero, we do not bother generating VOLATILE
> >> >> >     around volatile memory references, and we are willing to
> >> >> > @@ -5340,6 +5341,80 @@ emit_storent_insn (rtx to, rtx from)
> >> >> >    return maybe_expand_insn (code, 2, ops);
> >> >> >  }
> >> >> >
> >> >> > +/* Generate code for copying data of type TYPE at SOURCE plus OFFSET to TARGET
> >> >> > +   plus OFFSET, but do so element-wise and/or field-wise for each record and
> >> >> > +   array within TYPE.  TYPE must either be a register type or an aggregate
> >> >> > +   complying with scalarizable_type_p.
> >> >> > +
> >> >> > +   If CALL_PARAM_P is nonzero, this is a store into a call param on the
> >> >> > +   stack, and block moves may need to be treated specially.  */
> >> >> > +
> >> >> > +static void
> >> >> > +emit_move_elementwise (tree type, rtx target, rtx source, HOST_WIDE_INT offset,
> >> >> > +                      int call_param_p)
> >> >> > +{
> >> >> > +  switch (TREE_CODE (type))
> >> >> > +    {
> >> >> > +    case RECORD_TYPE:
> >> >> > +      for (tree fld = TYPE_FIELDS (type); fld; fld = DECL_CHAIN (fld))
> >> >> > +       if (TREE_CODE (fld) == FIELD_DECL)
> >> >> > +         {
> >> >> > +           HOST_WIDE_INT fld_offset = offset + int_bit_position (fld);
> >> >> > +           tree ft = TREE_TYPE (fld);
> >> >> > +           emit_move_elementwise (ft, target, source, fld_offset,
> >> >> > +                                  call_param_p);
> >> >> > +         }
> >> >> > +      break;
> >> >> > +
> >> >> > +    case ARRAY_TYPE:
> >> >> > +      {
> >> >> > +       tree elem_type = TREE_TYPE (type);
> >> >> > +       HOST_WIDE_INT el_size = tree_to_shwi (TYPE_SIZE (elem_type));
> >> >> > +       gcc_assert (el_size > 0);
> >> >> > +
> >> >> > +       offset_int idx, max;
> >> >> > +       /* Skip (some) zero-length arrays; others have MAXIDX == MINIDX - 1.  */
> >> >> > +       if (extract_min_max_idx_from_array (type, &idx, &max))
> >> >> > +         {
> >> >> > +           HOST_WIDE_INT el_offset = offset;
> >> >> > +           for (; idx <= max; ++idx)
> >> >> > +             {
> >> >> > +               emit_move_elementwise (elem_type, target, source, el_offset,
> >> >> > +                                      call_param_p);
> >> >> > +               el_offset += el_size;
> >> >> > +             }
> >> >> > +         }
> >> >> > +      }
> >> >> > +      break;
> >> >> > +    default:
> >> >> > +      machine_mode mode = TYPE_MODE (type);
> >> >> > +
> >> >> > +      rtx ntgt = adjust_address (target, mode, offset / BITS_PER_UNIT);
> >> >> > +      rtx nsrc = adjust_address (source, mode, offset / BITS_PER_UNIT);
> >> >> > +
> >> >> > +      /* TODO: Figure out whether the following is actually necessary.  */
> >> >> > +      if (target == ntgt)
> >> >> > +       ntgt = copy_rtx (target);
> >> >> > +      if (source == nsrc)
> >> >> > +       nsrc = copy_rtx (source);
> >> >> > +
> >> >> > +      gcc_assert (mode != VOIDmode);
> >> >> > +      if (mode != BLKmode)
> >> >> > +       emit_move_insn (ntgt, nsrc);
> >> >> > +      else
> >> >> > +       {
> >> >> > +         /* For example vector gimple registers can end up here.  */
> >> >> > +         rtx size = expand_expr (TYPE_SIZE_UNIT (type), NULL_RTX,
> >> >> > +                                 TYPE_MODE (sizetype), EXPAND_NORMAL);
> >> >> > +         emit_block_move (ntgt, nsrc, size,
> >> >> > +                          (call_param_p
> >> >> > +                           ? BLOCK_OP_CALL_PARM : BLOCK_OP_NORMAL));
> >> >> > +       }
> >> >> > +      break;
> >> >> > +    }
> >> >> > +  return;
> >> >> > +}
> >> >> > +
> >> >> >  /* Generate code for computing expression EXP,
> >> >> >     and storing the value into TARGET.
> >> >> >
> >> >> > @@ -5713,9 +5788,29 @@ store_expr_with_bounds (tree exp, rtx target, int call_param_p,
> >> >> >         emit_group_store (target, temp, TREE_TYPE (exp),
> >> >> >                           int_size_in_bytes (TREE_TYPE (exp)));
> >> >> >        else if (GET_MODE (temp) == BLKmode)
> >> >> > -       emit_block_move (target, temp, expr_size (exp),
> >> >> > -                        (call_param_p
> >> >> > -                         ? BLOCK_OP_CALL_PARM : BLOCK_OP_NORMAL));
> >> >> > +       {
> >> >> > +         /* Copying smallish BLKmode structures with emit_block_move and thus
> >> >> > +            by-pieces can result in store-to-load stalls.  So copy some simple
> >> >> > +            small aggregates element or field-wise.  */
> >> >> > +         if (GET_MODE (target) == BLKmode
> >> >> > +             && AGGREGATE_TYPE_P (TREE_TYPE (exp))
> >> >> > +             && !TREE_ADDRESSABLE (TREE_TYPE (exp))
> >> >> > +             && tree_fits_shwi_p (TYPE_SIZE (TREE_TYPE (exp)))
> >> >> > +             && (tree_to_shwi (TYPE_SIZE (TREE_TYPE (exp)))
> >> >> > +                 <= (PARAM_VALUE (PARAM_MAX_SIZE_FOR_ELEMENTWISE_COPY)
> >> >> > +                     * BITS_PER_UNIT))
> >> >> > +             && simple_mix_of_records_and_arrays_p (TREE_TYPE (exp), false))
> >> >> > +           {
> >> >> > +             /* FIXME:  Can this happen?  What would it mean?  */
> >> >> > +             gcc_assert (!reverse);
> >> >> > +             emit_move_elementwise (TREE_TYPE (exp), target, temp, 0,
> >> >> > +                                    call_param_p);
> >> >> > +           }
> >> >> > +         else
> >> >> > +           emit_block_move (target, temp, expr_size (exp),
> >> >> > +                            (call_param_p
> >> >> > +                             ? BLOCK_OP_CALL_PARM : BLOCK_OP_NORMAL));
> >> >> > +       }
> >> >> >        /* If we emit a nontemporal store, there is nothing else to do.  */
> >> >> >        else if (nontemporal && emit_storent_insn (target, temp))
> >> >> >         ;
> >> >> > diff --git a/gcc/ipa-cp.c b/gcc/ipa-cp.c
> >> >> > index 6b3d8d7364c..7d6019bbd30 100644
> >> >> > --- a/gcc/ipa-cp.c
> >> >> > +++ b/gcc/ipa-cp.c
> >> >> > @@ -124,6 +124,7 @@ along with GCC; see the file COPYING3.  If not see
> >> >> >  #include "tree-ssa-ccp.h"
> >> >> >  #include "stringpool.h"
> >> >> >  #include "attribs.h"
> >> >> > +#include "tree-sra.h"
> >> >> >
> >> >> >  template <typename valtype> class ipcp_value;
> >> >> >
> >> >> > diff --git a/gcc/ipa-prop.h b/gcc/ipa-prop.h
> >> >> > index fa5bed49ee0..2313cc884ed 100644
> >> >> > --- a/gcc/ipa-prop.h
> >> >> > +++ b/gcc/ipa-prop.h
> >> >> > @@ -877,10 +877,6 @@ ipa_parm_adjustment *ipa_get_adjustment_candidate (tree **, bool *,
> >> >> >  void ipa_release_body_info (struct ipa_func_body_info *);
> >> >> >  tree ipa_get_callee_param_type (struct cgraph_edge *e, int i);
> >> >> >
> >> >> > -/* From tree-sra.c:  */
> >> >> > -tree build_ref_for_offset (location_t, tree, HOST_WIDE_INT, bool, tree,
> >> >> > -                          gimple_stmt_iterator *, bool);
> >> >> > -
> >> >> >  /* In ipa-cp.c  */
> >> >> >  void ipa_cp_c_finalize (void);
> >> >> >
> >> >> > diff --git a/gcc/params.def b/gcc/params.def
> >> >> > index e55afc28053..5e19f1414a0 100644
> >> >> > --- a/gcc/params.def
> >> >> > +++ b/gcc/params.def
> >> >> > @@ -1294,6 +1294,12 @@ DEFPARAM (PARAM_VECT_EPILOGUES_NOMASK,
> >> >> >           "Enable loop epilogue vectorization using smaller vector size.",
> >> >> >           0, 0, 1)
> >> >> >
> >> >> > +DEFPARAM (PARAM_MAX_SIZE_FOR_ELEMENTWISE_COPY,
> >> >> > +         "max-size-for-elementwise-copy",
> >> >> > +         "Maximum size in bytes of a structure or array to by considered for "
> >> >> > +         "copying by its individual fields or elements",
> >> >> > +         0, 0, 512)
> >> >> > +
> >> >> >  /*
> >> >> >
> >> >> >  Local variables:
> >> >> > diff --git a/gcc/testsuite/gcc.target/i386/pr80689-1.c b/gcc/testsuite/gcc.target/i386/pr80689-1.c
> >> >> > new file mode 100644
> >> >> > index 00000000000..4156d4fba45
> >> >> > --- /dev/null
> >> >> > +++ b/gcc/testsuite/gcc.target/i386/pr80689-1.c
> >> >> > @@ -0,0 +1,38 @@
> >> >> > +/* { dg-do compile } */
> >> >> > +/* { dg-options "-O2" } */
> >> >> > +
> >> >> > +typedef struct st1
> >> >> > +{
> >> >> > +        long unsigned int a,b;
> >> >> > +        long int c,d;
> >> >> > +}R;
> >> >> > +
> >> >> > +typedef struct st2
> >> >> > +{
> >> >> > +        int  t;
> >> >> > +        R  reg;
> >> >> > +}N;
> >> >> > +
> >> >> > +void Set (const R *region,  N *n_info );
> >> >> > +
> >> >> > +void test(N  *n_obj ,const long unsigned int a, const long unsigned int b,  const long int c,const long int d)
> >> >> > +{
> >> >> > +        R reg;
> >> >> > +
> >> >> > +        reg.a=a;
> >> >> > +        reg.b=b;
> >> >> > +        reg.c=c;
> >> >> > +        reg.d=d;
> >> >> > +        Set (&reg, n_obj);
> >> >> > +
> >> >> > +}
> >> >> > +
> >> >> > +void Set (const R *reg,  N *n_obj )
> >> >> > +{
> >> >> > +        n_obj->reg=(*reg);
> >> >> > +}
> >> >> > +
> >> >> > +
> >> >> > +/* { dg-final { scan-assembler-not "%(x|y|z)mm\[0-9\]+" } } */
> >> >> > +/* { dg-final { scan-assembler-not "movdqu" } } */
> >> >> > +/* { dg-final { scan-assembler-not "movups" } } */
> >> >> > diff --git a/gcc/tree-sra.c b/gcc/tree-sra.c
> >> >> > index bac593951e7..ade97964205 100644
> >> >> > --- a/gcc/tree-sra.c
> >> >> > +++ b/gcc/tree-sra.c
> >> >> > @@ -104,6 +104,7 @@ along with GCC; see the file COPYING3.  If not see
> >> >> >  #include "ipa-fnsummary.h"
> >> >> >  #include "ipa-utils.h"
> >> >> >  #include "builtins.h"
> >> >> > +#include "tree-sra.h"
> >> >> >
> >> >> >  /* Enumeration of all aggregate reductions we can do.  */
> >> >> >  enum sra_mode { SRA_MODE_EARLY_IPA,   /* early call regularization */
> >> >> > @@ -952,14 +953,14 @@ create_access (tree expr, gimple *stmt, bool write)
> >> >> >  }
> >> >> >
> >> >> >
> >> >> > -/* Return true iff TYPE is scalarizable - i.e. a RECORD_TYPE or fixed-length
> >> >> > -   ARRAY_TYPE with fields that are either of gimple register types (excluding
> >> >> > -   bit-fields) or (recursively) scalarizable types.  CONST_DECL must be true if
> >> >> > -   we are considering a decl from constant pool.  If it is false, char arrays
> >> >> > -   will be refused.  */
> >> >> > +/* Return true if TYPE consists of RECORD_TYPE or fixed-length ARRAY_TYPE with
> >> >> > +   fields/elements that are not bit-fields and are either register types or
> >> >> > +   recursively comply with simple_mix_of_records_and_arrays_p.  Furthermore, if
> >> >> > +   ALLOW_CHAR_ARRAYS is false, the function will return false also if TYPE
> >> >> > +   contains an array of elements that only have one byte.  */
> >> >> >
> >> >> > -static bool
> >> >> > -scalarizable_type_p (tree type, bool const_decl)
> >> >> > +bool
> >> >> > +simple_mix_of_records_and_arrays_p (tree type, bool allow_char_arrays)
> >> >> >  {
> >> >> >    gcc_assert (!is_gimple_reg_type (type));
> >> >> >    if (type_contains_placeholder_p (type))
> >> >> > @@ -977,7 +978,7 @@ scalarizable_type_p (tree type, bool const_decl)
> >> >> >             return false;
> >> >> >
> >> >> >           if (!is_gimple_reg_type (ft)
> >> >> > -             && !scalarizable_type_p (ft, const_decl))
> >> >> > +             && !simple_mix_of_records_and_arrays_p (ft, allow_char_arrays))
> >> >> >             return false;
> >> >> >         }
> >> >> >
> >> >> > @@ -986,7 +987,7 @@ scalarizable_type_p (tree type, bool const_decl)
> >> >> >    case ARRAY_TYPE:
> >> >> >      {
> >> >> >        HOST_WIDE_INT min_elem_size;
> >> >> > -      if (const_decl)
> >> >> > +      if (allow_char_arrays)
> >> >> >         min_elem_size = 0;
> >> >> >        else
> >> >> >         min_elem_size = BITS_PER_UNIT;
> >> >> > @@ -1008,7 +1009,7 @@ scalarizable_type_p (tree type, bool const_decl)
> >> >> >
> >> >> >        tree elem = TREE_TYPE (type);
> >> >> >        if (!is_gimple_reg_type (elem)
> >> >> > -         && !scalarizable_type_p (elem, const_decl))
> >> >> > +         && !simple_mix_of_records_and_arrays_p (elem, allow_char_arrays))
> >> >> >         return false;
> >> >> >        return true;
> >> >> >      }
> >> >> > @@ -1017,10 +1018,38 @@ scalarizable_type_p (tree type, bool const_decl)
> >> >> >    }
> >> >> >  }
> >> >> >
> >> >> > -static void scalarize_elem (tree, HOST_WIDE_INT, HOST_WIDE_INT, bool, tree, tree);
> >> >> > +static void scalarize_elem (tree, HOST_WIDE_INT, HOST_WIDE_INT, bool, tree,
> >> >> > +                           tree);
> >> >> > +
> >> >> > +/* For a given array TYPE, return false if its domain does not have any maximum
> >> >> > +   value.  Otherwise calculate MIN and MAX indices of the first and the last
> >> >> > +   element.  */
> >> >> > +
> >> >> > +bool
> >> >> > +extract_min_max_idx_from_array (tree type, offset_int *min, offset_int *max)
> >> >> > +{
> >> >> > +  tree domain = TYPE_DOMAIN (type);
> >> >> > +  tree minidx = TYPE_MIN_VALUE (domain);
> >> >> > +  gcc_assert (TREE_CODE (minidx) == INTEGER_CST);
> >> >> > +  tree maxidx = TYPE_MAX_VALUE (domain);
> >> >> > +  if (!maxidx)
> >> >> > +    return false;
> >> >> > +  gcc_assert (TREE_CODE (maxidx) == INTEGER_CST);
> >> >> > +
> >> >> > +  /* MINIDX and MAXIDX are inclusive, and must be interpreted in
> >> >> > +     DOMAIN (e.g. signed int, whereas min/max may be size_int).  */
> >> >> > +  *min = wi::to_offset (minidx);
> >> >> > +  *max = wi::to_offset (maxidx);
> >> >> > +  if (!TYPE_UNSIGNED (domain))
> >> >> > +    {
> >> >> > +      *min = wi::sext (*min, TYPE_PRECISION (domain));
> >> >> > +      *max = wi::sext (*max, TYPE_PRECISION (domain));
> >> >> > +    }
> >> >> > +  return true;
> >> >> > +}
> >> >> >
> >> >> >  /* Create total_scalarization accesses for all scalar fields of a member
> >> >> > -   of type DECL_TYPE conforming to scalarizable_type_p.  BASE
> >> >> > +   of type DECL_TYPE conforming to simple_mix_of_records_and_arrays_p.  BASE
> >> >> >     must be the top-most VAR_DECL representing the variable; within that,
> >> >> >     OFFSET locates the member and REF must be the memory reference expression for
> >> >> >     the member.  */
> >> >> > @@ -1047,27 +1076,14 @@ completely_scalarize (tree base, tree decl_type, HOST_WIDE_INT offset, tree ref)
> >> >> >        {
> >> >> >         tree elemtype = TREE_TYPE (decl_type);
> >> >> >         tree elem_size = TYPE_SIZE (elemtype);
> >> >> > -       gcc_assert (elem_size && tree_fits_shwi_p (elem_size));
> >> >> >         HOST_WIDE_INT el_size = tree_to_shwi (elem_size);
> >> >> >         gcc_assert (el_size > 0);
> >> >> >
> >> >> > -       tree minidx = TYPE_MIN_VALUE (TYPE_DOMAIN (decl_type));
> >> >> > -       gcc_assert (TREE_CODE (minidx) == INTEGER_CST);
> >> >> > -       tree maxidx = TYPE_MAX_VALUE (TYPE_DOMAIN (decl_type));
> >> >> > +       offset_int idx, max;
> >> >> >         /* Skip (some) zero-length arrays; others have MAXIDX == MINIDX - 1.  */
> >> >> > -       if (maxidx)
> >> >> > +       if (extract_min_max_idx_from_array (decl_type, &idx, &max))
> >> >> >           {
> >> >> > -           gcc_assert (TREE_CODE (maxidx) == INTEGER_CST);
> >> >> >             tree domain = TYPE_DOMAIN (decl_type);
> >> >> > -           /* MINIDX and MAXIDX are inclusive, and must be interpreted in
> >> >> > -              DOMAIN (e.g. signed int, whereas min/max may be size_int).  */
> >> >> > -           offset_int idx = wi::to_offset (minidx);
> >> >> > -           offset_int max = wi::to_offset (maxidx);
> >> >> > -           if (!TYPE_UNSIGNED (domain))
> >> >> > -             {
> >> >> > -               idx = wi::sext (idx, TYPE_PRECISION (domain));
> >> >> > -               max = wi::sext (max, TYPE_PRECISION (domain));
> >> >> > -             }
> >> >> >             for (int el_off = offset; idx <= max; ++idx)
> >> >> >               {
> >> >> >                 tree nref = build4 (ARRAY_REF, elemtype,
> >> >> > @@ -1088,10 +1104,10 @@ completely_scalarize (tree base, tree decl_type, HOST_WIDE_INT offset, tree ref)
> >> >> >  }
> >> >> >
> >> >> >  /* Create total_scalarization accesses for a member of type TYPE, which must
> >> >> > -   satisfy either is_gimple_reg_type or scalarizable_type_p.  BASE must be the
> >> >> > -   top-most VAR_DECL representing the variable; within that, POS and SIZE locate
> >> >> > -   the member, REVERSE gives its torage order. and REF must be the reference
> >> >> > -   expression for it.  */
> >> >> > +   satisfy either is_gimple_reg_type or simple_mix_of_records_and_arrays_p.
> >> >> > +   BASE must be the top-most VAR_DECL representing the variable; within that,
> >> >> > +   POS and SIZE locate the member, REVERSE gives its torage order. and REF must
> >> >> > +   be the reference expression for it.  */
> >> >> >
> >> >> >  static void
> >> >> >  scalarize_elem (tree base, HOST_WIDE_INT pos, HOST_WIDE_INT size, bool reverse,
> >> >> > @@ -1111,7 +1127,8 @@ scalarize_elem (tree base, HOST_WIDE_INT pos, HOST_WIDE_INT size, bool reverse,
> >> >> >  }
> >> >> >
> >> >> >  /* Create a total_scalarization access for VAR as a whole.  VAR must be of a
> >> >> > -   RECORD_TYPE or ARRAY_TYPE conforming to scalarizable_type_p.  */
> >> >> > +   RECORD_TYPE or ARRAY_TYPE conforming to
> >> >> > +   simple_mix_of_records_and_arrays_p.  */
> >> >> >
> >> >> >  static void
> >> >> >  create_total_scalarization_access (tree var)
> >> >> > @@ -2803,8 +2820,9 @@ analyze_all_variable_accesses (void)
> >> >> >        {
> >> >> >         tree var = candidate (i);
> >> >> >
> >> >> > -       if (VAR_P (var) && scalarizable_type_p (TREE_TYPE (var),
> >> >> > -                                               constant_decl_p (var)))
> >> >> > +       if (VAR_P (var)
> >> >> > +           && simple_mix_of_records_and_arrays_p (TREE_TYPE (var),
> >> >> > +                                                  constant_decl_p (var)))
> >> >> >           {
> >> >> >             if (tree_to_uhwi (TYPE_SIZE (TREE_TYPE (var)))
> >> >> >                 <= max_scalarization_size)
> >> >> > diff --git a/gcc/tree-sra.h b/gcc/tree-sra.h
> >> >> > new file mode 100644
> >> >> > index 00000000000..dc901385994
> >> >> > --- /dev/null
> >> >> > +++ b/gcc/tree-sra.h
> >> >> > @@ -0,0 +1,33 @@
> >> >> > +/* tree-sra.h - Run-time parameters.
> >> >> > +   Copyright (C) 2017 Free Software Foundation, Inc.
> >> >> > +
> >> >> > +This file is part of GCC.
> >> >> > +
> >> >> > +GCC is free software; you can redistribute it and/or modify it under
> >> >> > +the terms of the GNU General Public License as published by the Free
> >> >> > +Software Foundation; either version 3, or (at your option) any later
> >> >> > +version.
> >> >> > +
> >> >> > +GCC is distributed in the hope that it will be useful, but WITHOUT ANY
> >> >> > +WARRANTY; without even the implied warranty of MERCHANTABILITY or
> >> >> > +FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
> >> >> > +for more details.
> >> >> > +
> >> >> > +You should have received a copy of the GNU General Public License
> >> >> > +along with GCC; see the file COPYING3.  If not see
> >> >> > +<http://www.gnu.org/licenses/>.  */
> >> >> > +
> >> >> > +#ifndef TREE_SRA_H
> >> >> > +#define TREE_SRA_H
> >> >> > +
> >> >> > +
> >> >> > +bool simple_mix_of_records_and_arrays_p (tree type, bool allow_char_arrays);
> >> >> > +bool extract_min_max_idx_from_array (tree type, offset_int *idx,
> >> >> > +                                    offset_int *max);
> >> >> > +tree build_ref_for_offset (location_t loc, tree base, HOST_WIDE_INT offset,
> >> >> > +                          bool reverse, tree exp_type,
> >> >> > +                          gimple_stmt_iterator *gsi, bool insert_after);
> >> >> > +
> >> >> > +
> >> >> > +
> >> >> > +#endif /* TREE_SRA_H */
> >> >> > --
> >> >> > 2.14.1
> >> >> >
Richard Biener Nov. 13, 2017, 12:23 p.m. UTC | #9
On Fri, Nov 3, 2017 at 5:38 PM, Martin Jambor <mjambor@suse.cz> wrote:
> Hi,
>
> On Thu, Oct 26, 2017 at 02:43:02PM +0200, Richard Biener wrote:
>> On Thu, Oct 26, 2017 at 2:18 PM, Martin Jambor <mjambor@suse.cz> wrote:
>> >
>> > Nevertheless, I still intend to experiment with the limit, I sent out
>> > this RFC exactly so that I don't spend a lot of time benchmarking
>> > something that is eventually not deemed acceptable on principle.
>>
>> I think the limit should be on the number of generated copies and not
>> the overall size of the structure...  If the struct were composed of
>> 32 individual chars we wouldn't want to emit 32 loads and 32 stores...
>
> I have added another parameter to also limit the number of generated
> element copies.  I have kept the size limit so that we don't even
> attempt to count them for large structures.
>
>> Given that load bandwith is usually higher than store bandwith it
>> might make sense to do the store combining in our copying sequence,
>> like for the 8 byte entry case use sth like
>>
>>   movq 0(%eax), %xmm0
>>   movhps 8(%eax), %xmm0 // or vpinsert
>>   mov[au]ps %xmm0, 0%(ebx)
>
> I would be concerned about the cost of GPR->XMM moves when the value
> being stored is in a GPR, especially with generic tuning which (with
> -O2) is the main thing I am targeting here.  Wouldn't we actually pass
> it through stack with all the associated penalties?
>
> Also, while such store combining might work for ImageMagick, if a
> programmer  did:
>
> region1->x = x1;
> region2->x = x2;
> region1->y = 0;
> region2->y = 20;
> ...
> SetPixelCacheNexusPixels(cache_info, ReadMode, region1, ...)
>
> The transformation would not work unless it could prove region1 and
> region2 are not the same thing.
>
>> As said a general concern was you not copying padding.  If you
>> put this into an even more common place you surely will break
>> stuff, no?
>
> I don't understand, what even more common place do you mean?
>
> I have been testing the patch also on a bunch of other architectures
> and those have tests in their testsuite that check that padding is
> copied, for example some tests in gcc.target/aarch64/aapcs64/ check
> whether a structure passed to a function is binary the same as the
> original, and the test fail because of padding.  That is the only
> "breakage" I know about but I believe that the assumption that padding
> must always be is wrong (if it is not than we need to make SRA quite a
> bit more conservative).

The main concern here is that GIMPLE is not very well defined for
aggregate copies and that gimple-fold.c happily optimizes
memcpy (&a, &b, sizeof (a)) into a = b;

struct A { short s; long i; long j; };
struct A a, b;
void foo ()
{
  __builtin_memcpy (&a, &b, sizeof (struct A));
}

gets folded to

  MEM[(char * {ref-all})&a] = MEM[(char * {ref-all})&b];
  return;

you see we're careful about TBAA but (don't see that above but
can be verified by for example debugging expand_assignment)
TREE_TYPE (MEM[...]) is actually 'struct A'.

And yes, I've been worried about SRA as well here...  it _does_
have some early outs when seeing VIEW_CONVERT_EXPR but
appearantly not for the above.  Testcase that aborts with SRA but
not without:

struct A { short s; long i; long j; };
struct A a, b;
void foo ()
{
  struct A c;
  __builtin_memcpy (&c, &b, sizeof (struct A));
  __builtin_memcpy (&a, &c, sizeof (struct A));
}
int main()
{
  __builtin_memset (&b, 0, sizeof (struct A));
  b.s = 1;
  __builtin_memcpy ((char *)&b+2, &b, 2);
  foo ();
  __builtin_memcpy (&a, (char *)&a+2, 2);
  if (a.s != 1)
    __builtin_abort ();
  return 0;
}

> On Thu, Oct 26, 2017 at 05:09:42PM +0200, Richard Biener wrote:
>> Also if we do the stores in smaller chunks we are more
>> likely hitting the same store-to-load-forwarding issue
>> elsewhere.  Like in case the destination is memcpy'ed
>> away.
>>
>> So the proposed change isn't necessarily a win without
>> a possible similar regression that it tries to fix.
>>
>
> With some encouragement by Honza, I have done some benchmarking anyway
> and I did not see anything of that kind.

The regression would be visible when the aggregate copy is followed by
SLP vectorized code for example.  Then we'd get a vector load from
say v4si mode but had earlier 4 SImode stores -> STLF issue again.
The copying via xmm registers would have made a perfect forwarding
possibility.

I'm not saying you'll hit this in SPEC but just it's easy to construct
a case that didn't have a STLF issue but after the "fix" has.

So the fix is to _not_ split the stores but only the loads ... unless
you can do sophisticated analysis of the context.

That said, splitting the loads is fine if the CPU can handle enough loads
in flight, etc., but splitting stores is dangerous (and CPU resources on
the store side are usually more limited).

>> Whole-program analysis of accesses might allow
>> marking affected objects.
>
> Attempting to save access patterns before IPA and then tracking them
> and keep them in sync across inlining and all gimple late passes seems
> like a nightmarish task.  If this approach is indeed rejected I might
> attempt to do the store combining but a WPA analysis seems just too
> complex.

Ok.

> Anyway, here are the numbers.  They were taken on two different
> Zen-based machines.  I am also in the process of measuring at least
> something on a Haswell machine but I started later and the machine is
> quite a bit slower so I will not have the numbers until next week (and
> not all equivalents in any way).  I found out I do not have access to
> any more modern .*Lake intel CPU.
>
> trunk is pristine trunk revision 254205.  All benchmarks were run
> three times and the median was chosen.
>
> s or strict means the patch with the strictest possible settings to
> speed-up ImageMagick, i.e. --param max-size-for-elementwise-copy=32
> --param max-insns-for-elementwise-copy=4.  Also run three times.
>
> x1 is patched trunk with the parameters having the default values was
> going to propose, i.e. --param max-size-for-elementwise-copy=35
> --param max-insns-for-elementwise-copy=6.  Also run three times.
>
> I then increased the parameter, in search for further missed
> opportunities and to see what and how soon will start to regress.
> x2 is roughly twice that, --param max-size-for-elementwise-copy=67
> --param max-insns-for-elementwise-copy=12.  Run twice, outliers
> manually checked.
>
> x4 is roughly four times x1, namely --param max-size-for-elementwise-copy=143
> --param max-insns-for-elementwise-copy=24.  Run only once.
>
> The times below are of course "non-reportable," for a whole bunch of
> reasons.
>
>
> Zen SPECINT 2006  -O2 generic tuning
> ====================================
>
>  Run-time
>  --------
>
> | Benchmark      | trunk |   s |     % |  x1 |     % |  x2 |     % |  x4 |     % |
> |----------------+-------+-----+-------+-----+-------+-----+-------+-----+-------|
> | 400.perlbench  |   237 | 236 | -0.42 | 236 | -0.42 | 238 | +0.42 | 237 | +0.00 |
> | 401.bzip2      |   341 | 342 | +0.29 | 341 | +0.00 | 341 | +0.00 | 341 | +0.00 |
> | 403.gcc        |   217 | 217 | +0.00 | 217 | +0.00 | 216 | -0.46 | 217 | +0.00 |
> | 429.mcf        |   224 | 218 | -2.68 | 223 | -0.45 | 221 | -1.34 | 226 | +0.89 |
> | 445.gobmk      |   361 | 361 | +0.00 | 361 | +0.00 | 360 | -0.28 | 363 | +0.55 |
> | 456.hmmer      |   296 | 296 | +0.00 | 296 | +0.00 | 297 | +0.34 | 296 | +0.00 |
> | 458.sjeng      |   453 | 452 | -0.22 | 454 | +0.22 | 454 | +0.22 | 460 | +1.55 |
> | 462.libquantum |   289 | 289 | +0.00 | 291 | +0.69 | 289 | +0.00 | 291 | +0.69 |
> | 464.h264ref    |   391 | 391 | +0.00 | 385 | -1.53 | 385 | -1.53 | 385 | -1.53 |
> | 471.omnetpp    |   269 | 255 | -5.20 | 250 | -7.06 | 247 | -8.18 | 268 | -0.37 |
> | 473.astar      |   320 | 321 | +0.31 | 317 | -0.94 | 320 | +0.00 | 320 | +0.00 |
> | 483.xalancbmk  |   187 | 188 | +0.53 | 188 | +0.53 | 187 | +0.00 | 187 | +0.00 |
>
> Although the omnetpp looks like a sizeable improvement I should warn
> that this is one of the few slightly jumpy benchmarks. However, I
> re-run it a few more times and it seems like it is jumping around a
> lower value when compiled with the patched compiler.  It might not be
> the 5-8% though.
>
>  Text size
>  ---------
>
> | Benchmark      |   trunk | struict |     % |      x1 |     % |      x2 |     % |      x4 |     % |
> |----------------+---------+---------+-------+---------+-------+---------+-------+---------+-------|
> | 400.perlbench  |  875874 |  875954 | +0.01 |  875954 | +0.01 |  876018 | +0.02 |  876146 | +0.03 |
> | 401.bzip2      |   44754 |   44754 | +0.00 |   44754 | +0.00 |   44754 | +0.00 |   44754 | +0.00 |
> | 403.gcc        | 2294466 | 2294930 | +0.02 | 2296098 | +0.07 | 2296306 | +0.08 | 2296466 | +0.09 |
> | 429.mcf        |    8226 |    8226 | +0.00 |    8226 | +0.00 |    8258 | +0.39 |    8258 | +0.39 |
> | 445.gobmk      |  579778 |  579778 | +0.00 |  579826 | +0.01 |  579826 | +0.01 |  580402 | +0.11 |
> | 456.hmmer      |  221058 |  221058 | +0.00 |  221058 | +0.00 |  221058 | +0.00 |  221058 | +0.00 |
> | 458.sjeng      |   93362 |   93362 | +0.00 |   94882 | +1.63 |   94882 | +1.63 |   96066 | +2.90 |
> | 462.libquantum |   28314 |   28314 | +0.00 |   28362 | +0.17 |   28362 | +0.17 |   28362 | +0.17 |
> | 464.h264ref    |  393874 |  393874 | +0.00 |  393922 | +0.01 |  393922 | +0.01 |  394226 | +0.09 |
> | 471.omnetpp    |  430306 |  430306 | +0.00 |  430418 | +0.03 |  430418 | +0.03 |  430418 | +0.03 |
> | 473.astar      |   29362 |   29538 | +0.60 |   29538 | +0.60 |   29554 | +0.65 |   29554 | +0.65 |
> | 483.xalancbmk  | 2361298 | 2361506 | +0.01 | 2361506 | +0.01 | 2361506 | +0.01 | 2361506 | +0.01 |
>
>
>
> Zen SPECINT 2006  -Ofast native tuning
> ======================================
>
>  Run-time
>  --------
>
> | Benchmark      | trunk |   s |     % |  x1 |     % |  x2 |     % |  x4 |     % |
> |----------------+-------+-----+-------+-----+-------+-----+-------+-----+-------|
> | 400.perlbench  |   240 | 239 | -0.42 | 239 | -0.42 | 241 | +0.42 | 238 | -0.83 |
> | 401.bzip2      |   341 | 341 | +0.00 | 341 | +0.00 | 341 | +0.00 | 340 | -0.29 |
> | 403.gcc        |   210 | 208 | -0.95 | 207 | -1.43 | 209 | -0.48 | 208 | -0.95 |
> | 429.mcf        |   225 | 225 | +0.00 | 225 | +0.00 | 228 | +1.33 | 226 | +0.44 |
> | 445.gobmk      |   352 | 352 | +0.00 | 352 | +0.00 | 351 | -0.28 | 352 | +0.00 |
> | 456.hmmer      |   131 | 131 | +0.00 | 131 | +0.00 | 131 | +0.00 | 131 | +0.00 |
> | 458.sjeng      |   442 | 442 | +0.00 | 438 | -0.90 | 438 | -0.90 | 437 | -1.13 |
> | 462.libquantum |   291 | 292 | +0.34 | 286 | -1.72 | 287 | -1.37 | 287 | -1.37 |
> | 464.h264ref    |   364 | 365 | +0.27 | 364 | +0.00 | 364 | +0.00 | 363 | -0.27 |
> | 471.omnetpp    |   266 | 266 | +0.00 | 265 | -0.38 | 265 | -0.38 | 265 | -0.38 |
> | 473.astar      |   306 | 307 | +0.33 | 306 | +0.00 | 306 | +0.00 | 309 | +0.98 |
> | 483.xalancbmk  |   177 | 173 | -2.26 | 170 | -3.95 | 170 | -3.95 | 170 | -3.95 |
>
>  Text size
>  ---------
>
> | Benchmark      |   trunk |  strict |     % |      x1 |     % |      x2 |     % |      x4 |     % |
> |----------------+---------+---------+-------+---------+-------+---------+-------+---------+-------|
> | 400.perlbench  | 1161762 | 1161874 | +0.01 | 1161874 | +0.01 | 1162226 | +0.04 | 1162338 | +0.05 |
> | 401.bzip2      |   80834 |   80834 | +0.00 |   80834 | +0.00 |   80834 | +0.00 |   80834 | +0.00 |
> | 403.gcc        | 3170946 | 3171394 | +0.01 | 3172914 | +0.06 | 3173170 | +0.07 | 3174818 | +0.12 |
> | 429.mcf        |   10418 |   10418 | +0.00 |   10418 | +0.00 |   10450 | +0.31 |   10450 | +0.31 |
> | 445.gobmk      |  779778 |  779778 | +0.00 |  779842 | +0.01 |  779842 | +0.01 |  780418 | +0.08 |
> | 456.hmmer      |  328258 |  328258 | +0.00 |  328258 | +0.00 |  328258 | +0.00 |  328258 | +0.00 |
> | 458.sjeng      |  146386 |  146386 | +0.00 |  148162 | +1.21 |  148162 | +1.21 |  149330 | +2.01 |
> | 462.libquantum |   30666 |   30666 | +0.00 |   30730 | +0.21 |   30730 | +0.21 |   30730 | +0.21 |
> | 464.h264ref    |  737826 |  737826 | +0.00 |  737890 | +0.01 |  737890 | +0.01 |  739186 | +0.18 |
> | 471.omnetpp    |  561570 |  561570 | +0.00 |  561826 | +0.05 |  561826 | +0.05 |  561826 | +0.05 |
> | 473.astar      |   39314 |   39522 | +0.53 |   39522 | +0.53 |   39538 | +0.57 |   39538 | +0.57 |
> | 483.xalancbmk  | 3319682 | 3319842 | +0.00 | 3319842 | +0.00 | 3319842 | +0.00 | 3319842 | +0.00 |
>
>
>
> Zen SPECFP 2006 -O2 generic tuning
> ==================================
>
>  Run-time
>  --------
>
> | Benchmark     | trunk |   s |     % |  x1 |     % |  x2 |     % |  x4 |     % |
> |---------------+-------+-----+-------+-----+-------+-----+-------+-----+-------|
> | 410.bwaves    |   214 | 213 | -0.47 | 214 | +0.00 | 214 | +0.00 | 214 | +0.00 |
> | 433.milc      |   290 | 291 | +0.34 | 290 | +0.00 | 295 | +1.72 | 289 | -0.34 |
> | 434.zeusmp    |   182 | 182 | +0.00 | 182 | +0.00 | 184 | +1.10 | 182 | +0.00 |
> | 435.gromacs   |   218 | 218 | +0.00 | 217 | -0.46 | 216 | -0.92 | 220 | +0.92 |
> | 436.cactusADM |   350 | 349 | -0.29 | 349 | -0.29 | 343 | -2.00 | 349 | -0.29 |
> | 437.leslie3d  |   196 | 195 | -0.51 | 196 | +0.00 | 194 | -1.02 | 196 | +0.00 |
> | 444.namd      |   273 | 273 | +0.00 | 273 | +0.00 | 273 | +0.00 | 273 | +0.00 |
> | 447.dealII    |   211 | 211 | +0.00 | 210 | -0.47 | 210 | -0.47 | 211 | +0.00 |
> | 450.soplex    |   187 | 188 | +0.53 | 188 | +0.53 | 187 | +0.00 | 187 | +0.00 |
> | 453.povray    |   119 | 118 | -0.84 | 119 | +0.00 | 119 | +0.00 | 118 | -0.84 |
> | 454.calculix  |   534 | 533 | -0.19 | 531 | -0.56 | 531 | -0.56 | 532 | -0.37 |
> | 459.GemsFDTD  |   236 | 235 | -0.42 | 235 | -0.42 | 242 | +2.54 | 237 | +0.42 |
> | 465.tonto     |   366 | 365 | -0.27 | 365 | -0.27 | 364 | -0.55 | 365 | -0.27 |
> | 470.lbm       |   181 | 180 | -0.55 | 180 | -0.55 | 180 | -0.55 | 180 | -0.55 |
> | 481.wrf       |   303 | 303 | +0.00 | 302 | -0.33 | 304 | +0.33 | 304 | +0.33 |
> | 482.sphinx3   |   362 | 362 | +0.00 | 360 | -0.55 | 361 | -0.28 | 363 | +0.28 |
>
>  Text size
>  ---------
>
> | Benchmark     |   trunk |  strict |     % |      x1 |     % |      x2 |     % |      x4 |     % |
> |---------------+---------+---------+-------+---------+-------+---------+-------+---------+-------|
> | 410.bwaves    |   25954 |   25954 | +0.00 |   25954 | +0.00 |   25954 | +0.00 |   25954 | +0.00 |
> | 433.milc      |   87922 |   87922 | +0.00 |   87922 | +0.00 |   88610 | +0.78 |   89042 | +1.27 |
> | 434.zeusmp    |  212034 |  212034 | +0.00 |  212034 | +0.00 |  212034 | +0.00 |  212034 | +0.00 |
> | 435.gromacs   |  747026 |  747026 | +0.00 |  747026 | +0.00 |  747026 | +0.00 |  747026 | +0.00 |
> | 436.cactusADM |  526178 |  526178 | +0.00 |  526178 | +0.00 |  526274 | +0.02 |  526274 | +0.02 |
> | 437.leslie3d  |   83234 |   83234 | +0.00 |   83234 | +0.00 |   83234 | +0.00 |   83234 | +0.00 |
> | 444.namd      |  297234 |  297266 | +0.01 |  297266 | +0.01 |  297266 | +0.01 |  297266 | +0.01 |
> | 447.dealII    | 2165282 | 2167650 | +0.11 | 2172290 | +0.32 | 2174034 | +0.40 | 2174082 | +0.41 |
> | 450.soplex    |  347122 |  347122 | +0.00 |  347122 | +0.00 |  347122 | +0.00 |  347122 | +0.00 |
> | 453.povray    |  800914 |  800962 | +0.01 |  801570 | +0.08 |  802002 | +0.14 |  803138 | +0.28 |
> | 454.calculix  | 1342802 | 1342802 | +0.00 | 1342802 | +0.00 | 1342802 | +0.00 | 1342802 | +0.00 |
> | 459.GemsFDTD  |  353410 |  354050 | +0.18 |  354050 | +0.18 |  354050 | +0.18 |  354098 | +0.19 |
> | 465.tonto     | 3464210 | 3465058 | +0.02 | 3465058 | +0.02 | 3468434 | +0.12 | 3476594 | +0.36 |
> | 470.lbm       |    9202 |    9202 | +0.00 |    9202 | +0.00 |    9202 | +0.00 |    9202 | +0.00 |
> | 481.wrf       | 3345170 | 3345170 | +0.00 | 3345170 | +0.00 | 3351586 | +0.19 | 3351586 | +0.19 |
> | 482.sphinx3   |  125026 |  125026 | +0.00 |  125026 | +0.00 |  125026 | +0.00 |  125026 | +0.00 |
>
>
>
> Zen SPECFP 2006 -Ofast native tuning
> ====================================
>
>  Run-time
>  --------
>
> | Benchmark     | trunk |   s |     % |  x1 |     % |  x2 |     % |  x4 |     % |
> |---------------+-------+-----+-------+-----+-------+-----+-------+-----+-------|
> | 410.bwaves    |   151 | 150 | -0.66 | 151 | +0.00 | 151 | +0.00 | 151 | +0.00 |
> | 433.milc      |   197 | 197 | +0.00 | 197 | +0.00 | 194 | -1.52 | 186 | -5.58 |
> | 434.zeusmp    |   128 | 128 | +0.00 | 128 | +0.00 | 128 | +0.00 | 128 | +0.00 |
> | 435.gromacs   |   181 | 181 | +0.00 | 180 | -0.55 | 180 | -0.55 | 181 | +0.00 |
> | 436.cactusADM |   139 | 139 | +0.00 | 139 | +0.00 | 132 | -5.04 | 139 | +0.00 |
> | 437.leslie3d  |   159 | 160 | +0.63 | 160 | +0.63 | 159 | +0.00 | 159 | +0.00 |
> | 444.namd      |   256 | 256 | +0.00 | 255 | -0.39 | 255 | -0.39 | 256 | +0.00 |
> | 447.dealII    |   200 | 200 | +0.00 | 199 | -0.50 | 201 | +0.50 | 201 | +0.50 |
> | 450.soplex    |   184 | 184 | +0.00 | 185 | +0.54 | 184 | +0.00 | 184 | +0.00 |
> | 453.povray    |   124 | 122 | -1.61 | 123 | -0.81 | 124 | +0.00 | 122 | -1.61 |
> | 454.calculix  |   192 | 192 | +0.00 | 192 | +0.00 | 193 | +0.52 | 193 | +0.52 |
> | 459.GemsFDTD  |   208 | 208 | +0.00 | 208 | +0.00 | 214 | +2.88 | 208 | +0.00 |
> | 465.tonto     |   320 | 320 | +0.00 | 320 | +0.00 | 320 | +0.00 | 320 | +0.00 |
> | 470.lbm       |   142 | 142 | +0.00 | 142 | +0.00 | 142 | +0.00 | 142 | +0.00 |
> | 481.wrf       |   195 | 195 | +0.00 | 195 | +0.00 | 195 | +0.00 | 195 | +0.00 |
> | 482.sphinx3   |   256 | 258 | +0.78 | 256 | +0.00 | 256 | +0.00 | 257 | +0.39 |
>
>  Text size
>  ---------
>
> | Benchmark     |   trunk |  strict |     % |      x1 |     % |      x2 |     % |      x4 |     % |
> |---------------+---------+---------+-------+---------+-------+---------+-------+---------+-------|
> | 410.bwaves    |   27490 |   27490 | +0.00 |   27490 | +0.00 |   27490 | +0.00 |   27490 | +0.00 |
> | 433.milc      |  118178 |  118178 | +0.00 |  118178 | +0.00 |  118962 | +0.66 |  119634 | +1.23 |
> | 434.zeusmp    |  411106 |  411106 | +0.00 |  411106 | +0.00 |  411106 | +0.00 |  411106 | +0.00 |
> | 435.gromacs   |  935970 |  935970 | +0.00 |  935970 | +0.00 |  935970 | +0.00 |  936162 | +0.02 |
> | 436.cactusADM |  750546 |  750546 | +0.00 |  750546 | +0.00 |  750626 | +0.01 |  750626 | +0.01 |
> | 437.leslie3d  |  123410 |  123410 | +0.00 |  123410 | +0.00 |  123410 | +0.00 |  123410 | +0.00 |
> | 444.namd      |  284082 |  284114 | +0.01 |  284114 | +0.01 |  284114 | +0.01 |  284114 | +0.01 |
> | 447.dealII    | 2438610 | 2440946 | +0.10 | 2444978 | +0.26 | 2446882 | +0.34 | 2446930 | +0.34 |
> | 450.soplex    |  443218 |  443218 | +0.00 |  443218 | +0.00 |  443218 | +0.00 |  443218 | +0.00 |
> | 453.povray    | 1077778 | 1077890 | +0.01 | 1078658 | +0.08 | 1079026 | +0.12 | 1080370 | +0.24 |
> | 454.calculix  | 1639138 | 1639138 | +0.00 | 1639138 | +0.00 | 1639474 | +0.02 | 1639474 | +0.02 |
> | 459.GemsFDTD  |  451202 |  451234 | +0.01 |  451234 | +0.01 |  451234 | +0.01 |  451282 | +0.02 |
> | 465.tonto     | 4584690 | 4585250 | +0.01 | 4585250 | +0.01 | 4588130 | +0.08 | 4595442 | +0.23 |
> | 470.lbm       |    9858 |    9858 | +0.00 |    9858 | +0.00 |    9858 | +0.00 |    9858 | +0.00 |
> | 481.wrf       | 4588002 | 4588002 | +0.00 | 4588290 | +0.01 | 4621010 | +0.72 | 4621922 | +0.74 |
> | 482.sphinx3   |  179602 |  179602 | +0.00 |  179602 | +0.00 |  179602 | +0.00 |  179602 | +0.00 |
>
>
>
> Zen SPEC INT 2017 -O2 generic tuning
> ====================================
>
>  Run-time
>  --------
>
> | Benchmark       | trunk |   s |     % |  x1 |     % |  x2 |     % |  x4 |     % |
> |-----------------+-------+-----+-------+-----+-------+-----+-------+-----+-------|
> | 500.perlbench_r |   529 | 529 | +0.00 | 531 | +0.38 | 530 | +0.19 | 534 | +0.95 |
> | 502.gcc_r       |   338 | 333 | -1.48 | 334 | -1.18 | 339 | +0.30 | 339 | +0.30 |
> | 505.mcf_r       |   382 | 381 | -0.26 | 382 | +0.00 | 382 | +0.00 | 381 | -0.26 |
> | 520.omnetpp_r   |   511 | 503 | -1.57 | 497 | -2.74 | 497 | -2.74 | 497 | -2.74 |
> | 523.xalancbmk_r |   391 | 388 | -0.77 | 389 | -0.51 | 390 | -0.26 | 391 | +0.00 |
> | 525.x264_r      |   590 | 590 | +0.00 | 591 | +0.17 | 592 | +0.34 | 593 | +0.51 |
> | 531.deepsjeng_r |   427 | 427 | +0.00 | 427 | +0.00 | 428 | +0.23 | 427 | +0.00 |
> | 541.leela_r     |   716 | 716 | +0.00 | 716 | +0.00 | 719 | +0.42 | 719 | +0.42 |
> | 548.exchange2_r |   593 | 593 | +0.00 | 593 | +0.00 | 593 | +0.00 | 593 | +0.00 |
> | 557.xz_r        |   452 | 452 | +0.00 | 453 | +0.22 | 454 | +0.44 | 452 | +0.00 |
>
>  Text size
>  ---------
>
> | Benchmark       |   trunk |  strict |     % |      x1 |     % |      x2 |     % |      x4 |     % |
> |-----------------+---------+---------+-------+---------+-------+---------+-------+---------+-------|
> | 500.perlbench_r | 1599442 | 1599522 | +0.01 | 1599522 | +0.01 | 1599522 | +0.01 | 1600082 | +0.04 |
> | 502.gcc_r       | 6757602 | 6758978 | +0.02 | 6759090 | +0.02 | 6759842 | +0.03 | 6760306 | +0.04 |
> | 505.mcf_r       |   16098 |   16098 | +0.00 |   16098 | +0.00 |   16098 | +0.00 |   16306 | +1.29 |
> | 520.omnetpp_r   | 1262498 | 1262562 | +0.01 | 1264034 | +0.12 | 1264034 | +0.12 | 1264034 | +0.12 |
> | 523.xalancbmk_r | 3989026 | 3989202 | +0.00 | 3989202 | +0.00 | 3989202 | +0.00 | 3989202 | +0.00 |
> | 525.x264_r      |  414130 |  414194 | +0.02 |  414194 | +0.02 |  414738 | +0.15 |  415122 | +0.24 |
> | 531.deepsjeng_r |   67426 |   67426 | +0.00 |   67458 | +0.05 |   67458 | +0.05 |   67458 | +0.05 |
> | 541.leela_r     |  219378 |  219378 | +0.00 |  219378 | +0.00 |  224082 | +2.14 |  237026 | +8.04 |
> | 548.exchange2_r |   61234 |   61234 | +0.00 |   61234 | +0.00 |   61234 | +0.00 |   61234 | +0.00 |
> | 557.xz_r        |  111490 |  111490 | +0.00 |  111490 | +0.00 |  111506 | +0.01 |  111890 | +0.36 |
>
>
>
> Zen SPEC INT 2017 -Ofast native tuning
> ======================================
>
>  Run-time
>  ---------
>
> | Benchmark       | trunk |   s |     % |  x1 |     % |  x2 |     % |  x4 |     % |
> |-----------------+-------+-----+-------+-----+-------+-----+-------+-----+-------|
> | 500.perlbench_r |   525 | 524 | -0.19 | 525 | +0.00 | 525 | +0.00 | 534 | +1.71 |
> | 502.gcc_r       |   331 | 329 | -0.60 | 324 | -2.11 | 330 | -0.30 | 324 | -2.11 |
> | 505.mcf_r       |   380 | 380 | +0.00 | 381 | +0.26 | 380 | +0.00 | 379 | -0.26 |
> | 520.omnetpp_r   |   487 | 486 | -0.21 | 488 | +0.21 | 489 | +0.41 | 488 | +0.21 |
> | 523.xalancbmk_r |   373 | 369 | -1.07 | 367 | -1.61 | 370 | -0.80 | 368 | -1.34 |
> | 525.x264_r      |   319 | 319 | +0.00 | 320 | +0.31 | 321 | +0.63 | 322 | +0.94 |
> | 531.deepsjeng_r |   418 | 418 | +0.00 | 418 | +0.00 | 418 | +0.00 | 419 | +0.24 |
> | 541.leela_r     |   674 | 674 | +0.00 | 674 | +0.00 | 672 | -0.30 | 672 | -0.30 |
> | 548.exchange2_r |   466 | 466 | +0.00 | 466 | +0.00 | 466 | +0.00 | 466 | +0.00 |
> | 557.xz_r        |   443 | 443 | +0.00 | 443 | +0.00 | 449 | +1.35 | 449 | +1.35 |
>
>  Text size
>  ---------
>
> | Benchmark       |   trunk |  strict |     % |      x1 |     % |      x2 |     % |      x4 |     % |
> |-----------------+---------+---------+-------+---------+-------+---------+-------+---------+-------|
> | 500.perlbench_r | 2122882 | 2122962 | +0.00 | 2122962 | +0.00 | 2122962 | +0.00 | 2122514 | -0.02 |
> | 502.gcc_r       | 8566290 | 8567794 | +0.02 | 8569138 | +0.03 | 8570066 | +0.04 | 8570642 | +0.05 |
> | 505.mcf_r       |   26770 |   26770 | +0.00 |   26770 | +0.00 |   26770 | +0.00 |   26962 | +0.72 |
> | 520.omnetpp_r   | 1713938 | 1713954 | +0.00 | 1714754 | +0.05 | 1714754 | +0.05 | 1714754 | +0.05 |
> | 523.xalancbmk_r | 4881890 | 4882114 | +0.00 | 4882114 | +0.00 | 4882114 | +0.00 | 4882114 | +0.00 |
> | 525.x264_r      |  601522 |  601602 | +0.01 |  601602 | +0.01 |  602130 | +0.10 |  602834 | +0.22 |
> | 531.deepsjeng_r |   90306 |   90306 | +0.00 |   90338 | +0.04 |   90338 | +0.04 |   90338 | +0.04 |
> | 541.leela_r     |  277634 |  277650 | +0.01 |  277650 | +0.01 |  282386 | +1.71 |  295778 | +6.54 |
> | 548.exchange2_r |  109058 |  109058 | +0.00 |  109058 | +0.00 |  109058 | +0.00 |  109058 | +0.00 |
> | 557.xz_r        |  154594 |  154594 | +0.00 |  154594 | +0.00 |  154610 | +0.01 |  154930 | +0.22 |
>
>
>
> Zen SPEC 2017 FP -O2 generic tuning
> ===================================
>
>  Run-time
>  --------
> | Benchmark       | trunk |   s |      % |  x1 |      % |  x2 |      % |  x4 |      % |
> |-----------------+-------+-----+--------+-----+--------+-----+--------+-----+--------|
> | 503.bwaves_r    |   801 | 801 |  +0.00 | 801 |  +0.00 | 801 |  +0.00 | 801 |  +0.00 |
> | 507.cactuBSSN_r |   303 | 302 |  -0.33 | 299 |  -1.32 | 302 |  -0.33 | 307 |  +1.32 |
> | 508.namd_r      |   306 | 306 |  +0.00 | 307 |  +0.33 | 306 |  +0.00 | 306 |  +0.00 |
> | 510.parest_r    |   558 | 553 |  -0.90 | 561 |  +0.54 | 554 |  -0.72 | 562 |  +0.72 |
> | 511.povray_r    |   679 | 672 |  -1.03 | 673 |  -0.88 | 680 |  +0.15 | 644 |  -5.15 |
> | 519.lbm_r       |   240 | 240 |  +0.00 | 240 |  +0.00 | 240 |  +0.00 | 240 |  +0.00 |
> | 521.wrf_r       |   851 | 827 |  -2.82 | 827 |  -2.82 | 827 |  -2.82 | 828 |  -2.70 |
> | 526.blender_r   |   376 | 376 |  +0.00 | 379 |  +0.80 | 377 |  +0.27 | 376 |  +0.00 |
> | 527.cam4_r      |   529 | 527 |  -0.38 | 533 |  +0.76 | 536 |  +1.32 | 528 |  -0.19 |
> | 538.imagick_r   |   646 | 570 | -11.76 | 570 | -11.76 | 569 | -11.92 | 570 | -11.76 |
> | 544.nab_r       |   467 | 467 |  +0.00 | 467 |  +0.00 | 467 |  +0.00 | 467 |  +0.00 |
> | 549.fotonik3d_r |   413 | 413 |  +0.00 | 414 |  +0.24 | 415 |  +0.48 | 413 |  +0.00 |
> | 554.roms_r      |   459 | 455 |  -0.87 | 456 |  -0.65 | 456 |  -0.65 | 456 |  -0.65 |
>
>  Text size
>  ---------
>
> | Benchmark       |    trunk |   strict |     % |       x1 |     % |       x2 |     % |       x4 |     % |
> |-----------------+----------+----------+-------+----------+-------+----------+-------+----------+-------|
> | 503.bwaves_r    |    32034 |    32034 | +0.00 |    32034 | +0.00 |    32034 | +0.00 |    32034 | +0.00 |
> | 507.cactuBSSN_r |  2951634 |  2951634 | +0.00 |  2951634 | +0.00 |  2951698 | +0.00 |  2951730 | +0.00 |
> | 508.namd_r      |   837458 |   837490 | +0.00 |   837490 | +0.00 |   837490 | +0.00 |   837490 | +0.00 |
> | 510.parest_r    |  6540866 |  6545618 | +0.07 |  6546754 | +0.09 |  6561426 | +0.31 |  6569426 | +0.44 |
> | 511.povray_r    |   803618 |   803666 | +0.01 |   804274 | +0.08 |   804706 | +0.14 |   805842 | +0.28 |
> | 519.lbm_r       |    12018 |    12018 | +0.00 |    12018 | +0.00 |    12018 | +0.00 |    12018 | +0.00 |
> | 521.wrf_r       | 16292962 | 16296786 | +0.02 | 16296978 | +0.02 | 16302594 | +0.06 | 16419842 | +0.78 |
> | 526.blender_r   |  7268224 |  7281264 | +0.18 |  7282608 | +0.20 |  7289168 | +0.29 |  7295296 | +0.37 |
> | 527.cam4_r      |  5063666 |  5063922 | +0.01 |  5065010 | +0.03 |  5068114 | +0.09 |  5072946 | +0.18 |
> | 538.imagick_r   |  1608178 |  1609282 | +0.07 |  1609282 | +0.07 |  1613458 | +0.33 |  1613970 | +0.36 |
> | 544.nab_r       |   156242 |   156242 | +0.00 |   156242 | +0.00 |   156242 | +0.00 |   156242 | +0.00 |
> | 549.fotonik3d_r |   326738 |   326738 | +0.00 |   326738 | +0.00 |   326738 | +0.00 |   326738 | +0.00 |
> | 554.roms_r      |   728546 |   728546 | +0.00 |   728546 | +0.00 |   728546 | +0.00 |   728546 | +0.00 |
>
>
>
> Zen SPEC 2017 FP -Ofast native tuning
> =====================================
>
>  Run-time
>  --------
>
> | Benchmark       | trunk |   s |     % |  x1 |     % |  x2 |     % |  x4 |     % |
> |-----------------+-------+-----+-------+-----+-------+-----+-------+-----+-------|
> | 503.bwaves_r    |   310 | 310 | +0.00 | 310 | +0.00 | 310 | +0.00 | 309 | -0.32 |
> | 507.cactuBSSN_r |   269 | 266 | -1.12 | 266 | -1.12 | 268 | -0.37 | 270 | +0.37 |
> | 508.namd_r      |   270 | 269 | -0.37 | 269 | -0.37 | 268 | -0.74 | 268 | -0.74 |
> | 510.parest_r    |   607 | 601 | -0.99 | 599 | -1.32 | 599 | -1.32 | 604 | -0.49 |
> | 511.povray_r    |   662 | 664 | +0.30 | 671 | +1.36 | 680 | +2.72 | 675 | +1.96 |
> | 519.lbm_r       |   186 | 186 | +0.00 | 186 | +0.00 | 186 | +0.00 | 186 | +0.00 |
> | 521.wrf_r       |   550 | 554 | +0.73 | 550 | +0.00 | 550 | +0.00 | 549 | -0.18 |
> | 526.blender_r   |   355 | 354 | -0.28 | 355 | +0.00 | 354 | -0.28 | 354 | -0.28 |
> | 527.cam4_r      |   434 | 437 | +0.69 | 435 | +0.23 | 437 | +0.69 | 435 | +0.23 |
> | 538.imagick_r   |   433 | 420 | -3.00 | 420 | -3.00 | 420 | -3.00 | 419 | -3.23 |
> | 544.nab_r       |   424 | 425 | +0.24 | 425 | +0.24 | 425 | +0.24 | 425 | +0.24 |
> | 549.fotonik3d_r |   421 | 422 | +0.24 | 422 | +0.24 | 422 | +0.24 | 422 | +0.24 |
> | 554.roms_r      |   360 | 361 | +0.28 | 361 | +0.28 | 361 | +0.28 | 361 | +0.28 |
>
> +1.36% for 511.povray_r is the worst regression for the proposed x1
> defaults, by the way.  I have not investigated it further, however.
>
>  Text size
>  ---------
>
> | Benchmark       |    trunk |   strict |     % |       x1 |     % |       x2 |     % |       x4 |     % |
> |-----------------+----------+----------+-------+----------+-------+----------+-------+----------+-------|
> | 503.bwaves_r    |    34562 |    34562 | +0.00 |    34562 | +0.00 |    34562 | +0.00 |    34562 | +0.00 |
> | 507.cactuBSSN_r |  3978402 |  3978402 | +0.00 |  3978402 | +0.00 |  3978514 | +0.00 |  3978546 | +0.00 |
> | 508.namd_r      |   869106 |   869154 | +0.01 |   869154 | +0.01 |   869154 | +0.01 |   869154 | +0.01 |
> | 510.parest_r    |  7186258 |  7189298 | +0.04 |  7190370 | +0.06 |  7203890 | +0.25 |  7211202 | +0.35 |
> | 511.povray_r    |  1063314 |  1063410 | +0.01 |  1064178 | +0.08 |  1064546 | +0.12 |  1065890 | +0.24 |
> | 519.lbm_r       |    12178 |    12178 | +0.00 |    12178 | +0.00 |    12178 | +0.00 |    12178 | +0.00 |
> | 521.wrf_r       | 19480946 | 19484146 | +0.02 | 19484466 | +0.02 | 19607538 | +0.65 | 19716178 | +1.21 |
> | 526.blender_r   |  9708752 |  9719952 | +0.12 |  9722768 | +0.14 |  9730224 | +0.22 |  9737760 | +0.30 |
> | 527.cam4_r      |  6217970 |  6218162 | +0.00 |  6219570 | +0.03 |  6223362 | +0.09 |  6227762 | +0.16 |
> | 538.imagick_r   |  2255682 |  2256162 | +0.02 |  2256162 | +0.02 |  2261346 | +0.25 |  2261938 | +0.28 |
> | 544.nab_r       |   212418 |   212418 | +0.00 |   212418 | +0.00 |   212418 | +0.00 |   212578 | +0.08 |
> | 549.fotonik3d_r |   454738 |   454738 | +0.00 |   454738 | +0.00 |   454738 | +0.00 |   454738 | +0.00 |
> | 554.roms_r      |   910978 |   910978 | +0.00 |   910978 | +0.00 |   910978 | +0.00 |   910978 | +0.00 |
>
>
> I believe the numbers are good and thus I would like to ask-for
> re-consideration of the objection and for approval to commit the patch
> below.  Needless to say, it has passed bootstrap and testing on
> x86_64-linux.
>
> Thanks
>
> Martin
>
>
> 2017-10-27  Martin Jambor  <mjambor@suse.cz>
>
>         PR target/80689
>         * tree-sra.h: New file.
>         * ipa-prop.h: Moved declaration of build_ref_for_offset to
>         tree-sra.h.
>         * expr.c: Include params.h and tree-sra.h.
>         (emit_move_elementwise): New function.
>         (store_expr_with_bounds): Optionally use it.
>         * ipa-cp.c: Include tree-sra.h.
>         * params.def (PARAM_MAX_SIZE_FOR_ELEMENTWISE_COPY): New.
>         (PARAM_MAX_INSNS_FOR_ELEMENTWISE_COPY): Likewise.
>         * config/i386/i386.c (ix86_option_override_internal): Set
>         PARAM_MAX_SIZE_FOR_ELEMENTWISE_COPY to 35.
>         * tree-sra.c: Include tree-sra.h.
>         (scalarizable_type_p): Renamed to
>         simple_mix_of_records_and_arrays_p, made public, renamed the
>         second parameter to allow_char_arrays, added count_p parameter.
>         (extract_min_max_idx_from_array): New function.
>         (completely_scalarize): Moved bits of the function to
>         extract_min_max_idx_from_array.
>
>         testsuite/
>         * gcc.target/i386/pr80689-1.c: New test.
>
> Added insns count param limit
> ---
>  gcc/config/i386/i386.c                    |   4 +
>  gcc/expr.c                                | 106 ++++++++++++++++++++++-
>  gcc/ipa-cp.c                              |   1 +
>  gcc/ipa-prop.h                            |   4 -
>  gcc/params.def                            |  12 +++
>  gcc/testsuite/gcc.target/i386/pr80689-1.c |  38 +++++++++
>  gcc/tree-sra.c                            | 134 +++++++++++++++++++++---------
>  gcc/tree-sra.h                            |  34 ++++++++
>  8 files changed, 288 insertions(+), 45 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/i386/pr80689-1.c
>  create mode 100644 gcc/tree-sra.h
>
> diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
> index 80c8ce7ecb9..0bff2da72dd 100644
> --- a/gcc/config/i386/i386.c
> +++ b/gcc/config/i386/i386.c
> @@ -4580,6 +4580,10 @@ ix86_option_override_internal (bool main_args_p,
>                          ix86_tune_cost->l2_cache_size,
>                          opts->x_param_values,
>                          opts_set->x_param_values);
> +  maybe_set_param_value (PARAM_MAX_SIZE_FOR_ELEMENTWISE_COPY,
> +                        35,
> +                        opts->x_param_values,
> +                        opts_set->x_param_values);
>
>    /* Enable sw prefetching at -O3 for CPUS that prefetching is helpful.  */
>    if (opts->x_flag_prefetch_loop_arrays < 0
> diff --git a/gcc/expr.c b/gcc/expr.c
> index 496d492c9fa..971880b635d 100644
> --- a/gcc/expr.c
> +++ b/gcc/expr.c
> @@ -61,7 +61,8 @@ along with GCC; see the file COPYING3.  If not see
>  #include "tree-chkp.h"
>  #include "rtl-chkp.h"
>  #include "ccmp.h"
> -
> +#include "params.h"
> +#include "tree-sra.h"
>
>  /* If this is nonzero, we do not bother generating VOLATILE
>     around volatile memory references, and we are willing to
> @@ -5340,6 +5341,80 @@ emit_storent_insn (rtx to, rtx from)
>    return maybe_expand_insn (code, 2, ops);
>  }
>
> +/* Generate code for copying data of type TYPE at SOURCE plus OFFSET to TARGET
> +   plus OFFSET, but do so element-wise and/or field-wise for each record and
> +   array within TYPE.  TYPE must either be a register type or an aggregate
> +   complying with scalarizable_type_p.
> +
> +   If CALL_PARAM_P is nonzero, this is a store into a call param on the
> +   stack, and block moves may need to be treated specially.  */
> +
> +static void
> +emit_move_elementwise (tree type, rtx target, rtx source, HOST_WIDE_INT offset,
> +                      int call_param_p)
> +{
> +  switch (TREE_CODE (type))
> +    {
> +    case RECORD_TYPE:
> +      for (tree fld = TYPE_FIELDS (type); fld; fld = DECL_CHAIN (fld))
> +       if (TREE_CODE (fld) == FIELD_DECL)
> +         {
> +           HOST_WIDE_INT fld_offset = offset + int_bit_position (fld);
> +           tree ft = TREE_TYPE (fld);
> +           emit_move_elementwise (ft, target, source, fld_offset,
> +                                  call_param_p);
> +         }
> +      break;
> +
> +    case ARRAY_TYPE:
> +      {
> +       tree elem_type = TREE_TYPE (type);
> +       HOST_WIDE_INT el_size = tree_to_shwi (TYPE_SIZE (elem_type));
> +       gcc_assert (el_size > 0);
> +
> +       offset_int idx, max;
> +       /* Skip (some) zero-length arrays; others have MAXIDX == MINIDX - 1.  */
> +       if (extract_min_max_idx_from_array (type, &idx, &max))
> +         {
> +           HOST_WIDE_INT el_offset = offset;
> +           for (; idx <= max; ++idx)
> +             {
> +               emit_move_elementwise (elem_type, target, source, el_offset,
> +                                      call_param_p);
> +               el_offset += el_size;
> +             }
> +         }
> +      }
> +      break;
> +    default:
> +      machine_mode mode = TYPE_MODE (type);
> +
> +      rtx ntgt = adjust_address (target, mode, offset / BITS_PER_UNIT);
> +      rtx nsrc = adjust_address (source, mode, offset / BITS_PER_UNIT);
> +
> +      /* TODO: Figure out whether the following is actually necessary.  */
> +      if (target == ntgt)
> +       ntgt = copy_rtx (target);
> +      if (source == nsrc)
> +       nsrc = copy_rtx (source);
> +
> +      gcc_assert (mode != VOIDmode);
> +      if (mode != BLKmode)
> +       emit_move_insn (ntgt, nsrc);
> +      else
> +       {
> +         /* For example vector gimple registers can end up here.  */
> +         rtx size = expand_expr (TYPE_SIZE_UNIT (type), NULL_RTX,
> +                                 TYPE_MODE (sizetype), EXPAND_NORMAL);
> +         emit_block_move (ntgt, nsrc, size,
> +                          (call_param_p
> +                           ? BLOCK_OP_CALL_PARM : BLOCK_OP_NORMAL));
> +       }
> +      break;
> +    }
> +  return;
> +}
> +
>  /* Generate code for computing expression EXP,
>     and storing the value into TARGET.
>
> @@ -5713,9 +5788,32 @@ store_expr_with_bounds (tree exp, rtx target, int call_param_p,
>         emit_group_store (target, temp, TREE_TYPE (exp),
>                           int_size_in_bytes (TREE_TYPE (exp)));
>        else if (GET_MODE (temp) == BLKmode)
> -       emit_block_move (target, temp, expr_size (exp),
> -                        (call_param_p
> -                         ? BLOCK_OP_CALL_PARM : BLOCK_OP_NORMAL));
> +       {
> +         /* Copying smallish BLKmode structures with emit_block_move and thus
> +            by-pieces can result in store-to-load stalls.  So copy some simple
> +            small aggregates element or field-wise.  */
> +         int count = 0;
> +         if (GET_MODE (target) == BLKmode
> +             && AGGREGATE_TYPE_P (TREE_TYPE (exp))
> +             && !TREE_ADDRESSABLE (TREE_TYPE (exp))
> +             && tree_fits_shwi_p (TYPE_SIZE (TREE_TYPE (exp)))
> +             && (tree_to_shwi (TYPE_SIZE (TREE_TYPE (exp)))
> +                 <= (PARAM_VALUE (PARAM_MAX_SIZE_FOR_ELEMENTWISE_COPY)
> +                     * BITS_PER_UNIT))
> +             && simple_mix_of_records_and_arrays_p (TREE_TYPE (exp), false,
> +                                                    &count)
> +             && (count <= PARAM_VALUE (PARAM_MAX_INSNS_FOR_ELEMENTWISE_COPY)))
> +           {
> +             /* FIXME:  Can this happen?  What would it mean?  */
> +             gcc_assert (!reverse);
> +             emit_move_elementwise (TREE_TYPE (exp), target, temp, 0,
> +                                    call_param_p);
> +           }
> +         else
> +           emit_block_move (target, temp, expr_size (exp),
> +                            (call_param_p
> +                             ? BLOCK_OP_CALL_PARM : BLOCK_OP_NORMAL));
> +       }
>        /* If we emit a nontemporal store, there is nothing else to do.  */
>        else if (nontemporal && emit_storent_insn (target, temp))
>         ;
> diff --git a/gcc/ipa-cp.c b/gcc/ipa-cp.c
> index d23c1d8ba3e..30f91e70c22 100644
> --- a/gcc/ipa-cp.c
> +++ b/gcc/ipa-cp.c
> @@ -124,6 +124,7 @@ along with GCC; see the file COPYING3.  If not see
>  #include "tree-ssa-ccp.h"
>  #include "stringpool.h"
>  #include "attribs.h"
> +#include "tree-sra.h"
>
>  template <typename valtype> class ipcp_value;
>
> diff --git a/gcc/ipa-prop.h b/gcc/ipa-prop.h
> index fa5bed49ee0..2313cc884ed 100644
> --- a/gcc/ipa-prop.h
> +++ b/gcc/ipa-prop.h
> @@ -877,10 +877,6 @@ ipa_parm_adjustment *ipa_get_adjustment_candidate (tree **, bool *,
>  void ipa_release_body_info (struct ipa_func_body_info *);
>  tree ipa_get_callee_param_type (struct cgraph_edge *e, int i);
>
> -/* From tree-sra.c:  */
> -tree build_ref_for_offset (location_t, tree, HOST_WIDE_INT, bool, tree,
> -                          gimple_stmt_iterator *, bool);
> -
>  /* In ipa-cp.c  */
>  void ipa_cp_c_finalize (void);
>
> diff --git a/gcc/params.def b/gcc/params.def
> index 8881f4c403a..9c778f9540a 100644
> --- a/gcc/params.def
> +++ b/gcc/params.def
> @@ -1287,6 +1287,18 @@ DEFPARAM (PARAM_VECT_EPILOGUES_NOMASK,
>           "Enable loop epilogue vectorization using smaller vector size.",
>           0, 0, 1)
>
> +DEFPARAM (PARAM_MAX_SIZE_FOR_ELEMENTWISE_COPY,
> +         "max-size-for-elementwise-copy",
> +         "Maximum size in bytes of a structure or an array to by considered "
> +         "for copying by its individual fields or elements",
> +         0, 0, 512)
> +
> +DEFPARAM (PARAM_MAX_INSNS_FOR_ELEMENTWISE_COPY,
> +         "max-insns-for-elementwise-copy",
> +         "Maximum number of instructions needed to consider copying "
> +          "a structure or an array by its individual fields or elements",
> +         6, 0, 64)
> +
>  /*
>
>  Local variables:
> diff --git a/gcc/testsuite/gcc.target/i386/pr80689-1.c b/gcc/testsuite/gcc.target/i386/pr80689-1.c
> new file mode 100644
> index 00000000000..4156d4fba45
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/i386/pr80689-1.c
> @@ -0,0 +1,38 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2" } */
> +
> +typedef struct st1
> +{
> +        long unsigned int a,b;
> +        long int c,d;
> +}R;
> +
> +typedef struct st2
> +{
> +        int  t;
> +        R  reg;
> +}N;
> +
> +void Set (const R *region,  N *n_info );
> +
> +void test(N  *n_obj ,const long unsigned int a, const long unsigned int b,  const long int c,const long int d)
> +{
> +        R reg;
> +
> +        reg.a=a;
> +        reg.b=b;
> +        reg.c=c;
> +        reg.d=d;
> +        Set (&reg, n_obj);
> +
> +}
> +
> +void Set (const R *reg,  N *n_obj )
> +{
> +        n_obj->reg=(*reg);
> +}
> +
> +
> +/* { dg-final { scan-assembler-not "%(x|y|z)mm\[0-9\]+" } } */
> +/* { dg-final { scan-assembler-not "movdqu" } } */
> +/* { dg-final { scan-assembler-not "movups" } } */
> diff --git a/gcc/tree-sra.c b/gcc/tree-sra.c
> index bac593951e7..d06463ce21c 100644
> --- a/gcc/tree-sra.c
> +++ b/gcc/tree-sra.c
> @@ -104,6 +104,7 @@ along with GCC; see the file COPYING3.  If not see
>  #include "ipa-fnsummary.h"
>  #include "ipa-utils.h"
>  #include "builtins.h"
> +#include "tree-sra.h"
>
>  /* Enumeration of all aggregate reductions we can do.  */
>  enum sra_mode { SRA_MODE_EARLY_IPA,   /* early call regularization */
> @@ -952,14 +953,15 @@ create_access (tree expr, gimple *stmt, bool write)
>  }
>
>
> -/* Return true iff TYPE is scalarizable - i.e. a RECORD_TYPE or fixed-length
> -   ARRAY_TYPE with fields that are either of gimple register types (excluding
> -   bit-fields) or (recursively) scalarizable types.  CONST_DECL must be true if
> -   we are considering a decl from constant pool.  If it is false, char arrays
> -   will be refused.  */
> +/* Return true if TYPE consists of RECORD_TYPE or fixed-length ARRAY_TYPE with
> +   fields/elements that are not bit-fields and are either register types or
> +   recursively comply with simple_mix_of_records_and_arrays_p.  Furthermore, if
> +   ALLOW_CHAR_ARRAYS is false, the function will return false also if TYPE
> +   contains an array of elements that only have one byte.  */
>
> -static bool
> -scalarizable_type_p (tree type, bool const_decl)
> +bool
> +simple_mix_of_records_and_arrays_p (tree type, bool allow_char_arrays,
> +                                   int *count_p)
>  {
>    gcc_assert (!is_gimple_reg_type (type));
>    if (type_contains_placeholder_p (type))
> @@ -976,8 +978,13 @@ scalarizable_type_p (tree type, bool const_decl)
>           if (DECL_BIT_FIELD (fld))
>             return false;
>
> -         if (!is_gimple_reg_type (ft)
> -             && !scalarizable_type_p (ft, const_decl))
> +         if (is_gimple_reg_type (ft))
> +           {
> +             if (count_p)
> +               (*count_p)++;
> +           }
> +         else if (!simple_mix_of_records_and_arrays_p (ft, allow_char_arrays,
> +                                                      count_p))
>             return false;
>         }
>
> @@ -986,7 +993,7 @@ scalarizable_type_p (tree type, bool const_decl)
>    case ARRAY_TYPE:
>      {
>        HOST_WIDE_INT min_elem_size;
> -      if (const_decl)
> +      if (allow_char_arrays)
>         min_elem_size = 0;
>        else
>         min_elem_size = BITS_PER_UNIT;
> @@ -1007,9 +1014,45 @@ scalarizable_type_p (tree type, bool const_decl)
>         return false;
>
>        tree elem = TREE_TYPE (type);
> -      if (!is_gimple_reg_type (elem)
> -         && !scalarizable_type_p (elem, const_decl))
> -       return false;
> +      if (!count_p)
> +       {
> +         if (!is_gimple_reg_type (elem)
> +             && !simple_mix_of_records_and_arrays_p (elem, allow_char_arrays,
> +                                                     NULL))
> +           return false;
> +         else
> +           return true;
> +       }
> +
> +      offset_int min, max;
> +      HOST_WIDE_INT ds;
> +      bool nonzero = extract_min_max_idx_from_array (type, &min, &max);
> +
> +      if (nonzero && (min <= max))
> +       {
> +         offset_int d = max - min + 1;
> +         if (!wi::fits_shwi_p (d))
> +           return false;
> +         ds = d.to_shwi ();
> +         if (ds > INT_MAX)
> +           return false;
> +       }
> +      else
> +       ds = 0;
> +
> +      if (is_gimple_reg_type (elem))
> +       *count_p += (int) ds;
> +      else
> +       {
> +         int elc = 0;
> +         if (!simple_mix_of_records_and_arrays_p (elem, allow_char_arrays,
> +                                                  &elc))
> +           return false;
> +         ds *= elc;
> +         if (ds > INT_MAX)
> +           return false;
> +         *count_p += (unsigned) ds;
> +       }
>        return true;
>      }
>    default:
> @@ -1017,10 +1060,38 @@ scalarizable_type_p (tree type, bool const_decl)
>    }
>  }
>
> -static void scalarize_elem (tree, HOST_WIDE_INT, HOST_WIDE_INT, bool, tree, tree);
> +static void scalarize_elem (tree, HOST_WIDE_INT, HOST_WIDE_INT, bool, tree,
> +                           tree);
> +
> +/* For a given array TYPE, return false if its domain does not have any maximum
> +   value.  Otherwise calculate MIN and MAX indices of the first and the last
> +   element.  */
> +
> +bool
> +extract_min_max_idx_from_array (tree type, offset_int *min, offset_int *max)
> +{
> +  tree domain = TYPE_DOMAIN (type);
> +  tree minidx = TYPE_MIN_VALUE (domain);
> +  gcc_assert (TREE_CODE (minidx) == INTEGER_CST);
> +  tree maxidx = TYPE_MAX_VALUE (domain);
> +  if (!maxidx)
> +    return false;
> +  gcc_assert (TREE_CODE (maxidx) == INTEGER_CST);
> +
> +  /* MINIDX and MAXIDX are inclusive, and must be interpreted in
> +     DOMAIN (e.g. signed int, whereas min/max may be size_int).  */
> +  *min = wi::to_offset (minidx);
> +  *max = wi::to_offset (maxidx);
> +  if (!TYPE_UNSIGNED (domain))
> +    {
> +      *min = wi::sext (*min, TYPE_PRECISION (domain));
> +      *max = wi::sext (*max, TYPE_PRECISION (domain));
> +    }
> +  return true;
> +}
>
>  /* Create total_scalarization accesses for all scalar fields of a member
> -   of type DECL_TYPE conforming to scalarizable_type_p.  BASE
> +   of type DECL_TYPE conforming to simple_mix_of_records_and_arrays_p.  BASE
>     must be the top-most VAR_DECL representing the variable; within that,
>     OFFSET locates the member and REF must be the memory reference expression for
>     the member.  */
> @@ -1047,27 +1118,14 @@ completely_scalarize (tree base, tree decl_type, HOST_WIDE_INT offset, tree ref)
>        {
>         tree elemtype = TREE_TYPE (decl_type);
>         tree elem_size = TYPE_SIZE (elemtype);
> -       gcc_assert (elem_size && tree_fits_shwi_p (elem_size));
>         HOST_WIDE_INT el_size = tree_to_shwi (elem_size);
>         gcc_assert (el_size > 0);
>
> -       tree minidx = TYPE_MIN_VALUE (TYPE_DOMAIN (decl_type));
> -       gcc_assert (TREE_CODE (minidx) == INTEGER_CST);
> -       tree maxidx = TYPE_MAX_VALUE (TYPE_DOMAIN (decl_type));
> +       offset_int idx, max;
>         /* Skip (some) zero-length arrays; others have MAXIDX == MINIDX - 1.  */
> -       if (maxidx)
> +       if (extract_min_max_idx_from_array (decl_type, &idx, &max))
>           {
> -           gcc_assert (TREE_CODE (maxidx) == INTEGER_CST);
>             tree domain = TYPE_DOMAIN (decl_type);
> -           /* MINIDX and MAXIDX are inclusive, and must be interpreted in
> -              DOMAIN (e.g. signed int, whereas min/max may be size_int).  */
> -           offset_int idx = wi::to_offset (minidx);
> -           offset_int max = wi::to_offset (maxidx);
> -           if (!TYPE_UNSIGNED (domain))
> -             {
> -               idx = wi::sext (idx, TYPE_PRECISION (domain));
> -               max = wi::sext (max, TYPE_PRECISION (domain));
> -             }
>             for (int el_off = offset; idx <= max; ++idx)
>               {
>                 tree nref = build4 (ARRAY_REF, elemtype,
> @@ -1088,10 +1146,10 @@ completely_scalarize (tree base, tree decl_type, HOST_WIDE_INT offset, tree ref)
>  }
>
>  /* Create total_scalarization accesses for a member of type TYPE, which must
> -   satisfy either is_gimple_reg_type or scalarizable_type_p.  BASE must be the
> -   top-most VAR_DECL representing the variable; within that, POS and SIZE locate
> -   the member, REVERSE gives its torage order. and REF must be the reference
> -   expression for it.  */
> +   satisfy either is_gimple_reg_type or simple_mix_of_records_and_arrays_p.
> +   BASE must be the top-most VAR_DECL representing the variable; within that,
> +   POS and SIZE locate the member, REVERSE gives its torage order. and REF must
> +   be the reference expression for it.  */
>
>  static void
>  scalarize_elem (tree base, HOST_WIDE_INT pos, HOST_WIDE_INT size, bool reverse,
> @@ -1111,7 +1169,8 @@ scalarize_elem (tree base, HOST_WIDE_INT pos, HOST_WIDE_INT size, bool reverse,
>  }
>
>  /* Create a total_scalarization access for VAR as a whole.  VAR must be of a
> -   RECORD_TYPE or ARRAY_TYPE conforming to scalarizable_type_p.  */
> +   RECORD_TYPE or ARRAY_TYPE conforming to
> +   simple_mix_of_records_and_arrays_p.  */
>
>  static void
>  create_total_scalarization_access (tree var)
> @@ -2803,8 +2862,9 @@ analyze_all_variable_accesses (void)
>        {
>         tree var = candidate (i);
>
> -       if (VAR_P (var) && scalarizable_type_p (TREE_TYPE (var),
> -                                               constant_decl_p (var)))
> +       if (VAR_P (var)
> +           && simple_mix_of_records_and_arrays_p (TREE_TYPE (var),
> +                                                  constant_decl_p (var), NULL))
>           {
>             if (tree_to_uhwi (TYPE_SIZE (TREE_TYPE (var)))
>                 <= max_scalarization_size)
> diff --git a/gcc/tree-sra.h b/gcc/tree-sra.h
> new file mode 100644
> index 00000000000..2857688b21e
> --- /dev/null
> +++ b/gcc/tree-sra.h
> @@ -0,0 +1,34 @@
> +/* tree-sra.h - Run-time parameters.
> +   Copyright (C) 2017 Free Software Foundation, Inc.
> +
> +This file is part of GCC.
> +
> +GCC is free software; you can redistribute it and/or modify it under
> +the terms of the GNU General Public License as published by the Free
> +Software Foundation; either version 3, or (at your option) any later
> +version.
> +
> +GCC is distributed in the hope that it will be useful, but WITHOUT ANY
> +WARRANTY; without even the implied warranty of MERCHANTABILITY or
> +FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
> +for more details.
> +
> +You should have received a copy of the GNU General Public License
> +along with GCC; see the file COPYING3.  If not see
> +<http://www.gnu.org/licenses/>.  */
> +
> +#ifndef TREE_SRA_H
> +#define TREE_SRA_H
> +
> +
> +bool simple_mix_of_records_and_arrays_p (tree type, bool allow_char_arrays,
> +                                        int *count_pg);
> +bool extract_min_max_idx_from_array (tree type, offset_int *idx,
> +                                    offset_int *max);
> +tree build_ref_for_offset (location_t loc, tree base, HOST_WIDE_INT offset,
> +                          bool reverse, tree exp_type,
> +                          gimple_stmt_iterator *gsi, bool insert_after);
> +
> +
> +
> +#endif /* TREE_SRA_H */
> --
> 2.14.2
>
Michael Matz Nov. 13, 2017, 1:46 p.m. UTC | #10
Hi,

On Mon, 13 Nov 2017, Richard Biener wrote:

> The main concern here is that GIMPLE is not very well defined for
> aggregate copies and that gimple-fold.c happily optimizes
> memcpy (&a, &b, sizeof (a)) into a = b;

What you missed to mention is that we then discussed about rectifying this 
situation by defining GIMPLE more precisely :)  Effectively an aggregate 
assignment in GIMPLE (right now) is always defined to be a dumb block 
copy.  We need a way to describe a member-wise copy as well.  That can 
either be a flag on the statement or implicit by the alias type (i.e. 
block copies always need the ref-all alias type, all others would be 
member-wise copies).

Then a user-written memcpy can only be rewritten into a member-wise 
assignment when the types contain no padding, and SRA can only look 
through member-wise assignments when it doesn't see all accesses to the 
destination (when it does then it can look also through block copies).

Then this patch can be restricted to the member-wise assignments (which 
still helps imagemagick as the struct in question doesn't contain padding 
IIRC).

That, or we must dumb down SRA quite much, which I don't think would be a 
good idea.

(I'm not sure if your example would be really valid C as it changes the 
dynamic type of a statically typed declaration; but OTOH we shouldn't 
care, as in GIMPLE the example should of course be expressible)


Ciao,
Michael.
Richard Biener Nov. 13, 2017, 2:03 p.m. UTC | #11
On Mon, Nov 13, 2017 at 2:46 PM, Michael Matz <matz@suse.de> wrote:
> Hi,
>
> On Mon, 13 Nov 2017, Richard Biener wrote:
>
>> The main concern here is that GIMPLE is not very well defined for
>> aggregate copies and that gimple-fold.c happily optimizes
>> memcpy (&a, &b, sizeof (a)) into a = b;
>
> What you missed to mention is that we then discussed about rectifying this
> situation by defining GIMPLE more precisely :)  Effectively an aggregate
> assignment in GIMPLE (right now) is always defined to be a dumb block
> copy.  We need a way to describe a member-wise copy as well.  That can
> either be a flag on the statement or implicit by the alias type (i.e.
> block copies always need the ref-all alias type, all others would be
> member-wise copies).

Yes.  Note that it's already GENERIC that needs to nail down the difference.

For the folding there's the possibility of using a char[n] type with n being
constant, thus a new array type for each size, or a char[] type with
variable size, using a WITH_SIZE_EXPR on the RHS (but support for
WITH_SIZE_EXPR isn't so good in passes so I'd rather avoid this for
constant sizes).

The chance here is, of course (find the PR, it exists...), that SRA then
decomposes the char[] copy bytewise...

That said, memcpy folding is easy to fix.  The question is of course
what the semantic of VIEW_CONVERTs is (SRA _does_ contain
bail-outs on those).  Like if you have

struct A { short s; int i; } x;
struct B { int i; short s; } y;

void foo ()
{
  x = VIEW_CONVERT <struct A> (y);
}

so can you access padding via view-converting its value?  Ada uses
VIEW_CONVERT punning on structures a _lot_ (probably the reason
for the SRA bailout).

The above assignment would still be going through that aggregate copy
expansion path.

> Then a user-written memcpy can only be rewritten into a member-wise
> assignment when the types contain no padding, and SRA can only look
> through member-wise assignments when it doesn't see all accesses to the
> destination (when it does then it can look also through block copies).
>
> Then this patch can be restricted to the member-wise assignments (which
> still helps imagemagick as the struct in question doesn't contain padding
> IIRC).
>
> That, or we must dumb down SRA quite much, which I don't think would be a
> good idea.
>
> (I'm not sure if your example would be really valid C as it changes the
> dynamic type of a statically typed declaration; but OTOH we shouldn't
> care, as in GIMPLE the example should of course be expressible)

Yeah, we can equally use allocated storage (but our memcpy folding
then won't apply ...).

Richard.
Michael Matz Nov. 13, 2017, 2:20 p.m. UTC | #12
Hi,

On Mon, 13 Nov 2017, Richard Biener wrote:

> The chance here is, of course (find the PR, it exists...), that SRA then 
> decomposes the char[] copy bytewise...
> 
> That said, memcpy folding is easy to fix.  The question is of course
> what the semantic of VIEW_CONVERTs is (SRA _does_ contain
> bail-outs on those).  Like if you have
> 
> struct A { short s; int i; } x;
> struct B { int i; short s; } y;
> 
> void foo ()
> {
>   x = VIEW_CONVERT <struct A> (y);
> }
> 
> so can you access padding via view-converting its value?  Ada uses 
> VIEW_CONVERT punning on structures a _lot_ (probably the reason for the 
> SRA bailout).

I would say a VIEW_CONVERT shouldn't be allowed to inspect padding on the 
RHS (and expected to clobber padding on the LHS).  That is, if you want to 
really really access padding on some struct type you can only use memcpy.  
(Or view-convert it to some char[N] array, perhaps there it makes sense to 
copy padding, i.e. regard that as a block copy).

The above example shows why I'm of this opinion.  Both structs have 
padding at different place, and it overlaps a member in the other 
struct.  I don't see how to give that any sane meaning (beyond always 
handling it as block copy, and which point we can as well give up and get 
rid of VIEW_CONVERT_EXPR in favor of explicit memcpy).


Ciao,
Michael.
Richard Biener Nov. 13, 2017, 3:18 p.m. UTC | #13
On November 13, 2017 3:20:16 PM GMT+01:00, Michael Matz <matz@suse.de> wrote:
>Hi,
>
>On Mon, 13 Nov 2017, Richard Biener wrote:
>
>> The chance here is, of course (find the PR, it exists...), that SRA
>then 
>> decomposes the char[] copy bytewise...
>> 
>> That said, memcpy folding is easy to fix.  The question is of course
>> what the semantic of VIEW_CONVERTs is (SRA _does_ contain
>> bail-outs on those).  Like if you have
>> 
>> struct A { short s; int i; } x;
>> struct B { int i; short s; } y;
>> 
>> void foo ()
>> {
>>   x = VIEW_CONVERT <struct A> (y);
>> }
>> 
>> so can you access padding via view-converting its value?  Ada uses 
>> VIEW_CONVERT punning on structures a _lot_ (probably the reason for
>the 
>> SRA bailout).
>
>I would say a VIEW_CONVERT shouldn't be allowed to inspect padding on
>the 
>RHS (and expected to clobber padding on the LHS).  That is, if you want
>to 
>really really access padding on some struct type you can only use
>memcpy.  
>(Or view-convert it to some char[N] array, perhaps there it makes sense
>to 
>copy padding, i.e. regard that as a block copy).
>
>The above example shows why I'm of this opinion.  Both structs have 
>padding at different place, and it overlaps a member in the other 
>struct.  I don't see how to give that any sane meaning (beyond always 
>handling it as block copy, and which point we can as well give up and
>get 
>rid of VIEW_CONVERT_EXPR in favor of explicit memcpy).

Eric should know constraints important for Ada. 

Richard. 

>
>Ciao,
>Michael.
Eric Botcazou Nov. 13, 2017, 11:57 p.m. UTC | #14
> The chance here is, of course (find the PR, it exists...), that SRA then
> decomposes the char[] copy bytewise...
> 
> That said, memcpy folding is easy to fix.  The question is of course
> what the semantic of VIEW_CONVERTs is (SRA _does_ contain
> bail-outs on those).  Like if you have
> 
> struct A { short s; int i; } x;
> struct B { int i; short s; } y;
> 
> void foo ()
> {
>   x = VIEW_CONVERT <struct A> (y);
> }
> 
> so can you access padding via view-converting its value?  Ada uses
> VIEW_CONVERT punning on structures a _lot_ (probably the reason
> for the SRA bailout).

Couple of things:
 1. We have been trying to get rid of VIEW_CONVERT as much as possible in the 
Ada compiler for a number of years now.
 2. Padding is garbage in Ada and thus its contents cannot have any effect on 
the execution of legal programs (and SRA already killed any hope of preserving 
padding a long time ago for not-so-legal programs anyway).
Martin Jambor Nov. 14, 2017, 9:59 a.m. UTC | #15
Hi,

I thought I sent the following email last Friday but found it in my
drafts folder right now, so let me send it now so that anybody
interested can see what the patch does on Haswell.

I have only skimmed through new messages in the thread.  I am now
looking into something else right now but will get back to this matter
next week at the latest.



On Fri, Nov 03, 2017 at 05:38:30PM +0100, Martin Jambor wrote:
>

...

> 
> Anyway, here are the numbers.  They were taken on two different
> Zen-based machines.  I am also in the process of measuring at least
> something on a Haswell machine but I started later and the machine is
> quite a bit slower so I will not have the numbers until next week (and
> not all equivalents in any way).  I found out I do not have access to
> any more modern .*Lake intel CPU.
> 

OK, I have the numbers now too.  So far I do not know why, in addition
to 416.gamess, also 465.tonto failed to compile, I will investigate
why only later.

Because the machine is quite a bit slower and everything took forever,
I have measured only unpatched trunk three times and then re-run only
those benchmarks which were more than 2% off when compiled with the
patched compiler.

Haswell SPECINT 2006 -O2 generic tuning
=======================================

 Run-time
 --------

| Benchmark      | trunk |   x1 |     % |
|----------------+-------+------+-------|
| 400.perlbench  |   775 |  777 | +0.26 |
| 401.bzip2      |  1200 | 1200 | +0.00 |
| 403.gcc        |   655 |  656 | +0.15 |
| 429.mcf        |   547 |  517 | -5.48 |
| 445.gobmk      |  1140 | 1140 | +0.00 |
| 456.hmmer      |  1130 | 1130 | +0.00 |
| 458.sjeng      |  1310 | 1300 | -0.76 |
| 462.libquantum |   758 |  751 | -0.92 |
| 464.h264ref    |  1370 | 1390 | +1.46 |
| 471.omnetpp    |   475 |  471 | -0.84 |
| 473.astar      |   870 |  867 | -0.34 |
| 483.xalancbmk  |   488 |  486 | -0.41 |

 Text size
 ---------

| Benchmark      |   trunk |      x1 |     % |
|----------------+---------+---------+-------|
| 400.perlbench  |  875874 |  875954 | +0.01 |
| 401.bzip2      |   44754 |   44754 | +0.00 |
| 403.gcc        | 2294466 | 2296098 | +0.07 |
| 429.mcf        |    8226 |    8226 | +0.00 |
| 445.gobmk      |  579778 |  579826 | +0.01 |
| 456.hmmer      |  221058 |  221058 | +0.00 |
| 458.sjeng      |   93362 |   94882 | +1.63 |
| 462.libquantum |   28314 |   28362 | +0.17 |
| 464.h264ref    |  393874 |  393922 | +0.01 |
| 471.omnetpp    |  430306 |  430418 | +0.03 |
| 473.astar      |   29362 |   29538 | +0.60 |
| 483.xalancbmk  | 2361298 | 2361506 | +0.01 |

Haswell SPECINT 2006 -Ofast native tuning
=========================================

 Run-time
 --------

| Benchmark      | trunk |   x1 |     % |
|----------------+-------+------+-------|
| 400.perlbench  |   802 |  803 | +0.12 |
| 401.bzip2      |  1180 | 1170 | -0.85 |
| 403.gcc        |   646 |  647 | +0.15 |
| 429.mcf        |   543 |  508 | -6.45 |
| 445.gobmk      |  1130 | 1130 | +0.00 |
| 456.hmmer      |   529 |  532 | +0.57 |
| 458.sjeng      |  1260 | 1260 | +0.00 |
| 462.libquantum |   764 |  761 | -0.39 |
| 464.h264ref    |  1280 | 1290 | +0.78 |
| 471.omnetpp    |   476 |  464 | -2.52 |
| 473.astar      |   844 |  843 | -0.12 |
| 483.xalancbmk  |   480 |  476 | -0.83 |

 Text size
 ---------

| Benchmark      |   trunk |      x1 |     % |
|----------------+---------+---------+-------|
| 400.perlbench  | 1130994 | 1131058 | +0.01 |
| 401.bzip2      |   77346 |   77346 | +0.00 |
| 403.gcc        | 3099938 | 3101826 | +0.06 |
| 429.mcf        |   10162 |   10162 | +0.00 |
| 445.gobmk      |  766706 |  766786 | +0.01 |
| 456.hmmer      |  346610 |  346610 | +0.00 |
| 458.sjeng      |  143650 |  145522 | +1.30 |
| 462.libquantum |   30986 |   31066 | +0.26 |
| 464.h264ref    |  725218 |  725266 | +0.01 |
| 471.omnetpp    |  546386 |  546642 | +0.05 |
| 473.astar      |   38690 |   38914 | +0.58 |
| 483.xalancbmk  | 3313746 | 3313922 | +0.01 |

Haswell SPECFP 2006 -O2 generic tuning
======================================

 Run-time
 --------

| Benchmark     | trunk |   x1 |     % |
|---------------+-------+------+-------|
| 410.bwaves    |   833 |  831 | -0.24 |
| 416.gamess    |    NR |   NR |       |
| 433.milc      |   820 |  814 | -0.73 |
| 434.zeusmp    |   950 |  949 | -0.11 |
| 435.gromacs   |   945 |  946 | +0.11 |
| 436.cactusADM |  1380 | 1380 | +0.00 |
| 437.leslie3d  |   813 |  812 | -0.12 |
| 444.namd      |   983 |  983 | +0.00 |
| 447.dealII    |   755 |  759 | +0.53 |
| 450.soplex    |   467 |  464 | -0.64 |
| 453.povray    |   402 |  395 | -1.74 |
| 454.calculix  |  1980 | 1980 | +0.00 |
| 459.GemsFDTD  |   765 |  753 | -1.57 |
| 465.tonto     |    NR |   NR |       |
| 470.lbm       |   806 |  806 | +0.00 |
| 481.wrf       |  1330 | 1330 | +0.00 |
| 482.sphinx3   |  1380 | 1380 | +0.00 |

 Text size
 ---------

| Benchmark     |   trunk |      x1 |     % |
|---------------+---------+---------+-------|
| 410.bwaves    |   25954 |   25954 | +0.00 |
| 433.milc      |   87922 |   87922 | +0.00 |
| 434.zeusmp    |  212034 |  212034 | +0.00 |
| 435.gromacs   |  747026 |  747026 | +0.00 |
| 436.cactusADM |  526178 |  526178 | +0.00 |
| 437.leslie3d  |   83234 |   83234 | +0.00 |
| 444.namd      |  297234 |  297266 | +0.01 |
| 447.dealII    | 2165282 | 2172290 | +0.32 |
| 450.soplex    |  347122 |  347122 | +0.00 |
| 453.povray    |  800914 |  801570 | +0.08 |
| 454.calculix  | 1342802 | 1342802 | +0.00 |
| 459.GemsFDTD  |  353410 |  354050 | +0.18 |
| 470.lbm       |    9202 |    9202 | +0.00 |
| 481.wrf       | 3345170 | 3345170 | +0.00 |
| 482.sphinx3   |  125026 |  125026 | +0.00 |

Haswell SPECFP 2006 -Ofast native tuning
========================================

 Run-time
 --------

| Benchmark     | trunk |   x1 |     % |
|---------------+-------+------+-------|
| 410.bwaves    |   551 |  550 | -0.18 |
| 416.gamess    |    NR |   NR |       |
| 433.milc      |   773 |  776 | +0.39 |
| 434.zeusmp    |   660 |  660 | +0.00 |
| 435.gromacs   |   876 |  874 | -0.23 |
| 436.cactusADM |   620 |  619 | -0.16 |
| 437.leslie3d  |   501 |  501 | +0.00 |
| 444.namd      |   974 |  974 | +0.00 |
| 447.dealII    |   722 |  720 | -0.28 |
| 450.soplex    |   459 |  457 | -0.44 |
| 453.povray    |   416 |  410 | -1.44 |
| 454.calculix  |   883 |  882 | -0.11 |
| 459.GemsFDTD  |   625 |  614 | -1.76 |
| 465.tonto     |    NR |   NR |       |
| 470.lbm       |   783 |  781 | -0.26 |
| 481.wrf       |   748 |  746 | -0.27 |
| 482.sphinx3   |  1020 | 1020 | +0.00 |

 Text size
 ---------

| Benchmark     |   trunk |      x1 |     % |
|---------------+---------+---------+-------|
| 410.bwaves    |   30802 |   30802 | +0.00 |
| 433.milc      |  122450 |  122450 | +0.00 |
| 434.zeusmp    |  613458 |  613458 | +0.00 |
| 435.gromacs   |  957922 |  957922 | +0.00 |
| 436.cactusADM |  763794 |  763794 | +0.00 |
| 437.leslie3d  |  154690 |  154690 | +0.00 |
| 444.namd      |  311282 |  311314 | +0.01 |
| 447.dealII    | 2486482 | 2493202 | +0.27 |
| 450.soplex    |  436322 |  436322 | +0.00 |
| 453.povray    | 1088034 | 1088962 | +0.09 |
| 454.calculix  | 1701410 | 1701410 | +0.00 |
| 459.GemsFDTD  |  560642 |  560658 | +0.00 |
| 470.lbm       |    9458 |    9458 | +0.00 |
| 481.wrf       | 5413554 | 5413778 | +0.00 |
| 482.sphinx3   |  190034 |  190034 | +0.00 |

Haswell SPEC INTrate 2017 -O2 generic tuning
============================================

 Run-time
 --------

| Benchmark       | trunk |   x1 |     % |
|-----------------+-------+------+-------|
| 500.perlbench_r |  1201 | 1204 | +0.25 |
| 502.gcc_r       |   798 |  793 | -0.63 |
| 505.mcf_r       |  1038 | 1049 | +1.06 |
| 520.omnetpp_r   |   825 |  824 | -0.12 |
| 523.xalancbmk_r |   985 |  981 | -0.41 |
| 525.x264_r      |  1463 | 1463 | +0.00 |
| 531.deepsjeng_r |   954 |  956 | +0.21 |
| 541.leela_r     |  1570 | 1569 | -0.06 |
| 548.exchange2_r |  1266 | 1267 | +0.08 |
| 557.xz_r        |  1033 | 1029 | -0.39 |

 Test size
 ---------
 
| Benchmark       |   trunk |      x1 |     % |
|-----------------+---------+---------+-------|
| 500.perlbench_r | 1599442 | 1599522 | +0.01 |
| 502.gcc_r       | 6757602 | 6759090 | +0.02 |
| 505.mcf_r       |   16098 |   16098 | +0.00 |
| 520.omnetpp_r   | 1262498 | 1264034 | +0.12 |
| 523.xalancbmk_r | 3989026 | 3989202 | +0.00 |
| 525.x264_r      |  414130 |  414194 | +0.02 |
| 531.deepsjeng_r |   67426 |   67458 | +0.05 |
| 541.leela_r     |  219378 |  219378 | +0.00 |
| 548.exchange2_r |   61234 |   61234 | +0.00 |
| 557.xz_r        |  111490 |  111490 | +0.00 |

Haswell SPEC INTrate 2017 -Ofast native tuning
==============================================

 Run-time
 --------

| Benchmark       | trunk |   x1 |      % |
|-----------------+-------+------+--------|
| 500.perlbench_r |  1169 | 1170 |  +0.09 |
| 502.gcc_r       |   786 |  788 |  +0.25 |
| 505.mcf_r       |  1034 | 1032 |  -0.19 |
| 520.omnetpp_r   |   804 |  794 |  -1.24 |
| 523.xalancbmk_r |   962 |  971 |  +0.94 |
| 525.x264_r      |   886 |  887 |  +0.11 |
| 531.deepsjeng_r |   939 |  944 |  +0.53 |
| 541.leela_r     |  1462 | 1461 |  -0.07 |
| 548.exchange2_r |  1078 | 1082 |  +0.37 |
| 557.xz_r        |   960 |  950 |  -1.04 |

 Text size
 ---------

| Benchmark       |   trunk |      x1 |     % |
|-----------------+---------+---------+-------|
| 500.perlbench_r | 2074450 | 2074498 | +0.00 |
| 502.gcc_r       | 8434514 | 8437250 | +0.03 |
| 505.mcf_r       |   26322 |   26322 | +0.00 |
| 520.omnetpp_r   | 1680082 | 1682130 | +0.12 |
| 523.xalancbmk_r | 4853458 | 4853682 | +0.00 |
| 525.x264_r      |  594210 |  594210 | +0.00 |
| 531.deepsjeng_r |   88050 |   88082 | +0.04 |
| 541.leela_r     |  269298 |  269314 | +0.01 |
| 548.exchange2_r |  114098 |  114098 | +0.00 |
| 557.xz_r        |  152354 |  152354 | +0.00 |

Haswell SPEC FP rate 2017 - generic tuning
==========================================

 Run-time
 --------

| Benchmark       | trunk |   x1 |      % |
|-----------------+-------+------+--------|
| 503.bwaves_r    |  2319 | 2343 |  +1.03 |
| 507.cactuBSSN_r |  1023 |  975 |  -4.69 |
| 508.namd_r      |   934 |  935 |  +0.11 |
| 510.parest_r    |  1391 | 1413 |  +1.58 |
| 511.povray_r    |  1544 | 1570 |  +1.68 |
| 519.lbm_r       |   920 |  920 |  +0.00 |
| 521.wrf_r       |  2955 | 2958 |  +0.10 |
| 526.blender_r   |   976 |  974 |  -0.20 |
| 527.cam4_r      |  1580 | 1586 |  +0.38 |
| 538.imagick_r   |  1758 | 1581 | -10.07 |
| 544.nab_r       |  1357 | 1356 |  -0.07 |
| 549.fotonik3d_r |  1063 | 1077 |  +1.32 |
| 554.roms_r      |  1280 | 1283 |  +0.23 |

 Text size
 ---------
 
| Benchmark       |    trunk |       x1 |     % |
|-----------------+----------+----------+-------|
| 503.bwaves_r    |    32034 |    32034 | +0.00 |
| 507.cactuBSSN_r |  2951634 |  2951634 | +0.00 |
| 508.namd_r      |   837458 |   837490 | +0.00 |
| 510.parest_r    |  6540866 |  6546754 | +0.09 |
| 511.povray_r    |   803618 |   804274 | +0.08 |
| 519.lbm_r       |    12018 |    12018 | +0.00 |
| 521.wrf_r       | 16292962 | 16296978 | +0.02 |
| 526.blender_r   |  7268224 |  7282608 | +0.20 |
| 527.cam4_r      |  5063666 |  5065010 | +0.03 |
| 538.imagick_r   |  1608178 |  1609282 | +0.07 |
| 544.nab_r       |   156242 |   156242 | +0.00 |
| 549.fotonik3d_r |   326738 |   326738 | +0.00 |
| 554.roms_r      |   728546 |   728546 | +0.00 |

Haswell SPEC FP rate 2017 - native tuning
=========================================

 Run-time
 --------

| Benchmark       | trunk |   x1 |     % |
|-----------------+-------+------+-------|
| 503.bwaves_r    |   919 |  919 | +0.00 |
| 507.cactuBSSN_r |   864 |  853 | -1.27 |
| 508.namd_r      |   924 |  924 | +0.00 |
| 510.parest_r    |  1219 | 1220 | +0.08 |
| 511.povray_r    |  1597 | 1624 | +1.69 |
| 519.lbm_r       |   851 |  851 | +0.00 |
| 521.wrf_r       |  1591 | 1594 | +0.19 |
| 526.blender_r   |   912 |  920 | +0.88 |
| 527.cam4_r      |  1296 | 1309 | +1.00 |
| 538.imagick_r   |  1227 | 1207 | -1.63 |
| 544.nab_r       |  1278 | 1278 | +0.00 |
| 549.fotonik3d_r |    VE |   VE |       |
| 554.roms_r      |  1036 | 1037 | +0.10 |

 Text size
 ---------

| Benchmark       |    trunk |       x1 |     % |
|-----------------+----------+----------+-------|
| 503.bwaves_r    |    39426 |    39426 | +0.00 |
| 507.cactuBSSN_r |  3991794 |  3991794 | +0.00 |
| 508.namd_r      |   956450 |   956466 | +0.00 |
| 510.parest_r    |  7341122 |  7345426 | +0.06 |
| 511.povray_r    |  1083010 |  1083938 | +0.09 |
| 519.lbm_r       |    11826 |    11826 | +0.00 |
| 521.wrf_r       | 22028578 | 22032098 | +0.02 |
| 526.blender_r   |  9698768 |  9718544 | +0.20 |
| 527.cam4_r      |  6738562 |  6740050 | +0.02 |
| 538.imagick_r   |  2246674 |  2247122 | +0.02 |
| 544.nab_r       |   211378 |   211378 | +0.00 |
| 549.fotonik3d_r |   582626 |   582626 | +0.00 |
| 554.roms_r      |  1085234 |  1085234 | +0.00 |

Martin
Martin Jambor Nov. 23, 2017, 3:32 p.m. UTC | #16
Hi,

On Mon, Nov 13 2017, Richard Biener wrote:
> The main concern here is that GIMPLE is not very well defined for
> aggregate copies and that gimple-fold.c happily optimizes
> memcpy (&a, &b, sizeof (a)) into a = b;
>
> struct A { short s; long i; long j; };
> struct A a, b;
> void foo ()
> {
>   __builtin_memcpy (&a, &b, sizeof (struct A));
> }
>
> gets folded to
>
>   MEM[(char * {ref-all})&a] = MEM[(char * {ref-all})&b];
>   return;
>
> you see we're careful about TBAA but (don't see that above but
> can be verified by for example debugging expand_assignment)
> TREE_TYPE (MEM[...]) is actually 'struct A'.
>
> And yes, I've been worried about SRA as well here...  it _does_
> have some early outs when seeing VIEW_CONVERT_EXPR but
> appearantly not for the above.  Testcase that aborts with SRA but
> not without:
>
> struct A { short s; long i; long j; };
> struct A a, b;
> void foo ()
> {
>   struct A c;
>   __builtin_memcpy (&c, &b, sizeof (struct A));
>   __builtin_memcpy (&a, &c, sizeof (struct A));
> }
> int main()
> {
>   __builtin_memset (&b, 0, sizeof (struct A));
>   b.s = 1;
>   __builtin_memcpy ((char *)&b+2, &b, 2);
>   foo ();
>   __builtin_memcpy (&a, (char *)&a+2, 2);
>   if (a.s != 1)
>     __builtin_abort ();
>   return 0;
> }

Thanks for the testcase, I agree that is a fairly big problem.  Do you
think that the following (untested) patch is an appropriate way of
fixing it and generally of extending gimple to capture that a statement
is a bit-copy?

If so, I'll add the testcase, bootstrap it and formally propose it.
Subsequently I will of course make sure that any element-wise copying
patch would test the predicate.

Thanks,

Martin


2017-11-23  Martin Jambor  <mjambor@suse.cz>

	* gimple.c (gimple_bit_copy_p): New function.
	* gimple.h (gimple_bit_copy_p): Declare it.
	* tree-sra.c (sra_modify_assign): Use it.
---
 gcc/gimple.c   | 20 ++++++++++++++++++++
 gcc/gimple.h   |  1 +
 gcc/tree-sra.c |  1 +
 3 files changed, 22 insertions(+)

diff --git a/gcc/gimple.c b/gcc/gimple.c
index c986a732004..e1b428d91bb 100644
--- a/gcc/gimple.c
+++ b/gcc/gimple.c
@@ -3087,6 +3087,26 @@ gimple_inexpensive_call_p (gcall *stmt)
   return false;
 }
 
+/* Return true if STMT is an assignment performing bit copy and so is also
+   expected to copy any padding.  */
+
+bool
+gimple_bit_copy_p (gassign *stmt)
+{
+  if (!gimple_assign_single_p (stmt))
+    return false;
+
+  tree lhs = gimple_assign_lhs (stmt);
+  if (TREE_CODE (lhs) == MEM_REF
+      && TYPE_REF_CAN_ALIAS_ALL (reference_alias_ptr_type (lhs)))
+    return true;
+  tree rhs = gimple_assign_rhs1 (stmt);
+  if (TREE_CODE (rhs) == MEM_REF
+      && TYPE_REF_CAN_ALIAS_ALL (reference_alias_ptr_type (rhs)))
+    return true;
+  return false;
+}
+
 #if CHECKING_P
 
 namespace selftest {
diff --git a/gcc/gimple.h b/gcc/gimple.h
index 334def89398..60929473361 100644
--- a/gcc/gimple.h
+++ b/gcc/gimple.h
@@ -1531,6 +1531,7 @@ extern void gimple_seq_discard (gimple_seq);
 extern void maybe_remove_unused_call_args (struct function *, gimple *);
 extern bool gimple_inexpensive_call_p (gcall *);
 extern bool stmt_can_terminate_bb_p (gimple *);
+extern bool gimple_bit_copy_p (gassign *);
 
 /* Formal (expression) temporary table handling: multiple occurrences of
    the same scalar expression are evaluated into the same temporary.  */
diff --git a/gcc/tree-sra.c b/gcc/tree-sra.c
index db490b20c3e..fc0a8fe60bf 100644
--- a/gcc/tree-sra.c
+++ b/gcc/tree-sra.c
@@ -3591,6 +3591,7 @@ sra_modify_assign (gimple *stmt, gimple_stmt_iterator *gsi)
       || gimple_has_volatile_ops (stmt)
       || contains_vce_or_bfcref_p (rhs)
       || contains_vce_or_bfcref_p (lhs)
+      || gimple_bit_copy_p (as_a <gassign *> (stmt))
       || stmt_ends_bb_p (stmt))
     {
       /* No need to copy into a constant-pool, it comes pre-initialized.  */
Jakub Jelinek Nov. 23, 2017, 3:50 p.m. UTC | #17
On Thu, Nov 23, 2017 at 04:32:43PM +0100, Martin Jambor wrote:
> > struct A { short s; long i; long j; };
> > struct A a, b;
> > void foo ()
> > {
> >   struct A c;
> >   __builtin_memcpy (&c, &b, sizeof (struct A));
> >   __builtin_memcpy (&a, &c, sizeof (struct A));
> > }
> > int main()
> > {
> >   __builtin_memset (&b, 0, sizeof (struct A));
> >   b.s = 1;
> >   __builtin_memcpy ((char *)&b+2, &b, 2);
> >   foo ();
> >   __builtin_memcpy (&a, (char *)&a+2, 2);
> >   if (a.s != 1)
> >     __builtin_abort ();
> >   return 0;
> > }

Note the testcase would need to be guarded with sizeof (short) == 2
and offsetof (struct A, i) >= 4.

> Thanks for the testcase, I agree that is a fairly big problem.  Do you
> think that the following (untested) patch is an appropriate way of
> fixing it and generally of extending gimple to capture that a statement
> is a bit-copy?

Can you bail out just if the type contains any padding?  If there is no
padding, then perhaps SRA still might do its stuff (though, e.g. if it
contains bitfields, we'd need to hope store-merging merges it all back
again).

	Jakub
Richard Biener Nov. 24, 2017, 10:31 a.m. UTC | #18
On Thu, Nov 23, 2017 at 4:32 PM, Martin Jambor <mjambor@suse.cz> wrote:
> Hi,
>
> On Mon, Nov 13 2017, Richard Biener wrote:
>> The main concern here is that GIMPLE is not very well defined for
>> aggregate copies and that gimple-fold.c happily optimizes
>> memcpy (&a, &b, sizeof (a)) into a = b;
>>
>> struct A { short s; long i; long j; };
>> struct A a, b;
>> void foo ()
>> {
>>   __builtin_memcpy (&a, &b, sizeof (struct A));
>> }
>>
>> gets folded to
>>
>>   MEM[(char * {ref-all})&a] = MEM[(char * {ref-all})&b];
>>   return;
>>
>> you see we're careful about TBAA but (don't see that above but
>> can be verified by for example debugging expand_assignment)
>> TREE_TYPE (MEM[...]) is actually 'struct A'.
>>
>> And yes, I've been worried about SRA as well here...  it _does_
>> have some early outs when seeing VIEW_CONVERT_EXPR but
>> appearantly not for the above.  Testcase that aborts with SRA but
>> not without:
>>
>> struct A { short s; long i; long j; };
>> struct A a, b;
>> void foo ()
>> {
>>   struct A c;
>>   __builtin_memcpy (&c, &b, sizeof (struct A));
>>   __builtin_memcpy (&a, &c, sizeof (struct A));
>> }
>> int main()
>> {
>>   __builtin_memset (&b, 0, sizeof (struct A));
>>   b.s = 1;
>>   __builtin_memcpy ((char *)&b+2, &b, 2);
>>   foo ();
>>   __builtin_memcpy (&a, (char *)&a+2, 2);
>>   if (a.s != 1)
>>     __builtin_abort ();
>>   return 0;
>> }
>
> Thanks for the testcase, I agree that is a fairly big problem.  Do you
> think that the following (untested) patch is an appropriate way of
> fixing it and generally of extending gimple to capture that a statement
> is a bit-copy?

I think the place to fix is the memcpy folding.  That is, we'd say that
aggregate assignments are not bit-copies but do element-wise assignments.
For memcpy folding we'd then need to use a type that doesn't contain
padding.  Which effectively means char[].

Of course we need to stop SRA from decomposing that copy to
individual characters then ;)

So iff we decide that all aggregate copies are element copies,
maybe only those where TYPE_MAIN_VARIANT of lhs and rhs match
(currently we allow TYPE_CANONICAL compatibility and thus there
might be some mismatches), then we have to fix nothign but
the memcpy folding.

> If so, I'll add the testcase, bootstrap it and formally propose it.
> Subsequently I will of course make sure that any element-wise copying
> patch would test the predicate.

I don't think the alias-set should determine whether a copy is
bit-wise or not.

Richard.

> Thanks,
>
> Martin
>
>
> 2017-11-23  Martin Jambor  <mjambor@suse.cz>
>
>         * gimple.c (gimple_bit_copy_p): New function.
>         * gimple.h (gimple_bit_copy_p): Declare it.
>         * tree-sra.c (sra_modify_assign): Use it.
> ---
>  gcc/gimple.c   | 20 ++++++++++++++++++++
>  gcc/gimple.h   |  1 +
>  gcc/tree-sra.c |  1 +
>  3 files changed, 22 insertions(+)
>
> diff --git a/gcc/gimple.c b/gcc/gimple.c
> index c986a732004..e1b428d91bb 100644
> --- a/gcc/gimple.c
> +++ b/gcc/gimple.c
> @@ -3087,6 +3087,26 @@ gimple_inexpensive_call_p (gcall *stmt)
>    return false;
>  }
>
> +/* Return true if STMT is an assignment performing bit copy and so is also
> +   expected to copy any padding.  */
> +
> +bool
> +gimple_bit_copy_p (gassign *stmt)
> +{
> +  if (!gimple_assign_single_p (stmt))
> +    return false;
> +
> +  tree lhs = gimple_assign_lhs (stmt);
> +  if (TREE_CODE (lhs) == MEM_REF
> +      && TYPE_REF_CAN_ALIAS_ALL (reference_alias_ptr_type (lhs)))
> +    return true;
> +  tree rhs = gimple_assign_rhs1 (stmt);
> +  if (TREE_CODE (rhs) == MEM_REF
> +      && TYPE_REF_CAN_ALIAS_ALL (reference_alias_ptr_type (rhs)))
> +    return true;
> +  return false;
> +}
> +
>  #if CHECKING_P
>
>  namespace selftest {
> diff --git a/gcc/gimple.h b/gcc/gimple.h
> index 334def89398..60929473361 100644
> --- a/gcc/gimple.h
> +++ b/gcc/gimple.h
> @@ -1531,6 +1531,7 @@ extern void gimple_seq_discard (gimple_seq);
>  extern void maybe_remove_unused_call_args (struct function *, gimple *);
>  extern bool gimple_inexpensive_call_p (gcall *);
>  extern bool stmt_can_terminate_bb_p (gimple *);
> +extern bool gimple_bit_copy_p (gassign *);
>
>  /* Formal (expression) temporary table handling: multiple occurrences of
>     the same scalar expression are evaluated into the same temporary.  */
> diff --git a/gcc/tree-sra.c b/gcc/tree-sra.c
> index db490b20c3e..fc0a8fe60bf 100644
> --- a/gcc/tree-sra.c
> +++ b/gcc/tree-sra.c
> @@ -3591,6 +3591,7 @@ sra_modify_assign (gimple *stmt, gimple_stmt_iterator *gsi)
>        || gimple_has_volatile_ops (stmt)
>        || contains_vce_or_bfcref_p (rhs)
>        || contains_vce_or_bfcref_p (lhs)
> +      || gimple_bit_copy_p (as_a <gassign *> (stmt))
>        || stmt_ends_bb_p (stmt))
>      {
>        /* No need to copy into a constant-pool, it comes pre-initialized.  */
> --
> 2.15.0
>
Richard Biener Nov. 24, 2017, 10:57 a.m. UTC | #19
On Fri, Nov 24, 2017 at 11:31 AM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On Thu, Nov 23, 2017 at 4:32 PM, Martin Jambor <mjambor@suse.cz> wrote:
>> Hi,
>>
>> On Mon, Nov 13 2017, Richard Biener wrote:
>>> The main concern here is that GIMPLE is not very well defined for
>>> aggregate copies and that gimple-fold.c happily optimizes
>>> memcpy (&a, &b, sizeof (a)) into a = b;
>>>
>>> struct A { short s; long i; long j; };
>>> struct A a, b;
>>> void foo ()
>>> {
>>>   __builtin_memcpy (&a, &b, sizeof (struct A));
>>> }
>>>
>>> gets folded to
>>>
>>>   MEM[(char * {ref-all})&a] = MEM[(char * {ref-all})&b];
>>>   return;
>>>
>>> you see we're careful about TBAA but (don't see that above but
>>> can be verified by for example debugging expand_assignment)
>>> TREE_TYPE (MEM[...]) is actually 'struct A'.
>>>
>>> And yes, I've been worried about SRA as well here...  it _does_
>>> have some early outs when seeing VIEW_CONVERT_EXPR but
>>> appearantly not for the above.  Testcase that aborts with SRA but
>>> not without:
>>>
>>> struct A { short s; long i; long j; };
>>> struct A a, b;
>>> void foo ()
>>> {
>>>   struct A c;
>>>   __builtin_memcpy (&c, &b, sizeof (struct A));
>>>   __builtin_memcpy (&a, &c, sizeof (struct A));
>>> }
>>> int main()
>>> {
>>>   __builtin_memset (&b, 0, sizeof (struct A));
>>>   b.s = 1;
>>>   __builtin_memcpy ((char *)&b+2, &b, 2);
>>>   foo ();
>>>   __builtin_memcpy (&a, (char *)&a+2, 2);
>>>   if (a.s != 1)
>>>     __builtin_abort ();
>>>   return 0;
>>> }
>>
>> Thanks for the testcase, I agree that is a fairly big problem.  Do you
>> think that the following (untested) patch is an appropriate way of
>> fixing it and generally of extending gimple to capture that a statement
>> is a bit-copy?
>
> I think the place to fix is the memcpy folding.  That is, we'd say that
> aggregate assignments are not bit-copies but do element-wise assignments.
> For memcpy folding we'd then need to use a type that doesn't contain
> padding.  Which effectively means char[].
>
> Of course we need to stop SRA from decomposing that copy to
> individual characters then ;)
>
> So iff we decide that all aggregate copies are element copies,
> maybe only those where TYPE_MAIN_VARIANT of lhs and rhs match
> (currently we allow TYPE_CANONICAL compatibility and thus there
> might be some mismatches), then we have to fix nothign but
> the memcpy folding.
>
>> If so, I'll add the testcase, bootstrap it and formally propose it.
>> Subsequently I will of course make sure that any element-wise copying
>> patch would test the predicate.
>
> I don't think the alias-set should determine whether a copy is
> bit-wise or not.

Like the attached.  At least FAILs

FAIL: gcc.dg/tree-ssa/ssa-ccp-27.c scan-tree-dump-times ccp1
"memcpy[^\n]*123456" 2 (found 0 times)

not sure why we have this test.

Richard.
Richard Biener Nov. 24, 2017, 11:12 a.m. UTC | #20
On Fri, Nov 24, 2017 at 11:57 AM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On Fri, Nov 24, 2017 at 11:31 AM, Richard Biener
> <richard.guenther@gmail.com> wrote:
>> On Thu, Nov 23, 2017 at 4:32 PM, Martin Jambor <mjambor@suse.cz> wrote:
>>> Hi,
>>>
>>> On Mon, Nov 13 2017, Richard Biener wrote:
>>>> The main concern here is that GIMPLE is not very well defined for
>>>> aggregate copies and that gimple-fold.c happily optimizes
>>>> memcpy (&a, &b, sizeof (a)) into a = b;
>>>>
>>>> struct A { short s; long i; long j; };
>>>> struct A a, b;
>>>> void foo ()
>>>> {
>>>>   __builtin_memcpy (&a, &b, sizeof (struct A));
>>>> }
>>>>
>>>> gets folded to
>>>>
>>>>   MEM[(char * {ref-all})&a] = MEM[(char * {ref-all})&b];
>>>>   return;
>>>>
>>>> you see we're careful about TBAA but (don't see that above but
>>>> can be verified by for example debugging expand_assignment)
>>>> TREE_TYPE (MEM[...]) is actually 'struct A'.
>>>>
>>>> And yes, I've been worried about SRA as well here...  it _does_
>>>> have some early outs when seeing VIEW_CONVERT_EXPR but
>>>> appearantly not for the above.  Testcase that aborts with SRA but
>>>> not without:
>>>>
>>>> struct A { short s; long i; long j; };
>>>> struct A a, b;
>>>> void foo ()
>>>> {
>>>>   struct A c;
>>>>   __builtin_memcpy (&c, &b, sizeof (struct A));
>>>>   __builtin_memcpy (&a, &c, sizeof (struct A));
>>>> }
>>>> int main()
>>>> {
>>>>   __builtin_memset (&b, 0, sizeof (struct A));
>>>>   b.s = 1;
>>>>   __builtin_memcpy ((char *)&b+2, &b, 2);
>>>>   foo ();
>>>>   __builtin_memcpy (&a, (char *)&a+2, 2);
>>>>   if (a.s != 1)
>>>>     __builtin_abort ();
>>>>   return 0;
>>>> }
>>>
>>> Thanks for the testcase, I agree that is a fairly big problem.  Do you
>>> think that the following (untested) patch is an appropriate way of
>>> fixing it and generally of extending gimple to capture that a statement
>>> is a bit-copy?
>>
>> I think the place to fix is the memcpy folding.  That is, we'd say that
>> aggregate assignments are not bit-copies but do element-wise assignments.
>> For memcpy folding we'd then need to use a type that doesn't contain
>> padding.  Which effectively means char[].
>>
>> Of course we need to stop SRA from decomposing that copy to
>> individual characters then ;)
>>
>> So iff we decide that all aggregate copies are element copies,
>> maybe only those where TYPE_MAIN_VARIANT of lhs and rhs match
>> (currently we allow TYPE_CANONICAL compatibility and thus there
>> might be some mismatches), then we have to fix nothign but
>> the memcpy folding.
>>
>>> If so, I'll add the testcase, bootstrap it and formally propose it.
>>> Subsequently I will of course make sure that any element-wise copying
>>> patch would test the predicate.
>>
>> I don't think the alias-set should determine whether a copy is
>> bit-wise or not.
>
> Like the attached.  At least FAILs
>
> FAIL: gcc.dg/tree-ssa/ssa-ccp-27.c scan-tree-dump-times ccp1
> "memcpy[^\n]*123456" 2 (found 0 times)
>
> not sure why we have this test.

Hum.  And SRA still decomposes the copy to struct elements w/o padding
even though the access is done using char[].  So somehow it ignores
VIEW_CONVERT_EXPRs
(well, those implicitely present on MEM_REFs).

Looks like this is because total scalarization is done on the decls
and not at all
honoring how the variable is accessed?  The following seems to fix
that, otherwise untested.

Index: gcc/tree-sra.c
===================================================================
--- gcc/tree-sra.c      (revision 255137)
+++ gcc/tree-sra.c      (working copy)
@@ -1338,7 +1338,9 @@ build_accesses_from_assign (gimple *stmt
     {
       racc->grp_assignment_read = 1;
       if (should_scalarize_away_bitmap && !gimple_has_volatile_ops (stmt)
-         && !is_gimple_reg_type (racc->type))
+         && !is_gimple_reg_type (racc->type)
+         && (TYPE_MAIN_VARIANT (racc->type)
+             == TYPE_MAIN_VARIANT (TREE_TYPE (racc->base))))
        bitmap_set_bit (should_scalarize_away_bitmap, DECL_UID (racc->base));
       if (storage_order_barrier_p (lhs))
        racc->grp_unscalarizable_region = 1;


I'm giving this full testing with the folding fix.

Richard.

> Richard.
Martin Jambor Nov. 24, 2017, 11:53 a.m. UTC | #21
Hi Richi,

On Fri, Nov 24 2017, Richard Biener wrote:
> On Fri, Nov 24, 2017 at 11:57 AM, Richard Biener
> <richard.guenther@gmail.com> wrote:
>> On Fri, Nov 24, 2017 at 11:31 AM, Richard Biener

..

>>>>> And yes, I've been worried about SRA as well here...  it _does_
>>>>> have some early outs when seeing VIEW_CONVERT_EXPR but
>>>>> appearantly not for the above.  Testcase that aborts with SRA but
>>>>> not without:
>>>>>
>>>>> struct A { short s; long i; long j; };
>>>>> struct A a, b;
>>>>> void foo ()
>>>>> {
>>>>>   struct A c;
>>>>>   __builtin_memcpy (&c, &b, sizeof (struct A));
>>>>>   __builtin_memcpy (&a, &c, sizeof (struct A));
>>>>> }
>>>>> int main()
>>>>> {
>>>>>   __builtin_memset (&b, 0, sizeof (struct A));
>>>>>   b.s = 1;
>>>>>   __builtin_memcpy ((char *)&b+2, &b, 2);
>>>>>   foo ();
>>>>>   __builtin_memcpy (&a, (char *)&a+2, 2);
>>>>>   if (a.s != 1)
>>>>>     __builtin_abort ();
>>>>>   return 0;
>>>>> }
>>>>
>>>> Thanks for the testcase, I agree that is a fairly big problem.  Do you
>>>> think that the following (untested) patch is an appropriate way of
>>>> fixing it and generally of extending gimple to capture that a statement
>>>> is a bit-copy?
>>>
>>> I think the place to fix is the memcpy folding.  That is, we'd say that
>>> aggregate assignments are not bit-copies but do element-wise assignments.
>>> For memcpy folding we'd then need to use a type that doesn't contain
>>> padding.  Which effectively means char[].
>>>
>>> Of course we need to stop SRA from decomposing that copy to
>>> individual characters then ;)
>>>
>>> So iff we decide that all aggregate copies are element copies,
>>> maybe only those where TYPE_MAIN_VARIANT of lhs and rhs match
>>> (currently we allow TYPE_CANONICAL compatibility and thus there
>>> might be some mismatches), then we have to fix nothign but
>>> the memcpy folding.
>>>
>>>> If so, I'll add the testcase, bootstrap it and formally propose it.
>>>> Subsequently I will of course make sure that any element-wise copying
>>>> patch would test the predicate.
>>>
>>> I don't think the alias-set should determine whether a copy is
>>> bit-wise or not.
>>
>> Like the attached.  At least FAILs
>>
>> FAIL: gcc.dg/tree-ssa/ssa-ccp-27.c scan-tree-dump-times ccp1
>> "memcpy[^\n]*123456" 2 (found 0 times)
>>
>> not sure why we have this test.
>
> Hum.  And SRA still decomposes the copy to struct elements w/o padding
> even though the access is done using char[].  So somehow it ignores
> VIEW_CONVERT_EXPRs
> (well, those implicitely present on MEM_REFs).

Yes.  SRA is not even too afraid of top-level V_C_Es.  It really bails
out only if they are buried under a handled_component.  And it does not
remove aggregate assignments containing them.

>
> Looks like this is because total scalarization is done on the decls
> and not at all
> honoring how the variable is accessed?  The following seems to fix
> that, otherwise untested.


>
> Index: gcc/tree-sra.c
> ===================================================================
> --- gcc/tree-sra.c      (revision 255137)
> +++ gcc/tree-sra.c      (working copy)
> @@ -1338,7 +1338,9 @@ build_accesses_from_assign (gimple *stmt
>      {
>        racc->grp_assignment_read = 1;
>        if (should_scalarize_away_bitmap && !gimple_has_volatile_ops (stmt)
> -         && !is_gimple_reg_type (racc->type))
> +         && !is_gimple_reg_type (racc->type)
> +         && (TYPE_MAIN_VARIANT (racc->type)
> +             == TYPE_MAIN_VARIANT (TREE_TYPE (racc->base))))
>         bitmap_set_bit (should_scalarize_away_bitmap, DECL_UID (racc->base));
>        if (storage_order_barrier_p (lhs))
>         racc->grp_unscalarizable_region = 1;

I believe that the added condition is not what you want, this seems
to trigger also for ordinary:

  s1 = s2.field

Where racc->type is type of the field but racc->base is s2 and its type
is type of the structure.   

I also think you want to be setting a bit in
cannot_scalarize_away_bitmap in order to guarantee that total
scalarization will not happen for the given candidate.  Otherwise some
other regular assignment might trigger it ...except if we then also
checked the statement for bit-copying types in sra_modify_assign (in the
condition after the big comment), which I suppose is actually the
correct thing to do.

Thanks a lot for the folding patch, I can take over the SRA bits if you
want to.

Martin
Richard Biener Nov. 24, 2017, 12:01 p.m. UTC | #22
On Fri, Nov 24, 2017 at 12:53 PM, Martin Jambor <mjambor@suse.cz> wrote:
> Hi Richi,
>
> On Fri, Nov 24 2017, Richard Biener wrote:
>> On Fri, Nov 24, 2017 at 11:57 AM, Richard Biener
>> <richard.guenther@gmail.com> wrote:
>>> On Fri, Nov 24, 2017 at 11:31 AM, Richard Biener
>
> ..
>
>>>>>> And yes, I've been worried about SRA as well here...  it _does_
>>>>>> have some early outs when seeing VIEW_CONVERT_EXPR but
>>>>>> appearantly not for the above.  Testcase that aborts with SRA but
>>>>>> not without:
>>>>>>
>>>>>> struct A { short s; long i; long j; };
>>>>>> struct A a, b;
>>>>>> void foo ()
>>>>>> {
>>>>>>   struct A c;
>>>>>>   __builtin_memcpy (&c, &b, sizeof (struct A));
>>>>>>   __builtin_memcpy (&a, &c, sizeof (struct A));
>>>>>> }
>>>>>> int main()
>>>>>> {
>>>>>>   __builtin_memset (&b, 0, sizeof (struct A));
>>>>>>   b.s = 1;
>>>>>>   __builtin_memcpy ((char *)&b+2, &b, 2);
>>>>>>   foo ();
>>>>>>   __builtin_memcpy (&a, (char *)&a+2, 2);
>>>>>>   if (a.s != 1)
>>>>>>     __builtin_abort ();
>>>>>>   return 0;
>>>>>> }
>>>>>
>>>>> Thanks for the testcase, I agree that is a fairly big problem.  Do you
>>>>> think that the following (untested) patch is an appropriate way of
>>>>> fixing it and generally of extending gimple to capture that a statement
>>>>> is a bit-copy?
>>>>
>>>> I think the place to fix is the memcpy folding.  That is, we'd say that
>>>> aggregate assignments are not bit-copies but do element-wise assignments.
>>>> For memcpy folding we'd then need to use a type that doesn't contain
>>>> padding.  Which effectively means char[].
>>>>
>>>> Of course we need to stop SRA from decomposing that copy to
>>>> individual characters then ;)
>>>>
>>>> So iff we decide that all aggregate copies are element copies,
>>>> maybe only those where TYPE_MAIN_VARIANT of lhs and rhs match
>>>> (currently we allow TYPE_CANONICAL compatibility and thus there
>>>> might be some mismatches), then we have to fix nothign but
>>>> the memcpy folding.
>>>>
>>>>> If so, I'll add the testcase, bootstrap it and formally propose it.
>>>>> Subsequently I will of course make sure that any element-wise copying
>>>>> patch would test the predicate.
>>>>
>>>> I don't think the alias-set should determine whether a copy is
>>>> bit-wise or not.
>>>
>>> Like the attached.  At least FAILs
>>>
>>> FAIL: gcc.dg/tree-ssa/ssa-ccp-27.c scan-tree-dump-times ccp1
>>> "memcpy[^\n]*123456" 2 (found 0 times)
>>>
>>> not sure why we have this test.
>>
>> Hum.  And SRA still decomposes the copy to struct elements w/o padding
>> even though the access is done using char[].  So somehow it ignores
>> VIEW_CONVERT_EXPRs
>> (well, those implicitely present on MEM_REFs).
>
> Yes.  SRA is not even too afraid of top-level V_C_Es.  It really bails
> out only if they are buried under a handled_component.  And it does not
> remove aggregate assignments containing them.
>
>>
>> Looks like this is because total scalarization is done on the decls
>> and not at all
>> honoring how the variable is accessed?  The following seems to fix
>> that, otherwise untested.
>
>
>>
>> Index: gcc/tree-sra.c
>> ===================================================================
>> --- gcc/tree-sra.c      (revision 255137)
>> +++ gcc/tree-sra.c      (working copy)
>> @@ -1338,7 +1338,9 @@ build_accesses_from_assign (gimple *stmt
>>      {
>>        racc->grp_assignment_read = 1;
>>        if (should_scalarize_away_bitmap && !gimple_has_volatile_ops (stmt)
>> -         && !is_gimple_reg_type (racc->type))
>> +         && !is_gimple_reg_type (racc->type)
>> +         && (TYPE_MAIN_VARIANT (racc->type)
>> +             == TYPE_MAIN_VARIANT (TREE_TYPE (racc->base))))
>>         bitmap_set_bit (should_scalarize_away_bitmap, DECL_UID (racc->base));
>>        if (storage_order_barrier_p (lhs))
>>         racc->grp_unscalarizable_region = 1;
>
> I believe that the added condition is not what you want, this seems
> to trigger also for ordinary:
>
>   s1 = s2.field
>
> Where racc->type is type of the field but racc->base is s2 and its type
> is type of the structure.

Yes.  But do we want to totally scalarize s2 in this case?  We only
access parts of it.  We don't seem to have a testcase that fails
(well, full testing still in progress).

> I also think you want to be setting a bit in
> cannot_scalarize_away_bitmap in order to guarantee that total
> scalarization will not happen for the given candidate.  Otherwise some
> other regular assignment might trigger it ...

Yeah, figured that out myself.

> except if we then also
> checked the statement for bit-copying types in sra_modify_assign (in the
> condition after the big comment), which I suppose is actually the
> correct thing to do.

But modification is too late, no?

> Thanks a lot for the folding patch, I can take over the SRA bits if you
> want to.

For reference below is the full patch.

Bootstrapped on x86_64-unknown-linux-gnu, testing in progress.

Ok for the SRA parts?

Thanks,
Richard.

2017-11-24  Richard Biener  <rguenther@suse.de>

        PR tree-optimization/83141
        * gimple-fold.c (gimple_fold_builtin_memory_op): Simplify
        aggregate copy generation by always using a unsigned char[]
        type to perform the copying.
        * tree-sra.c (build_accesses_from_assign): Disqualify
        accesses in non-native type for total scalarization.

        * gcc.dg/torture/pr83141.c: New testcase.
        * gcc.dg/tree-ssa/ssa-ccp-27.c: Adjust.
Martin Jambor Nov. 24, 2017, 1 p.m. UTC | #23
On Fri, Nov 24 2017, Richard Biener wrote:
> On Fri, Nov 24, 2017 at 12:53 PM, Martin Jambor <mjambor@suse.cz> wrote:
>> Hi Richi,
>>
>> On Fri, Nov 24 2017, Richard Biener wrote:
>>> On Fri, Nov 24, 2017 at 11:57 AM, Richard Biener
>>> <richard.guenther@gmail.com> wrote:
>>>> On Fri, Nov 24, 2017 at 11:31 AM, Richard Biener
>>
>> ..
>>
>>>>>>> And yes, I've been worried about SRA as well here...  it _does_
>>>>>>> have some early outs when seeing VIEW_CONVERT_EXPR but
>>>>>>> appearantly not for the above.  Testcase that aborts with SRA but
>>>>>>> not without:
>>>>>>>
>>>>>>> struct A { short s; long i; long j; };
>>>>>>> struct A a, b;
>>>>>>> void foo ()
>>>>>>> {
>>>>>>>   struct A c;
>>>>>>>   __builtin_memcpy (&c, &b, sizeof (struct A));
>>>>>>>   __builtin_memcpy (&a, &c, sizeof (struct A));
>>>>>>> }
>>>>>>> int main()
>>>>>>> {
>>>>>>>   __builtin_memset (&b, 0, sizeof (struct A));
>>>>>>>   b.s = 1;
>>>>>>>   __builtin_memcpy ((char *)&b+2, &b, 2);
>>>>>>>   foo ();
>>>>>>>   __builtin_memcpy (&a, (char *)&a+2, 2);
>>>>>>>   if (a.s != 1)
>>>>>>>     __builtin_abort ();
>>>>>>>   return 0;
>>>>>>> }
>>>>>>
>>>>>> Thanks for the testcase, I agree that is a fairly big problem.  Do you
>>>>>> think that the following (untested) patch is an appropriate way of
>>>>>> fixing it and generally of extending gimple to capture that a statement
>>>>>> is a bit-copy?
>>>>>
>>>>> I think the place to fix is the memcpy folding.  That is, we'd say that
>>>>> aggregate assignments are not bit-copies but do element-wise assignments.
>>>>> For memcpy folding we'd then need to use a type that doesn't contain
>>>>> padding.  Which effectively means char[].
>>>>>
>>>>> Of course we need to stop SRA from decomposing that copy to
>>>>> individual characters then ;)
>>>>>
>>>>> So iff we decide that all aggregate copies are element copies,
>>>>> maybe only those where TYPE_MAIN_VARIANT of lhs and rhs match
>>>>> (currently we allow TYPE_CANONICAL compatibility and thus there
>>>>> might be some mismatches), then we have to fix nothign but
>>>>> the memcpy folding.
>>>>>
>>>>>> If so, I'll add the testcase, bootstrap it and formally propose it.
>>>>>> Subsequently I will of course make sure that any element-wise copying
>>>>>> patch would test the predicate.
>>>>>
>>>>> I don't think the alias-set should determine whether a copy is
>>>>> bit-wise or not.
>>>>
>>>> Like the attached.  At least FAILs
>>>>
>>>> FAIL: gcc.dg/tree-ssa/ssa-ccp-27.c scan-tree-dump-times ccp1
>>>> "memcpy[^\n]*123456" 2 (found 0 times)
>>>>
>>>> not sure why we have this test.
>>>
>>> Hum.  And SRA still decomposes the copy to struct elements w/o padding
>>> even though the access is done using char[].  So somehow it ignores
>>> VIEW_CONVERT_EXPRs
>>> (well, those implicitely present on MEM_REFs).
>>
>> Yes.  SRA is not even too afraid of top-level V_C_Es.  It really bails
>> out only if they are buried under a handled_component.  And it does not
>> remove aggregate assignments containing them.
>>
>>>
>>> Looks like this is because total scalarization is done on the decls
>>> and not at all
>>> honoring how the variable is accessed?  The following seems to fix
>>> that, otherwise untested.
>>
>>
>>>
>>> Index: gcc/tree-sra.c
>>> ===================================================================
>>> --- gcc/tree-sra.c      (revision 255137)
>>> +++ gcc/tree-sra.c      (working copy)
>>> @@ -1338,7 +1338,9 @@ build_accesses_from_assign (gimple *stmt
>>>      {
>>>        racc->grp_assignment_read = 1;
>>>        if (should_scalarize_away_bitmap && !gimple_has_volatile_ops (stmt)
>>> -         && !is_gimple_reg_type (racc->type))
>>> +         && !is_gimple_reg_type (racc->type)
>>> +         && (TYPE_MAIN_VARIANT (racc->type)
>>> +             == TYPE_MAIN_VARIANT (TREE_TYPE (racc->base))))
>>>         bitmap_set_bit (should_scalarize_away_bitmap, DECL_UID (racc->base));
>>>        if (storage_order_barrier_p (lhs))
>>>         racc->grp_unscalarizable_region = 1;
>>
>> I believe that the added condition is not what you want, this seems
>> to trigger also for ordinary:
>>
>>   s1 = s2.field
>>
>> Where racc->type is type of the field but racc->base is s2 and its type
>> is type of the structure.
>
> Yes.  But do we want to totally scalarize s2 in this case?  We only
> access parts of it.  We don't seem to have a testcase that fails
> (well, full testing still in progress).

If we start with

small_s1 = src;
dst = small_s1->even_smaller_struct_field;

and small_s1 is otherwise unused, I think that we want to facilitate
copy propagation with total scalarization too.

>
>> I also think you want to be setting a bit in
>> cannot_scalarize_away_bitmap in order to guarantee that total
>> scalarization will not happen for the given candidate.  Otherwise some
>> other regular assignment might trigger it ...
>
> Yeah, figured that out myself.
>
>> except if we then also
>> checked the statement for bit-copying types in sra_modify_assign (in the
>> condition after the big comment), which I suppose is actually the
>> correct thing to do.
>
> But modification is too late, no?

No, at modification time we still decide whether how to deal with an
aggregate assignment.  That can be either pessimistically by storing all
RHS replacement back to its original aggregate, leave the original
assignment in place and then load all LHS replacement from its
aggregate, or optimistically by trying to load LHS replacements either
from RHS replacements or at least from the RHS and storing RHS
replacement directly to LHS, hoping to eliminate the original load.  It
is this optimistic approach rather than total scalarization what we need
to disable.  In fact, just disabling total scalarization is not enough
if SRA can come across accesses to components from elsewhere in function
body, your patch unfortunately still fails for:

volatile short vs;
volatile long vl;

struct A { short s; long i; long j; };
struct A a, b;
void foo ()
{
  struct A c;
  __builtin_memcpy (&c, &b, sizeof (struct A));
  __builtin_memcpy (&a, &c, sizeof (struct A));

  vs = c.s;
  vl = c.i;
  vl = c.j;
}
int main()
{
  __builtin_memset (&b, 0, sizeof (struct A));
  b.s = 1;
  __builtin_memcpy ((char *)&b+2, &b, 2);
  foo ();
  __builtin_memcpy (&a, (char *)&a+2, 2);
  if (a.s != 1)
    __builtin_abort ();
  return 0;
}


>
>> Thanks a lot for the folding patch, I can take over the SRA bits if you
>> want to.
>
> For reference below is the full patch.
>
> Bootstrapped on x86_64-unknown-linux-gnu, testing in progress.
>
> Ok for the SRA parts?
>

My preferred SRA part would be:

diff --git a/gcc/tree-sra.c b/gcc/tree-sra.c
index db490b20c3e..7a0e4d1ae26 100644
--- a/gcc/tree-sra.c
+++ b/gcc/tree-sra.c
@@ -1302,6 +1302,17 @@ comes_initialized_p (tree base)
   return TREE_CODE (base) == PARM_DECL || constant_decl_p (base);
 }
 
+/* Return true if REF is a MEM_REF which changes the type of the data it
+   accesses.  */
+
+static bool
+type_changing_mem_ref_p (tree ref)
+{
+  return (TREE_CODE (ref) == MEM_REF
+         && (TYPE_MAIN_VARIANT (TREE_TYPE (ref))
+             != TYPE_MAIN_VARIANT (TREE_TYPE (TREE_OPERAND (ref, 0)))));
+}
+
 /* Scan expressions occurring in STMT, create access structures for all accesses
    to candidates for scalarization and remove those candidates which occur in
    statements or expressions that prevent them from being split apart.  Return
@@ -1338,7 +1349,8 @@ build_accesses_from_assign (gimple *stmt)
     {
       racc->grp_assignment_read = 1;
       if (should_scalarize_away_bitmap && !gimple_has_volatile_ops (stmt)
-         && !is_gimple_reg_type (racc->type))
+         && !is_gimple_reg_type (racc->type)
+         && !type_changing_mem_ref_p (rhs))
        bitmap_set_bit (should_scalarize_away_bitmap, DECL_UID (racc->base));
       if (storage_order_barrier_p (lhs))
        racc->grp_unscalarizable_region = 1;
@@ -3589,6 +3601,8 @@ sra_modify_assign (gimple *stmt, gimple_stmt_iterator *gsi)
 
   if (modify_this_stmt
       || gimple_has_volatile_ops (stmt)
+      || type_changing_mem_ref_p (lhs)
+      || type_changing_mem_ref_p (rhs)
       || contains_vce_or_bfcref_p (rhs)
       || contains_vce_or_bfcref_p (lhs)
       || stmt_ends_bb_p (stmt))


I kept the condition in build_accesses_from_assign in order not to do
unnecessary total-scalarization work, but it is those in
sra_modify_assign that actually ensure we keep the assignment intact.

It still would not ask what Jakub asked for, i.e. keep the old behavior
if there is no padding.  But that should be done at the gimple-fold
level too, I believe.

Thanks,

Martin
Richard Biener Nov. 24, 2017, 1:29 p.m. UTC | #24
On Fri, Nov 24, 2017 at 2:00 PM, Martin Jambor <mjambor@suse.cz> wrote:
> On Fri, Nov 24 2017, Richard Biener wrote:
>> On Fri, Nov 24, 2017 at 12:53 PM, Martin Jambor <mjambor@suse.cz> wrote:
>>> Hi Richi,
>>>
>>> On Fri, Nov 24 2017, Richard Biener wrote:
>>>> On Fri, Nov 24, 2017 at 11:57 AM, Richard Biener
>>>> <richard.guenther@gmail.com> wrote:
>>>>> On Fri, Nov 24, 2017 at 11:31 AM, Richard Biener
>>>
>>> ..
>>>
>>>>>>>> And yes, I've been worried about SRA as well here...  it _does_
>>>>>>>> have some early outs when seeing VIEW_CONVERT_EXPR but
>>>>>>>> appearantly not for the above.  Testcase that aborts with SRA but
>>>>>>>> not without:
>>>>>>>>
>>>>>>>> struct A { short s; long i; long j; };
>>>>>>>> struct A a, b;
>>>>>>>> void foo ()
>>>>>>>> {
>>>>>>>>   struct A c;
>>>>>>>>   __builtin_memcpy (&c, &b, sizeof (struct A));
>>>>>>>>   __builtin_memcpy (&a, &c, sizeof (struct A));
>>>>>>>> }
>>>>>>>> int main()
>>>>>>>> {
>>>>>>>>   __builtin_memset (&b, 0, sizeof (struct A));
>>>>>>>>   b.s = 1;
>>>>>>>>   __builtin_memcpy ((char *)&b+2, &b, 2);
>>>>>>>>   foo ();
>>>>>>>>   __builtin_memcpy (&a, (char *)&a+2, 2);
>>>>>>>>   if (a.s != 1)
>>>>>>>>     __builtin_abort ();
>>>>>>>>   return 0;
>>>>>>>> }
>>>>>>>
>>>>>>> Thanks for the testcase, I agree that is a fairly big problem.  Do you
>>>>>>> think that the following (untested) patch is an appropriate way of
>>>>>>> fixing it and generally of extending gimple to capture that a statement
>>>>>>> is a bit-copy?
>>>>>>
>>>>>> I think the place to fix is the memcpy folding.  That is, we'd say that
>>>>>> aggregate assignments are not bit-copies but do element-wise assignments.
>>>>>> For memcpy folding we'd then need to use a type that doesn't contain
>>>>>> padding.  Which effectively means char[].
>>>>>>
>>>>>> Of course we need to stop SRA from decomposing that copy to
>>>>>> individual characters then ;)
>>>>>>
>>>>>> So iff we decide that all aggregate copies are element copies,
>>>>>> maybe only those where TYPE_MAIN_VARIANT of lhs and rhs match
>>>>>> (currently we allow TYPE_CANONICAL compatibility and thus there
>>>>>> might be some mismatches), then we have to fix nothign but
>>>>>> the memcpy folding.
>>>>>>
>>>>>>> If so, I'll add the testcase, bootstrap it and formally propose it.
>>>>>>> Subsequently I will of course make sure that any element-wise copying
>>>>>>> patch would test the predicate.
>>>>>>
>>>>>> I don't think the alias-set should determine whether a copy is
>>>>>> bit-wise or not.
>>>>>
>>>>> Like the attached.  At least FAILs
>>>>>
>>>>> FAIL: gcc.dg/tree-ssa/ssa-ccp-27.c scan-tree-dump-times ccp1
>>>>> "memcpy[^\n]*123456" 2 (found 0 times)
>>>>>
>>>>> not sure why we have this test.
>>>>
>>>> Hum.  And SRA still decomposes the copy to struct elements w/o padding
>>>> even though the access is done using char[].  So somehow it ignores
>>>> VIEW_CONVERT_EXPRs
>>>> (well, those implicitely present on MEM_REFs).
>>>
>>> Yes.  SRA is not even too afraid of top-level V_C_Es.  It really bails
>>> out only if they are buried under a handled_component.  And it does not
>>> remove aggregate assignments containing them.
>>>
>>>>
>>>> Looks like this is because total scalarization is done on the decls
>>>> and not at all
>>>> honoring how the variable is accessed?  The following seems to fix
>>>> that, otherwise untested.
>>>
>>>
>>>>
>>>> Index: gcc/tree-sra.c
>>>> ===================================================================
>>>> --- gcc/tree-sra.c      (revision 255137)
>>>> +++ gcc/tree-sra.c      (working copy)
>>>> @@ -1338,7 +1338,9 @@ build_accesses_from_assign (gimple *stmt
>>>>      {
>>>>        racc->grp_assignment_read = 1;
>>>>        if (should_scalarize_away_bitmap && !gimple_has_volatile_ops (stmt)
>>>> -         && !is_gimple_reg_type (racc->type))
>>>> +         && !is_gimple_reg_type (racc->type)
>>>> +         && (TYPE_MAIN_VARIANT (racc->type)
>>>> +             == TYPE_MAIN_VARIANT (TREE_TYPE (racc->base))))
>>>>         bitmap_set_bit (should_scalarize_away_bitmap, DECL_UID (racc->base));
>>>>        if (storage_order_barrier_p (lhs))
>>>>         racc->grp_unscalarizable_region = 1;
>>>
>>> I believe that the added condition is not what you want, this seems
>>> to trigger also for ordinary:
>>>
>>>   s1 = s2.field
>>>
>>> Where racc->type is type of the field but racc->base is s2 and its type
>>> is type of the structure.
>>
>> Yes.  But do we want to totally scalarize s2 in this case?  We only
>> access parts of it.  We don't seem to have a testcase that fails
>> (well, full testing still in progress).
>
> If we start with
>
> small_s1 = src;
> dst = small_s1->even_smaller_struct_field;
>
> and small_s1 is otherwise unused, I think that we want to facilitate
> copy propagation with total scalarization too.
>
>>
>>> I also think you want to be setting a bit in
>>> cannot_scalarize_away_bitmap in order to guarantee that total
>>> scalarization will not happen for the given candidate.  Otherwise some
>>> other regular assignment might trigger it ...
>>
>> Yeah, figured that out myself.
>>
>>> except if we then also
>>> checked the statement for bit-copying types in sra_modify_assign (in the
>>> condition after the big comment), which I suppose is actually the
>>> correct thing to do.
>>
>> But modification is too late, no?
>
> No, at modification time we still decide whether how to deal with an
> aggregate assignment.  That can be either pessimistically by storing all
> RHS replacement back to its original aggregate, leave the original
> assignment in place and then load all LHS replacement from its
> aggregate, or optimistically by trying to load LHS replacements either
> from RHS replacements or at least from the RHS and storing RHS
> replacement directly to LHS, hoping to eliminate the original load.  It
> is this optimistic approach rather than total scalarization what we need
> to disable.  In fact, just disabling total scalarization is not enough
> if SRA can come across accesses to components from elsewhere in function
> body, your patch unfortunately still fails for:
>
> volatile short vs;
> volatile long vl;
>
> struct A { short s; long i; long j; };
> struct A a, b;
> void foo ()
> {
>   struct A c;
>   __builtin_memcpy (&c, &b, sizeof (struct A));
>   __builtin_memcpy (&a, &c, sizeof (struct A));
>
>   vs = c.s;
>   vl = c.i;
>   vl = c.j;
> }
> int main()
> {
>   __builtin_memset (&b, 0, sizeof (struct A));
>   b.s = 1;
>   __builtin_memcpy ((char *)&b+2, &b, 2);
>   foo ();
>   __builtin_memcpy (&a, (char *)&a+2, 2);
>   if (a.s != 1)
>     __builtin_abort ();
>   return 0;
> }
>
>
>>
>>> Thanks a lot for the folding patch, I can take over the SRA bits if you
>>> want to.
>>
>> For reference below is the full patch.
>>
>> Bootstrapped on x86_64-unknown-linux-gnu, testing in progress.
>>
>> Ok for the SRA parts?
>>
>
> My preferred SRA part would be:
>
> diff --git a/gcc/tree-sra.c b/gcc/tree-sra.c
> index db490b20c3e..7a0e4d1ae26 100644
> --- a/gcc/tree-sra.c
> +++ b/gcc/tree-sra.c
> @@ -1302,6 +1302,17 @@ comes_initialized_p (tree base)
>    return TREE_CODE (base) == PARM_DECL || constant_decl_p (base);
>  }
>
> +/* Return true if REF is a MEM_REF which changes the type of the data it
> +   accesses.  */
> +
> +static bool
> +type_changing_mem_ref_p (tree ref)
> +{
> +  return (TREE_CODE (ref) == MEM_REF
> +         && (TYPE_MAIN_VARIANT (TREE_TYPE (ref))
> +             != TYPE_MAIN_VARIANT (TREE_TYPE (TREE_OPERAND (ref, 0)))));
> +}
> +
>  /* Scan expressions occurring in STMT, create access structures for all accesses
>     to candidates for scalarization and remove those candidates which occur in
>     statements or expressions that prevent them from being split apart.  Return
> @@ -1338,7 +1349,8 @@ build_accesses_from_assign (gimple *stmt)
>      {
>        racc->grp_assignment_read = 1;
>        if (should_scalarize_away_bitmap && !gimple_has_volatile_ops (stmt)
> -         && !is_gimple_reg_type (racc->type))
> +         && !is_gimple_reg_type (racc->type)
> +         && !type_changing_mem_ref_p (rhs))
>         bitmap_set_bit (should_scalarize_away_bitmap, DECL_UID (racc->base));
>        if (storage_order_barrier_p (lhs))
>         racc->grp_unscalarizable_region = 1;
> @@ -3589,6 +3601,8 @@ sra_modify_assign (gimple *stmt, gimple_stmt_iterator *gsi)
>
>    if (modify_this_stmt
>        || gimple_has_volatile_ops (stmt)
> +      || type_changing_mem_ref_p (lhs)
> +      || type_changing_mem_ref_p (rhs)
>        || contains_vce_or_bfcref_p (rhs)
>        || contains_vce_or_bfcref_p (lhs)
>        || stmt_ends_bb_p (stmt))

But a type-changing MEM_REF can appear at each ref base.  Maybe it's
only relevant
for toplevel ones though, you should know.

So we really need to handle it like we do VIEW_CONVERT_EXPR and/or we need
to handle VIEW_CONVERT_EXPRs somewhere in the ref chain the same.

So if you want to take over the SRA parts that would be nice.

I have to dumb down the memcpy folding a bit as we get too much for the amount
of fallout I want to fix right now.

Richard.

>
> I kept the condition in build_accesses_from_assign in order not to do
> unnecessary total-scalarization work, but it is those in
> sra_modify_assign that actually ensure we keep the assignment intact.
>
> It still would not ask what Jakub asked for, i.e. keep the old behavior
> if there is no padding.  But that should be done at the gimple-fold
> level too, I believe.
>
> Thanks,
>
> Martin
diff mbox series

Patch

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 1ee8351c21f..87f602e7ead 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -6511,6 +6511,10 @@  ix86_option_override_internal (bool main_args_p,
 			 ix86_tune_cost->l2_cache_size,
 			 opts->x_param_values,
 			 opts_set->x_param_values);
+  maybe_set_param_value (PARAM_MAX_SIZE_FOR_ELEMENTWISE_COPY,
+			 35,
+			 opts->x_param_values,
+			 opts_set->x_param_values);
 
   /* Enable sw prefetching at -O3 for CPUS that prefetching is helpful.  */
   if (opts->x_flag_prefetch_loop_arrays < 0
diff --git a/gcc/expr.c b/gcc/expr.c
index 134ee731c29..dff24e7f166 100644
--- a/gcc/expr.c
+++ b/gcc/expr.c
@@ -61,7 +61,8 @@  along with GCC; see the file COPYING3.  If not see
 #include "tree-chkp.h"
 #include "rtl-chkp.h"
 #include "ccmp.h"
-
+#include "params.h"
+#include "tree-sra.h"
 
 /* If this is nonzero, we do not bother generating VOLATILE
    around volatile memory references, and we are willing to
@@ -5340,6 +5341,80 @@  emit_storent_insn (rtx to, rtx from)
   return maybe_expand_insn (code, 2, ops);
 }
 
+/* Generate code for copying data of type TYPE at SOURCE plus OFFSET to TARGET
+   plus OFFSET, but do so element-wise and/or field-wise for each record and
+   array within TYPE.  TYPE must either be a register type or an aggregate
+   complying with scalarizable_type_p.
+
+   If CALL_PARAM_P is nonzero, this is a store into a call param on the
+   stack, and block moves may need to be treated specially.  */
+
+static void
+emit_move_elementwise (tree type, rtx target, rtx source, HOST_WIDE_INT offset,
+		       int call_param_p)
+{
+  switch (TREE_CODE (type))
+    {
+    case RECORD_TYPE:
+      for (tree fld = TYPE_FIELDS (type); fld; fld = DECL_CHAIN (fld))
+	if (TREE_CODE (fld) == FIELD_DECL)
+	  {
+	    HOST_WIDE_INT fld_offset = offset + int_bit_position (fld);
+	    tree ft = TREE_TYPE (fld);
+	    emit_move_elementwise (ft, target, source, fld_offset,
+				   call_param_p);
+	  }
+      break;
+
+    case ARRAY_TYPE:
+      {
+	tree elem_type = TREE_TYPE (type);
+	HOST_WIDE_INT el_size = tree_to_shwi (TYPE_SIZE (elem_type));
+	gcc_assert (el_size > 0);
+
+	offset_int idx, max;
+	/* Skip (some) zero-length arrays; others have MAXIDX == MINIDX - 1.  */
+	if (extract_min_max_idx_from_array (type, &idx, &max))
+	  {
+	    HOST_WIDE_INT el_offset = offset;
+	    for (; idx <= max; ++idx)
+	      {
+		emit_move_elementwise (elem_type, target, source, el_offset,
+				       call_param_p);
+		el_offset += el_size;
+	      }
+	  }
+      }
+      break;
+    default:
+      machine_mode mode = TYPE_MODE (type);
+
+      rtx ntgt = adjust_address (target, mode, offset / BITS_PER_UNIT);
+      rtx nsrc = adjust_address (source, mode, offset / BITS_PER_UNIT);
+
+      /* TODO: Figure out whether the following is actually necessary.  */
+      if (target == ntgt)
+	ntgt = copy_rtx (target);
+      if (source == nsrc)
+	nsrc = copy_rtx (source);
+
+      gcc_assert (mode != VOIDmode);
+      if (mode != BLKmode)
+	emit_move_insn (ntgt, nsrc);
+      else
+	{
+	  /* For example vector gimple registers can end up here.  */
+	  rtx size = expand_expr (TYPE_SIZE_UNIT (type), NULL_RTX,
+				  TYPE_MODE (sizetype), EXPAND_NORMAL);
+	  emit_block_move (ntgt, nsrc, size,
+			   (call_param_p
+			    ? BLOCK_OP_CALL_PARM : BLOCK_OP_NORMAL));
+	}
+      break;
+    }
+  return;
+}
+
 /* Generate code for computing expression EXP,
    and storing the value into TARGET.
 
@@ -5713,9 +5788,29 @@  store_expr_with_bounds (tree exp, rtx target, int call_param_p,
 	emit_group_store (target, temp, TREE_TYPE (exp),
 			  int_size_in_bytes (TREE_TYPE (exp)));
       else if (GET_MODE (temp) == BLKmode)
-	emit_block_move (target, temp, expr_size (exp),
-			 (call_param_p
-			  ? BLOCK_OP_CALL_PARM : BLOCK_OP_NORMAL));
+	{
+	  /* Copying smallish BLKmode structures with emit_block_move and thus
+	     by-pieces can result in store-to-load stalls.  So copy some simple
+	     small aggregates element or field-wise.  */
+	  if (GET_MODE (target) == BLKmode
+	      && AGGREGATE_TYPE_P (TREE_TYPE (exp))
+	      && !TREE_ADDRESSABLE (TREE_TYPE (exp))
+	      && tree_fits_shwi_p (TYPE_SIZE (TREE_TYPE (exp)))
+	      && (tree_to_shwi (TYPE_SIZE (TREE_TYPE (exp)))
+		  <= (PARAM_VALUE (PARAM_MAX_SIZE_FOR_ELEMENTWISE_COPY)
+		      * BITS_PER_UNIT))
+	      && simple_mix_of_records_and_arrays_p (TREE_TYPE (exp), false))
+	    {
+	      /* FIXME:  Can this happen?  What would it mean?  */
+	      gcc_assert (!reverse);
+	      emit_move_elementwise (TREE_TYPE (exp), target, temp, 0,
+				     call_param_p);
+	    }
+	  else
+	    emit_block_move (target, temp, expr_size (exp),
+			     (call_param_p
+			      ? BLOCK_OP_CALL_PARM : BLOCK_OP_NORMAL));
+	}
       /* If we emit a nontemporal store, there is nothing else to do.  */
       else if (nontemporal && emit_storent_insn (target, temp))
 	;
diff --git a/gcc/ipa-cp.c b/gcc/ipa-cp.c
index 6b3d8d7364c..7d6019bbd30 100644
--- a/gcc/ipa-cp.c
+++ b/gcc/ipa-cp.c
@@ -124,6 +124,7 @@  along with GCC; see the file COPYING3.  If not see
 #include "tree-ssa-ccp.h"
 #include "stringpool.h"
 #include "attribs.h"
+#include "tree-sra.h"
 
 template <typename valtype> class ipcp_value;
 
diff --git a/gcc/ipa-prop.h b/gcc/ipa-prop.h
index fa5bed49ee0..2313cc884ed 100644
--- a/gcc/ipa-prop.h
+++ b/gcc/ipa-prop.h
@@ -877,10 +877,6 @@  ipa_parm_adjustment *ipa_get_adjustment_candidate (tree **, bool *,
 void ipa_release_body_info (struct ipa_func_body_info *);
 tree ipa_get_callee_param_type (struct cgraph_edge *e, int i);
 
-/* From tree-sra.c:  */
-tree build_ref_for_offset (location_t, tree, HOST_WIDE_INT, bool, tree,
-			   gimple_stmt_iterator *, bool);
-
 /* In ipa-cp.c  */
 void ipa_cp_c_finalize (void);
 
diff --git a/gcc/params.def b/gcc/params.def
index e55afc28053..5e19f1414a0 100644
--- a/gcc/params.def
+++ b/gcc/params.def
@@ -1294,6 +1294,12 @@  DEFPARAM (PARAM_VECT_EPILOGUES_NOMASK,
 	  "Enable loop epilogue vectorization using smaller vector size.",
 	  0, 0, 1)
 
+DEFPARAM (PARAM_MAX_SIZE_FOR_ELEMENTWISE_COPY,
+	  "max-size-for-elementwise-copy",
+	  "Maximum size in bytes of a structure or array to by considered for "
+	  "copying by its individual fields or elements",
+	  0, 0, 512)
+
 /*
 
 Local variables:
diff --git a/gcc/testsuite/gcc.target/i386/pr80689-1.c b/gcc/testsuite/gcc.target/i386/pr80689-1.c
new file mode 100644
index 00000000000..4156d4fba45
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr80689-1.c
@@ -0,0 +1,38 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+
+typedef struct st1
+{
+        long unsigned int a,b;
+        long int c,d;
+}R;
+
+typedef struct st2
+{
+        int  t;
+        R  reg;
+}N;
+
+void Set (const R *region,  N *n_info );
+
+void test(N  *n_obj ,const long unsigned int a, const long unsigned int b,  const long int c,const long int d)
+{
+        R reg;
+
+        reg.a=a;
+        reg.b=b;
+        reg.c=c;
+        reg.d=d;
+        Set (&reg, n_obj);
+
+}
+
+void Set (const R *reg,  N *n_obj )
+{
+        n_obj->reg=(*reg);
+}
+
+
+/* { dg-final { scan-assembler-not "%(x|y|z)mm\[0-9\]+" } } */
+/* { dg-final { scan-assembler-not "movdqu" } } */
+/* { dg-final { scan-assembler-not "movups" } } */
diff --git a/gcc/tree-sra.c b/gcc/tree-sra.c
index bac593951e7..ade97964205 100644
--- a/gcc/tree-sra.c
+++ b/gcc/tree-sra.c
@@ -104,6 +104,7 @@  along with GCC; see the file COPYING3.  If not see
 #include "ipa-fnsummary.h"
 #include "ipa-utils.h"
 #include "builtins.h"
+#include "tree-sra.h"
 
 /* Enumeration of all aggregate reductions we can do.  */
 enum sra_mode { SRA_MODE_EARLY_IPA,   /* early call regularization */
@@ -952,14 +953,14 @@  create_access (tree expr, gimple *stmt, bool write)
 }
 
 
-/* Return true iff TYPE is scalarizable - i.e. a RECORD_TYPE or fixed-length
-   ARRAY_TYPE with fields that are either of gimple register types (excluding
-   bit-fields) or (recursively) scalarizable types.  CONST_DECL must be true if
-   we are considering a decl from constant pool.  If it is false, char arrays
-   will be refused.  */
+/* Return true if TYPE consists of RECORD_TYPE or fixed-length ARRAY_TYPE with
+   fields/elements that are not bit-fields and are either register types or
+   recursively comply with simple_mix_of_records_and_arrays_p.  Furthermore, if
+   ALLOW_CHAR_ARRAYS is false, the function will return false also if TYPE
+   contains an array of elements that only have one byte.  */
 
-static bool
-scalarizable_type_p (tree type, bool const_decl)
+bool
+simple_mix_of_records_and_arrays_p (tree type, bool allow_char_arrays)
 {
   gcc_assert (!is_gimple_reg_type (type));
   if (type_contains_placeholder_p (type))
@@ -977,7 +978,7 @@  scalarizable_type_p (tree type, bool const_decl)
 	    return false;
 
 	  if (!is_gimple_reg_type (ft)
-	      && !scalarizable_type_p (ft, const_decl))
+	      && !simple_mix_of_records_and_arrays_p (ft, allow_char_arrays))
 	    return false;
 	}
 
@@ -986,7 +987,7 @@  scalarizable_type_p (tree type, bool const_decl)
   case ARRAY_TYPE:
     {
       HOST_WIDE_INT min_elem_size;
-      if (const_decl)
+      if (allow_char_arrays)
 	min_elem_size = 0;
       else
 	min_elem_size = BITS_PER_UNIT;
@@ -1008,7 +1009,7 @@  scalarizable_type_p (tree type, bool const_decl)
 
       tree elem = TREE_TYPE (type);
       if (!is_gimple_reg_type (elem)
-	  && !scalarizable_type_p (elem, const_decl))
+	  && !simple_mix_of_records_and_arrays_p (elem, allow_char_arrays))
 	return false;
       return true;
     }
@@ -1017,10 +1018,38 @@  scalarizable_type_p (tree type, bool const_decl)
   }
 }
 
-static void scalarize_elem (tree, HOST_WIDE_INT, HOST_WIDE_INT, bool, tree, tree);
+static void scalarize_elem (tree, HOST_WIDE_INT, HOST_WIDE_INT, bool, tree,
+			    tree);
+
+/* For a given array TYPE, return false if its domain does not have any maximum
+   value.  Otherwise calculate MIN and MAX indices of the first and the last
+   element.  */
+
+bool
+extract_min_max_idx_from_array (tree type, offset_int *min, offset_int *max)
+{
+  tree domain = TYPE_DOMAIN (type);
+  tree minidx = TYPE_MIN_VALUE (domain);
+  gcc_assert (TREE_CODE (minidx) == INTEGER_CST);
+  tree maxidx = TYPE_MAX_VALUE (domain);
+  if (!maxidx)
+    return false;
+  gcc_assert (TREE_CODE (maxidx) == INTEGER_CST);
+
+  /* MINIDX and MAXIDX are inclusive, and must be interpreted in
+     DOMAIN (e.g. signed int, whereas min/max may be size_int).  */
+  *min = wi::to_offset (minidx);
+  *max = wi::to_offset (maxidx);
+  if (!TYPE_UNSIGNED (domain))
+    {
+      *min = wi::sext (*min, TYPE_PRECISION (domain));
+      *max = wi::sext (*max, TYPE_PRECISION (domain));
+    }
+  return true;
+}
 
 /* Create total_scalarization accesses for all scalar fields of a member
-   of type DECL_TYPE conforming to scalarizable_type_p.  BASE
+   of type DECL_TYPE conforming to simple_mix_of_records_and_arrays_p.  BASE
    must be the top-most VAR_DECL representing the variable; within that,
    OFFSET locates the member and REF must be the memory reference expression for
    the member.  */
@@ -1047,27 +1076,14 @@  completely_scalarize (tree base, tree decl_type, HOST_WIDE_INT offset, tree ref)
       {
 	tree elemtype = TREE_TYPE (decl_type);
 	tree elem_size = TYPE_SIZE (elemtype);
-	gcc_assert (elem_size && tree_fits_shwi_p (elem_size));
 	HOST_WIDE_INT el_size = tree_to_shwi (elem_size);
 	gcc_assert (el_size > 0);
 
-	tree minidx = TYPE_MIN_VALUE (TYPE_DOMAIN (decl_type));
-	gcc_assert (TREE_CODE (minidx) == INTEGER_CST);
-	tree maxidx = TYPE_MAX_VALUE (TYPE_DOMAIN (decl_type));
+	offset_int idx, max;
 	/* Skip (some) zero-length arrays; others have MAXIDX == MINIDX - 1.  */
-	if (maxidx)
+	if (extract_min_max_idx_from_array (decl_type, &idx, &max))
 	  {
-	    gcc_assert (TREE_CODE (maxidx) == INTEGER_CST);
 	    tree domain = TYPE_DOMAIN (decl_type);
-	    /* MINIDX and MAXIDX are inclusive, and must be interpreted in
-	       DOMAIN (e.g. signed int, whereas min/max may be size_int).  */
-	    offset_int idx = wi::to_offset (minidx);
-	    offset_int max = wi::to_offset (maxidx);
-	    if (!TYPE_UNSIGNED (domain))
-	      {
-		idx = wi::sext (idx, TYPE_PRECISION (domain));
-		max = wi::sext (max, TYPE_PRECISION (domain));
-	      }
 	    for (int el_off = offset; idx <= max; ++idx)
 	      {
 		tree nref = build4 (ARRAY_REF, elemtype,
@@ -1088,10 +1104,10 @@  completely_scalarize (tree base, tree decl_type, HOST_WIDE_INT offset, tree ref)
 }
 
 /* Create total_scalarization accesses for a member of type TYPE, which must
-   satisfy either is_gimple_reg_type or scalarizable_type_p.  BASE must be the
-   top-most VAR_DECL representing the variable; within that, POS and SIZE locate
-   the member, REVERSE gives its torage order. and REF must be the reference
-   expression for it.  */
+   satisfy either is_gimple_reg_type or simple_mix_of_records_and_arrays_p.
+   BASE must be the top-most VAR_DECL representing the variable; within that,
+   POS and SIZE locate the member, REVERSE gives its torage order. and REF must
+   be the reference expression for it.  */
 
 static void
 scalarize_elem (tree base, HOST_WIDE_INT pos, HOST_WIDE_INT size, bool reverse,
@@ -1111,7 +1127,8 @@  scalarize_elem (tree base, HOST_WIDE_INT pos, HOST_WIDE_INT size, bool reverse,
 }
 
 /* Create a total_scalarization access for VAR as a whole.  VAR must be of a
-   RECORD_TYPE or ARRAY_TYPE conforming to scalarizable_type_p.  */
+   RECORD_TYPE or ARRAY_TYPE conforming to
+   simple_mix_of_records_and_arrays_p.  */
 
 static void
 create_total_scalarization_access (tree var)
@@ -2803,8 +2820,9 @@  analyze_all_variable_accesses (void)
       {
 	tree var = candidate (i);
 
-	if (VAR_P (var) && scalarizable_type_p (TREE_TYPE (var),
-						constant_decl_p (var)))
+	if (VAR_P (var)
+	    && simple_mix_of_records_and_arrays_p (TREE_TYPE (var),
+						   constant_decl_p (var)))
 	  {
 	    if (tree_to_uhwi (TYPE_SIZE (TREE_TYPE (var)))
 		<= max_scalarization_size)
diff --git a/gcc/tree-sra.h b/gcc/tree-sra.h
new file mode 100644
index 00000000000..dc901385994
--- /dev/null
+++ b/gcc/tree-sra.h
@@ -0,0 +1,33 @@ 
+/* tree-sra.h - Run-time parameters.
+   Copyright (C) 2017 Free Software Foundation, Inc.
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify it under
+the terms of the GNU General Public License as published by the Free
+Software Foundation; either version 3, or (at your option) any later
+version.
+
+GCC is distributed in the hope that it will be useful, but WITHOUT ANY
+WARRANTY; without even the implied warranty of MERCHANTABILITY or
+FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+for more details.
+
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3.  If not see
+<http://www.gnu.org/licenses/>.  */
+
+#ifndef TREE_SRA_H
+#define TREE_SRA_H
+
+
+bool simple_mix_of_records_and_arrays_p (tree type, bool allow_char_arrays);
+bool extract_min_max_idx_from_array (tree type, offset_int *idx,
+				     offset_int *max);
+tree build_ref_for_offset (location_t loc, tree base, HOST_WIDE_INT offset,
+			   bool reverse, tree exp_type,
+			   gimple_stmt_iterator *gsi, bool insert_after);
+
+
+
+#endif	/* TREE_SRA_H */