diff mbox series

[2/4] vect: Fix inaccurate vector stmts number for slp reduction with lane-reducing

Message ID LV2PR01MB7839726CEFECEB5A7747217AF7A52@LV2PR01MB7839.prod.exchangelabs.com
State New
Headers show
Series [1/4] vect: Shorten name of macro SLP_TREE_NUMBER_OF_VEC_STMTS | expand

Commit Message

Feng Xue OS July 11, 2024, 8:53 a.m. UTC
Vector stmts number of an operation is calculated based on output vectype.
This is over-estimated for lane-reducing operation. Sometimes, to workaround
the issue, we have to rely on additional logic to deduce an exactly accurate
number by other means. Aiming at the inconvenience, in this patch, we would
"turn" lane-reducing operation into a normal one by inserting new trivial
statements like zero-valued PHIs and pass-through copies, which could be
optimized away by later passes. At the same time, a new field is added for
slp node to hold number of vector stmts that are really effective after
vectorization. For example:

  int sum = 1;
  for (i)
    {
      sum += d0[i] * d1[i];      // dot-prod <vector(16) char>
    }

  The vector size is 128-bit,vectorization factor is 16.  Reduction
  statements would be transformed as:

  vector<4> int sum_v0 = { 0, 0, 0, 1 };
  vector<4> int sum_v1 = { 0, 0, 0, 0 };
  vector<4> int sum_v2 = { 0, 0, 0, 0 };
  vector<4> int sum_v3 = { 0, 0, 0, 0 };

  for (i / 16)
    {
      sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
      sum_v1 = sum_v1;  // copy
      sum_v2 = sum_v2;  // copy
      sum_v3 = sum_v3;  // copy
    }

  sum_v = sum_v0 + sum_v1 + sum_v2 + sum_v3;   // = sum_v0

Thanks,
Feng

---
gcc/
	* tree-vectorizer.h (vec_stmts_effec_size): New field in _slp_tree.
	(SLP_TREE_VEC_STMTS_EFFEC_NUM): New macro.
	(vect_get_num_vectors): New overload function.
	(vect_get_slp_num_vectors): New function.
	* tree-vect-loop.cc (vect_reduction_update_partial_vector_usage): Use
	effective vector stmts number.
	(vectorizable_reduction): Compute number of effective vector stmts for
	lane-reducing op and reduction PHI.
	(vect_transform_reduction): Insert copies for lane-reducing so as to
	fix inaccurate vector stmts number.
	(vect_transform_cycle_phi): Only need to calculate vector PHI number
	based on input vectype for non-slp code path.
	* tree-vect-slp.cc (_slp_tree::_slp_tree): Initialize effective vector
	stmts number to zero.
	(vect_slp_analyze_node_operations_1): Remove adjustment on vector
	stmts number specific to slp reduction.
	(vect_slp_analyze_node_operations): Compute number of vector elements
	for constant/external slp node with vect_get_slp_num_vectors.
---
 gcc/tree-vect-loop.cc | 139 ++++++++++++++++++++++++++++++++++++------
 gcc/tree-vect-slp.cc  |  56 ++++++-----------
 gcc/tree-vectorizer.h |  45 ++++++++++++++
 3 files changed, 183 insertions(+), 57 deletions(-)

Comments

Richard Biener July 11, 2024, 9:43 a.m. UTC | #1
On Thu, Jul 11, 2024 at 10:53 AM Feng Xue OS
<fxue@os.amperecomputing.com> wrote:
>
> Vector stmts number of an operation is calculated based on output vectype.
> This is over-estimated for lane-reducing operation. Sometimes, to workaround
> the issue, we have to rely on additional logic to deduce an exactly accurate
> number by other means. Aiming at the inconvenience, in this patch, we would
> "turn" lane-reducing operation into a normal one by inserting new trivial
> statements like zero-valued PHIs and pass-through copies, which could be
> optimized away by later passes. At the same time, a new field is added for
> slp node to hold number of vector stmts that are really effective after
> vectorization. For example:

Adding Richard into the loop.

I'm sorry, but this feels a bit backwards - in the end I was hoping that we
can get rid of SLP_TREE_NUMBER_OF_VEC_STMTS completely.
We do currently have the odd ncopies (non-SLP) vs. vec_num (SLP)
duality but in reality all vectorizable_* should know the number of
stmt copies (or output vector defs) to produce by looking at the vector
type and the vectorization factor (and in the SLP case the number of
lanes represented by the node).

That means that in the end vectorizable_* could at transform time
simply make sure that SLP_TREE_VEC_DEF is appropriately
created (currently generic code does this based on
SLP_TREE_NUMBER_OF_VEC_STMTS and also generic code
tries to determine SLP_TREE_NUMBER_OF_VEC_STMTS).

There are a lot (well, not too many) uses of SLP_TREE_NUMBER_OF_VEC_STMTS
that short-cut "appropriate" vec_num computation based on
VF, vector type and lanes.  I hope all of them could vanish.

You add vect_get_[slp_]num_vectors, but I'd rather see a single

inline unsigned int
vect_get_num_copies (vec_info *vinfo, tree vectype, slp_tree node = nullptr)
{
   if (loop_vec_info loop_vinfo = dyn_cast <loop_vec_info> (vinfo))
     {
        if (node)
          return vect_get_num_vectors (LOOP_VINFO_VECT_FACTOR
(loop_vinfo) * SLP_TREE_LANES (node), vectype);
        else
          return vect_get_num_vectors (LOOP_VINFO_VECT_FACTOR
(loop_vinfo), vectype);
     }
  else
     return vect_get_num_vectors (SLP_TREE_LANES (node), vectype);
}

so that

  if (slp_node)
    {
      ncopies = 1;
      vec_num = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node);
    }
  else
    {
      ncopies = vect_get_num_copies (loop_vinfo, vectype);
      vec_num = 1;
    }

can become

  if (slp_node)
    {
      ncopies = 1;
      vec_num = vect_get_num_copies (loop_vinfo, vectype, slp_node);
    }
  else
    {
      ncopies = vect_get_num_copies (loop_vinfo, vectype/*, slp_node */);
      vec_num = 1;
    }

without actually resolving the ncopies/vec_num duality (that will solve itself
when we have achieved only-SLP).

I think if SLP_TREE_NUMBER_OF_VEC_STMTS is gone then having the
few places you needed to fix compute the correct number of stmts should be OK.
Note this will probably require moving the allocation of SLP_TREE_VEC_DEFS
to the code doing the transform (and remove/adjust some sanity checking we do).

Thanks,
Richard.

>   int sum = 1;
>   for (i)
>     {
>       sum += d0[i] * d1[i];      // dot-prod <vector(16) char>
>     }
>
>   The vector size is 128-bit,vectorization factor is 16.  Reduction
>   statements would be transformed as:
>
>   vector<4> int sum_v0 = { 0, 0, 0, 1 };
>   vector<4> int sum_v1 = { 0, 0, 0, 0 };
>   vector<4> int sum_v2 = { 0, 0, 0, 0 };
>   vector<4> int sum_v3 = { 0, 0, 0, 0 };
>
>   for (i / 16)
>     {
>       sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
>       sum_v1 = sum_v1;  // copy
>       sum_v2 = sum_v2;  // copy
>       sum_v3 = sum_v3;  // copy
>     }
>
>   sum_v = sum_v0 + sum_v1 + sum_v2 + sum_v3;   // = sum_v0
>
> Thanks,
> Feng
>
> ---
> gcc/
>         * tree-vectorizer.h (vec_stmts_effec_size): New field in _slp_tree.
>         (SLP_TREE_VEC_STMTS_EFFEC_NUM): New macro.
>         (vect_get_num_vectors): New overload function.
>         (vect_get_slp_num_vectors): New function.
>         * tree-vect-loop.cc (vect_reduction_update_partial_vector_usage): Use
>         effective vector stmts number.
>         (vectorizable_reduction): Compute number of effective vector stmts for
>         lane-reducing op and reduction PHI.
>         (vect_transform_reduction): Insert copies for lane-reducing so as to
>         fix inaccurate vector stmts number.
>         (vect_transform_cycle_phi): Only need to calculate vector PHI number
>         based on input vectype for non-slp code path.
>         * tree-vect-slp.cc (_slp_tree::_slp_tree): Initialize effective vector
>         stmts number to zero.
>         (vect_slp_analyze_node_operations_1): Remove adjustment on vector
>         stmts number specific to slp reduction.
>         (vect_slp_analyze_node_operations): Compute number of vector elements
>         for constant/external slp node with vect_get_slp_num_vectors.
> ---
>  gcc/tree-vect-loop.cc | 139 ++++++++++++++++++++++++++++++++++++------
>  gcc/tree-vect-slp.cc  |  56 ++++++-----------
>  gcc/tree-vectorizer.h |  45 ++++++++++++++
>  3 files changed, 183 insertions(+), 57 deletions(-)
>
> diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> index c183e2b6068..5ad9836d6c8 100644
> --- a/gcc/tree-vect-loop.cc
> +++ b/gcc/tree-vect-loop.cc
> @@ -7471,7 +7471,7 @@ vect_reduction_update_partial_vector_usage (loop_vec_info loop_vinfo,
>        unsigned nvectors;
>
>        if (slp_node)
> -       nvectors = SLP_TREE_VEC_STMTS_NUM (slp_node);
> +       nvectors = SLP_TREE_VEC_STMTS_EFFEC_NUM (slp_node);
>        else
>         nvectors = vect_get_num_copies (loop_vinfo, vectype_in);
>
> @@ -7594,6 +7594,15 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
>    stmt_vec_info phi_info = stmt_info;
>    if (!is_a <gphi *> (stmt_info->stmt))
>      {
> +      if (lane_reducing_stmt_p (stmt_info->stmt) && slp_node)
> +       {
> +         /* Compute number of effective vector statements for lane-reducing
> +            ops.  */
> +         vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (stmt_info);
> +         gcc_assert (vectype_in);
> +         SLP_TREE_VEC_STMTS_EFFEC_NUM (slp_node)
> +           = vect_get_slp_num_vectors (loop_vinfo, slp_node, vectype_in);
> +       }
>        STMT_VINFO_TYPE (stmt_info) = reduc_vec_info_type;
>        return true;
>      }
> @@ -8012,14 +8021,25 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
>    if (STMT_VINFO_LIVE_P (phi_info))
>      return false;
>
> -  if (slp_node)
> -    ncopies = 1;
> -  else
> -    ncopies = vect_get_num_copies (loop_vinfo, vectype_in);
> +  poly_uint64 nunits_out = TYPE_VECTOR_SUBPARTS (vectype_out);
>
> -  gcc_assert (ncopies >= 1);
> +  if (slp_node)
> +    {
> +      ncopies = 1;
>
> -  poly_uint64 nunits_out = TYPE_VECTOR_SUBPARTS (vectype_out);
> +      if (maybe_ne (TYPE_VECTOR_SUBPARTS (vectype_in), nunits_out))
> +       {
> +         /* Not all vector reduction PHIs would be used, compute number
> +            of the effective statements.  */
> +         SLP_TREE_VEC_STMTS_EFFEC_NUM (slp_node)
> +           = vect_get_slp_num_vectors (loop_vinfo, slp_node, vectype_in);
> +       }
> +    }
> +  else
> +    {
> +      ncopies = vect_get_num_copies (loop_vinfo, vectype_in);
> +      gcc_assert (ncopies >= 1);
> +    }
>
>    if (nested_cycle)
>      {
> @@ -8360,7 +8380,7 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
>         || (slp_node
>            && !REDUC_GROUP_FIRST_ELEMENT (stmt_info)
>            && SLP_TREE_LANES (slp_node) == 1
> -          && vect_get_num_copies (loop_vinfo, vectype_in) > 1))
> +          && SLP_TREE_VEC_STMTS_EFFEC_NUM (slp_node) > 1))
>        && (STMT_VINFO_RELEVANT (stmt_info) <= vect_used_only_live)
>        && reduc_chain_length == 1
>        && loop_vinfo->suggested_unroll_factor == 1)
> @@ -8595,12 +8615,15 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
>    stmt_vec_info phi_info = STMT_VINFO_REDUC_DEF (vect_orig_stmt (stmt_info));
>    gphi *reduc_def_phi = as_a <gphi *> (phi_info->stmt);
>    int reduc_index = STMT_VINFO_REDUC_IDX (stmt_info);
> -  tree vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (reduc_info);
> +  tree vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (stmt_info);
> +
> +  if (!vectype_in)
> +    vectype_in = STMT_VINFO_VECTYPE (stmt_info);
>
>    if (slp_node)
>      {
>        ncopies = 1;
> -      vec_num = SLP_TREE_VEC_STMTS_NUM (slp_node);
> +      vec_num = SLP_TREE_VEC_STMTS_EFFEC_NUM (slp_node);
>      }
>    else
>      {
> @@ -8702,6 +8725,64 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
>                          reduc_index == 2 ? op.ops[2] : NULL_TREE,
>                          &vec_oprnds[2]);
>      }
> +  else if (lane_reducing)
> +    {
> +      /* For normal reduction, consistency between vectorized def/use is
> +        naturally ensured when mapping from scalar statement.  But if lane-
> +        reducing op is involved in reduction, thing would become somewhat
> +        complicated in that the op's result and operand for accumulation are
> +        limited to less lanes than other operands, which certainly causes
> +        def/use mismatch on adjacent statements around the op if do not have
> +        any kind of specific adjustment.  One approach is to refit lane-
> +        reducing op in the way of introducing new trivial pass-through copies
> +        to fix possible def/use gap, so as to make it behave like a normal op.
> +        And vector reduction PHIs are always generated to the full extent, no
> +        matter lane-reducing op exists or not.  If some copies or PHIs are
> +        actually superfluous, they would be cleaned up by passes after
> +        vectorization.  An example for single-lane slp is given as below.
> +        Similarly, this handling is applicable for multiple-lane slp as well.
> +
> +          int sum = 1;
> +          for (i)
> +            {
> +              sum += d0[i] * d1[i];      // dot-prod <vector(16) char>
> +            }
> +
> +        The vector size is 128-bit,vectorization factor is 16.  Reduction
> +        statements would be transformed as:
> +
> +          vector<4> int sum_v0 = { 0, 0, 0, 1 };
> +          vector<4> int sum_v1 = { 0, 0, 0, 0 };
> +          vector<4> int sum_v2 = { 0, 0, 0, 0 };
> +          vector<4> int sum_v3 = { 0, 0, 0, 0 };
> +
> +          for (i / 16)
> +            {
> +              sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
> +              sum_v1 = sum_v1;  // copy
> +              sum_v2 = sum_v2;  // copy
> +              sum_v3 = sum_v3;  // copy
> +            }
> +
> +          sum_v = sum_v0 + sum_v1 + sum_v2 + sum_v3;   // = sum_v0
> +       */
> +      unsigned effec_ncopies = vec_oprnds[0].length ();
> +      unsigned total_ncopies = vec_oprnds[reduc_index].length ();
> +
> +      if (slp_node)
> +       gcc_assert (effec_ncopies == SLP_TREE_VEC_STMTS_EFFEC_NUM (slp_node));
> +
> +      gcc_assert (effec_ncopies <= total_ncopies);
> +
> +      if (effec_ncopies < total_ncopies)
> +       {
> +         for (unsigned i = 0; i < op.num_ops - 1; i++)
> +           {
> +             gcc_assert (vec_oprnds[i].length () == effec_ncopies);
> +             vec_oprnds[i].safe_grow_cleared (total_ncopies);
> +           }
> +       }
> +    }
>
>    bool emulated_mixed_dot_prod = vect_is_emulated_mixed_dot_prod (stmt_info);
>    unsigned num = vec_oprnds[reduc_index == 0 ? 1 : 0].length ();
> @@ -8710,7 +8791,27 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
>      {
>        gimple *new_stmt;
>        tree vop[3] = { vec_oprnds[0][i], vec_oprnds[1][i], NULL_TREE };
> -      if (masked_loop_p && !mask_by_cond_expr)
> +      if (!vop[0] || !vop[1])
> +       {
> +         tree reduc_vop = vec_oprnds[reduc_index][i];
> +
> +         /* If could not generate an effective vector statement for current
> +            portion of reduction operand, insert a trivial copy to simply
> +            handle over the operand to other dependent statements.  */
> +         gcc_assert (reduc_vop);
> +
> +         if (slp_node && TREE_CODE (reduc_vop) == SSA_NAME
> +             && !SSA_NAME_IS_DEFAULT_DEF (reduc_vop))
> +           new_stmt = SSA_NAME_DEF_STMT (reduc_vop);
> +         else
> +           {
> +             new_temp = make_ssa_name (vec_dest);
> +             new_stmt = gimple_build_assign (new_temp, reduc_vop);
> +             vect_finish_stmt_generation (loop_vinfo, stmt_info, new_stmt,
> +                                          gsi);
> +           }
> +       }
> +      else if (masked_loop_p && !mask_by_cond_expr)
>         {
>           /* No conditional ifns have been defined for lane-reducing op
>              yet.  */
> @@ -8810,21 +8911,19 @@ vect_transform_cycle_phi (loop_vec_info loop_vinfo,
>      /* Leave the scalar phi in place.  */
>      return true;
>
> -  tree vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (reduc_info);
> -  /* For a nested cycle we do not fill the above.  */
> -  if (!vectype_in)
> -    vectype_in = STMT_VINFO_VECTYPE (stmt_info);
> -  gcc_assert (vectype_in);
> -
>    if (slp_node)
>      {
> -      /* The size vect_schedule_slp_instance computes is off for us.  */
> -      vec_num = vect_get_num_vectors (LOOP_VINFO_VECT_FACTOR (loop_vinfo)
> -                                     * SLP_TREE_LANES (slp_node), vectype_in);
> +      vec_num = SLP_TREE_VEC_STMTS_NUM (slp_node);
>        ncopies = 1;
>      }
>    else
>      {
> +      tree vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (reduc_info);
> +      /* For a nested cycle we do not fill the above.  */
> +      if (!vectype_in)
> +       vectype_in = STMT_VINFO_VECTYPE (stmt_info);
> +      gcc_assert (vectype_in);
> +
>        vec_num = 1;
>        ncopies = vect_get_num_copies (loop_vinfo, vectype_in);
>      }
> diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
> index 00863398fd0..3c163c16fc8 100644
> --- a/gcc/tree-vect-slp.cc
> +++ b/gcc/tree-vect-slp.cc
> @@ -114,6 +114,7 @@ _slp_tree::_slp_tree ()
>    SLP_TREE_SCALAR_OPS (this) = vNULL;
>    SLP_TREE_VEC_DEFS (this) = vNULL;
>    SLP_TREE_VEC_STMTS_NUM (this) = 0;
> +  SLP_TREE_VEC_STMTS_EFFEC_NUM (this) = 0;
>    SLP_TREE_CHILDREN (this) = vNULL;
>    SLP_TREE_LOAD_PERMUTATION (this) = vNULL;
>    SLP_TREE_LANE_PERMUTATION (this) = vNULL;
> @@ -6552,38 +6553,16 @@ vect_slp_analyze_node_operations_1 (vec_info *vinfo, slp_tree node,
>                                     slp_instance node_instance,
>                                     stmt_vector_for_cost *cost_vec)
>  {
> -  stmt_vec_info stmt_info = SLP_TREE_REPRESENTATIVE (node);
> -
>    /* Calculate the number of vector statements to be created for the
> -     scalar stmts in this node.  For SLP reductions it is equal to the
> -     number of vector statements in the children (which has already been
> -     calculated by the recursive call).  Otherwise it is the number of
> -     scalar elements in one scalar iteration (DR_GROUP_SIZE) multiplied by
> -     VF divided by the number of elements in a vector.  */
> -  if (SLP_TREE_CODE (node) != VEC_PERM_EXPR
> -      && !STMT_VINFO_DATA_REF (stmt_info)
> -      && REDUC_GROUP_FIRST_ELEMENT (stmt_info))
> -    {
> -      for (unsigned i = 0; i < SLP_TREE_CHILDREN (node).length (); ++i)
> -       if (SLP_TREE_DEF_TYPE (SLP_TREE_CHILDREN (node)[i]) == vect_internal_def)
> -         {
> -           SLP_TREE_VEC_STMTS_NUM (node)
> -             = SLP_TREE_VEC_STMTS_NUM (SLP_TREE_CHILDREN (node)[i]);
> -           break;
> -         }
> -    }
> -  else
> -    {
> -      poly_uint64 vf;
> -      if (loop_vec_info loop_vinfo = dyn_cast <loop_vec_info> (vinfo))
> -       vf = loop_vinfo->vectorization_factor;
> -      else
> -       vf = 1;
> -      unsigned int group_size = SLP_TREE_LANES (node);
> -      tree vectype = SLP_TREE_VECTYPE (node);
> -      SLP_TREE_VEC_STMTS_NUM (node)
> -       = vect_get_num_vectors (vf * group_size, vectype);
> -    }
> +     scalar stmts in this node.  */
> +  unsigned int num = vect_get_slp_num_vectors (vinfo, node);
> +
> +  SLP_TREE_VEC_STMTS_NUM (node) = num;
> +
> +  /* Initialize with the number of vector statements, and would be changed
> +     to some smaller one for reduction PHI and lane-reducing statement
> +     after real input vectype is determined by vectorizable analysis.  */
> +  SLP_TREE_VEC_STMTS_EFFEC_NUM (node) = num;
>
>    /* Handle purely internal nodes.  */
>    if (SLP_TREE_CODE (node) == VEC_PERM_EXPR)
> @@ -6605,7 +6584,9 @@ vect_slp_analyze_node_operations_1 (vec_info *vinfo, slp_tree node,
>        return true;
>      }
>
> +  stmt_vec_info stmt_info = SLP_TREE_REPRESENTATIVE (node);
>    bool dummy;
> +
>    return vect_analyze_stmt (vinfo, stmt_info, &dummy,
>                             node, node_instance, cost_vec);
>  }
> @@ -6851,12 +6832,13 @@ vect_slp_analyze_node_operations (vec_info *vinfo, slp_tree node,
>                           && j == 1);
>               continue;
>             }
> -         unsigned group_size = SLP_TREE_LANES (child);
> -         poly_uint64 vf = 1;
> -         if (loop_vec_info loop_vinfo = dyn_cast <loop_vec_info> (vinfo))
> -           vf = loop_vinfo->vectorization_factor;
> -         SLP_TREE_VEC_STMTS_NUM (child)
> -           = vect_get_num_vectors (vf * group_size, vector_type);
> +
> +         /* Calculate number of vector elements for constant slp node.  */
> +         unsigned int num = vect_get_slp_num_vectors (vinfo, child);
> +
> +         SLP_TREE_VEC_STMTS_NUM (child) = num;
> +         SLP_TREE_VEC_STMTS_EFFEC_NUM (child) = num;
> +
>           /* And cost them.  */
>           vect_prologue_cost_for_slp (child, cost_vec);
>         }
> diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
> index ef89ff1b3e4..f769c2b8c92 100644
> --- a/gcc/tree-vectorizer.h
> +++ b/gcc/tree-vectorizer.h
> @@ -211,6 +211,15 @@ struct _slp_tree {
>       divided by vector size.  */
>    unsigned int vec_stmts_size;
>
> +  /* Number of vector stmts that are really effective after transformation.
> +     Lane-reducing operation requires less vector stmts than normal, so
> +     trivial copies will be inserted to make vector def-use chain align
> +     mutually.  Vector reduction PHIs are fully generated with VF as above,
> +     some may be optimized away if it starts a copy-self def-use cycle, which
> +     occurs in reduction involving only lane-reducing.  Note: The enforcement
> +     of single-defuse-cycle would not impact this, is specially handled.  */
> +  unsigned int vec_stmts_effec_size;
> +
>    /* Reference count in the SLP graph.  */
>    unsigned int refcnt;
>    /* The maximum number of vector elements for the subtree rooted
> @@ -304,6 +313,7 @@ public:
>  #define SLP_TREE_REF_COUNT(S)                    (S)->refcnt
>  #define SLP_TREE_VEC_DEFS(S)                     (S)->vec_defs
>  #define SLP_TREE_VEC_STMTS_NUM(S)                (S)->vec_stmts_size
> +#define SLP_TREE_VEC_STMTS_EFFEC_NUM(S)          (S)->vec_stmts_effec_size
>  #define SLP_TREE_LOAD_PERMUTATION(S)             (S)->load_permutation
>  #define SLP_TREE_LANE_PERMUTATION(S)             (S)->lane_permutation
>  #define SLP_TREE_SIMD_CLONE_INFO(S)              (S)->simd_clone_info
> @@ -2080,6 +2090,41 @@ vect_get_num_vectors (poly_uint64 nunits, tree vectype)
>    return exact_div (nunits, TYPE_VECTOR_SUBPARTS (vectype)).to_constant ();
>  }
>
> +/* Return the number of vectors in the context of vectorization region VINFO,
> +   needed for a group of total SIZE statements that are supposed to be
> +   interleaved together with no gap, and all operate on vectors of type
> +   VECTYPE.  */
> +
> +inline unsigned int
> +vect_get_num_vectors (vec_info *vinfo, tree vectype, unsigned int size)
> +{
> +  poly_uint64 vf;
> +
> +  if (loop_vec_info loop_vinfo = dyn_cast <loop_vec_info> (vinfo))
> +    vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
> +  else
> +    vf = 1;
> +
> +  return vect_get_num_vectors (vf * size, vectype);
> +}
> +
> +/* Return the number of vectors needed for SLP_NODE regarding to VINFO.  If
> +   not NULL, VECTYPE gives a new vectype to operate on, rather than
> +   SLP_TREE_VECTYPE of the node.  */
> +
> +inline unsigned int
> +vect_get_slp_num_vectors (vec_info *vinfo, slp_tree slp_node,
> +                         tree vectype = NULL_TREE)
> +{
> +  if (!vectype)
> +    {
> +      vectype = SLP_TREE_VECTYPE (slp_node);
> +      gcc_checking_assert (vectype);
> +    }
> +
> +  return vect_get_num_vectors (vinfo, vectype, SLP_TREE_LANES (slp_node));
> +}
> +
>  /* Return the number of copies needed for loop vectorization when
>     a statement operates on vectors of type VECTYPE.  This is the
>     vectorization factor divided by the number of elements in
> --
> 2.17.1
Feng Xue OS July 12, 2024, 5:20 a.m. UTC | #2
Hi, Richard,

Let me explain some idea that has to be chosen for lane-reducing. The key
complication is that these ops are associated with two kinds of vec_nums,
one is number of effective vector stmts, which is used by partial vectorzation
function such as vect_get_loop_mask.  The other is number of total created
vector stmts. Now we should make it aligned with normal op, in order to
interoperate with normal op. Suppose expressions mixed with lane-reducing
and normal as:
    
    temp = lane_reducing<16*char> + expr<4*int>;
    temp = cst<4*int> * lane_reducing<16*char>;

If only generating effective vector stmt for lane_reducing, vector def/use
between ops will never be matched, so extra pass-through copies are 
necessary. This is why we say "refit a lane-reducing to be a fake normal op".

The requirement of two vec_stmts are independent of how we will implement
SLP_TREE_NUMBER_OF_VEC_STMTS. Moreover, if we want to refactor vect code
to unify ncopies/vec_num computation and completely remove
SLP_TREE_NUMBER_OF_VEC_STMTS, this tends to be a a large task, and might
be overkill for these lane-reducing patches. So I will keep it as before, and do
not touch it as what I have done in this patch.

Since one SLP_TREE_NUMBER_OF_VEC_STMTS could not be used for two purposes.
The your previous suggestion might not be work:

> As said, I don't like this much.  vect_slp_analyze_node_operations_1 sets this
> and I think the existing "exception"
>
>  /* Calculate the number of vector statements to be created for the
>     scalar stmts in this node.  For SLP reductions it is equal to the
>     number of vector statements in the children (which has already been
>     calculated by the recursive call).  Otherwise it is the number of
>     scalar elements in one scalar iteration (DR_GROUP_SIZE) multiplied by
>     VF divided by the number of elements in a vector.  */
>  if (SLP_TREE_CODE (node) != VEC_PERM_EXPR
>      && !STMT_VINFO_DATA_REF (stmt_info)
>      && REDUC_GROUP_FIRST_ELEMENT (stmt_info))
>    {
>      for (unsigned i = 0; i < SLP_TREE_CHILDREN (node).length (); ++i)
>        if (SLP_TREE_DEF_TYPE (SLP_TREE_CHILDREN (node)[i]) ==
>  vect_internal_def)
>          {
>            SLP_TREE_NUMBER_OF_VEC_STMTS (node)
>              = SLP_TREE_NUMBER_OF_VEC_STMTS (SLP_TREE_CHILDREN (node)[i]);
>            break;
>          }
>    }
>
> could be changed (or amended if replacing doesn't work out) to
> 
>   if (SLP_TREE_CODE (node) != VEC_PERM_EXPR
>       && STMT_VINFO_REDUC_IDX (stmt_info)
>       // do we have this always set?
>       && STMT_VINFO_REDUC_VECTYPE_IN (stmt_info))
>    {
>       do the same as in else {} but using VECTYPE_IN
>    }
> 
> Or maybe scrap the special case and use STMT_VINFO_REDUC_VECTYPE_IN
> when that's set instead of SLP_TREE_VECTYPE?  As said having wrong
> SLP_TREE_NUMBER_OF_VEC_STMTS is going to backfire.

Then the alternative is to limit special handling related to the vec_num only
inside vect_transform_reduction. Is that ok? Or any other suggestion?

Thanks,
Feng
Richard Biener July 12, 2024, 8:13 a.m. UTC | #3
On Fri, Jul 12, 2024 at 7:20 AM Feng Xue OS <fxue@os.amperecomputing.com> wrote:
>
> Hi, Richard,
>
> Let me explain some idea that has to be chosen for lane-reducing. The key
> complication is that these ops are associated with two kinds of vec_nums,
> one is number of effective vector stmts, which is used by partial vectorzation
> function such as vect_get_loop_mask.  The other is number of total created
> vector stmts. Now we should make it aligned with normal op, in order to
> interoperate with normal op. Suppose expressions mixed with lane-reducing
> and normal as:
>
>     temp = lane_reducing<16*char> + expr<4*int>;
>     temp = cst<4*int> * lane_reducing<16*char>;
>
> If only generating effective vector stmt for lane_reducing, vector def/use
> between ops will never be matched, so extra pass-through copies are
> necessary. This is why we say "refit a lane-reducing to be a fake normal op".

And this only happens in vect_transform_reduction, right?

The other pre-existing issue is that for single_defuse_cycle optimization
SLP_TREE_NUMBER_OF_VEC_STMTS is also off (too large).  But here
the transform also goes through vect_transform_reduction.

> The requirement of two vec_stmts are independent of how we will implement
> SLP_TREE_NUMBER_OF_VEC_STMTS. Moreover, if we want to refactor vect code
> to unify ncopies/vec_num computation and completely remove
> SLP_TREE_NUMBER_OF_VEC_STMTS, this tends to be a a large task, and might
> be overkill for these lane-reducing patches. So I will keep it as before, and do
> not touch it as what I have done in this patch.
>
> Since one SLP_TREE_NUMBER_OF_VEC_STMTS could not be used for two purposes.
> The your previous suggestion might not be work:
>
> > As said, I don't like this much.  vect_slp_analyze_node_operations_1 sets this
> > and I think the existing "exception"
> >
> >  /* Calculate the number of vector statements to be created for the
> >     scalar stmts in this node.  For SLP reductions it is equal to the
> >     number of vector statements in the children (which has already been
> >     calculated by the recursive call).  Otherwise it is the number of
> >     scalar elements in one scalar iteration (DR_GROUP_SIZE) multiplied by
> >     VF divided by the number of elements in a vector.  */
> >  if (SLP_TREE_CODE (node) != VEC_PERM_EXPR
> >      && !STMT_VINFO_DATA_REF (stmt_info)
> >      && REDUC_GROUP_FIRST_ELEMENT (stmt_info))
> >    {
> >      for (unsigned i = 0; i < SLP_TREE_CHILDREN (node).length (); ++i)
> >        if (SLP_TREE_DEF_TYPE (SLP_TREE_CHILDREN (node)[i]) ==
> >  vect_internal_def)
> >          {
> >            SLP_TREE_NUMBER_OF_VEC_STMTS (node)
> >              = SLP_TREE_NUMBER_OF_VEC_STMTS (SLP_TREE_CHILDREN (node)[i]);
> >            break;
> >          }
> >    }
> >
> > could be changed (or amended if replacing doesn't work out) to
> >
> >   if (SLP_TREE_CODE (node) != VEC_PERM_EXPR
> >       && STMT_VINFO_REDUC_IDX (stmt_info)
> >       // do we have this always set?
> >       && STMT_VINFO_REDUC_VECTYPE_IN (stmt_info))
> >    {
> >       do the same as in else {} but using VECTYPE_IN
> >    }
> >
> > Or maybe scrap the special case and use STMT_VINFO_REDUC_VECTYPE_IN
> > when that's set instead of SLP_TREE_VECTYPE?  As said having wrong
> > SLP_TREE_NUMBER_OF_VEC_STMTS is going to backfire.
>
> Then the alternative is to limit special handling related to the vec_num only
> inside vect_transform_reduction. Is that ok? Or any other suggestion?

I think that's kind-of in line with the suggestion of a reduction
specific VF, so yes,
not using SLP_TREE_NUMBER_OF_VEC_STMTS in vect_transform_reduction
sounds fine to me and would be a step towards not having
SLP_TREE_NUMBER_OF_VEC_STMTS
where the function would be responsible for appropriate allocation as well.

Richard.

> Thanks,
> Feng
> ________________________________________
> From: Richard Biener <richard.guenther@gmail.com>
> Sent: Thursday, July 11, 2024 5:43 PM
> To: Feng Xue OS; Richard Sandiford
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH 2/4] vect: Fix inaccurate vector stmts number for slp reduction with lane-reducing
>
> On Thu, Jul 11, 2024 at 10:53 AM Feng Xue OS
> <fxue@os.amperecomputing.com> wrote:
> >
> > Vector stmts number of an operation is calculated based on output vectype.
> > This is over-estimated for lane-reducing operation. Sometimes, to workaround
> > the issue, we have to rely on additional logic to deduce an exactly accurate
> > number by other means. Aiming at the inconvenience, in this patch, we would
> > "turn" lane-reducing operation into a normal one by inserting new trivial
> > statements like zero-valued PHIs and pass-through copies, which could be
> > optimized away by later passes. At the same time, a new field is added for
> > slp node to hold number of vector stmts that are really effective after
> > vectorization. For example:
>
> Adding Richard into the loop.
>
> I'm sorry, but this feels a bit backwards - in the end I was hoping that we
> can get rid of SLP_TREE_NUMBER_OF_VEC_STMTS completely.
> We do currently have the odd ncopies (non-SLP) vs. vec_num (SLP)
> duality but in reality all vectorizable_* should know the number of
> stmt copies (or output vector defs) to produce by looking at the vector
> type and the vectorization factor (and in the SLP case the number of
> lanes represented by the node).
>
> That means that in the end vectorizable_* could at transform time
> simply make sure that SLP_TREE_VEC_DEF is appropriately
> created (currently generic code does this based on
> SLP_TREE_NUMBER_OF_VEC_STMTS and also generic code
> tries to determine SLP_TREE_NUMBER_OF_VEC_STMTS).
>
> There are a lot (well, not too many) uses of SLP_TREE_NUMBER_OF_VEC_STMTS
> that short-cut "appropriate" vec_num computation based on
> VF, vector type and lanes.  I hope all of them could vanish.
>
> You add vect_get_[slp_]num_vectors, but I'd rather see a single
>
> inline unsigned int
> vect_get_num_copies (vec_info *vinfo, tree vectype, slp_tree node = nullptr)
> {
>    if (loop_vec_info loop_vinfo = dyn_cast <loop_vec_info> (vinfo))
>      {
>         if (node)
>           return vect_get_num_vectors (LOOP_VINFO_VECT_FACTOR
> (loop_vinfo) * SLP_TREE_LANES (node), vectype);
>         else
>           return vect_get_num_vectors (LOOP_VINFO_VECT_FACTOR
> (loop_vinfo), vectype);
>      }
>   else
>      return vect_get_num_vectors (SLP_TREE_LANES (node), vectype);
> }
>
> so that
>
>   if (slp_node)
>     {
>       ncopies = 1;
>       vec_num = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node);
>     }
>   else
>     {
>       ncopies = vect_get_num_copies (loop_vinfo, vectype);
>       vec_num = 1;
>     }
>
> can become
>
>   if (slp_node)
>     {
>       ncopies = 1;
>       vec_num = vect_get_num_copies (loop_vinfo, vectype, slp_node);
>     }
>   else
>     {
>       ncopies = vect_get_num_copies (loop_vinfo, vectype/*, slp_node */);
>       vec_num = 1;
>     }
>
> without actually resolving the ncopies/vec_num duality (that will solve itself
> when we have achieved only-SLP).
>
> I think if SLP_TREE_NUMBER_OF_VEC_STMTS is gone then having the
> few places you needed to fix compute the correct number of stmts should be OK.
> Note this will probably require moving the allocation of SLP_TREE_VEC_DEFS
> to the code doing the transform (and remove/adjust some sanity checking we do).
>
> Thanks,
> Richard.
>
> >   int sum = 1;
> >   for (i)
> >     {
> >       sum += d0[i] * d1[i];      // dot-prod <vector(16) char>
> >     }
> >
> >   The vector size is 128-bit,vectorization factor is 16.  Reduction
> >   statements would be transformed as:
> >
> >   vector<4> int sum_v0 = { 0, 0, 0, 1 };
> >   vector<4> int sum_v1 = { 0, 0, 0, 0 };
> >   vector<4> int sum_v2 = { 0, 0, 0, 0 };
> >   vector<4> int sum_v3 = { 0, 0, 0, 0 };
> >
> >   for (i / 16)
> >     {
> >       sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
> >       sum_v1 = sum_v1;  // copy
> >       sum_v2 = sum_v2;  // copy
> >       sum_v3 = sum_v3;  // copy
> >     }
> >
> >   sum_v = sum_v0 + sum_v1 + sum_v2 + sum_v3;   // = sum_v0
> >
> > Thanks,
> > Feng
> >
> > ---
> > gcc/
> >         * tree-vectorizer.h (vec_stmts_effec_size): New field in _slp_tree.
> >         (SLP_TREE_VEC_STMTS_EFFEC_NUM): New macro.
> >         (vect_get_num_vectors): New overload function.
> >         (vect_get_slp_num_vectors): New function.
> >         * tree-vect-loop.cc (vect_reduction_update_partial_vector_usage): Use
> >         effective vector stmts number.
> >         (vectorizable_reduction): Compute number of effective vector stmts for
> >         lane-reducing op and reduction PHI.
> >         (vect_transform_reduction): Insert copies for lane-reducing so as to
> >         fix inaccurate vector stmts number.
> >         (vect_transform_cycle_phi): Only need to calculate vector PHI number
> >         based on input vectype for non-slp code path.
> >         * tree-vect-slp.cc (_slp_tree::_slp_tree): Initialize effective vector
> >         stmts number to zero.
> >         (vect_slp_analyze_node_operations_1): Remove adjustment on vector
> >         stmts number specific to slp reduction.
> >         (vect_slp_analyze_node_operations): Compute number of vector elements
> >         for constant/external slp node with vect_get_slp_num_vectors.
> > ---
> >  gcc/tree-vect-loop.cc | 139 ++++++++++++++++++++++++++++++++++++------
> >  gcc/tree-vect-slp.cc  |  56 ++++++-----------
> >  gcc/tree-vectorizer.h |  45 ++++++++++++++
> >  3 files changed, 183 insertions(+), 57 deletions(-)
> >
> > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> > index c183e2b6068..5ad9836d6c8 100644
> > --- a/gcc/tree-vect-loop.cc
> > +++ b/gcc/tree-vect-loop.cc
> > @@ -7471,7 +7471,7 @@ vect_reduction_update_partial_vector_usage (loop_vec_info loop_vinfo,
> >        unsigned nvectors;
> >
> >        if (slp_node)
> > -       nvectors = SLP_TREE_VEC_STMTS_NUM (slp_node);
> > +       nvectors = SLP_TREE_VEC_STMTS_EFFEC_NUM (slp_node);
> >        else
> >         nvectors = vect_get_num_copies (loop_vinfo, vectype_in);
> >
> > @@ -7594,6 +7594,15 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
> >    stmt_vec_info phi_info = stmt_info;
> >    if (!is_a <gphi *> (stmt_info->stmt))
> >      {
> > +      if (lane_reducing_stmt_p (stmt_info->stmt) && slp_node)
> > +       {
> > +         /* Compute number of effective vector statements for lane-reducing
> > +            ops.  */
> > +         vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (stmt_info);
> > +         gcc_assert (vectype_in);
> > +         SLP_TREE_VEC_STMTS_EFFEC_NUM (slp_node)
> > +           = vect_get_slp_num_vectors (loop_vinfo, slp_node, vectype_in);
> > +       }
> >        STMT_VINFO_TYPE (stmt_info) = reduc_vec_info_type;
> >        return true;
> >      }
> > @@ -8012,14 +8021,25 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
> >    if (STMT_VINFO_LIVE_P (phi_info))
> >      return false;
> >
> > -  if (slp_node)
> > -    ncopies = 1;
> > -  else
> > -    ncopies = vect_get_num_copies (loop_vinfo, vectype_in);
> > +  poly_uint64 nunits_out = TYPE_VECTOR_SUBPARTS (vectype_out);
> >
> > -  gcc_assert (ncopies >= 1);
> > +  if (slp_node)
> > +    {
> > +      ncopies = 1;
> >
> > -  poly_uint64 nunits_out = TYPE_VECTOR_SUBPARTS (vectype_out);
> > +      if (maybe_ne (TYPE_VECTOR_SUBPARTS (vectype_in), nunits_out))
> > +       {
> > +         /* Not all vector reduction PHIs would be used, compute number
> > +            of the effective statements.  */
> > +         SLP_TREE_VEC_STMTS_EFFEC_NUM (slp_node)
> > +           = vect_get_slp_num_vectors (loop_vinfo, slp_node, vectype_in);
> > +       }
> > +    }
> > +  else
> > +    {
> > +      ncopies = vect_get_num_copies (loop_vinfo, vectype_in);
> > +      gcc_assert (ncopies >= 1);
> > +    }
> >
> >    if (nested_cycle)
> >      {
> > @@ -8360,7 +8380,7 @@ vectorizable_reduction (loop_vec_info loop_vinfo,
> >         || (slp_node
> >            && !REDUC_GROUP_FIRST_ELEMENT (stmt_info)
> >            && SLP_TREE_LANES (slp_node) == 1
> > -          && vect_get_num_copies (loop_vinfo, vectype_in) > 1))
> > +          && SLP_TREE_VEC_STMTS_EFFEC_NUM (slp_node) > 1))
> >        && (STMT_VINFO_RELEVANT (stmt_info) <= vect_used_only_live)
> >        && reduc_chain_length == 1
> >        && loop_vinfo->suggested_unroll_factor == 1)
> > @@ -8595,12 +8615,15 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
> >    stmt_vec_info phi_info = STMT_VINFO_REDUC_DEF (vect_orig_stmt (stmt_info));
> >    gphi *reduc_def_phi = as_a <gphi *> (phi_info->stmt);
> >    int reduc_index = STMT_VINFO_REDUC_IDX (stmt_info);
> > -  tree vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (reduc_info);
> > +  tree vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (stmt_info);
> > +
> > +  if (!vectype_in)
> > +    vectype_in = STMT_VINFO_VECTYPE (stmt_info);
> >
> >    if (slp_node)
> >      {
> >        ncopies = 1;
> > -      vec_num = SLP_TREE_VEC_STMTS_NUM (slp_node);
> > +      vec_num = SLP_TREE_VEC_STMTS_EFFEC_NUM (slp_node);
> >      }
> >    else
> >      {
> > @@ -8702,6 +8725,64 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
> >                          reduc_index == 2 ? op.ops[2] : NULL_TREE,
> >                          &vec_oprnds[2]);
> >      }
> > +  else if (lane_reducing)
> > +    {
> > +      /* For normal reduction, consistency between vectorized def/use is
> > +        naturally ensured when mapping from scalar statement.  But if lane-
> > +        reducing op is involved in reduction, thing would become somewhat
> > +        complicated in that the op's result and operand for accumulation are
> > +        limited to less lanes than other operands, which certainly causes
> > +        def/use mismatch on adjacent statements around the op if do not have
> > +        any kind of specific adjustment.  One approach is to refit lane-
> > +        reducing op in the way of introducing new trivial pass-through copies
> > +        to fix possible def/use gap, so as to make it behave like a normal op.
> > +        And vector reduction PHIs are always generated to the full extent, no
> > +        matter lane-reducing op exists or not.  If some copies or PHIs are
> > +        actually superfluous, they would be cleaned up by passes after
> > +        vectorization.  An example for single-lane slp is given as below.
> > +        Similarly, this handling is applicable for multiple-lane slp as well.
> > +
> > +          int sum = 1;
> > +          for (i)
> > +            {
> > +              sum += d0[i] * d1[i];      // dot-prod <vector(16) char>
> > +            }
> > +
> > +        The vector size is 128-bit,vectorization factor is 16.  Reduction
> > +        statements would be transformed as:
> > +
> > +          vector<4> int sum_v0 = { 0, 0, 0, 1 };
> > +          vector<4> int sum_v1 = { 0, 0, 0, 0 };
> > +          vector<4> int sum_v2 = { 0, 0, 0, 0 };
> > +          vector<4> int sum_v3 = { 0, 0, 0, 0 };
> > +
> > +          for (i / 16)
> > +            {
> > +              sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
> > +              sum_v1 = sum_v1;  // copy
> > +              sum_v2 = sum_v2;  // copy
> > +              sum_v3 = sum_v3;  // copy
> > +            }
> > +
> > +          sum_v = sum_v0 + sum_v1 + sum_v2 + sum_v3;   // = sum_v0
> > +       */
> > +      unsigned effec_ncopies = vec_oprnds[0].length ();
> > +      unsigned total_ncopies = vec_oprnds[reduc_index].length ();
> > +
> > +      if (slp_node)
> > +       gcc_assert (effec_ncopies == SLP_TREE_VEC_STMTS_EFFEC_NUM (slp_node));
> > +
> > +      gcc_assert (effec_ncopies <= total_ncopies);
> > +
> > +      if (effec_ncopies < total_ncopies)
> > +       {
> > +         for (unsigned i = 0; i < op.num_ops - 1; i++)
> > +           {
> > +             gcc_assert (vec_oprnds[i].length () == effec_ncopies);
> > +             vec_oprnds[i].safe_grow_cleared (total_ncopies);
> > +           }
> > +       }
> > +    }
> >
> >    bool emulated_mixed_dot_prod = vect_is_emulated_mixed_dot_prod (stmt_info);
> >    unsigned num = vec_oprnds[reduc_index == 0 ? 1 : 0].length ();
> > @@ -8710,7 +8791,27 @@ vect_transform_reduction (loop_vec_info loop_vinfo,
> >      {
> >        gimple *new_stmt;
> >        tree vop[3] = { vec_oprnds[0][i], vec_oprnds[1][i], NULL_TREE };
> > -      if (masked_loop_p && !mask_by_cond_expr)
> > +      if (!vop[0] || !vop[1])
> > +       {
> > +         tree reduc_vop = vec_oprnds[reduc_index][i];
> > +
> > +         /* If could not generate an effective vector statement for current
> > +            portion of reduction operand, insert a trivial copy to simply
> > +            handle over the operand to other dependent statements.  */
> > +         gcc_assert (reduc_vop);
> > +
> > +         if (slp_node && TREE_CODE (reduc_vop) == SSA_NAME
> > +             && !SSA_NAME_IS_DEFAULT_DEF (reduc_vop))
> > +           new_stmt = SSA_NAME_DEF_STMT (reduc_vop);
> > +         else
> > +           {
> > +             new_temp = make_ssa_name (vec_dest);
> > +             new_stmt = gimple_build_assign (new_temp, reduc_vop);
> > +             vect_finish_stmt_generation (loop_vinfo, stmt_info, new_stmt,
> > +                                          gsi);
> > +           }
> > +       }
> > +      else if (masked_loop_p && !mask_by_cond_expr)
> >         {
> >           /* No conditional ifns have been defined for lane-reducing op
> >              yet.  */
> > @@ -8810,21 +8911,19 @@ vect_transform_cycle_phi (loop_vec_info loop_vinfo,
> >      /* Leave the scalar phi in place.  */
> >      return true;
> >
> > -  tree vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (reduc_info);
> > -  /* For a nested cycle we do not fill the above.  */
> > -  if (!vectype_in)
> > -    vectype_in = STMT_VINFO_VECTYPE (stmt_info);
> > -  gcc_assert (vectype_in);
> > -
> >    if (slp_node)
> >      {
> > -      /* The size vect_schedule_slp_instance computes is off for us.  */
> > -      vec_num = vect_get_num_vectors (LOOP_VINFO_VECT_FACTOR (loop_vinfo)
> > -                                     * SLP_TREE_LANES (slp_node), vectype_in);
> > +      vec_num = SLP_TREE_VEC_STMTS_NUM (slp_node);
> >        ncopies = 1;
> >      }
> >    else
> >      {
> > +      tree vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (reduc_info);
> > +      /* For a nested cycle we do not fill the above.  */
> > +      if (!vectype_in)
> > +       vectype_in = STMT_VINFO_VECTYPE (stmt_info);
> > +      gcc_assert (vectype_in);
> > +
> >        vec_num = 1;
> >        ncopies = vect_get_num_copies (loop_vinfo, vectype_in);
> >      }
> > diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
> > index 00863398fd0..3c163c16fc8 100644
> > --- a/gcc/tree-vect-slp.cc
> > +++ b/gcc/tree-vect-slp.cc
> > @@ -114,6 +114,7 @@ _slp_tree::_slp_tree ()
> >    SLP_TREE_SCALAR_OPS (this) = vNULL;
> >    SLP_TREE_VEC_DEFS (this) = vNULL;
> >    SLP_TREE_VEC_STMTS_NUM (this) = 0;
> > +  SLP_TREE_VEC_STMTS_EFFEC_NUM (this) = 0;
> >    SLP_TREE_CHILDREN (this) = vNULL;
> >    SLP_TREE_LOAD_PERMUTATION (this) = vNULL;
> >    SLP_TREE_LANE_PERMUTATION (this) = vNULL;
> > @@ -6552,38 +6553,16 @@ vect_slp_analyze_node_operations_1 (vec_info *vinfo, slp_tree node,
> >                                     slp_instance node_instance,
> >                                     stmt_vector_for_cost *cost_vec)
> >  {
> > -  stmt_vec_info stmt_info = SLP_TREE_REPRESENTATIVE (node);
> > -
> >    /* Calculate the number of vector statements to be created for the
> > -     scalar stmts in this node.  For SLP reductions it is equal to the
> > -     number of vector statements in the children (which has already been
> > -     calculated by the recursive call).  Otherwise it is the number of
> > -     scalar elements in one scalar iteration (DR_GROUP_SIZE) multiplied by
> > -     VF divided by the number of elements in a vector.  */
> > -  if (SLP_TREE_CODE (node) != VEC_PERM_EXPR
> > -      && !STMT_VINFO_DATA_REF (stmt_info)
> > -      && REDUC_GROUP_FIRST_ELEMENT (stmt_info))
> > -    {
> > -      for (unsigned i = 0; i < SLP_TREE_CHILDREN (node).length (); ++i)
> > -       if (SLP_TREE_DEF_TYPE (SLP_TREE_CHILDREN (node)[i]) == vect_internal_def)
> > -         {
> > -           SLP_TREE_VEC_STMTS_NUM (node)
> > -             = SLP_TREE_VEC_STMTS_NUM (SLP_TREE_CHILDREN (node)[i]);
> > -           break;
> > -         }
> > -    }
> > -  else
> > -    {
> > -      poly_uint64 vf;
> > -      if (loop_vec_info loop_vinfo = dyn_cast <loop_vec_info> (vinfo))
> > -       vf = loop_vinfo->vectorization_factor;
> > -      else
> > -       vf = 1;
> > -      unsigned int group_size = SLP_TREE_LANES (node);
> > -      tree vectype = SLP_TREE_VECTYPE (node);
> > -      SLP_TREE_VEC_STMTS_NUM (node)
> > -       = vect_get_num_vectors (vf * group_size, vectype);
> > -    }
> > +     scalar stmts in this node.  */
> > +  unsigned int num = vect_get_slp_num_vectors (vinfo, node);
> > +
> > +  SLP_TREE_VEC_STMTS_NUM (node) = num;
> > +
> > +  /* Initialize with the number of vector statements, and would be changed
> > +     to some smaller one for reduction PHI and lane-reducing statement
> > +     after real input vectype is determined by vectorizable analysis.  */
> > +  SLP_TREE_VEC_STMTS_EFFEC_NUM (node) = num;
> >
> >    /* Handle purely internal nodes.  */
> >    if (SLP_TREE_CODE (node) == VEC_PERM_EXPR)
> > @@ -6605,7 +6584,9 @@ vect_slp_analyze_node_operations_1 (vec_info *vinfo, slp_tree node,
> >        return true;
> >      }
> >
> > +  stmt_vec_info stmt_info = SLP_TREE_REPRESENTATIVE (node);
> >    bool dummy;
> > +
> >    return vect_analyze_stmt (vinfo, stmt_info, &dummy,
> >                             node, node_instance, cost_vec);
> >  }
> > @@ -6851,12 +6832,13 @@ vect_slp_analyze_node_operations (vec_info *vinfo, slp_tree node,
> >                           && j == 1);
> >               continue;
> >             }
> > -         unsigned group_size = SLP_TREE_LANES (child);
> > -         poly_uint64 vf = 1;
> > -         if (loop_vec_info loop_vinfo = dyn_cast <loop_vec_info> (vinfo))
> > -           vf = loop_vinfo->vectorization_factor;
> > -         SLP_TREE_VEC_STMTS_NUM (child)
> > -           = vect_get_num_vectors (vf * group_size, vector_type);
> > +
> > +         /* Calculate number of vector elements for constant slp node.  */
> > +         unsigned int num = vect_get_slp_num_vectors (vinfo, child);
> > +
> > +         SLP_TREE_VEC_STMTS_NUM (child) = num;
> > +         SLP_TREE_VEC_STMTS_EFFEC_NUM (child) = num;
> > +
> >           /* And cost them.  */
> >           vect_prologue_cost_for_slp (child, cost_vec);
> >         }
> > diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
> > index ef89ff1b3e4..f769c2b8c92 100644
> > --- a/gcc/tree-vectorizer.h
> > +++ b/gcc/tree-vectorizer.h
> > @@ -211,6 +211,15 @@ struct _slp_tree {
> >       divided by vector size.  */
> >    unsigned int vec_stmts_size;
> >
> > +  /* Number of vector stmts that are really effective after transformation.
> > +     Lane-reducing operation requires less vector stmts than normal, so
> > +     trivial copies will be inserted to make vector def-use chain align
> > +     mutually.  Vector reduction PHIs are fully generated with VF as above,
> > +     some may be optimized away if it starts a copy-self def-use cycle, which
> > +     occurs in reduction involving only lane-reducing.  Note: The enforcement
> > +     of single-defuse-cycle would not impact this, is specially handled.  */
> > +  unsigned int vec_stmts_effec_size;
> > +
> >    /* Reference count in the SLP graph.  */
> >    unsigned int refcnt;
> >    /* The maximum number of vector elements for the subtree rooted
> > @@ -304,6 +313,7 @@ public:
> >  #define SLP_TREE_REF_COUNT(S)                    (S)->refcnt
> >  #define SLP_TREE_VEC_DEFS(S)                     (S)->vec_defs
> >  #define SLP_TREE_VEC_STMTS_NUM(S)                (S)->vec_stmts_size
> > +#define SLP_TREE_VEC_STMTS_EFFEC_NUM(S)          (S)->vec_stmts_effec_size
> >  #define SLP_TREE_LOAD_PERMUTATION(S)             (S)->load_permutation
> >  #define SLP_TREE_LANE_PERMUTATION(S)             (S)->lane_permutation
> >  #define SLP_TREE_SIMD_CLONE_INFO(S)              (S)->simd_clone_info
> > @@ -2080,6 +2090,41 @@ vect_get_num_vectors (poly_uint64 nunits, tree vectype)
> >    return exact_div (nunits, TYPE_VECTOR_SUBPARTS (vectype)).to_constant ();
> >  }
> >
> > +/* Return the number of vectors in the context of vectorization region VINFO,
> > +   needed for a group of total SIZE statements that are supposed to be
> > +   interleaved together with no gap, and all operate on vectors of type
> > +   VECTYPE.  */
> > +
> > +inline unsigned int
> > +vect_get_num_vectors (vec_info *vinfo, tree vectype, unsigned int size)
> > +{
> > +  poly_uint64 vf;
> > +
> > +  if (loop_vec_info loop_vinfo = dyn_cast <loop_vec_info> (vinfo))
> > +    vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
> > +  else
> > +    vf = 1;
> > +
> > +  return vect_get_num_vectors (vf * size, vectype);
> > +}
> > +
> > +/* Return the number of vectors needed for SLP_NODE regarding to VINFO.  If
> > +   not NULL, VECTYPE gives a new vectype to operate on, rather than
> > +   SLP_TREE_VECTYPE of the node.  */
> > +
> > +inline unsigned int
> > +vect_get_slp_num_vectors (vec_info *vinfo, slp_tree slp_node,
> > +                         tree vectype = NULL_TREE)
> > +{
> > +  if (!vectype)
> > +    {
> > +      vectype = SLP_TREE_VECTYPE (slp_node);
> > +      gcc_checking_assert (vectype);
> > +    }
> > +
> > +  return vect_get_num_vectors (vinfo, vectype, SLP_TREE_LANES (slp_node));
> > +}
> > +
> >  /* Return the number of copies needed for loop vectorization when
> >     a statement operates on vectors of type VECTYPE.  This is the
> >     vectorization factor divided by the number of elements in
> > --
> > 2.17.1
Feng Xue OS July 13, 2024, 3:25 p.m. UTC | #4
> > Hi, Richard,
> >
> > Let me explain some idea that has to be chosen for lane-reducing. The key
> > complication is that these ops are associated with two kinds of vec_nums,
> > one is number of effective vector stmts, which is used by partial vectorzation
> > function such as vect_get_loop_mask.  The other is number of total created
> > vector stmts. Now we should make it aligned with normal op, in order to
> > interoperate with normal op. Suppose expressions mixed with lane-reducing
> > and normal as:
> >
> >     temp = lane_reducing<16*char> + expr<4*int>;
> >     temp = cst<4*int> * lane_reducing<16*char>;
> >
> > If only generating effective vector stmt for lane_reducing, vector def/use
> > between ops will never be matched, so extra pass-through copies are
> > necessary. This is why we say "refit a lane-reducing to be a fake normal op".
> 
> And this only happens in vect_transform_reduction, right?

Yes. it is.

> 
> The other pre-existing issue is that for single_defuse_cycle optimization
> SLP_TREE_NUMBER_OF_VEC_STMTS is also off (too large).  But here
> the transform also goes through vect_transform_reduction.
> 
> > The requirement of two vec_stmts are independent of how we will implement
> > SLP_TREE_NUMBER_OF_VEC_STMTS. Moreover, if we want to refactor vect code
> > to unify ncopies/vec_num computation and completely remove
> > SLP_TREE_NUMBER_OF_VEC_STMTS, this tends to be a a large task, and might
> > be overkill for these lane-reducing patches. So I will keep it as before, and do
> > not touch it as what I have done in this patch.
> >
> > Since one SLP_TREE_NUMBER_OF_VEC_STMTS could not be used for two purposes.
> > The your previous suggestion might not be work:
> >
> > > As said, I don't like this much.  vect_slp_analyze_node_operations_1 sets this
> > > and I think the existing "exception"
> > >
> > >  /* Calculate the number of vector statements to be created for the
> > >     scalar stmts in this node.  For SLP reductions it is equal to the
> > >     number of vector statements in the children (which has already been
> > >     calculated by the recursive call).  Otherwise it is the number of
> > >     scalar elements in one scalar iteration (DR_GROUP_SIZE) multiplied by
> > >     VF divided by the number of elements in a vector.  */
> > >  if (SLP_TREE_CODE (node) != VEC_PERM_EXPR
> > >      && !STMT_VINFO_DATA_REF (stmt_info)
> > >      && REDUC_GROUP_FIRST_ELEMENT (stmt_info))
> > >    {
> > >      for (unsigned i = 0; i < SLP_TREE_CHILDREN (node).length (); ++i)
> > >        if (SLP_TREE_DEF_TYPE (SLP_TREE_CHILDREN (node)[i]) ==
> > >  vect_internal_def)
> > >          {
> > >            SLP_TREE_NUMBER_OF_VEC_STMTS (node)
> > >              = SLP_TREE_NUMBER_OF_VEC_STMTS (SLP_TREE_CHILDREN (node)[i]);
> > >            break;
> > >          }
> > >    }
> > >
> > > could be changed (or amended if replacing doesn't work out) to
> > >
> > >   if (SLP_TREE_CODE (node) != VEC_PERM_EXPR
> > >       && STMT_VINFO_REDUC_IDX (stmt_info)
> > >       // do we have this always set?
> > >       && STMT_VINFO_REDUC_VECTYPE_IN (stmt_info))
> > >    {
> > >       do the same as in else {} but using VECTYPE_IN
> > >    }
> > >
> > > Or maybe scrap the special case and use STMT_VINFO_REDUC_VECTYPE_IN
> > > when that's set instead of SLP_TREE_VECTYPE?  As said having wrong
> > > SLP_TREE_NUMBER_OF_VEC_STMTS is going to backfire.
> >
> > Then the alternative is to limit special handling related to the vec_num only
> > inside vect_transform_reduction. Is that ok? Or any other suggestion?
> 
> I think that's kind-of in line with the suggestion of a reduction
> specific VF, so yes,
> not using SLP_TREE_NUMBER_OF_VEC_STMTS in vect_transform_reduction
> sounds fine to me and would be a step towards not having
> SLP_TREE_NUMBER_OF_VEC_STMTS
> where the function would be responsible for appropriate allocation as well.

OK. I remade 4 patches, and send them in a new emails.

Thanks,
Feng
diff mbox series

Patch

From ece7ade0be117f2a081d098f9ea6625cb21a51f7 Mon Sep 17 00:00:00 2001
From: Feng Xue <fxue@os.amperecomputing.com>
Date: Tue, 2 Jul 2024 17:12:00 +0800
Subject: [PATCH 2/4] vect: Fix inaccurate vector stmts number for slp
 reduction with lane-reducing
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Vector stmts number of an operation is calculated based on output vectype.
This is over-estimated for lane-reducing operation. Sometimes, to workaround
the issue, we have to rely on additional logic to deduce an exactly accurate
number by other means. Aiming at the inconvenience, in this patch, we would
"turn" lane-reducing operation into a normal one by inserting new trivial
statements like zero-valued PHIs and pass-through copies, which could be
optimized away by later passes. At the same time, a new field is added for
slp node to hold number of vector stmts that are really effective after
vectorization. For example:

  int sum = 1;
  for (i)
    {
      sum += d0[i] * d1[i];      // dot-prod <vector(16) char>
    }

  The vector size is 128-bit,vectorization factor is 16.  Reduction
  statements would be transformed as:

  vector<4> int sum_v0 = { 0, 0, 0, 1 };
  vector<4> int sum_v1 = { 0, 0, 0, 0 };
  vector<4> int sum_v2 = { 0, 0, 0, 0 };
  vector<4> int sum_v3 = { 0, 0, 0, 0 };

  for (i / 16)
    {
      sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
      sum_v1 = sum_v1;  // copy
      sum_v2 = sum_v2;  // copy
      sum_v3 = sum_v3;  // copy
    }

  sum_v = sum_v0 + sum_v1 + sum_v2 + sum_v3;   // = sum_v0

2024-07-02 Feng Xue <fxue@os.amperecomputing.com>

gcc/
	* tree-vectorizer.h (vec_stmts_effec_size): New field in _slp_tree.
	(SLP_TREE_VEC_STMTS_EFFEC_NUM): New macro.
	(vect_get_num_vectors): New overload function.
	(vect_get_slp_num_vectors): New function.
	* tree-vect-loop.cc (vect_reduction_update_partial_vector_usage): Use
	effective vector stmts number.
	(vectorizable_reduction): Compute number of effective vector stmts for
	lane-reducing op and reduction PHI.
	(vect_transform_reduction): Insert copies for lane-reducing so as to
	fix inaccurate vector stmts number.
	(vect_transform_cycle_phi): Only need to calculate vector PHI number
	based on input vectype for non-slp code path.
	* tree-vect-slp.cc (_slp_tree::_slp_tree): Initialize effective vector
	stmts number to zero.
	(vect_slp_analyze_node_operations_1): Remove adjustment on vector
	stmts number specific to slp reduction.
	(vect_slp_analyze_node_operations): Compute number of vector elements
	for constant/external slp node with vect_get_slp_num_vectors.
---
 gcc/tree-vect-loop.cc | 139 ++++++++++++++++++++++++++++++++++++------
 gcc/tree-vect-slp.cc  |  56 ++++++-----------
 gcc/tree-vectorizer.h |  45 ++++++++++++++
 3 files changed, 183 insertions(+), 57 deletions(-)

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index c183e2b6068..5ad9836d6c8 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -7471,7 +7471,7 @@  vect_reduction_update_partial_vector_usage (loop_vec_info loop_vinfo,
       unsigned nvectors;
 
       if (slp_node)
-	nvectors = SLP_TREE_VEC_STMTS_NUM (slp_node);
+	nvectors = SLP_TREE_VEC_STMTS_EFFEC_NUM (slp_node);
       else
 	nvectors = vect_get_num_copies (loop_vinfo, vectype_in);
 
@@ -7594,6 +7594,15 @@  vectorizable_reduction (loop_vec_info loop_vinfo,
   stmt_vec_info phi_info = stmt_info;
   if (!is_a <gphi *> (stmt_info->stmt))
     {
+      if (lane_reducing_stmt_p (stmt_info->stmt) && slp_node)
+	{
+	  /* Compute number of effective vector statements for lane-reducing
+	     ops.  */
+	  vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (stmt_info);
+	  gcc_assert (vectype_in);
+	  SLP_TREE_VEC_STMTS_EFFEC_NUM (slp_node)
+	    = vect_get_slp_num_vectors (loop_vinfo, slp_node, vectype_in);
+	}
       STMT_VINFO_TYPE (stmt_info) = reduc_vec_info_type;
       return true;
     }
@@ -8012,14 +8021,25 @@  vectorizable_reduction (loop_vec_info loop_vinfo,
   if (STMT_VINFO_LIVE_P (phi_info))
     return false;
 
-  if (slp_node)
-    ncopies = 1;
-  else
-    ncopies = vect_get_num_copies (loop_vinfo, vectype_in);
+  poly_uint64 nunits_out = TYPE_VECTOR_SUBPARTS (vectype_out);
 
-  gcc_assert (ncopies >= 1);
+  if (slp_node)
+    {
+      ncopies = 1;
 
-  poly_uint64 nunits_out = TYPE_VECTOR_SUBPARTS (vectype_out);
+      if (maybe_ne (TYPE_VECTOR_SUBPARTS (vectype_in), nunits_out))
+	{
+	  /* Not all vector reduction PHIs would be used, compute number
+	     of the effective statements.  */
+	  SLP_TREE_VEC_STMTS_EFFEC_NUM (slp_node)
+	    = vect_get_slp_num_vectors (loop_vinfo, slp_node, vectype_in);
+	}
+    }
+  else
+    {
+      ncopies = vect_get_num_copies (loop_vinfo, vectype_in);
+      gcc_assert (ncopies >= 1);
+    }
 
   if (nested_cycle)
     {
@@ -8360,7 +8380,7 @@  vectorizable_reduction (loop_vec_info loop_vinfo,
        || (slp_node
 	   && !REDUC_GROUP_FIRST_ELEMENT (stmt_info)
 	   && SLP_TREE_LANES (slp_node) == 1
-	   && vect_get_num_copies (loop_vinfo, vectype_in) > 1))
+	   && SLP_TREE_VEC_STMTS_EFFEC_NUM (slp_node) > 1))
       && (STMT_VINFO_RELEVANT (stmt_info) <= vect_used_only_live)
       && reduc_chain_length == 1
       && loop_vinfo->suggested_unroll_factor == 1)
@@ -8595,12 +8615,15 @@  vect_transform_reduction (loop_vec_info loop_vinfo,
   stmt_vec_info phi_info = STMT_VINFO_REDUC_DEF (vect_orig_stmt (stmt_info));
   gphi *reduc_def_phi = as_a <gphi *> (phi_info->stmt);
   int reduc_index = STMT_VINFO_REDUC_IDX (stmt_info);
-  tree vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (reduc_info);
+  tree vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (stmt_info);
+
+  if (!vectype_in)
+    vectype_in = STMT_VINFO_VECTYPE (stmt_info);
 
   if (slp_node)
     {
       ncopies = 1;
-      vec_num = SLP_TREE_VEC_STMTS_NUM (slp_node);
+      vec_num = SLP_TREE_VEC_STMTS_EFFEC_NUM (slp_node);
     }
   else
     {
@@ -8702,6 +8725,64 @@  vect_transform_reduction (loop_vec_info loop_vinfo,
 			 reduc_index == 2 ? op.ops[2] : NULL_TREE,
 			 &vec_oprnds[2]);
     }
+  else if (lane_reducing)
+    {
+      /* For normal reduction, consistency between vectorized def/use is
+	 naturally ensured when mapping from scalar statement.  But if lane-
+	 reducing op is involved in reduction, thing would become somewhat
+	 complicated in that the op's result and operand for accumulation are
+	 limited to less lanes than other operands, which certainly causes
+	 def/use mismatch on adjacent statements around the op if do not have
+	 any kind of specific adjustment.  One approach is to refit lane-
+	 reducing op in the way of introducing new trivial pass-through copies
+	 to fix possible def/use gap, so as to make it behave like a normal op.
+	 And vector reduction PHIs are always generated to the full extent, no
+	 matter lane-reducing op exists or not.  If some copies or PHIs are
+	 actually superfluous, they would be cleaned up by passes after
+	 vectorization.  An example for single-lane slp is given as below.
+	 Similarly, this handling is applicable for multiple-lane slp as well.
+
+	   int sum = 1;
+	   for (i)
+	     {
+	       sum += d0[i] * d1[i];      // dot-prod <vector(16) char>
+	     }
+
+	 The vector size is 128-bit,vectorization factor is 16.  Reduction
+	 statements would be transformed as:
+
+	   vector<4> int sum_v0 = { 0, 0, 0, 1 };
+	   vector<4> int sum_v1 = { 0, 0, 0, 0 };
+	   vector<4> int sum_v2 = { 0, 0, 0, 0 };
+	   vector<4> int sum_v3 = { 0, 0, 0, 0 };
+
+	   for (i / 16)
+	     {
+	       sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0);
+	       sum_v1 = sum_v1;  // copy
+	       sum_v2 = sum_v2;  // copy
+	       sum_v3 = sum_v3;  // copy
+	     }
+
+	   sum_v = sum_v0 + sum_v1 + sum_v2 + sum_v3;   // = sum_v0
+	*/
+      unsigned effec_ncopies = vec_oprnds[0].length ();
+      unsigned total_ncopies = vec_oprnds[reduc_index].length ();
+
+      if (slp_node)
+	gcc_assert (effec_ncopies == SLP_TREE_VEC_STMTS_EFFEC_NUM (slp_node));
+
+      gcc_assert (effec_ncopies <= total_ncopies);
+
+      if (effec_ncopies < total_ncopies)
+	{
+	  for (unsigned i = 0; i < op.num_ops - 1; i++)
+	    {
+	      gcc_assert (vec_oprnds[i].length () == effec_ncopies);
+	      vec_oprnds[i].safe_grow_cleared (total_ncopies);
+	    }
+	}
+    }
 
   bool emulated_mixed_dot_prod = vect_is_emulated_mixed_dot_prod (stmt_info);
   unsigned num = vec_oprnds[reduc_index == 0 ? 1 : 0].length ();
@@ -8710,7 +8791,27 @@  vect_transform_reduction (loop_vec_info loop_vinfo,
     {
       gimple *new_stmt;
       tree vop[3] = { vec_oprnds[0][i], vec_oprnds[1][i], NULL_TREE };
-      if (masked_loop_p && !mask_by_cond_expr)
+      if (!vop[0] || !vop[1])
+	{
+	  tree reduc_vop = vec_oprnds[reduc_index][i];
+
+	  /* If could not generate an effective vector statement for current
+	     portion of reduction operand, insert a trivial copy to simply
+	     handle over the operand to other dependent statements.  */
+	  gcc_assert (reduc_vop);
+
+	  if (slp_node && TREE_CODE (reduc_vop) == SSA_NAME
+	      && !SSA_NAME_IS_DEFAULT_DEF (reduc_vop))
+	    new_stmt = SSA_NAME_DEF_STMT (reduc_vop);
+	  else
+	    {
+	      new_temp = make_ssa_name (vec_dest);
+	      new_stmt = gimple_build_assign (new_temp, reduc_vop);
+	      vect_finish_stmt_generation (loop_vinfo, stmt_info, new_stmt,
+					   gsi);
+	    }
+	}
+      else if (masked_loop_p && !mask_by_cond_expr)
 	{
 	  /* No conditional ifns have been defined for lane-reducing op
 	     yet.  */
@@ -8810,21 +8911,19 @@  vect_transform_cycle_phi (loop_vec_info loop_vinfo,
     /* Leave the scalar phi in place.  */
     return true;
 
-  tree vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (reduc_info);
-  /* For a nested cycle we do not fill the above.  */
-  if (!vectype_in)
-    vectype_in = STMT_VINFO_VECTYPE (stmt_info);
-  gcc_assert (vectype_in);
-
   if (slp_node)
     {
-      /* The size vect_schedule_slp_instance computes is off for us.  */
-      vec_num = vect_get_num_vectors (LOOP_VINFO_VECT_FACTOR (loop_vinfo)
-				      * SLP_TREE_LANES (slp_node), vectype_in);
+      vec_num = SLP_TREE_VEC_STMTS_NUM (slp_node);
       ncopies = 1;
     }
   else
     {
+      tree vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (reduc_info);
+      /* For a nested cycle we do not fill the above.  */
+      if (!vectype_in)
+	vectype_in = STMT_VINFO_VECTYPE (stmt_info);
+      gcc_assert (vectype_in);
+
       vec_num = 1;
       ncopies = vect_get_num_copies (loop_vinfo, vectype_in);
     }
diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
index 00863398fd0..3c163c16fc8 100644
--- a/gcc/tree-vect-slp.cc
+++ b/gcc/tree-vect-slp.cc
@@ -114,6 +114,7 @@  _slp_tree::_slp_tree ()
   SLP_TREE_SCALAR_OPS (this) = vNULL;
   SLP_TREE_VEC_DEFS (this) = vNULL;
   SLP_TREE_VEC_STMTS_NUM (this) = 0;
+  SLP_TREE_VEC_STMTS_EFFEC_NUM (this) = 0;
   SLP_TREE_CHILDREN (this) = vNULL;
   SLP_TREE_LOAD_PERMUTATION (this) = vNULL;
   SLP_TREE_LANE_PERMUTATION (this) = vNULL;
@@ -6552,38 +6553,16 @@  vect_slp_analyze_node_operations_1 (vec_info *vinfo, slp_tree node,
 				    slp_instance node_instance,
 				    stmt_vector_for_cost *cost_vec)
 {
-  stmt_vec_info stmt_info = SLP_TREE_REPRESENTATIVE (node);
-
   /* Calculate the number of vector statements to be created for the
-     scalar stmts in this node.  For SLP reductions it is equal to the
-     number of vector statements in the children (which has already been
-     calculated by the recursive call).  Otherwise it is the number of
-     scalar elements in one scalar iteration (DR_GROUP_SIZE) multiplied by
-     VF divided by the number of elements in a vector.  */
-  if (SLP_TREE_CODE (node) != VEC_PERM_EXPR
-      && !STMT_VINFO_DATA_REF (stmt_info)
-      && REDUC_GROUP_FIRST_ELEMENT (stmt_info))
-    {
-      for (unsigned i = 0; i < SLP_TREE_CHILDREN (node).length (); ++i)
-	if (SLP_TREE_DEF_TYPE (SLP_TREE_CHILDREN (node)[i]) == vect_internal_def)
-	  {
-	    SLP_TREE_VEC_STMTS_NUM (node)
-	      = SLP_TREE_VEC_STMTS_NUM (SLP_TREE_CHILDREN (node)[i]);
-	    break;
-	  }
-    }
-  else
-    {
-      poly_uint64 vf;
-      if (loop_vec_info loop_vinfo = dyn_cast <loop_vec_info> (vinfo))
-	vf = loop_vinfo->vectorization_factor;
-      else
-	vf = 1;
-      unsigned int group_size = SLP_TREE_LANES (node);
-      tree vectype = SLP_TREE_VECTYPE (node);
-      SLP_TREE_VEC_STMTS_NUM (node)
-	= vect_get_num_vectors (vf * group_size, vectype);
-    }
+     scalar stmts in this node.  */
+  unsigned int num = vect_get_slp_num_vectors (vinfo, node);
+
+  SLP_TREE_VEC_STMTS_NUM (node) = num;
+
+  /* Initialize with the number of vector statements, and would be changed
+     to some smaller one for reduction PHI and lane-reducing statement
+     after real input vectype is determined by vectorizable analysis.  */
+  SLP_TREE_VEC_STMTS_EFFEC_NUM (node) = num;
 
   /* Handle purely internal nodes.  */
   if (SLP_TREE_CODE (node) == VEC_PERM_EXPR)
@@ -6605,7 +6584,9 @@  vect_slp_analyze_node_operations_1 (vec_info *vinfo, slp_tree node,
       return true;
     }
 
+  stmt_vec_info stmt_info = SLP_TREE_REPRESENTATIVE (node);
   bool dummy;
+
   return vect_analyze_stmt (vinfo, stmt_info, &dummy,
 			    node, node_instance, cost_vec);
 }
@@ -6851,12 +6832,13 @@  vect_slp_analyze_node_operations (vec_info *vinfo, slp_tree node,
 			  && j == 1);
 	      continue;
 	    }
-	  unsigned group_size = SLP_TREE_LANES (child);
-	  poly_uint64 vf = 1;
-	  if (loop_vec_info loop_vinfo = dyn_cast <loop_vec_info> (vinfo))
-	    vf = loop_vinfo->vectorization_factor;
-	  SLP_TREE_VEC_STMTS_NUM (child)
-	    = vect_get_num_vectors (vf * group_size, vector_type);
+
+	  /* Calculate number of vector elements for constant slp node.  */
+	  unsigned int num = vect_get_slp_num_vectors (vinfo, child);
+
+	  SLP_TREE_VEC_STMTS_NUM (child) = num;
+	  SLP_TREE_VEC_STMTS_EFFEC_NUM (child) = num;
+
 	  /* And cost them.  */
 	  vect_prologue_cost_for_slp (child, cost_vec);
 	}
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index ef89ff1b3e4..f769c2b8c92 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -211,6 +211,15 @@  struct _slp_tree {
      divided by vector size.  */
   unsigned int vec_stmts_size;
 
+  /* Number of vector stmts that are really effective after transformation.
+     Lane-reducing operation requires less vector stmts than normal, so
+     trivial copies will be inserted to make vector def-use chain align
+     mutually.  Vector reduction PHIs are fully generated with VF as above,
+     some may be optimized away if it starts a copy-self def-use cycle, which
+     occurs in reduction involving only lane-reducing.  Note: The enforcement
+     of single-defuse-cycle would not impact this, is specially handled.  */
+  unsigned int vec_stmts_effec_size;
+
   /* Reference count in the SLP graph.  */
   unsigned int refcnt;
   /* The maximum number of vector elements for the subtree rooted
@@ -304,6 +313,7 @@  public:
 #define SLP_TREE_REF_COUNT(S)                    (S)->refcnt
 #define SLP_TREE_VEC_DEFS(S)                     (S)->vec_defs
 #define SLP_TREE_VEC_STMTS_NUM(S)                (S)->vec_stmts_size
+#define SLP_TREE_VEC_STMTS_EFFEC_NUM(S)          (S)->vec_stmts_effec_size
 #define SLP_TREE_LOAD_PERMUTATION(S)             (S)->load_permutation
 #define SLP_TREE_LANE_PERMUTATION(S)             (S)->lane_permutation
 #define SLP_TREE_SIMD_CLONE_INFO(S)              (S)->simd_clone_info
@@ -2080,6 +2090,41 @@  vect_get_num_vectors (poly_uint64 nunits, tree vectype)
   return exact_div (nunits, TYPE_VECTOR_SUBPARTS (vectype)).to_constant ();
 }
 
+/* Return the number of vectors in the context of vectorization region VINFO,
+   needed for a group of total SIZE statements that are supposed to be
+   interleaved together with no gap, and all operate on vectors of type
+   VECTYPE.  */
+
+inline unsigned int
+vect_get_num_vectors (vec_info *vinfo, tree vectype, unsigned int size)
+{
+  poly_uint64 vf;
+
+  if (loop_vec_info loop_vinfo = dyn_cast <loop_vec_info> (vinfo))
+    vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
+  else
+    vf = 1;
+
+  return vect_get_num_vectors (vf * size, vectype);
+}
+
+/* Return the number of vectors needed for SLP_NODE regarding to VINFO.  If
+   not NULL, VECTYPE gives a new vectype to operate on, rather than
+   SLP_TREE_VECTYPE of the node.  */
+
+inline unsigned int
+vect_get_slp_num_vectors (vec_info *vinfo, slp_tree slp_node,
+			  tree vectype = NULL_TREE)
+{
+  if (!vectype)
+    {
+      vectype = SLP_TREE_VECTYPE (slp_node);
+      gcc_checking_assert (vectype);
+    }
+
+  return vect_get_num_vectors (vinfo, vectype, SLP_TREE_LANES (slp_node));
+}
+
 /* Return the number of copies needed for loop vectorization when
    a statement operates on vectors of type VECTYPE.  This is the
    vectorization factor divided by the number of elements in
-- 
2.17.1