Message ID | LV2PR01MB7839EFE65507BCE03159DDAAF7A72@LV2PR01MB7839.prod.exchangelabs.com |
---|---|
State | New |
Headers | show |
Series | [1/4] vect: Add a unified vect_get_num_copies for slp and non-slp | expand |
On Sat, Jul 13, 2024 at 5:47 PM Feng Xue OS <fxue@os.amperecomputing.com> wrote: > > Vector stmts number of an operation is calculated based on output vectype. > This is over-estimated for lane-reducing operation, which would cause vector > def/use mismatched when we want to support loop reduction mixed with lane- > reducing and normal operations. One solution is to refit lane-reducing > to make it behave like a normal one, by adding new pass-through copies to > fix possible def/use gap. And resultant superfluous statements could be > optimized away after vectorization. For example: > > int sum = 1; > for (i) > { > sum += d0[i] * d1[i]; // dot-prod <vector(16) char> > } > > The vector size is 128-bit,vectorization factor is 16. Reduction > statements would be transformed as: > > vector<4> int sum_v0 = { 0, 0, 0, 1 }; > vector<4> int sum_v1 = { 0, 0, 0, 0 }; > vector<4> int sum_v2 = { 0, 0, 0, 0 }; > vector<4> int sum_v3 = { 0, 0, 0, 0 }; > > for (i / 16) > { > sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0); > sum_v1 = sum_v1; // copy > sum_v2 = sum_v2; // copy > sum_v3 = sum_v3; // copy > } > > sum_v = sum_v0 + sum_v1 + sum_v2 + sum_v3; // = sum_v0 OK. Thanks, Richard. > Thanks, > Feng > --- > gcc/ > * tree-vect-loop.cc (vect_reduction_update_partial_vector_usage): > Calculate effective vector stmts number with generic > vect_get_num_copies. > (vect_transform_reduction): Insert copies for lane-reducing so as to > fix over-estimated vector stmts number. > (vect_transform_cycle_phi): Calculate vector PHI number only based on > output vectype. > * tree-vect-slp.cc (vect_slp_analyze_node_operations_1): Remove > adjustment on vector stmts number specific to slp reduction. > --- > gcc/tree-vect-loop.cc | 134 +++++++++++++++++++++++++++++++++++------- > gcc/tree-vect-slp.cc | 27 +++------ > 2 files changed, 121 insertions(+), 40 deletions(-) > > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc > index a64b5082bd1..5ac83e76975 100644 > --- a/gcc/tree-vect-loop.cc > +++ b/gcc/tree-vect-loop.cc > @@ -7468,12 +7468,8 @@ vect_reduction_update_partial_vector_usage (loop_vec_info loop_vinfo, > = get_masked_reduction_fn (reduc_fn, vectype_in); > vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo); > vec_loop_lens *lens = &LOOP_VINFO_LENS (loop_vinfo); > - unsigned nvectors; > - > - if (slp_node) > - nvectors = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node); > - else > - nvectors = vect_get_num_copies (loop_vinfo, vectype_in); > + unsigned nvectors = vect_get_num_copies (loop_vinfo, slp_node, > + vectype_in); > > if (mask_reduc_fn == IFN_MASK_LEN_FOLD_LEFT_PLUS) > vect_record_loop_len (loop_vinfo, lens, nvectors, vectype_in, 1); > @@ -8595,12 +8591,15 @@ vect_transform_reduction (loop_vec_info loop_vinfo, > stmt_vec_info phi_info = STMT_VINFO_REDUC_DEF (vect_orig_stmt (stmt_info)); > gphi *reduc_def_phi = as_a <gphi *> (phi_info->stmt); > int reduc_index = STMT_VINFO_REDUC_IDX (stmt_info); > - tree vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (reduc_info); > + tree vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (stmt_info); > + > + if (!vectype_in) > + vectype_in = STMT_VINFO_VECTYPE (stmt_info); > > if (slp_node) > { > ncopies = 1; > - vec_num = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node); > + vec_num = vect_get_num_copies (loop_vinfo, slp_node, vectype_in); > } > else > { > @@ -8658,13 +8657,40 @@ vect_transform_reduction (loop_vec_info loop_vinfo, > bool lane_reducing = lane_reducing_op_p (code); > gcc_assert (single_defuse_cycle || lane_reducing); > > + if (lane_reducing) > + { > + /* The last operand of lane-reducing op is for reduction. */ > + gcc_assert (reduc_index == (int) op.num_ops - 1); > + } > + > /* Create the destination vector */ > tree scalar_dest = gimple_get_lhs (stmt_info->stmt); > tree vec_dest = vect_create_destination_var (scalar_dest, vectype_out); > > + if (lane_reducing && !slp_node && !single_defuse_cycle) > + { > + /* Note: there are still vectorizable cases that can not be handled by > + single-lane slp. Probably it would take some time to evolve the > + feature to a mature state. So we have to keep the below non-slp code > + path as failsafe for lane-reducing support. */ > + gcc_assert (op.num_ops <= 3); > + for (unsigned i = 0; i < op.num_ops; i++) > + { > + unsigned oprnd_ncopies = ncopies; > + > + if ((int) i == reduc_index) > + { > + tree vectype = STMT_VINFO_VECTYPE (stmt_info); > + oprnd_ncopies = vect_get_num_copies (loop_vinfo, vectype); > + } > + > + vect_get_vec_defs_for_operand (loop_vinfo, stmt_info, oprnd_ncopies, > + op.ops[i], &vec_oprnds[i]); > + } > + } > /* Get NCOPIES vector definitions for all operands except the reduction > definition. */ > - if (!cond_fn_p) > + else if (!cond_fn_p) > { > gcc_assert (reduc_index >= 0 && reduc_index <= 2); > vect_get_vec_defs (loop_vinfo, stmt_info, slp_node, ncopies, > @@ -8702,6 +8728,61 @@ vect_transform_reduction (loop_vec_info loop_vinfo, > reduc_index == 2 ? op.ops[2] : NULL_TREE, > &vec_oprnds[2]); > } > + else if (lane_reducing) > + { > + /* For normal reduction, consistency between vectorized def/use is > + naturally ensured when mapping from scalar statement. But if lane- > + reducing op is involved in reduction, thing would become somewhat > + complicated in that the op's result and operand for accumulation are > + limited to less lanes than other operands, which certainly causes > + def/use mismatch on adjacent statements around the op if do not have > + any kind of specific adjustment. One approach is to refit lane- > + reducing op in the way of introducing new trivial pass-through copies > + to fix possible def/use gap, so as to make it behave like a normal op. > + And vector reduction PHIs are always generated to the full extent, no > + matter lane-reducing op exists or not. If some copies or PHIs are > + actually superfluous, they would be cleaned up by passes after > + vectorization. An example for single-lane slp is given as below. > + Similarly, this handling is applicable for multiple-lane slp as well. > + > + int sum = 1; > + for (i) > + { > + sum += d0[i] * d1[i]; // dot-prod <vector(16) char> > + } > + > + The vector size is 128-bit,vectorization factor is 16. Reduction > + statements would be transformed as: > + > + vector<4> int sum_v0 = { 0, 0, 0, 1 }; > + vector<4> int sum_v1 = { 0, 0, 0, 0 }; > + vector<4> int sum_v2 = { 0, 0, 0, 0 }; > + vector<4> int sum_v3 = { 0, 0, 0, 0 }; > + > + for (i / 16) > + { > + sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0); > + sum_v1 = sum_v1; // copy > + sum_v2 = sum_v2; // copy > + sum_v3 = sum_v3; // copy > + } > + > + sum_v = sum_v0 + sum_v1 + sum_v2 + sum_v3; // = sum_v0 > + */ > + unsigned effec_ncopies = vec_oprnds[0].length (); > + unsigned total_ncopies = vec_oprnds[reduc_index].length (); > + > + gcc_assert (effec_ncopies <= total_ncopies); > + > + if (effec_ncopies < total_ncopies) > + { > + for (unsigned i = 0; i < op.num_ops - 1; i++) > + { > + gcc_assert (vec_oprnds[i].length () == effec_ncopies); > + vec_oprnds[i].safe_grow_cleared (total_ncopies); > + } > + } > + } > > bool emulated_mixed_dot_prod = vect_is_emulated_mixed_dot_prod (stmt_info); > unsigned num = vec_oprnds[reduc_index == 0 ? 1 : 0].length (); > @@ -8710,7 +8791,27 @@ vect_transform_reduction (loop_vec_info loop_vinfo, > { > gimple *new_stmt; > tree vop[3] = { vec_oprnds[0][i], vec_oprnds[1][i], NULL_TREE }; > - if (masked_loop_p && !mask_by_cond_expr) > + if (!vop[0] || !vop[1]) > + { > + tree reduc_vop = vec_oprnds[reduc_index][i]; > + > + /* If could not generate an effective vector statement for current > + portion of reduction operand, insert a trivial copy to simply > + handle over the operand to other dependent statements. */ > + gcc_assert (reduc_vop); > + > + if (slp_node && TREE_CODE (reduc_vop) == SSA_NAME > + && !SSA_NAME_IS_DEFAULT_DEF (reduc_vop)) > + new_stmt = SSA_NAME_DEF_STMT (reduc_vop); > + else > + { > + new_temp = make_ssa_name (vec_dest); > + new_stmt = gimple_build_assign (new_temp, reduc_vop); > + vect_finish_stmt_generation (loop_vinfo, stmt_info, new_stmt, > + gsi); > + } > + } > + else if (masked_loop_p && !mask_by_cond_expr) > { > /* No conditional ifns have been defined for lane-reducing op > yet. */ > @@ -8810,23 +8911,16 @@ vect_transform_cycle_phi (loop_vec_info loop_vinfo, > /* Leave the scalar phi in place. */ > return true; > > - tree vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (reduc_info); > - /* For a nested cycle we do not fill the above. */ > - if (!vectype_in) > - vectype_in = STMT_VINFO_VECTYPE (stmt_info); > - gcc_assert (vectype_in); > - > if (slp_node) > { > - /* The size vect_schedule_slp_instance computes is off for us. */ > - vec_num = vect_get_num_vectors (LOOP_VINFO_VECT_FACTOR (loop_vinfo) > - * SLP_TREE_LANES (slp_node), vectype_in); > + vec_num = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node); > ncopies = 1; > } > else > { > vec_num = 1; > - ncopies = vect_get_num_copies (loop_vinfo, vectype_in); > + ncopies = vect_get_num_copies (loop_vinfo, > + STMT_VINFO_VECTYPE (stmt_info)); > } > > /* Check whether we should use a single PHI node and accumulate > diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc > index 4dadbc6854d..55ae496cbb2 100644 > --- a/gcc/tree-vect-slp.cc > +++ b/gcc/tree-vect-slp.cc > @@ -6554,26 +6554,13 @@ vect_slp_analyze_node_operations_1 (vec_info *vinfo, slp_tree node, > { > stmt_vec_info stmt_info = SLP_TREE_REPRESENTATIVE (node); > > - /* Calculate the number of vector statements to be created for the > - scalar stmts in this node. For SLP reductions it is equal to the > - number of vector statements in the children (which has already been > - calculated by the recursive call). Otherwise it is the number of > - scalar elements in one scalar iteration (DR_GROUP_SIZE) multiplied by > - VF divided by the number of elements in a vector. */ > - if (SLP_TREE_CODE (node) != VEC_PERM_EXPR > - && !STMT_VINFO_DATA_REF (stmt_info) > - && REDUC_GROUP_FIRST_ELEMENT (stmt_info)) > - { > - for (unsigned i = 0; i < SLP_TREE_CHILDREN (node).length (); ++i) > - if (SLP_TREE_DEF_TYPE (SLP_TREE_CHILDREN (node)[i]) == vect_internal_def) > - { > - SLP_TREE_NUMBER_OF_VEC_STMTS (node) > - = SLP_TREE_NUMBER_OF_VEC_STMTS (SLP_TREE_CHILDREN (node)[i]); > - break; > - } > - } > - else > - SLP_TREE_NUMBER_OF_VEC_STMTS (node) = vect_get_num_copies (vinfo, node); > + /* Calculate the number of vector statements to be created for the scalar > + stmts in this node. It is the number of scalar elements in one scalar > + iteration (DR_GROUP_SIZE) multiplied by VF divided by the number of > + elements in a vector. For single-defuse-cycle, lane-reducing op, and > + PHI statement that starts reduction comprised of only lane-reducing ops, > + the number is more than effective vector statements actually required. */ > + SLP_TREE_NUMBER_OF_VEC_STMTS (node) = vect_get_num_copies (vinfo, node); > > /* Handle purely internal nodes. */ > if (SLP_TREE_CODE (node) == VEC_PERM_EXPR) > -- > 2.17.1
From 2b9b22f7f1a19816a17086c79e7ec5f7d0298af6 Mon Sep 17 00:00:00 2001 From: Feng Xue <fxue@os.amperecomputing.com> Date: Tue, 2 Jul 2024 17:12:00 +0800 Subject: [PATCH 2/4] vect: Refit lane-reducing to be normal operation MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Vector stmts number of an operation is calculated based on output vectype. This is over-estimated for lane-reducing operation, which would cause vector def/use mismatched when we want to support loop reduction mixed with lane- reducing and normal operations. One solution is to refit lane-reducing to make it behave like a normal one, by adding new pass-through copies to fix possible def/use gap. And resultant superfluous statements could be optimized away after vectorization. For example: int sum = 1; for (i) { sum += d0[i] * d1[i]; // dot-prod <vector(16) char> } The vector size is 128-bit,vectorization factor is 16. Reduction statements would be transformed as: vector<4> int sum_v0 = { 0, 0, 0, 1 }; vector<4> int sum_v1 = { 0, 0, 0, 0 }; vector<4> int sum_v2 = { 0, 0, 0, 0 }; vector<4> int sum_v3 = { 0, 0, 0, 0 }; for (i / 16) { sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0); sum_v1 = sum_v1; // copy sum_v2 = sum_v2; // copy sum_v3 = sum_v3; // copy } sum_v = sum_v0 + sum_v1 + sum_v2 + sum_v3; // = sum_v0 2024-07-02 Feng Xue <fxue@os.amperecomputing.com> gcc/ * tree-vect-loop.cc (vect_reduction_update_partial_vector_usage): Calculate effective vector stmts number with generic vect_get_num_copies. (vect_transform_reduction): Insert copies for lane-reducing so as to fix over-estimated vector stmts number. (vect_transform_cycle_phi): Calculate vector PHI number only based on output vectype. * tree-vect-slp.cc (vect_slp_analyze_node_operations_1): Remove adjustment on vector stmts number specific to slp reduction. --- gcc/tree-vect-loop.cc | 134 +++++++++++++++++++++++++++++++++++------- gcc/tree-vect-slp.cc | 27 +++------ 2 files changed, 121 insertions(+), 40 deletions(-) diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index a64b5082bd1..5ac83e76975 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -7468,12 +7468,8 @@ vect_reduction_update_partial_vector_usage (loop_vec_info loop_vinfo, = get_masked_reduction_fn (reduc_fn, vectype_in); vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo); vec_loop_lens *lens = &LOOP_VINFO_LENS (loop_vinfo); - unsigned nvectors; - - if (slp_node) - nvectors = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node); - else - nvectors = vect_get_num_copies (loop_vinfo, vectype_in); + unsigned nvectors = vect_get_num_copies (loop_vinfo, slp_node, + vectype_in); if (mask_reduc_fn == IFN_MASK_LEN_FOLD_LEFT_PLUS) vect_record_loop_len (loop_vinfo, lens, nvectors, vectype_in, 1); @@ -8595,12 +8591,15 @@ vect_transform_reduction (loop_vec_info loop_vinfo, stmt_vec_info phi_info = STMT_VINFO_REDUC_DEF (vect_orig_stmt (stmt_info)); gphi *reduc_def_phi = as_a <gphi *> (phi_info->stmt); int reduc_index = STMT_VINFO_REDUC_IDX (stmt_info); - tree vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (reduc_info); + tree vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (stmt_info); + + if (!vectype_in) + vectype_in = STMT_VINFO_VECTYPE (stmt_info); if (slp_node) { ncopies = 1; - vec_num = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node); + vec_num = vect_get_num_copies (loop_vinfo, slp_node, vectype_in); } else { @@ -8658,13 +8657,40 @@ vect_transform_reduction (loop_vec_info loop_vinfo, bool lane_reducing = lane_reducing_op_p (code); gcc_assert (single_defuse_cycle || lane_reducing); + if (lane_reducing) + { + /* The last operand of lane-reducing op is for reduction. */ + gcc_assert (reduc_index == (int) op.num_ops - 1); + } + /* Create the destination vector */ tree scalar_dest = gimple_get_lhs (stmt_info->stmt); tree vec_dest = vect_create_destination_var (scalar_dest, vectype_out); + if (lane_reducing && !slp_node && !single_defuse_cycle) + { + /* Note: there are still vectorizable cases that can not be handled by + single-lane slp. Probably it would take some time to evolve the + feature to a mature state. So we have to keep the below non-slp code + path as failsafe for lane-reducing support. */ + gcc_assert (op.num_ops <= 3); + for (unsigned i = 0; i < op.num_ops; i++) + { + unsigned oprnd_ncopies = ncopies; + + if ((int) i == reduc_index) + { + tree vectype = STMT_VINFO_VECTYPE (stmt_info); + oprnd_ncopies = vect_get_num_copies (loop_vinfo, vectype); + } + + vect_get_vec_defs_for_operand (loop_vinfo, stmt_info, oprnd_ncopies, + op.ops[i], &vec_oprnds[i]); + } + } /* Get NCOPIES vector definitions for all operands except the reduction definition. */ - if (!cond_fn_p) + else if (!cond_fn_p) { gcc_assert (reduc_index >= 0 && reduc_index <= 2); vect_get_vec_defs (loop_vinfo, stmt_info, slp_node, ncopies, @@ -8702,6 +8728,61 @@ vect_transform_reduction (loop_vec_info loop_vinfo, reduc_index == 2 ? op.ops[2] : NULL_TREE, &vec_oprnds[2]); } + else if (lane_reducing) + { + /* For normal reduction, consistency between vectorized def/use is + naturally ensured when mapping from scalar statement. But if lane- + reducing op is involved in reduction, thing would become somewhat + complicated in that the op's result and operand for accumulation are + limited to less lanes than other operands, which certainly causes + def/use mismatch on adjacent statements around the op if do not have + any kind of specific adjustment. One approach is to refit lane- + reducing op in the way of introducing new trivial pass-through copies + to fix possible def/use gap, so as to make it behave like a normal op. + And vector reduction PHIs are always generated to the full extent, no + matter lane-reducing op exists or not. If some copies or PHIs are + actually superfluous, they would be cleaned up by passes after + vectorization. An example for single-lane slp is given as below. + Similarly, this handling is applicable for multiple-lane slp as well. + + int sum = 1; + for (i) + { + sum += d0[i] * d1[i]; // dot-prod <vector(16) char> + } + + The vector size is 128-bit,vectorization factor is 16. Reduction + statements would be transformed as: + + vector<4> int sum_v0 = { 0, 0, 0, 1 }; + vector<4> int sum_v1 = { 0, 0, 0, 0 }; + vector<4> int sum_v2 = { 0, 0, 0, 0 }; + vector<4> int sum_v3 = { 0, 0, 0, 0 }; + + for (i / 16) + { + sum_v0 = DOT_PROD (d0_v0[i: 0 ~ 15], d1_v0[i: 0 ~ 15], sum_v0); + sum_v1 = sum_v1; // copy + sum_v2 = sum_v2; // copy + sum_v3 = sum_v3; // copy + } + + sum_v = sum_v0 + sum_v1 + sum_v2 + sum_v3; // = sum_v0 + */ + unsigned effec_ncopies = vec_oprnds[0].length (); + unsigned total_ncopies = vec_oprnds[reduc_index].length (); + + gcc_assert (effec_ncopies <= total_ncopies); + + if (effec_ncopies < total_ncopies) + { + for (unsigned i = 0; i < op.num_ops - 1; i++) + { + gcc_assert (vec_oprnds[i].length () == effec_ncopies); + vec_oprnds[i].safe_grow_cleared (total_ncopies); + } + } + } bool emulated_mixed_dot_prod = vect_is_emulated_mixed_dot_prod (stmt_info); unsigned num = vec_oprnds[reduc_index == 0 ? 1 : 0].length (); @@ -8710,7 +8791,27 @@ vect_transform_reduction (loop_vec_info loop_vinfo, { gimple *new_stmt; tree vop[3] = { vec_oprnds[0][i], vec_oprnds[1][i], NULL_TREE }; - if (masked_loop_p && !mask_by_cond_expr) + if (!vop[0] || !vop[1]) + { + tree reduc_vop = vec_oprnds[reduc_index][i]; + + /* If could not generate an effective vector statement for current + portion of reduction operand, insert a trivial copy to simply + handle over the operand to other dependent statements. */ + gcc_assert (reduc_vop); + + if (slp_node && TREE_CODE (reduc_vop) == SSA_NAME + && !SSA_NAME_IS_DEFAULT_DEF (reduc_vop)) + new_stmt = SSA_NAME_DEF_STMT (reduc_vop); + else + { + new_temp = make_ssa_name (vec_dest); + new_stmt = gimple_build_assign (new_temp, reduc_vop); + vect_finish_stmt_generation (loop_vinfo, stmt_info, new_stmt, + gsi); + } + } + else if (masked_loop_p && !mask_by_cond_expr) { /* No conditional ifns have been defined for lane-reducing op yet. */ @@ -8810,23 +8911,16 @@ vect_transform_cycle_phi (loop_vec_info loop_vinfo, /* Leave the scalar phi in place. */ return true; - tree vectype_in = STMT_VINFO_REDUC_VECTYPE_IN (reduc_info); - /* For a nested cycle we do not fill the above. */ - if (!vectype_in) - vectype_in = STMT_VINFO_VECTYPE (stmt_info); - gcc_assert (vectype_in); - if (slp_node) { - /* The size vect_schedule_slp_instance computes is off for us. */ - vec_num = vect_get_num_vectors (LOOP_VINFO_VECT_FACTOR (loop_vinfo) - * SLP_TREE_LANES (slp_node), vectype_in); + vec_num = SLP_TREE_NUMBER_OF_VEC_STMTS (slp_node); ncopies = 1; } else { vec_num = 1; - ncopies = vect_get_num_copies (loop_vinfo, vectype_in); + ncopies = vect_get_num_copies (loop_vinfo, + STMT_VINFO_VECTYPE (stmt_info)); } /* Check whether we should use a single PHI node and accumulate diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc index 4dadbc6854d..55ae496cbb2 100644 --- a/gcc/tree-vect-slp.cc +++ b/gcc/tree-vect-slp.cc @@ -6554,26 +6554,13 @@ vect_slp_analyze_node_operations_1 (vec_info *vinfo, slp_tree node, { stmt_vec_info stmt_info = SLP_TREE_REPRESENTATIVE (node); - /* Calculate the number of vector statements to be created for the - scalar stmts in this node. For SLP reductions it is equal to the - number of vector statements in the children (which has already been - calculated by the recursive call). Otherwise it is the number of - scalar elements in one scalar iteration (DR_GROUP_SIZE) multiplied by - VF divided by the number of elements in a vector. */ - if (SLP_TREE_CODE (node) != VEC_PERM_EXPR - && !STMT_VINFO_DATA_REF (stmt_info) - && REDUC_GROUP_FIRST_ELEMENT (stmt_info)) - { - for (unsigned i = 0; i < SLP_TREE_CHILDREN (node).length (); ++i) - if (SLP_TREE_DEF_TYPE (SLP_TREE_CHILDREN (node)[i]) == vect_internal_def) - { - SLP_TREE_NUMBER_OF_VEC_STMTS (node) - = SLP_TREE_NUMBER_OF_VEC_STMTS (SLP_TREE_CHILDREN (node)[i]); - break; - } - } - else - SLP_TREE_NUMBER_OF_VEC_STMTS (node) = vect_get_num_copies (vinfo, node); + /* Calculate the number of vector statements to be created for the scalar + stmts in this node. It is the number of scalar elements in one scalar + iteration (DR_GROUP_SIZE) multiplied by VF divided by the number of + elements in a vector. For single-defuse-cycle, lane-reducing op, and + PHI statement that starts reduction comprised of only lane-reducing ops, + the number is more than effective vector statements actually required. */ + SLP_TREE_NUMBER_OF_VEC_STMTS (node) = vect_get_num_copies (vinfo, node); /* Handle purely internal nodes. */ if (SLP_TREE_CODE (node) == VEC_PERM_EXPR) -- 2.17.1