Message ID | 4b7f2daf-467e-d940-b79c-31c1c30a1dd4@linux.ibm.com |
---|---|
State | New |
Headers | show |
Series | Support vector load/store with length | expand |
"Kewen.Lin" <linkw@linux.ibm.com> writes: > @@ -626,6 +645,12 @@ public: > /* True if have decided to use a fully-masked loop. */ > bool fully_masked_p; > > + /* Records whether we still have the option of using a length access loop. */ > + bool can_with_length_p; > + > + /* True if have decided to use length access for the loop fully. */ > + bool fully_with_length_p; Rather than duplicate the flags like this, I think we should have three bits of information: (1) Can the loop operate on partial vectors? Starts off optimistically assuming "yes", gets set to "no" when we find a counter-example. (2) If we do decide to use partial vectors, will we need loop masks? (3) If we do decide to use partial vectors, will we need lengths? Vectorisation using partial vectors succeeds if (1) && ((2) != (3)) LOOP_VINFO_CAN_FULLY_MASK_P currently tracks (1) and LOOP_VINFO_MASKS currently tracks (2). In pathological cases it's already possible to have (1) && !(2), see r9-6240 for an example. With the new support, LOOP_VINFO_LENS tracks (3). So I don't think we need the can_with_length_p. What is now LOOP_VINFO_CAN_FULLY_MASK_P can continue to track (1) for both approaches, with the final choice of approach only being made at the end. Maybe it would be worth renaming it to something more generic though, now that we have two approaches to partial vectorisation. I think we can assume for now that no arch will be asymmetrical, and require (say) loop masks for loads and lengths for stores. So if that does happen (i.e. if (2) && (3) ends up being true) we should just be able to punt on partial vectorisation. Some of the new length code looks like it's copied and adjusted from the corresponding mask code. It would be good to share the code instead where possible, e.g. when deciding whether an IV can overflow. Thanks, Richard
Richard Sandiford <richard.sandiford@arm.com> writes: > "Kewen.Lin" <linkw@linux.ibm.com> writes: >> @@ -626,6 +645,12 @@ public: >> /* True if have decided to use a fully-masked loop. */ >> bool fully_masked_p; >> >> + /* Records whether we still have the option of using a length access loop. */ >> + bool can_with_length_p; >> + >> + /* True if have decided to use length access for the loop fully. */ >> + bool fully_with_length_p; > > Rather than duplicate the flags like this, I think we should have > three bits of information: > > (1) Can the loop operate on partial vectors? Starts off optimistically > assuming "yes", gets set to "no" when we find a counter-example. > > (2) If we do decide to use partial vectors, will we need loop masks? > > (3) If we do decide to use partial vectors, will we need lengths? > > Vectorisation using partial vectors succeeds if (1) && ((2) != (3)) > > LOOP_VINFO_CAN_FULLY_MASK_P currently tracks (1) and > LOOP_VINFO_MASKS currently tracks (2). In pathological cases it's > already possible to have (1) && !(2), see r9-6240 for an example. Oops, I meant r8-6240.
Hi Richard, Thanks for your comments! on 2020/5/26 下午8:49, Richard Sandiford wrote: > "Kewen.Lin" <linkw@linux.ibm.com> writes: >> @@ -626,6 +645,12 @@ public: >> /* True if have decided to use a fully-masked loop. */ >> bool fully_masked_p; >> >> + /* Records whether we still have the option of using a length access loop. */ >> + bool can_with_length_p; >> + >> + /* True if have decided to use length access for the loop fully. */ >> + bool fully_with_length_p; > > Rather than duplicate the flags like this, I think we should have > three bits of information: > > (1) Can the loop operate on partial vectors? Starts off optimistically > assuming "yes", gets set to "no" when we find a counter-example. > > (2) If we do decide to use partial vectors, will we need loop masks? > > (3) If we do decide to use partial vectors, will we need lengths? > > Vectorisation using partial vectors succeeds if (1) && ((2) != (3)) > > LOOP_VINFO_CAN_FULLY_MASK_P currently tracks (1) and > LOOP_VINFO_MASKS currently tracks (2). In pathological cases it's > already possible to have (1) && !(2), see r9-6240 for an example. > > With the new support, LOOP_VINFO_LENS tracks (3). > > So I don't think we need the can_with_length_p. What is now > LOOP_VINFO_CAN_FULLY_MASK_P can continue to track (1) for both > approaches, with the final choice of approach only being made > at the end. Maybe it would be worth renaming it to something > more generic though, now that we have two approaches to partial > vectorisation. I like this idea! I could be wrong, but I'm afraid that we can not have one common flag to be shared for both approaches, the check criterias could be different for both approaches, one counter example for length could be acceptable for masking, such as length can only allow CONTIGUOUS related modes, but masking can support more. When we see acceptable VMAT_LOAD_STORE_LANES, we leave LOOP_VINFO_CAN_FULLY_MASK_P true, later should length checking turn it to false? I guess no, assuming still true, then LOOP_VINFO_CAN_FULLY_MASK_P will mean only partial vectorization for masking, not for both. We can probably clean LOOP_VINFO_LENS when the length checking is false, but we just know the vec is empty, not sure we are unable to do partial vectorization with length, when we see LOOP_VINFO_CAN_FULLY_MASK_P true, we could still record length into it if possible. > > I think we can assume for now that no arch will be asymmetrical, > and require (say) loop masks for loads and lengths for stores. > So if that does happen (i.e. if (2) && (3) ends up being true) > we should just be able to punt on partial vectorisation. > Agreed, the current implementation takes masking as preferrence, if it's fully_masked, we will disable vector with length. > Some of the new length code looks like it's copied and adjusted from the > corresponding mask code. It would be good to share the code instead > where possible, e.g. when deciding whether an IV can overflow. > Yes, some refactoring can be done, it's on my to-do list, give it priority as your comments. V2 attached with some changes against V1: 1) use rgroup_objs for both mask and length 2) merge both mask and length handlings into vect_set_loop_condition_partial which is renamed and extended from vect_set_loop_condition_masked. 3) renamed and updated vect_set_loop_masks_directly to vect_set_loop_objs_directly. 4) renamed vect_set_loop_condition_unmasked to vect_set_loop_condition_normal 5) factored out min_prec_for_max_niters. 6) added macro LOOP_VINFO_PARTIAL_VECT_P since a few places need to check (LOOP_VINFO_FULLY_MASKED_P || LOOP_VINFO_FULLY_WITH_LENGTH_P) Tested with ppc64le test cases, will update with changelog if everything goes well. BR, Kewen --- gcc/doc/invoke.texi | 7 + gcc/params.opt | 4 + gcc/tree-vect-loop-manip.c | 266 ++++++++++++++++++------------- gcc/tree-vect-loop.c | 311 ++++++++++++++++++++++++++++++++----- gcc/tree-vect-stmts.c | 152 ++++++++++++++++++ gcc/tree-vectorizer.h | 57 +++++-- 6 files changed, 639 insertions(+), 158 deletions(-) diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi index 8b9935dfe65..ac765feab13 100644 --- a/gcc/doc/invoke.texi +++ b/gcc/doc/invoke.texi @@ -13079,6 +13079,13 @@ by the copy loop headers pass. @item vect-epilogues-nomask Enable loop epilogue vectorization using smaller vector size. +@item vect-with-length-scope +Control the scope of vector memory access with length exploitation. 0 means we +don't expliot any vector memory access with length, 1 means we only exploit +vector memory access with length for those loops whose iteration number are +less than VF, such as very small loop or epilogue, 2 means we want to exploit +vector memory access with length for any loops if possible. + @item slp-max-insns-in-bb Maximum number of instructions in basic block to be considered for SLP vectorization. diff --git a/gcc/params.opt b/gcc/params.opt index 4aec480798b..d4309101067 100644 --- a/gcc/params.opt +++ b/gcc/params.opt @@ -964,4 +964,8 @@ Bound on number of runtime checks inserted by the vectorizer's loop versioning f Common Joined UInteger Var(param_vect_max_version_for_alignment_checks) Init(6) Param Optimization Bound on number of runtime checks inserted by the vectorizer's loop versioning for alignment check. +-param=vect-with-length-scope= +Common Joined UInteger Var(param_vect_with_length_scope) Init(0) IntegerRange(0, 2) Param Optimization +Control the vector with length exploitation scope. + ; This comment is to ensure we retain the blank line above. diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c index 8c5e696b995..0a5770c7d28 100644 --- a/gcc/tree-vect-loop-manip.c +++ b/gcc/tree-vect-loop-manip.c @@ -256,17 +256,17 @@ adjust_phi_and_debug_stmts (gimple *update_phi, edge e, tree new_def) gimple_bb (update_phi)); } -/* Define one loop mask MASK from loop LOOP. INIT_MASK is the value that - the mask should have during the first iteration and NEXT_MASK is the +/* Define one loop mask/length OBJ from loop LOOP. INIT_OBJ is the value that + the mask/length should have during the first iteration and NEXT_OBJ is the value that it should have on subsequent iterations. */ static void -vect_set_loop_mask (class loop *loop, tree mask, tree init_mask, - tree next_mask) +vect_set_loop_mask_or_len (class loop *loop, tree obj, tree init_obj, + tree next_obj) { - gphi *phi = create_phi_node (mask, loop->header); - add_phi_arg (phi, init_mask, loop_preheader_edge (loop), UNKNOWN_LOCATION); - add_phi_arg (phi, next_mask, loop_latch_edge (loop), UNKNOWN_LOCATION); + gphi *phi = create_phi_node (obj, loop->header); + add_phi_arg (phi, init_obj, loop_preheader_edge (loop), UNKNOWN_LOCATION); + add_phi_arg (phi, next_obj, loop_latch_edge (loop), UNKNOWN_LOCATION); } /* Add SEQ to the end of LOOP's preheader block. */ @@ -320,8 +320,8 @@ interleave_supported_p (vec_perm_indices *indices, tree vectype, latter. Return true on success, adding any new statements to SEQ. */ static bool -vect_maybe_permute_loop_masks (gimple_seq *seq, rgroup_masks *dest_rgm, - rgroup_masks *src_rgm) +vect_maybe_permute_loop_masks (gimple_seq *seq, rgroup_objs *dest_rgm, + rgroup_objs *src_rgm) { tree src_masktype = src_rgm->mask_type; tree dest_masktype = dest_rgm->mask_type; @@ -338,10 +338,10 @@ vect_maybe_permute_loop_masks (gimple_seq *seq, rgroup_masks *dest_rgm, machine_mode dest_mode = insn_data[icode1].operand[0].mode; gcc_assert (dest_mode == insn_data[icode2].operand[0].mode); tree unpack_masktype = vect_halve_mask_nunits (src_masktype, dest_mode); - for (unsigned int i = 0; i < dest_rgm->masks.length (); ++i) + for (unsigned int i = 0; i < dest_rgm->objs.length (); ++i) { - tree src = src_rgm->masks[i / 2]; - tree dest = dest_rgm->masks[i]; + tree src = src_rgm->objs[i / 2]; + tree dest = dest_rgm->objs[i]; tree_code code = ((i & 1) == (BYTES_BIG_ENDIAN ? 0 : 1) ? VEC_UNPACK_HI_EXPR : VEC_UNPACK_LO_EXPR); @@ -371,10 +371,10 @@ vect_maybe_permute_loop_masks (gimple_seq *seq, rgroup_masks *dest_rgm, tree masks[2]; for (unsigned int i = 0; i < 2; ++i) masks[i] = vect_gen_perm_mask_checked (src_masktype, indices[i]); - for (unsigned int i = 0; i < dest_rgm->masks.length (); ++i) + for (unsigned int i = 0; i < dest_rgm->objs.length (); ++i) { - tree src = src_rgm->masks[i / 2]; - tree dest = dest_rgm->masks[i]; + tree src = src_rgm->objs[i / 2]; + tree dest = dest_rgm->objs[i]; gimple *stmt = gimple_build_assign (dest, VEC_PERM_EXPR, src, src, masks[i & 1]); gimple_seq_add_stmt (seq, stmt); @@ -384,60 +384,80 @@ vect_maybe_permute_loop_masks (gimple_seq *seq, rgroup_masks *dest_rgm, return false; } -/* Helper for vect_set_loop_condition_masked. Generate definitions for - all the masks in RGM and return a mask that is nonzero when the loop +/* Helper for vect_set_loop_condition_partial. Generate definitions for + all the objs in RGO and return a obj that is nonzero when the loop needs to iterate. Add any new preheader statements to PREHEADER_SEQ. Use LOOP_COND_GSI to insert code before the exit gcond. - RGM belongs to loop LOOP. The loop originally iterated NITERS + RGO belongs to loop LOOP. The loop originally iterated NITERS times and has been vectorized according to LOOP_VINFO. If NITERS_SKIP is nonnull, the first iteration of the vectorized loop starts with NITERS_SKIP dummy iterations of the scalar loop before - the real work starts. The mask elements for these dummy iterations + the real work starts. The obj elements for these dummy iterations must be 0, to ensure that the extra iterations do not have an effect. It is known that: - NITERS * RGM->max_nscalars_per_iter + NITERS * RGO->max_nscalars_per_iter does not overflow. However, MIGHT_WRAP_P says whether an induction variable that starts at 0 and has step: - VF * RGM->max_nscalars_per_iter + VF * RGO->max_nscalars_per_iter might overflow before hitting a value above: - (NITERS + NITERS_SKIP) * RGM->max_nscalars_per_iter + (NITERS + NITERS_SKIP) * RGO->max_nscalars_per_iter This means that we cannot guarantee that such an induction variable - would ever hit a value that produces a set of all-false masks for RGM. */ + would ever hit a value that produces a set of all-false masks or + zero byte length for RGO. */ static tree -vect_set_loop_masks_directly (class loop *loop, loop_vec_info loop_vinfo, +vect_set_loop_objs_directly (class loop *loop, loop_vec_info loop_vinfo, gimple_seq *preheader_seq, gimple_stmt_iterator loop_cond_gsi, - rgroup_masks *rgm, tree niters, tree niters_skip, + rgroup_objs *rgo, tree niters, tree niters_skip, bool might_wrap_p) { tree compare_type = LOOP_VINFO_MASK_COMPARE_TYPE (loop_vinfo); tree iv_type = LOOP_VINFO_MASK_IV_TYPE (loop_vinfo); - tree mask_type = rgm->mask_type; - unsigned int nscalars_per_iter = rgm->max_nscalars_per_iter; - poly_uint64 nscalars_per_mask = TYPE_VECTOR_SUBPARTS (mask_type); + + bool vect_for_masking = LOOP_VINFO_FULLY_MASKED_P (loop_vinfo); + if (!vect_for_masking) + { + /* Obtain target supported length type. */ + scalar_int_mode len_mode = targetm.vectorize.length_mode; + unsigned int len_prec = GET_MODE_PRECISION (len_mode); + compare_type = build_nonstandard_integer_type (len_prec, true); + /* Simply set iv_type as same as compare_type. */ + iv_type = compare_type; + } + + tree obj_type = rgo->mask_type; + /* Here, take nscalars_per_iter as nbytes_per_iter for length. */ + unsigned int nscalars_per_iter = rgo->max_nscalars_per_iter; + poly_uint64 nscalars_per_obj = TYPE_VECTOR_SUBPARTS (obj_type); + poly_uint64 vector_size = GET_MODE_SIZE (TYPE_MODE (obj_type)); poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo); + tree vec_size = NULL_TREE; + /* For length, we probably need vec_size to check length in range. */ + if (!vect_for_masking) + vec_size = build_int_cst (compare_type, vector_size); /* Calculate the maximum number of scalar values that the rgroup handles in total, the number that it handles for each iteration of the vector loop, and the number that it should skip during the - first iteration of the vector loop. */ + first iteration of the vector loop. For vector with length, take + scalar values as bytes. */ tree nscalars_total = niters; tree nscalars_step = build_int_cst (iv_type, vf); tree nscalars_skip = niters_skip; if (nscalars_per_iter != 1) { - /* We checked before choosing to use a fully-masked loop that these - multiplications don't overflow. */ + /* We checked before choosing to use a fully-masked or fully with length + loop that these multiplications don't overflow. */ tree compare_factor = build_int_cst (compare_type, nscalars_per_iter); tree iv_factor = build_int_cst (iv_type, nscalars_per_iter); nscalars_total = gimple_build (preheader_seq, MULT_EXPR, compare_type, @@ -541,28 +561,28 @@ vect_set_loop_masks_directly (class loop *loop, loop_vec_info loop_vinfo, test_index = gimple_convert (&test_seq, compare_type, test_index); gsi_insert_seq_before (test_gsi, test_seq, GSI_SAME_STMT); - /* Provide a definition of each mask in the group. */ - tree next_mask = NULL_TREE; - tree mask; + /* Provide a definition of each obj in the group. */ + tree next_obj = NULL_TREE; + tree obj; unsigned int i; - FOR_EACH_VEC_ELT_REVERSE (rgm->masks, i, mask) + poly_uint64 batch_cnt = vect_for_masking ? nscalars_per_obj : vector_size; + FOR_EACH_VEC_ELT_REVERSE (rgo->objs, i, obj) { - /* Previous masks will cover BIAS scalars. This mask covers the + /* Previous objs will cover BIAS scalars. This obj covers the next batch. */ - poly_uint64 bias = nscalars_per_mask * i; + poly_uint64 bias = batch_cnt * i; tree bias_tree = build_int_cst (compare_type, bias); - gimple *tmp_stmt; /* See whether the first iteration of the vector loop is known - to have a full mask. */ + to have a full mask or length. */ poly_uint64 const_limit; bool first_iteration_full = (poly_int_tree_p (first_limit, &const_limit) - && known_ge (const_limit, (i + 1) * nscalars_per_mask)); + && known_ge (const_limit, (i + 1) * batch_cnt)); /* Rather than have a new IV that starts at BIAS and goes up to TEST_LIMIT, prefer to use the same 0-based IV for each mask - and adjust the bound down by BIAS. */ + or length and adjust the bound down by BIAS. */ tree this_test_limit = test_limit; if (i != 0) { @@ -574,9 +594,9 @@ vect_set_loop_masks_directly (class loop *loop, loop_vec_info loop_vinfo, bias_tree); } - /* Create the initial mask. First include all scalars that + /* Create the initial obj. First include all scalars that are within the loop limit. */ - tree init_mask = NULL_TREE; + tree init_obj = NULL_TREE; if (!first_iteration_full) { tree start, end; @@ -598,9 +618,18 @@ vect_set_loop_masks_directly (class loop *loop, loop_vec_info loop_vinfo, end = first_limit; } - init_mask = make_temp_ssa_name (mask_type, NULL, "max_mask"); - tmp_stmt = vect_gen_while (init_mask, start, end); - gimple_seq_add_stmt (preheader_seq, tmp_stmt); + if (vect_for_masking) + { + init_obj = make_temp_ssa_name (obj_type, NULL, "max_mask"); + gimple *tmp_stmt = vect_gen_while (init_obj, start, end); + gimple_seq_add_stmt (preheader_seq, tmp_stmt); + } + else + { + init_obj = make_temp_ssa_name (compare_type, NULL, "max_len"); + gimple_seq seq = vect_gen_len (init_obj, start, end, vec_size); + gimple_seq_add_seq (preheader_seq, seq); + } } /* Now AND out the bits that are within the number of skipped @@ -610,51 +639,76 @@ vect_set_loop_masks_directly (class loop *loop, loop_vec_info loop_vinfo, && !(poly_int_tree_p (nscalars_skip, &const_skip) && known_le (const_skip, bias))) { - tree unskipped_mask = vect_gen_while_not (preheader_seq, mask_type, + tree unskipped_mask = vect_gen_while_not (preheader_seq, obj_type, bias_tree, nscalars_skip); - if (init_mask) - init_mask = gimple_build (preheader_seq, BIT_AND_EXPR, mask_type, - init_mask, unskipped_mask); + if (init_obj) + init_obj = gimple_build (preheader_seq, BIT_AND_EXPR, obj_type, + init_obj, unskipped_mask); else - init_mask = unskipped_mask; + init_obj = unskipped_mask; + gcc_assert (vect_for_masking); } - if (!init_mask) - /* First iteration is full. */ - init_mask = build_minus_one_cst (mask_type); + /* First iteration is full. */ + if (!init_obj) + { + if (vect_for_masking) + init_obj = build_minus_one_cst (obj_type); + else + init_obj = vec_size; + } - /* Get the mask value for the next iteration of the loop. */ - next_mask = make_temp_ssa_name (mask_type, NULL, "next_mask"); - gcall *call = vect_gen_while (next_mask, test_index, this_test_limit); - gsi_insert_before (test_gsi, call, GSI_SAME_STMT); + /* Get the obj value for the next iteration of the loop. */ + if (vect_for_masking) + { + next_obj = make_temp_ssa_name (obj_type, NULL, "next_mask"); + gcall *call = vect_gen_while (next_obj, test_index, this_test_limit); + gsi_insert_before (test_gsi, call, GSI_SAME_STMT); + } + else + { + next_obj = make_temp_ssa_name (compare_type, NULL, "next_len"); + tree end = this_test_limit; + gimple_seq seq = vect_gen_len (next_obj, test_index, end, vec_size); + gsi_insert_seq_before (test_gsi, seq, GSI_SAME_STMT); + } - vect_set_loop_mask (loop, mask, init_mask, next_mask); + vect_set_loop_mask_or_len (loop, obj, init_obj, next_obj); } - return next_mask; + return next_obj; } -/* Make LOOP iterate NITERS times using masking and WHILE_ULT calls. - LOOP_VINFO describes the vectorization of LOOP. NITERS is the - number of iterations of the original scalar loop that should be - handled by the vector loop. NITERS_MAYBE_ZERO and FINAL_IV are - as for vect_set_loop_condition. +/* Make LOOP iterate NITERS times using objects like masks (and + WHILE_ULT calls) or lengths. LOOP_VINFO describes the vectorization + of LOOP. NITERS is the number of iterations of the original scalar + loop that should be handled by the vector loop. NITERS_MAYBE_ZERO + and FINAL_IV are as for vect_set_loop_condition. Insert the branch-back condition before LOOP_COND_GSI and return the final gcond. */ static gcond * -vect_set_loop_condition_masked (class loop *loop, loop_vec_info loop_vinfo, - tree niters, tree final_iv, - bool niters_maybe_zero, - gimple_stmt_iterator loop_cond_gsi) +vect_set_loop_condition_partial (class loop *loop, loop_vec_info loop_vinfo, + tree niters, tree final_iv, + bool niters_maybe_zero, + gimple_stmt_iterator loop_cond_gsi) { gimple_seq preheader_seq = NULL; gimple_seq header_seq = NULL; + bool vect_for_masking = LOOP_VINFO_FULLY_MASKED_P (loop_vinfo); + tree compare_type = LOOP_VINFO_MASK_COMPARE_TYPE (loop_vinfo); + if (!vect_for_masking) + { + /* Obtain target supported length type as compare_type. */ + scalar_int_mode len_mode = targetm.vectorize.length_mode; + unsigned len_prec = GET_MODE_PRECISION (len_mode); + compare_type = build_nonstandard_integer_type (len_prec, true); + } unsigned int compare_precision = TYPE_PRECISION (compare_type); - tree orig_niters = niters; + tree orig_niters = niters; /* Type of the initial value of NITERS. */ tree ni_actual_type = TREE_TYPE (niters); unsigned int ni_actual_precision = TYPE_PRECISION (ni_actual_type); @@ -677,42 +731,45 @@ vect_set_loop_condition_masked (class loop *loop, loop_vec_info loop_vinfo, else niters = gimple_convert (&preheader_seq, compare_type, niters); - widest_int iv_limit = vect_iv_limit_for_full_masking (loop_vinfo); + widest_int iv_limit = vect_iv_limit_for_partial_vect (loop_vinfo); - /* Iterate over all the rgroups and fill in their masks. We could use - the first mask from any rgroup for the loop condition; here we + /* Iterate over all the rgroups and fill in their objs. We could use + the first obj from any rgroup for the loop condition; here we arbitrarily pick the last. */ - tree test_mask = NULL_TREE; - rgroup_masks *rgm; + tree test_obj = NULL_TREE; + rgroup_objs *rgo; unsigned int i; - vec_loop_masks *masks = &LOOP_VINFO_MASKS (loop_vinfo); - FOR_EACH_VEC_ELT (*masks, i, rgm) - if (!rgm->masks.is_empty ()) + auto_vec<rgroup_objs> *objs = vect_for_masking + ? &LOOP_VINFO_MASKS (loop_vinfo) + : &LOOP_VINFO_LENS (loop_vinfo); + + FOR_EACH_VEC_ELT (*objs, i, rgo) + if (!rgo->objs.is_empty ()) { /* First try using permutes. This adds a single vector instruction to the loop for each mask, but needs no extra loop invariants or IVs. */ unsigned int nmasks = i + 1; - if ((nmasks & 1) == 0) + if (vect_for_masking && (nmasks & 1) == 0) { - rgroup_masks *half_rgm = &(*masks)[nmasks / 2 - 1]; - if (!half_rgm->masks.is_empty () - && vect_maybe_permute_loop_masks (&header_seq, rgm, half_rgm)) + rgroup_objs *half_rgo = &(*objs)[nmasks / 2 - 1]; + if (!half_rgo->objs.is_empty () + && vect_maybe_permute_loop_masks (&header_seq, rgo, half_rgo)) continue; } /* See whether zero-based IV would ever generate all-false masks - before wrapping around. */ + or zero byte length before wrapping around. */ bool might_wrap_p = (iv_limit == -1 - || (wi::min_precision (iv_limit * rgm->max_nscalars_per_iter, + || (wi::min_precision (iv_limit * rgo->max_nscalars_per_iter, UNSIGNED) > compare_precision)); - /* Set up all masks for this group. */ - test_mask = vect_set_loop_masks_directly (loop, loop_vinfo, + /* Set up all masks/lengths for this group. */ + test_obj = vect_set_loop_objs_directly (loop, loop_vinfo, &preheader_seq, - loop_cond_gsi, rgm, + loop_cond_gsi, rgo, niters, niters_skip, might_wrap_p); } @@ -724,8 +781,8 @@ vect_set_loop_condition_masked (class loop *loop, loop_vec_info loop_vinfo, /* Get a boolean result that tells us whether to iterate. */ edge exit_edge = single_exit (loop); tree_code code = (exit_edge->flags & EDGE_TRUE_VALUE) ? EQ_EXPR : NE_EXPR; - tree zero_mask = build_zero_cst (TREE_TYPE (test_mask)); - gcond *cond_stmt = gimple_build_cond (code, test_mask, zero_mask, + tree zero_obj = build_zero_cst (TREE_TYPE (test_obj)); + gcond *cond_stmt = gimple_build_cond (code, test_obj, zero_obj, NULL_TREE, NULL_TREE); gsi_insert_before (&loop_cond_gsi, cond_stmt, GSI_SAME_STMT); @@ -748,13 +805,12 @@ vect_set_loop_condition_masked (class loop *loop, loop_vec_info loop_vinfo, } /* Like vect_set_loop_condition, but handle the case in which there - are no loop masks. */ + are no loop masks/lengths. */ static gcond * -vect_set_loop_condition_unmasked (class loop *loop, tree niters, - tree step, tree final_iv, - bool niters_maybe_zero, - gimple_stmt_iterator loop_cond_gsi) +vect_set_loop_condition_normal (class loop *loop, tree niters, tree step, + tree final_iv, bool niters_maybe_zero, + gimple_stmt_iterator loop_cond_gsi) { tree indx_before_incr, indx_after_incr; gcond *cond_stmt; @@ -912,14 +968,14 @@ vect_set_loop_condition (class loop *loop, loop_vec_info loop_vinfo, gcond *orig_cond = get_loop_exit_condition (loop); gimple_stmt_iterator loop_cond_gsi = gsi_for_stmt (orig_cond); - if (loop_vinfo && LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)) - cond_stmt = vect_set_loop_condition_masked (loop, loop_vinfo, niters, - final_iv, niters_maybe_zero, - loop_cond_gsi); + if (loop_vinfo && LOOP_VINFO_PARTIAL_VECT_P (loop_vinfo)) + cond_stmt + = vect_set_loop_condition_partial (loop, loop_vinfo, niters, final_iv, + niters_maybe_zero, loop_cond_gsi); else - cond_stmt = vect_set_loop_condition_unmasked (loop, niters, step, - final_iv, niters_maybe_zero, - loop_cond_gsi); + cond_stmt + = vect_set_loop_condition_normal (loop, niters, step, final_iv, + niters_maybe_zero, loop_cond_gsi); /* Remove old loop exit test. */ stmt_vec_info orig_cond_info; @@ -1938,8 +1994,7 @@ vect_gen_vector_loop_niters (loop_vec_info loop_vinfo, tree niters, ni_minus_gap = niters; unsigned HOST_WIDE_INT const_vf; - if (vf.is_constant (&const_vf) - && !LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)) + if (vf.is_constant (&const_vf) && !LOOP_VINFO_PARTIAL_VECT_P (loop_vinfo)) { /* Create: niters >> log2(vf) */ /* If it's known that niters == number of latch executions + 1 doesn't @@ -2471,7 +2526,7 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1, poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo); poly_uint64 bound_epilog = 0; - if (!LOOP_VINFO_FULLY_MASKED_P (loop_vinfo) + if (!LOOP_VINFO_PARTIAL_VECT_P (loop_vinfo) && LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo)) bound_epilog += vf - 1; if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)) @@ -2567,7 +2622,8 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1, if (vect_epilogues && LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo) && prolog_peeling >= 0 - && known_eq (vf, lowest_vf)) + && known_eq (vf, lowest_vf) + && !LOOP_VINFO_FULLY_WITH_LENGTH_P (epilogue_vinfo)) { unsigned HOST_WIDE_INT eiters = (LOOP_VINFO_INT_NITERS (loop_vinfo) diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c index 80e33b61be7..cbf498e87dd 100644 --- a/gcc/tree-vect-loop.c +++ b/gcc/tree-vect-loop.c @@ -815,6 +815,8 @@ _loop_vec_info::_loop_vec_info (class loop *loop_in, vec_info_shared *shared) vectorizable (false), can_fully_mask_p (true), fully_masked_p (false), + can_with_length_p (param_vect_with_length_scope != 0), + fully_with_length_p (false), peeling_for_gaps (false), peeling_for_niter (false), no_data_dependencies (false), @@ -880,13 +882,25 @@ _loop_vec_info::_loop_vec_info (class loop *loop_in, vec_info_shared *shared) void release_vec_loop_masks (vec_loop_masks *masks) { - rgroup_masks *rgm; + rgroup_objs *rgm; unsigned int i; FOR_EACH_VEC_ELT (*masks, i, rgm) - rgm->masks.release (); + rgm->objs.release (); masks->release (); } +/* Free all levels of LENS. */ + +void +release_vec_loop_lens (vec_loop_lens *lens) +{ + rgroup_objs *rgl; + unsigned int i; + FOR_EACH_VEC_ELT (*lens, i, rgl) + rgl->objs.release (); + lens->release (); +} + /* Free all memory used by the _loop_vec_info, as well as all the stmt_vec_info structs of all the stmts in the loop. */ @@ -895,6 +909,7 @@ _loop_vec_info::~_loop_vec_info () free (bbs); release_vec_loop_masks (&masks); + release_vec_loop_lens (&lens); delete ivexpr_map; delete scan_map; epilogue_vinfos.release (); @@ -935,7 +950,7 @@ cse_and_gimplify_to_preheader (loop_vec_info loop_vinfo, tree expr) static bool can_produce_all_loop_masks_p (loop_vec_info loop_vinfo, tree cmp_type) { - rgroup_masks *rgm; + rgroup_objs *rgm; unsigned int i; FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo), i, rgm) if (rgm->mask_type != NULL_TREE @@ -954,12 +969,40 @@ vect_get_max_nscalars_per_iter (loop_vec_info loop_vinfo) { unsigned int res = 1; unsigned int i; - rgroup_masks *rgm; + rgroup_objs *rgm; FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo), i, rgm) res = MAX (res, rgm->max_nscalars_per_iter); return res; } +/* Calculate the minimal bits necessary to represent the maximal iteration + count of loop with loop_vec_info LOOP_VINFO which is scaling with a given + factor FACTOR. */ + +static unsigned +min_prec_for_max_niters (loop_vec_info loop_vinfo, unsigned int factor) +{ + class loop *loop = LOOP_VINFO_LOOP (loop_vinfo); + + /* Get the maximum number of iterations that is representable + in the counter type. */ + tree ni_type = TREE_TYPE (LOOP_VINFO_NITERSM1 (loop_vinfo)); + widest_int max_ni = wi::to_widest (TYPE_MAX_VALUE (ni_type)) + 1; + + /* Get a more refined estimate for the number of iterations. */ + widest_int max_back_edges; + if (max_loop_iterations (loop, &max_back_edges)) + max_ni = wi::smin (max_ni, max_back_edges + 1); + + /* Account for factor, in which each bit is replicated N times. */ + max_ni *= factor; + + /* Work out how many bits we need to represent the limit. */ + unsigned int min_ni_width = wi::min_precision (max_ni, UNSIGNED); + + return min_ni_width; +} + /* Each statement in LOOP_VINFO can be masked where necessary. Check whether we can actually generate the masks required. Return true if so, storing the type of the scalar IV in LOOP_VINFO_MASK_COMPARE_TYPE. */ @@ -967,7 +1010,6 @@ vect_get_max_nscalars_per_iter (loop_vec_info loop_vinfo) static bool vect_verify_full_masking (loop_vec_info loop_vinfo) { - class loop *loop = LOOP_VINFO_LOOP (loop_vinfo); unsigned int min_ni_width; unsigned int max_nscalars_per_iter = vect_get_max_nscalars_per_iter (loop_vinfo); @@ -978,27 +1020,14 @@ vect_verify_full_masking (loop_vec_info loop_vinfo) if (LOOP_VINFO_MASKS (loop_vinfo).is_empty ()) return false; - /* Get the maximum number of iterations that is representable - in the counter type. */ - tree ni_type = TREE_TYPE (LOOP_VINFO_NITERSM1 (loop_vinfo)); - widest_int max_ni = wi::to_widest (TYPE_MAX_VALUE (ni_type)) + 1; - - /* Get a more refined estimate for the number of iterations. */ - widest_int max_back_edges; - if (max_loop_iterations (loop, &max_back_edges)) - max_ni = wi::smin (max_ni, max_back_edges + 1); - - /* Account for rgroup masks, in which each bit is replicated N times. */ - max_ni *= max_nscalars_per_iter; - /* Work out how many bits we need to represent the limit. */ - min_ni_width = wi::min_precision (max_ni, UNSIGNED); + min_ni_width = min_prec_for_max_niters (loop_vinfo, max_nscalars_per_iter); /* Find a scalar mode for which WHILE_ULT is supported. */ opt_scalar_int_mode cmp_mode_iter; tree cmp_type = NULL_TREE; tree iv_type = NULL_TREE; - widest_int iv_limit = vect_iv_limit_for_full_masking (loop_vinfo); + widest_int iv_limit = vect_iv_limit_for_partial_vect (loop_vinfo); unsigned int iv_precision = UINT_MAX; if (iv_limit != -1) @@ -1056,6 +1085,33 @@ vect_verify_full_masking (loop_vec_info loop_vinfo) return true; } +/* Check whether we can use vector access with length based on precison + comparison. So far, to keep it simple, we only allow the case that the + precision of the target supported length is larger than the precision + required by loop niters. */ + +static bool +vect_verify_loop_lens (loop_vec_info loop_vinfo) +{ + vec_loop_lens *lens = &LOOP_VINFO_LENS (loop_vinfo); + + if (LOOP_VINFO_LENS (loop_vinfo).is_empty ()) + return false; + + /* The one which has the largest NV should have max bytes per iter. */ + rgroup_objs *rgl = &(*lens)[lens->length () - 1]; + + /* Work out how many bits we need to represent the limit. */ + unsigned int min_ni_width + = min_prec_for_max_niters (loop_vinfo, rgl->nbytes_per_iter); + + unsigned len_bits = GET_MODE_PRECISION (targetm.vectorize.length_mode); + if (len_bits < min_ni_width) + return false; + + return true; +} + /* Calculate the cost of one scalar iteration of the loop. */ static void vect_compute_single_scalar_iteration_cost (loop_vec_info loop_vinfo) @@ -1628,9 +1684,9 @@ vect_analyze_loop_costing (loop_vec_info loop_vinfo) class loop *loop = LOOP_VINFO_LOOP (loop_vinfo); unsigned int assumed_vf = vect_vf_for_cost (loop_vinfo); - /* Only fully-masked loops can have iteration counts less than the - vectorization factor. */ - if (!LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)) + /* Only fully-masked or fully with length loops can have iteration counts less + than the vectorization factor. */ + if (!LOOP_VINFO_PARTIAL_VECT_P (loop_vinfo)) { if (known_niters_smaller_than_vf (loop_vinfo)) { @@ -1858,7 +1914,7 @@ determine_peel_for_niter (loop_vec_info loop_vinfo) th = LOOP_VINFO_COST_MODEL_THRESHOLD (LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo)); - if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)) + if (LOOP_VINFO_PARTIAL_VECT_P (loop_vinfo)) /* The main loop handles all iterations. */ LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo) = false; else if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo) @@ -2048,6 +2104,7 @@ vect_analyze_loop_2 (loop_vec_info loop_vinfo, bool &fatal, unsigned *n_stmts) } bool saved_can_fully_mask_p = LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo); + bool saved_can_with_length_p = LOOP_VINFO_CAN_WITH_LENGTH_P(loop_vinfo); /* We don't expect to have to roll back to anything other than an empty set of rgroups. */ @@ -2144,6 +2201,71 @@ start_over: "not using a fully-masked loop.\n"); } + /* Decide whether we can use vector access with length. */ + + if ((LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) + || LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo)) + && LOOP_VINFO_CAN_WITH_LENGTH_P (loop_vinfo)) + { + if (dump_enabled_p ()) + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, + "can't use vector access with length becuase peeling" + " for alignment or gaps is required.\n"); + LOOP_VINFO_CAN_WITH_LENGTH_P (loop_vinfo) = false; + } + + if (LOOP_VINFO_CAN_WITH_LENGTH_P (loop_vinfo) + && !vect_verify_loop_lens (loop_vinfo)) + { + if (dump_enabled_p ()) + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, + "can't use vector access with length becuase the" + " length precision verification fail.\n"); + LOOP_VINFO_CAN_WITH_LENGTH_P (loop_vinfo) = false; + } + + if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)) + { + if (dump_enabled_p ()) + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, + "can't use vector access with length becuase the" + " loop will be fully-masked.\n"); + LOOP_VINFO_CAN_WITH_LENGTH_P (loop_vinfo) = false; + } + + if (LOOP_VINFO_CAN_WITH_LENGTH_P (loop_vinfo)) + { + /* One special case, the loop with max niters less than VF, we can simply + take it as body with length. */ + if (param_vect_with_length_scope == 1) + { + /* This is the epilogue, should be less than VF. */ + if (LOOP_VINFO_EPILOGUE_P (loop_vinfo)) + LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo) = true; + /* Otherwise, ensure the loop iteration less than VF. */ + else if (known_niters_smaller_than_vf (loop_vinfo)) + LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo) = true; + } + else + { + gcc_assert (param_vect_with_length_scope == 2); + LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo) = true; + } + } + else + /* Always set it as false in case previous tries set it. */ + LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo) = false; + + if (dump_enabled_p ()) + { + if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo)) + dump_printf_loc (MSG_NOTE, vect_location, "using vector access with" + " length for loop fully.\n"); + else + dump_printf_loc (MSG_NOTE, vect_location, "not using vector access with" + " length for loop fully.\n"); + } + /* If epilog loop is required because of data accesses with gaps, one additional iteration needs to be peeled. Check if there is enough iterations for vectorization. */ @@ -2163,7 +2285,7 @@ start_over: /* If we're vectorizing an epilogue loop, we either need a fully-masked loop or a loop that has a lower VF than the main loop. */ if (LOOP_VINFO_EPILOGUE_P (loop_vinfo) - && !LOOP_VINFO_FULLY_MASKED_P (loop_vinfo) + && !LOOP_VINFO_PARTIAL_VECT_P (loop_vinfo) && maybe_ge (LOOP_VINFO_VECT_FACTOR (loop_vinfo), LOOP_VINFO_VECT_FACTOR (orig_loop_vinfo))) return opt_result::failure_at (vect_location, @@ -2362,12 +2484,14 @@ again: = init_cost (LOOP_VINFO_LOOP (loop_vinfo)); /* Reset accumulated rgroup information. */ release_vec_loop_masks (&LOOP_VINFO_MASKS (loop_vinfo)); + release_vec_loop_lens (&LOOP_VINFO_LENS (loop_vinfo)); /* Reset assorted flags. */ LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo) = false; LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) = false; LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo) = 0; LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo) = 0; LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo) = saved_can_fully_mask_p; + LOOP_VINFO_CAN_WITH_LENGTH_P (loop_vinfo) = saved_can_with_length_p; goto start_over; } @@ -2646,8 +2770,10 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared) if (ordered_p (lowest_th, th)) lowest_th = ordered_min (lowest_th, th); } - else - delete loop_vinfo; + else { + delete loop_vinfo; + loop_vinfo = opt_loop_vec_info::success (NULL); + } /* Only vectorize epilogues if PARAM_VECT_EPILOGUES_NOMASK is enabled, SIMDUID is not set, it is the innermost loop and we have @@ -2672,6 +2798,7 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared) else { delete loop_vinfo; + loop_vinfo = opt_loop_vec_info::success (NULL); if (fatal) { gcc_checking_assert (first_loop_vinfo == NULL); @@ -2679,6 +2806,21 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared) } } + /* If the original loop can use vector access with length but we still + get true vect_epilogue here, it would try vector access with length + on epilogue and with the same mode. */ + if (vect_epilogues && loop_vinfo + && LOOP_VINFO_CAN_WITH_LENGTH_P (loop_vinfo)) + { + gcc_assert (!LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo)); + if (dump_enabled_p ()) + dump_printf_loc (MSG_NOTE, vect_location, + "***** Re-trying analysis with same vector" + " mode %s for epilogue with length.\n", + GET_MODE_NAME (loop_vinfo->vector_mode)); + continue; + } + if (mode_i < vector_modes.length () && VECTOR_MODE_P (autodetected_vector_mode) && (related_vector_mode (vector_modes[mode_i], @@ -3493,7 +3635,7 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo, /* Calculate how many masks we need to generate. */ unsigned int num_masks = 0; - rgroup_masks *rgm; + rgroup_objs *rgm; unsigned int num_vectors_m1; FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo), num_vectors_m1, rgm) if (rgm->mask_type) @@ -3519,6 +3661,11 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo, target_cost_data, num_masks - 1, vector_stmt, NULL, NULL_TREE, 0, vect_body); } + else if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo)) + { + peel_iters_prologue = 0; + peel_iters_epilogue = 0; + } else if (npeel < 0) { peel_iters_prologue = assumed_vf / 2; @@ -3808,7 +3955,7 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo, " Calculated minimum iters for profitability: %d\n", min_profitable_iters); - if (!LOOP_VINFO_FULLY_MASKED_P (loop_vinfo) + if (!LOOP_VINFO_PARTIAL_VECT_P (loop_vinfo) && min_profitable_iters < (assumed_vf + peel_iters_prologue)) /* We want the vectorized loop to execute at least once. */ min_profitable_iters = assumed_vf + peel_iters_prologue; @@ -6761,6 +6908,16 @@ vectorizable_reduction (loop_vec_info loop_vinfo, dump_printf_loc (MSG_NOTE, vect_location, "using an in-order (fold-left) reduction.\n"); STMT_VINFO_TYPE (orig_stmt_of_analysis) = cycle_phi_info_type; + + if (loop_vinfo && LOOP_VINFO_CAN_WITH_LENGTH_P (loop_vinfo)) + { + if (dump_enabled_p ()) + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, + "can't use vector access with length due to" + " reduction operation.\n"); + LOOP_VINFO_CAN_WITH_LENGTH_P (loop_vinfo) = false; + } + /* All but single defuse-cycle optimized, lane-reducing and fold-left reductions go through their own vectorizable_* routines. */ if (!single_defuse_cycle @@ -8041,6 +8198,16 @@ vectorizable_live_operation (loop_vec_info loop_vinfo, 1, vectype, NULL); } } + + if (loop_vinfo && LOOP_VINFO_CAN_WITH_LENGTH_P (loop_vinfo)) + { + LOOP_VINFO_CAN_WITH_LENGTH_P (loop_vinfo) = false; + if (dump_enabled_p ()) + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, + "can't use vector access with length due to" + " live operation.\n"); + } + return true; } @@ -8285,7 +8452,7 @@ vect_record_loop_mask (loop_vec_info loop_vinfo, vec_loop_masks *masks, gcc_assert (nvectors != 0); if (masks->length () < nvectors) masks->safe_grow_cleared (nvectors); - rgroup_masks *rgm = &(*masks)[nvectors - 1]; + rgroup_objs *rgm = &(*masks)[nvectors - 1]; /* The number of scalars per iteration and the number of vectors are both compile-time constants. */ unsigned int nscalars_per_iter @@ -8316,24 +8483,24 @@ tree vect_get_loop_mask (gimple_stmt_iterator *gsi, vec_loop_masks *masks, unsigned int nvectors, tree vectype, unsigned int index) { - rgroup_masks *rgm = &(*masks)[nvectors - 1]; + rgroup_objs *rgm = &(*masks)[nvectors - 1]; tree mask_type = rgm->mask_type; /* Populate the rgroup's mask array, if this is the first time we've used it. */ - if (rgm->masks.is_empty ()) + if (rgm->objs.is_empty ()) { - rgm->masks.safe_grow_cleared (nvectors); + rgm->objs.safe_grow_cleared (nvectors); for (unsigned int i = 0; i < nvectors; ++i) { tree mask = make_temp_ssa_name (mask_type, NULL, "loop_mask"); /* Provide a dummy definition until the real one is available. */ SSA_NAME_DEF_STMT (mask) = gimple_build_nop (); - rgm->masks[i] = mask; + rgm->objs[i] = mask; } } - tree mask = rgm->masks[index]; + tree mask = rgm->objs[index]; if (maybe_ne (TYPE_VECTOR_SUBPARTS (mask_type), TYPE_VECTOR_SUBPARTS (vectype))) { @@ -8354,6 +8521,66 @@ vect_get_loop_mask (gimple_stmt_iterator *gsi, vec_loop_masks *masks, return mask; } +/* Record that LOOP_VINFO would need LENS to contain a sequence of NVECTORS + lengths for vector access with length that each control a vector of type + VECTYPE. */ + +void +vect_record_loop_len (loop_vec_info loop_vinfo, vec_loop_lens *lens, + unsigned int nvectors, tree vectype) +{ + gcc_assert (nvectors != 0); + if (lens->length () < nvectors) + lens->safe_grow_cleared (nvectors); + rgroup_objs *rgl = &(*lens)[nvectors - 1]; + + /* The number of scalars per iteration, total bytes of them and the number of + vectors are both compile-time constants. */ + poly_uint64 vector_size = GET_MODE_SIZE (TYPE_MODE (vectype)); + poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo); + unsigned int nbytes_per_iter + = exact_div (nvectors * vector_size, vf).to_constant (); + + /* The one associated to the same nvectors should have the same bytes per + iteration. */ + if (!rgl->vec_type) + { + rgl->vec_type = vectype; + rgl->nbytes_per_iter = nbytes_per_iter; + } + else + gcc_assert (rgl->nbytes_per_iter == nbytes_per_iter); +} + +/* Given a complete set of length LENS, extract length number INDEX for an + rgroup that operates on NVECTORS vectors, where 0 <= INDEX < NVECTORS. */ + +tree +vect_get_loop_len (vec_loop_lens *lens, unsigned int nvectors, unsigned int index) +{ + rgroup_objs *rgl = &(*lens)[nvectors - 1]; + + /* Populate the rgroup's len array, if this is the first time we've + used it. */ + if (rgl->objs.is_empty ()) + { + rgl->objs.safe_grow_cleared (nvectors); + for (unsigned int i = 0; i < nvectors; ++i) + { + scalar_int_mode len_mode = targetm.vectorize.length_mode; + unsigned int len_prec = GET_MODE_PRECISION (len_mode); + tree len_type = build_nonstandard_integer_type (len_prec, true); + tree len = make_temp_ssa_name (len_type, NULL, "loop_len"); + + /* Provide a dummy definition until the real one is available. */ + SSA_NAME_DEF_STMT (len) = gimple_build_nop (); + rgl->objs[i] = len; + } + } + + return rgl->objs[index]; +} + /* Scale profiling counters by estimation for LOOP which is vectorized by factor VF. */ @@ -8713,7 +8940,7 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple *loop_vectorized_call) if (niters_vector == NULL_TREE) { if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo) - && !LOOP_VINFO_FULLY_MASKED_P (loop_vinfo) + && !LOOP_VINFO_PARTIAL_VECT_P (loop_vinfo) && known_eq (lowest_vf, vf)) { niters_vector @@ -8881,7 +9108,7 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple *loop_vectorized_call) /* True if the final iteration might not handle a full vector's worth of scalar iterations. */ - bool final_iter_may_be_partial = LOOP_VINFO_FULLY_MASKED_P (loop_vinfo); + bool final_iter_may_be_partial = LOOP_VINFO_PARTIAL_VECT_P (loop_vinfo); /* The minimum number of iterations performed by the epilogue. This is 1 when peeling for gaps because we always need a final scalar iteration. */ @@ -9184,12 +9411,14 @@ optimize_mask_stores (class loop *loop) } /* Decide whether it is possible to use a zero-based induction variable - when vectorizing LOOP_VINFO with a fully-masked loop. If it is, - return the value that the induction variable must be able to hold - in order to ensure that the loop ends with an all-false mask. + when vectorizing LOOP_VINFO with a fully-masked or fully with length + loop. If it is, return the value that the induction variable must + be able to hold in order to ensure that the loop ends with an + all-false mask or zero byte length. Return -1 otherwise. */ + widest_int -vect_iv_limit_for_full_masking (loop_vec_info loop_vinfo) +vect_iv_limit_for_partial_vect (loop_vec_info loop_vinfo) { tree niters_skip = LOOP_VINFO_MASK_SKIP_NITERS (loop_vinfo); class loop *loop = LOOP_VINFO_LOOP (loop_vinfo); diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c index e7822c44951..d6be39e1831 100644 --- a/gcc/tree-vect-stmts.c +++ b/gcc/tree-vect-stmts.c @@ -1879,6 +1879,66 @@ check_load_store_masking (loop_vec_info loop_vinfo, tree vectype, gcc_unreachable (); } +/* Check whether a load or store statement in the loop described by + LOOP_VINFO is possible to go with length. This is testing whether + the vectorizer pass has the appropriate support, as well as whether + the target does. + + VLS_TYPE says whether the statement is a load or store and VECTYPE + is the type of the vector being loaded or stored. MEMORY_ACCESS_TYPE + says how the load or store is going to be implemented and GROUP_SIZE + is the number of load or store statements in the containing group. + + Clear LOOP_VINFO_CAN_WITH_LENGTH_P if it can't go with length, otherwise + record the required length types. */ + +static void +check_load_store_with_len (loop_vec_info loop_vinfo, tree vectype, + vec_load_store_type vls_type, int group_size, + vect_memory_access_type memory_access_type) +{ + /* Invariant loads need no special support. */ + if (memory_access_type == VMAT_INVARIANT) + return; + + if (memory_access_type != VMAT_CONTIGUOUS + && memory_access_type != VMAT_CONTIGUOUS_PERMUTE) + { + if (dump_enabled_p ()) + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, + "can't use vector access with length" + " because an access isn't contiguous.\n"); + LOOP_VINFO_CAN_WITH_LENGTH_P (loop_vinfo) = false; + return; + } + + machine_mode vecmode = TYPE_MODE (vectype); + bool is_load = (vls_type == VLS_LOAD); + optab op = is_load ? lenload_optab : lenstore_optab; + + if (!VECTOR_MODE_P (vecmode) + || !convert_optab_handler (op, vecmode, targetm.vectorize.length_mode)) + { + if (dump_enabled_p ()) + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, + "can't use vector access with length because" + " the target doesn't have the appropriate" + " load or store with length.\n"); + LOOP_VINFO_CAN_WITH_LENGTH_P (loop_vinfo) = false; + return; + } + + vec_loop_lens *lens = &LOOP_VINFO_LENS (loop_vinfo); + poly_uint64 nunits = TYPE_VECTOR_SUBPARTS (vectype); + poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo); + unsigned int nvectors; + + if (can_div_away_from_zero_p (group_size * vf, nunits, &nvectors)) + vect_record_loop_len (loop_vinfo, lens, nvectors, vectype); + else + gcc_unreachable (); +} + /* Return the mask input to a masked load or store. VEC_MASK is the vectorized form of the scalar mask condition and LOOP_MASK, if nonnull, is the mask that needs to be applied to all loads and stores in a vectorized loop. @@ -7532,6 +7592,10 @@ vectorizable_store (vec_info *vinfo, check_load_store_masking (loop_vinfo, vectype, vls_type, group_size, memory_access_type, &gs_info, mask); + if (loop_vinfo && LOOP_VINFO_CAN_WITH_LENGTH_P (loop_vinfo)) + check_load_store_with_len (loop_vinfo, vectype, vls_type, group_size, + memory_access_type); + if (slp_node && !vect_maybe_update_slp_op_vectype (SLP_TREE_CHILDREN (slp_node)[0], vectype)) @@ -8068,6 +8132,15 @@ vectorizable_store (vec_info *vinfo, = (loop_vinfo && LOOP_VINFO_FULLY_MASKED_P (loop_vinfo) ? &LOOP_VINFO_MASKS (loop_vinfo) : NULL); + + vec_loop_lens *loop_lens + = (loop_vinfo && LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo) + ? &LOOP_VINFO_LENS (loop_vinfo) + : NULL); + + /* Shouldn't go with length if fully masked. */ + gcc_assert (!loop_lens || (loop_lens && !loop_masks)); + /* Targets with store-lane instructions must not require explicit realignment. vect_supportable_dr_alignment always returns either dr_aligned or dr_unaligned_supported for masked operations. */ @@ -8320,10 +8393,15 @@ vectorizable_store (vec_info *vinfo, unsigned HOST_WIDE_INT align; tree final_mask = NULL_TREE; + tree final_len = NULL_TREE; if (loop_masks) final_mask = vect_get_loop_mask (gsi, loop_masks, vec_num * ncopies, vectype, vec_num * j + i); + else if (loop_lens) + final_len = vect_get_loop_len (loop_lens, vec_num * ncopies, + vec_num * j + i); + if (vec_mask) final_mask = prepare_load_store_mask (mask_vectype, final_mask, vec_mask, gsi); @@ -8403,6 +8481,17 @@ vectorizable_store (vec_info *vinfo, new_stmt_info = vect_finish_stmt_generation (vinfo, stmt_info, call, gsi); } + else if (final_len) + { + align = least_bit_hwi (misalign | align); + tree ptr = build_int_cst (ref_type, align); + gcall *call + = gimple_build_call_internal (IFN_LEN_STORE, 4, dataref_ptr, + ptr, final_len, vec_oprnd); + gimple_call_set_nothrow (call, true); + new_stmt_info + = vect_finish_stmt_generation (vinfo, stmt_info, call, gsi); + } else { data_ref = fold_build2 (MEM_REF, vectype, @@ -8839,6 +8928,10 @@ vectorizable_load (vec_info *vinfo, check_load_store_masking (loop_vinfo, vectype, VLS_LOAD, group_size, memory_access_type, &gs_info, mask); + if (loop_vinfo && LOOP_VINFO_CAN_WITH_LENGTH_P (loop_vinfo)) + check_load_store_with_len (loop_vinfo, vectype, VLS_LOAD, group_size, + memory_access_type); + STMT_VINFO_TYPE (stmt_info) = load_vec_info_type; vect_model_load_cost (vinfo, stmt_info, ncopies, vf, memory_access_type, slp_node, cost_vec); @@ -8937,6 +9030,7 @@ vectorizable_load (vec_info *vinfo, gcc_assert (!LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)); gcc_assert (!nested_in_vect_loop); + gcc_assert (!LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo)); if (grouped_load) { @@ -9234,6 +9328,15 @@ vectorizable_load (vec_info *vinfo, = (loop_vinfo && LOOP_VINFO_FULLY_MASKED_P (loop_vinfo) ? &LOOP_VINFO_MASKS (loop_vinfo) : NULL); + + vec_loop_lens *loop_lens + = (loop_vinfo && LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo) + ? &LOOP_VINFO_LENS (loop_vinfo) + : NULL); + + /* Shouldn't go with length if fully masked. */ + gcc_assert (!loop_lens || (loop_lens && !loop_masks)); + /* Targets with store-lane instructions must not require explicit realignment. vect_supportable_dr_alignment always returns either dr_aligned or dr_unaligned_supported for masked operations. */ @@ -9555,15 +9658,20 @@ vectorizable_load (vec_info *vinfo, for (i = 0; i < vec_num; i++) { tree final_mask = NULL_TREE; + tree final_len = NULL_TREE; if (loop_masks && memory_access_type != VMAT_INVARIANT) final_mask = vect_get_loop_mask (gsi, loop_masks, vec_num * ncopies, vectype, vec_num * j + i); + else if (loop_lens && memory_access_type != VMAT_INVARIANT) + final_len = vect_get_loop_len (loop_lens, vec_num * ncopies, + vec_num * j + i); if (vec_mask) final_mask = prepare_load_store_mask (mask_vectype, final_mask, vec_mask, gsi); + if (i > 0) dataref_ptr = bump_vector_ptr (vinfo, dataref_ptr, ptr_incr, gsi, stmt_info, bump); @@ -9629,6 +9737,18 @@ vectorizable_load (vec_info *vinfo, new_stmt = call; data_ref = NULL_TREE; } + else if (final_len) + { + align = least_bit_hwi (misalign | align); + tree ptr = build_int_cst (ref_type, align); + gcall *call + = gimple_build_call_internal (IFN_LEN_LOAD, 3, + dataref_ptr, ptr, + final_len); + gimple_call_set_nothrow (call, true); + new_stmt = call; + data_ref = NULL_TREE; + } else { tree ltype = vectype; @@ -12480,3 +12600,35 @@ vect_get_vector_types_for_stmt (vec_info *vinfo, stmt_vec_info stmt_info, *nunits_vectype_out = nunits_vectype; return opt_result::success (); } + +/* Generate and return statement sequence that sets vector length LEN that is: + + min_of_start_and_end = min (START_INDEX, END_INDEX); + left_bytes = END_INDEX - min_of_start_and_end; + rhs = min (left_bytes, VECTOR_SIZE); + LEN = rhs; + + TODO: for now, rs6000 supported vector with length only cares 8-bits, which + means if we have left_bytes larger than 255, it can't be saturated to vector + size. One target hook can be provided if other ports don't suffer this. +*/ + +gimple_seq +vect_gen_len (tree len, tree start_index, tree end_index, tree vector_size) +{ + gimple_seq stmts = NULL; + tree len_type = TREE_TYPE (len); + gcc_assert (TREE_TYPE (start_index) == len_type); + + tree min = fold_build2 (MIN_EXPR, len_type, start_index, end_index); + tree left_bytes = fold_build2 (MINUS_EXPR, len_type, end_index, min); + left_bytes = fold_build2 (MIN_EXPR, len_type, left_bytes, vector_size); + + tree rhs = force_gimple_operand (left_bytes, &stmts, true, NULL_TREE); + gimple *new_stmt = gimple_build_assign (len, rhs); + gimple_stmt_iterator i = gsi_last (stmts); + gsi_insert_after_without_update (&i, new_stmt, GSI_CONTINUE_LINKING); + + return stmts; +} + diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h index 2eb3ab5d280..78e260e5611 100644 --- a/gcc/tree-vectorizer.h +++ b/gcc/tree-vectorizer.h @@ -461,20 +461,32 @@ is_a_helper <_bb_vec_info *>::test (vec_info *i) first level being indexed by nV - 1 (since nV == 0 doesn't exist) and the second being indexed by the mask index 0 <= i < nV. */ -/* The masks needed by rgroups with nV vectors, according to the - description above. */ -struct rgroup_masks { - /* The largest nS for all rgroups that use these masks. */ - unsigned int max_nscalars_per_iter; - - /* The type of mask to use, based on the highest nS recorded above. */ - tree mask_type; +/* The masks/lengths (called as objects) needed by rgroups with nV vectors, + according to the description above. */ +struct rgroup_objs { + union + { + /* The largest nS for all rgroups that use these masks. */ + unsigned int max_nscalars_per_iter; + /* The total bytes for any nS per iteration. */ + unsigned int nbytes_per_iter; + }; - /* A vector of nV masks, in iteration order. */ - vec<tree> masks; + union + { + /* The type of mask to use, based on the highest nS recorded above. */ + tree mask_type; + /* Any vector type to use these lengths. */ + tree vec_type; + }; + + /* A vector of nV objs, in iteration order. */ + vec<tree> objs; }; -typedef auto_vec<rgroup_masks> vec_loop_masks; +typedef auto_vec<rgroup_objs> vec_loop_masks; + +typedef auto_vec<rgroup_objs> vec_loop_lens; typedef auto_vec<std::pair<data_reference*, tree> > drs_init_vec; @@ -523,6 +535,10 @@ public: on inactive scalars. */ vec_loop_masks masks; + /* The lengths that a loop with length should use to avoid operating + on inactive scalars. */ + vec_loop_lens lens; + /* Set of scalar conditions that have loop mask applied. */ scalar_cond_masked_set_type scalar_cond_masked_set; @@ -626,6 +642,12 @@ public: /* True if have decided to use a fully-masked loop. */ bool fully_masked_p; + /* Records whether we still have the option of using a length access loop. */ + bool can_with_length_p; + + /* True if have decided to use length access for the loop fully. */ + bool fully_with_length_p; + /* When we have grouped data accesses with gaps, we may introduce invalid memory accesses. We peel the last iteration of the loop to prevent this. */ @@ -689,6 +711,9 @@ public: #define LOOP_VINFO_VECTORIZABLE_P(L) (L)->vectorizable #define LOOP_VINFO_CAN_FULLY_MASK_P(L) (L)->can_fully_mask_p #define LOOP_VINFO_FULLY_MASKED_P(L) (L)->fully_masked_p +#define LOOP_VINFO_CAN_WITH_LENGTH_P(L) (L)->can_with_length_p +#define LOOP_VINFO_FULLY_WITH_LENGTH_P(L) (L)->fully_with_length_p +#define LOOP_VINFO_LENS(L) (L)->lens #define LOOP_VINFO_VECT_FACTOR(L) (L)->vectorization_factor #define LOOP_VINFO_MAX_VECT_FACTOR(L) (L)->max_vectorization_factor #define LOOP_VINFO_MASKS(L) (L)->masks @@ -741,6 +766,10 @@ public: || LOOP_REQUIRES_VERSIONING_FOR_NITERS (L) \ || LOOP_REQUIRES_VERSIONING_FOR_SIMD_IF_COND (L)) +/* Whether operates on partial vector. */ +#define LOOP_VINFO_PARTIAL_VECT_P(L) \ + (LOOP_VINFO_FULLY_MASKED_P (L) || LOOP_VINFO_FULLY_WITH_LENGTH_P (L)) + #define LOOP_VINFO_NITERS_KNOWN_P(L) \ (tree_fits_shwi_p ((L)->num_iters) && tree_to_shwi ((L)->num_iters) > 0) @@ -1824,7 +1853,7 @@ extern tree vect_create_addr_base_for_vector_ref (vec_info *, tree, tree = NULL_TREE); /* In tree-vect-loop.c. */ -extern widest_int vect_iv_limit_for_full_masking (loop_vec_info loop_vinfo); +extern widest_int vect_iv_limit_for_partial_vect (loop_vec_info loop_vinfo); /* Used in tree-vect-loop-manip.c */ extern void determine_peel_for_niter (loop_vec_info); /* Used in gimple-loop-interchange.c and tree-parloops.c. */ @@ -1842,6 +1871,10 @@ extern void vect_record_loop_mask (loop_vec_info, vec_loop_masks *, unsigned int, tree, tree); extern tree vect_get_loop_mask (gimple_stmt_iterator *, vec_loop_masks *, unsigned int, tree, unsigned int); +extern void vect_record_loop_len (loop_vec_info, vec_loop_lens *, unsigned int, + tree); +extern tree vect_get_loop_len (vec_loop_lens *, unsigned int, unsigned int); +extern gimple_seq vect_gen_len (tree, tree, tree, tree); extern stmt_vec_info info_for_reduction (vec_info *, stmt_vec_info); /* Drive for loop transformation stage. */ --
"Kewen.Lin" <linkw@linux.ibm.com> writes: > Hi Richard, > > Thanks for your comments! > > on 2020/5/26 锟斤拷锟斤拷8:49, Richard Sandiford wrote: >> "Kewen.Lin" <linkw@linux.ibm.com> writes: >>> @@ -626,6 +645,12 @@ public: >>> /* True if have decided to use a fully-masked loop. */ >>> bool fully_masked_p; >>> >>> + /* Records whether we still have the option of using a length access loop. */ >>> + bool can_with_length_p; >>> + >>> + /* True if have decided to use length access for the loop fully. */ >>> + bool fully_with_length_p; >> >> Rather than duplicate the flags like this, I think we should have >> three bits of information: >> >> (1) Can the loop operate on partial vectors? Starts off optimistically >> assuming "yes", gets set to "no" when we find a counter-example. >> >> (2) If we do decide to use partial vectors, will we need loop masks? >> >> (3) If we do decide to use partial vectors, will we need lengths? >> >> Vectorisation using partial vectors succeeds if (1) && ((2) != (3)) >> >> LOOP_VINFO_CAN_FULLY_MASK_P currently tracks (1) and >> LOOP_VINFO_MASKS currently tracks (2). In pathological cases it's >> already possible to have (1) && !(2), see r9-6240 for an example. >> >> With the new support, LOOP_VINFO_LENS tracks (3). >> >> So I don't think we need the can_with_length_p. What is now >> LOOP_VINFO_CAN_FULLY_MASK_P can continue to track (1) for both >> approaches, with the final choice of approach only being made >> at the end. Maybe it would be worth renaming it to something >> more generic though, now that we have two approaches to partial >> vectorisation. > > I like this idea! I could be wrong, but I'm afraid that we > can not have one common flag to be shared for both approaches, > the check criterias could be different for both approaches, one > counter example for length could be acceptable for masking, such > as length can only allow CONTIGUOUS related modes, but masking > can support more. When we see acceptable VMAT_LOAD_STORE_LANES, > we leave LOOP_VINFO_CAN_FULLY_MASK_P true, later should length > checking turn it to false? I guess no, assuming still true, then > LOOP_VINFO_CAN_FULLY_MASK_P will mean only partial vectorization > for masking, not for both. We can probably clean LOOP_VINFO_LENS > when the length checking is false, but we just know the vec is empty, > not sure we are unable to do partial vectorization with length, > when we see LOOP_VINFO_CAN_FULLY_MASK_P true, we could still > record length into it if possible. Let's call the flag in (1) CAN_USE_PARTIAL_VECTORS_P rather than CAN_FULLY_MASK_P to (try to) avoid any confusion from the current name. What I meant is that each vectorizable_* routine has the responsibility of finding a way of coping with partial vectorisation, or setting CAN_USE_PARTIAL_VECTORS_P to false if it can't. vectorizable_load chooses the VMAT first, and then decides based on that whether partial vectorisation is supported. There's no influence in the other direction (partial vectorisation doesn't determine the VMAT). So once it has chosen a VMAT, vectorizable_load needs to try to find a way of handling the operation with partial vectorisation. Currently the only way of doing that for VMAT_LOAD_STORE_LANES is using masks. So at the moment there are two possible outcomes: - The target supports the necessary IFN_MASK_LOAD_LANES function. If so, we can use partial vectorisation for the statement, so we leave CAN_USE_PARTIAL_VECTORS_P true and record the necessary masks in LOOP_VINFO_MASKS. - The target doesn't support the necessary IFN_MASK_LOAD_LANES function. If so, we can't use partial vectorisation, so we clear CAN_USE_PARTIAL_VECTORS_P. That's how things work at the moment. It would work in the same way for lengths if we ever supported IFN_LEN_LOAD_LANES: we'd check whether IFN_LEN_LOAD_LANES is available and record the length in LOOP_VINFO_LENS if so. If partial vectorisation isn't supported (via masks or lengths), we'd continue to clear CAN_USE_PARTIAL_VECTORS_P. But equally, if we never add support for IFN_LEN_LOAD_LANES, the current code continues to work with length-based approaches. We'll continue to clear CAN_USE_PARTIAL_VECTORS_P for VMAT_LOAD_STORE_LANES when the target provides no IFN_MASK_LOAD_LANES function. As I say, this is all predicated on the assumption that we don't need to mix both masks and lengths in the same loop, and so can decide not to use partial vectorisation when both masks and lengths have been recorded. This is a check that would happen at the end, after all statements have been analysed. (There's no reason in principle why we *couldn't* support both approaches in the same loop, but it's not worth adding the code for that until there's a use case.) Thanks, Richard
on 2020/5/27 下午6:02, Richard Sandiford wrote: > "Kewen.Lin" <linkw@linux.ibm.com> writes: >> Hi Richard, >> >> Thanks for your comments! >> >> on 2020/5/26 锟斤拷锟斤拷8:49, Richard Sandiford wrote: >>> "Kewen.Lin" <linkw@linux.ibm.com> writes: >>>> @@ -626,6 +645,12 @@ public: >>>> /* True if have decided to use a fully-masked loop. */ >>>> bool fully_masked_p; >>>> >>>> + /* Records whether we still have the option of using a length access loop. */ >>>> + bool can_with_length_p; >>>> + >>>> + /* True if have decided to use length access for the loop fully. */ >>>> + bool fully_with_length_p; >>> >>> Rather than duplicate the flags like this, I think we should have >>> three bits of information: >>> >>> (1) Can the loop operate on partial vectors? Starts off optimistically >>> assuming "yes", gets set to "no" when we find a counter-example. >>> >>> (2) If we do decide to use partial vectors, will we need loop masks? >>> >>> (3) If we do decide to use partial vectors, will we need lengths? >>> >>> Vectorisation using partial vectors succeeds if (1) && ((2) != (3)) >>> >>> LOOP_VINFO_CAN_FULLY_MASK_P currently tracks (1) and >>> LOOP_VINFO_MASKS currently tracks (2). In pathological cases it's >>> already possible to have (1) && !(2), see r9-6240 for an example. >>> >>> With the new support, LOOP_VINFO_LENS tracks (3). >>> >>> So I don't think we need the can_with_length_p. What is now >>> LOOP_VINFO_CAN_FULLY_MASK_P can continue to track (1) for both >>> approaches, with the final choice of approach only being made >>> at the end. Maybe it would be worth renaming it to something >>> more generic though, now that we have two approaches to partial >>> vectorisation. >> >> I like this idea! I could be wrong, but I'm afraid that we >> can not have one common flag to be shared for both approaches, >> the check criterias could be different for both approaches, one >> counter example for length could be acceptable for masking, such >> as length can only allow CONTIGUOUS related modes, but masking >> can support more. When we see acceptable VMAT_LOAD_STORE_LANES, >> we leave LOOP_VINFO_CAN_FULLY_MASK_P true, later should length >> checking turn it to false? I guess no, assuming still true, then >> LOOP_VINFO_CAN_FULLY_MASK_P will mean only partial vectorization >> for masking, not for both. We can probably clean LOOP_VINFO_LENS >> when the length checking is false, but we just know the vec is empty, >> not sure we are unable to do partial vectorization with length, >> when we see LOOP_VINFO_CAN_FULLY_MASK_P true, we could still >> record length into it if possible. > > Let's call the flag in (1) CAN_USE_PARTIAL_VECTORS_P rather than > CAN_FULLY_MASK_P to (try to) avoid any confusion from the current name. > > What I meant is that each vectorizable_* routine has the responsibility > of finding a way of coping with partial vectorisation, or setting > CAN_USE_PARTIAL_VECTORS_P to false if it can't. > > vectorizable_load chooses the VMAT first, and then decides based on that > whether partial vectorisation is supported. There's no influence in > the other direction (partial vectorisation doesn't determine the VMAT). > > So once it has chosen a VMAT, vectorizable_load needs to try to find a way > of handling the operation with partial vectorisation. Currently the only > way of doing that for VMAT_LOAD_STORE_LANES is using masks. So at the > moment there are two possible outcomes: > > - The target supports the necessary IFN_MASK_LOAD_LANES function. > If so, we can use partial vectorisation for the statement, so we > leave CAN_USE_PARTIAL_VECTORS_P true and record the necessary masks > in LOOP_VINFO_MASKS. > > - The target doesn't support the necessary IFN_MASK_LOAD_LANES function. > If so, we can't use partial vectorisation, so we clear > CAN_USE_PARTIAL_VECTORS_P. > > That's how things work at the moment. It would work in the same way > for lengths if we ever supported IFN_LEN_LOAD_LANES: we'd check whether > IFN_LEN_LOAD_LANES is available and record the length in LOOP_VINFO_LENS > if so. If partial vectorisation isn't supported (via masks or lengths), > we'd continue to clear CAN_USE_PARTIAL_VECTORS_P. > > But equally, if we never add support for IFN_LEN_LOAD_LANES, the current > code continues to work with length-based approaches. We'll continue to > clear CAN_USE_PARTIAL_VECTORS_P for VMAT_LOAD_STORE_LANES when the > target provides no IFN_MASK_LOAD_LANES function. > Thanks a lot for your detailed explanation! This proposal looks good based on the current implementation of both masking and length. I may think too much, but I had a bit concern as below when some targets have both masking and length supports in future, such as ppc adds masking support like SVE. I assumed that you meant each vectorizable_* routine should record the objs for any available partial vectorisation approaches. If one target supports both, we would have both recorded but decide not to do partial vectorisation finally since both have records. The target can disable length like through optab to resolve it, but there is one possibility that the masking support can be imperfect initially since ISA support could be gradual, it further leads some vectorizable_* check or final verification to fail for masking, and length approach may work here but it gets disabled. We can miss to use partial vectorisation here. The other assumption is that each vectorizable_* routine record the first available partial vectorisation approach, let's assume masking takes preference, then it's fine to record just one here even if one target supports both approaches, but we still have the possiblity to miss the partial vectorisation chance as some check/verify fail with masking but fine with length. Does this concern make sense? BR, Kewen > As I say, this is all predicated on the assumption that we don't need > to mix both masks and lengths in the same loop, and so can decide not > to use partial vectorisation when both masks and lengths have been > recorded. This is a check that would happen at the end, after all > statements have been analysed. > > (There's no reason in principle why we *couldn't* support both > approaches in the same loop, but it's not worth adding the code > for that until there's a use case.) > > Thanks, > Richard >
"Kewen.Lin" <linkw@linux.ibm.com> writes: > on 2020/5/27 下午6:02, Richard Sandiford wrote: >> "Kewen.Lin" <linkw@linux.ibm.com> writes: >>> Hi Richard, >>> >>> Thanks for your comments! >>> >>> on 2020/5/26 锟斤拷锟斤拷8:49, Richard Sandiford wrote: >>>> "Kewen.Lin" <linkw@linux.ibm.com> writes: >>>>> @@ -626,6 +645,12 @@ public: >>>>> /* True if have decided to use a fully-masked loop. */ >>>>> bool fully_masked_p; >>>>> >>>>> + /* Records whether we still have the option of using a length access loop. */ >>>>> + bool can_with_length_p; >>>>> + >>>>> + /* True if have decided to use length access for the loop fully. */ >>>>> + bool fully_with_length_p; >>>> >>>> Rather than duplicate the flags like this, I think we should have >>>> three bits of information: >>>> >>>> (1) Can the loop operate on partial vectors? Starts off optimistically >>>> assuming "yes", gets set to "no" when we find a counter-example. >>>> >>>> (2) If we do decide to use partial vectors, will we need loop masks? >>>> >>>> (3) If we do decide to use partial vectors, will we need lengths? >>>> >>>> Vectorisation using partial vectors succeeds if (1) && ((2) != (3)) >>>> >>>> LOOP_VINFO_CAN_FULLY_MASK_P currently tracks (1) and >>>> LOOP_VINFO_MASKS currently tracks (2). In pathological cases it's >>>> already possible to have (1) && !(2), see r9-6240 for an example. >>>> >>>> With the new support, LOOP_VINFO_LENS tracks (3). >>>> >>>> So I don't think we need the can_with_length_p. What is now >>>> LOOP_VINFO_CAN_FULLY_MASK_P can continue to track (1) for both >>>> approaches, with the final choice of approach only being made >>>> at the end. Maybe it would be worth renaming it to something >>>> more generic though, now that we have two approaches to partial >>>> vectorisation. >>> >>> I like this idea! I could be wrong, but I'm afraid that we >>> can not have one common flag to be shared for both approaches, >>> the check criterias could be different for both approaches, one >>> counter example for length could be acceptable for masking, such >>> as length can only allow CONTIGUOUS related modes, but masking >>> can support more. When we see acceptable VMAT_LOAD_STORE_LANES, >>> we leave LOOP_VINFO_CAN_FULLY_MASK_P true, later should length >>> checking turn it to false? I guess no, assuming still true, then >>> LOOP_VINFO_CAN_FULLY_MASK_P will mean only partial vectorization >>> for masking, not for both. We can probably clean LOOP_VINFO_LENS >>> when the length checking is false, but we just know the vec is empty, >>> not sure we are unable to do partial vectorization with length, >>> when we see LOOP_VINFO_CAN_FULLY_MASK_P true, we could still >>> record length into it if possible. >> >> Let's call the flag in (1) CAN_USE_PARTIAL_VECTORS_P rather than >> CAN_FULLY_MASK_P to (try to) avoid any confusion from the current name. >> >> What I meant is that each vectorizable_* routine has the responsibility >> of finding a way of coping with partial vectorisation, or setting >> CAN_USE_PARTIAL_VECTORS_P to false if it can't. >> >> vectorizable_load chooses the VMAT first, and then decides based on that >> whether partial vectorisation is supported. There's no influence in >> the other direction (partial vectorisation doesn't determine the VMAT). >> >> So once it has chosen a VMAT, vectorizable_load needs to try to find a way >> of handling the operation with partial vectorisation. Currently the only >> way of doing that for VMAT_LOAD_STORE_LANES is using masks. So at the >> moment there are two possible outcomes: >> >> - The target supports the necessary IFN_MASK_LOAD_LANES function. >> If so, we can use partial vectorisation for the statement, so we >> leave CAN_USE_PARTIAL_VECTORS_P true and record the necessary masks >> in LOOP_VINFO_MASKS. >> >> - The target doesn't support the necessary IFN_MASK_LOAD_LANES function. >> If so, we can't use partial vectorisation, so we clear >> CAN_USE_PARTIAL_VECTORS_P. >> >> That's how things work at the moment. It would work in the same way >> for lengths if we ever supported IFN_LEN_LOAD_LANES: we'd check whether >> IFN_LEN_LOAD_LANES is available and record the length in LOOP_VINFO_LENS >> if so. If partial vectorisation isn't supported (via masks or lengths), >> we'd continue to clear CAN_USE_PARTIAL_VECTORS_P. >> >> But equally, if we never add support for IFN_LEN_LOAD_LANES, the current >> code continues to work with length-based approaches. We'll continue to >> clear CAN_USE_PARTIAL_VECTORS_P for VMAT_LOAD_STORE_LANES when the >> target provides no IFN_MASK_LOAD_LANES function. >> > > Thanks a lot for your detailed explanation! This proposal looks good > based on the current implementation of both masking and length. I may > think too much, but I had a bit concern as below when some targets have > both masking and length supports in future, such as ppc adds masking > support like SVE. > > I assumed that you meant each vectorizable_* routine should record the > objs for any available partial vectorisation approaches. If one target > supports both, we would have both recorded but decide not to do partial > vectorisation finally since both have records. The target can disable > length like through optab to resolve it, but there is one possibility > that the masking support can be imperfect initially since ISA support > could be gradual, it further leads some vectorizable_* check or final > verification to fail for masking, and length approach may work here but > it gets disabled. We can miss to use partial vectorisation here. > > The other assumption is that each vectorizable_* routine record the > first available partial vectorisation approach, let's assume masking > takes preference, then it's fine to record just one here even if one > target supports both approaches, but we still have the possiblity to > miss the partial vectorisation chance as some check/verify fail with > masking but fine with length. > > Does this concern make sense? There's nothing to stop us using masks and lengths in the same loop in future if we need to. It would “just” be a case of setting up both the masks and the lengths in vect_set_loop_condition. But the point is that doing that would be extra code, and there's no point writing that extra code until it's needed. If some future arch does support both mask-based and length-based approaches, I think that's even less reason to make a binary choice between them. How we prioritise the length and mask approaches when both are available is something that we'll have to decide at the time. If your concern is that the arch might support masked operations without wanting them to be used for loop control, we could test for that case by checking whether while_ult_optab is implemented. Thanks, Richard
Hi! On Fri, May 29, 2020 at 09:32:49AM +0100, Richard Sandiford wrote: > There's nothing to stop us using masks and lengths in the same loop > in future if we need to. It would “just” be a case of setting up both > the masks and the lengths in vect_set_loop_condition. But the point is > that doing that would be extra code, and there's no point writing that > extra code until it's needed. You won't ever get it right even, because you do not know exactly what will be needed :-) > If some future arch does support both mask-based and length-based > approaches, I think that's even less reason to make a binary choice > between them. How we prioritise the length and mask approaches when > both are available is something that we'll have to decide at the time. > > If your concern is that the arch might support masked operations > without wanting them to be used for loop control, we could test for > that case by checking whether while_ult_optab is implemented. Heh, sneaky. But at least for now it will work fine, and it is local, and not hard to change later. Segher
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi index 8b9935dfe65..ac765feab13 100644 --- a/gcc/doc/invoke.texi +++ b/gcc/doc/invoke.texi @@ -13079,6 +13079,13 @@ by the copy loop headers pass. @item vect-epilogues-nomask Enable loop epilogue vectorization using smaller vector size. +@item vect-with-length-scope +Control the scope of vector memory access with length exploitation. 0 means we +don't expliot any vector memory access with length, 1 means we only exploit +vector memory access with length for those loops whose iteration number are +less than VF, such as very small loop or epilogue, 2 means we want to exploit +vector memory access with length for any loops if possible. + @item slp-max-insns-in-bb Maximum number of instructions in basic block to be considered for SLP vectorization. diff --git a/gcc/params.opt b/gcc/params.opt index 4aec480798b..d4309101067 100644 --- a/gcc/params.opt +++ b/gcc/params.opt @@ -964,4 +964,8 @@ Bound on number of runtime checks inserted by the vectorizer's loop versioning f Common Joined UInteger Var(param_vect_max_version_for_alignment_checks) Init(6) Param Optimization Bound on number of runtime checks inserted by the vectorizer's loop versioning for alignment check. +-param=vect-with-length-scope= +Common Joined UInteger Var(param_vect_with_length_scope) Init(0) IntegerRange(0, 2) Param Optimization +Control the vector with length exploitation scope. + ; This comment is to ensure we retain the blank line above. diff --git a/gcc/tree-vect-loop-manip.c b/gcc/tree-vect-loop-manip.c index 8c5e696b995..3d5dec6f65c 100644 --- a/gcc/tree-vect-loop-manip.c +++ b/gcc/tree-vect-loop-manip.c @@ -747,6 +747,263 @@ vect_set_loop_condition_masked (class loop *loop, loop_vec_info loop_vinfo, return cond_stmt; } +/* Helper for vect_set_loop_condition_len. Like vect_set_loop_masks_directly, + generate definitions for all the lengths in RGL and return a length that is + nonzero when the loop needs to iterate. Add any new preheader statements to + PREHEADER_SEQ. Use LOOP_COND_GSI to insert code before the exit gcond. + + RGL belongs to loop LOOP. The loop originally iterated NITERS + times and has been vectorized according to LOOP_VINFO. Each iteration + of the vectorized loop handles VF iterations of the scalar loop. + + IV_LIMIT is the limit which induction variable can reach, that will be used + to check whether induction variable can wrap before hit the niters. */ + +static tree +vect_set_loop_lens_directly (class loop *loop, loop_vec_info loop_vinfo, + gimple_seq *preheader_seq, + gimple_stmt_iterator loop_cond_gsi, + rgroup_lens *rgl, tree niters, widest_int iv_limit) +{ + scalar_int_mode len_mode = targetm.vectorize.length_mode; + unsigned int len_prec = GET_MODE_PRECISION (len_mode); + tree len_type = build_nonstandard_integer_type (len_prec, true); + + tree vec_type = rgl->vec_type; + unsigned int nbytes_per_iter = rgl->nbytes_per_iter; + poly_uint64 vector_size = GET_MODE_SIZE (TYPE_MODE (vec_type)); + poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo); + tree vec_size = build_int_cst (len_type, vector_size); + + /* See whether zero-based IV would ever generate zero length before + wrapping around. */ + bool might_wrap_p = (iv_limit == -1); + if (!might_wrap_p) + { + widest_int iv_limit_max = iv_limit * nbytes_per_iter; + might_wrap_p = wi::min_precision (iv_limit_max, UNSIGNED) > len_prec; + } + + /* Calculate the maximum number of bytes of scalars that the rgroup + handles in total, the number that it handles for each iteration + of the vector loop. */ + tree nbytes_total = niters; + tree nbytes_step = build_int_cst (len_type, vf); + if (nbytes_per_iter != 1) + { + tree factor = build_int_cst (len_type, nbytes_per_iter); + nbytes_total = gimple_build (preheader_seq, MULT_EXPR, len_type, + nbytes_total, factor); + nbytes_step = gimple_build (preheader_seq, MULT_EXPR, len_type, + nbytes_step, factor); + } + + /* Create an induction variable that counts the processed bytes of scalars. */ + tree index_before_incr, index_after_incr; + gimple_stmt_iterator incr_gsi; + bool insert_after; + standard_iv_increment_position (loop, &incr_gsi, &insert_after); + create_iv (build_int_cst (len_type, 0), nbytes_step, NULL_TREE, loop, + &incr_gsi, insert_after, &index_before_incr, &index_after_incr); + + tree zero_index = build_int_cst (len_type, 0); + tree test_index, test_limit, first_limit; + gimple_stmt_iterator *test_gsi; + + /* For the first iteration it doesn't matter whether the IV hits + a value above NBYTES_TOTAL. That only matters for the latch + condition. */ + first_limit = nbytes_total; + + if (might_wrap_p) + { + test_index = index_before_incr; + tree adjust = gimple_convert (preheader_seq, len_type, nbytes_step); + test_limit = gimple_build (preheader_seq, MAX_EXPR, len_type, + nbytes_total, adjust); + test_limit = gimple_build (preheader_seq, MINUS_EXPR, len_type, + test_limit, adjust); + test_gsi = &incr_gsi; + } + else + { + /* Test the incremented IV, which will always hit a value above + the bound before wrapping. */ + test_index = index_after_incr; + test_limit = nbytes_total; + test_gsi = &loop_cond_gsi; + } + + /* Provide a definition of each length in the group. */ + tree next_len = NULL_TREE; + tree len; + unsigned int i; + FOR_EACH_VEC_ELT_REVERSE (rgl->lens, i, len) + { + /* Previous lengths will cover BIAS scalars. This length covers the + next batch. Each batch's length should be vector_size. */ + poly_uint64 bias = vector_size * i; + tree bias_tree = build_int_cst (len_type, bias); + + /* See whether the first iteration of the vector loop is known + to have a full vector size. */ + poly_uint64 const_limit; + bool first_iteration_full + = (poly_int_tree_p (first_limit, &const_limit) + && known_ge (const_limit, (i + 1) * vector_size)); + + /* Rather than have a new IV that starts at BIAS and goes up to + TEST_LIMIT, prefer to use the same 0-based IV for each length + and adjust the bound down by BIAS. */ + tree this_test_limit = test_limit; + if (i != 0) + { + this_test_limit = gimple_build (preheader_seq, MAX_EXPR, len_type, + this_test_limit, bias_tree); + this_test_limit = gimple_build (preheader_seq, MINUS_EXPR, len_type, + this_test_limit, bias_tree); + } + + /* Create the initial length. First include all scalar bytes that + are within the loop limit. */ + tree init_len = NULL_TREE; + if (!first_iteration_full) + { + tree start, end; + if (first_limit == test_limit) + { + /* Use a natural test between zero (the initial IV value) + and the loop limit. The "else" block would be valid too, + but this choice can avoid the need to load BIAS_TREE into + a register. */ + start = zero_index; + end = this_test_limit; + } + else + { + /* FIRST_LIMIT is the maximum number of scalar bytes handled by + the first iteration of the vector loop. Test the portion + associated with this length. */ + start = bias_tree; + end = first_limit; + } + + init_len = make_temp_ssa_name (len_type, NULL, "max_len"); + gimple_seq seq = vect_gen_len (init_len, start, end, vec_size); + gimple_seq_add_seq (preheader_seq, seq); + } + + /* First iteration is full. */ + if (!init_len) + init_len = vec_size; + + /* Get the length value for the next iteration of the loop. */ + next_len = make_temp_ssa_name (len_type, NULL, "next_len"); + tree end = this_test_limit; + gimple_seq seq = vect_gen_len (next_len, test_index, end, vec_size); + gsi_insert_seq_before (test_gsi, seq, GSI_SAME_STMT); + + /* Use mask routine for length. */ + vect_set_loop_mask (loop, len, init_len, next_len); + } + + return next_len; +} + +/* Like vect_set_loop_condition_masked, handle the case vector access with + length. */ + +static gcond * +vect_set_loop_condition_len (class loop *loop, loop_vec_info loop_vinfo, + tree niters, tree final_iv, + bool niters_maybe_zero, + gimple_stmt_iterator loop_cond_gsi) +{ + gimple_seq preheader_seq = NULL; + gimple_seq header_seq = NULL; + tree orig_niters = niters; + + /* Type of the initial value of NITERS. */ + tree ni_actual_type = TREE_TYPE (niters); + unsigned int ni_actual_prec = TYPE_PRECISION (ni_actual_type); + + /* Obtain target supported length type. */ + scalar_int_mode len_mode = targetm.vectorize.length_mode; + unsigned int len_prec = GET_MODE_PRECISION (len_mode); + tree len_type = build_nonstandard_integer_type (len_prec, true); + + /* Calculate the value that the induction variable must be able to hit in + order to ensure that we end the loop with an zero length. */ + widest_int iv_limit = -1; + unsigned HOST_WIDE_INT max_vf = vect_max_vf (loop_vinfo); + if (max_loop_iterations (loop, &iv_limit)) + { + /* Round this value down to the previous vector alignment boundary and + then add an extra full iteration. */ + poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo); + iv_limit = (iv_limit & -(int) known_alignment (vf)) + max_vf; + } + + /* Convert NITERS to the same size as the length. */ + if (niters_maybe_zero || (len_prec > ni_actual_prec)) + { + /* We know that there is always at least one iteration, so if the + count is zero then it must have wrapped. Cope with this by + subtracting 1 before the conversion and adding 1 to the result. */ + gcc_assert (TYPE_UNSIGNED (ni_actual_type)); + niters = gimple_build (&preheader_seq, PLUS_EXPR, ni_actual_type, niters, + build_minus_one_cst (ni_actual_type)); + niters = gimple_convert (&preheader_seq, len_type, niters); + niters = gimple_build (&preheader_seq, PLUS_EXPR, len_type, niters, + build_one_cst (len_type)); + } + else + niters = gimple_convert (&preheader_seq, len_type, niters); + + /* Iterate over all the rgroups and fill in their lengths. We could use + the first length from any rgroup for the loop condition; here we + arbitrarily pick the last. */ + tree test_len = NULL_TREE; + rgroup_lens *rgl; + unsigned int i; + vec_loop_lens *lens = &LOOP_VINFO_LENS (loop_vinfo); + + FOR_EACH_VEC_ELT (*lens, i, rgl) + if (!rgl->lens.is_empty ()) + /* Set up all lens for this group. */ + test_len + = vect_set_loop_lens_directly (loop, loop_vinfo, &preheader_seq, + loop_cond_gsi, rgl, niters, iv_limit); + + /* Emit all accumulated statements. */ + add_preheader_seq (loop, preheader_seq); + add_header_seq (loop, header_seq); + + /* Get a boolean result that tells us whether to iterate. */ + edge exit_edge = single_exit (loop); + tree_code code = (exit_edge->flags & EDGE_TRUE_VALUE) ? EQ_EXPR : NE_EXPR; + tree zero_len = build_zero_cst (TREE_TYPE (test_len)); + gcond *cond_stmt + = gimple_build_cond (code, test_len, zero_len, NULL_TREE, NULL_TREE); + gsi_insert_before (&loop_cond_gsi, cond_stmt, GSI_SAME_STMT); + + /* The loop iterates (NITERS - 1) / VF + 1 times. + Subtract one from this to get the latch count. */ + tree step = build_int_cst (len_type, LOOP_VINFO_VECT_FACTOR (loop_vinfo)); + tree niters_minus_one + = fold_build2 (PLUS_EXPR, len_type, niters, build_minus_one_cst (len_type)); + loop->nb_iterations + = fold_build2 (TRUNC_DIV_EXPR, len_type, niters_minus_one, step); + + if (final_iv) + { + gassign *assign = gimple_build_assign (final_iv, orig_niters); + gsi_insert_on_edge_immediate (single_exit (loop), assign); + } + + return cond_stmt; +} + /* Like vect_set_loop_condition, but handle the case in which there are no loop masks. */ @@ -916,6 +1173,10 @@ vect_set_loop_condition (class loop *loop, loop_vec_info loop_vinfo, cond_stmt = vect_set_loop_condition_masked (loop, loop_vinfo, niters, final_iv, niters_maybe_zero, loop_cond_gsi); + else if (loop_vinfo && LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo)) + cond_stmt = vect_set_loop_condition_len (loop, loop_vinfo, niters, + final_iv, niters_maybe_zero, + loop_cond_gsi); else cond_stmt = vect_set_loop_condition_unmasked (loop, niters, step, final_iv, niters_maybe_zero, @@ -1939,7 +2200,8 @@ vect_gen_vector_loop_niters (loop_vec_info loop_vinfo, tree niters, unsigned HOST_WIDE_INT const_vf; if (vf.is_constant (&const_vf) - && !LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)) + && !LOOP_VINFO_FULLY_MASKED_P (loop_vinfo) + && !LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo)) { /* Create: niters >> log2(vf) */ /* If it's known that niters == number of latch executions + 1 doesn't @@ -2472,6 +2734,7 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1, poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo); poly_uint64 bound_epilog = 0; if (!LOOP_VINFO_FULLY_MASKED_P (loop_vinfo) + && !LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo) && LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo)) bound_epilog += vf - 1; if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)) @@ -2567,7 +2830,8 @@ vect_do_peeling (loop_vec_info loop_vinfo, tree niters, tree nitersm1, if (vect_epilogues && LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo) && prolog_peeling >= 0 - && known_eq (vf, lowest_vf)) + && known_eq (vf, lowest_vf) + && !LOOP_VINFO_FULLY_WITH_LENGTH_P (epilogue_vinfo)) { unsigned HOST_WIDE_INT eiters = (LOOP_VINFO_INT_NITERS (loop_vinfo) diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c index 80e33b61be7..d61f46becfd 100644 --- a/gcc/tree-vect-loop.c +++ b/gcc/tree-vect-loop.c @@ -815,6 +815,8 @@ _loop_vec_info::_loop_vec_info (class loop *loop_in, vec_info_shared *shared) vectorizable (false), can_fully_mask_p (true), fully_masked_p (false), + can_with_length_p (param_vect_with_length_scope != 0), + fully_with_length_p (false), peeling_for_gaps (false), peeling_for_niter (false), no_data_dependencies (false), @@ -887,6 +889,18 @@ release_vec_loop_masks (vec_loop_masks *masks) masks->release (); } +/* Free all levels of LENS. */ + +void +release_vec_loop_lens (vec_loop_lens *lens) +{ + rgroup_lens *rgl; + unsigned int i; + FOR_EACH_VEC_ELT (*lens, i, rgl) + rgl->lens.release (); + lens->release (); +} + /* Free all memory used by the _loop_vec_info, as well as all the stmt_vec_info structs of all the stmts in the loop. */ @@ -895,6 +909,7 @@ _loop_vec_info::~_loop_vec_info () free (bbs); release_vec_loop_masks (&masks); + release_vec_loop_lens (&lens); delete ivexpr_map; delete scan_map; epilogue_vinfos.release (); @@ -1056,6 +1071,44 @@ vect_verify_full_masking (loop_vec_info loop_vinfo) return true; } +/* Check whether we can use vector access with length based on precison + comparison. So far, to keep it simple, we only allow the case that the + precision of the target supported length is larger than the precision + required by loop niters. */ + +static bool +vect_verify_loop_lens (loop_vec_info loop_vinfo) +{ + class loop *loop = LOOP_VINFO_LOOP (loop_vinfo); + vec_loop_lens *lens = &LOOP_VINFO_LENS (loop_vinfo); + + if (LOOP_VINFO_LENS (loop_vinfo).is_empty ()) + return false; + + /* Get the maximum number of iterations that is representable + in the counter type. */ + tree ni_type = TREE_TYPE (LOOP_VINFO_NITERSM1 (loop_vinfo)); + widest_int max_ni = wi::to_widest (TYPE_MAX_VALUE (ni_type)) + 1; + + /* Get a more refined estimate for the number of iterations. */ + widest_int max_back_edges; + if (max_loop_iterations (loop, &max_back_edges)) + max_ni = wi::smin (max_ni, max_back_edges + 1); + + /* Account for rgroup lengths, in which each bit is replicated N times. */ + rgroup_lens *rgl = &(*lens)[lens->length () - 1]; + max_ni *= rgl->nbytes_per_iter; + + /* Work out how many bits we need to represent the limit. */ + unsigned int min_ni_width = wi::min_precision (max_ni, UNSIGNED); + + unsigned len_bits = GET_MODE_PRECISION (targetm.vectorize.length_mode); + if (len_bits < min_ni_width) + return false; + + return true; +} + /* Calculate the cost of one scalar iteration of the loop. */ static void vect_compute_single_scalar_iteration_cost (loop_vec_info loop_vinfo) @@ -1630,7 +1683,8 @@ vect_analyze_loop_costing (loop_vec_info loop_vinfo) /* Only fully-masked loops can have iteration counts less than the vectorization factor. */ - if (!LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)) + if (!LOOP_VINFO_FULLY_MASKED_P (loop_vinfo) + && !LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo)) { if (known_niters_smaller_than_vf (loop_vinfo)) { @@ -1858,7 +1912,8 @@ determine_peel_for_niter (loop_vec_info loop_vinfo) th = LOOP_VINFO_COST_MODEL_THRESHOLD (LOOP_VINFO_ORIG_LOOP_INFO (loop_vinfo)); - if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)) + if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo) + || LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo)) /* The main loop handles all iterations. */ LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo) = false; else if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo) @@ -2048,6 +2103,7 @@ vect_analyze_loop_2 (loop_vec_info loop_vinfo, bool &fatal, unsigned *n_stmts) } bool saved_can_fully_mask_p = LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo); + bool saved_can_with_length_p = LOOP_VINFO_CAN_WITH_LENGTH_P(loop_vinfo); /* We don't expect to have to roll back to anything other than an empty set of rgroups. */ @@ -2144,6 +2200,71 @@ start_over: "not using a fully-masked loop.\n"); } + /* Decide whether we can use vector access with length. */ + + if ((LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) + || LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo)) + && LOOP_VINFO_CAN_WITH_LENGTH_P (loop_vinfo)) + { + if (dump_enabled_p ()) + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, + "can't use vector access with length becuase peeling" + " for alignment or gaps is required.\n"); + LOOP_VINFO_CAN_WITH_LENGTH_P (loop_vinfo) = false; + } + + if (LOOP_VINFO_CAN_WITH_LENGTH_P (loop_vinfo) + && !vect_verify_loop_lens (loop_vinfo)) + { + if (dump_enabled_p ()) + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, + "can't use vector access with length becuase the" + " length precision verification fail.\n"); + LOOP_VINFO_CAN_WITH_LENGTH_P (loop_vinfo) = false; + } + + if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)) + { + if (dump_enabled_p ()) + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, + "can't use vector access with length becuase the" + " loop will be fully-masked.\n"); + LOOP_VINFO_CAN_WITH_LENGTH_P (loop_vinfo) = false; + } + + if (LOOP_VINFO_CAN_WITH_LENGTH_P (loop_vinfo)) + { + /* One special case, the loop with max niters less than VF, we can simply + take it as body with length. */ + if (param_vect_with_length_scope == 1) + { + /* This is the epilogue, should be less than VF. */ + if (LOOP_VINFO_EPILOGUE_P (loop_vinfo)) + LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo) = true; + /* Otherwise, ensure the loop iteration less than VF. */ + else if (known_niters_smaller_than_vf (loop_vinfo)) + LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo) = true; + } + else + { + gcc_assert (param_vect_with_length_scope == 2); + LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo) = true; + } + } + else + /* Always set it as false in case previous tries set it. */ + LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo) = false; + + if (dump_enabled_p ()) + { + if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo)) + dump_printf_loc (MSG_NOTE, vect_location, "using vector access with" + " length for loop fully.\n"); + else + dump_printf_loc (MSG_NOTE, vect_location, "not using vector access with" + " length for loop fully.\n"); + } + /* If epilog loop is required because of data accesses with gaps, one additional iteration needs to be peeled. Check if there is enough iterations for vectorization. */ @@ -2164,6 +2285,7 @@ start_over: loop or a loop that has a lower VF than the main loop. */ if (LOOP_VINFO_EPILOGUE_P (loop_vinfo) && !LOOP_VINFO_FULLY_MASKED_P (loop_vinfo) + && !LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo) && maybe_ge (LOOP_VINFO_VECT_FACTOR (loop_vinfo), LOOP_VINFO_VECT_FACTOR (orig_loop_vinfo))) return opt_result::failure_at (vect_location, @@ -2362,12 +2484,14 @@ again: = init_cost (LOOP_VINFO_LOOP (loop_vinfo)); /* Reset accumulated rgroup information. */ release_vec_loop_masks (&LOOP_VINFO_MASKS (loop_vinfo)); + release_vec_loop_lens (&LOOP_VINFO_LENS (loop_vinfo)); /* Reset assorted flags. */ LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo) = false; LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) = false; LOOP_VINFO_COST_MODEL_THRESHOLD (loop_vinfo) = 0; LOOP_VINFO_VERSIONING_THRESHOLD (loop_vinfo) = 0; LOOP_VINFO_CAN_FULLY_MASK_P (loop_vinfo) = saved_can_fully_mask_p; + LOOP_VINFO_CAN_WITH_LENGTH_P (loop_vinfo) = saved_can_with_length_p; goto start_over; } @@ -2646,8 +2770,10 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared) if (ordered_p (lowest_th, th)) lowest_th = ordered_min (lowest_th, th); } - else - delete loop_vinfo; + else { + delete loop_vinfo; + loop_vinfo = opt_loop_vec_info::success (NULL); + } /* Only vectorize epilogues if PARAM_VECT_EPILOGUES_NOMASK is enabled, SIMDUID is not set, it is the innermost loop and we have @@ -2672,6 +2798,7 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared) else { delete loop_vinfo; + loop_vinfo = opt_loop_vec_info::success (NULL); if (fatal) { gcc_checking_assert (first_loop_vinfo == NULL); @@ -2679,6 +2806,21 @@ vect_analyze_loop (class loop *loop, vec_info_shared *shared) } } + /* If the original loop can use vector access with length but we still + get true vect_epilogue here, it would try vector access with length + on epilogue and with the same mode. */ + if (vect_epilogues && loop_vinfo + && LOOP_VINFO_CAN_WITH_LENGTH_P (loop_vinfo)) + { + gcc_assert (!LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo)); + if (dump_enabled_p ()) + dump_printf_loc (MSG_NOTE, vect_location, + "***** Re-trying analysis with same vector" + " mode %s for epilogue with length.\n", + GET_MODE_NAME (loop_vinfo->vector_mode)); + continue; + } + if (mode_i < vector_modes.length () && VECTOR_MODE_P (autodetected_vector_mode) && (related_vector_mode (vector_modes[mode_i], @@ -3519,6 +3661,11 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo, target_cost_data, num_masks - 1, vector_stmt, NULL, NULL_TREE, 0, vect_body); } + else if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo)) + { + peel_iters_prologue = 0; + peel_iters_epilogue = 0; + } else if (npeel < 0) { peel_iters_prologue = assumed_vf / 2; @@ -3809,6 +3956,7 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo, min_profitable_iters); if (!LOOP_VINFO_FULLY_MASKED_P (loop_vinfo) + && !LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo) && min_profitable_iters < (assumed_vf + peel_iters_prologue)) /* We want the vectorized loop to execute at least once. */ min_profitable_iters = assumed_vf + peel_iters_prologue; @@ -6761,6 +6909,16 @@ vectorizable_reduction (loop_vec_info loop_vinfo, dump_printf_loc (MSG_NOTE, vect_location, "using an in-order (fold-left) reduction.\n"); STMT_VINFO_TYPE (orig_stmt_of_analysis) = cycle_phi_info_type; + + if (loop_vinfo && LOOP_VINFO_CAN_WITH_LENGTH_P (loop_vinfo)) + { + if (dump_enabled_p ()) + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, + "can't use vector access with length due to" + " reduction operation.\n"); + LOOP_VINFO_CAN_WITH_LENGTH_P (loop_vinfo) = false; + } + /* All but single defuse-cycle optimized, lane-reducing and fold-left reductions go through their own vectorizable_* routines. */ if (!single_defuse_cycle @@ -8041,6 +8199,16 @@ vectorizable_live_operation (loop_vec_info loop_vinfo, 1, vectype, NULL); } } + + if (loop_vinfo && LOOP_VINFO_CAN_WITH_LENGTH_P (loop_vinfo)) + { + LOOP_VINFO_CAN_WITH_LENGTH_P (loop_vinfo) = false; + if (dump_enabled_p ()) + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, + "can't use vector access with length due to" + " live operation.\n"); + } + return true; } @@ -8354,6 +8522,66 @@ vect_get_loop_mask (gimple_stmt_iterator *gsi, vec_loop_masks *masks, return mask; } +/* Record that LOOP_VINFO would need LENS to contain a sequence of NVECTORS + lengths for vector access with length that each control a vector of type + VECTYPE. */ + +void +vect_record_loop_len (loop_vec_info loop_vinfo, vec_loop_lens *lens, + unsigned int nvectors, tree vectype) +{ + gcc_assert (nvectors != 0); + if (lens->length () < nvectors) + lens->safe_grow_cleared (nvectors); + rgroup_lens *rgl = &(*lens)[nvectors - 1]; + + /* The number of scalars per iteration, total bytes of them and the number of + vectors are both compile-time constants. */ + poly_uint64 vector_size = GET_MODE_SIZE (TYPE_MODE (vectype)); + poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo); + unsigned int nbytes_per_iter + = exact_div (nvectors * vector_size, vf).to_constant (); + + /* The one associated to the same nvectors should have the same bytes per + iteration. */ + if (!rgl->vec_type) + { + rgl->vec_type = vectype; + rgl->nbytes_per_iter = nbytes_per_iter; + } + else + gcc_assert (rgl->nbytes_per_iter == nbytes_per_iter); +} + +/* Given a complete set of length LENS, extract length number INDEX for an + rgroup that operates on NVECTORS vectors, where 0 <= INDEX < NVECTORS. */ + +tree +vect_get_loop_len (vec_loop_lens *lens, unsigned int nvectors, unsigned int index) +{ + rgroup_lens *rgl = &(*lens)[nvectors - 1]; + + /* Populate the rgroup's len array, if this is the first time we've + used it. */ + if (rgl->lens.is_empty ()) + { + rgl->lens.safe_grow_cleared (nvectors); + for (unsigned int i = 0; i < nvectors; ++i) + { + scalar_int_mode len_mode = targetm.vectorize.length_mode; + unsigned int len_prec = GET_MODE_PRECISION (len_mode); + tree len_type = build_nonstandard_integer_type (len_prec, true); + tree len = make_temp_ssa_name (len_type, NULL, "loop_len"); + + /* Provide a dummy definition until the real one is available. */ + SSA_NAME_DEF_STMT (len) = gimple_build_nop (); + rgl->lens[i] = len; + } + } + + return rgl->lens[index]; +} + /* Scale profiling counters by estimation for LOOP which is vectorized by factor VF. */ @@ -8714,6 +8942,7 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple *loop_vectorized_call) { if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo) && !LOOP_VINFO_FULLY_MASKED_P (loop_vinfo) + && !LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo) && known_eq (lowest_vf, vf)) { niters_vector @@ -8881,7 +9110,9 @@ vect_transform_loop (loop_vec_info loop_vinfo, gimple *loop_vectorized_call) /* True if the final iteration might not handle a full vector's worth of scalar iterations. */ - bool final_iter_may_be_partial = LOOP_VINFO_FULLY_MASKED_P (loop_vinfo); + bool final_iter_may_be_partial + = LOOP_VINFO_FULLY_MASKED_P (loop_vinfo) + || LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo); /* The minimum number of iterations performed by the epilogue. This is 1 when peeling for gaps because we always need a final scalar iteration. */ diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c index e7822c44951..d6be39e1831 100644 --- a/gcc/tree-vect-stmts.c +++ b/gcc/tree-vect-stmts.c @@ -1879,6 +1879,66 @@ check_load_store_masking (loop_vec_info loop_vinfo, tree vectype, gcc_unreachable (); } +/* Check whether a load or store statement in the loop described by + LOOP_VINFO is possible to go with length. This is testing whether + the vectorizer pass has the appropriate support, as well as whether + the target does. + + VLS_TYPE says whether the statement is a load or store and VECTYPE + is the type of the vector being loaded or stored. MEMORY_ACCESS_TYPE + says how the load or store is going to be implemented and GROUP_SIZE + is the number of load or store statements in the containing group. + + Clear LOOP_VINFO_CAN_WITH_LENGTH_P if it can't go with length, otherwise + record the required length types. */ + +static void +check_load_store_with_len (loop_vec_info loop_vinfo, tree vectype, + vec_load_store_type vls_type, int group_size, + vect_memory_access_type memory_access_type) +{ + /* Invariant loads need no special support. */ + if (memory_access_type == VMAT_INVARIANT) + return; + + if (memory_access_type != VMAT_CONTIGUOUS + && memory_access_type != VMAT_CONTIGUOUS_PERMUTE) + { + if (dump_enabled_p ()) + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, + "can't use vector access with length" + " because an access isn't contiguous.\n"); + LOOP_VINFO_CAN_WITH_LENGTH_P (loop_vinfo) = false; + return; + } + + machine_mode vecmode = TYPE_MODE (vectype); + bool is_load = (vls_type == VLS_LOAD); + optab op = is_load ? lenload_optab : lenstore_optab; + + if (!VECTOR_MODE_P (vecmode) + || !convert_optab_handler (op, vecmode, targetm.vectorize.length_mode)) + { + if (dump_enabled_p ()) + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, + "can't use vector access with length because" + " the target doesn't have the appropriate" + " load or store with length.\n"); + LOOP_VINFO_CAN_WITH_LENGTH_P (loop_vinfo) = false; + return; + } + + vec_loop_lens *lens = &LOOP_VINFO_LENS (loop_vinfo); + poly_uint64 nunits = TYPE_VECTOR_SUBPARTS (vectype); + poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo); + unsigned int nvectors; + + if (can_div_away_from_zero_p (group_size * vf, nunits, &nvectors)) + vect_record_loop_len (loop_vinfo, lens, nvectors, vectype); + else + gcc_unreachable (); +} + /* Return the mask input to a masked load or store. VEC_MASK is the vectorized form of the scalar mask condition and LOOP_MASK, if nonnull, is the mask that needs to be applied to all loads and stores in a vectorized loop. @@ -7532,6 +7592,10 @@ vectorizable_store (vec_info *vinfo, check_load_store_masking (loop_vinfo, vectype, vls_type, group_size, memory_access_type, &gs_info, mask); + if (loop_vinfo && LOOP_VINFO_CAN_WITH_LENGTH_P (loop_vinfo)) + check_load_store_with_len (loop_vinfo, vectype, vls_type, group_size, + memory_access_type); + if (slp_node && !vect_maybe_update_slp_op_vectype (SLP_TREE_CHILDREN (slp_node)[0], vectype)) @@ -8068,6 +8132,15 @@ vectorizable_store (vec_info *vinfo, = (loop_vinfo && LOOP_VINFO_FULLY_MASKED_P (loop_vinfo) ? &LOOP_VINFO_MASKS (loop_vinfo) : NULL); + + vec_loop_lens *loop_lens + = (loop_vinfo && LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo) + ? &LOOP_VINFO_LENS (loop_vinfo) + : NULL); + + /* Shouldn't go with length if fully masked. */ + gcc_assert (!loop_lens || (loop_lens && !loop_masks)); + /* Targets with store-lane instructions must not require explicit realignment. vect_supportable_dr_alignment always returns either dr_aligned or dr_unaligned_supported for masked operations. */ @@ -8320,10 +8393,15 @@ vectorizable_store (vec_info *vinfo, unsigned HOST_WIDE_INT align; tree final_mask = NULL_TREE; + tree final_len = NULL_TREE; if (loop_masks) final_mask = vect_get_loop_mask (gsi, loop_masks, vec_num * ncopies, vectype, vec_num * j + i); + else if (loop_lens) + final_len = vect_get_loop_len (loop_lens, vec_num * ncopies, + vec_num * j + i); + if (vec_mask) final_mask = prepare_load_store_mask (mask_vectype, final_mask, vec_mask, gsi); @@ -8403,6 +8481,17 @@ vectorizable_store (vec_info *vinfo, new_stmt_info = vect_finish_stmt_generation (vinfo, stmt_info, call, gsi); } + else if (final_len) + { + align = least_bit_hwi (misalign | align); + tree ptr = build_int_cst (ref_type, align); + gcall *call + = gimple_build_call_internal (IFN_LEN_STORE, 4, dataref_ptr, + ptr, final_len, vec_oprnd); + gimple_call_set_nothrow (call, true); + new_stmt_info + = vect_finish_stmt_generation (vinfo, stmt_info, call, gsi); + } else { data_ref = fold_build2 (MEM_REF, vectype, @@ -8839,6 +8928,10 @@ vectorizable_load (vec_info *vinfo, check_load_store_masking (loop_vinfo, vectype, VLS_LOAD, group_size, memory_access_type, &gs_info, mask); + if (loop_vinfo && LOOP_VINFO_CAN_WITH_LENGTH_P (loop_vinfo)) + check_load_store_with_len (loop_vinfo, vectype, VLS_LOAD, group_size, + memory_access_type); + STMT_VINFO_TYPE (stmt_info) = load_vec_info_type; vect_model_load_cost (vinfo, stmt_info, ncopies, vf, memory_access_type, slp_node, cost_vec); @@ -8937,6 +9030,7 @@ vectorizable_load (vec_info *vinfo, gcc_assert (!LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)); gcc_assert (!nested_in_vect_loop); + gcc_assert (!LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo)); if (grouped_load) { @@ -9234,6 +9328,15 @@ vectorizable_load (vec_info *vinfo, = (loop_vinfo && LOOP_VINFO_FULLY_MASKED_P (loop_vinfo) ? &LOOP_VINFO_MASKS (loop_vinfo) : NULL); + + vec_loop_lens *loop_lens + = (loop_vinfo && LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo) + ? &LOOP_VINFO_LENS (loop_vinfo) + : NULL); + + /* Shouldn't go with length if fully masked. */ + gcc_assert (!loop_lens || (loop_lens && !loop_masks)); + /* Targets with store-lane instructions must not require explicit realignment. vect_supportable_dr_alignment always returns either dr_aligned or dr_unaligned_supported for masked operations. */ @@ -9555,15 +9658,20 @@ vectorizable_load (vec_info *vinfo, for (i = 0; i < vec_num; i++) { tree final_mask = NULL_TREE; + tree final_len = NULL_TREE; if (loop_masks && memory_access_type != VMAT_INVARIANT) final_mask = vect_get_loop_mask (gsi, loop_masks, vec_num * ncopies, vectype, vec_num * j + i); + else if (loop_lens && memory_access_type != VMAT_INVARIANT) + final_len = vect_get_loop_len (loop_lens, vec_num * ncopies, + vec_num * j + i); if (vec_mask) final_mask = prepare_load_store_mask (mask_vectype, final_mask, vec_mask, gsi); + if (i > 0) dataref_ptr = bump_vector_ptr (vinfo, dataref_ptr, ptr_incr, gsi, stmt_info, bump); @@ -9629,6 +9737,18 @@ vectorizable_load (vec_info *vinfo, new_stmt = call; data_ref = NULL_TREE; } + else if (final_len) + { + align = least_bit_hwi (misalign | align); + tree ptr = build_int_cst (ref_type, align); + gcall *call + = gimple_build_call_internal (IFN_LEN_LOAD, 3, + dataref_ptr, ptr, + final_len); + gimple_call_set_nothrow (call, true); + new_stmt = call; + data_ref = NULL_TREE; + } else { tree ltype = vectype; @@ -12480,3 +12600,35 @@ vect_get_vector_types_for_stmt (vec_info *vinfo, stmt_vec_info stmt_info, *nunits_vectype_out = nunits_vectype; return opt_result::success (); } + +/* Generate and return statement sequence that sets vector length LEN that is: + + min_of_start_and_end = min (START_INDEX, END_INDEX); + left_bytes = END_INDEX - min_of_start_and_end; + rhs = min (left_bytes, VECTOR_SIZE); + LEN = rhs; + + TODO: for now, rs6000 supported vector with length only cares 8-bits, which + means if we have left_bytes larger than 255, it can't be saturated to vector + size. One target hook can be provided if other ports don't suffer this. +*/ + +gimple_seq +vect_gen_len (tree len, tree start_index, tree end_index, tree vector_size) +{ + gimple_seq stmts = NULL; + tree len_type = TREE_TYPE (len); + gcc_assert (TREE_TYPE (start_index) == len_type); + + tree min = fold_build2 (MIN_EXPR, len_type, start_index, end_index); + tree left_bytes = fold_build2 (MINUS_EXPR, len_type, end_index, min); + left_bytes = fold_build2 (MIN_EXPR, len_type, left_bytes, vector_size); + + tree rhs = force_gimple_operand (left_bytes, &stmts, true, NULL_TREE); + gimple *new_stmt = gimple_build_assign (len, rhs); + gimple_stmt_iterator i = gsi_last (stmts); + gsi_insert_after_without_update (&i, new_stmt, GSI_CONTINUE_LINKING); + + return stmts; +} + diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h index 2eb3ab5d280..774d5025639 100644 --- a/gcc/tree-vectorizer.h +++ b/gcc/tree-vectorizer.h @@ -476,6 +476,21 @@ struct rgroup_masks { typedef auto_vec<rgroup_masks> vec_loop_masks; +/* Similar to masks above, the lengths needed by rgroups with nV vectors. */ +struct rgroup_lens +{ + /* The total bytes for any nS per iteration. */ + unsigned int nbytes_per_iter; + + /* Any vector type to use these lengths. */ + tree vec_type; + + /* A vector of nV lengths, in iteration order. */ + vec<tree> lens; +}; + +typedef auto_vec<rgroup_lens> vec_loop_lens; + typedef auto_vec<std::pair<data_reference*, tree> > drs_init_vec; /*-----------------------------------------------------------------*/ @@ -523,6 +538,10 @@ public: on inactive scalars. */ vec_loop_masks masks; + /* The lengths that a loop with length should use to avoid operating + on inactive scalars. */ + vec_loop_lens lens; + /* Set of scalar conditions that have loop mask applied. */ scalar_cond_masked_set_type scalar_cond_masked_set; @@ -626,6 +645,12 @@ public: /* True if have decided to use a fully-masked loop. */ bool fully_masked_p; + /* Records whether we still have the option of using a length access loop. */ + bool can_with_length_p; + + /* True if have decided to use length access for the loop fully. */ + bool fully_with_length_p; + /* When we have grouped data accesses with gaps, we may introduce invalid memory accesses. We peel the last iteration of the loop to prevent this. */ @@ -689,6 +714,9 @@ public: #define LOOP_VINFO_VECTORIZABLE_P(L) (L)->vectorizable #define LOOP_VINFO_CAN_FULLY_MASK_P(L) (L)->can_fully_mask_p #define LOOP_VINFO_FULLY_MASKED_P(L) (L)->fully_masked_p +#define LOOP_VINFO_CAN_WITH_LENGTH_P(L) (L)->can_with_length_p +#define LOOP_VINFO_FULLY_WITH_LENGTH_P(L) (L)->fully_with_length_p +#define LOOP_VINFO_LENS(L) (L)->lens #define LOOP_VINFO_VECT_FACTOR(L) (L)->vectorization_factor #define LOOP_VINFO_MAX_VECT_FACTOR(L) (L)->max_vectorization_factor #define LOOP_VINFO_MASKS(L) (L)->masks @@ -1842,6 +1870,10 @@ extern void vect_record_loop_mask (loop_vec_info, vec_loop_masks *, unsigned int, tree, tree); extern tree vect_get_loop_mask (gimple_stmt_iterator *, vec_loop_masks *, unsigned int, tree, unsigned int); +extern void vect_record_loop_len (loop_vec_info, vec_loop_lens *, unsigned int, + tree); +extern tree vect_get_loop_len (vec_loop_lens *, unsigned int, unsigned int); +extern gimple_seq vect_gen_len (tree, tree, tree, tree); extern stmt_vec_info info_for_reduction (vec_info *, stmt_vec_info); /* Drive for loop transformation stage. */