Message ID | a06e714e-04c3-8a2f-fa1d-02a72aecf7f4@linux.ibm.com |
---|---|
State | New |
Headers | show |
Series | [v2] vect/rs6000: Support vector with length cost modeling | expand |
On Wed, Jul 22, 2020 at 3:26 AM Kewen.Lin <linkw@linux.ibm.com> wrote: > > Hi Richard, > > on 2020/7/21 下午3:57, Richard Biener wrote: > > On Tue, Jul 21, 2020 at 7:52 AM Kewen.Lin <linkw@linux.ibm.com> wrote: > >> > >> Hi, > >> > >> This patch is to add the cost modeling for vector with length, > >> it mainly follows what we generate for vector with length in > >> functions vect_set_loop_controls_directly and vect_gen_len > >> at the worst case. > >> > >> For Power, the length is expected to be in bits 0-7 (high bits), > >> we have to model the cost of shifting bits. To allow other targets > >> not suffer this, I used one target hook to describe this extra cost, > >> I'm not sure if it's a correct way. > >> > >> Bootstrapped/regtested on powerpc64le-linux-gnu (P9) with explicit > >> param vect-partial-vector-usage=1. > >> > >> Any comments/suggestions are highly appreciated! > > > > I don't like the introduction of an extra target hook for this. All > > vectorizer cost modeling should ideally go through > > init_cost/add_stmt_cost/finish_cost. If the extra costing is > > not per stmt then either init_cost or finish_cost is appropriate. > > Currently init_cost only gets a struct loop while we should > > probably give it a vec_info * parameter so targets can > > check LOOP_VINFO_USING_PARTIAL_VECTORS_P and friends. > > > > Thanks! Nice, your suggested way looks better. I've removed the hook > and taken care of it in finish_cost. The updated v2 is attached. > > Bootstrapped/regtested again on powerpc64le-linux-gnu (P9) with explicit > param vect-partial-vector-usage=1. LGTM (with assuming the first larger hunk is mostly re-indenting under LOOP_VINFO_USING_PARTIAL_VECTORS_P). Thanks, Richard. > BR, > Kewen > ----- > gcc/ChangeLog: > > * config/rs6000/rs6000.c (adjust_vect_cost): New function. > (rs6000_finish_cost): Call function adjust_vect_cost. > * tree-vect-loop.c (vect_estimate_min_profitable_iters): Add cost > modeling for vector with length.
Hi Richard, on 2020/7/22 下午2:38, Richard Biener wrote: > On Wed, Jul 22, 2020 at 3:26 AM Kewen.Lin <linkw@linux.ibm.com> wrote: >> >> Hi Richard, >> >> on 2020/7/21 下午3:57, Richard Biener wrote: >>> On Tue, Jul 21, 2020 at 7:52 AM Kewen.Lin <linkw@linux.ibm.com> wrote: >>>> >>>> Hi, >>>> >>>> This patch is to add the cost modeling for vector with length, >>>> it mainly follows what we generate for vector with length in >>>> functions vect_set_loop_controls_directly and vect_gen_len >>>> at the worst case. >>>> >>>> For Power, the length is expected to be in bits 0-7 (high bits), >>>> we have to model the cost of shifting bits. To allow other targets >>>> not suffer this, I used one target hook to describe this extra cost, >>>> I'm not sure if it's a correct way. >>>> >>>> Bootstrapped/regtested on powerpc64le-linux-gnu (P9) with explicit >>>> param vect-partial-vector-usage=1. >>>> >>>> Any comments/suggestions are highly appreciated! >>> >>> I don't like the introduction of an extra target hook for this. All >>> vectorizer cost modeling should ideally go through >>> init_cost/add_stmt_cost/finish_cost. If the extra costing is >>> not per stmt then either init_cost or finish_cost is appropriate. >>> Currently init_cost only gets a struct loop while we should >>> probably give it a vec_info * parameter so targets can >>> check LOOP_VINFO_USING_PARTIAL_VECTORS_P and friends. >>> >> >> Thanks! Nice, your suggested way looks better. I've removed the hook >> and taken care of it in finish_cost. The updated v2 is attached. >> >> Bootstrapped/regtested again on powerpc64le-linux-gnu (P9) with explicit >> param vect-partial-vector-usage=1. > > LGTM (with assuming the first larger hunk is mostly re-indenting > under LOOP_VINFO_USING_PARTIAL_VECTORS_P). Thanks for the review! Yes, for the original LOOP_VINFO_FULLY_MASKED_P hunk, this patch moves the handling of gap peeling to be shared between mask and length, and re-indent the remaining (masking specific) into inner LOOP_VINFO_FULLY_MASKED_P. The length specific is put into the else hunk. It wouldn't change anything for masking, I'll run aarch64 regtesting to ensure it. :) BR, Kewen
"Kewen.Lin" <linkw@linux.ibm.com> writes: > Hi Richard, > > on 2020/7/21 下午3:57, Richard Biener wrote: >> On Tue, Jul 21, 2020 at 7:52 AM Kewen.Lin <linkw@linux.ibm.com> wrote: >>> >>> Hi, >>> >>> This patch is to add the cost modeling for vector with length, >>> it mainly follows what we generate for vector with length in >>> functions vect_set_loop_controls_directly and vect_gen_len >>> at the worst case. >>> >>> For Power, the length is expected to be in bits 0-7 (high bits), >>> we have to model the cost of shifting bits. To allow other targets >>> not suffer this, I used one target hook to describe this extra cost, >>> I'm not sure if it's a correct way. >>> >>> Bootstrapped/regtested on powerpc64le-linux-gnu (P9) with explicit >>> param vect-partial-vector-usage=1. >>> >>> Any comments/suggestions are highly appreciated! >> >> I don't like the introduction of an extra target hook for this. All >> vectorizer cost modeling should ideally go through >> init_cost/add_stmt_cost/finish_cost. If the extra costing is >> not per stmt then either init_cost or finish_cost is appropriate. >> Currently init_cost only gets a struct loop while we should >> probably give it a vec_info * parameter so targets can >> check LOOP_VINFO_USING_PARTIAL_VECTORS_P and friends. >> > > Thanks! Nice, your suggested way looks better. I've removed the hook > and taken care of it in finish_cost. The updated v2 is attached. > > Bootstrapped/regtested again on powerpc64le-linux-gnu (P9) with explicit > param vect-partial-vector-usage=1. > > BR, > Kewen > ----- > gcc/ChangeLog: > > * config/rs6000/rs6000.c (adjust_vect_cost): New function. > (rs6000_finish_cost): Call function adjust_vect_cost. > * tree-vect-loop.c (vect_estimate_min_profitable_iters): Add cost > modeling for vector with length. > > diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c > index 5a4f07d5810..f2724e792c9 100644 > --- a/gcc/config/rs6000/rs6000.c > +++ b/gcc/config/rs6000/rs6000.c > @@ -5177,6 +5177,34 @@ rs6000_add_stmt_cost (class vec_info *vinfo, void *data, int count, > return retval; > } > > +/* For some target specific vectorization cost which can't be handled per stmt, > + we check the requisite conditions and adjust the vectorization cost > + accordingly if satisfied. One typical example is to model shift cost for > + vector with length by counting number of required lengths under condition > + LOOP_VINFO_FULLY_WITH_LENGTH_P. */ > + > +static void > +adjust_vect_cost (rs6000_cost_data *data) > +{ > + struct loop *loop = data->loop_info; > + gcc_assert (loop); > + loop_vec_info loop_vinfo = loop_vec_info_for_loop (loop); > + > + if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo)) > + { > + rgroup_controls *rgc; > + unsigned int num_vectors_m1; > + unsigned int shift_cnt = 0; > + FOR_EACH_VEC_ELT (LOOP_VINFO_LENS (loop_vinfo), num_vectors_m1, rgc) > + if (rgc->type) > + /* Each length needs one shift to fill into bits 0-7. */ > + shift_cnt += (num_vectors_m1 + 1); > + > + rs6000_add_stmt_cost (loop_vinfo, (void *) data, shift_cnt, scalar_stmt, > + NULL, NULL_TREE, 0, vect_body); > + } > +} > + > /* Implement targetm.vectorize.finish_cost. */ > > static void > @@ -5186,7 +5214,10 @@ rs6000_finish_cost (void *data, unsigned *prologue_cost, > rs6000_cost_data *cost_data = (rs6000_cost_data*) data; > > if (cost_data->loop_info) > - rs6000_density_test (cost_data); > + { > + adjust_vect_cost (cost_data); > + rs6000_density_test (cost_data); > + } > > /* Don't vectorize minimum-vectorization-factor, simple copy loops > that require versioning for any reason. The vectorization is at > diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c > index e933441b922..99e1fd7bdd0 100644 > --- a/gcc/tree-vect-loop.c > +++ b/gcc/tree-vect-loop.c > @@ -3652,7 +3652,7 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo, > TODO: Build an expression that represents peel_iters for prologue and > epilogue to be used in a run-time test. */ > > - if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)) > + if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)) > { > peel_iters_prologue = 0; > peel_iters_epilogue = 0; > @@ -3663,45 +3663,145 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo, > peel_iters_epilogue += 1; > stmt_info_for_cost *si; > int j; > - FOR_EACH_VEC_ELT (LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo), > - j, si) > + FOR_EACH_VEC_ELT (LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo), j, > + si) > (void) add_stmt_cost (loop_vinfo, target_cost_data, si->count, > si->kind, si->stmt_info, si->vectype, > si->misalign, vect_epilogue); > } > > - /* Calculate how many masks we need to generate. */ > - unsigned int num_masks = 0; > - rgroup_controls *rgm; > - unsigned int num_vectors_m1; > - FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo), num_vectors_m1, rgm) > - if (rgm->type) > - num_masks += num_vectors_m1 + 1; > - gcc_assert (num_masks > 0); > - > - /* In the worst case, we need to generate each mask in the prologue > - and in the loop body. One of the loop body mask instructions > - replaces the comparison in the scalar loop, and since we don't > - count the scalar comparison against the scalar body, we shouldn't > - count that vector instruction against the vector body either. > - > - Sometimes we can use unpacks instead of generating prologue > - masks and sometimes the prologue mask will fold to a constant, > - so the actual prologue cost might be smaller. However, it's > - simpler and safer to use the worst-case cost; if this ends up > - being the tie-breaker between vectorizing or not, then it's > - probably better not to vectorize. */ > - (void) add_stmt_cost (loop_vinfo, > - target_cost_data, num_masks, vector_stmt, > - NULL, NULL_TREE, 0, vect_prologue); > - (void) add_stmt_cost (loop_vinfo, > - target_cost_data, num_masks - 1, vector_stmt, > - NULL, NULL_TREE, 0, vect_body); > - } > - else if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo)) > - { > - peel_iters_prologue = 0; > - peel_iters_epilogue = 0; > + if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)) > + { > + /* Calculate how many masks we need to generate. */ > + unsigned int num_masks = 0; > + rgroup_controls *rgm; > + unsigned int num_vectors_m1; > + FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo), num_vectors_m1, rgm) > + if (rgm->type) > + num_masks += num_vectors_m1 + 1; > + gcc_assert (num_masks > 0); > + > + /* In the worst case, we need to generate each mask in the prologue > + and in the loop body. One of the loop body mask instructions > + replaces the comparison in the scalar loop, and since we don't > + count the scalar comparison against the scalar body, we shouldn't > + count that vector instruction against the vector body either. > + > + Sometimes we can use unpacks instead of generating prologue > + masks and sometimes the prologue mask will fold to a constant, > + so the actual prologue cost might be smaller. However, it's > + simpler and safer to use the worst-case cost; if this ends up > + being the tie-breaker between vectorizing or not, then it's > + probably better not to vectorize. */ > + (void) add_stmt_cost (loop_vinfo, target_cost_data, num_masks, > + vector_stmt, NULL, NULL_TREE, 0, vect_prologue); > + (void) add_stmt_cost (loop_vinfo, target_cost_data, num_masks - 1, > + vector_stmt, NULL, NULL_TREE, 0, vect_body); > + } > + else > + { > + gcc_assert (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo)); > + > + /* Consider cost for LOOP_VINFO_PEELING_FOR_ALIGNMENT. */ > + if (npeel < 0) > + { > + peel_iters_prologue = assumed_vf / 2; > + /* See below, if peeled iterations are unknown, count a taken > + branch and a not taken branch per peeled loop. */ > + (void) add_stmt_cost (loop_vinfo, target_cost_data, 1, > + cond_branch_taken, NULL, NULL_TREE, 0, > + vect_prologue); > + (void) add_stmt_cost (loop_vinfo, target_cost_data, 1, > + cond_branch_not_taken, NULL, NULL_TREE, 0, > + vect_prologue); > + } > + else > + { > + peel_iters_prologue = npeel; > + if (!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)) > + /* See vect_get_known_peeling_cost, if peeled iterations are > + known but number of scalar loop iterations are unknown, count > + a taken branch per peeled loop. */ > + (void) add_stmt_cost (loop_vinfo, target_cost_data, 1, > + cond_branch_taken, NULL, NULL_TREE, 0, > + vect_prologue); > + } I think it'd be good to avoid duplicating this. How about the following structure? if (vect_use_loop_mask_for_alignment_p (…)) { peel_iters_prologue = 0; peel_iters_epilogue = 0; } else if (npeel < 0) { … // A } else { …vect_get_known_peeling_cost stuff… } but in A and vect_get_known_peeling_cost, set peel_iters_epilogue to: LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo) ? 1 : 0 for LOOP_VINFO_USING_PARTIAL_VECTORS_P, instead of setting it to whatever value we'd normally use. Then wrap: (void) add_stmt_cost (loop_vinfo, target_cost_data, 1, cond_branch_taken, NULL, NULL_TREE, 0, vect_epilogue); (void) add_stmt_cost (loop_vinfo, target_cost_data, 1, cond_branch_not_taken, NULL, NULL_TREE, 0, vect_epilogue); in !LOOP_VINFO_USING_PARTIAL_VECTORS_P and make the other vect_epilogue stuff in A conditional on peel_iters_epilogue != 0. This will also remove the need for the existing LOOP_VINFO_FULLY_MASKED_P code: if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)) { /* We need to peel exactly one iteration. */ peel_iters_epilogue += 1; stmt_info_for_cost *si; int j; FOR_EACH_VEC_ELT (LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo), j, si) (void) add_stmt_cost (loop_vinfo, target_cost_data, si->count, si->kind, si->stmt_info, si->vectype, si->misalign, vect_epilogue); } Then, after the above, have: if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)) …add costs for mask overhead… else if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo)) …add costs for lengths overhead… So we'd have one block of code for estimating the prologue and epilogue peeling cost, and a separate block of code for the loop control overhead. Thanks, Richard
Hi! On Wed, Jul 22, 2020 at 09:26:39AM +0800, Kewen.Lin wrote: > +/* For some target specific vectorization cost which can't be handled per stmt, > + we check the requisite conditions and adjust the vectorization cost > + accordingly if satisfied. One typical example is to model shift cost for > + vector with length by counting number of required lengths under condition > + LOOP_VINFO_FULLY_WITH_LENGTH_P. */ > + > +static void > +adjust_vect_cost (rs6000_cost_data *data) > +{ Maybe call it rs6000_adjust_vect_cost? For consistency, but also it could (in the future) collide with a globalfunction of the same name (it is a very non-specific name). > + /* Each length needs one shift to fill into bits 0-7. */ > + shift_cnt += (num_vectors_m1 + 1); That doesn't need parentheses. > if (cost_data->loop_info) > - rs6000_density_test (cost_data); > + { > + adjust_vect_cost (cost_data); > + rs6000_density_test (cost_data); > + } ^^^ consistency :-) The rs6000 parts are fine for trunk, thanks! Segher
Hi Segher, Thanks for the comments! on 2020/7/23 上午1:49, Segher Boessenkool wrote: > Hi! > > On Wed, Jul 22, 2020 at 09:26:39AM +0800, Kewen.Lin wrote: >> +/* For some target specific vectorization cost which can't be handled per stmt, >> + we check the requisite conditions and adjust the vectorization cost >> + accordingly if satisfied. One typical example is to model shift cost for >> + vector with length by counting number of required lengths under condition >> + LOOP_VINFO_FULLY_WITH_LENGTH_P. */ >> + >> +static void >> +adjust_vect_cost (rs6000_cost_data *data) >> +{ > > Maybe call it rs6000_adjust_vect_cost? For consistency, but also it > could (in the future) collide with a globalfunction of the same name (it > is a very non-specific name). Done in v4, used rs6000_adjust_vect_cost_per_loop. > >> + /* Each length needs one shift to fill into bits 0-7. */ >> + shift_cnt += (num_vectors_m1 + 1); > > That doesn't need parentheses. Done in v4. > >> if (cost_data->loop_info) >> - rs6000_density_test (cost_data); >> + { >> + adjust_vect_cost (cost_data); >> + rs6000_density_test (cost_data); >> + } > > ^^^ consistency :-) > > The rs6000 parts are fine for trunk, thanks! Thanks! BR, Kewen
diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c index 5a4f07d5810..f2724e792c9 100644 --- a/gcc/config/rs6000/rs6000.c +++ b/gcc/config/rs6000/rs6000.c @@ -5177,6 +5177,34 @@ rs6000_add_stmt_cost (class vec_info *vinfo, void *data, int count, return retval; } +/* For some target specific vectorization cost which can't be handled per stmt, + we check the requisite conditions and adjust the vectorization cost + accordingly if satisfied. One typical example is to model shift cost for + vector with length by counting number of required lengths under condition + LOOP_VINFO_FULLY_WITH_LENGTH_P. */ + +static void +adjust_vect_cost (rs6000_cost_data *data) +{ + struct loop *loop = data->loop_info; + gcc_assert (loop); + loop_vec_info loop_vinfo = loop_vec_info_for_loop (loop); + + if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo)) + { + rgroup_controls *rgc; + unsigned int num_vectors_m1; + unsigned int shift_cnt = 0; + FOR_EACH_VEC_ELT (LOOP_VINFO_LENS (loop_vinfo), num_vectors_m1, rgc) + if (rgc->type) + /* Each length needs one shift to fill into bits 0-7. */ + shift_cnt += (num_vectors_m1 + 1); + + rs6000_add_stmt_cost (loop_vinfo, (void *) data, shift_cnt, scalar_stmt, + NULL, NULL_TREE, 0, vect_body); + } +} + /* Implement targetm.vectorize.finish_cost. */ static void @@ -5186,7 +5214,10 @@ rs6000_finish_cost (void *data, unsigned *prologue_cost, rs6000_cost_data *cost_data = (rs6000_cost_data*) data; if (cost_data->loop_info) - rs6000_density_test (cost_data); + { + adjust_vect_cost (cost_data); + rs6000_density_test (cost_data); + } /* Don't vectorize minimum-vectorization-factor, simple copy loops that require versioning for any reason. The vectorization is at diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c index e933441b922..99e1fd7bdd0 100644 --- a/gcc/tree-vect-loop.c +++ b/gcc/tree-vect-loop.c @@ -3652,7 +3652,7 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo, TODO: Build an expression that represents peel_iters for prologue and epilogue to be used in a run-time test. */ - if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)) + if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)) { peel_iters_prologue = 0; peel_iters_epilogue = 0; @@ -3663,45 +3663,145 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo, peel_iters_epilogue += 1; stmt_info_for_cost *si; int j; - FOR_EACH_VEC_ELT (LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo), - j, si) + FOR_EACH_VEC_ELT (LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo), j, + si) (void) add_stmt_cost (loop_vinfo, target_cost_data, si->count, si->kind, si->stmt_info, si->vectype, si->misalign, vect_epilogue); } - /* Calculate how many masks we need to generate. */ - unsigned int num_masks = 0; - rgroup_controls *rgm; - unsigned int num_vectors_m1; - FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo), num_vectors_m1, rgm) - if (rgm->type) - num_masks += num_vectors_m1 + 1; - gcc_assert (num_masks > 0); - - /* In the worst case, we need to generate each mask in the prologue - and in the loop body. One of the loop body mask instructions - replaces the comparison in the scalar loop, and since we don't - count the scalar comparison against the scalar body, we shouldn't - count that vector instruction against the vector body either. - - Sometimes we can use unpacks instead of generating prologue - masks and sometimes the prologue mask will fold to a constant, - so the actual prologue cost might be smaller. However, it's - simpler and safer to use the worst-case cost; if this ends up - being the tie-breaker between vectorizing or not, then it's - probably better not to vectorize. */ - (void) add_stmt_cost (loop_vinfo, - target_cost_data, num_masks, vector_stmt, - NULL, NULL_TREE, 0, vect_prologue); - (void) add_stmt_cost (loop_vinfo, - target_cost_data, num_masks - 1, vector_stmt, - NULL, NULL_TREE, 0, vect_body); - } - else if (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo)) - { - peel_iters_prologue = 0; - peel_iters_epilogue = 0; + if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)) + { + /* Calculate how many masks we need to generate. */ + unsigned int num_masks = 0; + rgroup_controls *rgm; + unsigned int num_vectors_m1; + FOR_EACH_VEC_ELT (LOOP_VINFO_MASKS (loop_vinfo), num_vectors_m1, rgm) + if (rgm->type) + num_masks += num_vectors_m1 + 1; + gcc_assert (num_masks > 0); + + /* In the worst case, we need to generate each mask in the prologue + and in the loop body. One of the loop body mask instructions + replaces the comparison in the scalar loop, and since we don't + count the scalar comparison against the scalar body, we shouldn't + count that vector instruction against the vector body either. + + Sometimes we can use unpacks instead of generating prologue + masks and sometimes the prologue mask will fold to a constant, + so the actual prologue cost might be smaller. However, it's + simpler and safer to use the worst-case cost; if this ends up + being the tie-breaker between vectorizing or not, then it's + probably better not to vectorize. */ + (void) add_stmt_cost (loop_vinfo, target_cost_data, num_masks, + vector_stmt, NULL, NULL_TREE, 0, vect_prologue); + (void) add_stmt_cost (loop_vinfo, target_cost_data, num_masks - 1, + vector_stmt, NULL, NULL_TREE, 0, vect_body); + } + else + { + gcc_assert (LOOP_VINFO_FULLY_WITH_LENGTH_P (loop_vinfo)); + + /* Consider cost for LOOP_VINFO_PEELING_FOR_ALIGNMENT. */ + if (npeel < 0) + { + peel_iters_prologue = assumed_vf / 2; + /* See below, if peeled iterations are unknown, count a taken + branch and a not taken branch per peeled loop. */ + (void) add_stmt_cost (loop_vinfo, target_cost_data, 1, + cond_branch_taken, NULL, NULL_TREE, 0, + vect_prologue); + (void) add_stmt_cost (loop_vinfo, target_cost_data, 1, + cond_branch_not_taken, NULL, NULL_TREE, 0, + vect_prologue); + } + else + { + peel_iters_prologue = npeel; + if (!LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)) + /* See vect_get_known_peeling_cost, if peeled iterations are + known but number of scalar loop iterations are unknown, count + a taken branch per peeled loop. */ + (void) add_stmt_cost (loop_vinfo, target_cost_data, 1, + cond_branch_taken, NULL, NULL_TREE, 0, + vect_prologue); + } + + stmt_info_for_cost *si; + int j; + FOR_EACH_VEC_ELT (LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo), j, + si) + (void) add_stmt_cost (loop_vinfo, target_cost_data, + si->count * peel_iters_prologue, si->kind, + si->stmt_info, si->vectype, si->misalign, + vect_prologue); + + /* Refer to the functions vect_set_loop_condition_partial_vectors + and vect_set_loop_controls_directly, we need to generate each + length in the prologue and in the loop body if required. Although + there are some possible optimization, we consider the worst case + here. */ + + /* For now we only operate length-based partial vectors on Power, + which has constant VF all the time, we need some tweakings below + if it doesn't hold in future. */ + gcc_assert (LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant ()); + + /* For wrap around checking. */ + tree compare_type = LOOP_VINFO_RGROUP_COMPARE_TYPE (loop_vinfo); + unsigned int compare_precision = TYPE_PRECISION (compare_type); + widest_int iv_limit = vect_iv_limit_for_partial_vectors (loop_vinfo); + + bool niters_known_p = LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo); + bool need_iterate_p + = (!LOOP_VINFO_EPILOGUE_P (loop_vinfo) + && !vect_known_niters_smaller_than_vf (loop_vinfo)); + + /* Init min/max, shift and minus cost relative to single scalar_stmt. + For now we only use length-based partial vectors on Power, target + specific cost tweaking may be needed for other ports in future. */ + unsigned int min_max_cost = 2; + unsigned int shift_cost = 1, minus_cost = 1; + + /* Init cost relative to single scalar_stmt. */ + unsigned int prol_cnt = 0; + unsigned int body_cnt = 0; + + rgroup_controls *rgc; + unsigned int num_vectors_m1; + FOR_EACH_VEC_ELT (LOOP_VINFO_LENS (loop_vinfo), num_vectors_m1, rgc) + if (rgc->type) + { + unsigned nitems = rgc->max_nscalars_per_iter * rgc->factor; + + /* Need one shift for niters_total computation. */ + if (!niters_known_p && nitems != 1) + prol_cnt += shift_cost; + + /* Need to handle wrap around. */ + if (iv_limit == -1 + || (wi::min_precision (iv_limit * nitems, UNSIGNED) + > compare_precision)) + prol_cnt += (min_max_cost + minus_cost); + + /* Need to handle batch limit excepting for the 1st one. */ + prol_cnt += (min_max_cost + minus_cost) * num_vectors_m1; + + unsigned int num_vectors = num_vectors_m1 + 1; + /* Need to set up lengths in prologue, only one MIN required + since start index is zero. */ + prol_cnt += min_max_cost * num_vectors; + + /* Need to update lengths in body for next iteration. */ + if (need_iterate_p) + body_cnt += (2 * min_max_cost + minus_cost) * num_vectors; + } + + (void) add_stmt_cost (loop_vinfo, target_cost_data, prol_cnt, + scalar_stmt, NULL, NULL_TREE, 0, vect_prologue); + (void) add_stmt_cost (loop_vinfo, target_cost_data, body_cnt, + scalar_stmt, NULL, NULL_TREE, 0, vect_body); + } } else if (npeel < 0) { @@ -3913,8 +4013,8 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo, } /* ??? The "if" arm is written to handle all cases; see below for what - we would do for !LOOP_VINFO_FULLY_MASKED_P. */ - if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)) + we would do for !LOOP_VINFO_USING_PARTIAL_VECTORS_P. */ + if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)) { /* Rewriting the condition above in terms of the number of vector iterations (vniters) rather than the number of @@ -3941,7 +4041,7 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo, dump_printf (MSG_NOTE, " Minimum number of vector iterations: %d\n", min_vec_niters); - if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)) + if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)) { /* Now that we know the minimum number of vector iterations, find the minimum niters for which the scalar cost is larger: @@ -3996,6 +4096,10 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo, && min_profitable_iters < (assumed_vf + peel_iters_prologue)) /* We want the vectorized loop to execute at least once. */ min_profitable_iters = assumed_vf + peel_iters_prologue; + else if (min_profitable_iters < peel_iters_prologue) + /* For LOOP_VINFO_USING_PARTIAL_VECTORS_P, we need to ensure the + vectorized loop to execute at least once. */ + min_profitable_iters = peel_iters_prologue; if (dump_enabled_p ()) dump_printf_loc (MSG_NOTE, vect_location, @@ -4013,7 +4117,7 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo, if (vec_outside_cost <= 0) min_profitable_estimate = 0; - else if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)) + else if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)) { /* This is a repeat of the code above, but with + SOC rather than - SOC. */ @@ -4025,7 +4129,7 @@ vect_estimate_min_profitable_iters (loop_vec_info loop_vinfo, if (outside_overhead > 0) min_vec_niters = outside_overhead / saving_per_viter + 1; - if (LOOP_VINFO_FULLY_MASKED_P (loop_vinfo)) + if (LOOP_VINFO_USING_PARTIAL_VECTORS_P (loop_vinfo)) { int threshold = (vec_inside_cost * min_vec_niters + vec_outside_cost