Message ID | 7570cb71-cb74-d97f-3b7a-b161631e36c5@linux.vnet.ibm.com |
---|---|
State | New |
Headers | show |
Series | Fix PR82255 (vectorizer cost model overcounts some vector load costs) | expand |
On 9/19/17 12:38 PM, Bill Schmidt wrote: > Hi, > > https://gcc.gnu.org/PR82255 identifies a problem in the vector cost model > where a vectorized load is treated as having the cost of a strided load > in a case where we will not actually generate a strided load. This is > simply a mismatch between the conditions tested in the cost model and > those tested in the code that generates vectorized instructions. This > patch fixes the problem by recognizing when only a single non-strided > load will be generated and reporting the cost accordingly. > > I believe this patch is sufficient to catch all such cases, but I admit > that the code in vectorizable_load is complex enough that I could have > missed a trick. > > I've added a test in the PowerPC cost model subdirectory. Even though > this isn't a target-specific issue, the test does rely on a 16-byte > vector size, so this seems safest. > > Bootstrapped and tested on powerpc64le-linux-gnu with no regressions. > Is this ok for trunk? After posting, I realized that I had wrongly recalculated stmt_info in the patch. Here's a new version (also passing regstrap) that removes that flaw. [gcc] 2017-09-19 Bill Schmidt <wschmidt@linux.vnet.ibm.com> PR tree-optimization/82255 * tree-vect-stmts.c (vect_model_load_cost): Don't count vec_construct cost when a true strided load isn't present. [gcc/testsuite] 2017-09-19 Bill Schmidt <wschmidt@linux.vnet.ibm.com> PR tree-optimization/82255 * gcc.dg/vect/costmodel/ppc/costmodel-pr82255.c: New file. Index: gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-pr82255.c =================================================================== --- gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-pr82255.c (nonexistent) +++ gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-pr82255.c (working copy) @@ -0,0 +1,30 @@ +/* { dg-do compile } */ +/* { dg-require-effective-target vect_int } */ + +/* PR82255: Ensure we don't require a vec_construct cost when we aren't + going to generate a strided load. */ + +extern int abs (int __x) __attribute__ ((__nothrow__, __leaf__)) __attribute__ ((__const__)); + +static int +foo (unsigned char *w, int i, unsigned char *x, int j) +{ + int tot = 0; + for (int a = 0; a < 16; a++) + { + for (int b = 0; b < 16; b++) + tot += abs (w[b] - x[b]); + w += i; + x += j; + } + return tot; +} + +void +bar (unsigned char *w, unsigned char *x, int i, int *result) +{ + *result = foo (w, 16, x, i); +} + +/* { dg-final { scan-tree-dump-times "vec_construct required" 0 "vect" } } */ + Index: gcc/tree-vect-stmts.c =================================================================== --- gcc/tree-vect-stmts.c (revision 252760) +++ gcc/tree-vect-stmts.c (working copy) @@ -1091,8 +1091,19 @@ vect_model_load_cost (stmt_vec_info stmt_info, int prologue_cost_vec, body_cost_vec, true); if (memory_access_type == VMAT_ELEMENTWISE || memory_access_type == VMAT_STRIDED_SLP) - inside_cost += record_stmt_cost (body_cost_vec, ncopies, vec_construct, - stmt_info, 0, vect_body); + { + int group_size = GROUP_SIZE (stmt_info); + int nunits = TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info)); + if (group_size < nunits) + { + if (dump_enabled_p ()) + dump_printf_loc (MSG_NOTE, vect_location, + "vect_model_load_cost: vec_construct required"); + inside_cost += record_stmt_cost (body_cost_vec, ncopies, + vec_construct, stmt_info, 0, + vect_body); + } + } if (dump_enabled_p ()) dump_printf_loc (MSG_NOTE, vect_location,
Bill Schmidt <wschmidt@linux.vnet.ibm.com> writes: > Index: gcc/tree-vect-stmts.c > =================================================================== > --- gcc/tree-vect-stmts.c (revision 252760) > +++ gcc/tree-vect-stmts.c (working copy) > @@ -1091,8 +1091,19 @@ vect_model_load_cost (stmt_vec_info stmt_info, int > prologue_cost_vec, body_cost_vec, true); > if (memory_access_type == VMAT_ELEMENTWISE > || memory_access_type == VMAT_STRIDED_SLP) > - inside_cost += record_stmt_cost (body_cost_vec, ncopies, vec_construct, > - stmt_info, 0, vect_body); > + { > + int group_size = GROUP_SIZE (stmt_info); > + int nunits = TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info)); > + if (group_size < nunits) > + { > + if (dump_enabled_p ()) > + dump_printf_loc (MSG_NOTE, vect_location, > + "vect_model_load_cost: vec_construct required"); > + inside_cost += record_stmt_cost (body_cost_vec, ncopies, > + vec_construct, stmt_info, 0, > + vect_body); > + } > + } > > if (dump_enabled_p ()) > dump_printf_loc (MSG_NOTE, vect_location, This feels like we've probably got the wrong memory_access_type. If it's a just a contiguous load then it should be VMAT_CONTIGUOUS instead. Thanks, Richard
On Sep 19, 2017, at 3:58 PM, Richard Sandiford <richard.sandiford@linaro.org> wrote: > > Bill Schmidt <wschmidt@linux.vnet.ibm.com> writes: >> Index: gcc/tree-vect-stmts.c >> =================================================================== >> --- gcc/tree-vect-stmts.c (revision 252760) >> +++ gcc/tree-vect-stmts.c (working copy) >> @@ -1091,8 +1091,19 @@ vect_model_load_cost (stmt_vec_info stmt_info, int >> prologue_cost_vec, body_cost_vec, true); >> if (memory_access_type == VMAT_ELEMENTWISE >> || memory_access_type == VMAT_STRIDED_SLP) >> - inside_cost += record_stmt_cost (body_cost_vec, ncopies, vec_construct, >> - stmt_info, 0, vect_body); >> + { >> + int group_size = GROUP_SIZE (stmt_info); >> + int nunits = TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info)); >> + if (group_size < nunits) >> + { >> + if (dump_enabled_p ()) >> + dump_printf_loc (MSG_NOTE, vect_location, >> + "vect_model_load_cost: vec_construct required"); >> + inside_cost += record_stmt_cost (body_cost_vec, ncopies, >> + vec_construct, stmt_info, 0, >> + vect_body); >> + } >> + } >> >> if (dump_enabled_p ()) >> dump_printf_loc (MSG_NOTE, vect_location, > > This feels like we've probably got the wrong memory_access_type. > If it's a just a contiguous load then it should be VMAT_CONTIGUOUS > instead. I tend to agree -- that will take more surgery to the code that detects strided loads in both the cost model analysis and in vectorizable_load. Bill > > Thanks, > Richard > >
On Sep 19, 2017, at 4:21 PM, Bill Schmidt <wschmidt@linux.vnet.ibm.com> wrote: > > On Sep 19, 2017, at 3:58 PM, Richard Sandiford <richard.sandiford@linaro.org> wrote: >> >> Bill Schmidt <wschmidt@linux.vnet.ibm.com> writes: >>> Index: gcc/tree-vect-stmts.c >>> =================================================================== >>> --- gcc/tree-vect-stmts.c (revision 252760) >>> +++ gcc/tree-vect-stmts.c (working copy) >>> @@ -1091,8 +1091,19 @@ vect_model_load_cost (stmt_vec_info stmt_info, int >>> prologue_cost_vec, body_cost_vec, true); >>> if (memory_access_type == VMAT_ELEMENTWISE >>> || memory_access_type == VMAT_STRIDED_SLP) >>> - inside_cost += record_stmt_cost (body_cost_vec, ncopies, vec_construct, >>> - stmt_info, 0, vect_body); >>> + { >>> + int group_size = GROUP_SIZE (stmt_info); >>> + int nunits = TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info)); >>> + if (group_size < nunits) >>> + { >>> + if (dump_enabled_p ()) >>> + dump_printf_loc (MSG_NOTE, vect_location, >>> + "vect_model_load_cost: vec_construct required"); >>> + inside_cost += record_stmt_cost (body_cost_vec, ncopies, >>> + vec_construct, stmt_info, 0, >>> + vect_body); >>> + } >>> + } >>> >>> if (dump_enabled_p ()) >>> dump_printf_loc (MSG_NOTE, vect_location, >> >> This feels like we've probably got the wrong memory_access_type. >> If it's a just a contiguous load then it should be VMAT_CONTIGUOUS >> instead. > > I tend to agree -- that will take more surgery to the code that detects strided loads in > both the cost model analysis and in vectorizable_load. > It's just a little less clear how to clean this up in vect_analyze_data_refs. The too-simplistic test there now is: else if (is_a <loop_vec_info> (vinfo) && TREE_CODE (DR_STEP (dr)) != INTEGER_CST) { if (nested_in_vect_loop_p (loop, stmt)) { if (dump_enabled_p ()) { dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, "not vectorized: not suitable for strided " "load "); dump_gimple_stmt (MSG_MISSED_OPTIMIZATION, TDF_SLIM, stmt, 0); } return false; } STMT_VINFO_STRIDED_P (stmt_info) = true; } Furthermore, this is being set while processing each data reference, not after all information has been gathered and analyzed. So this is too early. Next place to fix this would be in vect_analyze_slp_cost_1, where vect_memory_access_type memory_access_type = (STMT_VINFO_STRIDED_P (stmt_info) ? VMAT_STRIDED_SLP : VMAT_CONTIGUOUS); is too simplistic. I guess we could do something here, but it would require unpacking all the grouped accesses and duplicating all the logic from scratch to determine whether it's a single load or not. So it's going to be a little painful to do it "right." I submitted this patch because I felt it would be the simplest solution. Thanks, Bill > Bill > >> >> Thanks, >> Richard
Bill Schmidt <wschmidt@linux.vnet.ibm.com> writes: > On Sep 19, 2017, at 4:21 PM, Bill Schmidt <wschmidt@linux.vnet.ibm.com> wrote: >> On Sep 19, 2017, at 3:58 PM, Richard Sandiford >> <richard.sandiford@linaro.org> wrote: >>> Bill Schmidt <wschmidt@linux.vnet.ibm.com> writes: >>>> Index: gcc/tree-vect-stmts.c >>>> =================================================================== >>>> --- gcc/tree-vect-stmts.c (revision 252760) >>>> +++ gcc/tree-vect-stmts.c (working copy) >>>> @@ -1091,8 +1091,19 @@ vect_model_load_cost (stmt_vec_info stmt_info, int >>>> prologue_cost_vec, body_cost_vec, true); >>>> if (memory_access_type == VMAT_ELEMENTWISE >>>> || memory_access_type == VMAT_STRIDED_SLP) >>>> - inside_cost += record_stmt_cost (body_cost_vec, ncopies, vec_construct, >>>> - stmt_info, 0, vect_body); >>>> + { >>>> + int group_size = GROUP_SIZE (stmt_info); >>>> + int nunits = TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info)); >>>> + if (group_size < nunits) >>>> + { >>>> + if (dump_enabled_p ()) >>>> + dump_printf_loc (MSG_NOTE, vect_location, >>>> + "vect_model_load_cost: vec_construct required"); >>>> + inside_cost += record_stmt_cost (body_cost_vec, ncopies, >>>> + vec_construct, stmt_info, 0, >>>> + vect_body); >>>> + } >>>> + } >>>> >>>> if (dump_enabled_p ()) >>>> dump_printf_loc (MSG_NOTE, vect_location, >>> >>> This feels like we've probably got the wrong memory_access_type. >>> If it's a just a contiguous load then it should be VMAT_CONTIGUOUS >>> instead. >> >> I tend to agree -- that will take more surgery to the code that >> detects strided loads in >> both the cost model analysis and in vectorizable_load. >> > It's just a little less clear how to clean this up in > vect_analyze_data_refs. The too-simplistic test there now is: > > else if (is_a <loop_vec_info> (vinfo) > && TREE_CODE (DR_STEP (dr)) != INTEGER_CST) > { > if (nested_in_vect_loop_p (loop, stmt)) > { > if (dump_enabled_p ()) > { > dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, > "not vectorized: not suitable for strided " > "load "); > dump_gimple_stmt (MSG_MISSED_OPTIMIZATION, TDF_SLIM, stmt, 0); > } > return false; > } > STMT_VINFO_STRIDED_P (stmt_info) = true; > } > > Furthermore, this is being set while processing each data reference, not after > all information has been gathered and analyzed. So this is too early. Yeah, agree we can't tell at this stage. > Next place to fix this would be in vect_analyze_slp_cost_1, where > > vect_memory_access_type memory_access_type > = (STMT_VINFO_STRIDED_P (stmt_info) > ? VMAT_STRIDED_SLP > : VMAT_CONTIGUOUS); > > is too simplistic. I guess we could do something here, but it would > require unpacking all the grouped accesses and duplicating all the > logic from scratch to determine whether it's a single load or not. I think we can use: SLP_INSTANCE_UNROLLING_FACTOR (instance) * ncopies_for_cost to calculate the number of times that the vector SLP statements repeat the original scalar group. We should then only use STRIDED_SLP if that's greater than 1. This affects stores as well as loads, so fixing it here will work for both. But I think this shows up another problem. In the vectorised loop, we have 1 copy of the load and 4 copies of the ABS (after unpacking). But vect_analyze_slp_cost_1 is being called with ncopies_for_cost == 4 even for the loads. So I think it's a double whammy: we've quadrupling the real cost via the excessive ncopies_for_cost, and we're using the wrong access type too. The patch below should fix the second problem but isn't much use without the first. I think we need to adjust ncopies_for_cost in the recursive calls as well. Of course, fixing an overestimate here is likely to expose an underestimate elsewhere. :-) Interesting test case though. It seems counterproductive to unroll a loop like that before vectorisation. Thanks, Richard Index: gcc/tree-vect-slp.c =================================================================== --- gcc/tree-vect-slp.c (revision 252986) +++ gcc/tree-vect-slp.c (working copy) @@ -1687,8 +1687,10 @@ stmt_info = vinfo_for_stmt (stmt); if (STMT_VINFO_GROUPED_ACCESS (stmt_info)) { + unsigned int group_ncopies = (SLP_INSTANCE_UNROLLING_FACTOR (instance) + * ncopies_for_cost); vect_memory_access_type memory_access_type - = (STMT_VINFO_STRIDED_P (stmt_info) + = (STMT_VINFO_STRIDED_P (stmt_info) && group_ncopies > 1 ? VMAT_STRIDED_SLP : VMAT_CONTIGUOUS); if (DR_IS_WRITE (STMT_VINFO_DATA_REF (stmt_info)))
On Sep 20, 2017, at 3:49 AM, Richard Sandiford <richard.sandiford@linaro.org> wrote: > > Bill Schmidt <wschmidt@linux.vnet.ibm.com> writes: >> On Sep 19, 2017, at 4:21 PM, Bill Schmidt <wschmidt@linux.vnet.ibm.com> wrote: >>> On Sep 19, 2017, at 3:58 PM, Richard Sandiford >>> <richard.sandiford@linaro.org> wrote: >>>> Bill Schmidt <wschmidt@linux.vnet.ibm.com> writes: >>>>> Index: gcc/tree-vect-stmts.c >>>>> =================================================================== >>>>> --- gcc/tree-vect-stmts.c (revision 252760) >>>>> +++ gcc/tree-vect-stmts.c (working copy) >>>>> @@ -1091,8 +1091,19 @@ vect_model_load_cost (stmt_vec_info stmt_info, int >>>>> prologue_cost_vec, body_cost_vec, true); >>>>> if (memory_access_type == VMAT_ELEMENTWISE >>>>> || memory_access_type == VMAT_STRIDED_SLP) >>>>> - inside_cost += record_stmt_cost (body_cost_vec, ncopies, vec_construct, >>>>> - stmt_info, 0, vect_body); >>>>> + { >>>>> + int group_size = GROUP_SIZE (stmt_info); >>>>> + int nunits = TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info)); >>>>> + if (group_size < nunits) >>>>> + { >>>>> + if (dump_enabled_p ()) >>>>> + dump_printf_loc (MSG_NOTE, vect_location, >>>>> + "vect_model_load_cost: vec_construct required"); >>>>> + inside_cost += record_stmt_cost (body_cost_vec, ncopies, >>>>> + vec_construct, stmt_info, 0, >>>>> + vect_body); >>>>> + } >>>>> + } >>>>> >>>>> if (dump_enabled_p ()) >>>>> dump_printf_loc (MSG_NOTE, vect_location, >>>> >>>> This feels like we've probably got the wrong memory_access_type. >>>> If it's a just a contiguous load then it should be VMAT_CONTIGUOUS >>>> instead. >>> >>> I tend to agree -- that will take more surgery to the code that >>> detects strided loads in >>> both the cost model analysis and in vectorizable_load. >>> >> It's just a little less clear how to clean this up in >> vect_analyze_data_refs. The too-simplistic test there now is: >> >> else if (is_a <loop_vec_info> (vinfo) >> && TREE_CODE (DR_STEP (dr)) != INTEGER_CST) >> { >> if (nested_in_vect_loop_p (loop, stmt)) >> { >> if (dump_enabled_p ()) >> { >> dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, >> "not vectorized: not suitable for strided " >> "load "); >> dump_gimple_stmt (MSG_MISSED_OPTIMIZATION, TDF_SLIM, stmt, 0); >> } >> return false; >> } >> STMT_VINFO_STRIDED_P (stmt_info) = true; >> } >> >> Furthermore, this is being set while processing each data reference, not after >> all information has been gathered and analyzed. So this is too early. > > Yeah, agree we can't tell at this stage. > >> Next place to fix this would be in vect_analyze_slp_cost_1, where >> >> vect_memory_access_type memory_access_type >> = (STMT_VINFO_STRIDED_P (stmt_info) >> ? VMAT_STRIDED_SLP >> : VMAT_CONTIGUOUS); >> >> is too simplistic. I guess we could do something here, but it would >> require unpacking all the grouped accesses and duplicating all the >> logic from scratch to determine whether it's a single load or not. > > I think we can use: > > SLP_INSTANCE_UNROLLING_FACTOR (instance) * ncopies_for_cost > > to calculate the number of times that the vector SLP statements repeat > the original scalar group. We should then only use STRIDED_SLP if > that's greater than 1. > > This affects stores as well as loads, so fixing it here will work > for both. Thanks, Richard, that makes sense. Let me play around with it a bit. > > But I think this shows up another problem. In the vectorised loop, > we have 1 copy of the load and 4 copies of the ABS (after unpacking). > But vect_analyze_slp_cost_1 is being called with ncopies_for_cost == 4 > even for the loads. So I think it's a double whammy: we've quadrupling > the real cost via the excessive ncopies_for_cost, and we're using the > wrong access type too. Absolutely right, I failed to pay attention to this! Good catch. > > The patch below should fix the second problem but isn't much use without > the first. I think we need to adjust ncopies_for_cost in the recursive > calls as well. Just to be clear, I think you mean it fixes the original problem (wrong access type). I'll look into how to fix the excessive ncopies_for_cost. > > Of course, fixing an overestimate here is likely to expose an > underestimate elsewhere. :-) Right...it was ever thus... Very much appreciate your assistance! Bill > > Interesting test case though. It seems counterproductive to unroll > a loop like that before vectorisation. > > Thanks, > Richard > > > Index: gcc/tree-vect-slp.c > =================================================================== > --- gcc/tree-vect-slp.c (revision 252986) > +++ gcc/tree-vect-slp.c (working copy) > @@ -1687,8 +1687,10 @@ > stmt_info = vinfo_for_stmt (stmt); > if (STMT_VINFO_GROUPED_ACCESS (stmt_info)) > { > + unsigned int group_ncopies = (SLP_INSTANCE_UNROLLING_FACTOR (instance) > + * ncopies_for_cost); > vect_memory_access_type memory_access_type > - = (STMT_VINFO_STRIDED_P (stmt_info) > + = (STMT_VINFO_STRIDED_P (stmt_info) && group_ncopies > 1 > ? VMAT_STRIDED_SLP > : VMAT_CONTIGUOUS); > if (DR_IS_WRITE (STMT_VINFO_DATA_REF (stmt_info)))
On Sep 20, 2017, at 3:49 AM, Richard Sandiford <richard.sandiford@linaro.org> wrote: > > But I think this shows up another problem. In the vectorised loop, > we have 1 copy of the load and 4 copies of the ABS (after unpacking). > But vect_analyze_slp_cost_1 is being called with ncopies_for_cost == 4 > even for the loads. So I think it's a double whammy: we've quadrupling > the real cost via the excessive ncopies_for_cost, and we're using the > wrong access type too. > > The patch below should fix the second problem but isn't much use without > the first. I think we need to adjust ncopies_for_cost in the recursive > calls as well. Unfortunately we have to deal with the fact that ncopies_for_cost is set once for the whole SLP instance, which is not ideal since we know the number of loads and stores doesn't follow the same rules. What about the following? I *think* that group_size / nunits is sufficient for VMAT_STRIDED_SLP and VMAT_CONTIGUOUS for which permuted loads are not possible (already handled differently), but I could well be missing something. Index: gcc/tree-vect-slp.c =================================================================== --- gcc/tree-vect-slp.c (revision 252760) +++ gcc/tree-vect-slp.c (working copy) @@ -1662,12 +1662,14 @@ vect_analyze_slp_cost_1 (slp_instance instance, sl stmt_info = vinfo_for_stmt (stmt); if (STMT_VINFO_GROUPED_ACCESS (stmt_info)) { + int group_size = GROUP_SIZE (stmt_info); + int nunits = TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info)); vect_memory_access_type memory_access_type - = (STMT_VINFO_STRIDED_P (stmt_info) + = (STMT_VINFO_STRIDED_P (stmt_info) && group_size > nunits ? VMAT_STRIDED_SLP : VMAT_CONTIGUOUS); if (DR_IS_WRITE (STMT_VINFO_DATA_REF (stmt_info))) - vect_model_store_cost (stmt_info, ncopies_for_cost, + vect_model_store_cost (stmt_info, group_size / nunits, memory_access_type, vect_uninitialized_def, node, prologue_cost_vec, body_cost_vec); else @@ -1714,6 +1716,8 @@ vect_analyze_slp_cost_1 (slp_instance instance, sl + nunits - 1) / nunits); ncopies_for_cost *= SLP_INSTANCE_UNROLLING_FACTOR (instance); } + else + ncopies_for_cost = group_size / nunits; /* Record the cost for the vector loads. */ vect_model_load_cost (stmt_info, ncopies_for_cost, memory_access_type, node, prologue_cost_vec, Bill
On Sep 20, 2017, at 2:14 PM, Bill Schmidt <wschmidt@linux.vnet.ibm.com> wrote: > > On Sep 20, 2017, at 3:49 AM, Richard Sandiford <richard.sandiford@linaro.org> wrote: >> >> But I think this shows up another problem. In the vectorised loop, >> we have 1 copy of the load and 4 copies of the ABS (after unpacking). >> But vect_analyze_slp_cost_1 is being called with ncopies_for_cost == 4 >> even for the loads. So I think it's a double whammy: we've quadrupling >> the real cost via the excessive ncopies_for_cost, and we're using the >> wrong access type too. >> >> The patch below should fix the second problem but isn't much use without >> the first. I think we need to adjust ncopies_for_cost in the recursive >> calls as well. > > Unfortunately we have to deal with the fact that ncopies_for_cost is set > once for the whole SLP instance, which is not ideal since we know the > number of loads and stores doesn't follow the same rules. > > What about the following? I *think* that group_size / nunits is sufficient for > VMAT_STRIDED_SLP and VMAT_CONTIGUOUS for which permuted loads are > not possible (already handled differently), but I could well be missing something. Uh, I need to do the obvious rounding of (group_size + nunits - 1) / nunits... Bill > > Index: gcc/tree-vect-slp.c > =================================================================== > --- gcc/tree-vect-slp.c (revision 252760) > +++ gcc/tree-vect-slp.c (working copy) > @@ -1662,12 +1662,14 @@ vect_analyze_slp_cost_1 (slp_instance instance, sl > stmt_info = vinfo_for_stmt (stmt); > if (STMT_VINFO_GROUPED_ACCESS (stmt_info)) > { > + int group_size = GROUP_SIZE (stmt_info); > + int nunits = TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info)); > vect_memory_access_type memory_access_type > - = (STMT_VINFO_STRIDED_P (stmt_info) > + = (STMT_VINFO_STRIDED_P (stmt_info) && group_size > nunits > ? VMAT_STRIDED_SLP > : VMAT_CONTIGUOUS); > if (DR_IS_WRITE (STMT_VINFO_DATA_REF (stmt_info))) > - vect_model_store_cost (stmt_info, ncopies_for_cost, > + vect_model_store_cost (stmt_info, group_size / nunits, > memory_access_type, vect_uninitialized_def, > node, prologue_cost_vec, body_cost_vec); > else > @@ -1714,6 +1716,8 @@ vect_analyze_slp_cost_1 (slp_instance instance, sl > + nunits - 1) / nunits); > ncopies_for_cost *= SLP_INSTANCE_UNROLLING_FACTOR (instance); > } > + else > + ncopies_for_cost = group_size / nunits; > /* Record the cost for the vector loads. */ > vect_model_load_cost (stmt_info, ncopies_for_cost, > memory_access_type, node, prologue_cost_vec, > > Bill >
On Wed, Sep 20, 2017 at 9:17 PM, Bill Schmidt <wschmidt@linux.vnet.ibm.com> wrote: > On Sep 20, 2017, at 2:14 PM, Bill Schmidt <wschmidt@linux.vnet.ibm.com> wrote: >> >> On Sep 20, 2017, at 3:49 AM, Richard Sandiford <richard.sandiford@linaro.org> wrote: >>> >>> But I think this shows up another problem. In the vectorised loop, >>> we have 1 copy of the load and 4 copies of the ABS (after unpacking). >>> But vect_analyze_slp_cost_1 is being called with ncopies_for_cost == 4 >>> even for the loads. So I think it's a double whammy: we've quadrupling >>> the real cost via the excessive ncopies_for_cost, and we're using the >>> wrong access type too. >>> >>> The patch below should fix the second problem but isn't much use without >>> the first. I think we need to adjust ncopies_for_cost in the recursive >>> calls as well. >> >> Unfortunately we have to deal with the fact that ncopies_for_cost is set >> once for the whole SLP instance, which is not ideal since we know the >> number of loads and stores doesn't follow the same rules. >> >> What about the following? I *think* that group_size / nunits is sufficient for >> VMAT_STRIDED_SLP and VMAT_CONTIGUOUS for which permuted loads are >> not possible (already handled differently), but I could well be missing something. > > Uh, I need to do the obvious rounding of (group_size + nunits - 1) / nunits... I don't think any such simple approach will result in appropriate costing. The current costing is conservative and you definitely don't want to be overly optimistic here ;) You'd have to fully emulate the vectorizable_load code. Which means we should probably refactor the code-gen part if (memory_access_type == VMAT_ELEMENTWISE || memory_access_type == VMAT_STRIDED_SLP) { ... where it computes ltype, nloads, etc., do that computation up-front in the analysis phase, storing the result somewhere in stmt_vinfo (err, or slp_node or both -- think of hybrid SLP and/or the stmt used in different SLP contexts) and use the info during transform. And hope to have enough info to statically compute the number of loads and their size/alignment at costing time. Or go the full way of restructuring costing to have a prologue/body_cost_vec in stmt_info and slp_node so that we can compute and store the cost from vectorizable_* rather than duplicating the hard parts in SLP costing code. Richard. > Bill >> >> Index: gcc/tree-vect-slp.c >> =================================================================== >> --- gcc/tree-vect-slp.c (revision 252760) >> +++ gcc/tree-vect-slp.c (working copy) >> @@ -1662,12 +1662,14 @@ vect_analyze_slp_cost_1 (slp_instance instance, sl >> stmt_info = vinfo_for_stmt (stmt); >> if (STMT_VINFO_GROUPED_ACCESS (stmt_info)) >> { >> + int group_size = GROUP_SIZE (stmt_info); >> + int nunits = TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info)); >> vect_memory_access_type memory_access_type >> - = (STMT_VINFO_STRIDED_P (stmt_info) >> + = (STMT_VINFO_STRIDED_P (stmt_info) && group_size > nunits >> ? VMAT_STRIDED_SLP >> : VMAT_CONTIGUOUS); >> if (DR_IS_WRITE (STMT_VINFO_DATA_REF (stmt_info))) >> - vect_model_store_cost (stmt_info, ncopies_for_cost, >> + vect_model_store_cost (stmt_info, group_size / nunits, >> memory_access_type, vect_uninitialized_def, >> node, prologue_cost_vec, body_cost_vec); >> else >> @@ -1714,6 +1716,8 @@ vect_analyze_slp_cost_1 (slp_instance instance, sl >> + nunits - 1) / nunits); >> ncopies_for_cost *= SLP_INSTANCE_UNROLLING_FACTOR (instance); >> } >> + else >> + ncopies_for_cost = group_size / nunits; >> /* Record the cost for the vector loads. */ >> vect_model_load_cost (stmt_info, ncopies_for_cost, >> memory_access_type, node, prologue_cost_vec, >> >> Bill >> >
On Sep 21, 2017, at 2:37 AM, Richard Biener <richard.guenther@gmail.com> wrote: > > On Wed, Sep 20, 2017 at 9:17 PM, Bill Schmidt > <wschmidt@linux.vnet.ibm.com> wrote: >> On Sep 20, 2017, at 2:14 PM, Bill Schmidt <wschmidt@linux.vnet.ibm.com> wrote: >>> >>> On Sep 20, 2017, at 3:49 AM, Richard Sandiford <richard.sandiford@linaro.org> wrote: >>>> >>>> But I think this shows up another problem. In the vectorised loop, >>>> we have 1 copy of the load and 4 copies of the ABS (after unpacking). >>>> But vect_analyze_slp_cost_1 is being called with ncopies_for_cost == 4 >>>> even for the loads. So I think it's a double whammy: we've quadrupling >>>> the real cost via the excessive ncopies_for_cost, and we're using the >>>> wrong access type too. >>>> >>>> The patch below should fix the second problem but isn't much use without >>>> the first. I think we need to adjust ncopies_for_cost in the recursive >>>> calls as well. >>> >>> Unfortunately we have to deal with the fact that ncopies_for_cost is set >>> once for the whole SLP instance, which is not ideal since we know the >>> number of loads and stores doesn't follow the same rules. >>> >>> What about the following? I *think* that group_size / nunits is sufficient for >>> VMAT_STRIDED_SLP and VMAT_CONTIGUOUS for which permuted loads are >>> not possible (already handled differently), but I could well be missing something. >> >> Uh, I need to do the obvious rounding of (group_size + nunits - 1) / nunits... > > I don't think any such simple approach will result in appropriate costing. The > current costing is conservative and you definitely don't want to be > overly optimistic > here ;) > > You'd have to fully emulate the vectorizable_load code. Which means we should > probably refactor the code-gen part > > if (memory_access_type == VMAT_ELEMENTWISE > || memory_access_type == VMAT_STRIDED_SLP) > { > ... > > where it computes ltype, nloads, etc., do that computation up-front in > the analysis > phase, storing the result somewhere in stmt_vinfo (err, or slp_node or > both -- think > of hybrid SLP and/or the stmt used in different SLP contexts) and use the info > during transform. And hope to have enough info to statically compute the > number of loads and their size/alignment at costing time. I suppose so. I was thinking that if these are truly strided loads, then the cost is predictable -- but only for targets that have hardware strided loads available, which are few, so this isn't conservative enough. The code in question didn't deal with VMAT_ELEMENTWISE, so the factoring is even more complicated -- the structures of the costing logic and the code gen logic don't match well at all. I'm unlikely to have time for a complex solution right now, so I'll stick a pointer to this conversation in the bugzilla and take my name off in case somebody gets to it sooner. But I'll probably keep poking at it in spare moments. Thanks, Bill > > Or go the full way of restructuring costing to have a prologue/body_cost_vec > in stmt_info and slp_node so that we can compute and store the cost > from vectorizable_* rather than duplicating the hard parts in SLP costing code. > > Richard. > >> Bill >>> >>> Index: gcc/tree-vect-slp.c >>> =================================================================== >>> --- gcc/tree-vect-slp.c (revision 252760) >>> +++ gcc/tree-vect-slp.c (working copy) >>> @@ -1662,12 +1662,14 @@ vect_analyze_slp_cost_1 (slp_instance instance, sl >>> stmt_info = vinfo_for_stmt (stmt); >>> if (STMT_VINFO_GROUPED_ACCESS (stmt_info)) >>> { >>> + int group_size = GROUP_SIZE (stmt_info); >>> + int nunits = TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info)); >>> vect_memory_access_type memory_access_type >>> - = (STMT_VINFO_STRIDED_P (stmt_info) >>> + = (STMT_VINFO_STRIDED_P (stmt_info) && group_size > nunits >>> ? VMAT_STRIDED_SLP >>> : VMAT_CONTIGUOUS); >>> if (DR_IS_WRITE (STMT_VINFO_DATA_REF (stmt_info))) >>> - vect_model_store_cost (stmt_info, ncopies_for_cost, >>> + vect_model_store_cost (stmt_info, group_size / nunits, >>> memory_access_type, vect_uninitialized_def, >>> node, prologue_cost_vec, body_cost_vec); >>> else >>> @@ -1714,6 +1716,8 @@ vect_analyze_slp_cost_1 (slp_instance instance, sl >>> + nunits - 1) / nunits); >>> ncopies_for_cost *= SLP_INSTANCE_UNROLLING_FACTOR (instance); >>> } >>> + else >>> + ncopies_for_cost = group_size / nunits; >>> /* Record the cost for the vector loads. */ >>> vect_model_load_cost (stmt_info, ncopies_for_cost, >>> memory_access_type, node, prologue_cost_vec, >>> >>> Bill >>> >> >
On Thu, Sep 21, 2017 at 3:15 PM, Bill Schmidt <wschmidt@linux.vnet.ibm.com> wrote: > On Sep 21, 2017, at 2:37 AM, Richard Biener <richard.guenther@gmail.com> wrote >> >> On Wed, Sep 20, 2017 at 9:17 PM, Bill Schmidt >> <wschmidt@linux.vnet.ibm.com> wrote: >>> On Sep 20, 2017, at 2:14 PM, Bill Schmidt <wschmidt@linux.vnet.ibm.com> wrote: >>>> >>>> On Sep 20, 2017, at 3:49 AM, Richard Sandiford <richard.sandiford@linaro.org> wrote: >>>>> >>>>> But I think this shows up another problem. In the vectorised loop, >>>>> we have 1 copy of the load and 4 copies of the ABS (after unpacking). >>>>> But vect_analyze_slp_cost_1 is being called with ncopies_for_cost == 4 >>>>> even for the loads. So I think it's a double whammy: we've quadrupling >>>>> the real cost via the excessive ncopies_for_cost, and we're using the >>>>> wrong access type too. >>>>> >>>>> The patch below should fix the second problem but isn't much use without >>>>> the first. I think we need to adjust ncopies_for_cost in the recursive >>>>> calls as well. >>>> >>>> Unfortunately we have to deal with the fact that ncopies_for_cost is set >>>> once for the whole SLP instance, which is not ideal since we know the >>>> number of loads and stores doesn't follow the same rules. >>>> >>>> What about the following? I *think* that group_size / nunits is sufficient for >>>> VMAT_STRIDED_SLP and VMAT_CONTIGUOUS for which permuted loads are >>>> not possible (already handled differently), but I could well be missing something. >>> >>> Uh, I need to do the obvious rounding of (group_size + nunits - 1) / nunits... >> >> I don't think any such simple approach will result in appropriate costing. The >> current costing is conservative and you definitely don't want to be >> overly optimistic >> here ;) >> >> You'd have to fully emulate the vectorizable_load code. Which means we should >> probably refactor the code-gen part >> >> if (memory_access_type == VMAT_ELEMENTWISE >> || memory_access_type == VMAT_STRIDED_SLP) >> { >> ... >> >> where it computes ltype, nloads, etc., do that computation up-front in >> the analysis >> phase, storing the result somewhere in stmt_vinfo (err, or slp_node or >> both -- think >> of hybrid SLP and/or the stmt used in different SLP contexts) and use the info >> during transform. And hope to have enough info to statically compute the >> number of loads and their size/alignment at costing time. > > I suppose so. I was thinking that if these are truly strided loads, then the > cost is predictable -- but only for targets that have hardware strided loads > available, which are few, so this isn't conservative enough. The code in > question didn't deal with VMAT_ELEMENTWISE, so the factoring is even > more complicated -- the structures of the costing logic and the code gen > logic don't match well at all. > > I'm unlikely to have time for a complex solution right now, so I'll stick a pointer > to this conversation in the bugzilla and take my name off in case somebody > gets to it sooner. But I'll probably keep poking at it in spare moments. I hope to get to some costmodel cleanups for GCC 8. Richard. > Thanks, > Bill >> >> Or go the full way of restructuring costing to have a prologue/body_cost_vec >> in stmt_info and slp_node so that we can compute and store the cost >> from vectorizable_* rather than duplicating the hard parts in SLP costing code. >> >> Richard. >> >>> Bill >>>> >>>> Index: gcc/tree-vect-slp.c >>>> =================================================================== >>>> --- gcc/tree-vect-slp.c (revision 252760) >>>> +++ gcc/tree-vect-slp.c (working copy) >>>> @@ -1662,12 +1662,14 @@ vect_analyze_slp_cost_1 (slp_instance instance, sl >>>> stmt_info = vinfo_for_stmt (stmt); >>>> if (STMT_VINFO_GROUPED_ACCESS (stmt_info)) >>>> { >>>> + int group_size = GROUP_SIZE (stmt_info); >>>> + int nunits = TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info)); >>>> vect_memory_access_type memory_access_type >>>> - = (STMT_VINFO_STRIDED_P (stmt_info) >>>> + = (STMT_VINFO_STRIDED_P (stmt_info) && group_size > nunits >>>> ? VMAT_STRIDED_SLP >>>> : VMAT_CONTIGUOUS); >>>> if (DR_IS_WRITE (STMT_VINFO_DATA_REF (stmt_info))) >>>> - vect_model_store_cost (stmt_info, ncopies_for_cost, >>>> + vect_model_store_cost (stmt_info, group_size / nunits, >>>> memory_access_type, vect_uninitialized_def, >>>> node, prologue_cost_vec, body_cost_vec); >>>> else >>>> @@ -1714,6 +1716,8 @@ vect_analyze_slp_cost_1 (slp_instance instance, sl >>>> + nunits - 1) / nunits); >>>> ncopies_for_cost *= SLP_INSTANCE_UNROLLING_FACTOR (instance); >>>> } >>>> + else >>>> + ncopies_for_cost = group_size / nunits; >>>> /* Record the cost for the vector loads. */ >>>> vect_model_load_cost (stmt_info, ncopies_for_cost, >>>> memory_access_type, node, prologue_cost_vec, >>>> >>>> Bill >>>> >>> >> >
Index: gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-pr82255.c =================================================================== --- gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-pr82255.c (nonexistent) +++ gcc/testsuite/gcc.dg/vect/costmodel/ppc/costmodel-pr82255.c (working copy) @@ -0,0 +1,30 @@ +/* { dg-do compile } */ +/* { dg-require-effective-target vect_int } */ + +/* PR82255: Ensure we don't require a vec_construct cost when we aren't + going to generate a strided load. */ + +extern int abs (int __x) __attribute__ ((__nothrow__, __leaf__)) __attribute__ ((__const__)); + +static int +foo (unsigned char *w, int i, unsigned char *x, int j) +{ + int tot = 0; + for (int a = 0; a < 16; a++) + { + for (int b = 0; b < 16; b++) + tot += abs (w[b] - x[b]); + w += i; + x += j; + } + return tot; +} + +void +bar (unsigned char *w, unsigned char *x, int i, int *result) +{ + *result = foo (w, 16, x, i); +} + +/* { dg-final { scan-tree-dump-times "vec_construct required" 0 "vect" } } */ + Index: gcc/tree-vect-stmts.c =================================================================== --- gcc/tree-vect-stmts.c (revision 252760) +++ gcc/tree-vect-stmts.c (working copy) @@ -1091,8 +1091,20 @@ vect_model_load_cost (stmt_vec_info stmt_info, int prologue_cost_vec, body_cost_vec, true); if (memory_access_type == VMAT_ELEMENTWISE || memory_access_type == VMAT_STRIDED_SLP) - inside_cost += record_stmt_cost (body_cost_vec, ncopies, vec_construct, - stmt_info, 0, vect_body); + { + stmt_vec_info stmt_info = vinfo_for_stmt (first_stmt); + int group_size = GROUP_SIZE (stmt_info); + int nunits = TYPE_VECTOR_SUBPARTS (STMT_VINFO_VECTYPE (stmt_info)); + if (group_size < nunits) + { + if (dump_enabled_p ()) + dump_printf_loc (MSG_NOTE, vect_location, + "vect_model_load_cost: vec_construct required"); + inside_cost += record_stmt_cost (body_cost_vec, ncopies, + vec_construct, stmt_info, 0, + vect_body); + } + } if (dump_enabled_p ()) dump_printf_loc (MSG_NOTE, vect_location,