Message ID | 3424a3d3-fa4e-16f9-89c6-0b07beec957d@linux.ibm.com |
---|---|
State | New |
Headers | show |
Series | [v2] rs6000: Add load density heuristic | expand |
Hi, Gentle ping this: https://gcc.gnu.org/pipermail/gcc-patches/2021-May/571258.html BR, Kewen on 2021/5/26 上午10:59, Kewen.Lin via Gcc-patches wrote: > Hi, > > This is the updated version of patch to deal with the bwaves_r > degradation due to vector construction fed by strided loads. > > As Richi's comments [1], this follows the similar idea to over > price the vector construction fed by VMAT_ELEMENTWISE or > VMAT_STRIDED_SLP. Instead of adding the extra cost on vector > construction costing immediately, it firstly records how many > loads and vectorized statements in the given loop, later in > rs6000_density_test (called by finish_cost) it computes the > load density ratio against all vectorized stmts, and check > with the corresponding thresholds DENSITY_LOAD_NUM_THRESHOLD > and DENSITY_LOAD_PCT_THRESHOLD, do the actual extra pricing > if both thresholds are exceeded. > > Note that this new load density heuristic check is based on > some fields in target cost which are updated as needed when > scanning each add_stmt_cost entry, it's independent of the > current function rs6000_density_test which requires to scan > non_vect stmts. Since it's checking the load stmts count > vs. all vectorized stmts, it's kind of density, so I put > it in function rs6000_density_test. With the same reason to > keep it independent, I didn't put it as an else arm of the > current existing density threshold check hunk or before this > hunk. > > In the investigation of -1.04% degradation from 526.blender_r > on Power8, I noticed that the extra penalized cost 320 on one > single vector construction with type V16QI is much exaggerated, > which makes the final body cost unreliable, so this patch adds > one maximum bound for the extra penalized cost for each vector > construction statement. > > Bootstrapped/regtested on powerpc64le-linux-gnu P9. > > Full SPEC2017 performance evaluation on Power8/Power9 with > option combinations: > * -O2 -ftree-vectorize {,-fvect-cost-model=very-cheap} {,-ffast-math} > * {-O3, -Ofast} {,-funroll-loops} > > bwaves_r degradations on P8/P9 have been fixed, nothing else > remarkable was observed. > > Is it ok for trunk? > > [1] https://gcc.gnu.org/pipermail/gcc-patches/2021-May/570076.html > > BR, > Kewen > ----- > gcc/ChangeLog: > > * config/rs6000/rs6000.c (struct rs6000_cost_data): New members > nstmts, nloads and extra_ctor_cost. > (rs6000_density_test): Add load density related heuristics and the > checks, do extra costing on vector construction statements if need. > (rs6000_init_cost): Init new members. > (rs6000_update_target_cost_per_stmt): New function. > (rs6000_add_stmt_cost): Factor vect_nonmem hunk out to function > rs6000_update_target_cost_per_stmt and call it. >
Hi, Gentle ping this: https://gcc.gnu.org/pipermail/gcc-patches/2021-May/571258.html BR, Kewen on 2021/6/9 上午10:26, Kewen.Lin via Gcc-patches wrote: > Hi, > > Gentle ping this: > > https://gcc.gnu.org/pipermail/gcc-patches/2021-May/571258.html > > BR, > Kewen > > on 2021/5/26 上午10:59, Kewen.Lin via Gcc-patches wrote: >> Hi, >> >> This is the updated version of patch to deal with the bwaves_r >> degradation due to vector construction fed by strided loads. >> >> As Richi's comments [1], this follows the similar idea to over >> price the vector construction fed by VMAT_ELEMENTWISE or >> VMAT_STRIDED_SLP. Instead of adding the extra cost on vector >> construction costing immediately, it firstly records how many >> loads and vectorized statements in the given loop, later in >> rs6000_density_test (called by finish_cost) it computes the >> load density ratio against all vectorized stmts, and check >> with the corresponding thresholds DENSITY_LOAD_NUM_THRESHOLD >> and DENSITY_LOAD_PCT_THRESHOLD, do the actual extra pricing >> if both thresholds are exceeded. >> >> Note that this new load density heuristic check is based on >> some fields in target cost which are updated as needed when >> scanning each add_stmt_cost entry, it's independent of the >> current function rs6000_density_test which requires to scan >> non_vect stmts. Since it's checking the load stmts count >> vs. all vectorized stmts, it's kind of density, so I put >> it in function rs6000_density_test. With the same reason to >> keep it independent, I didn't put it as an else arm of the >> current existing density threshold check hunk or before this >> hunk. >> >> In the investigation of -1.04% degradation from 526.blender_r >> on Power8, I noticed that the extra penalized cost 320 on one >> single vector construction with type V16QI is much exaggerated, >> which makes the final body cost unreliable, so this patch adds >> one maximum bound for the extra penalized cost for each vector >> construction statement. >> >> Bootstrapped/regtested on powerpc64le-linux-gnu P9. >> >> Full SPEC2017 performance evaluation on Power8/Power9 with >> option combinations: >> * -O2 -ftree-vectorize {,-fvect-cost-model=very-cheap} {,-ffast-math} >> * {-O3, -Ofast} {,-funroll-loops} >> >> bwaves_r degradations on P8/P9 have been fixed, nothing else >> remarkable was observed. >> >> Is it ok for trunk? >> >> [1] https://gcc.gnu.org/pipermail/gcc-patches/2021-May/570076.html >> >> BR, >> Kewen >> ----- >> gcc/ChangeLog: >> >> * config/rs6000/rs6000.c (struct rs6000_cost_data): New members >> nstmts, nloads and extra_ctor_cost. >> (rs6000_density_test): Add load density related heuristics and the >> checks, do extra costing on vector construction statements if need. >> (rs6000_init_cost): Init new members. >> (rs6000_update_target_cost_per_stmt): New function. >> (rs6000_add_stmt_cost): Factor vect_nonmem hunk out to function >> rs6000_update_target_cost_per_stmt and call it. >> >
Hi, Gentle ping this: https://gcc.gnu.org/pipermail/gcc-patches/2021-May/571258.html BR, Kewen on 2021/6/28 下午3:01, Kewen.Lin via Gcc-patches wrote: > Hi, > > Gentle ping this: > > https://gcc.gnu.org/pipermail/gcc-patches/2021-May/571258.html > > BR, > Kewen > > on 2021/6/9 上午10:26, Kewen.Lin via Gcc-patches wrote: >> Hi, >> >> Gentle ping this: >> >> https://gcc.gnu.org/pipermail/gcc-patches/2021-May/571258.html >> >> BR, >> Kewen >> >> on 2021/5/26 上午10:59, Kewen.Lin via Gcc-patches wrote: >>> Hi, >>> >>> This is the updated version of patch to deal with the bwaves_r >>> degradation due to vector construction fed by strided loads. >>> >>> As Richi's comments [1], this follows the similar idea to over >>> price the vector construction fed by VMAT_ELEMENTWISE or >>> VMAT_STRIDED_SLP. Instead of adding the extra cost on vector >>> construction costing immediately, it firstly records how many >>> loads and vectorized statements in the given loop, later in >>> rs6000_density_test (called by finish_cost) it computes the >>> load density ratio against all vectorized stmts, and check >>> with the corresponding thresholds DENSITY_LOAD_NUM_THRESHOLD >>> and DENSITY_LOAD_PCT_THRESHOLD, do the actual extra pricing >>> if both thresholds are exceeded. >>> >>> Note that this new load density heuristic check is based on >>> some fields in target cost which are updated as needed when >>> scanning each add_stmt_cost entry, it's independent of the >>> current function rs6000_density_test which requires to scan >>> non_vect stmts. Since it's checking the load stmts count >>> vs. all vectorized stmts, it's kind of density, so I put >>> it in function rs6000_density_test. With the same reason to >>> keep it independent, I didn't put it as an else arm of the >>> current existing density threshold check hunk or before this >>> hunk. >>> >>> In the investigation of -1.04% degradation from 526.blender_r >>> on Power8, I noticed that the extra penalized cost 320 on one >>> single vector construction with type V16QI is much exaggerated, >>> which makes the final body cost unreliable, so this patch adds >>> one maximum bound for the extra penalized cost for each vector >>> construction statement. >>> >>> Bootstrapped/regtested on powerpc64le-linux-gnu P9. >>> >>> Full SPEC2017 performance evaluation on Power8/Power9 with >>> option combinations: >>> * -O2 -ftree-vectorize {,-fvect-cost-model=very-cheap} {,-ffast-math} >>> * {-O3, -Ofast} {,-funroll-loops} >>> >>> bwaves_r degradations on P8/P9 have been fixed, nothing else >>> remarkable was observed. >>> >>> Is it ok for trunk? >>> >>> [1] https://gcc.gnu.org/pipermail/gcc-patches/2021-May/570076.html >>> >>> BR, >>> Kewen >>> ----- >>> gcc/ChangeLog: >>> >>> * config/rs6000/rs6000.c (struct rs6000_cost_data): New members >>> nstmts, nloads and extra_ctor_cost. >>> (rs6000_density_test): Add load density related heuristics and the >>> checks, do extra costing on vector construction statements if need. >>> (rs6000_init_cost): Init new members. >>> (rs6000_update_target_cost_per_stmt): New function. >>> (rs6000_add_stmt_cost): Factor vect_nonmem hunk out to function >>> rs6000_update_target_cost_per_stmt and call it. >>> >>
On Wed, 2021-05-26 at 10:59 +0800, Kewen.Lin via Gcc-patches wrote: > Hi, > Hi, > This is the updated version of patch to deal with the bwaves_r > degradation due to vector construction fed by strided loads. > > As Richi's comments [1], this follows the similar idea to over > price the vector construction fed by VMAT_ELEMENTWISE or > VMAT_STRIDED_SLP. Instead of adding the extra cost on vector > construction costing immediately, it firstly records how many > loads and vectorized statements in the given loop, later in > rs6000_density_test (called by finish_cost) it computes the > load density ratio against all vectorized stmts, and check > with the corresponding thresholds DENSITY_LOAD_NUM_THRESHOLD > and DENSITY_LOAD_PCT_THRESHOLD, do the actual extra pricing > if both thresholds are exceeded. ok > > Note that this new load density heuristic check is based on > some fields in target cost which are updated as needed when > scanning each add_stmt_cost entry, it's independent of the > current function rs6000_density_test which requires to scan > non_vect stmts. Since it's checking the load stmts count > vs. all vectorized stmts, it's kind of density, so I put > it in function rs6000_density_test. With the same reason to > keep it independent, I didn't put it as an else arm of the > current existing density threshold check hunk or before this > hunk. ok > > In the investigation of -1.04% degradation from 526.blender_r > on Power8, I noticed that the extra penalized cost 320 on one > single vector construction with type V16QI is much exaggerated, > which makes the final body cost unreliable, so this patch adds > one maximum bound for the extra penalized cost for each vector > construction statement. ok > > Bootstrapped/regtested on powerpc64le-linux-gnu P9. > > Full SPEC2017 performance evaluation on Power8/Power9 with > option combinations: > * -O2 -ftree-vectorize {,-fvect-cost-model=very-cheap} {,-ffast-math} > * {-O3, -Ofast} {,-funroll-loops} > > bwaves_r degradations on P8/P9 have been fixed, nothing else > remarkable was observed. So, this fixes the "-1.04% degradation from 526.blender_r on Power8" degredation with no additional regressions. that sounds good. > > Is it ok for trunk? > > [1] https://gcc.gnu.org/pipermail/gcc-patches/2021-May/570076.html > > BR, > Kewen > ----- > gcc/ChangeLog: > > * config/rs6000/rs6000.c (struct rs6000_cost_data): New members > nstmts, nloads and extra_ctor_cost. > (rs6000_density_test): Add load density related heuristics and the > checks, do extra costing on vector construction statements if need. > (rs6000_init_cost): Init new members. > (rs6000_update_target_cost_per_stmt): New function. > (rs6000_add_stmt_cost): Factor vect_nonmem hunk out to function > rs6000_update_target_cost_per_stmt and call it. > > diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c > index 83d29cbfac1..806c3335cbc 100644 > --- a/gcc/config/rs6000/rs6000.c > +++ b/gcc/config/rs6000/rs6000.c > > @@ -5231,6 +5231,12 @@ typedef struct _rs6000_cost_data > { > struct loop *loop_info; > unsigned cost[3]; > + /* Total number of vectorized stmts (loop only). */ > + unsigned nstmts; > + /* Total number of loads (loop only). */ > + unsigned nloads; > + /* Possible extra penalized cost on vector construction (loop only). */ > + unsigned extra_ctor_cost; > > /* For each vectorized loop, this var holds TRUE iff a non-memory vector > instruction is needed by the vectorization. */ > bool vect_nonmem; > @@ -5292,9 +5298,45 @@ rs6000_density_test (rs6000_cost_data *data) > if (dump_enabled_p ()) > dump_printf_loc (MSG_NOTE, vect_location, > "density %d%%, cost %d exceeds threshold, penalizing " > - "loop body cost by %d%%", density_pct, > + "loop body cost by %d%%\n", density_pct, > vec_cost + not_vec_cost, DENSITY_PENALTY); > } > + > + /* Check if we need to penalize the body cost for latency and > + execution resources bound from strided or elementwise loads > + into a vector. */ > + if (data->extra_ctor_cost > 0) > + { > + /* Threshold for load stmts percentage in all vectorized stmts. */ > + const int DENSITY_LOAD_PCT_THRESHOLD = 45; > + /* Threshold for total number of load stmts. */ > + const int DENSITY_LOAD_NUM_THRESHOLD = 20; > + > + gcc_assert (data->nloads <= data->nstmts); > + unsigned int load_pct = (data->nloads * 100) / (data->nstmts); > + > + /* It's likely to be bounded by latency and execution resources > + from many scalar loads which are strided or elementwise loads > + into a vector if both conditions below are found: > + 1. there are many loads, it's easy to result in a long wait > + for load units; > + 2. load has a big proportion of all vectorized statements, > + it's not easy to schedule other statements to spread among > + the loads. > + One typical case is the innermost loop of the hotspot of SPEC2017 > + 503.bwaves_r without loop interchange. */ > + if (data->nloads > DENSITY_LOAD_NUM_THRESHOLD > + && load_pct > DENSITY_LOAD_PCT_THRESHOLD) > + { > + data->cost[vect_body] += data->extra_ctor_cost; > + if (dump_enabled_p ()) > + dump_printf_loc (MSG_NOTE, vect_location, > + "Found %u loads and load pct. %u%% exceed " > + "the threshold, penalizing loop body " > + "cost by extra cost %u for ctor.\n", > + data->nloads, load_pct, data->extra_ctor_cost); > + } > + } > > } > > /* Implement targetm.vectorize.init_cost. */ > @@ -5308,6 +5350,9 @@ rs6000_init_cost (struct loop *loop_info, bool costing_for_scalar) > data->cost[vect_body] = 0; > data->cost[vect_epilogue] = 0; > data->vect_nonmem = false; > + data->nstmts = 0; > + data->nloads = 0; > + data->extra_ctor_cost = 0; > > data->costing_for_scalar = costing_for_scalar; > return data; > } > @@ -5335,6 +5380,66 @@ rs6000_adjust_vect_cost_per_stmt (enum vect_cost_for_stmt kind, > return 0; > } > > +/* As a visitor function for each statement cost entry handled in > + function add_stmt_cost, gather some information and update its > + relevant fields in target cost accordingly. */ I got lost trying to parse that.. (could be just me :-) Possibly instead something like /* Helper function for add_stmt_cost ; gather information and update the target_cost fields accordingly. */ > +static void > +rs6000_update_target_cost_per_stmt (rs6000_cost_data *data, > + enum vect_cost_for_stmt kind, > + struct _stmt_vec_info *stmt_info, > + enum vect_cost_model_location where, > + int stmt_cost, unsigned int orig_count) > +{ > + > + /* Check whether we're doing something other than just a copy loop. > + Not all such loops may be profitably vectorized; see > + rs6000_finish_cost. */ > + if ((kind == vec_to_scalar || kind == vec_perm || kind == vec_promote_demote > + || kind == vec_construct || kind == scalar_to_vec) > + || (where == vect_body && kind == vector_stmt)) I thought I saw an alignment issue, then noticed the "(" that was hiding on me.. :-) Maybe clearer to read if you rearrange slightly and flatten it ? I defer to others on that.. if ((kind == vec_to_scalar || kind == vec_perm || kind == vec_promote_demote || kind == vec_construct || kind == scalar_to_vec) || (kind == vector_stmt && where == vect_body) > + data->vect_nonmem = true; > + > + /* Gather some information when we are costing the vector version for > + the statements locate in a loop body. */ s/version/instruction? operation?/ ? ? s/locate/located/ > + if (!data->costing_for_scalar && data->loop_info && where == vect_body) > + { > + data->nstmts += orig_count; > + > + if (kind == scalar_load || kind == vector_load || kind == unaligned_load > + || kind == vector_gather_load) Cosmetically, I'd move the second "||" to the next line to balance those two lines a bit more. > + data->nloads += orig_count; > + > + /* If we have strided or elementwise loads into a vector, it's > + possible to be bounded by latency and execution resources for > + many scalar loads. Try to account for this by scaling the > + construction cost by the number of elements involved, when > + handling each matching statement we record the possible extra > + penalized cost into target cost, in the end of costing for > + the whole loop, we do the actual penalization once some load > + density heuristics are satisfied. */ > + if (kind == vec_construct && stmt_info > + && STMT_VINFO_TYPE (stmt_info) == load_vec_info_type > + && (STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_ELEMENTWISE > + || STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_STRIDED_SLP)) > + { > + tree vectype = STMT_VINFO_VECTYPE (stmt_info); > + unsigned int nunits = vect_nunits_for_cost (vectype); > + unsigned int extra_cost = nunits * stmt_cost; > + /* As function rs6000_builtin_vectorization_cost shows, we have > + priced much on V16QI/V8HI vector construction as their units, > + if we penalize them with nunits * stmt_cost, it can result in > + a unreliable body cost, eg: for V16QI on Power8, stmt_cost is s/a/an/ > + 20 and nunits is 16, the extra cost is 320 which looks much > + exaggerated. So let's use one maximum bound for the extra > + penalized cost for vector construction here. */ > + const unsigned int MAX_PENALIZED_COST_FOR_CTOR = 12; > + if (extra_cost > MAX_PENALIZED_COST_FOR_CTOR) > + extra_cost = MAX_PENALIZED_COST_FOR_CTOR; > + data->extra_ctor_cost += extra_cost; > + } > + } > +} ok > + > > /* Implement targetm.vectorize.add_stmt_cost. */ > > static unsigned > @@ -5354,6 +5459,7 @@ rs6000_add_stmt_cost (class vec_info *vinfo, void *data, int count, > /* Statements in an inner loop relative to the loop being > vectorized are weighted more heavily. The value here is > arbitrary and could potentially be improved with analysis. */ > + unsigned int orig_count = count; I don't see any other assignments to orig_count. Does 'count' get updated elsewhere in rs6000_add_stmt_cost() that the new orig_count variable is necessary? > > if (where == vect_body && stmt_info > && stmt_in_inner_loop_p (vinfo, stmt_info)) > { > @@ -5365,14 +5471,8 @@ rs6000_add_stmt_cost (class vec_info *vinfo, void *data, int count, > retval = (unsigned) (count * stmt_cost); > cost_data->cost[where] += retval; > > - /* Check whether we're doing something other than just a copy loop. > - Not all such loops may be profitably vectorized; see > - rs6000_finish_cost. */ > - if ((kind == vec_to_scalar || kind == vec_perm > - || kind == vec_promote_demote || kind == vec_construct > - || kind == scalar_to_vec) > - || (where == vect_body && kind == vector_stmt)) > - cost_data->vect_nonmem = true; > + rs6000_update_target_cost_per_stmt (cost_data, kind, stmt_info, where, > + stmt_cost, orig_count); > > } > > return retval; >
Hi William, Thanks for the review comments! on 2021/7/28 上午6:25, will schmidt wrote: > On Wed, 2021-05-26 at 10:59 +0800, Kewen.Lin via Gcc-patches wrote: >> Hi, >> > > > Hi, > > >> This is the updated version of patch to deal with the bwaves_r >> degradation due to vector construction fed by strided loads. >> >> As Richi's comments [1], this follows the similar idea to over >> price the vector construction fed by VMAT_ELEMENTWISE or >> VMAT_STRIDED_SLP. Instead of adding the extra cost on vector >> construction costing immediately, it firstly records how many >> loads and vectorized statements in the given loop, later in >> rs6000_density_test (called by finish_cost) it computes the >> load density ratio against all vectorized stmts, and check >> with the corresponding thresholds DENSITY_LOAD_NUM_THRESHOLD >> and DENSITY_LOAD_PCT_THRESHOLD, do the actual extra pricing >> if both thresholds are exceeded. > > ok > >> >> Note that this new load density heuristic check is based on >> some fields in target cost which are updated as needed when >> scanning each add_stmt_cost entry, it's independent of the >> current function rs6000_density_test which requires to scan >> non_vect stmts. Since it's checking the load stmts count >> vs. all vectorized stmts, it's kind of density, so I put >> it in function rs6000_density_test. With the same reason to >> keep it independent, I didn't put it as an else arm of the >> current existing density threshold check hunk or before this >> hunk. > > ok > >> >> In the investigation of -1.04% degradation from 526.blender_r >> on Power8, I noticed that the extra penalized cost 320 on one >> single vector construction with type V16QI is much exaggerated, >> which makes the final body cost unreliable, so this patch adds >> one maximum bound for the extra penalized cost for each vector >> construction statement. > > ok > >> >> Bootstrapped/regtested on powerpc64le-linux-gnu P9. >> >> Full SPEC2017 performance evaluation on Power8/Power9 with >> option combinations: >> * -O2 -ftree-vectorize {,-fvect-cost-model=very-cheap} {,-ffast-math} >> * {-O3, -Ofast} {,-funroll-loops} >> >> bwaves_r degradations on P8/P9 have been fixed, nothing else >> remarkable was observed. > > So, this fixes the "-1.04% degradation from 526.blender_r on Power8" > degredation with no additional regressions. that sounds good. > Sorry for the confusion, the original intention of this patch is to fix the -8% degradation at -O2 -ftree-vectorize vs. -O2 on bwaves_r. (From the last evaluation based on r12-1442, P8 is about -10% while P9 is about -9%). The mentioned -1.04% degradation from 526.blender_r on P8 was expected to be a reason why we need the bound for the extra penalized cost adjustment. >> >> Is it ok for trunk? >> >> [1] https://gcc.gnu.org/pipermail/gcc-patches/2021-May/570076.html >> >> BR, >> Kewen >> ----- >> gcc/ChangeLog: >> >> * config/rs6000/rs6000.c (struct rs6000_cost_data): New members >> nstmts, nloads and extra_ctor_cost. >> (rs6000_density_test): Add load density related heuristics and the >> checks, do extra costing on vector construction statements if need. >> (rs6000_init_cost): Init new members. >> (rs6000_update_target_cost_per_stmt): New function. >> (rs6000_add_stmt_cost): Factor vect_nonmem hunk out to function >> rs6000_update_target_cost_per_stmt and call it. >> > >> diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c >> index 83d29cbfac1..806c3335cbc 100644 >> --- a/gcc/config/rs6000/rs6000.c >> +++ b/gcc/config/rs6000/rs6000.c >> >> @@ -5231,6 +5231,12 @@ typedef struct _rs6000_cost_data >> { >> struct loop *loop_info; >> unsigned cost[3]; >> + /* Total number of vectorized stmts (loop only). */ >> + unsigned nstmts; >> + /* Total number of loads (loop only). */ >> + unsigned nloads; >> + /* Possible extra penalized cost on vector construction (loop only). */ >> + unsigned extra_ctor_cost; >> >> /* For each vectorized loop, this var holds TRUE iff a non-memory vector >> instruction is needed by the vectorization. */ >> bool vect_nonmem; >> @@ -5292,9 +5298,45 @@ rs6000_density_test (rs6000_cost_data *data) >> if (dump_enabled_p ()) >> dump_printf_loc (MSG_NOTE, vect_location, >> "density %d%%, cost %d exceeds threshold, penalizing " >> - "loop body cost by %d%%", density_pct, >> + "loop body cost by %d%%\n", density_pct, >> vec_cost + not_vec_cost, DENSITY_PENALTY); >> } >> + >> + /* Check if we need to penalize the body cost for latency and >> + execution resources bound from strided or elementwise loads >> + into a vector. */ >> + if (data->extra_ctor_cost > 0) >> + { >> + /* Threshold for load stmts percentage in all vectorized stmts. */ >> + const int DENSITY_LOAD_PCT_THRESHOLD = 45; >> + /* Threshold for total number of load stmts. */ >> + const int DENSITY_LOAD_NUM_THRESHOLD = 20; >> + >> + gcc_assert (data->nloads <= data->nstmts); >> + unsigned int load_pct = (data->nloads * 100) / (data->nstmts); >> + >> + /* It's likely to be bounded by latency and execution resources >> + from many scalar loads which are strided or elementwise loads >> + into a vector if both conditions below are found: >> + 1. there are many loads, it's easy to result in a long wait >> + for load units; >> + 2. load has a big proportion of all vectorized statements, >> + it's not easy to schedule other statements to spread among >> + the loads. >> + One typical case is the innermost loop of the hotspot of SPEC2017 >> + 503.bwaves_r without loop interchange. */ >> + if (data->nloads > DENSITY_LOAD_NUM_THRESHOLD >> + && load_pct > DENSITY_LOAD_PCT_THRESHOLD) >> + { >> + data->cost[vect_body] += data->extra_ctor_cost; >> + if (dump_enabled_p ()) >> + dump_printf_loc (MSG_NOTE, vect_location, >> + "Found %u loads and load pct. %u%% exceed " >> + "the threshold, penalizing loop body " >> + "cost by extra cost %u for ctor.\n", >> + data->nloads, load_pct, data->extra_ctor_cost); >> + } >> + } >> >> } >> >> /* Implement targetm.vectorize.init_cost. */ >> @@ -5308,6 +5350,9 @@ rs6000_init_cost (struct loop *loop_info, bool costing_for_scalar) >> data->cost[vect_body] = 0; >> data->cost[vect_epilogue] = 0; >> data->vect_nonmem = false; >> + data->nstmts = 0; >> + data->nloads = 0; >> + data->extra_ctor_cost = 0; >> >> data->costing_for_scalar = costing_for_scalar; >> return data; >> } >> @@ -5335,6 +5380,66 @@ rs6000_adjust_vect_cost_per_stmt (enum vect_cost_for_stmt kind, >> return 0; >> } >> >> +/* As a visitor function for each statement cost entry handled in >> + function add_stmt_cost, gather some information and update its >> + relevant fields in target cost accordingly. */ > > I got lost trying to parse that.. (could be just me :-) > > Possibly instead something like > /* Helper function for add_stmt_cost ; gather information and update > the target_cost fields accordingly. */ > > OK, will update. I was thinking for each entry handled in function add_stmt_cost, this helper acts like a visitor, trying to visit each entry and take some actions if some conditions are satisifed. > >> +static void >> +rs6000_update_target_cost_per_stmt (rs6000_cost_data *data, >> + enum vect_cost_for_stmt kind, >> + struct _stmt_vec_info *stmt_info, >> + enum vect_cost_model_location where, >> + int stmt_cost, unsigned int orig_count) >> +{ >> + >> + /* Check whether we're doing something other than just a copy loop. >> + Not all such loops may be profitably vectorized; see >> + rs6000_finish_cost. */ >> + if ((kind == vec_to_scalar || kind == vec_perm || kind == vec_promote_demote >> + || kind == vec_construct || kind == scalar_to_vec) >> + || (where == vect_body && kind == vector_stmt)) > > I thought I saw an alignment issue, then noticed the "(" that was > hiding on me.. :-) > > Maybe clearer to read if you rearrange slightly and flatten it ? I > defer to others on that.. > > if ((kind == vec_to_scalar > || kind == vec_perm > || kind == vec_promote_demote > || kind == > vec_construct > || kind == scalar_to_vec) > || (kind == vector_stmt && where == vect_body) > > This hunk is factored out from function rs6000_add_stmt_cost, maybe I can keep the original formatting? The formatting tool isn't so smart, and sometimes rearrange things to become unexpected (although it meets the basic rule, not so elegant), sigh. >> + data->vect_nonmem = true; >> + >> + /* Gather some information when we are costing the vector version for >> + the statements locate in a loop body. */ > s/version/instruction? operation?/ ? ? > s/locate/located/ > Will fix. >> + if (!data->costing_for_scalar && data->loop_info && where == vect_body) >> + { >> + data->nstmts += orig_count; >> + >> + if (kind == scalar_load || kind == vector_load || kind == unaligned_load >> + || kind == vector_gather_load) > > Cosmetically, I'd move the second "||" to the next line to balance > those two lines a bit more. > Will fix. >> + data->nloads += orig_count; >> + >> + /* If we have strided or elementwise loads into a vector, it's >> + possible to be bounded by latency and execution resources for >> + many scalar loads. Try to account for this by scaling the >> + construction cost by the number of elements involved, when >> + handling each matching statement we record the possible extra >> + penalized cost into target cost, in the end of costing for >> + the whole loop, we do the actual penalization once some load >> + density heuristics are satisfied. */ >> + if (kind == vec_construct && stmt_info >> + && STMT_VINFO_TYPE (stmt_info) == load_vec_info_type >> + && (STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_ELEMENTWISE >> + || STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_STRIDED_SLP)) >> + { >> + tree vectype = STMT_VINFO_VECTYPE (stmt_info); >> + unsigned int nunits = vect_nunits_for_cost (vectype); >> + unsigned int extra_cost = nunits * stmt_cost; >> + /* As function rs6000_builtin_vectorization_cost shows, we have >> + priced much on V16QI/V8HI vector construction as their units, >> + if we penalize them with nunits * stmt_cost, it can result in >> + a unreliable body cost, eg: for V16QI on Power8, stmt_cost is > s/a/an/ Will fix. >> + 20 and nunits is 16, the extra cost is 320 which looks much >> + exaggerated. So let's use one maximum bound for the extra >> + penalized cost for vector construction here. */ >> + const unsigned int MAX_PENALIZED_COST_FOR_CTOR = 12; >> + if (extra_cost > MAX_PENALIZED_COST_FOR_CTOR) >> + extra_cost = MAX_PENALIZED_COST_FOR_CTOR; >> + data->extra_ctor_cost += extra_cost; >> + } >> + } >> +} > ok > >> + >> >> /* Implement targetm.vectorize.add_stmt_cost. */ >> >> static unsigned >> @@ -5354,6 +5459,7 @@ rs6000_add_stmt_cost (class vec_info *vinfo, void *data, int count, >> /* Statements in an inner loop relative to the loop being >> vectorized are weighted more heavily. The value here is >> arbitrary and could potentially be improved with analysis. */ >> + unsigned int orig_count = count; > > I don't see any other assignments to orig_count. Does 'count' get > updated elsewhere in rs6000_add_stmt_cost() that the new orig_count > variable is necessary? > Yeah, the 'count' could get updated but the default "git diff" doesn't show it, the codes omitted look as below: if (where == vect_body && stmt_info && stmt_in_inner_loop_p (vinfo, stmt_info)) { loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (vinfo); gcc_assert (loop_vinfo); count *= LOOP_VINFO_INNER_LOOP_COST_FACTOR (loop_vinfo); /* FIXME. */ } BR, Kewen >> >> if (where == vect_body && stmt_info >> && stmt_in_inner_loop_p (vinfo, stmt_info)) >> { >> @@ -5365,14 +5471,8 @@ rs6000_add_stmt_cost (class vec_info *vinfo, void *data, int count, >> retval = (unsigned) (count * stmt_cost); >> cost_data->cost[where] += retval; >> >> - /* Check whether we're doing something other than just a copy loop. >> - Not all such loops may be profitably vectorized; see >> - rs6000_finish_cost. */ >> - if ((kind == vec_to_scalar || kind == vec_perm >> - || kind == vec_promote_demote || kind == vec_construct >> - || kind == scalar_to_vec) >> - || (where == vect_body && kind == vector_stmt)) >> - cost_data->vect_nonmem = true; >> + rs6000_update_target_cost_per_stmt (cost_data, kind, stmt_info, where, >> + stmt_cost, orig_count); >> >> } >> >> return retval; >> > >
Hi! On Wed, Jul 28, 2021 at 10:59:50AM +0800, Kewen.Lin wrote: > >> +/* As a visitor function for each statement cost entry handled in > >> + function add_stmt_cost, gather some information and update its > >> + relevant fields in target cost accordingly. */ > > > > I got lost trying to parse that.. (could be just me :-) > > > > Possibly instead something like > > /* Helper function for add_stmt_cost ; gather information and update > > the target_cost fields accordingly. */ > > OK, will update. I was thinking for each entry handled in function > add_stmt_cost, this helper acts like a visitor, trying to visit each > entry and take some actions if some conditions are satisifed. It (thankfully!) has nothing to do with the "visitor pattern", so some other name might be better :-) > > Maybe clearer to read if you rearrange slightly and flatten it ? I > > defer to others on that.. > > > > if ((kind == vec_to_scalar > > || kind == vec_perm > > || kind == vec_promote_demote > > || kind == vec_construct > > || kind == scalar_to_vec) > > || (kind == vector_stmt && where == vect_body) > > This hunk is factored out from function rs6000_add_stmt_cost, maybe I > can keep the original formatting? The formatting tool isn't so smart, > and sometimes rearrange things to become unexpected (although it meets > the basic rule, not so elegant), sigh. It has too many parens, making grouping where there is none, that is the core issue. if (kind == vec_to_scalar || kind == vec_perm || kind == vec_promote_demote || kind == vec_construct || kind == scalar_to_vec || (kind == vector_stmt && where == vect_body)) Segher
Hi Segher, Thanks for the comments! on 2021/9/7 上午7:43, Segher Boessenkool wrote: > Hi! > > On Wed, Jul 28, 2021 at 10:59:50AM +0800, Kewen.Lin wrote: >>>> +/* As a visitor function for each statement cost entry handled in >>>> + function add_stmt_cost, gather some information and update its >>>> + relevant fields in target cost accordingly. */ >>> >>> I got lost trying to parse that.. (could be just me :-) >>> >>> Possibly instead something like >>> /* Helper function for add_stmt_cost ; gather information and update >>> the target_cost fields accordingly. */ >> >> OK, will update. I was thinking for each entry handled in function >> add_stmt_cost, this helper acts like a visitor, trying to visit each >> entry and take some actions if some conditions are satisifed. > > It (thankfully!) has nothing to do with the "visitor pattern", so some > other name might be better :-) > >>> Maybe clearer to read if you rearrange slightly and flatten it ? I >>> defer to others on that.. >>> >>> if ((kind == vec_to_scalar >>> || kind == vec_perm >>> || kind == vec_promote_demote >>> || kind == vec_construct >>> || kind == scalar_to_vec) >>> || (kind == vector_stmt && where == vect_body) >> >> This hunk is factored out from function rs6000_add_stmt_cost, maybe I >> can keep the original formatting? The formatting tool isn't so smart, >> and sometimes rearrange things to become unexpected (although it meets >> the basic rule, not so elegant), sigh. > > It has too many parens, making grouping where there is none, that is the > core issue. > > if (kind == vec_to_scalar > || kind == vec_perm > || kind == vec_promote_demote > || kind == vec_construct > || kind == scalar_to_vec > || (kind == vector_stmt && where == vect_body)) > > Good catch, I've updated it in V4. BR, Kewen
diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c index 83d29cbfac1..806c3335cbc 100644 --- a/gcc/config/rs6000/rs6000.c +++ b/gcc/config/rs6000/rs6000.c @@ -5231,6 +5231,12 @@ typedef struct _rs6000_cost_data { struct loop *loop_info; unsigned cost[3]; + /* Total number of vectorized stmts (loop only). */ + unsigned nstmts; + /* Total number of loads (loop only). */ + unsigned nloads; + /* Possible extra penalized cost on vector construction (loop only). */ + unsigned extra_ctor_cost; /* For each vectorized loop, this var holds TRUE iff a non-memory vector instruction is needed by the vectorization. */ bool vect_nonmem; @@ -5292,9 +5298,45 @@ rs6000_density_test (rs6000_cost_data *data) if (dump_enabled_p ()) dump_printf_loc (MSG_NOTE, vect_location, "density %d%%, cost %d exceeds threshold, penalizing " - "loop body cost by %d%%", density_pct, + "loop body cost by %d%%\n", density_pct, vec_cost + not_vec_cost, DENSITY_PENALTY); } + + /* Check if we need to penalize the body cost for latency and + execution resources bound from strided or elementwise loads + into a vector. */ + if (data->extra_ctor_cost > 0) + { + /* Threshold for load stmts percentage in all vectorized stmts. */ + const int DENSITY_LOAD_PCT_THRESHOLD = 45; + /* Threshold for total number of load stmts. */ + const int DENSITY_LOAD_NUM_THRESHOLD = 20; + + gcc_assert (data->nloads <= data->nstmts); + unsigned int load_pct = (data->nloads * 100) / (data->nstmts); + + /* It's likely to be bounded by latency and execution resources + from many scalar loads which are strided or elementwise loads + into a vector if both conditions below are found: + 1. there are many loads, it's easy to result in a long wait + for load units; + 2. load has a big proportion of all vectorized statements, + it's not easy to schedule other statements to spread among + the loads. + One typical case is the innermost loop of the hotspot of SPEC2017 + 503.bwaves_r without loop interchange. */ + if (data->nloads > DENSITY_LOAD_NUM_THRESHOLD + && load_pct > DENSITY_LOAD_PCT_THRESHOLD) + { + data->cost[vect_body] += data->extra_ctor_cost; + if (dump_enabled_p ()) + dump_printf_loc (MSG_NOTE, vect_location, + "Found %u loads and load pct. %u%% exceed " + "the threshold, penalizing loop body " + "cost by extra cost %u for ctor.\n", + data->nloads, load_pct, data->extra_ctor_cost); + } + } } /* Implement targetm.vectorize.init_cost. */ @@ -5308,6 +5350,9 @@ rs6000_init_cost (struct loop *loop_info, bool costing_for_scalar) data->cost[vect_body] = 0; data->cost[vect_epilogue] = 0; data->vect_nonmem = false; + data->nstmts = 0; + data->nloads = 0; + data->extra_ctor_cost = 0; data->costing_for_scalar = costing_for_scalar; return data; } @@ -5335,6 +5380,66 @@ rs6000_adjust_vect_cost_per_stmt (enum vect_cost_for_stmt kind, return 0; } +/* As a visitor function for each statement cost entry handled in + function add_stmt_cost, gather some information and update its + relevant fields in target cost accordingly. */ +static void +rs6000_update_target_cost_per_stmt (rs6000_cost_data *data, + enum vect_cost_for_stmt kind, + struct _stmt_vec_info *stmt_info, + enum vect_cost_model_location where, + int stmt_cost, unsigned int orig_count) +{ + + /* Check whether we're doing something other than just a copy loop. + Not all such loops may be profitably vectorized; see + rs6000_finish_cost. */ + if ((kind == vec_to_scalar || kind == vec_perm || kind == vec_promote_demote + || kind == vec_construct || kind == scalar_to_vec) + || (where == vect_body && kind == vector_stmt)) + data->vect_nonmem = true; + + /* Gather some information when we are costing the vector version for + the statements locate in a loop body. */ + if (!data->costing_for_scalar && data->loop_info && where == vect_body) + { + data->nstmts += orig_count; + + if (kind == scalar_load || kind == vector_load || kind == unaligned_load + || kind == vector_gather_load) + data->nloads += orig_count; + + /* If we have strided or elementwise loads into a vector, it's + possible to be bounded by latency and execution resources for + many scalar loads. Try to account for this by scaling the + construction cost by the number of elements involved, when + handling each matching statement we record the possible extra + penalized cost into target cost, in the end of costing for + the whole loop, we do the actual penalization once some load + density heuristics are satisfied. */ + if (kind == vec_construct && stmt_info + && STMT_VINFO_TYPE (stmt_info) == load_vec_info_type + && (STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_ELEMENTWISE + || STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_STRIDED_SLP)) + { + tree vectype = STMT_VINFO_VECTYPE (stmt_info); + unsigned int nunits = vect_nunits_for_cost (vectype); + unsigned int extra_cost = nunits * stmt_cost; + /* As function rs6000_builtin_vectorization_cost shows, we have + priced much on V16QI/V8HI vector construction as their units, + if we penalize them with nunits * stmt_cost, it can result in + a unreliable body cost, eg: for V16QI on Power8, stmt_cost is + 20 and nunits is 16, the extra cost is 320 which looks much + exaggerated. So let's use one maximum bound for the extra + penalized cost for vector construction here. */ + const unsigned int MAX_PENALIZED_COST_FOR_CTOR = 12; + if (extra_cost > MAX_PENALIZED_COST_FOR_CTOR) + extra_cost = MAX_PENALIZED_COST_FOR_CTOR; + data->extra_ctor_cost += extra_cost; + } + } +} + /* Implement targetm.vectorize.add_stmt_cost. */ static unsigned @@ -5354,6 +5459,7 @@ rs6000_add_stmt_cost (class vec_info *vinfo, void *data, int count, /* Statements in an inner loop relative to the loop being vectorized are weighted more heavily. The value here is arbitrary and could potentially be improved with analysis. */ + unsigned int orig_count = count; if (where == vect_body && stmt_info && stmt_in_inner_loop_p (vinfo, stmt_info)) { @@ -5365,14 +5471,8 @@ rs6000_add_stmt_cost (class vec_info *vinfo, void *data, int count, retval = (unsigned) (count * stmt_cost); cost_data->cost[where] += retval; - /* Check whether we're doing something other than just a copy loop. - Not all such loops may be profitably vectorized; see - rs6000_finish_cost. */ - if ((kind == vec_to_scalar || kind == vec_perm - || kind == vec_promote_demote || kind == vec_construct - || kind == scalar_to_vec) - || (where == vect_body && kind == vector_stmt)) - cost_data->vect_nonmem = true; + rs6000_update_target_cost_per_stmt (cost_data, kind, stmt_info, where, + stmt_cost, orig_count); } return retval;