[v2] rs6000: Add load density heuristic

Message ID	3424a3d3-fa4e-16f9-89c6-0b07beec957d@linux.ibm.com
State	New
Headers	show Return-Path: <gcc-patches-bounces@gcc.gnu.org> DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org ED936394443A Subject: [PATCH v2] rs6000: Add load density heuristic To: GCC Patches <gcc-patches@gcc.gnu.org> References: <7b9f9bdf-1ed5-139b-de9c-511ee8454b85@linux.ibm.com> Message-ID: <3424a3d3-fa4e-16f9-89c6-0b07beec957d@linux.ibm.com> Date: Wed, 26 May 2021 10:59:57 +0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Thunderbird/78.10.0 In-Reply-To: <7b9f9bdf-1ed5-139b-de9c-511ee8454b85@linux.ibm.com> Content-Type: multipart/mixed; boundary="------------A1B16BC12EC2EE5FB543516B" Content-Language: en-US MIME-Version: 1.0 Precedence: list From: "Kewen.Lin via Gcc-patches" <gcc-patches@gcc.gnu.org> Reply-To: "Kewen.Lin" <linkw@linux.ibm.com> Cc: Bill Schmidt <wschmidt@linux.ibm.com>, Segher Boessenkool <segher@kernel.crashing.org>, David Edelsohn <dje.gcc@gmail.com> Errors-To: gcc-patches-bounces@gcc.gnu.org Sender: "Gcc-patches" <gcc-patches-bounces@gcc.gnu.org>
Series	[v2] rs6000: Add load density heuristic \| expand [v2] rs6000: Add load density heuristic

Kewen.Lin May 26, 2021, 2:59 a.m. UTC

Hi,

This is the updated version of patch to deal with the bwaves_r
degradation due to vector construction fed by strided loads.

As Richi's comments [1], this follows the similar idea to over
price the vector construction fed by VMAT_ELEMENTWISE or
VMAT_STRIDED_SLP.  Instead of adding the extra cost on vector
construction costing immediately, it firstly records how many
loads and vectorized statements in the given loop, later in
rs6000_density_test (called by finish_cost) it computes the
load density ratio against all vectorized stmts, and check
with the corresponding thresholds DENSITY_LOAD_NUM_THRESHOLD
and DENSITY_LOAD_PCT_THRESHOLD, do the actual extra pricing
if both thresholds are exceeded.

Note that this new load density heuristic check is based on
some fields in target cost which are updated as needed when
scanning each add_stmt_cost entry, it's independent of the
current function rs6000_density_test which requires to scan
non_vect stmts.  Since it's checking the load stmts count
vs. all vectorized stmts, it's kind of density, so I put
it in function rs6000_density_test.  With the same reason to
keep it independent, I didn't put it as an else arm of the
current existing density threshold check hunk or before this
hunk.

In the investigation of -1.04% degradation from 526.blender_r
on Power8, I noticed that the extra penalized cost 320 on one
single vector construction with type V16QI is much exaggerated,
which makes the final body cost unreliable, so this patch adds
one maximum bound for the extra penalized cost for each vector
construction statement.

Bootstrapped/regtested on powerpc64le-linux-gnu P9.

Full SPEC2017 performance evaluation on Power8/Power9 with
option combinations:
  * -O2 -ftree-vectorize {,-fvect-cost-model=very-cheap} {,-ffast-math}
  * {-O3, -Ofast} {,-funroll-loops}

bwaves_r degradations on P8/P9 have been fixed, nothing else
remarkable was observed.

Is it ok for trunk?

[1] https://gcc.gnu.org/pipermail/gcc-patches/2021-May/570076.html

BR,
Kewen
-----
gcc/ChangeLog:

	* config/rs6000/rs6000.c (struct rs6000_cost_data): New members
	nstmts, nloads and extra_ctor_cost.
	(rs6000_density_test): Add load density related heuristics and the
	checks, do extra costing on vector construction statements if need.
	(rs6000_init_cost): Init new members.
	(rs6000_update_target_cost_per_stmt): New function.
	(rs6000_add_stmt_cost): Factor vect_nonmem hunk out to function
	rs6000_update_target_cost_per_stmt and call it.

Kewen.Lin June 9, 2021, 2:26 a.m. UTC | #1

Hi,

Gentle ping this:

https://gcc.gnu.org/pipermail/gcc-patches/2021-May/571258.html

BR,
Kewen

on 2021/5/26 上午10:59, Kewen.Lin via Gcc-patches wrote:
> Hi,
> 
> This is the updated version of patch to deal with the bwaves_r
> degradation due to vector construction fed by strided loads.
> 
> As Richi's comments [1], this follows the similar idea to over
> price the vector construction fed by VMAT_ELEMENTWISE or
> VMAT_STRIDED_SLP.  Instead of adding the extra cost on vector
> construction costing immediately, it firstly records how many
> loads and vectorized statements in the given loop, later in
> rs6000_density_test (called by finish_cost) it computes the
> load density ratio against all vectorized stmts, and check
> with the corresponding thresholds DENSITY_LOAD_NUM_THRESHOLD
> and DENSITY_LOAD_PCT_THRESHOLD, do the actual extra pricing
> if both thresholds are exceeded.
> 
> Note that this new load density heuristic check is based on
> some fields in target cost which are updated as needed when
> scanning each add_stmt_cost entry, it's independent of the
> current function rs6000_density_test which requires to scan
> non_vect stmts.  Since it's checking the load stmts count
> vs. all vectorized stmts, it's kind of density, so I put
> it in function rs6000_density_test.  With the same reason to
> keep it independent, I didn't put it as an else arm of the
> current existing density threshold check hunk or before this
> hunk.
> 
> In the investigation of -1.04% degradation from 526.blender_r
> on Power8, I noticed that the extra penalized cost 320 on one
> single vector construction with type V16QI is much exaggerated,
> which makes the final body cost unreliable, so this patch adds
> one maximum bound for the extra penalized cost for each vector
> construction statement.
> 
> Bootstrapped/regtested on powerpc64le-linux-gnu P9.
> 
> Full SPEC2017 performance evaluation on Power8/Power9 with
> option combinations:
>   * -O2 -ftree-vectorize {,-fvect-cost-model=very-cheap} {,-ffast-math}
>   * {-O3, -Ofast} {,-funroll-loops}
> 
> bwaves_r degradations on P8/P9 have been fixed, nothing else
> remarkable was observed.
> 
> Is it ok for trunk?
> 
> [1] https://gcc.gnu.org/pipermail/gcc-patches/2021-May/570076.html
> 
> BR,
> Kewen
> -----
> gcc/ChangeLog:
> 
> 	* config/rs6000/rs6000.c (struct rs6000_cost_data): New members
> 	nstmts, nloads and extra_ctor_cost.
> 	(rs6000_density_test): Add load density related heuristics and the
> 	checks, do extra costing on vector construction statements if need.
> 	(rs6000_init_cost): Init new members.
> 	(rs6000_update_target_cost_per_stmt): New function.
> 	(rs6000_add_stmt_cost): Factor vect_nonmem hunk out to function
> 	rs6000_update_target_cost_per_stmt and call it.
>

Kewen.Lin June 28, 2021, 7:01 a.m. UTC | #2

Hi,

Gentle ping this:

https://gcc.gnu.org/pipermail/gcc-patches/2021-May/571258.html

BR,
Kewen

on 2021/6/9 上午10:26, Kewen.Lin via Gcc-patches wrote:
> Hi,
> 
> Gentle ping this:
> 
> https://gcc.gnu.org/pipermail/gcc-patches/2021-May/571258.html
> 
> BR,
> Kewen
> 
> on 2021/5/26 上午10:59, Kewen.Lin via Gcc-patches wrote:
>> Hi,
>>
>> This is the updated version of patch to deal with the bwaves_r
>> degradation due to vector construction fed by strided loads.
>>
>> As Richi's comments [1], this follows the similar idea to over
>> price the vector construction fed by VMAT_ELEMENTWISE or
>> VMAT_STRIDED_SLP.  Instead of adding the extra cost on vector
>> construction costing immediately, it firstly records how many
>> loads and vectorized statements in the given loop, later in
>> rs6000_density_test (called by finish_cost) it computes the
>> load density ratio against all vectorized stmts, and check
>> with the corresponding thresholds DENSITY_LOAD_NUM_THRESHOLD
>> and DENSITY_LOAD_PCT_THRESHOLD, do the actual extra pricing
>> if both thresholds are exceeded.
>>
>> Note that this new load density heuristic check is based on
>> some fields in target cost which are updated as needed when
>> scanning each add_stmt_cost entry, it's independent of the
>> current function rs6000_density_test which requires to scan
>> non_vect stmts.  Since it's checking the load stmts count
>> vs. all vectorized stmts, it's kind of density, so I put
>> it in function rs6000_density_test.  With the same reason to
>> keep it independent, I didn't put it as an else arm of the
>> current existing density threshold check hunk or before this
>> hunk.
>>
>> In the investigation of -1.04% degradation from 526.blender_r
>> on Power8, I noticed that the extra penalized cost 320 on one
>> single vector construction with type V16QI is much exaggerated,
>> which makes the final body cost unreliable, so this patch adds
>> one maximum bound for the extra penalized cost for each vector
>> construction statement.
>>
>> Bootstrapped/regtested on powerpc64le-linux-gnu P9.
>>
>> Full SPEC2017 performance evaluation on Power8/Power9 with
>> option combinations:
>>   * -O2 -ftree-vectorize {,-fvect-cost-model=very-cheap} {,-ffast-math}
>>   * {-O3, -Ofast} {,-funroll-loops}
>>
>> bwaves_r degradations on P8/P9 have been fixed, nothing else
>> remarkable was observed.
>>
>> Is it ok for trunk?
>>
>> [1] https://gcc.gnu.org/pipermail/gcc-patches/2021-May/570076.html
>>
>> BR,
>> Kewen
>> -----
>> gcc/ChangeLog:
>>
>> 	* config/rs6000/rs6000.c (struct rs6000_cost_data): New members
>> 	nstmts, nloads and extra_ctor_cost.
>> 	(rs6000_density_test): Add load density related heuristics and the
>> 	checks, do extra costing on vector construction statements if need.
>> 	(rs6000_init_cost): Init new members.
>> 	(rs6000_update_target_cost_per_stmt): New function.
>> 	(rs6000_add_stmt_cost): Factor vect_nonmem hunk out to function
>> 	rs6000_update_target_cost_per_stmt and call it.
>>
>

Kewen.Lin July 15, 2021, 1:59 a.m. UTC | #3

Hi,

Gentle ping this:

https://gcc.gnu.org/pipermail/gcc-patches/2021-May/571258.html

BR,
Kewen

on 2021/6/28 下午3:01, Kewen.Lin via Gcc-patches wrote:
> Hi,
> 
> Gentle ping this:
> 
> https://gcc.gnu.org/pipermail/gcc-patches/2021-May/571258.html
> 
> BR,
> Kewen
> 
> on 2021/6/9 上午10:26, Kewen.Lin via Gcc-patches wrote:
>> Hi,
>>
>> Gentle ping this:
>>
>> https://gcc.gnu.org/pipermail/gcc-patches/2021-May/571258.html
>>
>> BR,
>> Kewen
>>
>> on 2021/5/26 上午10:59, Kewen.Lin via Gcc-patches wrote:
>>> Hi,
>>>
>>> This is the updated version of patch to deal with the bwaves_r
>>> degradation due to vector construction fed by strided loads.
>>>
>>> As Richi's comments [1], this follows the similar idea to over
>>> price the vector construction fed by VMAT_ELEMENTWISE or
>>> VMAT_STRIDED_SLP.  Instead of adding the extra cost on vector
>>> construction costing immediately, it firstly records how many
>>> loads and vectorized statements in the given loop, later in
>>> rs6000_density_test (called by finish_cost) it computes the
>>> load density ratio against all vectorized stmts, and check
>>> with the corresponding thresholds DENSITY_LOAD_NUM_THRESHOLD
>>> and DENSITY_LOAD_PCT_THRESHOLD, do the actual extra pricing
>>> if both thresholds are exceeded.
>>>
>>> Note that this new load density heuristic check is based on
>>> some fields in target cost which are updated as needed when
>>> scanning each add_stmt_cost entry, it's independent of the
>>> current function rs6000_density_test which requires to scan
>>> non_vect stmts.  Since it's checking the load stmts count
>>> vs. all vectorized stmts, it's kind of density, so I put
>>> it in function rs6000_density_test.  With the same reason to
>>> keep it independent, I didn't put it as an else arm of the
>>> current existing density threshold check hunk or before this
>>> hunk.
>>>
>>> In the investigation of -1.04% degradation from 526.blender_r
>>> on Power8, I noticed that the extra penalized cost 320 on one
>>> single vector construction with type V16QI is much exaggerated,
>>> which makes the final body cost unreliable, so this patch adds
>>> one maximum bound for the extra penalized cost for each vector
>>> construction statement.
>>>
>>> Bootstrapped/regtested on powerpc64le-linux-gnu P9.
>>>
>>> Full SPEC2017 performance evaluation on Power8/Power9 with
>>> option combinations:
>>>   * -O2 -ftree-vectorize {,-fvect-cost-model=very-cheap} {,-ffast-math}
>>>   * {-O3, -Ofast} {,-funroll-loops}
>>>
>>> bwaves_r degradations on P8/P9 have been fixed, nothing else
>>> remarkable was observed.
>>>
>>> Is it ok for trunk?
>>>
>>> [1] https://gcc.gnu.org/pipermail/gcc-patches/2021-May/570076.html
>>>
>>> BR,
>>> Kewen
>>> -----
>>> gcc/ChangeLog:
>>>
>>> 	* config/rs6000/rs6000.c (struct rs6000_cost_data): New members
>>> 	nstmts, nloads and extra_ctor_cost.
>>> 	(rs6000_density_test): Add load density related heuristics and the
>>> 	checks, do extra costing on vector construction statements if need.
>>> 	(rs6000_init_cost): Init new members.
>>> 	(rs6000_update_target_cost_per_stmt): New function.
>>> 	(rs6000_add_stmt_cost): Factor vect_nonmem hunk out to function
>>> 	rs6000_update_target_cost_per_stmt and call it.
>>>
>>

will schmidt July 27, 2021, 10:25 p.m. UTC | #4

On Wed, 2021-05-26 at 10:59 +0800, Kewen.Lin via Gcc-patches wrote:
> Hi,
> 


Hi,


> This is the updated version of patch to deal with the bwaves_r
> degradation due to vector construction fed by strided loads.
> 
> As Richi's comments [1], this follows the similar idea to over
> price the vector construction fed by VMAT_ELEMENTWISE or
> VMAT_STRIDED_SLP.  Instead of adding the extra cost on vector
> construction costing immediately, it firstly records how many
> loads and vectorized statements in the given loop, later in
> rs6000_density_test (called by finish_cost) it computes the
> load density ratio against all vectorized stmts, and check
> with the corresponding thresholds DENSITY_LOAD_NUM_THRESHOLD
> and DENSITY_LOAD_PCT_THRESHOLD, do the actual extra pricing
> if both thresholds are exceeded.

ok

> 
> Note that this new load density heuristic check is based on
> some fields in target cost which are updated as needed when
> scanning each add_stmt_cost entry, it's independent of the
> current function rs6000_density_test which requires to scan
> non_vect stmts.  Since it's checking the load stmts count
> vs. all vectorized stmts, it's kind of density, so I put
> it in function rs6000_density_test.  With the same reason to
> keep it independent, I didn't put it as an else arm of the
> current existing density threshold check hunk or before this
> hunk.

ok

> 
> In the investigation of -1.04% degradation from 526.blender_r
> on Power8, I noticed that the extra penalized cost 320 on one
> single vector construction with type V16QI is much exaggerated,
> which makes the final body cost unreliable, so this patch adds
> one maximum bound for the extra penalized cost for each vector
> construction statement.

ok

> 
> Bootstrapped/regtested on powerpc64le-linux-gnu P9.
> 
> Full SPEC2017 performance evaluation on Power8/Power9 with
> option combinations:
>   * -O2 -ftree-vectorize {,-fvect-cost-model=very-cheap} {,-ffast-math}
>   * {-O3, -Ofast} {,-funroll-loops}
> 
> bwaves_r degradations on P8/P9 have been fixed, nothing else
> remarkable was observed.

So, this fixes the "-1.04% degradation from 526.blender_r on Power8"
degredation with no additional regressions.  that sounds good. 

> 
> Is it ok for trunk?
> 
> [1] https://gcc.gnu.org/pipermail/gcc-patches/2021-May/570076.html
> 
> BR,
> Kewen
> -----
> gcc/ChangeLog:
> 
> 	* config/rs6000/rs6000.c (struct rs6000_cost_data): New members
> 	nstmts, nloads and extra_ctor_cost.
> 	(rs6000_density_test): Add load density related heuristics and the
> 	checks, do extra costing on vector construction statements if need.
> 	(rs6000_init_cost): Init new members.
> 	(rs6000_update_target_cost_per_stmt): New function.
> 	(rs6000_add_stmt_cost): Factor vect_nonmem hunk out to function
> 	rs6000_update_target_cost_per_stmt and call it.
> 

> diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
> index 83d29cbfac1..806c3335cbc 100644
> --- a/gcc/config/rs6000/rs6000.c
> +++ b/gcc/config/rs6000/rs6000.c
> 
> @@ -5231,6 +5231,12 @@ typedef struct _rs6000_cost_data
>  {
>    struct loop *loop_info;
>    unsigned cost[3];
> +  /* Total number of vectorized stmts (loop only).  */
> +  unsigned nstmts;
> +  /* Total number of loads (loop only).  */
> +  unsigned nloads;
> +  /* Possible extra penalized cost on vector construction (loop only).  */
> +  unsigned extra_ctor_cost;
> 
>    /* For each vectorized loop, this var holds TRUE iff a non-memory vector
>       instruction is needed by the vectorization.  */
>    bool vect_nonmem;
> @@ -5292,9 +5298,45 @@ rs6000_density_test (rs6000_cost_data *data)
>        if (dump_enabled_p ())
>  	dump_printf_loc (MSG_NOTE, vect_location,
>  			 "density %d%%, cost %d exceeds threshold, penalizing "
> -			 "loop body cost by %d%%", density_pct,
> +			 "loop body cost by %d%%\n", density_pct,
>  			 vec_cost + not_vec_cost, DENSITY_PENALTY);
>      }
> +
> +  /* Check if we need to penalize the body cost for latency and
> +     execution resources bound from strided or elementwise loads
> +     into a vector.  */
> +  if (data->extra_ctor_cost > 0)
> +    {
> +      /* Threshold for load stmts percentage in all vectorized stmts.  */
> +      const int DENSITY_LOAD_PCT_THRESHOLD = 45;
> +      /* Threshold for total number of load stmts.  */
> +      const int DENSITY_LOAD_NUM_THRESHOLD = 20;
> +
> +      gcc_assert (data->nloads <= data->nstmts);
> +      unsigned int load_pct = (data->nloads * 100) / (data->nstmts);
> +
> +      /* It's likely to be bounded by latency and execution resources
> +	 from many scalar loads which are strided or elementwise loads
> +	 into a vector if both conditions below are found:
> +	   1. there are many loads, it's easy to result in a long wait
> +	      for load units;
> +	   2. load has a big proportion of all vectorized statements,
> +	      it's not easy to schedule other statements to spread among
> +	      the loads.
> +	 One typical case is the innermost loop of the hotspot of SPEC2017
> +	 503.bwaves_r without loop interchange.  */
> +      if (data->nloads > DENSITY_LOAD_NUM_THRESHOLD
> +	  && load_pct > DENSITY_LOAD_PCT_THRESHOLD)
> +	{
> +	  data->cost[vect_body] += data->extra_ctor_cost;
> +	  if (dump_enabled_p ())
> +	    dump_printf_loc (MSG_NOTE, vect_location,
> +			     "Found %u loads and load pct. %u%% exceed "
> +			     "the threshold, penalizing loop body "
> +			     "cost by extra cost %u for ctor.\n",
> +			     data->nloads, load_pct, data->extra_ctor_cost);
> +	}
> +    }
> 
>  }
> 
>  /* Implement targetm.vectorize.init_cost.  */
> @@ -5308,6 +5350,9 @@ rs6000_init_cost (struct loop *loop_info, bool costing_for_scalar)
>    data->cost[vect_body]     = 0;
>    data->cost[vect_epilogue] = 0;
>    data->vect_nonmem = false;
> +  data->nstmts = 0;
> +  data->nloads = 0;
> +  data->extra_ctor_cost = 0;
> 
>    data->costing_for_scalar = costing_for_scalar;
>    return data;
>  }
> @@ -5335,6 +5380,66 @@ rs6000_adjust_vect_cost_per_stmt (enum vect_cost_for_stmt kind,
>    return 0;
>  }
> 
> +/* As a visitor function for each statement cost entry handled in
> +   function add_stmt_cost, gather some information and update its
> +   relevant fields in target cost accordingly.  */

I got lost trying to parse that..  (could be just me :-) 

Possibly instead something like
/* Helper function for add_stmt_cost ; gather information and update
the target_cost fields accordingly.  */



> +static void
> +rs6000_update_target_cost_per_stmt (rs6000_cost_data *data,
> +				    enum vect_cost_for_stmt kind,
> +				    struct _stmt_vec_info *stmt_info,
> +				    enum vect_cost_model_location where,
> +				    int stmt_cost, unsigned int orig_count)
> +{
> +
> +  /* Check whether we're doing something other than just a copy loop.
> +     Not all such loops may be profitably vectorized; see
> +     rs6000_finish_cost.  */
> +  if ((kind == vec_to_scalar || kind == vec_perm || kind == vec_promote_demote
> +       || kind == vec_construct || kind == scalar_to_vec)
> +      || (where == vect_body && kind == vector_stmt))

I thought I saw an alignment issue, then noticed the "(" that was
hiding on me.. :-)    

Maybe clearer to read if you rearrange slightly and flatten it ?  I
defer to others on that..

	if ((kind == vec_to_scalar
	     || kind == vec_perm
	     || kind == vec_promote_demote
	     || kind ==
vec_construct	
	     || kind == scalar_to_vec)
	    || (kind == vector_stmt && where == vect_body)
	

> +    data->vect_nonmem = true;
> +
> +  /* Gather some information when we are costing the vector version for
> +     the statements locate in a loop body.  */
s/version/instruction? operation?/  ? ? 
s/locate/located/

> +  if (!data->costing_for_scalar && data->loop_info && where == vect_body)
> +    {
> +      data->nstmts += orig_count;
> +
> +      if (kind == scalar_load || kind == vector_load || kind == unaligned_load
> +	  || kind == vector_gather_load)

Cosmetically, I'd move the second "||" to the next line to balance
those two lines a bit more. 

> +	data->nloads += orig_count;
> +
> +      /* If we have strided or elementwise loads into a vector, it's
> +	 possible to be bounded by latency and execution resources for
> +	 many scalar loads.  Try to account for this by scaling the
> +	 construction cost by the number of elements involved, when
> +	 handling each matching statement we record the possible extra
> +	 penalized cost into target cost, in the end of costing for
> +	 the whole loop, we do the actual penalization once some load
> +	 density heuristics are satisfied.  */
> +      if (kind == vec_construct && stmt_info
> +	  && STMT_VINFO_TYPE (stmt_info) == load_vec_info_type
> +	  && (STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_ELEMENTWISE
> +	      || STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_STRIDED_SLP))
> +	{
> +	  tree vectype = STMT_VINFO_VECTYPE (stmt_info);
> +	  unsigned int nunits = vect_nunits_for_cost (vectype);
> +	  unsigned int extra_cost = nunits * stmt_cost;
> +	  /* As function rs6000_builtin_vectorization_cost shows, we have
> +	     priced much on V16QI/V8HI vector construction as their units,
> +	     if we penalize them with nunits * stmt_cost, it can result in
> +	     a unreliable body cost, eg: for V16QI on Power8, stmt_cost is
s/a/an/ 
> +	     20 and nunits is 16, the extra cost is 320 which looks much
> +	     exaggerated.  So let's use one maximum bound for the extra
> +	     penalized cost for vector construction here.  */
> +	  const unsigned int MAX_PENALIZED_COST_FOR_CTOR = 12;
> +	  if (extra_cost > MAX_PENALIZED_COST_FOR_CTOR)
> +	    extra_cost = MAX_PENALIZED_COST_FOR_CTOR;
> +	  data->extra_ctor_cost += extra_cost;
> +	}
> +    }
> +}
ok

> +
> 
>  /* Implement targetm.vectorize.add_stmt_cost.  */
> 
>  static unsigned
> @@ -5354,6 +5459,7 @@ rs6000_add_stmt_cost (class vec_info *vinfo, void *data, int count,
>        /* Statements in an inner loop relative to the loop being
>  	 vectorized are weighted more heavily.  The value here is
>  	 arbitrary and could potentially be improved with analysis.  */
> +      unsigned int orig_count = count;

I don't see any other assignments to orig_count.  Does 'count' get
updated elsewhere in rs6000_add_stmt_cost() that the new orig_count
variable is necessary?

> 
>        if (where == vect_body && stmt_info
>  	  && stmt_in_inner_loop_p (vinfo, stmt_info))
>  	{
> @@ -5365,14 +5471,8 @@ rs6000_add_stmt_cost (class vec_info *vinfo, void *data, int count,
>        retval = (unsigned) (count * stmt_cost);
>        cost_data->cost[where] += retval;
> 
> -      /* Check whether we're doing something other than just a copy loop.
> -	 Not all such loops may be profitably vectorized; see
> -	 rs6000_finish_cost.  */
> -      if ((kind == vec_to_scalar || kind == vec_perm
> -	   || kind == vec_promote_demote || kind == vec_construct
> -	   || kind == scalar_to_vec)
> -	  || (where == vect_body && kind == vector_stmt))
> -	cost_data->vect_nonmem = true;
> +      rs6000_update_target_cost_per_stmt (cost_data, kind, stmt_info, where,
> +					  stmt_cost, orig_count);
> 
>      }
> 
>    return retval;
>

Kewen.Lin July 28, 2021, 2:59 a.m. UTC | #5

Hi William,

Thanks for the review comments!

on 2021/7/28 上午6:25, will schmidt wrote:
> On Wed, 2021-05-26 at 10:59 +0800, Kewen.Lin via Gcc-patches wrote:
>> Hi,
>>
> 
> 
> Hi,
> 
> 
>> This is the updated version of patch to deal with the bwaves_r
>> degradation due to vector construction fed by strided loads.
>>
>> As Richi's comments [1], this follows the similar idea to over
>> price the vector construction fed by VMAT_ELEMENTWISE or
>> VMAT_STRIDED_SLP.  Instead of adding the extra cost on vector
>> construction costing immediately, it firstly records how many
>> loads and vectorized statements in the given loop, later in
>> rs6000_density_test (called by finish_cost) it computes the
>> load density ratio against all vectorized stmts, and check
>> with the corresponding thresholds DENSITY_LOAD_NUM_THRESHOLD
>> and DENSITY_LOAD_PCT_THRESHOLD, do the actual extra pricing
>> if both thresholds are exceeded.
> 
> ok
> 
>>
>> Note that this new load density heuristic check is based on
>> some fields in target cost which are updated as needed when
>> scanning each add_stmt_cost entry, it's independent of the
>> current function rs6000_density_test which requires to scan
>> non_vect stmts.  Since it's checking the load stmts count
>> vs. all vectorized stmts, it's kind of density, so I put
>> it in function rs6000_density_test.  With the same reason to
>> keep it independent, I didn't put it as an else arm of the
>> current existing density threshold check hunk or before this
>> hunk.
> 
> ok
> 
>>
>> In the investigation of -1.04% degradation from 526.blender_r
>> on Power8, I noticed that the extra penalized cost 320 on one
>> single vector construction with type V16QI is much exaggerated,
>> which makes the final body cost unreliable, so this patch adds
>> one maximum bound for the extra penalized cost for each vector
>> construction statement.
> 
> ok
> 
>>
>> Bootstrapped/regtested on powerpc64le-linux-gnu P9.
>>
>> Full SPEC2017 performance evaluation on Power8/Power9 with
>> option combinations:
>>   * -O2 -ftree-vectorize {,-fvect-cost-model=very-cheap} {,-ffast-math}
>>   * {-O3, -Ofast} {,-funroll-loops}
>>
>> bwaves_r degradations on P8/P9 have been fixed, nothing else
>> remarkable was observed.
> 
> So, this fixes the "-1.04% degradation from 526.blender_r on Power8"
> degredation with no additional regressions.  that sounds good. 
> 

Sorry for the confusion, the original intention of this patch is to
fix the -8% degradation at -O2 -ftree-vectorize vs. -O2 on bwaves_r.
(From the last evaluation based on r12-1442, P8 is about -10% while
P9 is about -9%).  The mentioned -1.04% degradation from 526.blender_r
on P8 was expected to be a reason why we need the bound for the extra
penalized cost adjustment.

>>
>> Is it ok for trunk?
>>
>> [1] https://gcc.gnu.org/pipermail/gcc-patches/2021-May/570076.html
>>
>> BR,
>> Kewen
>> -----
>> gcc/ChangeLog:
>>
>> 	* config/rs6000/rs6000.c (struct rs6000_cost_data): New members
>> 	nstmts, nloads and extra_ctor_cost.
>> 	(rs6000_density_test): Add load density related heuristics and the
>> 	checks, do extra costing on vector construction statements if need.
>> 	(rs6000_init_cost): Init new members.
>> 	(rs6000_update_target_cost_per_stmt): New function.
>> 	(rs6000_add_stmt_cost): Factor vect_nonmem hunk out to function
>> 	rs6000_update_target_cost_per_stmt and call it.
>>
> 
>> diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
>> index 83d29cbfac1..806c3335cbc 100644
>> --- a/gcc/config/rs6000/rs6000.c
>> +++ b/gcc/config/rs6000/rs6000.c
>>
>> @@ -5231,6 +5231,12 @@ typedef struct _rs6000_cost_data
>>  {
>>    struct loop *loop_info;
>>    unsigned cost[3];
>> +  /* Total number of vectorized stmts (loop only).  */
>> +  unsigned nstmts;
>> +  /* Total number of loads (loop only).  */
>> +  unsigned nloads;
>> +  /* Possible extra penalized cost on vector construction (loop only).  */
>> +  unsigned extra_ctor_cost;
>>
>>    /* For each vectorized loop, this var holds TRUE iff a non-memory vector
>>       instruction is needed by the vectorization.  */
>>    bool vect_nonmem;
>> @@ -5292,9 +5298,45 @@ rs6000_density_test (rs6000_cost_data *data)
>>        if (dump_enabled_p ())
>>  	dump_printf_loc (MSG_NOTE, vect_location,
>>  			 "density %d%%, cost %d exceeds threshold, penalizing "
>> -			 "loop body cost by %d%%", density_pct,
>> +			 "loop body cost by %d%%\n", density_pct,
>>  			 vec_cost + not_vec_cost, DENSITY_PENALTY);
>>      }
>> +
>> +  /* Check if we need to penalize the body cost for latency and
>> +     execution resources bound from strided or elementwise loads
>> +     into a vector.  */
>> +  if (data->extra_ctor_cost > 0)
>> +    {
>> +      /* Threshold for load stmts percentage in all vectorized stmts.  */
>> +      const int DENSITY_LOAD_PCT_THRESHOLD = 45;
>> +      /* Threshold for total number of load stmts.  */
>> +      const int DENSITY_LOAD_NUM_THRESHOLD = 20;
>> +
>> +      gcc_assert (data->nloads <= data->nstmts);
>> +      unsigned int load_pct = (data->nloads * 100) / (data->nstmts);
>> +
>> +      /* It's likely to be bounded by latency and execution resources
>> +	 from many scalar loads which are strided or elementwise loads
>> +	 into a vector if both conditions below are found:
>> +	   1. there are many loads, it's easy to result in a long wait
>> +	      for load units;
>> +	   2. load has a big proportion of all vectorized statements,
>> +	      it's not easy to schedule other statements to spread among
>> +	      the loads.
>> +	 One typical case is the innermost loop of the hotspot of SPEC2017
>> +	 503.bwaves_r without loop interchange.  */
>> +      if (data->nloads > DENSITY_LOAD_NUM_THRESHOLD
>> +	  && load_pct > DENSITY_LOAD_PCT_THRESHOLD)
>> +	{
>> +	  data->cost[vect_body] += data->extra_ctor_cost;
>> +	  if (dump_enabled_p ())
>> +	    dump_printf_loc (MSG_NOTE, vect_location,
>> +			     "Found %u loads and load pct. %u%% exceed "
>> +			     "the threshold, penalizing loop body "
>> +			     "cost by extra cost %u for ctor.\n",
>> +			     data->nloads, load_pct, data->extra_ctor_cost);
>> +	}
>> +    }
>>
>>  }
>>
>>  /* Implement targetm.vectorize.init_cost.  */
>> @@ -5308,6 +5350,9 @@ rs6000_init_cost (struct loop *loop_info, bool costing_for_scalar)
>>    data->cost[vect_body]     = 0;
>>    data->cost[vect_epilogue] = 0;
>>    data->vect_nonmem = false;
>> +  data->nstmts = 0;
>> +  data->nloads = 0;
>> +  data->extra_ctor_cost = 0;
>>
>>    data->costing_for_scalar = costing_for_scalar;
>>    return data;
>>  }
>> @@ -5335,6 +5380,66 @@ rs6000_adjust_vect_cost_per_stmt (enum vect_cost_for_stmt kind,
>>    return 0;
>>  }
>>
>> +/* As a visitor function for each statement cost entry handled in
>> +   function add_stmt_cost, gather some information and update its
>> +   relevant fields in target cost accordingly.  */
> 
> I got lost trying to parse that..  (could be just me :-) 
> 
> Possibly instead something like
> /* Helper function for add_stmt_cost ; gather information and update
> the target_cost fields accordingly.  */
> 
> 

OK, will update.  I was thinking for each entry handled in function
add_stmt_cost, this helper acts like a visitor, trying to visit each
entry and take some actions if some conditions are satisifed.

> 
>> +static void
>> +rs6000_update_target_cost_per_stmt (rs6000_cost_data *data,
>> +				    enum vect_cost_for_stmt kind,
>> +				    struct _stmt_vec_info *stmt_info,
>> +				    enum vect_cost_model_location where,
>> +				    int stmt_cost, unsigned int orig_count)
>> +{
>> +
>> +  /* Check whether we're doing something other than just a copy loop.
>> +     Not all such loops may be profitably vectorized; see
>> +     rs6000_finish_cost.  */
>> +  if ((kind == vec_to_scalar || kind == vec_perm || kind == vec_promote_demote
>> +       || kind == vec_construct || kind == scalar_to_vec)
>> +      || (where == vect_body && kind == vector_stmt))
> 
> I thought I saw an alignment issue, then noticed the "(" that was
> hiding on me.. :-)    
> 
> Maybe clearer to read if you rearrange slightly and flatten it ?  I
> defer to others on that..
> 
> 	if ((kind == vec_to_scalar
> 	     || kind == vec_perm
> 	     || kind == vec_promote_demote
> 	     || kind ==
> vec_construct	
> 	     || kind == scalar_to_vec)
> 	    || (kind == vector_stmt && where == vect_body)
> 	
> 

This hunk is factored out from function rs6000_add_stmt_cost, maybe I
can keep the original formatting?  The formatting tool isn't so smart,
and sometimes rearrange things to become unexpected (although it meets
the basic rule, not so elegant), sigh.

>> +    data->vect_nonmem = true;
>> +
>> +  /* Gather some information when we are costing the vector version for
>> +     the statements locate in a loop body.  */
> s/version/instruction? operation?/  ? ? 
> s/locate/located/
> 

Will fix.

>> +  if (!data->costing_for_scalar && data->loop_info && where == vect_body)
>> +    {
>> +      data->nstmts += orig_count;
>> +
>> +      if (kind == scalar_load || kind == vector_load || kind == unaligned_load
>> +	  || kind == vector_gather_load)
> 
> Cosmetically, I'd move the second "||" to the next line to balance
> those two lines a bit more. 
> 

Will fix.

>> +	data->nloads += orig_count;
>> +
>> +      /* If we have strided or elementwise loads into a vector, it's
>> +	 possible to be bounded by latency and execution resources for
>> +	 many scalar loads.  Try to account for this by scaling the
>> +	 construction cost by the number of elements involved, when
>> +	 handling each matching statement we record the possible extra
>> +	 penalized cost into target cost, in the end of costing for
>> +	 the whole loop, we do the actual penalization once some load
>> +	 density heuristics are satisfied.  */
>> +      if (kind == vec_construct && stmt_info
>> +	  && STMT_VINFO_TYPE (stmt_info) == load_vec_info_type
>> +	  && (STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_ELEMENTWISE
>> +	      || STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_STRIDED_SLP))
>> +	{
>> +	  tree vectype = STMT_VINFO_VECTYPE (stmt_info);
>> +	  unsigned int nunits = vect_nunits_for_cost (vectype);
>> +	  unsigned int extra_cost = nunits * stmt_cost;
>> +	  /* As function rs6000_builtin_vectorization_cost shows, we have
>> +	     priced much on V16QI/V8HI vector construction as their units,
>> +	     if we penalize them with nunits * stmt_cost, it can result in
>> +	     a unreliable body cost, eg: for V16QI on Power8, stmt_cost is
> s/a/an/ 

Will fix.

>> +	     20 and nunits is 16, the extra cost is 320 which looks much
>> +	     exaggerated.  So let's use one maximum bound for the extra
>> +	     penalized cost for vector construction here.  */
>> +	  const unsigned int MAX_PENALIZED_COST_FOR_CTOR = 12;
>> +	  if (extra_cost > MAX_PENALIZED_COST_FOR_CTOR)
>> +	    extra_cost = MAX_PENALIZED_COST_FOR_CTOR;
>> +	  data->extra_ctor_cost += extra_cost;
>> +	}
>> +    }
>> +}
> ok
> 
>> +
>>
>>  /* Implement targetm.vectorize.add_stmt_cost.  */
>>
>>  static unsigned
>> @@ -5354,6 +5459,7 @@ rs6000_add_stmt_cost (class vec_info *vinfo, void *data, int count,
>>        /* Statements in an inner loop relative to the loop being
>>  	 vectorized are weighted more heavily.  The value here is
>>  	 arbitrary and could potentially be improved with analysis.  */
>> +      unsigned int orig_count = count;
> 
> I don't see any other assignments to orig_count.  Does 'count' get
> updated elsewhere in rs6000_add_stmt_cost() that the new orig_count
> variable is necessary?
> 

Yeah, the 'count' could get updated but the default "git diff" doesn't
show it, the codes omitted look as below:

      if (where == vect_body && stmt_info
	  && stmt_in_inner_loop_p (vinfo, stmt_info))
	{
	  loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (vinfo);
	  gcc_assert (loop_vinfo);
	  count *= LOOP_VINFO_INNER_LOOP_COST_FACTOR (loop_vinfo); /* FIXME.  */
	}

BR,
Kewen

>>
>>        if (where == vect_body && stmt_info
>>  	  && stmt_in_inner_loop_p (vinfo, stmt_info))
>>  	{
>> @@ -5365,14 +5471,8 @@ rs6000_add_stmt_cost (class vec_info *vinfo, void *data, int count,
>>        retval = (unsigned) (count * stmt_cost);
>>        cost_data->cost[where] += retval;
>>
>> -      /* Check whether we're doing something other than just a copy loop.
>> -	 Not all such loops may be profitably vectorized; see
>> -	 rs6000_finish_cost.  */
>> -      if ((kind == vec_to_scalar || kind == vec_perm
>> -	   || kind == vec_promote_demote || kind == vec_construct
>> -	   || kind == scalar_to_vec)
>> -	  || (where == vect_body && kind == vector_stmt))
>> -	cost_data->vect_nonmem = true;
>> +      rs6000_update_target_cost_per_stmt (cost_data, kind, stmt_info, where,
>> +					  stmt_cost, orig_count);
>>
>>      }
>>
>>    return retval;
>>
> 
>

Segher Boessenkool Sept. 6, 2021, 11:43 p.m. UTC | #6

Hi!

On Wed, Jul 28, 2021 at 10:59:50AM +0800, Kewen.Lin wrote:
> >> +/* As a visitor function for each statement cost entry handled in
> >> +   function add_stmt_cost, gather some information and update its
> >> +   relevant fields in target cost accordingly.  */
> > 
> > I got lost trying to parse that..  (could be just me :-) 
> > 
> > Possibly instead something like
> > /* Helper function for add_stmt_cost ; gather information and update
> > the target_cost fields accordingly.  */
> 
> OK, will update.  I was thinking for each entry handled in function
> add_stmt_cost, this helper acts like a visitor, trying to visit each
> entry and take some actions if some conditions are satisifed.

It (thankfully!) has nothing to do with the "visitor pattern", so some
other name might be better :-)

> > Maybe clearer to read if you rearrange slightly and flatten it ?  I
> > defer to others on that..
> > 
> > 	if ((kind == vec_to_scalar
> > 	     || kind == vec_perm
> > 	     || kind == vec_promote_demote
> > 	     || kind == vec_construct
> > 	     || kind == scalar_to_vec)
> > 	    || (kind == vector_stmt && where == vect_body)
> 
> This hunk is factored out from function rs6000_add_stmt_cost, maybe I
> can keep the original formatting?  The formatting tool isn't so smart,
> and sometimes rearrange things to become unexpected (although it meets
> the basic rule, not so elegant), sigh.

It has too many parens, making grouping where there is none, that is the
core issue.

	if (kind == vec_to_scalar
	    || kind == vec_perm
	    || kind == vec_promote_demote
	    || kind == vec_construct
	    || kind == scalar_to_vec
	    || (kind == vector_stmt && where == vect_body))


Segher

Kewen.Lin Sept. 8, 2021, 7:01 a.m. UTC | #7

Hi Segher,

Thanks for the comments!

on 2021/9/7 上午7:43, Segher Boessenkool wrote:
> Hi!
> 
> On Wed, Jul 28, 2021 at 10:59:50AM +0800, Kewen.Lin wrote:
>>>> +/* As a visitor function for each statement cost entry handled in
>>>> +   function add_stmt_cost, gather some information and update its
>>>> +   relevant fields in target cost accordingly.  */
>>>
>>> I got lost trying to parse that..  (could be just me :-) 
>>>
>>> Possibly instead something like
>>> /* Helper function for add_stmt_cost ; gather information and update
>>> the target_cost fields accordingly.  */
>>
>> OK, will update.  I was thinking for each entry handled in function
>> add_stmt_cost, this helper acts like a visitor, trying to visit each
>> entry and take some actions if some conditions are satisifed.
> 
> It (thankfully!) has nothing to do with the "visitor pattern", so some
> other name might be better :-)
> 
>>> Maybe clearer to read if you rearrange slightly and flatten it ?  I
>>> defer to others on that..
>>>
>>> 	if ((kind == vec_to_scalar
>>> 	     || kind == vec_perm
>>> 	     || kind == vec_promote_demote
>>> 	     || kind == vec_construct
>>> 	     || kind == scalar_to_vec)
>>> 	    || (kind == vector_stmt && where == vect_body)
>>
>> This hunk is factored out from function rs6000_add_stmt_cost, maybe I
>> can keep the original formatting?  The formatting tool isn't so smart,
>> and sometimes rearrange things to become unexpected (although it meets
>> the basic rule, not so elegant), sigh.
> 
> It has too many parens, making grouping where there is none, that is the
> core issue.
> 
> 	if (kind == vec_to_scalar
> 	    || kind == vec_perm
> 	    || kind == vec_promote_demote
> 	    || kind == vec_construct
> 	    || kind == scalar_to_vec
> 	    || (kind == vector_stmt && where == vect_body))
> 
> 

Good catch, I've updated it in V4.

BR,
Kewen

[v2] rs6000: Add load density heuristic

Commit Message

Comments

Patch