diff mbox series

[tree-optimization,predcom] Improve unroll factor for predictive commoning

Message ID 32ad22b3-1fa7-47f6-b8e9-877cea9aee00@linux.ibm.com
State New
Headers show
Series [tree-optimization,predcom] Improve unroll factor for predictive commoning | expand

Commit Message

Ajit Agarwal July 11, 2024, 8:28 a.m. UTC
Hello All:

Unroll factor is determined with max distance across loop iterations.
The logic for determining the loop unroll factor is based on
data dependency across loop iterations.

The max distance across loop iterations is the unrolling factor
that helps in predictive commoning.

Bootstrapped and regtested on powerpc64-linux-gnu.

Thanks & Regards
Ajit

tree-optimization, predcom: Improve unroll factor for predictive commoning

Unroll factor is determined with max distance across loop iterations.
The logic for determining the loop unroll factor is based on
data dependency across loop iterations.

The max distance across loop iterations is the unrolling factor
that helps in predictive commoning.

2024-07-11  Ajit Kumar Agarwal  <aagarwa1@linux.ibm.com>

gcc/ChangeLog:

	* tree-predcom.cc: Change in determining unroll factor with
	data dependence across loop iterations.
---
 gcc/tree-predcom.cc | 51 ++++++++++++++++++++++++++++++++++-----------
 1 file changed, 39 insertions(+), 12 deletions(-)

Comments

Richard Biener July 11, 2024, 8:51 a.m. UTC | #1
On Thu, Jul 11, 2024 at 10:30 AM Ajit Agarwal <aagarwa1@linux.ibm.com> wrote:
>
> Hello All:
>
> Unroll factor is determined with max distance across loop iterations.
> The logic for determining the loop unroll factor is based on
> data dependency across loop iterations.
>
> The max distance across loop iterations is the unrolling factor
> that helps in predictive commoning.

The old comment in the code says

> -      /* The best unroll factor for this chain is equal to the number of
> -        temporary variables that we create for it.  */

why is that wrong and why is the max dependence distance more correct?

Do you have a testcase that shows how this makes a (positive) difference?

Richard.

> Bootstrapped and regtested on powerpc64-linux-gnu.
>
> Thanks & Regards
> Ajit
>
> tree-optimization, predcom: Improve unroll factor for predictive commoning
>
> Unroll factor is determined with max distance across loop iterations.
> The logic for determining the loop unroll factor is based on
> data dependency across loop iterations.
>
> The max distance across loop iterations is the unrolling factor
> that helps in predictive commoning.
>
> 2024-07-11  Ajit Kumar Agarwal  <aagarwa1@linux.ibm.com>
>
> gcc/ChangeLog:
>
>         * tree-predcom.cc: Change in determining unroll factor with
>         data dependence across loop iterations.
> ---
>  gcc/tree-predcom.cc | 51 ++++++++++++++++++++++++++++++++++-----------
>  1 file changed, 39 insertions(+), 12 deletions(-)
>
> diff --git a/gcc/tree-predcom.cc b/gcc/tree-predcom.cc
> index 9844fee1e97..029b02f5990 100644
> --- a/gcc/tree-predcom.cc
> +++ b/gcc/tree-predcom.cc
> @@ -409,6 +409,7 @@ public:
>    /* Perform the predictive commoning optimization for chains, make this
>       public for being called in callback execute_pred_commoning_cbck.  */
>    void execute_pred_commoning (bitmap tmp_vars);
> +  unsigned determine_unroll_factor (const vec<chain_p> &chains);
>
>  private:
>    /* The pointer to the given loop.  */
> @@ -2400,13 +2401,46 @@ pcom_worker::execute_pred_commoning_chain (chain_p chain,
>     copies as possible.  CHAINS is the list of chains that will be
>     optimized.  */
>
> -static unsigned
> -determine_unroll_factor (const vec<chain_p> &chains)
> +unsigned
> +pcom_worker::determine_unroll_factor (const vec<chain_p> &chains)
>  {
>    chain_p chain;
> -  unsigned factor = 1, af, nfactor, i;
> +  unsigned factor = 1, i;
>    unsigned max = param_max_unroll_times;
> +  struct data_dependence_relation *ddr;
> +  unsigned nfactor = 0;
> +  int nzfactor = 0;
> +
> +  /* Best unroll factor is the maximum distance across loop
> +     iterations.  */
> +  FOR_EACH_VEC_ELT (m_dependences, i, ddr)
> +    {
> +      for (unsigned j = 0; j < DDR_NUM_DIST_VECTS (ddr); j++)
> +       {
> +         lambda_vector vec = DDR_DIST_VECT (ddr, j);
> +         widest_int distance = vec[j];
> +         unsigned offset = distance.to_uhwi ();
> +         if (offset == 0)
> +           continue;
> +
> +         int dist = offset - nzfactor;
> +         if (dist  == 0)
> +           continue;
>
> +         if (nfactor == 0)
> +           {
> +             nfactor = offset;
> +             nzfactor = offset;
> +           }
> +         else if (dist <= nzfactor)
> +           nfactor = offset;
> +
> +         if (nfactor > 0 && nfactor <= max)
> +           factor = nfactor;
> +       }
> +    }
> +
> +  int max_use = 0;
>    FOR_EACH_VEC_ELT (chains, i, chain)
>      {
>        if (chain->type == CT_INVARIANT)
> @@ -2427,17 +2461,10 @@ determine_unroll_factor (const vec<chain_p> &chains)
>           continue;
>         }
>
> -      /* The best unroll factor for this chain is equal to the number of
> -        temporary variables that we create for it.  */
> -      af = chain->length;
>        if (chain->has_max_use_after)
> -       af++;
> -
> -      nfactor = factor * af / gcd (factor, af);
> -      if (nfactor <= max)
> -       factor = nfactor;
> +       max_use++;
>      }
> -
> +  factor += max_use;
>    return factor;
>  }
>
> --
> 2.43.5
>
Ajit Agarwal July 12, 2024, 10:09 a.m. UTC | #2
Hello Richard:

On 11/07/24 2:21 pm, Richard Biener wrote:
> On Thu, Jul 11, 2024 at 10:30 AM Ajit Agarwal <aagarwa1@linux.ibm.com> wrote:
>>
>> Hello All:
>>
>> Unroll factor is determined with max distance across loop iterations.
>> The logic for determining the loop unroll factor is based on
>> data dependency across loop iterations.
>>
>> The max distance across loop iterations is the unrolling factor
>> that helps in predictive commoning.
> 
> The old comment in the code says
> 
>> -      /* The best unroll factor for this chain is equal to the number of
>> -        temporary variables that we create for it.  */
> 
> why is that wrong and why is the max dependence distance more correct?
> 
> Do you have a testcase that shows how this makes a (positive) difference?
>

There is nothing wrong in the existing implementation of unroll
factor for predictive commoning.

But with max dependence distance we get performance improvement
with spec 2017 benchmarks (INT) of 0.01% (Geomean) with and without
changes. Improvement in benchmarks with max dependence distance
changes.

I have used the following flags:
-O2 -funroll-loops --param max-unroll-times=8 -fpredictive-commoning -fno-tree-pre

With above flags I ran with and without changes.

There is no degradation with spec 2017 (FP benchmarks).

Because in predictive commoning we reuse values computed in
earlier iterations of a loop in the later ones, max distance is the
better choice.

> Richard.
>

Thanks & Regards
Ajit
 
>> Bootstrapped and regtested on powerpc64-linux-gnu.
>>
>> Thanks & Regards
>> Ajit
>>
>> tree-optimization, predcom: Improve unroll factor for predictive commoning
>>
>> Unroll factor is determined with max distance across loop iterations.
>> The logic for determining the loop unroll factor is based on
>> data dependency across loop iterations.
>>
>> The max distance across loop iterations is the unrolling factor
>> that helps in predictive commoning.
>>
>> 2024-07-11  Ajit Kumar Agarwal  <aagarwa1@linux.ibm.com>
>>
>> gcc/ChangeLog:
>>
>>         * tree-predcom.cc: Change in determining unroll factor with
>>         data dependence across loop iterations.
>> ---
>>  gcc/tree-predcom.cc | 51 ++++++++++++++++++++++++++++++++++-----------
>>  1 file changed, 39 insertions(+), 12 deletions(-)
>>
>> diff --git a/gcc/tree-predcom.cc b/gcc/tree-predcom.cc
>> index 9844fee1e97..029b02f5990 100644
>> --- a/gcc/tree-predcom.cc
>> +++ b/gcc/tree-predcom.cc
>> @@ -409,6 +409,7 @@ public:
>>    /* Perform the predictive commoning optimization for chains, make this
>>       public for being called in callback execute_pred_commoning_cbck.  */
>>    void execute_pred_commoning (bitmap tmp_vars);
>> +  unsigned determine_unroll_factor (const vec<chain_p> &chains);
>>
>>  private:
>>    /* The pointer to the given loop.  */
>> @@ -2400,13 +2401,46 @@ pcom_worker::execute_pred_commoning_chain (chain_p chain,
>>     copies as possible.  CHAINS is the list of chains that will be
>>     optimized.  */
>>
>> -static unsigned
>> -determine_unroll_factor (const vec<chain_p> &chains)
>> +unsigned
>> +pcom_worker::determine_unroll_factor (const vec<chain_p> &chains)
>>  {
>>    chain_p chain;
>> -  unsigned factor = 1, af, nfactor, i;
>> +  unsigned factor = 1, i;
>>    unsigned max = param_max_unroll_times;
>> +  struct data_dependence_relation *ddr;
>> +  unsigned nfactor = 0;
>> +  int nzfactor = 0;
>> +
>> +  /* Best unroll factor is the maximum distance across loop
>> +     iterations.  */
>> +  FOR_EACH_VEC_ELT (m_dependences, i, ddr)
>> +    {
>> +      for (unsigned j = 0; j < DDR_NUM_DIST_VECTS (ddr); j++)
>> +       {
>> +         lambda_vector vec = DDR_DIST_VECT (ddr, j);
>> +         widest_int distance = vec[j];
>> +         unsigned offset = distance.to_uhwi ();
>> +         if (offset == 0)
>> +           continue;
>> +
>> +         int dist = offset - nzfactor;
>> +         if (dist  == 0)
>> +           continue;
>>
>> +         if (nfactor == 0)
>> +           {
>> +             nfactor = offset;
>> +             nzfactor = offset;
>> +           }
>> +         else if (dist <= nzfactor)
>> +           nfactor = offset;
>> +
>> +         if (nfactor > 0 && nfactor <= max)
>> +           factor = nfactor;
>> +       }
>> +    }
>> +
>> +  int max_use = 0;
>>    FOR_EACH_VEC_ELT (chains, i, chain)
>>      {
>>        if (chain->type == CT_INVARIANT)
>> @@ -2427,17 +2461,10 @@ determine_unroll_factor (const vec<chain_p> &chains)
>>           continue;
>>         }
>>
>> -      /* The best unroll factor for this chain is equal to the number of
>> -        temporary variables that we create for it.  */
>> -      af = chain->length;
>>        if (chain->has_max_use_after)
>> -       af++;
>> -
>> -      nfactor = factor * af / gcd (factor, af);
>> -      if (nfactor <= max)
>> -       factor = nfactor;
>> +       max_use++;
>>      }
>> -
>> +  factor += max_use;
>>    return factor;
>>  }
>>
>> --
>> 2.43.5
>>
Richard Biener July 12, 2024, 12:50 p.m. UTC | #3
On Fri, Jul 12, 2024 at 12:09 PM Ajit Agarwal <aagarwa1@linux.ibm.com> wrote:
>
> Hello Richard:
>
> On 11/07/24 2:21 pm, Richard Biener wrote:
> > On Thu, Jul 11, 2024 at 10:30 AM Ajit Agarwal <aagarwa1@linux.ibm.com> wrote:
> >>
> >> Hello All:
> >>
> >> Unroll factor is determined with max distance across loop iterations.
> >> The logic for determining the loop unroll factor is based on
> >> data dependency across loop iterations.
> >>
> >> The max distance across loop iterations is the unrolling factor
> >> that helps in predictive commoning.
> >
> > The old comment in the code says
> >
> >> -      /* The best unroll factor for this chain is equal to the number of
> >> -        temporary variables that we create for it.  */
> >
> > why is that wrong and why is the max dependence distance more correct?
> >
> > Do you have a testcase that shows how this makes a (positive) difference?
> >
>
> There is nothing wrong in the existing implementation of unroll
> factor for predictive commoning.
>
> But with max dependence distance we get performance improvement
> with spec 2017 benchmarks (INT) of 0.01% (Geomean) with and without
> changes. Improvement in benchmarks with max dependence distance
> changes.
>
> I have used the following flags:
> -O2 -funroll-loops --param max-unroll-times=8 -fpredictive-commoning -fno-tree-pre
>
> With above flags I ran with and without changes.

A 0.01% geomean improvement is noise.  Why did you disable PRE?

> There is no degradation with spec 2017 (FP benchmarks).
>
> Because in predictive commoning we reuse values computed in
> earlier iterations of a loop in the later ones, max distance is the
> better choice.

The re-use distance is the same though.  So your change merely increases
the unroll factor?  Or can you explain why there is more re-use with
your change.

Richard.

> > Richard.
> >
>
> Thanks & Regards
> Ajit
>
> >> Bootstrapped and regtested on powerpc64-linux-gnu.
> >>
> >> Thanks & Regards
> >> Ajit
> >>
> >> tree-optimization, predcom: Improve unroll factor for predictive commoning
> >>
> >> Unroll factor is determined with max distance across loop iterations.
> >> The logic for determining the loop unroll factor is based on
> >> data dependency across loop iterations.
> >>
> >> The max distance across loop iterations is the unrolling factor
> >> that helps in predictive commoning.
> >>
> >> 2024-07-11  Ajit Kumar Agarwal  <aagarwa1@linux.ibm.com>
> >>
> >> gcc/ChangeLog:
> >>
> >>         * tree-predcom.cc: Change in determining unroll factor with
> >>         data dependence across loop iterations.
> >> ---
> >>  gcc/tree-predcom.cc | 51 ++++++++++++++++++++++++++++++++++-----------
> >>  1 file changed, 39 insertions(+), 12 deletions(-)
> >>
> >> diff --git a/gcc/tree-predcom.cc b/gcc/tree-predcom.cc
> >> index 9844fee1e97..029b02f5990 100644
> >> --- a/gcc/tree-predcom.cc
> >> +++ b/gcc/tree-predcom.cc
> >> @@ -409,6 +409,7 @@ public:
> >>    /* Perform the predictive commoning optimization for chains, make this
> >>       public for being called in callback execute_pred_commoning_cbck.  */
> >>    void execute_pred_commoning (bitmap tmp_vars);
> >> +  unsigned determine_unroll_factor (const vec<chain_p> &chains);
> >>
> >>  private:
> >>    /* The pointer to the given loop.  */
> >> @@ -2400,13 +2401,46 @@ pcom_worker::execute_pred_commoning_chain (chain_p chain,
> >>     copies as possible.  CHAINS is the list of chains that will be
> >>     optimized.  */
> >>
> >> -static unsigned
> >> -determine_unroll_factor (const vec<chain_p> &chains)
> >> +unsigned
> >> +pcom_worker::determine_unroll_factor (const vec<chain_p> &chains)
> >>  {
> >>    chain_p chain;
> >> -  unsigned factor = 1, af, nfactor, i;
> >> +  unsigned factor = 1, i;
> >>    unsigned max = param_max_unroll_times;
> >> +  struct data_dependence_relation *ddr;
> >> +  unsigned nfactor = 0;
> >> +  int nzfactor = 0;
> >> +
> >> +  /* Best unroll factor is the maximum distance across loop
> >> +     iterations.  */
> >> +  FOR_EACH_VEC_ELT (m_dependences, i, ddr)
> >> +    {
> >> +      for (unsigned j = 0; j < DDR_NUM_DIST_VECTS (ddr); j++)
> >> +       {
> >> +         lambda_vector vec = DDR_DIST_VECT (ddr, j);
> >> +         widest_int distance = vec[j];
> >> +         unsigned offset = distance.to_uhwi ();
> >> +         if (offset == 0)
> >> +           continue;
> >> +
> >> +         int dist = offset - nzfactor;
> >> +         if (dist  == 0)
> >> +           continue;
> >>
> >> +         if (nfactor == 0)
> >> +           {
> >> +             nfactor = offset;
> >> +             nzfactor = offset;
> >> +           }
> >> +         else if (dist <= nzfactor)
> >> +           nfactor = offset;
> >> +
> >> +         if (nfactor > 0 && nfactor <= max)
> >> +           factor = nfactor;
> >> +       }
> >> +    }
> >> +
> >> +  int max_use = 0;
> >>    FOR_EACH_VEC_ELT (chains, i, chain)
> >>      {
> >>        if (chain->type == CT_INVARIANT)
> >> @@ -2427,17 +2461,10 @@ determine_unroll_factor (const vec<chain_p> &chains)
> >>           continue;
> >>         }
> >>
> >> -      /* The best unroll factor for this chain is equal to the number of
> >> -        temporary variables that we create for it.  */
> >> -      af = chain->length;
> >>        if (chain->has_max_use_after)
> >> -       af++;
> >> -
> >> -      nfactor = factor * af / gcd (factor, af);
> >> -      if (nfactor <= max)
> >> -       factor = nfactor;
> >> +       max_use++;
> >>      }
> >> -
> >> +  factor += max_use;
> >>    return factor;
> >>  }
> >>
> >> --
> >> 2.43.5
> >>
Ajit Agarwal July 13, 2024, 2:35 p.m. UTC | #4
Hello Richard:

On 12/07/24 6:20 pm, Richard Biener wrote:
> On Fri, Jul 12, 2024 at 12:09 PM Ajit Agarwal <aagarwa1@linux.ibm.com> wrote:
>>
>> Hello Richard:
>>
>> On 11/07/24 2:21 pm, Richard Biener wrote:
>>> On Thu, Jul 11, 2024 at 10:30 AM Ajit Agarwal <aagarwa1@linux.ibm.com> wrote:
>>>>
>>>> Hello All:
>>>>
>>>> Unroll factor is determined with max distance across loop iterations.
>>>> The logic for determining the loop unroll factor is based on
>>>> data dependency across loop iterations.
>>>>
>>>> The max distance across loop iterations is the unrolling factor
>>>> that helps in predictive commoning.
>>>
>>> The old comment in the code says
>>>
>>>> -      /* The best unroll factor for this chain is equal to the number of
>>>> -        temporary variables that we create for it.  */
>>>
>>> why is that wrong and why is the max dependence distance more correct?
>>>
>>> Do you have a testcase that shows how this makes a (positive) difference?
>>>
>>
>> There is nothing wrong in the existing implementation of unroll
>> factor for predictive commoning.
>>
>> But with max dependence distance we get performance improvement
>> with spec 2017 benchmarks (INT) of 0.01% (Geomean) with and without
>> changes. Improvement in benchmarks with max dependence distance
>> changes.
>>
>> I have used the following flags:
>> -O2 -funroll-loops --param max-unroll-times=8 -fpredictive-commoning -fno-tree-pre
>>
>> With above flags I ran with and without changes.
> 
> A 0.01% geomean improvement is noise.  Why did you disable PRE?
> 

I have changed the flags used. Now I change the flags to -O3 and
-fpredictive-commoning -funroll-loops.
I am measuring the performance  with the above flags for spec 2017
benchmarks and would let you know the performance by Monday.

>> There is no degradation with spec 2017 (FP benchmarks).
>>
>> Because in predictive commoning we reuse values computed in
>> earlier iterations of a loop in the later ones, max distance is the
>> better choice.
> 
> The re-use distance is the same though.  So your change merely increases
> the unroll factor?  Or can you explain why there is more re-use with
> your change.
>

With -O3 flag and -fpredictive-commoning -funroll-loops in spec 2017 
benchmarks many benchmarks increases the unroll factor with my changes
in predictive commoning determine_unroll_factor.

I have traversed the data dependence relation vector and get the distance
and the max data dependence distance is the unroll factor.

> Richard.
>

Thanks & Regards
Ajit
 
>>> Richard.
>>>
>>
>> Thanks & Regards
>> Ajit
>>
>>>> Bootstrapped and regtested on powerpc64-linux-gnu.
>>>>
>>>> Thanks & Regards
>>>> Ajit
>>>>
>>>> tree-optimization, predcom: Improve unroll factor for predictive commoning
>>>>
>>>> Unroll factor is determined with max distance across loop iterations.
>>>> The logic for determining the loop unroll factor is based on
>>>> data dependency across loop iterations.
>>>>
>>>> The max distance across loop iterations is the unrolling factor
>>>> that helps in predictive commoning.
>>>>
>>>> 2024-07-11  Ajit Kumar Agarwal  <aagarwa1@linux.ibm.com>
>>>>
>>>> gcc/ChangeLog:
>>>>
>>>>         * tree-predcom.cc: Change in determining unroll factor with
>>>>         data dependence across loop iterations.
>>>> ---
>>>>  gcc/tree-predcom.cc | 51 ++++++++++++++++++++++++++++++++++-----------
>>>>  1 file changed, 39 insertions(+), 12 deletions(-)
>>>>
>>>> diff --git a/gcc/tree-predcom.cc b/gcc/tree-predcom.cc
>>>> index 9844fee1e97..029b02f5990 100644
>>>> --- a/gcc/tree-predcom.cc
>>>> +++ b/gcc/tree-predcom.cc
>>>> @@ -409,6 +409,7 @@ public:
>>>>    /* Perform the predictive commoning optimization for chains, make this
>>>>       public for being called in callback execute_pred_commoning_cbck.  */
>>>>    void execute_pred_commoning (bitmap tmp_vars);
>>>> +  unsigned determine_unroll_factor (const vec<chain_p> &chains);
>>>>
>>>>  private:
>>>>    /* The pointer to the given loop.  */
>>>> @@ -2400,13 +2401,46 @@ pcom_worker::execute_pred_commoning_chain (chain_p chain,
>>>>     copies as possible.  CHAINS is the list of chains that will be
>>>>     optimized.  */
>>>>
>>>> -static unsigned
>>>> -determine_unroll_factor (const vec<chain_p> &chains)
>>>> +unsigned
>>>> +pcom_worker::determine_unroll_factor (const vec<chain_p> &chains)
>>>>  {
>>>>    chain_p chain;
>>>> -  unsigned factor = 1, af, nfactor, i;
>>>> +  unsigned factor = 1, i;
>>>>    unsigned max = param_max_unroll_times;
>>>> +  struct data_dependence_relation *ddr;
>>>> +  unsigned nfactor = 0;
>>>> +  int nzfactor = 0;
>>>> +
>>>> +  /* Best unroll factor is the maximum distance across loop
>>>> +     iterations.  */
>>>> +  FOR_EACH_VEC_ELT (m_dependences, i, ddr)
>>>> +    {
>>>> +      for (unsigned j = 0; j < DDR_NUM_DIST_VECTS (ddr); j++)
>>>> +       {
>>>> +         lambda_vector vec = DDR_DIST_VECT (ddr, j);
>>>> +         widest_int distance = vec[j];
>>>> +         unsigned offset = distance.to_uhwi ();
>>>> +         if (offset == 0)
>>>> +           continue;
>>>> +
>>>> +         int dist = offset - nzfactor;
>>>> +         if (dist  == 0)
>>>> +           continue;
>>>>
>>>> +         if (nfactor == 0)
>>>> +           {
>>>> +             nfactor = offset;
>>>> +             nzfactor = offset;
>>>> +           }
>>>> +         else if (dist <= nzfactor)
>>>> +           nfactor = offset;
>>>> +
>>>> +         if (nfactor > 0 && nfactor <= max)
>>>> +           factor = nfactor;
>>>> +       }
>>>> +    }
>>>> +
>>>> +  int max_use = 0;
>>>>    FOR_EACH_VEC_ELT (chains, i, chain)
>>>>      {
>>>>        if (chain->type == CT_INVARIANT)
>>>> @@ -2427,17 +2461,10 @@ determine_unroll_factor (const vec<chain_p> &chains)
>>>>           continue;
>>>>         }
>>>>
>>>> -      /* The best unroll factor for this chain is equal to the number of
>>>> -        temporary variables that we create for it.  */
>>>> -      af = chain->length;
>>>>        if (chain->has_max_use_after)
>>>> -       af++;
>>>> -
>>>> -      nfactor = factor * af / gcd (factor, af);
>>>> -      if (nfactor <= max)
>>>> -       factor = nfactor;
>>>> +       max_use++;
>>>>      }
>>>> -
>>>> +  factor += max_use;
>>>>    return factor;
>>>>  }
>>>>
>>>> --
>>>> 2.43.5
>>>>
Ajit Agarwal July 13, 2024, 2:46 p.m. UTC | #5
Hello Richard:

On 12/07/24 6:20 pm, Richard Biener wrote:
> On Fri, Jul 12, 2024 at 12:09 PM Ajit Agarwal <aagarwa1@linux.ibm.com> wrote:
>>
>> Hello Richard:
>>
>> On 11/07/24 2:21 pm, Richard Biener wrote:
>>> On Thu, Jul 11, 2024 at 10:30 AM Ajit Agarwal <aagarwa1@linux.ibm.com> wrote:
>>>>
>>>> Hello All:
>>>>
>>>> Unroll factor is determined with max distance across loop iterations.
>>>> The logic for determining the loop unroll factor is based on
>>>> data dependency across loop iterations.
>>>>
>>>> The max distance across loop iterations is the unrolling factor
>>>> that helps in predictive commoning.
>>>
>>> The old comment in the code says
>>>
>>>> -      /* The best unroll factor for this chain is equal to the number of
>>>> -        temporary variables that we create for it.  */
>>>
>>> why is that wrong and why is the max dependence distance more correct?
>>>
>>> Do you have a testcase that shows how this makes a (positive) difference?
>>>
>>
>> There is nothing wrong in the existing implementation of unroll
>> factor for predictive commoning.
>>
>> But with max dependence distance we get performance improvement
>> with spec 2017 benchmarks (INT) of 0.01% (Geomean) with and without
>> changes. Improvement in benchmarks with max dependence distance
>> changes.
>>
>> I have used the following flags:
>> -O2 -funroll-loops --param max-unroll-times=8 -fpredictive-commoning -fno-tree-pre
>>
>> With above flags I ran with and without changes.
> 
> A 0.01% geomean improvement is noise.  Why did you disable PRE?
> 

I have changed the flags. Now I changed the flags to -O3
-fpredictive-commoning -funroll-loops. With these flags
I am measuring the performance with spec 2017 benchmarks.
Would let you know the results by Monday.
 
>> There is no degradation with spec 2017 (FP benchmarks).
>>
>> Because in predictive commoning we reuse values computed in
>> earlier iterations of a loop in the later ones, max distance is the
>> better choice.
> 
> The re-use distance is the same though.  So your change merely increases
> the unroll factor?  Or can you explain why there is more re-use with
> your change.
>

With -O3 -fpredictive-commoning -funroll-loops many spec 2017 benchmarks
increases unroll factor with my changes.

I am traversing data dependence relation vector, get the distance
and max distance is the unroll factor.
 
> Richard.
> 
Thanks & Regards
Ajit
>>> Richard.
>>>
>>
>> Thanks & Regards
>> Ajit
>>
>>>> Bootstrapped and regtested on powerpc64-linux-gnu.
>>>>
>>>> Thanks & Regards
>>>> Ajit
>>>>
>>>> tree-optimization, predcom: Improve unroll factor for predictive commoning
>>>>
>>>> Unroll factor is determined with max distance across loop iterations.
>>>> The logic for determining the loop unroll factor is based on
>>>> data dependency across loop iterations.
>>>>
>>>> The max distance across loop iterations is the unrolling factor
>>>> that helps in predictive commoning.
>>>>
>>>> 2024-07-11  Ajit Kumar Agarwal  <aagarwa1@linux.ibm.com>
>>>>
>>>> gcc/ChangeLog:
>>>>
>>>>         * tree-predcom.cc: Change in determining unroll factor with
>>>>         data dependence across loop iterations.
>>>> ---
>>>>  gcc/tree-predcom.cc | 51 ++++++++++++++++++++++++++++++++++-----------
>>>>  1 file changed, 39 insertions(+), 12 deletions(-)
>>>>
>>>> diff --git a/gcc/tree-predcom.cc b/gcc/tree-predcom.cc
>>>> index 9844fee1e97..029b02f5990 100644
>>>> --- a/gcc/tree-predcom.cc
>>>> +++ b/gcc/tree-predcom.cc
>>>> @@ -409,6 +409,7 @@ public:
>>>>    /* Perform the predictive commoning optimization for chains, make this
>>>>       public for being called in callback execute_pred_commoning_cbck.  */
>>>>    void execute_pred_commoning (bitmap tmp_vars);
>>>> +  unsigned determine_unroll_factor (const vec<chain_p> &chains);
>>>>
>>>>  private:
>>>>    /* The pointer to the given loop.  */
>>>> @@ -2400,13 +2401,46 @@ pcom_worker::execute_pred_commoning_chain (chain_p chain,
>>>>     copies as possible.  CHAINS is the list of chains that will be
>>>>     optimized.  */
>>>>
>>>> -static unsigned
>>>> -determine_unroll_factor (const vec<chain_p> &chains)
>>>> +unsigned
>>>> +pcom_worker::determine_unroll_factor (const vec<chain_p> &chains)
>>>>  {
>>>>    chain_p chain;
>>>> -  unsigned factor = 1, af, nfactor, i;
>>>> +  unsigned factor = 1, i;
>>>>    unsigned max = param_max_unroll_times;
>>>> +  struct data_dependence_relation *ddr;
>>>> +  unsigned nfactor = 0;
>>>> +  int nzfactor = 0;
>>>> +
>>>> +  /* Best unroll factor is the maximum distance across loop
>>>> +     iterations.  */
>>>> +  FOR_EACH_VEC_ELT (m_dependences, i, ddr)
>>>> +    {
>>>> +      for (unsigned j = 0; j < DDR_NUM_DIST_VECTS (ddr); j++)
>>>> +       {
>>>> +         lambda_vector vec = DDR_DIST_VECT (ddr, j);
>>>> +         widest_int distance = vec[j];
>>>> +         unsigned offset = distance.to_uhwi ();
>>>> +         if (offset == 0)
>>>> +           continue;
>>>> +
>>>> +         int dist = offset - nzfactor;
>>>> +         if (dist  == 0)
>>>> +           continue;
>>>>
>>>> +         if (nfactor == 0)
>>>> +           {
>>>> +             nfactor = offset;
>>>> +             nzfactor = offset;
>>>> +           }
>>>> +         else if (dist <= nzfactor)
>>>> +           nfactor = offset;
>>>> +
>>>> +         if (nfactor > 0 && nfactor <= max)
>>>> +           factor = nfactor;
>>>> +       }
>>>> +    }
>>>> +
>>>> +  int max_use = 0;
>>>>    FOR_EACH_VEC_ELT (chains, i, chain)
>>>>      {
>>>>        if (chain->type == CT_INVARIANT)
>>>> @@ -2427,17 +2461,10 @@ determine_unroll_factor (const vec<chain_p> &chains)
>>>>           continue;
>>>>         }
>>>>
>>>> -      /* The best unroll factor for this chain is equal to the number of
>>>> -        temporary variables that we create for it.  */
>>>> -      af = chain->length;
>>>>        if (chain->has_max_use_after)
>>>> -       af++;
>>>> -
>>>> -      nfactor = factor * af / gcd (factor, af);
>>>> -      if (nfactor <= max)
>>>> -       factor = nfactor;
>>>> +       max_use++;
>>>>      }
>>>> -
>>>> +  factor += max_use;
>>>>    return factor;
>>>>  }
>>>>
>>>> --
>>>> 2.43.5
>>>>
Ajit Agarwal July 15, 2024, 12:37 p.m. UTC | #6
Hello Richard:

On 13/07/24 8:16 pm, Ajit Agarwal wrote:
> Hello Richard:
> 
> On 12/07/24 6:20 pm, Richard Biener wrote:
>> On Fri, Jul 12, 2024 at 12:09 PM Ajit Agarwal <aagarwa1@linux.ibm.com> wrote:
>>>
>>> Hello Richard:
>>>
>>> On 11/07/24 2:21 pm, Richard Biener wrote:
>>>> On Thu, Jul 11, 2024 at 10:30 AM Ajit Agarwal <aagarwa1@linux.ibm.com> wrote:
>>>>>
>>>>> Hello All:
>>>>>
>>>>> Unroll factor is determined with max distance across loop iterations.
>>>>> The logic for determining the loop unroll factor is based on
>>>>> data dependency across loop iterations.
>>>>>
>>>>> The max distance across loop iterations is the unrolling factor
>>>>> that helps in predictive commoning.
>>>>
>>>> The old comment in the code says
>>>>
>>>>> -      /* The best unroll factor for this chain is equal to the number of
>>>>> -        temporary variables that we create for it.  */
>>>>
>>>> why is that wrong and why is the max dependence distance more correct?
>>>>
>>>> Do you have a testcase that shows how this makes a (positive) difference?
>>>>
>>>
>>> There is nothing wrong in the existing implementation of unroll
>>> factor for predictive commoning.
>>>
>>> But with max dependence distance we get performance improvement
>>> with spec 2017 benchmarks (INT) of 0.01% (Geomean) with and without
>>> changes. Improvement in benchmarks with max dependence distance
>>> changes.
>>>
>>> I have used the following flags:
>>> -O2 -funroll-loops --param max-unroll-times=8 -fpredictive-commoning -fno-tree-pre
>>>
>>> With above flags I ran with and without changes.
>>
>> A 0.01% geomean improvement is noise.  Why did you disable PRE?
>>
> 
> I have changed the flags. Now I changed the flags to -O3
> -fpredictive-commoning -funroll-loops. With these flags
> I am measuring the performance with spec 2017 benchmarks.
> Would let you know the results by Monday.
>

With the changes in this patch, 500.perlbench_r gave gain of
2.56% (spec 2017 INT) benchmarks. There is no degradation
with other benchmarks.

Thanks & Regards
Ajit
  
>>> There is no degradation with spec 2017 (FP benchmarks).
>>>
>>> Because in predictive commoning we reuse values computed in
>>> earlier iterations of a loop in the later ones, max distance is the
>>> better choice.
>>
>> The re-use distance is the same though.  So your change merely increases
>> the unroll factor?  Or can you explain why there is more re-use with
>> your change.
>>
> 
> With -O3 -fpredictive-commoning -funroll-loops many spec 2017 benchmarks
> increases unroll factor with my changes.
> 
> I am traversing data dependence relation vector, get the distance
> and max distance is the unroll factor.
>  
>> Richard.
>>
> Thanks & Regards
> Ajit
>>>> Richard.
>>>>
>>>
>>> Thanks & Regards
>>> Ajit
>>>
>>>>> Bootstrapped and regtested on powerpc64-linux-gnu.
>>>>>
>>>>> Thanks & Regards
>>>>> Ajit
>>>>>
>>>>> tree-optimization, predcom: Improve unroll factor for predictive commoning
>>>>>
>>>>> Unroll factor is determined with max distance across loop iterations.
>>>>> The logic for determining the loop unroll factor is based on
>>>>> data dependency across loop iterations.
>>>>>
>>>>> The max distance across loop iterations is the unrolling factor
>>>>> that helps in predictive commoning.
>>>>>
>>>>> 2024-07-11  Ajit Kumar Agarwal  <aagarwa1@linux.ibm.com>
>>>>>
>>>>> gcc/ChangeLog:
>>>>>
>>>>>         * tree-predcom.cc: Change in determining unroll factor with
>>>>>         data dependence across loop iterations.
>>>>> ---
>>>>>  gcc/tree-predcom.cc | 51 ++++++++++++++++++++++++++++++++++-----------
>>>>>  1 file changed, 39 insertions(+), 12 deletions(-)
>>>>>
>>>>> diff --git a/gcc/tree-predcom.cc b/gcc/tree-predcom.cc
>>>>> index 9844fee1e97..029b02f5990 100644
>>>>> --- a/gcc/tree-predcom.cc
>>>>> +++ b/gcc/tree-predcom.cc
>>>>> @@ -409,6 +409,7 @@ public:
>>>>>    /* Perform the predictive commoning optimization for chains, make this
>>>>>       public for being called in callback execute_pred_commoning_cbck.  */
>>>>>    void execute_pred_commoning (bitmap tmp_vars);
>>>>> +  unsigned determine_unroll_factor (const vec<chain_p> &chains);
>>>>>
>>>>>  private:
>>>>>    /* The pointer to the given loop.  */
>>>>> @@ -2400,13 +2401,46 @@ pcom_worker::execute_pred_commoning_chain (chain_p chain,
>>>>>     copies as possible.  CHAINS is the list of chains that will be
>>>>>     optimized.  */
>>>>>
>>>>> -static unsigned
>>>>> -determine_unroll_factor (const vec<chain_p> &chains)
>>>>> +unsigned
>>>>> +pcom_worker::determine_unroll_factor (const vec<chain_p> &chains)
>>>>>  {
>>>>>    chain_p chain;
>>>>> -  unsigned factor = 1, af, nfactor, i;
>>>>> +  unsigned factor = 1, i;
>>>>>    unsigned max = param_max_unroll_times;
>>>>> +  struct data_dependence_relation *ddr;
>>>>> +  unsigned nfactor = 0;
>>>>> +  int nzfactor = 0;
>>>>> +
>>>>> +  /* Best unroll factor is the maximum distance across loop
>>>>> +     iterations.  */
>>>>> +  FOR_EACH_VEC_ELT (m_dependences, i, ddr)
>>>>> +    {
>>>>> +      for (unsigned j = 0; j < DDR_NUM_DIST_VECTS (ddr); j++)
>>>>> +       {
>>>>> +         lambda_vector vec = DDR_DIST_VECT (ddr, j);
>>>>> +         widest_int distance = vec[j];
>>>>> +         unsigned offset = distance.to_uhwi ();
>>>>> +         if (offset == 0)
>>>>> +           continue;
>>>>> +
>>>>> +         int dist = offset - nzfactor;
>>>>> +         if (dist  == 0)
>>>>> +           continue;
>>>>>
>>>>> +         if (nfactor == 0)
>>>>> +           {
>>>>> +             nfactor = offset;
>>>>> +             nzfactor = offset;
>>>>> +           }
>>>>> +         else if (dist <= nzfactor)
>>>>> +           nfactor = offset;
>>>>> +
>>>>> +         if (nfactor > 0 && nfactor <= max)
>>>>> +           factor = nfactor;
>>>>> +       }
>>>>> +    }
>>>>> +
>>>>> +  int max_use = 0;
>>>>>    FOR_EACH_VEC_ELT (chains, i, chain)
>>>>>      {
>>>>>        if (chain->type == CT_INVARIANT)
>>>>> @@ -2427,17 +2461,10 @@ determine_unroll_factor (const vec<chain_p> &chains)
>>>>>           continue;
>>>>>         }
>>>>>
>>>>> -      /* The best unroll factor for this chain is equal to the number of
>>>>> -        temporary variables that we create for it.  */
>>>>> -      af = chain->length;
>>>>>        if (chain->has_max_use_after)
>>>>> -       af++;
>>>>> -
>>>>> -      nfactor = factor * af / gcd (factor, af);
>>>>> -      if (nfactor <= max)
>>>>> -       factor = nfactor;
>>>>> +       max_use++;
>>>>>      }
>>>>> -
>>>>> +  factor += max_use;
>>>>>    return factor;
>>>>>  }
>>>>>
>>>>> --
>>>>> 2.43.5
>>>>>
diff mbox series

Patch

diff --git a/gcc/tree-predcom.cc b/gcc/tree-predcom.cc
index 9844fee1e97..029b02f5990 100644
--- a/gcc/tree-predcom.cc
+++ b/gcc/tree-predcom.cc
@@ -409,6 +409,7 @@  public:
   /* Perform the predictive commoning optimization for chains, make this
      public for being called in callback execute_pred_commoning_cbck.  */
   void execute_pred_commoning (bitmap tmp_vars);
+  unsigned determine_unroll_factor (const vec<chain_p> &chains);
 
 private:
   /* The pointer to the given loop.  */
@@ -2400,13 +2401,46 @@  pcom_worker::execute_pred_commoning_chain (chain_p chain,
    copies as possible.  CHAINS is the list of chains that will be
    optimized.  */
 
-static unsigned
-determine_unroll_factor (const vec<chain_p> &chains)
+unsigned
+pcom_worker::determine_unroll_factor (const vec<chain_p> &chains)
 {
   chain_p chain;
-  unsigned factor = 1, af, nfactor, i;
+  unsigned factor = 1, i;
   unsigned max = param_max_unroll_times;
+  struct data_dependence_relation *ddr;
+  unsigned nfactor = 0;
+  int nzfactor = 0;
+
+  /* Best unroll factor is the maximum distance across loop
+     iterations.  */
+  FOR_EACH_VEC_ELT (m_dependences, i, ddr)
+    {
+      for (unsigned j = 0; j < DDR_NUM_DIST_VECTS (ddr); j++)
+	{
+	  lambda_vector vec = DDR_DIST_VECT (ddr, j);
+	  widest_int distance = vec[j];
+	  unsigned offset = distance.to_uhwi ();
+	  if (offset == 0)
+	    continue;
+
+	  int dist = offset - nzfactor;
+	  if (dist  == 0)
+	    continue;
 
+	  if (nfactor == 0)
+	    {
+	      nfactor = offset;
+	      nzfactor = offset;
+	    }
+	  else if (dist <= nzfactor)
+	    nfactor = offset;
+
+	  if (nfactor > 0 && nfactor <= max)
+	    factor = nfactor;
+	}
+    }
+
+  int max_use = 0;
   FOR_EACH_VEC_ELT (chains, i, chain)
     {
       if (chain->type == CT_INVARIANT)
@@ -2427,17 +2461,10 @@  determine_unroll_factor (const vec<chain_p> &chains)
 	  continue;
 	}
 
-      /* The best unroll factor for this chain is equal to the number of
-	 temporary variables that we create for it.  */
-      af = chain->length;
       if (chain->has_max_use_after)
-	af++;
-
-      nfactor = factor * af / gcd (factor, af);
-      if (nfactor <= max)
-	factor = nfactor;
+	max_use++;
     }
-
+  factor += max_use;
   return factor;
 }