rs6000: Refine RTL unroll adjust hook

Message ID	20200706071313.2463011-1-guojiufu@linux.ibm.com
State	New
Headers	show Return-Path: <gcc-patches-bounces@gcc.gnu.org> DMARC-Filter: OpenDMARC Filter v1.3.2 sourceware.org 04A0E3858D35 To: gcc-patches@gcc.gnu.org Subject: [PATCH] rs6000: Refine RTL unroll adjust hook Date: Mon, 6 Jul 2020 15:13:13 +0800 Message-Id: <20200706071313.2463011-1-guojiufu@linux.ibm.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: list From: guojiufu via Gcc-patches <gcc-patches@gcc.gnu.org> Reply-To: guojiufu <guojiufu@linux.ibm.com> Cc: wschmidt@linux.ibm.com, segher@kernel.crashing.org Errors-To: gcc-patches-bounces@gcc.gnu.org Sender: "Gcc-patches" <gcc-patches-bounces@gcc.gnu.org>
Series	rs6000: Refine RTL unroll adjust hook \| expand rs6000: Refine RTL unroll adjust hook

Jiufu Guo July 6, 2020, 7:13 a.m. UTC

For very small loops (< 6 insns), it would be fine to unroll 4
times to use cache line better.  Like below loops:
 `while (i) a[--i] = NULL;   while (p < e)  *d++ = *p++;`

And for very complex loops which may cause negative impacts:
branch-miss or cache-miss. Like below loop: there are calls,
early exits and branches in loop.
```
  for (int i = 0; i < n; i++) {
              int e = a[I];
             ....
              if (function_call(e))  break;
             ....
  }
```

This patch enhances RTL unroll for small loops and prevent to
unroll complex loops.

gcc/ChangeLog
2020-07-03  Jiufu Guo  <guojiufu@linux.ibm.com>

        * config/rs6000/rs6000.c (rs6000_loop_unroll_adjust): Refine hook.
        (rs6000_complex_loop_p): New function.
        (num_loop_calls): New function.
---
 gcc/config/rs6000/rs6000.c | 46 +++++++++++++++++++++++++++++++++-----
 1 file changed, 40 insertions(+), 6 deletions(-)

will schmidt July 7, 2020, 4:10 p.m. UTC | #1

On Mon, 2020-07-06 at 15:13 +0800, guojiufu via Gcc-patches wrote:

Hi,

Assorted comments below.   thanks :-)

> For very small loops (< 6 insns), it would be fine to unroll 4
> times to use cache line better.  Like below loops:
>  `while (i) a[--i] = NULL;   while (p < e)  *d++ = *p++;`
> 
> And for very complex loops which may cause negative impacts:
> branch-miss or cache-miss. Like below loop: there are calls,
> early exits and branches in loop.
> ```
>   for (int i = 0; i < n; i++) {
>               int e = a[I];
>              ....
>               if (function_call(e))  break;
>              ....
>   }
> ```
> 
> This patch enhances RTL unroll for small loops and prevent to
> unroll complex loops.

ok.

> 
> gcc/ChangeLog
> 2020-07-03  Jiufu Guo  <guojiufu@linux.ibm.com>
> 
>         * config/rs6000/rs6000.c (rs6000_loop_unroll_adjust): Refine hook.
>         (rs6000_complex_loop_p): New function.
>         (num_loop_calls): New function.

Tabs versus spaces.

(num_loop_calls): New function.


> ---
>  gcc/config/rs6000/rs6000.c | 46 +++++++++++++++++++++++++++++++++---
> --
>  1 file changed, 40 insertions(+), 6 deletions(-)
> 
> diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
> index 58f5d780603..a4874fa0efc 100644
> --- a/gcc/config/rs6000/rs6000.c
> +++ b/gcc/config/rs6000/rs6000.c
> @@ -5130,22 +5130,56 @@ rs6000_destroy_cost_data (void *data)
>    free (data);
>  }
> 
> +/* Count the number of call insns in LOOP.  */
> +static unsigned int
> +num_loop_calls (struct loop *loop)
> +{
> +  basic_block *bbs;
> +  rtx_insn *insn;
> +  unsigned int i;
> +  unsigned int call_ins_num = 0;
> +
> +  bbs = get_loop_body (loop);
> +  for (i = 0; i < loop->num_nodes; i++)
> +    FOR_BB_INSNS (bbs[i], insn)
> +      if (CALL_P (insn))
> +	call_ins_num++;
> +
> +  free (bbs);
> +
> +  return call_ins_num;
> +}

ok.


> +
> +/* Return true if LOOP is too complex to be unrolled.  */
> +static bool
> +rs6000_complex_loop_p (struct loop *loop)
> +{
> +  unsigned call_num;
> +
> +  return loop->ninsns > 10
> +    && (call_num = num_loop_calls (loop)) > 0
> +    && (call_num + num_loop_branches (loop)) * 5 > loop->ninsns
> +    && !single_exit (loop);
> +}
> +


The assignment to call_num within the logic there concerns me.  I'd
break that out.

The 5 value is not explicitly mentioned elsewhere.  Contextually this
appears to be evaluating the ratio of branches versus instructions
within the loop.  Could use some clarity.  



>  /* Implement targetm.loop_unroll_adjust.  */
> 
>  static unsigned
>  rs6000_loop_unroll_adjust (unsigned nunroll, struct loop *loop)
>  {
> -   if (unroll_only_small_loops)
> +  if (unroll_only_small_loops)

indentation fix looks ok.

>      {
> -      /* TODO: This is hardcoded to 10 right now.  It can be refined, for
> -	 example we may want to unroll very small loops more times (4 perhaps).
> -	 We also should use a PARAM for this.  */

Still hardcoded values, and may still wish to eventually have this as a
tunable param.   Probably OK to drop the 2nd sentence, but first and
last sentences should probably stay.


> +      if (loop->ninsns <= 6)
> +	return MIN (4, nunroll);
>        if (loop->ninsns <= 10)
>  	return MIN (2, nunroll);
> -      else
> -	return 0;
> +
> +      return 0;
>      }


ok

> 
> +  if (rs6000_complex_loop_p (loop))
> +    return 0;
> +
>    return nunroll;
>  }
> 

ok

Segher Boessenkool July 8, 2020, 12:33 a.m. UTC | #2

Hi!

On Mon, Jul 06, 2020 at 03:13:13PM +0800, guojiufu wrote:
> For very small loops (< 6 insns), it would be fine to unroll 4
> times to use cache line better.  Like below loops:
>  `while (i) a[--i] = NULL;   while (p < e)  *d++ = *p++;`

Yes, definitely.

> And for very complex loops which may cause negative impacts:
> branch-miss or cache-miss. Like below loop: there are calls,
> early exits and branches in loop.
> ```
>   for (int i = 0; i < n; i++) {
>               int e = a[I];
>              ....
>               if (function_call(e))  break;
>              ....
>   }
> ```
> 
> This patch enhances RTL unroll for small loops and prevent to
> unroll complex loops.

I am not happy about what is considered "a complex loop" here.

> +/* Count the number of call insns in LOOP.  */
> +static unsigned int
> +num_loop_calls (struct loop *loop)
> +{
> +  basic_block *bbs;
> +  rtx_insn *insn;
> +  unsigned int i;
> +  unsigned int call_ins_num = 0;
> +
> +  bbs = get_loop_body (loop);
> +  for (i = 0; i < loop->num_nodes; i++)
> +    FOR_BB_INSNS (bbs[i], insn)
> +      if (CALL_P (insn))
> +	call_ins_num++;
> +
> +  free (bbs);
> +
> +  return call_ins_num;
> +}

This function belongs in cfgloop.c really?  (Or cfgloopanal.c).  Next to
num_loop_branches ;-)

> +/* Return true if LOOP is too complex to be unrolled.  */
> +static bool
> +rs6000_complex_loop_p (struct loop *loop)
> +{
> +  unsigned call_num;
> +
> +  return loop->ninsns > 10
> +    && (call_num = num_loop_calls (loop)) > 0
> +    && (call_num + num_loop_branches (loop)) * 5 > loop->ninsns
> +    && !single_exit (loop);
> +}

Don't do initialisation in conditionals please (or in loop conditions),
like Will said already.

This calls only very specific loops "complex".  We need a better way
to decide this :-(

>  static unsigned
>  rs6000_loop_unroll_adjust (unsigned nunroll, struct loop *loop)
>  {
> -   if (unroll_only_small_loops)
> +  if (unroll_only_small_loops)
>      {
> -      /* TODO: This is hardcoded to 10 right now.  It can be refined, for
> -	 example we may want to unroll very small loops more times (4 perhaps).
> -	 We also should use a PARAM for this.  */
> +      if (loop->ninsns <= 6)
> +	return MIN (4, nunroll);
>        if (loop->ninsns <= 10)
>  	return MIN (2, nunroll);
> -      else
> -	return 0;
> +
> +      return 0;
>      }

This part is fine.  It is independent of the rest AFAICS, so if you
agree, this part is preapproved for trunk.  Thanks!

(A PARAM would be nice, but too many of those isn't actually useful
either...  Next time, add one as soon as writing the code, at least it
is useful at that point in time, when you still need to experiment with
it :-) )

> +  if (rs6000_complex_loop_p (loop))
> +    return 0;
> +
>    return nunroll;
>  }

So, we use rs6000_complex_loop_p only to prevent all unrolling, never to
reduce the unrolling, and only in very specific cases.

Is there no middle road possible?  Say, don't unroll to more than 25
insns total (which is what the "only small loops" does, sort of -- it
also avoids unrolling 3x a bit, yes), and don't unroll to more than 2
calls, and not to more than 4 branches (I'm making up those numbers, of
course, and PARAMS would be helpful).  Some of this already does exist,
and might need retuning for us?

Segher

Jiufu Guo July 8, 2020, 3:19 a.m. UTC | #3

will schmidt <will_schmidt@vnet.ibm.com> writes:

Thanks!
> On Mon, 2020-07-06 at 15:13 +0800, guojiufu via Gcc-patches wrote:
>
> Hi,
>
> Assorted comments below.   thanks :-)
>
>> For very small loops (< 6 insns), it would be fine to unroll 4
>> times to use cache line better.  Like below loops:
>>  `while (i) a[--i] = NULL;   while (p < e)  *d++ = *p++;`
>> 
>> And for very complex loops which may cause negative impacts:
>> branch-miss or cache-miss. Like below loop: there are calls,
>> early exits and branches in loop.
>> ```
>>   for (int i = 0; i < n; i++) {
>>               int e = a[I];
>>              ....
>>               if (function_call(e))  break;
>>              ....
>>   }
>> ```
>> 
>> This patch enhances RTL unroll for small loops and prevent to
>> unroll complex loops.
>
> ok.
>
>> 
>> gcc/ChangeLog
>> 2020-07-03  Jiufu Guo  <guojiufu@linux.ibm.com>
>> 
>>         * config/rs6000/rs6000.c (rs6000_loop_unroll_adjust): Refine hook.
>>         (rs6000_complex_loop_p): New function.
>>         (num_loop_calls): New function.
>
> Tabs versus spaces.
oh, thanks!

>
> (num_loop_calls): New function.
>
>
>> ---
>>  gcc/config/rs6000/rs6000.c | 46 +++++++++++++++++++++++++++++++++---
>> --
>>  1 file changed, 40 insertions(+), 6 deletions(-)
>> 
>> diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
>> index 58f5d780603..a4874fa0efc 100644
>> --- a/gcc/config/rs6000/rs6000.c
>> +++ b/gcc/config/rs6000/rs6000.c
>> @@ -5130,22 +5130,56 @@ rs6000_destroy_cost_data (void *data)
>>    free (data);
>>  }
>> 
>> +/* Count the number of call insns in LOOP.  */
>> +static unsigned int
>> +num_loop_calls (struct loop *loop)
>> +{
>> +  basic_block *bbs;
>> +  rtx_insn *insn;
>> +  unsigned int i;
>> +  unsigned int call_ins_num = 0;
>> +
>> +  bbs = get_loop_body (loop);
>> +  for (i = 0; i < loop->num_nodes; i++)
>> +    FOR_BB_INSNS (bbs[i], insn)
>> +      if (CALL_P (insn))
>> +	call_ins_num++;
>> +
>> +  free (bbs);
>> +
>> +  return call_ins_num;
>> +}
>
> ok.
>
>
>> +
>> +/* Return true if LOOP is too complex to be unrolled.  */
>> +static bool
>> +rs6000_complex_loop_p (struct loop *loop)
>> +{
>> +  unsigned call_num;
>> +
>> +  return loop->ninsns > 10
>> +    && (call_num = num_loop_calls (loop)) > 0
>> +    && (call_num + num_loop_branches (loop)) * 5 > loop->ninsns
>> +    && !single_exit (loop);
>> +}
>> +
>
>
> The assignment to call_num within the logic there concerns me.  I'd
> break that out.
>
> The 5 value is not explicitly mentioned elsewhere.  Contextually this
> appears to be evaluating the ratio of branches versus instructions
> within the loop.  Could use some clarity.
Yes, it is 20%. Would make it clarity. Thanks, 
>
>
>
>>  /* Implement targetm.loop_unroll_adjust.  */
>> 
>>  static unsigned
>>  rs6000_loop_unroll_adjust (unsigned nunroll, struct loop *loop)
>>  {
>> -   if (unroll_only_small_loops)
>> +  if (unroll_only_small_loops)
>
> indentation fix looks ok.
>
>>      {
>> -      /* TODO: This is hardcoded to 10 right now.  It can be refined, for
>> -	 example we may want to unroll very small loops more times (4 perhaps).
>> -	 We also should use a PARAM for this.  */
>
> Still hardcoded values, and may still wish to eventually have this as a
> tunable param.   Probably OK to drop the 2nd sentence, but first and
> last sentences should probably stay.
>
>
>> +      if (loop->ninsns <= 6)
>> +	return MIN (4, nunroll);
>>        if (loop->ninsns <= 10)
>>  	return MIN (2, nunroll);
>> -      else
>> -	return 0;
>> +
>> +      return 0;
>>      }
>
>
> ok
>
>> 
>> +  if (rs6000_complex_loop_p (loop))
>> +    return 0;
>> +
>>    return nunroll;
>>  }
>> 
>
> ok

Jiufu Guo July 8, 2020, 3:39 a.m. UTC | #4

Segher Boessenkool <segher@kernel.crashing.org> writes:

Thanks all!
> Hi!
>
> On Mon, Jul 06, 2020 at 03:13:13PM +0800, guojiufu wrote:
>> For very small loops (< 6 insns), it would be fine to unroll 4
>> times to use cache line better.  Like below loops:
>>  `while (i) a[--i] = NULL;   while (p < e)  *d++ = *p++;`
>
> Yes, definitely.
>
>> And for very complex loops which may cause negative impacts:
>> branch-miss or cache-miss. Like below loop: there are calls,
>> early exits and branches in loop.
>> ```
>>   for (int i = 0; i < n; i++) {
>>               int e = a[I];
>>              ....
>>               if (function_call(e))  break;
>>              ....
>>   }
>> ```
>> 
>> This patch enhances RTL unroll for small loops and prevent to
>> unroll complex loops.
>
> I am not happy about what is considered "a complex loop" here.
For early exit, which may cause and *next* unrolled iterations may be
not executed, then unroll may be not benifit.
For too many branches (this patch say 20% of insns), may cause more
branch-misses even for unrolled loop.

From my intial intuition, I once think each condition may *define* loop
as complex. :-)

But for each single condition, loop unrolling may still be helpful.
While, if these conditions are all occur in a loop, it would be more
possible to get negative impacts after unrolled.

We could continue discuss and tune it, as you pointed out below.

>
>> +/* Count the number of call insns in LOOP.  */
>> +static unsigned int
>> +num_loop_calls (struct loop *loop)
>> +{
>> +  basic_block *bbs;
>> +  rtx_insn *insn;
>> +  unsigned int i;
>> +  unsigned int call_ins_num = 0;
>> +
>> +  bbs = get_loop_body (loop);
>> +  for (i = 0; i < loop->num_nodes; i++)
>> +    FOR_BB_INSNS (bbs[i], insn)
>> +      if (CALL_P (insn))
>> +	call_ins_num++;
>> +
>> +  free (bbs);
>> +
>> +  return call_ins_num;
>> +}
>
> This function belongs in cfgloop.c really?  (Or cfgloopanal.c).  Next to
> num_loop_branches ;-)
Thanks for great sugguestion!
>
>> +/* Return true if LOOP is too complex to be unrolled.  */
>> +static bool
>> +rs6000_complex_loop_p (struct loop *loop)
>> +{
>> +  unsigned call_num;
>> +
>> +  return loop->ninsns > 10
>> +    && (call_num = num_loop_calls (loop)) > 0
>> +    && (call_num + num_loop_branches (loop)) * 5 > loop->ninsns
>> +    && !single_exit (loop);
>> +}
>
> Don't do initialisation in conditionals please (or in loop conditions),
> like Will said already.
Thanks!
>
> This calls only very specific loops "complex".  We need a better way
> to decide this :-(
>
>>  static unsigned
>>  rs6000_loop_unroll_adjust (unsigned nunroll, struct loop *loop)
>>  {
>> -   if (unroll_only_small_loops)
>> +  if (unroll_only_small_loops)
>>      {
>> -      /* TODO: This is hardcoded to 10 right now.  It can be refined, for
>> -	 example we may want to unroll very small loops more times (4 perhaps).
>> -	 We also should use a PARAM for this.  */
>> +      if (loop->ninsns <= 6)
>> +	return MIN (4, nunroll);
>>        if (loop->ninsns <= 10)
>>  	return MIN (2, nunroll);
>> -      else
>> -	return 0;
>> +
>> +      return 0;
>>      }
>
> This part is fine.  It is independent of the rest AFAICS, so if you
> agree, this part is preapproved for trunk.  Thanks!
Sure, thanks.
>
> (A PARAM would be nice, but too many of those isn't actually useful
> either...  Next time, add one as soon as writing the code, at least it
> is useful at that point in time, when you still need to experiment with
> it :-) )
Yes, with PARAM, we can just change param to experiment:
"if (loop->ninsns <= 10) return 0;" (6insn - 10 insn) may just located
n one cache line, it may not need to unroll. But, after some tunning, I
chose 6/4,10/2 here. 4 hardcodes may be too many for PARAM.  2 PARAMs
(one for 6, one for 10) may be ok.  Any comments?
>
>> +  if (rs6000_complex_loop_p (loop))
>> +    return 0;
>> +
>>    return nunroll;
>>  }
>
> So, we use rs6000_complex_loop_p only to prevent all unrolling, never to
> reduce the unrolling, and only in very specific cases.
>
> Is there no middle road possible?  Say, don't unroll to more than 25
> insns total (which is what the "only small loops" does, sort of -- it
> also avoids unrolling 3x a bit, yes), and don't unroll to more than 2
> calls, and not to more than 4 branches (I'm making up those numbers, of
> course, and PARAMS would be helpful).  Some of this already does exist,
> and might need retuning for us?
It may make sense. There are param_max_unrolled_insns,
param_max_unroll_times, param_max_peel_branches(cunrol)...; we may add
calls number and branches number checking for rtl unroller.

While, actually, here we would need condition to define *complex* loop,
where contains call exist (may just 1), branch exist(may 2) and early
exit(may 1) at the same time, but each number is not large.
Any sugguestions? Thanks.

>
>
> Segher

Segher Boessenkool July 8, 2020, 11:13 p.m. UTC | #5

On Wed, Jul 08, 2020 at 11:39:56AM +0800, Jiufu Guo wrote:
> Segher Boessenkool <segher@kernel.crashing.org> writes:
> > I am not happy about what is considered "a complex loop" here.
> For early exit, which may cause and *next* unrolled iterations may be
> not executed, then unroll may be not benifit.

Yes, and it can well result in worse branch prediction.

> For too many branches (this patch say 20% of insns), may cause more
> branch-misses even for unrolled loop.
> 
> >From my intial intuition, I once think each condition may *define* loop
> as complex. :-)
> 
> But for each single condition, loop unrolling may still be helpful.
> While, if these conditions are all occur in a loop, it would be more
> possible to get negative impacts after unrolled.

Yes, but how many loops have *all* these conditions?  That is my problem
with it: it is only tested with one specific loop, and only benefits
that loop.

> > (A PARAM would be nice, but too many of those isn't actually useful
> > either...  Next time, add one as soon as writing the code, at least it
> > is useful at that point in time, when you still need to experiment with
> > it :-) )
> Yes, with PARAM, we can just change param to experiment:
> "if (loop->ninsns <= 10) return 0;" (6insn - 10 insn) may just located
> n one cache line, it may not need to unroll. But, after some tunning, I
> chose 6/4,10/2 here. 4 hardcodes may be too many for PARAM.  2 PARAMs
> (one for 6, one for 10) may be ok.  Any comments?

RTL insns are not one-to-one with machine insns.  One important reason
to unroll is for small loops, where we need to hide the latency of a
fetch redirect (say, three cycles); this is complicated by most insns
having a 2 cycle latency (or more for FP and such), and by different
SMT modes changing stuff here, oh and different CPUs of course.

So it is very important to unroll small loops enough, and it can be
beneficial for larger loops too, but it also hurts (in general, and
more for bigger loops, or loops with calls, or jumps).

It's not a science, more an art.  You'll just have to find something
that works well in practice.  But not something that looks at very
special cases only, preferably.

> > So, we use rs6000_complex_loop_p only to prevent all unrolling, never to
> > reduce the unrolling, and only in very specific cases.
> >
> > Is there no middle road possible?  Say, don't unroll to more than 25
> > insns total (which is what the "only small loops" does, sort of -- it
> > also avoids unrolling 3x a bit, yes), and don't unroll to more than 2
> > calls, and not to more than 4 branches (I'm making up those numbers, of
> > course, and PARAMS would be helpful).  Some of this already does exist,
> > and might need retuning for us?
> It may make sense. There are param_max_unrolled_insns,
> param_max_unroll_times, param_max_peel_branches(cunrol)...; we may add
> calls number and branches number checking for rtl unroller.

Hrm yes, that may be generally useful even.

> While, actually, here we would need condition to define *complex* loop,
> where contains call exist (may just 1), branch exist(may 2) and early
> exit(may 1) at the same time, but each number is not large.
> Any sugguestions? Thanks.

How many loops have you seen where all those conditions are true, but
the generic code still decides to unroll things?

Segher

Jiufu Guo July 9, 2020, 8:01 a.m. UTC | #6

Segher Boessenkool <segher@kernel.crashing.org> writes:

Hi,

> On Wed, Jul 08, 2020 at 11:39:56AM +0800, Jiufu Guo wrote:
>> Segher Boessenkool <segher@kernel.crashing.org> writes:
>> > I am not happy about what is considered "a complex loop" here.
>> For early exit, which may cause and *next* unrolled iterations may be
>> not executed, then unroll may be not benifit.
>
> Yes, and it can well result in worse branch prediction.
>
>> For too many branches (this patch say 20% of insns), may cause more
>> branch-misses even for unrolled loop.
>> 
>> >From my intial intuition, I once think each condition may *define* loop
>> as complex. :-)
>> 
>> But for each single condition, loop unrolling may still be helpful.
>> While, if these conditions are all occur in a loop, it would be more
>> possible to get negative impacts after unrolled.
>
> Yes, but how many loops have *all* these conditions?  That is my problem
> with it: it is only tested with one specific loop, and only benefits
> that loop.

I also encounter a few of this kind of loops, some in hot path of leela
and perlbench, and had negative impact leelar_r (~2%), perlbench(>0.5%),
and also gcc_r.  I had a quick count, there are ~500 this kind of loops
occur in specint build.

>
>> > (A PARAM would be nice, but too many of those isn't actually useful
>> > either...  Next time, add one as soon as writing the code, at least it
>> > is useful at that point in time, when you still need to experiment with
>> > it :-) )
>> Yes, with PARAM, we can just change param to experiment:
>> "if (loop->ninsns <= 10) return 0;" (6insn - 10 insn) may just located
>> n one cache line, it may not need to unroll. But, after some tunning, I
>> chose 6/4,10/2 here. 4 hardcodes may be too many for PARAM.  2 PARAMs
>> (one for 6, one for 10) may be ok.  Any comments?
>
> RTL insns are not one-to-one with machine insns.  One important reason
> to unroll is for small loops, where we need to hide the latency of a
> fetch redirect (say, three cycles); this is complicated by most insns
> having a 2 cycle latency (or more for FP and such), and by different
> SMT modes changing stuff here, oh and different CPUs of course.
>
> So it is very important to unroll small loops enough, and it can be
> beneficial for larger loops too, but it also hurts (in general, and
> more for bigger loops, or loops with calls, or jumps).
>
> It's not a science, more an art.  You'll just have to find something
> that works well in practice.  But not something that looks at very
> special cases only, preferably.
After unroll there are some optimizations which also could change rtl
insn sequence, and the number is not same with binary instrctions. 
Yes, some 'think so' theory is not `real so`. It is heuristic. :-).

>
>> > So, we use rs6000_complex_loop_p only to prevent all unrolling, never to
>> > reduce the unrolling, and only in very specific cases.
>> >
>> > Is there no middle road possible?  Say, don't unroll to more than 25
>> > insns total (which is what the "only small loops" does, sort of -- it
>> > also avoids unrolling 3x a bit, yes), and don't unroll to more than 2
>> > calls, and not to more than 4 branches (I'm making up those numbers, of
>> > course, and PARAMS would be helpful).  Some of this already does exist,
>> > and might need retuning for us?
>> It may make sense. There are param_max_unrolled_insns,
>> param_max_unroll_times, param_max_peel_branches(cunrol)...; we may add
>> calls number and branches number checking for rtl unroller.
>
> Hrm yes, that may be generally useful even.
>
>> While, actually, here we would need condition to define *complex* loop,
>> where contains call exist (may just 1), branch exist(may 2) and early
>> exit(may 1) at the same time, but each number is not large.
>> Any sugguestions? Thanks.
>
> How many loops have you seen where all those conditions are true, but
> the generic code still decides to unroll things?
Some occur as above said.  I use -fopt-info to compare the changed
unroll_adjust_hook to check loops of this kind.

Thanks a lot!
Jiufu

>
>
> Segher

Segher Boessenkool July 9, 2020, 4:54 p.m. UTC | #7

Hi Jiufu,

On Thu, Jul 09, 2020 at 04:01:38PM +0800, Jiufu Guo wrote:
> Segher Boessenkool <segher@kernel.crashing.org> writes:
> >> But for each single condition, loop unrolling may still be helpful.
> >> While, if these conditions are all occur in a loop, it would be more
> >> possible to get negative impacts after unrolled.
> >
> > Yes, but how many loops have *all* these conditions?  That is my problem
> > with it: it is only tested with one specific loop, and only benefits
> > that loop.
> 
> I also encounter a few of this kind of loops, some in hot path of leela
> and perlbench, and had negative impact leelar_r (~2%), perlbench(>0.5%),
> and also gcc_r.  I had a quick count, there are ~500 this kind of loops
> occur in specint build.

If the generic code decides to unroll big loops with calls *and* jumps,
there is a big problem there?

On most targets, it should not unroll loops with calls *at all*, for
example.

> >> While, actually, here we would need condition to define *complex* loop,
> >> where contains call exist (may just 1), branch exist(may 2) and early
> >> exit(may 1) at the same time, but each number is not large.
> >> Any sugguestions? Thanks.
> >
> > How many loops have you seen where all those conditions are true, but
> > the generic code still decides to unroll things?
> Some occur as above said.  I use -fopt-info to compare the changed
> unroll_adjust_hook to check loops of this kind.

But there are a few cases where we *do* want to unroll, you say?  What
is special about those cases, what do they do differently?


Segher

Jiufu Guo July 10, 2020, 1:16 p.m. UTC | #8

Hi,

Segher Boessenkool <segher@kernel.crashing.org> writes:

> Hi Jiufu,
>
> On Thu, Jul 09, 2020 at 04:01:38PM +0800, Jiufu Guo wrote:
>> Segher Boessenkool <segher@kernel.crashing.org> writes:
>> >> But for each single condition, loop unrolling may still be helpful.
>> >> While, if these conditions are all occur in a loop, it would be more
>> >> possible to get negative impacts after unrolled.
>> >
>> > Yes, but how many loops have *all* these conditions?  That is my problem
>> > with it: it is only tested with one specific loop, and only benefits
>> > that loop.
>> 
>> I also encounter a few of this kind of loops, some in hot path of leela
>> and perlbench, and had negative impact leelar_r (~2%), perlbench(>0.5%),
>> and also gcc_r.  I had a quick count, there are ~500 this kind of loops
>> occur in specint build.
>
> If the generic code decides to unroll big loops with calls *and* jumps,
> there is a big problem there?
>
> On most targets, it should not unroll loops with calls *at all*, for
> example.
>
>> >> While, actually, here we would need condition to define *complex* loop,
>> >> where contains call exist (may just 1), branch exist(may 2) and early
>> >> exit(may 1) at the same time, but each number is not large.
>> >> Any sugguestions? Thanks.
>> >
>> > How many loops have you seen where all those conditions are true, but
>> > the generic code still decides to unroll things?
>> Some occur as above said.  I use -fopt-info to compare the changed
>> unroll_adjust_hook to check loops of this kind.
>
> But there are a few cases where we *do* want to unroll, you say?  What
> is special about those cases, what do they do differently?
During tests, if avoid unroll loops if there is call (or early exit).
As I remember there are performance downgrade on 1 or 2 benchmarks, but
I forget which ones. I check and confirm.

BR.
Jiufu,

>
>
> Segher

Jiufu Guo July 10, 2020, 1:36 p.m. UTC | #9

Hi Segher,

Thanks a lot for your time and helpful comments!

Segher Boessenkool <segher@kernel.crashing.org> writes:

> Hi Jiufu,
>
> On Thu, Jul 09, 2020 at 04:01:38PM +0800, Jiufu Guo wrote:
>> Segher Boessenkool <segher@kernel.crashing.org> writes:
...
> If the generic code decides to unroll big loops with calls *and* jumps,
> there is a big problem there?
>
> On most targets, it should not unroll loops with calls *at all*, for
> example.
>
....
>> >
>
> But there are a few cases where we *do* want to unroll, you say?  What
> is special about those cases, what do they do differently?
Here are loops with early exit, but could be unroll.
        while (s < end && *s != '>') s++;
 IF ( setinitval .EQ. 1 .OR. setinitval .EQ. 3 ) grid%evbxy=initial_data_value
I will recheck to indendify/confirm which loops we want to unroll but
contains call or early exit.

BR.
Jiufu,
>
>
> Segher

rs6000: Refine RTL unroll adjust hook

Commit Message

Comments

Patch