[ARM] Switch to default sched pressure algorithm

Message ID	VI1PR0801MB2127648B6F8EA34101D444F683DD0@VI1PR0801MB2127.eurprd08.prod.outlook.com
State	New
Headers	show Return-Path: <gcc-patches-return-505772-incoming=patchwork.ozlabs.org@gcc.gnu.org> DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:from :to:cc:subject:date:message-id:content-type :content-transfer-encoding:mime-version; q=dns; s=default; b=Shw Nw2yaqomHoTOK6pZI3bJzOusVIvkwgVjB6P1zNlkiPbcuCfF0q8evLOnI6gwvdmt RwYlKG4AjbapHoLrKvrnainwOKHwxy6Q230GObLBsjMe9kqZ/c/Mo0ArZTFg4eja +Wd8n3o+Y/9iQTGaFkgT2VuYOJBHJCMPVnFrKRWc= Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk Sender: gcc-patches-owner@gcc.gnu.org Received-SPF: TempError (protection.outlook.com: error in processing during lookup of arm.com: DNS Timeout) From: Wilco Dijkstra <Wilco.Dijkstra@arm.com> To: GCC Patches <gcc-patches@gcc.gnu.org> CC: nd <nd@arm.com>, Richard Earnshaw <Richard.Earnshaw@arm.com>, "Kyrylo Tkachov" <Kyrylo.Tkachov@arm.com> Subject: [PATCH][ARM] Switch to default sched pressure algorithm Date: Mon, 29 Jul 2019 16:49:11 +0000 Message-ID: <VI1PR0801MB2127648B6F8EA34101D444F683DD0@VI1PR0801MB2127.eurprd08.prod.outlook.com> Authentication-Results-Original: spf=none (sender IP is ) smtp.mailfrom=Wilco.Dijkstra@arm.com; received-spf: None (protection.outlook.com: arm.com does not designate permitted sender hosts) Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Original-Authentication-Results: spf=none (sender IP is ) smtp.mailfrom=Wilco.Dijkstra@arm.com; X-MS-Exchange-Transport-CrossTenantHeadersStripped: AM5EUR03FT017.eop-EUR03.prod.protection.outlook.com X-MS-Office365-Filtering-Correlation-Id-Prvs: ae7d788d-c414-450a-0d5c-08d71444ae8f
Series	[ARM] Switch to default sched pressure algorithm \| expand [ARM] Switch to default sched pressure algorithm

Wilco Dijkstra July 29, 2019, 4:49 p.m. UTC

Currently the Arm backend selects the alternative sched pressure algorithm.
The issue is that this doesn't take register pressure into account, and so
it causes significant additional spilling on Arm where there are only 14
allocatable registers.  SPEC2006 shows significant codesize reduction
with the default pressure algorithm, so switch back to that.  PR77308 shows
~800 fewer instructions.

SPECINT2006 is ~0.6% faster on Cortex-A57 together with the other DImode
patches. Overall SPEC codesize is 1.1% smaller.

Bootstrap & regress OK on arm-none-linux-gnueabihf --with-cpu=cortex-a57

ChangeLog:
2019-07-29  Wilco Dijkstra  <wdijkstr@arm.com>

	* config/arm/arm.c (arm_option_override): Don't override sched
	pressure algorithm.

--

Christophe Lyon July 30, 2019, 9:08 a.m. UTC | #1

On Mon, 29 Jul 2019 at 18:49, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
>
> Currently the Arm backend selects the alternative sched pressure algorithm.
> The issue is that this doesn't take register pressure into account, and so
> it causes significant additional spilling on Arm where there are only 14
> allocatable registers.  SPEC2006 shows significant codesize reduction
> with the default pressure algorithm, so switch back to that.  PR77308 shows
> ~800 fewer instructions.
>
> SPECINT2006 is ~0.6% faster on Cortex-A57 together with the other DImode
> patches. Overall SPEC codesize is 1.1% smaller.
>

Hi Wilco,

Do you know which benchmarks were used when this was checked-in?
It isn't clear from https://gcc.gnu.org/ml/gcc-patches/2012-07/msg00706.html

Thanks,

Christophe

> Bootstrap & regress OK on arm-none-linux-gnueabihf --with-cpu=cortex-a57
>
> ChangeLog:
> 2019-07-29  Wilco Dijkstra  <wdijkstr@arm.com>
>
>         * config/arm/arm.c (arm_option_override): Don't override sched
>         pressure algorithm.
>
> --
>
> diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
> index 81286cadf32f908e045d704128c5e06842e0cc92..628cf02f23fb29392a63d87f561c3ee2fb73a515 100644
> --- a/gcc/config/arm/arm.c
> +++ b/gcc/config/arm/arm.c
> @@ -3575,11 +3575,6 @@ arm_option_override (void)
>    if (use_neon_for_64bits == 1)
>       prefer_neon_for_64bits = true;
>
> -  /* Use the alternative scheduling-pressure algorithm by default.  */
> -  maybe_set_param_value (PARAM_SCHED_PRESSURE_ALGORITHM, SCHED_PRESSURE_MODEL,
> -                        global_options.x_param_values,
> -                        global_options_set.x_param_values);
> -
>    /* Look through ready list and all of queue for instructions
>       relevant for L2 auto-prefetcher.  */
>    int param_sched_autopref_queue_depth;
>

Ramana Radhakrishnan July 30, 2019, 9:31 a.m. UTC | #2

On 30/07/2019 10:08, Christophe Lyon wrote:
> On Mon, 29 Jul 2019 at 18:49, Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
>>
>> Currently the Arm backend selects the alternative sched pressure algorithm.
>> The issue is that this doesn't take register pressure into account, and so
>> it causes significant additional spilling on Arm where there are only 14
>> allocatable registers.  SPEC2006 shows significant codesize reduction
>> with the default pressure algorithm, so switch back to that.  PR77308 shows
>> ~800 fewer instructions.
>>
>> SPECINT2006 is ~0.6% faster on Cortex-A57 together with the other DImode
>> patches. Overall SPEC codesize is 1.1% smaller.
>>
> 
> Hi Wilco,
> 
> Do you know which benchmarks were used when this was checked-in?
> It isn't clear from https://gcc.gnu.org/ml/gcc-patches/2012-07/msg00706.html

It was from my time in Linaro and thus would have been a famous embedded 
benchmark, coremark , spec2000 - all tested probably on cortex-a9 and 
Cortex-A15. In addition to this I would like to see what the impact of 
this is on something like Cortex-A53 as the issue rates are likely to be 
different on the schedulers causing different behaviour.


I don't have all the notes today for that - maybe you can look into the 
linaro wiki.

I am concerned about taking this patch in without some more data across 
a variety of cores.

Thanks,
Ramana


> 
> Thanks,
> 
> Christophe
> 
>> Bootstrap & regress OK on arm-none-linux-gnueabihf --with-cpu=cortex-a57
>>
>> ChangeLog:
>> 2019-07-29  Wilco Dijkstra  <wdijkstr@arm.com>
>>
>>          * config/arm/arm.c (arm_option_override): Don't override sched
>>          pressure algorithm.
>>
>> --
>>
>> diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
>> index 81286cadf32f908e045d704128c5e06842e0cc92..628cf02f23fb29392a63d87f561c3ee2fb73a515 100644
>> --- a/gcc/config/arm/arm.c
>> +++ b/gcc/config/arm/arm.c
>> @@ -3575,11 +3575,6 @@ arm_option_override (void)
>>     if (use_neon_for_64bits == 1)
>>        prefer_neon_for_64bits = true;
>>
>> -  /* Use the alternative scheduling-pressure algorithm by default.  */
>> -  maybe_set_param_value (PARAM_SCHED_PRESSURE_ALGORITHM, SCHED_PRESSURE_MODEL,
>> -                        global_options.x_param_values,
>> -                        global_options_set.x_param_values);
>> -
>>     /* Look through ready list and all of queue for instructions
>>        relevant for L2 auto-prefetcher.  */
>>     int param_sched_autopref_queue_depth;
>>

Richard Earnshaw (lists) July 30, 2019, 12:50 p.m. UTC | #3

On 30/07/2019 10:31, Ramana Radhakrishnan wrote:
> On 30/07/2019 10:08, Christophe Lyon wrote:
>> On Mon, 29 Jul 2019 at 18:49, Wilco Dijkstra <Wilco.Dijkstra@arm.com> 
>> wrote:
>>>
>>> Currently the Arm backend selects the alternative sched pressure 
>>> algorithm.
>>> The issue is that this doesn't take register pressure into account, 
>>> and so
>>> it causes significant additional spilling on Arm where there are only 14
>>> allocatable registers.  SPEC2006 shows significant codesize reduction
>>> with the default pressure algorithm, so switch back to that.  PR77308 
>>> shows
>>> ~800 fewer instructions.
>>>
>>> SPECINT2006 is ~0.6% faster on Cortex-A57 together with the other DImode
>>> patches. Overall SPEC codesize is 1.1% smaller.
>>>
>>
>> Hi Wilco,
>>
>> Do you know which benchmarks were used when this was checked-in?
>> It isn't clear from 
>> https://gcc.gnu.org/ml/gcc-patches/2012-07/msg00706.html
> 
> It was from my time in Linaro and thus would have been a famous embedded 
> benchmark, coremark , spec2000 - all tested probably on cortex-a9 and 
> Cortex-A15. In addition to this I would like to see what the impact of 
> this is on something like Cortex-A53 as the issue rates are likely to be 
> different on the schedulers causing different behaviour.
> 
> 
> I don't have all the notes today for that - maybe you can look into the 
> linaro wiki.
> 
> I am concerned about taking this patch in without some more data across 
> a variety of cores.
> 

My concern is the original patch 
(https://gcc.gnu.org/ml/gcc-patches/2012-07/msg00706.html) is lacking in 
any real detail as to the reasons for the choice of the second algorithm 
over the first.

- It's not clear what the win was
- It's not clear what outliers there were and whether they were significant.

And finally, it's not clear if, 7 years later, this is still the best 
choice.

If the second algorithm really is better, why is no other target using 
it by default?

I think we need a bit more information (both ways).  In particular I'm 
concerned not just by the overall benchmark average, but also the amount 
of variance across the benchmarks.  I think the default needs to avoid 
significant outliers if at all possible, even if it is marginally less 
good on the average.

R.

> Thanks,
> Ramana
> 
> 
>>
>> Thanks,
>>
>> Christophe
>>
>>> Bootstrap & regress OK on arm-none-linux-gnueabihf --with-cpu=cortex-a57
>>>
>>> ChangeLog:
>>> 2019-07-29  Wilco Dijkstra  <wdijkstr@arm.com>
>>>
>>>          * config/arm/arm.c (arm_option_override): Don't override sched
>>>          pressure algorithm.
>>>
>>> -- 
>>>
>>> diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
>>> index 
>>> 81286cadf32f908e045d704128c5e06842e0cc92..628cf02f23fb29392a63d87f561c3ee2fb73a515 
>>> 100644
>>> --- a/gcc/config/arm/arm.c
>>> +++ b/gcc/config/arm/arm.c
>>> @@ -3575,11 +3575,6 @@ arm_option_override (void)
>>>     if (use_neon_for_64bits == 1)
>>>        prefer_neon_for_64bits = true;
>>>
>>> -  /* Use the alternative scheduling-pressure algorithm by default.  */
>>> -  maybe_set_param_value (PARAM_SCHED_PRESSURE_ALGORITHM, 
>>> SCHED_PRESSURE_MODEL,
>>> -                        global_options.x_param_values,
>>> -                        global_options_set.x_param_values);
>>> -
>>>     /* Look through ready list and all of queue for instructions
>>>        relevant for L2 auto-prefetcher.  */
>>>     int param_sched_autopref_queue_depth;
>>>
>

Wilco Dijkstra July 30, 2019, 3:15 p.m. UTC | #4

Hi all,

 >On 30/07/2019 10:31, Ramana Radhakrishnan wrote:
 >> On 30/07/2019 10:08, Christophe Lyon wrote:

 >>> Hi Wilco,
 >>>
 >>> Do you know which benchmarks were used when this was checked-in?
 >>> It isn't clear from 
 >>> https://gcc.gnu.org/ml/gcc-patches/2012-07/msg00706.html
 >> 
 >> It was from my time in Linaro and thus would have been a famous embedded 
 >> benchmark, coremark , spec2000 - all tested probably on cortex-a9 and 
 >> Cortex-A15. In addition to this I would like to see what the impact of 
 >> this is on something like Cortex-A53 as the issue rates are likely to be 
 >> different on the schedulers causing different behaviour.

Obviously there are differences between various schedulers, but the general
issue is that register pressure is increased many times beyond the spilling limit
(a few cases I looked at had a pressure well over 120 when there are only 14
integer registers - this causes panic spilling in the register allocator).

In fact the spilling overhead between the 2 algorithms is almost identical on
Cortex-A53 and Cortex-A57, so the issue isn't directly related to the pipeline
model used. It seems more related to the scheduler being too aggressive
and not caring about register pressure at all (for example lifting a load 100
instructions before its use so it must be spilled).

 >> I don't have all the notes today for that - maybe you can look into the 
 >> linaro wiki.
 >> 
 >> I am concerned about taking this patch in without some more data across 
 >> a variety of cores.
 >> 
 >
 > My concern is the original patch 
 > (https://gcc.gnu.org/ml/gcc-patches/2012-07/msg00706.html) is lacking in 
 > any real detail as to the reasons for the choice of the second algorithm 
 > over the first.
 > 
 > - It's not clear what the win was
 > - It's not clear what outliers there were and whether they were significant.
 >
 > And finally, it's not clear if, 7 years later, this is still the best 
 > choice.
 > 
 > If the second algorithm really is better, why is no other target using 
 > it by default?
 >
 > I think we need a bit more information (both ways).  In particular I'm 
 > concerned not just by the overall benchmark average, but also the amount 
 > of variance across the benchmarks.  I think the default needs to avoid 
 > significant outliers if at all possible, even if it is marginally less 
 > good on the average.

The results clearly show that algorithm 1 works best on Arm today - I haven't
seen a single benchmark where algorithm 2 results in less spilling. We could
tune algorithm 2 so it switches back to algorithm 1 when register pressure is
high or a basic block is large. However until it is fixed, the evidence is that
algorithm 1 is the best choice for current cores.

Wilco

Wilco Dijkstra Aug. 19, 2019, 3:51 p.m. UTC | #5

ping


  
Currently the Arm backend selects the alternative sched pressure algorithm.
 The issue is that this doesn't take register pressure into account, and so
 it causes significant additional spilling on Arm where there are only 14
 allocatable registers.  SPEC2006 shows significant codesize reduction
 with the default pressure algorithm, so switch back to that.  PR77308 shows
 ~800 fewer instructions.
 
 SPECINT2006 is ~0.6% faster on Cortex-A57 together with the other DImode
 patches. Overall SPEC codesize is 1.1% smaller.
 
 Bootstrap & regress OK on arm-none-linux-gnueabihf --with-cpu=cortex-a57
 
 ChangeLog:
 2019-07-29  Wilco Dijkstra  <wdijkstr@arm.com>
 
         * config/arm/arm.c (arm_option_override): Don't override sched
         pressure algorithm.
 
 --
 
 diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
 index 81286cadf32f908e045d704128c5e06842e0cc92..628cf02f23fb29392a63d87f561c3ee2fb73a515 100644
 --- a/gcc/config/arm/arm.c
 +++ b/gcc/config/arm/arm.c
 @@ -3575,11 +3575,6 @@ arm_option_override (void)
    if (use_neon_for_64bits == 1)
       prefer_neon_for_64bits = true;
  
 -  /* Use the alternative scheduling-pressure algorithm by default.  */
 -  maybe_set_param_value (PARAM_SCHED_PRESSURE_ALGORITHM, SCHED_PRESSURE_MODEL,
 -                        global_options.x_param_values,
 -                        global_options_set.x_param_values);
 -
    /* Look through ready list and all of queue for instructions
       relevant for L2 auto-prefetcher.  */
    int param_sched_autopref_queue_depth;

Wilco Dijkstra Sept. 2, 2019, 12:11 p.m. UTC | #6

ping


 
   
 Currently the Arm backend selects the alternative sched pressure algorithm.
  The issue is that this doesn't take register pressure into account, and so
  it causes significant additional spilling on Arm where there are only 14
  allocatable registers.  SPEC2006 shows significant codesize reduction
  with the default pressure algorithm, so switch back to that.  PR77308 shows
  ~800 fewer instructions.
  
  SPECINT2006 is ~0.6% faster on Cortex-A57 together with the other DImode
  patches. Overall SPEC codesize is 1.1% smaller.
  
  Bootstrap & regress OK on arm-none-linux-gnueabihf --with-cpu=cortex-a57
  
  ChangeLog:
  2019-07-29  Wilco Dijkstra  <wdijkstr@arm.com>
  
          * config/arm/arm.c (arm_option_override): Don't override sched
          pressure algorithm.
  
  --
  
  diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
  index 81286cadf32f908e045d704128c5e06842e0cc92..628cf02f23fb29392a63d87f561c3ee2fb73a515 100644
  --- a/gcc/config/arm/arm.c
  +++ b/gcc/config/arm/arm.c
  @@ -3575,11 +3575,6 @@ arm_option_override (void)
     if (use_neon_for_64bits == 1)
        prefer_neon_for_64bits = true;
   
  -  /* Use the alternative scheduling-pressure algorithm by default.  */
  -  maybe_set_param_value (PARAM_SCHED_PRESSURE_ALGORITHM, SCHED_PRESSURE_MODEL,
  -                        global_options.x_param_values,
  -                        global_options_set.x_param_values);
  -
     /* Look through ready list and all of queue for instructions
        relevant for L2 auto-prefetcher.  */
     int param_sched_autopref_queue_depth;

Wilco Dijkstra Sept. 9, 2019, 5:05 p.m. UTC | #7

ping
 
 
  
    
  Currently the Arm backend selects the alternative sched pressure algorithm.
   The issue is that this doesn't take register pressure into account, and so
   it causes significant additional spilling on Arm where there are only 14
   allocatable registers.  SPEC2006 shows significant codesize reduction
   with the default pressure algorithm, so switch back to that.  PR77308 shows
   ~800 fewer instructions.
   
   SPECINT2006 is ~0.6% faster on Cortex-A57 together with the other DImode
   patches. Overall SPEC codesize is 1.1% smaller.
   
   Bootstrap & regress OK on arm-none-linux-gnueabihf --with-cpu=cortex-a57
   
   ChangeLog:
   2019-07-29  Wilco Dijkstra  <wdijkstr@arm.com>
   
           * config/arm/arm.c (arm_option_override): Don't override sched
           pressure algorithm.
   
   --
   
   diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
   index 81286cadf32f908e045d704128c5e06842e0cc92..628cf02f23fb29392a63d87f561c3ee2fb73a515 100644
   --- a/gcc/config/arm/arm.c
   +++ b/gcc/config/arm/arm.c
   @@ -3575,11 +3575,6 @@ arm_option_override (void)
      if (use_neon_for_64bits == 1)
         prefer_neon_for_64bits = true;
    
   -  /* Use the alternative scheduling-pressure algorithm by default.  */
   -  maybe_set_param_value (PARAM_SCHED_PRESSURE_ALGORITHM, SCHED_PRESSURE_MODEL,
   -                        global_options.x_param_values,
   -                        global_options_set.x_param_values);
   -
      /* Look through ready list and all of queue for instructions
         relevant for L2 auto-prefetcher.  */
      int param_sched_autopref_queue_depth;

Wilco Dijkstra Oct. 10, 2019, 5:24 p.m. UTC | #8

ping

   Currently the Arm backend selects the alternative sched pressure algorithm.
   The issue is that this doesn't take register pressure into account, and so
   it causes significant additional spilling on Arm where there are only 14
   allocatable registers.  SPEC2006 shows significant codesize reduction
   with the default pressure algorithm, so switch back to that.  PR77308 shows
   ~800 fewer instructions.
   
   SPECINT2006 is ~0.6% faster on Cortex-A57 together with the other DImode
   patches. Overall SPEC codesize is 1.1% smaller.
   
   Bootstrap & regress OK on arm-none-linux-gnueabihf --with-cpu=cortex-a57
   
   ChangeLog:
   2019-07-29  Wilco Dijkstra  <wdijkstr@arm.com>
   
           * config/arm/arm.c (arm_option_override): Don't override sched
           pressure algorithm.
   
   --
   
   diff --git a/gcc/config/arm/arm.c b/gcc/config/arm/arm.c
   index 81286cadf32f908e045d704128c5e06842e0cc92..628cf02f23fb29392a63d87f561c3ee2fb73a515 100644
   --- a/gcc/config/arm/arm.c
   +++ b/gcc/config/arm/arm.c
   @@ -3575,11 +3575,6 @@ arm_option_override (void)
      if (use_neon_for_64bits == 1)
         prefer_neon_for_64bits = true;
    
   -  /* Use the alternative scheduling-pressure algorithm by default.  */
   -  maybe_set_param_value (PARAM_SCHED_PRESSURE_ALGORITHM, SCHED_PRESSURE_MODEL,
   -                        global_options.x_param_values,
   -                        global_options_set.x_param_values);
   -
      /* Look through ready list and all of queue for instructions
         relevant for L2 auto-prefetcher.  */
      int param_sched_autopref_queue_depth;

Ramana Radhakrishnan Oct. 10, 2019, 9:19 p.m. UTC | #9

On Tue, Jul 30, 2019 at 4:16 PM Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
>
> Hi all,
>
>  >On 30/07/2019 10:31, Ramana Radhakrishnan wrote:
>  >> On 30/07/2019 10:08, Christophe Lyon wrote:
>
>  >>> Hi Wilco,
>  >>>
>  >>> Do you know which benchmarks were used when this was checked-in?
>  >>> It isn't clear from
>  >>> https://gcc.gnu.org/ml/gcc-patches/2012-07/msg00706.html
>  >>
>  >> It was from my time in Linaro and thus would have been a famous embedded
>  >> benchmark, coremark , spec2000 - all tested probably on cortex-a9 and
>  >> Cortex-A15. In addition to this I would like to see what the impact of
>  >> this is on something like Cortex-A53 as the issue rates are likely to be
>  >> different on the schedulers causing different behaviour.
>
> Obviously there are differences between various schedulers, but the general
> issue is that register pressure is increased many times beyond the spilling limit
> (a few cases I looked at had a pressure well over 120 when there are only 14
> integer registers - this causes panic spilling in the register allocator).
>
> In fact the spilling overhead between the 2 algorithms is almost identical on
> Cortex-A53 and Cortex-A57, so the issue isn't directly related to the pipeline
> model used. It seems more related to the scheduler being too aggressive
> and not caring about register pressure at all (for example lifting a load 100
> instructions before its use so it must be spilled).

In those days it would have been the Cortex-A8, Cortex-A9 schedulers
and the Cortex-A15
schedulers and IIRC the benchmarking would have been mostly on a
Cortex-A9 board or on some
Cortex-A15 boards we had (long gone now) inside Arm.

Can you see what happens with the Cortex-A8 or Cortex-A9 schedulers to
spread the range
across some v7-a CPUs as well ? While they aren't that popular today I
would suggest
you look at them because the defaults for v7-a are still to use the
Cortex-A8 scheduler and
the Cortex-A9 scheduler might well also get used in places given the
availability of hardware.


>
>  >> I don't have all the notes today for that - maybe you can look into the
>  >> linaro wiki.
>  >>
>  >> I am concerned about taking this patch in without some more data across
>  >> a variety of cores.
>  >>
>  >
>  > My concern is the original patch
>  > (https://gcc.gnu.org/ml/gcc-patches/2012-07/msg00706.html) is lacking in
>  > any real detail as to the reasons for the choice of the second algorithm
>  > over the first.
>  >
>  > - It's not clear what the win was
>  > - It's not clear what outliers there were and whether they were significant.
>  >
>  > And finally, it's not clear if, 7 years later, this is still the best
>  > choice.
>  >
>  > If the second algorithm really is better, why is no other target using
>  > it by default?
>  >
>  > I think we need a bit more information (both ways).  In particular I'm
>  > concerned not just by the overall benchmark average, but also the amount
>  > of variance across the benchmarks.  I think the default needs to avoid
>  > significant outliers if at all possible, even if it is marginally less
>  > good on the average.
>
> The results clearly show that algorithm 1 works best on Arm today - I haven't
> seen a single benchmark where algorithm 2 results in less spilling. We could
> tune algorithm 2 so it switches back to algorithm 1 when register pressure is
> high or a basic block is large. However until it is fixed, the evidence is that
> algorithm 1 is the best choice for current cores.

I'd be happy to move this forward if you could show if there is no
*increase* in spills
for the same range of benchmarks that you are doing for the Cortex-A8
and Cortex-A9
schedulers.

Sorry about the time it has taken. I've been a bit otherwise occupied recently.

regards
Ramana

>
> Wilco

Wilco Dijkstra Oct. 11, 2019, 5:19 p.m. UTC | #10

Hi Ramana, 

> Can you see what happens with the Cortex-A8 or Cortex-A9 schedulers to
> spread the range across some v7-a CPUs as well ? While they aren't that popular today I
> would suggest you look at them because the defaults for v7-a are still to use the
> Cortex-A8 scheduler and the Cortex-A9 scheduler might well also get used in places given
> the availability of hardware.

The results are practically identical to Cortex-A53 and A57 - there is a huge codesize win
across the board on SPEC2006, there isn't a single benchmark that is larger (ie. more
spilling).

> I'd be happy to move this forward if you could show if there is no *increase* in spills
> for the same range of benchmarks that you are doing for the Cortex-A8 and Cortex-A9
> schedulers.

There certainly isn't. I don't think results like these could be any more one-sided, it's a
significant win for every single benchmark, both for codesize and performance!

What isn't clear is whether something has gone horribly wrong in the scheduler which
could be fixed/reverted, but as it is right now I can't see it being useful at all. This means
we should also reevaluate whether pressure scheduling now hurts AArch64 too.

Cheers,
Wilco

Wilco Dijkstra Oct. 11, 2019, 9:42 p.m. UTC | #11

Hi,

 > the defaults for v7-a are still to use the
 > Cortex-A8 scheduler

I missed that part, but that's a serious bug btw - Cortex-A8 is 15 years old now so
way beyond obsolete. Even Cortex-A53 is ancient now, but it has an accurate scheduler
that performs surprisingly well on both in-order and out-of-order implementations.

Cheers,
Wilco

Ramana Radhakrishnan Oct. 12, 2019, 12:52 a.m. UTC | #12

On Fri, Oct 11, 2019 at 6:19 PM Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
>
> Hi Ramana,
>
> > Can you see what happens with the Cortex-A8 or Cortex-A9 schedulers to
> > spread the range across some v7-a CPUs as well ? While they aren't that popular today I
> > would suggest you look at them because the defaults for v7-a are still to use the
> > Cortex-A8 scheduler and the Cortex-A9 scheduler might well also get used in places given
> > the availability of hardware.
>
> The results are practically identical to Cortex-A53 and A57 - there is a huge codesize win
> across the board on SPEC2006, there isn't a single benchmark that is larger (ie. more
> spilling).
>
> > I'd be happy to move this forward if you could show if there is no *increase* in spills
> > for the same range of benchmarks that you are doing for the Cortex-A8 and Cortex-A9
> > schedulers.
>
> There certainly isn't. I don't think results like these could be any more one-sided, it's a
> significant win for every single benchmark, both for codesize and performance!
>

Ok go ahead - please be sensitive to testsuite regressions.

Ramana


> What isn't clear is whether something has gone horribly wrong in the scheduler which
> could be fixed/reverted, but as it is right now I can't see it being useful at all. This means
> we should also reevaluate whether pressure scheduling now hurts AArch64 too.
>
> Cheers,
> Wilco

Ramana Radhakrishnan Oct. 12, 2019, 12:58 a.m. UTC | #13

On Fri, Oct 11, 2019 at 10:42 PM Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
>
> Hi,
>
>  > the defaults for v7-a are still to use the
>  > Cortex-A8 scheduler
>
> I missed that part, but that's a serious bug btw - Cortex-A8 is 15 years old now so
> way beyond obsolete. Even Cortex-A53 is ancient now, but it has an accurate scheduler
> that performs surprisingly well on both in-order and out-of-order implementations.

On armv8-a we do use cortex-a53 as the default scheduler in the AArch32 backend.

regards
Ramana

>
> Cheers,
> Wilco

Christophe Lyon Oct. 15, 2019, 6:05 p.m. UTC | #14

On Sat, 12 Oct 2019 at 02:52, Ramana Radhakrishnan
<ramana.gcc@googlemail.com> wrote:
>
> On Fri, Oct 11, 2019 at 6:19 PM Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote:
> >
> > Hi Ramana,
> >
> > > Can you see what happens with the Cortex-A8 or Cortex-A9 schedulers to
> > > spread the range across some v7-a CPUs as well ? While they aren't that popular today I
> > > would suggest you look at them because the defaults for v7-a are still to use the
> > > Cortex-A8 scheduler and the Cortex-A9 scheduler might well also get used in places given
> > > the availability of hardware.
> >
> > The results are practically identical to Cortex-A53 and A57 - there is a huge codesize win
> > across the board on SPEC2006, there isn't a single benchmark that is larger (ie. more
> > spilling).
> >
> > > I'd be happy to move this forward if you could show if there is no *increase* in spills
> > > for the same range of benchmarks that you are doing for the Cortex-A8 and Cortex-A9
> > > schedulers.
> >
> > There certainly isn't. I don't think results like these could be any more one-sided, it's a
> > significant win for every single benchmark, both for codesize and performance!
> >
>
> Ok go ahead - please be sensitive to testsuite regressions.
>

Hi Wilco,

I've noticed that your patch caused a regression:
FAIL: gcc.dg/tree-prof/pr77698.c scan-rtl-dump-times alignments
"internal loop alignment added" 1

Christophe


when the compiler is configured --with-mode thumb (or forcing -mthumb
when running the tests)
> Ramana
>
>
> > What isn't clear is whether something has gone horribly wrong in the scheduler which
> > could be fixed/reverted, but as it is right now I can't see it being useful at all. This means
> > we should also reevaluate whether pressure scheduling now hurts AArch64 too.
> >
> > Cheers,
> > Wilco

Wilco Dijkstra Oct. 16, 2019, 12:13 p.m. UTC | #15

Hi Christophe,

> I've noticed that your patch caused a regression:
> FAIL: gcc.dg/tree-prof/pr77698.c scan-rtl-dump-times alignments
> "internal loop alignment added" 1

That's just a testism - it only tests for loop alignment and doesn't
consider the possibility of the loop being jumped into like this:

.L17:
        adds    r0, r0, #1
        b       .L27
.L6:
        ldr     r4, [r2, #12]
        adds    r0, r0, #4
        ldr     lr, [r1]
        str     lr, [r3, r4, lsl #2]
        ldr     r4, [r2, #12]
        ldr     lr, [r1]
        str     lr, [r3, r4, lsl #2]
        ldr     r4, [r2, #12]
        ldr     lr, [r1]
        str     lr, [r3, r4, lsl #2]
.L27:
        ldr     r4, [r2, #12]
        cmp     ip, r0
        ldr     lr, [r1]
        str     lr, [r3, r4, lsl #2]
        bne     .L6
        pop     {r4, pc}

It seems minor changes in scheduling allows blocks to be commoned or not.
The underlying issue is that commoning like this should not be allowed on blocks
with different profile stats - particularly on loops where it inhibits scheduling of
the loop itself.

Cheers,
Wilco

Richard Earnshaw (lists) Oct. 16, 2019, 3:41 p.m. UTC | #16

On 16/10/2019 13:13, Wilco Dijkstra wrote:
> Hi Christophe,
> 
>> I've noticed that your patch caused a regression:
>> FAIL: gcc.dg/tree-prof/pr77698.c scan-rtl-dump-times alignments
>> "internal loop alignment added" 1
> 
> That's just a testism - it only tests for loop alignment and doesn't
> consider the possibility of the loop being jumped into like this:
> 
> .L17:
>          adds    r0, r0, #1
>          b       .L27
> .L6:
>          ldr     r4, [r2, #12]
>          adds    r0, r0, #4
>          ldr     lr, [r1]
>          str     lr, [r3, r4, lsl #2]
>          ldr     r4, [r2, #12]
>          ldr     lr, [r1]
>          str     lr, [r3, r4, lsl #2]
>          ldr     r4, [r2, #12]
>          ldr     lr, [r1]
>          str     lr, [r3, r4, lsl #2]
> .L27:
>          ldr     r4, [r2, #12]
>          cmp     ip, r0
>          ldr     lr, [r1]
>          str     lr, [r3, r4, lsl #2]
>          bne     .L6
>          pop     {r4, pc}
> 
> It seems minor changes in scheduling allows blocks to be commoned or not.
> The underlying issue is that commoning like this should not be allowed on blocks
> with different profile stats - particularly on loops where it inhibits scheduling of
> the loop itself.
> 
> Cheers,
> Wilco
> 

So what's your proposed solution?  Leaving the test failing is not an 
option.

Wilco Dijkstra Dec. 19, 2019, 1:21 p.m. UTC | #17

Hi,

>> I've noticed that your patch caused a regression:
>> FAIL: gcc.dg/tree-prof/pr77698.c scan-rtl-dump-times alignments
>> "internal loop alignment added" 1

I've created https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93007

Cheers,
Wilco

[ARM] Switch to default sched pressure algorithm

Commit Message

Comments

Patch