diff mbox

[AArch64] Improve aarch64_case_values_threshold setting

Message ID AM3PR08MB008846A786228571DC6E0731836F0@AM3PR08MB0088.eurprd08.prod.outlook.com
State New
Headers show

Commit Message

Wilco Dijkstra April 22, 2016, 4:15 p.m. UTC
GCC expands switch statements in a very simplistic way and tries to use a table
expansion even when it is a bad idea for performance or codesize.
GCC typically emits extremely sparse tables that contain mostly default entries
(something which currently cannot be tuned by backends).  Additionally the 
computation of the minimum/maximum label offsets is too simplistic so the tables
are often twice as large as necessary.

The cost of a table switch is significant due to the setup overhead, the table
lookup (which due to being sparse and large adds unnecessary cachemisses)
and hard to predict indirect jump.  Therefore it is best to avoid using a table
unless there are many real case labels.

This patch fixes that by setting the default aarch64_case_values_threshold to
16 when the per-CPU tuning is not set.  On SPEC2006 this improves the switch
heavy benchmarks GCC and perlbench both in performance (1-2%) as well as size
(0.5-1% smaller).

OK for trunk?

ChangeLog:
2016-04-22  Wilco Dijkstra  <wdijkstr@arm.com>

    gcc/
        * config/aarch64/aarch64.c (aarch64_case_values_threshold):
        Return a better case_values_threshold when optimizing.

--

Comments

Evandro Menezes April 25, 2016, 6:58 p.m. UTC | #1
On 04/22/16 11:15, Wilco Dijkstra wrote:
> This patch fixes that by setting the default aarch64_case_values_threshold to
> 16 when the per-CPU tuning is not set.  On SPEC2006 this improves the switch
> heavy benchmarks GCC and perlbench both in performance (1-2%) as well as size
> (0.5-1% smaller).

I assume that you mean that such improvements are true for 
-mcpu=generic, yes?  On which target, A53 or A57 or other?

Otherwise, it seems to be a sensible change, but I'm trying to 
understand how generally beneficial it is.

Thank you,
Wilco Dijkstra April 25, 2016, 7:21 p.m. UTC | #2
Evandro Menezes wrote:
> I assume that you mean that such improvements are true for
> -mcpu=generic, yes?  On which target, A53 or A57 or other?

It's true for any CPU setting. The SPEC results are for Cortex-A57
however I wrote a microbenchmark that shows improvements on 
all targets I have access to. The GCC switch expansion is awful, so
even with a good indirect predictor it is better to use conditional
branches.

Wilco
Evandro Menezes April 25, 2016, 7:29 p.m. UTC | #3
On 04/25/16 14:21, Wilco Dijkstra wrote:
> Evandro Menezes wrote:
>> I assume that you mean that such improvements are true for
>> -mcpu=generic, yes?  On which target, A53 or A57 or other?
> It's true for any CPU setting. The SPEC results are for Cortex-A57
> however I wrote a microbenchmark that shows improvements on
> all targets I have access to. The GCC switch expansion is awful, so
> even with a good indirect predictor it is better to use conditional
> branches.

I agree with your assessment, but I'm more curious to understand how 
this change affects code built with the default -mcpu=generic when run 
on both A53 and A57, the typical configuration of big.LITTLE machines.

Thank you,
Wilco Dijkstra April 25, 2016, 7:58 p.m. UTC | #4
Evandro Menezes wrote:
> I agree with your assessment, but I'm more curious to understand how
> this change affects code built with the default -mcpu=generic when run
> on both A53 and A57, the typical configuration of big.LITTLE machines.

I wouldn't expect the result to be any different as the -mcpu setting makes
very little difference.

Wilco
Evandro Menezes April 25, 2016, 8:01 p.m. UTC | #5
On 04/25/16 14:58, Wilco Dijkstra wrote:
> Evandro Menezes wrote:
>> I agree with your assessment, but I'm more curious to understand how
>> this change affects code built with the default -mcpu=generic when run
>> on both A53 and A57, the typical configuration of big.LITTLE machines.
> I wouldn't expect the result to be any different as the -mcpu setting makes
> very little difference.

True, but the results when running on A53 could be quite different.

Thank you,
Wilco Dijkstra April 26, 2016, 4:14 p.m. UTC | #6
Evandro Menezes wrote:
>
> True, but the results when running on A53 could be quite different.

GCC is ~1.2% faster on Cortex-A53 built for generic, but there is no
difference in perlbench.

Wilco
Evandro Menezes April 26, 2016, 4:19 p.m. UTC | #7
On 04/26/16 11:14, Wilco Dijkstra wrote:
> Evandro Menezes wrote:
>> True, but the results when running on A53 could be quite different.
> GCC is ~1.2% faster on Cortex-A53 built for generic, but there is no
> difference in perlbench.

Looks good, then.  Fine by me.

Thanks for your patience,
Kyrill Tkachov April 27, 2016, 10:29 a.m. UTC | #8
Hi Wilco,

On 25/04/16 20:21, Wilco Dijkstra wrote:
> Evandro Menezes wrote:
>> I assume that you mean that such improvements are true for
>> -mcpu=generic, yes?  On which target, A53 or A57 or other?
> It's true for any CPU setting. The SPEC results are for Cortex-A57
> however I wrote a microbenchmark that shows improvements on
> all targets I have access to. The GCC switch expansion is awful, so
> even with a good indirect predictor it is better to use conditional
> branches.

In what way is it awful? If there's something we can do better at
can you file a bug report with a testcase so that we can work on
improving it rather than tweaking a heuristic in the backend.
(Not that I'm against your patch ;)

Cheers,
Kyrill

> Wilco
>
>
>
>
Wilco Dijkstra April 28, 2016, 5:58 p.m. UTC | #9
Kyrill Tkachov wrote:
> On 25/04/16 20:21, Wilco Dijkstra wrote:
> > The GCC switch expansion is awful, so
> > even with a good indirect predictor it is better to use conditional
> > branches.
> 
> In what way is it awful? If there's something we can do better at
> can you file a bug report with a testcase so that we can work on
> improving it rather than tweaking a heuristic in the backend.

In every way :-( 

Try this simple example and see whether you agree this is terrible,
especially on targets that don't use PC-relative tables...

int i;
int func(int a)
{
  switch(a)
  {
    case 0:   i = 20; break;
    case 1:   i = 50; break;
    case 2:   i = 29; break;
    case 3:   i = 20; break;
    case 4:   i = 50; break;
    case 5:   i = 29; break;
    case 6:   i = 20; break;
    case 7:   i = 50; break;
    case 8:   i = 29; break;
    case 9:   i = 79; break;
    case 110: i = 27; break;
    default:  i = 77; break;
  }
  return i;
}

See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70861.

It's also disappointing this doesn't get turned into:

return i = (unsigned)a <= 110 ? table[a] : 77;

Wilco
diff mbox

Patch

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 0620f1e..a240635 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -3546,7 +3546,12 @@  aarch64_cannot_force_const_mem (machine_mode mode ATTRIBUTE_UNUSED, rtx x)
   return aarch64_tls_referenced_p (x);
 }
 
-/* Implement TARGET_CASE_VALUES_THRESHOLD.  */
+/* Implement TARGET_CASE_VALUES_THRESHOLD.
+   The expansion for a table switch is quite expensive due to the number
+   of instructions, the table lookup and hard to predict indirect jump.
+   When optimizing for speed, with -O3 use the per-core tuning if set,
+   otherwise use tables for > 16 cases as a tradeoff between size and
+   performance.  */
 
 static unsigned int
 aarch64_case_values_threshold (void)
@@ -3557,7 +3562,7 @@  aarch64_case_values_threshold (void)
       && selected_cpu->tune->max_case_values != 0)
     return selected_cpu->tune->max_case_values;
   else
-    return default_case_values_threshold ();
+    return optimize_size ? default_case_values_threshold () : 17;
 }