Message ID | AM3PR08MB008846A786228571DC6E0731836F0@AM3PR08MB0088.eurprd08.prod.outlook.com |
---|---|
State | New |
Headers | show |
On 04/22/16 11:15, Wilco Dijkstra wrote: > This patch fixes that by setting the default aarch64_case_values_threshold to > 16 when the per-CPU tuning is not set. On SPEC2006 this improves the switch > heavy benchmarks GCC and perlbench both in performance (1-2%) as well as size > (0.5-1% smaller). I assume that you mean that such improvements are true for -mcpu=generic, yes? On which target, A53 or A57 or other? Otherwise, it seems to be a sensible change, but I'm trying to understand how generally beneficial it is. Thank you,
Evandro Menezes wrote: > I assume that you mean that such improvements are true for > -mcpu=generic, yes? On which target, A53 or A57 or other? It's true for any CPU setting. The SPEC results are for Cortex-A57 however I wrote a microbenchmark that shows improvements on all targets I have access to. The GCC switch expansion is awful, so even with a good indirect predictor it is better to use conditional branches. Wilco
On 04/25/16 14:21, Wilco Dijkstra wrote: > Evandro Menezes wrote: >> I assume that you mean that such improvements are true for >> -mcpu=generic, yes? On which target, A53 or A57 or other? > It's true for any CPU setting. The SPEC results are for Cortex-A57 > however I wrote a microbenchmark that shows improvements on > all targets I have access to. The GCC switch expansion is awful, so > even with a good indirect predictor it is better to use conditional > branches. I agree with your assessment, but I'm more curious to understand how this change affects code built with the default -mcpu=generic when run on both A53 and A57, the typical configuration of big.LITTLE machines. Thank you,
Evandro Menezes wrote: > I agree with your assessment, but I'm more curious to understand how > this change affects code built with the default -mcpu=generic when run > on both A53 and A57, the typical configuration of big.LITTLE machines. I wouldn't expect the result to be any different as the -mcpu setting makes very little difference. Wilco
On 04/25/16 14:58, Wilco Dijkstra wrote: > Evandro Menezes wrote: >> I agree with your assessment, but I'm more curious to understand how >> this change affects code built with the default -mcpu=generic when run >> on both A53 and A57, the typical configuration of big.LITTLE machines. > I wouldn't expect the result to be any different as the -mcpu setting makes > very little difference. True, but the results when running on A53 could be quite different. Thank you,
Evandro Menezes wrote: > > True, but the results when running on A53 could be quite different. GCC is ~1.2% faster on Cortex-A53 built for generic, but there is no difference in perlbench. Wilco
On 04/26/16 11:14, Wilco Dijkstra wrote: > Evandro Menezes wrote: >> True, but the results when running on A53 could be quite different. > GCC is ~1.2% faster on Cortex-A53 built for generic, but there is no > difference in perlbench. Looks good, then. Fine by me. Thanks for your patience,
Hi Wilco, On 25/04/16 20:21, Wilco Dijkstra wrote: > Evandro Menezes wrote: >> I assume that you mean that such improvements are true for >> -mcpu=generic, yes? On which target, A53 or A57 or other? > It's true for any CPU setting. The SPEC results are for Cortex-A57 > however I wrote a microbenchmark that shows improvements on > all targets I have access to. The GCC switch expansion is awful, so > even with a good indirect predictor it is better to use conditional > branches. In what way is it awful? If there's something we can do better at can you file a bug report with a testcase so that we can work on improving it rather than tweaking a heuristic in the backend. (Not that I'm against your patch ;) Cheers, Kyrill > Wilco > > > >
Kyrill Tkachov wrote: > On 25/04/16 20:21, Wilco Dijkstra wrote: > > The GCC switch expansion is awful, so > > even with a good indirect predictor it is better to use conditional > > branches. > > In what way is it awful? If there's something we can do better at > can you file a bug report with a testcase so that we can work on > improving it rather than tweaking a heuristic in the backend. In every way :-( Try this simple example and see whether you agree this is terrible, especially on targets that don't use PC-relative tables... int i; int func(int a) { switch(a) { case 0: i = 20; break; case 1: i = 50; break; case 2: i = 29; break; case 3: i = 20; break; case 4: i = 50; break; case 5: i = 29; break; case 6: i = 20; break; case 7: i = 50; break; case 8: i = 29; break; case 9: i = 79; break; case 110: i = 27; break; default: i = 77; break; } return i; } See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70861. It's also disappointing this doesn't get turned into: return i = (unsigned)a <= 110 ? table[a] : 77; Wilco
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c index 0620f1e..a240635 100644 --- a/gcc/config/aarch64/aarch64.c +++ b/gcc/config/aarch64/aarch64.c @@ -3546,7 +3546,12 @@ aarch64_cannot_force_const_mem (machine_mode mode ATTRIBUTE_UNUSED, rtx x) return aarch64_tls_referenced_p (x); } -/* Implement TARGET_CASE_VALUES_THRESHOLD. */ +/* Implement TARGET_CASE_VALUES_THRESHOLD. + The expansion for a table switch is quite expensive due to the number + of instructions, the table lookup and hard to predict indirect jump. + When optimizing for speed, with -O3 use the per-core tuning if set, + otherwise use tables for > 16 cases as a tradeoff between size and + performance. */ static unsigned int aarch64_case_values_threshold (void) @@ -3557,7 +3562,7 @@ aarch64_case_values_threshold (void) && selected_cpu->tune->max_case_values != 0) return selected_cpu->tune->max_case_values; else - return default_case_values_threshold (); + return optimize_size ? default_case_values_threshold () : 17; }