Message ID | DB5PR08MB1030E3F6C94348770C9BB83283C60@DB5PR08MB1030.eurprd08.prod.outlook.com |
---|---|
State | New |
Headers | show |
Series | [AArch64] PR79262: Adjust vector cost | expand |
On Fri, Nov 09, 2018 at 08:14:27AM -0600, Wilco Dijkstra wrote: > PR79262 has been fixed for almost all AArch64 cpus, however the example is still > vectorized in a few cases, resulting in lower performance. Increase the cost of > vector-to-scalar moves so it is more similar to the other vector costs. As a result > -mcpu=cortex-a53 no longer vectorizes the testcase - libquantum and SPECv6 > performance improves. > > OK for commit? No. We have 7 unique target tuning structures in the AArch64 backend, of which only one has a 2x ratio between scalar_int_cost and vec_to_scalar_cost. Other ratios are 1, 3, 8, 3, 4, 6. What makes this choice correct? What makes it more correct than what we have now? On which of the 28 entries in config/aarch64/aarch64-cores.def does performance improve? Are the Spec benchmarks sufficiently representative to change the generic vectorisation costs? Please validate the performance effect of this patch, which changes default code generation for everyone, on more than one testcase in a bug report. Thanks, James > ChangeLog: > 2018-01-22 Wilco Dijkstra <wdijkstr@arm.com> > > PR target/79262 > * config/aarch64/aarch64.c (generic_vector_cost): Adjust vec_to_scalar_cost. > --
Hi James, > We have 7 unique target tuning structures in the AArch64 backend, of which > only one has a 2x ratio between scalar_int_cost and vec_to_scalar_cost. Other > ratios are 1, 3, 8, 3, 4, 6. I wouldn't read too much in the exact value here - the costs are simply relative to other values for the same tuning, ie. cores that use 4 or 6 also tend to use larger values for the other entries. > What makes this choice correct? What makes it more correct than what we > have now? On which of the 28 entries in config/aarch64/aarch64-cores.def does > performance improve? It's correct since it's the minimum value that stops vectorization for this particular bug without increasing costs too much and accidentally blocking useful loops from being vectorized (as confirmed by benchmarking). Given that all other vector cost tunings already block vectorization for this case, this is clearly what is required for best performance. So this improves performance for all 28 entries. > Please validate the performance effect of this patch, which changes default > code generation for everyone, on more than one testcase in a bug report. I did validate the performance like I mentioned in the patch. Since it has been a while, I can easily rerun the benchmarks using latest trunk to verify it's still a gain. > Are the Spec benchmarks sufficiently representative to change the generic > vectorisation costs? SPEC is sufficiently large and complex to show there is a gain without regressions. Do you have reason to believe other benchmarks might not see the same gain? Wilco
ping PR79262 has been fixed for almost all AArch64 cpus, however the example is still vectorized in a few cases, resulting in lower performance. Increase the cost of vector-to-scalar moves so it is more similar to the other vector costs. As a result -mcpu=cortex-a53 no longer vectorizes the testcase - libquantum and SPECv6 performance improves. OK for commit? ChangeLog: 2018-01-22 Wilco Dijkstra <wdijkstr@arm.com> PR target/79262 * config/aarch64/aarch64.c (generic_vector_cost): Adjust vec_to_scalar_cost. -- diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c index c6a83c881038873d8b68e36f906783be63ddde56..43f5b7162152ca92a916f4febee01f624c375202 100644 --- a/gcc/config/aarch64/aarch64.c +++ b/gcc/config/aarch64/aarch64.c @@ -403,7 +403,7 @@ static const struct cpu_vector_cost generic_vector_cost = 1, /* vec_int_stmt_cost */ 1, /* vec_fp_stmt_cost */ 2, /* vec_permute_cost */ - 1, /* vec_to_scalar_cost */ + 2, /* vec_to_scalar_cost */ 1, /* scalar_to_vec_cost */ 1, /* vec_align_load_cost */ 1, /* vec_unalign_load_cost */
Wilco Dijkstra <Wilco.Dijkstra@arm.com> writes: > ping > > PR79262 has been fixed for almost all AArch64 cpus, however the example is still > vectorized in a few cases, resulting in lower performance. Increase the cost of > vector-to-scalar moves so it is more similar to the other vector costs. As a result > -mcpu=cortex-a53 no longer vectorizes the testcase - libquantum and SPECv6 > performance improves. > > OK for commit? > > ChangeLog: > 2018-01-22 Wilco Dijkstra <wdijkstr@arm.com> > > PR target/79262 > * config/aarch64/aarch64.c (generic_vector_cost): Adjust vec_to_scalar_cost. OK, thanks, and sorry for the delay. qdf24xx_vector_cost is the only specific CPU cost table with a vec_to_scalar_cost as low as 1. It's not obvious how emphatic that choice is though. It looks like qdf24xx_vector_cost might (very reasonably!) have started out as a copy of the generic costs with some targeted changes. But even if 1 is accurate there from a h/w perspective, the problem is that the vectoriser's costings have a tendency to miss additional overhead involved in scalarisation. Although increasing the cost to avoid that might be a bit of a hack, it's the accepted hack. So I suspect in practice all CPUs will benefit from a higher cost, not just those whose CPU tables already have one. On that basis, increasing the generic cost by the smallest possible amount should be a good change across the board. If anyone finds a counter-example, please let us know or file a bug. Richard > -- > > diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c > index c6a83c881038873d8b68e36f906783be63ddde56..43f5b7162152ca92a916f4febee01f624c375202 100644 > --- a/gcc/config/aarch64/aarch64.c > +++ b/gcc/config/aarch64/aarch64.c > @@ -403,7 +403,7 @@ static const struct cpu_vector_cost generic_vector_cost = > 1, /* vec_int_stmt_cost */ > 1, /* vec_fp_stmt_cost */ > 2, /* vec_permute_cost */ > - 1, /* vec_to_scalar_cost */ > + 2, /* vec_to_scalar_cost */ > 1, /* scalar_to_vec_cost */ > 1, /* vec_align_load_cost */ > 1, /* vec_unalign_load_cost */
ping PR79262 has been fixed for almost all AArch64 cpus, however the example is still vectorized in a few cases, resulting in lower performance. Increase the cost of vector-to-scalar moves so it is more similar to the other vector costs. As a result -mcpu=cortex-a53 no longer vectorizes the testcase - libquantum and SPECv6 performance improves. OK for commit? ChangeLog: 2018-01-22 Wilco Dijkstra <wdijkstr@arm.com> PR target/79262 * config/aarch64/aarch64.c (generic_vector_cost): Adjust vec_to_scalar_cost. -- diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c index c6a83c881038873d8b68e36f906783be63ddde56..43f5b7162152ca92a916f4febee01f624c375202 100644 --- a/gcc/config/aarch64/aarch64.c +++ b/gcc/config/aarch64/aarch64.c @@ -403,7 +403,7 @@ static const struct cpu_vector_cost generic_vector_cost = 1, /* vec_int_stmt_cost */ 1, /* vec_fp_stmt_cost */ 2, /* vec_permute_cost */ - 1, /* vec_to_scalar_cost */ + 2, /* vec_to_scalar_cost */ 1, /* scalar_to_vec_cost */ 1, /* vec_align_load_cost */ 1, /* vec_unalign_load_cost */
Wilco Dijkstra <Wilco.Dijkstra@arm.com> writes: > ping I acked this here: https://gcc.gnu.org/ml/gcc-patches/2019-10/msg01229.html > > PR79262 has been fixed for almost all AArch64 cpus, however the example is still > vectorized in a few cases, resulting in lower performance. Increase the cost of > vector-to-scalar moves so it is more similar to the other vector costs. As a result > -mcpu=cortex-a53 no longer vectorizes the testcase - libquantum and SPECv6 > performance improves. > > OK for commit? > > ChangeLog: > 2018-01-22 Wilco Dijkstra <wdijkstr@arm.com> > > PR target/79262 > * config/aarch64/aarch64.c (generic_vector_cost): Adjust vec_to_scalar_cost. > -- > > diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c > index c6a83c881038873d8b68e36f906783be63ddde56..43f5b7162152ca92a916f4febee01f624c375202 100644 > --- a/gcc/config/aarch64/aarch64.c > +++ b/gcc/config/aarch64/aarch64.c > @@ -403,7 +403,7 @@ static const struct cpu_vector_cost generic_vector_cost = > 1, /* vec_int_stmt_cost */ > 1, /* vec_fp_stmt_cost */ > 2, /* vec_permute_cost */ > - 1, /* vec_to_scalar_cost */ > + 2, /* vec_to_scalar_cost */ > 1, /* scalar_to_vec_cost */ > 1, /* vec_align_load_cost */ > 1, /* vec_unalign_load_cost */
Hi Richard, > I acked this here: > https://gcc.gnu.org/ml/gcc-patches/2019-10/msg01229.html Thanks - I missed your email, but it's committed now. Yes we will need to look at the vector costs again and retune them based on recent vectorizer improvements and latest microarchitectures. Cheers, Wilco
diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c index c6a83c881038873d8b68e36f906783be63ddde56..43f5b7162152ca92a916f4febee01f624c375202 100644 --- a/gcc/config/aarch64/aarch64.c +++ b/gcc/config/aarch64/aarch64.c @@ -403,7 +403,7 @@ static const struct cpu_vector_cost generic_vector_cost = 1, /* vec_int_stmt_cost */ 1, /* vec_fp_stmt_cost */ 2, /* vec_permute_cost */ - 1, /* vec_to_scalar_cost */ + 2, /* vec_to_scalar_cost */ 1, /* scalar_to_vec_cost */ 1, /* vec_align_load_cost */ 1, /* vec_unalign_load_cost */