diff mbox

[AArch64] Change FP reassociation width

Message ID AM5PR0802MB2610946B99FCF37A2BB819BD83CD0@AM5PR0802MB2610.eurprd08.prod.outlook.com
State New
Headers show

Commit Message

Wilco Dijkstra June 12, 2017, 10:50 a.m. UTC
Currently the FP reassociation width is set to 4 on AArch64.  On recent
GCCs this has become more aggressive in splitting expressions.  This means
many FMAs are split into FMUL and FADD.  The reassociation increases register
pressure, in some benchmarks so much that inner loops start to spill.
This results in larger, slower code.  Benchmarking FP reassociation width=1
showed a ~0.5% gain on SPECFP2006 and similar gains on other benchmarks,
so change it to 1.

Passes regress & bootstrap, OK for commit?

ChangeLog:
2017-06-12  Wilco Dijkstra  <wdijkstr@arm.com>

	* gcc/config/aarch64/aarch64.c
	(generic_tuning): Set fp_reassoc_width to 1.
	(cortexa35_tunings): Likewise.
	(cortexa53_tunings): Likewise.
	(cortexa57_tunings): Likewise.
	(cortexa72_tunings): Likewise.
	(cortexa73_tunings): Likewise.
--

Comments

Richard Earnshaw (lists) June 13, 2017, 9:07 a.m. UTC | #1
On 12/06/17 11:50, Wilco Dijkstra wrote:
> Currently the FP reassociation width is set to 4 on AArch64.  On recent
> GCCs this has become more aggressive in splitting expressions.  This means
> many FMAs are split into FMUL and FADD.  The reassociation increases register
> pressure, in some benchmarks so much that inner loops start to spill.
> This results in larger, slower code.  Benchmarking FP reassociation width=1
> showed a ~0.5% gain on SPECFP2006 and similar gains on other benchmarks,
> so change it to 1.
> 

Why 1 and not 2?  Many processors have 2 fp pipes and forcing this down
to a sequential stream is not obviously the right thing.

If reassociation is is causing excess spilling, then the right fix for
that is to look at the pressure model, not hammer the problem away.

R.

> Passes regress & bootstrap, OK for commit?
> 
> ChangeLog:
> 2017-06-12  Wilco Dijkstra  <wdijkstr@arm.com>
> 
> 	* gcc/config/aarch64/aarch64.c
> 	(generic_tuning): Set fp_reassoc_width to 1.
> 	(cortexa35_tunings): Likewise.
> 	(cortexa53_tunings): Likewise.
> 	(cortexa57_tunings): Likewise.
> 	(cortexa72_tunings): Likewise.
> 	(cortexa73_tunings): Likewise.
> --
> 
> diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
> index 72a758642743025ac8974c8f7ad4c44c31a474d5..0998bf37b2abf399277d2f2a539295506085209f 100644
> --- a/gcc/config/aarch64/aarch64.c
> +++ b/gcc/config/aarch64/aarch64.c
> @@ -541,7 +541,7 @@ static const struct tune_params generic_tunings =
>    4,	/* jump_align.  */
>    8,	/* loop_align.  */
>    2,	/* int_reassoc_width.  */
> -  4,	/* fp_reassoc_width.  */
> +  1,	/* fp_reassoc_width.  */
>    1,	/* vec_reassoc_width.  */
>    2,	/* min_div_recip_mul_sf.  */
>    2,	/* min_div_recip_mul_df.  */
> @@ -567,7 +567,7 @@ static const struct tune_params cortexa35_tunings =
>    8,	/* jump_align.  */
>    8,	/* loop_align.  */
>    2,	/* int_reassoc_width.  */
> -  4,	/* fp_reassoc_width.  */
> +  1,	/* fp_reassoc_width.  */
>    1,	/* vec_reassoc_width.  */
>    2,	/* min_div_recip_mul_sf.  */
>    2,	/* min_div_recip_mul_df.  */
> @@ -593,7 +593,7 @@ static const struct tune_params cortexa53_tunings =
>    8,	/* jump_align.  */
>    8,	/* loop_align.  */
>    2,	/* int_reassoc_width.  */
> -  4,	/* fp_reassoc_width.  */
> +  1,	/* fp_reassoc_width.  */
>    1,	/* vec_reassoc_width.  */
>    2,	/* min_div_recip_mul_sf.  */
>    2,	/* min_div_recip_mul_df.  */
> @@ -619,7 +619,7 @@ static const struct tune_params cortexa57_tunings =
>    8,	/* jump_align.  */
>    8,	/* loop_align.  */
>    2,	/* int_reassoc_width.  */
> -  4,	/* fp_reassoc_width.  */
> +  1,	/* fp_reassoc_width.  */
>    1,	/* vec_reassoc_width.  */
>    2,	/* min_div_recip_mul_sf.  */
>    2,	/* min_div_recip_mul_df.  */
> @@ -645,7 +645,7 @@ static const struct tune_params cortexa72_tunings =
>    8,	/* jump_align.  */
>    8,	/* loop_align.  */
>    2,	/* int_reassoc_width.  */
> -  4,	/* fp_reassoc_width.  */
> +  1,	/* fp_reassoc_width.  */
>    1,	/* vec_reassoc_width.  */
>    2,	/* min_div_recip_mul_sf.  */
>    2,	/* min_div_recip_mul_df.  */
> @@ -671,7 +671,7 @@ static const struct tune_params cortexa73_tunings =
>    8,	/* jump_align.  */
>    8,	/* loop_align.  */
>    2,	/* int_reassoc_width.  */
> -  4,	/* fp_reassoc_width.  */
> +  1,	/* fp_reassoc_width.  */
>    1,	/* vec_reassoc_width.  */
>    2,	/* min_div_recip_mul_sf.  */
>    2,	/* min_div_recip_mul_df.  */
>
Wilco Dijkstra June 13, 2017, 9:43 a.m. UTC | #2
Richard Earnshaw (lists) wrote:
> 
> Why 1 and not 2?  Many processors have 2 fp pipes and forcing this down
> to a sequential stream is not obviously the right thing.

1 was faster than 2. Like I said, the reassociation is too aggressive and even
splits multiply-add rather than keeping them. Until reassociation is fixed and
able to split a complex expression into 2 independent chains of similar depth
while keeping multiply-accumulate operations, it is best to set it to 1 for now.

> If reassociation is is causing excess spilling, then the right fix for
> that is to look at the pressure model, not hammer the problem away.

The problem of the GCC scheduler hugely increasing register pressure has
existed for many years with no progress being made. Unless we're willing to
start a project improving this I do not believe there will be a solution any time
soon. So changing the association width is the best solution for the time being.

Wilco
James Greenhalgh June 14, 2017, 2:29 p.m. UTC | #3
On Tue, Jun 13, 2017 at 10:43:05AM +0100, Wilco Dijkstra wrote:
> Richard Earnshaw (lists) wrote:
> > 
> > Why 1 and not 2?  Many processors have 2 fp pipes and forcing this down
> > to a sequential stream is not obviously the right thing.
> 
> 1 was faster than 2. Like I said, the reassociation is too aggressive and even
> splits multiply-add rather than keeping them. Until reassociation is fixed and
> able to split a complex expression into 2 independent chains of similar depth
> while keeping multiply-accumulate operations, it is best to set it to 1 for now.
> 
> > If reassociation is is causing excess spilling, then the right fix for
> > that is to look at the pressure model, not hammer the problem away.
> 
> The problem of the GCC scheduler hugely increasing register pressure has
> existed for many years with no progress being made. Unless we're willing to
> start a project improving this I do not believe there will be a solution any time
> soon. So changing the association width is the best solution for the time being.

Are there bugs with testcases open in bugzilla for this? Was the regression
in "more recent" GCCs that are more aggressive at resassociation noted
somewhere?

It is hard to take this fix with the justification given, and no idea of
what compiler behaviour we're masking with this patch. I don't think I can
approve it with what is presented here, particularly so for "generic".
Richard's point of why "1" rather than "2" also needs a better answer than
given.

Thanks,
James
diff mbox

Patch

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 72a758642743025ac8974c8f7ad4c44c31a474d5..0998bf37b2abf399277d2f2a539295506085209f 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -541,7 +541,7 @@  static const struct tune_params generic_tunings =
   4,	/* jump_align.  */
   8,	/* loop_align.  */
   2,	/* int_reassoc_width.  */
-  4,	/* fp_reassoc_width.  */
+  1,	/* fp_reassoc_width.  */
   1,	/* vec_reassoc_width.  */
   2,	/* min_div_recip_mul_sf.  */
   2,	/* min_div_recip_mul_df.  */
@@ -567,7 +567,7 @@  static const struct tune_params cortexa35_tunings =
   8,	/* jump_align.  */
   8,	/* loop_align.  */
   2,	/* int_reassoc_width.  */
-  4,	/* fp_reassoc_width.  */
+  1,	/* fp_reassoc_width.  */
   1,	/* vec_reassoc_width.  */
   2,	/* min_div_recip_mul_sf.  */
   2,	/* min_div_recip_mul_df.  */
@@ -593,7 +593,7 @@  static const struct tune_params cortexa53_tunings =
   8,	/* jump_align.  */
   8,	/* loop_align.  */
   2,	/* int_reassoc_width.  */
-  4,	/* fp_reassoc_width.  */
+  1,	/* fp_reassoc_width.  */
   1,	/* vec_reassoc_width.  */
   2,	/* min_div_recip_mul_sf.  */
   2,	/* min_div_recip_mul_df.  */
@@ -619,7 +619,7 @@  static const struct tune_params cortexa57_tunings =
   8,	/* jump_align.  */
   8,	/* loop_align.  */
   2,	/* int_reassoc_width.  */
-  4,	/* fp_reassoc_width.  */
+  1,	/* fp_reassoc_width.  */
   1,	/* vec_reassoc_width.  */
   2,	/* min_div_recip_mul_sf.  */
   2,	/* min_div_recip_mul_df.  */
@@ -645,7 +645,7 @@  static const struct tune_params cortexa72_tunings =
   8,	/* jump_align.  */
   8,	/* loop_align.  */
   2,	/* int_reassoc_width.  */
-  4,	/* fp_reassoc_width.  */
+  1,	/* fp_reassoc_width.  */
   1,	/* vec_reassoc_width.  */
   2,	/* min_div_recip_mul_sf.  */
   2,	/* min_div_recip_mul_df.  */
@@ -671,7 +671,7 @@  static const struct tune_params cortexa73_tunings =
   8,	/* jump_align.  */
   8,	/* loop_align.  */
   2,	/* int_reassoc_width.  */
-  4,	/* fp_reassoc_width.  */
+  1,	/* fp_reassoc_width.  */
   1,	/* vec_reassoc_width.  */
   2,	/* min_div_recip_mul_sf.  */
   2,	/* min_div_recip_mul_df.  */