diff mbox series

rs6000: Enable -fvariable-expansion-in-unroller by default

Message ID 81b9ed20-0a11-a921-0b52-c57b414abb14@linux.ibm.com
State New
Headers show
Series rs6000: Enable -fvariable-expansion-in-unroller by default | expand

Commit Message

Bill Schmidt June 27, 2019, 3:23 a.m. UTC
Hi,

We've done some experimenting and realized that the subject option almost
always provide improved performance for Power when the loop unroller is
enabled.  So this patch turns that flag on by default for us.

Bootstrapped and tested on powerpc64le-unknown-linux-gnu with no regressions.
Is this OK for trunk?

Thanks!
Bill


2019-06-27  Bill Schmidt  <wschmidt@linux.ibm.com>

	* config/rs6000/rs6000.c (rs6000_option_override_internal): Enable
	-fvariable-expansion-in-unroller by default.

Comments

Richard Biener June 27, 2019, 9:33 a.m. UTC | #1
On Thu, Jun 27, 2019 at 5:23 AM Bill Schmidt <wschmidt@linux.ibm.com> wrote:
>
> Hi,
>
> We've done some experimenting and realized that the subject option almost
> always provide improved performance for Power when the loop unroller is
> enabled.  So this patch turns that flag on by default for us.
>
> Bootstrapped and tested on powerpc64le-unknown-linux-gnu with no regressions.
> Is this OK for trunk?

I guess it creates more freedom for combine (more single-uses) and register
allocation.  I wonder in which cases this might pessimize things?  I guess
the pre-RA scheduler might make RAs life harder with creating overlapping
life-ranges.

I guess you didn't actually investigate the nature of the improvements you saw?

Do we want to adjust the flags documentation, saying whether this is enabled
by default depends on the target (or even list them)?

Thanks,
Richard.

> Thanks!
> Bill
>
>
> 2019-06-27  Bill Schmidt  <wschmidt@linux.ibm.com>
>
>         * config/rs6000/rs6000.c (rs6000_option_override_internal): Enable
>         -fvariable-expansion-in-unroller by default.
>
>
> Index: gcc/config/rs6000/rs6000.c
> ===================================================================
> --- gcc/config/rs6000/rs6000.c  (revision 272719)
> +++ gcc/config/rs6000/rs6000.c  (working copy)
> @@ -3616,6 +3616,11 @@ rs6000_option_override_internal (bool global_init_
>        && !global_options_set.x_flag_asynchronous_unwind_tables)
>      flag_asynchronous_unwind_tables = 1;
>
> +  /* -fvariable-expansion-in-unroller is a win for POWER whenever the
> +     loop unroller is active.  It is only checked during unrolling, so
> +     we can just set it on by default.  */
> +  flag_variable_expansion_in_unroller = 1;
> +
>    /* Set the pointer size.  */
>    if (TARGET_64BIT)
>      {
>
Segher Boessenkool June 27, 2019, 11:32 a.m. UTC | #2
Hi Bill,

On Wed, Jun 26, 2019 at 10:23:06PM -0500, Bill Schmidt wrote:
> 2019-06-27  Bill Schmidt  <wschmidt@linux.ibm.com>
> 
> 	* config/rs6000/rs6000.c (rs6000_option_override_internal): Enable
> 	-fvariable-expansion-in-unroller by default.
> 
> 
> --- gcc/config/rs6000/rs6000.c	(revision 272719)
> +++ gcc/config/rs6000/rs6000.c	(working copy)
> @@ -3616,6 +3616,11 @@ rs6000_option_override_internal (bool global_init_
>        && !global_options_set.x_flag_asynchronous_unwind_tables)
>      flag_asynchronous_unwind_tables = 1;
>  
> +  /* -fvariable-expansion-in-unroller is a win for POWER whenever the
> +     loop unroller is active.  It is only checked during unrolling, so
> +     we can just set it on by default.  */
> +  flag_variable_expansion_in_unroller = 1;

You should test global_options_set like the asynch unwind thing above,
so that a user can still turn off variable expansion if he/she so desires.


Okay with that change.  Thanks!


Segher
Segher Boessenkool June 27, 2019, 11:45 a.m. UTC | #3
On Thu, Jun 27, 2019 at 11:33:45AM +0200, Richard Biener wrote:
> On Thu, Jun 27, 2019 at 5:23 AM Bill Schmidt <wschmidt@linux.ibm.com> wrote:
> > We've done some experimenting and realized that the subject option almost
> > always provide improved performance for Power when the loop unroller is
> > enabled.  So this patch turns that flag on by default for us.
> 
> I guess it creates more freedom for combine (more single-uses) and register
> allocation.  I wonder in which cases this might pessimize things?  I guess
> the pre-RA scheduler might make RAs life harder with creating overlapping
> life-ranges.
> 
> I guess you didn't actually investigate the nature of the improvements you saw?

It breaks the length of dependency chains by a factor equal to the unroll
factor.  I do not know why this doesn't help a lot everywhere.  It of
course raises register pressure, maybe that is just it?

> Do we want to adjust the flags documentation, saying whether this is enabled
> by default depends on the target (or even list them)?

Good idea, thanks.


Segher
Bill Schmidt June 27, 2019, 12:19 p.m. UTC | #4
On 6/27/19 6:45 AM, Segher Boessenkool wrote:
> On Thu, Jun 27, 2019 at 11:33:45AM +0200, Richard Biener wrote:
>> On Thu, Jun 27, 2019 at 5:23 AM Bill Schmidt <wschmidt@linux.ibm.com> wrote:
>>> We've done some experimenting and realized that the subject option almost
>>> always provide improved performance for Power when the loop unroller is
>>> enabled.  So this patch turns that flag on by default for us.
>> I guess it creates more freedom for combine (more single-uses) and register
>> allocation.  I wonder in which cases this might pessimize things?  I guess
>> the pre-RA scheduler might make RAs life harder with creating overlapping
>> life-ranges.
>>
>> I guess you didn't actually investigate the nature of the improvements you saw?
> It breaks the length of dependency chains by a factor equal to the unroll
> factor.  I do not know why this doesn't help a lot everywhere.  It of
> course raises register pressure, maybe that is just it?

Right, it's all about breaking dependencies to more efficiently exploit
the microarchitecture.  By default, variable expansion in GCC is quite
conservative, creating only two reduction streams out of one, so it's
pretty rare for it to cause spill.  This can be adjusted upwards with
--param max-variable-expansions-in-unroller=n.  Our experiments show
that raising n to 4 starts to cause some minor degradations, which are
almost certainly due to pressure, so the default setting looks appropriate.
>
>> Do we want to adjust the flags documentation, saying whether this is enabled
>> by default depends on the target (or even list them)?
> Good idea, thanks.

OK, I'll update the docs and make the change that Segher requested. 
Thanks for the reviews!

Bill
>
>
> Segher
Richard Biener July 1, 2019, 10:30 a.m. UTC | #5
On Thu, Jun 27, 2019 at 2:19 PM Bill Schmidt <wschmidt@linux.ibm.com> wrote:
>
> On 6/27/19 6:45 AM, Segher Boessenkool wrote:
> > On Thu, Jun 27, 2019 at 11:33:45AM +0200, Richard Biener wrote:
> >> On Thu, Jun 27, 2019 at 5:23 AM Bill Schmidt <wschmidt@linux.ibm.com> wrote:
> >>> We've done some experimenting and realized that the subject option almost
> >>> always provide improved performance for Power when the loop unroller is
> >>> enabled.  So this patch turns that flag on by default for us.
> >> I guess it creates more freedom for combine (more single-uses) and register
> >> allocation.  I wonder in which cases this might pessimize things?  I guess
> >> the pre-RA scheduler might make RAs life harder with creating overlapping
> >> life-ranges.
> >>
> >> I guess you didn't actually investigate the nature of the improvements you saw?
> > It breaks the length of dependency chains by a factor equal to the unroll
> > factor.  I do not know why this doesn't help a lot everywhere.  It of
> > course raises register pressure, maybe that is just it?
>
> Right, it's all about breaking dependencies to more efficiently exploit
> the microarchitecture.  By default, variable expansion in GCC is quite
> conservative, creating only two reduction streams out of one, so it's
> pretty rare for it to cause spill.  This can be adjusted upwards with
> --param max-variable-expansions-in-unroller=n.  Our experiments show
> that raising n to 4 starts to cause some minor degradations, which are
> almost certainly due to pressure, so the default setting looks appropriate.

But it's probably only an issue for targets which enable pre-RA scheduling
by default?  It might also increase RA compile-time (more allocnos).

Richard.

> >> Do we want to adjust the flags documentation, saying whether this is enabled
> >> by default depends on the target (or even list them)?
> > Good idea, thanks.
>
> OK, I'll update the docs and make the change that Segher requested.
> Thanks for the reviews!
>
> Bill
> >
> >
> > Segher
>
Segher Boessenkool July 1, 2019, 12:11 p.m. UTC | #6
On Mon, Jul 01, 2019 at 12:30:41PM +0200, Richard Biener wrote:
> On Thu, Jun 27, 2019 at 2:19 PM Bill Schmidt <wschmidt@linux.ibm.com> wrote:
> >
> > On 6/27/19 6:45 AM, Segher Boessenkool wrote:
> > > On Thu, Jun 27, 2019 at 11:33:45AM +0200, Richard Biener wrote:
> > >> On Thu, Jun 27, 2019 at 5:23 AM Bill Schmidt <wschmidt@linux.ibm.com> wrote:
> > >>> We've done some experimenting and realized that the subject option almost
> > >>> always provide improved performance for Power when the loop unroller is
> > >>> enabled.  So this patch turns that flag on by default for us.
> > >> I guess it creates more freedom for combine (more single-uses) and register
> > >> allocation.  I wonder in which cases this might pessimize things?  I guess
> > >> the pre-RA scheduler might make RAs life harder with creating overlapping
> > >> life-ranges.
> > >>
> > >> I guess you didn't actually investigate the nature of the improvements you saw?
> > > It breaks the length of dependency chains by a factor equal to the unroll
> > > factor.  I do not know why this doesn't help a lot everywhere.  It of
> > > course raises register pressure, maybe that is just it?
> >
> > Right, it's all about breaking dependencies to more efficiently exploit
> > the microarchitecture.  By default, variable expansion in GCC is quite
> > conservative, creating only two reduction streams out of one, so it's
> > pretty rare for it to cause spill.  This can be adjusted upwards with
> > --param max-variable-expansions-in-unroller=n.  Our experiments show
> > that raising n to 4 starts to cause some minor degradations, which are
> > almost certainly due to pressure, so the default setting looks appropriate.
> 
> But it's probably only an issue for targets which enable pre-RA scheduling
> by default?  It might also increase RA compile-time (more allocnos).

It only helps super-scalar CPUs normally.  It doesn't increase register
use much at all, compared to only doing unrolling.  It helps scheduling
a lot, but I don't see where sched1 comes in here?

To see if it helps your arch, just try it out?


Segher
diff mbox series

Patch

Index: gcc/config/rs6000/rs6000.c
===================================================================
--- gcc/config/rs6000/rs6000.c	(revision 272719)
+++ gcc/config/rs6000/rs6000.c	(working copy)
@@ -3616,6 +3616,11 @@  rs6000_option_override_internal (bool global_init_
       && !global_options_set.x_flag_asynchronous_unwind_tables)
     flag_asynchronous_unwind_tables = 1;
 
+  /* -fvariable-expansion-in-unroller is a win for POWER whenever the
+     loop unroller is active.  It is only checked during unrolling, so
+     we can just set it on by default.  */
+  flag_variable_expansion_in_unroller = 1;
+
   /* Set the pointer size.  */
   if (TARGET_64BIT)
     {