Message ID | 81b9ed20-0a11-a921-0b52-c57b414abb14@linux.ibm.com |
---|---|
State | New |
Headers | show |
Series | rs6000: Enable -fvariable-expansion-in-unroller by default | expand |
On Thu, Jun 27, 2019 at 5:23 AM Bill Schmidt <wschmidt@linux.ibm.com> wrote: > > Hi, > > We've done some experimenting and realized that the subject option almost > always provide improved performance for Power when the loop unroller is > enabled. So this patch turns that flag on by default for us. > > Bootstrapped and tested on powerpc64le-unknown-linux-gnu with no regressions. > Is this OK for trunk? I guess it creates more freedom for combine (more single-uses) and register allocation. I wonder in which cases this might pessimize things? I guess the pre-RA scheduler might make RAs life harder with creating overlapping life-ranges. I guess you didn't actually investigate the nature of the improvements you saw? Do we want to adjust the flags documentation, saying whether this is enabled by default depends on the target (or even list them)? Thanks, Richard. > Thanks! > Bill > > > 2019-06-27 Bill Schmidt <wschmidt@linux.ibm.com> > > * config/rs6000/rs6000.c (rs6000_option_override_internal): Enable > -fvariable-expansion-in-unroller by default. > > > Index: gcc/config/rs6000/rs6000.c > =================================================================== > --- gcc/config/rs6000/rs6000.c (revision 272719) > +++ gcc/config/rs6000/rs6000.c (working copy) > @@ -3616,6 +3616,11 @@ rs6000_option_override_internal (bool global_init_ > && !global_options_set.x_flag_asynchronous_unwind_tables) > flag_asynchronous_unwind_tables = 1; > > + /* -fvariable-expansion-in-unroller is a win for POWER whenever the > + loop unroller is active. It is only checked during unrolling, so > + we can just set it on by default. */ > + flag_variable_expansion_in_unroller = 1; > + > /* Set the pointer size. */ > if (TARGET_64BIT) > { >
Hi Bill, On Wed, Jun 26, 2019 at 10:23:06PM -0500, Bill Schmidt wrote: > 2019-06-27 Bill Schmidt <wschmidt@linux.ibm.com> > > * config/rs6000/rs6000.c (rs6000_option_override_internal): Enable > -fvariable-expansion-in-unroller by default. > > > --- gcc/config/rs6000/rs6000.c (revision 272719) > +++ gcc/config/rs6000/rs6000.c (working copy) > @@ -3616,6 +3616,11 @@ rs6000_option_override_internal (bool global_init_ > && !global_options_set.x_flag_asynchronous_unwind_tables) > flag_asynchronous_unwind_tables = 1; > > + /* -fvariable-expansion-in-unroller is a win for POWER whenever the > + loop unroller is active. It is only checked during unrolling, so > + we can just set it on by default. */ > + flag_variable_expansion_in_unroller = 1; You should test global_options_set like the asynch unwind thing above, so that a user can still turn off variable expansion if he/she so desires. Okay with that change. Thanks! Segher
On Thu, Jun 27, 2019 at 11:33:45AM +0200, Richard Biener wrote: > On Thu, Jun 27, 2019 at 5:23 AM Bill Schmidt <wschmidt@linux.ibm.com> wrote: > > We've done some experimenting and realized that the subject option almost > > always provide improved performance for Power when the loop unroller is > > enabled. So this patch turns that flag on by default for us. > > I guess it creates more freedom for combine (more single-uses) and register > allocation. I wonder in which cases this might pessimize things? I guess > the pre-RA scheduler might make RAs life harder with creating overlapping > life-ranges. > > I guess you didn't actually investigate the nature of the improvements you saw? It breaks the length of dependency chains by a factor equal to the unroll factor. I do not know why this doesn't help a lot everywhere. It of course raises register pressure, maybe that is just it? > Do we want to adjust the flags documentation, saying whether this is enabled > by default depends on the target (or even list them)? Good idea, thanks. Segher
On 6/27/19 6:45 AM, Segher Boessenkool wrote: > On Thu, Jun 27, 2019 at 11:33:45AM +0200, Richard Biener wrote: >> On Thu, Jun 27, 2019 at 5:23 AM Bill Schmidt <wschmidt@linux.ibm.com> wrote: >>> We've done some experimenting and realized that the subject option almost >>> always provide improved performance for Power when the loop unroller is >>> enabled. So this patch turns that flag on by default for us. >> I guess it creates more freedom for combine (more single-uses) and register >> allocation. I wonder in which cases this might pessimize things? I guess >> the pre-RA scheduler might make RAs life harder with creating overlapping >> life-ranges. >> >> I guess you didn't actually investigate the nature of the improvements you saw? > It breaks the length of dependency chains by a factor equal to the unroll > factor. I do not know why this doesn't help a lot everywhere. It of > course raises register pressure, maybe that is just it? Right, it's all about breaking dependencies to more efficiently exploit the microarchitecture. By default, variable expansion in GCC is quite conservative, creating only two reduction streams out of one, so it's pretty rare for it to cause spill. This can be adjusted upwards with --param max-variable-expansions-in-unroller=n. Our experiments show that raising n to 4 starts to cause some minor degradations, which are almost certainly due to pressure, so the default setting looks appropriate. > >> Do we want to adjust the flags documentation, saying whether this is enabled >> by default depends on the target (or even list them)? > Good idea, thanks. OK, I'll update the docs and make the change that Segher requested. Thanks for the reviews! Bill > > > Segher
On Thu, Jun 27, 2019 at 2:19 PM Bill Schmidt <wschmidt@linux.ibm.com> wrote: > > On 6/27/19 6:45 AM, Segher Boessenkool wrote: > > On Thu, Jun 27, 2019 at 11:33:45AM +0200, Richard Biener wrote: > >> On Thu, Jun 27, 2019 at 5:23 AM Bill Schmidt <wschmidt@linux.ibm.com> wrote: > >>> We've done some experimenting and realized that the subject option almost > >>> always provide improved performance for Power when the loop unroller is > >>> enabled. So this patch turns that flag on by default for us. > >> I guess it creates more freedom for combine (more single-uses) and register > >> allocation. I wonder in which cases this might pessimize things? I guess > >> the pre-RA scheduler might make RAs life harder with creating overlapping > >> life-ranges. > >> > >> I guess you didn't actually investigate the nature of the improvements you saw? > > It breaks the length of dependency chains by a factor equal to the unroll > > factor. I do not know why this doesn't help a lot everywhere. It of > > course raises register pressure, maybe that is just it? > > Right, it's all about breaking dependencies to more efficiently exploit > the microarchitecture. By default, variable expansion in GCC is quite > conservative, creating only two reduction streams out of one, so it's > pretty rare for it to cause spill. This can be adjusted upwards with > --param max-variable-expansions-in-unroller=n. Our experiments show > that raising n to 4 starts to cause some minor degradations, which are > almost certainly due to pressure, so the default setting looks appropriate. But it's probably only an issue for targets which enable pre-RA scheduling by default? It might also increase RA compile-time (more allocnos). Richard. > >> Do we want to adjust the flags documentation, saying whether this is enabled > >> by default depends on the target (or even list them)? > > Good idea, thanks. > > OK, I'll update the docs and make the change that Segher requested. > Thanks for the reviews! > > Bill > > > > > > Segher >
On Mon, Jul 01, 2019 at 12:30:41PM +0200, Richard Biener wrote: > On Thu, Jun 27, 2019 at 2:19 PM Bill Schmidt <wschmidt@linux.ibm.com> wrote: > > > > On 6/27/19 6:45 AM, Segher Boessenkool wrote: > > > On Thu, Jun 27, 2019 at 11:33:45AM +0200, Richard Biener wrote: > > >> On Thu, Jun 27, 2019 at 5:23 AM Bill Schmidt <wschmidt@linux.ibm.com> wrote: > > >>> We've done some experimenting and realized that the subject option almost > > >>> always provide improved performance for Power when the loop unroller is > > >>> enabled. So this patch turns that flag on by default for us. > > >> I guess it creates more freedom for combine (more single-uses) and register > > >> allocation. I wonder in which cases this might pessimize things? I guess > > >> the pre-RA scheduler might make RAs life harder with creating overlapping > > >> life-ranges. > > >> > > >> I guess you didn't actually investigate the nature of the improvements you saw? > > > It breaks the length of dependency chains by a factor equal to the unroll > > > factor. I do not know why this doesn't help a lot everywhere. It of > > > course raises register pressure, maybe that is just it? > > > > Right, it's all about breaking dependencies to more efficiently exploit > > the microarchitecture. By default, variable expansion in GCC is quite > > conservative, creating only two reduction streams out of one, so it's > > pretty rare for it to cause spill. This can be adjusted upwards with > > --param max-variable-expansions-in-unroller=n. Our experiments show > > that raising n to 4 starts to cause some minor degradations, which are > > almost certainly due to pressure, so the default setting looks appropriate. > > But it's probably only an issue for targets which enable pre-RA scheduling > by default? It might also increase RA compile-time (more allocnos). It only helps super-scalar CPUs normally. It doesn't increase register use much at all, compared to only doing unrolling. It helps scheduling a lot, but I don't see where sched1 comes in here? To see if it helps your arch, just try it out? Segher
Index: gcc/config/rs6000/rs6000.c =================================================================== --- gcc/config/rs6000/rs6000.c (revision 272719) +++ gcc/config/rs6000/rs6000.c (working copy) @@ -3616,6 +3616,11 @@ rs6000_option_override_internal (bool global_init_ && !global_options_set.x_flag_asynchronous_unwind_tables) flag_asynchronous_unwind_tables = 1; + /* -fvariable-expansion-in-unroller is a win for POWER whenever the + loop unroller is active. It is only checked during unrolling, so + we can just set it on by default. */ + flag_variable_expansion_in_unroller = 1; + /* Set the pointer size. */ if (TARGET_64BIT) {