Message ID | 20180124052755.GA8250@ibm-tiger.the-meissners.org |
---|---|
State | New |
Headers | show |
Series | , PR target/81550, Rewrite PowerPC loop_align test so it still tests the original target hook | expand |
Hi! On Wed, Jan 24, 2018 at 12:27:55AM -0500, Michael Meissner wrote: > > As Segher and I were discussing over private IRC, the root cause of this bug is > the compiler no long generates the BDNZ instruction for a count down loop, > instead it decrements the index in a GPR and does a branch/comparison on it. Yes, ivopts makes a bad decision (it uses stride 8 for all IVs, it should keep one with stride -1 for the loop counter, for optimal code; it also does three separate increments for the three memory accesses, which is a bit excessive here). > In doing so, it now unrolls the loop twice, and and the resulting loop is too > big for the target hook TARGET_ASM_LOOP_ALIGN_MAX_SKIP. This means the loop > isn't aligned to a 32 byte boundary. It's not really unrolling, it is bb-reorder copying an RTL block. However, even if you disable it you still get 9 insns on some configurations, so your patch does not hide the problem :-( Although, hrm, in your patch you also change "int i" to "long i"; that alone seems to be enough to fix everything? Could you check that please? Segher
On Wed, Jan 24, 2018 at 12:35:38PM -0600, Segher Boessenkool wrote: > Hi! > > On Wed, Jan 24, 2018 at 12:27:55AM -0500, Michael Meissner wrote: > > > > As Segher and I were discussing over private IRC, the root cause of this bug is > > the compiler no long generates the BDNZ instruction for a count down loop, > > instead it decrements the index in a GPR and does a branch/comparison on it. > > Yes, ivopts makes a bad decision (it uses stride 8 for all IVs, it should > keep one with stride -1 for the loop counter, for optimal code; it also > does three separate increments for the three memory accesses, which is > a bit excessive here). > > > In doing so, it now unrolls the loop twice, and and the resulting loop is too > > big for the target hook TARGET_ASM_LOOP_ALIGN_MAX_SKIP. This means the loop > > isn't aligned to a 32 byte boundary. > > It's not really unrolling, it is bb-reorder copying an RTL block. However, > even if you disable it you still get 9 insns on some configurations, so > your patch does not hide the problem :-( > > Although, hrm, in your patch you also change "int i" to "long i"; that > alone seems to be enough to fix everything? Could you check that please? Changing i and n to either 'long' or 'long unsigned' makes the test work. It is interesting that -mcpu=power7 -mbig does not seem to be able to create LFDU and STFDU, but either setting cpu to power8/power9 or setting -mbig to -mlittle or -m32 it can generate those instructions.
On Wed, Jan 24, 2018 at 12:35:38PM -0600, Segher Boessenkool wrote: > Hi! > > On Wed, Jan 24, 2018 at 12:27:55AM -0500, Michael Meissner wrote: > > > > As Segher and I were discussing over private IRC, the root cause of this bug is > > the compiler no long generates the BDNZ instruction for a count down loop, > > instead it decrements the index in a GPR and does a branch/comparison on it. > > Yes, ivopts makes a bad decision (it uses stride 8 for all IVs, it should > keep one with stride -1 for the loop counter, for optimal code; it also > does three separate increments for the three memory accesses, which is > a bit excessive here). > > > In doing so, it now unrolls the loop twice, and and the resulting loop is too > > big for the target hook TARGET_ASM_LOOP_ALIGN_MAX_SKIP. This means the loop > > isn't aligned to a 32 byte boundary. > > It's not really unrolling, it is bb-reorder copying an RTL block. However, > even if you disable it you still get 9 insns on some configurations, so > your patch does not hide the problem :-( > > Although, hrm, in your patch you also change "int i" to "long i"; that > alone seems to be enough to fix everything? Could you check that please? Replacing 'int' with 'unsigned long' allows the test to succeed once again. I have checked this on a big endian power8 (both 32-bit and 64-bit) and on a little endian power8 (64-bit only), and it passes in all three environments. Can I install this on the trunk? [gcc/testsuite] 2018-01-24 Michael Meissner <meissner@linux.vnet.ibm.com> PR target/81550 * gcc.target/powerpc/loop_align.c: Use unsigned long for the loop index instead of int, which allows IVOPTs to properly optimize the loop. Index: gcc/testsuite/gcc.target/powerpc/loop_align.c =================================================================== --- gcc/testsuite/gcc.target/powerpc/loop_align.c (revision 256992) +++ gcc/testsuite/gcc.target/powerpc/loop_align.c (working copy) @@ -4,8 +4,8 @@ /* { dg-options "-O2 -mcpu=power7 -falign-functions=16" } */ /* { dg-final { scan-assembler ".p2align 5,,31" } } */ -void f(double *a, double *b, double *c, int n) { - int i; +void f(double *a, double *b, double *c, unsigned long n) { + unsigned long i; for (i=0; i < n; i++) a[i] = b[i] + c[i]; }
On Wed, Jan 24, 2018 at 03:19:00PM -0500, Michael Meissner wrote: > On Wed, Jan 24, 2018 at 12:35:38PM -0600, Segher Boessenkool wrote: > > Although, hrm, in your patch you also change "int i" to "long i"; that > > alone seems to be enough to fix everything? Could you check that please? > > Changing i and n to either 'long' or 'long unsigned' makes the test work. > > It is interesting that -mcpu=power7 -mbig does not seem to be able to create > LFDU and STFDU, but either setting cpu to power8/power9 or setting -mbig to > -mlittle or -m32 it can generate those instructions. Yeah, dunno... I suspect we have some target costs a bit wrong, influencing the ivopts etc. decisions. The auto_inc_dec pass says (-mcpu=power7 -mabi=elfv2 -mbig): 23: r147:DF=[r126:DI] found mem(23) *(r[126]+0) trying SIMPLE_PRE_INC cost failure old=16 new=408 (where -mlittle thinks it is fine; does not say what costs, but the 408 for -mbig looks suspicious of course -- sounds like the call to rs6000_slow_unaligned_access in rs6000_rtx_costs misfired. Yet another reason to use insn_cost instead ;-) ) Segher
On Wed, Jan 24, 2018 at 05:00:39PM -0500, Michael Meissner wrote: > Replacing 'int' with 'unsigned long' allows the test to succeed once again. I > have checked this on a big endian power8 (both 32-bit and 64-bit) and on a > little endian power8 (64-bit only), and it passes in all three environments. > Can I install this on the trunk? Yes, this is. Please install on trunk. Thanks! Segher > [gcc/testsuite] > 2018-01-24 Michael Meissner <meissner@linux.vnet.ibm.com> > > PR target/81550 > * gcc.target/powerpc/loop_align.c: Use unsigned long for the loop > index instead of int, which allows IVOPTs to properly optimize the > loop.
While we've 'fixed' the loop_align.c test to work with the current compiler, I did some digging into the effects of subversion id 240482, which removed the old debug options -m{,no-}upper-regs{,-df,-di,-sf}. These options controlled whether the DFmode, DImode, and SFmode types could go into the Altivec registers. In subversion id 240481, if you did not use the -mno-upper-regs* switches, on power7 DFmode and DImode could go into the Altivec registers. On power8, SFmode can also go into the Altivec registers. In subversion id 240481 for power7, the flags in the reg_addr array say that DFmode does not support ++/-- operations, due to the Altivec registers not having load and store with update variants. Similarly for power8, the flags in the reg_addr for SFmode says that it does not support ++/-- operations. In subversion id 240482, I allowed ++/-- for DFmode in power7 and ++/-- for SFmode in power8. By allowing the ++/-- support for int indexes, it causes IV-OPTS to not properly optimize the loop. This patch restores the 240841 behavior of not allowing ++/-- for scalar floating point types. The loop looks like: .L3: lfdx 0,4,9 lfdx 12,5,9 fadd 0,0,12 stfdx 0,3,9 addi 9,9,8 bdnz .L3 Which is small enough to align the loop to 32 bytes. However, it does cause slight changes in running Spec 2006 on a little endian power8 system: 471.omnetpp: 1.3% faster 434.zeusmp: 1.9% slower 435.gromacs: 2.0% faster 482.sphinx3: 1.4% faster I did the bootstrap and make check tests, and they all pass. If I replace the loop_align.c test with the old version it passes also. Even with the zeusmp regression, three other benchmarks speed up slightly, and the simple loop test also is better. Can I check this into the trunk? [gcc] 2018-01-27 Michael Meissner <meissner@linux.vnet.ibm.com> PR target/81550 * config/rs6000/rs6000.c (rs6000_setup_reg_addr_masks): If floating point scalars are allowed in traditional Altivec registers, don't allow PRE_{INC,DEC,MODIFY} on those floating point types.
Hi! On Sat, Jan 27, 2018 at 11:16:00AM -0500, Michael Meissner wrote: > By allowing the ++/-- support for int indexes, it causes IV-OPTS to not > properly optimize the loop. > > This patch restores the 240841 behavior of not allowing ++/-- for scalar > floating point types. The loop looks like: > > .L3: > lfdx 0,4,9 > lfdx 12,5,9 > fadd 0,0,12 > stfdx 0,3,9 > addi 9,9,8 > bdnz .L3 > > Which is small enough to align the loop to 32 bytes. More importantly ;-), this is better than doing three increments in the loop (as the other case does). Which then as a bonus makes an IV choice that allows bdnz win. > However, it does cause slight changes in running Spec 2006 on a little endian > power8 system: > > 471.omnetpp: 1.3% faster > 434.zeusmp: 1.9% slower > 435.gromacs: 2.0% faster > 482.sphinx3: 1.4% faster That looks good. Would be nice to figure out what causes the zeusmp regression. Also, with or without this patch, it seems we give pretty wrong costs to ivopts. We should investigate that, too (for GCC 9, unless there is some obvious and simple thing we do wrong currently). > PR target/81550 > * config/rs6000/rs6000.c (rs6000_setup_reg_addr_masks): If > floating point scalars are allowed in traditional Altivec > registers, don't allow PRE_{INC,DEC,MODIFY} on those floating > point types. > Index: gcc/config/rs6000/rs6000.c > =================================================================== > --- gcc/config/rs6000/rs6000.c (revision 257024) > +++ gcc/config/rs6000/rs6000.c (working copy) > @@ -2990,6 +2990,8 @@ rs6000_setup_reg_addr_masks (void) > && !VECTOR_MODE_P (m2) > && !FLOAT128_VECTOR_P (m2) > && !complex_p > + && (m != E_DFmode || !TARGET_VSX) > + && (m != E_SFmode || !TARGET_P8_VECTOR) > && !small_int_vsx_p) > { > addr_mask |= RELOAD_REG_PRE_INCDEC; (Maybe add a comment?) This is okay for trunk. Thanks! Segher
On Mon, Jan 29, 2018 at 04:00:59PM -0600, Segher Boessenkool wrote: > (Maybe add a comment?) > > This is okay for trunk. Thanks! This is the patch I applied: 2018-01-29 Michael Meissner <meissner@linux.vnet.ibm.com> PR target/81550 * config/rs6000/rs6000.c (rs6000_setup_reg_addr_masks): If DFmode and SFmode can go in Altivec registers (-mcpu=power7 for DFmode, -mcpu=power8 for SFmode) don't set the PRE_INCDEC or PRE_MODIFY flags. This restores the settings used before the 2017-07-24. Turning off pre increment/decrement/modify allows IVOPTS to optimize DF/SF loops where the index is an int. Index: gcc/config/rs6000/rs6000.c =================================================================== --- gcc/config/rs6000/rs6000.c (revision 257165) +++ gcc/config/rs6000/rs6000.c (working copy) @@ -2982,7 +2982,15 @@ rs6000_setup_reg_addr_masks (void) /* Figure out if we can do PRE_INC, PRE_DEC, or PRE_MODIFY addressing. If we allow scalars into Altivec registers, - don't allow PRE_INC, PRE_DEC, or PRE_MODIFY. */ + don't allow PRE_INC, PRE_DEC, or PRE_MODIFY. + + For VSX systems, we don't allow update addressing for + DFmode/SFmode if those registers can go in both the + traditional floating point registers and Altivec registers. + The load/store instructions for the Altivec registers do not + have update forms. If we allowed update addressing, it seems + to break IV-OPT code using floating point if the index type is + int instead of long (PR target/81550 and target/84042). */ if (TARGET_UPDATE && (rc == RELOAD_REG_GPR || rc == RELOAD_REG_FPR) @@ -2990,6 +2998,8 @@ rs6000_setup_reg_addr_masks (void) && !VECTOR_MODE_P (m2) && !FLOAT128_VECTOR_P (m2) && !complex_p + && (m != E_DFmode || !TARGET_VSX) + && (m != E_SFmode || !TARGET_P8_VECTOR) && !small_int_vsx_p) { addr_mask |= RELOAD_REG_PRE_INCDEC;
Index: gcc/testsuite/gcc.target/powerpc/loop_align.c =================================================================== --- gcc/testsuite/gcc.target/powerpc/loop_align.c (revision 256992) +++ gcc/testsuite/gcc.target/powerpc/loop_align.c (working copy) @@ -1,11 +1,16 @@ /* { dg-do compile { target { powerpc*-*-* } } } */ /* { dg-skip-if "" { powerpc*-*-darwin* powerpc-ibm-aix* } } */ /* { dg-skip-if "do not override -mcpu" { powerpc*-*-* } { "-mcpu=*" } { "-mcpu=power7" } } */ -/* { dg-options "-O2 -mcpu=power7 -falign-functions=16" } */ +/* { dg-options "-O2 -mcpu=power7 -falign-functions=16 -fno-reorder-blocks" } */ /* { dg-final { scan-assembler ".p2align 5,,31" } } */ -void f(double *a, double *b, double *c, int n) { - int i; +/* This test must be crafted so that the loop is less than 8 insns (and more + than 4) in order for the TARGET_ASM_LOOP_ALIGN_MAX_SKIP target hook to fire + and align the loop properly. Originally the loop used double, and various + optimizations caused the loop to double in size, and fail the test. */ + +void f(long *a, long *b, long *c, long n) { + long i; for (i=0; i < n; i++) a[i] = b[i] + c[i]; }