Message ID | 20180117035543.GA1373@ibm-tiger.the-meissners.org |
---|---|
State | New |
Headers | show |
Series | , Fix PR target/pr83862: Fix PowerPC long double signbit with -mabi=ieeelongdouble | expand |
On Tue, Jan 16, 2018 at 10:55:43PM -0500, Michael Meissner wrote: > PR target/83862 pointed out a problem I put into the 128-bit floating point > type signbit optimization. The issue is we want to avoid doing a load to a > floating point/vector register and then a direct move to do signbit, so we > change the load to load the upper 64-bits of the floating point value to get > the sign bit. Unfortunately, if the type is IEEE 128-bit and memory is > addressed with an indexed address on a little endian system, it generates an > illegal address and generates an internal compiler error. So all this is caused by these splitters running after reload. Why do we have to do that? Do we? We should be able to just change it to a subreg and shift, early already? Segher
On Wed, Jan 17, 2018 at 04:09:57PM -0600, Segher Boessenkool wrote: > On Tue, Jan 16, 2018 at 10:55:43PM -0500, Michael Meissner wrote: > > PR target/83862 pointed out a problem I put into the 128-bit floating point > > type signbit optimization. The issue is we want to avoid doing a load to a > > floating point/vector register and then a direct move to do signbit, so we > > change the load to load the upper 64-bits of the floating point value to get > > the sign bit. Unfortunately, if the type is IEEE 128-bit and memory is > > addressed with an indexed address on a little endian system, it generates an > > illegal address and generates an internal compiler error. > > So all this is caused by these splitters running after reload. Why do > we have to do that? Do we? We should be able to just change it to a > subreg and shift, early already? The part that is failing is trying to optimize the case: x = signbit (*p) Doing the code with just registers means that you will get: vr = load *p gr = diret move from vr shift With the optimization you get: gr = 'upper' word shift If the address is: base + index In little endian, the compiler tried to generate: (base + index) + 8 And it didn't have a temporary register to put base+index. It only shows up on a little endian system on a type that does REG+REG addressing. IBM extended double does not allow REG+REG addressing, so it wasn't an issue for that. But with GLIBC starting to test -mabi=ieeelongdouble, it showed up. I believe when I worked on it, we were mostly big endian, and the IEEE stuff wasn't yet completely solid, so it was a thinko on my part. I did the memory optimization to GPR and avoid the direct move because it is used frequently in the GLIBC math library (and direct move on power8 being on the slow side).
On Wed, Jan 17, 2018 at 06:40:12PM -0500, Michael Meissner wrote: > On Wed, Jan 17, 2018 at 04:09:57PM -0600, Segher Boessenkool wrote: > > On Tue, Jan 16, 2018 at 10:55:43PM -0500, Michael Meissner wrote: > > > PR target/83862 pointed out a problem I put into the 128-bit floating point > > > type signbit optimization. The issue is we want to avoid doing a load to a > > > floating point/vector register and then a direct move to do signbit, so we > > > change the load to load the upper 64-bits of the floating point value to get > > > the sign bit. Unfortunately, if the type is IEEE 128-bit and memory is > > > addressed with an indexed address on a little endian system, it generates an > > > illegal address and generates an internal compiler error. > > > > So all this is caused by these splitters running after reload. Why do > > we have to do that? Do we? We should be able to just change it to a > > subreg and shift, early already? > > The part that is failing is trying to optimize the case: > > x = signbit (*p) > > Doing the code with just registers means that you will get: > > vr = load *p > gr = diret move from vr > shift > > With the optimization you get: > > gr = 'upper' word > shift > > If the address is: > > base + index > > In little endian, the compiler tried to generate: > > (base + index) + 8 > > And it didn't have a temporary register to put base+index. Yes, but you won't have this problem if you do this before RA. Is there any good reason to do it only after RA? If you do it before RA you immediately get some_di = load upper from *p shift just as you want (and things might even be optimised further). Segher
On Thu, Jan 18, 2018 at 12:39:05PM -0600, Segher Boessenkool wrote: > On Wed, Jan 17, 2018 at 06:40:12PM -0500, Michael Meissner wrote: > > On Wed, Jan 17, 2018 at 04:09:57PM -0600, Segher Boessenkool wrote: > > > On Tue, Jan 16, 2018 at 10:55:43PM -0500, Michael Meissner wrote: > > > > PR target/83862 pointed out a problem I put into the 128-bit floating point > > > > type signbit optimization. The issue is we want to avoid doing a load to a > > > > floating point/vector register and then a direct move to do signbit, so we > > > > change the load to load the upper 64-bits of the floating point value to get > > > > the sign bit. Unfortunately, if the type is IEEE 128-bit and memory is > > > > addressed with an indexed address on a little endian system, it generates an > > > > illegal address and generates an internal compiler error. > > > > > > So all this is caused by these splitters running after reload. Why do > > > we have to do that? Do we? We should be able to just change it to a > > > subreg and shift, early already? > > > > The part that is failing is trying to optimize the case: > > > > x = signbit (*p) > > > > Doing the code with just registers means that you will get: > > > > vr = load *p > > gr = diret move from vr > > shift > > > > With the optimization you get: > > > > gr = 'upper' word > > shift > > > > If the address is: > > > > base + index > > > > In little endian, the compiler tried to generate: > > > > (base + index) + 8 > > > > And it didn't have a temporary register to put base+index. > > Yes, but you won't have this problem if you do this before RA. Is there > any good reason to do it only after RA? Yes you will, because it is a memory address not a register. > If you do it before RA you immediately get > > some_di = load upper from *p > shift > > just as you want (and things might even be optimised further). And in my experience, splitting such loads and changing the type before RA, often times leads to error.
I've reworked the patch for PR83864. This bug is due to the compiler issuing an internal error for code of the form on a little endian system: int sb (_Float128 *p, unsigned long n) { return __builtin_signbit (p[n]); } The problem is that the memory optimization wants to load the high double word from memory to isolate the signbit. In little endian, the high word is at offset 8 instead of 0. The bug will also show up if you use the long double type, and you use the -mabi=ieeelongdouble option. Previously the code did one UNSPEC (signbit<mode>2_dm) that was done before register allocation, and combined the value being in vector register, memory, and GPR register along with the shift to isolate the signbit. Then after the split after register allocation, it created a second UNSPEC (signbit<mode>2_dm2) that was just the direct move, or it did the memory and GPR code path separately. With these patches, the generator code issues an UNSPEC (signbit<mode>2_dm) just to get the high double word, and the shift right is then emitted. The UNSPEC only handles the value being in either vector or GPR register. There is a second UNSPEC that is created by the combiner if the value is in memory. On little endian systems, the first split pass (before register allocation) will allocate a pseudo register to hold the initial ADD of the base and index registers for indexed loads, and then forms a LD reg,8(tmp) to load the high word. Just in case, some code after register allocation reforms the UNSPEC, it uses a base register for the load, and it can use the base register as needed for the temporary. I have run this code on both little endian and big endian power8 systems, doing bootstrap and regression test. There were no regressions. Can I install this bug on the trunk? I don't believe the bug shows up in gcc 6/7 branches, but I will build these and test to see if the code is needed to be backported. [gcc] 2018-01-21 Michael Meissner <meissner@linux.vnet.ibm.com> PR target/83862 * config/rs6000/rs6000-protos.h (rs6000_split_signbit): Delete, no longer used. * config/rs6000/rs6000.c (rs6000_split_signbit): Likewise. * config/rs6000/rs6000.md (signbit<mode>2): Change code for IEEE 128-bit to produce an UNSPEC move to get the double word with the signbit and then a shift directly to do signbit. (signbit<mode>2_dm): Replace old IEEE 128-bit signbit implementation with a new version that just does either a direct move or a regular move. Move memory interface to separate insns. Move insns so they are next to the expander. (signbit<mode>2_dm_mem_be): New combiner insns to combine load with signbit move. Split big and little endian case. (signbit<mode>2_dm_mem_le): Likewise. (signbit<mode>2_dm_<su>ext): Delete, no longer used. (signbit<mode>2_dm2): Likewise. [gcc/testsuite] 2018-01-22 Michael Meissner <meissner@linux.vnet.ibm.com> PR target/83862 * gcc.target/powerpc/pr83862.c: New test.
Hi Mike, Thanks for doing this! On Sun, Jan 21, 2018 at 07:03:58PM -0500, Michael Meissner wrote: > I've reworked the patch for PR83864. This bug is due to the compiler issuing > an internal error for code of the form on a little endian system: > > int sb (_Float128 *p, unsigned long n) { return __builtin_signbit (p[n]); } > > The problem is that the memory optimization wants to load the high double word > from memory to isolate the signbit. In little endian, the high word is at > offset 8 instead of 0. > > The bug will also show up if you use the long double type, and you use the > -mabi=ieeelongdouble option. > > Previously the code did one UNSPEC (signbit<mode>2_dm) that was done before > register allocation, and combined the value being in vector register, memory, > and GPR register along with the shift to isolate the signbit. Then after the > split after register allocation, it created a second UNSPEC > (signbit<mode>2_dm2) that was just the direct move, or it did the memory and > GPR code path separately. > > With these patches, the generator code issues an UNSPEC (signbit<mode>2_dm) > just to get the high double word, and the shift right is then emitted. The > UNSPEC only handles the value being in either vector or GPR register. There is > a second UNSPEC that is created by the combiner if the value is in memory. On > little endian systems, the first split pass (before register allocation) will > allocate a pseudo register to hold the initial ADD of the base and index > registers for indexed loads, and then forms a LD reg,8(tmp) to load the high > word. Just in case, some code after register allocation reforms the UNSPEC, it > uses a base register for the load, and it can use the base register as needed > for the temporary. But does it have to do an unspec at all? Can't it just immediately take the relevant 64-bit half (during expand), no unspec in sight, and that is that? > I have run this code on both little endian and big endian power8 systems, doing > bootstrap and regression test. There were no regressions. Can I install this > bug on the trunk? > > I don't believe the bug shows up in gcc 6/7 branches, but I will build these > and test to see if the code is needed to be backported. If it is needed on the branches, okay for there too (once 7 reopens). > [gcc] > 2018-01-21 Michael Meissner <meissner@linux.vnet.ibm.com> > > PR target/83862 > * config/rs6000/rs6000-protos.h (rs6000_split_signbit): Delete, > no longer used. > * config/rs6000/rs6000.c (rs6000_split_signbit): Likewise. > * config/rs6000/rs6000.md (signbit<mode>2): Change code for IEEE > 128-bit to produce an UNSPEC move to get the double word with the > signbit and then a shift directly to do signbit. > (signbit<mode>2_dm): Replace old IEEE 128-bit signbit > implementation with a new version that just does either a direct > move or a regular move. Move memory interface to separate insns. > Move insns so they are next to the expander. > (signbit<mode>2_dm_mem_be): New combiner insns to combine load > with signbit move. Split big and little endian case. > (signbit<mode>2_dm_mem_le): Likewise. > (signbit<mode>2_dm_<su>ext): Delete, no longer used. > (signbit<mode>2_dm2): Likewise. > > [gcc/testsuite] > 2018-01-22 Michael Meissner <meissner@linux.vnet.ibm.com> > > PR target/83862 > * gcc.target/powerpc/pr83862.c: New test. The patch is okay for trunk, but I still think this can be a lot simpler. > +;; We can't use SUBREG:DI of the IEEE 128-bit value before register > +;; allocation, because KF/TFmode aren't tieable with DImode. Also, having the > +;; 64-bit scalar part in the high end of the register instead of the low end > +;; also complicates things. So instead, we use an UNSPEC to do the move of the > +;; high double word to isolate the signbit. I'll think about this. Either way, thank you for your patience! And the patch :-) Segher
On Mon, Jan 22, 2018 at 12:32:13PM -0600, Segher Boessenkool wrote: > Hi Mike, > > Thanks for doing this! > > On Sun, Jan 21, 2018 at 07:03:58PM -0500, Michael Meissner wrote: > > With these patches, the generator code issues an UNSPEC (signbit<mode>2_dm) > > just to get the high double word, and the shift right is then emitted. The > > UNSPEC only handles the value being in either vector or GPR register. There is > > a second UNSPEC that is created by the combiner if the value is in memory. On > > little endian systems, the first split pass (before register allocation) will > > allocate a pseudo register to hold the initial ADD of the base and index > > registers for indexed loads, and then forms a LD reg,8(tmp) to load the high > > word. Just in case, some code after register allocation reforms the UNSPEC, it > > uses a base register for the load, and it can use the base register as needed > > for the temporary. > > But does it have to do an unspec at all? Can't it just immediately take > the relevant 64-bit half (during expand), no unspec in sight, and that > is that? Yes it needs the UNSPEC. As I said, you can't use SUBREG due to a combination of MODES_TIEABLE_P and the way things are represented in vector registers (with the scalar part being in the upper part of the register). I tried it, and could not get it to work.
On Mon, Jan 22, 2018 at 12:32:13PM -0600, Segher Boessenkool wrote: > Hi Mike, > > If it is needed on the branches, okay for there too (once 7 reopens). It is needed on GCC 7 if you compile the test for -mcpu=power9 and use an explicit __float128 instead of long double (-mabi=ieeelongdouble is not really supported in GCC 7). I'll check GCC 6 as well.
Index: gcc/config/rs6000/rs6000.c =================================================================== --- gcc/config/rs6000/rs6000.c (revision 256753) +++ gcc/config/rs6000/rs6000.c (working copy) @@ -23341,9 +23341,27 @@ rs6000_split_signbit (rtx dest, rtx src) if (MEM_P (src)) { - rtx mem = (WORDS_BIG_ENDIAN - ? adjust_address (src, DImode, 0) - : adjust_address (src, DImode, 8)); + rtx addr = XEXP (src, 0); + rtx mem; + + if (!WORDS_BIG_ENDIAN) + { + /* Do not create an illegal address for indexed addressing when we + add in the 8 to address the second word where the sign bit is. + Instead use the desitnation register as a base register. */ + if (GET_CODE (addr) == PLUS + && !rs6000_legitimate_offset_address_p (DImode, addr, true, true)) + { + emit_insn (gen_rtx_SET (dest, addr)); + mem = change_address (src, DImode, + gen_rtx_PLUS (Pmode, dest, GEN_INT (8))); + } + else + mem = adjust_address (src, DImode, 8); + } + else + mem = adjust_address (src, DImode, 0); + emit_insn (gen_rtx_SET (dest_di, mem)); } Index: gcc/config/rs6000/rs6000.md =================================================================== --- gcc/config/rs6000/rs6000.md (revision 256753) +++ gcc/config/rs6000/rs6000.md (working copy) @@ -4835,9 +4835,11 @@ (define_expand "copysign<mode>3" }) ;; Optimize signbit on 64-bit systems with direct move to avoid doing the store -;; and load. +;; and load. We restrict signbit coming from a load to use a base register for +;; the destination, in case we need to use the base register as a tempoary +;; address register. (define_insn_and_split "signbit<mode>2_dm" - [(set (match_operand:SI 0 "gpc_reg_operand" "=r,r,r") + [(set (match_operand:SI 0 "gpc_reg_operand" "=r,&b,r") (unspec:SI [(match_operand:SIGNBIT 1 "input_operand" "wa,m,r")] UNSPEC_SIGNBIT))] @@ -4853,7 +4855,7 @@ (define_insn_and_split "signbit<mode>2_d (set_attr "type" "mftgpr,load,integer")]) (define_insn_and_split "*signbit<mode>2_dm_<su>ext" - [(set (match_operand:DI 0 "gpc_reg_operand" "=r,r,r") + [(set (match_operand:DI 0 "gpc_reg_operand" "=r,&b,r") (any_extend:DI (unspec:SI [(match_operand:SIGNBIT 1 "input_operand" "wa,m,r")] Index: gcc/testsuite/gcc.target/powerpc/pr83862.c =================================================================== --- gcc/testsuite/gcc.target/powerpc/pr83862.c (nonexistent) +++ gcc/testsuite/gcc.target/powerpc/pr83862.c (working copy) @@ -0,0 +1,21 @@ +/* PR target/83862.c */ +/* { dg-do compile { target { powerpc*-*-* && lp64 } } } */ +/* { dg-require-effective-target ppc_float128_sw } */ +/* { dg-require-effective-target powerpc_p8vector_ok } */ +/* { dg-options "-mpower8-vector -O2 -mabi=ieeelongdouble -Wno-psabi" } */ + +/* On little endian systems, optimizing signbit of IEEE 128-bit values from + memory could abort if the memory address was indexed (reg+reg). The + optimization is only on 64-bit machines with direct move. + + Compile with -g -O2 -mabi=ieeelongdouble -Wno-psabi. */ + +#ifndef TYPE +#define TYPE long double +#endif + +int sbr (TYPE a) { return __builtin_signbit (a); } +int sbm (TYPE *a) { return __builtin_signbit (*a); } +int sbo (TYPE *a) { return __builtin_signbit (a[4]); } +int sbi (TYPE *a, unsigned long n) { return __builtin_signbit (a[n]); } +void sbs (int *p, TYPE a) { *p = __builtin_signbit (a); }