Message ID | 1505321438.2286.7.camel@cavium.com |
---|---|
State | New |
Headers | show |
Series | [aarch64] Fix target/pr77729 - missed optimization related to zero extension | expand |
Hi Steve, On 13/09/17 17:50, Steve Ellcey wrote: > This is a patch for PR target/77729 on aarch64. The code is doing an > unneeded zero extend ('uxtb' in the original report, 'and' in the ToT > sources). > > The patch looks a bit odd, it is a specialized define_insn for the combine > pass. At some point in combine (I never did find out where), the > zero_extend > is converted to an AND so my instruction looks for an OR of a constant > and an AND expression where one operand of the AND is a subreg and the > other > is a constant. If the two constants add up to 255 that means that the AND > is being used to mask out the upper bits of the register while not messing > up the constant we are using in the OR expression. > > I also had to recognize this in the aarch64 cost function or combine would > not use the new expression even when it recognized it as it thought it > cost > more than the original uncombined expressions. > > Tested on aarch64 with a bootstrap and testsuite run that had no > regressions. > > OK to checkin? > > > 2017-09-13 Steve Ellcey <sellcey@cavium.com> > > PR target/77729 > * config/aarch64/aarch64.c (aarch64_rtx_costs): > Handle cost of *iorqi3_uxtw instruction. > * config/aarch64/aarch64.md (*iorqi3_uxtw): New > instruction for combine phase. > > > 2017-09-13 Steve Ellcey <sellcey@cavium.com> > > * gcc.target/aarch64/pr77729.c: New test. +;; Specialized OR instruction for combiner. The AND is masking out bits +;; not needed in the OR (doing a zero_extend). The zero_extend is not +;; needed because we know from the subreg that the upper part of the reg +;; is zero. +(define_insn "*iorqi3_uxtw" + [(set (match_operand:SI 0 "register_operand" "=r") + (ior:SI (and:SI + (subreg:SI (match_operand:QI 1 "register_operand" "r") 0) + (match_operand:SI 2 "const_int_operand" "n")) + (match_operand:SI 3 "aarch64_logical_operand" "K")))] + "INTVAL (operands[2]) + INTVAL (operands[3]) == 255" + "orr\\t%w0, %w1, %3" + [(set_attr "type" "logic_imm")] +) We are usually hesitant to add explicit subreg matching in the MD pattern (though I don't remember if there's a hard rule against it). In this case this looks like a missing simplification from combine (simplify-rtx) so I think adding it there would be better. Also, in this pattern operands[3] has the aarch64_logical_operand predicate which allows registers but in the final instruction check below you unconditionally take its INTVAL which will throw an error if the compiler is configured with RTL checking enabled. Thanks, Kyrill
Hi! On Wed, Sep 13, 2017 at 06:13:50PM +0100, Kyrill Tkachov wrote: > +;; Specialized OR instruction for combiner. The AND is masking out bits > +;; not needed in the OR (doing a zero_extend). The zero_extend is not > +;; needed because we know from the subreg that the upper part of the reg > +;; is zero. > +(define_insn "*iorqi3_uxtw" > + [(set (match_operand:SI 0 "register_operand" "=r") > + (ior:SI (and:SI > + (subreg:SI (match_operand:QI 1 "register_operand" "r") 0) > + (match_operand:SI 2 "const_int_operand" "n")) > + (match_operand:SI 3 "aarch64_logical_operand" "K")))] > + "INTVAL (operands[2]) + INTVAL (operands[3]) == 255" > + "orr\\t%w0, %w1, %3" > + [(set_attr "type" "logic_imm")] > +) > > We are usually hesitant to add explicit subreg matching in the MD pattern > (though I don't remember if there's a hard rule against it). > In this case this looks like a missing simplification from combine > (simplify-rtx) so > I think adding it there would be better. Yes, it probably belongs as a generic simplification in simplify-rtx.c; if there is a reason not to do that, it can be done in combine.c instead. Segher
On Wed, 2017-09-13 at 14:46 -0500, Segher Boessenkool wrote: > On Wed, Sep 13, 2017 at 06:13:50PM +0100, Kyrill Tkachov wrote: > > > > We are usually hesitant to add explicit subreg matching in the MD pattern > > (though I don't remember if there's a hard rule against it). > > In this case this looks like a missing simplification from combine > > (simplify-rtx) so > > I think adding it there would be better. > Yes, it probably belongs as a generic simplification in simplify-rtx.c; > if there is a reason not to do that, it can be done in combine.c > instead. Actually, now that I look at it some more and compare it to the arm32 version (where we do not have this problem) I think the problem starts well before combine. In arm32 rtl expansion, when reading the QI memory location, I see these instructions get generated: (insn 10 3 11 2 (set (reg:SI 119) (zero_extend:SI (mem:QI (reg/v/f:SI 117 [ string ]) [0 *string_9(D)+0 S1 A8]))) "pr77729.c":4 -1 (nil)) (insn 11 10 12 2 (set (reg:QI 118) (subreg:QI (reg:SI 119) 0)) "pr77729.c":4 -1 (nil)) And in aarch64 rtl expansion I see: (insn 10 9 11 (set (reg:QI 81) (mem:QI (reg/v/f:DI 80 [ string ]) [0 *string_9(D)+0 S1 A8])) "pr77729.c":3 -1 (nil)) Both of these sequences expand to ldrb but in the arm32 case I know that I set all 32 bits of the register (even though I only want the bottom 8 bits), but for aarch64 I only know that I set the bottom 8 bits and I don't know anything about the higher bits, meaning I have to keep the AND instruction to mask out the upper bits on aarch64. I think we should change the movqi/movhi expansions on aarch64 to recognize that the ldrb/ldrh instructions zero out the upper bits in the register by generating rtl like arm32 does. Steve Ellcey sellcey@cavium.com
On 09/13/2017 03:46 PM, Steve Ellcey wrote: > On Wed, 2017-09-13 at 14:46 -0500, Segher Boessenkool wrote: >> On Wed, Sep 13, 2017 at 06:13:50PM +0100, Kyrill Tkachov wrote: >>> >>> We are usually hesitant to add explicit subreg matching in the MD pattern >>> (though I don't remember if there's a hard rule against it). >>> In this case this looks like a missing simplification from combine >>> (simplify-rtx) so >>> I think adding it there would be better. > >> Yes, it probably belongs as a generic simplification in simplify-rtx.c; >> if there is a reason not to do that, it can be done in combine.c >> instead. > > Actually, now that I look at it some more and compare it to the arm32 > version (where we do not have this problem) I think the problem starts > well before combine. > > In arm32 rtl expansion, when reading the QI memory location, I see > these instructions get generated: > > (insn 10 3 11 2 (set (reg:SI 119) > (zero_extend:SI (mem:QI (reg/v/f:SI 117 [ string ]) [0 *string_9(D)+0 S1 A8]))) "pr77729.c":4 -1 > (nil)) > (insn 11 10 12 2 (set (reg:QI 118) > (subreg:QI (reg:SI 119) 0)) "pr77729.c":4 -1 > (nil)) > > And in aarch64 rtl expansion I see: > > (insn 10 9 11 (set (reg:QI 81) > (mem:QI (reg/v/f:DI 80 [ string ]) [0 *string_9(D)+0 S1 A8])) "pr77729.c":3 -1 > (nil)) > > Both of these sequences expand to ldrb but in the arm32 case I know > that I set all 32 bits of the register (even though I only want the > bottom 8 bits), but for aarch64 I only know that I set the bottom 8 > bits and I don't know anything about the higher bits, meaning I have to > keep the AND instruction to mask out the upper bits on aarch64. It's one of the reasons I discourage subregs -- the number of cases where we can optimize based on the "don't care" semantics are relatively small in my experience and I consistently see cases where the "don't care" property of the subreg turns into "don't know" and suppresses downstream optimizations. It's always a judgment call, but more and more often I find myself pushing towards defining those bits using a zero/sign extension, bit operation or whatever rather than using subregs. > > I think we should change the movqi/movhi expansions on aarch64 to > recognize that the ldrb/ldrh instructions zero out the upper bits in > the register by generating rtl like arm32 does. Is LOAD_EXTEND_OP defined for aarch64? It may also be worth looking at ree.c -- my recollection is that it didn't handle subregs, but it could and probably should. jeff
On Thu, 2017-09-14 at 09:03 -0600, Jeff Law wrote: > On 09/13/2017 03:46 PM, Steve Ellcey wrote: > > > > In arm32 rtl expansion, when reading the QI memory location, I see > > these instructions get generated: > > > > (insn 10 3 11 2 (set (reg:SI 119) > > (zero_extend:SI (mem:QI (reg/v/f:SI 117 [ string ]) [0 > > *string_9(D)+0 S1 A8]))) "pr77729.c":4 -1 > > (nil)) > > (insn 11 10 12 2 (set (reg:QI 118) > > (subreg:QI (reg:SI 119) 0)) "pr77729.c":4 -1 > > (nil)) > > > > And in aarch64 rtl expansion I see: > > > > (insn 10 9 11 (set (reg:QI 81) > > (mem:QI (reg/v/f:DI 80 [ string ]) [0 *string_9(D)+0 S1 > > A8])) "pr77729.c":3 -1 > > (nil)) > > > > Both of these sequences expand to ldrb but in the arm32 case I know > > that I set all 32 bits of the register (even though I only want the > > bottom 8 bits), but for aarch64 I only know that I set the bottom 8 > > bits and I don't know anything about the higher bits, meaning I have to > > keep the AND instruction to mask out the upper bits on aarch64. > It's one of the reasons I discourage subregs -- the number of cases > where we can optimize based on the "don't care" semantics are relatively > small in my experience and I consistently see cases where the "don't > care" property of the subreg turns into "don't know" and suppresses > downstream optimizations. > > It's always a judgment call, but more and more often I find myself > pushing towards defining those bits using a zero/sign extension, bit > operation or whatever rather than using subregs. So if I were loading a QImode to a register (zeroing out the upper bits) would you generate something like: (truncate:QI (zero_extend:SI (reg:QI))) instead of: (subreg:QI (reg:SI)) > > I think we should change the movqi/movhi expansions on aarch64 to > > recognize that the ldrb/ldrh instructions zero out the upper bits in > > the register by generating rtl like arm32 does. > Is LOAD_EXTEND_OP defined for aarch64? Yes, aarch64 defines LOAD_EXTEND_OP to be ZERO_EXTEND. > It may also be worth looking at ree.c -- my recollection is that it > didn't handle subregs, but it could and probably should. I only see a couple of references to subregs in ree.c. I think they both involve searching for all uses of a register. Steve Ellcey
On 09/14/2017 10:33 AM, Steve Ellcey wrote: > On Thu, 2017-09-14 at 09:03 -0600, Jeff Law wrote: >> On 09/13/2017 03:46 PM, Steve Ellcey wrote: >>> >>> In arm32 rtl expansion, when reading the QI memory location, I see >>> these instructions get generated: >>> >>> (insn 10 3 11 2 (set (reg:SI 119) >>> (zero_extend:SI (mem:QI (reg/v/f:SI 117 [ string ]) [0 >>> *string_9(D)+0 S1 A8]))) "pr77729.c":4 -1 >>> (nil)) >>> (insn 11 10 12 2 (set (reg:QI 118) >>> (subreg:QI (reg:SI 119) 0)) "pr77729.c":4 -1 >>> (nil)) >>> >>> And in aarch64 rtl expansion I see: >>> >>> (insn 10 9 11 (set (reg:QI 81) >>> (mem:QI (reg/v/f:DI 80 [ string ]) [0 *string_9(D)+0 S1 >>> A8])) "pr77729.c":3 -1 >>> (nil)) >>> >>> Both of these sequences expand to ldrb but in the arm32 case I know >>> that I set all 32 bits of the register (even though I only want the >>> bottom 8 bits), but for aarch64 I only know that I set the bottom 8 >>> bits and I don't know anything about the higher bits, meaning I have to >>> keep the AND instruction to mask out the upper bits on aarch64. > >> It's one of the reasons I discourage subregs -- the number of cases >> where we can optimize based on the "don't care" semantics are relatively >> small in my experience and I consistently see cases where the "don't >> care" property of the subreg turns into "don't know" and suppresses >> downstream optimizations. >> >> It's always a judgment call, but more and more often I find myself >> pushing towards defining those bits using a zero/sign extension, bit >> operation or whatever rather than using subregs. > > So if I were loading a QImode to a register (zeroing out the upper > bits) would you generate something like: > > (truncate:QI (zero_extend:SI (reg:QI))) On a LOAD_EXTEND_OP target which zero extends and WORD_REGISTER_OPERATIONS as 1 I'd load memory with just (set (reg:QI) (mem:QI (whatever)) If that object is later used elsewhere it probably will be explicitly sign/zero extended. But combine ought to be able to eliminate that explicit extension. And I think that's starting to zero in on the problem -- WORD_REGISTER_OPERATIONS is zero on aarch64 as you don't get extension to word_mode for W form registers. I wonder if what needs to happen is somehow look to extend that code somehow so that combine and friends know that the value is zero extended to 32 bits, even if it's not extended to word_mode. Jeff
On 09/14/2017 10:33 AM, Steve Ellcey wrote: > On Thu, 2017-09-14 at 09:03 -0600, Jeff Law wrote: >> On 09/13/2017 03:46 PM, Steve Ellcey wrote: >>> >>> In arm32 rtl expansion, when reading the QI memory location, I see >>> these instructions get generated: >>> >>> (insn 10 3 11 2 (set (reg:SI 119) >>> (zero_extend:SI (mem:QI (reg/v/f:SI 117 [ string ]) [0 >>> *string_9(D)+0 S1 A8]))) "pr77729.c":4 -1 >>> (nil)) >>> (insn 11 10 12 2 (set (reg:QI 118) >>> (subreg:QI (reg:SI 119) 0)) "pr77729.c":4 -1 >>> (nil)) >>> >>> And in aarch64 rtl expansion I see: >>> >>> (insn 10 9 11 (set (reg:QI 81) >>> (mem:QI (reg/v/f:DI 80 [ string ]) [0 *string_9(D)+0 S1 >>> A8])) "pr77729.c":3 -1 >>> (nil)) >>> >>> Both of these sequences expand to ldrb but in the arm32 case I know >>> that I set all 32 bits of the register (even though I only want the >>> bottom 8 bits), but for aarch64 I only know that I set the bottom 8 >>> bits and I don't know anything about the higher bits, meaning I have to >>> keep the AND instruction to mask out the upper bits on aarch64. > >> It's one of the reasons I discourage subregs -- the number of cases >> where we can optimize based on the "don't care" semantics are relatively >> small in my experience and I consistently see cases where the "don't >> care" property of the subreg turns into "don't know" and suppresses >> downstream optimizations. >> >> It's always a judgment call, but more and more often I find myself >> pushing towards defining those bits using a zero/sign extension, bit >> operation or whatever rather than using subregs. > > So if I were loading a QImode to a register (zeroing out the upper > bits) would you generate something like: > > (truncate:QI (zero_extend:SI (reg:QI))) > > instead of: > > (subreg:QI (reg:SI)) > >>> I think we should change the movqi/movhi expansions on aarch64 to >>> recognize that the ldrb/ldrh instructions zero out the upper bits in >>> the register by generating rtl like arm32 does. > >> Is LOAD_EXTEND_OP defined for aarch64? > > Yes, aarch64 defines LOAD_EXTEND_OP to be ZERO_EXTEND. > >> It may also be worth looking at ree.c -- my recollection is that it >> didn't handle subregs, but it could and probably should. > > I only see a couple of references to subregs in ree.c. I think they > both involve searching for all uses of a register. Right. But the subreg expressions are also forms of extension -- we just don't know (or care) if it's zero or sign extension. We'd start by recognizing the paradoxical subreg in add_removable_extension as a form of an extension similar to zero_extend and sign_extend. When we go to combine the "extension" into the defining insn, we would test 3 forms (set (target) (zero_extend (exp))) (set (target) (sign_extend (exp))) (set (target) (subreg (exp))) If any form matches an insn on the target, then we're done. This may require adding some new patterns to aarch64 -- I believe we've got patterns on x86 to match some of these forms to aid redundant extension eliminmation. It might also be helpful to teach ree about LOAD_EXTEND_OP which would allow combining one of the extension forms with a memory reference. Jeff
On Thu, 2017-09-14 at 11:53 -0600, Jeff Law wrote: > > > And I think that's starting to zero in on the problem -- > WORD_REGISTER_OPERATIONS is zero on aarch64 as you don't get extension > to word_mode for W form registers. > > I wonder if what needs to happen is somehow look to extend that code > somehow so that combine and friends know that the value is zero extended > to 32 bits, even if it's not extended to word_mode. > > Jeff This might be a good long term direction to move but in the mean time it sure does seem a lot easier to just generate a subreg. Here is a patch that does that, it passes bootstrap and has no regressions and fixes the bug in question (and most likely improves other code as well). The "LOAD_EXTEND_OP (<MODE>mode) == ZERO_EXTEND" part of the if statement is not really necessary since we know this is true on aarch64 but I thought it helped make it clear what we were doing and the compiler should optimize it away anyway. OK to checkin this fix while we consider longer term options? Steve Ellcey sellcey@cavium.com 2017-09-14 Steve Ellcey <sellcey@cavium.com> PR target/77729 * config/aarch64/aarch64.md (mov<mode>): Generate subreg for short loads to reflect that upper bits are zeroed out on load. 2017-09-14 Steve Ellcey <sellcey@cavium.com> * gcc.target/aarch64/pr77729.c: New test. diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md index f8cdb06..bca4cf5 100644 --- a/gcc/config/aarch64/aarch64.md +++ b/gcc/config/aarch64/aarch64.md @@ -864,6 +864,15 @@ (match_operand:SHORT 1 "general_operand" ""))] "" " + if (LOAD_EXTEND_OP (<MODE>mode) == ZERO_EXTEND && MEM_P (operands[1]) + && can_create_pseudo_p () && optimize > 0) + { + /* Generate subreg of SImode so we know that the upper bits + of the reg are zero and do not need to masked out later. */ + rtx reg = gen_reg_rtx (SImode); + emit_insn (gen_zero_extend<mode>si2 (reg, operands[1])); + operands[1] = gen_lowpart (<MODE>mode, reg); + } if (GET_CODE (operands[0]) == MEM && operands[1] != const0_rtx) operands[1] = force_reg (<MODE>mode, operands[1]); " diff --git a/gcc/testsuite/gcc.target/aarch64/pr77729.c b/gcc/testsuite/gcc.target/aarch64/pr77729.c index e69de29..2fcda9a 100644 --- a/gcc/testsuite/gcc.target/aarch64/pr77729.c +++ b/gcc/testsuite/gcc.target/aarch64/pr77729.c @@ -0,0 +1,32 @@ +/* { dg-do compile } */ +/* { dg-options "-O2" } */ + +int TrieCase3_v1(const char *string) +{ + if((string[0] | 32) == 't') { + if((string[1] | 32) == 'a') { + if((string[2] | 32) == 'g') { + return 42; + } + } + } + return -1; +} + +int TrieCase3_v2(const char *string) +{ + switch(string[0] | 32) { + case 't': + switch(string[1] | 32) { + case 'a': + switch(string[2] | 32) { + case 'g': + return 42; + } + } + } + return -1; +} + +/* { dg-final { scan-assembler-not "and" } } */ +/* { dg-final { scan-assembler-not "uxtb" } } */
Ping. There was discussion of larger fixes for this including a type promotion pass but this patch seems small, safe, and in line with other platforms (like arm32). Steve Ellcey sellcey@cavium.com On Thu, 2017-09-14 at 11:43 -0700, Steve Ellcey wrote: > On Thu, 2017-09-14 at 11:53 -0600, Jeff Law wrote: > > > > > > > > And I think that's starting to zero in on the problem -- > > WORD_REGISTER_OPERATIONS is zero on aarch64 as you don't get > > extension > > to word_mode for W form registers. > > > > I wonder if what needs to happen is somehow look to extend that > > code > > somehow so that combine and friends know that the value is zero > > extended > > to 32 bits, even if it's not extended to word_mode. > > > > Jeff > This might be a good long term direction to move but in the mean time > it sure does seem a lot easier to just generate a subreg. Here is a > patch that does that, it passes bootstrap and has no regressions and > fixes the bug in question (and most likely improves other code as > well). > > The "LOAD_EXTEND_OP (<MODE>mode) == ZERO_EXTEND" part of the if > statement is not really necessary since we know this is true on > aarch64 > but I thought it helped make it clear what we were doing and the > compiler should optimize it away anyway. > > OK to checkin this fix while we consider longer term options? > > Steve Ellcey > sellcey@cavium.com > > > 2017-09-14 Steve Ellcey <sellcey@cavium.com> > > PR target/77729 > * config/aarch64/aarch64.md (mov<mode>): Generate subreg for > short loads to reflect that upper bits are zeroed out on load. > > > 2017-09-14 Steve Ellcey <sellcey@cavium.com> > > * gcc.target/aarch64/pr77729.c: New test. >
diff --git a/gcc/testsuite/gcc.target/aarch64/pr77729.c b/gcc/testsuite/gcc.target/aarch64/pr77729.c index e69de29..2fcda9a 100644 --- a/gcc/testsuite/gcc.target/aarch64/pr77729.c +++ b/gcc/testsuite/gcc.target/aarch64/pr77729.c @@ -0,0 +1,32 @@ +/* { dg-do compile } */ +/* { dg-options "-O2" } */ + +int TrieCase3_v1(const char *string) +{ + if((string[0] | 32) == 't') { + if((string[1] | 32) == 'a') { + if((string[2] | 32) == 'g') { + return 42; + } + } + } + return -1; +} + +int TrieCase3_v2(const char *string) +{ + switch(string[0] | 32) { + case 't': + switch(string[1] | 32) { + case 'a': + switch(string[2] | 32) { + case 'g': + return 42; + } + } + } + return -1; +} + +/* { dg-final { scan-assembler-not "and" } } */ +/* { dg-final { scan-assembler-not "uxtb" } } */