Message ID | 20160427082004.GE5082@linux.vnet.ibm.com |
---|---|
State | New |
Headers | show |
On 04/27/2016 02:20 AM, Dominik Vogt wrote: > The attached patch is a result of discussing an S/390 issue with > "and with complement" in some cases. > > https://gcc.gnu.org/ml/gcc/2016-03/msg00163.html > https://gcc.gnu.org/ml/gcc-patches/2016-04/msg01586.html > > Combine would merge a ZERO_EXTEND and a SET taking the known zero > bits into account, resulting in an AND. Later on, > make_compound_operation() fails to replace that with a ZERO_EXTEND > which we get for free on S/390 but leaves the AND, eventually > resulting in two consecutive AND instructions. > > The current code in make_compound_operation() that detects > opportunities for ZERO_EXTEND does not work here because it does > not take the known zero bits into account: > > /* If the constant is one less than a power of two, this might be > representable by an extraction even if no shift is present. > If it doesn't end up being a ZERO_EXTEND, we will ignore it unless > we are in a COMPARE. */ > else if ((i = exact_log2 (UINTVAL (XEXP (x, 1)) + 1)) >= 0) > new_rtx = make_extraction (mode, > make_compound_operation (XEXP (x, 0), > next_code), > 0, NULL_RTX, i, 1, 0, in_code == COMPARE); > > An attempt to use the zero bits in the above conditions resulted > in many situations that generated worse code, so the patch tries > to fix this in a more conservative way. While the effect is > completely positive on S/390, this will very likely have > unforeseeable consequences on other targets. > > Bootstrapped and regression tested on s390 and s390x only at the > moment. > > Ciao > > Dominik ^_^ ^_^ > > -- Dominik Vogt IBM Germany > > > 0001-ChangeLog > > > gcc/ChangeLog > > * combine.c (make_compound_operation): Take known zero bits into > account when checking for possible zero_extend. I'd strongly recommend writing some tests for this. Extra credit if they can be run on an x86 target which gets more testing than s390. If I go back to our original discussion, we have this going into combine: (insn 6 3 7 2 (parallel [ (set (reg:SI 64) (and:SI (mem:SI (reg/v/f:DI 63 [ a ]) [1 *a_2(D)+0 S4 A32]) (const_int -65521 [0xffffffffffff000f]))) (clobber (reg:CC 33 %cc)) ]) andc-immediate.c:21 1481 {*andsi3_zarch} (expr_list:REG_DEAD (reg/v/f:DI 63 [ a ]) (expr_list:REG_UNUSED (reg:CC 33 %cc) (nil)))) (insn 7 6 12 2 (set (reg:DI 65) (zero_extend:DI (reg:SI 64))) andc-immediate.c:21 1207 {*zero_extendsidi2} (expr_list:REG_DEAD (reg:SI 64) (nil))) (insn 12 7 13 2 (set (reg/i:DI 2 %r2) (reg:DI 65)) andc-immediate.c:22 1073 {*movdi_64} (expr_list:REG_DEAD (reg:DI 65) (nil))) Which combine turns into: (insn 6 3 7 2 (parallel [ (set (reg:SI 64) (and:SI (mem:SI (reg:DI 2 %r2 [ a ]) [1 *a_2(D)+0 S4 A32]) (const_int -65521 [0xffffffffffff000f]))) (clobber (reg:CC 33 %cc)) ]) andc-immediate.c:21 1481 {*andsi3_zarch} (expr_list:REG_DEAD (reg:DI 2 %r2 [ a ]) (expr_list:REG_UNUSED (reg:CC 33 %cc) (nil)))) (insn 12 7 13 2 (parallel [ (set (reg/i:DI 2 %r2) (and:DI (subreg:DI (reg:SI 64) 0) ^^^ (const_int 4294901775 [0xffff000f]))) ^^^^^^^^^^ (clobber (reg:CC 33 %cc)) ]) andc-immediate.c:22 1474 {*anddi3} (expr_list:REG_UNUSED (reg:CC 33 %cc) (expr_list:REG_DEAD (reg:SI 64) (nil)))) Instead you want insn 12 to use a zero-extend to extend (reg:SI 64) into (reg:DI 2)? Can't you achieve this in this clause: /* If the constant is one less than a power of two, this might be representable by an extraction even if no shift is present. If it doesn't end up being a ZERO_EXTEND, we will ignore it unless we are in a COMPARE. */ You extract the constant via UINTVAL (XEXP (x, 1)), then munge it based on nonzero_bits and pass the result to exact_log2? Though I do like how you've conditionalized on the cost of the and vs the cost of hte zero-extend. So maybe your approach is ultimately better. Still curious your thoughts on doing it by just munging the constant you pass off to exact_log2 in that earlier clause. > + /* If the one operand is a paradoxical subreg of a register or memory and > + the constant (limited to the smaller mode) has only zero bits where > + the sub expression has known zero bits, this can be expressed as > + a zero_extend. */ > + else if (GET_CODE (XEXP (x, 0)) == SUBREG) > + { > + rtx sub; > + > + sub = XEXP (XEXP (x, 0), 0); > + machine_mode sub_mode = GET_MODE (sub); > + if ((REG_P (sub) || MEM_P (sub)) > + && GET_MODE_PRECISION (sub_mode) < mode_width > + && (UINTVAL (XEXP (x, 1)) > + | (~nonzero_bits (sub, sub_mode) & GET_MODE_MASK (sub_mode)) > + ) == GET_MODE_MASK (sub_mode)) I'd either factor something out or write a nested conditional to avoid the horrid line wrapping here and hopefully make the conditions easier to read. Jeff
On Wed, Apr 27, 2016 at 10:24:21PM -0600, Jeff Law wrote: > On 04/27/2016 02:20 AM, Dominik Vogt wrote: > > * combine.c (make_compound_operation): Take known zero bits into > > account when checking for possible zero_extend. > I'd strongly recommend writing some tests for this. Extra credit if > they can be run on an x86 target which gets more testing than s390. I'll look into that later. > If I go back to our original discussion, we have this going into combine: > > (insn 6 3 7 2 (parallel [ > (set (reg:SI 64) > (and:SI (mem:SI (reg/v/f:DI 63 [ a ]) [1 *a_2(D)+0 S4 A32]) > (const_int -65521 [0xffffffffffff000f]))) > (clobber (reg:CC 33 %cc)) > ]) andc-immediate.c:21 1481 {*andsi3_zarch} > (expr_list:REG_DEAD (reg/v/f:DI 63 [ a ]) > (expr_list:REG_UNUSED (reg:CC 33 %cc) > (nil)))) > (insn 7 6 12 2 (set (reg:DI 65) > (zero_extend:DI (reg:SI 64))) andc-immediate.c:21 1207 > {*zero_extendsidi2} > (expr_list:REG_DEAD (reg:SI 64) > (nil))) > (insn 12 7 13 2 (set (reg/i:DI 2 %r2) > (reg:DI 65)) andc-immediate.c:22 1073 {*movdi_64} > (expr_list:REG_DEAD (reg:DI 65) > (nil))) > > Which combine turns into: > > (insn 6 3 7 2 (parallel [ > (set (reg:SI 64) > (and:SI (mem:SI (reg:DI 2 %r2 [ a ]) [1 *a_2(D)+0 S4 A32]) > (const_int -65521 [0xffffffffffff000f]))) > (clobber (reg:CC 33 %cc)) > ]) andc-immediate.c:21 1481 {*andsi3_zarch} > (expr_list:REG_DEAD (reg:DI 2 %r2 [ a ]) > (expr_list:REG_UNUSED (reg:CC 33 %cc) > (nil)))) > (insn 12 7 13 2 (parallel [ > (set (reg/i:DI 2 %r2) > (and:DI (subreg:DI (reg:SI 64) 0) > ^^^ > (const_int 4294901775 [0xffff000f]))) > ^^^^^^^^^^ > (clobber (reg:CC 33 %cc)) > ]) andc-immediate.c:22 1474 {*anddi3} > (expr_list:REG_UNUSED (reg:CC 33 %cc) > (expr_list:REG_DEAD (reg:SI 64) > (nil)))) > > > Instead you want insn 12 to use a zero-extend to extend (reg:SI 64) > into (reg:DI 2)? Yes, because we get the zero extend for free in this case (through the constant in the AND or because the input value is a function argument that is already zero extended). > Can't you achieve this in this clause: > > /* If the constant is one less than a power of two, this might be > representable by an extraction even if no shift is present. > If it doesn't end up being a ZERO_EXTEND, we will ignore it unless > we are in a COMPARE. */ > > You extract the constant via UINTVAL (XEXP (x, 1)), then munge it > based on nonzero_bits and pass the result to exact_log2? That's what we tried first, but it resulted in worse code in many places (saved about 250 instructions in the SPEC2006 testsuite but added about 42000 elsewhere). It was so bad that I didn't even bother to check what's going on. Probably this was triggered all over the place by small constants like 1, 3, 7 and the like where s390 has no cheap way for zero extension. So I limited this to constants that are actually mode masks, implicitly assuming that there are zero extend instructions only for known modes (which is true for s390 but may not for some other targets). Being conservative here shouldn't hurt; but I wonder whether there are targets where this condition still allows too much. > Though I do like how you've conditionalized on the cost of the and > vs the cost of hte zero-extend. So maybe your approach is > ultimately better. Actually we wanted to remove the call to rtx_cost() (because usually combine just assumes that a zero extend is cheaper). I've probably forgotten to remove it before posting the patch. For s390 it's meaningless whether rtx_cost() is called or not because at the moment it doesn't model the cost of zero extension (i.e. the cost of either way is just one instruction, and without context it's not possible to decide whether s390 needs a separate instruction for the zero extend or whether it comes for free). > Still curious your thoughts on doing it by just > munging the constant you pass off to exact_log2 in that earlier > clause. > >+ /* If the one operand is a paradoxical subreg of a register or memory and > >+ the constant (limited to the smaller mode) has only zero bits where > >+ the sub expression has known zero bits, this can be expressed as > >+ a zero_extend. */ > >+ else if (GET_CODE (XEXP (x, 0)) == SUBREG) > >+ { > >+ rtx sub; > >+ > >+ sub = XEXP (XEXP (x, 0), 0); > >+ machine_mode sub_mode = GET_MODE (sub); > >+ if ((REG_P (sub) || MEM_P (sub)) > >+ && GET_MODE_PRECISION (sub_mode) < mode_width > >+ && (UINTVAL (XEXP (x, 1)) > >+ | (~nonzero_bits (sub, sub_mode) & GET_MODE_MASK (sub_mode)) > >+ ) == GET_MODE_MASK (sub_mode)) > I'd either factor something out or write a nested conditional to > avoid the horrid line wrapping here and hopefully make the > conditions easier to read. Will do. Ciao Dominik ^_^ ^_^
On Wed, Apr 27, 2016 at 10:24:21PM -0600, Jeff Law wrote: > On 04/27/2016 02:20 AM, Dominik Vogt wrote: > >The attached patch is a result of discussing an S/390 issue with > >"and with complement" in some cases. > > > > https://gcc.gnu.org/ml/gcc/2016-03/msg00163.html > > https://gcc.gnu.org/ml/gcc-patches/2016-04/msg01586.html > > > >Combine would merge a ZERO_EXTEND and a SET taking the known zero > >bits into account, resulting in an AND. Later on, > >make_compound_operation() fails to replace that with a ZERO_EXTEND > >which we get for free on S/390 but leaves the AND, eventually > >resulting in two consecutive AND instructions. > > > >The current code in make_compound_operation() that detects > >opportunities for ZERO_EXTEND does not work here because it does > >not take the known zero bits into account: > > > > /* If the constant is one less than a power of two, this might be > > representable by an extraction even if no shift is present. > > If it doesn't end up being a ZERO_EXTEND, we will ignore it unless > > we are in a COMPARE. */ > > else if ((i = exact_log2 (UINTVAL (XEXP (x, 1)) + 1)) >= 0) > > new_rtx = make_extraction (mode, > > make_compound_operation (XEXP (x, 0), > > next_code), > > 0, NULL_RTX, i, 1, 0, in_code == COMPARE); > > > >An attempt to use the zero bits in the above conditions resulted > >in many situations that generated worse code, so the patch tries > >to fix this in a more conservative way. While the effect is > >completely positive on S/390, this will very likely have > >unforeseeable consequences on other targets. > > * combine.c (make_compound_operation): Take known zero bits into > > account when checking for possible zero_extend. > I'd strongly recommend writing some tests for this. This turns out to be quite difficult. A small test function effectively just returns the argument: unsigned long bar (unsigned long in) { if ((in & 1) == 0) in = (in & ~(unsigned long)1); return in; } However, Gcc does not notice that the AND is a no-op. As far as I understand, zero bit tracking is only done in "combine", so when folding the assignment statement the information that the lowest bit is zero is not available and therefore the no-op is not detected? (I've been trying to trigger the code from the patch with a function bases on the above construct.) Ciao Dominik ^_^ ^_^
On Mon, 9 May 2016, Dominik Vogt wrote: > This turns out to be quite difficult. A small test function > effectively just returns the argument: > > unsigned long bar (unsigned long in) > { > if ((in & 1) == 0) > in = (in & ~(unsigned long)1); > > return in; > } > > However, Gcc does not notice that the AND is a no-op. As far as I > understand, zero bit tracking is only done in "combine", so when > folding the assignment statement the information that the lowest > bit is zero is not available and therefore the no-op is not > detected? VRP is also supposed to track bits that may/must be non-zero. It may be possible to enhance it to handle this case.
On Mon, May 9, 2016 at 3:36 PM, Marc Glisse <marc.glisse@inria.fr> wrote: > On Mon, 9 May 2016, Dominik Vogt wrote: > >> This turns out to be quite difficult. A small test function >> effectively just returns the argument: >> >> unsigned long bar (unsigned long in) >> { >> if ((in & 1) == 0) >> in = (in & ~(unsigned long)1); >> >> return in; >> } >> >> However, Gcc does not notice that the AND is a no-op. As far as I >> understand, zero bit tracking is only done in "combine", so when >> folding the assignment statement the information that the lowest >> bit is zero is not available and therefore the no-op is not >> detected? > > > VRP is also supposed to track bits that may/must be non-zero. It may be > possible to enhance it to handle this case. Actually VRP only tracks value-ranges, CCP tracks may/must be non-zero bits but does not have a conditional lattice. Richard. > -- > Marc Glisse
On Tue, May 10, 2016 at 11:14:35AM +0200, Richard Biener wrote: > On Mon, May 9, 2016 at 3:36 PM, Marc Glisse <marc.glisse@inria.fr> wrote: > > On Mon, 9 May 2016, Dominik Vogt wrote: > > > >> This turns out to be quite difficult. A small test function > >> effectively just returns the argument: > >> > >> unsigned long bar (unsigned long in) > >> { > >> if ((in & 1) == 0) > >> in = (in & ~(unsigned long)1); > >> > >> return in; > >> } > >> > >> However, Gcc does not notice that the AND is a no-op. As far as I > >> understand, zero bit tracking is only done in "combine", so when > >> folding the assignment statement the information that the lowest > >> bit is zero is not available and therefore the no-op is not > >> detected? > > > > > > VRP is also supposed to track bits that may/must be non-zero. It may be > > possible to enhance it to handle this case. > > Actually VRP only tracks value-ranges, CCP tracks may/must be non-zero bits > but does not have a conditional lattice. Does it track known *one* bits too? Because the AND doesn't get eliminated here either: in = in | 3; in = in ^ 1; in = (in & ~(unsigned long)1); Ciao Dominik ^_^ ^_^
On Tue, May 10, 2016 at 12:07 PM, Dominik Vogt <vogt@linux.vnet.ibm.com> wrote: > On Tue, May 10, 2016 at 11:14:35AM +0200, Richard Biener wrote: >> On Mon, May 9, 2016 at 3:36 PM, Marc Glisse <marc.glisse@inria.fr> wrote: >> > On Mon, 9 May 2016, Dominik Vogt wrote: >> > >> >> This turns out to be quite difficult. A small test function >> >> effectively just returns the argument: >> >> >> >> unsigned long bar (unsigned long in) >> >> { >> >> if ((in & 1) == 0) >> >> in = (in & ~(unsigned long)1); >> >> >> >> return in; >> >> } >> >> >> >> However, Gcc does not notice that the AND is a no-op. As far as I >> >> understand, zero bit tracking is only done in "combine", so when >> >> folding the assignment statement the information that the lowest >> >> bit is zero is not available and therefore the no-op is not >> >> detected? >> > >> > >> > VRP is also supposed to track bits that may/must be non-zero. It may be >> > possible to enhance it to handle this case. >> >> Actually VRP only tracks value-ranges, CCP tracks may/must be non-zero bits >> but does not have a conditional lattice. > > Does it track known *one* bits too? Because the AND doesn't get > eliminated here either: > > in = in | 3; > in = in ^ 1; > in = (in & ~(unsigned long)1); Yes, but CCP itself does not eliminate this unless something in match.pd simplifies this (ccp_fold is now simply dispatching there) later. This means CCP doesn't properly identify the copy here: Visiting statement: in_2 = in_1(D) | 3; which is likely CONSTANT Lattice value changed to CONSTANT 0x3 (0x0fffffffffffffffc). Adding SSA edges to worklist. interesting_ssa_edges: adding SSA use in in_3 = in_2 ^ 1; marking stmt to be not simulated again Visiting statement: in_3 = in_2 ^ 1; which is likely CONSTANT Lattice value changed to CONSTANT 0x2 (0x0fffffffffffffffc). Adding SSA edges to worklist. interesting_ssa_edges: adding SSA use in in_4 = in_3 & 18446744073709551614; marking stmt to be not simulated again Visiting statement: in_4 = in_3 & 18446744073709551614; which is likely CONSTANT Lattice value changed to CONSTANT 0x2 (0x0fffffffffffffffc). Adding SSA edges to worklist. interesting_ssa_edges: adding SSA use in _5 = in_4; marking stmt to be not simulated again so it only computes new known bits but not whether the operation is value-preserving. Richard. > Ciao > > Dominik ^_^ ^_^ > > -- > > Dominik Vogt > IBM Germany >
On 04/29/2016 03:35 AM, Dominik Vogt wrote: > On Wed, Apr 27, 2016 at 10:24:21PM -0600, Jeff Law wrote: >> Instead you want insn 12 to use a zero-extend to extend (reg:SI 64) >> into (reg:DI 2)? > > Yes, because we get the zero extend for free in this case (through > the constant in the AND or because the input value is a function > argument that is already zero extended). > >> Can't you achieve this in this clause: >> >> /* If the constant is one less than a power of two, this might be >> representable by an extraction even if no shift is present. >> If it doesn't end up being a ZERO_EXTEND, we will ignore it unless >> we are in a COMPARE. */ >> >> You extract the constant via UINTVAL (XEXP (x, 1)), then munge it >> based on nonzero_bits and pass the result to exact_log2? > > That's what we tried first, but it resulted in worse code in many > places (saved about 250 instructions in the SPEC2006 testsuite but > added about 42000 elsewhere). It was so bad that I didn't even > bother to check what's going on. Probably this was triggered all > over the place by small constants like 1, 3, 7 and the like where > s390 has no cheap way for zero extension. So I limited this to > constants that are actually mode masks, implicitly assuming that > there are zero extend instructions only for known modes (which is > true for s390 but may not for some other targets). Being > conservative here shouldn't hurt; but I wonder whether there are > targets where this condition still allows too much. You're probably right. FWIW, I do believe a variety of targets can do these kind of zero extensions. The PA for example has extru which can extract a field from a general register zero extend it, then place the result, right justified into another register. We don't get them "for free", except as a component of a larger sequence of statements for bitfield extraction/manipulation. I believe PPC has similar capabilities. jeff
From e70e6e469200b53b3f4ae52a766cdd322a4d365d Mon Sep 17 00:00:00 2001 From: Dominik Vogt <vogt@linux.vnet.ibm.com> Date: Tue, 12 Apr 2016 09:53:46 +0100 Subject: [PATCH] Take known zero bits into account when checking extraction. Allows AND Insns with a const_int operand to be expressed as ZERO_EXTEND if the operand ist a power of 2 - 1 even with the known zero bits masked out. --- gcc/combine.c | 33 +++++++++++++++++++++++++++++++++ 1 file changed, 33 insertions(+) diff --git a/gcc/combine.c b/gcc/combine.c index 1d0e8be..44bb1b3 100644 --- a/gcc/combine.c +++ b/gcc/combine.c @@ -7988,6 +7988,39 @@ make_compound_operation (rtx x, enum rtx_code in_code) next_code), i, NULL_RTX, 1, 1, 0, 1); + /* If the one operand is a paradoxical subreg of a register or memory and + the constant (limited to the smaller mode) has only zero bits where + the sub expression has known zero bits, this can be expressed as + a zero_extend. */ + else if (GET_CODE (XEXP (x, 0)) == SUBREG) + { + rtx sub; + + sub = XEXP (XEXP (x, 0), 0); + machine_mode sub_mode = GET_MODE (sub); + if ((REG_P (sub) || MEM_P (sub)) + && GET_MODE_PRECISION (sub_mode) < mode_width + && (UINTVAL (XEXP (x, 1)) + | (~nonzero_bits (sub, sub_mode) & GET_MODE_MASK (sub_mode)) + ) == GET_MODE_MASK (sub_mode)) + { + bool speed_p = optimize_insn_for_speed_p (); + rtx temp = gen_rtx_ZERO_EXTEND (mode, sub); + int cost_of_and; + int cost_of_zero_extend; + + cost_of_and = rtx_cost (x, mode, in_code, 1, speed_p); + cost_of_zero_extend = rtx_cost (temp, mode, in_code, 1, speed_p); + if (cost_of_zero_extend <= cost_of_and) + { + new_rtx = make_compound_operation (sub, next_code); + new_rtx = make_extraction (mode, new_rtx, 0, 0, + GET_MODE_PRECISION (sub_mode), + 1, 0, in_code == COMPARE); + } + } + } + break; case LSHIFTRT: -- 2.3.0