Message ID | 021801d66663$e93294d0$bb97be70$@nextmovesoftware.com |
---|---|
State | New |
Headers | show |
Series | x86_64: Integer min/max improvements. | expand |
On Thu, Jul 30, 2020 at 1:24 PM Roger Sayle <roger@nextmovesoftware.com> wrote: > > > This patch tweaks the way that min and max are expanded, so that the > semantics of these operations are visible to the early RTL optimization > passes, until split into explicit comparison and conditional move > instructions. The good news is that i386.md already contains all of > the required logic (many thanks to Richard Biener and Uros Bizjak), > but this is currently only to enable scalar-to-vector (STV) synthesis > of min/max instructions. This change enables this functionality for > all TARGET_CMOVE architectures for SImode, HImode and DImode. > > My first attempt to support "min(reg,reg)" as an instruction revealed > why this functionality isn't already enabled: In PR rtl-optimization 91154 > we end up generating a cmp instead of a test in gcc.target/i386/minmax-3.c > which has poor performance on AMD Opteron. My solution to this is to > actually support "min(reg,general_operand)" allowing us to inspect > any immediate constant at the point this operation is split. This > allows us to use "test" instructions for min/max against 0, 1 and -1. > As an added benefit it allows us to CSE large immediate constants, > reducing code size. > > Previously, "int foo(int x) { return max(x,12345); }" would generate: > > foo: cmpl $12345, %edi > movl $12345, %eax > cmovge %edi, %eax > ret > > with this patch we instead generate: > > foo: movl $12345, %eax > cmpl %eax, %edi > cmovge %edi, %eax > ret > > > I've also included a peephole2 to avoid the "movl $0,%eax" instructions > being produced by the register allocator. Materializing constants as > late as possible reduces register pressure, but for const0_rtx on x86, > it's preferable to use "xor" by moving this set from between a flags > setting operation and its use. This should also help instruction macro > fusion on some microarchitectures, where test/cmp and the following > instruction can sometimes be combined. > > Previously, "int foo(int x) { return max(x,0); }" would generate: > > foo: testl %edi, %edi > movl $0, %eax > cmovns %edi, %eax > ret > > with this patch we instead generate: > foo: xorl %eax, %eax > testl %edi, %edi > cmovns %edi, %eax > ret > > The benefits of having min/max explicit at the RTL level can be seen > from compiling the following example with "-O2 -fno-tree-reassoc". > > > #define max(a,b) (((a) > (b))? (a) : (b)) > > int foo(int x) > { > int y = max(x,5); > return max(y,10); > } > > where GCC currently produces: > > foo: cmpl $5, %edi > movl $5, %eax > movl $10, %edx > cmovge %edi, %eax > cmpl $10, %eax > cmovl %edx, %eax > ret > > and with this patch it instead now produces: > > foo: movl $10, %eax > cmpl %eax, %edi > cmovge %edi, %eax > ret > > > The original motivation was from looking at a performance critical > function in a quantum mechanics code, that performed MIN_EXPR and > MAX_EXPR of the same arguments (effectively a two-element sort), > where GCC was performing the comparison twice. I'd hoped that it > might be possible to fuse these together, perhaps in combine, but > this patch is an intermediate step towards that goal. > > This patch has been tested on x86_64-pc-linux-gnu with a make > bootstrap followed by make -k check with no new regressions. > Ok for mainline? May I ask for a testcase or two? You have good examples above. Btw, I'm sure some variants of those are in bugzilla as well. Richard. > > 2020-07-30 Roger Sayle <roger@nextmovesoftware.com> > > * config/i386/i386.md (MAXMIN_IMODE): No longer needed. > (<maxmin><mode>3): Support SWI248 and general_operand for > second operand, when TARGET_CMOVE. > (<maxmin><mode>3_1 splitter): Optimize comparisons against > 0, 1 and -1 to use "test" instead of "cmp". > (*<maxmin>di3_doubleword): Likewise, allow general_operand > and enable on TARGET_CMOVE. > (peephole2): Convert clearing a register after a flag setting > instruction into an xor followed by the original flag setter. > > > Many thanks in advance. Almost all of the credit goes to Uros and > Richard for already implementing this feature, I've just copied the > transformations from optab expansion, that allow it to be enabled > without penalty (this late in the compilation). > > Roger > -- > Roger Sayle > NextMove Software > Cambridge, UK >
Hi Richard, >> This patch tweaks the way that min and max are expanded, so that the >> semantics of these operations are visible to the early RTL >> optimization passes, until split into explicit comparison and >> conditional move instructions. > > Btw, I'm sure some variants of those are in bugzilla as well. Now that you mention it, I'm not sure whether PR rtl-optimization 94543 is a bug at all, but with you and Richard Henderson weighing in, I suspect that I must be missing something subtle. The initial text of the bug report complains about an AND of 0xff not being optimized away, as a zero extension is still present. I believe that the AND with 0xff has been completely optimized away, and the remaining movzwl (not movzbl) is the zero extension of the short result to the integer ABI return type. If the result is written as a short to memory, there is no extension (or AND). But perhaps the problem isn't with min/and at all, but about whether function arguments and return values are guaranteed to be extended on input/return, so perhaps this is related to SUBREG_PROMOTED_P and friends? Thoughts? Cheers, Roger > 2020-07-30 Roger Sayle <roger@nextmovesoftware.com> > > * config/i386/i386.md (MAXMIN_IMODE): No longer needed. > (<maxmin><mode>3): Support SWI248 and general_operand for > second operand, when TARGET_CMOVE. > (<maxmin><mode>3_1 splitter): Optimize comparisons against > 0, 1 and -1 to use "test" instead of "cmp". > (*<maxmin>di3_doubleword): Likewise, allow general_operand > and enable on TARGET_CMOVE. > (peephole2): Convert clearing a register after a flag setting > instruction into an xor followed by the original flag setter. >
On Thu, Jul 30, 2020 at 01:55:03PM +0100, Roger Sayle wrote: > Now that you mention it, I'm not sure whether PR rtl-optimization 94543 > is a bug at all, but with you and Richard Henderson weighing in, I suspect > that I must be missing something subtle. > > The initial text of the bug report complains about an AND of 0xff not being > optimized away, as a zero extension is still present. I believe that the AND > with 0xff has been completely optimized away, and the remaining movzwl > (not movzbl) is the zero extension of the short result to the integer ABI return > type. If the result is written as a short to memory, there is no extension > (or AND). > > But perhaps the problem isn't with min/and at all, but about whether function > arguments and return values are guaranteed to be extended on input/return, > so perhaps this is related to SUBREG_PROMOTED_P and friends? > > Thoughts? Yeah, given that the RTL performs the computation in HImode rather than SImode, an extension is needed at least from the RTL POV, because there is no guarantee what behavior will be for the upper bits of the GPR. (insn 10 6 9 (set (reg:HI 88) (const_int 255 [0xff])) "x.c":1:56 -1 (nil)) ... (insn 11 9 12 (set (reg:HI 87) (if_then_else:HI (leu (reg:CC 17 flags) (const_int 0 [0])) (reg/v:HI 84 [ x ]) (reg:HI 88))) "x.c":1:56 -1 (nil)) If these two (or just one of them) insns were emitted as movw $255, %ax or cmova %ax, %di rather than movl $255, %eax or cmova %eax, %edi, then the upper 16 bits would contain the previous content rather than being cleared. So, if we want to get rid of the useless zero extension, we'd need some target specific pass or hook or whatever that would query instruction attributes/whatever to figure out what can or can't be trusted to be zero. E.g. the conditions on when movhi_internal uses mode attribute of SI vs. HI is quite complicated, but I guess hopefully it is reliable. I'm afraid for many insns it is not. And then there is also the on the side knowledge that pretty much all insns for TARGET_64BIT with GPRs zero extend if they are 32-bit insns, but the question is if one can trust mode attribute for those. Another possibility would be to move the zero extensions earlier in this case, so use set (reg:SI 88) (const_int 255) and 32-bit conditional move rather than 16-bit, and in those cases perhaps with help of range information during expansion we could create SUBREGs and use SUBREG_PROMOTED_* on those. Jakub
Hi Richard, As requested here are some additional test cases for this patch, derived from the examples in my post. For minmax-8 and minmax-9, the tests are run at -Os, to avoid limiting code generation options (for speed) on future architectures, i.e. CSE may or may not always be a win for performance, but it is always smaller. Tested on x86_64-pc-linux-gnu (with the above patch). 2020-08-03 Roger Sayle <roger@nextmovesoftware.com> gcc/testsuite/ChangeLog * gcc.target/i386/minmax-8.c: New test. * gcc.target/i386/minmax-9.c: New test. * gcc.target/i386/minmax-10.c: New test. * gcc.target/i386/minmax-11.c: New test. Cheers, Roger -- -----Original Message----- From: Richard Biener <richard.guenther@gmail.com> Sent: 30 July 2020 13:21 To: Roger Sayle <roger@nextmovesoftware.com> Cc: GCC Patches <gcc-patches@gcc.gnu.org>; Uros Bizjak <ubizjak@gmail.com> Subject: Re: [PATCH] x86_64: Integer min/max improvements. On Thu, Jul 30, 2020 at 1:24 PM Roger Sayle <roger@nextmovesoftware.com> wrote: > > > This patch tweaks the way that min and max are expanded, so that the > semantics of these operations are visible to the early RTL > optimization passes, until split into explicit comparison and > conditional move instructions. The good news is that i386.md already > contains all of the required logic (many thanks to Richard Biener and > Uros Bizjak), but this is currently only to enable scalar-to-vector > (STV) synthesis of min/max instructions. This change enables this > functionality for all TARGET_CMOVE architectures for SImode, HImode and DImode. > > My first attempt to support "min(reg,reg)" as an instruction revealed > why this functionality isn't already enabled: In PR rtl-optimization > 91154 we end up generating a cmp instead of a test in > gcc.target/i386/minmax-3.c which has poor performance on AMD Opteron. > My solution to this is to actually support "min(reg,general_operand)" > allowing us to inspect any immediate constant at the point this > operation is split. This allows us to use "test" instructions for min/max against 0, 1 and -1. > As an added benefit it allows us to CSE large immediate constants, > reducing code size. > > Previously, "int foo(int x) { return max(x,12345); }" would generate: > > foo: cmpl $12345, %edi > movl $12345, %eax > cmovge %edi, %eax > ret > > with this patch we instead generate: > > foo: movl $12345, %eax > cmpl %eax, %edi > cmovge %edi, %eax > ret > > > I've also included a peephole2 to avoid the "movl $0,%eax" > instructions being produced by the register allocator. Materializing > constants as late as possible reduces register pressure, but for > const0_rtx on x86, it's preferable to use "xor" by moving this set > from between a flags setting operation and its use. This should also > help instruction macro fusion on some microarchitectures, where > test/cmp and the following instruction can sometimes be combined. > > Previously, "int foo(int x) { return max(x,0); }" would generate: > > foo: testl %edi, %edi > movl $0, %eax > cmovns %edi, %eax > ret > > with this patch we instead generate: > foo: xorl %eax, %eax > testl %edi, %edi > cmovns %edi, %eax > ret > > The benefits of having min/max explicit at the RTL level can be seen > from compiling the following example with "-O2 -fno-tree-reassoc". > > > #define max(a,b) (((a) > (b))? (a) : (b)) > > int foo(int x) > { > int y = max(x,5); > return max(y,10); > } > > where GCC currently produces: > > foo: cmpl $5, %edi > movl $5, %eax > movl $10, %edx > cmovge %edi, %eax > cmpl $10, %eax > cmovl %edx, %eax > ret > > and with this patch it instead now produces: > > foo: movl $10, %eax > cmpl %eax, %edi > cmovge %edi, %eax > ret > > > The original motivation was from looking at a performance critical > function in a quantum mechanics code, that performed MIN_EXPR and > MAX_EXPR of the same arguments (effectively a two-element sort), where > GCC was performing the comparison twice. I'd hoped that it might be > possible to fuse these together, perhaps in combine, but this patch is > an intermediate step towards that goal. > > This patch has been tested on x86_64-pc-linux-gnu with a make > bootstrap followed by make -k check with no new regressions. > Ok for mainline? May I ask for a testcase or two? You have good examples above. Btw, I'm sure some variants of those are in bugzilla as well. Richard. > > 2020-07-30 Roger Sayle <roger@nextmovesoftware.com> > > * config/i386/i386.md (MAXMIN_IMODE): No longer needed. > (<maxmin><mode>3): Support SWI248 and general_operand for > second operand, when TARGET_CMOVE. > (<maxmin><mode>3_1 splitter): Optimize comparisons against > 0, 1 and -1 to use "test" instead of "cmp". > (*<maxmin>di3_doubleword): Likewise, allow general_operand > and enable on TARGET_CMOVE. > (peephole2): Convert clearing a register after a flag setting > instruction into an xor followed by the original flag setter. > > > Many thanks in advance. Almost all of the credit goes to Uros and > Richard for already implementing this feature, I've just copied the > transformations from optab expansion, that allow it to be enabled > without penalty (this late in the compilation). > > Roger > -- > Roger Sayle > NextMove Software > Cambridge, UK > diff --git a/gcc/testsuite/gcc.target/i386/minmax-8.c b/gcc/testsuite/gcc.target/i386/minmax-8.c new file mode 100644 index 0000000..1f7e466 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/minmax-8.c @@ -0,0 +1,18 @@ +/* { dg-do compile } */ +/* { dg-options "-Os" } */ + +#define max(a,b) (((a) > (b))? (a) : (b)) +#define min(a,b) (((a) < (b))? (a) : (b)) + +int foo(int x) +{ + return max(x,12345); +} + +int bar(int x) +{ + return min(x,87654); +} + +/* { dg-final { scan-assembler-times "12345" 1 } } */ +/* { dg-final { scan-assembler-times "87654" 1 } } */ diff --git a/gcc/testsuite/gcc.target/i386/minmax-9.c b/gcc/testsuite/gcc.target/i386/minmax-9.c new file mode 100644 index 0000000..3b94023 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/minmax-9.c @@ -0,0 +1,23 @@ +/* { dg-do compile } */ +/* { dg-options "-Os" } */ + +#define max(a,b) (((a) > (b))? (a) : (b)) +#define min(a,b) (((a) < (b))? (a) : (b)) + +int foo(int x) +{ + return max(x,0); +} + +int bar(int x) +{ + return min(x,0); +} + +unsigned int baz(unsigned int x) +{ + return min(x,1); +} + +/* { dg-final { scan-assembler-times "xor" 3 } } */ +/* { dg-final { scan-assembler-times "test" 3 } } */ diff --git a/gcc/testsuite/gcc.target/i386/minmax-10.c b/gcc/testsuite/gcc.target/i386/minmax-10.c new file mode 100644 index 0000000..b044462 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/minmax-10.c @@ -0,0 +1,38 @@ +/* { dg-do compile } */ +/* { dg-options "-O2" } */ + +#define max(a,b) (((a) > (b))? (a) : (b)) +#define min(a,b) (((a) < (b))? (a) : (b)) + +int smax1(int x) +{ + return max(x,1); +} + +int smin1(int x) +{ + return min(x,1); +} + +int smaxm1(int x) +{ + return max(x,-1); +} + +int sminm1(int x) +{ + return min(x,-1); +} + +unsigned int umax1(unsigned int x) +{ + return max(x,1); +} + +unsigned int umin1(unsigned int x) +{ + return min(x,1); +} + +/* { dg-final { scan-assembler-times "test" 6 } } */ +/* { dg-final { scan-assembler-not "cmp" } } */ diff --git a/gcc/testsuite/gcc.target/i386/minmax-11.c b/gcc/testsuite/gcc.target/i386/minmax-11.c new file mode 100644 index 0000000..a8c2df5 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/minmax-11.c @@ -0,0 +1,12 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fno-tree-reassoc" } */ + +#define max(a,b) (((a) > (b))? (a) : (b)) + +int foo(int x) +{ + int y = max(x,12345); + return max(y,87654); +} + +/* { dg-final { scan-assembler-not "12345" } } */
On Thu, Jul 30, 2020 at 1:23 PM Roger Sayle <roger@nextmovesoftware.com> wrote: > > > This patch tweaks the way that min and max are expanded, so that the > semantics of these operations are visible to the early RTL optimization > passes, until split into explicit comparison and conditional move > instructions. The good news is that i386.md already contains all of > the required logic (many thanks to Richard Biener and Uros Bizjak), > but this is currently only to enable scalar-to-vector (STV) synthesis > of min/max instructions. This change enables this functionality for > all TARGET_CMOVE architectures for SImode, HImode and DImode. > > My first attempt to support "min(reg,reg)" as an instruction revealed > why this functionality isn't already enabled: In PR rtl-optimization 91154 > we end up generating a cmp instead of a test in gcc.target/i386/minmax-3.c > which has poor performance on AMD Opteron. My solution to this is to > actually support "min(reg,general_operand)" allowing us to inspect > any immediate constant at the point this operation is split. This > allows us to use "test" instructions for min/max against 0, 1 and -1. > As an added benefit it allows us to CSE large immediate constants, > reducing code size. > > Previously, "int foo(int x) { return max(x,12345); }" would generate: > > foo: cmpl $12345, %edi > movl $12345, %eax > cmovge %edi, %eax > ret > > with this patch we instead generate: > > foo: movl $12345, %eax > cmpl %eax, %edi > cmovge %edi, %eax > ret > > > I've also included a peephole2 to avoid the "movl $0,%eax" instructions > being produced by the register allocator. Materializing constants as > late as possible reduces register pressure, but for const0_rtx on x86, > it's preferable to use "xor" by moving this set from between a flags > setting operation and its use. This should also help instruction macro > fusion on some microarchitectures, where test/cmp and the following > instruction can sometimes be combined. > > Previously, "int foo(int x) { return max(x,0); }" would generate: > > foo: testl %edi, %edi > movl $0, %eax > cmovns %edi, %eax > ret > > with this patch we instead generate: > foo: xorl %eax, %eax > testl %edi, %edi > cmovns %edi, %eax > ret > > The benefits of having min/max explicit at the RTL level can be seen > from compiling the following example with "-O2 -fno-tree-reassoc". > > > #define max(a,b) (((a) > (b))? (a) : (b)) > > int foo(int x) > { > int y = max(x,5); > return max(y,10); > } > > where GCC currently produces: > > foo: cmpl $5, %edi > movl $5, %eax > movl $10, %edx > cmovge %edi, %eax > cmpl $10, %eax > cmovl %edx, %eax > ret > > and with this patch it instead now produces: > > foo: movl $10, %eax > cmpl %eax, %edi > cmovge %edi, %eax > ret > > > The original motivation was from looking at a performance critical > function in a quantum mechanics code, that performed MIN_EXPR and > MAX_EXPR of the same arguments (effectively a two-element sort), > where GCC was performing the comparison twice. I'd hoped that it > might be possible to fuse these together, perhaps in combine, but > this patch is an intermediate step towards that goal. > > This patch has been tested on x86_64-pc-linux-gnu with a make > bootstrap followed by make -k check with no new regressions. > Ok for mainline? > > > 2020-07-30 Roger Sayle <roger@nextmovesoftware.com> > > * config/i386/i386.md (MAXMIN_IMODE): No longer needed. > (<maxmin><mode>3): Support SWI248 and general_operand for > second operand, when TARGET_CMOVE. > (<maxmin><mode>3_1 splitter): Optimize comparisons against > 0, 1 and -1 to use "test" instead of "cmp". > (*<maxmin>di3_doubleword): Likewise, allow general_operand > and enable on TARGET_CMOVE. > (peephole2): Convert clearing a register after a flag setting > instruction into an xor followed by the original flag setter. > > > Many thanks in advance. Almost all of the credit goes to Uros and > Richard for already implementing this feature, I've just copied the > transformations from optab expansion, that allow it to be enabled > without penalty (this late in the compilation). LGTM, modulo the below part: +;; Avoid clearing a register between a flags setting comparison and its use, +;; i.e. prefer "xorl %eax,%eax; test/cmp" over "test/cmp; movl $0, %eax". +(define_peephole2 + [(set (reg FLAGS_REG) (match_operand 0)) + (set (match_operand:SWI48 1 "register_operand") (const_int 0))] + "peep2_regno_dead_p (0, FLAGS_REG) + && !reg_overlap_mentioned_p (operands[1], operands[0])" + [(parallel [(set (match_dup 1) (const_int 0)) + (clobber (reg:CC FLAGS_REG))]) + (set (match_dup 2) (match_dup 0))] +{ + operands[2] = gen_rtx_REG (GET_MODE (operands[0]), FLAGS_REG); +}) Please use ix86_expand_clear (operands[1]); in the preparation statement instead of emitting RTX pattern. This will take TARGET_USE_MOV0 and size optimizations into account. Uros.
Hi Uros, Many thanks for the review and feedback. Here's the final version as committed, with both the test cases requested by Richard Biener and your suggestion/request to use ix86_expand_clear. Tested again on x86_64-pc-linux-gnu. Thank you again for the fantastic ix86_expand_clear pointer, which cleared up one of two technical questions I had, and allowed this peephole2 to now also apply to QImode and HImode MOV0s, where my original version was limited to SImode and DImode. My two questions were (i) why a QImode set of 0 with a flags clobber isn't a recognized instruction? I'd assume that on some architectures "xorb dl,dl" might be an appropriate sequence to use. This is mostly answered by the use of ix86_expand_clear, which intelligently selects the correct form, but the lack of a *movqi_xor was previously odd. (ii) My other question, was that despite my best efforts I couldn't seem to convince GCC to generate/use a *movsi_or to load the constant -1. It was just a curiosity, but this would affect/benefit the smaxm1 and sminm1 examples in the new i386/minmax-10.c test program. Many thanks again for your help. Roger -- -----Original Message----- From: Uros Bizjak <ubizjak@gmail.com> Sent: 03 August 2020 11:29 To: Roger Sayle <roger@nextmovesoftware.com> Cc: GCC Patches <gcc-patches@gcc.gnu.org> Subject: Re: [PATCH] x86_64: Integer min/max improvements. On Thu, Jul 30, 2020 at 1:23 PM Roger Sayle <roger@nextmovesoftware.com> wrote: > > > This patch tweaks the way that min and max are expanded, so that the > semantics of these operations are visible to the early RTL > optimization passes, until split into explicit comparison and > conditional move instructions. The good news is that i386.md already > contains all of the required logic (many thanks to Richard Biener and > Uros Bizjak), but this is currently only to enable scalar-to-vector > (STV) synthesis of min/max instructions. This change enables this > functionality for all TARGET_CMOVE architectures for SImode, HImode and DImode. > > My first attempt to support "min(reg,reg)" as an instruction revealed > why this functionality isn't already enabled: In PR rtl-optimization > 91154 we end up generating a cmp instead of a test in > gcc.target/i386/minmax-3.c which has poor performance on AMD Opteron. > My solution to this is to actually support "min(reg,general_operand)" > allowing us to inspect any immediate constant at the point this > operation is split. This allows us to use "test" instructions for min/max against 0, 1 and -1. > As an added benefit it allows us to CSE large immediate constants, > reducing code size. > > Previously, "int foo(int x) { return max(x,12345); }" would generate: > > foo: cmpl $12345, %edi > movl $12345, %eax > cmovge %edi, %eax > ret > > with this patch we instead generate: > > foo: movl $12345, %eax > cmpl %eax, %edi > cmovge %edi, %eax > ret > > > I've also included a peephole2 to avoid the "movl $0,%eax" > instructions being produced by the register allocator. Materializing > constants as late as possible reduces register pressure, but for > const0_rtx on x86, it's preferable to use "xor" by moving this set > from between a flags setting operation and its use. This should also > help instruction macro fusion on some microarchitectures, where > test/cmp and the following instruction can sometimes be combined. > > Previously, "int foo(int x) { return max(x,0); }" would generate: > > foo: testl %edi, %edi > movl $0, %eax > cmovns %edi, %eax > ret > > with this patch we instead generate: > foo: xorl %eax, %eax > testl %edi, %edi > cmovns %edi, %eax > ret > > The benefits of having min/max explicit at the RTL level can be seen > from compiling the following example with "-O2 -fno-tree-reassoc". > > > #define max(a,b) (((a) > (b))? (a) : (b)) > > int foo(int x) > { > int y = max(x,5); > return max(y,10); > } > > where GCC currently produces: > > foo: cmpl $5, %edi > movl $5, %eax > movl $10, %edx > cmovge %edi, %eax > cmpl $10, %eax > cmovl %edx, %eax > ret > > and with this patch it instead now produces: > > foo: movl $10, %eax > cmpl %eax, %edi > cmovge %edi, %eax > ret > > > The original motivation was from looking at a performance critical > function in a quantum mechanics code, that performed MIN_EXPR and > MAX_EXPR of the same arguments (effectively a two-element sort), where > GCC was performing the comparison twice. I'd hoped that it might be > possible to fuse these together, perhaps in combine, but this patch is > an intermediate step towards that goal. > > This patch has been tested on x86_64-pc-linux-gnu with a make > bootstrap followed by make -k check with no new regressions. > Ok for mainline? > > > 2020-07-30 Roger Sayle <roger@nextmovesoftware.com> > > * config/i386/i386.md (MAXMIN_IMODE): No longer needed. > (<maxmin><mode>3): Support SWI248 and general_operand for > second operand, when TARGET_CMOVE. > (<maxmin><mode>3_1 splitter): Optimize comparisons against > 0, 1 and -1 to use "test" instead of "cmp". > (*<maxmin>di3_doubleword): Likewise, allow general_operand > and enable on TARGET_CMOVE. > (peephole2): Convert clearing a register after a flag setting > instruction into an xor followed by the original flag setter. > > > Many thanks in advance. Almost all of the credit goes to Uros and > Richard for already implementing this feature, I've just copied the > transformations from optab expansion, that allow it to be enabled > without penalty (this late in the compilation). LGTM, modulo the below part: +;; Avoid clearing a register between a flags setting comparison and its +use, ;; i.e. prefer "xorl %eax,%eax; test/cmp" over "test/cmp; movl $0, %eax". +(define_peephole2 + [(set (reg FLAGS_REG) (match_operand 0)) + (set (match_operand:SWI48 1 "register_operand") (const_int 0))] + "peep2_regno_dead_p (0, FLAGS_REG) + && !reg_overlap_mentioned_p (operands[1], operands[0])" + [(parallel [(set (match_dup 1) (const_int 0)) + (clobber (reg:CC FLAGS_REG))]) + (set (match_dup 2) (match_dup 0))] +{ + operands[2] = gen_rtx_REG (GET_MODE (operands[0]), FLAGS_REG); +}) Please use ix86_expand_clear (operands[1]); in the preparation statement instead of emitting RTX pattern. This will take TARGET_USE_MOV0 and size optimizations into account. Uros. diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index b24a455..4e916bf 100644 --- a/gcc/config/i386/i386.md +++ b/gcc/config/i386/i386.md @@ -18809,45 +18809,68 @@ ;; min/max patterns -(define_mode_iterator MAXMIN_IMODE - [(SI "TARGET_SSE4_1") (DI "TARGET_AVX512VL")]) (define_code_attr maxmin_rel [(smax "GE") (smin "LE") (umax "GEU") (umin "LEU")]) (define_expand "<code><mode>3" [(parallel - [(set (match_operand:MAXMIN_IMODE 0 "register_operand") - (maxmin:MAXMIN_IMODE - (match_operand:MAXMIN_IMODE 1 "register_operand") - (match_operand:MAXMIN_IMODE 2 "nonimmediate_operand"))) + [(set (match_operand:SWI248 0 "register_operand") + (maxmin:SWI248 + (match_operand:SWI248 1 "register_operand") + (match_operand:SWI248 2 "general_operand"))) (clobber (reg:CC FLAGS_REG))])] - "TARGET_STV") + "TARGET_CMOVE") (define_insn_and_split "*<code><mode>3_1" - [(set (match_operand:MAXMIN_IMODE 0 "register_operand") - (maxmin:MAXMIN_IMODE - (match_operand:MAXMIN_IMODE 1 "register_operand") - (match_operand:MAXMIN_IMODE 2 "nonimmediate_operand"))) + [(set (match_operand:SWI248 0 "register_operand") + (maxmin:SWI248 + (match_operand:SWI248 1 "register_operand") + (match_operand:SWI248 2 "general_operand"))) (clobber (reg:CC FLAGS_REG))] - "(TARGET_64BIT || <MODE>mode != DImode) && TARGET_STV + "TARGET_CMOVE && ix86_pre_reload_split ()" "#" "&& 1" [(set (match_dup 0) - (if_then_else:MAXMIN_IMODE (match_dup 3) + (if_then_else:SWI248 (match_dup 3) (match_dup 1) (match_dup 2)))] { machine_mode mode = <MODE>mode; + rtx cmp_op = operands[2]; - if (!register_operand (operands[2], mode)) - operands[2] = force_reg (mode, operands[2]); + if (!register_operand (cmp_op, mode)) + operands[2] = force_reg (mode, cmp_op); enum rtx_code code = <maxmin_rel>; - machine_mode cmpmode = SELECT_CC_MODE (code, operands[1], operands[2]); + + if (cmp_op == const1_rtx) + { + /* Convert smax (x, 1) into (x > 0 ? x : 1). + Convert umax (x, 1) into (x != 0 ? x : 1). + Convert ?min (x, 1) into (x <= 0 ? x : 1). */ + cmp_op = const0_rtx; + if (code == GE) + code = GT; + else if (code == GEU) + code = NE; + } + /* Convert smin (x, -1) into (x < 0 ? x : -1). */ + else if (cmp_op == constm1_rtx && code == LE) + { + cmp_op = const0_rtx; + code = LT; + } + /* Convert smax (x, -1) into (x >= 0 ? x : -1). */ + else if (cmp_op == constm1_rtx && code == GE) + cmp_op = const0_rtx; + else if (cmp_op != const0_rtx) + cmp_op = operands[2]; + + machine_mode cmpmode = SELECT_CC_MODE (code, operands[1], cmp_op); rtx flags = gen_rtx_REG (cmpmode, FLAGS_REG); - rtx tmp = gen_rtx_COMPARE (cmpmode, operands[1], operands[2]); + rtx tmp = gen_rtx_COMPARE (cmpmode, operands[1], cmp_op); emit_insn (gen_rtx_SET (flags, tmp)); operands[3] = gen_rtx_fmt_ee (code, VOIDmode, flags, const0_rtx); @@ -18856,9 +18879,9 @@ (define_insn_and_split "*<code>di3_doubleword" [(set (match_operand:DI 0 "register_operand") (maxmin:DI (match_operand:DI 1 "register_operand") - (match_operand:DI 2 "nonimmediate_operand"))) + (match_operand:DI 2 "general_operand"))) (clobber (reg:CC FLAGS_REG))] - "!TARGET_64BIT && TARGET_STV && TARGET_AVX512VL + "!TARGET_64BIT && TARGET_CMOVE && ix86_pre_reload_split ()" "#" "&& 1" @@ -18910,6 +18933,19 @@ gcc_unreachable (); } }) + +;; Avoid clearing a register between a flags setting comparison and its use, +;; i.e. prefer "xorl %eax,%eax; test/cmp" over "test/cmp; movl $0, %eax". +(define_peephole2 + [(set (reg FLAGS_REG) (match_operand 0)) + (set (match_operand:SWI 1 "register_operand") (const_int 0))] + "peep2_regno_dead_p (0, FLAGS_REG) + && !reg_overlap_mentioned_p (operands[1], operands[0])" + [(set (match_dup 2) (match_dup 0))] +{ + operands[2] = gen_rtx_REG (GET_MODE (operands[0]), FLAGS_REG); + ix86_expand_clear (operands[1]); +}) ;; Misc patterns (?) diff --git a/gcc/testsuite/gcc.target/i386/minmax-8.c b/gcc/testsuite/gcc.target/i386/minmax-8.c new file mode 100644 index 0000000..1f7e466 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/minmax-8.c @@ -0,0 +1,18 @@ +/* { dg-do compile } */ +/* { dg-options "-Os" } */ + +#define max(a,b) (((a) > (b))? (a) : (b)) +#define min(a,b) (((a) < (b))? (a) : (b)) + +int foo(int x) +{ + return max(x,12345); +} + +int bar(int x) +{ + return min(x,87654); +} + +/* { dg-final { scan-assembler-times "12345" 1 } } */ +/* { dg-final { scan-assembler-times "87654" 1 } } */ diff --git a/gcc/testsuite/gcc.target/i386/minmax-9.c b/gcc/testsuite/gcc.target/i386/minmax-9.c new file mode 100644 index 0000000..3b94023 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/minmax-9.c @@ -0,0 +1,23 @@ +/* { dg-do compile } */ +/* { dg-options "-Os" } */ + +#define max(a,b) (((a) > (b))? (a) : (b)) +#define min(a,b) (((a) < (b))? (a) : (b)) + +int foo(int x) +{ + return max(x,0); +} + +int bar(int x) +{ + return min(x,0); +} + +unsigned int baz(unsigned int x) +{ + return min(x,1); +} + +/* { dg-final { scan-assembler-times "xor" 3 } } */ +/* { dg-final { scan-assembler-times "test" 3 } } */ diff --git a/gcc/testsuite/gcc.target/i386/minmax-10.c b/gcc/testsuite/gcc.target/i386/minmax-10.c new file mode 100644 index 0000000..b044462 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/minmax-10.c @@ -0,0 +1,38 @@ +/* { dg-do compile } */ +/* { dg-options "-O2" } */ + +#define max(a,b) (((a) > (b))? (a) : (b)) +#define min(a,b) (((a) < (b))? (a) : (b)) + +int smax1(int x) +{ + return max(x,1); +} + +int smin1(int x) +{ + return min(x,1); +} + +int smaxm1(int x) +{ + return max(x,-1); +} + +int sminm1(int x) +{ + return min(x,-1); +} + +unsigned int umax1(unsigned int x) +{ + return max(x,1); +} + +unsigned int umin1(unsigned int x) +{ + return min(x,1); +} + +/* { dg-final { scan-assembler-times "test" 6 } } */ +/* { dg-final { scan-assembler-not "cmp" } } */ diff --git a/gcc/testsuite/gcc.target/i386/minmax-11.c b/gcc/testsuite/gcc.target/i386/minmax-11.c new file mode 100644 index 0000000..a8c2df5 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/minmax-11.c @@ -0,0 +1,12 @@ +/* { dg-do compile } */ +/* { dg-options "-O2 -fno-tree-reassoc" } */ + +#define max(a,b) (((a) > (b))? (a) : (b)) + +int foo(int x) +{ + int y = max(x,12345); + return max(y,87654); +} + +/* { dg-final { scan-assembler-not "12345" } } */
On Thu, Aug 06, 2020 at 09:40:49AM +0100, Roger Sayle wrote: This test fails on i686-linux (or x86_64-linux when testing with -m32). make check-gcc RUNTESTFLAGS='--target_board=unix\{-m32,-m64\} i386.exp=minmax-9.c' Running /usr/src/gcc/gcc/testsuite/gcc.target/i386/i386.exp ... FAIL: gcc.target/i386/minmax-9.c scan-assembler-times test 3 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/minmax-9.c > @@ -0,0 +1,23 @@ > +/* { dg-do compile } */ > +/* { dg-options "-Os" } */ > + > +#define max(a,b) (((a) > (b))? (a) : (b)) > +#define min(a,b) (((a) < (b))? (a) : (b)) > + > +int foo(int x) > +{ > + return max(x,0); > +} > + > +int bar(int x) > +{ > + return min(x,0); > +} > + > +unsigned int baz(unsigned int x) > +{ > + return min(x,1); > +} > + > +/* { dg-final { scan-assembler-times "xor" 3 } } */ > +/* { dg-final { scan-assembler-times "test" 3 } } */ Jakub
Sorry for the inconvenience. I've just added the obligatory /* { dg-do compile { target { ! ia32 } } } */ to this new gcc.target/i386 test to resolve this failure. Please let me know if there's a better fix. 2020-08-06 Roger Sayle <roger@nextmovesoftware.com> gcc/testsuite/ChangeLog * gcc.target/i386/minmax-9.c: Restrict test to !ia32. My apologies again. Roger -- -----Original Message----- From: Jakub Jelinek <jakub@redhat.com> Sent: 06 August 2020 13:28 To: Roger Sayle <roger@nextmovesoftware.com> Cc: 'Uros Bizjak' <ubizjak@gmail.com>; 'GCC Patches' <gcc-patches@gcc.gnu.org> Subject: Re: [PATCH] x86_64: Integer min/max improvements. On Thu, Aug 06, 2020 at 09:40:49AM +0100, Roger Sayle wrote: This test fails on i686-linux (or x86_64-linux when testing with -m32). make check-gcc RUNTESTFLAGS='--target_board=unix\{-m32,-m64\} i386.exp=minmax-9.c' Running /usr/src/gcc/gcc/testsuite/gcc.target/i386/i386.exp ... FAIL: gcc.target/i386/minmax-9.c scan-assembler-times test 3 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/minmax-9.c > @@ -0,0 +1,23 @@ > +/* { dg-do compile } */ > +/* { dg-options "-Os" } */ > + > +#define max(a,b) (((a) > (b))? (a) : (b)) #define min(a,b) (((a) < > +(b))? (a) : (b)) > + > +int foo(int x) > +{ > + return max(x,0); > +} > + > +int bar(int x) > +{ > + return min(x,0); > +} > + > +unsigned int baz(unsigned int x) > +{ > + return min(x,1); > +} > + > +/* { dg-final { scan-assembler-times "xor" 3 } } */ > +/* { dg-final { scan-assembler-times "test" 3 } } */ Jakub
On Thu, Aug 6, 2020 at 10:40 AM Roger Sayle <roger@nextmovesoftware.com> wrote: > > > Hi Uros, > > Many thanks for the review and feedback. Here's the final version as committed, > with both the test cases requested by Richard Biener and your suggestion/request > to use ix86_expand_clear. Tested again on x86_64-pc-linux-gnu. > > Thank you again for the fantastic ix86_expand_clear pointer, which cleared up one > of two technical questions I had, and allowed this peephole2 to now also apply to > QImode and HImode MOV0s, where my original version was limited to SImode and > DImode. > > My two questions were (i) why a QImode set of 0 with a flags clobber isn't a recognized > instruction? I'd assume that on some architectures "xorb dl,dl" might be an appropriate > sequence to use. This is mostly answered by the use of ix86_expand_clear, which > intelligently selects the correct form, but the lack of a *movqi_xor was previously odd. XOR transformation is used mostly due to code size, where we have: 0: b0 00 mov $0x0,%al 2: 30 c0 xor %al,%al 4: bb 00 00 00 00 mov $0x0,%ebx 9: 31 db xor %ebx,%ebx So, as can be seen from the above example, there is no benefit for QImode, where 3 bytes can be saved for SImode. > (ii) My other question, was that despite my best efforts I couldn't seem to convince GCC > to generate/use a *movsi_or to load the constant -1. It was just a curiosity, but this > would affect/benefit the smaxm1 and sminm1 examples in the new i386/minmax-10.c This transformation is enabled only for -Os or DEF_TUNE (X86_TUNE_MOVE_M1_VIA_OR, "move_m1_via_or", m_PENT | m_LAKEMONT) However, clearing the register with xor reg,reg also prevents partial reg stall, where or -1, reg does not. Uros.
diff --git a/gcc/config/i386/i386.md b/gcc/config/i386/i386.md index b24a455..717d55f 100644 --- a/gcc/config/i386/i386.md +++ b/gcc/config/i386/i386.md @@ -18809,45 +18809,68 @@ ;; min/max patterns -(define_mode_iterator MAXMIN_IMODE - [(SI "TARGET_SSE4_1") (DI "TARGET_AVX512VL")]) (define_code_attr maxmin_rel [(smax "GE") (smin "LE") (umax "GEU") (umin "LEU")]) (define_expand "<code><mode>3" [(parallel - [(set (match_operand:MAXMIN_IMODE 0 "register_operand") - (maxmin:MAXMIN_IMODE - (match_operand:MAXMIN_IMODE 1 "register_operand") - (match_operand:MAXMIN_IMODE 2 "nonimmediate_operand"))) + [(set (match_operand:SWI248 0 "register_operand") + (maxmin:SWI248 + (match_operand:SWI248 1 "register_operand") + (match_operand:SWI248 2 "general_operand"))) (clobber (reg:CC FLAGS_REG))])] - "TARGET_STV") + "TARGET_CMOVE") (define_insn_and_split "*<code><mode>3_1" - [(set (match_operand:MAXMIN_IMODE 0 "register_operand") - (maxmin:MAXMIN_IMODE - (match_operand:MAXMIN_IMODE 1 "register_operand") - (match_operand:MAXMIN_IMODE 2 "nonimmediate_operand"))) + [(set (match_operand:SWI248 0 "register_operand") + (maxmin:SWI248 + (match_operand:SWI248 1 "register_operand") + (match_operand:SWI248 2 "general_operand"))) (clobber (reg:CC FLAGS_REG))] - "(TARGET_64BIT || <MODE>mode != DImode) && TARGET_STV + "TARGET_CMOVE && ix86_pre_reload_split ()" "#" "&& 1" [(set (match_dup 0) - (if_then_else:MAXMIN_IMODE (match_dup 3) + (if_then_else:SWI248 (match_dup 3) (match_dup 1) (match_dup 2)))] { machine_mode mode = <MODE>mode; + rtx cmp_op = operands[2]; - if (!register_operand (operands[2], mode)) - operands[2] = force_reg (mode, operands[2]); + if (!register_operand (cmp_op, mode)) + operands[2] = force_reg (mode, cmp_op); enum rtx_code code = <maxmin_rel>; - machine_mode cmpmode = SELECT_CC_MODE (code, operands[1], operands[2]); + + if (cmp_op == const1_rtx) + { + /* Convert smax (x, 1) into (x > 0 ? x : 1). + Convert umax (x, 1) into (x != 0 ? x : 1). + Convert ?min (x, 1) into (x <= 0 ? x : 1). */ + cmp_op = const0_rtx; + if (code == GE) + code = GT; + else if (code == GEU) + code = NE; + } + /* Convert smin (x, -1) into (x < 0 ? x : -1). */ + else if (cmp_op == constm1_rtx && code == LE) + { + cmp_op = const0_rtx; + code = LT; + } + /* Convert smax (x, -1) into (x >= 0 ? x : -1). */ + else if (cmp_op == constm1_rtx && code == GE) + cmp_op = const0_rtx; + else if (cmp_op != const0_rtx) + cmp_op = operands[2]; + + machine_mode cmpmode = SELECT_CC_MODE (code, operands[1], cmp_op); rtx flags = gen_rtx_REG (cmpmode, FLAGS_REG); - rtx tmp = gen_rtx_COMPARE (cmpmode, operands[1], operands[2]); + rtx tmp = gen_rtx_COMPARE (cmpmode, operands[1], cmp_op); emit_insn (gen_rtx_SET (flags, tmp)); operands[3] = gen_rtx_fmt_ee (code, VOIDmode, flags, const0_rtx); @@ -18856,9 +18879,9 @@ (define_insn_and_split "*<code>di3_doubleword" [(set (match_operand:DI 0 "register_operand") (maxmin:DI (match_operand:DI 1 "register_operand") - (match_operand:DI 2 "nonimmediate_operand"))) + (match_operand:DI 2 "general_operand"))) (clobber (reg:CC FLAGS_REG))] - "!TARGET_64BIT && TARGET_STV && TARGET_AVX512VL + "!TARGET_64BIT && TARGET_CMOVE && ix86_pre_reload_split ()" "#" "&& 1" @@ -18910,6 +18933,20 @@ gcc_unreachable (); } }) + +;; Avoid clearing a register between a flags setting comparison and its use, +;; i.e. prefer "xorl %eax,%eax; test/cmp" over "test/cmp; movl $0, %eax". +(define_peephole2 + [(set (reg FLAGS_REG) (match_operand 0)) + (set (match_operand:SWI48 1 "register_operand") (const_int 0))] + "peep2_regno_dead_p (0, FLAGS_REG) + && !reg_overlap_mentioned_p (operands[1], operands[0])" + [(parallel [(set (match_dup 1) (const_int 0)) + (clobber (reg:CC FLAGS_REG))]) + (set (match_dup 2) (match_dup 0))] +{ + operands[2] = gen_rtx_REG (GET_MODE (operands[0]), FLAGS_REG); +}) ;; Misc patterns (?)