Message ID | 00cd01d9f76b$3db62990$b9227cb0$@nextmovesoftware.com |
---|---|
State | New |
Headers | show |
Series | [X86] Split lea into shorter left shift by 2 or 3 bits with -Oz. | expand |
On Thu, Oct 5, 2023 at 11:06 AM Roger Sayle <roger@nextmovesoftware.com> wrote: > > > This patch avoids long lea instructions for performing x<<2 and x<<3 > by splitting them into shorter sal and move (or xchg instructions). > Because this increases the number of instructions, but reduces the > total size, its suitable for -Oz (but not -Os). > > The impact can be seen in the new test case: > > int foo(int x) { return x<<2; } > int bar(int x) { return x<<3; } > long long fool(long long x) { return x<<2; } > long long barl(long long x) { return x<<3; } > > where with -O2 we generate: > > foo: lea 0x0(,%rdi,4),%eax // 7 bytes > retq > bar: lea 0x0(,%rdi,8),%eax // 7 bytes > retq > fool: lea 0x0(,%rdi,4),%rax // 8 bytes > retq > barl: lea 0x0(,%rdi,8),%rax // 8 bytes > retq > > and with -Oz we now generate: > > foo: xchg %eax,%edi // 1 byte > shl $0x2,%eax // 3 bytes > retq > bar: xchg %eax,%edi // 1 byte > shl $0x3,%eax // 3 bytes > retq > fool: xchg %rax,%rdi // 2 bytes > shl $0x2,%rax // 4 bytes > retq > barl: xchg %rax,%rdi // 2 bytes > shl $0x3,%rax // 4 bytes > retq > > Over the entirety of the CSiBE code size benchmark this saves 1347 > bytes (0.037%) for x86_64, and 1312 bytes (0.036%) with -m32. > Conveniently, there's already a backend function in i386.cc for > deciding whether to split an lea into its component instructions, > ix86_avoid_lea_for_addr, all that's required is an additional clause > checking for -Oz (i.e. optimize_size > 1). > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap > and make -k check, both with and without --target_board='unix{-m32}' > with no new failures. Additional testing was performed by repeating > these steps after removing the "optimize_size > 1" condition, so that > suitable lea instructions were always split [-Oz is not heavily > tested, so this invoked the new code during the bootstrap and > regression testing], again with no regressions. Ok for mainline? > > > 2023-10-05 Roger Sayle <roger@nextmovesoftware.com> > > gcc/ChangeLog > * config/i386/i386.cc (ix86_avoid_lea_for_addr): Split LEAs used > to perform left shifts into shorter instructions with -Oz. > > gcc/testsuite/ChangeLog > * gcc.target/i386/lea-2.c: New test case. > OK, but ... @@ -0,0 +1,7 @@ +/* { dg-do compile { target { ! ia32 } } } */ Is there a reason to avoid 32-bit targets? I'd expect that the optimization also triggers on x86_32 for 32bit integers. +/* { dg-options "-Oz" } */ +int foo(int x) { return x<<2; } +int bar(int x) { return x<<3; } +long long fool(long long x) { return x<<2; } +long long barl(long long x) { return x<<3; } +/* { dg-final { scan-assembler-not "lea\[lq\]" } } */ Uros.
Hi Uros, Very many thanks for the speedy reviews. Uros Bizjak wrote: > On Thu, Oct 5, 2023 at 11:06 AM Roger Sayle <roger@nextmovesoftware.com> > wrote: > > > > > > This patch avoids long lea instructions for performing x<<2 and x<<3 > > by splitting them into shorter sal and move (or xchg instructions). > > Because this increases the number of instructions, but reduces the > > total size, its suitable for -Oz (but not -Os). > > > > The impact can be seen in the new test case: > > > > int foo(int x) { return x<<2; } > > int bar(int x) { return x<<3; } > > long long fool(long long x) { return x<<2; } long long barl(long long > > x) { return x<<3; } > > > > where with -O2 we generate: > > > > foo: lea 0x0(,%rdi,4),%eax // 7 bytes > > retq > > bar: lea 0x0(,%rdi,8),%eax // 7 bytes > > retq > > fool: lea 0x0(,%rdi,4),%rax // 8 bytes > > retq > > barl: lea 0x0(,%rdi,8),%rax // 8 bytes > > retq > > > > and with -Oz we now generate: > > > > foo: xchg %eax,%edi // 1 byte > > shl $0x2,%eax // 3 bytes > > retq > > bar: xchg %eax,%edi // 1 byte > > shl $0x3,%eax // 3 bytes > > retq > > fool: xchg %rax,%rdi // 2 bytes > > shl $0x2,%rax // 4 bytes > > retq > > barl: xchg %rax,%rdi // 2 bytes > > shl $0x3,%rax // 4 bytes > > retq > > > > Over the entirety of the CSiBE code size benchmark this saves 1347 > > bytes (0.037%) for x86_64, and 1312 bytes (0.036%) with -m32. > > Conveniently, there's already a backend function in i386.cc for > > deciding whether to split an lea into its component instructions, > > ix86_avoid_lea_for_addr, all that's required is an additional clause > > checking for -Oz (i.e. optimize_size > 1). > > > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap > > and make -k check, both with and without --target_board='unix{-m32}' > > with no new failures. Additional testing was performed by repeating > > these steps after removing the "optimize_size > 1" condition, so that > > suitable lea instructions were always split [-Oz is not heavily > > tested, so this invoked the new code during the bootstrap and > > regression testing], again with no regressions. Ok for mainline? > > > > > > 2023-10-05 Roger Sayle <roger@nextmovesoftware.com> > > > > gcc/ChangeLog > > * config/i386/i386.cc (ix86_avoid_lea_for_addr): Split LEAs used > > to perform left shifts into shorter instructions with -Oz. > > > > gcc/testsuite/ChangeLog > > * gcc.target/i386/lea-2.c: New test case. > > > > OK, but ... > > @@ -0,0 +1,7 @@ > +/* { dg-do compile { target { ! ia32 } } } */ > > Is there a reason to avoid 32-bit targets? I'd expect that the optimization also > triggers on x86_32 for 32bit integers. Good catch. You're 100% correct; because the test case just checks that an LEA is not used, and not for the specific sequence of shift instructions used instead, this test also passes with --target_board='unix{-m32}'. I'll remove the target clause from the dg-do compile directive. > +/* { dg-options "-Oz" } */ > +int foo(int x) { return x<<2; } > +int bar(int x) { return x<<3; } > +long long fool(long long x) { return x<<2; } long long barl(long long > +x) { return x<<3; } > +/* { dg-final { scan-assembler-not "lea\[lq\]" } } */ Thanks again. Roger --
diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc index 477e6ce..9557bff 100644 --- a/gcc/config/i386/i386.cc +++ b/gcc/config/i386/i386.cc @@ -15543,6 +15543,13 @@ ix86_avoid_lea_for_addr (rtx_insn *insn, rtx operands[]) && (regno0 == regno1 || regno0 == regno2)) return true; + /* Split with -Oz if the encoding requires fewer bytes. */ + if (optimize_size > 1 + && parts.scale > 1 + && !parts.base + && (!parts.disp || parts.disp == const0_rtx)) + return true; + /* Check we need to optimize. */ if (!TARGET_AVOID_LEA_FOR_ADDR || optimize_function_for_size_p (cfun)) return false; diff --git a/gcc/testsuite/gcc.target/i386/lea-2.c b/gcc/testsuite/gcc.target/i386/lea-2.c new file mode 100644 index 0000000..20aded8 --- /dev/null +++ b/gcc/testsuite/gcc.target/i386/lea-2.c @@ -0,0 +1,7 @@ +/* { dg-do compile { target { ! ia32 } } } */ +/* { dg-options "-Oz" } */ +int foo(int x) { return x<<2; } +int bar(int x) { return x<<3; } +long long fool(long long x) { return x<<2; } +long long barl(long long x) { return x<<3; } +/* { dg-final { scan-assembler-not "lea\[lq\]" } } */