Message ID | CAAs8HmxC4KQSc5EWiux7syOj8+ghs_LY8zY-D4-Wjp+ZhHiDuw@mail.gmail.com |
---|---|
State | New |
Headers | show |
On Thu, Apr 30, 2015 at 05:31:30PM -0700, Sriraman Tallam wrote: > This comes with caveats. This cannot be generally done for all > functions marked extern as it is impossible for the compiler to say if > a function is "truly extern" (defined in a shared library). If a > function is not truly extern(ends up defined in the final executable), > then calling it indirectly is a performance penalty as it could have > been a direct call. Further, the newly created GOT entries are fixed > up at start-up and do not get lazily bound. I've considered something similar for PowerPC (but didn't consider doing do so for a subset of calls). Losing lazy symbol resolution is a real problem. The other problem you cite of indirect calls that could be direct can be fixed in the linker relatively easily. Edit this code 0: ff 15 00 00 00 00 callq *0x0(%rip) # 0x6 2: R_X86_64_GOTPCREL foo-0x4 6: ff 25 00 00 00 00 jmpq *0x0(%rip) # 0xc 8: R_X86_64_GOTPCREL foo-0x4 to this c: e8 00 00 00 00 callq 0x11 d: R_X86_64_PC32 foo-0x4 11: 90 nop 12: e9 00 00 00 00 jmpq 0x17 13: R_X86_64_PC32 foo-0x4 17: 90 nop You may need to have gcc or gas add a marker reloc to say exactly where an instruction starts.
On Thu, Apr 30, 2015 at 8:21 PM, Alan Modra <amodra@gmail.com> wrote: > On Thu, Apr 30, 2015 at 05:31:30PM -0700, Sriraman Tallam wrote: >> This comes with caveats. This cannot be generally done for all >> functions marked extern as it is impossible for the compiler to say if >> a function is "truly extern" (defined in a shared library). If a >> function is not truly extern(ends up defined in the final executable), >> then calling it indirectly is a performance penalty as it could have >> been a direct call. Further, the newly created GOT entries are fixed >> up at start-up and do not get lazily bound. > > I've considered something similar for PowerPC (but didn't consider > doing do so for a subset of calls). Losing lazy symbol resolution is > a real problem. With -fno-plt= option, you are choosing functions that are hot and PLT must be avoided. Losing lazy binding on these should be perfectly fine because they would be called. Thanks Sri The other problem you cite of indirect calls that > could be direct can be fixed in the linker relatively easily. > Edit this code > 0: ff 15 00 00 00 00 callq *0x0(%rip) # 0x6 > 2: R_X86_64_GOTPCREL foo-0x4 > 6: ff 25 00 00 00 00 jmpq *0x0(%rip) # 0xc > 8: R_X86_64_GOTPCREL foo-0x4 > to this > c: e8 00 00 00 00 callq 0x11 > d: R_X86_64_PC32 foo-0x4 > 11: 90 nop > 12: e9 00 00 00 00 jmpq 0x17 > 13: R_X86_64_PC32 foo-0x4 > 17: 90 nop > You may need to have gcc or gas add a marker reloc to say exactly > where an instruction starts. > > -- > Alan Modra > Australia Development Lab, IBM
Sriraman Tallam <tmsriram@google.com> writes: > > This comes with caveats. This cannot be generally done for all > functions marked extern as it is impossible for the compiler to say if > a function is "truly extern" (defined in a shared library). If a > function is not truly extern(ends up defined in the final executable), > then calling it indirectly is a performance penalty as it could have > been a direct call. Further, the newly created GOT entries are fixed > up at start-up and do not get lazily bound. This means you need to make it depend on -fno-semantic-interposition ? > Given this, I propose adding a new option called > -fno-plt=<function-name> to the compiler. This tells the compiler > that we know that the function is truly extern and we want the > indirect call only for these call-sites. I have attached a patch that > adds -fno-plt= to GCC. Any number of "-fno-plt=" can be specified and > all call-sites corresponding to these named functions will be done > indirectly using the mechanism described above without the use of a > PLT stub. The argument seems awkward. The command line may get very long. Better an attribute? Longer term it would be probably better to support it properly in the linker. -Andi
On Fri, May 1, 2015 at 8:01 AM, Andi Kleen <andi@firstfloor.org> wrote: > Sriraman Tallam <tmsriram@google.com> writes: >> >> This comes with caveats. This cannot be generally done for all >> functions marked extern as it is impossible for the compiler to say if >> a function is "truly extern" (defined in a shared library). If a >> function is not truly extern(ends up defined in the final executable), >> then calling it indirectly is a performance penalty as it could have >> been a direct call. Further, the newly created GOT entries are fixed >> up at start-up and do not get lazily bound. > > This means you need to make it depend on -fno-semantic-interposition ? > >> Given this, I propose adding a new option called >> -fno-plt=<function-name> to the compiler. This tells the compiler >> that we know that the function is truly extern and we want the >> indirect call only for these call-sites. I have attached a patch that >> adds -fno-plt= to GCC. Any number of "-fno-plt=" can be specified and >> all call-sites corresponding to these named functions will be done >> indirectly using the mechanism described above without the use of a >> PLT stub. > > The argument seems awkward. The command line may get very long. > Better an attribute? They are complementary. Perhaps another option like linker's --dynamic-list=<> that can take a file specifying the list of symbols. > > Longer term it would be probably better to support it properly > in the linker. > Linker solution has its own downside -- it require reserving more space conservatively for many callsites which end up being direct calls. David > -Andi > > -- > ak@linux.intel.com -- Speaking for myself only
On Fri, May 1, 2015 at 9:19 AM, Xinliang David Li <davidxl@google.com> wrote: > On Fri, May 1, 2015 at 8:01 AM, Andi Kleen <andi@firstfloor.org> wrote: >> Sriraman Tallam <tmsriram@google.com> writes: >>> >>> This comes with caveats. This cannot be generally done for all >>> functions marked extern as it is impossible for the compiler to say if >>> a function is "truly extern" (defined in a shared library). If a >>> function is not truly extern(ends up defined in the final executable), >>> then calling it indirectly is a performance penalty as it could have >>> been a direct call. Further, the newly created GOT entries are fixed >>> up at start-up and do not get lazily bound. >> >> This means you need to make it depend on -fno-semantic-interposition ? >> >>> Given this, I propose adding a new option called >>> -fno-plt=<function-name> to the compiler. This tells the compiler >>> that we know that the function is truly extern and we want the >>> indirect call only for these call-sites. I have attached a patch that >>> adds -fno-plt= to GCC. Any number of "-fno-plt=" can be specified and >>> all call-sites corresponding to these named functions will be done >>> indirectly using the mechanism described above without the use of a >>> PLT stub. >> >> The argument seems awkward. The command line may get very long. >> Better an attribute? > > They are complementary. Perhaps another option like linker's > --dynamic-list=<> that can take a file specifying the list of symbols. > >> >> Longer term it would be probably better to support it properly >> in the linker. >> > > Linker solution has its own downside -- it require reserving more > space conservatively for many callsites which end up being direct > calls. > Can we do it automatically for LTO?
yes -- it is good to turn this on by default in LTO mode without requiring user to specify the option. David On Fri, May 1, 2015 at 9:23 AM, H.J. Lu <hjl.tools@gmail.com> wrote: > On Fri, May 1, 2015 at 9:19 AM, Xinliang David Li <davidxl@google.com> wrote: >> On Fri, May 1, 2015 at 8:01 AM, Andi Kleen <andi@firstfloor.org> wrote: >>> Sriraman Tallam <tmsriram@google.com> writes: >>>> >>>> This comes with caveats. This cannot be generally done for all >>>> functions marked extern as it is impossible for the compiler to say if >>>> a function is "truly extern" (defined in a shared library). If a >>>> function is not truly extern(ends up defined in the final executable), >>>> then calling it indirectly is a performance penalty as it could have >>>> been a direct call. Further, the newly created GOT entries are fixed >>>> up at start-up and do not get lazily bound. >>> >>> This means you need to make it depend on -fno-semantic-interposition ? >>> >>>> Given this, I propose adding a new option called >>>> -fno-plt=<function-name> to the compiler. This tells the compiler >>>> that we know that the function is truly extern and we want the >>>> indirect call only for these call-sites. I have attached a patch that >>>> adds -fno-plt= to GCC. Any number of "-fno-plt=" can be specified and >>>> all call-sites corresponding to these named functions will be done >>>> indirectly using the mechanism described above without the use of a >>>> PLT stub. >>> >>> The argument seems awkward. The command line may get very long. >>> Better an attribute? >> >> They are complementary. Perhaps another option like linker's >> --dynamic-list=<> that can take a file specifying the list of symbols. >> >>> >>> Longer term it would be probably better to support it properly >>> in the linker. >>> >> >> Linker solution has its own downside -- it require reserving more >> space conservatively for many callsites which end up being direct >> calls. >> > > Can we do it automatically for LTO? > > > -- > H.J.
On Fri, May 1, 2015 at 8:01 AM, Andi Kleen <andi@firstfloor.org> wrote: > Sriraman Tallam <tmsriram@google.com> writes: >> >> This comes with caveats. This cannot be generally done for all >> functions marked extern as it is impossible for the compiler to say if >> a function is "truly extern" (defined in a shared library). If a >> function is not truly extern(ends up defined in the final executable), >> then calling it indirectly is a performance penalty as it could have >> been a direct call. Further, the newly created GOT entries are fixed >> up at start-up and do not get lazily bound. > > This means you need to make it depend on -fno-semantic-interposition ? Please correct me if I am wrong but I do not see any dependency on semantic-interposition. The GOT entry created for the function pointer (whose PLT has been eliminated) has a dynamic relocation against it to fixup the address at run-time and the dynamic linker fills it with the right address. This is not a new mechanism. The same mechanism is used when we access function pointers with PIE for instance. Thanks Sri > >> Given this, I propose adding a new option called >> -fno-plt=<function-name> to the compiler. This tells the compiler >> that we know that the function is truly extern and we want the >> indirect call only for these call-sites. I have attached a patch that >> adds -fno-plt= to GCC. Any number of "-fno-plt=" can be specified and >> all call-sites corresponding to these named functions will be done >> indirectly using the mechanism described above without the use of a >> PLT stub. > > The argument seems awkward. The command line may get very long. > Better an attribute? > > Longer term it would be probably better to support it properly > in the linker. > > -Andi > > -- > ak@linux.intel.com -- Speaking for myself only
On Fri, May 1, 2015 at 9:26 AM, Xinliang David Li <davidxl@google.com> wrote: > yes -- it is good to turn this on by default in LTO mode without > requiring user to specify the option. Yes, with LTO, we would exactly know what the "truly extern" functions are and PLT stubs can be eliminated for all extern functions when early binding is specified. With lazy binding, we can eliminate the PLT stubs selectively for the hot extern functions. Thanks Sri > > David > > On Fri, May 1, 2015 at 9:23 AM, H.J. Lu <hjl.tools@gmail.com> wrote: >> On Fri, May 1, 2015 at 9:19 AM, Xinliang David Li <davidxl@google.com> wrote: >>> On Fri, May 1, 2015 at 8:01 AM, Andi Kleen <andi@firstfloor.org> wrote: >>>> Sriraman Tallam <tmsriram@google.com> writes: >>>>> >>>>> This comes with caveats. This cannot be generally done for all >>>>> functions marked extern as it is impossible for the compiler to say if >>>>> a function is "truly extern" (defined in a shared library). If a >>>>> function is not truly extern(ends up defined in the final executable), >>>>> then calling it indirectly is a performance penalty as it could have >>>>> been a direct call. Further, the newly created GOT entries are fixed >>>>> up at start-up and do not get lazily bound. >>>> >>>> This means you need to make it depend on -fno-semantic-interposition ? >>>> >>>>> Given this, I propose adding a new option called >>>>> -fno-plt=<function-name> to the compiler. This tells the compiler >>>>> that we know that the function is truly extern and we want the >>>>> indirect call only for these call-sites. I have attached a patch that >>>>> adds -fno-plt= to GCC. Any number of "-fno-plt=" can be specified and >>>>> all call-sites corresponding to these named functions will be done >>>>> indirectly using the mechanism described above without the use of a >>>>> PLT stub. >>>> >>>> The argument seems awkward. The command line may get very long. >>>> Better an attribute? >>> >>> They are complementary. Perhaps another option like linker's >>> --dynamic-list=<> that can take a file specifying the list of symbols. >>> >>>> >>>> Longer term it would be probably better to support it properly >>>> in the linker. >>>> >>> >>> Linker solution has its own downside -- it require reserving more >>> space conservatively for many callsites which end up being direct >>> calls. >>> >> >> Can we do it automatically for LTO? >> >> >> -- >> H.J.
On Fri, May 01, 2015 at 11:05:58AM -0700, Sriraman Tallam wrote: > On Fri, May 1, 2015 at 9:26 AM, Xinliang David Li <davidxl@google.com> wrote: > > yes -- it is good to turn this on by default in LTO mode without > > requiring user to specify the option. > > Yes, with LTO, we would exactly know what the "truly extern" functions > are ... unless a function is overwritten somewhere else at dynamic link time That's why you may need -fno-semantic... -Andi
Hi, On Thu, 30 Apr 2015, Sriraman Tallam wrote: > We noticed that one of our benchmarks sped-up by ~1% when we eliminated > PLT stubs for some of the hot external library functions like memcmp, > pow. The win was from better icache and itlb performance. The main > reason was that the PLT stubs had no spatial locality with the > call-sites. I have started looking at ways to tell the compiler to > eliminate PLT stubs (in-effect inline them) for specified external > functions, for x86_64. I have a proposal and a patch and I would like to > hear what you think. > > This comes with caveats. This cannot be generally done for all > functions marked extern as it is impossible for the compiler to say if a > function is "truly extern" (defined in a shared library). If a function > is not truly extern(ends up defined in the final executable), then > calling it indirectly is a performance penalty as it could have been a > direct call. This can be fixed by Alans idea. > Further, the newly created GOT entries are fixed up at > start-up and do not get lazily bound. And this can be fixed by some enhancements in the linker and dynamic linker. The idea is to still generate a PLT stub and make its GOT entry point to it initially (like a normal got.plt slot). Then the first indirect call will use the address of PLT entry (starting lazy resolution) and update the GOT slot with the real address, so further indirect calls will directly go to the function. This requires a new asm marker (and hence new reloc) as normally if there's a GOT slot it's filled by the real symbols address, unlike if there's only a got.plt slot. E.g. a call *foo@GOTPLT(%rip) would generate a GOT slot (and fill its address into above call insn), but generate a JUMP_SLOT reloc in the final executable, not a GLOB_DAT one. Ciao, Michael.
The use case proposed by Sri allows user to selectively eliminate PLT overhead for hot external calls only. In such scenarios, lazy binding won't be something matters to the user. David On Mon, May 4, 2015 at 7:45 AM, Michael Matz <matz@suse.de> wrote: > Hi, > > On Thu, 30 Apr 2015, Sriraman Tallam wrote: > >> We noticed that one of our benchmarks sped-up by ~1% when we eliminated >> PLT stubs for some of the hot external library functions like memcmp, >> pow. The win was from better icache and itlb performance. The main >> reason was that the PLT stubs had no spatial locality with the >> call-sites. I have started looking at ways to tell the compiler to >> eliminate PLT stubs (in-effect inline them) for specified external >> functions, for x86_64. I have a proposal and a patch and I would like to >> hear what you think. >> >> This comes with caveats. This cannot be generally done for all >> functions marked extern as it is impossible for the compiler to say if a >> function is "truly extern" (defined in a shared library). If a function >> is not truly extern(ends up defined in the final executable), then >> calling it indirectly is a performance penalty as it could have been a >> direct call. > > This can be fixed by Alans idea. > >> Further, the newly created GOT entries are fixed up at >> start-up and do not get lazily bound. > > And this can be fixed by some enhancements in the linker and dynamic > linker. The idea is to still generate a PLT stub and make its GOT entry > point to it initially (like a normal got.plt slot). Then the first > indirect call will use the address of PLT entry (starting lazy resolution) > and update the GOT slot with the real address, so further indirect calls > will directly go to the function. > > This requires a new asm marker (and hence new reloc) as normally if > there's a GOT slot it's filled by the real symbols address, unlike if > there's only a got.plt slot. E.g. a > > call *foo@GOTPLT(%rip) > > would generate a GOT slot (and fill its address into above call insn), but > generate a JUMP_SLOT reloc in the final executable, not a GLOB_DAT one. > > > Ciao, > Michael.
Hi, On Mon, 4 May 2015, Xinliang David Li wrote: > The use case proposed by Sri allows user to selectively eliminate PLT > overhead for hot external calls only. Yes, but only _because_ his approach doesn't use lazy binding. With the full solution such restriction to a subset of functions isn't necessary. And we should strive for going the full way, instead of adding hacks, shouldn't we? Ciao, Michael.
yes -- a full solution that supports lazy binding will be nice. David On Mon, May 4, 2015 at 9:58 AM, Michael Matz <matz@suse.de> wrote: > Hi, > > On Mon, 4 May 2015, Xinliang David Li wrote: > >> The use case proposed by Sri allows user to selectively eliminate PLT >> overhead for hot external calls only. > > Yes, but only _because_ his approach doesn't use lazy binding. With the > full solution such restriction to a subset of functions isn't necessary. > And we should strive for going the full way, instead of adding hacks, > shouldn't we? > > > Ciao, > Michael.
On Mon, May 4, 2015 at 7:45 AM, Michael Matz <matz@suse.de> wrote: > Hi, > > On Thu, 30 Apr 2015, Sriraman Tallam wrote: > >> We noticed that one of our benchmarks sped-up by ~1% when we eliminated >> PLT stubs for some of the hot external library functions like memcmp, >> pow. The win was from better icache and itlb performance. The main >> reason was that the PLT stubs had no spatial locality with the >> call-sites. I have started looking at ways to tell the compiler to >> eliminate PLT stubs (in-effect inline them) for specified external >> functions, for x86_64. I have a proposal and a patch and I would like to >> hear what you think. >> >> This comes with caveats. This cannot be generally done for all >> functions marked extern as it is impossible for the compiler to say if a >> function is "truly extern" (defined in a shared library). If a function >> is not truly extern(ends up defined in the final executable), then >> calling it indirectly is a performance penalty as it could have been a >> direct call. > > This can be fixed by Alans idea. > >> Further, the newly created GOT entries are fixed up at >> start-up and do not get lazily bound. > > And this can be fixed by some enhancements in the linker and dynamic > linker. The idea is to still generate a PLT stub and make its GOT entry > point to it initially (like a normal got.plt slot). Then the first > indirect call will use the address of PLT entry (starting lazy resolution) > and update the GOT slot with the real address, so further indirect calls > will directly go to the function. > > This requires a new asm marker (and hence new reloc) as normally if > there's a GOT slot it's filled by the real symbols address, unlike if > there's only a got.plt slot. E.g. a > > call *foo@GOTPLT(%rip) > > would generate a GOT slot (and fill its address into above call insn), but > generate a JUMP_SLOT reloc in the final executable, not a GLOB_DAT one. > I added the "relax" prefix support to x86 assembler on users/hjl/relax branch at https://sourceware.org/git/?p=binutils-gdb.git;a=summary [hjl@gnu-tools-1 relax-3]$ cat r.S .text relax jmp foo relax call foo relax jmp foo@plt relax call foo@plt [hjl@gnu-tools-1 relax-3]$ ./as -o r.o r.S [hjl@gnu-tools-1 relax-3]$ ./objdump -drw r.o r.o: file format elf64-x86-64 Disassembly of section .text: 0000000000000000 <.text>: 0: 66 e9 00 00 00 00 data16 jmpq 0x6 2: R_X86_64_RELAX_PC32 foo-0x4 6: 66 e8 00 00 00 00 data16 callq 0xc 8: R_X86_64_RELAX_PC32 foo-0x4 c: 66 e9 00 00 00 00 data16 jmpq 0x12 e: R_X86_64_RELAX_PLT32foo-0x4 12: 66 e8 00 00 00 00 data16 callq 0x18 14: R_X86_64_RELAX_PLT32foo-0x4 [hjl@gnu-tools-1 relax-3]$ Right now, the relax relocations are treated as PC32/PLT32 relocations. I am working on linker support.
Index: common.opt =================================================================== --- common.opt (revision 222641) +++ common.opt (working copy) @@ -1087,6 +1087,11 @@ fdbg-cnt= Common RejectNegative Joined Var(common_deferred_options) Defer -fdbg-cnt=<counter>:<limit>[,<counter>:<limit>,...] Set the debug counter limit. +fno-plt= +Common RejectNegative Joined Var(common_deferred_options) Defer +-fno-plt=<symbol1> Avoid going through the PLT when calling the specified function. +Allow multiple instances of this option with different function names. + fdebug-prefix-map= Common Joined RejectNegative Var(common_deferred_options) Defer Map one directory name to another in debug information Index: config/i386/i386.c =================================================================== --- config/i386/i386.c (revision 222641) +++ config/i386/i386.c (working copy) @@ -25282,6 +25282,25 @@ ix86_expand_call (rtx retval, rtx fnaddr, rtx call return call; } +extern htab_t avoid_plt_fnsymbol_names_tab; +/* If the function referenced by call_op is to a external function + and calls via PLT must be avoided as specified by -fno-plt=, then + return true. */ + +static int +avoid_plt_to_call(rtx call_op) +{ + const char *name; + if (GET_CODE (call_op) != SYMBOL_REF + || SYMBOL_REF_LOCAL_P (call_op) + || avoid_plt_fnsymbol_names_tab == NULL) + return 0; + name = XSTR (call_op, 0); + if (htab_find_slot (avoid_plt_fnsymbol_names_tab, name, NO_INSERT) != NULL) + return 1; + return 0; +} + /* Output the assembly for a call instruction. */ const char * @@ -25294,7 +25313,12 @@ ix86_output_call_insn (rtx insn, rtx call_op) if (SIBLING_CALL_P (insn)) { if (direct_p) - xasm = "jmp\t%P0"; + { + if (avoid_plt_to_call (call_op)) + xasm = "jmp\t*%p0@GOTPCREL(%%rip)"; + else + xasm = "jmp\t%P0"; + } /* SEH epilogue detection requires the indirect branch case to include REX.W. */ else if (TARGET_SEH) @@ -25346,9 +25370,15 @@ ix86_output_call_insn (rtx insn, rtx call_op) } if (direct_p) - xasm = "call\t%P0"; + { + if (avoid_plt_to_call (call_op)) + xasm = "call\t*%p0@GOTPCREL(%%rip)"; + else + xasm = "call\t%P0"; + } else xasm = "call\t%A0"; + output_asm_insn (xasm, &call_op); Index: opts-global.c =================================================================== --- opts-global.c (revision 222641) +++ opts-global.c (working copy) @@ -47,6 +47,7 @@ along with GCC; see the file COPYING3. If not see #include "xregex.h" #include "attribs.h" #include "stringpool.h" +#include "hash-table.h" typedef const char *const_char_p; /* For DEF_VEC_P. */ @@ -420,6 +421,17 @@ decode_options (struct gcc_options *opts, struct g finish_options (opts, opts_set, loc); } +/* Helper function for the hash table that compares the + existing entry (S1) with the given string (S2). */ + +static int +htab_str_eq (const void *s1, const void *s2) +{ + return !strcmp ((const char *)s1, (const char *) s2); +} + +htab_t avoid_plt_fnsymbol_names_tab = NULL; + /* Process common options that have been deferred until after the handlers have been called for all options. */ @@ -539,6 +551,15 @@ handle_common_deferred_options (void) stack_limit_rtx = gen_rtx_SYMBOL_REF (Pmode, ggc_strdup (opt->arg)); break; + case OPT_fno_plt_: + void **slot; + if (avoid_plt_fnsymbol_names_tab == NULL) + avoid_plt_fnsymbol_names_tab = htab_create (10, htab_hash_string, + htab_str_eq, NULL); + slot = htab_find_slot (avoid_plt_fnsymbol_names_tab, opt->arg, INSERT); + *slot = (void *)opt->arg; + break; + default: gcc_unreachable (); }