From patchwork Fri May 1 00:31:30 2015 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Sriraman Tallam X-Patchwork-Id: 466778 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 922AF140320 for ; Fri, 1 May 2015 10:31:45 +1000 (AEST) Authentication-Results: ozlabs.org; dkim=pass reason="1024-bit key; unprotected key" header.d=gcc.gnu.org header.i=@gcc.gnu.org header.b=vdZ/JMkR; dkim-adsp=none (unprotected policy); dkim-atps=neutral DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender :mime-version:date:message-id:subject:from:to:content-type; q= dns; s=default; b=lk7y2uMP0qNtxOMZnFYUrg6LrPRIAv4kAAHXuMl+iwwyBA QEyQFqN/w+0pLpGq+GxduSfotT1wcMS8fxnHJFrLSfFxos85YIa1rAiJUXfhXcwI Buc3ifDftEWAbH2VR0TYW0fWpS6JdykHADaM0elYv026y7YECeaMexOTVcL1Y= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender :mime-version:date:message-id:subject:from:to:content-type; s= default; bh=icplhCUlowb7lopBtyvwL7mtdJM=; b=vdZ/JMkRR6FHs3bhUpcF bF57cfGmCQa36dlv7UrjTn+nLdi1cBo0Oxev8AIEMTA0iqz8MBcRyNZ5gfe7MzaV +QIEaMv/ZeTUtFgwZVmSW1V2X7MTyZZCx3k7DQQcjQnfuui1xalnTtqYQsVbHoVZ IV6cm6exTpDp/PIBCpjfi2o= Received: (qmail 3136 invoked by alias); 1 May 2015 00:31:36 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org Received: (qmail 3124 invoked by uid 89); 1 May 2015 00:31:35 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-1.3 required=5.0 tests=AWL, BAYES_00, KAM_ASCII_DIVIDERS, KAM_STOCKGEN, RCVD_IN_DNSWL_LOW, SPF_PASS, T_RP_MATCHES_RCVD autolearn=no version=3.3.2 X-HELO: mail-vn0-f45.google.com Received: from mail-vn0-f45.google.com (HELO mail-vn0-f45.google.com) (209.85.216.45) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES128-GCM-SHA256 encrypted) ESMTPS; Fri, 01 May 2015 00:31:33 +0000 Received: by vnbg1 with SMTP id g1so9195045vnb.2 for ; Thu, 30 Apr 2015 17:31:31 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:date:message-id:subject:from:to :content-type; bh=ueYjkdy8Jt2o0eBWXFLVpDCibAVdvefgDPvo40ALkao=; b=L1KYpQk7IZ5Ta9rPPqEc6oy27QDoJG//kh+156WsIMWHOB++9CGfZTQgaEq9a7bCPC b3GkWqUNYC32K4/+0Ve5XA5hWRAnDoZwh2jYiDxvO1RzszTdHNcGRKnI2Lb3H3h9v4CK 5FcYqONxFe+Xgm3WZUf4xPEj0x1a2X/9+GLoKZCsSfb6ynjKxXuqHW8HurpP0HwCXo/O TyyebGQXS2kDfDW5Cx5dX3kTACARLCsVq8tHnwJY9LC8Cl2ncTlddEwLKPQr9cXjDXmz 1wL4f466zX834GWvh17j+mpgc62RWyWzt8QjpfmWJl6LW7HUvh5H46ggpjA7K4NWMYxO kJ1g== X-Gm-Message-State: ALoCoQlh9EVaaG0TkLqykxZ4qID0x/J2O1D2AeLB/P3iFCVmWxbJ/8Pw9eqA/NoNxQbHR39eWPe0 MIME-Version: 1.0 X-Received: by 10.52.240.198 with SMTP id wc6mr11901270vdc.34.1430440291039; Thu, 30 Apr 2015 17:31:31 -0700 (PDT) Received: by 10.52.122.52 with HTTP; Thu, 30 Apr 2015 17:31:30 -0700 (PDT) Date: Thu, 30 Apr 2015 17:31:30 -0700 Message-ID: Subject: [RFC][PATCH][X86_64] Eliminate PLT stubs for specified external functions via -fno-plt= From: Sriraman Tallam To: GCC Patches , "H.J. Lu" , David Li X-IsSubscribed: yes Hi, We noticed that one of our benchmarks sped-up by ~1% when we eliminated PLT stubs for some of the hot external library functions like memcmp, pow. The win was from better icache and itlb performance. The main reason was that the PLT stubs had no spatial locality with the call-sites. I have started looking at ways to tell the compiler to eliminate PLT stubs (in-effect inline them) for specified external functions, for x86_64. I have a proposal and a patch and I would like to hear what you think. Here is a summary of what is happening currently. A call to an external function is direct but calls into the PLT stub which then jumps indirectly to the GOT entry. If I could replace the direct call to the PLT stub with an indirect call to a GOT entry which will hold the address of the external function, I have gotten rid of the PLT stub. Here is an example: foo.cc ===== extern int foo (); // Truly external library function, defined in a shared library. int main() { foo(); ... } Currently, this is what is happening. foo.s looks like this: main: ..... callq _Z3foov but the linker replaces this to call the PLT stub of foo instead. Function main calls the plt stub directly: 0000000000400766
: …. 40076a: e8 71 fe ff ff callq 4005e0 <_Z3foov@plt> and the PLT stub does this: 00000000004005e0 <_Z3foov@plt>: 4005e0: jmpq *0x15d2(%rip) # 401bb8 <_GLOBAL_OFFSET_TABLE_+0x28> 4005e6: pushq $0x2 4005eb: jmpq 4005b0 <_init+0x28> The GOT entry at address 0x401bb8 contains the address of foo which will be lazily bound. What my proposal plans does is to change foo.s to look like this: callq *_Z3foov@GOTPCREL(%rip) which is indirectly calling foo via a GOT entry that contains the address of foo. The address in the GOT entry is fixed up at load time and the linker creates only one GOT entry per function irrespective of the number of callers. a.out now looks like this: 0000000000400746
: ... 40074a: ff 15 20 14 00 00 callq *0x1420(%rip) # 401b70 <_DYNAMIC+0x1e8> ... Function main indirectly calls foo using the contents at location 0x401b70 which is actually a GOT entry containing the address of foo. Notice that we have in effect inlined the PLT stub. This comes with caveats. This cannot be generally done for all functions marked extern as it is impossible for the compiler to say if a function is "truly extern" (defined in a shared library). If a function is not truly extern(ends up defined in the final executable), then calling it indirectly is a performance penalty as it could have been a direct call. Further, the newly created GOT entries are fixed up at start-up and do not get lazily bound. Given this, I propose adding a new option called -fno-plt= to the compiler. This tells the compiler that we know that the function is truly extern and we want the indirect call only for these call-sites. I have attached a patch that adds -fno-plt= to GCC. Any number of "-fno-plt=" can be specified and all call-sites corresponding to these named functions will be done indirectly using the mechanism described above without the use of a PLT stub. Alternatively, we can do this entirely in the linker. We can introduce a new relocation type to tell the linker to convert all direct calls to truly extern functions into indirect calls via GOT entries. The GCC patch just seems simpler. Also, we could link statically but we do not want that or we could copy the specific external functions into our executable. This might work for executable A but a different set of external functions might be hot for executable B. We want a more general solution. Please let me know what you think. Thanks Sri * common.opt (-fno-plt=): New option. * config/i386/i386.c (avoid_plt_to_call): New function. (ix86_output_call_insn): Check if PLT needs to be avoided and call or jump indirectly if true. * opts-global.c (htab_str_eq): New function. (avoid_plt_fnsymbol_names_tab): New htab. (handle_common_deferred_options): Handle -fno-plt= Index: common.opt =================================================================== --- common.opt (revision 222641) +++ common.opt (working copy) @@ -1087,6 +1087,11 @@ fdbg-cnt= Common RejectNegative Joined Var(common_deferred_options) Defer -fdbg-cnt=:[,:,...] Set the debug counter limit. +fno-plt= +Common RejectNegative Joined Var(common_deferred_options) Defer +-fno-plt= Avoid going through the PLT when calling the specified function. +Allow multiple instances of this option with different function names. + fdebug-prefix-map= Common Joined RejectNegative Var(common_deferred_options) Defer Map one directory name to another in debug information Index: config/i386/i386.c =================================================================== --- config/i386/i386.c (revision 222641) +++ config/i386/i386.c (working copy) @@ -25282,6 +25282,25 @@ ix86_expand_call (rtx retval, rtx fnaddr, rtx call return call; } +extern htab_t avoid_plt_fnsymbol_names_tab; +/* If the function referenced by call_op is to a external function + and calls via PLT must be avoided as specified by -fno-plt=, then + return true. */ + +static int +avoid_plt_to_call(rtx call_op) +{ + const char *name; + if (GET_CODE (call_op) != SYMBOL_REF + || SYMBOL_REF_LOCAL_P (call_op) + || avoid_plt_fnsymbol_names_tab == NULL) + return 0; + name = XSTR (call_op, 0); + if (htab_find_slot (avoid_plt_fnsymbol_names_tab, name, NO_INSERT) != NULL) + return 1; + return 0; +} + /* Output the assembly for a call instruction. */ const char * @@ -25294,7 +25313,12 @@ ix86_output_call_insn (rtx insn, rtx call_op) if (SIBLING_CALL_P (insn)) { if (direct_p) - xasm = "jmp\t%P0"; + { + if (avoid_plt_to_call (call_op)) + xasm = "jmp\t*%p0@GOTPCREL(%%rip)"; + else + xasm = "jmp\t%P0"; + } /* SEH epilogue detection requires the indirect branch case to include REX.W. */ else if (TARGET_SEH) @@ -25346,9 +25370,15 @@ ix86_output_call_insn (rtx insn, rtx call_op) } if (direct_p) - xasm = "call\t%P0"; + { + if (avoid_plt_to_call (call_op)) + xasm = "call\t*%p0@GOTPCREL(%%rip)"; + else + xasm = "call\t%P0"; + } else xasm = "call\t%A0"; + output_asm_insn (xasm, &call_op); Index: opts-global.c =================================================================== --- opts-global.c (revision 222641) +++ opts-global.c (working copy) @@ -47,6 +47,7 @@ along with GCC; see the file COPYING3. If not see #include "xregex.h" #include "attribs.h" #include "stringpool.h" +#include "hash-table.h" typedef const char *const_char_p; /* For DEF_VEC_P. */ @@ -420,6 +421,17 @@ decode_options (struct gcc_options *opts, struct g finish_options (opts, opts_set, loc); } +/* Helper function for the hash table that compares the + existing entry (S1) with the given string (S2). */ + +static int +htab_str_eq (const void *s1, const void *s2) +{ + return !strcmp ((const char *)s1, (const char *) s2); +} + +htab_t avoid_plt_fnsymbol_names_tab = NULL; + /* Process common options that have been deferred until after the handlers have been called for all options. */ @@ -539,6 +551,15 @@ handle_common_deferred_options (void) stack_limit_rtx = gen_rtx_SYMBOL_REF (Pmode, ggc_strdup (opt->arg)); break; + case OPT_fno_plt_: + void **slot; + if (avoid_plt_fnsymbol_names_tab == NULL) + avoid_plt_fnsymbol_names_tab = htab_create (10, htab_hash_string, + htab_str_eq, NULL); + slot = htab_find_slot (avoid_plt_fnsymbol_names_tab, opt->arg, INSERT); + *slot = (void *)opt->arg; + break; + default: gcc_unreachable (); }