[v2,0/7] Implement inline static calls on PPC32 - v2

Message ID	cover.1657301423.git.christophe.leroy@csgroup.eu (mailing list archive)
Headers	show Return-Path: <linuxppc-dev-bounces+incoming=patchwork.ozlabs.org@lists.ozlabs.org> From: Christophe Leroy <christophe.leroy@csgroup.eu> To: Michael Ellerman <mpe@ellerman.id.au>, Nicholas Piggin <npiggin@gmail.com>, sv@linux.ibm.com, agust@denx.de, jpoimboe@kernel.org, peterz@infradead.org, jbaron@akamai.com, rostedt@goodmis.org, ardb@kernel.org, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, hpa@zytor.com Subject: [PATCH v2 0/7] Implement inline static calls on PPC32 - v2 Date: Fri, 8 Jul 2022 19:31:19 +0200 Message-Id: <cover.1657301423.git.christophe.leroy@csgroup.eu> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: list Cc: chenzhongjin@huawei.com, x86@kernel.org, linuxppc-dev@lists.ozlabs.org, linux-kernel@vger.kernel.org Errors-To: linuxppc-dev-bounces+incoming=patchwork.ozlabs.org@lists.ozlabs.org Sender: "Linuxppc-dev" <linuxppc-dev-bounces+incoming=patchwork.ozlabs.org@lists.ozlabs.org>
Series	Implement inline static calls on PPC32 - v2 \| expand [v2,0/7] Implement inline static calls on PPC32 - v2 [v2,1/7] powerpc: Add missing asm/asm.h for objtool [v2,2/7] objtool/powerpc: Activate objtool on PPC32 [v2,3/7] objtool: Add architecture specific R_REL32 macro [v2,4/7] objtool/powerpc: Add necessary support for inline static calls [v2,5/7] init: Call static_call_init() from start_kernel() [v2,6/7] static_call_inline: Provide trampoline address when updating sites [v2,7/7] powerpc/static_call: Implement inline static calls

Message ID

cover.1657301423.git.christophe.leroy@csgroup.eu (mailing list archive)

Headers

From: Christophe Leroy <christophe.leroy@csgroup.eu>
To: Michael Ellerman <mpe@ellerman.id.au>,
 Nicholas Piggin <npiggin@gmail.com>,
        sv@linux.ibm.com, agust@denx.de, jpoimboe@kernel.org,
        peterz@infradead.org, jbaron@akamai.com, rostedt@goodmis.org,
        ardb@kernel.org, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de,
        dave.hansen@linux.intel.com, hpa@zytor.com
Subject: [PATCH v2 0/7] Implement inline static calls on PPC32 - v2
Date: Fri,  8 Jul 2022 19:31:19 +0200
Message-Id: <cover.1657301423.git.christophe.leroy@csgroup.eu>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: list
Cc: chenzhongjin@huawei.com, x86@kernel.org, linuxppc-dev@lists.ozlabs.org,
 linux-kernel@vger.kernel.org
Errors-To: linuxppc-dev-bounces+incoming=patchwork.ozlabs.org@lists.ozlabs.org
Sender: "Linuxppc-dev"
 <linuxppc-dev-bounces+incoming=patchwork.ozlabs.org@lists.ozlabs.org>

Series

Implement inline static calls on PPC32 - v2 | expand

Message

Christophe Leroy July 8, 2022, 5:31 p.m. UTC

This series applies on top of the series v3 "objtool: Enable and
implement --mcount option on powerpc" [1] rebased on powerpc-next branch

A few modifications are done to core parts to enable powerpc
implementation:
- R_X86_64_PC32 is abstracted to R_REL32 so that it can then be
redefined as R_PPC_REL32.
- A call to static_call_init() is added to start_kernel() to avoid
every architecture to have to call it
- Trampoline address is provided to arch_static_call_transform() even
when setting a site to fallback on a call to the trampoline when the
target is too far.

[1] https://lore.kernel.org/lkml/70b6d08d-aced-7f4e-b958-a3c7ae1a9319@csgroup.eu/T/#rb3a073c54aba563a135fba891e0c34c46e47beef

Christophe Leroy (7):
  powerpc: Add missing asm/asm.h for objtool
  objtool/powerpc: Activate objtool on PPC32
  objtool: Add architecture specific R_REL32 macro
  objtool/powerpc: Add necessary support for inline static calls
  init: Call static_call_init() from start_kernel()
  static_call_inline: Provide trampoline address when updating sites
  powerpc/static_call: Implement inline static calls

 arch/powerpc/Kconfig                          |  3 +-
 arch/powerpc/include/asm/asm.h                |  7 +++
 arch/powerpc/include/asm/static_call.h        |  2 +
 arch/powerpc/kernel/cpu_setup_6xx.S           | 26 ++++++---
 arch/powerpc/kernel/cpu_setup_fsl_booke.S     |  8 ++-
 arch/powerpc/kernel/entry_32.S                |  8 ++-
 arch/powerpc/kernel/head_40x.S                |  5 +-
 arch/powerpc/kernel/head_8xx.S                |  5 +-
 arch/powerpc/kernel/head_book3s_32.S          | 29 +++++++---
 arch/powerpc/kernel/head_fsl_booke.S          |  5 +-
 arch/powerpc/kernel/static_call.c             | 56 ++++++++++++++-----
 arch/powerpc/kernel/swsusp_32.S               |  5 +-
 arch/powerpc/kvm/fpu.S                        | 17 ++++--
 arch/powerpc/platforms/52xx/lite5200_sleep.S  | 15 +++--
 arch/x86/kernel/static_call.c                 |  2 +-
 init/main.c                                   |  1 +
 kernel/static_call_inline.c                   |  2 +-
 tools/objtool/arch/powerpc/decode.c           | 16 ++++--
 tools/objtool/arch/powerpc/include/arch/elf.h |  1 +
 tools/objtool/arch/x86/include/arch/elf.h     |  1 +
 tools/objtool/check.c                         | 10 ++--
 tools/objtool/orc_gen.c                       |  2 +-
 22 files changed, 162 insertions(+), 64 deletions(-)
 create mode 100644 arch/powerpc/include/asm/asm.h

Comments

Ard Biesheuvel July 9, 2022, 6:52 a.m. UTC | #1

Hello Christophe,

On Fri, 8 Jul 2022 at 19:32, Christophe Leroy
<christophe.leroy@csgroup.eu> wrote:
>
> This series applies on top of the series v3 "objtool: Enable and
> implement --mcount option on powerpc" [1] rebased on powerpc-next branch
>
> A few modifications are done to core parts to enable powerpc
> implementation:
> - R_X86_64_PC32 is abstracted to R_REL32 so that it can then be
> redefined as R_PPC_REL32.
> - A call to static_call_init() is added to start_kernel() to avoid
> every architecture to have to call it
> - Trampoline address is provided to arch_static_call_transform() even
> when setting a site to fallback on a call to the trampoline when the
> target is too far.
>
> [1] https://lore.kernel.org/lkml/70b6d08d-aced-7f4e-b958-a3c7ae1a9319@csgroup.eu/T/#rb3a073c54aba563a135fba891e0c34c46e47beef
>
> Christophe Leroy (7):
>   powerpc: Add missing asm/asm.h for objtool
>   objtool/powerpc: Activate objtool on PPC32
>   objtool: Add architecture specific R_REL32 macro
>   objtool/powerpc: Add necessary support for inline static calls
>   init: Call static_call_init() from start_kernel()
>   static_call_inline: Provide trampoline address when updating sites
>   powerpc/static_call: Implement inline static calls
>

Could you quantify the performance gains of moving from out-of-line,
patched tail-call branch instructions to full-fledged inline static
calls? On x86, the retpoline problem makes this glaringly obvious, but
on other architectures, the complexity of supporting this model may
outweigh the performance advantages.

Christophe Leroy Sept. 1, 2022, 4:46 p.m. UTC | #2

Le 09/07/2022 à 08:52, Ard Biesheuvel a écrit :
> Hello Christophe,
> 
> On Fri, 8 Jul 2022 at 19:32, Christophe Leroy
> <christophe.leroy@csgroup.eu> wrote:
>>
>> This series applies on top of the series v3 "objtool: Enable and
>> implement --mcount option on powerpc" [1] rebased on powerpc-next branch
>>
>> A few modifications are done to core parts to enable powerpc
>> implementation:
>> - R_X86_64_PC32 is abstracted to R_REL32 so that it can then be
>> redefined as R_PPC_REL32.
>> - A call to static_call_init() is added to start_kernel() to avoid
>> every architecture to have to call it
>> - Trampoline address is provided to arch_static_call_transform() even
>> when setting a site to fallback on a call to the trampoline when the
>> target is too far.
>>
>> [1] https://lore.kernel.org/lkml/70b6d08d-aced-7f4e-b958-a3c7ae1a9319@csgroup.eu/T/#rb3a073c54aba563a135fba891e0c34c46e47beef
>>
>> Christophe Leroy (7):
>>    powerpc: Add missing asm/asm.h for objtool
>>    objtool/powerpc: Activate objtool on PPC32
>>    objtool: Add architecture specific R_REL32 macro
>>    objtool/powerpc: Add necessary support for inline static calls
>>    init: Call static_call_init() from start_kernel()
>>    static_call_inline: Provide trampoline address when updating sites
>>    powerpc/static_call: Implement inline static calls
>>
> 
> Could you quantify the performance gains of moving from out-of-line,
> patched tail-call branch instructions to full-fledged inline static
> calls? On x86, the retpoline problem makes this glaringly obvious, but
> on other architectures, the complexity of supporting this model may
> outweigh the performance advantages.

Surprisingly, I get worst performance with inline static call than with 
out of line static call:

No static call:

root@vgoip:~# perf stat -r 10 ./hackbench 1
Running with 1*40 (== 40) tasks.
Time: 17.186
Running with 1*40 (== 40) tasks.
Time: 16.738
Running with 1*40 (== 40) tasks.
Time: 16.579
Running with 1*40 (== 40) tasks.
Time: 16.838
Running with 1*40 (== 40) tasks.
Time: 16.652
Running with 1*40 (== 40) tasks.
Time: 17.380
Running with 1*40 (== 40) tasks.
Time: 16.630
Running with 1*40 (== 40) tasks.
Time: 16.850
Running with 1*40 (== 40) tasks.
Time: 17.161
Running with 1*40 (== 40) tasks.
Time: 16.722

  Performance counter stats for './hackbench 1' (10 runs):

           17019.55 msec task-clock                #    0.980 CPUs 
utilized            ( +-  0.51% )
               4847      context-switches          #  282.280 /sec 
               ( +-  6.32% )
                  0      cpu-migrations            #    0.000 /sec
               1249      page-faults               #   72.739 /sec 
               ( +-  0.49% )
         2245344976      cycles                    #    0.131 GHz 
               ( +-  0.51% )
          727437072      instructions              #    0.32  insn per 
cycle           ( +-  0.40% )
    <not supported>      branches
    <not supported>      branch-misses

            17.3585 +- 0.0909 seconds time elapsed  ( +-  0.52% )


Outline static call:

root@vgoip:~# perf stat -r 10 ./hackbench 1
Running with 1*40 (== 40) tasks.
Time: 15.892
Running with 1*40 (== 40) tasks.
Time: 15.731
Running with 1*40 (== 40) tasks.
Time: 15.507
Running with 1*40 (== 40) tasks.
Time: 16.269
Running with 1*40 (== 40) tasks.
Time: 15.934
Running with 1*40 (== 40) tasks.
Time: 16.048
Running with 1*40 (== 40) tasks.
Time: 15.700
Running with 1*40 (== 40) tasks.
Time: 16.063
Running with 1*40 (== 40) tasks.
Time: 15.852
Running with 1*40 (== 40) tasks.
Time: 15.941

  Performance counter stats for './hackbench 1' (10 runs):

           16227.32 msec task-clock                #    0.992 CPUs 
utilized            ( +-  0.42% )
               3732      context-switches          #  230.525 /sec 
               ( +-  6.42% )
                  0      cpu-migrations            #    0.000 /sec
               1244      page-faults               #   76.842 /sec 
               ( +-  0.11% )
         2141094288      cycles                    #    0.132 GHz 
               ( +-  0.42% )
          712598441      instructions              #    0.33  insn per 
cycle           ( +-  0.29% )
    <not supported>      branches
    <not supported>      branch-misses

            16.3539 +- 0.0675 seconds time elapsed  ( +-  0.41% )


Inline static call:

root@vgoip:~# perf stat -r 10 ./hackbench 1
Running with 1*40 (== 40) tasks.
Time: 17.512
Running with 1*40 (== 40) tasks.
Time: 17.240
Running with 1*40 (== 40) tasks.
Time: 16.901
Running with 1*40 (== 40) tasks.
Time: 17.125
Running with 1*40 (== 40) tasks.
Time: 17.262
Running with 1*40 (== 40) tasks.
Time: 17.298
Running with 1*40 (== 40) tasks.
Time: 17.182
Running with 1*40 (== 40) tasks.
Time: 16.988
Running with 1*40 (== 40) tasks.
Time: 17.102
Running with 1*40 (== 40) tasks.
Time: 16.669

  Performance counter stats for './hackbench 1' (10 runs):

           16976.76 msec task-clock                #    0.964 CPUs 
utilized            ( +-  0.44% )
               4760      context-switches          #  273.007 /sec 
               ( +-  4.93% )
                  0      cpu-migrations            #    0.000 /sec
               1252      page-faults               #   71.808 /sec 
               ( +-  0.35% )
         2239986112      cycles                    #    0.128 GHz 
               ( +-  0.44% )
          721540184      instructions              #    0.31  insn per 
cycle           ( +-  0.31% )
    <not supported>      branches
    <not supported>      branch-misses

            17.6126 +- 0.0762 seconds time elapsed  ( +-  0.43% )


Summary:

No static calls:
            17.3585 +- 0.0909 seconds time elapsed  ( +-  0.52% )
Out-of-line static calls:
            16.3539 +- 0.0675 seconds time elapsed  ( +-  0.41% )
Inline static calls:
            17.6126 +- 0.0762 seconds time elapsed  ( +-  0.43% )

Is there anything wrong with inline statica calls ?

Christophe

Benjamin Gray Sept. 8, 2022, 12:13 a.m. UTC | #3

On Thu, 2022-09-01 at 16:46 +0000, Christophe Leroy wrote:
> Surprisingly, I get worst performance with inline static call than
> with 
> out of line static call:

I'm not sure what hackbench is doing, but when microbenchmarking 64 bit
out-of-line calls in a loop I saw a similar thing where adding more
indirection improved the performance despite doing more work. The cause
seemed to be a combination of using older hardware and the target being
too short (just an integer increment). Moving to a newer machine and
adding a lot of NOPs to the target made the performance make sense.

Christophe Leroy Sept. 8, 2022, 6:11 a.m. UTC | #4

Le 08/09/2022 à 02:13, Benjamin Gray a écrit :
> On Thu, 2022-09-01 at 16:46 +0000, Christophe Leroy wrote:
>> Surprisingly, I get worst performance with inline static call than
>> with
>> out of line static call:
> 
> I'm not sure what hackbench is doing, but when microbenchmarking 64 bit
> out-of-line calls in a loop I saw a similar thing where adding more
> indirection improved the performance despite doing more work. The cause
> seemed to be a combination of using older hardware and the target being
> too short (just an integer increment). Moving to a newer machine and
> adding a lot of NOPs to the target made the performance make sense.

Yes might be.

I think I'll first do new tests with CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B

Christophe