Message ID | cover.1657301423.git.christophe.leroy@csgroup.eu (mailing list archive) |
---|---|
Headers | show |
Series | Implement inline static calls on PPC32 - v2 | expand |
Hello Christophe, On Fri, 8 Jul 2022 at 19:32, Christophe Leroy <christophe.leroy@csgroup.eu> wrote: > > This series applies on top of the series v3 "objtool: Enable and > implement --mcount option on powerpc" [1] rebased on powerpc-next branch > > A few modifications are done to core parts to enable powerpc > implementation: > - R_X86_64_PC32 is abstracted to R_REL32 so that it can then be > redefined as R_PPC_REL32. > - A call to static_call_init() is added to start_kernel() to avoid > every architecture to have to call it > - Trampoline address is provided to arch_static_call_transform() even > when setting a site to fallback on a call to the trampoline when the > target is too far. > > [1] https://lore.kernel.org/lkml/70b6d08d-aced-7f4e-b958-a3c7ae1a9319@csgroup.eu/T/#rb3a073c54aba563a135fba891e0c34c46e47beef > > Christophe Leroy (7): > powerpc: Add missing asm/asm.h for objtool > objtool/powerpc: Activate objtool on PPC32 > objtool: Add architecture specific R_REL32 macro > objtool/powerpc: Add necessary support for inline static calls > init: Call static_call_init() from start_kernel() > static_call_inline: Provide trampoline address when updating sites > powerpc/static_call: Implement inline static calls > Could you quantify the performance gains of moving from out-of-line, patched tail-call branch instructions to full-fledged inline static calls? On x86, the retpoline problem makes this glaringly obvious, but on other architectures, the complexity of supporting this model may outweigh the performance advantages.
Le 09/07/2022 à 08:52, Ard Biesheuvel a écrit : > Hello Christophe, > > On Fri, 8 Jul 2022 at 19:32, Christophe Leroy > <christophe.leroy@csgroup.eu> wrote: >> >> This series applies on top of the series v3 "objtool: Enable and >> implement --mcount option on powerpc" [1] rebased on powerpc-next branch >> >> A few modifications are done to core parts to enable powerpc >> implementation: >> - R_X86_64_PC32 is abstracted to R_REL32 so that it can then be >> redefined as R_PPC_REL32. >> - A call to static_call_init() is added to start_kernel() to avoid >> every architecture to have to call it >> - Trampoline address is provided to arch_static_call_transform() even >> when setting a site to fallback on a call to the trampoline when the >> target is too far. >> >> [1] https://lore.kernel.org/lkml/70b6d08d-aced-7f4e-b958-a3c7ae1a9319@csgroup.eu/T/#rb3a073c54aba563a135fba891e0c34c46e47beef >> >> Christophe Leroy (7): >> powerpc: Add missing asm/asm.h for objtool >> objtool/powerpc: Activate objtool on PPC32 >> objtool: Add architecture specific R_REL32 macro >> objtool/powerpc: Add necessary support for inline static calls >> init: Call static_call_init() from start_kernel() >> static_call_inline: Provide trampoline address when updating sites >> powerpc/static_call: Implement inline static calls >> > > Could you quantify the performance gains of moving from out-of-line, > patched tail-call branch instructions to full-fledged inline static > calls? On x86, the retpoline problem makes this glaringly obvious, but > on other architectures, the complexity of supporting this model may > outweigh the performance advantages. Surprisingly, I get worst performance with inline static call than with out of line static call: No static call: root@vgoip:~# perf stat -r 10 ./hackbench 1 Running with 1*40 (== 40) tasks. Time: 17.186 Running with 1*40 (== 40) tasks. Time: 16.738 Running with 1*40 (== 40) tasks. Time: 16.579 Running with 1*40 (== 40) tasks. Time: 16.838 Running with 1*40 (== 40) tasks. Time: 16.652 Running with 1*40 (== 40) tasks. Time: 17.380 Running with 1*40 (== 40) tasks. Time: 16.630 Running with 1*40 (== 40) tasks. Time: 16.850 Running with 1*40 (== 40) tasks. Time: 17.161 Running with 1*40 (== 40) tasks. Time: 16.722 Performance counter stats for './hackbench 1' (10 runs): 17019.55 msec task-clock # 0.980 CPUs utilized ( +- 0.51% ) 4847 context-switches # 282.280 /sec ( +- 6.32% ) 0 cpu-migrations # 0.000 /sec 1249 page-faults # 72.739 /sec ( +- 0.49% ) 2245344976 cycles # 0.131 GHz ( +- 0.51% ) 727437072 instructions # 0.32 insn per cycle ( +- 0.40% ) <not supported> branches <not supported> branch-misses 17.3585 +- 0.0909 seconds time elapsed ( +- 0.52% ) Outline static call: root@vgoip:~# perf stat -r 10 ./hackbench 1 Running with 1*40 (== 40) tasks. Time: 15.892 Running with 1*40 (== 40) tasks. Time: 15.731 Running with 1*40 (== 40) tasks. Time: 15.507 Running with 1*40 (== 40) tasks. Time: 16.269 Running with 1*40 (== 40) tasks. Time: 15.934 Running with 1*40 (== 40) tasks. Time: 16.048 Running with 1*40 (== 40) tasks. Time: 15.700 Running with 1*40 (== 40) tasks. Time: 16.063 Running with 1*40 (== 40) tasks. Time: 15.852 Running with 1*40 (== 40) tasks. Time: 15.941 Performance counter stats for './hackbench 1' (10 runs): 16227.32 msec task-clock # 0.992 CPUs utilized ( +- 0.42% ) 3732 context-switches # 230.525 /sec ( +- 6.42% ) 0 cpu-migrations # 0.000 /sec 1244 page-faults # 76.842 /sec ( +- 0.11% ) 2141094288 cycles # 0.132 GHz ( +- 0.42% ) 712598441 instructions # 0.33 insn per cycle ( +- 0.29% ) <not supported> branches <not supported> branch-misses 16.3539 +- 0.0675 seconds time elapsed ( +- 0.41% ) Inline static call: root@vgoip:~# perf stat -r 10 ./hackbench 1 Running with 1*40 (== 40) tasks. Time: 17.512 Running with 1*40 (== 40) tasks. Time: 17.240 Running with 1*40 (== 40) tasks. Time: 16.901 Running with 1*40 (== 40) tasks. Time: 17.125 Running with 1*40 (== 40) tasks. Time: 17.262 Running with 1*40 (== 40) tasks. Time: 17.298 Running with 1*40 (== 40) tasks. Time: 17.182 Running with 1*40 (== 40) tasks. Time: 16.988 Running with 1*40 (== 40) tasks. Time: 17.102 Running with 1*40 (== 40) tasks. Time: 16.669 Performance counter stats for './hackbench 1' (10 runs): 16976.76 msec task-clock # 0.964 CPUs utilized ( +- 0.44% ) 4760 context-switches # 273.007 /sec ( +- 4.93% ) 0 cpu-migrations # 0.000 /sec 1252 page-faults # 71.808 /sec ( +- 0.35% ) 2239986112 cycles # 0.128 GHz ( +- 0.44% ) 721540184 instructions # 0.31 insn per cycle ( +- 0.31% ) <not supported> branches <not supported> branch-misses 17.6126 +- 0.0762 seconds time elapsed ( +- 0.43% ) Summary: No static calls: 17.3585 +- 0.0909 seconds time elapsed ( +- 0.52% ) Out-of-line static calls: 16.3539 +- 0.0675 seconds time elapsed ( +- 0.41% ) Inline static calls: 17.6126 +- 0.0762 seconds time elapsed ( +- 0.43% ) Is there anything wrong with inline statica calls ? Christophe
On Thu, 2022-09-01 at 16:46 +0000, Christophe Leroy wrote: > Surprisingly, I get worst performance with inline static call than > with > out of line static call: I'm not sure what hackbench is doing, but when microbenchmarking 64 bit out-of-line calls in a loop I saw a similar thing where adding more indirection improved the performance despite doing more work. The cause seemed to be a combination of using older hardware and the target being too short (just an integer increment). Moving to a newer machine and adding a lot of NOPs to the target made the performance make sense.
Le 08/09/2022 à 02:13, Benjamin Gray a écrit : > On Thu, 2022-09-01 at 16:46 +0000, Christophe Leroy wrote: >> Surprisingly, I get worst performance with inline static call than >> with >> out of line static call: > > I'm not sure what hackbench is doing, but when microbenchmarking 64 bit > out-of-line calls in a loop I saw a similar thing where adding more > indirection improved the performance despite doing more work. The cause > seemed to be a combination of using older hardware and the target being > too short (just an integer increment). Moving to a newer machine and > adding a lot of NOPs to the target made the performance make sense. Yes might be. I think I'll first do new tests with CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_64B Christophe