diff mbox series

[v7,5/5] powerpc/64s: enable MMU_LAZY_TLB_SHOOTDOWN

Message ID 20230203071837.1136453-6-npiggin@gmail.com (mailing list archive)
State Not Applicable
Headers show
Series shoot lazy tlbs (lazy tlb refcount scalability improvement) | expand

Checks

Context Check Description
snowpatch_ozlabs/github-powerpc_selftests success Successfully ran 8 jobs.
snowpatch_ozlabs/github-powerpc_ppctests success Successfully ran 8 jobs.
snowpatch_ozlabs/github-powerpc_clang success Successfully ran 6 jobs.
snowpatch_ozlabs/github-powerpc_sparse success Successfully ran 4 jobs.
snowpatch_ozlabs/github-powerpc_kernel_qemu success Successfully ran 24 jobs.

Commit Message

Nicholas Piggin Feb. 3, 2023, 7:18 a.m. UTC
On a 16-socket 192-core POWER8 system, the context_switch1_threads
benchmark from will-it-scale (see earlier changelog), upstream can
achieve a rate of about 1 million context switches per second, due to
contention on the mm refcount.

64s meets the prerequisites for CONFIG_MMU_LAZY_TLB_SHOOTDOWN, so enable
the option. This increases the above benchmark to 118 million context
switches per second.

This generates 314 additional IPI interrupts on a 144 CPU system doing
a kernel compile, which is in the noise in terms of kernel cycles.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/Kconfig | 1 +
 1 file changed, 1 insertion(+)

Comments

Andrew Morton Feb. 26, 2023, 10:12 p.m. UTC | #1
On Fri,  3 Feb 2023 17:18:37 +1000 Nicholas Piggin <npiggin@gmail.com> wrote:

> On a 16-socket 192-core POWER8 system, the context_switch1_threads
> benchmark from will-it-scale (see earlier changelog), upstream can
> achieve a rate of about 1 million context switches per second, due to
> contention on the mm refcount.
> 
> 64s meets the prerequisites for CONFIG_MMU_LAZY_TLB_SHOOTDOWN, so enable
> the option. This increases the above benchmark to 118 million context
> switches per second.

Is that the best you can do ;)

> This generates 314 additional IPI interrupts on a 144 CPU system doing
> a kernel compile, which is in the noise in terms of kernel cycles.
> 
> ...
>
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -265,6 +265,7 @@ config PPC
>  	select MMU_GATHER_PAGE_SIZE
>  	select MMU_GATHER_RCU_TABLE_FREE
>  	select MMU_GATHER_MERGE_VMAS
> +	select MMU_LAZY_TLB_SHOOTDOWN		if PPC_BOOK3S_64
>  	select MODULES_USE_ELF_RELA
>  	select NEED_DMA_MAP_STATE		if PPC64 || NOT_COHERENT_CACHE
>  	select NEED_PER_CPU_EMBED_FIRST_CHUNK	if PPC64

Can we please have a summary of which other architectures might benefit
from this, and what must they do?

As this is powerpc-only, I expect it won't get a lot of testing in
mm.git or in linux-next.  The powerpc maintainers might choose to merge
in the mm-stable branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm if this is a
concern.
Peter Zijlstra Feb. 27, 2023, 1:33 p.m. UTC | #2
On Sun, Feb 26, 2023 at 02:12:38PM -0800, Andrew Morton wrote:
> On Fri,  3 Feb 2023 17:18:37 +1000 Nicholas Piggin <npiggin@gmail.com> wrote:
> 
> > On a 16-socket 192-core POWER8 system, the context_switch1_threads
> > benchmark from will-it-scale (see earlier changelog), upstream can
> > achieve a rate of about 1 million context switches per second, due to
> > contention on the mm refcount.
> > 
> > 64s meets the prerequisites for CONFIG_MMU_LAZY_TLB_SHOOTDOWN, so enable
> > the option. This increases the above benchmark to 118 million context
> > switches per second.
> 
> Is that the best you can do ;)
> 
> > This generates 314 additional IPI interrupts on a 144 CPU system doing
> > a kernel compile, which is in the noise in terms of kernel cycles.
> > 
> > ...
> >
> > --- a/arch/powerpc/Kconfig
> > +++ b/arch/powerpc/Kconfig
> > @@ -265,6 +265,7 @@ config PPC
> >  	select MMU_GATHER_PAGE_SIZE
> >  	select MMU_GATHER_RCU_TABLE_FREE
> >  	select MMU_GATHER_MERGE_VMAS
> > +	select MMU_LAZY_TLB_SHOOTDOWN		if PPC_BOOK3S_64
> >  	select MODULES_USE_ELF_RELA
> >  	select NEED_DMA_MAP_STATE		if PPC64 || NOT_COHERENT_CACHE
> >  	select NEED_PER_CPU_EMBED_FIRST_CHUNK	if PPC64
> 
> Can we please have a summary of which other architectures might benefit
> from this, and what must they do?
> 
> As this is powerpc-only, I expect it won't get a lot of testing in
> mm.git or in linux-next.  The powerpc maintainers might choose to merge
> in the mm-stable branch at
> git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm if this is a
> concern.

I haven't really had time to page all of this back in, but x86 is very
close to be able to use this, it mostly just needs cleaning up some
accidental active_mm usage.

I've got a branch here:

  https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=x86/lazy

That's mostly Nick's patches with a bunch of Andy's old patches stuck on
top. I also have a pile of notes, but alas, not finished in any way.
Nicholas Piggin March 21, 2023, 3:54 a.m. UTC | #3
On Mon Feb 27, 2023 at 11:33 PM AEST, Peter Zijlstra wrote:
> On Sun, Feb 26, 2023 at 02:12:38PM -0800, Andrew Morton wrote:
> > On Fri,  3 Feb 2023 17:18:37 +1000 Nicholas Piggin <npiggin@gmail.com> wrote:
> > 
> > > On a 16-socket 192-core POWER8 system, the context_switch1_threads
> > > benchmark from will-it-scale (see earlier changelog), upstream can
> > > achieve a rate of about 1 million context switches per second, due to
> > > contention on the mm refcount.
> > > 
> > > 64s meets the prerequisites for CONFIG_MMU_LAZY_TLB_SHOOTDOWN, so enable
> > > the option. This increases the above benchmark to 118 million context
> > > switches per second.
> > 
> > Is that the best you can do ;)
> > 
> > > This generates 314 additional IPI interrupts on a 144 CPU system doing
> > > a kernel compile, which is in the noise in terms of kernel cycles.
> > > 
> > > ...
> > >
> > > --- a/arch/powerpc/Kconfig
> > > +++ b/arch/powerpc/Kconfig
> > > @@ -265,6 +265,7 @@ config PPC
> > >  	select MMU_GATHER_PAGE_SIZE
> > >  	select MMU_GATHER_RCU_TABLE_FREE
> > >  	select MMU_GATHER_MERGE_VMAS
> > > +	select MMU_LAZY_TLB_SHOOTDOWN		if PPC_BOOK3S_64
> > >  	select MODULES_USE_ELF_RELA
> > >  	select NEED_DMA_MAP_STATE		if PPC64 || NOT_COHERENT_CACHE
> > >  	select NEED_PER_CPU_EMBED_FIRST_CHUNK	if PPC64
> > 
> > Can we please have a summary of which other architectures might benefit
> > from this, and what must they do?

Coming back to this... The recipes to enable are somewhat documented I
Kconfig. If those weren't clear I can improve or.. not sure where else
to add this stuff. It would be nice if all these options had more
explanation and requirements, I'm just not sure what's going to work
best (beyond what I did in Kconfig).

Not much noise from other archs so far, so I'll take a guess and say
archs that have large SMP systems might. x86 and s390 perhaps. Seems
to be some work still ongoing in the x86 branch, I didn't hear if you
found the docs inadequate or any suggestions to improve understanding?
Some were very confused by it, but I was never able to help them grasp
the concepts or get to the bottom of what the problem was, so that
was a dead end unfortunately.


> > 
> > As this is powerpc-only, I expect it won't get a lot of testing in
> > mm.git or in linux-next.  The powerpc maintainers might choose to merge
> > in the mm-stable branch at
> > git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm if this is a
> > concern.
>
> I haven't really had time to page all of this back in, but x86 is very
> close to be able to use this, it mostly just needs cleaning up some
> accidental active_mm usage.
>
> I've got a branch here:
>
>   https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=x86/lazy
>
> That's mostly Nick's patches with a bunch of Andy's old patches stuck on
> top. I also have a pile of notes, but alas, not finished in any way.

Great that a proof of concept shows it can work for x86, I guess
that's an ack for this series from x86? :)

x86 implementation presumably won't be merged until objectionable
active_mm and other code in core code that makes things difficult for
the arch is cleaned up so we don't get into the situation again where
crap keeps getting built on crap and everybody else's nice clean patches
gets nacked for years because one arch is festering. Will be great to
see those cleanups.

Thanks,
Nick
diff mbox series

Patch

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index b8c4ac56bddc..600ace5a7f1a 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -265,6 +265,7 @@  config PPC
 	select MMU_GATHER_PAGE_SIZE
 	select MMU_GATHER_RCU_TABLE_FREE
 	select MMU_GATHER_MERGE_VMAS
+	select MMU_LAZY_TLB_SHOOTDOWN		if PPC_BOOK3S_64
 	select MODULES_USE_ELF_RELA
 	select NEED_DMA_MAP_STATE		if PPC64 || NOT_COHERENT_CACHE
 	select NEED_PER_CPU_EMBED_FIRST_CHUNK	if PPC64