mbox series

[00/18] KVM: Consolidate and optimize MMU notifiers

Message ID 20210326021957.1424875-1-seanjc@google.com
Headers show
Series KVM: Consolidate and optimize MMU notifiers | expand

Message

Sean Christopherson March 26, 2021, 2:19 a.m. UTC
The end goal of this series is to optimize the MMU notifiers to take
mmu_lock if and only if the notification is relevant to KVM, i.e. the hva
range overlaps a memslot.   Large VMs (hundreds of vCPUs) are very
sensitive to mmu_lock being taken for write at inopportune times, and
such VMs also tend to be "static", e.g. backed by HugeTLB with minimal
page shenanigans.  The vast majority of notifications for these VMs will
be spurious (for KVM), and eliding mmu_lock for spurious notifications
avoids an otherwise unacceptable disruption to the guest.

To get there without potentially degrading performance, e.g. due to
multiple memslot lookups, especially on non-x86 where the use cases are
largely unknown (from my perspective), first consolidate the MMU notifier
logic by moving the hva->gfn lookups into common KVM.

Applies on my TDP MMU TLB flushing bug fixes[*], which conflict horribly
with the TDP MMU changes in this series.  That code applies on kvm/queue
(commit 4a98623d5d90, "KVM: x86/mmu: Mark the PAE roots as decrypted for
shadow paging").

Speaking of conflicts, Ben will soon be posting a series to convert a
bunch of TDP MMU flows to take mmu_lock only for read.  Presumably there
will be an absurd number of conflicts; Ben and I will sort out the
conflicts in whichever series loses the race.

Well tested on Intel and AMD.  Compile tested for arm64, MIPS, PPC,
PPC e500, and s390.  Absolutely needs to be tested for real on non-x86,
I give it even odds that I introduced an off-by-one bug somewhere.

[*] https://lkml.kernel.org/r/20210325200119.1359384-1-seanjc@google.com


Patches 1-7 are x86 specific prep patches to play nice with moving
the hva->gfn memslot lookups into common code.  There ended up being waaay
more of these than I expected/wanted, but I had a hell of a time getting
the flushing logic right when shuffling the memslot and address space
loops.  In the end, I was more confident I got things correct by batching
the flushes.

Patch 8 moves the existing API prototypes into common code.  It could
technically be dropped since the old APIs are gone in the end, but I
thought the switch to the new APIs would suck a bit less this way.

Patch 9 moves arm64's MMU notifier tracepoints into common code so that
they are not lost when arm64 is converted to the new APIs, and so that all
architectures can benefit.

Patch 10 moves x86's memslot walkers into common KVM.  I chose x86 purely
because I could actually test it.  All architectures use nearly identical
code, so I don't think it actually matters in the end.

Patches 11-13 move arm64, MIPS, and PPC to the new APIs.

Patch 14 yanks out the old APIs.

Patch 15 adds the mmu_lock elision, but only for unpaired notifications.

Patch 16 adds mmu_lock elision for paired .invalidate_range_{start,end}().
This is quite nasty and no small part of me thinks the patch should be
burned with fire (I won't spoil it any further), but it's also the most
problematic scenario for our particular use case.  :-/

Patches 17-18 are additional x86 cleanups.

Sean Christopherson (18):
  KVM: x86/mmu: Coalesce TDP MMU TLB flushes when zapping collapsible
    SPTEs
  KVM: x86/mmu: Move flushing for "slot" handlers to caller for legacy
    MMU
  KVM: x86/mmu: Coalesce TLB flushes when zapping collapsible SPTEs
  KVM: x86/mmu: Coalesce TLB flushes across address spaces for gfn range
    zap
  KVM: x86/mmu: Pass address space ID to __kvm_tdp_mmu_zap_gfn_range()
  KVM: x86/mmu: Pass address space ID to TDP MMU root walkers
  KVM: x86/mmu: Use leaf-only loop for walking TDP SPTEs when changing
    SPTE
  KVM: Move prototypes for MMU notifier callbacks to generic code
  KVM: Move arm64's MMU notifier trace events to generic code
  KVM: Move x86's MMU notifier memslot walkers to generic code
  KVM: arm64: Convert to the gfn-based MMU notifier callbacks
  KVM: MIPS/MMU: Convert to the gfn-based MMU notifier callbacks
  KVM: PPC: Convert to the gfn-based MMU notifier callbacks
  KVM: Kill off the old hva-based MMU notifier callbacks
  KVM: Take mmu_lock when handling MMU notifier iff the hva hits a
    memslot
  KVM: Don't take mmu_lock for range invalidation unless necessary
  KVM: x86/mmu: Allow yielding during MMU notifier unmap/zap, if
    possible
  KVM: x86/mmu: Drop trace_kvm_age_page() tracepoint

 arch/arm64/include/asm/kvm_host.h             |   5 -
 arch/arm64/kvm/mmu.c                          | 118 ++----
 arch/arm64/kvm/trace_arm.h                    |  66 ----
 arch/mips/include/asm/kvm_host.h              |   5 -
 arch/mips/kvm/mmu.c                           |  97 +----
 arch/powerpc/include/asm/kvm_book3s.h         |  12 +-
 arch/powerpc/include/asm/kvm_host.h           |   7 -
 arch/powerpc/include/asm/kvm_ppc.h            |   9 +-
 arch/powerpc/kvm/book3s.c                     |  18 +-
 arch/powerpc/kvm/book3s.h                     |  10 +-
 arch/powerpc/kvm/book3s_64_mmu_hv.c           |  98 ++---
 arch/powerpc/kvm/book3s_64_mmu_radix.c        |  25 +-
 arch/powerpc/kvm/book3s_hv.c                  |  12 +-
 arch/powerpc/kvm/book3s_pr.c                  |  56 +--
 arch/powerpc/kvm/e500_mmu_host.c              |  29 +-
 arch/powerpc/kvm/trace_booke.h                |  15 -
 arch/x86/include/asm/kvm_host.h               |   6 +-
 arch/x86/kvm/mmu/mmu.c                        | 180 ++++-----
 arch/x86/kvm/mmu/mmu_internal.h               |  10 +
 arch/x86/kvm/mmu/tdp_mmu.c                    | 344 +++++++-----------
 arch/x86/kvm/mmu/tdp_mmu.h                    |  31 +-
 include/linux/kvm_host.h                      |  22 +-
 include/trace/events/kvm.h                    |  90 +++--
 tools/testing/selftests/kvm/lib/kvm_util.c    |   4 -
 .../selftests/kvm/lib/x86_64/processor.c      |   2 +
 virt/kvm/kvm_main.c                           | 312 ++++++++++++----
 26 files changed, 697 insertions(+), 886 deletions(-)

Comments

Ben Gardon March 30, 2021, 6:32 p.m. UTC | #1
On Thu, Mar 25, 2021 at 7:20 PM Sean Christopherson <seanjc@google.com> wrote:
>
> The end goal of this series is to optimize the MMU notifiers to take
> mmu_lock if and only if the notification is relevant to KVM, i.e. the hva
> range overlaps a memslot.   Large VMs (hundreds of vCPUs) are very
> sensitive to mmu_lock being taken for write at inopportune times, and
> such VMs also tend to be "static", e.g. backed by HugeTLB with minimal
> page shenanigans.  The vast majority of notifications for these VMs will
> be spurious (for KVM), and eliding mmu_lock for spurious notifications
> avoids an otherwise unacceptable disruption to the guest.
>
> To get there without potentially degrading performance, e.g. due to
> multiple memslot lookups, especially on non-x86 where the use cases are
> largely unknown (from my perspective), first consolidate the MMU notifier
> logic by moving the hva->gfn lookups into common KVM.
>
> Applies on my TDP MMU TLB flushing bug fixes[*], which conflict horribly
> with the TDP MMU changes in this series.  That code applies on kvm/queue
> (commit 4a98623d5d90, "KVM: x86/mmu: Mark the PAE roots as decrypted for
> shadow paging").
>
> Speaking of conflicts, Ben will soon be posting a series to convert a
> bunch of TDP MMU flows to take mmu_lock only for read.  Presumably there
> will be an absurd number of conflicts; Ben and I will sort out the
> conflicts in whichever series loses the race.
>
> Well tested on Intel and AMD.  Compile tested for arm64, MIPS, PPC,
> PPC e500, and s390.  Absolutely needs to be tested for real on non-x86,
> I give it even odds that I introduced an off-by-one bug somewhere.
>
> [*] https://lkml.kernel.org/r/20210325200119.1359384-1-seanjc@google.com
>
>
> Patches 1-7 are x86 specific prep patches to play nice with moving
> the hva->gfn memslot lookups into common code.  There ended up being waaay
> more of these than I expected/wanted, but I had a hell of a time getting
> the flushing logic right when shuffling the memslot and address space
> loops.  In the end, I was more confident I got things correct by batching
> the flushes.
>
> Patch 8 moves the existing API prototypes into common code.  It could
> technically be dropped since the old APIs are gone in the end, but I
> thought the switch to the new APIs would suck a bit less this way.

Patches 1-8 look good to me. Feel free to add my Reviewed-by tag to those.
I appreciate the care you took to make all those changes tiny and reviewable.

>
> Patch 9 moves arm64's MMU notifier tracepoints into common code so that
> they are not lost when arm64 is converted to the new APIs, and so that all
> architectures can benefit.
>
> Patch 10 moves x86's memslot walkers into common KVM.  I chose x86 purely
> because I could actually test it.  All architectures use nearly identical
> code, so I don't think it actually matters in the end.

I'm still reviewing 10 and 14-18. 10 is a huge change and the diff is
pretty hard to parse.

>
> Patches 11-13 move arm64, MIPS, and PPC to the new APIs.
>
> Patch 14 yanks out the old APIs.
>
> Patch 15 adds the mmu_lock elision, but only for unpaired notifications.

Reading through all this code and considering the changes I'm
preparing for the TDP MMU have me wondering if it might help to have a
more general purpose MMU lock context struct which could be embedded
in the structs added in this patch. I'm thinking something like:
enum kvm_mmu_lock_mode {
    KVM_MMU_LOCK_NONE,
    KVM_MMU_LOCK_READ,
    KVM_MMU_LOCK_WRITE,
};

struct kvm_mmu_lock_context {
    enum kvm_mmu_lock_mode lock_mode;
    bool can_block;
    bool can_yield;
    bool flush;
};

This could yield some grossly long lines, but it would also have
potential to unify a bunch of ad-hoc handling.
The above struct could also fit into a single byte, so it'd be pretty
easy to pass it around.

>
> Patch 16 adds mmu_lock elision for paired .invalidate_range_{start,end}().
> This is quite nasty and no small part of me thinks the patch should be
> burned with fire (I won't spoil it any further), but it's also the most
> problematic scenario for our particular use case.  :-/
>
> Patches 17-18 are additional x86 cleanups.
>
> Sean Christopherson (18):
>   KVM: x86/mmu: Coalesce TDP MMU TLB flushes when zapping collapsible
>     SPTEs
>   KVM: x86/mmu: Move flushing for "slot" handlers to caller for legacy
>     MMU
>   KVM: x86/mmu: Coalesce TLB flushes when zapping collapsible SPTEs
>   KVM: x86/mmu: Coalesce TLB flushes across address spaces for gfn range
>     zap
>   KVM: x86/mmu: Pass address space ID to __kvm_tdp_mmu_zap_gfn_range()
>   KVM: x86/mmu: Pass address space ID to TDP MMU root walkers
>   KVM: x86/mmu: Use leaf-only loop for walking TDP SPTEs when changing
>     SPTE
>   KVM: Move prototypes for MMU notifier callbacks to generic code
>   KVM: Move arm64's MMU notifier trace events to generic code
>   KVM: Move x86's MMU notifier memslot walkers to generic code
>   KVM: arm64: Convert to the gfn-based MMU notifier callbacks
>   KVM: MIPS/MMU: Convert to the gfn-based MMU notifier callbacks
>   KVM: PPC: Convert to the gfn-based MMU notifier callbacks
>   KVM: Kill off the old hva-based MMU notifier callbacks
>   KVM: Take mmu_lock when handling MMU notifier iff the hva hits a
>     memslot
>   KVM: Don't take mmu_lock for range invalidation unless necessary
>   KVM: x86/mmu: Allow yielding during MMU notifier unmap/zap, if
>     possible
>   KVM: x86/mmu: Drop trace_kvm_age_page() tracepoint
>
>  arch/arm64/include/asm/kvm_host.h             |   5 -
>  arch/arm64/kvm/mmu.c                          | 118 ++----
>  arch/arm64/kvm/trace_arm.h                    |  66 ----
>  arch/mips/include/asm/kvm_host.h              |   5 -
>  arch/mips/kvm/mmu.c                           |  97 +----
>  arch/powerpc/include/asm/kvm_book3s.h         |  12 +-
>  arch/powerpc/include/asm/kvm_host.h           |   7 -
>  arch/powerpc/include/asm/kvm_ppc.h            |   9 +-
>  arch/powerpc/kvm/book3s.c                     |  18 +-
>  arch/powerpc/kvm/book3s.h                     |  10 +-
>  arch/powerpc/kvm/book3s_64_mmu_hv.c           |  98 ++---
>  arch/powerpc/kvm/book3s_64_mmu_radix.c        |  25 +-
>  arch/powerpc/kvm/book3s_hv.c                  |  12 +-
>  arch/powerpc/kvm/book3s_pr.c                  |  56 +--
>  arch/powerpc/kvm/e500_mmu_host.c              |  29 +-
>  arch/powerpc/kvm/trace_booke.h                |  15 -
>  arch/x86/include/asm/kvm_host.h               |   6 +-
>  arch/x86/kvm/mmu/mmu.c                        | 180 ++++-----
>  arch/x86/kvm/mmu/mmu_internal.h               |  10 +
>  arch/x86/kvm/mmu/tdp_mmu.c                    | 344 +++++++-----------
>  arch/x86/kvm/mmu/tdp_mmu.h                    |  31 +-
>  include/linux/kvm_host.h                      |  22 +-
>  include/trace/events/kvm.h                    |  90 +++--
>  tools/testing/selftests/kvm/lib/kvm_util.c    |   4 -
>  .../selftests/kvm/lib/x86_64/processor.c      |   2 +
>  virt/kvm/kvm_main.c                           | 312 ++++++++++++----
>  26 files changed, 697 insertions(+), 886 deletions(-)
>
> --
> 2.31.0.291.g576ba9dcdaf-goog
>
Paolo Bonzini March 30, 2021, 7:48 p.m. UTC | #2
On 30/03/21 20:32, Ben Gardon wrote:
>> Patches 1-7 are x86 specific prep patches to play nice with moving
>> the hva->gfn memslot lookups into common code.  There ended up being waaay
>> more of these than I expected/wanted, but I had a hell of a time getting
>> the flushing logic right when shuffling the memslot and address space
>> loops.  In the end, I was more confident I got things correct by batching
>> the flushes.
>>
>> Patch 8 moves the existing API prototypes into common code.  It could
>> technically be dropped since the old APIs are gone in the end, but I
>> thought the switch to the new APIs would suck a bit less this way.
> Patches 1-8 look good to me. Feel free to add my Reviewed-by tag to those.
> I appreciate the care you took to make all those changes tiny and reviewable.
> 

Just finished reviewing that part too, they were very nice and I've 
queued them.  I'll continue tomorrow with the rest.

Paolo
Sean Christopherson March 30, 2021, 7:58 p.m. UTC | #3
On Tue, Mar 30, 2021, Ben Gardon wrote:
> On Thu, Mar 25, 2021 at 7:20 PM Sean Christopherson <seanjc@google.com> wrote:
> > Patch 10 moves x86's memslot walkers into common KVM.  I chose x86 purely
> > because I could actually test it.  All architectures use nearly identical
> > code, so I don't think it actually matters in the end.
> 
> I'm still reviewing 10 and 14-18. 10 is a huge change and the diff is
> pretty hard to parse.

Ya :-/  I don't see an easy way to break it up without creating a massive diff,
e.g. it could be staged in x86 and moved to common, but I don't think that would
fundamentally change the diff.  Although I admittedly didn't spend _that_ much
time thinking about how to break it up.

> > Patches 11-13 move arm64, MIPS, and PPC to the new APIs.
> >
> > Patch 14 yanks out the old APIs.
> >
> > Patch 15 adds the mmu_lock elision, but only for unpaired notifications.
> 
> Reading through all this code and considering the changes I'm
> preparing for the TDP MMU have me wondering if it might help to have a
> more general purpose MMU lock context struct which could be embedded
> in the structs added in this patch. I'm thinking something like:
> enum kvm_mmu_lock_mode {
>     KVM_MMU_LOCK_NONE,
>     KVM_MMU_LOCK_READ,
>     KVM_MMU_LOCK_WRITE,
> };
> 
> struct kvm_mmu_lock_context {
>     enum kvm_mmu_lock_mode lock_mode;
>     bool can_block;
>     bool can_yield;

Not that it matters right now, but can_block and can_yield are the same thing.
I considered s/can_yield/can_block to make it all consistent, but that felt like
unnecessary thrash.

>     bool flush;

Drat.  This made me realize that the 'struct kvm_gfn_range' passed to arch code
isn't tagged 'const'.  I thought I had done that, but obviously not.

Anyways, what I was going to say before that realization is that the downside to
putting flush into kvm_gfn_range is that it would have to lose its 'const'
qualifier.  That's all a moot point if it's not easily constified though.

Const aside, my gut reaction is that it will probably be cleaner to keep the
flush stuff in arch code, separate from the kvm_gfn_range passed in by common
KVM.  Looping 'flush' back into the helpers is x86 specific at this point, and
AFAICT that's not likely to change any time soon.

For rwlock support, if we get to the point where kvm_age_gfn() and/or
kvm_test_age_gfn() can take mmu_lock for read, then it definitely makes sense to
track locking in kvm_gfn_range, assuming it isn't the sole entity that prevents
consifying kvm_range_range.

> };
> 
> This could yield some grossly long lines, but it would also have
> potential to unify a bunch of ad-hoc handling.
> The above struct could also fit into a single byte, so it'd be pretty
> easy to pass it around.
Paolo Bonzini March 31, 2021, 7:57 a.m. UTC | #4
On 26/03/21 03:19, Sean Christopherson wrote:
> The end goal of this series is to optimize the MMU notifiers to take
> mmu_lock if and only if the notification is relevant to KVM, i.e. the hva
> range overlaps a memslot.   Large VMs (hundreds of vCPUs) are very
> sensitive to mmu_lock being taken for write at inopportune times, and
> such VMs also tend to be "static", e.g. backed by HugeTLB with minimal
> page shenanigans.  The vast majority of notifications for these VMs will
> be spurious (for KVM), and eliding mmu_lock for spurious notifications
> avoids an otherwise unacceptable disruption to the guest.
> 
> To get there without potentially degrading performance, e.g. due to
> multiple memslot lookups, especially on non-x86 where the use cases are
> largely unknown (from my perspective), first consolidate the MMU notifier
> logic by moving the hva->gfn lookups into common KVM.
> 
> Applies on my TDP MMU TLB flushing bug fixes[*], which conflict horribly
> with the TDP MMU changes in this series.  That code applies on kvm/queue
> (commit 4a98623d5d90, "KVM: x86/mmu: Mark the PAE roots as decrypted for
> shadow paging").
> 
> Speaking of conflicts, Ben will soon be posting a series to convert a
> bunch of TDP MMU flows to take mmu_lock only for read.  Presumably there
> will be an absurd number of conflicts; Ben and I will sort out the
> conflicts in whichever series loses the race.
> 
> Well tested on Intel and AMD.  Compile tested for arm64, MIPS, PPC,
> PPC e500, and s390.  Absolutely needs to be tested for real on non-x86,
> I give it even odds that I introduced an off-by-one bug somewhere.
> 
> [*] https://lkml.kernel.org/r/20210325200119.1359384-1-seanjc@google.com
> 
> 
> Patches 1-7 are x86 specific prep patches to play nice with moving
> the hva->gfn memslot lookups into common code.  There ended up being waaay
> more of these than I expected/wanted, but I had a hell of a time getting
> the flushing logic right when shuffling the memslot and address space
> loops.  In the end, I was more confident I got things correct by batching
> the flushes.
> 
> Patch 8 moves the existing API prototypes into common code.  It could
> technically be dropped since the old APIs are gone in the end, but I
> thought the switch to the new APIs would suck a bit less this way.
> 
> Patch 9 moves arm64's MMU notifier tracepoints into common code so that
> they are not lost when arm64 is converted to the new APIs, and so that all
> architectures can benefit.
> 
> Patch 10 moves x86's memslot walkers into common KVM.  I chose x86 purely
> because I could actually test it.  All architectures use nearly identical
> code, so I don't think it actually matters in the end.
> 
> Patches 11-13 move arm64, MIPS, and PPC to the new APIs.
> 
> Patch 14 yanks out the old APIs.
> 
> Patch 15 adds the mmu_lock elision, but only for unpaired notifications.
> 
> Patch 16 adds mmu_lock elision for paired .invalidate_range_{start,end}().
> This is quite nasty and no small part of me thinks the patch should be
> burned with fire (I won't spoil it any further), but it's also the most
> problematic scenario for our particular use case.  :-/
> 
> Patches 17-18 are additional x86 cleanups.

Queued and 1-9 and 18, thanks.  There's a small issue in patch 10 that 
prevented me from committing 10-15, but they mostly look good.

Paolo
Marc Zyngier March 31, 2021, 9:34 a.m. UTC | #5
On 2021-03-31 08:57, Paolo Bonzini wrote:

> Queued and 1-9 and 18, thanks.  There's a small issue in patch 10 that
> prevented me from committing 10-15, but they mostly look good.

Can you please push the resulting merge somewhere?

I'm concerned that it will conflict in interesting way with other stuff
that is on its way on the arm64 side, not to mentiobn that this hasn't
been tested at all on anything but x86 (and given the series was posted
on Friday, that's a bit of a short notice).

Thanks,

         M.
Paolo Bonzini March 31, 2021, 9:41 a.m. UTC | #6
On 31/03/21 11:34, Marc Zyngier wrote:
> 
>> Queued and 1-9 and 18, thanks.  There's a small issue in patch 10 that
>> prevented me from committing 10-15, but they mostly look good.
> 
> Can you please push the resulting merge somewhere?
> 
> I'm concerned that it will conflict in interesting way with other stuff
> that is on its way on the arm64 side, not to mentiobn that this hasn't
> been tested at all on anything but x86 (and given the series was posted
> on Friday, that's a bit of a short notice).

Yes, I will push it shortly to kvm/queue.  Note that the patches I have 
pushed are x86 only apart from changes to tracepoints.  The rest will 
certainly need a lot more baking, which is also why I got rid quickly of 
the easy ones.

Paolo