mbox series

[v3,0/4] Migrate non-migrated pages of a SVM.

Message ID 1592606622-29884-1-git-send-email-linuxram@us.ibm.com (mailing list archive)
Headers show
Series Migrate non-migrated pages of a SVM. | expand

Message

Ram Pai June 19, 2020, 10:43 p.m. UTC
The time taken to switch a VM to Secure-VM, increases by the size of the VM.  A
100GB VM takes about 7minutes. This is unacceptable.  This linear increase is
caused by a suboptimal behavior by the Ultravisor and the Hypervisor.  The
Ultravisor unnecessarily migrates all the GFN of the VM from normal-memory to
secure-memory. It has to just migrate the necessary and sufficient GFNs.

However when the optimization is incorporated in the Ultravisor, the Hypervisor
starts misbehaving. The Hypervisor has a inbuilt assumption that the Ultravisor
will explicitly request to migrate, each and every GFN of the VM. If only
necessary and sufficient GFNs are requested for migration, the Hypervisor
continues to manage the remaining GFNs as normal GFNs. This leads of memory
corruption, manifested consistently when the SVM reboots.

The same is true, when a memory slot is hotplugged into a SVM. The Hypervisor
expects the ultravisor to request migration of all GFNs to secure-GFN.  But at
the same time, the hypervisor is unable to handle any H_SVM_PAGE_IN requests
from the Ultravisor, done in the context of UV_REGISTER_MEM_SLOT ucall.  This
problem manifests as random errors in the SVM, when a memory-slot is
hotplugged.

This patch series automatically migrates the non-migrated pages of a SVM,
     and thus solves the problem.

Testing: Passed rigorous SVM test using various sized SVMs.

Changelog:

v2: . fixed a bug observed by Laurent. The state of the GFN's associated
	with Secure-VMs were not reset during memslot flush.
    . Re-organized the code, for easier review.
    . Better description of the patch series.

v1: fixed a bug observed by Bharata. Pages that where paged-in and later
paged-out must also be skipped from migration during H_SVM_INIT_DONE.

Laurent Dufour (1):
  KVM: PPC: Book3S HV: migrate hot plugged memory

Ram Pai (3):
  KVM: PPC: Book3S HV: Fix function definition in book3s_hv_uvmem.c
  KVM: PPC: Book3S HV: track the state GFNs associated with secure VMs
  KVM: PPC: Book3S HV: migrate remaining normal-GFNs to secure-GFNs in
    H_SVM_INIT_DONE

 Documentation/powerpc/ultravisor.rst        |   2 +
 arch/powerpc/include/asm/kvm_book3s_uvmem.h |   2 +
 arch/powerpc/kvm/book3s_hv.c                |  10 +-
 arch/powerpc/kvm/book3s_hv_uvmem.c          | 360 +++++++++++++++++++++++-----
 4 files changed, 315 insertions(+), 59 deletions(-)

Comments

Bharata B Rao June 28, 2020, 4:11 p.m. UTC | #1
On Fri, Jun 19, 2020 at 03:43:38PM -0700, Ram Pai wrote:
> The time taken to switch a VM to Secure-VM, increases by the size of the VM.  A
> 100GB VM takes about 7minutes. This is unacceptable.  This linear increase is
> caused by a suboptimal behavior by the Ultravisor and the Hypervisor.  The
> Ultravisor unnecessarily migrates all the GFN of the VM from normal-memory to
> secure-memory. It has to just migrate the necessary and sufficient GFNs.
> 
> However when the optimization is incorporated in the Ultravisor, the Hypervisor
> starts misbehaving. The Hypervisor has a inbuilt assumption that the Ultravisor
> will explicitly request to migrate, each and every GFN of the VM. If only
> necessary and sufficient GFNs are requested for migration, the Hypervisor
> continues to manage the remaining GFNs as normal GFNs. This leads of memory
> corruption, manifested consistently when the SVM reboots.
> 
> The same is true, when a memory slot is hotplugged into a SVM. The Hypervisor
> expects the ultravisor to request migration of all GFNs to secure-GFN.  But at
> the same time, the hypervisor is unable to handle any H_SVM_PAGE_IN requests
> from the Ultravisor, done in the context of UV_REGISTER_MEM_SLOT ucall.  This
> problem manifests as random errors in the SVM, when a memory-slot is
> hotplugged.
> 
> This patch series automatically migrates the non-migrated pages of a SVM,
>      and thus solves the problem.

So this is what I understand as the objective of this patchset:

1. Getting all the pages into the secure memory right when the guest
   transitions into secure mode is expensive. Ultravisor wants to just get
   the necessary and sufficient pages in and put the onus on the Hypervisor
   to mark the remaining pages (w/o actual page-in) as secure during
   H_SVM_INIT_DONE.
2. During H_SVM_INIT_DONE, you want a way to differentiate the pages that
   are already secure from the pages that are shared and that are paged-out.
   For this you are introducing all these new states in HV.

UV knows about the shared GFNs and maintains the state of the same. Hence
let HV send all the pages (minus already secured pages) via H_SVM_PAGE_IN
and if UV finds any shared pages in them, let it fail the uv-page-in call.
Then HV can fail the migration for it  and the page continues to remain
shared. With this, you don't need to maintain a state for secured GFN in HV.

In the unlikely case of sending a paged-out page to UV during
H_SVM_INIT_DONE, let the page-in succeed and HV will fault on it again
if required. With this, you don't need a state in HV to identify a
paged-out-but-encrypted state.

Doesn't the above work? If so, we can avoid all those extra states
in HV. That way HV can continue to differentiate only between two types
of pages: secure and not-secure. The rest of the states (shared,
paged-out-encrypted) actually belong to SVM/UV and let UV take care of
them.

Or did I miss something?

Regards,
Bharata.
Bharata B Rao June 29, 2020, 1:53 a.m. UTC | #2
On Sun, Jun 28, 2020 at 09:41:53PM +0530, Bharata B Rao wrote:
> On Fri, Jun 19, 2020 at 03:43:38PM -0700, Ram Pai wrote:
> > The time taken to switch a VM to Secure-VM, increases by the size of the VM.  A
> > 100GB VM takes about 7minutes. This is unacceptable.  This linear increase is
> > caused by a suboptimal behavior by the Ultravisor and the Hypervisor.  The
> > Ultravisor unnecessarily migrates all the GFN of the VM from normal-memory to
> > secure-memory. It has to just migrate the necessary and sufficient GFNs.
> > 
> > However when the optimization is incorporated in the Ultravisor, the Hypervisor
> > starts misbehaving. The Hypervisor has a inbuilt assumption that the Ultravisor
> > will explicitly request to migrate, each and every GFN of the VM. If only
> > necessary and sufficient GFNs are requested for migration, the Hypervisor
> > continues to manage the remaining GFNs as normal GFNs. This leads of memory
> > corruption, manifested consistently when the SVM reboots.
> > 
> > The same is true, when a memory slot is hotplugged into a SVM. The Hypervisor
> > expects the ultravisor to request migration of all GFNs to secure-GFN.  But at
> > the same time, the hypervisor is unable to handle any H_SVM_PAGE_IN requests
> > from the Ultravisor, done in the context of UV_REGISTER_MEM_SLOT ucall.  This
> > problem manifests as random errors in the SVM, when a memory-slot is
> > hotplugged.
> > 
> > This patch series automatically migrates the non-migrated pages of a SVM,
> >      and thus solves the problem.
> 
> So this is what I understand as the objective of this patchset:
> 
> 1. Getting all the pages into the secure memory right when the guest
>    transitions into secure mode is expensive. Ultravisor wants to just get
>    the necessary and sufficient pages in and put the onus on the Hypervisor
>    to mark the remaining pages (w/o actual page-in) as secure during
>    H_SVM_INIT_DONE.
> 2. During H_SVM_INIT_DONE, you want a way to differentiate the pages that
>    are already secure from the pages that are shared and that are paged-out.
>    For this you are introducing all these new states in HV.
> 
> UV knows about the shared GFNs and maintains the state of the same. Hence
> let HV send all the pages (minus already secured pages) via H_SVM_PAGE_IN
> and if UV finds any shared pages in them, let it fail the uv-page-in call.
> Then HV can fail the migration for it  and the page continues to remain
> shared. With this, you don't need to maintain a state for secured GFN in HV.
> 
> In the unlikely case of sending a paged-out page to UV during
> H_SVM_INIT_DONE, let the page-in succeed and HV will fault on it again
> if required. With this, you don't need a state in HV to identify a
> paged-out-but-encrypted state.
> 
> Doesn't the above work?

I see that you want to infact skip the uv-page-in calls from H_SVM_INIT_DONE.
So that would need the extra states in HV which you are proposing here.

Regards,
Bharata.
Ram Pai June 29, 2020, 7:25 p.m. UTC | #3
On Mon, Jun 29, 2020 at 07:23:30AM +0530, Bharata B Rao wrote:
> On Sun, Jun 28, 2020 at 09:41:53PM +0530, Bharata B Rao wrote:
> > On Fri, Jun 19, 2020 at 03:43:38PM -0700, Ram Pai wrote:
> > > The time taken to switch a VM to Secure-VM, increases by the size of the VM.  A
> > > 100GB VM takes about 7minutes. This is unacceptable.  This linear increase is
> > > caused by a suboptimal behavior by the Ultravisor and the Hypervisor.  The
> > > Ultravisor unnecessarily migrates all the GFN of the VM from normal-memory to
> > > secure-memory. It has to just migrate the necessary and sufficient GFNs.
> > > 
> > > However when the optimization is incorporated in the Ultravisor, the Hypervisor
> > > starts misbehaving. The Hypervisor has a inbuilt assumption that the Ultravisor
> > > will explicitly request to migrate, each and every GFN of the VM. If only
> > > necessary and sufficient GFNs are requested for migration, the Hypervisor
> > > continues to manage the remaining GFNs as normal GFNs. This leads of memory
> > > corruption, manifested consistently when the SVM reboots.
> > > 
> > > The same is true, when a memory slot is hotplugged into a SVM. The Hypervisor
> > > expects the ultravisor to request migration of all GFNs to secure-GFN.  But at
> > > the same time, the hypervisor is unable to handle any H_SVM_PAGE_IN requests
> > > from the Ultravisor, done in the context of UV_REGISTER_MEM_SLOT ucall.  This
> > > problem manifests as random errors in the SVM, when a memory-slot is
> > > hotplugged.
> > > 
> > > This patch series automatically migrates the non-migrated pages of a SVM,
> > >      and thus solves the problem.
> > 
> > So this is what I understand as the objective of this patchset:
> > 
> > 1. Getting all the pages into the secure memory right when the guest
> >    transitions into secure mode is expensive. Ultravisor wants to just get
> >    the necessary and sufficient pages in and put the onus on the Hypervisor
> >    to mark the remaining pages (w/o actual page-in) as secure during
> >    H_SVM_INIT_DONE.
> > 2. During H_SVM_INIT_DONE, you want a way to differentiate the pages that
> >    are already secure from the pages that are shared and that are paged-out.
> >    For this you are introducing all these new states in HV.
> > 
> > UV knows about the shared GFNs and maintains the state of the same. Hence
> > let HV send all the pages (minus already secured pages) via H_SVM_PAGE_IN
> > and if UV finds any shared pages in them, let it fail the uv-page-in call.
> > Then HV can fail the migration for it  and the page continues to remain
> > shared. With this, you don't need to maintain a state for secured GFN in HV.
> > 
> > In the unlikely case of sending a paged-out page to UV during
> > H_SVM_INIT_DONE, let the page-in succeed and HV will fault on it again
> > if required. With this, you don't need a state in HV to identify a
> > paged-out-but-encrypted state.
> > 
> > Doesn't the above work?
> 
> I see that you want to infact skip the uv-page-in calls from H_SVM_INIT_DONE.
> So that would need the extra states in HV which you are proposing here.

Yes. I want to skip to speed up the overall ESM switch.

RP