[v3,0/4] Balloon inhibit enhancements, vfio restriction

Message ID	20180807193125.30378-1-alex.williamson@redhat.com
Headers	show Return-Path: <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org> From: Alex Williamson <alex.williamson@redhat.com> To: alex.williamson@redhat.com, qemu-devel@nongnu.org Date: Tue, 7 Aug 2018 13:31:21 -0600 Message-Id: <20180807193125.30378-1-alex.williamson@redhat.com> Subject: [Qemu-devel] [PATCH v3 0/4] Balloon inhibit enhancements, vfio restriction Precedence: list Cc: Cornelia Huck <cohuck@redhat.com>, david@redhat.com, kvm@vger.kernel.org, peterx@redhat.com, "Michael S. Tsirkin" <mst@redhat.com> Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org>
Series	Balloon inhibit enhancements, vfio restriction \| expand [v3,0/4] Balloon inhibit enhancements, vfio restriction [v3,1/4] balloon: Allow multiple inhibit users [v3,2/4] kvm: Use inhibit to prevent ballooning without synchronous mmu [v3,3/4] vfio: Inhibit ballooning based on group attachment to a container [v3,4/4] vfio/ccw/pci: Allow devices to opt-in for ballooning

Alex Williamson Aug. 7, 2018, 7:31 p.m. UTC

v3:
 - Drop "nested" term in commit log (David)
 - Adopt suggested wording in ccw code (Cornelia)
 - Explain balloon inhibitor usage in vfio common (Peter)
 - Fix to call inhibitor prior to re-using existing containers
   to avoid gap that pinning may have occurred in set container
   ioctl (self) - Peter, this change is the reason I didn't
   include your R-b.
 - Add R-b to patches 1 & 2

v2:
 - Use atomic ops for balloon inhibit counter (Peter)
 - Allow endpoint driver opt-in for ballooning, vfio-ccw opt-in by
   default, vfio-pci opt-in by device option, only allowed for mdev
   devices, no support added for platform as there are no platform
   mdev devices.

See patch 3/4 for detailed explanation why ballooning and device
assignment typically don't mix.  If this eventually changes, flags
on the iommu info struct or perhaps device info struct can inform
us for automatic opt-in.  Thanks,

Alex

Alex Williamson (4):
  balloon: Allow multiple inhibit users
  kvm: Use inhibit to prevent ballooning without synchronous mmu
  vfio: Inhibit ballooning based on group attachment to a container
  vfio/ccw/pci: Allow devices to opt-in for ballooning

 accel/kvm/kvm-all.c           |  4 +++
 balloon.c                     | 13 ++++++---
 hw/vfio/ccw.c                 |  9 +++++++
 hw/vfio/common.c              | 51 +++++++++++++++++++++++++++++++++++
 hw/vfio/pci.c                 | 26 +++++++++++++++++-
 hw/vfio/trace-events          |  1 +
 hw/virtio/virtio-balloon.c    |  4 +--
 include/hw/vfio/vfio-common.h |  2 ++
 8 files changed, 103 insertions(+), 7 deletions(-)

Michael S. Tsirkin Aug. 7, 2018, 7:44 p.m. UTC | #1

On Tue, Aug 07, 2018 at 01:31:21PM -0600, Alex Williamson wrote:
> v3:
>  - Drop "nested" term in commit log (David)
>  - Adopt suggested wording in ccw code (Cornelia)
>  - Explain balloon inhibitor usage in vfio common (Peter)
>  - Fix to call inhibitor prior to re-using existing containers
>    to avoid gap that pinning may have occurred in set container
>    ioctl (self) - Peter, this change is the reason I didn't
>    include your R-b.
>  - Add R-b to patches 1 & 2
> 
> v2:
>  - Use atomic ops for balloon inhibit counter (Peter)
>  - Allow endpoint driver opt-in for ballooning, vfio-ccw opt-in by
>    default, vfio-pci opt-in by device option, only allowed for mdev
>    devices, no support added for platform as there are no platform
>    mdev devices.
> 
> See patch 3/4 for detailed explanation why ballooning and device
> assignment typically don't mix.  If this eventually changes, flags
> on the iommu info struct or perhaps device info struct can inform
> us for automatic opt-in.  Thanks,
> 
> Alex

One of the issues with pass-through is that it breaks overcommit
through swap. ballooning seems to offer one solution, instead of
making it work this patch just attempts to block ballooning.

I guess it's better than corrupting memory but I personally find this
approach disappointing.


> Alex Williamson (4):
>   balloon: Allow multiple inhibit users
>   kvm: Use inhibit to prevent ballooning without synchronous mmu
>   vfio: Inhibit ballooning based on group attachment to a container
>   vfio/ccw/pci: Allow devices to opt-in for ballooning
> 
>  accel/kvm/kvm-all.c           |  4 +++
>  balloon.c                     | 13 ++++++---
>  hw/vfio/ccw.c                 |  9 +++++++
>  hw/vfio/common.c              | 51 +++++++++++++++++++++++++++++++++++
>  hw/vfio/pci.c                 | 26 +++++++++++++++++-
>  hw/vfio/trace-events          |  1 +
>  hw/virtio/virtio-balloon.c    |  4 +--
>  include/hw/vfio/vfio-common.h |  2 ++
>  8 files changed, 103 insertions(+), 7 deletions(-)
> 
> -- 
> 2.18.0

Alex Williamson Aug. 7, 2018, 7:53 p.m. UTC | #2

On Tue, 7 Aug 2018 22:44:56 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Tue, Aug 07, 2018 at 01:31:21PM -0600, Alex Williamson wrote:
> > v3:
> >  - Drop "nested" term in commit log (David)
> >  - Adopt suggested wording in ccw code (Cornelia)
> >  - Explain balloon inhibitor usage in vfio common (Peter)
> >  - Fix to call inhibitor prior to re-using existing containers
> >    to avoid gap that pinning may have occurred in set container
> >    ioctl (self) - Peter, this change is the reason I didn't
> >    include your R-b.
> >  - Add R-b to patches 1 & 2
> > 
> > v2:
> >  - Use atomic ops for balloon inhibit counter (Peter)
> >  - Allow endpoint driver opt-in for ballooning, vfio-ccw opt-in by
> >    default, vfio-pci opt-in by device option, only allowed for mdev
> >    devices, no support added for platform as there are no platform
> >    mdev devices.
> > 
> > See patch 3/4 for detailed explanation why ballooning and device
> > assignment typically don't mix.  If this eventually changes, flags
> > on the iommu info struct or perhaps device info struct can inform
> > us for automatic opt-in.  Thanks,
> > 
> > Alex  
> 
> One of the issues with pass-through is that it breaks overcommit
> through swap. ballooning seems to offer one solution, instead of
> making it work this patch just attempts to block ballooning.
> 
> I guess it's better than corrupting memory but I personally find this
> approach disappointing.

Memory hotplug is the way to achieve variable density with assigned
device VMs, otherwise look towards approaches like mdev and shared
virtual addresses with PASID support.  We cannot shoehorn page faulting
without both hardware and software support.  Some class of "legacy"
device assignment will always have this incompatibility.  Thanks,

Alex

Michael S. Tsirkin Aug. 7, 2018, 9:58 p.m. UTC | #3

On Tue, Aug 07, 2018 at 01:53:03PM -0600, Alex Williamson wrote:
> On Tue, 7 Aug 2018 22:44:56 +0300
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
> > On Tue, Aug 07, 2018 at 01:31:21PM -0600, Alex Williamson wrote:
> > > v3:
> > >  - Drop "nested" term in commit log (David)
> > >  - Adopt suggested wording in ccw code (Cornelia)
> > >  - Explain balloon inhibitor usage in vfio common (Peter)
> > >  - Fix to call inhibitor prior to re-using existing containers
> > >    to avoid gap that pinning may have occurred in set container
> > >    ioctl (self) - Peter, this change is the reason I didn't
> > >    include your R-b.
> > >  - Add R-b to patches 1 & 2
> > > 
> > > v2:
> > >  - Use atomic ops for balloon inhibit counter (Peter)
> > >  - Allow endpoint driver opt-in for ballooning, vfio-ccw opt-in by
> > >    default, vfio-pci opt-in by device option, only allowed for mdev
> > >    devices, no support added for platform as there are no platform
> > >    mdev devices.
> > > 
> > > See patch 3/4 for detailed explanation why ballooning and device
> > > assignment typically don't mix.  If this eventually changes, flags
> > > on the iommu info struct or perhaps device info struct can inform
> > > us for automatic opt-in.  Thanks,
> > > 
> > > Alex  
> > 
> > One of the issues with pass-through is that it breaks overcommit
> > through swap. ballooning seems to offer one solution, instead of
> > making it work this patch just attempts to block ballooning.
> > 
> > I guess it's better than corrupting memory but I personally find this
> > approach disappointing.
> 
> Memory hotplug is the way to achieve variable density with assigned
> device VMs, otherwise look towards approaches like mdev and shared
> virtual addresses with PASID support.  We cannot shoehorn page faulting
> without both hardware and software support.  Some class of "legacy"
> device assignment will always have this incompatibility.  Thanks,
> 
> Alex

I'm not sure I agree.

At least with VTD, it seems entirely possible to change e.g. a PMD
atomically to point to a different set of PTEs, then flush.
That will allow removing memory at high granularity for
an arbitrary device without mdev or PASID dependency.

I suspect most IOMMUs are like this.

IIUC doing that within guest right now will cause a range to be unmapped
and them mapped again, which I suspect only works if we are lucky and
device does not access the range during this time.

So at some level it's a theoretical bug we would do well to fix,
and then we can support ballooning better.

Alex Williamson Aug. 7, 2018, 10:40 p.m. UTC | #4

On Wed, 8 Aug 2018 00:58:32 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Tue, Aug 07, 2018 at 01:53:03PM -0600, Alex Williamson wrote:
> > On Tue, 7 Aug 2018 22:44:56 +0300
> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >   
> > > On Tue, Aug 07, 2018 at 01:31:21PM -0600, Alex Williamson wrote:  
> > > > v3:
> > > >  - Drop "nested" term in commit log (David)
> > > >  - Adopt suggested wording in ccw code (Cornelia)
> > > >  - Explain balloon inhibitor usage in vfio common (Peter)
> > > >  - Fix to call inhibitor prior to re-using existing containers
> > > >    to avoid gap that pinning may have occurred in set container
> > > >    ioctl (self) - Peter, this change is the reason I didn't
> > > >    include your R-b.
> > > >  - Add R-b to patches 1 & 2
> > > > 
> > > > v2:
> > > >  - Use atomic ops for balloon inhibit counter (Peter)
> > > >  - Allow endpoint driver opt-in for ballooning, vfio-ccw opt-in by
> > > >    default, vfio-pci opt-in by device option, only allowed for mdev
> > > >    devices, no support added for platform as there are no platform
> > > >    mdev devices.
> > > > 
> > > > See patch 3/4 for detailed explanation why ballooning and device
> > > > assignment typically don't mix.  If this eventually changes, flags
> > > > on the iommu info struct or perhaps device info struct can inform
> > > > us for automatic opt-in.  Thanks,
> > > > 
> > > > Alex    
> > > 
> > > One of the issues with pass-through is that it breaks overcommit
> > > through swap. ballooning seems to offer one solution, instead of
> > > making it work this patch just attempts to block ballooning.
> > > 
> > > I guess it's better than corrupting memory but I personally find this
> > > approach disappointing.  
> > 
> > Memory hotplug is the way to achieve variable density with assigned
> > device VMs, otherwise look towards approaches like mdev and shared
> > virtual addresses with PASID support.  We cannot shoehorn page faulting
> > without both hardware and software support.  Some class of "legacy"
> > device assignment will always have this incompatibility.  Thanks,
> > 
> > Alex  
> 
> I'm not sure I agree.
> 
> At least with VTD, it seems entirely possible to change e.g. a PMD
> atomically to point to a different set of PTEs, then flush.
> That will allow removing memory at high granularity for
> an arbitrary device without mdev or PASID dependency.
> 
> I suspect most IOMMUs are like this.
> 
> IIUC doing that within guest right now will cause a range to be unmapped
> and them mapped again, which I suspect only works if we are lucky and
> device does not access the range during this time.
> 
> So at some level it's a theoretical bug we would do well to fix,
> and then we can support ballooning better.

Being able to unmap the page atomically from the IOMMU is one aspect,
the other is re-mapping the page when the balloon is deflated, which is
currently done only via a page fault.  We cannot guarantee that a vCPU
will touch a page before the IO device does, so something needs to
fault in that page for the IOMMU.  So we have:

 - How do we handle re-mapping pages as the balloon is deflated?
   - IOMMU page faults?  Requires PRI, IOMMU & endpoint support.
   - Some new MMU notifier hook?  Not sure WILLNEED is appropriate here.

 - How do we handle un-mapping pages as the balloon is inflated?
   - Rewrite the kernel IOMMU API and IOMMU drivers to allow unmapping
     sub-pages within previous mappings.
   - MMU notifier hook to trigger above non-existent code?
   - Alternatively, sacrificing IOTLB performance and probably kernel
     bloat by using only PAGE_SIZE IOMMU mappings.

Maybe some of these will evolve over time, SVA efforts are working on
some of these interfaces, but apparently device assignment users have
been getting along just fine without ballooning for many years.  With
physical devices, or even modern VFs, it seems hard to push density
beyond what we can handle with memory hotplug.  Perhaps as we get into
scalable IOV type approaches we can opt-in more mediated devices by
default.  It seems like we're just going around in circles here though,
anything more than preventing QEMU from shooting itself is a long term
goal touching multiple levels of the stack. Thanks,

Alex

Michael S. Tsirkin Aug. 8, 2018, 12:02 a.m. UTC | #5

On Tue, Aug 07, 2018 at 04:40:33PM -0600, Alex Williamson wrote:
> Maybe some of these will evolve over time, SVA efforts are working on
> some of these interfaces, but apparently device assignment users have
> been getting along just fine without ballooning for many years.

But not any more I think.  It takes all the running you can do, to keep
in the same place.

Overcommit with device specific drivers is one of the things that
containers do better than VMs. If VMs had a better overcommit story with
PT devices, it would be interesting IMHO.

> With
> physical devices, or even modern VFs, it seems hard to push density
> beyond what we can handle with memory hotplug.  Perhaps as we get into
> scalable IOV type approaches we can opt-in more mediated devices by
> default.

I'm not sure what does mediated have to do with it though.  It seems
weird to fix internal Linux or even system call interfaces being
inadequate with custom hardware.

> It seems like we're just going around in circles here though,
> anything more than preventing QEMU from shooting itself is a long term
> goal touching multiple levels of the stack.

It's just QEMU and the kernel, I don't see why any other levels would be
involved. And it looks like we both agree it is a bug in the current VTD
emulation even though current guests do not trigger it.

I agree it's more work than just blocking things out, I am not making an
argument for nacking this specific patch, but I do hope this thread
motivates someone to look into it.

Peter Xu Aug. 8, 2018, 3:45 a.m. UTC | #6

On Wed, Aug 08, 2018 at 12:58:32AM +0300, Michael S. Tsirkin wrote:
> At least with VTD, it seems entirely possible to change e.g. a PMD
> atomically to point to a different set of PTEs, then flush.
> That will allow removing memory at high granularity for
> an arbitrary device without mdev or PASID dependency.

My understanding is that the guest driver should prohibit this kind of
operation (say, modifying PMD).  Actually I don't see how it can
happen in Linux if the kernel drivers always call the IOMMU API since
there are only map/unmap APIs rather than this atomic-modify API.

The thing is that IMHO it's the guest driver's responsibility to make
sure the pages will never be used by the device before it removes the
entry (including modifying the PMD since that actually removes all the
entries on the old PMD).  If not, I would see it a guest kernel bug
instead of the bug in the emulation code.

Thanks,

Alex Williamson Aug. 8, 2018, 10:23 p.m. UTC | #7

On Wed, 8 Aug 2018 11:45:43 +0800
Peter Xu <peterx@redhat.com> wrote:

> On Wed, Aug 08, 2018 at 12:58:32AM +0300, Michael S. Tsirkin wrote:
> > At least with VTD, it seems entirely possible to change e.g. a PMD
> > atomically to point to a different set of PTEs, then flush.
> > That will allow removing memory at high granularity for
> > an arbitrary device without mdev or PASID dependency.  
> 
> My understanding is that the guest driver should prohibit this kind of
> operation (say, modifying PMD).

There's currently no need for this sort of operation within the dma api
and the iommu api doesn't offer it either.

> Actually I don't see how it can
> happen in Linux if the kernel drivers always call the IOMMU API since
> there are only map/unmap APIs rather than this atomic-modify API.

Exactly, the vfio dma mapping api is just an extension of the iommu api
and there's only map and unmap.  Furthermore, unmap can currently return
more than requested if the original mapping made use of superpages in
the iommu, so the only way to achieve page level granularity is to make
only page size mappings.  Otherwise we're talking about new apis
across the board.

> The thing is that IMHO it's the guest driver's responsibility to make
> sure the pages will never be used by the device before it removes the
> entry (including modifying the PMD since that actually removes all the
> entries on the old PMD).  If not, I would see it a guest kernel bug
> instead of the bug in the emulation code.

This is why there is no atomic modify in the dma api, we have drivers
that directly manage the buffers for a device and know when it's in use
and when it's not.  There's never a need, currently, to replace the iova
mapping for a single page within a larger buffer.  Maybe the dma api
could also find use for it, but it seems more unique to the iommu api
that we have a "buffer", which happens to be a contiguous RAM region
for the VM, where we do want to change the mapping of a single page.
That single page might currently be mapped by a 2MB or 1GB page in the
case of Intel, or by an arbitrary page size in the case of AMD.  vfio
is the driver managing these mappings, but versus the dma api, we don't
have any insight to the device behavior, including inflight dma.  We can
stop all dma for the device, but not without interfering and potentially
breaking the behavior of the device.

So again, I think this comes down to new iommu driver support and new
iommu apis and new vfio apis to enable some sort of atomic update
interface, or sacrificing performance and adding bloat by forcing page
size mappings.  Thanks,

Alex

Michael S. Tsirkin Aug. 9, 2018, 9:20 a.m. UTC | #8

On Wed, Aug 08, 2018 at 04:23:04PM -0600, Alex Williamson wrote:
> So again, I think this comes down to new iommu driver support and new
> iommu apis and new vfio apis to enable some sort of atomic update
> interface,

Oh absolutely. My point is some guest OS can start using atomic updates
at any time since it's something IOMMU hardware supports.  Adherence to
a hardware spec would be preferable to adherence to an internal Linux
API.  I appreciate it's not an easy task involving host Linux and QEMU
changes.

Michael S. Tsirkin Aug. 9, 2018, 9:23 a.m. UTC | #9

On Wed, Aug 08, 2018 at 11:45:43AM +0800, Peter Xu wrote:
> On Wed, Aug 08, 2018 at 12:58:32AM +0300, Michael S. Tsirkin wrote:
> > At least with VTD, it seems entirely possible to change e.g. a PMD
> > atomically to point to a different set of PTEs, then flush.
> > That will allow removing memory at high granularity for
> > an arbitrary device without mdev or PASID dependency.
> 
> My understanding is that the guest driver should prohibit this kind of
> operation (say, modifying PMD).

Interesting.  Which part of the VTD spec prohibits this?

> Actually I don't see how it can
> happen in Linux if the kernel drivers always call the IOMMU API since
> there are only map/unmap APIs rather than this atomic-modify API.

It could happen with a non-Linux guest which might have a different API.

> The thing is that IMHO it's the guest driver's responsibility to make
> sure the pages will never be used by the device before it removes the
> entry (including modifying the PMD since that actually removes all the
> entries on the old PMD).

If you switch PMDs atomically from one set of valid PTEs to another,
then flush, then as far as I could see it just works in the hardware
VTD, but not in the emulated VTD. So that's a difference in
behaviour. Maybe we are lucky and no one does that.

>  If not, I would see it a guest kernel bug
> instead of the bug in the emulation code.
> 
> Thanks,
> 
> -- 
> Peter Xu

Peter Xu Aug. 9, 2018, 9:37 a.m. UTC | #10

On Thu, Aug 09, 2018 at 12:23:43PM +0300, Michael S. Tsirkin wrote:
> On Wed, Aug 08, 2018 at 11:45:43AM +0800, Peter Xu wrote:
> > On Wed, Aug 08, 2018 at 12:58:32AM +0300, Michael S. Tsirkin wrote:
> > > At least with VTD, it seems entirely possible to change e.g. a PMD
> > > atomically to point to a different set of PTEs, then flush.
> > > That will allow removing memory at high granularity for
> > > an arbitrary device without mdev or PASID dependency.
> > 
> > My understanding is that the guest driver should prohibit this kind of
> > operation (say, modifying PMD).
> 
> Interesting.  Which part of the VTD spec prohibits this?
> 
> > Actually I don't see how it can
> > happen in Linux if the kernel drivers always call the IOMMU API since
> > there are only map/unmap APIs rather than this atomic-modify API.
> 
> It could happen with a non-Linux guest which might have a different API.
> 
> > The thing is that IMHO it's the guest driver's responsibility to make
> > sure the pages will never be used by the device before it removes the
> > entry (including modifying the PMD since that actually removes all the
> > entries on the old PMD).
> 
> If you switch PMDs atomically from one set of valid PTEs to another,
> then flush, then as far as I could see it just works in the hardware
> VTD, but not in the emulated VTD. So that's a difference in
> behaviour. Maybe we are lucky and no one does that.

Yes, but AFAICT that's also the best we can have now since the
userspace QEMU (or say, the VT-d emulation code) cannot really modify
a real PMD that the hardware uses - it can only call the VFIO APIs,
and finally it boils down again to the host kernel IOMMU APIs to do
map or unmap only.  So it's a impossible task until we provide such an
interface through the whole IOMMU/VFIO/... stack just like what you
have discussed in the other thread.

Thanks,

Michael S. Tsirkin Aug. 9, 2018, 10:13 a.m. UTC | #11

On Thu, Aug 09, 2018 at 05:37:58PM +0800, Peter Xu wrote:
> On Thu, Aug 09, 2018 at 12:23:43PM +0300, Michael S. Tsirkin wrote:
> > On Wed, Aug 08, 2018 at 11:45:43AM +0800, Peter Xu wrote:
> > > On Wed, Aug 08, 2018 at 12:58:32AM +0300, Michael S. Tsirkin wrote:
> > > > At least with VTD, it seems entirely possible to change e.g. a PMD
> > > > atomically to point to a different set of PTEs, then flush.
> > > > That will allow removing memory at high granularity for
> > > > an arbitrary device without mdev or PASID dependency.
> > > 
> > > My understanding is that the guest driver should prohibit this kind of
> > > operation (say, modifying PMD).
> > 
> > Interesting.  Which part of the VTD spec prohibits this?
> > 
> > > Actually I don't see how it can
> > > happen in Linux if the kernel drivers always call the IOMMU API since
> > > there are only map/unmap APIs rather than this atomic-modify API.
> > 
> > It could happen with a non-Linux guest which might have a different API.
> > 
> > > The thing is that IMHO it's the guest driver's responsibility to make
> > > sure the pages will never be used by the device before it removes the
> > > entry (including modifying the PMD since that actually removes all the
> > > entries on the old PMD).
> > 
> > If you switch PMDs atomically from one set of valid PTEs to another,
> > then flush, then as far as I could see it just works in the hardware
> > VTD, but not in the emulated VTD. So that's a difference in
> > behaviour. Maybe we are lucky and no one does that.
> 
> Yes, but AFAICT that's also the best we can have now since the
> userspace QEMU (or say, the VT-d emulation code) cannot really modify
> a real PMD that the hardware uses - it can only call the VFIO APIs,
> and finally it boils down again to the host kernel IOMMU APIs to do
> map or unmap only.  So it's a impossible task until we provide such an
> interface through the whole IOMMU/VFIO/... stack just like what you
> have discussed in the other thread.
> 
> Thanks,


This would need host kernel support, yes.

> -- 
> Peter Xu

[v3,0/4] Balloon inhibit enhancements, vfio restriction

Message

Comments