[RFC,v1,7/7] vhost: abort if an emulated iommu is used

Message ID	1349962023-560-8-git-send-email-avi@redhat.com
State	New
Headers	show Return-Path: <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org> From: Avi Kivity <avi@redhat.com> To: qemu-devel@nongnu.org, Blue Swirl <blauwirbel@gmail.com>, Anthony Liguori <anthony@codemonkey.ws>, "Michael S. Tsirkin" <mst@redhat.com>, Alex Williamson <alex.williamson@redhat.com>, liu ping fan <qemulist@gmail.com>, Paolo Bonzini <pbonzini@redhat.com> Date: Thu, 11 Oct 2012 15:27:03 +0200 Message-Id: <1349962023-560-8-git-send-email-avi@redhat.com> In-Reply-To: <1349962023-560-1-git-send-email-avi@redhat.com> References: <1349962023-560-1-git-send-email-avi@redhat.com> Subject: [Qemu-devel] [RFC v1 7/7] vhost: abort if an emulated iommu is used Precedence: list Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org

Avi Kivity Oct. 11, 2012, 1:27 p.m. UTC

vhost doesn't support guest iommus yet, indicate it to the user
by gently depositing a core on their disk.

Signed-off-by: Avi Kivity <avi@redhat.com>
---
 hw/vhost.c | 2 ++
 1 file changed, 2 insertions(+)

Michael S. Tsirkin Oct. 11, 2012, 2:35 p.m. UTC | #1

On Thu, Oct 11, 2012 at 03:44:10PM +0200, Avi Kivity wrote:
> On 10/11/2012 03:44 PM, Michael S. Tsirkin wrote:
> > On Thu, Oct 11, 2012 at 03:34:54PM +0200, Avi Kivity wrote:
> >> On 10/11/2012 03:31 PM, Michael S. Tsirkin wrote:
> >> > On Thu, Oct 11, 2012 at 03:27:03PM +0200, Avi Kivity wrote:
> >> >> vhost doesn't support guest iommus yet, indicate it to the user
> >> >> by gently depositing a core on their disk.
> >> >> 
> >> >> Signed-off-by: Avi Kivity <avi@redhat.com>
> >> > 
> >> > Actually there is no problem. virtio bypasses an IOMMU,
> >> > so vhost works fine by writing into guest memory directly.
> >> > 
> >> > So I don't think we need this patch.
> >> 
> >> The pci subsystem should set up the iommu so that it ignores virtio
> >> devices.  If it does, an emulated iommu will not reach vhost.  If it
> >> doesn't, then it will, and the assert() will alert us that we have a bug.
> > 
> > You mean pci subsystem in the guest? I'm pretty sure that's not
> > the case at the moment: iommu is on by default and applies
> > to all devices unless you do something special.
> > I see where you are coming from but it does
> > not look right to break all existing guests.
> 
> No, qemu should configure virtio devices to bypass the iommu, even if it
> is on.

Okay so there will be some API that virtio devices should call
to achieve this?

> > Also - I see no reason to single out vhost - I think same applies with
> > any virtio device, since it doesn't use the DMA API.
> 
> True.
> 
> -- 
> error compiling committee.c: too many arguments to function

Avi Kivity Oct. 11, 2012, 2:35 p.m. UTC | #2

On 10/11/2012 04:35 PM, Michael S. Tsirkin wrote:

>> No, qemu should configure virtio devices to bypass the iommu, even if it
>> is on.
> 
> Okay so there will be some API that virtio devices should call
> to achieve this?

The iommu should probably call pci_device_bypasses_iommu() to check for
such devices.

Michael S. Tsirkin Oct. 11, 2012, 3:34 p.m. UTC | #3

On Thu, Oct 11, 2012 at 04:35:23PM +0200, Avi Kivity wrote:
> On 10/11/2012 04:35 PM, Michael S. Tsirkin wrote:
> 
> >> No, qemu should configure virtio devices to bypass the iommu, even if it
> >> is on.
> > 
> > Okay so there will be some API that virtio devices should call
> > to achieve this?
> 
> The iommu should probably call pci_device_bypasses_iommu() to check for
> such devices.

So maybe this patch should depend on the introduction of such
an API.

> -- 
> error compiling committee.c: too many arguments to function

Avi Kivity Oct. 11, 2012, 3:48 p.m. UTC | #4

On 10/11/2012 05:34 PM, Michael S. Tsirkin wrote:
> On Thu, Oct 11, 2012 at 04:35:23PM +0200, Avi Kivity wrote:
>> On 10/11/2012 04:35 PM, Michael S. Tsirkin wrote:
>> 
>> >> No, qemu should configure virtio devices to bypass the iommu, even if it
>> >> is on.
>> > 
>> > Okay so there will be some API that virtio devices should call
>> > to achieve this?
>> 
>> The iommu should probably call pci_device_bypasses_iommu() to check for
>> such devices.
> 
> So maybe this patch should depend on the introduction of such
> an API.

I've dropped it for now.

In fact, virtio/vhost are safe since they use cpu_physical_memory_rw()
and the memory listener watches address_space_memory, no iommu there.
vfio needs to change to listen to pci_dev->bus_master_as, and need
special handling for iommu regions (abort for now, type 2 iommu later).

Alex Williamson Oct. 11, 2012, 7:38 p.m. UTC | #5

On Thu, 2012-10-11 at 17:48 +0200, Avi Kivity wrote:
> On 10/11/2012 05:34 PM, Michael S. Tsirkin wrote:
> > On Thu, Oct 11, 2012 at 04:35:23PM +0200, Avi Kivity wrote:
> >> On 10/11/2012 04:35 PM, Michael S. Tsirkin wrote:
> >> 
> >> >> No, qemu should configure virtio devices to bypass the iommu, even if it
> >> >> is on.
> >> > 
> >> > Okay so there will be some API that virtio devices should call
> >> > to achieve this?
> >> 
> >> The iommu should probably call pci_device_bypasses_iommu() to check for
> >> such devices.
> > 
> > So maybe this patch should depend on the introduction of such
> > an API.
> 
> I've dropped it for now.
> 
> In fact, virtio/vhost are safe since they use cpu_physical_memory_rw()
> and the memory listener watches address_space_memory, no iommu there.
> vfio needs to change to listen to pci_dev->bus_master_as, and need
> special handling for iommu regions (abort for now, type 2 iommu later).

I don't see how we can ever support an assigned device with the
translate function.  Don't we want a flat address space at run time
anyway?  IOMMU drivers go to pains to make IOTLB updates efficient and
drivers optimize for long running translations, but here we impose a
penalty on every access.  I think we'd be more efficient and better able
to support assigned devices if the per device/bus address space was
updated and flattened when it changes.  Being able to implement an XOR
IOMMU is impressive, but is it practical?  We could be doing much more
practical things like nested device assignment with a flatten
translation ;)  Thanks,

Alex

pingfan liu Oct. 15, 2012, 8:44 a.m. UTC | #6

On Thu, Oct 11, 2012 at 11:48 PM, Avi Kivity <avi@redhat.com> wrote:
> On 10/11/2012 05:34 PM, Michael S. Tsirkin wrote:
>> On Thu, Oct 11, 2012 at 04:35:23PM +0200, Avi Kivity wrote:
>>> On 10/11/2012 04:35 PM, Michael S. Tsirkin wrote:
>>>
>>> >> No, qemu should configure virtio devices to bypass the iommu, even if it
>>> >> is on.
>>> >
>>> > Okay so there will be some API that virtio devices should call
>>> > to achieve this?
>>>
>>> The iommu should probably call pci_device_bypasses_iommu() to check for
>>> such devices.
>>
>> So maybe this patch should depend on the introduction of such
>> an API.
>
> I've dropped it for now.
>
> In fact, virtio/vhost are safe since they use cpu_physical_memory_rw()
> and the memory listener watches address_space_memory, no iommu there.

Not quite sure your meaning.  My understanding is that as a pci
device, vhost can lie behind a iommu in topology, which result in the
transaction launched can be snapped by the emulated iommu. BUT we make
a exception for vhost-dev and enforce
address_space_rw(address_space_memory, ..) NOT
address_space_rw(pci_dev->bus_master_as,..)  for vhost device, so we
bypass the iommu.  Right?

Regards,
pingfan
> vfio needs to change to listen to pci_dev->bus_master_as, and need
> special handling for iommu regions (abort for now, type 2 iommu later).
>
> --
> error compiling committee.c: too many arguments to function

Avi Kivity Oct. 15, 2012, 10:24 a.m. UTC | #7

On 10/11/2012 09:38 PM, Alex Williamson wrote:
> On Thu, 2012-10-11 at 17:48 +0200, Avi Kivity wrote:
>> On 10/11/2012 05:34 PM, Michael S. Tsirkin wrote:
>> > On Thu, Oct 11, 2012 at 04:35:23PM +0200, Avi Kivity wrote:
>> >> On 10/11/2012 04:35 PM, Michael S. Tsirkin wrote:
>> >> 
>> >> >> No, qemu should configure virtio devices to bypass the iommu, even if it
>> >> >> is on.
>> >> > 
>> >> > Okay so there will be some API that virtio devices should call
>> >> > to achieve this?
>> >> 
>> >> The iommu should probably call pci_device_bypasses_iommu() to check for
>> >> such devices.
>> > 
>> > So maybe this patch should depend on the introduction of such
>> > an API.
>> 
>> I've dropped it for now.
>> 
>> In fact, virtio/vhost are safe since they use cpu_physical_memory_rw()
>> and the memory listener watches address_space_memory, no iommu there.
>> vfio needs to change to listen to pci_dev->bus_master_as, and need
>> special handling for iommu regions (abort for now, type 2 iommu later).
> 
> I don't see how we can ever support an assigned device with the
> translate function.  

We cannot.

> Don't we want a flat address space at run time
> anyway?  

Not if we want vfio-in-the-guest (for nested virt or OS bypass).

> IOMMU drivers go to pains to make IOTLB updates efficient and
> drivers optimize for long running translations, but here we impose a
> penalty on every access.  I think we'd be more efficient and better able
> to support assigned devices if the per device/bus address space was
> updated and flattened when it changes.  

A flattened address space cannot be efficiently implemented with a
->translate() callback.  Describing the transformed address space
requires walking all the iommu page tables; these can change very
frequently for some use cases, and the io page tables can be built after
the iommu is configured but before dma is initiated, so you have no hook
from which to call ->translate(); and the representation of the address
space can be huge.

> Being able to implement an XOR
> IOMMU is impressive, but is it practical?  

The XOR IOMMU is just a way for me to test and demonstrate the API.

> We could be doing much more
> practical things like nested device assignment with a flatten
> translation ;)  Thanks,

No, a flattened translation is impractical, at least when driven from qemu.

My plans wrt vfio/kvm here are to have memory_region_init_iommu()
provide, in addition to ->translate(), a declarative description of the
translation function.  In practical terms, this means that the API will
receive the name of the spec that the iommu implements:

  MemoryRegionIOMMUOps amd_iommu_v2_ops = {
      .translate = amd_iommu_v2_ops,
      .translation_type = IOMMU_AMD_V2,
  };

qemu-side vfio would then match ->translation_type with what the kernel
provides, and configure the kernel for this type of translation.  As
some v2 hardware supports two levels of translations, all vfio has to do
is to set up the lower translation level to match the guest->host
translation (which it does already), and to set up the upper translation
level to follow the guest configuration.  From then on the hardware does
the rest.

If the hardware supports only one translation level, we may still be
able to implement nested iommu using the same techniques we use for the
processor page tables - shadowing.  kvm would write-protect the iommu
page tables and pass any updates to vfio, which would update the shadow
io page tables that implement the ngpa->gpa->hpa translation.  However
given the complexity and performance problems on one side, and the size
of the niche that nested device assignment serves, we'll probably limit
ourselves to hardware that supports two levels of translations.  If
nested virtualization really takes off we can use shadowing to provide
the guest with emulated hardware that supports two translation level
(the solution above uses host hardware with two levels to expose guest
hardware with one level).

Avi Kivity Oct. 15, 2012, 10:32 a.m. UTC | #8

On 10/15/2012 10:44 AM, liu ping fan wrote:
> On Thu, Oct 11, 2012 at 11:48 PM, Avi Kivity <avi@redhat.com> wrote:
>> On 10/11/2012 05:34 PM, Michael S. Tsirkin wrote:
>>> On Thu, Oct 11, 2012 at 04:35:23PM +0200, Avi Kivity wrote:
>>>> On 10/11/2012 04:35 PM, Michael S. Tsirkin wrote:
>>>>
>>>> >> No, qemu should configure virtio devices to bypass the iommu, even if it
>>>> >> is on.
>>>> >
>>>> > Okay so there will be some API that virtio devices should call
>>>> > to achieve this?
>>>>
>>>> The iommu should probably call pci_device_bypasses_iommu() to check for
>>>> such devices.
>>>
>>> So maybe this patch should depend on the introduction of such
>>> an API.
>>
>> I've dropped it for now.
>>
>> In fact, virtio/vhost are safe since they use cpu_physical_memory_rw()
>> and the memory listener watches address_space_memory, no iommu there.
> 
> Not quite sure your meaning.  My understanding is that as a pci
> device, vhost can lie behind a iommu in topology, which result in the
> transaction launched can be snapped by the emulated iommu. BUT we make
> a exception for vhost-dev and enforce
> address_space_rw(address_space_memory, ..) NOT
> address_space_rw(pci_dev->bus_master_as,..)  for vhost device, so we
> bypass the iommu.  Right?

The exception is not just for vhost, but for every virtio device.  So
the iommu needs to be aware of that, and if it manages a virtio device,
it needs to provide a 1:1 translation.

[RFC,v1,7/7] vhost: abort if an emulated iommu is used

Commit Message

Comments

Patch