[RFC,v4,18/20] intel_iommu: enable vfio devices

Message ID	1484917736-32056-19-git-send-email-peterx@redhat.com
State	New
Headers	show Return-Path: <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org> From: Peter Xu <peterx@redhat.com> To: qemu-devel@nongnu.org Date: Fri, 20 Jan 2017 21:08:54 +0800 Message-Id: <1484917736-32056-19-git-send-email-peterx@redhat.com> In-Reply-To: <1484917736-32056-1-git-send-email-peterx@redhat.com> References: <1484917736-32056-1-git-send-email-peterx@redhat.com> Subject: [Qemu-devel] [PATCH RFC v4 18/20] intel_iommu: enable vfio devices Precedence: list Cc: tianyu.lan@intel.com, kevin.tian@intel.com, mst@redhat.com, jan.kiszka@siemens.com, jasowang@redhat.com, peterx@redhat.com, alex.williamson@redhat.com, bd.aviv@gmail.com Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org>

Peter Xu Jan. 20, 2017, 1:08 p.m. UTC

This patch is based on Aviv Ben-David (<bd.aviv@gmail.com>)'s patch
upstream:

  "IOMMU: enable intel_iommu map and unmap notifiers"
  https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg01453.html

However I removed/fixed some content, and added my own codes.

Instead of translate() every page for iotlb invalidations (which is
slower), we walk the pages when needed and notify in a hook function.

This patch enables vfio devices for VT-d emulation.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/intel_iommu.c         | 66 +++++++++++++++++++++++++++++++++++++------
 include/hw/i386/intel_iommu.h |  8 ++++++
 2 files changed, 65 insertions(+), 9 deletions(-)

Jason Wang Jan. 22, 2017, 8:08 a.m. UTC | #1

On 2017年01月20日 21:08, Peter Xu wrote:
> This patch is based on Aviv Ben-David (<bd.aviv@gmail.com>)'s patch
> upstream:
>
>    "IOMMU: enable intel_iommu map and unmap notifiers"
>    https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg01453.html
>
> However I removed/fixed some content, and added my own codes.
>
> Instead of translate() every page for iotlb invalidations (which is
> slower), we walk the pages when needed and notify in a hook function.
>
> This patch enables vfio devices for VT-d emulation.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   hw/i386/intel_iommu.c         | 66 +++++++++++++++++++++++++++++++++++++------
>   include/hw/i386/intel_iommu.h |  8 ++++++
>   2 files changed, 65 insertions(+), 9 deletions(-)
>
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 83a2e1f..7cbf057 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -833,7 +833,8 @@ next:
>    * @private: private data for the hook function
>    */
>   static int vtd_page_walk(VTDContextEntry *ce, uint64_t start, uint64_t end,
> -                         vtd_page_walk_hook hook_fn, void *private)
> +                         vtd_page_walk_hook hook_fn, void *private,
> +                         bool notify_unmap)
>   {
>       dma_addr_t addr = vtd_get_slpt_base_from_context(ce);
>       uint32_t level = vtd_get_level_from_context_entry(ce);
> @@ -852,7 +853,7 @@ static int vtd_page_walk(VTDContextEntry *ce, uint64_t start, uint64_t end,
>       trace_vtd_page_walk_level(addr, level, start, end);
>   
>       return vtd_page_walk_level(addr, start, end, hook_fn, private,
> -                               level, true, true, NULL, false);
> +                               level, true, true, NULL, notify_unmap);
>   }
>   
>   /* Map a device to its corresponding domain (context-entry) */
> @@ -1205,6 +1206,33 @@ static void vtd_iotlb_domain_invalidate(IntelIOMMUState *s, uint16_t domain_id)
>                                   &domain_id);
>   }
>   
> +static int vtd_page_invalidate_notify_hook(IOMMUTLBEntry *entry,
> +                                           void *private)
> +{
> +    memory_region_notify_iommu((MemoryRegion *)private, *entry);
> +    return 0;
> +}
> +
> +static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
> +                                           uint16_t domain_id, hwaddr addr,
> +                                           uint8_t am)
> +{
> +    IntelIOMMUNotifierNode *node;
> +    VTDContextEntry ce;
> +    int ret;
> +
> +    QLIST_FOREACH(node, &(s->notifiers_list), next) {
> +        VTDAddressSpace *vtd_as = node->vtd_as;
> +        ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
> +                                       vtd_as->devfn, &ce);
> +        if (!ret && domain_id == VTD_CONTEXT_ENTRY_DID(ce.hi)) {
> +            vtd_page_walk(&ce, addr, addr + (1 << am) * VTD_PAGE_SIZE,
> +                          vtd_page_invalidate_notify_hook,
> +                          (void *)&vtd_as->iommu, true);

Why not simply trigger the notifier here? (or is this vfio required?)

> +        }
> +    }
> +}
> +
>   static void vtd_iotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
>                                         hwaddr addr, uint8_t am)
>   {
> @@ -1215,6 +1243,7 @@ static void vtd_iotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
>       info.addr = addr;
>       info.mask = ~((1 << am) - 1);
>       g_hash_table_foreach_remove(s->iotlb, vtd_hash_remove_by_page, &info);
> +    vtd_iotlb_page_invalidate_notify(s, domain_id, addr, am);

I think it's better to squash DSI and GLOBAL invalidation into this 
patch, otherwise the patch is buggy.

Thanks

Peter Xu Jan. 22, 2017, 9:04 a.m. UTC | #2

On Sun, Jan 22, 2017 at 04:08:04PM +0800, Jason Wang wrote:

[...]

> >+static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
> >+                                           uint16_t domain_id, hwaddr addr,
> >+                                           uint8_t am)
> >+{
> >+    IntelIOMMUNotifierNode *node;
> >+    VTDContextEntry ce;
> >+    int ret;
> >+
> >+    QLIST_FOREACH(node, &(s->notifiers_list), next) {
> >+        VTDAddressSpace *vtd_as = node->vtd_as;
> >+        ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
> >+                                       vtd_as->devfn, &ce);
> >+        if (!ret && domain_id == VTD_CONTEXT_ENTRY_DID(ce.hi)) {
> >+            vtd_page_walk(&ce, addr, addr + (1 << am) * VTD_PAGE_SIZE,
> >+                          vtd_page_invalidate_notify_hook,
> >+                          (void *)&vtd_as->iommu, true);
> 
> Why not simply trigger the notifier here? (or is this vfio required?)

Because we may only want to notify part of the region - we are with
mask here, but not exact size.

Consider this: guest (with caching mode) maps 12K memory (4K*3 pages),
the mask will be extended to 16K in the guest. In that case, we need
to explicitly go over the page entry to know that the 4th page should
not be notified.

> 
> >+        }
> >+    }
> >+}
> >+
> >  static void vtd_iotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
> >                                        hwaddr addr, uint8_t am)
> >  {
> >@@ -1215,6 +1243,7 @@ static void vtd_iotlb_page_invalidate(IntelIOMMUState *s, uint16_t domain_id,
> >      info.addr = addr;
> >      info.mask = ~((1 << am) - 1);
> >      g_hash_table_foreach_remove(s->iotlb, vtd_hash_remove_by_page, &info);
> >+    vtd_iotlb_page_invalidate_notify(s, domain_id, addr, am);
> 
> I think it's better to squash DSI and GLOBAL invalidation into this patch,
> otherwise the patch is buggy.

I can do this. Thanks,

-- peterx

Jason Wang Jan. 23, 2017, 1:55 a.m. UTC | #3

On 2017年01月22日 17:04, Peter Xu wrote:
> On Sun, Jan 22, 2017 at 04:08:04PM +0800, Jason Wang wrote:
>
> [...]
>
>>> +static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
>>> +                                           uint16_t domain_id, hwaddr addr,
>>> +                                           uint8_t am)
>>> +{
>>> +    IntelIOMMUNotifierNode *node;
>>> +    VTDContextEntry ce;
>>> +    int ret;
>>> +
>>> +    QLIST_FOREACH(node, &(s->notifiers_list), next) {
>>> +        VTDAddressSpace *vtd_as = node->vtd_as;
>>> +        ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
>>> +                                       vtd_as->devfn, &ce);
>>> +        if (!ret && domain_id == VTD_CONTEXT_ENTRY_DID(ce.hi)) {
>>> +            vtd_page_walk(&ce, addr, addr + (1 << am) * VTD_PAGE_SIZE,
>>> +                          vtd_page_invalidate_notify_hook,
>>> +                          (void *)&vtd_as->iommu, true);
>> Why not simply trigger the notifier here? (or is this vfio required?)
> Because we may only want to notify part of the region - we are with
> mask here, but not exact size.
>
> Consider this: guest (with caching mode) maps 12K memory (4K*3 pages),
> the mask will be extended to 16K in the guest. In that case, we need
> to explicitly go over the page entry to know that the 4th page should
> not be notified.

I see. Then it was required by vfio only, I think we can add a fast path 
for !CM in this case by triggering the notifier directly.

Another possible issue is, consider (with CM) a 16K contiguous iova with 
the last page has already been mapped. In this case, if we want to map 
first three pages, when handling IOTLB invalidation, am would be 16K, 
then the last page will be mapped twice. Can this lead some issue?

Thanks

Jason Wang Jan. 23, 2017, 2:01 a.m. UTC | #4

On 2017年01月20日 21:08, Peter Xu wrote:
> This patch is based on Aviv Ben-David (<bd.aviv@gmail.com>)'s patch
> upstream:
>
>    "IOMMU: enable intel_iommu map and unmap notifiers"
>    https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg01453.html
>
> However I removed/fixed some content, and added my own codes.
>
> Instead of translate() every page for iotlb invalidations (which is
> slower), we walk the pages when needed and notify in a hook function.
>
> This patch enables vfio devices for VT-d emulation.
>
> Signed-off-by: Peter Xu<peterx@redhat.com>
> ---
>   hw/i386/intel_iommu.c         | 66 +++++++++++++++++++++++++++++++++++++------
>   include/hw/i386/intel_iommu.h |  8 ++++++
>   2 files changed, 65 insertions(+), 9 deletions(-)

A good side effect of this patch is that it makes vhost device IOTLB 
works without ATS (though may be slow). We probably need a better title :)

And I think we should block notifiers during PSI/DSI/GLOBAL for device 
with ATS enabled.

Thanks

Jason Wang Jan. 23, 2017, 2:17 a.m. UTC | #5

On 2017年01月23日 10:01, Jason Wang wrote:
> On 2017年01月20日 21:08, Peter Xu wrote:
>> This patch is based on Aviv Ben-David (<bd.aviv@gmail.com>)'s patch
>> upstream:
>>
>>    "IOMMU: enable intel_iommu map and unmap notifiers"
>> https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg01453.html
>>
>> However I removed/fixed some content, and added my own codes.
>>
>> Instead of translate() every page for iotlb invalidations (which is
>> slower), we walk the pages when needed and notify in a hook function.
>>
>> This patch enables vfio devices for VT-d emulation.
>>
>> Signed-off-by: Peter Xu<peterx@redhat.com>
>> ---
>>   hw/i386/intel_iommu.c         | 66 
>> +++++++++++++++++++++++++++++++++++++------
>>   include/hw/i386/intel_iommu.h |  8 ++++++
>>   2 files changed, 65 insertions(+), 9 deletions(-)
>
> A good side effect of this patch is that it makes vhost device IOTLB 
> works without ATS (though may be slow). We probably need a better 
> title :)

Probably something like "remote IOMMU/IOTLB" support.

Thanks

Peter Xu Jan. 23, 2017, 3:34 a.m. UTC | #6

On Mon, Jan 23, 2017 at 09:55:39AM +0800, Jason Wang wrote:
> 
> 
> On 2017年01月22日 17:04, Peter Xu wrote:
> >On Sun, Jan 22, 2017 at 04:08:04PM +0800, Jason Wang wrote:
> >
> >[...]
> >
> >>>+static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
> >>>+                                           uint16_t domain_id, hwaddr addr,
> >>>+                                           uint8_t am)
> >>>+{
> >>>+    IntelIOMMUNotifierNode *node;
> >>>+    VTDContextEntry ce;
> >>>+    int ret;
> >>>+
> >>>+    QLIST_FOREACH(node, &(s->notifiers_list), next) {
> >>>+        VTDAddressSpace *vtd_as = node->vtd_as;
> >>>+        ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
> >>>+                                       vtd_as->devfn, &ce);
> >>>+        if (!ret && domain_id == VTD_CONTEXT_ENTRY_DID(ce.hi)) {
> >>>+            vtd_page_walk(&ce, addr, addr + (1 << am) * VTD_PAGE_SIZE,
> >>>+                          vtd_page_invalidate_notify_hook,
> >>>+                          (void *)&vtd_as->iommu, true);
> >>Why not simply trigger the notifier here? (or is this vfio required?)
> >Because we may only want to notify part of the region - we are with
> >mask here, but not exact size.
> >
> >Consider this: guest (with caching mode) maps 12K memory (4K*3 pages),
> >the mask will be extended to 16K in the guest. In that case, we need
> >to explicitly go over the page entry to know that the 4th page should
> >not be notified.
> 
> I see. Then it was required by vfio only, I think we can add a fast path for
> !CM in this case by triggering the notifier directly.

I noted this down (to be further investigated in my todo), but I don't
know whether this can work, due to the fact that I think it is still
legal that guest merge more than one PSIs into one. For example, I
don't know whether below is legal:

- guest invalidate page (0, 4k)
- guest map new page (4k, 8k)
- guest send single PSI of (0, 8k)

In that case, it contains both map/unmap, and looks like it didn't
disobay the spec as well?

> 
> Another possible issue is, consider (with CM) a 16K contiguous iova with the
> last page has already been mapped. In this case, if we want to map first
> three pages, when handling IOTLB invalidation, am would be 16K, then the
> last page will be mapped twice. Can this lead some issue?

I don't know whether guest has special handling of this kind of
request.

Besides, imho to completely solve this problem, we still need that
per-domain tree. Considering that currently the tree is inside vfio, I
see this not a big issue as well. In that case, the last page mapping
request will fail (we might see one error line from QEMU stderr),
however that'll not affect too much since currently vfio allows that
failure to happen (ioctl fail, but that page is still mapped, which is
what we wanted).

(But of course above error message can be used by an in-guest attacker
 as well just like general error_report() issues reported before,
 though again I will appreciate if we can have this series
 functionally work first :)

And, I should be able to emulate this behavior in guest with a tiny C
program to make sure of it, possibly after this series if allowed.

Thanks,

-- peterx

Peter Xu Jan. 23, 2017, 3:40 a.m. UTC | #7

On Mon, Jan 23, 2017 at 10:01:11AM +0800, Jason Wang wrote:
> 
> 
> On 2017年01月20日 21:08, Peter Xu wrote:
> >This patch is based on Aviv Ben-David (<bd.aviv@gmail.com>)'s patch
> >upstream:
> >
> >   "IOMMU: enable intel_iommu map and unmap notifiers"
> >   https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg01453.html
> >
> >However I removed/fixed some content, and added my own codes.
> >
> >Instead of translate() every page for iotlb invalidations (which is
> >slower), we walk the pages when needed and notify in a hook function.
> >
> >This patch enables vfio devices for VT-d emulation.
> >
> >Signed-off-by: Peter Xu<peterx@redhat.com>
> >---
> >  hw/i386/intel_iommu.c         | 66 +++++++++++++++++++++++++++++++++++++------
> >  include/hw/i386/intel_iommu.h |  8 ++++++
> >  2 files changed, 65 insertions(+), 9 deletions(-)
> 
> A good side effect of this patch is that it makes vhost device IOTLB works
> without ATS (though may be slow). We probably need a better title :)

How about I mention it in the commit message at the end? Like:

"And, since we already have vhost DMAR support via device-iotlb, a
 natural benefit that this patch brings is that vt-d enabled vhost can
 live even without ATS capability now. Though more tests are needed."

> 
> And I think we should block notifiers during PSI/DSI/GLOBAL for device with
> ATS enabled.

Again, would that be okay I note this in my todo list? :)

Thanks,

-- peterx

Jason Wang Jan. 23, 2017, 10:23 a.m. UTC | #8

On 2017年01月23日 11:34, Peter Xu wrote:
> On Mon, Jan 23, 2017 at 09:55:39AM +0800, Jason Wang wrote:
>>
>> On 2017年01月22日 17:04, Peter Xu wrote:
>>> On Sun, Jan 22, 2017 at 04:08:04PM +0800, Jason Wang wrote:
>>>
>>> [...]
>>>
>>>>> +static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
>>>>> +                                           uint16_t domain_id, hwaddr addr,
>>>>> +                                           uint8_t am)
>>>>> +{
>>>>> +    IntelIOMMUNotifierNode *node;
>>>>> +    VTDContextEntry ce;
>>>>> +    int ret;
>>>>> +
>>>>> +    QLIST_FOREACH(node, &(s->notifiers_list), next) {
>>>>> +        VTDAddressSpace *vtd_as = node->vtd_as;
>>>>> +        ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
>>>>> +                                       vtd_as->devfn, &ce);
>>>>> +        if (!ret && domain_id == VTD_CONTEXT_ENTRY_DID(ce.hi)) {
>>>>> +            vtd_page_walk(&ce, addr, addr + (1 << am) * VTD_PAGE_SIZE,
>>>>> +                          vtd_page_invalidate_notify_hook,
>>>>> +                          (void *)&vtd_as->iommu, true);
>>>> Why not simply trigger the notifier here? (or is this vfio required?)
>>> Because we may only want to notify part of the region - we are with
>>> mask here, but not exact size.
>>>
>>> Consider this: guest (with caching mode) maps 12K memory (4K*3 pages),
>>> the mask will be extended to 16K in the guest. In that case, we need
>>> to explicitly go over the page entry to know that the 4th page should
>>> not be notified.
>> I see. Then it was required by vfio only, I think we can add a fast path for
>> !CM in this case by triggering the notifier directly.
> I noted this down (to be further investigated in my todo), but I don't
> know whether this can work, due to the fact that I think it is still
> legal that guest merge more than one PSIs into one. For example, I
> don't know whether below is legal:
>
> - guest invalidate page (0, 4k)
> - guest map new page (4k, 8k)
> - guest send single PSI of (0, 8k)
>
> In that case, it contains both map/unmap, and looks like it didn't
> disobay the spec as well?

Not sure I get your meaning, you mean just send single PSI instead of two?

>
>> Another possible issue is, consider (with CM) a 16K contiguous iova with the
>> last page has already been mapped. In this case, if we want to map first
>> three pages, when handling IOTLB invalidation, am would be 16K, then the
>> last page will be mapped twice. Can this lead some issue?
> I don't know whether guest has special handling of this kind of
> request.

This seems quite usual I think? E.g iommu_flush_iotlb_psi() did:

static void iommu_flush_iotlb_psi(struct intel_iommu *iommu,
                   struct dmar_domain *domain,
                   unsigned long pfn, unsigned int pages,
                   int ih, int map)
{
     unsigned int mask = ilog2(__roundup_pow_of_two(pages));
     uint64_t addr = (uint64_t)pfn << VTD_PAGE_SHIFT;
     u16 did = domain->iommu_did[iommu->seq_id];
...


>
> Besides, imho to completely solve this problem, we still need that
> per-domain tree. Considering that currently the tree is inside vfio, I
> see this not a big issue as well.

Another issue I found is: with this series, VFIO_IOMMU_MAP_DMA seems 
become guest trigger-able. And since VFIO allocate its own structure to 
record dma mapping, this seems open a window for evil guest to exhaust 
host memory which is even worse.

>   In that case, the last page mapping
> request will fail (we might see one error line from QEMU stderr),
> however that'll not affect too much since currently vfio allows that
> failure to happen (ioctl fail, but that page is still mapped, which is
> what we wanted).

Works but sub-optimal or maybe even buggy.

>
> (But of course above error message can be used by an in-guest attacker
>   as well just like general error_report() issues reported before,
>   though again I will appreciate if we can have this series
>   functionally work first :)
>
> And, I should be able to emulate this behavior in guest with a tiny C
> program to make sure of it, possibly after this series if allowed.

Or through your vtd unittest :) ?

Thanks

>
> Thanks,
>
> -- peterx

Jason Wang Jan. 23, 2017, 10:27 a.m. UTC | #9

On 2017年01月23日 11:40, Peter Xu wrote:
> On Mon, Jan 23, 2017 at 10:01:11AM +0800, Jason Wang wrote:
>>
>> On 2017年01月20日 21:08, Peter Xu wrote:
>>> This patch is based on Aviv Ben-David (<bd.aviv@gmail.com>)'s patch
>>> upstream:
>>>
>>>    "IOMMU: enable intel_iommu map and unmap notifiers"
>>>    https://lists.gnu.org/archive/html/qemu-devel/2016-11/msg01453.html
>>>
>>> However I removed/fixed some content, and added my own codes.
>>>
>>> Instead of translate() every page for iotlb invalidations (which is
>>> slower), we walk the pages when needed and notify in a hook function.
>>>
>>> This patch enables vfio devices for VT-d emulation.
>>>
>>> Signed-off-by: Peter Xu<peterx@redhat.com>
>>> ---
>>>   hw/i386/intel_iommu.c         | 66 +++++++++++++++++++++++++++++++++++++------
>>>   include/hw/i386/intel_iommu.h |  8 ++++++
>>>   2 files changed, 65 insertions(+), 9 deletions(-)
>> A good side effect of this patch is that it makes vhost device IOTLB works
>> without ATS (though may be slow). We probably need a better title :)
> How about I mention it in the commit message at the end? Like:
>
> "And, since we already have vhost DMAR support via device-iotlb, a
>   natural benefit that this patch brings is that vt-d enabled vhost can
>   live even without ATS capability now. Though more tests are needed."
>

Ok for me.

>> And I think we should block notifiers during PSI/DSI/GLOBAL for device with
>> ATS enabled.
> Again, would that be okay I note this in my todo list? :)
>
> Thanks,
>
> -- peterx

Yes, on top.

Thanks

Alex Williamson Jan. 23, 2017, 6:03 p.m. UTC | #10

On Mon, 23 Jan 2017 11:34:29 +0800
Peter Xu <peterx@redhat.com> wrote:

> On Mon, Jan 23, 2017 at 09:55:39AM +0800, Jason Wang wrote:
> > 
> > 
> > On 2017年01月22日 17:04, Peter Xu wrote:  
> > >On Sun, Jan 22, 2017 at 04:08:04PM +0800, Jason Wang wrote:
> > >
> > >[...]
> > >  
> > >>>+static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
> > >>>+                                           uint16_t domain_id, hwaddr addr,
> > >>>+                                           uint8_t am)
> > >>>+{
> > >>>+    IntelIOMMUNotifierNode *node;
> > >>>+    VTDContextEntry ce;
> > >>>+    int ret;
> > >>>+
> > >>>+    QLIST_FOREACH(node, &(s->notifiers_list), next) {
> > >>>+        VTDAddressSpace *vtd_as = node->vtd_as;
> > >>>+        ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
> > >>>+                                       vtd_as->devfn, &ce);
> > >>>+        if (!ret && domain_id == VTD_CONTEXT_ENTRY_DID(ce.hi)) {
> > >>>+            vtd_page_walk(&ce, addr, addr + (1 << am) * VTD_PAGE_SIZE,
> > >>>+                          vtd_page_invalidate_notify_hook,
> > >>>+                          (void *)&vtd_as->iommu, true);  
> > >>Why not simply trigger the notifier here? (or is this vfio required?)  
> > >Because we may only want to notify part of the region - we are with
> > >mask here, but not exact size.
> > >
> > >Consider this: guest (with caching mode) maps 12K memory (4K*3 pages),
> > >the mask will be extended to 16K in the guest. In that case, we need
> > >to explicitly go over the page entry to know that the 4th page should
> > >not be notified.  
> > 
> > I see. Then it was required by vfio only, I think we can add a fast path for
> > !CM in this case by triggering the notifier directly.  
> 
> I noted this down (to be further investigated in my todo), but I don't
> know whether this can work, due to the fact that I think it is still
> legal that guest merge more than one PSIs into one. For example, I
> don't know whether below is legal:
> 
> - guest invalidate page (0, 4k)
> - guest map new page (4k, 8k)
> - guest send single PSI of (0, 8k)
> 
> In that case, it contains both map/unmap, and looks like it didn't
> disobay the spec as well?

The topic of mapping and invalidation granularity also makes me
slightly concerned with the abstraction we use for the type1 IOMMU
backend.  With the "v2" type1 configuration we currently use in QEMU,
the user may only unmap with the same minimum granularity with which
the original mapping was created.  For instance if an iommu notifier
map request gets to vfio with an 8k range, the resulting mapping can
only be removed by an invalidation covering the full range.  Trying to
bisect that original mapping by only invalidating 4k of the range will
generate an error.

I would think (but please confirm), that when we're only tracking
mappings generated by the guest OS that this works.  If the guest OS
maps with 4k pages, we get map notifies for each of those 4k pages.  If
they use 2MB pages, we get 2MB ranges and invalidations will come in
the same granularity.

An area of concern though is the replay mechanism in QEMU, I'll need to
look for it in the code, but replaying an IOMMU domain into a new
container *cannot* coalesce mappings or else it limits the granularity
with which we can later accept unmaps.  Take for instance a guest that
has mapped a contiguous 2MB range with 4K pages.  They can unmap any 4K
page within that range.  However if vfio gets a single 2MB mapping
rather than 512 4K mappings, then the host IOMMU may use a hugepage
mapping where our granularity is now 2MB.  Thanks,

Alex

Alex Williamson Jan. 23, 2017, 7:40 p.m. UTC | #11

On Mon, 23 Jan 2017 18:23:44 +0800
Jason Wang <jasowang@redhat.com> wrote:

> On 2017年01月23日 11:34, Peter Xu wrote:
> > On Mon, Jan 23, 2017 at 09:55:39AM +0800, Jason Wang wrote:  
> >>
> >> On 2017年01月22日 17:04, Peter Xu wrote:  
> >>> On Sun, Jan 22, 2017 at 04:08:04PM +0800, Jason Wang wrote:
> >>>
> >>> [...]
> >>>  
> >>>>> +static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
> >>>>> +                                           uint16_t domain_id, hwaddr addr,
> >>>>> +                                           uint8_t am)
> >>>>> +{
> >>>>> +    IntelIOMMUNotifierNode *node;
> >>>>> +    VTDContextEntry ce;
> >>>>> +    int ret;
> >>>>> +
> >>>>> +    QLIST_FOREACH(node, &(s->notifiers_list), next) {
> >>>>> +        VTDAddressSpace *vtd_as = node->vtd_as;
> >>>>> +        ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
> >>>>> +                                       vtd_as->devfn, &ce);
> >>>>> +        if (!ret && domain_id == VTD_CONTEXT_ENTRY_DID(ce.hi)) {
> >>>>> +            vtd_page_walk(&ce, addr, addr + (1 << am) * VTD_PAGE_SIZE,
> >>>>> +                          vtd_page_invalidate_notify_hook,
> >>>>> +                          (void *)&vtd_as->iommu, true);  
> >>>> Why not simply trigger the notifier here? (or is this vfio required?)  
> >>> Because we may only want to notify part of the region - we are with
> >>> mask here, but not exact size.
> >>>
> >>> Consider this: guest (with caching mode) maps 12K memory (4K*3 pages),
> >>> the mask will be extended to 16K in the guest. In that case, we need
> >>> to explicitly go over the page entry to know that the 4th page should
> >>> not be notified.  
> >> I see. Then it was required by vfio only, I think we can add a fast path for
> >> !CM in this case by triggering the notifier directly.  
> > I noted this down (to be further investigated in my todo), but I don't
> > know whether this can work, due to the fact that I think it is still
> > legal that guest merge more than one PSIs into one. For example, I
> > don't know whether below is legal:
> >
> > - guest invalidate page (0, 4k)
> > - guest map new page (4k, 8k)
> > - guest send single PSI of (0, 8k)
> >
> > In that case, it contains both map/unmap, and looks like it didn't
> > disobay the spec as well?  
> 
> Not sure I get your meaning, you mean just send single PSI instead of two?
> 
> >  
> >> Another possible issue is, consider (with CM) a 16K contiguous iova with the
> >> last page has already been mapped. In this case, if we want to map first
> >> three pages, when handling IOTLB invalidation, am would be 16K, then the
> >> last page will be mapped twice. Can this lead some issue?  
> > I don't know whether guest has special handling of this kind of
> > request.  
> 
> This seems quite usual I think? E.g iommu_flush_iotlb_psi() did:
> 
> static void iommu_flush_iotlb_psi(struct intel_iommu *iommu,
>                    struct dmar_domain *domain,
>                    unsigned long pfn, unsigned int pages,
>                    int ih, int map)
> {
>      unsigned int mask = ilog2(__roundup_pow_of_two(pages));
>      uint64_t addr = (uint64_t)pfn << VTD_PAGE_SHIFT;
>      u16 did = domain->iommu_did[iommu->seq_id];
> ...
> 
> 
> >
> > Besides, imho to completely solve this problem, we still need that
> > per-domain tree. Considering that currently the tree is inside vfio, I
> > see this not a big issue as well.  
> 
> Another issue I found is: with this series, VFIO_IOMMU_MAP_DMA seems 
> become guest trigger-able. And since VFIO allocate its own structure to 
> record dma mapping, this seems open a window for evil guest to exhaust 
> host memory which is even worse.

You're thinking of pci-assign, vfio does page accounting such that a
user can only lock pages up to their locked memory limit.  Exposing the
mapping ioctl within the guest is not a different problem from exposing
the ioctl to the host user from a vfio perspective.  Thanks,

Alex

Peter Xu Jan. 24, 2017, 4:42 a.m. UTC | #12

On Mon, Jan 23, 2017 at 06:23:44PM +0800, Jason Wang wrote:
> 
> 
> On 2017年01月23日 11:34, Peter Xu wrote:
> >On Mon, Jan 23, 2017 at 09:55:39AM +0800, Jason Wang wrote:
> >>
> >>On 2017年01月22日 17:04, Peter Xu wrote:
> >>>On Sun, Jan 22, 2017 at 04:08:04PM +0800, Jason Wang wrote:
> >>>
> >>>[...]
> >>>
> >>>>>+static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
> >>>>>+                                           uint16_t domain_id, hwaddr addr,
> >>>>>+                                           uint8_t am)
> >>>>>+{
> >>>>>+    IntelIOMMUNotifierNode *node;
> >>>>>+    VTDContextEntry ce;
> >>>>>+    int ret;
> >>>>>+
> >>>>>+    QLIST_FOREACH(node, &(s->notifiers_list), next) {
> >>>>>+        VTDAddressSpace *vtd_as = node->vtd_as;
> >>>>>+        ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
> >>>>>+                                       vtd_as->devfn, &ce);
> >>>>>+        if (!ret && domain_id == VTD_CONTEXT_ENTRY_DID(ce.hi)) {
> >>>>>+            vtd_page_walk(&ce, addr, addr + (1 << am) * VTD_PAGE_SIZE,
> >>>>>+                          vtd_page_invalidate_notify_hook,
> >>>>>+                          (void *)&vtd_as->iommu, true);
> >>>>Why not simply trigger the notifier here? (or is this vfio required?)
> >>>Because we may only want to notify part of the region - we are with
> >>>mask here, but not exact size.
> >>>
> >>>Consider this: guest (with caching mode) maps 12K memory (4K*3 pages),
> >>>the mask will be extended to 16K in the guest. In that case, we need
> >>>to explicitly go over the page entry to know that the 4th page should
> >>>not be notified.
> >>I see. Then it was required by vfio only, I think we can add a fast path for
> >>!CM in this case by triggering the notifier directly.
> >I noted this down (to be further investigated in my todo), but I don't
> >know whether this can work, due to the fact that I think it is still
> >legal that guest merge more than one PSIs into one. For example, I
> >don't know whether below is legal:
> >
> >- guest invalidate page (0, 4k)
> >- guest map new page (4k, 8k)
> >- guest send single PSI of (0, 8k)
> >
> >In that case, it contains both map/unmap, and looks like it didn't
> >disobay the spec as well?
> 
> Not sure I get your meaning, you mean just send single PSI instead of two?

Yes, and looks like that still doesn't violate the spec?

Actually for now, I think the best way to do with this series is that,
we can first let it in (so that advanced users can start to use it and
play with it). Then, we can get more feedback and solve critical
issues that may matter to customers and users.

For the above, I think per-page walk is the safest one for now. And I
can do investigate (as I mentioned) in the future to see whether we
can make it faster, according to your suggestion. However that'll be
nice we do it after we have some real use cases for this series, then
we can make sure the enhancement won't break anything besides boosting
the performance.

But of course I would like to listen to the maintainer's opinion on
this...

> 
> >
> >>Another possible issue is, consider (with CM) a 16K contiguous iova with the
> >>last page has already been mapped. In this case, if we want to map first
> >>three pages, when handling IOTLB invalidation, am would be 16K, then the
> >>last page will be mapped twice. Can this lead some issue?
> >I don't know whether guest has special handling of this kind of
> >request.
> 
> This seems quite usual I think? E.g iommu_flush_iotlb_psi() did:
> 
> static void iommu_flush_iotlb_psi(struct intel_iommu *iommu,
>                   struct dmar_domain *domain,
>                   unsigned long pfn, unsigned int pages,
>                   int ih, int map)
> {
>     unsigned int mask = ilog2(__roundup_pow_of_two(pages));
>     uint64_t addr = (uint64_t)pfn << VTD_PAGE_SHIFT;
>     u16 did = domain->iommu_did[iommu->seq_id];
> ...

Yes, do rounding up should be the only thing to do when we have
unaligned size.

> 
> 
> >
> >Besides, imho to completely solve this problem, we still need that
> >per-domain tree. Considering that currently the tree is inside vfio, I
> >see this not a big issue as well.
> 
> Another issue I found is: with this series, VFIO_IOMMU_MAP_DMA seems become
> guest trigger-able. And since VFIO allocate its own structure to record dma
> mapping, this seems open a window for evil guest to exhaust host memory
> which is even worse.

(I see Alex replied in another email, so will skip this one)

> 
> >  In that case, the last page mapping
> >request will fail (we might see one error line from QEMU stderr),
> >however that'll not affect too much since currently vfio allows that
> >failure to happen (ioctl fail, but that page is still mapped, which is
> >what we wanted).
> 
> Works but sub-optimal or maybe even buggy.

Again, to finally solve this, I think we need a tree. But I don't
think that's a good idea for this series, considering that we have
already had one in the kernel. But I see this issue not a critical
blocker (if you won't disagree) since it should work for our goal,
which is either nested device assignment, or dpdk applications in
general.

I think users' feedback is really important for this series. So again,
I'll request that we postpone some issues as todo, rather than solving
all of them in this series before merge.

> 
> >
> >(But of course above error message can be used by an in-guest attacker
> >  as well just like general error_report() issues reported before,
> >  though again I will appreciate if we can have this series
> >  functionally work first :)
> >
> >And, I should be able to emulate this behavior in guest with a tiny C
> >program to make sure of it, possibly after this series if allowed.
> 
> Or through your vtd unittest :) ?

Yes, or easier, just write a program in guest running Linux, sends
VFIO_IOMMU_DMA_MAP ioctl()s correspondingly.

Thanks,

-- peterx

Peter Xu Jan. 24, 2017, 7:22 a.m. UTC | #13

On Mon, Jan 23, 2017 at 11:03:08AM -0700, Alex Williamson wrote:
> On Mon, 23 Jan 2017 11:34:29 +0800
> Peter Xu <peterx@redhat.com> wrote:
> 
> > On Mon, Jan 23, 2017 at 09:55:39AM +0800, Jason Wang wrote:
> > > 
> > > 
> > > On 2017年01月22日 17:04, Peter Xu wrote:  
> > > >On Sun, Jan 22, 2017 at 04:08:04PM +0800, Jason Wang wrote:
> > > >
> > > >[...]
> > > >  
> > > >>>+static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
> > > >>>+                                           uint16_t domain_id, hwaddr addr,
> > > >>>+                                           uint8_t am)
> > > >>>+{
> > > >>>+    IntelIOMMUNotifierNode *node;
> > > >>>+    VTDContextEntry ce;
> > > >>>+    int ret;
> > > >>>+
> > > >>>+    QLIST_FOREACH(node, &(s->notifiers_list), next) {
> > > >>>+        VTDAddressSpace *vtd_as = node->vtd_as;
> > > >>>+        ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
> > > >>>+                                       vtd_as->devfn, &ce);
> > > >>>+        if (!ret && domain_id == VTD_CONTEXT_ENTRY_DID(ce.hi)) {
> > > >>>+            vtd_page_walk(&ce, addr, addr + (1 << am) * VTD_PAGE_SIZE,
> > > >>>+                          vtd_page_invalidate_notify_hook,
> > > >>>+                          (void *)&vtd_as->iommu, true);  
> > > >>Why not simply trigger the notifier here? (or is this vfio required?)  
> > > >Because we may only want to notify part of the region - we are with
> > > >mask here, but not exact size.
> > > >
> > > >Consider this: guest (with caching mode) maps 12K memory (4K*3 pages),
> > > >the mask will be extended to 16K in the guest. In that case, we need
> > > >to explicitly go over the page entry to know that the 4th page should
> > > >not be notified.  
> > > 
> > > I see. Then it was required by vfio only, I think we can add a fast path for
> > > !CM in this case by triggering the notifier directly.  
> > 
> > I noted this down (to be further investigated in my todo), but I don't
> > know whether this can work, due to the fact that I think it is still
> > legal that guest merge more than one PSIs into one. For example, I
> > don't know whether below is legal:
> > 
> > - guest invalidate page (0, 4k)
> > - guest map new page (4k, 8k)
> > - guest send single PSI of (0, 8k)
> > 
> > In that case, it contains both map/unmap, and looks like it didn't
> > disobay the spec as well?
> 
> The topic of mapping and invalidation granularity also makes me
> slightly concerned with the abstraction we use for the type1 IOMMU
> backend.  With the "v2" type1 configuration we currently use in QEMU,
> the user may only unmap with the same minimum granularity with which
> the original mapping was created.  For instance if an iommu notifier
> map request gets to vfio with an 8k range, the resulting mapping can
> only be removed by an invalidation covering the full range.  Trying to
> bisect that original mapping by only invalidating 4k of the range will
> generate an error.

I see. Then this will be an strict requirement that we cannot do
coalescing during page walk, at least for mappings.

I didn't notice this before, but luckily current series is following
the rule above - we are basically doing the mapping in the unit of
pages. Normally, we should always be mapping with 4K pages, only if
guest provides huge pages in the VT-d page table, would we notify map
with >4K, though of course it can be either 2M/1G but never other
values.

The point is, guest should be aware of the existance of the above huge
pages, so it won't unmap (for example) a single 4k region within a 2M
huge page range. It'll either keep the huge page, or unmap the whole
huge page. In that sense, we are quite safe.

(for my own curiousity and out of topic: could I ask why we can't do
 that? e.g., we map 4K*2 pages, then we unmap the first 4K page?)

> 
> I would think (but please confirm), that when we're only tracking
> mappings generated by the guest OS that this works.  If the guest OS
> maps with 4k pages, we get map notifies for each of those 4k pages.  If
> they use 2MB pages, we get 2MB ranges and invalidations will come in
> the same granularity.

I would agree (I haven't thought of a case that this might be a
problem).

> 
> An area of concern though is the replay mechanism in QEMU, I'll need to
> look for it in the code, but replaying an IOMMU domain into a new
> container *cannot* coalesce mappings or else it limits the granularity
> with which we can later accept unmaps. Take for instance a guest that
> has mapped a contiguous 2MB range with 4K pages.  They can unmap any 4K
> page within that range.  However if vfio gets a single 2MB mapping
> rather than 512 4K mappings, then the host IOMMU may use a hugepage
> mapping where our granularity is now 2MB.  Thanks,

Is this the answer of my above question (which is for my own
curiosity)? If so, that'll kind of explain.

If it's just because vfio is smart enough on automatically using huge
pages when applicable (I believe it's for performance's sake), not
sure whether we can introduce a ioctl() to setup the iova_pgsizes
bitmap, as long as it is a subset of supported iova_pgsizes (from
VFIO_IOMMU_GET_INFO) - then when people wants to get rid of above
limitation, they can explicitly set the iova_pgsizes to only allow 4K
pages.

But, of course, this series can live well without it at least for now.

Thanks,

-- peterx

Alex Williamson Jan. 24, 2017, 4:24 p.m. UTC | #14

On Tue, 24 Jan 2017 15:22:15 +0800
Peter Xu <peterx@redhat.com> wrote:

> On Mon, Jan 23, 2017 at 11:03:08AM -0700, Alex Williamson wrote:
> > On Mon, 23 Jan 2017 11:34:29 +0800
> > Peter Xu <peterx@redhat.com> wrote:
> >   
> > > On Mon, Jan 23, 2017 at 09:55:39AM +0800, Jason Wang wrote:  
> > > > 
> > > > 
> > > > On 2017年01月22日 17:04, Peter Xu wrote:    
> > > > >On Sun, Jan 22, 2017 at 04:08:04PM +0800, Jason Wang wrote:
> > > > >
> > > > >[...]
> > > > >    
> > > > >>>+static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
> > > > >>>+                                           uint16_t domain_id, hwaddr addr,
> > > > >>>+                                           uint8_t am)
> > > > >>>+{
> > > > >>>+    IntelIOMMUNotifierNode *node;
> > > > >>>+    VTDContextEntry ce;
> > > > >>>+    int ret;
> > > > >>>+
> > > > >>>+    QLIST_FOREACH(node, &(s->notifiers_list), next) {
> > > > >>>+        VTDAddressSpace *vtd_as = node->vtd_as;
> > > > >>>+        ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
> > > > >>>+                                       vtd_as->devfn, &ce);
> > > > >>>+        if (!ret && domain_id == VTD_CONTEXT_ENTRY_DID(ce.hi)) {
> > > > >>>+            vtd_page_walk(&ce, addr, addr + (1 << am) * VTD_PAGE_SIZE,
> > > > >>>+                          vtd_page_invalidate_notify_hook,
> > > > >>>+                          (void *)&vtd_as->iommu, true);    
> > > > >>Why not simply trigger the notifier here? (or is this vfio required?)    
> > > > >Because we may only want to notify part of the region - we are with
> > > > >mask here, but not exact size.
> > > > >
> > > > >Consider this: guest (with caching mode) maps 12K memory (4K*3 pages),
> > > > >the mask will be extended to 16K in the guest. In that case, we need
> > > > >to explicitly go over the page entry to know that the 4th page should
> > > > >not be notified.    
> > > > 
> > > > I see. Then it was required by vfio only, I think we can add a fast path for
> > > > !CM in this case by triggering the notifier directly.    
> > > 
> > > I noted this down (to be further investigated in my todo), but I don't
> > > know whether this can work, due to the fact that I think it is still
> > > legal that guest merge more than one PSIs into one. For example, I
> > > don't know whether below is legal:
> > > 
> > > - guest invalidate page (0, 4k)
> > > - guest map new page (4k, 8k)
> > > - guest send single PSI of (0, 8k)
> > > 
> > > In that case, it contains both map/unmap, and looks like it didn't
> > > disobay the spec as well?  
> > 
> > The topic of mapping and invalidation granularity also makes me
> > slightly concerned with the abstraction we use for the type1 IOMMU
> > backend.  With the "v2" type1 configuration we currently use in QEMU,
> > the user may only unmap with the same minimum granularity with which
> > the original mapping was created.  For instance if an iommu notifier
> > map request gets to vfio with an 8k range, the resulting mapping can
> > only be removed by an invalidation covering the full range.  Trying to
> > bisect that original mapping by only invalidating 4k of the range will
> > generate an error.  
> 
> I see. Then this will be an strict requirement that we cannot do
> coalescing during page walk, at least for mappings.
> 
> I didn't notice this before, but luckily current series is following
> the rule above - we are basically doing the mapping in the unit of
> pages. Normally, we should always be mapping with 4K pages, only if
> guest provides huge pages in the VT-d page table, would we notify map
> with >4K, though of course it can be either 2M/1G but never other
> values.
> 
> The point is, guest should be aware of the existance of the above huge
> pages, so it won't unmap (for example) a single 4k region within a 2M
> huge page range. It'll either keep the huge page, or unmap the whole
> huge page. In that sense, we are quite safe.
> 
> (for my own curiousity and out of topic: could I ask why we can't do
>  that? e.g., we map 4K*2 pages, then we unmap the first 4K page?)

You understand why we can't do this in the hugepage case, right?  A
hugepage means that at least one entire level of the page table is
missing and that in order to unmap a subsection of it, we actually need
to replace it with a new page table level, which cannot be done
atomically relative to the rest of the PTEs in that entry.  Now what if
we don't assume that hugepages are only the Intel defined 2MB & 1GB?
AMD-Vi supports effectively arbitrary power of two page table entries.
So what if we've passed a 2x 4K mapping where the physical pages were
contiguous and vfio passed it as a direct 8K mapping to the IOMMU and
the IOMMU has native support for 8K mappings.  We're in a similar
scenario as the 2MB page, different page table layout though.

> > I would think (but please confirm), that when we're only tracking
> > mappings generated by the guest OS that this works.  If the guest OS
> > maps with 4k pages, we get map notifies for each of those 4k pages.  If
> > they use 2MB pages, we get 2MB ranges and invalidations will come in
> > the same granularity.  
> 
> I would agree (I haven't thought of a case that this might be a
> problem).
> 
> > 
> > An area of concern though is the replay mechanism in QEMU, I'll need to
> > look for it in the code, but replaying an IOMMU domain into a new
> > container *cannot* coalesce mappings or else it limits the granularity
> > with which we can later accept unmaps. Take for instance a guest that
> > has mapped a contiguous 2MB range with 4K pages.  They can unmap any 4K
> > page within that range.  However if vfio gets a single 2MB mapping
> > rather than 512 4K mappings, then the host IOMMU may use a hugepage
> > mapping where our granularity is now 2MB.  Thanks,  
> 
> Is this the answer of my above question (which is for my own
> curiosity)? If so, that'll kind of explain.
> 
> If it's just because vfio is smart enough on automatically using huge
> pages when applicable (I believe it's for performance's sake), not
> sure whether we can introduce a ioctl() to setup the iova_pgsizes
> bitmap, as long as it is a subset of supported iova_pgsizes (from
> VFIO_IOMMU_GET_INFO) - then when people wants to get rid of above
> limitation, they can explicitly set the iova_pgsizes to only allow 4K
> pages.
> 
> But, of course, this series can live well without it at least for now.

Yes, this is part of how vfio transparently makes use of hugepages in
the IOMMU, we effectively disregard the supported page sizes bitmap
(it's useless for anything other than determining the minimum page size
anyway), and instead pass through the largest range of iovas which are
physically contiguous.  The IOMMU driver can then make use of hugepages
where available.  The VFIO_IOMMU_MAP_DMA ioctl does include a flags
field where we could appropriate a bit to indicate map with minimum
granularity, but that would not be as simple as triggering the
disable_hugepages mapping path because the type1 driver would also need
to flag the internal vfio_dma as being bisectable, if not simply
converted to multiple vfio_dma structs internally.  Thanks,

Alex

Jason Wang Jan. 25, 2017, 1:19 a.m. UTC | #15

On 2017年01月24日 03:40, Alex Williamson wrote:
> On Mon, 23 Jan 2017 18:23:44 +0800
> Jason Wang<jasowang@redhat.com>  wrote:
>
>> On 2017年01月23日 11:34, Peter Xu wrote:
>>> On Mon, Jan 23, 2017 at 09:55:39AM +0800, Jason Wang wrote:
>>>> On 2017年01月22日 17:04, Peter Xu wrote:
>>>>> On Sun, Jan 22, 2017 at 04:08:04PM +0800, Jason Wang wrote:
>>>>>
>>>>> [...]
>>>>>   
>>>>>>> +static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
>>>>>>> +                                           uint16_t domain_id, hwaddr addr,
>>>>>>> +                                           uint8_t am)
>>>>>>> +{
>>>>>>> +    IntelIOMMUNotifierNode *node;
>>>>>>> +    VTDContextEntry ce;
>>>>>>> +    int ret;
>>>>>>> +
>>>>>>> +    QLIST_FOREACH(node, &(s->notifiers_list), next) {
>>>>>>> +        VTDAddressSpace *vtd_as = node->vtd_as;
>>>>>>> +        ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
>>>>>>> +                                       vtd_as->devfn, &ce);
>>>>>>> +        if (!ret && domain_id == VTD_CONTEXT_ENTRY_DID(ce.hi)) {
>>>>>>> +            vtd_page_walk(&ce, addr, addr + (1 << am) * VTD_PAGE_SIZE,
>>>>>>> +                          vtd_page_invalidate_notify_hook,
>>>>>>> +                          (void *)&vtd_as->iommu, true);
>>>>>> Why not simply trigger the notifier here? (or is this vfio required?)
>>>>> Because we may only want to notify part of the region - we are with
>>>>> mask here, but not exact size.
>>>>>
>>>>> Consider this: guest (with caching mode) maps 12K memory (4K*3 pages),
>>>>> the mask will be extended to 16K in the guest. In that case, we need
>>>>> to explicitly go over the page entry to know that the 4th page should
>>>>> not be notified.
>>>> I see. Then it was required by vfio only, I think we can add a fast path for
>>>> !CM in this case by triggering the notifier directly.
>>> I noted this down (to be further investigated in my todo), but I don't
>>> know whether this can work, due to the fact that I think it is still
>>> legal that guest merge more than one PSIs into one. For example, I
>>> don't know whether below is legal:
>>>
>>> - guest invalidate page (0, 4k)
>>> - guest map new page (4k, 8k)
>>> - guest send single PSI of (0, 8k)
>>>
>>> In that case, it contains both map/unmap, and looks like it didn't
>>> disobay the spec as well?
>> Not sure I get your meaning, you mean just send single PSI instead of two?
>>
>>>   
>>>> Another possible issue is, consider (with CM) a 16K contiguous iova with the
>>>> last page has already been mapped. In this case, if we want to map first
>>>> three pages, when handling IOTLB invalidation, am would be 16K, then the
>>>> last page will be mapped twice. Can this lead some issue?
>>> I don't know whether guest has special handling of this kind of
>>> request.
>> This seems quite usual I think? E.g iommu_flush_iotlb_psi() did:
>>
>> static void iommu_flush_iotlb_psi(struct intel_iommu *iommu,
>>                     struct dmar_domain *domain,
>>                     unsigned long pfn, unsigned int pages,
>>                     int ih, int map)
>> {
>>       unsigned int mask = ilog2(__roundup_pow_of_two(pages));
>>       uint64_t addr = (uint64_t)pfn << VTD_PAGE_SHIFT;
>>       u16 did = domain->iommu_did[iommu->seq_id];
>> ...
>>
>>
>>> Besides, imho to completely solve this problem, we still need that
>>> per-domain tree. Considering that currently the tree is inside vfio, I
>>> see this not a big issue as well.
>> Another issue I found is: with this series, VFIO_IOMMU_MAP_DMA seems
>> become guest trigger-able. And since VFIO allocate its own structure to
>> record dma mapping, this seems open a window for evil guest to exhaust
>> host memory which is even worse.
> You're thinking of pci-assign, vfio does page accounting such that a
> user can only lock pages up to their locked memory limit.  Exposing the
> mapping ioctl within the guest is not a different problem from exposing
> the ioctl to the host user from a vfio perspective.  Thanks,
>
> Alex
>

Yes, but what if an evil guest that maps all iovas to the same gpa?

Thanks

Alex Williamson Jan. 25, 2017, 1:31 a.m. UTC | #16

On Wed, 25 Jan 2017 09:19:25 +0800
Jason Wang <jasowang@redhat.com> wrote:

> On 2017年01月24日 03:40, Alex Williamson wrote:
> > On Mon, 23 Jan 2017 18:23:44 +0800
> > Jason Wang<jasowang@redhat.com>  wrote:
> >  
> >> On 2017年01月23日 11:34, Peter Xu wrote:  
> >>> On Mon, Jan 23, 2017 at 09:55:39AM +0800, Jason Wang wrote:  
> >>>> On 2017年01月22日 17:04, Peter Xu wrote:  
> >>>>> On Sun, Jan 22, 2017 at 04:08:04PM +0800, Jason Wang wrote:
> >>>>>
> >>>>> [...]
> >>>>>     
> >>>>>>> +static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
> >>>>>>> +                                           uint16_t domain_id, hwaddr addr,
> >>>>>>> +                                           uint8_t am)
> >>>>>>> +{
> >>>>>>> +    IntelIOMMUNotifierNode *node;
> >>>>>>> +    VTDContextEntry ce;
> >>>>>>> +    int ret;
> >>>>>>> +
> >>>>>>> +    QLIST_FOREACH(node, &(s->notifiers_list), next) {
> >>>>>>> +        VTDAddressSpace *vtd_as = node->vtd_as;
> >>>>>>> +        ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
> >>>>>>> +                                       vtd_as->devfn, &ce);
> >>>>>>> +        if (!ret && domain_id == VTD_CONTEXT_ENTRY_DID(ce.hi)) {
> >>>>>>> +            vtd_page_walk(&ce, addr, addr + (1 << am) * VTD_PAGE_SIZE,
> >>>>>>> +                          vtd_page_invalidate_notify_hook,
> >>>>>>> +                          (void *)&vtd_as->iommu, true);  
> >>>>>> Why not simply trigger the notifier here? (or is this vfio required?)  
> >>>>> Because we may only want to notify part of the region - we are with
> >>>>> mask here, but not exact size.
> >>>>>
> >>>>> Consider this: guest (with caching mode) maps 12K memory (4K*3 pages),
> >>>>> the mask will be extended to 16K in the guest. In that case, we need
> >>>>> to explicitly go over the page entry to know that the 4th page should
> >>>>> not be notified.  
> >>>> I see. Then it was required by vfio only, I think we can add a fast path for
> >>>> !CM in this case by triggering the notifier directly.  
> >>> I noted this down (to be further investigated in my todo), but I don't
> >>> know whether this can work, due to the fact that I think it is still
> >>> legal that guest merge more than one PSIs into one. For example, I
> >>> don't know whether below is legal:
> >>>
> >>> - guest invalidate page (0, 4k)
> >>> - guest map new page (4k, 8k)
> >>> - guest send single PSI of (0, 8k)
> >>>
> >>> In that case, it contains both map/unmap, and looks like it didn't
> >>> disobay the spec as well?  
> >> Not sure I get your meaning, you mean just send single PSI instead of two?
> >>  
> >>>     
> >>>> Another possible issue is, consider (with CM) a 16K contiguous iova with the
> >>>> last page has already been mapped. In this case, if we want to map first
> >>>> three pages, when handling IOTLB invalidation, am would be 16K, then the
> >>>> last page will be mapped twice. Can this lead some issue?  
> >>> I don't know whether guest has special handling of this kind of
> >>> request.  
> >> This seems quite usual I think? E.g iommu_flush_iotlb_psi() did:
> >>
> >> static void iommu_flush_iotlb_psi(struct intel_iommu *iommu,
> >>                     struct dmar_domain *domain,
> >>                     unsigned long pfn, unsigned int pages,
> >>                     int ih, int map)
> >> {
> >>       unsigned int mask = ilog2(__roundup_pow_of_two(pages));
> >>       uint64_t addr = (uint64_t)pfn << VTD_PAGE_SHIFT;
> >>       u16 did = domain->iommu_did[iommu->seq_id];
> >> ...
> >>
> >>  
> >>> Besides, imho to completely solve this problem, we still need that
> >>> per-domain tree. Considering that currently the tree is inside vfio, I
> >>> see this not a big issue as well.  
> >> Another issue I found is: with this series, VFIO_IOMMU_MAP_DMA seems
> >> become guest trigger-able. And since VFIO allocate its own structure to
> >> record dma mapping, this seems open a window for evil guest to exhaust
> >> host memory which is even worse.  
> > You're thinking of pci-assign, vfio does page accounting such that a
> > user can only lock pages up to their locked memory limit.  Exposing the
> > mapping ioctl within the guest is not a different problem from exposing
> > the ioctl to the host user from a vfio perspective.  Thanks,
> >
> > Alex
> >  
> 
> Yes, but what if an evil guest that maps all iovas to the same gpa?

Doesn't matter, we'd account that gpa each time it's mapped, so
effectively the locked memory limit is equal to the iova size the user
can map.  Thanks,

Alex

Peter Xu Jan. 25, 2017, 4:04 a.m. UTC | #17

On Tue, Jan 24, 2017 at 09:24:29AM -0700, Alex Williamson wrote:

[...]

> > I see. Then this will be an strict requirement that we cannot do
> > coalescing during page walk, at least for mappings.
> > 
> > I didn't notice this before, but luckily current series is following
> > the rule above - we are basically doing the mapping in the unit of
> > pages. Normally, we should always be mapping with 4K pages, only if
> > guest provides huge pages in the VT-d page table, would we notify map
> > with >4K, though of course it can be either 2M/1G but never other
> > values.
> > 
> > The point is, guest should be aware of the existance of the above huge
> > pages, so it won't unmap (for example) a single 4k region within a 2M
> > huge page range. It'll either keep the huge page, or unmap the whole
> > huge page. In that sense, we are quite safe.
> > 
> > (for my own curiousity and out of topic: could I ask why we can't do
> >  that? e.g., we map 4K*2 pages, then we unmap the first 4K page?)
> 
> You understand why we can't do this in the hugepage case, right?  A
> hugepage means that at least one entire level of the page table is
> missing and that in order to unmap a subsection of it, we actually need
> to replace it with a new page table level, which cannot be done
> atomically relative to the rest of the PTEs in that entry.  Now what if
> we don't assume that hugepages are only the Intel defined 2MB & 1GB?
> AMD-Vi supports effectively arbitrary power of two page table entries.
> So what if we've passed a 2x 4K mapping where the physical pages were
> contiguous and vfio passed it as a direct 8K mapping to the IOMMU and
> the IOMMU has native support for 8K mappings.  We're in a similar
> scenario as the 2MB page, different page table layout though.

Thanks for the explaination. The AMD example is clear.

> 
> > > I would think (but please confirm), that when we're only tracking
> > > mappings generated by the guest OS that this works.  If the guest OS
> > > maps with 4k pages, we get map notifies for each of those 4k pages.  If
> > > they use 2MB pages, we get 2MB ranges and invalidations will come in
> > > the same granularity.  
> > 
> > I would agree (I haven't thought of a case that this might be a
> > problem).
> > 
> > > 
> > > An area of concern though is the replay mechanism in QEMU, I'll need to
> > > look for it in the code, but replaying an IOMMU domain into a new
> > > container *cannot* coalesce mappings or else it limits the granularity
> > > with which we can later accept unmaps. Take for instance a guest that
> > > has mapped a contiguous 2MB range with 4K pages.  They can unmap any 4K
> > > page within that range.  However if vfio gets a single 2MB mapping
> > > rather than 512 4K mappings, then the host IOMMU may use a hugepage
> > > mapping where our granularity is now 2MB.  Thanks,  
> > 
> > Is this the answer of my above question (which is for my own
> > curiosity)? If so, that'll kind of explain.
> > 
> > If it's just because vfio is smart enough on automatically using huge
> > pages when applicable (I believe it's for performance's sake), not
> > sure whether we can introduce a ioctl() to setup the iova_pgsizes
> > bitmap, as long as it is a subset of supported iova_pgsizes (from
> > VFIO_IOMMU_GET_INFO) - then when people wants to get rid of above
> > limitation, they can explicitly set the iova_pgsizes to only allow 4K
> > pages.
> > 
> > But, of course, this series can live well without it at least for now.
> 
> Yes, this is part of how vfio transparently makes use of hugepages in
> the IOMMU, we effectively disregard the supported page sizes bitmap
> (it's useless for anything other than determining the minimum page size
> anyway), and instead pass through the largest range of iovas which are
> physically contiguous.  The IOMMU driver can then make use of hugepages
> where available.  The VFIO_IOMMU_MAP_DMA ioctl does include a flags
> field where we could appropriate a bit to indicate map with minimum
> granularity, but that would not be as simple as triggering the
> disable_hugepages mapping path because the type1 driver would also need
> to flag the internal vfio_dma as being bisectable, if not simply
> converted to multiple vfio_dma structs internally.  Thanks,

I see, thanks!

-- peterx

Jason Wang Jan. 25, 2017, 7:41 a.m. UTC | #18

On 2017年01月25日 09:31, Alex Williamson wrote:
> On Wed, 25 Jan 2017 09:19:25 +0800
> Jason Wang <jasowang@redhat.com> wrote:
>
>> On 2017年01月24日 03:40, Alex Williamson wrote:
>>> On Mon, 23 Jan 2017 18:23:44 +0800
>>> Jason Wang<jasowang@redhat.com>  wrote:
>>>   
>>>> On 2017年01月23日 11:34, Peter Xu wrote:
>>>>> On Mon, Jan 23, 2017 at 09:55:39AM +0800, Jason Wang wrote:
>>>>>> On 2017年01月22日 17:04, Peter Xu wrote:
>>>>>>> On Sun, Jan 22, 2017 at 04:08:04PM +0800, Jason Wang wrote:
>>>>>>>
>>>>>>> [...]
>>>>>>>      
>>>>>>>>> +static void vtd_iotlb_page_invalidate_notify(IntelIOMMUState *s,
>>>>>>>>> +                                           uint16_t domain_id, hwaddr addr,
>>>>>>>>> +                                           uint8_t am)
>>>>>>>>> +{
>>>>>>>>> +    IntelIOMMUNotifierNode *node;
>>>>>>>>> +    VTDContextEntry ce;
>>>>>>>>> +    int ret;
>>>>>>>>> +
>>>>>>>>> +    QLIST_FOREACH(node, &(s->notifiers_list), next) {
>>>>>>>>> +        VTDAddressSpace *vtd_as = node->vtd_as;
>>>>>>>>> +        ret = vtd_dev_to_context_entry(s, pci_bus_num(vtd_as->bus),
>>>>>>>>> +                                       vtd_as->devfn, &ce);
>>>>>>>>> +        if (!ret && domain_id == VTD_CONTEXT_ENTRY_DID(ce.hi)) {
>>>>>>>>> +            vtd_page_walk(&ce, addr, addr + (1 << am) * VTD_PAGE_SIZE,
>>>>>>>>> +                          vtd_page_invalidate_notify_hook,
>>>>>>>>> +                          (void *)&vtd_as->iommu, true);
>>>>>>>> Why not simply trigger the notifier here? (or is this vfio required?)
>>>>>>> Because we may only want to notify part of the region - we are with
>>>>>>> mask here, but not exact size.
>>>>>>>
>>>>>>> Consider this: guest (with caching mode) maps 12K memory (4K*3 pages),
>>>>>>> the mask will be extended to 16K in the guest. In that case, we need
>>>>>>> to explicitly go over the page entry to know that the 4th page should
>>>>>>> not be notified.
>>>>>> I see. Then it was required by vfio only, I think we can add a fast path for
>>>>>> !CM in this case by triggering the notifier directly.
>>>>> I noted this down (to be further investigated in my todo), but I don't
>>>>> know whether this can work, due to the fact that I think it is still
>>>>> legal that guest merge more than one PSIs into one. For example, I
>>>>> don't know whether below is legal:
>>>>>
>>>>> - guest invalidate page (0, 4k)
>>>>> - guest map new page (4k, 8k)
>>>>> - guest send single PSI of (0, 8k)
>>>>>
>>>>> In that case, it contains both map/unmap, and looks like it didn't
>>>>> disobay the spec as well?
>>>> Not sure I get your meaning, you mean just send single PSI instead of two?
>>>>   
>>>>>      
>>>>>> Another possible issue is, consider (with CM) a 16K contiguous iova with the
>>>>>> last page has already been mapped. In this case, if we want to map first
>>>>>> three pages, when handling IOTLB invalidation, am would be 16K, then the
>>>>>> last page will be mapped twice. Can this lead some issue?
>>>>> I don't know whether guest has special handling of this kind of
>>>>> request.
>>>> This seems quite usual I think? E.g iommu_flush_iotlb_psi() did:
>>>>
>>>> static void iommu_flush_iotlb_psi(struct intel_iommu *iommu,
>>>>                      struct dmar_domain *domain,
>>>>                      unsigned long pfn, unsigned int pages,
>>>>                      int ih, int map)
>>>> {
>>>>        unsigned int mask = ilog2(__roundup_pow_of_two(pages));
>>>>        uint64_t addr = (uint64_t)pfn << VTD_PAGE_SHIFT;
>>>>        u16 did = domain->iommu_did[iommu->seq_id];
>>>> ...
>>>>
>>>>   
>>>>> Besides, imho to completely solve this problem, we still need that
>>>>> per-domain tree. Considering that currently the tree is inside vfio, I
>>>>> see this not a big issue as well.
>>>> Another issue I found is: with this series, VFIO_IOMMU_MAP_DMA seems
>>>> become guest trigger-able. And since VFIO allocate its own structure to
>>>> record dma mapping, this seems open a window for evil guest to exhaust
>>>> host memory which is even worse.
>>> You're thinking of pci-assign, vfio does page accounting such that a
>>> user can only lock pages up to their locked memory limit.  Exposing the
>>> mapping ioctl within the guest is not a different problem from exposing
>>> the ioctl to the host user from a vfio perspective.  Thanks,
>>>
>>> Alex
>>>   
>> Yes, but what if an evil guest that maps all iovas to the same gpa?
> Doesn't matter, we'd account that gpa each time it's mapped, so
> effectively the locked memory limit is equal to the iova size the user
> can map.  Thanks,
>
> Alex

I see. Good to know this.

Thanks

[RFC,v4,18/20] intel_iommu: enable vfio devices

Commit Message

Comments

Patch