diff mbox series

[v4,05/12] vfio/iommufd: Introduce auto domain creation

Message ID 20240712114704.8708-6-joao.m.martins@oracle.com
State New
Headers show
Series hw/iommufd: IOMMUFD Dirty Tracking | expand

Commit Message

Joao Martins July 12, 2024, 11:46 a.m. UTC
There's generally two modes of operation for IOMMUFD:

* The simple user API which intends to perform relatively simple things
with IOMMUs e.g. DPDK. It generally creates an IOAS and attach to VFIO
and mainly performs IOAS_MAP and UNMAP.

* The native IOMMUFD API where you have fine grained control of the
IOMMU domain and model it accordingly. This is where most new feature
are being steered to.

For dirty tracking 2) is required, as it needs to ensure that
the stage-2/parent IOMMU domain will only attach devices
that support dirty tracking (so far it is all homogeneous in x86, likely
not the case for smmuv3). Such invariant on dirty tracking provides a
useful guarantee to VMMs that will refuse incompatible device
attachments for IOMMU domains.

Dirty tracking insurance is enforced via HWPT_ALLOC, which is
responsible for creating an IOMMU domain. This is contrast to the
'simple API' where the IOMMU domain is created by IOMMUFD automatically
when it attaches to VFIO (usually referred as autodomains) but it has
the needed handling for mdevs.

To support dirty tracking with the advanced IOMMUFD API, it needs
similar logic, where IOMMU domains are created and devices attached to
compatible domains. Essentially mimmicing kernel
iommufd_device_auto_get_domain(). With mdevs given there's no IOMMU domain
it falls back to IOAS attach.

The auto domain logic allows different IOMMU domains to be created when
DMA dirty tracking is not desired (and VF can provide it), and others where
it is. Here is not used in this way here given how VFIODevice migration
state is initialized after the device attachment. But such mixed mode of
IOMMU dirty tracking + device dirty tracking is an improvement that can
be added on. Keep the 'all of nothing' of type1 approach that we have
been using so far between container vs device dirty tracking.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 include/hw/vfio/vfio-common.h |  9 ++++
 include/sysemu/iommufd.h      |  5 +++
 backends/iommufd.c            | 30 +++++++++++++
 hw/vfio/iommufd.c             | 82 +++++++++++++++++++++++++++++++++++
 backends/trace-events         |  1 +
 5 files changed, 127 insertions(+)

Comments

Cédric Le Goater July 16, 2024, 9:39 a.m. UTC | #1
On 7/12/24 13:46, Joao Martins wrote:
> There's generally two modes of operation for IOMMUFD:
> 
> * The simple user API which intends to perform relatively simple things
> with IOMMUs e.g. DPDK. It generally creates an IOAS and attach to VFIO
> and mainly performs IOAS_MAP and UNMAP.
> 
> * The native IOMMUFD API where you have fine grained control of the
> IOMMU domain and model it accordingly. This is where most new feature
> are being steered to.
> 
> For dirty tracking 2) 

I suppose 1) and 2) are the bullets above ?

> is required, as it needs to ensure that
> the stage-2/parent IOMMU domain will only attach devices
> that support dirty tracking (so far it is all homogeneous in x86, likely
> not the case for smmuv3). Such invariant on dirty tracking provides a
> useful guarantee to VMMs that will refuse incompatible device
> attachments for IOMMU domains.
> 
> Dirty tracking insurance is enforced via HWPT_ALLOC, which is
> responsible for creating an IOMMU domain. This is contrast to the
> 'simple API' where the IOMMU domain is created by IOMMUFD automatically
> when it attaches to VFIO (usually referred as autodomains) but it has
> the needed handling for mdevs.
> 
> To support dirty tracking with the advanced IOMMUFD API, it needs
> similar logic, where IOMMU domains are created and devices attached to
> compatible domains. Essentially mimmicing kernel

mimmicing -> mimicking, I think.

> iommufd_device_auto_get_domain(). With mdevs given there's no IOMMU domain
> it falls back to IOAS attach.
> 
> The auto domain logic allows different IOMMU domains to be created when
> DMA dirty tracking is not desired (and VF can provide it), and others where
> it is. Here is not used in this way here given how VFIODevice migration
> state is initialized after the device attachment. But such mixed mode of
> IOMMU dirty tracking + device dirty tracking is an improvement that can
> be added on. Keep the 'all of nothing' of type1 approach that we have
> been using so far between container vs device dirty tracking.
>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>

Reviewed-by: Cédric Le Goater <clg@redhat.com>

This needs feedback from IOMMUFD experts also.

Thanks,

C.


> ---
>   include/hw/vfio/vfio-common.h |  9 ++++
>   include/sysemu/iommufd.h      |  5 +++
>   backends/iommufd.c            | 30 +++++++++++++
>   hw/vfio/iommufd.c             | 82 +++++++++++++++++++++++++++++++++++
>   backends/trace-events         |  1 +
>   5 files changed, 127 insertions(+)
> 
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 7419466bca92..2dd468ce3c02 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -95,10 +95,17 @@ typedef struct VFIOHostDMAWindow {
>   
>   typedef struct IOMMUFDBackend IOMMUFDBackend;
>   
> +typedef struct VFIOIOASHwpt {
> +    uint32_t hwpt_id;
> +    QLIST_HEAD(, VFIODevice) device_list;
> +    QLIST_ENTRY(VFIOIOASHwpt) next;
> +} VFIOIOASHwpt;
> +
>   typedef struct VFIOIOMMUFDContainer {
>       VFIOContainerBase bcontainer;
>       IOMMUFDBackend *be;
>       uint32_t ioas_id;
> +    QLIST_HEAD(, VFIOIOASHwpt) hwpt_list;
>   } VFIOIOMMUFDContainer;
>   
>   OBJECT_DECLARE_SIMPLE_TYPE(VFIOIOMMUFDContainer, VFIO_IOMMU_IOMMUFD);
> @@ -135,6 +142,8 @@ typedef struct VFIODevice {
>       HostIOMMUDevice *hiod;
>       int devid;
>       IOMMUFDBackend *iommufd;
> +    VFIOIOASHwpt *hwpt;
> +    QLIST_ENTRY(VFIODevice) hwpt_next;
>   } VFIODevice;
>   
>   struct VFIODeviceOps {
> diff --git a/include/sysemu/iommufd.h b/include/sysemu/iommufd.h
> index 57d502a1c79a..e917e7591d05 100644
> --- a/include/sysemu/iommufd.h
> +++ b/include/sysemu/iommufd.h
> @@ -50,6 +50,11 @@ int iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t ioas_id,
>   bool iommufd_backend_get_device_info(IOMMUFDBackend *be, uint32_t devid,
>                                        uint32_t *type, void *data, uint32_t len,
>                                        uint64_t *caps, Error **errp);
> +bool iommufd_backend_alloc_hwpt(IOMMUFDBackend *be, uint32_t dev_id,
> +                                uint32_t pt_id, uint32_t flags,
> +                                uint32_t data_type, uint32_t data_len,
> +                                void *data_ptr, uint32_t *out_hwpt,
> +                                Error **errp);
>   
>   #define TYPE_HOST_IOMMU_DEVICE_IOMMUFD TYPE_HOST_IOMMU_DEVICE "-iommufd"
>   #endif
> diff --git a/backends/iommufd.c b/backends/iommufd.c
> index 2b3d51af26d2..5d3dfa917415 100644
> --- a/backends/iommufd.c
> +++ b/backends/iommufd.c
> @@ -208,6 +208,36 @@ int iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t ioas_id,
>       return ret;
>   }
>   
> +bool iommufd_backend_alloc_hwpt(IOMMUFDBackend *be, uint32_t dev_id,
> +                                uint32_t pt_id, uint32_t flags,
> +                                uint32_t data_type, uint32_t data_len,
> +                                void *data_ptr, uint32_t *out_hwpt,
> +                                Error **errp)
> +{
> +    int ret, fd = be->fd;
> +    struct iommu_hwpt_alloc alloc_hwpt = {
> +        .size = sizeof(struct iommu_hwpt_alloc),
> +        .flags = flags,
> +        .dev_id = dev_id,
> +        .pt_id = pt_id,
> +        .data_type = data_type,
> +        .data_len = data_len,
> +        .data_uptr = (uint64_t)data_ptr,
> +    };
> +
> +    ret = ioctl(fd, IOMMU_HWPT_ALLOC, &alloc_hwpt);
> +    trace_iommufd_backend_alloc_hwpt(fd, dev_id, pt_id, flags, data_type,
> +                                     data_len, (uint64_t)data_ptr,
> +                                     alloc_hwpt.out_hwpt_id, ret);
> +    if (ret) {
> +        error_setg_errno(errp, errno, "Failed to allocate hwpt");
> +        return false;
> +    }
> +
> +    *out_hwpt = alloc_hwpt.out_hwpt_id;
> +    return true;
> +}
> +
>   bool iommufd_backend_get_device_info(IOMMUFDBackend *be, uint32_t devid,
>                                        uint32_t *type, void *data, uint32_t len,
>                                        uint64_t *caps, Error **errp)
> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
> index 077dea8f1b64..325c7598d5a1 100644
> --- a/hw/vfio/iommufd.c
> +++ b/hw/vfio/iommufd.c
> @@ -212,10 +212,86 @@ static bool iommufd_cdev_detach_ioas_hwpt(VFIODevice *vbasedev, Error **errp)
>       return true;
>   }
>   
> +static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
> +                                         VFIOIOMMUFDContainer *container,
> +                                         Error **errp)
> +{
> +    IOMMUFDBackend *iommufd = vbasedev->iommufd;
> +    uint32_t flags = 0;
> +    VFIOIOASHwpt *hwpt;
> +    uint32_t hwpt_id;
> +    int ret;
> +
> +    /* Try to find a domain */
> +    QLIST_FOREACH(hwpt, &container->hwpt_list, next) {
> +        ret = iommufd_cdev_attach_ioas_hwpt(vbasedev, hwpt->hwpt_id, errp);
> +        if (ret) {
> +            /* -EINVAL means the domain is incompatible with the device. */
> +            if (ret == -EINVAL) {
> +                /*
> +                 * It is an expected failure and it just means we will try
> +                 * another domain, or create one if no existing compatible
> +                 * domain is found. Hence why the error is discarded below.
> +                 */
> +                error_free(*errp);
> +                *errp = NULL;
> +                continue;
> +            }
> +
> +            return false;
> +        } else {
> +            vbasedev->hwpt = hwpt;
> +            QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev, hwpt_next);
> +            return true;
> +        }
> +    }
> +
> +    if (!iommufd_backend_alloc_hwpt(iommufd, vbasedev->devid,
> +                                    container->ioas_id, flags,
> +                                    IOMMU_HWPT_DATA_NONE, 0, NULL,
> +                                    &hwpt_id, errp)) {
> +        return false;
> +    }
> +
> +    hwpt = g_malloc0(sizeof(*hwpt));
> +    hwpt->hwpt_id = hwpt_id;
> +    QLIST_INIT(&hwpt->device_list);
> +
> +    ret = iommufd_cdev_attach_ioas_hwpt(vbasedev, hwpt->hwpt_id, errp);
> +    if (ret) {
> +        iommufd_backend_free_id(container->be, hwpt->hwpt_id);
> +        g_free(hwpt);
> +        return false;
> +    }
> +
> +    vbasedev->hwpt = hwpt;
> +    QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev, hwpt_next);
> +    QLIST_INSERT_HEAD(&container->hwpt_list, hwpt, next);
> +    return true;
> +}
> +
> +static void iommufd_cdev_autodomains_put(VFIODevice *vbasedev,
> +                                         VFIOIOMMUFDContainer *container)
> +{
> +    VFIOIOASHwpt *hwpt = vbasedev->hwpt;
> +
> +    QLIST_REMOVE(vbasedev, hwpt_next);
> +    if (QLIST_EMPTY(&hwpt->device_list)) {
> +        QLIST_REMOVE(hwpt, next);
> +        iommufd_backend_free_id(container->be, hwpt->hwpt_id);
> +        g_free(hwpt);
> +    }
> +}
> +
>   static bool iommufd_cdev_attach_container(VFIODevice *vbasedev,
>                                             VFIOIOMMUFDContainer *container,
>                                             Error **errp)
>   {
> +    /* mdevs aren't physical devices and will fail with auto domains */
> +    if (!vbasedev->mdev) {
> +        return iommufd_cdev_autodomains_get(vbasedev, container, errp);
> +    }
> +
>       return !iommufd_cdev_attach_ioas_hwpt(vbasedev, container->ioas_id, errp);
>   }
>   
> @@ -224,6 +300,11 @@ static void iommufd_cdev_detach_container(VFIODevice *vbasedev,
>   {
>       Error *err = NULL;
>   
> +    if (vbasedev->hwpt) {
> +        iommufd_cdev_autodomains_put(vbasedev, container);
> +        return;
> +    }
> +
>       if (!iommufd_cdev_detach_ioas_hwpt(vbasedev, &err)) {
>           error_report_err(err);
>       }
> @@ -354,6 +435,7 @@ static bool iommufd_cdev_attach(const char *name, VFIODevice *vbasedev,
>       container = VFIO_IOMMU_IOMMUFD(object_new(TYPE_VFIO_IOMMU_IOMMUFD));
>       container->be = vbasedev->iommufd;
>       container->ioas_id = ioas_id;
> +    QLIST_INIT(&container->hwpt_list);
>   
>       bcontainer = &container->bcontainer;
>       vfio_address_space_insert(space, bcontainer);
> diff --git a/backends/trace-events b/backends/trace-events
> index 211e6f374adc..4d8ac02fe7d6 100644
> --- a/backends/trace-events
> +++ b/backends/trace-events
> @@ -14,4 +14,5 @@ iommufd_backend_map_dma(int iommufd, uint32_t ioas, uint64_t iova, uint64_t size
>   iommufd_backend_unmap_dma_non_exist(int iommufd, uint32_t ioas, uint64_t iova, uint64_t size, int ret) " Unmap nonexistent mapping: iommufd=%d ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" (%d)"
>   iommufd_backend_unmap_dma(int iommufd, uint32_t ioas, uint64_t iova, uint64_t size, int ret) " iommufd=%d ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" (%d)"
>   iommufd_backend_alloc_ioas(int iommufd, uint32_t ioas) " iommufd=%d ioas=%d"
> +iommufd_backend_alloc_hwpt(int iommufd, uint32_t dev_id, uint32_t pt_id, uint32_t flags, uint32_t hwpt_type, uint32_t len, uint64_t data_ptr, uint32_t out_hwpt_id, int ret) " iommufd=%d dev_id=%u pt_id=%u flags=0x%x hwpt_type=%u len=%u data_ptr=0x%"PRIx64" out_hwpt=%u (%d)"
>   iommufd_backend_free_id(int iommufd, uint32_t id, int ret) " iommufd=%d id=%d (%d)"
Joao Martins July 16, 2024, 9:47 a.m. UTC | #2
On 16/07/2024 10:39, Cédric Le Goater wrote:
> On 7/12/24 13:46, Joao Martins wrote:
>> There's generally two modes of operation for IOMMUFD:
>>
>> * The simple user API which intends to perform relatively simple things
>> with IOMMUs e.g. DPDK. It generally creates an IOAS and attach to VFIO
>> and mainly performs IOAS_MAP and UNMAP.
>>
>> * The native IOMMUFD API where you have fine grained control of the
>> IOMMU domain and model it accordingly. This is where most new feature
>> are being steered to.
>>
>> For dirty tracking 2) 
> 
> I suppose 1) and 2) are the bullets above ?
> 
yeah

>> is required, as it needs to ensure that
>> the stage-2/parent IOMMU domain will only attach devices
>> that support dirty tracking (so far it is all homogeneous in x86, likely
>> not the case for smmuv3). Such invariant on dirty tracking provides a
>> useful guarantee to VMMs that will refuse incompatible device
>> attachments for IOMMU domains.
>>
>> Dirty tracking insurance is enforced via HWPT_ALLOC, which is
>> responsible for creating an IOMMU domain. This is contrast to the
>> 'simple API' where the IOMMU domain is created by IOMMUFD automatically
>> when it attaches to VFIO (usually referred as autodomains) but it has
>> the needed handling for mdevs.
>>
>> To support dirty tracking with the advanced IOMMUFD API, it needs
>> similar logic, where IOMMU domains are created and devices attached to
>> compatible domains. Essentially mimmicing kernel
> 
> mimmicing -> mimicking, I think.
> 

Ack

>> iommufd_device_auto_get_domain(). With mdevs given there's no IOMMU domain
>> it falls back to IOAS attach.
>>
>> The auto domain logic allows different IOMMU domains to be created when
>> DMA dirty tracking is not desired (and VF can provide it), and others where
>> it is. Here is not used in this way here given how VFIODevice migration
>> state is initialized after the device attachment. But such mixed mode of
>> IOMMU dirty tracking + device dirty tracking is an improvement that can
>> be added on. Keep the 'all of nothing' of type1 approach that we have
>> been using so far between container vs device dirty tracking.
>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> 
> Reviewed-by: Cédric Le Goater <clg@redhat.com>
> 
> This needs feedback from IOMMUFD experts also.

I take it that by IOMMUFD experts you the ones more familiar with Qemu side
(Zhenzhong and/or Yi)

	Joao
Cédric Le Goater July 16, 2024, 12:54 p.m. UTC | #3
On 7/12/24 13:46, Joao Martins wrote:
> There's generally two modes of operation for IOMMUFD:
> 
> * The simple user API which intends to perform relatively simple things
> with IOMMUs e.g. DPDK. It generally creates an IOAS and attach to VFIO
> and mainly performs IOAS_MAP and UNMAP.
> 
> * The native IOMMUFD API where you have fine grained control of the
> IOMMU domain and model it accordingly. This is where most new feature
> are being steered to.
> 
> For dirty tracking 2) is required, as it needs to ensure that
> the stage-2/parent IOMMU domain will only attach devices
> that support dirty tracking (so far it is all homogeneous in x86, likely
> not the case for smmuv3). Such invariant on dirty tracking provides a
> useful guarantee to VMMs that will refuse incompatible device
> attachments for IOMMU domains.
> 
> Dirty tracking insurance is enforced via HWPT_ALLOC, which is
> responsible for creating an IOMMU domain. This is contrast to the
> 'simple API' where the IOMMU domain is created by IOMMUFD automatically
> when it attaches to VFIO (usually referred as autodomains) but it has
> the needed handling for mdevs.
> 
> To support dirty tracking with the advanced IOMMUFD API, it needs
> similar logic, where IOMMU domains are created and devices attached to
> compatible domains. Essentially mimmicing kernel
> iommufd_device_auto_get_domain(). With mdevs given there's no IOMMU domain
> it falls back to IOAS attach.
> 
> The auto domain logic allows different IOMMU domains to be created when
> DMA dirty tracking is not desired (and VF can provide it), and others where
> it is. Here is not used in this way here given how VFIODevice migration
> state is initialized after the device attachment. But such mixed mode of
> IOMMU dirty tracking + device dirty tracking is an improvement that can
> be added on. Keep the 'all of nothing' of type1 approach that we have
> been using so far between container vs device dirty tracking.
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>   include/hw/vfio/vfio-common.h |  9 ++++
>   include/sysemu/iommufd.h      |  5 +++
>   backends/iommufd.c            | 30 +++++++++++++
>   hw/vfio/iommufd.c             | 82 +++++++++++++++++++++++++++++++++++
>   backends/trace-events         |  1 +
>   5 files changed, 127 insertions(+)
> 
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 7419466bca92..2dd468ce3c02 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -95,10 +95,17 @@ typedef struct VFIOHostDMAWindow {
>   
>   typedef struct IOMMUFDBackend IOMMUFDBackend;
>   
> +typedef struct VFIOIOASHwpt {
> +    uint32_t hwpt_id;
> +    QLIST_HEAD(, VFIODevice) device_list;
> +    QLIST_ENTRY(VFIOIOASHwpt) next;
> +} VFIOIOASHwpt;
> +
>   typedef struct VFIOIOMMUFDContainer {
>       VFIOContainerBase bcontainer;
>       IOMMUFDBackend *be;
>       uint32_t ioas_id;
> +    QLIST_HEAD(, VFIOIOASHwpt) hwpt_list;
>   } VFIOIOMMUFDContainer;
>   
>   OBJECT_DECLARE_SIMPLE_TYPE(VFIOIOMMUFDContainer, VFIO_IOMMU_IOMMUFD);
> @@ -135,6 +142,8 @@ typedef struct VFIODevice {
>       HostIOMMUDevice *hiod;
>       int devid;
>       IOMMUFDBackend *iommufd;
> +    VFIOIOASHwpt *hwpt;
> +    QLIST_ENTRY(VFIODevice) hwpt_next;
>   } VFIODevice;
>   
>   struct VFIODeviceOps {
> diff --git a/include/sysemu/iommufd.h b/include/sysemu/iommufd.h
> index 57d502a1c79a..e917e7591d05 100644
> --- a/include/sysemu/iommufd.h
> +++ b/include/sysemu/iommufd.h
> @@ -50,6 +50,11 @@ int iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t ioas_id,
>   bool iommufd_backend_get_device_info(IOMMUFDBackend *be, uint32_t devid,
>                                        uint32_t *type, void *data, uint32_t len,
>                                        uint64_t *caps, Error **errp);
> +bool iommufd_backend_alloc_hwpt(IOMMUFDBackend *be, uint32_t dev_id,
> +                                uint32_t pt_id, uint32_t flags,
> +                                uint32_t data_type, uint32_t data_len,
> +                                void *data_ptr, uint32_t *out_hwpt,
> +                                Error **errp);
>   
>   #define TYPE_HOST_IOMMU_DEVICE_IOMMUFD TYPE_HOST_IOMMU_DEVICE "-iommufd"
>   #endif
> diff --git a/backends/iommufd.c b/backends/iommufd.c
> index 2b3d51af26d2..5d3dfa917415 100644
> --- a/backends/iommufd.c
> +++ b/backends/iommufd.c
> @@ -208,6 +208,36 @@ int iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t ioas_id,
>       return ret;
>   }
>   
> +bool iommufd_backend_alloc_hwpt(IOMMUFDBackend *be, uint32_t dev_id,
> +                                uint32_t pt_id, uint32_t flags,
> +                                uint32_t data_type, uint32_t data_len,
> +                                void *data_ptr, uint32_t *out_hwpt,
> +                                Error **errp)
> +{
> +    int ret, fd = be->fd;
> +    struct iommu_hwpt_alloc alloc_hwpt = {
> +        .size = sizeof(struct iommu_hwpt_alloc),
> +        .flags = flags,
> +        .dev_id = dev_id,
> +        .pt_id = pt_id,
> +        .data_type = data_type,
> +        .data_len = data_len,
> +        .data_uptr = (uint64_t)data_ptr,

The type cast should be (uintptr_t)

> +    };
> +
> +    ret = ioctl(fd, IOMMU_HWPT_ALLOC, &alloc_hwpt);
> +    trace_iommufd_backend_alloc_hwpt(fd, dev_id, pt_id, flags, data_type,
> +                                     data_len, (uint64_t)data_ptr,

same here.


Thanks,

C.



> +                                     alloc_hwpt.out_hwpt_id, ret);
> +    if (ret) {
> +        error_setg_errno(errp, errno, "Failed to allocate hwpt");
> +        return false;
> +    }
> +
> +    *out_hwpt = alloc_hwpt.out_hwpt_id;
> +    return true;
> +}
> +
>   bool iommufd_backend_get_device_info(IOMMUFDBackend *be, uint32_t devid,
>                                        uint32_t *type, void *data, uint32_t len,
>                                        uint64_t *caps, Error **errp)
> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
> index 077dea8f1b64..325c7598d5a1 100644
> --- a/hw/vfio/iommufd.c
> +++ b/hw/vfio/iommufd.c
> @@ -212,10 +212,86 @@ static bool iommufd_cdev_detach_ioas_hwpt(VFIODevice *vbasedev, Error **errp)
>       return true;
>   }
>   
> +static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
> +                                         VFIOIOMMUFDContainer *container,
> +                                         Error **errp)
> +{
> +    IOMMUFDBackend *iommufd = vbasedev->iommufd;
> +    uint32_t flags = 0;
> +    VFIOIOASHwpt *hwpt;
> +    uint32_t hwpt_id;
> +    int ret;
> +
> +    /* Try to find a domain */
> +    QLIST_FOREACH(hwpt, &container->hwpt_list, next) {
> +        ret = iommufd_cdev_attach_ioas_hwpt(vbasedev, hwpt->hwpt_id, errp);
> +        if (ret) {
> +            /* -EINVAL means the domain is incompatible with the device. */
> +            if (ret == -EINVAL) {
> +                /*
> +                 * It is an expected failure and it just means we will try
> +                 * another domain, or create one if no existing compatible
> +                 * domain is found. Hence why the error is discarded below.
> +                 */
> +                error_free(*errp);
> +                *errp = NULL;
> +                continue;
> +            }
> +
> +            return false;
> +        } else {
> +            vbasedev->hwpt = hwpt;
> +            QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev, hwpt_next);
> +            return true;
> +        }
> +    }
> +
> +    if (!iommufd_backend_alloc_hwpt(iommufd, vbasedev->devid,
> +                                    container->ioas_id, flags,
> +                                    IOMMU_HWPT_DATA_NONE, 0, NULL,
> +                                    &hwpt_id, errp)) {
> +        return false;
> +    }
> +
> +    hwpt = g_malloc0(sizeof(*hwpt));
> +    hwpt->hwpt_id = hwpt_id;
> +    QLIST_INIT(&hwpt->device_list);
> +
> +    ret = iommufd_cdev_attach_ioas_hwpt(vbasedev, hwpt->hwpt_id, errp);
> +    if (ret) {
> +        iommufd_backend_free_id(container->be, hwpt->hwpt_id);
> +        g_free(hwpt);
> +        return false;
> +    }
> +
> +    vbasedev->hwpt = hwpt;
> +    QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev, hwpt_next);
> +    QLIST_INSERT_HEAD(&container->hwpt_list, hwpt, next);
> +    return true;
> +}
> +
> +static void iommufd_cdev_autodomains_put(VFIODevice *vbasedev,
> +                                         VFIOIOMMUFDContainer *container)
> +{
> +    VFIOIOASHwpt *hwpt = vbasedev->hwpt;
> +
> +    QLIST_REMOVE(vbasedev, hwpt_next);
> +    if (QLIST_EMPTY(&hwpt->device_list)) {
> +        QLIST_REMOVE(hwpt, next);
> +        iommufd_backend_free_id(container->be, hwpt->hwpt_id);
> +        g_free(hwpt);
> +    }
> +}
> +
>   static bool iommufd_cdev_attach_container(VFIODevice *vbasedev,
>                                             VFIOIOMMUFDContainer *container,
>                                             Error **errp)
>   {
> +    /* mdevs aren't physical devices and will fail with auto domains */
> +    if (!vbasedev->mdev) {
> +        return iommufd_cdev_autodomains_get(vbasedev, container, errp);
> +    }
> +
>       return !iommufd_cdev_attach_ioas_hwpt(vbasedev, container->ioas_id, errp);
>   }
>   
> @@ -224,6 +300,11 @@ static void iommufd_cdev_detach_container(VFIODevice *vbasedev,
>   {
>       Error *err = NULL;
>   
> +    if (vbasedev->hwpt) {
> +        iommufd_cdev_autodomains_put(vbasedev, container);
> +        return;
> +    }
> +
>       if (!iommufd_cdev_detach_ioas_hwpt(vbasedev, &err)) {
>           error_report_err(err);
>       }
> @@ -354,6 +435,7 @@ static bool iommufd_cdev_attach(const char *name, VFIODevice *vbasedev,
>       container = VFIO_IOMMU_IOMMUFD(object_new(TYPE_VFIO_IOMMU_IOMMUFD));
>       container->be = vbasedev->iommufd;
>       container->ioas_id = ioas_id;
> +    QLIST_INIT(&container->hwpt_list);
>   
>       bcontainer = &container->bcontainer;
>       vfio_address_space_insert(space, bcontainer);
> diff --git a/backends/trace-events b/backends/trace-events
> index 211e6f374adc..4d8ac02fe7d6 100644
> --- a/backends/trace-events
> +++ b/backends/trace-events
> @@ -14,4 +14,5 @@ iommufd_backend_map_dma(int iommufd, uint32_t ioas, uint64_t iova, uint64_t size
>   iommufd_backend_unmap_dma_non_exist(int iommufd, uint32_t ioas, uint64_t iova, uint64_t size, int ret) " Unmap nonexistent mapping: iommufd=%d ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" (%d)"
>   iommufd_backend_unmap_dma(int iommufd, uint32_t ioas, uint64_t iova, uint64_t size, int ret) " iommufd=%d ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" (%d)"
>   iommufd_backend_alloc_ioas(int iommufd, uint32_t ioas) " iommufd=%d ioas=%d"
> +iommufd_backend_alloc_hwpt(int iommufd, uint32_t dev_id, uint32_t pt_id, uint32_t flags, uint32_t hwpt_type, uint32_t len, uint64_t data_ptr, uint32_t out_hwpt_id, int ret) " iommufd=%d dev_id=%u pt_id=%u flags=0x%x hwpt_type=%u len=%u data_ptr=0x%"PRIx64" out_hwpt=%u (%d)"
>   iommufd_backend_free_id(int iommufd, uint32_t id, int ret) " iommufd=%d id=%d (%d)"
Eric Auger July 16, 2024, 4:04 p.m. UTC | #4
Hi Joao,

On 7/12/24 13:46, Joao Martins wrote:
> There's generally two modes of operation for IOMMUFD:
>
> * The simple user API which intends to perform relatively simple things
> with IOMMUs e.g. DPDK. It generally creates an IOAS and attach to VFIO

It generally creates? can you explicit what is "it"

I am confused by this automatic terminology again (not your fault). the doc says:
"

  *

    Automatic domain - refers to an iommu domain created automatically
    when attaching a device to an IOAS object. This is compatible to the
    semantics of VFIO type1.

  *

    Manual domain - refers to an iommu domain designated by the user as
    the target pagetable to be attached to by a device. Though currently
    there are no uAPIs to directly create such domain, the datastructure
    and algorithms are ready for handling that use case.

"


in 1) the device is attached to the ioas id (using the auto domain if I am not wrong)
Here you attach to an hwpt id. Isn't it a manual domain?

> and mainly performs IOAS_MAP and UNMAP.
>
> * The native IOMMUFD API where you have fine grained control of the
> IOMMU domain and model it accordingly. This is where most new feature
> are being steered to.
>
> For dirty tracking 2) is required, as it needs to ensure that
> the stage-2/parent IOMMU domain will only attach devices
> that support dirty tracking (so far it is all homogeneous in x86, likely
> not the case for smmuv3). Such invariant on dirty tracking provides a
> useful guarantee to VMMs that will refuse incompatible device
> attachments for IOMMU domains.
>
> Dirty tracking insurance is enforced via HWPT_ALLOC, which is
> responsible for creating an IOMMU domain. This is contrast to the
> 'simple API' where the IOMMU domain is created by IOMMUFD automatically
> when it attaches to VFIO (usually referred as autodomains) but it has
> the needed handling for mdevs.
>
> To support dirty tracking with the advanced IOMMUFD API, it needs
> similar logic, where IOMMU domains are created and devices attached to
> compatible domains. Essentially mimmicing kernel
> iommufd_device_auto_get_domain(). With mdevs given there's no IOMMU domain
> it falls back to IOAS attach.
>
> The auto domain logic allows different IOMMU domains to be created when
> DMA dirty tracking is not desired (and VF can provide it), and others where
> it is. Here is not used in this way here given how VFIODevice migration

Here is not used in this way here ?

> state is initialized after the device attachment. But such mixed mode of
> IOMMU dirty tracking + device dirty tracking is an improvement that can
> be added on. Keep the 'all of nothing' of type1 approach that we have
> been using so far between container vs device dirty tracking.
>
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
> ---
>  include/hw/vfio/vfio-common.h |  9 ++++
>  include/sysemu/iommufd.h      |  5 +++
>  backends/iommufd.c            | 30 +++++++++++++
>  hw/vfio/iommufd.c             | 82 +++++++++++++++++++++++++++++++++++
>  backends/trace-events         |  1 +
>  5 files changed, 127 insertions(+)
>
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 7419466bca92..2dd468ce3c02 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -95,10 +95,17 @@ typedef struct VFIOHostDMAWindow {
>  
>  typedef struct IOMMUFDBackend IOMMUFDBackend;
>  
> +typedef struct VFIOIOASHwpt {
> +    uint32_t hwpt_id;
> +    QLIST_HEAD(, VFIODevice) device_list;
> +    QLIST_ENTRY(VFIOIOASHwpt) next;
> +} VFIOIOASHwpt;
> +
>  typedef struct VFIOIOMMUFDContainer {
>      VFIOContainerBase bcontainer;
>      IOMMUFDBackend *be;
>      uint32_t ioas_id;
> +    QLIST_HEAD(, VFIOIOASHwpt) hwpt_list;
>  } VFIOIOMMUFDContainer;
>  
>  OBJECT_DECLARE_SIMPLE_TYPE(VFIOIOMMUFDContainer, VFIO_IOMMU_IOMMUFD);
> @@ -135,6 +142,8 @@ typedef struct VFIODevice {
>      HostIOMMUDevice *hiod;
>      int devid;
>      IOMMUFDBackend *iommufd;
> +    VFIOIOASHwpt *hwpt;
> +    QLIST_ENTRY(VFIODevice) hwpt_next;
>  } VFIODevice;
>  
>  struct VFIODeviceOps {
> diff --git a/include/sysemu/iommufd.h b/include/sysemu/iommufd.h
> index 57d502a1c79a..e917e7591d05 100644
> --- a/include/sysemu/iommufd.h
> +++ b/include/sysemu/iommufd.h
> @@ -50,6 +50,11 @@ int iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t ioas_id,
>  bool iommufd_backend_get_device_info(IOMMUFDBackend *be, uint32_t devid,
>                                       uint32_t *type, void *data, uint32_t len,
>                                       uint64_t *caps, Error **errp);
> +bool iommufd_backend_alloc_hwpt(IOMMUFDBackend *be, uint32_t dev_id,
> +                                uint32_t pt_id, uint32_t flags,
> +                                uint32_t data_type, uint32_t data_len,
> +                                void *data_ptr, uint32_t *out_hwpt,
> +                                Error **errp);
>  
>  #define TYPE_HOST_IOMMU_DEVICE_IOMMUFD TYPE_HOST_IOMMU_DEVICE "-iommufd"
>  #endif
> diff --git a/backends/iommufd.c b/backends/iommufd.c
> index 2b3d51af26d2..5d3dfa917415 100644
> --- a/backends/iommufd.c
> +++ b/backends/iommufd.c
> @@ -208,6 +208,36 @@ int iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t ioas_id,
>      return ret;
>  }
>  
> +bool iommufd_backend_alloc_hwpt(IOMMUFDBackend *be, uint32_t dev_id,
> +                                uint32_t pt_id, uint32_t flags,
> +                                uint32_t data_type, uint32_t data_len,
> +                                void *data_ptr, uint32_t *out_hwpt,
> +                                Error **errp)
> +{
> +    int ret, fd = be->fd;
> +    struct iommu_hwpt_alloc alloc_hwpt = {
> +        .size = sizeof(struct iommu_hwpt_alloc),
> +        .flags = flags,
> +        .dev_id = dev_id,
> +        .pt_id = pt_id,
> +        .data_type = data_type,
> +        .data_len = data_len,
> +        .data_uptr = (uint64_t)data_ptr,
> +    };
> +
> +    ret = ioctl(fd, IOMMU_HWPT_ALLOC, &alloc_hwpt);
> +    trace_iommufd_backend_alloc_hwpt(fd, dev_id, pt_id, flags, data_type,
> +                                     data_len, (uint64_t)data_ptr,
> +                                     alloc_hwpt.out_hwpt_id, ret);
> +    if (ret) {
> +        error_setg_errno(errp, errno, "Failed to allocate hwpt");
> +        return false;
> +    }
> +
> +    *out_hwpt = alloc_hwpt.out_hwpt_id;
> +    return true;
> +}
> +
>  bool iommufd_backend_get_device_info(IOMMUFDBackend *be, uint32_t devid,
>                                       uint32_t *type, void *data, uint32_t len,
>                                       uint64_t *caps, Error **errp)
> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
> index 077dea8f1b64..325c7598d5a1 100644
> --- a/hw/vfio/iommufd.c
> +++ b/hw/vfio/iommufd.c
> @@ -212,10 +212,86 @@ static bool iommufd_cdev_detach_ioas_hwpt(VFIODevice *vbasedev, Error **errp)
>      return true;
>  }
>  
> +static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
> +                                         VFIOIOMMUFDContainer *container,
> +                                         Error **errp)
> +{
> +    IOMMUFDBackend *iommufd = vbasedev->iommufd;
> +    uint32_t flags = 0;
> +    VFIOIOASHwpt *hwpt;
> +    uint32_t hwpt_id;
> +    int ret;
> +
> +    /* Try to find a domain */
> +    QLIST_FOREACH(hwpt, &container->hwpt_list, next) {
> +        ret = iommufd_cdev_attach_ioas_hwpt(vbasedev, hwpt->hwpt_id, errp);
> +        if (ret) {
> +            /* -EINVAL means the domain is incompatible with the device. */
> +            if (ret == -EINVAL) {
> +                /*
> +                 * It is an expected failure and it just means we will try
> +                 * another domain, or create one if no existing compatible
> +                 * domain is found. Hence why the error is discarded below.
> +                 */
> +                error_free(*errp);
> +                *errp = NULL;
> +                continue;
> +            }
> +
> +            return false;
> +        } else {
> +            vbasedev->hwpt = hwpt;
> +            QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev, hwpt_next);
> +            return true;
> +        }
> +    }
> +
> +    if (!iommufd_backend_alloc_hwpt(iommufd, vbasedev->devid,
> +                                    container->ioas_id, flags,
> +                                    IOMMU_HWPT_DATA_NONE, 0, NULL,
> +                                    &hwpt_id, errp)) {
> +        return false;
> +    }
> +
> +    hwpt = g_malloc0(sizeof(*hwpt));
> +    hwpt->hwpt_id = hwpt_id;
> +    QLIST_INIT(&hwpt->device_list);
> +
> +    ret = iommufd_cdev_attach_ioas_hwpt(vbasedev, hwpt->hwpt_id, errp);
> +    if (ret) {
> +        iommufd_backend_free_id(container->be, hwpt->hwpt_id);
> +        g_free(hwpt);
> +        return false;
> +    }
> +
> +    vbasedev->hwpt = hwpt;
> +    QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev, hwpt_next);
> +    QLIST_INSERT_HEAD(&container->hwpt_list, hwpt, next);
> +    return true;
> +}
> +
> +static void iommufd_cdev_autodomains_put(VFIODevice *vbasedev,
> +                                         VFIOIOMMUFDContainer *container)
> +{
> +    VFIOIOASHwpt *hwpt = vbasedev->hwpt;
> +
> +    QLIST_REMOVE(vbasedev, hwpt_next);
don't you want to reset vbasedev->hwpt = NULL too?


> +    if (QLIST_EMPTY(&hwpt->device_list)) {
> +        QLIST_REMOVE(hwpt, next);
> +        iommufd_backend_free_id(container->be, hwpt->hwpt_id);
> +        g_free(hwpt);
> +    }
> +}
> +
>  static bool iommufd_cdev_attach_container(VFIODevice *vbasedev,
>                                            VFIOIOMMUFDContainer *container,
>                                            Error **errp)
>  {
> +    /* mdevs aren't physical devices and will fail with auto domains */
> +    if (!vbasedev->mdev) {
> +        return iommufd_cdev_autodomains_get(vbasedev, container, errp);
> +    }
> +
>      return !iommufd_cdev_attach_ioas_hwpt(vbasedev, container->ioas_id, errp);
>  }
>  
> @@ -224,6 +300,11 @@ static void iommufd_cdev_detach_container(VFIODevice *vbasedev,
>  {
>      Error *err = NULL;
>  
> +    if (vbasedev->hwpt) {
> +        iommufd_cdev_autodomains_put(vbasedev, container);
> +        return;
Where do we detach the device from the hwpt?

Thanks

Eric
> +    }
> +
>      if (!iommufd_cdev_detach_ioas_hwpt(vbasedev, &err)) {
>          error_report_err(err);
>      }
> @@ -354,6 +435,7 @@ static bool iommufd_cdev_attach(const char *name, VFIODevice *vbasedev,
>      container = VFIO_IOMMU_IOMMUFD(object_new(TYPE_VFIO_IOMMU_IOMMUFD));
>      container->be = vbasedev->iommufd;
>      container->ioas_id = ioas_id;
> +    QLIST_INIT(&container->hwpt_list);
>  
>      bcontainer = &container->bcontainer;
>      vfio_address_space_insert(space, bcontainer);
> diff --git a/backends/trace-events b/backends/trace-events
> index 211e6f374adc..4d8ac02fe7d6 100644
> --- a/backends/trace-events
> +++ b/backends/trace-events
> @@ -14,4 +14,5 @@ iommufd_backend_map_dma(int iommufd, uint32_t ioas, uint64_t iova, uint64_t size
>  iommufd_backend_unmap_dma_non_exist(int iommufd, uint32_t ioas, uint64_t iova, uint64_t size, int ret) " Unmap nonexistent mapping: iommufd=%d ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" (%d)"
>  iommufd_backend_unmap_dma(int iommufd, uint32_t ioas, uint64_t iova, uint64_t size, int ret) " iommufd=%d ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" (%d)"
>  iommufd_backend_alloc_ioas(int iommufd, uint32_t ioas) " iommufd=%d ioas=%d"
> +iommufd_backend_alloc_hwpt(int iommufd, uint32_t dev_id, uint32_t pt_id, uint32_t flags, uint32_t hwpt_type, uint32_t len, uint64_t data_ptr, uint32_t out_hwpt_id, int ret) " iommufd=%d dev_id=%u pt_id=%u flags=0x%x hwpt_type=%u len=%u data_ptr=0x%"PRIx64" out_hwpt=%u (%d)"
>  iommufd_backend_free_id(int iommufd, uint32_t id, int ret) " iommufd=%d id=%d (%d)"
Joao Martins July 16, 2024, 4:44 p.m. UTC | #5
On 16/07/2024 17:04, Eric Auger wrote:
> Hi Joao,
> 
> On 7/12/24 13:46, Joao Martins wrote:
>> There's generally two modes of operation for IOMMUFD:
>>
>> * The simple user API which intends to perform relatively simple things
>> with IOMMUs e.g. DPDK. It generally creates an IOAS and attach to VFIO
> 
> It generally creates? can you explicit what is "it"
> 
'It' here refers to the process/API-user

> I am confused by this automatic terminology again (not your fault). the doc says:
> "
> 
>   *
> 
>     Automatic domain - refers to an iommu domain created automatically
>     when attaching a device to an IOAS object. This is compatible to the
>     semantics of VFIO type1.
> 
>   *
> 
>     Manual domain - refers to an iommu domain designated by the user as
>     the target pagetable to be attached to by a device. Though currently
>     there are no uAPIs to directly create such domain, the datastructure
>     and algorithms are ready for handling that use case.
> 
> "
> 
> 
> in 1) the device is attached to the ioas id (using the auto domain if I am not wrong)
> Here you attach to an hwpt id. Isn't it a manual domain?
>

Correct.

The 'auto domains' generally refers to the kernel-equivalent own automatic
attaching to a new pagetable.

Here I call 'auto domains' in the userspace version too because we are doing the
exact same but from userspace, using the manual API in IOMMUFD.

>> and mainly performs IOAS_MAP and UNMAP.
>>
>> * The native IOMMUFD API where you have fine grained control of the
>> IOMMU domain and model it accordingly. This is where most new feature
>> are being steered to.
>>
>> For dirty tracking 2) is required, as it needs to ensure that
>> the stage-2/parent IOMMU domain will only attach devices
>> that support dirty tracking (so far it is all homogeneous in x86, likely
>> not the case for smmuv3). Such invariant on dirty tracking provides a
>> useful guarantee to VMMs that will refuse incompatible device
>> attachments for IOMMU domains.
>>
>> Dirty tracking insurance is enforced via HWPT_ALLOC, which is
>> responsible for creating an IOMMU domain. This is contrast to the
>> 'simple API' where the IOMMU domain is created by IOMMUFD automatically
>> when it attaches to VFIO (usually referred as autodomains) but it has
>> the needed handling for mdevs.
>>
>> To support dirty tracking with the advanced IOMMUFD API, it needs
>> similar logic, where IOMMU domains are created and devices attached to
>> compatible domains. Essentially mimmicing kernel
>> iommufd_device_auto_get_domain(). With mdevs given there's no IOMMU domain
>> it falls back to IOAS attach.
>>
>> The auto domain logic allows different IOMMU domains to be created when
>> DMA dirty tracking is not desired (and VF can provide it), and others where
>> it is. Here is not used in this way here given how VFIODevice migration
> 
> Here is not used in this way here ?
> 

I meant, 'Here it is not used in this way given (...)'

>> state is initialized after the device attachment. But such mixed mode of
>> IOMMU dirty tracking + device dirty tracking is an improvement that can
>> be added on. Keep the 'all of nothing' of type1 approach that we have
>> been using so far between container vs device dirty tracking.
>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>>  include/hw/vfio/vfio-common.h |  9 ++++
>>  include/sysemu/iommufd.h      |  5 +++
>>  backends/iommufd.c            | 30 +++++++++++++
>>  hw/vfio/iommufd.c             | 82 +++++++++++++++++++++++++++++++++++
>>  backends/trace-events         |  1 +
>>  5 files changed, 127 insertions(+)
>>
>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>> index 7419466bca92..2dd468ce3c02 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -95,10 +95,17 @@ typedef struct VFIOHostDMAWindow {
>>  
>>  typedef struct IOMMUFDBackend IOMMUFDBackend;
>>  
>> +typedef struct VFIOIOASHwpt {
>> +    uint32_t hwpt_id;
>> +    QLIST_HEAD(, VFIODevice) device_list;
>> +    QLIST_ENTRY(VFIOIOASHwpt) next;
>> +} VFIOIOASHwpt;
>> +
>>  typedef struct VFIOIOMMUFDContainer {
>>      VFIOContainerBase bcontainer;
>>      IOMMUFDBackend *be;
>>      uint32_t ioas_id;
>> +    QLIST_HEAD(, VFIOIOASHwpt) hwpt_list;
>>  } VFIOIOMMUFDContainer;
>>  
>>  OBJECT_DECLARE_SIMPLE_TYPE(VFIOIOMMUFDContainer, VFIO_IOMMU_IOMMUFD);
>> @@ -135,6 +142,8 @@ typedef struct VFIODevice {
>>      HostIOMMUDevice *hiod;
>>      int devid;
>>      IOMMUFDBackend *iommufd;
>> +    VFIOIOASHwpt *hwpt;
>> +    QLIST_ENTRY(VFIODevice) hwpt_next;
>>  } VFIODevice;
>>  
>>  struct VFIODeviceOps {
>> diff --git a/include/sysemu/iommufd.h b/include/sysemu/iommufd.h
>> index 57d502a1c79a..e917e7591d05 100644
>> --- a/include/sysemu/iommufd.h
>> +++ b/include/sysemu/iommufd.h
>> @@ -50,6 +50,11 @@ int iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t ioas_id,
>>  bool iommufd_backend_get_device_info(IOMMUFDBackend *be, uint32_t devid,
>>                                       uint32_t *type, void *data, uint32_t len,
>>                                       uint64_t *caps, Error **errp);
>> +bool iommufd_backend_alloc_hwpt(IOMMUFDBackend *be, uint32_t dev_id,
>> +                                uint32_t pt_id, uint32_t flags,
>> +                                uint32_t data_type, uint32_t data_len,
>> +                                void *data_ptr, uint32_t *out_hwpt,
>> +                                Error **errp);
>>  
>>  #define TYPE_HOST_IOMMU_DEVICE_IOMMUFD TYPE_HOST_IOMMU_DEVICE "-iommufd"
>>  #endif
>> diff --git a/backends/iommufd.c b/backends/iommufd.c
>> index 2b3d51af26d2..5d3dfa917415 100644
>> --- a/backends/iommufd.c
>> +++ b/backends/iommufd.c
>> @@ -208,6 +208,36 @@ int iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t ioas_id,
>>      return ret;
>>  }
>>  
>> +bool iommufd_backend_alloc_hwpt(IOMMUFDBackend *be, uint32_t dev_id,
>> +                                uint32_t pt_id, uint32_t flags,
>> +                                uint32_t data_type, uint32_t data_len,
>> +                                void *data_ptr, uint32_t *out_hwpt,
>> +                                Error **errp)
>> +{
>> +    int ret, fd = be->fd;
>> +    struct iommu_hwpt_alloc alloc_hwpt = {
>> +        .size = sizeof(struct iommu_hwpt_alloc),
>> +        .flags = flags,
>> +        .dev_id = dev_id,
>> +        .pt_id = pt_id,
>> +        .data_type = data_type,
>> +        .data_len = data_len,
>> +        .data_uptr = (uint64_t)data_ptr,
>> +    };
>> +
>> +    ret = ioctl(fd, IOMMU_HWPT_ALLOC, &alloc_hwpt);
>> +    trace_iommufd_backend_alloc_hwpt(fd, dev_id, pt_id, flags, data_type,
>> +                                     data_len, (uint64_t)data_ptr,
>> +                                     alloc_hwpt.out_hwpt_id, ret);
>> +    if (ret) {
>> +        error_setg_errno(errp, errno, "Failed to allocate hwpt");
>> +        return false;
>> +    }
>> +
>> +    *out_hwpt = alloc_hwpt.out_hwpt_id;
>> +    return true;
>> +}
>> +
>>  bool iommufd_backend_get_device_info(IOMMUFDBackend *be, uint32_t devid,
>>                                       uint32_t *type, void *data, uint32_t len,
>>                                       uint64_t *caps, Error **errp)
>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>> index 077dea8f1b64..325c7598d5a1 100644
>> --- a/hw/vfio/iommufd.c
>> +++ b/hw/vfio/iommufd.c
>> @@ -212,10 +212,86 @@ static bool iommufd_cdev_detach_ioas_hwpt(VFIODevice *vbasedev, Error **errp)
>>      return true;
>>  }
>>  
>> +static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
>> +                                         VFIOIOMMUFDContainer *container,
>> +                                         Error **errp)
>> +{
>> +    IOMMUFDBackend *iommufd = vbasedev->iommufd;
>> +    uint32_t flags = 0;
>> +    VFIOIOASHwpt *hwpt;
>> +    uint32_t hwpt_id;
>> +    int ret;
>> +
>> +    /* Try to find a domain */
>> +    QLIST_FOREACH(hwpt, &container->hwpt_list, next) {
>> +        ret = iommufd_cdev_attach_ioas_hwpt(vbasedev, hwpt->hwpt_id, errp);
>> +        if (ret) {
>> +            /* -EINVAL means the domain is incompatible with the device. */
>> +            if (ret == -EINVAL) {
>> +                /*
>> +                 * It is an expected failure and it just means we will try
>> +                 * another domain, or create one if no existing compatible
>> +                 * domain is found. Hence why the error is discarded below.
>> +                 */
>> +                error_free(*errp);
>> +                *errp = NULL;
>> +                continue;
>> +            }
>> +
>> +            return false;
>> +        } else {
>> +            vbasedev->hwpt = hwpt;
>> +            QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev, hwpt_next);
>> +            return true;
>> +        }
>> +    }
>> +
>> +    if (!iommufd_backend_alloc_hwpt(iommufd, vbasedev->devid,
>> +                                    container->ioas_id, flags,
>> +                                    IOMMU_HWPT_DATA_NONE, 0, NULL,
>> +                                    &hwpt_id, errp)) {
>> +        return false;
>> +    }
>> +
>> +    hwpt = g_malloc0(sizeof(*hwpt));
>> +    hwpt->hwpt_id = hwpt_id;
>> +    QLIST_INIT(&hwpt->device_list);
>> +
>> +    ret = iommufd_cdev_attach_ioas_hwpt(vbasedev, hwpt->hwpt_id, errp);
>> +    if (ret) {
>> +        iommufd_backend_free_id(container->be, hwpt->hwpt_id);
>> +        g_free(hwpt);
>> +        return false;
>> +    }
>> +
>> +    vbasedev->hwpt = hwpt;
>> +    QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev, hwpt_next);
>> +    QLIST_INSERT_HEAD(&container->hwpt_list, hwpt, next);
>> +    return true;
>> +}
>> +
>> +static void iommufd_cdev_autodomains_put(VFIODevice *vbasedev,
>> +                                         VFIOIOMMUFDContainer *container)
>> +{
>> +    VFIOIOASHwpt *hwpt = vbasedev->hwpt;
>> +
>> +    QLIST_REMOVE(vbasedev, hwpt_next);
> don't you want to reset vbasedev->hwpt = NULL too?
> 
Yeap, Thanks for catching that

> 
>> +    if (QLIST_EMPTY(&hwpt->device_list)) {
>> +        QLIST_REMOVE(hwpt, next);
>> +        iommufd_backend_free_id(container->be, hwpt->hwpt_id);
>> +        g_free(hwpt);
>> +    }
>> +}
>> +
>>  static bool iommufd_cdev_attach_container(VFIODevice *vbasedev,
>>                                            VFIOIOMMUFDContainer *container,
>>                                            Error **errp)
>>  {
>> +    /* mdevs aren't physical devices and will fail with auto domains */
>> +    if (!vbasedev->mdev) {
>> +        return iommufd_cdev_autodomains_get(vbasedev, container, errp);
>> +    }
>> +
>>      return !iommufd_cdev_attach_ioas_hwpt(vbasedev, container->ioas_id, errp);
>>  }
>>  
>> @@ -224,6 +300,11 @@ static void iommufd_cdev_detach_container(VFIODevice *vbasedev,
>>  {
>>      Error *err = NULL;
>>  
>> +    if (vbasedev->hwpt) {
>> +        iommufd_cdev_autodomains_put(vbasedev, container);
>> +        return;
> Where do we detach the device from the hwpt?
> 
In iommufd_backend_free_id() for auto domains

> Thanks
> 
> Eric
>> +    }
>> +
>>      if (!iommufd_cdev_detach_ioas_hwpt(vbasedev, &err)) {
>>          error_report_err(err);
>>      }
>> @@ -354,6 +435,7 @@ static bool iommufd_cdev_attach(const char *name, VFIODevice *vbasedev,
>>      container = VFIO_IOMMU_IOMMUFD(object_new(TYPE_VFIO_IOMMU_IOMMUFD));
>>      container->be = vbasedev->iommufd;
>>      container->ioas_id = ioas_id;
>> +    QLIST_INIT(&container->hwpt_list);
>>  
>>      bcontainer = &container->bcontainer;
>>      vfio_address_space_insert(space, bcontainer);
>> diff --git a/backends/trace-events b/backends/trace-events
>> index 211e6f374adc..4d8ac02fe7d6 100644
>> --- a/backends/trace-events
>> +++ b/backends/trace-events
>> @@ -14,4 +14,5 @@ iommufd_backend_map_dma(int iommufd, uint32_t ioas, uint64_t iova, uint64_t size
>>  iommufd_backend_unmap_dma_non_exist(int iommufd, uint32_t ioas, uint64_t iova, uint64_t size, int ret) " Unmap nonexistent mapping: iommufd=%d ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" (%d)"
>>  iommufd_backend_unmap_dma(int iommufd, uint32_t ioas, uint64_t iova, uint64_t size, int ret) " iommufd=%d ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" (%d)"
>>  iommufd_backend_alloc_ioas(int iommufd, uint32_t ioas) " iommufd=%d ioas=%d"
>> +iommufd_backend_alloc_hwpt(int iommufd, uint32_t dev_id, uint32_t pt_id, uint32_t flags, uint32_t hwpt_type, uint32_t len, uint64_t data_ptr, uint32_t out_hwpt_id, int ret) " iommufd=%d dev_id=%u pt_id=%u flags=0x%x hwpt_type=%u len=%u data_ptr=0x%"PRIx64" out_hwpt=%u (%d)"
>>  iommufd_backend_free_id(int iommufd, uint32_t id, int ret) " iommufd=%d id=%d (%d)"
>
Joao Martins July 16, 2024, 4:46 p.m. UTC | #6
On 16/07/2024 17:44, Joao Martins wrote:
> On 16/07/2024 17:04, Eric Auger wrote:
>> Hi Joao,
>>
>> On 7/12/24 13:46, Joao Martins wrote:
>>> There's generally two modes of operation for IOMMUFD:
>>>
>>> * The simple user API which intends to perform relatively simple things
>>> with IOMMUs e.g. DPDK. It generally creates an IOAS and attach to VFIO
>>
>> It generally creates? can you explicit what is "it"
>>
> 'It' here refers to the process/API-user
> 
>> I am confused by this automatic terminology again (not your fault). the doc says:
>> "
>>
>>   *
>>
>>     Automatic domain - refers to an iommu domain created automatically
>>     when attaching a device to an IOAS object. This is compatible to the
>>     semantics of VFIO type1.
>>
>>   *
>>
>>     Manual domain - refers to an iommu domain designated by the user as
>>     the target pagetable to be attached to by a device. Though currently
>>     there are no uAPIs to directly create such domain, the datastructure
>>     and algorithms are ready for handling that use case.
>>
>> "
>>
>>
>> in 1) the device is attached to the ioas id (using the auto domain if I am not wrong)
>> Here you attach to an hwpt id. Isn't it a manual domain?
>>
> 
> Correct.
> 
> The 'auto domains' generally refers to the kernel-equivalent own automatic
> attaching to a new pagetable.
> 
> Here I call 'auto domains' in the userspace version too because we are doing the
> exact same but from userspace, using the manual API in IOMMUFD.
> 
>>> and mainly performs IOAS_MAP and UNMAP.
>>>
>>> * The native IOMMUFD API where you have fine grained control of the
>>> IOMMU domain and model it accordingly. This is where most new feature
>>> are being steered to.
>>>
>>> For dirty tracking 2) is required, as it needs to ensure that
>>> the stage-2/parent IOMMU domain will only attach devices
>>> that support dirty tracking (so far it is all homogeneous in x86, likely
>>> not the case for smmuv3). Such invariant on dirty tracking provides a
>>> useful guarantee to VMMs that will refuse incompatible device
>>> attachments for IOMMU domains.
>>>
>>> Dirty tracking insurance is enforced via HWPT_ALLOC, which is
>>> responsible for creating an IOMMU domain. This is contrast to the
>>> 'simple API' where the IOMMU domain is created by IOMMUFD automatically
>>> when it attaches to VFIO (usually referred as autodomains) but it has
>>> the needed handling for mdevs.
>>>
>>> To support dirty tracking with the advanced IOMMUFD API, it needs
>>> similar logic, where IOMMU domains are created and devices attached to
>>> compatible domains. Essentially mimmicing kernel
>>> iommufd_device_auto_get_domain(). With mdevs given there's no IOMMU domain
>>> it falls back to IOAS attach.
>>>
>>> The auto domain logic allows different IOMMU domains to be created when
>>> DMA dirty tracking is not desired (and VF can provide it), and others where
>>> it is. Here is not used in this way here given how VFIODevice migration
>>
>> Here is not used in this way here ?
>>
> 
> I meant, 'Here it is not used in this way given (...)'
> 
>>> state is initialized after the device attachment. But such mixed mode of
>>> IOMMU dirty tracking + device dirty tracking is an improvement that can
>>> be added on. Keep the 'all of nothing' of type1 approach that we have
>>> been using so far between container vs device dirty tracking.
>>>
>>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>>> ---
>>>  include/hw/vfio/vfio-common.h |  9 ++++
>>>  include/sysemu/iommufd.h      |  5 +++
>>>  backends/iommufd.c            | 30 +++++++++++++
>>>  hw/vfio/iommufd.c             | 82 +++++++++++++++++++++++++++++++++++
>>>  backends/trace-events         |  1 +
>>>  5 files changed, 127 insertions(+)
>>>
>>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>>> index 7419466bca92..2dd468ce3c02 100644
>>> --- a/include/hw/vfio/vfio-common.h
>>> +++ b/include/hw/vfio/vfio-common.h
>>> @@ -95,10 +95,17 @@ typedef struct VFIOHostDMAWindow {
>>>  
>>>  typedef struct IOMMUFDBackend IOMMUFDBackend;
>>>  
>>> +typedef struct VFIOIOASHwpt {
>>> +    uint32_t hwpt_id;
>>> +    QLIST_HEAD(, VFIODevice) device_list;
>>> +    QLIST_ENTRY(VFIOIOASHwpt) next;
>>> +} VFIOIOASHwpt;
>>> +
>>>  typedef struct VFIOIOMMUFDContainer {
>>>      VFIOContainerBase bcontainer;
>>>      IOMMUFDBackend *be;
>>>      uint32_t ioas_id;
>>> +    QLIST_HEAD(, VFIOIOASHwpt) hwpt_list;
>>>  } VFIOIOMMUFDContainer;
>>>  
>>>  OBJECT_DECLARE_SIMPLE_TYPE(VFIOIOMMUFDContainer, VFIO_IOMMU_IOMMUFD);
>>> @@ -135,6 +142,8 @@ typedef struct VFIODevice {
>>>      HostIOMMUDevice *hiod;
>>>      int devid;
>>>      IOMMUFDBackend *iommufd;
>>> +    VFIOIOASHwpt *hwpt;
>>> +    QLIST_ENTRY(VFIODevice) hwpt_next;
>>>  } VFIODevice;
>>>  
>>>  struct VFIODeviceOps {
>>> diff --git a/include/sysemu/iommufd.h b/include/sysemu/iommufd.h
>>> index 57d502a1c79a..e917e7591d05 100644
>>> --- a/include/sysemu/iommufd.h
>>> +++ b/include/sysemu/iommufd.h
>>> @@ -50,6 +50,11 @@ int iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t ioas_id,
>>>  bool iommufd_backend_get_device_info(IOMMUFDBackend *be, uint32_t devid,
>>>                                       uint32_t *type, void *data, uint32_t len,
>>>                                       uint64_t *caps, Error **errp);
>>> +bool iommufd_backend_alloc_hwpt(IOMMUFDBackend *be, uint32_t dev_id,
>>> +                                uint32_t pt_id, uint32_t flags,
>>> +                                uint32_t data_type, uint32_t data_len,
>>> +                                void *data_ptr, uint32_t *out_hwpt,
>>> +                                Error **errp);
>>>  
>>>  #define TYPE_HOST_IOMMU_DEVICE_IOMMUFD TYPE_HOST_IOMMU_DEVICE "-iommufd"
>>>  #endif
>>> diff --git a/backends/iommufd.c b/backends/iommufd.c
>>> index 2b3d51af26d2..5d3dfa917415 100644
>>> --- a/backends/iommufd.c
>>> +++ b/backends/iommufd.c
>>> @@ -208,6 +208,36 @@ int iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t ioas_id,
>>>      return ret;
>>>  }
>>>  
>>> +bool iommufd_backend_alloc_hwpt(IOMMUFDBackend *be, uint32_t dev_id,
>>> +                                uint32_t pt_id, uint32_t flags,
>>> +                                uint32_t data_type, uint32_t data_len,
>>> +                                void *data_ptr, uint32_t *out_hwpt,
>>> +                                Error **errp)
>>> +{
>>> +    int ret, fd = be->fd;
>>> +    struct iommu_hwpt_alloc alloc_hwpt = {
>>> +        .size = sizeof(struct iommu_hwpt_alloc),
>>> +        .flags = flags,
>>> +        .dev_id = dev_id,
>>> +        .pt_id = pt_id,
>>> +        .data_type = data_type,
>>> +        .data_len = data_len,
>>> +        .data_uptr = (uint64_t)data_ptr,
>>> +    };
>>> +
>>> +    ret = ioctl(fd, IOMMU_HWPT_ALLOC, &alloc_hwpt);
>>> +    trace_iommufd_backend_alloc_hwpt(fd, dev_id, pt_id, flags, data_type,
>>> +                                     data_len, (uint64_t)data_ptr,
>>> +                                     alloc_hwpt.out_hwpt_id, ret);
>>> +    if (ret) {
>>> +        error_setg_errno(errp, errno, "Failed to allocate hwpt");
>>> +        return false;
>>> +    }
>>> +
>>> +    *out_hwpt = alloc_hwpt.out_hwpt_id;
>>> +    return true;
>>> +}
>>> +
>>>  bool iommufd_backend_get_device_info(IOMMUFDBackend *be, uint32_t devid,
>>>                                       uint32_t *type, void *data, uint32_t len,
>>>                                       uint64_t *caps, Error **errp)
>>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>>> index 077dea8f1b64..325c7598d5a1 100644
>>> --- a/hw/vfio/iommufd.c
>>> +++ b/hw/vfio/iommufd.c
>>> @@ -212,10 +212,86 @@ static bool iommufd_cdev_detach_ioas_hwpt(VFIODevice *vbasedev, Error **errp)
>>>      return true;
>>>  }
>>>  
>>> +static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
>>> +                                         VFIOIOMMUFDContainer *container,
>>> +                                         Error **errp)
>>> +{
>>> +    IOMMUFDBackend *iommufd = vbasedev->iommufd;
>>> +    uint32_t flags = 0;
>>> +    VFIOIOASHwpt *hwpt;
>>> +    uint32_t hwpt_id;
>>> +    int ret;
>>> +
>>> +    /* Try to find a domain */
>>> +    QLIST_FOREACH(hwpt, &container->hwpt_list, next) {
>>> +        ret = iommufd_cdev_attach_ioas_hwpt(vbasedev, hwpt->hwpt_id, errp);
>>> +        if (ret) {
>>> +            /* -EINVAL means the domain is incompatible with the device. */
>>> +            if (ret == -EINVAL) {
>>> +                /*
>>> +                 * It is an expected failure and it just means we will try
>>> +                 * another domain, or create one if no existing compatible
>>> +                 * domain is found. Hence why the error is discarded below.
>>> +                 */
>>> +                error_free(*errp);
>>> +                *errp = NULL;
>>> +                continue;
>>> +            }
>>> +
>>> +            return false;
>>> +        } else {
>>> +            vbasedev->hwpt = hwpt;
>>> +            QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev, hwpt_next);
>>> +            return true;
>>> +        }
>>> +    }
>>> +
>>> +    if (!iommufd_backend_alloc_hwpt(iommufd, vbasedev->devid,
>>> +                                    container->ioas_id, flags,
>>> +                                    IOMMU_HWPT_DATA_NONE, 0, NULL,
>>> +                                    &hwpt_id, errp)) {
>>> +        return false;
>>> +    }
>>> +
>>> +    hwpt = g_malloc0(sizeof(*hwpt));
>>> +    hwpt->hwpt_id = hwpt_id;
>>> +    QLIST_INIT(&hwpt->device_list);
>>> +
>>> +    ret = iommufd_cdev_attach_ioas_hwpt(vbasedev, hwpt->hwpt_id, errp);
>>> +    if (ret) {
>>> +        iommufd_backend_free_id(container->be, hwpt->hwpt_id);
>>> +        g_free(hwpt);
>>> +        return false;
>>> +    }
>>> +
>>> +    vbasedev->hwpt = hwpt;
>>> +    QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev, hwpt_next);
>>> +    QLIST_INSERT_HEAD(&container->hwpt_list, hwpt, next);
>>> +    return true;
>>> +}
>>> +
>>> +static void iommufd_cdev_autodomains_put(VFIODevice *vbasedev,
>>> +                                         VFIOIOMMUFDContainer *container)
>>> +{
>>> +    VFIOIOASHwpt *hwpt = vbasedev->hwpt;
>>> +
>>> +    QLIST_REMOVE(vbasedev, hwpt_next);
>> don't you want to reset vbasedev->hwpt = NULL too?
>>
> Yeap, Thanks for catching that
> 
>>
>>> +    if (QLIST_EMPTY(&hwpt->device_list)) {
>>> +        QLIST_REMOVE(hwpt, next);
>>> +        iommufd_backend_free_id(container->be, hwpt->hwpt_id);
>>> +        g_free(hwpt);
>>> +    }
>>> +}
>>> +
>>>  static bool iommufd_cdev_attach_container(VFIODevice *vbasedev,
>>>                                            VFIOIOMMUFDContainer *container,
>>>                                            Error **errp)
>>>  {
>>> +    /* mdevs aren't physical devices and will fail with auto domains */
>>> +    if (!vbasedev->mdev) {
>>> +        return iommufd_cdev_autodomains_get(vbasedev, container, errp);
>>> +    }
>>> +
>>>      return !iommufd_cdev_attach_ioas_hwpt(vbasedev, container->ioas_id, errp);
>>>  }
>>>  
>>> @@ -224,6 +300,11 @@ static void iommufd_cdev_detach_container(VFIODevice *vbasedev,
>>>  {
>>>      Error *err = NULL;
>>>  
>>> +    if (vbasedev->hwpt) {
>>> +        iommufd_cdev_autodomains_put(vbasedev, container);
>>> +        return;
>> Where do we detach the device from the hwpt?
>>
> In iommufd_backend_free_id() for auto domains
> 

to clarify here I meant *userspace* auto domains

*kernel* auto domains (mdev) goes via DETACH_IOMMUFD_PT

>> Thanks
>>
>> Eric
>>> +    }
>>> +
>>>      if (!iommufd_cdev_detach_ioas_hwpt(vbasedev, &err)) {
>>>          error_report_err(err);
>>>      }
>>> @@ -354,6 +435,7 @@ static bool iommufd_cdev_attach(const char *name, VFIODevice *vbasedev,
>>>      container = VFIO_IOMMU_IOMMUFD(object_new(TYPE_VFIO_IOMMU_IOMMUFD));
>>>      container->be = vbasedev->iommufd;
>>>      container->ioas_id = ioas_id;
>>> +    QLIST_INIT(&container->hwpt_list);
>>>  
>>>      bcontainer = &container->bcontainer;
>>>      vfio_address_space_insert(space, bcontainer);
>>> diff --git a/backends/trace-events b/backends/trace-events
>>> index 211e6f374adc..4d8ac02fe7d6 100644
>>> --- a/backends/trace-events
>>> +++ b/backends/trace-events
>>> @@ -14,4 +14,5 @@ iommufd_backend_map_dma(int iommufd, uint32_t ioas, uint64_t iova, uint64_t size
>>>  iommufd_backend_unmap_dma_non_exist(int iommufd, uint32_t ioas, uint64_t iova, uint64_t size, int ret) " Unmap nonexistent mapping: iommufd=%d ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" (%d)"
>>>  iommufd_backend_unmap_dma(int iommufd, uint32_t ioas, uint64_t iova, uint64_t size, int ret) " iommufd=%d ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" (%d)"
>>>  iommufd_backend_alloc_ioas(int iommufd, uint32_t ioas) " iommufd=%d ioas=%d"
>>> +iommufd_backend_alloc_hwpt(int iommufd, uint32_t dev_id, uint32_t pt_id, uint32_t flags, uint32_t hwpt_type, uint32_t len, uint64_t data_ptr, uint32_t out_hwpt_id, int ret) " iommufd=%d dev_id=%u pt_id=%u flags=0x%x hwpt_type=%u len=%u data_ptr=0x%"PRIx64" out_hwpt=%u (%d)"
>>>  iommufd_backend_free_id(int iommufd, uint32_t id, int ret) " iommufd=%d id=%d (%d)"
>>
>
Eric Auger July 16, 2024, 5:32 p.m. UTC | #7
On 7/16/24 18:44, Joao Martins wrote:
> On 16/07/2024 17:04, Eric Auger wrote:
>> Hi Joao,
>>
>> On 7/12/24 13:46, Joao Martins wrote:
>>> There's generally two modes of operation for IOMMUFD:
>>>
>>> * The simple user API which intends to perform relatively simple things
>>> with IOMMUs e.g. DPDK. It generally creates an IOAS and attach to VFIO
>> It generally creates? can you explicit what is "it"
>>
> 'It' here refers to the process/API-user
OK
>
>> I am confused by this automatic terminology again (not your fault). the doc says:
>> "
>>
>>   *
>>
>>     Automatic domain - refers to an iommu domain created automatically
>>     when attaching a device to an IOAS object. This is compatible to the
>>     semantics of VFIO type1.
>>
>>   *
>>
>>     Manual domain - refers to an iommu domain designated by the user as
>>     the target pagetable to be attached to by a device. Though currently
>>     there are no uAPIs to directly create such domain, the datastructure
>>     and algorithms are ready for handling that use case.
>>
>> "
>>
>>
>> in 1) the device is attached to the ioas id (using the auto domain if I am not wrong)
>> Here you attach to an hwpt id. Isn't it a manual domain?
>>
> Correct.
>
> The 'auto domains' generally refers to the kernel-equivalent own automatic
> attaching to a new pagetable.
>
> Here I call 'auto domains' in the userspace version too because we are doing the
> exact same but from userspace, using the manual API in IOMMUFD.
OK
>
>>> and mainly performs IOAS_MAP and UNMAP.
>>>
>>> * The native IOMMUFD API where you have fine grained control of the
>>> IOMMU domain and model it accordingly. This is where most new feature
>>> are being steered to.
>>>
>>> For dirty tracking 2) is required, as it needs to ensure that
>>> the stage-2/parent IOMMU domain will only attach devices
>>> that support dirty tracking (so far it is all homogeneous in x86, likely
>>> not the case for smmuv3). Such invariant on dirty tracking provides a
>>> useful guarantee to VMMs that will refuse incompatible device
>>> attachments for IOMMU domains.
>>>
>>> Dirty tracking insurance is enforced via HWPT_ALLOC, which is
>>> responsible for creating an IOMMU domain. This is contrast to the
>>> 'simple API' where the IOMMU domain is created by IOMMUFD automatically
>>> when it attaches to VFIO (usually referred as autodomains) but it has
>>> the needed handling for mdevs.
>>>
>>> To support dirty tracking with the advanced IOMMUFD API, it needs
>>> similar logic, where IOMMU domains are created and devices attached to
>>> compatible domains. Essentially mimmicing kernel
>>> iommufd_device_auto_get_domain(). With mdevs given there's no IOMMU domain
>>> it falls back to IOAS attach.
>>>
>>> The auto domain logic allows different IOMMU domains to be created when
>>> DMA dirty tracking is not desired (and VF can provide it), and others where
>>> it is. Here is not used in this way here given how VFIODevice migration
>> Here is not used in this way here ?
>>
> I meant, 'Here it is not used in this way given (...)'
OK
>
>>> state is initialized after the device attachment. But such mixed mode of
>>> IOMMU dirty tracking + device dirty tracking is an improvement that can
>>> be added on. Keep the 'all of nothing' of type1 approach that we have
>>> been using so far between container vs device dirty tracking.
>>>
>>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>>> ---
>>>  include/hw/vfio/vfio-common.h |  9 ++++
>>>  include/sysemu/iommufd.h      |  5 +++
>>>  backends/iommufd.c            | 30 +++++++++++++
>>>  hw/vfio/iommufd.c             | 82 +++++++++++++++++++++++++++++++++++
>>>  backends/trace-events         |  1 +
>>>  5 files changed, 127 insertions(+)
>>>
>>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>>> index 7419466bca92..2dd468ce3c02 100644
>>> --- a/include/hw/vfio/vfio-common.h
>>> +++ b/include/hw/vfio/vfio-common.h
>>> @@ -95,10 +95,17 @@ typedef struct VFIOHostDMAWindow {
>>>  
>>>  typedef struct IOMMUFDBackend IOMMUFDBackend;
>>>  
>>> +typedef struct VFIOIOASHwpt {
>>> +    uint32_t hwpt_id;
>>> +    QLIST_HEAD(, VFIODevice) device_list;
>>> +    QLIST_ENTRY(VFIOIOASHwpt) next;
>>> +} VFIOIOASHwpt;
>>> +
>>>  typedef struct VFIOIOMMUFDContainer {
>>>      VFIOContainerBase bcontainer;
>>>      IOMMUFDBackend *be;
>>>      uint32_t ioas_id;
>>> +    QLIST_HEAD(, VFIOIOASHwpt) hwpt_list;
>>>  } VFIOIOMMUFDContainer;
>>>  
>>>  OBJECT_DECLARE_SIMPLE_TYPE(VFIOIOMMUFDContainer, VFIO_IOMMU_IOMMUFD);
>>> @@ -135,6 +142,8 @@ typedef struct VFIODevice {
>>>      HostIOMMUDevice *hiod;
>>>      int devid;
>>>      IOMMUFDBackend *iommufd;
>>> +    VFIOIOASHwpt *hwpt;
>>> +    QLIST_ENTRY(VFIODevice) hwpt_next;
>>>  } VFIODevice;
>>>  
>>>  struct VFIODeviceOps {
>>> diff --git a/include/sysemu/iommufd.h b/include/sysemu/iommufd.h
>>> index 57d502a1c79a..e917e7591d05 100644
>>> --- a/include/sysemu/iommufd.h
>>> +++ b/include/sysemu/iommufd.h
>>> @@ -50,6 +50,11 @@ int iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t ioas_id,
>>>  bool iommufd_backend_get_device_info(IOMMUFDBackend *be, uint32_t devid,
>>>                                       uint32_t *type, void *data, uint32_t len,
>>>                                       uint64_t *caps, Error **errp);
>>> +bool iommufd_backend_alloc_hwpt(IOMMUFDBackend *be, uint32_t dev_id,
>>> +                                uint32_t pt_id, uint32_t flags,
>>> +                                uint32_t data_type, uint32_t data_len,
>>> +                                void *data_ptr, uint32_t *out_hwpt,
>>> +                                Error **errp);
>>>  
>>>  #define TYPE_HOST_IOMMU_DEVICE_IOMMUFD TYPE_HOST_IOMMU_DEVICE "-iommufd"
>>>  #endif
>>> diff --git a/backends/iommufd.c b/backends/iommufd.c
>>> index 2b3d51af26d2..5d3dfa917415 100644
>>> --- a/backends/iommufd.c
>>> +++ b/backends/iommufd.c
>>> @@ -208,6 +208,36 @@ int iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t ioas_id,
>>>      return ret;
>>>  }
>>>  
>>> +bool iommufd_backend_alloc_hwpt(IOMMUFDBackend *be, uint32_t dev_id,
>>> +                                uint32_t pt_id, uint32_t flags,
>>> +                                uint32_t data_type, uint32_t data_len,
>>> +                                void *data_ptr, uint32_t *out_hwpt,
>>> +                                Error **errp)
>>> +{
>>> +    int ret, fd = be->fd;
>>> +    struct iommu_hwpt_alloc alloc_hwpt = {
>>> +        .size = sizeof(struct iommu_hwpt_alloc),
>>> +        .flags = flags,
>>> +        .dev_id = dev_id,
>>> +        .pt_id = pt_id,
>>> +        .data_type = data_type,
>>> +        .data_len = data_len,
>>> +        .data_uptr = (uint64_t)data_ptr,
>>> +    };
>>> +
>>> +    ret = ioctl(fd, IOMMU_HWPT_ALLOC, &alloc_hwpt);
>>> +    trace_iommufd_backend_alloc_hwpt(fd, dev_id, pt_id, flags, data_type,
>>> +                                     data_len, (uint64_t)data_ptr,
>>> +                                     alloc_hwpt.out_hwpt_id, ret);
>>> +    if (ret) {
>>> +        error_setg_errno(errp, errno, "Failed to allocate hwpt");
>>> +        return false;
>>> +    }
>>> +
>>> +    *out_hwpt = alloc_hwpt.out_hwpt_id;
>>> +    return true;
>>> +}
>>> +
>>>  bool iommufd_backend_get_device_info(IOMMUFDBackend *be, uint32_t devid,
>>>                                       uint32_t *type, void *data, uint32_t len,
>>>                                       uint64_t *caps, Error **errp)
>>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>>> index 077dea8f1b64..325c7598d5a1 100644
>>> --- a/hw/vfio/iommufd.c
>>> +++ b/hw/vfio/iommufd.c
>>> @@ -212,10 +212,86 @@ static bool iommufd_cdev_detach_ioas_hwpt(VFIODevice *vbasedev, Error **errp)
>>>      return true;
>>>  }
>>>  
>>> +static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
>>> +                                         VFIOIOMMUFDContainer *container,
>>> +                                         Error **errp)
>>> +{
>>> +    IOMMUFDBackend *iommufd = vbasedev->iommufd;
>>> +    uint32_t flags = 0;
>>> +    VFIOIOASHwpt *hwpt;
>>> +    uint32_t hwpt_id;
>>> +    int ret;
>>> +
>>> +    /* Try to find a domain */
>>> +    QLIST_FOREACH(hwpt, &container->hwpt_list, next) {
>>> +        ret = iommufd_cdev_attach_ioas_hwpt(vbasedev, hwpt->hwpt_id, errp);
>>> +        if (ret) {
>>> +            /* -EINVAL means the domain is incompatible with the device. */
>>> +            if (ret == -EINVAL) {
>>> +                /*
>>> +                 * It is an expected failure and it just means we will try
>>> +                 * another domain, or create one if no existing compatible
>>> +                 * domain is found. Hence why the error is discarded below.
>>> +                 */
>>> +                error_free(*errp);
>>> +                *errp = NULL;
>>> +                continue;
>>> +            }
>>> +
>>> +            return false;
>>> +        } else {
>>> +            vbasedev->hwpt = hwpt;
>>> +            QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev, hwpt_next);
>>> +            return true;
>>> +        }
>>> +    }
>>> +
>>> +    if (!iommufd_backend_alloc_hwpt(iommufd, vbasedev->devid,
>>> +                                    container->ioas_id, flags,
>>> +                                    IOMMU_HWPT_DATA_NONE, 0, NULL,
>>> +                                    &hwpt_id, errp)) {
>>> +        return false;
>>> +    }
>>> +
>>> +    hwpt = g_malloc0(sizeof(*hwpt));
>>> +    hwpt->hwpt_id = hwpt_id;
>>> +    QLIST_INIT(&hwpt->device_list);
>>> +
>>> +    ret = iommufd_cdev_attach_ioas_hwpt(vbasedev, hwpt->hwpt_id, errp);
>>> +    if (ret) {
>>> +        iommufd_backend_free_id(container->be, hwpt->hwpt_id);
>>> +        g_free(hwpt);
>>> +        return false;
>>> +    }
>>> +
>>> +    vbasedev->hwpt = hwpt;
>>> +    QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev, hwpt_next);
>>> +    QLIST_INSERT_HEAD(&container->hwpt_list, hwpt, next);
>>> +    return true;
>>> +}
>>> +
>>> +static void iommufd_cdev_autodomains_put(VFIODevice *vbasedev,
>>> +                                         VFIOIOMMUFDContainer *container)
>>> +{
>>> +    VFIOIOASHwpt *hwpt = vbasedev->hwpt;
>>> +
>>> +    QLIST_REMOVE(vbasedev, hwpt_next);
>> don't you want to reset vbasedev->hwpt = NULL too?
>>
> Yeap, Thanks for catching that
>
>>> +    if (QLIST_EMPTY(&hwpt->device_list)) {
>>> +        QLIST_REMOVE(hwpt, next);
>>> +        iommufd_backend_free_id(container->be, hwpt->hwpt_id);
>>> +        g_free(hwpt);
>>> +    }
>>> +}
>>> +
>>>  static bool iommufd_cdev_attach_container(VFIODevice *vbasedev,
>>>                                            VFIOIOMMUFDContainer *container,
>>>                                            Error **errp)
>>>  {
>>> +    /* mdevs aren't physical devices and will fail with auto domains */
>>> +    if (!vbasedev->mdev) {
>>> +        return iommufd_cdev_autodomains_get(vbasedev, container, errp);
>>> +    }
>>> +
>>>      return !iommufd_cdev_attach_ioas_hwpt(vbasedev, container->ioas_id, errp);
>>>  }
>>>  
>>> @@ -224,6 +300,11 @@ static void iommufd_cdev_detach_container(VFIODevice *vbasedev,
>>>  {
>>>      Error *err = NULL;
>>>  
>>> +    if (vbasedev->hwpt) {
>>> +        iommufd_cdev_autodomains_put(vbasedev, container);
>>> +        return;
>> Where do we detach the device from the hwpt?
>>
> In iommufd_backend_free_id() for auto domains
Hum I see iommufd_backend_free_id frees the object. I guess the detach
then is done on kernel side...

Eric
>
>> Thanks
>>
>> Eric
>>> +    }
>>> +
>>>      if (!iommufd_cdev_detach_ioas_hwpt(vbasedev, &err)) {
>>>          error_report_err(err);
>>>      }
>>> @@ -354,6 +435,7 @@ static bool iommufd_cdev_attach(const char *name, VFIODevice *vbasedev,
>>>      container = VFIO_IOMMU_IOMMUFD(object_new(TYPE_VFIO_IOMMU_IOMMUFD));
>>>      container->be = vbasedev->iommufd;
>>>      container->ioas_id = ioas_id;
>>> +    QLIST_INIT(&container->hwpt_list);
>>>  
>>>      bcontainer = &container->bcontainer;
>>>      vfio_address_space_insert(space, bcontainer);
>>> diff --git a/backends/trace-events b/backends/trace-events
>>> index 211e6f374adc..4d8ac02fe7d6 100644
>>> --- a/backends/trace-events
>>> +++ b/backends/trace-events
>>> @@ -14,4 +14,5 @@ iommufd_backend_map_dma(int iommufd, uint32_t ioas, uint64_t iova, uint64_t size
>>>  iommufd_backend_unmap_dma_non_exist(int iommufd, uint32_t ioas, uint64_t iova, uint64_t size, int ret) " Unmap nonexistent mapping: iommufd=%d ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" (%d)"
>>>  iommufd_backend_unmap_dma(int iommufd, uint32_t ioas, uint64_t iova, uint64_t size, int ret) " iommufd=%d ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" (%d)"
>>>  iommufd_backend_alloc_ioas(int iommufd, uint32_t ioas) " iommufd=%d ioas=%d"
>>> +iommufd_backend_alloc_hwpt(int iommufd, uint32_t dev_id, uint32_t pt_id, uint32_t flags, uint32_t hwpt_type, uint32_t len, uint64_t data_ptr, uint32_t out_hwpt_id, int ret) " iommufd=%d dev_id=%u pt_id=%u flags=0x%x hwpt_type=%u len=%u data_ptr=0x%"PRIx64" out_hwpt=%u (%d)"
>>>  iommufd_backend_free_id(int iommufd, uint32_t id, int ret) " iommufd=%d id=%d (%d)"
Duan, Zhenzhong July 17, 2024, 2:18 a.m. UTC | #8
>-----Original Message-----
>From: Joao Martins <joao.m.martins@oracle.com>
>Subject: [PATCH v4 05/12] vfio/iommufd: Introduce auto domain creation
>
>There's generally two modes of operation for IOMMUFD:
>
>* The simple user API which intends to perform relatively simple things
>with IOMMUs e.g. DPDK. It generally creates an IOAS and attach to VFIO
>and mainly performs IOAS_MAP and UNMAP.
>
>* The native IOMMUFD API where you have fine grained control of the
>IOMMU domain and model it accordingly. This is where most new feature
>are being steered to.
>
>For dirty tracking 2) is required, as it needs to ensure that
>the stage-2/parent IOMMU domain will only attach devices
>that support dirty tracking (so far it is all homogeneous in x86, likely
>not the case for smmuv3). Such invariant on dirty tracking provides a
>useful guarantee to VMMs that will refuse incompatible device
>attachments for IOMMU domains.
>
>Dirty tracking insurance is enforced via HWPT_ALLOC, which is
>responsible for creating an IOMMU domain. This is contrast to the
>'simple API' where the IOMMU domain is created by IOMMUFD
>automatically
>when it attaches to VFIO (usually referred as autodomains) but it has
>the needed handling for mdevs.
>
>To support dirty tracking with the advanced IOMMUFD API, it needs
>similar logic, where IOMMU domains are created and devices attached to
>compatible domains. Essentially mimmicing kernel
>iommufd_device_auto_get_domain(). With mdevs given there's no IOMMU
>domain
>it falls back to IOAS attach.
>
>The auto domain logic allows different IOMMU domains to be created when
>DMA dirty tracking is not desired (and VF can provide it), and others where
>it is. Here is not used in this way here given how VFIODevice migration
>state is initialized after the device attachment. But such mixed mode of
>IOMMU dirty tracking + device dirty tracking is an improvement that can
>be added on. Keep the 'all of nothing' of type1 approach that we have
>been using so far between container vs device dirty tracking.
>
>Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>---
> include/hw/vfio/vfio-common.h |  9 ++++
> include/sysemu/iommufd.h      |  5 +++
> backends/iommufd.c            | 30 +++++++++++++
> hw/vfio/iommufd.c             | 82
>+++++++++++++++++++++++++++++++++++
> backends/trace-events         |  1 +
> 5 files changed, 127 insertions(+)
>
>diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-
>common.h
>index 7419466bca92..2dd468ce3c02 100644
>--- a/include/hw/vfio/vfio-common.h
>+++ b/include/hw/vfio/vfio-common.h
>@@ -95,10 +95,17 @@ typedef struct VFIOHostDMAWindow {
>
> typedef struct IOMMUFDBackend IOMMUFDBackend;
>
>+typedef struct VFIOIOASHwpt {
>+    uint32_t hwpt_id;
>+    QLIST_HEAD(, VFIODevice) device_list;
>+    QLIST_ENTRY(VFIOIOASHwpt) next;
>+} VFIOIOASHwpt;
>+
> typedef struct VFIOIOMMUFDContainer {
>     VFIOContainerBase bcontainer;
>     IOMMUFDBackend *be;
>     uint32_t ioas_id;
>+    QLIST_HEAD(, VFIOIOASHwpt) hwpt_list;
> } VFIOIOMMUFDContainer;
>
> OBJECT_DECLARE_SIMPLE_TYPE(VFIOIOMMUFDContainer,
>VFIO_IOMMU_IOMMUFD);
>@@ -135,6 +142,8 @@ typedef struct VFIODevice {
>     HostIOMMUDevice *hiod;
>     int devid;
>     IOMMUFDBackend *iommufd;
>+    VFIOIOASHwpt *hwpt;
>+    QLIST_ENTRY(VFIODevice) hwpt_next;
> } VFIODevice;
>
> struct VFIODeviceOps {
>diff --git a/include/sysemu/iommufd.h b/include/sysemu/iommufd.h
>index 57d502a1c79a..e917e7591d05 100644
>--- a/include/sysemu/iommufd.h
>+++ b/include/sysemu/iommufd.h
>@@ -50,6 +50,11 @@ int
>iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t ioas_id,
> bool iommufd_backend_get_device_info(IOMMUFDBackend *be, uint32_t
>devid,
>                                      uint32_t *type, void *data, uint32_t len,
>                                      uint64_t *caps, Error **errp);
>+bool iommufd_backend_alloc_hwpt(IOMMUFDBackend *be, uint32_t
>dev_id,
>+                                uint32_t pt_id, uint32_t flags,
>+                                uint32_t data_type, uint32_t data_len,
>+                                void *data_ptr, uint32_t *out_hwpt,
>+                                Error **errp);
>
> #define TYPE_HOST_IOMMU_DEVICE_IOMMUFD
>TYPE_HOST_IOMMU_DEVICE "-iommufd"
> #endif
>diff --git a/backends/iommufd.c b/backends/iommufd.c
>index 2b3d51af26d2..5d3dfa917415 100644
>--- a/backends/iommufd.c
>+++ b/backends/iommufd.c
>@@ -208,6 +208,36 @@ int
>iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t ioas_id,
>     return ret;
> }
>
>+bool iommufd_backend_alloc_hwpt(IOMMUFDBackend *be, uint32_t
>dev_id,
>+                                uint32_t pt_id, uint32_t flags,
>+                                uint32_t data_type, uint32_t data_len,
>+                                void *data_ptr, uint32_t *out_hwpt,
>+                                Error **errp)
>+{
>+    int ret, fd = be->fd;
>+    struct iommu_hwpt_alloc alloc_hwpt = {
>+        .size = sizeof(struct iommu_hwpt_alloc),
>+        .flags = flags,
>+        .dev_id = dev_id,
>+        .pt_id = pt_id,
>+        .data_type = data_type,
>+        .data_len = data_len,
>+        .data_uptr = (uint64_t)data_ptr,
>+    };
>+
>+    ret = ioctl(fd, IOMMU_HWPT_ALLOC, &alloc_hwpt);
>+    trace_iommufd_backend_alloc_hwpt(fd, dev_id, pt_id, flags, data_type,
>+                                     data_len, (uint64_t)data_ptr,
>+                                     alloc_hwpt.out_hwpt_id, ret);
>+    if (ret) {
>+        error_setg_errno(errp, errno, "Failed to allocate hwpt");
>+        return false;
>+    }
>+
>+    *out_hwpt = alloc_hwpt.out_hwpt_id;
>+    return true;
>+}
>+
> bool iommufd_backend_get_device_info(IOMMUFDBackend *be, uint32_t
>devid,
>                                      uint32_t *type, void *data, uint32_t len,
>                                      uint64_t *caps, Error **errp)
>diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>index 077dea8f1b64..325c7598d5a1 100644
>--- a/hw/vfio/iommufd.c
>+++ b/hw/vfio/iommufd.c
>@@ -212,10 +212,86 @@ static bool
>iommufd_cdev_detach_ioas_hwpt(VFIODevice *vbasedev, Error **errp)
>     return true;
> }
>
>+static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
>+                                         VFIOIOMMUFDContainer *container,
>+                                         Error **errp)
>+{
>+    IOMMUFDBackend *iommufd = vbasedev->iommufd;
>+    uint32_t flags = 0;
>+    VFIOIOASHwpt *hwpt;
>+    uint32_t hwpt_id;
>+    int ret;
>+
>+    /* Try to find a domain */
>+    QLIST_FOREACH(hwpt, &container->hwpt_list, next) {
>+        ret = iommufd_cdev_attach_ioas_hwpt(vbasedev, hwpt->hwpt_id,
>errp);

If there is already an hwpt that supports dirty tracking.
Another device that doesn't support dirty tracking attaches to this hwpt, will it succeed?

If existing hwpt doesn't support dirty tracking.
Another device supporting dirty tracking attaches to that hwpt, what will happen?

Thanks
Zhenzhong

>+        if (ret) {
>+            /* -EINVAL means the domain is incompatible with the device. */
>+            if (ret == -EINVAL) {
>+                /*
>+                 * It is an expected failure and it just means we will try
>+                 * another domain, or create one if no existing compatible
>+                 * domain is found. Hence why the error is discarded below.
>+                 */
>+                error_free(*errp);
>+                *errp = NULL;
>+                continue;
>+            }
>+
>+            return false;
>+        } else {
>+            vbasedev->hwpt = hwpt;
>+            QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev, hwpt_next);
>+            return true;
>+        }
>+    }
>+
>+    if (!iommufd_backend_alloc_hwpt(iommufd, vbasedev->devid,
>+                                    container->ioas_id, flags,
>+                                    IOMMU_HWPT_DATA_NONE, 0, NULL,
>+                                    &hwpt_id, errp)) {
>+        return false;
>+    }
>+
>+    hwpt = g_malloc0(sizeof(*hwpt));
>+    hwpt->hwpt_id = hwpt_id;
>+    QLIST_INIT(&hwpt->device_list);
>+
>+    ret = iommufd_cdev_attach_ioas_hwpt(vbasedev, hwpt->hwpt_id, errp);
>+    if (ret) {
>+        iommufd_backend_free_id(container->be, hwpt->hwpt_id);
>+        g_free(hwpt);
>+        return false;
>+    }
>+
>+    vbasedev->hwpt = hwpt;
>+    QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev, hwpt_next);
>+    QLIST_INSERT_HEAD(&container->hwpt_list, hwpt, next);
>+    return true;
>+}
>+
>+static void iommufd_cdev_autodomains_put(VFIODevice *vbasedev,
>+                                         VFIOIOMMUFDContainer *container)
>+{
>+    VFIOIOASHwpt *hwpt = vbasedev->hwpt;
>+
>+    QLIST_REMOVE(vbasedev, hwpt_next);
>+    if (QLIST_EMPTY(&hwpt->device_list)) {
>+        QLIST_REMOVE(hwpt, next);
>+        iommufd_backend_free_id(container->be, hwpt->hwpt_id);
>+        g_free(hwpt);
>+    }
>+}
>+
> static bool iommufd_cdev_attach_container(VFIODevice *vbasedev,
>                                           VFIOIOMMUFDContainer *container,
>                                           Error **errp)
> {
>+    /* mdevs aren't physical devices and will fail with auto domains */
>+    if (!vbasedev->mdev) {
>+        return iommufd_cdev_autodomains_get(vbasedev, container, errp);
>+    }
>+
>     return !iommufd_cdev_attach_ioas_hwpt(vbasedev, container->ioas_id,
>errp);
> }
>
>@@ -224,6 +300,11 @@ static void
>iommufd_cdev_detach_container(VFIODevice *vbasedev,
> {
>     Error *err = NULL;
>
>+    if (vbasedev->hwpt) {
>+        iommufd_cdev_autodomains_put(vbasedev, container);
>+        return;
>+    }
>+
>     if (!iommufd_cdev_detach_ioas_hwpt(vbasedev, &err)) {
>         error_report_err(err);
>     }
>@@ -354,6 +435,7 @@ static bool iommufd_cdev_attach(const char *name,
>VFIODevice *vbasedev,
>     container =
>VFIO_IOMMU_IOMMUFD(object_new(TYPE_VFIO_IOMMU_IOMMUFD));
>     container->be = vbasedev->iommufd;
>     container->ioas_id = ioas_id;
>+    QLIST_INIT(&container->hwpt_list);
>
>     bcontainer = &container->bcontainer;
>     vfio_address_space_insert(space, bcontainer);
>diff --git a/backends/trace-events b/backends/trace-events
>index 211e6f374adc..4d8ac02fe7d6 100644
>--- a/backends/trace-events
>+++ b/backends/trace-events
>@@ -14,4 +14,5 @@ iommufd_backend_map_dma(int iommufd, uint32_t
>ioas, uint64_t iova, uint64_t size
> iommufd_backend_unmap_dma_non_exist(int iommufd, uint32_t ioas,
>uint64_t iova, uint64_t size, int ret) " Unmap nonexistent mapping:
>iommufd=%d ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" (%d)"
> iommufd_backend_unmap_dma(int iommufd, uint32_t ioas, uint64_t iova,
>uint64_t size, int ret) " iommufd=%d ioas=%d iova=0x%"PRIx64"
>size=0x%"PRIx64" (%d)"
> iommufd_backend_alloc_ioas(int iommufd, uint32_t ioas) " iommufd=%d
>ioas=%d"
>+iommufd_backend_alloc_hwpt(int iommufd, uint32_t dev_id, uint32_t
>pt_id, uint32_t flags, uint32_t hwpt_type, uint32_t len, uint64_t data_ptr,
>uint32_t out_hwpt_id, int ret) " iommufd=%d dev_id=%u pt_id=%u
>flags=0x%x hwpt_type=%u len=%u data_ptr=0x%"PRIx64" out_hwpt=%u
>(%d)"
> iommufd_backend_free_id(int iommufd, uint32_t id, int ret) " iommufd=%d
>id=%d (%d)"
>--
>2.17.2
Duan, Zhenzhong July 17, 2024, 2:52 a.m. UTC | #9
>-----Original Message-----
>From: Joao Martins <joao.m.martins@oracle.com>
>Subject: Re: [PATCH v4 05/12] vfio/iommufd: Introduce auto domain
>creation
>
>On 16/07/2024 17:44, Joao Martins wrote:
>> On 16/07/2024 17:04, Eric Auger wrote:
>>> Hi Joao,
>>>
>>> On 7/12/24 13:46, Joao Martins wrote:
>>>> There's generally two modes of operation for IOMMUFD:
>>>>
>>>> * The simple user API which intends to perform relatively simple things
>>>> with IOMMUs e.g. DPDK. It generally creates an IOAS and attach to VFIO
>>>
>>> It generally creates? can you explicit what is "it"
>>>
>> 'It' here refers to the process/API-user
>>
>>> I am confused by this automatic terminology again (not your fault). the
>doc says:
>>> "
>>>
>>>   *
>>>
>>>     Automatic domain - refers to an iommu domain created automatically
>>>     when attaching a device to an IOAS object. This is compatible to the
>>>     semantics of VFIO type1.
>>>
>>>   *
>>>
>>>     Manual domain - refers to an iommu domain designated by the user as
>>>     the target pagetable to be attached to by a device. Though currently
>>>     there are no uAPIs to directly create such domain, the datastructure
>>>     and algorithms are ready for handling that use case.
>>>
>>> "
>>>
>>>
>>> in 1) the device is attached to the ioas id (using the auto domain if I am
>not wrong)
>>> Here you attach to an hwpt id. Isn't it a manual domain?
>>>
>>
>> Correct.
>>
>> The 'auto domains' generally refers to the kernel-equivalent own
>automatic
>> attaching to a new pagetable.
>>
>> Here I call 'auto domains' in the userspace version too because we are
>doing the
>> exact same but from userspace, using the manual API in IOMMUFD.
>>
>>>> and mainly performs IOAS_MAP and UNMAP.
>>>>
>>>> * The native IOMMUFD API where you have fine grained control of the
>>>> IOMMU domain and model it accordingly. This is where most new
>feature
>>>> are being steered to.
>>>>
>>>> For dirty tracking 2) is required, as it needs to ensure that
>>>> the stage-2/parent IOMMU domain will only attach devices
>>>> that support dirty tracking (so far it is all homogeneous in x86, likely
>>>> not the case for smmuv3). Such invariant on dirty tracking provides a
>>>> useful guarantee to VMMs that will refuse incompatible device
>>>> attachments for IOMMU domains.
>>>>
>>>> Dirty tracking insurance is enforced via HWPT_ALLOC, which is
>>>> responsible for creating an IOMMU domain. This is contrast to the
>>>> 'simple API' where the IOMMU domain is created by IOMMUFD
>automatically
>>>> when it attaches to VFIO (usually referred as autodomains) but it has
>>>> the needed handling for mdevs.
>>>>
>>>> To support dirty tracking with the advanced IOMMUFD API, it needs
>>>> similar logic, where IOMMU domains are created and devices attached
>to
>>>> compatible domains. Essentially mimmicing kernel
>>>> iommufd_device_auto_get_domain(). With mdevs given there's no
>IOMMU domain
>>>> it falls back to IOAS attach.
>>>>
>>>> The auto domain logic allows different IOMMU domains to be created
>when
>>>> DMA dirty tracking is not desired (and VF can provide it), and others
>where
>>>> it is. Here is not used in this way here given how VFIODevice migration
>>>
>>> Here is not used in this way here ?
>>>
>>
>> I meant, 'Here it is not used in this way given (...)'
>>
>>>> state is initialized after the device attachment. But such mixed mode of
>>>> IOMMU dirty tracking + device dirty tracking is an improvement that
>can
>>>> be added on. Keep the 'all of nothing' of type1 approach that we have
>>>> been using so far between container vs device dirty tracking.
>>>>
>>>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>>>> ---
>>>>  include/hw/vfio/vfio-common.h |  9 ++++
>>>>  include/sysemu/iommufd.h      |  5 +++
>>>>  backends/iommufd.c            | 30 +++++++++++++
>>>>  hw/vfio/iommufd.c             | 82
>+++++++++++++++++++++++++++++++++++
>>>>  backends/trace-events         |  1 +
>>>>  5 files changed, 127 insertions(+)
>>>>
>>>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-
>common.h
>>>> index 7419466bca92..2dd468ce3c02 100644
>>>> --- a/include/hw/vfio/vfio-common.h
>>>> +++ b/include/hw/vfio/vfio-common.h
>>>> @@ -95,10 +95,17 @@ typedef struct VFIOHostDMAWindow {
>>>>
>>>>  typedef struct IOMMUFDBackend IOMMUFDBackend;
>>>>
>>>> +typedef struct VFIOIOASHwpt {
>>>> +    uint32_t hwpt_id;
>>>> +    QLIST_HEAD(, VFIODevice) device_list;
>>>> +    QLIST_ENTRY(VFIOIOASHwpt) next;
>>>> +} VFIOIOASHwpt;
>>>> +
>>>>  typedef struct VFIOIOMMUFDContainer {
>>>>      VFIOContainerBase bcontainer;
>>>>      IOMMUFDBackend *be;
>>>>      uint32_t ioas_id;
>>>> +    QLIST_HEAD(, VFIOIOASHwpt) hwpt_list;
>>>>  } VFIOIOMMUFDContainer;
>>>>
>>>>  OBJECT_DECLARE_SIMPLE_TYPE(VFIOIOMMUFDContainer,
>VFIO_IOMMU_IOMMUFD);
>>>> @@ -135,6 +142,8 @@ typedef struct VFIODevice {
>>>>      HostIOMMUDevice *hiod;
>>>>      int devid;
>>>>      IOMMUFDBackend *iommufd;
>>>> +    VFIOIOASHwpt *hwpt;
>>>> +    QLIST_ENTRY(VFIODevice) hwpt_next;
>>>>  } VFIODevice;
>>>>
>>>>  struct VFIODeviceOps {
>>>> diff --git a/include/sysemu/iommufd.h b/include/sysemu/iommufd.h
>>>> index 57d502a1c79a..e917e7591d05 100644
>>>> --- a/include/sysemu/iommufd.h
>>>> +++ b/include/sysemu/iommufd.h
>>>> @@ -50,6 +50,11 @@ int
>iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t ioas_id,
>>>>  bool iommufd_backend_get_device_info(IOMMUFDBackend *be,
>uint32_t devid,
>>>>                                       uint32_t *type, void *data, uint32_t len,
>>>>                                       uint64_t *caps, Error **errp);
>>>> +bool iommufd_backend_alloc_hwpt(IOMMUFDBackend *be, uint32_t
>dev_id,
>>>> +                                uint32_t pt_id, uint32_t flags,
>>>> +                                uint32_t data_type, uint32_t data_len,
>>>> +                                void *data_ptr, uint32_t *out_hwpt,
>>>> +                                Error **errp);
>>>>
>>>>  #define TYPE_HOST_IOMMU_DEVICE_IOMMUFD
>TYPE_HOST_IOMMU_DEVICE "-iommufd"
>>>>  #endif
>>>> diff --git a/backends/iommufd.c b/backends/iommufd.c
>>>> index 2b3d51af26d2..5d3dfa917415 100644
>>>> --- a/backends/iommufd.c
>>>> +++ b/backends/iommufd.c
>>>> @@ -208,6 +208,36 @@ int
>iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t ioas_id,
>>>>      return ret;
>>>>  }
>>>>
>>>> +bool iommufd_backend_alloc_hwpt(IOMMUFDBackend *be, uint32_t
>dev_id,
>>>> +                                uint32_t pt_id, uint32_t flags,
>>>> +                                uint32_t data_type, uint32_t data_len,
>>>> +                                void *data_ptr, uint32_t *out_hwpt,
>>>> +                                Error **errp)
>>>> +{
>>>> +    int ret, fd = be->fd;
>>>> +    struct iommu_hwpt_alloc alloc_hwpt = {
>>>> +        .size = sizeof(struct iommu_hwpt_alloc),
>>>> +        .flags = flags,
>>>> +        .dev_id = dev_id,
>>>> +        .pt_id = pt_id,
>>>> +        .data_type = data_type,
>>>> +        .data_len = data_len,
>>>> +        .data_uptr = (uint64_t)data_ptr,
>>>> +    };
>>>> +
>>>> +    ret = ioctl(fd, IOMMU_HWPT_ALLOC, &alloc_hwpt);
>>>> +    trace_iommufd_backend_alloc_hwpt(fd, dev_id, pt_id, flags,
>data_type,
>>>> +                                     data_len, (uint64_t)data_ptr,
>>>> +                                     alloc_hwpt.out_hwpt_id, ret);
>>>> +    if (ret) {
>>>> +        error_setg_errno(errp, errno, "Failed to allocate hwpt");
>>>> +        return false;
>>>> +    }
>>>> +
>>>> +    *out_hwpt = alloc_hwpt.out_hwpt_id;
>>>> +    return true;
>>>> +}
>>>> +
>>>>  bool iommufd_backend_get_device_info(IOMMUFDBackend *be,
>uint32_t devid,
>>>>                                       uint32_t *type, void *data, uint32_t len,
>>>>                                       uint64_t *caps, Error **errp)
>>>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>>>> index 077dea8f1b64..325c7598d5a1 100644
>>>> --- a/hw/vfio/iommufd.c
>>>> +++ b/hw/vfio/iommufd.c
>>>> @@ -212,10 +212,86 @@ static bool
>iommufd_cdev_detach_ioas_hwpt(VFIODevice *vbasedev, Error **errp)
>>>>      return true;
>>>>  }
>>>>
>>>> +static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
>>>> +                                         VFIOIOMMUFDContainer *container,
>>>> +                                         Error **errp)
>>>> +{
>>>> +    IOMMUFDBackend *iommufd = vbasedev->iommufd;
>>>> +    uint32_t flags = 0;
>>>> +    VFIOIOASHwpt *hwpt;
>>>> +    uint32_t hwpt_id;
>>>> +    int ret;
>>>> +
>>>> +    /* Try to find a domain */
>>>> +    QLIST_FOREACH(hwpt, &container->hwpt_list, next) {
>>>> +        ret = iommufd_cdev_attach_ioas_hwpt(vbasedev, hwpt->hwpt_id,
>errp);
>>>> +        if (ret) {
>>>> +            /* -EINVAL means the domain is incompatible with the device.
>*/
>>>> +            if (ret == -EINVAL) {
>>>> +                /*
>>>> +                 * It is an expected failure and it just means we will try
>>>> +                 * another domain, or create one if no existing compatible
>>>> +                 * domain is found. Hence why the error is discarded below.
>>>> +                 */
>>>> +                error_free(*errp);
>>>> +                *errp = NULL;
>>>> +                continue;
>>>> +            }
>>>> +
>>>> +            return false;
>>>> +        } else {
>>>> +            vbasedev->hwpt = hwpt;
>>>> +            QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev, hwpt_next);
>>>> +            return true;
>>>> +        }
>>>> +    }
>>>> +
>>>> +    if (!iommufd_backend_alloc_hwpt(iommufd, vbasedev->devid,
>>>> +                                    container->ioas_id, flags,
>>>> +                                    IOMMU_HWPT_DATA_NONE, 0, NULL,
>>>> +                                    &hwpt_id, errp)) {
>>>> +        return false;
>>>> +    }
>>>> +
>>>> +    hwpt = g_malloc0(sizeof(*hwpt));
>>>> +    hwpt->hwpt_id = hwpt_id;
>>>> +    QLIST_INIT(&hwpt->device_list);
>>>> +
>>>> +    ret = iommufd_cdev_attach_ioas_hwpt(vbasedev, hwpt->hwpt_id,
>errp);
>>>> +    if (ret) {
>>>> +        iommufd_backend_free_id(container->be, hwpt->hwpt_id);
>>>> +        g_free(hwpt);
>>>> +        return false;
>>>> +    }
>>>> +
>>>> +    vbasedev->hwpt = hwpt;
>>>> +    QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev, hwpt_next);
>>>> +    QLIST_INSERT_HEAD(&container->hwpt_list, hwpt, next);
>>>> +    return true;
>>>> +}
>>>> +
>>>> +static void iommufd_cdev_autodomains_put(VFIODevice *vbasedev,
>>>> +                                         VFIOIOMMUFDContainer *container)
>>>> +{
>>>> +    VFIOIOASHwpt *hwpt = vbasedev->hwpt;
>>>> +
>>>> +    QLIST_REMOVE(vbasedev, hwpt_next);
>>> don't you want to reset vbasedev->hwpt = NULL too?
>>>
>> Yeap, Thanks for catching that
>>
>>>
>>>> +    if (QLIST_EMPTY(&hwpt->device_list)) {
>>>> +        QLIST_REMOVE(hwpt, next);
>>>> +        iommufd_backend_free_id(container->be, hwpt->hwpt_id);
>>>> +        g_free(hwpt);
>>>> +    }
>>>> +}
>>>> +
>>>>  static bool iommufd_cdev_attach_container(VFIODevice *vbasedev,
>>>>                                            VFIOIOMMUFDContainer *container,
>>>>                                            Error **errp)
>>>>  {
>>>> +    /* mdevs aren't physical devices and will fail with auto domains */
>>>> +    if (!vbasedev->mdev) {
>>>> +        return iommufd_cdev_autodomains_get(vbasedev, container,
>errp);
>>>> +    }
>>>> +
>>>>      return !iommufd_cdev_attach_ioas_hwpt(vbasedev, container-
>>ioas_id, errp);
>>>>  }
>>>>
>>>> @@ -224,6 +300,11 @@ static void
>iommufd_cdev_detach_container(VFIODevice *vbasedev,
>>>>  {
>>>>      Error *err = NULL;
>>>>
>>>> +    if (vbasedev->hwpt) {
>>>> +        iommufd_cdev_autodomains_put(vbasedev, container);
>>>> +        return;
>>> Where do we detach the device from the hwpt?
>>>
>> In iommufd_backend_free_id() for auto domains
>>
>
>to clarify here I meant *userspace* auto domains
>
>*kernel* auto domains (mdev) goes via DETACH_IOMMUFD_PT

If the device is still attached to the hwpt, will iommufd_backend_free_id() succeed?
Have you tried the hot unplug?

Thanks
Zhenzhong

>
>>> Thanks
>>>
>>> Eric
>>>> +    }
>>>> +
>>>>      if (!iommufd_cdev_detach_ioas_hwpt(vbasedev, &err)) {
>>>>          error_report_err(err);
>>>>      }
>>>> @@ -354,6 +435,7 @@ static bool iommufd_cdev_attach(const char
>*name, VFIODevice *vbasedev,
>>>>      container =
>VFIO_IOMMU_IOMMUFD(object_new(TYPE_VFIO_IOMMU_IOMMUFD));
>>>>      container->be = vbasedev->iommufd;
>>>>      container->ioas_id = ioas_id;
>>>> +    QLIST_INIT(&container->hwpt_list);
>>>>
>>>>      bcontainer = &container->bcontainer;
>>>>      vfio_address_space_insert(space, bcontainer);
>>>> diff --git a/backends/trace-events b/backends/trace-events
>>>> index 211e6f374adc..4d8ac02fe7d6 100644
>>>> --- a/backends/trace-events
>>>> +++ b/backends/trace-events
>>>> @@ -14,4 +14,5 @@ iommufd_backend_map_dma(int iommufd,
>uint32_t ioas, uint64_t iova, uint64_t size
>>>>  iommufd_backend_unmap_dma_non_exist(int iommufd, uint32_t ioas,
>uint64_t iova, uint64_t size, int ret) " Unmap nonexistent mapping:
>iommufd=%d ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" (%d)"
>>>>  iommufd_backend_unmap_dma(int iommufd, uint32_t ioas, uint64_t
>iova, uint64_t size, int ret) " iommufd=%d ioas=%d iova=0x%"PRIx64"
>size=0x%"PRIx64" (%d)"
>>>>  iommufd_backend_alloc_ioas(int iommufd, uint32_t ioas) "
>iommufd=%d ioas=%d"
>>>> +iommufd_backend_alloc_hwpt(int iommufd, uint32_t dev_id,
>uint32_t pt_id, uint32_t flags, uint32_t hwpt_type, uint32_t len, uint64_t
>data_ptr, uint32_t out_hwpt_id, int ret) " iommufd=%d dev_id=%u
>pt_id=%u flags=0x%x hwpt_type=%u len=%u data_ptr=0x%"PRIx64"
>out_hwpt=%u (%d)"
>>>>  iommufd_backend_free_id(int iommufd, uint32_t id, int ret) "
>iommufd=%d id=%d (%d)"
>>>
>>
Joao Martins July 17, 2024, 9:04 a.m. UTC | #10
On 17/07/2024 03:18, Duan, Zhenzhong wrote:
> 
> 
>> -----Original Message-----
>> From: Joao Martins <joao.m.martins@oracle.com>
>> Subject: [PATCH v4 05/12] vfio/iommufd: Introduce auto domain creation
>>
>> There's generally two modes of operation for IOMMUFD:
>>
>> * The simple user API which intends to perform relatively simple things
>> with IOMMUs e.g. DPDK. It generally creates an IOAS and attach to VFIO
>> and mainly performs IOAS_MAP and UNMAP.
>>
>> * The native IOMMUFD API where you have fine grained control of the
>> IOMMU domain and model it accordingly. This is where most new feature
>> are being steered to.
>>
>> For dirty tracking 2) is required, as it needs to ensure that
>> the stage-2/parent IOMMU domain will only attach devices
>> that support dirty tracking (so far it is all homogeneous in x86, likely
>> not the case for smmuv3). Such invariant on dirty tracking provides a
>> useful guarantee to VMMs that will refuse incompatible device
>> attachments for IOMMU domains.
>>
>> Dirty tracking insurance is enforced via HWPT_ALLOC, which is
>> responsible for creating an IOMMU domain. This is contrast to the
>> 'simple API' where the IOMMU domain is created by IOMMUFD
>> automatically
>> when it attaches to VFIO (usually referred as autodomains) but it has
>> the needed handling for mdevs.
>>
>> To support dirty tracking with the advanced IOMMUFD API, it needs
>> similar logic, where IOMMU domains are created and devices attached to
>> compatible domains. Essentially mimmicing kernel
>> iommufd_device_auto_get_domain(). With mdevs given there's no IOMMU
>> domain
>> it falls back to IOAS attach.
>>
>> The auto domain logic allows different IOMMU domains to be created when
>> DMA dirty tracking is not desired (and VF can provide it), and others where
>> it is. Here is not used in this way here given how VFIODevice migration
>> state is initialized after the device attachment. But such mixed mode of
>> IOMMU dirty tracking + device dirty tracking is an improvement that can
>> be added on. Keep the 'all of nothing' of type1 approach that we have
>> been using so far between container vs device dirty tracking.
>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>> ---
>> include/hw/vfio/vfio-common.h |  9 ++++
>> include/sysemu/iommufd.h      |  5 +++
>> backends/iommufd.c            | 30 +++++++++++++
>> hw/vfio/iommufd.c             | 82
>> +++++++++++++++++++++++++++++++++++
>> backends/trace-events         |  1 +
>> 5 files changed, 127 insertions(+)
>>
>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-
>> common.h
>> index 7419466bca92..2dd468ce3c02 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -95,10 +95,17 @@ typedef struct VFIOHostDMAWindow {
>>
>> typedef struct IOMMUFDBackend IOMMUFDBackend;
>>
>> +typedef struct VFIOIOASHwpt {
>> +    uint32_t hwpt_id;
>> +    QLIST_HEAD(, VFIODevice) device_list;
>> +    QLIST_ENTRY(VFIOIOASHwpt) next;
>> +} VFIOIOASHwpt;
>> +
>> typedef struct VFIOIOMMUFDContainer {
>>     VFIOContainerBase bcontainer;
>>     IOMMUFDBackend *be;
>>     uint32_t ioas_id;
>> +    QLIST_HEAD(, VFIOIOASHwpt) hwpt_list;
>> } VFIOIOMMUFDContainer;
>>
>> OBJECT_DECLARE_SIMPLE_TYPE(VFIOIOMMUFDContainer,
>> VFIO_IOMMU_IOMMUFD);
>> @@ -135,6 +142,8 @@ typedef struct VFIODevice {
>>     HostIOMMUDevice *hiod;
>>     int devid;
>>     IOMMUFDBackend *iommufd;
>> +    VFIOIOASHwpt *hwpt;
>> +    QLIST_ENTRY(VFIODevice) hwpt_next;
>> } VFIODevice;
>>
>> struct VFIODeviceOps {
>> diff --git a/include/sysemu/iommufd.h b/include/sysemu/iommufd.h
>> index 57d502a1c79a..e917e7591d05 100644
>> --- a/include/sysemu/iommufd.h
>> +++ b/include/sysemu/iommufd.h
>> @@ -50,6 +50,11 @@ int
>> iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t ioas_id,
>> bool iommufd_backend_get_device_info(IOMMUFDBackend *be, uint32_t
>> devid,
>>                                      uint32_t *type, void *data, uint32_t len,
>>                                      uint64_t *caps, Error **errp);
>> +bool iommufd_backend_alloc_hwpt(IOMMUFDBackend *be, uint32_t
>> dev_id,
>> +                                uint32_t pt_id, uint32_t flags,
>> +                                uint32_t data_type, uint32_t data_len,
>> +                                void *data_ptr, uint32_t *out_hwpt,
>> +                                Error **errp);
>>
>> #define TYPE_HOST_IOMMU_DEVICE_IOMMUFD
>> TYPE_HOST_IOMMU_DEVICE "-iommufd"
>> #endif
>> diff --git a/backends/iommufd.c b/backends/iommufd.c
>> index 2b3d51af26d2..5d3dfa917415 100644
>> --- a/backends/iommufd.c
>> +++ b/backends/iommufd.c
>> @@ -208,6 +208,36 @@ int
>> iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t ioas_id,
>>     return ret;
>> }
>>
>> +bool iommufd_backend_alloc_hwpt(IOMMUFDBackend *be, uint32_t
>> dev_id,
>> +                                uint32_t pt_id, uint32_t flags,
>> +                                uint32_t data_type, uint32_t data_len,
>> +                                void *data_ptr, uint32_t *out_hwpt,
>> +                                Error **errp)
>> +{
>> +    int ret, fd = be->fd;
>> +    struct iommu_hwpt_alloc alloc_hwpt = {
>> +        .size = sizeof(struct iommu_hwpt_alloc),
>> +        .flags = flags,
>> +        .dev_id = dev_id,
>> +        .pt_id = pt_id,
>> +        .data_type = data_type,
>> +        .data_len = data_len,
>> +        .data_uptr = (uint64_t)data_ptr,
>> +    };
>> +
>> +    ret = ioctl(fd, IOMMU_HWPT_ALLOC, &alloc_hwpt);
>> +    trace_iommufd_backend_alloc_hwpt(fd, dev_id, pt_id, flags, data_type,
>> +                                     data_len, (uint64_t)data_ptr,
>> +                                     alloc_hwpt.out_hwpt_id, ret);
>> +    if (ret) {
>> +        error_setg_errno(errp, errno, "Failed to allocate hwpt");
>> +        return false;
>> +    }
>> +
>> +    *out_hwpt = alloc_hwpt.out_hwpt_id;
>> +    return true;
>> +}
>> +
>> bool iommufd_backend_get_device_info(IOMMUFDBackend *be, uint32_t
>> devid,
>>                                      uint32_t *type, void *data, uint32_t len,
>>                                      uint64_t *caps, Error **errp)
>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>> index 077dea8f1b64..325c7598d5a1 100644
>> --- a/hw/vfio/iommufd.c
>> +++ b/hw/vfio/iommufd.c
>> @@ -212,10 +212,86 @@ static bool
>> iommufd_cdev_detach_ioas_hwpt(VFIODevice *vbasedev, Error **errp)
>>     return true;
>> }
>>
>> +static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
>> +                                         VFIOIOMMUFDContainer *container,
>> +                                         Error **errp)
>> +{
>> +    IOMMUFDBackend *iommufd = vbasedev->iommufd;
>> +    uint32_t flags = 0;
>> +    VFIOIOASHwpt *hwpt;
>> +    uint32_t hwpt_id;
>> +    int ret;
>> +
>> +    /* Try to find a domain */
>> +    QLIST_FOREACH(hwpt, &container->hwpt_list, next) {
>> +        ret = iommufd_cdev_attach_ioas_hwpt(vbasedev, hwpt->hwpt_id,
>> errp);
> 
> If there is already an hwpt that supports dirty tracking.
> Another device that doesn't support dirty tracking attaches to this hwpt, will it succeed?
>

It returns -EINVAL, and we handle that right after this statement. Which means
another HWPT is created.

> If existing hwpt doesn't support dirty tracking.
> Another device supporting dirty tracking attaches to that hwpt, what will happen?
> 

Hmm, It succeeds as there's no incompatbility. At the very least I plan on
blocking migration if the device neither has VF dirty tracking, nor IOMMU dirty
tracking (and patch 11 needs to be adjusted to check hwpt_flags instead of
container).

Qemu right now doesn't handle heteregenous environment, it's all of nothing
approach even before this patchset. Additionally, I am not sure server
environments are applicable here. So essentially I kept the status quo -- more
follow-up is needed to support a mix and match of IOMMU + VF dirty tracking too.
The challenge is having the migration state of VFIO device initialized early
enough that we can make all sort of decisions whether IOMMU dirty tracking is
desired on a per-device basis.
Joao Martins July 17, 2024, 9:09 a.m. UTC | #11
On 17/07/2024 03:52, Duan, Zhenzhong wrote:
> 
> 
>> -----Original Message-----
>> From: Joao Martins <joao.m.martins@oracle.com>
>> Subject: Re: [PATCH v4 05/12] vfio/iommufd: Introduce auto domain
>> creation
>>
>> On 16/07/2024 17:44, Joao Martins wrote:
>>> On 16/07/2024 17:04, Eric Auger wrote:
>>>> Hi Joao,
>>>>
>>>> On 7/12/24 13:46, Joao Martins wrote:
>>>>> There's generally two modes of operation for IOMMUFD:
>>>>>
>>>>> * The simple user API which intends to perform relatively simple things
>>>>> with IOMMUs e.g. DPDK. It generally creates an IOAS and attach to VFIO
>>>>
>>>> It generally creates? can you explicit what is "it"
>>>>
>>> 'It' here refers to the process/API-user
>>>
>>>> I am confused by this automatic terminology again (not your fault). the
>> doc says:
>>>> "
>>>>
>>>>   *
>>>>
>>>>     Automatic domain - refers to an iommu domain created automatically
>>>>     when attaching a device to an IOAS object. This is compatible to the
>>>>     semantics of VFIO type1.
>>>>
>>>>   *
>>>>
>>>>     Manual domain - refers to an iommu domain designated by the user as
>>>>     the target pagetable to be attached to by a device. Though currently
>>>>     there are no uAPIs to directly create such domain, the datastructure
>>>>     and algorithms are ready for handling that use case.
>>>>
>>>> "
>>>>
>>>>
>>>> in 1) the device is attached to the ioas id (using the auto domain if I am
>> not wrong)
>>>> Here you attach to an hwpt id. Isn't it a manual domain?
>>>>
>>>
>>> Correct.
>>>
>>> The 'auto domains' generally refers to the kernel-equivalent own
>> automatic
>>> attaching to a new pagetable.
>>>
>>> Here I call 'auto domains' in the userspace version too because we are
>> doing the
>>> exact same but from userspace, using the manual API in IOMMUFD.
>>>
>>>>> and mainly performs IOAS_MAP and UNMAP.
>>>>>
>>>>> * The native IOMMUFD API where you have fine grained control of the
>>>>> IOMMU domain and model it accordingly. This is where most new
>> feature
>>>>> are being steered to.
>>>>>
>>>>> For dirty tracking 2) is required, as it needs to ensure that
>>>>> the stage-2/parent IOMMU domain will only attach devices
>>>>> that support dirty tracking (so far it is all homogeneous in x86, likely
>>>>> not the case for smmuv3). Such invariant on dirty tracking provides a
>>>>> useful guarantee to VMMs that will refuse incompatible device
>>>>> attachments for IOMMU domains.
>>>>>
>>>>> Dirty tracking insurance is enforced via HWPT_ALLOC, which is
>>>>> responsible for creating an IOMMU domain. This is contrast to the
>>>>> 'simple API' where the IOMMU domain is created by IOMMUFD
>> automatically
>>>>> when it attaches to VFIO (usually referred as autodomains) but it has
>>>>> the needed handling for mdevs.
>>>>>
>>>>> To support dirty tracking with the advanced IOMMUFD API, it needs
>>>>> similar logic, where IOMMU domains are created and devices attached
>> to
>>>>> compatible domains. Essentially mimmicing kernel
>>>>> iommufd_device_auto_get_domain(). With mdevs given there's no
>> IOMMU domain
>>>>> it falls back to IOAS attach.
>>>>>
>>>>> The auto domain logic allows different IOMMU domains to be created
>> when
>>>>> DMA dirty tracking is not desired (and VF can provide it), and others
>> where
>>>>> it is. Here is not used in this way here given how VFIODevice migration
>>>>
>>>> Here is not used in this way here ?
>>>>
>>>
>>> I meant, 'Here it is not used in this way given (...)'
>>>
>>>>> state is initialized after the device attachment. But such mixed mode of
>>>>> IOMMU dirty tracking + device dirty tracking is an improvement that
>> can
>>>>> be added on. Keep the 'all of nothing' of type1 approach that we have
>>>>> been using so far between container vs device dirty tracking.
>>>>>
>>>>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>>>>> ---
>>>>>  include/hw/vfio/vfio-common.h |  9 ++++
>>>>>  include/sysemu/iommufd.h      |  5 +++
>>>>>  backends/iommufd.c            | 30 +++++++++++++
>>>>>  hw/vfio/iommufd.c             | 82
>> +++++++++++++++++++++++++++++++++++
>>>>>  backends/trace-events         |  1 +
>>>>>  5 files changed, 127 insertions(+)
>>>>>
>>>>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-
>> common.h
>>>>> index 7419466bca92..2dd468ce3c02 100644
>>>>> --- a/include/hw/vfio/vfio-common.h
>>>>> +++ b/include/hw/vfio/vfio-common.h
>>>>> @@ -95,10 +95,17 @@ typedef struct VFIOHostDMAWindow {
>>>>>
>>>>>  typedef struct IOMMUFDBackend IOMMUFDBackend;
>>>>>
>>>>> +typedef struct VFIOIOASHwpt {
>>>>> +    uint32_t hwpt_id;
>>>>> +    QLIST_HEAD(, VFIODevice) device_list;
>>>>> +    QLIST_ENTRY(VFIOIOASHwpt) next;
>>>>> +} VFIOIOASHwpt;
>>>>> +
>>>>>  typedef struct VFIOIOMMUFDContainer {
>>>>>      VFIOContainerBase bcontainer;
>>>>>      IOMMUFDBackend *be;
>>>>>      uint32_t ioas_id;
>>>>> +    QLIST_HEAD(, VFIOIOASHwpt) hwpt_list;
>>>>>  } VFIOIOMMUFDContainer;
>>>>>
>>>>>  OBJECT_DECLARE_SIMPLE_TYPE(VFIOIOMMUFDContainer,
>> VFIO_IOMMU_IOMMUFD);
>>>>> @@ -135,6 +142,8 @@ typedef struct VFIODevice {
>>>>>      HostIOMMUDevice *hiod;
>>>>>      int devid;
>>>>>      IOMMUFDBackend *iommufd;
>>>>> +    VFIOIOASHwpt *hwpt;
>>>>> +    QLIST_ENTRY(VFIODevice) hwpt_next;
>>>>>  } VFIODevice;
>>>>>
>>>>>  struct VFIODeviceOps {
>>>>> diff --git a/include/sysemu/iommufd.h b/include/sysemu/iommufd.h
>>>>> index 57d502a1c79a..e917e7591d05 100644
>>>>> --- a/include/sysemu/iommufd.h
>>>>> +++ b/include/sysemu/iommufd.h
>>>>> @@ -50,6 +50,11 @@ int
>> iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t ioas_id,
>>>>>  bool iommufd_backend_get_device_info(IOMMUFDBackend *be,
>> uint32_t devid,
>>>>>                                       uint32_t *type, void *data, uint32_t len,
>>>>>                                       uint64_t *caps, Error **errp);
>>>>> +bool iommufd_backend_alloc_hwpt(IOMMUFDBackend *be, uint32_t
>> dev_id,
>>>>> +                                uint32_t pt_id, uint32_t flags,
>>>>> +                                uint32_t data_type, uint32_t data_len,
>>>>> +                                void *data_ptr, uint32_t *out_hwpt,
>>>>> +                                Error **errp);
>>>>>
>>>>>  #define TYPE_HOST_IOMMU_DEVICE_IOMMUFD
>> TYPE_HOST_IOMMU_DEVICE "-iommufd"
>>>>>  #endif
>>>>> diff --git a/backends/iommufd.c b/backends/iommufd.c
>>>>> index 2b3d51af26d2..5d3dfa917415 100644
>>>>> --- a/backends/iommufd.c
>>>>> +++ b/backends/iommufd.c
>>>>> @@ -208,6 +208,36 @@ int
>> iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t ioas_id,
>>>>>      return ret;
>>>>>  }
>>>>>
>>>>> +bool iommufd_backend_alloc_hwpt(IOMMUFDBackend *be, uint32_t
>> dev_id,
>>>>> +                                uint32_t pt_id, uint32_t flags,
>>>>> +                                uint32_t data_type, uint32_t data_len,
>>>>> +                                void *data_ptr, uint32_t *out_hwpt,
>>>>> +                                Error **errp)
>>>>> +{
>>>>> +    int ret, fd = be->fd;
>>>>> +    struct iommu_hwpt_alloc alloc_hwpt = {
>>>>> +        .size = sizeof(struct iommu_hwpt_alloc),
>>>>> +        .flags = flags,
>>>>> +        .dev_id = dev_id,
>>>>> +        .pt_id = pt_id,
>>>>> +        .data_type = data_type,
>>>>> +        .data_len = data_len,
>>>>> +        .data_uptr = (uint64_t)data_ptr,
>>>>> +    };
>>>>> +
>>>>> +    ret = ioctl(fd, IOMMU_HWPT_ALLOC, &alloc_hwpt);
>>>>> +    trace_iommufd_backend_alloc_hwpt(fd, dev_id, pt_id, flags,
>> data_type,
>>>>> +                                     data_len, (uint64_t)data_ptr,
>>>>> +                                     alloc_hwpt.out_hwpt_id, ret);
>>>>> +    if (ret) {
>>>>> +        error_setg_errno(errp, errno, "Failed to allocate hwpt");
>>>>> +        return false;
>>>>> +    }
>>>>> +
>>>>> +    *out_hwpt = alloc_hwpt.out_hwpt_id;
>>>>> +    return true;
>>>>> +}
>>>>> +
>>>>>  bool iommufd_backend_get_device_info(IOMMUFDBackend *be,
>> uint32_t devid,
>>>>>                                       uint32_t *type, void *data, uint32_t len,
>>>>>                                       uint64_t *caps, Error **errp)
>>>>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>>>>> index 077dea8f1b64..325c7598d5a1 100644
>>>>> --- a/hw/vfio/iommufd.c
>>>>> +++ b/hw/vfio/iommufd.c
>>>>> @@ -212,10 +212,86 @@ static bool
>> iommufd_cdev_detach_ioas_hwpt(VFIODevice *vbasedev, Error **errp)
>>>>>      return true;
>>>>>  }
>>>>>
>>>>> +static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
>>>>> +                                         VFIOIOMMUFDContainer *container,
>>>>> +                                         Error **errp)
>>>>> +{
>>>>> +    IOMMUFDBackend *iommufd = vbasedev->iommufd;
>>>>> +    uint32_t flags = 0;
>>>>> +    VFIOIOASHwpt *hwpt;
>>>>> +    uint32_t hwpt_id;
>>>>> +    int ret;
>>>>> +
>>>>> +    /* Try to find a domain */
>>>>> +    QLIST_FOREACH(hwpt, &container->hwpt_list, next) {
>>>>> +        ret = iommufd_cdev_attach_ioas_hwpt(vbasedev, hwpt->hwpt_id,
>> errp);
>>>>> +        if (ret) {
>>>>> +            /* -EINVAL means the domain is incompatible with the device.
>> */
>>>>> +            if (ret == -EINVAL) {
>>>>> +                /*
>>>>> +                 * It is an expected failure and it just means we will try
>>>>> +                 * another domain, or create one if no existing compatible
>>>>> +                 * domain is found. Hence why the error is discarded below.
>>>>> +                 */
>>>>> +                error_free(*errp);
>>>>> +                *errp = NULL;
>>>>> +                continue;
>>>>> +            }
>>>>> +
>>>>> +            return false;
>>>>> +        } else {
>>>>> +            vbasedev->hwpt = hwpt;
>>>>> +            QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev, hwpt_next);
>>>>> +            return true;
>>>>> +        }
>>>>> +    }
>>>>> +
>>>>> +    if (!iommufd_backend_alloc_hwpt(iommufd, vbasedev->devid,
>>>>> +                                    container->ioas_id, flags,
>>>>> +                                    IOMMU_HWPT_DATA_NONE, 0, NULL,
>>>>> +                                    &hwpt_id, errp)) {
>>>>> +        return false;
>>>>> +    }
>>>>> +
>>>>> +    hwpt = g_malloc0(sizeof(*hwpt));
>>>>> +    hwpt->hwpt_id = hwpt_id;
>>>>> +    QLIST_INIT(&hwpt->device_list);
>>>>> +
>>>>> +    ret = iommufd_cdev_attach_ioas_hwpt(vbasedev, hwpt->hwpt_id,
>> errp);
>>>>> +    if (ret) {
>>>>> +        iommufd_backend_free_id(container->be, hwpt->hwpt_id);
>>>>> +        g_free(hwpt);
>>>>> +        return false;
>>>>> +    }
>>>>> +
>>>>> +    vbasedev->hwpt = hwpt;
>>>>> +    QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev, hwpt_next);
>>>>> +    QLIST_INSERT_HEAD(&container->hwpt_list, hwpt, next);
>>>>> +    return true;
>>>>> +}
>>>>> +
>>>>> +static void iommufd_cdev_autodomains_put(VFIODevice *vbasedev,
>>>>> +                                         VFIOIOMMUFDContainer *container)
>>>>> +{
>>>>> +    VFIOIOASHwpt *hwpt = vbasedev->hwpt;
>>>>> +
>>>>> +    QLIST_REMOVE(vbasedev, hwpt_next);
>>>> don't you want to reset vbasedev->hwpt = NULL too?
>>>>
>>> Yeap, Thanks for catching that
>>>
>>>>
>>>>> +    if (QLIST_EMPTY(&hwpt->device_list)) {
>>>>> +        QLIST_REMOVE(hwpt, next);
>>>>> +        iommufd_backend_free_id(container->be, hwpt->hwpt_id);
>>>>> +        g_free(hwpt);
>>>>> +    }
>>>>> +}
>>>>> +
>>>>>  static bool iommufd_cdev_attach_container(VFIODevice *vbasedev,
>>>>>                                            VFIOIOMMUFDContainer *container,
>>>>>                                            Error **errp)
>>>>>  {
>>>>> +    /* mdevs aren't physical devices and will fail with auto domains */
>>>>> +    if (!vbasedev->mdev) {
>>>>> +        return iommufd_cdev_autodomains_get(vbasedev, container,
>> errp);
>>>>> +    }
>>>>> +
>>>>>      return !iommufd_cdev_attach_ioas_hwpt(vbasedev, container-
>>> ioas_id, errp);
>>>>>  }
>>>>>
>>>>> @@ -224,6 +300,11 @@ static void
>> iommufd_cdev_detach_container(VFIODevice *vbasedev,
>>>>>  {
>>>>>      Error *err = NULL;
>>>>>
>>>>> +    if (vbasedev->hwpt) {
>>>>> +        iommufd_cdev_autodomains_put(vbasedev, container);
>>>>> +        return;
>>>> Where do we detach the device from the hwpt?
>>>>
>>> In iommufd_backend_free_id() for auto domains
>>>
>>
>> to clarify here I meant *userspace* auto domains
>>
>> *kernel* auto domains (mdev) goes via DETACH_IOMMUFD_PT
> 
> If the device is still attached to the hwpt, will iommufd_backend_free_id() succeed?
> Have you tried the hot unplug?
> 

I have but I didn't see any errors. But I will check again for v5 as it could
also be my oversight.

I was thinking about Eric's remark overnight and I think what I am doing is not
correct regardless of the above.

I should be calling DETACH_IOMMUFD_PT pairing with ATTACH_IOMMUFD_PT, and the
iommufd_backend_free_id() is to drop the final reference pairing with
alloc_hwpt() when the device list is empty i.e. when there's no more devices in
that vdev::hwpt.

DETACH_IOMMUFD_PT decrement the hwpt refcount and it doesn't differentiate
between auto domains vs manual domains.

The code is already there anyhow it just has the order of
iommufd_cdev_autodomains_put vs detach invocation reversed; I'll fix that for
next version.
Cédric Le Goater July 17, 2024, 9:28 a.m. UTC | #12
On 7/17/24 11:09, Joao Martins wrote:
> On 17/07/2024 03:52, Duan, Zhenzhong wrote:
>>
>>
>>> -----Original Message-----
>>> From: Joao Martins <joao.m.martins@oracle.com>
>>> Subject: Re: [PATCH v4 05/12] vfio/iommufd: Introduce auto domain
>>> creation
>>>
>>> On 16/07/2024 17:44, Joao Martins wrote:
>>>> On 16/07/2024 17:04, Eric Auger wrote:
>>>>> Hi Joao,
>>>>>
>>>>> On 7/12/24 13:46, Joao Martins wrote:
>>>>>> There's generally two modes of operation for IOMMUFD:
>>>>>>
>>>>>> * The simple user API which intends to perform relatively simple things
>>>>>> with IOMMUs e.g. DPDK. It generally creates an IOAS and attach to VFIO
>>>>>
>>>>> It generally creates? can you explicit what is "it"
>>>>>
>>>> 'It' here refers to the process/API-user
>>>>
>>>>> I am confused by this automatic terminology again (not your fault). the
>>> doc says:
>>>>> "
>>>>>
>>>>>    *
>>>>>
>>>>>      Automatic domain - refers to an iommu domain created automatically
>>>>>      when attaching a device to an IOAS object. This is compatible to the
>>>>>      semantics of VFIO type1.
>>>>>
>>>>>    *
>>>>>
>>>>>      Manual domain - refers to an iommu domain designated by the user as
>>>>>      the target pagetable to be attached to by a device. Though currently
>>>>>      there are no uAPIs to directly create such domain, the datastructure
>>>>>      and algorithms are ready for handling that use case.
>>>>>
>>>>> "
>>>>>
>>>>>
>>>>> in 1) the device is attached to the ioas id (using the auto domain if I am
>>> not wrong)
>>>>> Here you attach to an hwpt id. Isn't it a manual domain?
>>>>>
>>>>
>>>> Correct.
>>>>
>>>> The 'auto domains' generally refers to the kernel-equivalent own
>>> automatic
>>>> attaching to a new pagetable.
>>>>
>>>> Here I call 'auto domains' in the userspace version too because we are
>>> doing the
>>>> exact same but from userspace, using the manual API in IOMMUFD.
>>>>
>>>>>> and mainly performs IOAS_MAP and UNMAP.
>>>>>>
>>>>>> * The native IOMMUFD API where you have fine grained control of the
>>>>>> IOMMU domain and model it accordingly. This is where most new
>>> feature
>>>>>> are being steered to.
>>>>>>
>>>>>> For dirty tracking 2) is required, as it needs to ensure that
>>>>>> the stage-2/parent IOMMU domain will only attach devices
>>>>>> that support dirty tracking (so far it is all homogeneous in x86, likely
>>>>>> not the case for smmuv3). Such invariant on dirty tracking provides a
>>>>>> useful guarantee to VMMs that will refuse incompatible device
>>>>>> attachments for IOMMU domains.
>>>>>>
>>>>>> Dirty tracking insurance is enforced via HWPT_ALLOC, which is
>>>>>> responsible for creating an IOMMU domain. This is contrast to the
>>>>>> 'simple API' where the IOMMU domain is created by IOMMUFD
>>> automatically
>>>>>> when it attaches to VFIO (usually referred as autodomains) but it has
>>>>>> the needed handling for mdevs.
>>>>>>
>>>>>> To support dirty tracking with the advanced IOMMUFD API, it needs
>>>>>> similar logic, where IOMMU domains are created and devices attached
>>> to
>>>>>> compatible domains. Essentially mimmicing kernel
>>>>>> iommufd_device_auto_get_domain(). With mdevs given there's no
>>> IOMMU domain
>>>>>> it falls back to IOAS attach.
>>>>>>
>>>>>> The auto domain logic allows different IOMMU domains to be created
>>> when
>>>>>> DMA dirty tracking is not desired (and VF can provide it), and others
>>> where
>>>>>> it is. Here is not used in this way here given how VFIODevice migration
>>>>>
>>>>> Here is not used in this way here ?
>>>>>
>>>>
>>>> I meant, 'Here it is not used in this way given (...)'
>>>>
>>>>>> state is initialized after the device attachment. But such mixed mode of
>>>>>> IOMMU dirty tracking + device dirty tracking is an improvement that
>>> can
>>>>>> be added on. Keep the 'all of nothing' of type1 approach that we have
>>>>>> been using so far between container vs device dirty tracking.
>>>>>>
>>>>>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>>>>>> ---
>>>>>>   include/hw/vfio/vfio-common.h |  9 ++++
>>>>>>   include/sysemu/iommufd.h      |  5 +++
>>>>>>   backends/iommufd.c            | 30 +++++++++++++
>>>>>>   hw/vfio/iommufd.c             | 82
>>> +++++++++++++++++++++++++++++++++++
>>>>>>   backends/trace-events         |  1 +
>>>>>>   5 files changed, 127 insertions(+)
>>>>>>
>>>>>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-
>>> common.h
>>>>>> index 7419466bca92..2dd468ce3c02 100644
>>>>>> --- a/include/hw/vfio/vfio-common.h
>>>>>> +++ b/include/hw/vfio/vfio-common.h
>>>>>> @@ -95,10 +95,17 @@ typedef struct VFIOHostDMAWindow {
>>>>>>
>>>>>>   typedef struct IOMMUFDBackend IOMMUFDBackend;
>>>>>>
>>>>>> +typedef struct VFIOIOASHwpt {
>>>>>> +    uint32_t hwpt_id;
>>>>>> +    QLIST_HEAD(, VFIODevice) device_list;
>>>>>> +    QLIST_ENTRY(VFIOIOASHwpt) next;
>>>>>> +} VFIOIOASHwpt;
>>>>>> +
>>>>>>   typedef struct VFIOIOMMUFDContainer {
>>>>>>       VFIOContainerBase bcontainer;
>>>>>>       IOMMUFDBackend *be;
>>>>>>       uint32_t ioas_id;
>>>>>> +    QLIST_HEAD(, VFIOIOASHwpt) hwpt_list;
>>>>>>   } VFIOIOMMUFDContainer;
>>>>>>
>>>>>>   OBJECT_DECLARE_SIMPLE_TYPE(VFIOIOMMUFDContainer,
>>> VFIO_IOMMU_IOMMUFD);
>>>>>> @@ -135,6 +142,8 @@ typedef struct VFIODevice {
>>>>>>       HostIOMMUDevice *hiod;
>>>>>>       int devid;
>>>>>>       IOMMUFDBackend *iommufd;
>>>>>> +    VFIOIOASHwpt *hwpt;
>>>>>> +    QLIST_ENTRY(VFIODevice) hwpt_next;
>>>>>>   } VFIODevice;
>>>>>>
>>>>>>   struct VFIODeviceOps {
>>>>>> diff --git a/include/sysemu/iommufd.h b/include/sysemu/iommufd.h
>>>>>> index 57d502a1c79a..e917e7591d05 100644
>>>>>> --- a/include/sysemu/iommufd.h
>>>>>> +++ b/include/sysemu/iommufd.h
>>>>>> @@ -50,6 +50,11 @@ int
>>> iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t ioas_id,
>>>>>>   bool iommufd_backend_get_device_info(IOMMUFDBackend *be,
>>> uint32_t devid,
>>>>>>                                        uint32_t *type, void *data, uint32_t len,
>>>>>>                                        uint64_t *caps, Error **errp);
>>>>>> +bool iommufd_backend_alloc_hwpt(IOMMUFDBackend *be, uint32_t
>>> dev_id,
>>>>>> +                                uint32_t pt_id, uint32_t flags,
>>>>>> +                                uint32_t data_type, uint32_t data_len,
>>>>>> +                                void *data_ptr, uint32_t *out_hwpt,
>>>>>> +                                Error **errp);
>>>>>>
>>>>>>   #define TYPE_HOST_IOMMU_DEVICE_IOMMUFD
>>> TYPE_HOST_IOMMU_DEVICE "-iommufd"
>>>>>>   #endif
>>>>>> diff --git a/backends/iommufd.c b/backends/iommufd.c
>>>>>> index 2b3d51af26d2..5d3dfa917415 100644
>>>>>> --- a/backends/iommufd.c
>>>>>> +++ b/backends/iommufd.c
>>>>>> @@ -208,6 +208,36 @@ int
>>> iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t ioas_id,
>>>>>>       return ret;
>>>>>>   }
>>>>>>
>>>>>> +bool iommufd_backend_alloc_hwpt(IOMMUFDBackend *be, uint32_t
>>> dev_id,
>>>>>> +                                uint32_t pt_id, uint32_t flags,
>>>>>> +                                uint32_t data_type, uint32_t data_len,
>>>>>> +                                void *data_ptr, uint32_t *out_hwpt,
>>>>>> +                                Error **errp)
>>>>>> +{
>>>>>> +    int ret, fd = be->fd;
>>>>>> +    struct iommu_hwpt_alloc alloc_hwpt = {
>>>>>> +        .size = sizeof(struct iommu_hwpt_alloc),
>>>>>> +        .flags = flags,
>>>>>> +        .dev_id = dev_id,
>>>>>> +        .pt_id = pt_id,
>>>>>> +        .data_type = data_type,
>>>>>> +        .data_len = data_len,
>>>>>> +        .data_uptr = (uint64_t)data_ptr,
>>>>>> +    };
>>>>>> +
>>>>>> +    ret = ioctl(fd, IOMMU_HWPT_ALLOC, &alloc_hwpt);
>>>>>> +    trace_iommufd_backend_alloc_hwpt(fd, dev_id, pt_id, flags,
>>> data_type,
>>>>>> +                                     data_len, (uint64_t)data_ptr,
>>>>>> +                                     alloc_hwpt.out_hwpt_id, ret);
>>>>>> +    if (ret) {
>>>>>> +        error_setg_errno(errp, errno, "Failed to allocate hwpt");
>>>>>> +        return false;
>>>>>> +    }
>>>>>> +
>>>>>> +    *out_hwpt = alloc_hwpt.out_hwpt_id;
>>>>>> +    return true;
>>>>>> +}
>>>>>> +
>>>>>>   bool iommufd_backend_get_device_info(IOMMUFDBackend *be,
>>> uint32_t devid,
>>>>>>                                        uint32_t *type, void *data, uint32_t len,
>>>>>>                                        uint64_t *caps, Error **errp)
>>>>>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>>>>>> index 077dea8f1b64..325c7598d5a1 100644
>>>>>> --- a/hw/vfio/iommufd.c
>>>>>> +++ b/hw/vfio/iommufd.c
>>>>>> @@ -212,10 +212,86 @@ static bool
>>> iommufd_cdev_detach_ioas_hwpt(VFIODevice *vbasedev, Error **errp)
>>>>>>       return true;
>>>>>>   }
>>>>>>
>>>>>> +static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
>>>>>> +                                         VFIOIOMMUFDContainer *container,
>>>>>> +                                         Error **errp)
>>>>>> +{
>>>>>> +    IOMMUFDBackend *iommufd = vbasedev->iommufd;
>>>>>> +    uint32_t flags = 0;
>>>>>> +    VFIOIOASHwpt *hwpt;
>>>>>> +    uint32_t hwpt_id;
>>>>>> +    int ret;
>>>>>> +
>>>>>> +    /* Try to find a domain */
>>>>>> +    QLIST_FOREACH(hwpt, &container->hwpt_list, next) {
>>>>>> +        ret = iommufd_cdev_attach_ioas_hwpt(vbasedev, hwpt->hwpt_id,
>>> errp);
>>>>>> +        if (ret) {
>>>>>> +            /* -EINVAL means the domain is incompatible with the device.
>>> */
>>>>>> +            if (ret == -EINVAL) {
>>>>>> +                /*
>>>>>> +                 * It is an expected failure and it just means we will try
>>>>>> +                 * another domain, or create one if no existing compatible
>>>>>> +                 * domain is found. Hence why the error is discarded below.
>>>>>> +                 */
>>>>>> +                error_free(*errp);
>>>>>> +                *errp = NULL;
>>>>>> +                continue;
>>>>>> +            }
>>>>>> +
>>>>>> +            return false;
>>>>>> +        } else {
>>>>>> +            vbasedev->hwpt = hwpt;
>>>>>> +            QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev, hwpt_next);
>>>>>> +            return true;
>>>>>> +        }
>>>>>> +    }
>>>>>> +
>>>>>> +    if (!iommufd_backend_alloc_hwpt(iommufd, vbasedev->devid,
>>>>>> +                                    container->ioas_id, flags,
>>>>>> +                                    IOMMU_HWPT_DATA_NONE, 0, NULL,
>>>>>> +                                    &hwpt_id, errp)) {
>>>>>> +        return false;
>>>>>> +    }
>>>>>> +
>>>>>> +    hwpt = g_malloc0(sizeof(*hwpt));
>>>>>> +    hwpt->hwpt_id = hwpt_id;
>>>>>> +    QLIST_INIT(&hwpt->device_list);
>>>>>> +
>>>>>> +    ret = iommufd_cdev_attach_ioas_hwpt(vbasedev, hwpt->hwpt_id,
>>> errp);
>>>>>> +    if (ret) {
>>>>>> +        iommufd_backend_free_id(container->be, hwpt->hwpt_id);
>>>>>> +        g_free(hwpt);
>>>>>> +        return false;
>>>>>> +    }
>>>>>> +
>>>>>> +    vbasedev->hwpt = hwpt;
>>>>>> +    QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev, hwpt_next);
>>>>>> +    QLIST_INSERT_HEAD(&container->hwpt_list, hwpt, next);
>>>>>> +    return true;
>>>>>> +}
>>>>>> +
>>>>>> +static void iommufd_cdev_autodomains_put(VFIODevice *vbasedev,
>>>>>> +                                         VFIOIOMMUFDContainer *container)
>>>>>> +{
>>>>>> +    VFIOIOASHwpt *hwpt = vbasedev->hwpt;
>>>>>> +
>>>>>> +    QLIST_REMOVE(vbasedev, hwpt_next);
>>>>> don't you want to reset vbasedev->hwpt = NULL too?
>>>>>
>>>> Yeap, Thanks for catching that
>>>>
>>>>>
>>>>>> +    if (QLIST_EMPTY(&hwpt->device_list)) {
>>>>>> +        QLIST_REMOVE(hwpt, next);
>>>>>> +        iommufd_backend_free_id(container->be, hwpt->hwpt_id);
>>>>>> +        g_free(hwpt);
>>>>>> +    }
>>>>>> +}
>>>>>> +
>>>>>>   static bool iommufd_cdev_attach_container(VFIODevice *vbasedev,
>>>>>>                                             VFIOIOMMUFDContainer *container,
>>>>>>                                             Error **errp)
>>>>>>   {
>>>>>> +    /* mdevs aren't physical devices and will fail with auto domains */
>>>>>> +    if (!vbasedev->mdev) {
>>>>>> +        return iommufd_cdev_autodomains_get(vbasedev, container,
>>> errp);
>>>>>> +    }
>>>>>> +
>>>>>>       return !iommufd_cdev_attach_ioas_hwpt(vbasedev, container-
>>>> ioas_id, errp);
>>>>>>   }
>>>>>>
>>>>>> @@ -224,6 +300,11 @@ static void
>>> iommufd_cdev_detach_container(VFIODevice *vbasedev,
>>>>>>   {
>>>>>>       Error *err = NULL;
>>>>>>
>>>>>> +    if (vbasedev->hwpt) {
>>>>>> +        iommufd_cdev_autodomains_put(vbasedev, container);
>>>>>> +        return;
>>>>> Where do we detach the device from the hwpt?
>>>>>
>>>> In iommufd_backend_free_id() for auto domains
>>>>
>>>
>>> to clarify here I meant *userspace* auto domains
>>>
>>> *kernel* auto domains (mdev) goes via DETACH_IOMMUFD_PT
>>
>> If the device is still attached to the hwpt, will iommufd_backend_free_id() succeed?
>> Have you tried the hot unplug?
>>
> 
> I have but I didn't see any errors. But I will check again for v5 as it could
> also be my oversight.
> 
> I was thinking about Eric's remark overnight and I think what I am doing is not
> correct regardless of the above.
> 
> I should be calling DETACH_IOMMUFD_PT pairing with ATTACH_IOMMUFD_PT, and the
> iommufd_backend_free_id() is to drop the final reference pairing with
> alloc_hwpt() when the device list is empty i.e. when there's no more devices in
> that vdev::hwpt.
> 
> DETACH_IOMMUFD_PT decrement the hwpt refcount and it doesn't differentiate
> between auto domains vs manual domains.
> 
> The code is already there anyhow it just has the order of
> iommufd_cdev_autodomains_put vs detach invocation reversed; I'll fix that for
> next version.

While at it, could you please move these routines :

   iommufd_cdev_detach_ioas_hwpt
   iommufd_cdev_attach_ioas_hwpt
  
under backends/iommufd.c ? I think that's where they belong.



Thanks,

C.
Joao Martins July 17, 2024, 9:31 a.m. UTC | #13
On 17/07/2024 10:28, Cédric Le Goater wrote:
>>>>>>> @@ -224,6 +300,11 @@ static void
>>>> iommufd_cdev_detach_container(VFIODevice *vbasedev,
>>>>>>>   {
>>>>>>>       Error *err = NULL;
>>>>>>>
>>>>>>> +    if (vbasedev->hwpt) {
>>>>>>> +        iommufd_cdev_autodomains_put(vbasedev, container);
>>>>>>> +        return;
>>>>>> Where do we detach the device from the hwpt?
>>>>>>
>>>>> In iommufd_backend_free_id() for auto domains
>>>>>
>>>>
>>>> to clarify here I meant *userspace* auto domains
>>>>
>>>> *kernel* auto domains (mdev) goes via DETACH_IOMMUFD_PT
>>>
>>> If the device is still attached to the hwpt, will iommufd_backend_free_id()
>>> succeed?
>>> Have you tried the hot unplug?
>>>
>>
>> I have but I didn't see any errors. But I will check again for v5 as it could
>> also be my oversight.
>>
>> I was thinking about Eric's remark overnight and I think what I am doing is not
>> correct regardless of the above.
>>
>> I should be calling DETACH_IOMMUFD_PT pairing with ATTACH_IOMMUFD_PT, and the
>> iommufd_backend_free_id() is to drop the final reference pairing with
>> alloc_hwpt() when the device list is empty i.e. when there's no more devices in
>> that vdev::hwpt.
>>
>> DETACH_IOMMUFD_PT decrement the hwpt refcount and it doesn't differentiate
>> between auto domains vs manual domains.
>>
>> The code is already there anyhow it just has the order of
>> iommufd_cdev_autodomains_put vs detach invocation reversed; I'll fix that for
>> next version.
> 
> While at it, could you please move these routines :
> 
>   iommufd_cdev_detach_ioas_hwpt
>   iommufd_cdev_attach_ioas_hwpt
>  
> under backends/iommufd.c ? I think that's where they belong.

OK
Duan, Zhenzhong July 17, 2024, 9:48 a.m. UTC | #14
>-----Original Message-----
>From: Joao Martins <joao.m.martins@oracle.com>
>Subject: Re: [PATCH v4 05/12] vfio/iommufd: Introduce auto domain
>creation
>
>On 17/07/2024 03:52, Duan, Zhenzhong wrote:
>>
>>
>>> -----Original Message-----
>>> From: Joao Martins <joao.m.martins@oracle.com>
>>> Subject: Re: [PATCH v4 05/12] vfio/iommufd: Introduce auto domain
>>> creation
>>>
>>> On 16/07/2024 17:44, Joao Martins wrote:
>>>> On 16/07/2024 17:04, Eric Auger wrote:
>>>>> Hi Joao,
>>>>>
>>>>> On 7/12/24 13:46, Joao Martins wrote:
>>>>>> There's generally two modes of operation for IOMMUFD:
>>>>>>
>>>>>> * The simple user API which intends to perform relatively simple
>things
>>>>>> with IOMMUs e.g. DPDK. It generally creates an IOAS and attach to
>VFIO
>>>>>
>>>>> It generally creates? can you explicit what is "it"
>>>>>
>>>> 'It' here refers to the process/API-user
>>>>
>>>>> I am confused by this automatic terminology again (not your fault). the
>>> doc says:
>>>>> "
>>>>>
>>>>>   *
>>>>>
>>>>>     Automatic domain - refers to an iommu domain created
>automatically
>>>>>     when attaching a device to an IOAS object. This is compatible to the
>>>>>     semantics of VFIO type1.
>>>>>
>>>>>   *
>>>>>
>>>>>     Manual domain - refers to an iommu domain designated by the user
>as
>>>>>     the target pagetable to be attached to by a device. Though currently
>>>>>     there are no uAPIs to directly create such domain, the datastructure
>>>>>     and algorithms are ready for handling that use case.
>>>>>
>>>>> "
>>>>>
>>>>>
>>>>> in 1) the device is attached to the ioas id (using the auto domain if I am
>>> not wrong)
>>>>> Here you attach to an hwpt id. Isn't it a manual domain?
>>>>>
>>>>
>>>> Correct.
>>>>
>>>> The 'auto domains' generally refers to the kernel-equivalent own
>>> automatic
>>>> attaching to a new pagetable.
>>>>
>>>> Here I call 'auto domains' in the userspace version too because we are
>>> doing the
>>>> exact same but from userspace, using the manual API in IOMMUFD.
>>>>
>>>>>> and mainly performs IOAS_MAP and UNMAP.
>>>>>>
>>>>>> * The native IOMMUFD API where you have fine grained control of
>the
>>>>>> IOMMU domain and model it accordingly. This is where most new
>>> feature
>>>>>> are being steered to.
>>>>>>
>>>>>> For dirty tracking 2) is required, as it needs to ensure that
>>>>>> the stage-2/parent IOMMU domain will only attach devices
>>>>>> that support dirty tracking (so far it is all homogeneous in x86, likely
>>>>>> not the case for smmuv3). Such invariant on dirty tracking provides a
>>>>>> useful guarantee to VMMs that will refuse incompatible device
>>>>>> attachments for IOMMU domains.
>>>>>>
>>>>>> Dirty tracking insurance is enforced via HWPT_ALLOC, which is
>>>>>> responsible for creating an IOMMU domain. This is contrast to the
>>>>>> 'simple API' where the IOMMU domain is created by IOMMUFD
>>> automatically
>>>>>> when it attaches to VFIO (usually referred as autodomains) but it has
>>>>>> the needed handling for mdevs.
>>>>>>
>>>>>> To support dirty tracking with the advanced IOMMUFD API, it needs
>>>>>> similar logic, where IOMMU domains are created and devices
>attached
>>> to
>>>>>> compatible domains. Essentially mimmicing kernel
>>>>>> iommufd_device_auto_get_domain(). With mdevs given there's no
>>> IOMMU domain
>>>>>> it falls back to IOAS attach.
>>>>>>
>>>>>> The auto domain logic allows different IOMMU domains to be created
>>> when
>>>>>> DMA dirty tracking is not desired (and VF can provide it), and others
>>> where
>>>>>> it is. Here is not used in this way here given how VFIODevice
>migration
>>>>>
>>>>> Here is not used in this way here ?
>>>>>
>>>>
>>>> I meant, 'Here it is not used in this way given (...)'
>>>>
>>>>>> state is initialized after the device attachment. But such mixed mode
>of
>>>>>> IOMMU dirty tracking + device dirty tracking is an improvement that
>>> can
>>>>>> be added on. Keep the 'all of nothing' of type1 approach that we have
>>>>>> been using so far between container vs device dirty tracking.
>>>>>>
>>>>>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>>>>>> ---
>>>>>>  include/hw/vfio/vfio-common.h |  9 ++++
>>>>>>  include/sysemu/iommufd.h      |  5 +++
>>>>>>  backends/iommufd.c            | 30 +++++++++++++
>>>>>>  hw/vfio/iommufd.c             | 82
>>> +++++++++++++++++++++++++++++++++++
>>>>>>  backends/trace-events         |  1 +
>>>>>>  5 files changed, 127 insertions(+)
>>>>>>
>>>>>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-
>>> common.h
>>>>>> index 7419466bca92..2dd468ce3c02 100644
>>>>>> --- a/include/hw/vfio/vfio-common.h
>>>>>> +++ b/include/hw/vfio/vfio-common.h
>>>>>> @@ -95,10 +95,17 @@ typedef struct VFIOHostDMAWindow {
>>>>>>
>>>>>>  typedef struct IOMMUFDBackend IOMMUFDBackend;
>>>>>>
>>>>>> +typedef struct VFIOIOASHwpt {
>>>>>> +    uint32_t hwpt_id;
>>>>>> +    QLIST_HEAD(, VFIODevice) device_list;
>>>>>> +    QLIST_ENTRY(VFIOIOASHwpt) next;
>>>>>> +} VFIOIOASHwpt;
>>>>>> +
>>>>>>  typedef struct VFIOIOMMUFDContainer {
>>>>>>      VFIOContainerBase bcontainer;
>>>>>>      IOMMUFDBackend *be;
>>>>>>      uint32_t ioas_id;
>>>>>> +    QLIST_HEAD(, VFIOIOASHwpt) hwpt_list;
>>>>>>  } VFIOIOMMUFDContainer;
>>>>>>
>>>>>>  OBJECT_DECLARE_SIMPLE_TYPE(VFIOIOMMUFDContainer,
>>> VFIO_IOMMU_IOMMUFD);
>>>>>> @@ -135,6 +142,8 @@ typedef struct VFIODevice {
>>>>>>      HostIOMMUDevice *hiod;
>>>>>>      int devid;
>>>>>>      IOMMUFDBackend *iommufd;
>>>>>> +    VFIOIOASHwpt *hwpt;
>>>>>> +    QLIST_ENTRY(VFIODevice) hwpt_next;
>>>>>>  } VFIODevice;
>>>>>>
>>>>>>  struct VFIODeviceOps {
>>>>>> diff --git a/include/sysemu/iommufd.h b/include/sysemu/iommufd.h
>>>>>> index 57d502a1c79a..e917e7591d05 100644
>>>>>> --- a/include/sysemu/iommufd.h
>>>>>> +++ b/include/sysemu/iommufd.h
>>>>>> @@ -50,6 +50,11 @@ int
>>> iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t
>ioas_id,
>>>>>>  bool iommufd_backend_get_device_info(IOMMUFDBackend *be,
>>> uint32_t devid,
>>>>>>                                       uint32_t *type, void *data, uint32_t len,
>>>>>>                                       uint64_t *caps, Error **errp);
>>>>>> +bool iommufd_backend_alloc_hwpt(IOMMUFDBackend *be,
>uint32_t
>>> dev_id,
>>>>>> +                                uint32_t pt_id, uint32_t flags,
>>>>>> +                                uint32_t data_type, uint32_t data_len,
>>>>>> +                                void *data_ptr, uint32_t *out_hwpt,
>>>>>> +                                Error **errp);
>>>>>>
>>>>>>  #define TYPE_HOST_IOMMU_DEVICE_IOMMUFD
>>> TYPE_HOST_IOMMU_DEVICE "-iommufd"
>>>>>>  #endif
>>>>>> diff --git a/backends/iommufd.c b/backends/iommufd.c
>>>>>> index 2b3d51af26d2..5d3dfa917415 100644
>>>>>> --- a/backends/iommufd.c
>>>>>> +++ b/backends/iommufd.c
>>>>>> @@ -208,6 +208,36 @@ int
>>> iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t
>ioas_id,
>>>>>>      return ret;
>>>>>>  }
>>>>>>
>>>>>> +bool iommufd_backend_alloc_hwpt(IOMMUFDBackend *be,
>uint32_t
>>> dev_id,
>>>>>> +                                uint32_t pt_id, uint32_t flags,
>>>>>> +                                uint32_t data_type, uint32_t data_len,
>>>>>> +                                void *data_ptr, uint32_t *out_hwpt,
>>>>>> +                                Error **errp)
>>>>>> +{
>>>>>> +    int ret, fd = be->fd;
>>>>>> +    struct iommu_hwpt_alloc alloc_hwpt = {
>>>>>> +        .size = sizeof(struct iommu_hwpt_alloc),
>>>>>> +        .flags = flags,
>>>>>> +        .dev_id = dev_id,
>>>>>> +        .pt_id = pt_id,
>>>>>> +        .data_type = data_type,
>>>>>> +        .data_len = data_len,
>>>>>> +        .data_uptr = (uint64_t)data_ptr,
>>>>>> +    };
>>>>>> +
>>>>>> +    ret = ioctl(fd, IOMMU_HWPT_ALLOC, &alloc_hwpt);
>>>>>> +    trace_iommufd_backend_alloc_hwpt(fd, dev_id, pt_id, flags,
>>> data_type,
>>>>>> +                                     data_len, (uint64_t)data_ptr,
>>>>>> +                                     alloc_hwpt.out_hwpt_id, ret);
>>>>>> +    if (ret) {
>>>>>> +        error_setg_errno(errp, errno, "Failed to allocate hwpt");
>>>>>> +        return false;
>>>>>> +    }
>>>>>> +
>>>>>> +    *out_hwpt = alloc_hwpt.out_hwpt_id;
>>>>>> +    return true;
>>>>>> +}
>>>>>> +
>>>>>>  bool iommufd_backend_get_device_info(IOMMUFDBackend *be,
>>> uint32_t devid,
>>>>>>                                       uint32_t *type, void *data, uint32_t len,
>>>>>>                                       uint64_t *caps, Error **errp)
>>>>>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>>>>>> index 077dea8f1b64..325c7598d5a1 100644
>>>>>> --- a/hw/vfio/iommufd.c
>>>>>> +++ b/hw/vfio/iommufd.c
>>>>>> @@ -212,10 +212,86 @@ static bool
>>> iommufd_cdev_detach_ioas_hwpt(VFIODevice *vbasedev, Error **errp)
>>>>>>      return true;
>>>>>>  }
>>>>>>
>>>>>> +static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
>>>>>> +                                         VFIOIOMMUFDContainer *container,
>>>>>> +                                         Error **errp)
>>>>>> +{
>>>>>> +    IOMMUFDBackend *iommufd = vbasedev->iommufd;
>>>>>> +    uint32_t flags = 0;
>>>>>> +    VFIOIOASHwpt *hwpt;
>>>>>> +    uint32_t hwpt_id;
>>>>>> +    int ret;
>>>>>> +
>>>>>> +    /* Try to find a domain */
>>>>>> +    QLIST_FOREACH(hwpt, &container->hwpt_list, next) {
>>>>>> +        ret = iommufd_cdev_attach_ioas_hwpt(vbasedev, hwpt-
>>hwpt_id,
>>> errp);
>>>>>> +        if (ret) {
>>>>>> +            /* -EINVAL means the domain is incompatible with the device.
>>> */
>>>>>> +            if (ret == -EINVAL) {
>>>>>> +                /*
>>>>>> +                 * It is an expected failure and it just means we will try
>>>>>> +                 * another domain, or create one if no existing compatible
>>>>>> +                 * domain is found. Hence why the error is discarded below.
>>>>>> +                 */
>>>>>> +                error_free(*errp);
>>>>>> +                *errp = NULL;
>>>>>> +                continue;
>>>>>> +            }
>>>>>> +
>>>>>> +            return false;
>>>>>> +        } else {
>>>>>> +            vbasedev->hwpt = hwpt;
>>>>>> +            QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev,
>hwpt_next);
>>>>>> +            return true;
>>>>>> +        }
>>>>>> +    }
>>>>>> +
>>>>>> +    if (!iommufd_backend_alloc_hwpt(iommufd, vbasedev->devid,
>>>>>> +                                    container->ioas_id, flags,
>>>>>> +                                    IOMMU_HWPT_DATA_NONE, 0, NULL,
>>>>>> +                                    &hwpt_id, errp)) {
>>>>>> +        return false;
>>>>>> +    }
>>>>>> +
>>>>>> +    hwpt = g_malloc0(sizeof(*hwpt));
>>>>>> +    hwpt->hwpt_id = hwpt_id;
>>>>>> +    QLIST_INIT(&hwpt->device_list);
>>>>>> +
>>>>>> +    ret = iommufd_cdev_attach_ioas_hwpt(vbasedev, hwpt-
>>hwpt_id,
>>> errp);
>>>>>> +    if (ret) {
>>>>>> +        iommufd_backend_free_id(container->be, hwpt->hwpt_id);
>>>>>> +        g_free(hwpt);
>>>>>> +        return false;
>>>>>> +    }
>>>>>> +
>>>>>> +    vbasedev->hwpt = hwpt;
>>>>>> +    QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev, hwpt_next);
>>>>>> +    QLIST_INSERT_HEAD(&container->hwpt_list, hwpt, next);
>>>>>> +    return true;
>>>>>> +}
>>>>>> +
>>>>>> +static void iommufd_cdev_autodomains_put(VFIODevice *vbasedev,
>>>>>> +                                         VFIOIOMMUFDContainer *container)
>>>>>> +{
>>>>>> +    VFIOIOASHwpt *hwpt = vbasedev->hwpt;
>>>>>> +
>>>>>> +    QLIST_REMOVE(vbasedev, hwpt_next);
>>>>> don't you want to reset vbasedev->hwpt = NULL too?
>>>>>
>>>> Yeap, Thanks for catching that
>>>>
>>>>>
>>>>>> +    if (QLIST_EMPTY(&hwpt->device_list)) {
>>>>>> +        QLIST_REMOVE(hwpt, next);
>>>>>> +        iommufd_backend_free_id(container->be, hwpt->hwpt_id);
>>>>>> +        g_free(hwpt);
>>>>>> +    }
>>>>>> +}
>>>>>> +
>>>>>>  static bool iommufd_cdev_attach_container(VFIODevice *vbasedev,
>>>>>>                                            VFIOIOMMUFDContainer *container,
>>>>>>                                            Error **errp)
>>>>>>  {
>>>>>> +    /* mdevs aren't physical devices and will fail with auto domains
>*/
>>>>>> +    if (!vbasedev->mdev) {
>>>>>> +        return iommufd_cdev_autodomains_get(vbasedev, container,
>>> errp);
>>>>>> +    }
>>>>>> +
>>>>>>      return !iommufd_cdev_attach_ioas_hwpt(vbasedev, container-
>>>> ioas_id, errp);
>>>>>>  }
>>>>>>
>>>>>> @@ -224,6 +300,11 @@ static void
>>> iommufd_cdev_detach_container(VFIODevice *vbasedev,
>>>>>>  {
>>>>>>      Error *err = NULL;
>>>>>>
>>>>>> +    if (vbasedev->hwpt) {
>>>>>> +        iommufd_cdev_autodomains_put(vbasedev, container);
>>>>>> +        return;
>>>>> Where do we detach the device from the hwpt?
>>>>>
>>>> In iommufd_backend_free_id() for auto domains
>>>>
>>>
>>> to clarify here I meant *userspace* auto domains
>>>
>>> *kernel* auto domains (mdev) goes via DETACH_IOMMUFD_PT
>>
>> If the device is still attached to the hwpt, will iommufd_backend_free_id()
>succeed?
>> Have you tried the hot unplug?
>>
>
>I have but I didn't see any errors. But I will check again for v5 as it could
>also be my oversight.
>
>I was thinking about Eric's remark overnight and I think what I am doing is
>not
>correct regardless of the above.
>
>I should be calling DETACH_IOMMUFD_PT pairing with
>ATTACH_IOMMUFD_PT, and the
>iommufd_backend_free_id() is to drop the final reference pairing with
>alloc_hwpt() when the device list is empty i.e. when there's no more devices
>in
>that vdev::hwpt.
>
>DETACH_IOMMUFD_PT decrement the hwpt refcount and it doesn't
>differentiate
>between auto domains vs manual domains.

Yes, missing DETACH_IOMMUFD_PT so ref count isn't decreased.
My understanding is freeing hwpt will fails become device is still attached, such as return -EBUSY,
But may be I understand wrong as you didn't see that failure.

Thanks
Zhenzhong

>
>The code is already there anyhow it just has the order of
>iommufd_cdev_autodomains_put vs detach invocation reversed; I'll fix that
>for
>next version.
Joao Martins July 17, 2024, 9:53 a.m. UTC | #15
On 17/07/2024 10:48, Duan, Zhenzhong wrote:
> 
> 
>> -----Original Message-----
>> From: Joao Martins <joao.m.martins@oracle.com>
>> Subject: Re: [PATCH v4 05/12] vfio/iommufd: Introduce auto domain
>> creation
>>
>> On 17/07/2024 03:52, Duan, Zhenzhong wrote:
>>>
>>>
>>>> -----Original Message-----
>>>> From: Joao Martins <joao.m.martins@oracle.com>
>>>> Subject: Re: [PATCH v4 05/12] vfio/iommufd: Introduce auto domain
>>>> creation
>>>>
>>>> On 16/07/2024 17:44, Joao Martins wrote:
>>>>> On 16/07/2024 17:04, Eric Auger wrote:
>>>>>> Hi Joao,
>>>>>>
>>>>>> On 7/12/24 13:46, Joao Martins wrote:
>>>>>>> There's generally two modes of operation for IOMMUFD:
>>>>>>>
>>>>>>> * The simple user API which intends to perform relatively simple
>> things
>>>>>>> with IOMMUs e.g. DPDK. It generally creates an IOAS and attach to
>> VFIO
>>>>>>
>>>>>> It generally creates? can you explicit what is "it"
>>>>>>
>>>>> 'It' here refers to the process/API-user
>>>>>
>>>>>> I am confused by this automatic terminology again (not your fault). the
>>>> doc says:
>>>>>> "
>>>>>>
>>>>>>   *
>>>>>>
>>>>>>     Automatic domain - refers to an iommu domain created
>> automatically
>>>>>>     when attaching a device to an IOAS object. This is compatible to the
>>>>>>     semantics of VFIO type1.
>>>>>>
>>>>>>   *
>>>>>>
>>>>>>     Manual domain - refers to an iommu domain designated by the user
>> as
>>>>>>     the target pagetable to be attached to by a device. Though currently
>>>>>>     there are no uAPIs to directly create such domain, the datastructure
>>>>>>     and algorithms are ready for handling that use case.
>>>>>>
>>>>>> "
>>>>>>
>>>>>>
>>>>>> in 1) the device is attached to the ioas id (using the auto domain if I am
>>>> not wrong)
>>>>>> Here you attach to an hwpt id. Isn't it a manual domain?
>>>>>>
>>>>>
>>>>> Correct.
>>>>>
>>>>> The 'auto domains' generally refers to the kernel-equivalent own
>>>> automatic
>>>>> attaching to a new pagetable.
>>>>>
>>>>> Here I call 'auto domains' in the userspace version too because we are
>>>> doing the
>>>>> exact same but from userspace, using the manual API in IOMMUFD.
>>>>>
>>>>>>> and mainly performs IOAS_MAP and UNMAP.
>>>>>>>
>>>>>>> * The native IOMMUFD API where you have fine grained control of
>> the
>>>>>>> IOMMU domain and model it accordingly. This is where most new
>>>> feature
>>>>>>> are being steered to.
>>>>>>>
>>>>>>> For dirty tracking 2) is required, as it needs to ensure that
>>>>>>> the stage-2/parent IOMMU domain will only attach devices
>>>>>>> that support dirty tracking (so far it is all homogeneous in x86, likely
>>>>>>> not the case for smmuv3). Such invariant on dirty tracking provides a
>>>>>>> useful guarantee to VMMs that will refuse incompatible device
>>>>>>> attachments for IOMMU domains.
>>>>>>>
>>>>>>> Dirty tracking insurance is enforced via HWPT_ALLOC, which is
>>>>>>> responsible for creating an IOMMU domain. This is contrast to the
>>>>>>> 'simple API' where the IOMMU domain is created by IOMMUFD
>>>> automatically
>>>>>>> when it attaches to VFIO (usually referred as autodomains) but it has
>>>>>>> the needed handling for mdevs.
>>>>>>>
>>>>>>> To support dirty tracking with the advanced IOMMUFD API, it needs
>>>>>>> similar logic, where IOMMU domains are created and devices
>> attached
>>>> to
>>>>>>> compatible domains. Essentially mimmicing kernel
>>>>>>> iommufd_device_auto_get_domain(). With mdevs given there's no
>>>> IOMMU domain
>>>>>>> it falls back to IOAS attach.
>>>>>>>
>>>>>>> The auto domain logic allows different IOMMU domains to be created
>>>> when
>>>>>>> DMA dirty tracking is not desired (and VF can provide it), and others
>>>> where
>>>>>>> it is. Here is not used in this way here given how VFIODevice
>> migration
>>>>>>
>>>>>> Here is not used in this way here ?
>>>>>>
>>>>>
>>>>> I meant, 'Here it is not used in this way given (...)'
>>>>>
>>>>>>> state is initialized after the device attachment. But such mixed mode
>> of
>>>>>>> IOMMU dirty tracking + device dirty tracking is an improvement that
>>>> can
>>>>>>> be added on. Keep the 'all of nothing' of type1 approach that we have
>>>>>>> been using so far between container vs device dirty tracking.
>>>>>>>
>>>>>>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>>>>>>> ---
>>>>>>>  include/hw/vfio/vfio-common.h |  9 ++++
>>>>>>>  include/sysemu/iommufd.h      |  5 +++
>>>>>>>  backends/iommufd.c            | 30 +++++++++++++
>>>>>>>  hw/vfio/iommufd.c             | 82
>>>> +++++++++++++++++++++++++++++++++++
>>>>>>>  backends/trace-events         |  1 +
>>>>>>>  5 files changed, 127 insertions(+)
>>>>>>>
>>>>>>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-
>>>> common.h
>>>>>>> index 7419466bca92..2dd468ce3c02 100644
>>>>>>> --- a/include/hw/vfio/vfio-common.h
>>>>>>> +++ b/include/hw/vfio/vfio-common.h
>>>>>>> @@ -95,10 +95,17 @@ typedef struct VFIOHostDMAWindow {
>>>>>>>
>>>>>>>  typedef struct IOMMUFDBackend IOMMUFDBackend;
>>>>>>>
>>>>>>> +typedef struct VFIOIOASHwpt {
>>>>>>> +    uint32_t hwpt_id;
>>>>>>> +    QLIST_HEAD(, VFIODevice) device_list;
>>>>>>> +    QLIST_ENTRY(VFIOIOASHwpt) next;
>>>>>>> +} VFIOIOASHwpt;
>>>>>>> +
>>>>>>>  typedef struct VFIOIOMMUFDContainer {
>>>>>>>      VFIOContainerBase bcontainer;
>>>>>>>      IOMMUFDBackend *be;
>>>>>>>      uint32_t ioas_id;
>>>>>>> +    QLIST_HEAD(, VFIOIOASHwpt) hwpt_list;
>>>>>>>  } VFIOIOMMUFDContainer;
>>>>>>>
>>>>>>>  OBJECT_DECLARE_SIMPLE_TYPE(VFIOIOMMUFDContainer,
>>>> VFIO_IOMMU_IOMMUFD);
>>>>>>> @@ -135,6 +142,8 @@ typedef struct VFIODevice {
>>>>>>>      HostIOMMUDevice *hiod;
>>>>>>>      int devid;
>>>>>>>      IOMMUFDBackend *iommufd;
>>>>>>> +    VFIOIOASHwpt *hwpt;
>>>>>>> +    QLIST_ENTRY(VFIODevice) hwpt_next;
>>>>>>>  } VFIODevice;
>>>>>>>
>>>>>>>  struct VFIODeviceOps {
>>>>>>> diff --git a/include/sysemu/iommufd.h b/include/sysemu/iommufd.h
>>>>>>> index 57d502a1c79a..e917e7591d05 100644
>>>>>>> --- a/include/sysemu/iommufd.h
>>>>>>> +++ b/include/sysemu/iommufd.h
>>>>>>> @@ -50,6 +50,11 @@ int
>>>> iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t
>> ioas_id,
>>>>>>>  bool iommufd_backend_get_device_info(IOMMUFDBackend *be,
>>>> uint32_t devid,
>>>>>>>                                       uint32_t *type, void *data, uint32_t len,
>>>>>>>                                       uint64_t *caps, Error **errp);
>>>>>>> +bool iommufd_backend_alloc_hwpt(IOMMUFDBackend *be,
>> uint32_t
>>>> dev_id,
>>>>>>> +                                uint32_t pt_id, uint32_t flags,
>>>>>>> +                                uint32_t data_type, uint32_t data_len,
>>>>>>> +                                void *data_ptr, uint32_t *out_hwpt,
>>>>>>> +                                Error **errp);
>>>>>>>
>>>>>>>  #define TYPE_HOST_IOMMU_DEVICE_IOMMUFD
>>>> TYPE_HOST_IOMMU_DEVICE "-iommufd"
>>>>>>>  #endif
>>>>>>> diff --git a/backends/iommufd.c b/backends/iommufd.c
>>>>>>> index 2b3d51af26d2..5d3dfa917415 100644
>>>>>>> --- a/backends/iommufd.c
>>>>>>> +++ b/backends/iommufd.c
>>>>>>> @@ -208,6 +208,36 @@ int
>>>> iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t
>> ioas_id,
>>>>>>>      return ret;
>>>>>>>  }
>>>>>>>
>>>>>>> +bool iommufd_backend_alloc_hwpt(IOMMUFDBackend *be,
>> uint32_t
>>>> dev_id,
>>>>>>> +                                uint32_t pt_id, uint32_t flags,
>>>>>>> +                                uint32_t data_type, uint32_t data_len,
>>>>>>> +                                void *data_ptr, uint32_t *out_hwpt,
>>>>>>> +                                Error **errp)
>>>>>>> +{
>>>>>>> +    int ret, fd = be->fd;
>>>>>>> +    struct iommu_hwpt_alloc alloc_hwpt = {
>>>>>>> +        .size = sizeof(struct iommu_hwpt_alloc),
>>>>>>> +        .flags = flags,
>>>>>>> +        .dev_id = dev_id,
>>>>>>> +        .pt_id = pt_id,
>>>>>>> +        .data_type = data_type,
>>>>>>> +        .data_len = data_len,
>>>>>>> +        .data_uptr = (uint64_t)data_ptr,
>>>>>>> +    };
>>>>>>> +
>>>>>>> +    ret = ioctl(fd, IOMMU_HWPT_ALLOC, &alloc_hwpt);
>>>>>>> +    trace_iommufd_backend_alloc_hwpt(fd, dev_id, pt_id, flags,
>>>> data_type,
>>>>>>> +                                     data_len, (uint64_t)data_ptr,
>>>>>>> +                                     alloc_hwpt.out_hwpt_id, ret);
>>>>>>> +    if (ret) {
>>>>>>> +        error_setg_errno(errp, errno, "Failed to allocate hwpt");
>>>>>>> +        return false;
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    *out_hwpt = alloc_hwpt.out_hwpt_id;
>>>>>>> +    return true;
>>>>>>> +}
>>>>>>> +
>>>>>>>  bool iommufd_backend_get_device_info(IOMMUFDBackend *be,
>>>> uint32_t devid,
>>>>>>>                                       uint32_t *type, void *data, uint32_t len,
>>>>>>>                                       uint64_t *caps, Error **errp)
>>>>>>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>>>>>>> index 077dea8f1b64..325c7598d5a1 100644
>>>>>>> --- a/hw/vfio/iommufd.c
>>>>>>> +++ b/hw/vfio/iommufd.c
>>>>>>> @@ -212,10 +212,86 @@ static bool
>>>> iommufd_cdev_detach_ioas_hwpt(VFIODevice *vbasedev, Error **errp)
>>>>>>>      return true;
>>>>>>>  }
>>>>>>>
>>>>>>> +static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
>>>>>>> +                                         VFIOIOMMUFDContainer *container,
>>>>>>> +                                         Error **errp)
>>>>>>> +{
>>>>>>> +    IOMMUFDBackend *iommufd = vbasedev->iommufd;
>>>>>>> +    uint32_t flags = 0;
>>>>>>> +    VFIOIOASHwpt *hwpt;
>>>>>>> +    uint32_t hwpt_id;
>>>>>>> +    int ret;
>>>>>>> +
>>>>>>> +    /* Try to find a domain */
>>>>>>> +    QLIST_FOREACH(hwpt, &container->hwpt_list, next) {
>>>>>>> +        ret = iommufd_cdev_attach_ioas_hwpt(vbasedev, hwpt-
>>> hwpt_id,
>>>> errp);
>>>>>>> +        if (ret) {
>>>>>>> +            /* -EINVAL means the domain is incompatible with the device.
>>>> */
>>>>>>> +            if (ret == -EINVAL) {
>>>>>>> +                /*
>>>>>>> +                 * It is an expected failure and it just means we will try
>>>>>>> +                 * another domain, or create one if no existing compatible
>>>>>>> +                 * domain is found. Hence why the error is discarded below.
>>>>>>> +                 */
>>>>>>> +                error_free(*errp);
>>>>>>> +                *errp = NULL;
>>>>>>> +                continue;
>>>>>>> +            }
>>>>>>> +
>>>>>>> +            return false;
>>>>>>> +        } else {
>>>>>>> +            vbasedev->hwpt = hwpt;
>>>>>>> +            QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev,
>> hwpt_next);
>>>>>>> +            return true;
>>>>>>> +        }
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    if (!iommufd_backend_alloc_hwpt(iommufd, vbasedev->devid,
>>>>>>> +                                    container->ioas_id, flags,
>>>>>>> +                                    IOMMU_HWPT_DATA_NONE, 0, NULL,
>>>>>>> +                                    &hwpt_id, errp)) {
>>>>>>> +        return false;
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    hwpt = g_malloc0(sizeof(*hwpt));
>>>>>>> +    hwpt->hwpt_id = hwpt_id;
>>>>>>> +    QLIST_INIT(&hwpt->device_list);
>>>>>>> +
>>>>>>> +    ret = iommufd_cdev_attach_ioas_hwpt(vbasedev, hwpt-
>>> hwpt_id,
>>>> errp);
>>>>>>> +    if (ret) {
>>>>>>> +        iommufd_backend_free_id(container->be, hwpt->hwpt_id);
>>>>>>> +        g_free(hwpt);
>>>>>>> +        return false;
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    vbasedev->hwpt = hwpt;
>>>>>>> +    QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev, hwpt_next);
>>>>>>> +    QLIST_INSERT_HEAD(&container->hwpt_list, hwpt, next);
>>>>>>> +    return true;
>>>>>>> +}
>>>>>>> +
>>>>>>> +static void iommufd_cdev_autodomains_put(VFIODevice *vbasedev,
>>>>>>> +                                         VFIOIOMMUFDContainer *container)
>>>>>>> +{
>>>>>>> +    VFIOIOASHwpt *hwpt = vbasedev->hwpt;
>>>>>>> +
>>>>>>> +    QLIST_REMOVE(vbasedev, hwpt_next);
>>>>>> don't you want to reset vbasedev->hwpt = NULL too?
>>>>>>
>>>>> Yeap, Thanks for catching that
>>>>>
>>>>>>
>>>>>>> +    if (QLIST_EMPTY(&hwpt->device_list)) {
>>>>>>> +        QLIST_REMOVE(hwpt, next);
>>>>>>> +        iommufd_backend_free_id(container->be, hwpt->hwpt_id);
>>>>>>> +        g_free(hwpt);
>>>>>>> +    }
>>>>>>> +}
>>>>>>> +
>>>>>>>  static bool iommufd_cdev_attach_container(VFIODevice *vbasedev,
>>>>>>>                                            VFIOIOMMUFDContainer *container,
>>>>>>>                                            Error **errp)
>>>>>>>  {
>>>>>>> +    /* mdevs aren't physical devices and will fail with auto domains
>> */
>>>>>>> +    if (!vbasedev->mdev) {
>>>>>>> +        return iommufd_cdev_autodomains_get(vbasedev, container,
>>>> errp);
>>>>>>> +    }
>>>>>>> +
>>>>>>>      return !iommufd_cdev_attach_ioas_hwpt(vbasedev, container-
>>>>> ioas_id, errp);
>>>>>>>  }
>>>>>>>
>>>>>>> @@ -224,6 +300,11 @@ static void
>>>> iommufd_cdev_detach_container(VFIODevice *vbasedev,
>>>>>>>  {
>>>>>>>      Error *err = NULL;
>>>>>>>
>>>>>>> +    if (vbasedev->hwpt) {
>>>>>>> +        iommufd_cdev_autodomains_put(vbasedev, container);
>>>>>>> +        return;
>>>>>> Where do we detach the device from the hwpt?
>>>>>>
>>>>> In iommufd_backend_free_id() for auto domains
>>>>>
>>>>
>>>> to clarify here I meant *userspace* auto domains
>>>>
>>>> *kernel* auto domains (mdev) goes via DETACH_IOMMUFD_PT
>>>
>>> If the device is still attached to the hwpt, will iommufd_backend_free_id()
>> succeed?
>>> Have you tried the hot unplug?
>>>
>>
>> I have but I didn't see any errors. But I will check again for v5 as it could
>> also be my oversight.
>>
>> I was thinking about Eric's remark overnight and I think what I am doing is
>> not
>> correct regardless of the above.
>>
>> I should be calling DETACH_IOMMUFD_PT pairing with
>> ATTACH_IOMMUFD_PT, and the
>> iommufd_backend_free_id() is to drop the final reference pairing with
>> alloc_hwpt() when the device list is empty i.e. when there's no more devices
>> in
>> that vdev::hwpt.
>>
>> DETACH_IOMMUFD_PT decrement the hwpt refcount and it doesn't
>> differentiate
>> between auto domains vs manual domains.
> 
> Yes, missing DETACH_IOMMUFD_PT so ref count isn't decreased.
> My understanding is freeing hwpt will fails become device is still attached, such as return -EBUSY,
> But may be I understand wrong as you didn't see that failure.
> 

I recall exercising *hotunplug/hotplug* that working it out, but the error
likely could go silent as it doesn't does fail higher levels. So part of the
reason that although it seemed to work it might be my oversight in seeing that
the errno was returned from free id operation.
Duan, Zhenzhong July 17, 2024, 10:05 a.m. UTC | #16
>-----Original Message-----
>From: Joao Martins <joao.m.martins@oracle.com>
>Subject: Re: [PATCH v4 05/12] vfio/iommufd: Introduce auto domain
>creation
>
>On 17/07/2024 03:18, Duan, Zhenzhong wrote:
>>
>>
>>> -----Original Message-----
>>> From: Joao Martins <joao.m.martins@oracle.com>
>>> Subject: [PATCH v4 05/12] vfio/iommufd: Introduce auto domain
>creation
>>>
>>> There's generally two modes of operation for IOMMUFD:
>>>
>>> * The simple user API which intends to perform relatively simple things
>>> with IOMMUs e.g. DPDK. It generally creates an IOAS and attach to VFIO
>>> and mainly performs IOAS_MAP and UNMAP.
>>>
>>> * The native IOMMUFD API where you have fine grained control of the
>>> IOMMU domain and model it accordingly. This is where most new feature
>>> are being steered to.
>>>
>>> For dirty tracking 2) is required, as it needs to ensure that
>>> the stage-2/parent IOMMU domain will only attach devices
>>> that support dirty tracking (so far it is all homogeneous in x86, likely
>>> not the case for smmuv3). Such invariant on dirty tracking provides a
>>> useful guarantee to VMMs that will refuse incompatible device
>>> attachments for IOMMU domains.
>>>
>>> Dirty tracking insurance is enforced via HWPT_ALLOC, which is
>>> responsible for creating an IOMMU domain. This is contrast to the
>>> 'simple API' where the IOMMU domain is created by IOMMUFD
>>> automatically
>>> when it attaches to VFIO (usually referred as autodomains) but it has
>>> the needed handling for mdevs.
>>>
>>> To support dirty tracking with the advanced IOMMUFD API, it needs
>>> similar logic, where IOMMU domains are created and devices attached to
>>> compatible domains. Essentially mimmicing kernel
>>> iommufd_device_auto_get_domain(). With mdevs given there's no
>IOMMU
>>> domain
>>> it falls back to IOAS attach.
>>>
>>> The auto domain logic allows different IOMMU domains to be created
>when
>>> DMA dirty tracking is not desired (and VF can provide it), and others
>where
>>> it is. Here is not used in this way here given how VFIODevice migration
>>> state is initialized after the device attachment. But such mixed mode of
>>> IOMMU dirty tracking + device dirty tracking is an improvement that can
>>> be added on. Keep the 'all of nothing' of type1 approach that we have
>>> been using so far between container vs device dirty tracking.
>>>
>>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>>> ---
>>> include/hw/vfio/vfio-common.h |  9 ++++
>>> include/sysemu/iommufd.h      |  5 +++
>>> backends/iommufd.c            | 30 +++++++++++++
>>> hw/vfio/iommufd.c             | 82
>>> +++++++++++++++++++++++++++++++++++
>>> backends/trace-events         |  1 +
>>> 5 files changed, 127 insertions(+)
>>>
>>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-
>>> common.h
>>> index 7419466bca92..2dd468ce3c02 100644
>>> --- a/include/hw/vfio/vfio-common.h
>>> +++ b/include/hw/vfio/vfio-common.h
>>> @@ -95,10 +95,17 @@ typedef struct VFIOHostDMAWindow {
>>>
>>> typedef struct IOMMUFDBackend IOMMUFDBackend;
>>>
>>> +typedef struct VFIOIOASHwpt {
>>> +    uint32_t hwpt_id;
>>> +    QLIST_HEAD(, VFIODevice) device_list;
>>> +    QLIST_ENTRY(VFIOIOASHwpt) next;
>>> +} VFIOIOASHwpt;
>>> +
>>> typedef struct VFIOIOMMUFDContainer {
>>>     VFIOContainerBase bcontainer;
>>>     IOMMUFDBackend *be;
>>>     uint32_t ioas_id;
>>> +    QLIST_HEAD(, VFIOIOASHwpt) hwpt_list;
>>> } VFIOIOMMUFDContainer;
>>>
>>> OBJECT_DECLARE_SIMPLE_TYPE(VFIOIOMMUFDContainer,
>>> VFIO_IOMMU_IOMMUFD);
>>> @@ -135,6 +142,8 @@ typedef struct VFIODevice {
>>>     HostIOMMUDevice *hiod;
>>>     int devid;
>>>     IOMMUFDBackend *iommufd;
>>> +    VFIOIOASHwpt *hwpt;
>>> +    QLIST_ENTRY(VFIODevice) hwpt_next;
>>> } VFIODevice;
>>>
>>> struct VFIODeviceOps {
>>> diff --git a/include/sysemu/iommufd.h b/include/sysemu/iommufd.h
>>> index 57d502a1c79a..e917e7591d05 100644
>>> --- a/include/sysemu/iommufd.h
>>> +++ b/include/sysemu/iommufd.h
>>> @@ -50,6 +50,11 @@ int
>>> iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t
>ioas_id,
>>> bool iommufd_backend_get_device_info(IOMMUFDBackend *be,
>uint32_t
>>> devid,
>>>                                      uint32_t *type, void *data, uint32_t len,
>>>                                      uint64_t *caps, Error **errp);
>>> +bool iommufd_backend_alloc_hwpt(IOMMUFDBackend *be, uint32_t
>>> dev_id,
>>> +                                uint32_t pt_id, uint32_t flags,
>>> +                                uint32_t data_type, uint32_t data_len,
>>> +                                void *data_ptr, uint32_t *out_hwpt,
>>> +                                Error **errp);
>>>
>>> #define TYPE_HOST_IOMMU_DEVICE_IOMMUFD
>>> TYPE_HOST_IOMMU_DEVICE "-iommufd"
>>> #endif
>>> diff --git a/backends/iommufd.c b/backends/iommufd.c
>>> index 2b3d51af26d2..5d3dfa917415 100644
>>> --- a/backends/iommufd.c
>>> +++ b/backends/iommufd.c
>>> @@ -208,6 +208,36 @@ int
>>> iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t
>ioas_id,
>>>     return ret;
>>> }
>>>
>>> +bool iommufd_backend_alloc_hwpt(IOMMUFDBackend *be, uint32_t
>>> dev_id,
>>> +                                uint32_t pt_id, uint32_t flags,
>>> +                                uint32_t data_type, uint32_t data_len,
>>> +                                void *data_ptr, uint32_t *out_hwpt,
>>> +                                Error **errp)
>>> +{
>>> +    int ret, fd = be->fd;
>>> +    struct iommu_hwpt_alloc alloc_hwpt = {
>>> +        .size = sizeof(struct iommu_hwpt_alloc),
>>> +        .flags = flags,
>>> +        .dev_id = dev_id,
>>> +        .pt_id = pt_id,
>>> +        .data_type = data_type,
>>> +        .data_len = data_len,
>>> +        .data_uptr = (uint64_t)data_ptr,
>>> +    };
>>> +
>>> +    ret = ioctl(fd, IOMMU_HWPT_ALLOC, &alloc_hwpt);
>>> +    trace_iommufd_backend_alloc_hwpt(fd, dev_id, pt_id, flags,
>data_type,
>>> +                                     data_len, (uint64_t)data_ptr,
>>> +                                     alloc_hwpt.out_hwpt_id, ret);
>>> +    if (ret) {
>>> +        error_setg_errno(errp, errno, "Failed to allocate hwpt");
>>> +        return false;
>>> +    }
>>> +
>>> +    *out_hwpt = alloc_hwpt.out_hwpt_id;
>>> +    return true;
>>> +}
>>> +
>>> bool iommufd_backend_get_device_info(IOMMUFDBackend *be,
>uint32_t
>>> devid,
>>>                                      uint32_t *type, void *data, uint32_t len,
>>>                                      uint64_t *caps, Error **errp)
>>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>>> index 077dea8f1b64..325c7598d5a1 100644
>>> --- a/hw/vfio/iommufd.c
>>> +++ b/hw/vfio/iommufd.c
>>> @@ -212,10 +212,86 @@ static bool
>>> iommufd_cdev_detach_ioas_hwpt(VFIODevice *vbasedev, Error **errp)
>>>     return true;
>>> }
>>>
>>> +static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
>>> +                                         VFIOIOMMUFDContainer *container,
>>> +                                         Error **errp)
>>> +{
>>> +    IOMMUFDBackend *iommufd = vbasedev->iommufd;
>>> +    uint32_t flags = 0;
>>> +    VFIOIOASHwpt *hwpt;
>>> +    uint32_t hwpt_id;
>>> +    int ret;
>>> +
>>> +    /* Try to find a domain */
>>> +    QLIST_FOREACH(hwpt, &container->hwpt_list, next) {
>>> +        ret = iommufd_cdev_attach_ioas_hwpt(vbasedev, hwpt->hwpt_id,
>>> errp);
>>
>> If there is already an hwpt that supports dirty tracking.
>> Another device that doesn't support dirty tracking attaches to this hwpt,
>will it succeed?
>>
>
>It returns -EINVAL, and we handle that right after this statement. Which
>means
>another HWPT is created.

Looked into kernel code, I didn't see the check about dirty tracking between device and hwpt, do you know which func does that?

>
>> If existing hwpt doesn't support dirty tracking.
>> Another device supporting dirty tracking attaches to that hwpt, what will
>happen?
>>
>
>Hmm, It succeeds as there's no incompatbility. At the very least I plan on
>blocking migration if the device neither has VF dirty tracking, nor IOMMU
>dirty
>tracking (and patch 11 needs to be adjusted to check hwpt_flags instead of
>container).

When bcontainer->dirty_pages_supported is true, I think that container should only contains hwpt list that support dirty tracking. All hwpt not supporting dirty tracking should be in other container.

If device supports dirty tracking, it should bypass attaching container that doesn't support dirty tracking. Vise versa.
This way we can support the mixing environment.

Thanks
Zhenzhong 

>
>Qemu right now doesn't handle heteregenous environment, it's all of
>nothing
>approach even before this patchset. Additionally, I am not sure server
>environments are applicable here. So essentially I kept the status quo --
>more
>follow-up is needed to support a mix and match of IOMMU + VF dirty
>tracking too.
>The challenge is having the migration state of VFIO device initialized early
>enough that we can make all sort of decisions whether IOMMU dirty tracking
>is
>desired on a per-device basis.
Joao Martins July 17, 2024, 11:04 a.m. UTC | #17
On 17/07/2024 11:05, Duan, Zhenzhong wrote:
>> -----Original Message-----
>> From: Joao Martins <joao.m.martins@oracle.com>
>> Subject: Re: [PATCH v4 05/12] vfio/iommufd: Introduce auto domain
>> creation
>>
>> On 17/07/2024 03:18, Duan, Zhenzhong wrote:
>>>
>>>
>>>> -----Original Message-----
>>>> From: Joao Martins <joao.m.martins@oracle.com>
>>>> Subject: [PATCH v4 05/12] vfio/iommufd: Introduce auto domain
>> creation
>>>>
>>>> There's generally two modes of operation for IOMMUFD:
>>>>
>>>> * The simple user API which intends to perform relatively simple things
>>>> with IOMMUs e.g. DPDK. It generally creates an IOAS and attach to VFIO
>>>> and mainly performs IOAS_MAP and UNMAP.
>>>>
>>>> * The native IOMMUFD API where you have fine grained control of the
>>>> IOMMU domain and model it accordingly. This is where most new feature
>>>> are being steered to.
>>>>
>>>> For dirty tracking 2) is required, as it needs to ensure that
>>>> the stage-2/parent IOMMU domain will only attach devices
>>>> that support dirty tracking (so far it is all homogeneous in x86, likely
>>>> not the case for smmuv3). Such invariant on dirty tracking provides a
>>>> useful guarantee to VMMs that will refuse incompatible device
>>>> attachments for IOMMU domains.
>>>>
>>>> Dirty tracking insurance is enforced via HWPT_ALLOC, which is
>>>> responsible for creating an IOMMU domain. This is contrast to the
>>>> 'simple API' where the IOMMU domain is created by IOMMUFD
>>>> automatically
>>>> when it attaches to VFIO (usually referred as autodomains) but it has
>>>> the needed handling for mdevs.
>>>>
>>>> To support dirty tracking with the advanced IOMMUFD API, it needs
>>>> similar logic, where IOMMU domains are created and devices attached to
>>>> compatible domains. Essentially mimmicing kernel
>>>> iommufd_device_auto_get_domain(). With mdevs given there's no
>> IOMMU
>>>> domain
>>>> it falls back to IOAS attach.
>>>>
>>>> The auto domain logic allows different IOMMU domains to be created
>> when
>>>> DMA dirty tracking is not desired (and VF can provide it), and others
>> where
>>>> it is. Here is not used in this way here given how VFIODevice migration
>>>> state is initialized after the device attachment. But such mixed mode of
>>>> IOMMU dirty tracking + device dirty tracking is an improvement that can
>>>> be added on. Keep the 'all of nothing' of type1 approach that we have
>>>> been using so far between container vs device dirty tracking.
>>>>
>>>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>>>> ---
>>>> include/hw/vfio/vfio-common.h |  9 ++++
>>>> include/sysemu/iommufd.h      |  5 +++
>>>> backends/iommufd.c            | 30 +++++++++++++
>>>> hw/vfio/iommufd.c             | 82
>>>> +++++++++++++++++++++++++++++++++++
>>>> backends/trace-events         |  1 +
>>>> 5 files changed, 127 insertions(+)
>>>>
>>>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-
>>>> common.h
>>>> index 7419466bca92..2dd468ce3c02 100644
>>>> --- a/include/hw/vfio/vfio-common.h
>>>> +++ b/include/hw/vfio/vfio-common.h
>>>> @@ -95,10 +95,17 @@ typedef struct VFIOHostDMAWindow {
>>>>
>>>> typedef struct IOMMUFDBackend IOMMUFDBackend;
>>>>
>>>> +typedef struct VFIOIOASHwpt {
>>>> +    uint32_t hwpt_id;
>>>> +    QLIST_HEAD(, VFIODevice) device_list;
>>>> +    QLIST_ENTRY(VFIOIOASHwpt) next;
>>>> +} VFIOIOASHwpt;
>>>> +
>>>> typedef struct VFIOIOMMUFDContainer {
>>>>     VFIOContainerBase bcontainer;
>>>>     IOMMUFDBackend *be;
>>>>     uint32_t ioas_id;
>>>> +    QLIST_HEAD(, VFIOIOASHwpt) hwpt_list;
>>>> } VFIOIOMMUFDContainer;
>>>>
>>>> OBJECT_DECLARE_SIMPLE_TYPE(VFIOIOMMUFDContainer,
>>>> VFIO_IOMMU_IOMMUFD);
>>>> @@ -135,6 +142,8 @@ typedef struct VFIODevice {
>>>>     HostIOMMUDevice *hiod;
>>>>     int devid;
>>>>     IOMMUFDBackend *iommufd;
>>>> +    VFIOIOASHwpt *hwpt;
>>>> +    QLIST_ENTRY(VFIODevice) hwpt_next;
>>>> } VFIODevice;
>>>>
>>>> struct VFIODeviceOps {
>>>> diff --git a/include/sysemu/iommufd.h b/include/sysemu/iommufd.h
>>>> index 57d502a1c79a..e917e7591d05 100644
>>>> --- a/include/sysemu/iommufd.h
>>>> +++ b/include/sysemu/iommufd.h
>>>> @@ -50,6 +50,11 @@ int
>>>> iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t
>> ioas_id,
>>>> bool iommufd_backend_get_device_info(IOMMUFDBackend *be,
>> uint32_t
>>>> devid,
>>>>                                      uint32_t *type, void *data, uint32_t len,
>>>>                                      uint64_t *caps, Error **errp);
>>>> +bool iommufd_backend_alloc_hwpt(IOMMUFDBackend *be, uint32_t
>>>> dev_id,
>>>> +                                uint32_t pt_id, uint32_t flags,
>>>> +                                uint32_t data_type, uint32_t data_len,
>>>> +                                void *data_ptr, uint32_t *out_hwpt,
>>>> +                                Error **errp);
>>>>
>>>> #define TYPE_HOST_IOMMU_DEVICE_IOMMUFD
>>>> TYPE_HOST_IOMMU_DEVICE "-iommufd"
>>>> #endif
>>>> diff --git a/backends/iommufd.c b/backends/iommufd.c
>>>> index 2b3d51af26d2..5d3dfa917415 100644
>>>> --- a/backends/iommufd.c
>>>> +++ b/backends/iommufd.c
>>>> @@ -208,6 +208,36 @@ int
>>>> iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t
>> ioas_id,
>>>>     return ret;
>>>> }
>>>>
>>>> +bool iommufd_backend_alloc_hwpt(IOMMUFDBackend *be, uint32_t
>>>> dev_id,
>>>> +                                uint32_t pt_id, uint32_t flags,
>>>> +                                uint32_t data_type, uint32_t data_len,
>>>> +                                void *data_ptr, uint32_t *out_hwpt,
>>>> +                                Error **errp)
>>>> +{
>>>> +    int ret, fd = be->fd;
>>>> +    struct iommu_hwpt_alloc alloc_hwpt = {
>>>> +        .size = sizeof(struct iommu_hwpt_alloc),
>>>> +        .flags = flags,
>>>> +        .dev_id = dev_id,
>>>> +        .pt_id = pt_id,
>>>> +        .data_type = data_type,
>>>> +        .data_len = data_len,
>>>> +        .data_uptr = (uint64_t)data_ptr,
>>>> +    };
>>>> +
>>>> +    ret = ioctl(fd, IOMMU_HWPT_ALLOC, &alloc_hwpt);
>>>> +    trace_iommufd_backend_alloc_hwpt(fd, dev_id, pt_id, flags,
>> data_type,
>>>> +                                     data_len, (uint64_t)data_ptr,
>>>> +                                     alloc_hwpt.out_hwpt_id, ret);
>>>> +    if (ret) {
>>>> +        error_setg_errno(errp, errno, "Failed to allocate hwpt");
>>>> +        return false;
>>>> +    }
>>>> +
>>>> +    *out_hwpt = alloc_hwpt.out_hwpt_id;
>>>> +    return true;
>>>> +}
>>>> +
>>>> bool iommufd_backend_get_device_info(IOMMUFDBackend *be,
>> uint32_t
>>>> devid,
>>>>                                      uint32_t *type, void *data, uint32_t len,
>>>>                                      uint64_t *caps, Error **errp)
>>>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>>>> index 077dea8f1b64..325c7598d5a1 100644
>>>> --- a/hw/vfio/iommufd.c
>>>> +++ b/hw/vfio/iommufd.c
>>>> @@ -212,10 +212,86 @@ static bool
>>>> iommufd_cdev_detach_ioas_hwpt(VFIODevice *vbasedev, Error **errp)
>>>>     return true;
>>>> }
>>>>
>>>> +static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
>>>> +                                         VFIOIOMMUFDContainer *container,
>>>> +                                         Error **errp)
>>>> +{
>>>> +    IOMMUFDBackend *iommufd = vbasedev->iommufd;
>>>> +    uint32_t flags = 0;
>>>> +    VFIOIOASHwpt *hwpt;
>>>> +    uint32_t hwpt_id;
>>>> +    int ret;
>>>> +
>>>> +    /* Try to find a domain */
>>>> +    QLIST_FOREACH(hwpt, &container->hwpt_list, next) {
>>>> +        ret = iommufd_cdev_attach_ioas_hwpt(vbasedev, hwpt->hwpt_id,
>>>> errp);
>>>
>>> If there is already an hwpt that supports dirty tracking.
>>> Another device that doesn't support dirty tracking attaches to this hwpt,
>> will it succeed?
>>>
>>
>> It returns -EINVAL, and we handle that right after this statement. Which
>> means
>> another HWPT is created.
> 
> Looked into kernel code, I didn't see the check about dirty tracking between device and hwpt, do you know which func does that?
> 

A device is associated with a group (aka IOMMU instance) and those checks
happens when the device in a group is firstly being attached the first time or
belongs to some *other* group and gets attach to this domain with dirty tracking
enforced. If the device belongs to the same group that had a device attached
already there's just a bump in the refcount and device is added to the /same
group/ device list. Otherwise the device belongs to a different group and it's
being attached to a domain and the various checks get triggered (dirty tracking
being one of them). These attachment validation checks are part of the iommu
driver, not core (the core just sees a .attach_dev() failure).

Usually follows this codepath when the group attachment checks are firstly being
done:

vfio_iommufd_physical_attach_ioas()
 iommufd_device_attach()
  iommufd_device_do_attach()
   iommufd_hw_pagetable_attach()
    iommu_attach_group()
    ...
     __iommu_attach_device()

Then each iommu driver defines the compatibility checks and if the domain has
dirty_ops set (that comes from this ALLOC_DIRTY_TRACKING flag) and the IOMMU
backing the device doesn't have dirty tracking the driver returns -EINVAL
e.g. on Intel IOMMU:

intel_iommu_attach_device()
  prepare_domain_attach_device():
    domain->dirty_ops && !ssads_supported(iommu)
	return -EINVAL;


>>
>>> If existing hwpt doesn't support dirty tracking.
>>> Another device supporting dirty tracking attaches to that hwpt, what will
>> happen?
>>>
>>
>> Hmm, It succeeds as there's no incompatbility. At the very least I plan on
>> blocking migration if the device neither has VF dirty tracking, nor IOMMU
>> dirty
>> tracking (and patch 11 needs to be adjusted to check hwpt_flags instead of
>> container).
> 
> When bcontainer->dirty_pages_supported is true, I think that container should only contains hwpt list that support dirty tracking. All hwpt not supporting dirty tracking should be in other container.
> 
Well but we are adopting this auto domains scheme and works for any device,
dirty tracking or not. We already track hwpt flags so we know which ones support
dirty tracking. This differentiation would (IMHO) complicate more and I am not
sure the gain

> If device supports dirty tracking, it should bypass attaching container that doesn't support dirty tracking. Vise versa.
> This way we can support the mixing environment.
> 

It's not that easy as the whole flow doesn't handle this mixed mode (even
excluding this series). We would to have device-dirty-tracking start all
non-disabled device trackers first [and stop them as well], and then we would
always iterate those first (if device dirty trackers are active), and then defer
to IOMMU tracker for those who don't.

But given this mixed mode might be prone to regressions plus with me being
dangerously close to softfreeze too, I was deeming it follow-up. And hence
hoping I improve detection when the IOMMU doesn't provide the lowest common
denominator for the 'all or nothing' mode then it would block migration. I can
turn that if statement in {start,query}_dirty_tracking into an assert if that
improves things.

> 
>>
>> Qemu right now doesn't handle heteregenous environment, it's all of
>> nothing
>> approach even before this patchset. Additionally, I am not sure server
>> environments are applicable here. So essentially I kept the status quo --
>> more
>> follow-up is needed to support a mix and match of IOMMU + VF dirty
>> tracking too.
>> The challenge is having the migration state of VFIO device initialized early
>> enough that we can make all sort of decisions whether IOMMU dirty tracking
>> is
>> desired on a per-device basis.
Duan, Zhenzhong July 18, 2024, 7:44 a.m. UTC | #18
>-----Original Message-----
>From: Joao Martins <joao.m.martins@oracle.com>
>Subject: Re: [PATCH v4 05/12] vfio/iommufd: Introduce auto domain
>creation
>
>On 17/07/2024 11:05, Duan, Zhenzhong wrote:
>>> -----Original Message-----
>>> From: Joao Martins <joao.m.martins@oracle.com>
>>> Subject: Re: [PATCH v4 05/12] vfio/iommufd: Introduce auto domain
>>> creation
>>>
>>> On 17/07/2024 03:18, Duan, Zhenzhong wrote:
>>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: Joao Martins <joao.m.martins@oracle.com>
>>>>> Subject: [PATCH v4 05/12] vfio/iommufd: Introduce auto domain
>>> creation
>>>>>
>>>>> There's generally two modes of operation for IOMMUFD:
>>>>>
>>>>> * The simple user API which intends to perform relatively simple things
>>>>> with IOMMUs e.g. DPDK. It generally creates an IOAS and attach to
>VFIO
>>>>> and mainly performs IOAS_MAP and UNMAP.
>>>>>
>>>>> * The native IOMMUFD API where you have fine grained control of the
>>>>> IOMMU domain and model it accordingly. This is where most new
>feature
>>>>> are being steered to.
>>>>>
>>>>> For dirty tracking 2) is required, as it needs to ensure that
>>>>> the stage-2/parent IOMMU domain will only attach devices
>>>>> that support dirty tracking (so far it is all homogeneous in x86, likely
>>>>> not the case for smmuv3). Such invariant on dirty tracking provides a
>>>>> useful guarantee to VMMs that will refuse incompatible device
>>>>> attachments for IOMMU domains.
>>>>>
>>>>> Dirty tracking insurance is enforced via HWPT_ALLOC, which is
>>>>> responsible for creating an IOMMU domain. This is contrast to the
>>>>> 'simple API' where the IOMMU domain is created by IOMMUFD
>>>>> automatically
>>>>> when it attaches to VFIO (usually referred as autodomains) but it has
>>>>> the needed handling for mdevs.
>>>>>
>>>>> To support dirty tracking with the advanced IOMMUFD API, it needs
>>>>> similar logic, where IOMMU domains are created and devices attached
>to
>>>>> compatible domains. Essentially mimmicing kernel
>>>>> iommufd_device_auto_get_domain(). With mdevs given there's no
>>> IOMMU
>>>>> domain
>>>>> it falls back to IOAS attach.
>>>>>
>>>>> The auto domain logic allows different IOMMU domains to be created
>>> when
>>>>> DMA dirty tracking is not desired (and VF can provide it), and others
>>> where
>>>>> it is. Here is not used in this way here given how VFIODevice migration
>>>>> state is initialized after the device attachment. But such mixed mode of
>>>>> IOMMU dirty tracking + device dirty tracking is an improvement that
>can
>>>>> be added on. Keep the 'all of nothing' of type1 approach that we have
>>>>> been using so far between container vs device dirty tracking.
>>>>>
>>>>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>>>>> ---
>>>>> include/hw/vfio/vfio-common.h |  9 ++++
>>>>> include/sysemu/iommufd.h      |  5 +++
>>>>> backends/iommufd.c            | 30 +++++++++++++
>>>>> hw/vfio/iommufd.c             | 82
>>>>> +++++++++++++++++++++++++++++++++++
>>>>> backends/trace-events         |  1 +
>>>>> 5 files changed, 127 insertions(+)
>>>>>
>>>>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-
>>>>> common.h
>>>>> index 7419466bca92..2dd468ce3c02 100644
>>>>> --- a/include/hw/vfio/vfio-common.h
>>>>> +++ b/include/hw/vfio/vfio-common.h
>>>>> @@ -95,10 +95,17 @@ typedef struct VFIOHostDMAWindow {
>>>>>
>>>>> typedef struct IOMMUFDBackend IOMMUFDBackend;
>>>>>
>>>>> +typedef struct VFIOIOASHwpt {
>>>>> +    uint32_t hwpt_id;
>>>>> +    QLIST_HEAD(, VFIODevice) device_list;
>>>>> +    QLIST_ENTRY(VFIOIOASHwpt) next;
>>>>> +} VFIOIOASHwpt;
>>>>> +
>>>>> typedef struct VFIOIOMMUFDContainer {
>>>>>     VFIOContainerBase bcontainer;
>>>>>     IOMMUFDBackend *be;
>>>>>     uint32_t ioas_id;
>>>>> +    QLIST_HEAD(, VFIOIOASHwpt) hwpt_list;
>>>>> } VFIOIOMMUFDContainer;
>>>>>
>>>>> OBJECT_DECLARE_SIMPLE_TYPE(VFIOIOMMUFDContainer,
>>>>> VFIO_IOMMU_IOMMUFD);
>>>>> @@ -135,6 +142,8 @@ typedef struct VFIODevice {
>>>>>     HostIOMMUDevice *hiod;
>>>>>     int devid;
>>>>>     IOMMUFDBackend *iommufd;
>>>>> +    VFIOIOASHwpt *hwpt;
>>>>> +    QLIST_ENTRY(VFIODevice) hwpt_next;
>>>>> } VFIODevice;
>>>>>
>>>>> struct VFIODeviceOps {
>>>>> diff --git a/include/sysemu/iommufd.h b/include/sysemu/iommufd.h
>>>>> index 57d502a1c79a..e917e7591d05 100644
>>>>> --- a/include/sysemu/iommufd.h
>>>>> +++ b/include/sysemu/iommufd.h
>>>>> @@ -50,6 +50,11 @@ int
>>>>> iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t
>>> ioas_id,
>>>>> bool iommufd_backend_get_device_info(IOMMUFDBackend *be,
>>> uint32_t
>>>>> devid,
>>>>>                                      uint32_t *type, void *data, uint32_t len,
>>>>>                                      uint64_t *caps, Error **errp);
>>>>> +bool iommufd_backend_alloc_hwpt(IOMMUFDBackend *be,
>uint32_t
>>>>> dev_id,
>>>>> +                                uint32_t pt_id, uint32_t flags,
>>>>> +                                uint32_t data_type, uint32_t data_len,
>>>>> +                                void *data_ptr, uint32_t *out_hwpt,
>>>>> +                                Error **errp);
>>>>>
>>>>> #define TYPE_HOST_IOMMU_DEVICE_IOMMUFD
>>>>> TYPE_HOST_IOMMU_DEVICE "-iommufd"
>>>>> #endif
>>>>> diff --git a/backends/iommufd.c b/backends/iommufd.c
>>>>> index 2b3d51af26d2..5d3dfa917415 100644
>>>>> --- a/backends/iommufd.c
>>>>> +++ b/backends/iommufd.c
>>>>> @@ -208,6 +208,36 @@ int
>>>>> iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t
>>> ioas_id,
>>>>>     return ret;
>>>>> }
>>>>>
>>>>> +bool iommufd_backend_alloc_hwpt(IOMMUFDBackend *be,
>uint32_t
>>>>> dev_id,
>>>>> +                                uint32_t pt_id, uint32_t flags,
>>>>> +                                uint32_t data_type, uint32_t data_len,
>>>>> +                                void *data_ptr, uint32_t *out_hwpt,
>>>>> +                                Error **errp)
>>>>> +{
>>>>> +    int ret, fd = be->fd;
>>>>> +    struct iommu_hwpt_alloc alloc_hwpt = {
>>>>> +        .size = sizeof(struct iommu_hwpt_alloc),
>>>>> +        .flags = flags,
>>>>> +        .dev_id = dev_id,
>>>>> +        .pt_id = pt_id,
>>>>> +        .data_type = data_type,
>>>>> +        .data_len = data_len,
>>>>> +        .data_uptr = (uint64_t)data_ptr,
>>>>> +    };
>>>>> +
>>>>> +    ret = ioctl(fd, IOMMU_HWPT_ALLOC, &alloc_hwpt);
>>>>> +    trace_iommufd_backend_alloc_hwpt(fd, dev_id, pt_id, flags,
>>> data_type,
>>>>> +                                     data_len, (uint64_t)data_ptr,
>>>>> +                                     alloc_hwpt.out_hwpt_id, ret);
>>>>> +    if (ret) {
>>>>> +        error_setg_errno(errp, errno, "Failed to allocate hwpt");
>>>>> +        return false;
>>>>> +    }
>>>>> +
>>>>> +    *out_hwpt = alloc_hwpt.out_hwpt_id;
>>>>> +    return true;
>>>>> +}
>>>>> +
>>>>> bool iommufd_backend_get_device_info(IOMMUFDBackend *be,
>>> uint32_t
>>>>> devid,
>>>>>                                      uint32_t *type, void *data, uint32_t len,
>>>>>                                      uint64_t *caps, Error **errp)
>>>>> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
>>>>> index 077dea8f1b64..325c7598d5a1 100644
>>>>> --- a/hw/vfio/iommufd.c
>>>>> +++ b/hw/vfio/iommufd.c
>>>>> @@ -212,10 +212,86 @@ static bool
>>>>> iommufd_cdev_detach_ioas_hwpt(VFIODevice *vbasedev, Error
>**errp)
>>>>>     return true;
>>>>> }
>>>>>
>>>>> +static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
>>>>> +                                         VFIOIOMMUFDContainer *container,
>>>>> +                                         Error **errp)
>>>>> +{
>>>>> +    IOMMUFDBackend *iommufd = vbasedev->iommufd;
>>>>> +    uint32_t flags = 0;
>>>>> +    VFIOIOASHwpt *hwpt;
>>>>> +    uint32_t hwpt_id;
>>>>> +    int ret;
>>>>> +
>>>>> +    /* Try to find a domain */
>>>>> +    QLIST_FOREACH(hwpt, &container->hwpt_list, next) {
>>>>> +        ret = iommufd_cdev_attach_ioas_hwpt(vbasedev, hwpt-
>>hwpt_id,
>>>>> errp);
>>>>
>>>> If there is already an hwpt that supports dirty tracking.
>>>> Another device that doesn't support dirty tracking attaches to this hwpt,
>>> will it succeed?
>>>>
>>>
>>> It returns -EINVAL, and we handle that right after this statement. Which
>>> means
>>> another HWPT is created.
>>
>> Looked into kernel code, I didn't see the check about dirty tracking
>between device and hwpt, do you know which func does that?
>>
>
>A device is associated with a group (aka IOMMU instance) and those checks
>happens when the device in a group is firstly being attached the first time or
>belongs to some *other* group and gets attach to this domain with dirty
>tracking
>enforced. If the device belongs to the same group that had a device attached
>already there's just a bump in the refcount and device is added to the /same
>group/ device list. Otherwise the device belongs to a different group and it's
>being attached to a domain and the various checks get triggered (dirty
>tracking
>being one of them). These attachment validation checks are part of the
>iommu
>driver, not core (the core just sees a .attach_dev() failure).
>
>Usually follows this codepath when the group attachment checks are firstly
>being
>done:
>
>vfio_iommufd_physical_attach_ioas()
> iommufd_device_attach()
>  iommufd_device_do_attach()
>   iommufd_hw_pagetable_attach()
>    iommu_attach_group()
>    ...
>     __iommu_attach_device()
>
>Then each iommu driver defines the compatibility checks and if the domain
>has
>dirty_ops set (that comes from this ALLOC_DIRTY_TRACKING flag) and the
>IOMMU
>backing the device doesn't have dirty tracking the driver returns -EINVAL
>e.g. on Intel IOMMU:
>
>intel_iommu_attach_device()
>  prepare_domain_attach_device():
>    domain->dirty_ops && !ssads_supported(iommu)
>	return -EINVAL;

Understood, thanks.

>
>
>>>
>>>> If existing hwpt doesn't support dirty tracking.
>>>> Another device supporting dirty tracking attaches to that hwpt, what
>will
>>> happen?
>>>>
>>>
>>> Hmm, It succeeds as there's no incompatbility. At the very least I plan on
>>> blocking migration if the device neither has VF dirty tracking, nor IOMMU
>>> dirty
>>> tracking (and patch 11 needs to be adjusted to check hwpt_flags instead
>of
>>> container).
>>
>> When bcontainer->dirty_pages_supported is true, I think that container
>should only contains hwpt list that support dirty tracking. All hwpt not
>supporting dirty tracking should be in other container.
>>
>Well but we are adopting this auto domains scheme and works for any
>device,
>dirty tracking or not. We already track hwpt flags so we know which ones
>support
>dirty tracking. This differentiation would (IMHO) complicate more and I am
>not
>sure the gain

OK, I was trying to make bcontainer->dirty_pages_supported  accurate because it is used in many functions such as vfio_get_dirty_bitmap() which require an accurate value. If there is mix of hwpt in that container, that's impossible.

But as you say you want to address the mix issue in a follow-up and presume all are homogeneous hw for now, then OK, there is no conflict.

>
>> If device supports dirty tracking, it should bypass attaching container that
>doesn't support dirty tracking. Vise versa.
>> This way we can support the mixing environment.
>>
>
>It's not that easy as the whole flow doesn't handle this mixed mode (even
>excluding this series). We would to have device-dirty-tracking start all
>non-disabled device trackers first [and stop them as well], and then we
>would
>always iterate those first (if device dirty trackers are active), and then defer
>to IOMMU tracker for those who don't.

Why is device-dirty-tracking preferred over IOMMU dirty tracking?
Imagine if many devices attached to same domain.

>
>But given this mixed mode might be prone to regressions plus with me being
>dangerously close to softfreeze too, I was deeming it follow-up. And hence
>hoping I improve detection when the IOMMU doesn't provide the lowest
>common
>denominator for the 'all or nothing' mode then it would block migration. I
>can
>turn that if statement in {start,query}_dirty_tracking into an assert if that
>improves things.

OK

>
>>
>>>
>>> Qemu right now doesn't handle heteregenous environment, it's all of
>>> nothing
>>> approach even before this patchset. Additionally, I am not sure server
>>> environments are applicable here. So essentially I kept the status quo --
>>> more
>>> follow-up is needed to support a mix and match of IOMMU + VF dirty
>>> tracking too.
>>> The challenge is having the migration state of VFIO device initialized early
>>> enough that we can make all sort of decisions whether IOMMU dirty
>tracking
>>> is
>>> desired on a per-device basis.

OK.

Thanks
Zhenzhong
Joao Martins July 18, 2024, 9:16 a.m. UTC | #19
On 18/07/2024 08:44, Duan, Zhenzhong wrote:
>>>>> If existing hwpt doesn't support dirty tracking.
>>>>> Another device supporting dirty tracking attaches to that hwpt, what
>> will
>>>> happen?
>>>>>
>>>>
>>>> Hmm, It succeeds as there's no incompatbility. At the very least I plan on
>>>> blocking migration if the device neither has VF dirty tracking, nor IOMMU
>>>> dirty
>>>> tracking (and patch 11 needs to be adjusted to check hwpt_flags instead
>> of
>>>> container).
>>>
>>> When bcontainer->dirty_pages_supported is true, I think that container
>> should only contains hwpt list that support dirty tracking. All hwpt not
>> supporting dirty tracking should be in other container.
>>>
>> Well but we are adopting this auto domains scheme and works for any
>> device,
>> dirty tracking or not. We already track hwpt flags so we know which ones
>> support
>> dirty tracking. This differentiation would (IMHO) complicate more and I am
>> not
>> sure the gain
> 
> OK, I was trying to make bcontainer->dirty_pages_supported  accurate because it is used in many functions such as vfio_get_dirty_bitmap() which require an accurate value. If there is mix of hwpt in that container, that's impossible.
> 
> But as you say you want to address the mix issue in a follow-up and presume all are homogeneous hw for now, then OK, there is no conflict.
> 

Right

>>
>>> If device supports dirty tracking, it should bypass attaching container that
>> doesn't support dirty tracking. Vise versa.
>>> This way we can support the mixing environment.
>>>
>>
>> It's not that easy as the whole flow doesn't handle this mixed mode (even
>> excluding this series). We would to have device-dirty-tracking start all
>> non-disabled device trackers first [and stop them as well], and then we
>> would
>> always iterate those first (if device dirty trackers are active), and then defer
>> to IOMMU tracker for those who don't.
> 
> Why is device-dirty-tracking preferred over IOMMU dirty tracking?
> Imagine if many devices attached to same domain.
> 

The heuristic or expectation is that device dirty tracking doesn't involve a
compromise for SW because it can a) perform lowest granularity of IOVA range
being dirty with b) no DMA penalty. With IOMMU though, SW needs to worry about
managing page tables to dictate the granularity and those take time to walk the
deeper the level we descend into. I used to think that IOMMU we have DMA penalty
(because of the IOTLB flushes to clear dirty bit, and IOTLB cache misses) but I
haven't yet that materialized in the field yet (at least for 100Gbit/s rates).

TL;DR At the end of the day with device dirty tracking you have less to worry
about, and it's the VF doing most of the heavy lifting. In theory with device
dirty tracking you could even perform sub basepage tracking if the device allows
it to do so.
Joao Martins July 18, 2024, 1:47 p.m. UTC | #20
On 17/07/2024 10:31, Joao Martins wrote:
> On 17/07/2024 10:28, Cédric Le Goater wrote:
>>>>>>>> @@ -224,6 +300,11 @@ static void
>>>>> iommufd_cdev_detach_container(VFIODevice *vbasedev,
>>>>>>>>   {
>>>>>>>>       Error *err = NULL;
>>>>>>>>
>>>>>>>> +    if (vbasedev->hwpt) {
>>>>>>>> +        iommufd_cdev_autodomains_put(vbasedev, container);
>>>>>>>> +        return;
>>>>>>> Where do we detach the device from the hwpt?
>>>>>>>
>>>>>> In iommufd_backend_free_id() for auto domains
>>>>>>
>>>>>
>>>>> to clarify here I meant *userspace* auto domains
>>>>>
>>>>> *kernel* auto domains (mdev) goes via DETACH_IOMMUFD_PT
>>>>
>>>> If the device is still attached to the hwpt, will iommufd_backend_free_id()
>>>> succeed?
>>>> Have you tried the hot unplug?
>>>>
>>>
>>> I have but I didn't see any errors. But I will check again for v5 as it could
>>> also be my oversight.
>>>
>>> I was thinking about Eric's remark overnight and I think what I am doing is not
>>> correct regardless of the above.
>>>
>>> I should be calling DETACH_IOMMUFD_PT pairing with ATTACH_IOMMUFD_PT, and the
>>> iommufd_backend_free_id() is to drop the final reference pairing with
>>> alloc_hwpt() when the device list is empty i.e. when there's no more devices in
>>> that vdev::hwpt.
>>>
>>> DETACH_IOMMUFD_PT decrement the hwpt refcount and it doesn't differentiate
>>> between auto domains vs manual domains.
>>>
>>> The code is already there anyhow it just has the order of
>>> iommufd_cdev_autodomains_put vs detach invocation reversed; I'll fix that for
>>> next version.
>>
>> While at it, could you please move these routines :
>>
>>   iommufd_cdev_detach_ioas_hwpt
>>   iommufd_cdev_attach_ioas_hwpt
>>  
>> under backends/iommufd.c ? I think that's where they belong.
> 
> OK

At the first glance I thought this was a good idea. But these functions while
they attach an IOMMUFD they do not really talk to an IOMMUFD backend, but to a
VFIO device file descriptor. Now I think they are in the right place here and we
would leave IOMMUFD uAPI things to backends/iommufd and VFIO APIs in hw/vfio/.

It also uses a lot of VFIODevice* which requires some funny includes in
sysemu/iommufd.h.

Do you still want me to go ahead with it? Here's a snip below of the change
involved:

diff --git a/backends/iommufd.c b/backends/iommufd.c
index 2b3d51af26d2..19d1e430ef48 100644
--- a/backends/iommufd.c
+++ b/backends/iommufd.c
@@ -20,6 +20,7 @@
 #include "trace.h"
 #include <sys/ioctl.h>
 #include <linux/iommufd.h>
+#include <linux/vfio.h>

 static void iommufd_backend_init(Object *obj)
 {
@@ -232,6 +233,46 @@ bool iommufd_backend_get_device_info(IOMMUFDBackend *be,
uint32_t devid,
     return true;
 }

+int iommufd_cdev_attach_ioas_hwpt(VFIODevice *vbasedev, uint32_t id,
+                                  Error **errp)
+{
+    int iommufd = vbasedev->iommufd->fd;
+    struct vfio_device_attach_iommufd_pt attach_data = {
+        .argsz = sizeof(attach_data),
+        .flags = 0,
+        .pt_id = id,
+    };
+
+    /* Attach device to an IOAS or hwpt within iommufd */
+    if (ioctl(vbasedev->fd, VFIO_DEVICE_ATTACH_IOMMUFD_PT, &attach_data)) {
+        error_setg_errno(errp, errno,
+                         "[iommufd=%d] error attach %s (%d) to id=%d",
+                         iommufd, vbasedev->name, vbasedev->fd, id);
+        return -errno;
+    }
+
+    trace_iommufd_cdev_attach_ioas_hwpt(iommufd, vbasedev->name,
+                                        vbasedev->fd, id);
+    return 0;
+}
+
+bool iommufd_cdev_detach_ioas_hwpt(VFIODevice *vbasedev, Error **errp)
+{
+    int iommufd = vbasedev->iommufd->fd;
+    struct vfio_device_detach_iommufd_pt detach_data = {
+        .argsz = sizeof(detach_data),
+        .flags = 0,
+    };
+
+    if (ioctl(vbasedev->fd, VFIO_DEVICE_DETACH_IOMMUFD_PT, &detach_data)) {
+        error_setg_errno(errp, errno, "detach %s failed", vbasedev->name);
+        return false;
+    }
+
+    trace_iommufd_cdev_detach_ioas_hwpt(iommufd, vbasedev->name);
+    return true;
+}
+
 static int hiod_iommufd_get_cap(HostIOMMUDevice *hiod, int cap, Error **errp)
 {
     HostIOMMUDeviceCaps *caps = &hiod->caps;
diff --git a/backends/trace-events b/backends/trace-events
index 211e6f374adc..2fee8e0af20e 100644
--- a/backends/trace-events
+++ b/backends/trace-events
@@ -15,3 +15,5 @@ iommufd_backend_unmap_dma_non_exist(int iommufd, uint32_t
ioas, uint64_t iova, u
 iommufd_backend_unmap_dma(int iommufd, uint32_t ioas, uint64_t iova, uint64_t
size, int ret) " iommufd=%d ioas=%d iova=0x%"PRIx64" s
ize=0x%"PRIx64" (%d)"
 iommufd_backend_alloc_ioas(int iommufd, uint32_t ioas) " iommufd=%d ioas=%d"
 iommufd_backend_free_id(int iommufd, uint32_t id, int ret) " iommufd=%d id=%d (%d)"
+iommufd_cdev_attach_ioas_hwpt(int iommufd, const char *name, int devfd, int id)
" [iommufd=%d] Successfully attached device %s (%d)
to id=%d"
+iommufd_cdev_detach_ioas_hwpt(int iommufd, const char *name) " [iommufd=%d]
Successfully detached %s"
diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index 077dea8f1b64..5a6d56c915e2 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -172,46 +172,6 @@ out:
     return ret;
 }

-static int iommufd_cdev_attach_ioas_hwpt(VFIODevice *vbasedev, uint32_t id,
-                                         Error **errp)
-{
-    int iommufd = vbasedev->iommufd->fd;
-    struct vfio_device_attach_iommufd_pt attach_data = {
-        .argsz = sizeof(attach_data),
-        .flags = 0,
-        .pt_id = id,
-    };
-
-    /* Attach device to an IOAS or hwpt within iommufd */
-    if (ioctl(vbasedev->fd, VFIO_DEVICE_ATTACH_IOMMUFD_PT, &attach_data)) {
-        error_setg_errno(errp, errno,
-                         "[iommufd=%d] error attach %s (%d) to id=%d",
-                         iommufd, vbasedev->name, vbasedev->fd, id);
-        return -errno;
-    }
-
-    trace_iommufd_cdev_attach_ioas_hwpt(iommufd, vbasedev->name,
-                                        vbasedev->fd, id);
-    return 0;
-}
-
-static bool iommufd_cdev_detach_ioas_hwpt(VFIODevice *vbasedev, Error **errp)
-{
-    int iommufd = vbasedev->iommufd->fd;
-    struct vfio_device_detach_iommufd_pt detach_data = {
-        .argsz = sizeof(detach_data),
-        .flags = 0,
-    };
-
-    if (ioctl(vbasedev->fd, VFIO_DEVICE_DETACH_IOMMUFD_PT, &detach_data)) {
-        error_setg_errno(errp, errno, "detach %s failed", vbasedev->name);
-        return false;
-    }
-
-    trace_iommufd_cdev_detach_ioas_hwpt(iommufd, vbasedev->name);
-    return true;
-}
-
 static bool iommufd_cdev_attach_container(VFIODevice *vbasedev,
                                           VFIOIOMMUFDContainer *container,
                                           Error **errp)
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index e16179b507ed..24fde6270112 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -170,8 +170,6 @@ vfio_vmstate_change_prepare(const char *name, int running,
const char *reason, c

 iommufd_cdev_connect_and_bind(int iommufd, const char *name, int devfd, int
devid) " [iommufd=%d] Successfully bound device %s (fd=%
d): output devid=%d"
 iommufd_cdev_getfd(const char *dev, int devfd) " %s (fd=%d)"
-iommufd_cdev_attach_ioas_hwpt(int iommufd, const char *name, int devfd, int id)
" [iommufd=%d] Successfully attached device %s (%d)
to id=%d"
-iommufd_cdev_detach_ioas_hwpt(int iommufd, const char *name) " [iommufd=%d]
Successfully detached %s"
 iommufd_cdev_fail_attach_existing_container(const char *msg) " %s"
 iommufd_cdev_alloc_ioas(int iommufd, int ioas_id) " [iommufd=%d] new IOMMUFD
container with ioasid=%d"
 iommufd_cdev_device_info(char *name, int devfd, int num_irqs, int num_regions,
int flags) " %s (%d) num_irqs=%d num_regions=%d flags
=%d"
diff --git a/include/sysemu/iommufd.h b/include/sysemu/iommufd.h
index 57d502a1c79a..89780669118f 100644
--- a/include/sysemu/iommufd.h
+++ b/include/sysemu/iommufd.h
@@ -18,6 +18,8 @@
 #include "exec/hwaddr.h"
 #include "exec/cpu-common.h"
 #include "sysemu/host_iommu_device.h"
+#include "hw/vfio/vfio-common.h"
+#include "hw/vfio/vfio-container-base.h"

 #define TYPE_IOMMUFD_BACKEND "iommufd"
 OBJECT_DECLARE_TYPE(IOMMUFDBackend, IOMMUFDBackendClass, IOMMUFD_BACKEND)
@@ -51,5 +53,9 @@ bool iommufd_backend_get_device_info(IOMMUFDBackend *be,
uint32_t devid,
                                      uint32_t *type, void *data, uint32_t len,
                                      uint64_t *caps, Error **errp);

+bool iommufd_cdev_detach_ioas_hwpt(VFIODevice *vbasedev, Error **errp);
+int iommufd_cdev_attach_ioas_hwpt(VFIODevice *vbasedev, uint32_t id,
+                                  Error **errp);
+
 #define TYPE_HOST_IOMMU_DEVICE_IOMMUFD TYPE_HOST_IOMMU_DEVICE "-iommufd"
 #endif
Duan, Zhenzhong July 19, 2024, 2:36 a.m. UTC | #21
>-----Original Message-----
>From: Joao Martins <joao.m.martins@oracle.com>
>Subject: Re: [PATCH v4 05/12] vfio/iommufd: Introduce auto domain
>creation
>
>On 18/07/2024 08:44, Duan, Zhenzhong wrote:
>>>>>> If existing hwpt doesn't support dirty tracking.
>>>>>> Another device supporting dirty tracking attaches to that hwpt, what
>>> will
>>>>> happen?
>>>>>>
>>>>>
>>>>> Hmm, It succeeds as there's no incompatbility. At the very least I plan
>on
>>>>> blocking migration if the device neither has VF dirty tracking, nor
>IOMMU
>>>>> dirty
>>>>> tracking (and patch 11 needs to be adjusted to check hwpt_flags
>instead
>>> of
>>>>> container).
>>>>
>>>> When bcontainer->dirty_pages_supported is true, I think that container
>>> should only contains hwpt list that support dirty tracking. All hwpt not
>>> supporting dirty tracking should be in other container.
>>>>
>>> Well but we are adopting this auto domains scheme and works for any
>>> device,
>>> dirty tracking or not. We already track hwpt flags so we know which ones
>>> support
>>> dirty tracking. This differentiation would (IMHO) complicate more and I
>am
>>> not
>>> sure the gain
>>
>> OK, I was trying to make bcontainer->dirty_pages_supported  accurate
>because it is used in many functions such as vfio_get_dirty_bitmap() which
>require an accurate value. If there is mix of hwpt in that container, that's
>impossible.
>>
>> But as you say you want to address the mix issue in a follow-up and
>presume all are homogeneous hw for now, then OK, there is no conflict.
>>
>
>Right
>
>>>
>>>> If device supports dirty tracking, it should bypass attaching container
>that
>>> doesn't support dirty tracking. Vise versa.
>>>> This way we can support the mixing environment.
>>>>
>>>
>>> It's not that easy as the whole flow doesn't handle this mixed mode (even
>>> excluding this series). We would to have device-dirty-tracking start all
>>> non-disabled device trackers first [and stop them as well], and then we
>>> would
>>> always iterate those first (if device dirty trackers are active), and then
>defer
>>> to IOMMU tracker for those who don't.
>>
>> Why is device-dirty-tracking preferred over IOMMU dirty tracking?
>> Imagine if many devices attached to same domain.
>>
>
>The heuristic or expectation is that device dirty tracking doesn't involve a
>compromise for SW because it can a) perform lowest granularity of IOVA
>range
>being dirty with b) no DMA penalty. With IOMMU though, SW needs to
>worry about
>managing page tables to dictate the granularity and those take time to walk
>the
>deeper the level we descend into. I used to think that IOMMU we have DMA
>penalty
>(because of the IOTLB flushes to clear dirty bit, and IOTLB cache misses) but I
>haven't yet that materialized in the field yet (at least for 100Gbit/s rates).
>
>TL;DR At the end of the day with device dirty tracking you have less to worry
>about, and it's the VF doing most of the heavy lifting. In theory with device
>dirty tracking you could even perform sub basepage tracking if the device
>allows
>it to do so.

Clear, thanks Joao.

BRs.
Zhenzhong
Cédric Le Goater July 19, 2024, 6:06 a.m. UTC | #22
On 7/18/24 15:47, Joao Martins wrote:
> On 17/07/2024 10:31, Joao Martins wrote:
>> On 17/07/2024 10:28, Cédric Le Goater wrote:
>>>>>>>>> @@ -224,6 +300,11 @@ static void
>>>>>> iommufd_cdev_detach_container(VFIODevice *vbasedev,
>>>>>>>>>    {
>>>>>>>>>        Error *err = NULL;
>>>>>>>>>
>>>>>>>>> +    if (vbasedev->hwpt) {
>>>>>>>>> +        iommufd_cdev_autodomains_put(vbasedev, container);
>>>>>>>>> +        return;
>>>>>>>> Where do we detach the device from the hwpt?
>>>>>>>>
>>>>>>> In iommufd_backend_free_id() for auto domains
>>>>>>>
>>>>>>
>>>>>> to clarify here I meant *userspace* auto domains
>>>>>>
>>>>>> *kernel* auto domains (mdev) goes via DETACH_IOMMUFD_PT
>>>>>
>>>>> If the device is still attached to the hwpt, will iommufd_backend_free_id()
>>>>> succeed?
>>>>> Have you tried the hot unplug?
>>>>>
>>>>
>>>> I have but I didn't see any errors. But I will check again for v5 as it could
>>>> also be my oversight.
>>>>
>>>> I was thinking about Eric's remark overnight and I think what I am doing is not
>>>> correct regardless of the above.
>>>>
>>>> I should be calling DETACH_IOMMUFD_PT pairing with ATTACH_IOMMUFD_PT, and the
>>>> iommufd_backend_free_id() is to drop the final reference pairing with
>>>> alloc_hwpt() when the device list is empty i.e. when there's no more devices in
>>>> that vdev::hwpt.
>>>>
>>>> DETACH_IOMMUFD_PT decrement the hwpt refcount and it doesn't differentiate
>>>> between auto domains vs manual domains.
>>>>
>>>> The code is already there anyhow it just has the order of
>>>> iommufd_cdev_autodomains_put vs detach invocation reversed; I'll fix that for
>>>> next version.
>>>
>>> While at it, could you please move these routines :
>>>
>>>    iommufd_cdev_detach_ioas_hwpt
>>>    iommufd_cdev_attach_ioas_hwpt
>>>   
>>> under backends/iommufd.c ? I think that's where they belong.
>>
>> OK
> 
> At the first glance I thought this was a good idea. But these functions while
> they attach an IOMMUFD they do not really talk to an IOMMUFD backend, but to a
> VFIO device file descriptor. Now I think they are in the right place here and we
> would leave IOMMUFD uAPI things to backends/iommufd and VFIO APIs in hw/vfio/.

yep. I was misled by vbasedev->iommufd->fd which is only used in the trace event.
Let's keep things how they are. Thanks for looking,

C.



> It also uses a lot of VFIODevice* which requires some funny includes in
> sysemu/iommufd.h.
> 
> Do you still want me to go ahead with it? Here's a snip below of the change
> involved:
> 
> diff --git a/backends/iommufd.c b/backends/iommufd.c
> index 2b3d51af26d2..19d1e430ef48 100644
> --- a/backends/iommufd.c
> +++ b/backends/iommufd.c
> @@ -20,6 +20,7 @@
>   #include "trace.h"
>   #include <sys/ioctl.h>
>   #include <linux/iommufd.h>
> +#include <linux/vfio.h>
> 
>   static void iommufd_backend_init(Object *obj)
>   {
> @@ -232,6 +233,46 @@ bool iommufd_backend_get_device_info(IOMMUFDBackend *be,
> uint32_t devid,
>       return true;
>   }
> 
> +int iommufd_cdev_attach_ioas_hwpt(VFIODevice *vbasedev, uint32_t id,
> +                                  Error **errp)
> +{
> +    int iommufd = vbasedev->iommufd->fd;
> +    struct vfio_device_attach_iommufd_pt attach_data = {
> +        .argsz = sizeof(attach_data),
> +        .flags = 0,
> +        .pt_id = id,
> +    };
> +
> +    /* Attach device to an IOAS or hwpt within iommufd */
> +    if (ioctl(vbasedev->fd, VFIO_DEVICE_ATTACH_IOMMUFD_PT, &attach_data)) {
> +        error_setg_errno(errp, errno,
> +                         "[iommufd=%d] error attach %s (%d) to id=%d",
> +                         iommufd, vbasedev->name, vbasedev->fd, id);
> +        return -errno;
> +    }
> +
> +    trace_iommufd_cdev_attach_ioas_hwpt(iommufd, vbasedev->name,
> +                                        vbasedev->fd, id);
> +    return 0;
> +}
> +
> +bool iommufd_cdev_detach_ioas_hwpt(VFIODevice *vbasedev, Error **errp)
> +{
> +    int iommufd = vbasedev->iommufd->fd;
> +    struct vfio_device_detach_iommufd_pt detach_data = {
> +        .argsz = sizeof(detach_data),
> +        .flags = 0,
> +    };
> +
> +    if (ioctl(vbasedev->fd, VFIO_DEVICE_DETACH_IOMMUFD_PT, &detach_data)) {
> +        error_setg_errno(errp, errno, "detach %s failed", vbasedev->name);
> +        return false;
> +    }
> +
> +    trace_iommufd_cdev_detach_ioas_hwpt(iommufd, vbasedev->name);
> +    return true;
> +}
> +
>   static int hiod_iommufd_get_cap(HostIOMMUDevice *hiod, int cap, Error **errp)
>   {
>       HostIOMMUDeviceCaps *caps = &hiod->caps;
> diff --git a/backends/trace-events b/backends/trace-events
> index 211e6f374adc..2fee8e0af20e 100644
> --- a/backends/trace-events
> +++ b/backends/trace-events
> @@ -15,3 +15,5 @@ iommufd_backend_unmap_dma_non_exist(int iommufd, uint32_t
> ioas, uint64_t iova, u
>   iommufd_backend_unmap_dma(int iommufd, uint32_t ioas, uint64_t iova, uint64_t
> size, int ret) " iommufd=%d ioas=%d iova=0x%"PRIx64" s
> ize=0x%"PRIx64" (%d)"
>   iommufd_backend_alloc_ioas(int iommufd, uint32_t ioas) " iommufd=%d ioas=%d"
>   iommufd_backend_free_id(int iommufd, uint32_t id, int ret) " iommufd=%d id=%d (%d)"
> +iommufd_cdev_attach_ioas_hwpt(int iommufd, const char *name, int devfd, int id)
> " [iommufd=%d] Successfully attached device %s (%d)
> to id=%d"
> +iommufd_cdev_detach_ioas_hwpt(int iommufd, const char *name) " [iommufd=%d]
> Successfully detached %s"
> diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
> index 077dea8f1b64..5a6d56c915e2 100644
> --- a/hw/vfio/iommufd.c
> +++ b/hw/vfio/iommufd.c
> @@ -172,46 +172,6 @@ out:
>       return ret;
>   }
> 
> -static int iommufd_cdev_attach_ioas_hwpt(VFIODevice *vbasedev, uint32_t id,
> -                                         Error **errp)
> -{
> -    int iommufd = vbasedev->iommufd->fd;
> -    struct vfio_device_attach_iommufd_pt attach_data = {
> -        .argsz = sizeof(attach_data),
> -        .flags = 0,
> -        .pt_id = id,
> -    };
> -
> -    /* Attach device to an IOAS or hwpt within iommufd */
> -    if (ioctl(vbasedev->fd, VFIO_DEVICE_ATTACH_IOMMUFD_PT, &attach_data)) {
> -        error_setg_errno(errp, errno,
> -                         "[iommufd=%d] error attach %s (%d) to id=%d",
> -                         iommufd, vbasedev->name, vbasedev->fd, id);
> -        return -errno;
> -    }
> -
> -    trace_iommufd_cdev_attach_ioas_hwpt(iommufd, vbasedev->name,
> -                                        vbasedev->fd, id);
> -    return 0;
> -}
> -
> -static bool iommufd_cdev_detach_ioas_hwpt(VFIODevice *vbasedev, Error **errp)
> -{
> -    int iommufd = vbasedev->iommufd->fd;
> -    struct vfio_device_detach_iommufd_pt detach_data = {
> -        .argsz = sizeof(detach_data),
> -        .flags = 0,
> -    };
> -
> -    if (ioctl(vbasedev->fd, VFIO_DEVICE_DETACH_IOMMUFD_PT, &detach_data)) {
> -        error_setg_errno(errp, errno, "detach %s failed", vbasedev->name);
> -        return false;
> -    }
> -
> -    trace_iommufd_cdev_detach_ioas_hwpt(iommufd, vbasedev->name);
> -    return true;
> -}
> -
>   static bool iommufd_cdev_attach_container(VFIODevice *vbasedev,
>                                             VFIOIOMMUFDContainer *container,
>                                             Error **errp)
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index e16179b507ed..24fde6270112 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -170,8 +170,6 @@ vfio_vmstate_change_prepare(const char *name, int running,
> const char *reason, c
> 
>   iommufd_cdev_connect_and_bind(int iommufd, const char *name, int devfd, int
> devid) " [iommufd=%d] Successfully bound device %s (fd=%
> d): output devid=%d"
>   iommufd_cdev_getfd(const char *dev, int devfd) " %s (fd=%d)"
> -iommufd_cdev_attach_ioas_hwpt(int iommufd, const char *name, int devfd, int id)
> " [iommufd=%d] Successfully attached device %s (%d)
> to id=%d"
> -iommufd_cdev_detach_ioas_hwpt(int iommufd, const char *name) " [iommufd=%d]
> Successfully detached %s"
>   iommufd_cdev_fail_attach_existing_container(const char *msg) " %s"
>   iommufd_cdev_alloc_ioas(int iommufd, int ioas_id) " [iommufd=%d] new IOMMUFD
> container with ioasid=%d"
>   iommufd_cdev_device_info(char *name, int devfd, int num_irqs, int num_regions,
> int flags) " %s (%d) num_irqs=%d num_regions=%d flags
> =%d"
> diff --git a/include/sysemu/iommufd.h b/include/sysemu/iommufd.h
> index 57d502a1c79a..89780669118f 100644
> --- a/include/sysemu/iommufd.h
> +++ b/include/sysemu/iommufd.h
> @@ -18,6 +18,8 @@
>   #include "exec/hwaddr.h"
>   #include "exec/cpu-common.h"
>   #include "sysemu/host_iommu_device.h"
> +#include "hw/vfio/vfio-common.h"
> +#include "hw/vfio/vfio-container-base.h"
> 
>   #define TYPE_IOMMUFD_BACKEND "iommufd"
>   OBJECT_DECLARE_TYPE(IOMMUFDBackend, IOMMUFDBackendClass, IOMMUFD_BACKEND)
> @@ -51,5 +53,9 @@ bool iommufd_backend_get_device_info(IOMMUFDBackend *be,
> uint32_t devid,
>                                        uint32_t *type, void *data, uint32_t len,
>                                        uint64_t *caps, Error **errp);
> 
> +bool iommufd_cdev_detach_ioas_hwpt(VFIODevice *vbasedev, Error **errp);
> +int iommufd_cdev_attach_ioas_hwpt(VFIODevice *vbasedev, uint32_t id,
> +                                  Error **errp);
> +
>   #define TYPE_HOST_IOMMU_DEVICE_IOMMUFD TYPE_HOST_IOMMU_DEVICE "-iommufd"
>   #endif
> 
>
diff mbox series

Patch

diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 7419466bca92..2dd468ce3c02 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -95,10 +95,17 @@  typedef struct VFIOHostDMAWindow {
 
 typedef struct IOMMUFDBackend IOMMUFDBackend;
 
+typedef struct VFIOIOASHwpt {
+    uint32_t hwpt_id;
+    QLIST_HEAD(, VFIODevice) device_list;
+    QLIST_ENTRY(VFIOIOASHwpt) next;
+} VFIOIOASHwpt;
+
 typedef struct VFIOIOMMUFDContainer {
     VFIOContainerBase bcontainer;
     IOMMUFDBackend *be;
     uint32_t ioas_id;
+    QLIST_HEAD(, VFIOIOASHwpt) hwpt_list;
 } VFIOIOMMUFDContainer;
 
 OBJECT_DECLARE_SIMPLE_TYPE(VFIOIOMMUFDContainer, VFIO_IOMMU_IOMMUFD);
@@ -135,6 +142,8 @@  typedef struct VFIODevice {
     HostIOMMUDevice *hiod;
     int devid;
     IOMMUFDBackend *iommufd;
+    VFIOIOASHwpt *hwpt;
+    QLIST_ENTRY(VFIODevice) hwpt_next;
 } VFIODevice;
 
 struct VFIODeviceOps {
diff --git a/include/sysemu/iommufd.h b/include/sysemu/iommufd.h
index 57d502a1c79a..e917e7591d05 100644
--- a/include/sysemu/iommufd.h
+++ b/include/sysemu/iommufd.h
@@ -50,6 +50,11 @@  int iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t ioas_id,
 bool iommufd_backend_get_device_info(IOMMUFDBackend *be, uint32_t devid,
                                      uint32_t *type, void *data, uint32_t len,
                                      uint64_t *caps, Error **errp);
+bool iommufd_backend_alloc_hwpt(IOMMUFDBackend *be, uint32_t dev_id,
+                                uint32_t pt_id, uint32_t flags,
+                                uint32_t data_type, uint32_t data_len,
+                                void *data_ptr, uint32_t *out_hwpt,
+                                Error **errp);
 
 #define TYPE_HOST_IOMMU_DEVICE_IOMMUFD TYPE_HOST_IOMMU_DEVICE "-iommufd"
 #endif
diff --git a/backends/iommufd.c b/backends/iommufd.c
index 2b3d51af26d2..5d3dfa917415 100644
--- a/backends/iommufd.c
+++ b/backends/iommufd.c
@@ -208,6 +208,36 @@  int iommufd_backend_unmap_dma(IOMMUFDBackend *be, uint32_t ioas_id,
     return ret;
 }
 
+bool iommufd_backend_alloc_hwpt(IOMMUFDBackend *be, uint32_t dev_id,
+                                uint32_t pt_id, uint32_t flags,
+                                uint32_t data_type, uint32_t data_len,
+                                void *data_ptr, uint32_t *out_hwpt,
+                                Error **errp)
+{
+    int ret, fd = be->fd;
+    struct iommu_hwpt_alloc alloc_hwpt = {
+        .size = sizeof(struct iommu_hwpt_alloc),
+        .flags = flags,
+        .dev_id = dev_id,
+        .pt_id = pt_id,
+        .data_type = data_type,
+        .data_len = data_len,
+        .data_uptr = (uint64_t)data_ptr,
+    };
+
+    ret = ioctl(fd, IOMMU_HWPT_ALLOC, &alloc_hwpt);
+    trace_iommufd_backend_alloc_hwpt(fd, dev_id, pt_id, flags, data_type,
+                                     data_len, (uint64_t)data_ptr,
+                                     alloc_hwpt.out_hwpt_id, ret);
+    if (ret) {
+        error_setg_errno(errp, errno, "Failed to allocate hwpt");
+        return false;
+    }
+
+    *out_hwpt = alloc_hwpt.out_hwpt_id;
+    return true;
+}
+
 bool iommufd_backend_get_device_info(IOMMUFDBackend *be, uint32_t devid,
                                      uint32_t *type, void *data, uint32_t len,
                                      uint64_t *caps, Error **errp)
diff --git a/hw/vfio/iommufd.c b/hw/vfio/iommufd.c
index 077dea8f1b64..325c7598d5a1 100644
--- a/hw/vfio/iommufd.c
+++ b/hw/vfio/iommufd.c
@@ -212,10 +212,86 @@  static bool iommufd_cdev_detach_ioas_hwpt(VFIODevice *vbasedev, Error **errp)
     return true;
 }
 
+static bool iommufd_cdev_autodomains_get(VFIODevice *vbasedev,
+                                         VFIOIOMMUFDContainer *container,
+                                         Error **errp)
+{
+    IOMMUFDBackend *iommufd = vbasedev->iommufd;
+    uint32_t flags = 0;
+    VFIOIOASHwpt *hwpt;
+    uint32_t hwpt_id;
+    int ret;
+
+    /* Try to find a domain */
+    QLIST_FOREACH(hwpt, &container->hwpt_list, next) {
+        ret = iommufd_cdev_attach_ioas_hwpt(vbasedev, hwpt->hwpt_id, errp);
+        if (ret) {
+            /* -EINVAL means the domain is incompatible with the device. */
+            if (ret == -EINVAL) {
+                /*
+                 * It is an expected failure and it just means we will try
+                 * another domain, or create one if no existing compatible
+                 * domain is found. Hence why the error is discarded below.
+                 */
+                error_free(*errp);
+                *errp = NULL;
+                continue;
+            }
+
+            return false;
+        } else {
+            vbasedev->hwpt = hwpt;
+            QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev, hwpt_next);
+            return true;
+        }
+    }
+
+    if (!iommufd_backend_alloc_hwpt(iommufd, vbasedev->devid,
+                                    container->ioas_id, flags,
+                                    IOMMU_HWPT_DATA_NONE, 0, NULL,
+                                    &hwpt_id, errp)) {
+        return false;
+    }
+
+    hwpt = g_malloc0(sizeof(*hwpt));
+    hwpt->hwpt_id = hwpt_id;
+    QLIST_INIT(&hwpt->device_list);
+
+    ret = iommufd_cdev_attach_ioas_hwpt(vbasedev, hwpt->hwpt_id, errp);
+    if (ret) {
+        iommufd_backend_free_id(container->be, hwpt->hwpt_id);
+        g_free(hwpt);
+        return false;
+    }
+
+    vbasedev->hwpt = hwpt;
+    QLIST_INSERT_HEAD(&hwpt->device_list, vbasedev, hwpt_next);
+    QLIST_INSERT_HEAD(&container->hwpt_list, hwpt, next);
+    return true;
+}
+
+static void iommufd_cdev_autodomains_put(VFIODevice *vbasedev,
+                                         VFIOIOMMUFDContainer *container)
+{
+    VFIOIOASHwpt *hwpt = vbasedev->hwpt;
+
+    QLIST_REMOVE(vbasedev, hwpt_next);
+    if (QLIST_EMPTY(&hwpt->device_list)) {
+        QLIST_REMOVE(hwpt, next);
+        iommufd_backend_free_id(container->be, hwpt->hwpt_id);
+        g_free(hwpt);
+    }
+}
+
 static bool iommufd_cdev_attach_container(VFIODevice *vbasedev,
                                           VFIOIOMMUFDContainer *container,
                                           Error **errp)
 {
+    /* mdevs aren't physical devices and will fail with auto domains */
+    if (!vbasedev->mdev) {
+        return iommufd_cdev_autodomains_get(vbasedev, container, errp);
+    }
+
     return !iommufd_cdev_attach_ioas_hwpt(vbasedev, container->ioas_id, errp);
 }
 
@@ -224,6 +300,11 @@  static void iommufd_cdev_detach_container(VFIODevice *vbasedev,
 {
     Error *err = NULL;
 
+    if (vbasedev->hwpt) {
+        iommufd_cdev_autodomains_put(vbasedev, container);
+        return;
+    }
+
     if (!iommufd_cdev_detach_ioas_hwpt(vbasedev, &err)) {
         error_report_err(err);
     }
@@ -354,6 +435,7 @@  static bool iommufd_cdev_attach(const char *name, VFIODevice *vbasedev,
     container = VFIO_IOMMU_IOMMUFD(object_new(TYPE_VFIO_IOMMU_IOMMUFD));
     container->be = vbasedev->iommufd;
     container->ioas_id = ioas_id;
+    QLIST_INIT(&container->hwpt_list);
 
     bcontainer = &container->bcontainer;
     vfio_address_space_insert(space, bcontainer);
diff --git a/backends/trace-events b/backends/trace-events
index 211e6f374adc..4d8ac02fe7d6 100644
--- a/backends/trace-events
+++ b/backends/trace-events
@@ -14,4 +14,5 @@  iommufd_backend_map_dma(int iommufd, uint32_t ioas, uint64_t iova, uint64_t size
 iommufd_backend_unmap_dma_non_exist(int iommufd, uint32_t ioas, uint64_t iova, uint64_t size, int ret) " Unmap nonexistent mapping: iommufd=%d ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" (%d)"
 iommufd_backend_unmap_dma(int iommufd, uint32_t ioas, uint64_t iova, uint64_t size, int ret) " iommufd=%d ioas=%d iova=0x%"PRIx64" size=0x%"PRIx64" (%d)"
 iommufd_backend_alloc_ioas(int iommufd, uint32_t ioas) " iommufd=%d ioas=%d"
+iommufd_backend_alloc_hwpt(int iommufd, uint32_t dev_id, uint32_t pt_id, uint32_t flags, uint32_t hwpt_type, uint32_t len, uint64_t data_ptr, uint32_t out_hwpt_id, int ret) " iommufd=%d dev_id=%u pt_id=%u flags=0x%x hwpt_type=%u len=%u data_ptr=0x%"PRIx64" out_hwpt=%u (%d)"
 iommufd_backend_free_id(int iommufd, uint32_t id, int ret) " iommufd=%d id=%d (%d)"