[RFCv1,08/14] iommufd: Add IOMMU_VIOMMU_SET_DEV_ID ioctl

Message ID	c97a98a72ee3498c587e5898d6b899553ccd9b27.1712978212.git.nicolinc@nvidia.com
State	Handled Elsewhere
Headers	show Return-Path: <linux-tegra+bounces-1618-incoming=patchwork.ozlabs.org@vger.kernel.org> Received-SPF: Pass (protection.outlook.com: domain of nvidia.com designates 216.228.118.233 as permitted sender) receiver=protection.outlook.com; client-ip=216.228.118.233; helo=mail.nvidia.com; pr=C From: Nicolin Chen <nicolinc@nvidia.com> To: <will@kernel.org>, <robin.murphy@arm.com>, <jgg@nvidia.com>, <kevin.tian@intel.com>, <suravee.suthikulpanit@amd.com> CC: <joro@8bytes.org>, <linux-kernel@vger.kernel.org>, <iommu@lists.linux.dev>, <linux-arm-kernel@lists.infradead.org>, <linux-tegra@vger.kernel.org>, <yi.l.liu@intel.com>, <eric.auger@redhat.com>, <vasant.hegde@amd.com>, <jon.grimm@amd.com>, <santosh.shukla@amd.com>, <Dhaval.Giani@amd.com>, <shameerali.kolothum.thodi@huawei.com> Subject: [PATCH RFCv1 08/14] iommufd: Add IOMMU_VIOMMU_SET_DEV_ID ioctl Date: Fri, 12 Apr 2024 20:47:05 -0700 Message-ID: <c97a98a72ee3498c587e5898d6b899553ccd9b27.1712978212.git.nicolinc@nvidia.com> In-Reply-To: <cover.1712978212.git.nicolinc@nvidia.com> References: <cover.1712978212.git.nicolinc@nvidia.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain
Series	Add Tegra241 (Grace) CMDQV Support (part 2/2) \| expand [RFCv1,00/14] Add Tegra241 (Grace) CMDQV Support (part 2/2) [RFCv1,01/14] iommufd: Move iommufd_object to public iommufd header [RFCv1,02/14] iommufd: Swap _iommufd_object_alloc and __iommufd_object_alloc [RFCv1,03/14] iommufd: Prepare for viommu structures and functions [RFCv1,04/14] iommufd: Add struct iommufd_viommu and iommufd_viommu_ops [RFCv1,05/14] iommufd: Add IOMMUFD_OBJ_VIOMMU and IOMMUFD_CMD_VIOMMU_ALLOC [RFCv1,06/14] iommufd/selftest: Add IOMMU_VIOMMU_ALLOC test coverage [RFCv1,07/14] iommufd: Add viommu set/unset_dev_id ops [RFCv1,08/14] iommufd: Add IOMMU_VIOMMU_SET_DEV_ID ioctl [RFCv1,09/14] iommufd/selftest: Add IOMMU_VIOMMU_SET_DEV_ID test coverage [RFCv1,10/14] iommufd/selftest: Add IOMMU_TEST_OP_MV_CHECK_DEV_ID [RFCv1,11/14] iommufd: Add struct iommufd_vqueue and its related viommu ops [RFCv1,12/14] iommufd: Add IOMMUFD_OBJ_VQUEUE and IOMMUFD_CMD_VQUEUE_ALLOC [RFCv1,13/14] iommufd: Add mmap infrastructure [RFCv1,14/14] iommu/tegra241-cmdqv: Add user-space use support

Nicolin Chen April 13, 2024, 3:47 a.m. UTC

Introduce a new ioctl to set a per-viommu device virtual id that should be
linked to the physical device id (or just a struct device pointer).

Since a viommu (user space IOMMU instance) can have multiple devices while
it's not ideal to confine a device to one single user space IOMMU instance
either, these two shouldn't just do a 1:1 mapping. Add two xarrays in their
structures to bind them bidirectionally.

Signed-off-by: Nicolin Chen <nicolinc@nvidia.com>
---
 drivers/iommu/iommufd/device.c          | 11 +++++
 drivers/iommu/iommufd/iommufd_private.h | 10 +++++
 drivers/iommu/iommufd/main.c            |  2 +
 drivers/iommu/iommufd/viommu.c          | 58 +++++++++++++++++++++++++
 include/linux/iommufd.h                 |  2 +
 include/uapi/linux/iommufd.h            | 20 +++++++++
 6 files changed, 103 insertions(+)

Jason Gunthorpe May 12, 2024, 2:58 p.m. UTC | #1

On Fri, Apr 12, 2024 at 08:47:05PM -0700, Nicolin Chen wrote:
> Introduce a new ioctl to set a per-viommu device virtual id that should be
> linked to the physical device id (or just a struct device pointer).
> 
> Since a viommu (user space IOMMU instance) can have multiple devices while
> it's not ideal to confine a device to one single user space IOMMU instance
> either, these two shouldn't just do a 1:1 mapping. Add two xarrays in their
> structures to bind them bidirectionally.

Since I would like to retain the dev_id, I think this is probably
better done with an allocated struct per dev-id:

struct dev_id {
    struct iommufd_device *idev;
    struct iommufd_viommu *viommu;
    u64 vdev_id;
    u64 driver_private; // Ie the driver can store the pSID here
    struct list_head idev_entry;
};

> @@ -135,7 +135,16 @@ void iommufd_device_destroy(struct iommufd_object *obj)
>  {
>  	struct iommufd_device *idev =
>  		container_of(obj, struct iommufd_device, obj);
> +	struct iommufd_viommu *viommu;
> +	unsigned long index;
>  
> +	xa_for_each(&idev->viommus, index, viommu) {
> +		if (viommu->ops->unset_dev_id)
> +			viommu->ops->unset_dev_id(viommu, idev->dev);
> +		xa_erase(&viommu->idevs, idev->obj.id);
> +		xa_erase(&idev->viommus, index);
> +	}

Then this turns into list_for_each(idev->viommu_vdevid_list)

> +int iommufd_viommu_set_device_id(struct iommufd_ucmd *ucmd)
> +{
> +	struct iommu_viommu_set_dev_id *cmd = ucmd->cmd;
> +	unsigned int dev_id, viommu_id;
> +	struct iommufd_viommu *viommu;
> +	struct iommufd_device *idev;
> +	int rc;
> +
> +	idev = iommufd_get_device(ucmd, cmd->dev_id);
> +	if (IS_ERR(idev))
> +		return PTR_ERR(idev);
> +	dev_id = idev->obj.id;
> +
> +	viommu = iommufd_get_viommu(ucmd, cmd->viommu_id);
> +	if (IS_ERR(viommu)) {
> +		rc = PTR_ERR(viommu);
> +		goto out_put_idev;
> +	}
> +
> +	if (viommu->iommu_dev != idev->dev->iommu->iommu_dev) {
> +		rc = -EINVAL;
> +		goto out_put_viommu;
> +	}
> +
> +	if (!viommu->ops->set_dev_id || !viommu->ops->unset_dev_id) {
> +		rc = -EOPNOTSUPP;
> +		goto out_put_viommu;
> +	}
> +
> +	rc = xa_alloc(&idev->viommus, &viommu_id, viommu,
> +		      XA_LIMIT(viommu->obj.id, viommu->obj.id),
> +		      GFP_KERNEL_ACCOUNT);
> +	if (rc)
> +		goto out_put_viommu;
> +
> +	rc = xa_alloc(&viommu->idevs, &dev_id, idev,
> +		      XA_LIMIT(dev_id, dev_id), GFP_KERNEL_ACCOUNT);
> +	if (rc)
> +		goto out_xa_erase_viommu;

Both of these are API mis-uses, you don't want an allocating xarray
you just want to use xa_cmpxchg

Put an xarray in the viommu object and fill it with pointers to the
struct dev_id thing above

The driver can piggyback on this xarray too if it wants, so we only
need one.

xa_cmpchg to install the user requests vdev_id only if the vdev_id is
already unused.

> @@ -51,6 +51,7 @@ enum {
>  	IOMMUFD_CMD_HWPT_GET_DIRTY_BITMAP,
>  	IOMMUFD_CMD_HWPT_INVALIDATE,
>  	IOMMUFD_CMD_VIOMMU_ALLOC,
> +	IOMMUFD_CMD_VIOMMU_SET_DEV_ID,
>  };

We almost certainly will need a REMOVE_DEV_ID so userspace can have
sensible error handling.

> +
> +/**
> + * struct iommu_viommu_set_dev_id - ioctl(IOMMU_VIOMMU_SET_DEV_ID)
> + * @size: sizeof(struct iommu_viommu_set_dev_id)
> + * @viommu_id: viommu ID to associate with the device to store its virtual ID
> + * @dev_id: device ID to set a device virtual ID
> + * @__reserved: Must be 0
> + * @id: Device virtual ID
> + *
> + * Set a viommu-specific virtual ID of a device
> + */
> +struct iommu_viommu_set_dev_id {
> +	__u32 size;
> +	__u32 viommu_id;
> +	__u32 dev_id;
> +	__u32 __reserved;
> +	__aligned_u64 id;

I think I'd consistently call this vdev_id throughout the API

Jason

Nicolin Chen May 13, 2024, 5:24 a.m. UTC | #2

On Sun, May 12, 2024 at 11:58:27AM -0300, Jason Gunthorpe wrote:
> On Fri, Apr 12, 2024 at 08:47:05PM -0700, Nicolin Chen wrote:
> > Introduce a new ioctl to set a per-viommu device virtual id that should be
> > linked to the physical device id (or just a struct device pointer).
> > 
> > Since a viommu (user space IOMMU instance) can have multiple devices while
> > it's not ideal to confine a device to one single user space IOMMU instance
> > either, these two shouldn't just do a 1:1 mapping. Add two xarrays in their
> > structures to bind them bidirectionally.
> 
> Since I would like to retain the dev_id, I think this is probably
> better done with an allocated struct per dev-id:
> 
> struct dev_id {
>     struct iommufd_device *idev;
>     struct iommufd_viommu *viommu;
>     u64 vdev_id;
>     u64 driver_private; // Ie the driver can store the pSID here
>     struct list_head idev_entry;
> };

Oh, I needed a better solution to store the HW slot number of a
VINTF configuration that links a pSID and a vSID, which I hacked
into struct arm_smmu_stream for now. So, with this struct dev_id,
likely I can tie it to this structure.

For a driver, pSID is known with a device pointer, while likely
no use to the iommufd core?

Also, I will address the other comments in your reply.

Thanks
Nicolin

Nicolin Chen May 17, 2024, 5:14 a.m. UTC | #3

On Sun, May 12, 2024 at 11:58:27AM -0300, Jason Gunthorpe wrote:
> On Fri, Apr 12, 2024 at 08:47:05PM -0700, Nicolin Chen wrote:
> > Introduce a new ioctl to set a per-viommu device virtual id that should be
> > linked to the physical device id (or just a struct device pointer).
> > 
> > Since a viommu (user space IOMMU instance) can have multiple devices while
> > it's not ideal to confine a device to one single user space IOMMU instance
> > either, these two shouldn't just do a 1:1 mapping. Add two xarrays in their
> > structures to bind them bidirectionally.
> 
> Since I would like to retain the dev_id, I think this is probably
> better done with an allocated struct per dev-id:
> 
> struct dev_id {
>     struct iommufd_device *idev;
>     struct iommufd_viommu *viommu;
>     u64 vdev_id;
>     u64 driver_private; // Ie the driver can store the pSID here
>     struct list_head idev_entry;
> };

I implemented it with a small tweak, to align with viommu_alloc
and vqueue_alloc:

	// core
	struct iommufd_vdev_id {
		struct iommufd_viommu *viommu;
		struct device *dev;
		u64 vdev_id;
		struct list_head idev_item;
	};

	// driver
	struct my_driver_vdev_id {
	    struct iommufd_vdev_id core;
	    unsigned int private_attrs;
	};

	static struct iommufd_vdev_id *
	my_driver_set_vdev_id(struct iommufd_viommu *viommu,
			      struct device *dev, u64 id)
	{
	    struct my_driver_vdev_id *my_vdev_id;

	    my_vdev_id = kzalloc(sizeof(*my_vdev_id), GFP_KERNEL);
	    .... /* set private_attrs */
	    return &my_driver_vdev_id->core;
	}

	static void my_driver_unset_vdev_id(struct iommufd_vdev_id *vdev_id)
	{
	    struct my_driver_vdev_id *my_vdev_id =
		    container_of(vdev_id, struct my_driver_vdev_id, core);
	    .... /* unset private_attrs */
	}

Please let me know if you like it inverted as you wrote above.

> > @@ -135,7 +135,16 @@ void iommufd_device_destroy(struct iommufd_object *obj)
> >  {
> >  	struct iommufd_device *idev =
> >  		container_of(obj, struct iommufd_device, obj);
> > +	struct iommufd_viommu *viommu;
> > +	unsigned long index;
> >  
> > +	xa_for_each(&idev->viommus, index, viommu) {
> > +		if (viommu->ops->unset_dev_id)
> > +			viommu->ops->unset_dev_id(viommu, idev->dev);
> > +		xa_erase(&viommu->idevs, idev->obj.id);
> > +		xa_erase(&idev->viommus, index);
> > +	}
> 
> Then this turns into list_for_each(idev->viommu_vdevid_list)

Done.

> > +int iommufd_viommu_set_device_id(struct iommufd_ucmd *ucmd)
> > +{
...
> > +	rc = xa_alloc(&idev->viommus, &viommu_id, viommu,
> > +		      XA_LIMIT(viommu->obj.id, viommu->obj.id),
> > +		      GFP_KERNEL_ACCOUNT);
> > +	if (rc)
> > +		goto out_put_viommu;
> > +
> > +	rc = xa_alloc(&viommu->idevs, &dev_id, idev,
> > +		      XA_LIMIT(dev_id, dev_id), GFP_KERNEL_ACCOUNT);
> > +	if (rc)
> > +		goto out_xa_erase_viommu;
> 
> Both of these are API mis-uses, you don't want an allocating xarray
> you just want to use xa_cmpxchg
> 
> Put an xarray in the viommu object and fill it with pointers to the
> struct dev_id thing above
> 
> The driver can piggyback on this xarray too if it wants, so we only
> need one.

I also moved xa_cmpxchg to the driver, as I feel that the vdev_id
here is very driver/iommu specific. There can be some complication
if iommufd core handles this u64 vdev_id: most likely we will use
this u64 vdev_id to index the xarray that takes an unsigned-long
xa_index for a fast vdev_id-to-pdev_id lookup, while only a driver
knows whether u64 vdev_id is compatible with unsigned long or not.

And, we have a list_head in the structure idev, so a device unbind
will for-each the list and unset all the vdev_ids in it, meanwhile
the viommu stays. I wonder if we need to add another list_head in
the structure viommu, so a viommu tear down will for-each its list
and unset all the vdev_ids on its side while a device (idev) stays.
I don't see a use case of that though..any thought?

Thanks
Nicolin

Jason Gunthorpe May 21, 2024, 6:30 p.m. UTC | #4

On Thu, May 16, 2024 at 10:14:38PM -0700, Nicolin Chen wrote:
> I implemented it with a small tweak, to align with viommu_alloc
> and vqueue_alloc:
> 
> 	// core
> 	struct iommufd_vdev_id {
> 		struct iommufd_viommu *viommu;
> 		struct device *dev;
> 		u64 vdev_id;
> 		struct list_head idev_item;
> 	};
> 
> 	// driver
> 	struct my_driver_vdev_id {
> 	    struct iommufd_vdev_id core;
> 	    unsigned int private_attrs;
> 	};
> 
> 	static struct iommufd_vdev_id *
> 	my_driver_set_vdev_id(struct iommufd_viommu *viommu,
> 			      struct device *dev, u64 id)
> 	{
> 	    struct my_driver_vdev_id *my_vdev_id;
> 
> 	    my_vdev_id = kzalloc(sizeof(*my_vdev_id), GFP_KERNEL);
> 	    .... /* set private_attrs */
> 	    return &my_driver_vdev_id->core;
> 	}
> 
> 	static void my_driver_unset_vdev_id(struct iommufd_vdev_id *vdev_id)
> 	{
> 	    struct my_driver_vdev_id *my_vdev_id =
> 		    container_of(vdev_id, struct my_driver_vdev_id, core);
> 	    .... /* unset private_attrs */
> 	}
> 
> Please let me know if you like it inverted as you wrote above.

Seems like a reasonable direction

> I also moved xa_cmpxchg to the driver, as I feel that the vdev_id
> here is very driver/iommu specific. There can be some complication
> if iommufd core handles this u64 vdev_id: most likely we will use
> this u64 vdev_id to index the xarray that takes an unsigned-long
> xa_index for a fast vdev_id-to-pdev_id lookup, while only a driver
> knows whether u64 vdev_id is compatible with unsigned long or not.

This seems like the most general part.. The core code should just have
a check like:

  if (vdevid >= ULONG_MAX) return -EINVAL;

And if someone wants to make 32 bit kernels support a 64bit vdevid
then they can sort it out someday :) I think this is not a big issue
as all iommus seem to have some kind of radix lookup for vdevid and
want it to be small.

Matthew has been talking about support for 64bit indexes in
maple/xarray or something for a bit so it might sort itself out.

> And, we have a list_head in the structure idev, so a device unbind
> will for-each the list and unset all the vdev_ids in it, meanwhile
> the viommu stays. I wonder if we need to add another list_head in
> the structure viommu, so a viommu tear down will for-each its list
> and unset all the vdev_ids on its side while a device (idev) stays.
> I don't see a use case of that though..any thought?

I think you need to support viommu teardown, at least from a
correctness perspective. The API permits it. But isn't it just
list_del everything in the xarray and that will clean it up enough?

Jason

Nicolin Chen May 22, 2024, 2:15 a.m. UTC | #5

On Tue, May 21, 2024 at 03:30:11PM -0300, Jason Gunthorpe wrote:
> On Thu, May 16, 2024 at 10:14:38PM -0700, Nicolin Chen wrote:
> > I also moved xa_cmpxchg to the driver, as I feel that the vdev_id
> > here is very driver/iommu specific. There can be some complication
> > if iommufd core handles this u64 vdev_id: most likely we will use
> > this u64 vdev_id to index the xarray that takes an unsigned-long
> > xa_index for a fast vdev_id-to-pdev_id lookup, while only a driver
> > knows whether u64 vdev_id is compatible with unsigned long or not.
> 
> This seems like the most general part.. The core code should just have
> a check like:
> 
>   if (vdevid >= ULONG_MAX) return -EINVAL;

Ack.

> And if someone wants to make 32 bit kernels support a 64bit vdevid
> then they can sort it out someday :) I think this is not a big issue
> as all iommus seem to have some kind of radix lookup for vdevid and
> want it to be small.
> 
> Matthew has been talking about support for 64bit indexes in
> maple/xarray or something for a bit so it might sort itself out.

OK. In that case, the core can do the xarray maintenance.

And then..

> > And, we have a list_head in the structure idev, so a device unbind
> > will for-each the list and unset all the vdev_ids in it, meanwhile
> > the viommu stays. I wonder if we need to add another list_head in
> > the structure viommu, so a viommu tear down will for-each its list
> > and unset all the vdev_ids on its side while a device (idev) stays.
> > I don't see a use case of that though..any thought?
> 
> I think you need to support viommu teardown, at least from a
> correctness perspective. The API permits it. But isn't it just
> list_del everything in the xarray and that will clean it up enough?

... viommu tear down can xa_for_each to call unset_vdev_id().

Thanks
Nicolin

Tian, Kevin May 23, 2024, 6:42 a.m. UTC | #6

> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Saturday, April 13, 2024 11:47 AM
> 
> Introduce a new ioctl to set a per-viommu device virtual id that should be
> linked to the physical device id (or just a struct device pointer).
> 
> Since a viommu (user space IOMMU instance) can have multiple devices

this is true...

> while
> it's not ideal to confine a device to one single user space IOMMU instance
> either, these two shouldn't just do a 1:1 mapping. Add two xarrays in 

...but why would one device link to multiple viommu instances?

Or is it referring to Tegra194 as arm-smmu-nvidia.c tries to support?

In any case a real example might be useful as this may not be a given
knowledge to all folks.

btw there is a check in the following code:

+	if (viommu->iommu_dev != idev->dev->iommu->iommu_dev) {
+		rc = -EINVAL;
+		goto out_put_viommu;
+	}

I vaguely remember an earlier discussion about multiple vSMMU instances
following the physical SMMU topology, but don't quite recall the exact
reason.

What is the actual technical obstacle prohibiting one to put multiple
VCMDQ's from different SMMU's into one vIOMMU instance?

Nicolin Chen May 24, 2024, 5:40 a.m. UTC | #7

On Thu, May 23, 2024 at 06:42:56AM +0000, Tian, Kevin wrote:
> > From: Nicolin Chen <nicolinc@nvidia.com>
> > Sent: Saturday, April 13, 2024 11:47 AM
> >
> > Introduce a new ioctl to set a per-viommu device virtual id that should be
> > linked to the physical device id (or just a struct device pointer).
> >
> > Since a viommu (user space IOMMU instance) can have multiple devices
> 
> this is true...
> 
> > while
> > it's not ideal to confine a device to one single user space IOMMU instance
> > either, these two shouldn't just do a 1:1 mapping. Add two xarrays in
> 
> ...but why would one device link to multiple viommu instances?

That's a suggestion from Jason, IIRC, to avoid limiting a device
to a single viommu, though I can't find out the source at this
moment.

Jason, would you mind shed some light here?

> Or is it referring to Tegra194 as arm-smmu-nvidia.c tries to support?

Not actual. It's an SMMUv2 driver, which is not in our plan for
virtualization at this moment. And that driver is essentially a
different "compatible" string as a unique SMMUv2 implementation.

> btw there is a check in the following code:
> 
> +       if (viommu->iommu_dev != idev->dev->iommu->iommu_dev) {
> +               rc = -EINVAL;
> +               goto out_put_viommu;
> +       }
> 
> I vaguely remember an earlier discussion about multiple vSMMU instances
> following the physical SMMU topology, but don't quite recall the exact
> reason.
> 
> What is the actual technical obstacle prohibiting one to put multiple
> VCMDQ's from different SMMU's into one vIOMMU instance?

Because a VCMDQ passthrough involves a direct mmap of a HW MMIO
page to the guest-level MMIO region. The MMIO page provides the
read/write of queue's head and tail indexes.

With a single pSMMU and a single vSMMU, it's simply 1:1 mapping.

With a multi-pSMMU and a single vSMMU, the single vSMMU will see
one guest-level MMIO region backed by multiple physical pages.
Since we are talking about MMIO, technically it cannot select the
corresponding MMIO page to the device, not to mention we don't
really want VMM to involve, i.e. no VM exist, when using VCMDQ.

So, there must be some kind of multi-instanced carriers to hold
those MMIO pages, by attaching devices behind different pSMMUs to
corresponding carriers. And today we have VIOMMU as the carrier.

One step back, even without VCMDQ feature, a multi-pSMMU setup
will have multiple viommus (with our latest design) being added
to a viommu list of a single vSMMU's. Yet, vSMMU in this case
always traps regular SMMU CMDQ, so it can do viommu selection
or even broadcast (if it has to).

Thanks
Nicolin

Tian, Kevin May 24, 2024, 7:13 a.m. UTC | #8

> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Friday, May 24, 2024 1:40 PM
> 
> On Thu, May 23, 2024 at 06:42:56AM +0000, Tian, Kevin wrote:
> > btw there is a check in the following code:
> >
> > +       if (viommu->iommu_dev != idev->dev->iommu->iommu_dev) {
> > +               rc = -EINVAL;
> > +               goto out_put_viommu;
> > +       }
> >
> > I vaguely remember an earlier discussion about multiple vSMMU instances
> > following the physical SMMU topology, but don't quite recall the exact
> > reason.
> >
> > What is the actual technical obstacle prohibiting one to put multiple
> > VCMDQ's from different SMMU's into one vIOMMU instance?
> 
> Because a VCMDQ passthrough involves a direct mmap of a HW MMIO
> page to the guest-level MMIO region. The MMIO page provides the
> read/write of queue's head and tail indexes.
> 
> With a single pSMMU and a single vSMMU, it's simply 1:1 mapping.
> 
> With a multi-pSMMU and a single vSMMU, the single vSMMU will see
> one guest-level MMIO region backed by multiple physical pages.
> Since we are talking about MMIO, technically it cannot select the
> corresponding MMIO page to the device, not to mention we don't
> really want VMM to involve, i.e. no VM exist, when using VCMDQ.

can a vSMMU report to support multiple CMDQ's then there are
several MMIO regions each mapped to a different backend VCMDQ?

but I guess even if it's supported there is still a problem describing
the association between assigned devices and the CMDQ's of the
single vIOMMU instance. On bare metal a feature of multiple CMDQ
is more for load-balancing so there won't be a fixed association
between CMDQ/device. But the fixed association exists if we mixes
multiple VCMDQ's in a single vIOMMU instance then there may
lack of a way in spec to describe it.

> 
> So, there must be some kind of multi-instanced carriers to hold
> those MMIO pages, by attaching devices behind different pSMMUs to
> corresponding carriers. And today we have VIOMMU as the carrier.
> 
> One step back, even without VCMDQ feature, a multi-pSMMU setup
> will have multiple viommus (with our latest design) being added
> to a viommu list of a single vSMMU's. Yet, vSMMU in this case
> always traps regular SMMU CMDQ, so it can do viommu selection
> or even broadcast (if it has to).
> 

I'm curious to learn the real reason of that design. Is it because you
want to do certain load-balance between viommu's or due to other
reasons in the kernel smmuv3 driver which e.g. cannot support a
viommu spanning multiple pSMMU?

Jason Gunthorpe May 24, 2024, 1:19 p.m. UTC | #9

On Fri, May 24, 2024 at 07:13:23AM +0000, Tian, Kevin wrote:
> > From: Nicolin Chen <nicolinc@nvidia.com>
> > Sent: Friday, May 24, 2024 1:40 PM
> > 
> > On Thu, May 23, 2024 at 06:42:56AM +0000, Tian, Kevin wrote:
> > > btw there is a check in the following code:
> > >
> > > +       if (viommu->iommu_dev != idev->dev->iommu->iommu_dev) {
> > > +               rc = -EINVAL;
> > > +               goto out_put_viommu;
> > > +       }
> > >
> > > I vaguely remember an earlier discussion about multiple vSMMU instances
> > > following the physical SMMU topology, but don't quite recall the exact
> > > reason.
> > >
> > > What is the actual technical obstacle prohibiting one to put multiple
> > > VCMDQ's from different SMMU's into one vIOMMU instance?
> > 
> > Because a VCMDQ passthrough involves a direct mmap of a HW MMIO
> > page to the guest-level MMIO region. The MMIO page provides the
> > read/write of queue's head and tail indexes.
> > 
> > With a single pSMMU and a single vSMMU, it's simply 1:1 mapping.
> > 
> > With a multi-pSMMU and a single vSMMU, the single vSMMU will see
> > one guest-level MMIO region backed by multiple physical pages.
> > Since we are talking about MMIO, technically it cannot select the
> > corresponding MMIO page to the device, not to mention we don't
> > really want VMM to involve, i.e. no VM exist, when using VCMDQ.
> 
> can a vSMMU report to support multiple CMDQ's then there are
> several MMIO regions each mapped to a different backend VCMDQ?

This is something I want to support in the API, regardless vCMDQ uses
it or not..

> but I guess even if it's supported there is still a problem describing
> the association between assigned devices and the CMDQ's of the
> single vIOMMU instance. 

CMDQ should be linked to the VIOMMU instance only and not per-device,
I think that is a very important property of the system.

This means if there are multiple pSMMU instances then we need a
matching set of vSMMU instances so that the devices are grouped on the
proper vSMMU instance so they are compatible with the vCMDQ.

> I'm curious to learn the real reason of that design. Is it because you
> want to do certain load-balance between viommu's or due to other
> reasons in the kernel smmuv3 driver which e.g. cannot support a
> viommu spanning multiple pSMMU?

Yeah, there is no concept of support for a SMMUv3 instance where it's
command Q's can only work on a subset of devices.

My expectation was that VIOMMU would be 1:1 with physical iommu
instances, I think AMD needs this too??

Jason

Jason Gunthorpe May 24, 2024, 1:20 p.m. UTC | #10

On Thu, May 23, 2024 at 10:40:07PM -0700, Nicolin Chen wrote:
> On Thu, May 23, 2024 at 06:42:56AM +0000, Tian, Kevin wrote:
> > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > Sent: Saturday, April 13, 2024 11:47 AM
> > >
> > > Introduce a new ioctl to set a per-viommu device virtual id that should be
> > > linked to the physical device id (or just a struct device pointer).
> > >
> > > Since a viommu (user space IOMMU instance) can have multiple devices
> > 
> > this is true...
> > 
> > > while
> > > it's not ideal to confine a device to one single user space IOMMU instance
> > > either, these two shouldn't just do a 1:1 mapping. Add two xarrays in
> > 
> > ...but why would one device link to multiple viommu instances?
> 
> That's a suggestion from Jason, IIRC, to avoid limiting a device
> to a single viommu, though I can't find out the source at this
> moment.
> 
> Jason, would you mind shed some light here?

We could probably do it either way, it just doesn't seem like there is
a strong reason to limit a device to a single viommu even if in
practice that is how everyone will use it.

Jason

Tian, Kevin May 27, 2024, 1:08 a.m. UTC | #11

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Friday, May 24, 2024 9:19 PM
> 
> On Fri, May 24, 2024 at 07:13:23AM +0000, Tian, Kevin wrote:
> > I'm curious to learn the real reason of that design. Is it because you
> > want to do certain load-balance between viommu's or due to other
> > reasons in the kernel smmuv3 driver which e.g. cannot support a
> > viommu spanning multiple pSMMU?
> 
> Yeah, there is no concept of support for a SMMUv3 instance where it's
> command Q's can only work on a subset of devices.
> 
> My expectation was that VIOMMU would be 1:1 with physical iommu
> instances, I think AMD needs this too??
> 

Yes this part is clear now regarding to VCMDQ.

But Nicoline said:

"
One step back, even without VCMDQ feature, a multi-pSMMU setup
will have multiple viommus (with our latest design) being added
to a viommu list of a single vSMMU's. Yet, vSMMU in this case
always traps regular SMMU CMDQ, so it can do viommu selection
or even broadcast (if it has to).
"

I don't think there is an arch limitation mandating that?

Nicolin Chen May 28, 2024, 8:22 p.m. UTC | #12

On Mon, May 27, 2024 at 01:08:43AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Friday, May 24, 2024 9:19 PM
> >
> > On Fri, May 24, 2024 at 07:13:23AM +0000, Tian, Kevin wrote:
> > > I'm curious to learn the real reason of that design. Is it because you
> > > want to do certain load-balance between viommu's or due to other
> > > reasons in the kernel smmuv3 driver which e.g. cannot support a
> > > viommu spanning multiple pSMMU?
> >
> > Yeah, there is no concept of support for a SMMUv3 instance where it's
> > command Q's can only work on a subset of devices.
> >
> > My expectation was that VIOMMU would be 1:1 with physical iommu
> > instances, I think AMD needs this too??
> >
> 
> Yes this part is clear now regarding to VCMDQ.
> 
> But Nicoline said:
> 
> "
> One step back, even without VCMDQ feature, a multi-pSMMU setup
> will have multiple viommus (with our latest design) being added
> to a viommu list of a single vSMMU's. Yet, vSMMU in this case
> always traps regular SMMU CMDQ, so it can do viommu selection
> or even broadcast (if it has to).
> "
> 
> I don't think there is an arch limitation mandating that?

What I mean is for regular vSMMU. Without VCMDQ, a regular vSMMU
on a multi-pSMMU setup will look like (e.g. three devices behind
different SMMUs):
|<------ VMM ------->|<------ kernel ------>|
       |-- viommu0 --|-- pSMMU0 --|
vSMMU--|-- viommu1 --|-- pSMMU1 --|--s2_hwpt
       |-- viommu2 --|-- pSMMU2 --|

And device would attach to:
|<---- guest ---->|<--- VMM --->|<- kernel ->|
       |-- dev0 --|-- viommu0 --|-- pSMMU0 --|
vSMMU--|-- dev1 --|-- viommu1 --|-- pSMMU1 --|
       |-- dev2 --|-- viommu2 --|-- pSMMU2 --|

When trapping a device cache invalidation: it is straightforward
by deciphering the virtual device ID to pick the viommu that the
device is attached to.

When doing iotlb invalidation, a command may or may not contain
an ASID (a domain ID, and nested domain in this case):
a) if a command doesn't have an ASID, VMM needs to broadcast the
   command to all viommus (i.e. pSMMUs)
b) if a command has an ASID, VMM needs to initially maintain an
   S1 HWPT list by linking an ASID when adding an HWPT entry to
   the list, by deciphering vSTE and its linked CD. Then it needs
   to go through the S1 list with the ASID in the command, and to
   find all corresponding HWPTs to issue/broadcast the command.

Thanks
Nicolin

Nicolin Chen May 28, 2024, 8:30 p.m. UTC | #13

On Mon, May 27, 2024 at 01:08:43AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Friday, May 24, 2024 9:19 PM
> >
> > On Fri, May 24, 2024 at 07:13:23AM +0000, Tian, Kevin wrote:
> > > I'm curious to learn the real reason of that design. Is it because you
> > > want to do certain load-balance between viommu's or due to other
> > > reasons in the kernel smmuv3 driver which e.g. cannot support a
> > > viommu spanning multiple pSMMU?
> >
> > Yeah, there is no concept of support for a SMMUv3 instance where it's
> > command Q's can only work on a subset of devices.
> >
> > My expectation was that VIOMMU would be 1:1 with physical iommu
> > instances, I think AMD needs this too??
> >
>
> Yes this part is clear now regarding to VCMDQ.
>
> But Nicoline said:
>
> "
> One step back, even without VCMDQ feature, a multi-pSMMU setup
> will have multiple viommus (with our latest design) being added
> to a viommu list of a single vSMMU's. Yet, vSMMU in this case
> always traps regular SMMU CMDQ, so it can do viommu selection
> or even broadcast (if it has to).
> "
>
> I don't think there is an arch limitation mandating that?

What I mean is for regular a nested SMMU case. Without VCMDQ, a
regular vSMMU on a multi-pSMMU setup will look like (e.g. three
devices behind different SMMUs):
|<---- guest ---->|<--------- VMM -------->|<- kernel ->|
       |-- dev0 --|-- viommu0 (s2_hwpt0) --|-- pSMMU0 --|
vSMMU--|-- dev1 --|-- viommu1 (s2_hwpt0) --|-- pSMMU1 --|
       |-- dev2 --|-- viommu2 (s2_hwpt0) --|-- pSMMU2 --|

When trapping a device cache invalidation: it is straightforward
by deciphering the virtual device ID to pick the viommu that the
device is attached to. So, only one pSMMU would receive the user
invalidation request.

When doing iotlb invalidation, a command may or may not contain
an ASID (a domain ID, and nested domain in this case):
a) if a command doesn't have an ASID, VMM needs to broadcast the
   command to all viommus (i.e. pSMMUs)
b) if a command has an ASID, VMM needs to initially maintain an
   S1 HWPT list by linking an ASID when adding an HWPT entry to
   the list, by deciphering vSTE and its linked CD. Then it needs
   to go through the S1 list with the ASID in the command, and to
   find all corresponding HWPTs to issue/broadcast the command.

Thanks
Nicolin

Nicolin Chen May 28, 2024, 8:33 p.m. UTC | #14

On Tue, May 28, 2024 at 01:22:46PM -0700, Nicolin Chen wrote:
> On Mon, May 27, 2024 at 01:08:43AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Friday, May 24, 2024 9:19 PM
> > >
> > > On Fri, May 24, 2024 at 07:13:23AM +0000, Tian, Kevin wrote:
> > > > I'm curious to learn the real reason of that design. Is it because you
> > > > want to do certain load-balance between viommu's or due to other
> > > > reasons in the kernel smmuv3 driver which e.g. cannot support a
> > > > viommu spanning multiple pSMMU?
> > >
> > > Yeah, there is no concept of support for a SMMUv3 instance where it's
> > > command Q's can only work on a subset of devices.
> > >
> > > My expectation was that VIOMMU would be 1:1 with physical iommu
> > > instances, I think AMD needs this too??
> > >
> >
> > Yes this part is clear now regarding to VCMDQ.
> >
> > But Nicoline said:
> >
> > "
> > One step back, even without VCMDQ feature, a multi-pSMMU setup
> > will have multiple viommus (with our latest design) being added
> > to a viommu list of a single vSMMU's. Yet, vSMMU in this case
> > always traps regular SMMU CMDQ, so it can do viommu selection
> > or even broadcast (if it has to).
> > "
> >
> > I don't think there is an arch limitation mandating that?
> 
> What I mean is for regular vSMMU. Without VCMDQ, a regular vSMMU
> on a multi-pSMMU setup will look like (e.g. three devices behind
> different SMMUs):
> |<------ VMM ------->|<------ kernel ------>|
>        |-- viommu0 --|-- pSMMU0 --|
> vSMMU--|-- viommu1 --|-- pSMMU1 --|--s2_hwpt
>        |-- viommu2 --|-- pSMMU2 --|
> 
> And device would attach to:
> |<---- guest ---->|<--- VMM --->|<- kernel ->|
>        |-- dev0 --|-- viommu0 --|-- pSMMU0 --|
> vSMMU--|-- dev1 --|-- viommu1 --|-- pSMMU1 --|
>        |-- dev2 --|-- viommu2 --|-- pSMMU2 --|

I accidentally sent a duplicated one.. Please ignore this reply
and check the other one. Thanks!

Tian, Kevin May 29, 2024, 2:58 a.m. UTC | #15

> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Wednesday, May 29, 2024 4:23 AM
> 
> On Mon, May 27, 2024 at 01:08:43AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Friday, May 24, 2024 9:19 PM
> > >
> > > On Fri, May 24, 2024 at 07:13:23AM +0000, Tian, Kevin wrote:
> > > > I'm curious to learn the real reason of that design. Is it because you
> > > > want to do certain load-balance between viommu's or due to other
> > > > reasons in the kernel smmuv3 driver which e.g. cannot support a
> > > > viommu spanning multiple pSMMU?
> > >
> > > Yeah, there is no concept of support for a SMMUv3 instance where it's
> > > command Q's can only work on a subset of devices.
> > >
> > > My expectation was that VIOMMU would be 1:1 with physical iommu
> > > instances, I think AMD needs this too??
> > >
> >
> > Yes this part is clear now regarding to VCMDQ.
> >
> > But Nicoline said:
> >
> > "
> > One step back, even without VCMDQ feature, a multi-pSMMU setup
> > will have multiple viommus (with our latest design) being added
> > to a viommu list of a single vSMMU's. Yet, vSMMU in this case
> > always traps regular SMMU CMDQ, so it can do viommu selection
> > or even broadcast (if it has to).
> > "
> >
> > I don't think there is an arch limitation mandating that?
> 
> What I mean is for regular vSMMU. Without VCMDQ, a regular vSMMU
> on a multi-pSMMU setup will look like (e.g. three devices behind
> different SMMUs):
> |<------ VMM ------->|<------ kernel ------>|
>        |-- viommu0 --|-- pSMMU0 --|
> vSMMU--|-- viommu1 --|-- pSMMU1 --|--s2_hwpt
>        |-- viommu2 --|-- pSMMU2 --|
> 
> And device would attach to:
> |<---- guest ---->|<--- VMM --->|<- kernel ->|
>        |-- dev0 --|-- viommu0 --|-- pSMMU0 --|
> vSMMU--|-- dev1 --|-- viommu1 --|-- pSMMU1 --|
>        |-- dev2 --|-- viommu2 --|-- pSMMU2 --|
> 
> When trapping a device cache invalidation: it is straightforward
> by deciphering the virtual device ID to pick the viommu that the
> device is attached to.

I understand how above works.

My question is why that option is chosen instead of going with 1:1
mapping between vSMMU and viommu i.e. letting the kernel to
figure out which pSMMU should be sent an invalidation cmd to, as
how VT-d is virtualized.

I want to know whether doing so is simply to be compatible with
what VCMDQ requires, or due to another untold reason.

> 
> When doing iotlb invalidation, a command may or may not contain
> an ASID (a domain ID, and nested domain in this case):
> a) if a command doesn't have an ASID, VMM needs to broadcast the
>    command to all viommus (i.e. pSMMUs)
> b) if a command has an ASID, VMM needs to initially maintain an
>    S1 HWPT list by linking an ASID when adding an HWPT entry to
>    the list, by deciphering vSTE and its linked CD. Then it needs
>    to go through the S1 list with the ASID in the command, and to
>    find all corresponding HWPTs to issue/broadcast the command.
> 
> Thanks
> Nicolin

Nicolin Chen May 29, 2024, 3:20 a.m. UTC | #16

On Wed, May 29, 2024 at 02:58:11AM +0000, Tian, Kevin wrote:
> > From: Nicolin Chen <nicolinc@nvidia.com>
> > Sent: Wednesday, May 29, 2024 4:23 AM

> > What I mean is for regular vSMMU. Without VCMDQ, a regular vSMMU
> > on a multi-pSMMU setup will look like (e.g. three devices behind
> > different SMMUs):
> > |<------ VMM ------->|<------ kernel ------>|
> >        |-- viommu0 --|-- pSMMU0 --|
> > vSMMU--|-- viommu1 --|-- pSMMU1 --|--s2_hwpt
> >        |-- viommu2 --|-- pSMMU2 --|
> >
> > And device would attach to:
> > |<---- guest ---->|<--- VMM --->|<- kernel ->|
> >        |-- dev0 --|-- viommu0 --|-- pSMMU0 --|
> > vSMMU--|-- dev1 --|-- viommu1 --|-- pSMMU1 --|
> >        |-- dev2 --|-- viommu2 --|-- pSMMU2 --|
> >
> > When trapping a device cache invalidation: it is straightforward
> > by deciphering the virtual device ID to pick the viommu that the
> > device is attached to.
> 
> I understand how above works.
> 
> My question is why that option is chosen instead of going with 1:1
> mapping between vSMMU and viommu i.e. letting the kernel to
> figure out which pSMMU should be sent an invalidation cmd to, as
> how VT-d is virtualized.
> 
> I want to know whether doing so is simply to be compatible with
> what VCMDQ requires, or due to another untold reason.

Because we use viommu as a VMID holder for SMMU. So a pSMMU must
have its own viommu to store its VMID for a shared s2_hwpt:
        |-- viommu0 (VMIDx) --|-- pSMMU0 --|
 vSMMU--|-- viommu1 (VMIDy) --|-- pSMMU1 --|--s2_hwpt
        |-- viommu2 (VMIDz) --|-- pSMMU2 --|

Thanks
Nicolin

Tian, Kevin May 30, 2024, 12:28 a.m. UTC | #17

> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Wednesday, May 29, 2024 11:21 AM
> 
> On Wed, May 29, 2024 at 02:58:11AM +0000, Tian, Kevin wrote:
> > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > Sent: Wednesday, May 29, 2024 4:23 AM
> 
> > > What I mean is for regular vSMMU. Without VCMDQ, a regular vSMMU
> > > on a multi-pSMMU setup will look like (e.g. three devices behind
> > > different SMMUs):
> > > |<------ VMM ------->|<------ kernel ------>|
> > >        |-- viommu0 --|-- pSMMU0 --|
> > > vSMMU--|-- viommu1 --|-- pSMMU1 --|--s2_hwpt
> > >        |-- viommu2 --|-- pSMMU2 --|
> > >
> > > And device would attach to:
> > > |<---- guest ---->|<--- VMM --->|<- kernel ->|
> > >        |-- dev0 --|-- viommu0 --|-- pSMMU0 --|
> > > vSMMU--|-- dev1 --|-- viommu1 --|-- pSMMU1 --|
> > >        |-- dev2 --|-- viommu2 --|-- pSMMU2 --|
> > >
> > > When trapping a device cache invalidation: it is straightforward
> > > by deciphering the virtual device ID to pick the viommu that the
> > > device is attached to.
> >
> > I understand how above works.
> >
> > My question is why that option is chosen instead of going with 1:1
> > mapping between vSMMU and viommu i.e. letting the kernel to
> > figure out which pSMMU should be sent an invalidation cmd to, as
> > how VT-d is virtualized.
> >
> > I want to know whether doing so is simply to be compatible with
> > what VCMDQ requires, or due to another untold reason.
> 
> Because we use viommu as a VMID holder for SMMU. So a pSMMU must
> have its own viommu to store its VMID for a shared s2_hwpt:
>         |-- viommu0 (VMIDx) --|-- pSMMU0 --|
>  vSMMU--|-- viommu1 (VMIDy) --|-- pSMMU1 --|--s2_hwpt
>         |-- viommu2 (VMIDz) --|-- pSMMU2 --|
> 

there are other options, e.g. you can have one viommu holding multiple
VMIDs each associating to a pSMMU.

so it's really an implementation choice that you want a same viommu
topology scheme w/ or w/o VCMDQ.

we all know there are some constraints of copying the physical topology,
e.g. hotplugging a device or migration. for VCMDQ it's clearly an
acceptable tradeoff for performance. w/o VCMDQ I'm not sure whether
you want to keep more flexibility for the user. 😊

Nicolin Chen May 30, 2024, 12:58 a.m. UTC | #18

On Thu, May 30, 2024 at 12:28:43AM +0000, Tian, Kevin wrote:
> > From: Nicolin Chen <nicolinc@nvidia.com>
> > Sent: Wednesday, May 29, 2024 11:21 AM
> > On Wed, May 29, 2024 at 02:58:11AM +0000, Tian, Kevin wrote:
> > > My question is why that option is chosen instead of going with 1:1
> > > mapping between vSMMU and viommu i.e. letting the kernel to
> > > figure out which pSMMU should be sent an invalidation cmd to, as
> > > how VT-d is virtualized.
> > >
> > > I want to know whether doing so is simply to be compatible with
> > > what VCMDQ requires, or due to another untold reason.
> >
> > Because we use viommu as a VMID holder for SMMU. So a pSMMU must
> > have its own viommu to store its VMID for a shared s2_hwpt:
> >         |-- viommu0 (VMIDx) --|-- pSMMU0 --|
> >  vSMMU--|-- viommu1 (VMIDy) --|-- pSMMU1 --|--s2_hwpt
> >         |-- viommu2 (VMIDz) --|-- pSMMU2 --|
> >
> 
> there are other options, e.g. you can have one viommu holding multiple
> VMIDs each associating to a pSMMU.

Well, possibly. But everything previously in a viommu would have
to be a list (for number of VMIDs) in the kernel level: not only
a VMID list, but also a 2D virtual ID lists, something like:

struct xarray vdev_ids[num_of_vmid]; // per-IOMMU vID to pID lookup

And a driver in this case would be difficult to get a complete
concept of a viommu object since it's core object and shared by
all kernel-level IOMMU instances. If a driver wants to extend a
viommu object for some additional feature, e.g. VINTF in this
series, it would likely have to create another per-driver object
and again another list of this kind of objects in struct viommu.

At the end of day, we have to duplicate it one way or another for
the amount of physical IOMMUs. And it seems to cleaner by doing
it with multiple viommu objects.

> so it's really an implementation choice that you want a same viommu
> topology scheme w/ or w/o VCMDQ.
> 
> we all know there are some constraints of copying the physical topology,
> e.g. hotplugging a device or migration. for VCMDQ it's clearly an
> acceptable tradeoff for performance. w/o VCMDQ I'm not sure whether
> you want to keep more flexibility for the user. 😊

Oh. With regular nested SMMU, there is only one virtual SMMU in
the guest VM. No need of copying physical topology. Just the VMM
needs to allocate three viommus to add them to a list of its own.

Thanks
Nicolin

Tian, Kevin May 30, 2024, 3:05 a.m. UTC | #19

> From: Nicolin Chen <nicolinc@nvidia.com>
> Sent: Thursday, May 30, 2024 8:59 AM
> 
> On Thu, May 30, 2024 at 12:28:43AM +0000, Tian, Kevin wrote:
> > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > Sent: Wednesday, May 29, 2024 11:21 AM
> > > On Wed, May 29, 2024 at 02:58:11AM +0000, Tian, Kevin wrote:
> > > > My question is why that option is chosen instead of going with 1:1
> > > > mapping between vSMMU and viommu i.e. letting the kernel to
> > > > figure out which pSMMU should be sent an invalidation cmd to, as
> > > > how VT-d is virtualized.
> > > >
> > > > I want to know whether doing so is simply to be compatible with
> > > > what VCMDQ requires, or due to another untold reason.
> > >
> > > Because we use viommu as a VMID holder for SMMU. So a pSMMU must
> > > have its own viommu to store its VMID for a shared s2_hwpt:
> > >         |-- viommu0 (VMIDx) --|-- pSMMU0 --|
> > >  vSMMU--|-- viommu1 (VMIDy) --|-- pSMMU1 --|--s2_hwpt
> > >         |-- viommu2 (VMIDz) --|-- pSMMU2 --|
> > >
> >
> > there are other options, e.g. you can have one viommu holding multiple
> > VMIDs each associating to a pSMMU.
> 
> Well, possibly. But everything previously in a viommu would have
> to be a list (for number of VMIDs) in the kernel level: not only
> a VMID list, but also a 2D virtual ID lists, something like:
> 
> struct xarray vdev_ids[num_of_vmid]; // per-IOMMU vID to pID lookup

ah does it mean that dev ID space on ARM is local to SMMU?

It's not the case on x86 platforms where the RID is platform-wise.

> 
> And a driver in this case would be difficult to get a complete
> concept of a viommu object since it's core object and shared by
> all kernel-level IOMMU instances. If a driver wants to extend a
> viommu object for some additional feature, e.g. VINTF in this
> series, it would likely have to create another per-driver object
> and again another list of this kind of objects in struct viommu.
> 
> At the end of day, we have to duplicate it one way or another for
> the amount of physical IOMMUs. And it seems to cleaner by doing
> it with multiple viommu objects.
> 
> > so it's really an implementation choice that you want a same viommu
> > topology scheme w/ or w/o VCMDQ.
> >
> > we all know there are some constraints of copying the physical topology,
> > e.g. hotplugging a device or migration. for VCMDQ it's clearly an
> > acceptable tradeoff for performance. w/o VCMDQ I'm not sure whether
> > you want to keep more flexibility for the user. 😊
> 
> Oh. With regular nested SMMU, there is only one virtual SMMU in
> the guest VM. No need of copying physical topology. Just the VMM
> needs to allocate three viommus to add them to a list of its own.
> 

Okay I missed this point. Then the number of viommus is really about
the contract between the vmm and the kernel, which is invisible to
the guest or the admin who launches the Qemu.

but wait, isn't there another problem - does the VMM have the
permission to enumerate the topology of physical SMMUs? Or probably
the VMM only needs to know the number of relevant SMMUs for
assigned devices and such info can be indirectly composed by extending  GET_HW_INFO...

Nicolin Chen May 30, 2024, 4:26 a.m. UTC | #20

On Thu, May 30, 2024 at 03:05:05AM +0000, Tian, Kevin wrote:
> > From: Nicolin Chen <nicolinc@nvidia.com>
> > Sent: Thursday, May 30, 2024 8:59 AM
> > On Thu, May 30, 2024 at 12:28:43AM +0000, Tian, Kevin wrote:
> > > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > > Sent: Wednesday, May 29, 2024 11:21 AM
> > > > On Wed, May 29, 2024 at 02:58:11AM +0000, Tian, Kevin wrote:
> > > > > My question is why that option is chosen instead of going with 1:1
> > > > > mapping between vSMMU and viommu i.e. letting the kernel to
> > > > > figure out which pSMMU should be sent an invalidation cmd to, as
> > > > > how VT-d is virtualized.
> > > > >
> > > > > I want to know whether doing so is simply to be compatible with
> > > > > what VCMDQ requires, or due to another untold reason.
> > > >
> > > > Because we use viommu as a VMID holder for SMMU. So a pSMMU must
> > > > have its own viommu to store its VMID for a shared s2_hwpt:
> > > >         |-- viommu0 (VMIDx) --|-- pSMMU0 --|
> > > >  vSMMU--|-- viommu1 (VMIDy) --|-- pSMMU1 --|--s2_hwpt
> > > >         |-- viommu2 (VMIDz) --|-- pSMMU2 --|
> > > >
> > >
> > > there are other options, e.g. you can have one viommu holding multiple
> > > VMIDs each associating to a pSMMU.
> >
> > Well, possibly. But everything previously in a viommu would have
> > to be a list (for number of VMIDs) in the kernel level: not only
> > a VMID list, but also a 2D virtual ID lists, something like:
> >
> > struct xarray vdev_ids[num_of_vmid]; // per-IOMMU vID to pID lookup
> 
> ah does it mean that dev ID space on ARM is local to SMMU?
> 
> It's not the case on x86 platforms where the RID is platform-wise.

Actually I had a second thought after I replied. Yea, we only
support PCI devices at this moment, so their RIDs come from BDF
numbers, and then should be platform-wise as you pointed out:

|<-------VMM-------->|<--------------- kernel ------------>|
 vRID_a --|                       |-- VMIDx / SMMU0 -- pRID_A
 vRID_b --| => vSMMU => viommu => |-- VMIDy / SMMU1 -- pRID_B
 vRID_c --|                       |-- VMIDz / SMMU2 -- pRID_C

# x/y/z can be identical, while a/b/c and A/B/C must be unique.

So likely a single lookup list can be enough. That still can't
avoid the list of per-pIOMMU objects for some driver feature
though. So, I think having a per-pIOMMU viommu also adds a bit
of flexibility for the kernel.

Overall, indeed an implementation choice, as you mentioned :)

> > And a driver in this case would be difficult to get a complete
> > concept of a viommu object since it's core object and shared by
> > all kernel-level IOMMU instances. If a driver wants to extend a
> > viommu object for some additional feature, e.g. VINTF in this
> > series, it would likely have to create another per-driver object
> > and again another list of this kind of objects in struct viommu.
> >
> > At the end of day, we have to duplicate it one way or another for
> > the amount of physical IOMMUs. And it seems to cleaner by doing
> > it with multiple viommu objects.
> >
> > > so it's really an implementation choice that you want a same viommu
> > > topology scheme w/ or w/o VCMDQ.
> > >
> > > we all know there are some constraints of copying the physical topology,
> > > e.g. hotplugging a device or migration. for VCMDQ it's clearly an
> > > acceptable tradeoff for performance. w/o VCMDQ I'm not sure whether
> > > you want to keep more flexibility for the user. 😊
> >
> > Oh. With regular nested SMMU, there is only one virtual SMMU in
> > the guest VM. No need of copying physical topology. Just the VMM
> > needs to allocate three viommus to add them to a list of its own.
> >
> 
> Okay I missed this point. Then the number of viommus is really about
> the contract between the vmm and the kernel, which is invisible to
> the guest or the admin who launches the Qemu.

Yes. Everything should be behind the scene, since VMM can trap
and select, unlike VCMDQ doing direct MMIO to the HW without a
chance of VM Exits.

> but wait, isn't there another problem - does the VMM have the
> permission to enumerate the topology of physical SMMUs? Or probably
> the VMM only needs to know the number of relevant SMMUs for
> assigned devices and such info can be indirectly composed by extending  GET_HW_INFO...

I think VMM already has some kinda permission reading the number
of IOMMUs from "/sys/class/iommu"?

That being said, in a regular nesting case w/o VCMDQ, there is
no need to instantiate all vSMMUs to the number of pSMMUs unless
there is at least one device behind each pSMMU attaches.

So, the current implementation allocates an s2_hwpt/viommu only
on demand, when the first device from a physical SMMU attaches,
meaning there'll be only one viommu in the list until the other
first device from another pSMMU attaches. And the actual attach
logical always tries attaching a device to the existing viommus
from the list, and only allocates a new viommu (or s2_hwpt) when
all of them failed. FWIW, host-kernel/driver has a full control
against these attaches.

Thanks
Nicolin

Jason Gunthorpe June 1, 2024, 9:45 p.m. UTC | #21

On Wed, May 29, 2024 at 05:58:39PM -0700, Nicolin Chen wrote:
> On Thu, May 30, 2024 at 12:28:43AM +0000, Tian, Kevin wrote:
> > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > Sent: Wednesday, May 29, 2024 11:21 AM
> > > On Wed, May 29, 2024 at 02:58:11AM +0000, Tian, Kevin wrote:
> > > > My question is why that option is chosen instead of going with 1:1
> > > > mapping between vSMMU and viommu i.e. letting the kernel to
> > > > figure out which pSMMU should be sent an invalidation cmd to, as
> > > > how VT-d is virtualized.
> > > >
> > > > I want to know whether doing so is simply to be compatible with
> > > > what VCMDQ requires, or due to another untold reason.
> > >
> > > Because we use viommu as a VMID holder for SMMU. So a pSMMU must
> > > have its own viommu to store its VMID for a shared s2_hwpt:
> > >         |-- viommu0 (VMIDx) --|-- pSMMU0 --|
> > >  vSMMU--|-- viommu1 (VMIDy) --|-- pSMMU1 --|--s2_hwpt
> > >         |-- viommu2 (VMIDz) --|-- pSMMU2 --|
> > >
> > 
> > there are other options, e.g. you can have one viommu holding multiple
> > VMIDs each associating to a pSMMU.
> 
> Well, possibly. But everything previously in a viommu would have
> to be a list (for number of VMIDs) in the kernel level: not only
> a VMID list, but also a 2D virtual ID lists, something like:
> 
> struct xarray vdev_ids[num_of_vmid]; // per-IOMMU vID to pID lookup

I feel it makes most sense that ARM (and maybe everyone) just have a
viommu per piommu.

The main argument against is we haven't made it efficient for the VMM
to support multiple piommus. It has to do a system call per piommu
each time it processes the cmdq.

But, on the other hand, if you care about invalidation efficiency it
is kind of silly not to expose the piommus to the guest so that the
invalidation scope can be appropriately narrowed. Replicating all ASID
invalidations to all piommus doesn't make alot of sense if the guest
can know that only one piommu actually needs invalidation.

> And a driver in this case would be difficult to get a complete
> concept of a viommu object since it's core object and shared by
> all kernel-level IOMMU instances. If a driver wants to extend a
> viommu object for some additional feature, e.g. VINTF in this
> series, it would likely have to create another per-driver object
> and again another list of this kind of objects in struct viommu.

Right, we need some kind of per-piommu object because we have
per-piommu data.

> Oh. With regular nested SMMU, there is only one virtual SMMU in
> the guest VM. No need of copying physical topology. Just the VMM
> needs to allocate three viommus to add them to a list of its own.

I understand the appeal of doing this has been to minimize qemu
changes in its ACPI parts, if we tackle that instead maybe we should
just not implement viommu to multiple piommu. It is somewhat
complicated.

Jason

Nicolin Chen June 3, 2024, 3:25 a.m. UTC | #22

On Sat, Jun 01, 2024 at 06:45:01PM -0300, Jason Gunthorpe wrote:
> On Wed, May 29, 2024 at 05:58:39PM -0700, Nicolin Chen wrote:
> > On Thu, May 30, 2024 at 12:28:43AM +0000, Tian, Kevin wrote:
> > > > From: Nicolin Chen <nicolinc@nvidia.com>
> > > > Sent: Wednesday, May 29, 2024 11:21 AM
> > > > On Wed, May 29, 2024 at 02:58:11AM +0000, Tian, Kevin wrote:
> > > > > My question is why that option is chosen instead of going with 1:1
> > > > > mapping between vSMMU and viommu i.e. letting the kernel to
> > > > > figure out which pSMMU should be sent an invalidation cmd to, as
> > > > > how VT-d is virtualized.
> > > > >
> > > > > I want to know whether doing so is simply to be compatible with
> > > > > what VCMDQ requires, or due to another untold reason.
> > > >
> > > > Because we use viommu as a VMID holder for SMMU. So a pSMMU must
> > > > have its own viommu to store its VMID for a shared s2_hwpt:
> > > >         |-- viommu0 (VMIDx) --|-- pSMMU0 --|
> > > >  vSMMU--|-- viommu1 (VMIDy) --|-- pSMMU1 --|--s2_hwpt
> > > >         |-- viommu2 (VMIDz) --|-- pSMMU2 --|
> > > >
> > > 
> > > there are other options, e.g. you can have one viommu holding multiple
> > > VMIDs each associating to a pSMMU.
> > 
> > Well, possibly. But everything previously in a viommu would have
> > to be a list (for number of VMIDs) in the kernel level: not only
> > a VMID list, but also a 2D virtual ID lists, something like:
> > 
> > struct xarray vdev_ids[num_of_vmid]; // per-IOMMU vID to pID lookup
> 
> I feel it makes most sense that ARM (and maybe everyone) just have a
> viommu per piommu.
> 
> The main argument against is we haven't made it efficient for the VMM
> to support multiple piommus. It has to do a system call per piommu
> each time it processes the cmdq.
> 
> But, on the other hand, if you care about invalidation efficiency it
> is kind of silly not to expose the piommus to the guest so that the
> invalidation scope can be appropriately narrowed. Replicating all ASID
> invalidations to all piommus doesn't make alot of sense if the guest
> can know that only one piommu actually needs invalidation.

Yea, that'd be pretty slow, though a broadcast would be still
inevitable when an invalidation only has an address range w/o
an ASID, CMD_TLBI_NH_VAA for example.

In fact, there should always be a dispatcher (v.s. broadcast):
 - in the one-viommu-per-pIOMMU case (#1), it's in the VMM
 - in the one-viommu-per-vIOMMU case (#2), it's in the kernel

One of them has to take the role to burn some CPUs for-eaching
the hwpt list to identify the iommu to forward. The design #1,
simply makes the kernel easier.

The design #2, on the other hand, would not only require some
lists and new objects that we just discussed, yet also pair of
VIOMMU_SET/UNSET_HWPT_ID ioctls, though it also makes sense as
we choose IOMMU_VIOMMU_INVALIDATE over IOMMU_DEV_INVALIDATE by
adding VIOMMU_SET/UNSET_VDEV_ID?

> > And a driver in this case would be difficult to get a complete
> > concept of a viommu object since it's core object and shared by
> > all kernel-level IOMMU instances. If a driver wants to extend a
> > viommu object for some additional feature, e.g. VINTF in this
> > series, it would likely have to create another per-driver object
> > and again another list of this kind of objects in struct viommu.
> 
> Right, we need some kind of per-piommu object because we have
> per-piommu data.
>
> > Oh. With regular nested SMMU, there is only one virtual SMMU in
> > the guest VM. No need of copying physical topology. Just the VMM
> > needs to allocate three viommus to add them to a list of its own.
> 
> I understand the appeal of doing this has been to minimize qemu
> changes in its ACPI parts if we tackle that instead maybe we should
> just not implement viommu to multiple piommu. It is somewhat
> complicated.

Would you please clarify that suggestion "not implement viommu
to multiple piommu"?

For regular nesting (SMMU), we are still doing one vSMMU in the
VMM, though VCMDQ case would be an exception....

Thanks
Nicolin

Jason Gunthorpe June 6, 2024, 6:24 p.m. UTC | #23

On Sun, Jun 02, 2024 at 08:25:34PM -0700, Nicolin Chen wrote:

> > I understand the appeal of doing this has been to minimize qemu
> > changes in its ACPI parts if we tackle that instead maybe we should
> > just not implement viommu to multiple piommu. It is somewhat
> > complicated.
> 
> Would you please clarify that suggestion "not implement viommu
> to multiple piommu"?
> 
> For regular nesting (SMMU), we are still doing one vSMMU in the
> VMM, though VCMDQ case would be an exception....

This is what I mean, always do multiple vSMMU if there are multiple
physical pSMMUs. Don't replicate any virtual commands across pSMMUs.

Jason

Nicolin Chen June 6, 2024, 6:44 p.m. UTC | #24

On Thu, Jun 06, 2024 at 03:24:23PM -0300, Jason Gunthorpe wrote:
> On Sun, Jun 02, 2024 at 08:25:34PM -0700, Nicolin Chen wrote:
> 
> > > I understand the appeal of doing this has been to minimize qemu
> > > changes in its ACPI parts if we tackle that instead maybe we should
> > > just not implement viommu to multiple piommu. It is somewhat
> > > complicated.
> > 
> > Would you please clarify that suggestion "not implement viommu
> > to multiple piommu"?
> > 
> > For regular nesting (SMMU), we are still doing one vSMMU in the
> > VMM, though VCMDQ case would be an exception....
> 
> This is what I mean, always do multiple vSMMU if there are multiple
> physical pSMMUs. Don't replicate any virtual commands across pSMMUs.

Thanks for clarifying. That also means you'd prefer putting the
command dispatcher in VMM, which is what we have at this moment.

Nicolin

Jason Gunthorpe June 7, 2024, 12:27 a.m. UTC | #25

On Thu, Jun 06, 2024 at 11:44:58AM -0700, Nicolin Chen wrote:
> On Thu, Jun 06, 2024 at 03:24:23PM -0300, Jason Gunthorpe wrote:
> > On Sun, Jun 02, 2024 at 08:25:34PM -0700, Nicolin Chen wrote:
> > 
> > > > I understand the appeal of doing this has been to minimize qemu
> > > > changes in its ACPI parts if we tackle that instead maybe we should
> > > > just not implement viommu to multiple piommu. It is somewhat
> > > > complicated.
> > > 
> > > Would you please clarify that suggestion "not implement viommu
> > > to multiple piommu"?
> > > 
> > > For regular nesting (SMMU), we are still doing one vSMMU in the
> > > VMM, though VCMDQ case would be an exception....
> > 
> > This is what I mean, always do multiple vSMMU if there are multiple
> > physical pSMMUs. Don't replicate any virtual commands across pSMMUs.
> 
> Thanks for clarifying. That also means you'd prefer putting the
> command dispatcher in VMM, which is what we have at this moment.

Unless someone knows a reason why we should strive hard to have only a
single vSMMU and accept some invalidation inefficiency?

Jason

Tian, Kevin June 7, 2024, 12:36 a.m. UTC | #26

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Friday, June 7, 2024 8:27 AM
> 
> On Thu, Jun 06, 2024 at 11:44:58AM -0700, Nicolin Chen wrote:
> > On Thu, Jun 06, 2024 at 03:24:23PM -0300, Jason Gunthorpe wrote:
> > > On Sun, Jun 02, 2024 at 08:25:34PM -0700, Nicolin Chen wrote:
> > >
> > > > > I understand the appeal of doing this has been to minimize qemu
> > > > > changes in its ACPI parts if we tackle that instead maybe we should
> > > > > just not implement viommu to multiple piommu. It is somewhat
> > > > > complicated.
> > > >
> > > > Would you please clarify that suggestion "not implement viommu
> > > > to multiple piommu"?
> > > >
> > > > For regular nesting (SMMU), we are still doing one vSMMU in the
> > > > VMM, though VCMDQ case would be an exception....
> > >
> > > This is what I mean, always do multiple vSMMU if there are multiple
> > > physical pSMMUs. Don't replicate any virtual commands across pSMMUs.
> >
> > Thanks for clarifying. That also means you'd prefer putting the
> > command dispatcher in VMM, which is what we have at this moment.
> 
> Unless someone knows a reason why we should strive hard to have only a
> single vSMMU and accept some invalidation inefficiency?
> 

migration? a single vSMMU provides better compatibility between
src/dest...

Jason Gunthorpe June 7, 2024, 2:49 p.m. UTC | #27

On Fri, Jun 07, 2024 at 12:36:46AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Friday, June 7, 2024 8:27 AM
> > 
> > On Thu, Jun 06, 2024 at 11:44:58AM -0700, Nicolin Chen wrote:
> > > On Thu, Jun 06, 2024 at 03:24:23PM -0300, Jason Gunthorpe wrote:
> > > > On Sun, Jun 02, 2024 at 08:25:34PM -0700, Nicolin Chen wrote:
> > > >
> > > > > > I understand the appeal of doing this has been to minimize qemu
> > > > > > changes in its ACPI parts if we tackle that instead maybe we should
> > > > > > just not implement viommu to multiple piommu. It is somewhat
> > > > > > complicated.
> > > > >
> > > > > Would you please clarify that suggestion "not implement viommu
> > > > > to multiple piommu"?
> > > > >
> > > > > For regular nesting (SMMU), we are still doing one vSMMU in the
> > > > > VMM, though VCMDQ case would be an exception....
> > > >
> > > > This is what I mean, always do multiple vSMMU if there are multiple
> > > > physical pSMMUs. Don't replicate any virtual commands across pSMMUs.
> > >
> > > Thanks for clarifying. That also means you'd prefer putting the
> > > command dispatcher in VMM, which is what we have at this moment.
> > 
> > Unless someone knows a reason why we should strive hard to have only a
> > single vSMMU and accept some invalidation inefficiency?
> > 
> 
> migration? a single vSMMU provides better compatibility between
> src/dest...

Maybe, though I think we can safely split a single pSMMU into multiple
vSMMUs using the IOMMUFD vIOMMU interface. So if your machine model
has two vSMMUs and your physical HW has less we can still make that
work.

IOTLB efficiency will suffer though when splitting 1p -> 2v while
invalidation performance will suffer when joining 2p -> 1v.

Jason

Nicolin Chen June 7, 2024, 9:19 p.m. UTC | #28

On Fri, Jun 07, 2024 at 11:49:17AM -0300, Jason Gunthorpe wrote:
> On Fri, Jun 07, 2024 at 12:36:46AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Friday, June 7, 2024 8:27 AM
> > > 
> > > On Thu, Jun 06, 2024 at 11:44:58AM -0700, Nicolin Chen wrote:
> > > > On Thu, Jun 06, 2024 at 03:24:23PM -0300, Jason Gunthorpe wrote:
> > > > > On Sun, Jun 02, 2024 at 08:25:34PM -0700, Nicolin Chen wrote:
> > > > >
> > > > > > > I understand the appeal of doing this has been to minimize qemu
> > > > > > > changes in its ACPI parts if we tackle that instead maybe we should
> > > > > > > just not implement viommu to multiple piommu. It is somewhat
> > > > > > > complicated.
> > > > > >
> > > > > > Would you please clarify that suggestion "not implement viommu
> > > > > > to multiple piommu"?
> > > > > >
> > > > > > For regular nesting (SMMU), we are still doing one vSMMU in the
> > > > > > VMM, though VCMDQ case would be an exception....
> > > > >
> > > > > This is what I mean, always do multiple vSMMU if there are multiple
> > > > > physical pSMMUs. Don't replicate any virtual commands across pSMMUs.
> > > >
> > > > Thanks for clarifying. That also means you'd prefer putting the
> > > > command dispatcher in VMM, which is what we have at this moment.
> > > 
> > > Unless someone knows a reason why we should strive hard to have only a
> > > single vSMMU and accept some invalidation inefficiency?
> > > 
> > 
> > migration? a single vSMMU provides better compatibility between
> > src/dest...
> 
> Maybe, though I think we can safely split a single pSMMU into multiple
> vSMMUs using the IOMMUFD vIOMMU interface. So if your machine model
> has two vSMMUs and your physical HW has less we can still make that
> work.
> 
> IOTLB efficiency will suffer though when splitting 1p -> 2v while
> invalidation performance will suffer when joining 2p -> 1v.

I think the invalidation efficiency is actually solvable. So,
basically viommu_invalidate would receive a whole batch of cmds
and dispatch them to different pSMMUs (nested_domains/devices).
We already have a vdev_id table for devices, yet we just need a
new vasid table for nested_domains. Right?

The immediate benefit is that VMMs won't need to duplicate each
other's dispatcher pieces, and likely helps migrations as Kevin
pointed out.

With that being said, it would make the kernel design a bit more
complicated. And the VMM still has to separate the commands for
passthrough devices (HW iotlb) from commands for emulated devices
(emulated iotlb), unless we further split the topology at the VM
level to have a dedicated vSMMU for all passthrough devices --
then VMM could just forward its entire cmdq to the kernel without
deciphering every command (likely?).

Thanks
Nicolin

Jason Gunthorpe June 10, 2024, 12:04 p.m. UTC | #29

On Fri, Jun 07, 2024 at 02:19:21PM -0700, Nicolin Chen wrote:

> > IOTLB efficiency will suffer though when splitting 1p -> 2v while
> > invalidation performance will suffer when joining 2p -> 1v.
> 
> I think the invalidation efficiency is actually solvable. So,
> basically viommu_invalidate would receive a whole batch of cmds
> and dispatch them to different pSMMUs (nested_domains/devices).
> We already have a vdev_id table for devices, yet we just need a
> new vasid table for nested_domains. Right?

You can't know the ASID usage of the hypervisor from the VM, unless
you also inspect the CD table memory in the guest. That seems like
something we should try hard to avoid.

> With that being said, it would make the kernel design a bit more
> complicated. And the VMM still has to separate the commands for
> passthrough devices (HW iotlb) from commands for emulated devices
> (emulated iotlb), unless we further split the topology at the VM
> level to have a dedicated vSMMU for all passthrough devices --
> then VMM could just forward its entire cmdq to the kernel without
> deciphering every command (likely?).

I would not include the emulated devices in a shared SMMU.. For the
same reason, we should try hard to avoid inspecting the page table
memory.

If a viommu is needed for emulated then virtio-iommu may be more
appropriate..

That said I'm sure someone will want to do this, so as long as it is
possible in the VMM, as slow as it may be, then it is fine.

Jason

Nicolin Chen June 10, 2024, 8:01 p.m. UTC | #30

On Mon, Jun 10, 2024 at 09:04:46AM -0300, Jason Gunthorpe wrote:
> On Fri, Jun 07, 2024 at 02:19:21PM -0700, Nicolin Chen wrote:
> 
> > > IOTLB efficiency will suffer though when splitting 1p -> 2v while
> > > invalidation performance will suffer when joining 2p -> 1v.
> > 
> > I think the invalidation efficiency is actually solvable. So,
> > basically viommu_invalidate would receive a whole batch of cmds
> > and dispatch them to different pSMMUs (nested_domains/devices).
> > We already have a vdev_id table for devices, yet we just need a
> > new vasid table for nested_domains. Right?
> 
> You can't know the ASID usage of the hypervisor from the VM, unless
> you also inspect the CD table memory in the guest. That seems like
> something we should try hard to avoid.

Actually, even now as we put a dispatcher in VMM, VMM still does
decode the CD table to link ASID to s1_hwpt. Otherwise, it could
only broadcast a TLBI cmd to all pSMMUs.

Doing in the other way by moving it to the kernel, we'd just need
a pair of new ioctls and use them when VMM traps CFGI_CD cmds, so
kernel driver instead of VMM user driver manages the links between
ASIDs to nested domains. Either a master ASID or SVA ASIDs can be
linked to the same nested_domain that's allocated per vSTE.

> > With that being said, it would make the kernel design a bit more
> > complicated. And the VMM still has to separate the commands for
> > passthrough devices (HW iotlb) from commands for emulated devices
> > (emulated iotlb), unless we further split the topology at the VM
> > level to have a dedicated vSMMU for all passthrough devices --
> > then VMM could just forward its entire cmdq to the kernel without
> > deciphering every command (likely?).
> 
> I would not include the emulated devices in a shared SMMU.. For the
> same reason, we should try hard to avoid inspecting the page table
> memory.

I wouldn't like the idea of attaching emulated devices to a shared
vSMMU. Yet, mind elaborating why this would inspect the page table
memory? Or do you mean we should avoid VMM inspecting user tables?

> If a viommu is needed for emulated then virtio-iommu may be more
> appropriate..
> 
> That said I'm sure someone will want to do this, so as long as it is
> possible in the VMM, as slow as it may be, then it is fine.

Eric hasn't replied my previous query regarding how to design this,
yet I guess the same. And looks like Intel is doing so for emulated
devices, since there is only one intel_iommu instance in a VM.

Thanks
Nicolin

Jason Gunthorpe June 10, 2024, 10:01 p.m. UTC | #31

On Mon, Jun 10, 2024 at 01:01:32PM -0700, Nicolin Chen wrote:
> On Mon, Jun 10, 2024 at 09:04:46AM -0300, Jason Gunthorpe wrote:
> > On Fri, Jun 07, 2024 at 02:19:21PM -0700, Nicolin Chen wrote:
> > 
> > > > IOTLB efficiency will suffer though when splitting 1p -> 2v while
> > > > invalidation performance will suffer when joining 2p -> 1v.
> > > 
> > > I think the invalidation efficiency is actually solvable. So,
> > > basically viommu_invalidate would receive a whole batch of cmds
> > > and dispatch them to different pSMMUs (nested_domains/devices).
> > > We already have a vdev_id table for devices, yet we just need a
> > > new vasid table for nested_domains. Right?
> > 
> > You can't know the ASID usage of the hypervisor from the VM, unless
> > you also inspect the CD table memory in the guest. That seems like
> > something we should try hard to avoid.
> 
> Actually, even now as we put a dispatcher in VMM, VMM still does
> decode the CD table to link ASID to s1_hwpt. Otherwise, it could
> only broadcast a TLBI cmd to all pSMMUs.

No, there should be no CD table decoding and no linking ASID to
anything by the VMM.

The ARM architecture is clean, the ASID can remain private to the VM,
there is no reason for the VMM to understand it.

The s1_hwpt is derived only from the vSTE and nothing more. It would
be fine for all the devices to have their own s1_hwpts with their own
vSTE's inside it.

> > > With that being said, it would make the kernel design a bit more
> > > complicated. And the VMM still has to separate the commands for
> > > passthrough devices (HW iotlb) from commands for emulated devices
> > > (emulated iotlb), unless we further split the topology at the VM
> > > level to have a dedicated vSMMU for all passthrough devices --
> > > then VMM could just forward its entire cmdq to the kernel without
> > > deciphering every command (likely?).
> > 
> > I would not include the emulated devices in a shared SMMU.. For the
> > same reason, we should try hard to avoid inspecting the page table
> > memory.
> 
> I wouldn't like the idea of attaching emulated devices to a shared
> vSMMU. Yet, mind elaborating why this would inspect the page table
> memory? Or do you mean we should avoid VMM inspecting user tables?

Emulated devices can't use the HW page table walker in the SMMU since
they won't get a clean CD linkage they can use. They have to manually
walk the page tables and convert them into an IOAS. It is a big PITA,
best to be avoided.

> > If a viommu is needed for emulated then virtio-iommu may be more
> > appropriate..
> > 
> > That said I'm sure someone will want to do this, so as long as it is
> > possible in the VMM, as slow as it may be, then it is fine.
> 
> Eric hasn't replied my previous query regarding how to design this,
> yet I guess the same. And looks like Intel is doing so for emulated
> devices, since there is only one intel_iommu instance in a VM.

Yes, Intel has long has this code to build VFIO containers from page
table walks.

Jason

Nicolin Chen June 10, 2024, 11:04 p.m. UTC | #32

On Mon, Jun 10, 2024 at 07:01:10PM -0300, Jason Gunthorpe wrote:
> On Mon, Jun 10, 2024 at 01:01:32PM -0700, Nicolin Chen wrote:
> > On Mon, Jun 10, 2024 at 09:04:46AM -0300, Jason Gunthorpe wrote:
> > > On Fri, Jun 07, 2024 at 02:19:21PM -0700, Nicolin Chen wrote:
> > > 
> > > > > IOTLB efficiency will suffer though when splitting 1p -> 2v while
> > > > > invalidation performance will suffer when joining 2p -> 1v.
> > > > 
> > > > I think the invalidation efficiency is actually solvable. So,
> > > > basically viommu_invalidate would receive a whole batch of cmds
> > > > and dispatch them to different pSMMUs (nested_domains/devices).
> > > > We already have a vdev_id table for devices, yet we just need a
> > > > new vasid table for nested_domains. Right?
> > > 
> > > You can't know the ASID usage of the hypervisor from the VM, unless
> > > you also inspect the CD table memory in the guest. That seems like
> > > something we should try hard to avoid.
> > 
> > Actually, even now as we put a dispatcher in VMM, VMM still does
> > decode the CD table to link ASID to s1_hwpt. Otherwise, it could
> > only broadcast a TLBI cmd to all pSMMUs.
> 
> No, there should be no CD table decoding and no linking ASID to
> anything by the VMM.
> 
> The ARM architecture is clean, the ASID can remain private to the VM,
> there is no reason for the VMM to understand it.

But a guest-level TLBI command usually has only ASID available to
know which pSMMU to dispatch the command. Without an ASID lookup
table, how could VMM then dispatch a command to the corresponding
pSMMU?

> The s1_hwpt is derived only from the vSTE and nothing more. It would
> be fine for all the devices to have their own s1_hwpts with their own
> vSTE's inside it.
> 
> > > > With that being said, it would make the kernel design a bit more
> > > > complicated. And the VMM still has to separate the commands for
> > > > passthrough devices (HW iotlb) from commands for emulated devices
> > > > (emulated iotlb), unless we further split the topology at the VM
> > > > level to have a dedicated vSMMU for all passthrough devices --
> > > > then VMM could just forward its entire cmdq to the kernel without
> > > > deciphering every command (likely?).
> > > 
> > > I would not include the emulated devices in a shared SMMU.. For the
> > > same reason, we should try hard to avoid inspecting the page table
> > > memory.
> > 
> > I wouldn't like the idea of attaching emulated devices to a shared
> > vSMMU. Yet, mind elaborating why this would inspect the page table
> > memory? Or do you mean we should avoid VMM inspecting user tables?
> 
> Emulated devices can't use the HW page table walker in the SMMU since
> they won't get a clean CD linkage they can use. They have to manually
> walk the page tables and convert them into an IOAS. It is a big PITA,
> best to be avoided.

Oh..the mainline QEMU smmu code already has that:
https://github.com/qemu/qemu/blob/master/hw/arm/smmu-common.c#L522

Surely, it's a pathway that passthrough devices shouldn't run into.

Thanks
Nicolin

Jason Gunthorpe June 11, 2024, 12:28 a.m. UTC | #33

On Mon, Jun 10, 2024 at 04:04:30PM -0700, Nicolin Chen wrote:

> > > Actually, even now as we put a dispatcher in VMM, VMM still does
> > > decode the CD table to link ASID to s1_hwpt. Otherwise, it could
> > > only broadcast a TLBI cmd to all pSMMUs.
> > 
> > No, there should be no CD table decoding and no linking ASID to
> > anything by the VMM.
> > 
> > The ARM architecture is clean, the ASID can remain private to the VM,
> > there is no reason for the VMM to understand it.
> 
> But a guest-level TLBI command usually has only ASID available to
> know which pSMMU to dispatch the command. Without an ASID lookup
> table, how could VMM then dispatch a command to the corresponding
> pSMMU?

It can broadcast. The ARM architecture does not expect a N:1 mapping
of SMMUs. This is why I think it is not such a good idea..

Yes the VMM could walk the CD tables too and build up a bitmap of what
ASIDs are being used by what pSMMUs, and that would be fine for the
VMM to do, but I wouldn't necessarily recommend it :)

Jason

Nicolin Chen June 11, 2024, 12:44 a.m. UTC | #34

On Mon, Jun 10, 2024 at 09:28:39PM -0300, Jason Gunthorpe wrote:
> On Mon, Jun 10, 2024 at 04:04:30PM -0700, Nicolin Chen wrote:
> 
> > > > Actually, even now as we put a dispatcher in VMM, VMM still does
> > > > decode the CD table to link ASID to s1_hwpt. Otherwise, it could
> > > > only broadcast a TLBI cmd to all pSMMUs.
> > > 
> > > No, there should be no CD table decoding and no linking ASID to
> > > anything by the VMM.
> > > 
> > > The ARM architecture is clean, the ASID can remain private to the VM,
> > > there is no reason for the VMM to understand it.
> > 
> > But a guest-level TLBI command usually has only ASID available to
> > know which pSMMU to dispatch the command. Without an ASID lookup
> > table, how could VMM then dispatch a command to the corresponding
> > pSMMU?
> 
> It can broadcast. The ARM architecture does not expect a N:1 mapping
> of SMMUs. This is why I think it is not such a good idea..

Hmm, I thought we had an agreed idea that we shouldn't broadcast
a TLBI (except global NH_ALL/VAA) for invalidation performance?

> Yes the VMM could walk the CD tables too and build up a bitmap of what
> ASIDs are being used by what pSMMUs, and that would be fine for the
> VMM to do, but I wouldn't necessarily recommend it :)

CD table walkthrough would be always done only by VMM, while the
lookup table could be created/maintained by the kernel. I feel a
vasid table could make sense since we maintain the vdev_id table
in the kernel space too.

Anyway, it is still an implementation choice, as Kevin remarked.

Thanks
Nicolin

Jason Gunthorpe June 11, 2024, 12:17 p.m. UTC | #35

On Mon, Jun 10, 2024 at 05:44:16PM -0700, Nicolin Chen wrote:
> On Mon, Jun 10, 2024 at 09:28:39PM -0300, Jason Gunthorpe wrote:
> > On Mon, Jun 10, 2024 at 04:04:30PM -0700, Nicolin Chen wrote:
> > 
> > > > > Actually, even now as we put a dispatcher in VMM, VMM still does
> > > > > decode the CD table to link ASID to s1_hwpt. Otherwise, it could
> > > > > only broadcast a TLBI cmd to all pSMMUs.
> > > > 
> > > > No, there should be no CD table decoding and no linking ASID to
> > > > anything by the VMM.
> > > > 
> > > > The ARM architecture is clean, the ASID can remain private to the VM,
> > > > there is no reason for the VMM to understand it.
> > > 
> > > But a guest-level TLBI command usually has only ASID available to
> > > know which pSMMU to dispatch the command. Without an ASID lookup
> > > table, how could VMM then dispatch a command to the corresponding
> > > pSMMU?
> > 
> > It can broadcast. The ARM architecture does not expect a N:1 mapping
> > of SMMUs. This is why I think it is not such a good idea..
> 
> Hmm, I thought we had an agreed idea that we shouldn't broadcast
> a TLBI (except global NH_ALL/VAA) for invalidation performance?

I wouldn't say agree, there are just lots of different trade offs to
be made here. You can reduce broadcast by parsing the CD table from
the VMM. You can reduce broadcast with multiple vSMMUs.

VMM needs to pick a design. I favour multiple vSMMUs.

> CD table walkthrough would be always done only by VMM, while the
> lookup table could be created/maintained by the kernel. I feel a
> vasid table could make sense since we maintain the vdev_id table
> in the kernel space too.

I'm not convinced we should put such a micro optimization in the
kernel. If the VMM cares about performance then split the vSMMU,
otherwise lets just put all the mess in the VMM and give it the tools
to manage the invalidation distribution.

In the long run things like vCMDQ are going to force the choice to
multi-smmu, which is why I don't see too much value with investing in
optimizing the single vSMMU case. The optimization can be done later
if someone does have a use case.

Jason

Nicolin Chen June 11, 2024, 7:11 p.m. UTC | #36

On Tue, Jun 11, 2024 at 09:17:56AM -0300, Jason Gunthorpe wrote:
> On Mon, Jun 10, 2024 at 05:44:16PM -0700, Nicolin Chen wrote:
> > On Mon, Jun 10, 2024 at 09:28:39PM -0300, Jason Gunthorpe wrote:
> > > On Mon, Jun 10, 2024 at 04:04:30PM -0700, Nicolin Chen wrote:
> > > 
> > > > > > Actually, even now as we put a dispatcher in VMM, VMM still does
> > > > > > decode the CD table to link ASID to s1_hwpt. Otherwise, it could
> > > > > > only broadcast a TLBI cmd to all pSMMUs.
> > > > > 
> > > > > No, there should be no CD table decoding and no linking ASID to
> > > > > anything by the VMM.
> > > > > 
> > > > > The ARM architecture is clean, the ASID can remain private to the VM,
> > > > > there is no reason for the VMM to understand it.
> > > > 
> > > > But a guest-level TLBI command usually has only ASID available to
> > > > know which pSMMU to dispatch the command. Without an ASID lookup
> > > > table, how could VMM then dispatch a command to the corresponding
> > > > pSMMU?
> > > 
> > > It can broadcast. The ARM architecture does not expect a N:1 mapping
> > > of SMMUs. This is why I think it is not such a good idea..
> > 
> > Hmm, I thought we had an agreed idea that we shouldn't broadcast
> > a TLBI (except global NH_ALL/VAA) for invalidation performance?
> 
> I wouldn't say agree, there are just lots of different trade offs to
> be made here. You can reduce broadcast by parsing the CD table from
> the VMM. You can reduce broadcast with multiple vSMMUs.
> 
> VMM needs to pick a design. I favour multiple vSMMUs.

Yea, having multiple vSMMUs for nesting too seems to be a cleaner
design. The thing is that we have to put a certain complexity in
the VMM, and it should be more efficient by having it at the boot
stage (creating multi-vSMMUs/PCIs and IORT nodes) v.s. runtime
(trappings and distributing at every command).

> > CD table walkthrough would be always done only by VMM, while the
> > lookup table could be created/maintained by the kernel. I feel a
> > vasid table could make sense since we maintain the vdev_id table
> > in the kernel space too.
> 
> I'm not convinced we should put such a micro optimization in the
> kernel. If the VMM cares about performance then split the vSMMU,
> otherwise lets just put all the mess in the VMM and give it the tools
> to manage the invalidation distribution.
> 
> In the long run things like vCMDQ are going to force the choice to
> multi-smmu, which is why I don't see too much value with investing in
> optimizing the single vSMMU case. The optimization can be done later
> if someone does have a use case.

I see. That makes sense to me.

Nicolin

[RFCv1,08/14] iommufd: Add IOMMU_VIOMMU_SET_DEV_ID ioctl

Commit Message

Comments

Patch