Message ID | c97a98a72ee3498c587e5898d6b899553ccd9b27.1712978212.git.nicolinc@nvidia.com |
---|---|
State | Handled Elsewhere |
Headers | show |
Series | Add Tegra241 (Grace) CMDQV Support (part 2/2) | expand |
On Fri, Apr 12, 2024 at 08:47:05PM -0700, Nicolin Chen wrote: > Introduce a new ioctl to set a per-viommu device virtual id that should be > linked to the physical device id (or just a struct device pointer). > > Since a viommu (user space IOMMU instance) can have multiple devices while > it's not ideal to confine a device to one single user space IOMMU instance > either, these two shouldn't just do a 1:1 mapping. Add two xarrays in their > structures to bind them bidirectionally. Since I would like to retain the dev_id, I think this is probably better done with an allocated struct per dev-id: struct dev_id { struct iommufd_device *idev; struct iommufd_viommu *viommu; u64 vdev_id; u64 driver_private; // Ie the driver can store the pSID here struct list_head idev_entry; }; > @@ -135,7 +135,16 @@ void iommufd_device_destroy(struct iommufd_object *obj) > { > struct iommufd_device *idev = > container_of(obj, struct iommufd_device, obj); > + struct iommufd_viommu *viommu; > + unsigned long index; > > + xa_for_each(&idev->viommus, index, viommu) { > + if (viommu->ops->unset_dev_id) > + viommu->ops->unset_dev_id(viommu, idev->dev); > + xa_erase(&viommu->idevs, idev->obj.id); > + xa_erase(&idev->viommus, index); > + } Then this turns into list_for_each(idev->viommu_vdevid_list) > +int iommufd_viommu_set_device_id(struct iommufd_ucmd *ucmd) > +{ > + struct iommu_viommu_set_dev_id *cmd = ucmd->cmd; > + unsigned int dev_id, viommu_id; > + struct iommufd_viommu *viommu; > + struct iommufd_device *idev; > + int rc; > + > + idev = iommufd_get_device(ucmd, cmd->dev_id); > + if (IS_ERR(idev)) > + return PTR_ERR(idev); > + dev_id = idev->obj.id; > + > + viommu = iommufd_get_viommu(ucmd, cmd->viommu_id); > + if (IS_ERR(viommu)) { > + rc = PTR_ERR(viommu); > + goto out_put_idev; > + } > + > + if (viommu->iommu_dev != idev->dev->iommu->iommu_dev) { > + rc = -EINVAL; > + goto out_put_viommu; > + } > + > + if (!viommu->ops->set_dev_id || !viommu->ops->unset_dev_id) { > + rc = -EOPNOTSUPP; > + goto out_put_viommu; > + } > + > + rc = xa_alloc(&idev->viommus, &viommu_id, viommu, > + XA_LIMIT(viommu->obj.id, viommu->obj.id), > + GFP_KERNEL_ACCOUNT); > + if (rc) > + goto out_put_viommu; > + > + rc = xa_alloc(&viommu->idevs, &dev_id, idev, > + XA_LIMIT(dev_id, dev_id), GFP_KERNEL_ACCOUNT); > + if (rc) > + goto out_xa_erase_viommu; Both of these are API mis-uses, you don't want an allocating xarray you just want to use xa_cmpxchg Put an xarray in the viommu object and fill it with pointers to the struct dev_id thing above The driver can piggyback on this xarray too if it wants, so we only need one. xa_cmpchg to install the user requests vdev_id only if the vdev_id is already unused. > @@ -51,6 +51,7 @@ enum { > IOMMUFD_CMD_HWPT_GET_DIRTY_BITMAP, > IOMMUFD_CMD_HWPT_INVALIDATE, > IOMMUFD_CMD_VIOMMU_ALLOC, > + IOMMUFD_CMD_VIOMMU_SET_DEV_ID, > }; We almost certainly will need a REMOVE_DEV_ID so userspace can have sensible error handling. > + > +/** > + * struct iommu_viommu_set_dev_id - ioctl(IOMMU_VIOMMU_SET_DEV_ID) > + * @size: sizeof(struct iommu_viommu_set_dev_id) > + * @viommu_id: viommu ID to associate with the device to store its virtual ID > + * @dev_id: device ID to set a device virtual ID > + * @__reserved: Must be 0 > + * @id: Device virtual ID > + * > + * Set a viommu-specific virtual ID of a device > + */ > +struct iommu_viommu_set_dev_id { > + __u32 size; > + __u32 viommu_id; > + __u32 dev_id; > + __u32 __reserved; > + __aligned_u64 id; I think I'd consistently call this vdev_id throughout the API Jason
On Sun, May 12, 2024 at 11:58:27AM -0300, Jason Gunthorpe wrote: > On Fri, Apr 12, 2024 at 08:47:05PM -0700, Nicolin Chen wrote: > > Introduce a new ioctl to set a per-viommu device virtual id that should be > > linked to the physical device id (or just a struct device pointer). > > > > Since a viommu (user space IOMMU instance) can have multiple devices while > > it's not ideal to confine a device to one single user space IOMMU instance > > either, these two shouldn't just do a 1:1 mapping. Add two xarrays in their > > structures to bind them bidirectionally. > > Since I would like to retain the dev_id, I think this is probably > better done with an allocated struct per dev-id: > > struct dev_id { > struct iommufd_device *idev; > struct iommufd_viommu *viommu; > u64 vdev_id; > u64 driver_private; // Ie the driver can store the pSID here > struct list_head idev_entry; > }; Oh, I needed a better solution to store the HW slot number of a VINTF configuration that links a pSID and a vSID, which I hacked into struct arm_smmu_stream for now. So, with this struct dev_id, likely I can tie it to this structure. For a driver, pSID is known with a device pointer, while likely no use to the iommufd core? Also, I will address the other comments in your reply. Thanks Nicolin
On Sun, May 12, 2024 at 11:58:27AM -0300, Jason Gunthorpe wrote: > On Fri, Apr 12, 2024 at 08:47:05PM -0700, Nicolin Chen wrote: > > Introduce a new ioctl to set a per-viommu device virtual id that should be > > linked to the physical device id (or just a struct device pointer). > > > > Since a viommu (user space IOMMU instance) can have multiple devices while > > it's not ideal to confine a device to one single user space IOMMU instance > > either, these two shouldn't just do a 1:1 mapping. Add two xarrays in their > > structures to bind them bidirectionally. > > Since I would like to retain the dev_id, I think this is probably > better done with an allocated struct per dev-id: > > struct dev_id { > struct iommufd_device *idev; > struct iommufd_viommu *viommu; > u64 vdev_id; > u64 driver_private; // Ie the driver can store the pSID here > struct list_head idev_entry; > }; I implemented it with a small tweak, to align with viommu_alloc and vqueue_alloc: // core struct iommufd_vdev_id { struct iommufd_viommu *viommu; struct device *dev; u64 vdev_id; struct list_head idev_item; }; // driver struct my_driver_vdev_id { struct iommufd_vdev_id core; unsigned int private_attrs; }; static struct iommufd_vdev_id * my_driver_set_vdev_id(struct iommufd_viommu *viommu, struct device *dev, u64 id) { struct my_driver_vdev_id *my_vdev_id; my_vdev_id = kzalloc(sizeof(*my_vdev_id), GFP_KERNEL); .... /* set private_attrs */ return &my_driver_vdev_id->core; } static void my_driver_unset_vdev_id(struct iommufd_vdev_id *vdev_id) { struct my_driver_vdev_id *my_vdev_id = container_of(vdev_id, struct my_driver_vdev_id, core); .... /* unset private_attrs */ } Please let me know if you like it inverted as you wrote above. > > @@ -135,7 +135,16 @@ void iommufd_device_destroy(struct iommufd_object *obj) > > { > > struct iommufd_device *idev = > > container_of(obj, struct iommufd_device, obj); > > + struct iommufd_viommu *viommu; > > + unsigned long index; > > > > + xa_for_each(&idev->viommus, index, viommu) { > > + if (viommu->ops->unset_dev_id) > > + viommu->ops->unset_dev_id(viommu, idev->dev); > > + xa_erase(&viommu->idevs, idev->obj.id); > > + xa_erase(&idev->viommus, index); > > + } > > Then this turns into list_for_each(idev->viommu_vdevid_list) Done. > > +int iommufd_viommu_set_device_id(struct iommufd_ucmd *ucmd) > > +{ ... > > + rc = xa_alloc(&idev->viommus, &viommu_id, viommu, > > + XA_LIMIT(viommu->obj.id, viommu->obj.id), > > + GFP_KERNEL_ACCOUNT); > > + if (rc) > > + goto out_put_viommu; > > + > > + rc = xa_alloc(&viommu->idevs, &dev_id, idev, > > + XA_LIMIT(dev_id, dev_id), GFP_KERNEL_ACCOUNT); > > + if (rc) > > + goto out_xa_erase_viommu; > > Both of these are API mis-uses, you don't want an allocating xarray > you just want to use xa_cmpxchg > > Put an xarray in the viommu object and fill it with pointers to the > struct dev_id thing above > > The driver can piggyback on this xarray too if it wants, so we only > need one. I also moved xa_cmpxchg to the driver, as I feel that the vdev_id here is very driver/iommu specific. There can be some complication if iommufd core handles this u64 vdev_id: most likely we will use this u64 vdev_id to index the xarray that takes an unsigned-long xa_index for a fast vdev_id-to-pdev_id lookup, while only a driver knows whether u64 vdev_id is compatible with unsigned long or not. And, we have a list_head in the structure idev, so a device unbind will for-each the list and unset all the vdev_ids in it, meanwhile the viommu stays. I wonder if we need to add another list_head in the structure viommu, so a viommu tear down will for-each its list and unset all the vdev_ids on its side while a device (idev) stays. I don't see a use case of that though..any thought? Thanks Nicolin
On Thu, May 16, 2024 at 10:14:38PM -0700, Nicolin Chen wrote: > I implemented it with a small tweak, to align with viommu_alloc > and vqueue_alloc: > > // core > struct iommufd_vdev_id { > struct iommufd_viommu *viommu; > struct device *dev; > u64 vdev_id; > struct list_head idev_item; > }; > > // driver > struct my_driver_vdev_id { > struct iommufd_vdev_id core; > unsigned int private_attrs; > }; > > static struct iommufd_vdev_id * > my_driver_set_vdev_id(struct iommufd_viommu *viommu, > struct device *dev, u64 id) > { > struct my_driver_vdev_id *my_vdev_id; > > my_vdev_id = kzalloc(sizeof(*my_vdev_id), GFP_KERNEL); > .... /* set private_attrs */ > return &my_driver_vdev_id->core; > } > > static void my_driver_unset_vdev_id(struct iommufd_vdev_id *vdev_id) > { > struct my_driver_vdev_id *my_vdev_id = > container_of(vdev_id, struct my_driver_vdev_id, core); > .... /* unset private_attrs */ > } > > Please let me know if you like it inverted as you wrote above. Seems like a reasonable direction > I also moved xa_cmpxchg to the driver, as I feel that the vdev_id > here is very driver/iommu specific. There can be some complication > if iommufd core handles this u64 vdev_id: most likely we will use > this u64 vdev_id to index the xarray that takes an unsigned-long > xa_index for a fast vdev_id-to-pdev_id lookup, while only a driver > knows whether u64 vdev_id is compatible with unsigned long or not. This seems like the most general part.. The core code should just have a check like: if (vdevid >= ULONG_MAX) return -EINVAL; And if someone wants to make 32 bit kernels support a 64bit vdevid then they can sort it out someday :) I think this is not a big issue as all iommus seem to have some kind of radix lookup for vdevid and want it to be small. Matthew has been talking about support for 64bit indexes in maple/xarray or something for a bit so it might sort itself out. > And, we have a list_head in the structure idev, so a device unbind > will for-each the list and unset all the vdev_ids in it, meanwhile > the viommu stays. I wonder if we need to add another list_head in > the structure viommu, so a viommu tear down will for-each its list > and unset all the vdev_ids on its side while a device (idev) stays. > I don't see a use case of that though..any thought? I think you need to support viommu teardown, at least from a correctness perspective. The API permits it. But isn't it just list_del everything in the xarray and that will clean it up enough? Jason
On Tue, May 21, 2024 at 03:30:11PM -0300, Jason Gunthorpe wrote: > On Thu, May 16, 2024 at 10:14:38PM -0700, Nicolin Chen wrote: > > I also moved xa_cmpxchg to the driver, as I feel that the vdev_id > > here is very driver/iommu specific. There can be some complication > > if iommufd core handles this u64 vdev_id: most likely we will use > > this u64 vdev_id to index the xarray that takes an unsigned-long > > xa_index for a fast vdev_id-to-pdev_id lookup, while only a driver > > knows whether u64 vdev_id is compatible with unsigned long or not. > > This seems like the most general part.. The core code should just have > a check like: > > if (vdevid >= ULONG_MAX) return -EINVAL; Ack. > And if someone wants to make 32 bit kernels support a 64bit vdevid > then they can sort it out someday :) I think this is not a big issue > as all iommus seem to have some kind of radix lookup for vdevid and > want it to be small. > > Matthew has been talking about support for 64bit indexes in > maple/xarray or something for a bit so it might sort itself out. OK. In that case, the core can do the xarray maintenance. And then.. > > And, we have a list_head in the structure idev, so a device unbind > > will for-each the list and unset all the vdev_ids in it, meanwhile > > the viommu stays. I wonder if we need to add another list_head in > > the structure viommu, so a viommu tear down will for-each its list > > and unset all the vdev_ids on its side while a device (idev) stays. > > I don't see a use case of that though..any thought? > > I think you need to support viommu teardown, at least from a > correctness perspective. The API permits it. But isn't it just > list_del everything in the xarray and that will clean it up enough? ... viommu tear down can xa_for_each to call unset_vdev_id(). Thanks Nicolin
> From: Nicolin Chen <nicolinc@nvidia.com> > Sent: Saturday, April 13, 2024 11:47 AM > > Introduce a new ioctl to set a per-viommu device virtual id that should be > linked to the physical device id (or just a struct device pointer). > > Since a viommu (user space IOMMU instance) can have multiple devices this is true... > while > it's not ideal to confine a device to one single user space IOMMU instance > either, these two shouldn't just do a 1:1 mapping. Add two xarrays in ...but why would one device link to multiple viommu instances? Or is it referring to Tegra194 as arm-smmu-nvidia.c tries to support? In any case a real example might be useful as this may not be a given knowledge to all folks. btw there is a check in the following code: + if (viommu->iommu_dev != idev->dev->iommu->iommu_dev) { + rc = -EINVAL; + goto out_put_viommu; + } I vaguely remember an earlier discussion about multiple vSMMU instances following the physical SMMU topology, but don't quite recall the exact reason. What is the actual technical obstacle prohibiting one to put multiple VCMDQ's from different SMMU's into one vIOMMU instance?
On Thu, May 23, 2024 at 06:42:56AM +0000, Tian, Kevin wrote: > > From: Nicolin Chen <nicolinc@nvidia.com> > > Sent: Saturday, April 13, 2024 11:47 AM > > > > Introduce a new ioctl to set a per-viommu device virtual id that should be > > linked to the physical device id (or just a struct device pointer). > > > > Since a viommu (user space IOMMU instance) can have multiple devices > > this is true... > > > while > > it's not ideal to confine a device to one single user space IOMMU instance > > either, these two shouldn't just do a 1:1 mapping. Add two xarrays in > > ...but why would one device link to multiple viommu instances? That's a suggestion from Jason, IIRC, to avoid limiting a device to a single viommu, though I can't find out the source at this moment. Jason, would you mind shed some light here? > Or is it referring to Tegra194 as arm-smmu-nvidia.c tries to support? Not actual. It's an SMMUv2 driver, which is not in our plan for virtualization at this moment. And that driver is essentially a different "compatible" string as a unique SMMUv2 implementation. > btw there is a check in the following code: > > + if (viommu->iommu_dev != idev->dev->iommu->iommu_dev) { > + rc = -EINVAL; > + goto out_put_viommu; > + } > > I vaguely remember an earlier discussion about multiple vSMMU instances > following the physical SMMU topology, but don't quite recall the exact > reason. > > What is the actual technical obstacle prohibiting one to put multiple > VCMDQ's from different SMMU's into one vIOMMU instance? Because a VCMDQ passthrough involves a direct mmap of a HW MMIO page to the guest-level MMIO region. The MMIO page provides the read/write of queue's head and tail indexes. With a single pSMMU and a single vSMMU, it's simply 1:1 mapping. With a multi-pSMMU and a single vSMMU, the single vSMMU will see one guest-level MMIO region backed by multiple physical pages. Since we are talking about MMIO, technically it cannot select the corresponding MMIO page to the device, not to mention we don't really want VMM to involve, i.e. no VM exist, when using VCMDQ. So, there must be some kind of multi-instanced carriers to hold those MMIO pages, by attaching devices behind different pSMMUs to corresponding carriers. And today we have VIOMMU as the carrier. One step back, even without VCMDQ feature, a multi-pSMMU setup will have multiple viommus (with our latest design) being added to a viommu list of a single vSMMU's. Yet, vSMMU in this case always traps regular SMMU CMDQ, so it can do viommu selection or even broadcast (if it has to). Thanks Nicolin
> From: Nicolin Chen <nicolinc@nvidia.com> > Sent: Friday, May 24, 2024 1:40 PM > > On Thu, May 23, 2024 at 06:42:56AM +0000, Tian, Kevin wrote: > > btw there is a check in the following code: > > > > + if (viommu->iommu_dev != idev->dev->iommu->iommu_dev) { > > + rc = -EINVAL; > > + goto out_put_viommu; > > + } > > > > I vaguely remember an earlier discussion about multiple vSMMU instances > > following the physical SMMU topology, but don't quite recall the exact > > reason. > > > > What is the actual technical obstacle prohibiting one to put multiple > > VCMDQ's from different SMMU's into one vIOMMU instance? > > Because a VCMDQ passthrough involves a direct mmap of a HW MMIO > page to the guest-level MMIO region. The MMIO page provides the > read/write of queue's head and tail indexes. > > With a single pSMMU and a single vSMMU, it's simply 1:1 mapping. > > With a multi-pSMMU and a single vSMMU, the single vSMMU will see > one guest-level MMIO region backed by multiple physical pages. > Since we are talking about MMIO, technically it cannot select the > corresponding MMIO page to the device, not to mention we don't > really want VMM to involve, i.e. no VM exist, when using VCMDQ. can a vSMMU report to support multiple CMDQ's then there are several MMIO regions each mapped to a different backend VCMDQ? but I guess even if it's supported there is still a problem describing the association between assigned devices and the CMDQ's of the single vIOMMU instance. On bare metal a feature of multiple CMDQ is more for load-balancing so there won't be a fixed association between CMDQ/device. But the fixed association exists if we mixes multiple VCMDQ's in a single vIOMMU instance then there may lack of a way in spec to describe it. > > So, there must be some kind of multi-instanced carriers to hold > those MMIO pages, by attaching devices behind different pSMMUs to > corresponding carriers. And today we have VIOMMU as the carrier. > > One step back, even without VCMDQ feature, a multi-pSMMU setup > will have multiple viommus (with our latest design) being added > to a viommu list of a single vSMMU's. Yet, vSMMU in this case > always traps regular SMMU CMDQ, so it can do viommu selection > or even broadcast (if it has to). > I'm curious to learn the real reason of that design. Is it because you want to do certain load-balance between viommu's or due to other reasons in the kernel smmuv3 driver which e.g. cannot support a viommu spanning multiple pSMMU?
On Fri, May 24, 2024 at 07:13:23AM +0000, Tian, Kevin wrote: > > From: Nicolin Chen <nicolinc@nvidia.com> > > Sent: Friday, May 24, 2024 1:40 PM > > > > On Thu, May 23, 2024 at 06:42:56AM +0000, Tian, Kevin wrote: > > > btw there is a check in the following code: > > > > > > + if (viommu->iommu_dev != idev->dev->iommu->iommu_dev) { > > > + rc = -EINVAL; > > > + goto out_put_viommu; > > > + } > > > > > > I vaguely remember an earlier discussion about multiple vSMMU instances > > > following the physical SMMU topology, but don't quite recall the exact > > > reason. > > > > > > What is the actual technical obstacle prohibiting one to put multiple > > > VCMDQ's from different SMMU's into one vIOMMU instance? > > > > Because a VCMDQ passthrough involves a direct mmap of a HW MMIO > > page to the guest-level MMIO region. The MMIO page provides the > > read/write of queue's head and tail indexes. > > > > With a single pSMMU and a single vSMMU, it's simply 1:1 mapping. > > > > With a multi-pSMMU and a single vSMMU, the single vSMMU will see > > one guest-level MMIO region backed by multiple physical pages. > > Since we are talking about MMIO, technically it cannot select the > > corresponding MMIO page to the device, not to mention we don't > > really want VMM to involve, i.e. no VM exist, when using VCMDQ. > > can a vSMMU report to support multiple CMDQ's then there are > several MMIO regions each mapped to a different backend VCMDQ? This is something I want to support in the API, regardless vCMDQ uses it or not.. > but I guess even if it's supported there is still a problem describing > the association between assigned devices and the CMDQ's of the > single vIOMMU instance. CMDQ should be linked to the VIOMMU instance only and not per-device, I think that is a very important property of the system. This means if there are multiple pSMMU instances then we need a matching set of vSMMU instances so that the devices are grouped on the proper vSMMU instance so they are compatible with the vCMDQ. > I'm curious to learn the real reason of that design. Is it because you > want to do certain load-balance between viommu's or due to other > reasons in the kernel smmuv3 driver which e.g. cannot support a > viommu spanning multiple pSMMU? Yeah, there is no concept of support for a SMMUv3 instance where it's command Q's can only work on a subset of devices. My expectation was that VIOMMU would be 1:1 with physical iommu instances, I think AMD needs this too?? Jason
On Thu, May 23, 2024 at 10:40:07PM -0700, Nicolin Chen wrote: > On Thu, May 23, 2024 at 06:42:56AM +0000, Tian, Kevin wrote: > > > From: Nicolin Chen <nicolinc@nvidia.com> > > > Sent: Saturday, April 13, 2024 11:47 AM > > > > > > Introduce a new ioctl to set a per-viommu device virtual id that should be > > > linked to the physical device id (or just a struct device pointer). > > > > > > Since a viommu (user space IOMMU instance) can have multiple devices > > > > this is true... > > > > > while > > > it's not ideal to confine a device to one single user space IOMMU instance > > > either, these two shouldn't just do a 1:1 mapping. Add two xarrays in > > > > ...but why would one device link to multiple viommu instances? > > That's a suggestion from Jason, IIRC, to avoid limiting a device > to a single viommu, though I can't find out the source at this > moment. > > Jason, would you mind shed some light here? We could probably do it either way, it just doesn't seem like there is a strong reason to limit a device to a single viommu even if in practice that is how everyone will use it. Jason
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Friday, May 24, 2024 9:19 PM > > On Fri, May 24, 2024 at 07:13:23AM +0000, Tian, Kevin wrote: > > I'm curious to learn the real reason of that design. Is it because you > > want to do certain load-balance between viommu's or due to other > > reasons in the kernel smmuv3 driver which e.g. cannot support a > > viommu spanning multiple pSMMU? > > Yeah, there is no concept of support for a SMMUv3 instance where it's > command Q's can only work on a subset of devices. > > My expectation was that VIOMMU would be 1:1 with physical iommu > instances, I think AMD needs this too?? > Yes this part is clear now regarding to VCMDQ. But Nicoline said: " One step back, even without VCMDQ feature, a multi-pSMMU setup will have multiple viommus (with our latest design) being added to a viommu list of a single vSMMU's. Yet, vSMMU in this case always traps regular SMMU CMDQ, so it can do viommu selection or even broadcast (if it has to). " I don't think there is an arch limitation mandating that?
On Mon, May 27, 2024 at 01:08:43AM +0000, Tian, Kevin wrote: > > From: Jason Gunthorpe <jgg@nvidia.com> > > Sent: Friday, May 24, 2024 9:19 PM > > > > On Fri, May 24, 2024 at 07:13:23AM +0000, Tian, Kevin wrote: > > > I'm curious to learn the real reason of that design. Is it because you > > > want to do certain load-balance between viommu's or due to other > > > reasons in the kernel smmuv3 driver which e.g. cannot support a > > > viommu spanning multiple pSMMU? > > > > Yeah, there is no concept of support for a SMMUv3 instance where it's > > command Q's can only work on a subset of devices. > > > > My expectation was that VIOMMU would be 1:1 with physical iommu > > instances, I think AMD needs this too?? > > > > Yes this part is clear now regarding to VCMDQ. > > But Nicoline said: > > " > One step back, even without VCMDQ feature, a multi-pSMMU setup > will have multiple viommus (with our latest design) being added > to a viommu list of a single vSMMU's. Yet, vSMMU in this case > always traps regular SMMU CMDQ, so it can do viommu selection > or even broadcast (if it has to). > " > > I don't think there is an arch limitation mandating that? What I mean is for regular vSMMU. Without VCMDQ, a regular vSMMU on a multi-pSMMU setup will look like (e.g. three devices behind different SMMUs): |<------ VMM ------->|<------ kernel ------>| |-- viommu0 --|-- pSMMU0 --| vSMMU--|-- viommu1 --|-- pSMMU1 --|--s2_hwpt |-- viommu2 --|-- pSMMU2 --| And device would attach to: |<---- guest ---->|<--- VMM --->|<- kernel ->| |-- dev0 --|-- viommu0 --|-- pSMMU0 --| vSMMU--|-- dev1 --|-- viommu1 --|-- pSMMU1 --| |-- dev2 --|-- viommu2 --|-- pSMMU2 --| When trapping a device cache invalidation: it is straightforward by deciphering the virtual device ID to pick the viommu that the device is attached to. When doing iotlb invalidation, a command may or may not contain an ASID (a domain ID, and nested domain in this case): a) if a command doesn't have an ASID, VMM needs to broadcast the command to all viommus (i.e. pSMMUs) b) if a command has an ASID, VMM needs to initially maintain an S1 HWPT list by linking an ASID when adding an HWPT entry to the list, by deciphering vSTE and its linked CD. Then it needs to go through the S1 list with the ASID in the command, and to find all corresponding HWPTs to issue/broadcast the command. Thanks Nicolin
On Mon, May 27, 2024 at 01:08:43AM +0000, Tian, Kevin wrote: > > From: Jason Gunthorpe <jgg@nvidia.com> > > Sent: Friday, May 24, 2024 9:19 PM > > > > On Fri, May 24, 2024 at 07:13:23AM +0000, Tian, Kevin wrote: > > > I'm curious to learn the real reason of that design. Is it because you > > > want to do certain load-balance between viommu's or due to other > > > reasons in the kernel smmuv3 driver which e.g. cannot support a > > > viommu spanning multiple pSMMU? > > > > Yeah, there is no concept of support for a SMMUv3 instance where it's > > command Q's can only work on a subset of devices. > > > > My expectation was that VIOMMU would be 1:1 with physical iommu > > instances, I think AMD needs this too?? > > > > Yes this part is clear now regarding to VCMDQ. > > But Nicoline said: > > " > One step back, even without VCMDQ feature, a multi-pSMMU setup > will have multiple viommus (with our latest design) being added > to a viommu list of a single vSMMU's. Yet, vSMMU in this case > always traps regular SMMU CMDQ, so it can do viommu selection > or even broadcast (if it has to). > " > > I don't think there is an arch limitation mandating that? What I mean is for regular a nested SMMU case. Without VCMDQ, a regular vSMMU on a multi-pSMMU setup will look like (e.g. three devices behind different SMMUs): |<---- guest ---->|<--------- VMM -------->|<- kernel ->| |-- dev0 --|-- viommu0 (s2_hwpt0) --|-- pSMMU0 --| vSMMU--|-- dev1 --|-- viommu1 (s2_hwpt0) --|-- pSMMU1 --| |-- dev2 --|-- viommu2 (s2_hwpt0) --|-- pSMMU2 --| When trapping a device cache invalidation: it is straightforward by deciphering the virtual device ID to pick the viommu that the device is attached to. So, only one pSMMU would receive the user invalidation request. When doing iotlb invalidation, a command may or may not contain an ASID (a domain ID, and nested domain in this case): a) if a command doesn't have an ASID, VMM needs to broadcast the command to all viommus (i.e. pSMMUs) b) if a command has an ASID, VMM needs to initially maintain an S1 HWPT list by linking an ASID when adding an HWPT entry to the list, by deciphering vSTE and its linked CD. Then it needs to go through the S1 list with the ASID in the command, and to find all corresponding HWPTs to issue/broadcast the command. Thanks Nicolin
On Tue, May 28, 2024 at 01:22:46PM -0700, Nicolin Chen wrote: > On Mon, May 27, 2024 at 01:08:43AM +0000, Tian, Kevin wrote: > > > From: Jason Gunthorpe <jgg@nvidia.com> > > > Sent: Friday, May 24, 2024 9:19 PM > > > > > > On Fri, May 24, 2024 at 07:13:23AM +0000, Tian, Kevin wrote: > > > > I'm curious to learn the real reason of that design. Is it because you > > > > want to do certain load-balance between viommu's or due to other > > > > reasons in the kernel smmuv3 driver which e.g. cannot support a > > > > viommu spanning multiple pSMMU? > > > > > > Yeah, there is no concept of support for a SMMUv3 instance where it's > > > command Q's can only work on a subset of devices. > > > > > > My expectation was that VIOMMU would be 1:1 with physical iommu > > > instances, I think AMD needs this too?? > > > > > > > Yes this part is clear now regarding to VCMDQ. > > > > But Nicoline said: > > > > " > > One step back, even without VCMDQ feature, a multi-pSMMU setup > > will have multiple viommus (with our latest design) being added > > to a viommu list of a single vSMMU's. Yet, vSMMU in this case > > always traps regular SMMU CMDQ, so it can do viommu selection > > or even broadcast (if it has to). > > " > > > > I don't think there is an arch limitation mandating that? > > What I mean is for regular vSMMU. Without VCMDQ, a regular vSMMU > on a multi-pSMMU setup will look like (e.g. three devices behind > different SMMUs): > |<------ VMM ------->|<------ kernel ------>| > |-- viommu0 --|-- pSMMU0 --| > vSMMU--|-- viommu1 --|-- pSMMU1 --|--s2_hwpt > |-- viommu2 --|-- pSMMU2 --| > > And device would attach to: > |<---- guest ---->|<--- VMM --->|<- kernel ->| > |-- dev0 --|-- viommu0 --|-- pSMMU0 --| > vSMMU--|-- dev1 --|-- viommu1 --|-- pSMMU1 --| > |-- dev2 --|-- viommu2 --|-- pSMMU2 --| I accidentally sent a duplicated one.. Please ignore this reply and check the other one. Thanks!
> From: Nicolin Chen <nicolinc@nvidia.com> > Sent: Wednesday, May 29, 2024 4:23 AM > > On Mon, May 27, 2024 at 01:08:43AM +0000, Tian, Kevin wrote: > > > From: Jason Gunthorpe <jgg@nvidia.com> > > > Sent: Friday, May 24, 2024 9:19 PM > > > > > > On Fri, May 24, 2024 at 07:13:23AM +0000, Tian, Kevin wrote: > > > > I'm curious to learn the real reason of that design. Is it because you > > > > want to do certain load-balance between viommu's or due to other > > > > reasons in the kernel smmuv3 driver which e.g. cannot support a > > > > viommu spanning multiple pSMMU? > > > > > > Yeah, there is no concept of support for a SMMUv3 instance where it's > > > command Q's can only work on a subset of devices. > > > > > > My expectation was that VIOMMU would be 1:1 with physical iommu > > > instances, I think AMD needs this too?? > > > > > > > Yes this part is clear now regarding to VCMDQ. > > > > But Nicoline said: > > > > " > > One step back, even without VCMDQ feature, a multi-pSMMU setup > > will have multiple viommus (with our latest design) being added > > to a viommu list of a single vSMMU's. Yet, vSMMU in this case > > always traps regular SMMU CMDQ, so it can do viommu selection > > or even broadcast (if it has to). > > " > > > > I don't think there is an arch limitation mandating that? > > What I mean is for regular vSMMU. Without VCMDQ, a regular vSMMU > on a multi-pSMMU setup will look like (e.g. three devices behind > different SMMUs): > |<------ VMM ------->|<------ kernel ------>| > |-- viommu0 --|-- pSMMU0 --| > vSMMU--|-- viommu1 --|-- pSMMU1 --|--s2_hwpt > |-- viommu2 --|-- pSMMU2 --| > > And device would attach to: > |<---- guest ---->|<--- VMM --->|<- kernel ->| > |-- dev0 --|-- viommu0 --|-- pSMMU0 --| > vSMMU--|-- dev1 --|-- viommu1 --|-- pSMMU1 --| > |-- dev2 --|-- viommu2 --|-- pSMMU2 --| > > When trapping a device cache invalidation: it is straightforward > by deciphering the virtual device ID to pick the viommu that the > device is attached to. I understand how above works. My question is why that option is chosen instead of going with 1:1 mapping between vSMMU and viommu i.e. letting the kernel to figure out which pSMMU should be sent an invalidation cmd to, as how VT-d is virtualized. I want to know whether doing so is simply to be compatible with what VCMDQ requires, or due to another untold reason. > > When doing iotlb invalidation, a command may or may not contain > an ASID (a domain ID, and nested domain in this case): > a) if a command doesn't have an ASID, VMM needs to broadcast the > command to all viommus (i.e. pSMMUs) > b) if a command has an ASID, VMM needs to initially maintain an > S1 HWPT list by linking an ASID when adding an HWPT entry to > the list, by deciphering vSTE and its linked CD. Then it needs > to go through the S1 list with the ASID in the command, and to > find all corresponding HWPTs to issue/broadcast the command. > > Thanks > Nicolin
On Wed, May 29, 2024 at 02:58:11AM +0000, Tian, Kevin wrote: > > From: Nicolin Chen <nicolinc@nvidia.com> > > Sent: Wednesday, May 29, 2024 4:23 AM > > What I mean is for regular vSMMU. Without VCMDQ, a regular vSMMU > > on a multi-pSMMU setup will look like (e.g. three devices behind > > different SMMUs): > > |<------ VMM ------->|<------ kernel ------>| > > |-- viommu0 --|-- pSMMU0 --| > > vSMMU--|-- viommu1 --|-- pSMMU1 --|--s2_hwpt > > |-- viommu2 --|-- pSMMU2 --| > > > > And device would attach to: > > |<---- guest ---->|<--- VMM --->|<- kernel ->| > > |-- dev0 --|-- viommu0 --|-- pSMMU0 --| > > vSMMU--|-- dev1 --|-- viommu1 --|-- pSMMU1 --| > > |-- dev2 --|-- viommu2 --|-- pSMMU2 --| > > > > When trapping a device cache invalidation: it is straightforward > > by deciphering the virtual device ID to pick the viommu that the > > device is attached to. > > I understand how above works. > > My question is why that option is chosen instead of going with 1:1 > mapping between vSMMU and viommu i.e. letting the kernel to > figure out which pSMMU should be sent an invalidation cmd to, as > how VT-d is virtualized. > > I want to know whether doing so is simply to be compatible with > what VCMDQ requires, or due to another untold reason. Because we use viommu as a VMID holder for SMMU. So a pSMMU must have its own viommu to store its VMID for a shared s2_hwpt: |-- viommu0 (VMIDx) --|-- pSMMU0 --| vSMMU--|-- viommu1 (VMIDy) --|-- pSMMU1 --|--s2_hwpt |-- viommu2 (VMIDz) --|-- pSMMU2 --| Thanks Nicolin
> From: Nicolin Chen <nicolinc@nvidia.com> > Sent: Wednesday, May 29, 2024 11:21 AM > > On Wed, May 29, 2024 at 02:58:11AM +0000, Tian, Kevin wrote: > > > From: Nicolin Chen <nicolinc@nvidia.com> > > > Sent: Wednesday, May 29, 2024 4:23 AM > > > > What I mean is for regular vSMMU. Without VCMDQ, a regular vSMMU > > > on a multi-pSMMU setup will look like (e.g. three devices behind > > > different SMMUs): > > > |<------ VMM ------->|<------ kernel ------>| > > > |-- viommu0 --|-- pSMMU0 --| > > > vSMMU--|-- viommu1 --|-- pSMMU1 --|--s2_hwpt > > > |-- viommu2 --|-- pSMMU2 --| > > > > > > And device would attach to: > > > |<---- guest ---->|<--- VMM --->|<- kernel ->| > > > |-- dev0 --|-- viommu0 --|-- pSMMU0 --| > > > vSMMU--|-- dev1 --|-- viommu1 --|-- pSMMU1 --| > > > |-- dev2 --|-- viommu2 --|-- pSMMU2 --| > > > > > > When trapping a device cache invalidation: it is straightforward > > > by deciphering the virtual device ID to pick the viommu that the > > > device is attached to. > > > > I understand how above works. > > > > My question is why that option is chosen instead of going with 1:1 > > mapping between vSMMU and viommu i.e. letting the kernel to > > figure out which pSMMU should be sent an invalidation cmd to, as > > how VT-d is virtualized. > > > > I want to know whether doing so is simply to be compatible with > > what VCMDQ requires, or due to another untold reason. > > Because we use viommu as a VMID holder for SMMU. So a pSMMU must > have its own viommu to store its VMID for a shared s2_hwpt: > |-- viommu0 (VMIDx) --|-- pSMMU0 --| > vSMMU--|-- viommu1 (VMIDy) --|-- pSMMU1 --|--s2_hwpt > |-- viommu2 (VMIDz) --|-- pSMMU2 --| > there are other options, e.g. you can have one viommu holding multiple VMIDs each associating to a pSMMU. so it's really an implementation choice that you want a same viommu topology scheme w/ or w/o VCMDQ. we all know there are some constraints of copying the physical topology, e.g. hotplugging a device or migration. for VCMDQ it's clearly an acceptable tradeoff for performance. w/o VCMDQ I'm not sure whether you want to keep more flexibility for the user. 😊
On Thu, May 30, 2024 at 12:28:43AM +0000, Tian, Kevin wrote: > > From: Nicolin Chen <nicolinc@nvidia.com> > > Sent: Wednesday, May 29, 2024 11:21 AM > > On Wed, May 29, 2024 at 02:58:11AM +0000, Tian, Kevin wrote: > > > My question is why that option is chosen instead of going with 1:1 > > > mapping between vSMMU and viommu i.e. letting the kernel to > > > figure out which pSMMU should be sent an invalidation cmd to, as > > > how VT-d is virtualized. > > > > > > I want to know whether doing so is simply to be compatible with > > > what VCMDQ requires, or due to another untold reason. > > > > Because we use viommu as a VMID holder for SMMU. So a pSMMU must > > have its own viommu to store its VMID for a shared s2_hwpt: > > |-- viommu0 (VMIDx) --|-- pSMMU0 --| > > vSMMU--|-- viommu1 (VMIDy) --|-- pSMMU1 --|--s2_hwpt > > |-- viommu2 (VMIDz) --|-- pSMMU2 --| > > > > there are other options, e.g. you can have one viommu holding multiple > VMIDs each associating to a pSMMU. Well, possibly. But everything previously in a viommu would have to be a list (for number of VMIDs) in the kernel level: not only a VMID list, but also a 2D virtual ID lists, something like: struct xarray vdev_ids[num_of_vmid]; // per-IOMMU vID to pID lookup And a driver in this case would be difficult to get a complete concept of a viommu object since it's core object and shared by all kernel-level IOMMU instances. If a driver wants to extend a viommu object for some additional feature, e.g. VINTF in this series, it would likely have to create another per-driver object and again another list of this kind of objects in struct viommu. At the end of day, we have to duplicate it one way or another for the amount of physical IOMMUs. And it seems to cleaner by doing it with multiple viommu objects. > so it's really an implementation choice that you want a same viommu > topology scheme w/ or w/o VCMDQ. > > we all know there are some constraints of copying the physical topology, > e.g. hotplugging a device or migration. for VCMDQ it's clearly an > acceptable tradeoff for performance. w/o VCMDQ I'm not sure whether > you want to keep more flexibility for the user. 😊 Oh. With regular nested SMMU, there is only one virtual SMMU in the guest VM. No need of copying physical topology. Just the VMM needs to allocate three viommus to add them to a list of its own. Thanks Nicolin
> From: Nicolin Chen <nicolinc@nvidia.com> > Sent: Thursday, May 30, 2024 8:59 AM > > On Thu, May 30, 2024 at 12:28:43AM +0000, Tian, Kevin wrote: > > > From: Nicolin Chen <nicolinc@nvidia.com> > > > Sent: Wednesday, May 29, 2024 11:21 AM > > > On Wed, May 29, 2024 at 02:58:11AM +0000, Tian, Kevin wrote: > > > > My question is why that option is chosen instead of going with 1:1 > > > > mapping between vSMMU and viommu i.e. letting the kernel to > > > > figure out which pSMMU should be sent an invalidation cmd to, as > > > > how VT-d is virtualized. > > > > > > > > I want to know whether doing so is simply to be compatible with > > > > what VCMDQ requires, or due to another untold reason. > > > > > > Because we use viommu as a VMID holder for SMMU. So a pSMMU must > > > have its own viommu to store its VMID for a shared s2_hwpt: > > > |-- viommu0 (VMIDx) --|-- pSMMU0 --| > > > vSMMU--|-- viommu1 (VMIDy) --|-- pSMMU1 --|--s2_hwpt > > > |-- viommu2 (VMIDz) --|-- pSMMU2 --| > > > > > > > there are other options, e.g. you can have one viommu holding multiple > > VMIDs each associating to a pSMMU. > > Well, possibly. But everything previously in a viommu would have > to be a list (for number of VMIDs) in the kernel level: not only > a VMID list, but also a 2D virtual ID lists, something like: > > struct xarray vdev_ids[num_of_vmid]; // per-IOMMU vID to pID lookup ah does it mean that dev ID space on ARM is local to SMMU? It's not the case on x86 platforms where the RID is platform-wise. > > And a driver in this case would be difficult to get a complete > concept of a viommu object since it's core object and shared by > all kernel-level IOMMU instances. If a driver wants to extend a > viommu object for some additional feature, e.g. VINTF in this > series, it would likely have to create another per-driver object > and again another list of this kind of objects in struct viommu. > > At the end of day, we have to duplicate it one way or another for > the amount of physical IOMMUs. And it seems to cleaner by doing > it with multiple viommu objects. > > > so it's really an implementation choice that you want a same viommu > > topology scheme w/ or w/o VCMDQ. > > > > we all know there are some constraints of copying the physical topology, > > e.g. hotplugging a device or migration. for VCMDQ it's clearly an > > acceptable tradeoff for performance. w/o VCMDQ I'm not sure whether > > you want to keep more flexibility for the user. 😊 > > Oh. With regular nested SMMU, there is only one virtual SMMU in > the guest VM. No need of copying physical topology. Just the VMM > needs to allocate three viommus to add them to a list of its own. > Okay I missed this point. Then the number of viommus is really about the contract between the vmm and the kernel, which is invisible to the guest or the admin who launches the Qemu. but wait, isn't there another problem - does the VMM have the permission to enumerate the topology of physical SMMUs? Or probably the VMM only needs to know the number of relevant SMMUs for assigned devices and such info can be indirectly composed by extending GET_HW_INFO...
On Thu, May 30, 2024 at 03:05:05AM +0000, Tian, Kevin wrote: > > From: Nicolin Chen <nicolinc@nvidia.com> > > Sent: Thursday, May 30, 2024 8:59 AM > > On Thu, May 30, 2024 at 12:28:43AM +0000, Tian, Kevin wrote: > > > > From: Nicolin Chen <nicolinc@nvidia.com> > > > > Sent: Wednesday, May 29, 2024 11:21 AM > > > > On Wed, May 29, 2024 at 02:58:11AM +0000, Tian, Kevin wrote: > > > > > My question is why that option is chosen instead of going with 1:1 > > > > > mapping between vSMMU and viommu i.e. letting the kernel to > > > > > figure out which pSMMU should be sent an invalidation cmd to, as > > > > > how VT-d is virtualized. > > > > > > > > > > I want to know whether doing so is simply to be compatible with > > > > > what VCMDQ requires, or due to another untold reason. > > > > > > > > Because we use viommu as a VMID holder for SMMU. So a pSMMU must > > > > have its own viommu to store its VMID for a shared s2_hwpt: > > > > |-- viommu0 (VMIDx) --|-- pSMMU0 --| > > > > vSMMU--|-- viommu1 (VMIDy) --|-- pSMMU1 --|--s2_hwpt > > > > |-- viommu2 (VMIDz) --|-- pSMMU2 --| > > > > > > > > > > there are other options, e.g. you can have one viommu holding multiple > > > VMIDs each associating to a pSMMU. > > > > Well, possibly. But everything previously in a viommu would have > > to be a list (for number of VMIDs) in the kernel level: not only > > a VMID list, but also a 2D virtual ID lists, something like: > > > > struct xarray vdev_ids[num_of_vmid]; // per-IOMMU vID to pID lookup > > ah does it mean that dev ID space on ARM is local to SMMU? > > It's not the case on x86 platforms where the RID is platform-wise. Actually I had a second thought after I replied. Yea, we only support PCI devices at this moment, so their RIDs come from BDF numbers, and then should be platform-wise as you pointed out: |<-------VMM-------->|<--------------- kernel ------------>| vRID_a --| |-- VMIDx / SMMU0 -- pRID_A vRID_b --| => vSMMU => viommu => |-- VMIDy / SMMU1 -- pRID_B vRID_c --| |-- VMIDz / SMMU2 -- pRID_C # x/y/z can be identical, while a/b/c and A/B/C must be unique. So likely a single lookup list can be enough. That still can't avoid the list of per-pIOMMU objects for some driver feature though. So, I think having a per-pIOMMU viommu also adds a bit of flexibility for the kernel. Overall, indeed an implementation choice, as you mentioned :) > > And a driver in this case would be difficult to get a complete > > concept of a viommu object since it's core object and shared by > > all kernel-level IOMMU instances. If a driver wants to extend a > > viommu object for some additional feature, e.g. VINTF in this > > series, it would likely have to create another per-driver object > > and again another list of this kind of objects in struct viommu. > > > > At the end of day, we have to duplicate it one way or another for > > the amount of physical IOMMUs. And it seems to cleaner by doing > > it with multiple viommu objects. > > > > > so it's really an implementation choice that you want a same viommu > > > topology scheme w/ or w/o VCMDQ. > > > > > > we all know there are some constraints of copying the physical topology, > > > e.g. hotplugging a device or migration. for VCMDQ it's clearly an > > > acceptable tradeoff for performance. w/o VCMDQ I'm not sure whether > > > you want to keep more flexibility for the user. 😊 > > > > Oh. With regular nested SMMU, there is only one virtual SMMU in > > the guest VM. No need of copying physical topology. Just the VMM > > needs to allocate three viommus to add them to a list of its own. > > > > Okay I missed this point. Then the number of viommus is really about > the contract between the vmm and the kernel, which is invisible to > the guest or the admin who launches the Qemu. Yes. Everything should be behind the scene, since VMM can trap and select, unlike VCMDQ doing direct MMIO to the HW without a chance of VM Exits. > but wait, isn't there another problem - does the VMM have the > permission to enumerate the topology of physical SMMUs? Or probably > the VMM only needs to know the number of relevant SMMUs for > assigned devices and such info can be indirectly composed by extending GET_HW_INFO... I think VMM already has some kinda permission reading the number of IOMMUs from "/sys/class/iommu"? That being said, in a regular nesting case w/o VCMDQ, there is no need to instantiate all vSMMUs to the number of pSMMUs unless there is at least one device behind each pSMMU attaches. So, the current implementation allocates an s2_hwpt/viommu only on demand, when the first device from a physical SMMU attaches, meaning there'll be only one viommu in the list until the other first device from another pSMMU attaches. And the actual attach logical always tries attaching a device to the existing viommus from the list, and only allocates a new viommu (or s2_hwpt) when all of them failed. FWIW, host-kernel/driver has a full control against these attaches. Thanks Nicolin
On Wed, May 29, 2024 at 05:58:39PM -0700, Nicolin Chen wrote: > On Thu, May 30, 2024 at 12:28:43AM +0000, Tian, Kevin wrote: > > > From: Nicolin Chen <nicolinc@nvidia.com> > > > Sent: Wednesday, May 29, 2024 11:21 AM > > > On Wed, May 29, 2024 at 02:58:11AM +0000, Tian, Kevin wrote: > > > > My question is why that option is chosen instead of going with 1:1 > > > > mapping between vSMMU and viommu i.e. letting the kernel to > > > > figure out which pSMMU should be sent an invalidation cmd to, as > > > > how VT-d is virtualized. > > > > > > > > I want to know whether doing so is simply to be compatible with > > > > what VCMDQ requires, or due to another untold reason. > > > > > > Because we use viommu as a VMID holder for SMMU. So a pSMMU must > > > have its own viommu to store its VMID for a shared s2_hwpt: > > > |-- viommu0 (VMIDx) --|-- pSMMU0 --| > > > vSMMU--|-- viommu1 (VMIDy) --|-- pSMMU1 --|--s2_hwpt > > > |-- viommu2 (VMIDz) --|-- pSMMU2 --| > > > > > > > there are other options, e.g. you can have one viommu holding multiple > > VMIDs each associating to a pSMMU. > > Well, possibly. But everything previously in a viommu would have > to be a list (for number of VMIDs) in the kernel level: not only > a VMID list, but also a 2D virtual ID lists, something like: > > struct xarray vdev_ids[num_of_vmid]; // per-IOMMU vID to pID lookup I feel it makes most sense that ARM (and maybe everyone) just have a viommu per piommu. The main argument against is we haven't made it efficient for the VMM to support multiple piommus. It has to do a system call per piommu each time it processes the cmdq. But, on the other hand, if you care about invalidation efficiency it is kind of silly not to expose the piommus to the guest so that the invalidation scope can be appropriately narrowed. Replicating all ASID invalidations to all piommus doesn't make alot of sense if the guest can know that only one piommu actually needs invalidation. > And a driver in this case would be difficult to get a complete > concept of a viommu object since it's core object and shared by > all kernel-level IOMMU instances. If a driver wants to extend a > viommu object for some additional feature, e.g. VINTF in this > series, it would likely have to create another per-driver object > and again another list of this kind of objects in struct viommu. Right, we need some kind of per-piommu object because we have per-piommu data. > Oh. With regular nested SMMU, there is only one virtual SMMU in > the guest VM. No need of copying physical topology. Just the VMM > needs to allocate three viommus to add them to a list of its own. I understand the appeal of doing this has been to minimize qemu changes in its ACPI parts, if we tackle that instead maybe we should just not implement viommu to multiple piommu. It is somewhat complicated. Jason
On Sat, Jun 01, 2024 at 06:45:01PM -0300, Jason Gunthorpe wrote: > On Wed, May 29, 2024 at 05:58:39PM -0700, Nicolin Chen wrote: > > On Thu, May 30, 2024 at 12:28:43AM +0000, Tian, Kevin wrote: > > > > From: Nicolin Chen <nicolinc@nvidia.com> > > > > Sent: Wednesday, May 29, 2024 11:21 AM > > > > On Wed, May 29, 2024 at 02:58:11AM +0000, Tian, Kevin wrote: > > > > > My question is why that option is chosen instead of going with 1:1 > > > > > mapping between vSMMU and viommu i.e. letting the kernel to > > > > > figure out which pSMMU should be sent an invalidation cmd to, as > > > > > how VT-d is virtualized. > > > > > > > > > > I want to know whether doing so is simply to be compatible with > > > > > what VCMDQ requires, or due to another untold reason. > > > > > > > > Because we use viommu as a VMID holder for SMMU. So a pSMMU must > > > > have its own viommu to store its VMID for a shared s2_hwpt: > > > > |-- viommu0 (VMIDx) --|-- pSMMU0 --| > > > > vSMMU--|-- viommu1 (VMIDy) --|-- pSMMU1 --|--s2_hwpt > > > > |-- viommu2 (VMIDz) --|-- pSMMU2 --| > > > > > > > > > > there are other options, e.g. you can have one viommu holding multiple > > > VMIDs each associating to a pSMMU. > > > > Well, possibly. But everything previously in a viommu would have > > to be a list (for number of VMIDs) in the kernel level: not only > > a VMID list, but also a 2D virtual ID lists, something like: > > > > struct xarray vdev_ids[num_of_vmid]; // per-IOMMU vID to pID lookup > > I feel it makes most sense that ARM (and maybe everyone) just have a > viommu per piommu. > > The main argument against is we haven't made it efficient for the VMM > to support multiple piommus. It has to do a system call per piommu > each time it processes the cmdq. > > But, on the other hand, if you care about invalidation efficiency it > is kind of silly not to expose the piommus to the guest so that the > invalidation scope can be appropriately narrowed. Replicating all ASID > invalidations to all piommus doesn't make alot of sense if the guest > can know that only one piommu actually needs invalidation. Yea, that'd be pretty slow, though a broadcast would be still inevitable when an invalidation only has an address range w/o an ASID, CMD_TLBI_NH_VAA for example. In fact, there should always be a dispatcher (v.s. broadcast): - in the one-viommu-per-pIOMMU case (#1), it's in the VMM - in the one-viommu-per-vIOMMU case (#2), it's in the kernel One of them has to take the role to burn some CPUs for-eaching the hwpt list to identify the iommu to forward. The design #1, simply makes the kernel easier. The design #2, on the other hand, would not only require some lists and new objects that we just discussed, yet also pair of VIOMMU_SET/UNSET_HWPT_ID ioctls, though it also makes sense as we choose IOMMU_VIOMMU_INVALIDATE over IOMMU_DEV_INVALIDATE by adding VIOMMU_SET/UNSET_VDEV_ID? > > And a driver in this case would be difficult to get a complete > > concept of a viommu object since it's core object and shared by > > all kernel-level IOMMU instances. If a driver wants to extend a > > viommu object for some additional feature, e.g. VINTF in this > > series, it would likely have to create another per-driver object > > and again another list of this kind of objects in struct viommu. > > Right, we need some kind of per-piommu object because we have > per-piommu data. > > > Oh. With regular nested SMMU, there is only one virtual SMMU in > > the guest VM. No need of copying physical topology. Just the VMM > > needs to allocate three viommus to add them to a list of its own. > > I understand the appeal of doing this has been to minimize qemu > changes in its ACPI parts if we tackle that instead maybe we should > just not implement viommu to multiple piommu. It is somewhat > complicated. Would you please clarify that suggestion "not implement viommu to multiple piommu"? For regular nesting (SMMU), we are still doing one vSMMU in the VMM, though VCMDQ case would be an exception.... Thanks Nicolin
On Sun, Jun 02, 2024 at 08:25:34PM -0700, Nicolin Chen wrote: > > I understand the appeal of doing this has been to minimize qemu > > changes in its ACPI parts if we tackle that instead maybe we should > > just not implement viommu to multiple piommu. It is somewhat > > complicated. > > Would you please clarify that suggestion "not implement viommu > to multiple piommu"? > > For regular nesting (SMMU), we are still doing one vSMMU in the > VMM, though VCMDQ case would be an exception.... This is what I mean, always do multiple vSMMU if there are multiple physical pSMMUs. Don't replicate any virtual commands across pSMMUs. Jason
On Thu, Jun 06, 2024 at 03:24:23PM -0300, Jason Gunthorpe wrote: > On Sun, Jun 02, 2024 at 08:25:34PM -0700, Nicolin Chen wrote: > > > > I understand the appeal of doing this has been to minimize qemu > > > changes in its ACPI parts if we tackle that instead maybe we should > > > just not implement viommu to multiple piommu. It is somewhat > > > complicated. > > > > Would you please clarify that suggestion "not implement viommu > > to multiple piommu"? > > > > For regular nesting (SMMU), we are still doing one vSMMU in the > > VMM, though VCMDQ case would be an exception.... > > This is what I mean, always do multiple vSMMU if there are multiple > physical pSMMUs. Don't replicate any virtual commands across pSMMUs. Thanks for clarifying. That also means you'd prefer putting the command dispatcher in VMM, which is what we have at this moment. Nicolin
On Thu, Jun 06, 2024 at 11:44:58AM -0700, Nicolin Chen wrote: > On Thu, Jun 06, 2024 at 03:24:23PM -0300, Jason Gunthorpe wrote: > > On Sun, Jun 02, 2024 at 08:25:34PM -0700, Nicolin Chen wrote: > > > > > > I understand the appeal of doing this has been to minimize qemu > > > > changes in its ACPI parts if we tackle that instead maybe we should > > > > just not implement viommu to multiple piommu. It is somewhat > > > > complicated. > > > > > > Would you please clarify that suggestion "not implement viommu > > > to multiple piommu"? > > > > > > For regular nesting (SMMU), we are still doing one vSMMU in the > > > VMM, though VCMDQ case would be an exception.... > > > > This is what I mean, always do multiple vSMMU if there are multiple > > physical pSMMUs. Don't replicate any virtual commands across pSMMUs. > > Thanks for clarifying. That also means you'd prefer putting the > command dispatcher in VMM, which is what we have at this moment. Unless someone knows a reason why we should strive hard to have only a single vSMMU and accept some invalidation inefficiency? Jason
> From: Jason Gunthorpe <jgg@nvidia.com> > Sent: Friday, June 7, 2024 8:27 AM > > On Thu, Jun 06, 2024 at 11:44:58AM -0700, Nicolin Chen wrote: > > On Thu, Jun 06, 2024 at 03:24:23PM -0300, Jason Gunthorpe wrote: > > > On Sun, Jun 02, 2024 at 08:25:34PM -0700, Nicolin Chen wrote: > > > > > > > > I understand the appeal of doing this has been to minimize qemu > > > > > changes in its ACPI parts if we tackle that instead maybe we should > > > > > just not implement viommu to multiple piommu. It is somewhat > > > > > complicated. > > > > > > > > Would you please clarify that suggestion "not implement viommu > > > > to multiple piommu"? > > > > > > > > For regular nesting (SMMU), we are still doing one vSMMU in the > > > > VMM, though VCMDQ case would be an exception.... > > > > > > This is what I mean, always do multiple vSMMU if there are multiple > > > physical pSMMUs. Don't replicate any virtual commands across pSMMUs. > > > > Thanks for clarifying. That also means you'd prefer putting the > > command dispatcher in VMM, which is what we have at this moment. > > Unless someone knows a reason why we should strive hard to have only a > single vSMMU and accept some invalidation inefficiency? > migration? a single vSMMU provides better compatibility between src/dest...
On Fri, Jun 07, 2024 at 12:36:46AM +0000, Tian, Kevin wrote: > > From: Jason Gunthorpe <jgg@nvidia.com> > > Sent: Friday, June 7, 2024 8:27 AM > > > > On Thu, Jun 06, 2024 at 11:44:58AM -0700, Nicolin Chen wrote: > > > On Thu, Jun 06, 2024 at 03:24:23PM -0300, Jason Gunthorpe wrote: > > > > On Sun, Jun 02, 2024 at 08:25:34PM -0700, Nicolin Chen wrote: > > > > > > > > > > I understand the appeal of doing this has been to minimize qemu > > > > > > changes in its ACPI parts if we tackle that instead maybe we should > > > > > > just not implement viommu to multiple piommu. It is somewhat > > > > > > complicated. > > > > > > > > > > Would you please clarify that suggestion "not implement viommu > > > > > to multiple piommu"? > > > > > > > > > > For regular nesting (SMMU), we are still doing one vSMMU in the > > > > > VMM, though VCMDQ case would be an exception.... > > > > > > > > This is what I mean, always do multiple vSMMU if there are multiple > > > > physical pSMMUs. Don't replicate any virtual commands across pSMMUs. > > > > > > Thanks for clarifying. That also means you'd prefer putting the > > > command dispatcher in VMM, which is what we have at this moment. > > > > Unless someone knows a reason why we should strive hard to have only a > > single vSMMU and accept some invalidation inefficiency? > > > > migration? a single vSMMU provides better compatibility between > src/dest... Maybe, though I think we can safely split a single pSMMU into multiple vSMMUs using the IOMMUFD vIOMMU interface. So if your machine model has two vSMMUs and your physical HW has less we can still make that work. IOTLB efficiency will suffer though when splitting 1p -> 2v while invalidation performance will suffer when joining 2p -> 1v. Jason
On Fri, Jun 07, 2024 at 11:49:17AM -0300, Jason Gunthorpe wrote: > On Fri, Jun 07, 2024 at 12:36:46AM +0000, Tian, Kevin wrote: > > > From: Jason Gunthorpe <jgg@nvidia.com> > > > Sent: Friday, June 7, 2024 8:27 AM > > > > > > On Thu, Jun 06, 2024 at 11:44:58AM -0700, Nicolin Chen wrote: > > > > On Thu, Jun 06, 2024 at 03:24:23PM -0300, Jason Gunthorpe wrote: > > > > > On Sun, Jun 02, 2024 at 08:25:34PM -0700, Nicolin Chen wrote: > > > > > > > > > > > > I understand the appeal of doing this has been to minimize qemu > > > > > > > changes in its ACPI parts if we tackle that instead maybe we should > > > > > > > just not implement viommu to multiple piommu. It is somewhat > > > > > > > complicated. > > > > > > > > > > > > Would you please clarify that suggestion "not implement viommu > > > > > > to multiple piommu"? > > > > > > > > > > > > For regular nesting (SMMU), we are still doing one vSMMU in the > > > > > > VMM, though VCMDQ case would be an exception.... > > > > > > > > > > This is what I mean, always do multiple vSMMU if there are multiple > > > > > physical pSMMUs. Don't replicate any virtual commands across pSMMUs. > > > > > > > > Thanks for clarifying. That also means you'd prefer putting the > > > > command dispatcher in VMM, which is what we have at this moment. > > > > > > Unless someone knows a reason why we should strive hard to have only a > > > single vSMMU and accept some invalidation inefficiency? > > > > > > > migration? a single vSMMU provides better compatibility between > > src/dest... > > Maybe, though I think we can safely split a single pSMMU into multiple > vSMMUs using the IOMMUFD vIOMMU interface. So if your machine model > has two vSMMUs and your physical HW has less we can still make that > work. > > IOTLB efficiency will suffer though when splitting 1p -> 2v while > invalidation performance will suffer when joining 2p -> 1v. I think the invalidation efficiency is actually solvable. So, basically viommu_invalidate would receive a whole batch of cmds and dispatch them to different pSMMUs (nested_domains/devices). We already have a vdev_id table for devices, yet we just need a new vasid table for nested_domains. Right? The immediate benefit is that VMMs won't need to duplicate each other's dispatcher pieces, and likely helps migrations as Kevin pointed out. With that being said, it would make the kernel design a bit more complicated. And the VMM still has to separate the commands for passthrough devices (HW iotlb) from commands for emulated devices (emulated iotlb), unless we further split the topology at the VM level to have a dedicated vSMMU for all passthrough devices -- then VMM could just forward its entire cmdq to the kernel without deciphering every command (likely?). Thanks Nicolin
On Fri, Jun 07, 2024 at 02:19:21PM -0700, Nicolin Chen wrote: > > IOTLB efficiency will suffer though when splitting 1p -> 2v while > > invalidation performance will suffer when joining 2p -> 1v. > > I think the invalidation efficiency is actually solvable. So, > basically viommu_invalidate would receive a whole batch of cmds > and dispatch them to different pSMMUs (nested_domains/devices). > We already have a vdev_id table for devices, yet we just need a > new vasid table for nested_domains. Right? You can't know the ASID usage of the hypervisor from the VM, unless you also inspect the CD table memory in the guest. That seems like something we should try hard to avoid. > With that being said, it would make the kernel design a bit more > complicated. And the VMM still has to separate the commands for > passthrough devices (HW iotlb) from commands for emulated devices > (emulated iotlb), unless we further split the topology at the VM > level to have a dedicated vSMMU for all passthrough devices -- > then VMM could just forward its entire cmdq to the kernel without > deciphering every command (likely?). I would not include the emulated devices in a shared SMMU.. For the same reason, we should try hard to avoid inspecting the page table memory. If a viommu is needed for emulated then virtio-iommu may be more appropriate.. That said I'm sure someone will want to do this, so as long as it is possible in the VMM, as slow as it may be, then it is fine. Jason
On Mon, Jun 10, 2024 at 09:04:46AM -0300, Jason Gunthorpe wrote: > On Fri, Jun 07, 2024 at 02:19:21PM -0700, Nicolin Chen wrote: > > > > IOTLB efficiency will suffer though when splitting 1p -> 2v while > > > invalidation performance will suffer when joining 2p -> 1v. > > > > I think the invalidation efficiency is actually solvable. So, > > basically viommu_invalidate would receive a whole batch of cmds > > and dispatch them to different pSMMUs (nested_domains/devices). > > We already have a vdev_id table for devices, yet we just need a > > new vasid table for nested_domains. Right? > > You can't know the ASID usage of the hypervisor from the VM, unless > you also inspect the CD table memory in the guest. That seems like > something we should try hard to avoid. Actually, even now as we put a dispatcher in VMM, VMM still does decode the CD table to link ASID to s1_hwpt. Otherwise, it could only broadcast a TLBI cmd to all pSMMUs. Doing in the other way by moving it to the kernel, we'd just need a pair of new ioctls and use them when VMM traps CFGI_CD cmds, so kernel driver instead of VMM user driver manages the links between ASIDs to nested domains. Either a master ASID or SVA ASIDs can be linked to the same nested_domain that's allocated per vSTE. > > With that being said, it would make the kernel design a bit more > > complicated. And the VMM still has to separate the commands for > > passthrough devices (HW iotlb) from commands for emulated devices > > (emulated iotlb), unless we further split the topology at the VM > > level to have a dedicated vSMMU for all passthrough devices -- > > then VMM could just forward its entire cmdq to the kernel without > > deciphering every command (likely?). > > I would not include the emulated devices in a shared SMMU.. For the > same reason, we should try hard to avoid inspecting the page table > memory. I wouldn't like the idea of attaching emulated devices to a shared vSMMU. Yet, mind elaborating why this would inspect the page table memory? Or do you mean we should avoid VMM inspecting user tables? > If a viommu is needed for emulated then virtio-iommu may be more > appropriate.. > > That said I'm sure someone will want to do this, so as long as it is > possible in the VMM, as slow as it may be, then it is fine. Eric hasn't replied my previous query regarding how to design this, yet I guess the same. And looks like Intel is doing so for emulated devices, since there is only one intel_iommu instance in a VM. Thanks Nicolin
On Mon, Jun 10, 2024 at 01:01:32PM -0700, Nicolin Chen wrote: > On Mon, Jun 10, 2024 at 09:04:46AM -0300, Jason Gunthorpe wrote: > > On Fri, Jun 07, 2024 at 02:19:21PM -0700, Nicolin Chen wrote: > > > > > > IOTLB efficiency will suffer though when splitting 1p -> 2v while > > > > invalidation performance will suffer when joining 2p -> 1v. > > > > > > I think the invalidation efficiency is actually solvable. So, > > > basically viommu_invalidate would receive a whole batch of cmds > > > and dispatch them to different pSMMUs (nested_domains/devices). > > > We already have a vdev_id table for devices, yet we just need a > > > new vasid table for nested_domains. Right? > > > > You can't know the ASID usage of the hypervisor from the VM, unless > > you also inspect the CD table memory in the guest. That seems like > > something we should try hard to avoid. > > Actually, even now as we put a dispatcher in VMM, VMM still does > decode the CD table to link ASID to s1_hwpt. Otherwise, it could > only broadcast a TLBI cmd to all pSMMUs. No, there should be no CD table decoding and no linking ASID to anything by the VMM. The ARM architecture is clean, the ASID can remain private to the VM, there is no reason for the VMM to understand it. The s1_hwpt is derived only from the vSTE and nothing more. It would be fine for all the devices to have their own s1_hwpts with their own vSTE's inside it. > > > With that being said, it would make the kernel design a bit more > > > complicated. And the VMM still has to separate the commands for > > > passthrough devices (HW iotlb) from commands for emulated devices > > > (emulated iotlb), unless we further split the topology at the VM > > > level to have a dedicated vSMMU for all passthrough devices -- > > > then VMM could just forward its entire cmdq to the kernel without > > > deciphering every command (likely?). > > > > I would not include the emulated devices in a shared SMMU.. For the > > same reason, we should try hard to avoid inspecting the page table > > memory. > > I wouldn't like the idea of attaching emulated devices to a shared > vSMMU. Yet, mind elaborating why this would inspect the page table > memory? Or do you mean we should avoid VMM inspecting user tables? Emulated devices can't use the HW page table walker in the SMMU since they won't get a clean CD linkage they can use. They have to manually walk the page tables and convert them into an IOAS. It is a big PITA, best to be avoided. > > If a viommu is needed for emulated then virtio-iommu may be more > > appropriate.. > > > > That said I'm sure someone will want to do this, so as long as it is > > possible in the VMM, as slow as it may be, then it is fine. > > Eric hasn't replied my previous query regarding how to design this, > yet I guess the same. And looks like Intel is doing so for emulated > devices, since there is only one intel_iommu instance in a VM. Yes, Intel has long has this code to build VFIO containers from page table walks. Jason
On Mon, Jun 10, 2024 at 07:01:10PM -0300, Jason Gunthorpe wrote: > On Mon, Jun 10, 2024 at 01:01:32PM -0700, Nicolin Chen wrote: > > On Mon, Jun 10, 2024 at 09:04:46AM -0300, Jason Gunthorpe wrote: > > > On Fri, Jun 07, 2024 at 02:19:21PM -0700, Nicolin Chen wrote: > > > > > > > > IOTLB efficiency will suffer though when splitting 1p -> 2v while > > > > > invalidation performance will suffer when joining 2p -> 1v. > > > > > > > > I think the invalidation efficiency is actually solvable. So, > > > > basically viommu_invalidate would receive a whole batch of cmds > > > > and dispatch them to different pSMMUs (nested_domains/devices). > > > > We already have a vdev_id table for devices, yet we just need a > > > > new vasid table for nested_domains. Right? > > > > > > You can't know the ASID usage of the hypervisor from the VM, unless > > > you also inspect the CD table memory in the guest. That seems like > > > something we should try hard to avoid. > > > > Actually, even now as we put a dispatcher in VMM, VMM still does > > decode the CD table to link ASID to s1_hwpt. Otherwise, it could > > only broadcast a TLBI cmd to all pSMMUs. > > No, there should be no CD table decoding and no linking ASID to > anything by the VMM. > > The ARM architecture is clean, the ASID can remain private to the VM, > there is no reason for the VMM to understand it. But a guest-level TLBI command usually has only ASID available to know which pSMMU to dispatch the command. Without an ASID lookup table, how could VMM then dispatch a command to the corresponding pSMMU? > The s1_hwpt is derived only from the vSTE and nothing more. It would > be fine for all the devices to have their own s1_hwpts with their own > vSTE's inside it. > > > > > With that being said, it would make the kernel design a bit more > > > > complicated. And the VMM still has to separate the commands for > > > > passthrough devices (HW iotlb) from commands for emulated devices > > > > (emulated iotlb), unless we further split the topology at the VM > > > > level to have a dedicated vSMMU for all passthrough devices -- > > > > then VMM could just forward its entire cmdq to the kernel without > > > > deciphering every command (likely?). > > > > > > I would not include the emulated devices in a shared SMMU.. For the > > > same reason, we should try hard to avoid inspecting the page table > > > memory. > > > > I wouldn't like the idea of attaching emulated devices to a shared > > vSMMU. Yet, mind elaborating why this would inspect the page table > > memory? Or do you mean we should avoid VMM inspecting user tables? > > Emulated devices can't use the HW page table walker in the SMMU since > they won't get a clean CD linkage they can use. They have to manually > walk the page tables and convert them into an IOAS. It is a big PITA, > best to be avoided. Oh..the mainline QEMU smmu code already has that: https://github.com/qemu/qemu/blob/master/hw/arm/smmu-common.c#L522 Surely, it's a pathway that passthrough devices shouldn't run into. Thanks Nicolin
On Mon, Jun 10, 2024 at 04:04:30PM -0700, Nicolin Chen wrote: > > > Actually, even now as we put a dispatcher in VMM, VMM still does > > > decode the CD table to link ASID to s1_hwpt. Otherwise, it could > > > only broadcast a TLBI cmd to all pSMMUs. > > > > No, there should be no CD table decoding and no linking ASID to > > anything by the VMM. > > > > The ARM architecture is clean, the ASID can remain private to the VM, > > there is no reason for the VMM to understand it. > > But a guest-level TLBI command usually has only ASID available to > know which pSMMU to dispatch the command. Without an ASID lookup > table, how could VMM then dispatch a command to the corresponding > pSMMU? It can broadcast. The ARM architecture does not expect a N:1 mapping of SMMUs. This is why I think it is not such a good idea.. Yes the VMM could walk the CD tables too and build up a bitmap of what ASIDs are being used by what pSMMUs, and that would be fine for the VMM to do, but I wouldn't necessarily recommend it :) Jason
On Mon, Jun 10, 2024 at 09:28:39PM -0300, Jason Gunthorpe wrote: > On Mon, Jun 10, 2024 at 04:04:30PM -0700, Nicolin Chen wrote: > > > > > Actually, even now as we put a dispatcher in VMM, VMM still does > > > > decode the CD table to link ASID to s1_hwpt. Otherwise, it could > > > > only broadcast a TLBI cmd to all pSMMUs. > > > > > > No, there should be no CD table decoding and no linking ASID to > > > anything by the VMM. > > > > > > The ARM architecture is clean, the ASID can remain private to the VM, > > > there is no reason for the VMM to understand it. > > > > But a guest-level TLBI command usually has only ASID available to > > know which pSMMU to dispatch the command. Without an ASID lookup > > table, how could VMM then dispatch a command to the corresponding > > pSMMU? > > It can broadcast. The ARM architecture does not expect a N:1 mapping > of SMMUs. This is why I think it is not such a good idea.. Hmm, I thought we had an agreed idea that we shouldn't broadcast a TLBI (except global NH_ALL/VAA) for invalidation performance? > Yes the VMM could walk the CD tables too and build up a bitmap of what > ASIDs are being used by what pSMMUs, and that would be fine for the > VMM to do, but I wouldn't necessarily recommend it :) CD table walkthrough would be always done only by VMM, while the lookup table could be created/maintained by the kernel. I feel a vasid table could make sense since we maintain the vdev_id table in the kernel space too. Anyway, it is still an implementation choice, as Kevin remarked. Thanks Nicolin
On Mon, Jun 10, 2024 at 05:44:16PM -0700, Nicolin Chen wrote: > On Mon, Jun 10, 2024 at 09:28:39PM -0300, Jason Gunthorpe wrote: > > On Mon, Jun 10, 2024 at 04:04:30PM -0700, Nicolin Chen wrote: > > > > > > > Actually, even now as we put a dispatcher in VMM, VMM still does > > > > > decode the CD table to link ASID to s1_hwpt. Otherwise, it could > > > > > only broadcast a TLBI cmd to all pSMMUs. > > > > > > > > No, there should be no CD table decoding and no linking ASID to > > > > anything by the VMM. > > > > > > > > The ARM architecture is clean, the ASID can remain private to the VM, > > > > there is no reason for the VMM to understand it. > > > > > > But a guest-level TLBI command usually has only ASID available to > > > know which pSMMU to dispatch the command. Without an ASID lookup > > > table, how could VMM then dispatch a command to the corresponding > > > pSMMU? > > > > It can broadcast. The ARM architecture does not expect a N:1 mapping > > of SMMUs. This is why I think it is not such a good idea.. > > Hmm, I thought we had an agreed idea that we shouldn't broadcast > a TLBI (except global NH_ALL/VAA) for invalidation performance? I wouldn't say agree, there are just lots of different trade offs to be made here. You can reduce broadcast by parsing the CD table from the VMM. You can reduce broadcast with multiple vSMMUs. VMM needs to pick a design. I favour multiple vSMMUs. > CD table walkthrough would be always done only by VMM, while the > lookup table could be created/maintained by the kernel. I feel a > vasid table could make sense since we maintain the vdev_id table > in the kernel space too. I'm not convinced we should put such a micro optimization in the kernel. If the VMM cares about performance then split the vSMMU, otherwise lets just put all the mess in the VMM and give it the tools to manage the invalidation distribution. In the long run things like vCMDQ are going to force the choice to multi-smmu, which is why I don't see too much value with investing in optimizing the single vSMMU case. The optimization can be done later if someone does have a use case. Jason
On Tue, Jun 11, 2024 at 09:17:56AM -0300, Jason Gunthorpe wrote: > On Mon, Jun 10, 2024 at 05:44:16PM -0700, Nicolin Chen wrote: > > On Mon, Jun 10, 2024 at 09:28:39PM -0300, Jason Gunthorpe wrote: > > > On Mon, Jun 10, 2024 at 04:04:30PM -0700, Nicolin Chen wrote: > > > > > > > > > Actually, even now as we put a dispatcher in VMM, VMM still does > > > > > > decode the CD table to link ASID to s1_hwpt. Otherwise, it could > > > > > > only broadcast a TLBI cmd to all pSMMUs. > > > > > > > > > > No, there should be no CD table decoding and no linking ASID to > > > > > anything by the VMM. > > > > > > > > > > The ARM architecture is clean, the ASID can remain private to the VM, > > > > > there is no reason for the VMM to understand it. > > > > > > > > But a guest-level TLBI command usually has only ASID available to > > > > know which pSMMU to dispatch the command. Without an ASID lookup > > > > table, how could VMM then dispatch a command to the corresponding > > > > pSMMU? > > > > > > It can broadcast. The ARM architecture does not expect a N:1 mapping > > > of SMMUs. This is why I think it is not such a good idea.. > > > > Hmm, I thought we had an agreed idea that we shouldn't broadcast > > a TLBI (except global NH_ALL/VAA) for invalidation performance? > > I wouldn't say agree, there are just lots of different trade offs to > be made here. You can reduce broadcast by parsing the CD table from > the VMM. You can reduce broadcast with multiple vSMMUs. > > VMM needs to pick a design. I favour multiple vSMMUs. Yea, having multiple vSMMUs for nesting too seems to be a cleaner design. The thing is that we have to put a certain complexity in the VMM, and it should be more efficient by having it at the boot stage (creating multi-vSMMUs/PCIs and IORT nodes) v.s. runtime (trappings and distributing at every command). > > CD table walkthrough would be always done only by VMM, while the > > lookup table could be created/maintained by the kernel. I feel a > > vasid table could make sense since we maintain the vdev_id table > > in the kernel space too. > > I'm not convinced we should put such a micro optimization in the > kernel. If the VMM cares about performance then split the vSMMU, > otherwise lets just put all the mess in the VMM and give it the tools > to manage the invalidation distribution. > > In the long run things like vCMDQ are going to force the choice to > multi-smmu, which is why I don't see too much value with investing in > optimizing the single vSMMU case. The optimization can be done later > if someone does have a use case. I see. That makes sense to me. Nicolin
diff --git a/drivers/iommu/iommufd/device.c b/drivers/iommu/iommufd/device.c index 873630c111c1..68086f3074b6 100644 --- a/drivers/iommu/iommufd/device.c +++ b/drivers/iommu/iommufd/device.c @@ -135,7 +135,16 @@ void iommufd_device_destroy(struct iommufd_object *obj) { struct iommufd_device *idev = container_of(obj, struct iommufd_device, obj); + struct iommufd_viommu *viommu; + unsigned long index; + xa_for_each(&idev->viommus, index, viommu) { + if (viommu->ops->unset_dev_id) + viommu->ops->unset_dev_id(viommu, idev->dev); + xa_erase(&viommu->idevs, idev->obj.id); + xa_erase(&idev->viommus, index); + } + xa_destroy(&idev->viommus); iommu_device_release_dma_owner(idev->dev); iommufd_put_group(idev->igroup); if (!iommufd_selftest_is_mock_dev(idev->dev)) @@ -216,6 +225,8 @@ struct iommufd_device *iommufd_device_bind(struct iommufd_ctx *ictx, /* igroup refcount moves into iommufd_device */ idev->igroup = igroup; + xa_init_flags(&idev->viommus, XA_FLAGS_ALLOC1); + /* * If the caller fails after this success it must call * iommufd_unbind_device() which is safe since we hold this refcount. diff --git a/drivers/iommu/iommufd/iommufd_private.h b/drivers/iommu/iommufd/iommufd_private.h index ae90b4493109..9ba8f4ecc221 100644 --- a/drivers/iommu/iommufd/iommufd_private.h +++ b/drivers/iommu/iommufd/iommufd_private.h @@ -392,6 +392,7 @@ struct iommufd_device { struct list_head group_item; /* always the physical device */ struct device *dev; + struct xarray viommus; bool enforce_cache_coherency; }; @@ -424,8 +425,17 @@ void iopt_remove_access(struct io_pagetable *iopt, u32 iopt_access_list_id); void iommufd_access_destroy_object(struct iommufd_object *obj); +static inline struct iommufd_viommu * +iommufd_get_viommu(struct iommufd_ucmd *ucmd, u32 id) +{ + return container_of(iommufd_get_object(ucmd->ictx, id, + IOMMUFD_OBJ_VIOMMU), + struct iommufd_viommu, obj); +} + int iommufd_viommu_alloc_ioctl(struct iommufd_ucmd *ucmd); void iommufd_viommu_destroy(struct iommufd_object *obj); +int iommufd_viommu_set_device_id(struct iommufd_ucmd *ucmd); #ifdef CONFIG_IOMMUFD_TEST int iommufd_test(struct iommufd_ucmd *ucmd); diff --git a/drivers/iommu/iommufd/main.c b/drivers/iommu/iommufd/main.c index 9de7e3e63ce4..16efc3346a2a 100644 --- a/drivers/iommu/iommufd/main.c +++ b/drivers/iommu/iommufd/main.c @@ -381,6 +381,8 @@ static const struct iommufd_ioctl_op iommufd_ioctl_ops[] = { __reserved), IOCTL_OP(IOMMU_VIOMMU_ALLOC, iommufd_viommu_alloc_ioctl, struct iommu_viommu_alloc, out_viommu_id), + IOCTL_OP(IOMMU_VIOMMU_SET_DEV_ID, iommufd_viommu_set_device_id, + struct iommu_viommu_set_dev_id, id), #ifdef CONFIG_IOMMUFD_TEST IOCTL_OP(IOMMU_TEST_CMD, iommufd_test, struct iommu_test_cmd, last), #endif diff --git a/drivers/iommu/iommufd/viommu.c b/drivers/iommu/iommufd/viommu.c index 079e0ff79942..71baca0c75de 100644 --- a/drivers/iommu/iommufd/viommu.c +++ b/drivers/iommu/iommufd/viommu.c @@ -69,6 +69,7 @@ int iommufd_viommu_alloc_ioctl(struct iommufd_ucmd *ucmd) rc = PTR_ERR(viommu); goto out_put_hwpt; } + xa_init_flags(&viommu->idevs, XA_FLAGS_ALLOC1); /* iommufd_object_finalize will store the viommu->obj.id */ rc = xa_alloc(&ucmd->ictx->objects, &viommu->obj.id, XA_ZERO_ENTRY, @@ -102,3 +103,60 @@ int iommufd_viommu_alloc_ioctl(struct iommufd_ucmd *ucmd) iommufd_put_object(ucmd->ictx, &idev->obj); return rc; } + +int iommufd_viommu_set_device_id(struct iommufd_ucmd *ucmd) +{ + struct iommu_viommu_set_dev_id *cmd = ucmd->cmd; + unsigned int dev_id, viommu_id; + struct iommufd_viommu *viommu; + struct iommufd_device *idev; + int rc; + + idev = iommufd_get_device(ucmd, cmd->dev_id); + if (IS_ERR(idev)) + return PTR_ERR(idev); + dev_id = idev->obj.id; + + viommu = iommufd_get_viommu(ucmd, cmd->viommu_id); + if (IS_ERR(viommu)) { + rc = PTR_ERR(viommu); + goto out_put_idev; + } + + if (viommu->iommu_dev != idev->dev->iommu->iommu_dev) { + rc = -EINVAL; + goto out_put_viommu; + } + + if (!viommu->ops->set_dev_id || !viommu->ops->unset_dev_id) { + rc = -EOPNOTSUPP; + goto out_put_viommu; + } + + rc = xa_alloc(&idev->viommus, &viommu_id, viommu, + XA_LIMIT(viommu->obj.id, viommu->obj.id), + GFP_KERNEL_ACCOUNT); + if (rc) + goto out_put_viommu; + + rc = xa_alloc(&viommu->idevs, &dev_id, idev, + XA_LIMIT(dev_id, dev_id), GFP_KERNEL_ACCOUNT); + if (rc) + goto out_xa_erase_viommu; + + rc = viommu->ops->set_dev_id(viommu, idev->dev, cmd->id); + if (rc) + goto out_xa_erase_idev; + + goto out_put_viommu; + +out_xa_erase_idev: + xa_erase(&viommu->idevs, idev->obj.id); +out_xa_erase_viommu: + xa_erase(&idev->viommus, viommu->obj.id); +out_put_viommu: + iommufd_put_object(ucmd->ictx, &viommu->obj); +out_put_idev: + iommufd_put_object(ucmd->ictx, &idev->obj); + return rc; +} diff --git a/include/linux/iommufd.h b/include/linux/iommufd.h index ca6ac8a1ffd0..2be302b82f47 100644 --- a/include/linux/iommufd.h +++ b/include/linux/iommufd.h @@ -10,6 +10,7 @@ #include <linux/errno.h> #include <linux/err.h> #include <linux/refcount.h> +#include <linux/xarray.h> struct device; struct iommufd_device; @@ -88,6 +89,7 @@ struct iommufd_viommu { const struct iommufd_viommu_ops *ops; unsigned int type; + struct xarray idevs; }; /** diff --git a/include/uapi/linux/iommufd.h b/include/uapi/linux/iommufd.h index 2b0825d69846..eaa192de63d3 100644 --- a/include/uapi/linux/iommufd.h +++ b/include/uapi/linux/iommufd.h @@ -51,6 +51,7 @@ enum { IOMMUFD_CMD_HWPT_GET_DIRTY_BITMAP, IOMMUFD_CMD_HWPT_INVALIDATE, IOMMUFD_CMD_VIOMMU_ALLOC, + IOMMUFD_CMD_VIOMMU_SET_DEV_ID, }; /** @@ -722,4 +723,23 @@ struct iommu_viommu_alloc { __u32 out_viommu_id; }; #define IOMMU_VIOMMU_ALLOC _IO(IOMMUFD_TYPE, IOMMUFD_CMD_VIOMMU_ALLOC) + +/** + * struct iommu_viommu_set_dev_id - ioctl(IOMMU_VIOMMU_SET_DEV_ID) + * @size: sizeof(struct iommu_viommu_set_dev_id) + * @viommu_id: viommu ID to associate with the device to store its virtual ID + * @dev_id: device ID to set a device virtual ID + * @__reserved: Must be 0 + * @id: Device virtual ID + * + * Set a viommu-specific virtual ID of a device + */ +struct iommu_viommu_set_dev_id { + __u32 size; + __u32 viommu_id; + __u32 dev_id; + __u32 __reserved; + __aligned_u64 id; +}; +#define IOMMU_VIOMMU_SET_DEV_ID _IO(IOMMUFD_TYPE, IOMMUFD_CMD_VIOMMU_SET_DEV_ID) #endif
Introduce a new ioctl to set a per-viommu device virtual id that should be linked to the physical device id (or just a struct device pointer). Since a viommu (user space IOMMU instance) can have multiple devices while it's not ideal to confine a device to one single user space IOMMU instance either, these two shouldn't just do a 1:1 mapping. Add two xarrays in their structures to bind them bidirectionally. Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> --- drivers/iommu/iommufd/device.c | 11 +++++ drivers/iommu/iommufd/iommufd_private.h | 10 +++++ drivers/iommu/iommufd/main.c | 2 + drivers/iommu/iommufd/viommu.c | 58 +++++++++++++++++++++++++ include/linux/iommufd.h | 2 + include/uapi/linux/iommufd.h | 20 +++++++++ 6 files changed, 103 insertions(+)