[RFC,11/21] KVM: SEV: Add TIO VMGEXIT and bind TDI

Message ID	20240823132137.336874-12-aik@amd.com
State	New
Headers	show Return-Path: <linux-pci+bounces-12107-incoming=patchwork.ozlabs.org@vger.kernel.org> Received-SPF: Pass (protection.outlook.com: domain of amd.com designates 165.204.84.17 as permitted sender) receiver=protection.outlook.com; client-ip=165.204.84.17; helo=SATLEXMB04.amd.com; pr=C From: Alexey Kardashevskiy <aik@amd.com> To: <kvm@vger.kernel.org> CC: <iommu@lists.linux.dev>, <linux-coco@lists.linux.dev>, <linux-pci@vger.kernel.org>, Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>, Alex Williamson <alex.williamson@redhat.com>, Dan Williams <dan.j.williams@intel.com>, <pratikrajesh.sampat@amd.com>, <michael.day@amd.com>, <david.kaplan@amd.com>, <dhaval.giani@amd.com>, Santosh Shukla <santosh.shukla@amd.com>, Tom Lendacky <thomas.lendacky@amd.com>, Michael Roth <michael.roth@amd.com>, "Alexander Graf" <agraf@suse.de>, Nikunj A Dadhania <nikunj@amd.com>, Vasant Hegde <vasant.hegde@amd.com>, Lukas Wunner <lukas@wunner.de>, Alexey Kardashevskiy <aik@amd.com> Subject: [RFC PATCH 11/21] KVM: SEV: Add TIO VMGEXIT and bind TDI Date: Fri, 23 Aug 2024 23:21:25 +1000 Message-ID: <20240823132137.336874-12-aik@amd.com> In-Reply-To: <20240823132137.336874-1-aik@amd.com> References: <20240823132137.336874-1-aik@amd.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain
Series	Secure VFIO, TDISP, SEV TIO \| expand [RFC,00/21] Secure VFIO, TDISP, SEV TIO [RFC,01/21] tsm-report: Rename module to reflect what it does [RFC,02/21] pci/doe: Define protocol types and make those public [RFC,03/21] pci: Define TEE-IO bit in PCIe device capabilities [RFC,04/21] PCI/IDE: Define Integrity and Data Encryption (IDE) extended capability [RFC,05/21] crypto/ccp: Make some SEV helpers public [RFC,06/21] crypto: ccp: Enable SEV-TIO feature in the PSP when supported [RFC,07/21] pci/tdisp: Introduce tsm module [RFC,08/21] crypto/ccp: Implement SEV TIO firmware interface [RFC,09/21] kvm: Export kvm_vm_set_mem_attributes [RFC,10/21] vfio: Export helper to get vfio_device from fd [RFC,11/21] KVM: SEV: Add TIO VMGEXIT and bind TDI [RFC,12/21] KVM: IOMMUFD: MEMFD: Map private pages [RFC,13/21] KVM: X86: Handle private MMIO as shared [RFC,14/21] RFC: iommu/iommufd/amd: Add IOMMU_HWPT_TRUSTED flag, tweak DTE's DomainID, IOTLB [RFC,15/21] coco/sev-guest: Allow multiple source files in the driver [RFC,16/21] coco/sev-guest: Make SEV-to-PSP request helpers public [RFC,17/21] coco/sev-guest: Implement the guest side of things [RFC,18/21] RFC: pci: Add BUS_NOTIFY_PCI_BUS_MASTER event [RFC,19/21] sev-guest: Stop changing encrypted page state for TDISP devices [RFC,20/21] pci: Allow encrypted MMIO mapping via sysfs [RFC,21/21] pci: Define pci_iomap_range_encrypted

Alexey Kardashevskiy Aug. 23, 2024, 1:21 p.m. UTC

The SEV TIO spec defines a new TIO_GUEST_MESSAGE message to
provide a secure communication channel between a SNP VM and
the PSP.

The defined messages provide way to read TDI info and do secure
MMIO/DMA setup.

On top of this, GHCB defines an extension to return certificates/
measurements/report and TDI run status to the VM.

The TIO_GUEST_MESSAGE handler also checks if a specific TDI bound
to the VM and exits the KVM to allow the userspace to bind it.

Skip adjust_direct_map() in rmpupdate() for now as it fails on MMIO.

Signed-off-by: Alexey Kardashevskiy <aik@amd.com>
---
 arch/x86/include/asm/kvm-x86-ops.h |   2 +
 arch/x86/include/asm/kvm_host.h    |   2 +
 arch/x86/include/asm/sev.h         |   1 +
 arch/x86/include/uapi/asm/svm.h    |   2 +
 arch/x86/kvm/svm/svm.h             |   2 +
 include/linux/kvm_host.h           |   2 +
 include/uapi/linux/kvm.h           |  29 +++
 arch/x86/kvm/svm/sev.c             | 217 ++++++++++++++++++++
 arch/x86/kvm/svm/svm.c             |   3 +
 arch/x86/kvm/x86.c                 |  12 ++
 arch/x86/virt/svm/sev.c            |  23 ++-
 virt/kvm/vfio.c                    | 139 +++++++++++++
 arch/x86/kvm/Kconfig               |   1 +
 13 files changed, 431 insertions(+), 4 deletions(-)

Xu Yilun Aug. 29, 2024, 10:08 a.m. UTC | #1

> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
> index 76b7f6085dcd..a4e9db212adc 100644
> --- a/virt/kvm/vfio.c
> +++ b/virt/kvm/vfio.c
> @@ -15,6 +15,7 @@
>  #include <linux/slab.h>
>  #include <linux/uaccess.h>
>  #include <linux/vfio.h>
> +#include <linux/tsm.h>
>  #include "vfio.h"
>  
>  #ifdef CONFIG_SPAPR_TCE_IOMMU
> @@ -29,8 +30,14 @@ struct kvm_vfio_file {
>  #endif
>  };
>  
> +struct kvm_vfio_tdi {
> +	struct list_head node;
> +	struct vfio_device *vdev;
> +};
> +
>  struct kvm_vfio {
>  	struct list_head file_list;
> +	struct list_head tdi_list;
>  	struct mutex lock;
>  	bool noncoherent;
>  };
> @@ -80,6 +87,22 @@ static bool kvm_vfio_file_is_valid(struct file *file)
>  	return ret;
>  }
>  
> +static struct vfio_device *kvm_vfio_file_device(struct file *file)
> +{
> +	struct vfio_device *(*fn)(struct file *file);
> +	struct vfio_device *ret;
> +
> +	fn = symbol_get(vfio_file_device);
> +	if (!fn)
> +		return NULL;
> +
> +	ret = fn(file);
> +
> +	symbol_put(vfio_file_device);
> +
> +	return ret;
> +}
> +
>  #ifdef CONFIG_SPAPR_TCE_IOMMU
>  static struct iommu_group *kvm_vfio_file_iommu_group(struct file *file)
>  {
> @@ -297,6 +320,103 @@ static int kvm_vfio_set_file(struct kvm_device *dev, long attr,
>  	return -ENXIO;
>  }
>  
> +static int kvm_dev_tsm_bind(struct kvm_device *dev, void __user *arg)
> +{
> +	struct kvm_vfio *kv = dev->private;
> +	struct kvm_vfio_tsm_bind tb;
> +	struct kvm_vfio_tdi *ktdi;
> +	struct vfio_device *vdev;
> +	struct fd fdev;
> +	int ret;
> +
> +	if (copy_from_user(&tb, arg, sizeof(tb)))
> +		return -EFAULT;
> +
> +	ktdi = kzalloc(sizeof(*ktdi), GFP_KERNEL_ACCOUNT);
> +	if (!ktdi)
> +		return -ENOMEM;
> +
> +	fdev = fdget(tb.devfd);
> +	if (!fdev.file)
> +		return -EBADF;
> +
> +	ret = -ENOENT;
> +
> +	mutex_lock(&kv->lock);
> +
> +	vdev = kvm_vfio_file_device(fdev.file);
> +	if (vdev) {
> +		ret = kvm_arch_tsm_bind(dev->kvm, vdev->dev, tb.guest_rid);
> +		if (!ret) {
> +			ktdi->vdev = vdev;
> +			list_add_tail(&ktdi->node, &kv->tdi_list);
> +		} else {
> +			vfio_put_device(vdev);
> +		}
> +	}
> +
> +	fdput(fdev);
> +	mutex_unlock(&kv->lock);
> +	if (ret)
> +		kfree(ktdi);
> +
> +	return ret;
> +}
> +
> +static int kvm_dev_tsm_unbind(struct kvm_device *dev, void __user *arg)
> +{
> +	struct kvm_vfio *kv = dev->private;
> +	struct kvm_vfio_tsm_bind tb;
> +	struct kvm_vfio_tdi *ktdi;
> +	struct vfio_device *vdev;
> +	struct fd fdev;
> +	int ret;
> +
> +	if (copy_from_user(&tb, arg, sizeof(tb)))
> +		return -EFAULT;
> +
> +	fdev = fdget(tb.devfd);
> +	if (!fdev.file)
> +		return -EBADF;
> +
> +	ret = -ENOENT;
> +
> +	mutex_lock(&kv->lock);
> +
> +	vdev = kvm_vfio_file_device(fdev.file);
> +	if (vdev) {
> +		list_for_each_entry(ktdi, &kv->tdi_list, node) {
> +			if (ktdi->vdev != vdev)
> +				continue;
> +
> +			kvm_arch_tsm_unbind(dev->kvm, vdev->dev);
> +			list_del(&ktdi->node);
> +			kfree(ktdi);
> +			vfio_put_device(vdev);
> +			ret = 0;
> +			break;
> +		}
> +		vfio_put_device(vdev);
> +	}
> +
> +	fdput(fdev);
> +	mutex_unlock(&kv->lock);
> +	return ret;
> +}
> +
> +static int kvm_vfio_set_device(struct kvm_device *dev, long attr,
> +			       void __user *arg)
> +{
> +	switch (attr) {
> +	case KVM_DEV_VFIO_DEVICE_TDI_BIND:
> +		return kvm_dev_tsm_bind(dev, arg);

I think the TDI bind operation should be under the control of the device
owner (i.e. VFIO driver), rather than in this bridge driver. The TDI bind
means TDI would be transitioned to CONFIG_LOCKED state, and a bunch of
device configurations breaks the state (TDISP spec 11.4.5/8/9). So the
VFIO driver should be fully aware of the TDI bind and manage unwanted
breakage.

Thanks,
Yilun

Alexey Kardashevskiy Aug. 30, 2024, 4 a.m. UTC | #2

On 29/8/24 20:08, Xu Yilun wrote:
>> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
>> index 76b7f6085dcd..a4e9db212adc 100644
>> --- a/virt/kvm/vfio.c
>> +++ b/virt/kvm/vfio.c
>> @@ -15,6 +15,7 @@
>>   #include <linux/slab.h>
>>   #include <linux/uaccess.h>
>>   #include <linux/vfio.h>
>> +#include <linux/tsm.h>
>>   #include "vfio.h"
>>   
>>   #ifdef CONFIG_SPAPR_TCE_IOMMU
>> @@ -29,8 +30,14 @@ struct kvm_vfio_file {
>>   #endif
>>   };
>>   
>> +struct kvm_vfio_tdi {
>> +	struct list_head node;
>> +	struct vfio_device *vdev;
>> +};
>> +
>>   struct kvm_vfio {
>>   	struct list_head file_list;
>> +	struct list_head tdi_list;
>>   	struct mutex lock;
>>   	bool noncoherent;
>>   };
>> @@ -80,6 +87,22 @@ static bool kvm_vfio_file_is_valid(struct file *file)
>>   	return ret;
>>   }
>>   
>> +static struct vfio_device *kvm_vfio_file_device(struct file *file)
>> +{
>> +	struct vfio_device *(*fn)(struct file *file);
>> +	struct vfio_device *ret;
>> +
>> +	fn = symbol_get(vfio_file_device);
>> +	if (!fn)
>> +		return NULL;
>> +
>> +	ret = fn(file);
>> +
>> +	symbol_put(vfio_file_device);
>> +
>> +	return ret;
>> +}
>> +
>>   #ifdef CONFIG_SPAPR_TCE_IOMMU
>>   static struct iommu_group *kvm_vfio_file_iommu_group(struct file *file)
>>   {
>> @@ -297,6 +320,103 @@ static int kvm_vfio_set_file(struct kvm_device *dev, long attr,
>>   	return -ENXIO;
>>   }
>>   
>> +static int kvm_dev_tsm_bind(struct kvm_device *dev, void __user *arg)
>> +{
>> +	struct kvm_vfio *kv = dev->private;
>> +	struct kvm_vfio_tsm_bind tb;
>> +	struct kvm_vfio_tdi *ktdi;
>> +	struct vfio_device *vdev;
>> +	struct fd fdev;
>> +	int ret;
>> +
>> +	if (copy_from_user(&tb, arg, sizeof(tb)))
>> +		return -EFAULT;
>> +
>> +	ktdi = kzalloc(sizeof(*ktdi), GFP_KERNEL_ACCOUNT);
>> +	if (!ktdi)
>> +		return -ENOMEM;
>> +
>> +	fdev = fdget(tb.devfd);
>> +	if (!fdev.file)
>> +		return -EBADF;
>> +
>> +	ret = -ENOENT;
>> +
>> +	mutex_lock(&kv->lock);
>> +
>> +	vdev = kvm_vfio_file_device(fdev.file);
>> +	if (vdev) {
>> +		ret = kvm_arch_tsm_bind(dev->kvm, vdev->dev, tb.guest_rid);
>> +		if (!ret) {
>> +			ktdi->vdev = vdev;
>> +			list_add_tail(&ktdi->node, &kv->tdi_list);
>> +		} else {
>> +			vfio_put_device(vdev);
>> +		}
>> +	}
>> +
>> +	fdput(fdev);
>> +	mutex_unlock(&kv->lock);
>> +	if (ret)
>> +		kfree(ktdi);
>> +
>> +	return ret;
>> +}
>> +
>> +static int kvm_dev_tsm_unbind(struct kvm_device *dev, void __user *arg)
>> +{
>> +	struct kvm_vfio *kv = dev->private;
>> +	struct kvm_vfio_tsm_bind tb;
>> +	struct kvm_vfio_tdi *ktdi;
>> +	struct vfio_device *vdev;
>> +	struct fd fdev;
>> +	int ret;
>> +
>> +	if (copy_from_user(&tb, arg, sizeof(tb)))
>> +		return -EFAULT;
>> +
>> +	fdev = fdget(tb.devfd);
>> +	if (!fdev.file)
>> +		return -EBADF;
>> +
>> +	ret = -ENOENT;
>> +
>> +	mutex_lock(&kv->lock);
>> +
>> +	vdev = kvm_vfio_file_device(fdev.file);
>> +	if (vdev) {
>> +		list_for_each_entry(ktdi, &kv->tdi_list, node) {
>> +			if (ktdi->vdev != vdev)
>> +				continue;
>> +
>> +			kvm_arch_tsm_unbind(dev->kvm, vdev->dev);
>> +			list_del(&ktdi->node);
>> +			kfree(ktdi);
>> +			vfio_put_device(vdev);
>> +			ret = 0;
>> +			break;
>> +		}
>> +		vfio_put_device(vdev);
>> +	}
>> +
>> +	fdput(fdev);
>> +	mutex_unlock(&kv->lock);
>> +	return ret;
>> +}
>> +
>> +static int kvm_vfio_set_device(struct kvm_device *dev, long attr,
>> +			       void __user *arg)
>> +{
>> +	switch (attr) {
>> +	case KVM_DEV_VFIO_DEVICE_TDI_BIND:
>> +		return kvm_dev_tsm_bind(dev, arg);
> 
> I think the TDI bind operation should be under the control of the device
> owner (i.e. VFIO driver), rather than in this bridge driver.

This is a valid point, although this means teaching VFIO about the KVM 
lifetime (and KVM already holds references to VFIO groups) and guest 
BDFns (which have no meaning for VFIO in the host kernel).

> The TDI bind
> means TDI would be transitioned to CONFIG_LOCKED state, and a bunch of
> device configurations breaks the state (TDISP spec 11.4.5/8/9). So the
> VFIO driver should be fully aware of the TDI bind and manage unwanted
> breakage.

VFIO has no control over TDI any way, cannot even know what state it is 
in without talking to the firmware. When TDI goes into ERROR, this needs 
to be propagated to the VM. At the moment (afaik) it does not tell the 
userspace/guest about IOMMU errors and it probably should but the 
existing mechanism should be able to do so. Thanks,


> 
> Thanks,
> Yilun

Xu Yilun Aug. 30, 2024, 7:02 a.m. UTC | #3

On Fri, Aug 30, 2024 at 02:00:30PM +1000, Alexey Kardashevskiy wrote:
> 
> 
> On 29/8/24 20:08, Xu Yilun wrote:
> > > diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
> > > index 76b7f6085dcd..a4e9db212adc 100644
> > > --- a/virt/kvm/vfio.c
> > > +++ b/virt/kvm/vfio.c
> > > @@ -15,6 +15,7 @@
> > >   #include <linux/slab.h>
> > >   #include <linux/uaccess.h>
> > >   #include <linux/vfio.h>
> > > +#include <linux/tsm.h>
> > >   #include "vfio.h"
> > >   #ifdef CONFIG_SPAPR_TCE_IOMMU
> > > @@ -29,8 +30,14 @@ struct kvm_vfio_file {
> > >   #endif
> > >   };
> > > +struct kvm_vfio_tdi {
> > > +	struct list_head node;
> > > +	struct vfio_device *vdev;
> > > +};
> > > +
> > >   struct kvm_vfio {
> > >   	struct list_head file_list;
> > > +	struct list_head tdi_list;
> > >   	struct mutex lock;
> > >   	bool noncoherent;
> > >   };
> > > @@ -80,6 +87,22 @@ static bool kvm_vfio_file_is_valid(struct file *file)
> > >   	return ret;
> > >   }
> > > +static struct vfio_device *kvm_vfio_file_device(struct file *file)
> > > +{
> > > +	struct vfio_device *(*fn)(struct file *file);
> > > +	struct vfio_device *ret;
> > > +
> > > +	fn = symbol_get(vfio_file_device);
> > > +	if (!fn)
> > > +		return NULL;
> > > +
> > > +	ret = fn(file);
> > > +
> > > +	symbol_put(vfio_file_device);
> > > +
> > > +	return ret;
> > > +}
> > > +
> > >   #ifdef CONFIG_SPAPR_TCE_IOMMU
> > >   static struct iommu_group *kvm_vfio_file_iommu_group(struct file *file)
> > >   {
> > > @@ -297,6 +320,103 @@ static int kvm_vfio_set_file(struct kvm_device *dev, long attr,
> > >   	return -ENXIO;
> > >   }
> > > +static int kvm_dev_tsm_bind(struct kvm_device *dev, void __user *arg)
> > > +{
> > > +	struct kvm_vfio *kv = dev->private;
> > > +	struct kvm_vfio_tsm_bind tb;
> > > +	struct kvm_vfio_tdi *ktdi;
> > > +	struct vfio_device *vdev;
> > > +	struct fd fdev;
> > > +	int ret;
> > > +
> > > +	if (copy_from_user(&tb, arg, sizeof(tb)))
> > > +		return -EFAULT;
> > > +
> > > +	ktdi = kzalloc(sizeof(*ktdi), GFP_KERNEL_ACCOUNT);
> > > +	if (!ktdi)
> > > +		return -ENOMEM;
> > > +
> > > +	fdev = fdget(tb.devfd);
> > > +	if (!fdev.file)
> > > +		return -EBADF;
> > > +
> > > +	ret = -ENOENT;
> > > +
> > > +	mutex_lock(&kv->lock);
> > > +
> > > +	vdev = kvm_vfio_file_device(fdev.file);
> > > +	if (vdev) {
> > > +		ret = kvm_arch_tsm_bind(dev->kvm, vdev->dev, tb.guest_rid);
> > > +		if (!ret) {
> > > +			ktdi->vdev = vdev;
> > > +			list_add_tail(&ktdi->node, &kv->tdi_list);
> > > +		} else {
> > > +			vfio_put_device(vdev);
> > > +		}
> > > +	}
> > > +
> > > +	fdput(fdev);
> > > +	mutex_unlock(&kv->lock);
> > > +	if (ret)
> > > +		kfree(ktdi);
> > > +
> > > +	return ret;
> > > +}
> > > +
> > > +static int kvm_dev_tsm_unbind(struct kvm_device *dev, void __user *arg)
> > > +{
> > > +	struct kvm_vfio *kv = dev->private;
> > > +	struct kvm_vfio_tsm_bind tb;
> > > +	struct kvm_vfio_tdi *ktdi;
> > > +	struct vfio_device *vdev;
> > > +	struct fd fdev;
> > > +	int ret;
> > > +
> > > +	if (copy_from_user(&tb, arg, sizeof(tb)))
> > > +		return -EFAULT;
> > > +
> > > +	fdev = fdget(tb.devfd);
> > > +	if (!fdev.file)
> > > +		return -EBADF;
> > > +
> > > +	ret = -ENOENT;
> > > +
> > > +	mutex_lock(&kv->lock);
> > > +
> > > +	vdev = kvm_vfio_file_device(fdev.file);
> > > +	if (vdev) {
> > > +		list_for_each_entry(ktdi, &kv->tdi_list, node) {
> > > +			if (ktdi->vdev != vdev)
> > > +				continue;
> > > +
> > > +			kvm_arch_tsm_unbind(dev->kvm, vdev->dev);
> > > +			list_del(&ktdi->node);
> > > +			kfree(ktdi);
> > > +			vfio_put_device(vdev);
> > > +			ret = 0;
> > > +			break;
> > > +		}
> > > +		vfio_put_device(vdev);
> > > +	}
> > > +
> > > +	fdput(fdev);
> > > +	mutex_unlock(&kv->lock);
> > > +	return ret;
> > > +}
> > > +
> > > +static int kvm_vfio_set_device(struct kvm_device *dev, long attr,
> > > +			       void __user *arg)
> > > +{
> > > +	switch (attr) {
> > > +	case KVM_DEV_VFIO_DEVICE_TDI_BIND:
> > > +		return kvm_dev_tsm_bind(dev, arg);
> > 
> > I think the TDI bind operation should be under the control of the device
> > owner (i.e. VFIO driver), rather than in this bridge driver.
> 
> This is a valid point, although this means teaching VFIO about the KVM
> lifetime (and KVM already holds references to VFIO groups) and

Not sure if I understand, VFIO already knows KVM lifetime via
vfio_device_get_kvm_safe(), is it?

> guest BDFns (which have no meaning for VFIO in the host kernel).

KVM is not aware of the guest BDF today.

I think we need to pass a firmware recognizable TDI identifier, which
is actually a magic number and specific to vendors. For TDX, it is the
FUNCTION_ID. So I didn't think too much to whom the identifier is
meaningful.

> 
> > The TDI bind
> > means TDI would be transitioned to CONFIG_LOCKED state, and a bunch of
> > device configurations breaks the state (TDISP spec 11.4.5/8/9). So the
> > VFIO driver should be fully aware of the TDI bind and manage unwanted
> > breakage.
> 
> VFIO has no control over TDI any way, cannot even know what state it is in
> without talking to the firmware.

I think VFIO could talk to the firmware, that's part of the reason we are
working on the TSM module independent to KVM.

> When TDI goes into ERROR, this needs to be
> propagated to the VM. At the moment (afaik) it does not tell the

I assume when TDISP ERROR happens, an interrupt (e.g. AER) would be sent
to OS and VFIO driver is the one who handles it in the first place. So
maybe there has to be some TDI stuff in VFIO?

Thanks,
Yilun

> userspace/guest about IOMMU errors and it probably should but the existing
> mechanism should be able to do so. Thanks,
> 
> 
> > 
> > Thanks,
> > Yilun
> 
> -- 
> Alexey
>

Alexey Kardashevskiy Sept. 2, 2024, 1:24 a.m. UTC | #4

On 30/8/24 17:02, Xu Yilun wrote:
> On Fri, Aug 30, 2024 at 02:00:30PM +1000, Alexey Kardashevskiy wrote:
>>
>>
>> On 29/8/24 20:08, Xu Yilun wrote:
>>>> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
>>>> index 76b7f6085dcd..a4e9db212adc 100644
>>>> --- a/virt/kvm/vfio.c
>>>> +++ b/virt/kvm/vfio.c
>>>> @@ -15,6 +15,7 @@
>>>>    #include <linux/slab.h>
>>>>    #include <linux/uaccess.h>
>>>>    #include <linux/vfio.h>
>>>> +#include <linux/tsm.h>
>>>>    #include "vfio.h"
>>>>    #ifdef CONFIG_SPAPR_TCE_IOMMU
>>>> @@ -29,8 +30,14 @@ struct kvm_vfio_file {
>>>>    #endif
>>>>    };
>>>> +struct kvm_vfio_tdi {
>>>> +	struct list_head node;
>>>> +	struct vfio_device *vdev;
>>>> +};
>>>> +
>>>>    struct kvm_vfio {
>>>>    	struct list_head file_list;
>>>> +	struct list_head tdi_list;
>>>>    	struct mutex lock;
>>>>    	bool noncoherent;
>>>>    };
>>>> @@ -80,6 +87,22 @@ static bool kvm_vfio_file_is_valid(struct file *file)
>>>>    	return ret;
>>>>    }
>>>> +static struct vfio_device *kvm_vfio_file_device(struct file *file)
>>>> +{
>>>> +	struct vfio_device *(*fn)(struct file *file);
>>>> +	struct vfio_device *ret;
>>>> +
>>>> +	fn = symbol_get(vfio_file_device);
>>>> +	if (!fn)
>>>> +		return NULL;
>>>> +
>>>> +	ret = fn(file);
>>>> +
>>>> +	symbol_put(vfio_file_device);
>>>> +
>>>> +	return ret;
>>>> +}
>>>> +
>>>>    #ifdef CONFIG_SPAPR_TCE_IOMMU
>>>>    static struct iommu_group *kvm_vfio_file_iommu_group(struct file *file)
>>>>    {
>>>> @@ -297,6 +320,103 @@ static int kvm_vfio_set_file(struct kvm_device *dev, long attr,
>>>>    	return -ENXIO;
>>>>    }
>>>> +static int kvm_dev_tsm_bind(struct kvm_device *dev, void __user *arg)
>>>> +{
>>>> +	struct kvm_vfio *kv = dev->private;
>>>> +	struct kvm_vfio_tsm_bind tb;
>>>> +	struct kvm_vfio_tdi *ktdi;
>>>> +	struct vfio_device *vdev;
>>>> +	struct fd fdev;
>>>> +	int ret;
>>>> +
>>>> +	if (copy_from_user(&tb, arg, sizeof(tb)))
>>>> +		return -EFAULT;
>>>> +
>>>> +	ktdi = kzalloc(sizeof(*ktdi), GFP_KERNEL_ACCOUNT);
>>>> +	if (!ktdi)
>>>> +		return -ENOMEM;
>>>> +
>>>> +	fdev = fdget(tb.devfd);
>>>> +	if (!fdev.file)
>>>> +		return -EBADF;
>>>> +
>>>> +	ret = -ENOENT;
>>>> +
>>>> +	mutex_lock(&kv->lock);
>>>> +
>>>> +	vdev = kvm_vfio_file_device(fdev.file);
>>>> +	if (vdev) {
>>>> +		ret = kvm_arch_tsm_bind(dev->kvm, vdev->dev, tb.guest_rid);
>>>> +		if (!ret) {
>>>> +			ktdi->vdev = vdev;
>>>> +			list_add_tail(&ktdi->node, &kv->tdi_list);
>>>> +		} else {
>>>> +			vfio_put_device(vdev);
>>>> +		}
>>>> +	}
>>>> +
>>>> +	fdput(fdev);
>>>> +	mutex_unlock(&kv->lock);
>>>> +	if (ret)
>>>> +		kfree(ktdi);
>>>> +
>>>> +	return ret;
>>>> +}
>>>> +
>>>> +static int kvm_dev_tsm_unbind(struct kvm_device *dev, void __user *arg)
>>>> +{
>>>> +	struct kvm_vfio *kv = dev->private;
>>>> +	struct kvm_vfio_tsm_bind tb;
>>>> +	struct kvm_vfio_tdi *ktdi;
>>>> +	struct vfio_device *vdev;
>>>> +	struct fd fdev;
>>>> +	int ret;
>>>> +
>>>> +	if (copy_from_user(&tb, arg, sizeof(tb)))
>>>> +		return -EFAULT;
>>>> +
>>>> +	fdev = fdget(tb.devfd);
>>>> +	if (!fdev.file)
>>>> +		return -EBADF;
>>>> +
>>>> +	ret = -ENOENT;
>>>> +
>>>> +	mutex_lock(&kv->lock);
>>>> +
>>>> +	vdev = kvm_vfio_file_device(fdev.file);
>>>> +	if (vdev) {
>>>> +		list_for_each_entry(ktdi, &kv->tdi_list, node) {
>>>> +			if (ktdi->vdev != vdev)
>>>> +				continue;
>>>> +
>>>> +			kvm_arch_tsm_unbind(dev->kvm, vdev->dev);
>>>> +			list_del(&ktdi->node);
>>>> +			kfree(ktdi);
>>>> +			vfio_put_device(vdev);
>>>> +			ret = 0;
>>>> +			break;
>>>> +		}
>>>> +		vfio_put_device(vdev);
>>>> +	}
>>>> +
>>>> +	fdput(fdev);
>>>> +	mutex_unlock(&kv->lock);
>>>> +	return ret;
>>>> +}
>>>> +
>>>> +static int kvm_vfio_set_device(struct kvm_device *dev, long attr,
>>>> +			       void __user *arg)
>>>> +{
>>>> +	switch (attr) {
>>>> +	case KVM_DEV_VFIO_DEVICE_TDI_BIND:
>>>> +		return kvm_dev_tsm_bind(dev, arg);
>>>
>>> I think the TDI bind operation should be under the control of the device
>>> owner (i.e. VFIO driver), rather than in this bridge driver.
>>
>> This is a valid point, although this means teaching VFIO about the KVM
>> lifetime (and KVM already holds references to VFIO groups) and
> 
> Not sure if I understand, VFIO already knows KVM lifetime via
> vfio_device_get_kvm_safe(), is it?

Yeah you're right.

>> guest BDFns (which have no meaning for VFIO in the host kernel).
> 
> KVM is not aware of the guest BDF today.
> 
> I think we need to pass a firmware recognizable TDI identifier, which
> is actually a magic number and specific to vendors. For TDX, it is the
> FUNCTION_ID. So I didn't think too much to whom the identifier is
> meaningful.

It needs to be the same id for "bind" operation (bind a TDI to a VM, the 
op performed by QEMU) and GUEST_REQUEST (VMGEXIT from the VM so the id 
comes from the guest). The host kernel is not going to parse it but just 
pass to the firmware so I guess it can be just an u32.


>>> The TDI bind
>>> means TDI would be transitioned to CONFIG_LOCKED state, and a bunch of
>>> device configurations breaks the state (TDISP spec 11.4.5/8/9). So the
>>> VFIO driver should be fully aware of the TDI bind and manage unwanted
>>> breakage.
>>
>> VFIO has no control over TDI any way, cannot even know what state it is in
>> without talking to the firmware.
> 
> I think VFIO could talk to the firmware, that's part of the reason we are
> working on the TSM module independent to KVM.
> 
>> When TDI goes into ERROR, this needs to be
>> propagated to the VM. At the moment (afaik) it does not tell the
> 
> I assume when TDISP ERROR happens, an interrupt (e.g. AER) would be sent
> to OS and VFIO driver is the one who handles it in the first place. So
> maybe there has to be some TDI stuff in VFIO?

Sounds reasonable, my test device just does not do this so I have not 
poked at the error handling much :) Thanks,


> Thanks,
> Yilun
> 
>> userspace/guest about IOMMU errors and it probably should but the existing
>> mechanism should be able to do so. Thanks,
>>
>>
>>>
>>> Thanks,
>>> Yilun
>>
>> -- 
>> Alexey
>>

Zhi Wang Sept. 13, 2024, 1:50 p.m. UTC | #5

On Fri, 23 Aug 2024 23:21:25 +1000
Alexey Kardashevskiy <aik@amd.com> wrote:

> The SEV TIO spec defines a new TIO_GUEST_MESSAGE message to
> provide a secure communication channel between a SNP VM and
> the PSP.
> 
> The defined messages provide way to read TDI info and do secure
> MMIO/DMA setup.
> 
> On top of this, GHCB defines an extension to return certificates/
> measurements/report and TDI run status to the VM.
> 
> The TIO_GUEST_MESSAGE handler also checks if a specific TDI bound
> to the VM and exits the KVM to allow the userspace to bind it.
> 

Out of curiosity, do we have to handle the TDI bind/unbind in the kernel
space? It seems we are get the relationship between modules more
complicated. What is the design concern that letting QEMU to handle the
TDI bind/unbind message, because QEMU can talk to VFIO/KVM and also TSM.

> Skip adjust_direct_map() in rmpupdate() for now as it fails on MMIO.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@amd.com>
> ---
>  arch/x86/include/asm/kvm-x86-ops.h |   2 +
>  arch/x86/include/asm/kvm_host.h    |   2 +
>  arch/x86/include/asm/sev.h         |   1 +
>  arch/x86/include/uapi/asm/svm.h    |   2 +
>  arch/x86/kvm/svm/svm.h             |   2 +
>  include/linux/kvm_host.h           |   2 +
>  include/uapi/linux/kvm.h           |  29 +++
>  arch/x86/kvm/svm/sev.c             | 217 ++++++++++++++++++++
>  arch/x86/kvm/svm/svm.c             |   3 +
>  arch/x86/kvm/x86.c                 |  12 ++
>  arch/x86/virt/svm/sev.c            |  23 ++-
>  virt/kvm/vfio.c                    | 139 +++++++++++++
>  arch/x86/kvm/Kconfig               |   1 +
>  13 files changed, 431 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm-x86-ops.h
> b/arch/x86/include/asm/kvm-x86-ops.h index 68ad4f923664..80e8176a4ea0
> 100644 --- a/arch/x86/include/asm/kvm-x86-ops.h
> +++ b/arch/x86/include/asm/kvm-x86-ops.h
> @@ -139,6 +139,8 @@ KVM_X86_OP_OPTIONAL(alloc_apic_backing_page)
>  KVM_X86_OP_OPTIONAL_RET0(gmem_prepare)
>  KVM_X86_OP_OPTIONAL_RET0(private_max_mapping_level)
>  KVM_X86_OP_OPTIONAL(gmem_invalidate)
> +KVM_X86_OP_OPTIONAL(tsm_bind)
> +KVM_X86_OP_OPTIONAL(tsm_unbind)
>  
>  #undef KVM_X86_OP
>  #undef KVM_X86_OP_OPTIONAL
> diff --git a/arch/x86/include/asm/kvm_host.h
> b/arch/x86/include/asm/kvm_host.h index 4a68cb3eba78..80bdac4e47ac
> 100644 --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1830,6 +1830,8 @@ struct kvm_x86_ops {
>  	int (*gmem_prepare)(struct kvm *kvm, kvm_pfn_t pfn, gfn_t
> gfn, int max_order); void (*gmem_invalidate)(kvm_pfn_t start,
> kvm_pfn_t end); int (*private_max_mapping_level)(struct kvm *kvm,
> kvm_pfn_t pfn);
> +	int (*tsm_bind)(struct kvm *kvm, struct device *dev, u32
> guest_rid);
> +	void (*tsm_unbind)(struct kvm *kvm, struct device *dev);
>  };
>  
>  struct kvm_x86_nested_ops {
> diff --git a/arch/x86/include/asm/sev.h b/arch/x86/include/asm/sev.h
> index 80d9aa16fe61..8edd7bccabf2 100644
> --- a/arch/x86/include/asm/sev.h
> +++ b/arch/x86/include/asm/sev.h
> @@ -464,6 +464,7 @@ int snp_lookup_rmpentry(u64 pfn, bool *assigned,
> int *level); void snp_dump_hva_rmpentry(unsigned long address);
>  int psmash(u64 pfn);
>  int rmp_make_private(u64 pfn, u64 gpa, enum pg_level level, u32
> asid, bool immutable); +int rmp_make_private_mmio(u64 pfn, u64 gpa,
> enum pg_level level, u32 asid, bool immutable); int
> rmp_make_shared(u64 pfn, enum pg_level level); void
> snp_leak_pages(u64 pfn, unsigned int npages); void
> kdump_sev_callback(void); diff --git
> a/arch/x86/include/uapi/asm/svm.h b/arch/x86/include/uapi/asm/svm.h
> index 1814b413fd57..ac90a69e6327 100644 ---
> a/arch/x86/include/uapi/asm/svm.h +++
> b/arch/x86/include/uapi/asm/svm.h @@ -116,6 +116,7 @@
>  #define SVM_VMGEXIT_AP_CREATE			1
>  #define SVM_VMGEXIT_AP_DESTROY			2
>  #define SVM_VMGEXIT_SNP_RUN_VMPL		0x80000018
> +#define SVM_VMGEXIT_SEV_TIO_GUEST_REQUEST	0x80000020
>  #define SVM_VMGEXIT_HV_FEATURES			0x8000fffd
>  #define SVM_VMGEXIT_TERM_REQUEST		0x8000fffe
>  #define SVM_VMGEXIT_TERM_REASON(reason_set, reason_code)	\
> @@ -237,6 +238,7 @@
>  	{ SVM_VMGEXIT_GUEST_REQUEST,	"vmgexit_guest_request"
> }, \ { SVM_VMGEXIT_EXT_GUEST_REQUEST, "vmgexit_ext_guest_request" }, \
>  	{ SVM_VMGEXIT_AP_CREATION,	"vmgexit_ap_creation" }, \
> +	{ SVM_VMGEXIT_SEV_TIO_GUEST_REQUEST,
> "vmgexit_sev_tio_guest_request" }, \ {
> SVM_VMGEXIT_HV_FEATURES,	"vmgexit_hypervisor_feature" }, \ {
> SVM_EXIT_ERR,         "invalid_guest_state" } 
> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index 76107c7d0595..d04d583c1741 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -749,6 +749,8 @@ void sev_snp_init_protected_guest_state(struct
> kvm_vcpu *vcpu); int sev_gmem_prepare(struct kvm *kvm, kvm_pfn_t pfn,
> gfn_t gfn, int max_order); void sev_gmem_invalidate(kvm_pfn_t start,
> kvm_pfn_t end); int sev_private_max_mapping_level(struct kvm *kvm,
> kvm_pfn_t pfn); +int sev_tsm_bind(struct kvm *kvm, struct device
> *dev, u32 guest_rid); +void sev_tsm_unbind(struct kvm *kvm, struct
> device *dev); #else
>  static inline struct page *snp_safe_alloc_page_node(int node, gfp_t
> gfp) {
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index d004d96c2ace..fdb331b3e0d3 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2497,5 +2497,7 @@ void kvm_arch_gmem_invalidate(kvm_pfn_t start,
> kvm_pfn_t end); long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu
> *vcpu, struct kvm_pre_fault_memory *range);
>  #endif
> +int kvm_arch_tsm_bind(struct kvm *kvm, struct device *dev, u32
> guest_rid); +void kvm_arch_tsm_unbind(struct kvm *kvm, struct device
> *dev); 
>  #endif
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 637efc055145..37f76bbdfa9b 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -135,6 +135,17 @@ struct kvm_xen_exit {
>  	} u;
>  };
>  
> +struct kvm_user_vmgexit {
> +#define KVM_USER_VMGEXIT_TIO_REQ	4
> +	__u32 type; /* KVM_USER_VMGEXIT_* type */
> +	union {
> +		struct {
> +			__u32 guest_rid;
> +			__u32 ret;
> +		} tio_req;
> +	};
> +} __packed;
> +
>  #define KVM_S390_GET_SKEYS_NONE   1
>  #define KVM_S390_SKEYS_MAX        1048576
>  
> @@ -178,6 +189,7 @@ struct kvm_xen_exit {
>  #define KVM_EXIT_NOTIFY           37
>  #define KVM_EXIT_LOONGARCH_IOCSR  38
>  #define KVM_EXIT_MEMORY_FAULT     39
> +#define KVM_EXIT_VMGEXIT          40
>  
>  /* For KVM_EXIT_INTERNAL_ERROR */
>  /* Emulate instruction failed. */
> @@ -446,6 +458,7 @@ struct kvm_run {
>  			__u64 gpa;
>  			__u64 size;
>  		} memory_fault;
> +		struct kvm_user_vmgexit vmgexit;
>  		/* Fix the size of the union. */
>  		char padding[256];
>  	};
> @@ -1166,6 +1179,22 @@ struct kvm_vfio_spapr_tce {
>  	__s32	tablefd;
>  };
>  
> +#define  KVM_DEV_VFIO_DEVICE			2
> +#define   KVM_DEV_VFIO_DEVICE_TDI_BIND			1
> +#define   KVM_DEV_VFIO_DEVICE_TDI_UNBIND		2
> +
> +/*
> + * struct kvm_vfio_tsm_bind
> + *
> + * @guest_rid: Hypervisor provided identifier used by the guest to
> identify
> + *             the TDI in guest messages
> + * @devfd: a fd of VFIO device
> + */
> +struct kvm_vfio_tsm_bind {
> +	__u32 guest_rid;
> +	__s32 devfd;
> +} __packed;
> +
>  /*
>   * KVM_CREATE_VCPU receives as a parameter the vcpu slot, and returns
>   * a vcpu fd.
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index 9badf4fa7e1d..e36b93b9cc2b 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -20,6 +20,8 @@
>  #include <linux/processor.h>
>  #include <linux/trace_events.h>
>  #include <uapi/linux/sev-guest.h>
> +#include <linux/tsm.h>
> +#include <linux/pci.h>
>  
>  #include <asm/pkru.h>
>  #include <asm/trapnr.h>
> @@ -3413,6 +3415,8 @@ static int sev_es_validate_vmgexit(struct
> vcpu_svm *svm) control->exit_info_1 == control->exit_info_2)
>  			goto vmgexit_err;
>  		break;
> +	case SVM_VMGEXIT_SEV_TIO_GUEST_REQUEST:
> +		break;
>  	default:
>  		reason = GHCB_ERR_INVALID_EVENT;
>  		goto vmgexit_err;
> @@ -4128,6 +4132,182 @@ static int snp_handle_ext_guest_req(struct
> vcpu_svm *svm, gpa_t req_gpa, gpa_t r return 1; /* resume guest */
>  }
>  
> +static int tio_make_mmio_private(struct vcpu_svm *svm, struct
> pci_dev *pdev,
> +				 phys_addr_t mmio_gpa, phys_addr_t
> mmio_size,
> +				 unsigned int rangeid)
> +{
> +	int ret = 0;
> +
> +	if (!mmio_gpa || !mmio_size || mmio_size !=
> pci_resource_len(pdev, rangeid)) {
> +		pci_err(pdev, "Invalid MMIO #%d gpa=%llx..%llx\n",
> +			rangeid, mmio_gpa, mmio_gpa + mmio_size);
> +		return SEV_RET_INVALID_PARAM;
> +	}
> +
> +	/* Could as well exit to the userspace and
> ioctl(KVM_MEMORY_ATTRIBUTE_PRIVATE) */
> +	ret = kvm_vm_set_mem_attributes(svm->vcpu.kvm, mmio_gpa >>
> PAGE_SHIFT,
> +					(mmio_gpa + mmio_size) >>
> PAGE_SHIFT,
> +
> KVM_MEMORY_ATTRIBUTE_PRIVATE);
> +	if (ret)
> +		pci_err(pdev, "Failed to mark MMIO #%d
> gpa=%llx..%llx as private, ret=%d\n",
> +			rangeid, mmio_gpa, mmio_gpa + mmio_size,
> ret);
> +	else
> +		pci_notice(pdev, "Marked MMIO#%d gpa=%llx..%llx as
> private\n",
> +			   rangeid, mmio_gpa, mmio_gpa + mmio_size);
> +
> +	for (phys_addr_t off = 0; off < mmio_size; off += PAGE_SIZE)
> {
> +		ret =
> rmp_make_private_mmio((pci_resource_start(pdev, rangeid) + off) >>
> PAGE_SHIFT,
> +					    (mmio_gpa + off),
> PG_LEVEL_4K, svm->asid,
> +					    false/*Immutable*/);
> +		if (ret)
> +			pci_err(pdev, "Failed to map TIO #%d %pR
> +%llx %llx -> gpa=%llx ret=%d\n",
> +				rangeid, pci_resource_n(pdev,
> rangeid), off, mmio_size,
> +				mmio_gpa + off, ret);
> +	}
> +
> +	return SEV_RET_SUCCESS;
> +}
> +
> +static int snp_complete_sev_tio_guest_request(struct kvm_vcpu *vcpu,
> struct tsm_tdi *tdi) +{
> +	struct vcpu_svm *svm = to_svm(vcpu);
> +	struct vmcb_control_area *control = &svm->vmcb->control;
> +	struct kvm *kvm = vcpu->kvm;
> +	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> +	enum tsm_tdisp_state state = TDISP_STATE_UNAVAIL;
> +	unsigned long exitcode = 0, data_npages;
> +	struct tio_guest_request tioreq = { 0 };
> +	struct snp_guest_msg_hdr *req_hdr;
> +	gpa_t req_gpa, resp_gpa;
> +	struct fd sevfd;
> +	u64 data_gpa;
> +	int ret;
> +
> +	if (!sev_snp_guest(kvm))
> +		return -EINVAL;
> +
> +	mutex_lock(&sev->guest_req_mutex);
> +
> +	req_gpa = control->exit_info_1;
> +	resp_gpa = control->exit_info_2;
> +
> +	ret = kvm_read_guest(kvm, req_gpa, sev->guest_req_buf,
> PAGE_SIZE);
> +	if (ret)
> +		goto out_unlock;
> +
> +	tioreq.data.gctx_paddr = __psp_pa(sev->snp_context);
> +	tioreq.data.req_paddr = __psp_pa(sev->guest_req_buf);
> +	tioreq.data.res_paddr = __psp_pa(sev->guest_resp_buf);
> +
> +	sevfd = fdget(sev->fd);
> +	if (!sevfd.file)
> +		goto out_unlock;
> +
> +	req_hdr = sev->guest_req_buf;
> +	if (req_hdr->msg_type == TIO_MSG_MMIO_VALIDATE_REQ) {
> +		const u64 raw_gpa = vcpu->arch.regs[VCPU_REGS_RDX];
> +
> +		ret = tio_make_mmio_private(svm, tdi->pdev,
> +
> MMIO_VALIDATE_GPA(raw_gpa),
> +
> MMIO_VALIDATE_LEN(raw_gpa),
> +
> MMIO_VALIDATE_RANGEID(raw_gpa));
> +		if (ret != SEV_RET_SUCCESS)
> +			goto put_unlock;
> +	}
> +
> +	ret = tsm_guest_request(tdi,
> +				(req_hdr->msg_type ==
> TIO_MSG_TDI_INFO_REQ) ? &state : NULL,
> +				&tioreq);
> +	if (ret)
> +		goto put_unlock;
> +
> +	struct tio_blob_table_entry t[4] = {
> +		{ .guid = TIO_GUID_MEASUREMENTS,
> +		  .offset = sizeof(t),
> +		  .length = tdi->tdev->meas ? tdi->tdev->meas->len :
> 0 },
> +		{ .guid = TIO_GUID_CERTIFICATES,
> +		  .offset = sizeof(t) + t[0].length,
> +		  .length = tdi->tdev->certs ? tdi->tdev->certs->len
> : 0 },
> +		{ .guid = TIO_GUID_REPORT,
> +		  .offset = sizeof(t) + t[0].length + t[1].length,
> +		  .length = tdi->report ? tdi->report->len : 0 },
> +		{ .guid.b = { 0 } }
> +	};
> +	void *tp[4] = {
> +		tdi->tdev->meas ? tdi->tdev->meas->data : NULL,
> +		tdi->tdev->certs ? tdi->tdev->certs->data  : NULL,
> +		tdi->report ? tdi->report->data : NULL
> +	};
> +
> +	data_gpa = vcpu->arch.regs[VCPU_REGS_RAX];
> +	data_npages = vcpu->arch.regs[VCPU_REGS_RBX];
> +	vcpu->arch.regs[VCPU_REGS_RBX] = PAGE_ALIGN(t[0].length +
> t[1].length +
> +						    t[2].length +
> sizeof(t)) >> PAGE_SHIFT;
> +	if (data_gpa && ((data_npages << PAGE_SHIFT) >=
> vcpu->arch.regs[VCPU_REGS_RBX])) {
> +		if (kvm_write_guest(kvm, data_gpa + 0, &t,
> sizeof(t)) ||
> +		    kvm_write_guest(kvm, data_gpa + t[0].offset,
> tp[0], t[0].length) ||
> +		    kvm_write_guest(kvm, data_gpa + t[1].offset,
> tp[1], t[1].length) ||
> +		    kvm_write_guest(kvm, data_gpa + t[2].offset,
> tp[2], t[2].length))
> +			exitcode = SEV_RET_INVALID_ADDRESS;
> +	}
> +
> +	if (req_hdr->msg_type == TIO_MSG_TDI_INFO_REQ)
> +		vcpu->arch.regs[VCPU_REGS_RDX] = state;
> +
> +	ret = kvm_write_guest(kvm, resp_gpa, sev->guest_resp_buf,
> PAGE_SIZE);
> +	if (ret)
> +		goto put_unlock;
> +
> +	ret = 1; /* Resume guest */
> +
> +	ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, SNP_GUEST_ERR(0,
> tioreq.fw_err)); +
> +put_unlock:
> +	fdput(sevfd);
> +out_unlock:
> +	mutex_unlock(&sev->guest_req_mutex);
> +
> +	return ret;
> +}
> +
> +static int snp_try_complete_sev_tio_guest_request(struct kvm_vcpu
> *vcpu) +{
> +	struct kvm_sev_info *sev = &to_kvm_svm(vcpu->kvm)->sev_info;
> +	u32 guest_rid = vcpu->arch.regs[VCPU_REGS_RCX];
> +	struct tsm_tdi *tdi = tsm_tdi_find(guest_rid, (u64)
> __psp_pa(sev->snp_context)); +
> +	if (!tdi) {
> +		pr_err("TDI is not bound to %x:%02x.%d\n",
> +		       PCI_BUS_NUM(guest_rid), PCI_SLOT(guest_rid),
> PCI_FUNC(guest_rid));
> +		return 1; /* Resume guest */
> +	}
> +
> +	return snp_complete_sev_tio_guest_request(vcpu, tdi);
> +}
> +
> +static int snp_sev_tio_guest_request(struct kvm_vcpu *vcpu)
> +{
> +	u32 guest_rid = vcpu->arch.regs[VCPU_REGS_RCX];
> +	struct kvm *kvm = vcpu->kvm;
> +	struct kvm_sev_info *sev;
> +	struct tsm_tdi *tdi;
> +
> +	if (!sev_snp_guest(kvm))
> +		return SEV_RET_INVALID_GUEST;
> +
> +	sev = &to_kvm_svm(kvm)->sev_info;
> +	tdi = tsm_tdi_find(guest_rid, (u64)
> __psp_pa(sev->snp_context));
> +	if (!tdi) {
> +		vcpu->run->exit_reason = KVM_EXIT_VMGEXIT;
> +		vcpu->run->vmgexit.type = KVM_USER_VMGEXIT_TIO_REQ;
> +		vcpu->run->vmgexit.tio_req.guest_rid = guest_rid;
> +		vcpu->arch.complete_userspace_io =
> snp_try_complete_sev_tio_guest_request;
> +		return 0; /* Exit KVM */
> +	}
> +
> +	return snp_complete_sev_tio_guest_request(vcpu, tdi);
> +}
> +
>  static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm)
>  {
>  	struct vmcb_control_area *control = &svm->vmcb->control;
> @@ -4408,6 +4588,9 @@ int sev_handle_vmgexit(struct kvm_vcpu *vcpu)
>  	case SVM_VMGEXIT_EXT_GUEST_REQUEST:
>  		ret = snp_handle_ext_guest_req(svm,
> control->exit_info_1, control->exit_info_2); break;
> +	case SVM_VMGEXIT_SEV_TIO_GUEST_REQUEST:
> +		ret = snp_sev_tio_guest_request(vcpu);
> +		break;
>  	case SVM_VMGEXIT_UNSUPPORTED_EVENT:
>  		vcpu_unimpl(vcpu,
>  			    "vmgexit: unsupported event -
> exit_info_1=%#llx, exit_info_2=%#llx\n", @@ -5000,3 +5183,37 @@ int
> sev_private_max_mapping_level(struct kvm *kvm, kvm_pfn_t pfn) 
>  	return level;
>  }
> +
> +int sev_tsm_bind(struct kvm *kvm, struct device *dev, u32 guest_rid)
> +{
> +	struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info;
> +	struct tsm_tdi *tdi = tsm_tdi_get(dev);
> +	struct fd sevfd;
> +	int ret;
> +
> +	if (!tdi)
> +		return -ENODEV;
> +
> +	sevfd = fdget(sev->fd);
> +	if (!sevfd.file)
> +		return -EPERM;
> +
> +	dev_info(dev, "Binding guest=%x:%02x.%d\n",
> +		 PCI_BUS_NUM(guest_rid), PCI_SLOT(guest_rid),
> PCI_FUNC(guest_rid));
> +	ret = tsm_tdi_bind(tdi, guest_rid, (u64)
> __psp_pa(sev->snp_context), sev->asid);
> +	fdput(sevfd);
> +
> +	return ret;
> +}
> +
> +void sev_tsm_unbind(struct kvm *kvm, struct device *dev)
> +{
> +	struct tsm_tdi *tdi = tsm_tdi_get(dev);
> +
> +	if (!tdi)
> +		return;
> +
> +	dev_notice(dev, "Unbinding guest=%x:%02x.%d\n",
> +		   PCI_BUS_NUM(tdi->guest_rid),
> PCI_SLOT(tdi->guest_rid), PCI_FUNC(tdi->guest_rid));
> +	tsm_tdi_unbind(tdi);
> +}
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index d6f252555ab3..ab6e41eed697 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -5093,6 +5093,9 @@ static struct kvm_x86_ops svm_x86_ops
> __initdata = { 
>  	.vm_copy_enc_context_from = sev_vm_copy_enc_context_from,
>  	.vm_move_enc_context_from = sev_vm_move_enc_context_from,
> +
> +	.tsm_bind = sev_tsm_bind,
> +	.tsm_unbind = sev_tsm_unbind,
>  #endif
>  	.check_emulate_instruction = svm_check_emulate_instruction,
>  
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 70219e406987..97261cffa9ad 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -14055,3 +14055,15 @@ static void __exit kvm_x86_exit(void)
>  	WARN_ON_ONCE(static_branch_unlikely(&kvm_has_noapic_vcpu));
>  }
>  module_exit(kvm_x86_exit);
> +
> +int kvm_arch_tsm_bind(struct kvm *kvm, struct device *dev, u32
> guest_rid) +{
> +	return static_call(kvm_x86_tsm_bind)(kvm, dev, guest_rid);
> +}
> +EXPORT_SYMBOL_GPL(kvm_arch_tsm_bind);
> +
> +void kvm_arch_tsm_unbind(struct kvm *kvm, struct device *dev)
> +{
> +	static_call(kvm_x86_tsm_unbind)(kvm, dev);
> +}
> +EXPORT_SYMBOL_GPL(kvm_arch_tsm_unbind);
> diff --git a/arch/x86/virt/svm/sev.c b/arch/x86/virt/svm/sev.c
> index 44e7609c9bd6..91f5729dfcad 100644
> --- a/arch/x86/virt/svm/sev.c
> +++ b/arch/x86/virt/svm/sev.c
> @@ -945,7 +945,7 @@ static int adjust_direct_map(u64 pfn, int
> rmp_level)
>   * The optimal solution would be range locking to avoid locking
> disjoint
>   * regions unnecessarily but there's no support for that yet.
>   */
> -static int rmpupdate(u64 pfn, struct rmp_state *state)
> +static int rmpupdate(u64 pfn, struct rmp_state *state, bool mmio)
>  {
>  	unsigned long paddr = pfn << PAGE_SHIFT;
>  	int ret, level;
> @@ -955,7 +955,7 @@ static int rmpupdate(u64 pfn, struct rmp_state
> *state) 
>  	level = RMP_TO_PG_LEVEL(state->pagesize);
>  
> -	if (adjust_direct_map(pfn, level))
> +	if (!mmio && adjust_direct_map(pfn, level))
>  		return -EFAULT;
>  
>  	do {
> @@ -989,10 +989,25 @@ int rmp_make_private(u64 pfn, u64 gpa, enum
> pg_level level, u32 asid, bool immut state.gpa = gpa;
>  	state.pagesize = PG_LEVEL_TO_RMP(level);
>  
> -	return rmpupdate(pfn, &state);
> +	return rmpupdate(pfn, &state, false);
>  }
>  EXPORT_SYMBOL_GPL(rmp_make_private);
>  
> +int rmp_make_private_mmio(u64 pfn, u64 gpa, enum pg_level level, u32
> asid, bool immutable) +{
> +	struct rmp_state state;
> +
> +	memset(&state, 0, sizeof(state));
> +	state.assigned = 1;
> +	state.asid = asid;
> +	state.immutable = immutable;
> +	state.gpa = gpa;
> +	state.pagesize = PG_LEVEL_TO_RMP(level);
> +
> +	return rmpupdate(pfn, &state, true);
> +}
> +EXPORT_SYMBOL_GPL(rmp_make_private_mmio);
> +
>  /* Transition a page to hypervisor-owned/shared state in the RMP
> table. */ int rmp_make_shared(u64 pfn, enum pg_level level)
>  {
> @@ -1001,7 +1016,7 @@ int rmp_make_shared(u64 pfn, enum pg_level
> level) memset(&state, 0, sizeof(state));
>  	state.pagesize = PG_LEVEL_TO_RMP(level);
>  
> -	return rmpupdate(pfn, &state);
> +	return rmpupdate(pfn, &state, false);
>  }
>  EXPORT_SYMBOL_GPL(rmp_make_shared);
>  
> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
> index 76b7f6085dcd..a4e9db212adc 100644
> --- a/virt/kvm/vfio.c
> +++ b/virt/kvm/vfio.c
> @@ -15,6 +15,7 @@
>  #include <linux/slab.h>
>  #include <linux/uaccess.h>
>  #include <linux/vfio.h>
> +#include <linux/tsm.h>
>  #include "vfio.h"
>  
>  #ifdef CONFIG_SPAPR_TCE_IOMMU
> @@ -29,8 +30,14 @@ struct kvm_vfio_file {
>  #endif
>  };
>  
> +struct kvm_vfio_tdi {
> +	struct list_head node;
> +	struct vfio_device *vdev;
> +};
> +
>  struct kvm_vfio {
>  	struct list_head file_list;
> +	struct list_head tdi_list;
>  	struct mutex lock;
>  	bool noncoherent;
>  };
> @@ -80,6 +87,22 @@ static bool kvm_vfio_file_is_valid(struct file
> *file) return ret;
>  }
>  
> +static struct vfio_device *kvm_vfio_file_device(struct file *file)
> +{
> +	struct vfio_device *(*fn)(struct file *file);
> +	struct vfio_device *ret;
> +
> +	fn = symbol_get(vfio_file_device);
> +	if (!fn)
> +		return NULL;
> +
> +	ret = fn(file);
> +
> +	symbol_put(vfio_file_device);
> +
> +	return ret;
> +}
> +
>  #ifdef CONFIG_SPAPR_TCE_IOMMU
>  static struct iommu_group *kvm_vfio_file_iommu_group(struct file
> *file) {
> @@ -297,6 +320,103 @@ static int kvm_vfio_set_file(struct kvm_device
> *dev, long attr, return -ENXIO;
>  }
>  
> +static int kvm_dev_tsm_bind(struct kvm_device *dev, void __user *arg)
> +{
> +	struct kvm_vfio *kv = dev->private;
> +	struct kvm_vfio_tsm_bind tb;
> +	struct kvm_vfio_tdi *ktdi;
> +	struct vfio_device *vdev;
> +	struct fd fdev;
> +	int ret;
> +
> +	if (copy_from_user(&tb, arg, sizeof(tb)))
> +		return -EFAULT;
> +
> +	ktdi = kzalloc(sizeof(*ktdi), GFP_KERNEL_ACCOUNT);
> +	if (!ktdi)
> +		return -ENOMEM;
> +
> +	fdev = fdget(tb.devfd);
> +	if (!fdev.file)
> +		return -EBADF;
> +
> +	ret = -ENOENT;
> +
> +	mutex_lock(&kv->lock);
> +
> +	vdev = kvm_vfio_file_device(fdev.file);
> +	if (vdev) {
> +		ret = kvm_arch_tsm_bind(dev->kvm, vdev->dev,
> tb.guest_rid);
> +		if (!ret) {
> +			ktdi->vdev = vdev;
> +			list_add_tail(&ktdi->node, &kv->tdi_list);
> +		} else {
> +			vfio_put_device(vdev);
> +		}
> +	}
> +
> +	fdput(fdev);
> +	mutex_unlock(&kv->lock);
> +	if (ret)
> +		kfree(ktdi);
> +
> +	return ret;
> +}
> +
> +static int kvm_dev_tsm_unbind(struct kvm_device *dev, void __user
> *arg) +{
> +	struct kvm_vfio *kv = dev->private;
> +	struct kvm_vfio_tsm_bind tb;
> +	struct kvm_vfio_tdi *ktdi;
> +	struct vfio_device *vdev;
> +	struct fd fdev;
> +	int ret;
> +
> +	if (copy_from_user(&tb, arg, sizeof(tb)))
> +		return -EFAULT;
> +
> +	fdev = fdget(tb.devfd);
> +	if (!fdev.file)
> +		return -EBADF;
> +
> +	ret = -ENOENT;
> +
> +	mutex_lock(&kv->lock);
> +
> +	vdev = kvm_vfio_file_device(fdev.file);
> +	if (vdev) {
> +		list_for_each_entry(ktdi, &kv->tdi_list, node) {
> +			if (ktdi->vdev != vdev)
> +				continue;
> +
> +			kvm_arch_tsm_unbind(dev->kvm, vdev->dev);
> +			list_del(&ktdi->node);
> +			kfree(ktdi);
> +			vfio_put_device(vdev);
> +			ret = 0;
> +			break;
> +		}
> +		vfio_put_device(vdev);
> +	}
> +
> +	fdput(fdev);
> +	mutex_unlock(&kv->lock);
> +	return ret;
> +}
> +
> +static int kvm_vfio_set_device(struct kvm_device *dev, long attr,
> +			       void __user *arg)
> +{
> +	switch (attr) {
> +	case KVM_DEV_VFIO_DEVICE_TDI_BIND:
> +		return kvm_dev_tsm_bind(dev, arg);
> +	case KVM_DEV_VFIO_DEVICE_TDI_UNBIND:
> +		return kvm_dev_tsm_unbind(dev, arg);
> +	}
> +
> +	return -ENXIO;
> +}
> +
>  static int kvm_vfio_set_attr(struct kvm_device *dev,
>  			     struct kvm_device_attr *attr)
>  {
> @@ -304,6 +424,9 @@ static int kvm_vfio_set_attr(struct kvm_device
> *dev, case KVM_DEV_VFIO_FILE:
>  		return kvm_vfio_set_file(dev, attr->attr,
>  					 u64_to_user_ptr(attr->addr));
> +	case KVM_DEV_VFIO_DEVICE:
> +		return kvm_vfio_set_device(dev, attr->attr,
> +
> u64_to_user_ptr(attr->addr)); }
>  
>  	return -ENXIO;
> @@ -323,6 +446,13 @@ static int kvm_vfio_has_attr(struct kvm_device
> *dev, return 0;
>  		}
>  
> +		break;
> +	case KVM_DEV_VFIO_DEVICE:
> +		switch (attr->attr) {
> +		case KVM_DEV_VFIO_DEVICE_TDI_BIND:
> +		case KVM_DEV_VFIO_DEVICE_TDI_UNBIND:
> +			return 0;
> +		}
>  		break;
>  	}
>  
> @@ -332,8 +462,16 @@ static int kvm_vfio_has_attr(struct kvm_device
> *dev, static void kvm_vfio_release(struct kvm_device *dev)
>  {
>  	struct kvm_vfio *kv = dev->private;
> +	struct kvm_vfio_tdi *ktdi, *tmp2;
>  	struct kvm_vfio_file *kvf, *tmp;
>  
> +	list_for_each_entry_safe(ktdi, tmp2, &kv->tdi_list, node) {
> +		kvm_arch_tsm_unbind(dev->kvm, ktdi->vdev->dev);
> +		list_del(&ktdi->node);
> +		vfio_put_device(ktdi->vdev);
> +		kfree(ktdi);
> +	}
> +
>  	list_for_each_entry_safe(kvf, tmp, &kv->file_list, node) {
>  #ifdef CONFIG_SPAPR_TCE_IOMMU
>  		kvm_spapr_tce_release_vfio_group(dev->kvm, kvf);
> @@ -379,6 +517,7 @@ static int kvm_vfio_create(struct kvm_device
> *dev, u32 type) 
>  	INIT_LIST_HEAD(&kv->file_list);
>  	mutex_init(&kv->lock);
> +	INIT_LIST_HEAD(&kv->tdi_list);
>  
>  	dev->private = kv;
>  
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index 472a1537b7a9..5e07a1fddb67 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -143,6 +143,7 @@ config KVM_AMD_SEV
>  	select KVM_GENERIC_PRIVATE_MEM
>  	select HAVE_KVM_ARCH_GMEM_PREPARE
>  	select HAVE_KVM_ARCH_GMEM_INVALIDATE
> +	select KVM_VFIO
>  	help
>  	  Provides support for launching Encrypted VMs (SEV) and
> Encrypted VMs with Encrypted State (SEV-ES) on AMD processors.

Dan Williams Sept. 13, 2024, 10:08 p.m. UTC | #6

Zhi Wang wrote:
> On Fri, 23 Aug 2024 23:21:25 +1000
> Alexey Kardashevskiy <aik@amd.com> wrote:
> 
> > The SEV TIO spec defines a new TIO_GUEST_MESSAGE message to
> > provide a secure communication channel between a SNP VM and
> > the PSP.
> > 
> > The defined messages provide way to read TDI info and do secure
> > MMIO/DMA setup.
> > 
> > On top of this, GHCB defines an extension to return certificates/
> > measurements/report and TDI run status to the VM.
> > 
> > The TIO_GUEST_MESSAGE handler also checks if a specific TDI bound
> > to the VM and exits the KVM to allow the userspace to bind it.
> > 
> 
> Out of curiosity, do we have to handle the TDI bind/unbind in the kernel
> space? It seems we are get the relationship between modules more
> complicated. What is the design concern that letting QEMU to handle the
> TDI bind/unbind message, because QEMU can talk to VFIO/KVM and also TSM.

Hmm, the flow I have in mind is:

Guest GHCx(BIND) => KVM => TSM GHCx handler => VFIO state update + TSM low-level BIND

vs this: (if I undertand your question correctly?)

Guest GHCx(BIND) => KVM => TSM GHCx handler => QEMU => VFIO => TSM low-level BIND

Why exit to QEMU only to turn around and call back into the kernel? VFIO
should already have the context from establishing the vPCI device as
"bind-capable" at setup time.

Maybe I misunderstood your complication concern?

Tian, Kevin Sept. 14, 2024, 2:47 a.m. UTC | #7

> From: Dan Williams <dan.j.williams@intel.com>
> Sent: Saturday, September 14, 2024 6:09 AM
> 
> Zhi Wang wrote:
> > On Fri, 23 Aug 2024 23:21:25 +1000
> > Alexey Kardashevskiy <aik@amd.com> wrote:
> >
> > > The SEV TIO spec defines a new TIO_GUEST_MESSAGE message to
> > > provide a secure communication channel between a SNP VM and
> > > the PSP.
> > >
> > > The defined messages provide way to read TDI info and do secure
> > > MMIO/DMA setup.
> > >
> > > On top of this, GHCB defines an extension to return certificates/
> > > measurements/report and TDI run status to the VM.
> > >
> > > The TIO_GUEST_MESSAGE handler also checks if a specific TDI bound
> > > to the VM and exits the KVM to allow the userspace to bind it.
> > >
> >
> > Out of curiosity, do we have to handle the TDI bind/unbind in the kernel
> > space? It seems we are get the relationship between modules more
> > complicated. What is the design concern that letting QEMU to handle the
> > TDI bind/unbind message, because QEMU can talk to VFIO/KVM and also
> TSM.
> 
> Hmm, the flow I have in mind is:
> 
> Guest GHCx(BIND) => KVM => TSM GHCx handler => VFIO state update + TSM
> low-level BIND
> 
> vs this: (if I undertand your question correctly?)
> 
> Guest GHCx(BIND) => KVM => TSM GHCx handler => QEMU => VFIO => TSM
> low-level BIND

Reading this patch appears that it's implemented this way except QEMU
calls a KVM_DEV uAPI instead of going through VFIO (as Yilun suggested).

> 
> Why exit to QEMU only to turn around and call back into the kernel? VFIO
> should already have the context from establishing the vPCI device as
> "bind-capable" at setup time.
> 

The general practice in VFIO is to design things around userspace driver
control over the device w/o assuming the existence of KVM. When VMM
comes to the picture the interaction with KVM is minimized unless
for functional or perf reasons.

e.g. KVM needs to know whether an assigned device allows non-coherent
DMA for proper cache control, or mdev/new vIOMMU object needs
a reference to struct kvm, etc. 

sometimes frequent trap-emulates is too costly then KVM/VFIO may
enable in-kernel acceleration to skip Qemu via eventfd, but in 
this case the slow-path via Qemu has been firstly implemented.

Ideally BIND/UNBIND is not a frequent operation, so falling back to
Qemu in a longer path is not a real problem. If no specific
functionality or security reason for doing it in-kernel, I'm inclined
to agree with Zhi here (though not about complexity).

Zhi Wang Sept. 14, 2024, 5:19 a.m. UTC | #8

On Sat, 14 Sep 2024 02:47:27 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Dan Williams <dan.j.williams@intel.com>
> > Sent: Saturday, September 14, 2024 6:09 AM
> > 
> > Zhi Wang wrote:
> > > On Fri, 23 Aug 2024 23:21:25 +1000
> > > Alexey Kardashevskiy <aik@amd.com> wrote:
> > >
> > > > The SEV TIO spec defines a new TIO_GUEST_MESSAGE message to
> > > > provide a secure communication channel between a SNP VM and
> > > > the PSP.
> > > >
> > > > The defined messages provide way to read TDI info and do secure
> > > > MMIO/DMA setup.
> > > >
> > > > On top of this, GHCB defines an extension to return
> > > > certificates/ measurements/report and TDI run status to the VM.
> > > >
> > > > The TIO_GUEST_MESSAGE handler also checks if a specific TDI
> > > > bound to the VM and exits the KVM to allow the userspace to
> > > > bind it.
> > > >
> > >
> > > Out of curiosity, do we have to handle the TDI bind/unbind in the
> > > kernel space? It seems we are get the relationship between
> > > modules more complicated. What is the design concern that letting
> > > QEMU to handle the TDI bind/unbind message, because QEMU can talk
> > > to VFIO/KVM and also
> > TSM.
> > 
> > Hmm, the flow I have in mind is:
> > 
> > Guest GHCx(BIND) => KVM => TSM GHCx handler => VFIO state update +
> > TSM low-level BIND
> > 
> > vs this: (if I undertand your question correctly?)
> > 
> > Guest GHCx(BIND) => KVM => TSM GHCx handler => QEMU => VFIO => TSM
> > low-level BIND
> 
> Reading this patch appears that it's implemented this way except QEMU
> calls a KVM_DEV uAPI instead of going through VFIO (as Yilun
> suggested).
> 
> > 
> > Why exit to QEMU only to turn around and call back into the kernel?
> > VFIO should already have the context from establishing the vPCI
> > device as "bind-capable" at setup time.
> > 
> 
> The general practice in VFIO is to design things around userspace
> driver control over the device w/o assuming the existence of KVM.
> When VMM comes to the picture the interaction with KVM is minimized
> unless for functional or perf reasons.
> 
> e.g. KVM needs to know whether an assigned device allows non-coherent
> DMA for proper cache control, or mdev/new vIOMMU object needs
> a reference to struct kvm, etc. 
> 
> sometimes frequent trap-emulates is too costly then KVM/VFIO may
> enable in-kernel acceleration to skip Qemu via eventfd, but in 
> this case the slow-path via Qemu has been firstly implemented.
> 
> Ideally BIND/UNBIND is not a frequent operation, so falling back to
> Qemu in a longer path is not a real problem. If no specific
> functionality or security reason for doing it in-kernel, I'm inclined
> to agree with Zhi here (though not about complexity).
> 
> 

Exactly what I was thinking. Folks had been spending quite some efforts
on keeping VFIO and KVM independent. The existing shortcut calling
between two modules is there because there is no other better way to do
it.

TSM BIND/UNBIND should not be a performance critical path. Thus falling
back to QEMU would be fine. Besides, not sure about others' opinion, I
don't think adding tsm_{bind, unbind} in kvm_x86_ops is a good idea.

If we have to stick to the current approach, I think we need more
justifications.

Xu Yilun Sept. 18, 2024, 10:45 a.m. UTC | #9

On Sat, Sep 14, 2024 at 08:19:46AM +0300, Zhi Wang wrote:
> On Sat, 14 Sep 2024 02:47:27 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Dan Williams <dan.j.williams@intel.com>
> > > Sent: Saturday, September 14, 2024 6:09 AM
> > > 
> > > Zhi Wang wrote:
> > > > On Fri, 23 Aug 2024 23:21:25 +1000
> > > > Alexey Kardashevskiy <aik@amd.com> wrote:
> > > >
> > > > > The SEV TIO spec defines a new TIO_GUEST_MESSAGE message to
> > > > > provide a secure communication channel between a SNP VM and
> > > > > the PSP.
> > > > >
> > > > > The defined messages provide way to read TDI info and do secure
> > > > > MMIO/DMA setup.
> > > > >
> > > > > On top of this, GHCB defines an extension to return
> > > > > certificates/ measurements/report and TDI run status to the VM.
> > > > >
> > > > > The TIO_GUEST_MESSAGE handler also checks if a specific TDI
> > > > > bound to the VM and exits the KVM to allow the userspace to
> > > > > bind it.
> > > > >
> > > >
> > > > Out of curiosity, do we have to handle the TDI bind/unbind in the
> > > > kernel space? It seems we are get the relationship between
> > > > modules more complicated. What is the design concern that letting
> > > > QEMU to handle the TDI bind/unbind message, because QEMU can talk
> > > > to VFIO/KVM and also
> > > TSM.
> > > 
> > > Hmm, the flow I have in mind is:
> > > 
> > > Guest GHCx(BIND) => KVM => TSM GHCx handler => VFIO state update +
> > > TSM low-level BIND
> > > 
> > > vs this: (if I undertand your question correctly?)
> > > 
> > > Guest GHCx(BIND) => KVM => TSM GHCx handler => QEMU => VFIO => TSM
> > > low-level BIND
> > 
> > Reading this patch appears that it's implemented this way except QEMU
> > calls a KVM_DEV uAPI instead of going through VFIO (as Yilun
> > suggested).
> > 
> > > 
> > > Why exit to QEMU only to turn around and call back into the kernel?
> > > VFIO should already have the context from establishing the vPCI
> > > device as "bind-capable" at setup time.

Previously we tried to do host side "bind-capable" setup (TDI context
creation required by firmware but no LOCK) at setup time. But didn't
see enough value, only to make the error recovery flow more complex. So
now I actually didn't see much work to do for "bind-capable", just to
mark the device as can-be-private. I.e. the context from establishing
the vPCI device are moved to GHCx BIND phase.

> > > 
> > 
> > The general practice in VFIO is to design things around userspace
> > driver control over the device w/o assuming the existence of KVM.
> > When VMM comes to the picture the interaction with KVM is minimized
> > unless for functional or perf reasons.
> > 
> > e.g. KVM needs to know whether an assigned device allows non-coherent
> > DMA for proper cache control, or mdev/new vIOMMU object needs
> > a reference to struct kvm, etc. 
> > 
> > sometimes frequent trap-emulates is too costly then KVM/VFIO may
> > enable in-kernel acceleration to skip Qemu via eventfd, but in 
> > this case the slow-path via Qemu has been firstly implemented.
> > 
> > Ideally BIND/UNBIND is not a frequent operation, so falling back to
> > Qemu in a longer path is not a real problem. If no specific
> > functionality or security reason for doing it in-kernel, I'm inclined
> > to agree with Zhi here (though not about complexity).

I agree GHCx BIND/UNBIND been routed to QEMU, cause there are host side
cross module managements for BIND/UNBIND. E.g. IOMMUFD page table
switching, VFIO side settings that builds host side TDI context & LOCK
TDI.

But I do support other GHCx calls between BIND/UNBIND been directly
route to TSM low-level. E.g. get device interface report, get device
certification/measurement, TDISP RUN. It is because these communications
are purely for CoCo-VM, firmware and TDI. Host is totally out of its
business and worth nothing to pass these requirements to QEMU/VFIO and
still back into TSM low-level.

Thanks,
Yilun

> > 
> > 
> 
> Exactly what I was thinking. Folks had been spending quite some efforts
> on keeping VFIO and KVM independent. The existing shortcut calling
> between two modules is there because there is no other better way to do
> it.
> 
> TSM BIND/UNBIND should not be a performance critical path. Thus falling
> back to QEMU would be fine. Besides, not sure about others' opinion, I
> don't think adding tsm_{bind, unbind} in kvm_x86_ops is a good idea.
> 
> If we have to stick to the current approach, I think we need more
> justifications.
>

Tian, Kevin Sept. 20, 2024, 3:41 a.m. UTC | #10

> From: Xu Yilun <yilun.xu@linux.intel.com>
> Sent: Wednesday, September 18, 2024 6:45 PM
> 
> On Sat, Sep 14, 2024 at 08:19:46AM +0300, Zhi Wang wrote:
> > On Sat, 14 Sep 2024 02:47:27 +0000
> > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > >
> > > The general practice in VFIO is to design things around userspace
> > > driver control over the device w/o assuming the existence of KVM.
> > > When VMM comes to the picture the interaction with KVM is minimized
> > > unless for functional or perf reasons.
> > >
> > > e.g. KVM needs to know whether an assigned device allows non-coherent
> > > DMA for proper cache control, or mdev/new vIOMMU object needs
> > > a reference to struct kvm, etc.
> > >
> > > sometimes frequent trap-emulates is too costly then KVM/VFIO may
> > > enable in-kernel acceleration to skip Qemu via eventfd, but in
> > > this case the slow-path via Qemu has been firstly implemented.
> > >
> > > Ideally BIND/UNBIND is not a frequent operation, so falling back to
> > > Qemu in a longer path is not a real problem. If no specific
> > > functionality or security reason for doing it in-kernel, I'm inclined
> > > to agree with Zhi here (though not about complexity).
> 
> I agree GHCx BIND/UNBIND been routed to QEMU, cause there are host side
> cross module managements for BIND/UNBIND. E.g. IOMMUFD page table
> switching, VFIO side settings that builds host side TDI context & LOCK
> TDI.
> 
> But I do support other GHCx calls between BIND/UNBIND been directly
> route to TSM low-level. E.g. get device interface report, get device
> certification/measurement, TDISP RUN. It is because these communications
> are purely for CoCo-VM, firmware and TDI. Host is totally out of its
> business and worth nothing to pass these requirements to QEMU/VFIO and
> still back into TSM low-level.
> 

sure. If VFIO is conceptually irrelevant to an operation it's certainly
right to skip it.

[RFC,11/21] KVM: SEV: Add TIO VMGEXIT and bind TDI

Commit Message

Comments

Patch