diff mbox

[1/2] KVM: page track: add a new notifier type: track_flush_slot

Message ID 5800B579.9000705@intel.com
State New
Headers show

Commit Message

Jike Song Oct. 14, 2016, 10:37 a.m. UTC
On 10/11/2016 05:47 PM, Paolo Bonzini wrote:
> 
> 
> On 11/10/2016 11:21, Xiao Guangrong wrote:
>>
>>
>> On 10/11/2016 04:54 PM, Paolo Bonzini wrote:
>>>
>>>
>>> On 11/10/2016 04:39, Xiao Guangrong wrote:
>>>>
>>>>
>>>> On 10/11/2016 02:32 AM, Paolo Bonzini wrote:
>>>>>
>>>>>
>>>>> On 10/10/2016 20:01, Neo Jia wrote:
>>>>>>> Hi Neo,
>>>>>>>
>>>>>>> AFAIK this is needed because KVMGT doesn't paravirtualize the PPGTT,
>>>>>>> while nVidia does.
>>>>>>
>>>>>> Hi Paolo and Xiaoguang,
>>>>>>
>>>>>> I am just wondering how device driver can register a notifier so he
>>>>>> can be
>>>>>> notified for write-protected pages when writes are happening.
>>>>>
>>>>> It can't yet, but the API is ready for that.  kvm_vfio_set_group is
>>>>> currently where a struct kvm_device* and struct vfio_group* touch.
>>>>> Given
>>>>> a struct kvm_device*, dev->kvm provides the struct kvm to be passed to
>>>>> kvm_page_track_register_notifier.  So I guess you could add a callback
>>>>> that passes the struct kvm_device* to the mdev device.
>>>>>
>>>>> Xiaoguang and Guangrong, what were your plans?  We discussed it briefly
>>>>> at KVM Forum but I don't remember the details.
>>>>
>>>> Your suggestion was that pass kvm fd to KVMGT via VFIO, so that we can
>>>> figure out the kvm instance based on the fd.
>>>>
>>>> We got a new idea, how about search the kvm instance by mm_struct, it
>>>> can work as KVMGT is running in the vcpu context and it is much more
>>>> straightforward.
>>>
>>> Perhaps I didn't understand your suggestion, but the same mm_struct can
>>> have more than 1 struct kvm so I'm not sure that it can work.
>>
>> vcpu->pid is valid during vcpu running so that it can be used to figure
>> out which kvm instance owns the vcpu whose pid is the one as current
>> thread, i think it can work. :)
> 
> No, don't do that.  There's no reason for a thread to run a single VCPU,
> and if you can have multiple VCPUs you can also have multiple VCPUs from
> multiple VMs.
> 
> Passing file descriptors around are the right way to connect subsystems.

[CC Alex, Kevin and Qemu-devel]

Hi Paolo & Alex,

IIUC, passing file descriptors means touching QEMU and the UAPI between
QEMU and VFIO. Would you guys have a look at below draft patch? If it's
on the correct direction, I'll send the split ones. Thanks!

--
Thanks,
Jike

Comments

Paolo Bonzini Oct. 14, 2016, 10:43 a.m. UTC | #1
On 14/10/2016 12:37, Jike Song wrote:
> Hi Paolo & Alex,
> 
> IIUC, passing file descriptors means touching QEMU and the UAPI between
> QEMU and VFIO. Would you guys have a look at below draft patch? If it's
> on the correct direction, I'll send the split ones. Thanks!
> 
> --
> Thanks,
> Jike
> 
> 
> diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
> index bec694c..f715d37 100644
> --- a/hw/vfio/pci-quirks.c
> +++ b/hw/vfio/pci-quirks.c
> @@ -10,12 +10,14 @@
>   * the COPYING file in the top-level directory.
>   */
>  
> +#include <sys/ioctl.h>
>  #include "qemu/osdep.h"
>  #include "qemu/error-report.h"
>  #include "qemu/range.h"
>  #include "qapi/error.h"
>  #include "hw/nvram/fw_cfg.h"
>  #include "pci.h"
> +#include "sysemu/kvm.h"
>  #include "trace.h"
>  
>  /* Use uin32_t for vendor & device so PCI_ANY_ID expands and cannot match hw */
> @@ -1844,3 +1846,15 @@ void vfio_setup_resetfn_quirk(VFIOPCIDevice *vdev)
>          break;
>      }
>  }
> +
> +void vfio_quirk_kvmgt(VFIOPCIDevice *vdev)
> +{
> +    int vmfd;
> +
> +    if (!kvm_enabled() || !vdev->kvmgt)
> +        return;
> +
> +    /* Tell the device what KVM it attached */
> +    vmfd = kvm_get_vmfd(kvm_state);
> +    ioctl(vdev->vbasedev.fd, VFIO_SET_KVMFD, vmfd);
> +}

vfio_kvm_device_add_group is already telling the group id file
descriptor to KVM.  You can use that existing hook (whose kernel side is
virt/kvm/vfio.c).

Paolo
Jike Song Oct. 14, 2016, 12:26 p.m. UTC | #2
On 10/14/2016 06:43 PM, Paolo Bonzini wrote:
> 
> 
> On 14/10/2016 12:37, Jike Song wrote:
>> Hi Paolo & Alex,
>>
>> IIUC, passing file descriptors means touching QEMU and the UAPI between
>> QEMU and VFIO. Would you guys have a look at below draft patch? If it's
>> on the correct direction, I'll send the split ones. Thanks!
>>
>> --
>> Thanks,
>> Jike
>>
>>
>> diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
>> index bec694c..f715d37 100644
>> --- a/hw/vfio/pci-quirks.c
>> +++ b/hw/vfio/pci-quirks.c
>> @@ -10,12 +10,14 @@
>>   * the COPYING file in the top-level directory.
>>   */
>>  
>> +#include <sys/ioctl.h>
>>  #include "qemu/osdep.h"
>>  #include "qemu/error-report.h"
>>  #include "qemu/range.h"
>>  #include "qapi/error.h"
>>  #include "hw/nvram/fw_cfg.h"
>>  #include "pci.h"
>> +#include "sysemu/kvm.h"
>>  #include "trace.h"
>>  
>>  /* Use uin32_t for vendor & device so PCI_ANY_ID expands and cannot match hw */
>> @@ -1844,3 +1846,15 @@ void vfio_setup_resetfn_quirk(VFIOPCIDevice *vdev)
>>          break;
>>      }
>>  }
>> +
>> +void vfio_quirk_kvmgt(VFIOPCIDevice *vdev)
>> +{
>> +    int vmfd;
>> +
>> +    if (!kvm_enabled() || !vdev->kvmgt)
>> +        return;
>> +
>> +    /* Tell the device what KVM it attached */
>> +    vmfd = kvm_get_vmfd(kvm_state);
>> +    ioctl(vdev->vbasedev.fd, VFIO_SET_KVMFD, vmfd);
>> +}
> 
> vfio_kvm_device_add_group is already telling the group id file
> descriptor to KVM.  You can use that existing hook (whose kernel side is
> virt/kvm/vfio.c).

Thanks for quick reply. I'll do some homework and report back :)

--
Thanks,
Jike
Alex Williamson Oct. 14, 2016, 2:41 p.m. UTC | #3
On Fri, 14 Oct 2016 18:37:45 +0800
Jike Song <jike.song@intel.com> wrote:

> On 10/11/2016 05:47 PM, Paolo Bonzini wrote:
> > 
> > 
> > On 11/10/2016 11:21, Xiao Guangrong wrote:  
> >>
> >>
> >> On 10/11/2016 04:54 PM, Paolo Bonzini wrote:  
> >>>
> >>>
> >>> On 11/10/2016 04:39, Xiao Guangrong wrote:  
> >>>>
> >>>>
> >>>> On 10/11/2016 02:32 AM, Paolo Bonzini wrote:  
> >>>>>
> >>>>>
> >>>>> On 10/10/2016 20:01, Neo Jia wrote:  
> >>>>>>> Hi Neo,
> >>>>>>>
> >>>>>>> AFAIK this is needed because KVMGT doesn't paravirtualize the PPGTT,
> >>>>>>> while nVidia does.  
> >>>>>>
> >>>>>> Hi Paolo and Xiaoguang,
> >>>>>>
> >>>>>> I am just wondering how device driver can register a notifier so he
> >>>>>> can be
> >>>>>> notified for write-protected pages when writes are happening.  
> >>>>>
> >>>>> It can't yet, but the API is ready for that.  kvm_vfio_set_group is
> >>>>> currently where a struct kvm_device* and struct vfio_group* touch.
> >>>>> Given
> >>>>> a struct kvm_device*, dev->kvm provides the struct kvm to be passed to
> >>>>> kvm_page_track_register_notifier.  So I guess you could add a callback
> >>>>> that passes the struct kvm_device* to the mdev device.
> >>>>>
> >>>>> Xiaoguang and Guangrong, what were your plans?  We discussed it briefly
> >>>>> at KVM Forum but I don't remember the details.  
> >>>>
> >>>> Your suggestion was that pass kvm fd to KVMGT via VFIO, so that we can
> >>>> figure out the kvm instance based on the fd.
> >>>>
> >>>> We got a new idea, how about search the kvm instance by mm_struct, it
> >>>> can work as KVMGT is running in the vcpu context and it is much more
> >>>> straightforward.  
> >>>
> >>> Perhaps I didn't understand your suggestion, but the same mm_struct can
> >>> have more than 1 struct kvm so I'm not sure that it can work.  
> >>
> >> vcpu->pid is valid during vcpu running so that it can be used to figure
> >> out which kvm instance owns the vcpu whose pid is the one as current
> >> thread, i think it can work. :)  
> > 
> > No, don't do that.  There's no reason for a thread to run a single VCPU,
> > and if you can have multiple VCPUs you can also have multiple VCPUs from
> > multiple VMs.
> > 
> > Passing file descriptors around are the right way to connect subsystems.  
> 
> [CC Alex, Kevin and Qemu-devel]
> 
> Hi Paolo & Alex,
> 
> IIUC, passing file descriptors means touching QEMU and the UAPI between
> QEMU and VFIO. Would you guys have a look at below draft patch? If it's
> on the correct direction, I'll send the split ones. Thanks!
> 
> --
> Thanks,
> Jike
> 
> 
> diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
> index bec694c..f715d37 100644
> --- a/hw/vfio/pci-quirks.c
> +++ b/hw/vfio/pci-quirks.c
> @@ -10,12 +10,14 @@
>   * the COPYING file in the top-level directory.
>   */
>  
> +#include <sys/ioctl.h>
>  #include "qemu/osdep.h"
>  #include "qemu/error-report.h"
>  #include "qemu/range.h"
>  #include "qapi/error.h"
>  #include "hw/nvram/fw_cfg.h"
>  #include "pci.h"
> +#include "sysemu/kvm.h"
>  #include "trace.h"
>  
>  /* Use uin32_t for vendor & device so PCI_ANY_ID expands and cannot match hw */
> @@ -1844,3 +1846,15 @@ void vfio_setup_resetfn_quirk(VFIOPCIDevice *vdev)
>          break;
>      }
>  }
> +
> +void vfio_quirk_kvmgt(VFIOPCIDevice *vdev)
> +{
> +    int vmfd;
> +
> +    if (!kvm_enabled() || !vdev->kvmgt)
> +        return;
> +
> +    /* Tell the device what KVM it attached */
> +    vmfd = kvm_get_vmfd(kvm_state);
> +    ioctl(vdev->vbasedev.fd, VFIO_SET_KVMFD, vmfd);
> +}
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index a5a620a..8732552 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -2561,6 +2561,8 @@ static int vfio_initfn(PCIDevice *pdev)
>          return ret;
>      }
>  
> +    vfio_quirk_kvmgt(vdev);
> +
>      /* Get a copy of config space */
>      ret = pread(vdev->vbasedev.fd, vdev->pdev.config,
>                  MIN(pci_config_size(&vdev->pdev), vdev->config_size),
> @@ -2832,6 +2834,7 @@ static Property vfio_pci_dev_properties[] = {
>      DEFINE_PROP_UINT32("x-pci-sub-device-id", VFIOPCIDevice,
>                         sub_device_id, PCI_ANY_ID),
>      DEFINE_PROP_UINT32("x-igd-gms", VFIOPCIDevice, igd_gms, 0),
> +    DEFINE_PROP_BOOL("kvmgt", VFIOPCIDevice, kvmgt, false),

Just a side note, device options are a headache, users are prone to get
them wrong and minimally it requires an entire round to get libvirt
support.  We should be able to detect from the device or vfio API
whether such a call is required.  Obviously if we can use the existing
kvm-vfio device, that's the better option anyway.  Thanks,

Alex

>      /*
>       * TODO - support passed fds... is this necessary?
>       * DEFINE_PROP_STRING("vfiofd", VFIOPCIDevice, vfiofd_name),
> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> index 7d482d9..813832c 100644
> --- a/hw/vfio/pci.h
> +++ b/hw/vfio/pci.h
> @@ -143,6 +143,7 @@ typedef struct VFIOPCIDevice {
>      bool no_kvm_intx;
>      bool no_kvm_msi;
>      bool no_kvm_msix;
> +    bool kvmgt;
>  } VFIOPCIDevice;
>  
>  uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len);
> @@ -166,4 +167,6 @@ int vfio_populate_vga(VFIOPCIDevice *vdev);
>  int vfio_pci_igd_opregion_init(VFIOPCIDevice *vdev,
>                                 struct vfio_region_info *info);
>  
> +void vfio_quirk_kvmgt(VFIOPCIDevice *vdev);
> +
>  #endif /* HW_VFIO_VFIO_PCI_H */
> diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
> index df67cc0..dd8320a 100644
> --- a/include/sysemu/kvm.h
> +++ b/include/sysemu/kvm.h
> @@ -254,6 +254,7 @@ void phys_mem_set_alloc(void *(*alloc)(size_t, uint64_t *align));
>  int kvm_ioctl(KVMState *s, int type, ...);
>  
>  int kvm_vm_ioctl(KVMState *s, int type, ...);
> +int kvm_get_vmfd(KVMState *s);
>  
>  int kvm_vcpu_ioctl(CPUState *cpu, int type, ...);
>  
> diff --git a/kvm-all.c b/kvm-all.c
> index efb5fe3..bd72ce3 100644
> --- a/kvm-all.c
> +++ b/kvm-all.c
> @@ -2065,6 +2065,11 @@ int kvm_vm_ioctl(KVMState *s, int type, ...)
>      return ret;
>  }
>  
> +int kvm_get_vmfd(KVMState *s)
> +{
> +	return s->vmfd;
> +}
> +
>  int kvm_vcpu_ioctl(CPUState *cpu, int type, ...)
>  {
>      int ret;
> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> index 759b850..952303f 100644
> --- a/linux-headers/linux/vfio.h
> +++ b/linux-headers/linux/vfio.h
> @@ -686,6 +686,12 @@ struct vfio_iommu_spapr_tce_remove {
>  };
>  #define VFIO_IOMMU_SPAPR_TCE_REMOVE	_IO(VFIO_TYPE, VFIO_BASE + 20)
>  
> +
> +/**
> + * VFIO_SET_KVMFD - _IO(VFIO_TYPE, VFIO_BASE + 21, __u32)
> + */
> +#define VFIO_SET_KVMFD		_IO(VFIO_TYPE, VFIO_BASE + 21)
> +
>  /* ***************************************************************** */
>  
>  #endif /* VFIO_H */
>
Alex Williamson Oct. 14, 2016, 2:46 p.m. UTC | #4
On Fri, 14 Oct 2016 08:41:58 -0600
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Fri, 14 Oct 2016 18:37:45 +0800
> Jike Song <jike.song@intel.com> wrote:
> 
> > On 10/11/2016 05:47 PM, Paolo Bonzini wrote:  
> > > 
> > > 
> > > On 11/10/2016 11:21, Xiao Guangrong wrote:    
> > >>
> > >>
> > >> On 10/11/2016 04:54 PM, Paolo Bonzini wrote:    
> > >>>
> > >>>
> > >>> On 11/10/2016 04:39, Xiao Guangrong wrote:    
> > >>>>
> > >>>>
> > >>>> On 10/11/2016 02:32 AM, Paolo Bonzini wrote:    
> > >>>>>
> > >>>>>
> > >>>>> On 10/10/2016 20:01, Neo Jia wrote:    
> > >>>>>>> Hi Neo,
> > >>>>>>>
> > >>>>>>> AFAIK this is needed because KVMGT doesn't paravirtualize the PPGTT,
> > >>>>>>> while nVidia does.    
> > >>>>>>
> > >>>>>> Hi Paolo and Xiaoguang,
> > >>>>>>
> > >>>>>> I am just wondering how device driver can register a notifier so he
> > >>>>>> can be
> > >>>>>> notified for write-protected pages when writes are happening.    
> > >>>>>
> > >>>>> It can't yet, but the API is ready for that.  kvm_vfio_set_group is
> > >>>>> currently where a struct kvm_device* and struct vfio_group* touch.
> > >>>>> Given
> > >>>>> a struct kvm_device*, dev->kvm provides the struct kvm to be passed to
> > >>>>> kvm_page_track_register_notifier.  So I guess you could add a callback
> > >>>>> that passes the struct kvm_device* to the mdev device.
> > >>>>>
> > >>>>> Xiaoguang and Guangrong, what were your plans?  We discussed it briefly
> > >>>>> at KVM Forum but I don't remember the details.    
> > >>>>
> > >>>> Your suggestion was that pass kvm fd to KVMGT via VFIO, so that we can
> > >>>> figure out the kvm instance based on the fd.
> > >>>>
> > >>>> We got a new idea, how about search the kvm instance by mm_struct, it
> > >>>> can work as KVMGT is running in the vcpu context and it is much more
> > >>>> straightforward.    
> > >>>
> > >>> Perhaps I didn't understand your suggestion, but the same mm_struct can
> > >>> have more than 1 struct kvm so I'm not sure that it can work.    
> > >>
> > >> vcpu->pid is valid during vcpu running so that it can be used to figure
> > >> out which kvm instance owns the vcpu whose pid is the one as current
> > >> thread, i think it can work. :)    
> > > 
> > > No, don't do that.  There's no reason for a thread to run a single VCPU,
> > > and if you can have multiple VCPUs you can also have multiple VCPUs from
> > > multiple VMs.
> > > 
> > > Passing file descriptors around are the right way to connect subsystems.    
> > 
> > [CC Alex, Kevin and Qemu-devel]
> > 
> > Hi Paolo & Alex,
> > 
> > IIUC, passing file descriptors means touching QEMU and the UAPI between
> > QEMU and VFIO. Would you guys have a look at below draft patch? If it's
> > on the correct direction, I'll send the split ones. Thanks!
> > 
> > --
> > Thanks,
> > Jike
> > 
> > 
> > diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
> > index bec694c..f715d37 100644
> > --- a/hw/vfio/pci-quirks.c
> > +++ b/hw/vfio/pci-quirks.c
> > @@ -10,12 +10,14 @@
> >   * the COPYING file in the top-level directory.
> >   */
> >  
> > +#include <sys/ioctl.h>
> >  #include "qemu/osdep.h"
> >  #include "qemu/error-report.h"
> >  #include "qemu/range.h"
> >  #include "qapi/error.h"
> >  #include "hw/nvram/fw_cfg.h"
> >  #include "pci.h"
> > +#include "sysemu/kvm.h"
> >  #include "trace.h"
> >  
> >  /* Use uin32_t for vendor & device so PCI_ANY_ID expands and cannot match hw */
> > @@ -1844,3 +1846,15 @@ void vfio_setup_resetfn_quirk(VFIOPCIDevice *vdev)
> >          break;
> >      }
> >  }
> > +
> > +void vfio_quirk_kvmgt(VFIOPCIDevice *vdev)
> > +{
> > +    int vmfd;
> > +
> > +    if (!kvm_enabled() || !vdev->kvmgt)
> > +        return;
> > +
> > +    /* Tell the device what KVM it attached */
> > +    vmfd = kvm_get_vmfd(kvm_state);
> > +    ioctl(vdev->vbasedev.fd, VFIO_SET_KVMFD, vmfd);
> > +}
> > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> > index a5a620a..8732552 100644
> > --- a/hw/vfio/pci.c
> > +++ b/hw/vfio/pci.c
> > @@ -2561,6 +2561,8 @@ static int vfio_initfn(PCIDevice *pdev)
> >          return ret;
> >      }
> >  
> > +    vfio_quirk_kvmgt(vdev);
> > +
> >      /* Get a copy of config space */
> >      ret = pread(vdev->vbasedev.fd, vdev->pdev.config,
> >                  MIN(pci_config_size(&vdev->pdev), vdev->config_size),
> > @@ -2832,6 +2834,7 @@ static Property vfio_pci_dev_properties[] = {
> >      DEFINE_PROP_UINT32("x-pci-sub-device-id", VFIOPCIDevice,
> >                         sub_device_id, PCI_ANY_ID),
> >      DEFINE_PROP_UINT32("x-igd-gms", VFIOPCIDevice, igd_gms, 0),
> > +    DEFINE_PROP_BOOL("kvmgt", VFIOPCIDevice, kvmgt, false),  
> 
> Just a side note, device options are a headache, users are prone to get
> them wrong and minimally it requires an entire round to get libvirt
> support.  We should be able to detect from the device or vfio API
> whether such a call is required.  Obviously if we can use the existing
> kvm-vfio device, that's the better option anyway.  Thanks,

Also, vfio devices currently have no hard dependencies on KVM, if kvmgt
does, it needs to produce a device failure when unavailable.  Thanks,

Alex

> >      /*
> >       * TODO - support passed fds... is this necessary?
> >       * DEFINE_PROP_STRING("vfiofd", VFIOPCIDevice, vfiofd_name),
> > diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> > index 7d482d9..813832c 100644
> > --- a/hw/vfio/pci.h
> > +++ b/hw/vfio/pci.h
> > @@ -143,6 +143,7 @@ typedef struct VFIOPCIDevice {
> >      bool no_kvm_intx;
> >      bool no_kvm_msi;
> >      bool no_kvm_msix;
> > +    bool kvmgt;
> >  } VFIOPCIDevice;
> >  
> >  uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len);
> > @@ -166,4 +167,6 @@ int vfio_populate_vga(VFIOPCIDevice *vdev);
> >  int vfio_pci_igd_opregion_init(VFIOPCIDevice *vdev,
> >                                 struct vfio_region_info *info);
> >  
> > +void vfio_quirk_kvmgt(VFIOPCIDevice *vdev);
> > +
> >  #endif /* HW_VFIO_VFIO_PCI_H */
> > diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
> > index df67cc0..dd8320a 100644
> > --- a/include/sysemu/kvm.h
> > +++ b/include/sysemu/kvm.h
> > @@ -254,6 +254,7 @@ void phys_mem_set_alloc(void *(*alloc)(size_t, uint64_t *align));
> >  int kvm_ioctl(KVMState *s, int type, ...);
> >  
> >  int kvm_vm_ioctl(KVMState *s, int type, ...);
> > +int kvm_get_vmfd(KVMState *s);
> >  
> >  int kvm_vcpu_ioctl(CPUState *cpu, int type, ...);
> >  
> > diff --git a/kvm-all.c b/kvm-all.c
> > index efb5fe3..bd72ce3 100644
> > --- a/kvm-all.c
> > +++ b/kvm-all.c
> > @@ -2065,6 +2065,11 @@ int kvm_vm_ioctl(KVMState *s, int type, ...)
> >      return ret;
> >  }
> >  
> > +int kvm_get_vmfd(KVMState *s)
> > +{
> > +	return s->vmfd;
> > +}
> > +
> >  int kvm_vcpu_ioctl(CPUState *cpu, int type, ...)
> >  {
> >      int ret;
> > diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> > index 759b850..952303f 100644
> > --- a/linux-headers/linux/vfio.h
> > +++ b/linux-headers/linux/vfio.h
> > @@ -686,6 +686,12 @@ struct vfio_iommu_spapr_tce_remove {
> >  };
> >  #define VFIO_IOMMU_SPAPR_TCE_REMOVE	_IO(VFIO_TYPE, VFIO_BASE + 20)
> >  
> > +
> > +/**
> > + * VFIO_SET_KVMFD - _IO(VFIO_TYPE, VFIO_BASE + 21, __u32)
> > + */
> > +#define VFIO_SET_KVMFD		_IO(VFIO_TYPE, VFIO_BASE + 21)
> > +
> >  /* ***************************************************************** */
> >  
> >  #endif /* VFIO_H */
> >   
>
Neo Jia Oct. 14, 2016, 4:35 p.m. UTC | #5
On Fri, Oct 14, 2016 at 08:46:01AM -0600, Alex Williamson wrote:
> On Fri, 14 Oct 2016 08:41:58 -0600
> Alex Williamson <alex.williamson@redhat.com> wrote:
> 
> > On Fri, 14 Oct 2016 18:37:45 +0800
> > Jike Song <jike.song@intel.com> wrote:
> > 
> > > On 10/11/2016 05:47 PM, Paolo Bonzini wrote:  
> > > > 
> > > > 
> > > > On 11/10/2016 11:21, Xiao Guangrong wrote:    
> > > >>
> > > >>
> > > >> On 10/11/2016 04:54 PM, Paolo Bonzini wrote:    
> > > >>>
> > > >>>
> > > >>> On 11/10/2016 04:39, Xiao Guangrong wrote:    
> > > >>>>
> > > >>>>
> > > >>>> On 10/11/2016 02:32 AM, Paolo Bonzini wrote:    
> > > >>>>>
> > > >>>>>
> > > >>>>> On 10/10/2016 20:01, Neo Jia wrote:    
> > > >>>>>>> Hi Neo,
> > > >>>>>>>
> > > >>>>>>> AFAIK this is needed because KVMGT doesn't paravirtualize the PPGTT,
> > > >>>>>>> while nVidia does.    
> > > >>>>>>
> > > >>>>>> Hi Paolo and Xiaoguang,
> > > >>>>>>
> > > >>>>>> I am just wondering how device driver can register a notifier so he
> > > >>>>>> can be
> > > >>>>>> notified for write-protected pages when writes are happening.    
> > > >>>>>
> > > >>>>> It can't yet, but the API is ready for that.  kvm_vfio_set_group is
> > > >>>>> currently where a struct kvm_device* and struct vfio_group* touch.
> > > >>>>> Given
> > > >>>>> a struct kvm_device*, dev->kvm provides the struct kvm to be passed to
> > > >>>>> kvm_page_track_register_notifier.  So I guess you could add a callback
> > > >>>>> that passes the struct kvm_device* to the mdev device.
> > > >>>>>
> > > >>>>> Xiaoguang and Guangrong, what were your plans?  We discussed it briefly
> > > >>>>> at KVM Forum but I don't remember the details.    
> > > >>>>
> > > >>>> Your suggestion was that pass kvm fd to KVMGT via VFIO, so that we can
> > > >>>> figure out the kvm instance based on the fd.
> > > >>>>
> > > >>>> We got a new idea, how about search the kvm instance by mm_struct, it
> > > >>>> can work as KVMGT is running in the vcpu context and it is much more
> > > >>>> straightforward.    
> > > >>>
> > > >>> Perhaps I didn't understand your suggestion, but the same mm_struct can
> > > >>> have more than 1 struct kvm so I'm not sure that it can work.    
> > > >>
> > > >> vcpu->pid is valid during vcpu running so that it can be used to figure
> > > >> out which kvm instance owns the vcpu whose pid is the one as current
> > > >> thread, i think it can work. :)    
> > > > 
> > > > No, don't do that.  There's no reason for a thread to run a single VCPU,
> > > > and if you can have multiple VCPUs you can also have multiple VCPUs from
> > > > multiple VMs.
> > > > 
> > > > Passing file descriptors around are the right way to connect subsystems.    
> > > 
> > > [CC Alex, Kevin and Qemu-devel]
> > > 
> > > Hi Paolo & Alex,
> > > 
> > > IIUC, passing file descriptors means touching QEMU and the UAPI between
> > > QEMU and VFIO. Would you guys have a look at below draft patch? If it's
> > > on the correct direction, I'll send the split ones. Thanks!
> > > 
> > > --
> > > Thanks,
> > > Jike
> > > 
> > > 
> > > diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
> > > index bec694c..f715d37 100644
> > > --- a/hw/vfio/pci-quirks.c
> > > +++ b/hw/vfio/pci-quirks.c
> > > @@ -10,12 +10,14 @@
> > >   * the COPYING file in the top-level directory.
> > >   */
> > >  
> > > +#include <sys/ioctl.h>
> > >  #include "qemu/osdep.h"
> > >  #include "qemu/error-report.h"
> > >  #include "qemu/range.h"
> > >  #include "qapi/error.h"
> > >  #include "hw/nvram/fw_cfg.h"
> > >  #include "pci.h"
> > > +#include "sysemu/kvm.h"
> > >  #include "trace.h"
> > >  
> > >  /* Use uin32_t for vendor & device so PCI_ANY_ID expands and cannot match hw */
> > > @@ -1844,3 +1846,15 @@ void vfio_setup_resetfn_quirk(VFIOPCIDevice *vdev)
> > >          break;
> > >      }
> > >  }
> > > +
> > > +void vfio_quirk_kvmgt(VFIOPCIDevice *vdev)
> > > +{
> > > +    int vmfd;
> > > +
> > > +    if (!kvm_enabled() || !vdev->kvmgt)
> > > +        return;
> > > +
> > > +    /* Tell the device what KVM it attached */
> > > +    vmfd = kvm_get_vmfd(kvm_state);
> > > +    ioctl(vdev->vbasedev.fd, VFIO_SET_KVMFD, vmfd);
> > > +}
> > > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> > > index a5a620a..8732552 100644
> > > --- a/hw/vfio/pci.c
> > > +++ b/hw/vfio/pci.c
> > > @@ -2561,6 +2561,8 @@ static int vfio_initfn(PCIDevice *pdev)
> > >          return ret;
> > >      }
> > >  
> > > +    vfio_quirk_kvmgt(vdev);
> > > +
> > >      /* Get a copy of config space */
> > >      ret = pread(vdev->vbasedev.fd, vdev->pdev.config,
> > >                  MIN(pci_config_size(&vdev->pdev), vdev->config_size),
> > > @@ -2832,6 +2834,7 @@ static Property vfio_pci_dev_properties[] = {
> > >      DEFINE_PROP_UINT32("x-pci-sub-device-id", VFIOPCIDevice,
> > >                         sub_device_id, PCI_ANY_ID),
> > >      DEFINE_PROP_UINT32("x-igd-gms", VFIOPCIDevice, igd_gms, 0),
> > > +    DEFINE_PROP_BOOL("kvmgt", VFIOPCIDevice, kvmgt, false),  
> > 
> > Just a side note, device options are a headache, users are prone to get
> > them wrong and minimally it requires an entire round to get libvirt
> > support.  We should be able to detect from the device or vfio API
> > whether such a call is required.  Obviously if we can use the existing
> > kvm-vfio device, that's the better option anyway.  Thanks,
> 
> Also, vfio devices currently have no hard dependencies on KVM, if kvmgt
> does, it needs to produce a device failure when unavailable.  Thanks,

Also, I would like to see this as an generic feature instead of
kvmgt specific interface, so we don't have to add new options to QEMU and it is
up to the vendor driver to proceed with or without it.

Thanks,
Neo

> 
> Alex
> 
> > >      /*
> > >       * TODO - support passed fds... is this necessary?
> > >       * DEFINE_PROP_STRING("vfiofd", VFIOPCIDevice, vfiofd_name),
> > > diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> > > index 7d482d9..813832c 100644
> > > --- a/hw/vfio/pci.h
> > > +++ b/hw/vfio/pci.h
> > > @@ -143,6 +143,7 @@ typedef struct VFIOPCIDevice {
> > >      bool no_kvm_intx;
> > >      bool no_kvm_msi;
> > >      bool no_kvm_msix;
> > > +    bool kvmgt;
> > >  } VFIOPCIDevice;
> > >  
> > >  uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len);
> > > @@ -166,4 +167,6 @@ int vfio_populate_vga(VFIOPCIDevice *vdev);
> > >  int vfio_pci_igd_opregion_init(VFIOPCIDevice *vdev,
> > >                                 struct vfio_region_info *info);
> > >  
> > > +void vfio_quirk_kvmgt(VFIOPCIDevice *vdev);
> > > +
> > >  #endif /* HW_VFIO_VFIO_PCI_H */
> > > diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
> > > index df67cc0..dd8320a 100644
> > > --- a/include/sysemu/kvm.h
> > > +++ b/include/sysemu/kvm.h
> > > @@ -254,6 +254,7 @@ void phys_mem_set_alloc(void *(*alloc)(size_t, uint64_t *align));
> > >  int kvm_ioctl(KVMState *s, int type, ...);
> > >  
> > >  int kvm_vm_ioctl(KVMState *s, int type, ...);
> > > +int kvm_get_vmfd(KVMState *s);
> > >  
> > >  int kvm_vcpu_ioctl(CPUState *cpu, int type, ...);
> > >  
> > > diff --git a/kvm-all.c b/kvm-all.c
> > > index efb5fe3..bd72ce3 100644
> > > --- a/kvm-all.c
> > > +++ b/kvm-all.c
> > > @@ -2065,6 +2065,11 @@ int kvm_vm_ioctl(KVMState *s, int type, ...)
> > >      return ret;
> > >  }
> > >  
> > > +int kvm_get_vmfd(KVMState *s)
> > > +{
> > > +	return s->vmfd;
> > > +}
> > > +
> > >  int kvm_vcpu_ioctl(CPUState *cpu, int type, ...)
> > >  {
> > >      int ret;
> > > diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> > > index 759b850..952303f 100644
> > > --- a/linux-headers/linux/vfio.h
> > > +++ b/linux-headers/linux/vfio.h
> > > @@ -686,6 +686,12 @@ struct vfio_iommu_spapr_tce_remove {
> > >  };
> > >  #define VFIO_IOMMU_SPAPR_TCE_REMOVE	_IO(VFIO_TYPE, VFIO_BASE + 20)
> > >  
> > > +
> > > +/**
> > > + * VFIO_SET_KVMFD - _IO(VFIO_TYPE, VFIO_BASE + 21, __u32)
> > > + */
> > > +#define VFIO_SET_KVMFD		_IO(VFIO_TYPE, VFIO_BASE + 21)
> > > +
> > >  /* ***************************************************************** */
> > >  
> > >  #endif /* VFIO_H */
> > >   
> > 
>
Alex Williamson Oct. 14, 2016, 4:51 p.m. UTC | #6
On Fri, 14 Oct 2016 09:35:45 -0700
Neo Jia <cjia@nvidia.com> wrote:

> On Fri, Oct 14, 2016 at 08:46:01AM -0600, Alex Williamson wrote:
> > On Fri, 14 Oct 2016 08:41:58 -0600
> > Alex Williamson <alex.williamson@redhat.com> wrote:
> >   
> > > On Fri, 14 Oct 2016 18:37:45 +0800
> > > Jike Song <jike.song@intel.com> wrote:
> > >   
> > > > On 10/11/2016 05:47 PM, Paolo Bonzini wrote:    
> > > > > 
> > > > > 
> > > > > On 11/10/2016 11:21, Xiao Guangrong wrote:      
> > > > >>
> > > > >>
> > > > >> On 10/11/2016 04:54 PM, Paolo Bonzini wrote:      
> > > > >>>
> > > > >>>
> > > > >>> On 11/10/2016 04:39, Xiao Guangrong wrote:      
> > > > >>>>
> > > > >>>>
> > > > >>>> On 10/11/2016 02:32 AM, Paolo Bonzini wrote:      
> > > > >>>>>
> > > > >>>>>
> > > > >>>>> On 10/10/2016 20:01, Neo Jia wrote:      
> > > > >>>>>>> Hi Neo,
> > > > >>>>>>>
> > > > >>>>>>> AFAIK this is needed because KVMGT doesn't paravirtualize the PPGTT,
> > > > >>>>>>> while nVidia does.      
> > > > >>>>>>
> > > > >>>>>> Hi Paolo and Xiaoguang,
> > > > >>>>>>
> > > > >>>>>> I am just wondering how device driver can register a notifier so he
> > > > >>>>>> can be
> > > > >>>>>> notified for write-protected pages when writes are happening.      
> > > > >>>>>
> > > > >>>>> It can't yet, but the API is ready for that.  kvm_vfio_set_group is
> > > > >>>>> currently where a struct kvm_device* and struct vfio_group* touch.
> > > > >>>>> Given
> > > > >>>>> a struct kvm_device*, dev->kvm provides the struct kvm to be passed to
> > > > >>>>> kvm_page_track_register_notifier.  So I guess you could add a callback
> > > > >>>>> that passes the struct kvm_device* to the mdev device.
> > > > >>>>>
> > > > >>>>> Xiaoguang and Guangrong, what were your plans?  We discussed it briefly
> > > > >>>>> at KVM Forum but I don't remember the details.      
> > > > >>>>
> > > > >>>> Your suggestion was that pass kvm fd to KVMGT via VFIO, so that we can
> > > > >>>> figure out the kvm instance based on the fd.
> > > > >>>>
> > > > >>>> We got a new idea, how about search the kvm instance by mm_struct, it
> > > > >>>> can work as KVMGT is running in the vcpu context and it is much more
> > > > >>>> straightforward.      
> > > > >>>
> > > > >>> Perhaps I didn't understand your suggestion, but the same mm_struct can
> > > > >>> have more than 1 struct kvm so I'm not sure that it can work.      
> > > > >>
> > > > >> vcpu->pid is valid during vcpu running so that it can be used to figure
> > > > >> out which kvm instance owns the vcpu whose pid is the one as current
> > > > >> thread, i think it can work. :)      
> > > > > 
> > > > > No, don't do that.  There's no reason for a thread to run a single VCPU,
> > > > > and if you can have multiple VCPUs you can also have multiple VCPUs from
> > > > > multiple VMs.
> > > > > 
> > > > > Passing file descriptors around are the right way to connect subsystems.      
> > > > 
> > > > [CC Alex, Kevin and Qemu-devel]
> > > > 
> > > > Hi Paolo & Alex,
> > > > 
> > > > IIUC, passing file descriptors means touching QEMU and the UAPI between
> > > > QEMU and VFIO. Would you guys have a look at below draft patch? If it's
> > > > on the correct direction, I'll send the split ones. Thanks!
> > > > 
> > > > --
> > > > Thanks,
> > > > Jike
> > > > 
> > > > 
> > > > diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
> > > > index bec694c..f715d37 100644
> > > > --- a/hw/vfio/pci-quirks.c
> > > > +++ b/hw/vfio/pci-quirks.c
> > > > @@ -10,12 +10,14 @@
> > > >   * the COPYING file in the top-level directory.
> > > >   */
> > > >  
> > > > +#include <sys/ioctl.h>
> > > >  #include "qemu/osdep.h"
> > > >  #include "qemu/error-report.h"
> > > >  #include "qemu/range.h"
> > > >  #include "qapi/error.h"
> > > >  #include "hw/nvram/fw_cfg.h"
> > > >  #include "pci.h"
> > > > +#include "sysemu/kvm.h"
> > > >  #include "trace.h"
> > > >  
> > > >  /* Use uin32_t for vendor & device so PCI_ANY_ID expands and cannot match hw */
> > > > @@ -1844,3 +1846,15 @@ void vfio_setup_resetfn_quirk(VFIOPCIDevice *vdev)
> > > >          break;
> > > >      }
> > > >  }
> > > > +
> > > > +void vfio_quirk_kvmgt(VFIOPCIDevice *vdev)
> > > > +{
> > > > +    int vmfd;
> > > > +
> > > > +    if (!kvm_enabled() || !vdev->kvmgt)
> > > > +        return;
> > > > +
> > > > +    /* Tell the device what KVM it attached */
> > > > +    vmfd = kvm_get_vmfd(kvm_state);
> > > > +    ioctl(vdev->vbasedev.fd, VFIO_SET_KVMFD, vmfd);
> > > > +}
> > > > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> > > > index a5a620a..8732552 100644
> > > > --- a/hw/vfio/pci.c
> > > > +++ b/hw/vfio/pci.c
> > > > @@ -2561,6 +2561,8 @@ static int vfio_initfn(PCIDevice *pdev)
> > > >          return ret;
> > > >      }
> > > >  
> > > > +    vfio_quirk_kvmgt(vdev);
> > > > +
> > > >      /* Get a copy of config space */
> > > >      ret = pread(vdev->vbasedev.fd, vdev->pdev.config,
> > > >                  MIN(pci_config_size(&vdev->pdev), vdev->config_size),
> > > > @@ -2832,6 +2834,7 @@ static Property vfio_pci_dev_properties[] = {
> > > >      DEFINE_PROP_UINT32("x-pci-sub-device-id", VFIOPCIDevice,
> > > >                         sub_device_id, PCI_ANY_ID),
> > > >      DEFINE_PROP_UINT32("x-igd-gms", VFIOPCIDevice, igd_gms, 0),
> > > > +    DEFINE_PROP_BOOL("kvmgt", VFIOPCIDevice, kvmgt, false),    
> > > 
> > > Just a side note, device options are a headache, users are prone to get
> > > them wrong and minimally it requires an entire round to get libvirt
> > > support.  We should be able to detect from the device or vfio API
> > > whether such a call is required.  Obviously if we can use the existing
> > > kvm-vfio device, that's the better option anyway.  Thanks,  
> > 
> > Also, vfio devices currently have no hard dependencies on KVM, if kvmgt
> > does, it needs to produce a device failure when unavailable.  Thanks,  
> 
> Also, I would like to see this as an generic feature instead of
> kvmgt specific interface, so we don't have to add new options to QEMU and it is
> up to the vendor driver to proceed with or without it.

In general this should be decided by lack of some required feature
exclusively provided by KVM.  I would not want to add a generic opt-out
for mdev vendor drivers to decide that they arbitrarily want to disable
that path.  Thanks,

Alex
Neo Jia Oct. 14, 2016, 10:19 p.m. UTC | #7
On Fri, Oct 14, 2016 at 10:51:24AM -0600, Alex Williamson wrote:
> On Fri, 14 Oct 2016 09:35:45 -0700
> Neo Jia <cjia@nvidia.com> wrote:
> 
> > On Fri, Oct 14, 2016 at 08:46:01AM -0600, Alex Williamson wrote:
> > > On Fri, 14 Oct 2016 08:41:58 -0600
> > > Alex Williamson <alex.williamson@redhat.com> wrote:
> > >   
> > > > On Fri, 14 Oct 2016 18:37:45 +0800
> > > > Jike Song <jike.song@intel.com> wrote:
> > > >   
> > > > > On 10/11/2016 05:47 PM, Paolo Bonzini wrote:    
> > > > > > 
> > > > > > 
> > > > > > On 11/10/2016 11:21, Xiao Guangrong wrote:      
> > > > > >>
> > > > > >>
> > > > > >> On 10/11/2016 04:54 PM, Paolo Bonzini wrote:      
> > > > > >>>
> > > > > >>>
> > > > > >>> On 11/10/2016 04:39, Xiao Guangrong wrote:      
> > > > > >>>>
> > > > > >>>>
> > > > > >>>> On 10/11/2016 02:32 AM, Paolo Bonzini wrote:      
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>>> On 10/10/2016 20:01, Neo Jia wrote:      
> > > > > >>>>>>> Hi Neo,
> > > > > >>>>>>>
> > > > > >>>>>>> AFAIK this is needed because KVMGT doesn't paravirtualize the PPGTT,
> > > > > >>>>>>> while nVidia does.      
> > > > > >>>>>>
> > > > > >>>>>> Hi Paolo and Xiaoguang,
> > > > > >>>>>>
> > > > > >>>>>> I am just wondering how device driver can register a notifier so he
> > > > > >>>>>> can be
> > > > > >>>>>> notified for write-protected pages when writes are happening.      
> > > > > >>>>>
> > > > > >>>>> It can't yet, but the API is ready for that.  kvm_vfio_set_group is
> > > > > >>>>> currently where a struct kvm_device* and struct vfio_group* touch.
> > > > > >>>>> Given
> > > > > >>>>> a struct kvm_device*, dev->kvm provides the struct kvm to be passed to
> > > > > >>>>> kvm_page_track_register_notifier.  So I guess you could add a callback
> > > > > >>>>> that passes the struct kvm_device* to the mdev device.
> > > > > >>>>>
> > > > > >>>>> Xiaoguang and Guangrong, what were your plans?  We discussed it briefly
> > > > > >>>>> at KVM Forum but I don't remember the details.      
> > > > > >>>>
> > > > > >>>> Your suggestion was that pass kvm fd to KVMGT via VFIO, so that we can
> > > > > >>>> figure out the kvm instance based on the fd.
> > > > > >>>>
> > > > > >>>> We got a new idea, how about search the kvm instance by mm_struct, it
> > > > > >>>> can work as KVMGT is running in the vcpu context and it is much more
> > > > > >>>> straightforward.      
> > > > > >>>
> > > > > >>> Perhaps I didn't understand your suggestion, but the same mm_struct can
> > > > > >>> have more than 1 struct kvm so I'm not sure that it can work.      
> > > > > >>
> > > > > >> vcpu->pid is valid during vcpu running so that it can be used to figure
> > > > > >> out which kvm instance owns the vcpu whose pid is the one as current
> > > > > >> thread, i think it can work. :)      
> > > > > > 
> > > > > > No, don't do that.  There's no reason for a thread to run a single VCPU,
> > > > > > and if you can have multiple VCPUs you can also have multiple VCPUs from
> > > > > > multiple VMs.
> > > > > > 
> > > > > > Passing file descriptors around are the right way to connect subsystems.      
> > > > > 
> > > > > [CC Alex, Kevin and Qemu-devel]
> > > > > 
> > > > > Hi Paolo & Alex,
> > > > > 
> > > > > IIUC, passing file descriptors means touching QEMU and the UAPI between
> > > > > QEMU and VFIO. Would you guys have a look at below draft patch? If it's
> > > > > on the correct direction, I'll send the split ones. Thanks!
> > > > > 
> > > > > --
> > > > > Thanks,
> > > > > Jike
> > > > > 
> > > > > 
> > > > > diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
> > > > > index bec694c..f715d37 100644
> > > > > --- a/hw/vfio/pci-quirks.c
> > > > > +++ b/hw/vfio/pci-quirks.c
> > > > > @@ -10,12 +10,14 @@
> > > > >   * the COPYING file in the top-level directory.
> > > > >   */
> > > > >  
> > > > > +#include <sys/ioctl.h>
> > > > >  #include "qemu/osdep.h"
> > > > >  #include "qemu/error-report.h"
> > > > >  #include "qemu/range.h"
> > > > >  #include "qapi/error.h"
> > > > >  #include "hw/nvram/fw_cfg.h"
> > > > >  #include "pci.h"
> > > > > +#include "sysemu/kvm.h"
> > > > >  #include "trace.h"
> > > > >  
> > > > >  /* Use uin32_t for vendor & device so PCI_ANY_ID expands and cannot match hw */
> > > > > @@ -1844,3 +1846,15 @@ void vfio_setup_resetfn_quirk(VFIOPCIDevice *vdev)
> > > > >          break;
> > > > >      }
> > > > >  }
> > > > > +
> > > > > +void vfio_quirk_kvmgt(VFIOPCIDevice *vdev)
> > > > > +{
> > > > > +    int vmfd;
> > > > > +
> > > > > +    if (!kvm_enabled() || !vdev->kvmgt)
> > > > > +        return;
> > > > > +
> > > > > +    /* Tell the device what KVM it attached */
> > > > > +    vmfd = kvm_get_vmfd(kvm_state);
> > > > > +    ioctl(vdev->vbasedev.fd, VFIO_SET_KVMFD, vmfd);
> > > > > +}
> > > > > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> > > > > index a5a620a..8732552 100644
> > > > > --- a/hw/vfio/pci.c
> > > > > +++ b/hw/vfio/pci.c
> > > > > @@ -2561,6 +2561,8 @@ static int vfio_initfn(PCIDevice *pdev)
> > > > >          return ret;
> > > > >      }
> > > > >  
> > > > > +    vfio_quirk_kvmgt(vdev);
> > > > > +
> > > > >      /* Get a copy of config space */
> > > > >      ret = pread(vdev->vbasedev.fd, vdev->pdev.config,
> > > > >                  MIN(pci_config_size(&vdev->pdev), vdev->config_size),
> > > > > @@ -2832,6 +2834,7 @@ static Property vfio_pci_dev_properties[] = {
> > > > >      DEFINE_PROP_UINT32("x-pci-sub-device-id", VFIOPCIDevice,
> > > > >                         sub_device_id, PCI_ANY_ID),
> > > > >      DEFINE_PROP_UINT32("x-igd-gms", VFIOPCIDevice, igd_gms, 0),
> > > > > +    DEFINE_PROP_BOOL("kvmgt", VFIOPCIDevice, kvmgt, false),    
> > > > 
> > > > Just a side note, device options are a headache, users are prone to get
> > > > them wrong and minimally it requires an entire round to get libvirt
> > > > support.  We should be able to detect from the device or vfio API
> > > > whether such a call is required.  Obviously if we can use the existing
> > > > kvm-vfio device, that's the better option anyway.  Thanks,  
> > > 
> > > Also, vfio devices currently have no hard dependencies on KVM, if kvmgt
> > > does, it needs to produce a device failure when unavailable.  Thanks,  
> > 
> > Also, I would like to see this as an generic feature instead of
> > kvmgt specific interface, so we don't have to add new options to QEMU and it is
> > up to the vendor driver to proceed with or without it.
> 
> In general this should be decided by lack of some required feature
> exclusively provided by KVM.  I would not want to add a generic opt-out
> for mdev vendor drivers to decide that they arbitrarily want to disable
> that path.  Thanks,

IIUC, you are suggesting that this path should be controlled by KVM feature cap
and it will be accessible to VFIO users when such checking is satisfied.

Thanks,
Neo

> 
> Alex
Alex Williamson Oct. 17, 2016, 4:02 p.m. UTC | #8
On Fri, 14 Oct 2016 15:19:01 -0700
Neo Jia <cjia@nvidia.com> wrote:

> On Fri, Oct 14, 2016 at 10:51:24AM -0600, Alex Williamson wrote:
> > On Fri, 14 Oct 2016 09:35:45 -0700
> > Neo Jia <cjia@nvidia.com> wrote:
> >   
> > > On Fri, Oct 14, 2016 at 08:46:01AM -0600, Alex Williamson wrote:  
> > > > On Fri, 14 Oct 2016 08:41:58 -0600
> > > > Alex Williamson <alex.williamson@redhat.com> wrote:
> > > >     
> > > > > On Fri, 14 Oct 2016 18:37:45 +0800
> > > > > Jike Song <jike.song@intel.com> wrote:
> > > > >     
> > > > > > On 10/11/2016 05:47 PM, Paolo Bonzini wrote:      
> > > > > > > 
> > > > > > > 
> > > > > > > On 11/10/2016 11:21, Xiao Guangrong wrote:        
> > > > > > >>
> > > > > > >>
> > > > > > >> On 10/11/2016 04:54 PM, Paolo Bonzini wrote:        
> > > > > > >>>
> > > > > > >>>
> > > > > > >>> On 11/10/2016 04:39, Xiao Guangrong wrote:        
> > > > > > >>>>
> > > > > > >>>>
> > > > > > >>>> On 10/11/2016 02:32 AM, Paolo Bonzini wrote:        
> > > > > > >>>>>
> > > > > > >>>>>
> > > > > > >>>>> On 10/10/2016 20:01, Neo Jia wrote:        
> > > > > > >>>>>>> Hi Neo,
> > > > > > >>>>>>>
> > > > > > >>>>>>> AFAIK this is needed because KVMGT doesn't paravirtualize the PPGTT,
> > > > > > >>>>>>> while nVidia does.        
> > > > > > >>>>>>
> > > > > > >>>>>> Hi Paolo and Xiaoguang,
> > > > > > >>>>>>
> > > > > > >>>>>> I am just wondering how device driver can register a notifier so he
> > > > > > >>>>>> can be
> > > > > > >>>>>> notified for write-protected pages when writes are happening.        
> > > > > > >>>>>
> > > > > > >>>>> It can't yet, but the API is ready for that.  kvm_vfio_set_group is
> > > > > > >>>>> currently where a struct kvm_device* and struct vfio_group* touch.
> > > > > > >>>>> Given
> > > > > > >>>>> a struct kvm_device*, dev->kvm provides the struct kvm to be passed to
> > > > > > >>>>> kvm_page_track_register_notifier.  So I guess you could add a callback
> > > > > > >>>>> that passes the struct kvm_device* to the mdev device.
> > > > > > >>>>>
> > > > > > >>>>> Xiaoguang and Guangrong, what were your plans?  We discussed it briefly
> > > > > > >>>>> at KVM Forum but I don't remember the details.        
> > > > > > >>>>
> > > > > > >>>> Your suggestion was that pass kvm fd to KVMGT via VFIO, so that we can
> > > > > > >>>> figure out the kvm instance based on the fd.
> > > > > > >>>>
> > > > > > >>>> We got a new idea, how about search the kvm instance by mm_struct, it
> > > > > > >>>> can work as KVMGT is running in the vcpu context and it is much more
> > > > > > >>>> straightforward.        
> > > > > > >>>
> > > > > > >>> Perhaps I didn't understand your suggestion, but the same mm_struct can
> > > > > > >>> have more than 1 struct kvm so I'm not sure that it can work.        
> > > > > > >>
> > > > > > >> vcpu->pid is valid during vcpu running so that it can be used to figure
> > > > > > >> out which kvm instance owns the vcpu whose pid is the one as current
> > > > > > >> thread, i think it can work. :)        
> > > > > > > 
> > > > > > > No, don't do that.  There's no reason for a thread to run a single VCPU,
> > > > > > > and if you can have multiple VCPUs you can also have multiple VCPUs from
> > > > > > > multiple VMs.
> > > > > > > 
> > > > > > > Passing file descriptors around are the right way to connect subsystems.        
> > > > > > 
> > > > > > [CC Alex, Kevin and Qemu-devel]
> > > > > > 
> > > > > > Hi Paolo & Alex,
> > > > > > 
> > > > > > IIUC, passing file descriptors means touching QEMU and the UAPI between
> > > > > > QEMU and VFIO. Would you guys have a look at below draft patch? If it's
> > > > > > on the correct direction, I'll send the split ones. Thanks!
> > > > > > 
> > > > > > --
> > > > > > Thanks,
> > > > > > Jike
> > > > > > 
> > > > > > 
> > > > > > diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
> > > > > > index bec694c..f715d37 100644
> > > > > > --- a/hw/vfio/pci-quirks.c
> > > > > > +++ b/hw/vfio/pci-quirks.c
> > > > > > @@ -10,12 +10,14 @@
> > > > > >   * the COPYING file in the top-level directory.
> > > > > >   */
> > > > > >  
> > > > > > +#include <sys/ioctl.h>
> > > > > >  #include "qemu/osdep.h"
> > > > > >  #include "qemu/error-report.h"
> > > > > >  #include "qemu/range.h"
> > > > > >  #include "qapi/error.h"
> > > > > >  #include "hw/nvram/fw_cfg.h"
> > > > > >  #include "pci.h"
> > > > > > +#include "sysemu/kvm.h"
> > > > > >  #include "trace.h"
> > > > > >  
> > > > > >  /* Use uin32_t for vendor & device so PCI_ANY_ID expands and cannot match hw */
> > > > > > @@ -1844,3 +1846,15 @@ void vfio_setup_resetfn_quirk(VFIOPCIDevice *vdev)
> > > > > >          break;
> > > > > >      }
> > > > > >  }
> > > > > > +
> > > > > > +void vfio_quirk_kvmgt(VFIOPCIDevice *vdev)
> > > > > > +{
> > > > > > +    int vmfd;
> > > > > > +
> > > > > > +    if (!kvm_enabled() || !vdev->kvmgt)
> > > > > > +        return;
> > > > > > +
> > > > > > +    /* Tell the device what KVM it attached */
> > > > > > +    vmfd = kvm_get_vmfd(kvm_state);
> > > > > > +    ioctl(vdev->vbasedev.fd, VFIO_SET_KVMFD, vmfd);
> > > > > > +}
> > > > > > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> > > > > > index a5a620a..8732552 100644
> > > > > > --- a/hw/vfio/pci.c
> > > > > > +++ b/hw/vfio/pci.c
> > > > > > @@ -2561,6 +2561,8 @@ static int vfio_initfn(PCIDevice *pdev)
> > > > > >          return ret;
> > > > > >      }
> > > > > >  
> > > > > > +    vfio_quirk_kvmgt(vdev);
> > > > > > +
> > > > > >      /* Get a copy of config space */
> > > > > >      ret = pread(vdev->vbasedev.fd, vdev->pdev.config,
> > > > > >                  MIN(pci_config_size(&vdev->pdev), vdev->config_size),
> > > > > > @@ -2832,6 +2834,7 @@ static Property vfio_pci_dev_properties[] = {
> > > > > >      DEFINE_PROP_UINT32("x-pci-sub-device-id", VFIOPCIDevice,
> > > > > >                         sub_device_id, PCI_ANY_ID),
> > > > > >      DEFINE_PROP_UINT32("x-igd-gms", VFIOPCIDevice, igd_gms, 0),
> > > > > > +    DEFINE_PROP_BOOL("kvmgt", VFIOPCIDevice, kvmgt, false),      
> > > > > 
> > > > > Just a side note, device options are a headache, users are prone to get
> > > > > them wrong and minimally it requires an entire round to get libvirt
> > > > > support.  We should be able to detect from the device or vfio API
> > > > > whether such a call is required.  Obviously if we can use the existing
> > > > > kvm-vfio device, that's the better option anyway.  Thanks,    
> > > > 
> > > > Also, vfio devices currently have no hard dependencies on KVM, if kvmgt
> > > > does, it needs to produce a device failure when unavailable.  Thanks,    
> > > 
> > > Also, I would like to see this as an generic feature instead of
> > > kvmgt specific interface, so we don't have to add new options to QEMU and it is
> > > up to the vendor driver to proceed with or without it.  
> > 
> > In general this should be decided by lack of some required feature
> > exclusively provided by KVM.  I would not want to add a generic opt-out
> > for mdev vendor drivers to decide that they arbitrarily want to disable
> > that path.  Thanks,  
> 
> IIUC, you are suggesting that this path should be controlled by KVM feature cap
> and it will be accessible to VFIO users when such checking is satisfied.

Maybe we're getting too loose with our pronouns here, I'm starting to
lose track of what "this" is referring to.  I agree that there's no
reason for the ioctl, as proposed to be kvmgt specific.  I would hope
that going through the kvm-vfio device to create that linkage would
eliminate that, but we'll need to see what Jike can come up with to
plumb between KVM and vfio.  Vendor drivers can implement their own
ioctls, now that we pass them through the mdev layer, but someone needs
to call those ioctls.  Ideally we want something programmatic to
trigger that, without requiring a user to pass an extra device
parameter.  Additionally, if there is any hope of making use of the
device with userspace drivers other than QEMU, hard dependencies on KVM
should be avoided.  Thanks,

Alex
diff mbox

Patch

diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
index bec694c..f715d37 100644
--- a/hw/vfio/pci-quirks.c
+++ b/hw/vfio/pci-quirks.c
@@ -10,12 +10,14 @@ 
  * the COPYING file in the top-level directory.
  */
 
+#include <sys/ioctl.h>
 #include "qemu/osdep.h"
 #include "qemu/error-report.h"
 #include "qemu/range.h"
 #include "qapi/error.h"
 #include "hw/nvram/fw_cfg.h"
 #include "pci.h"
+#include "sysemu/kvm.h"
 #include "trace.h"
 
 /* Use uin32_t for vendor & device so PCI_ANY_ID expands and cannot match hw */
@@ -1844,3 +1846,15 @@  void vfio_setup_resetfn_quirk(VFIOPCIDevice *vdev)
         break;
     }
 }
+
+void vfio_quirk_kvmgt(VFIOPCIDevice *vdev)
+{
+    int vmfd;
+
+    if (!kvm_enabled() || !vdev->kvmgt)
+        return;
+
+    /* Tell the device what KVM it attached */
+    vmfd = kvm_get_vmfd(kvm_state);
+    ioctl(vdev->vbasedev.fd, VFIO_SET_KVMFD, vmfd);
+}
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index a5a620a..8732552 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2561,6 +2561,8 @@  static int vfio_initfn(PCIDevice *pdev)
         return ret;
     }
 
+    vfio_quirk_kvmgt(vdev);
+
     /* Get a copy of config space */
     ret = pread(vdev->vbasedev.fd, vdev->pdev.config,
                 MIN(pci_config_size(&vdev->pdev), vdev->config_size),
@@ -2832,6 +2834,7 @@  static Property vfio_pci_dev_properties[] = {
     DEFINE_PROP_UINT32("x-pci-sub-device-id", VFIOPCIDevice,
                        sub_device_id, PCI_ANY_ID),
     DEFINE_PROP_UINT32("x-igd-gms", VFIOPCIDevice, igd_gms, 0),
+    DEFINE_PROP_BOOL("kvmgt", VFIOPCIDevice, kvmgt, false),
     /*
      * TODO - support passed fds... is this necessary?
      * DEFINE_PROP_STRING("vfiofd", VFIOPCIDevice, vfiofd_name),
diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index 7d482d9..813832c 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -143,6 +143,7 @@  typedef struct VFIOPCIDevice {
     bool no_kvm_intx;
     bool no_kvm_msi;
     bool no_kvm_msix;
+    bool kvmgt;
 } VFIOPCIDevice;
 
 uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len);
@@ -166,4 +167,6 @@  int vfio_populate_vga(VFIOPCIDevice *vdev);
 int vfio_pci_igd_opregion_init(VFIOPCIDevice *vdev,
                                struct vfio_region_info *info);
 
+void vfio_quirk_kvmgt(VFIOPCIDevice *vdev);
+
 #endif /* HW_VFIO_VFIO_PCI_H */
diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
index df67cc0..dd8320a 100644
--- a/include/sysemu/kvm.h
+++ b/include/sysemu/kvm.h
@@ -254,6 +254,7 @@  void phys_mem_set_alloc(void *(*alloc)(size_t, uint64_t *align));
 int kvm_ioctl(KVMState *s, int type, ...);
 
 int kvm_vm_ioctl(KVMState *s, int type, ...);
+int kvm_get_vmfd(KVMState *s);
 
 int kvm_vcpu_ioctl(CPUState *cpu, int type, ...);
 
diff --git a/kvm-all.c b/kvm-all.c
index efb5fe3..bd72ce3 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -2065,6 +2065,11 @@  int kvm_vm_ioctl(KVMState *s, int type, ...)
     return ret;
 }
 
+int kvm_get_vmfd(KVMState *s)
+{
+	return s->vmfd;
+}
+
 int kvm_vcpu_ioctl(CPUState *cpu, int type, ...)
 {
     int ret;
diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
index 759b850..952303f 100644
--- a/linux-headers/linux/vfio.h
+++ b/linux-headers/linux/vfio.h
@@ -686,6 +686,12 @@  struct vfio_iommu_spapr_tce_remove {
 };
 #define VFIO_IOMMU_SPAPR_TCE_REMOVE	_IO(VFIO_TYPE, VFIO_BASE + 20)
 
+
+/**
+ * VFIO_SET_KVMFD - _IO(VFIO_TYPE, VFIO_BASE + 21, __u32)
+ */
+#define VFIO_SET_KVMFD		_IO(VFIO_TYPE, VFIO_BASE + 21)
+
 /* ***************************************************************** */
 
 #endif /* VFIO_H */