From patchwork Wed Dec 21 21:42:12 2011
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Alex Williamson <alex.williamson@redhat.com>
X-Patchwork-Id: 132739
Return-Path: <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@bilbo.ozlabs.org
Received: from lists.gnu.org (lists.gnu.org [140.186.70.17])
	(using TLSv1 with cipher AES256-SHA (256/256 bits))
	(Client did not present a certificate)
	by ozlabs.org (Postfix) with ESMTPS id 944DFB7138
	for <incoming@patchwork.ozlabs.org>;
	Thu, 22 Dec 2011 08:43:17 +1100 (EST)
Received: from localhost ([::1]:41248 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71) (envelope-from
	<qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org>)
	id 1RdTwJ-0007JC-D0
	for incoming@patchwork.ozlabs.org; Wed, 21 Dec 2011 16:43:07 -0500
Received: from eggs.gnu.org ([140.186.70.92]:44121)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <alex.williamson@redhat.com>) id 1RdTvg-00078T-7W
	for qemu-devel@nongnu.org; Wed, 21 Dec 2011 16:42:30 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <alex.williamson@redhat.com>) id 1RdTvc-0003Bh-TJ
	for qemu-devel@nongnu.org; Wed, 21 Dec 2011 16:42:28 -0500
Received: from mx1.redhat.com ([209.132.183.28]:54109)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <alex.williamson@redhat.com>) id 1RdTvc-0003Ba-Gh
	for qemu-devel@nongnu.org; Wed, 21 Dec 2011 16:42:24 -0500
Received: from int-mx01.intmail.prod.int.phx2.redhat.com
	(int-mx01.intmail.prod.int.phx2.redhat.com [10.5.11.11])
	by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id pBLLgERH011161
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK);
	Wed, 21 Dec 2011 16:42:14 -0500
Received: from bling.home (ovpn-113-45.phx2.redhat.com [10.3.113.45])
	by int-mx01.intmail.prod.int.phx2.redhat.com (8.13.8/8.13.8) with
	ESMTP id pBLLgC1V005189; Wed, 21 Dec 2011 16:42:12 -0500
From: Alex Williamson <alex.williamson@redhat.com>
To: chrisw@sous-sol.org, aik@ozlabs.ru, david@gibson.dropbear.id.au,
	joerg.roedel@amd.com, agraf@suse.de, benve@cisco.com,
	aafabbri@cisco.com, B08248@freescale.com, B07421@freescale.com,
	avi@redhat.com, kvm@vger.kernel.org, qemu-devel@nongnu.org,
	iommu@lists.linux-foundation.org, linux-pci@vger.kernel.org,
	linux-kernel@vger.kernel.org
Date: Wed, 21 Dec 2011 14:42:12 -0700
Message-ID: <20111221214212.27028.78947.stgit@bling.home>
In-Reply-To: <20111221213019.27028.26890.stgit@bling.home>
References: <20111221213019.27028.26890.stgit@bling.home>
User-Agent: StGIT/0.14.3
MIME-Version: 1.0
X-Scanned-By: MIMEDefang 2.67 on 10.5.11.11
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6 (newer, 3)
X-Received-From: 209.132.183.28
Subject: [Qemu-devel] [PATCH 1/5] vfio: Introduce documentation for VFIO
	driver
X-BeenThere: qemu-devel@nongnu.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org
Sender: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org

Including rationale for design, example usage and API description.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---

 Documentation/vfio.txt |  352 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 352 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/vfio.txt

diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
new file mode 100644
index 0000000..09a5a5b
--- /dev/null
+++ b/Documentation/vfio.txt
@@ -0,0 +1,352 @@
+VFIO - "Virtual Function I/O"[1]
+-------------------------------------------------------------------------------
+Many modern system now provide DMA and interrupt remapping facilities
+to help ensure I/O devices behave within the boundaries they've been
+allotted.  This includes x86 hardware with AMD-Vi and Intel VT-d,
+POWER systems with Partitionable Endpoints (PEs) and embedded PowerPC
+systems such as Freescale PAMU.  The VFIO driver is an IOMMU/device
+agnostic framework for exposing direct device access to userspace, in
+a secure, IOMMU protected environment.  In other words, this allows
+safe[2], non-privileged, userspace drivers.
+
+Why do we want that?  Virtual machines often make use of direct device
+access ("device assignment") when configured for the highest possible
+I/O performance.  From a device and host perspective, this simply
+turns the VM into a userspace driver, with the benefits of
+significantly reduced latency, higher bandwidth, and direct use of
+bare-metal device drivers[3].
+
+Some applications, particularly in the high performance computing
+field, also benefit from low-overhead, direct device access from
+userspace.  Examples include network adapters (often non-TCP/IP based)
+and compute accelerators.  Prior to VFIO, these drivers had to either
+go through the full development cycle to become proper upstream
+driver, be maintained out of tree, or make use of the UIO framework,
+which has no notion of IOMMU protection, limited interrupt support,
+and requires root privileges to access things like PCI configuration
+space.
+
+The VFIO driver framework intends to unify these, replacing both the
+KVM PCI specific device assignment code as well as provide a more
+secure, more featureful userspace driver environment than UIO.
+
+Groups, Devices, and IOMMUs
+-------------------------------------------------------------------------------
+
+Userspace drivers are primarily concerned with manipulating individual
+devices and setting up mappings in the IOMMU for those devices.
+Unfortunately, the IOMMU doesn't always have the granularity to track
+mappings for an individual device.  Sometimes this is a topology
+barrier, such as a PCIe-to-PCI bridge interposing the device and
+IOMMU, other times this is an IOMMU limitation.  In any case, the
+reality is that devices are not always independent with respect to the
+IOMMU.  Translations setup for one device can be used by another
+device in these scenarios.
+
+The IOMMU API exposes these relationships by identifying an "IOMMU
+group" for these dependent devices.  Devices on the same bus with the
+same IOMMU group (or just "group" for this document) are not isolated
+from each other with respect to DMA mappings.  For userspace usage,
+this logically means that instead of being able to grant ownership of
+an individual device, we must grant ownership of a group, which may
+contain one or more devices.
+
+These groups therefore become a fundamental component of VFIO and the
+working unit we use for exposing devices and granting permissions to
+userspace.  In addition, VFIO make efforts to ensure the integrity of
+the group for user access.  This includes ensuring that all devices
+within the group are controlled by VFIO (vs native host drivers)
+before allowing a user to access any member of the group or the IOMMU
+mappings, as well as maintaining the group viability as devices are
+dynamically added or removed from the system.
+
+To access a device through VFIO, a user must open a character device
+for the group that the device belongs to and then issue an ioctl to
+retrieve a file descriptor for the individual device.  This ensures
+that the user has permissions to the group (file based access to the
+/dev entry) and allows a check point at which VFIO can deny access to
+the device if the group is not viable (all devices within the group
+controlled by VFIO).  A file descriptor for the IOMMU is obtain in the
+same fashion.
+
+VFIO defines a standard set of APIs for access to devices and a
+modular interface for adding new, bus-specific VFIO device drivers.
+We call these "VFIO bus drivers".  The vfio-pci module is an example
+of a bus driver for exposing PCI devices.  When the bus driver module
+is loaded it enumerates all of the devices for it's bus, registering
+each device with the vfio core along with a set of callbacks.  For
+buses that support hotplug, the bus driver also adds itself to the
+notification chain for such events.  The callbacks registered with
+each device implement the VFIO device access API for that bus.
+
+The VFIO device API includes ioctls for describing the device, the I/O
+regions and their read/write/mmap offsets on the device descriptor, as
+well as mechanisms for describing and registering interrupt
+notifications.
+
+The VFIO IOMMU object is accessed in a similar way; an ioctl on the
+group provides a file descriptor for programming the IOMMU.  Like
+devices, the IOMMU file descriptor is only accessible when a group is
+viable.  The API for the IOMMU is effectively a userspace extension of
+the kernel IOMMU API.  The IOMMU provides an ioctl to describe the
+IOMMU domain as well as to setup and teardown DMA mappings.  As the
+IOMMU API is extended to support more esoteric IOMMU implementations,
+it's expected that the VFIO interface will also evolve.
+
+To facilitate this evolution, all of the VFIO interfaces are designed
+for extensions.  Particularly, for all structures passed via ioctl, we
+include a structure size and flags field.  We also define the ioctl
+request to be independent of passed structure size.  This allows us to
+later add structure fields and define flags as necessary.  It's
+expected that each additional field will have an associated flag to
+indicate whether the data is valid.  Additionally, we provide an
+"info" ioctl for each file descriptor, which allows us to flag new
+features as they're added (ex. an IOMMU domain configuration ioctl).
+
+The final aspect of VFIO is the notion of merging groups.  In both the
+assignment of devices to virtual machines and the pure userspace
+driver model, it's expect that a single user instance is likely to
+have multiple groups in use simultaneously.  If these groups are all
+using the same set of IOMMU mappings, the overhead of userspace
+setting up and tearing down the mappings, as well as the internal
+IOMMU driver overhead of managing those mappings can be non-trivial.
+Some IOMMU implementations are able to easily reduce this overhead by
+simply using the same set of page tables across multiple groups.
+VFIO allows users to take advantage of this option by merging groups
+together, effectively creating a super group (IOMMU groups only define
+the minimum granularity).
+
+A user can attempt to merge groups together by calling the merge ioctl
+on one group (the "merger") and passing a file descriptor for the
+group to be merged in (the "mergee").  Note that existing DMA mappings
+cannot be atomically merged between groups, it's therefore a
+requirement that the mergee group is not in use.  This is enforced by
+not allowing open device or iommu file descriptors on the mergee group
+at the time of merging.  The merger group can be actively in use at
+the time of merging.  Likewise, to unmerge a group, none of the device
+file descriptors for the group being removed can be in use.  The
+remaining merged group can be actively in use.
+
+If the groups cannot be merged, the ioctl will fail and the user will
+need to manage the groups independently.  Users should have no
+expectation for group merging to be successful.  Some platforms may
+not support it at all, others may only enable merging of sufficiently
+similar groups.  If the ioctl succeeds, then the group file
+descriptors are effectively fungible between the groups.  That is,
+instead of their actions being isolated to the individual group, each
+of them are gateways into the combined, merged group.  For instance,
+retrieving an IOMMU file descriptor from any group returns a reference
+to the same object, mappings to that IOMMU descriptor are visible to
+all devices in the merged group, and device descriptors can be
+retrieved for any device in the merged group from any one of the group
+file descriptors.  In effect, a user can manage devices and the IOMMU
+of a merged group using a single file descriptor (saving the merged
+groups file descriptors away only for unmerged) without the
+permission complications of creating a separate "super group" character
+device.
+
+VFIO Usage Example
+-------------------------------------------------------------------------------
+
+Assume user wants to access PCI device 0000:06:0d.0
+
+$ cat /sys/bus/pci/devices/0000:06:0d.0/iommu_group
+240
+
+Since this device is on the "pci" bus, the user can then find the
+character device for interacting with the VFIO group as:
+
+$ ls -l /dev/vfio/pci:240
+crw-rw---- 1 root root 252, 27 Dec 15 15:13 /dev/vfio/pci:240
+
+We can also examine other members of the group through sysfs:
+
+$ ls -l /sys/devices/virtual/vfio/pci:240/devices/
+total 0
+lrwxrwxrwx 1 root root 0 Dec 20 12:01 0000:06:0d.0 -> \
+		../../../../pci0000:00/0000:00:1e.0/0000:06:0d.0
+lrwxrwxrwx 1 root root 0 Dec 20 12:01 0000:06:0d.1 -> \
+		../../../../pci0000:00/0000:00:1e.0/0000:06:0d.1
+
+This group therefore contains two devices[4].  VFIO will prevent
+device or iommu manipulation unless all group members are attached to
+the vfio bus driver, so we simply unbind the devices from their
+current driver and rebind them to vfio:
+
+# for i in /sys/devices/virtual/vfio/pci:240/devices/*; do
+	dir=$(readlink -f $i)
+	if [ -L $dir/driver ]; then
+		echo $(basename $i) > $dir/driver/unbind
+	fi
+	vendor=$(cat $dir/vendor)
+	device=$(cat $dir/device)
+	echo $vendor $device > /sys/bus/pci/drivers/vfio/new_id
+	echo $(basename $i) > /sys/bus/pci/drivers/vfio/bind
+done
+
+# chown user:user /dev/vfio/pci:240
+
+The user now has full access to all the devices and the iommu for this
+group and can access them as follows:
+
+	int group, iommu, device, i;
+	struct vfio_group_info group_info = { .argsz = sizeof(group_info) };
+	struct vfio_iommu_info iommu_info = { .argsz = sizeof(iommu_info) };
+	struct vfio_dma_map dma_map = { .argsz = sizeof(dma_map) };
+	struct vfio_device_info device_info = { .argsz = sizeof(device_info) };
+
+	/* Open the group */
+	group = open("/dev/vfio/pci:240", O_RDWR);
+
+	/* Test the group is viable and available */
+	ioctl(group, VFIO_GROUP_GET_INFO, &group_info);
+
+	if (!(group_info.flags & VFIO_GROUP_FLAGS_VIABLE))
+		/* Group is not viable */
+
+	if ((group_info.flags & VFIO_GROUP_FLAGS_MM_LOCKED))
+		/* Already in use by someone else */
+
+	/* Get a file descriptor for the IOMMU */
+	iommu = ioctl(group, VFIO_GROUP_GET_IOMMU_FD);
+
+	/* Test the IOMMU is what we expect */
+	ioctl(iommu, VFIO_IOMMU_GET_INFO, &iommu_info);
+
+	/* Allocate some space and setup a DMA mapping */
+	dma_map.vaddr = mmap(0, 1024 * 1024, PROT_READ | PROT_WRITE,
+			     MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
+	dma_map.size = 1024 * 1024;
+	dma_map.iova = 0; /* 1MB starting at 0x0 from device view */
+	dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
+
+	ioctl(iommu, VFIO_IOMMU_MAP_DMA, &dma_map);
+
+	/* Get a file descriptor for the device */
+	device = ioctl(group, VFIO_GROUP_GET_DEVICE_FD, "0000:06:0d.0");
+
+	/* Test and setup the device */
+	ioctl(device, VFIO_DEVICE_GET_INFO, &device_info);
+
+	for (i = 0; i < device_info.num_regions; i++) {
+		struct vfio_region_info reg = { .argsz = sizeof(reg) };
+
+		reg.index = i;
+
+		ioctl(device, VFIO_DEVICE_GET_REGION_INFO, &reg);
+
+		/* Setup mappings... read/write offsets, mmaps
+		 * For PCI devices, config space is a region */
+	}
+
+	for (i = 0; i < device_info.num_irqs; i++) {
+		struct vfio_irq_info irq = { .argsz = sizeof(irq) };
+
+		irq.index = i;
+
+		ioctl(device, VFIO_DEVICE_GET_IRQ_INFO, &reg);
+
+		/* Setup IRQs... eventfds, VFIO_DEVICE_SET_IRQ_EVENTFDS */
+	}
+
+	/* Gratuitous device reset and go... */
+	ioctl(device, VFIO_DEVICE_RESET);
+
+VFIO User API
+-------------------------------------------------------------------------------
+
+Please see include/linux/vfio.h for complete API documentation.
+
+VFIO bus driver API
+-------------------------------------------------------------------------------
+
+Bus drivers, such as PCI, have three jobs:
+ 1) Add/remove devices from vfio
+ 2) Provide vfio_device_ops for device access
+ 3) Device binding and unbinding
+
+When initialized, the bus driver should enumerate the devices on its
+bus and call vfio_group_add_dev() for each device.  If the bus
+supports hotplug, notifiers should be enabled to track devices being
+added and removed.  vfio_group_del_dev() removes a previously added
+device from vfio.
+
+extern int vfio_group_add_dev(struct device *dev,
+                              const struct vfio_device_ops *ops);
+extern void vfio_group_del_dev(struct device *dev);
+
+Adding a device registers a vfio_device_ops function pointer structure
+for the device:
+
+struct vfio_device_ops {
+	bool	(*match)(struct device *dev, char *buf);
+	int	(*claim)(struct device *dev);
+	int	(*open)(void *device_data);
+	void	(*release)(void *device_data);
+	ssize_t	(*read)(void *device_data, char __user *buf,
+			size_t count, loff_t *ppos);
+	ssize_t	(*write)(void *device_data, const char __user *buf,
+			 size_t size, loff_t *ppos);
+	long	(*ioctl)(void *device_data, unsigned int cmd,
+			 unsigned long arg);
+	int	(*mmap)(void *device_data, struct vm_area_struct *vma);
+};
+
+For buses supporting hotplug, all functions are required to be
+implemented.  Non-hotplug buses do not need to implement claim().
+
+match() provides a device specific method for associating a struct
+device to a user provided string.  Many drivers may simply strcmp the
+buffer to dev_name().
+
+claim() is used when a device is hot-added to a group that is already
+in use.  This is how VFIO requests that a bus driver manually takes
+ownership of a device.  The expected call path for this is triggered
+from the bus add notifier.  The bus driver calls vfio_group_add_dev for
+the newly added device, vfio-core determines this group is already in
+use and calls claim on the bus driver.  This triggers the bus driver
+to call it's own probe function, including calling vfio_bind_dev to
+mark the device as controlled by vfio.  The device is then available
+for use by the group.
+
+The remaining vfio_device_ops are similar to a simplified struct
+file_operations except a device_data pointer is provided rather than a
+file pointer.  The device_data is an opaque structure registered by
+the bus driver when a device is bound to the vfio bus driver:
+
+extern int vfio_bind_dev(struct device *dev, void *device_data);
+extern void *vfio_unbind_dev(struct device *dev);
+
+When the device is unbound from the driver, the bus driver will call
+vfio_unbind_dev() which will return the device_data for any bus driver
+specific cleanup and freeing of the structure.  The vfio_unbind_dev
+call may block if the group is currently in use.
+
+-------------------------------------------------------------------------------
+
+[1] VFIO was originally an acronym for "Virtual Function I/O" in it's
+initial implementation by Tom Lyon while as Cisco.  We've since
+outgrown the acronym, but it's catchy.
+
+[2] "safe" also depends upon a device being "well behaved".  It's
+possible for multi-function devices to have backdoors between
+functions and even for single function devices to have alternative
+access to things like PCI config space through MMIO registers.  To
+guard against the former we can include additional precautions in the
+IOMMU driver to group multi-function PCI devices together
+(iommu=group_mf).  The latter we can't prevent, but the IOMMU should
+still provide isolation.  For PCI, Virtual Functions are the best
+indicator of "well behaved", as these are designed for virtualization
+usage models.
+
+[3] As always there are trade-offs to virtual machine device
+assignment that are beyond the scope of VFIO.  It's expected that
+future IOMMU technologies will reduce some, but maybe not all, of
+these trade-offs.
+
+[4] In this case the device is below a PCI bridge:
+
+-[0000:00]-+-1e.0-[06]--+-0d.0
+                        \-0d.1
+
+00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 90)