[qemu,v6] spapr: Support NVIDIA V100 GPU with NVLink2

NVIDIA V100 GPUs have on-board RAM which is mapped into the host memory
space and accessible as normal RAM via an NVLink bus. The VFIO-PCI driver
implements special regions for such GPUs and emulates an NVLink bridge.
NVLink2-enabled POWER9 CPUs also provide address translation services
which includes an ATS shootdown (ATSD) register exported via the NVLink
bridge device.

This adds a quirk to VFIO to map the GPU memory and create an MR;
the new MR is stored in a PCI device as a QOM link. The sPAPR PCI uses
this to get the MR and map it to the system address space.
Another quirk does the same for ATSD.

This adds additional steps to sPAPR PHB setup:

1. Search for specific GPUs and NPUs, collect findings in
sPAPRPHBState::nvgpus, manage system address space mappings;

2. Add device-specific properties such as "ibm,npu", "ibm,gpu",
"memory-block", "link-speed" to advertise the NVLink2 function to
the guest;

3. Add "mmio-atsd" to vPHB to advertise the ATSD capability;

4. Add new memory blocks (with extra "linux,memory-usable" to prevent
the guest OS from accessing the new memory until it is onlined) and
npuphb# nodes representing an NPU unit for every vPHB as the GPU driver
uses it for link discovery.

This allocates space for GPU RAM and ATSD like we do for MMIOs by
adding 2 new parameters to the phb_placement() hook. Older machine types
set these to zero.

This puts new memory nodes in a separate NUMA node to as the GPU RAM
needs to be configured equally distant from any other node in the system.
Unlike the host setup which assigns numa ids from 255 downwards, this
adds new NUMA nodes after the user configures nodes or from 1 if none
were configured.

This adds requirement similar to EEH - one IOMMU group per vPHB.
The reason for this is that ATSD registers belong to a physical NPU
so they cannot invalidate translations on GPUs attached to another NPU.
It is guaranteed by the host platform as it does not mix NVLink bridges
or GPUs from different NPU in the same IOMMU group. If more than one
IOMMU group is detected on a vPHB, this disables ATSD support for that
vPHB and prints a warning.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[aw: for vfio portions]
Acked-by: Alex Williamson <alex.williamson@redhat.com>
---

This is based on David's ppc-for-4.0 and acked "vfio_info_cap public" from
https://patchwork.ozlabs.org/patch/1052645/

Changes:
v6:
* changed error handling in spapr-nvlink code
* fixed mmap error checking from NULL to MAP_FAILED
* changed NUMA node ids and changed the commit log about it

v5:
* converted MRs to VFIOQuirk - this fixed leaks

v4:
* fixed ATSD placement
* fixed spapr_phb_unrealize() to do nvgpu cleanup
* replaced warn_report() with Error*

v3:
* moved GPU RAM above PCI MMIO limit
* renamed QOM property to nvlink2-tgt
* moved nvlink2 code to its own file

---

The example command line for redbud system:

pbuild/qemu-aiku1804le-ppc64/ppc64-softmmu/qemu-system-ppc64 \
-nodefaults \
-chardev stdio,id=STDIO0,signal=off,mux=on \
-device spapr-vty,id=svty0,reg=0x71000110,chardev=STDIO0 \
-mon id=MON0,chardev=STDIO0,mode=readline -nographic -vga none \
-enable-kvm -m 384G \
-chardev socket,id=SOCKET0,server,nowait,host=localhost,port=40000 \
-mon chardev=SOCKET0,mode=control \
-smp 80,sockets=1,threads=4 \
-netdev "tap,id=TAP0,helper=/home/aik/qemu-bridge-helper --br=br0" \
-device "virtio-net-pci,id=vnet0,mac=52:54:00:12:34:56,netdev=TAP0" \
img/vdisk0.img \
-device "vfio-pci,id=vfio0004_04_00_0,host=0004:04:00.0" \
-device "vfio-pci,id=vfio0006_00_00_0,host=0006:00:00.0" \
-device "vfio-pci,id=vfio0006_00_00_1,host=0006:00:00.1" \
-device "vfio-pci,id=vfio0006_00_00_2,host=0006:00:00.2" \
-device "vfio-pci,id=vfio0004_05_00_0,host=0004:05:00.0" \
-device "vfio-pci,id=vfio0006_00_01_0,host=0006:00:01.0" \
-device "vfio-pci,id=vfio0006_00_01_1,host=0006:00:01.1" \
-device "vfio-pci,id=vfio0006_00_01_2,host=0006:00:01.2" \
-device spapr-pci-host-bridge,id=phb1,index=1 \
-device "vfio-pci,id=vfio0035_03_00_0,host=0035:03:00.0" \
-device "vfio-pci,id=vfio0007_00_00_0,host=0007:00:00.0" \
-device "vfio-pci,id=vfio0007_00_00_1,host=0007:00:00.1" \
-device "vfio-pci,id=vfio0007_00_00_2,host=0007:00:00.2" \
-device "vfio-pci,id=vfio0035_04_00_0,host=0035:04:00.0" \
-device "vfio-pci,id=vfio0007_00_01_0,host=0007:00:01.0" \
-device "vfio-pci,id=vfio0007_00_01_1,host=0007:00:01.1" \
-device "vfio-pci,id=vfio0007_00_01_2,host=0007:00:01.2" -snapshot \
-machine pseries \
-L /home/aik/t/qemu-ppc64-bios/ -d guest_errors

Note that QEMU attaches PCI devices to the last added vPHB so first
8 devices - 4:04:00.0 till 6:00:01.2 - go to the default vPHB, and
35:03:00.0..7:00:01.2 to the vPHB with id=phb1.
---
 hw/ppc/Makefile.objs        |   2 +-
 hw/vfio/pci.h               |   2 +
 include/hw/pci-host/spapr.h |  45 ++++
 include/hw/ppc/spapr.h      |   5 +-
 hw/ppc/spapr.c              |  48 +++-
 hw/ppc/spapr_pci.c          |  19 ++
 hw/ppc/spapr_pci_nvlink2.c  | 450 ++++++++++++++++++++++++++++++++++++
 hw/vfio/pci-quirks.c        | 131 +++++++++++
 hw/vfio/pci.c               |  14 ++
 hw/vfio/trace-events        |   4 +
 10 files changed, 711 insertions(+), 9 deletions(-)
 create mode 100644 hw/ppc/spapr_pci_nvlink2.c

Message ID	20190312044221.99392-1-aik@ozlabs.ru
State	New
Headers	show Return-Path: <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org> X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (mailfrom) smtp.mailfrom=nongnu.org (client-ip=209.51.188.17; helo=lists.gnu.org; envelope-from=qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org; receiver=<UNKNOWN>) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=ozlabs.ru Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 44JN4h6qXmz9s70 for <incoming@patchwork.ozlabs.org>; Tue, 12 Mar 2019 15:57:21 +1100 (AEDT) Received: from localhost ([127.0.0.1]:45191 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org>) id 1h3ZTa-0006aY-7n for incoming@patchwork.ozlabs.org; Tue, 12 Mar 2019 00:57:18 -0400 Received: from eggs.gnu.org ([209.51.188.92]:42040) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from <aik@ozlabs.ru>) id 1h3ZSW-0006WI-Io for qemu-devel@nongnu.org; Tue, 12 Mar 2019 00:56:16 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from <aik@ozlabs.ru>) id 1h3ZFL-0001xV-32 for qemu-devel@nongnu.org; Tue, 12 Mar 2019 00:42:38 -0400 Received: from ozlabs.ru ([107.173.13.209]:58725) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from <aik@ozlabs.ru>) id 1h3ZFJ-0001uK-4Z; Tue, 12 Mar 2019 00:42:35 -0400 Received: from fstn1-p1.ozlabs.ibm.com (localhost [IPv6:::1]) by ozlabs.ru (Postfix) with ESMTP id BE130AE8001F; Tue, 12 Mar 2019 00:42:23 -0400 (EDT) From: Alexey Kardashevskiy <aik@ozlabs.ru> To: qemu-devel@nongnu.org Date: Tue, 12 Mar 2019 15:42:21 +1100 Message-Id: <20190312044221.99392-1-aik@ozlabs.ru> X-Mailer: git-send-email 2.17.1 X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x X-Received-From: 107.173.13.209 Subject: [Qemu-devel] [PATCH qemu v6] spapr: Support NVIDIA V100 GPU with NVLink2 X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: <qemu-devel.nongnu.org> List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>, <mailto:qemu-devel-request@nongnu.org?subject=unsubscribe> List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel/> List-Post: <mailto:qemu-devel@nongnu.org> List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help> List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>, <mailto:qemu-devel-request@nongnu.org?subject=subscribe> Cc: Jose Ricardo Ziviani <joserz@linux.ibm.com>, Alexey Kardashevskiy <aik@ozlabs.ru>, Daniel Henrique Barboza <danielhb413@gmail.com>, Alex Williamson <alex.williamson@redhat.com>, Sam Bobroff <sbobroff@linux.ibm.com>, Piotr Jaroszynski <pjaroszynski@nvidia.com>, qemu-ppc@nongnu.org, =?utf-8?q?Leonardo_Au?= =?utf-8?q?gusto_Guimar=C3=A3es_Garcia?= <lagarcia@br.ibm.com>, Reza Arbab <arbab@linux.ibm.com>, David Gibson <david@gibson.dropbear.id.au> Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org>
Series	[qemu,v6] spapr: Support NVIDIA V100 GPU with NVLink2 \| expand [qemu,v6] spapr: Support NVIDIA V100 GPU with NVLink2

[qemu,v6] spapr: Support NVIDIA V100 GPU with NVLink2

Commit Message

Comments

Patch