From patchwork Fri Mar 15 08:18:34 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Alexey Kardashevskiy X-Patchwork-Id: 1056898 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=kvm-ppc-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=ozlabs.ru Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 44LJPb6wxrz9sDn for ; Fri, 15 Mar 2019 19:18:43 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726627AbfCOISm (ORCPT ); Fri, 15 Mar 2019 04:18:42 -0400 Received: from ozlabs.ru ([107.173.13.209]:55939 "EHLO ozlabs.ru" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728487AbfCOISm (ORCPT ); Fri, 15 Mar 2019 04:18:42 -0400 Received: from fstn1-p1.ozlabs.ibm.com (localhost [IPv6:::1]) by ozlabs.ru (Postfix) with ESMTP id D841CAE80022; Fri, 15 Mar 2019 04:18:39 -0400 (EDT) From: Alexey Kardashevskiy To: linuxppc-dev@lists.ozlabs.org Cc: Alexey Kardashevskiy , David Gibson , kvm-ppc@vger.kernel.org, Piotr Jaroszynski , =?utf-8?q?Leonardo_Augusto_Guimar=C3=A3es_?= =?utf-8?q?Garcia?= , Jose Ricardo Ziviani , Daniel Henrique Barboza , Alex Williamson Subject: [PATCH kernel RFC 1/2] vfio_pci: Allow device specific error handlers Date: Fri, 15 Mar 2019 19:18:34 +1100 Message-Id: <20190315081835.14083-2-aik@ozlabs.ru> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190315081835.14083-1-aik@ozlabs.ru> References: <20190315081835.14083-1-aik@ozlabs.ru> Sender: kvm-ppc-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: kvm-ppc@vger.kernel.org PCI device drivers can define own pci_error_handlers which are called on errors or before/after reset. The VFIO PCI driver defines one as well. This adds a vfio_pci_error_handlers struct for VFIO PCI which is a wrapper on top of vfio_err_handlers. At the moment it defines reset_done() - this hook is called right after the device reset and it can be used to do some device tweaking before the userspace gets a chance to use the device. Signed-off-by: Alexey Kardashevskiy --- drivers/vfio/pci/vfio_pci_private.h | 5 +++++ drivers/vfio/pci/vfio_pci.c | 17 +++++++++++++++++ 2 files changed, 22 insertions(+) diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h index 1812cf22fc4f..aff96fa28726 100644 --- a/drivers/vfio/pci/vfio_pci_private.h +++ b/drivers/vfio/pci/vfio_pci_private.h @@ -87,8 +87,13 @@ struct vfio_pci_reflck { struct mutex lock; }; +struct vfio_pci_error_handlers { + void (*reset_done)(struct vfio_pci_device *vdev); +}; + struct vfio_pci_device { struct pci_dev *pdev; + struct vfio_pci_error_handlers *error_handlers; void __iomem *barmap[PCI_STD_RESOURCE_END + 1]; bool bar_mmap_supported[PCI_STD_RESOURCE_END + 1]; u8 *pci_config_map; diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c index 5bd97fa632d3..6ebc441d91c3 100644 --- a/drivers/vfio/pci/vfio_pci.c +++ b/drivers/vfio/pci/vfio_pci.c @@ -1434,8 +1434,25 @@ static pci_ers_result_t vfio_pci_aer_err_detected(struct pci_dev *pdev, return PCI_ERS_RESULT_CAN_RECOVER; } +static void vfio_pci_reset_done(struct pci_dev *dev) +{ + struct vfio_pci_device *vdev; + struct vfio_device *device; + + device = vfio_device_get_from_dev(&dev->dev); + if (device == NULL) + return; + + vdev = vfio_device_data(device); + if (vdev && vdev->error_handlers && vdev->error_handlers->reset_done) + vdev->error_handlers->reset_done(vdev); + + vfio_device_put(device); +} + static const struct pci_error_handlers vfio_err_handlers = { .error_detected = vfio_pci_aer_err_detected, + .reset_done = vfio_pci_reset_done, }; static struct pci_driver vfio_pci_driver = { From patchwork Fri Mar 15 08:18:35 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Alexey Kardashevskiy X-Patchwork-Id: 1056899 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=kvm-ppc-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=ozlabs.ru Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 44LJPf0Y0nz9s9T for ; Fri, 15 Mar 2019 19:18:46 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726582AbfCOISp (ORCPT ); Fri, 15 Mar 2019 04:18:45 -0400 Received: from ozlabs.ru ([107.173.13.209]:55953 "EHLO ozlabs.ru" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726584AbfCOISo (ORCPT ); Fri, 15 Mar 2019 04:18:44 -0400 Received: from fstn1-p1.ozlabs.ibm.com (localhost [IPv6:::1]) by ozlabs.ru (Postfix) with ESMTP id 19FB7AE807F1; Fri, 15 Mar 2019 04:18:41 -0400 (EDT) From: Alexey Kardashevskiy To: linuxppc-dev@lists.ozlabs.org Cc: Alexey Kardashevskiy , David Gibson , kvm-ppc@vger.kernel.org, Piotr Jaroszynski , =?utf-8?q?Leonardo_Augusto_Guimar=C3=A3es_?= =?utf-8?q?Garcia?= , Jose Ricardo Ziviani , Daniel Henrique Barboza , Alex Williamson Subject: [PATCH kernel RFC 2/2] vfio-pci-nvlink2: Implement interconnect isolation Date: Fri, 15 Mar 2019 19:18:35 +1100 Message-Id: <20190315081835.14083-3-aik@ozlabs.ru> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20190315081835.14083-1-aik@ozlabs.ru> References: <20190315081835.14083-1-aik@ozlabs.ru> Sender: kvm-ppc-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: kvm-ppc@vger.kernel.org The NVIDIA V100 SXM2 GPUs are connected to the CPU via PCIe links and (on POWER9) NVLinks. In addition to that, GPUs themselves have direct peer to peer NVLinks in groups of 2 to 4 GPUs. At the moment the POWERNV platform puts all interconnected GPUs to the same IOMMU group. However the user may want to pass individual GPUs to the userspace so in order to do so we need to put them into separate IOMMU groups and cut off the interconnects. Thankfully V100 GPUs implement an interface to do by programming link disabling mask to BAR0 of a GPU. Once a link is disabled in a GPU using this interface, it cannot be re-enabled until the secondary bus reset is issued to the GPU. This defines a reset_done() handler for V100 NVlink2 device which determines what links need to be disabled. This relies on presence of the new "ibm,nvlink-peers" device tree property of a GPU telling which PCI peers it is connected to (which includes NVLink bridges or peer GPUs). This does not change the existing behaviour and instead adds a new "isolate_nvlink" kernel parameter to allow such isolation. The alternative approaches would be: 1. do this in the system firmware (skiboot) but for that we would need to tell skiboot via an additional OPAL call whether or not we want this isolation - skiboot is unaware of IOMMU groups. 2. do this in the secondary bus reset handler in the POWERNV platform - the problem with that is at that point the device is not enabled, i.e. config space is not restored so we need to enable the device (i.e. MMIO bit in CMD register + program valid address to BAR0) in order to disable links and then perhaps undo all this initialization to bring the device back to the state where pci_try_reset_function() expects it to be. Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/platforms/powernv/npu-dma.c | 24 +++++- drivers/vfio/pci/vfio_pci_nvlink2.c | 98 ++++++++++++++++++++++++ 2 files changed, 120 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c index 3a102378c8dc..6f5c769b6fc8 100644 --- a/arch/powerpc/platforms/powernv/npu-dma.c +++ b/arch/powerpc/platforms/powernv/npu-dma.c @@ -441,6 +441,23 @@ static void pnv_comp_attach_table_group(struct npu_comp *npucomp, ++npucomp->pe_num; } +static bool isolate_nvlink; + +static int __init parse_isolate_nvlink(char *p) +{ + bool val; + + if (!p) + val = true; + else if (kstrtobool(p, &val)) + return -EINVAL; + + isolate_nvlink = val; + + return 0; +} +early_param("isolate_nvlink", parse_isolate_nvlink); + struct iommu_table_group *pnv_try_setup_npu_table_group(struct pnv_ioda_pe *pe) { struct iommu_table_group *table_group; @@ -463,7 +480,7 @@ struct iommu_table_group *pnv_try_setup_npu_table_group(struct pnv_ioda_pe *pe) hose = pci_bus_to_host(npdev->bus); phb = hose->private_data; - if (hose->npu) { + if (hose->npu && !isolate_nvlink) { if (!phb->npucomp) { phb->npucomp = kzalloc(sizeof(struct npu_comp), GFP_KERNEL); @@ -477,7 +494,10 @@ struct iommu_table_group *pnv_try_setup_npu_table_group(struct pnv_ioda_pe *pe) pe->pe_number); } } else { - /* Create a group for 1 GPU and attached NPUs for POWER8 */ + /* + * Create a group for 1 GPU and attached NPUs for + * POWER8 (always) or POWER9 (when isolate_nvlink). + */ pe->npucomp = kzalloc(sizeof(*pe->npucomp), GFP_KERNEL); table_group = &pe->npucomp->table_group; table_group->ops = &pnv_npu_peers_ops; diff --git a/drivers/vfio/pci/vfio_pci_nvlink2.c b/drivers/vfio/pci/vfio_pci_nvlink2.c index 32f695ffe128..bb6bba762f46 100644 --- a/drivers/vfio/pci/vfio_pci_nvlink2.c +++ b/drivers/vfio/pci/vfio_pci_nvlink2.c @@ -206,6 +206,102 @@ static int vfio_pci_nvgpu_group_notifier(struct notifier_block *nb, return NOTIFY_OK; } +static int vfio_pci_nvdia_v100_is_ph_in_group(struct device *dev, void *data) +{ + return dev->of_node->phandle == *(phandle *) data; +} + +static u32 vfio_pci_nvdia_v100_get_disable_mask(struct device *dev) +{ + int npu, peer; + u32 mask; + struct device_node *dn; + struct iommu_group *group; + + dn = dev->of_node; + if (!of_find_property(dn, "ibm,nvlink-peers", NULL)) + return 0; + + group = iommu_group_get(dev); + if (!group) + return 0; + + /* + * Collect links to keep which includes links to NPU and links to + * other GPUs in the same IOMMU group. + */ + for (npu = 0, mask = 0; ; ++npu) { + u32 npuph = 0; + + if (of_property_read_u32_index(dn, "ibm,npu", npu, &npuph)) + break; + + for (peer = 0; ; ++peer) { + u32 peerph = 0; + + if (of_property_read_u32_index(dn, "ibm,nvlink-peers", + peer, &peerph)) + break; + + if (peerph != npuph && + !iommu_group_for_each_dev(group, &peerph, + vfio_pci_nvdia_v100_is_ph_in_group)) + continue; + + mask |= 1 << (peer + 16); + } + } + iommu_group_put(group); + + /* Disabling mechanism takes links to disable so invert it here */ + mask = ~mask & 0x3F0000; + + return mask; +} + +static void vfio_pci_nvdia_v100_nvlink2_reset_done(struct vfio_pci_device *vdev) +{ + struct pci_dev *pdev = vdev->pdev; + u16 cmd = 0, cmdmask; + u32 mask, val; + void __iomem *bar0; + + bar0 = vdev->barmap[0]; + if (!bar0) + return; + + mask = vfio_pci_nvdia_v100_get_disable_mask(&pdev->dev); + if (!mask) + return; + + pci_read_config_word(pdev, PCI_COMMAND, &cmd); + cmdmask = PCI_COMMAND_MEMORY | PCI_COMMAND_MASTER | PCI_COMMAND_PARITY; + if ((cmd & cmdmask) != cmdmask) + pci_write_config_word(pdev, PCI_COMMAND, cmd | cmdmask); + + /* + * The sequence is from + * Tesla P100 and V100 SXM2 NVLink Isolation on Multi-Tenant Systems. + * The register names are not provided there either, hence raw values. + */ + iowrite32(0x4, bar0 + 0x12004C); + iowrite32(0x2, bar0 + 0x122204); + val = ioread32(bar0 + 0x200); + val |= 0x02000000; + iowrite32(val, bar0 + 0x200); + val = ioread32(bar0 + 0xA00148); + val |= mask; + iowrite32(val, bar0 + 0xA00148); + val = ioread32(bar0 + 0xA00148); + + if ((cmd | cmdmask) != cmd) + pci_write_config_word(pdev, PCI_COMMAND, cmd); +} + +static struct vfio_pci_error_handlers vfio_pci_nvdia_v100_error_handlers = { + .reset_done = vfio_pci_nvdia_v100_nvlink2_reset_done, +}; + int vfio_pci_nvdia_v100_nvlink2_init(struct vfio_pci_device *vdev) { int ret; @@ -286,6 +382,8 @@ int vfio_pci_nvdia_v100_nvlink2_init(struct vfio_pci_device *vdev) if (ret) goto free_exit; + vdev->error_handlers = &vfio_pci_nvdia_v100_error_handlers; + return 0; free_exit: kfree(data);