From patchwork Thu Jun 22 21:48:30 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Joao Martins X-Patchwork-Id: 1798715 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=nongnu.org (client-ip=209.51.188.17; helo=lists.gnu.org; envelope-from=qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org; receiver=) Authentication-Results: legolas.ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=oracle.com header.i=@oracle.com header.a=rsa-sha256 header.s=corp-2023-03-30 header.b=nY7qOz5T; dkim-atps=neutral Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-ECDSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4QnDZc2rDgz20Xt for ; Fri, 23 Jun 2023 07:50:48 +1000 (AEST) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1qCSB9-0003A0-IJ; Thu, 22 Jun 2023 17:49:23 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1qCSB6-00039P-Ln for qemu-devel@nongnu.org; Thu, 22 Jun 2023 17:49:20 -0400 Received: from mx0b-00069f02.pphosted.com ([205.220.177.32]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1qCSB2-0007Nx-O3 for qemu-devel@nongnu.org; Thu, 22 Jun 2023 17:49:20 -0400 Received: from pps.filterd (m0246632.ppops.net [127.0.0.1]) by mx0b-00069f02.pphosted.com (8.17.1.19/8.17.1.19) with ESMTP id 35MIbN5G031279; Thu, 22 Jun 2023 21:49:09 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=from : to : cc : subject : date : message-id : mime-version : content-transfer-encoding; s=corp-2023-03-30; bh=KN0zfDUZkh90PoVwIuiDS6YEm4/UbVbLS0DC+sZS1hQ=; b=nY7qOz5TuWLteMGll7NXAMiqjL22SfJAFtSHotxjtgX6CN1HXELmRBa/Pn70hXZJN/US Wgzx3Su5LonmOkE8JxX4GptZolubQqGhpag+L6M0l9zXQroz00p+TFglSqpTCmCZIwYF QBHA9ql8OHn/vLoG4G3Kp9ZUfUn+djGgw2u5EYdWZiuZmyGwfldvGjJ9sQuznxf2/f4E FJ6m6yDIiJ8KnKtcLP1ZCW8Tt97IrLgpcaxg8n2U514NQF/JmnREslQMSqDBSl4LxjRF kiNoa36jrdTH/7XqDC6cYnR7z/aq3BJLcd4ptCYEvBxdVUosxovh/E2yf0nOOttG/eaV Ww== Received: from phxpaimrmta03.imrmtpd1.prodappphxaev1.oraclevcn.com (phxpaimrmta03.appoci.oracle.com [138.1.37.129]) by mx0b-00069f02.pphosted.com (PPS) with ESMTPS id 3r94ettxkk-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 22 Jun 2023 21:49:08 +0000 Received: from pps.filterd (phxpaimrmta03.imrmtpd1.prodappphxaev1.oraclevcn.com [127.0.0.1]) by phxpaimrmta03.imrmtpd1.prodappphxaev1.oraclevcn.com (8.17.1.19/8.17.1.19) with ESMTP id 35MKvXx4008371; Thu, 22 Jun 2023 21:49:08 GMT Received: from pps.reinject (localhost [127.0.0.1]) by phxpaimrmta03.imrmtpd1.prodappphxaev1.oraclevcn.com (PPS) with ESMTPS id 3r9398ep3s-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Thu, 22 Jun 2023 21:49:07 +0000 Received: from phxpaimrmta03.imrmtpd1.prodappphxaev1.oraclevcn.com (phxpaimrmta03.imrmtpd1.prodappphxaev1.oraclevcn.com [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 35MLn76k035791; Thu, 22 Jun 2023 21:49:07 GMT Received: from joaomart-mac.uk.oracle.com (dhcp-10-175-180-251.vpn.oracle.com [10.175.180.251]) by phxpaimrmta03.imrmtpd1.prodappphxaev1.oraclevcn.com (PPS) with ESMTP id 3r9398ep1g-1; Thu, 22 Jun 2023 21:49:07 +0000 From: Joao Martins To: qemu-devel@nongnu.org Cc: Alex Williamson , Cedric Le Goater , Paolo Bonzini , Peter Xu , David Hildenbrand , Philippe Mathieu-Daude , "Michael S. Tsirkin" , Marcel Apfelbaum , Jason Wang , Richard Henderson , Eduardo Habkost , Avihai Horon , Jason Gunthorpe , Joao Martins , Yi Liu Subject: [PATCH v4 00/15] vfio: VFIO migration support with vIOMMU Date: Thu, 22 Jun 2023 22:48:30 +0100 Message-Id: <20230622214845.3980-1-joao.m.martins@oracle.com> MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.254,Aquarius:18.0.957,Hydra:6.0.591,FMLib:17.11.176.26 definitions=2023-06-22_16,2023-06-22_02,2023-05-22_02 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 mlxlogscore=999 adultscore=0 spamscore=0 phishscore=0 suspectscore=0 malwarescore=0 bulkscore=0 mlxscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2305260000 definitions=main-2306220186 X-Proofpoint-GUID: vjBmvDFS-DsiAnXWi5zbMt2IoYzx2SkS X-Proofpoint-ORIG-GUID: vjBmvDFS-DsiAnXWi5zbMt2IoYzx2SkS Received-SPF: pass client-ip=205.220.177.32; envelope-from=joao.m.martins@oracle.com; helo=mx0b-00069f02.pphosted.com X-Spam_score_int: -7 X-Spam_score: -0.8 X-Spam_bar: / X-Spam_report: (-0.8 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_MED=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H5=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01, URI_TRY_3LD=1.999 autolearn=no autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Hey, This series introduces support for vIOMMU with VFIO device migration, particurlarly related to how we do the dirty page tracking. Today vIOMMUs serve two purposes: 1) enable interrupt remaping 2) provide dma translation services for guests to provide some form of guest kernel managed DMA e.g. for nested virt based usage; (1) is specially required for big VMs with VFs with more than 255 vcpus. We tackle both and remove the migration blocker when vIOMMU is present provided the conditions are met. I have both use-cases here in one series, but I am happy to tackle them in separate series. As I found out we don't necessarily need to expose the whole vIOMMU functionality in order to just support interrupt remapping. x86 IOMMUs on Windows Server 2018[2] and Linux >=5.10, with qemu 7.1+ (or really Linux guests with commit c40aaaac10 and since qemu commit 8646d9c773d8) can instantiate a IOMMU just for interrupt remapping without needing to be advertised/support DMA translation. AMD IOMMU in theory can provide the same, but Linux doesn't quite support the IR-only part there yet, only intel-iommu. The series is organized as following: Patches 1-5: Today we can't gather vIOMMU details before the guest establishes their first DMA mapping via the vIOMMU. So these first four patches add a way for vIOMMUs to be asked of their properties at start of day. I choose the least churn possible way for now (as opposed to a treewide conversion) and allow easy conversion a posteriori. As suggested by Peter Xu[7], I have ressurected Yi's patches[5][6] which allows us to fetch PCI backing vIOMMU attributes, without necessarily tieing the caller (VFIO or anyone else) to an IOMMU MR like I was doing in v3. Patches 6-8: Handle configs with vIOMMU interrupt remapping but without DMA translation allowed. Today the 'dma-translation' attribute is x86-iommu only, but the way this series is structured nothing stops from other vIOMMUs supporting it too as long as they use pci_setup_iommu_ops() and the necessary IOMMU MR get_attr attributes are handled. The blocker is thus relaxed when vIOMMUs are able to toggle the toggle/report DMA_TRANSLATION attribute. With the patches up to this set, we've then tackled item (1) of the second paragraph. Patches 9-15: Simplified a lot from v2 (patch 9) to only track the complete IOVA address space, leveraging the logic we use to compose the dirty ranges. The blocker is once again relaxed for vIOMMUs that advertise their IOVA addressing limits. This tackles item (2). So far I mainly use it with intel-iommu, although I have a small set of patches for virtio-iommu per Alex's suggestion in v2. Comments, suggestions welcome. Thanks for the review! Regards, Joao Changes since v3[8]: * Pick up Yi's patches[5][6], and rework the first four patches. These are a bit better splitted, and make the new iommu_ops *optional* as opposed to a treewide conversion. Rather than returning an IOMMU MR and let VFIO operate on it to fetch attributes, we instead let the underlying IOMMU driver fetch the desired IOMMU MR and ask for the desired IOMMU attribute. Callers only care about PCI Device backing vIOMMU attributes regardless of its topology/association. (Peter Xu) These patches are a bit better splitted compared to original ones, and I've kept all the same authorship and note the changes from original where applicable. * Because of the rework of the first four patches, switch to individual attributes in the VFIOSpace that track dma_translation and the max_iova. All are expected to be unused when zero to retain the defaults of today in common code. * Improve the migration blocker message of the last patch to be more obvious that vIOMMU migration blocker is added when no vIOMMU address space limits are advertised. (Patch 15) * Cast to uintptr_t in IOMMUAttr data in intel-iommu (Philippe). * Switch to MAKE_64BIT_MASK() instead of plain left shift (Philippe). * Change diffstat of patches with scripts/git.orderfile (Philippe). Changes since v2[3]: * New patches 1-9 to be able to handle vIOMMUs without DMA translation, and introduce ways to know various IOMMU model attributes via the IOMMU MR. This is partly meant to address a comment in previous versions where we can't access the IOMMU MR prior to the DMA mapping happening. Before this series vfio giommu_list is only tracking 'mapped GIOVA' and that controlled by the guest. As well as better tackling of the IOMMU usage for interrupt-remapping only purposes. * Dropped Peter Xu ack on patch 9 given that the code changed a bit. * Adjust patch 14 to adjust for the VFIO bitmaps no longer being pointers. * The patches that existed in v2 of vIOMMU dirty tracking, are mostly * untouched, except patch 12 which was greatly simplified. Changes since v1[4]: - Rebased on latest master branch. As part of it, made some changes in pre-copy to adjust it to Juan's new patches: 1. Added a new patch that passes threshold_size parameter to .state_pending_{estimate,exact}() handlers. 2. Added a new patch that refactors vfio_save_block(). 3. Changed the pre-copy patch to cache and report pending pre-copy size in the .state_pending_estimate() handler. - Removed unnecessary P2P code. This should be added later on when P2P support is added. (Alex) - Moved the dirty sync to be after the DMA unmap in vfio_dma_unmap() (patch #11). (Alex) - Stored vfio_devices_all_device_dirty_tracking()'s value in a local variable in vfio_get_dirty_bitmap() so it can be re-used (patch #11). - Refactored the viommu device dirty tracking ranges creation code to make it clearer (patch #15). - Changed overflow check in vfio_iommu_range_is_device_tracked() to emphasize that we specifically check for 2^64 wrap around (patch #15). - Added R-bs / Acks. [0] https://lore.kernel.org/qemu-devel/20230222174915.5647-1-avihaih@nvidia.com/ [1] https://lore.kernel.org/qemu-devel/c66d2d8e-f042-964a-a797-a3d07c260a3b@oracle.com/ [2] https://learn.microsoft.com/en-us/windows-hardware/design/device-experiences/oem-kernel-dma-protection [3] https://lore.kernel.org/qemu-devel/20230222174915.5647-1-avihaih@nvidia.com/ [4] https://lore.kernel.org/qemu-devel/20230126184948.10478-1-avihaih@nvidia.com/ [5] https://lore.kernel.org/all/20210302203827.437645-5-yi.l.liu@intel.com/ [6] https://lore.kernel.org/all/20210302203827.437645-6-yi.l.liu@intel.com/ [7] https://lore.kernel.org/qemu-devel/ZH9Kr6mrKNqUgcYs@x1n/ [8] https://lore.kernel.org/qemu-devel/20230530175937.24202-1-joao.m.martins@oracle.com/ Avihai Horon (4): memory/iommu: Add IOMMU_ATTR_MAX_IOVA attribute intel-iommu: Implement IOMMU_ATTR_MAX_IOVA get_attr() attribute vfio/common: Extract vIOMMU code from vfio_sync_dirty_bitmap() vfio/common: Optimize device dirty page tracking with vIOMMU Joao Martins (7): memory/iommu: Add IOMMU_ATTR_DMA_TRANSLATION attribute intel-iommu: Implement get_attr() method vfio/common: Track whether DMA Translation is enabled on the vIOMMU vfio/common: Relax vIOMMU detection when DMA translation is off vfio/common: Move dirty tracking ranges update to helper vfio/common: Support device dirty page tracking with vIOMMU vfio/common: Block migration with vIOMMUs without address width limits Yi Liu (4): hw/pci: Add a pci_setup_iommu_ops() helper hw/pci: Refactor pci_device_iommu_address_space() hw/pci: Introduce pci_device_iommu_get_attr() intel-iommu: Switch to pci_setup_iommu_ops() include/exec/memory.h | 4 +- include/hw/pci/pci.h | 11 ++ include/hw/pci/pci_bus.h | 1 + include/hw/vfio/vfio-common.h | 2 + hw/i386/intel_iommu.c | 53 +++++++- hw/pci/pci.c | 58 +++++++- hw/vfio/common.c | 241 ++++++++++++++++++++++++++-------- hw/vfio/pci.c | 22 +++- 8 files changed, 329 insertions(+), 63 deletions(-)