From patchwork Fri Nov 4 19:33:53 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: John Cabaj X-Patchwork-Id: 1699859 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=lists.ubuntu.com (client-ip=91.189.94.19; helo=huckleberry.canonical.com; envelope-from=kernel-team-bounces@lists.ubuntu.com; receiver=) Authentication-Results: legolas.ozlabs.org; dkim=fail reason="signature verification failed" (2048-bit key; unprotected) header.d=canonical.com header.i=@canonical.com header.a=rsa-sha256 header.s=20210705 header.b=t45pyhTu; dkim-atps=neutral Received: from huckleberry.canonical.com (huckleberry.canonical.com [91.189.94.19]) (using TLSv1.2 with cipher ECDHE-ECDSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4N3rRH3x9jz23lQ for ; Sat, 5 Nov 2022 06:34:18 +1100 (AEDT) Received: from localhost ([127.0.0.1] helo=huckleberry.canonical.com) by huckleberry.canonical.com with esmtp (Exim 4.86_2) (envelope-from ) id 1or2Rz-0002NS-A8; Fri, 04 Nov 2022 19:33:59 +0000 Received: from smtp-relay-internal-1.internal ([10.131.114.114] helo=smtp-relay-internal-1.canonical.com) by huckleberry.canonical.com with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.86_2) (envelope-from ) id 1or2Rw-0002N6-OC for kernel-team@lists.ubuntu.com; Fri, 04 Nov 2022 19:33:56 +0000 Received: from mail-wm1-f72.google.com (mail-wm1-f72.google.com [209.85.128.72]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by smtp-relay-internal-1.canonical.com (Postfix) with ESMTPS id 6A32D412D4 for ; Fri, 4 Nov 2022 19:33:56 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=canonical.com; s=20210705; t=1667590436; bh=cLcrBPLXVeLtIBjZ0YEb5k7OgfqWyRJ4P+hOIqlVBDI=; h=From:To:Subject:Date:Message-Id:MIME-Version; b=t45pyhTuBPNsDKpHb6JMoL4PIb5K0o32x8KgDpbRX3K8IilSI4fweXV1knn7UMDV0 9sGQvhIB8Evlp7NqL5tOgZcUwbPfdi12GDdyBUOtk6/yPgXWXRksL5YOhXEWFZ4JjF JIdfflmjt1RT73dPua3z/PLY0H4pSg9QJ0b4T3mnjng9WMn2YZNvplPKzgBKNVrEgs lTIAOMC73yQ7dY82KAUXoYd1Paf/DxKffCd3KTrPAYyTAkRlN8X5KosUK1MLXGOW5/ 5Dg7H1Xqb3+KhHfSPBIEXx2dfCsapMw7/+EMQr8hBF9I+h7aWxxpJ4LdRrfHbtsLHw tBi1wEzAPWQ6Q== Received: by mail-wm1-f72.google.com with SMTP id o18-20020a05600c4fd200b003c6ceb1339bso3960466wmq.1 for ; Fri, 04 Nov 2022 12:33:56 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:message-id:date:subject:to :from:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=cLcrBPLXVeLtIBjZ0YEb5k7OgfqWyRJ4P+hOIqlVBDI=; b=vRqzz4I9UtRJaE4PXDRxREmUOAASuDG44q8N8SA/LfcbIYcZiBw91PxXlqWqXtbBT/ uEURT1ALWGsDIuxygd/CRM462qvr1ZSYDYcJqPk3/pEOaeaemgbSoImxzGkD+ART/w29 SuOMXYDgRSEXwpDBsgWVD/LmDzb0Fou8klvFw2VjDdkyRbvKr8aPEeE8PonoXduNAQ3f X2MSNV6zbeJcUgN/iQFiflj8u+AHIRr/FIKhwA9oygOGOZmJpUi07Zo6nLVPFGOn0yqy Qg8QvxMjcpyX6jr+jAwPCyZShZnZEqhcML8n0w8WKvM7YDMn8HxOSItXBEX6bFsi+fW3 FLqA== X-Gm-Message-State: ACrzQf2LNL79BYyFmBEAhS+MFLMYVTe97thYnAuibbgVHABsXM8WcQR9 ms2f97M3MX3XQ5THUx4KgKn9OYKNwGrwGs1lLvYXl2qaIkV5XIuSANXUzcbJOoeJcRSXvsVQV3i H8Rpkmg3VsGldkGlHSbyBhX39MZZgRE7jBLbNiPNKXA== X-Received: by 2002:a5d:4e4f:0:b0:236:7a40:ed01 with SMTP id r15-20020a5d4e4f000000b002367a40ed01mr23132801wrt.402.1667590435562; Fri, 04 Nov 2022 12:33:55 -0700 (PDT) X-Google-Smtp-Source: AMsMyM6/UwHDNjvMmqoyrG2jK1DRJ/iDiVqj5Pkdan8iOjG9OSt2HM1U3zYlDsO28yI9ZfCrXHYDwA== X-Received: by 2002:a5d:4e4f:0:b0:236:7a40:ed01 with SMTP id r15-20020a5d4e4f000000b002367a40ed01mr23132785wrt.402.1667590435146; Fri, 04 Nov 2022 12:33:55 -0700 (PDT) Received: from smtp.gmail.com ([62.168.35.125]) by smtp.gmail.com with ESMTPSA id b6-20020a5d6346000000b0022e66749437sm124463wrw.93.2022.11.04.12.33.54 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 04 Nov 2022 12:33:54 -0700 (PDT) From: John Cabaj To: kernel-team@lists.ubuntu.com Subject: [SRU][kinetic:linux-azure][PATCH] UBUNTU: SAUCE: PCI: hv: Only reuse existing IRTE allocation for Multi-MSI Date: Fri, 4 Nov 2022 14:33:53 -0500 Message-Id: <20221104193353.26851-1-john.cabaj@canonical.com> X-Mailer: git-send-email 2.34.1 MIME-Version: 1.0 X-BeenThere: kernel-team@lists.ubuntu.com X-Mailman-Version: 2.1.20 Precedence: list List-Id: Kernel team discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: kernel-team-bounces@lists.ubuntu.com Sender: "kernel-team" From: Dexuan Cui BugLink: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1995408 Jeffrey added Multi-MSI support to the pci-hyperv driver by the 4 patches: 08e61e861a0e ("PCI: hv: Fix multi-MSI to allow more than one MSI vector") 455880dfe292 ("PCI: hv: Fix hv_arch_irq_unmask() for multi-MSI") b4b77778ecc5 ("PCI: hv: Reuse existing IRTE allocation in compose_msi_msg()") a2bad844a67b ("PCI: hv: Fix interrupt mapping for multi-MSI") It turns out that the third patch (b4b77778ecc5) causes a performance regression because all the interrupts now happen on 1 physical CPU (or two pCPUs, if one pCPU doesn't have enough vectors). When a guest has many PCI devices, it may suffer from soft lockups if the workload is heavy, e.g., see https://lwn.net/ml/linux-kernel/20220804025104.15673-1-decui@microsoft.com/ Commit b4b77778ecc5 itself is good. The real issue is that the hypercall in hv_irq_unmask() -> hv_arch_irq_unmask() -> hv_do_hypercall(HVCALL_RETARGET_INTERRUPT...) only changes the target virtual CPU rather than physical CPU; with b4b77778ecc5, the pCPU is determined only once in hv_compose_msi_msg() where only vCPU0 is specified; consequently the hypervisor only uses 1 target pCPU for all the interrupts. Note: before b4b77778ecc5, the pCPU is determined twice, and when the pCPU is determinted the second time, the vCPU in the effective affinity mask is used (i.e., it isn't always vCPU0), so the hypervisor chooses a different pCPU for each interrupt. The hypercall will be fixed in future to update the pCPU as well, but that will take quite a while, so let's restore the old behavior in hv_compose_msi_msg(), i.e., don't reuse the existing IRTE allocation for single-MSI and MSI-X; for multi-MSI, we choose the vCPU in a round-robin manner for each PCI device, so the interrupts of different devices can happen on different pCPUs, though the interrupts of each device happen on some single pCPU. The hypercall fix may not be backported to all old versions of Hyper-V, so we want to have this guest side change for ever (or at least till we're sure the old affected versions of Hyper-V are no longer supported). Fixes: b4b77778ecc5 ("PCI: hv: Reuse existing IRTE allocation in compose_msi_msg()") Co-developed-by: Jeffrey Hugo Signed-off-by: Jeffrey Hugo Co-developed-by: Carl Vanderlip Signed-off-by: Carl Vanderlip Signed-off-by: Dexuan Cui Signed-off-by: John Cabaj --- drivers/pci/controller/pci-hyperv.c | 61 ++++++++++++++++++++++------- 1 file changed, 46 insertions(+), 15 deletions(-) diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c index db814f7b93ba..c10548243d65 100644 --- a/drivers/pci/controller/pci-hyperv.c +++ b/drivers/pci/controller/pci-hyperv.c @@ -1613,7 +1613,7 @@ static void hv_pci_compose_compl(void *context, struct pci_response *resp, } static u32 hv_compose_msi_req_v1( - struct pci_create_interrupt *int_pkt, struct cpumask *affinity, + struct pci_create_interrupt *int_pkt, u32 slot, u8 vector, u8 vector_count) { int_pkt->message_type.type = PCI_CREATE_INTERRUPT_MESSAGE; @@ -1640,18 +1640,39 @@ static int hv_compose_msi_req_get_cpu(struct cpumask *affinity) return cpumask_first_and(affinity, cpu_online_mask); } -static u32 hv_compose_msi_req_v2( - struct pci_create_interrupt2 *int_pkt, struct cpumask *affinity, - u32 slot, u8 vector, u8 vector_count) +/* + * Make sure the dummy vCPU values for multi-MSI don't all point to vCPU0. + */ +static int hv_compose_multi_msi_req_get_cpu(void) { + static DEFINE_SPINLOCK(multi_msi_cpu_lock); + + /* -1 means starting with CPU 0 */ + static int cpu_next = -1; + + unsigned long flags; int cpu; + spin_lock_irqsave(&multi_msi_cpu_lock, flags); + + cpu_next = cpumask_next_wrap(cpu_next, cpu_online_mask, nr_cpu_ids, + false); + cpu = cpu_next; + + spin_unlock_irqrestore(&multi_msi_cpu_lock, flags); + + return cpu; +} + +static u32 hv_compose_msi_req_v2( + struct pci_create_interrupt2 *int_pkt, int cpu, + u32 slot, u8 vector, u8 vector_count) +{ int_pkt->message_type.type = PCI_CREATE_INTERRUPT_MESSAGE2; int_pkt->wslot.slot = slot; int_pkt->int_desc.vector = vector; int_pkt->int_desc.vector_count = vector_count; int_pkt->int_desc.delivery_mode = DELIVERY_MODE; - cpu = hv_compose_msi_req_get_cpu(affinity); int_pkt->int_desc.processor_array[0] = hv_cpu_number_to_vp_number(cpu); int_pkt->int_desc.processor_count = 1; @@ -1660,18 +1681,15 @@ static u32 hv_compose_msi_req_v2( } static u32 hv_compose_msi_req_v3( - struct pci_create_interrupt3 *int_pkt, struct cpumask *affinity, + struct pci_create_interrupt3 *int_pkt, int cpu, u32 slot, u32 vector, u8 vector_count) { - int cpu; - int_pkt->message_type.type = PCI_CREATE_INTERRUPT_MESSAGE3; int_pkt->wslot.slot = slot; int_pkt->int_desc.vector = vector; int_pkt->int_desc.reserved = 0; int_pkt->int_desc.vector_count = vector_count; int_pkt->int_desc.delivery_mode = DELIVERY_MODE; - cpu = hv_compose_msi_req_get_cpu(affinity); int_pkt->int_desc.processor_array[0] = hv_cpu_number_to_vp_number(cpu); int_pkt->int_desc.processor_count = 1; @@ -1710,12 +1728,18 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg) struct pci_create_interrupt3 v3; } int_pkts; } __packed ctxt; + bool multi_msi; u64 trans_id; u32 size; int ret; + int cpu; + + msi_desc = irq_data_get_msi_desc(data); + multi_msi = !msi_desc->pci.msi_attrib.is_msix && + msi_desc->nvec_used > 1; /* Reuse the previous allocation */ - if (data->chip_data) { + if (data->chip_data && multi_msi) { int_desc = data->chip_data; msg->address_hi = int_desc->address >> 32; msg->address_lo = int_desc->address & 0xffffffff; @@ -1723,7 +1747,6 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg) return; } - msi_desc = irq_data_get_msi_desc(data); pdev = msi_desc_to_pci_dev(msi_desc); dest = irq_data_get_effective_affinity_mask(data); pbus = pdev->bus; @@ -1733,11 +1756,18 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg) if (!hpdev) goto return_null_message; + /* Free any previous message that might have already been composed. */ + if (data->chip_data && !multi_msi) { + int_desc = data->chip_data; + data->chip_data = NULL; + hv_int_desc_free(hpdev, int_desc); + } + int_desc = kzalloc(sizeof(*int_desc), GFP_ATOMIC); if (!int_desc) goto drop_reference; - if (!msi_desc->pci.msi_attrib.is_msix && msi_desc->nvec_used > 1) { + if (multi_msi) { /* * If this is not the first MSI of Multi MSI, we already have * a mapping. Can exit early. @@ -1762,9 +1792,11 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg) */ vector = 32; vector_count = msi_desc->nvec_used; + cpu = hv_compose_multi_msi_req_get_cpu(); } else { vector = hv_msi_get_int_vector(data); vector_count = 1; + cpu = hv_compose_msi_req_get_cpu(dest); } memset(&ctxt, 0, sizeof(ctxt)); @@ -1775,7 +1807,6 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg) switch (hbus->protocol_version) { case PCI_PROTOCOL_VERSION_1_1: size = hv_compose_msi_req_v1(&ctxt.int_pkts.v1, - dest, hpdev->desc.win_slot.slot, vector, vector_count); @@ -1784,7 +1815,7 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg) case PCI_PROTOCOL_VERSION_1_2: case PCI_PROTOCOL_VERSION_1_3: size = hv_compose_msi_req_v2(&ctxt.int_pkts.v2, - dest, + cpu, hpdev->desc.win_slot.slot, vector, vector_count); @@ -1792,7 +1823,7 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg) case PCI_PROTOCOL_VERSION_1_4: size = hv_compose_msi_req_v3(&ctxt.int_pkts.v3, - dest, + cpu, hpdev->desc.win_slot.slot, vector, vector_count);