Message ID | 20221104222953.11356-1-decui@microsoft.com |
---|---|
State | New |
Headers | show |
Series | [v3] PCI: hv: Only reuse existing IRTE allocation for Multi-MSI | expand |
From: Dexuan Cui <decui@microsoft.com> Sent: Friday, November 4, 2022 3:30 PM > > Jeffrey added Multi-MSI support to the pci-hyperv driver by the 4 patches: > 08e61e861a0e ("PCI: hv: Fix multi-MSI to allow more than one MSI vector") > 455880dfe292 ("PCI: hv: Fix hv_arch_irq_unmask() for multi-MSI") > b4b77778ecc5 ("PCI: hv: Reuse existing IRTE allocation in compose_msi_msg()") > a2bad844a67b ("PCI: hv: Fix interrupt mapping for multi-MSI") > > It turns out that the third patch (b4b77778ecc5) causes a performance > regression because all the interrupts now happen on 1 physical CPU (or two > pCPUs, if one pCPU doesn't have enough vectors). When a guest has many PCI > devices, it may suffer from soft lockups if the workload is heavy, e.g., > see https://lwn.net/ml/linux-kernel/20220804025104.15673-1-decui@microsoft.com/ > > Commit b4b77778ecc5 itself is good. The real issue is that the hypercall in > hv_irq_unmask() -> hv_arch_irq_unmask() -> > hv_do_hypercall(HVCALL_RETARGET_INTERRUPT...) only changes the target > virtual CPU rather than physical CPU; with b4b77778ecc5, the pCPU is > determined only once in hv_compose_msi_msg() where only vCPU0 is specified; > consequently the hypervisor only uses 1 target pCPU for all the interrupts. > > Note: before b4b77778ecc5, the pCPU is determined twice, and when the pCPU > is determinted the second time, the vCPU in the effective affinity mask is > used (i.e., it isn't always vCPU0), so the hypervisor chooses different > pCPU for each interrupt. > > The hypercall will be fixed in future to update the pCPU as well, but > that will take quite a while, so let's restore the old behavior in > hv_compose_msi_msg(), i.e., don't reuse the existing IRTE allocation for > single-MSI and MSI-X; for multi-MSI, we choose the vCPU in a round-robin > manner for each PCI device, so the interrupts of different devices can > happen on different pCPUs, though the interrupts of each device happen on > some single pCPU. > > The hypercall fix may not be backported to all old versions of Hyper-V, so > we want to have this guest side change for ever (or at least till we're sure > the old affected versions of Hyper-V are no longer supported). > > Fixes: b4b77778ecc5 ("PCI: hv: Reuse existing IRTE allocation in compose_msi_msg()") > Co-developed-by: Jeffrey Hugo <quic_jhugo@quicinc.com> > Signed-off-by: Jeffrey Hugo <quic_jhugo@quicinc.com> > Co-developed-by: Carl Vanderlip <quic_carlv@quicinc.com> > Signed-off-by: Carl Vanderlip <quic_carlv@quicinc.com> > Signed-off-by: Dexuan Cui <decui@microsoft.com> > > --- > > v1 is here: > > https://lwn.net/ml/linux-kernel/20220804025104.15673-1-decui@microsoft.com/ > > Changes in v2: > round-robin the vCPU for multi-MSI. > The commit message is re-worked. > Added Jeff and Carl's Co-developed-by and Signed-off-by. > > Changes in v3: > Michael Kelley kindly helped to make a great comment, and I added the > comment before hv_compose_msi_req_get_cpu(). Thank you, Michael! > > Rebased to Hyper-V tree's "hyperv-fixes" branch: > > https://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux.git/log/?h=hyperv-fixes > > Bjorn, Lorenzo, it would be great to have your Ack. The patch needs to go > through the Hyper-V tree because it's rebased to another hv_pci patch (which > only exists in the Hyper-V tree for now): > e70af8d040d2 ("PCI: hv: Fix the definition of vector in hv_compose_msi_msg()") > > BTW, Michael has some other hv_pci patches, which would also need go through > the Hyper-V tree: > > https://lwn.net/ml/linux-kernel/1666288635-72591-1-git-send-email-mikelley%40microsoft.com/ > > > drivers/pci/controller/pci-hyperv.c | 90 ++++++++++++++++++++++++----- > 1 file changed, 75 insertions(+), 15 deletions(-) > > diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c > index ba64284eaf9f..fa5a1ba35a82 100644 > --- a/drivers/pci/controller/pci-hyperv.c > +++ b/drivers/pci/controller/pci-hyperv.c > @@ -1613,7 +1613,7 @@ static void hv_pci_compose_compl(void *context, struct > pci_response *resp, > } > > static u32 hv_compose_msi_req_v1( > - struct pci_create_interrupt *int_pkt, const struct cpumask *affinity, > + struct pci_create_interrupt *int_pkt, > u32 slot, u8 vector, u16 vector_count) > { > int_pkt->message_type.type = PCI_CREATE_INTERRUPT_MESSAGE; > @@ -1631,6 +1631,35 @@ static u32 hv_compose_msi_req_v1( > return sizeof(*int_pkt); > } > > +/* > + * The vCPU selected by hv_compose_multi_msi_req_get_cpu() and > + * hv_compose_msi_req_get_cpu() is a "dummy" vCPU because the final vCPU to be > + * interrupted is specified later in hv_irq_unmask() and communicated to Hyper-V > + * via the HVCALL_RETARGET_INTERRUPT hypercall. But the choice of dummy vCPU is > + * not irrelevant because Hyper-V chooses the physical CPU to handle the > + * interrupts based on the vCPU specified in message sent to the vPCI VSP in > + * hv_compose_msi_msg(). Hyper-V's choice of pCPU is not visible to the guest, > + * but assigning too many vPCI device interrupts to the same pCPU can cause a > + * performance bottleneck. So we spread out the dummy vCPUs to influence Hyper- > V > + * to spread out the pCPUs that it selects. > + * > + * For the single-MSI and MSI-X cases, it's OK for hv_compose_msi_req_get_cpu() > + * to always return the same dummy vCPU, because a second call to > + * hv_compose_msi_msg() contains the "real" vCPU, causing Hyper-V to choose a > + * new pCPU for the interrupt. But for the multi-MSI case, the second call to > + * hv_compose_msi_msg() exits without sending a message to the vPCI VSP, so the > + * original dummy vCPU is used. This dummy vCPU must be round-robin'ed so that > + * the pCPUs are spread out. All interrupts for a multi-MSI device end up using > + * the same pCPU, even though the vCPUs will be spread out by later calls > + * to hv_irq_unmask(), but that is the best we can do now. > + * > + * With current Hyper-V, the HVCALL_RETARGET_INTERRUPT hypercall does *not* > + * cause Hyper-V to reselect the pCPU based on the specified vCPU. Such an > + * enhancement is planned for a future version. With that enhancement, the > + * dummy vCPU selection won't matter, and interrupts for the same multi-MSI > + * device will be spread across multiple pCPUs. > + */ > + > /* > * Create MSI w/ dummy vCPU set targeting just one vCPU, overwritten > * by subsequent retarget in hv_irq_unmask(). > @@ -1640,18 +1669,39 @@ static int hv_compose_msi_req_get_cpu(const struct > cpumask *affinity) > return cpumask_first_and(affinity, cpu_online_mask); > } > > -static u32 hv_compose_msi_req_v2( > - struct pci_create_interrupt2 *int_pkt, const struct cpumask *affinity, > - u32 slot, u8 vector, u16 vector_count) > +/* > + * Make sure the dummy vCPU values for multi-MSI don't all point to vCPU0. > + */ > +static int hv_compose_multi_msi_req_get_cpu(void) > { > + static DEFINE_SPINLOCK(multi_msi_cpu_lock); > + > + /* -1 means starting with CPU 0 */ > + static int cpu_next = -1; > + > + unsigned long flags; > int cpu; > > + spin_lock_irqsave(&multi_msi_cpu_lock, flags); > + > + cpu_next = cpumask_next_wrap(cpu_next, cpu_online_mask, nr_cpu_ids, > + false); > + cpu = cpu_next; > + > + spin_unlock_irqrestore(&multi_msi_cpu_lock, flags); > + > + return cpu; > +} > + > +static u32 hv_compose_msi_req_v2( > + struct pci_create_interrupt2 *int_pkt, int cpu, > + u32 slot, u8 vector, u16 vector_count) > +{ > int_pkt->message_type.type = PCI_CREATE_INTERRUPT_MESSAGE2; > int_pkt->wslot.slot = slot; > int_pkt->int_desc.vector = vector; > int_pkt->int_desc.vector_count = vector_count; > int_pkt->int_desc.delivery_mode = DELIVERY_MODE; > - cpu = hv_compose_msi_req_get_cpu(affinity); > int_pkt->int_desc.processor_array[0] = > hv_cpu_number_to_vp_number(cpu); > int_pkt->int_desc.processor_count = 1; > @@ -1660,18 +1710,15 @@ static u32 hv_compose_msi_req_v2( > } > > static u32 hv_compose_msi_req_v3( > - struct pci_create_interrupt3 *int_pkt, const struct cpumask *affinity, > + struct pci_create_interrupt3 *int_pkt, int cpu, > u32 slot, u32 vector, u16 vector_count) > { > - int cpu; > - > int_pkt->message_type.type = PCI_CREATE_INTERRUPT_MESSAGE3; > int_pkt->wslot.slot = slot; > int_pkt->int_desc.vector = vector; > int_pkt->int_desc.reserved = 0; > int_pkt->int_desc.vector_count = vector_count; > int_pkt->int_desc.delivery_mode = DELIVERY_MODE; > - cpu = hv_compose_msi_req_get_cpu(affinity); > int_pkt->int_desc.processor_array[0] = > hv_cpu_number_to_vp_number(cpu); > int_pkt->int_desc.processor_count = 1; > @@ -1715,12 +1762,18 @@ static void hv_compose_msi_msg(struct irq_data *data, > struct msi_msg *msg) > struct pci_create_interrupt3 v3; > } int_pkts; > } __packed ctxt; > + bool multi_msi; > u64 trans_id; > u32 size; > int ret; > + int cpu; > + > + msi_desc = irq_data_get_msi_desc(data); > + multi_msi = !msi_desc->pci.msi_attrib.is_msix && > + msi_desc->nvec_used > 1; > > /* Reuse the previous allocation */ > - if (data->chip_data) { > + if (data->chip_data && multi_msi) { > int_desc = data->chip_data; > msg->address_hi = int_desc->address >> 32; > msg->address_lo = int_desc->address & 0xffffffff; > @@ -1728,7 +1781,6 @@ static void hv_compose_msi_msg(struct irq_data *data, > struct msi_msg *msg) > return; > } > > - msi_desc = irq_data_get_msi_desc(data); > pdev = msi_desc_to_pci_dev(msi_desc); > dest = irq_data_get_effective_affinity_mask(data); > pbus = pdev->bus; > @@ -1738,11 +1790,18 @@ static void hv_compose_msi_msg(struct irq_data *data, > struct msi_msg *msg) > if (!hpdev) > goto return_null_message; > > + /* Free any previous message that might have already been composed. */ > + if (data->chip_data && !multi_msi) { > + int_desc = data->chip_data; > + data->chip_data = NULL; > + hv_int_desc_free(hpdev, int_desc); > + } > + > int_desc = kzalloc(sizeof(*int_desc), GFP_ATOMIC); > if (!int_desc) > goto drop_reference; > > - if (!msi_desc->pci.msi_attrib.is_msix && msi_desc->nvec_used > 1) { > + if (multi_msi) { > /* > * If this is not the first MSI of Multi MSI, we already have > * a mapping. Can exit early. > @@ -1767,9 +1826,11 @@ static void hv_compose_msi_msg(struct irq_data *data, > struct msi_msg *msg) > */ > vector = 32; > vector_count = msi_desc->nvec_used; > + cpu = hv_compose_multi_msi_req_get_cpu(); > } else { > vector = hv_msi_get_int_vector(data); > vector_count = 1; > + cpu = hv_compose_msi_req_get_cpu(dest); > } > > /* > @@ -1785,7 +1846,6 @@ static void hv_compose_msi_msg(struct irq_data *data, > struct msi_msg *msg) > switch (hbus->protocol_version) { > case PCI_PROTOCOL_VERSION_1_1: > size = hv_compose_msi_req_v1(&ctxt.int_pkts.v1, > - dest, > hpdev->desc.win_slot.slot, > (u8)vector, > vector_count); > @@ -1794,7 +1854,7 @@ static void hv_compose_msi_msg(struct irq_data *data, > struct msi_msg *msg) > case PCI_PROTOCOL_VERSION_1_2: > case PCI_PROTOCOL_VERSION_1_3: > size = hv_compose_msi_req_v2(&ctxt.int_pkts.v2, > - dest, > + cpu, > hpdev->desc.win_slot.slot, > (u8)vector, > vector_count); > @@ -1802,7 +1862,7 @@ static void hv_compose_msi_msg(struct irq_data *data, > struct msi_msg *msg) > > case PCI_PROTOCOL_VERSION_1_4: > size = hv_compose_msi_req_v3(&ctxt.int_pkts.v3, > - dest, > + cpu, > hpdev->desc.win_slot.slot, > vector, > vector_count); > -- > 2.25.1 Reviewed-by: Michael Kelley <mikelley@microsoft.com>
> Sent: Friday, November 4, 2022 4:38 PM > > ... > > Bjorn, Lorenzo, it would be great to have your Ack. The patch needs to go > > through the Hyper-V tree because it's rebased to another hv_pci patch > > (which only exists in the Hyper-V tree for now): > > e70af8d040d2 ("PCI: hv: Fix the definition of vector in > > hv_compose_msi_msg()") > > > > BTW, Michael has some other hv_pci patches, which would also need go > > through the Hyper-V tree: > > > Reviewed-by: Michael Kelley <mikelley@microsoft.com> Hi Bjorn, Lorenzo, if you have no objection to the patch, I suggest Wei merge it through the Hyper-V tree early next week.
On Fri, Nov 04, 2022 at 03:29:53PM -0700, Dexuan Cui wrote: > Jeffrey added Multi-MSI support to the pci-hyperv driver by the 4 patches: > 08e61e861a0e ("PCI: hv: Fix multi-MSI to allow more than one MSI vector") > 455880dfe292 ("PCI: hv: Fix hv_arch_irq_unmask() for multi-MSI") > b4b77778ecc5 ("PCI: hv: Reuse existing IRTE allocation in compose_msi_msg()") > a2bad844a67b ("PCI: hv: Fix interrupt mapping for multi-MSI") > > It turns out that the third patch (b4b77778ecc5) causes a performance > regression because all the interrupts now happen on 1 physical CPU (or two > pCPUs, if one pCPU doesn't have enough vectors). When a guest has many PCI > devices, it may suffer from soft lockups if the workload is heavy, e.g., > see https://lwn.net/ml/linux-kernel/20220804025104.15673-1-decui@microsoft.com/ > > Commit b4b77778ecc5 itself is good. The real issue is that the hypercall in > hv_irq_unmask() -> hv_arch_irq_unmask() -> > hv_do_hypercall(HVCALL_RETARGET_INTERRUPT...) only changes the target > virtual CPU rather than physical CPU; with b4b77778ecc5, the pCPU is > determined only once in hv_compose_msi_msg() where only vCPU0 is specified; > consequently the hypervisor only uses 1 target pCPU for all the interrupts. > > Note: before b4b77778ecc5, the pCPU is determined twice, and when the pCPU > is determinted the second time, the vCPU in the effective affinity mask is s/determinted/determined/ > used (i.e., it isn't always vCPU0), so the hypervisor chooses different > pCPU for each interrupt. > > The hypercall will be fixed in future to update the pCPU as well, but > that will take quite a while, so let's restore the old behavior in > hv_compose_msi_msg(), i.e., don't reuse the existing IRTE allocation for > single-MSI and MSI-X; for multi-MSI, we choose the vCPU in a round-robin > manner for each PCI device, so the interrupts of different devices can > happen on different pCPUs, though the interrupts of each device happen on > some single pCPU. > > The hypercall fix may not be backported to all old versions of Hyper-V, so > we want to have this guest side change for ever (or at least till we're sure s/for ever/forever/ > the old affected versions of Hyper-V are no longer supported). > > Fixes: b4b77778ecc5 ("PCI: hv: Reuse existing IRTE allocation in compose_msi_msg()") > Co-developed-by: Jeffrey Hugo <quic_jhugo@quicinc.com> > Signed-off-by: Jeffrey Hugo <quic_jhugo@quicinc.com> > Co-developed-by: Carl Vanderlip <quic_carlv@quicinc.com> > Signed-off-by: Carl Vanderlip <quic_carlv@quicinc.com> > Signed-off-by: Dexuan Cui <decui@microsoft.com> > > --- > > v1 is here: > https://lwn.net/ml/linux-kernel/20220804025104.15673-1-decui@microsoft.com/ > > Changes in v2: > round-robin the vCPU for multi-MSI. > The commit message is re-worked. > Added Jeff and Carl's Co-developed-by and Signed-off-by. > > Changes in v3: > Michael Kelley kindly helped to make a great comment, and I added the > comment before hv_compose_msi_req_get_cpu(). Thank you, Michael! > > Rebased to Hyper-V tree's "hyperv-fixes" branch: > https://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux.git/log/?h=hyperv-fixes > > Bjorn, Lorenzo, it would be great to have your Ack. The patch needs to go > through the Hyper-V tree because it's rebased to another hv_pci patch (which > only exists in the Hyper-V tree for now): > e70af8d040d2 ("PCI: hv: Fix the definition of vector in hv_compose_msi_msg()") Fine with me, but it's Lorenzo's area so I don't want to preemptively ack it.
> From: Bjorn Helgaas <helgaas@kernel.org> > Sent: Thursday, November 10, 2022 1:44 PM > > ... > > Note: before b4b77778ecc5, the pCPU is determined twice, and when the > > pCPU is determinted the second time, the vCPU in the effective affinity > > mask is > > s/determinted/determined/ Thanks, Bjorn! I suppose Wei can help fix this :-) > > The hypercall fix may not be backported to all old versions of Hyper-V, so > > we want to have this guest side change for ever (or at least till we're sure > > s/for ever/forever/ ditto. > > Bjorn, Lorenzo, it would be great to have your Ack. The patch needs to go > > through the Hyper-V tree because it's rebased to another hv_pci patch > > (which only exists in the Hyper-V tree for now): > > e70af8d040d2 ("PCI: hv: Fix the definition of vector in > > hv_compose_msi_msg()") > > Fine with me, but it's Lorenzo's area so I don't want to preemptively > ack it. Thanks. Hopefully Lorenzo can take a look soon.
On Fri, Nov 04, 2022 at 03:29:53PM -0700, Dexuan Cui wrote: > Jeffrey added Multi-MSI support to the pci-hyperv driver by the 4 patches: > 08e61e861a0e ("PCI: hv: Fix multi-MSI to allow more than one MSI vector") > 455880dfe292 ("PCI: hv: Fix hv_arch_irq_unmask() for multi-MSI") > b4b77778ecc5 ("PCI: hv: Reuse existing IRTE allocation in compose_msi_msg()") > a2bad844a67b ("PCI: hv: Fix interrupt mapping for multi-MSI") > > It turns out that the third patch (b4b77778ecc5) causes a performance > regression because all the interrupts now happen on 1 physical CPU (or two > pCPUs, if one pCPU doesn't have enough vectors). When a guest has many PCI > devices, it may suffer from soft lockups if the workload is heavy, e.g., > see https://lwn.net/ml/linux-kernel/20220804025104.15673-1-decui@microsoft.com/ > > Commit b4b77778ecc5 itself is good. The real issue is that the hypercall in > hv_irq_unmask() -> hv_arch_irq_unmask() -> > hv_do_hypercall(HVCALL_RETARGET_INTERRUPT...) only changes the target > virtual CPU rather than physical CPU; with b4b77778ecc5, the pCPU is > determined only once in hv_compose_msi_msg() where only vCPU0 is specified; > consequently the hypervisor only uses 1 target pCPU for all the interrupts. > > Note: before b4b77778ecc5, the pCPU is determined twice, and when the pCPU > is determinted the second time, the vCPU in the effective affinity mask is > used (i.e., it isn't always vCPU0), so the hypervisor chooses different > pCPU for each interrupt. > > The hypercall will be fixed in future to update the pCPU as well, but > that will take quite a while, so let's restore the old behavior in > hv_compose_msi_msg(), i.e., don't reuse the existing IRTE allocation for > single-MSI and MSI-X; for multi-MSI, we choose the vCPU in a round-robin > manner for each PCI device, so the interrupts of different devices can > happen on different pCPUs, though the interrupts of each device happen on > some single pCPU. > > The hypercall fix may not be backported to all old versions of Hyper-V, so > we want to have this guest side change for ever (or at least till we're sure > the old affected versions of Hyper-V are no longer supported). > > Fixes: b4b77778ecc5 ("PCI: hv: Reuse existing IRTE allocation in compose_msi_msg()") > Co-developed-by: Jeffrey Hugo <quic_jhugo@quicinc.com> > Signed-off-by: Jeffrey Hugo <quic_jhugo@quicinc.com> > Co-developed-by: Carl Vanderlip <quic_carlv@quicinc.com> > Signed-off-by: Carl Vanderlip <quic_carlv@quicinc.com> > Signed-off-by: Dexuan Cui <decui@microsoft.com> > > --- > > v1 is here: > https://lwn.net/ml/linux-kernel/20220804025104.15673-1-decui@microsoft.com/ > > Changes in v2: > round-robin the vCPU for multi-MSI. > The commit message is re-worked. > Added Jeff and Carl's Co-developed-by and Signed-off-by. > > Changes in v3: > Michael Kelley kindly helped to make a great comment, and I added the > comment before hv_compose_msi_req_get_cpu(). Thank you, Michael! > > Rebased to Hyper-V tree's "hyperv-fixes" branch: > https://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux.git/log/?h=hyperv-fixes > > Bjorn, Lorenzo, it would be great to have your Ack. The patch needs to go > through the Hyper-V tree because it's rebased to another hv_pci patch (which > only exists in the Hyper-V tree for now): > e70af8d040d2 ("PCI: hv: Fix the definition of vector in hv_compose_msi_msg()") > > BTW, Michael has some other hv_pci patches, which would also need go through > the Hyper-V tree: > https://lwn.net/ml/linux-kernel/1666288635-72591-1-git-send-email-mikelley%40microsoft.com/ > > > drivers/pci/controller/pci-hyperv.c | 90 ++++++++++++++++++++++++----- > 1 file changed, 75 insertions(+), 15 deletions(-) > > diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c > index ba64284eaf9f..fa5a1ba35a82 100644 > --- a/drivers/pci/controller/pci-hyperv.c > +++ b/drivers/pci/controller/pci-hyperv.c > @@ -1613,7 +1613,7 @@ static void hv_pci_compose_compl(void *context, struct pci_response *resp, > } > > static u32 hv_compose_msi_req_v1( > - struct pci_create_interrupt *int_pkt, const struct cpumask *affinity, > + struct pci_create_interrupt *int_pkt, > u32 slot, u8 vector, u16 vector_count) > { > int_pkt->message_type.type = PCI_CREATE_INTERRUPT_MESSAGE; > @@ -1631,6 +1631,35 @@ static u32 hv_compose_msi_req_v1( > return sizeof(*int_pkt); > } > > +/* > + * The vCPU selected by hv_compose_multi_msi_req_get_cpu() and > + * hv_compose_msi_req_get_cpu() is a "dummy" vCPU because the final vCPU to be > + * interrupted is specified later in hv_irq_unmask() and communicated to Hyper-V > + * via the HVCALL_RETARGET_INTERRUPT hypercall. But the choice of dummy vCPU is > + * not irrelevant because Hyper-V chooses the physical CPU to handle the > + * interrupts based on the vCPU specified in message sent to the vPCI VSP in > + * hv_compose_msi_msg(). Hyper-V's choice of pCPU is not visible to the guest, > + * but assigning too many vPCI device interrupts to the same pCPU can cause a > + * performance bottleneck. So we spread out the dummy vCPUs to influence Hyper-V > + * to spread out the pCPUs that it selects. > + * > + * For the single-MSI and MSI-X cases, it's OK for hv_compose_msi_req_get_cpu() > + * to always return the same dummy vCPU, because a second call to > + * hv_compose_msi_msg() contains the "real" vCPU, causing Hyper-V to choose a > + * new pCPU for the interrupt. But for the multi-MSI case, the second call to > + * hv_compose_msi_msg() exits without sending a message to the vPCI VSP, so the Why ? Can't you fix _that_ ? Why can't the initial call to hv_compose_msi_msg() determine the _real_ target vCPU ? > + * original dummy vCPU is used. This dummy vCPU must be round-robin'ed so that > + * the pCPUs are spread out. All interrupts for a multi-MSI device end up using > + * the same pCPU, even though the vCPUs will be spread out by later calls > + * to hv_irq_unmask(), but that is the best we can do now. > + * > + * With current Hyper-V, the HVCALL_RETARGET_INTERRUPT hypercall does *not* "current" Hyper-V means nothing, remove it or provide versioning information. Imagine yourself reading this comment some time in the future. I can't claim to understand how this MSI vCPU to pCPU mapping is made to work in current code but I can't ack this patch sorry, if you feel like it is good to merge it it is your and Hyper-V maintainers call, feel free to go ahead - I can review PCI hyper-V changes that affect PCI and IRQs core APIs, I don't know Hyper-V internals. Lorenzo > + * cause Hyper-V to reselect the pCPU based on the specified vCPU. Such an > + * enhancement is planned for a future version. With that enhancement, the > + * dummy vCPU selection won't matter, and interrupts for the same multi-MSI > + * device will be spread across multiple pCPUs. > + */ > + > /* > * Create MSI w/ dummy vCPU set targeting just one vCPU, overwritten > * by subsequent retarget in hv_irq_unmask(). > @@ -1640,18 +1669,39 @@ static int hv_compose_msi_req_get_cpu(const struct cpumask *affinity) > return cpumask_first_and(affinity, cpu_online_mask); > } > > -static u32 hv_compose_msi_req_v2( > - struct pci_create_interrupt2 *int_pkt, const struct cpumask *affinity, > - u32 slot, u8 vector, u16 vector_count) > +/* > + * Make sure the dummy vCPU values for multi-MSI don't all point to vCPU0. > + */ > +static int hv_compose_multi_msi_req_get_cpu(void) > { > + static DEFINE_SPINLOCK(multi_msi_cpu_lock); > + > + /* -1 means starting with CPU 0 */ > + static int cpu_next = -1; > + > + unsigned long flags; > int cpu; > > + spin_lock_irqsave(&multi_msi_cpu_lock, flags); > + > + cpu_next = cpumask_next_wrap(cpu_next, cpu_online_mask, nr_cpu_ids, > + false); > + cpu = cpu_next; > + > + spin_unlock_irqrestore(&multi_msi_cpu_lock, flags); > + > + return cpu; > +} > + > +static u32 hv_compose_msi_req_v2( > + struct pci_create_interrupt2 *int_pkt, int cpu, > + u32 slot, u8 vector, u16 vector_count) > +{ > int_pkt->message_type.type = PCI_CREATE_INTERRUPT_MESSAGE2; > int_pkt->wslot.slot = slot; > int_pkt->int_desc.vector = vector; > int_pkt->int_desc.vector_count = vector_count; > int_pkt->int_desc.delivery_mode = DELIVERY_MODE; > - cpu = hv_compose_msi_req_get_cpu(affinity); > int_pkt->int_desc.processor_array[0] = > hv_cpu_number_to_vp_number(cpu); > int_pkt->int_desc.processor_count = 1; > @@ -1660,18 +1710,15 @@ static u32 hv_compose_msi_req_v2( > } > > static u32 hv_compose_msi_req_v3( > - struct pci_create_interrupt3 *int_pkt, const struct cpumask *affinity, > + struct pci_create_interrupt3 *int_pkt, int cpu, > u32 slot, u32 vector, u16 vector_count) > { > - int cpu; > - > int_pkt->message_type.type = PCI_CREATE_INTERRUPT_MESSAGE3; > int_pkt->wslot.slot = slot; > int_pkt->int_desc.vector = vector; > int_pkt->int_desc.reserved = 0; > int_pkt->int_desc.vector_count = vector_count; > int_pkt->int_desc.delivery_mode = DELIVERY_MODE; > - cpu = hv_compose_msi_req_get_cpu(affinity); > int_pkt->int_desc.processor_array[0] = > hv_cpu_number_to_vp_number(cpu); > int_pkt->int_desc.processor_count = 1; > @@ -1715,12 +1762,18 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg) > struct pci_create_interrupt3 v3; > } int_pkts; > } __packed ctxt; > + bool multi_msi; > u64 trans_id; > u32 size; > int ret; > + int cpu; > + > + msi_desc = irq_data_get_msi_desc(data); > + multi_msi = !msi_desc->pci.msi_attrib.is_msix && > + msi_desc->nvec_used > 1; > > /* Reuse the previous allocation */ > - if (data->chip_data) { > + if (data->chip_data && multi_msi) { > int_desc = data->chip_data; > msg->address_hi = int_desc->address >> 32; > msg->address_lo = int_desc->address & 0xffffffff; > @@ -1728,7 +1781,6 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg) > return; > } > > - msi_desc = irq_data_get_msi_desc(data); > pdev = msi_desc_to_pci_dev(msi_desc); > dest = irq_data_get_effective_affinity_mask(data); > pbus = pdev->bus; > @@ -1738,11 +1790,18 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg) > if (!hpdev) > goto return_null_message; > > + /* Free any previous message that might have already been composed. */ > + if (data->chip_data && !multi_msi) { > + int_desc = data->chip_data; > + data->chip_data = NULL; > + hv_int_desc_free(hpdev, int_desc); > + } > + > int_desc = kzalloc(sizeof(*int_desc), GFP_ATOMIC); > if (!int_desc) > goto drop_reference; > > - if (!msi_desc->pci.msi_attrib.is_msix && msi_desc->nvec_used > 1) { > + if (multi_msi) { > /* > * If this is not the first MSI of Multi MSI, we already have > * a mapping. Can exit early. > @@ -1767,9 +1826,11 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg) > */ > vector = 32; > vector_count = msi_desc->nvec_used; > + cpu = hv_compose_multi_msi_req_get_cpu(); > } else { > vector = hv_msi_get_int_vector(data); > vector_count = 1; > + cpu = hv_compose_msi_req_get_cpu(dest); > } > > /* > @@ -1785,7 +1846,6 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg) > switch (hbus->protocol_version) { > case PCI_PROTOCOL_VERSION_1_1: > size = hv_compose_msi_req_v1(&ctxt.int_pkts.v1, > - dest, > hpdev->desc.win_slot.slot, > (u8)vector, > vector_count); > @@ -1794,7 +1854,7 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg) > case PCI_PROTOCOL_VERSION_1_2: > case PCI_PROTOCOL_VERSION_1_3: > size = hv_compose_msi_req_v2(&ctxt.int_pkts.v2, > - dest, > + cpu, > hpdev->desc.win_slot.slot, > (u8)vector, > vector_count); > @@ -1802,7 +1862,7 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg) > > case PCI_PROTOCOL_VERSION_1_4: > size = hv_compose_msi_req_v3(&ctxt.int_pkts.v3, > - dest, > + cpu, > hpdev->desc.win_slot.slot, > vector, > vector_count); > -- > 2.25.1 >
On Fri, Nov 11, 2022 at 10:56:28AM +0100, Lorenzo Pieralisi wrote: > On Fri, Nov 04, 2022 at 03:29:53PM -0700, Dexuan Cui wrote: > > Jeffrey added Multi-MSI support to the pci-hyperv driver by the 4 patches: > > 08e61e861a0e ("PCI: hv: Fix multi-MSI to allow more than one MSI vector") > > 455880dfe292 ("PCI: hv: Fix hv_arch_irq_unmask() for multi-MSI") > > b4b77778ecc5 ("PCI: hv: Reuse existing IRTE allocation in compose_msi_msg()") > > a2bad844a67b ("PCI: hv: Fix interrupt mapping for multi-MSI") > > > > It turns out that the third patch (b4b77778ecc5) causes a performance > > regression because all the interrupts now happen on 1 physical CPU (or two > > pCPUs, if one pCPU doesn't have enough vectors). When a guest has many PCI > > devices, it may suffer from soft lockups if the workload is heavy, e.g., > > see https://lwn.net/ml/linux-kernel/20220804025104.15673-1-decui@microsoft.com/ > > > > Commit b4b77778ecc5 itself is good. The real issue is that the hypercall in > > hv_irq_unmask() -> hv_arch_irq_unmask() -> > > hv_do_hypercall(HVCALL_RETARGET_INTERRUPT...) only changes the target > > virtual CPU rather than physical CPU; with b4b77778ecc5, the pCPU is > > determined only once in hv_compose_msi_msg() where only vCPU0 is specified; > > consequently the hypervisor only uses 1 target pCPU for all the interrupts. > > > > Note: before b4b77778ecc5, the pCPU is determined twice, and when the pCPU > > is determinted the second time, the vCPU in the effective affinity mask is > > used (i.e., it isn't always vCPU0), so the hypervisor chooses different > > pCPU for each interrupt. > > > > The hypercall will be fixed in future to update the pCPU as well, but > > that will take quite a while, so let's restore the old behavior in > > hv_compose_msi_msg(), i.e., don't reuse the existing IRTE allocation for > > single-MSI and MSI-X; for multi-MSI, we choose the vCPU in a round-robin > > manner for each PCI device, so the interrupts of different devices can > > happen on different pCPUs, though the interrupts of each device happen on > > some single pCPU. > > > > The hypercall fix may not be backported to all old versions of Hyper-V, so > > we want to have this guest side change for ever (or at least till we're sure > > the old affected versions of Hyper-V are no longer supported). > > > > Fixes: b4b77778ecc5 ("PCI: hv: Reuse existing IRTE allocation in compose_msi_msg()") > > Co-developed-by: Jeffrey Hugo <quic_jhugo@quicinc.com> > > Signed-off-by: Jeffrey Hugo <quic_jhugo@quicinc.com> > > Co-developed-by: Carl Vanderlip <quic_carlv@quicinc.com> > > Signed-off-by: Carl Vanderlip <quic_carlv@quicinc.com> > > Signed-off-by: Dexuan Cui <decui@microsoft.com> > > > > --- > > > > v1 is here: > > https://lwn.net/ml/linux-kernel/20220804025104.15673-1-decui@microsoft.com/ > > > > Changes in v2: > > round-robin the vCPU for multi-MSI. > > The commit message is re-worked. > > Added Jeff and Carl's Co-developed-by and Signed-off-by. > > > > Changes in v3: > > Michael Kelley kindly helped to make a great comment, and I added the > > comment before hv_compose_msi_req_get_cpu(). Thank you, Michael! > > > > Rebased to Hyper-V tree's "hyperv-fixes" branch: > > https://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux.git/log/?h=hyperv-fixes > > > > Bjorn, Lorenzo, it would be great to have your Ack. The patch needs to go > > through the Hyper-V tree because it's rebased to another hv_pci patch (which > > only exists in the Hyper-V tree for now): > > e70af8d040d2 ("PCI: hv: Fix the definition of vector in hv_compose_msi_msg()") > > > > BTW, Michael has some other hv_pci patches, which would also need go through > > the Hyper-V tree: > > https://lwn.net/ml/linux-kernel/1666288635-72591-1-git-send-email-mikelley%40microsoft.com/ > > > > > > drivers/pci/controller/pci-hyperv.c | 90 ++++++++++++++++++++++++----- > > 1 file changed, 75 insertions(+), 15 deletions(-) > > > > diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c > > index ba64284eaf9f..fa5a1ba35a82 100644 > > --- a/drivers/pci/controller/pci-hyperv.c > > +++ b/drivers/pci/controller/pci-hyperv.c > > @@ -1613,7 +1613,7 @@ static void hv_pci_compose_compl(void *context, struct pci_response *resp, > > } > > > > static u32 hv_compose_msi_req_v1( > > - struct pci_create_interrupt *int_pkt, const struct cpumask *affinity, > > + struct pci_create_interrupt *int_pkt, > > u32 slot, u8 vector, u16 vector_count) > > { > > int_pkt->message_type.type = PCI_CREATE_INTERRUPT_MESSAGE; > > @@ -1631,6 +1631,35 @@ static u32 hv_compose_msi_req_v1( > > return sizeof(*int_pkt); > > } > > > > +/* > > + * The vCPU selected by hv_compose_multi_msi_req_get_cpu() and > > + * hv_compose_msi_req_get_cpu() is a "dummy" vCPU because the final vCPU to be > > + * interrupted is specified later in hv_irq_unmask() and communicated to Hyper-V > > + * via the HVCALL_RETARGET_INTERRUPT hypercall. But the choice of dummy vCPU is > > + * not irrelevant because Hyper-V chooses the physical CPU to handle the > > + * interrupts based on the vCPU specified in message sent to the vPCI VSP in > > + * hv_compose_msi_msg(). Hyper-V's choice of pCPU is not visible to the guest, > > + * but assigning too many vPCI device interrupts to the same pCPU can cause a > > + * performance bottleneck. So we spread out the dummy vCPUs to influence Hyper-V > > + * to spread out the pCPUs that it selects. > > + * > > + * For the single-MSI and MSI-X cases, it's OK for hv_compose_msi_req_get_cpu() > > + * to always return the same dummy vCPU, because a second call to > > + * hv_compose_msi_msg() contains the "real" vCPU, causing Hyper-V to choose a > > + * new pCPU for the interrupt. But for the multi-MSI case, the second call to > > + * hv_compose_msi_msg() exits without sending a message to the vPCI VSP, so the > > Why ? Can't you fix _that_ ? Why can't the initial call to > hv_compose_msi_msg() determine the _real_ target vCPU ? Dexuan, any comment on this? > > > + * original dummy vCPU is used. This dummy vCPU must be round-robin'ed so that > > + * the pCPUs are spread out. All interrupts for a multi-MSI device end up using > > + * the same pCPU, even though the vCPUs will be spread out by later calls > > + * to hv_irq_unmask(), but that is the best we can do now. > > + * > > + * With current Hyper-V, the HVCALL_RETARGET_INTERRUPT hypercall does *not* > > "current" Hyper-V means nothing, remove it or provide versioning > information. Imagine yourself reading this comment some time > in the future. > And this? Wei.
> From: Lorenzo Pieralisi <lpieralisi@kernel.org> > Sent: Friday, November 11, 2022 1:56 AM > > ... > > + * For the single-MSI and MSI-X cases, it's OK for > > hv_compose_msi_req_get_cpu() > > + * to always return the same dummy vCPU, because a second call to > > + * hv_compose_msi_msg() contains the "real" vCPU, causing Hyper-V to > > choose a > > + * new pCPU for the interrupt. But for the multi-MSI case, the second call to > > + * hv_compose_msi_msg() exits without sending a message to the vPCI VSP, > > so the > > Why ? Can't you fix _that_ ? Why can't the initial call to > hv_compose_msi_msg() determine the _real_ target vCPU ? Unluckily I can't fix this because of the way it works in x86's irq management code. This is out of the control of the pci-hyperv driver. hv_compose_msi_msg() uses irq_data_get_effective_affinity_mask() to get the "effective"mask. On x86, when the irq is initialized, irq_data_update_effective_affinity() is called from apic_update_irq_cfg() -- please refer to the below debug code. When the initial call to hv_compose_msi_msg() is invoked, it's from pci_alloc_irq_vectors(), and the x86 irq code always passes CPU0 to pci-hyperv. Please see the below "cpumask_first(cpu_online_mask)" in vector_assign_managed_shutdown(). When hv_compose_msi_msg() is invoked the second time, it's from request_irq(), and the x86 irq code passes the real effectivey CPU to pci-hyperv. I added the below debug code and pasted the trimmed output below. --- a/arch/x86/kernel/apic/vector.c +++ b/arch/x86/kernel/apic/vector.c @@ -179,6 +179,7 @@ static void vector_assign_managed_shutdown(struct irq_data *irqd) unsigned int cpu = cpumask_first(cpu_online_mask); apic_update_irq_cfg(irqd, MANAGED_IRQ_SHUTDOWN_VECTOR, cpu); + WARN(irqd->irq >= 24, "cdx: vector_assign_managed_shutdown: irq=%d, cpu=%d\n", irqd->irq, cpu); } static int reserve_managed_vector(struct irq_data *irqd) @@ -251,6 +252,7 @@ assign_vector_locked(struct irq_data *irqd, const struct cpumask *dest) return vector; apic_update_vector(irqd, vector, cpu); apic_update_irq_cfg(irqd, vector, cpu); + WARN(irqd->irq >= 24, "cdx: assign_vector_locked: irq=%d, cpu=%d\n", irqd->irq, cpu); return 0; } @@ -328,6 +330,7 @@ assign_managed_vector(struct irq_data *irqd, const struct cpumask *dest) return vector; apic_update_vector(irqd, vector, cpu); apic_update_irq_cfg(irqd, vector, cpu); + WARN(irqd->irq >= 24, "cdx: assign_managed_vector: irq=%d, cpu=%d\n", irqd->irq, cpu); return 0; } --- a/drivers/pci/controller/pci-hyperv.c +++ b/drivers/pci/controller/pci-hyperv.c @@ -1803,6 +1803,7 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg) u32 size; int ret; + WARN(1, "cdx: hv_compose_msi_msg: irq=%d\n", data->irq); /* Reuse the previous allocation */ if (data->chip_data) { int_desc = data->chip_data; 1 cdx: vector_assign_managed_shutdown: irq=24, cpu=0 2 WARNING: CPU: 3 PID: 87 at arch/x86/kernel/apic/vector.c:182 vector_assign_managed_shutdown+0x53/0x60 3 RIP: 0010:vector_assign_managed_shutdown+0x53/0x60 4 reserve_irq_vector_locked+0x41/0xa0 5 x86_vector_alloc_irqs+0x298/0x460 6 irq_domain_alloc_irqs_hierarchy+0x1b/0x50 7 irq_domain_alloc_irqs_parent+0x17/0x30 8 msi_domain_alloc+0x83/0x150 9 irq_domain_alloc_irqs_hierarchy+0x1b/0x50 10 __irq_domain_alloc_irqs+0xdf/0x320 11 __msi_domain_alloc_irqs+0x103/0x3e0 12 msi_domain_alloc_irqs_descs_locked+0x3e/0x90 13 pci_msi_setup_msi_irqs+0x2d/0x40 14 __pci_enable_msix_range+0x2fd/0x420 15 pci_alloc_irq_vectors_affinity+0xb0/0x110 16 nvme_reset_work+0x1cf/0x1170 17 process_one_work+0x21f/0x3f0 18 worker_thread+0x4a/0x3c0 19 kthread+0xff/0x130 20 ret_from_fork+0x22/0x30 21 22 cdx: vector_assign_managed_shutdown: irq=24, cpu=0 23 WARNING: CPU: 3 PID: 87 at arch/x86/kernel/apic/vector.c:182 vector_assign_managed_shutdown+0x53/0x60 24 RIP: 0010:vector_assign_managed_shutdown+0x53/0x60 25 x86_vector_activate+0x149/0x1e0 26 __irq_domain_activate_irq+0x58/0x90 27 __irq_domain_activate_irq+0x38/0x90 28 irq_domain_activate_irq+0x2d/0x50 29 __msi_domain_alloc_irqs+0x1bb/0x3e0 30 msi_domain_alloc_irqs_descs_locked+0x3e/0x90 31 pci_msi_setup_msi_irqs+0x2d/0x40 32 __pci_enable_msix_range+0x2fd/0x420 33 pci_alloc_irq_vectors_affinity+0xb0/0x110 34 nvme_reset_work+0x1cf/0x1170 35 process_one_work+0x21f/0x3f0 36 worker_thread+0x4a/0x3c0 37 kthread+0xff/0x130 38 ret_from_fork+0x22/0x30 39 40 41 cdx: hv_compose_msi_msg: irq=24 42 WARNING: CPU: 3 PID: 87 at drivers/pci/controller/pci-hyperv.c:1806 hv_compose_msi_msg+0x3f/0x5d0 [pci_hyperv] 43 RIP: 0010:hv_compose_msi_msg+0x3f/0x5d0 [pci_hyperv] 44 irq_chip_compose_msi_msg+0x32/0x50 45 msi_domain_activate+0x45/0xa0 46 __irq_domain_activate_irq+0x58/0x90 47 irq_domain_activate_irq+0x2d/0x50 48 __msi_domain_alloc_irqs+0x1bb/0x3e0 49 msi_domain_alloc_irqs_descs_locked+0x3e/0x90 50 pci_msi_setup_msi_irqs+0x2d/0x40 51 __pci_enable_msix_range+0x2fd/0x420 52 pci_alloc_irq_vectors_affinity+0xb0/0x110 53 nvme_reset_work+0x1cf/0x1170 54 process_one_work+0x21f/0x3f0 55 worker_thread+0x4a/0x3c0 56 kthread+0xff/0x130 57 ret_from_fork+0x22/0x30 58 59 60 61 cdx: assign_vector_locked: irq=24, cpu=3 62 WARNING: CPU: 0 PID: 87 at arch/x86/kernel/apic/vector.c:255 assign_vector_locked+0x160/0x170 63 RIP: 0010:assign_vector_locked+0x160/0x170 64 assign_irq_vector_any_locked+0x6a/0x150 65 x86_vector_activate+0x93/0x1e0 66 __irq_domain_activate_irq+0x58/0x90 67 __irq_domain_activate_irq+0x38/0x90 68 irq_domain_activate_irq+0x2d/0x50 69 irq_activate+0x29/0x30 70 __setup_irq+0x2e5/0x780 71 request_threaded_irq+0x112/0x180 72 pci_request_irq+0xa3/0xf0 73 queue_request_irq+0x74/0x80 74 nvme_reset_work+0x408/0x1170 75 process_one_work+0x21f/0x3f0 76 worker_thread+0x4a/0x3c0 77 kthread+0xff/0x130 78 ret_from_fork+0x22/0x30 79 80 cdx: hv_compose_msi_msg: irq=24 81 WARNING: CPU: 0 PID: 87 at drivers/pci/controller/pci-hyperv.c:1806 hv_compose_msi_msg+0x3f/0x5d0 [pci_hyperv] 82 RIP: 0010:hv_compose_msi_msg+0x3f/0x5d0 [pci_hyperv] 83 irq_chip_compose_msi_msg+0x32/0x50 84 msi_domain_activate+0x45/0xa0 85 __irq_domain_activate_irq+0x58/0x90 86 irq_domain_activate_irq+0x2d/0x50 87 irq_activate+0x29/0x30 88 __setup_irq+0x2e5/0x780 89 request_threaded_irq+0x112/0x180 90 pci_request_irq+0xa3/0xf0 91 queue_request_irq+0x74/0x80 92 nvme_reset_work+0x408/0x1170 93 process_one_work+0x21f/0x3f0 94 worker_thread+0x4a/0x3c0 95 kthread+0xff/0x130 96 ret_from_fork+0x22/0x30 > > + * original dummy vCPU is used. This dummy vCPU must be round-robin'ed > > so that > > + * the pCPUs are spread out. All interrupts for a multi-MSI device end up > > using > > + * the same pCPU, even though the vCPUs will be spread out by later calls > > + * to hv_irq_unmask(), but that is the best we can do now. > > + * > > + * With current Hyper-V, the HVCALL_RETARGET_INTERRUPT hypercall does > *not* > > "current" Hyper-V means nothing, remove it or provide versioning > information. Imagine yourself reading this comment some time > in the future. Good point. @Wei, can you please help fix this? I think we can change "With current Hyper-V" To "With Hyper-V in Nov 2022". BTW, it's hard to provide the exact versioning info, because technically there are many versions of Hyper-V... > I can't claim to understand how this MSI vCPU to pCPU mapping is made to > work in current code but I can't ack this patch sorry, if you feel like > it is good to merge it it is your and Hyper-V maintainers call, feel > free to go ahead - I can review PCI hyper-V changes that affect PCI > and IRQs core APIs, I don't know Hyper-V internals. > > Lorenzo I understand. Thanks! I discussed the issue with Hyper-V team. I believe I understood it and this patch is the right thing to have. @Wei, Bjorn spotted two typos in the commit message, and Lorenzo suggested a change to the above "current". Can you please help fix these and merge the patch? Or, let me know if it'd be easier if I should send v4. Thanks, Dexuan
On Sat, Nov 12, 2022 at 12:58:33AM +0000, Dexuan Cui wrote: [...] > > I discussed the issue with Hyper-V team. I believe I understood it and > this patch is the right thing to have. > > @Wei, Bjorn spotted two typos in the commit message, and Lorenzo > suggested a change to the above "current". Can you please help > fix these and merge the patch? Or, let me know if it'd be easier if > I should send v4. All fixed and patch applied to hyperv-fixes. Thanks, Wei. > > Thanks, > Dexuan >
diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c index ba64284eaf9f..fa5a1ba35a82 100644 --- a/drivers/pci/controller/pci-hyperv.c +++ b/drivers/pci/controller/pci-hyperv.c @@ -1613,7 +1613,7 @@ static void hv_pci_compose_compl(void *context, struct pci_response *resp, } static u32 hv_compose_msi_req_v1( - struct pci_create_interrupt *int_pkt, const struct cpumask *affinity, + struct pci_create_interrupt *int_pkt, u32 slot, u8 vector, u16 vector_count) { int_pkt->message_type.type = PCI_CREATE_INTERRUPT_MESSAGE; @@ -1631,6 +1631,35 @@ static u32 hv_compose_msi_req_v1( return sizeof(*int_pkt); } +/* + * The vCPU selected by hv_compose_multi_msi_req_get_cpu() and + * hv_compose_msi_req_get_cpu() is a "dummy" vCPU because the final vCPU to be + * interrupted is specified later in hv_irq_unmask() and communicated to Hyper-V + * via the HVCALL_RETARGET_INTERRUPT hypercall. But the choice of dummy vCPU is + * not irrelevant because Hyper-V chooses the physical CPU to handle the + * interrupts based on the vCPU specified in message sent to the vPCI VSP in + * hv_compose_msi_msg(). Hyper-V's choice of pCPU is not visible to the guest, + * but assigning too many vPCI device interrupts to the same pCPU can cause a + * performance bottleneck. So we spread out the dummy vCPUs to influence Hyper-V + * to spread out the pCPUs that it selects. + * + * For the single-MSI and MSI-X cases, it's OK for hv_compose_msi_req_get_cpu() + * to always return the same dummy vCPU, because a second call to + * hv_compose_msi_msg() contains the "real" vCPU, causing Hyper-V to choose a + * new pCPU for the interrupt. But for the multi-MSI case, the second call to + * hv_compose_msi_msg() exits without sending a message to the vPCI VSP, so the + * original dummy vCPU is used. This dummy vCPU must be round-robin'ed so that + * the pCPUs are spread out. All interrupts for a multi-MSI device end up using + * the same pCPU, even though the vCPUs will be spread out by later calls + * to hv_irq_unmask(), but that is the best we can do now. + * + * With current Hyper-V, the HVCALL_RETARGET_INTERRUPT hypercall does *not* + * cause Hyper-V to reselect the pCPU based on the specified vCPU. Such an + * enhancement is planned for a future version. With that enhancement, the + * dummy vCPU selection won't matter, and interrupts for the same multi-MSI + * device will be spread across multiple pCPUs. + */ + /* * Create MSI w/ dummy vCPU set targeting just one vCPU, overwritten * by subsequent retarget in hv_irq_unmask(). @@ -1640,18 +1669,39 @@ static int hv_compose_msi_req_get_cpu(const struct cpumask *affinity) return cpumask_first_and(affinity, cpu_online_mask); } -static u32 hv_compose_msi_req_v2( - struct pci_create_interrupt2 *int_pkt, const struct cpumask *affinity, - u32 slot, u8 vector, u16 vector_count) +/* + * Make sure the dummy vCPU values for multi-MSI don't all point to vCPU0. + */ +static int hv_compose_multi_msi_req_get_cpu(void) { + static DEFINE_SPINLOCK(multi_msi_cpu_lock); + + /* -1 means starting with CPU 0 */ + static int cpu_next = -1; + + unsigned long flags; int cpu; + spin_lock_irqsave(&multi_msi_cpu_lock, flags); + + cpu_next = cpumask_next_wrap(cpu_next, cpu_online_mask, nr_cpu_ids, + false); + cpu = cpu_next; + + spin_unlock_irqrestore(&multi_msi_cpu_lock, flags); + + return cpu; +} + +static u32 hv_compose_msi_req_v2( + struct pci_create_interrupt2 *int_pkt, int cpu, + u32 slot, u8 vector, u16 vector_count) +{ int_pkt->message_type.type = PCI_CREATE_INTERRUPT_MESSAGE2; int_pkt->wslot.slot = slot; int_pkt->int_desc.vector = vector; int_pkt->int_desc.vector_count = vector_count; int_pkt->int_desc.delivery_mode = DELIVERY_MODE; - cpu = hv_compose_msi_req_get_cpu(affinity); int_pkt->int_desc.processor_array[0] = hv_cpu_number_to_vp_number(cpu); int_pkt->int_desc.processor_count = 1; @@ -1660,18 +1710,15 @@ static u32 hv_compose_msi_req_v2( } static u32 hv_compose_msi_req_v3( - struct pci_create_interrupt3 *int_pkt, const struct cpumask *affinity, + struct pci_create_interrupt3 *int_pkt, int cpu, u32 slot, u32 vector, u16 vector_count) { - int cpu; - int_pkt->message_type.type = PCI_CREATE_INTERRUPT_MESSAGE3; int_pkt->wslot.slot = slot; int_pkt->int_desc.vector = vector; int_pkt->int_desc.reserved = 0; int_pkt->int_desc.vector_count = vector_count; int_pkt->int_desc.delivery_mode = DELIVERY_MODE; - cpu = hv_compose_msi_req_get_cpu(affinity); int_pkt->int_desc.processor_array[0] = hv_cpu_number_to_vp_number(cpu); int_pkt->int_desc.processor_count = 1; @@ -1715,12 +1762,18 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg) struct pci_create_interrupt3 v3; } int_pkts; } __packed ctxt; + bool multi_msi; u64 trans_id; u32 size; int ret; + int cpu; + + msi_desc = irq_data_get_msi_desc(data); + multi_msi = !msi_desc->pci.msi_attrib.is_msix && + msi_desc->nvec_used > 1; /* Reuse the previous allocation */ - if (data->chip_data) { + if (data->chip_data && multi_msi) { int_desc = data->chip_data; msg->address_hi = int_desc->address >> 32; msg->address_lo = int_desc->address & 0xffffffff; @@ -1728,7 +1781,6 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg) return; } - msi_desc = irq_data_get_msi_desc(data); pdev = msi_desc_to_pci_dev(msi_desc); dest = irq_data_get_effective_affinity_mask(data); pbus = pdev->bus; @@ -1738,11 +1790,18 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg) if (!hpdev) goto return_null_message; + /* Free any previous message that might have already been composed. */ + if (data->chip_data && !multi_msi) { + int_desc = data->chip_data; + data->chip_data = NULL; + hv_int_desc_free(hpdev, int_desc); + } + int_desc = kzalloc(sizeof(*int_desc), GFP_ATOMIC); if (!int_desc) goto drop_reference; - if (!msi_desc->pci.msi_attrib.is_msix && msi_desc->nvec_used > 1) { + if (multi_msi) { /* * If this is not the first MSI of Multi MSI, we already have * a mapping. Can exit early. @@ -1767,9 +1826,11 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg) */ vector = 32; vector_count = msi_desc->nvec_used; + cpu = hv_compose_multi_msi_req_get_cpu(); } else { vector = hv_msi_get_int_vector(data); vector_count = 1; + cpu = hv_compose_msi_req_get_cpu(dest); } /* @@ -1785,7 +1846,6 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg) switch (hbus->protocol_version) { case PCI_PROTOCOL_VERSION_1_1: size = hv_compose_msi_req_v1(&ctxt.int_pkts.v1, - dest, hpdev->desc.win_slot.slot, (u8)vector, vector_count); @@ -1794,7 +1854,7 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg) case PCI_PROTOCOL_VERSION_1_2: case PCI_PROTOCOL_VERSION_1_3: size = hv_compose_msi_req_v2(&ctxt.int_pkts.v2, - dest, + cpu, hpdev->desc.win_slot.slot, (u8)vector, vector_count); @@ -1802,7 +1862,7 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg) case PCI_PROTOCOL_VERSION_1_4: size = hv_compose_msi_req_v3(&ctxt.int_pkts.v3, - dest, + cpu, hpdev->desc.win_slot.slot, vector, vector_count);