diff mbox series

[SRU,kinetic:linux-azure,v2] UBUNTU: SAUCE: PCI: hv: Only reuse existing IRTE allocation for Multi-MSI

Message ID 20221108191641.47619-1-john.cabaj@canonical.com
State New
Headers show
Series [SRU,kinetic:linux-azure,v2] UBUNTU: SAUCE: PCI: hv: Only reuse existing IRTE allocation for Multi-MSI | expand

Commit Message

John Cabaj Nov. 8, 2022, 7:16 p.m. UTC
From: Dexuan Cui <decui@microsoft.com>

BugLink: https://bugs.launchpad.net/bugs/1995408

Jeffrey added Multi-MSI support to the pci-hyperv driver by the 4 patches:
08e61e861a0e ("PCI: hv: Fix multi-MSI to allow more than one MSI vector")
455880dfe292 ("PCI: hv: Fix hv_arch_irq_unmask() for multi-MSI")
b4b77778ecc5 ("PCI: hv: Reuse existing IRTE allocation in compose_msi_msg()")
a2bad844a67b ("PCI: hv: Fix interrupt mapping for multi-MSI")

It turns out that the third patch (b4b77778ecc5) causes a performance
regression because all the interrupts now happen on 1 physical CPU (or two
pCPUs, if one pCPU doesn't have enough vectors). When a guest has many PCI
devices, it may suffer from soft lockups if the workload is heavy, e.g.,
see https://lwn.net/ml/linux-kernel/20220804025104.15673-1-decui@microsoft.com/

Commit b4b77778ecc5 itself is good. The real issue is that the hypercall in
hv_irq_unmask() -> hv_arch_irq_unmask() ->
hv_do_hypercall(HVCALL_RETARGET_INTERRUPT...) only changes the target
virtual CPU rather than physical CPU; with b4b77778ecc5, the pCPU is
determined only once in hv_compose_msi_msg() where only vCPU0 is specified;
consequently the hypervisor only uses 1 target pCPU for all the interrupts.

Note: before b4b77778ecc5, the pCPU is determined twice, and when the pCPU
is determinted the second time, the vCPU in the effective affinity mask is
used (i.e., it isn't always vCPU0), so the hypervisor chooses a different
pCPU for each interrupt.

The hypercall will be fixed in future to update the pCPU as well, but
that will take quite a while, so let's restore the old behavior in
hv_compose_msi_msg(), i.e., don't reuse the existing IRTE allocation for
single-MSI and MSI-X; for multi-MSI, we choose the vCPU in a round-robin
manner for each PCI device, so the interrupts of different devices can
happen on different pCPUs, though the interrupts of each device happen on
some single pCPU.

The hypercall fix may not be backported to all old versions of Hyper-V, so
we want to have this guest side change for ever (or at least till we're sure
the old affected versions of Hyper-V are no longer supported).

Fixes: b4b77778ecc5 ("PCI: hv: Reuse existing IRTE allocation in compose_msi_msg()")
Co-developed-by: Jeffrey Hugo <quic_jhugo@quicinc.com>
Signed-off-by: Jeffrey Hugo <quic_jhugo@quicinc.com>
Co-developed-by: Carl Vanderlip <quic_carlv@quicinc.com>
Signed-off-by: Carl Vanderlip <quic_carlv@quicinc.com>
Signed-off-by: Dexuan Cui <decui@microsoft.com>
Signed-off-by: John Cabaj <john.cabaj@canonical.com>
---
 drivers/pci/controller/pci-hyperv.c | 61 ++++++++++++++++++++++-------
 1 file changed, 46 insertions(+), 15 deletions(-)

Comments

Tim Gardner Nov. 8, 2022, 7:24 p.m. UTC | #1
On 11/8/22 12:16 PM, John Cabaj wrote:
> From: Dexuan Cui <decui@microsoft.com>
> 
> BugLink: https://bugs.launchpad.net/bugs/1995408
> 
> Jeffrey added Multi-MSI support to the pci-hyperv driver by the 4 patches:
> 08e61e861a0e ("PCI: hv: Fix multi-MSI to allow more than one MSI vector")
> 455880dfe292 ("PCI: hv: Fix hv_arch_irq_unmask() for multi-MSI")
> b4b77778ecc5 ("PCI: hv: Reuse existing IRTE allocation in compose_msi_msg()")
> a2bad844a67b ("PCI: hv: Fix interrupt mapping for multi-MSI")
> 
> It turns out that the third patch (b4b77778ecc5) causes a performance
> regression because all the interrupts now happen on 1 physical CPU (or two
> pCPUs, if one pCPU doesn't have enough vectors). When a guest has many PCI
> devices, it may suffer from soft lockups if the workload is heavy, e.g.,
> see https://lwn.net/ml/linux-kernel/20220804025104.15673-1-decui@microsoft.com/
> 
> Commit b4b77778ecc5 itself is good. The real issue is that the hypercall in
> hv_irq_unmask() -> hv_arch_irq_unmask() ->
> hv_do_hypercall(HVCALL_RETARGET_INTERRUPT...) only changes the target
> virtual CPU rather than physical CPU; with b4b77778ecc5, the pCPU is
> determined only once in hv_compose_msi_msg() where only vCPU0 is specified;
> consequently the hypervisor only uses 1 target pCPU for all the interrupts.
> 
> Note: before b4b77778ecc5, the pCPU is determined twice, and when the pCPU
> is determinted the second time, the vCPU in the effective affinity mask is
> used (i.e., it isn't always vCPU0), so the hypervisor chooses a different
> pCPU for each interrupt.
> 
> The hypercall will be fixed in future to update the pCPU as well, but
> that will take quite a while, so let's restore the old behavior in
> hv_compose_msi_msg(), i.e., don't reuse the existing IRTE allocation for
> single-MSI and MSI-X; for multi-MSI, we choose the vCPU in a round-robin
> manner for each PCI device, so the interrupts of different devices can
> happen on different pCPUs, though the interrupts of each device happen on
> some single pCPU.
> 
> The hypercall fix may not be backported to all old versions of Hyper-V, so
> we want to have this guest side change for ever (or at least till we're sure
> the old affected versions of Hyper-V are no longer supported).
> 
> Fixes: b4b77778ecc5 ("PCI: hv: Reuse existing IRTE allocation in compose_msi_msg()")
> Co-developed-by: Jeffrey Hugo <quic_jhugo@quicinc.com>
> Signed-off-by: Jeffrey Hugo <quic_jhugo@quicinc.com>
> Co-developed-by: Carl Vanderlip <quic_carlv@quicinc.com>
> Signed-off-by: Carl Vanderlip <quic_carlv@quicinc.com>
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
> Signed-off-by: John Cabaj <john.cabaj@canonical.com>
> ---
>   drivers/pci/controller/pci-hyperv.c | 61 ++++++++++++++++++++++-------
>   1 file changed, 46 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> index db814f7b93ba..c10548243d65 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -1613,7 +1613,7 @@ static void hv_pci_compose_compl(void *context, struct pci_response *resp,
>   }
>   
>   static u32 hv_compose_msi_req_v1(
> -	struct pci_create_interrupt *int_pkt, struct cpumask *affinity,
> +	struct pci_create_interrupt *int_pkt,
>   	u32 slot, u8 vector, u8 vector_count)
>   {
>   	int_pkt->message_type.type = PCI_CREATE_INTERRUPT_MESSAGE;
> @@ -1640,18 +1640,39 @@ static int hv_compose_msi_req_get_cpu(struct cpumask *affinity)
>   	return cpumask_first_and(affinity, cpu_online_mask);
>   }
>   
> -static u32 hv_compose_msi_req_v2(
> -	struct pci_create_interrupt2 *int_pkt, struct cpumask *affinity,
> -	u32 slot, u8 vector, u8 vector_count)
> +/*
> + * Make sure the dummy vCPU values for multi-MSI don't all point to vCPU0.
> + */
> +static int hv_compose_multi_msi_req_get_cpu(void)
>   {
> +	static DEFINE_SPINLOCK(multi_msi_cpu_lock);
> +
> +	/* -1 means starting with CPU 0 */
> +	static int cpu_next = -1;
> +
> +	unsigned long flags;
>   	int cpu;
>   
> +	spin_lock_irqsave(&multi_msi_cpu_lock, flags);
> +
> +	cpu_next = cpumask_next_wrap(cpu_next, cpu_online_mask, nr_cpu_ids,
> +				     false);
> +	cpu = cpu_next;
> +
> +	spin_unlock_irqrestore(&multi_msi_cpu_lock, flags);
> +
> +	return cpu;
> +}
> +
> +static u32 hv_compose_msi_req_v2(
> +	struct pci_create_interrupt2 *int_pkt, int cpu,
> +	u32 slot, u8 vector, u8 vector_count)
> +{
>   	int_pkt->message_type.type = PCI_CREATE_INTERRUPT_MESSAGE2;
>   	int_pkt->wslot.slot = slot;
>   	int_pkt->int_desc.vector = vector;
>   	int_pkt->int_desc.vector_count = vector_count;
>   	int_pkt->int_desc.delivery_mode = DELIVERY_MODE;
> -	cpu = hv_compose_msi_req_get_cpu(affinity);
>   	int_pkt->int_desc.processor_array[0] =
>   		hv_cpu_number_to_vp_number(cpu);
>   	int_pkt->int_desc.processor_count = 1;
> @@ -1660,18 +1681,15 @@ static u32 hv_compose_msi_req_v2(
>   }
>   
>   static u32 hv_compose_msi_req_v3(
> -	struct pci_create_interrupt3 *int_pkt, struct cpumask *affinity,
> +	struct pci_create_interrupt3 *int_pkt, int cpu,
>   	u32 slot, u32 vector, u8 vector_count)
>   {
> -	int cpu;
> -
>   	int_pkt->message_type.type = PCI_CREATE_INTERRUPT_MESSAGE3;
>   	int_pkt->wslot.slot = slot;
>   	int_pkt->int_desc.vector = vector;
>   	int_pkt->int_desc.reserved = 0;
>   	int_pkt->int_desc.vector_count = vector_count;
>   	int_pkt->int_desc.delivery_mode = DELIVERY_MODE;
> -	cpu = hv_compose_msi_req_get_cpu(affinity);
>   	int_pkt->int_desc.processor_array[0] =
>   		hv_cpu_number_to_vp_number(cpu);
>   	int_pkt->int_desc.processor_count = 1;
> @@ -1710,12 +1728,18 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
>   			struct pci_create_interrupt3 v3;
>   		} int_pkts;
>   	} __packed ctxt;
> +	bool multi_msi;
>   	u64 trans_id;
>   	u32 size;
>   	int ret;
> +	int cpu;
> +
> +	msi_desc  = irq_data_get_msi_desc(data);
> +	multi_msi = !msi_desc->pci.msi_attrib.is_msix &&
> +		    msi_desc->nvec_used > 1;
>   
>   	/* Reuse the previous allocation */
> -	if (data->chip_data) {
> +	if (data->chip_data && multi_msi) {
>   		int_desc = data->chip_data;
>   		msg->address_hi = int_desc->address >> 32;
>   		msg->address_lo = int_desc->address & 0xffffffff;
> @@ -1723,7 +1747,6 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
>   		return;
>   	}
>   
> -	msi_desc  = irq_data_get_msi_desc(data);
>   	pdev = msi_desc_to_pci_dev(msi_desc);
>   	dest = irq_data_get_effective_affinity_mask(data);
>   	pbus = pdev->bus;
> @@ -1733,11 +1756,18 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
>   	if (!hpdev)
>   		goto return_null_message;
>   
> +	/* Free any previous message that might have already been composed. */
> +	if (data->chip_data && !multi_msi) {
> +		int_desc = data->chip_data;
> +		data->chip_data = NULL;
> +		hv_int_desc_free(hpdev, int_desc);
> +	}
> +
>   	int_desc = kzalloc(sizeof(*int_desc), GFP_ATOMIC);
>   	if (!int_desc)
>   		goto drop_reference;
>   
> -	if (!msi_desc->pci.msi_attrib.is_msix && msi_desc->nvec_used > 1) {
> +	if (multi_msi) {
>   		/*
>   		 * If this is not the first MSI of Multi MSI, we already have
>   		 * a mapping.  Can exit early.
> @@ -1762,9 +1792,11 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
>   		 */
>   		vector = 32;
>   		vector_count = msi_desc->nvec_used;
> +		cpu = hv_compose_multi_msi_req_get_cpu();
>   	} else {
>   		vector = hv_msi_get_int_vector(data);
>   		vector_count = 1;
> +		cpu = hv_compose_msi_req_get_cpu(dest);
>   	}
>   
>   	memset(&ctxt, 0, sizeof(ctxt));
> @@ -1775,7 +1807,6 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
>   	switch (hbus->protocol_version) {
>   	case PCI_PROTOCOL_VERSION_1_1:
>   		size = hv_compose_msi_req_v1(&ctxt.int_pkts.v1,
> -					dest,
>   					hpdev->desc.win_slot.slot,
>   					vector,
>   					vector_count);
> @@ -1784,7 +1815,7 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
>   	case PCI_PROTOCOL_VERSION_1_2:
>   	case PCI_PROTOCOL_VERSION_1_3:
>   		size = hv_compose_msi_req_v2(&ctxt.int_pkts.v2,
> -					dest,
> +					cpu,
>   					hpdev->desc.win_slot.slot,
>   					vector,
>   					vector_count);
> @@ -1792,7 +1823,7 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
>   
>   	case PCI_PROTOCOL_VERSION_1_4:
>   		size = hv_compose_msi_req_v3(&ctxt.int_pkts.v3,
> -					dest,
> +					cpu,
>   					hpdev->desc.win_slot.slot,
>   					vector,
>   					vector_count);
Acked-by: Tim Gardner <tim.gardner@canonical.com>
Stefan Bader Nov. 9, 2022, 10:03 a.m. UTC | #2
On 08.11.22 20:16, John Cabaj wrote:
> From: Dexuan Cui <decui@microsoft.com>
> 
> BugLink: https://bugs.launchpad.net/bugs/1995408
> 
> Jeffrey added Multi-MSI support to the pci-hyperv driver by the 4 patches:
> 08e61e861a0e ("PCI: hv: Fix multi-MSI to allow more than one MSI vector")
> 455880dfe292 ("PCI: hv: Fix hv_arch_irq_unmask() for multi-MSI")
> b4b77778ecc5 ("PCI: hv: Reuse existing IRTE allocation in compose_msi_msg()")
> a2bad844a67b ("PCI: hv: Fix interrupt mapping for multi-MSI")
> 
> It turns out that the third patch (b4b77778ecc5) causes a performance
> regression because all the interrupts now happen on 1 physical CPU (or two
> pCPUs, if one pCPU doesn't have enough vectors). When a guest has many PCI
> devices, it may suffer from soft lockups if the workload is heavy, e.g.,
> see https://lwn.net/ml/linux-kernel/20220804025104.15673-1-decui@microsoft.com/
> 
> Commit b4b77778ecc5 itself is good. The real issue is that the hypercall in
> hv_irq_unmask() -> hv_arch_irq_unmask() ->
> hv_do_hypercall(HVCALL_RETARGET_INTERRUPT...) only changes the target
> virtual CPU rather than physical CPU; with b4b77778ecc5, the pCPU is
> determined only once in hv_compose_msi_msg() where only vCPU0 is specified;
> consequently the hypervisor only uses 1 target pCPU for all the interrupts.
> 
> Note: before b4b77778ecc5, the pCPU is determined twice, and when the pCPU
> is determinted the second time, the vCPU in the effective affinity mask is
> used (i.e., it isn't always vCPU0), so the hypervisor chooses a different
> pCPU for each interrupt.
> 
> The hypercall will be fixed in future to update the pCPU as well, but
> that will take quite a while, so let's restore the old behavior in
> hv_compose_msi_msg(), i.e., don't reuse the existing IRTE allocation for
> single-MSI and MSI-X; for multi-MSI, we choose the vCPU in a round-robin
> manner for each PCI device, so the interrupts of different devices can
> happen on different pCPUs, though the interrupts of each device happen on
> some single pCPU.
> 
> The hypercall fix may not be backported to all old versions of Hyper-V, so
> we want to have this guest side change for ever (or at least till we're sure
> the old affected versions of Hyper-V are no longer supported).
> 
> Fixes: b4b77778ecc5 ("PCI: hv: Reuse existing IRTE allocation in compose_msi_msg()")
> Co-developed-by: Jeffrey Hugo <quic_jhugo@quicinc.com>
> Signed-off-by: Jeffrey Hugo <quic_jhugo@quicinc.com>
> Co-developed-by: Carl Vanderlip <quic_carlv@quicinc.com>
> Signed-off-by: Carl Vanderlip <quic_carlv@quicinc.com>
> Signed-off-by: Dexuan Cui <decui@microsoft.com>

???
> Signed-off-by: John Cabaj <john.cabaj@canonical.com>
Acked-by: Stefan Bader <stefan.bader@canonical.com>
> ---

Tim seems happy and its for a cloud kernel. Still I would add something along 
the lines of a "(cherry picked from vendor tree or discussion)" there when 
applying. Looks like at some point the parent kernel might get some kind of fix 
which may or may not be the same but likely clashing with this. Then any piece 
of info about origins helps to decide what to do. Yes there is the bug report 
but that is some indirection and often also is to cryptic.

I also would have used a cover email here, even though it is just one patch. 
Just to pass on some of the meta info about this. Like why is it fixed SAUCE. 
Might be noteworthy that cover email and bug report info target different 
people. For simplicity the SRU justification is often copied into the cover 
email but essentially the justification in the bug report is for the SRU team, 
which is more distro towards archive admins than kernel developers. While the 
cover email is helping other kernel devs to decide whether they can ACK this.

I ACK this because following all trails this sounds like a customer requested 
quick fix and was tested ok in the target environment. But there could have been 
a little more help to make my task easier.

-Stefan

>   drivers/pci/controller/pci-hyperv.c | 61 ++++++++++++++++++++++-------
>   1 file changed, 46 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> index db814f7b93ba..c10548243d65 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -1613,7 +1613,7 @@ static void hv_pci_compose_compl(void *context, struct pci_response *resp,
>   }
>   
>   static u32 hv_compose_msi_req_v1(
> -	struct pci_create_interrupt *int_pkt, struct cpumask *affinity,
> +	struct pci_create_interrupt *int_pkt,
>   	u32 slot, u8 vector, u8 vector_count)
>   {
>   	int_pkt->message_type.type = PCI_CREATE_INTERRUPT_MESSAGE;
> @@ -1640,18 +1640,39 @@ static int hv_compose_msi_req_get_cpu(struct cpumask *affinity)
>   	return cpumask_first_and(affinity, cpu_online_mask);
>   }
>   
> -static u32 hv_compose_msi_req_v2(
> -	struct pci_create_interrupt2 *int_pkt, struct cpumask *affinity,
> -	u32 slot, u8 vector, u8 vector_count)
> +/*
> + * Make sure the dummy vCPU values for multi-MSI don't all point to vCPU0.
> + */
> +static int hv_compose_multi_msi_req_get_cpu(void)
>   {
> +	static DEFINE_SPINLOCK(multi_msi_cpu_lock);
> +
> +	/* -1 means starting with CPU 0 */
> +	static int cpu_next = -1;
> +
> +	unsigned long flags;
>   	int cpu;
>   
> +	spin_lock_irqsave(&multi_msi_cpu_lock, flags);
> +
> +	cpu_next = cpumask_next_wrap(cpu_next, cpu_online_mask, nr_cpu_ids,
> +				     false);
> +	cpu = cpu_next;
> +
> +	spin_unlock_irqrestore(&multi_msi_cpu_lock, flags);
> +
> +	return cpu;
> +}
> +
> +static u32 hv_compose_msi_req_v2(
> +	struct pci_create_interrupt2 *int_pkt, int cpu,
> +	u32 slot, u8 vector, u8 vector_count)
> +{
>   	int_pkt->message_type.type = PCI_CREATE_INTERRUPT_MESSAGE2;
>   	int_pkt->wslot.slot = slot;
>   	int_pkt->int_desc.vector = vector;
>   	int_pkt->int_desc.vector_count = vector_count;
>   	int_pkt->int_desc.delivery_mode = DELIVERY_MODE;
> -	cpu = hv_compose_msi_req_get_cpu(affinity);
>   	int_pkt->int_desc.processor_array[0] =
>   		hv_cpu_number_to_vp_number(cpu);
>   	int_pkt->int_desc.processor_count = 1;
> @@ -1660,18 +1681,15 @@ static u32 hv_compose_msi_req_v2(
>   }
>   
>   static u32 hv_compose_msi_req_v3(
> -	struct pci_create_interrupt3 *int_pkt, struct cpumask *affinity,
> +	struct pci_create_interrupt3 *int_pkt, int cpu,
>   	u32 slot, u32 vector, u8 vector_count)
>   {
> -	int cpu;
> -
>   	int_pkt->message_type.type = PCI_CREATE_INTERRUPT_MESSAGE3;
>   	int_pkt->wslot.slot = slot;
>   	int_pkt->int_desc.vector = vector;
>   	int_pkt->int_desc.reserved = 0;
>   	int_pkt->int_desc.vector_count = vector_count;
>   	int_pkt->int_desc.delivery_mode = DELIVERY_MODE;
> -	cpu = hv_compose_msi_req_get_cpu(affinity);
>   	int_pkt->int_desc.processor_array[0] =
>   		hv_cpu_number_to_vp_number(cpu);
>   	int_pkt->int_desc.processor_count = 1;
> @@ -1710,12 +1728,18 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
>   			struct pci_create_interrupt3 v3;
>   		} int_pkts;
>   	} __packed ctxt;
> +	bool multi_msi;
>   	u64 trans_id;
>   	u32 size;
>   	int ret;
> +	int cpu;
> +
> +	msi_desc  = irq_data_get_msi_desc(data);
> +	multi_msi = !msi_desc->pci.msi_attrib.is_msix &&
> +		    msi_desc->nvec_used > 1;
>   
>   	/* Reuse the previous allocation */
> -	if (data->chip_data) {
> +	if (data->chip_data && multi_msi) {
>   		int_desc = data->chip_data;
>   		msg->address_hi = int_desc->address >> 32;
>   		msg->address_lo = int_desc->address & 0xffffffff;
> @@ -1723,7 +1747,6 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
>   		return;
>   	}
>   
> -	msi_desc  = irq_data_get_msi_desc(data);
>   	pdev = msi_desc_to_pci_dev(msi_desc);
>   	dest = irq_data_get_effective_affinity_mask(data);
>   	pbus = pdev->bus;
> @@ -1733,11 +1756,18 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
>   	if (!hpdev)
>   		goto return_null_message;
>   
> +	/* Free any previous message that might have already been composed. */
> +	if (data->chip_data && !multi_msi) {
> +		int_desc = data->chip_data;
> +		data->chip_data = NULL;
> +		hv_int_desc_free(hpdev, int_desc);
> +	}
> +
>   	int_desc = kzalloc(sizeof(*int_desc), GFP_ATOMIC);
>   	if (!int_desc)
>   		goto drop_reference;
>   
> -	if (!msi_desc->pci.msi_attrib.is_msix && msi_desc->nvec_used > 1) {
> +	if (multi_msi) {
>   		/*
>   		 * If this is not the first MSI of Multi MSI, we already have
>   		 * a mapping.  Can exit early.
> @@ -1762,9 +1792,11 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
>   		 */
>   		vector = 32;
>   		vector_count = msi_desc->nvec_used;
> +		cpu = hv_compose_multi_msi_req_get_cpu();
>   	} else {
>   		vector = hv_msi_get_int_vector(data);
>   		vector_count = 1;
> +		cpu = hv_compose_msi_req_get_cpu(dest);
>   	}
>   
>   	memset(&ctxt, 0, sizeof(ctxt));
> @@ -1775,7 +1807,6 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
>   	switch (hbus->protocol_version) {
>   	case PCI_PROTOCOL_VERSION_1_1:
>   		size = hv_compose_msi_req_v1(&ctxt.int_pkts.v1,
> -					dest,
>   					hpdev->desc.win_slot.slot,
>   					vector,
>   					vector_count);
> @@ -1784,7 +1815,7 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
>   	case PCI_PROTOCOL_VERSION_1_2:
>   	case PCI_PROTOCOL_VERSION_1_3:
>   		size = hv_compose_msi_req_v2(&ctxt.int_pkts.v2,
> -					dest,
> +					cpu,
>   					hpdev->desc.win_slot.slot,
>   					vector,
>   					vector_count);
> @@ -1792,7 +1823,7 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
>   
>   	case PCI_PROTOCOL_VERSION_1_4:
>   		size = hv_compose_msi_req_v3(&ctxt.int_pkts.v3,
> -					dest,
> +					cpu,
>   					hpdev->desc.win_slot.slot,
>   					vector,
>   					vector_count);
Tim Gardner Nov. 9, 2022, 2:09 p.m. UTC | #3
On 11/8/22 12:16 PM, John Cabaj wrote:
> From: Dexuan Cui <decui@microsoft.com>
> 
> BugLink: https://bugs.launchpad.net/bugs/1995408
> 
> Jeffrey added Multi-MSI support to the pci-hyperv driver by the 4 patches:
> 08e61e861a0e ("PCI: hv: Fix multi-MSI to allow more than one MSI vector")
> 455880dfe292 ("PCI: hv: Fix hv_arch_irq_unmask() for multi-MSI")
> b4b77778ecc5 ("PCI: hv: Reuse existing IRTE allocation in compose_msi_msg()")
> a2bad844a67b ("PCI: hv: Fix interrupt mapping for multi-MSI")
> 
> It turns out that the third patch (b4b77778ecc5) causes a performance
> regression because all the interrupts now happen on 1 physical CPU (or two
> pCPUs, if one pCPU doesn't have enough vectors). When a guest has many PCI
> devices, it may suffer from soft lockups if the workload is heavy, e.g.,
> see https://lwn.net/ml/linux-kernel/20220804025104.15673-1-decui@microsoft.com/
> 
> Commit b4b77778ecc5 itself is good. The real issue is that the hypercall in
> hv_irq_unmask() -> hv_arch_irq_unmask() ->
> hv_do_hypercall(HVCALL_RETARGET_INTERRUPT...) only changes the target
> virtual CPU rather than physical CPU; with b4b77778ecc5, the pCPU is
> determined only once in hv_compose_msi_msg() where only vCPU0 is specified;
> consequently the hypervisor only uses 1 target pCPU for all the interrupts.
> 
> Note: before b4b77778ecc5, the pCPU is determined twice, and when the pCPU
> is determinted the second time, the vCPU in the effective affinity mask is
> used (i.e., it isn't always vCPU0), so the hypervisor chooses a different
> pCPU for each interrupt.
> 
> The hypercall will be fixed in future to update the pCPU as well, but
> that will take quite a while, so let's restore the old behavior in
> hv_compose_msi_msg(), i.e., don't reuse the existing IRTE allocation for
> single-MSI and MSI-X; for multi-MSI, we choose the vCPU in a round-robin
> manner for each PCI device, so the interrupts of different devices can
> happen on different pCPUs, though the interrupts of each device happen on
> some single pCPU.
> 
> The hypercall fix may not be backported to all old versions of Hyper-V, so
> we want to have this guest side change for ever (or at least till we're sure
> the old affected versions of Hyper-V are no longer supported).
> 
> Fixes: b4b77778ecc5 ("PCI: hv: Reuse existing IRTE allocation in compose_msi_msg()")
> Co-developed-by: Jeffrey Hugo <quic_jhugo@quicinc.com>
> Signed-off-by: Jeffrey Hugo <quic_jhugo@quicinc.com>
> Co-developed-by: Carl Vanderlip <quic_carlv@quicinc.com>
> Signed-off-by: Carl Vanderlip <quic_carlv@quicinc.com>
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
> Signed-off-by: John Cabaj <john.cabaj@canonical.com>
> ---
>   drivers/pci/controller/pci-hyperv.c | 61 ++++++++++++++++++++++-------
>   1 file changed, 46 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> index db814f7b93ba..c10548243d65 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -1613,7 +1613,7 @@ static void hv_pci_compose_compl(void *context, struct pci_response *resp,
>   }
>   
>   static u32 hv_compose_msi_req_v1(
> -	struct pci_create_interrupt *int_pkt, struct cpumask *affinity,
> +	struct pci_create_interrupt *int_pkt,
>   	u32 slot, u8 vector, u8 vector_count)
>   {
>   	int_pkt->message_type.type = PCI_CREATE_INTERRUPT_MESSAGE;
> @@ -1640,18 +1640,39 @@ static int hv_compose_msi_req_get_cpu(struct cpumask *affinity)
>   	return cpumask_first_and(affinity, cpu_online_mask);
>   }
>   
> -static u32 hv_compose_msi_req_v2(
> -	struct pci_create_interrupt2 *int_pkt, struct cpumask *affinity,
> -	u32 slot, u8 vector, u8 vector_count)
> +/*
> + * Make sure the dummy vCPU values for multi-MSI don't all point to vCPU0.
> + */
> +static int hv_compose_multi_msi_req_get_cpu(void)
>   {
> +	static DEFINE_SPINLOCK(multi_msi_cpu_lock);
> +
> +	/* -1 means starting with CPU 0 */
> +	static int cpu_next = -1;
> +
> +	unsigned long flags;
>   	int cpu;
>   
> +	spin_lock_irqsave(&multi_msi_cpu_lock, flags);
> +
> +	cpu_next = cpumask_next_wrap(cpu_next, cpu_online_mask, nr_cpu_ids,
> +				     false);
> +	cpu = cpu_next;
> +
> +	spin_unlock_irqrestore(&multi_msi_cpu_lock, flags);
> +
> +	return cpu;
> +}
> +
> +static u32 hv_compose_msi_req_v2(
> +	struct pci_create_interrupt2 *int_pkt, int cpu,
> +	u32 slot, u8 vector, u8 vector_count)
> +{
>   	int_pkt->message_type.type = PCI_CREATE_INTERRUPT_MESSAGE2;
>   	int_pkt->wslot.slot = slot;
>   	int_pkt->int_desc.vector = vector;
>   	int_pkt->int_desc.vector_count = vector_count;
>   	int_pkt->int_desc.delivery_mode = DELIVERY_MODE;
> -	cpu = hv_compose_msi_req_get_cpu(affinity);
>   	int_pkt->int_desc.processor_array[0] =
>   		hv_cpu_number_to_vp_number(cpu);
>   	int_pkt->int_desc.processor_count = 1;
> @@ -1660,18 +1681,15 @@ static u32 hv_compose_msi_req_v2(
>   }
>   
>   static u32 hv_compose_msi_req_v3(
> -	struct pci_create_interrupt3 *int_pkt, struct cpumask *affinity,
> +	struct pci_create_interrupt3 *int_pkt, int cpu,
>   	u32 slot, u32 vector, u8 vector_count)
>   {
> -	int cpu;
> -
>   	int_pkt->message_type.type = PCI_CREATE_INTERRUPT_MESSAGE3;
>   	int_pkt->wslot.slot = slot;
>   	int_pkt->int_desc.vector = vector;
>   	int_pkt->int_desc.reserved = 0;
>   	int_pkt->int_desc.vector_count = vector_count;
>   	int_pkt->int_desc.delivery_mode = DELIVERY_MODE;
> -	cpu = hv_compose_msi_req_get_cpu(affinity);
>   	int_pkt->int_desc.processor_array[0] =
>   		hv_cpu_number_to_vp_number(cpu);
>   	int_pkt->int_desc.processor_count = 1;
> @@ -1710,12 +1728,18 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
>   			struct pci_create_interrupt3 v3;
>   		} int_pkts;
>   	} __packed ctxt;
> +	bool multi_msi;
>   	u64 trans_id;
>   	u32 size;
>   	int ret;
> +	int cpu;
> +
> +	msi_desc  = irq_data_get_msi_desc(data);
> +	multi_msi = !msi_desc->pci.msi_attrib.is_msix &&
> +		    msi_desc->nvec_used > 1;
>   
>   	/* Reuse the previous allocation */
> -	if (data->chip_data) {
> +	if (data->chip_data && multi_msi) {
>   		int_desc = data->chip_data;
>   		msg->address_hi = int_desc->address >> 32;
>   		msg->address_lo = int_desc->address & 0xffffffff;
> @@ -1723,7 +1747,6 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
>   		return;
>   	}
>   
> -	msi_desc  = irq_data_get_msi_desc(data);
>   	pdev = msi_desc_to_pci_dev(msi_desc);
>   	dest = irq_data_get_effective_affinity_mask(data);
>   	pbus = pdev->bus;
> @@ -1733,11 +1756,18 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
>   	if (!hpdev)
>   		goto return_null_message;
>   
> +	/* Free any previous message that might have already been composed. */
> +	if (data->chip_data && !multi_msi) {
> +		int_desc = data->chip_data;
> +		data->chip_data = NULL;
> +		hv_int_desc_free(hpdev, int_desc);
> +	}
> +
>   	int_desc = kzalloc(sizeof(*int_desc), GFP_ATOMIC);
>   	if (!int_desc)
>   		goto drop_reference;
>   
> -	if (!msi_desc->pci.msi_attrib.is_msix && msi_desc->nvec_used > 1) {
> +	if (multi_msi) {
>   		/*
>   		 * If this is not the first MSI of Multi MSI, we already have
>   		 * a mapping.  Can exit early.
> @@ -1762,9 +1792,11 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
>   		 */
>   		vector = 32;
>   		vector_count = msi_desc->nvec_used;
> +		cpu = hv_compose_multi_msi_req_get_cpu();
>   	} else {
>   		vector = hv_msi_get_int_vector(data);
>   		vector_count = 1;
> +		cpu = hv_compose_msi_req_get_cpu(dest);
>   	}
>   
>   	memset(&ctxt, 0, sizeof(ctxt));
> @@ -1775,7 +1807,6 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
>   	switch (hbus->protocol_version) {
>   	case PCI_PROTOCOL_VERSION_1_1:
>   		size = hv_compose_msi_req_v1(&ctxt.int_pkts.v1,
> -					dest,
>   					hpdev->desc.win_slot.slot,
>   					vector,
>   					vector_count);
> @@ -1784,7 +1815,7 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
>   	case PCI_PROTOCOL_VERSION_1_2:
>   	case PCI_PROTOCOL_VERSION_1_3:
>   		size = hv_compose_msi_req_v2(&ctxt.int_pkts.v2,
> -					dest,
> +					cpu,
>   					hpdev->desc.win_slot.slot,
>   					vector,
>   					vector_count);
> @@ -1792,7 +1823,7 @@ static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
>   
>   	case PCI_PROTOCOL_VERSION_1_4:
>   		size = hv_compose_msi_req_v3(&ctxt.int_pkts.v3,
> -					dest,
> +					cpu,
>   					hpdev->desc.win_slot.slot,
>   					vector,
>   					vector_count);
Applied to kinetic:linux-azure master-next. Thanks.

Stefan - You are correct that having inside knowledge of this patch I 
might not have been as picky as I should have. Hopefully it is a 
temporary fix. When it eventually lands upstream we should see a 
conflict with a stable update and know to remove this SAUCE patch.

-rtg
Tim Gardner Nov. 9, 2022, 2:10 p.m. UTC | #4
> Applied to kinetic:linux-azure master-next. Thanks.
> 
> Stefan - You are correct that having inside knowledge of this patch I 
> might not have been as picky as I should have. Hopefully it is a 
> temporary fix. When it eventually lands upstream we should see a 
> conflict with a stable update and know to remove this SAUCE patch.
> 
> -rtg
diff mbox series

Patch

diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
index db814f7b93ba..c10548243d65 100644
--- a/drivers/pci/controller/pci-hyperv.c
+++ b/drivers/pci/controller/pci-hyperv.c
@@ -1613,7 +1613,7 @@  static void hv_pci_compose_compl(void *context, struct pci_response *resp,
 }
 
 static u32 hv_compose_msi_req_v1(
-	struct pci_create_interrupt *int_pkt, struct cpumask *affinity,
+	struct pci_create_interrupt *int_pkt,
 	u32 slot, u8 vector, u8 vector_count)
 {
 	int_pkt->message_type.type = PCI_CREATE_INTERRUPT_MESSAGE;
@@ -1640,18 +1640,39 @@  static int hv_compose_msi_req_get_cpu(struct cpumask *affinity)
 	return cpumask_first_and(affinity, cpu_online_mask);
 }
 
-static u32 hv_compose_msi_req_v2(
-	struct pci_create_interrupt2 *int_pkt, struct cpumask *affinity,
-	u32 slot, u8 vector, u8 vector_count)
+/*
+ * Make sure the dummy vCPU values for multi-MSI don't all point to vCPU0.
+ */
+static int hv_compose_multi_msi_req_get_cpu(void)
 {
+	static DEFINE_SPINLOCK(multi_msi_cpu_lock);
+
+	/* -1 means starting with CPU 0 */
+	static int cpu_next = -1;
+
+	unsigned long flags;
 	int cpu;
 
+	spin_lock_irqsave(&multi_msi_cpu_lock, flags);
+
+	cpu_next = cpumask_next_wrap(cpu_next, cpu_online_mask, nr_cpu_ids,
+				     false);
+	cpu = cpu_next;
+
+	spin_unlock_irqrestore(&multi_msi_cpu_lock, flags);
+
+	return cpu;
+}
+
+static u32 hv_compose_msi_req_v2(
+	struct pci_create_interrupt2 *int_pkt, int cpu,
+	u32 slot, u8 vector, u8 vector_count)
+{
 	int_pkt->message_type.type = PCI_CREATE_INTERRUPT_MESSAGE2;
 	int_pkt->wslot.slot = slot;
 	int_pkt->int_desc.vector = vector;
 	int_pkt->int_desc.vector_count = vector_count;
 	int_pkt->int_desc.delivery_mode = DELIVERY_MODE;
-	cpu = hv_compose_msi_req_get_cpu(affinity);
 	int_pkt->int_desc.processor_array[0] =
 		hv_cpu_number_to_vp_number(cpu);
 	int_pkt->int_desc.processor_count = 1;
@@ -1660,18 +1681,15 @@  static u32 hv_compose_msi_req_v2(
 }
 
 static u32 hv_compose_msi_req_v3(
-	struct pci_create_interrupt3 *int_pkt, struct cpumask *affinity,
+	struct pci_create_interrupt3 *int_pkt, int cpu,
 	u32 slot, u32 vector, u8 vector_count)
 {
-	int cpu;
-
 	int_pkt->message_type.type = PCI_CREATE_INTERRUPT_MESSAGE3;
 	int_pkt->wslot.slot = slot;
 	int_pkt->int_desc.vector = vector;
 	int_pkt->int_desc.reserved = 0;
 	int_pkt->int_desc.vector_count = vector_count;
 	int_pkt->int_desc.delivery_mode = DELIVERY_MODE;
-	cpu = hv_compose_msi_req_get_cpu(affinity);
 	int_pkt->int_desc.processor_array[0] =
 		hv_cpu_number_to_vp_number(cpu);
 	int_pkt->int_desc.processor_count = 1;
@@ -1710,12 +1728,18 @@  static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
 			struct pci_create_interrupt3 v3;
 		} int_pkts;
 	} __packed ctxt;
+	bool multi_msi;
 	u64 trans_id;
 	u32 size;
 	int ret;
+	int cpu;
+
+	msi_desc  = irq_data_get_msi_desc(data);
+	multi_msi = !msi_desc->pci.msi_attrib.is_msix &&
+		    msi_desc->nvec_used > 1;
 
 	/* Reuse the previous allocation */
-	if (data->chip_data) {
+	if (data->chip_data && multi_msi) {
 		int_desc = data->chip_data;
 		msg->address_hi = int_desc->address >> 32;
 		msg->address_lo = int_desc->address & 0xffffffff;
@@ -1723,7 +1747,6 @@  static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
 		return;
 	}
 
-	msi_desc  = irq_data_get_msi_desc(data);
 	pdev = msi_desc_to_pci_dev(msi_desc);
 	dest = irq_data_get_effective_affinity_mask(data);
 	pbus = pdev->bus;
@@ -1733,11 +1756,18 @@  static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
 	if (!hpdev)
 		goto return_null_message;
 
+	/* Free any previous message that might have already been composed. */
+	if (data->chip_data && !multi_msi) {
+		int_desc = data->chip_data;
+		data->chip_data = NULL;
+		hv_int_desc_free(hpdev, int_desc);
+	}
+
 	int_desc = kzalloc(sizeof(*int_desc), GFP_ATOMIC);
 	if (!int_desc)
 		goto drop_reference;
 
-	if (!msi_desc->pci.msi_attrib.is_msix && msi_desc->nvec_used > 1) {
+	if (multi_msi) {
 		/*
 		 * If this is not the first MSI of Multi MSI, we already have
 		 * a mapping.  Can exit early.
@@ -1762,9 +1792,11 @@  static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
 		 */
 		vector = 32;
 		vector_count = msi_desc->nvec_used;
+		cpu = hv_compose_multi_msi_req_get_cpu();
 	} else {
 		vector = hv_msi_get_int_vector(data);
 		vector_count = 1;
+		cpu = hv_compose_msi_req_get_cpu(dest);
 	}
 
 	memset(&ctxt, 0, sizeof(ctxt));
@@ -1775,7 +1807,6 @@  static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
 	switch (hbus->protocol_version) {
 	case PCI_PROTOCOL_VERSION_1_1:
 		size = hv_compose_msi_req_v1(&ctxt.int_pkts.v1,
-					dest,
 					hpdev->desc.win_slot.slot,
 					vector,
 					vector_count);
@@ -1784,7 +1815,7 @@  static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
 	case PCI_PROTOCOL_VERSION_1_2:
 	case PCI_PROTOCOL_VERSION_1_3:
 		size = hv_compose_msi_req_v2(&ctxt.int_pkts.v2,
-					dest,
+					cpu,
 					hpdev->desc.win_slot.slot,
 					vector,
 					vector_count);
@@ -1792,7 +1823,7 @@  static void hv_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
 
 	case PCI_PROTOCOL_VERSION_1_4:
 		size = hv_compose_msi_req_v3(&ctxt.int_pkts.v3,
-					dest,
+					cpu,
 					hpdev->desc.win_slot.slot,
 					vector,
 					vector_count);