Message ID | KU1P153MB0120D20BC6ED8CF54168EEE6BF0D0@KU1P153MB0120.APCP153.PROD.OUTLOOK.COM |
---|---|
State | New |
Headers | show |
Series | irq_build_affinity_masks() allocates improper affinity if num_possible_cpus() > num_present_cpus()? | expand |
On Tue, 2020-10-06 at 06:47 +0000, Dexuan Cui wrote: > Hi all, > I'm running a single-CPU Linux VM on Hyper-V. The Linux kernel is v5.9-rc7 > and I have CONFIG_NR_CPUS=256. > > The Hyper-V Host (Version 17763-10.0-1-0.1457) provides a guest firmware, > which always reports 128 Local APIC entries in the ACPI MADT table. Here > only the first Local APIC entry's "Processor Enabled" is 1 since this > Linux VM is configured to have only 1 CPU. This means: in the Linux kernel, > the "cpu_present_mask" and " cpu_online_mask " have only 1 CPU (i.e. CPU0), > while the "cpu_possible_mask" has 128 CPUs, and the "nr_cpu_ids" is 128. > > I pass through an MSI-X-capable PCI device to the Linux VM (which has > only 1 virtual CPU), and the below code does *not* report any error > (i.e. pci_alloc_irq_vectors_affinity() returns 2, and request_irq() > returns 0), but the code does not work: the second MSI-X interrupt is not > happening while the first interrupt does work fine. > > int nr_irqs = 2; > int i, nvec, irq; > > nvec = pci_alloc_irq_vectors_affinity(pdev, nr_irqs, nr_irqs, > PCI_IRQ_MSIX | PCI_IRQ_AFFINITY, NULL); > > for (i = 0; i < nvec; i++) { > irq = pci_irq_vector(pdev, i); > err = request_irq(irq, test_intr, 0, "test_intr", &intr_cxt[i]); > } > > It turns out that pci_alloc_irq_vectors_affinity() -> ... -> > irq_create_affinity_masks() allocates an improper affinity for the second > interrupt. The below printk() shows that the second interrupt's affinity is > 1-64, but only CPU0 is present in the system! As a result, later, > request_irq() -> ... -> irq_startup() -> __irq_startup_managed() returns > IRQ_STARTUP_ABORT because cpumask_any_and(aff, cpu_online_mask) is > empty (i.e. >= nr_cpu_ids), and irq_startup() *silently* fails (i.e. "return 0;"), > since __irq_startup() is only called for IRQ_STARTUP_MANAGED and > IRQ_STARTUP_NORMAL. > > --- a/kernel/irq/affinity.c > +++ b/kernel/irq/affinity.c > @@ -484,6 +484,9 @@ struct irq_affinity_desc * > for (i = affd->pre_vectors; i < nvecs - affd->post_vectors; i++) > masks[i].is_managed = 1; > > + for (i = 0; i < nvecs; i++) > + printk("i=%d, affi = %*pbl\n", i, > + cpumask_pr_args(&masks[i].mask)); > return masks; > } > > [ 43.770477] i=0, affi = 0,65-127 > [ 43.770484] i=1, affi = 1-64 > > Though here the issue happens to a Linux VM on Hyper-V, I think the same > issue can also happen to a physical machine, if the physical machine also > uses a lot of static MADT entries, of which only the entries of the present > CPUs are marked to be "Processor Enabled == 1". > > I think pci_alloc_irq_vectors_affinity() -> __pci_enable_msix_range() -> > irq_calc_affinity_vectors() -> cpumask_weight(cpu_possible_mask) should > use cpu_present_mask rather than cpu_possible_mask (), so here > irq_calc_affinity_vectors() would return 1, and > __pci_enable_msix_range() would immediately return -ENOSPC to avoid a > *silent* failure. > > However, git-log shows that this 2018 commit intentionally changed the > cpu_present_mask to cpu_possible_mask: > 84676c1f21e8 ("genirq/affinity: assign vectors to all possible CPUs") > > so I'm not sure whether (and how?) we should address the *silent* failure. > > BTW, here I use a single-CPU VM to simplify the discussion. Actually, > if the VM has n CPUs, with the above usage of > pci_alloc_irq_vectors_affinity() (which might seem incorrect, but my point is > that it's really not good to have a silent failure, which makes it a lot more > difficult to figure out what goes wrong), it looks only the first n MSI-X interrupts > can work, and the (n+1)'th MSI-X interrupt can not work due to the allocated > improper affinity. > > According to my tests, if we need n+1 MSI-X interrupts in such a VM that > has n CPUs, it looks we have 2 options (the second should be better): > > 1. Do not use the PCI_IRQ_AFFINITY flag, i.e. > pci_alloc_irq_vectors_affinity(pdev, n+1, n+1, PCI_IRQ_MSIX, NULL); > > 2. Use the PCI_IRQ_AFFINITY flag, and pass a struct irq_affinity affd, > which tells the API that we don't care about the first interrupt's affinity: > > struct irq_affinity affd = { > .pre_vectors = 1, > ... > }; > > pci_alloc_irq_vectors_affinity(pdev, n+1, n+1, > PCI_IRQ_MSIX | PCI_IRQ_AFFINITY, &affd); > > PS, irq_create_affinity_masks() is complicated. Let me know if you're > interested to know how it allocates the invalid affinity "1-64" for the > second MSI-X interrupt. Go on. It'll save me a cup of coffee or two... > PS2, the latest Hyper-V provides only one ACPI MADT entry to a 1-CPU VM, > so the issue described above can not reproduce there. It seems fairly easy to reproduce in qemu with -smp 1,maxcpus=128 and a virtio-blk drive, having commented out the 'desc->pre_vectors++' around line 130 of virtio_pci_common.c so that it does actually spread them. [ 0.836252] i=0, affi = 0,65-127 [ 0.836672] i=1, affi = 1-64 [ 0.837905] virtio_blk virtio1: [vda] 41943040 512-byte logical blocks (21.5 GB/20.0 GiB) [ 0.839080] vda: detected capacity change from 0 to 21474836480 In my build I had to add 'nox2apic' because I think I actually already fixed this for the x2apic + no-irq-remapping case with the max_affinity patch series¹. But mostly by accident. ¹ https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/irqaffinity
On Tue, 2020-10-06 at 09:37 +0100, David Woodhouse wrote: > On Tue, 2020-10-06 at 06:47 +0000, Dexuan Cui wrote: > > Hi all, > > I'm running a single-CPU Linux VM on Hyper-V. The Linux kernel is v5.9-rc7 > > and I have CONFIG_NR_CPUS=256. > > > > The Hyper-V Host (Version 17763-10.0-1-0.1457) provides a guest firmware, > > which always reports 128 Local APIC entries in the ACPI MADT table. Here > > only the first Local APIC entry's "Processor Enabled" is 1 since this > > Linux VM is configured to have only 1 CPU. This means: in the Linux kernel, > > the "cpu_present_mask" and " cpu_online_mask " have only 1 CPU (i.e. CPU0), > > while the "cpu_possible_mask" has 128 CPUs, and the "nr_cpu_ids" is 128. > > > > I pass through an MSI-X-capable PCI device to the Linux VM (which has > > only 1 virtual CPU), and the below code does *not* report any error > > (i.e. pci_alloc_irq_vectors_affinity() returns 2, and request_irq() > > returns 0), but the code does not work: the second MSI-X interrupt is not > > happening while the first interrupt does work fine. > > > > int nr_irqs = 2; > > int i, nvec, irq; > > > > nvec = pci_alloc_irq_vectors_affinity(pdev, nr_irqs, nr_irqs, > > PCI_IRQ_MSIX | PCI_IRQ_AFFINITY, NULL); > > > > for (i = 0; i < nvec; i++) { > > irq = pci_irq_vector(pdev, i); > > err = request_irq(irq, test_intr, 0, "test_intr", &intr_cxt[i]); > > } > > > > It turns out that pci_alloc_irq_vectors_affinity() -> ... -> > > irq_create_affinity_masks() allocates an improper affinity for the second > > interrupt. The below printk() shows that the second interrupt's affinity is > > 1-64, but only CPU0 is present in the system! As a result, later, > > request_irq() -> ... -> irq_startup() -> __irq_startup_managed() returns > > IRQ_STARTUP_ABORT because cpumask_any_and(aff, cpu_online_mask) is > > empty (i.e. >= nr_cpu_ids), and irq_startup() *silently* fails (i.e. "return 0;"), > > since __irq_startup() is only called for IRQ_STARTUP_MANAGED and > > IRQ_STARTUP_NORMAL. > > > > --- a/kernel/irq/affinity.c > > +++ b/kernel/irq/affinity.c > > @@ -484,6 +484,9 @@ struct irq_affinity_desc * > > for (i = affd->pre_vectors; i < nvecs - affd->post_vectors; i++) > > masks[i].is_managed = 1; > > > > + for (i = 0; i < nvecs; i++) > > + printk("i=%d, affi = %*pbl\n", i, > > + cpumask_pr_args(&masks[i].mask)); > > return masks; > > } > > > > [ 43.770477] i=0, affi = 0,65-127 > > [ 43.770484] i=1, affi = 1-64 > > > > Though here the issue happens to a Linux VM on Hyper-V, I think the same > > issue can also happen to a physical machine, if the physical machine also > > uses a lot of static MADT entries, of which only the entries of the present > > CPUs are marked to be "Processor Enabled == 1". > > > > I think pci_alloc_irq_vectors_affinity() -> __pci_enable_msix_range() -> > > irq_calc_affinity_vectors() -> cpumask_weight(cpu_possible_mask) should > > use cpu_present_mask rather than cpu_possible_mask (), so here > > irq_calc_affinity_vectors() would return 1, and > > __pci_enable_msix_range() would immediately return -ENOSPC to avoid a > > *silent* failure. > > > > However, git-log shows that this 2018 commit intentionally changed the > > cpu_present_mask to cpu_possible_mask: > > 84676c1f21e8 ("genirq/affinity: assign vectors to all possible CPUs") > > > > so I'm not sure whether (and how?) we should address the *silent* failure. > > > > BTW, here I use a single-CPU VM to simplify the discussion. Actually, > > if the VM has n CPUs, with the above usage of > > pci_alloc_irq_vectors_affinity() (which might seem incorrect, but my point is > > that it's really not good to have a silent failure, which makes it a lot more > > difficult to figure out what goes wrong), it looks only the first n MSI-X interrupts > > can work, and the (n+1)'th MSI-X interrupt can not work due to the allocated > > improper affinity. > > > > According to my tests, if we need n+1 MSI-X interrupts in such a VM that > > has n CPUs, it looks we have 2 options (the second should be better): > > > > 1. Do not use the PCI_IRQ_AFFINITY flag, i.e. > > pci_alloc_irq_vectors_affinity(pdev, n+1, n+1, PCI_IRQ_MSIX, NULL); > > > > 2. Use the PCI_IRQ_AFFINITY flag, and pass a struct irq_affinity affd, > > which tells the API that we don't care about the first interrupt's affinity: > > > > struct irq_affinity affd = { > > .pre_vectors = 1, > > ... > > }; > > > > pci_alloc_irq_vectors_affinity(pdev, n+1, n+1, > > PCI_IRQ_MSIX | PCI_IRQ_AFFINITY, &affd); > > > > PS, irq_create_affinity_masks() is complicated. Let me know if you're > > interested to know how it allocates the invalid affinity "1-64" for the > > second MSI-X interrupt. > > Go on. It'll save me a cup of coffee or two... > > > PS2, the latest Hyper-V provides only one ACPI MADT entry to a 1-CPU VM, > > so the issue described above can not reproduce there. > > It seems fairly easy to reproduce in qemu with -smp 1,maxcpus=128 and a > virtio-blk drive, having commented out the 'desc->pre_vectors++' around > line 130 of virtio_pci_common.c so that it does actually spread them. > > [ 0.836252] i=0, affi = 0,65-127 > [ 0.836672] i=1, affi = 1-64 > [ 0.837905] virtio_blk virtio1: [vda] 41943040 512-byte logical blocks (21.5 GB/20.0 GiB) > [ 0.839080] vda: detected capacity change from 0 to 21474836480 > > In my build I had to add 'nox2apic' because I think I actually already > fixed this for the x2apic + no-irq-remapping case with the max_affinity > patch series¹. But mostly by accident. > > > ¹ https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/irqaffinity Is it fixed by https://git.infradead.org/users/dwmw2/linux.git/commitdiff/41cfe6d54e5? --- kernel/irq/affinity.c | 17 +++++++++++------ 1 file changed, 11 insertions(+), 6 deletions(-) diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c index 6d7dbcf91061..00aa0ba6b32a 100644 --- a/kernel/irq/affinity.c +++ b/kernel/irq/affinity.c @@ -364,12 +364,17 @@ static int irq_build_affinity_masks(unsigned int startvec, unsigned int numvecs, cpumask_copy(npresmsk, cpu_present_mask); /* Spread on present CPUs starting from affd->pre_vectors */ - ret = __irq_build_affinity_masks(curvec, numvecs, firstvec, - node_to_cpumask, cpu_present_mask, - nmsk, masks); - if (ret < 0) - goto fail_build_affinity; - nr_present = ret; + while (nr_present < numvecs) { + curvec = firstvec + nr_present; + ret = __irq_build_affinity_masks(curvec, numvecs, firstvec, + node_to_cpumask, npresmsk, + nmsk, masks); + if (ret < 0) + goto fail_build_affinity; + if (!ret) + break; + nr_present += ret; + } /* * Spread on non present CPUs starting from the next vector to be
On Tue, Oct 06 2020 at 06:47, Dexuan Cui wrote: > I'm running a single-CPU Linux VM on Hyper-V. The Linux kernel is v5.9-rc7 > and I have CONFIG_NR_CPUS=256. > > The Hyper-V Host (Version 17763-10.0-1-0.1457) provides a guest firmware, > which always reports 128 Local APIC entries in the ACPI MADT table. Here > only the first Local APIC entry's "Processor Enabled" is 1 since this > Linux VM is configured to have only 1 CPU. This means: in the Linux kernel, > the "cpu_present_mask" and " cpu_online_mask " have only 1 CPU (i.e. CPU0), > while the "cpu_possible_mask" has 128 CPUs, and the "nr_cpu_ids" is 128. > > I pass through an MSI-X-capable PCI device to the Linux VM (which has > only 1 virtual CPU), and the below code does *not* report any error > (i.e. pci_alloc_irq_vectors_affinity() returns 2, and request_irq() > returns 0), but the code does not work: the second MSI-X interrupt is not > happening while the first interrupt does work fine. > > int nr_irqs = 2; > int i, nvec, irq; > > nvec = pci_alloc_irq_vectors_affinity(pdev, nr_irqs, nr_irqs, > PCI_IRQ_MSIX | PCI_IRQ_AFFINITY, NULL); Why should it return an error? > for (i = 0; i < nvec; i++) { > irq = pci_irq_vector(pdev, i); > err = request_irq(irq, test_intr, 0, "test_intr", &intr_cxt[i]); > } And why do you expect that the second interrupt works? This is about managed interrupts and the spreading code has two vectors to which it can spread the interrupts. One is assigned to one half of the possible CPUs and the other one to the other half. Now you have only one CPU online so only the interrupt with has the online CPU in the assigned affinity mask is started up. That's how managed interrupts work. If you don't want managed interrupts then don't use them. Thanks, tglx
On Tue, Oct 06 2020 at 09:37, David Woodhouse wrote: > On Tue, 2020-10-06 at 06:47 +0000, Dexuan Cui wrote: >> PS2, the latest Hyper-V provides only one ACPI MADT entry to a 1-CPU VM, >> so the issue described above can not reproduce there. > > It seems fairly easy to reproduce in qemu with -smp 1,maxcpus=128 and a > virtio-blk drive, having commented out the 'desc->pre_vectors++' around > line 130 of virtio_pci_common.c so that it does actually spread them. > > [ 0.836252] i=0, affi = 0,65-127 > [ 0.836672] i=1, affi = 1-64 > [ 0.837905] virtio_blk virtio1: [vda] 41943040 512-byte logical blocks (21.5 GB/20.0 GiB) > [ 0.839080] vda: detected capacity change from 0 to 21474836480 > > In my build I had to add 'nox2apic' because I think I actually already > fixed this for the x2apic + no-irq-remapping case with the max_affinity > patch series¹. But mostly by accident. There is nothing to fix. It's intentional behaviour. Managed interrupts and their spreading (aside of the rather odd spread here) work that way. And virtio-blk works perfectly fine with that. Thanks, tglx
> From: Thomas Gleixner <tglx@linutronix.de> > Sent: Tuesday, October 6, 2020 11:58 AM > > ... > > I pass through an MSI-X-capable PCI device to the Linux VM (which has > > only 1 virtual CPU), and the below code does *not* report any error > > (i.e. pci_alloc_irq_vectors_affinity() returns 2, and request_irq() > > returns 0), but the code does not work: the second MSI-X interrupt is not > > happening while the first interrupt does work fine. > > > > int nr_irqs = 2; > > int i, nvec, irq; > > > > nvec = pci_alloc_irq_vectors_affinity(pdev, nr_irqs, nr_irqs, > > PCI_IRQ_MSIX | PCI_IRQ_AFFINITY, NULL); > > Why should it return an error? The above code returns -ENOSPC if num_possible_cpus() is also 1, and returns 0 if num_possible_cpus() is 128. So it looks the above code is not using the API correctly, and hence gets undefined results. > > for (i = 0; i < nvec; i++) { > > irq = pci_irq_vector(pdev, i); > > err = request_irq(irq, test_intr, 0, "test_intr", &intr_cxt[i]); > > } > > And why do you expect that the second interrupt works? > > This is about managed interrupts and the spreading code has two vectors > to which it can spread the interrupts. One is assigned to one half of > the possible CPUs and the other one to the other half. Now you have only > one CPU online so only the interrupt with has the online CPU in the > assigned affinity mask is started up. > > That's how managed interrupts work. If you don't want managed interrupts > then don't use them. > > Thanks, > > tglx Thanks for the clarification! It looks with PCI_IRQ_AFFINITY the kernel guarantees that the allocated interrutps are 1:1 bound to CPUs, and the userspace is unable to change the affinities. This is very useful to support per-CPU I/O queues. Thanks, -- Dexuan
--- a/kernel/irq/affinity.c +++ b/kernel/irq/affinity.c @@ -484,6 +484,9 @@ struct irq_affinity_desc * for (i = affd->pre_vectors; i < nvecs - affd->post_vectors; i++) masks[i].is_managed = 1; + for (i = 0; i < nvecs; i++) + printk("i=%d, affi = %*pbl\n", i, + cpumask_pr_args(&masks[i].mask)); return masks; }