mbox series

[V4,00/21] genirq, irqchip: Convert ARM MSI handling to per device MSI domains

Message ID 20240623142137.448898081@linutronix.de
Headers show
Series genirq, irqchip: Convert ARM MSI handling to per device MSI domains | expand

Message

Thomas Gleixner June 23, 2024, 3:18 p.m. UTC
This is version 4 of the series to convert ARM MSI handling over to
per device MSI domains. Version 3 can be found here:

  https://lore.kernel.org/lkml/20240614102403.13610-1-shivamurthy.shastri@linutronix.de

The conversion aims to replace the existing platform MSI mechanism and
enables ARM to support the future PCI/IMS mechanism.

The infrastructure to replace the platform MSI mechanism is already
upstream and in use by RISC-V and has been tested on various ARM platforms
during the V2 development.

Changes vs. V3:

    - Fix the conversion of the GIC V3 MBI driver - Marc

    - Dropped a few stray MSI_FLAG_PCI_MSI_MASK_PARENT flags

    - Dropped the trivial cleanup patches as they have been merged

    - Picked up tags

The series is only lightly tested due to lack of hardware, so we rely on
the people who have access to affected machines to help with testing.

If there are no major objections raised or testing fallout reported, I'm
aiming this series for the next merge window.

The series is also available from git:

  git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git devmsi-arm-v4

Thanks,

	tglx
---
 b/drivers/base/platform-msi.c                 |  350 --------------------------
 b/drivers/irqchip/Kconfig                     |    8 
 b/drivers/irqchip/Makefile                    |    4 
 b/drivers/irqchip/irq-gic-common.h            |    3 
 b/drivers/irqchip/irq-gic-v2m.c               |   80 +----
 b/drivers/irqchip/irq-gic-v3-its-msi-parent.c |  210 +++++++++++++++
 b/drivers/irqchip/irq-gic-v3-its.c            |    5 
 b/drivers/irqchip/irq-gic-v3-mbi.c            |  130 +++------
 b/drivers/irqchip/irq-imx-mu-msi.c            |   48 +--
 b/drivers/irqchip/irq-mbigen.c                |   96 ++-----
 b/drivers/irqchip/irq-msi-lib.c               |  135 ++++++++++
 b/drivers/irqchip/irq-msi-lib.h               |   27 ++
 b/drivers/irqchip/irq-mvebu-gicp.c            |   44 +--
 b/drivers/irqchip/irq-mvebu-icu.c             |  275 ++++++++------------
 b/drivers/irqchip/irq-mvebu-odmi.c            |   37 +-
 b/drivers/irqchip/irq-mvebu-sei.c             |   52 +--
 b/drivers/pci/msi/irqdomain.c                 |   21 +
 b/include/linux/msi.h                         |   52 ---
 b/kernel/irq/msi.c                            |   95 +------
 drivers/irqchip/irq-gic-v3-its-pci-msi.c      |  202 ---------------
 drivers/irqchip/irq-gic-v3-its-platform-msi.c |  163 ------------
 21 files changed, 738 insertions(+), 1299 deletions(-)

Comments

Rob Herring (Arm) June 25, 2024, 7:46 p.m. UTC | #1
On Sun, Jun 23, 2024 at 05:18:31PM +0200, Thomas Gleixner wrote:
> This is version 4 of the series to convert ARM MSI handling over to
> per device MSI domains. Version 3 can be found here:
> 
>   https://lore.kernel.org/lkml/20240614102403.13610-1-shivamurthy.shastri@linutronix.de
> 
> The conversion aims to replace the existing platform MSI mechanism and
> enables ARM to support the future PCI/IMS mechanism.
> 
> The infrastructure to replace the platform MSI mechanism is already
> upstream and in use by RISC-V and has been tested on various ARM platforms
> during the V2 development.
> 
> Changes vs. V3:
> 
>     - Fix the conversion of the GIC V3 MBI driver - Marc
> 
>     - Dropped a few stray MSI_FLAG_PCI_MSI_MASK_PARENT flags
> 
>     - Dropped the trivial cleanup patches as they have been merged
> 
>     - Picked up tags
> 
> The series is only lightly tested due to lack of hardware, so we rely on
> the people who have access to affected machines to help with testing.
> 
> If there are no major objections raised or testing fallout reported, I'm
> aiming this series for the next merge window.
> 
> The series is also available from git:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git devmsi-arm-v4

Running this thru kernelCI has some failures on x86 QEMU boots[1]. 
Here's the backtrace:

<1>[    2.199948] BUG: kernel NULL pointer dereference, address: 0000000000000000
<1>[    2.199948] #PF: supervisor instruction fetch in kernel mode
<1>[    2.199948] #PF: error_code(0x0010) - not-present page
<6>[    2.199948] PGD 0 P4D 0 
<4>[    2.199948] Oops: Oops: 0010 [#1] PREEMPT SMP NOPTI
<4>[    2.199948] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.10.0-rc3 #1
<4>[    2.199948] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
<4>[    2.199948] RIP: 0010:0x0
<4>[    2.199948] Code: Unable to access opcode bytes at 0xffffffffffffffd6.
<4>[    2.199948] RSP: 0018:ffffa7ac80013a90 EFLAGS: 00000002
<4>[    2.199948] RAX: 0000000000000000 RBX: ffffa4050333d600 RCX: 0000000000000000
<4>[    2.199948] RDX: ffffa4050333d430 RSI: 0000000000000001 RDI: ffffa40502ff3100
<4>[    2.199948] RBP: ffffa4050333d600 R08: ffffa405032f1c00 R09: 0000000000000000
<4>[    2.199948] R10: 0000000000000246 R11: ffffa405032f1d80 R12: ffffa405032f1d80
<4>[    2.199948] R13: 0000000000000001 R14: 0000000000000000 R15: ffffa4050333d760
<4>[    2.199948] FS:  0000000000000000(0000) GS:ffffa4053e400000(0000) knlGS:0000000000000000
<4>[    2.199948] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[    2.199948] CR2: ffffffffffffffd6 CR3: 000000002a22e000 CR4: 00000000000006f0
<4>[    2.199948] Call Trace:
<4>[    2.199948]  <TASK>
<4>[    2.199948]  ? __die+0x1f/0x70
<4>[    2.199948]  ? page_fault_oops+0x155/0x440
<4>[    2.199948]  ? ondemand_readahead+0x2c0/0x370
<4>[    2.199948]  ? bitmap_find_next_zero_area_off+0x7b/0x90
<4>[    2.199948]  ? exc_page_fault+0x69/0x150
<4>[    2.199948]  ? asm_exc_page_fault+0x26/0x30
<4>[    2.199948]  pci_irq_unmask_msix+0x53/0x60
<4>[    2.199948]  irq_enable+0x32/0x80
<4>[    2.199948]  __irq_startup+0x51/0x70
<4>[    2.199948]  irq_startup+0x62/0x120
<4>[    2.199948]  __setup_irq+0x326/0x730
<4>[    2.199948]  ? __pfx_vp_config_changed+0x10/0x10
<4>[    2.199948]  request_threaded_irq+0x10b/0x180
<4>[    2.199948]  vp_find_vqs_msix+0x16b/0x470
<4>[    2.199948]  vp_find_vqs+0x34/0x1a0
<4>[    2.199948]  vp_modern_find_vqs+0x16/0x60
<4>[    2.199948]  init_vqs+0x3ee/0x690
<4>[    2.199948]  virtnet_probe+0x50c/0xd10
<4>[    2.199948]  virtio_dev_probe+0x1dd/0x2b0
<4>[    2.199948]  really_probe+0xbc/0x2b0
<4>[    2.199948]  __driver_probe_device+0x6e/0x120
<4>[    2.199948]  driver_probe_device+0x19/0xe0
<4>[    2.199948]  __driver_attach+0x85/0x180
<4>[    2.199948]  ? __pfx___driver_attach+0x10/0x10
<4>[    2.199948]  bus_for_each_dev+0x76/0xd0
<4>[    2.199948]  bus_add_driver+0xe3/0x210
<4>[    2.199948]  driver_register+0x5b/0x110
<4>[    2.199948]  ? __pfx_virtio_net_driver_init+0x10/0x10
<4>[    2.199948]  virtio_net_driver_init+0x8b/0xb0
<4>[    2.199948]  ? __pfx_virtio_net_driver_init+0x10/0x10
<4>[    2.199948]  do_one_initcall+0x43/0x210
<4>[    2.199948]  kernel_init_freeable+0x19b/0x2d0
<4>[    2.199948]  ? __pfx_kernel_init+0x10/0x10
<4>[    2.199948]  kernel_init+0x15/0x1c0
<4>[    2.199948]  ret_from_fork+0x2f/0x50
<4>[    2.199948]  ? __pfx_kernel_init+0x10/0x10
<4>[    2.199948]  ret_from_fork_asm+0x1a/0x30
<4>[    2.199948]  </TASK>
<4>[    2.199948] Modules linked in:
<4>[    2.199948] CR2: 0000000000000000
<4>[    2.199948] ---[ end trace 0000000000000000 ]---


Rob

[1] https://linux.kernelci.org/test/job/robh/branch/for-kernelci/kernel/v6.10-rc3-21-gd27f9f4a2dd80/plan/baseline/
Thomas Gleixner June 26, 2024, 7:03 p.m. UTC | #2
On Tue, Jun 25 2024 at 13:46, Rob Herring wrote:
> Running this thru kernelCI has some failures on x86 QEMU boots[1].

oops

> <4>[    2.199948]  pci_irq_unmask_msix+0x53/0x60
> <4>[    2.199948]  irq_enable+0x32/0x80

I'm sure that I fixed that before.

Updated branch:

   git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git devmsi-arm-v4-1

Thanks,

        tglx
Johan Hovold July 15, 2024, 11:18 a.m. UTC | #3
On Sun, Jun 23, 2024 at 05:18:31PM +0200, Thomas Gleixner wrote:
> This is version 4 of the series to convert ARM MSI handling over to
> per device MSI domains.

> The conversion aims to replace the existing platform MSI mechanism and
> enables ARM to support the future PCI/IMS mechanism.

> The series is only lightly tested due to lack of hardware, so we rely on
> the people who have access to affected machines to help with testing.
> 
> If there are no major objections raised or testing fallout reported, I'm
> aiming this series for the next merge window.

This series only showed up in linux-next last Friday and broke interrupt
handling on Qualcomm platforms like sc8280xp (e.g. Lenovo ThinkPad X13s)
and x1e80100 that use the GIC ITS for PCIe MSIs.

I've applied the series (21 commits from linux-next) on top of 6.10 and
can confirm that the breakage is caused by commits:

	3d1c927c08fc ("irqchip/gic-v3-its: Switch platform MSI to MSI parent")
	233db05bc37f ("irqchip/gic-v3-its: Provide MSI parent for PCI/MSI[-X]")

Applying the series up until the change before 3d1c927c08fc unbreaks the
wifi on one machine:

	ath11k_pci 0006:01:00.0: failed to enable msi: -22
	ath11k_pci 0006:01:00.0: probe with driver ath11k_pci failed with error -22

and backing up until the commit before 233db05bc37f makes the NVMe come
up again during boot on another.

I have not tried to debug this further.

Johan
Marc Zyngier July 15, 2024, 12:58 p.m. UTC | #4
On Mon, 15 Jul 2024 12:18:47 +0100,
Johan Hovold <johan@kernel.org> wrote:
> 
> On Sun, Jun 23, 2024 at 05:18:31PM +0200, Thomas Gleixner wrote:
> > This is version 4 of the series to convert ARM MSI handling over to
> > per device MSI domains.
> 
> > The conversion aims to replace the existing platform MSI mechanism and
> > enables ARM to support the future PCI/IMS mechanism.
> 
> > The series is only lightly tested due to lack of hardware, so we rely on
> > the people who have access to affected machines to help with testing.
> > 
> > If there are no major objections raised or testing fallout reported, I'm
> > aiming this series for the next merge window.
> 
> This series only showed up in linux-next last Friday and broke interrupt
> handling on Qualcomm platforms like sc8280xp (e.g. Lenovo ThinkPad X13s)
> and x1e80100 that use the GIC ITS for PCIe MSIs.
> 
> I've applied the series (21 commits from linux-next) on top of 6.10 and
> can confirm that the breakage is caused by commits:
> 
> 	3d1c927c08fc ("irqchip/gic-v3-its: Switch platform MSI to MSI parent")
> 	233db05bc37f ("irqchip/gic-v3-its: Provide MSI parent for PCI/MSI[-X]")
> 
> Applying the series up until the change before 3d1c927c08fc unbreaks the
> wifi on one machine:
> 
> 	ath11k_pci 0006:01:00.0: failed to enable msi: -22
> 	ath11k_pci 0006:01:00.0: probe with driver ath11k_pci failed with error -22
>
> and backing up until the commit before 233db05bc37f makes the NVMe come
> up again during boot on another.
> 
> I have not tried to debug this further.

I need a few things from you though, because you're not giving much to
help you (and I'm travelling, which doesn't help).

Can you at least investigate what in ath11k_pci_alloc_msi() causes the
wifi driver to be upset? Does it normally use a single MSI vector or
MSI-X? How about your nVME device?

It would also help if you could define the DEBUG symbol at the very
top of irq-gic-v3-its.c and report the debug information that the ITS
driver dumps.

Thanks,

	M.
Thomas Gleixner July 15, 2024, 1:10 p.m. UTC | #5
On Mon, Jul 15 2024 at 13:18, Johan Hovold wrote:
> I've applied the series (21 commits from linux-next) on top of 6.10 and
> can confirm that the breakage is caused by commits:
>
> 	3d1c927c08fc ("irqchip/gic-v3-its: Switch platform MSI to MSI parent")
> 	233db05bc37f ("irqchip/gic-v3-its: Provide MSI parent for PCI/MSI[-X]")
>
> Applying the series up until the change before 3d1c927c08fc unbreaks the
> wifi on one machine:
>
> 	ath11k_pci 0006:01:00.0: failed to enable msi: -22
> 	ath11k_pci 0006:01:00.0: probe with driver ath11k_pci failed with error -22

3d1c927c08fc converts the platform MSI stuff over which is unrelated to
PCI/MSI. I'm confused how this affects PCI/MSI of the WIFI card.

> and backing up until the commit before 233db05bc37f makes the NVMe come
> up again during boot on another.

So that undoes the PCI/MSI change. Hrm.

> I have not tried to debug this further.

Any hint would be appreciated.

Thanks,

        tglx
Johan Hovold July 15, 2024, 2:10 p.m. UTC | #6
On Mon, Jul 15, 2024 at 01:58:13PM +0100, Marc Zyngier wrote:
> On Mon, 15 Jul 2024 12:18:47 +0100,
> Johan Hovold <johan@kernel.org> wrote:
> > On Sun, Jun 23, 2024 at 05:18:31PM +0200, Thomas Gleixner wrote:
> > > This is version 4 of the series to convert ARM MSI handling over to
> > > per device MSI domains.

> > This series only showed up in linux-next last Friday and broke interrupt
> > handling on Qualcomm platforms like sc8280xp (e.g. Lenovo ThinkPad X13s)
> > and x1e80100 that use the GIC ITS for PCIe MSIs.
> > 
> > I've applied the series (21 commits from linux-next) on top of 6.10 and
> > can confirm that the breakage is caused by commits:
> > 
> > 	3d1c927c08fc ("irqchip/gic-v3-its: Switch platform MSI to MSI parent")
> > 	233db05bc37f ("irqchip/gic-v3-its: Provide MSI parent for PCI/MSI[-X]")
> > 
> > Applying the series up until the change before 3d1c927c08fc unbreaks the
> > wifi on one machine:
> > 
> > 	ath11k_pci 0006:01:00.0: failed to enable msi: -22
> > 	ath11k_pci 0006:01:00.0: probe with driver ath11k_pci failed with error -22
> >
> > and backing up until the commit before 233db05bc37f makes the NVMe come
> > up again during boot on another.
> > 
> > I have not tried to debug this further.
> 
> I need a few things from you though, because you're not giving much to
> help you (and I'm travelling, which doesn't help).

Yeah, this was just an early heads up.

> Can you at least investigate what in ath11k_pci_alloc_msi() causes the
> wifi driver to be upset? Does it normally use a single MSI vector or
> MSI-X? How about your nVME device?

It uses multiple vectors, but now it falls back to trying to allocate a
single one and even that fails with -ENOSPC:

	ath11k_pci 0006:01:00.0: ath11k_pci_alloc_msi - requesting one vector failed: -28

Similar for the NVMe, it uses multiple vectors normally, but now only
the AER interrupts appears to be allocated for each controller and there
is a GICv3 interrupt for the NVMe:

208:          0          0          0          0          0          0          0          0  ITS-PCI-MSI-0006:00:00.0   0 Edge      PCIe PME, aerdrv
212:          0          0          0          0          0          0          0          0  ITS-PCI-MSI-0004:00:00.0   0 Edge      PCIe PME, aerdrv
214:        161          0          0          0          0          0          0          0     GICv3 562 Level     nvme0q0, nvme0q1
215:          0          0          0          0          0          0          0          0  ITS-PCI-MSI-0002:00:00.0   0 Edge      PCIe PME, aerdrv

Next boot, after disabling PCIe controller async probing, it's an MSI-X?!:

201:          0          0          0          0          0          0          0          0  ITS-PCI-MSI-0006:00:00.0   0 Edge      PCIe PME, aerdrv
203:          0          0          0          0          0          0          0          0  ITS-PCI-MSI-0004:00:00.0   0 Edge      PCIe PME, aerdrv
205:          0          0          0          0          0          0          0          0  ITS-PCI-MSI-0002:00:00.0   0 Edge      PCIe PME, aerdrv
206:          0          0          0          0          0          0          0          0  ITS-PCI-MSIX-0002:01:00.0   0 Edge      nvme0q0

This time ath11k vector allocation succeeded, but the driver times out
eventually:

[    8.984619] ath11k_pci 0006:01:00.0: MSI vectors: 32
[   29.690841] ath11k_pci 0006:01:00.0: failed to power up mhi: -110
[   29.697136] ath11k_pci 0006:01:00.0: failed to start mhi: -110
[   29.703153] ath11k_pci 0006:01:00.0: failed to power up :-110
[   29.732144] ath11k_pci 0006:01:00.0: failed to create soc core: -110
[   29.738694] ath11k_pci 0006:01:00.0: failed to init core: -110
[   32.841758] ath11k_pci 0006:01:00.0: probe with driver ath11k_pci failed with error -110

> It would also help if you could define the DEBUG symbol at the very
> top of irq-gic-v3-its.c and report the debug information that the ITS
> driver dumps.

See below (with synchronous probing of the pcie controllers).

Johan

[    0.000000] NR_IRQS: 64, nr_irqs: 64, preallocated irqs: 0
[    0.000000] GICv3: 960 SPIs implemented
[    0.000000] GICv3: 0 Extended SPIs implemented
[    0.000000] Root IRQ handler: gic_handle_irq
[    0.000000] GICv3: GICv3 features: 16 PPIs
[    0.000000] GICv3: CPU0: found redistributor 0 region 0:0x0000000017a60000
[    0.000000] ITS [mem 0x17a40000-0x17a5ffff]
[    0.000000] ITS@0x0000000017a40000: allocated 8192 Devices @100100000 (indirect, esz 8, psz 64K, shr 1)
[    0.000000] ITS@0x0000000017a40000: allocated 32768 Interrupt Collections @100110000 (flat, esz 2, psz 64K, shr 1)
[    0.000000] GICv3: using LPI property table @0x0000000100120000
[    0.000000] ITS: Allocator initialized for 57344 LPIs
[    0.000000] GICv3: CPU0: using allocated LPI pending table @0x0000000100130000

[    0.010428] GICv3: CPU1: found redistributor 100 region 0:0x0000000017a80000
[    0.010438] GICv3: CPU1: using allocated LPI pending table @0x0000000100140000
[    0.010477] CPU1: Booted secondary processor 0x0000000100 [0x410fd4b0]
[    0.011496] Detected PIPT I-cache on CPU2
[    0.011535] GICv3: CPU2: found redistributor 200 region 0:0x0000000017aa0000
[    0.011545] GICv3: CPU2: using allocated LPI pending table @0x0000000100150000
[    0.011576] CPU2: Booted secondary processor 0x0000000200 [0x410fd4b0]
[    0.012593] Detected PIPT I-cache on CPU3
[    0.012631] GICv3: CPU3: found redistributor 300 region 0:0x0000000017ac0000
[    0.012641] GICv3: CPU3: using allocated LPI pending table @0x0000000100160000
[    0.012671] CPU3: Booted secondary processor 0x0000000300 [0x410fd4b0]
[    0.015590] Detected PIPT I-cache on CPU4
[    0.015637] GICv3: CPU4: found redistributor 400 region 0:0x0000000017ae0000
[    0.015647] GICv3: CPU4: using allocated LPI pending table @0x0000000100170000
[    0.015675] CPU4: Booted secondary processor 0x0000000400 [0x410fd4c0]
[    0.016698] Detected PIPT I-cache on CPU5
[    0.016733] GICv3: CPU5: found redistributor 500 region 0:0x0000000017b00000
[    0.016742] GICv3: CPU5: using allocated LPI pending table @0x0000000100180000
[    0.016772] CPU5: Booted secondary processor 0x0000000500 [0x410fd4c0]
[    0.020807] Detected PIPT I-cache on CPU6
[    0.020841] GICv3: CPU6: found redistributor 600 region 0:0x0000000017b20000
[    0.020851] GICv3: CPU6: using allocated LPI pending table @0x0000000100190000
[    0.020879] CPU6: Booted secondary processor 0x0000000600 [0x410fd4c0]
[    0.021878] Detected PIPT I-cache on CPU7
[    0.021914] GICv3: CPU7: found redistributor 700 region 0:0x0000000017b40000
[    0.021922] GICv3: CPU7: using allocated LPI pending table @0x00000001001a0000
[    0.021952] CPU7: Booted secondary processor 0x0000000700 [0x410fd4c0]

[    8.358586] qcom-pcie 1c00000.pcie: host bridge /soc@0/pcie@1c00000 ranges:
[    8.365787] qcom-pcie 1c00000.pcie:       IO 0x0030200000..0x00302fffff -> 0x0000000000
[    8.381670] qcom-pcie 1c00000.pcie:      MEM 0x0030300000..0x0031ffffff -> 0x0030300000
[    8.507519] qcom-pcie 1c00000.pcie: iATU: unroll T, 8 ob, 8 ib, align 4K, limit 1024G
[    8.603797] qcom-pcie 1c00000.pcie: PCIe Gen.2 x1 link up
[    8.610023] qcom-pcie 1c00000.pcie: PCI host bridge to bus 0006:00
[    8.616805] pci_bus 0006:00: root bus resource [bus 00-ff]
[    8.622872] pci_bus 0006:00: root bus resource [io  0x0000-0xfffff]
[    8.629844] pci_bus 0006:00: root bus resource [mem 0x30300000-0x31ffffff]
[    8.636981] pci 0006:00:00.0: [17cb:010e] type 01 class 0x060400 PCIe Root Port
[    8.655493] pci 0006:00:00.0: BAR 0 [mem 0x00000000-0x00000fff]
[    8.672909] pci 0006:00:00.0: PCI bridge to [bus 01-ff]
[    8.688721] pci 0006:00:00.0:   bridge window [io  0x0000-0x0fff]
[    8.703805] pci 0006:00:00.0:   bridge window [mem 0x00000000-0x000fffff]
[    8.719789] pci 0006:00:00.0:   bridge window [mem 0x00000000-0x000fffff 64bit pref]
[    8.736680] pci 0006:00:00.0: PME# supported from D0 D3hot D3cold
[    8.745548] pci 0006:01:00.0: [17cb:1103] type 00 class 0x028000 PCIe Endpoint
[    8.745646] pci 0006:01:00.0: BAR 0 [mem 0x00000000-0x001fffff 64bit]
[    8.746274] pci 0006:01:00.0: PME# supported from D0 D3hot D3cold
[    8.746442] pci 0006:01:00.0: 4.000 Gb/s available PCIe bandwidth, limited by 5.0 GT/s PCIe x1 link at 0006:00:00.0 (capable of 7.876 Gb/s with 8.0 GT/s PCIe x1 link)
[    8.836195] pci 0006:00:00.0: bridge window [mem 0x30400000-0x305fffff]: assigned
[    8.853287] pci 0006:00:00.0: BAR 0 [mem 0x30300000-0x30300fff]: assigned
[    8.870163] pci 0006:01:00.0: BAR 0 [mem 0x30400000-0x305fffff 64bit]: assigned
[    8.887617] pci 0006:00:00.0: PCI bridge to [bus 01-ff]
[    8.902850] pci 0006:00:00.0:   bridge window [mem 0x30400000-0x305fffff]
[    8.933586] ITS: alloc 8192:32
[    8.933599] ITT 32 entries, 5 bits
[    8.951573] ID:0 pID:8192 vID:201
[    8.951585] ID:1 pID:8193 vID:202
[    8.951591] ID:2 pID:8194 vID:203
[    8.951597] ID:3 pID:8195 vID:204
[    8.951603] ID:4 pID:8196 vID:205
[    8.951609] ID:5 pID:8197 vID:206
[    8.951615] ID:6 pID:8198 vID:207
[    8.951621] ID:7 pID:8199 vID:208
[    8.951627] ID:8 pID:8200 vID:209
[    8.951633] ID:9 pID:8201 vID:210
[    8.951639] ID:10 pID:8202 vID:211
[    8.951645] ID:11 pID:8203 vID:212
[    8.951650] ID:12 pID:8204 vID:213
[    8.951656] ID:13 pID:8205 vID:214
[    8.951662] ID:14 pID:8206 vID:215
[    8.951667] ID:15 pID:8207 vID:216
[    8.951673] ID:16 pID:8208 vID:217
[    8.951679] ID:17 pID:8209 vID:218
[    8.951685] ID:18 pID:8210 vID:219
[    8.951691] ID:19 pID:8211 vID:220
[    8.951696] ID:20 pID:8212 vID:221
[    8.951702] ID:21 pID:8213 vID:222
[    8.951708] ID:22 pID:8214 vID:223
[    8.951714] ID:23 pID:8215 vID:224
[    8.951720] ID:24 pID:8216 vID:225
[    8.951725] ID:25 pID:8217 vID:226
[    8.951772] ID:26 pID:8218 vID:227
[    8.951778] ID:27 pID:8219 vID:228
[    8.951784] ID:28 pID:8220 vID:229
[    8.951790] ID:29 pID:8221 vID:230
[    8.951796] ID:30 pID:8222 vID:231
[    8.951802] ID:31 pID:8223 vID:232
[    8.951919] IRQ201 -> 0-7 CPU0
[    8.951940] IRQ202 -> 0-7 CPU1
[    8.951952] IRQ203 -> 0-7 CPU2
[    8.951963] IRQ204 -> 0-7 CPU3
[    8.951975] IRQ205 -> 0-7 CPU4
[    8.951987] IRQ206 -> 0-7 CPU5
[    8.951998] IRQ207 -> 0-7 CPU6
[    8.952010] IRQ208 -> 0-7 CPU7
[    8.952022] IRQ209 -> 0-7 CPU0
[    8.952033] IRQ210 -> 0-7 CPU1
[    8.952045] IRQ211 -> 0-7 CPU2
[    8.952056] IRQ212 -> 0-7 CPU3
[    8.952068] IRQ213 -> 0-7 CPU4
[    8.952079] IRQ214 -> 0-7 CPU5
[    8.952091] IRQ215 -> 0-7 CPU6
[    8.952103] IRQ216 -> 0-7 CPU7
[    8.952115] IRQ217 -> 0-7 CPU0
[    8.952126] IRQ218 -> 0-7 CPU1
[    8.952138] IRQ219 -> 0-7 CPU2
[    8.952150] IRQ220 -> 0-7 CPU3
[    8.952162] IRQ221 -> 0-7 CPU4
[    8.952174] IRQ222 -> 0-7 CPU5
[    8.952185] IRQ223 -> 0-7 CPU6
[    8.952197] IRQ224 -> 0-7 CPU7
[    8.952209] IRQ225 -> 0-7 CPU0
[    8.952220] IRQ226 -> 0-7 CPU1
[    8.952232] IRQ227 -> 0-7 CPU2
[    8.952244] IRQ228 -> 0-7 CPU3
[    8.952255] IRQ229 -> 0-7 CPU4
[    8.952267] IRQ230 -> 0-7 CPU5
[    8.952278] IRQ231 -> 0-7 CPU6
[    8.952290] IRQ232 -> 0-7 CPU7
[    8.954072] ITS: alloc 8192:32
[    8.954081] ITT 32 entries, 5 bits
[    8.954128] ID:0 pID:8192 vID:201
[    8.954137] IRQ201 -> 0-7 CPU0
[    8.954328] IRQ201 -> 0-7 CPU0
[    8.954357] pcieport 0006:00:00.0: PME: Signaling with IRQ 201
[    8.960980] pcieport 0006:00:00.0: AER: enabled with IRQ 201
[    8.967607] ath11k_pci 0006:01:00.0: BAR 0 [mem 0x30400000-0x305fffff 64bit]: assigned
[    8.976146] ath11k_pci 0006:01:00.0: enabling device (0000 -> 0002)
[    8.983071] ITS: alloc 8224:32
[    8.983080] ITT 32 entries, 5 bits
[    8.983842] ID:0 pID:8224 vID:202
[    8.983849] ID:1 pID:8225 vID:203
[    8.983855] ID:2 pID:8226 vID:204
[    8.983861] ID:3 pID:8227 vID:205
[    8.983867] ID:4 pID:8228 vID:206
[    8.983873] ID:5 pID:8229 vID:207
[    8.983878] ID:6 pID:8230 vID:208
[    8.983884] ID:7 pID:8231 vID:209
[    8.983890] ID:8 pID:8232 vID:210
[    8.983895] ID:9 pID:8233 vID:211
[    8.983901] ID:10 pID:8234 vID:212
[    8.983907] ID:11 pID:8235 vID:213
[    8.983913] ID:12 pID:8236 vID:214
[    8.983919] ID:13 pID:8237 vID:215
[    8.983925] ID:14 pID:8238 vID:216
[    8.983931] ID:15 pID:8239 vID:217
[    8.983937] ID:16 pID:8240 vID:218
[    8.983942] ID:17 pID:8241 vID:219
[    8.983948] ID:18 pID:8242 vID:220
[    8.983954] ID:19 pID:8243 vID:221
[    8.983960] ID:20 pID:8244 vID:222
[    8.983965] ID:21 pID:8245 vID:223
[    8.983971] ID:22 pID:8246 vID:224
[    8.983977] ID:23 pID:8247 vID:225
[    8.983983] ID:24 pID:8248 vID:226
[    8.983989] ID:25 pID:8249 vID:227
[    8.983995] ID:26 pID:8250 vID:228
[    8.984000] ID:27 pID:8251 vID:229
[    8.984006] ID:28 pID:8252 vID:230
[    8.984012] ID:29 pID:8253 vID:231
[    8.984018] ID:30 pID:8254 vID:232
[    8.984024] ID:31 pID:8255 vID:233
[    8.984102] IRQ202 -> 0-7 CPU1
[    8.984148] IRQ203 -> 0-7 CPU2
[    8.984160] IRQ204 -> 0-7 CPU3
[    8.984172] IRQ205 -> 0-7 CPU4
[    8.984184] IRQ206 -> 0-7 CPU5
[    8.984196] IRQ207 -> 0-7 CPU6
[    8.984208] IRQ208 -> 0-7 CPU7
[    8.984220] IRQ209 -> 0-7 CPU0
[    8.984231] IRQ210 -> 0-7 CPU1
[    8.984243] IRQ211 -> 0-7 CPU2
[    8.984255] IRQ212 -> 0-7 CPU3
[    8.984267] IRQ213 -> 0-7 CPU4
[    8.984279] IRQ214 -> 0-7 CPU5
[    8.984291] IRQ215 -> 0-7 CPU6
[    8.984303] IRQ216 -> 0-7 CPU7
[    8.984315] IRQ217 -> 0-7 CPU0
[    8.984326] IRQ218 -> 0-7 CPU1
[    8.984338] IRQ219 -> 0-7 CPU2
[    8.984350] IRQ220 -> 0-7 CPU3
[    8.984362] IRQ221 -> 0-7 CPU4
[    8.984373] IRQ222 -> 0-7 CPU5
[    8.984385] IRQ223 -> 0-7 CPU6
[    8.984398] IRQ224 -> 0-7 CPU7
[    8.984409] IRQ225 -> 0-7 CPU0
[    8.984422] IRQ226 -> 0-7 CPU1
[    8.984434] IRQ227 -> 0-7 CPU2
[    8.984445] IRQ228 -> 0-7 CPU3
[    8.984457] IRQ229 -> 0-7 CPU4
[    8.984469] IRQ230 -> 0-7 CPU5
[    8.984481] IRQ231 -> 0-7 CPU6
[    8.984492] IRQ232 -> 0-7 CPU7
[    8.984504] IRQ233 -> 0-7 CPU0
[    8.984619] ath11k_pci 0006:01:00.0: MSI vectors: 32
[    8.990070] ath11k_pci 0006:01:00.0: wcn6855 hw2.0
[    8.998289] IRQ202 -> 0-7 CPU1
[    8.998348] IRQ203 -> 0-7 CPU2
[    8.998376] IRQ204 -> 0-7 CPU3
[    9.001890] IRQ205 -> 0-7 CPU4
[    9.001923] IRQ206 -> 0-7 CPU5
[    9.001953] IRQ207 -> 0-7 CPU6
[    9.001977] IRQ208 -> 0-7 CPU7
[    9.002003] IRQ209 -> 0-7 CPU0
[    9.002031] IRQ210 -> 0-7 CPU1
[    9.002055] IRQ211 -> 0-7 CPU2
[    9.002117] IRQ216 -> 0-7 CPU7
[    9.002168] IRQ217 -> 0-7 CPU0
[    9.002210] IRQ218 -> 0-7 CPU1
[    9.002257] IRQ220 -> 0-7 CPU3
[    9.002296] IRQ221 -> 0-7 CPU4
[    9.002337] IRQ222 -> 0-7 CPU5
[    9.002381] IRQ223 -> 0-7 CPU6
[    9.002421] IRQ224 -> 0-7 CPU7
[    9.002460] IRQ225 -> 0-7 CPU0
[    9.002499] IRQ226 -> 0-7 CPU1
[    9.162382] mhi mhi0: Requested to power ON
[    9.167114] mhi mhi0: Power on setup success

[   29.680356] mhi mhi0: Device link is not accessible
[   29.685437] mhi mhi0: MHI did not enter READY state
[   29.690841] ath11k_pci 0006:01:00.0: failed to power up mhi: -110
[   29.697136] ath11k_pci 0006:01:00.0: failed to start mhi: -110
[   29.703153] ath11k_pci 0006:01:00.0: failed to power up :-110
[   29.732144] ath11k_pci 0006:01:00.0: failed to create soc core: -110
[   29.738694] ath11k_pci 0006:01:00.0: failed to init core: -110
[   32.841758] ath11k_pci 0006:01:00.0: probe with driver ath11k_pci failed with error -110
[   32.852799] qcom-pcie 1c10000.pcie: supply vdda not found, using dummy regulator
[   32.860924] qcom-pcie 1c10000.pcie: host bridge /soc@0/pcie@1c10000 ranges:
[   32.868157] qcom-pcie 1c10000.pcie:       IO 0x0034200000..0x00342fffff -> 0x0000000000
[   32.876428] qcom-pcie 1c10000.pcie:      MEM 0x0034300000..0x0035ffffff -> 0x0034300000
[   33.001705] qcom-pcie 1c10000.pcie: iATU: unroll T, 8 ob, 8 ib, align 4K, limit 1024G
[   33.111456] qcom-pcie 1c10000.pcie: PCIe Gen.3 x2 link up
[   33.117554] qcom-pcie 1c10000.pcie: PCI host bridge to bus 0004:00
[   33.124000] pci_bus 0004:00: root bus resource [bus 00-ff]
[   33.129745] pci_bus 0004:00: root bus resource [io  0x100000-0x1fffff] (bus address [0x0000-0xfffff])
[   33.139324] pci_bus 0004:00: root bus resource [mem 0x34300000-0x35ffffff]
[   33.146525] pci 0004:00:00.0: [17cb:010e] type 01 class 0x060400 PCIe Root Port
[   33.154167] pci 0004:00:00.0: BAR 0 [mem 0x00000000-0x00000fff]
[   33.160373] pci 0004:00:00.0: PCI bridge to [bus 01-ff]
[   33.165804] pci 0004:00:00.0:   bridge window [io  0x100000-0x100fff]
[   33.172482] pci 0004:00:00.0:   bridge window [mem 0x00000000-0x000fffff]
[   33.179515] pci 0004:00:00.0:   bridge window [mem 0x00000000-0x000fffff 64bit pref]
[   33.187622] pci 0004:00:00.0: PME# supported from D0 D3hot D3cold
[   33.195555] pci 0004:01:00.0: [17cb:0306] type 00 class 0xff0000 PCIe Endpoint
[   33.203462] pci 0004:01:00.0: BAR 0 [mem 0x00000000-0x00000fff 64bit]
[   33.210163] pci 0004:01:00.0: BAR 2 [mem 0x00000000-0x00000fff 64bit]
[   33.217379] pci 0004:01:00.0: PME# supported from D0 D3hot D3cold
[   33.223825] pci 0004:01:00.0: 15.752 Gb/s available PCIe bandwidth, limited by 8.0 GT/s PCIe x2 link at 0004:00:00.0 (capable of 31.506 Gb/s with 16.0 GT/s PCIe x2 link)
[   33.251876] pci 0004:00:00.0: bridge window [mem 0x34300000-0x343fffff]: assigned
[   33.259599] pci 0004:00:00.0: BAR 0 [mem 0x34400000-0x34400fff]: assigned
[   33.266621] pci 0004:01:00.0: BAR 0 [mem 0x34300000-0x34300fff 64bit]: assigned
[   33.274186] pci 0004:01:00.0: BAR 2 [mem 0x34301000-0x34301fff 64bit]: assigned
[   33.281748] pci 0004:00:00.0: PCI bridge to [bus 01-ff]
[   33.287133] pci 0004:00:00.0:   bridge window [mem 0x34300000-0x343fffff]
[   33.294322] Reusing ITT for devID 0
[   33.296005] Reusing ITT for devID 0
[   33.296053] ID:1 pID:8193 vID:203
[   33.296066] IRQ203 -> 0-7 CPU1
[   33.296176] IRQ203 -> 0-7 CPU1
[   33.296240] pcieport 0004:00:00.0: PME: Signaling with IRQ 203
[   33.302538] pcieport 0004:00:00.0: AER: enabled with IRQ 203
[   33.308587] mhi-pci-generic 0004:01:00.0: MHI PCI device found: foxconn-sdx55
[   33.315945] mhi-pci-generic 0004:01:00.0: BAR 0 [mem 0x34300000-0x34300fff 64bit]: assigned
[   33.324583] mhi-pci-generic 0004:01:00.0: enabling device (0000 -> 0002)
[   33.331610] ITS: alloc 8224:8
[   33.331619] ITT 8 entries, 3 bits
[   33.331750] ID:0 pID:8224 vID:204
[   33.331756] ID:1 pID:8225 vID:205
[   33.331762] ID:2 pID:8226 vID:206
[   33.331769] ID:3 pID:8227 vID:207
[   33.331774] ID:4 pID:8228 vID:208
[   33.331791] IRQ204 -> 0-7 CPU2
[   33.331837] IRQ205 -> 0-7 CPU3
[   33.331848] IRQ206 -> 0-7 CPU4
[   33.331860] IRQ207 -> 0-7 CPU5
[   33.331872] IRQ208 -> 0-7 CPU6
[   33.332711] IRQ204 -> 0-7 CPU2
[   33.333016] IRQ205 -> 0-7 CPU3
[   33.333042] IRQ206 -> 0-7 CPU4
[   33.333066] IRQ207 -> 0-7 CPU5
[   33.333090] IRQ208 -> 0-7 CPU6
[   33.335976] mhi mhi0: Requested to power ON
[   33.340327] mhi mhi0: Power on setup success
[   54.242353] mhi-pci-generic 0004:01:00.0: failed to power up MHI controller
[   54.251547] mhi-pci-generic 0004:01:00.0: probe with driver mhi-pci-generic failed with error -110
[   54.262662] qcom-pcie 1c20000.pcie: supply vdda not found, using dummy regulator
[   54.270794] qcom-pcie 1c20000.pcie: host bridge /soc@0/pcie@1c20000 ranges:
[   54.278042] qcom-pcie 1c20000.pcie:       IO 0x003c200000..0x003c2fffff -> 0x0000000000
[   54.286340] qcom-pcie 1c20000.pcie:      MEM 0x003c300000..0x003dffffff -> 0x003c300000
[   54.409356] qcom-pcie 1c20000.pcie: iATU: unroll T, 8 ob, 8 ib, align 4K, limit 1024G
[   54.519604] qcom-pcie 1c20000.pcie: PCIe Gen.3 x4 link up
[   54.525609] qcom-pcie 1c20000.pcie: PCI host bridge to bus 0002:00
[   54.532017] pci_bus 0002:00: root bus resource [bus 00-ff]
[   54.537732] pci_bus 0002:00: root bus resource [io  0x200000-0x2fffff] (bus address [0x0000-0xfffff])
[   54.547830] pci_bus 0002:00: root bus resource [mem 0x3c300000-0x3dffffff]
[   54.555523] pci 0002:00:00.0: [17cb:010e] type 01 class 0x060400 PCIe Root Port
[   54.563629] pci 0002:00:00.0: BAR 0 [mem 0x00000000-0x00000fff]
[   54.570244] pci 0002:00:00.0: PCI bridge to [bus 01-ff]
[   54.576099] pci 0002:00:00.0:   bridge window [io  0x200000-0x200fff]
[   54.583121] pci 0002:00:00.0:   bridge window [mem 0x00000000-0x000fffff]
[   54.590473] pci 0002:00:00.0:   bridge window [mem 0x00000000-0x000fffff 64bit pref]
[   54.598841] pci 0002:00:00.0: PME# supported from D0 D3hot D3cold
[   54.606657] pci 0002:01:00.0: [1e0f:0001] type 00 class 0x010802 PCIe Endpoint
[   54.614458] pci 0002:01:00.0: BAR 0 [mem 0x00000000-0x00003fff 64bit]
[   54.621900] pci 0002:01:00.0: PME# supported from D0 D3hot
[   54.635232] sd 0:0:0:0: [sda] Starting disk
[   54.641117] pci 0002:00:00.0: bridge window [mem 0x3c300000-0x3c3fffff]: assigned
[   54.649086] pci 0002:00:00.0: BAR 0 [mem 0x3c400000-0x3c400fff]: assigned
[   54.656299] pci 0002:01:00.0: BAR 0 [mem 0x3c300000-0x3c303fff 64bit]: assigned
[   54.664083] pci 0002:00:00.0: PCI bridge to [bus 01-ff]
[   54.669688] pci 0002:00:00.0:   bridge window [mem 0x3c300000-0x3c3fffff]
[   54.677113] Reusing ITT for devID 0
[   54.678960] Reusing ITT for devID 0
[   54.678994] ID:2 pID:8194 vID:205
[   54.679005] IRQ205 -> 0-7 CPU2
[   54.679103] IRQ205 -> 0-7 CPU2
[   54.679123] pcieport 0002:00:00.0: PME: Signaling with IRQ 205
[   54.685994] pcieport 0002:00:00.0: AER: enabled with IRQ 205
[   54.693042] nvme nvme0: pci function 0002:01:00.0
[   54.698150] nvme 0002:01:00.0: enabling device (0000 -> 0002)
[   54.704457] Reusing ITT for devID 100
[   54.704500] ID:0 pID:8224 vID:206
[   54.704509] IRQ206 -> 0-7 CPU3
[   54.706919] IRQ206 -> 0-7 CPU3

[  115.695904] nvme nvme0: I/O tag 0 (1000) QID 0 timeout, completion polled
[  177.135829] nvme nvme0: I/O tag 1 (1001) QID 0 timeout, completion polled
[  238.575830] nvme nvme0: I/O tag 2 (1002) QID 0 timeout, completion polled
[  300.023834] nvme nvme0: I/O tag 3 (1003) QID 0 timeout, completion polled
[  300.055992] nvme nvme0: allocated 61 MiB host memory buffer.
Marc Zyngier July 16, 2024, 10:30 a.m. UTC | #7
On Mon, 15 Jul 2024 15:10:01 +0100,
Johan Hovold <johan@kernel.org> wrote:
> 
> On Mon, Jul 15, 2024 at 01:58:13PM +0100, Marc Zyngier wrote:
> > On Mon, 15 Jul 2024 12:18:47 +0100,
> > Johan Hovold <johan@kernel.org> wrote:
> > > On Sun, Jun 23, 2024 at 05:18:31PM +0200, Thomas Gleixner wrote:
> > > > This is version 4 of the series to convert ARM MSI handling over to
> > > > per device MSI domains.
> 
> > > This series only showed up in linux-next last Friday and broke interrupt
> > > handling on Qualcomm platforms like sc8280xp (e.g. Lenovo ThinkPad X13s)
> > > and x1e80100 that use the GIC ITS for PCIe MSIs.
> > > 
> > > I've applied the series (21 commits from linux-next) on top of 6.10 and
> > > can confirm that the breakage is caused by commits:
> > > 
> > > 	3d1c927c08fc ("irqchip/gic-v3-its: Switch platform MSI to MSI parent")
> > > 	233db05bc37f ("irqchip/gic-v3-its: Provide MSI parent for PCI/MSI[-X]")
> > > 
> > > Applying the series up until the change before 3d1c927c08fc unbreaks the
> > > wifi on one machine:
> > > 
> > > 	ath11k_pci 0006:01:00.0: failed to enable msi: -22
> > > 	ath11k_pci 0006:01:00.0: probe with driver ath11k_pci failed with error -22
> > >
> > > and backing up until the commit before 233db05bc37f makes the NVMe come
> > > up again during boot on another.
> > > 
> > > I have not tried to debug this further.
> > 
> > I need a few things from you though, because you're not giving much to
> > help you (and I'm travelling, which doesn't help).
> 
> Yeah, this was just an early heads up.
> 
> > Can you at least investigate what in ath11k_pci_alloc_msi() causes the
> > wifi driver to be upset? Does it normally use a single MSI vector or
> > MSI-X? How about your nVME device?
> 
> It uses multiple vectors, but now it falls back to trying to allocate a
> single one and even that fails with -ENOSPC:
> 
> 	ath11k_pci 0006:01:00.0: ath11k_pci_alloc_msi - requesting one vector failed: -28
> 
> Similar for the NVMe, it uses multiple vectors normally, but now only
> the AER interrupts appears to be allocated for each controller and there
> is a GICv3 interrupt for the NVMe:
> 
> 208:          0          0          0          0          0          0          0          0  ITS-PCI-MSI-0006:00:00.0   0 Edge      PCIe PME, aerdrv
> 212:          0          0          0          0          0          0          0          0  ITS-PCI-MSI-0004:00:00.0   0 Edge      PCIe PME, aerdrv
> 214:        161          0          0          0          0          0          0          0     GICv3 562 Level     nvme0q0, nvme0q1
> 215:          0          0          0          0          0          0          0          0  ITS-PCI-MSI-0002:00:00.0   0 Edge      PCIe PME, aerdrv
>

That's an indication of the driver having failed its MSI allocation
and gone back to INTx signalling.

> Next boot, after disabling PCIe controller async probing, it's an MSI-X?!:
> 
> 201:          0          0          0          0          0          0          0          0  ITS-PCI-MSI-0006:00:00.0   0 Edge      PCIe PME, aerdrv
> 203:          0          0          0          0          0          0          0          0  ITS-PCI-MSI-0004:00:00.0   0 Edge      PCIe PME, aerdrv
> 205:          0          0          0          0          0          0          0          0  ITS-PCI-MSI-0002:00:00.0   0 Edge      PCIe PME, aerdrv
> 206:          0          0          0          0          0          0          0          0  ITS-PCI-MSIX-0002:01:00.0   0 Edge      nvme0q0
>

So is this issue actually tied to the async probing? Does it always
work if you disable it?

> This time ath11k vector allocation succeeded, but the driver times out
> eventually:
> 
> [    8.984619] ath11k_pci 0006:01:00.0: MSI vectors: 32
> [   29.690841] ath11k_pci 0006:01:00.0: failed to power up mhi: -110
> [   29.697136] ath11k_pci 0006:01:00.0: failed to start mhi: -110
> [   29.703153] ath11k_pci 0006:01:00.0: failed to power up :-110
> [   29.732144] ath11k_pci 0006:01:00.0: failed to create soc core: -110
> [   29.738694] ath11k_pci 0006:01:00.0: failed to init core: -110
> [   32.841758] ath11k_pci 0006:01:00.0: probe with driver ath11k_pci failed with error -110
> 
> > It would also help if you could define the DEBUG symbol at the very
> > top of irq-gic-v3-its.c and report the debug information that the ITS
> > driver dumps.
> 
> See below (with synchronous probing of the pcie controllers).

I don't see much going wrong there, and the ITS driver correctly
dishes out interrupts. I'll take the current -next for a ride on my
own HW and see what happens.

	M.
Johan Hovold July 16, 2024, 2:53 p.m. UTC | #8
On Tue, Jul 16, 2024 at 11:30:05AM +0100, Marc Zyngier wrote:
> On Mon, 15 Jul 2024 15:10:01 +0100,
> Johan Hovold <johan@kernel.org> wrote:
> > On Mon, Jul 15, 2024 at 01:58:13PM +0100, Marc Zyngier wrote:
> > > On Mon, 15 Jul 2024 12:18:47 +0100,
> > > Johan Hovold <johan@kernel.org> wrote:
> > > > On Sun, Jun 23, 2024 at 05:18:31PM +0200, Thomas Gleixner wrote:
> > > > > This is version 4 of the series to convert ARM MSI handling over to
> > > > > per device MSI domains.
> > 
> > > > This series only showed up in linux-next last Friday and broke interrupt
> > > > handling on Qualcomm platforms like sc8280xp (e.g. Lenovo ThinkPad X13s)
> > > > and x1e80100 that use the GIC ITS for PCIe MSIs.
> > > > 
> > > > I've applied the series (21 commits from linux-next) on top of 6.10 and
> > > > can confirm that the breakage is caused by commits:
> > > > 
> > > > 	3d1c927c08fc ("irqchip/gic-v3-its: Switch platform MSI to MSI parent")
> > > > 	233db05bc37f ("irqchip/gic-v3-its: Provide MSI parent for PCI/MSI[-X]")
> > > > 
> > > > Applying the series up until the change before 3d1c927c08fc unbreaks the
> > > > wifi on one machine:
> > > > 
> > > > 	ath11k_pci 0006:01:00.0: failed to enable msi: -22
> > > > 	ath11k_pci 0006:01:00.0: probe with driver ath11k_pci failed with error -22

Correction, this doesn't fix the wifi, but I'm not seeing these errors
with the commit before cc23d1dfc959 as the ath11k driver doesn't get
this far (or doesn't probe at all).

> > > > and backing up until the commit before 233db05bc37f makes the NVMe come
> > > > up again during boot on another.
> > > > 
> > > > I have not tried to debug this further.
> > > 
> > > I need a few things from you though, because you're not giving much to
> > > help you (and I'm travelling, which doesn't help).
> > 
> > Yeah, this was just an early heads up.
> > 
> > > Can you at least investigate what in ath11k_pci_alloc_msi() causes the
> > > wifi driver to be upset? Does it normally use a single MSI vector or
> > > MSI-X? How about your nVME device?
> > 
> > It uses multiple vectors, but now it falls back to trying to allocate a
> > single one and even that fails with -ENOSPC:
> > 
> > 	ath11k_pci 0006:01:00.0: ath11k_pci_alloc_msi - requesting one vector failed: -28
> > 
> > Similar for the NVMe, it uses multiple vectors normally, but now only
> > the AER interrupts appears to be allocated for each controller and there
> > is a GICv3 interrupt for the NVMe:
> > 
> > 208:          0          0          0          0          0          0          0          0  ITS-PCI-MSI-0006:00:00.0   0 Edge      PCIe PME, aerdrv
> > 212:          0          0          0          0          0          0          0          0  ITS-PCI-MSI-0004:00:00.0   0 Edge      PCIe PME, aerdrv
> > 214:        161          0          0          0          0          0          0          0     GICv3 562 Level     nvme0q0, nvme0q1
> > 215:          0          0          0          0          0          0          0          0  ITS-PCI-MSI-0002:00:00.0   0 Edge      PCIe PME, aerdrv
> >
> 
> That's an indication of the driver having failed its MSI allocation
> and gone back to INTx signalling.
> 
> > Next boot, after disabling PCIe controller async probing, it's an MSI-X?!:
> > 
> > 201:          0          0          0          0          0          0          0          0  ITS-PCI-MSI-0006:00:00.0   0 Edge      PCIe PME, aerdrv
> > 203:          0          0          0          0          0          0          0          0  ITS-PCI-MSI-0004:00:00.0   0 Edge      PCIe PME, aerdrv
> > 205:          0          0          0          0          0          0          0          0  ITS-PCI-MSI-0002:00:00.0   0 Edge      PCIe PME, aerdrv
> > 206:          0          0          0          0          0          0          0          0  ITS-PCI-MSIX-0002:01:00.0   0 Edge      nvme0q0
> >
> 
> So is this issue actually tied to the async probing? Does it always
> work if you disable it?

There seem to multiple issues here.

With the full series applied and normal async (i.e. parallel) probing of
the PCIe controllers I sometimes see allocation failing with -ENOSPC
(e.g. the above ath11k errors). This seems to indicate broken locking
somewhere.

With synchronous probing, allocation always seems to succeed but the
ath11k (and modem) drivers time out as no interrupts are received.

The NVMe driver sometimes falls back to INTx signalling and can access
the drive, but often end up with an MSIX (?!) allocation and then fails
to probe:

	[  132.084740] nvme nvme0: I/O tag 17 (1011) QID 0 timeout, completion polled

Johan
Marc Zyngier July 16, 2024, 6:21 p.m. UTC | #9
[Dropping shivamurthy.shastri@linutronix.de who is now bouncing...]

On Tue, 16 Jul 2024 15:53:28 +0100,
Johan Hovold <johan@kernel.org> wrote:
> 
> On Tue, Jul 16, 2024 at 11:30:05AM +0100, Marc Zyngier wrote:
> > On Mon, 15 Jul 2024 15:10:01 +0100,
> > Johan Hovold <johan@kernel.org> wrote:
> > > On Mon, Jul 15, 2024 at 01:58:13PM +0100, Marc Zyngier wrote:
> > > > On Mon, 15 Jul 2024 12:18:47 +0100,
> > > > Johan Hovold <johan@kernel.org> wrote:
> > > > > On Sun, Jun 23, 2024 at 05:18:31PM +0200, Thomas Gleixner wrote:
> > > > > > This is version 4 of the series to convert ARM MSI handling over to
> > > > > > per device MSI domains.
> > > 
> > > > > This series only showed up in linux-next last Friday and broke interrupt
> > > > > handling on Qualcomm platforms like sc8280xp (e.g. Lenovo ThinkPad X13s)
> > > > > and x1e80100 that use the GIC ITS for PCIe MSIs.
> > > > > 
> > > > > I've applied the series (21 commits from linux-next) on top of 6.10 and
> > > > > can confirm that the breakage is caused by commits:
> > > > > 
> > > > > 	3d1c927c08fc ("irqchip/gic-v3-its: Switch platform MSI to MSI parent")
> > > > > 	233db05bc37f ("irqchip/gic-v3-its: Provide MSI parent for PCI/MSI[-X]")
> > > > > 
> > > > > Applying the series up until the change before 3d1c927c08fc unbreaks the
> > > > > wifi on one machine:
> > > > > 
> > > > > 	ath11k_pci 0006:01:00.0: failed to enable msi: -22
> > > > > 	ath11k_pci 0006:01:00.0: probe with driver ath11k_pci failed with error -22
> 
> Correction, this doesn't fix the wifi, but I'm not seeing these errors
> with the commit before cc23d1dfc959 as the ath11k driver doesn't get
> this far (or doesn't probe at all).

I think we need to track one thing at a time. The wifi and nvme
problems seem subtly different... Which is the exact commit that
breaks nvme on your machine?

[...]

> > So is this issue actually tied to the async probing? Does it always
> > work if you disable it?
> 
> There seem to multiple issues here.
> 
> With the full series applied and normal async (i.e. parallel) probing of
> the PCIe controllers I sometimes see allocation failing with -ENOSPC
> (e.g. the above ath11k errors). This seems to indicate broken locking
> somewhere.

Your log doesn't support this theory. At least not from an ITS
perspective, as it keeps dishing out INTIDs (and it is very hard to
run out of IRQs with the ITS).

>
> With synchronous probing, allocation always seems to succeed but the
> ath11k (and modem) drivers time out as no interrupts are received.
> 
> The NVMe driver sometimes falls back to INTx signalling and can access
> the drive, but often end up with an MSIX (?!) allocation and then fails
> to probe:
> 
> 	[  132.084740] nvme nvme0: I/O tag 17 (1011) QID 0 timeout, completion polled

So one of my test boxes (ThunderX) fails this exact way, while another
(Synquacer) is pretty happy. Still trying to understand the difference
in behaviour.

How do you enforce synchronous probing?

	M.
Johan Hovold July 17, 2024, 7:23 a.m. UTC | #10
On Tue, Jul 16, 2024 at 07:21:39PM +0100, Marc Zyngier wrote:
> On Tue, 16 Jul 2024 15:53:28 +0100,
> Johan Hovold <johan@kernel.org> wrote:
> > On Tue, Jul 16, 2024 at 11:30:05AM +0100, Marc Zyngier wrote:
> > > On Mon, 15 Jul 2024 15:10:01 +0100,
> > > Johan Hovold <johan@kernel.org> wrote:
> > > > On Mon, Jul 15, 2024 at 01:58:13PM +0100, Marc Zyngier wrote:
> > > > > On Mon, 15 Jul 2024 12:18:47 +0100,
> > > > > Johan Hovold <johan@kernel.org> wrote:

> > > > > > This series only showed up in linux-next last Friday and broke interrupt
> > > > > > handling on Qualcomm platforms like sc8280xp (e.g. Lenovo ThinkPad X13s)
> > > > > > and x1e80100 that use the GIC ITS for PCIe MSIs.
> > > > > > 
> > > > > > I've applied the series (21 commits from linux-next) on top of 6.10 and
> > > > > > can confirm that the breakage is caused by commits:
> > > > > > 
> > > > > > 	3d1c927c08fc ("irqchip/gic-v3-its: Switch platform MSI to MSI parent")
> > > > > > 	233db05bc37f ("irqchip/gic-v3-its: Provide MSI parent for PCI/MSI[-X]")
> > > > > > 
> > > > > > Applying the series up until the change before 3d1c927c08fc unbreaks the
> > > > > > wifi on one machine:
> > > > > > 
> > > > > > 	ath11k_pci 0006:01:00.0: failed to enable msi: -22
> > > > > > 	ath11k_pci 0006:01:00.0: probe with driver ath11k_pci failed with error -22
> > 
> > Correction, this doesn't fix the wifi, but I'm not seeing these errors
> > with the commit before cc23d1dfc959 as the ath11k driver doesn't get

[ This was supposed to say 3d1c927c08fc, which is the mainline hash,
sorry. ]

> > this far (or doesn't probe at all).
> 
> I think we need to track one thing at a time. The wifi and nvme
> problems seem subtly different... Which is the exact commit that
> breaks nvme on your machine?

Yeah, forget about 3d1c927c08fc for now, which may have been a red
herring since we're also appear to be dealing with some sort of race and
(some) symptoms keep changing from boot to boot. The only thing that for
certain is that the series breaks MSI and that the NVMe breaks with
commit 233db05bc37f ("irqchip/gic-v3-its: Provide MSI parent for
PCI/MSI[-X]").

> > > So is this issue actually tied to the async probing? Does it always
> > > work if you disable it?
> > 
> > There seem to multiple issues here.
> > 
> > With the full series applied and normal async (i.e. parallel) probing of
> > the PCIe controllers I sometimes see allocation failing with -ENOSPC
> > (e.g. the above ath11k errors). This seems to indicate broken locking
> > somewhere.
> 
> Your log doesn't support this theory. At least not from an ITS
> perspective, as it keeps dishing out INTIDs (and it is very hard to
> run out of IRQs with the ITS).

The log I shared was with synchronous probing which takes parallel
allocation out of the equation (and gives more readable logs) so that is
expected. See below for a log with normal async probing that may give
some more insight into the race as well (i.e. when ath11k allocation
fails with -ENOSPC.)

> > With synchronous probing, allocation always seems to succeed but the
> > ath11k (and modem) drivers time out as no interrupts are received.
> > 
> > The NVMe driver sometimes falls back to INTx signalling and can access
> > the drive, but often end up with an MSIX (?!) allocation and then fails
> > to probe:
> > 
> > 	[  132.084740] nvme nvme0: I/O tag 17 (1011) QID 0 timeout, completion polled
> 
> So one of my test boxes (ThunderX) fails this exact way, while another
> (Synquacer) is pretty happy. Still trying to understand the difference
> in behaviour.
> 
> How do you enforce synchronous probing?

I believe there is a kernel parameter for this (e.g.
module.async_probe), but I just disable async probing for the Qualcomm
PCIe driver I'm using:

--- a/drivers/pci/controller/dwc/pcie-qcom.c
+++ b/drivers/pci/controller/dwc/pcie-qcom.c
@@ -1684,7 +1684,7 @@ static struct platform_driver qcom_pcie_driver = {
                .name = "qcom-pcie",
                .of_match_table = qcom_pcie_match,
                .pm = &qcom_pcie_pm_ops,
-               .probe_type = PROBE_PREFER_ASYNCHRONOUS,
+               //.probe_type = PROBE_PREFER_ASYNCHRONOUS,
        },
 };

Johan


[    8.323957] qcom-pcie 1c00000.pcie: host bridge /soc@0/pcie@1c00000 ranges:
[    8.334800] qcom-pcie 1c00000.pcie:       IO 0x0030200000..0x00302fffff -> 0x0000000000
[    8.348124] qcom-pcie 1c00000.pcie:      MEM 0x0030300000..0x0031ffffff -> 0x0030300000
[    8.378334] qcom-pcie 1c10000.pcie: host bridge /soc@0/pcie@1c10000 ranges:
[    8.378632] qcom-pcie 1c20000.pcie: host bridge /soc@0/pcie@1c20000 ranges:
[    8.378654] qcom-pcie 1c20000.pcie:       IO 0x003c200000..0x003c2fffff -> 0x0000000000
[    8.378666] qcom-pcie 1c20000.pcie:      MEM 0x003c300000..0x003dffffff -> 0x003c300000
[    8.391084] qcom-pcie 1c10000.pcie:       IO 0x0034200000..0x00342fffff -> 0x0000000000
[    8.419252] qcom-pcie 1c10000.pcie:      MEM 0x0034300000..0x0035ffffff -> 0x0034300000
[    8.477255] qcom-pcie 1c00000.pcie: iATU: unroll T, 8 ob, 8 ib, align 4K, limit 1024G
[    8.497259] qcom-pcie 1c20000.pcie: iATU: unroll T, 8 ob, 8 ib, align 4K, limit 1024G
[    8.537258] qcom-pcie 1c10000.pcie: iATU: unroll T, 8 ob, 8 ib, align 4K, limit 1024G
[    8.583746] qcom-pcie 1c00000.pcie: PCIe Gen.2 x1 link up
[    8.590079] qcom-pcie 1c00000.pcie: PCI host bridge to bus 0006:00
[    8.596838] pci_bus 0006:00: root bus resource [bus 00-ff]
[    8.602874] pci_bus 0006:00: root bus resource [io  0x0000-0xfffff]
[    8.603809] qcom-pcie 1c20000.pcie: PCIe Gen.3 x4 link up
[    8.609322] pci_bus 0006:00: root bus resource [mem 0x30300000-0x31ffffff]
[    8.609393] pci 0006:00:00.0: [17cb:010e] type 01 class 0x060400 PCIe Root Port
[    8.615040] qcom-pcie 1c20000.pcie: PCI host bridge to bus 0002:00
[    8.621951] pci 0006:00:00.0: BAR 0 [mem 0x00000000-0x00000fff]
[    8.629452] pci_bus 0002:00: root bus resource [bus 00-ff]
[    8.629706] pci_bus 0002:00: root bus resource [io  0x100000-0x1fffff] (bus address [0x0000-0xfffff])
[    8.635822] pci 0006:00:00.0: PCI bridge to [bus 01-ff]
[    8.641903] pci_bus 0002:00: root bus resource [mem 0x3c300000-0x3dffffff]
[    8.643728] qcom-pcie 1c10000.pcie: PCIe Gen.3 x2 link up
[    8.643851] qcom-pcie 1c10000.pcie: PCI host bridge to bus 0004:00
[    8.643854] pci_bus 0004:00: root bus resource [bus 00-ff]
[    8.643857] pci_bus 0004:00: root bus resource [io  0x200000-0x2fffff] (bus address [0x0000-0xfffff])
[    8.643859] pci_bus 0004:00: root bus resource [mem 0x34300000-0x35ffffff]
[    8.643873] pci 0004:00:00.0: [17cb:010e] type 01 class 0x060400 PCIe Root Port
[    8.643881] pci 0004:00:00.0: BAR 0 [mem 0x00000000-0x00000fff]
[    8.643890] pci 0004:00:00.0: PCI bridge to [bus 01-ff]
[    8.643894] pci 0004:00:00.0:   bridge window [io  0x200000-0x200fff]
[    8.643897] pci 0004:00:00.0:   bridge window [mem 0x00000000-0x000fffff]
[    8.643903] pci 0004:00:00.0:   bridge window [mem 0x00000000-0x000fffff 64bit pref]
[    8.643982] pci 0004:00:00.0: PME# supported from D0 D3hot D3cold
[    8.644933] pci 0004:01:00.0: [17cb:0306] type 00 class 0xff0000 PCIe Endpoint
[    8.645012] pci 0004:01:00.0: BAR 0 [mem 0x00000000-0x00000fff 64bit]
[    8.645063] pci 0004:01:00.0: BAR 2 [mem 0x00000000-0x00000fff 64bit]
[    8.645614] pci 0004:01:00.0: PME# supported from D0 D3hot D3cold
[    8.645768] pci 0004:01:00.0: 15.752 Gb/s available PCIe bandwidth, limited by 8.0 GT/s PCIe x2 link at 0004:00:00.0 (capable of 31.506 Gb/s with 16.0 GT/s PCIe x2 link)
[    8.647523] pci 0006:00:00.0:   bridge window [io  0x0000-0x0fff]
[    8.659851] pci 0004:00:00.0: bridge window [mem 0x34300000-0x343fffff]: assigned
[    8.659862] pci 0002:00:00.0: [17cb:010e] type 01 class 0x060400 PCIe Root Port
[    8.659873] pci 0002:00:00.0: BAR 0 [mem 0x00000000-0x00000fff]
[    8.659883] pci 0002:00:00.0: PCI bridge to [bus 01-ff]
[    8.659889] pci 0002:00:00.0:   bridge window [io  0x100000-0x100fff]
[    8.659893] pci 0002:00:00.0:   bridge window [mem 0x00000000-0x000fffff]
[    8.659900] pci 0002:00:00.0:   bridge window [mem 0x00000000-0x000fffff 64bit pref]
[    8.659962] pci 0002:00:00.0: PME# supported from D0 D3hot D3cold
[    8.661170] pci 0002:01:00.0: [1e0f:0001] type 00 class 0x010802 PCIe Endpoint
[    8.661259] pci 0002:01:00.0: BAR 0 [mem 0x00000000-0x00003fff 64bit]
[    8.661825] pci 0002:01:00.0: PME# supported from D0 D3hot
[    8.662365] pci 0006:00:00.0:   bridge window [mem 0x00000000-0x000fffff]
[    8.669410] pci 0004:00:00.0: BAR 0 [mem 0x34400000-0x34400fff]: assigned
[    8.671873] pci 0002:00:00.0: bridge window [mem 0x3c300000-0x3c3fffff]: assigned
[    8.671879] pci 0002:00:00.0: BAR 0 [mem 0x3c400000-0x3c400fff]: assigned
[    8.671887] pci 0002:01:00.0: BAR 0 [mem 0x3c300000-0x3c303fff 64bit]: assigned
[    8.671931] pci 0002:00:00.0: PCI bridge to [bus 01-ff]
[    8.671936] pci 0002:00:00.0:   bridge window [mem 0x3c300000-0x3c3fffff]
[    8.672719] ITS: alloc 8192:32
[    8.672737] ITT 32 entries, 5 bits
[    8.674420] ID:0 pID:8192 vID:196
[    8.674436] ID:1 pID:8193 vID:197
[    8.674444] ID:2 pID:8194 vID:198
[    8.674452] ID:3 pID:8195 vID:199
[    8.674461] ID:4 pID:8196 vID:200
[    8.674469] ID:5 pID:8197 vID:201
[    8.674476] ID:6 pID:8198 vID:202
[    8.674485] ID:7 pID:8199 vID:203
[    8.674493] ID:8 pID:8200 vID:204
[    8.674501] ID:9 pID:8201 vID:205
[    8.674508] ID:10 pID:8202 vID:206
[    8.674517] ID:11 pID:8203 vID:207
[    8.674525] ID:12 pID:8204 vID:208
[    8.674532] ID:13 pID:8205 vID:209
[    8.674540] ID:14 pID:8206 vID:210
[    8.674548] ID:15 pID:8207 vID:211
[    8.674556] ID:16 pID:8208 vID:212
[    8.674564] ID:17 pID:8209 vID:213
[    8.674572] ID:18 pID:8210 vID:214
[    8.674580] ID:19 pID:8211 vID:215
[    8.674588] ID:20 pID:8212 vID:216
[    8.674596] ID:21 pID:8213 vID:217
[    8.674604] ID:22 pID:8214 vID:218
[    8.674612] ID:23 pID:8215 vID:219
[    8.674620] ID:24 pID:8216 vID:220
[    8.674628] ID:25 pID:8217 vID:221
[    8.674636] ID:26 pID:8218 vID:222
[    8.674643] ID:27 pID:8219 vID:223
[    8.674651] ID:28 pID:8220 vID:224
[    8.674659] ID:29 pID:8221 vID:225
[    8.674667] ID:30 pID:8222 vID:226
[    8.674675] ID:31 pID:8223 vID:227
[    8.674824] IRQ196 -> 0-7 CPU0
[    8.674850] IRQ197 -> 0-7 CPU1
[    8.674864] IRQ198 -> 0-7 CPU2
[    8.674878] IRQ199 -> 0-7 CPU3
[    8.674891] IRQ200 -> 0-7 CPU4
[    8.674905] IRQ201 -> 0-7 CPU5
[    8.674918] IRQ202 -> 0-7 CPU6
[    8.674932] IRQ203 -> 0-7 CPU7
[    8.674945] IRQ204 -> 0-7 CPU0
[    8.674951] pci 0006:00:00.0:   bridge window [mem 0x00000000-0x000fffff 64bit pref]
[    8.675005] pci 0006:00:00.0: PME# supported from D0 D3hot D3cold
[    8.675887] pci 0006:01:00.0: [17cb:1103] type 00 class 0x028000 PCIe Endpoint
[    8.675983] pci 0006:01:00.0: BAR 0 [mem 0x00000000-0x001fffff 64bit]
[    8.676613] pci 0006:01:00.0: PME# supported from D0 D3hot D3cold
[    8.676779] pci 0006:01:00.0: 4.000 Gb/s available PCIe bandwidth, limited by 5.0 GT/s PCIe x1 link at 0006:00:00.0 (capable of 7.876 Gb/s with 8.0 GT/s PCIe x1 link)
[    8.681292] pci 0004:01:00.0: BAR 0 [mem 0x34300000-0x34300fff 64bit]: assigned
[    8.681332] pci 0004:01:00.0: BAR 2 [mem 0x34301000-0x34301fff 64bit]: assigned
[    8.686968] IRQ205 -> 0-7 CPU1
[    8.691823] pci 0006:00:00.0: bridge window [mem 0x30400000-0x305fffff]: assigned
[    8.691825] pci 0006:00:00.0: BAR 0 [mem 0x30300000-0x30300fff]: assigned
[    8.691829] pci 0006:01:00.0: BAR 0 [mem 0x30400000-0x305fffff 64bit]: assigned
[    8.691877] pci 0006:00:00.0: PCI bridge to [bus 01-ff]
[    8.691880] pci 0006:00:00.0:   bridge window [mem 0x30400000-0x305fffff]
[    8.692011] Reusing ITT for devID 0
[    8.693668] Reusing ITT for devID 0
[    8.693871] pcieport 0006:00:00.0: PME: Signaling with IRQ 228
[    8.694116] pcieport 0006:00:00.0: AER: enabled with IRQ 228
[    8.696453] pci 0004:00:00.0: PCI bridge to [bus 01-ff]
[    8.703760] IRQ206 -> 0-7 CPU2
[    8.710986] pci 0004:00:00.0:   bridge window [mem 0x34300000-0x343fffff]
[    8.711136] Reusing ITT for devID 0
[    8.717093] IRQ207 -> 0-7 CPU3
[    8.723889] Reusing ITT for devID 0
[    8.729600] IRQ208 -> 0-7 CPU4
[    8.736507] pcieport 0004:00:00.0: PME: Signaling with IRQ 229
[    8.744261] IRQ209 -> 0-7 CPU5
[    8.750757] pcieport 0004:00:00.0: AER: enabled with IRQ 229
[    8.758038] IRQ210 -> 0-7 CPU6
[    9.071793] IRQ211 -> 0-7 CPU7
[    9.071807] IRQ212 -> 0-7 CPU0
[    9.071819] IRQ213 -> 0-7 CPU1
[    9.071831] IRQ214 -> 0-7 CPU2
[    9.071842] IRQ215 -> 0-7 CPU3
[    9.071852] IRQ216 -> 0-7 CPU4
[    9.071863] IRQ217 -> 0-7 CPU5
[    9.071875] IRQ218 -> 0-7 CPU6
[    9.071886] IRQ219 -> 0-7 CPU7
[    9.071897] IRQ220 -> 0-7 CPU0
[    9.071907] IRQ221 -> 0-7 CPU1
[    9.071920] IRQ222 -> 0-7 CPU2
[    9.071930] IRQ223 -> 0-7 CPU3
[    9.071941] IRQ224 -> 0-7 CPU4
[    9.071952] IRQ225 -> 0-7 CPU5
[    9.071962] IRQ226 -> 0-7 CPU6
[    9.071973] IRQ227 -> 0-7 CPU7
[    9.073568] Reusing ITT for devID 0
[    9.073607] ID:0 pID:8192 vID:196
[    9.073618] IRQ196 -> 0-7 CPU0
[    9.073717] IRQ196 -> 0-7 CPU0
[    9.073737] pcieport 0002:00:00.0: PME: Signaling with IRQ 196
[    9.086532] pcieport 0002:00:00.0: AER: enabled with IRQ 196
[    9.102057] mhi-pci-generic 0004:01:00.0: MHI PCI device found: foxconn-sdx55
[    9.109830] mhi-pci-generic 0004:01:00.0: BAR 0 [mem 0x34300000-0x34300fff 64bit]: assigned
[    9.119027] mhi-pci-generic 0004:01:00.0: enabling device (0000 -> 0002)
[    9.127271] ITS: alloc 8224:8
[    9.141500] ITT 8 entries, 3 bits
[    9.144502] ID:0 pID:8224 vID:198
[    9.144597] ID:1 pID:8225 vID:199
[    9.144605] ID:2 pID:8226 vID:200
[    9.144612] ID:3 pID:8227 vID:201
[    9.144619] ID:4 pID:8228 vID:202
[    9.144689] IRQ198 -> 0-7 CPU1
[    9.144888] IRQ199 -> 0-7 CPU2
[    9.144901] IRQ200 -> 0-7 CPU3
[    9.144914] IRQ201 -> 0-7 CPU4
[    9.144927] IRQ202 -> 0-7 CPU5
[    9.151264] IRQ198 -> 0-7 CPU1
[    9.151479] IRQ199 -> 0-7 CPU2
[    9.151673] IRQ200 -> 0-7 CPU3
[    9.151849] IRQ201 -> 0-7 CPU4
[    9.152056] IRQ202 -> 0-7 CPU5
[    9.159972] mhi mhi0: Requested to power ON
[    9.165275] mhi mhi0: Power on setup success
[    9.279951] ath11k_pci 0006:01:00.0: BAR 0 [mem 0x30400000-0x305fffff 64bit]: assigned
[    9.288208] ath11k_pci 0006:01:00.0: enabling device (0000 -> 0002)
[    9.301708] nvme nvme0: pci function 0002:01:00.0
[    9.307052] Reusing ITT for devID 100
[    9.315457] nvme 0002:01:00.0: enabling device (0000 -> 0002)
[    9.326554] Reusing ITT for devID 100
[    9.336332] ath11k_pci 0006:01:00.0: ath11k_pci_alloc_msi - requesting one vector failed: -28
[    9.344362] Reusing ITT for devID 100
[    9.351639] ath11k_pci 0006:01:00.0: failed to enable msi: -22
[    9.351866] ath11k_pci 0006:01:00.0: probe with driver ath11k_pci failed with error -22
[    9.360327] Reusing ITT for devID 100
[    9.654429] nvme nvme0: allocated 61 MiB host memory buffer.
[    9.814664] Reusing ITT for devID 100
[    9.815000] Reusing ITT for devID 100
[    9.815553] Reusing ITT for devID 100
[    9.843417] nvme nvme0: 1/0/0 default/read/poll queues
[    9.875782]  nvme0n1: p1

[   29.666877] mhi-pci-generic 0004:01:00.0: failed to power up MHI controller
[   29.681492] mhi-pci-generic 0004:01:00.0: probe with driver mhi-pci-generic failed with error -110
Marc Zyngier July 17, 2024, 12:54 p.m. UTC | #11
On Wed, 17 Jul 2024 08:23:39 +0100,
Johan Hovold <johan@kernel.org> wrote:
> 
> On Tue, Jul 16, 2024 at 07:21:39PM +0100, Marc Zyngier wrote:
> > On Tue, 16 Jul 2024 15:53:28 +0100,
> > Johan Hovold <johan@kernel.org> wrote:
> > > On Tue, Jul 16, 2024 at 11:30:05AM +0100, Marc Zyngier wrote:
> > > > On Mon, 15 Jul 2024 15:10:01 +0100,
> > > > Johan Hovold <johan@kernel.org> wrote:
> > > > > On Mon, Jul 15, 2024 at 01:58:13PM +0100, Marc Zyngier wrote:
> > > > > > On Mon, 15 Jul 2024 12:18:47 +0100,
> > > > > > Johan Hovold <johan@kernel.org> wrote:
> 
> > > > > > > This series only showed up in linux-next last Friday and broke interrupt
> > > > > > > handling on Qualcomm platforms like sc8280xp (e.g. Lenovo ThinkPad X13s)
> > > > > > > and x1e80100 that use the GIC ITS for PCIe MSIs.
> > > > > > > 
> > > > > > > I've applied the series (21 commits from linux-next) on top of 6.10 and
> > > > > > > can confirm that the breakage is caused by commits:
> > > > > > > 
> > > > > > > 	3d1c927c08fc ("irqchip/gic-v3-its: Switch platform MSI to MSI parent")
> > > > > > > 	233db05bc37f ("irqchip/gic-v3-its: Provide MSI parent for PCI/MSI[-X]")
> > > > > > > 
> > > > > > > Applying the series up until the change before 3d1c927c08fc unbreaks the
> > > > > > > wifi on one machine:
> > > > > > > 
> > > > > > > 	ath11k_pci 0006:01:00.0: failed to enable msi: -22
> > > > > > > 	ath11k_pci 0006:01:00.0: probe with driver ath11k_pci failed with error -22
> > > 
> > > Correction, this doesn't fix the wifi, but I'm not seeing these errors
> > > with the commit before cc23d1dfc959 as the ath11k driver doesn't get
> 
> [ This was supposed to say 3d1c927c08fc, which is the mainline hash,
> sorry. ]
> 
> > > this far (or doesn't probe at all).
> > 
> > I think we need to track one thing at a time. The wifi and nvme
> > problems seem subtly different... Which is the exact commit that
> > breaks nvme on your machine?
> 
> Yeah, forget about 3d1c927c08fc for now, which may have been a red
> herring since we're also appear to be dealing with some sort of race and
> (some) symptoms keep changing from boot to boot. The only thing that for
> certain is that the series breaks MSI and that the NVMe breaks with
> commit 233db05bc37f ("irqchip/gic-v3-its: Provide MSI parent for
> PCI/MSI[-X]").
> 
> > > > So is this issue actually tied to the async probing? Does it always
> > > > work if you disable it?
> > > 
> > > There seem to multiple issues here.
> > > 
> > > With the full series applied and normal async (i.e. parallel) probing of
> > > the PCIe controllers I sometimes see allocation failing with -ENOSPC
> > > (e.g. the above ath11k errors). This seems to indicate broken locking
> > > somewhere.
> > 
> > Your log doesn't support this theory. At least not from an ITS
> > perspective, as it keeps dishing out INTIDs (and it is very hard to
> > run out of IRQs with the ITS).
> 
> The log I shared was with synchronous probing which takes parallel
> allocation out of the equation (and gives more readable logs) so that is
> expected. See below for a log with normal async probing that may give
> some more insight into the race as well (i.e. when ath11k allocation
> fails with -ENOSPC.)

Huh, this log is actually pointing at something very ugly. Not a race,
but some horrible ID confusion. See below.

> 
> > > With synchronous probing, allocation always seems to succeed but the
> > > ath11k (and modem) drivers time out as no interrupts are received.
> > > 
> > > The NVMe driver sometimes falls back to INTx signalling and can access
> > > the drive, but often end up with an MSIX (?!) allocation and then fails
> > > to probe:
> > > 
> > > 	[  132.084740] nvme nvme0: I/O tag 17 (1011) QID 0 timeout, completion polled
> > 
> > So one of my test boxes (ThunderX) fails this exact way, while another
> > (Synquacer) is pretty happy. Still trying to understand the difference
> > in behaviour.
> > 
> > How do you enforce synchronous probing?
> 
> I believe there is a kernel parameter for this (e.g.
> module.async_probe), but I just disable async probing for the Qualcomm
> PCIe driver I'm using:

I had tried this module parameter, but it didn't change anything on my
end.

> 
> --- a/drivers/pci/controller/dwc/pcie-qcom.c
> +++ b/drivers/pci/controller/dwc/pcie-qcom.c
> @@ -1684,7 +1684,7 @@ static struct platform_driver qcom_pcie_driver = {
>                 .name = "qcom-pcie",
>                 .of_match_table = qcom_pcie_match,
>                 .pm = &qcom_pcie_pm_ops,
> -               .probe_type = PROBE_PREFER_ASYNCHRONOUS,
> +               //.probe_type = PROBE_PREFER_ASYNCHRONOUS,
>         },
>  };

I'll have a look whether the TX1 PCIe driver uses this. It's
positively ancient, so I wouldn't bet that it has been touched
significantly in the past 5 years.

[...]

> [    8.692011] Reusing ITT for devID 0
> [    8.693668] Reusing ITT for devID 0

This is really odd. It indicates that you have several devices sharing
the same DeviceID, which I seriously doubt it is the case in a
laptop. Do you have any non-transparent bridge here? lspci would help.

> [    8.693871] pcieport 0006:00:00.0: PME: Signaling with IRQ 228
> [    8.694116] pcieport 0006:00:00.0: AER: enabled with IRQ 228
> [    8.696453] pci 0004:00:00.0: PCI bridge to [bus 01-ff]
> [    8.703760] IRQ206 -> 0-7 CPU2
> [    8.710986] pci 0004:00:00.0:   bridge window [mem 0x34300000-0x343fffff]
> [    8.711136] Reusing ITT for devID 0

Where is the bus number gone?

> [    8.717093] IRQ207 -> 0-7 CPU3
> [    8.723889] Reusing ITT for devID 0
> [    8.729600] IRQ208 -> 0-7 CPU4
> [    8.736507] pcieport 0004:00:00.0: PME: Signaling with IRQ 229
> [    8.744261] IRQ209 -> 0-7 CPU5
> [    8.750757] pcieport 0004:00:00.0: AER: enabled with IRQ 229
> [    8.758038] IRQ210 -> 0-7 CPU6
> [    9.071793] IRQ211 -> 0-7 CPU7
> [    9.071807] IRQ212 -> 0-7 CPU0
> [    9.071819] IRQ213 -> 0-7 CPU1
> [    9.071831] IRQ214 -> 0-7 CPU2
> [    9.071842] IRQ215 -> 0-7 CPU3
> [    9.071852] IRQ216 -> 0-7 CPU4
> [    9.071863] IRQ217 -> 0-7 CPU5
> [    9.071875] IRQ218 -> 0-7 CPU6
> [    9.071886] IRQ219 -> 0-7 CPU7
> [    9.071897] IRQ220 -> 0-7 CPU0
> [    9.071907] IRQ221 -> 0-7 CPU1
> [    9.071920] IRQ222 -> 0-7 CPU2
> [    9.071930] IRQ223 -> 0-7 CPU3
> [    9.071941] IRQ224 -> 0-7 CPU4
> [    9.071952] IRQ225 -> 0-7 CPU5
> [    9.071962] IRQ226 -> 0-7 CPU6
> [    9.071973] IRQ227 -> 0-7 CPU7
> [    9.073568] Reusing ITT for devID 0
> [    9.073607] ID:0 pID:8192 vID:196
> [    9.073618] IRQ196 -> 0-7 CPU0
> [    9.073717] IRQ196 -> 0-7 CPU0
> [    9.073737] pcieport 0002:00:00.0: PME: Signaling with IRQ 196
> [    9.086532] pcieport 0002:00:00.0: AER: enabled with IRQ 196
> [    9.102057] mhi-pci-generic 0004:01:00.0: MHI PCI device found: foxconn-sdx55
> [    9.109830] mhi-pci-generic 0004:01:00.0: BAR 0 [mem 0x34300000-0x34300fff 64bit]: assigned
> [    9.119027] mhi-pci-generic 0004:01:00.0: enabling device (0000 -> 0002)
> [    9.127271] ITS: alloc 8224:8
> [    9.141500] ITT 8 entries, 3 bits
> [    9.144502] ID:0 pID:8224 vID:198
> [    9.144597] ID:1 pID:8225 vID:199
> [    9.144605] ID:2 pID:8226 vID:200
> [    9.144612] ID:3 pID:8227 vID:201
> [    9.144619] ID:4 pID:8228 vID:202
> [    9.144689] IRQ198 -> 0-7 CPU1
> [    9.144888] IRQ199 -> 0-7 CPU2
> [    9.144901] IRQ200 -> 0-7 CPU3
> [    9.144914] IRQ201 -> 0-7 CPU4
> [    9.144927] IRQ202 -> 0-7 CPU5
> [    9.151264] IRQ198 -> 0-7 CPU1
> [    9.151479] IRQ199 -> 0-7 CPU2
> [    9.151673] IRQ200 -> 0-7 CPU3
> [    9.151849] IRQ201 -> 0-7 CPU4
> [    9.152056] IRQ202 -> 0-7 CPU5
> [    9.159972] mhi mhi0: Requested to power ON
> [    9.165275] mhi mhi0: Power on setup success
> [    9.279951] ath11k_pci 0006:01:00.0: BAR 0 [mem 0x30400000-0x305fffff 64bit]: assigned
> [    9.288208] ath11k_pci 0006:01:00.0: enabling device (0000 -> 0002)
> [    9.301708] nvme nvme0: pci function 0002:01:00.0
> [    9.307052] Reusing ITT for devID 100
> [    9.315457] nvme 0002:01:00.0: enabling device (0000 -> 0002)

This is device 0002:01:00.0...

> [    9.326554] Reusing ITT for devID 100

... seen as device 0000:01:00.0. WTF???

> [    9.336332] ath11k_pci 0006:01:00.0: ath11k_pci_alloc_msi - requesting one vector failed: -28

I'm starting to suspect that the new code doesn't carry all the
required bits for the DevID, and that we end-up trying to allocated
interrupts from the pool allocated to another device, which can never
be a good thing, and would explain why everything dies a painful
death.

Can you run the same trace with the whole thing reverted? I think
we're on something here.

Thanks,

	M.
Johan Hovold July 17, 2024, 1:38 p.m. UTC | #12
On Wed, Jul 17, 2024 at 01:54:40PM +0100, Marc Zyngier wrote:
> On Wed, 17 Jul 2024 08:23:39 +0100,
> Johan Hovold <johan@kernel.org> wrote:

> > I believe there is a kernel parameter for this (e.g.
> > module.async_probe), but I just disable async probing for the Qualcomm
> > PCIe driver I'm using:
> 
> I had tried this module parameter, but it didn't change anything on my
> end.

> I'll have a look whether the TX1 PCIe driver uses this. It's
> positively ancient, so I wouldn't bet that it has been touched
> significantly in the past 5 years.

Perhaps async probing just changes the symptoms, the NVMe and wifi
doesn't work in either case.

> > [    8.692011] Reusing ITT for devID 0
> > [    8.693668] Reusing ITT for devID 0
> 
> This is really odd. It indicates that you have several devices sharing
> the same DeviceID, which I seriously doubt it is the case in a
> laptop. Do you have any non-transparent bridge here? lspci would help.

Yeah, and these messages do not show up without the series (see log
below). They are there in the previous synchronous log however.

0002:00:00.0 PCI bridge: Qualcomm Technologies, Inc SC8280XP PCI Express Root Port
0002:01:00.0 Non-Volatile memory controller: KIOXIA Corporation NVMe SSD Controller BG4 (DRAM-less)
0004:00:00.0 PCI bridge: Qualcomm Technologies, Inc SC8280XP PCI Express Root Port
0004:01:00.0 Unassigned class [ff00]: Qualcomm Technologies, Inc SDX55 [Snapdragon X55 5G]
0006:00:00.0 PCI bridge: Qualcomm Technologies, Inc SC8280XP PCI Express Root Port
0006:01:00.0 Network controller: Qualcomm Technologies, Inc QCNFA765 Wireless Network Adapter (rev 01)

> I'm starting to suspect that the new code doesn't carry all the
> required bits for the DevID, and that we end-up trying to allocated
> interrupts from the pool allocated to another device, which can never
> be a good thing, and would explain why everything dies a painful
> death.
> 
> Can you run the same trace with the whole thing reverted? I think
> we're on something here.

See below, using normal asynchronous probing like the previous log.

Johan


[    8.129424] qcom-pcie 1c10000.pcie: host bridge /soc@0/pcie@1c10000 ranges:
[    8.136886] qcom-pcie 1c10000.pcie:       IO 0x0034200000..0x00342fffff -> 0x0000000000
[    8.145351] qcom-pcie 1c00000.pcie: host bridge /soc@0/pcie@1c00000 ranges:
[    8.145372] qcom-pcie 1c10000.pcie:      MEM 0x0034300000..0x0035ffffff -> 0x0034300000
[    8.146042] qcom-pcie 1c20000.pcie: host bridge /soc@0/pcie@1c20000 ranges:
[    8.146063] qcom-pcie 1c20000.pcie:       IO 0x003c200000..0x003c2fffff -> 0x0000000000
[    8.146073] qcom-pcie 1c20000.pcie:      MEM 0x003c300000..0x003dffffff -> 0x003c300000
[    8.152546] qcom-pcie 1c00000.pcie:       IO 0x0030200000..0x00302fffff -> 0x0000000000
[    8.176372] qcom-pcie 1c00000.pcie:      MEM 0x0030300000..0x0031ffffff -> 0x0030300000
[    8.266560] qcom-pcie 1c20000.pcie: iATU: unroll T, 8 ob, 8 ib, align 4K, limit 1024G
[    8.298587] qcom-pcie 1c10000.pcie: iATU: unroll T, 8 ob, 8 ib, align 4K, limit 1024G
[    8.318753] qcom-pcie 1c00000.pcie: iATU: unroll T, 8 ob, 8 ib, align 4K, limit 1024G
[    8.377720] qcom-pcie 1c20000.pcie: PCIe Gen.3 x4 link up
[    8.384650] qcom-pcie 1c20000.pcie: PCI host bridge to bus 0002:00
[    8.392099] pci_bus 0002:00: root bus resource [bus 00-ff]
[    8.398766] pci_bus 0002:00: root bus resource [io  0x100000-0x1fffff] (bus address [0x0000-0xfffff])
[    8.405033] qcom-pcie 1c10000.pcie: PCIe Gen.3 x2 link up
[    8.408250] pci_bus 0002:00: root bus resource [mem 0x3c300000-0x3dffffff]
[    8.413899] qcom-pcie 1c10000.pcie: PCI host bridge to bus 0004:00
[    8.420959] pci 0002:00:00.0: [17cb:010e] type 01 class 0x060400 PCIe Root Port
[    8.427201] pci_bus 0004:00: root bus resource [bus 00-ff]
[    8.427204] pci_bus 0004:00: root bus resource [io  0x0000-0xfffff]
[    8.427206] pci_bus 0004:00: root bus resource [mem 0x34300000-0x35ffffff]
[    8.427219] pci 0004:00:00.0: [17cb:010e] type 01 class 0x060400 PCIe Root Port
[    8.430158] qcom-pcie 1c00000.pcie: PCIe Gen.2 x1 link up
[    8.430263] qcom-pcie 1c00000.pcie: PCI host bridge to bus 0006:00
[    8.430266] pci_bus 0006:00: root bus resource [bus 00-ff]
[    8.430269] pci_bus 0006:00: root bus resource [io  0x200000-0x2fffff] (bus address [0x0000-0xfffff])
[    8.430271] pci_bus 0006:00: root bus resource [mem 0x30300000-0x31ffffff]
[    8.430285] pci 0006:00:00.0: [17cb:010e] type 01 class 0x060400 PCIe Root Port
[    8.430297] pci 0006:00:00.0: BAR 0 [mem 0x00000000-0x00000fff]
[    8.430307] pci 0006:00:00.0: PCI bridge to [bus 01-ff]
[    8.430313] pci 0006:00:00.0:   bridge window [io  0x200000-0x200fff]
[    8.430317] pci 0006:00:00.0:   bridge window [mem 0x00000000-0x000fffff]
[    8.430324] pci 0006:00:00.0:   bridge window [mem 0x00000000-0x000fffff 64bit pref]
[    8.430414] pci 0006:00:00.0: PME# supported from D0 D3hot D3cold
[    8.431430] pci 0006:01:00.0: [17cb:1103] type 00 class 0x028000 PCIe Endpoint
[    8.431526] pci 0006:01:00.0: BAR 0 [mem 0x00000000-0x001fffff 64bit]
[    8.432154] pci 0006:01:00.0: PME# supported from D0 D3hot D3cold
[    8.432320] pci 0006:01:00.0: 4.000 Gb/s available PCIe bandwidth, limited by 5.0 GT/s PCIe x1 link at 0006:00:00.0 (capable of 7.876 Gb/s with 8.0 GT/s PCIe x1 link)
[    8.434723] pci 0002:00:00.0: BAR 0 [mem 0x00000000-0x00000fff]
[    8.440358] pci 0004:00:00.0: BAR 0 [mem 0x00000000-0x00000fff]
[    8.445157] pci 0006:00:00.0: bridge window [mem 0x30400000-0x305fffff]: assigned
[    8.445160] pci 0006:00:00.0: BAR 0 [mem 0x30300000-0x30300fff]: assigned
[    8.445163] pci 0006:01:00.0: BAR 0 [mem 0x30400000-0x305fffff 64bit]: assigned
[    8.445211] pci 0006:00:00.0: PCI bridge to [bus 01-ff]
[    8.445214] pci 0006:00:00.0:   bridge window [mem 0x30400000-0x305fffff]
[    8.445526] ITS: alloc 8192:32
[    8.445537] ITT 32 entries, 5 bits
[    8.446675] ID:0 pID:8192 vID:196
[    8.446697] ID:1 pID:8193 vID:197
[    8.446702] ID:2 pID:8194 vID:198
[    8.446707] ID:3 pID:8195 vID:199
[    8.446712] ID:4 pID:8196 vID:200
[    8.446718] ID:5 pID:8197 vID:201
[    8.446722] ID:6 pID:8198 vID:202
[    8.446727] ID:7 pID:8199 vID:203
[    8.446732] ID:8 pID:8200 vID:204
[    8.446738] ID:9 pID:8201 vID:205
[    8.446743] ID:10 pID:8202 vID:206
[    8.446748] ID:11 pID:8203 vID:207
[    8.446753] ID:12 pID:8204 vID:208
[    8.446758] ID:13 pID:8205 vID:209
[    8.446763] ID:14 pID:8206 vID:210
[    8.446768] ID:15 pID:8207 vID:211
[    8.446773] ID:16 pID:8208 vID:212
[    8.446777] ID:17 pID:8209 vID:213
[    8.446783] ID:18 pID:8210 vID:214
[    8.446788] ID:19 pID:8211 vID:215
[    8.446805] pci 0002:00:00.0: PCI bridge to [bus 01-ff]
[    8.446812] pci 0002:00:00.0:   bridge window [io  0x100000-0x100fff]
[    8.446817] pci 0002:00:00.0:   bridge window [mem 0x00000000-0x000fffff]
[    8.446827] pci 0002:00:00.0:   bridge window [mem 0x00000000-0x000fffff 64bit pref]
[    8.446899] pci 0002:00:00.0: PME# supported from D0 D3hot D3cold
[    8.448399] pci 0002:01:00.0: [1e0f:0001] type 00 class 0x010802 PCIe Endpoint
[    8.448489] pci 0002:01:00.0: BAR 0 [mem 0x00000000-0x00003fff 64bit]
[    8.449076] pci 0002:01:00.0: PME# supported from D0 D3hot
[    8.453855] pci 0004:00:00.0: PCI bridge to [bus 01-ff]
[    8.453860] pci 0004:00:00.0:   bridge window [io  0x0000-0x0fff]
[    8.461133] pci 0002:00:00.0: bridge window [mem 0x3c300000-0x3c3fffff]: assigned
[    8.461137] pci 0002:00:00.0: BAR 0 [mem 0x3c400000-0x3c400fff]: assigned
[    8.461141] pci 0002:01:00.0: BAR 0 [mem 0x3c300000-0x3c303fff 64bit]: assigned
[    8.461182] pci 0002:00:00.0: PCI bridge to [bus 01-ff]
[    8.461185] pci 0002:00:00.0:   bridge window [mem 0x3c300000-0x3c3fffff]
[    8.461378] ID:20 pID:8212 vID:216
[    8.466916] pci 0004:00:00.0:   bridge window [mem 0x00000000-0x000fffff]
[    8.473265] ID:21 pID:8213 vID:217
[    8.478893] pci 0004:00:00.0:   bridge window [mem 0x00000000-0x000fffff 64bit pref]
[    8.488351] ID:22 pID:8214 vID:218
[    8.495446] pci 0004:00:00.0: PME# supported from D0 D3hot D3cold
[    8.502905] ID:23 pID:8215 vID:219
[    8.509868] pci 0004:01:00.0: [17cb:0306] type 00 class 0xff0000 PCIe Endpoint
[    8.514345] ID:24 pID:8216 vID:220
[    8.521029] pci 0004:01:00.0: BAR 0 [mem 0x00000000-0x00000fff 64bit]
[    8.527916] ID:25 pID:8217 vID:221
[    8.535900] pci 0004:01:00.0: BAR 2 [mem 0x00000000-0x00000fff 64bit]
[    8.542116] ID:26 pID:8218 vID:222
[    8.550074] pci 0004:01:00.0: PME# supported from D0 D3hot D3cold
[    8.556138] ID:27 pID:8219 vID:223
[    8.562538] pci 0004:01:00.0: 15.752 Gb/s available PCIe bandwidth, limited by 8.0 GT/s PCIe x2 link at 0004:00:00.0 (capable of 31.506 Gb/s with 16.0 GT/s PCIe x2 link)
[    8.577637] ID:28 pID:8220 vID:224
[    8.597112] pci 0004:00:00.0: bridge window [mem 0x34300000-0x343fffff]: assigned
[    8.597753] ID:29 pID:8221 vID:225
[    8.604711] pci 0004:00:00.0: BAR 0 [mem 0x34400000-0x34400fff]: assigned
[    8.612214] ID:30 pID:8222 vID:226
[    8.617572] pci 0004:01:00.0: BAR 0 [mem 0x34300000-0x34300fff 64bit]: assigned
[    8.624536] ID:31 pID:8223 vID:227
[    8.624836] pci 0004:01:00.0: BAR 2 [mem 0x34301000-0x34301fff 64bit]: assigned
[    8.625174] IRQ196 -> 0-7 CPU0
[    8.625221] ITS: alloc 8224:32
[    8.625230] ITT 32 entries, 5 bits
[    8.625370] pci 0004:00:00.0: PCI bridge to [bus 01-ff]
[    8.625633] IRQ197 -> 0-7 CPU1
[    8.625888] pci 0004:00:00.0:   bridge window [mem 0x34300000-0x343fffff]
[    8.626014] ID:0 pID:8224 vID:229
[    8.626020] ID:1 pID:8225 vID:230
[    8.626025] ID:2 pID:8226 vID:231
[    8.626031] ID:3 pID:8227 vID:232
[    8.626036] ID:4 pID:8228 vID:233
[    8.626041] ID:5 pID:8229 vID:234
[    8.626046] ID:6 pID:8230 vID:235
[    8.626051] ID:7 pID:8231 vID:236
[    8.626056] ID:8 pID:8232 vID:237
[    8.626061] ID:9 pID:8233 vID:238
[    8.626066] ID:10 pID:8234 vID:239
[    8.626071] ID:11 pID:8235 vID:240
[    8.626076] ID:12 pID:8236 vID:241
[    8.626081] ID:13 pID:8237 vID:242
[    8.626086] ID:14 pID:8238 vID:243
[    8.626092] ID:15 pID:8239 vID:244
[    8.626097] ID:16 pID:8240 vID:245
[    8.626102] ID:17 pID:8241 vID:246
[    8.626107] ID:18 pID:8242 vID:247
[    8.626112] ID:19 pID:8243 vID:248
[    8.626117] ID:20 pID:8244 vID:249
[    8.626122] ID:21 pID:8245 vID:250
[    8.626127] ID:22 pID:8246 vID:251
[    8.626132] ID:23 pID:8247 vID:252
[    8.626137] ID:24 pID:8248 vID:253
[    8.626143] ID:25 pID:8249 vID:254
[    8.626148] ID:26 pID:8250 vID:255
[    8.626153] ID:27 pID:8251 vID:256
[    8.626158] ID:28 pID:8252 vID:257
[    8.626166] IRQ198 -> 0-7 CPU2
[    8.626177] IRQ199 -> 0-7 CPU3
[    8.626188] IRQ200 -> 0-7 CPU4
[    8.626199] IRQ201 -> 0-7 CPU5
[    8.626210] IRQ202 -> 0-7 CPU6
[    8.626221] IRQ203 -> 0-7 CPU7
[    8.626232] IRQ204 -> 0-7 CPU0
[    8.626243] IRQ205 -> 0-7 CPU1
[    8.626254] IRQ206 -> 0-7 CPU2
[    8.626264] IRQ207 -> 0-7 CPU3
[    8.626275] IRQ208 -> 0-7 CPU4
[    8.626286] IRQ209 -> 0-7 CPU5
[    8.626297] IRQ210 -> 0-7 CPU6
[    8.626308] IRQ211 -> 0-7 CPU7
[    8.626319] IRQ212 -> 0-7 CPU0
[    8.626330] IRQ213 -> 0-7 CPU1
[    8.626341] IRQ214 -> 0-7 CPU2
[    8.626352] IRQ215 -> 0-7 CPU3
[    8.626363] IRQ216 -> 0-7 CPU4
[    8.626374] IRQ217 -> 0-7 CPU5
[    8.626385] IRQ218 -> 0-7 CPU6
[    8.626396] IRQ219 -> 0-7 CPU7
[    8.626407] IRQ220 -> 0-7 CPU0
[    8.626418] IRQ221 -> 0-7 CPU1
[    8.626429] IRQ222 -> 0-7 CPU2
[    8.626704] ID:29 pID:8253 vID:258
[    8.626965] IRQ223 -> 0-7 CPU3
[    8.627214] ID:30 pID:8254 vID:259
[    8.627467] IRQ224 -> 0-7 CPU4
[    8.627722] ID:31 pID:8255 vID:260
[    8.627977] IRQ225 -> 0-7 CPU5
[    8.628312] IRQ229 -> 0-7 CPU5
[    8.628372] ITS: alloc 8256:32
[    8.628380] ITT 32 entries, 5 bits
[    8.628479] IRQ226 -> 0-7 CPU6
[    8.628723] IRQ230 -> 0-7 CPU6
[    8.628957] IRQ227 -> 0-7 CPU7
[    8.629094] ID:0 pID:8256 vID:262
[    8.629099] ID:1 pID:8257 vID:263
[    8.629104] ID:2 pID:8258 vID:264
[    8.629109] ID:3 pID:8259 vID:265
[    8.629114] ID:4 pID:8260 vID:266
[    8.629119] ID:5 pID:8261 vID:267
[    8.629124] ID:6 pID:8262 vID:268
[    8.629129] ID:7 pID:8263 vID:269
[    8.629134] ID:8 pID:8264 vID:270
[    8.629139] ID:9 pID:8265 vID:271
[    8.629144] ID:10 pID:8266 vID:272
[    8.629149] ID:11 pID:8267 vID:273
[    8.629153] ID:12 pID:8268 vID:274
[    8.629158] ID:13 pID:8269 vID:275
[    8.629163] ID:14 pID:8270 vID:276
[    8.629168] ID:15 pID:8271 vID:277
[    8.629173] ID:16 pID:8272 vID:278
[    8.629178] ID:17 pID:8273 vID:279
[    8.629183] ID:18 pID:8274 vID:280
[    8.629188] ID:19 pID:8275 vID:281
[    8.629200] IRQ231 -> 0-7 CPU7
[    8.629211] IRQ232 -> 0-7 CPU0
[    8.629222] IRQ233 -> 0-7 CPU1
[    8.629233] IRQ234 -> 0-7 CPU2
[    8.629244] IRQ235 -> 0-7 CPU3
[    8.629255] IRQ236 -> 0-7 CPU4
[    8.629266] IRQ237 -> 0-7 CPU7
[    8.629277] IRQ238 -> 0-7 CPU0
[    8.629287] IRQ239 -> 0-7 CPU1
[    8.629298] IRQ240 -> 0-7 CPU2
[    8.629309] IRQ241 -> 0-7 CPU3
[    8.629319] IRQ242 -> 0-7 CPU4
[    8.629336] IRQ243 -> 0-7 CPU5
[    8.629346] IRQ244 -> 0-7 CPU6
[    8.629357] IRQ245 -> 0-7 CPU7
[    8.629368] IRQ246 -> 0-7 CPU0
[    8.629379] IRQ247 -> 0-7 CPU1
[    8.629390] IRQ248 -> 0-7 CPU2
[    8.629401] IRQ249 -> 0-7 CPU3
[    8.629411] IRQ250 -> 0-7 CPU4
[    8.629422] IRQ251 -> 0-7 CPU5
[    8.629433] IRQ252 -> 0-7 CPU6
[    8.629670] ID:20 pID:8276 vID:282
[    8.629908] IRQ253 -> 0-7 CPU0
[    8.630134] ID:21 pID:8277 vID:283
[    8.635511] IRQ254 -> 0-7 CPU1
[    8.642115] ID:22 pID:8278 vID:284
[    8.649085] IRQ255 -> 0-7 CPU2
[    8.657029] ID:23 pID:8279 vID:285
[    8.663285] IRQ256 -> 0-7 CPU3
[    8.670689] ID:24 pID:8280 vID:286
[    8.677302] IRQ257 -> 0-7 CPU4
[    8.682925] ID:25 pID:8281 vID:287
[    8.688293] IRQ258 -> 0-7 CPU5
[    8.694547] ID:26 pID:8282 vID:288
[    8.702234] IRQ259 -> 0-7 CPU6
[    8.709197] ID:27 pID:8283 vID:289
[    8.709204] ID:28 pID:8284 vID:290
[    8.716722] IRQ260 -> 0-7 CPU7
[    8.722081] ID:29 pID:8285 vID:291
[    8.842813] ID:30 pID:8286 vID:292
[    8.842818] ID:31 pID:8287 vID:293
[    8.842966] IRQ262 -> 0-7 CPU0
[    8.842982] IRQ263 -> 0-7 CPU1
[    8.842993] IRQ264 -> 0-7 CPU2
[    8.843004] IRQ265 -> 0-7 CPU3
[    8.843016] IRQ266 -> 0-7 CPU4
[    8.843028] IRQ267 -> 0-7 CPU5
[    8.843040] IRQ268 -> 0-7 CPU6
[    8.843051] IRQ269 -> 0-7 CPU7
[    8.843063] IRQ270 -> 0-7 CPU0
[    8.843075] IRQ271 -> 0-7 CPU1
[    8.843087] IRQ272 -> 0-7 CPU2
[    8.843098] IRQ273 -> 0-7 CPU3
[    8.843110] IRQ274 -> 0-7 CPU4
[    8.843122] IRQ275 -> 0-7 CPU5
[    8.843133] IRQ276 -> 0-7 CPU6
[    8.843145] IRQ277 -> 0-7 CPU7
[    8.843157] IRQ278 -> 0-7 CPU0
[    8.843168] IRQ279 -> 0-7 CPU1
[    8.843180] IRQ280 -> 0-7 CPU2
[    8.843192] IRQ281 -> 0-7 CPU3
[    8.843203] IRQ282 -> 0-7 CPU4
[    8.843215] IRQ283 -> 0-7 CPU5
[    8.843227] IRQ284 -> 0-7 CPU6
[    8.843238] IRQ285 -> 0-7 CPU7
[    8.843250] IRQ286 -> 0-7 CPU0
[    8.843262] IRQ287 -> 0-7 CPU1
[    8.843273] IRQ288 -> 0-7 CPU2
[    8.843284] IRQ289 -> 0-7 CPU3
[    8.843296] IRQ290 -> 0-7 CPU4
[    8.843308] IRQ291 -> 0-7 CPU5
[    8.843319] IRQ292 -> 0-7 CPU6
[    8.843331] IRQ293 -> 0-7 CPU7
[    8.844444] ITS: alloc 8192:1
[    8.844455] ITT 1 entries, 0 bits
[    8.845389] ID:0 pID:8192 vID:196
[    8.845395] ITS: alloc 8193:1
[    8.845403] IRQ196 -> 0-7 CPU0
[    8.845405] ITT 1 entries, 0 bits
[    8.845604] IRQ196 -> 0-7 CPU0
[    8.845631] pcieport 0006:00:00.0: PME: Signaling with IRQ 196
[    8.846380] ID:0 pID:8193 vID:197
[    8.846414] ITS: alloc 8194:1
[    8.846423] ITT 1 entries, 0 bits
[    8.857408] IRQ197 -> 0-7 CPU1
[    8.857440] ID:0 pID:8194 vID:198
[    8.857450] IRQ198 -> 0-7 CPU2
[    8.857499] IRQ197 -> 0-7 CPU1
[    8.857515] pcieport 0002:00:00.0: PME: Signaling with IRQ 197
[    8.857529] IRQ198 -> 0-7 CPU2
[    8.858291] pcieport 0006:00:00.0: AER: enabled with IRQ 196
[    8.866563] pcieport 0002:00:00.0: AER: enabled with IRQ 197
[    8.872342] pcieport 0004:00:00.0: PME: Signaling with IRQ 198
[    8.885618] pcieport 0004:00:00.0: AER: enabled with IRQ 198
[    8.909946] mhi-pci-generic 0004:01:00.0: MHI PCI device found: foxconn-sdx55
[    8.914659] nvme nvme0: pci function 0002:01:00.0
[    8.917541] mhi-pci-generic 0004:01:00.0: BAR 0 [mem 0x34300000-0x34300fff 64bit]: assigned
[    8.922185] nvme 0002:01:00.0: enabling device (0000 -> 0002)
[    8.930939] mhi-pci-generic 0004:01:00.0: enabling device (0000 -> 0002)
[    8.937318] ITS: alloc 8195:1
[    8.944985] ITT 1 entries, 0 bits
[    8.945289] ITS: alloc 8196:8
[    8.945303] ITT 8 entries, 3 bits
[    8.947818] ID:0 pID:8195 vID:201
[    8.947910] IRQ201 -> 0-7 CPU3
[    8.948702] ID:0 pID:8196 vID:202
[    8.948720] ID:1 pID:8197 vID:203
[    8.950480] IRQ201 -> 0-7 CPU3
[    8.965330] ID:2 pID:8198 vID:204
[    8.974909] ID:3 pID:8199 vID:205
[    8.987215] ID:4 pID:8200 vID:206
[    9.001562] IRQ202 -> 0-7 CPU4
[    9.001759] IRQ203 -> 0-7 CPU5
[    9.001771] IRQ204 -> 0-7 CPU6
[    9.001849] IRQ205 -> 0-7 CPU7
[    9.001862] IRQ206 -> 0-7 CPU0
[    9.003223] IRQ202 -> 0-7 CPU4
[    9.003449] IRQ203 -> 0-7 CPU5
[    9.003638] IRQ204 -> 0-7 CPU6
[    9.003836] IRQ205 -> 0-7 CPU7
[    9.004007] IRQ206 -> 0-7 CPU0
[    9.005127] mhi mhi0: Requested to power ON
[    9.009901] mhi mhi0: Power on setup success
[    9.015403] nvme nvme0: allocated 61 MiB host memory buffer.
[    9.169296] ITS: alloc 8204:16
[    9.169319] ITT 16 entries, 4 bits
[    9.169492] ID:0 pID:8204 vID:201
[    9.169516] IRQ201 -> 0-7 CPU3
[    9.169620] ID:1 pID:8205 vID:211
[    9.169633] IRQ211 -> 0-7 CPU0
[    9.169702] ID:2 pID:8206 vID:212
[    9.169713] IRQ212 -> 0-7 CPU1
[    9.169904] ID:3 pID:8207 vID:213
[    9.169917] IRQ213 -> 0-7 CPU2
[    9.169982] ID:4 pID:8208 vID:214
[    9.169993] IRQ214 -> 0-7 CPU3
[    9.170070] ID:5 pID:8209 vID:215
[    9.170082] IRQ215 -> 0-7 CPU4
[    9.170143] ID:6 pID:8210 vID:216
[    9.170155] IRQ216 -> 0-7 CPU5
[    9.170221] ID:7 pID:8211 vID:217
[    9.170232] IRQ217 -> 0-7 CPU6
[    9.170294] ID:8 pID:8212 vID:218
[    9.170319] IRQ218 -> 0-7 CPU7
[    9.170460] IRQ201 -> 0-7 CPU3
[    9.179969] IRQ211 -> 0 CPU0
[    9.180329] IRQ212 -> 1 CPU1
[    9.180663] IRQ213 -> 2 CPU2
[    9.181001] IRQ214 -> 3 CPU3
[    9.181355] IRQ215 -> 4 CPU4
[    9.181702] IRQ216 -> 5 CPU5
[    9.188542] IRQ217 -> 6 CPU6
[    9.196576] IRQ218 -> 7 CPU7
[    9.196623] nvme nvme0: 8/0/0 default/read/poll queues
[    9.206751]  nvme0n1: p1
[    9.278797] ath11k_pci 0006:01:00.0: BAR 0 [mem 0x30400000-0x305fffff 64bit]: assigned
[    9.294555] ath11k_pci 0006:01:00.0: enabling device (0000 -> 0002)
[    9.295634] wwan wwan0: port wwan0qcdm0 attached
[    9.296105] wwan wwan0: port wwan0mbim0 attached
[    9.296789] wwan wwan0: port wwan0at0 attached
[    9.304915] ITS: alloc 8220:32
[    9.314316] ITT 32 entries, 5 bits
[    9.324270] ID:0 pID:8220 vID:262
[    9.338759] ID:1 pID:8221 vID:263
[    9.338765] ID:2 pID:8222 vID:264
[    9.338770] ID:3 pID:8223 vID:265
[    9.338775] ID:4 pID:8224 vID:266
[    9.338779] ID:5 pID:8225 vID:267
[    9.338784] ID:6 pID:8226 vID:268
[    9.338789] ID:7 pID:8227 vID:269
[    9.338794] ID:8 pID:8228 vID:270
[    9.338798] ID:9 pID:8229 vID:271
[    9.338803] ID:10 pID:8230 vID:272
[    9.338808] ID:11 pID:8231 vID:273
[    9.338812] ID:12 pID:8232 vID:274
[    9.338817] ID:13 pID:8233 vID:275
[    9.338821] ID:14 pID:8234 vID:276
[    9.338826] ID:15 pID:8235 vID:277
[    9.338831] ID:16 pID:8236 vID:278
[    9.338836] ID:17 pID:8237 vID:279
[    9.338841] ID:18 pID:8238 vID:280
[    9.338845] ID:19 pID:8239 vID:281
[    9.338850] ID:20 pID:8240 vID:282
[    9.338855] ID:21 pID:8241 vID:283
[    9.338859] ID:22 pID:8242 vID:284
[    9.338864] ID:23 pID:8243 vID:285
[    9.338868] ID:24 pID:8244 vID:286
[    9.338873] ID:25 pID:8245 vID:287
[    9.338877] ID:26 pID:8246 vID:288
[    9.338882] ID:27 pID:8247 vID:289
[    9.338887] ID:28 pID:8248 vID:290
[    9.338891] ID:29 pID:8249 vID:291
[    9.338896] ID:30 pID:8250 vID:292
[    9.338900] ID:31 pID:8251 vID:293
[    9.338980] IRQ262 -> 0-7 CPU1
[    9.362613] IRQ263 -> 0-7 CPU2
[    9.370142] IRQ264 -> 0-7 CPU3
[    9.377656] IRQ265 -> 0-7 CPU4
[    9.400274] IRQ266 -> 0-7 CPU5
[    9.409009] IRQ267 -> 0-7 CPU6
[    9.409021] IRQ268 -> 0-7 CPU7
[    9.409033] IRQ269 -> 0-7 CPU0
[    9.409044] IRQ270 -> 0-7 CPU1
[    9.409056] IRQ271 -> 0-7 CPU2
[    9.409067] IRQ272 -> 0-7 CPU3
[    9.409078] IRQ273 -> 0-7 CPU4
[    9.409089] IRQ274 -> 0-7 CPU5
[    9.409100] IRQ275 -> 0-7 CPU6
[    9.409111] IRQ276 -> 0-7 CPU7
[    9.409123] IRQ277 -> 0-7 CPU0
[    9.409134] IRQ278 -> 0-7 CPU1
[    9.409145] IRQ279 -> 0-7 CPU2
[    9.409157] IRQ280 -> 0-7 CPU3
[    9.409168] IRQ281 -> 0-7 CPU4
[    9.409179] IRQ282 -> 0-7 CPU5
[    9.409190] IRQ283 -> 0-7 CPU6
[    9.409201] IRQ284 -> 0-7 CPU7
[    9.409213] IRQ285 -> 0-7 CPU0
[    9.409224] IRQ286 -> 0-7 CPU1
[    9.409235] IRQ287 -> 0-7 CPU2
[    9.409247] IRQ288 -> 0-7 CPU3
[    9.409258] IRQ289 -> 0-7 CPU4
[    9.409270] IRQ290 -> 0-7 CPU5
[    9.409281] IRQ291 -> 0-7 CPU6
[    9.409292] IRQ292 -> 0-7 CPU7
[    9.409303] IRQ293 -> 0-7 CPU0
[    9.409438] ath11k_pci 0006:01:00.0: MSI vectors: 32
[    9.426507] ath11k_pci 0006:01:00.0: wcn6855 hw2.0
[    9.456885] IRQ262 -> 0-7 CPU1
[    9.467067] IRQ263 -> 0-7 CPU2
[    9.481466] IRQ264 -> 0-7 CPU3
[    9.630594] IRQ265 -> 0-7 CPU4
[    9.630629] IRQ266 -> 0-7 CPU5
[    9.630655] IRQ267 -> 0-7 CPU6
[    9.630682] IRQ268 -> 0-7 CPU7
[    9.630709] IRQ269 -> 0-7 CPU0
[    9.630735] IRQ270 -> 0-7 CPU1
[    9.630764] IRQ271 -> 0-7 CPU2
[    9.640971] IRQ276 -> 0-7 CPU7
[    9.641039] IRQ277 -> 0-7 CPU0
[    9.641088] IRQ278 -> 0-7 CPU1
[    9.641138] IRQ280 -> 0-7 CPU3
[    9.641182] IRQ281 -> 0-7 CPU4
[    9.641227] IRQ282 -> 0-7 CPU5
[    9.651400] IRQ283 -> 0-7 CPU6
[    9.651442] IRQ284 -> 0-7 CPU7
[    9.651490] IRQ285 -> 0-7 CPU0
[    9.651534] IRQ286 -> 0-7 CPU1
[    9.813900] mhi mhi1: Requested to power ON
[    9.818607] mhi mhi1: Power on setup success
[   10.017482] mhi mhi1: Wait for device to enter SBL or Mission mode
[   10.862765] ath11k_pci 0006:01:00.0: chip_id 0x2 chip_family 0xb board_id 0x8c soc_id 0x400c0200
[   10.872101] ath11k_pci 0006:01:00.0: fw_version 0x1106196e fw_build_timestamp 2024-01-12 11:30 fw_build_id WLAN.HSP.1.1-03125-QCAHSPSWPL_V1_V2_SILICONZ_LITE-3.6510.37
Marc Zyngier July 17, 2024, 6:07 p.m. UTC | #13
On Wed, 17 Jul 2024 14:38:59 +0100,
Johan Hovold <johan@kernel.org> wrote:
> 
> On Wed, Jul 17, 2024 at 01:54:40PM +0100, Marc Zyngier wrote:
> > On Wed, 17 Jul 2024 08:23:39 +0100,
> > Johan Hovold <johan@kernel.org> wrote:
> 
> > > I believe there is a kernel parameter for this (e.g.
> > > module.async_probe), but I just disable async probing for the Qualcomm
> > > PCIe driver I'm using:
> > 
> > I had tried this module parameter, but it didn't change anything on my
> > end.
> 
> > I'll have a look whether the TX1 PCIe driver uses this. It's
> > positively ancient, so I wouldn't bet that it has been touched
> > significantly in the past 5 years.
> 
> Perhaps async probing just changes the symptoms, the NVMe and wifi
> doesn't work in either case.

Yeah, my impression is that this changes the order in which LPIs get
allocated, but the core symptom is the same.

> 
> > > [    8.692011] Reusing ITT for devID 0
> > > [    8.693668] Reusing ITT for devID 0
> > 
> > This is really odd. It indicates that you have several devices sharing
> > the same DeviceID, which I seriously doubt it is the case in a
> > laptop. Do you have any non-transparent bridge here? lspci would help.
> 
> Yeah, and these messages do not show up without the series (see log
> below). They are there in the previous synchronous log however.
> 
> 0002:00:00.0 PCI bridge: Qualcomm Technologies, Inc SC8280XP PCI Express Root Port
> 0002:01:00.0 Non-Volatile memory controller: KIOXIA Corporation NVMe SSD Controller BG4 (DRAM-less)
> 0004:00:00.0 PCI bridge: Qualcomm Technologies, Inc SC8280XP PCI Express Root Port
> 0004:01:00.0 Unassigned class [ff00]: Qualcomm Technologies, Inc SDX55 [Snapdragon X55 5G]
> 0006:00:00.0 PCI bridge: Qualcomm Technologies, Inc SC8280XP PCI Express Root Port
> 0006:01:00.0 Network controller: Qualcomm Technologies, Inc QCNFA765 Wireless Network Adapter (rev 01)

Right, this is a very straightforward setup, Design-crap-ware-style.
Nothing that would alias any device.

>
> > I'm starting to suspect that the new code doesn't carry all the
> > required bits for the DevID, and that we end-up trying to allocated
> > interrupts from the pool allocated to another device, which can never
> > be a good thing, and would explain why everything dies a painful
> > death.
> > 
> > Can you run the same trace with the whole thing reverted? I think
> > we're on something here.
> 
> See below, using normal asynchronous probing like the previous log.

And as expected, no aliasing showing up in this log. Somehow, we're
not able to distinguish between the different PCI domains anymore,
leading to all sorts of funnies.

For the record, I've added some extra debug in the its driver and ran
the result on TX1, old and new kernels.

Before this series:

[   10.139806] nvme nvme0: pci function 0006:58:00.0
[   10.158599] nvme 0006:58:00.0: devid = 35800


With this series:

[   10.143729] nvme nvme0: pci function 0006:58:00.0
[   10.181775] nvme 0006:58:00.0: devid = 5800

Clearly, we've lost something in the battle. I'll keep digging.

	M.
Marc Zyngier July 17, 2024, 8:10 p.m. UTC | #14
On Wed, 17 Jul 2024 14:38:59 +0100,
Johan Hovold <johan@kernel.org> wrote:
> 
> On Wed, Jul 17, 2024 at 01:54:40PM +0100, Marc Zyngier wrote:
> > On Wed, 17 Jul 2024 08:23:39 +0100,
> > Johan Hovold <johan@kernel.org> wrote:
> 
> > > [    8.692011] Reusing ITT for devID 0
> > > [    8.693668] Reusing ITT for devID 0
> > 
> > This is really odd. It indicates that you have several devices sharing
> > the same DeviceID, which I seriously doubt it is the case in a
> > laptop. Do you have any non-transparent bridge here? lspci would help.
> 
> Yeah, and these messages do not show up without the series (see log
> below). They are there in the previous synchronous log however.

I think I've finally nailed the sucker, and posted a potential fix[1].

It definitely restore my TX1 to a state that is no worse than normal,
so something must be less wrong there.  I'm pretty sure that the
platform-msi equivalent is equally broken, but I don't have the energy
to verify/debug that tonight.

Thomas, feel free to squash this into your series or keep it as is, as
you prefer.

	M.

[1] https://lore.kernel.org/r/20240717195937.2240400-1-maz@kernel.org
Johan Hovold July 18, 2024, 7:30 a.m. UTC | #15
On Wed, Jul 17, 2024 at 09:10:02PM +0100, Marc Zyngier wrote:

> I think I've finally nailed the sucker, and posted a potential fix[1].
> 
> It definitely restore my TX1 to a state that is no worse than normal,
> so something must be less wrong there.  I'm pretty sure that the
> platform-msi equivalent is equally broken, but I don't have the energy
> to verify/debug that tonight.

> [1] https://lore.kernel.org/r/20240717195937.2240400-1-maz@kernel.org

This seems to fix the regression here too, thanks!

201:          0       ...         0  ITS-PCI-MSI-0006:00:00.0   0 Edge      PCIe PME, aerdrv
202:          0                   0  ITS-PCI-MSI-0006:01:00.0   0 Edge      bhi
203:          0                   0  ITS-PCI-MSI-0006:01:00.0   1 Edge      mhi
204:          0                   0  ITS-PCI-MSI-0006:01:00.0   2 Edge      mhi
205:          0                   0  ITS-PCI-MSI-0006:01:00.0   3 Edge      ce0
206:          0                   0  ITS-PCI-MSI-0006:01:00.0   4 Edge      ce1
207:          0                   0  ITS-PCI-MSI-0006:01:00.0   5 Edge      ce2
208:          0                   2  ITS-PCI-MSI-0006:01:00.0   6 Edge      ce3
209:          2                   0  ITS-PCI-MSI-0006:01:00.0   7 Edge      ce5
210:          0                   0  ITS-PCI-MSI-0006:01:00.0   8 Edge      ce7
211:          0                   0  ITS-PCI-MSI-0006:01:00.0   9 Edge      ce8
216:          0                   0  ITS-PCI-MSI-0006:01:00.0  14 Edge      DP_EXT_IRQ
217:          0                   0  ITS-PCI-MSI-0006:01:00.0  15 Edge      DP_EXT_IRQ
218:          0                   0  ITS-PCI-MSI-0006:01:00.0  16 Edge      DP_EXT_IRQ
220:          0                   0  ITS-PCI-MSI-0006:01:00.0  18 Edge      DP_EXT_IRQ
221:          0                   0  ITS-PCI-MSI-0006:01:00.0  19 Edge      DP_EXT_IRQ
222:          0                   0  ITS-PCI-MSI-0006:01:00.0  20 Edge      DP_EXT_IRQ
223:          0                   0  ITS-PCI-MSI-0006:01:00.0  21 Edge      DP_EXT_IRQ
224:          0                   0  ITS-PCI-MSI-0006:01:00.0  22 Edge      DP_EXT_IRQ
225:          0                   0  ITS-PCI-MSI-0006:01:00.0  23 Edge      DP_EXT_IRQ
226:          0                   0  ITS-PCI-MSI-0006:01:00.0  24 Edge      DP_EXT_IRQ
235:          0                   0  ITS-PCI-MSI-0004:00:00.0   0 Edge      PCIe PME, aerdrv
236:          0                   0  ITS-PCI-MSI-0004:01:00.0   0 Edge      bhi
237:          0                   0  ITS-PCI-MSI-0004:01:00.0   1 Edge      mhi
238:          0                   0  ITS-PCI-MSI-0004:01:00.0   2 Edge      mhi
239:          0                   0  ITS-PCI-MSI-0004:01:00.0   3 Edge      mhi
240:          0                   0  ITS-PCI-MSI-0004:01:00.0   4 Edge      mhi
242:          0                   0  ITS-PCI-MSI-0002:00:00.0   0 Edge      PCIe PME, aerdrv
243:         22                   0  ITS-PCI-MSIX-0002:01:00.0   0 Edge      nvme0q0
244:          0                   0  ITS-PCI-MSIX-0002:01:00.0   1 Edge      nvme0q1
245:          0                   0  ITS-PCI-MSIX-0002:01:00.0   2 Edge      nvme0q2
246:          0                   0  ITS-PCI-MSIX-0002:01:00.0   3 Edge      nvme0q3
247:          0                   0  ITS-PCI-MSIX-0002:01:00.0   4 Edge      nvme0q4
248:          0                   0  ITS-PCI-MSIX-0002:01:00.0   5 Edge      nvme0q5
249:          0                   0  ITS-PCI-MSIX-0002:01:00.0   6 Edge      nvme0q6
250:          0                   0  ITS-PCI-MSIX-0002:01:00.0   7 Edge      nvme0q7
251:          0                   0  ITS-PCI-MSIX-0002:01:00.0   8 Edge      nvme0q8

Johan