Message ID | 47833bde-a89a-988a-6350-6e6ec90048b4@nvidia.com |
---|---|
State | Not Applicable |
Headers | show |
Series | [RFC] PCI/MSI: Warning observed for NVMe with ACPI | expand |
Hi Jon, On Fri, 10 Dec 2021 10:48:22 +0000, Jon Hunter <jonathanh@nvidia.com> wrote: > > Hi all, > > Since Linux v5.13, we have noticed that following warning splat when > booting Tegra (ARM64) with ACPI ... > > [ 2.725479] WARNING: CPU: 0 PID: 94 at include/linux/msi.h:264 free_msi_irqs+0x84/0x188 > [ 2.736137] Modules linked in: > [ 2.736147] CPU: 0 PID: 94 Comm: kworker/u16:1 Tainted: G W 5.12.0-rc2-00008-g658376bd3e5-dirty #36 > [ 2.736160] Workqueue: nvme-reset-wq nvme_reset_work > [ 2.746470] pstate: 60400009 (nZCv daif +PAN -UAO -TCO BTYPE=--) > [ 2.757713] pc : free_msi_irqs+0x84/0x188 > [ 2.757726] lr : __pci_enable_msix_range+0x380/0x530 > [ 2.757735] sp : ffff800012813b00 > [ 2.757739] x29: ffff800012813b00 > [ 2.768371] x28: 00000000ffffffed > [ 2.768382] x27: 0000000000000001 x26: 0000000000000000 > [ 2.768393] x25: ffff0000809362e8 x24: 0000000000000000 > [ 2.768407] x23: 000000000000000c x22: ffff000080936000 > [ 2.768418] x21: ffff0000809362e8 x20: ffff0000809362e8 > [ 2.775320] x19: ffff000080936000 > [ 2.785950] x18: ffffffffffffffff > [ 2.785961] x17: 0000000000000007 x16: 0000000000000001 > [ 2.785975] x15: ffff800011bf9948 > [ 2.793997] x14: ffff8000928137e7 > [ 2.794009] x13: ffff8000128137f5 x12: ffff800011c19640 > [ 2.794023] x11: fffffffffffe5788 x10: 0000000005f5e0ff > [ 2.794034] x9 : 00000000ffffffd0 x8 : 203a737542204f49 > [ 2.803737] x7 : 444d206465786946 x6 : ffff800011ee1fd7 > [ 2.803750] x5 : 0000000000000000 x4 : 0000000000000000 > [ 2.815286] x3 : 00000000ffffffff x2 : ffff0000809362e8 > [ 2.815300] x1 : ffff0000809362e8 x0 : 0000000000000000 > [ 2.825270] Call trace: > [ 2.825275] free_msi_irqs+0x84/0x188 > [ 2.825288] __pci_enable_msix_range+0x380/0x530 > [ 2.825299] pci_alloc_irq_vectors_affinity+0x158/0x168 > [ 2.825309] nvme_reset_work+0x214/0x15b8 > [ 2.829340] dwc-eth-dwmac NVDA1160:00: SPH feature enabled > [ 2.832986] process_one_work+0x1cc/0x360 > [ 2.833002] worker_thread+0x48/0x450 > [ 2.833012] kthread+0x120/0x150 > [ 2.833020] ret_from_fork+0x10/0x18 > > > Bisecting this I found that started to occur because with Linux v5.13, > CONFIG_PCI_MSI_ARCH_FALLBACKS was no longer enabled by default and only > happened to be enabled because Renesas R-Car was enabling it. > > When booting with ACPI, I see that when pci_msi_setup_msi_irqs() is > called, it ends up calling arch_setup_msi_irqs() and if > CONFIG_PCI_MSI_ARCH_FALLBACKS is not enabled, then this will call > WARN_ON_ONCE(1). > > So the question is, should this be enabled by default for ARM64? I see > a lot of other architectures enabling this when PCI_MSI is enabled. So > I am wondering if we should be doing something like ... > > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig > index 1f212b47a48a..4bbd81bab809 100644 > --- a/arch/arm64/Kconfig > +++ b/arch/arm64/Kconfig > @@ -202,6 +202,7 @@ config ARM64 > select PCI_DOMAINS_GENERIC if PCI > select PCI_ECAM if (ACPI && PCI) > select PCI_SYSCALL if PCI > + select PCI_MSI_ARCH_FALLBACKS if PCI_MSI > select POWER_RESET > select POWER_SUPPLY > select SPARSE_IRQ +Thomas, as he's neck-deep in the MSI rework. No, this definitely is the wrong solution. arm64 doesn't need any arch fallback (I actually went out of my way to kill them on this architecture), and requires the individual MSI controller drivers to do the right thing by using MSI domains. Adding this config option makes the warning disappear, but the core issue is that you have a device that doesn't have a MSI domain associated with it. So either your device isn't MSI capable (odd), your host bridge doesn't make the link with the MSI controller to advertise the MSI domain (this should normally be dealt with via IORT), or there is a bug of a similar sort somewhere else. Getting to the root of this issue would be the right thing to do. Thanks, M.
Hi Marc, On 10/12/2021 11:39, Marc Zyngier wrote: ... >> So the question is, should this be enabled by default for ARM64? I see >> a lot of other architectures enabling this when PCI_MSI is enabled. So >> I am wondering if we should be doing something like ... >> >> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig >> index 1f212b47a48a..4bbd81bab809 100644 >> --- a/arch/arm64/Kconfig >> +++ b/arch/arm64/Kconfig >> @@ -202,6 +202,7 @@ config ARM64 >> select PCI_DOMAINS_GENERIC if PCI >> select PCI_ECAM if (ACPI && PCI) >> select PCI_SYSCALL if PCI >> + select PCI_MSI_ARCH_FALLBACKS if PCI_MSI >> select POWER_RESET >> select POWER_SUPPLY >> select SPARSE_IRQ > > +Thomas, as he's neck-deep in the MSI rework. > > No, this definitely is the wrong solution. > > arm64 doesn't need any arch fallback (I actually went out of my way to > kill them on this architecture), and requires the individual MSI > controller drivers to do the right thing by using MSI domains. Adding > this config option makes the warning disappear, but the core issue is > that you have a device that doesn't have a MSI domain associated with > it. > > So either your device isn't MSI capable (odd), your host bridge > doesn't make the link with the MSI controller to advertise the MSI > domain (this should normally be dealt with via IORT), or there is a > bug of a similar sort somewhere else. > > Getting to the root of this issue would be the right thing to do. Thanks! I will chat with Sagar about this and see what we are missing. Jon
On Fri, Dec 10 2021 at 11:39, Marc Zyngier wrote: > On Fri, 10 Dec 2021 10:48:22 +0000, > Jon Hunter <jonathanh@nvidia.com> wrote: >> + select PCI_MSI_ARCH_FALLBACKS if PCI_MSI >> select POWER_RESET >> select POWER_SUPPLY >> select SPARSE_IRQ > > +Thomas, as he's neck-deep in the MSI rework. > > No, this definitely is the wrong solution. Correct. > arm64 doesn't need any arch fallback (I actually went out of my way to > kill them on this architecture), and requires the individual MSI > controller drivers to do the right thing by using MSI domains. Adding > this config option makes the warning disappear, but the core issue is > that you have a device that doesn't have a MSI domain associated with > it. > > So either your device isn't MSI capable (odd), your host bridge > doesn't make the link with the MSI controller to advertise the MSI > domain (this should normally be dealt with via IORT), or there is a > bug of a similar sort somewhere else. What's even more odd is: >> [ 2.725479] WARNING: CPU: 0 PID: 94 at include/linux/msi.h:264 free_msi_irqs+0x84/0x188 >> [ 2.825275] free_msi_irqs+0x84/0x188 >> [ 2.825288] __pci_enable_msix_range+0x380/0x530 >> [ 2.825299] pci_alloc_irq_vectors_affinity+0x158/0x168 From __pci_enable_msix_range() there are two ways to reach free_msi_irqs(): 1) pci_alloc_irq_vectors_affinity() __pci_enable_msix_range() __pci_enable_msix() msix_capability_init() msix_setup_entries() alloc_msi_entry(() -> allocation fail 2) pci_alloc_irq_vectors_affinity() __pci_enable_msix_range() __pci_enable_msix() msix_capability_init() pci_msi_setup_msi_irqs(); -> any failure after this succeeded #1 is unlikely #2 is odd because if the irqdomain of the device is not hierarchical, then the same warning should trigger already in pci_msi_setup_msi_irqs() via arch_setup_msi_irqs(). Strange. Thanks, tglx
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 1f212b47a48a..4bbd81bab809 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -202,6 +202,7 @@ config ARM64 select PCI_DOMAINS_GENERIC if PCI select PCI_ECAM if (ACPI && PCI) select PCI_SYSCALL if PCI + select PCI_MSI_ARCH_FALLBACKS if PCI_MSI select POWER_RESET select POWER_SUPPLY select SPARSE_IRQ