PCI: Fix devres regression in pci_intx()

Message ID	20240725120729.59788-2-pstanner@redhat.com
State	New
Headers	show Return-Path: <linux-pci+bounces-10787-incoming=patchwork.ozlabs.org@vger.kernel.org> From: Philipp Stanner <pstanner@redhat.com> To: Bjorn Helgaas <bhelgaas@google.com>, =?utf-8?q?Krzysztof_Wilczy=C5=84ski?= <kwilczynski@kernel.org>, Philipp Stanner <pstanner@redhat.com> Cc: linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org, Damien Le Moal <dlemoal@kernel.org> Subject: [PATCH] PCI: Fix devres regression in pci_intx() Date: Thu, 25 Jul 2024 14:07:30 +0200 Message-ID: <20240725120729.59788-2-pstanner@redhat.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	PCI: Fix devres regression in pci_intx() \| expand PCI: Fix devres regression in pci_intx()

Philipp Stanner July 25, 2024, 12:07 p.m. UTC

pci_intx() is a function that becomes managed if pcim_enable_device()
has been called in advance. Commit 25216afc9db5 ("PCI: Add managed
pcim_intx()") changed this behavior so that pci_intx() always leads to
creation of a separate device resource for itself, whereas earlier, a
shared resource was used for all PCI devres operations.

Unfortunately, pci_intx() seems to be used in some drivers' remove()
paths; in the managed case this causes a device resource to be created
on driver detach.

Fix the regression by only redirecting pci_intx() to its managed twin
pcim_intx() if the pci_command changes.

Fixes: 25216afc9db5 ("PCI: Add managed pcim_intx()")
Reported-by: Damien Le Moal <dlemoal@kernel.org>
Closes: https://lore.kernel.org/all/b8f4ba97-84fc-4b7e-ba1a-99de2d9f0118@kernel.org/
Signed-off-by: Philipp Stanner <pstanner@redhat.com>
---
Alright, I reproduced this with QEMU as Damien described and this here
fixes the issue on my side. Feedback welcome. Thank you very much,
Damien.

It seems that this might yet again be the issue of drivers not being
aware that pci_intx() might become managed, so they use it in their
unwind path (rightfully so; there probably was no alternative back
then).

That will make the long term cleanup difficult. But I think this for now
is the most elegant possible workaround.
---
 drivers/pci/pci.c | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

Christoph Hellwig July 25, 2024, 2:26 p.m. UTC | #1

Can we please fix this to never silently redirect to a manager version
and add a proper pcim_intx instead and use that where people actually
want to use the crazy devres voodoo instead?  Doing this silently
underneath will always create problems.

Philipp Stanner July 25, 2024, 3:21 p.m. UTC | #2

On Thu, 2024-07-25 at 07:26 -0700, Christoph Hellwig wrote:
> Can we please fix this to never silently redirect to a manager
> version

It is not the fix or the recent changes (which the fix is for) to PCI
devres that is doing that. pci_intx() has been "silently redirect[ing]
to a managed version" since 15 years.

The changes merged into v6.11 attempted to keep this behavior exactly
identical as a preparation for later cleanups. The fix here only
corrects the position of the redirection to where the "crazy devres
voodoo" had always been:

void pci_intx(struct pci_dev *pdev, int enable)
{
	u16 pci_command, new;

	pci_read_config_word(pdev, PCI_COMMAND, &pci_command);

	if (enable)
		new = pci_command & ~PCI_COMMAND_INTX_DISABLE;
	else
		new = pci_command | PCI_COMMAND_INTX_DISABLE;

	if (new != pci_command) {
		struct pci_devres *dr;

		pci_write_config_word(pdev, PCI_COMMAND, new);

		/* voodoo_begin */
		dr = find_pci_dr(pdev);
		if (dr && !dr->restore_intx) {
			dr->restore_intx = 1;
			dr->orig_intx = !enable;
		}
		/* voodoo_end */
	}
}
EXPORT_SYMBOL_GPL(pci_intx);

> and add a proper pcim_intx instead 

That has already been done. pcim_intx() sits in drivers/pci/devres.c

> and use that where people actually
> want to use the crazy devres voodoo instead?  Doing this silently
> underneath will always create problems.

That's precisely what all my work is all about. The hybrid nature of
pci_intx(), pci_set_mwi() and all pci*request*() functions needs to be
removed.

However, that will take us some while, because the APIs are partly
ossificated and every user that relies on implicit crazy devres voodoo
has to be identified and then ported to *explicit* half-crazy devres
voodoo.

More details here:
https://lore.kernel.org/all/20240613115032.29098-1-pstanner@redhat.com/

P.

> 
>

Christoph Hellwig July 25, 2024, 3:47 p.m. UTC | #3

> That's precisely what all my work is all about. The hybrid nature of
> pci_intx(), pci_set_mwi() and all pci*request*() functions needs to be
> removed.

Ok.  Thanks for doing the work..

Bjorn Helgaas July 25, 2024, 9 p.m. UTC | #4

[+cc Christoph]

On Thu, Jul 25, 2024 at 02:07:30PM +0200, Philipp Stanner wrote:
> pci_intx() is a function that becomes managed if pcim_enable_device()
> has been called in advance. Commit 25216afc9db5 ("PCI: Add managed
> pcim_intx()") changed this behavior so that pci_intx() always leads to
> creation of a separate device resource for itself, whereas earlier, a
> shared resource was used for all PCI devres operations.
> 
> Unfortunately, pci_intx() seems to be used in some drivers' remove()
> paths; in the managed case this causes a device resource to be created
> on driver detach.
> 
> Fix the regression by only redirecting pci_intx() to its managed twin
> pcim_intx() if the pci_command changes.
> 
> Fixes: 25216afc9db5 ("PCI: Add managed pcim_intx()")
> Reported-by: Damien Le Moal <dlemoal@kernel.org>
> Closes: https://lore.kernel.org/all/b8f4ba97-84fc-4b7e-ba1a-99de2d9f0118@kernel.org/
> Signed-off-by: Philipp Stanner <pstanner@redhat.com>

Applied to for-linus for v6.11, thanks!

> ---
> Alright, I reproduced this with QEMU as Damien described and this here
> fixes the issue on my side. Feedback welcome. Thank you very much,
> Damien.
> 
> It seems that this might yet again be the issue of drivers not being
> aware that pci_intx() might become managed, so they use it in their
> unwind path (rightfully so; there probably was no alternative back
> then).
> 
> That will make the long term cleanup difficult. But I think this for now
> is the most elegant possible workaround.
> ---
>  drivers/pci/pci.c | 15 ++++++++-------
>  1 file changed, 8 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index e3a49f66982d..ffaaca0978cb 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -4477,12 +4477,6 @@ void pci_intx(struct pci_dev *pdev, int enable)
>  {
>  	u16 pci_command, new;
>  
> -	/* Preserve the "hybrid" behavior for backwards compatibility */
> -	if (pci_is_managed(pdev)) {
> -		WARN_ON_ONCE(pcim_intx(pdev, enable) != 0);
> -		return;
> -	}
> -
>  	pci_read_config_word(pdev, PCI_COMMAND, &pci_command);
>  
>  	if (enable)
> @@ -4490,8 +4484,15 @@ void pci_intx(struct pci_dev *pdev, int enable)
>  	else
>  		new = pci_command | PCI_COMMAND_INTX_DISABLE;
>  
> -	if (new != pci_command)
> +	if (new != pci_command) {
> +		/* Preserve the "hybrid" behavior for backwards compatibility */
> +		if (pci_is_managed(pdev)) {
> +			WARN_ON_ONCE(pcim_intx(pdev, enable) != 0);
> +			return;
> +		}
> +
>  		pci_write_config_word(pdev, PCI_COMMAND, new);
> +	}
>  }
>  EXPORT_SYMBOL_GPL(pci_intx);
>  
> -- 
> 2.45.2
>

Damien Le Moal July 26, 2024, 12:19 a.m. UTC | #5

On 7/25/24 21:07, Philipp Stanner wrote:
> pci_intx() is a function that becomes managed if pcim_enable_device()
> has been called in advance. Commit 25216afc9db5 ("PCI: Add managed
> pcim_intx()") changed this behavior so that pci_intx() always leads to
> creation of a separate device resource for itself, whereas earlier, a
> shared resource was used for all PCI devres operations.
> 
> Unfortunately, pci_intx() seems to be used in some drivers' remove()
> paths; in the managed case this causes a device resource to be created
> on driver detach.
> 
> Fix the regression by only redirecting pci_intx() to its managed twin
> pcim_intx() if the pci_command changes.
> 
> Fixes: 25216afc9db5 ("PCI: Add managed pcim_intx()")
> Reported-by: Damien Le Moal <dlemoal@kernel.org>
> Closes: https://lore.kernel.org/all/b8f4ba97-84fc-4b7e-ba1a-99de2d9f0118@kernel.org/
> Signed-off-by: Philipp Stanner <pstanner@redhat.com>
> ---
> Alright, I reproduced this with QEMU as Damien described and this here
> fixes the issue on my side. Feedback welcome. Thank you very much,
> Damien.

This works ans is cleaner that what I had :)

Tested-by: Damien Le Moal <dlemoal@kernel.org>

> It seems that this might yet again be the issue of drivers not being
> aware that pci_intx() might become managed, so they use it in their
> unwind path (rightfully so; there probably was no alternative back
> then).

At least for the ahci driver with wich I found the issue, what is odd is that
there is only a single call to pci_intx() to *enable* intx, and that call is in
the probe path. With QEMU, this call is not even done as the qemu AHCI support MSI.

Adding a WARN_ON(!enable) at the beginning of pci_inx(), we can see that what
happens is that during device probe, we get this backtrace:

[   34.658988] WARNING: CPU: 0 PID: 999 at drivers/pci/pci.c:4480 pci_intx+0x7f/0xc0
[   34.660799] Modules linked in: ahci(+) rpcsec_gss_krb5 auth_rpcgss nfsv4
dns_resolver nfs lockd grace netfs]
[   34.673784] CPU: 0 UID: 0 PID: 999 Comm: modprobe Tainted: G        W
 6.10.0+ #1948
[   34.675961] Tainted: [W]=WARN
[   34.676891] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
1.16.3-2.fc40 04/01/2014
[   34.679197] RIP: 0010:pci_intx+0x7f/0xc0
[   34.680348] Code: b7 d2 be 04 00 00 00 48 89 df e8 0c 84 ff ff 48 8b 44 24 08
65 48 2b 04 25 28 00 00 00 756
[   34.685015] RSP: 0018:ffffb60f40e2f7f0 EFLAGS: 00010246
[   34.686436] RAX: 0000000000000000 RBX: ffff9dbb81237000 RCX: ffffb60f40e2f64c
[   34.688294] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9dbb81237000
[   34.690120] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001
[   34.691986] R10: ffff9dbb88883538 R11: 0000000000000001 R12: 0000000000000001
[   34.693687] R13: ffff9dbb812370c8 R14: ffff9dbb86eeaa00 R15: 0000000000000000
[   34.695140] FS:  00007f9d81465740(0000) GS:ffff9dbcf7c00000(0000)
knlGS:0000000000000000
[   34.696884] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   34.698107] CR2: 00007ffc786ed8b8 CR3: 00000001088da000 CR4: 0000000000350ef0
[   34.699562] Call Trace:
[   34.700215]  <TASK>
[   34.700802]  ? pci_intx+0x7f/0xc0
[   34.701607]  ? __warn.cold+0xa5/0x13c
[   34.702448]  ? pci_intx+0x7f/0xc0
[   34.703257]  ? report_bug+0xff/0x140
[   34.704105]  ? handle_bug+0x3a/0x70
[   34.704938]  ? exc_invalid_op+0x17/0x70
[   34.705826]  ? asm_exc_invalid_op+0x1a/0x20
[   34.706593]  ? pci_intx+0x7f/0xc0
[   34.707086]  msi_capability_init+0x35a/0x370
[   34.707723]  __pci_enable_msi_range+0x187/0x240
[   34.708356]  pci_alloc_irq_vectors_affinity+0xc4/0x110
[   34.709058]  ahci_init_one+0x6ec/0xcc0 [ahci]
[   34.709692]  ? __pm_runtime_resume+0x58/0x90
[   34.710311]  local_pci_probe+0x45/0x90
[   34.710865]  pci_device_probe+0xbb/0x230
[   34.711433]  really_probe+0xcc/0x350
[   34.711976]  ? pm_runtime_barrier+0x54/0x90
[   34.712569]  ? __pfx___driver_attach+0x10/0x10
[   34.713206]  __driver_probe_device+0x78/0x110
[   34.713837]  driver_probe_device+0x1f/0xa0
[   34.714427]  __driver_attach+0xbe/0x1d0
[   34.715001]  bus_for_each_dev+0x92/0xe0
[   34.715563]  bus_add_driver+0x115/0x200
[   34.716136]  driver_register+0x72/0xd0
[   34.716704]  ? __pfx_ahci_pci_driver_init+0x10/0x10 [ahci]
[   34.717457]  do_one_initcall+0x76/0x3a0
[   34.718036]  do_init_module+0x60/0x210
[   34.718616]  init_module_from_file+0x86/0xc0
[   34.719243]  idempotent_init_module+0x127/0x2c0
[   34.719913]  __x64_sys_finit_module+0x5e/0xb0
[   34.720546]  do_syscall_64+0x7d/0x160
[   34.721100]  ? srso_return_thunk+0x5/0x5f
[   34.721695]  ? do_syscall_64+0x89/0x160
[   34.722258]  ? srso_return_thunk+0x5/0x5f
[   34.722846]  ? do_sys_openat2+0x9c/0xe0
[   34.723421]  ? srso_return_thunk+0x5/0x5f
[   34.724012]  ? syscall_exit_to_user_mode+0x64/0x1f0
[   34.724703]  ? srso_return_thunk+0x5/0x5f
[   34.725293]  ? do_syscall_64+0x89/0x160
[   34.725883]  ? srso_return_thunk+0x5/0x5f
[   34.726467]  ? syscall_exit_to_user_mode+0x64/0x1f0
[   34.727159]  ? srso_return_thunk+0x5/0x5f
[   34.727764]  ? do_syscall_64+0x89/0x160
[   34.728341]  ? srso_return_thunk+0x5/0x5f
[   34.728937]  ? exc_page_fault+0x6c/0x200
[   34.729511]  ? srso_return_thunk+0x5/0x5f
[   34.730109]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   34.730837] RIP: 0033:0x7f9d80d281dd
[   34.731375] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8
48 89 f7 48 89 d6 48 89 ca 4d8
[   34.733796] RSP: 002b:00007ffc786f0898 EFLAGS: 00000246 ORIG_RAX:
0000000000000139
[   34.734894] RAX: ffffffffffffffda RBX: 00005617347f09a0 RCX: 00007f9d80d281dd
[   34.735850] RDX: 0000000000000000 RSI: 0000561715fe5e79 RDI: 0000000000000003
[   34.736812] RBP: 00007ffc786f0950 R08: 00007f9d80df6b20 R09: 00007ffc786f08e0
[   34.737769] R10: 00005617347f13e0 R11: 0000000000000246 R12: 0000561715fe5e79
[   34.738725] R13: 0000000000040000 R14: 00005617347f8990 R15: 00005617347f8b20
[   34.739695]  </TASK>
[   34.740075] ---[ end trace 0000000000000000 ]---

So it is msi_capability_init() that is the problem: that function calls
pci_intx_for_msi(dev, 0) which then calls pci_intx(dev, 0), thus creating the
intx devres for the device despite the driver code not touching intx at all.
The driver is fine ! It is MSI touching INTX that is messing things up.

That said, I do not see that as an issue in itself. What I fail to understand
though is why that intx devres is not deleted on device teardown. I think this
may have something to do with the fact that pcim_intx() always does
"res->orig_intx = !enable;", that is, it assumes that if there is a call to
pcim_intx(dev, 0), then it is because intx where enabled already, which I do not
think is true for most drivers... So we endup with INTX being wrongly enabled on
device teardown by pcim_intx_restore(), and because of that, the intx resource
is not deleted ?

Re-enabling intx on teardown is wrong I think, but that still does not explain
why that resource is not deleted. I fail to see why.

Philipp Stanner July 26, 2024, 6:43 p.m. UTC | #6

On Fri, 2024-07-26 at 09:19 +0900, Damien Le Moal wrote:
> On 7/25/24 21:07, Philipp Stanner wrote:
> > pci_intx() is a function that becomes managed if
> > pcim_enable_device()
> > has been called in advance. Commit 25216afc9db5 ("PCI: Add managed
> > pcim_intx()") changed this behavior so that pci_intx() always leads
> > to
> > creation of a separate device resource for itself, whereas earlier,
> > a
> > shared resource was used for all PCI devres operations.
> > 
> > Unfortunately, pci_intx() seems to be used in some drivers'
> > remove()
> > paths; in the managed case this causes a device resource to be
> > created
> > on driver detach.
> > 
> > Fix the regression by only redirecting pci_intx() to its managed
> > twin
> > pcim_intx() if the pci_command changes.
> > 
> > Fixes: 25216afc9db5 ("PCI: Add managed pcim_intx()")
> > Reported-by: Damien Le Moal <dlemoal@kernel.org>
> > Closes:
> > https://lore.kernel.org/all/b8f4ba97-84fc-4b7e-ba1a-99de2d9f0118@kernel.org/
> > Signed-off-by: Philipp Stanner <pstanner@redhat.com>
> > ---
> > Alright, I reproduced this with QEMU as Damien described and this
> > here
> > fixes the issue on my side. Feedback welcome. Thank you very much,
> > Damien.
> 
> This works ans is cleaner that what I had :)

The fundamental idea is mostly identical to yours – you likely just
didn't see it because your attention was directed towards the code in
devres.c ;)

> 
> Tested-by: Damien Le Moal <dlemoal@kernel.org>

You might wanna ping Bjorn about that in case he didn't see.

> 
> > It seems that this might yet again be the issue of drivers not
> > being
> > aware that pci_intx() might become managed, so they use it in their
> > unwind path (rightfully so; there probably was no alternative back
> > then).
> 
> At least for the ahci driver with wich I found the issue, what is odd
> is that
> there is only a single call to pci_intx() 

There is only a single _directly visible_ call :]

> to *enable* intx, and that call is in
> the probe path. With QEMU, this call is not even done as the qemu
> AHCI support MSI.

Hmm, MSI...

> 
> Adding a WARN_ON(!enable) at the beginning of pci_inx(), we can see
> that what
> happens is that during device probe, we get this backtrace:
> 
> [   34.658988] WARNING: CPU: 0 PID: 999 at drivers/pci/pci.c:4480
> pci_intx+0x7f/0xc0
> [   34.660799] Modules linked in: ahci(+) rpcsec_gss_krb5 auth_rpcgss
> nfsv4
> dns_resolver nfs lockd grace netfs]
> [   34.673784] CPU: 0 UID: 0 PID: 999 Comm: modprobe Tainted:
> G        W
>  6.10.0+ #1948
> [   34.675961] Tainted: [W]=WARN
> [   34.676891] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
> BIOS
> 1.16.3-2.fc40 04/01/2014
> [   34.679197] RIP: 0010:pci_intx+0x7f/0xc0
> [   34.680348] Code: b7 d2 be 04 00 00 00 48 89 df e8 0c 84 ff ff 48
> 8b 44 24 08
> 65 48 2b 04 25 28 00 00 00 756
> [   34.685015] RSP: 0018:ffffb60f40e2f7f0 EFLAGS: 00010246
> [   34.686436] RAX: 0000000000000000 RBX: ffff9dbb81237000 RCX:
> ffffb60f40e2f64c
> [   34.688294] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
> ffff9dbb81237000
> [   34.690120] RBP: 0000000000000000 R08: 0000000000000000 R09:
> 0000000000000001
> [   34.691986] R10: ffff9dbb88883538 R11: 0000000000000001 R12:
> 0000000000000001
> [   34.693687] R13: ffff9dbb812370c8 R14: ffff9dbb86eeaa00 R15:
> 0000000000000000
> [   34.695140] FS:  00007f9d81465740(0000) GS:ffff9dbcf7c00000(0000)
> knlGS:0000000000000000
> [   34.696884] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   34.698107] CR2: 00007ffc786ed8b8 CR3: 00000001088da000 CR4:
> 0000000000350ef0
> [   34.699562] Call Trace:
> [   34.700215]  <TASK>
> [   34.700802]  ? pci_intx+0x7f/0xc0
> [   34.701607]  ? __warn.cold+0xa5/0x13c
> [   34.702448]  ? pci_intx+0x7f/0xc0
> [   34.703257]  ? report_bug+0xff/0x140
> [   34.704105]  ? handle_bug+0x3a/0x70
> [   34.704938]  ? exc_invalid_op+0x17/0x70
> [   34.705826]  ? asm_exc_invalid_op+0x1a/0x20
> [   34.706593]  ? pci_intx+0x7f/0xc0
> [   34.707086]  msi_capability_init+0x35a/0x370
> [   34.707723]  __pci_enable_msi_range+0x187/0x240
> [   34.708356]  pci_alloc_irq_vectors_affinity+0xc4/0x110
> [   34.709058]  ahci_init_one+0x6ec/0xcc0 [ahci]
> [   34.709692]  ? __pm_runtime_resume+0x58/0x90
> [   34.710311]  local_pci_probe+0x45/0x90
> [   34.710865]  pci_device_probe+0xbb/0x230
> [   34.711433]  really_probe+0xcc/0x350
> [   34.711976]  ? pm_runtime_barrier+0x54/0x90
> [   34.712569]  ? __pfx___driver_attach+0x10/0x10
> [   34.713206]  __driver_probe_device+0x78/0x110
> [   34.713837]  driver_probe_device+0x1f/0xa0
> [   34.714427]  __driver_attach+0xbe/0x1d0
> [   34.715001]  bus_for_each_dev+0x92/0xe0
> [   34.715563]  bus_add_driver+0x115/0x200
> [   34.716136]  driver_register+0x72/0xd0
> [   34.716704]  ? __pfx_ahci_pci_driver_init+0x10/0x10 [ahci]
> [   34.717457]  do_one_initcall+0x76/0x3a0
> [   34.718036]  do_init_module+0x60/0x210
> [   34.718616]  init_module_from_file+0x86/0xc0
> [   34.719243]  idempotent_init_module+0x127/0x2c0
> [   34.719913]  __x64_sys_finit_module+0x5e/0xb0
> [   34.720546]  do_syscall_64+0x7d/0x160
> [   34.721100]  ? srso_return_thunk+0x5/0x5f
> [   34.721695]  ? do_syscall_64+0x89/0x160
> [   34.722258]  ? srso_return_thunk+0x5/0x5f
> [   34.722846]  ? do_sys_openat2+0x9c/0xe0
> [   34.723421]  ? srso_return_thunk+0x5/0x5f
> [   34.724012]  ? syscall_exit_to_user_mode+0x64/0x1f0
> [   34.724703]  ? srso_return_thunk+0x5/0x5f
> [   34.725293]  ? do_syscall_64+0x89/0x160
> [   34.725883]  ? srso_return_thunk+0x5/0x5f
> [   34.726467]  ? syscall_exit_to_user_mode+0x64/0x1f0
> [   34.727159]  ? srso_return_thunk+0x5/0x5f
> [   34.727764]  ? do_syscall_64+0x89/0x160
> [   34.728341]  ? srso_return_thunk+0x5/0x5f
> [   34.728937]  ? exc_page_fault+0x6c/0x200
> [   34.729511]  ? srso_return_thunk+0x5/0x5f
> [   34.730109]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [   34.730837] RIP: 0033:0x7f9d80d281dd
> [   34.731375] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e
> fa 48 89 f8
> 48 89 f7 48 89 d6 48 89 ca 4d8
> [   34.733796] RSP: 002b:00007ffc786f0898 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000139
> [   34.734894] RAX: ffffffffffffffda RBX: 00005617347f09a0 RCX:
> 00007f9d80d281dd
> [   34.735850] RDX: 0000000000000000 RSI: 0000561715fe5e79 RDI:
> 0000000000000003
> [   34.736812] RBP: 00007ffc786f0950 R08: 00007f9d80df6b20 R09:
> 00007ffc786f08e0
> [   34.737769] R10: 00005617347f13e0 R11: 0000000000000246 R12:
> 0000561715fe5e79
> [   34.738725] R13: 0000000000040000 R14: 00005617347f8990 R15:
> 00005617347f8b20
> [   34.739695]  </TASK>
> [   34.740075] ---[ end trace 0000000000000000 ]---
> 
> So it is msi_capability_init() that is the problem: that function
> calls
> pci_intx_for_msi(dev, 0) which then calls pci_intx(dev, 0), thus
> creating the
> intx devres for the device despite the driver code not touching intx
> at all.

Nope, that is not the problem – as you correctly point out below, that
device resource should be destroyed again.

> The driver is fine ! It is MSI touching INTX that is messing things
> up.

Yes, many drivers are more or less innocent in regards with all of
that. As Christoph rightfully pointed out, an API should never behave
like that and do _implicit_ magic behind your back. That's the entire
philosophy of the C Language.

> 
> That said, I do not see that as an issue in itself. What I fail to
> understand
> though is why that intx devres is not deleted on device teardown. I
> think this
> may have something to do with the fact that pcim_intx() always does
> "res->orig_intx = !enable;", that is, it assumes that if there is a
> call to
> pcim_intx(dev, 0), then it is because intx where enabled already,
> which I do not
> think is true for most drivers... So we endup with INTX being wrongly
> enabled on
> device teardown by pcim_intx_restore(), and because of that, the intx
> resource
> is not deleted ?

Spoiler: The device resources that have initially been created do get
deleted. Devres works as intended. The issue is that the forces of evil
invoke pci_intx() through another path, hidden behind an API, through
another devres callback.

So the device resource never gets deleated because it is *created* on
driver detach, when devres already ran.

> 
> Re-enabling intx on teardown is wrong I think, but that still does
> not explain
> why that resource is not deleted. I fail to see why.

You came very close to the truth ;)

With some help from my favorite coworker I did some tracing today and
found this when doing `rmmod ahci`:

=> pci_intx
=> pci_msi_shutdown
=> pci_disable_msi
=> devres_release_all
=> device_unbind_cleanup
=> device_release_driver_internal
=> driver_detach
=> bus_remove_driver
=> pci_unregister_driver
=> __do_sys_delete_module
=> do_syscall_64
=> entry_SYSCALL_64_after_hwframe

The SYSCALL is my `rmmod`.

As you can see, pci_intx() is invoked indirectly through
pci_disable_msi() – which gets invoked by devres, which is precisely
one reason why you could not find the suspicious pci_intx() call in the
ahci code base.

Now the question is: Who set up that devres callback which indirectly
calls pci_intx()?

It is indeed MSI, here in msi/msi.c:

static void pcim_msi_release(void *pcidev)
{
 struct pci_dev *dev = pcidev;

 dev->is_msi_managed = false;
 pci_free_irq_vectors(dev); // <-- calls pci_disable_msi(), which calls pci_intx(), which re-registers yet another devres callback
}

/*
 * Needs to be separate from pcim_release to prevent an ordering problem

==> Oh, here they even had a warning about that interacting with devres somehow...

 * vs. msi_device_data_release() in the MSI core code.
 */
static int pcim_setup_msi_release(struct pci_dev *dev)
{
 int ret;

 if (!pci_is_managed(dev) || dev->is_msi_managed)
 return 0;

 ret = devm_add_action(&dev->dev, pcim_msi_release, dev);
 if (ret)
 return ret;

 dev->is_msi_managed = true;
 return 0;
}

I don't know enough about AHCI to see where exactly it jumps into
these, but a candidate would be:
 * pci_enable_msi(), called among others in acard-ahci.c

Another path is:
   1. ahci_init_one() calls
   2. ahci_init_msi() calls
   3. pci_alloc_irq_vectors() calls
   4. pci_alloc_irq_vectors_affinity() calls
   5. __pci_enable_msi_range() OR __pci_enable_msix_range() call
   6. pci_setup_msi_context() calls
   7. pcim_setup_msi_release() which registers the callback to
      pci_intx()


Ha!

I think I earned myself a Friday evening beer now 8-)


Now the interesting question will be how the heck we're supposed to
clean that up.

Another interesting question is: Did that only work by coincidence
during the last 15 years, or is it by design that the check in
pci_intx():

if (new != pci_command)

only evaluates to true if we are not in a detach path.

If it were coincidence, it would not have caused faults as it did now
with my recent work, because the old version did not allocate in
pci_intx().

But it could certainly have been racy and might run into a UAF since
the old pci_intx() would have worked on memory that is also managed by
devres, but has been registered at a different place. I guess that is
what that comment in the MSI code quoted above is hinting at.


Christoph indeed rightfully called it voodoo ^^


Cheers,
P.

Bjorn Helgaas July 26, 2024, 6:59 p.m. UTC | #7

On Fri, Jul 26, 2024 at 08:43:12PM +0200, pstanner@redhat.com wrote:
> On Fri, 2024-07-26 at 09:19 +0900, Damien Le Moal wrote:

> > Tested-by: Damien Le Moal <dlemoal@kernel.org>
> 
> You might wanna ping Bjorn about that in case he didn't see.

I amended the commit to add this, thanks Damien!

Damien Le Moal July 29, 2024, 11:29 a.m. UTC | #8

On 7/27/24 03:43, pstanner@redhat.com wrote:
>> That said, I do not see that as an issue in itself. What I fail to
>> understand
>> though is why that intx devres is not deleted on device teardown. I
>> think this
>> may have something to do with the fact that pcim_intx() always does
>> "res->orig_intx = !enable;", that is, it assumes that if there is a
>> call to
>> pcim_intx(dev, 0), then it is because intx where enabled already,
>> which I do not
>> think is true for most drivers... So we endup with INTX being wrongly
>> enabled on
>> device teardown by pcim_intx_restore(), and because of that, the intx
>> resource
>> is not deleted ?
> 
> Spoiler: The device resources that have initially been created do get
> deleted. Devres works as intended. The issue is that the forces of evil
> invoke pci_intx() through another path, hidden behind an API, through
> another devres callback.
> 
> So the device resource never gets deleated because it is *created* on
> driver detach, when devres already ran.

That explains the issue :)

>> Re-enabling intx on teardown is wrong I think, but that still does
>> not explain
>> why that resource is not deleted. I fail to see why.
> 
> You came very close to the truth ;)
> 
> With some help from my favorite coworker I did some tracing today and
> found this when doing `rmmod ahci`:
> 
> => pci_intx
> => pci_msi_shutdown
> => pci_disable_msi
> => devres_release_all
> => device_unbind_cleanup
> => device_release_driver_internal
> => driver_detach
> => bus_remove_driver
> => pci_unregister_driver
> => __do_sys_delete_module
> => do_syscall_64
> => entry_SYSCALL_64_after_hwframe
> 
> The SYSCALL is my `rmmod`.
> 
> As you can see, pci_intx() is invoked indirectly through
> pci_disable_msi() – which gets invoked by devres, which is precisely
> one reason why you could not find the suspicious pci_intx() call in the
> ahci code base.
> 
> Now the question is: Who set up that devres callback which indirectly
> calls pci_intx()?
> 
> It is indeed MSI, here in msi/msi.c:
> 
> static void pcim_msi_release(void *pcidev)
> {
>  struct pci_dev *dev = pcidev;
> 
>  dev->is_msi_managed = false;
>  pci_free_irq_vectors(dev); // <-- calls pci_disable_msi(), which calls pci_intx(), which re-registers yet another devres callback
> }
> 
> /*
>  * Needs to be separate from pcim_release to prevent an ordering problem
> 
> ==> Oh, here they even had a warning about that interacting with devres somehow...
> 
>  * vs. msi_device_data_release() in the MSI core code.
>  */
> static int pcim_setup_msi_release(struct pci_dev *dev)
> {
>  int ret;
> 
>  if (!pci_is_managed(dev) || dev->is_msi_managed)
>  return 0;
> 
>  ret = devm_add_action(&dev->dev, pcim_msi_release, dev);
>  if (ret)
>  return ret;
> 
>  dev->is_msi_managed = true;
>  return 0;
> }
> 
> I don't know enough about AHCI to see where exactly it jumps into
> these, but a candidate would be:
>  * pci_enable_msi(), called among others in acard-ahci.c
> 
> Another path is:
>    1. ahci_init_one() calls
>    2. ahci_init_msi() calls
>    3. pci_alloc_irq_vectors() calls
>    4. pci_alloc_irq_vectors_affinity() calls
>    5. __pci_enable_msi_range() OR __pci_enable_msix_range() call
>    6. pci_setup_msi_context() calls
>    7. pcim_setup_msi_release() which registers the callback to
>       pci_intx()

ahci_init_one() is the function used by the default AHCI driver (ahci.ko), so
this path is correct.

> Ha!
> 
> I think I earned myself a Friday evening beer now 8-)

I was way ahead of you :)

> Now the interesting question will be how the heck we're supposed to
> clean that up.

If pcim_intx() always gets called on device release, AND MSI/MSIX are also
managed interrupts with a devres action on release, I would say that
pcim_msi_release() should NOT lead to a path calling pci_intx(dev, 0), as that
would create the intx devres again. But given the comment you point out above,
it seems that there are some ordering constraints between disabling msi and intx
? If that is the case, then may be this indeed will be tricky.

> Another interesting question is: Did that only work by coincidence
> during the last 15 years, or is it by design that the check in
> pci_intx():
> 
> if (new != pci_command)
> 
> only evaluates to true if we are not in a detach path.

I would say that it evaluates to true for any device using intx, which tend to
be rare these days. In such case, then enabling on device attach and disabling
on detach would lead to new != pci_command, including in the hidden intx disable
path triggered by disabling msi. But the devres being recreated on detach
definitely seem to depend on the disabling order though. So we may have been
lucky indeed.

> 
> If it were coincidence, it would not have caused faults as it did now
> with my recent work, because the old version did not allocate in
> pci_intx().
> 
> But it could certainly have been racy and might run into a UAF since
> the old pci_intx() would have worked on memory that is also managed by
> devres, but has been registered at a different place. I guess that is
> what that comment in the MSI code quoted above is hinting at.
> 
> 
> Christoph indeed rightfully called it voodoo ^^
> 
> 
> Cheers,
> P.
> 
>

Philipp Stanner July 29, 2024, 3:45 p.m. UTC | #9

On Mon, 2024-07-29 at 20:29 +0900, Damien Le Moal wrote:
> On 7/27/24 03:43, pstanner@redhat.com wrote:
> > > That said, I do not see that as an issue in itself. What I fail
> > > to
> > > understand
> > > though is why that intx devres is not deleted on device teardown.
> > > I
> > > think this
> > > may have something to do with the fact that pcim_intx() always
> > > does
> > > "res->orig_intx = !enable;", that is, it assumes that if there is
> > > a
> > > call to
> > > pcim_intx(dev, 0), then it is because intx where enabled already,
> > > which I do not
> > > think is true for most drivers... So we endup with INTX being
> > > wrongly
> > > enabled on
> > > device teardown by pcim_intx_restore(), and because of that, the
> > > intx
> > > resource
> > > is not deleted ?
> > 
> > Spoiler: The device resources that have initially been created do
> > get
> > deleted. Devres works as intended. The issue is that the forces of
> > evil
> > invoke pci_intx() through another path, hidden behind an API,
> > through
> > another devres callback.
> > 
> > So the device resource never gets deleated because it is *created*
> > on
> > driver detach, when devres already ran.
> 
> That explains the issue :)
> 
> > > Re-enabling intx on teardown is wrong I think, but that still
> > > does
> > > not explain
> > > why that resource is not deleted. I fail to see why.
> > 
> > You came very close to the truth ;)
> > 
> > With some help from my favorite coworker I did some tracing today
> > and
> > found this when doing `rmmod ahci`:
> > 
> > => pci_intx
> > => pci_msi_shutdown
> > => pci_disable_msi
> > => devres_release_all
> > => device_unbind_cleanup
> > => device_release_driver_internal
> > => driver_detach
> > => bus_remove_driver
> > => pci_unregister_driver
> > => __do_sys_delete_module
> > => do_syscall_64
> > => entry_SYSCALL_64_after_hwframe
> > 
> > The SYSCALL is my `rmmod`.
> > 
> > As you can see, pci_intx() is invoked indirectly through
> > pci_disable_msi() – which gets invoked by devres, which is
> > precisely
> > one reason why you could not find the suspicious pci_intx() call in
> > the
> > ahci code base.
> > 
> > Now the question is: Who set up that devres callback which
> > indirectly
> > calls pci_intx()?
> > 
> > It is indeed MSI, here in msi/msi.c:
> > 
> > static void pcim_msi_release(void *pcidev)
> > {
> >  struct pci_dev *dev = pcidev;
> > 
> >  dev->is_msi_managed = false;
> >  pci_free_irq_vectors(dev); // <-- calls pci_disable_msi(), which
> > calls pci_intx(), which re-registers yet another devres callback
> > }
> > 
> > /*
> >  * Needs to be separate from pcim_release to prevent an ordering
> > problem
> > 
> > ==> Oh, here they even had a warning about that interacting with
> > devres somehow...
> > 
> >  * vs. msi_device_data_release() in the MSI core code.
> >  */
> > static int pcim_setup_msi_release(struct pci_dev *dev)
> > {
> >  int ret;
> > 
> >  if (!pci_is_managed(dev) || dev->is_msi_managed)
> >  return 0;
> > 
> >  ret = devm_add_action(&dev->dev, pcim_msi_release, dev);
> >  if (ret)
> >  return ret;
> > 
> >  dev->is_msi_managed = true;
> >  return 0;
> > }
> > 
> > I don't know enough about AHCI to see where exactly it jumps into
> > these, but a candidate would be:
> >  * pci_enable_msi(), called among others in acard-ahci.c
> > 
> > Another path is:
> >    1. ahci_init_one() calls
> >    2. ahci_init_msi() calls
> >    3. pci_alloc_irq_vectors() calls
> >    4. pci_alloc_irq_vectors_affinity() calls
> >    5. __pci_enable_msi_range() OR __pci_enable_msix_range() call
> >    6. pci_setup_msi_context() calls
> >    7. pcim_setup_msi_release() which registers the callback to
> >       pci_intx()
> 
> ahci_init_one() is the function used by the default AHCI driver
> (ahci.ko), so
> this path is correct.
> 
> > Ha!
> > 
> > I think I earned myself a Friday evening beer now 8-)
> 
> I was way ahead of you :)
> 
> > Now the interesting question will be how the heck we're supposed to
> > clean that up.
> 
> If pcim_intx() always gets called on device release, AND MSI/MSIX are
> also
> managed interrupts with a devres action on release, I would say that
> pcim_msi_release() should NOT lead to a path calling pci_intx(dev,
> 0), as that
> would create the intx devres again.

Yes, that is obviously wrong – the thing is that we cannot remove the
devres aspect of pci_intx() as long as the other callers (especially
drivers) rely on it.

And, most notably, we (= I) don't know if *MSI* relies on the devres
aspect being present in pci_intx() in MSI's cleanup path.

Looking at what pci_intx_for_msi() is good for and where it's used
might give a clue.

@Krzysztof: Aren't you the MSI expert? Any idea what this might be all
about?

There are definitely some hacks one could do to remove the devres
aspect from pci_intx() _just for MSI_, for example:
 * set struct pci_dev.is_managed = false in the disable path in MSI
 * Use yet another function, pci_intx_unmanaged(), in MSI

It's obviously work and all that and I don't think one should even try
it without an understanding of what MSI is supposed to do.


> But given the comment you point out above,
> it seems that there are some ordering constraints between disabling
> msi and intx
> ? If that is the case, then may be this indeed will be tricky.

No, I don't think that's related.
I think the ordering issue should have disappeared now in v6.11
(although I wouldn't bet my hand on it)

The issue was that until including v6.10, struct pci_devres bundled
everything needed by several managed PCI functions into just that one
struct. It is handled by 1 creator function and 1 cleanup function
(Should have never been implemented that way, since it violates the
basic concepts of devres, but here we are, should, would, could...).

I presume the author was then about to register that separate cleanup
devres callback in MSI and discovered that it would be racy if that MSI
code uses the same struct pci_devres, which is administrated by a
different function. It could have been freed already when the MSI
callback awakes -> UAF.

> 
> > Another interesting question is: Did that only work by coincidence
> > during the last 15 years, or is it by design that the check in
> > pci_intx():
> > 
> > if (new != pci_command)
> > 
> > only evaluates to true if we are not in a detach path.
> 
> I would say that it evaluates to true for any device using intx,
> which tend to
> be rare these days. In such case, then enabling on device attach and
> disabling
> on detach would lead to new != pci_command, including in the hidden
> intx disable
> path triggered by disabling msi. But the devres being recreated on
> detach
> definitely seem to depend on the disabling order though. So we may
> have been
> lucky indeed.

Hm. I hope you are right. If not, then other drivers would experience
the same regression you found in AHCI.

Let's keep our eyes open...

P.

> 
> > 
> > If it were coincidence, it would not have caused faults as it did
> > now
> > with my recent work, because the old version did not allocate in
> > pci_intx().
> > 
> > But it could certainly have been racy and might run into a UAF
> > since
> > the old pci_intx() would have worked on memory that is also managed
> > by
> > devres, but has been registered at a different place. I guess that
> > is
> > what that comment in the MSI code quoted above is hinting at.
> > 
> > 
> > Christoph indeed rightfully called it voodoo ^^
> > 
> > 
> > Cheers,
> > P.
> > 
> > 
>

Alex Williamson Sept. 3, 2024, 3:44 p.m. UTC | #10

On Thu, 25 Jul 2024 14:07:30 +0200
Philipp Stanner <pstanner@redhat.com> wrote:

> pci_intx() is a function that becomes managed if pcim_enable_device()
> has been called in advance. Commit 25216afc9db5 ("PCI: Add managed
> pcim_intx()") changed this behavior so that pci_intx() always leads to
> creation of a separate device resource for itself, whereas earlier, a
> shared resource was used for all PCI devres operations.
> 
> Unfortunately, pci_intx() seems to be used in some drivers' remove()
> paths; in the managed case this causes a device resource to be created
> on driver detach.
> 
> Fix the regression by only redirecting pci_intx() to its managed twin
> pcim_intx() if the pci_command changes.
> 
> Fixes: 25216afc9db5 ("PCI: Add managed pcim_intx()")

I'm seeing another issue from this, which is maybe a more general
problem with managed mode.  In my case I'm using vfio-pci to assign an
ahci controller to a VM.  ahci_init_one() calls pcim_enable_device()
which sets is_managed = true.  I notice that nothing ever sets
is_managed to false.  Therefore now when I call pci_intx() from vfio-pci
under spinlock, I get a lockdep warning as I no go through pcim_intx()
code after 25216afc9db5 since the previous driver was managed.  It seems
like we should be setting is_managed to false is the driver release
path, right?  Thanks,

Alex

Philipp Stanner Sept. 4, 2024, 7:06 a.m. UTC | #11

On Tue, 2024-09-03 at 09:44 -0600, Alex Williamson wrote:
> On Thu, 25 Jul 2024 14:07:30 +0200
> Philipp Stanner <pstanner@redhat.com> wrote:
> 
> > pci_intx() is a function that becomes managed if
> > pcim_enable_device()
> > has been called in advance. Commit 25216afc9db5 ("PCI: Add managed
> > pcim_intx()") changed this behavior so that pci_intx() always leads
> > to
> > creation of a separate device resource for itself, whereas earlier,
> > a
> > shared resource was used for all PCI devres operations.
> > 
> > Unfortunately, pci_intx() seems to be used in some drivers'
> > remove()
> > paths; in the managed case this causes a device resource to be
> > created
> > on driver detach.
> > 
> > Fix the regression by only redirecting pci_intx() to its managed
> > twin
> > pcim_intx() if the pci_command changes.
> > 
> > Fixes: 25216afc9db5 ("PCI: Add managed pcim_intx()")
> 
> I'm seeing another issue from this, which is maybe a more general
> problem with managed mode.  In my case I'm using vfio-pci to assign
> an
> ahci controller to a VM.

"In my case" doesn't mean OOT, does it? I can't fully follow.

>   ahci_init_one() calls pcim_enable_device()
> which sets is_managed = true.  I notice that nothing ever sets
> is_managed to false.  Therefore now when I call pci_intx() from vfio-
> pci
> under spinlock, I get a lockdep warning

I suppose you see the lockdep warning because the new pcim_intx() can 
now allocate, whereas before 25216afc9db5 it was pcim_enable_device()
which allocated *everything* related to PCI devres.

>  as I no go through pcim_intx()
> code after 25216afc9db5 

You alwas went through pcim_intx()'s logic. The issue seems to be that
the allocation step was moved.

> since the previous driver was managed.

what do you mean by "previous driver"?

>   It seems
> like we should be setting is_managed to false is the driver release
> path, right?

So the issue seems to be that the same struct pci_dev can be used by
different drivers, is that correct?

If so, I think that can be addressed trough having
pcim_disable_device() set is_managed to false as you suggest.

Another solution can could at least consider would be to use a
GFP_ATOMIC for allocation in get_or_create_intx_devres().

I suppose your solution is the better one, though.

P.

>   Thanks,
> 
> Alex
>

Damien Le Moal Sept. 4, 2024, 8:25 a.m. UTC | #12

On 2024/09/04 16:06, Philipp Stanner wrote:
> On Tue, 2024-09-03 at 09:44 -0600, Alex Williamson wrote:
>> On Thu, 25 Jul 2024 14:07:30 +0200
>> Philipp Stanner <pstanner@redhat.com> wrote:
>>
>>> pci_intx() is a function that becomes managed if
>>> pcim_enable_device()
>>> has been called in advance. Commit 25216afc9db5 ("PCI: Add managed
>>> pcim_intx()") changed this behavior so that pci_intx() always leads
>>> to
>>> creation of a separate device resource for itself, whereas earlier,
>>> a
>>> shared resource was used for all PCI devres operations.
>>>
>>> Unfortunately, pci_intx() seems to be used in some drivers'
>>> remove()
>>> paths; in the managed case this causes a device resource to be
>>> created
>>> on driver detach.
>>>
>>> Fix the regression by only redirecting pci_intx() to its managed
>>> twin
>>> pcim_intx() if the pci_command changes.
>>>
>>> Fixes: 25216afc9db5 ("PCI: Add managed pcim_intx()")
>>
>> I'm seeing another issue from this, which is maybe a more general
>> problem with managed mode.  In my case I'm using vfio-pci to assign
>> an
>> ahci controller to a VM.
> 
> "In my case" doesn't mean OOT, does it? I can't fully follow.
> 
>>   ahci_init_one() calls pcim_enable_device()
>> which sets is_managed = true.  I notice that nothing ever sets
>> is_managed to false.  Therefore now when I call pci_intx() from vfio-
>> pci
>> under spinlock, I get a lockdep warning
> 
> I suppose you see the lockdep warning because the new pcim_intx() can 
> now allocate, whereas before 25216afc9db5 it was pcim_enable_device()
> which allocated *everything* related to PCI devres.
> 
>>  as I no go through pcim_intx()
>> code after 25216afc9db5 
> 
> You alwas went through pcim_intx()'s logic. The issue seems to be that
> the allocation step was moved.
> 
>> since the previous driver was managed.
> 
> what do you mean by "previous driver"?

The AHCI driver... When attaching a PCI dev to vfio to e.g. passthrough to a VM,
the device driver must first be unbound and the device bound to vfio-pci. So we
switch from ahci/libata driver to vfio. When vfio tries to enable intx with
is_managed still true from the use of the device by ahci, problem happen.

> 
>>   It seems
>> like we should be setting is_managed to false is the driver release
>> path, right?
> 
> So the issue seems to be that the same struct pci_dev can be used by
> different drivers, is that correct?
> 
> If so, I think that can be addressed trough having
> pcim_disable_device() set is_managed to false as you suggest.
> 
> Another solution can could at least consider would be to use a
> GFP_ATOMIC for allocation in get_or_create_intx_devres().

If it is allowed to call pci_intx() under a spin_lock, then we need GFP_ATOMIC.
If not, then vfio-pci needs to move the call out of the spinlock.

Either solution must be implemented regardless of the fix to set is_managed to
false.

So what context is allowed to call pci_intx() ? The current kdoc comment does
not say...

Alex Williamson Sept. 4, 2024, 12:57 p.m. UTC | #13

On Wed, 04 Sep 2024 09:06:37 +0200
Philipp Stanner <pstanner@redhat.com> wrote:

> On Tue, 2024-09-03 at 09:44 -0600, Alex Williamson wrote:
> > On Thu, 25 Jul 2024 14:07:30 +0200
> > Philipp Stanner <pstanner@redhat.com> wrote:
> >   
> > > pci_intx() is a function that becomes managed if
> > > pcim_enable_device()
> > > has been called in advance. Commit 25216afc9db5 ("PCI: Add managed
> > > pcim_intx()") changed this behavior so that pci_intx() always leads
> > > to
> > > creation of a separate device resource for itself, whereas earlier,
> > > a
> > > shared resource was used for all PCI devres operations.
> > > 
> > > Unfortunately, pci_intx() seems to be used in some drivers'
> > > remove()
> > > paths; in the managed case this causes a device resource to be
> > > created
> > > on driver detach.
> > > 
> > > Fix the regression by only redirecting pci_intx() to its managed
> > > twin
> > > pcim_intx() if the pci_command changes.
> > > 
> > > Fixes: 25216afc9db5 ("PCI: Add managed pcim_intx()")  
> > 
> > I'm seeing another issue from this, which is maybe a more general
> > problem with managed mode.  In my case I'm using vfio-pci to assign
> > an
> > ahci controller to a VM.  
> 
> "In my case" doesn't mean OOT, does it? I can't fully follow.

"OOT" Out Of Tree?  No, "In my case" is simply introducing the scenario
in which I see the issue.  vfio-pci is an in-tree driver used to attach
devices to userspace drivers, such as QEMU.  The ahci driver is loaded
during system boot, setting the is_managed flag.  The ahci driver is
then unbound from the device and the vfio-pci driver is bound.  The
vfio-pci driver provides a uAPI for userspace drivers to operate a
device in an an IOMMU protected context.

> >   ahci_init_one() calls pcim_enable_device()
> > which sets is_managed = true.  I notice that nothing ever sets
> > is_managed to false.  Therefore now when I call pci_intx() from vfio-
> > pci
> > under spinlock, I get a lockdep warning  
> 
> I suppose you see the lockdep warning because the new pcim_intx() can 
> now allocate, whereas before 25216afc9db5 it was pcim_enable_device()
> which allocated *everything* related to PCI devres.
> 
> >  as I no go through pcim_intx()
> > code after 25216afc9db5   
> 
> You alwas went through pcim_intx()'s logic. The issue seems to be that
> the allocation step was moved.

Unintentionally, yes, I believe so.  vfio-pci is not a managed, devres
driver and therefore had no expectation of using the managed code path.

> > since the previous driver was managed.  
> 
> what do you mean by "previous driver"?

As noted, the ahci driver is first bound to the device at boot,
unbound, and the vfio-pci driver bound to the device.  The ahci driver
is the previous driver.

> >   It seems
> > like we should be setting is_managed to false is the driver release
> > path, right?  
> 
> So the issue seems to be that the same struct pci_dev can be used by
> different drivers, is that correct?

Yes, and more generically, the driver release should undo everything
that has been configured by the driver probe.

> If so, I think that can be addressed trough having
> pcim_disable_device() set is_managed to false as you suggest.

If that's sufficient and drivers only call pcim_disable_device() in
their release function.  I also note that f748a07a0b64 ("PCI: Remove
legacy pcim_release()") claims that:

  Thanks to preceding cleanup steps, pcim_release() is now not needed
  anymore and can be replaced by pcim_disable_device(), which is the
  exact counterpart to pcim_enable_device().

However, that's not accurate as pcim_enable_device() adds a devm
action, unconditionally calls pci_enable_device() and sets is_managed
to true.  If we assume pcim_pin_device() is a valid concept, don't we
still need to remove the devm action as well?

> Another solution can could at least consider would be to use a
> GFP_ATOMIC for allocation in get_or_create_intx_devres().

If we look at what pci_intx() does without devres, it's simply reading
and setting or clearing a bit in config space.  I can attest that a
driver author would have no expectation that such a function allocates
memory and there are scenarios where we want to call this with
interrupts disabled, such as within an interrupt context.  So, TBH, it
might make sense to consider whether an allocation in this path is
appropriate at all, but I'm obviously no expert in devres.

> I suppose your solution is the better one, though.

I see you've posted a patch, I'll test it as soon as I'm able.  Thanks,

Alex

Philipp Stanner Sept. 4, 2024, 1:29 p.m. UTC | #14

On Wed, 2024-09-04 at 06:57 -0600, Alex Williamson wrote:
> On Wed, 04 Sep 2024 09:06:37 +0200
> Philipp Stanner <pstanner@redhat.com> wrote:
> 
> > On Tue, 2024-09-03 at 09:44 -0600, Alex Williamson wrote:
> > > On Thu, 25 Jul 2024 14:07:30 +0200
> > > Philipp Stanner <pstanner@redhat.com> wrote:
> > >   
> > > > pci_intx() is a function that becomes managed if
> > > > pcim_enable_device()
> > > > has been called in advance. Commit 25216afc9db5 ("PCI: Add
> > > > managed
> > > > pcim_intx()") changed this behavior so that pci_intx() always
> > > > leads
> > > > to
> > > > creation of a separate device resource for itself, whereas
> > > > earlier,
> > > > a
> > > > shared resource was used for all PCI devres operations.
> > > > 
> > > > Unfortunately, pci_intx() seems to be used in some drivers'
> > > > remove()
> > > > paths; in the managed case this causes a device resource to be
> > > > created
> > > > on driver detach.
> > > > 
> > > > Fix the regression by only redirecting pci_intx() to its
> > > > managed
> > > > twin
> > > > pcim_intx() if the pci_command changes.
> > > > 
> > > > Fixes: 25216afc9db5 ("PCI: Add managed pcim_intx()")  
> > > 
> > > I'm seeing another issue from this, which is maybe a more general
> > > problem with managed mode.  In my case I'm using vfio-pci to
> > > assign
> > > an
> > > ahci controller to a VM.  
> > 
> > "In my case" doesn't mean OOT, does it? I can't fully follow.
> 
> "OOT" Out Of Tree?  No, "In my case" is simply introducing the
> scenario
> in which I see the issue.  vfio-pci is an in-tree driver used to
> attach
> devices to userspace drivers, such as QEMU.  The ahci driver is
> loaded
> during system boot, setting the is_managed flag.  The ahci driver is
> then unbound from the device and the vfio-pci driver is bound.  The
> vfio-pci driver provides a uAPI for userspace drivers to operate a
> device in an an IOMMU protected context.

Alright, thx for the clarification.

>  
> > >   ahci_init_one() calls pcim_enable_device()
> > > which sets is_managed = true.  I notice that nothing ever sets
> > > is_managed to false.  Therefore now when I call pci_intx() from
> > > vfio-
> > > pci
> > > under spinlock, I get a lockdep warning  
> > 
> > I suppose you see the lockdep warning because the new pcim_intx()
> > can 
> > now allocate, whereas before 25216afc9db5 it was
> > pcim_enable_device()
> > which allocated *everything* related to PCI devres.
> > 
> > >  as I no go through pcim_intx()
> > > code after 25216afc9db5   
> > 
> > You alwas went through pcim_intx()'s logic. The issue seems to be
> > that
> > the allocation step was moved.
> 
> Unintentionally, yes, I believe so.  vfio-pci is not a managed,
> devres
> driver and therefore had no expectation of using the managed code
> path.

Yes, I agree this needs to be fixed through the solution you proposed.

> 
> > > since the previous driver was managed.  
> > 
> > what do you mean by "previous driver"?
> 
> As noted, the ahci driver is first bound to the device at boot,
> unbound, and the vfio-pci driver bound to the device.  The ahci
> driver
> is the previous driver.
> 
> > >   It seems
> > > like we should be setting is_managed to false is the driver
> > > release
> > > path, right?  
> > 
> > So the issue seems to be that the same struct pci_dev can be used
> > by
> > different drivers, is that correct?
> 
> Yes, and more generically, the driver release should undo everything
> that has been configured by the driver probe.
> 
> > If so, I think that can be addressed trough having
> > pcim_disable_device() set is_managed to false as you suggest.
> 
> If that's sufficient and drivers only call pcim_disable_device() in
> their release function.

pcim_disable_device() is not intended to be used directly by drivers.
It's basically the devres callback for pcim_enable_device() and is
called in everyone's release path automatically by devres.
(I agree that the naming is not superbe)

>   I also note that f748a07a0b64 ("PCI: Remove
> legacy pcim_release()") claims that:
> 
>   Thanks to preceding cleanup steps, pcim_release() is now not needed
>   anymore and can be replaced by pcim_disable_device(), which is the
>   exact counterpart to pcim_enable_device().
> 
> However, that's not accurate as pcim_enable_device() adds a devm
> action, unconditionally calls pci_enable_device() and sets is_managed
> to true.

It's not accurate in regards with is_managed.

The rest is fine IMO. The devres callback shall be added, and the
unconditional call to pci_enable_device() is also desired.

>   If we assume pcim_pin_device() is a valid concept, don't we
> still need to remove the devm action as well?

No. As pcim_disable_device() is the very devres callback, it does not
need to remove itself. Devres calls it once and then it's gone.

However, pcim_pin_device() IMO is not a valid concept. Who wants such
behavior IMO shall use pci_enable_device() and pci_disable_device()
manually.
I proposed removing it here:
https://lore.kernel.org/all/20240822073815.12365-2-pstanner@redhat.com/

(Note that previously it could not be removed because
pcim_enable_device() also allocated all the stuff needed by
pci_request_region() etc.)

> 
> > Another solution can could at least consider would be to use a
> > GFP_ATOMIC for allocation in get_or_create_intx_devres().
> 
> If we look at what pci_intx() does without devres, it's simply
> reading
> and setting or clearing a bit in config space.  I can attest that a
> driver author would have no expectation that such a function
> allocates
> memory

A driver author would not have any expectation of a function implicitly
doing anything with devres. That's the reason why I did all this work
in the first place, to, ultimately, remove this hybrid behavior from
all pci_ functions.
So that devres functions are clearly marked with pcim_

That is, however, not that easy because everyone who uses
pcim_enable_device() in combination with pci_intx() first has to be
ported to a pci_intx()-version that has nothing to do with devres.


>  and there are scenarios where we want to call this with
> interrupts disabled, such as within an interrupt context.  So, TBH,
> it
> might make sense to consider whether an allocation in this path is
> appropriate at all, but I'm obviously no expert in devres.

The entire hybrid nature from pci_intx() should be removed.
I'm working on that.

The hybrid nature was always there, but the allocation was not. It
would be removed later together with the hybrid devres usage.

> 
> > I suppose your solution is the better one, though.
> 
> I see you've posted a patch, I'll test it as soon as I'm able. 

If it works from what I understand that should solve those issues for
now until we can phase out pci_intx() usage that relies on the device
resource.

---
btw. I just looked into the old code and found that this one also
already had a similar half-bug with is_managed. It never sets it to
false again, but pci_intx() runs into:

static struct pci_devres *find_pci_dr(struct pci_dev *pdev)
{
	if (pci_is_managed(pdev))
		return devres_find(&pdev->dev, pcim_release, NULL,
NULL);
	return NULL;
}


So in your case pci_is_managed() would have also been true and the only
reason no problem occurred is that devres_find() doesn't find the
device resource of the previous driver anymore, so pci_intx() won't use
that memory.
---


Thanks for debugging,
P.

> Thanks,
> 
> Alex
>

Philipp Stanner Sept. 4, 2024, 1:37 p.m. UTC | #15

On Wed, 2024-09-04 at 17:25 +0900, Damien Le Moal wrote:
> On 2024/09/04 16:06, Philipp Stanner wrote:
> > On Tue, 2024-09-03 at 09:44 -0600, Alex Williamson wrote:
> > > On Thu, 25 Jul 2024 14:07:30 +0200
> > > Philipp Stanner <pstanner@redhat.com> wrote:
> > > 
> > > > pci_intx() is a function that becomes managed if
> > > > pcim_enable_device()
> > > > has been called in advance. Commit 25216afc9db5 ("PCI: Add
> > > > managed
> > > > pcim_intx()") changed this behavior so that pci_intx() always
> > > > leads
> > > > to
> > > > creation of a separate device resource for itself, whereas
> > > > earlier,
> > > > a
> > > > shared resource was used for all PCI devres operations.
> > > > 
> > > > Unfortunately, pci_intx() seems to be used in some drivers'
> > > > remove()
> > > > paths; in the managed case this causes a device resource to be
> > > > created
> > > > on driver detach.
> > > > 
> > > > Fix the regression by only redirecting pci_intx() to its
> > > > managed
> > > > twin
> > > > pcim_intx() if the pci_command changes.
> > > > 
> > > > Fixes: 25216afc9db5 ("PCI: Add managed pcim_intx()")
> > > 
> > > I'm seeing another issue from this, which is maybe a more general
> > > problem with managed mode.  In my case I'm using vfio-pci to
> > > assign
> > > an
> > > ahci controller to a VM.
> > 
> > "In my case" doesn't mean OOT, does it? I can't fully follow.
> > 
> > >   ahci_init_one() calls pcim_enable_device()
> > > which sets is_managed = true.  I notice that nothing ever sets
> > > is_managed to false.  Therefore now when I call pci_intx() from
> > > vfio-
> > > pci
> > > under spinlock, I get a lockdep warning
> > 
> > I suppose you see the lockdep warning because the new pcim_intx()
> > can 
> > now allocate, whereas before 25216afc9db5 it was
> > pcim_enable_device()
> > which allocated *everything* related to PCI devres.
> > 
> > >  as I no go through pcim_intx()
> > > code after 25216afc9db5 
> > 
> > You alwas went through pcim_intx()'s logic. The issue seems to be
> > that
> > the allocation step was moved.
> > 
> > > since the previous driver was managed.
> > 
> > what do you mean by "previous driver"?
> 
> The AHCI driver... When attaching a PCI dev to vfio to e.g.
> passthrough to a VM,
> the device driver must first be unbound and the device bound to vfio-
> pci. So we
> switch from ahci/libata driver to vfio. When vfio tries to enable
> intx with
> is_managed still true from the use of the device by ahci, problem
> happen.
> 
> > 
> > >   It seems
> > > like we should be setting is_managed to false is the driver
> > > release
> > > path, right?
> > 
> > So the issue seems to be that the same struct pci_dev can be used
> > by
> > different drivers, is that correct?
> > 
> > If so, I think that can be addressed trough having
> > pcim_disable_device() set is_managed to false as you suggest.
> > 
> > Another solution can could at least consider would be to use a
> > GFP_ATOMIC for allocation in get_or_create_intx_devres().
> 
> If it is allowed to call pci_intx() under a spin_lock, then we need
> GFP_ATOMIC.
> If not, then vfio-pci needs to move the call out of the spinlock.

If vfio-pci can get rid of pci_intx() alltogether, that might be a good
thing. As far as I understood Andy Shevchenko, pci_intx() is outdated.
There's only a hand full of users anyways.


Best solution would be to avoid GFP_ATOMIC and see first if setting
is_managed = false solves the reported problem for now.

Other problematic users should hopefully be found through lockdep, too.
Though I think they are unlikely to occur

> 
> Either solution must be implemented regardless of the fix to set
> is_managed to
> false.

Yes

> 
> So what context is allowed to call pci_intx() ? The current kdoc
> comment does
> not say...

the old pci_intx() did not allocate.
It only calls pci_read_config_word() and pci_write_config_word(). If
those cannot block etc. it should be save from any context.

Though I'd like to hear from one of the maintainers about it.

The new version allocates if pcim_enable_device() was called when it
runs for the first time. That first run would then be illegal in must-
not-sleep contexts.


P.

Alex Williamson Sept. 4, 2024, 6:07 p.m. UTC | #16

On Wed, 04 Sep 2024 15:37:25 +0200
Philipp Stanner <pstanner@redhat.com> wrote:

> On Wed, 2024-09-04 at 17:25 +0900, Damien Le Moal wrote:
> > On 2024/09/04 16:06, Philipp Stanner wrote:  
> > > On Tue, 2024-09-03 at 09:44 -0600, Alex Williamson wrote:  
> > > > On Thu, 25 Jul 2024 14:07:30 +0200
> > > > Philipp Stanner <pstanner@redhat.com> wrote:
> > > >   
> > > > > pci_intx() is a function that becomes managed if
> > > > > pcim_enable_device()
> > > > > has been called in advance. Commit 25216afc9db5 ("PCI: Add
> > > > > managed
> > > > > pcim_intx()") changed this behavior so that pci_intx() always
> > > > > leads
> > > > > to
> > > > > creation of a separate device resource for itself, whereas
> > > > > earlier,
> > > > > a
> > > > > shared resource was used for all PCI devres operations.
> > > > > 
> > > > > Unfortunately, pci_intx() seems to be used in some drivers'
> > > > > remove()
> > > > > paths; in the managed case this causes a device resource to be
> > > > > created
> > > > > on driver detach.
> > > > > 
> > > > > Fix the regression by only redirecting pci_intx() to its
> > > > > managed
> > > > > twin
> > > > > pcim_intx() if the pci_command changes.
> > > > > 
> > > > > Fixes: 25216afc9db5 ("PCI: Add managed pcim_intx()")  
> > > > 
> > > > I'm seeing another issue from this, which is maybe a more general
> > > > problem with managed mode.  In my case I'm using vfio-pci to
> > > > assign
> > > > an
> > > > ahci controller to a VM.  
> > > 
> > > "In my case" doesn't mean OOT, does it? I can't fully follow.
> > >   
> > > >   ahci_init_one() calls pcim_enable_device()
> > > > which sets is_managed = true.  I notice that nothing ever sets
> > > > is_managed to false.  Therefore now when I call pci_intx() from
> > > > vfio-
> > > > pci
> > > > under spinlock, I get a lockdep warning  
> > > 
> > > I suppose you see the lockdep warning because the new pcim_intx()
> > > can 
> > > now allocate, whereas before 25216afc9db5 it was
> > > pcim_enable_device()
> > > which allocated *everything* related to PCI devres.
> > >   
> > > >  as I no go through pcim_intx()
> > > > code after 25216afc9db5   
> > > 
> > > You alwas went through pcim_intx()'s logic. The issue seems to be
> > > that
> > > the allocation step was moved.
> > >   
> > > > since the previous driver was managed.  
> > > 
> > > what do you mean by "previous driver"?  
> > 
> > The AHCI driver... When attaching a PCI dev to vfio to e.g.
> > passthrough to a VM,
> > the device driver must first be unbound and the device bound to vfio-
> > pci. So we
> > switch from ahci/libata driver to vfio. When vfio tries to enable
> > intx with
> > is_managed still true from the use of the device by ahci, problem
> > happen.
> >   
> > >   
> > > >   It seems
> > > > like we should be setting is_managed to false is the driver
> > > > release
> > > > path, right?  
> > > 
> > > So the issue seems to be that the same struct pci_dev can be used
> > > by
> > > different drivers, is that correct?
> > > 
> > > If so, I think that can be addressed trough having
> > > pcim_disable_device() set is_managed to false as you suggest.
> > > 
> > > Another solution can could at least consider would be to use a
> > > GFP_ATOMIC for allocation in get_or_create_intx_devres().  
> > 
> > If it is allowed to call pci_intx() under a spin_lock, then we need
> > GFP_ATOMIC.
> > If not, then vfio-pci needs to move the call out of the spinlock.  
> 
> If vfio-pci can get rid of pci_intx() alltogether, that might be a good
> thing. As far as I understood Andy Shevchenko, pci_intx() is outdated.
> There's only a hand full of users anyways.

What's the alternative?  vfio-pci has a potentially unique requirement
here, we don't know how to handle the device interrupt, we only forward
it to the userspace driver.  As a level triggered interrupt, INTx will
continue to assert until that userspace driver handles the device.
That's obviously unacceptable from a host perspective, so INTx is
masked at the device via pci_intx() where available, or at the
interrupt controller otherwise.  The API with the userspace driver
requires that driver to unmask the interrupt, again resulting in a call
to pci_intx() or unmasking the interrupt controller, in order to receive
further interrupts from the device.  Thanks,

Alex

Andy Shevchenko Sept. 4, 2024, 8:24 p.m. UTC | #17

Wed, Sep 04, 2024 at 12:07:21PM -0600, Alex Williamson kirjoitti:
> On Wed, 04 Sep 2024 15:37:25 +0200
> Philipp Stanner <pstanner@redhat.com> wrote:
> > On Wed, 2024-09-04 at 17:25 +0900, Damien Le Moal wrote:

...

> > If vfio-pci can get rid of pci_intx() alltogether, that might be a good
> > thing. As far as I understood Andy Shevchenko, pci_intx() is outdated.
> > There's only a hand full of users anyways.
> 
> What's the alternative?

From API perspective the pci_alloc_irq_vectors() & Co should be used.

> vfio-pci has a potentially unique requirement
> here, we don't know how to handle the device interrupt, we only forward
> it to the userspace driver.  As a level triggered interrupt, INTx will
> continue to assert until that userspace driver handles the device.
> That's obviously unacceptable from a host perspective, so INTx is
> masked at the device via pci_intx() where available, or at the
> interrupt controller otherwise.  The API with the userspace driver
> requires that driver to unmask the interrupt, again resulting in a call
> to pci_intx() or unmasking the interrupt controller, in order to receive
> further interrupts from the device.  Thanks,

I briefly read the discussion and if I understand it correctly the problem here
is in the flow: when the above mentioned API is being called. Hence it's design
(or architectural) level of issue and changing call from foo() to bar() won't
magically make problem go away. But I might be mistaken.

Alex Williamson Sept. 4, 2024, 9:10 p.m. UTC | #18

On Wed, 4 Sep 2024 23:24:53 +0300
Andy Shevchenko <andy.shevchenko@gmail.com> wrote:

> Wed, Sep 04, 2024 at 12:07:21PM -0600, Alex Williamson kirjoitti:
> > On Wed, 04 Sep 2024 15:37:25 +0200
> > Philipp Stanner <pstanner@redhat.com> wrote:  
> > > On Wed, 2024-09-04 at 17:25 +0900, Damien Le Moal wrote:  
> 
> ...
> 
> > > If vfio-pci can get rid of pci_intx() alltogether, that might be a good
> > > thing. As far as I understood Andy Shevchenko, pci_intx() is outdated.
> > > There's only a hand full of users anyways.  
> > 
> > What's the alternative?  
> 
> From API perspective the pci_alloc_irq_vectors() & Co should be used.

We can't replace a device level INTx control with a vector allocation
function.
 
> > vfio-pci has a potentially unique requirement
> > here, we don't know how to handle the device interrupt, we only forward
> > it to the userspace driver.  As a level triggered interrupt, INTx will
> > continue to assert until that userspace driver handles the device.
> > That's obviously unacceptable from a host perspective, so INTx is
> > masked at the device via pci_intx() where available, or at the
> > interrupt controller otherwise.  The API with the userspace driver
> > requires that driver to unmask the interrupt, again resulting in a call
> > to pci_intx() or unmasking the interrupt controller, in order to receive
> > further interrupts from the device.  Thanks,  
> 
> I briefly read the discussion and if I understand it correctly the problem here
> is in the flow: when the above mentioned API is being called. Hence it's design
> (or architectural) level of issue and changing call from foo() to bar() won't
> magically make problem go away. But I might be mistaken.

Certainly from a vector allocation standpoint we can change to whatever
is preferred, but the direct INTx manipulation functions are a
different thing entirely and afaik there's nothing else that can
replace them at a low level, nor can we just get rid of our calls to
pci_intx().  Thanks,

Alex

Damien Le Moal Sept. 5, 2024, 12:33 a.m. UTC | #19

On 2024/09/05 6:10, Alex Williamson wrote:
> On Wed, 4 Sep 2024 23:24:53 +0300
> Andy Shevchenko <andy.shevchenko@gmail.com> wrote:
> 
>> Wed, Sep 04, 2024 at 12:07:21PM -0600, Alex Williamson kirjoitti:
>>> On Wed, 04 Sep 2024 15:37:25 +0200
>>> Philipp Stanner <pstanner@redhat.com> wrote:  
>>>> On Wed, 2024-09-04 at 17:25 +0900, Damien Le Moal wrote:  
>>
>> ...
>>
>>>> If vfio-pci can get rid of pci_intx() alltogether, that might be a good
>>>> thing. As far as I understood Andy Shevchenko, pci_intx() is outdated.
>>>> There's only a hand full of users anyways.  
>>>
>>> What's the alternative?  
>>
>> From API perspective the pci_alloc_irq_vectors() & Co should be used.
> 
> We can't replace a device level INTx control with a vector allocation
> function.
>  
>>> vfio-pci has a potentially unique requirement
>>> here, we don't know how to handle the device interrupt, we only forward
>>> it to the userspace driver.  As a level triggered interrupt, INTx will
>>> continue to assert until that userspace driver handles the device.
>>> That's obviously unacceptable from a host perspective, so INTx is
>>> masked at the device via pci_intx() where available, or at the
>>> interrupt controller otherwise.  The API with the userspace driver
>>> requires that driver to unmask the interrupt, again resulting in a call
>>> to pci_intx() or unmasking the interrupt controller, in order to receive
>>> further interrupts from the device.  Thanks,  
>>
>> I briefly read the discussion and if I understand it correctly the problem here
>> is in the flow: when the above mentioned API is being called. Hence it's design
>> (or architectural) level of issue and changing call from foo() to bar() won't
>> magically make problem go away. But I might be mistaken.
> 
> Certainly from a vector allocation standpoint we can change to whatever
> is preferred, but the direct INTx manipulation functions are a
> different thing entirely and afaik there's nothing else that can
> replace them at a low level, nor can we just get rid of our calls to
> pci_intx().  Thanks,

But can these calls be moved out of the spinlock context ? If not, then we need
to clarify that pci_intx() can be called from any context, which will require
changing to a GFP_ATOMIC for the resource allocation, even if the use case
cannot trigger the allocation. This is needed to ensure the correctness of the
pci_intx() function use. Frankly, I am surprised that the might sleep splat you
got was not already reported before (fuzzying, static analyzers might eventually
catch that though).

The other solution would be a version of pci_intx() that has a gfp flags
argument to allow callers to use the right gfp flags for the call context.


> 
> Alex
>

Alex Williamson Sept. 5, 2024, 1:56 a.m. UTC | #20

On Thu, 5 Sep 2024 09:33:35 +0900
Damien Le Moal <dlemoal@kernel.org> wrote:

> On 2024/09/05 6:10, Alex Williamson wrote:
> > On Wed, 4 Sep 2024 23:24:53 +0300
> > Andy Shevchenko <andy.shevchenko@gmail.com> wrote:
> >   
> >> Wed, Sep 04, 2024 at 12:07:21PM -0600, Alex Williamson kirjoitti:  
> >>> On Wed, 04 Sep 2024 15:37:25 +0200
> >>> Philipp Stanner <pstanner@redhat.com> wrote:    
> >>>> On Wed, 2024-09-04 at 17:25 +0900, Damien Le Moal wrote:    
> >>
> >> ...
> >>  
> >>>> If vfio-pci can get rid of pci_intx() alltogether, that might be a good
> >>>> thing. As far as I understood Andy Shevchenko, pci_intx() is outdated.
> >>>> There's only a hand full of users anyways.    
> >>>
> >>> What's the alternative?    
> >>
> >> From API perspective the pci_alloc_irq_vectors() & Co should be used.  
> > 
> > We can't replace a device level INTx control with a vector allocation
> > function.
> >    
> >>> vfio-pci has a potentially unique requirement
> >>> here, we don't know how to handle the device interrupt, we only forward
> >>> it to the userspace driver.  As a level triggered interrupt, INTx will
> >>> continue to assert until that userspace driver handles the device.
> >>> That's obviously unacceptable from a host perspective, so INTx is
> >>> masked at the device via pci_intx() where available, or at the
> >>> interrupt controller otherwise.  The API with the userspace driver
> >>> requires that driver to unmask the interrupt, again resulting in a call
> >>> to pci_intx() or unmasking the interrupt controller, in order to receive
> >>> further interrupts from the device.  Thanks,    
> >>
> >> I briefly read the discussion and if I understand it correctly the problem here
> >> is in the flow: when the above mentioned API is being called. Hence it's design
> >> (or architectural) level of issue and changing call from foo() to bar() won't
> >> magically make problem go away. But I might be mistaken.  
> > 
> > Certainly from a vector allocation standpoint we can change to whatever
> > is preferred, but the direct INTx manipulation functions are a
> > different thing entirely and afaik there's nothing else that can
> > replace them at a low level, nor can we just get rid of our calls to
> > pci_intx().  Thanks,  
> 
> But can these calls be moved out of the spinlock context ? If not, then we need
> to clarify that pci_intx() can be called from any context, which will require
> changing to a GFP_ATOMIC for the resource allocation, even if the use case
> cannot trigger the allocation. This is needed to ensure the correctness of the
> pci_intx() function use. Frankly, I am surprised that the might sleep splat you
> got was not already reported before (fuzzying, static analyzers might eventually
> catch that though).
> 
> The other solution would be a version of pci_intx() that has a gfp flags
> argument to allow callers to use the right gfp flags for the call context.

In vfio-pci we're trying to achieve mutual exclusion of the device
interrupt masking between IRQ context and userspace context, so the
problem really does not lend itself to pulling the pci_intx() call out
of an atomic context.  I'll also note again that from a non-devres
perspective, pci_intx() is only setting or clearing a bit in the
command register, so it's a hefty imposition to restrict the function
in general for an allocation in the devres path.  Thanks,

Alex

Philipp Stanner Sept. 5, 2024, 7:13 a.m. UTC | #21

On Thu, 2024-09-05 at 09:33 +0900, Damien Le Moal wrote:
> On 2024/09/05 6:10, Alex Williamson wrote:
> > On Wed, 4 Sep 2024 23:24:53 +0300
> > Andy Shevchenko <andy.shevchenko@gmail.com> wrote:
> > 
> > > Wed, Sep 04, 2024 at 12:07:21PM -0600, Alex Williamson kirjoitti:
> > > > On Wed, 04 Sep 2024 15:37:25 +0200
> > > > Philipp Stanner <pstanner@redhat.com> wrote:  
> > > > > On Wed, 2024-09-04 at 17:25 +0900, Damien Le Moal wrote:  
> > > 
> > > ...
> > > 
> > > > > If vfio-pci can get rid of pci_intx() alltogether, that might
> > > > > be a good
> > > > > thing. As far as I understood Andy Shevchenko, pci_intx() is
> > > > > outdated.
> > > > > There's only a hand full of users anyways.  
> > > > 
> > > > What's the alternative?  
> > > 
> > > From API perspective the pci_alloc_irq_vectors() & Co should be
> > > used.
> > 
> > We can't replace a device level INTx control with a vector
> > allocation
> > function.
> >  
> > > > vfio-pci has a potentially unique requirement
> > > > here, we don't know how to handle the device interrupt, we only
> > > > forward
> > > > it to the userspace driver.  As a level triggered interrupt,
> > > > INTx will
> > > > continue to assert until that userspace driver handles the
> > > > device.
> > > > That's obviously unacceptable from a host perspective, so INTx
> > > > is
> > > > masked at the device via pci_intx() where available, or at the
> > > > interrupt controller otherwise.  The API with the userspace
> > > > driver
> > > > requires that driver to unmask the interrupt, again resulting
> > > > in a call
> > > > to pci_intx() or unmasking the interrupt controller, in order
> > > > to receive
> > > > further interrupts from the device.  Thanks,  
> > > 
> > > I briefly read the discussion and if I understand it correctly
> > > the problem here
> > > is in the flow: when the above mentioned API is being called.
> > > Hence it's design
> > > (or architectural) level of issue and changing call from foo() to
> > > bar() won't
> > > magically make problem go away. But I might be mistaken.
> > 
> > Certainly from a vector allocation standpoint we can change to
> > whatever
> > is preferred, but the direct INTx manipulation functions are a
> > different thing entirely and afaik there's nothing else that can
> > replace them at a low level, nor can we just get rid of our calls
> > to
> > pci_intx().  Thanks,
> 
> But can these calls be moved out of the spinlock context ? If not,
> then we need
> to clarify that pci_intx() can be called from any context, which will
> require
> changing to a GFP_ATOMIC for the resource allocation, even if the use
> case
> cannot trigger the allocation. This is needed to ensure the
> correctness of the
> pci_intx() function use.

We could do that I guess. As I keep saying, it's not intended to have
pci_intx() allocate _permanently_. This is a temporary situation.
pci_intx() should have neither devres nor allocation.

> Frankly, I am surprised that the might sleep splat you
> got was not already reported before (fuzzying, static analyzers might
> eventually
> catch that though).

It's a super rare situation:
 * pci_intx() has very few callers
 * It only allocates if pcim_enable_device() instead of
   pci_enable_device() ran.
 * It only allocates when it's called for the FIRST TIME
 * All of the above is only a problem while you hold a lock

> 
> The other solution would be a version of pci_intx() that has a gfp
> flags
> argument to allow callers to use the right gfp flags for the call
> context.

I don't think that's a good idea. As I said, I want to clean up all
that in the mid term.

As a matter of fact, there is already __pcim_intx() in pci/devres.c as
a pure unmanaged pci_intx() as a means to split and then cleanup the
APIs.

One path towards getting the hybrid behavior out of pci_intx() could be
to rename __pcim_intx() to pci_intx_unmanaged() and port everyone who
uses pci_enable_device() + pci_intx() to that version. That would be
better than to have a third version with a gfp_t argument.


P.

> 
> 
> > 
> > Alex
> > 
>

Damien Le Moal Sept. 6, 2024, 12:37 a.m. UTC | #22

On 9/5/24 16:13, Philipp Stanner wrote:
> On Thu, 2024-09-05 at 09:33 +0900, Damien Le Moal wrote:
>> On 2024/09/05 6:10, Alex Williamson wrote:
>>> On Wed, 4 Sep 2024 23:24:53 +0300
>>> Andy Shevchenko <andy.shevchenko@gmail.com> wrote:
>>>
>>>> Wed, Sep 04, 2024 at 12:07:21PM -0600, Alex Williamson kirjoitti:
>>>>> On Wed, 04 Sep 2024 15:37:25 +0200
>>>>> Philipp Stanner <pstanner@redhat.com> wrote:  
>>>>>> On Wed, 2024-09-04 at 17:25 +0900, Damien Le Moal wrote:  
>>>>
>>>> ...
>>>>
>>>>>> If vfio-pci can get rid of pci_intx() alltogether, that might
>>>>>> be a good
>>>>>> thing. As far as I understood Andy Shevchenko, pci_intx() is
>>>>>> outdated.
>>>>>> There's only a hand full of users anyways.  
>>>>>
>>>>> What's the alternative?  
>>>>
>>>> From API perspective the pci_alloc_irq_vectors() & Co should be
>>>> used.
>>>
>>> We can't replace a device level INTx control with a vector
>>> allocation
>>> function.
>>>  
>>>>> vfio-pci has a potentially unique requirement
>>>>> here, we don't know how to handle the device interrupt, we only
>>>>> forward
>>>>> it to the userspace driver.  As a level triggered interrupt,
>>>>> INTx will
>>>>> continue to assert until that userspace driver handles the
>>>>> device.
>>>>> That's obviously unacceptable from a host perspective, so INTx
>>>>> is
>>>>> masked at the device via pci_intx() where available, or at the
>>>>> interrupt controller otherwise.  The API with the userspace
>>>>> driver
>>>>> requires that driver to unmask the interrupt, again resulting
>>>>> in a call
>>>>> to pci_intx() or unmasking the interrupt controller, in order
>>>>> to receive
>>>>> further interrupts from the device.  Thanks,  
>>>>
>>>> I briefly read the discussion and if I understand it correctly
>>>> the problem here
>>>> is in the flow: when the above mentioned API is being called.
>>>> Hence it's design
>>>> (or architectural) level of issue and changing call from foo() to
>>>> bar() won't
>>>> magically make problem go away. But I might be mistaken.
>>>
>>> Certainly from a vector allocation standpoint we can change to
>>> whatever
>>> is preferred, but the direct INTx manipulation functions are a
>>> different thing entirely and afaik there's nothing else that can
>>> replace them at a low level, nor can we just get rid of our calls
>>> to
>>> pci_intx().  Thanks,
>>
>> But can these calls be moved out of the spinlock context ? If not,
>> then we need
>> to clarify that pci_intx() can be called from any context, which will
>> require
>> changing to a GFP_ATOMIC for the resource allocation, even if the use
>> case
>> cannot trigger the allocation. This is needed to ensure the
>> correctness of the
>> pci_intx() function use.
> 
> We could do that I guess. As I keep saying, it's not intended to have
> pci_intx() allocate _permanently_. This is a temporary situation.
> pci_intx() should have neither devres nor allocation.
> 
>> Frankly, I am surprised that the might sleep splat you
>> got was not already reported before (fuzzying, static analyzers might
>> eventually
>> catch that though).
> 
> It's a super rare situation:
>  * pci_intx() has very few callers
>  * It only allocates if pcim_enable_device() instead of
>    pci_enable_device() ran.
>  * It only allocates when it's called for the FIRST TIME
>  * All of the above is only a problem while you hold a lock
> 
>>
>> The other solution would be a version of pci_intx() that has a gfp
>> flags
>> argument to allow callers to use the right gfp flags for the call
>> context.
> 
> I don't think that's a good idea. As I said, I want to clean up all
> that in the mid term.
> 
> As a matter of fact, there is already __pcim_intx() in pci/devres.c as
> a pure unmanaged pci_intx() as a means to split and then cleanup the
> APIs.

Yeah. That naming is in fact confusing. __pcim_intx() should really be named
pci_intx()...

> One path towards getting the hybrid behavior out of pci_intx() could be
> to rename __pcim_intx() to pci_intx_unmanaged() and port everyone who
> uses pci_enable_device() + pci_intx() to that version. That would be
> better than to have a third version with a gfp_t argument.

Sounds good. But ideally, all users that rely on the managed variant should be
converted to use pcim_intx() and pci_intx() changed to not call in devres. But
that may be too much code churn (I have not checked).

Philipp Stanner Sept. 6, 2024, 6:45 a.m. UTC | #23

On Fri, 2024-09-06 at 09:37 +0900, Damien Le Moal wrote:
> On 9/5/24 16:13, Philipp Stanner wrote:
> > On Thu, 2024-09-05 at 09:33 +0900, Damien Le Moal wrote:
> > > On 2024/09/05 6:10, Alex Williamson wrote:
> > > > On Wed, 4 Sep 2024 23:24:53 +0300
> > > > Andy Shevchenko <andy.shevchenko@gmail.com> wrote:
> > > > 
> > > > > Wed, Sep 04, 2024 at 12:07:21PM -0600, Alex Williamson
> > > > > kirjoitti:
> > > > > > On Wed, 04 Sep 2024 15:37:25 +0200
> > > > > > Philipp Stanner <pstanner@redhat.com> wrote:  
> > > > > > > On Wed, 2024-09-04 at 17:25 +0900, Damien Le Moal wrote: 
> > > > > 
> > > > > ...
> > > > > 
> > > > > > > If vfio-pci can get rid of pci_intx() alltogether, that
> > > > > > > might
> > > > > > > be a good
> > > > > > > thing. As far as I understood Andy Shevchenko, pci_intx()
> > > > > > > is
> > > > > > > outdated.
> > > > > > > There's only a hand full of users anyways.  
> > > > > > 
> > > > > > What's the alternative?  
> > > > > 
> > > > > From API perspective the pci_alloc_irq_vectors() & Co should
> > > > > be
> > > > > used.
> > > > 
> > > > We can't replace a device level INTx control with a vector
> > > > allocation
> > > > function.
> > > >  
> > > > > > vfio-pci has a potentially unique requirement
> > > > > > here, we don't know how to handle the device interrupt, we
> > > > > > only
> > > > > > forward
> > > > > > it to the userspace driver.  As a level triggered
> > > > > > interrupt,
> > > > > > INTx will
> > > > > > continue to assert until that userspace driver handles the
> > > > > > device.
> > > > > > That's obviously unacceptable from a host perspective, so
> > > > > > INTx
> > > > > > is
> > > > > > masked at the device via pci_intx() where available, or at
> > > > > > the
> > > > > > interrupt controller otherwise.  The API with the userspace
> > > > > > driver
> > > > > > requires that driver to unmask the interrupt, again
> > > > > > resulting
> > > > > > in a call
> > > > > > to pci_intx() or unmasking the interrupt controller, in
> > > > > > order
> > > > > > to receive
> > > > > > further interrupts from the device.  Thanks,  
> > > > > 
> > > > > I briefly read the discussion and if I understand it
> > > > > correctly
> > > > > the problem here
> > > > > is in the flow: when the above mentioned API is being called.
> > > > > Hence it's design
> > > > > (or architectural) level of issue and changing call from
> > > > > foo() to
> > > > > bar() won't
> > > > > magically make problem go away. But I might be mistaken.
> > > > 
> > > > Certainly from a vector allocation standpoint we can change to
> > > > whatever
> > > > is preferred, but the direct INTx manipulation functions are a
> > > > different thing entirely and afaik there's nothing else that
> > > > can
> > > > replace them at a low level, nor can we just get rid of our
> > > > calls
> > > > to
> > > > pci_intx().  Thanks,
> > > 
> > > But can these calls be moved out of the spinlock context ? If
> > > not,
> > > then we need
> > > to clarify that pci_intx() can be called from any context, which
> > > will
> > > require
> > > changing to a GFP_ATOMIC for the resource allocation, even if the
> > > use
> > > case
> > > cannot trigger the allocation. This is needed to ensure the
> > > correctness of the
> > > pci_intx() function use.
> > 
> > We could do that I guess. As I keep saying, it's not intended to
> > have
> > pci_intx() allocate _permanently_. This is a temporary situation.
> > pci_intx() should have neither devres nor allocation.
> > 
> > > Frankly, I am surprised that the might sleep splat you
> > > got was not already reported before (fuzzying, static analyzers
> > > might
> > > eventually
> > > catch that though).
> > 
> > It's a super rare situation:
> >  * pci_intx() has very few callers
> >  * It only allocates if pcim_enable_device() instead of
> >    pci_enable_device() ran.
> >  * It only allocates when it's called for the FIRST TIME
> >  * All of the above is only a problem while you hold a lock
> > 
> > > 
> > > The other solution would be a version of pci_intx() that has a
> > > gfp
> > > flags
> > > argument to allow callers to use the right gfp flags for the call
> > > context.
> > 
> > I don't think that's a good idea. As I said, I want to clean up all
> > that in the mid term.
> > 
> > As a matter of fact, there is already __pcim_intx() in pci/devres.c
> > as
> > a pure unmanaged pci_intx() as a means to split and then cleanup
> > the
> > APIs.
> 
> Yeah. That naming is in fact confusing. __pcim_intx() should really
> be named
> pci_intx()...
> 
> > One path towards getting the hybrid behavior out of pci_intx()
> > could be
> > to rename __pcim_intx() to pci_intx_unmanaged() and port everyone
> > who
> > uses pci_enable_device() + pci_intx() to that version. That would
> > be
> > better than to have a third version with a gfp_t argument.
> 
> Sounds good. But ideally, all users that rely on the managed variant
> should be
> converted to use pcim_intx() and pci_intx() changed to not call in
> devres. But
> that may be too much code churn (I have not checked).

Exactly that is the plan.

I looked into that yesterday and my idea is to publish __pcim_intx()
under the name pci_intx_unmanaged(), then port everyone who doesn't
rely on managed behavior to that function and then port everyone who
does rely on it to pcim_intx() (as you suggest).
Afterwards you can remove pci_intx_unmanaged() again and also remove
the hybrid behavior from pci_intx().

It's doable on not that much code. Getting it merged might be
politically difficult, though.

The only thing I'm really a bit afraid of is the pci_intx() user in
pci/msi/ – MSI calls that through yet another implicit devres call,
pcim_msi_release().
I *suspect* that pci_intx() in MSI can just be made
pci_intx_unmanaged(), but that should be checked carefully. The doubly
implicit devres magic caused some trouble in the past already, as we
had discussed here [1].


P.

[1] https://lore.kernel.org/all/ee44ea7ac760e73edad3f20b30b4d2fff66c1a85.camel@redhat.com/

PCI: Fix devres regression in pci_intx()

Commit Message

Comments

Patch