Message ID | 20240802183327.1309020-1-amachhiw@linux.ibm.com |
---|---|
State | New |
Headers | show |
Series | [v3] PCI: Fix crash during pci_dev hot-unplug on pseries KVM guest | expand |
On Sat, Aug 03, 2024 at 12:03:25AM +0530, Amit Machhiwal wrote: > With CONFIG_PCI_DYNAMIC_OF_NODES [1], a hot-plug and hot-unplug sequence > of a PCI device attached to a PCI-bridge causes following kernel Oops on > a pseries KVM guest: What is unique about pseries here? There's nothing specific to pseries in the patch, so I would expect this to be a generic problem on any arch. > RTAS: event: 2, Type: Hotplug Event (229), Severity: 1 > Kernel attempted to read user page (10ec00000048) - exploit attempt? (uid: 0) > BUG: Unable to handle kernel data access on read at 0x10ec00000048 Weird address. I would expect NULL or something. Where did this non-NULL pointer come from? > Faulting instruction address: 0xc0000000012d8728 > Oops: Kernel access of bad area, sig: 11 [#1] > LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries > <snip> > NIP [c0000000012d8728] __of_changeset_entry_invert+0x10/0x1ac > LR [c0000000012da7f0] __of_changeset_revert_entries+0x98/0x180 > Call Trace: > [c00000000bcc3970] [c0000000012daa60] of_changeset_revert+0x58/0xd8 > [c00000000bcc39c0] [c000000000d0ed78] of_pci_remove_node+0x74/0xb0 > [c00000000bcc39f0] [c000000000cdcfe0] pci_stop_bus_device+0xf4/0x138 > [c00000000bcc3a30] [c000000000cdd140] pci_stop_and_remove_bus_device_locked+0x34/0x64 > [c00000000bcc3a60] [c000000000cf3780] remove_store+0xf0/0x108 > [c00000000bcc3ab0] [c000000000e89e04] dev_attr_store+0x34/0x78 > [c00000000bcc3ad0] [c0000000007f8dd4] sysfs_kf_write+0x70/0xa4 > [c00000000bcc3af0] [c0000000007f7248] kernfs_fop_write_iter+0x1d0/0x2e0 > [c00000000bcc3b40] [c0000000006c9b08] vfs_write+0x27c/0x558 > [c00000000bcc3bf0] [c0000000006ca168] ksys_write+0x90/0x170 > [c00000000bcc3c40] [c000000000033248] system_call_exception+0xf8/0x290 > [c00000000bcc3e50] [c00000000000d05c] system_call_vectored_common+0x15c/0x2ec > <snip> > > A git bisect pointed this regression to be introduced via [1] that added > a mechanism to create device tree nodes for parent PCI bridges when a > PCI device is hot-plugged. > > The Oops is caused when `pci_stop_dev()` tries to remove a non-existing > device-tree node associated with the pci_dev that was earlier > hot-plugged and was attached under a pci-bridge. The PCI dev header > `dev->hdr_type` being 0, results a conditional check done with > `pci_is_bridge()` into false. Consequently, a call to > `of_pci_make_dev_node()` to create a device node is never made. When at > a later point in time, in the device node removal path, a memcpy is > attempted in `__of_changeset_entry_invert()`; since the device node was > never created, results in an Oops due to kernel read access to a bad > address. I'm sure this description is 100% correct, but it's at such a low level that it doesn't really help understand the underlying design problem. Will need an ack from Rob. > To fix this issue, the patch introduces a new flag OF_CREATE_WITH_CSET > to keep track of device nodes created via `of_pci_make_dev_node()` and > later attempt to destroy only such device nodes which have this flag > set. > > [1] commit 407d1a51921e ("PCI: Create device tree node for bridge") > > Fixes: 407d1a51921e ("PCI: Create device tree node for bridge") > Reported-by: Kowshik Jois B S <kowsjois@linux.ibm.com> > Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> > Signed-off-by: Amit Machhiwal <amachhiw@linux.ibm.com> > --- > Changes since v2: > * Drop v2 changes and introduce a different approach from Lizhi discussed > over the v2 of this patch > * V2: https://lore.kernel.org/all/20240715080726.2496198-1-amachhiw@linux.ibm.com/ > Changes since v1: > * Included Lizhi's suggested changes on V1 > * Fixed below two warnings from Lizhi's changes and rearranged the cleanup > part a bit in `of_pci_make_dev_node` > drivers/pci/of.c:611:6: warning: no previous prototype for ‘of_pci_free_node’ [-Wmissing-prototypes] > 611 | void of_pci_free_node(struct device_node *np) > | ^~~~~~~~~~~~~~~~ > drivers/pci/of.c: In function ‘of_pci_make_dev_node’: > drivers/pci/of.c:696:1: warning: label ‘out_destroy_cset’ defined but not used [-Wunused-label] > 696 | out_destroy_cset: > | ^~~~~~~~~~~~~~~~ > * V1: https://lore.kernel.org/all/20240703141634.2974589-1-amachhiw@linux.ibm.com/ > > drivers/pci/of.c | 3 ++- > include/linux/of.h | 1 + > 2 files changed, 3 insertions(+), 1 deletion(-) > > diff --git a/drivers/pci/of.c b/drivers/pci/of.c > index dacea3fc5128..bc455370143e 100644 > --- a/drivers/pci/of.c > +++ b/drivers/pci/of.c > @@ -653,7 +653,7 @@ void of_pci_remove_node(struct pci_dev *pdev) > struct device_node *np; > > np = pci_device_to_OF_node(pdev); > - if (!np || !of_node_check_flag(np, OF_DYNAMIC)) > + if (!np || !of_node_check_flag(np, OF_CREATE_WITH_CSET)) > return; > pdev->dev.of_node = NULL; of_pci_remove_node() goes on to call of_changeset_revert() and of_changeset_destroy(). of_pci_remove_node() has nothing PCI-specific in it. There are a few other callers of of_changeset_destroy(), but they don't look anything like of_pci_remove_node(). Seems like there should be some similarity across the callers, so it makes me a little nervous that there isn't. > @@ -712,6 +712,7 @@ void of_pci_make_dev_node(struct pci_dev *pdev) > if (ret) > goto out_free_node; > > + of_node_set_flag(np, OF_CREATE_WITH_CSET); > np->data = cset; > pdev->dev.of_node = np; > kfree(name); > diff --git a/include/linux/of.h b/include/linux/of.h > index 85b60ac9eec5..5faa5a1198c6 100644 > --- a/include/linux/of.h > +++ b/include/linux/of.h > @@ -153,6 +153,7 @@ extern struct device_node *of_stdout; > #define OF_POPULATED_BUS 4 /* platform bus created for children */ > #define OF_OVERLAY 5 /* allocated for an overlay */ > #define OF_OVERLAY_FREE_CSET 6 /* in overlay cset being freed */ > +#define OF_CREATE_WITH_CSET 7 /* Created by of_changeset_create_node */ Follow existing capitalization style.
On Tue, Aug 06, 2024 at 03:00:59PM -0500, Bjorn Helgaas wrote: > On Sat, Aug 03, 2024 at 12:03:25AM +0530, Amit Machhiwal wrote: > > With CONFIG_PCI_DYNAMIC_OF_NODES [1], a hot-plug and hot-unplug sequence > > of a PCI device attached to a PCI-bridge causes following kernel Oops on > > a pseries KVM guest: > > What is unique about pseries here? There's nothing specific to > pseries in the patch, so I would expect this to be a generic problem > on any arch. Only pseries does PCI hotplug with DT I think. It would happen if another system did though, but I think documenting the exact system is good. No reason we can't do both though. Rob
On Sat, 03 Aug 2024 00:03:25 +0530, Amit Machhiwal wrote: > With CONFIG_PCI_DYNAMIC_OF_NODES [1], a hot-plug and hot-unplug sequence > of a PCI device attached to a PCI-bridge causes following kernel Oops on > a pseries KVM guest: > > RTAS: event: 2, Type: Hotplug Event (229), Severity: 1 > Kernel attempted to read user page (10ec00000048) - exploit attempt? (uid: 0) > BUG: Unable to handle kernel data access on read at 0x10ec00000048 > Faulting instruction address: 0xc0000000012d8728 > Oops: Kernel access of bad area, sig: 11 [#1] > LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries > <snip> > NIP [c0000000012d8728] __of_changeset_entry_invert+0x10/0x1ac > LR [c0000000012da7f0] __of_changeset_revert_entries+0x98/0x180 > Call Trace: > [c00000000bcc3970] [c0000000012daa60] of_changeset_revert+0x58/0xd8 > [c00000000bcc39c0] [c000000000d0ed78] of_pci_remove_node+0x74/0xb0 > [c00000000bcc39f0] [c000000000cdcfe0] pci_stop_bus_device+0xf4/0x138 > [c00000000bcc3a30] [c000000000cdd140] pci_stop_and_remove_bus_device_locked+0x34/0x64 > [c00000000bcc3a60] [c000000000cf3780] remove_store+0xf0/0x108 > [c00000000bcc3ab0] [c000000000e89e04] dev_attr_store+0x34/0x78 > [c00000000bcc3ad0] [c0000000007f8dd4] sysfs_kf_write+0x70/0xa4 > [c00000000bcc3af0] [c0000000007f7248] kernfs_fop_write_iter+0x1d0/0x2e0 > [c00000000bcc3b40] [c0000000006c9b08] vfs_write+0x27c/0x558 > [c00000000bcc3bf0] [c0000000006ca168] ksys_write+0x90/0x170 > [c00000000bcc3c40] [c000000000033248] system_call_exception+0xf8/0x290 > [c00000000bcc3e50] [c00000000000d05c] system_call_vectored_common+0x15c/0x2ec > <snip> > > A git bisect pointed this regression to be introduced via [1] that added > a mechanism to create device tree nodes for parent PCI bridges when a > PCI device is hot-plugged. > > The Oops is caused when `pci_stop_dev()` tries to remove a non-existing > device-tree node associated with the pci_dev that was earlier > hot-plugged and was attached under a pci-bridge. The PCI dev header > `dev->hdr_type` being 0, results a conditional check done with > `pci_is_bridge()` into false. Consequently, a call to > `of_pci_make_dev_node()` to create a device node is never made. When at > a later point in time, in the device node removal path, a memcpy is > attempted in `__of_changeset_entry_invert()`; since the device node was > never created, results in an Oops due to kernel read access to a bad > address. > > To fix this issue, the patch introduces a new flag OF_CREATE_WITH_CSET > to keep track of device nodes created via `of_pci_make_dev_node()` and > later attempt to destroy only such device nodes which have this flag > set. > > [1] commit 407d1a51921e ("PCI: Create device tree node for bridge") > > Fixes: 407d1a51921e ("PCI: Create device tree node for bridge") > Reported-by: Kowshik Jois B S <kowsjois@linux.ibm.com> > Signed-off-by: Lizhi Hou <lizhi.hou@amd.com> > Signed-off-by: Amit Machhiwal <amachhiw@linux.ibm.com> > --- > Changes since v2: > * Drop v2 changes and introduce a different approach from Lizhi discussed > over the v2 of this patch > * V2: https://lore.kernel.org/all/20240715080726.2496198-1-amachhiw@linux.ibm.com/ > Changes since v1: > * Included Lizhi's suggested changes on V1 > * Fixed below two warnings from Lizhi's changes and rearranged the cleanup > part a bit in `of_pci_make_dev_node` > drivers/pci/of.c:611:6: warning: no previous prototype for ‘of_pci_free_node’ [-Wmissing-prototypes] > 611 | void of_pci_free_node(struct device_node *np) > | ^~~~~~~~~~~~~~~~ > drivers/pci/of.c: In function ‘of_pci_make_dev_node’: > drivers/pci/of.c:696:1: warning: label ‘out_destroy_cset’ defined but not used [-Wunused-label] > 696 | out_destroy_cset: > | ^~~~~~~~~~~~~~~~ > * V1: https://lore.kernel.org/all/20240703141634.2974589-1-amachhiw@linux.ibm.com/ > > drivers/pci/of.c | 3 ++- > include/linux/of.h | 1 + > 2 files changed, 3 insertions(+), 1 deletion(-) > Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
Bjorn Helgaas <helgaas@kernel.org> writes: > On Sat, Aug 03, 2024 at 12:03:25AM +0530, Amit Machhiwal wrote: >> With CONFIG_PCI_DYNAMIC_OF_NODES [1], a hot-plug and hot-unplug sequence >> of a PCI device attached to a PCI-bridge causes following kernel Oops on >> a pseries KVM guest: > > What is unique about pseries here? There's nothing specific to > pseries in the patch, so I would expect this to be a generic problem > on any arch. > >> RTAS: event: 2, Type: Hotplug Event (229), Severity: 1 >> Kernel attempted to read user page (10ec00000048) - exploit attempt? (uid: 0) >> BUG: Unable to handle kernel data access on read at 0x10ec00000048 > > Weird address. I would expect NULL or something. Where did this > non-NULL pointer come from? It originally comes from np->data, which is supposed to be an of_changeset. The powerpc code also uses np->data for the struct pci_dn pointer, see pci_add_device_node_info(). I wonder if that's why it's non-NULL? Amit, do we have exact steps to reproduce this? I poked around a bit but couldn't get it to trigger. cheers
Hi Michael, On 2024/08/15 01:20 PM, Michael Ellerman wrote: > Bjorn Helgaas <helgaas@kernel.org> writes: > > On Sat, Aug 03, 2024 at 12:03:25AM +0530, Amit Machhiwal wrote: > >> With CONFIG_PCI_DYNAMIC_OF_NODES [1], a hot-plug and hot-unplug sequence > >> of a PCI device attached to a PCI-bridge causes following kernel Oops on > >> a pseries KVM guest: > > > > What is unique about pseries here? There's nothing specific to > > pseries in the patch, so I would expect this to be a generic problem > > on any arch. > > > >> RTAS: event: 2, Type: Hotplug Event (229), Severity: 1 > >> Kernel attempted to read user page (10ec00000048) - exploit attempt? (uid: 0) > >> BUG: Unable to handle kernel data access on read at 0x10ec00000048 > > > > Weird address. I would expect NULL or something. Where did this > > non-NULL pointer come from? > > It originally comes from np->data, which is supposed to be an > of_changeset. > > The powerpc code also uses np->data for the struct pci_dn pointer, see > pci_add_device_node_info(). > > I wonder if that's why it's non-NULL? I'm also looking into the code to figure out where's that value coming from. I will update as soon as I get there. > > Amit, do we have exact steps to reproduce this? I poked around a bit but > couldn't get it to trigger. Sure, below are the steps: 1. Set CONFIG_PCI_DYNAMIC_OF_NODES=y in the kernel config and compile (Fedora has it disabled in it's distro config, Ubuntu has it enabled but will have it disabled in the next update) 2. If you are using Fedora cloud images, make sure you've these packages installed: $ rpm -qa | grep -e 'ppc64-diag\|powerpc-utils' powerpc-utils-core-1.3.11-6.fc40.ppc64le powerpc-utils-1.3.11-6.fc40.ppc64le ppc64-diag-rtas-2.7.9-6.fc40.ppc64le ppc64-diag-2.7.9-6.fc40.ppc64le 3. Hotplug a pci device as follows: virsh attach-interface <domain_name> bridge --source virbr0 4. Check if the pci device was added by running `ip a s` 5. Try hot-unplug of that device by supplying the MAC, which should trigger the Oops virsh detach-interface <domain_name> bridge <mac_addr> Thanks, Amit > cheers
Amit Machhiwal <amachhiw@linux.ibm.com> writes: > On 2024/08/15 01:20 PM, Michael Ellerman wrote: >> Bjorn Helgaas <helgaas@kernel.org> writes: >> > On Sat, Aug 03, 2024 at 12:03:25AM +0530, Amit Machhiwal wrote: >> >> With CONFIG_PCI_DYNAMIC_OF_NODES [1], a hot-plug and hot-unplug sequence >> >> of a PCI device attached to a PCI-bridge causes following kernel Oops on >> >> a pseries KVM guest: >> > >> > What is unique about pseries here? There's nothing specific to >> > pseries in the patch, so I would expect this to be a generic problem >> > on any arch. >> > >> >> RTAS: event: 2, Type: Hotplug Event (229), Severity: 1 >> >> Kernel attempted to read user page (10ec00000048) - exploit attempt? (uid: 0) >> >> BUG: Unable to handle kernel data access on read at 0x10ec00000048 >> > >> > Weird address. I would expect NULL or something. Where did this >> > non-NULL pointer come from? >> >> It originally comes from np->data, which is supposed to be an >> of_changeset. >> >> The powerpc code also uses np->data for the struct pci_dn pointer, see >> pci_add_device_node_info(). >> >> I wonder if that's why it's non-NULL? > > I'm also looking into the code to figure out where's that value coming from. I > will update as soon as I get there. Thanks. >> Amit, do we have exact steps to reproduce this? I poked around a bit but >> couldn't get it to trigger. > > Sure, below are the steps: > > 1. Set CONFIG_PCI_DYNAMIC_OF_NODES=y in the kernel config and compile (Fedora > has it disabled in it's distro config, Ubuntu has it enabled but will have it > disabled in the next update) > > 2. If you are using Fedora cloud images, make sure you've these packages > installed: > $ rpm -qa | grep -e 'ppc64-diag\|powerpc-utils' > powerpc-utils-core-1.3.11-6.fc40.ppc64le > powerpc-utils-1.3.11-6.fc40.ppc64le > ppc64-diag-rtas-2.7.9-6.fc40.ppc64le > ppc64-diag-2.7.9-6.fc40.ppc64le > > 3. Hotplug a pci device as follows: > virsh attach-interface <domain_name> bridge --source virbr0 I don't use virsh :) Any idea how to do it with just qemu monitor commands? cheers
Hi Michael, On 2024/08/17 08:59 AM, Michael Ellerman wrote: > Amit Machhiwal <amachhiw@linux.ibm.com> writes: > > On 2024/08/15 01:20 PM, Michael Ellerman wrote: > >> Bjorn Helgaas <helgaas@kernel.org> writes: > >> > On Sat, Aug 03, 2024 at 12:03:25AM +0530, Amit Machhiwal wrote: > >> >> With CONFIG_PCI_DYNAMIC_OF_NODES [1], a hot-plug and hot-unplug sequence > >> >> of a PCI device attached to a PCI-bridge causes following kernel Oops on > >> >> a pseries KVM guest: > >> > > >> > What is unique about pseries here? There's nothing specific to > >> > pseries in the patch, so I would expect this to be a generic problem > >> > on any arch. > >> > > >> >> RTAS: event: 2, Type: Hotplug Event (229), Severity: 1 > >> >> Kernel attempted to read user page (10ec00000048) - exploit attempt? (uid: 0) > >> >> BUG: Unable to handle kernel data access on read at 0x10ec00000048 > >> > > >> > Weird address. I would expect NULL or something. Where did this > >> > non-NULL pointer come from? > >> > >> It originally comes from np->data, which is supposed to be an > >> of_changeset. > >> > >> The powerpc code also uses np->data for the struct pci_dn pointer, see > >> pci_add_device_node_info(). > >> > >> I wonder if that's why it's non-NULL? > > > > I'm also looking into the code to figure out where's that value coming from. I > > will update as soon as I get there. > > Thanks. > > >> Amit, do we have exact steps to reproduce this? I poked around a bit but > >> couldn't get it to trigger. > > > > Sure, below are the steps: > > > > 1. Set CONFIG_PCI_DYNAMIC_OF_NODES=y in the kernel config and compile (Fedora > > has it disabled in it's distro config, Ubuntu has it enabled but will have it > > disabled in the next update) > > > > 2. If you are using Fedora cloud images, make sure you've these packages > > installed: > > $ rpm -qa | grep -e 'ppc64-diag\|powerpc-utils' > > powerpc-utils-core-1.3.11-6.fc40.ppc64le > > powerpc-utils-1.3.11-6.fc40.ppc64le > > ppc64-diag-rtas-2.7.9-6.fc40.ppc64le > > ppc64-diag-2.7.9-6.fc40.ppc64le > > > > 3. Hotplug a pci device as follows: > > virsh attach-interface <domain_name> bridge --source virbr0 > > I don't use virsh :) No worries. Fortunately, we do have a way to do it with qemu monitor. > > Any idea how to do it with just qemu monitor commands? > 1. Boot the guest with the below included in the qemu cmdline: -netdev bridge,id=<net_name>,br=virbr0,helper=/usr/libexec/qemu-bridge-helper 2. Once the guest boots, run the below command on qemu monitor to hot-plug a pci device: device_add rtl8139,netdev=<net_name>,mac=52:54:00:88:31:28,id=<net_id> dmesg ===== [ 116.968210] pci 0000:00:01.0: [10ec:8139] type 00 class 0x020000 conventional PCI endpoint [ 116.969260] pci 0000:00:01.0: BAR 0 [io 0x10000-0x100ff] [ 116.969904] pci 0000:00:01.0: BAR 1 [mem 0x00000000-0x000000ff] [ 116.970745] pci 0000:00:01.0: ROM [mem 0x00000000-0x0003ffff pref] [ 116.971456] pci 0000:00:01.0: No hypervisor support for SR-IOV on this device, IOV BARs disabled. [ 116.972583] pci 0000:00:01.0: Adding to iommu group 0 [ 116.978466] pci 0000:00:01.0: ROM [mem 0x200080080000-0x2000800bffff pref]: assigned [ 116.979347] pci 0000:00:01.0: BAR 0 [io 0x10400-0x104ff]: assigned [ 116.980063] pci 0000:00:01.0: BAR 1 [mem 0x200080001000-0x2000800010ff]: assigned [ 117.017187] 8139cp: 8139cp: 10/100 PCI Ethernet driver v1.3 (Mar 22, 2004) [ 117.018577] 8139cp 0000:00:01.0: enabling device (0000 -> 0003) [ 117.025414] 8139cp 0000:00:01.0 eth1: RTL-8139C+ at 0x00000000fbf09e59, 52:54:00:88:31:28, IRQ 26 [ 117.051028] 8139too: 8139too Fast Ethernet driver 0.9.28 [ 117.076577] 8139cp 0000:00:01.0 eth1: link up, 100Mbps, full-duplex, lpa 0x05E1 3. Try hot-unplug of the device to recreate the kernel Oops. device_del <net_id> Thanks, Amit > cheers
diff --git a/drivers/pci/of.c b/drivers/pci/of.c index dacea3fc5128..bc455370143e 100644 --- a/drivers/pci/of.c +++ b/drivers/pci/of.c @@ -653,7 +653,7 @@ void of_pci_remove_node(struct pci_dev *pdev) struct device_node *np; np = pci_device_to_OF_node(pdev); - if (!np || !of_node_check_flag(np, OF_DYNAMIC)) + if (!np || !of_node_check_flag(np, OF_CREATE_WITH_CSET)) return; pdev->dev.of_node = NULL; @@ -712,6 +712,7 @@ void of_pci_make_dev_node(struct pci_dev *pdev) if (ret) goto out_free_node; + of_node_set_flag(np, OF_CREATE_WITH_CSET); np->data = cset; pdev->dev.of_node = np; kfree(name); diff --git a/include/linux/of.h b/include/linux/of.h index 85b60ac9eec5..5faa5a1198c6 100644 --- a/include/linux/of.h +++ b/include/linux/of.h @@ -153,6 +153,7 @@ extern struct device_node *of_stdout; #define OF_POPULATED_BUS 4 /* platform bus created for children */ #define OF_OVERLAY 5 /* allocated for an overlay */ #define OF_OVERLAY_FREE_CSET 6 /* in overlay cset being freed */ +#define OF_CREATE_WITH_CSET 7 /* Created by of_changeset_create_node */ #define OF_BAD_ADDR ((u64)-1)