diff mbox series

PCI: Fix crash during pci_dev hot-unplug on pseries KVM guest

Message ID 20240703141634.2974589-1-amachhiw@linux.ibm.com (mailing list archive)
State Handled Elsewhere, archived
Headers show
Series PCI: Fix crash during pci_dev hot-unplug on pseries KVM guest | expand

Checks

Context Check Description
snowpatch_ozlabs/github-powerpc_sparse success Successfully ran 4 jobs.
snowpatch_ozlabs/github-powerpc_clang success Successfully ran 5 jobs.
snowpatch_ozlabs/github-powerpc_kernel_qemu success Successfully ran 21 jobs.

Commit Message

Amit Machhiwal July 3, 2024, 2:16 p.m. UTC
With CONFIG_PCI_DYNAMIC_OF_NODES [1], a hot-plug and hot-unplug sequence
of a PCI device attached to a PCI-bridge causes following kernel Oops on
a pseries KVM guest:

 RTAS: event: 2, Type: Hotplug Event (229), Severity: 1
 Kernel attempted to read user page (10ec00000048) - exploit attempt? (uid: 0)
 BUG: Unable to handle kernel data access on read at 0x10ec00000048
 Faulting instruction address: 0xc0000000012d8728
 Oops: Kernel access of bad area, sig: 11 [#1]
 LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
<snip>
 NIP [c0000000012d8728] __of_changeset_entry_invert+0x10/0x1ac
 LR [c0000000012da7f0] __of_changeset_revert_entries+0x98/0x180
 Call Trace:
 [c00000000bcc3970] [c0000000012daa60] of_changeset_revert+0x58/0xd8
 [c00000000bcc39c0] [c000000000d0ed78] of_pci_remove_node+0x74/0xb0
 [c00000000bcc39f0] [c000000000cdcfe0] pci_stop_bus_device+0xf4/0x138
 [c00000000bcc3a30] [c000000000cdd140] pci_stop_and_remove_bus_device_locked+0x34/0x64
 [c00000000bcc3a60] [c000000000cf3780] remove_store+0xf0/0x108
 [c00000000bcc3ab0] [c000000000e89e04] dev_attr_store+0x34/0x78
 [c00000000bcc3ad0] [c0000000007f8dd4] sysfs_kf_write+0x70/0xa4
 [c00000000bcc3af0] [c0000000007f7248] kernfs_fop_write_iter+0x1d0/0x2e0
 [c00000000bcc3b40] [c0000000006c9b08] vfs_write+0x27c/0x558
 [c00000000bcc3bf0] [c0000000006ca168] ksys_write+0x90/0x170
 [c00000000bcc3c40] [c000000000033248] system_call_exception+0xf8/0x290
 [c00000000bcc3e50] [c00000000000d05c] system_call_vectored_common+0x15c/0x2ec
<snip>

A git bisect pointed this regression to be introduced via [1] that added
a mechanism to create device tree nodes for parent PCI bridges when a
PCI device is hot-plugged.

The Oops is caused when `pci_stop_dev()` tries to remove a non-existing
device-tree node associated with the pci_dev that was earlier
hot-plugged and was attached under a pci-bridge. The PCI dev header
`dev->hdr_type` being 0, results a conditional check done with
`pci_is_bridge()` into false. Consequently, a call to
`of_pci_make_dev_node()` to create a device node is never made. When at
a later point in time, in the device node removal path, a memcpy is
attempted in `__of_changeset_entry_invert()`; since the device node was
never created, results in an Oops due to kernel read access to a bad
address.

To fix this issue the patch updates `pci_stop_dev()` to ensure that a
call to `of_pci_remove_node()` is only made for pci-bridge devices.

[1] commit 407d1a51921e ("PCI: Create device tree node for bridge")

Fixes: 407d1a51921e ("PCI: Create device tree node for bridge")
Reported-by: Kowshik Jois B S <kowsjois@linux.ibm.com>
Tested-by: Kowshik Jois B S <kowsjois@linux.ibm.com>
Signed-off-by: Amit Machhiwal <amachhiw@linux.ibm.com>
---
 drivers/pci/remove.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)


base-commit: e9d22f7a6655941fc8b2b942ed354ec780936b3e

Comments

Bjorn Helgaas July 5, 2024, 7:20 p.m. UTC | #1
[+cc Lukas, FYI]

On Wed, Jul 03, 2024 at 07:46:34PM +0530, Amit Machhiwal wrote:
> With CONFIG_PCI_DYNAMIC_OF_NODES [1], a hot-plug and hot-unplug sequence
> of a PCI device attached to a PCI-bridge causes following kernel Oops on
> a pseries KVM guest:
> 
>  RTAS: event: 2, Type: Hotplug Event (229), Severity: 1
>  Kernel attempted to read user page (10ec00000048) - exploit attempt? (uid: 0)
>  BUG: Unable to handle kernel data access on read at 0x10ec00000048
>  Faulting instruction address: 0xc0000000012d8728
>  Oops: Kernel access of bad area, sig: 11 [#1]
>  LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
> <snip>
>  NIP [c0000000012d8728] __of_changeset_entry_invert+0x10/0x1ac
>  LR [c0000000012da7f0] __of_changeset_revert_entries+0x98/0x180
>  Call Trace:
>  [c00000000bcc3970] [c0000000012daa60] of_changeset_revert+0x58/0xd8
>  [c00000000bcc39c0] [c000000000d0ed78] of_pci_remove_node+0x74/0xb0
>  [c00000000bcc39f0] [c000000000cdcfe0] pci_stop_bus_device+0xf4/0x138
>  [c00000000bcc3a30] [c000000000cdd140] pci_stop_and_remove_bus_device_locked+0x34/0x64
>  [c00000000bcc3a60] [c000000000cf3780] remove_store+0xf0/0x108
>  [c00000000bcc3ab0] [c000000000e89e04] dev_attr_store+0x34/0x78
>  [c00000000bcc3ad0] [c0000000007f8dd4] sysfs_kf_write+0x70/0xa4
>  [c00000000bcc3af0] [c0000000007f7248] kernfs_fop_write_iter+0x1d0/0x2e0
>  [c00000000bcc3b40] [c0000000006c9b08] vfs_write+0x27c/0x558
>  [c00000000bcc3bf0] [c0000000006ca168] ksys_write+0x90/0x170
>  [c00000000bcc3c40] [c000000000033248] system_call_exception+0xf8/0x290
>  [c00000000bcc3e50] [c00000000000d05c] system_call_vectored_common+0x15c/0x2ec
> <snip>
> 
> A git bisect pointed this regression to be introduced via [1] that added
> a mechanism to create device tree nodes for parent PCI bridges when a
> PCI device is hot-plugged.
> 
> The Oops is caused when `pci_stop_dev()` tries to remove a non-existing
> device-tree node associated with the pci_dev that was earlier
> hot-plugged and was attached under a pci-bridge. The PCI dev header
> `dev->hdr_type` being 0, results a conditional check done with
> `pci_is_bridge()` into false. Consequently, a call to
> `of_pci_make_dev_node()` to create a device node is never made. When at
> a later point in time, in the device node removal path, a memcpy is
> attempted in `__of_changeset_entry_invert()`; since the device node was
> never created, results in an Oops due to kernel read access to a bad
> address.
> 
> To fix this issue the patch updates `pci_stop_dev()` to ensure that a
> call to `of_pci_remove_node()` is only made for pci-bridge devices.
> 
> [1] commit 407d1a51921e ("PCI: Create device tree node for bridge")
> 
> Fixes: 407d1a51921e ("PCI: Create device tree node for bridge")
> Reported-by: Kowshik Jois B S <kowsjois@linux.ibm.com>
> Tested-by: Kowshik Jois B S <kowsjois@linux.ibm.com>
> Signed-off-by: Amit Machhiwal <amachhiw@linux.ibm.com>

Thanks for the patch and testing!  Would like a reviewed-by from
Lizhi.

> ---
>  drivers/pci/remove.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/remove.c b/drivers/pci/remove.c
> index d749ea8250d6..4e51c64af416 100644
> --- a/drivers/pci/remove.c
> +++ b/drivers/pci/remove.c
> @@ -22,7 +22,8 @@ static void pci_stop_dev(struct pci_dev *dev)
>  		device_release_driver(&dev->dev);
>  		pci_proc_detach_device(dev);
>  		pci_remove_sysfs_dev_files(dev);
> -		of_pci_remove_node(dev);
> +		if (pci_is_bridge(dev))
> +			of_pci_remove_node(dev);

IIUC, this basically undoes the work that was done by
of_pci_make_dev_node().

The call of of_pci_make_dev_node() from pci_bus_add_device() was added
by 407d1a51921e and is conditional on pci_is_bridge(), so it makes
sense to me that the remove needs a similar condition.

>  		pci_dev_assign_added(dev, false);
>  	}
> 
> base-commit: e9d22f7a6655941fc8b2b942ed354ec780936b3e
> -- 
> 2.45.2
>
Lizhi Hou July 11, 2024, 4:48 a.m. UTC | #2
On 7/5/24 12:20, Bjorn Helgaas wrote:
> [+cc Lukas, FYI]
>
> On Wed, Jul 03, 2024 at 07:46:34PM +0530, Amit Machhiwal wrote:
>> With CONFIG_PCI_DYNAMIC_OF_NODES [1], a hot-plug and hot-unplug sequence
>> of a PCI device attached to a PCI-bridge causes following kernel Oops on
>> a pseries KVM guest:
>>
>>   RTAS: event: 2, Type: Hotplug Event (229), Severity: 1
>>   Kernel attempted to read user page (10ec00000048) - exploit attempt? (uid: 0)
>>   BUG: Unable to handle kernel data access on read at 0x10ec00000048
>>   Faulting instruction address: 0xc0000000012d8728
>>   Oops: Kernel access of bad area, sig: 11 [#1]
>>   LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
>> <snip>
>>   NIP [c0000000012d8728] __of_changeset_entry_invert+0x10/0x1ac
>>   LR [c0000000012da7f0] __of_changeset_revert_entries+0x98/0x180
>>   Call Trace:
>>   [c00000000bcc3970] [c0000000012daa60] of_changeset_revert+0x58/0xd8
>>   [c00000000bcc39c0] [c000000000d0ed78] of_pci_remove_node+0x74/0xb0
>>   [c00000000bcc39f0] [c000000000cdcfe0] pci_stop_bus_device+0xf4/0x138
>>   [c00000000bcc3a30] [c000000000cdd140] pci_stop_and_remove_bus_device_locked+0x34/0x64
>>   [c00000000bcc3a60] [c000000000cf3780] remove_store+0xf0/0x108
>>   [c00000000bcc3ab0] [c000000000e89e04] dev_attr_store+0x34/0x78
>>   [c00000000bcc3ad0] [c0000000007f8dd4] sysfs_kf_write+0x70/0xa4
>>   [c00000000bcc3af0] [c0000000007f7248] kernfs_fop_write_iter+0x1d0/0x2e0
>>   [c00000000bcc3b40] [c0000000006c9b08] vfs_write+0x27c/0x558
>>   [c00000000bcc3bf0] [c0000000006ca168] ksys_write+0x90/0x170
>>   [c00000000bcc3c40] [c000000000033248] system_call_exception+0xf8/0x290
>>   [c00000000bcc3e50] [c00000000000d05c] system_call_vectored_common+0x15c/0x2ec
>> <snip>
>>
>> A git bisect pointed this regression to be introduced via [1] that added
>> a mechanism to create device tree nodes for parent PCI bridges when a
>> PCI device is hot-plugged.
>>
>> The Oops is caused when `pci_stop_dev()` tries to remove a non-existing
>> device-tree node associated with the pci_dev that was earlier
>> hot-plugged and was attached under a pci-bridge. The PCI dev header
>> `dev->hdr_type` being 0, results a conditional check done with
>> `pci_is_bridge()` into false. Consequently, a call to
>> `of_pci_make_dev_node()` to create a device node is never made. When at
>> a later point in time, in the device node removal path, a memcpy is
>> attempted in `__of_changeset_entry_invert()`; since the device node was
>> never created, results in an Oops due to kernel read access to a bad
>> address.
>>
>> To fix this issue the patch updates `pci_stop_dev()` to ensure that a
>> call to `of_pci_remove_node()` is only made for pci-bridge devices.
>>
>> [1] commit 407d1a51921e ("PCI: Create device tree node for bridge")
>>
>> Fixes: 407d1a51921e ("PCI: Create device tree node for bridge")
>> Reported-by: Kowshik Jois B S <kowsjois@linux.ibm.com>
>> Tested-by: Kowshik Jois B S <kowsjois@linux.ibm.com>
>> Signed-off-by: Amit Machhiwal <amachhiw@linux.ibm.com>
> Thanks for the patch and testing!  Would like a reviewed-by from
> Lizhi.

of_pci_make_dev_node() will create of nodes for some endpoint devices 
(not a bridge) as well. And actually this is the main purpose.

Maybe the patch as below would resolve the Oops?

diff --git a/drivers/of/dynamic.c b/drivers/of/dynamic.c
index dda6092e6d3a..3c693b091ecf 100644
--- a/drivers/of/dynamic.c
+++ b/drivers/of/dynamic.c
@@ -492,21 +492,29 @@ struct device_node *__of_node_dup(const struct 
device_node *np,
   * a given changeset.
   *
   * @ocs: Pointer to changeset
+ * @np: Pointer to device node. If it is not null, init it directly 
instead of
+ *      allocate a new node.
   * @parent: Pointer to parent device node
   * @full_name: Node full name
   *
   * Return: Pointer to the created device node or NULL in case of an error.
   */
  struct device_node *of_changeset_create_node(struct of_changeset *ocs,
+                                            struct device_node *np,
                                              struct device_node *parent,
                                              const char *full_name)
  {
-       struct device_node *np;
         int ret;

-       np = __of_node_dup(NULL, full_name);
-       if (!np)
-               return NULL;
+       if (!np) {
+               np = __of_node_dup(NULL, full_name);
+               if (!np)
+                       return NULL;
+       } else {
+               of_node_set_flag(np, OF_DYNAMIC);
+               of_node_set_flag(np, OF_DETACHED);
+       }
+
         np->parent = parent;

         ret = of_changeset_attach_node(ocs, np);
diff --git a/drivers/of/unittest.c b/drivers/of/unittest.c
index 445ad13dab98..087de26852cc 100644
--- a/drivers/of/unittest.c
+++ b/drivers/of/unittest.c
@@ -871,7 +871,7 @@ static void __init of_unittest_changeset(void)
         unittest(!of_changeset_add_property(&chgset, parent, ppadd), 
"fail add prop prop-add\n");
         unittest(!of_changeset_update_property(&chgset, parent, 
ppupdate), "fail update prop\n");
         unittest(!of_changeset_remove_property(&chgset, parent, 
ppremove), "fail remove prop\n");
-       n22 = of_changeset_create_node(&chgset, n2, "n22");
+       n22 = of_changeset_create_node(&chgset, NULL, n2, "n22");
         unittest(n22, "fail create n22\n");
         unittest(!of_changeset_add_prop_string(&chgset, n22, 
"prop-str", "abcd"),
                  "fail add prop prop-str");
diff --git a/drivers/pci/of.c b/drivers/pci/of.c
index 51e3dd0ea5ab..92c079b2e570 100644
--- a/drivers/pci/of.c
+++ b/drivers/pci/of.c
@@ -608,18 +608,28 @@ int devm_of_pci_bridge_init(struct device *dev, 
struct pci_host_bridge *bridge)

  #ifdef CONFIG_PCI_DYNAMIC_OF_NODES

+void of_pci_free_node(struct device_node *np)
+{
+       struct of_changeset *cset;
+
+       cset = (struct of_changeset *)(np + 1);
+
+       np->data = NULL;
+       of_changeset_revert(cset);
+       of_changeset_destroy(cset);
+       of_node_put(np);
+}
+
  void of_pci_remove_node(struct pci_dev *pdev)
  {
         struct device_node *np;

         np = pci_device_to_OF_node(pdev);
-       if (!np || !of_node_check_flag(np, OF_DYNAMIC))
+       if (!np || np->data != of_pci_free_node)
                 return;
         pdev->dev.of_node = NULL;

-       of_changeset_revert(np->data);
-       of_changeset_destroy(np->data);
-       of_node_put(np);
+       of_pci_free_node(np);
  }

  void of_pci_make_dev_node(struct pci_dev *pdev)
@@ -655,14 +665,18 @@ void of_pci_make_dev_node(struct pci_dev *pdev)
         if (!name)
                 return;

-       cset = kmalloc(sizeof(*cset), GFP_KERNEL);
-       if (!cset)
+       np = kzalloc(sizeof(*np) + sizeof(*cset), GFP_KERNEL);
+       if (!np)
                 goto out_free_name;
+       np->full_name = name;
+       of_node_init(np);
+
+       cset = (struct of_changeset *)(np + 1);
         of_changeset_init(cset);

-       np = of_changeset_create_node(cset, ppnode, name);
+       np = of_changeset_create_node(cset, np, ppnode, NULL);
         if (!np)
-               goto out_destroy_cset;
+               goto out_free_node;

         ret = of_pci_add_properties(pdev, cset, np);
         if (ret)
@@ -672,9 +686,8 @@ void of_pci_make_dev_node(struct pci_dev *pdev)
         if (ret)
                 goto out_free_node;

-       np->data = cset;
+       np->data = of_pci_free_node;
         pdev->dev.of_node = np;
-       kfree(name);

         return;

diff --git a/include/linux/of.h b/include/linux/of.h
index a0bedd038a05..f774459d0d84 100644
--- a/include/linux/of.h
+++ b/include/linux/of.h
@@ -1631,6 +1631,7 @@ static inline int 
of_changeset_update_property(struct of_changeset *ocs,
  }

  struct device_node *of_changeset_create_node(struct of_changeset *ocs,
+                                            struct device_node *np,
                                              struct device_node *parent,
                                              const char *full_name);
  int of_changeset_add_prop_string(struct of_changeset *ocs,

Thanks,

Lizhi

>
>> ---
>>   drivers/pci/remove.c | 3 ++-
>>   1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/pci/remove.c b/drivers/pci/remove.c
>> index d749ea8250d6..4e51c64af416 100644
>> --- a/drivers/pci/remove.c
>> +++ b/drivers/pci/remove.c
>> @@ -22,7 +22,8 @@ static void pci_stop_dev(struct pci_dev *dev)
>>   		device_release_driver(&dev->dev);
>>   		pci_proc_detach_device(dev);
>>   		pci_remove_sysfs_dev_files(dev);
>> -		of_pci_remove_node(dev);
>> +		if (pci_is_bridge(dev))
>> +			of_pci_remove_node(dev);
> IIUC, this basically undoes the work that was done by
> of_pci_make_dev_node().
>
> The call of of_pci_make_dev_node() from pci_bus_add_device() was added
> by 407d1a51921e and is conditional on pci_is_bridge(), so it makes
> sense to me that the remove needs a similar condition.
>
>>   		pci_dev_assign_added(dev, false);
>>   	}
>>
>> base-commit: e9d22f7a6655941fc8b2b942ed354ec780936b3e
>> -- 
>> 2.45.2
>>
Rob Herring (Arm) July 11, 2024, 12:20 p.m. UTC | #3
On Wed, Jul 3, 2024 at 8:17 AM Amit Machhiwal <amachhiw@linux.ibm.com> wrote:
>
> With CONFIG_PCI_DYNAMIC_OF_NODES [1], a hot-plug and hot-unplug sequence
> of a PCI device attached to a PCI-bridge causes following kernel Oops on
> a pseries KVM guest:

Can I ask why you have this option on in the first place? Do you have
a use for it or it's just a case of distros turn on every kconfig
option.

Rob
Amit Machhiwal July 11, 2024, 2:21 p.m. UTC | #4
Hi Rob,

On 2024/07/11 06:20 AM, Rob Herring wrote:
> On Wed, Jul 3, 2024 at 8:17 AM Amit Machhiwal <amachhiw@linux.ibm.com> wrote:
> >
> > With CONFIG_PCI_DYNAMIC_OF_NODES [1], a hot-plug and hot-unplug sequence
> > of a PCI device attached to a PCI-bridge causes following kernel Oops on
> > a pseries KVM guest:
> 
> Can I ask why you have this option on in the first place? Do you have
> a use for it or it's just a case of distros turn on every kconfig
> option.

Yes, this option is turned on in Ubuntu's distro kernel config where the issue
was originally reported, while Fedora is keeping this turned off.

    root@ubuntu:~# cat /boot/config-6.8.0-38-generic | grep PCI_DYN
    CONFIG_PCI_DYNAMIC_OF_NODES=y

    root@fedora:~# cat /boot/config-6.9.7-200.fc40.ppc64le | grep PCI_DYN
    # CONFIG_PCI_DYNAMIC_OF_NODES is not set

Thanks,
Amit

> 
> Rob
Amit Machhiwal July 11, 2024, 6:48 p.m. UTC | #5
On 2024/07/10 09:48 PM, Lizhi Hou wrote:
> 
> On 7/5/24 12:20, Bjorn Helgaas wrote:
> > [+cc Lukas, FYI]
> > 
> > On Wed, Jul 03, 2024 at 07:46:34PM +0530, Amit Machhiwal wrote:
> > > With CONFIG_PCI_DYNAMIC_OF_NODES [1], a hot-plug and hot-unplug sequence
> > > of a PCI device attached to a PCI-bridge causes following kernel Oops on
> > > a pseries KVM guest:
> > > 
> > >   RTAS: event: 2, Type: Hotplug Event (229), Severity: 1
> > >   Kernel attempted to read user page (10ec00000048) - exploit attempt? (uid: 0)
> > >   BUG: Unable to handle kernel data access on read at 0x10ec00000048
> > >   Faulting instruction address: 0xc0000000012d8728
> > >   Oops: Kernel access of bad area, sig: 11 [#1]
> > >   LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
> > > <snip>
> > >   NIP [c0000000012d8728] __of_changeset_entry_invert+0x10/0x1ac
> > >   LR [c0000000012da7f0] __of_changeset_revert_entries+0x98/0x180
> > >   Call Trace:
> > >   [c00000000bcc3970] [c0000000012daa60] of_changeset_revert+0x58/0xd8
> > >   [c00000000bcc39c0] [c000000000d0ed78] of_pci_remove_node+0x74/0xb0
> > >   [c00000000bcc39f0] [c000000000cdcfe0] pci_stop_bus_device+0xf4/0x138
> > >   [c00000000bcc3a30] [c000000000cdd140] pci_stop_and_remove_bus_device_locked+0x34/0x64
> > >   [c00000000bcc3a60] [c000000000cf3780] remove_store+0xf0/0x108
> > >   [c00000000bcc3ab0] [c000000000e89e04] dev_attr_store+0x34/0x78
> > >   [c00000000bcc3ad0] [c0000000007f8dd4] sysfs_kf_write+0x70/0xa4
> > >   [c00000000bcc3af0] [c0000000007f7248] kernfs_fop_write_iter+0x1d0/0x2e0
> > >   [c00000000bcc3b40] [c0000000006c9b08] vfs_write+0x27c/0x558
> > >   [c00000000bcc3bf0] [c0000000006ca168] ksys_write+0x90/0x170
> > >   [c00000000bcc3c40] [c000000000033248] system_call_exception+0xf8/0x290
> > >   [c00000000bcc3e50] [c00000000000d05c] system_call_vectored_common+0x15c/0x2ec
> > > <snip>
> > > 
> > > A git bisect pointed this regression to be introduced via [1] that added
> > > a mechanism to create device tree nodes for parent PCI bridges when a
> > > PCI device is hot-plugged.
> > > 
> > > The Oops is caused when `pci_stop_dev()` tries to remove a non-existing
> > > device-tree node associated with the pci_dev that was earlier
> > > hot-plugged and was attached under a pci-bridge. The PCI dev header
> > > `dev->hdr_type` being 0, results a conditional check done with
> > > `pci_is_bridge()` into false. Consequently, a call to
> > > `of_pci_make_dev_node()` to create a device node is never made. When at
> > > a later point in time, in the device node removal path, a memcpy is
> > > attempted in `__of_changeset_entry_invert()`; since the device node was
> > > never created, results in an Oops due to kernel read access to a bad
> > > address.
> > > 
> > > To fix this issue the patch updates `pci_stop_dev()` to ensure that a
> > > call to `of_pci_remove_node()` is only made for pci-bridge devices.
> > > 
> > > [1] commit 407d1a51921e ("PCI: Create device tree node for bridge")
> > > 
> > > Fixes: 407d1a51921e ("PCI: Create device tree node for bridge")
> > > Reported-by: Kowshik Jois B S <kowsjois@linux.ibm.com>
> > > Tested-by: Kowshik Jois B S <kowsjois@linux.ibm.com>
> > > Signed-off-by: Amit Machhiwal <amachhiw@linux.ibm.com>
> > Thanks for the patch and testing!  Would like a reviewed-by from
> > Lizhi.
> 
> of_pci_make_dev_node() will create of nodes for some endpoint devices (not a
> bridge) as well. And actually this is the main purpose.
> 
> Maybe the patch as below would resolve the Oops?

Thanks for the patch, Lizhi! I tried out this patch and don't see the issue with
the same. The hot-plug and hot-unplug of PCI device seem to work fine as
expected.

~ Amit

> 
> diff --git a/drivers/of/dynamic.c b/drivers/of/dynamic.c
> index dda6092e6d3a..3c693b091ecf 100644
> --- a/drivers/of/dynamic.c
> +++ b/drivers/of/dynamic.c
> @@ -492,21 +492,29 @@ struct device_node *__of_node_dup(const struct
> device_node *np,
>   * a given changeset.
>   *
>   * @ocs: Pointer to changeset
> + * @np: Pointer to device node. If it is not null, init it directly instead
> of
> + *      allocate a new node.
>   * @parent: Pointer to parent device node
>   * @full_name: Node full name
>   *
>   * Return: Pointer to the created device node or NULL in case of an error.
>   */
>  struct device_node *of_changeset_create_node(struct of_changeset *ocs,
> +                                            struct device_node *np,
>                                              struct device_node *parent,
>                                              const char *full_name)
>  {
> -       struct device_node *np;
>         int ret;
> 
> -       np = __of_node_dup(NULL, full_name);
> -       if (!np)
> -               return NULL;
> +       if (!np) {
> +               np = __of_node_dup(NULL, full_name);
> +               if (!np)
> +                       return NULL;
> +       } else {
> +               of_node_set_flag(np, OF_DYNAMIC);
> +               of_node_set_flag(np, OF_DETACHED);
> +       }
> +
>         np->parent = parent;
> 
>         ret = of_changeset_attach_node(ocs, np);
> diff --git a/drivers/of/unittest.c b/drivers/of/unittest.c
> index 445ad13dab98..087de26852cc 100644
> --- a/drivers/of/unittest.c
> +++ b/drivers/of/unittest.c
> @@ -871,7 +871,7 @@ static void __init of_unittest_changeset(void)
>         unittest(!of_changeset_add_property(&chgset, parent, ppadd), "fail
> add prop prop-add\n");
>         unittest(!of_changeset_update_property(&chgset, parent, ppupdate),
> "fail update prop\n");
>         unittest(!of_changeset_remove_property(&chgset, parent, ppremove),
> "fail remove prop\n");
> -       n22 = of_changeset_create_node(&chgset, n2, "n22");
> +       n22 = of_changeset_create_node(&chgset, NULL, n2, "n22");
>         unittest(n22, "fail create n22\n");
>         unittest(!of_changeset_add_prop_string(&chgset, n22, "prop-str",
> "abcd"),
>                  "fail add prop prop-str");
> diff --git a/drivers/pci/of.c b/drivers/pci/of.c
> index 51e3dd0ea5ab..92c079b2e570 100644
> --- a/drivers/pci/of.c
> +++ b/drivers/pci/of.c
> @@ -608,18 +608,28 @@ int devm_of_pci_bridge_init(struct device *dev, struct
> pci_host_bridge *bridge)
> 
>  #ifdef CONFIG_PCI_DYNAMIC_OF_NODES
> 
> +void of_pci_free_node(struct device_node *np)
> +{
> +       struct of_changeset *cset;
> +
> +       cset = (struct of_changeset *)(np + 1);
> +
> +       np->data = NULL;
> +       of_changeset_revert(cset);
> +       of_changeset_destroy(cset);
> +       of_node_put(np);
> +}
> +
>  void of_pci_remove_node(struct pci_dev *pdev)
>  {
>         struct device_node *np;
> 
>         np = pci_device_to_OF_node(pdev);
> -       if (!np || !of_node_check_flag(np, OF_DYNAMIC))
> +       if (!np || np->data != of_pci_free_node)
>                 return;
>         pdev->dev.of_node = NULL;
> 
> -       of_changeset_revert(np->data);
> -       of_changeset_destroy(np->data);
> -       of_node_put(np);
> +       of_pci_free_node(np);
>  }
> 
>  void of_pci_make_dev_node(struct pci_dev *pdev)
> @@ -655,14 +665,18 @@ void of_pci_make_dev_node(struct pci_dev *pdev)
>         if (!name)
>                 return;
> 
> -       cset = kmalloc(sizeof(*cset), GFP_KERNEL);
> -       if (!cset)
> +       np = kzalloc(sizeof(*np) + sizeof(*cset), GFP_KERNEL);
> +       if (!np)
>                 goto out_free_name;
> +       np->full_name = name;
> +       of_node_init(np);
> +
> +       cset = (struct of_changeset *)(np + 1);
>         of_changeset_init(cset);
> 
> -       np = of_changeset_create_node(cset, ppnode, name);
> +       np = of_changeset_create_node(cset, np, ppnode, NULL);
>         if (!np)
> -               goto out_destroy_cset;
> +               goto out_free_node;
> 
>         ret = of_pci_add_properties(pdev, cset, np);
>         if (ret)
> @@ -672,9 +686,8 @@ void of_pci_make_dev_node(struct pci_dev *pdev)
>         if (ret)
>                 goto out_free_node;
> 
> -       np->data = cset;
> +       np->data = of_pci_free_node;
>         pdev->dev.of_node = np;
> -       kfree(name);
> 
>         return;
> 
> diff --git a/include/linux/of.h b/include/linux/of.h
> index a0bedd038a05..f774459d0d84 100644
> --- a/include/linux/of.h
> +++ b/include/linux/of.h
> @@ -1631,6 +1631,7 @@ static inline int of_changeset_update_property(struct
> of_changeset *ocs,
>  }
> 
>  struct device_node *of_changeset_create_node(struct of_changeset *ocs,
> +                                            struct device_node *np,
>                                              struct device_node *parent,
>                                              const char *full_name);
>  int of_changeset_add_prop_string(struct of_changeset *ocs,
> 
> Thanks,
> 
> Lizhi
> 
> > 
> > > ---
> > >   drivers/pci/remove.c | 3 ++-
> > >   1 file changed, 2 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/pci/remove.c b/drivers/pci/remove.c
> > > index d749ea8250d6..4e51c64af416 100644
> > > --- a/drivers/pci/remove.c
> > > +++ b/drivers/pci/remove.c
> > > @@ -22,7 +22,8 @@ static void pci_stop_dev(struct pci_dev *dev)
> > >   		device_release_driver(&dev->dev);
> > >   		pci_proc_detach_device(dev);
> > >   		pci_remove_sysfs_dev_files(dev);
> > > -		of_pci_remove_node(dev);
> > > +		if (pci_is_bridge(dev))
> > > +			of_pci_remove_node(dev);
> > IIUC, this basically undoes the work that was done by
> > of_pci_make_dev_node().
> > 
> > The call of of_pci_make_dev_node() from pci_bus_add_device() was added
> > by 407d1a51921e and is conditional on pci_is_bridge(), so it makes
> > sense to me that the remove needs a similar condition.
> > 
> > >   		pci_dev_assign_added(dev, false);
> > >   	}
> > > 
> > > base-commit: e9d22f7a6655941fc8b2b942ed354ec780936b3e
> > > -- 
> > > 2.45.2
> > >
Rob Herring (Arm) July 11, 2024, 7:34 p.m. UTC | #6
+Ubuntu kernel team

On Thu, Jul 11, 2024 at 8:21 AM Amit Machhiwal <amachhiw@linux.ibm.com> wrote:
>
> Hi Rob,
>
> On 2024/07/11 06:20 AM, Rob Herring wrote:
> > On Wed, Jul 3, 2024 at 8:17 AM Amit Machhiwal <amachhiw@linux.ibm.com> wrote:
> > >
> > > With CONFIG_PCI_DYNAMIC_OF_NODES [1], a hot-plug and hot-unplug sequence
> > > of a PCI device attached to a PCI-bridge causes following kernel Oops on
> > > a pseries KVM guest:
> >
> > Can I ask why you have this option on in the first place? Do you have
> > a use for it or it's just a case of distros turn on every kconfig
> > option.
>
> Yes, this option is turned on in Ubuntu's distro kernel config where the issue
> was originally reported, while Fedora is keeping this turned off.
>
>     root@ubuntu:~# cat /boot/config-6.8.0-38-generic | grep PCI_DYN
>     CONFIG_PCI_DYNAMIC_OF_NODES=y

Ubuntu should turn off this option. For starters, it is not complete
to be usable. Eventually, it should get removed in favor of some TBD
runtime option.

(And we should fix the crash too)

Rob
Lizhi Hou July 11, 2024, 9:18 p.m. UTC | #7
On 7/11/24 11:48, Amit Machhiwal wrote:
> On 2024/07/10 09:48 PM, Lizhi Hou wrote:
>> On 7/5/24 12:20, Bjorn Helgaas wrote:
>>> [+cc Lukas, FYI]
>>>
>>> On Wed, Jul 03, 2024 at 07:46:34PM +0530, Amit Machhiwal wrote:
>>>> With CONFIG_PCI_DYNAMIC_OF_NODES [1], a hot-plug and hot-unplug sequence
>>>> of a PCI device attached to a PCI-bridge causes following kernel Oops on
>>>> a pseries KVM guest:
>>>>
>>>>    RTAS: event: 2, Type: Hotplug Event (229), Severity: 1
>>>>    Kernel attempted to read user page (10ec00000048) - exploit attempt? (uid: 0)
>>>>    BUG: Unable to handle kernel data access on read at 0x10ec00000048
>>>>    Faulting instruction address: 0xc0000000012d8728
>>>>    Oops: Kernel access of bad area, sig: 11 [#1]
>>>>    LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
>>>> <snip>
>>>>    NIP [c0000000012d8728] __of_changeset_entry_invert+0x10/0x1ac
>>>>    LR [c0000000012da7f0] __of_changeset_revert_entries+0x98/0x180
>>>>    Call Trace:
>>>>    [c00000000bcc3970] [c0000000012daa60] of_changeset_revert+0x58/0xd8
>>>>    [c00000000bcc39c0] [c000000000d0ed78] of_pci_remove_node+0x74/0xb0
>>>>    [c00000000bcc39f0] [c000000000cdcfe0] pci_stop_bus_device+0xf4/0x138
>>>>    [c00000000bcc3a30] [c000000000cdd140] pci_stop_and_remove_bus_device_locked+0x34/0x64
>>>>    [c00000000bcc3a60] [c000000000cf3780] remove_store+0xf0/0x108
>>>>    [c00000000bcc3ab0] [c000000000e89e04] dev_attr_store+0x34/0x78
>>>>    [c00000000bcc3ad0] [c0000000007f8dd4] sysfs_kf_write+0x70/0xa4
>>>>    [c00000000bcc3af0] [c0000000007f7248] kernfs_fop_write_iter+0x1d0/0x2e0
>>>>    [c00000000bcc3b40] [c0000000006c9b08] vfs_write+0x27c/0x558
>>>>    [c00000000bcc3bf0] [c0000000006ca168] ksys_write+0x90/0x170
>>>>    [c00000000bcc3c40] [c000000000033248] system_call_exception+0xf8/0x290
>>>>    [c00000000bcc3e50] [c00000000000d05c] system_call_vectored_common+0x15c/0x2ec
>>>> <snip>
>>>>
>>>> A git bisect pointed this regression to be introduced via [1] that added
>>>> a mechanism to create device tree nodes for parent PCI bridges when a
>>>> PCI device is hot-plugged.
>>>>
>>>> The Oops is caused when `pci_stop_dev()` tries to remove a non-existing
>>>> device-tree node associated with the pci_dev that was earlier
>>>> hot-plugged and was attached under a pci-bridge. The PCI dev header
>>>> `dev->hdr_type` being 0, results a conditional check done with
>>>> `pci_is_bridge()` into false. Consequently, a call to
>>>> `of_pci_make_dev_node()` to create a device node is never made. When at
>>>> a later point in time, in the device node removal path, a memcpy is
>>>> attempted in `__of_changeset_entry_invert()`; since the device node was
>>>> never created, results in an Oops due to kernel read access to a bad
>>>> address.
>>>>
>>>> To fix this issue the patch updates `pci_stop_dev()` to ensure that a
>>>> call to `of_pci_remove_node()` is only made for pci-bridge devices.
>>>>
>>>> [1] commit 407d1a51921e ("PCI: Create device tree node for bridge")
>>>>
>>>> Fixes: 407d1a51921e ("PCI: Create device tree node for bridge")
>>>> Reported-by: Kowshik Jois B S <kowsjois@linux.ibm.com>
>>>> Tested-by: Kowshik Jois B S <kowsjois@linux.ibm.com>
>>>> Signed-off-by: Amit Machhiwal <amachhiw@linux.ibm.com>
>>> Thanks for the patch and testing!  Would like a reviewed-by from
>>> Lizhi.
>> of_pci_make_dev_node() will create of nodes for some endpoint devices (not a
>> bridge) as well. And actually this is the main purpose.
>>
>> Maybe the patch as below would resolve the Oops?
> Thanks for the patch, Lizhi! I tried out this patch and don't see the issue with
> the same. The hot-plug and hot-unplug of PCI device seem to work fine as
> expected.

Cool! Thanks for trying it. Will you re-spin the patch or you would like 
me to create a patch?

Lizhi

>
> ~ Amit
>
>> diff --git a/drivers/of/dynamic.c b/drivers/of/dynamic.c
>> index dda6092e6d3a..3c693b091ecf 100644
>> --- a/drivers/of/dynamic.c
>> +++ b/drivers/of/dynamic.c
>> @@ -492,21 +492,29 @@ struct device_node *__of_node_dup(const struct
>> device_node *np,
>>    * a given changeset.
>>    *
>>    * @ocs: Pointer to changeset
>> + * @np: Pointer to device node. If it is not null, init it directly instead
>> of
>> + *      allocate a new node.
>>    * @parent: Pointer to parent device node
>>    * @full_name: Node full name
>>    *
>>    * Return: Pointer to the created device node or NULL in case of an error.
>>    */
>>   struct device_node *of_changeset_create_node(struct of_changeset *ocs,
>> +                                            struct device_node *np,
>>                                               struct device_node *parent,
>>                                               const char *full_name)
>>   {
>> -       struct device_node *np;
>>          int ret;
>>
>> -       np = __of_node_dup(NULL, full_name);
>> -       if (!np)
>> -               return NULL;
>> +       if (!np) {
>> +               np = __of_node_dup(NULL, full_name);
>> +               if (!np)
>> +                       return NULL;
>> +       } else {
>> +               of_node_set_flag(np, OF_DYNAMIC);
>> +               of_node_set_flag(np, OF_DETACHED);
>> +       }
>> +
>>          np->parent = parent;
>>
>>          ret = of_changeset_attach_node(ocs, np);
>> diff --git a/drivers/of/unittest.c b/drivers/of/unittest.c
>> index 445ad13dab98..087de26852cc 100644
>> --- a/drivers/of/unittest.c
>> +++ b/drivers/of/unittest.c
>> @@ -871,7 +871,7 @@ static void __init of_unittest_changeset(void)
>>          unittest(!of_changeset_add_property(&chgset, parent, ppadd), "fail
>> add prop prop-add\n");
>>          unittest(!of_changeset_update_property(&chgset, parent, ppupdate),
>> "fail update prop\n");
>>          unittest(!of_changeset_remove_property(&chgset, parent, ppremove),
>> "fail remove prop\n");
>> -       n22 = of_changeset_create_node(&chgset, n2, "n22");
>> +       n22 = of_changeset_create_node(&chgset, NULL, n2, "n22");
>>          unittest(n22, "fail create n22\n");
>>          unittest(!of_changeset_add_prop_string(&chgset, n22, "prop-str",
>> "abcd"),
>>                   "fail add prop prop-str");
>> diff --git a/drivers/pci/of.c b/drivers/pci/of.c
>> index 51e3dd0ea5ab..92c079b2e570 100644
>> --- a/drivers/pci/of.c
>> +++ b/drivers/pci/of.c
>> @@ -608,18 +608,28 @@ int devm_of_pci_bridge_init(struct device *dev, struct
>> pci_host_bridge *bridge)
>>
>>   #ifdef CONFIG_PCI_DYNAMIC_OF_NODES
>>
>> +void of_pci_free_node(struct device_node *np)
>> +{
>> +       struct of_changeset *cset;
>> +
>> +       cset = (struct of_changeset *)(np + 1);
>> +
>> +       np->data = NULL;
>> +       of_changeset_revert(cset);
>> +       of_changeset_destroy(cset);
>> +       of_node_put(np);
>> +}
>> +
>>   void of_pci_remove_node(struct pci_dev *pdev)
>>   {
>>          struct device_node *np;
>>
>>          np = pci_device_to_OF_node(pdev);
>> -       if (!np || !of_node_check_flag(np, OF_DYNAMIC))
>> +       if (!np || np->data != of_pci_free_node)
>>                  return;
>>          pdev->dev.of_node = NULL;
>>
>> -       of_changeset_revert(np->data);
>> -       of_changeset_destroy(np->data);
>> -       of_node_put(np);
>> +       of_pci_free_node(np);
>>   }
>>
>>   void of_pci_make_dev_node(struct pci_dev *pdev)
>> @@ -655,14 +665,18 @@ void of_pci_make_dev_node(struct pci_dev *pdev)
>>          if (!name)
>>                  return;
>>
>> -       cset = kmalloc(sizeof(*cset), GFP_KERNEL);
>> -       if (!cset)
>> +       np = kzalloc(sizeof(*np) + sizeof(*cset), GFP_KERNEL);
>> +       if (!np)
>>                  goto out_free_name;
>> +       np->full_name = name;
>> +       of_node_init(np);
>> +
>> +       cset = (struct of_changeset *)(np + 1);
>>          of_changeset_init(cset);
>>
>> -       np = of_changeset_create_node(cset, ppnode, name);
>> +       np = of_changeset_create_node(cset, np, ppnode, NULL);
>>          if (!np)
>> -               goto out_destroy_cset;
>> +               goto out_free_node;
>>
>>          ret = of_pci_add_properties(pdev, cset, np);
>>          if (ret)
>> @@ -672,9 +686,8 @@ void of_pci_make_dev_node(struct pci_dev *pdev)
>>          if (ret)
>>                  goto out_free_node;
>>
>> -       np->data = cset;
>> +       np->data = of_pci_free_node;
>>          pdev->dev.of_node = np;
>> -       kfree(name);
>>
>>          return;
>>
>> diff --git a/include/linux/of.h b/include/linux/of.h
>> index a0bedd038a05..f774459d0d84 100644
>> --- a/include/linux/of.h
>> +++ b/include/linux/of.h
>> @@ -1631,6 +1631,7 @@ static inline int of_changeset_update_property(struct
>> of_changeset *ocs,
>>   }
>>
>>   struct device_node *of_changeset_create_node(struct of_changeset *ocs,
>> +                                            struct device_node *np,
>>                                               struct device_node *parent,
>>                                               const char *full_name);
>>   int of_changeset_add_prop_string(struct of_changeset *ocs,
>>
>> Thanks,
>>
>> Lizhi
>>
>>>> ---
>>>>    drivers/pci/remove.c | 3 ++-
>>>>    1 file changed, 2 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/pci/remove.c b/drivers/pci/remove.c
>>>> index d749ea8250d6..4e51c64af416 100644
>>>> --- a/drivers/pci/remove.c
>>>> +++ b/drivers/pci/remove.c
>>>> @@ -22,7 +22,8 @@ static void pci_stop_dev(struct pci_dev *dev)
>>>>    		device_release_driver(&dev->dev);
>>>>    		pci_proc_detach_device(dev);
>>>>    		pci_remove_sysfs_dev_files(dev);
>>>> -		of_pci_remove_node(dev);
>>>> +		if (pci_is_bridge(dev))
>>>> +			of_pci_remove_node(dev);
>>> IIUC, this basically undoes the work that was done by
>>> of_pci_make_dev_node().
>>>
>>> The call of of_pci_make_dev_node() from pci_bus_add_device() was added
>>> by 407d1a51921e and is conditional on pci_is_bridge(), so it makes
>>> sense to me that the remove needs a similar condition.
>>>
>>>>    		pci_dev_assign_added(dev, false);
>>>>    	}
>>>>
>>>> base-commit: e9d22f7a6655941fc8b2b942ed354ec780936b3e
>>>> -- 
>>>> 2.45.2
>>>>
Amit Machhiwal July 12, 2024, 5:19 p.m. UTC | #8
Hi Lizhi,

On 2024/07/11 02:18 PM, Lizhi Hou wrote:
> 
> On 7/11/24 11:48, Amit Machhiwal wrote:
> > On 2024/07/10 09:48 PM, Lizhi Hou wrote:
> > > On 7/5/24 12:20, Bjorn Helgaas wrote:
> > > > [+cc Lukas, FYI]
> > > > 
> > > > On Wed, Jul 03, 2024 at 07:46:34PM +0530, Amit Machhiwal wrote:
> > > > > With CONFIG_PCI_DYNAMIC_OF_NODES [1], a hot-plug and hot-unplug sequence
> > > > > of a PCI device attached to a PCI-bridge causes following kernel Oops on
> > > > > a pseries KVM guest:
> > > > > 
> > > > >    RTAS: event: 2, Type: Hotplug Event (229), Severity: 1
> > > > >    Kernel attempted to read user page (10ec00000048) - exploit attempt? (uid: 0)
> > > > >    BUG: Unable to handle kernel data access on read at 0x10ec00000048
> > > > >    Faulting instruction address: 0xc0000000012d8728
> > > > >    Oops: Kernel access of bad area, sig: 11 [#1]
> > > > >    LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
> > > > > <snip>
> > > > >    NIP [c0000000012d8728] __of_changeset_entry_invert+0x10/0x1ac
> > > > >    LR [c0000000012da7f0] __of_changeset_revert_entries+0x98/0x180
> > > > >    Call Trace:
> > > > >    [c00000000bcc3970] [c0000000012daa60] of_changeset_revert+0x58/0xd8
> > > > >    [c00000000bcc39c0] [c000000000d0ed78] of_pci_remove_node+0x74/0xb0
> > > > >    [c00000000bcc39f0] [c000000000cdcfe0] pci_stop_bus_device+0xf4/0x138
> > > > >    [c00000000bcc3a30] [c000000000cdd140] pci_stop_and_remove_bus_device_locked+0x34/0x64
> > > > >    [c00000000bcc3a60] [c000000000cf3780] remove_store+0xf0/0x108
> > > > >    [c00000000bcc3ab0] [c000000000e89e04] dev_attr_store+0x34/0x78
> > > > >    [c00000000bcc3ad0] [c0000000007f8dd4] sysfs_kf_write+0x70/0xa4
> > > > >    [c00000000bcc3af0] [c0000000007f7248] kernfs_fop_write_iter+0x1d0/0x2e0
> > > > >    [c00000000bcc3b40] [c0000000006c9b08] vfs_write+0x27c/0x558
> > > > >    [c00000000bcc3bf0] [c0000000006ca168] ksys_write+0x90/0x170
> > > > >    [c00000000bcc3c40] [c000000000033248] system_call_exception+0xf8/0x290
> > > > >    [c00000000bcc3e50] [c00000000000d05c] system_call_vectored_common+0x15c/0x2ec
> > > > > <snip>
> > > > > 
> > > > > A git bisect pointed this regression to be introduced via [1] that added
> > > > > a mechanism to create device tree nodes for parent PCI bridges when a
> > > > > PCI device is hot-plugged.
> > > > > 
> > > > > The Oops is caused when `pci_stop_dev()` tries to remove a non-existing
> > > > > device-tree node associated with the pci_dev that was earlier
> > > > > hot-plugged and was attached under a pci-bridge. The PCI dev header
> > > > > `dev->hdr_type` being 0, results a conditional check done with
> > > > > `pci_is_bridge()` into false. Consequently, a call to
> > > > > `of_pci_make_dev_node()` to create a device node is never made. When at
> > > > > a later point in time, in the device node removal path, a memcpy is
> > > > > attempted in `__of_changeset_entry_invert()`; since the device node was
> > > > > never created, results in an Oops due to kernel read access to a bad
> > > > > address.
> > > > > 
> > > > > To fix this issue the patch updates `pci_stop_dev()` to ensure that a
> > > > > call to `of_pci_remove_node()` is only made for pci-bridge devices.
> > > > > 
> > > > > [1] commit 407d1a51921e ("PCI: Create device tree node for bridge")
> > > > > 
> > > > > Fixes: 407d1a51921e ("PCI: Create device tree node for bridge")
> > > > > Reported-by: Kowshik Jois B S <kowsjois@linux.ibm.com>
> > > > > Tested-by: Kowshik Jois B S <kowsjois@linux.ibm.com>
> > > > > Signed-off-by: Amit Machhiwal <amachhiw@linux.ibm.com>
> > > > Thanks for the patch and testing!  Would like a reviewed-by from
> > > > Lizhi.
> > > of_pci_make_dev_node() will create of nodes for some endpoint devices (not a
> > > bridge) as well. And actually this is the main purpose.
> > > 
> > > Maybe the patch as below would resolve the Oops?
> > Thanks for the patch, Lizhi! I tried out this patch and don't see the issue with
> > the same. The hot-plug and hot-unplug of PCI device seem to work fine as
> > expected.
> 
> Cool! Thanks for trying it. Will you re-spin the patch or you would like me
> to create a patch?

Sure, I'll re-spin and send the V2 of the patch.

~ Amit

> 
> Lizhi
> 
> > 
> > ~ Amit
> > 
> > > diff --git a/drivers/of/dynamic.c b/drivers/of/dynamic.c
> > > index dda6092e6d3a..3c693b091ecf 100644
> > > --- a/drivers/of/dynamic.c
> > > +++ b/drivers/of/dynamic.c
> > > @@ -492,21 +492,29 @@ struct device_node *__of_node_dup(const struct
> > > device_node *np,
> > >    * a given changeset.
> > >    *
> > >    * @ocs: Pointer to changeset
> > > + * @np: Pointer to device node. If it is not null, init it directly instead
> > > of
> > > + *      allocate a new node.
> > >    * @parent: Pointer to parent device node
> > >    * @full_name: Node full name
> > >    *
> > >    * Return: Pointer to the created device node or NULL in case of an error.
> > >    */
> > >   struct device_node *of_changeset_create_node(struct of_changeset *ocs,
> > > +                                            struct device_node *np,
> > >                                               struct device_node *parent,
> > >                                               const char *full_name)
> > >   {
> > > -       struct device_node *np;
> > >          int ret;
> > > 
> > > -       np = __of_node_dup(NULL, full_name);
> > > -       if (!np)
> > > -               return NULL;
> > > +       if (!np) {
> > > +               np = __of_node_dup(NULL, full_name);
> > > +               if (!np)
> > > +                       return NULL;
> > > +       } else {
> > > +               of_node_set_flag(np, OF_DYNAMIC);
> > > +               of_node_set_flag(np, OF_DETACHED);
> > > +       }
> > > +
> > >          np->parent = parent;
> > > 
> > >          ret = of_changeset_attach_node(ocs, np);
> > > diff --git a/drivers/of/unittest.c b/drivers/of/unittest.c
> > > index 445ad13dab98..087de26852cc 100644
> > > --- a/drivers/of/unittest.c
> > > +++ b/drivers/of/unittest.c
> > > @@ -871,7 +871,7 @@ static void __init of_unittest_changeset(void)
> > >          unittest(!of_changeset_add_property(&chgset, parent, ppadd), "fail
> > > add prop prop-add\n");
> > >          unittest(!of_changeset_update_property(&chgset, parent, ppupdate),
> > > "fail update prop\n");
> > >          unittest(!of_changeset_remove_property(&chgset, parent, ppremove),
> > > "fail remove prop\n");
> > > -       n22 = of_changeset_create_node(&chgset, n2, "n22");
> > > +       n22 = of_changeset_create_node(&chgset, NULL, n2, "n22");
> > >          unittest(n22, "fail create n22\n");
> > >          unittest(!of_changeset_add_prop_string(&chgset, n22, "prop-str",
> > > "abcd"),
> > >                   "fail add prop prop-str");
> > > diff --git a/drivers/pci/of.c b/drivers/pci/of.c
> > > index 51e3dd0ea5ab..92c079b2e570 100644
> > > --- a/drivers/pci/of.c
> > > +++ b/drivers/pci/of.c
> > > @@ -608,18 +608,28 @@ int devm_of_pci_bridge_init(struct device *dev, struct
> > > pci_host_bridge *bridge)
> > > 
> > >   #ifdef CONFIG_PCI_DYNAMIC_OF_NODES
> > > 
> > > +void of_pci_free_node(struct device_node *np)
> > > +{
> > > +       struct of_changeset *cset;
> > > +
> > > +       cset = (struct of_changeset *)(np + 1);
> > > +
> > > +       np->data = NULL;
> > > +       of_changeset_revert(cset);
> > > +       of_changeset_destroy(cset);
> > > +       of_node_put(np);
> > > +}
> > > +
> > >   void of_pci_remove_node(struct pci_dev *pdev)
> > >   {
> > >          struct device_node *np;
> > > 
> > >          np = pci_device_to_OF_node(pdev);
> > > -       if (!np || !of_node_check_flag(np, OF_DYNAMIC))
> > > +       if (!np || np->data != of_pci_free_node)
> > >                  return;
> > >          pdev->dev.of_node = NULL;
> > > 
> > > -       of_changeset_revert(np->data);
> > > -       of_changeset_destroy(np->data);
> > > -       of_node_put(np);
> > > +       of_pci_free_node(np);
> > >   }
> > > 
> > >   void of_pci_make_dev_node(struct pci_dev *pdev)
> > > @@ -655,14 +665,18 @@ void of_pci_make_dev_node(struct pci_dev *pdev)
> > >          if (!name)
> > >                  return;
> > > 
> > > -       cset = kmalloc(sizeof(*cset), GFP_KERNEL);
> > > -       if (!cset)
> > > +       np = kzalloc(sizeof(*np) + sizeof(*cset), GFP_KERNEL);
> > > +       if (!np)
> > >                  goto out_free_name;
> > > +       np->full_name = name;
> > > +       of_node_init(np);
> > > +
> > > +       cset = (struct of_changeset *)(np + 1);
> > >          of_changeset_init(cset);
> > > 
> > > -       np = of_changeset_create_node(cset, ppnode, name);
> > > +       np = of_changeset_create_node(cset, np, ppnode, NULL);
> > >          if (!np)
> > > -               goto out_destroy_cset;
> > > +               goto out_free_node;
> > > 
> > >          ret = of_pci_add_properties(pdev, cset, np);
> > >          if (ret)
> > > @@ -672,9 +686,8 @@ void of_pci_make_dev_node(struct pci_dev *pdev)
> > >          if (ret)
> > >                  goto out_free_node;
> > > 
> > > -       np->data = cset;
> > > +       np->data = of_pci_free_node;
> > >          pdev->dev.of_node = np;
> > > -       kfree(name);
> > > 
> > >          return;
> > > 
> > > diff --git a/include/linux/of.h b/include/linux/of.h
> > > index a0bedd038a05..f774459d0d84 100644
> > > --- a/include/linux/of.h
> > > +++ b/include/linux/of.h
> > > @@ -1631,6 +1631,7 @@ static inline int of_changeset_update_property(struct
> > > of_changeset *ocs,
> > >   }
> > > 
> > >   struct device_node *of_changeset_create_node(struct of_changeset *ocs,
> > > +                                            struct device_node *np,
> > >                                               struct device_node *parent,
> > >                                               const char *full_name);
> > >   int of_changeset_add_prop_string(struct of_changeset *ocs,
> > > 
> > > Thanks,
> > > 
> > > Lizhi
> > > 
> > > > > ---
> > > > >    drivers/pci/remove.c | 3 ++-
> > > > >    1 file changed, 2 insertions(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/drivers/pci/remove.c b/drivers/pci/remove.c
> > > > > index d749ea8250d6..4e51c64af416 100644
> > > > > --- a/drivers/pci/remove.c
> > > > > +++ b/drivers/pci/remove.c
> > > > > @@ -22,7 +22,8 @@ static void pci_stop_dev(struct pci_dev *dev)
> > > > >    		device_release_driver(&dev->dev);
> > > > >    		pci_proc_detach_device(dev);
> > > > >    		pci_remove_sysfs_dev_files(dev);
> > > > > -		of_pci_remove_node(dev);
> > > > > +		if (pci_is_bridge(dev))
> > > > > +			of_pci_remove_node(dev);
> > > > IIUC, this basically undoes the work that was done by
> > > > of_pci_make_dev_node().
> > > > 
> > > > The call of of_pci_make_dev_node() from pci_bus_add_device() was added
> > > > by 407d1a51921e and is conditional on pci_is_bridge(), so it makes
> > > > sense to me that the remove needs a similar condition.
> > > > 
> > > > >    		pci_dev_assign_added(dev, false);
> > > > >    	}
> > > > > 
> > > > > base-commit: e9d22f7a6655941fc8b2b942ed354ec780936b3e
> > > > > -- 
> > > > > 2.45.2
> > > > >
diff mbox series

Patch

diff --git a/drivers/pci/remove.c b/drivers/pci/remove.c
index d749ea8250d6..4e51c64af416 100644
--- a/drivers/pci/remove.c
+++ b/drivers/pci/remove.c
@@ -22,7 +22,8 @@  static void pci_stop_dev(struct pci_dev *dev)
 		device_release_driver(&dev->dev);
 		pci_proc_detach_device(dev);
 		pci_remove_sysfs_dev_files(dev);
-		of_pci_remove_node(dev);
+		if (pci_is_bridge(dev))
+			of_pci_remove_node(dev);
 
 		pci_dev_assign_added(dev, false);
 	}