[3/3] amba: Properly handle device probe without IRQ domain

Message ID	20210816074619.177383-4-wangkefeng.wang@huawei.com
State	Not Applicable, archived
Headers	show Return-Path: <devicetree-owner@vger.kernel.org> From: Kefeng Wang <wangkefeng.wang@huawei.com> To: <linux-kernel@vger.kernel.org>, Rob Herring <robh+dt@kernel.org>, "Frank Rowand" <frowand.list@gmail.com> CC: <devicetree@vger.kernel.org>, Russell King <linux@armlinux.org.uk>, "Linus Walleij" <linus.walleij@linaro.org>, <linux-arm-kernel@lists.infradead.org>, Kefeng Wang <wangkefeng.wang@huawei.com>, Ruizhe Lin <linruizhe@huawei.com> Subject: [PATCH 3/3] amba: Properly handle device probe without IRQ domain Date: Mon, 16 Aug 2021 15:46:19 +0800 Message-ID: <20210816074619.177383-4-wangkefeng.wang@huawei.com> In-Reply-To: <20210816074619.177383-1-wangkefeng.wang@huawei.com> References: <20210816074619.177383-1-wangkefeng.wang@huawei.com> MIME-Version: 1.0 Content-Transfer-Encoding: 7BIT Content-Type: text/plain; charset=US-ASCII Precedence: bulk
Series	amba: Properly handle device probe without IRQ domain \| expand [0/3] amba: Properly handle device probe without IRQ domain [3/3] amba: Properly handle device probe without IRQ domain

Context	Check	Description
robh/checkpatch	success

Kefeng Wang Aug. 16, 2021, 7:46 a.m. UTC

of_amba_device_create() uses irq_of_parse_and_map() to translate
a DT interrupt specification into a Linux virtual interrupt number.

But it doesn't properly handle the case where the interrupt controller
is not yet available, eg, when pl011 interrupt is connected to MBIGEN
interrupt controller, because the mbigen initialization is too late,
which will lead to no IRQ due to no IRQ domain found, log is shown below,
  "irq: no irq domain found for uart0 !"

use of_irq_get() to return -EPROBE_DEFER as above, and in the function
amba_device_try_add()/amba_device_add(), it will properly handle in such
case, also return 0 in other fail cases to be consistent as before.

Cc: Russell King <linux@armlinux.org.uk>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Frank Rowand <frowand.list@gmail.com>
Reported-by: Ruizhe Lin <linruizhe@huawei.com>
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
---
 drivers/amba/bus.c    | 27 +++++++++++++++++++++++++++
 drivers/of/platform.c |  6 +-----
 2 files changed, 28 insertions(+), 5 deletions(-)

Rob Herring Aug. 24, 2021, 8:05 p.m. UTC | #1

+Saravana

Saravana mentioned to me there may be some issues with this one...


On Mon, Aug 16, 2021 at 2:43 AM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>
> of_amba_device_create() uses irq_of_parse_and_map() to translate
> a DT interrupt specification into a Linux virtual interrupt number.
>
> But it doesn't properly handle the case where the interrupt controller
> is not yet available, eg, when pl011 interrupt is connected to MBIGEN
> interrupt controller, because the mbigen initialization is too late,
> which will lead to no IRQ due to no IRQ domain found, log is shown below,
>   "irq: no irq domain found for uart0 !"
>
> use of_irq_get() to return -EPROBE_DEFER as above, and in the function
> amba_device_try_add()/amba_device_add(), it will properly handle in such
> case, also return 0 in other fail cases to be consistent as before.
>
> Cc: Russell King <linux@armlinux.org.uk>
> Cc: Rob Herring <robh+dt@kernel.org>
> Cc: Frank Rowand <frowand.list@gmail.com>
> Reported-by: Ruizhe Lin <linruizhe@huawei.com>
> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
> ---
>  drivers/amba/bus.c    | 27 +++++++++++++++++++++++++++
>  drivers/of/platform.c |  6 +-----
>  2 files changed, 28 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/amba/bus.c b/drivers/amba/bus.c
> index 36f2f42c8014..720aa6cdd402 100644
> --- a/drivers/amba/bus.c
> +++ b/drivers/amba/bus.c
> @@ -19,6 +19,7 @@
>  #include <linux/clk/clk-conf.h>
>  #include <linux/platform_device.h>
>  #include <linux/reset.h>
> +#include <linux/of_irq.h>
>
>  #include <asm/irq.h>
>
> @@ -371,12 +372,38 @@ static void amba_device_release(struct device *dev)
>         kfree(d);
>  }
>
> +static int of_amba_device_decode_irq(struct amba_device *dev)
> +{
> +       struct device_node *node = dev->dev.of_node;
> +       int i, irq = 0;
> +
> +       if (IS_ENABLED(CONFIG_OF_IRQ) && node) {
> +               /* Decode the IRQs and address ranges */
> +               for (i = 0; i < AMBA_NR_IRQS; i++) {
> +                       irq = of_irq_get(node, i);
> +                       if (irq < 0) {
> +                               if (irq == -EPROBE_DEFER)
> +                                       return irq;
> +                               irq = 0;
> +                       }
> +
> +                       dev->irq[i] = irq;
> +               }
> +       }
> +
> +       return 0;
> +}
> +
>  static int amba_device_try_add(struct amba_device *dev, struct resource *parent)
>  {
>         u32 size;
>         void __iomem *tmp;
>         int i, ret;
>
> +       ret = of_amba_device_decode_irq(dev);
> +       if (ret)
> +               goto err_out;
> +
>         ret = request_resource(parent, &dev->res);
>         if (ret)
>                 goto err_out;
> diff --git a/drivers/of/platform.c b/drivers/of/platform.c
> index 74afbb7a4f5e..32d5ff8df747 100644
> --- a/drivers/of/platform.c
> +++ b/drivers/of/platform.c
> @@ -222,7 +222,7 @@ static struct amba_device *of_amba_device_create(struct device_node *node,
>  {
>         struct amba_device *dev;
>         const void *prop;
> -       int i, ret;
> +       int ret;
>
>         pr_debug("Creating amba device %pOF\n", node);
>
> @@ -253,10 +253,6 @@ static struct amba_device *of_amba_device_create(struct device_node *node,
>         if (prop)
>                 dev->periphid = of_read_ulong(prop, 1);
>
> -       /* Decode the IRQs and address ranges */
> -       for (i = 0; i < AMBA_NR_IRQS; i++)
> -               dev->irq[i] = irq_of_parse_and_map(node, i);
> -
>         ret = of_address_to_resource(node, 0, &dev->res);
>         if (ret) {
>                 pr_err("amba: of_address_to_resource() failed (%d) for %pOF\n",
> --
> 2.26.2
>

Saravana Kannan Aug. 24, 2021, 8:08 p.m. UTC | #2

On Tue, Aug 24, 2021 at 1:05 PM Rob Herring <robh+dt@kernel.org> wrote:
>
> +Saravana
>
> Saravana mentioned to me there may be some issues with this one...
>
>
> On Mon, Aug 16, 2021 at 2:43 AM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
> >
> > of_amba_device_create() uses irq_of_parse_and_map() to translate
> > a DT interrupt specification into a Linux virtual interrupt number.
> >
> > But it doesn't properly handle the case where the interrupt controller
> > is not yet available, eg, when pl011 interrupt is connected to MBIGEN
> > interrupt controller, because the mbigen initialization is too late,
> > which will lead to no IRQ due to no IRQ domain found, log is shown below,
> >   "irq: no irq domain found for uart0 !"
> >
> > use of_irq_get() to return -EPROBE_DEFER as above, and in the function
> > amba_device_try_add()/amba_device_add(), it will properly handle in such
> > case, also return 0 in other fail cases to be consistent as before.
> >
> > Cc: Russell King <linux@armlinux.org.uk>
> > Cc: Rob Herring <robh+dt@kernel.org>
> > Cc: Frank Rowand <frowand.list@gmail.com>
> > Reported-by: Ruizhe Lin <linruizhe@huawei.com>
> > Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
> > ---
> >  drivers/amba/bus.c    | 27 +++++++++++++++++++++++++++
> >  drivers/of/platform.c |  6 +-----
> >  2 files changed, 28 insertions(+), 5 deletions(-)
> >
> > diff --git a/drivers/amba/bus.c b/drivers/amba/bus.c
> > index 36f2f42c8014..720aa6cdd402 100644
> > --- a/drivers/amba/bus.c
> > +++ b/drivers/amba/bus.c
> > @@ -19,6 +19,7 @@
> >  #include <linux/clk/clk-conf.h>
> >  #include <linux/platform_device.h>
> >  #include <linux/reset.h>
> > +#include <linux/of_irq.h>
> >
> >  #include <asm/irq.h>
> >
> > @@ -371,12 +372,38 @@ static void amba_device_release(struct device *dev)
> >         kfree(d);
> >  }
> >
> > +static int of_amba_device_decode_irq(struct amba_device *dev)
> > +{
> > +       struct device_node *node = dev->dev.of_node;
> > +       int i, irq = 0;
> > +
> > +       if (IS_ENABLED(CONFIG_OF_IRQ) && node) {
> > +               /* Decode the IRQs and address ranges */
> > +               for (i = 0; i < AMBA_NR_IRQS; i++) {
> > +                       irq = of_irq_get(node, i);
> > +                       if (irq < 0) {
> > +                               if (irq == -EPROBE_DEFER)
> > +                                       return irq;
> > +                               irq = 0;
> > +                       }
> > +
> > +                       dev->irq[i] = irq;
> > +               }
> > +       }
> > +
> > +       return 0;
> > +}
> > +
> >  static int amba_device_try_add(struct amba_device *dev, struct resource *parent)
> >  {
> >         u32 size;
> >         void __iomem *tmp;
> >         int i, ret;
> >
> > +       ret = of_amba_device_decode_irq(dev);
> > +       if (ret)
> > +               goto err_out;
> > +

Similar to other resources the AMBA bus "gets" for the device, I think
this should be moved into amba_probe() and not here. There's no reason
to delay the addition of the device (and loading its module) because
the IRQ isn't ready yet.

-Saravana

> >         ret = request_resource(parent, &dev->res);
> >         if (ret)
> >                 goto err_out;
> > diff --git a/drivers/of/platform.c b/drivers/of/platform.c
> > index 74afbb7a4f5e..32d5ff8df747 100644
> > --- a/drivers/of/platform.c
> > +++ b/drivers/of/platform.c
> > @@ -222,7 +222,7 @@ static struct amba_device *of_amba_device_create(struct device_node *node,
> >  {
> >         struct amba_device *dev;
> >         const void *prop;
> > -       int i, ret;
> > +       int ret;
> >
> >         pr_debug("Creating amba device %pOF\n", node);
> >
> > @@ -253,10 +253,6 @@ static struct amba_device *of_amba_device_create(struct device_node *node,
> >         if (prop)
> >                 dev->periphid = of_read_ulong(prop, 1);
> >
> > -       /* Decode the IRQs and address ranges */
> > -       for (i = 0; i < AMBA_NR_IRQS; i++)
> > -               dev->irq[i] = irq_of_parse_and_map(node, i);
> > -
> >         ret = of_address_to_resource(node, 0, &dev->res);
> >         if (ret) {
> >                 pr_err("amba: of_address_to_resource() failed (%d) for %pOF\n",
> > --
> > 2.26.2
> >

Kefeng Wang Aug. 25, 2021, 4:05 a.m. UTC | #3

On 2021/8/25 4:08, Saravana Kannan wrote:
> On Tue, Aug 24, 2021 at 1:05 PM Rob Herring <robh+dt@kernel.org> wrote:
>> +Saravana
>>
>> Saravana mentioned to me there may be some issues with this one...
>>
>>
>> On Mon, Aug 16, 2021 at 2:43 AM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>>> of_amba_device_create() uses irq_of_parse_and_map() to translate
>>> a DT interrupt specification into a Linux virtual interrupt number.
>>>
>>> But it doesn't properly handle the case where the interrupt controller
>>> is not yet available, eg, when pl011 interrupt is connected to MBIGEN
>>> interrupt controller, because the mbigen initialization is too late,
>>> which will lead to no IRQ due to no IRQ domain found, log is shown below,
>>>    "irq: no irq domain found for uart0 !"
>>>
>>> use of_irq_get() to return -EPROBE_DEFER as above, and in the function
>>> amba_device_try_add()/amba_device_add(), it will properly handle in such
>>> case, also return 0 in other fail cases to be consistent as before.
>>>
>>> Cc: Russell King <linux@armlinux.org.uk>
>>> Cc: Rob Herring <robh+dt@kernel.org>
>>> Cc: Frank Rowand <frowand.list@gmail.com>
>>> Reported-by: Ruizhe Lin <linruizhe@huawei.com>
>>> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
>>> ---
>>>   drivers/amba/bus.c    | 27 +++++++++++++++++++++++++++
>>>   drivers/of/platform.c |  6 +-----
>>>   2 files changed, 28 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/drivers/amba/bus.c b/drivers/amba/bus.c
>>> index 36f2f42c8014..720aa6cdd402 100644
>>> --- a/drivers/amba/bus.c
>>> +++ b/drivers/amba/bus.c
>>> @@ -19,6 +19,7 @@
>>>   #include <linux/clk/clk-conf.h>
>>>   #include <linux/platform_device.h>
>>>   #include <linux/reset.h>
>>> +#include <linux/of_irq.h>
>>>
>>>   #include <asm/irq.h>
>>>
>>> @@ -371,12 +372,38 @@ static void amba_device_release(struct device *dev)
>>>          kfree(d);
>>>   }
>>>
>>> +static int of_amba_device_decode_irq(struct amba_device *dev)
>>> +{
>>> +       struct device_node *node = dev->dev.of_node;
>>> +       int i, irq = 0;
>>> +
>>> +       if (IS_ENABLED(CONFIG_OF_IRQ) && node) {
>>> +               /* Decode the IRQs and address ranges */
>>> +               for (i = 0; i < AMBA_NR_IRQS; i++) {
>>> +                       irq = of_irq_get(node, i);
>>> +                       if (irq < 0) {
>>> +                               if (irq == -EPROBE_DEFER)
>>> +                                       return irq;
>>> +                               irq = 0;
>>> +                       }
>>> +
>>> +                       dev->irq[i] = irq;
>>> +               }
>>> +       }
>>> +
>>> +       return 0;
>>> +}
>>> +
>>>   static int amba_device_try_add(struct amba_device *dev, struct resource *parent)
>>>   {
>>>          u32 size;
>>>          void __iomem *tmp;
>>>          int i, ret;
>>>
>>> +       ret = of_amba_device_decode_irq(dev);
>>> +       if (ret)
>>> +               goto err_out;
>>> +
> Similar to other resources the AMBA bus "gets" for the device, I think
> this should be moved into amba_probe() and not here. There's no reason
> to delay the addition of the device (and loading its module) because
> the IRQ isn't ready yet.

The following code in the amba_device_try_add() will be called, it uses irq[0]
and irq[1], so I put of_amba_device_decode_irq() into amba_device_try_add().

470         if (dev->irq[0])
471                 ret = device_create_file(&dev->dev, &dev_attr_irq0);
472         if (ret == 0 && dev->irq[1])
473                 ret = device_create_file(&dev->dev, &dev_attr_irq1);
474         if (ret == 0)
475                 return ret;

of_amba_device_decode_irq() in amba_device_try_add() won't lead to issue,
only delay the device add, right?

If make it into amba_probe(), the above code should be moved too, could we
make a new patch to move both of them, or don't move them?

Saravana Kannan Aug. 25, 2021, 8:04 a.m. UTC | #4

On Tue, Aug 24, 2021 at 9:05 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>
>
> On 2021/8/25 4:08, Saravana Kannan wrote:
> > On Tue, Aug 24, 2021 at 1:05 PM Rob Herring <robh+dt@kernel.org> wrote:
> >> +Saravana
> >>
> >> Saravana mentioned to me there may be some issues with this one...
> >>
> >>
> >> On Mon, Aug 16, 2021 at 2:43 AM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
> >>> of_amba_device_create() uses irq_of_parse_and_map() to translate
> >>> a DT interrupt specification into a Linux virtual interrupt number.
> >>>
> >>> But it doesn't properly handle the case where the interrupt controller
> >>> is not yet available, eg, when pl011 interrupt is connected to MBIGEN
> >>> interrupt controller, because the mbigen initialization is too late,
> >>> which will lead to no IRQ due to no IRQ domain found, log is shown below,
> >>>    "irq: no irq domain found for uart0 !"
> >>>
> >>> use of_irq_get() to return -EPROBE_DEFER as above, and in the function
> >>> amba_device_try_add()/amba_device_add(), it will properly handle in such
> >>> case, also return 0 in other fail cases to be consistent as before.
> >>>
> >>> Cc: Russell King <linux@armlinux.org.uk>
> >>> Cc: Rob Herring <robh+dt@kernel.org>
> >>> Cc: Frank Rowand <frowand.list@gmail.com>
> >>> Reported-by: Ruizhe Lin <linruizhe@huawei.com>
> >>> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
> >>> ---
> >>>   drivers/amba/bus.c    | 27 +++++++++++++++++++++++++++
> >>>   drivers/of/platform.c |  6 +-----
> >>>   2 files changed, 28 insertions(+), 5 deletions(-)
> >>>
> >>> diff --git a/drivers/amba/bus.c b/drivers/amba/bus.c
> >>> index 36f2f42c8014..720aa6cdd402 100644
> >>> --- a/drivers/amba/bus.c
> >>> +++ b/drivers/amba/bus.c
> >>> @@ -19,6 +19,7 @@
> >>>   #include <linux/clk/clk-conf.h>
> >>>   #include <linux/platform_device.h>
> >>>   #include <linux/reset.h>
> >>> +#include <linux/of_irq.h>
> >>>
> >>>   #include <asm/irq.h>
> >>>
> >>> @@ -371,12 +372,38 @@ static void amba_device_release(struct device *dev)
> >>>          kfree(d);
> >>>   }
> >>>
> >>> +static int of_amba_device_decode_irq(struct amba_device *dev)
> >>> +{
> >>> +       struct device_node *node = dev->dev.of_node;
> >>> +       int i, irq = 0;
> >>> +
> >>> +       if (IS_ENABLED(CONFIG_OF_IRQ) && node) {
> >>> +               /* Decode the IRQs and address ranges */
> >>> +               for (i = 0; i < AMBA_NR_IRQS; i++) {
> >>> +                       irq = of_irq_get(node, i);
> >>> +                       if (irq < 0) {
> >>> +                               if (irq == -EPROBE_DEFER)
> >>> +                                       return irq;
> >>> +                               irq = 0;
> >>> +                       }
> >>> +
> >>> +                       dev->irq[i] = irq;
> >>> +               }
> >>> +       }
> >>> +
> >>> +       return 0;
> >>> +}
> >>> +
> >>>   static int amba_device_try_add(struct amba_device *dev, struct resource *parent)
> >>>   {
> >>>          u32 size;
> >>>          void __iomem *tmp;
> >>>          int i, ret;
> >>>
> >>> +       ret = of_amba_device_decode_irq(dev);
> >>> +       if (ret)
> >>> +               goto err_out;
> >>> +
> > Similar to other resources the AMBA bus "gets" for the device, I think
> > this should be moved into amba_probe() and not here. There's no reason
> > to delay the addition of the device (and loading its module) because
> > the IRQ isn't ready yet.
>
> The following code in the amba_device_try_add() will be called, it uses irq[0]
> and irq[1], so I put of_amba_device_decode_irq() into amba_device_try_add().
>
> 470         if (dev->irq[0])
> 471                 ret = device_create_file(&dev->dev, &dev_attr_irq0);
> 472         if (ret == 0 && dev->irq[1])
> 473                 ret = device_create_file(&dev->dev, &dev_attr_irq1);
> 474         if (ret == 0)
> 475                 return ret;
>
> of_amba_device_decode_irq() in amba_device_try_add() won't lead to issue,
> only delay the device add, right?

But delaying the device add is the issue. For example, adding a device
could trigger the loading of the corresponding module using uevents.
But now this change would delay that step. That can have other
unintended consequences -- slowing down boot, what if the driver was
working fine without the IRQ, etc.

> If make it into amba_probe(), the above code should be moved too, could we
> make a new patch to move both of them, or don't move them?

I'd say move them both. If Russell hasn't already picked this up, then
I'd say redo your Patch 3/3.

Btw, I've been working on [1] cleaning up the one-off deferred probe
solution that we have for amba devices. That causes a bunch of other
headaches. Your patch 3/3 takes us further in the wrong direction by
adding more reasons for delaying the addition of the device.

-Saravana

[1] - https://lore.kernel.org/lkml/CAGETcx8b228nDUho3cX9AAQ-pXOfZTMv8cj2vhdx9yc_pk8q+A@mail.gmail.com/

Kefeng Wang Aug. 25, 2021, 8:39 a.m. UTC | #5

On 2021/8/25 16:04, Saravana Kannan wrote:
> On Tue, Aug 24, 2021 at 9:05 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>>
>> On 2021/8/25 4:08, Saravana Kannan wrote:
>>> On Tue, Aug 24, 2021 at 1:05 PM Rob Herring <robh+dt@kernel.org> wrote:
>>>> +Saravana
>>>>
>>>> Saravana mentioned to me there may be some issues with this one...
>>>>
>>>>
>>>> On Mon, Aug 16, 2021 at 2:43 AM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>>>>> of_amba_device_create() uses irq_of_parse_and_map() to translate
>>>>> a DT interrupt specification into a Linux virtual interrupt number.
>>>>>
>>>>> But it doesn't properly handle the case where the interrupt controller
>>>>> is not yet available, eg, when pl011 interrupt is connected to MBIGEN
>>>>> interrupt controller, because the mbigen initialization is too late,
>>>>> which will lead to no IRQ due to no IRQ domain found, log is shown below,
>>>>>     "irq: no irq domain found for uart0 !"
>>>>>
>>>>> use of_irq_get() to return -EPROBE_DEFER as above, and in the function
>>>>> amba_device_try_add()/amba_device_add(), it will properly handle in such
>>>>> case, also return 0 in other fail cases to be consistent as before.
>>>>>
>>>>> Cc: Russell King <linux@armlinux.org.uk>
>>>>> Cc: Rob Herring <robh+dt@kernel.org>
>>>>> Cc: Frank Rowand <frowand.list@gmail.com>
>>>>> Reported-by: Ruizhe Lin <linruizhe@huawei.com>
>>>>> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>> ---
>>>>>    drivers/amba/bus.c    | 27 +++++++++++++++++++++++++++
>>>>>    drivers/of/platform.c |  6 +-----
>>>>>    2 files changed, 28 insertions(+), 5 deletions(-)
>>>>>
>>>>> diff --git a/drivers/amba/bus.c b/drivers/amba/bus.c
>>>>> index 36f2f42c8014..720aa6cdd402 100644
>>>>> --- a/drivers/amba/bus.c
>>>>> +++ b/drivers/amba/bus.c
>>>>> @@ -19,6 +19,7 @@
>>>>>    #include <linux/clk/clk-conf.h>
>>>>>    #include <linux/platform_device.h>
>>>>>    #include <linux/reset.h>
>>>>> +#include <linux/of_irq.h>
>>>>>
>>>>>    #include <asm/irq.h>
>>>>>
>>>>> @@ -371,12 +372,38 @@ static void amba_device_release(struct device *dev)
>>>>>           kfree(d);
>>>>>    }
>>>>>
>>>>> +static int of_amba_device_decode_irq(struct amba_device *dev)
>>>>> +{
>>>>> +       struct device_node *node = dev->dev.of_node;
>>>>> +       int i, irq = 0;
>>>>> +
>>>>> +       if (IS_ENABLED(CONFIG_OF_IRQ) && node) {
>>>>> +               /* Decode the IRQs and address ranges */
>>>>> +               for (i = 0; i < AMBA_NR_IRQS; i++) {
>>>>> +                       irq = of_irq_get(node, i);
>>>>> +                       if (irq < 0) {
>>>>> +                               if (irq == -EPROBE_DEFER)
>>>>> +                                       return irq;
>>>>> +                               irq = 0;
>>>>> +                       }
>>>>> +
>>>>> +                       dev->irq[i] = irq;
>>>>> +               }
>>>>> +       }
>>>>> +
>>>>> +       return 0;
>>>>> +}
>>>>> +
>>>>>    static int amba_device_try_add(struct amba_device *dev, struct resource *parent)
>>>>>    {
>>>>>           u32 size;
>>>>>           void __iomem *tmp;
>>>>>           int i, ret;
>>>>>
>>>>> +       ret = of_amba_device_decode_irq(dev);
>>>>> +       if (ret)
>>>>> +               goto err_out;
>>>>> +
>>> Similar to other resources the AMBA bus "gets" for the device, I think
>>> this should be moved into amba_probe() and not here. There's no reason
>>> to delay the addition of the device (and loading its module) because
>>> the IRQ isn't ready yet.
>> The following code in the amba_device_try_add() will be called, it uses irq[0]
>> and irq[1], so I put of_amba_device_decode_irq() into amba_device_try_add().
>>
>> 470         if (dev->irq[0])
>> 471                 ret = device_create_file(&dev->dev, &dev_attr_irq0);
>> 472         if (ret == 0 && dev->irq[1])
>> 473                 ret = device_create_file(&dev->dev, &dev_attr_irq1);
>> 474         if (ret == 0)
>> 475                 return ret;
>>
>> of_amba_device_decode_irq() in amba_device_try_add() won't lead to issue,
>> only delay the device add, right?
> But delaying the device add is the issue. For example, adding a device
> could trigger the loading of the corresponding module using uevents.
> But now this change would delay that step. That can have other
> unintended consequences -- slowing down boot, what if the driver was
> working fine without the IRQ, etc.
>
>> If make it into amba_probe(), the above code should be moved too, could we
>> make a new patch to move both of them, or don't move them?
> I'd say move them both. If Russell hasn't already picked this up, then
> I'd say redo your Patch 3/3.


Sure，I will update it and resend.

>
> Btw, I've been working on [1] cleaning up the one-off deferred probe
> solution that we have for amba devices. That causes a bunch of other
> headaches. Your patch 3/3 takes us further in the wrong direction by
> adding more reasons for delaying the addition of the device.
Thanks for your explanation.
> -Saravana
>
> [1] - https://lore.kernel.org/lkml/CAGETcx8b228nDUho3cX9AAQ-pXOfZTMv8cj2vhdx9yc_pk8q+A@mail.gmail.com/
> .
>

Rob Herring Aug. 25, 2021, 12:33 p.m. UTC | #6

On Tue, Aug 24, 2021 at 11:05 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>
>
> On 2021/8/25 4:08, Saravana Kannan wrote:
> > On Tue, Aug 24, 2021 at 1:05 PM Rob Herring <robh+dt@kernel.org> wrote:
> >> +Saravana
> >>
> >> Saravana mentioned to me there may be some issues with this one...
> >>
> >>
> >> On Mon, Aug 16, 2021 at 2:43 AM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
> >>> of_amba_device_create() uses irq_of_parse_and_map() to translate
> >>> a DT interrupt specification into a Linux virtual interrupt number.
> >>>
> >>> But it doesn't properly handle the case where the interrupt controller
> >>> is not yet available, eg, when pl011 interrupt is connected to MBIGEN
> >>> interrupt controller, because the mbigen initialization is too late,
> >>> which will lead to no IRQ due to no IRQ domain found, log is shown below,
> >>>    "irq: no irq domain found for uart0 !"
> >>>
> >>> use of_irq_get() to return -EPROBE_DEFER as above, and in the function
> >>> amba_device_try_add()/amba_device_add(), it will properly handle in such
> >>> case, also return 0 in other fail cases to be consistent as before.
> >>>
> >>> Cc: Russell King <linux@armlinux.org.uk>
> >>> Cc: Rob Herring <robh+dt@kernel.org>
> >>> Cc: Frank Rowand <frowand.list@gmail.com>
> >>> Reported-by: Ruizhe Lin <linruizhe@huawei.com>
> >>> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
> >>> ---
> >>>   drivers/amba/bus.c    | 27 +++++++++++++++++++++++++++
> >>>   drivers/of/platform.c |  6 +-----
> >>>   2 files changed, 28 insertions(+), 5 deletions(-)
> >>>
> >>> diff --git a/drivers/amba/bus.c b/drivers/amba/bus.c
> >>> index 36f2f42c8014..720aa6cdd402 100644
> >>> --- a/drivers/amba/bus.c
> >>> +++ b/drivers/amba/bus.c
> >>> @@ -19,6 +19,7 @@
> >>>   #include <linux/clk/clk-conf.h>
> >>>   #include <linux/platform_device.h>
> >>>   #include <linux/reset.h>
> >>> +#include <linux/of_irq.h>
> >>>
> >>>   #include <asm/irq.h>
> >>>
> >>> @@ -371,12 +372,38 @@ static void amba_device_release(struct device *dev)
> >>>          kfree(d);
> >>>   }
> >>>
> >>> +static int of_amba_device_decode_irq(struct amba_device *dev)
> >>> +{
> >>> +       struct device_node *node = dev->dev.of_node;
> >>> +       int i, irq = 0;
> >>> +
> >>> +       if (IS_ENABLED(CONFIG_OF_IRQ) && node) {
> >>> +               /* Decode the IRQs and address ranges */
> >>> +               for (i = 0; i < AMBA_NR_IRQS; i++) {
> >>> +                       irq = of_irq_get(node, i);
> >>> +                       if (irq < 0) {
> >>> +                               if (irq == -EPROBE_DEFER)
> >>> +                                       return irq;
> >>> +                               irq = 0;
> >>> +                       }
> >>> +
> >>> +                       dev->irq[i] = irq;
> >>> +               }
> >>> +       }
> >>> +
> >>> +       return 0;
> >>> +}
> >>> +
> >>>   static int amba_device_try_add(struct amba_device *dev, struct resource *parent)
> >>>   {
> >>>          u32 size;
> >>>          void __iomem *tmp;
> >>>          int i, ret;
> >>>
> >>> +       ret = of_amba_device_decode_irq(dev);
> >>> +       if (ret)
> >>> +               goto err_out;
> >>> +
> > Similar to other resources the AMBA bus "gets" for the device, I think
> > this should be moved into amba_probe() and not here. There's no reason
> > to delay the addition of the device (and loading its module) because
> > the IRQ isn't ready yet.
>
> The following code in the amba_device_try_add() will be called, it uses irq[0]
> and irq[1], so I put of_amba_device_decode_irq() into amba_device_try_add().
>
> 470         if (dev->irq[0])
> 471                 ret = device_create_file(&dev->dev, &dev_attr_irq0);
> 472         if (ret == 0 && dev->irq[1])
> 473                 ret = device_create_file(&dev->dev, &dev_attr_irq1);
> 474         if (ret == 0)
> 475                 return ret;

I wonder if we could just remove these. Why does userspace need them
in the first place? It's only an ABI if someone notices. Looking at
the history, AMBA bus was added in 2003 with just 'irq' and then
changed (ABI break) in 2004 to 'irq0' and 'irq1'.

Rob

[1] https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/log/arch/arm/common/amba.c

Kefeng Wang Aug. 26, 2021, 2:45 a.m. UTC | #7

On 2021/8/25 16:04, Saravana Kannan wrote:
> On Tue, Aug 24, 2021 at 9:05 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>>
>> On 2021/8/25 4:08, Saravana Kannan wrote:
>>> On Tue, Aug 24, 2021 at 1:05 PM Rob Herring <robh+dt@kernel.org> wrote:
>>>> +Saravana
>>>>
>>>> Saravana mentioned to me there may be some issues with this one...
>>>>
>>>>
>>>> On Mon, Aug 16, 2021 at 2:43 AM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>>>>> of_amba_device_create() uses irq_of_parse_and_map() to translate
>>>>> a DT interrupt specification into a Linux virtual interrupt number.
>>>>>
>>>>> But it doesn't properly handle the case where the interrupt controller
>>>>> is not yet available, eg, when pl011 interrupt is connected to MBIGEN
>>>>> interrupt controller, because the mbigen initialization is too late,
>>>>> which will lead to no IRQ due to no IRQ domain found, log is shown below,
>>>>>     "irq: no irq domain found for uart0 !"
>>>>>
>>>>> use of_irq_get() to return -EPROBE_DEFER as above, and in the function
>>>>> amba_device_try_add()/amba_device_add(), it will properly handle in such
>>>>> case, also return 0 in other fail cases to be consistent as before.
>>>>>
>>>>> Cc: Russell King <linux@armlinux.org.uk>
>>>>> Cc: Rob Herring <robh+dt@kernel.org>
>>>>> Cc: Frank Rowand <frowand.list@gmail.com>
>>>>> Reported-by: Ruizhe Lin <linruizhe@huawei.com>
>>>>> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>> ---
>>>>>    drivers/amba/bus.c    | 27 +++++++++++++++++++++++++++
>>>>>    drivers/of/platform.c |  6 +-----
>>>>>    2 files changed, 28 insertions(+), 5 deletions(-)
>>>>>
>>>>> diff --git a/drivers/amba/bus.c b/drivers/amba/bus.c
>>>>> index 36f2f42c8014..720aa6cdd402 100644
>>>>> --- a/drivers/amba/bus.c
>>>>> +++ b/drivers/amba/bus.c
>>>>> @@ -19,6 +19,7 @@
>>>>>    #include <linux/clk/clk-conf.h>
>>>>>    #include <linux/platform_device.h>
>>>>>    #include <linux/reset.h>
>>>>> +#include <linux/of_irq.h>
>>>>>
>>>>>    #include <asm/irq.h>
>>>>>
>>>>> @@ -371,12 +372,38 @@ static void amba_device_release(struct device *dev)
>>>>>           kfree(d);
>>>>>    }
>>>>>
>>>>> +static int of_amba_device_decode_irq(struct amba_device *dev)
>>>>> +{
>>>>> +       struct device_node *node = dev->dev.of_node;
>>>>> +       int i, irq = 0;
>>>>> +
>>>>> +       if (IS_ENABLED(CONFIG_OF_IRQ) && node) {
>>>>> +               /* Decode the IRQs and address ranges */
>>>>> +               for (i = 0; i < AMBA_NR_IRQS; i++) {
>>>>> +                       irq = of_irq_get(node, i);
>>>>> +                       if (irq < 0) {
>>>>> +                               if (irq == -EPROBE_DEFER)
>>>>> +                                       return irq;
>>>>> +                               irq = 0;
>>>>> +                       }
>>>>> +
>>>>> +                       dev->irq[i] = irq;
>>>>> +               }
>>>>> +       }
>>>>> +
>>>>> +       return 0;
>>>>> +}
>>>>> +
>>>>>    static int amba_device_try_add(struct amba_device *dev, struct resource *parent)
>>>>>    {
>>>>>           u32 size;
>>>>>           void __iomem *tmp;
>>>>>           int i, ret;
>>>>>
>>>>> +       ret = of_amba_device_decode_irq(dev);
>>>>> +       if (ret)
>>>>> +               goto err_out;
>>>>> +
>>> Similar to other resources the AMBA bus "gets" for the device, I think
>>> this should be moved into amba_probe() and not here. There's no reason
>>> to delay the addition of the device (and loading its module) because
>>> the IRQ isn't ready yet.
>> The following code in the amba_device_try_add() will be called, it uses irq[0]
>> and irq[1], so I put of_amba_device_decode_irq() into amba_device_try_add().
>>
>> 470         if (dev->irq[0])
>> 471                 ret = device_create_file(&dev->dev, &dev_attr_irq0);
>> 472         if (ret == 0 && dev->irq[1])
>> 473                 ret = device_create_file(&dev->dev, &dev_attr_irq1);
>> 474         if (ret == 0)
>> 475                 return ret;
>>
>> of_amba_device_decode_irq() in amba_device_try_add() won't lead to issue,
>> only delay the device add, right?
> But delaying the device add is the issue. For example, adding a device
> could trigger the loading of the corresponding module using uevents.
> But now this change would delay that step. That can have other
> unintended consequences -- slowing down boot, what if the driver was
> working fine without the IRQ, etc.
>
>> If make it into amba_probe(), the above code should be moved too, could we
>> make a new patch to move both of them, or don't move them?
> I'd say move them both. If Russell hasn't already picked this up, then
> I'd say redo your Patch 3/3.
I will resend with put it into amba_probe.
>
> Btw, I've been working on [1] cleaning up the one-off deferred probe
> solution that we have for amba devices. That causes a bunch of other
> headaches. Your patch 3/3 takes us further in the wrong direction by
> adding more reasons for delaying the addition of the device.

Got it,  and I could resend all combine your patch(due to context conflict

when changing same function) if you no object.


>
> -Saravana
>
> [1] - https://lore.kernel.org/lkml/CAGETcx8b228nDUho3cX9AAQ-pXOfZTMv8cj2vhdx9yc_pk8q+A@mail.gmail.com/
> .
>

Saravana Kannan Aug. 26, 2021, 4:45 a.m. UTC | #8

On Wed, Aug 25, 2021 at 7:45 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>
>
> On 2021/8/25 16:04, Saravana Kannan wrote:
> > On Tue, Aug 24, 2021 at 9:05 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
> >>
> >> On 2021/8/25 4:08, Saravana Kannan wrote:
> >>> On Tue, Aug 24, 2021 at 1:05 PM Rob Herring <robh+dt@kernel.org> wrote:
> >>>> +Saravana
> >>>>
> >>>> Saravana mentioned to me there may be some issues with this one...
> >>>>
> >>>>
> >>>> On Mon, Aug 16, 2021 at 2:43 AM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
> >>>>> of_amba_device_create() uses irq_of_parse_and_map() to translate
> >>>>> a DT interrupt specification into a Linux virtual interrupt number.
> >>>>>
> >>>>> But it doesn't properly handle the case where the interrupt controller
> >>>>> is not yet available, eg, when pl011 interrupt is connected to MBIGEN
> >>>>> interrupt controller, because the mbigen initialization is too late,
> >>>>> which will lead to no IRQ due to no IRQ domain found, log is shown below,
> >>>>>     "irq: no irq domain found for uart0 !"
> >>>>>
> >>>>> use of_irq_get() to return -EPROBE_DEFER as above, and in the function
> >>>>> amba_device_try_add()/amba_device_add(), it will properly handle in such
> >>>>> case, also return 0 in other fail cases to be consistent as before.
> >>>>>
> >>>>> Cc: Russell King <linux@armlinux.org.uk>
> >>>>> Cc: Rob Herring <robh+dt@kernel.org>
> >>>>> Cc: Frank Rowand <frowand.list@gmail.com>
> >>>>> Reported-by: Ruizhe Lin <linruizhe@huawei.com>
> >>>>> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
> >>>>> ---
> >>>>>    drivers/amba/bus.c    | 27 +++++++++++++++++++++++++++
> >>>>>    drivers/of/platform.c |  6 +-----
> >>>>>    2 files changed, 28 insertions(+), 5 deletions(-)
> >>>>>
> >>>>> diff --git a/drivers/amba/bus.c b/drivers/amba/bus.c
> >>>>> index 36f2f42c8014..720aa6cdd402 100644
> >>>>> --- a/drivers/amba/bus.c
> >>>>> +++ b/drivers/amba/bus.c
> >>>>> @@ -19,6 +19,7 @@
> >>>>>    #include <linux/clk/clk-conf.h>
> >>>>>    #include <linux/platform_device.h>
> >>>>>    #include <linux/reset.h>
> >>>>> +#include <linux/of_irq.h>
> >>>>>
> >>>>>    #include <asm/irq.h>
> >>>>>
> >>>>> @@ -371,12 +372,38 @@ static void amba_device_release(struct device *dev)
> >>>>>           kfree(d);
> >>>>>    }
> >>>>>
> >>>>> +static int of_amba_device_decode_irq(struct amba_device *dev)
> >>>>> +{
> >>>>> +       struct device_node *node = dev->dev.of_node;
> >>>>> +       int i, irq = 0;
> >>>>> +
> >>>>> +       if (IS_ENABLED(CONFIG_OF_IRQ) && node) {
> >>>>> +               /* Decode the IRQs and address ranges */
> >>>>> +               for (i = 0; i < AMBA_NR_IRQS; i++) {
> >>>>> +                       irq = of_irq_get(node, i);
> >>>>> +                       if (irq < 0) {
> >>>>> +                               if (irq == -EPROBE_DEFER)
> >>>>> +                                       return irq;
> >>>>> +                               irq = 0;
> >>>>> +                       }
> >>>>> +
> >>>>> +                       dev->irq[i] = irq;
> >>>>> +               }
> >>>>> +       }
> >>>>> +
> >>>>> +       return 0;
> >>>>> +}
> >>>>> +
> >>>>>    static int amba_device_try_add(struct amba_device *dev, struct resource *parent)
> >>>>>    {
> >>>>>           u32 size;
> >>>>>           void __iomem *tmp;
> >>>>>           int i, ret;
> >>>>>
> >>>>> +       ret = of_amba_device_decode_irq(dev);
> >>>>> +       if (ret)
> >>>>> +               goto err_out;
> >>>>> +
> >>> Similar to other resources the AMBA bus "gets" for the device, I think
> >>> this should be moved into amba_probe() and not here. There's no reason
> >>> to delay the addition of the device (and loading its module) because
> >>> the IRQ isn't ready yet.
> >> The following code in the amba_device_try_add() will be called, it uses irq[0]
> >> and irq[1], so I put of_amba_device_decode_irq() into amba_device_try_add().
> >>
> >> 470         if (dev->irq[0])
> >> 471                 ret = device_create_file(&dev->dev, &dev_attr_irq0);
> >> 472         if (ret == 0 && dev->irq[1])
> >> 473                 ret = device_create_file(&dev->dev, &dev_attr_irq1);
> >> 474         if (ret == 0)
> >> 475                 return ret;
> >>
> >> of_amba_device_decode_irq() in amba_device_try_add() won't lead to issue,
> >> only delay the device add, right?
> > But delaying the device add is the issue. For example, adding a device
> > could trigger the loading of the corresponding module using uevents.
> > But now this change would delay that step. That can have other
> > unintended consequences -- slowing down boot, what if the driver was
> > working fine without the IRQ, etc.
> >
> >> If make it into amba_probe(), the above code should be moved too, could we
> >> make a new patch to move both of them, or don't move them?
> > I'd say move them both. If Russell hasn't already picked this up, then
> > I'd say redo your Patch 3/3.
> I will resend with put it into amba_probe.
> >
> > Btw, I've been working on [1] cleaning up the one-off deferred probe
> > solution that we have for amba devices. That causes a bunch of other
> > headaches. Your patch 3/3 takes us further in the wrong direction by
> > adding more reasons for delaying the addition of the device.
>
> Got it,  and I could resend all combine your patch(due to context conflict
>
> when changing same function) if you no object.

If you want to resolve the conflict with my patch and resend it while
keeping me as the author, I would definitely appreciate it.

-Saravana
>
>
> >
> > -Saravana
> >
> > [1] - https://lore.kernel.org/lkml/CAGETcx8b228nDUho3cX9AAQ-pXOfZTMv8cj2vhdx9yc_pk8q+A@mail.gmail.com/
> > .
> >

Kefeng Wang Aug. 26, 2021, 6:22 a.m. UTC | #9

On 2021/8/26 12:45, Saravana Kannan wrote:
> On Wed, Aug 25, 2021 at 7:45 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>>
>> On 2021/8/25 16:04, Saravana Kannan wrote:
>>> On Tue, Aug 24, 2021 at 9:05 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>>>> On 2021/8/25 4:08, Saravana Kannan wrote:
>>>>> On Tue, Aug 24, 2021 at 1:05 PM Rob Herring <robh+dt@kernel.org> wrote:
>>>>>> +Saravana
>>>>>>
>>>>>> Saravana mentioned to me there may be some issues with this one...
>>>>>>
>>>>>>
>>>>>> On Mon, Aug 16, 2021 at 2:43 AM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>>>>>>> of_amba_device_create() uses irq_of_parse_and_map() to translate
>>>>>>> a DT interrupt specification into a Linux virtual interrupt number.
>>>>>>>
>>>>>>> But it doesn't properly handle the case where the interrupt controller
>>>>>>> is not yet available, eg, when pl011 interrupt is connected to MBIGEN
>>>>>>> interrupt controller, because the mbigen initialization is too late,
>>>>>>> which will lead to no IRQ due to no IRQ domain found, log is shown below,
>>>>>>>      "irq: no irq domain found for uart0 !"
>>>>>>>
>>>>>>> use of_irq_get() to return -EPROBE_DEFER as above, and in the function
>>>>>>> amba_device_try_add()/amba_device_add(), it will properly handle in such
>>>>>>> case, also return 0 in other fail cases to be consistent as before.
>>>>>>>
>>>>>>> Cc: Russell King <linux@armlinux.org.uk>
>>>>>>> Cc: Rob Herring <robh+dt@kernel.org>
>>>>>>> Cc: Frank Rowand <frowand.list@gmail.com>
>>>>>>> Reported-by: Ruizhe Lin <linruizhe@huawei.com>
>>>>>>> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>> ---
...
>>>>> Similar to other resources the AMBA bus "gets" for the device, I think
>>>>> this should be moved into amba_probe() and not here. There's no reason
>>>>> to delay the addition of the device (and loading its module) because
>>>>> the IRQ isn't ready yet.
>>>> The following code in the amba_device_try_add() will be called, it uses irq[0]
>>>> and irq[1], so I put of_amba_device_decode_irq() into amba_device_try_add().
>>>>
>>>> 470         if (dev->irq[0])
>>>> 471                 ret = device_create_file(&dev->dev, &dev_attr_irq0);
>>>> 472         if (ret == 0 && dev->irq[1])
>>>> 473                 ret = device_create_file(&dev->dev, &dev_attr_irq1);
>>>> 474         if (ret == 0)
>>>> 475                 return ret;
>>>>
>>>> of_amba_device_decode_irq() in amba_device_try_add() won't lead to issue,
>>>> only delay the device add, right?
>>> But delaying the device add is the issue. For example, adding a device
>>> could trigger the loading of the corresponding module using uevents.
>>> But now this change would delay that step. That can have other
>>> unintended consequences -- slowing down boot, what if the driver was
>>> working fine without the IRQ, etc.
>>>
>>>> If make it into amba_probe(), the above code should be moved too, could we
>>>> make a new patch to move both of them, or don't move them?
>>> I'd say move them both. If Russell hasn't already picked this up, then
>>> I'd say redo your Patch 3/3.
>> I will resend with put it into amba_probe.
>>> Btw, I've been working on [1] cleaning up the one-off deferred probe
>>> solution that we have for amba devices. That causes a bunch of other
>>> headaches. Your patch 3/3 takes us further in the wrong direction by
>>> adding more reasons for delaying the addition of the device.
>> Got it,  and I could resend all combine your patch(due to context conflict
>>
>> when changing same function) if you no object.
> If you want to resolve the conflict with my patch and resend it while
> keeping me as the author, I would definitely appreciate it.
Yes, I will keep it, and rebase my patch based on it.
>
> -Saravana
>>
>>> -Saravana
>>>
>>> [1] - https://lore.kernel.org/lkml/CAGETcx8b228nDUho3cX9AAQ-pXOfZTMv8cj2vhdx9yc_pk8q+A@mail.gmail.com/
>>> .
>>>
> .
>

Kefeng Wang Aug. 26, 2021, 8:22 a.m. UTC | #10

>>> Btw, I've been working on [1] cleaning up the one-off deferred probe
>>> solution that we have for amba devices. That causes a bunch of other
>>> headaches. Your patch 3/3 takes us further in the wrong direction by
>>> adding more reasons for delaying the addition of the device.

Hi Saravana, I try the link[1], but with it, there is a crash when boot 
(qemu-system-arm -M vexpress-a15),

without it, boot successfully.

[    2.246057] aaci-pl041 1c040000.aaci: ARM AC'97 Interface PL041 rev0 
at 0x1c040000, irq 36
[    2.246357] aaci-pl041 1c040000.aaci: FIFO 512 entries
[    2.248617] NET: Registered PF_PACKET protocol family
[    2.250481] 9pnet: Installing 9P2000 support
[    2.251474] Registering SWP/SWPB emulation handler
[    2.284374] 1c090000.serial: ttyAMA0 at MMIO 0x1c090000 (irq = 41, 
base_baud = 0) is a PL011 rev1
[    2.287797] printk: console [ttyAMA0] enabled
[    2.287797] printk: console [ttyAMA0] enabled
[    2.288262] input: AT Raw Set 2 keyboard as 
/devices/platform/bus@8000000/bus@8000000:motherboard-bus/bus@8000000:motherboard-bus:iofpga-bus@300000000/1c060000.kmi/serio0/input/input0
[    2.288262] input: AT Raw Set 2 keyboard as 
/devices/platform/bus@8000000/bus@8000000:motherboard-bus/bus@8000000:motherboard-bus:iofpga-bus@300000000/1c060000.kmi/serio0/input/input0
[    2.288755] printk: bootconsole [earlycon0] disabled
[    2.288755] printk: bootconsole [earlycon0] disabled
[    2.294507] 1c0a0000.serial: ttyAMA1 at MMIO 0x1c0a0000 (irq = 42, 
base_baud = 0) is a PL011 rev1
[    2.296950] 1c0b0000.serial: ttyAMA2 at MMIO 0x1c0b0000 (irq = 43, 
base_baud = 0) is a PL011 rev1
[    2.298636] 1c0c0000.serial: ttyAMA3 at MMIO 0x1c0c0000 (irq = 44, 
base_baud = 0) is a PL011 rev1
[    2.300496] 8<--- cut here ---
[    2.300775] ------------[ cut here ]------------
[    2.301260] Unable to handle kernel NULL pointer dereference at 
virtual address 00000000
[    2.300928] WARNING: CPU: 1 PID: 27 at 
/home/wkf/work/hulk/lib/refcount.c:25 refcount_warn_saturate+0x108/0x174
[    2.301700] pgd = (ptrval)
[    2.302002] refcount_t: addition on 0; use-after-free.
[    2.302184] [00000000] *pgd=00000000
[    2.302363] Modules linked in:
[    2.302384]
[    2.302753]
[    2.303501] CPU: 1 PID: 27 Comm: kworker/u4:1 Not tainted 5.14.0-rc7+ 
#193
[    2.303990] Hardware name: ARM-Versatile Express
[    2.304198] Internal error: Oops: 5 [#1] SMP ARM
[    2.304537] Workqueue: events_unbound deferred_probe_work_func
[    2.304829] Modules linked in:
[    2.304865]
[    2.305133]
[    2.305401] Backtrace:
[    2.305562] CPU: 0 PID: 41 Comm: kworker/0:2 Not tainted 5.14.0-rc7+ #193
[    2.305614] Hardware name: ARM-Versatile Express
[    2.305576]
[    2.305781] [<c010c780>] (dump_backtrace) from [<c010cacc>] 
(show_stack+0x20/0x24)
[    2.306266] Workqueue: events_long serio_handle_event
[    2.306732]  r7:00000009 r6:00000000 r5:c0b1efb8 r4:600b0093
[    2.306889] PC is at strcmp+0x18/0x44
[    2.307115] [<c010caac>] (show_stack) from [<c091eba4>] 
(dump_stack_lvl+0x48/0x54)
[    2.307263] LR is at platform_match+0xb8/0xcc
[    2.307498] [<c091eb5c>] (dump_stack_lvl) from [<c091ebc8>] 
(dump_stack+0x18/0x1c)
[    2.307739] pc : [<c0560aec>]    lr : [<c064626c>]    psr: 60000013
[    2.307988]  r5:c0b4ef98 r4:c165ddc4
[    2.308317] sp : c1675d70  ip : c1675d80  fp : c1675d7c
[    2.308433] [<c091ebb0>] (dump_stack) from [<c01289c4>] 
(__warn+0x110/0x114)
[    2.308743] r10: 00000000  r9 : 00000000  r8 : 00000001
[    2.308961] [<c01288b4>] (__warn) from [<c0128a4c>] 
(warn_slowpath_fmt+0x84/0xc0)
[    2.309252] r7 : c0d04d08  r6 : c13aed18  r5 : c1090fc0  r4 : c13aed18
[    2.309547]  r9:00000009 r8:c0504f10 r7:00000019 r6:c0b4ef98 
r5:c0b4efbc r4:c0d04d08
[    2.309975] [<c01289cc>] (warn_slowpath_fmt) from [<c0504f10>] 
(refcount_warn_saturate+0x108/0x174)
[    2.310531] r3 : c0a5e1c0  r2 : 00000002  r1 : c0b82860  r0 : 00000000
[    2.311263] Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  
Segment none
[    2.311298]  r9:600b0013 r8:00000000 r7:00000000 r6:c1787eb8 
r5:c165de3c r4:c1702b38
[    2.311899] Control: 10c5387d  Table: 8000406a  DAC: 00000051
[    2.312117] [<c0504e08>] (refcount_warn_saturate) from [<c055657c>] 
(klist_next+0x134/0x138)
[    2.312755] Register r0 information:
[    2.312964] [<c0556448>] (klist_next) from [<c06411dc>] 
(bus_for_each_drv+0x74/0xc8)
[    2.313345]  NULL pointer
[    2.313690] Register r1 information:
[    2.313736]  r9:00000000 r8:00000001 r7:c0d04d08 r6:c064373c 
r5:c165de6c r4:00000000
[    2.313933]  non-slab/vmalloc memory
[    2.314172] [<c0641168>] (bus_for_each_drv) from [<c0642f38>] 
(__device_attach+0xf0/0x15c)
[    2.315042]  r7:c10846b8 r6:c13ae444 r5:c13ae400 r4:c0d04d08
[    2.315060] Register r2 information:
[    2.315243] [<c0642e48>] (__device_attach) from [<c064390c>] 
(device_initial_probe+0x1c/0x20)
[    2.315276]  non-paged memory
[    2.315569]  r8:00000000 r7:c10846b8 r6:c13ae400 r5:c107d690 r4:c13ae400
[    2.315593] [<c06438f0>] (device_initial_probe) from [<c0642080>] 
(bus_probe_device+0x94/0x9c)
[    2.316192] [<c0641fec>] (bus_probe_device) from [<c064259c>] 
(deferred_probe_work_func+0x8c/0xb8)
[    2.316939]  r7:c10846b8 r6:c10846a4 r5:c10846a4 r4:c13ae400
[    2.317573] [<c0642510>] (deferred_probe_work_func) from [<c01475a4>] 
(process_one_work+0x238/0x594)
[    2.318513]  r9:00000000 r8:00000000 r7:c1225b00 r6:c1206200 
r5:c16b6f80 r4:c10846d4
[    2.318931] Register r3 information: non-slab/vmalloc memory
[    2.319218] [<c014736c>] (process_one_work) from [<c0147bc4>] 
(worker_thread+0x2c4/0x5f4)
[    2.320001]  r10:c0d03d00 r9:00000088 r8:ffffe000 r7:c1206218 
r6:c16b6f94 r5:c1206200
[    2.320280] Register r4 information:
[    2.320614]  r4:c16b6f80
[    2.320942] [<c0147900>] (worker_thread) from [<c014feb4>] 
(kthread+0x178/0x194)
[    2.321111]  slab kmalloc-1k
[    2.321403]  r10:c165c000 r9:c1313e74 r8:00000000 r7:c16b6f80 
r6:c0147900 r5:c16b5580
[    2.321391]  start c13aec00 pointer offset 280
[    2.321993]  r4:c13d4980
[    2.322006]  size 1024
[    2.322176] [<c014fd3c>] (kthread) from [<c0100150>] 
(ret_from_fork+0x14/0x24)
[    2.323165] Exception stack(0xc165dfb0 to 0xc165dff8)
[    2.323187]
[    2.323371] Register r5 information: non-slab/vmalloc memory
[    2.323535] Register r6 information: slab kmalloc-1k start c13aec00 
pointer offset 280 size 1024
[    2.324594] Register r7 information:
[    2.324597] dfa0:                                     00000000 
00000000 00000000 00000000
[    2.325507]  non-slab/vmalloc memory
[    2.325830] Register r8 information: non-paged memory
[    2.326267] Register r9 information:
[    2.326274] dfc0: 00000000 00000000 00000000 00000000 00000000 
00000000 00000000 00000000
[    2.326453]  NULL pointer
[    2.326942] Register r10 information: NULL pointer
[    2.327159] dfe0: 00000000 00000000 00000000 00000000 00000013 00000000
[    2.327258] Register r11 information: non-slab/vmalloc memory
[    2.327904] Register r12 information:
[    2.327928]  r10:00000000 r9:00000000 r8:00000000 r7:00000000 
r6:00000000 r5:c014fd3c
[    2.327937]  non-slab/vmalloc memory
[    2.328057] Process kworker/0:2 (pid: 41, stack limit = 0x(ptrval))
[    2.328204] Stack: (0xc1675d70 to 0xc1676000)
[    2.328479]  r4:c16b5580
[    2.329851] ---[ end trace f293b13f591ee203 ]---
[    2.330027] 5d60:                                     c1675d9c 
c1675d80 c064626c c0560ae0
[    2.331070] 5d80: c1090fc0 c1675df4 c13aed18 c0d04d08 c1675dbc 
c1675da0 c0643778 c06461c0
[    2.331120] 8<--- cut here ---
[    2.331739] Unhandled fault: page domain fault (0x01b) at 0x00000010
[    2.331915] 5da0: 00000000 c1675df4 c064373c c0d04d08 c1675dec 
c1675dc0 c06411d0 c0643748
[    2.332379] pgd = (ptrval)
[    2.332408] 5dc0: c1675dec c16bf06c c1787eb8 76076098 c0d04d08 
c13aed18 c13aed5c c13aa800
[    2.332889] [00000010] *pgd=00000000
[    2.332951] 5de0: c1675e24 c1675df0 c0642f38 c0641174 c1675e44 
c13aed18 00000001 76076098
[    2.333535] 5e00: 00000000 c13aed18 c108ea84 c13aed18 c13aa800 
c10843a0 c1675e34 c1675e28
[    2.334201] 5e20: c064390c c0642e54 c1675e54 c1675e38 c0642080 
c06438fc c13aed18 00000000
[    2.334845] 5e40: c0d04d08 c13aa800 c1675eb4 c1675e58 c063fa1c 
c0641ff8 c13a5180 c1706380
[    2.335464] 5e60: eee2e0c0 c1201180 c071c008 c11ea558 c0b7f260 
c0b7f28c c1675eb4 c1675e88
[    2.336088] 5e80: c02e6368 76076098 00000001 c1706f4c c13aec00 
c108ea54 c1706f40 c11ea558
[    2.336758] 5ea0: c0b7f260 c0b7f28c c1675ef4 c1675eb8 c071c0e4 
c063f620 c0923300 c0158300
[    2.337399] 5ec0: 00000000 c13aed18 c1675ef4 c108ea70 c1702e80 
effc4400 effc7700 00000000
[    2.338025] 5ee0: 00000000 c10b2580 c1675f34 c1675ef8 c01475a4 
c071bf38 c13a5100 ffffe000
[    2.338663] 5f00: c1675f1c c1675f10 c0149250 c1702e80 effc4400 
c1702e94 effc4418 ffffe000
[    2.339296] 5f20: 00000008 c0d03d00 c1675f74 c1675f38 c014795c 
c0147378 c130fe74 c0b0c2c4
[    2.340102] 5f40: c10b1d2a effc4400 c1675f74 c1706000 c1706880 
c0147900 c1702e80 00000000
[    2.340736] 5f60: c130fe74 c1674000 c1675fac c1675f78 c014feb4 
c014790c c1706024 c1706024
[    2.341315] 5f80: c1675fac c1706880 c014fd3c 00000000 00000000 
00000000 00000000 00000000
[    2.341852] 5fa0: 00000000 c1675fb0 c0100150 c014fd48 00000000 
00000000 00000000 00000000
[    2.342407] 5fc0: 00000000 00000000 00000000 00000000 00000000 
00000000 00000000 00000000
[    2.342957] 5fe0: 00000000 00000000 00000000 00000000 00000013 
00000000 00000000 00000000
[    2.343361] Backtrace:
[    2.343573] [<c0560ad4>] (strcmp) from [<c064626c>] 
(platform_match+0xb8/0xcc)
[    2.343954] [<c06461b4>] (platform_match) from [<c0643778>] 
(__device_attach_driver+0x3c/0xc4)
[    2.344369]  r7:c0d04d08 r6:c13aed18 r5:c1675df4 r4:c1090fc0
[    2.344615] [<c064373c>] (__device_attach_driver) from [<c06411d0>] 
(bus_for_each_drv+0x68/0xc8)
[    2.345015]  r7:c0d04d08 r6:c064373c r5:c1675df4 r4:00000000
[    2.345283] [<c0641168>] (bus_for_each_drv) from [<c0642f38>] 
(__device_attach+0xf0/0x15c)
[    2.345654]  r7:c13aa800 r6:c13aed5c r5:c13aed18 r4:c0d04d08
[    2.345903] [<c0642e48>] (__device_attach) from [<c064390c>] 
(device_initial_probe+0x1c/0x20)
[    2.346325]  r8:c10843a0 r7:c13aa800 r6:c13aed18 r5:c108ea84 r4:c13aed18
[    2.346635] [<c06438f0>] (device_initial_probe) from [<c0642080>] 
(bus_probe_device+0x94/0x9c)
[    2.347038] [<c0641fec>] (bus_probe_device) from [<c063fa1c>] 
(device_add+0x408/0x8b8)
[    2.347419]  r7:c13aa800 r6:c0d04d08 r5:00000000 r4:c13aed18
[    2.347695] [<c063f614>] (device_add) from [<c071c0e4>] 
(serio_handle_event+0x1b8/0x234)
[    2.348094]  r10:c0b7f28c r9:c0b7f260 r8:c11ea558 r7:c1706f40 
r6:c108ea54 r5:c13aec00
[    2.348453]  r4:c1706f4c
[    2.348604] [<c071bf2c>] (serio_handle_event) from [<c01475a4>] 
(process_one_work+0x238/0x594)
[    2.348968]  r10:c10b2580 r9:00000000 r8:00000000 r7:effc7700 
r6:effc4400 r5:c1702e80
[    2.349315]  r4:c108ea70
[    2.349468] [<c014736c>] (process_one_work) from [<c014795c>] 
(worker_thread+0x5c/0x5f4)
[    2.349875]  r10:c0d03d00 r9:00000008 r8:ffffe000 r7:effc4418 
r6:c1702e94 r5:effc4400
[    2.350169]  r4:c1702e80
[    2.350315] [<c0147900>] (worker_thread) from [<c014feb4>] 
(kthread+0x178/0x194)
[    2.350687]  r10:c1674000 r9:c130fe74 r8:00000000 r7:c1702e80 
r6:c0147900 r5:c1706880
[    2.351038]  r4:c1706000
[    2.351182] [<c014fd3c>] (kthread) from [<c0100150>] 
(ret_from_fork+0x14/0x24)
[    2.351500] Exception stack(0xc1675fb0 to 0xc1675ff8)
[    2.351855] 5fa0:                                     00000000 
00000000 00000000 00000000
[    2.352415] 5fc0: 00000000 00000000 00000000 00000000 00000000 
00000000 00000000 00000000
[    2.352923] 5fe0: 00000000 00000000 00000000 00000000 00000013 00000000
[    2.353283]  r10:00000000 r9:00000000 r8:00000000 r7:00000000 
r6:00000000 r5:c014fd3c
[    2.353618]  r4:c1706880
[    2.354146] Code: e24cb004 ea000001 e3530000 0a000006 (e4d03001)
[    2.354860] Internal error: : 1b [#2] SMP ARM
[    2.355172] Modules linked in:
[    2.355254] ---[ end trace f293b13f591ee204 ]---
[    2.355650] CPU: 1 PID: 27 Comm: kworker/u4:1 Tainted: G      D 
W         5.14.0-rc7+ #193
[    2.355888] Kernel panic - not syncing: Fatal exception
[    2.355990] Hardware name: ARM-Versatile Express
[    2.356735] Workqueue: events_unbound deferred_probe_work_func
[    2.357217] PC is at klist_put+0x20/0xa4
[    2.357537] LR is at klist_iter_exit+0x24/0x30
[    2.357872] pc : [<c0556280>]    lr : [<c0556340>]    psr: a00b0013
[    2.358299] sp : c165de00  ip : c165de20  fp : c165de1c
[    2.358655] r10: c10b2580  r9 : 00000000  r8 : 00000001
[    2.359009] r7 : c1702b38  r6 : 00000000  r5 : c165de6c  r4 : 00000000
[    2.359440] r3 : 00000000  r2 : 76076098  r1 : 00000000  r0 : 00000000
[    2.359893] Flags: NzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  
Segment none
[    2.360368] Control: 10c5387d  Table: 8177806a  DAC: 00000051
[    2.360759] Register r0 information: NULL pointer
[    2.361126] Register r1 information: NULL pointer
[    2.361477] Register r2 information: non-paged memory
[    2.361835] Register r3 information: NULL pointer
[    2.362162] Register r4 information: NULL pointer
[    2.362486] Register r5 information: non-slab/vmalloc memory
[    2.362887] Register r6 information: NULL pointer
[    2.363226] Register r7 information: slab kmalloc-128 start c1702b00 
pointer offset 56 size 128
[    2.363919] Register r8 information: non-paged memory
[    2.364299] Register r9 information: NULL pointer
[    2.364668] Register r10 information: non-slab/vmalloc memory
[    2.365089] Register r11 information: non-slab/vmalloc memory
[    2.365466] Register r12 information: non-slab/vmalloc memory
[    2.365872] Process kworker/u4:1 (pid: 27, stack limit = 0x(ptrval))
[    2.366297] Stack: (0xc165de00 to 0xc165e000)
[    2.366769] de00: c165de3c c165de6c c064373c c0d04d08 c165de34 
c165de20 c0556340 c055626c
[    2.367478] de20: 00000000 c165de6c c165de64 c165de38 c0641208 
c0556328 c165de64 c136996c
[    2.368192] de40: c1702b38 76076098 c0d04d08 c13ae400 c13ae444 
c10846b8 c165de9c c165de68
[    2.368934] de60: c0642f38 c0641174 c063c4c4 c13ae400 00000001 
76076098 00000000 c13ae400
[    2.369654] de80: c107d690 c13ae400 c10846b8 00000000 c165deac 
c165dea0 c064390c c0642e54
[    2.370367] dea0: c165decc c165deb0 c0642080 c06438fc c13ae400 
c10846a4 c10846a4 c10846b8
[    2.371046] dec0: c165def4 c165ded0 c064259c c0641ff8 c10846d4 
c16b6f80 c1206200 c1225b00
[    2.371774] dee0: 00000000 00000000 c165df34 c165def8 c01475a4 
c064251c c13906c0 ffffe000
[    2.372470] df00: c165df1c c165df10 c0149250 c16b6f80 c1206200 
c16b6f94 c1206218 ffffe000
[    2.373189] df20: 00000088 c0d03d00 c165df74 c165df38 c0147bc4 
c0147378 c1313e74 c0b0c2c4
[    2.373925] df40: c10b1d2a c1206200 c165df74 c13d4980 c16b5580 
c0147900 c16b6f80 00000000
[    2.374628] df60: c1313e74 c165c000 c165dfac c165df78 c014feb4 
c014790c c13d49a4 c13d49a4
[    2.375345] df80: c165dfac c16b5580 c014fd3c 00000000 00000000 
00000000 00000000 00000000
[    2.376080] dfa0: 00000000 c165dfb0 c0100150 c014fd48 00000000 
00000000 00000000 00000000
[    2.376807] dfc0: 00000000 00000000 00000000 00000000 00000000 
00000000 00000000 00000000
[    2.377534] dfe0: 00000000 00000000 00000000 00000000 00000013 
00000000 00000000 00000000
[    2.378080] Backtrace:
[    2.378297] [<c0556260>] (klist_put) from [<c0556340>] 
(klist_iter_exit+0x24/0x30)
[    2.378819]  r7:c0d04d08 r6:c064373c r5:c165de6c r4:c165de3c
[    2.379199] [<c055631c>] (klist_iter_exit) from [<c0641208>] 
(bus_for_each_drv+0xa0/0xc8)
[    2.379701]  r5:c165de6c r4:00000000
[    2.379932] [<c0641168>] (bus_for_each_drv) from [<c0642f38>] 
(__device_attach+0xf0/0x15c)
[    2.380482]  r7:c10846b8 r6:c13ae444 r5:c13ae400 r4:c0d04d08
[    2.380871] [<c0642e48>] (__device_attach) from [<c064390c>] 
(device_initial_probe+0x1c/0x20)
[    2.381453]  r8:00000000 r7:c10846b8 r6:c13ae400 r5:c107d690 r4:c13ae400
[    2.381851] [<c06438f0>] (device_initial_probe) from [<c0642080>] 
(bus_probe_device+0x94/0x9c)
[    2.382364] [<c0641fec>] (bus_probe_device) from [<c064259c>] 
(deferred_probe_work_func+0x8c/0xb8)
[    2.382936]  r7:c10846b8 r6:c10846a4 r5:c10846a4 r4:c13ae400
[    2.383266] [<c0642510>] (deferred_probe_work_func) from [<c01475a4>] 
(process_one_work+0x238/0x594)
[    2.383885]  r9:00000000 r8:00000000 r7:c1225b00 r6:c1206200 
r5:c16b6f80 r4:c10846d4
[    2.384344] [<c014736c>] (process_one_work) from [<c0147bc4>] 
(worker_thread+0x2c4/0x5f4)
[    2.384893]  r10:c0d03d00 r9:00000088 r8:ffffe000 r7:c1206218 
r6:c16b6f94 r5:c1206200
[    2.385341]  r4:c16b6f80
[    2.385536] [<c0147900>] (worker_thread) from [<c014feb4>] 
(kthread+0x178/0x194)
[    2.386030]  r10:c165c000 r9:c1313e74 r8:00000000 r7:c16b6f80 
r6:c0147900 r5:c16b5580
[    2.386540]  r4:c13d4980
[    2.386742] [<c014fd3c>] (kthread) from [<c0100150>] 
(ret_from_fork+0x14/0x24)
[    2.387211] Exception stack(0xc165dfb0 to 0xc165dff8)
[    2.387653] dfa0:                                     00000000 
00000000 00000000 00000000
[    2.388374] dfc0: 00000000 00000000 00000000 00000000 00000000 
00000000 00000000 00000000
[    2.389071] dfe0: 00000000 00000000 00000000 00000000 00000013 00000000
[    2.389546]  r10:00000000 r9:00000000 r8:00000000 r7:00000000 
r6:00000000 r5:c014fd3c
[    2.390037]  r4:c16b5580
[    2.390376] Code: e1a07000 e1a06001 e3c44001 e1a00004 (e5945010)
[    2.390836] ---[ end trace f293b13f591ee205 ]---
[    2.391580] ---[ end Kernel panic - not syncing: Fatal exception ]---


>>> -Saravana
>>>
>>> [1] - https://lore.kernel.org/lkml/CAGETcx8b228nDUho3cX9AAQ-pXOfZTMv8cj2vhdx9yc_pk8q+A@mail.gmail.com/
>>> .
>>>
> .
>

Saravana Kannan Aug. 27, 2021, 12:04 a.m. UTC | #11

On Thu, Aug 26, 2021 at 1:22 AM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>
>
> >>> Btw, I've been working on [1] cleaning up the one-off deferred probe
> >>> solution that we have for amba devices. That causes a bunch of other
> >>> headaches. Your patch 3/3 takes us further in the wrong direction by
> >>> adding more reasons for delaying the addition of the device.
>
> Hi Saravana, I try the link[1], but with it, there is a crash when boot
> (qemu-system-arm -M vexpress-a15),

Hi,

It's hard to make sense of the logs. Looks like two different threads
might be printing to the log at the same time? Can you please enable
the config that prints the thread ID (forgot what it's called) and
collect this again? With what I could tell the crash seems to be
happening somewhere in platform_match(), but that's not related to
this patch at all?

-Saravana

>
> without it, boot successfully.
>
> [    2.246057] aaci-pl041 1c040000.aaci: ARM AC'97 Interface PL041 rev0
> at 0x1c040000, irq 36
> [    2.246357] aaci-pl041 1c040000.aaci: FIFO 512 entries
> [    2.248617] NET: Registered PF_PACKET protocol family
> [    2.250481] 9pnet: Installing 9P2000 support
> [    2.251474] Registering SWP/SWPB emulation handler
> [    2.284374] 1c090000.serial: ttyAMA0 at MMIO 0x1c090000 (irq = 41,
> base_baud = 0) is a PL011 rev1
> [    2.287797] printk: console [ttyAMA0] enabled
> [    2.287797] printk: console [ttyAMA0] enabled
> [    2.288262] input: AT Raw Set 2 keyboard as
> /devices/platform/bus@8000000/bus@8000000:motherboard-bus/bus@8000000:motherboard-bus:iofpga-bus@300000000/1c060000.kmi/serio0/input/input0
> [    2.288262] input: AT Raw Set 2 keyboard as
> /devices/platform/bus@8000000/bus@8000000:motherboard-bus/bus@8000000:motherboard-bus:iofpga-bus@300000000/1c060000.kmi/serio0/input/input0
> [    2.288755] printk: bootconsole [earlycon0] disabled
> [    2.288755] printk: bootconsole [earlycon0] disabled
> [    2.294507] 1c0a0000.serial: ttyAMA1 at MMIO 0x1c0a0000 (irq = 42,
> base_baud = 0) is a PL011 rev1
> [    2.296950] 1c0b0000.serial: ttyAMA2 at MMIO 0x1c0b0000 (irq = 43,
> base_baud = 0) is a PL011 rev1
> [    2.298636] 1c0c0000.serial: ttyAMA3 at MMIO 0x1c0c0000 (irq = 44,
> base_baud = 0) is a PL011 rev1
> [    2.300496] 8<--- cut here ---
> [    2.300775] ------------[ cut here ]------------
> [    2.301260] Unable to handle kernel NULL pointer dereference at
> virtual address 00000000
> [    2.300928] WARNING: CPU: 1 PID: 27 at
> /home/wkf/work/hulk/lib/refcount.c:25 refcount_warn_saturate+0x108/0x174
> [    2.301700] pgd = (ptrval)
> [    2.302002] refcount_t: addition on 0; use-after-free.
> [    2.302184] [00000000] *pgd=00000000
> [    2.302363] Modules linked in:
> [    2.302384]
> [    2.302753]
> [    2.303501] CPU: 1 PID: 27 Comm: kworker/u4:1 Not tainted 5.14.0-rc7+
> #193
> [    2.303990] Hardware name: ARM-Versatile Express
> [    2.304198] Internal error: Oops: 5 [#1] SMP ARM
> [    2.304537] Workqueue: events_unbound deferred_probe_work_func
> [    2.304829] Modules linked in:
> [    2.304865]
> [    2.305133]
> [    2.305401] Backtrace:
> [    2.305562] CPU: 0 PID: 41 Comm: kworker/0:2 Not tainted 5.14.0-rc7+ #193
> [    2.305614] Hardware name: ARM-Versatile Express
> [    2.305576]
> [    2.305781] [<c010c780>] (dump_backtrace) from [<c010cacc>]
> (show_stack+0x20/0x24)
> [    2.306266] Workqueue: events_long serio_handle_event
> [    2.306732]  r7:00000009 r6:00000000 r5:c0b1efb8 r4:600b0093
> [    2.306889] PC is at strcmp+0x18/0x44
> [    2.307115] [<c010caac>] (show_stack) from [<c091eba4>]
> (dump_stack_lvl+0x48/0x54)
> [    2.307263] LR is at platform_match+0xb8/0xcc
> [    2.307498] [<c091eb5c>] (dump_stack_lvl) from [<c091ebc8>]
> (dump_stack+0x18/0x1c)
> [    2.307739] pc : [<c0560aec>]    lr : [<c064626c>]    psr: 60000013
> [    2.307988]  r5:c0b4ef98 r4:c165ddc4
> [    2.308317] sp : c1675d70  ip : c1675d80  fp : c1675d7c
> [    2.308433] [<c091ebb0>] (dump_stack) from [<c01289c4>]
> (__warn+0x110/0x114)
> [    2.308743] r10: 00000000  r9 : 00000000  r8 : 00000001
> [    2.308961] [<c01288b4>] (__warn) from [<c0128a4c>]
> (warn_slowpath_fmt+0x84/0xc0)
> [    2.309252] r7 : c0d04d08  r6 : c13aed18  r5 : c1090fc0  r4 : c13aed18
> [    2.309547]  r9:00000009 r8:c0504f10 r7:00000019 r6:c0b4ef98
> r5:c0b4efbc r4:c0d04d08
> [    2.309975] [<c01289cc>] (warn_slowpath_fmt) from [<c0504f10>]
> (refcount_warn_saturate+0x108/0x174)
> [    2.310531] r3 : c0a5e1c0  r2 : 00000002  r1 : c0b82860  r0 : 00000000
> [    2.311263] Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM
> Segment none
> [    2.311298]  r9:600b0013 r8:00000000 r7:00000000 r6:c1787eb8
> r5:c165de3c r4:c1702b38
> [    2.311899] Control: 10c5387d  Table: 8000406a  DAC: 00000051
> [    2.312117] [<c0504e08>] (refcount_warn_saturate) from [<c055657c>]
> (klist_next+0x134/0x138)
> [    2.312755] Register r0 information:
> [    2.312964] [<c0556448>] (klist_next) from [<c06411dc>]
> (bus_for_each_drv+0x74/0xc8)
> [    2.313345]  NULL pointer
> [    2.313690] Register r1 information:
> [    2.313736]  r9:00000000 r8:00000001 r7:c0d04d08 r6:c064373c
> r5:c165de6c r4:00000000
> [    2.313933]  non-slab/vmalloc memory
> [    2.314172] [<c0641168>] (bus_for_each_drv) from [<c0642f38>]
> (__device_attach+0xf0/0x15c)
> [    2.315042]  r7:c10846b8 r6:c13ae444 r5:c13ae400 r4:c0d04d08
> [    2.315060] Register r2 information:
> [    2.315243] [<c0642e48>] (__device_attach) from [<c064390c>]
> (device_initial_probe+0x1c/0x20)
> [    2.315276]  non-paged memory
> [    2.315569]  r8:00000000 r7:c10846b8 r6:c13ae400 r5:c107d690 r4:c13ae400
> [    2.315593] [<c06438f0>] (device_initial_probe) from [<c0642080>]
> (bus_probe_device+0x94/0x9c)
> [    2.316192] [<c0641fec>] (bus_probe_device) from [<c064259c>]
> (deferred_probe_work_func+0x8c/0xb8)
> [    2.316939]  r7:c10846b8 r6:c10846a4 r5:c10846a4 r4:c13ae400
> [    2.317573] [<c0642510>] (deferred_probe_work_func) from [<c01475a4>]
> (process_one_work+0x238/0x594)
> [    2.318513]  r9:00000000 r8:00000000 r7:c1225b00 r6:c1206200
> r5:c16b6f80 r4:c10846d4
> [    2.318931] Register r3 information: non-slab/vmalloc memory
> [    2.319218] [<c014736c>] (process_one_work) from [<c0147bc4>]
> (worker_thread+0x2c4/0x5f4)
> [    2.320001]  r10:c0d03d00 r9:00000088 r8:ffffe000 r7:c1206218
> r6:c16b6f94 r5:c1206200
> [    2.320280] Register r4 information:
> [    2.320614]  r4:c16b6f80
> [    2.320942] [<c0147900>] (worker_thread) from [<c014feb4>]
> (kthread+0x178/0x194)
> [    2.321111]  slab kmalloc-1k
> [    2.321403]  r10:c165c000 r9:c1313e74 r8:00000000 r7:c16b6f80
> r6:c0147900 r5:c16b5580
> [    2.321391]  start c13aec00 pointer offset 280
> [    2.321993]  r4:c13d4980
> [    2.322006]  size 1024
> [    2.322176] [<c014fd3c>] (kthread) from [<c0100150>]
> (ret_from_fork+0x14/0x24)
> [    2.323165] Exception stack(0xc165dfb0 to 0xc165dff8)
> [    2.323187]
> [    2.323371] Register r5 information: non-slab/vmalloc memory
> [    2.323535] Register r6 information: slab kmalloc-1k start c13aec00
> pointer offset 280 size 1024
> [    2.324594] Register r7 information:
> [    2.324597] dfa0:                                     00000000
> 00000000 00000000 00000000
> [    2.325507]  non-slab/vmalloc memory
> [    2.325830] Register r8 information: non-paged memory
> [    2.326267] Register r9 information:
> [    2.326274] dfc0: 00000000 00000000 00000000 00000000 00000000
> 00000000 00000000 00000000
> [    2.326453]  NULL pointer
> [    2.326942] Register r10 information: NULL pointer
> [    2.327159] dfe0: 00000000 00000000 00000000 00000000 00000013 00000000
> [    2.327258] Register r11 information: non-slab/vmalloc memory
> [    2.327904] Register r12 information:
> [    2.327928]  r10:00000000 r9:00000000 r8:00000000 r7:00000000
> r6:00000000 r5:c014fd3c
> [    2.327937]  non-slab/vmalloc memory
> [    2.328057] Process kworker/0:2 (pid: 41, stack limit = 0x(ptrval))
> [    2.328204] Stack: (0xc1675d70 to 0xc1676000)
> [    2.328479]  r4:c16b5580
> [    2.329851] ---[ end trace f293b13f591ee203 ]---
> [    2.330027] 5d60:                                     c1675d9c
> c1675d80 c064626c c0560ae0
> [    2.331070] 5d80: c1090fc0 c1675df4 c13aed18 c0d04d08 c1675dbc
> c1675da0 c0643778 c06461c0
> [    2.331120] 8<--- cut here ---
> [    2.331739] Unhandled fault: page domain fault (0x01b) at 0x00000010
> [    2.331915] 5da0: 00000000 c1675df4 c064373c c0d04d08 c1675dec
> c1675dc0 c06411d0 c0643748
> [    2.332379] pgd = (ptrval)
> [    2.332408] 5dc0: c1675dec c16bf06c c1787eb8 76076098 c0d04d08
> c13aed18 c13aed5c c13aa800
> [    2.332889] [00000010] *pgd=00000000
> [    2.332951] 5de0: c1675e24 c1675df0 c0642f38 c0641174 c1675e44
> c13aed18 00000001 76076098
> [    2.333535] 5e00: 00000000 c13aed18 c108ea84 c13aed18 c13aa800
> c10843a0 c1675e34 c1675e28
> [    2.334201] 5e20: c064390c c0642e54 c1675e54 c1675e38 c0642080
> c06438fc c13aed18 00000000
> [    2.334845] 5e40: c0d04d08 c13aa800 c1675eb4 c1675e58 c063fa1c
> c0641ff8 c13a5180 c1706380
> [    2.335464] 5e60: eee2e0c0 c1201180 c071c008 c11ea558 c0b7f260
> c0b7f28c c1675eb4 c1675e88
> [    2.336088] 5e80: c02e6368 76076098 00000001 c1706f4c c13aec00
> c108ea54 c1706f40 c11ea558
> [    2.336758] 5ea0: c0b7f260 c0b7f28c c1675ef4 c1675eb8 c071c0e4
> c063f620 c0923300 c0158300
> [    2.337399] 5ec0: 00000000 c13aed18 c1675ef4 c108ea70 c1702e80
> effc4400 effc7700 00000000
> [    2.338025] 5ee0: 00000000 c10b2580 c1675f34 c1675ef8 c01475a4
> c071bf38 c13a5100 ffffe000
> [    2.338663] 5f00: c1675f1c c1675f10 c0149250 c1702e80 effc4400
> c1702e94 effc4418 ffffe000
> [    2.339296] 5f20: 00000008 c0d03d00 c1675f74 c1675f38 c014795c
> c0147378 c130fe74 c0b0c2c4
> [    2.340102] 5f40: c10b1d2a effc4400 c1675f74 c1706000 c1706880
> c0147900 c1702e80 00000000
> [    2.340736] 5f60: c130fe74 c1674000 c1675fac c1675f78 c014feb4
> c014790c c1706024 c1706024
> [    2.341315] 5f80: c1675fac c1706880 c014fd3c 00000000 00000000
> 00000000 00000000 00000000
> [    2.341852] 5fa0: 00000000 c1675fb0 c0100150 c014fd48 00000000
> 00000000 00000000 00000000
> [    2.342407] 5fc0: 00000000 00000000 00000000 00000000 00000000
> 00000000 00000000 00000000
> [    2.342957] 5fe0: 00000000 00000000 00000000 00000000 00000013
> 00000000 00000000 00000000
> [    2.343361] Backtrace:
> [    2.343573] [<c0560ad4>] (strcmp) from [<c064626c>]
> (platform_match+0xb8/0xcc)
> [    2.343954] [<c06461b4>] (platform_match) from [<c0643778>]
> (__device_attach_driver+0x3c/0xc4)
> [    2.344369]  r7:c0d04d08 r6:c13aed18 r5:c1675df4 r4:c1090fc0
> [    2.344615] [<c064373c>] (__device_attach_driver) from [<c06411d0>]
> (bus_for_each_drv+0x68/0xc8)
> [    2.345015]  r7:c0d04d08 r6:c064373c r5:c1675df4 r4:00000000
> [    2.345283] [<c0641168>] (bus_for_each_drv) from [<c0642f38>]
> (__device_attach+0xf0/0x15c)
> [    2.345654]  r7:c13aa800 r6:c13aed5c r5:c13aed18 r4:c0d04d08
> [    2.345903] [<c0642e48>] (__device_attach) from [<c064390c>]
> (device_initial_probe+0x1c/0x20)
> [    2.346325]  r8:c10843a0 r7:c13aa800 r6:c13aed18 r5:c108ea84 r4:c13aed18
> [    2.346635] [<c06438f0>] (device_initial_probe) from [<c0642080>]
> (bus_probe_device+0x94/0x9c)
> [    2.347038] [<c0641fec>] (bus_probe_device) from [<c063fa1c>]
> (device_add+0x408/0x8b8)
> [    2.347419]  r7:c13aa800 r6:c0d04d08 r5:00000000 r4:c13aed18
> [    2.347695] [<c063f614>] (device_add) from [<c071c0e4>]
> (serio_handle_event+0x1b8/0x234)
> [    2.348094]  r10:c0b7f28c r9:c0b7f260 r8:c11ea558 r7:c1706f40
> r6:c108ea54 r5:c13aec00
> [    2.348453]  r4:c1706f4c
> [    2.348604] [<c071bf2c>] (serio_handle_event) from [<c01475a4>]
> (process_one_work+0x238/0x594)
> [    2.348968]  r10:c10b2580 r9:00000000 r8:00000000 r7:effc7700
> r6:effc4400 r5:c1702e80
> [    2.349315]  r4:c108ea70
> [    2.349468] [<c014736c>] (process_one_work) from [<c014795c>]
> (worker_thread+0x5c/0x5f4)
> [    2.349875]  r10:c0d03d00 r9:00000008 r8:ffffe000 r7:effc4418
> r6:c1702e94 r5:effc4400
> [    2.350169]  r4:c1702e80
> [    2.350315] [<c0147900>] (worker_thread) from [<c014feb4>]
> (kthread+0x178/0x194)
> [    2.350687]  r10:c1674000 r9:c130fe74 r8:00000000 r7:c1702e80
> r6:c0147900 r5:c1706880
> [    2.351038]  r4:c1706000
> [    2.351182] [<c014fd3c>] (kthread) from [<c0100150>]
> (ret_from_fork+0x14/0x24)
> [    2.351500] Exception stack(0xc1675fb0 to 0xc1675ff8)
> [    2.351855] 5fa0:                                     00000000
> 00000000 00000000 00000000
> [    2.352415] 5fc0: 00000000 00000000 00000000 00000000 00000000
> 00000000 00000000 00000000
> [    2.352923] 5fe0: 00000000 00000000 00000000 00000000 00000013 00000000
> [    2.353283]  r10:00000000 r9:00000000 r8:00000000 r7:00000000
> r6:00000000 r5:c014fd3c
> [    2.353618]  r4:c1706880
> [    2.354146] Code: e24cb004 ea000001 e3530000 0a000006 (e4d03001)
> [    2.354860] Internal error: : 1b [#2] SMP ARM
> [    2.355172] Modules linked in:
> [    2.355254] ---[ end trace f293b13f591ee204 ]---
> [    2.355650] CPU: 1 PID: 27 Comm: kworker/u4:1 Tainted: G      D
> W         5.14.0-rc7+ #193
> [    2.355888] Kernel panic - not syncing: Fatal exception
> [    2.355990] Hardware name: ARM-Versatile Express
> [    2.356735] Workqueue: events_unbound deferred_probe_work_func
> [    2.357217] PC is at klist_put+0x20/0xa4
> [    2.357537] LR is at klist_iter_exit+0x24/0x30
> [    2.357872] pc : [<c0556280>]    lr : [<c0556340>]    psr: a00b0013
> [    2.358299] sp : c165de00  ip : c165de20  fp : c165de1c
> [    2.358655] r10: c10b2580  r9 : 00000000  r8 : 00000001
> [    2.359009] r7 : c1702b38  r6 : 00000000  r5 : c165de6c  r4 : 00000000
> [    2.359440] r3 : 00000000  r2 : 76076098  r1 : 00000000  r0 : 00000000
> [    2.359893] Flags: NzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM
> Segment none
> [    2.360368] Control: 10c5387d  Table: 8177806a  DAC: 00000051
> [    2.360759] Register r0 information: NULL pointer
> [    2.361126] Register r1 information: NULL pointer
> [    2.361477] Register r2 information: non-paged memory
> [    2.361835] Register r3 information: NULL pointer
> [    2.362162] Register r4 information: NULL pointer
> [    2.362486] Register r5 information: non-slab/vmalloc memory
> [    2.362887] Register r6 information: NULL pointer
> [    2.363226] Register r7 information: slab kmalloc-128 start c1702b00
> pointer offset 56 size 128
> [    2.363919] Register r8 information: non-paged memory
> [    2.364299] Register r9 information: NULL pointer
> [    2.364668] Register r10 information: non-slab/vmalloc memory
> [    2.365089] Register r11 information: non-slab/vmalloc memory
> [    2.365466] Register r12 information: non-slab/vmalloc memory
> [    2.365872] Process kworker/u4:1 (pid: 27, stack limit = 0x(ptrval))
> [    2.366297] Stack: (0xc165de00 to 0xc165e000)
> [    2.366769] de00: c165de3c c165de6c c064373c c0d04d08 c165de34
> c165de20 c0556340 c055626c
> [    2.367478] de20: 00000000 c165de6c c165de64 c165de38 c0641208
> c0556328 c165de64 c136996c
> [    2.368192] de40: c1702b38 76076098 c0d04d08 c13ae400 c13ae444
> c10846b8 c165de9c c165de68
> [    2.368934] de60: c0642f38 c0641174 c063c4c4 c13ae400 00000001
> 76076098 00000000 c13ae400
> [    2.369654] de80: c107d690 c13ae400 c10846b8 00000000 c165deac
> c165dea0 c064390c c0642e54
> [    2.370367] dea0: c165decc c165deb0 c0642080 c06438fc c13ae400
> c10846a4 c10846a4 c10846b8
> [    2.371046] dec0: c165def4 c165ded0 c064259c c0641ff8 c10846d4
> c16b6f80 c1206200 c1225b00
> [    2.371774] dee0: 00000000 00000000 c165df34 c165def8 c01475a4
> c064251c c13906c0 ffffe000
> [    2.372470] df00: c165df1c c165df10 c0149250 c16b6f80 c1206200
> c16b6f94 c1206218 ffffe000
> [    2.373189] df20: 00000088 c0d03d00 c165df74 c165df38 c0147bc4
> c0147378 c1313e74 c0b0c2c4
> [    2.373925] df40: c10b1d2a c1206200 c165df74 c13d4980 c16b5580
> c0147900 c16b6f80 00000000
> [    2.374628] df60: c1313e74 c165c000 c165dfac c165df78 c014feb4
> c014790c c13d49a4 c13d49a4
> [    2.375345] df80: c165dfac c16b5580 c014fd3c 00000000 00000000
> 00000000 00000000 00000000
> [    2.376080] dfa0: 00000000 c165dfb0 c0100150 c014fd48 00000000
> 00000000 00000000 00000000
> [    2.376807] dfc0: 00000000 00000000 00000000 00000000 00000000
> 00000000 00000000 00000000
> [    2.377534] dfe0: 00000000 00000000 00000000 00000000 00000013
> 00000000 00000000 00000000
> [    2.378080] Backtrace:
> [    2.378297] [<c0556260>] (klist_put) from [<c0556340>]
> (klist_iter_exit+0x24/0x30)
> [    2.378819]  r7:c0d04d08 r6:c064373c r5:c165de6c r4:c165de3c
> [    2.379199] [<c055631c>] (klist_iter_exit) from [<c0641208>]
> (bus_for_each_drv+0xa0/0xc8)
> [    2.379701]  r5:c165de6c r4:00000000
> [    2.379932] [<c0641168>] (bus_for_each_drv) from [<c0642f38>]
> (__device_attach+0xf0/0x15c)
> [    2.380482]  r7:c10846b8 r6:c13ae444 r5:c13ae400 r4:c0d04d08
> [    2.380871] [<c0642e48>] (__device_attach) from [<c064390c>]
> (device_initial_probe+0x1c/0x20)
> [    2.381453]  r8:00000000 r7:c10846b8 r6:c13ae400 r5:c107d690 r4:c13ae400
> [    2.381851] [<c06438f0>] (device_initial_probe) from [<c0642080>]
> (bus_probe_device+0x94/0x9c)
> [    2.382364] [<c0641fec>] (bus_probe_device) from [<c064259c>]
> (deferred_probe_work_func+0x8c/0xb8)
> [    2.382936]  r7:c10846b8 r6:c10846a4 r5:c10846a4 r4:c13ae400
> [    2.383266] [<c0642510>] (deferred_probe_work_func) from [<c01475a4>]
> (process_one_work+0x238/0x594)
> [    2.383885]  r9:00000000 r8:00000000 r7:c1225b00 r6:c1206200
> r5:c16b6f80 r4:c10846d4
> [    2.384344] [<c014736c>] (process_one_work) from [<c0147bc4>]
> (worker_thread+0x2c4/0x5f4)
> [    2.384893]  r10:c0d03d00 r9:00000088 r8:ffffe000 r7:c1206218
> r6:c16b6f94 r5:c1206200
> [    2.385341]  r4:c16b6f80
> [    2.385536] [<c0147900>] (worker_thread) from [<c014feb4>]
> (kthread+0x178/0x194)
> [    2.386030]  r10:c165c000 r9:c1313e74 r8:00000000 r7:c16b6f80
> r6:c0147900 r5:c16b5580
> [    2.386540]  r4:c13d4980
> [    2.386742] [<c014fd3c>] (kthread) from [<c0100150>]
> (ret_from_fork+0x14/0x24)
> [    2.387211] Exception stack(0xc165dfb0 to 0xc165dff8)
> [    2.387653] dfa0:                                     00000000
> 00000000 00000000 00000000
> [    2.388374] dfc0: 00000000 00000000 00000000 00000000 00000000
> 00000000 00000000 00000000
> [    2.389071] dfe0: 00000000 00000000 00000000 00000000 00000013 00000000
> [    2.389546]  r10:00000000 r9:00000000 r8:00000000 r7:00000000
> r6:00000000 r5:c014fd3c
> [    2.390037]  r4:c16b5580
> [    2.390376] Code: e1a07000 e1a06001 e3c44001 e1a00004 (e5945010)
> [    2.390836] ---[ end trace f293b13f591ee205 ]---
> [    2.391580] ---[ end Kernel panic - not syncing: Fatal exception ]---
>
>
> >>> -Saravana
> >>>
> >>> [1] - https://lore.kernel.org/lkml/CAGETcx8b228nDUho3cX9AAQ-pXOfZTMv8cj2vhdx9yc_pk8q+A@mail.gmail.com/
> >>> .
> >>>
> > .
> >

Saravana Kannan Aug. 27, 2021, 7:09 p.m. UTC | #12

On Fri, Aug 27, 2021 at 7:38 AM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>
>
> On 2021/8/27 8:04, Saravana Kannan wrote:
> > On Thu, Aug 26, 2021 at 1:22 AM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
> >>
> >>>>> Btw, I've been working on [1] cleaning up the one-off deferred probe
> >>>>> solution that we have for amba devices. That causes a bunch of other
> >>>>> headaches. Your patch 3/3 takes us further in the wrong direction by
> >>>>> adding more reasons for delaying the addition of the device.
> >> Hi Saravana, I try the link[1], but with it, there is a crash when boot
> >> (qemu-system-arm -M vexpress-a15),

I'm assuming it's this one?
arch/arm/boot/dts/vexpress-v2p-ca15_a7.dts

> > Hi,
> >
> > It's hard to make sense of the logs. Looks like two different threads
> > might be printing to the log at the same time? Can you please enable
> > the config that prints the thread ID (forgot what it's called) and
> > collect this again? With what I could tell the crash seems to be
> > happening somewhere in platform_match(), but that's not related to
> > this patch at all?
>
> Can you reproduce it? it is very likely related(without your patch, the
> boot is fine),

Sorry, I haven't ever setup qemu and booted vexpress. Thanks for your help.

> the NULL ptr is about serio, it is registed from amba driver.
>
> ambakmi_driver_init
>
>   -- amba_kmi_probe
>
>     -- __serio_register_port

Thanks for the pointer. I took a look at the logs and the code. It's
very strange. As you can see from the backtrace, platform_match() is
being called for the device_add() from serio_handle_event(). But the
device that gets added there is on the serio_bus which obviously
should be using the serio_bus_match.

>
> +Dmitry and input maillist, is there some known issue about serio ?
>
> I add some debug, the full log is attached.
>
> [    2.958355][   T41] input: AT Raw Set 2 keyboard as
> /devices/platform/bus@8000000/bus@8000000:motherboard-bus/bus@8000000:motherboard-bus:iofpga-bus@300000000/1c060000.kmi/serio0/input/input0
> [    2.977441][   T41] serio serio1: pdev c1e05508, pdev->name (null),
> drv c1090fc0, drv->name vexpress-reset

Based on the logs you added, it's pretty clear we are getting to
platform_match(). It's also strange that the drv->name is
vexpress-reset

> [    2.977928][   T41] 8<--- cut here ---
> [    2.978162][   T41] Unhandled fault: page domain fault (0x01b) at
> 0x00000000
> [    2.978494][   T41] pgd = (ptrval)
> [    2.978819][   T41] [00000000] *pgd=00000000
> [    2.979881][   T41] Internal error: : 1b [#1] SMP ARM
> [    2.980385][   T41] Modules linked in:
> [    2.980810][   T41] CPU: 0 PID: 41 Comm: kworker/0:2 Not tainted
> 5.14.0-rc7+ #213
> [    2.981229][   T41] Hardware name: ARM-Versatile Express
> [    2.981780][   T41] Workqueue: events_long serio_handle_event
> [    2.982737][   T41] PC is at strcmp+0x18/0x44
> [    2.983030][   T41] LR is at platform_match+0xdc/0xf0
> [    2.983283][   T41] pc : [<c0560bcc>]    lr : [<c0646358>]    psr:
> 600b0013
> [    2.983572][   T41] sp : c1675d68  ip : c1675d78  fp : c1675d74
> [    2.983832][   T41] r10: 00000000  r9 : 00000000  r8 : 00000001
> [    2.984095][   T41] r7 : c1e05518  r6 : c1675df4  r5 : c1e05518  r4 :
> c1090fc0
> [    2.984395][   T41] r3 : c0a5e180  r2 : 6bede3db  r1 : c0b82a04  r0 :
> 00000000
> [    2.984906][   T41] Flags: nZCv  IRQs on  FIQs on  Mode SVC_32 ISA
> ARM  Segment none

---- 8< ---- cleaning up a bunch of register dumps

> [    3.003113][   T41] Backtrace:
> [    3.003451][   T41] [<c0560bb4>] (strcmp) from [<c0646358>] (platform_match+0xdc/0xf0)
> [    3.003963][   T41] [<c064627c>] (platform_match) from [<c06437d4>] (__device_attach_driver+0x3c/0xf4)
> [    3.004769][   T41] [<c0643798>] (__device_attach_driver) from [<c0641180>] (bus_for_each_drv+0x68/0xc8)
> [    3.005481][   T41] [<c0641118>] (bus_for_each_drv) from [<c0642f40>] (__device_attach+0xf0/0x16c)
> [    3.006152][   T41] [<c0642e50>] (__device_attach) from [<c06439d4>] (device_initial_probe+0x1c/0x20)
> [    3.006853][   T41] [<c06439b8>] (device_initial_probe) from [<c0642030>] (bus_probe_device+0x94/0x9c)
> [    3.007259][   T41] [<c0641f9c>] (bus_probe_device) from [<c063f9cc>] (device_add+0x408/0x8b8)
> [    3.007900][   T41] [<c063f5c4>] (device_add) from [<c071c1cc>] (serio_handle_event+0x1b8/0x234)
> [    3.008824][   T41] [<c071c014>] (serio_handle_event) from [<c01475a4>] (process_one_work+0x238/0x594)
> [    3.009737][   T41] [<c014736c>] (process_one_work) from [<c014795c>] (worker_thread+0x5c/0x5f4)
> [    3.010638][   T41] [<c0147900>] (worker_thread) from [<c014feb4>] (kthread+0x178/0x194)
> [    3.011496][   T41] [<c014fd3c>] (kthread) from [<c0100150>] (ret_from_fork+0x14/0x24)
> [    3.011860][   T41] Exception stack(0xc1675fb0 to 0xc1675ff8)

But the platform_match() is happening for the device_add() from
serio_event_handle() that's adding a device to the serio_bus and it
should be using serio_bus_match().

I haven't reached any conclusion yet, but my current thought process
is that it's either:
1. My patch is somehow causing list corruption. But I don't directly
touch any list in my change (other than deleting a list entirely), so
it's not clear how that would be happening.
2. Without my patch, these AMBA device's probe would be delayed at
least until 5 seconds or possibly later. I'm wondering if my patch is
catching some bad timing assumptions in other code.

You might be able to test out theory (2) by DEFERRED_DEVICE_TIMEOUT to
a much smaller number. Say 500ms or 100ms. If it doesn't crash, it
doesn't mean it's not (2), but if it does, then we know for sure it's
(2).

I'll continue debugging further.

-Saravana

>
> diff --git a/drivers/amba/bus.c b/drivers/amba/bus.c
> index 836d6d23bba3..883a58c658c2 100644
> --- a/drivers/amba/bus.c
> +++ b/drivers/amba/bus.c
> @@ -237,6 +237,7 @@ static int amba_match(struct device *dev, struct
> device_driver *drv)
>
>          if (!pcdev->periphid) {
>                  int ret = amba_read_periphid(pcdev);
> +               dev_info(dev, "%s, amba_read_periphid ret = %d\n",
> __func__, ret);
>
>                  if (ret)
>                          return ret;
> @@ -522,6 +523,7 @@ int amba_device_add(struct amba_device *dev, struct
> resource *parent)
>          /* If primecell ID isn't hard-coded, figure it out */
>          if (!dev->periphid) {
>                  ret = amba_read_periphid(dev);
> +               dev_info(&dev->dev, "%s, amba_read_periphid ret = %d\n",
> __func__, ret);
>                  if (ret && ret != -EPROBE_DEFER)
>                          goto err_release;
>                  /*
> diff --git a/drivers/base/platform.c b/drivers/base/platform.c
> index 8640578f45e9..f7c1933c56b5 100644
> --- a/drivers/base/platform.c
> +++ b/drivers/base/platform.c
> @@ -1360,6 +1360,7 @@ static int platform_match(struct device *dev,
> struct device_driver *drv)
>          struct platform_device *pdev = to_platform_device(dev);
>          struct platform_driver *pdrv = to_platform_driver(drv);
>
> +       dev_info(dev, "pdev %px, pdev->name %s, drv %px, drv->name %s",
> pdev, pdev->name, drv, drv->name);
>          /* When driver_override is set, only bind to the matching driver */
>          if (pdev->driver_override)
>                  return !strcmp(pdev->driver_override, drv->name);
>
>
> > [1] - https://lore.kernel.org/lkml/CAGETcx8b228nDUho3cX9AAQ-pXOfZTMv8cj2vhdx9yc_pk8q+A@mail.gmail.com/
> > .
> >
> >>> .
> >>>
> > .
> >

Kefeng Wang Aug. 28, 2021, 1:09 a.m. UTC | #13

On 2021/8/28 3:09, Saravana Kannan wrote:
> On Fri, Aug 27, 2021 at 7:38 AM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>>
>> On 2021/8/27 8:04, Saravana Kannan wrote:
>>> On Thu, Aug 26, 2021 at 1:22 AM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>>>>>>> Btw, I've been working on [1] cleaning up the one-off deferred probe
>>>>>>> solution that we have for amba devices. That causes a bunch of other
>>>>>>> headaches. Your patch 3/3 takes us further in the wrong direction by
>>>>>>> adding more reasons for delaying the addition of the device.
>>>> Hi Saravana, I try the link[1], but with it, there is a crash when boot
>>>> (qemu-system-arm -M vexpress-a15),
> I'm assuming it's this one?
> arch/arm/boot/dts/vexpress-v2p-ca15_a7.dts

I use arch/arm/boot/dts/vexpress-v2p-ca15-tc1.dts.

qemu-system-arm -M vexpress-a15 -dtb vexpress-v2p-ca15-tc1.dtb -cpu 
cortex-a15 -smp 2 -m size=3G -kernel zImage -rtc base=localtime -initrd 
initrd-arm32 -append 'console=ttyAMA0 cma=0 kfence.sample_interval=0 
earlyprintk debug ' -device virtio-net-device,netdev=net8 -netdev 
type=tap,id=net8,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown 
-nographic

>
>>> Hi,
>>>
>>> It's hard to make sense of the logs. Looks like two different threads
>>> might be printing to the log at the same time? Can you please enable
>>> the config that prints the thread ID (forgot what it's called) and
>>> collect this again? With what I could tell the crash seems to be
>>> happening somewhere in platform_match(), but that's not related to
>>> this patch at all?
>> Can you reproduce it? it is very likely related(without your patch, the
>> boot is fine),
> Sorry, I haven't ever setup qemu and booted vexpress. Thanks for your help.
>
>> the NULL ptr is about serio, it is registed from amba driver.
>>
>> ambakmi_driver_init
>>
>>    -- amba_kmi_probe
>>
>>      -- __serio_register_port
> Thanks for the pointer. I took a look at the logs and the code. It's
> very strange. As you can see from the backtrace, platform_match() is
> being called for the device_add() from serio_handle_event(). But the
> device that gets added there is on the serio_bus which obviously
> should be using the serio_bus_match.
Yes, I am confused too.
>
>> +Dmitry and input maillist, is there some known issue about serio ?
>>
>> I add some debug, the full log is attached.
>>
>> [    2.958355][   T41] input: AT Raw Set 2 keyboard as
>> /devices/platform/bus@8000000/bus@8000000:motherboard-bus/bus@8000000:motherboard-bus:iofpga-bus@300000000/1c060000.kmi/serio0/input/input0
>> [    2.977441][   T41] serio serio1: pdev c1e05508, pdev->name (null),
>> drv c1090fc0, drv->name vexpress-reset
> Based on the logs you added, it's pretty clear we are getting to
> platform_match(). It's also strange that the drv->name is
> vexpress-reset
...
>
>> [    3.003113][   T41] Backtrace:
>> [    3.003451][   T41] [<c0560bb4>] (strcmp) from [<c0646358>] (platform_match+0xdc/0xf0)
>> [    3.003963][   T41] [<c064627c>] (platform_match) from [<c06437d4>] (__device_attach_driver+0x3c/0xf4)
>> [    3.004769][   T41] [<c0643798>] (__device_attach_driver) from [<c0641180>] (bus_for_each_drv+0x68/0xc8)
>> [    3.005481][   T41] [<c0641118>] (bus_for_each_drv) from [<c0642f40>] (__device_attach+0xf0/0x16c)
>> [    3.006152][   T41] [<c0642e50>] (__device_attach) from [<c06439d4>] (device_initial_probe+0x1c/0x20)
>> [    3.006853][   T41] [<c06439b8>] (device_initial_probe) from [<c0642030>] (bus_probe_device+0x94/0x9c)
>> [    3.007259][   T41] [<c0641f9c>] (bus_probe_device) from [<c063f9cc>] (device_add+0x408/0x8b8)
>> [    3.007900][   T41] [<c063f5c4>] (device_add) from [<c071c1cc>] (serio_handle_event+0x1b8/0x234)
>> [    3.008824][   T41] [<c071c014>] (serio_handle_event) from [<c01475a4>] (process_one_work+0x238/0x594)
>> [    3.009737][   T41] [<c014736c>] (process_one_work) from [<c014795c>] (worker_thread+0x5c/0x5f4)
>> [    3.010638][   T41] [<c0147900>] (worker_thread) from [<c014feb4>] (kthread+0x178/0x194)
>> [    3.011496][   T41] [<c014fd3c>] (kthread) from [<c0100150>] (ret_from_fork+0x14/0x24)
>> [    3.011860][   T41] Exception stack(0xc1675fb0 to 0xc1675ff8)
> But the platform_match() is happening for the device_add() from
> serio_event_handle() that's adding a device to the serio_bus and it
> should be using serio_bus_match().
>
> I haven't reached any conclusion yet, but my current thought process
> is that it's either:
> 1. My patch is somehow causing list corruption. But I don't directly
> touch any list in my change (other than deleting a list entirely), so
> it's not clear how that would be happening.

Maybe some concurrent driver load？

> 2. Without my patch, these AMBA device's probe would be delayed at
> least until 5 seconds or possibly later. I'm wondering if my patch is
> catching some bad timing assumptions in other code.

After Rob's patch, It will retry soon.

commit 039599c92d3b2e73689e8b6e519d653fd4770abb
Author: Rob Herring <robh@kernel.org>
Date:   Wed Apr 29 15:58:12 2020 -0500

     amba: Retry adding deferred devices at late_initcall

     If amba bus devices defer when adding, the amba bus code simply retries
     adding the devices every 5 seconds. This doesn't work well as it
     completely unsynchronized with starting the init process which can
     happen in less than 5 secs. Add a retry during late_initcall. If the
     amba devices are added, then deferred probe takes over. If the
     dependencies have not probed at this point, then there's no improvement
     over previous behavior. To completely solve this, we'd need to retry
     after every successful probe as deferred probe does.

     The list_empty() check now happens outside the mutex, but the mutex
     wasn't necessary in the first place.

     This needed to use deferred probe instead of fragile initcall ordering
     on 32-bit VExpress systems where the apb_pclk has a number of probe
     dependencies (vexpress-sysregs, vexpress-config).


>
> You might be able to test out theory (2) by DEFERRED_DEVICE_TIMEOUT to
> a much smaller number. Say 500ms or 100ms. If it doesn't crash, it
> doesn't mean it's not (2), but if it does, then we know for sure it's
> (2).
ok, I will try this one, but due to above patch, it may not work.

>
> I'll continue debugging further.
>
> -Saravana

Saravana Kannan Sept. 9, 2021, 3:30 a.m. UTC | #14

On Fri, Aug 27, 2021 at 6:09 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>
>
> On 2021/8/28 3:09, Saravana Kannan wrote:
> > On Fri, Aug 27, 2021 at 7:38 AM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
> >>
> >> On 2021/8/27 8:04, Saravana Kannan wrote:
> >>> On Thu, Aug 26, 2021 at 1:22 AM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
> >>>>>>> Btw, I've been working on [1] cleaning up the one-off deferred probe
> >>>>>>> solution that we have for amba devices. That causes a bunch of other
> >>>>>>> headaches. Your patch 3/3 takes us further in the wrong direction by
> >>>>>>> adding more reasons for delaying the addition of the device.
> >>>> Hi Saravana, I try the link[1], but with it, there is a crash when boot
> >>>> (qemu-system-arm -M vexpress-a15),
> > I'm assuming it's this one?
> > arch/arm/boot/dts/vexpress-v2p-ca15_a7.dts
>
> I use arch/arm/boot/dts/vexpress-v2p-ca15-tc1.dts.
>
> qemu-system-arm -M vexpress-a15 -dtb vexpress-v2p-ca15-tc1.dtb -cpu
> cortex-a15 -smp 2 -m size=3G -kernel zImage -rtc base=localtime -initrd
> initrd-arm32 -append 'console=ttyAMA0 cma=0 kfence.sample_interval=0
> earlyprintk debug ' -device virtio-net-device,netdev=net8 -netdev
> type=tap,id=net8,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown
> -nographic
>
> >
> >>> Hi,
> >>>
> >>> It's hard to make sense of the logs. Looks like two different threads
> >>> might be printing to the log at the same time? Can you please enable
> >>> the config that prints the thread ID (forgot what it's called) and
> >>> collect this again? With what I could tell the crash seems to be
> >>> happening somewhere in platform_match(), but that's not related to
> >>> this patch at all?
> >> Can you reproduce it? it is very likely related(without your patch, the
> >> boot is fine),
> > Sorry, I haven't ever setup qemu and booted vexpress. Thanks for your help.
> >
> >> the NULL ptr is about serio, it is registed from amba driver.
> >>
> >> ambakmi_driver_init
> >>
> >>    -- amba_kmi_probe
> >>
> >>      -- __serio_register_port
> > Thanks for the pointer. I took a look at the logs and the code. It's
> > very strange. As you can see from the backtrace, platform_match() is
> > being called for the device_add() from serio_handle_event(). But the
> > device that gets added there is on the serio_bus which obviously
> > should be using the serio_bus_match.
> Yes, I am confused too.
> >
> >> +Dmitry and input maillist, is there some known issue about serio ?
> >>
> >> I add some debug, the full log is attached.
> >>
> >> [    2.958355][   T41] input: AT Raw Set 2 keyboard as
> >> /devices/platform/bus@8000000/bus@8000000:motherboard-bus/bus@8000000:motherboard-bus:iofpga-bus@300000000/1c060000.kmi/serio0/input/input0
> >> [    2.977441][   T41] serio serio1: pdev c1e05508, pdev->name (null),
> >> drv c1090fc0, drv->name vexpress-reset
> > Based on the logs you added, it's pretty clear we are getting to
> > platform_match(). It's also strange that the drv->name is
> > vexpress-reset
> ...
> >
> >> [    3.003113][   T41] Backtrace:
> >> [    3.003451][   T41] [<c0560bb4>] (strcmp) from [<c0646358>] (platform_match+0xdc/0xf0)
> >> [    3.003963][   T41] [<c064627c>] (platform_match) from [<c06437d4>] (__device_attach_driver+0x3c/0xf4)
> >> [    3.004769][   T41] [<c0643798>] (__device_attach_driver) from [<c0641180>] (bus_for_each_drv+0x68/0xc8)
> >> [    3.005481][   T41] [<c0641118>] (bus_for_each_drv) from [<c0642f40>] (__device_attach+0xf0/0x16c)
> >> [    3.006152][   T41] [<c0642e50>] (__device_attach) from [<c06439d4>] (device_initial_probe+0x1c/0x20)
> >> [    3.006853][   T41] [<c06439b8>] (device_initial_probe) from [<c0642030>] (bus_probe_device+0x94/0x9c)
> >> [    3.007259][   T41] [<c0641f9c>] (bus_probe_device) from [<c063f9cc>] (device_add+0x408/0x8b8)
> >> [    3.007900][   T41] [<c063f5c4>] (device_add) from [<c071c1cc>] (serio_handle_event+0x1b8/0x234)
> >> [    3.008824][   T41] [<c071c014>] (serio_handle_event) from [<c01475a4>] (process_one_work+0x238/0x594)
> >> [    3.009737][   T41] [<c014736c>] (process_one_work) from [<c014795c>] (worker_thread+0x5c/0x5f4)
> >> [    3.010638][   T41] [<c0147900>] (worker_thread) from [<c014feb4>] (kthread+0x178/0x194)
> >> [    3.011496][   T41] [<c014fd3c>] (kthread) from [<c0100150>] (ret_from_fork+0x14/0x24)
> >> [    3.011860][   T41] Exception stack(0xc1675fb0 to 0xc1675ff8)
> > But the platform_match() is happening for the device_add() from
> > serio_event_handle() that's adding a device to the serio_bus and it
> > should be using serio_bus_match().
> >
> > I haven't reached any conclusion yet, but my current thought process
> > is that it's either:
> > 1. My patch is somehow causing list corruption. But I don't directly
> > touch any list in my change (other than deleting a list entirely), so
> > it's not clear how that would be happening.
>
> Maybe some concurrent driver load？
>
> > 2. Without my patch, these AMBA device's probe would be delayed at
> > least until 5 seconds or possibly later. I'm wondering if my patch is
> > catching some bad timing assumptions in other code.
>
> After Rob's patch, It will retry soon.
>
> commit 039599c92d3b2e73689e8b6e519d653fd4770abb
> Author: Rob Herring <robh@kernel.org>
> Date:   Wed Apr 29 15:58:12 2020 -0500
>
>      amba: Retry adding deferred devices at late_initcall
>
>      If amba bus devices defer when adding, the amba bus code simply retries
>      adding the devices every 5 seconds. This doesn't work well as it
>      completely unsynchronized with starting the init process which can
>      happen in less than 5 secs. Add a retry during late_initcall. If the
>      amba devices are added, then deferred probe takes over. If the
>      dependencies have not probed at this point, then there's no improvement
>      over previous behavior. To completely solve this, we'd need to retry
>      after every successful probe as deferred probe does.
>
>      The list_empty() check now happens outside the mutex, but the mutex
>      wasn't necessary in the first place.
>
>      This needed to use deferred probe instead of fragile initcall ordering
>      on 32-bit VExpress systems where the apb_pclk has a number of probe
>      dependencies (vexpress-sysregs, vexpress-config).
>
>
> >
> > You might be able to test out theory (2) by DEFERRED_DEVICE_TIMEOUT to
> > a much smaller number. Say 500ms or 100ms. If it doesn't crash, it
> > doesn't mean it's not (2), but if it does, then we know for sure it's
> > (2).
> ok, I will try this one, but due to above patch, it may not work.

Were you able to find anything more?

-Saravana

Kefeng Wang Sept. 10, 2021, 7:59 a.m. UTC | #15

On 2021/9/9 11:30, Saravana Kannan wrote:
> On Fri, Aug 27, 2021 at 6:09 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>>
>> On 2021/8/28 3:09, Saravana Kannan wrote:
>>> On Fri, Aug 27, 2021 at 7:38 AM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>>>> On 2021/8/27 8:04, Saravana Kannan wrote:
>>>>> On Thu, Aug 26, 2021 at 1:22 AM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>>>>>>>>> Btw, I've been working on [1] cleaning up the one-off deferred probe
>>>>>>>>> solution that we have for amba devices. That causes a bunch of other
>>>>>>>>> headaches. Your patch 3/3 takes us further in the wrong direction by
>>>>>>>>> adding more reasons for delaying the addition of the device.
>>>>>> Hi Saravana, I try the link[1], but with it, there is a crash when boot
>>>>>> (qemu-system-arm -M vexpress-a15),
>>> I'm assuming it's this one?
>>> arch/arm/boot/dts/vexpress-v2p-ca15_a7.dts
>> I use arch/arm/boot/dts/vexpress-v2p-ca15-tc1.dts.
>>
>> qemu-system-arm -M vexpress-a15 -dtb vexpress-v2p-ca15-tc1.dtb -cpu
>> cortex-a15 -smp 2 -m size=3G -kernel zImage -rtc base=localtime -initrd
>> initrd-arm32 -append 'console=ttyAMA0 cma=0 kfence.sample_interval=0
>> earlyprintk debug ' -device virtio-net-device,netdev=net8 -netdev
>> type=tap,id=net8,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown
>> -nographic
>>
>>>>> Hi,
>>>>>
>>>>> It's hard to make sense of the logs. Looks like two different threads
>>>>> might be printing to the log at the same time? Can you please enable
>>>>> the config that prints the thread ID (forgot what it's called) and
>>>>> collect this again? With what I could tell the crash seems to be
>>>>> happening somewhere in platform_match(), but that's not related to
>>>>> this patch at all?
>>>> Can you reproduce it? it is very likely related(without your patch, the
>>>> boot is fine),
>>> Sorry, I haven't ever setup qemu and booted vexpress. Thanks for your help.
>>>
>>>> the NULL ptr is about serio, it is registed from amba driver.
>>>>
>>>> ambakmi_driver_init
>>>>
>>>>     -- amba_kmi_probe
>>>>
>>>>       -- __serio_register_port
>>> Thanks for the pointer. I took a look at the logs and the code. It's
>>> very strange. As you can see from the backtrace, platform_match() is
>>> being called for the device_add() from serio_handle_event(). But the
>>> device that gets added there is on the serio_bus which obviously
>>> should be using the serio_bus_match.
>> Yes, I am confused too.
>>>> +Dmitry and input maillist, is there some known issue about serio ?
>>>>
>>>> I add some debug, the full log is attached.
>>>>
>>>> [    2.958355][   T41] input: AT Raw Set 2 keyboard as
>>>> /devices/platform/bus@8000000/bus@8000000:motherboard-bus/bus@8000000:motherboard-bus:iofpga-bus@300000000/1c060000.kmi/serio0/input/input0
>>>> [    2.977441][   T41] serio serio1: pdev c1e05508, pdev->name (null),
>>>> drv c1090fc0, drv->name vexpress-reset
>>> Based on the logs you added, it's pretty clear we are getting to
>>> platform_match(). It's also strange that the drv->name is
>>> vexpress-reset
>> ...
>>>> [    3.003113][   T41] Backtrace:
>>>> [    3.003451][   T41] [<c0560bb4>] (strcmp) from [<c0646358>] (platform_match+0xdc/0xf0)
>>>> [    3.003963][   T41] [<c064627c>] (platform_match) from [<c06437d4>] (__device_attach_driver+0x3c/0xf4)
>>>> [    3.004769][   T41] [<c0643798>] (__device_attach_driver) from [<c0641180>] (bus_for_each_drv+0x68/0xc8)
>>>> [    3.005481][   T41] [<c0641118>] (bus_for_each_drv) from [<c0642f40>] (__device_attach+0xf0/0x16c)
>>>> [    3.006152][   T41] [<c0642e50>] (__device_attach) from [<c06439d4>] (device_initial_probe+0x1c/0x20)
>>>> [    3.006853][   T41] [<c06439b8>] (device_initial_probe) from [<c0642030>] (bus_probe_device+0x94/0x9c)
>>>> [    3.007259][   T41] [<c0641f9c>] (bus_probe_device) from [<c063f9cc>] (device_add+0x408/0x8b8)
>>>> [    3.007900][   T41] [<c063f5c4>] (device_add) from [<c071c1cc>] (serio_handle_event+0x1b8/0x234)
>>>> [    3.008824][   T41] [<c071c014>] (serio_handle_event) from [<c01475a4>] (process_one_work+0x238/0x594)
>>>> [    3.009737][   T41] [<c014736c>] (process_one_work) from [<c014795c>] (worker_thread+0x5c/0x5f4)
>>>> [    3.010638][   T41] [<c0147900>] (worker_thread) from [<c014feb4>] (kthread+0x178/0x194)
>>>> [    3.011496][   T41] [<c014fd3c>] (kthread) from [<c0100150>] (ret_from_fork+0x14/0x24)
>>>> [    3.011860][   T41] Exception stack(0xc1675fb0 to 0xc1675ff8)
>>> But the platform_match() is happening for the device_add() from
>>> serio_event_handle() that's adding a device to the serio_bus and it
>>> should be using serio_bus_match().
>>>
>>> I haven't reached any conclusion yet, but my current thought process
>>> is that it's either:
>>> 1. My patch is somehow causing list corruption. But I don't directly
>>> touch any list in my change (other than deleting a list entirely), so
>>> it's not clear how that would be happening.
>> Maybe some concurrent driver load？
>>
>>> 2. Without my patch, these AMBA device's probe would be delayed at
>>> least until 5 seconds or possibly later. I'm wondering if my patch is
>>> catching some bad timing assumptions in other code.
>> After Rob's patch, It will retry soon.
>>
>> commit 039599c92d3b2e73689e8b6e519d653fd4770abb
>> Author: Rob Herring <robh@kernel.org>
>> Date:   Wed Apr 29 15:58:12 2020 -0500
>>
>>       amba: Retry adding deferred devices at late_initcall
>>
>>       If amba bus devices defer when adding, the amba bus code simply retries
>>       adding the devices every 5 seconds. This doesn't work well as it
>>       completely unsynchronized with starting the init process which can
>>       happen in less than 5 secs. Add a retry during late_initcall. If the
>>       amba devices are added, then deferred probe takes over. If the
>>       dependencies have not probed at this point, then there's no improvement
>>       over previous behavior. To completely solve this, we'd need to retry
>>       after every successful probe as deferred probe does.
>>
>>       The list_empty() check now happens outside the mutex, but the mutex
>>       wasn't necessary in the first place.
>>
>>       This needed to use deferred probe instead of fragile initcall ordering
>>       on 32-bit VExpress systems where the apb_pclk has a number of probe
>>       dependencies (vexpress-sysregs, vexpress-config).
>>
>>
>>> You might be able to test out theory (2) by DEFERRED_DEVICE_TIMEOUT to
>>> a much smaller number. Say 500ms or 100ms. If it doesn't crash, it
>>> doesn't mean it's not (2), but if it does, then we know for sure it's
>>> (2).
>> ok, I will try this one, but due to above patch, it may not work.
> Were you able to find anything more?
I can't find any clue， and have no time to check this for now, is there 
any news from your side?
>
> -Saravana
> .
>

Saravana Kannan July 5, 2022, 7:25 p.m. UTC | #16

On Fri, Sep 10, 2021 at 12:59 AM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>
>
> On 2021/9/9 11:30, Saravana Kannan wrote:
> > On Fri, Aug 27, 2021 at 6:09 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
> >>
> >> On 2021/8/28 3:09, Saravana Kannan wrote:
> >>> On Fri, Aug 27, 2021 at 7:38 AM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
> >>>> On 2021/8/27 8:04, Saravana Kannan wrote:
> >>>>> On Thu, Aug 26, 2021 at 1:22 AM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
> >>>>>>>>> Btw, I've been working on [1] cleaning up the one-off deferred probe
> >>>>>>>>> solution that we have for amba devices. That causes a bunch of other
> >>>>>>>>> headaches. Your patch 3/3 takes us further in the wrong direction by
> >>>>>>>>> adding more reasons for delaying the addition of the device.
> >>>>>> Hi Saravana, I try the link[1], but with it, there is a crash when boot
> >>>>>> (qemu-system-arm -M vexpress-a15),
> >>> I'm assuming it's this one?
> >>> arch/arm/boot/dts/vexpress-v2p-ca15_a7.dts
> >> I use arch/arm/boot/dts/vexpress-v2p-ca15-tc1.dts.
> >>
> >> qemu-system-arm -M vexpress-a15 -dtb vexpress-v2p-ca15-tc1.dtb -cpu
> >> cortex-a15 -smp 2 -m size=3G -kernel zImage -rtc base=localtime -initrd
> >> initrd-arm32 -append 'console=ttyAMA0 cma=0 kfence.sample_interval=0
> >> earlyprintk debug ' -device virtio-net-device,netdev=net8 -netdev
> >> type=tap,id=net8,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown
> >> -nographic
> >>
> >>>>> Hi,
> >>>>>
> >>>>> It's hard to make sense of the logs. Looks like two different threads
> >>>>> might be printing to the log at the same time? Can you please enable
> >>>>> the config that prints the thread ID (forgot what it's called) and
> >>>>> collect this again? With what I could tell the crash seems to be
> >>>>> happening somewhere in platform_match(), but that's not related to
> >>>>> this patch at all?
> >>>> Can you reproduce it? it is very likely related(without your patch, the
> >>>> boot is fine),
> >>> Sorry, I haven't ever setup qemu and booted vexpress. Thanks for your help.
> >>>
> >>>> the NULL ptr is about serio, it is registed from amba driver.
> >>>>
> >>>> ambakmi_driver_init
> >>>>
> >>>>     -- amba_kmi_probe
> >>>>
> >>>>       -- __serio_register_port
> >>> Thanks for the pointer. I took a look at the logs and the code. It's
> >>> very strange. As you can see from the backtrace, platform_match() is
> >>> being called for the device_add() from serio_handle_event(). But the
> >>> device that gets added there is on the serio_bus which obviously
> >>> should be using the serio_bus_match.
> >> Yes, I am confused too.
> >>>> +Dmitry and input maillist, is there some known issue about serio ?
> >>>>
> >>>> I add some debug, the full log is attached.
> >>>>
> >>>> [    2.958355][   T41] input: AT Raw Set 2 keyboard as
> >>>> /devices/platform/bus@8000000/bus@8000000:motherboard-bus/bus@8000000:motherboard-bus:iofpga-bus@300000000/1c060000.kmi/serio0/input/input0
> >>>> [    2.977441][   T41] serio serio1: pdev c1e05508, pdev->name (null),
> >>>> drv c1090fc0, drv->name vexpress-reset
> >>> Based on the logs you added, it's pretty clear we are getting to
> >>> platform_match(). It's also strange that the drv->name is
> >>> vexpress-reset
> >> ...
> >>>> [    3.003113][   T41] Backtrace:
> >>>> [    3.003451][   T41] [<c0560bb4>] (strcmp) from [<c0646358>] (platform_match+0xdc/0xf0)
> >>>> [    3.003963][   T41] [<c064627c>] (platform_match) from [<c06437d4>] (__device_attach_driver+0x3c/0xf4)
> >>>> [    3.004769][   T41] [<c0643798>] (__device_attach_driver) from [<c0641180>] (bus_for_each_drv+0x68/0xc8)
> >>>> [    3.005481][   T41] [<c0641118>] (bus_for_each_drv) from [<c0642f40>] (__device_attach+0xf0/0x16c)
> >>>> [    3.006152][   T41] [<c0642e50>] (__device_attach) from [<c06439d4>] (device_initial_probe+0x1c/0x20)
> >>>> [    3.006853][   T41] [<c06439b8>] (device_initial_probe) from [<c0642030>] (bus_probe_device+0x94/0x9c)
> >>>> [    3.007259][   T41] [<c0641f9c>] (bus_probe_device) from [<c063f9cc>] (device_add+0x408/0x8b8)
> >>>> [    3.007900][   T41] [<c063f5c4>] (device_add) from [<c071c1cc>] (serio_handle_event+0x1b8/0x234)
> >>>> [    3.008824][   T41] [<c071c014>] (serio_handle_event) from [<c01475a4>] (process_one_work+0x238/0x594)
> >>>> [    3.009737][   T41] [<c014736c>] (process_one_work) from [<c014795c>] (worker_thread+0x5c/0x5f4)
> >>>> [    3.010638][   T41] [<c0147900>] (worker_thread) from [<c014feb4>] (kthread+0x178/0x194)
> >>>> [    3.011496][   T41] [<c014fd3c>] (kthread) from [<c0100150>] (ret_from_fork+0x14/0x24)
> >>>> [    3.011860][   T41] Exception stack(0xc1675fb0 to 0xc1675ff8)
> >>> But the platform_match() is happening for the device_add() from
> >>> serio_event_handle() that's adding a device to the serio_bus and it
> >>> should be using serio_bus_match().
> >>>
> >>> I haven't reached any conclusion yet, but my current thought process
> >>> is that it's either:
> >>> 1. My patch is somehow causing list corruption. But I don't directly
> >>> touch any list in my change (other than deleting a list entirely), so
> >>> it's not clear how that would be happening.
> >> Maybe some concurrent driver load？
> >>
> >>> 2. Without my patch, these AMBA device's probe would be delayed at
> >>> least until 5 seconds or possibly later. I'm wondering if my patch is
> >>> catching some bad timing assumptions in other code.
> >> After Rob's patch, It will retry soon.
> >>
> >> commit 039599c92d3b2e73689e8b6e519d653fd4770abb
> >> Author: Rob Herring <robh@kernel.org>
> >> Date:   Wed Apr 29 15:58:12 2020 -0500
> >>
> >>       amba: Retry adding deferred devices at late_initcall
> >>
> >>       If amba bus devices defer when adding, the amba bus code simply retries
> >>       adding the devices every 5 seconds. This doesn't work well as it
> >>       completely unsynchronized with starting the init process which can
> >>       happen in less than 5 secs. Add a retry during late_initcall. If the
> >>       amba devices are added, then deferred probe takes over. If the
> >>       dependencies have not probed at this point, then there's no improvement
> >>       over previous behavior. To completely solve this, we'd need to retry
> >>       after every successful probe as deferred probe does.
> >>
> >>       The list_empty() check now happens outside the mutex, but the mutex
> >>       wasn't necessary in the first place.
> >>
> >>       This needed to use deferred probe instead of fragile initcall ordering
> >>       on 32-bit VExpress systems where the apb_pclk has a number of probe
> >>       dependencies (vexpress-sysregs, vexpress-config).
> >>
> >>
> >>> You might be able to test out theory (2) by DEFERRED_DEVICE_TIMEOUT to
> >>> a much smaller number. Say 500ms or 100ms. If it doesn't crash, it
> >>> doesn't mean it's not (2), but if it does, then we know for sure it's
> >>> (2).
> >> ok, I will try this one, but due to above patch, it may not work.
> > Were you able to find anything more?
> I can't find any clue， and have no time to check this for now, is there
> any news from your side?

To close out this thread, the issue was due to a UAF bug in driver
core that was fixed by:
https://lore.kernel.org/all/20220513112444.45112-1-schspa@gmail.com/

With that fix, there wouldn't have been a crash, but amba driver
registration would have failed (because match returned
non-EPROBE_DEFER error).

-Saravana

Leizhen (ThunderTown) Aug. 27, 2022, 10:26 a.m. UTC | #17

On 2022/7/6 3:25, Saravana Kannan wrote:
> On Fri, Sep 10, 2021 at 12:59 AM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>>
>>
>> On 2021/9/9 11:30, Saravana Kannan wrote:
>>> On Fri, Aug 27, 2021 at 6:09 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>>>>
>>>> On 2021/8/28 3:09, Saravana Kannan wrote:
>>>>> On Fri, Aug 27, 2021 at 7:38 AM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>>>>>> On 2021/8/27 8:04, Saravana Kannan wrote:
>>>>>>> On Thu, Aug 26, 2021 at 1:22 AM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>>>>>>>>>>> Btw, I've been working on [1] cleaning up the one-off deferred probe
>>>>>>>>>>> solution that we have for amba devices. That causes a bunch of other
>>>>>>>>>>> headaches. Your patch 3/3 takes us further in the wrong direction by
>>>>>>>>>>> adding more reasons for delaying the addition of the device.
>>>>>>>> Hi Saravana, I try the link[1], but with it, there is a crash when boot
>>>>>>>> (qemu-system-arm -M vexpress-a15),
>>>>> I'm assuming it's this one?
>>>>> arch/arm/boot/dts/vexpress-v2p-ca15_a7.dts
>>>> I use arch/arm/boot/dts/vexpress-v2p-ca15-tc1.dts.
>>>>
>>>> qemu-system-arm -M vexpress-a15 -dtb vexpress-v2p-ca15-tc1.dtb -cpu
>>>> cortex-a15 -smp 2 -m size=3G -kernel zImage -rtc base=localtime -initrd
>>>> initrd-arm32 -append 'console=ttyAMA0 cma=0 kfence.sample_interval=0
>>>> earlyprintk debug ' -device virtio-net-device,netdev=net8 -netdev
>>>> type=tap,id=net8,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown
>>>> -nographic
>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> It's hard to make sense of the logs. Looks like two different threads
>>>>>>> might be printing to the log at the same time? Can you please enable
>>>>>>> the config that prints the thread ID (forgot what it's called) and
>>>>>>> collect this again? With what I could tell the crash seems to be
>>>>>>> happening somewhere in platform_match(), but that's not related to
>>>>>>> this patch at all?
>>>>>> Can you reproduce it? it is very likely related(without your patch, the
>>>>>> boot is fine),
>>>>> Sorry, I haven't ever setup qemu and booted vexpress. Thanks for your help.
>>>>>
>>>>>> the NULL ptr is about serio, it is registed from amba driver.
>>>>>>
>>>>>> ambakmi_driver_init
>>>>>>
>>>>>>     -- amba_kmi_probe
>>>>>>
>>>>>>       -- __serio_register_port
>>>>> Thanks for the pointer. I took a look at the logs and the code. It's
>>>>> very strange. As you can see from the backtrace, platform_match() is
>>>>> being called for the device_add() from serio_handle_event(). But the
>>>>> device that gets added there is on the serio_bus which obviously
>>>>> should be using the serio_bus_match.
>>>> Yes, I am confused too.
>>>>>> +Dmitry and input maillist, is there some known issue about serio ?
>>>>>>
>>>>>> I add some debug, the full log is attached.
>>>>>>
>>>>>> [    2.958355][   T41] input: AT Raw Set 2 keyboard as
>>>>>> /devices/platform/bus@8000000/bus@8000000:motherboard-bus/bus@8000000:motherboard-bus:iofpga-bus@300000000/1c060000.kmi/serio0/input/input0
>>>>>> [    2.977441][   T41] serio serio1: pdev c1e05508, pdev->name (null),
>>>>>> drv c1090fc0, drv->name vexpress-reset
>>>>> Based on the logs you added, it's pretty clear we are getting to
>>>>> platform_match(). It's also strange that the drv->name is
>>>>> vexpress-reset
>>>> ...
>>>>>> [    3.003113][   T41] Backtrace:
>>>>>> [    3.003451][   T41] [<c0560bb4>] (strcmp) from [<c0646358>] (platform_match+0xdc/0xf0)
>>>>>> [    3.003963][   T41] [<c064627c>] (platform_match) from [<c06437d4>] (__device_attach_driver+0x3c/0xf4)
>>>>>> [    3.004769][   T41] [<c0643798>] (__device_attach_driver) from [<c0641180>] (bus_for_each_drv+0x68/0xc8)
>>>>>> [    3.005481][   T41] [<c0641118>] (bus_for_each_drv) from [<c0642f40>] (__device_attach+0xf0/0x16c)
>>>>>> [    3.006152][   T41] [<c0642e50>] (__device_attach) from [<c06439d4>] (device_initial_probe+0x1c/0x20)
>>>>>> [    3.006853][   T41] [<c06439b8>] (device_initial_probe) from [<c0642030>] (bus_probe_device+0x94/0x9c)
>>>>>> [    3.007259][   T41] [<c0641f9c>] (bus_probe_device) from [<c063f9cc>] (device_add+0x408/0x8b8)
>>>>>> [    3.007900][   T41] [<c063f5c4>] (device_add) from [<c071c1cc>] (serio_handle_event+0x1b8/0x234)
>>>>>> [    3.008824][   T41] [<c071c014>] (serio_handle_event) from [<c01475a4>] (process_one_work+0x238/0x594)
>>>>>> [    3.009737][   T41] [<c014736c>] (process_one_work) from [<c014795c>] (worker_thread+0x5c/0x5f4)
>>>>>> [    3.010638][   T41] [<c0147900>] (worker_thread) from [<c014feb4>] (kthread+0x178/0x194)
>>>>>> [    3.011496][   T41] [<c014fd3c>] (kthread) from [<c0100150>] (ret_from_fork+0x14/0x24)
>>>>>> [    3.011860][   T41] Exception stack(0xc1675fb0 to 0xc1675ff8)
>>>>> But the platform_match() is happening for the device_add() from
>>>>> serio_event_handle() that's adding a device to the serio_bus and it
>>>>> should be using serio_bus_match().
>>>>>
>>>>> I haven't reached any conclusion yet, but my current thought process
>>>>> is that it's either:
>>>>> 1. My patch is somehow causing list corruption. But I don't directly
>>>>> touch any list in my change (other than deleting a list entirely), so
>>>>> it's not clear how that would be happening.
>>>> Maybe some concurrent driver load？
>>>>
>>>>> 2. Without my patch, these AMBA device's probe would be delayed at
>>>>> least until 5 seconds or possibly later. I'm wondering if my patch is
>>>>> catching some bad timing assumptions in other code.
>>>> After Rob's patch, It will retry soon.
>>>>
>>>> commit 039599c92d3b2e73689e8b6e519d653fd4770abb
>>>> Author: Rob Herring <robh@kernel.org>
>>>> Date:   Wed Apr 29 15:58:12 2020 -0500
>>>>
>>>>       amba: Retry adding deferred devices at late_initcall
>>>>
>>>>       If amba bus devices defer when adding, the amba bus code simply retries
>>>>       adding the devices every 5 seconds. This doesn't work well as it
>>>>       completely unsynchronized with starting the init process which can
>>>>       happen in less than 5 secs. Add a retry during late_initcall. If the
>>>>       amba devices are added, then deferred probe takes over. If the
>>>>       dependencies have not probed at this point, then there's no improvement
>>>>       over previous behavior. To completely solve this, we'd need to retry
>>>>       after every successful probe as deferred probe does.
>>>>
>>>>       The list_empty() check now happens outside the mutex, but the mutex
>>>>       wasn't necessary in the first place.
>>>>
>>>>       This needed to use deferred probe instead of fragile initcall ordering
>>>>       on 32-bit VExpress systems where the apb_pclk has a number of probe
>>>>       dependencies (vexpress-sysregs, vexpress-config).
>>>>
>>>>
>>>>> You might be able to test out theory (2) by DEFERRED_DEVICE_TIMEOUT to
>>>>> a much smaller number. Say 500ms or 100ms. If it doesn't crash, it
>>>>> doesn't mean it's not (2), but if it does, then we know for sure it's
>>>>> (2).
>>>> ok, I will try this one, but due to above patch, it may not work.
>>> Were you able to find anything more?
>> I can't find any clue， and have no time to check this for now, is there
>> any news from your side?

Hi, Saravana and Kefeng:
  I've spent the whole afternoon trying to figure this out, and the fix
patch has been cc you two.

> 
> To close out this thread, the issue was due to a UAF bug in driver
> core that was fixed by:
> https://lore.kernel.org/all/20220513112444.45112-1-schspa@gmail.com/
> 
> With that fix, there wouldn't have been a crash, but amba driver
> registration would have failed (because match returned
> non-EPROBE_DEFER error).
> 
> -Saravana
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
>

[3/3] amba: Properly handle device probe without IRQ domain

Checks

Commit Message

Comments

Patch