mbox series

[SRU,J:linux-bluefield,v1,0/1] UBUNTU: SAUCE: gpio-mlxbf3: During reboot test, ipmb driver fails to load intermittently

Message ID 20240520220104.3602-1-asmaa@nvidia.com
Headers show
Series UBUNTU: SAUCE: gpio-mlxbf3: During reboot test, ipmb driver fails to load intermittently | expand

Message

Asmaa Mnebhi May 20, 2024, 10:01 p.m. UTC
BugLink: https://bugs.launchpad.net/bugs/2066198

SRU Justification:

[Impact]

    The ipmb driver failing to load is just the result of i2c-mlxbf
    not receiving interrupts.
    In fact, any driver dependent on the i2c-mlxbf driver will not work.

    How to reproduce this issue?

    - modprobe gpio-mlxbf3
    - modprobe pwr-mlxbf
    - modprobe mlxbf-gige -> this calls into the gpio driver which enables the PHY interrupt (gpio10)
    - reboot linux
      -> graceful reboot does not remove modules so it doesn't disable the PHY interrupt via
         mlxbf3_gpio_irq_disable. Hence, the interrupt remains enabled.
    - In anolis, we don't enforce the dependency between gpio-mlxbf3 and mlxbf-gige.
      So the next time linux boots and loads the driver in this order, we encounter the issue:
    - modprobe mlxbf-gige. The gige driver uses polling in the case where it loads before the gpio
      driver. Note that the interrupt at GPIO10 is still enabled at this point so if the interrupt
      triggers, there is nothing to clear it.
    - modprobe gpio-mlxbf3
    - modprobe i2c-mlxbf. The interrupt wouldn't work here because it is shared with the gpio
      interrupts which was not cleared.

[Fix]

* The solution is to add a shutdown function to the gpio driver to clear and disable all interrupts.
* Also make sure to clear the interrupt after disabling it in the disable irq function.

[Test Case]

* Do the reboot test (2000-3000 iterations)
* Check that all following drivers are loaded without errors: gpio-mlxbf3, pwr_mlxbf, mlxbf-gige, i2c-mlxbf
* check that the ipmb drivers are loaded and functional (send ipmb command to the bmc and vice versa)

[Regression Potential]

* No known regression.

Comments

Tim Gardner May 23, 2024, 3:33 p.m. UTC | #1
On 5/20/24 4:01 PM, Asmaa Mnebhi wrote:
> BugLink: https://bugs.launchpad.net/bugs/2066198
> 
> SRU Justification:
> 
> [Impact]
> 
>      The ipmb driver failing to load is just the result of i2c-mlxbf
>      not receiving interrupts.
>      In fact, any driver dependent on the i2c-mlxbf driver will not work.
> 
>      How to reproduce this issue?
> 
>      - modprobe gpio-mlxbf3
>      - modprobe pwr-mlxbf
>      - modprobe mlxbf-gige -> this calls into the gpio driver which enables the PHY interrupt (gpio10)
>      - reboot linux
>        -> graceful reboot does not remove modules so it doesn't disable the PHY interrupt via
>           mlxbf3_gpio_irq_disable. Hence, the interrupt remains enabled.
>      - In anolis, we don't enforce the dependency between gpio-mlxbf3 and mlxbf-gige.
>        So the next time linux boots and loads the driver in this order, we encounter the issue:
>      - modprobe mlxbf-gige. The gige driver uses polling in the case where it loads before the gpio
>        driver. Note that the interrupt at GPIO10 is still enabled at this point so if the interrupt
>        triggers, there is nothing to clear it.
>      - modprobe gpio-mlxbf3
>      - modprobe i2c-mlxbf. The interrupt wouldn't work here because it is shared with the gpio
>        interrupts which was not cleared.
> 
> [Fix]
> 
> * The solution is to add a shutdown function to the gpio driver to clear and disable all interrupts.
> * Also make sure to clear the interrupt after disabling it in the disable irq function.
> 
> [Test Case]
> 
> * Do the reboot test (2000-3000 iterations)
> * Check that all following drivers are loaded without errors: gpio-mlxbf3, pwr_mlxbf, mlxbf-gige, i2c-mlxbf
> * check that the ipmb drivers are loaded and functional (send ipmb command to the bmc and vice versa)
> 
> [Regression Potential]
> 
> * No known regression.
> 
Acked-by: Tim Gardner <tim.gardner@canonical.com>
Bartlomiej Zolnierkiewicz June 3, 2024, 9:27 a.m. UTC | #2
Acked-by: Bartlomiej Zolnierkiewicz <bartlomiej.zolnierkiewicz@canonical.com>

On Tue, May 21, 2024 at 12:02 AM Asmaa Mnebhi <asmaa@nvidia.com> wrote:
>
> BugLink: https://bugs.launchpad.net/bugs/2066198
>
> SRU Justification:
>
> [Impact]
>
>     The ipmb driver failing to load is just the result of i2c-mlxbf
>     not receiving interrupts.
>     In fact, any driver dependent on the i2c-mlxbf driver will not work.
>
>     How to reproduce this issue?
>
>     - modprobe gpio-mlxbf3
>     - modprobe pwr-mlxbf
>     - modprobe mlxbf-gige -> this calls into the gpio driver which enables the PHY interrupt (gpio10)
>     - reboot linux
>       -> graceful reboot does not remove modules so it doesn't disable the PHY interrupt via
>          mlxbf3_gpio_irq_disable. Hence, the interrupt remains enabled.
>     - In anolis, we don't enforce the dependency between gpio-mlxbf3 and mlxbf-gige.
>       So the next time linux boots and loads the driver in this order, we encounter the issue:
>     - modprobe mlxbf-gige. The gige driver uses polling in the case where it loads before the gpio
>       driver. Note that the interrupt at GPIO10 is still enabled at this point so if the interrupt
>       triggers, there is nothing to clear it.
>     - modprobe gpio-mlxbf3
>     - modprobe i2c-mlxbf. The interrupt wouldn't work here because it is shared with the gpio
>       interrupts which was not cleared.
>
> [Fix]
>
> * The solution is to add a shutdown function to the gpio driver to clear and disable all interrupts.
> * Also make sure to clear the interrupt after disabling it in the disable irq function.
>
> [Test Case]
>
> * Do the reboot test (2000-3000 iterations)
> * Check that all following drivers are loaded without errors: gpio-mlxbf3, pwr_mlxbf, mlxbf-gige, i2c-mlxbf
> * check that the ipmb drivers are loaded and functional (send ipmb command to the bmc and vice versa)
>
> [Regression Potential]
>
> * No known regression.
>
Bartlomiej Zolnierkiewicz June 3, 2024, 10:55 a.m. UTC | #3
Applied to jammy:linux-bluefield/master-next. Thanks.

--
Best regards,
Bartlomiej

On Tue, May 21, 2024 at 12:02 AM Asmaa Mnebhi <asmaa@nvidia.com> wrote:
>
> BugLink: https://bugs.launchpad.net/bugs/2066198
>
> SRU Justification:
>
> [Impact]
>
>     The ipmb driver failing to load is just the result of i2c-mlxbf
>     not receiving interrupts.
>     In fact, any driver dependent on the i2c-mlxbf driver will not work.
>
>     How to reproduce this issue?
>
>     - modprobe gpio-mlxbf3
>     - modprobe pwr-mlxbf
>     - modprobe mlxbf-gige -> this calls into the gpio driver which enables the PHY interrupt (gpio10)
>     - reboot linux
>       -> graceful reboot does not remove modules so it doesn't disable the PHY interrupt via
>          mlxbf3_gpio_irq_disable. Hence, the interrupt remains enabled.
>     - In anolis, we don't enforce the dependency between gpio-mlxbf3 and mlxbf-gige.
>       So the next time linux boots and loads the driver in this order, we encounter the issue:
>     - modprobe mlxbf-gige. The gige driver uses polling in the case where it loads before the gpio
>       driver. Note that the interrupt at GPIO10 is still enabled at this point so if the interrupt
>       triggers, there is nothing to clear it.
>     - modprobe gpio-mlxbf3
>     - modprobe i2c-mlxbf. The interrupt wouldn't work here because it is shared with the gpio
>       interrupts which was not cleared.
>
> [Fix]
>
> * The solution is to add a shutdown function to the gpio driver to clear and disable all interrupts.
> * Also make sure to clear the interrupt after disabling it in the disable irq function.
>
> [Test Case]
>
> * Do the reboot test (2000-3000 iterations)
> * Check that all following drivers are loaded without errors: gpio-mlxbf3, pwr_mlxbf, mlxbf-gige, i2c-mlxbf
> * check that the ipmb drivers are loaded and functional (send ipmb command to the bmc and vice versa)
>
> [Regression Potential]
>
> * No known regression.
>