mbox series

[SRU,F,0/1] net/mlx5: Avoid processing commands before cmdif is ready (LP: 1987287)

Message ID 20220901165337.602338-1-frank.heimes@canonical.com
Headers show
Series net/mlx5: Avoid processing commands before cmdif is ready (LP: 1987287) | expand

Message

Frank Heimes Sept. 1, 2022, 4:53 p.m. UTC
BugLink: https://bugs.launchpad.net/bugs/1987287

SRU Justification:

[Impact] 

 * If the mlx5 driver is reloading while the recovery flow is happening,
   and if it receives new commands before the command interface is up
   again, this can lead to null pointer that tries to access non-
   initialized command structures.

 * So it's required to avoid processing commands before the command
   interface is up again.

 * This is accomplished by a new cmdif state that helps to avoid
   processing commands while cmdif is not ready.

[Fix]

 * backport of f7936ddd35d8 f7936ddd35d8b849daf0372770c7c9dbe7910fca "net/mlx5: Avoid processing commands before cmdif is ready"

[Test Plan]

 * An Ubuntu Server for s390x 18.04 or 20.04 LPAR or z/VM installation
   is needed that has Mellanox cards (RoCE Express 2.1) assigned, 
   configured and enabled and that runs a 5.4 kernel (on bionic hwe-5.4).

 * Now trigger a recovery (guess that can be done at the Support Element)
   and reload the driver at the same time.

 * Make sure the module/driver mlx5 is loaded and in use
   (otherwise it can't be removed/unloaded).

 * Now remove/unload the module with:
   sudo modprobe -r mlx5
   and (re-)load it again with:
   sudo modprobe mlx5

 * Due to the lack of RoCE Express 2.1 hardware,
   IBM needs to do the verification.

[Where problems could occur]

 * In case there is an issue with 'cmdif' it might not have the correct
   interface state, which:
   - either might lead to the fact that commands are not properly blocked
     and the situation is similar like before
   - or the commands may get always blocked,
     which render the hardware useless
   - or might block in wrong situation,
     which will cause unexpected issues and broken behavior.

 * Since the patch got upstream accepted with v5.7-rc7 it's
   not new to the kernel, was already part of groovy (and above)
   and is therefor already in use by newer Ubuntu releases.

[Other Info]
 
 * Since the patch is upstream since v5.7-rc7,
   it's already included in jammy and kinetic.

 * Since the upstream patch incl. the line:
   Fixes: e126ba97dba9 ("mlx5: Add driver for Mellanox
   Connect-IB adapters") it looks to me that it was forgotten
   to mark the patch for upstream stable updates.

 * Such SRUs for focal's 5.4 will automatically land in bionic's
   hwe-5.4, too. But since this was especially requested for
   bionic's hwe-5.4, I wanted to mention this here.

Eran Ben Elisha (1):
  net/mlx5: Avoid processing commands before cmdif is ready

 drivers/net/ethernet/mellanox/mlx5/core/cmd.c  | 10 ++++++++++
 drivers/net/ethernet/mellanox/mlx5/core/main.c |  4 ++++
 include/linux/mlx5/driver.h                    |  9 +++++++++
 3 files changed, 23 insertions(+)

Comments

Tim Gardner Sept. 1, 2022, 5:53 p.m. UTC | #1
On 9/1/22 10:53, frank.heimes@canonical.com wrote:
> BugLink: https://bugs.launchpad.net/bugs/1987287
> 
> SRU Justification:
> 
> [Impact]
> 
>   * If the mlx5 driver is reloading while the recovery flow is happening,
>     and if it receives new commands before the command interface is up
>     again, this can lead to null pointer that tries to access non-
>     initialized command structures.
> 
>   * So it's required to avoid processing commands before the command
>     interface is up again.
> 
>   * This is accomplished by a new cmdif state that helps to avoid
>     processing commands while cmdif is not ready.
> 
> [Fix]
> 
>   * backport of f7936ddd35d8 f7936ddd35d8b849daf0372770c7c9dbe7910fca "net/mlx5: Avoid processing commands before cmdif is ready"
> 
> [Test Plan]
> 
>   * An Ubuntu Server for s390x 18.04 or 20.04 LPAR or z/VM installation
>     is needed that has Mellanox cards (RoCE Express 2.1) assigned,
>     configured and enabled and that runs a 5.4 kernel (on bionic hwe-5.4).
> 
>   * Now trigger a recovery (guess that can be done at the Support Element)
>     and reload the driver at the same time.
> 
>   * Make sure the module/driver mlx5 is loaded and in use
>     (otherwise it can't be removed/unloaded).
> 
>   * Now remove/unload the module with:
>     sudo modprobe -r mlx5
>     and (re-)load it again with:
>     sudo modprobe mlx5
> 
>   * Due to the lack of RoCE Express 2.1 hardware,
>     IBM needs to do the verification.
> 
> [Where problems could occur]
> 
>   * In case there is an issue with 'cmdif' it might not have the correct
>     interface state, which:
>     - either might lead to the fact that commands are not properly blocked
>       and the situation is similar like before
>     - or the commands may get always blocked,
>       which render the hardware useless
>     - or might block in wrong situation,
>       which will cause unexpected issues and broken behavior.
> 
>   * Since the patch got upstream accepted with v5.7-rc7 it's
>     not new to the kernel, was already part of groovy (and above)
>     and is therefor already in use by newer Ubuntu releases.
> 
> [Other Info]
>   
>   * Since the patch is upstream since v5.7-rc7,
>     it's already included in jammy and kinetic.
> 
>   * Since the upstream patch incl. the line:
>     Fixes: e126ba97dba9 ("mlx5: Add driver for Mellanox
>     Connect-IB adapters") it looks to me that it was forgotten
>     to mark the patch for upstream stable updates.
> 
>   * Such SRUs for focal's 5.4 will automatically land in bionic's
>     hwe-5.4, too. But since this was especially requested for
>     bionic's hwe-5.4, I wanted to mention this here.
> 
> Eran Ben Elisha (1):
>    net/mlx5: Avoid processing commands before cmdif is ready
> 
>   drivers/net/ethernet/mellanox/mlx5/core/cmd.c  | 10 ++++++++++
>   drivers/net/ethernet/mellanox/mlx5/core/main.c |  4 ++++
>   include/linux/mlx5/driver.h                    |  9 +++++++++
>   3 files changed, 23 insertions(+)
> 
Acked-by: Tim Gardner <tim.gardner@canonical.com>
Stefan Bader Sept. 14, 2022, 2:10 p.m. UTC | #2
On 01.09.22 18:53, frank.heimes@canonical.com wrote:
> BugLink: https://bugs.launchpad.net/bugs/1987287
> 
> SRU Justification:
> 
> [Impact]
> 
>   * If the mlx5 driver is reloading while the recovery flow is happening,
>     and if it receives new commands before the command interface is up
>     again, this can lead to null pointer that tries to access non-
>     initialized command structures.
> 
>   * So it's required to avoid processing commands before the command
>     interface is up again.
> 
>   * This is accomplished by a new cmdif state that helps to avoid
>     processing commands while cmdif is not ready.
> 
> [Fix]
> 
>   * backport of f7936ddd35d8 f7936ddd35d8b849daf0372770c7c9dbe7910fca "net/mlx5: Avoid processing commands before cmdif is ready"
> 
> [Test Plan]
> 
>   * An Ubuntu Server for s390x 18.04 or 20.04 LPAR or z/VM installation
>     is needed that has Mellanox cards (RoCE Express 2.1) assigned,
>     configured and enabled and that runs a 5.4 kernel (on bionic hwe-5.4).
> 
>   * Now trigger a recovery (guess that can be done at the Support Element)
>     and reload the driver at the same time.
> 
>   * Make sure the module/driver mlx5 is loaded and in use
>     (otherwise it can't be removed/unloaded).
> 
>   * Now remove/unload the module with:
>     sudo modprobe -r mlx5
>     and (re-)load it again with:
>     sudo modprobe mlx5
> 
>   * Due to the lack of RoCE Express 2.1 hardware,
>     IBM needs to do the verification.
> 
> [Where problems could occur]
> 
>   * In case there is an issue with 'cmdif' it might not have the correct
>     interface state, which:
>     - either might lead to the fact that commands are not properly blocked
>       and the situation is similar like before
>     - or the commands may get always blocked,
>       which render the hardware useless
>     - or might block in wrong situation,
>       which will cause unexpected issues and broken behavior.
> 
>   * Since the patch got upstream accepted with v5.7-rc7 it's
>     not new to the kernel, was already part of groovy (and above)
>     and is therefor already in use by newer Ubuntu releases.
> 
> [Other Info]
>   
>   * Since the patch is upstream since v5.7-rc7,
>     it's already included in jammy and kinetic.
> 
>   * Since the upstream patch incl. the line:
>     Fixes: e126ba97dba9 ("mlx5: Add driver for Mellanox
>     Connect-IB adapters") it looks to me that it was forgotten
>     to mark the patch for upstream stable updates.
> 
>   * Such SRUs for focal's 5.4 will automatically land in bionic's
>     hwe-5.4, too. But since this was especially requested for
>     bionic's hwe-5.4, I wanted to mention this here.
> 
> Eran Ben Elisha (1):
>    net/mlx5: Avoid processing commands before cmdif is ready
> 
>   drivers/net/ethernet/mellanox/mlx5/core/cmd.c  | 10 ++++++++++
>   drivers/net/ethernet/mellanox/mlx5/core/main.c |  4 ++++
>   include/linux/mlx5/driver.h                    |  9 +++++++++
>   3 files changed, 23 insertions(+)
> 

Applied to focal:linux/master-next. Thanks.

-Stefan