Message ID | 20240111150334.760997-1-robert.malz@canonical.com |
---|---|
Headers | show |
Series | Intel E810 transmit hang with bonding enabled | expand |
On 1/11/24 08:03, Robert Malz wrote: > BugLink: https://bugs.launchpad.net/bugs/2036239 > > [Impact] > * Issue is causing transmit hang on E810 ports with bonding enabled. > * Based on the provided logs, TX hang can last for even a couple of minutes, but in most scenarios, the network will be recovered after the ice driver performs a PF reset (TX hang handler routine). > * Originally, the issue was observed during Tempest tests on a newly created OpenStack cluster, resulting in a lack of certification. > [Fix] > * Initially, a workaround has been proposed by Intel engineers to disable LAG initialization [1]. > This change has been tested in an environment where reproduction is easily achieved. > After multiple iterations, no reproduction has been observed. > * Shortly after, Intel proposed a patch [2] to disable LAG initialization if NVM does not expose proper capabilities. > [Test Plan] > * To reproduce the issue, over a 20-node cluster was used with Ceph-based storage. The problem could sometimes manifest while deploying a cluster or after the cluster was already deployed during the Tempest test run. > * The issue could appear on a random node, making reproduction hard to achieve. > * Multiple stress tests on single host with similar configuration did not trigger a reproduction. > [Where problems could occur] > * All ice drivers with ice_lag_event_handler registered can expose the issue. This handler is not implemented in 20.04 > * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this capability will be released. > Although potentialy issue is caused by using features without proper FW support [2], we want to take a closer look once NVMs with proper support are introduced. > > [1] - https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40 > [2] - https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html 4d50fcdc2476eef94c14c6761073af5667bb43b6 > > > Dave Ertman (2): > [SRU][M][PATCH v2 1/2] ice: Add driver support for firmware changes > for LAG > [SRU][M][PATCH v2 2/2] ice: alter feature support check for SRIOV and > LAG > > drivers/net/ethernet/intel/ice/ice.h | 5 ++ > .../net/ethernet/intel/ice/ice_adminq_cmd.h | 3 + > drivers/net/ethernet/intel/ice/ice_common.c | 8 +++ > drivers/net/ethernet/intel/ice/ice_lag.c | 55 ++++++++++--------- > drivers/net/ethernet/intel/ice/ice_lib.c | 2 +- > drivers/net/ethernet/intel/ice/ice_lib.h | 1 + > drivers/net/ethernet/intel/ice/ice_main.c | 22 +++++++- > drivers/net/ethernet/intel/ice/ice_type.h | 2 + > 8 files changed, 68 insertions(+), 30 deletions(-) > Acked-by: Tim Gardner <tim.gardner@canonical.com>
Acked-by: Jacob Martin <jacob.martin@canonical.com> On 1/11/24 9:03 AM, Robert Malz wrote: > BugLink: https://bugs.launchpad.net/bugs/2036239 > > [Impact] > * Issue is causing transmit hang on E810 ports with bonding enabled. > * Based on the provided logs, TX hang can last for even a couple of minutes, but in most scenarios, the network will be recovered after the ice driver performs a PF reset (TX hang handler routine). > * Originally, the issue was observed during Tempest tests on a newly created OpenStack cluster, resulting in a lack of certification. > [Fix] > * Initially, a workaround has been proposed by Intel engineers to disable LAG initialization [1]. > This change has been tested in an environment where reproduction is easily achieved. > After multiple iterations, no reproduction has been observed. > * Shortly after, Intel proposed a patch [2] to disable LAG initialization if NVM does not expose proper capabilities. > [Test Plan] > * To reproduce the issue, over a 20-node cluster was used with Ceph-based storage. The problem could sometimes manifest while deploying a cluster or after the cluster was already deployed during the Tempest test run. > * The issue could appear on a random node, making reproduction hard to achieve. > * Multiple stress tests on single host with similar configuration did not trigger a reproduction. > [Where problems could occur] > * All ice drivers with ice_lag_event_handler registered can expose the issue. This handler is not implemented in 20.04 > * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this capability will be released. > Although potentialy issue is caused by using features without proper FW support [2], we want to take a closer look once NVMs with proper support are introduced. > > [1] - https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40 > [2] - https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html 4d50fcdc2476eef94c14c6761073af5667bb43b6 > > > Dave Ertman (2): > [SRU][M][PATCH v2 1/2] ice: Add driver support for firmware changes > for LAG > [SRU][M][PATCH v2 2/2] ice: alter feature support check for SRIOV and > LAG > > drivers/net/ethernet/intel/ice/ice.h | 5 ++ > .../net/ethernet/intel/ice/ice_adminq_cmd.h | 3 + > drivers/net/ethernet/intel/ice/ice_common.c | 8 +++ > drivers/net/ethernet/intel/ice/ice_lag.c | 55 ++++++++++--------- > drivers/net/ethernet/intel/ice/ice_lib.c | 2 +- > drivers/net/ethernet/intel/ice/ice_lib.h | 1 + > drivers/net/ethernet/intel/ice/ice_main.c | 22 +++++++- > drivers/net/ethernet/intel/ice/ice_type.h | 2 + > 8 files changed, 68 insertions(+), 30 deletions(-) >
On 24/01/11 04:03PM, Robert Malz wrote: > BugLink: https://bugs.launchpad.net/bugs/2036239 > > [Impact] > * Issue is causing transmit hang on E810 ports with bonding enabled. > * Based on the provided logs, TX hang can last for even a couple of minutes, but in most scenarios, the network will be recovered after the ice driver performs a PF reset (TX hang handler routine). > * Originally, the issue was observed during Tempest tests on a newly created OpenStack cluster, resulting in a lack of certification. > [Fix] > * Initially, a workaround has been proposed by Intel engineers to disable LAG initialization [1]. > This change has been tested in an environment where reproduction is easily achieved. > After multiple iterations, no reproduction has been observed. > * Shortly after, Intel proposed a patch [2] to disable LAG initialization if NVM does not expose proper capabilities. > [Test Plan] > * To reproduce the issue, over a 20-node cluster was used with Ceph-based storage. The problem could sometimes manifest while deploying a cluster or after the cluster was already deployed during the Tempest test run. > * The issue could appear on a random node, making reproduction hard to achieve. > * Multiple stress tests on single host with similar configuration did not trigger a reproduction. > [Where problems could occur] > * All ice drivers with ice_lag_event_handler registered can expose the issue. This handler is not implemented in 20.04 > * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this capability will be released. > Although potentialy issue is caused by using features without proper FW support [2], we want to take a closer look once NVMs with proper support are introduced. > > [1] - https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40 > [2] - https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html 4d50fcdc2476eef94c14c6761073af5667bb43b6 > > > Dave Ertman (2): > [SRU][M][PATCH v2 1/2] ice: Add driver support for firmware changes > for LAG > [SRU][M][PATCH v2 2/2] ice: alter feature support check for SRIOV and > LAG > > drivers/net/ethernet/intel/ice/ice.h | 5 ++ > .../net/ethernet/intel/ice/ice_adminq_cmd.h | 3 + > drivers/net/ethernet/intel/ice/ice_common.c | 8 +++ > drivers/net/ethernet/intel/ice/ice_lag.c | 55 ++++++++++--------- > drivers/net/ethernet/intel/ice/ice_lib.c | 2 +- > drivers/net/ethernet/intel/ice/ice_lib.h | 1 + > drivers/net/ethernet/intel/ice/ice_main.c | 22 +++++++- > drivers/net/ethernet/intel/ice/ice_type.h | 2 + > 8 files changed, 68 insertions(+), 30 deletions(-) The bug report should include the target series. Could you update it accordingly? Also, I would have benefited from a v2 description in the cover letter. Acked-by: Andrei Gherzan <andrei.gherzan@canonical.com>
On 11.01.24 16:03, Robert Malz wrote: > BugLink: https://bugs.launchpad.net/bugs/2036239 > > [Impact] > * Issue is causing transmit hang on E810 ports with bonding enabled. > * Based on the provided logs, TX hang can last for even a couple of minutes, but in most scenarios, the network will be recovered after the ice driver performs a PF reset (TX hang handler routine). > * Originally, the issue was observed during Tempest tests on a newly created OpenStack cluster, resulting in a lack of certification. > [Fix] > * Initially, a workaround has been proposed by Intel engineers to disable LAG initialization [1]. > This change has been tested in an environment where reproduction is easily achieved. > After multiple iterations, no reproduction has been observed. > * Shortly after, Intel proposed a patch [2] to disable LAG initialization if NVM does not expose proper capabilities. > [Test Plan] > * To reproduce the issue, over a 20-node cluster was used with Ceph-based storage. The problem could sometimes manifest while deploying a cluster or after the cluster was already deployed during the Tempest test run. > * The issue could appear on a random node, making reproduction hard to achieve. > * Multiple stress tests on single host with similar configuration did not trigger a reproduction. > [Where problems could occur] > * All ice drivers with ice_lag_event_handler registered can expose the issue. This handler is not implemented in 20.04 > * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this capability will be released. > Although potentialy issue is caused by using features without proper FW support [2], we want to take a closer look once NVMs with proper support are introduced. > > [1] - https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40 > [2] - https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html 4d50fcdc2476eef94c14c6761073af5667bb43b6 > > > Dave Ertman (2): > [SRU][M][PATCH v2 1/2] ice: Add driver support for firmware changes > for LAG > [SRU][M][PATCH v2 2/2] ice: alter feature support check for SRIOV and > LAG > > drivers/net/ethernet/intel/ice/ice.h | 5 ++ > .../net/ethernet/intel/ice/ice_adminq_cmd.h | 3 + > drivers/net/ethernet/intel/ice/ice_common.c | 8 +++ > drivers/net/ethernet/intel/ice/ice_lag.c | 55 ++++++++++--------- > drivers/net/ethernet/intel/ice/ice_lib.c | 2 +- > drivers/net/ethernet/intel/ice/ice_lib.h | 1 + > drivers/net/ethernet/intel/ice/ice_main.c | 22 +++++++- > drivers/net/ethernet/intel/ice/ice_type.h | 2 + > 8 files changed, 68 insertions(+), 30 deletions(-) > Applied to mantic:linux/master-next. Thanks. -Stefan