From patchwork Tue Mar 15 16:35:24 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Asmaa Mnebhi X-Patchwork-Id: 1605734 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=lists.ubuntu.com (client-ip=91.189.94.19; helo=huckleberry.canonical.com; envelope-from=kernel-team-bounces@lists.ubuntu.com; receiver=) Received: from huckleberry.canonical.com (huckleberry.canonical.com [91.189.94.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by bilbo.ozlabs.org (Postfix) with ESMTPS id 4KHzYC253yz9sFn for ; Wed, 16 Mar 2022 03:35:43 +1100 (AEDT) Received: from localhost ([127.0.0.1] helo=huckleberry.canonical.com) by huckleberry.canonical.com with esmtp (Exim 4.86_2) (envelope-from ) id 1nUA92-0001PH-2S; Tue, 15 Mar 2022 16:35:36 +0000 Received: from mail-il-dmz.mellanox.com ([193.47.165.129] helo=mellanox.co.il) by huckleberry.canonical.com with esmtp (Exim 4.86_2) (envelope-from ) id 1nUA8z-0001Oq-Fw for kernel-team@lists.ubuntu.com; Tue, 15 Mar 2022 16:35:33 +0000 Received: from Internal Mail-Server by MTLPINE1 (envelope-from asmaa@mellanox.com) with SMTP; 15 Mar 2022 18:35:31 +0200 Received: from farm-0002.mtbu.labs.mlnx (farm-0002.mtbu.labs.mlnx [10.15.2.32]) by mtbu-labmailer.labs.mlnx (8.14.4/8.14.4) with ESMTP id 22FGZUob029281; Tue, 15 Mar 2022 12:35:30 -0400 Received: (from asmaa@localhost) by farm-0002.mtbu.labs.mlnx (8.14.7/8.13.8/Submit) id 22FGZUKe003595; Tue, 15 Mar 2022 12:35:30 -0400 From: Asmaa Mnebhi To: kernel-team@lists.ubuntu.com Subject: [SRU][F:linux-bluefield][PATCH v1 1/1] UBUNTU: SAUCE: Fix OOB handling RX packets in heavy traffic Date: Tue, 15 Mar 2022 12:35:24 -0400 Message-Id: <20220315163524.3547-2-asmaa@nvidia.com> X-Mailer: git-send-email 2.30.1 In-Reply-To: <20220315163524.3547-1-asmaa@nvidia.com> References: <20220315163524.3547-1-asmaa@nvidia.com> MIME-Version: 1.0 X-BeenThere: kernel-team@lists.ubuntu.com X-Mailman-Version: 2.1.20 Precedence: list List-Id: Kernel team discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: meriton@nvidia.com, khoav@nvidia.com, asmaa@nvidia.com, David Thompson Errors-To: kernel-team-bounces@lists.ubuntu.com Sender: "kernel-team" BugLink: https://bugs.launchpad.net/bugs/1964984 This is reproducible on systems which already have heavy background traffic. On top of that, the user issues one of the 2 docker pulls below: docker pull nvcr.io/ea-doca-hbn/hbn/hbn:latest OR docker pull gitlab-master.nvidia.com:5005/dl/dgx/tritonserver:22.02-py3-qa The second one is a very large container (17GB) When they run docker pull, the OOB interface stops being pingable, the docker pull is interrupted for a very long time (>3mn) or times out. The main reason for the above is because RX PI = RX CI. I have verified that by reading RX_CQE_PACKET_CI and RX_WQE_PI. This means the WQEs are full and HW has nowhere else to put the RX packets. I believe there is a race condition after SW receives a RX interrupt, and the interrupt is disabled. I believe HW still tries to add RX packets to the RX WQEs. So we need to stop the RX traffic by disabling the DMA. Also, move reading the RX CI before writing the increased value of RX PI to MLXBF_GIGE_RX_WQE_PI. Normally RX PI should always be > RX CI. I suspect that when entering mlxbf_gige_rx_packet, for example we have: MLXBF_GIGE_RX_WQE_PI = 128 RX_CQE_PACKET_CI = 128 (128 being the max size of the WQE) Then this code will make MLXBF_GIGE_RX_WQE_PI = 129: rx_pi++; /* Ensure completion of all writes before notifying HW of replenish */ wmb(); writeq(rx_pi, priv->base + MLXBF_GIGE_RX_WQE_PI); which means HW has one more slot to populate and in that time span, the HW populates that WQE and increases the RX_CQE_PACKET_CI = 129. Then this code is subject to a race condition: rx_ci = readq(priv->base + MLXBF_GIGE_RX_CQE_PACKET_CI); rx_ci_rem = rx_ci % priv->rx_q_entries; return rx_pi_rem != rx_ci_rem; because rx_pi_rem will be equal to rx_ci_rem. so remaining_pkts will be 0 and we will exit mlxbf_gige_poll. Signed-off-by: Asmaa Mnebhi Reviewed-by: David Thompson Signed-off-by: Asmaa Mnebhi Acked-by: Stefan Bader --- .../ethernet/mellanox/mlxbf_gige/mlxbf_gige_main.c | 2 +- .../ethernet/mellanox/mlxbf_gige/mlxbf_gige_rx.c | 13 +++++++++++-- 2 files changed, 12 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlxbf_gige/mlxbf_gige_main.c b/drivers/net/ethernet/mellanox/mlxbf_gige/mlxbf_gige_main.c index d0871014d4e9..9ef883b90aee 100644 --- a/drivers/net/ethernet/mellanox/mlxbf_gige/mlxbf_gige_main.c +++ b/drivers/net/ethernet/mellanox/mlxbf_gige/mlxbf_gige_main.c @@ -20,7 +20,7 @@ #include "mlxbf_gige_regs.h" #define DRV_NAME "mlxbf_gige" -#define DRV_VERSION 1.25 +#define DRV_VERSION 1.26 /* This setting defines the version of the ACPI table * content that is compatible with this driver version. diff --git a/drivers/net/ethernet/mellanox/mlxbf_gige/mlxbf_gige_rx.c b/drivers/net/ethernet/mellanox/mlxbf_gige/mlxbf_gige_rx.c index afa3b92a6905..96230763cf6c 100644 --- a/drivers/net/ethernet/mellanox/mlxbf_gige/mlxbf_gige_rx.c +++ b/drivers/net/ethernet/mellanox/mlxbf_gige/mlxbf_gige_rx.c @@ -266,6 +266,9 @@ static bool mlxbf_gige_rx_packet(struct mlxbf_gige *priv, int *rx_pkts) priv->stats.rx_truncate_errors++; } + rx_ci = readq(priv->base + MLXBF_GIGE_RX_CQE_PACKET_CI); + rx_ci_rem = rx_ci % priv->rx_q_entries; + /* Let hardware know we've replenished one buffer */ rx_pi++; @@ -278,8 +281,6 @@ static bool mlxbf_gige_rx_packet(struct mlxbf_gige *priv, int *rx_pkts) rx_pi_rem = rx_pi % priv->rx_q_entries; if (rx_pi_rem == 0) priv->valid_polarity ^= 1; - rx_ci = readq(priv->base + MLXBF_GIGE_RX_CQE_PACKET_CI); - rx_ci_rem = rx_ci % priv->rx_q_entries; if (skb) netif_receive_skb(skb); @@ -299,6 +300,10 @@ int mlxbf_gige_poll(struct napi_struct *napi, int budget) mlxbf_gige_handle_tx_complete(priv); + data = readq(priv->base + MLXBF_GIGE_RX_DMA); + data &= ~MLXBF_GIGE_RX_DMA_EN; + writeq(data, priv->base + MLXBF_GIGE_RX_DMA); + do { remaining_pkts = mlxbf_gige_rx_packet(priv, &work_done); } while (remaining_pkts && work_done < budget); @@ -314,6 +319,10 @@ int mlxbf_gige_poll(struct napi_struct *napi, int budget) data = readq(priv->base + MLXBF_GIGE_INT_MASK); data &= ~MLXBF_GIGE_INT_MASK_RX_RECEIVE_PACKET; writeq(data, priv->base + MLXBF_GIGE_INT_MASK); + + data = readq(priv->base + MLXBF_GIGE_RX_DMA); + data |= MLXBF_GIGE_RX_DMA_EN; + writeq(data, priv->base + MLXBF_GIGE_RX_DMA); } return work_done;