From patchwork Wed Oct 16 13:37:01 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jacob Martin X-Patchwork-Id: 1998082 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=lists.ubuntu.com (client-ip=185.125.189.65; helo=lists.ubuntu.com; envelope-from=kernel-team-bounces@lists.ubuntu.com; receiver=patchwork.ozlabs.org) Received: from lists.ubuntu.com (lists.ubuntu.com [185.125.189.65]) (using TLSv1.2 with cipher ECDHE-ECDSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4XTBpp6gR6z1xvV for ; Thu, 17 Oct 2024 00:37:22 +1100 (AEDT) Received: from localhost ([127.0.0.1] helo=lists.ubuntu.com) by lists.ubuntu.com with esmtp (Exim 4.86_2) (envelope-from ) id 1t14DD-0005Hq-6y; Wed, 16 Oct 2024 13:37:15 +0000 Received: from smtp-relay-internal-0.internal ([10.131.114.225] helo=smtp-relay-internal-0.canonical.com) by lists.ubuntu.com with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.86_2) (envelope-from ) id 1t14DB-0005Gs-JD for kernel-team@lists.ubuntu.com; Wed, 16 Oct 2024 13:37:13 +0000 Received: from mail-il1-f198.google.com (mail-il1-f198.google.com [209.85.166.198]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by smtp-relay-internal-0.canonical.com (Postfix) with ESMTPS id 0EA173F5B5 for ; Wed, 16 Oct 2024 13:37:13 +0000 (UTC) Received: by mail-il1-f198.google.com with SMTP id e9e14a558f8ab-3a3c24f3111so34066245ab.1 for ; Wed, 16 Oct 2024 06:37:13 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1729085831; x=1729690631; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=Dcx0KQne/pyAOkYBUrxEfuhg2r+iGXG8RDaovtJJtPM=; b=vfH9nkMy0Z9sCG/LlDyolPT8p0G/QbcnAr4iKNwqRyB5z5O4NNnhOXerAQ9Tk/nUJA /v9iL5HtuEDOxWCaXTMnyQkbcz7U2T6LVFgdyOnzuGaLSfF0SuZ6fEJN8OFRwHCMSkPf 8BEMkfuZUeYiw/8Bofhrk2VdW1LJlg6Bu8IJ02uGrXrUS3XGWFPyQwslN6FCxVoqZd7E 7fZ58IJ6ha6tWhgV1l41gLEHxMNLh+AdFc8ySMhbotylpBHl2W7m/7Vyx14Ak0uCVpLk /f0ztIz3B5hm0mvIcivBaM1eAtR6B9dfo7TG0ADxMmm2Lg086HR3SBqXRgCOQV63gtTu dVZA== X-Gm-Message-State: AOJu0YzdN94/VFPI4UeK+B/De4bFMmQwv39N7CKg7kst5sCSnqEv6cYc 8WtxyTaOXwdyOZnC/l4CPhiN2/ZWroLy1j5CjcZyUjdoFr7UHz1BloELaj0wCBR9Rs31dr55SF0 goMSaUI9CMoJv9lnDbQ2l7Rj6rY3aRGKkXsUbLmBxjMrgHCnPAb/tyIaV7i4bFRkgwSWvpQYRHp P3vIDtUL7/bQ== X-Received: by 2002:a05:6e02:20e9:b0:3a3:5fad:77b4 with SMTP id e9e14a558f8ab-3a3b5f233d6mr184743525ab.3.1729085830815; Wed, 16 Oct 2024 06:37:10 -0700 (PDT) X-Google-Smtp-Source: AGHT+IH9cosbkGgnL3zKDi7EeVxYyqwF+Tge0kPwCH7tGSUbVgLhbjcjuwrvGX7Z101liDjGi+abZw== X-Received: by 2002:a05:6e02:20e9:b0:3a3:5fad:77b4 with SMTP id e9e14a558f8ab-3a3b5f233d6mr184743135ab.3.1729085830324; Wed, 16 Oct 2024 06:37:10 -0700 (PDT) Received: from localhost ([2601:441:8201:e8ff:574d:d195:6f49:7c30]) by smtp.gmail.com with ESMTPSA id e9e14a558f8ab-3a3d70cd192sm8200545ab.46.2024.10.16.06.37.09 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 16 Oct 2024 06:37:10 -0700 (PDT) From: Jacob Martin To: kernel-team@lists.ubuntu.com Subject: [SRU][N:nvidia][PATCH 3/8] net: mana: Fix race of mana_hwc_post_rx_wqe and new hwc response Date: Wed, 16 Oct 2024 08:37:01 -0500 Message-ID: <20241016133706.173515-4-jacob.martin@canonical.com> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20241016133706.173515-1-jacob.martin@canonical.com> References: <20241016133706.173515-1-jacob.martin@canonical.com> MIME-Version: 1.0 X-BeenThere: kernel-team@lists.ubuntu.com X-Mailman-Version: 2.1.20 Precedence: list List-Id: Kernel team discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: kernel-team-bounces@lists.ubuntu.com Sender: "kernel-team" From: Haiyang Zhang BugLink: https://bugs.launchpad.net/bugs/2084598 The mana_hwc_rx_event_handler() / mana_hwc_handle_resp() calls complete(&ctx->comp_event) before posting the wqe back. It's possible that other callers, like mana_create_txq(), start the next round of mana_hwc_send_request() before the posting of wqe. And if the HW is fast enough to respond, it can hit no_wqe error on the HW channel, then the response message is lost. The mana driver may fail to create queues and open, because of waiting for the HW response and timed out. Sample dmesg: [ 528.610840] mana 39d4:00:02.0: HWC: Request timed out! [ 528.614452] mana 39d4:00:02.0: Failed to send mana message: -110, 0x0 [ 528.618326] mana 39d4:00:02.0 enP14804s2: Failed to create WQ object: -110 To fix it, move posting of rx wqe before complete(&ctx->comp_event). Cc: stable@vger.kernel.org Fixes: ca9c54d2d6a5 ("net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)") Signed-off-by: Haiyang Zhang Reviewed-by: Long Li Signed-off-by: David S. Miller (cherry-picked from commit 8af174ea863c72f25ce31cee3baad8a301c0cf0f netdev) Signed-off-by: Vinicius Peixoto Acked-by: Aaron Jauregui Acked-by: Thibault Ferrante Signed-off-by: John Cabaj (cherry picked from commit 40a4338c667f2c84e84b1fd5484e46ae3321e34f noble:linux-azure/master-next) Signed-off-by: Jacob Martin --- .../net/ethernet/microsoft/mana/hw_channel.c | 62 ++++++++++--------- 1 file changed, 34 insertions(+), 28 deletions(-) diff --git a/drivers/net/ethernet/microsoft/mana/hw_channel.c b/drivers/net/ethernet/microsoft/mana/hw_channel.c index 4dd43ac5a3cb..48b899834d0a 100644 --- a/drivers/net/ethernet/microsoft/mana/hw_channel.c +++ b/drivers/net/ethernet/microsoft/mana/hw_channel.c @@ -51,9 +51,33 @@ static int mana_hwc_verify_resp_msg(const struct hwc_caller_ctx *caller_ctx, return 0; } +static int mana_hwc_post_rx_wqe(const struct hwc_wq *hwc_rxq, + struct hwc_work_request *req) +{ + struct device *dev = hwc_rxq->hwc->dev; + struct gdma_sge *sge; + int err; + + sge = &req->sge; + sge->address = (u64)req->buf_sge_addr; + sge->mem_key = hwc_rxq->msg_buf->gpa_mkey; + sge->size = req->buf_len; + + memset(&req->wqe_req, 0, sizeof(struct gdma_wqe_request)); + req->wqe_req.sgl = sge; + req->wqe_req.num_sge = 1; + req->wqe_req.client_data_unit = 0; + + err = mana_gd_post_and_ring(hwc_rxq->gdma_wq, &req->wqe_req, NULL); + if (err) + dev_err(dev, "Failed to post WQE on HWC RQ: %d\n", err); + return err; +} + static void mana_hwc_handle_resp(struct hw_channel_context *hwc, u32 resp_len, - const struct gdma_resp_hdr *resp_msg) + struct hwc_work_request *rx_req) { + const struct gdma_resp_hdr *resp_msg = rx_req->buf_va; struct hwc_caller_ctx *ctx; int err; @@ -61,6 +85,7 @@ static void mana_hwc_handle_resp(struct hw_channel_context *hwc, u32 resp_len, hwc->inflight_msg_res.map)) { dev_err(hwc->dev, "hwc_rx: invalid msg_id = %u\n", resp_msg->response.hwc_msg_id); + mana_hwc_post_rx_wqe(hwc->rxq, rx_req); return; } @@ -74,30 +99,13 @@ static void mana_hwc_handle_resp(struct hw_channel_context *hwc, u32 resp_len, memcpy(ctx->output_buf, resp_msg, resp_len); out: ctx->error = err; - complete(&ctx->comp_event); -} - -static int mana_hwc_post_rx_wqe(const struct hwc_wq *hwc_rxq, - struct hwc_work_request *req) -{ - struct device *dev = hwc_rxq->hwc->dev; - struct gdma_sge *sge; - int err; - - sge = &req->sge; - sge->address = (u64)req->buf_sge_addr; - sge->mem_key = hwc_rxq->msg_buf->gpa_mkey; - sge->size = req->buf_len; - memset(&req->wqe_req, 0, sizeof(struct gdma_wqe_request)); - req->wqe_req.sgl = sge; - req->wqe_req.num_sge = 1; - req->wqe_req.client_data_unit = 0; + /* Must post rx wqe before complete(), otherwise the next rx may + * hit no_wqe error. + */ + mana_hwc_post_rx_wqe(hwc->rxq, rx_req); - err = mana_gd_post_and_ring(hwc_rxq->gdma_wq, &req->wqe_req, NULL); - if (err) - dev_err(dev, "Failed to post WQE on HWC RQ: %d\n", err); - return err; + complete(&ctx->comp_event); } static void mana_hwc_init_event_handler(void *ctx, struct gdma_queue *q_self, @@ -234,14 +242,12 @@ static void mana_hwc_rx_event_handler(void *ctx, u32 gdma_rxq_id, return; } - mana_hwc_handle_resp(hwc, rx_oob->tx_oob_data_size, resp); + mana_hwc_handle_resp(hwc, rx_oob->tx_oob_data_size, rx_req); - /* Do no longer use 'resp', because the buffer is posted to the HW - * in the below mana_hwc_post_rx_wqe(). + /* Can no longer use 'resp', because the buffer is posted to the HW + * in mana_hwc_handle_resp() above. */ resp = NULL; - - mana_hwc_post_rx_wqe(hwc_rxq, rx_req); } static void mana_hwc_tx_event_handler(void *ctx, u32 gdma_txq_id,