From patchwork Thu Jan 10 10:29:05 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Saeed Mahameed X-Patchwork-Id: 1022847 X-Patchwork-Delegate: davem@davemloft.net Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=fail (p=none dis=none) header.from=mellanox.com Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 43b2Ld3mdxz9sMQ for ; Thu, 10 Jan 2019 21:30:01 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728208AbfAJKaA (ORCPT ); Thu, 10 Jan 2019 05:30:00 -0500 Received: from mail-il-dmz.mellanox.com ([193.47.165.129]:41035 "EHLO mellanox.co.il" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726255AbfAJKaA (ORCPT ); Thu, 10 Jan 2019 05:30:00 -0500 Received: from Internal Mail-Server by MTLPINE1 (envelope-from saeedm@mellanox.com) with ESMTPS (AES256-SHA encrypted); 10 Jan 2019 12:29:56 +0200 Received: from sx1.mtl.com ([172.16.5.7]) by labmailer.mlnx (8.13.8/8.13.8) with ESMTP id x0AATDGN031830; Thu, 10 Jan 2019 12:29:53 +0200 From: Saeed Mahameed To: "David S. Miller" Cc: netdev@vger.kernel.org, Feras Daoud , Saeed Mahameed Subject: [net-next 9/9] net/mlx5: Protect against infinite recovery requests Date: Thu, 10 Jan 2019 12:29:05 +0200 Message-Id: <20190110102906.3751-10-saeedm@mellanox.com> X-Mailer: git-send-email 2.20.1 In-Reply-To: <20190110102906.3751-1-saeedm@mellanox.com> References: <20190110102906.3751-1-saeedm@mellanox.com> MIME-Version: 1.0 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org From: Feras Daoud A buggy HW may cause infinite recovery requests loop that may terminate the driver. The following change protects against that by adding a timestamp variable that will remember the last recover request timestamp, and allow recovery only if the period between two sequential requests is bigger than 20 min. Signed-off-by: Feras Daoud Signed-off-by: Saeed Mahameed --- .../net/ethernet/mellanox/mlx5/core/health.c | 24 +++++++++++++++++++ include/linux/mlx5/driver.h | 1 + 2 files changed, 25 insertions(+) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/health.c b/drivers/net/ethernet/mellanox/mlx5/core/health.c index 74de30246eee..b43070e4f519 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/health.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/health.c @@ -191,9 +191,26 @@ static bool reset_fw_if_needed(struct mlx5_core_dev *dev) #define MLX5_CRDUMP_WAIT_MS 60000 #define MLX5_FW_RESET_WAIT_MS 1000 +#define MLX5_RECOVERY_TIMEOUT_MS 1200000 + +static bool mlx5_health_allow_recover(struct mlx5_core_health *health) +{ + bool ret = false; + + ret = health->last_recover_tstamp ? + time_after(jiffies, health->last_recover_tstamp + + msecs_to_jiffies(MLX5_RECOVERY_TIMEOUT_MS)) : + true; + + health->last_recover_tstamp = jiffies; + + return ret; +} + void mlx5_enter_error_state(struct mlx5_core_dev *dev, bool force) { unsigned long end, delay_ms = MLX5_FW_RESET_WAIT_MS; + struct mlx5_core_health *health = &dev->priv.health; u32 fatal_error, err; int lock = -EBUSY; @@ -212,6 +229,12 @@ void mlx5_enter_error_state(struct mlx5_core_dev *dev, bool force) fatal_error = check_fatal_sensors(dev); + if (fatal_error == MLX5_SENSOR_FW_SYND_RFR && + !mlx5_health_allow_recover(health)) { + mlx5_core_warn_once(dev, "Device recovery ignored\n"); + goto err_state_done; + } + if (fatal_error || force) { dev->state = MLX5_DEVICE_STATE_INTERNAL_ERROR; mlx5_cmd_trigger_completions(dev); @@ -560,6 +583,7 @@ int mlx5_health_init(struct mlx5_core_dev *dev) INIT_WORK(&health->work, health_care); INIT_DELAYED_WORK(&health->recover_work, health_recover); health->crdump = NULL; + health->last_recover_tstamp = 0; return 0; } diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h index 2ea6732c1d4d..06ab2647f790 100644 --- a/include/linux/mlx5/driver.h +++ b/include/linux/mlx5/driver.h @@ -442,6 +442,7 @@ struct mlx5_core_health { struct work_struct work; struct delayed_work recover_work; struct mlx5_fw_crdump *crdump; + unsigned long last_recover_tstamp; }; struct mlx5_qp_table {