Message ID | 1599360937-26197-3-git-send-email-michael.chan@broadcom.com |
---|---|
State | Accepted |
Delegated to: | David Miller |
Headers | show |
Series | [net,1/2] bnxt_en: Avoid sending firmware messages when AER error is detected. | expand |
On Sat, 5 Sep 2020 22:55:37 -0400 Michael Chan wrote: > From: Vasundhara Volam <vasundhara-v.volam@broadcom.com> > > bnxt_fw_reset_task() which runs from a workqueue can race with > bnxt_remove_one(). For example, if firmware reset and VF FLR are > happening at about the same time. > > bnxt_remove_one() already cancels the workqueue and waits for it > to finish, but we need to do this earlier before the devlink > reporters are destroyed. This will guarantee that > the devlink reporters will always be valid when bnxt_fw_reset_task() > is still running. > > Fixes: b148bb238c02 ("bnxt_en: Fix possible crash in bnxt_fw_reset_task().") > Reviewed-by: Edwin Peer <edwin.peer@broadcom.com> > Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com> > Signed-off-by: Michael Chan <michael.chan@broadcom.com> > --- > drivers/net/ethernet/broadcom/bnxt/bnxt.c | 7 ++++--- > 1 file changed, 4 insertions(+), 3 deletions(-) > > diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c > index 619eb55..8eb73fe 100644 > --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c > +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c > @@ -11779,6 +11779,10 @@ static void bnxt_remove_one(struct pci_dev *pdev) > if (BNXT_PF(bp)) > bnxt_sriov_disable(bp); > > + clear_bit(BNXT_STATE_IN_FW_RESET, &bp->state); > + bnxt_cancel_sp_work(bp); > + bp->sp_event = 0; > + > bnxt_dl_fw_reporters_destroy(bp, true); > if (BNXT_PF(bp)) > devlink_port_type_clear(&bp->dl_port); > @@ -11786,9 +11790,6 @@ static void bnxt_remove_one(struct pci_dev *pdev) > unregister_netdev(dev); > bnxt_dl_unregister(bp); > bnxt_shutdown_tc(bp); > - clear_bit(BNXT_STATE_IN_FW_RESET, &bp->state); > - bnxt_cancel_sp_work(bp); > - bp->sp_event = 0; > > bnxt_clear_int_mode(bp); > bnxt_hwrm_func_drv_unrgtr(bp); devlink can itself scheduler a recovery via: bnxt_fw_fatal_recover() -> bnxt_fw_reset() no? Maybe don't make the devlink recovery path need to go via a workqueue?
On Sun, Sep 6, 2020 at 12:25 PM Jakub Kicinski <kuba@kernel.org> wrote: > > devlink can itself scheduler a recovery via: > > bnxt_fw_fatal_recover() -> bnxt_fw_reset() > Yes, this is how it is initiated when we call devlink_health_report() to report the error condition. From bnxt_fw_reset(), we use a workqueue because we have to go through many states, requiring sleeping/polling to transition through the states. > no? Maybe don't make the devlink recovery path need to go via a > workqueue? Current implementation is going through a work queue.
On Sun, 6 Sep 2020 15:07:02 -0700 Michael Chan wrote: > On Sun, Sep 6, 2020 at 12:25 PM Jakub Kicinski <kuba@kernel.org> wrote: > > > > devlink can itself scheduler a recovery via: > > > > bnxt_fw_fatal_recover() -> bnxt_fw_reset() > > > > Yes, this is how it is initiated when we call devlink_health_report() > to report the error condition. From bnxt_fw_reset(), we use a > workqueue because we have to go through many states, requiring > sleeping/polling to transition through the states. > > > no? Maybe don't make the devlink recovery path need to go via a > > workqueue? > > Current implementation is going through a work queue. What I'm saying is the code looks like this after this patch: + clear_bit(BNXT_STATE_IN_FW_RESET, &bp->state); + bnxt_cancel_sp_work(bp); + bp->sp_event = 0; + bnxt_dl_fw_reporters_destroy(bp, true); It cancels the work, _then_ destroys the reporter. But I think the reported can be used to schedule a recovery from command line. So the work may get re-scheduled after it has been canceled. devlink_nl_cmd_health_reporter_recover_doit() -> bnxt_fw_fatal_recover() -> bnxt_fw_reset() -> bnxt_queue_fw_reset_work() What am I missing?
On Sun, Sep 6, 2020 at 8:13 PM Jakub Kicinski <kuba@kernel.org> wrote: > > On Sun, 6 Sep 2020 15:07:02 -0700 Michael Chan wrote: > > On Sun, Sep 6, 2020 at 12:25 PM Jakub Kicinski <kuba@kernel.org> wrote: > > > > > > devlink can itself scheduler a recovery via: > > > > > > bnxt_fw_fatal_recover() -> bnxt_fw_reset() > > > > > > > Yes, this is how it is initiated when we call devlink_health_report() > > to report the error condition. From bnxt_fw_reset(), we use a > > workqueue because we have to go through many states, requiring > > sleeping/polling to transition through the states. > > > > > no? Maybe don't make the devlink recovery path need to go via a > > > workqueue? > > > > Current implementation is going through a work queue. > > What I'm saying is the code looks like this after this patch: > > + clear_bit(BNXT_STATE_IN_FW_RESET, &bp->state); > + bnxt_cancel_sp_work(bp); > + bp->sp_event = 0; > + > bnxt_dl_fw_reporters_destroy(bp, true); > > It cancels the work, _then_ destroys the reporter. But I think the > reported can be used to schedule a recovery from command line. So the > work may get re-scheduled after it has been canceled. bnxt_en does not support recovery from the command line. We return -EOPNOTSUPP when it comes from the command line. Recovery has to be triggered from a firmware reported error or a driver detected error.
On Sun, 6 Sep 2020 20:48:04 -0700 Michael Chan wrote: > On Sun, Sep 6, 2020 at 8:13 PM Jakub Kicinski <kuba@kernel.org> wrote: > > > > On Sun, 6 Sep 2020 15:07:02 -0700 Michael Chan wrote: > > > On Sun, Sep 6, 2020 at 12:25 PM Jakub Kicinski <kuba@kernel.org> wrote: > > > > > > > > devlink can itself scheduler a recovery via: > > > > > > > > bnxt_fw_fatal_recover() -> bnxt_fw_reset() > > > > > > > > > > Yes, this is how it is initiated when we call devlink_health_report() > > > to report the error condition. From bnxt_fw_reset(), we use a > > > workqueue because we have to go through many states, requiring > > > sleeping/polling to transition through the states. > > > > > > > no? Maybe don't make the devlink recovery path need to go via a > > > > workqueue? > > > > > > Current implementation is going through a work queue. > > > > What I'm saying is the code looks like this after this patch: > > > > + clear_bit(BNXT_STATE_IN_FW_RESET, &bp->state); > > + bnxt_cancel_sp_work(bp); > > + bp->sp_event = 0; > > + > > bnxt_dl_fw_reporters_destroy(bp, true); > > > > It cancels the work, _then_ destroys the reporter. But I think the > > reported can be used to schedule a recovery from command line. So the > > work may get re-scheduled after it has been canceled. > > bnxt_en does not support recovery from the command line. We return > -EOPNOTSUPP when it comes from the command line. > > Recovery has to be triggered from a firmware reported error or a > driver detected error. I see it now, thanks.
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c index 619eb55..8eb73fe 100644 --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c @@ -11779,6 +11779,10 @@ static void bnxt_remove_one(struct pci_dev *pdev) if (BNXT_PF(bp)) bnxt_sriov_disable(bp); + clear_bit(BNXT_STATE_IN_FW_RESET, &bp->state); + bnxt_cancel_sp_work(bp); + bp->sp_event = 0; + bnxt_dl_fw_reporters_destroy(bp, true); if (BNXT_PF(bp)) devlink_port_type_clear(&bp->dl_port); @@ -11786,9 +11790,6 @@ static void bnxt_remove_one(struct pci_dev *pdev) unregister_netdev(dev); bnxt_dl_unregister(bp); bnxt_shutdown_tc(bp); - clear_bit(BNXT_STATE_IN_FW_RESET, &bp->state); - bnxt_cancel_sp_work(bp); - bp->sp_event = 0; bnxt_clear_int_mode(bp); bnxt_hwrm_func_drv_unrgtr(bp);