Message ID | 163767273634.1368569.7327743414665595326.stgit@jupiter (mailing list archive) |
---|---|
State | Changes Requested |
Headers | show |
Series | powerpc/eeh: Delay slot presence check once driver is notified about the pci error. | expand |
Context | Check | Description |
---|---|---|
snowpatch_ozlabs/github-powerpc_ppctests | success | Successfully ran 8 jobs. |
snowpatch_ozlabs/github-powerpc_selftests | success | Successfully ran 8 jobs. |
snowpatch_ozlabs/github-powerpc_sparse | success | Successfully ran 4 jobs. |
snowpatch_ozlabs/github-powerpc_kernel_qemu | success | Successfully ran 24 jobs. |
snowpatch_ozlabs/github-powerpc_clang | success | Successfully ran 7 jobs. |
Mahesh Salgaonkar <mahesh@linux.ibm.com> writes: > When certain PHB HW failure causes phyp to recover PHB, it marks the PE > state as temporarily unavailable until recovery is complete. This also > triggers an EEH handler in Linux which needs to notify drivers, and perform > recovery. But before notifying the driver about the pci error it uses > get_adapter_state()->get-sesnor-state() operation of the hotplug_slot to > determine if the slot contains a device or not. if the slot is empty, the > recovery is skipped entirely. > > However on certain PHB failures, the rtas call get-sesnor-state() returns > extended busy error (9902) until PHB is recovered by phyp. Once PHB is > recovered, the get-sensor-state() returns success with correct presence > status. The rtas call interface rtas_get_sensor() loops over the rtas call > on extended delay return code (9902) until the return value is either > success (0) or error (-1). This causes the EEH handler to get stuck for ~6 > seconds before it could notify that the pci error has been detected and > stop any active operations. Hence with running I/O traffic, during this 6 > seconds, the network driver continues its operation and hits a timeout > (netdev watchdog). On timeouts, network driver go into ffdc capture mode > and reset path assuming the PCI device is in fatal condition. This causes > EEH recovery to fail and sometimes it leads to system hang or crash. > > ------------ > [52732.244731] DEBUG: ibm_read_slot_reset_state2() > [52732.244762] DEBUG: ret = 0, rets[0]=5, rets[1]=1, rets[2]=4000, rets[3]=0x0 > [52732.244798] DEBUG: in eeh_slot_presence_check > [52732.244804] DEBUG: error state check > [52732.244807] DEBUG: Is slot hotpluggable > [52732.244810] DEBUG: hotpluggable ops ? > [52732.244953] DEBUG: Calling ops->get_adapter_status > [52732.244958] DEBUG: calling rpaphp_get_sensor_state > [52736.564262] ------------[ cut here ]------------ > [52736.564299] NETDEV WATCHDOG: enP64p1s0f3 (tg3): transmit queue 0 timed out > [52736.564324] WARNING: CPU: 1442 PID: 0 at net/sched/sch_generic.c:478 dev_watchdog+0x438/0x440 > [...] > [52736.564505] NIP [c000000000c32368] dev_watchdog+0x438/0x440 > [52736.564513] LR [c000000000c32364] dev_watchdog+0x434/0x440 > ------------ > > To fix this issue, delay the slot presence check after notifying the driver > about the pci error. How does this interact with the commit that put the slot presence check there in the first place: b104af5a7687 ("powerpc/eeh: Check slot presence state in eeh_handle_normal_event()") It seems like delaying the slot presence check will effectively revert that commit? cheers
On 2021-11-24 10:14:30 Wed, Michael Ellerman wrote: > Mahesh Salgaonkar <mahesh@linux.ibm.com> writes: > > When certain PHB HW failure causes phyp to recover PHB, it marks the PE > > state as temporarily unavailable until recovery is complete. This also > > triggers an EEH handler in Linux which needs to notify drivers, and perform > > recovery. But before notifying the driver about the pci error it uses > > get_adapter_state()->get-sesnor-state() operation of the hotplug_slot to > > determine if the slot contains a device or not. if the slot is empty, the > > recovery is skipped entirely. > > > > However on certain PHB failures, the rtas call get-sesnor-state() returns > > extended busy error (9902) until PHB is recovered by phyp. Once PHB is > > recovered, the get-sensor-state() returns success with correct presence > > status. The rtas call interface rtas_get_sensor() loops over the rtas call > > on extended delay return code (9902) until the return value is either > > success (0) or error (-1). This causes the EEH handler to get stuck for ~6 > > seconds before it could notify that the pci error has been detected and > > stop any active operations. Hence with running I/O traffic, during this 6 > > seconds, the network driver continues its operation and hits a timeout > > (netdev watchdog). On timeouts, network driver go into ffdc capture mode > > and reset path assuming the PCI device is in fatal condition. This causes > > EEH recovery to fail and sometimes it leads to system hang or crash. > > > > ------------ > > [52732.244731] DEBUG: ibm_read_slot_reset_state2() > > [52732.244762] DEBUG: ret = 0, rets[0]=5, rets[1]=1, rets[2]=4000, rets[3]=0x0 > > [52732.244798] DEBUG: in eeh_slot_presence_check > > [52732.244804] DEBUG: error state check > > [52732.244807] DEBUG: Is slot hotpluggable > > [52732.244810] DEBUG: hotpluggable ops ? > > [52732.244953] DEBUG: Calling ops->get_adapter_status > > [52732.244958] DEBUG: calling rpaphp_get_sensor_state > > [52736.564262] ------------[ cut here ]------------ > > [52736.564299] NETDEV WATCHDOG: enP64p1s0f3 (tg3): transmit queue 0 timed out > > [52736.564324] WARNING: CPU: 1442 PID: 0 at net/sched/sch_generic.c:478 dev_watchdog+0x438/0x440 > > [...] > > [52736.564505] NIP [c000000000c32368] dev_watchdog+0x438/0x440 > > [52736.564513] LR [c000000000c32364] dev_watchdog+0x434/0x440 > > ------------ > > > > To fix this issue, delay the slot presence check after notifying the driver > > about the pci error. > > How does this interact with the commit that put the slot presence check > there in the first place: > > b104af5a7687 ("powerpc/eeh: Check slot presence state in eeh_handle_normal_event()") > > > It seems like delaying the slot presence check will effectively revert > that commit? No it doesn't. We will still do a presence check before the recovery process starts. This patch moves the check after notifying the driver to stop active I/O operations. If a presence check finds the device isn't present, we will skip the EEH recovery. However, on a surprise hotplug, the user will see the EEH messages on the console before it finds there is nothing to recover. Current EEH behaviour: EEH event -> eeh_handle_normal_event /* Check for adapter status */ eeh_slot_presence_check() if (!present) bail out early /* Report the error */ eeh_report_error() <- notify driver about error driver->err_handler->error_detected() /* Any active I/O will be stopped now */ /* Start the recovery process */ eeh_reset_device() eeh_report_resume() /* Recovery done */ With this patch: EEH event -> eeh_handle_normal_event /* Report the error */ eeh_report_error() <- notify driver about error driver->err_handler->error_detected() /* Any active I/O will be stopped now */ /* Check for adapter status */ eeh_slot_presence_check() if (!present) bail out early /* Start the recovery process */ eeh_reset_device() eeh_report_resume() /* Recovery done */ Thanks, -Mahesh.
On Wed, Nov 24, 2021 at 7:45 PM Mahesh J Salgaonkar <mahesh@linux.ibm.com> wrote: > > No it doesn't. We will still do a presence check before the recovery > process starts. This patch moves the check after notifying the driver to > stop active I/O operations. If a presence check finds the device isn't > present, we will skip the EEH recovery. However, on a surprise hotplug, > the user will see the EEH messages on the console before it finds there > is nothing to recover. Suppressing the spurious EEH messages was part of why I added that check in the first place. If you want to defer the presence check until later you should move the stack trace printing, etc to after we've confirmed there are still devices present. Considering the motivation for this patch is to avoid spurious warnings from the driver I don't think printing spurious EEH messages is much of an improvement. The other option would be returning an error from the pseries hotplug driver. IIRC that's what pnv_php / OPAL does if the PHB is fenced and we can't check the slot presence state.
On Wed, Nov 24, 2021 at 12:05 AM Mahesh Salgaonkar <mahesh@linux.ibm.com> wrote: > > *snip* > > This causes the EEH handler to get stuck for ~6 > seconds before it could notify that the pci error has been detected and > stop any active operations. Hence with running I/O traffic, during this 6 > seconds, the network driver continues its operation and hits a timeout > (netdev watchdog).On timeouts, network driver go into ffdc capture mode > and reset path assuming the PCI device is in fatal condition. This causes > EEH recovery to fail and sometimes it leads to system hang or crash. Whatever is causing that crash is the real issue IMO. PCI error reporting is fundamentally asynchronous and the driver always has to tolerate some amount of latency between the error occuring and being reported. Six seconds is admittedly an eternity, but it should not cause a system crash under any circumstances. Printing a warning due to a timeout is annoying, but it's not the end of the world.
On 2021-11-24 22:57:13 Wed, Oliver O'Halloran wrote: > On Wed, Nov 24, 2021 at 7:45 PM Mahesh J Salgaonkar > <mahesh@linux.ibm.com> wrote: > > > > No it doesn't. We will still do a presence check before the recovery > > process starts. This patch moves the check after notifying the driver to > > stop active I/O operations. If a presence check finds the device isn't > > present, we will skip the EEH recovery. However, on a surprise hotplug, > > the user will see the EEH messages on the console before it finds there > > is nothing to recover. > > Suppressing the spurious EEH messages was part of why I added that > check in the first place. If you want to defer the presence check > until later you should move the stack trace printing, etc to after > we've confirmed there are still devices present. Considering the That will help suppressing the spurious EEH messages. > motivation for this patch is to avoid spurious warnings from the > driver I don't think printing spurious EEH messages is much of an > improvement. Agree. > > The other option would be returning an error from the pseries hotplug > driver. IIRC that's what pnv_php / OPAL does if the PHB is fenced and > we can't check the slot presence state. Yeah. I can change rpaphp_get_sensor_state() to use rtas_get_sensor_fast() variant which will return immediately with an error on extended busy error. That way we don't need to move the slot presence check at all. I did test that and it does fix the problem. But I wasn't sure if that would have any implications on hotplug driver behaviour. If pnv_php / OPAL does the same thing then this would be a cleaner approach to fix this issue. Let me send out the patch with this other option to fix the pseries hotplug driver instead. Thanks, -Mahesh.
On 2021-11-24 23:01:45 Wed, Oliver O'Halloran wrote: > On Wed, Nov 24, 2021 at 12:05 AM Mahesh Salgaonkar <mahesh@linux.ibm.com> wrote: > > > > *snip* > > > > This causes the EEH handler to get stuck for ~6 > > seconds before it could notify that the pci error has been detected and > > stop any active operations. Hence with running I/O traffic, during this 6 > > seconds, the network driver continues its operation and hits a timeout > > (netdev watchdog).On timeouts, network driver go into ffdc capture mode > > and reset path assuming the PCI device is in fatal condition. This causes > > EEH recovery to fail and sometimes it leads to system hang or crash. > > Whatever is causing that crash is the real issue IMO. PCI error I have seen crash only once but that was triggered by HTX tool and may not be related. However, the major concern here is EEH failure. I will correct the above statement in my next patch. > reporting is fundamentally asynchronous and the driver always has to > tolerate some amount of latency between the error occuring and being > reported. Six seconds is admittedly an eternity, but it should not > cause a system crash under any circumstances. Printing a warning due > to a timeout is annoying, but it's not the end of the world. Yeah, but due to timeout sometimes the driver gets into a situation where when EEH recovery kicks-in, the driver is unable to recover the device. Thus EEH recovery fails and disconnects the pci device even when it could have recovered. To recover, we need to either reboot the lpar or re-assign the I/O adapter from HMC to get it back in working condition. [16532.212197] EEH: PCI-E AER 30: 00000000 00000000 [16532.213207] EEH: Reset without hotplug activity [16534.229469] bnx2x: [bnx2x_clean_tx_queue:1203(enP22p1s0f1)]timeout waiting for queue[2]: txdata->tx_pkt_prod(37003) != txdata->tx_pkt_cons(36996) [16534.385484] EEH: Beginning: 'slot_reset' [16534.385489] PCI 0016:01:00.0#10000: EEH: Invoking bnx2x->slot_reset() [16536.229469] bnx2x: [bnx2x_clean_tx_queue:1203(enP22p1s0f1)]timeout waiting for queue[4]: txdata->tx_pkt_prod(64894) != txdata->tx_pkt_cons(64891) o[...] [16623.571502] bnx2x: [bnx2x_nic_load_request:2342(enP22p1s0f1)]MCP response failure, aborting [16623.571507] bnx2x: [bnx2x_acquire_hw_lock:2019(enP22p1s0f1)]lock_status 0xffffffff resource_bit 0x800 [16623.571881] bnx2x: [bnx2x_io_slot_reset:14359(enP22p1s0f0)]IO slot reset initializing... [16623.571976] bnx2x 0016:01:00.0: enabling device (0140 -> 0142) [16623.576169] bnx2x: [bnx2x_io_slot_reset:14375(enP22p1s0f0)]IO slot reset --> driver unload [16623.576174] PCI 0016:01:00.0#10000: EEH: bnx2x driver reports: 'disconnect' [16623.576177] PCI 0016:01:00.1#10000: EEH: Invoking bnx2x->slot_reset() [16623.576179] bnx2x: [bnx2x_io_slot_reset:14359(enP22p1s0f1)]IO slot reset initializing... [16623.576239] bnx2x 0016:01:00.1: enabling device (0140 -> 0142) [16623.580241] bnx2x: [bnx2x_io_slot_reset:14375(enP22p1s0f1)]IO slot reset --> driver unload [16623.580245] PCI 0016:01:00.1#10000: EEH: bnx2x driver reports: 'disconnect' [16623.580246] EEH: Finished:'slot_reset' with aggregate recovery state:'disconnect' [16623.580250] EEH: Unable to recover from failure from PHB#16-PE#10000. Thanks, -Mahesh.
diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c index 350dab18e1373..a4a80451d50f7 100644 --- a/arch/powerpc/kernel/eeh_driver.c +++ b/arch/powerpc/kernel/eeh_driver.c @@ -851,26 +851,6 @@ void eeh_handle_normal_event(struct eeh_pe *pe) return; } - /* - * When devices are hot-removed we might get an EEH due to - * a driver attempting to touch the MMIO space of a removed - * device. In this case we don't have a device to recover - * so suppress the event if we can't find any present devices. - * - * The hotplug driver should take care of tearing down the - * device itself. - */ - eeh_for_each_pe(pe, tmp_pe) - eeh_pe_for_each_dev(tmp_pe, edev, tmp) - if (eeh_slot_presence_check(edev->pdev)) - devices++; - - if (!devices) { - pr_debug("EEH: Frozen PHB#%x-PE#%x is empty!\n", - pe->phb->global_number, pe->addr); - goto out; /* nothing to recover */ - } - /* Log the event */ if (pe->type & EEH_PE_PHB) { pr_err("EEH: Recovering PHB#%x, location: %s\n", @@ -942,6 +922,28 @@ void eeh_handle_normal_event(struct eeh_pe *pe) result = PCI_ERS_RESULT_NEED_RESET; } + /* + * When devices are hot-removed we might get an EEH due to + * a driver attempting to touch the MMIO space of a removed + * device. In this case we don't have a device to recover + * bail out early as there is nothing to recover. + * + * The hotplug driver should take care of tearing down the + * device itself. + */ + eeh_for_each_pe(pe, tmp_pe) + eeh_pe_for_each_dev(tmp_pe, edev, tmp) + if (eeh_slot_presence_check(edev->pdev)) + devices++; + + if (!devices) { + pr_info("EEH: Frozen PHB#%x-PE#%x is empty!\n", + pe->phb->global_number, pe->addr); + pr_info("EEH: Surprise hotplug remove. nothing to recover.\n"); + eeh_set_channel_state(pe, pci_channel_io_perm_failure); + goto out; /* nothing to recover */ + } + /* Get the current PCI slot state. This can take a long time, * sometimes over 300 seconds for certain systems. */
When certain PHB HW failure causes phyp to recover PHB, it marks the PE state as temporarily unavailable until recovery is complete. This also triggers an EEH handler in Linux which needs to notify drivers, and perform recovery. But before notifying the driver about the pci error it uses get_adapter_state()->get-sesnor-state() operation of the hotplug_slot to determine if the slot contains a device or not. if the slot is empty, the recovery is skipped entirely. However on certain PHB failures, the rtas call get-sesnor-state() returns extended busy error (9902) until PHB is recovered by phyp. Once PHB is recovered, the get-sensor-state() returns success with correct presence status. The rtas call interface rtas_get_sensor() loops over the rtas call on extended delay return code (9902) until the return value is either success (0) or error (-1). This causes the EEH handler to get stuck for ~6 seconds before it could notify that the pci error has been detected and stop any active operations. Hence with running I/O traffic, during this 6 seconds, the network driver continues its operation and hits a timeout (netdev watchdog). On timeouts, network driver go into ffdc capture mode and reset path assuming the PCI device is in fatal condition. This causes EEH recovery to fail and sometimes it leads to system hang or crash. ------------ [52732.244731] DEBUG: ibm_read_slot_reset_state2() [52732.244762] DEBUG: ret = 0, rets[0]=5, rets[1]=1, rets[2]=4000, rets[3]=0x0 [52732.244798] DEBUG: in eeh_slot_presence_check [52732.244804] DEBUG: error state check [52732.244807] DEBUG: Is slot hotpluggable [52732.244810] DEBUG: hotpluggable ops ? [52732.244953] DEBUG: Calling ops->get_adapter_status [52732.244958] DEBUG: calling rpaphp_get_sensor_state [52736.564262] ------------[ cut here ]------------ [52736.564299] NETDEV WATCHDOG: enP64p1s0f3 (tg3): transmit queue 0 timed out [52736.564324] WARNING: CPU: 1442 PID: 0 at net/sched/sch_generic.c:478 dev_watchdog+0x438/0x440 [...] [52736.564505] NIP [c000000000c32368] dev_watchdog+0x438/0x440 [52736.564513] LR [c000000000c32364] dev_watchdog+0x434/0x440 ------------ To fix this issue, delay the slot presence check after notifying the driver about the pci error. Signed-off-by: Mahesh Salgaonkar <mahesh@linux.ibm.com> --- arch/powerpc/kernel/eeh_driver.c | 42 ++++++++++++++++++++------------------ 1 file changed, 22 insertions(+), 20 deletions(-)