Message ID | 20230530174605.2772054-6-jacob.e.keller@intel.com |
---|---|
State | Changes Requested |
Headers | show |
Series | ice: Improve miscellaneous interrupt code | expand |
On Tue, May 30, 2023 at 7:46 PM Jacob Keller <jacob.e.keller@intel.com> wrote: > > At the end of ice_misc_intr() the driver calls ice_irq_dynamic_ena() to > re-enable the miscellaneous interrupt. This is done before the > ice_misc_intr_thread_fn can run and complete. > > According to the kernel function comment documentation for > request_threaded_irq(), the interrupt should remain disabled until the > thread function completes its task. > > By re-enabling the interrupt at the end of the hard IRQ, it is possible for > a new interrupt to trigger while the thread function is processing. This is > problematic for PTP Tx timestamps. > > For E822 devices, the hardware in the PHY keeps track of how many > outstanding timestamps are generated and how many timestamps are read from > the PHY. > > This counter is incremented once for each timestamp that is captured by > hardware, and decremented once each time a timestamp is read from the PHY. > The PHY will not generate a new interrupt unless this internal counter is > zero before the most recently captured timestamp. > > Because of this counter behavior, a race with the hard IRQ and threaded IRQ > function can result in the potential for the counter to get stuck such that > no new interrupts will be triggered until the device is reset. > > Consider the following flow: > > 1 -> Tx timestamp completes in hardware > 2 -> timestamp interrupt occurs > 3 -> ice_misc_intr() re-enables timestamp interrupt, and wakes the > thread_fn > 4 -> thread_fn is running and processing Tx timestamp > 5 -> the Tx timestamp is read from PHY, clearing the counter > 6 -> a new Tx timestamp completes in hardware, triggering interrupt > 7 -> the thread_fn hasn't exited and reported IRQ handled > 8 -> ice_misc_intr() triggers and sees PTP interrupt, so tries to wake thread > 9 -> thread_fn is already running (IRQTF_RUNTHREAD is set still!) so we > skip running the thread... There is a misunderstanding here. IRQTF_RUNTHREAD does not remain set for the execution of thread_fn. The IRQ thread clears the flag *before* it starts executing your thread_fn. See kernel/irq/manage.c. The IRQ thread's main loop is in irq_thread(). Every iteration of the loop starts with irq_wait_for_interrupt(). It uses "test_and_clear_bit(IRQTF_RUNTHREAD, ...)" to check if there's work to do. So it's not possible for step 9 to forget the interrupt like that. If thread_fn is still executing, the setting of IRQTF_RUNTHREAD by the hard interrupt handler will tell the IRQ thread to go through the loop again. Michal > 10 -> an outstanding timestamp is remaining but we never read it > 11 -> interrupt never triggers again > > The fix for this complicated race condition is simple: do not re-enable the > miscellaneous interrupt until *after* the thread function completes. If a > new timestamp event triggers while the interrupt is disabled, it will be > remembered and should cause the interrupt to trigger again immediately > after we re-enable the interrupt. > > Fixes: 1229b33973c7 ("ice: Add low latency Tx timestamp read") > Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> > --- > drivers/net/ethernet/intel/ice/ice_main.c | 9 +++++---- > 1 file changed, 5 insertions(+), 4 deletions(-) > > diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c > index 72e1b919b2d3..51fe3da0d54f 100644 > --- a/drivers/net/ethernet/intel/ice/ice_main.c > +++ b/drivers/net/ethernet/intel/ice/ice_main.c > @@ -3179,8 +3179,6 @@ static irqreturn_t ice_misc_intr(int __always_unused irq, void *data) > } > } > > - ice_irq_dynamic_ena(hw, NULL, NULL); > - > return IRQ_WAKE_THREAD; > } > > @@ -3192,6 +3190,9 @@ static irqreturn_t ice_misc_intr(int __always_unused irq, void *data) > static irqreturn_t ice_misc_intr_thread_fn(int __always_unused irq, void *data) > { > struct ice_pf *pf = data; > + struct ice_hw *hw; > + > + hw = &pf->hw; > > if (ice_is_reset_in_progress(pf->state)) > return IRQ_HANDLED; > @@ -3204,8 +3205,6 @@ static irqreturn_t ice_misc_intr_thread_fn(int __always_unused irq, void *data) > } > > if (pf->oicr_misc & PFINT_OICR_TSYN_TX_M) { > - struct ice_hw *hw = &pf->hw; > - > /* Process outstanding Tx timestamps. If there is more work, > * re-arm the interrupt to trigger again. > */ > @@ -3217,6 +3216,8 @@ static irqreturn_t ice_misc_intr_thread_fn(int __always_unused irq, void *data) > pf->oicr_misc &= ~PFINT_OICR_TSYN_TX_M; > } > > + ice_irq_dynamic_ena(hw, NULL, NULL); > + > return IRQ_HANDLED; > } > > -- > 2.40.0.471.gbd7f14d9353b >
On 5/31/2023 7:19 AM, Michal Schmidt wrote: > On Tue, May 30, 2023 at 7:46 PM Jacob Keller <jacob.e.keller@intel.com> wrote: >> >> At the end of ice_misc_intr() the driver calls ice_irq_dynamic_ena() to >> re-enable the miscellaneous interrupt. This is done before the >> ice_misc_intr_thread_fn can run and complete. >> >> According to the kernel function comment documentation for >> request_threaded_irq(), the interrupt should remain disabled until the >> thread function completes its task. >> >> By re-enabling the interrupt at the end of the hard IRQ, it is possible for >> a new interrupt to trigger while the thread function is processing. This is >> problematic for PTP Tx timestamps. >> >> For E822 devices, the hardware in the PHY keeps track of how many >> outstanding timestamps are generated and how many timestamps are read from >> the PHY. >> >> This counter is incremented once for each timestamp that is captured by >> hardware, and decremented once each time a timestamp is read from the PHY. >> The PHY will not generate a new interrupt unless this internal counter is >> zero before the most recently captured timestamp. >> >> Because of this counter behavior, a race with the hard IRQ and threaded IRQ >> function can result in the potential for the counter to get stuck such that >> no new interrupts will be triggered until the device is reset. >> >> Consider the following flow: >> >> 1 -> Tx timestamp completes in hardware >> 2 -> timestamp interrupt occurs >> 3 -> ice_misc_intr() re-enables timestamp interrupt, and wakes the >> thread_fn >> 4 -> thread_fn is running and processing Tx timestamp >> 5 -> the Tx timestamp is read from PHY, clearing the counter >> 6 -> a new Tx timestamp completes in hardware, triggering interrupt >> 7 -> the thread_fn hasn't exited and reported IRQ handled >> 8 -> ice_misc_intr() triggers and sees PTP interrupt, so tries to wake thread >> 9 -> thread_fn is already running (IRQTF_RUNTHREAD is set still!) so we >> skip running the thread... > > There is a misunderstanding here. IRQTF_RUNTHREAD does not remain set > for the execution of thread_fn. The IRQ thread clears the flag > *before* it starts executing your thread_fn. See kernel/irq/manage.c. > The IRQ thread's main loop is in irq_thread(). Every iteration of the > loop starts with irq_wait_for_interrupt(). It uses > "test_and_clear_bit(IRQTF_RUNTHREAD, ...)" to check if there's work to > do. > > So it's not possible for step 9 to forget the interrupt like that. If > thread_fn is still executing, the setting of IRQTF_RUNTHREAD by the > hard interrupt handler will tell the IRQ thread to go through the loop > again. > > Michal > Ok, so my original understanding was flawed, but I think I see the actual race that happened. I'll try to explain it here, and see if you agree with the outline, then I can update the commit message. The PHYs keep track of how many outstanding timestamps are in memory, and only generate an interrupt if the count of timestamps goes from below a threshold to above a threshold value (set by the driver to be 1, so that it will interrupt immediately on the first timestamp). As long as there are unread timestamps in the memory bank, it will not generate a new interrupt. Somehow, the device gets in a state where it failed to read the timestamps from the PHY memory *after* the interrupt occurred. When this happens, we no longer get a new interrupt, because the PHY sees that there are still unread timestamps. (Yes, I know, this is a poor design). Because of this, we stop processing all future timestamps. The actual race is possibly something like the following: same steps up from 1-7, then: 8 -> ice_misc_intr tiriggers and sees PTP interrupt, so it sets PFINT_OICR_TSYN_TX_M in pf->oicr_misc. 9 -> unfortunately, ice_misc_intr_thread_fn then *clears* the bit after exiting from its processing loop. 10 -> once thread_fn exits, the threaded IRQ logic re-runs the function. However, due to the way that we interact between the main and thread function, the second run sees that PFINT_OICR_TSYN_TX_M is unset, so it doesn't run the loop again. This patch fixes things by ensuring that the hardware won't even generate a hard IRQ interrupt until the thread_fn completes. We could also instead fix this by improving the communication between the handler function and the thread function by using atomic test_and_clear(), or by using an atomic counter. I can send a v3 with appropriate fixes and an updated commit message, once this description makes sense to you.
On Thu, Jun 1, 2023 at 8:46 PM Jacob Keller <jacob.e.keller@intel.com> wrote: > On 5/31/2023 7:19 AM, Michal Schmidt wrote: > > On Tue, May 30, 2023 at 7:46 PM Jacob Keller <jacob.e.keller@intel.com> wrote: > >> > >> At the end of ice_misc_intr() the driver calls ice_irq_dynamic_ena() to > >> re-enable the miscellaneous interrupt. This is done before the > >> ice_misc_intr_thread_fn can run and complete. > >> > >> According to the kernel function comment documentation for > >> request_threaded_irq(), the interrupt should remain disabled until the > >> thread function completes its task. > >> > >> By re-enabling the interrupt at the end of the hard IRQ, it is possible for > >> a new interrupt to trigger while the thread function is processing. This is > >> problematic for PTP Tx timestamps. > >> > >> For E822 devices, the hardware in the PHY keeps track of how many > >> outstanding timestamps are generated and how many timestamps are read from > >> the PHY. > >> > >> This counter is incremented once for each timestamp that is captured by > >> hardware, and decremented once each time a timestamp is read from the PHY. > >> The PHY will not generate a new interrupt unless this internal counter is > >> zero before the most recently captured timestamp. > >> > >> Because of this counter behavior, a race with the hard IRQ and threaded IRQ > >> function can result in the potential for the counter to get stuck such that > >> no new interrupts will be triggered until the device is reset. > >> > >> Consider the following flow: > >> > >> 1 -> Tx timestamp completes in hardware > >> 2 -> timestamp interrupt occurs > >> 3 -> ice_misc_intr() re-enables timestamp interrupt, and wakes the > >> thread_fn > >> 4 -> thread_fn is running and processing Tx timestamp > >> 5 -> the Tx timestamp is read from PHY, clearing the counter > >> 6 -> a new Tx timestamp completes in hardware, triggering interrupt > >> 7 -> the thread_fn hasn't exited and reported IRQ handled > >> 8 -> ice_misc_intr() triggers and sees PTP interrupt, so tries to wake thread > >> 9 -> thread_fn is already running (IRQTF_RUNTHREAD is set still!) so we > >> skip running the thread... > > > > There is a misunderstanding here. IRQTF_RUNTHREAD does not remain set > > for the execution of thread_fn. The IRQ thread clears the flag > > *before* it starts executing your thread_fn. See kernel/irq/manage.c. > > The IRQ thread's main loop is in irq_thread(). Every iteration of the > > loop starts with irq_wait_for_interrupt(). It uses > > "test_and_clear_bit(IRQTF_RUNTHREAD, ...)" to check if there's work to > > do. > > > > So it's not possible for step 9 to forget the interrupt like that. If > > thread_fn is still executing, the setting of IRQTF_RUNTHREAD by the > > hard interrupt handler will tell the IRQ thread to go through the loop > > again. > > > > Michal > > > > Ok, so my original understanding was flawed, but I think I see the > actual race that happened. > > I'll try to explain it here, and see if you agree with the outline, then > I can update the commit message. > > The PHYs keep track of how many outstanding timestamps are in memory, > and only generate an interrupt if the count of timestamps goes from > below a threshold to above a threshold value (set by the driver to be 1, > so that it will interrupt immediately on the first timestamp). > > As long as there are unread timestamps in the memory bank, it will not > generate a new interrupt. > > Somehow, the device gets in a state where it failed to read the > timestamps from the PHY memory *after* the interrupt occurred. When this > happens, we no longer get a new interrupt, because the PHY sees that > there are still unread timestamps. (Yes, I know, this is a poor design). > > Because of this, we stop processing all future timestamps. > > The actual race is possibly something like the following: > > same steps up from 1-7, then: > > 8 -> ice_misc_intr tiriggers and sees PTP interrupt, so it sets > PFINT_OICR_TSYN_TX_M in pf->oicr_misc. > 9 -> unfortunately, ice_misc_intr_thread_fn then *clears* the bit after > exiting from its processing loop. > 10 -> once thread_fn exits, the threaded IRQ logic re-runs the function. > However, due to the way that we interact between the main and thread > function, the second run sees that PFINT_OICR_TSYN_TX_M is unset, so it > doesn't run the loop again. > > This patch fixes things by ensuring that the hardware won't even > generate a hard IRQ interrupt until the thread_fn completes. We could > also instead fix this by improving the communication between the handler > function and the thread function by using atomic test_and_clear(), or by > using an atomic counter. > > I can send a v3 with appropriate fixes and an updated commit message, > once this description makes sense to you. OK, the new explanation seems plausible. Michal
On 6/1/2023 12:18 PM, Michal Schmidt wrote: > On Thu, Jun 1, 2023 at 8:46 PM Jacob Keller <jacob.e.keller@intel.com> wrote: >> On 5/31/2023 7:19 AM, Michal Schmidt wrote: >>> On Tue, May 30, 2023 at 7:46 PM Jacob Keller <jacob.e.keller@intel.com> wrote: >>>> >>>> At the end of ice_misc_intr() the driver calls ice_irq_dynamic_ena() to >>>> re-enable the miscellaneous interrupt. This is done before the >>>> ice_misc_intr_thread_fn can run and complete. >>>> >>>> According to the kernel function comment documentation for >>>> request_threaded_irq(), the interrupt should remain disabled until the >>>> thread function completes its task. >>>> >>>> By re-enabling the interrupt at the end of the hard IRQ, it is possible for >>>> a new interrupt to trigger while the thread function is processing. This is >>>> problematic for PTP Tx timestamps. >>>> >>>> For E822 devices, the hardware in the PHY keeps track of how many >>>> outstanding timestamps are generated and how many timestamps are read from >>>> the PHY. >>>> >>>> This counter is incremented once for each timestamp that is captured by >>>> hardware, and decremented once each time a timestamp is read from the PHY. >>>> The PHY will not generate a new interrupt unless this internal counter is >>>> zero before the most recently captured timestamp. >>>> >>>> Because of this counter behavior, a race with the hard IRQ and threaded IRQ >>>> function can result in the potential for the counter to get stuck such that >>>> no new interrupts will be triggered until the device is reset. >>>> >>>> Consider the following flow: >>>> >>>> 1 -> Tx timestamp completes in hardware >>>> 2 -> timestamp interrupt occurs >>>> 3 -> ice_misc_intr() re-enables timestamp interrupt, and wakes the >>>> thread_fn >>>> 4 -> thread_fn is running and processing Tx timestamp >>>> 5 -> the Tx timestamp is read from PHY, clearing the counter >>>> 6 -> a new Tx timestamp completes in hardware, triggering interrupt >>>> 7 -> the thread_fn hasn't exited and reported IRQ handled >>>> 8 -> ice_misc_intr() triggers and sees PTP interrupt, so tries to wake thread >>>> 9 -> thread_fn is already running (IRQTF_RUNTHREAD is set still!) so we >>>> skip running the thread... >>> >>> There is a misunderstanding here. IRQTF_RUNTHREAD does not remain set >>> for the execution of thread_fn. The IRQ thread clears the flag >>> *before* it starts executing your thread_fn. See kernel/irq/manage.c. >>> The IRQ thread's main loop is in irq_thread(). Every iteration of the >>> loop starts with irq_wait_for_interrupt(). It uses >>> "test_and_clear_bit(IRQTF_RUNTHREAD, ...)" to check if there's work to >>> do. >>> >>> So it's not possible for step 9 to forget the interrupt like that. If >>> thread_fn is still executing, the setting of IRQTF_RUNTHREAD by the >>> hard interrupt handler will tell the IRQ thread to go through the loop >>> again. >>> >>> Michal >>> >> >> Ok, so my original understanding was flawed, but I think I see the >> actual race that happened. >> >> I'll try to explain it here, and see if you agree with the outline, then >> I can update the commit message. >> >> The PHYs keep track of how many outstanding timestamps are in memory, >> and only generate an interrupt if the count of timestamps goes from >> below a threshold to above a threshold value (set by the driver to be 1, >> so that it will interrupt immediately on the first timestamp). >> >> As long as there are unread timestamps in the memory bank, it will not >> generate a new interrupt. >> >> Somehow, the device gets in a state where it failed to read the >> timestamps from the PHY memory *after* the interrupt occurred. When this >> happens, we no longer get a new interrupt, because the PHY sees that >> there are still unread timestamps. (Yes, I know, this is a poor design). >> >> Because of this, we stop processing all future timestamps. >> >> The actual race is possibly something like the following: >> >> same steps up from 1-7, then: >> >> 8 -> ice_misc_intr tiriggers and sees PTP interrupt, so it sets >> PFINT_OICR_TSYN_TX_M in pf->oicr_misc. >> 9 -> unfortunately, ice_misc_intr_thread_fn then *clears* the bit after >> exiting from its processing loop. >> 10 -> once thread_fn exits, the threaded IRQ logic re-runs the function. >> However, due to the way that we interact between the main and thread >> function, the second run sees that PFINT_OICR_TSYN_TX_M is unset, so it >> doesn't run the loop again. >> >> This patch fixes things by ensuring that the hardware won't even >> generate a hard IRQ interrupt until the thread_fn completes. We could >> also instead fix this by improving the communication between the handler >> function and the thread function by using atomic test_and_clear(), or by >> using an atomic counter. >> >> I can send a v3 with appropriate fixes and an updated commit message, >> once this description makes sense to you. > > OK, the new explanation seems plausible. > Michal > Really appreciate the careful review and helping me understand some of the areas I lacked sufficient knowledge in. I'll send a v3 with improved messages and I will also convert to using atomics instead of bare u32 values for passing data between the functions. I think thats better even if we leave the OICR disabled until the thread function exits.
diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c index 72e1b919b2d3..51fe3da0d54f 100644 --- a/drivers/net/ethernet/intel/ice/ice_main.c +++ b/drivers/net/ethernet/intel/ice/ice_main.c @@ -3179,8 +3179,6 @@ static irqreturn_t ice_misc_intr(int __always_unused irq, void *data) } } - ice_irq_dynamic_ena(hw, NULL, NULL); - return IRQ_WAKE_THREAD; } @@ -3192,6 +3190,9 @@ static irqreturn_t ice_misc_intr(int __always_unused irq, void *data) static irqreturn_t ice_misc_intr_thread_fn(int __always_unused irq, void *data) { struct ice_pf *pf = data; + struct ice_hw *hw; + + hw = &pf->hw; if (ice_is_reset_in_progress(pf->state)) return IRQ_HANDLED; @@ -3204,8 +3205,6 @@ static irqreturn_t ice_misc_intr_thread_fn(int __always_unused irq, void *data) } if (pf->oicr_misc & PFINT_OICR_TSYN_TX_M) { - struct ice_hw *hw = &pf->hw; - /* Process outstanding Tx timestamps. If there is more work, * re-arm the interrupt to trigger again. */ @@ -3217,6 +3216,8 @@ static irqreturn_t ice_misc_intr_thread_fn(int __always_unused irq, void *data) pf->oicr_misc &= ~PFINT_OICR_TSYN_TX_M; } + ice_irq_dynamic_ena(hw, NULL, NULL); + return IRQ_HANDLED; }
At the end of ice_misc_intr() the driver calls ice_irq_dynamic_ena() to re-enable the miscellaneous interrupt. This is done before the ice_misc_intr_thread_fn can run and complete. According to the kernel function comment documentation for request_threaded_irq(), the interrupt should remain disabled until the thread function completes its task. By re-enabling the interrupt at the end of the hard IRQ, it is possible for a new interrupt to trigger while the thread function is processing. This is problematic for PTP Tx timestamps. For E822 devices, the hardware in the PHY keeps track of how many outstanding timestamps are generated and how many timestamps are read from the PHY. This counter is incremented once for each timestamp that is captured by hardware, and decremented once each time a timestamp is read from the PHY. The PHY will not generate a new interrupt unless this internal counter is zero before the most recently captured timestamp. Because of this counter behavior, a race with the hard IRQ and threaded IRQ function can result in the potential for the counter to get stuck such that no new interrupts will be triggered until the device is reset. Consider the following flow: 1 -> Tx timestamp completes in hardware 2 -> timestamp interrupt occurs 3 -> ice_misc_intr() re-enables timestamp interrupt, and wakes the thread_fn 4 -> thread_fn is running and processing Tx timestamp 5 -> the Tx timestamp is read from PHY, clearing the counter 6 -> a new Tx timestamp completes in hardware, triggering interrupt 7 -> the thread_fn hasn't exited and reported IRQ handled 8 -> ice_misc_intr() triggers and sees PTP interrupt, so tries to wake thread 9 -> thread_fn is already running (IRQTF_RUNTHREAD is set still!) so we skip running the thread... 10 -> an outstanding timestamp is remaining but we never read it 11 -> interrupt never triggers again The fix for this complicated race condition is simple: do not re-enable the miscellaneous interrupt until *after* the thread function completes. If a new timestamp event triggers while the interrupt is disabled, it will be remembered and should cause the interrupt to trigger again immediately after we re-enable the interrupt. Fixes: 1229b33973c7 ("ice: Add low latency Tx timestamp read") Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> --- drivers/net/ethernet/intel/ice/ice_main.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-)