diff mbox series

[iwl-next,v2,5/5] ice: do not re-enable miscellaneous interrupt until thread_fn completes

Message ID 20230530174605.2772054-6-jacob.e.keller@intel.com
State Changes Requested
Headers show
Series ice: Improve miscellaneous interrupt code | expand

Commit Message

Jacob Keller May 30, 2023, 5:46 p.m. UTC
At the end of ice_misc_intr() the driver calls ice_irq_dynamic_ena() to
re-enable the miscellaneous interrupt. This is done before the
ice_misc_intr_thread_fn can run and complete.

According to the kernel function comment documentation for
request_threaded_irq(), the interrupt should remain disabled until the
thread function completes its task.

By re-enabling the interrupt at the end of the hard IRQ, it is possible for
a new interrupt to trigger while the thread function is processing. This is
problematic for PTP Tx timestamps.

For E822 devices, the hardware in the PHY keeps track of how many
outstanding timestamps are generated and how many timestamps are read from
the PHY.

This counter is incremented once for each timestamp that is captured by
hardware, and decremented once each time a timestamp is read from the PHY.
The PHY will not generate a new interrupt unless this internal counter is
zero before the most recently captured timestamp.

Because of this counter behavior, a race with the hard IRQ and threaded IRQ
function can result in the potential for the counter to get stuck such that
no new interrupts will be triggered until the device is reset.

Consider the following flow:

 1 -> Tx timestamp completes in hardware
 2 -> timestamp interrupt occurs
 3 -> ice_misc_intr() re-enables timestamp interrupt, and wakes the
      thread_fn
 4 -> thread_fn is running and processing Tx timestamp
 5 -> the Tx timestamp is read from PHY, clearing the counter
 6 -> a new Tx timestamp completes in hardware, triggering interrupt
 7 -> the thread_fn hasn't exited and reported IRQ handled
 8 -> ice_misc_intr() triggers and sees PTP interrupt, so tries to wake thread
 9 -> thread_fn is already running (IRQTF_RUNTHREAD is set still!) so we
      skip running the thread...
10 -> an outstanding timestamp is remaining but we never read it
11 -> interrupt never triggers again

The fix for this complicated race condition is simple: do not re-enable the
miscellaneous interrupt until *after* the thread function completes. If a
new timestamp event triggers while the interrupt is disabled, it will be
remembered and should cause the interrupt to trigger again immediately
after we re-enable the interrupt.

Fixes: 1229b33973c7 ("ice: Add low latency Tx timestamp read")
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
---
 drivers/net/ethernet/intel/ice/ice_main.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

Comments

Michal Schmidt May 31, 2023, 2:19 p.m. UTC | #1
On Tue, May 30, 2023 at 7:46 PM Jacob Keller <jacob.e.keller@intel.com> wrote:
>
> At the end of ice_misc_intr() the driver calls ice_irq_dynamic_ena() to
> re-enable the miscellaneous interrupt. This is done before the
> ice_misc_intr_thread_fn can run and complete.
>
> According to the kernel function comment documentation for
> request_threaded_irq(), the interrupt should remain disabled until the
> thread function completes its task.
>
> By re-enabling the interrupt at the end of the hard IRQ, it is possible for
> a new interrupt to trigger while the thread function is processing. This is
> problematic for PTP Tx timestamps.
>
> For E822 devices, the hardware in the PHY keeps track of how many
> outstanding timestamps are generated and how many timestamps are read from
> the PHY.
>
> This counter is incremented once for each timestamp that is captured by
> hardware, and decremented once each time a timestamp is read from the PHY.
> The PHY will not generate a new interrupt unless this internal counter is
> zero before the most recently captured timestamp.
>
> Because of this counter behavior, a race with the hard IRQ and threaded IRQ
> function can result in the potential for the counter to get stuck such that
> no new interrupts will be triggered until the device is reset.
>
> Consider the following flow:
>
>  1 -> Tx timestamp completes in hardware
>  2 -> timestamp interrupt occurs
>  3 -> ice_misc_intr() re-enables timestamp interrupt, and wakes the
>       thread_fn
>  4 -> thread_fn is running and processing Tx timestamp
>  5 -> the Tx timestamp is read from PHY, clearing the counter
>  6 -> a new Tx timestamp completes in hardware, triggering interrupt
>  7 -> the thread_fn hasn't exited and reported IRQ handled
>  8 -> ice_misc_intr() triggers and sees PTP interrupt, so tries to wake thread
>  9 -> thread_fn is already running (IRQTF_RUNTHREAD is set still!) so we
>       skip running the thread...

There is a misunderstanding here. IRQTF_RUNTHREAD does not remain set
for the execution of thread_fn. The IRQ thread clears the flag
*before* it starts executing your thread_fn. See kernel/irq/manage.c.
The IRQ thread's main loop is in irq_thread(). Every iteration of the
loop starts with irq_wait_for_interrupt(). It uses
"test_and_clear_bit(IRQTF_RUNTHREAD, ...)" to check if there's work to
do.

So it's not possible for step 9 to forget the interrupt like that. If
thread_fn is still executing, the setting of IRQTF_RUNTHREAD by the
hard interrupt handler will tell the IRQ thread to go through the loop
again.

Michal

> 10 -> an outstanding timestamp is remaining but we never read it
> 11 -> interrupt never triggers again
>
> The fix for this complicated race condition is simple: do not re-enable the
> miscellaneous interrupt until *after* the thread function completes. If a
> new timestamp event triggers while the interrupt is disabled, it will be
> remembered and should cause the interrupt to trigger again immediately
> after we re-enable the interrupt.
>
> Fixes: 1229b33973c7 ("ice: Add low latency Tx timestamp read")
> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
> ---
>  drivers/net/ethernet/intel/ice/ice_main.c | 9 +++++----
>  1 file changed, 5 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c
> index 72e1b919b2d3..51fe3da0d54f 100644
> --- a/drivers/net/ethernet/intel/ice/ice_main.c
> +++ b/drivers/net/ethernet/intel/ice/ice_main.c
> @@ -3179,8 +3179,6 @@ static irqreturn_t ice_misc_intr(int __always_unused irq, void *data)
>                 }
>         }
>
> -       ice_irq_dynamic_ena(hw, NULL, NULL);
> -
>         return IRQ_WAKE_THREAD;
>  }
>
> @@ -3192,6 +3190,9 @@ static irqreturn_t ice_misc_intr(int __always_unused irq, void *data)
>  static irqreturn_t ice_misc_intr_thread_fn(int __always_unused irq, void *data)
>  {
>         struct ice_pf *pf = data;
> +       struct ice_hw *hw;
> +
> +       hw = &pf->hw;
>
>         if (ice_is_reset_in_progress(pf->state))
>                 return IRQ_HANDLED;
> @@ -3204,8 +3205,6 @@ static irqreturn_t ice_misc_intr_thread_fn(int __always_unused irq, void *data)
>         }
>
>         if (pf->oicr_misc & PFINT_OICR_TSYN_TX_M) {
> -               struct ice_hw *hw = &pf->hw;
> -
>                 /* Process outstanding Tx timestamps. If there is more work,
>                  * re-arm the interrupt to trigger again.
>                  */
> @@ -3217,6 +3216,8 @@ static irqreturn_t ice_misc_intr_thread_fn(int __always_unused irq, void *data)
>                 pf->oicr_misc &= ~PFINT_OICR_TSYN_TX_M;
>         }
>
> +       ice_irq_dynamic_ena(hw, NULL, NULL);
> +
>         return IRQ_HANDLED;
>  }
>
> --
> 2.40.0.471.gbd7f14d9353b
>
Jacob Keller June 1, 2023, 6:46 p.m. UTC | #2
On 5/31/2023 7:19 AM, Michal Schmidt wrote:
> On Tue, May 30, 2023 at 7:46 PM Jacob Keller <jacob.e.keller@intel.com> wrote:
>>
>> At the end of ice_misc_intr() the driver calls ice_irq_dynamic_ena() to
>> re-enable the miscellaneous interrupt. This is done before the
>> ice_misc_intr_thread_fn can run and complete.
>>
>> According to the kernel function comment documentation for
>> request_threaded_irq(), the interrupt should remain disabled until the
>> thread function completes its task.
>>
>> By re-enabling the interrupt at the end of the hard IRQ, it is possible for
>> a new interrupt to trigger while the thread function is processing. This is
>> problematic for PTP Tx timestamps.
>>
>> For E822 devices, the hardware in the PHY keeps track of how many
>> outstanding timestamps are generated and how many timestamps are read from
>> the PHY.
>>
>> This counter is incremented once for each timestamp that is captured by
>> hardware, and decremented once each time a timestamp is read from the PHY.
>> The PHY will not generate a new interrupt unless this internal counter is
>> zero before the most recently captured timestamp.
>>
>> Because of this counter behavior, a race with the hard IRQ and threaded IRQ
>> function can result in the potential for the counter to get stuck such that
>> no new interrupts will be triggered until the device is reset.
>>
>> Consider the following flow:
>>
>>  1 -> Tx timestamp completes in hardware
>>  2 -> timestamp interrupt occurs
>>  3 -> ice_misc_intr() re-enables timestamp interrupt, and wakes the
>>       thread_fn
>>  4 -> thread_fn is running and processing Tx timestamp
>>  5 -> the Tx timestamp is read from PHY, clearing the counter
>>  6 -> a new Tx timestamp completes in hardware, triggering interrupt
>>  7 -> the thread_fn hasn't exited and reported IRQ handled
>>  8 -> ice_misc_intr() triggers and sees PTP interrupt, so tries to wake thread
>>  9 -> thread_fn is already running (IRQTF_RUNTHREAD is set still!) so we
>>       skip running the thread...
> 
> There is a misunderstanding here. IRQTF_RUNTHREAD does not remain set
> for the execution of thread_fn. The IRQ thread clears the flag
> *before* it starts executing your thread_fn. See kernel/irq/manage.c.
> The IRQ thread's main loop is in irq_thread(). Every iteration of the
> loop starts with irq_wait_for_interrupt(). It uses
> "test_and_clear_bit(IRQTF_RUNTHREAD, ...)" to check if there's work to
> do.
> 
> So it's not possible for step 9 to forget the interrupt like that. If
> thread_fn is still executing, the setting of IRQTF_RUNTHREAD by the
> hard interrupt handler will tell the IRQ thread to go through the loop
> again.
> 
> Michal
> 

Ok, so my original understanding was flawed, but I think I see the
actual race that happened.

I'll try to explain it here, and see if you agree with the outline, then
I can update the commit message.

The PHYs keep track of how many outstanding timestamps are in memory,
and only generate an interrupt if the count of timestamps goes from
below a threshold to above a threshold value (set by the driver to be 1,
so that it will interrupt immediately on the first timestamp).

As long as there are unread timestamps in the memory bank, it will not
generate a new interrupt.

Somehow, the device gets in a state where it failed to read the
timestamps from the PHY memory *after* the interrupt occurred. When this
happens, we no longer get a new interrupt, because the PHY sees that
there are still unread timestamps. (Yes, I know, this is a poor design).

Because of this, we stop processing all future timestamps.

The actual race is possibly something like the following:

same steps up from 1-7, then:

8 -> ice_misc_intr tiriggers and sees PTP interrupt, so it sets
PFINT_OICR_TSYN_TX_M in pf->oicr_misc.
9 -> unfortunately, ice_misc_intr_thread_fn then *clears* the bit after
exiting from its processing loop.
10 -> once thread_fn exits, the threaded IRQ logic re-runs the function.
However, due to the way that we interact between the main and thread
function, the second run sees that PFINT_OICR_TSYN_TX_M is unset, so it
doesn't run the loop again.

This patch fixes things by ensuring that the hardware won't even
generate a hard IRQ interrupt until the thread_fn completes. We could
also instead fix this by improving the communication between the handler
function and the thread function by using atomic test_and_clear(), or by
using an atomic counter.

I can send a v3 with appropriate fixes and an updated commit message,
once this description makes sense to you.
Michal Schmidt June 1, 2023, 7:18 p.m. UTC | #3
On Thu, Jun 1, 2023 at 8:46 PM Jacob Keller <jacob.e.keller@intel.com> wrote:
> On 5/31/2023 7:19 AM, Michal Schmidt wrote:
> > On Tue, May 30, 2023 at 7:46 PM Jacob Keller <jacob.e.keller@intel.com> wrote:
> >>
> >> At the end of ice_misc_intr() the driver calls ice_irq_dynamic_ena() to
> >> re-enable the miscellaneous interrupt. This is done before the
> >> ice_misc_intr_thread_fn can run and complete.
> >>
> >> According to the kernel function comment documentation for
> >> request_threaded_irq(), the interrupt should remain disabled until the
> >> thread function completes its task.
> >>
> >> By re-enabling the interrupt at the end of the hard IRQ, it is possible for
> >> a new interrupt to trigger while the thread function is processing. This is
> >> problematic for PTP Tx timestamps.
> >>
> >> For E822 devices, the hardware in the PHY keeps track of how many
> >> outstanding timestamps are generated and how many timestamps are read from
> >> the PHY.
> >>
> >> This counter is incremented once for each timestamp that is captured by
> >> hardware, and decremented once each time a timestamp is read from the PHY.
> >> The PHY will not generate a new interrupt unless this internal counter is
> >> zero before the most recently captured timestamp.
> >>
> >> Because of this counter behavior, a race with the hard IRQ and threaded IRQ
> >> function can result in the potential for the counter to get stuck such that
> >> no new interrupts will be triggered until the device is reset.
> >>
> >> Consider the following flow:
> >>
> >>  1 -> Tx timestamp completes in hardware
> >>  2 -> timestamp interrupt occurs
> >>  3 -> ice_misc_intr() re-enables timestamp interrupt, and wakes the
> >>       thread_fn
> >>  4 -> thread_fn is running and processing Tx timestamp
> >>  5 -> the Tx timestamp is read from PHY, clearing the counter
> >>  6 -> a new Tx timestamp completes in hardware, triggering interrupt
> >>  7 -> the thread_fn hasn't exited and reported IRQ handled
> >>  8 -> ice_misc_intr() triggers and sees PTP interrupt, so tries to wake thread
> >>  9 -> thread_fn is already running (IRQTF_RUNTHREAD is set still!) so we
> >>       skip running the thread...
> >
> > There is a misunderstanding here. IRQTF_RUNTHREAD does not remain set
> > for the execution of thread_fn. The IRQ thread clears the flag
> > *before* it starts executing your thread_fn. See kernel/irq/manage.c.
> > The IRQ thread's main loop is in irq_thread(). Every iteration of the
> > loop starts with irq_wait_for_interrupt(). It uses
> > "test_and_clear_bit(IRQTF_RUNTHREAD, ...)" to check if there's work to
> > do.
> >
> > So it's not possible for step 9 to forget the interrupt like that. If
> > thread_fn is still executing, the setting of IRQTF_RUNTHREAD by the
> > hard interrupt handler will tell the IRQ thread to go through the loop
> > again.
> >
> > Michal
> >
>
> Ok, so my original understanding was flawed, but I think I see the
> actual race that happened.
>
> I'll try to explain it here, and see if you agree with the outline, then
> I can update the commit message.
>
> The PHYs keep track of how many outstanding timestamps are in memory,
> and only generate an interrupt if the count of timestamps goes from
> below a threshold to above a threshold value (set by the driver to be 1,
> so that it will interrupt immediately on the first timestamp).
>
> As long as there are unread timestamps in the memory bank, it will not
> generate a new interrupt.
>
> Somehow, the device gets in a state where it failed to read the
> timestamps from the PHY memory *after* the interrupt occurred. When this
> happens, we no longer get a new interrupt, because the PHY sees that
> there are still unread timestamps. (Yes, I know, this is a poor design).
>
> Because of this, we stop processing all future timestamps.
>
> The actual race is possibly something like the following:
>
> same steps up from 1-7, then:
>
> 8 -> ice_misc_intr tiriggers and sees PTP interrupt, so it sets
> PFINT_OICR_TSYN_TX_M in pf->oicr_misc.
> 9 -> unfortunately, ice_misc_intr_thread_fn then *clears* the bit after
> exiting from its processing loop.
> 10 -> once thread_fn exits, the threaded IRQ logic re-runs the function.
> However, due to the way that we interact between the main and thread
> function, the second run sees that PFINT_OICR_TSYN_TX_M is unset, so it
> doesn't run the loop again.
>
> This patch fixes things by ensuring that the hardware won't even
> generate a hard IRQ interrupt until the thread_fn completes. We could
> also instead fix this by improving the communication between the handler
> function and the thread function by using atomic test_and_clear(), or by
> using an atomic counter.
>
> I can send a v3 with appropriate fixes and an updated commit message,
> once this description makes sense to you.

OK, the new explanation seems plausible.
Michal
Jacob Keller June 1, 2023, 7:45 p.m. UTC | #4
On 6/1/2023 12:18 PM, Michal Schmidt wrote:
> On Thu, Jun 1, 2023 at 8:46 PM Jacob Keller <jacob.e.keller@intel.com> wrote:
>> On 5/31/2023 7:19 AM, Michal Schmidt wrote:
>>> On Tue, May 30, 2023 at 7:46 PM Jacob Keller <jacob.e.keller@intel.com> wrote:
>>>>
>>>> At the end of ice_misc_intr() the driver calls ice_irq_dynamic_ena() to
>>>> re-enable the miscellaneous interrupt. This is done before the
>>>> ice_misc_intr_thread_fn can run and complete.
>>>>
>>>> According to the kernel function comment documentation for
>>>> request_threaded_irq(), the interrupt should remain disabled until the
>>>> thread function completes its task.
>>>>
>>>> By re-enabling the interrupt at the end of the hard IRQ, it is possible for
>>>> a new interrupt to trigger while the thread function is processing. This is
>>>> problematic for PTP Tx timestamps.
>>>>
>>>> For E822 devices, the hardware in the PHY keeps track of how many
>>>> outstanding timestamps are generated and how many timestamps are read from
>>>> the PHY.
>>>>
>>>> This counter is incremented once for each timestamp that is captured by
>>>> hardware, and decremented once each time a timestamp is read from the PHY.
>>>> The PHY will not generate a new interrupt unless this internal counter is
>>>> zero before the most recently captured timestamp.
>>>>
>>>> Because of this counter behavior, a race with the hard IRQ and threaded IRQ
>>>> function can result in the potential for the counter to get stuck such that
>>>> no new interrupts will be triggered until the device is reset.
>>>>
>>>> Consider the following flow:
>>>>
>>>>  1 -> Tx timestamp completes in hardware
>>>>  2 -> timestamp interrupt occurs
>>>>  3 -> ice_misc_intr() re-enables timestamp interrupt, and wakes the
>>>>       thread_fn
>>>>  4 -> thread_fn is running and processing Tx timestamp
>>>>  5 -> the Tx timestamp is read from PHY, clearing the counter
>>>>  6 -> a new Tx timestamp completes in hardware, triggering interrupt
>>>>  7 -> the thread_fn hasn't exited and reported IRQ handled
>>>>  8 -> ice_misc_intr() triggers and sees PTP interrupt, so tries to wake thread
>>>>  9 -> thread_fn is already running (IRQTF_RUNTHREAD is set still!) so we
>>>>       skip running the thread...
>>>
>>> There is a misunderstanding here. IRQTF_RUNTHREAD does not remain set
>>> for the execution of thread_fn. The IRQ thread clears the flag
>>> *before* it starts executing your thread_fn. See kernel/irq/manage.c.
>>> The IRQ thread's main loop is in irq_thread(). Every iteration of the
>>> loop starts with irq_wait_for_interrupt(). It uses
>>> "test_and_clear_bit(IRQTF_RUNTHREAD, ...)" to check if there's work to
>>> do.
>>>
>>> So it's not possible for step 9 to forget the interrupt like that. If
>>> thread_fn is still executing, the setting of IRQTF_RUNTHREAD by the
>>> hard interrupt handler will tell the IRQ thread to go through the loop
>>> again.
>>>
>>> Michal
>>>
>>
>> Ok, so my original understanding was flawed, but I think I see the
>> actual race that happened.
>>
>> I'll try to explain it here, and see if you agree with the outline, then
>> I can update the commit message.
>>
>> The PHYs keep track of how many outstanding timestamps are in memory,
>> and only generate an interrupt if the count of timestamps goes from
>> below a threshold to above a threshold value (set by the driver to be 1,
>> so that it will interrupt immediately on the first timestamp).
>>
>> As long as there are unread timestamps in the memory bank, it will not
>> generate a new interrupt.
>>
>> Somehow, the device gets in a state where it failed to read the
>> timestamps from the PHY memory *after* the interrupt occurred. When this
>> happens, we no longer get a new interrupt, because the PHY sees that
>> there are still unread timestamps. (Yes, I know, this is a poor design).
>>
>> Because of this, we stop processing all future timestamps.
>>
>> The actual race is possibly something like the following:
>>
>> same steps up from 1-7, then:
>>
>> 8 -> ice_misc_intr tiriggers and sees PTP interrupt, so it sets
>> PFINT_OICR_TSYN_TX_M in pf->oicr_misc.
>> 9 -> unfortunately, ice_misc_intr_thread_fn then *clears* the bit after
>> exiting from its processing loop.
>> 10 -> once thread_fn exits, the threaded IRQ logic re-runs the function.
>> However, due to the way that we interact between the main and thread
>> function, the second run sees that PFINT_OICR_TSYN_TX_M is unset, so it
>> doesn't run the loop again.
>>
>> This patch fixes things by ensuring that the hardware won't even
>> generate a hard IRQ interrupt until the thread_fn completes. We could
>> also instead fix this by improving the communication between the handler
>> function and the thread function by using atomic test_and_clear(), or by
>> using an atomic counter.
>>
>> I can send a v3 with appropriate fixes and an updated commit message,
>> once this description makes sense to you.
> 
> OK, the new explanation seems plausible.
> Michal
> 

Really appreciate the careful review and helping me understand some of
the areas I lacked sufficient knowledge in. I'll send a v3 with improved
messages and I will also convert to using atomics instead of bare u32
values for passing data between the functions. I think thats better even
if we leave the OICR disabled until the thread function exits.
diff mbox series

Patch

diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c
index 72e1b919b2d3..51fe3da0d54f 100644
--- a/drivers/net/ethernet/intel/ice/ice_main.c
+++ b/drivers/net/ethernet/intel/ice/ice_main.c
@@ -3179,8 +3179,6 @@  static irqreturn_t ice_misc_intr(int __always_unused irq, void *data)
 		}
 	}
 
-	ice_irq_dynamic_ena(hw, NULL, NULL);
-
 	return IRQ_WAKE_THREAD;
 }
 
@@ -3192,6 +3190,9 @@  static irqreturn_t ice_misc_intr(int __always_unused irq, void *data)
 static irqreturn_t ice_misc_intr_thread_fn(int __always_unused irq, void *data)
 {
 	struct ice_pf *pf = data;
+	struct ice_hw *hw;
+
+	hw = &pf->hw;
 
 	if (ice_is_reset_in_progress(pf->state))
 		return IRQ_HANDLED;
@@ -3204,8 +3205,6 @@  static irqreturn_t ice_misc_intr_thread_fn(int __always_unused irq, void *data)
 	}
 
 	if (pf->oicr_misc & PFINT_OICR_TSYN_TX_M) {
-		struct ice_hw *hw = &pf->hw;
-
 		/* Process outstanding Tx timestamps. If there is more work,
 		 * re-arm the interrupt to trigger again.
 		 */
@@ -3217,6 +3216,8 @@  static irqreturn_t ice_misc_intr_thread_fn(int __always_unused irq, void *data)
 		pf->oicr_misc &= ~PFINT_OICR_TSYN_TX_M;
 	}
 
+	ice_irq_dynamic_ena(hw, NULL, NULL);
+
 	return IRQ_HANDLED;
 }