diff mbox

powerpc: irq work racing with timer interrupt can result in timer interrupt hang

Message ID 1399695993.4481.47.camel@pasglop (mailing list archive)
State Not Applicable
Headers show

Commit Message

Benjamin Herrenschmidt May 10, 2014, 4:26 a.m. UTC
On Fri, 2014-05-09 at 15:22 +0530, Preeti U Murthy wrote:
> in __timer_interrupt() outside the _else_ loop? This will ensure that no
> matter what, before exiting timer interrupt handler we check for pending
> irq work.

We still need to make sure that set_next_event() doesn't move the
dec beyond the next tick if there is a pending timer... maybe we
can fix it like this:

static int decrementer_set_next_event(unsigned long evt,
				      struct clock_event_device *dev)
{
	__get_cpu_var(decrementers_next_tb) = get_tb_or_rtc() + evt;

	/* Don't adjust the decrementer if some irq work is pending */
	if (!test_irq_work_pending())
		set_dec(evt);

	return 0;
}

Along with a single occurrence of:

	if (test_irq_work_pending())
		set_dec(1);

At the end of __timer_interrupt(), outside if the current else {}
case, this should work, don't you think ?

What about this completely untested patch ?

Comments

Preeti U Murthy May 10, 2014, 3:36 p.m. UTC | #1
On 05/10/2014 09:56 AM, Benjamin Herrenschmidt wrote:
> On Fri, 2014-05-09 at 15:22 +0530, Preeti U Murthy wrote:
>> in __timer_interrupt() outside the _else_ loop? This will ensure that no
>> matter what, before exiting timer interrupt handler we check for pending
>> irq work.
> 
> We still need to make sure that set_next_event() doesn't move the
> dec beyond the next tick if there is a pending timer... maybe we

Sorry, but didn't get this. s/if there is pending timer/if there is
pending irq work ?

> can fix it like this:

We can call set_next_event() from events like hrtimer_cancel() or
hrtimer_forward() as well. In that case we don't come to
decrementer_set_next_event() from __timer_interrupt(). Then, if we race
with irq work, we *do not do* a set_dec(1) ( I am referring to the patch
below ), we might never set the decrementer to fire immediately right?

Or does this scenario never arise?

Regards
Preeti U Murthy
> 
> static int decrementer_set_next_event(unsigned long evt,
> 				      struct clock_event_device *dev)
> {
> 	__get_cpu_var(decrementers_next_tb) = get_tb_or_rtc() + evt;
> 
> 	/* Don't adjust the decrementer if some irq work is pending */
> 	if (!test_irq_work_pending())
> 		set_dec(evt);
> 
> 	return 0;
> }
> 
> Along with a single occurrence of:
> 
> 	if (test_irq_work_pending())
> 		set_dec(1);
> 
> At the end of __timer_interrupt(), outside if the current else {}
> case, this should work, don't you think ?
> 
> What about this completely untested patch ?
> 
> diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
> index 122a580..ba7e83b 100644
> --- a/arch/powerpc/kernel/time.c
> +++ b/arch/powerpc/kernel/time.c
> @@ -503,12 +503,13 @@ void __timer_interrupt(void)
>                 now = *next_tb - now;
>                 if (now <= DECREMENTER_MAX)
>                         set_dec((int)now);
> -               /* We may have raced with new irq work */
> -               if (test_irq_work_pending())
> -                       set_dec(1);
>                 __get_cpu_var(irq_stat).timer_irqs_others++;
>         }
> 
> +       /* We may have raced with new irq work */
> +       if (test_irq_work_pending())
> +               set_dec(1);
> +
>  #ifdef CONFIG_PPC64
>         /* collect purr register values often, for accurate calculations */
>         if (firmware_has_feature(FW_FEATURE_SPLPAR)) {
> @@ -813,15 +814,11 @@ static void __init clocksource_init(void)
>  static int decrementer_set_next_event(unsigned long evt,
>                                       struct clock_event_device *dev)
>  {
> -       /* Don't adjust the decrementer if some irq work is pending */
> -       if (test_irq_work_pending())
> -               return 0;
>         __get_cpu_var(decrementers_next_tb) = get_tb_or_rtc() + evt;
> -       set_dec(evt);
> 
> -       /* We may have raced with new irq work */
> -       if (test_irq_work_pending())
> -               set_dec(1);
> +       /* Don't adjust the decrementer if some irq work is pending */
> +       if (!test_irq_work_pending())
> +               set_dec(evt);
> 
>         return 0;
>  }
> 
> 
> 
>
Benjamin Herrenschmidt May 10, 2014, 10:25 p.m. UTC | #2
On Sat, 2014-05-10 at 21:06 +0530, Preeti U Murthy wrote:
> On 05/10/2014 09:56 AM, Benjamin Herrenschmidt wrote:
> > On Fri, 2014-05-09 at 15:22 +0530, Preeti U Murthy wrote:
> >> in __timer_interrupt() outside the _else_ loop? This will ensure that no
> >> matter what, before exiting timer interrupt handler we check for pending
> >> irq work.
> > 
> > We still need to make sure that set_next_event() doesn't move the
> > dec beyond the next tick if there is a pending timer... maybe we
> 
> Sorry, but didn't get this. s/if there is pending timer/if there is
> pending irq work ?

Yes, sorry :-) That's what I meant.

> > can fix it like this:
> 
> We can call set_next_event() from events like hrtimer_cancel() or
> hrtimer_forward() as well. In that case we don't come to
> decrementer_set_next_event() from __timer_interrupt(). Then, if we race
> with irq work, we *do not do* a set_dec(1) ( I am referring to the patch
> below ), we might never set the decrementer to fire immediately right?
> 
> Or does this scenario never arise?

So my proposed patch handles that no ?

With that patch, we do the set_dec(1) in two cases:

 - The existing arch_irq_work_raise() which is unchanged

 - At the end of __timer_interrupt() if an irq work is still pending

And the patch also makes decrementer_set_next_event() not modify the
decrementer if an irq work is pending, but *still* adjust next_tb unlike
what the code does now.

Thus the timer interrupt, when it happens, will re-adjust the dec
properly using next_tb.

Do we still miss a case ?

Cheers,
Ben.

> Regards
> Preeti U Murthy
> > 
> > static int decrementer_set_next_event(unsigned long evt,
> > 				      struct clock_event_device *dev)
> > {
> > 	__get_cpu_var(decrementers_next_tb) = get_tb_or_rtc() + evt;
> > 
> > 	/* Don't adjust the decrementer if some irq work is pending */
> > 	if (!test_irq_work_pending())
> > 		set_dec(evt);
> > 
> > 	return 0;
> > }
> > 
> > Along with a single occurrence of:
> > 
> > 	if (test_irq_work_pending())
> > 		set_dec(1);
> > 
> > At the end of __timer_interrupt(), outside if the current else {}
> > case, this should work, don't you think ?
> > 
> > What about this completely untested patch ?
> > 
> > diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
> > index 122a580..ba7e83b 100644
> > --- a/arch/powerpc/kernel/time.c
> > +++ b/arch/powerpc/kernel/time.c
> > @@ -503,12 +503,13 @@ void __timer_interrupt(void)
> >                 now = *next_tb - now;
> >                 if (now <= DECREMENTER_MAX)
> >                         set_dec((int)now);
> > -               /* We may have raced with new irq work */
> > -               if (test_irq_work_pending())
> > -                       set_dec(1);
> >                 __get_cpu_var(irq_stat).timer_irqs_others++;
> >         }
> > 
> > +       /* We may have raced with new irq work */
> > +       if (test_irq_work_pending())
> > +               set_dec(1);
> > +
> >  #ifdef CONFIG_PPC64
> >         /* collect purr register values often, for accurate calculations */
> >         if (firmware_has_feature(FW_FEATURE_SPLPAR)) {
> > @@ -813,15 +814,11 @@ static void __init clocksource_init(void)
> >  static int decrementer_set_next_event(unsigned long evt,
> >                                       struct clock_event_device *dev)
> >  {
> > -       /* Don't adjust the decrementer if some irq work is pending */
> > -       if (test_irq_work_pending())
> > -               return 0;
> >         __get_cpu_var(decrementers_next_tb) = get_tb_or_rtc() + evt;
> > -       set_dec(evt);
> > 
> > -       /* We may have raced with new irq work */
> > -       if (test_irq_work_pending())
> > -               set_dec(1);
> > +       /* Don't adjust the decrementer if some irq work is pending */
> > +       if (!test_irq_work_pending())
> > +               set_dec(evt);
> > 
> >         return 0;
> >  }
> > 
> > 
> > 
> >
Preeti U Murthy May 11, 2014, 8:15 a.m. UTC | #3
On 05/11/2014 03:55 AM, Benjamin Herrenschmidt wrote:
> On Sat, 2014-05-10 at 21:06 +0530, Preeti U Murthy wrote:
>> On 05/10/2014 09:56 AM, Benjamin Herrenschmidt wrote:
>>> On Fri, 2014-05-09 at 15:22 +0530, Preeti U Murthy wrote:
>>>> in __timer_interrupt() outside the _else_ loop? This will ensure that no
>>>> matter what, before exiting timer interrupt handler we check for pending
>>>> irq work.
>>>
>>> We still need to make sure that set_next_event() doesn't move the
>>> dec beyond the next tick if there is a pending timer... maybe we
>>
>> Sorry, but didn't get this. s/if there is pending timer/if there is
>> pending irq work ?
> 
> Yes, sorry :-) That's what I meant.
> 
>>> can fix it like this:
>>
>> We can call set_next_event() from events like hrtimer_cancel() or
>> hrtimer_forward() as well. In that case we don't come to
>> decrementer_set_next_event() from __timer_interrupt(). Then, if we race
>> with irq work, we *do not do* a set_dec(1) ( I am referring to the patch
>> below ), we might never set the decrementer to fire immediately right?
>>
>> Or does this scenario never arise?
> 
> So my proposed patch handles that no ?
> 
> With that patch, we do the set_dec(1) in two cases:
> 
>  - The existing arch_irq_work_raise() which is unchanged
> 
>  - At the end of __timer_interrupt() if an irq work is still pending
> 
> And the patch also makes decrementer_set_next_event() not modify the
> decrementer if an irq work is pending, but *still* adjust next_tb unlike
> what the code does now.
> 
> Thus the timer interrupt, when it happens, will re-adjust the dec
> properly using next_tb.
> 
> Do we still miss a case ?

I was thinking something like the below in decrementer_set_next_event().
See last line in particular :

 -       /* Don't adjust the decrementer if some irq work is pending */
 -       if (test_irq_work_pending())
 -               return 0;
         __get_cpu_var(decrementers_next_tb) = get_tb_or_rtc() + evt;
 -       set_dec(evt);

 -       /* We may have raced with new irq work */
 -       if (test_irq_work_pending())
 -               set_dec(1);
 +       /* Don't adjust the decrementer if some irq work is pending */
 +       if (!test_irq_work_pending())
 +               set_dec(evt);
 +       else
 +               set_dec(1);

                  ^^^^^ your patch currently does not have this explicit
set_dec(1) here. Will that create a problem? If there is any irq work
pending at this point, will someone set the decrementer to fire
immediately after this point? The current code in
decrementer_set_next_event() sets set_dec(1) explicitly in case of
pending irq work.

Regards
Preeti U Murthy
> 
> Cheers,
> Ben.
> 
>> Regards
>> Preeti U Murthy
>>>
>>> static int decrementer_set_next_event(unsigned long evt,
>>> 				      struct clock_event_device *dev)
>>> {
>>> 	__get_cpu_var(decrementers_next_tb) = get_tb_or_rtc() + evt;
>>>
>>> 	/* Don't adjust the decrementer if some irq work is pending */
>>> 	if (!test_irq_work_pending())
>>> 		set_dec(evt);
>>>
>>> 	return 0;
>>> }
>>>
>>> Along with a single occurrence of:
>>>
>>> 	if (test_irq_work_pending())
>>> 		set_dec(1);
>>>
>>> At the end of __timer_interrupt(), outside if the current else {}
>>> case, this should work, don't you think ?
>>>
>>> What about this completely untested patch ?
>>>
>>> diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
>>> index 122a580..ba7e83b 100644
>>> --- a/arch/powerpc/kernel/time.c
>>> +++ b/arch/powerpc/kernel/time.c
>>> @@ -503,12 +503,13 @@ void __timer_interrupt(void)
>>>                 now = *next_tb - now;
>>>                 if (now <= DECREMENTER_MAX)
>>>                         set_dec((int)now);
>>> -               /* We may have raced with new irq work */
>>> -               if (test_irq_work_pending())
>>> -                       set_dec(1);
>>>                 __get_cpu_var(irq_stat).timer_irqs_others++;
>>>         }
>>>
>>> +       /* We may have raced with new irq work */
>>> +       if (test_irq_work_pending())
>>> +               set_dec(1);
>>> +
>>>  #ifdef CONFIG_PPC64
>>>         /* collect purr register values often, for accurate calculations */
>>>         if (firmware_has_feature(FW_FEATURE_SPLPAR)) {
>>> @@ -813,15 +814,11 @@ static void __init clocksource_init(void)
>>>  static int decrementer_set_next_event(unsigned long evt,
>>>                                       struct clock_event_device *dev)
>>>  {
>>> -       /* Don't adjust the decrementer if some irq work is pending */
>>> -       if (test_irq_work_pending())
>>> -               return 0;
>>>         __get_cpu_var(decrementers_next_tb) = get_tb_or_rtc() + evt;
>>> -       set_dec(evt);
>>>
>>> -       /* We may have raced with new irq work */
>>> -       if (test_irq_work_pending())
>>> -               set_dec(1);
>>> +       /* Don't adjust the decrementer if some irq work is pending */
>>> +       if (!test_irq_work_pending())
>>> +               set_dec(evt);
>>>
>>>         return 0;
>>>  }
>>>
>>>
>>>
>>>
> 
>
Benjamin Herrenschmidt May 11, 2014, 8:37 a.m. UTC | #4
On Sun, 2014-05-11 at 13:45 +0530, Preeti U Murthy wrote:
>  +       /* Don't adjust the decrementer if some irq work is pending
> */
>  +       if (!test_irq_work_pending())
>  +               set_dec(evt);
>  +       else
>  +               set_dec(1);
> 
>                   ^^^^^ your patch currently does not have this
> explicit
> set_dec(1) here. Will that create a problem? 
>
> If there is any irq work pending at this point, will someone set the
> decrementer to fire immediately after this point? The current code in
> decrementer_set_next_event() sets set_dec(1) explicitly in case of
> pending irq work.

Hrm, actually this is an interesting point. The problem isn't that
*someone* will do a set_dec, nobody else should that matters.

The problem is that irq_work can be triggered typically by NMIs or
similar, which means that it might be queued between the
test_irq_work_pending() and the set_dec(), thus causing a race.

So basically Anton's original patch is fine :-) I had missed that
we did a post-set_dec() test already in decrementer_next_event()
so as far as I can tell, removing the pre-test, which is what Anton
does, is really all we need.

Cheers,
Ben.
Preeti U Murthy May 11, 2014, 8:43 a.m. UTC | #5
On 05/11/2014 02:07 PM, Benjamin Herrenschmidt wrote:
> On Sun, 2014-05-11 at 13:45 +0530, Preeti U Murthy wrote:
>>  +       /* Don't adjust the decrementer if some irq work is pending
>> */
>>  +       if (!test_irq_work_pending())
>>  +               set_dec(evt);
>>  +       else
>>  +               set_dec(1);
>>
>>                   ^^^^^ your patch currently does not have this
>> explicit
>> set_dec(1) here. Will that create a problem? 
>>
>> If there is any irq work pending at this point, will someone set the
>> decrementer to fire immediately after this point? The current code in
>> decrementer_set_next_event() sets set_dec(1) explicitly in case of
>> pending irq work.
> 
> Hrm, actually this is an interesting point. The problem isn't that
> *someone* will do a set_dec, nobody else should that matters.
> 
> The problem is that irq_work can be triggered typically by NMIs or
> similar, which means that it might be queued between the
> test_irq_work_pending() and the set_dec(), thus causing a race.
> 
> So basically Anton's original patch is fine :-) I had missed that
> we did a post-set_dec() test already in decrementer_next_event()
> so as far as I can tell, removing the pre-test, which is what Anton
> does, is really all we need.

Isn't this patch required too?

@@ -503,12 +503,13 @@ void __timer_interrupt(void)
                now = *next_tb - now;
                if (now <= DECREMENTER_MAX)
                        set_dec((int)now);
-               /* We may have raced with new irq work */
-               if (test_irq_work_pending())
-                       set_dec(1);
                __get_cpu_var(irq_stat).timer_irqs_others++;
        }

+       /* We may have raced with new irq work */
+       if (test_irq_work_pending())
+               set_dec(1);
+

The event_handler cannot be relied upon to call
decrementer_set_next_event() all the time. This is in the case where
there are no pending timers. In that case we need to have the check on
irq work pending at the end of __timer_interrupt() no?

Regards
Preeti U Murthy
> 
> Cheers,
> Ben.
> 
>
Benjamin Herrenschmidt May 11, 2014, 9:03 a.m. UTC | #6
On Sun, 2014-05-11 at 14:13 +0530, Preeti U Murthy wrote:
> 
> Isn't this patch required too?
> 
> @@ -503,12 +503,13 @@ void __timer_interrupt(void)
>                 now = *next_tb - now;
>                 if (now <= DECREMENTER_MAX)
>                         set_dec((int)now);
> -               /* We may have raced with new irq work */
> -               if (test_irq_work_pending())
> -                       set_dec(1);
>                 __get_cpu_var(irq_stat).timer_irqs_others++;
>         }
>
> +       /* We may have raced with new irq work */
> +       if (test_irq_work_pending())
> +               set_dec(1);
> +
> 
> The event_handler cannot be relied upon to call
> decrementer_set_next_event() all the time. This is in the case where
> there are no pending timers. In that case we need to have the check on
> irq work pending at the end of __timer_interrupt() no?

I don't think we need to move the test no. If there's a pending
irq_work, at that point, it will have done set_dec when being queued up.
So we only care about cases where we might change the decrementer.

If the event handler doesn't call decrementer_set_next_event() then
nothing will modify the decrementer and it will still trigger soon.

Cheers,
Ben.
Preeti U Murthy May 11, 2014, 9:07 a.m. UTC | #7
On 05/11/2014 02:33 PM, Benjamin Herrenschmidt wrote:
> On Sun, 2014-05-11 at 14:13 +0530, Preeti U Murthy wrote:
>>
>> Isn't this patch required too?
>>
>> @@ -503,12 +503,13 @@ void __timer_interrupt(void)
>>                 now = *next_tb - now;
>>                 if (now <= DECREMENTER_MAX)
>>                         set_dec((int)now);
>> -               /* We may have raced with new irq work */
>> -               if (test_irq_work_pending())
>> -                       set_dec(1);
>>                 __get_cpu_var(irq_stat).timer_irqs_others++;
>>         }
>>
>> +       /* We may have raced with new irq work */
>> +       if (test_irq_work_pending())
>> +               set_dec(1);
>> +
>>
>> The event_handler cannot be relied upon to call
>> decrementer_set_next_event() all the time. This is in the case where
>> there are no pending timers. In that case we need to have the check on
>> irq work pending at the end of __timer_interrupt() no?
> 
> I don't think we need to move the test no. If there's a pending
> irq_work, at that point, it will have done set_dec when being queued up.
> So we only care about cases where we might change the decrementer.
> 
> If the event handler doesn't call decrementer_set_next_event() then
> nothing will modify the decrementer and it will still trigger soon.

Hmm ok. Then Anton's patch covers all cases :)

Thanks!

Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>

Regards
Preeti U Murthy
> 
> Cheers,
> Ben.
> 
>
diff mbox

Patch

diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 122a580..ba7e83b 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -503,12 +503,13 @@  void __timer_interrupt(void)
                now = *next_tb - now;
                if (now <= DECREMENTER_MAX)
                        set_dec((int)now);
-               /* We may have raced with new irq work */
-               if (test_irq_work_pending())
-                       set_dec(1);
                __get_cpu_var(irq_stat).timer_irqs_others++;
        }
 
+       /* We may have raced with new irq work */
+       if (test_irq_work_pending())
+               set_dec(1);
+
 #ifdef CONFIG_PPC64
        /* collect purr register values often, for accurate calculations */
        if (firmware_has_feature(FW_FEATURE_SPLPAR)) {
@@ -813,15 +814,11 @@  static void __init clocksource_init(void)
 static int decrementer_set_next_event(unsigned long evt,
                                      struct clock_event_device *dev)
 {
-       /* Don't adjust the decrementer if some irq work is pending */
-       if (test_irq_work_pending())
-               return 0;
        __get_cpu_var(decrementers_next_tb) = get_tb_or_rtc() + evt;
-       set_dec(evt);
 
-       /* We may have raced with new irq work */
-       if (test_irq_work_pending())
-               set_dec(1);
+       /* Don't adjust the decrementer if some irq work is pending */
+       if (!test_irq_work_pending())
+               set_dec(evt);
 
        return 0;
 }