Message ID | 1399695993.4481.47.camel@pasglop (mailing list archive) |
---|---|
State | Not Applicable |
Headers | show |
On 05/10/2014 09:56 AM, Benjamin Herrenschmidt wrote: > On Fri, 2014-05-09 at 15:22 +0530, Preeti U Murthy wrote: >> in __timer_interrupt() outside the _else_ loop? This will ensure that no >> matter what, before exiting timer interrupt handler we check for pending >> irq work. > > We still need to make sure that set_next_event() doesn't move the > dec beyond the next tick if there is a pending timer... maybe we Sorry, but didn't get this. s/if there is pending timer/if there is pending irq work ? > can fix it like this: We can call set_next_event() from events like hrtimer_cancel() or hrtimer_forward() as well. In that case we don't come to decrementer_set_next_event() from __timer_interrupt(). Then, if we race with irq work, we *do not do* a set_dec(1) ( I am referring to the patch below ), we might never set the decrementer to fire immediately right? Or does this scenario never arise? Regards Preeti U Murthy > > static int decrementer_set_next_event(unsigned long evt, > struct clock_event_device *dev) > { > __get_cpu_var(decrementers_next_tb) = get_tb_or_rtc() + evt; > > /* Don't adjust the decrementer if some irq work is pending */ > if (!test_irq_work_pending()) > set_dec(evt); > > return 0; > } > > Along with a single occurrence of: > > if (test_irq_work_pending()) > set_dec(1); > > At the end of __timer_interrupt(), outside if the current else {} > case, this should work, don't you think ? > > What about this completely untested patch ? > > diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c > index 122a580..ba7e83b 100644 > --- a/arch/powerpc/kernel/time.c > +++ b/arch/powerpc/kernel/time.c > @@ -503,12 +503,13 @@ void __timer_interrupt(void) > now = *next_tb - now; > if (now <= DECREMENTER_MAX) > set_dec((int)now); > - /* We may have raced with new irq work */ > - if (test_irq_work_pending()) > - set_dec(1); > __get_cpu_var(irq_stat).timer_irqs_others++; > } > > + /* We may have raced with new irq work */ > + if (test_irq_work_pending()) > + set_dec(1); > + > #ifdef CONFIG_PPC64 > /* collect purr register values often, for accurate calculations */ > if (firmware_has_feature(FW_FEATURE_SPLPAR)) { > @@ -813,15 +814,11 @@ static void __init clocksource_init(void) > static int decrementer_set_next_event(unsigned long evt, > struct clock_event_device *dev) > { > - /* Don't adjust the decrementer if some irq work is pending */ > - if (test_irq_work_pending()) > - return 0; > __get_cpu_var(decrementers_next_tb) = get_tb_or_rtc() + evt; > - set_dec(evt); > > - /* We may have raced with new irq work */ > - if (test_irq_work_pending()) > - set_dec(1); > + /* Don't adjust the decrementer if some irq work is pending */ > + if (!test_irq_work_pending()) > + set_dec(evt); > > return 0; > } > > > >
On Sat, 2014-05-10 at 21:06 +0530, Preeti U Murthy wrote: > On 05/10/2014 09:56 AM, Benjamin Herrenschmidt wrote: > > On Fri, 2014-05-09 at 15:22 +0530, Preeti U Murthy wrote: > >> in __timer_interrupt() outside the _else_ loop? This will ensure that no > >> matter what, before exiting timer interrupt handler we check for pending > >> irq work. > > > > We still need to make sure that set_next_event() doesn't move the > > dec beyond the next tick if there is a pending timer... maybe we > > Sorry, but didn't get this. s/if there is pending timer/if there is > pending irq work ? Yes, sorry :-) That's what I meant. > > can fix it like this: > > We can call set_next_event() from events like hrtimer_cancel() or > hrtimer_forward() as well. In that case we don't come to > decrementer_set_next_event() from __timer_interrupt(). Then, if we race > with irq work, we *do not do* a set_dec(1) ( I am referring to the patch > below ), we might never set the decrementer to fire immediately right? > > Or does this scenario never arise? So my proposed patch handles that no ? With that patch, we do the set_dec(1) in two cases: - The existing arch_irq_work_raise() which is unchanged - At the end of __timer_interrupt() if an irq work is still pending And the patch also makes decrementer_set_next_event() not modify the decrementer if an irq work is pending, but *still* adjust next_tb unlike what the code does now. Thus the timer interrupt, when it happens, will re-adjust the dec properly using next_tb. Do we still miss a case ? Cheers, Ben. > Regards > Preeti U Murthy > > > > static int decrementer_set_next_event(unsigned long evt, > > struct clock_event_device *dev) > > { > > __get_cpu_var(decrementers_next_tb) = get_tb_or_rtc() + evt; > > > > /* Don't adjust the decrementer if some irq work is pending */ > > if (!test_irq_work_pending()) > > set_dec(evt); > > > > return 0; > > } > > > > Along with a single occurrence of: > > > > if (test_irq_work_pending()) > > set_dec(1); > > > > At the end of __timer_interrupt(), outside if the current else {} > > case, this should work, don't you think ? > > > > What about this completely untested patch ? > > > > diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c > > index 122a580..ba7e83b 100644 > > --- a/arch/powerpc/kernel/time.c > > +++ b/arch/powerpc/kernel/time.c > > @@ -503,12 +503,13 @@ void __timer_interrupt(void) > > now = *next_tb - now; > > if (now <= DECREMENTER_MAX) > > set_dec((int)now); > > - /* We may have raced with new irq work */ > > - if (test_irq_work_pending()) > > - set_dec(1); > > __get_cpu_var(irq_stat).timer_irqs_others++; > > } > > > > + /* We may have raced with new irq work */ > > + if (test_irq_work_pending()) > > + set_dec(1); > > + > > #ifdef CONFIG_PPC64 > > /* collect purr register values often, for accurate calculations */ > > if (firmware_has_feature(FW_FEATURE_SPLPAR)) { > > @@ -813,15 +814,11 @@ static void __init clocksource_init(void) > > static int decrementer_set_next_event(unsigned long evt, > > struct clock_event_device *dev) > > { > > - /* Don't adjust the decrementer if some irq work is pending */ > > - if (test_irq_work_pending()) > > - return 0; > > __get_cpu_var(decrementers_next_tb) = get_tb_or_rtc() + evt; > > - set_dec(evt); > > > > - /* We may have raced with new irq work */ > > - if (test_irq_work_pending()) > > - set_dec(1); > > + /* Don't adjust the decrementer if some irq work is pending */ > > + if (!test_irq_work_pending()) > > + set_dec(evt); > > > > return 0; > > } > > > > > > > >
On 05/11/2014 03:55 AM, Benjamin Herrenschmidt wrote: > On Sat, 2014-05-10 at 21:06 +0530, Preeti U Murthy wrote: >> On 05/10/2014 09:56 AM, Benjamin Herrenschmidt wrote: >>> On Fri, 2014-05-09 at 15:22 +0530, Preeti U Murthy wrote: >>>> in __timer_interrupt() outside the _else_ loop? This will ensure that no >>>> matter what, before exiting timer interrupt handler we check for pending >>>> irq work. >>> >>> We still need to make sure that set_next_event() doesn't move the >>> dec beyond the next tick if there is a pending timer... maybe we >> >> Sorry, but didn't get this. s/if there is pending timer/if there is >> pending irq work ? > > Yes, sorry :-) That's what I meant. > >>> can fix it like this: >> >> We can call set_next_event() from events like hrtimer_cancel() or >> hrtimer_forward() as well. In that case we don't come to >> decrementer_set_next_event() from __timer_interrupt(). Then, if we race >> with irq work, we *do not do* a set_dec(1) ( I am referring to the patch >> below ), we might never set the decrementer to fire immediately right? >> >> Or does this scenario never arise? > > So my proposed patch handles that no ? > > With that patch, we do the set_dec(1) in two cases: > > - The existing arch_irq_work_raise() which is unchanged > > - At the end of __timer_interrupt() if an irq work is still pending > > And the patch also makes decrementer_set_next_event() not modify the > decrementer if an irq work is pending, but *still* adjust next_tb unlike > what the code does now. > > Thus the timer interrupt, when it happens, will re-adjust the dec > properly using next_tb. > > Do we still miss a case ? I was thinking something like the below in decrementer_set_next_event(). See last line in particular : - /* Don't adjust the decrementer if some irq work is pending */ - if (test_irq_work_pending()) - return 0; __get_cpu_var(decrementers_next_tb) = get_tb_or_rtc() + evt; - set_dec(evt); - /* We may have raced with new irq work */ - if (test_irq_work_pending()) - set_dec(1); + /* Don't adjust the decrementer if some irq work is pending */ + if (!test_irq_work_pending()) + set_dec(evt); + else + set_dec(1); ^^^^^ your patch currently does not have this explicit set_dec(1) here. Will that create a problem? If there is any irq work pending at this point, will someone set the decrementer to fire immediately after this point? The current code in decrementer_set_next_event() sets set_dec(1) explicitly in case of pending irq work. Regards Preeti U Murthy > > Cheers, > Ben. > >> Regards >> Preeti U Murthy >>> >>> static int decrementer_set_next_event(unsigned long evt, >>> struct clock_event_device *dev) >>> { >>> __get_cpu_var(decrementers_next_tb) = get_tb_or_rtc() + evt; >>> >>> /* Don't adjust the decrementer if some irq work is pending */ >>> if (!test_irq_work_pending()) >>> set_dec(evt); >>> >>> return 0; >>> } >>> >>> Along with a single occurrence of: >>> >>> if (test_irq_work_pending()) >>> set_dec(1); >>> >>> At the end of __timer_interrupt(), outside if the current else {} >>> case, this should work, don't you think ? >>> >>> What about this completely untested patch ? >>> >>> diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c >>> index 122a580..ba7e83b 100644 >>> --- a/arch/powerpc/kernel/time.c >>> +++ b/arch/powerpc/kernel/time.c >>> @@ -503,12 +503,13 @@ void __timer_interrupt(void) >>> now = *next_tb - now; >>> if (now <= DECREMENTER_MAX) >>> set_dec((int)now); >>> - /* We may have raced with new irq work */ >>> - if (test_irq_work_pending()) >>> - set_dec(1); >>> __get_cpu_var(irq_stat).timer_irqs_others++; >>> } >>> >>> + /* We may have raced with new irq work */ >>> + if (test_irq_work_pending()) >>> + set_dec(1); >>> + >>> #ifdef CONFIG_PPC64 >>> /* collect purr register values often, for accurate calculations */ >>> if (firmware_has_feature(FW_FEATURE_SPLPAR)) { >>> @@ -813,15 +814,11 @@ static void __init clocksource_init(void) >>> static int decrementer_set_next_event(unsigned long evt, >>> struct clock_event_device *dev) >>> { >>> - /* Don't adjust the decrementer if some irq work is pending */ >>> - if (test_irq_work_pending()) >>> - return 0; >>> __get_cpu_var(decrementers_next_tb) = get_tb_or_rtc() + evt; >>> - set_dec(evt); >>> >>> - /* We may have raced with new irq work */ >>> - if (test_irq_work_pending()) >>> - set_dec(1); >>> + /* Don't adjust the decrementer if some irq work is pending */ >>> + if (!test_irq_work_pending()) >>> + set_dec(evt); >>> >>> return 0; >>> } >>> >>> >>> >>> > >
On Sun, 2014-05-11 at 13:45 +0530, Preeti U Murthy wrote: > + /* Don't adjust the decrementer if some irq work is pending > */ > + if (!test_irq_work_pending()) > + set_dec(evt); > + else > + set_dec(1); > > ^^^^^ your patch currently does not have this > explicit > set_dec(1) here. Will that create a problem? > > If there is any irq work pending at this point, will someone set the > decrementer to fire immediately after this point? The current code in > decrementer_set_next_event() sets set_dec(1) explicitly in case of > pending irq work. Hrm, actually this is an interesting point. The problem isn't that *someone* will do a set_dec, nobody else should that matters. The problem is that irq_work can be triggered typically by NMIs or similar, which means that it might be queued between the test_irq_work_pending() and the set_dec(), thus causing a race. So basically Anton's original patch is fine :-) I had missed that we did a post-set_dec() test already in decrementer_next_event() so as far as I can tell, removing the pre-test, which is what Anton does, is really all we need. Cheers, Ben.
On 05/11/2014 02:07 PM, Benjamin Herrenschmidt wrote: > On Sun, 2014-05-11 at 13:45 +0530, Preeti U Murthy wrote: >> + /* Don't adjust the decrementer if some irq work is pending >> */ >> + if (!test_irq_work_pending()) >> + set_dec(evt); >> + else >> + set_dec(1); >> >> ^^^^^ your patch currently does not have this >> explicit >> set_dec(1) here. Will that create a problem? >> >> If there is any irq work pending at this point, will someone set the >> decrementer to fire immediately after this point? The current code in >> decrementer_set_next_event() sets set_dec(1) explicitly in case of >> pending irq work. > > Hrm, actually this is an interesting point. The problem isn't that > *someone* will do a set_dec, nobody else should that matters. > > The problem is that irq_work can be triggered typically by NMIs or > similar, which means that it might be queued between the > test_irq_work_pending() and the set_dec(), thus causing a race. > > So basically Anton's original patch is fine :-) I had missed that > we did a post-set_dec() test already in decrementer_next_event() > so as far as I can tell, removing the pre-test, which is what Anton > does, is really all we need. Isn't this patch required too? @@ -503,12 +503,13 @@ void __timer_interrupt(void) now = *next_tb - now; if (now <= DECREMENTER_MAX) set_dec((int)now); - /* We may have raced with new irq work */ - if (test_irq_work_pending()) - set_dec(1); __get_cpu_var(irq_stat).timer_irqs_others++; } + /* We may have raced with new irq work */ + if (test_irq_work_pending()) + set_dec(1); + The event_handler cannot be relied upon to call decrementer_set_next_event() all the time. This is in the case where there are no pending timers. In that case we need to have the check on irq work pending at the end of __timer_interrupt() no? Regards Preeti U Murthy > > Cheers, > Ben. > >
On Sun, 2014-05-11 at 14:13 +0530, Preeti U Murthy wrote: > > Isn't this patch required too? > > @@ -503,12 +503,13 @@ void __timer_interrupt(void) > now = *next_tb - now; > if (now <= DECREMENTER_MAX) > set_dec((int)now); > - /* We may have raced with new irq work */ > - if (test_irq_work_pending()) > - set_dec(1); > __get_cpu_var(irq_stat).timer_irqs_others++; > } > > + /* We may have raced with new irq work */ > + if (test_irq_work_pending()) > + set_dec(1); > + > > The event_handler cannot be relied upon to call > decrementer_set_next_event() all the time. This is in the case where > there are no pending timers. In that case we need to have the check on > irq work pending at the end of __timer_interrupt() no? I don't think we need to move the test no. If there's a pending irq_work, at that point, it will have done set_dec when being queued up. So we only care about cases where we might change the decrementer. If the event handler doesn't call decrementer_set_next_event() then nothing will modify the decrementer and it will still trigger soon. Cheers, Ben.
On 05/11/2014 02:33 PM, Benjamin Herrenschmidt wrote: > On Sun, 2014-05-11 at 14:13 +0530, Preeti U Murthy wrote: >> >> Isn't this patch required too? >> >> @@ -503,12 +503,13 @@ void __timer_interrupt(void) >> now = *next_tb - now; >> if (now <= DECREMENTER_MAX) >> set_dec((int)now); >> - /* We may have raced with new irq work */ >> - if (test_irq_work_pending()) >> - set_dec(1); >> __get_cpu_var(irq_stat).timer_irqs_others++; >> } >> >> + /* We may have raced with new irq work */ >> + if (test_irq_work_pending()) >> + set_dec(1); >> + >> >> The event_handler cannot be relied upon to call >> decrementer_set_next_event() all the time. This is in the case where >> there are no pending timers. In that case we need to have the check on >> irq work pending at the end of __timer_interrupt() no? > > I don't think we need to move the test no. If there's a pending > irq_work, at that point, it will have done set_dec when being queued up. > So we only care about cases where we might change the decrementer. > > If the event handler doesn't call decrementer_set_next_event() then > nothing will modify the decrementer and it will still trigger soon. Hmm ok. Then Anton's patch covers all cases :) Thanks! Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> Regards Preeti U Murthy > > Cheers, > Ben. > >
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c index 122a580..ba7e83b 100644 --- a/arch/powerpc/kernel/time.c +++ b/arch/powerpc/kernel/time.c @@ -503,12 +503,13 @@ void __timer_interrupt(void) now = *next_tb - now; if (now <= DECREMENTER_MAX) set_dec((int)now); - /* We may have raced with new irq work */ - if (test_irq_work_pending()) - set_dec(1); __get_cpu_var(irq_stat).timer_irqs_others++; } + /* We may have raced with new irq work */ + if (test_irq_work_pending()) + set_dec(1); + #ifdef CONFIG_PPC64 /* collect purr register values often, for accurate calculations */ if (firmware_has_feature(FW_FEATURE_SPLPAR)) { @@ -813,15 +814,11 @@ static void __init clocksource_init(void) static int decrementer_set_next_event(unsigned long evt, struct clock_event_device *dev) { - /* Don't adjust the decrementer if some irq work is pending */ - if (test_irq_work_pending()) - return 0; __get_cpu_var(decrementers_next_tb) = get_tb_or_rtc() + evt; - set_dec(evt); - /* We may have raced with new irq work */ - if (test_irq_work_pending()) - set_dec(1); + /* Don't adjust the decrementer if some irq work is pending */ + if (!test_irq_work_pending()) + set_dec(evt); return 0; }