Message ID | 20211110030416.2085103-1-npiggin@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | [v1] powerpc/watchdog: help remote CPUs to flush NMI printk output | expand |
Hi, > The printk layer at the moment does not seem to have a good way to force > flush printk messages that are created in NMI context, except in the > panic path. > > NMI-context printk messages normally get to the console with irq_work, > but that won't help if the CPU is stuck with irqs disabled, as can be > the case for hard lockup watchdog messages. > > The watchdog currently flushes the printk buffers after detecting a > lockup on remote CPUs, but they may not have processed their NMI IPI > yet by that stage, or they may have self-detected a lockup in which > case they won't go via this NMI IPI path. > > Improve the situation by having NMI-context mark a flag if it called > printk, and have watchdog timer interrupts check if that flag was set > and try to flush if it was. Latency is not a big problem because we > were already stuck for a while, just need to try to make sure the > messages eventually make it out. Initially I was surprised that this doesn't affect the printk code itself, just the powerpc code... > > Cc: Laurent Dufour <ldufour@linux.ibm.com> > Signed-off-by: Nicholas Piggin <npiggin@gmail.com> > --- > This patch is actually based on top of this one which is planned to go > upstream in rc1/2. https://marc.info/?l=linux-kernel&m=163626070312052&w=2 > > Prior to commit 93d102f094be that is fixed by the above, we had a printk > flush function with a different name but basically does the same job, so > this patch can be backported, just needs some care. I'm posting it for > review now for feedback so it's ready to go when the printk patch is > upstream. > > Thanks, > Nick > > arch/powerpc/kernel/watchdog.c | 29 +++++++++++++++++++++++------ > 1 file changed, 23 insertions(+), 6 deletions(-) > > diff --git a/arch/powerpc/kernel/watchdog.c b/arch/powerpc/kernel/watchdog.c > index b6533539386b..a7b6b0691203 100644 > --- a/arch/powerpc/kernel/watchdog.c > +++ b/arch/powerpc/kernel/watchdog.c > @@ -86,6 +86,7 @@ static DEFINE_PER_CPU(u64, wd_timer_tb); > /* SMP checker bits */ > static unsigned long __wd_smp_lock; > static unsigned long __wd_reporting; > +static unsigned long __wd_nmi_output; > static cpumask_t wd_smp_cpus_pending; > static cpumask_t wd_smp_cpus_stuck; > static u64 wd_smp_last_reset_tb; > @@ -154,6 +155,18 @@ static void wd_lockup_ipi(struct pt_regs *regs) > else > dump_stack(); > > + /* > + * We printk()ed from NMI context, the contents may not get flushed > + * if we return to a context with interrupts disabled because > + * printk uses irq_work to schedule flushes of NMI output. > + * __wd_nmi_output says there has been output from NMI context, so > + * other CPUs are recruited to help flush it. > + * > + * xchg is not needed here (it could be a simple atomic store), but > + * it gives the memory ordering and atomicity required. > + */ > + xchg(&__wd_nmi_output, 1); > + > /* Do not panic from here because that can recurse into NMI IPI layer */ > } I think, looking at this and the other site where __wd_nmi_output is set, that this works because you set the flag only when you are done printing from the non-panic lockup context on this CPU. I was initially worried that you set this flag part way through printing, and then it might get cleared by another CPU while you're still trying to print. However, in this function it's right at the end - there's nothing else left to do, and ... > DEFINE_INTERRUPT_HANDLER_NMI(soft_nmi_interrupt) > @@ -386,6 +401,8 @@ DEFINE_INTERRUPT_HANDLER_NMI(soft_nmi_interrupt) > print_irqtrace_events(current); > show_regs(regs); > > + xchg(&__wd_nmi_output, 1); // see wd_lockup_ipi > + > if (sysctl_hardlockup_all_cpu_backtrace) > trigger_allbutself_cpu_backtrace(); in this one, the only things that can happen afterwards are - a panic, which does its own flushing, and - trigger_allbutself_cpu_backtrace(), which seems to just send IPIs, not do any printing of its own. All of which is fine, but I wonder if we need a more descriptive variable name or if the comment needs to specify that the flag should only be set at certain times. Kind regards, Daniel
Excerpts from Daniel Axtens's message of November 12, 2021 4:08 pm: > Hi, > >> The printk layer at the moment does not seem to have a good way to force >> flush printk messages that are created in NMI context, except in the >> panic path. >> >> NMI-context printk messages normally get to the console with irq_work, >> but that won't help if the CPU is stuck with irqs disabled, as can be >> the case for hard lockup watchdog messages. >> >> The watchdog currently flushes the printk buffers after detecting a >> lockup on remote CPUs, but they may not have processed their NMI IPI >> yet by that stage, or they may have self-detected a lockup in which >> case they won't go via this NMI IPI path. >> >> Improve the situation by having NMI-context mark a flag if it called >> printk, and have watchdog timer interrupts check if that flag was set >> and try to flush if it was. Latency is not a big problem because we >> were already stuck for a while, just need to try to make sure the >> messages eventually make it out. Sorry just coming back to this now the printk fix was merged upstream. > Initially I was surprised that this doesn't affect the printk code > itself, just the powerpc code... Yeah I'm actually not sure how other watchdogs handle this. If they use nmi_trigger_cpumask_backtrace() then that does have a flush. >> Cc: Laurent Dufour <ldufour@linux.ibm.com> >> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> >> --- >> This patch is actually based on top of this one which is planned to go >> upstream in rc1/2. https://marc.info/?l=linux-kernel&m=163626070312052&w=2 >> >> Prior to commit 93d102f094be that is fixed by the above, we had a printk >> flush function with a different name but basically does the same job, so >> this patch can be backported, just needs some care. I'm posting it for >> review now for feedback so it's ready to go when the printk patch is >> upstream. >> >> Thanks, >> Nick >> >> arch/powerpc/kernel/watchdog.c | 29 +++++++++++++++++++++++------ >> 1 file changed, 23 insertions(+), 6 deletions(-) >> >> diff --git a/arch/powerpc/kernel/watchdog.c b/arch/powerpc/kernel/watchdog.c >> index b6533539386b..a7b6b0691203 100644 >> --- a/arch/powerpc/kernel/watchdog.c >> +++ b/arch/powerpc/kernel/watchdog.c >> @@ -86,6 +86,7 @@ static DEFINE_PER_CPU(u64, wd_timer_tb); >> /* SMP checker bits */ >> static unsigned long __wd_smp_lock; >> static unsigned long __wd_reporting; >> +static unsigned long __wd_nmi_output; >> static cpumask_t wd_smp_cpus_pending; >> static cpumask_t wd_smp_cpus_stuck; >> static u64 wd_smp_last_reset_tb; >> @@ -154,6 +155,18 @@ static void wd_lockup_ipi(struct pt_regs *regs) >> else >> dump_stack(); >> >> + /* >> + * We printk()ed from NMI context, the contents may not get flushed >> + * if we return to a context with interrupts disabled because >> + * printk uses irq_work to schedule flushes of NMI output. >> + * __wd_nmi_output says there has been output from NMI context, so >> + * other CPUs are recruited to help flush it. >> + * >> + * xchg is not needed here (it could be a simple atomic store), but >> + * it gives the memory ordering and atomicity required. >> + */ >> + xchg(&__wd_nmi_output, 1); >> + >> /* Do not panic from here because that can recurse into NMI IPI layer */ >> } > > I think, looking at this and the other site where __wd_nmi_output is > set, that this works because you set the flag only when you are done > printing from the non-panic lockup context on this CPU. I was initially > worried that you set this flag part way through printing, and then it > might get cleared by another CPU while you're still trying to print. > However, in this function it's right at the end - there's nothing else > left to do, and ... > >> DEFINE_INTERRUPT_HANDLER_NMI(soft_nmi_interrupt) >> @@ -386,6 +401,8 @@ DEFINE_INTERRUPT_HANDLER_NMI(soft_nmi_interrupt) >> print_irqtrace_events(current); >> show_regs(regs); >> >> + xchg(&__wd_nmi_output, 1); // see wd_lockup_ipi >> + >> if (sysctl_hardlockup_all_cpu_backtrace) >> trigger_allbutself_cpu_backtrace(); > > in this one, the only things that can happen afterwards are > - a panic, which does its own flushing, and > > - trigger_allbutself_cpu_backtrace(), which seems to just send IPIs, not > do any printing of its own. Yeah, on powerpc that actually ends up not using NMI IPIs because that's dangerous and core code uses it for less severe things than a hard lockup so we disabled it, so if you can take a regular IPI then you can run irq work when it returns and print things. If we changed it to use NMI IPIs I guess we might want to do something, maybe. > All of which is fine, but I wonder if we need a more descriptive > variable name or if the comment needs to specify that the flag should > only be set at certain times. I did try to add some explanation in that comment. I'll see if that can be made a bit more readable and expand on it a bit. Thanks, Nick
diff --git a/arch/powerpc/kernel/watchdog.c b/arch/powerpc/kernel/watchdog.c index b6533539386b..a7b6b0691203 100644 --- a/arch/powerpc/kernel/watchdog.c +++ b/arch/powerpc/kernel/watchdog.c @@ -86,6 +86,7 @@ static DEFINE_PER_CPU(u64, wd_timer_tb); /* SMP checker bits */ static unsigned long __wd_smp_lock; static unsigned long __wd_reporting; +static unsigned long __wd_nmi_output; static cpumask_t wd_smp_cpus_pending; static cpumask_t wd_smp_cpus_stuck; static u64 wd_smp_last_reset_tb; @@ -154,6 +155,18 @@ static void wd_lockup_ipi(struct pt_regs *regs) else dump_stack(); + /* + * We printk()ed from NMI context, the contents may not get flushed + * if we return to a context with interrupts disabled because + * printk uses irq_work to schedule flushes of NMI output. + * __wd_nmi_output says there has been output from NMI context, so + * other CPUs are recruited to help flush it. + * + * xchg is not needed here (it could be a simple atomic store), but + * it gives the memory ordering and atomicity required. + */ + xchg(&__wd_nmi_output, 1); + /* Do not panic from here because that can recurse into NMI IPI layer */ } @@ -227,12 +240,6 @@ static void watchdog_smp_panic(int cpu) cpumask_clear(&wd_smp_cpus_ipi); } - /* - * Force flush any remote buffers that might be stuck in IRQ context - * and therefore could not run their irq_work. - */ - printk_trigger_flush(); - if (hardlockup_panic) nmi_panic(NULL, "Hard LOCKUP"); @@ -337,6 +344,14 @@ static void watchdog_timer_interrupt(int cpu) if ((s64)(tb - wd_smp_last_reset_tb) >= (s64)wd_smp_panic_timeout_tb) watchdog_smp_panic(cpu); + + if (__wd_nmi_output && xchg(&__wd_nmi_output, 0)) { + /* + * This triggers a flush to try get stuck NMI printk output + * from remote CPUs out to the console. See wd_lockup_ipi. + */ + printk_trigger_flush(); + } } DEFINE_INTERRUPT_HANDLER_NMI(soft_nmi_interrupt) @@ -386,6 +401,8 @@ DEFINE_INTERRUPT_HANDLER_NMI(soft_nmi_interrupt) print_irqtrace_events(current); show_regs(regs); + xchg(&__wd_nmi_output, 1); // see wd_lockup_ipi + if (sysctl_hardlockup_all_cpu_backtrace) trigger_allbutself_cpu_backtrace();
The printk layer at the moment does not seem to have a good way to force flush printk messages that are created in NMI context, except in the panic path. NMI-context printk messages normally get to the console with irq_work, but that won't help if the CPU is stuck with irqs disabled, as can be the case for hard lockup watchdog messages. The watchdog currently flushes the printk buffers after detecting a lockup on remote CPUs, but they may not have processed their NMI IPI yet by that stage, or they may have self-detected a lockup in which case they won't go via this NMI IPI path. Improve the situation by having NMI-context mark a flag if it called printk, and have watchdog timer interrupts check if that flag was set and try to flush if it was. Latency is not a big problem because we were already stuck for a while, just need to try to make sure the messages eventually make it out. Cc: Laurent Dufour <ldufour@linux.ibm.com> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> --- This patch is actually based on top of this one which is planned to go upstream in rc1/2. https://marc.info/?l=linux-kernel&m=163626070312052&w=2 Prior to commit 93d102f094be that is fixed by the above, we had a printk flush function with a different name but basically does the same job, so this patch can be backported, just needs some care. I'm posting it for review now for feedback so it's ready to go when the printk patch is upstream. Thanks, Nick arch/powerpc/kernel/watchdog.c | 29 +++++++++++++++++++++++------ 1 file changed, 23 insertions(+), 6 deletions(-)