Message ID | 1547657264-28761-1-git-send-email-leitao@debian.org (mailing list archive) |
---|---|
State | Accepted |
Commit | e620d45065c7b5b8d6ae11217c09c09380103b83 |
Headers | show |
Series | powerpc/tm: Avoid machine crash on rt_sigreturn | expand |
Context | Check | Description |
---|---|---|
snowpatch_ozlabs/apply_patch | success | next/apply_patch Successfully applied |
snowpatch_ozlabs/build-ppc64le | success | build succeeded & removed 0 sparse warning(s) |
snowpatch_ozlabs/build-ppc64be | success | build succeeded & removed 0 sparse warning(s) |
snowpatch_ozlabs/build-ppc64e | success | build succeeded & removed 0 sparse warning(s) |
snowpatch_ozlabs/build-pmac32 | success | build succeeded & removed 0 sparse warning(s) |
snowpatch_ozlabs/checkpatch | warning | total: 0 errors, 1 warnings, 0 checks, 39 lines checked |
On Wed, 2019-01-16 at 14:47 -0200, Breno Leitao wrote: > There is a kernel crash that happens if rt_sigreturn is called inside a > transactional block. > > This crash happens if the kernel hits an in-kernel page fault when > accessing userspace memory, usually through copy_ckvsx_to_user(). A major > page fault calls might_sleep() function, which can cause a task reschedule. > A task reschedule (switch_to()) reclaim and recheckpoint the TM states, > but, in the signal return path, the checkpointed memory was already > reclaimed, thus the exception stack has MSR that points to MSR[TS]=0. > > When the code returns from might_sleep() and a task reschedule happened, > then this task is returned with the memory recheckpointed, and > CPU MSR[TS] = suspended. > > This means that there is a side effect at might_sleep() if it is called > with CPU MSR[TS] = 0 and the task has regs->msr[TS] != 0. > > This side effect can cause a TM bad thing, since at the exception entrance, > the stack saves MSR[TS]=0, and this is what will be used at RFID, but, > the processor has MSR[TS] = Suspended, and this transition will be invalid > and a TM Bad thing will be raised, causing the following crash: > > Unexpected TM Bad Thing exception at c00000000000e9ec (msr > 0x8000000302a03031) tm_scratch=800000010280b033 > cpu 0xc: Vector: 700 (Program Check) at [c00000003ff1fd70] > pc: c00000000000e9ec: fast_exception_return+0x100/0x1bc > lr: c000000000032948: handle_rt_signal64+0xb8/0xaf0 > sp: c0000004263ebc40 > msr: 8000000302a03031 > current = 0xc000000415050300 > paca = 0xc00000003ffc4080 irqmask: 0x03 irq_happened: 0x01 > pid = 25006, comm = sigfuz > Linux version 5.0.0-rc1-00001-g3bd6e94bec12 (breno@debian) (gcc version > 8.2.0 (Debian 8.2.0-3)) #899 SMP Mon Jan 7 11:30:07 EST 2019 > WARNING: exception is not recoverable, can't continue > enter ? for help > [c0000004263ebc40] c000000000032948 handle_rt_signal64+0xb8/0xaf0 > (unreliable) > [c0000004263ebd30] c000000000022780 do_notify_resume+0x2f0/0x430 > [c0000004263ebe20] c00000000000e844 ret_from_except_lite+0x70/0x74 > --- Exception: c00 (System Call) at 00007fffbaac400c > SP (7fffeca90f40) is in userspace > > The solution for this problem is running the sigreturn code with > regs->msr[TS] disabled, thus, avoiding hitting the side effect above. This > does not seem to be a problem since regs->msr will be replaced by the > ucontext value, so, it is being flushed already. In this case, it is > flushed earlier. > > Signed-off-by: Breno Leitao <leitao@debian.org> Acked-by: Michael Neuling <mikey@neuling.org> This still applies on powerpc/next so just acking rather than reposting > --- > arch/powerpc/kernel/signal_64.c | 27 ++++++++++++++++++++++++++- > 1 file changed, 26 insertions(+), 1 deletion(-) > > diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c > index 6794466f6420..06c299ef6132 100644 > --- a/arch/powerpc/kernel/signal_64.c > +++ b/arch/powerpc/kernel/signal_64.c > @@ -565,7 +565,7 @@ static long restore_tm_sigcontexts(struct task_struct > *tsk, > preempt_disable(); > > /* pull in MSR TS bits from user context */ > - regs->msr = (regs->msr & ~MSR_TS_MASK) | (msr & MSR_TS_MASK); > + regs->msr |= msr & MSR_TS_MASK; > > /* > * Ensure that TM is enabled in regs->msr before we leave the signal > @@ -745,6 +745,31 @@ SYSCALL_DEFINE0(rt_sigreturn) > if (MSR_TM_SUSPENDED(mfmsr())) > tm_reclaim_current(0); > > + /* > + * Disable MSR[TS] bit also, so, if there is an exception in the > + * code below (as a page fault in copy_ckvsx_to_user()), it does > + * not recheckpoint this task if there was a context switch inside > + * the exception. > + * > + * A major page fault can indirectly call schedule(). A reschedule > + * process in the middle of an exception can have a side effect > + * (Changing the CPU MSR[TS] state), since schedule() is called > + * with the CPU MSR[TS] disable and returns with MSR[TS]=Suspended > + * (switch_to() calls tm_recheckpoint() for the 'new' process). In > + * this case, the process continues to be the same in the CPU, but > + * the CPU state just changed. > + * > + * This can cause a TM Bad Thing, since the MSR in the stack will > + * have the MSR[TS]=0, and this is what will be used to RFID. > + * > + * Clearing MSR[TS] state here will avoid a recheckpoint if there > + * is any process reschedule in kernel space. The MSR[TS] state > + * does not need to be saved also, since it will be replaced with > + * the MSR[TS] that came from user context later, at > + * restore_tm_sigcontexts. > + */ > + regs->msr &= ~MSR_TS_MASK; > + > if (__get_user(msr, &uc->uc_mcontext.gp_regs[PT_MSR])) > goto badframe; > if (MSR_TM_ACTIVE(msr)) {
diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c index 6794466f6420..06c299ef6132 100644 --- a/arch/powerpc/kernel/signal_64.c +++ b/arch/powerpc/kernel/signal_64.c @@ -565,7 +565,7 @@ static long restore_tm_sigcontexts(struct task_struct *tsk, preempt_disable(); /* pull in MSR TS bits from user context */ - regs->msr = (regs->msr & ~MSR_TS_MASK) | (msr & MSR_TS_MASK); + regs->msr |= msr & MSR_TS_MASK; /* * Ensure that TM is enabled in regs->msr before we leave the signal @@ -745,6 +745,31 @@ SYSCALL_DEFINE0(rt_sigreturn) if (MSR_TM_SUSPENDED(mfmsr())) tm_reclaim_current(0); + /* + * Disable MSR[TS] bit also, so, if there is an exception in the + * code below (as a page fault in copy_ckvsx_to_user()), it does + * not recheckpoint this task if there was a context switch inside + * the exception. + * + * A major page fault can indirectly call schedule(). A reschedule + * process in the middle of an exception can have a side effect + * (Changing the CPU MSR[TS] state), since schedule() is called + * with the CPU MSR[TS] disable and returns with MSR[TS]=Suspended + * (switch_to() calls tm_recheckpoint() for the 'new' process). In + * this case, the process continues to be the same in the CPU, but + * the CPU state just changed. + * + * This can cause a TM Bad Thing, since the MSR in the stack will + * have the MSR[TS]=0, and this is what will be used to RFID. + * + * Clearing MSR[TS] state here will avoid a recheckpoint if there + * is any process reschedule in kernel space. The MSR[TS] state + * does not need to be saved also, since it will be replaced with + * the MSR[TS] that came from user context later, at + * restore_tm_sigcontexts. + */ + regs->msr &= ~MSR_TS_MASK; + if (__get_user(msr, &uc->uc_mcontext.gp_regs[PT_MSR])) goto badframe; if (MSR_TM_ACTIVE(msr)) {
There is a kernel crash that happens if rt_sigreturn is called inside a transactional block. This crash happens if the kernel hits an in-kernel page fault when accessing userspace memory, usually through copy_ckvsx_to_user(). A major page fault calls might_sleep() function, which can cause a task reschedule. A task reschedule (switch_to()) reclaim and recheckpoint the TM states, but, in the signal return path, the checkpointed memory was already reclaimed, thus the exception stack has MSR that points to MSR[TS]=0. When the code returns from might_sleep() and a task reschedule happened, then this task is returned with the memory recheckpointed, and CPU MSR[TS] = suspended. This means that there is a side effect at might_sleep() if it is called with CPU MSR[TS] = 0 and the task has regs->msr[TS] != 0. This side effect can cause a TM bad thing, since at the exception entrance, the stack saves MSR[TS]=0, and this is what will be used at RFID, but, the processor has MSR[TS] = Suspended, and this transition will be invalid and a TM Bad thing will be raised, causing the following crash: Unexpected TM Bad Thing exception at c00000000000e9ec (msr 0x8000000302a03031) tm_scratch=800000010280b033 cpu 0xc: Vector: 700 (Program Check) at [c00000003ff1fd70] pc: c00000000000e9ec: fast_exception_return+0x100/0x1bc lr: c000000000032948: handle_rt_signal64+0xb8/0xaf0 sp: c0000004263ebc40 msr: 8000000302a03031 current = 0xc000000415050300 paca = 0xc00000003ffc4080 irqmask: 0x03 irq_happened: 0x01 pid = 25006, comm = sigfuz Linux version 5.0.0-rc1-00001-g3bd6e94bec12 (breno@debian) (gcc version 8.2.0 (Debian 8.2.0-3)) #899 SMP Mon Jan 7 11:30:07 EST 2019 WARNING: exception is not recoverable, can't continue enter ? for help [c0000004263ebc40] c000000000032948 handle_rt_signal64+0xb8/0xaf0 (unreliable) [c0000004263ebd30] c000000000022780 do_notify_resume+0x2f0/0x430 [c0000004263ebe20] c00000000000e844 ret_from_except_lite+0x70/0x74 --- Exception: c00 (System Call) at 00007fffbaac400c SP (7fffeca90f40) is in userspace The solution for this problem is running the sigreturn code with regs->msr[TS] disabled, thus, avoiding hitting the side effect above. This does not seem to be a problem since regs->msr will be replaced by the ucontext value, so, it is being flushed already. In this case, it is flushed earlier. Signed-off-by: Breno Leitao <leitao@debian.org> --- arch/powerpc/kernel/signal_64.c | 27 ++++++++++++++++++++++++++- 1 file changed, 26 insertions(+), 1 deletion(-)