Message ID | 201407271519.22272.cat.schulze@alice-dsl.net |
---|---|
State | Changes Requested |
Delegated to: | David Miller |
Headers | show |
From: Christopher Alexander Tobias Schulze <cat.schulze@alice-dsl.net> Date: Sun, 27 Jul 2014 15:19:21 +0200 Again, please format your subject line properly, the subsystem prefix here could be "sunsab: ". > Handling of a BREAK on a serial line is currently broken in the sunsab serial > driver. Due to this defect, it is not possible to break to the firmware prompt > or trigger a Magic SysRq on these machines (which would be quite useful for > debugging kernel problems on headless machines). > > This problem is caused by the fact that the count of characters to be processed > might be 0 when a BREAK interrupt is triggered. The code in the driver contains > a loop that processes the pending characters - with count set to 0 no processing > takes place at all. Unfortunately, the call of uart_handle_break() is contained > inside this loop and therefore skipped completely. > > This patch moves the check for BREAK status in front of the loop and executes it > only when count == 0, i.e. when it would not be done inside the loop. > > With this patch, Magic SysRq works again. However, there seems to be another > problem that causes an Oops when a register dump is requested by SysRq. This > is caused when perf_event_print_debug() is called from sysrq_handle_showregs() > in drivers/tty/sysrq.c. This patch provides no solution for this problem; my > workaround was to eliminate the call to perf_event_print_debug() in > sysrq_handle_showregs(). This problem was observed on a SunBlade 2000 with an > UltraSPARC III Cu CPU and might be specific to this CPU model (the UltraSPARC III > is the first CPU to support perf_event.). > > The patch was originally developed against a 3.13 backport kernel from Debian. > Both the patch against 3.13 and a recent 3.16-rc6 are included below. Please note > that I could only test the 3.13 version as I do not have access to the affected > machine anymore. Please provide a proper "Singed-off-by: " as per Documentation/SubmittingPatches. > + if(count == 0) { Space between "if" and "(". > + if(unlikely(stat->sreg.isr1 & SAB82532_ISR1_BRK)) { Likewise. -- To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Christopher Alexander Tobias Schulze <cat.schulze@alice-dsl.net> Date: Sun, 27 Jul 2014 15:19:21 +0200 > However, there seems to be another problem that causes an Oops when > a register dump is requested by SysRq. This is caused when > perf_event_print_debug() is called from sysrq_handle_showregs() in > drivers/tty/sysrq.c. Did you save the OOPS? It should be trivial to debug with that information. perf_event_print_debug() doesn't do anything fancy, it disables CPU interrupts and reads the PCR and PIC registers. UltraSPARC-III should be using direct_pcr_ops for this, therefore: __asm__ __volatile__("rd %%pcr, %0" : "=r" (val)); and __asm__ __volatile__("rd %%pic, %0" : "=r" (val)); are the two operations it will perform. -- To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tuesday 29 July 2014 00:54, David Miller wrote: > From: Christopher Alexander Tobias Schulze <cat.schulze@alice-dsl.net> > Date: Sun, 27 Jul 2014 15:19:21 +0200 > > > However, there seems to be another problem that causes an Oops when > > a register dump is requested by SysRq. This is caused when > > perf_event_print_debug() is called from sysrq_handle_showregs() in > > drivers/tty/sysrq.c. > > Did you save the OOPS? It should be trivial to debug with that > information. We had a serial console attached to the SunBlade during debugging and development of the patches, and serial output was logged to a file. I think we should therefore have captured the OOPS. Norman Beck (in CC) should be able to extract the OOPS from this log and send it to you. (I do not have access to this SunBlade anymore since July.) > perf_event_print_debug() doesn't do anything fancy, it disables CPU > interrupts and reads the PCR and PIC registers. > > UltraSPARC-III should be using direct_pcr_ops for this, therefore: > > __asm__ __volatile__("rd %%pcr, %0" : "=r" (val)); > > and > > __asm__ __volatile__("rd %%pic, %0" : "=r" (val)); > > are the two operations it will perform. IIRC it was a null pointer dereference in perf_event_print_debug(), but I am not quite sure, as we quickly disabled this call to continue working (getting BREAK on the serial console to work was top priority then). We will have to wait for Norman to retrieve this information - he is currently on vacation, however, so it will take another week. Regards, Alexander Schulze -- To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Christopher Alexander Tobias Schulze <cat.schulze@alice-dsl.net> Date: Sun, 3 Aug 2014 14:25:35 +0200 > IIRC it was a null pointer dereference in perf_event_print_debug(), > but I am not quite sure, as we quickly disabled this call to continue > working (getting BREAK on the serial console to work was top priority > then). We will have to wait for Norman to retrieve this information - > he is currently on vacation, however, so it will take another week. If it's in perf_event_print_debug() there isn't much to check. sparc_pmu is derefernced in the loops, but the first thing this function does is check it against NULL and return immediately if it is NULL. pcr_ops is dereferenced also, but pcr_ops should never be NULL if sparc_pmu is not NULL. Hmmm... that's setup by pcr_arch_init() which I only see being invoked by smp_cpus_done(). Is this a uniprocessor build? -- To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tuesday 05 August 2014 01:04:23 David Miller wrote: > From: Christopher Alexander Tobias Schulze <cat.schulze@alice-dsl.net> > Date: Sun, 3 Aug 2014 14:25:35 +0200 > > > IIRC it was a null pointer dereference in perf_event_print_debug(), > > but I am not quite sure, as we quickly disabled this call to continue > > working (getting BREAK on the serial console to work was top priority > > then). We will have to wait for Norman to retrieve this information - > > he is currently on vacation, however, so it will take another week. > > If it's in perf_event_print_debug() there isn't much to check. > > sparc_pmu is derefernced in the loops, but the first thing this > function does is check it against NULL and return immediately > if it is NULL. > > pcr_ops is dereferenced also, but pcr_ops should never be NULL > if sparc_pmu is not NULL. > > Hmmm... that's setup by pcr_arch_init() which I only see being > invoked by smp_cpus_done(). > > Is this a uniprocessor build? > > My colleague has returned from his vacation and provided me the serial console logs which contain the stacktrace for the crash in perf_event_print_debug() that was triggered by the register dump request using SysRq: [ 192.934865] Unable to handle kernel NULL pointer dereference [ 193.002524] tsk->{mm,active_mm}->context = 0000000000001009 [ 193.069192] tsk->{mm,active_mm}->pgd = fffffc003aec2000 [ 193.131694] \|/ ____ \|/ [ 193.131694] "@'/ .. \`@" [ 193.131694] /_| \__/ |_\ [ 193.131694] \__U_/ [ 193.307769] swapper(0): Oops [#1] [ 193.347330] CPU: 0 PID: 0 Comm: swapper Not tainted 3.13.10 #4 [ 193.417122] task: 000000000092d7e0 ti: 0000000000918000 task.ti: 0000000000918000 [ 193.506717] TSTATE: 0000009980e01604 TPC: 000000000044d630 TNPC: 000000000044d634 Y: 00000000 Not tainted [ 193.624441] TPC: <perf_event_print_debug+0x50/0x100> [ 193.683796] g0: 0000000001016460 g1: 0000000000000000 g2: 0000000000000007 g3: 0000000000000001 [ 193.787975] g4: 000000000092d7e0 g5: 0000000000000038 g6: 0000000000918000 g7: 0000000000000720 [ 193.892144] o0: 0000000000000000 o1: 000000000089d910 o2: 0000000000000000 o3: 0000000000001fff [ 193.996313] o4: 000000000089d8a0 o5: 0000000000000010 sp: fffffc003febb231 ret_pc: 000000000044d610 [ 194.104653] RPC: <perf_event_print_debug+0x30/0x100> [ 194.164018] l0: 0000000000001000 l1: 00000000ffff5511 l2: 00000000004208b0 l3: 0000000000000000 [ 194.268197] l4: 0000000000001000 l5: 0000000000000004 l6: 0000000000918000 l7: 0000000080001004 [ 194.372368] i0: 0000000000000070 i1: 00000000009cd800 i2: 000000000089d300 i3: 00000000009240d0 [ 194.476537] i4: 000000000000000e i5: 0000000000000001 i6: fffffc003febb2f1 i7: 000000000068b2a0 [ 194.580715] I7: <__handle_sysrq+0xc0/0x1a0> [ 194.630694] Call Trace: [ 194.659865] [000000000068b2a0] __handle_sysrq+0xc0/0x1a0 [ 194.724458] [00000000006a5908] sunsab_interrupt+0x468/0x8a0 [ 194.792165] [0000000000498d18] handle_irq_event_percpu+0x78/0x220 [ 194.866125] [0000000000498ee8] handle_irq_event+0x28/0x60 [ 194.931754] [000000000049b7dc] handle_fasteoi_irq+0xfc/0x1a0 [ 195.000503] [0000000000498560] generic_handle_irq+0x40/0x60 [ 195.068214] [000000000042b5d4] handler_irq+0x94/0xc0 [ 195.128636] [00000000004208b4] tl0_irq5+0x14/0x20 [ 195.185924] [000000000042bbfc] arch_cpu_idle+0x5c/0x80 [ 195.248426] [000000000049843c] cpu_startup_entry+0x19c/0x240 [ 195.317179] [000000000099a988] start_kernel+0x4f0/0x520 [ 195.380724] [00000000007e74ac] tlb_fixup_done+0x80/0x94 [ 195.444263] [0000000000000000] (null) [ 195.499473] Disabling lock debugging due to kernel taint [ 195.563022] Caller[000000000068b2a0]: __handle_sysrq+0xc0/0x1a0 [ 195.633855] Caller[00000000006a5908]: sunsab_interrupt+0x468/0x8a0 [ 195.707816] Caller[0000000000498d18]: handle_irq_event_percpu+0x78/0x220 [ 195.788028] Caller[0000000000498ee8]: handle_irq_event+0x28/0x60 [ 195.859905] Caller[000000000049b7dc]: handle_fasteoi_irq+0xfc/0x1a0 [ 195.934905] Caller[0000000000498560]: generic_handle_irq+0x40/0x60 [ 196.008866] Caller[000000000042b5d4]: handler_irq+0x94/0xc0 [ 196.075534] Caller[00000000004208b4]: tl0_irq5+0x14/0x20 [ 196.139079] Caller[000000000042bbb8]: arch_cpu_idle+0x18/0x80 [ 196.207831] Caller[000000000049843c]: cpu_startup_entry+0x19c/0x240 [ 196.282833] Caller[000000000099a988]: start_kernel+0x4f0/0x520 [ 196.352626] Caller[00000000007e74ac]: tlb_fixup_done+0x80/0x94 [ 196.422418] Caller[0000000000000000]: (null) [ 196.483877] Instruction DUMP: 953f6000 ba076001 9010000a <c2584000> 9fc04000 d477a7f7 d45fa7f7 96100008 92102000 [ 196.613045] Kernel panic - not syncing: Aiee, killing interrupt handler! [ 196.693264] Press Stop-A (L1-A) to return to the boot prom Unfortunately, this is less information than I hoped: the only new information is [ 193.624441] TPC: <perf_event_print_debug+0x50/0x100> [ 194.104653] RPC: <perf_event_print_debug+0x30/0x100> and I am not able to provide any further details (e.g., disassembly of the code at that location). The machine is a uniprocessor system. The kernel configuration was taken from Debian and slightly modified, but I am not sure whether it was based on an SMP kernel package or not (I would guess that it was a uniprocessor kernel, however). I hope this can provide at least some further information to you. Regards, Alexander Schulze -- To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Christopher Alexander Tobias Schulze <cat.schulze@alice-dsl.net> Date: Mon, 11 Aug 2014 19:49:48 +0200 > [ 196.483877] Instruction DUMP: 953f6000 ba076001 9010000a <c2584000> 9fc04000 d477a7f7 d45fa7f7 96100008 92102000 ... > and I am not able to provide any further details (e.g., disassembly of the code > at that location). The instruction dump is in the OOPS, as quoted above, and this disassembles to: 0: 95 3f 60 00 sra %i5, 0, %o2 4: ba 07 60 01 inc %i5 8: 90 10 00 0a mov %o2, %o0 crash_pc: c: c2 58 40 00 ldx [ %g1 ], %g1 10: 9f c0 40 00 call %g1 14: d4 77 a7 f7 stx %o2, [ %fp + 0x7f7 ] 18: d4 5f a7 f7 ldx [ %fp + 0x7f7 ], %o2 1c: 96 10 00 08 mov %o0, %o3 20: 92 10 20 00 clr %o1 This is an indirect call, and the only indirect calls in perf_event_print_debug() are the pcr_ops->() method invocations, that is what we are executing here, therefore pcr_ops is NULL at this location. It confirms my suspicions and I'll therefore work on a fix, thanks! -- To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff -Naupr linux-source-3.13-orig/drivers/tty/serial/sunsab.c linux-source-3.13-patched/drivers/tty/serial/sunsab.c --- linux-source-3.13-orig/drivers/tty/serial/sunsab.c 2014-04-14 15:48:24.000000000 +0200 +++ linux-source-3.13-patched/drivers/tty/serial/sunsab.c 2014-07-27 14:29:58.000000000 +0200 @@ -157,6 +157,14 @@ receive_chars(struct uart_sunsab_port *u (up->port.line == up->port.cons->index)) saw_console_brk = 1; + if(count == 0) { + if(unlikely(stat->sreg.isr1 & SAB82532_ISR1_BRK)) { + stat->sreg.isr0 &= ~(SAB82532_ISR0_PERR | + SAB82532_ISR0_FERR); + uart_handle_break(&up->port); + } + } + for (i = 0; i < count; i++) { unsigned char ch = buf[i], flag; PATCH 2 - KERNEL VERSION 3.16 ##################################################### diff -Naupr linux-3.16-rc6-orig/drivers/tty/serial/sunsab.c linux-3.16-rc6-patched/drivers/tty/serial/sunsab.c --- linux-3.16-rc6-orig/drivers/tty/serial/sunsab.c 2014-07-27 11:50:32.000000000 +0200 +++ linux-3.16-rc6-patched/drivers/tty/serial/sunsab.c 2014-07-27 14:28:12.000000000 +0200 @@ -157,6 +157,14 @@ receive_chars(struct uart_sunsab_port *u (up->port.line == up->port.cons->index)) saw_console_brk = 1; + if(count == 0) { + if(unlikely(stat->sreg.isr1 & SAB82532_ISR1_BRK)) { + stat->sreg.isr0 &= ~(SAB82532_ISR0_PERR | + SAB82532_ISR0_FERR); + uart_handle_break(&up->port); + } + } + for (i = 0; i < count; i++) { unsigned char ch = buf[i], flag;