Message ID | f8816223098569a9b2f478caa5b4a7a0c27dda00.1461875890.git.jpoimboe@redhat.com (mailing list archive) |
---|---|
State | Not Applicable |
Headers | show |
On Thu, Apr 28, 2016 at 1:44 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote: > A preempted function might not have had a chance to save the frame > pointer to the stack yet, which can result in its caller getting skipped > on a stack trace. > > Add a flag to indicate when the task has been preempted so that stack > dump code can determine whether the stack trace is reliable. I think I like this, but how do you handle the rather similar case in which a task goes to sleep because it's waiting on IO that happened in response to get_user, put_user, copy_from_user, etc? --Andy
On Fri, Apr 29, 2016 at 11:06:53AM -0700, Andy Lutomirski wrote: > On Thu, Apr 28, 2016 at 1:44 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote: > > A preempted function might not have had a chance to save the frame > > pointer to the stack yet, which can result in its caller getting skipped > > on a stack trace. > > > > Add a flag to indicate when the task has been preempted so that stack > > dump code can determine whether the stack trace is reliable. > > I think I like this, but how do you handle the rather similar case in > which a task goes to sleep because it's waiting on IO that happened in > response to get_user, put_user, copy_from_user, etc? Hm, good question. I was thinking that page faults had a dedicated stack, but now looking at the entry and traps code, that doesn't seem to be the case. Anyway I think it shouldn't be a problem if we make sure that any kernel function which might trigger a valid page fault (e.g., copy_user_generic_string) do the proper frame pointer setup first. Then the stack should still be reliable. In fact I might be able to teach objtool to enforce that: any function which uses an exception table should create a stack frame. Or alternatively, maybe set some kind of flag for page faults, similar to what I did with this patch.
On Fri, Apr 29, 2016 at 1:11 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote: > On Fri, Apr 29, 2016 at 11:06:53AM -0700, Andy Lutomirski wrote: >> On Thu, Apr 28, 2016 at 1:44 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote: >> > A preempted function might not have had a chance to save the frame >> > pointer to the stack yet, which can result in its caller getting skipped >> > on a stack trace. >> > >> > Add a flag to indicate when the task has been preempted so that stack >> > dump code can determine whether the stack trace is reliable. >> >> I think I like this, but how do you handle the rather similar case in >> which a task goes to sleep because it's waiting on IO that happened in >> response to get_user, put_user, copy_from_user, etc? > > Hm, good question. I was thinking that page faults had a dedicated > stack, but now looking at the entry and traps code, that doesn't seem to > be the case. > > Anyway I think it shouldn't be a problem if we make sure that any kernel > function which might trigger a valid page fault (e.g., > copy_user_generic_string) do the proper frame pointer setup first. Then > the stack should still be reliable. > > In fact I might be able to teach objtool to enforce that: any function > which uses an exception table should create a stack frame. > > Or alternatively, maybe set some kind of flag for page faults, similar > to what I did with this patch. > How about doing it the other way around: teach the unwinder to detect when it hits a non-outermost entry (i.e. it lands in idtentry, etc) and use some reasonable heuristic as to whether it's okay to keep unwinding. You should be able to handle preemption like that, too -- the unwind process will end up in an IRQ frame. --Andy
On Fri, Apr 29, 2016 at 01:19:23PM -0700, Andy Lutomirski wrote: > On Fri, Apr 29, 2016 at 1:11 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote: > > On Fri, Apr 29, 2016 at 11:06:53AM -0700, Andy Lutomirski wrote: > >> On Thu, Apr 28, 2016 at 1:44 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote: > >> > A preempted function might not have had a chance to save the frame > >> > pointer to the stack yet, which can result in its caller getting skipped > >> > on a stack trace. > >> > > >> > Add a flag to indicate when the task has been preempted so that stack > >> > dump code can determine whether the stack trace is reliable. > >> > >> I think I like this, but how do you handle the rather similar case in > >> which a task goes to sleep because it's waiting on IO that happened in > >> response to get_user, put_user, copy_from_user, etc? > > > > Hm, good question. I was thinking that page faults had a dedicated > > stack, but now looking at the entry and traps code, that doesn't seem to > > be the case. > > > > Anyway I think it shouldn't be a problem if we make sure that any kernel > > function which might trigger a valid page fault (e.g., > > copy_user_generic_string) do the proper frame pointer setup first. Then > > the stack should still be reliable. > > > > In fact I might be able to teach objtool to enforce that: any function > > which uses an exception table should create a stack frame. > > > > Or alternatively, maybe set some kind of flag for page faults, similar > > to what I did with this patch. > > > > How about doing it the other way around: teach the unwinder to detect > when it hits a non-outermost entry (i.e. it lands in idtentry, etc) > and use some reasonable heuristic as to whether it's okay to keep > unwinding. You should be able to handle preemption like that, too -- > the unwind process will end up in an IRQ frame. How exactly would the unwinder detect if a text address is in an idtentry? Maybe put all the idt entries in a special ELF section?
On Fri, Apr 29, 2016 at 1:27 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote: > On Fri, Apr 29, 2016 at 01:19:23PM -0700, Andy Lutomirski wrote: >> On Fri, Apr 29, 2016 at 1:11 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote: >> > On Fri, Apr 29, 2016 at 11:06:53AM -0700, Andy Lutomirski wrote: >> >> On Thu, Apr 28, 2016 at 1:44 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote: >> >> > A preempted function might not have had a chance to save the frame >> >> > pointer to the stack yet, which can result in its caller getting skipped >> >> > on a stack trace. >> >> > >> >> > Add a flag to indicate when the task has been preempted so that stack >> >> > dump code can determine whether the stack trace is reliable. >> >> >> >> I think I like this, but how do you handle the rather similar case in >> >> which a task goes to sleep because it's waiting on IO that happened in >> >> response to get_user, put_user, copy_from_user, etc? >> > >> > Hm, good question. I was thinking that page faults had a dedicated >> > stack, but now looking at the entry and traps code, that doesn't seem to >> > be the case. >> > >> > Anyway I think it shouldn't be a problem if we make sure that any kernel >> > function which might trigger a valid page fault (e.g., >> > copy_user_generic_string) do the proper frame pointer setup first. Then >> > the stack should still be reliable. >> > >> > In fact I might be able to teach objtool to enforce that: any function >> > which uses an exception table should create a stack frame. >> > >> > Or alternatively, maybe set some kind of flag for page faults, similar >> > to what I did with this patch. >> > >> >> How about doing it the other way around: teach the unwinder to detect >> when it hits a non-outermost entry (i.e. it lands in idtentry, etc) >> and use some reasonable heuristic as to whether it's okay to keep >> unwinding. You should be able to handle preemption like that, too -- >> the unwind process will end up in an IRQ frame. > > How exactly would the unwinder detect if a text address is in an > idtentry? Maybe put all the idt entries in a special ELF section? > Hmm. What actually happens when you unwind all the way into the entry code? Don't you end up in something that isn't in an ELF function? Can you detect that? Ideally, the unwinder could actually detect that it's hit a pt_regs struct and report that. If used for stack dumps, it could display some indication of this and then continue its unwinding by decoding the pt_regs. If used for patching, it could take some other appropriate action. I would have no objection to annotating all the pt_regs-style entry code, whether by putting it in a separate section or by making a table of addresses. There are a couple of nasty cases if NMI or MCE is involved but, as of 4.6, outside of NMI, MCE, and vmalloc faults (ugh!), there should always be a complete pt_regs on the stack before interrupts get enabled for each entry. Of course, finding the thing may be nontrivial in case other things were pushed. I suppose we could try to rejigger the code so that rbp points to pt_regs or similar. --Andy
On Fri, Apr 29, 2016 at 01:32:53PM -0700, Andy Lutomirski wrote: > On Fri, Apr 29, 2016 at 1:27 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote: > > On Fri, Apr 29, 2016 at 01:19:23PM -0700, Andy Lutomirski wrote: > >> On Fri, Apr 29, 2016 at 1:11 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote: > >> > On Fri, Apr 29, 2016 at 11:06:53AM -0700, Andy Lutomirski wrote: > >> >> On Thu, Apr 28, 2016 at 1:44 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote: > >> >> > A preempted function might not have had a chance to save the frame > >> >> > pointer to the stack yet, which can result in its caller getting skipped > >> >> > on a stack trace. > >> >> > > >> >> > Add a flag to indicate when the task has been preempted so that stack > >> >> > dump code can determine whether the stack trace is reliable. > >> >> > >> >> I think I like this, but how do you handle the rather similar case in > >> >> which a task goes to sleep because it's waiting on IO that happened in > >> >> response to get_user, put_user, copy_from_user, etc? > >> > > >> > Hm, good question. I was thinking that page faults had a dedicated > >> > stack, but now looking at the entry and traps code, that doesn't seem to > >> > be the case. > >> > > >> > Anyway I think it shouldn't be a problem if we make sure that any kernel > >> > function which might trigger a valid page fault (e.g., > >> > copy_user_generic_string) do the proper frame pointer setup first. Then > >> > the stack should still be reliable. > >> > > >> > In fact I might be able to teach objtool to enforce that: any function > >> > which uses an exception table should create a stack frame. > >> > > >> > Or alternatively, maybe set some kind of flag for page faults, similar > >> > to what I did with this patch. > >> > > >> > >> How about doing it the other way around: teach the unwinder to detect > >> when it hits a non-outermost entry (i.e. it lands in idtentry, etc) > >> and use some reasonable heuristic as to whether it's okay to keep > >> unwinding. You should be able to handle preemption like that, too -- > >> the unwind process will end up in an IRQ frame. > > > > How exactly would the unwinder detect if a text address is in an > > idtentry? Maybe put all the idt entries in a special ELF section? > > > > Hmm. > > What actually happens when you unwind all the way into the entry code? > Don't you end up in something that isn't in an ELF function? Can you > detect that? For entry from user space (e.g., syscalls), it's easy to detect because there's always a pt_regs at the bottom of the stack. So if the unwinder reaches the stack address at (thread.sp0 - sizeof(pt_regs)), it knows it's done. But for nested entry (e.g. in-kernel irqs/exceptions like preemption and page faults which don't have dedicated stacks), where the pt_regs is stored somewhere in the middle of the stack instead of the bottom, there's no reliable way to detect that. > Ideally, the unwinder could actually detect that it's > hit a pt_regs struct and report that. If used for stack dumps, it > could display some indication of this and then continue its unwinding > by decoding the pt_regs. If used for patching, it could take some > other appropriate action. > > I would have no objection to annotating all the pt_regs-style entry > code, whether by putting it in a separate section or by making a table > of addresses. I think the easiest way to make it work would be to modify the idtentry macro to put all the idt entries in a dedicated section. Then the unwinder could easily detect any calls from that code. > There are a couple of nasty cases if NMI or MCE is involved but, as of > 4.6, outside of NMI, MCE, and vmalloc faults (ugh!), there should > always be a complete pt_regs on the stack before interrupts get > enabled for each entry. Of course, finding the thing may be > nontrivial in case other things were pushed. NMI, MCE and interrupts aren't a problem because they have dedicated stacks, which are easy to detect. If the tasks' stack is on an exception stack or an irq stack, we consider it unreliable. And also, they don't sleep. The stack of any running task (other than current) is automatically considered unreliable anyway, since they could be modifying it while we're reading it. > I suppose we could try to rejigger the code so that rbp points to > pt_regs or similar. I think we should avoid doing something like that because it would break gdb and all the other unwinders who don't know about it.
On Fri, Apr 29, 2016 at 2:25 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote: > On Fri, Apr 29, 2016 at 01:32:53PM -0700, Andy Lutomirski wrote: >> On Fri, Apr 29, 2016 at 1:27 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote: >> > On Fri, Apr 29, 2016 at 01:19:23PM -0700, Andy Lutomirski wrote: >> >> On Fri, Apr 29, 2016 at 1:11 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote: >> >> > On Fri, Apr 29, 2016 at 11:06:53AM -0700, Andy Lutomirski wrote: >> >> >> On Thu, Apr 28, 2016 at 1:44 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote: >> >> >> > A preempted function might not have had a chance to save the frame >> >> >> > pointer to the stack yet, which can result in its caller getting skipped >> >> >> > on a stack trace. >> >> >> > >> >> >> > Add a flag to indicate when the task has been preempted so that stack >> >> >> > dump code can determine whether the stack trace is reliable. >> >> >> >> >> >> I think I like this, but how do you handle the rather similar case in >> >> >> which a task goes to sleep because it's waiting on IO that happened in >> >> >> response to get_user, put_user, copy_from_user, etc? >> >> > >> >> > Hm, good question. I was thinking that page faults had a dedicated >> >> > stack, but now looking at the entry and traps code, that doesn't seem to >> >> > be the case. >> >> > >> >> > Anyway I think it shouldn't be a problem if we make sure that any kernel >> >> > function which might trigger a valid page fault (e.g., >> >> > copy_user_generic_string) do the proper frame pointer setup first. Then >> >> > the stack should still be reliable. >> >> > >> >> > In fact I might be able to teach objtool to enforce that: any function >> >> > which uses an exception table should create a stack frame. >> >> > >> >> > Or alternatively, maybe set some kind of flag for page faults, similar >> >> > to what I did with this patch. >> >> > >> >> >> >> How about doing it the other way around: teach the unwinder to detect >> >> when it hits a non-outermost entry (i.e. it lands in idtentry, etc) >> >> and use some reasonable heuristic as to whether it's okay to keep >> >> unwinding. You should be able to handle preemption like that, too -- >> >> the unwind process will end up in an IRQ frame. >> > >> > How exactly would the unwinder detect if a text address is in an >> > idtentry? Maybe put all the idt entries in a special ELF section? >> > >> >> Hmm. >> >> What actually happens when you unwind all the way into the entry code? >> Don't you end up in something that isn't in an ELF function? Can you >> detect that? > > For entry from user space (e.g., syscalls), it's easy to detect because > there's always a pt_regs at the bottom of the stack. So if the unwinder > reaches the stack address at (thread.sp0 - sizeof(pt_regs)), it knows > it's done. > > But for nested entry (e.g. in-kernel irqs/exceptions like preemption and > page faults which don't have dedicated stacks), where the pt_regs is > stored somewhere in the middle of the stack instead of the bottom, > there's no reliable way to detect that. > >> Ideally, the unwinder could actually detect that it's >> hit a pt_regs struct and report that. If used for stack dumps, it >> could display some indication of this and then continue its unwinding >> by decoding the pt_regs. If used for patching, it could take some >> other appropriate action. >> >> I would have no objection to annotating all the pt_regs-style entry >> code, whether by putting it in a separate section or by making a table >> of addresses. > > I think the easiest way to make it work would be to modify the idtentry > macro to put all the idt entries in a dedicated section. Then the > unwinder could easily detect any calls from that code. That would work. Would it make sense to do the same for the irq entries? I'd be glad to review a patch. It should be straightforward. > >> There are a couple of nasty cases if NMI or MCE is involved but, as of >> 4.6, outside of NMI, MCE, and vmalloc faults (ugh!), there should >> always be a complete pt_regs on the stack before interrupts get >> enabled for each entry. Of course, finding the thing may be >> nontrivial in case other things were pushed. > > NMI, MCE and interrupts aren't a problem because they have dedicated > stacks, which are easy to detect. If the tasks' stack is on an > exception stack or an irq stack, we consider it unreliable. Only on x86_64. > > And also, they don't sleep. The stack of any running task (other than > current) is automatically considered unreliable anyway, since they could > be modifying it while we're reading it. True. > >> I suppose we could try to rejigger the code so that rbp points to >> pt_regs or similar. > > I think we should avoid doing something like that because it would break > gdb and all the other unwinders who don't know about it. How so? Currently, rbp in the entry code is meaningless. I'm suggesting that, when we do, for example, 'call \do_sym' in idtentry, we point rbp to the pt_regs. Currently it points to something stale (which the dump_stack code might be relying on. Hmm.) But it's probably also safe to assume that if you unwind to the 'call \do_sym', then pt_regs is the next thing on the stack, so just doing the section thing would work. We should really re-add DWARF some day. --Andy
On Fri, 29 Apr 2016, Andy Lutomirski wrote: > > NMI, MCE and interrupts aren't a problem because they have dedicated > > stacks, which are easy to detect. If the tasks' stack is on an > > exception stack or an irq stack, we consider it unreliable. > > Only on x86_64. Well, MCEs are more or less x86-specific as well. But otherwise good point, thanks Andy. So, how does stack layout generally look like in case when NMI is actually running on proper kernel stack? I thought it's guaranteed to contain pt_regs anyway in all cases. Is that not guaranteed to be the case? Thanks,
On Fri, Apr 29, 2016 at 02:37:41PM -0700, Andy Lutomirski wrote: > On Fri, Apr 29, 2016 at 2:25 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote: > > I think the easiest way to make it work would be to modify the idtentry > > macro to put all the idt entries in a dedicated section. Then the > > unwinder could easily detect any calls from that code. > > That would work. Would it make sense to do the same for the irq entries? Yes, I think so. > >> I suppose we could try to rejigger the code so that rbp points to > >> pt_regs or similar. > > > > I think we should avoid doing something like that because it would break > > gdb and all the other unwinders who don't know about it. > > How so? > > Currently, rbp in the entry code is meaningless. I'm suggesting that, > when we do, for example, 'call \do_sym' in idtentry, we point rbp to > the pt_regs. Currently it points to something stale (which the > dump_stack code might be relying on. Hmm.) But it's probably also > safe to assume that if you unwind to the 'call \do_sym', then pt_regs > is the next thing on the stack, so just doing the section thing would > work. Yes, rbp is meaningless on the entry from user space. But if an in-kernel interrupt occurs (e.g. page fault, preemption) and you have nested entry, rbp keeps its old value, right? So the unwinder can walk past the nested entry frame and keep going until it gets to the original entry. > We should really re-add DWARF some day. Working on it :-)
On Sat, Apr 30, 2016 at 12:11:45AM +0200, Jiri Kosina wrote: > On Fri, 29 Apr 2016, Andy Lutomirski wrote: > > > NMI, MCE and interrupts aren't a problem because they have dedicated > > > stacks, which are easy to detect. If the tasks' stack is on an > > > exception stack or an irq stack, we consider it unreliable. > > > > Only on x86_64. > > Well, MCEs are more or less x86-specific as well. But otherwise good > point, thanks Andy. > > So, how does stack layout generally look like in case when NMI is actually > running on proper kernel stack? I thought it's guaranteed to contain > pt_regs anyway in all cases. Is that not guaranteed to be the case? If the NMI were using the normal kernel stack and it interrupted kernel space, pt_regs would be placed in the "middle" of the stack rather than the bottom, and there's currently no way to detect that. However, NMIs don't sleep, and we only consider sleeping tasks for stack reliability, so it wouldn't be an issue anyway.
On Apr 29, 2016 3:41 PM, "Josh Poimboeuf" <jpoimboe@redhat.com> wrote: > > On Fri, Apr 29, 2016 at 02:37:41PM -0700, Andy Lutomirski wrote: > > On Fri, Apr 29, 2016 at 2:25 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote: > > > I think the easiest way to make it work would be to modify the idtentry > > > macro to put all the idt entries in a dedicated section. Then the > > > unwinder could easily detect any calls from that code. > > > > That would work. Would it make sense to do the same for the irq entries? > > Yes, I think so. > > > >> I suppose we could try to rejigger the code so that rbp points to > > >> pt_regs or similar. > > > > > > I think we should avoid doing something like that because it would break > > > gdb and all the other unwinders who don't know about it. > > > > How so? > > > > Currently, rbp in the entry code is meaningless. I'm suggesting that, > > when we do, for example, 'call \do_sym' in idtentry, we point rbp to > > the pt_regs. Currently it points to something stale (which the > > dump_stack code might be relying on. Hmm.) But it's probably also > > safe to assume that if you unwind to the 'call \do_sym', then pt_regs > > is the next thing on the stack, so just doing the section thing would > > work. > > Yes, rbp is meaningless on the entry from user space. But if an > in-kernel interrupt occurs (e.g. page fault, preemption) and you have > nested entry, rbp keeps its old value, right? So the unwinder can walk > past the nested entry frame and keep going until it gets to the original > entry. Yes. It would be nice if we could do better, though, and actually notice the pt_regs and identify the entry. For example, I'd love to see "page fault, RIP=xyz" printed in the middle of a stack dump on a crash. Also, I think that just following rbp links will lose the actual function that took the page fault (or whatever function pt_regs->ip actually points to). > > > We should really re-add DWARF some day. > > Working on it :-) Excellent. Have you looked at my vdso unwinding test at all? If we could do something similar for the kernel, IMO it would make testing much more pleasant. --Andy
On Apr 29, 2016 3:11 PM, "Jiri Kosina" <jikos@kernel.org> wrote: > > On Fri, 29 Apr 2016, Andy Lutomirski wrote: > > > > NMI, MCE and interrupts aren't a problem because they have dedicated > > > stacks, which are easy to detect. If the tasks' stack is on an > > > exception stack or an irq stack, we consider it unreliable. > > > > Only on x86_64. > > Well, MCEs are more or less x86-specific as well. But otherwise good > point, thanks Andy. > > So, how does stack layout generally look like in case when NMI is actually > running on proper kernel stack? I thought it's guaranteed to contain > pt_regs anyway in all cases. Is that not guaranteed to be the case? > On x86, at least, there will still be pt_regs for the NMI. For the interrupted state, though, there might not be pt_regs, as the NMI might have happened while still populating pt_regs. In fact, the NMI stack could overlap task_pt_regs. For x86_32, there's no guarantee that pt_regs contains sp due to hardware silliness. You need to parse it more carefully, as, !user_mode(regs), then the old sp is just above pt_regs. --Andy
On Fri, Apr 29, 2016 at 05:08:50PM -0700, Andy Lutomirski wrote: > On Apr 29, 2016 3:41 PM, "Josh Poimboeuf" <jpoimboe@redhat.com> wrote: > > > > On Fri, Apr 29, 2016 at 02:37:41PM -0700, Andy Lutomirski wrote: > > > On Fri, Apr 29, 2016 at 2:25 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote: > > > >> I suppose we could try to rejigger the code so that rbp points to > > > >> pt_regs or similar. > > > > > > > > I think we should avoid doing something like that because it would break > > > > gdb and all the other unwinders who don't know about it. > > > > > > How so? > > > > > > Currently, rbp in the entry code is meaningless. I'm suggesting that, > > > when we do, for example, 'call \do_sym' in idtentry, we point rbp to > > > the pt_regs. Currently it points to something stale (which the > > > dump_stack code might be relying on. Hmm.) But it's probably also > > > safe to assume that if you unwind to the 'call \do_sym', then pt_regs > > > is the next thing on the stack, so just doing the section thing would > > > work. > > > > Yes, rbp is meaningless on the entry from user space. But if an > > in-kernel interrupt occurs (e.g. page fault, preemption) and you have > > nested entry, rbp keeps its old value, right? So the unwinder can walk > > past the nested entry frame and keep going until it gets to the original > > entry. > > Yes. > > It would be nice if we could do better, though, and actually notice > the pt_regs and identify the entry. For example, I'd love to see > "page fault, RIP=xyz" printed in the middle of a stack dump on a > crash. > > Also, I think that just following rbp links will lose the > actual function that took the page fault (or whatever function > pt_regs->ip actually points to). Hm. I think we could fix all that in a more standard way. Whenever a new pt_regs frame gets saved on entry, we could also create a new stack frame which points to a fake kernel_entry() function. That would tell the unwinder there's a pt_regs frame without otherwise breaking frame pointers across the frame. Then I guess we wouldn't need my other solution of putting the idt entries in a special section. How does that sound? > Have you looked at my vdso unwinding test at all? If we could do > something similar for the kernel, IMO it would make testing much more > pleasant. I found it, but I'm not sure what it would mean to do something similar for the kernel. Do you mean doing something like an NMI sampling-based approach where we periodically do a random stack sanity check? (If so, I do have something like that planned.)
On Mon, May 2, 2016 at 6:52 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote: > On Fri, Apr 29, 2016 at 05:08:50PM -0700, Andy Lutomirski wrote: >> On Apr 29, 2016 3:41 PM, "Josh Poimboeuf" <jpoimboe@redhat.com> wrote: >> > >> > On Fri, Apr 29, 2016 at 02:37:41PM -0700, Andy Lutomirski wrote: >> > > On Fri, Apr 29, 2016 at 2:25 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote: >> > > >> I suppose we could try to rejigger the code so that rbp points to >> > > >> pt_regs or similar. >> > > > >> > > > I think we should avoid doing something like that because it would break >> > > > gdb and all the other unwinders who don't know about it. >> > > >> > > How so? >> > > >> > > Currently, rbp in the entry code is meaningless. I'm suggesting that, >> > > when we do, for example, 'call \do_sym' in idtentry, we point rbp to >> > > the pt_regs. Currently it points to something stale (which the >> > > dump_stack code might be relying on. Hmm.) But it's probably also >> > > safe to assume that if you unwind to the 'call \do_sym', then pt_regs >> > > is the next thing on the stack, so just doing the section thing would >> > > work. >> > >> > Yes, rbp is meaningless on the entry from user space. But if an >> > in-kernel interrupt occurs (e.g. page fault, preemption) and you have >> > nested entry, rbp keeps its old value, right? So the unwinder can walk >> > past the nested entry frame and keep going until it gets to the original >> > entry. >> >> Yes. >> >> It would be nice if we could do better, though, and actually notice >> the pt_regs and identify the entry. For example, I'd love to see >> "page fault, RIP=xyz" printed in the middle of a stack dump on a >> crash. >> >> Also, I think that just following rbp links will lose the >> actual function that took the page fault (or whatever function >> pt_regs->ip actually points to). > > Hm. I think we could fix all that in a more standard way. Whenever a > new pt_regs frame gets saved on entry, we could also create a new stack > frame which points to a fake kernel_entry() function. That would tell > the unwinder there's a pt_regs frame without otherwise breaking frame > pointers across the frame. > > Then I guess we wouldn't need my other solution of putting the idt > entries in a special section. > > How does that sound? Let me try to understand. The normal call sequence is call; push %rbp; mov %rsp, %rbp. So rbp points to (prev rbp, prev rip) on the stack, and you can follow the chain back. Right now, on a user access page fault or similar, we have rbp (probably) pointing to the interrupted frame, and the interrupted rip isn't saved anywhere that a naive unwinder can find it. (It's in pt_regs, but the rbp chain skips right over that.) We could change the entry code so that an interrupt / idtentry does: push pt_regs push kernel_entry push %rbp mov %rsp, %rbp call handler pop %rbp addq $8, %rsp or similar. That would make it appear that the actual C handler was caused by a dummy function "kernel_entry". Now the unwinder would get to kernel_entry, but it *still* wouldn't find its way to the calling frame, which only solves part of the problem. We could at least teach the unwinder how kernel_entry works and let it decode pt_regs to continue unwinding. This would be nice, and I think it could work. I think I like this, except that, if it used a separate section, it could potentially be faster, as, for each actual entry type, the offset from the C handler frame to pt_regs is a foregone conclusion. But this is pretty simple and performance is already abysmal in most handlers. There's an added benefit to using a separate section, though: we could also annotate the calls with what type of entry they were so the unwinder could print it out nicely. I could be convinced either way. > >> Have you looked at my vdso unwinding test at all? If we could do >> something similar for the kernel, IMO it would make testing much more >> pleasant. > > I found it, but I'm not sure what it would mean to do something similar > for the kernel. Do you mean doing something like an NMI sampling-based > approach where we periodically do a random stack sanity check? I was imagining something a little more strict: single-step interesting parts of the kernel and make sure that each step unwinds correctly. That could detect missing frames and similar.
On Mon, May 02, 2016 at 08:52:41AM -0700, Andy Lutomirski wrote: > On Mon, May 2, 2016 at 6:52 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote: > > On Fri, Apr 29, 2016 at 05:08:50PM -0700, Andy Lutomirski wrote: > >> On Apr 29, 2016 3:41 PM, "Josh Poimboeuf" <jpoimboe@redhat.com> wrote: > >> > > >> > On Fri, Apr 29, 2016 at 02:37:41PM -0700, Andy Lutomirski wrote: > >> > > On Fri, Apr 29, 2016 at 2:25 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote: > >> > > >> I suppose we could try to rejigger the code so that rbp points to > >> > > >> pt_regs or similar. > >> > > > > >> > > > I think we should avoid doing something like that because it would break > >> > > > gdb and all the other unwinders who don't know about it. > >> > > > >> > > How so? > >> > > > >> > > Currently, rbp in the entry code is meaningless. I'm suggesting that, > >> > > when we do, for example, 'call \do_sym' in idtentry, we point rbp to > >> > > the pt_regs. Currently it points to something stale (which the > >> > > dump_stack code might be relying on. Hmm.) But it's probably also > >> > > safe to assume that if you unwind to the 'call \do_sym', then pt_regs > >> > > is the next thing on the stack, so just doing the section thing would > >> > > work. > >> > > >> > Yes, rbp is meaningless on the entry from user space. But if an > >> > in-kernel interrupt occurs (e.g. page fault, preemption) and you have > >> > nested entry, rbp keeps its old value, right? So the unwinder can walk > >> > past the nested entry frame and keep going until it gets to the original > >> > entry. > >> > >> Yes. > >> > >> It would be nice if we could do better, though, and actually notice > >> the pt_regs and identify the entry. For example, I'd love to see > >> "page fault, RIP=xyz" printed in the middle of a stack dump on a > >> crash. > >> > >> Also, I think that just following rbp links will lose the > >> actual function that took the page fault (or whatever function > >> pt_regs->ip actually points to). > > > > Hm. I think we could fix all that in a more standard way. Whenever a > > new pt_regs frame gets saved on entry, we could also create a new stack > > frame which points to a fake kernel_entry() function. That would tell > > the unwinder there's a pt_regs frame without otherwise breaking frame > > pointers across the frame. > > > > Then I guess we wouldn't need my other solution of putting the idt > > entries in a special section. > > > > How does that sound? > > Let me try to understand. > > The normal call sequence is call; push %rbp; mov %rsp, %rbp. So rbp > points to (prev rbp, prev rip) on the stack, and you can follow the > chain back. Right now, on a user access page fault or similar, we > have rbp (probably) pointing to the interrupted frame, and the > interrupted rip isn't saved anywhere that a naive unwinder can find > it. (It's in pt_regs, but the rbp chain skips right over that.) > > We could change the entry code so that an interrupt / idtentry does: > > push pt_regs > push kernel_entry > push %rbp > mov %rsp, %rbp > call handler > pop %rbp > addq $8, %rsp > > or similar. That would make it appear that the actual C handler was > caused by a dummy function "kernel_entry". Now the unwinder would get > to kernel_entry, but it *still* wouldn't find its way to the calling > frame, which only solves part of the problem. We could at least teach > the unwinder how kernel_entry works and let it decode pt_regs to > continue unwinding. This would be nice, and I think it could work. Yeah, that's about what I had in mind. > I think I like this, except that, if it used a separate section, it > could potentially be faster, as, for each actual entry type, the > offset from the C handler frame to pt_regs is a foregone conclusion. Hm, this I don't really follow. It's true that the unwinder can easily find RIP from pt_regs, which will always be a known offset from the kernel_entry pointer on the stack. But why would having the entry code in a separate section make that faster? > But this is pretty simple and performance is already abysmal in most > handlers. > > There's an added benefit to using a separate section, though: we could > also annotate the calls with what type of entry they were so the > unwinder could print it out nicely. Yeah, that could be a nice feature... but doesn't printing the name of the C handler pretty much already give that information? In any case, once we have a working DWARF unwinder, I think it will show the name of the idt entry anyway. > >> Have you looked at my vdso unwinding test at all? If we could do > >> something similar for the kernel, IMO it would make testing much more > >> pleasant. > > > > I found it, but I'm not sure what it would mean to do something similar > > for the kernel. Do you mean doing something like an NMI sampling-based > > approach where we periodically do a random stack sanity check? > > I was imagining something a little more strict: single-step > interesting parts of the kernel and make sure that each step unwinds > correctly. That could detect missing frames and similar. Interesting idea, though I wonder how hard it would be to reliably distinguish a missing frame from the case where gcc decides to inline a function. Another idea to detect missing frames: for each return address on the stack, ensure there's a corresponding "call <func>" instruction immediately preceding the return location, where <func> matches what's on the stack. That wouldn't work so well for indirect calls using function pointers, but even then maybe we could use the DWARF CFI to find the function pointer value and validate that it matches the stack function.
On Mon, May 2, 2016 at 10:31 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote: > On Mon, May 02, 2016 at 08:52:41AM -0700, Andy Lutomirski wrote: >> On Mon, May 2, 2016 at 6:52 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote: >> > On Fri, Apr 29, 2016 at 05:08:50PM -0700, Andy Lutomirski wrote: >> >> On Apr 29, 2016 3:41 PM, "Josh Poimboeuf" <jpoimboe@redhat.com> wrote: >> >> > >> >> > On Fri, Apr 29, 2016 at 02:37:41PM -0700, Andy Lutomirski wrote: >> >> > > On Fri, Apr 29, 2016 at 2:25 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote: >> >> > > >> I suppose we could try to rejigger the code so that rbp points to >> >> > > >> pt_regs or similar. >> >> > > > >> >> > > > I think we should avoid doing something like that because it would break >> >> > > > gdb and all the other unwinders who don't know about it. >> >> > > >> >> > > How so? >> >> > > >> >> > > Currently, rbp in the entry code is meaningless. I'm suggesting that, >> >> > > when we do, for example, 'call \do_sym' in idtentry, we point rbp to >> >> > > the pt_regs. Currently it points to something stale (which the >> >> > > dump_stack code might be relying on. Hmm.) But it's probably also >> >> > > safe to assume that if you unwind to the 'call \do_sym', then pt_regs >> >> > > is the next thing on the stack, so just doing the section thing would >> >> > > work. >> >> > >> >> > Yes, rbp is meaningless on the entry from user space. But if an >> >> > in-kernel interrupt occurs (e.g. page fault, preemption) and you have >> >> > nested entry, rbp keeps its old value, right? So the unwinder can walk >> >> > past the nested entry frame and keep going until it gets to the original >> >> > entry. >> >> >> >> Yes. >> >> >> >> It would be nice if we could do better, though, and actually notice >> >> the pt_regs and identify the entry. For example, I'd love to see >> >> "page fault, RIP=xyz" printed in the middle of a stack dump on a >> >> crash. >> >> >> >> Also, I think that just following rbp links will lose the >> >> actual function that took the page fault (or whatever function >> >> pt_regs->ip actually points to). >> > >> > Hm. I think we could fix all that in a more standard way. Whenever a >> > new pt_regs frame gets saved on entry, we could also create a new stack >> > frame which points to a fake kernel_entry() function. That would tell >> > the unwinder there's a pt_regs frame without otherwise breaking frame >> > pointers across the frame. >> > >> > Then I guess we wouldn't need my other solution of putting the idt >> > entries in a special section. >> > >> > How does that sound? >> >> Let me try to understand. >> >> The normal call sequence is call; push %rbp; mov %rsp, %rbp. So rbp >> points to (prev rbp, prev rip) on the stack, and you can follow the >> chain back. Right now, on a user access page fault or similar, we >> have rbp (probably) pointing to the interrupted frame, and the >> interrupted rip isn't saved anywhere that a naive unwinder can find >> it. (It's in pt_regs, but the rbp chain skips right over that.) >> >> We could change the entry code so that an interrupt / idtentry does: >> >> push pt_regs >> push kernel_entry >> push %rbp >> mov %rsp, %rbp >> call handler >> pop %rbp >> addq $8, %rsp >> >> or similar. That would make it appear that the actual C handler was >> caused by a dummy function "kernel_entry". Now the unwinder would get >> to kernel_entry, but it *still* wouldn't find its way to the calling >> frame, which only solves part of the problem. We could at least teach >> the unwinder how kernel_entry works and let it decode pt_regs to >> continue unwinding. This would be nice, and I think it could work. > > Yeah, that's about what I had in mind. FWIW, I just tried this: static bool is_entry_text(unsigned long addr) { return addr >= (unsigned long)__entry_text_start && addr < (unsigned long)__entry_text_end; } it works. So the entry code is already annotated reasonably well :) I just hacked it up here: https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=stack&id=085eacfe0edfc18768e48340084415dba9a6bd21 and it seems to work, at least for page faults. A better implementation would print out the entire contents of pt_regs so that people reading the stack trace will know the registers at the time of the exception, which might be helpful. > >> I think I like this, except that, if it used a separate section, it >> could potentially be faster, as, for each actual entry type, the >> offset from the C handler frame to pt_regs is a foregone conclusion. > > Hm, this I don't really follow. It's true that the unwinder can easily > find RIP from pt_regs, which will always be a known offset from the > kernel_entry pointer on the stack. But why would having the entry code > in a separate section make that faster? It doesn't make the unwinder faster -- it makes the entry code faster. > >> But this is pretty simple and performance is already abysmal in most >> handlers. >> >> There's an added benefit to using a separate section, though: we could >> also annotate the calls with what type of entry they were so the >> unwinder could print it out nicely. > > Yeah, that could be a nice feature... but doesn't printing the name of > the C handler pretty much already give that information? > > In any case, once we have a working DWARF unwinder, I think it will show > the name of the idt entry anyway. True. And it'll automatically follow pt_regs. > >> >> Have you looked at my vdso unwinding test at all? If we could do >> >> something similar for the kernel, IMO it would make testing much more >> >> pleasant. >> > >> > I found it, but I'm not sure what it would mean to do something similar >> > for the kernel. Do you mean doing something like an NMI sampling-based >> > approach where we periodically do a random stack sanity check? >> >> I was imagining something a little more strict: single-step >> interesting parts of the kernel and make sure that each step unwinds >> correctly. That could detect missing frames and similar. > > Interesting idea, though I wonder how hard it would be to reliably > distinguish a missing frame from the case where gcc decides to inline a > function. > > Another idea to detect missing frames: for each return address on the > stack, ensure there's a corresponding "call <func>" instruction > immediately preceding the return location, where <func> matches what's > on the stack. Hmm, interesting. I hope your plans include rewriting the current stack unwinder completely. The thing in print_context_stack is (a) hard-to-understand and hard-to-modify crap and (b) is called in a loop from another file using totally ridiculous conventions. --Andy
* Andy Lutomirski <luto@amacapital.net> wrote: > > Another idea to detect missing frames: for each return address on the stack, > > ensure there's a corresponding "call <func>" instruction immediately preceding > > the return location, where <func> matches what's on the stack. > > Hmm, interesting. > > I hope your plans include rewriting the current stack unwinder completely. The > thing in print_context_stack is (a) hard-to-understand and hard-to-modify crap > and (b) is called in a loop from another file using totally ridiculous > conventions. So we had several attempts at making it better, any further improvements (including radical rewrites) are more than welcome! The generalization between the various stack walking methods certainly didn't make things easier to read - we might want to eliminate that by using better primitives to iterate over the stack frame. Thanks, Ingo
On Mon, May 02, 2016 at 11:12:39AM -0700, Andy Lutomirski wrote: > On Mon, May 2, 2016 at 10:31 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote: > > On Mon, May 02, 2016 at 08:52:41AM -0700, Andy Lutomirski wrote: > >> On Mon, May 2, 2016 at 6:52 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote: > >> > On Fri, Apr 29, 2016 at 05:08:50PM -0700, Andy Lutomirski wrote: > >> >> On Apr 29, 2016 3:41 PM, "Josh Poimboeuf" <jpoimboe@redhat.com> wrote: > >> >> > > >> >> > On Fri, Apr 29, 2016 at 02:37:41PM -0700, Andy Lutomirski wrote: > >> >> > > On Fri, Apr 29, 2016 at 2:25 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote: > >> >> > > >> I suppose we could try to rejigger the code so that rbp points to > >> >> > > >> pt_regs or similar. > >> >> > > > > >> >> > > > I think we should avoid doing something like that because it would break > >> >> > > > gdb and all the other unwinders who don't know about it. > >> >> > > > >> >> > > How so? > >> >> > > > >> >> > > Currently, rbp in the entry code is meaningless. I'm suggesting that, > >> >> > > when we do, for example, 'call \do_sym' in idtentry, we point rbp to > >> >> > > the pt_regs. Currently it points to something stale (which the > >> >> > > dump_stack code might be relying on. Hmm.) But it's probably also > >> >> > > safe to assume that if you unwind to the 'call \do_sym', then pt_regs > >> >> > > is the next thing on the stack, so just doing the section thing would > >> >> > > work. > >> >> > > >> >> > Yes, rbp is meaningless on the entry from user space. But if an > >> >> > in-kernel interrupt occurs (e.g. page fault, preemption) and you have > >> >> > nested entry, rbp keeps its old value, right? So the unwinder can walk > >> >> > past the nested entry frame and keep going until it gets to the original > >> >> > entry. > >> >> > >> >> Yes. > >> >> > >> >> It would be nice if we could do better, though, and actually notice > >> >> the pt_regs and identify the entry. For example, I'd love to see > >> >> "page fault, RIP=xyz" printed in the middle of a stack dump on a > >> >> crash. > >> >> > >> >> Also, I think that just following rbp links will lose the > >> >> actual function that took the page fault (or whatever function > >> >> pt_regs->ip actually points to). > >> > > >> > Hm. I think we could fix all that in a more standard way. Whenever a > >> > new pt_regs frame gets saved on entry, we could also create a new stack > >> > frame which points to a fake kernel_entry() function. That would tell > >> > the unwinder there's a pt_regs frame without otherwise breaking frame > >> > pointers across the frame. > >> > > >> > Then I guess we wouldn't need my other solution of putting the idt > >> > entries in a special section. > >> > > >> > How does that sound? > >> > >> Let me try to understand. > >> > >> The normal call sequence is call; push %rbp; mov %rsp, %rbp. So rbp > >> points to (prev rbp, prev rip) on the stack, and you can follow the > >> chain back. Right now, on a user access page fault or similar, we > >> have rbp (probably) pointing to the interrupted frame, and the > >> interrupted rip isn't saved anywhere that a naive unwinder can find > >> it. (It's in pt_regs, but the rbp chain skips right over that.) > >> > >> We could change the entry code so that an interrupt / idtentry does: > >> > >> push pt_regs > >> push kernel_entry > >> push %rbp > >> mov %rsp, %rbp > >> call handler > >> pop %rbp > >> addq $8, %rsp > >> > >> or similar. That would make it appear that the actual C handler was > >> caused by a dummy function "kernel_entry". Now the unwinder would get > >> to kernel_entry, but it *still* wouldn't find its way to the calling > >> frame, which only solves part of the problem. We could at least teach > >> the unwinder how kernel_entry works and let it decode pt_regs to > >> continue unwinding. This would be nice, and I think it could work. > > > > Yeah, that's about what I had in mind. > > FWIW, I just tried this: > > static bool is_entry_text(unsigned long addr) > { > return addr >= (unsigned long)__entry_text_start && > addr < (unsigned long)__entry_text_end; > } > > it works. So the entry code is already annotated reasonably well :) > > I just hacked it up here: > > https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=stack&id=085eacfe0edfc18768e48340084415dba9a6bd21 > > and it seems to work, at least for page faults. A better > implementation would print out the entire contents of pt_regs so that > people reading the stack trace will know the registers at the time of > the exception, which might be helpful. I still think we would need more specific annotations to do that reliably: a call from entry code doesn't necessarily correlate with a pt_regs frame. > >> I think I like this, except that, if it used a separate section, it > >> could potentially be faster, as, for each actual entry type, the > >> offset from the C handler frame to pt_regs is a foregone conclusion. > > > > Hm, this I don't really follow. It's true that the unwinder can easily > > find RIP from pt_regs, which will always be a known offset from the > > kernel_entry pointer on the stack. But why would having the entry code > > in a separate section make that faster? > > It doesn't make the unwinder faster -- it makes the entry code faster. Oh, right. But I don't think a few extra frame pointer instructions are much of an issue if you already have CONFIG_FRAME_POINTER enabled. Anyway I'm not sure which way is better. I'll think about it. > I hope your plans include rewriting the current stack unwinder > completely. The thing in print_context_stack is (a) > hard-to-understand and hard-to-modify crap and (b) is called in a loop > from another file using totally ridiculous conventions. I agree, that code is quite confusing. I haven't really thought about how specifically it could be improved or replaced though. Along those lines, I think it would be awesome if we could have an arch-independent DWARF unwinder so that most of the stack dumping code could be shared amongst all the arches.
On Mon, 2 May 2016, Andy Lutomirski wrote: > FWIW, I just tried this: > > static bool is_entry_text(unsigned long addr) > { > return addr >= (unsigned long)__entry_text_start && > addr < (unsigned long)__entry_text_end; > } > > it works. So the entry code is already annotated reasonably well :) > > I just hacked it up here: > > https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=stack&id=085eacfe0edfc18768e48340084415dba9a6bd21 > > and it seems to work, at least for page faults. A better > implementation would print out the entire contents of pt_regs so that > people reading the stack trace will know the registers at the time of > the exception, which might be helpful. Sorry for being dense, but how do you distinguish here between a "real" kernel entry, that pushes pt_regs, and any "non-entry" function call that passes pt_regs around?
On Mon, 2 May 2016, Jiri Kosina wrote: > > FWIW, I just tried this: > > > > static bool is_entry_text(unsigned long addr) > > { > > return addr >= (unsigned long)__entry_text_start && > > addr < (unsigned long)__entry_text_end; > > } > > > > it works. So the entry code is already annotated reasonably well :) > > > > I just hacked it up here: > > > > https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=stack&id=085eacfe0edfc18768e48340084415dba9a6bd21 > > > > and it seems to work, at least for page faults. A better > > implementation would print out the entire contents of pt_regs so that > > people reading the stack trace will know the registers at the time of > > the exception, which might be helpful. > > Sorry for being dense, but how do you distinguish here between a "real" > kernel entry, that pushes pt_regs, and any "non-entry" function call that > passes pt_regs around? Umm, actually, the more tricky part is the other way around -- how do you make sure that whenever you are calling out from a code between __entry_text_start and __entry_text_end, pt_regs will be at the place you're looking for it? How's that guaranteed? Thanks,
On Mon, May 2, 2016 at 1:00 PM, Jiri Kosina <jikos@kernel.org> wrote: > On Mon, 2 May 2016, Jiri Kosina wrote: > >> > FWIW, I just tried this: >> > >> > static bool is_entry_text(unsigned long addr) >> > { >> > return addr >= (unsigned long)__entry_text_start && >> > addr < (unsigned long)__entry_text_end; >> > } >> > >> > it works. So the entry code is already annotated reasonably well :) >> > >> > I just hacked it up here: >> > >> > https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=stack&id=085eacfe0edfc18768e48340084415dba9a6bd21 >> > >> > and it seems to work, at least for page faults. A better >> > implementation would print out the entire contents of pt_regs so that >> > people reading the stack trace will know the registers at the time of >> > the exception, which might be helpful. >> >> Sorry for being dense, but how do you distinguish here between a "real" >> kernel entry, that pushes pt_regs, and any "non-entry" function call that >> passes pt_regs around? > > Umm, actually, the more tricky part is the other way around -- how do you > make sure that whenever you are calling out from a code between > __entry_text_start and __entry_text_end, pt_regs will be at the place > you're looking for it? How's that guaranteed? It's not guaranteed in my code. I think we'd want to add a little table of call sites and their pt_regs offsets. This was just meant to test that the general idea works (and it does indeed generate better traces than the stock kernel, which gets it unconditionally wrong). --Andy > > Thanks, > > -- > Jiri Kosina > SUSE Labs >
From: Andy Lutomirski > Sent: 02 May 2016 19:13 ... > I hope your plans include rewriting the current stack unwinder > completely. The thing in print_context_stack is (a) > hard-to-understand and hard-to-modify crap and (b) is called in a loop > from another file using totally ridiculous conventions. I've seen a 'stack unwinder' that parsed the instruction stream forwards looking for 'return' instructions. I fixed it to add a few extra instructions needs to sort out the exit path from hardware interrupts. It only had to understand instructions that modified %sp and %bp and remember which branch instructions and branch targets it had used in order to find the correct exit path from a function. Worked reasonably well without any debug info or guaranteed frame pointers. It did have to fall back on scanning the stack if it was inside an infinite loop. Even on x86 it is reasonably possible to check for 'call' instructions in this case. David
diff --git a/include/linux/sched.h b/include/linux/sched.h index 3d31572..fb364a0 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2137,6 +2137,7 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut, #define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */ #define PF_NO_SETAFFINITY 0x04000000 /* Userland is not allowed to meddle with cpus_allowed */ #define PF_MCE_EARLY 0x08000000 /* Early kill for mce process policy */ +#define PF_PREEMPT_IRQ 0x10000000 /* Thread is preempted by an irq */ #define PF_MUTEX_TESTER 0x20000000 /* Thread belongs to the rt mutex tester */ #define PF_FREEZER_SKIP 0x40000000 /* Freezer should not count it as freezable */ #define PF_SUSPEND_TASK 0x80000000 /* this thread called freeze_processes and should not be frozen */ diff --git a/kernel/fork.c b/kernel/fork.c index b73a539..d2fe04a 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1373,7 +1373,7 @@ static struct task_struct *copy_process(unsigned long clone_flags, goto bad_fork_cleanup_count; delayacct_tsk_init(p); /* Must remain after dup_task_struct() */ - p->flags &= ~(PF_SUPERPRIV | PF_WQ_WORKER); + p->flags &= ~(PF_SUPERPRIV | PF_WQ_WORKER | PF_PREEMPT_IRQ); p->flags |= PF_FORKNOEXEC; INIT_LIST_HEAD(&p->children); INIT_LIST_HEAD(&p->sibling); diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 9d84d60..7594267 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3422,6 +3422,8 @@ asmlinkage __visible void __sched preempt_schedule_irq(void) prev_state = exception_enter(); + current->flags |= PF_PREEMPT_IRQ; + do { preempt_disable(); local_irq_enable(); @@ -3430,6 +3432,8 @@ asmlinkage __visible void __sched preempt_schedule_irq(void) sched_preempt_enable_no_resched(); } while (need_resched()); + current->flags &= ~PF_PREEMPT_IRQ; + exception_exit(prev_state); }
A preempted function might not have had a chance to save the frame pointer to the stack yet, which can result in its caller getting skipped on a stack trace. Add a flag to indicate when the task has been preempted so that stack dump code can determine whether the stack trace is reliable. Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> --- include/linux/sched.h | 1 + kernel/fork.c | 2 +- kernel/sched/core.c | 4 ++++ 3 files changed, 6 insertions(+), 1 deletion(-)