diff mbox

[RFC,v2,05/18] sched: add task flag for preempt IRQ tracking

Message ID f8816223098569a9b2f478caa5b4a7a0c27dda00.1461875890.git.jpoimboe@redhat.com (mailing list archive)
State Not Applicable
Headers show

Commit Message

Josh Poimboeuf April 28, 2016, 8:44 p.m. UTC
A preempted function might not have had a chance to save the frame
pointer to the stack yet, which can result in its caller getting skipped
on a stack trace.

Add a flag to indicate when the task has been preempted so that stack
dump code can determine whether the stack trace is reliable.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
---
 include/linux/sched.h | 1 +
 kernel/fork.c         | 2 +-
 kernel/sched/core.c   | 4 ++++
 3 files changed, 6 insertions(+), 1 deletion(-)

Comments

Andy Lutomirski April 29, 2016, 6:06 p.m. UTC | #1
On Thu, Apr 28, 2016 at 1:44 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> A preempted function might not have had a chance to save the frame
> pointer to the stack yet, which can result in its caller getting skipped
> on a stack trace.
>
> Add a flag to indicate when the task has been preempted so that stack
> dump code can determine whether the stack trace is reliable.

I think I like this, but how do you handle the rather similar case in
which a task goes to sleep because it's waiting on IO that happened in
response to get_user, put_user, copy_from_user, etc?

--Andy
Josh Poimboeuf April 29, 2016, 8:11 p.m. UTC | #2
On Fri, Apr 29, 2016 at 11:06:53AM -0700, Andy Lutomirski wrote:
> On Thu, Apr 28, 2016 at 1:44 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > A preempted function might not have had a chance to save the frame
> > pointer to the stack yet, which can result in its caller getting skipped
> > on a stack trace.
> >
> > Add a flag to indicate when the task has been preempted so that stack
> > dump code can determine whether the stack trace is reliable.
> 
> I think I like this, but how do you handle the rather similar case in
> which a task goes to sleep because it's waiting on IO that happened in
> response to get_user, put_user, copy_from_user, etc?

Hm, good question.  I was thinking that page faults had a dedicated
stack, but now looking at the entry and traps code, that doesn't seem to
be the case.

Anyway I think it shouldn't be a problem if we make sure that any kernel
function which might trigger a valid page fault (e.g.,
copy_user_generic_string) do the proper frame pointer setup first.  Then
the stack should still be reliable.

In fact I might be able to teach objtool to enforce that: any function
which uses an exception table should create a stack frame.

Or alternatively, maybe set some kind of flag for page faults, similar
to what I did with this patch.
Andy Lutomirski April 29, 2016, 8:19 p.m. UTC | #3
On Fri, Apr 29, 2016 at 1:11 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> On Fri, Apr 29, 2016 at 11:06:53AM -0700, Andy Lutomirski wrote:
>> On Thu, Apr 28, 2016 at 1:44 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>> > A preempted function might not have had a chance to save the frame
>> > pointer to the stack yet, which can result in its caller getting skipped
>> > on a stack trace.
>> >
>> > Add a flag to indicate when the task has been preempted so that stack
>> > dump code can determine whether the stack trace is reliable.
>>
>> I think I like this, but how do you handle the rather similar case in
>> which a task goes to sleep because it's waiting on IO that happened in
>> response to get_user, put_user, copy_from_user, etc?
>
> Hm, good question.  I was thinking that page faults had a dedicated
> stack, but now looking at the entry and traps code, that doesn't seem to
> be the case.
>
> Anyway I think it shouldn't be a problem if we make sure that any kernel
> function which might trigger a valid page fault (e.g.,
> copy_user_generic_string) do the proper frame pointer setup first.  Then
> the stack should still be reliable.
>
> In fact I might be able to teach objtool to enforce that: any function
> which uses an exception table should create a stack frame.
>
> Or alternatively, maybe set some kind of flag for page faults, similar
> to what I did with this patch.
>

How about doing it the other way around: teach the unwinder to detect
when it hits a non-outermost entry (i.e. it lands in idtentry, etc)
and use some reasonable heuristic as to whether it's okay to keep
unwinding.  You should be able to handle preemption like that, too --
the unwind process will end up in an IRQ frame.

--Andy
Josh Poimboeuf April 29, 2016, 8:27 p.m. UTC | #4
On Fri, Apr 29, 2016 at 01:19:23PM -0700, Andy Lutomirski wrote:
> On Fri, Apr 29, 2016 at 1:11 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > On Fri, Apr 29, 2016 at 11:06:53AM -0700, Andy Lutomirski wrote:
> >> On Thu, Apr 28, 2016 at 1:44 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> >> > A preempted function might not have had a chance to save the frame
> >> > pointer to the stack yet, which can result in its caller getting skipped
> >> > on a stack trace.
> >> >
> >> > Add a flag to indicate when the task has been preempted so that stack
> >> > dump code can determine whether the stack trace is reliable.
> >>
> >> I think I like this, but how do you handle the rather similar case in
> >> which a task goes to sleep because it's waiting on IO that happened in
> >> response to get_user, put_user, copy_from_user, etc?
> >
> > Hm, good question.  I was thinking that page faults had a dedicated
> > stack, but now looking at the entry and traps code, that doesn't seem to
> > be the case.
> >
> > Anyway I think it shouldn't be a problem if we make sure that any kernel
> > function which might trigger a valid page fault (e.g.,
> > copy_user_generic_string) do the proper frame pointer setup first.  Then
> > the stack should still be reliable.
> >
> > In fact I might be able to teach objtool to enforce that: any function
> > which uses an exception table should create a stack frame.
> >
> > Or alternatively, maybe set some kind of flag for page faults, similar
> > to what I did with this patch.
> >
> 
> How about doing it the other way around: teach the unwinder to detect
> when it hits a non-outermost entry (i.e. it lands in idtentry, etc)
> and use some reasonable heuristic as to whether it's okay to keep
> unwinding.  You should be able to handle preemption like that, too --
> the unwind process will end up in an IRQ frame.

How exactly would the unwinder detect if a text address is in an
idtentry?  Maybe put all the idt entries in a special ELF section?
Andy Lutomirski April 29, 2016, 8:32 p.m. UTC | #5
On Fri, Apr 29, 2016 at 1:27 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> On Fri, Apr 29, 2016 at 01:19:23PM -0700, Andy Lutomirski wrote:
>> On Fri, Apr 29, 2016 at 1:11 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>> > On Fri, Apr 29, 2016 at 11:06:53AM -0700, Andy Lutomirski wrote:
>> >> On Thu, Apr 28, 2016 at 1:44 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>> >> > A preempted function might not have had a chance to save the frame
>> >> > pointer to the stack yet, which can result in its caller getting skipped
>> >> > on a stack trace.
>> >> >
>> >> > Add a flag to indicate when the task has been preempted so that stack
>> >> > dump code can determine whether the stack trace is reliable.
>> >>
>> >> I think I like this, but how do you handle the rather similar case in
>> >> which a task goes to sleep because it's waiting on IO that happened in
>> >> response to get_user, put_user, copy_from_user, etc?
>> >
>> > Hm, good question.  I was thinking that page faults had a dedicated
>> > stack, but now looking at the entry and traps code, that doesn't seem to
>> > be the case.
>> >
>> > Anyway I think it shouldn't be a problem if we make sure that any kernel
>> > function which might trigger a valid page fault (e.g.,
>> > copy_user_generic_string) do the proper frame pointer setup first.  Then
>> > the stack should still be reliable.
>> >
>> > In fact I might be able to teach objtool to enforce that: any function
>> > which uses an exception table should create a stack frame.
>> >
>> > Or alternatively, maybe set some kind of flag for page faults, similar
>> > to what I did with this patch.
>> >
>>
>> How about doing it the other way around: teach the unwinder to detect
>> when it hits a non-outermost entry (i.e. it lands in idtentry, etc)
>> and use some reasonable heuristic as to whether it's okay to keep
>> unwinding.  You should be able to handle preemption like that, too --
>> the unwind process will end up in an IRQ frame.
>
> How exactly would the unwinder detect if a text address is in an
> idtentry?  Maybe put all the idt entries in a special ELF section?
>

Hmm.

What actually happens when you unwind all the way into the entry code?
 Don't you end up in something that isn't in an ELF function?  Can you
detect that?  Ideally, the unwinder could actually detect that it's
hit a pt_regs struct and report that.  If used for stack dumps, it
could display some indication of this and then continue its unwinding
by decoding the pt_regs.  If used for patching, it could take some
other appropriate action.

I would have no objection to annotating all the pt_regs-style entry
code, whether by putting it in a separate section or by making a table
of addresses.

There are a couple of nasty cases if NMI or MCE is involved but, as of
4.6, outside of NMI, MCE, and vmalloc faults (ugh!), there should
always be a complete pt_regs on the stack before interrupts get
enabled for each entry.  Of course, finding the thing may be
nontrivial in case other things were pushed.  I suppose we could try
to rejigger the code so that rbp points to pt_regs or similar.

--Andy
Josh Poimboeuf April 29, 2016, 9:25 p.m. UTC | #6
On Fri, Apr 29, 2016 at 01:32:53PM -0700, Andy Lutomirski wrote:
> On Fri, Apr 29, 2016 at 1:27 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > On Fri, Apr 29, 2016 at 01:19:23PM -0700, Andy Lutomirski wrote:
> >> On Fri, Apr 29, 2016 at 1:11 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> >> > On Fri, Apr 29, 2016 at 11:06:53AM -0700, Andy Lutomirski wrote:
> >> >> On Thu, Apr 28, 2016 at 1:44 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> >> >> > A preempted function might not have had a chance to save the frame
> >> >> > pointer to the stack yet, which can result in its caller getting skipped
> >> >> > on a stack trace.
> >> >> >
> >> >> > Add a flag to indicate when the task has been preempted so that stack
> >> >> > dump code can determine whether the stack trace is reliable.
> >> >>
> >> >> I think I like this, but how do you handle the rather similar case in
> >> >> which a task goes to sleep because it's waiting on IO that happened in
> >> >> response to get_user, put_user, copy_from_user, etc?
> >> >
> >> > Hm, good question.  I was thinking that page faults had a dedicated
> >> > stack, but now looking at the entry and traps code, that doesn't seem to
> >> > be the case.
> >> >
> >> > Anyway I think it shouldn't be a problem if we make sure that any kernel
> >> > function which might trigger a valid page fault (e.g.,
> >> > copy_user_generic_string) do the proper frame pointer setup first.  Then
> >> > the stack should still be reliable.
> >> >
> >> > In fact I might be able to teach objtool to enforce that: any function
> >> > which uses an exception table should create a stack frame.
> >> >
> >> > Or alternatively, maybe set some kind of flag for page faults, similar
> >> > to what I did with this patch.
> >> >
> >>
> >> How about doing it the other way around: teach the unwinder to detect
> >> when it hits a non-outermost entry (i.e. it lands in idtentry, etc)
> >> and use some reasonable heuristic as to whether it's okay to keep
> >> unwinding.  You should be able to handle preemption like that, too --
> >> the unwind process will end up in an IRQ frame.
> >
> > How exactly would the unwinder detect if a text address is in an
> > idtentry?  Maybe put all the idt entries in a special ELF section?
> >
> 
> Hmm.
> 
> What actually happens when you unwind all the way into the entry code?
>  Don't you end up in something that isn't in an ELF function?  Can you
> detect that?

For entry from user space (e.g., syscalls), it's easy to detect because
there's always a pt_regs at the bottom of the stack.  So if the unwinder
reaches the stack address at (thread.sp0 - sizeof(pt_regs)), it knows
it's done.

But for nested entry (e.g. in-kernel irqs/exceptions like preemption and
page faults which don't have dedicated stacks), where the pt_regs is
stored somewhere in the middle of the stack instead of the bottom,
there's no reliable way to detect that.

> Ideally, the unwinder could actually detect that it's
> hit a pt_regs struct and report that.  If used for stack dumps, it
> could display some indication of this and then continue its unwinding
> by decoding the pt_regs.  If used for patching, it could take some
> other appropriate action.
>
> I would have no objection to annotating all the pt_regs-style entry
> code, whether by putting it in a separate section or by making a table
> of addresses.

I think the easiest way to make it work would be to modify the idtentry
macro to put all the idt entries in a dedicated section.  Then the
unwinder could easily detect any calls from that code.

> There are a couple of nasty cases if NMI or MCE is involved but, as of
> 4.6, outside of NMI, MCE, and vmalloc faults (ugh!), there should
> always be a complete pt_regs on the stack before interrupts get
> enabled for each entry.  Of course, finding the thing may be
> nontrivial in case other things were pushed.

NMI, MCE and interrupts aren't a problem because they have dedicated
stacks, which are easy to detect.  If the tasks' stack is on an
exception stack or an irq stack, we consider it unreliable.

And also, they don't sleep.  The stack of any running task (other than
current) is automatically considered unreliable anyway, since they could
be modifying it while we're reading it.

> I suppose we could try to rejigger the code so that rbp points to
> pt_regs or similar.

I think we should avoid doing something like that because it would break
gdb and all the other unwinders who don't know about it.
Andy Lutomirski April 29, 2016, 9:37 p.m. UTC | #7
On Fri, Apr 29, 2016 at 2:25 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> On Fri, Apr 29, 2016 at 01:32:53PM -0700, Andy Lutomirski wrote:
>> On Fri, Apr 29, 2016 at 1:27 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>> > On Fri, Apr 29, 2016 at 01:19:23PM -0700, Andy Lutomirski wrote:
>> >> On Fri, Apr 29, 2016 at 1:11 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>> >> > On Fri, Apr 29, 2016 at 11:06:53AM -0700, Andy Lutomirski wrote:
>> >> >> On Thu, Apr 28, 2016 at 1:44 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>> >> >> > A preempted function might not have had a chance to save the frame
>> >> >> > pointer to the stack yet, which can result in its caller getting skipped
>> >> >> > on a stack trace.
>> >> >> >
>> >> >> > Add a flag to indicate when the task has been preempted so that stack
>> >> >> > dump code can determine whether the stack trace is reliable.
>> >> >>
>> >> >> I think I like this, but how do you handle the rather similar case in
>> >> >> which a task goes to sleep because it's waiting on IO that happened in
>> >> >> response to get_user, put_user, copy_from_user, etc?
>> >> >
>> >> > Hm, good question.  I was thinking that page faults had a dedicated
>> >> > stack, but now looking at the entry and traps code, that doesn't seem to
>> >> > be the case.
>> >> >
>> >> > Anyway I think it shouldn't be a problem if we make sure that any kernel
>> >> > function which might trigger a valid page fault (e.g.,
>> >> > copy_user_generic_string) do the proper frame pointer setup first.  Then
>> >> > the stack should still be reliable.
>> >> >
>> >> > In fact I might be able to teach objtool to enforce that: any function
>> >> > which uses an exception table should create a stack frame.
>> >> >
>> >> > Or alternatively, maybe set some kind of flag for page faults, similar
>> >> > to what I did with this patch.
>> >> >
>> >>
>> >> How about doing it the other way around: teach the unwinder to detect
>> >> when it hits a non-outermost entry (i.e. it lands in idtentry, etc)
>> >> and use some reasonable heuristic as to whether it's okay to keep
>> >> unwinding.  You should be able to handle preemption like that, too --
>> >> the unwind process will end up in an IRQ frame.
>> >
>> > How exactly would the unwinder detect if a text address is in an
>> > idtentry?  Maybe put all the idt entries in a special ELF section?
>> >
>>
>> Hmm.
>>
>> What actually happens when you unwind all the way into the entry code?
>>  Don't you end up in something that isn't in an ELF function?  Can you
>> detect that?
>
> For entry from user space (e.g., syscalls), it's easy to detect because
> there's always a pt_regs at the bottom of the stack.  So if the unwinder
> reaches the stack address at (thread.sp0 - sizeof(pt_regs)), it knows
> it's done.
>
> But for nested entry (e.g. in-kernel irqs/exceptions like preemption and
> page faults which don't have dedicated stacks), where the pt_regs is
> stored somewhere in the middle of the stack instead of the bottom,
> there's no reliable way to detect that.

>
>> Ideally, the unwinder could actually detect that it's
>> hit a pt_regs struct and report that.  If used for stack dumps, it
>> could display some indication of this and then continue its unwinding
>> by decoding the pt_regs.  If used for patching, it could take some
>> other appropriate action.
>>
>> I would have no objection to annotating all the pt_regs-style entry
>> code, whether by putting it in a separate section or by making a table
>> of addresses.
>
> I think the easiest way to make it work would be to modify the idtentry
> macro to put all the idt entries in a dedicated section.  Then the
> unwinder could easily detect any calls from that code.

That would work.  Would it make sense to do the same for the irq entries?

I'd be glad to review a patch.  It should be straightforward.

>
>> There are a couple of nasty cases if NMI or MCE is involved but, as of
>> 4.6, outside of NMI, MCE, and vmalloc faults (ugh!), there should
>> always be a complete pt_regs on the stack before interrupts get
>> enabled for each entry.  Of course, finding the thing may be
>> nontrivial in case other things were pushed.
>
> NMI, MCE and interrupts aren't a problem because they have dedicated
> stacks, which are easy to detect.  If the tasks' stack is on an
> exception stack or an irq stack, we consider it unreliable.

Only on x86_64.

>
> And also, they don't sleep.  The stack of any running task (other than
> current) is automatically considered unreliable anyway, since they could
> be modifying it while we're reading it.

True.

>
>> I suppose we could try to rejigger the code so that rbp points to
>> pt_regs or similar.
>
> I think we should avoid doing something like that because it would break
> gdb and all the other unwinders who don't know about it.

How so?

Currently, rbp in the entry code is meaningless.  I'm suggesting that,
when we do, for example, 'call \do_sym' in idtentry, we point rbp to
the pt_regs.  Currently it points to something stale (which the
dump_stack code might be relying on.  Hmm.)  But it's probably also
safe to assume that if you unwind to the 'call \do_sym', then pt_regs
is the next thing on the stack, so just doing the section thing would
work.

We should really re-add DWARF some day.

--Andy
Jiri Kosina April 29, 2016, 10:11 p.m. UTC | #8
On Fri, 29 Apr 2016, Andy Lutomirski wrote:

> > NMI, MCE and interrupts aren't a problem because they have dedicated
> > stacks, which are easy to detect.  If the tasks' stack is on an
> > exception stack or an irq stack, we consider it unreliable.
> 
> Only on x86_64.

Well, MCEs are more or less x86-specific as well. But otherwise good 
point, thanks Andy.

So, how does stack layout generally look like in case when NMI is actually 
running on proper kernel stack? I thought it's guaranteed to contain 
pt_regs anyway in all cases. Is that not guaranteed to be the case?

Thanks,
Josh Poimboeuf April 29, 2016, 10:41 p.m. UTC | #9
On Fri, Apr 29, 2016 at 02:37:41PM -0700, Andy Lutomirski wrote:
> On Fri, Apr 29, 2016 at 2:25 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > I think the easiest way to make it work would be to modify the idtentry
> > macro to put all the idt entries in a dedicated section.  Then the
> > unwinder could easily detect any calls from that code.
> 
> That would work.  Would it make sense to do the same for the irq entries?

Yes, I think so.

> >> I suppose we could try to rejigger the code so that rbp points to
> >> pt_regs or similar.
> >
> > I think we should avoid doing something like that because it would break
> > gdb and all the other unwinders who don't know about it.
> 
> How so?
> 
> Currently, rbp in the entry code is meaningless.  I'm suggesting that,
> when we do, for example, 'call \do_sym' in idtentry, we point rbp to
> the pt_regs.  Currently it points to something stale (which the
> dump_stack code might be relying on.  Hmm.)  But it's probably also
> safe to assume that if you unwind to the 'call \do_sym', then pt_regs
> is the next thing on the stack, so just doing the section thing would
> work.

Yes, rbp is meaningless on the entry from user space.  But if an
in-kernel interrupt occurs (e.g. page fault, preemption) and you have
nested entry, rbp keeps its old value, right?  So the unwinder can walk
past the nested entry frame and keep going until it gets to the original
entry.

> We should really re-add DWARF some day.

Working on it :-)
Josh Poimboeuf April 29, 2016, 10:57 p.m. UTC | #10
On Sat, Apr 30, 2016 at 12:11:45AM +0200, Jiri Kosina wrote:
> On Fri, 29 Apr 2016, Andy Lutomirski wrote:
> > > NMI, MCE and interrupts aren't a problem because they have dedicated
> > > stacks, which are easy to detect.  If the tasks' stack is on an
> > > exception stack or an irq stack, we consider it unreliable.
> > 
> > Only on x86_64.
> 
> Well, MCEs are more or less x86-specific as well. But otherwise good 
> point, thanks Andy.
> 
> So, how does stack layout generally look like in case when NMI is actually 
> running on proper kernel stack? I thought it's guaranteed to contain 
> pt_regs anyway in all cases. Is that not guaranteed to be the case?

If the NMI were using the normal kernel stack and it interrupted kernel
space, pt_regs would be placed in the "middle" of the stack rather than
the bottom, and there's currently no way to detect that.

However, NMIs don't sleep, and we only consider sleeping tasks for stack
reliability, so it wouldn't be an issue anyway.
Andy Lutomirski April 30, 2016, 12:08 a.m. UTC | #11
On Apr 29, 2016 3:41 PM, "Josh Poimboeuf" <jpoimboe@redhat.com> wrote:
>
> On Fri, Apr 29, 2016 at 02:37:41PM -0700, Andy Lutomirski wrote:
> > On Fri, Apr 29, 2016 at 2:25 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > > I think the easiest way to make it work would be to modify the idtentry
> > > macro to put all the idt entries in a dedicated section.  Then the
> > > unwinder could easily detect any calls from that code.
> >
> > That would work.  Would it make sense to do the same for the irq entries?
>
> Yes, I think so.
>
> > >> I suppose we could try to rejigger the code so that rbp points to
> > >> pt_regs or similar.
> > >
> > > I think we should avoid doing something like that because it would break
> > > gdb and all the other unwinders who don't know about it.
> >
> > How so?
> >
> > Currently, rbp in the entry code is meaningless.  I'm suggesting that,
> > when we do, for example, 'call \do_sym' in idtentry, we point rbp to
> > the pt_regs.  Currently it points to something stale (which the
> > dump_stack code might be relying on.  Hmm.)  But it's probably also
> > safe to assume that if you unwind to the 'call \do_sym', then pt_regs
> > is the next thing on the stack, so just doing the section thing would
> > work.
>
> Yes, rbp is meaningless on the entry from user space.  But if an
> in-kernel interrupt occurs (e.g. page fault, preemption) and you have
> nested entry, rbp keeps its old value, right?  So the unwinder can walk
> past the nested entry frame and keep going until it gets to the original
> entry.

Yes.

It would be nice if we could do better, though, and actually notice
the pt_regs and identify the entry.  For example, I'd love to see
"page fault, RIP=xyz" printed in the middle of a stack dump on a
crash.  Also, I think that just following rbp links will lose the
actual function that took the page fault (or whatever function
pt_regs->ip actually points to).

>
> > We should really re-add DWARF some day.
>
> Working on it :-)

Excellent.

Have you looked at my vdso unwinding test at all?  If we could do
something similar for the kernel, IMO it would make testing much more
pleasant.

--Andy
Andy Lutomirski April 30, 2016, 12:09 a.m. UTC | #12
On Apr 29, 2016 3:11 PM, "Jiri Kosina" <jikos@kernel.org> wrote:
>
> On Fri, 29 Apr 2016, Andy Lutomirski wrote:
>
> > > NMI, MCE and interrupts aren't a problem because they have dedicated
> > > stacks, which are easy to detect.  If the tasks' stack is on an
> > > exception stack or an irq stack, we consider it unreliable.
> >
> > Only on x86_64.
>
> Well, MCEs are more or less x86-specific as well. But otherwise good
> point, thanks Andy.
>
> So, how does stack layout generally look like in case when NMI is actually
> running on proper kernel stack? I thought it's guaranteed to contain
> pt_regs anyway in all cases. Is that not guaranteed to be the case?
>

On x86, at least, there will still be pt_regs for the NMI.  For the
interrupted state, though, there might not be pt_regs, as the NMI
might have happened while still populating pt_regs.  In fact, the NMI
stack could overlap task_pt_regs.

For x86_32, there's no guarantee that pt_regs contains sp due to
hardware silliness.  You need to parse it more carefully, as,
!user_mode(regs), then the old sp is just above pt_regs.

--Andy
Josh Poimboeuf May 2, 2016, 1:52 p.m. UTC | #13
On Fri, Apr 29, 2016 at 05:08:50PM -0700, Andy Lutomirski wrote:
> On Apr 29, 2016 3:41 PM, "Josh Poimboeuf" <jpoimboe@redhat.com> wrote:
> >
> > On Fri, Apr 29, 2016 at 02:37:41PM -0700, Andy Lutomirski wrote:
> > > On Fri, Apr 29, 2016 at 2:25 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > > >> I suppose we could try to rejigger the code so that rbp points to
> > > >> pt_regs or similar.
> > > >
> > > > I think we should avoid doing something like that because it would break
> > > > gdb and all the other unwinders who don't know about it.
> > >
> > > How so?
> > >
> > > Currently, rbp in the entry code is meaningless.  I'm suggesting that,
> > > when we do, for example, 'call \do_sym' in idtentry, we point rbp to
> > > the pt_regs.  Currently it points to something stale (which the
> > > dump_stack code might be relying on.  Hmm.)  But it's probably also
> > > safe to assume that if you unwind to the 'call \do_sym', then pt_regs
> > > is the next thing on the stack, so just doing the section thing would
> > > work.
> >
> > Yes, rbp is meaningless on the entry from user space.  But if an
> > in-kernel interrupt occurs (e.g. page fault, preemption) and you have
> > nested entry, rbp keeps its old value, right?  So the unwinder can walk
> > past the nested entry frame and keep going until it gets to the original
> > entry.
> 
> Yes.
> 
> It would be nice if we could do better, though, and actually notice
> the pt_regs and identify the entry.  For example, I'd love to see
> "page fault, RIP=xyz" printed in the middle of a stack dump on a
> crash.
>
> Also, I think that just following rbp links will lose the
> actual function that took the page fault (or whatever function
> pt_regs->ip actually points to).

Hm.  I think we could fix all that in a more standard way.  Whenever a
new pt_regs frame gets saved on entry, we could also create a new stack
frame which points to a fake kernel_entry() function.  That would tell
the unwinder there's a pt_regs frame without otherwise breaking frame
pointers across the frame.

Then I guess we wouldn't need my other solution of putting the idt
entries in a special section.

How does that sound?

> Have you looked at my vdso unwinding test at all?  If we could do
> something similar for the kernel, IMO it would make testing much more
> pleasant.

I found it, but I'm not sure what it would mean to do something similar
for the kernel.  Do you mean doing something like an NMI sampling-based
approach where we periodically do a random stack sanity check?

(If so, I do have something like that planned.)
Andy Lutomirski May 2, 2016, 3:52 p.m. UTC | #14
On Mon, May 2, 2016 at 6:52 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> On Fri, Apr 29, 2016 at 05:08:50PM -0700, Andy Lutomirski wrote:
>> On Apr 29, 2016 3:41 PM, "Josh Poimboeuf" <jpoimboe@redhat.com> wrote:
>> >
>> > On Fri, Apr 29, 2016 at 02:37:41PM -0700, Andy Lutomirski wrote:
>> > > On Fri, Apr 29, 2016 at 2:25 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>> > > >> I suppose we could try to rejigger the code so that rbp points to
>> > > >> pt_regs or similar.
>> > > >
>> > > > I think we should avoid doing something like that because it would break
>> > > > gdb and all the other unwinders who don't know about it.
>> > >
>> > > How so?
>> > >
>> > > Currently, rbp in the entry code is meaningless.  I'm suggesting that,
>> > > when we do, for example, 'call \do_sym' in idtentry, we point rbp to
>> > > the pt_regs.  Currently it points to something stale (which the
>> > > dump_stack code might be relying on.  Hmm.)  But it's probably also
>> > > safe to assume that if you unwind to the 'call \do_sym', then pt_regs
>> > > is the next thing on the stack, so just doing the section thing would
>> > > work.
>> >
>> > Yes, rbp is meaningless on the entry from user space.  But if an
>> > in-kernel interrupt occurs (e.g. page fault, preemption) and you have
>> > nested entry, rbp keeps its old value, right?  So the unwinder can walk
>> > past the nested entry frame and keep going until it gets to the original
>> > entry.
>>
>> Yes.
>>
>> It would be nice if we could do better, though, and actually notice
>> the pt_regs and identify the entry.  For example, I'd love to see
>> "page fault, RIP=xyz" printed in the middle of a stack dump on a
>> crash.
>>
>> Also, I think that just following rbp links will lose the
>> actual function that took the page fault (or whatever function
>> pt_regs->ip actually points to).
>
> Hm.  I think we could fix all that in a more standard way.  Whenever a
> new pt_regs frame gets saved on entry, we could also create a new stack
> frame which points to a fake kernel_entry() function.  That would tell
> the unwinder there's a pt_regs frame without otherwise breaking frame
> pointers across the frame.
>
> Then I guess we wouldn't need my other solution of putting the idt
> entries in a special section.
>
> How does that sound?

Let me try to understand.

The normal call sequence is call; push %rbp; mov %rsp, %rbp.  So rbp
points to (prev rbp, prev rip) on the stack, and you can follow the
chain back.  Right now, on a user access page fault or similar, we
have rbp (probably) pointing to the interrupted frame, and the
interrupted rip isn't saved anywhere that a naive unwinder can find
it.  (It's in pt_regs, but the rbp chain skips right over that.)

We could change the entry code so that an interrupt / idtentry does:

push pt_regs
push kernel_entry
push %rbp
mov %rsp, %rbp
call handler
pop %rbp
addq $8, %rsp

or similar.  That would make it appear that the actual C handler was
caused by a dummy function "kernel_entry".  Now the unwinder would get
to kernel_entry, but it *still* wouldn't find its way to the calling
frame, which only solves part of the problem.  We could at least teach
the unwinder how kernel_entry works and let it decode pt_regs to
continue unwinding.  This would be nice, and I think it could work.

I think I like this, except that, if it used a separate section, it
could potentially be faster, as, for each actual entry type, the
offset from the C handler frame to pt_regs is a foregone conclusion.
But this is pretty simple and performance is already abysmal in most
handlers.

There's an added benefit to using a separate section, though: we could
also annotate the calls with what type of entry they were so the
unwinder could print it out nicely.

I could be convinced either way.


>
>> Have you looked at my vdso unwinding test at all?  If we could do
>> something similar for the kernel, IMO it would make testing much more
>> pleasant.
>
> I found it, but I'm not sure what it would mean to do something similar
> for the kernel.  Do you mean doing something like an NMI sampling-based
> approach where we periodically do a random stack sanity check?

I was imagining something a little more strict: single-step
interesting parts of the kernel and make sure that each step unwinds
correctly.  That could detect missing frames and similar.
Josh Poimboeuf May 2, 2016, 5:31 p.m. UTC | #15
On Mon, May 02, 2016 at 08:52:41AM -0700, Andy Lutomirski wrote:
> On Mon, May 2, 2016 at 6:52 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > On Fri, Apr 29, 2016 at 05:08:50PM -0700, Andy Lutomirski wrote:
> >> On Apr 29, 2016 3:41 PM, "Josh Poimboeuf" <jpoimboe@redhat.com> wrote:
> >> >
> >> > On Fri, Apr 29, 2016 at 02:37:41PM -0700, Andy Lutomirski wrote:
> >> > > On Fri, Apr 29, 2016 at 2:25 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> >> > > >> I suppose we could try to rejigger the code so that rbp points to
> >> > > >> pt_regs or similar.
> >> > > >
> >> > > > I think we should avoid doing something like that because it would break
> >> > > > gdb and all the other unwinders who don't know about it.
> >> > >
> >> > > How so?
> >> > >
> >> > > Currently, rbp in the entry code is meaningless.  I'm suggesting that,
> >> > > when we do, for example, 'call \do_sym' in idtentry, we point rbp to
> >> > > the pt_regs.  Currently it points to something stale (which the
> >> > > dump_stack code might be relying on.  Hmm.)  But it's probably also
> >> > > safe to assume that if you unwind to the 'call \do_sym', then pt_regs
> >> > > is the next thing on the stack, so just doing the section thing would
> >> > > work.
> >> >
> >> > Yes, rbp is meaningless on the entry from user space.  But if an
> >> > in-kernel interrupt occurs (e.g. page fault, preemption) and you have
> >> > nested entry, rbp keeps its old value, right?  So the unwinder can walk
> >> > past the nested entry frame and keep going until it gets to the original
> >> > entry.
> >>
> >> Yes.
> >>
> >> It would be nice if we could do better, though, and actually notice
> >> the pt_regs and identify the entry.  For example, I'd love to see
> >> "page fault, RIP=xyz" printed in the middle of a stack dump on a
> >> crash.
> >>
> >> Also, I think that just following rbp links will lose the
> >> actual function that took the page fault (or whatever function
> >> pt_regs->ip actually points to).
> >
> > Hm.  I think we could fix all that in a more standard way.  Whenever a
> > new pt_regs frame gets saved on entry, we could also create a new stack
> > frame which points to a fake kernel_entry() function.  That would tell
> > the unwinder there's a pt_regs frame without otherwise breaking frame
> > pointers across the frame.
> >
> > Then I guess we wouldn't need my other solution of putting the idt
> > entries in a special section.
> >
> > How does that sound?
> 
> Let me try to understand.
> 
> The normal call sequence is call; push %rbp; mov %rsp, %rbp.  So rbp
> points to (prev rbp, prev rip) on the stack, and you can follow the
> chain back.  Right now, on a user access page fault or similar, we
> have rbp (probably) pointing to the interrupted frame, and the
> interrupted rip isn't saved anywhere that a naive unwinder can find
> it.  (It's in pt_regs, but the rbp chain skips right over that.)
> 
> We could change the entry code so that an interrupt / idtentry does:
> 
> push pt_regs
> push kernel_entry
> push %rbp
> mov %rsp, %rbp
> call handler
> pop %rbp
> addq $8, %rsp
> 
> or similar.  That would make it appear that the actual C handler was
> caused by a dummy function "kernel_entry".  Now the unwinder would get
> to kernel_entry, but it *still* wouldn't find its way to the calling
> frame, which only solves part of the problem.  We could at least teach
> the unwinder how kernel_entry works and let it decode pt_regs to
> continue unwinding.  This would be nice, and I think it could work.

Yeah, that's about what I had in mind.

> I think I like this, except that, if it used a separate section, it
> could potentially be faster, as, for each actual entry type, the
> offset from the C handler frame to pt_regs is a foregone conclusion.

Hm, this I don't really follow.  It's true that the unwinder can easily
find RIP from pt_regs, which will always be a known offset from the
kernel_entry pointer on the stack.  But why would having the entry code
in a separate section make that faster?

> But this is pretty simple and performance is already abysmal in most
> handlers.
> 
> There's an added benefit to using a separate section, though: we could
> also annotate the calls with what type of entry they were so the
> unwinder could print it out nicely.

Yeah, that could be a nice feature... but doesn't printing the name of
the C handler pretty much already give that information?

In any case, once we have a working DWARF unwinder, I think it will show
the name of the idt entry anyway.

> >> Have you looked at my vdso unwinding test at all?  If we could do
> >> something similar for the kernel, IMO it would make testing much more
> >> pleasant.
> >
> > I found it, but I'm not sure what it would mean to do something similar
> > for the kernel.  Do you mean doing something like an NMI sampling-based
> > approach where we periodically do a random stack sanity check?
> 
> I was imagining something a little more strict: single-step
> interesting parts of the kernel and make sure that each step unwinds
> correctly.  That could detect missing frames and similar.

Interesting idea, though I wonder how hard it would be to reliably
distinguish a missing frame from the case where gcc decides to inline a
function.

Another idea to detect missing frames: for each return address on the
stack, ensure there's a corresponding "call <func>" instruction
immediately preceding the return location, where <func> matches what's
on the stack.

That wouldn't work so well for indirect calls using function pointers,
but even then maybe we could use the DWARF CFI to find the function
pointer value and validate that it matches the stack function.
Andy Lutomirski May 2, 2016, 6:12 p.m. UTC | #16
On Mon, May 2, 2016 at 10:31 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> On Mon, May 02, 2016 at 08:52:41AM -0700, Andy Lutomirski wrote:
>> On Mon, May 2, 2016 at 6:52 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>> > On Fri, Apr 29, 2016 at 05:08:50PM -0700, Andy Lutomirski wrote:
>> >> On Apr 29, 2016 3:41 PM, "Josh Poimboeuf" <jpoimboe@redhat.com> wrote:
>> >> >
>> >> > On Fri, Apr 29, 2016 at 02:37:41PM -0700, Andy Lutomirski wrote:
>> >> > > On Fri, Apr 29, 2016 at 2:25 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
>> >> > > >> I suppose we could try to rejigger the code so that rbp points to
>> >> > > >> pt_regs or similar.
>> >> > > >
>> >> > > > I think we should avoid doing something like that because it would break
>> >> > > > gdb and all the other unwinders who don't know about it.
>> >> > >
>> >> > > How so?
>> >> > >
>> >> > > Currently, rbp in the entry code is meaningless.  I'm suggesting that,
>> >> > > when we do, for example, 'call \do_sym' in idtentry, we point rbp to
>> >> > > the pt_regs.  Currently it points to something stale (which the
>> >> > > dump_stack code might be relying on.  Hmm.)  But it's probably also
>> >> > > safe to assume that if you unwind to the 'call \do_sym', then pt_regs
>> >> > > is the next thing on the stack, so just doing the section thing would
>> >> > > work.
>> >> >
>> >> > Yes, rbp is meaningless on the entry from user space.  But if an
>> >> > in-kernel interrupt occurs (e.g. page fault, preemption) and you have
>> >> > nested entry, rbp keeps its old value, right?  So the unwinder can walk
>> >> > past the nested entry frame and keep going until it gets to the original
>> >> > entry.
>> >>
>> >> Yes.
>> >>
>> >> It would be nice if we could do better, though, and actually notice
>> >> the pt_regs and identify the entry.  For example, I'd love to see
>> >> "page fault, RIP=xyz" printed in the middle of a stack dump on a
>> >> crash.
>> >>
>> >> Also, I think that just following rbp links will lose the
>> >> actual function that took the page fault (or whatever function
>> >> pt_regs->ip actually points to).
>> >
>> > Hm.  I think we could fix all that in a more standard way.  Whenever a
>> > new pt_regs frame gets saved on entry, we could also create a new stack
>> > frame which points to a fake kernel_entry() function.  That would tell
>> > the unwinder there's a pt_regs frame without otherwise breaking frame
>> > pointers across the frame.
>> >
>> > Then I guess we wouldn't need my other solution of putting the idt
>> > entries in a special section.
>> >
>> > How does that sound?
>>
>> Let me try to understand.
>>
>> The normal call sequence is call; push %rbp; mov %rsp, %rbp.  So rbp
>> points to (prev rbp, prev rip) on the stack, and you can follow the
>> chain back.  Right now, on a user access page fault or similar, we
>> have rbp (probably) pointing to the interrupted frame, and the
>> interrupted rip isn't saved anywhere that a naive unwinder can find
>> it.  (It's in pt_regs, but the rbp chain skips right over that.)
>>
>> We could change the entry code so that an interrupt / idtentry does:
>>
>> push pt_regs
>> push kernel_entry
>> push %rbp
>> mov %rsp, %rbp
>> call handler
>> pop %rbp
>> addq $8, %rsp
>>
>> or similar.  That would make it appear that the actual C handler was
>> caused by a dummy function "kernel_entry".  Now the unwinder would get
>> to kernel_entry, but it *still* wouldn't find its way to the calling
>> frame, which only solves part of the problem.  We could at least teach
>> the unwinder how kernel_entry works and let it decode pt_regs to
>> continue unwinding.  This would be nice, and I think it could work.
>
> Yeah, that's about what I had in mind.

FWIW, I just tried this:

static bool is_entry_text(unsigned long addr)
{
    return addr >= (unsigned long)__entry_text_start &&
        addr < (unsigned long)__entry_text_end;
}

it works.  So the entry code is already annotated reasonably well :)

I just hacked it up here:

https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=stack&id=085eacfe0edfc18768e48340084415dba9a6bd21

and it seems to work, at least for page faults.  A better
implementation would print out the entire contents of pt_regs so that
people reading the stack trace will know the registers at the time of
the exception, which might be helpful.

>
>> I think I like this, except that, if it used a separate section, it
>> could potentially be faster, as, for each actual entry type, the
>> offset from the C handler frame to pt_regs is a foregone conclusion.
>
> Hm, this I don't really follow.  It's true that the unwinder can easily
> find RIP from pt_regs, which will always be a known offset from the
> kernel_entry pointer on the stack.  But why would having the entry code
> in a separate section make that faster?

It doesn't make the unwinder faster -- it makes the entry code faster.

>
>> But this is pretty simple and performance is already abysmal in most
>> handlers.
>>
>> There's an added benefit to using a separate section, though: we could
>> also annotate the calls with what type of entry they were so the
>> unwinder could print it out nicely.
>
> Yeah, that could be a nice feature... but doesn't printing the name of
> the C handler pretty much already give that information?
>
> In any case, once we have a working DWARF unwinder, I think it will show
> the name of the idt entry anyway.

True.  And it'll automatically follow pt_regs.

>
>> >> Have you looked at my vdso unwinding test at all?  If we could do
>> >> something similar for the kernel, IMO it would make testing much more
>> >> pleasant.
>> >
>> > I found it, but I'm not sure what it would mean to do something similar
>> > for the kernel.  Do you mean doing something like an NMI sampling-based
>> > approach where we periodically do a random stack sanity check?
>>
>> I was imagining something a little more strict: single-step
>> interesting parts of the kernel and make sure that each step unwinds
>> correctly.  That could detect missing frames and similar.
>
> Interesting idea, though I wonder how hard it would be to reliably
> distinguish a missing frame from the case where gcc decides to inline a
> function.
>
> Another idea to detect missing frames: for each return address on the
> stack, ensure there's a corresponding "call <func>" instruction
> immediately preceding the return location, where <func> matches what's
> on the stack.

Hmm, interesting.

I hope your plans include rewriting the current stack unwinder
completely.  The thing in print_context_stack is (a)
hard-to-understand and hard-to-modify crap and (b) is called in a loop
from another file using totally ridiculous conventions.

--Andy
Ingo Molnar May 2, 2016, 6:34 p.m. UTC | #17
* Andy Lutomirski <luto@amacapital.net> wrote:

> > Another idea to detect missing frames: for each return address on the stack, 
> > ensure there's a corresponding "call <func>" instruction immediately preceding 
> > the return location, where <func> matches what's on the stack.
> 
> Hmm, interesting.
> 
> I hope your plans include rewriting the current stack unwinder completely.  The 
> thing in print_context_stack is (a) hard-to-understand and hard-to-modify crap 
> and (b) is called in a loop from another file using totally ridiculous 
> conventions.

So we had several attempts at making it better, any further improvements 
(including radical rewrites) are more than welcome!

The generalization between the various stack walking methods certainly didn't make 
things easier to read - we might want to eliminate that by using better primitives 
to iterate over the stack frame.

Thanks,

	Ingo
Josh Poimboeuf May 2, 2016, 7:44 p.m. UTC | #18
On Mon, May 02, 2016 at 11:12:39AM -0700, Andy Lutomirski wrote:
> On Mon, May 2, 2016 at 10:31 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > On Mon, May 02, 2016 at 08:52:41AM -0700, Andy Lutomirski wrote:
> >> On Mon, May 2, 2016 at 6:52 AM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> >> > On Fri, Apr 29, 2016 at 05:08:50PM -0700, Andy Lutomirski wrote:
> >> >> On Apr 29, 2016 3:41 PM, "Josh Poimboeuf" <jpoimboe@redhat.com> wrote:
> >> >> >
> >> >> > On Fri, Apr 29, 2016 at 02:37:41PM -0700, Andy Lutomirski wrote:
> >> >> > > On Fri, Apr 29, 2016 at 2:25 PM, Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> >> >> > > >> I suppose we could try to rejigger the code so that rbp points to
> >> >> > > >> pt_regs or similar.
> >> >> > > >
> >> >> > > > I think we should avoid doing something like that because it would break
> >> >> > > > gdb and all the other unwinders who don't know about it.
> >> >> > >
> >> >> > > How so?
> >> >> > >
> >> >> > > Currently, rbp in the entry code is meaningless.  I'm suggesting that,
> >> >> > > when we do, for example, 'call \do_sym' in idtentry, we point rbp to
> >> >> > > the pt_regs.  Currently it points to something stale (which the
> >> >> > > dump_stack code might be relying on.  Hmm.)  But it's probably also
> >> >> > > safe to assume that if you unwind to the 'call \do_sym', then pt_regs
> >> >> > > is the next thing on the stack, so just doing the section thing would
> >> >> > > work.
> >> >> >
> >> >> > Yes, rbp is meaningless on the entry from user space.  But if an
> >> >> > in-kernel interrupt occurs (e.g. page fault, preemption) and you have
> >> >> > nested entry, rbp keeps its old value, right?  So the unwinder can walk
> >> >> > past the nested entry frame and keep going until it gets to the original
> >> >> > entry.
> >> >>
> >> >> Yes.
> >> >>
> >> >> It would be nice if we could do better, though, and actually notice
> >> >> the pt_regs and identify the entry.  For example, I'd love to see
> >> >> "page fault, RIP=xyz" printed in the middle of a stack dump on a
> >> >> crash.
> >> >>
> >> >> Also, I think that just following rbp links will lose the
> >> >> actual function that took the page fault (or whatever function
> >> >> pt_regs->ip actually points to).
> >> >
> >> > Hm.  I think we could fix all that in a more standard way.  Whenever a
> >> > new pt_regs frame gets saved on entry, we could also create a new stack
> >> > frame which points to a fake kernel_entry() function.  That would tell
> >> > the unwinder there's a pt_regs frame without otherwise breaking frame
> >> > pointers across the frame.
> >> >
> >> > Then I guess we wouldn't need my other solution of putting the idt
> >> > entries in a special section.
> >> >
> >> > How does that sound?
> >>
> >> Let me try to understand.
> >>
> >> The normal call sequence is call; push %rbp; mov %rsp, %rbp.  So rbp
> >> points to (prev rbp, prev rip) on the stack, and you can follow the
> >> chain back.  Right now, on a user access page fault or similar, we
> >> have rbp (probably) pointing to the interrupted frame, and the
> >> interrupted rip isn't saved anywhere that a naive unwinder can find
> >> it.  (It's in pt_regs, but the rbp chain skips right over that.)
> >>
> >> We could change the entry code so that an interrupt / idtentry does:
> >>
> >> push pt_regs
> >> push kernel_entry
> >> push %rbp
> >> mov %rsp, %rbp
> >> call handler
> >> pop %rbp
> >> addq $8, %rsp
> >>
> >> or similar.  That would make it appear that the actual C handler was
> >> caused by a dummy function "kernel_entry".  Now the unwinder would get
> >> to kernel_entry, but it *still* wouldn't find its way to the calling
> >> frame, which only solves part of the problem.  We could at least teach
> >> the unwinder how kernel_entry works and let it decode pt_regs to
> >> continue unwinding.  This would be nice, and I think it could work.
> >
> > Yeah, that's about what I had in mind.
> 
> FWIW, I just tried this:
> 
> static bool is_entry_text(unsigned long addr)
> {
>     return addr >= (unsigned long)__entry_text_start &&
>         addr < (unsigned long)__entry_text_end;
> }
> 
> it works.  So the entry code is already annotated reasonably well :)
> 
> I just hacked it up here:
> 
> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=stack&id=085eacfe0edfc18768e48340084415dba9a6bd21
> 
> and it seems to work, at least for page faults.  A better
> implementation would print out the entire contents of pt_regs so that
> people reading the stack trace will know the registers at the time of
> the exception, which might be helpful.

I still think we would need more specific annotations to do that
reliably: a call from entry code doesn't necessarily correlate with a
pt_regs frame.

> >> I think I like this, except that, if it used a separate section, it
> >> could potentially be faster, as, for each actual entry type, the
> >> offset from the C handler frame to pt_regs is a foregone conclusion.
> >
> > Hm, this I don't really follow.  It's true that the unwinder can easily
> > find RIP from pt_regs, which will always be a known offset from the
> > kernel_entry pointer on the stack.  But why would having the entry code
> > in a separate section make that faster?
> 
> It doesn't make the unwinder faster -- it makes the entry code faster.

Oh, right.  But I don't think a few extra frame pointer instructions are
much of an issue if you already have CONFIG_FRAME_POINTER enabled.

Anyway I'm not sure which way is better.  I'll think about it.

> I hope your plans include rewriting the current stack unwinder
> completely.  The thing in print_context_stack is (a)
> hard-to-understand and hard-to-modify crap and (b) is called in a loop
> from another file using totally ridiculous conventions.

I agree, that code is quite confusing.  I haven't really thought about
how specifically it could be improved or replaced though.

Along those lines, I think it would be awesome if we could have an
arch-independent DWARF unwinder so that most of the stack dumping code
could be shared amongst all the arches.
Jiri Kosina May 2, 2016, 7:54 p.m. UTC | #19
On Mon, 2 May 2016, Andy Lutomirski wrote:

> FWIW, I just tried this:
> 
> static bool is_entry_text(unsigned long addr)
> {
>     return addr >= (unsigned long)__entry_text_start &&
>         addr < (unsigned long)__entry_text_end;
> }
> 
> it works.  So the entry code is already annotated reasonably well :)
> 
> I just hacked it up here:
> 
> https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=stack&id=085eacfe0edfc18768e48340084415dba9a6bd21
> 
> and it seems to work, at least for page faults.  A better
> implementation would print out the entire contents of pt_regs so that
> people reading the stack trace will know the registers at the time of
> the exception, which might be helpful.

Sorry for being dense, but how do you distinguish here between a "real" 
kernel entry, that pushes pt_regs, and any "non-entry" function call that 
passes pt_regs around?
Jiri Kosina May 2, 2016, 8 p.m. UTC | #20
On Mon, 2 May 2016, Jiri Kosina wrote:

> > FWIW, I just tried this:
> > 
> > static bool is_entry_text(unsigned long addr)
> > {
> >     return addr >= (unsigned long)__entry_text_start &&
> >         addr < (unsigned long)__entry_text_end;
> > }
> > 
> > it works.  So the entry code is already annotated reasonably well :)
> > 
> > I just hacked it up here:
> > 
> > https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=stack&id=085eacfe0edfc18768e48340084415dba9a6bd21
> > 
> > and it seems to work, at least for page faults.  A better
> > implementation would print out the entire contents of pt_regs so that
> > people reading the stack trace will know the registers at the time of
> > the exception, which might be helpful.
> 
> Sorry for being dense, but how do you distinguish here between a "real" 
> kernel entry, that pushes pt_regs, and any "non-entry" function call that 
> passes pt_regs around?

Umm, actually, the more tricky part is the other way around -- how do you 
make sure that whenever you are calling out from a code between 
__entry_text_start and __entry_text_end, pt_regs will be at the place 
you're looking for it? How's that guaranteed?

Thanks,
Andy Lutomirski May 3, 2016, 12:39 a.m. UTC | #21
On Mon, May 2, 2016 at 1:00 PM, Jiri Kosina <jikos@kernel.org> wrote:
> On Mon, 2 May 2016, Jiri Kosina wrote:
>
>> > FWIW, I just tried this:
>> >
>> > static bool is_entry_text(unsigned long addr)
>> > {
>> >     return addr >= (unsigned long)__entry_text_start &&
>> >         addr < (unsigned long)__entry_text_end;
>> > }
>> >
>> > it works.  So the entry code is already annotated reasonably well :)
>> >
>> > I just hacked it up here:
>> >
>> > https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=stack&id=085eacfe0edfc18768e48340084415dba9a6bd21
>> >
>> > and it seems to work, at least for page faults.  A better
>> > implementation would print out the entire contents of pt_regs so that
>> > people reading the stack trace will know the registers at the time of
>> > the exception, which might be helpful.
>>
>> Sorry for being dense, but how do you distinguish here between a "real"
>> kernel entry, that pushes pt_regs, and any "non-entry" function call that
>> passes pt_regs around?
>
> Umm, actually, the more tricky part is the other way around -- how do you
> make sure that whenever you are calling out from a code between
> __entry_text_start and __entry_text_end, pt_regs will be at the place
> you're looking for it? How's that guaranteed?

It's not guaranteed in my code.  I think we'd want to add a little
table of call sites and their pt_regs offsets.  This was just meant to
test that the general idea works (and it does indeed generate better
traces than the stock kernel, which gets it unconditionally wrong).

--Andy

>
> Thanks,
>
> --
> Jiri Kosina
> SUSE Labs
>
David Laight May 4, 2016, 3:16 p.m. UTC | #22
From: Andy Lutomirski

> Sent: 02 May 2016 19:13

...
> I hope your plans include rewriting the current stack unwinder

> completely.  The thing in print_context_stack is (a)

> hard-to-understand and hard-to-modify crap and (b) is called in a loop

> from another file using totally ridiculous conventions.


I've seen a 'stack unwinder' that parsed the instruction stream
forwards looking for 'return' instructions.
I fixed it to add a few extra instructions needs to sort out the
exit path from hardware interrupts.

It only had to understand instructions that modified %sp and %bp
and remember which branch instructions and branch targets it had
used in order to find the correct exit path from a function.

Worked reasonably well without any debug info or guaranteed frame
pointers.
It did have to fall back on scanning the stack if it was inside
an infinite loop. Even on x86 it is reasonably possible to check
for 'call' instructions in this case.

	David
diff mbox

Patch

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 3d31572..fb364a0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2137,6 +2137,7 @@  extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define PF_SWAPWRITE	0x00800000	/* Allowed to write to swap */
 #define PF_NO_SETAFFINITY 0x04000000	/* Userland is not allowed to meddle with cpus_allowed */
 #define PF_MCE_EARLY    0x08000000      /* Early kill for mce process policy */
+#define PF_PREEMPT_IRQ	0x10000000	/* Thread is preempted by an irq */
 #define PF_MUTEX_TESTER	0x20000000	/* Thread belongs to the rt mutex tester */
 #define PF_FREEZER_SKIP	0x40000000	/* Freezer should not count it as freezable */
 #define PF_SUSPEND_TASK 0x80000000      /* this thread called freeze_processes and should not be frozen */
diff --git a/kernel/fork.c b/kernel/fork.c
index b73a539..d2fe04a 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1373,7 +1373,7 @@  static struct task_struct *copy_process(unsigned long clone_flags,
 		goto bad_fork_cleanup_count;
 
 	delayacct_tsk_init(p);	/* Must remain after dup_task_struct() */
-	p->flags &= ~(PF_SUPERPRIV | PF_WQ_WORKER);
+	p->flags &= ~(PF_SUPERPRIV | PF_WQ_WORKER | PF_PREEMPT_IRQ);
 	p->flags |= PF_FORKNOEXEC;
 	INIT_LIST_HEAD(&p->children);
 	INIT_LIST_HEAD(&p->sibling);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9d84d60..7594267 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3422,6 +3422,8 @@  asmlinkage __visible void __sched preempt_schedule_irq(void)
 
 	prev_state = exception_enter();
 
+	current->flags |= PF_PREEMPT_IRQ;
+
 	do {
 		preempt_disable();
 		local_irq_enable();
@@ -3430,6 +3432,8 @@  asmlinkage __visible void __sched preempt_schedule_irq(void)
 		sched_preempt_enable_no_resched();
 	} while (need_resched());
 
+	current->flags &= ~PF_PREEMPT_IRQ;
+
 	exception_exit(prev_state);
 }