Message ID | 1497416757-9894-1-git-send-email-rob.gardner@oracle.com |
---|---|
State | Changes Requested |
Delegated to: | David Miller |
Headers | show |
From: Rob Gardner <rob.gardner@oracle.com> Date: Tue, 13 Jun 2017 23:05:57 -0600 > One particular critical section occurs in switch_mm: > > spin_lock_irqsave(&mm->context.lock, flags); > ... > load_secondary_context(mm); > tsb_context_switch(mm); > ... > spin_unlock_irqrestore(&mm->context.lock, flags); > > If a perf interrupt arrives in between load_secondary_context() and > tsb_context_switch(), then perf_callchain_user() could execute with > the context ID of one process, but with an active tsb for a different > process. When the user stack is accessed, it is very likely to > incur a TLB miss, since the h/w context ID has been changed. The TLB > will then be reloaded with a translation from the TSB for one process, > but using a context ID for another process. This exposes memory from > one process to another, and since it is a mapping for stack memory, > this usually causes the new process to crash quickly. Ahhh, good catch. > Some potential solutions are: > > 1) Make critical sections run at PIL_NMI instead of PIL_NORMAL_MAX. > This would certainly eliminate the problem, but it would also prevent > perf from having any visibility into code running in these critical > sections, and it seems clear that PIL_NORMAL_MAX is used for just > this reason. > > 2) Protect this particular critical section by masking all interrupts, > either by setting %pil to PIL_NMI or by clearing pstate.ie around the > calls to load_secondary_context() and tsb_context_switch(). This approach > has a few drawbacks: > > - It would only address this particular critical section, and would > have to be repeated in several other known places. There might be > other such critical sections that are not known. > > - It has a performance cost which would be incurred at every context > switch, since it would require additional accesses to %pil or > %pstate. > > - Turning off pstate.ie would require changing __tsb_context_switch(), > which expects to be called with pstate.ie on. > > 3) Avoid user space MMU activity entirely in perf_callchain_user() by > implementing a new copy_from_user() function that accesses the user > stack via physical addresses. This works, but requires quite a bit of > new code to get it to perform reasonably, ie, caching of translations, > etc. > > 4) Allow the perf interrupt to happen in existing critical sections as > it does now, but have perf code detect that this is happening, and > skip any user callchain processing. This approach was deemed best, as > the change is extremely localized and covers both known and unknown > instances of perf interrupting critical sections. Perf has interrupted > a critical section when %pil == PIL_NORMAL_MAX at the time of the perf > interrupt. > > Ordinarily, a function that has a pt_regs passed in can simply examine > (reg->tstate & TSTATE_PIL) to extract the interrupt level that was in > effect at the time of the perf interrupt. However, when the perf > interrupt occurs while executing in the kernel, the pt_regs pointer is > replaced with task_pt_regs(current) in perf_callchain(), and so > perf_callchain_user() does not have access to the state of the machine > at the time of the perf interrupt. To work around this, we check > (regs->tstate & TSTATE_PIL) in perf_event_nmi_handler() before calling > in to the arch independent part of perf, and temporarily change the > event attributes so that user callchain processing is avoided. Though > a user stack sample is not collected, this loss is not statistically > significant. Kernel call graph collection is not affected. I tried to figure out how x86 handles this, and gave up after some time :-) Can you figure it out? PowerPC handles this by hard disabling IRQs from switch_mm until switch_to completes. I think excluding all local_irq_disabled() paths from having stack backtraces is not good. I'd rather see PIL_MAX get set from switch_mm until switch_to completes. Ok? -- To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 06/14/2017 10:25 AM, David Miller wrote: > From: Rob Gardner <rob.gardner@oracle.com> > Date: Tue, 13 Jun 2017 23:05:57 -0600 > >> One particular critical section occurs in switch_mm: >> >> spin_lock_irqsave(&mm->context.lock, flags); >> ... >> load_secondary_context(mm); >> tsb_context_switch(mm); >> ... >> spin_unlock_irqrestore(&mm->context.lock, flags); >> >> If a perf interrupt arrives in between load_secondary_context() and >> tsb_context_switch(), then perf_callchain_user() could execute with >> the context ID of one process, but with an active tsb for a different >> process. When the user stack is accessed, it is very likely to >> incur a TLB miss, since the h/w context ID has been changed. The TLB >> will then be reloaded with a translation from the TSB for one process, >> but using a context ID for another process. This exposes memory from >> one process to another, and since it is a mapping for stack memory, >> this usually causes the new process to crash quickly. > Ahhh, good catch. > >> Some potential solutions are: >> >> 1) Make critical sections run at PIL_NMI instead of PIL_NORMAL_MAX. >> This would certainly eliminate the problem, but it would also prevent >> perf from having any visibility into code running in these critical >> sections, and it seems clear that PIL_NORMAL_MAX is used for just >> this reason. >> >> 2) Protect this particular critical section by masking all interrupts, >> either by setting %pil to PIL_NMI or by clearing pstate.ie around the >> calls to load_secondary_context() and tsb_context_switch(). This approach >> has a few drawbacks: >> >> - It would only address this particular critical section, and would >> have to be repeated in several other known places. There might be >> other such critical sections that are not known. >> >> - It has a performance cost which would be incurred at every context >> switch, since it would require additional accesses to %pil or >> %pstate. >> >> - Turning off pstate.ie would require changing __tsb_context_switch(), >> which expects to be called with pstate.ie on. >> >> 3) Avoid user space MMU activity entirely in perf_callchain_user() by >> implementing a new copy_from_user() function that accesses the user >> stack via physical addresses. This works, but requires quite a bit of >> new code to get it to perform reasonably, ie, caching of translations, >> etc. >> >> 4) Allow the perf interrupt to happen in existing critical sections as >> it does now, but have perf code detect that this is happening, and >> skip any user callchain processing. This approach was deemed best, as >> the change is extremely localized and covers both known and unknown >> instances of perf interrupting critical sections. Perf has interrupted >> a critical section when %pil == PIL_NORMAL_MAX at the time of the perf >> interrupt. >> >> Ordinarily, a function that has a pt_regs passed in can simply examine >> (reg->tstate & TSTATE_PIL) to extract the interrupt level that was in >> effect at the time of the perf interrupt. However, when the perf >> interrupt occurs while executing in the kernel, the pt_regs pointer is >> replaced with task_pt_regs(current) in perf_callchain(), and so >> perf_callchain_user() does not have access to the state of the machine >> at the time of the perf interrupt. To work around this, we check >> (regs->tstate & TSTATE_PIL) in perf_event_nmi_handler() before calling >> in to the arch independent part of perf, and temporarily change the >> event attributes so that user callchain processing is avoided. Though >> a user stack sample is not collected, this loss is not statistically >> significant. Kernel call graph collection is not affected. > I tried to figure out how x86 handles this, and gave up after some > time :-) Can you figure it out? I could look. > > PowerPC handles this by hard disabling IRQs from switch_mm until > switch_to completes. > > I think excluding all local_irq_disabled() paths from having stack > backtraces is not good. Just to clarify, this does not exclude local_irq_disabled() paths from having stack backtraces. It only disables collection of *user* stack backtraces during these (very rare) times. Collection of *kernel* stack traces is not affected, so we still get the same visibility from perf for kernel call graphs. I do not believe we sacrifice anything here in terms of user stack collection since we'll only skip doing this when: 1) we're already executing in the kernel when the perf interrupt arrives, and this is already (what should be) a rare event 2) and kernel is executing in a local_irq_disabled() path, also should be very rare > I'd rather see PIL_MAX get set from switch_mm > until switch_to completes. > > Ok? > In light of the above, what is your preference? Rob -- To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Rob Gardner <rob.gardner@oracle.com> Date: Wed, 14 Jun 2017 10:47:08 -0600 > On 06/14/2017 10:25 AM, David Miller wrote: >> From: Rob Gardner <rob.gardner@oracle.com> >> Date: Tue, 13 Jun 2017 23:05:57 -0600 >> >>> One particular critical section occurs in switch_mm: >>> >>> spin_lock_irqsave(&mm->context.lock, flags); >>> ... >>> load_secondary_context(mm); >>> tsb_context_switch(mm); >>> ... >>> spin_unlock_irqrestore(&mm->context.lock, flags); >>> >>> If a perf interrupt arrives in between load_secondary_context() and >>> tsb_context_switch(), then perf_callchain_user() could execute with >>> the context ID of one process, but with an active tsb for a different >>> process. When the user stack is accessed, it is very likely to >>> incur a TLB miss, since the h/w context ID has been changed. The TLB >>> will then be reloaded with a translation from the TSB for one process, >>> but using a context ID for another process. This exposes memory from >>> one process to another, and since it is a mapping for stack memory, >>> this usually causes the new process to crash quickly. >> Ahhh, good catch. >> >>> Some potential solutions are: >>> >>> 1) Make critical sections run at PIL_NMI instead of PIL_NORMAL_MAX. >>> This would certainly eliminate the problem, but it would also prevent >>> perf from having any visibility into code running in these critical >>> sections, and it seems clear that PIL_NORMAL_MAX is used for just >>> this reason. >>> >>> 2) Protect this particular critical section by masking all interrupts, >>> either by setting %pil to PIL_NMI or by clearing pstate.ie around the >>> calls to load_secondary_context() and tsb_context_switch(). This >>> approach >>> has a few drawbacks: >>> >>> - It would only address this particular critical section, and would >>> have to be repeated in several other known places. There might be >>> other such critical sections that are not known. >>> >>> - It has a performance cost which would be incurred at every context >>> switch, since it would require additional accesses to %pil or >>> %pstate. >>> >>> - Turning off pstate.ie would require changing __tsb_context_switch(), >>> which expects to be called with pstate.ie on. >>> >>> 3) Avoid user space MMU activity entirely in perf_callchain_user() by >>> implementing a new copy_from_user() function that accesses the user >>> stack via physical addresses. This works, but requires quite a bit of >>> new code to get it to perform reasonably, ie, caching of translations, >>> etc. >>> >>> 4) Allow the perf interrupt to happen in existing critical sections as >>> it does now, but have perf code detect that this is happening, and >>> skip any user callchain processing. This approach was deemed best, as >>> the change is extremely localized and covers both known and unknown >>> instances of perf interrupting critical sections. Perf has interrupted >>> a critical section when %pil == PIL_NORMAL_MAX at the time of the perf >>> interrupt. >>> >>> Ordinarily, a function that has a pt_regs passed in can simply examine >>> (reg->tstate & TSTATE_PIL) to extract the interrupt level that was in >>> effect at the time of the perf interrupt. However, when the perf >>> interrupt occurs while executing in the kernel, the pt_regs pointer is >>> replaced with task_pt_regs(current) in perf_callchain(), and so >>> perf_callchain_user() does not have access to the state of the machine >>> at the time of the perf interrupt. To work around this, we check >>> (regs->tstate & TSTATE_PIL) in perf_event_nmi_handler() before calling >>> in to the arch independent part of perf, and temporarily change the >>> event attributes so that user callchain processing is avoided. Though >>> a user stack sample is not collected, this loss is not statistically >>> significant. Kernel call graph collection is not affected. >> I tried to figure out how x86 handles this, and gave up after some >> time :-) Can you figure it out? > > I could look. > >> >> PowerPC handles this by hard disabling IRQs from switch_mm until >> switch_to completes. >> >> I think excluding all local_irq_disabled() paths from having stack >> backtraces is not good. > > Just to clarify, this does not exclude local_irq_disabled() paths from > having stack backtraces. It only disables collection of *user* stack > backtraces during these (very rare) times. Collection of *kernel* > stack traces is not affected, so we still get the same visibility from > perf for kernel call graphs. I do not believe we sacrifice anything > here in terms of user stack collection since we'll only skip doing > this when: > 1) we're already executing in the kernel when the perf interrupt > arrives, and this is already (what should be) a rare event > 2) and kernel is executing in a local_irq_disabled() path, also should > be very rare I understood this from the beginning. >> I'd rather see PIL_MAX get set from switch_mm >> until switch_to completes. >> >> Ok? >> > > In light of the above, what is your preference? I still have the same position. That lack of user backtrace is a loss of information. And in order to make it work we must have the case that the PGD/TSB/mmu-context/stack all match up precisely. -- To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 06/14/2017 11:05 AM, David Miller wrote: > From: Rob Gardner <rob.gardner@oracle.com> > Date: Wed, 14 Jun 2017 10:47:08 -0600 > >> On 06/14/2017 10:25 AM, David Miller wrote: >>> From: Rob Gardner <rob.gardner@oracle.com> >>> Date: Tue, 13 Jun 2017 23:05:57 -0600 >>> >>>> One particular critical section occurs in switch_mm: >>>> >>>> spin_lock_irqsave(&mm->context.lock, flags); >>>> ... >>>> load_secondary_context(mm); >>>> tsb_context_switch(mm); >>>> ... >>>> spin_unlock_irqrestore(&mm->context.lock, flags); >>>> >>>> If a perf interrupt arrives in between load_secondary_context() and >>>> tsb_context_switch(), then perf_callchain_user() could execute with >>>> the context ID of one process, but with an active tsb for a different >>>> process. When the user stack is accessed, it is very likely to >>>> incur a TLB miss, since the h/w context ID has been changed. The TLB >>>> will then be reloaded with a translation from the TSB for one process, >>>> but using a context ID for another process. This exposes memory from >>>> one process to another, and since it is a mapping for stack memory, >>>> this usually causes the new process to crash quickly. >>> Ahhh, good catch. >>> >>>> Some potential solutions are: >>>> >>>> 1) Make critical sections run at PIL_NMI instead of PIL_NORMAL_MAX. >>>> This would certainly eliminate the problem, but it would also prevent >>>> perf from having any visibility into code running in these critical >>>> sections, and it seems clear that PIL_NORMAL_MAX is used for just >>>> this reason. >>>> >>>> 2) Protect this particular critical section by masking all interrupts, >>>> either by setting %pil to PIL_NMI or by clearing pstate.ie around the >>>> calls to load_secondary_context() and tsb_context_switch(). This >>>> approach >>>> has a few drawbacks: >>>> >>>> - It would only address this particular critical section, and would >>>> have to be repeated in several other known places. There might be >>>> other such critical sections that are not known. >>>> >>>> - It has a performance cost which would be incurred at every context >>>> switch, since it would require additional accesses to %pil or >>>> %pstate. >>>> >>>> - Turning off pstate.ie would require changing __tsb_context_switch(), >>>> which expects to be called with pstate.ie on. >>>> >>>> 3) Avoid user space MMU activity entirely in perf_callchain_user() by >>>> implementing a new copy_from_user() function that accesses the user >>>> stack via physical addresses. This works, but requires quite a bit of >>>> new code to get it to perform reasonably, ie, caching of translations, >>>> etc. >>>> >>>> 4) Allow the perf interrupt to happen in existing critical sections as >>>> it does now, but have perf code detect that this is happening, and >>>> skip any user callchain processing. This approach was deemed best, as >>>> the change is extremely localized and covers both known and unknown >>>> instances of perf interrupting critical sections. Perf has interrupted >>>> a critical section when %pil == PIL_NORMAL_MAX at the time of the perf >>>> interrupt. >>>> >>>> Ordinarily, a function that has a pt_regs passed in can simply examine >>>> (reg->tstate & TSTATE_PIL) to extract the interrupt level that was in >>>> effect at the time of the perf interrupt. However, when the perf >>>> interrupt occurs while executing in the kernel, the pt_regs pointer is >>>> replaced with task_pt_regs(current) in perf_callchain(), and so >>>> perf_callchain_user() does not have access to the state of the machine >>>> at the time of the perf interrupt. To work around this, we check >>>> (regs->tstate & TSTATE_PIL) in perf_event_nmi_handler() before calling >>>> in to the arch independent part of perf, and temporarily change the >>>> event attributes so that user callchain processing is avoided. Though >>>> a user stack sample is not collected, this loss is not statistically >>>> significant. Kernel call graph collection is not affected. >>> I tried to figure out how x86 handles this, and gave up after some >>> time :-) Can you figure it out? >> I could look. >> >>> PowerPC handles this by hard disabling IRQs from switch_mm until >>> switch_to completes. >>> >>> I think excluding all local_irq_disabled() paths from having stack >>> backtraces is not good. >> Just to clarify, this does not exclude local_irq_disabled() paths from >> having stack backtraces. It only disables collection of *user* stack >> backtraces during these (very rare) times. Collection of *kernel* >> stack traces is not affected, so we still get the same visibility from >> perf for kernel call graphs. I do not believe we sacrifice anything >> here in terms of user stack collection since we'll only skip doing >> this when: >> 1) we're already executing in the kernel when the perf interrupt >> arrives, and this is already (what should be) a rare event >> 2) and kernel is executing in a local_irq_disabled() path, also should >> be very rare > I understood this from the beginning. > >>> I'd rather see PIL_MAX get set from switch_mm >>> until switch_to completes. >>> >>> Ok? >>> >> In light of the above, what is your preference? > I still have the same position. > > That lack of user backtrace is a loss of information. And in order to > make it work we must have the case that the PGD/TSB/mmu-context/stack > all match up precisely. OK, thanks for the guidance. We'll work up a new patch. Rob -- To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Rob Gardner <rob.gardner@oracle.com> Date: Wed, 14 Jun 2017 11:24:36 -0600 > OK, thanks for the guidance. We'll work up a new patch. Thanks a lot for working on this. -- To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> I tried to figure out how x86 handles this, and gave up after some > time :-) Can you figure it out? I've had a look, what I found is that on x86 event->attr.exclude_callchain_user is not set from the arch/x86/* But perf unconditionally uses the NMI specific version of copy_from_user to prevent perf_callchain_user() to generated a fault: arch/x86/include/asm/perf_event.h:#define arch_perf_out_copy_user copy_from_user_nmi Where copy_from_user_nmi() is defined as follows: arch/x86/lib/usercopy.c: /* arch/x86/lib/usercopy.c: * We rely on the nested NMI work to allow atomic faults from the NMI path; the arch/x86/lib/usercopy.c: * nested NMI paths are careful to preserve CR2. arch/x86/lib/usercopy.c: */ arch/x86/lib/usercopy.c: unsigned long arch/x86/lib/usercopy.c: copy_from_user_nmi(void *to, const void __user *from, unsigned long n) arch/x86/lib/usercopy.c: { arch/x86/lib/usercopy.c: unsigned long ret; arch/x86/lib/usercopy.c: arch/x86/lib/usercopy.c: if (__range_not_ok(from, n, TASK_SIZE)) arch/x86/lib/usercopy.c: return 0; arch/x86/lib/usercopy.c: arch/x86/lib/usercopy.c: /* arch/x86/lib/usercopy.c: * Even though this function is typically called from NMI/IRQ context arch/x86/lib/usercopy.c: * disable pagefaults so that its behaviour is consistent even when arch/x86/lib/usercopy.c: * called form other contexts. arch/x86/lib/usercopy.c: */ arch/x86/lib/usercopy.c: pagefault_disable(); arch/x86/lib/usercopy.c: ret = __copy_from_user_inatomic(to, from, n); arch/x86/lib/usercopy.c: pagefault_enable(); arch/x86/lib/usercopy.c: arch/x86/lib/usercopy.c: return ret; arch/x86/lib/usercopy.c: } arch/x86/lib/usercopy.c: EXPORT_SYMBOL_GPL(copy_from_user_nmi); And the kernel also falls back (for other archs) to a version very similar to copy_from_user_nmi() where faults are disabled: kernel/events/internal.h: #ifndef arch_perf_out_copy_user kernel/events/internal.h: #define arch_perf_out_copy_user arch_perf_out_copy_user kernel/events/internal.h: kernel/events/internal.h: static inline unsigned long kernel/events/internal.h: arch_perf_out_copy_user(void *dst, const void *src, unsigned long n) kernel/events/internal.h: { kernel/events/internal.h: unsigned long ret; kernel/events/internal.h: kernel/events/internal.h: pagefault_disable(); kernel/events/internal.h: ret = __copy_from_user_inatomic(dst, src, n); kernel/events/internal.h: pagefault_enable(); kernel/events/internal.h: kernel/events/internal.h: return ret; kernel/events/internal.h: } kernel/events/internal.h: #endif kernel/events/internal.h: kernel/events/internal.h: DEFINE_OUTPUT_COPY(__output_copy_user, arch_perf_out_copy_user) I believe that's how faults are prevented from perf. -eric -- To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/arch/sparc/kernel/perf_event.c b/arch/sparc/kernel/perf_event.c index 710f327..5295dda 100644 --- a/arch/sparc/kernel/perf_event.c +++ b/arch/sparc/kernel/perf_event.c @@ -1602,6 +1602,7 @@ static int __kprobes perf_event_nmi_handler(struct notifier_block *self, struct perf_sample_data data; struct cpu_hw_events *cpuc; struct pt_regs *regs; + bool irq_disabled; int i; if (!atomic_read(&active_events)) @@ -1630,6 +1631,7 @@ static int __kprobes perf_event_nmi_handler(struct notifier_block *self, sparc_pmu->num_pcrs == 1) pcr_ops->write_pcr(0, cpuc->pcr[0]); + irq_disabled = ((regs->tstate & TSTATE_PIL) >> 20) == PIL_NORMAL_MAX; for (i = 0; i < cpuc->n_events; i++) { struct perf_event *event = cpuc->event[i]; int idx = cpuc->current_idx[i]; @@ -1649,8 +1651,16 @@ static int __kprobes perf_event_nmi_handler(struct notifier_block *self, if (!sparc_perf_event_set_period(event, hwc, idx)) continue; - if (perf_event_overflow(event, &data, regs)) + if (unlikely(irq_disabled)) { + u64 saved_exclude = event->attr.exclude_callchain_user; + + event->attr.exclude_callchain_user = 1; + if (perf_event_overflow(event, &data, regs)) + sparc_pmu_stop(event, 0); + event->attr.exclude_callchain_user = saved_exclude; + } else if (perf_event_overflow(event, &data, regs)) { sparc_pmu_stop(event, 0); + } } return NOTIFY_STOP;