Message ID | 20230517152654.7193-2-mathieu.desnoyers@efficios.com |
---|---|
State | New |
Headers | show |
Series | Extend rseq with sched_state field | expand |
On Wed, 17 May 2023, Mathieu Desnoyers wrote: >Expose the "on-cpu" state for each thread through struct rseq to allow >adaptative mutexes to decide more accurately between busy-waiting and >calling sys_futex() to release the CPU, based on the on-cpu state of the >mutex owner. Oh yeah moving the spin stuff out of the kernel is much nicer. >It is only provided as an optimization hint, because there is no >guarantee that the page containing this field is in the page cache, and >therefore the scheduler may very well fail to clear the on-cpu state on >preemption. This is expected to be rare though, and is resolved as soon >as the task returns to user-space. > >The goal is to improve use-cases where the duration of the critical >sections for a given lock follows a multi-modal distribution, preventing >statistical guesses from doing a good job at choosing between busy-wait >and futex wait behavior. > >Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> >Cc: Peter Zijlstra (Intel) <peterz@infradead.org> >Cc: Jonathan Corbet <corbet@lwn.net> >Cc: Steven Rostedt (Google) <rostedt@goodmis.org> >Cc: Carlos O'Donell <carlos@redhat.com> >Cc: Florian Weimer <fweimer@redhat.com> >Cc: libc-alpha@sourceware.org >--- > include/linux/sched.h | 12 ++++++++++++ > include/uapi/linux/rseq.h | 17 +++++++++++++++++ > kernel/rseq.c | 14 ++++++++++++++ > 3 files changed, 43 insertions(+) Ie: previous efforts kernel/futex.c | 675 ++++++++++++++++++++++++++++++++++++++------ kernel/futex.c | 572 ++++++++++++++++++++++++++++++++++++------------- Thanks, Davidlohr
On Wed, May 17, 2023 at 11:26:51AM -0400, Mathieu Desnoyers wrote: > Expose the "on-cpu" state for each thread through struct rseq to allow > adaptative mutexes to decide more accurately between busy-waiting and > calling sys_futex() to release the CPU, based on the on-cpu state of the > mutex owner. > > It is only provided as an optimization hint, because there is no > guarantee that the page containing this field is in the page cache, and > therefore the scheduler may very well fail to clear the on-cpu state on > preemption. This is expected to be rare though, and is resolved as soon > as the task returns to user-space. > > The goal is to improve use-cases where the duration of the critical > sections for a given lock follows a multi-modal distribution, preventing > statistical guesses from doing a good job at choosing between busy-wait > and futex wait behavior. > > Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> > Cc: Peter Zijlstra (Intel) <peterz@infradead.org> > Cc: Jonathan Corbet <corbet@lwn.net> > Cc: Steven Rostedt (Google) <rostedt@goodmis.org> > Cc: Carlos O'Donell <carlos@redhat.com> > Cc: Florian Weimer <fweimer@redhat.com> > Cc: libc-alpha@sourceware.org > --- > include/linux/sched.h | 12 ++++++++++++ > include/uapi/linux/rseq.h | 17 +++++++++++++++++ > kernel/rseq.c | 14 ++++++++++++++ > 3 files changed, 43 insertions(+) > > diff --git a/include/linux/sched.h b/include/linux/sched.h > index eed5d65b8d1f..c7e9248134c1 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -2351,11 +2351,20 @@ static inline void rseq_signal_deliver(struct ksignal *ksig, > rseq_handle_notify_resume(ksig, regs); > } > > +void __rseq_set_sched_state(struct task_struct *t, unsigned int state); > + > +static inline void rseq_set_sched_state(struct task_struct *t, unsigned int state) > +{ > + if (t->rseq) > + __rseq_set_sched_state(t, state); > +} > + > /* rseq_preempt() requires preemption to be disabled. */ > static inline void rseq_preempt(struct task_struct *t) > { > __set_bit(RSEQ_EVENT_PREEMPT_BIT, &t->rseq_event_mask); > rseq_set_notify_resume(t); > + rseq_set_sched_state(t, 0); > } > > /* rseq_migrate() requires preemption to be disabled. */ > @@ -2405,6 +2414,9 @@ static inline void rseq_signal_deliver(struct ksignal *ksig, > struct pt_regs *regs) > { > } > +static inline void rseq_set_sched_state(struct task_struct *t, unsigned int state) > +{ > +} > static inline void rseq_preempt(struct task_struct *t) > { > } > diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h > index c233aae5eac9..c6d8537e23ca 100644 > --- a/include/uapi/linux/rseq.h > +++ b/include/uapi/linux/rseq.h > @@ -37,6 +37,13 @@ enum rseq_cs_flags { > (1U << RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT), > }; > > +enum rseq_sched_state { > + /* > + * Task is currently running on a CPU if bit is set. > + */ > + RSEQ_SCHED_STATE_ON_CPU = (1U << 0), > +}; > + > /* > * struct rseq_cs is aligned on 4 * 8 bytes to ensure it is always > * contained within a single cache-line. It is usually declared as > @@ -148,6 +155,16 @@ struct rseq { > */ > __u32 mm_cid; > > + /* > + * Restartable sequences sched_state field. Updated by the kernel. Read > + * by user-space with single-copy atomicity semantics. This fields can > + * be read by any userspace thread. Aligned on 32-bit. Contains a Maybe this is a premature optimization, but since most of the time the bit would be read by another thread, does it make sense putting the "sched_state" into a different cache line to avoid false sharing? Also We could have a "sched_state_local" and "sched_state_remote" for different usages (local reads vs remote reads). Regards, Boqun > + * bitmask of enum rseq_sched_state. This field is provided as a hint > + * by the scheduler, and requires that the page holding struct rseq is > + * faulted-in for the state update to be performed by the scheduler. > + */ > + __u32 sched_state; > + > /* > * Flexible array member at end of structure, after last feature field. > */ > diff --git a/kernel/rseq.c b/kernel/rseq.c > index 9de6e35fe679..b2eb3bbaa9ef 100644 > --- a/kernel/rseq.c > +++ b/kernel/rseq.c > @@ -91,6 +91,7 @@ static int rseq_update_cpu_node_id(struct task_struct *t) > u32 cpu_id = raw_smp_processor_id(); > u32 node_id = cpu_to_node(cpu_id); > u32 mm_cid = task_mm_cid(t); > + u32 sched_state = RSEQ_SCHED_STATE_ON_CPU; > > WARN_ON_ONCE((int) mm_cid < 0); > if (!user_write_access_begin(rseq, t->rseq_len)) > @@ -99,6 +100,7 @@ static int rseq_update_cpu_node_id(struct task_struct *t) > unsafe_put_user(cpu_id, &rseq->cpu_id, efault_end); > unsafe_put_user(node_id, &rseq->node_id, efault_end); > unsafe_put_user(mm_cid, &rseq->mm_cid, efault_end); > + unsafe_put_user(sched_state, &rseq->sched_state, efault_end); > /* > * Additional feature fields added after ORIG_RSEQ_SIZE > * need to be conditionally updated only if > @@ -339,6 +341,18 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs) > force_sigsegv(sig); > } > > +/* > + * Attempt to update rseq scheduler state. > + */ > +void __rseq_set_sched_state(struct task_struct *t, unsigned int state) > +{ > + if (unlikely(t->flags & PF_EXITING)) > + return; > + pagefault_disable(); > + (void) put_user(state, &t->rseq->sched_state); > + pagefault_enable(); > +} > + > #ifdef CONFIG_DEBUG_RSEQ > > /* > -- > 2.25.1 >
On 2023-05-18 17:49, Boqun Feng wrote: > On Wed, May 17, 2023 at 11:26:51AM -0400, Mathieu Desnoyers wrote: [...] >> diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h >> index c233aae5eac9..c6d8537e23ca 100644 >> --- a/include/uapi/linux/rseq.h >> +++ b/include/uapi/linux/rseq.h >> @@ -37,6 +37,13 @@ enum rseq_cs_flags { >> (1U << RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT), >> }; >> >> +enum rseq_sched_state { >> + /* >> + * Task is currently running on a CPU if bit is set. >> + */ >> + RSEQ_SCHED_STATE_ON_CPU = (1U << 0), >> +}; [...] >> >> + /* >> + * Restartable sequences sched_state field. Updated by the kernel. Read >> + * by user-space with single-copy atomicity semantics. This fields can >> + * be read by any userspace thread. Aligned on 32-bit. Contains a > > Maybe this is a premature optimization, but since most of the time the > bit would be read by another thread, does it make sense putting the > "sched_state" into a different cache line to avoid false sharing? I'm puzzled by your optimization proposal, so I'll say it outright: I'm probably missing something. I agree that false-sharing would be an issue if various threads would contend for updating any field within this cache line. But the only thread responsible for updating this cache line's fields is the current thread, either from userspace (stores to rseq_abi->rseq_cs) or from the kernel (usually on return to userspace, except for this new ON_CPU bit clear on context switch). The other threads busy-waiting on the content of this sched_state field will only load from it, never store. And they will only busy-wait on it as long as the current task runs. When that task gets preempted, other threads will notice the flag change and use sys_futex instead. So the very worse I can think of in terms of pattern causing cache coherency traffic due to false-sharing is if the lock owner happens to have lots of rseq critical sections as well, causing it to repeatedly store to the rseq_abi->rseq_cs field, which is in the same cache line. But even then I'm wondering if this really matters, because each of those stores to rseq_cs would only slow down loads from other threads which will need to retry busy-wait anyway if the on-cpu flag is still set. So, what am I missing ? Is this heavy use of rseq critical sections while the lock is held the scenario you are concerned about ? Note that the heavy cache-line bouncing in my test-case happens on the lock structure (cmpxchg expecting NULL, setting the current thread rseq_get_abi() pointer on success). There are probably better ways to implement that part, it is currently just a simple prototype showcasing the approach. Thanks, Mathieu
On Fri, May 19, 2023 at 10:15:11AM -0400, Mathieu Desnoyers wrote: > On 2023-05-18 17:49, Boqun Feng wrote: > > On Wed, May 17, 2023 at 11:26:51AM -0400, Mathieu Desnoyers wrote: > [...] > > > diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h > > > index c233aae5eac9..c6d8537e23ca 100644 > > > --- a/include/uapi/linux/rseq.h > > > +++ b/include/uapi/linux/rseq.h > > > @@ -37,6 +37,13 @@ enum rseq_cs_flags { > > > (1U << RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT), > > > }; > > > +enum rseq_sched_state { > > > + /* > > > + * Task is currently running on a CPU if bit is set. > > > + */ > > > + RSEQ_SCHED_STATE_ON_CPU = (1U << 0), > > > +}; > [...] > > > + /* > > > + * Restartable sequences sched_state field. Updated by the kernel. Read > > > + * by user-space with single-copy atomicity semantics. This fields can > > > + * be read by any userspace thread. Aligned on 32-bit. Contains a > > > > Maybe this is a premature optimization, but since most of the time the > > bit would be read by another thread, does it make sense putting the > > "sched_state" into a different cache line to avoid false sharing? > > I'm puzzled by your optimization proposal, so I'll say it outright: I'm > probably missing something. > Maybe it's me who is missing something ;-) > I agree that false-sharing would be an issue if various threads would > contend for updating any field within this cache line. > > But the only thread responsible for updating this cache line's fields is the > current thread, either from userspace (stores to rseq_abi->rseq_cs) or from > the kernel (usually on return to userspace, except for this new ON_CPU bit > clear on context switch). > > The other threads busy-waiting on the content of this sched_state field will > only load from it, never store. And they will only busy-wait on it as long But their loads can change the cache line state from "Exclusive" to "Shared" (using MESI terms), right? And that could delay the stores of the current thread. > as the current task runs. When that task gets preempted, other threads will > notice the flag change and use sys_futex instead. > > So the very worse I can think of in terms of pattern causing cache coherency > traffic due to false-sharing is if the lock owner happens to have lots of > rseq critical sections as well, causing it to repeatedly store to the > rseq_abi->rseq_cs field, which is in the same cache line. > > But even then I'm wondering if this really matters, because each of those > stores to rseq_cs would only slow down loads from other threads which will > need to retry busy-wait anyway if the on-cpu flag is still set. > > So, what am I missing ? Is this heavy use of rseq critical sections while > the lock is held the scenario you are concerned about ? > The case in my mind is the opposite direction: the loads from other threads delay the stores to rseq_cs on the current thread, which I assume are usually a fast path. For example: CPU 1 CPU 2 lock(foo); // holding a lock rseq_start(): <CPU 1 own the cache line exclusively> lock(foo): <fail to get foo> <check whether the lock owner is on CPU> <cache line becames shared> ->rseq_cs = .. // Need to invalidate the cache line on other CPU But as you mentioned, there is only one updater here (the current thread), so maybe it doesn't matter... but since it's a userspace ABI, so I cannot help thinking "what if there is another bit that has a different usage pattern introduced in the future", so.. > Note that the heavy cache-line bouncing in my test-case happens on the lock > structure (cmpxchg expecting NULL, setting the current thread rseq_get_abi() > pointer on success). There are probably better ways to implement that part, > it is currently just a simple prototype showcasing the approach. > Yeah.. that's a little strange, I guess you can just read the lock owner's rseq_abi, for example: rseq_lock_slowpath() { struct rseq_abi *other_rseq = lock->owner; if (RSEQ_ACCESS_ONCE(other_rseq->sched_state)) { ... } } ? Regards, Boqun > Thanks, > > Mathieu > > -- > Mathieu Desnoyers > EfficiOS Inc. > https://www.efficios.com >
On Wed, May 17, 2023 at 10:28 AM Mathieu Desnoyers via Libc-alpha <libc-alpha@sourceware.org> wrote: > > Expose the "on-cpu" state for each thread through struct rseq to allow > adaptative mutexes to decide more accurately between busy-waiting and > calling sys_futex() to release the CPU, based on the on-cpu state of the > mutex owner. > > It is only provided as an optimization hint, because there is no > guarantee that the page containing this field is in the page cache, and > therefore the scheduler may very well fail to clear the on-cpu state on > preemption. This is expected to be rare though, and is resolved as soon > as the task returns to user-space. > > The goal is to improve use-cases where the duration of the critical > sections for a given lock follows a multi-modal distribution, preventing > statistical guesses from doing a good job at choosing between busy-wait > and futex wait behavior. > > Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> > Cc: Peter Zijlstra (Intel) <peterz@infradead.org> > Cc: Jonathan Corbet <corbet@lwn.net> > Cc: Steven Rostedt (Google) <rostedt@goodmis.org> > Cc: Carlos O'Donell <carlos@redhat.com> > Cc: Florian Weimer <fweimer@redhat.com> > Cc: libc-alpha@sourceware.org > --- > include/linux/sched.h | 12 ++++++++++++ > include/uapi/linux/rseq.h | 17 +++++++++++++++++ > kernel/rseq.c | 14 ++++++++++++++ > 3 files changed, 43 insertions(+) > > diff --git a/include/linux/sched.h b/include/linux/sched.h > index eed5d65b8d1f..c7e9248134c1 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -2351,11 +2351,20 @@ static inline void rseq_signal_deliver(struct ksignal *ksig, > rseq_handle_notify_resume(ksig, regs); > } > > +void __rseq_set_sched_state(struct task_struct *t, unsigned int state); > + > +static inline void rseq_set_sched_state(struct task_struct *t, unsigned int state) > +{ > + if (t->rseq) > + __rseq_set_sched_state(t, state); > +} > + > /* rseq_preempt() requires preemption to be disabled. */ > static inline void rseq_preempt(struct task_struct *t) > { > __set_bit(RSEQ_EVENT_PREEMPT_BIT, &t->rseq_event_mask); > rseq_set_notify_resume(t); > + rseq_set_sched_state(t, 0); Should rseq_migrate also be made to update the cpu_id of the new core? I imagine the usage of this will be something along the lines of: if(!on_cpu(mutex->owner_rseq_struct) && cpu(mutex->owner_rseq_struct) == this_threads_cpu) // goto futex So I would think updating on migrate would be useful as well. > } > > /* rseq_migrate() requires preemption to be disabled. */ > @@ -2405,6 +2414,9 @@ static inline void rseq_signal_deliver(struct ksignal *ksig, > struct pt_regs *regs) > { > } > +static inline void rseq_set_sched_state(struct task_struct *t, unsigned int state) > +{ > +} > static inline void rseq_preempt(struct task_struct *t) > { > } > diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h > index c233aae5eac9..c6d8537e23ca 100644 > --- a/include/uapi/linux/rseq.h > +++ b/include/uapi/linux/rseq.h > @@ -37,6 +37,13 @@ enum rseq_cs_flags { > (1U << RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT), > }; > > +enum rseq_sched_state { > + /* > + * Task is currently running on a CPU if bit is set. > + */ > + RSEQ_SCHED_STATE_ON_CPU = (1U << 0), > +}; > + > /* > * struct rseq_cs is aligned on 4 * 8 bytes to ensure it is always > * contained within a single cache-line. It is usually declared as > @@ -148,6 +155,16 @@ struct rseq { > */ > __u32 mm_cid; > > + /* > + * Restartable sequences sched_state field. Updated by the kernel. Read > + * by user-space with single-copy atomicity semantics. This fields can > + * be read by any userspace thread. Aligned on 32-bit. Contains a > + * bitmask of enum rseq_sched_state. This field is provided as a hint > + * by the scheduler, and requires that the page holding struct rseq is > + * faulted-in for the state update to be performed by the scheduler. > + */ > + __u32 sched_state; > + > /* > * Flexible array member at end of structure, after last feature field. > */ > diff --git a/kernel/rseq.c b/kernel/rseq.c > index 9de6e35fe679..b2eb3bbaa9ef 100644 > --- a/kernel/rseq.c > +++ b/kernel/rseq.c > @@ -91,6 +91,7 @@ static int rseq_update_cpu_node_id(struct task_struct *t) > u32 cpu_id = raw_smp_processor_id(); > u32 node_id = cpu_to_node(cpu_id); > u32 mm_cid = task_mm_cid(t); > + u32 sched_state = RSEQ_SCHED_STATE_ON_CPU; > > WARN_ON_ONCE((int) mm_cid < 0); > if (!user_write_access_begin(rseq, t->rseq_len)) > @@ -99,6 +100,7 @@ static int rseq_update_cpu_node_id(struct task_struct *t) > unsafe_put_user(cpu_id, &rseq->cpu_id, efault_end); > unsafe_put_user(node_id, &rseq->node_id, efault_end); > unsafe_put_user(mm_cid, &rseq->mm_cid, efault_end); > + unsafe_put_user(sched_state, &rseq->sched_state, efault_end); > /* > * Additional feature fields added after ORIG_RSEQ_SIZE > * need to be conditionally updated only if > @@ -339,6 +341,18 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs) > force_sigsegv(sig); > } > > +/* > + * Attempt to update rseq scheduler state. > + */ > +void __rseq_set_sched_state(struct task_struct *t, unsigned int state) > +{ > + if (unlikely(t->flags & PF_EXITING)) > + return; > + pagefault_disable(); > + (void) put_user(state, &t->rseq->sched_state); > + pagefault_enable(); > +} > + > #ifdef CONFIG_DEBUG_RSEQ > > /* > -- > 2.25.1 >
On 2023-05-19 16:51, Noah Goldstein wrote: > On Wed, May 17, 2023 at 10:28 AM Mathieu Desnoyers via Libc-alpha > <libc-alpha@sourceware.org> wrote: >> >> Expose the "on-cpu" state for each thread through struct rseq to allow >> adaptative mutexes to decide more accurately between busy-waiting and >> calling sys_futex() to release the CPU, based on the on-cpu state of the >> mutex owner. >> >> It is only provided as an optimization hint, because there is no >> guarantee that the page containing this field is in the page cache, and >> therefore the scheduler may very well fail to clear the on-cpu state on >> preemption. This is expected to be rare though, and is resolved as soon >> as the task returns to user-space. >> >> The goal is to improve use-cases where the duration of the critical >> sections for a given lock follows a multi-modal distribution, preventing >> statistical guesses from doing a good job at choosing between busy-wait >> and futex wait behavior. >> >> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> >> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> >> Cc: Jonathan Corbet <corbet@lwn.net> >> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> >> Cc: Carlos O'Donell <carlos@redhat.com> >> Cc: Florian Weimer <fweimer@redhat.com> >> Cc: libc-alpha@sourceware.org >> --- >> include/linux/sched.h | 12 ++++++++++++ >> include/uapi/linux/rseq.h | 17 +++++++++++++++++ >> kernel/rseq.c | 14 ++++++++++++++ >> 3 files changed, 43 insertions(+) >> >> diff --git a/include/linux/sched.h b/include/linux/sched.h >> index eed5d65b8d1f..c7e9248134c1 100644 >> --- a/include/linux/sched.h >> +++ b/include/linux/sched.h >> @@ -2351,11 +2351,20 @@ static inline void rseq_signal_deliver(struct ksignal *ksig, >> rseq_handle_notify_resume(ksig, regs); >> } >> >> +void __rseq_set_sched_state(struct task_struct *t, unsigned int state); >> + >> +static inline void rseq_set_sched_state(struct task_struct *t, unsigned int state) >> +{ >> + if (t->rseq) >> + __rseq_set_sched_state(t, state); >> +} >> + >> /* rseq_preempt() requires preemption to be disabled. */ >> static inline void rseq_preempt(struct task_struct *t) >> { >> __set_bit(RSEQ_EVENT_PREEMPT_BIT, &t->rseq_event_mask); >> rseq_set_notify_resume(t); >> + rseq_set_sched_state(t, 0); > > Should rseq_migrate also be made to update the cpu_id of the new core? > I imagine the usage of this will be something along the lines of: > > if(!on_cpu(mutex->owner_rseq_struct) && > cpu(mutex->owner_rseq_struct) == this_threads_cpu) > // goto futex > > So I would think updating on migrate would be useful as well. I don't think we want to act differently based on the cpu on which the owner is queued. If the mutex owner is not on-cpu, and queued on the same cpu as the current thread, we indeed want to call sys_futex WAIT. If the mutex owner is not on-cpu, but queued on a different cpu than the current thread, we *still* want to call sys_futex WAIT, because busy-waiting for a thread which is queued but not currently running is wasteful. Or am I missing something ? Thanks, Mathieu
On 2023-05-19 13:18, Boqun Feng wrote: [...] > > The case in my mind is the opposite direction: the loads from other > threads delay the stores to rseq_cs on the current thread, which I > assume are usually a fast path. For example: Yes, OK, you are correct. And I just validated on my end that busy-waiting repeatedly loading from a cache line does slow down the concurrent stores to other variables on that cache line significantly (at least on my Intel(R) Core(TM) i7-8650U). Small reproducer provided at the end of this email. Results: compudj@thinkos:~/test$ time ./test-cacheline -d thread id : 140242706274048, pid 16940 thread id : 140242697881344, pid 16940 real 0m4.145s user 0m8.289s sys 0m0.000s compudj@thinkos:~/test$ time ./test-cacheline -s thread id : 139741482387200, pid 16950 thread id : 139741473994496, pid 16950 real 0m4.573s user 0m9.147s sys 0m0.000s > > CPU 1 CPU 2 > > lock(foo); // holding a lock > rseq_start(): > <CPU 1 own the cache line exclusively> > lock(foo): > <fail to get foo> > <check whether the lock owner is on CPU> > <cache line becames shared> > ->rseq_cs = .. // Need to invalidate the cache line on other CPU > > But as you mentioned, there is only one updater here (the current > thread), so maybe it doesn't matter... but since it's a userspace ABI, > so I cannot help thinking "what if there is another bit that has a > different usage pattern introduced in the future", so.. Yes, however we have to be careful about how we introduce this considering that the rseq feature extensions are "append only" to the structure feature size exported by the kernel to userspace through getauxval(3). So if we decide that we create a big hole right in the middle of the rseq_abi for cacheline alignment, that's a possibility, but we'd really be wasting an entire cacheline for a single bit. Another possibility would be to add a level of indirection: we could have a field in struct rseq which is either a pointer or offset from the thread_pointer() to the on-cpu bit, which would sit in a different cache line. It would be up to glibc to allocate space for it, possibly at the end of the rseq_abi field. > >> Note that the heavy cache-line bouncing in my test-case happens on the lock >> structure (cmpxchg expecting NULL, setting the current thread rseq_get_abi() >> pointer on success). There are probably better ways to implement that part, >> it is currently just a simple prototype showcasing the approach. >> > > Yeah.. that's a little strange, I guess you can just read the lock > owner's rseq_abi, for example: > > rseq_lock_slowpath() { > struct rseq_abi *other_rseq = lock->owner; > > if (RSEQ_ACCESS_ONCE(other_rseq->sched_state)) { > ... > } > } Yes, I don't think the load of the owner pointer needs to be part of the cmpxchg per se. It could be done from a load on the slow-path. This way we would not require that the owner id and the lock state be the same content, and this would allow much more freedom for the fast-path semantic. Thanks, Mathieu > > ? > > Regards, > Boqun > >> Thanks, >> >> Mathieu >> >> -- >> Mathieu Desnoyers >> EfficiOS Inc. >> https://www.efficios.com >> Reproducer: /* * cacheline testing (exclusive vs shared store speed) * * build with gcc -O2 -pthread -o test-cacheline test-cacheline.c * * Mathieu Desnoyers <mathieu.desnoyers@efficios.com> * License: MIT */ #include <stdio.h> #include <pthread.h> #include <stdlib.h> #include <sys/types.h> #include <unistd.h> #include <rseq/rseq.h> #define NR_THREADS 2 struct test { int a; int b; } __attribute__((aligned(256))); enum testcase { TEST_SAME_CACHELINE, TEST_OTHER_CACHELINE, }; static enum testcase testcase; static int test_stop, test_go; static struct test test, test2; static void *testthread(void *arg) { long nr = (long)arg; printf("thread id : %lu, pid %lu\n", pthread_self(), getpid()); __atomic_add_fetch(&test_go, 1, __ATOMIC_RELAXED); while (RSEQ_READ_ONCE(test_go) < NR_THREADS) rseq_barrier(); if (nr == 0) { switch (testcase) { case TEST_SAME_CACHELINE: while (!RSEQ_READ_ONCE(test_stop)) (void) RSEQ_READ_ONCE(test.a); break; case TEST_OTHER_CACHELINE: while (!RSEQ_READ_ONCE(test_stop)) (void) RSEQ_READ_ONCE(test2.a); break; } } else if (nr == 1) { unsigned long long i; for (i = 0; i < 16000000000UL; i++) RSEQ_WRITE_ONCE(test.b, i); RSEQ_WRITE_ONCE(test_stop, 1); } return ((void*)0); } static void show_usage(char **argv) { fprintf(stderr, "Usage: %s <OPTIONS>\n", argv[0]); fprintf(stderr, "OPTIONS:\n"); fprintf(stderr, " [-s] Same cacheline\n"); fprintf(stderr, " [-d] Different cacheline\n"); } static int parse_args(int argc, char **argv) { if (argc != 2 || argv[1][0] != '-') { show_usage(argv); return -1; } switch (argv[1][1]) { case 's': testcase = TEST_SAME_CACHELINE; break; case 'd': testcase = TEST_OTHER_CACHELINE; break; default: show_usage(argv); return -1; } return 0; } int main(int argc, char **argv) { pthread_t testid[NR_THREADS]; void *tret; int i, err; if (parse_args(argc, argv)) exit(1); for (i = 0; i < NR_THREADS; i++) { err = pthread_create(&testid[i], NULL, testthread, (void *)(long)i); if (err != 0) exit(1); } for (i = 0; i < NR_THREADS; i++) { err = pthread_join(testid[i], &tret); if (err != 0) exit(1); } return 0; }
On Tue, May 23, 2023 at 7:49 AM Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > On 2023-05-19 16:51, Noah Goldstein wrote: > > On Wed, May 17, 2023 at 10:28 AM Mathieu Desnoyers via Libc-alpha > > <libc-alpha@sourceware.org> wrote: > >> > >> Expose the "on-cpu" state for each thread through struct rseq to allow > >> adaptative mutexes to decide more accurately between busy-waiting and > >> calling sys_futex() to release the CPU, based on the on-cpu state of the > >> mutex owner. > >> > >> It is only provided as an optimization hint, because there is no > >> guarantee that the page containing this field is in the page cache, and > >> therefore the scheduler may very well fail to clear the on-cpu state on > >> preemption. This is expected to be rare though, and is resolved as soon > >> as the task returns to user-space. > >> > >> The goal is to improve use-cases where the duration of the critical > >> sections for a given lock follows a multi-modal distribution, preventing > >> statistical guesses from doing a good job at choosing between busy-wait > >> and futex wait behavior. > >> > >> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> > >> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> > >> Cc: Jonathan Corbet <corbet@lwn.net> > >> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> > >> Cc: Carlos O'Donell <carlos@redhat.com> > >> Cc: Florian Weimer <fweimer@redhat.com> > >> Cc: libc-alpha@sourceware.org > >> --- > >> include/linux/sched.h | 12 ++++++++++++ > >> include/uapi/linux/rseq.h | 17 +++++++++++++++++ > >> kernel/rseq.c | 14 ++++++++++++++ > >> 3 files changed, 43 insertions(+) > >> > >> diff --git a/include/linux/sched.h b/include/linux/sched.h > >> index eed5d65b8d1f..c7e9248134c1 100644 > >> --- a/include/linux/sched.h > >> +++ b/include/linux/sched.h > >> @@ -2351,11 +2351,20 @@ static inline void rseq_signal_deliver(struct ksignal *ksig, > >> rseq_handle_notify_resume(ksig, regs); > >> } > >> > >> +void __rseq_set_sched_state(struct task_struct *t, unsigned int state); > >> + > >> +static inline void rseq_set_sched_state(struct task_struct *t, unsigned int state) > >> +{ > >> + if (t->rseq) > >> + __rseq_set_sched_state(t, state); > >> +} > >> + > >> /* rseq_preempt() requires preemption to be disabled. */ > >> static inline void rseq_preempt(struct task_struct *t) > >> { > >> __set_bit(RSEQ_EVENT_PREEMPT_BIT, &t->rseq_event_mask); > >> rseq_set_notify_resume(t); > >> + rseq_set_sched_state(t, 0); > > > > Should rseq_migrate also be made to update the cpu_id of the new core? > > I imagine the usage of this will be something along the lines of: > > > > if(!on_cpu(mutex->owner_rseq_struct) && > > cpu(mutex->owner_rseq_struct) == this_threads_cpu) > > // goto futex > > > > So I would think updating on migrate would be useful as well. > > I don't think we want to act differently based on the cpu on which the > owner is queued. > > If the mutex owner is not on-cpu, and queued on the same cpu as the > current thread, we indeed want to call sys_futex WAIT. > > If the mutex owner is not on-cpu, but queued on a different cpu than the > current thread, we *still* want to call sys_futex WAIT, because > busy-waiting for a thread which is queued but not currently running is > wasteful. > I think this is less clear. In some cases sure but not always. Going to the futex has more latency that userland waits, and if the system is not busy (other than the one process) most likely less latency that yield. Also going to the futex requires a syscall on unlock. For example if the critical section is expected to be very small, it would be easy to imagine the lock be better implemented with: while(is_locked) if (owner->on_cpu || owner->cpu != my_cpu) exponential backoff else yield Its not that "just go to futex" doesn't ever make sense, but I don't think its fair to say that *always* the case. Looking at the kernel code, it doesn't seem to be a particularly high cost to keep the CPU field updated during migration so seems like a why not kind of question. > Or am I missing something ? > > Thanks, > > Mathieu > > -- > Mathieu Desnoyers > EfficiOS Inc. > https://www.efficios.com >
On 2023-05-23 12:32, Noah Goldstein wrote: > On Tue, May 23, 2023 at 7:49 AM Mathieu Desnoyers > <mathieu.desnoyers@efficios.com> wrote: >> >> On 2023-05-19 16:51, Noah Goldstein wrote: >>> On Wed, May 17, 2023 at 10:28 AM Mathieu Desnoyers via Libc-alpha >>> <libc-alpha@sourceware.org> wrote: >>>> >>>> Expose the "on-cpu" state for each thread through struct rseq to allow >>>> adaptative mutexes to decide more accurately between busy-waiting and >>>> calling sys_futex() to release the CPU, based on the on-cpu state of the >>>> mutex owner. >>>> >>>> It is only provided as an optimization hint, because there is no >>>> guarantee that the page containing this field is in the page cache, and >>>> therefore the scheduler may very well fail to clear the on-cpu state on >>>> preemption. This is expected to be rare though, and is resolved as soon >>>> as the task returns to user-space. >>>> >>>> The goal is to improve use-cases where the duration of the critical >>>> sections for a given lock follows a multi-modal distribution, preventing >>>> statistical guesses from doing a good job at choosing between busy-wait >>>> and futex wait behavior. >>>> >>>> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> >>>> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> >>>> Cc: Jonathan Corbet <corbet@lwn.net> >>>> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> >>>> Cc: Carlos O'Donell <carlos@redhat.com> >>>> Cc: Florian Weimer <fweimer@redhat.com> >>>> Cc: libc-alpha@sourceware.org >>>> --- >>>> include/linux/sched.h | 12 ++++++++++++ >>>> include/uapi/linux/rseq.h | 17 +++++++++++++++++ >>>> kernel/rseq.c | 14 ++++++++++++++ >>>> 3 files changed, 43 insertions(+) >>>> >>>> diff --git a/include/linux/sched.h b/include/linux/sched.h >>>> index eed5d65b8d1f..c7e9248134c1 100644 >>>> --- a/include/linux/sched.h >>>> +++ b/include/linux/sched.h >>>> @@ -2351,11 +2351,20 @@ static inline void rseq_signal_deliver(struct ksignal *ksig, >>>> rseq_handle_notify_resume(ksig, regs); >>>> } >>>> >>>> +void __rseq_set_sched_state(struct task_struct *t, unsigned int state); >>>> + >>>> +static inline void rseq_set_sched_state(struct task_struct *t, unsigned int state) >>>> +{ >>>> + if (t->rseq) >>>> + __rseq_set_sched_state(t, state); >>>> +} >>>> + >>>> /* rseq_preempt() requires preemption to be disabled. */ >>>> static inline void rseq_preempt(struct task_struct *t) >>>> { >>>> __set_bit(RSEQ_EVENT_PREEMPT_BIT, &t->rseq_event_mask); >>>> rseq_set_notify_resume(t); >>>> + rseq_set_sched_state(t, 0); >>> >>> Should rseq_migrate also be made to update the cpu_id of the new core? >>> I imagine the usage of this will be something along the lines of: >>> >>> if(!on_cpu(mutex->owner_rseq_struct) && >>> cpu(mutex->owner_rseq_struct) == this_threads_cpu) >>> // goto futex >>> >>> So I would think updating on migrate would be useful as well. >> >> I don't think we want to act differently based on the cpu on which the >> owner is queued. >> >> If the mutex owner is not on-cpu, and queued on the same cpu as the >> current thread, we indeed want to call sys_futex WAIT. >> >> If the mutex owner is not on-cpu, but queued on a different cpu than the >> current thread, we *still* want to call sys_futex WAIT, because >> busy-waiting for a thread which is queued but not currently running is >> wasteful. >> > I think this is less clear. In some cases sure but not always. Going > to the futex > has more latency that userland waits, and if the system is not busy (other than > the one process) most likely less latency that yield. Also going to the futex > requires a syscall on unlock. > > For example if the critical section is expected to be very small, it > would be easy > to imagine the lock be better implemented with: > while(is_locked) > if (owner->on_cpu || owner->cpu != my_cpu) > exponential backoff > else > yield > > Its not that "just go to futex" doesn't ever make sense, but I don't > think its fair > to say that *always* the case. > > Looking at the kernel code, it doesn't seem to be a particularly high cost to > keep the CPU field updated during migration so seems like a why not > kind of question. We already have the owner rseq_abi cpu_id field populated on every return-to-userspace. I wonder if it's really relevant that migration populates an updated value in this field immediately ? It's another case where this would be provided as a hint updated only if the struct rseq is in the page cache, because AFAIU the scheduler migration path cannot take a page fault. Also, if a thread bounces around many runqueues before being scheduled again, we would be adding those useless stores to the rseq_abi structure at each migration between runqueues. Given this would add some complexity to the scheduler migration code, I would want to see metrics/benchmarks showing that it indeed improves real-world use-cases before adding this to the rseq ABI. It's not only a question of added lines of code as of today, but also a question of added userspace ABI guarantees which can prevent future scheduler optimizations. I'm *very* careful about keeping those to a strict minimum, which I hope Peter Zijlstra appreciates. Thanks, Mathieu >> Or am I missing something ? >> >> Thanks, >> >> Mathieu >> >> -- >> Mathieu Desnoyers >> EfficiOS Inc. >> https://www.efficios.com >>
On Tue, May 23, 2023 at 12:30 PM Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > On 2023-05-23 12:32, Noah Goldstein wrote: > > On Tue, May 23, 2023 at 7:49 AM Mathieu Desnoyers > > <mathieu.desnoyers@efficios.com> wrote: > >> > >> On 2023-05-19 16:51, Noah Goldstein wrote: > >>> On Wed, May 17, 2023 at 10:28 AM Mathieu Desnoyers via Libc-alpha > >>> <libc-alpha@sourceware.org> wrote: > >>>> > >>>> Expose the "on-cpu" state for each thread through struct rseq to allow > >>>> adaptative mutexes to decide more accurately between busy-waiting and > >>>> calling sys_futex() to release the CPU, based on the on-cpu state of the > >>>> mutex owner. > >>>> > >>>> It is only provided as an optimization hint, because there is no > >>>> guarantee that the page containing this field is in the page cache, and > >>>> therefore the scheduler may very well fail to clear the on-cpu state on > >>>> preemption. This is expected to be rare though, and is resolved as soon > >>>> as the task returns to user-space. > >>>> > >>>> The goal is to improve use-cases where the duration of the critical > >>>> sections for a given lock follows a multi-modal distribution, preventing > >>>> statistical guesses from doing a good job at choosing between busy-wait > >>>> and futex wait behavior. > >>>> > >>>> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> > >>>> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> > >>>> Cc: Jonathan Corbet <corbet@lwn.net> > >>>> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> > >>>> Cc: Carlos O'Donell <carlos@redhat.com> > >>>> Cc: Florian Weimer <fweimer@redhat.com> > >>>> Cc: libc-alpha@sourceware.org > >>>> --- > >>>> include/linux/sched.h | 12 ++++++++++++ > >>>> include/uapi/linux/rseq.h | 17 +++++++++++++++++ > >>>> kernel/rseq.c | 14 ++++++++++++++ > >>>> 3 files changed, 43 insertions(+) > >>>> > >>>> diff --git a/include/linux/sched.h b/include/linux/sched.h > >>>> index eed5d65b8d1f..c7e9248134c1 100644 > >>>> --- a/include/linux/sched.h > >>>> +++ b/include/linux/sched.h > >>>> @@ -2351,11 +2351,20 @@ static inline void rseq_signal_deliver(struct ksignal *ksig, > >>>> rseq_handle_notify_resume(ksig, regs); > >>>> } > >>>> > >>>> +void __rseq_set_sched_state(struct task_struct *t, unsigned int state); > >>>> + > >>>> +static inline void rseq_set_sched_state(struct task_struct *t, unsigned int state) > >>>> +{ > >>>> + if (t->rseq) > >>>> + __rseq_set_sched_state(t, state); > >>>> +} > >>>> + > >>>> /* rseq_preempt() requires preemption to be disabled. */ > >>>> static inline void rseq_preempt(struct task_struct *t) > >>>> { > >>>> __set_bit(RSEQ_EVENT_PREEMPT_BIT, &t->rseq_event_mask); > >>>> rseq_set_notify_resume(t); > >>>> + rseq_set_sched_state(t, 0); > >>> > >>> Should rseq_migrate also be made to update the cpu_id of the new core? > >>> I imagine the usage of this will be something along the lines of: > >>> > >>> if(!on_cpu(mutex->owner_rseq_struct) && > >>> cpu(mutex->owner_rseq_struct) == this_threads_cpu) > >>> // goto futex > >>> > >>> So I would think updating on migrate would be useful as well. > >> > >> I don't think we want to act differently based on the cpu on which the > >> owner is queued. > >> > >> If the mutex owner is not on-cpu, and queued on the same cpu as the > >> current thread, we indeed want to call sys_futex WAIT. > >> > >> If the mutex owner is not on-cpu, but queued on a different cpu than the > >> current thread, we *still* want to call sys_futex WAIT, because > >> busy-waiting for a thread which is queued but not currently running is > >> wasteful. > >> > > I think this is less clear. In some cases sure but not always. Going > > to the futex > > has more latency that userland waits, and if the system is not busy (other than > > the one process) most likely less latency that yield. Also going to the futex > > requires a syscall on unlock. > > > > For example if the critical section is expected to be very small, it > > would be easy > > to imagine the lock be better implemented with: > > while(is_locked) > > if (owner->on_cpu || owner->cpu != my_cpu) > > exponential backoff > > else > > yield > > > > Its not that "just go to futex" doesn't ever make sense, but I don't > > think its fair > > to say that *always* the case. > > > > Looking at the kernel code, it doesn't seem to be a particularly high cost to > > keep the CPU field updated during migration so seems like a why not > > kind of question. > > We already have the owner rseq_abi cpu_id field populated on every > return-to-userspace. I wonder if it's really relevant that migration > populates an updated value in this field immediately ? It's another case > where this would be provided as a hint updated only if the struct rseq > is in the page cache, because AFAIU the scheduler migration path cannot > take a page fault. > Ah, thats a good point. And probably as probability the page is in the cache goes down a fair bit as the task is idle / bounced around for longer. > Also, if a thread bounces around many runqueues before being scheduled > again, we would be adding those useless stores to the rseq_abi structure > at each migration between runqueues. > > Given this would add some complexity to the scheduler migration code, I > would want to see metrics/benchmarks showing that it indeed improves > real-world use-cases before adding this to the rseq ABI. > > It's not only a question of added lines of code as of today, but also a > question of added userspace ABI guarantees which can prevent future > scheduler optimizations. I'm *very* careful about keeping those to a > strict minimum, which I hope Peter Zijlstra appreciates. Well, this entire thing is moreso a hint than a guarantee. Even on_cpu is only updated if the page happens to be in the pagecache so I don't see how you could ever be *having* to do anything. But fair enough, thought I'd throw the idea out there, but enough valid concerns seem to make it not such a good idea. > > Thanks, > > Mathieu > > > >> Or am I missing something ? > >> > >> Thanks, > >> > >> Mathieu > >> > >> -- > >> Mathieu Desnoyers > >> EfficiOS Inc. > >> https://www.efficios.com > >> > > -- > Mathieu Desnoyers > EfficiOS Inc. > https://www.efficios.com >
diff --git a/include/linux/sched.h b/include/linux/sched.h index eed5d65b8d1f..c7e9248134c1 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -2351,11 +2351,20 @@ static inline void rseq_signal_deliver(struct ksignal *ksig, rseq_handle_notify_resume(ksig, regs); } +void __rseq_set_sched_state(struct task_struct *t, unsigned int state); + +static inline void rseq_set_sched_state(struct task_struct *t, unsigned int state) +{ + if (t->rseq) + __rseq_set_sched_state(t, state); +} + /* rseq_preempt() requires preemption to be disabled. */ static inline void rseq_preempt(struct task_struct *t) { __set_bit(RSEQ_EVENT_PREEMPT_BIT, &t->rseq_event_mask); rseq_set_notify_resume(t); + rseq_set_sched_state(t, 0); } /* rseq_migrate() requires preemption to be disabled. */ @@ -2405,6 +2414,9 @@ static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs) { } +static inline void rseq_set_sched_state(struct task_struct *t, unsigned int state) +{ +} static inline void rseq_preempt(struct task_struct *t) { } diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h index c233aae5eac9..c6d8537e23ca 100644 --- a/include/uapi/linux/rseq.h +++ b/include/uapi/linux/rseq.h @@ -37,6 +37,13 @@ enum rseq_cs_flags { (1U << RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT), }; +enum rseq_sched_state { + /* + * Task is currently running on a CPU if bit is set. + */ + RSEQ_SCHED_STATE_ON_CPU = (1U << 0), +}; + /* * struct rseq_cs is aligned on 4 * 8 bytes to ensure it is always * contained within a single cache-line. It is usually declared as @@ -148,6 +155,16 @@ struct rseq { */ __u32 mm_cid; + /* + * Restartable sequences sched_state field. Updated by the kernel. Read + * by user-space with single-copy atomicity semantics. This fields can + * be read by any userspace thread. Aligned on 32-bit. Contains a + * bitmask of enum rseq_sched_state. This field is provided as a hint + * by the scheduler, and requires that the page holding struct rseq is + * faulted-in for the state update to be performed by the scheduler. + */ + __u32 sched_state; + /* * Flexible array member at end of structure, after last feature field. */ diff --git a/kernel/rseq.c b/kernel/rseq.c index 9de6e35fe679..b2eb3bbaa9ef 100644 --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -91,6 +91,7 @@ static int rseq_update_cpu_node_id(struct task_struct *t) u32 cpu_id = raw_smp_processor_id(); u32 node_id = cpu_to_node(cpu_id); u32 mm_cid = task_mm_cid(t); + u32 sched_state = RSEQ_SCHED_STATE_ON_CPU; WARN_ON_ONCE((int) mm_cid < 0); if (!user_write_access_begin(rseq, t->rseq_len)) @@ -99,6 +100,7 @@ static int rseq_update_cpu_node_id(struct task_struct *t) unsafe_put_user(cpu_id, &rseq->cpu_id, efault_end); unsafe_put_user(node_id, &rseq->node_id, efault_end); unsafe_put_user(mm_cid, &rseq->mm_cid, efault_end); + unsafe_put_user(sched_state, &rseq->sched_state, efault_end); /* * Additional feature fields added after ORIG_RSEQ_SIZE * need to be conditionally updated only if @@ -339,6 +341,18 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs) force_sigsegv(sig); } +/* + * Attempt to update rseq scheduler state. + */ +void __rseq_set_sched_state(struct task_struct *t, unsigned int state) +{ + if (unlikely(t->flags & PF_EXITING)) + return; + pagefault_disable(); + (void) put_user(state, &t->rseq->sched_state); + pagefault_enable(); +} + #ifdef CONFIG_DEBUG_RSEQ /*
Expose the "on-cpu" state for each thread through struct rseq to allow adaptative mutexes to decide more accurately between busy-waiting and calling sys_futex() to release the CPU, based on the on-cpu state of the mutex owner. It is only provided as an optimization hint, because there is no guarantee that the page containing this field is in the page cache, and therefore the scheduler may very well fail to clear the on-cpu state on preemption. This is expected to be rare though, and is resolved as soon as the task returns to user-space. The goal is to improve use-cases where the duration of the critical sections for a given lock follows a multi-modal distribution, preventing statistical guesses from doing a good job at choosing between busy-wait and futex wait behavior. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Carlos O'Donell <carlos@redhat.com> Cc: Florian Weimer <fweimer@redhat.com> Cc: libc-alpha@sourceware.org --- include/linux/sched.h | 12 ++++++++++++ include/uapi/linux/rseq.h | 17 +++++++++++++++++ kernel/rseq.c | 14 ++++++++++++++ 3 files changed, 43 insertions(+)