Message ID | 20230529191416.53955-2-mathieu.desnoyers@efficios.com |
---|---|
State | New |
Headers | show |
Series | Extend rseq with sched_state_ptr field | expand |
* Mathieu Desnoyers: > +/* > + * rseq_sched_state should be aligned on the cache line size. > + */ > +struct rseq_sched_state { > + /* > + * Version of this structure. Populated by the kernel, read by > + * user-space. > + */ > + __u32 version; > + /* > + * The state is updated by the kernel. Read by user-space with > + * single-copy atomicity semantics. This field can be read by any > + * userspace thread. Aligned on 32-bit. Contains a bitmask of enum > + * rseq_sched_state_flags. This field is provided as a hint by the > + * scheduler, and requires that the page holding this state is > + * faulted-in for the state update to be performed by the scheduler. > + */ > + __u32 state; > + /* > + * Thread ID associated with the thread registering this structure. > + * Initialized by user-space before registration. > + */ > + __u32 tid; > +}; How does the version handshake protocol in practice? Given that this user-allocated? I don't see why we can't stick this directly into struct rseq because it's all public anyway. The TID field would be useful in its own right. Thanks, Florian
On 5/29/23 15:35, Florian Weimer wrote: > * Mathieu Desnoyers: > >> +/* >> + * rseq_sched_state should be aligned on the cache line size. >> + */ >> +struct rseq_sched_state { >> + /* >> + * Version of this structure. Populated by the kernel, read by >> + * user-space. >> + */ >> + __u32 version; >> + /* >> + * The state is updated by the kernel. Read by user-space with >> + * single-copy atomicity semantics. This field can be read by any >> + * userspace thread. Aligned on 32-bit. Contains a bitmask of enum >> + * rseq_sched_state_flags. This field is provided as a hint by the >> + * scheduler, and requires that the page holding this state is >> + * faulted-in for the state update to be performed by the scheduler. >> + */ >> + __u32 state; >> + /* >> + * Thread ID associated with the thread registering this structure. >> + * Initialized by user-space before registration. >> + */ >> + __u32 tid; >> +}; > > How does the version handshake protocol in practice? Given that this > user-allocated? Good point, I must admit that I have not thought this specific version protocol through. :) As you say, userspace is responsible for allocation, and the kernel is responsible for implementing features. Let's first see if we can get away with embedding these fields in struct rseq. > > I don't see why we can't stick this directly into struct rseq because > it's all public anyway. The motivation for moving this to a different cache line is to handle the prior comment from Boqun, who is concerned that busy-waiting repeatedly loading a field from struct rseq will cause false-sharing and make other stores to that cache line slower, especially stores to rseq_cs to begin rseq critical sections, thus slightly increasing the overhead of rseq critical sections taken while mutexes are held. If we want to embed this field into struct rseq with its own cache line, then we need to add a lot of padding, which is inconvenient. That being said, perhaps this is premature optimization, what do you think ? > > The TID field would be useful in its own right. Indeed, good point. While we are there, I wonder if we should use the thread_pointer() as lock identifier, or if the address of struct rseq is fine ? Thanks, Mathieu > > Thanks, > Florian >
* Mathieu Desnoyers: >> I don't see why we can't stick this directly into struct rseq because >> it's all public anyway. > > The motivation for moving this to a different cache line is to handle > the prior comment from Boqun, who is concerned that busy-waiting > repeatedly loading a field from struct rseq will cause false-sharing > and make other stores to that cache line slower, especially stores to > rseq_cs to begin rseq critical sections, thus slightly increasing the > overhead of rseq critical sections taken while mutexes are held. Hmm. For context, in glibc, we have to place struct rseq on a fixed offset from the start of a page (or even some larger alignment) for all threads. In the future (once we move the thread control block off the top of the userspace stack, where it resides since the LinuxThreads days), it is likely that the pointer difference between different threads will also be a multiple of a fairly large power of two (something like 2**20 might be common). Maybe this will make caching even more difficult? > If we want to embed this field into struct rseq with its own cache > line, then we need to add a lot of padding, which is inconvenient. > > That being said, perhaps this is premature optimization, what do you > think ? Maybe? I don't know how the access patterns will look like. But I suspect that once we hit this case, performance will not be great anyway, so the optimization is perhaps unnecessary? The challenge is that once we put stuff at fixed offsets, we can't transparently fix it later. It would need more auxv entries with further offsets, or accessing this data through some indirection, perhaps via vDSO helpers. >> The TID field would be useful in its own right. > > Indeed, good point. > > While we are there, I wonder if we should use the thread_pointer() as > lock identifier, or if the address of struct rseq is fine ? Hard to tell until we'll see what the futex integration looks like, I think. Thanks, Florian
On 5/30/23 04:20, Florian Weimer wrote: > * Mathieu Desnoyers: > >>> I don't see why we can't stick this directly into struct rseq because >>> it's all public anyway. >> >> The motivation for moving this to a different cache line is to handle >> the prior comment from Boqun, who is concerned that busy-waiting >> repeatedly loading a field from struct rseq will cause false-sharing >> and make other stores to that cache line slower, especially stores to >> rseq_cs to begin rseq critical sections, thus slightly increasing the >> overhead of rseq critical sections taken while mutexes are held. > > Hmm. For context, in glibc, we have to place struct rseq on a fixed > offset from the start of a page (or even some larger alignment) for all > threads. In the future (once we move the thread control block off the > top of the userspace stack, where it resides since the LinuxThreads > days), it is likely that the pointer difference between different > threads will also be a multiple of a fairly large power of two > (something like 2**20 might be common). Maybe this will make caching > even more difficult? > >> If we want to embed this field into struct rseq with its own cache >> line, then we need to add a lot of padding, which is inconvenient. >> >> That being said, perhaps this is premature optimization, what do you >> think ? > > Maybe? I don't know how the access patterns will look like. But I > suspect that once we hit this case, performance will not be great > anyway, so the optimization is perhaps unnecessary? What I dislike though is that contention for any lock which busy-waits on the rseq sched_state would slow down all rseq critical sections of that thread, which is a side-effect we want to avoid. I've done some more additional benchmarks on my 8-core AMD laptop, and I notice that things get especially bad whenever the store to rseq_abi->rseq_cs is surrounded by other instructions that need to be ordered with that store, e.g. a for loop doing 10 stores to a local variables. If it's surrounded by instructions that don't need to be ordered wrt that store (e.g. a for loop of 10 iterations issuing barrier() "memory" asm clobbers), then the overhead cannot be noticed anymore. > > The challenge is that once we put stuff at fixed offsets, we can't > transparently fix it later. It would need more auxv entries with > further offsets, or accessing this data through some indirection, > perhaps via vDSO helpers. Perhaps this is more flexibility/complexity than we really need. One possible approach would be to split struct rseq into sub-structures, e.g.: rseq_len = overall size of all sub-structures. auxv AT_RSEQ_ALIGN = 256 auxv AT_RSEQ_FEATURE_SIZE = size of first portion of struct rseq, at most 256 bytes, meant to contain fields stored/loaded from the thread doing the registration. auxv AT_RSEQ_SHARED_FEATURE_SIZE = size of 2nd portion of struct rseq, starts at offset 256, at most 256 bytes, meant to contain fields stored/loaded by any thread. Then we have this layout: struct rseq { struct rseq_local { /* Fields accessed from local thread. */ } __attribute__((aligned((256)); struct rseq_shared { /* Shared fields. */ } __attribute__((aligned(256)); } __attribute__((aligned(256)); And if someday AT_RSEQ_FEATURE_SIZE needs to grow over 256 bytes (32 * u64), we can still extend with a new auxv entry after the "shared" features. > >>> The TID field would be useful in its own right. >> >> Indeed, good point. >> >> While we are there, I wonder if we should use the thread_pointer() as >> lock identifier, or if the address of struct rseq is fine ? > > Hard to tell until we'll see what the futex integration looks like, I > think. Good point. I can choose one way or another for the prototype, and then we'll see how things go with futex integration. Thanks, Mathieu
On 5/30/23 10:25, Mathieu Desnoyers wrote: > On 5/30/23 04:20, Florian Weimer wrote: [...] >> >> The challenge is that once we put stuff at fixed offsets, we can't >> transparently fix it later. It would need more auxv entries with >> further offsets, or accessing this data through some indirection, >> perhaps via vDSO helpers. > > Perhaps this is more flexibility/complexity than we really need. One > possible approach would be to split struct rseq into sub-structures, e.g.: > > rseq_len = overall size of all sub-structures. > auxv AT_RSEQ_ALIGN = 256 > > auxv AT_RSEQ_FEATURE_SIZE = size of first portion of struct rseq, > at most 256 bytes, meant to contain fields > stored/loaded from the thread doing the > registration. > auxv AT_RSEQ_SHARED_FEATURE_SIZE = > size of 2nd portion of struct rseq, > starts at offset 256, at most 256 bytes, > meant to contain fields stored/loaded by > any thread. > > Then we have this layout: > > struct rseq { > struct rseq_local { > /* Fields accessed from local thread. */ > > } __attribute__((aligned((256)); > struct rseq_shared { > /* Shared fields. */ > > } __attribute__((aligned(256)); > } __attribute__((aligned(256)); > > And if someday AT_RSEQ_FEATURE_SIZE needs to grow over 256 bytes > (32 * u64), we can still extend with a new auxv entry after the "shared" > features. Actually, after giving it some more thoughts, I think we can do better: - Add a sys_rseq() rseq_flag RSEQ_FLAG_SHARED, which changes the behavior of sys_rseq() to expect an additional "struct rseq_shared *" argument. - Introduce auxv AT_RSEQ_SHARED_FEATURE_SIZE. This way, it's up to the libc to decide how to allocate its private vs shared rseq structures. The auxv "AT_RSEQ_ALIGN" would dictate the minimal alignment required for both private and shared rseq structures. I don't think we need to express the size of the rseq_shared memory area allocated by libc because we know that it needs to be large enough to handle the shared feature size. Thoughts ? Thanks, Mathieu
>> I don't see why we can't stick this directly into struct rseq because >> it's all public anyway. > > The motivation for moving this to a different cache line is to handle > the prior comment from Boqun, who is concerned that busy-waiting > repeatedly loading a field from struct rseq will cause false-sharing and > make other stores to that cache line slower, especially stores to > rseq_cs to begin rseq critical sections, thus slightly increasing the > overhead of rseq critical sections taken while mutexes are held. > > If we want to embed this field into struct rseq with its own cache line, > then we need to add a lot of padding, which is inconvenient. > > That being said, perhaps this is premature optimization, what do you think ? Hi Mathieu, Florian, This is exciting! I thought the motivation for moving rseq_sched_state out of struct rseq is lifetime management problem. I assume when a thread locks a mutex, it stores pointer to rseq_sched_state in the mutex state for other threads to poll. So the waiting thread would do something along the following lines: rseq_sched_state* state = __atomic_load_n(mutex->sched_state, __ATOMIC_RELAXED); if (state && !(state->state & RSEQ_SCHED_STATE_FLAG_ON_CPU)) futex_wait(); Now if the state is struct rseq, which is stored in TLS, then the owning thread can unlock the mutex, exit and unmap TLS in between. Consequently, load of state->state will cause a paging fault. And we do want rseq in TLS to save 1 indirection. If rseq_sched_state is separated from struct rseq, then it can be allocated in type stable memory that is never unmapped. What am I missing here? However, if we can store this state in struct rseq, then an alternative interface would for the kernel to do: rseq->cpu_id = -1; to denote that the thread is not running on any CPU. I think it kinda makes sense, rseq->cpu_id is the thread's current CPU, and -1 naturally means "not running at all". And we already store -1 right after init, so it shouldn't be a surprising value.
On Tue, 26 Sept 2023 at 13:52, Dmitry Vyukov <dvyukov@google.com> wrote: > > >> I don't see why we can't stick this directly into struct rseq because > >> it's all public anyway. > > > > The motivation for moving this to a different cache line is to handle > > the prior comment from Boqun, who is concerned that busy-waiting > > repeatedly loading a field from struct rseq will cause false-sharing and > > make other stores to that cache line slower, especially stores to > > rseq_cs to begin rseq critical sections, thus slightly increasing the > > overhead of rseq critical sections taken while mutexes are held. > > > > If we want to embed this field into struct rseq with its own cache line, > > then we need to add a lot of padding, which is inconvenient. > > > > That being said, perhaps this is premature optimization, what do you think ? > > Hi Mathieu, Florian, > > This is exciting! > > I thought the motivation for moving rseq_sched_state out of struct rseq > is lifetime management problem. I assume when a thread locks a mutex, > it stores pointer to rseq_sched_state in the mutex state for other > threads to poll. So the waiting thread would do something along the following > lines: > > rseq_sched_state* state = __atomic_load_n(mutex->sched_state, __ATOMIC_RELAXED); > if (state && !(state->state & RSEQ_SCHED_STATE_FLAG_ON_CPU)) > futex_wait(); > > Now if the state is struct rseq, which is stored in TLS, > then the owning thread can unlock the mutex, exit and unmap TLS in between. > Consequently, load of state->state will cause a paging fault. > > And we do want rseq in TLS to save 1 indirection. > > If rseq_sched_state is separated from struct rseq, then it can be allocated > in type stable memory that is never unmapped. > > What am I missing here? > > However, if we can store this state in struct rseq, then an alternative > interface would for the kernel to do: > > rseq->cpu_id = -1; > > to denote that the thread is not running on any CPU. > I think it kinda makes sense, rseq->cpu_id is the thread's current CPU, > and -1 naturally means "not running at all". And we already store -1 > right after init, so it shouldn't be a surprising value. As you may know we experimented with "virtual CPUs" in tcmalloc. The extension allows kernel to assign dense virtual CPU numbers to running threads instead of real sparse CPU numbers: https://github.com/google/tcmalloc/blob/229908285e216cca8b844c1781bf16b838128d1b/tcmalloc/internal/linux_syscall_support.h#L30-L41 Recently I added another change that [ab]uses rseq in an interesting way. We want to get notifications about thread re-scheduling. A bit simplified version of this is as follows: we don't use rseq.cpu_id_start for its original purpose, so instead we store something else there with a high bit set. Real CPU numbers don't have a high bit set (at least while you have less than 2B CPUs :)). This allows us to distinguish the value we stored in rseq.cpu_id_start from real CPU id stored by the kernel. Inside of rseq critical section we check if rseq.cpu_id_start has high bit set, and if not, then we know that we were just rescheduled, so we can do some additional work and update rseq.cpu_id_start to have high bit set. In reality it's a bit more involved since the field is actually 8 bytes and only partially overlaps with rseq.cpu_id_start (it's an 8-byte pointer with high 4 bytes overlap rseq.cpu_id_start): https://github.com/google/tcmalloc/blob/229908285e216cca8b844c1781bf16b838128d1b/tcmalloc/internal/percpu.h#L101-L165 I am thinking if we could extend the current proposed interface in a way that would be more flexible and would satisfy all of these use cases (spinlocks, and possibility of using virtual CPUs and rescheduling notifications). In the end they all need a very similar thing: kernel writing some value at some user address when a thread is de-scheduled. The minimal support we need for tcmalloc is an 8-byte user address + kernel writing 0 at that address when a thread is descheduled. The most flexible option to support multiple users (malloc/spinlocks/something else) would be as follows: User-space passes an array of structs with address + size (1/2/4/8 bytes) + value. Kernel intereates over the array when the thread is de-scheduled and writes the specified value at the provided address/size. Something along the following lines (pseudo-code): struct rseq { ... struct rseq_desched_notif_t* desched_notifs; int desched_notif_count; }; struct rseq_desched_notif_t { void* addr; uint64_t value; int size; }; static inline void rseq_preempt(struct task_struct *t) { ... for (int i = 0; i < t->rseq->desched_notif_count; i++) { switch (t->rseq->desched_notifs[i].size) { case 1: put_user1(t->rseq->desched_notifs[i].addr, t->rseq->desched_notifs[i].value); case 2: put_user2(t->rseq->desched_notifs[i].addr, t->rseq->desched_notifs[i].value); case 4: put_user4(t->rseq->desched_notifs[i].addr, t->rseq->desched_notifs[i].value); case 8: put_user8(t->rseq->desched_notifs[i].addr, t->rseq->desched_notifs[i].value); } } }
> +void __rseq_set_sched_state(struct task_struct *t, unsigned int state) > +{ > + if (unlikely(t->flags & PF_EXITING)) > + return; Don't we want to do this even if the task is exciting? If rseq is read only only the current task, then it does not matter (exiting task won't read own rseq anymore). But if we extending expected uses cases to tasks reading rseq of other tasks, we may need to updates even in PF_EXITING case. > + pagefault_disable(); This is a bit concerning. Does this mean updates are not done if the page is currently not in memory? Can we make it reliable as for the main rseq data (cpu_id)? For tcmalloc uses (and other uses that require reliable information) this may be problematic. User-space may start thinking that all "CPU slots" are busy and that there more threads running than there are CPUs. > + (void) put_user(state, &t->rseq_sched_state->state); > + pagefault_enable(); > +}
* Dmitry Vyukov: > In reality it's a bit more involved since the field is actually 8 > bytes and only partially overlaps with rseq.cpu_id_start (it's an > 8-byte pointer with high 4 bytes overlap rseq.cpu_id_start): > > https://github.com/google/tcmalloc/blob/229908285e216cca8b844c1781bf16b838128d1b/tcmalloc/internal/percpu.h#L101-L165 This does not compose with other rseq users, as noted in the sources: // Note: this makes __rseq_abi.cpu_id_start unusable for its original purpose. For a core library such a malloc replacement, that is a very bad trap. Thanks, Florian
On Tue, 26 Sept 2023 at 21:51, Florian Weimer <fweimer@redhat.com> wrote: > > * Dmitry Vyukov: > > > In reality it's a bit more involved since the field is actually 8 > > bytes and only partially overlaps with rseq.cpu_id_start (it's an > > 8-byte pointer with high 4 bytes overlap rseq.cpu_id_start): > > > > https://github.com/google/tcmalloc/blob/229908285e216cca8b844c1781bf16b838128d1b/tcmalloc/internal/percpu.h#L101-L165 > > This does not compose with other rseq users, as noted in the sources: > > // Note: this makes __rseq_abi.cpu_id_start unusable for its original purpose. > > For a core library such a malloc replacement, that is a very bad trap. I agree. I wouldn't do this if there were other options. That's why I am looking for official kernel support for this. If we would have a separate 8 bytes that are overwritten with 0 when a thread is descheduled, that would be perfect.
* Dmitry Vyukov: > On Tue, 26 Sept 2023 at 21:51, Florian Weimer <fweimer@redhat.com> wrote: >> >> * Dmitry Vyukov: >> >> > In reality it's a bit more involved since the field is actually 8 >> > bytes and only partially overlaps with rseq.cpu_id_start (it's an >> > 8-byte pointer with high 4 bytes overlap rseq.cpu_id_start): >> > >> > https://github.com/google/tcmalloc/blob/229908285e216cca8b844c1781bf16b838128d1b/tcmalloc/internal/percpu.h#L101-L165 >> >> This does not compose with other rseq users, as noted in the sources: >> >> // Note: this makes __rseq_abi.cpu_id_start unusable for its original purpose. >> >> For a core library such a malloc replacement, that is a very bad trap. > > I agree. I wouldn't do this if there were other options. That's why I > am looking for official kernel support for this. > If we would have a separate 8 bytes that are overwritten with 0 when a > thread is descheduled, that would be perfect. That only solves part of the problem because these fields would still have to be locked to tcmalloc. I think you'd need a rescheduling counter, then every library could keep their reference values in library-private thread-local storage. Thanks, Florian
On Mon, May 29, 2023 at 03:14:13PM -0400, Mathieu Desnoyers wrote: > Expose the "on-cpu" state for each thread through struct rseq to allow > adaptative mutexes to decide more accurately between busy-waiting and > calling sys_futex() to release the CPU, based on the on-cpu state of the > mutex owner. > > It is only provided as an optimization hint, because there is no > guarantee that the page containing this field is in the page cache, and > therefore the scheduler may very well fail to clear the on-cpu state on > preemption. This is expected to be rare though, and is resolved as soon > as the task returns to user-space. > > The goal is to improve use-cases where the duration of the critical > sections for a given lock follows a multi-modal distribution, preventing > statistical guesses from doing a good job at choosing between busy-wait > and futex wait behavior. As always, are syscalls really *that* expensive? Why can't we busy wait in the kernel instead? I mean, sure, meltdown sucked, but most people should now be running chips that are not affected by that particular horror show, no?
From: Peter Zijlstra > Sent: 28 September 2023 11:39 > > On Mon, May 29, 2023 at 03:14:13PM -0400, Mathieu Desnoyers wrote: > > Expose the "on-cpu" state for each thread through struct rseq to allow > > adaptative mutexes to decide more accurately between busy-waiting and > > calling sys_futex() to release the CPU, based on the on-cpu state of the > > mutex owner. Are you trying to avoid spinning when the owning process is sleeping? Or trying to avoid the system call when it will find that the futex is no longer held? The latter is really horribly detremental. > > > > It is only provided as an optimization hint, because there is no > > guarantee that the page containing this field is in the page cache, and > > therefore the scheduler may very well fail to clear the on-cpu state on > > preemption. This is expected to be rare though, and is resolved as soon > > as the task returns to user-space. > > > > The goal is to improve use-cases where the duration of the critical > > sections for a given lock follows a multi-modal distribution, preventing > > statistical guesses from doing a good job at choosing between busy-wait > > and futex wait behavior. > > As always, are syscalls really *that* expensive? Why can't we busy wait > in the kernel instead? > > I mean, sure, meltdown sucked, but most people should now be running > chips that are not affected by that particular horror show, no? IIRC 'page table separation' which is what makes system calls expensive is only a compile-time option. So is likely to be enabled on any 'distro' kernel. But a lot of other mitigations (eg RSB stuffing) are also pretty detrimental. OTOH if you have a 'hot' userspace mutex you are going to lose whatever. All that needs to happen is for a ethernet interrupt to decide to discard completed transmits and refill the rx ring, and then for the softint code to free a load of stuff deferred by rcu while you've grabbed the mutex and no matter how short the user-space code path the mutex won't be released for absolutely ages. I had to change a load of code to use arrays and atomic increments to avoid delays acquiring mutex. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On 9/28/23 07:22, David Laight wrote: > From: Peter Zijlstra >> Sent: 28 September 2023 11:39 >> >> On Mon, May 29, 2023 at 03:14:13PM -0400, Mathieu Desnoyers wrote: >>> Expose the "on-cpu" state for each thread through struct rseq to allow >>> adaptative mutexes to decide more accurately between busy-waiting and >>> calling sys_futex() to release the CPU, based on the on-cpu state of the >>> mutex owner. > > Are you trying to avoid spinning when the owning process is sleeping? Yes, this is my main intent. > Or trying to avoid the system call when it will find that the futex > is no longer held? > > The latter is really horribly detremental. That's a good questions. What should we do in those three situations when trying to grab the lock: 1) Lock has no owner We probably want to simply grab the lock with an atomic instruction. But then if other threads are queued on sys_futex and did not manage to grab the lock yet, this would be detrimental to fairness. 2) Lock owner is running: The lock owner is certainly running on another cpu (I'm using the term "cpu" here as logical cpu). I guess we could either decide to bypass sys_futex entirely and try to grab the lock with an atomic, or we go through sys_futex nevertheless to allow futex to guarantee some fairness across threads. 3) Lock owner is sleeping: The lock owner may be either tied to the same cpu as the requester, or a different cpu. Here calling FUTEX_WAIT and friends is pretty much required. Can you elaborate on why skipping sys_futex in scenario (2) would be so bad ? I wonder if we could get away with skipping futex entirely in this scenario and still guarantee fairness by implementing MCS locking or ticket locks in userspace. Basically, if userspace queues itself on the lock through either MCS locking or ticket locks, it could guarantee fairness on its own. Of course things are more complicated with PI-futex, is that what you have in mind ? > >>> >>> It is only provided as an optimization hint, because there is no >>> guarantee that the page containing this field is in the page cache, and >>> therefore the scheduler may very well fail to clear the on-cpu state on >>> preemption. This is expected to be rare though, and is resolved as soon >>> as the task returns to user-space. >>> >>> The goal is to improve use-cases where the duration of the critical >>> sections for a given lock follows a multi-modal distribution, preventing >>> statistical guesses from doing a good job at choosing between busy-wait >>> and futex wait behavior. >> >> As always, are syscalls really *that* expensive? Why can't we busy wait >> in the kernel instead? >> >> I mean, sure, meltdown sucked, but most people should now be running >> chips that are not affected by that particular horror show, no? > > IIRC 'page table separation' which is what makes system calls expensive > is only a compile-time option. So is likely to be enabled on any 'distro' > kernel. > But a lot of other mitigations (eg RSB stuffing) are also pretty detrimental. > > OTOH if you have a 'hot' userspace mutex you are going to lose whatever. > All that needs to happen is for a ethernet interrupt to decide to discard > completed transmits and refill the rx ring, and then for the softint code > to free a load of stuff deferred by rcu while you've grabbed the mutex > and no matter how short the user-space code path the mutex won't be > released for absolutely ages. > > I had to change a load of code to use arrays and atomic increments > to avoid delays acquiring mutex. That's good input, thanks! I mostly defer to André Almeida on the use-case motivation. I mostly provided this POC patch to show that it _can_ be done with sys_rseq(2). Thanks! Mathieu > > David > > - > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK > Registration No: 1397386 (Wales) >
On Thu, Sep 28, 2023 at 09:20:36AM -0400, Mathieu Desnoyers wrote: > Of course things are more complicated with PI-futex, is that what you have > in mind ? PI futex has to go through the kernel, also, they already spin-wait on owner.
From: Mathieu Desnoyers > Sent: 28 September 2023 14:21 > > On 9/28/23 07:22, David Laight wrote: > > From: Peter Zijlstra > >> Sent: 28 September 2023 11:39 > >> > >> On Mon, May 29, 2023 at 03:14:13PM -0400, Mathieu Desnoyers wrote: > >>> Expose the "on-cpu" state for each thread through struct rseq to allow > >>> adaptative mutexes to decide more accurately between busy-waiting and > >>> calling sys_futex() to release the CPU, based on the on-cpu state of the > >>> mutex owner. > > > > Are you trying to avoid spinning when the owning process is sleeping? > > Yes, this is my main intent. > > > Or trying to avoid the system call when it will find that the futex > > is no longer held? > > > > The latter is really horribly detremental. > > That's a good questions. What should we do in those three situations > when trying to grab the lock: > > 1) Lock has no owner > > We probably want to simply grab the lock with an atomic instruction. But > then if other threads are queued on sys_futex and did not manage to grab > the lock yet, this would be detrimental to fairness. > > 2) Lock owner is running: > > The lock owner is certainly running on another cpu (I'm using the term > "cpu" here as logical cpu). > > I guess we could either decide to bypass sys_futex entirely and try to > grab the lock with an atomic, or we go through sys_futex nevertheless to > allow futex to guarantee some fairness across threads. I'd not worry about 'fairness'. If the mutex is that contended you've already lost! I had a big problem trying to avoid the existing 'fairness' code. Consider 30 RT threads blocked in cv_wait() on the same condvar. Something does cv_broadcast() and you want them all to wakeup. They'll all release the mutex pretty quickly - it doesn't matter is they spin. But what actually happens is one thread is woken up. Once it has been scheduled (after the cpu has come out of a sleep state and/or any hardware interrupts completed (etc) then next thread is woken. If you are lucky it'll 'only' take a few ms to get them all running. Not good when you are trying to process audio every 10ms. I had to use a separate cv for each thread and get the woken threads to help with the wakeups. Gog knows what happens with 256 threads! > 3) Lock owner is sleeping: > > The lock owner may be either tied to the same cpu as the requester, or a > different cpu. Here calling FUTEX_WAIT and friends is pretty much required. You'd need the 'holding process is sleeping' test to be significantly faster then the 'optimistic spin hoping the mutex will be released'. And for the 'spin' to be longer than the syscall time for futex. Otherwise you are optimising an already slow path. If the thread is going to have to sleep until the thread that owns a mutex wakes up then I can't imagine performance mattering. OTOH it is much more usual for the owning thread to be running and release the mutex quickly. I wouldn't have thought it was really worth optimising for the 'lock owner is sleeping' case. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Thu, 28 Sep 2023 12:39:26 +0200 Peter Zijlstra <peterz@infradead.org> wrote: > As always, are syscalls really *that* expensive? Why can't we busy wait > in the kernel instead? Yes syscalls are that expensive. Several years ago I had a good talk with Robert Haas (one of the PostgreSQL maintainers) at Linux Plumbers, and I asked him if they used futexes. His answer was "no". He told me how they did several benchmarks and it was a huge performance hit (and this was before Spectre/Meltdown made things much worse). He explained to me that most locks are taken just to flip a few bits. Going into the kernel and coming back was orders of magnitude longer than the critical sections. By going into the kernel, it caused a ripple effect and lead to even more contention. There answer was to implement their locking completely in user space without any help from the kernel. This is when I thought that having an adaptive spinner that could get hints from the kernel via memory mapping would be extremely useful. The obvious problem with their implementation is that if the owner is sleeping, there's no point in spinning. Worse, the owner may even be waiting for the spinner to get off the CPU before it can run again. But according to Robert, the gain in the general performance greatly outweighed the few times this happened in practice. But still, if userspace could figure out if the owner is running on another CPU or not, to act just like the adaptive mutexes in the kernel, that would prevent the problem of a spinner keeping the owner from running. -- Steve
On Thu, 28 Sept 2023 at 01:52, Florian Weimer <fweimer@redhat.com> wrote: > > * Dmitry Vyukov: > > > On Tue, 26 Sept 2023 at 21:51, Florian Weimer <fweimer@redhat.com> wrote: > >> > >> * Dmitry Vyukov: > >> > >> > In reality it's a bit more involved since the field is actually 8 > >> > bytes and only partially overlaps with rseq.cpu_id_start (it's an > >> > 8-byte pointer with high 4 bytes overlap rseq.cpu_id_start): > >> > > >> > https://github.com/google/tcmalloc/blob/229908285e216cca8b844c1781bf16b838128d1b/tcmalloc/internal/percpu.h#L101-L165 > >> > >> This does not compose with other rseq users, as noted in the sources: > >> > >> // Note: this makes __rseq_abi.cpu_id_start unusable for its original purpose. > >> > >> For a core library such a malloc replacement, that is a very bad trap. > > > > I agree. I wouldn't do this if there were other options. That's why I > > am looking for official kernel support for this. > > If we would have a separate 8 bytes that are overwritten with 0 when a > > thread is descheduled, that would be perfect. > > That only solves part of the problem because these fields would still > have to be locked to tcmalloc. I think you'd need a rescheduling > counter, then every library could keep their reference values in > library-private thread-local storage. This unfortunatly won't work for tcmalloc. This data is accessed on the very hot path of malloc/free. We need a ready to use pointer in TLS, which is reset by the kernel to 0 (or some user-space specified value). Doing to separate loads for counters in different cache lines would be too expensive. It may be possible to make several libraries use this feature with an array of notifications (see rseq_desched_notif_t in my previous email).
On Tue, 26 Sept 2023 at 16:49, Dmitry Vyukov <dvyukov@google.com> wrote: > > > >> I don't see why we can't stick this directly into struct rseq because > > >> it's all public anyway. > > > > > > The motivation for moving this to a different cache line is to handle > > > the prior comment from Boqun, who is concerned that busy-waiting > > > repeatedly loading a field from struct rseq will cause false-sharing and > > > make other stores to that cache line slower, especially stores to > > > rseq_cs to begin rseq critical sections, thus slightly increasing the > > > overhead of rseq critical sections taken while mutexes are held. > > > > > > If we want to embed this field into struct rseq with its own cache line, > > > then we need to add a lot of padding, which is inconvenient. > > > > > > That being said, perhaps this is premature optimization, what do you think ? > > > > Hi Mathieu, Florian, > > > > This is exciting! > > > > I thought the motivation for moving rseq_sched_state out of struct rseq > > is lifetime management problem. I assume when a thread locks a mutex, > > it stores pointer to rseq_sched_state in the mutex state for other > > threads to poll. So the waiting thread would do something along the following > > lines: > > > > rseq_sched_state* state = __atomic_load_n(mutex->sched_state, __ATOMIC_RELAXED); > > if (state && !(state->state & RSEQ_SCHED_STATE_FLAG_ON_CPU)) > > futex_wait(); > > > > Now if the state is struct rseq, which is stored in TLS, > > then the owning thread can unlock the mutex, exit and unmap TLS in between. > > Consequently, load of state->state will cause a paging fault. > > > > And we do want rseq in TLS to save 1 indirection. > > > > If rseq_sched_state is separated from struct rseq, then it can be allocated > > in type stable memory that is never unmapped. > > > > What am I missing here? > > > > However, if we can store this state in struct rseq, then an alternative > > interface would for the kernel to do: > > > > rseq->cpu_id = -1; > > > > to denote that the thread is not running on any CPU. > > I think it kinda makes sense, rseq->cpu_id is the thread's current CPU, > > and -1 naturally means "not running at all". And we already store -1 > > right after init, so it shouldn't be a surprising value. > > As you may know we experimented with "virtual CPUs" in tcmalloc. The > extension allows kernel to assign dense virtual CPU numbers to running > threads instead of real sparse CPU numbers: > > https://github.com/google/tcmalloc/blob/229908285e216cca8b844c1781bf16b838128d1b/tcmalloc/internal/linux_syscall_support.h#L30-L41 > > Recently I added another change that [ab]uses rseq in an interesting > way. We want to get notifications about thread re-scheduling. A bit > simplified version of this is as follows: > we don't use rseq.cpu_id_start for its original purpose, so instead we > store something else there with a high bit set. Real CPU numbers don't > have a high bit set (at least while you have less than 2B CPUs :)). > This allows us to distinguish the value we stored in rseq.cpu_id_start > from real CPU id stored by the kernel. > Inside of rseq critical section we check if rseq.cpu_id_start has high > bit set, and if not, then we know that we were just rescheduled, so we > can do some additional work and update rseq.cpu_id_start to have high > bit set. > > In reality it's a bit more involved since the field is actually 8 > bytes and only partially overlaps with rseq.cpu_id_start (it's an > 8-byte pointer with high 4 bytes overlap rseq.cpu_id_start): > > https://github.com/google/tcmalloc/blob/229908285e216cca8b844c1781bf16b838128d1b/tcmalloc/internal/percpu.h#L101-L165 > > I am thinking if we could extend the current proposed interface in a > way that would be more flexible and would satisfy all of these use > cases (spinlocks, and possibility of using virtual CPUs and > rescheduling notifications). In the end they all need a very similar > thing: kernel writing some value at some user address when a thread is > de-scheduled. > > The minimal support we need for tcmalloc is an 8-byte user address + > kernel writing 0 at that address when a thread is descheduled. > > The most flexible option to support multiple users > (malloc/spinlocks/something else) would be as follows: > > User-space passes an array of structs with address + size (1/2/4/8 > bytes) + value. > Kernel intereates over the array when the thread is de-scheduled and > writes the specified value at the provided address/size. > Something along the following lines (pseudo-code): > > struct rseq { > ... > struct rseq_desched_notif_t* desched_notifs; > int desched_notif_count; > }; > > struct rseq_desched_notif_t { > void* addr; > uint64_t value; > int size; > }; > > static inline void rseq_preempt(struct task_struct *t) > { > ... > for (int i = 0; i < t->rseq->desched_notif_count; i++) { > switch (t->rseq->desched_notifs[i].size) { > case 1: put_user1(t->rseq->desched_notifs[i].addr, > t->rseq->desched_notifs[i].value); > case 2: put_user2(t->rseq->desched_notifs[i].addr, > t->rseq->desched_notifs[i].value); > case 4: put_user4(t->rseq->desched_notifs[i].addr, > t->rseq->desched_notifs[i].value); > case 8: put_user8(t->rseq->desched_notifs[i].addr, > t->rseq->desched_notifs[i].value); > } > } > } One thing I forgot to mention: ideally the kernel also writes a timestamp of descheduling somewhere. We are using this logic to assign per-CPU malloc caches to threads, and it's useful to know which caches were used very recently (still hot in cache) and which ones were not used for a long time.
On 9/28/23 15:20, Mathieu Desnoyers wrote: > On 9/28/23 07:22, David Laight wrote: >> From: Peter Zijlstra >>> Sent: 28 September 2023 11:39 >>> >>> On Mon, May 29, 2023 at 03:14:13PM -0400, Mathieu Desnoyers wrote: >>>> Expose the "on-cpu" state for each thread through struct rseq to allow >>>> adaptative mutexes to decide more accurately between busy-waiting and >>>> calling sys_futex() to release the CPU, based on the on-cpu state >>>> of the >>>> mutex owner. >> >> Are you trying to avoid spinning when the owning process is sleeping? > > Yes, this is my main intent. > >> Or trying to avoid the system call when it will find that the futex >> is no longer held? >> >> The latter is really horribly detremental. > > That's a good questions. What should we do in those three situations > when trying to grab the lock: > > 1) Lock has no owner > > We probably want to simply grab the lock with an atomic instruction. > But then if other threads are queued on sys_futex and did not manage > to grab the lock yet, this would be detrimental to fairness. > > 2) Lock owner is running: > > The lock owner is certainly running on another cpu (I'm using the term > "cpu" here as logical cpu). > > I guess we could either decide to bypass sys_futex entirely and try to > grab the lock with an atomic, or we go through sys_futex nevertheless > to allow futex to guarantee some fairness across threads. About the fairness part: Even if you enqueue everyone, the futex syscall doesn't provide any guarantee about the order of the wake. The current implementation tries to be fair, but I don't think it works for every case. I wouldn't be much concern about being fair here, given that it's an inherent problem of the futex anyway. From the man pages: "No guarantee is provided about which waiters are awoken" > > 3) Lock owner is sleeping: > > The lock owner may be either tied to the same cpu as the requester, or > a different cpu. Here calling FUTEX_WAIT and friends is pretty much > required. > > Can you elaborate on why skipping sys_futex in scenario (2) would be > so bad ? I wonder if we could get away with skipping futex entirely in > this scenario and still guarantee fairness by implementing MCS locking > or ticket locks in userspace. Basically, if userspace queues itself on > the lock through either MCS locking or ticket locks, it could > guarantee fairness on its own. > > Of course things are more complicated with PI-futex, is that what you > have in mind ? > >> >>>> >>>> It is only provided as an optimization hint, because there is no >>>> guarantee that the page containing this field is in the page cache, >>>> and >>>> therefore the scheduler may very well fail to clear the on-cpu >>>> state on >>>> preemption. This is expected to be rare though, and is resolved as >>>> soon >>>> as the task returns to user-space. >>>> >>>> The goal is to improve use-cases where the duration of the critical >>>> sections for a given lock follows a multi-modal distribution, >>>> preventing >>>> statistical guesses from doing a good job at choosing between >>>> busy-wait >>>> and futex wait behavior. >>> >>> As always, are syscalls really *that* expensive? Why can't we busy wait >>> in the kernel instead? >>> >>> I mean, sure, meltdown sucked, but most people should now be running >>> chips that are not affected by that particular horror show, no? >> >> IIRC 'page table separation' which is what makes system calls expensive >> is only a compile-time option. So is likely to be enabled on any >> 'distro' >> kernel. >> But a lot of other mitigations (eg RSB stuffing) are also pretty >> detrimental. >> >> OTOH if you have a 'hot' userspace mutex you are going to lose whatever. >> All that needs to happen is for a ethernet interrupt to decide to >> discard >> completed transmits and refill the rx ring, and then for the softint >> code >> to free a load of stuff deferred by rcu while you've grabbed the mutex >> and no matter how short the user-space code path the mutex won't be >> released for absolutely ages. >> >> I had to change a load of code to use arrays and atomic increments >> to avoid delays acquiring mutex. > > That's good input, thanks! I mostly defer to André Almeida on the > use-case motivation. I mostly provided this POC patch to show that it > _can_ be done with sys_rseq(2). > > Thanks! > > Mathieu > >> >> David >> >> - >> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, >> MK1 1PT, UK >> Registration No: 1397386 (Wales) >> >
From: Steven Rostedt > Sent: 28 September 2023 15:43 > > On Thu, 28 Sep 2023 12:39:26 +0200 > Peter Zijlstra <peterz@infradead.org> wrote: > > > As always, are syscalls really *that* expensive? Why can't we busy wait > > in the kernel instead? > > Yes syscalls are that expensive. Several years ago I had a good talk > with Robert Haas (one of the PostgreSQL maintainers) at Linux Plumbers, > and I asked him if they used futexes. His answer was "no". He told me > how they did several benchmarks and it was a huge performance hit (and > this was before Spectre/Meltdown made things much worse). He explained > to me that most locks are taken just to flip a few bits. Going into the > kernel and coming back was orders of magnitude longer than the critical > sections. By going into the kernel, it caused a ripple effect and lead > to even more contention. There answer was to implement their locking > completely in user space without any help from the kernel. That matches what I found with code that was using a mutex to take work items off a global list. Although the mutex was only held for a few instructions (probably several 100 because the list wasn't that well written), what happened was that as soon as there was any contention (which might start with a hardware interrupt) performance when through the floor. The fix was to replace the linked list with and array and use atomic add to 'grab' blocks of entries. (Even the atomic operations slowed things down.) > This is when I thought that having an adaptive spinner that could get > hints from the kernel via memory mapping would be extremely useful. Did you consider writing a timestamp into the mutex when it was acquired - or even as the 'acquired' value? A 'moderately synched TSC' should do. Then the waiter should be able to tell how long the mutex has been held for - and then not spin if it had been held ages. > The obvious problem with their implementation is that if the owner is > sleeping, there's no point in spinning. Worse, the owner may even be > waiting for the spinner to get off the CPU before it can run again. But > according to Robert, the gain in the general performance greatly > outweighed the few times this happened in practice. Unless you can use atomics (ok for bits and linked lists) you always have the problem that userspace can't disable interrupts. So, unlike the kernel, you can't implement a proper spinlock. I've NFI how CONFIG_RT manages to get anything done with all the spinlocks replaced by sleep locks. Clearly there are a spinlocks that are held for far too long. But you really do want to spin most of the time. ... David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Mon, May 29 2023 at 15:14, Mathieu Desnoyers wrote: > +void __rseq_set_sched_state(struct task_struct *t, unsigned int state); > + > +static inline void rseq_set_sched_state(struct task_struct *t, unsigned int state) > +{ > + if (t->rseq_sched_state) > + __rseq_set_sched_state(t, state); This is invoked on every context switch and writes over that state unconditionally even in the case that the state was already cleared. There are enough situations where tasks are scheduled out several times while being in the kernel. > /* rseq_preempt() requires preemption to be disabled. */ > static inline void rseq_preempt(struct task_struct *t) > { > __set_bit(RSEQ_EVENT_PREEMPT_BIT, &t->rseq_event_mask); > rseq_set_notify_resume(t); > + rseq_set_sched_state(t, 0); This code is already stupid to begin with. __set_bit() is cheap, but rseq_set_notify_resume() is not as it has a conditional and a locked instruction and now you add two more conditionals into the context switch path. Thanks, tglx
On 9/28/23 16:21, Thomas Gleixner wrote: > On Mon, May 29 2023 at 15:14, Mathieu Desnoyers wrote: >> +void __rseq_set_sched_state(struct task_struct *t, unsigned int state); >> + >> +static inline void rseq_set_sched_state(struct task_struct *t, unsigned int state) >> +{ >> + if (t->rseq_sched_state) >> + __rseq_set_sched_state(t, state); > > This is invoked on every context switch and writes over that state > unconditionally even in the case that the state was already > cleared. There are enough situations where tasks are scheduled out > several times while being in the kernel. Right, if this becomes more than a PoC, I'll make sure to keep track of the current state within the task struct, and only update userspace on state transition. > >> /* rseq_preempt() requires preemption to be disabled. */ >> static inline void rseq_preempt(struct task_struct *t) >> { >> __set_bit(RSEQ_EVENT_PREEMPT_BIT, &t->rseq_event_mask); >> rseq_set_notify_resume(t); >> + rseq_set_sched_state(t, 0); > > This code is already stupid to begin with. __set_bit() is cheap, but > rseq_set_notify_resume() is not as it has a conditional and a locked > instruction What alternative would you recommend to set a per-thread state that has the same effect as TIF_NOTIFY_RESUME ? Indeed all it really needs to synchronize with is the thread owning the flags, but AFAIU having this flag part of the TIF flags requires use of an atomic instruction to synchronize updates against concurrent threads. If we move this thread flag into a separate field of struct thread_info, then we could turn this atomic set bit into a more lightweight store, but then we'd need to check against an extra field on return to userspace. And if we want to remove the conditional branch on the scheduler fast-path, we could always load and test both the task struct's rseq pointer and the thread_info "preempted" state on return to userspace. The tradeoff there would be to add extra loads and conditional branches on return to userspace to speed up the scheduler fast-path. Is this what you have in mind or am I missing your point ? > and now you add two more conditionals into the context > switch path. I'm open to suggestions on how to improve this if this goes beyond PoC stage and we observe measurable benefits on the userspace side. Thanks, Mathieu > > Thanks, > > tglx
On Mon, May 29 2023 at 15:14, Mathieu Desnoyers wrote: > +/* > + * rseq_sched_state should be aligned on the cache line size. > + */ > +struct rseq_sched_state { > + /* > + * Version of this structure. Populated by the kernel, read by > + * user-space. > + */ > + __u32 version; > + /* > + * The state is updated by the kernel. Read by user-space with > + * single-copy atomicity semantics. This field can be read by any > + * userspace thread. Aligned on 32-bit. Contains a bitmask of enum > + * rseq_sched_state_flags. This field is provided as a hint by the > + * scheduler, and requires that the page holding this state is > + * faulted-in for the state update to be performed by the scheduler. > + */ > + __u32 state; > + /* > + * Thread ID associated with the thread registering this structure. > + * Initialized by user-space before registration. > + */ > + __u32 tid; > +}; > + > /* > * struct rseq is aligned on 4 * 8 bytes to ensure it is always > * contained within a single cache-line. > @@ -148,6 +180,15 @@ struct rseq { > */ > __u32 mm_cid; > > + __u32 padding1; > + > + /* > + * Restartable sequences sched_state_ptr field. Initialized by > + * userspace to the address at which the struct rseq_sched_state is > + * located. Read by the kernel on rseq registration. > + */ > + __u64 sched_state_ptr; > + Why is this a separate entity instead of being simply embedded into struct rseq? Neither the code comment nor the changelog tells anything about that. If your intention was to provide a solution for process shared futexes then you completely failed to understand how process shared futexes work. If not, then please explain why you need a pointer and the associated hackery to deal with it. Thanks, tglx
On 9/28/23 16:54, Thomas Gleixner wrote: > On Mon, May 29 2023 at 15:14, Mathieu Desnoyers wrote: >> +/* >> + * rseq_sched_state should be aligned on the cache line size. >> + */ >> +struct rseq_sched_state { >> + /* >> + * Version of this structure. Populated by the kernel, read by >> + * user-space. >> + */ >> + __u32 version; >> + /* >> + * The state is updated by the kernel. Read by user-space with >> + * single-copy atomicity semantics. This field can be read by any >> + * userspace thread. Aligned on 32-bit. Contains a bitmask of enum >> + * rseq_sched_state_flags. This field is provided as a hint by the >> + * scheduler, and requires that the page holding this state is >> + * faulted-in for the state update to be performed by the scheduler. >> + */ >> + __u32 state; >> + /* >> + * Thread ID associated with the thread registering this structure. >> + * Initialized by user-space before registration. >> + */ >> + __u32 tid; >> +}; >> + >> /* >> * struct rseq is aligned on 4 * 8 bytes to ensure it is always >> * contained within a single cache-line. >> @@ -148,6 +180,15 @@ struct rseq { >> */ >> __u32 mm_cid; >> >> + __u32 padding1; >> + >> + /* >> + * Restartable sequences sched_state_ptr field. Initialized by >> + * userspace to the address at which the struct rseq_sched_state is >> + * located. Read by the kernel on rseq registration. >> + */ >> + __u64 sched_state_ptr; >> + > > Why is this a separate entity instead of being simply embedded into > struct rseq? > > Neither the code comment nor the changelog tells anything about that. Here is the email thread from v1 that led to this: https://lore.kernel.org/lkml/ZGevZxOjJLMO9zlM@boqun-archlinux/ The reason for moving this sched_state to its own structure was to optimize for a scenario where we have: - many threads contending for the lock, thus loading the value of the struct rseq "sched_state". - the lock owner thread doing rseq critical sections with the lock held, thus updating the value of struct rseq "rseq_cs" field (in the same cache line). The loads of the sched_state from other threads would slow down (even slightly) the update to the rseq_cs field, thus causing the critical sections to take a few more cycles. I am not convinced that the complexity vs speed tradeoff of moving this into its own separate structure is worth it though. Especially given that it would require userspace to wire things up with an additional side-structure, rather than just extend naturally with the extensible-rseq ABI. Another approach would be to cacheline-align this field within struct rseq and waste space to padding. I am inclined to just embed this into struct rseq without caring too much about slight overhead caused by sharing the cache line with other fields. > > If your intention was to provide a solution for process shared futexes > then you completely failed to understand how process shared futexes > work. If not, then please explain why you need a pointer and the > associated hackery to deal with it. I have not given a single thought to shared futexes in this PoC so far. :) So let's see: we could do the following to support shared futexes: move the information about the owner's struct rseq (could be the thread pointer of the owner thread) to a field separate from the "lock" state (which is ABI as you pointed out). Therefore, grabbing the lock would be done with an atomic instruction, and then setting this extra information about lock owner would be done with a simple store. Given that the lock owner information is only used to statistically decide whether other threads should spin or block when contended, we don't really care if this information is set a few instructions after acquiring the lock. Thoughts ? Thanks for the feedback, Mathieu > > Thanks, > > tglx > >
On Thu, 28 Sep 2023 15:51:47 +0000 David Laight <David.Laight@ACULAB.COM> wrote: > > This is when I thought that having an adaptive spinner that could get > > hints from the kernel via memory mapping would be extremely useful. > > Did you consider writing a timestamp into the mutex when it was > acquired - or even as the 'acquired' value? > A 'moderately synched TSC' should do. > Then the waiter should be able to tell how long the mutex > has been held for - and then not spin if it had been held ages. And what heuristic would you use. My experience with picking "time to spin" may work for one workload but cause major regressions in another workload. I came to the conclusion to "hate" heuristics and NACK them whenever someone suggested adding them to the rt_mutex in the kernel (back before adaptive mutexes were introduced). > > > The obvious problem with their implementation is that if the owner is > > sleeping, there's no point in spinning. Worse, the owner may even be > > waiting for the spinner to get off the CPU before it can run again. But > > according to Robert, the gain in the general performance greatly > > outweighed the few times this happened in practice. > > Unless you can use atomics (ok for bits and linked lists) you > always have the problem that userspace can't disable interrupts. > So, unlike the kernel, you can't implement a proper spinlock. Why do you need to disable interrupts? If you know the owner is running on the CPU, you know it's not trying to run on the CPU that is acquiring the lock. Heck, there's normal spin locks outside of PREEMPT_RT that do not disable interrupts. The only time you need to disable interrupts is if the interrupt itself takes the spin lock, and that's just to prevent deadlocks. > > I've NFI how CONFIG_RT manages to get anything done with all > the spinlocks replaced by sleep locks. > Clearly there are a spinlocks that are held for far too long. > But you really do want to spin most of the time. It spins as long as the owner of the lock is running on the CPU. This is what we are looking to get from this patch series for user space. Back in 2007, we had an issue with scaling on SMP machines. The RT kernel with the sleeping spin locks would start to exponentially slow down with the more CPUs you had. Once we hit more than 16 CPUs, the time to boot a kernel took 10s of minutes to boot RT when the normal CONFIG_PREEMPT kernel would only take a couple of minutes. The more CPUs you added, the worse it became. Then SUSE submitted a patch to have the rt_mutex spin only if the owner of the mutex was still running on another CPU. This actually mimics a real spin lock (because that's exactly what they do, they spin while the owner is running on a CPU). The difference between a true spin lock and an rt_mutex was that the spinner would stop spinning if the owner was preempted (a true spin lock owner could not be preempted). After applying the adaptive spinning, we were able to scale PREEMPT_RT to any number of CPUs that the normal kernel could do with just a linear performance hit. This is why I'm very much interested in getting the same ability into user space spin locks. -- Steve
From: Steven Rostedt > Sent: 02 October 2023 17:51 > > On Thu, 28 Sep 2023 15:51:47 +0000 > David Laight <David.Laight@ACULAB.COM> wrote: > > > > > This is when I thought that having an adaptive spinner that could get > > > hints from the kernel via memory mapping would be extremely useful. > > > > Did you consider writing a timestamp into the mutex when it was > > acquired - or even as the 'acquired' value? > > A 'moderately synched TSC' should do. > > Then the waiter should be able to tell how long the mutex > > has been held for - and then not spin if it had been held ages. > > And what heuristic would you use. My experience with picking "time to spin" > may work for one workload but cause major regressions in another workload. > I came to the conclusion to "hate" heuristics and NACK them whenever > someone suggested adding them to the rt_mutex in the kernel (back before > adaptive mutexes were introduced). Isn't that exactly what and adaptive mutex does? Spin 'for a bit' before sleeping. > > > The obvious problem with their implementation is that if the owner is > > > sleeping, there's no point in spinning. Worse, the owner may even be > > > waiting for the spinner to get off the CPU before it can run again. But > > > according to Robert, the gain in the general performance greatly > > > outweighed the few times this happened in practice. > > > > Unless you can use atomics (ok for bits and linked lists) you > > always have the problem that userspace can't disable interrupts. > > So, unlike the kernel, you can't implement a proper spinlock. > > Why do you need to disable interrupts? If you know the owner is running on > the CPU, you know it's not trying to run on the CPU that is acquiring the > lock. Heck, there's normal spin locks outside of PREEMPT_RT that do not > disable interrupts. The only time you need to disable interrupts is if the > interrupt itself takes the spin lock, and that's just to prevent deadlocks. You need to disable interrupts in order to bound the time the spinlock is held for. If all you are doing is a dozen instructions (eg to remove an item from s list) then you really don't want an interrupt coming in while you have the spinlock held. It isn't the cost of the ISR - that has to happen sometime, but that the cpu waiting for the spinlock also take the cost of the ISR. A network+softint ISR can run for a long time - I'm sure I've seen a good fraction of a millisecond. You really don't want another (or many other) cpu spinning while that is going on. Which (to my mind) pretty much means that you always want to disable interrupts on a spinlock. If the architecture makes masking ISR expensive then I've seen schemes that let the hardware interrupt happen, then disable it and rerun later. > > I've NFI how CONFIG_RT manages to get anything done with all > > the spinlocks replaced by sleep locks. > > Clearly there are a spinlocks that are held for far too long. > > But you really do want to spin most of the time. > > It spins as long as the owner of the lock is running on the CPU. This is > what we are looking to get from this patch series for user space. I think you'd need to detect that the cpu was in-kernel running an ISR. But the multithreaded audio app I was 'fixing' basically failed as soon as it had to sleep on one of the futex. The real problem was ISR while the mutex was held. So deciding to sleep because the lock owner isn't running (in user) would already be delaying things too much. > > Back in 2007, we had an issue with scaling on SMP machines. The RT kernel > with the sleeping spin locks would start to exponentially slow down with > the more CPUs you had. Once we hit more than 16 CPUs, the time to boot a > kernel took 10s of minutes to boot RT when the normal CONFIG_PREEMPT kernel > would only take a couple of minutes. The more CPUs you added, the worse it > became. > > Then SUSE submitted a patch to have the rt_mutex spin only if the owner of > the mutex was still running on another CPU. This actually mimics a real > spin lock (because that's exactly what they do, they spin while the owner > is running on a CPU). The difference between a true spin lock and an > rt_mutex was that the spinner would stop spinning if the owner was > preempted (a true spin lock owner could not be preempted). > > After applying the adaptive spinning, we were able to scale PREEMPT_RT to > any number of CPUs that the normal kernel could do with just a linear > performance hit. Sounds like it was spinning for far too long at the best of times. But analysing these sort of latencies is hard. David > > This is why I'm very much interested in getting the same ability into user > space spin locks. > > -- Steve - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Mon, 2 Oct 2023 17:22:34 +0000 David Laight <David.Laight@ACULAB.COM> wrote: > > And what heuristic would you use. My experience with picking "time to spin" > > may work for one workload but cause major regressions in another workload. > > I came to the conclusion to "hate" heuristics and NACK them whenever > > someone suggested adding them to the rt_mutex in the kernel (back before > > adaptive mutexes were introduced). > > Isn't that exactly what and adaptive mutex does? > Spin 'for a bit' before sleeping. But it's not some arbitrary time to spin. Technically, a kernel spin lock is spinning on the heuristic of ownership. "Spin until the lock is released" is a heuristic! > > > > > The obvious problem with their implementation is that if the owner is > > > > sleeping, there's no point in spinning. Worse, the owner may even be > > > > waiting for the spinner to get off the CPU before it can run again. But > > > > according to Robert, the gain in the general performance greatly > > > > outweighed the few times this happened in practice. > > > > > > Unless you can use atomics (ok for bits and linked lists) you > > > always have the problem that userspace can't disable interrupts. > > > So, unlike the kernel, you can't implement a proper spinlock. > > > > Why do you need to disable interrupts? If you know the owner is running on > > the CPU, you know it's not trying to run on the CPU that is acquiring the > > lock. Heck, there's normal spin locks outside of PREEMPT_RT that do not > > disable interrupts. The only time you need to disable interrupts is if the > > interrupt itself takes the spin lock, and that's just to prevent deadlocks. > > You need to disable interrupts in order to bound the time the > spinlock is held for. > If all you are doing is a dozen instructions (eg to remove an > item from s list) then you really don't want an interrupt coming in > while you have the spinlock held. That's just noise of normal processing. What's the difference of it happening during spinning to where it happens in normal execution? > It isn't the cost of the ISR - that has to happen sometime, but that > the cpu waiting for the spinlock also take the cost of the ISR. As supposed to just going into the kernel? So it wastes some of its quota. It's not stopping anything else from running more than normal. > > A network+softint ISR can run for a long time - I'm sure I've > seen a good fraction of a millisecond. > You really don't want another (or many other) cpu spinning while > that is going on. Why not? The current user space only code does that now (and it will even spin if the owner is preempted). What we are talking about implementing is a big improvement to what is currently done. > Which (to my mind) pretty much means that you always want to > disable interrupts on a spinlock. The benchmarks say otherwise. Sure, once in a while you may spin longer because of an interrupt, but that's a very rare occurrence compared to normal taking of spin locks. Disabling interrupts is an expensive operation. The savings you get from "not waiting for a softirq to finish" will be drowned out by the added overhead of disabling interrupts at every acquire. > If the architecture makes masking ISR expensive then I've seen schemes > that let the hardware interrupt happen, then disable it and rerun later. > > > > I've NFI how CONFIG_RT manages to get anything done with all > > > the spinlocks replaced by sleep locks. > > > Clearly there are a spinlocks that are held for far too long. > > > But you really do want to spin most of the time. > > > > It spins as long as the owner of the lock is running on the CPU. This is > > what we are looking to get from this patch series for user space. > > I think you'd need to detect that the cpu was in-kernel running an ISR. For the few times that might happen, it's not worth it. > > But the multithreaded audio app I was 'fixing' basically failed > as soon as it had to sleep on one of the futex. > The real problem was ISR while the mutex was held. > So deciding to sleep because the lock owner isn't running (in user) > would already be delaying things too much. That doesn't sound like the use case we are fixing. If your audio app failed because it had to sleep, that tells me it would fail regardless. > > > > > Back in 2007, we had an issue with scaling on SMP machines. The RT kernel > > with the sleeping spin locks would start to exponentially slow down with > > the more CPUs you had. Once we hit more than 16 CPUs, the time to boot a > > kernel took 10s of minutes to boot RT when the normal CONFIG_PREEMPT kernel > > would only take a couple of minutes. The more CPUs you added, the worse it > > became. > > > > Then SUSE submitted a patch to have the rt_mutex spin only if the owner of > > the mutex was still running on another CPU. This actually mimics a real > > spin lock (because that's exactly what they do, they spin while the owner > > is running on a CPU). The difference between a true spin lock and an > > rt_mutex was that the spinner would stop spinning if the owner was > > preempted (a true spin lock owner could not be preempted). > > > > After applying the adaptive spinning, we were able to scale PREEMPT_RT to > > any number of CPUs that the normal kernel could do with just a linear > > performance hit. > > Sounds like it was spinning for far too long at the best of times. > But analysing these sort of latencies is hard. It wasn't spinning at all! The problem was that all rt_mutex would immediately sleep on any contention. This caused a ripple effect that would increase the time locks were held and that would increase contention which increased the time more. A very bad feedback loop. This was all very well traced and studied. That analysis was not hard at all. We know exactly what the cause was and why adaptive mutexes fixed the situation. And this is why I'm excited about this current work. -- Steve
diff --git a/include/linux/sched.h b/include/linux/sched.h index eed5d65b8d1f..7741ff10136a 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1311,6 +1311,7 @@ struct task_struct { * with respect to preemption. */ unsigned long rseq_event_mask; + struct rseq_sched_state __user *rseq_sched_state; #endif #ifdef CONFIG_SCHED_MM_CID @@ -2351,11 +2352,20 @@ static inline void rseq_signal_deliver(struct ksignal *ksig, rseq_handle_notify_resume(ksig, regs); } +void __rseq_set_sched_state(struct task_struct *t, unsigned int state); + +static inline void rseq_set_sched_state(struct task_struct *t, unsigned int state) +{ + if (t->rseq_sched_state) + __rseq_set_sched_state(t, state); +} + /* rseq_preempt() requires preemption to be disabled. */ static inline void rseq_preempt(struct task_struct *t) { __set_bit(RSEQ_EVENT_PREEMPT_BIT, &t->rseq_event_mask); rseq_set_notify_resume(t); + rseq_set_sched_state(t, 0); } /* rseq_migrate() requires preemption to be disabled. */ @@ -2376,11 +2386,13 @@ static inline void rseq_fork(struct task_struct *t, unsigned long clone_flags) t->rseq_len = 0; t->rseq_sig = 0; t->rseq_event_mask = 0; + t->rseq_sched_state = NULL; } else { t->rseq = current->rseq; t->rseq_len = current->rseq_len; t->rseq_sig = current->rseq_sig; t->rseq_event_mask = current->rseq_event_mask; + t->rseq_sched_state = current->rseq_sched_state; } } @@ -2390,6 +2402,7 @@ static inline void rseq_execve(struct task_struct *t) t->rseq_len = 0; t->rseq_sig = 0; t->rseq_event_mask = 0; + t->rseq_sched_state = NULL; } #else @@ -2405,6 +2418,9 @@ static inline void rseq_signal_deliver(struct ksignal *ksig, struct pt_regs *regs) { } +static inline void rseq_set_sched_state(struct task_struct *t, unsigned int state) +{ +} static inline void rseq_preempt(struct task_struct *t) { } diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h index c233aae5eac9..b28588225fa7 100644 --- a/include/uapi/linux/rseq.h +++ b/include/uapi/linux/rseq.h @@ -37,6 +37,13 @@ enum rseq_cs_flags { (1U << RSEQ_CS_FLAG_NO_RESTART_ON_MIGRATE_BIT), }; +enum rseq_sched_state_flags { + /* + * Task is currently running on a CPU if bit is set. + */ + RSEQ_SCHED_STATE_FLAG_ON_CPU = (1U << 0), +}; + /* * struct rseq_cs is aligned on 4 * 8 bytes to ensure it is always * contained within a single cache-line. It is usually declared as @@ -53,6 +60,31 @@ struct rseq_cs { __u64 abort_ip; } __attribute__((aligned(4 * sizeof(__u64)))); +/* + * rseq_sched_state should be aligned on the cache line size. + */ +struct rseq_sched_state { + /* + * Version of this structure. Populated by the kernel, read by + * user-space. + */ + __u32 version; + /* + * The state is updated by the kernel. Read by user-space with + * single-copy atomicity semantics. This field can be read by any + * userspace thread. Aligned on 32-bit. Contains a bitmask of enum + * rseq_sched_state_flags. This field is provided as a hint by the + * scheduler, and requires that the page holding this state is + * faulted-in for the state update to be performed by the scheduler. + */ + __u32 state; + /* + * Thread ID associated with the thread registering this structure. + * Initialized by user-space before registration. + */ + __u32 tid; +}; + /* * struct rseq is aligned on 4 * 8 bytes to ensure it is always * contained within a single cache-line. @@ -148,6 +180,15 @@ struct rseq { */ __u32 mm_cid; + __u32 padding1; + + /* + * Restartable sequences sched_state_ptr field. Initialized by + * userspace to the address at which the struct rseq_sched_state is + * located. Read by the kernel on rseq registration. + */ + __u64 sched_state_ptr; + /* * Flexible array member at end of structure, after last feature field. */ diff --git a/kernel/rseq.c b/kernel/rseq.c index 9de6e35fe679..e36d6deeae77 100644 --- a/kernel/rseq.c +++ b/kernel/rseq.c @@ -87,10 +87,12 @@ static int rseq_update_cpu_node_id(struct task_struct *t) { + struct rseq_sched_state __user *rseq_sched_state = t->rseq_sched_state; struct rseq __user *rseq = t->rseq; u32 cpu_id = raw_smp_processor_id(); u32 node_id = cpu_to_node(cpu_id); u32 mm_cid = task_mm_cid(t); + u32 sched_state = RSEQ_SCHED_STATE_FLAG_ON_CPU; WARN_ON_ONCE((int) mm_cid < 0); if (!user_write_access_begin(rseq, t->rseq_len)) @@ -99,6 +101,7 @@ static int rseq_update_cpu_node_id(struct task_struct *t) unsafe_put_user(cpu_id, &rseq->cpu_id, efault_end); unsafe_put_user(node_id, &rseq->node_id, efault_end); unsafe_put_user(mm_cid, &rseq->mm_cid, efault_end); + unsafe_put_user(sched_state, &rseq_sched_state->state, efault_end); /* * Additional feature fields added after ORIG_RSEQ_SIZE * need to be conditionally updated only if @@ -339,6 +342,18 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs) force_sigsegv(sig); } +/* + * Attempt to update rseq scheduler state. + */ +void __rseq_set_sched_state(struct task_struct *t, unsigned int state) +{ + if (unlikely(t->flags & PF_EXITING)) + return; + pagefault_disable(); + (void) put_user(state, &t->rseq_sched_state->state); + pagefault_enable(); +} + #ifdef CONFIG_DEBUG_RSEQ /* @@ -359,6 +374,29 @@ void rseq_syscall(struct pt_regs *regs) #endif +static int rseq_get_sched_state_ptr(struct rseq __user *rseq, u32 rseq_len, + struct rseq_sched_state __user **_sched_state_ptr) +{ + struct rseq_sched_state __user *sched_state_ptr; + u64 sched_state_ptr_value; + u32 version = 0; + int ret; + + if (rseq_len < offsetofend(struct rseq, sched_state_ptr)) + return 0; + ret = get_user(sched_state_ptr_value, &rseq->sched_state_ptr); + if (ret) + return ret; + sched_state_ptr = (struct rseq_sched_state __user *)(unsigned long)sched_state_ptr_value; + if (!sched_state_ptr) + return 0; + ret = put_user(version, &sched_state_ptr->version); + if (ret) + return ret; + *_sched_state_ptr = sched_state_ptr; + return 0; +} + /* * sys_rseq - setup restartable sequences for caller thread. */ @@ -366,6 +404,7 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, int, flags, u32, sig) { int ret; + struct rseq_sched_state __user *sched_state_ptr = NULL; if (flags & RSEQ_FLAG_UNREGISTER) { if (flags & ~RSEQ_FLAG_UNREGISTER) @@ -383,6 +422,7 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, current->rseq = NULL; current->rseq_sig = 0; current->rseq_len = 0; + current->rseq_sched_state = NULL; return 0; } @@ -420,9 +460,12 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len, return -EINVAL; if (!access_ok(rseq, rseq_len)) return -EFAULT; + if (rseq_get_sched_state_ptr(rseq, rseq_len, &sched_state_ptr)) + return -EFAULT; current->rseq = rseq; current->rseq_len = rseq_len; current->rseq_sig = sig; + current->rseq_sched_state = sched_state_ptr; /* * If rseq was previously inactive, and has just been * registered, ensure the cpu_id_start and cpu_id fields
Expose the "on-cpu" state for each thread through struct rseq to allow adaptative mutexes to decide more accurately between busy-waiting and calling sys_futex() to release the CPU, based on the on-cpu state of the mutex owner. It is only provided as an optimization hint, because there is no guarantee that the page containing this field is in the page cache, and therefore the scheduler may very well fail to clear the on-cpu state on preemption. This is expected to be rare though, and is resolved as soon as the task returns to user-space. The goal is to improve use-cases where the duration of the critical sections for a given lock follows a multi-modal distribution, preventing statistical guesses from doing a good job at choosing between busy-wait and futex wait behavior. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Carlos O'Donell <carlos@redhat.com> Cc: Florian Weimer <fweimer@redhat.com> Cc: libc-alpha@sourceware.org --- include/linux/sched.h | 16 +++++++++++++++ include/uapi/linux/rseq.h | 41 +++++++++++++++++++++++++++++++++++++ kernel/rseq.c | 43 +++++++++++++++++++++++++++++++++++++++ 3 files changed, 100 insertions(+)