Message ID | 20240918140500.26365-2-Jason@zx2c4.com |
---|---|
State | New |
Headers | show |
Series | [v7] linux: Add support for getrandom vDSO | expand |
On Wed, 2024-09-18 at 16:01 +0200, Jason A. Donenfeld wrote: > Linux 6.11 has getrandom() in vDSO. It operates on a thread-local opaque > state allocated with mmap using flags specified by the vDSO. > > Multiple states are allocated at once, as many as fit into a page, and > these are held in an array of available states to be doled out to each > thread upon first use, and recycled when a thread terminates. As these > states run low, more are allocated. > > To make this procedure async-signal-safe, a simple guard is used in the > LSB of the opaque state address, falling back to the syscall if there's > reentrancy contention. > > Also, _Fork() is handled by blocking signals on opaque state allocation > (so _Fork() always sees a consistent state even if it interrupts a > getrandom() call) and by iterating over the thread stack cache on > reclaim_stack. Each opaque state will be in the free states list > (grnd_alloc.states) or allocated to a running thread. > > The cancellation is handled by always using GRND_NONBLOCK flags while > calling the vDSO, and falling back to the cancellable syscall if the > kernel returns EAGAIN (would block). Since getrandom is not defined by > POSIX and cancellation is supported as an extension, the cancellation is > handled as 'may occur' instead of 'shall occur' [1], meaning that if > vDSO does not block (the expected behavior) getrandom will not act as a > cancellation entrypoint. It avoids a pthread_testcancel call on the fast > path (different than 'shall occur' functions, like sem_wait()). > > It is currently enabled for x86_64, which is available in Linux 6.11, > and aarch64, powerpc32, powerpc64, loongarch64, and s390x, which are > available in Linux 6.12. > > Link: https://pubs.opengroup.org/onlinepubs/9799919799/nframe.html [1] > Co-developed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> > Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Tested on x86_64, aarch64, and loongarch64. Glibc test suite passes and gdb shows vdso is really used. Tested-by: Xi Ruoyao <xry111@xry111.site>
On 18.09.24 21:22, Xi Ruoyao wrote: > On Wed, 2024-09-18 at 16:01 +0200, Jason A. Donenfeld wrote: >> Linux 6.11 has getrandom() in vDSO. It operates on a thread-local opaque >> state allocated with mmap using flags specified by the vDSO. >> >> Multiple states are allocated at once, as many as fit into a page, and >> these are held in an array of available states to be doled out to each >> thread upon first use, and recycled when a thread terminates. As these >> states run low, more are allocated. >> >> To make this procedure async-signal-safe, a simple guard is used in the >> LSB of the opaque state address, falling back to the syscall if there's >> reentrancy contention. >> >> Also, _Fork() is handled by blocking signals on opaque state allocation >> (so _Fork() always sees a consistent state even if it interrupts a >> getrandom() call) and by iterating over the thread stack cache on >> reclaim_stack. Each opaque state will be in the free states list >> (grnd_alloc.states) or allocated to a running thread. >> >> The cancellation is handled by always using GRND_NONBLOCK flags while >> calling the vDSO, and falling back to the cancellable syscall if the >> kernel returns EAGAIN (would block). Since getrandom is not defined by >> POSIX and cancellation is supported as an extension, the cancellation is >> handled as 'may occur' instead of 'shall occur' [1], meaning that if >> vDSO does not block (the expected behavior) getrandom will not act as a >> cancellation entrypoint. It avoids a pthread_testcancel call on the fast >> path (different than 'shall occur' functions, like sem_wait()). >> >> It is currently enabled for x86_64, which is available in Linux 6.11, >> and aarch64, powerpc32, powerpc64, loongarch64, and s390x, which are >> available in Linux 6.12. >> >> Link: https://pubs.opengroup.org/onlinepubs/9799919799/nframe.html [1] >> Co-developed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> >> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> > > Tested on x86_64, aarch64, and loongarch64. Glibc test suite passes and > gdb shows vdso is really used. Same on s390x. > > Tested-by: Xi Ruoyao <xry111@xry111.site> >
On Thu, Sep 19, 2024 at 01:30:38PM +0200, Stefan Liebler wrote: > On 18.09.24 21:22, Xi Ruoyao wrote: > > On Wed, 2024-09-18 at 16:01 +0200, Jason A. Donenfeld wrote: > >> Linux 6.11 has getrandom() in vDSO. It operates on a thread-local opaque > >> state allocated with mmap using flags specified by the vDSO. > >> > >> Multiple states are allocated at once, as many as fit into a page, and > >> these are held in an array of available states to be doled out to each > >> thread upon first use, and recycled when a thread terminates. As these > >> states run low, more are allocated. > >> > >> To make this procedure async-signal-safe, a simple guard is used in the > >> LSB of the opaque state address, falling back to the syscall if there's > >> reentrancy contention. > >> > >> Also, _Fork() is handled by blocking signals on opaque state allocation > >> (so _Fork() always sees a consistent state even if it interrupts a > >> getrandom() call) and by iterating over the thread stack cache on > >> reclaim_stack. Each opaque state will be in the free states list > >> (grnd_alloc.states) or allocated to a running thread. > >> > >> The cancellation is handled by always using GRND_NONBLOCK flags while > >> calling the vDSO, and falling back to the cancellable syscall if the > >> kernel returns EAGAIN (would block). Since getrandom is not defined by > >> POSIX and cancellation is supported as an extension, the cancellation is > >> handled as 'may occur' instead of 'shall occur' [1], meaning that if > >> vDSO does not block (the expected behavior) getrandom will not act as a > >> cancellation entrypoint. It avoids a pthread_testcancel call on the fast > >> path (different than 'shall occur' functions, like sem_wait()). > >> > >> It is currently enabled for x86_64, which is available in Linux 6.11, > >> and aarch64, powerpc32, powerpc64, loongarch64, and s390x, which are > >> available in Linux 6.12. > >> > >> Link: https://pubs.opengroup.org/onlinepubs/9799919799/nframe.html [1] > >> Co-developed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> > >> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> > > > > Tested on x86_64, aarch64, and loongarch64. Glibc test suite passes and > > gdb shows vdso is really used. > Same on s390x. I'll extract your 'Tested-by', then, if that's okay.
Hey Christophe, On Sat, Sep 21, 2024 at 12:21:14AM +0200, Jason A. Donenfeld wrote: > On Thu, Sep 19, 2024 at 01:30:38PM +0200, Stefan Liebler wrote: > > On 18.09.24 21:22, Xi Ruoyao wrote: > > > On Wed, 2024-09-18 at 16:01 +0200, Jason A. Donenfeld wrote: > > >> Linux 6.11 has getrandom() in vDSO. It operates on a thread-local opaque > > >> state allocated with mmap using flags specified by the vDSO. > > >> > > >> Multiple states are allocated at once, as many as fit into a page, and > > >> these are held in an array of available states to be doled out to each > > >> thread upon first use, and recycled when a thread terminates. As these > > >> states run low, more are allocated. > > >> > > >> To make this procedure async-signal-safe, a simple guard is used in the > > >> LSB of the opaque state address, falling back to the syscall if there's > > >> reentrancy contention. > > >> > > >> Also, _Fork() is handled by blocking signals on opaque state allocation > > >> (so _Fork() always sees a consistent state even if it interrupts a > > >> getrandom() call) and by iterating over the thread stack cache on > > >> reclaim_stack. Each opaque state will be in the free states list > > >> (grnd_alloc.states) or allocated to a running thread. > > >> > > >> The cancellation is handled by always using GRND_NONBLOCK flags while > > >> calling the vDSO, and falling back to the cancellable syscall if the > > >> kernel returns EAGAIN (would block). Since getrandom is not defined by > > >> POSIX and cancellation is supported as an extension, the cancellation is > > >> handled as 'may occur' instead of 'shall occur' [1], meaning that if > > >> vDSO does not block (the expected behavior) getrandom will not act as a > > >> cancellation entrypoint. It avoids a pthread_testcancel call on the fast > > >> path (different than 'shall occur' functions, like sem_wait()). > > >> > > >> It is currently enabled for x86_64, which is available in Linux 6.11, > > >> and aarch64, powerpc32, powerpc64, loongarch64, and s390x, which are > > >> available in Linux 6.12. > > >> > > >> Link: https://pubs.opengroup.org/onlinepubs/9799919799/nframe.html [1] > > >> Co-developed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> > > >> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> > > > > > > Tested on x86_64, aarch64, and loongarch64. Glibc test suite passes and > > > gdb shows vdso is really used. > > Same on s390x. > > I'll extract your 'Tested-by', then, if that's okay. Right now for testing, I've got these tags: Tested-by: Jason A. Donenfeld <Jason@zx2c4.com> # x86_64 Tested-by: Adhemerval Zanella<adhemerval.zanella@linaro.org> # x86_64, aarch64 Tested-by: Xi Ruoyao <xry111@xry111.site> # x86_64, aarch64, loongarch64 Tested-by: Stefan Liebler <stli@linux.ibm.com> # s390x If you have a moment, do you think you could give it a go on powerpc32 and 64, so I can add those archs to the tested list? Trying to make this real easy for Florian to merge when he's back from Vienna :). Jason
On 21.09.24 00:21, Jason A. Donenfeld wrote: > On Thu, Sep 19, 2024 at 01:30:38PM +0200, Stefan Liebler wrote: >> On 18.09.24 21:22, Xi Ruoyao wrote: >>> On Wed, 2024-09-18 at 16:01 +0200, Jason A. Donenfeld wrote: >>>> Linux 6.11 has getrandom() in vDSO. It operates on a thread-local opaque >>>> state allocated with mmap using flags specified by the vDSO. >>>> >>>> Multiple states are allocated at once, as many as fit into a page, and >>>> these are held in an array of available states to be doled out to each >>>> thread upon first use, and recycled when a thread terminates. As these >>>> states run low, more are allocated. >>>> >>>> To make this procedure async-signal-safe, a simple guard is used in the >>>> LSB of the opaque state address, falling back to the syscall if there's >>>> reentrancy contention. >>>> >>>> Also, _Fork() is handled by blocking signals on opaque state allocation >>>> (so _Fork() always sees a consistent state even if it interrupts a >>>> getrandom() call) and by iterating over the thread stack cache on >>>> reclaim_stack. Each opaque state will be in the free states list >>>> (grnd_alloc.states) or allocated to a running thread. >>>> >>>> The cancellation is handled by always using GRND_NONBLOCK flags while >>>> calling the vDSO, and falling back to the cancellable syscall if the >>>> kernel returns EAGAIN (would block). Since getrandom is not defined by >>>> POSIX and cancellation is supported as an extension, the cancellation is >>>> handled as 'may occur' instead of 'shall occur' [1], meaning that if >>>> vDSO does not block (the expected behavior) getrandom will not act as a >>>> cancellation entrypoint. It avoids a pthread_testcancel call on the fast >>>> path (different than 'shall occur' functions, like sem_wait()). >>>> >>>> It is currently enabled for x86_64, which is available in Linux 6.11, >>>> and aarch64, powerpc32, powerpc64, loongarch64, and s390x, which are >>>> available in Linux 6.12. >>>> >>>> Link: https://pubs.opengroup.org/onlinepubs/9799919799/nframe.html [1] >>>> Co-developed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> >>>> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> >>> >>> Tested on x86_64, aarch64, and loongarch64. Glibc test suite passes and >>> gdb shows vdso is really used. >> Same on s390x. > > I'll extract your 'Tested-by', then, if that's okay. This is fine for me. Thanks
* Jason A. Donenfeld: > diff --git a/include/sys/random.h b/include/sys/random.h > index 6aa313d35d..35f64a0339 100644 > --- a/include/sys/random.h > +++ b/include/sys/random.h > @@ -1,8 +1,12 @@ > #ifndef _SYS_RANDOM_H > #include <stdlib/sys/random.h> > > +#include_next <sys/random.h> This #include_next seems brittle/wrong, especially after including <stdlib/sys/random.h>. I would just add a new internal header, say sysdeps/unix/sysv/linux/include/getrandom-fork.h, and include it in all the relevant places, and leave include/sys/random.h unchanged. > diff --git a/sysdeps/nptl/_Fork.c b/sysdeps/nptl/_Fork.c > index ef199ddbc3..adb7c18b29 100644 > --- a/sysdeps/nptl/_Fork.c > +++ b/sysdeps/nptl/_Fork.c > @@ -18,6 +18,7 @@ > > #include <arch-fork.h> > #include <pthreadP.h> > +#include <sys/random.h> > > pid_t > _Fork (void) > @@ -43,6 +44,7 @@ _Fork (void) > self->robust_head.list = &self->robust_head; > INTERNAL_SYSCALL_CALL (set_robust_list, &self->robust_head, > sizeof (struct robust_list_head)); > + call_function_static_weak (__getrandom_fork_subprocess); > } > return pid; > } This unusual change is required to make getrandom work after _Fork. There is a small window for deadlock if a signal arrives before __getrandom_fork_subprocess resets the lock. Nothing is supposed to send any signals at this point. This could be plugged by disabling signals around the fork system call. I think it's required to avoid problems with the robust list, too, so it probably should be a separate change prior to this one. Filed a bug: Signal handlers after fork may encounter missing robust mutex list <https://sourceware.org/bugzilla/show_bug.cgi?id=32215> > diff --git a/sysdeps/nptl/fork.h b/sysdeps/nptl/fork.h > index 7643926df9..106b2cf71d 100644 > --- a/sysdeps/nptl/fork.h > +++ b/sysdeps/nptl/fork.h > @@ -26,6 +26,7 @@ > @@ -128,9 +130,19 @@ reclaim_stacks (void) > curp->specific_used = true; > } > } > + > + call_function_static_weak (__getrandom_reset_state, curp); > } > } > > + /* Also reset stale getrandom states for user stack threads. */ > + list_for_each (runp, &GL (dl_stack_user)) > + { > + struct pthread *curp = list_entry (runp, struct pthread, list); > + if (curp != self) > + call_function_static_weak (__getrandom_reset_state, curp); > + } I've reviewed the concurrency management for dl_stack_user, and this iteration should be safe for the same reason that dl_stack_cache is safe (but I haven't reviewed the actual approach). > diff --git a/sysdeps/unix/sysv/linux/dl-vdso-setup.c b/sysdeps/unix/sysv/linux/dl-vdso-setup.c > index 3a44944dbb..b117a25922 100644 > --- a/sysdeps/unix/sysv/linux/dl-vdso-setup.c > +++ b/sysdeps/unix/sysv/linux/dl-vdso-setup.c > @@ -66,6 +66,18 @@ PROCINFO_CLASS int (*_dl_vdso_clock_getres) (clockid_t, > PROCINFO_CLASS int (*_dl_vdso_clock_getres_time64) (clockid_t, > struct __timespec64 *) RELRO; > # endif > +# ifdef HAVE_GETRANDOM_VSYSCALL > +PROCINFO_CLASS ssize_t (*_dl_vdso_getrandom) (void *buffer, size_t len, > + unsigned int flags, void *state, > + size_t state_len) RELRO; > +/* These values will be initialized at loading time by calling the > + _dl_vdso_getrandom with a special value. The 'state_size' is the opaque > + state size per-thread allocated with a mmap using 'mmap_prot' and > + 'mmap_flags' argument. */ > +PROCINFO_CLASS uint32_t _dl_vdso_getrandom_state_size RELRO; > +PROCINFO_CLASS uint32_t _dl_vdso_getrandom_mmap_prot RELRO; > +PROCINFO_CLASS uint32_t _dl_vdso_getrandom_mmap_flags RELRO; > +# endif I think this should be in libc, not ld.so. Initialization can happen as part of __libc_early_init, but only for the initial libc. This is required to address a dlmopen incompatibility. With the current code, this test will result in a buffer overflow in __getrandom_vdso_release: #include <gnu/lib-names.h> #include <support/check.h> #include <support/xdlfcn.h> #include <support/xthread.h> #include <sys/random.h> static __typeof (getrandom) *getrandom_ptr; static void * threadfunc (void *ignored) { char buffer; TEST_COMPARE (getrandom_ptr (&buffer, 1, 0), 1); return NULL; } static int do_test (void) { void *handle = xdlmopen (LM_ID_NEWLM, LIBC_SO, RTLD_NOW); getrandom_ptr = xdlsym (handle, "getrandom"); for (int i = 0; i < 1000; ++i) xpthread_join (xpthread_create (NULL, threadfunc, NULL)); return 0; } #include <support/test-driver.c> We could probably safely use an already-allocated state from the TCB from an inner libc after dlmopen, but I'm not sure if it's worth the additional complexity. > --- a/sysdeps/unix/sysv/linux/dl-vdso-setup.h > +++ b/sysdeps/unix/sysv/linux/dl-vdso-setup.h > @@ -50,6 +54,19 @@ setup_vdso_pointers (void) > #ifdef HAVE_RISCV_HWPROBE > GLRO(dl_vdso_riscv_hwprobe) = dl_vdso_vsym (HAVE_RISCV_HWPROBE); > #endif > +#ifdef HAVE_GETRANDOM_VSYSCALL > + GLRO(dl_vdso_getrandom) = dl_vdso_vsym (HAVE_GETRANDOM_VSYSCALL); > + if (GLRO(dl_vdso_getrandom) != NULL) > + { > + struct vgetrandom_opaque_params params; > + if (GLRO(dl_vdso_getrandom) (NULL, 0, 0, ¶ms, ~0UL) == 0) > + { > + GLRO(dl_vdso_getrandom_state_size) = params.size_of_opaque_state; > + GLRO(dl_vdso_getrandom_mmap_prot) = params.mmap_prot; > + GLRO(dl_vdso_getrandom_mmap_flags) = params.mmap_flags; > + } > + } > +#endif On my test system, size_of_opaque_state is 144 bytes, which is not a multiple of the cache line size. I think we need to round it up in userspace. > diff --git a/sysdeps/unix/sysv/linux/getrandom.c b/sysdeps/unix/sysv/linux/getrandom.c > index 777d1decf0..d6025199dc 100644 > --- a/sysdeps/unix/sysv/linux/getrandom.c > +++ b/sysdeps/unix/sysv/linux/getrandom.c > @@ -21,12 +21,247 @@ > #include <unistd.h> > #include <sysdep-cancel.h> > > +static inline ssize_t > +getrandom_syscall (void *buffer, size_t length, unsigned int flags, > + bool cancel) > +{ > + return cancel > + ? SYSCALL_CANCEL (getrandom, buffer, length, flags) > + : INLINE_SYSCALL_CALL (getrandom, buffer, length, flags); > +} > + > +#ifdef HAVE_GETRANDOM_VSYSCALL > +# include <getrandom_vdso.h> > +# include <ldsodefs.h> > +# include <libc-lock.h> > +# include <list.h> > +# include <setvmaname.h> > +# include <sys/mman.h> > +# include <sys/sysinfo.h> > +# include <tls-internal.h> > + > +# define ALIGN_PAGE(p) PTR_ALIGN_UP (p, GLRO (dl_pagesize)) Not sure if this macro adds value. > +# define READ_ONCE(p) (*((volatile typeof (p) *) (&(p)))) > +# define WRITE_ONCE(p, v) (*((volatile typeof (p) *) (&(p))) = (v)) Please use atomic_load_relaxed and atomic_store_relaxed. > +# define RESERVE_PTR(p) ((void *) ((uintptr_t) (p) | 1UL)) > +# define RELEASE_PTR(p) ((void *) ((uintptr_t) (p) & ~1UL)) > +# define IS_RESERVED_PTR(p) (!!((uintptr_t) (p) & 1UL)) Please turn those into type-safe inline functions, with a comment how the LSB is used and why. > +static struct > +{ > + __libc_lock_define (, lock); > + > + void **states; /* Queue of opaque states allocated with the kernel > + provided flags and used on getrandom vDSO call. */ > + size_t len; /* Number of available free states in the queue. */ > + size_t total; /* Number of states allocated from the kernel. */ > + size_t cap; /* Total numver of states that 'states' can hold before > + needed to be resized. */ > +} grnd_alloc = { > + .lock = LLL_LOCK_INITIALIZER > +}; > + > +static bool > +vgetrandom_get_state_alloc (void) > +{ > + size_t num = __get_nprocs (); /* Just a decent heuristic. */ Isn't scaling by CPU count problematic if the system as a whole doesn't do thread-based scaling, but process-based scaling? Then you could end up allocating states quadratic in the number of CPUs in the system. Maybe start out with one page of pointers and one page of states? Call __get_nprocs is likely to cause weird issues with browser sandboxes because signals (including SIGSYS) are disabled at this point, and __get_nprocs calls open internally. > + size_t block_size = ALIGN_PAGE (num * GLRO(dl_vdso_getrandom_state_size)); > + num = (GLRO (dl_pagesize) / GLRO(dl_vdso_getrandom_state_size)) * > + (block_size / GLRO (dl_pagesize)); > + void *block = __mmap (NULL, block_size, GLRO(dl_vdso_getrandom_mmap_prot), > + GLRO(dl_vdso_getrandom_mmap_flags), -1, 0); > + if (block == MAP_FAILED) > + return false; > + __set_vma_name (block, block_size, " glibc: getrandom"); > + > + if (grnd_alloc.total + num > grnd_alloc.cap) > + { > + /* Use a new mmap instead of trying to mremap. It avoids a > + potential multithread fork issue where fork is called just after > + mremap returns but before assigning to the grnd_alloc.states, > + thus making the its value invalid in the child. */ > + void *old_states = grnd_alloc.states; > + size_t old_states_size = ALIGN_PAGE (sizeof (*grnd_alloc.states) * > + grnd_alloc.total + num); > + size_t states_size; > + if (grnd_alloc.states == NULL) > + states_size = old_states_size; > + else > + states_size = ALIGN_PAGE (sizeof (*grnd_alloc.states) > + * grnd_alloc.cap); > + > + void **states = __mmap (NULL, states_size, PROT_READ | PROT_WRITE, > + MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); > + if (states == MAP_FAILED) > + { > + __munmap (block, block_size); > + return false; > + } This could have a comment somewhere why memcpy is not required here. I think it's because grnd_alloc.len is 0. It's still necessary to allocate extra pointers because the deallocation routine needs to have space to store a pointer, and it cannot allocate itself. > + /* Atomically replace the old state, so if a fork happens the child > + process will see a consistent free state buffer. The size might > + not be updated, but it does not really matter since the buffer is > + always increased. */ > + atomic_store_relaxed (&grnd_alloc.states, states); The comment is good, but I think it means you should use release MO here. > + if (old_states != NULL) > + __munmap (old_states, old_states_size); > + > + __set_vma_name (states, states_size, " glibc: getrandom states"); > + grnd_alloc.cap = states_size / sizeof (*grnd_alloc.states); > + } > + > + for (size_t i = 0; i < num; ++i) > + { > + /* States should not straddle a page. */ > + if (((uintptr_t) block & (GLRO (dl_pagesize) - 1)) + > + GLRO(dl_vdso_getrandom_state_size) > GLRO (dl_pagesize)) > + block = ALIGN_PAGE (block); > + grnd_alloc.states[i] = block; > + block += GLRO(dl_vdso_getrandom_state_size); > + } > + grnd_alloc.len = num; Probably should have release MO here as well due to potentially concurrent fork that shouldn't observe the previous pointer values in the array. > + grnd_alloc.total += num; > + > + return true; > +} > + > +/* Allocate an opaque state for vgetrandom. If the grnd_alloc does not have > + any, mmap() another page of them using the vgetrandom parameters. */ > +static void * > +vgetrandom_get_state (void) > +{ > + void *state = NULL; > + > + /* The signal blocking avoid the potential issue where _Fork() (which is > + async-signal-safe) is called with the lock taken. The function is > + called only once during thread lifetime, so the overhead should be > + minimal. */ > + internal_sigset_t set; > + internal_signal_block_all (&set); > + __libc_lock_lock (grnd_alloc.lock); > + > + if (grnd_alloc.len > 0 || vgetrandom_get_state_alloc ()) > + state = grnd_alloc.states[--grnd_alloc.len]; > + > + __libc_lock_unlock (grnd_alloc.lock); > + internal_signal_restore_set (&set); > + > + return state; > +} This async-signal-safe locking looks okay to me (assuming the fix in _Fork discussed above). > +/* Returns true when vgetrandom is used successfully. Returns false if the > + syscall fallback should be issued in the case the vDSO is not present, in > + the case of reentrancy, or if any memory allocation fails. */ > +static ssize_t > +getrandom_vdso (void *buffer, size_t length, unsigned int flags, bool cancel) > +{ > + if (GLRO (dl_vdso_getrandom_state_size) == 0) > + return getrandom_syscall (buffer, length, flags, cancel); > + > + struct pthread *self = THREAD_SELF; > + /* Since the vDSO fallback does not issue the syscall with the cancellation The vDSO *implementation*? > + bridge (__syscall_cancel_arch), use GRND_NONBLOCK so there is no > + potential unbounded blocking in the kernel. It should be a rare > + situation, only at system startup when RNG is not initialized. */ > + ssize_t ret = GLRO (dl_vdso_getrandom) (buffer, > + length, > + flags | GRND_NONBLOCK, > + state, > + GLRO(dl_vdso_getrandom_state_size)); > + if (INTERNAL_SYSCALL_ERROR_P (ret)) > + { > + /* Fallback to the syscall if the kernel would block. */ > + int err = INTERNAL_SYSCALL_ERRNO (ret); > + if (err == EAGAIN && !(flags & GRND_NONBLOCK)) > + goto out; > + > + __set_errno (err); > + ret = -1; > + } > + r = true; > + > +out: > + WRITE_ONCE (self->getrandom_buf, state); > + return r ? ret : getrandom_syscall (buffer, length, flags, cancel); > +} I think we should conditionally invoke __pthread_testcancel if getrandom_syscall is not called. If performance is a concern, we can provide an internal inlineable implementation of its fast path. It's just a memory load, compare and well-predictable conditional branch, so it really shouldn't matter. > +#endif > + > +/* Re-add the state state from CURP on the free list. */ > +void > +__getrandom_reset_state (struct pthread *curp) > +{ > +#ifdef HAVE_GETRANDOM_VSYSCALL > + if (grnd_alloc.states == NULL || curp->getrandom_buf == NULL) > + return; > + grnd_alloc.states[grnd_alloc.len++] = RELEASE_PTR (curp->getrandom_buf); > + curp->getrandom_buf = NULL; > +#endif > +} This could have an assert for grnd_alloc.len, to make sure it's within bounds. I think it's missing locking. It's tempting to push that into the caller, like for __getrandom_vdso_release. > +/* Called when a thread terminates, and adds its random buffer back into the > + allocator pool for use in a future thread. */ > +void > +__getrandom_vdso_release (struct pthread *curp) > +{ > +#ifdef HAVE_GETRANDOM_VSYSCALL > + if (curp->getrandom_buf == NULL) > + return; > + Should have a comment here that signals have already been disabled at this point during thread exit. > + __libc_lock_lock (grnd_alloc.lock); > + grnd_alloc.states[grnd_alloc.len++] = curp->getrandom_buf; > + __libc_lock_unlock (grnd_alloc.lock); > +#endif > +} Maybe put in an assert here, considering the buffer overflow described above? So that we see that there's another buffer management issue? > diff --git a/sysdeps/unix/sysv/linux/include/sys/random.h b/sysdeps/unix/sysv/linux/include/sys/random.h > new file mode 100644 > index 0000000000..5a48de2d29 > --- /dev/null > +++ b/sysdeps/unix/sysv/linux/include/sys/random.h As discussed above, this should go into a separate header that does not mess with <sys/random.h>. Performance is very good, not too far off from the random function actually. Thanks, Florian
On 26/09/24 07:28, Florian Weimer wrote: > * Jason A. Donenfeld: > >> diff --git a/include/sys/random.h b/include/sys/random.h >> index 6aa313d35d..35f64a0339 100644 >> --- a/include/sys/random.h >> +++ b/include/sys/random.h >> @@ -1,8 +1,12 @@ >> #ifndef _SYS_RANDOM_H >> #include <stdlib/sys/random.h> >> >> +#include_next <sys/random.h> > > This #include_next seems brittle/wrong, especially after including > <stdlib/sys/random.h>. I would just add a new internal header, say > sysdeps/unix/sysv/linux/include/getrandom-fork.h, and include it in all > the relevant places, and leave include/sys/random.h unchanged. Ack. > >> diff --git a/sysdeps/nptl/_Fork.c b/sysdeps/nptl/_Fork.c >> index ef199ddbc3..adb7c18b29 100644 >> --- a/sysdeps/nptl/_Fork.c >> +++ b/sysdeps/nptl/_Fork.c >> @@ -18,6 +18,7 @@ >> >> #include <arch-fork.h> >> #include <pthreadP.h> >> +#include <sys/random.h> >> >> pid_t >> _Fork (void) >> @@ -43,6 +44,7 @@ _Fork (void) >> self->robust_head.list = &self->robust_head; >> INTERNAL_SYSCALL_CALL (set_robust_list, &self->robust_head, >> sizeof (struct robust_list_head)); >> + call_function_static_weak (__getrandom_fork_subprocess); >> } >> return pid; >> } > > This unusual change is required to make getrandom work after _Fork. > > There is a small window for deadlock if a signal arrives before > __getrandom_fork_subprocess resets the lock. Nothing is supposed to > send any signals at this point. This could be plugged by disabling > signals around the fork system call. I think it's required to avoid > problems with the robust list, too, so it probably should be a separate > change prior to this one. Filed a bug: > > Signal handlers after fork may encounter missing robust mutex list > <https://sourceware.org/bugzilla/show_bug.cgi?id=32215> Ack, I will review this patch. > >> diff --git a/sysdeps/nptl/fork.h b/sysdeps/nptl/fork.h >> index 7643926df9..106b2cf71d 100644 >> --- a/sysdeps/nptl/fork.h >> +++ b/sysdeps/nptl/fork.h >> @@ -26,6 +26,7 @@ > >> @@ -128,9 +130,19 @@ reclaim_stacks (void) >> curp->specific_used = true; >> } >> } >> + >> + call_function_static_weak (__getrandom_reset_state, curp); >> } >> } >> >> + /* Also reset stale getrandom states for user stack threads. */ >> + list_for_each (runp, &GL (dl_stack_user)) >> + { >> + struct pthread *curp = list_entry (runp, struct pthread, list); >> + if (curp != self) >> + call_function_static_weak (__getrandom_reset_state, curp); >> + } > > I've reviewed the concurrency management for dl_stack_user, and this > iteration should be safe for the same reason that dl_stack_cache is safe > (but I haven't reviewed the actual approach). Ack. > >> diff --git a/sysdeps/unix/sysv/linux/dl-vdso-setup.c b/sysdeps/unix/sysv/linux/dl-vdso-setup.c >> index 3a44944dbb..b117a25922 100644 >> --- a/sysdeps/unix/sysv/linux/dl-vdso-setup.c >> +++ b/sysdeps/unix/sysv/linux/dl-vdso-setup.c >> @@ -66,6 +66,18 @@ PROCINFO_CLASS int (*_dl_vdso_clock_getres) (clockid_t, >> PROCINFO_CLASS int (*_dl_vdso_clock_getres_time64) (clockid_t, >> struct __timespec64 *) RELRO; >> # endif >> +# ifdef HAVE_GETRANDOM_VSYSCALL >> +PROCINFO_CLASS ssize_t (*_dl_vdso_getrandom) (void *buffer, size_t len, >> + unsigned int flags, void *state, >> + size_t state_len) RELRO; >> +/* These values will be initialized at loading time by calling the >> + _dl_vdso_getrandom with a special value. The 'state_size' is the opaque >> + state size per-thread allocated with a mmap using 'mmap_prot' and >> + 'mmap_flags' argument. */ >> +PROCINFO_CLASS uint32_t _dl_vdso_getrandom_state_size RELRO; >> +PROCINFO_CLASS uint32_t _dl_vdso_getrandom_mmap_prot RELRO; >> +PROCINFO_CLASS uint32_t _dl_vdso_getrandom_mmap_flags RELRO; >> +# endif > > I think this should be in libc, not ld.so. Initialization can happen as > part of __libc_early_init, but only for the initial libc. This is > required to address a dlmopen incompatibility. With the current code, > this test will result in a buffer overflow in __getrandom_vdso_release: > > #include <gnu/lib-names.h> > #include <support/check.h> > #include <support/xdlfcn.h> > #include <support/xthread.h> > #include <sys/random.h> > > static __typeof (getrandom) *getrandom_ptr; > > static void * > threadfunc (void *ignored) > { > char buffer; > TEST_COMPARE (getrandom_ptr (&buffer, 1, 0), 1); > return NULL; > } > > static int > do_test (void) > { > void *handle = xdlmopen (LM_ID_NEWLM, LIBC_SO, RTLD_NOW); > getrandom_ptr = xdlsym (handle, "getrandom"); > for (int i = 0; i < 1000; ++i) > xpthread_join (xpthread_create (NULL, threadfunc, NULL)); > return 0; > } > > #include <support/test-driver.c> > > We could probably safely use an already-allocated state from the TCB > from an inner libc after dlmopen, but I'm not sure if it's worth the > additional complexity. Right, I have added this test on next version. The secondary libc.so is indeed something that I have not factored in this implementation but I also think there is no need to move the vDSO setup to libc.so. Besides not following the other vDSO setup (and thus splitting it in multiple parts, adding some complexity), there still the issue where _dl_call_libc_early_init is called *after* _dl_relocate_object, which is responsible to setup RELRO. It means that we will need to either refactor on how __libc_early_init is called, or make the internal vDSO pointer not RELRO (and maybe XOR mangling the pointer against). I don't really like neither option. So I added a __getrandom_early_init, which then sets a libc.so internal variable that is checked to issue the vDSO instead of GLRO (dl_vdso_getrandom_state_size). > >> --- a/sysdeps/unix/sysv/linux/dl-vdso-setup.h >> +++ b/sysdeps/unix/sysv/linux/dl-vdso-setup.h > >> @@ -50,6 +54,19 @@ setup_vdso_pointers (void) >> #ifdef HAVE_RISCV_HWPROBE >> GLRO(dl_vdso_riscv_hwprobe) = dl_vdso_vsym (HAVE_RISCV_HWPROBE); >> #endif >> +#ifdef HAVE_GETRANDOM_VSYSCALL >> + GLRO(dl_vdso_getrandom) = dl_vdso_vsym (HAVE_GETRANDOM_VSYSCALL); >> + if (GLRO(dl_vdso_getrandom) != NULL) >> + { >> + struct vgetrandom_opaque_params params; >> + if (GLRO(dl_vdso_getrandom) (NULL, 0, 0, ¶ms, ~0UL) == 0) >> + { >> + GLRO(dl_vdso_getrandom_state_size) = params.size_of_opaque_state; >> + GLRO(dl_vdso_getrandom_mmap_prot) = params.mmap_prot; >> + GLRO(dl_vdso_getrandom_mmap_flags) = params.mmap_flags; >> + } >> + } >> +#endif > > On my test system, size_of_opaque_state is 144 bytes, which is not a > multiple of the cache line size. I think we need to round it up in > userspace. Do you mean to avoid false sharing in multithread environment? I will check if this really make any difference. > >> diff --git a/sysdeps/unix/sysv/linux/getrandom.c b/sysdeps/unix/sysv/linux/getrandom.c >> index 777d1decf0..d6025199dc 100644 >> --- a/sysdeps/unix/sysv/linux/getrandom.c >> +++ b/sysdeps/unix/sysv/linux/getrandom.c >> @@ -21,12 +21,247 @@ >> #include <unistd.h> >> #include <sysdep-cancel.h> >> >> +static inline ssize_t >> +getrandom_syscall (void *buffer, size_t length, unsigned int flags, >> + bool cancel) >> +{ >> + return cancel >> + ? SYSCALL_CANCEL (getrandom, buffer, length, flags) >> + : INLINE_SYSCALL_CALL (getrandom, buffer, length, flags); >> +} >> + >> +#ifdef HAVE_GETRANDOM_VSYSCALL >> +# include <getrandom_vdso.h> >> +# include <ldsodefs.h> >> +# include <libc-lock.h> >> +# include <list.h> >> +# include <setvmaname.h> >> +# include <sys/mman.h> >> +# include <sys/sysinfo.h> >> +# include <tls-internal.h> >> + >> +# define ALIGN_PAGE(p) PTR_ALIGN_UP (p, GLRO (dl_pagesize)) > > Not sure if this macro adds value. It does not, I will remove it. > >> +# define READ_ONCE(p) (*((volatile typeof (p) *) (&(p)))) >> +# define WRITE_ONCE(p, v) (*((volatile typeof (p) *) (&(p))) = (v)) > > Please use atomic_load_relaxed and atomic_store_relaxed. Ack. > >> +# define RESERVE_PTR(p) ((void *) ((uintptr_t) (p) | 1UL)) >> +# define RELEASE_PTR(p) ((void *) ((uintptr_t) (p) & ~1UL)) >> +# define IS_RESERVED_PTR(p) (!!((uintptr_t) (p) & 1UL)) > > Please turn those into type-safe inline functions, with a comment how > the LSB is used and why. Ack. > >> +static struct >> +{ >> + __libc_lock_define (, lock); >> + >> + void **states; /* Queue of opaque states allocated with the kernel >> + provided flags and used on getrandom vDSO call. */ >> + size_t len; /* Number of available free states in the queue. */ >> + size_t total; /* Number of states allocated from the kernel. */ >> + size_t cap; /* Total numver of states that 'states' can hold before >> + needed to be resized. */ >> +} grnd_alloc = { >> + .lock = LLL_LOCK_INITIALIZER >> +}; >> + >> +static bool >> +vgetrandom_get_state_alloc (void) >> +{ >> + size_t num = __get_nprocs (); /* Just a decent heuristic. */ > > Isn't scaling by CPU count problematic if the system as a whole doesn't > do thread-based scaling, but process-based scaling? Then you could end > up allocating states quadratic in the number of CPUs in the system. > Maybe start out with one page of pointers and one page of states? > > Call __get_nprocs is likely to cause weird issues with browser sandboxes > because signals (including SIGSYS) are disabled at this point, and > __get_nprocs calls open internally. Indeed, starting with one page and avoiding __get_nproc seems more reasonable. > >> + size_t block_size = ALIGN_PAGE (num * GLRO(dl_vdso_getrandom_state_size)); >> + num = (GLRO (dl_pagesize) / GLRO(dl_vdso_getrandom_state_size)) * >> + (block_size / GLRO (dl_pagesize)); >> + void *block = __mmap (NULL, block_size, GLRO(dl_vdso_getrandom_mmap_prot), >> + GLRO(dl_vdso_getrandom_mmap_flags), -1, 0); >> + if (block == MAP_FAILED) >> + return false; >> + __set_vma_name (block, block_size, " glibc: getrandom"); >> + >> + if (grnd_alloc.total + num > grnd_alloc.cap) >> + { >> + /* Use a new mmap instead of trying to mremap. It avoids a >> + potential multithread fork issue where fork is called just after >> + mremap returns but before assigning to the grnd_alloc.states, >> + thus making the its value invalid in the child. */ >> + void *old_states = grnd_alloc.states; >> + size_t old_states_size = ALIGN_PAGE (sizeof (*grnd_alloc.states) * >> + grnd_alloc.total + num); >> + size_t states_size; >> + if (grnd_alloc.states == NULL) >> + states_size = old_states_size; >> + else >> + states_size = ALIGN_PAGE (sizeof (*grnd_alloc.states) >> + * grnd_alloc.cap); >> + >> + void **states = __mmap (NULL, states_size, PROT_READ | PROT_WRITE, >> + MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); >> + if (states == MAP_FAILED) >> + { >> + __munmap (block, block_size); >> + return false; >> + } > > This could have a comment somewhere why memcpy is not required here. I > think it's because grnd_alloc.len is 0. It's still necessary to > allocate extra pointers because the deallocation routine needs to have > space to store a pointer, and it cannot allocate itself. The memcpy is not required because all the allocated opaque states are assigned to running threads (meaning that if we iterate over them we can reconstruct the state list). And I don't think there is need to allocate any extra space: the initial mmap already takes in account for all allocate opaque states. > >> + /* Atomically replace the old state, so if a fork happens the child >> + process will see a consistent free state buffer. The size might >> + not be updated, but it does not really matter since the buffer is >> + always increased. */ >> + atomic_store_relaxed (&grnd_alloc.states, states); > > The comment is good, but I think it means you should use release MO > here. Ack, I was not really sure about it but it make sense (the NPTL stack cache is not clear about the memory semantic it uses for fork synchronization). > >> + if (old_states != NULL) >> + __munmap (old_states, old_states_size); >> + >> + __set_vma_name (states, states_size, " glibc: getrandom states"); >> + grnd_alloc.cap = states_size / sizeof (*grnd_alloc.states); >> + } >> + >> + for (size_t i = 0; i < num; ++i) >> + { >> + /* States should not straddle a page. */ >> + if (((uintptr_t) block & (GLRO (dl_pagesize) - 1)) + >> + GLRO(dl_vdso_getrandom_state_size) > GLRO (dl_pagesize)) >> + block = ALIGN_PAGE (block); >> + grnd_alloc.states[i] = block; >> + block += GLRO(dl_vdso_getrandom_state_size); >> + } >> + grnd_alloc.len = num; > > Probably should have release MO here as well due to potentially > concurrent fork that shouldn't observe the previous pointer values in > the array. Ack, makes sense. > >> + grnd_alloc.total += num; >> + >> + return true; >> +} >> + >> +/* Allocate an opaque state for vgetrandom. If the grnd_alloc does not have >> + any, mmap() another page of them using the vgetrandom parameters. */ >> +static void * >> +vgetrandom_get_state (void) >> +{ >> + void *state = NULL; >> + >> + /* The signal blocking avoid the potential issue where _Fork() (which is >> + async-signal-safe) is called with the lock taken. The function is >> + called only once during thread lifetime, so the overhead should be >> + minimal. */ >> + internal_sigset_t set; >> + internal_signal_block_all (&set); >> + __libc_lock_lock (grnd_alloc.lock); >> + >> + if (grnd_alloc.len > 0 || vgetrandom_get_state_alloc ()) >> + state = grnd_alloc.states[--grnd_alloc.len]; >> + >> + __libc_lock_unlock (grnd_alloc.lock); >> + internal_signal_restore_set (&set); >> + >> + return state; >> +} > > This async-signal-safe locking looks okay to me (assuming the fix in > _Fork discussed above). > >> +/* Returns true when vgetrandom is used successfully. Returns false if the >> + syscall fallback should be issued in the case the vDSO is not present, in >> + the case of reentrancy, or if any memory allocation fails. */ >> +static ssize_t >> +getrandom_vdso (void *buffer, size_t length, unsigned int flags, bool cancel) >> +{ >> + if (GLRO (dl_vdso_getrandom_state_size) == 0) >> + return getrandom_syscall (buffer, length, flags, cancel); >> + >> + struct pthread *self = THREAD_SELF; > >> + /* Since the vDSO fallback does not issue the syscall with the cancellation > > The vDSO *implementation*? Ack. > >> + bridge (__syscall_cancel_arch), use GRND_NONBLOCK so there is no >> + potential unbounded blocking in the kernel. It should be a rare >> + situation, only at system startup when RNG is not initialized. */ >> + ssize_t ret = GLRO (dl_vdso_getrandom) (buffer, >> + length, >> + flags | GRND_NONBLOCK, >> + state, >> + GLRO(dl_vdso_getrandom_state_size)); >> + if (INTERNAL_SYSCALL_ERROR_P (ret)) >> + { >> + /* Fallback to the syscall if the kernel would block. */ >> + int err = INTERNAL_SYSCALL_ERRNO (ret); >> + if (err == EAGAIN && !(flags & GRND_NONBLOCK)) >> + goto out; >> + >> + __set_errno (err); >> + ret = -1; >> + } >> + r = true; >> + >> +out: >> + WRITE_ONCE (self->getrandom_buf, state); >> + return r ? ret : getrandom_syscall (buffer, length, flags, cancel); >> +} > > I think we should conditionally invoke __pthread_testcancel if > getrandom_syscall is not called. If performance is a concern, we can > provide an internal inlineable implementation of its fast path. It's > just a memory load, compare and well-predictable conditional branch, so > it really shouldn't matter. I think we already have addressed why __pthread_testcancel is not required here, from the commit message: The cancellation is handled by always using GRND_NONBLOCK flags while calling the vDSO, and falling back to the cancellable syscall if the kernel returns EAGAIN (would block). Since getrandom is not defined by POSIX and cancellation is supported as an extension, the cancellation is handled as 'may occur' instead of 'shall occur' [1], meaning that if vDSO does not block (the expected behavior) getrandom will not act as a cancellation entrypoint. It avoids a pthread_testcancel call on the fast path (different than 'shall occur' functions, like sem_wait()). I still think we should not make getrandom a 'shall occur' cancellation entrypoint, on the grounds that recent kernel blocking is really rare and it simplifies the vDSO implementation. > >> +#endif >> + >> +/* Re-add the state state from CURP on the free list. */ >> +void >> +__getrandom_reset_state (struct pthread *curp) >> +{ >> +#ifdef HAVE_GETRANDOM_VSYSCALL >> + if (grnd_alloc.states == NULL || curp->getrandom_buf == NULL) >> + return; >> + grnd_alloc.states[grnd_alloc.len++] = RELEASE_PTR (curp->getrandom_buf); >> + curp->getrandom_buf = NULL; >> +#endif >> +} > > This could have an assert for grnd_alloc.len, to make sure it's within > bounds. > Ack. > I think it's missing locking. It's tempting to push that into the > caller, like for __getrandom_vdso_release. > I don't think there is need for locking here, __getrandom_reset_state is only called *after* fork returns and thus in single-thread mode. I will add a comment. >> +/* Called when a thread terminates, and adds its random buffer back into the >> + allocator pool for use in a future thread. */ >> +void >> +__getrandom_vdso_release (struct pthread *curp) >> +{ >> +#ifdef HAVE_GETRANDOM_VSYSCALL >> + if (curp->getrandom_buf == NULL) >> + return; >> + > > Should have a comment here that signals have already been disabled at > this point during thread exit. Ack. > >> + __libc_lock_lock (grnd_alloc.lock); >> + grnd_alloc.states[grnd_alloc.len++] = curp->getrandom_buf; >> + __libc_lock_unlock (grnd_alloc.lock); >> +#endif >> +} > > Maybe put in an assert here, considering the buffer overflow described > above? So that we see that there's another buffer management issue? Not sure if this worths the checking, it means that something is really wrong on the initial state allocation. The opaque state buffer is essentially a stack. > >> diff --git a/sysdeps/unix/sysv/linux/include/sys/random.h b/sysdeps/unix/sysv/linux/include/sys/random.h >> new file mode 100644 >> index 0000000000..5a48de2d29 >> --- /dev/null >> +++ b/sysdeps/unix/sysv/linux/include/sys/random.h > > As discussed above, this should go into a separate header that does not > mess with <sys/random.h>. Ack. > > Performance is very good, not too far off from the random function > actually. > > Thanks, > Florian >
* Adhemerval Zanella Netto: > Right, I have added this test on next version. The secondary libc.so is > indeed something that I have not factored in this implementation but I > also think there is no need to move the vDSO setup to libc.so. > > Besides not following the other vDSO setup (and thus splitting it in multiple > parts, adding some complexity), there still the issue where > _dl_call_libc_early_init is called *after* _dl_relocate_object, which is > responsible to setup RELRO. It means that we will need to either refactor > on how __libc_early_init is called, or make the internal vDSO pointer not > RELRO (and maybe XOR mangling the pointer against). I don't really like > neither option. I meant this part: PROCINFO_CLASS uint32_t _dl_vdso_getrandom_state_size RELRO; PROCINFO_CLASS uint32_t _dl_vdso_getrandom_mmap_prot RELRO; PROCINFO_CLASS uint32_t _dl_vdso_getrandom_mmap_flags RELRO; That really doesn't have to be in GLRO, I think. The function pointer can remain there. Moving the variables will save one indirection, and it avoids growing the GLIBC_PRIVATE ABI footprint. >> On my test system, size_of_opaque_state is 144 bytes, which is not a >> multiple of the cache line size. I think we need to round it up in >> userspace. > > Do you mean to avoid false sharing in multithread environment? I will check > if this really make any difference. Yes, it's just surprising that the cache line padding is missing. It could be really, really bad on a NUMA machine. >> I think we should conditionally invoke __pthread_testcancel if >> getrandom_syscall is not called. If performance is a concern, we can >> provide an internal inlineable implementation of its fast path. It's >> just a memory load, compare and well-predictable conditional branch, so >> it really shouldn't matter. > > I think we already have addressed why __pthread_testcancel is not > required here, from the commit message: > > The cancellation is handled by always using GRND_NONBLOCK flags while > calling the vDSO, and falling back to the cancellable syscall if the > kernel returns EAGAIN (would block). Since getrandom is not defined by > POSIX and cancellation is supported as an extension, the cancellation is > handled as 'may occur' instead of 'shall occur' [1], meaning that if > vDSO does not block (the expected behavior) getrandom will not act as a > cancellation entrypoint. It avoids a pthread_testcancel call on the fast > path (different than 'shall occur' functions, like sem_wait()). > > I still think we should not make getrandom a 'shall occur' cancellation > entrypoint, on the grounds that recent kernel blocking is really rare > and it simplifies the vDSO implementation. Isn't it one today? I want to avoid an unnecessary API change. >> I think it's missing locking. It's tempting to push that into the >> caller, like for __getrandom_vdso_release. >> > > I don't think there is need for locking here, __getrandom_reset_state is only > called *after* fork returns and thus in single-thread mode. I will add a > comment. Right, thanks. Thanks, Florian
On 26/09/24 14:22, Adhemerval Zanella Netto wrote: > >> >>> + /* Atomically replace the old state, so if a fork happens the child >>> + process will see a consistent free state buffer. The size might >>> + not be updated, but it does not really matter since the buffer is >>> + always increased. */ >>> + atomic_store_relaxed (&grnd_alloc.states, states); >> >> The comment is good, but I think it means you should use release MO >> here. > > Ack, I was not really sure about it but it make sense (the NPTL stack > cache is not clear about the memory semantic it uses for fork synchronization). So Jason pointed out that release does not make much sense without pairing all other 'states' access to acquire memory semantic. The 'state' is really change only with the lock taken (the thread release only change an index). So I am not sure about this change. > >> >>> + if (old_states != NULL) >>> + __munmap (old_states, old_states_size); >>> + >>> + __set_vma_name (states, states_size, " glibc: getrandom states"); >>> + grnd_alloc.cap = states_size / sizeof (*grnd_alloc.states); >>> + } >>> + >>> + for (size_t i = 0; i < num; ++i) >>> + { >>> + /* States should not straddle a page. */ >>> + if (((uintptr_t) block & (GLRO (dl_pagesize) - 1)) + >>> + GLRO(dl_vdso_getrandom_state_size) > GLRO (dl_pagesize)) >>> + block = ALIGN_PAGE (block); >>> + grnd_alloc.states[i] = block; >>> + block += GLRO(dl_vdso_getrandom_state_size); >>> + } >>> + grnd_alloc.len = num; >> >> Probably should have release MO here as well due to potentially >> concurrent fork that shouldn't observe the previous pointer values in >> the array. > > Ack, makes sense. Same as before.
* Adhemerval Zanella Netto: > On 26/09/24 14:22, Adhemerval Zanella Netto wrote: > >> >>> >>>> + /* Atomically replace the old state, so if a fork happens the child >>>> + process will see a consistent free state buffer. The size might >>>> + not be updated, but it does not really matter since the buffer is >>>> + always increased. */ >>>> + atomic_store_relaxed (&grnd_alloc.states, states); >>> >>> The comment is good, but I think it means you should use release MO >>> here. >> >> Ack, I was not really sure about it but it make sense (the NPTL stack >> cache is not clear about the memory semantic it uses for fork synchronization). > > So Jason pointed out that release does not make much sense without pairing > all other 'states' access to acquire memory semantic. The 'state' is really > change only with the lock taken (the thread release only change an index). > So I am not sure about this change. Doesn't fork do the equivalent of an acquire load? We need some sort of barrier so that the allocated size is at least as large as the specified size. Without the barrier, even the compiler can reorder the pointer update and the capacity field update. Thanks, Florian
On 26/09/24 14:31, Florian Weimer wrote: > * Adhemerval Zanella Netto: > >> Right, I have added this test on next version. The secondary libc.so is >> indeed something that I have not factored in this implementation but I >> also think there is no need to move the vDSO setup to libc.so. >> >> Besides not following the other vDSO setup (and thus splitting it in multiple >> parts, adding some complexity), there still the issue where >> _dl_call_libc_early_init is called *after* _dl_relocate_object, which is >> responsible to setup RELRO. It means that we will need to either refactor >> on how __libc_early_init is called, or make the internal vDSO pointer not >> RELRO (and maybe XOR mangling the pointer against). I don't really like >> neither option. > > I meant this part: > > PROCINFO_CLASS uint32_t _dl_vdso_getrandom_state_size RELRO; > PROCINFO_CLASS uint32_t _dl_vdso_getrandom_mmap_prot RELRO; > PROCINFO_CLASS uint32_t _dl_vdso_getrandom_mmap_flags RELRO; > > That really doesn't have to be in GLRO, I think. The function pointer > can remain there. Moving the variables will save one indirection, and > it avoids growing the GLIBC_PRIVATE ABI footprint. Ack. >>> I think we should conditionally invoke __pthread_testcancel if >>> getrandom_syscall is not called. If performance is a concern, we can >>> provide an internal inlineable implementation of its fast path. It's >>> just a memory load, compare and well-predictable conditional branch, so >>> it really shouldn't matter. >> >> I think we already have addressed why __pthread_testcancel is not >> required here, from the commit message: >> >> The cancellation is handled by always using GRND_NONBLOCK flags while >> calling the vDSO, and falling back to the cancellable syscall if the >> kernel returns EAGAIN (would block). Since getrandom is not defined by >> POSIX and cancellation is supported as an extension, the cancellation is >> handled as 'may occur' instead of 'shall occur' [1], meaning that if >> vDSO does not block (the expected behavior) getrandom will not act as a >> cancellation entrypoint. It avoids a pthread_testcancel call on the fast >> path (different than 'shall occur' functions, like sem_wait()). >> >> I still think we should not make getrandom a 'shall occur' cancellation >> entrypoint, on the grounds that recent kernel blocking is really rare >> and it simplifies the vDSO implementation. > > Isn't it one today? I want to avoid an unnecessary API change. I think the question is whether we should add this an extension, not the current behavior. Besides the performance implication, I don't think it makes much sense with current Linux implementation where blocking only really happens in very rare cases. > >>> I think it's missing locking. It's tempting to push that into the >>> caller, like for __getrandom_vdso_release. >>> >> >> I don't think there is need for locking here, __getrandom_reset_state is only >> called *after* fork returns and thus in single-thread mode. I will add a >> comment. > > Right, thanks. > > Thanks, > Florian >
On 27/09/24 11:57, Florian Weimer wrote: > * Adhemerval Zanella Netto: > >> On 26/09/24 14:22, Adhemerval Zanella Netto wrote: >> >>> >>>> >>>>> + /* Atomically replace the old state, so if a fork happens the child >>>>> + process will see a consistent free state buffer. The size might >>>>> + not be updated, but it does not really matter since the buffer is >>>>> + always increased. */ >>>>> + atomic_store_relaxed (&grnd_alloc.states, states); >>>> >>>> The comment is good, but I think it means you should use release MO >>>> here. >>> >>> Ack, I was not really sure about it but it make sense (the NPTL stack >>> cache is not clear about the memory semantic it uses for fork synchronization). >> >> So Jason pointed out that release does not make much sense without pairing >> all other 'states' access to acquire memory semantic. The 'state' is really >> change only with the lock taken (the thread release only change an index). >> So I am not sure about this change. > > Doesn't fork do the equivalent of an acquire load? > > We need some sort of barrier so that the allocated size is at least as > large as the specified size. Without the barrier, even the compiler can > reorder the pointer update and the capacity field update. I am not sure which memory ordering fork() guarantee, however nptl-stack.c uses atomic_write_barrier() for every write on the global list: static inline void list_add (list_t *newp, list_t *head) { newp->next = head->next; newp->prev = head; head->next->prev = newp; atomic_write_barrier (); head->next = newp; } void __nptl_stack_list_add (list_t *elem, list_t *list) { GL (dl_in_flight_stack) = (uintptr_t) elem | 1; atomic_write_barrier (); list_add (elem, list); atomic_write_barrier (); GL (dl_in_flight_stack) = 0; } Maybe we should do the same here.
* Adhemerval Zanella Netto: > I think the question is whether we should add this an extension, not the > current behavior. Besides the performance implication, I don't think it > makes much sense with current Linux implementation where blocking only > really happens in very rare cases. Doesn't the current implementation perform the implicit pthread_testcancel check regardless of blocking? Thanks, Florian
* Adhemerval Zanella Netto: > I am not sure which memory ordering fork() guarantee, however nptl-stack.c > uses atomic_write_barrier() for every write on the global list: That works, but it's from the legacy atomics. I think the new name is atomic_thread_fence_seq_cst (mirroring C11 atomics). Thanks, Florian
On 27/09/24 13:32, Florian Weimer wrote: > * Adhemerval Zanella Netto: > >> I think the question is whether we should add this an extension, not the >> current behavior. Besides the performance implication, I don't think it >> makes much sense with current Linux implementation where blocking only >> really happens in very rare cases. > > Doesn't the current implementation perform the implicit > pthread_testcancel check regardless of blocking? Yes, as some other 'may occur' entrypoints. With this rationale, all current 'may occur' will need to be 'shall occur' in future changes. Not sure if we should bound by this constraint.
On Thu, Sep 26, 2024 at 12:28:15PM +0200, Florian Weimer wrote: > On my test system, size_of_opaque_state is 144 bytes, which is not a > multiple of the cache line size. I think we need to round it up in > userspace. This nearly doubles the performance in the Go implementation: https://go-review.googlesource.com/c/go/+/616535 We've done this now for v8 of the glibc patch. Seems like a good improvement. Jason
diff --git a/include/sys/random.h b/include/sys/random.h index 6aa313d35d..35f64a0339 100644 --- a/include/sys/random.h +++ b/include/sys/random.h @@ -1,8 +1,12 @@ #ifndef _SYS_RANDOM_H #include <stdlib/sys/random.h> +#include_next <sys/random.h> + # ifndef _ISOMAC +# include <stdbool.h> + extern ssize_t __getrandom (void *__buffer, size_t __length, unsigned int __flags) __wur; libc_hidden_proto (__getrandom) diff --git a/malloc/malloc.c b/malloc/malloc.c index bcb6e5b83c..9e577ab900 100644 --- a/malloc/malloc.c +++ b/malloc/malloc.c @@ -3140,8 +3140,8 @@ static void tcache_key_initialize (void) { /* We need to use the _nostatus version here, see BZ 29624. */ - if (__getrandom_nocancel_nostatus (&tcache_key, sizeof(tcache_key), - GRND_NONBLOCK) + if (__getrandom_nocancel_nostatus_direct (&tcache_key, sizeof(tcache_key), + GRND_NONBLOCK) != sizeof (tcache_key)) { tcache_key = random_bits (); diff --git a/nptl/allocatestack.c b/nptl/allocatestack.c index 2cb562f8ea..d9adb5856c 100644 --- a/nptl/allocatestack.c +++ b/nptl/allocatestack.c @@ -132,6 +132,8 @@ get_cached_stack (size_t *sizep, void **memp) __libc_lock_init (result->exit_lock); memset (&result->tls_state, 0, sizeof result->tls_state); + result->getrandom_buf = NULL; + /* Clear the DTV. */ dtv_t *dtv = GET_DTV (TLS_TPADJ (result)); for (size_t cnt = 0; cnt < dtv[-1].counter; ++cnt) diff --git a/nptl/descr.h b/nptl/descr.h index 65d3baaee3..989995262b 100644 --- a/nptl/descr.h +++ b/nptl/descr.h @@ -404,6 +404,9 @@ struct pthread /* Used on strsignal. */ struct tls_internal_t tls_state; + /* getrandom vDSO per-thread opaque state. */ + void *getrandom_buf; + /* rseq area registered with the kernel. Use a custom definition here to isolate from kernel struct rseq changes. The implementation of sched_getcpu needs acccess to the cpu_id field; diff --git a/nptl/pthread_create.c b/nptl/pthread_create.c index 1d3665d5ed..d1f5568b3b 100644 --- a/nptl/pthread_create.c +++ b/nptl/pthread_create.c @@ -38,6 +38,7 @@ #include <version.h> #include <clone_internal.h> #include <futex-internal.h> +#include <sys/random.h> #include <shlib-compat.h> @@ -549,6 +550,10 @@ start_thread (void *arg) } #endif + /* Release the vDSO getrandom per-thread buffer with all signal blocked, + to avoid creating a new free-state block during thread release. */ + __getrandom_vdso_release (pd); + if (!pd->user_stack) advise_stack_range (pd->stackblock, pd->stackblock_size, (uintptr_t) pd, pd->guardsize); diff --git a/sysdeps/generic/not-cancel.h b/sysdeps/generic/not-cancel.h index 2dd1064600..8e3f49cc07 100644 --- a/sysdeps/generic/not-cancel.h +++ b/sysdeps/generic/not-cancel.h @@ -51,7 +51,9 @@ __fcntl64 (fd, cmd, __VA_ARGS__) #define __getrandom_nocancel(buf, size, flags) \ __getrandom (buf, size, flags) -#define __getrandom_nocancel_nostatus(buf, size, flags) \ +#define __getrandom_nocancel_direct(buf, size, flags) \ + __getrandom (buf, size, flags) +#define __getrandom_nocancel_nostatus_direct(buf, size, flags) \ __getrandom (buf, size, flags) #define __poll_infinity_nocancel(fds, nfds) \ __poll (fds, nfds, -1) diff --git a/sysdeps/mach/hurd/not-cancel.h b/sysdeps/mach/hurd/not-cancel.h index 69fb3c00ef..ec5f5aa895 100644 --- a/sysdeps/mach/hurd/not-cancel.h +++ b/sysdeps/mach/hurd/not-cancel.h @@ -79,7 +79,7 @@ __typeof (__fcntl) __fcntl_nocancel; /* Non cancellable getrandom syscall that does not also set errno in case of failure. */ static inline ssize_t -__getrandom_nocancel_nostatus (void *buf, size_t buflen, unsigned int flags) +__getrandom_nocancel_nostatus_direct (void *buf, size_t buflen, unsigned int flags) { int save_errno = errno; ssize_t r = __getrandom (buf, buflen, flags); @@ -90,6 +90,8 @@ __getrandom_nocancel_nostatus (void *buf, size_t buflen, unsigned int flags) #define __getrandom_nocancel(buf, size, flags) \ __getrandom (buf, size, flags) +#define __getrandom_nocancel_direct(buf, size, flags) \ + __getrandom (buf, size, flags) #define __poll_infinity_nocancel(fds, nfds) \ __poll (fds, nfds, -1) diff --git a/sysdeps/nptl/_Fork.c b/sysdeps/nptl/_Fork.c index ef199ddbc3..adb7c18b29 100644 --- a/sysdeps/nptl/_Fork.c +++ b/sysdeps/nptl/_Fork.c @@ -18,6 +18,7 @@ #include <arch-fork.h> #include <pthreadP.h> +#include <sys/random.h> pid_t _Fork (void) @@ -43,6 +44,7 @@ _Fork (void) self->robust_head.list = &self->robust_head; INTERNAL_SYSCALL_CALL (set_robust_list, &self->robust_head, sizeof (struct robust_list_head)); + call_function_static_weak (__getrandom_fork_subprocess); } return pid; } diff --git a/sysdeps/nptl/fork.h b/sysdeps/nptl/fork.h index 7643926df9..106b2cf71d 100644 --- a/sysdeps/nptl/fork.h +++ b/sysdeps/nptl/fork.h @@ -26,6 +26,7 @@ #include <mqueue.h> #include <pthreadP.h> #include <sysdep.h> +#include <sys/random.h> static inline void fork_system_setup (void) @@ -46,6 +47,7 @@ fork_system_setup_after_fork (void) call_function_static_weak (__mq_notify_fork_subprocess); call_function_static_weak (__timer_fork_subprocess); + call_function_static_weak (__getrandom_fork_subprocess); } /* In case of a fork() call the memory allocation in the child will be @@ -128,9 +130,19 @@ reclaim_stacks (void) curp->specific_used = true; } } + + call_function_static_weak (__getrandom_reset_state, curp); } } + /* Also reset stale getrandom states for user stack threads. */ + list_for_each (runp, &GL (dl_stack_user)) + { + struct pthread *curp = list_entry (runp, struct pthread, list); + if (curp != self) + call_function_static_weak (__getrandom_reset_state, curp); + } + /* Add the stack of all running threads to the cache. */ list_splice (&GL (dl_stack_used), &GL (dl_stack_cache)); diff --git a/sysdeps/unix/sysv/linux/aarch64/sysdep.h b/sysdeps/unix/sysv/linux/aarch64/sysdep.h index bbbe35723c..974b503b2f 100644 --- a/sysdeps/unix/sysv/linux/aarch64/sysdep.h +++ b/sysdeps/unix/sysv/linux/aarch64/sysdep.h @@ -164,6 +164,7 @@ # define HAVE_CLOCK_GETRES64_VSYSCALL "__kernel_clock_getres" # define HAVE_CLOCK_GETTIME64_VSYSCALL "__kernel_clock_gettime" # define HAVE_GETTIMEOFDAY_VSYSCALL "__kernel_gettimeofday" +# define HAVE_GETRANDOM_VSYSCALL "__kernel_getrandom" # define HAVE_CLONE3_WRAPPER 1 diff --git a/sysdeps/unix/sysv/linux/dl-vdso-setup.c b/sysdeps/unix/sysv/linux/dl-vdso-setup.c index 3a44944dbb..b117a25922 100644 --- a/sysdeps/unix/sysv/linux/dl-vdso-setup.c +++ b/sysdeps/unix/sysv/linux/dl-vdso-setup.c @@ -66,6 +66,18 @@ PROCINFO_CLASS int (*_dl_vdso_clock_getres) (clockid_t, PROCINFO_CLASS int (*_dl_vdso_clock_getres_time64) (clockid_t, struct __timespec64 *) RELRO; # endif +# ifdef HAVE_GETRANDOM_VSYSCALL +PROCINFO_CLASS ssize_t (*_dl_vdso_getrandom) (void *buffer, size_t len, + unsigned int flags, void *state, + size_t state_len) RELRO; +/* These values will be initialized at loading time by calling the + _dl_vdso_getrandom with a special value. The 'state_size' is the opaque + state size per-thread allocated with a mmap using 'mmap_prot' and + 'mmap_flags' argument. */ +PROCINFO_CLASS uint32_t _dl_vdso_getrandom_state_size RELRO; +PROCINFO_CLASS uint32_t _dl_vdso_getrandom_mmap_prot RELRO; +PROCINFO_CLASS uint32_t _dl_vdso_getrandom_mmap_flags RELRO; +# endif /* PowerPC specific ones. */ # ifdef HAVE_GET_TBFREQ diff --git a/sysdeps/unix/sysv/linux/dl-vdso-setup.h b/sysdeps/unix/sysv/linux/dl-vdso-setup.h index 8aee5a8212..c63b7689e5 100644 --- a/sysdeps/unix/sysv/linux/dl-vdso-setup.h +++ b/sysdeps/unix/sysv/linux/dl-vdso-setup.h @@ -19,6 +19,10 @@ #ifndef _DL_VDSO_INIT_H #define _DL_VDSO_INIT_H +#ifdef HAVE_GETRANDOM_VSYSCALL +# include <getrandom_vdso.h> +#endif + /* Initialize the VDSO functions pointers. */ static inline void __attribute__ ((always_inline)) setup_vdso_pointers (void) @@ -50,6 +54,19 @@ setup_vdso_pointers (void) #ifdef HAVE_RISCV_HWPROBE GLRO(dl_vdso_riscv_hwprobe) = dl_vdso_vsym (HAVE_RISCV_HWPROBE); #endif +#ifdef HAVE_GETRANDOM_VSYSCALL + GLRO(dl_vdso_getrandom) = dl_vdso_vsym (HAVE_GETRANDOM_VSYSCALL); + if (GLRO(dl_vdso_getrandom) != NULL) + { + struct vgetrandom_opaque_params params; + if (GLRO(dl_vdso_getrandom) (NULL, 0, 0, ¶ms, ~0UL) == 0) + { + GLRO(dl_vdso_getrandom_state_size) = params.size_of_opaque_state; + GLRO(dl_vdso_getrandom_mmap_prot) = params.mmap_prot; + GLRO(dl_vdso_getrandom_mmap_flags) = params.mmap_flags; + } + } +#endif } #endif diff --git a/sysdeps/unix/sysv/linux/getrandom.c b/sysdeps/unix/sysv/linux/getrandom.c index 777d1decf0..d6025199dc 100644 --- a/sysdeps/unix/sysv/linux/getrandom.c +++ b/sysdeps/unix/sysv/linux/getrandom.c @@ -21,12 +21,247 @@ #include <unistd.h> #include <sysdep-cancel.h> +static inline ssize_t +getrandom_syscall (void *buffer, size_t length, unsigned int flags, + bool cancel) +{ + return cancel + ? SYSCALL_CANCEL (getrandom, buffer, length, flags) + : INLINE_SYSCALL_CALL (getrandom, buffer, length, flags); +} + +#ifdef HAVE_GETRANDOM_VSYSCALL +# include <getrandom_vdso.h> +# include <ldsodefs.h> +# include <libc-lock.h> +# include <list.h> +# include <setvmaname.h> +# include <sys/mman.h> +# include <sys/sysinfo.h> +# include <tls-internal.h> + +# define ALIGN_PAGE(p) PTR_ALIGN_UP (p, GLRO (dl_pagesize)) +# define READ_ONCE(p) (*((volatile typeof (p) *) (&(p)))) +# define WRITE_ONCE(p, v) (*((volatile typeof (p) *) (&(p))) = (v)) +# define RESERVE_PTR(p) ((void *) ((uintptr_t) (p) | 1UL)) +# define RELEASE_PTR(p) ((void *) ((uintptr_t) (p) & ~1UL)) +# define IS_RESERVED_PTR(p) (!!((uintptr_t) (p) & 1UL)) + +static struct +{ + __libc_lock_define (, lock); + + void **states; /* Queue of opaque states allocated with the kernel + provided flags and used on getrandom vDSO call. */ + size_t len; /* Number of available free states in the queue. */ + size_t total; /* Number of states allocated from the kernel. */ + size_t cap; /* Total numver of states that 'states' can hold before + needed to be resized. */ +} grnd_alloc = { + .lock = LLL_LOCK_INITIALIZER +}; + +static bool +vgetrandom_get_state_alloc (void) +{ + size_t num = __get_nprocs (); /* Just a decent heuristic. */ + + size_t block_size = ALIGN_PAGE (num * GLRO(dl_vdso_getrandom_state_size)); + num = (GLRO (dl_pagesize) / GLRO(dl_vdso_getrandom_state_size)) * + (block_size / GLRO (dl_pagesize)); + void *block = __mmap (NULL, block_size, GLRO(dl_vdso_getrandom_mmap_prot), + GLRO(dl_vdso_getrandom_mmap_flags), -1, 0); + if (block == MAP_FAILED) + return false; + __set_vma_name (block, block_size, " glibc: getrandom"); + + if (grnd_alloc.total + num > grnd_alloc.cap) + { + /* Use a new mmap instead of trying to mremap. It avoids a + potential multithread fork issue where fork is called just after + mremap returns but before assigning to the grnd_alloc.states, + thus making the its value invalid in the child. */ + void *old_states = grnd_alloc.states; + size_t old_states_size = ALIGN_PAGE (sizeof (*grnd_alloc.states) * + grnd_alloc.total + num); + size_t states_size; + if (grnd_alloc.states == NULL) + states_size = old_states_size; + else + states_size = ALIGN_PAGE (sizeof (*grnd_alloc.states) + * grnd_alloc.cap); + + void **states = __mmap (NULL, states_size, PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); + if (states == MAP_FAILED) + { + __munmap (block, block_size); + return false; + } + + /* Atomically replace the old state, so if a fork happens the child + process will see a consistent free state buffer. The size might + not be updated, but it does not really matter since the buffer is + always increased. */ + atomic_store_relaxed (&grnd_alloc.states, states); + if (old_states != NULL) + __munmap (old_states, old_states_size); + + __set_vma_name (states, states_size, " glibc: getrandom states"); + grnd_alloc.cap = states_size / sizeof (*grnd_alloc.states); + } + + for (size_t i = 0; i < num; ++i) + { + /* States should not straddle a page. */ + if (((uintptr_t) block & (GLRO (dl_pagesize) - 1)) + + GLRO(dl_vdso_getrandom_state_size) > GLRO (dl_pagesize)) + block = ALIGN_PAGE (block); + grnd_alloc.states[i] = block; + block += GLRO(dl_vdso_getrandom_state_size); + } + grnd_alloc.len = num; + grnd_alloc.total += num; + + return true; +} + +/* Allocate an opaque state for vgetrandom. If the grnd_alloc does not have + any, mmap() another page of them using the vgetrandom parameters. */ +static void * +vgetrandom_get_state (void) +{ + void *state = NULL; + + /* The signal blocking avoid the potential issue where _Fork() (which is + async-signal-safe) is called with the lock taken. The function is + called only once during thread lifetime, so the overhead should be + minimal. */ + internal_sigset_t set; + internal_signal_block_all (&set); + __libc_lock_lock (grnd_alloc.lock); + + if (grnd_alloc.len > 0 || vgetrandom_get_state_alloc ()) + state = grnd_alloc.states[--grnd_alloc.len]; + + __libc_lock_unlock (grnd_alloc.lock); + internal_signal_restore_set (&set); + + return state; +} + +/* Returns true when vgetrandom is used successfully. Returns false if the + syscall fallback should be issued in the case the vDSO is not present, in + the case of reentrancy, or if any memory allocation fails. */ +static ssize_t +getrandom_vdso (void *buffer, size_t length, unsigned int flags, bool cancel) +{ + if (GLRO (dl_vdso_getrandom_state_size) == 0) + return getrandom_syscall (buffer, length, flags, cancel); + + struct pthread *self = THREAD_SELF; + + /* If the LSB of getrandom_buf is set, then this function is already being + called, and we have a reentrant call from a signal handler. In this case + fallback to the syscall. */ + void *state = READ_ONCE (self->getrandom_buf); + if (IS_RESERVED_PTR (state)) + return getrandom_syscall (buffer, length, flags, cancel); + WRITE_ONCE (self->getrandom_buf, RESERVE_PTR (state)); + + bool r = false; + if (state == NULL) + { + state = vgetrandom_get_state (); + if (state == NULL) + goto out; + } + + /* Since the vDSO fallback does not issue the syscall with the cancellation + bridge (__syscall_cancel_arch), use GRND_NONBLOCK so there is no + potential unbounded blocking in the kernel. It should be a rare + situation, only at system startup when RNG is not initialized. */ + ssize_t ret = GLRO (dl_vdso_getrandom) (buffer, + length, + flags | GRND_NONBLOCK, + state, + GLRO(dl_vdso_getrandom_state_size)); + if (INTERNAL_SYSCALL_ERROR_P (ret)) + { + /* Fallback to the syscall if the kernel would block. */ + int err = INTERNAL_SYSCALL_ERRNO (ret); + if (err == EAGAIN && !(flags & GRND_NONBLOCK)) + goto out; + + __set_errno (err); + ret = -1; + } + r = true; + +out: + WRITE_ONCE (self->getrandom_buf, state); + return r ? ret : getrandom_syscall (buffer, length, flags, cancel); +} +#endif + +/* Re-add the state state from CURP on the free list. */ +void +__getrandom_reset_state (struct pthread *curp) +{ +#ifdef HAVE_GETRANDOM_VSYSCALL + if (grnd_alloc.states == NULL || curp->getrandom_buf == NULL) + return; + grnd_alloc.states[grnd_alloc.len++] = RELEASE_PTR (curp->getrandom_buf); + curp->getrandom_buf = NULL; +#endif +} + +/* Called when a thread terminates, and adds its random buffer back into the + allocator pool for use in a future thread. */ +void +__getrandom_vdso_release (struct pthread *curp) +{ +#ifdef HAVE_GETRANDOM_VSYSCALL + if (curp->getrandom_buf == NULL) + return; + + __libc_lock_lock (grnd_alloc.lock); + grnd_alloc.states[grnd_alloc.len++] = curp->getrandom_buf; + __libc_lock_unlock (grnd_alloc.lock); +#endif +} + +/* Reset the internal lock state in case another thread has locked while + this thread calls fork. The stale thread states will be handled by + reclaim_stacks which calls __getrandom_reset_state on each thread. */ +void +__getrandom_fork_subprocess (void) +{ +#ifdef HAVE_GETRANDOM_VSYSCALL + grnd_alloc.lock = LLL_LOCK_INITIALIZER; +#endif +} + +ssize_t +__getrandom_nocancel (void *buffer, size_t length, unsigned int flags) +{ +#ifdef HAVE_GETRANDOM_VSYSCALL + return getrandom_vdso (buffer, length, flags, false); +#else + return getrandom_syscall (buffer, length, flags, false); +#endif +} + /* Write up to LENGTH bytes of randomness starting at BUFFER. Return the number of bytes written, or -1 on error. */ ssize_t __getrandom (void *buffer, size_t length, unsigned int flags) { - return SYSCALL_CANCEL (getrandom, buffer, length, flags); +#ifdef HAVE_GETRANDOM_VSYSCALL + return getrandom_vdso (buffer, length, flags, true); +#else + return getrandom_syscall (buffer, length, flags, true); +#endif } libc_hidden_def (__getrandom) weak_alias (__getrandom, getrandom) diff --git a/sysdeps/unix/sysv/linux/getrandom_vdso.h b/sysdeps/unix/sysv/linux/getrandom_vdso.h new file mode 100644 index 0000000000..d1ef690e50 --- /dev/null +++ b/sysdeps/unix/sysv/linux/getrandom_vdso.h @@ -0,0 +1,36 @@ +/* Linux getrandom vDSO support. + Copyright (C) 2024 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <https://www.gnu.org/licenses/>. */ + +#ifndef _GETRANDOM_VDSO_H +#define _GETRANDOM_VDSO_H + +#include <stddef.h> +#include <stdint.h> +#include <sys/types.h> + +/* Used to query the vDSO for the required mmap flags and the opaque + per-thread state size Defined by linux/random.h. */ +struct vgetrandom_opaque_params +{ + uint32_t size_of_opaque_state; + uint32_t mmap_prot; + uint32_t mmap_flags; + uint32_t reserved[13]; +}; + +#endif diff --git a/sysdeps/unix/sysv/linux/include/sys/random.h b/sysdeps/unix/sysv/linux/include/sys/random.h new file mode 100644 index 0000000000..5a48de2d29 --- /dev/null +++ b/sysdeps/unix/sysv/linux/include/sys/random.h @@ -0,0 +1,29 @@ +/* Internal definitions for Linux getrandom implementation. + Copyright (C) 2024 Free Software Foundation, Inc. + This file is part of the GNU C Library. + + The GNU C Library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + The GNU C Library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with the GNU C Library; if not, see + <https://www.gnu.org/licenses/>. */ + +#ifndef _LINUX_SYS_RANDOM_H +#define _LINUX_SYS_RANDOM_H + +# ifndef _ISOMAC +# include <pthreadP.h> + +extern void __getrandom_fork_subprocess (void) attribute_hidden; +extern void __getrandom_vdso_release (struct pthread *curp) attribute_hidden; +extern void __getrandom_reset_state (struct pthread *curp) attribute_hidden; +# endif +#endif diff --git a/sysdeps/unix/sysv/linux/loongarch/sysdep.h b/sysdeps/unix/sysv/linux/loongarch/sysdep.h index eb0ba790da..e2d853ae3e 100644 --- a/sysdeps/unix/sysv/linux/loongarch/sysdep.h +++ b/sysdeps/unix/sysv/linux/loongarch/sysdep.h @@ -119,6 +119,7 @@ #define HAVE_CLOCK_GETTIME64_VSYSCALL "__vdso_clock_gettime" #define HAVE_GETTIMEOFDAY_VSYSCALL "__vdso_gettimeofday" #define HAVE_GETCPU_VSYSCALL "__vdso_getcpu" +#define HAVE_GETRANDOM_VSYSCALL "__vdso_getrandom" #define HAVE_CLONE3_WRAPPER 1 diff --git a/sysdeps/unix/sysv/linux/not-cancel.h b/sysdeps/unix/sysv/linux/not-cancel.h index 2a7585b73f..12f26912d3 100644 --- a/sysdeps/unix/sysv/linux/not-cancel.h +++ b/sysdeps/unix/sysv/linux/not-cancel.h @@ -27,6 +27,7 @@ #include <sys/syscall.h> #include <sys/wait.h> #include <time.h> +#include <sys/random.h> /* Non cancellable open syscall. */ __typeof (open) __open_nocancel; @@ -84,15 +85,17 @@ __writev_nocancel_nostatus (int fd, const struct iovec *iov, int iovcnt) } static inline ssize_t -__getrandom_nocancel (void *buf, size_t buflen, unsigned int flags) +__getrandom_nocancel_direct (void *buf, size_t buflen, unsigned int flags) { return INLINE_SYSCALL_CALL (getrandom, buf, buflen, flags); } +__typeof (getrandom) __getrandom_nocancel attribute_hidden; + /* Non cancellable getrandom syscall that does not also set errno in case of failure. */ static inline ssize_t -__getrandom_nocancel_nostatus (void *buf, size_t buflen, unsigned int flags) +__getrandom_nocancel_nostatus_direct (void *buf, size_t buflen, unsigned int flags) { return INTERNAL_SYSCALL_CALL (getrandom, buf, buflen, flags); } diff --git a/sysdeps/unix/sysv/linux/powerpc/sysdep.h b/sysdeps/unix/sysv/linux/powerpc/sysdep.h index a69b7db338..48f3d0d1b2 100644 --- a/sysdeps/unix/sysv/linux/powerpc/sysdep.h +++ b/sysdeps/unix/sysv/linux/powerpc/sysdep.h @@ -223,5 +223,6 @@ #define HAVE_TIME_VSYSCALL "__kernel_time" #define HAVE_GETTIMEOFDAY_VSYSCALL "__kernel_gettimeofday" #define HAVE_GET_TBFREQ "__kernel_get_tbfreq" +#define HAVE_GETRANDOM_VSYSCALL "__kernel_getrandom" #endif /* _LINUX_POWERPC_SYSDEP_H */ diff --git a/sysdeps/unix/sysv/linux/s390/sysdep.h b/sysdeps/unix/sysv/linux/s390/sysdep.h index 9b3000ca62..9698c57a03 100644 --- a/sysdeps/unix/sysv/linux/s390/sysdep.h +++ b/sysdeps/unix/sysv/linux/s390/sysdep.h @@ -72,6 +72,7 @@ #ifdef __s390x__ #define HAVE_CLOCK_GETRES64_VSYSCALL "__kernel_clock_getres" #define HAVE_CLOCK_GETTIME64_VSYSCALL "__kernel_clock_gettime" +#define HAVE_GETRANDOM_VSYSCALL "__kernel_getrandom" #else #define HAVE_CLOCK_GETRES_VSYSCALL "__kernel_clock_getres" #define HAVE_CLOCK_GETTIME_VSYSCALL "__kernel_clock_gettime" diff --git a/sysdeps/unix/sysv/linux/x86_64/sysdep.h b/sysdeps/unix/sysv/linux/x86_64/sysdep.h index a2b021bd86..7dc072ae2d 100644 --- a/sysdeps/unix/sysv/linux/x86_64/sysdep.h +++ b/sysdeps/unix/sysv/linux/x86_64/sysdep.h @@ -376,6 +376,7 @@ # define HAVE_TIME_VSYSCALL "__vdso_time" # define HAVE_GETCPU_VSYSCALL "__vdso_getcpu" # define HAVE_CLOCK_GETRES64_VSYSCALL "__vdso_clock_getres" +# define HAVE_GETRANDOM_VSYSCALL "__vdso_getrandom" # define HAVE_CLONE3_WRAPPER 1
Linux 6.11 has getrandom() in vDSO. It operates on a thread-local opaque state allocated with mmap using flags specified by the vDSO. Multiple states are allocated at once, as many as fit into a page, and these are held in an array of available states to be doled out to each thread upon first use, and recycled when a thread terminates. As these states run low, more are allocated. To make this procedure async-signal-safe, a simple guard is used in the LSB of the opaque state address, falling back to the syscall if there's reentrancy contention. Also, _Fork() is handled by blocking signals on opaque state allocation (so _Fork() always sees a consistent state even if it interrupts a getrandom() call) and by iterating over the thread stack cache on reclaim_stack. Each opaque state will be in the free states list (grnd_alloc.states) or allocated to a running thread. The cancellation is handled by always using GRND_NONBLOCK flags while calling the vDSO, and falling back to the cancellable syscall if the kernel returns EAGAIN (would block). Since getrandom is not defined by POSIX and cancellation is supported as an extension, the cancellation is handled as 'may occur' instead of 'shall occur' [1], meaning that if vDSO does not block (the expected behavior) getrandom will not act as a cancellation entrypoint. It avoids a pthread_testcancel call on the fast path (different than 'shall occur' functions, like sem_wait()). It is currently enabled for x86_64, which is available in Linux 6.11, and aarch64, powerpc32, powerpc64, loongarch64, and s390x, which are available in Linux 6.12. Link: https://pubs.opengroup.org/onlinepubs/9799919799/nframe.html [1] Co-developed-by: Adhemerval Zanella <adhemerval.zanella@linaro.org> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> --- Adhemerval is out for a bit, so I'm sending in this week's installment of the patch instead. Changes v6->v7: * Add support for aarch64, powerpc32, powerpc64, loongarch64, and s390x, now that support for those has been merged into Linus' tree. Changes v5->v6: * Makes getrandom_vdso a tail call to improve latency. * Update the list of upcoming supported architectures with s390x. Changes v4->v5: * Handle cancellation by using GRND_NONBLOCK. Changes v3->v4: * Query the vgetrandom mmap parameters at loading time, instead of each block allocation. Changes v2->v3: * Move the getrandom opaque state buffer to 'struct pthread'. * Move the state release to start_thread and after signals are blocked, to avoid a possible concurrency update. * Move the state reset on fork() to reclaim_stack. This makes all the logic of handling threads on one place, and simplify the __getrandom_fork_subprocess (no need to extra argument). * Fixed some style issue on comments. * Do not use mremap to reallocate the free states, it avoids a possible fork() issue where the allocator state pointer update is interrupted just after the mremap call returns, but before the assignment. This is done by mmap/munmap a a new buffer with the new size, a fork() will not invalidate the old buffer. * Block all signals before taking the lock to avoid the _Fork() issue if it interrupts getrandom while the lock is taken. include/sys/random.h | 4 + malloc/malloc.c | 4 +- nptl/allocatestack.c | 2 + nptl/descr.h | 3 + nptl/pthread_create.c | 5 + sysdeps/generic/not-cancel.h | 4 +- sysdeps/mach/hurd/not-cancel.h | 4 +- sysdeps/nptl/_Fork.c | 2 + sysdeps/nptl/fork.h | 12 + sysdeps/unix/sysv/linux/aarch64/sysdep.h | 1 + sysdeps/unix/sysv/linux/dl-vdso-setup.c | 12 + sysdeps/unix/sysv/linux/dl-vdso-setup.h | 17 ++ sysdeps/unix/sysv/linux/getrandom.c | 237 ++++++++++++++++++- sysdeps/unix/sysv/linux/getrandom_vdso.h | 36 +++ sysdeps/unix/sysv/linux/include/sys/random.h | 29 +++ sysdeps/unix/sysv/linux/loongarch/sysdep.h | 1 + sysdeps/unix/sysv/linux/not-cancel.h | 7 +- sysdeps/unix/sysv/linux/powerpc/sysdep.h | 1 + sysdeps/unix/sysv/linux/s390/sysdep.h | 1 + sysdeps/unix/sysv/linux/x86_64/sysdep.h | 1 + 20 files changed, 376 insertions(+), 7 deletions(-) create mode 100644 sysdeps/unix/sysv/linux/getrandom_vdso.h create mode 100644 sysdeps/unix/sysv/linux/include/sys/random.h