Message ID | PAWPR08MB898203ABD80D81B91E5398B4832D2@PAWPR08MB8982.eurprd08.prod.outlook.com |
---|---|
State | New |
Headers | show |
Series | [1/2] Add random benchmark | expand |
On Mon, Mar 18, 2024 at 12:21 PM Wilco Dijkstra <Wilco.Dijkstra@arm.com> wrote: > > Improve performance of rand() and __random() by adding a single-threaded fast > path. Bench-random-lock shows about 5x speedup on Neoverse V1. > > OK for commit? So the generalized lock-omitting code in __libc_lock_lock didn't flew.. what a new fast path macro function like (sorry I suck at naming things) __libc_maybe_lock_lock that only locks if not single thread ?
Hi Cristian, > So the generalized lock-omitting code in __libc_lock_lock didn't > flew.. There were some comments on it. One issue is that it would affect practically every lock in GLIBC, thus adding code and branches even for locks that don't benefit from single-threaded optimization. Hence the idea of first sorting out rand() since it is a frequent performance complaint (unfortunately). > what a new fast path macro function like (sorry I suck at > naming things) __libc_maybe_lock_lock that only locks if not single > thread ? Yes despite the huge mess of locking macros in use, adding a new set of macros may be better overall. We could say that the critical section must not create a new thread itself. This means we can avoid checking the lock value and compile the lock macro like: #define __libc_fast_lock(NAME) if (SINGLE_THREAD_P || !__libc_lock_lock (NAME)) { Cheers, Wilco
On 19/03/24 12:44, Wilco Dijkstra wrote: > Hi Cristian, > >> So the generalized lock-omitting code in __libc_lock_lock didn't >> flew.. > > There were some comments on it. One issue is that it would affect > practically every lock in GLIBC, thus adding code and branches even for > locks that don't benefit from single-threaded optimization. > > Hence the idea of first sorting out rand() since it is a frequent performance > complaint (unfortunately). I don't have a strong opinion on this rand patch, if this idea is to have it as an workbench for a possible single-thread lock optimization it should be fine. It is just that I don't see much gain in optimizing such a bad interface (although we still lack a proper userland PRNG). > >> what a new fast path macro function like (sorry I suck at >> naming things) __libc_maybe_lock_lock that only locks if not single >> thread ? > > Yes despite the huge mess of locking macros in use, adding a new set of > macros may be better overall. We could say that the critical section must not > create a new thread itself. This means we can avoid checking the lock value > and compile the lock macro like: > > #define __libc_fast_lock(NAME) if (SINGLE_THREAD_P || !__libc_lock_lock (NAME)) { This sounds reasonable, maybe we use a static inline instead to enforce the lock type (not sure if this really required).
On Wed, Mar 20, 2024 at 9:31 AM Adhemerval Zanella Netto <adhemerval.zanella@linaro.org> wrote: > I don't have a strong opinion on this rand patch, if this idea is to > have it as an workbench for a possible single-thread lock optimization > it should be fine. It is just that I don't see much gain in optimizing > such a bad interface (although we still lack a proper userland PRNG). Yeah, it should be no surprise this interfaces are bad, I thought this was common knowledge. we need something like https://github.com/C2SP/C2SP/blob/main/chacha8rand.md which prety much outperforms even non-CS algorithms in at least 64 bit x86. but the question of the state remains.global? TLS? how to discard it in all the appropriate occasions?
On 20/03/24 11:18, Cristian Rodríguez wrote: > On Wed, Mar 20, 2024 at 9:31 AM Adhemerval Zanella Netto > <adhemerval.zanella@linaro.org> wrote: > > >> I don't have a strong opinion on this rand patch, if this idea is to >> have it as an workbench for a possible single-thread lock optimization >> it should be fine. It is just that I don't see much gain in optimizing >> such a bad interface (although we still lack a proper userland PRNG). > > Yeah, it should be no surprise this interfaces are bad, > I thought this was common knowledge. > > we need something like https://github.com/C2SP/C2SP/blob/main/chacha8rand.md > which prety much outperforms even non-CS algorithms in at least 64 bit x86. > but the question of the state remains.global? TLS? how to discard it > in all the appropriate occasions? And this is the arc4random in userspace discussion all over again.
Hi Cristian, > Yeah, it should be no surprise this interfaces are bad, > I thought this was common knowledge. It's bad but we also made it thread-safe via locks rather than add its state to TLS. > we need something like https://github.com/C2SP/C2SP/blob/main/chacha8rand.md > which prety much outperforms even non-CS algorithms in at least 64 bit x86. > but the question of the state remains.global? TLS? how to discard it > in all the appropriate occasions? I believe the current implementation is already way overkill. All you need is a simple LFSR random generator with 64 bits of state and a decently large cycle. It's not meant to be a serious random number generator - it's invariably used to generate randomized inputs for testing and benchmarking. That's all. The complaints are due to it being too slow due to all the locking. Cheers, Wilco
On Wed, 2024-03-20 at 14:28 +0000, Wilco Dijkstra wrote: > Hi Cristian, > > > Yeah, it should be no surprise this interfaces are bad, > > I thought this was common knowledge. > > It's bad but we also made it thread-safe via locks rather than add its state to TLS. > > > we need something like https://github.com/C2SP/C2SP/blob/main/chacha8rand.md > > which prety much outperforms even non-CS algorithms in at least 64 bit x86. > > but the question of the state remains.global? TLS? how to discard it > > in all the appropriate occasions? > > I believe the current implementation is already way overkill. All you need is a simple > LFSR random generator with 64 bits of state and a decently large cycle. It's not > meant to be a serious random number generator - it's invariably used to generate > randomized inputs for testing and benchmarking. That's all. The complaints are > due to it being too slow due to all the locking. Yes, we don't need a *CS*PRNG, just a PRNG. Thus no "oops, the user space cannot know whether to reseed" stuff.
* Adhemerval Zanella Netto: > On 20/03/24 11:18, Cristian Rodríguez wrote: >> On Wed, Mar 20, 2024 at 9:31 AM Adhemerval Zanella Netto >> <adhemerval.zanella@linaro.org> wrote: >> >> >>> I don't have a strong opinion on this rand patch, if this idea is to >>> have it as an workbench for a possible single-thread lock optimization >>> it should be fine. It is just that I don't see much gain in optimizing >>> such a bad interface (although we still lack a proper userland PRNG). >> >> Yeah, it should be no surprise this interfaces are bad, >> I thought this was common knowledge. >> >> we need something like https://github.com/C2SP/C2SP/blob/main/chacha8rand.md >> which prety much outperforms even non-CS algorithms in at least 64 bit x86. >> but the question of the state remains.global? TLS? how to discard it >> in all the appropriate occasions? > > And this is the arc4random in userspace discussion all over again. Agreed. But what has changed that we know now that Linux won't provide us with vDSO acceleration for arc4random. So I think it wouldn't be unreasonable to roll our own. Right now, the switch to arc4random provided by glibc is a massive performance regression compared to other implementations. Thanks, Florian
On Thu, Mar 21, 2024 at 4:39 AM Florian Weimer <fweimer@redhat.com> wrote: > > * Adhemerval Zanella Netto: > > > On 20/03/24 11:18, Cristian Rodríguez wrote: > >> On Wed, Mar 20, 2024 at 9:31 AM Adhemerval Zanella Netto > >> <adhemerval.zanella@linaro.org> wrote: > >> > >> > >>> I don't have a strong opinion on this rand patch, if this idea is to > >>> have it as an workbench for a possible single-thread lock optimization > >>> it should be fine. It is just that I don't see much gain in optimizing > >>> such a bad interface (although we still lack a proper userland PRNG). > >> > >> Yeah, it should be no surprise this interfaces are bad, > >> I thought this was common knowledge. > >> > >> we need something like https://github.com/C2SP/C2SP/blob/main/chacha8rand.md > >> which prety much outperforms even non-CS algorithms in at least 64 bit x86. > >> but the question of the state remains.global? TLS? how to discard it > >> in all the appropriate occasions? > > > > And this is the arc4random in userspace discussion all over again. > > Agreed. But what has changed that we know now that Linux won't provide > us with vDSO acceleration for arc4random. So I think it wouldn't be > unreasonable to roll our own. Right now, the switch to arc4random > provided by glibc is a massive performance regression compared to other > implementations. Afaik the other path that hasn't been tried is rseq no ? could the kernel provide a random state this way that it is cleared on the right conditions..
On 2024-03-21 09:33, Cristian Rodríguez wrote: > On Thu, Mar 21, 2024 at 4:39 AM Florian Weimer <fweimer@redhat.com> wrote: >> >> * Adhemerval Zanella Netto: >> >>> On 20/03/24 11:18, Cristian Rodríguez wrote: >>>> On Wed, Mar 20, 2024 at 9:31 AM Adhemerval Zanella Netto >>>> <adhemerval.zanella@linaro.org> wrote: >>>> >>>> >>>>> I don't have a strong opinion on this rand patch, if this idea is to >>>>> have it as an workbench for a possible single-thread lock optimization >>>>> it should be fine. It is just that I don't see much gain in optimizing >>>>> such a bad interface (although we still lack a proper userland PRNG). >>>> >>>> Yeah, it should be no surprise this interfaces are bad, >>>> I thought this was common knowledge. >>>> >>>> we need something like https://github.com/C2SP/C2SP/blob/main/chacha8rand.md >>>> which prety much outperforms even non-CS algorithms in at least 64 bit x86. >>>> but the question of the state remains.global? TLS? how to discard it >>>> in all the appropriate occasions? >>> >>> And this is the arc4random in userspace discussion all over again. >> >> Agreed. But what has changed that we know now that Linux won't provide >> us with vDSO acceleration for arc4random. So I think it wouldn't be >> unreasonable to roll our own. Right now, the switch to arc4random >> provided by glibc is a massive performance regression compared to other >> implementations. > > > Afaik the other path that hasn't been tried is rseq no ? could the > kernel provide a random state this way that it is cleared on the right > conditions.. Letting rseq know about userspace random state would add a lot of coupling between kernel and userspace. I wonder if we can achieve the speed up you are looking for using existing rseq/membarrier features. At the last GNU Cauldron, I created a biased locking prototype for Florian based on rseq: https://github.com/compudj/librseq/tree/biased-lock https://github.com/compudj/librseq/blob/biased-lock/tests/rseq_biased_lock.h https://github.com/compudj/librseq/blob/biased-lock/tests/rseq_biased_lock.c https://github.com/compudj/librseq/blob/biased-lock/tests/test_rseq_biased_lock.c The original purpose of this prototype was to replace the FILE lock by a biased lock, which acts as a very fast lock (a small rseq assembly sequence of loads, comparisons, stores) as long as the thread accessing the FILE is the thread owner thread whom created it. If another thread attempts to access the FILE, then the internal state machine of the biased lock transitions from RSEQ_BIASED_LOCK_STATE_ST to RSEQ_BIASED_LOCK_STATE_MT_STARTED, issues a membarrier RSEQ_PRIVATE_EXPEDITED (also referred to as "rseq fence") to abort all pre-existing rseq critical sections, and then transition to RSEQ_BIASED_LOCK_STATE_MT_READY before taking the lock. After the biased lock has transitioned to RSEQ_BIASED_LOCK_STATE_MT_READY it is meant to stay in that state forever. So as long as only the owner thread uses the lock, it stays fast. This means that even a multi-threaded process with short-lived FILE that are only used by a single thread in their lifetime will benefit from the biased lock fast-path. I suspect we could adapt this scheme to the srand()/rand() lock. Here is how I see this unfold for both srand() and rand(): - On first use (if rseq biased lock owner is NULL): - use rseq_biased_lock_try_set_fast_thread to set lock owner with a compare-and-swap (expecting NULL). - Take the rseq biased lock: - This rseq biased lock will take care of transitioning to a scheme which is MT-aware if it detects that the thread trying to access the biased lock is not the owner. So this scheme applied to both srand() and rand() should take care of making things really fast as long as only the biased lock owner thread uses srand()/rand(). Its overhead would be a single system call (membarrier) issued the first time the biased lock is accessed from a thread which differs from the first thread which used the lock. Thoughts ? Thanks, Mathieu
The 03/21/2024 10:35, Mathieu Desnoyers wrote: > On 2024-03-21 09:33, Cristian Rodríguez wrote: > > Afaik the other path that hasn't been tried is rseq no ? could the > > kernel provide a random state this way that it is cleared on the right > > conditions.. > > Letting rseq know about userspace random state would add a lot of coupling > between kernel and userspace. I wonder if we can achieve the speed up you > are looking for using existing rseq/membarrier features. > > At the last GNU Cauldron, I created a biased locking prototype for Florian > based on rseq: ... i think rand() should be a 2 line function, without any locking, since it is *not* required to be thread-safe nor crypto quality. but i guess for glibc it's safer to keep the current implementation just with single-thread checks. either way, it is not the best candidate for rseq optimizations, as that complexity is not justified, to speed up code with undefined behaviour..
Hi, > i think rand() should be a 2 line function, without any > locking, since it is *not* required to be thread-safe nor > crypto quality. Absolutely. > but i guess for glibc it's safer to keep the current > implementation just with single-thread checks. I think it is useful to rewrite it using a much simpler implementation that just has 64 bits of global state. Then you don't need any locks at all, we could use a relaxed atomic operation for updates or add it to TLS since its small enough. > either way, it is not the best candidate for rseq > optimizations, as that complexity is not justified, to > speed up code with undefined behaviour.. Agreed. The complexity of the current random algorithm and all the associated locking is already way out of order. Programming is all about keeping stuff as simple and fast as possible. Cheers, Wilco
On 21/03/24 10:33, Cristian Rodríguez wrote: > On Thu, Mar 21, 2024 at 4:39 AM Florian Weimer <fweimer@redhat.com> wrote: >> >> * Adhemerval Zanella Netto: >> >>> On 20/03/24 11:18, Cristian Rodríguez wrote: >>>> On Wed, Mar 20, 2024 at 9:31 AM Adhemerval Zanella Netto >>>> <adhemerval.zanella@linaro.org> wrote: >>>> >>>> >>>>> I don't have a strong opinion on this rand patch, if this idea is to >>>>> have it as an workbench for a possible single-thread lock optimization >>>>> it should be fine. It is just that I don't see much gain in optimizing >>>>> such a bad interface (although we still lack a proper userland PRNG). >>>> >>>> Yeah, it should be no surprise this interfaces are bad, >>>> I thought this was common knowledge. >>>> >>>> we need something like https://github.com/C2SP/C2SP/blob/main/chacha8rand.md >>>> which prety much outperforms even non-CS algorithms in at least 64 bit x86. >>>> but the question of the state remains.global? TLS? how to discard it >>>> in all the appropriate occasions? >>> >>> And this is the arc4random in userspace discussion all over again. >> >> Agreed. But what has changed that we know now that Linux won't provide >> us with vDSO acceleration for arc4random. So I think it wouldn't be >> unreasonable to roll our own. Right now, the switch to arc4random >> provided by glibc is a massive performance regression compared to other >> implementations. > > > Afaik the other path that hasn't been tried is rseq no ? could the > kernel provide a random state this way that it is cleared on the right > conditions.. If I recall correctly the main issues with initial arc4random implementation, and which lead us to roll back the chacha20 optimized ones (similar to what other OS do); was when and how to reset the state (there are other minor hardening, like to no spill out state on memory that vDSO experiment also did). The initial attempt used some heuristics, like when it reached a certain limits of byes; plus some obvious points (like fork); that might not be sufficient for all cases. And even if arc4random is explicit a non CPRNG, there were some worries that users might misuse the interface and thus add some security issues. I had to check again, but last time I checked on how OpenBSD uses this interface was mainly for hardening (for instance on malloc and other stuff) and to provide fast PRNG for simulation and such.
On Thu, Mar 21, 2024, at 11:53 AM, Adhemerval Zanella Netto wrote: > And even if arc4random is explicit a non CPRNG, there were some worries that > users might misuse the interface and thus add some security issues. No opinion about anything else in this thread, but if we add arc4random at all it MUST be a CSPRNG. That's a documented guarantee on all the systems that do have it, and applications rely on it. zw
On 22/03/24 11:27, Zack Weinberg wrote: > On Thu, Mar 21, 2024, at 11:53 AM, Adhemerval Zanella Netto wrote: >> And even if arc4random is explicit a non CPRNG, there were some worries that >> users might misuse the interface and thus add some security issues. > > No opinion about anything else in this thread, but if we add arc4random at all > it MUST be a CSPRNG. That's a documented guarantee on all the systems that > do have it, and applications rely on it. Yeah, this is another point of contention where one might consider that a userland CPRNG that has no feedback from kernel to where/how to properly reseed might not be considered a CPRNG.
On Fri, Mar 22, 2024, at 10:46 AM, Adhemerval Zanella Netto wrote: > On 22/03/24 11:27, Zack Weinberg wrote: >> On Thu, Mar 21, 2024, at 11:53 AM, Adhemerval Zanella Netto wrote: >>> And even if arc4random is explicit a non CPRNG, there were some worries that >>> users might misuse the interface and thus add some security issues. >> >> No opinion about anything else in this thread, but if we add arc4random at all >> it MUST be a CSPRNG. That's a documented guarantee on all the systems that >> do have it, and applications rely on it. > > Yeah, this is another point of contention where one might consider that a > userland CPRNG that has no feedback from kernel to where/how to properly > reseed might not be considered a CPRNG. I would describe that as a "CSPRNG with a known bug that makes it unsuitable for use under some conditions", but not as "not a CSPRNG". I would only call it "not a CSPRNG" if the cryptographic primitives were no good (e.g. RC4 or Xorshift or something even more predictable) or if there was a way to leak or clone the state *in a single-threaded program that does not fork*. On a related note, why is MADV_WIPEONFORK not adequate "feedback from the kernel"? zw
On 22/03/24 12:30, Zack Weinberg wrote: > On Fri, Mar 22, 2024, at 10:46 AM, Adhemerval Zanella Netto wrote: >> On 22/03/24 11:27, Zack Weinberg wrote: >>> On Thu, Mar 21, 2024, at 11:53 AM, Adhemerval Zanella Netto wrote: >>>> And even if arc4random is explicit a non CPRNG, there were some worries that >>>> users might misuse the interface and thus add some security issues. >>> >>> No opinion about anything else in this thread, but if we add arc4random at all >>> it MUST be a CSPRNG. That's a documented guarantee on all the systems that >>> do have it, and applications rely on it. >> >> Yeah, this is another point of contention where one might consider that a >> userland CPRNG that has no feedback from kernel to where/how to properly >> reseed might not be considered a CPRNG. > > I would describe that as a "CSPRNG with a known bug that makes it unsuitable > for use under some conditions", but not as "not a CSPRNG". I would only > call it "not a CSPRNG" if the cryptographic primitives were no good > (e.g. RC4 or Xorshift or something even more predictable) or if there was > a way to leak or clone the state *in a single-threaded program that does > not fork*. I tend to agree, but the contention point was really 'that makes it unsuitable for use under some conditions' was a deal breaker in face that kernel provides an API with better guarantees. > > On a related note, why is MADV_WIPEONFORK not adequate "feedback from the > kernel"? If I recall correctly, the problem was not only state wipe on fork (with MADV_WIPEONFORK should take care), but rather when the state needs to be reseed due various situations outside of the userland knowledge (on the arc4random thread Jason gave us some examples, I don't really recall all of them by hearth). That's why the idea of providing the arc4random through a vDSO primitive (where kernel can reseed any time it likes).
On 2024-03-22 14:05, Adhemerval Zanella Netto wrote: > > > On 22/03/24 12:30, Zack Weinberg wrote: >> On Fri, Mar 22, 2024, at 10:46 AM, Adhemerval Zanella Netto wrote: >>> On 22/03/24 11:27, Zack Weinberg wrote: >>>> On Thu, Mar 21, 2024, at 11:53 AM, Adhemerval Zanella Netto wrote: >>>>> And even if arc4random is explicit a non CPRNG, there were some worries that >>>>> users might misuse the interface and thus add some security issues. >>>> >>>> No opinion about anything else in this thread, but if we add arc4random at all >>>> it MUST be a CSPRNG. That's a documented guarantee on all the systems that >>>> do have it, and applications rely on it. >>> >>> Yeah, this is another point of contention where one might consider that a >>> userland CPRNG that has no feedback from kernel to where/how to properly >>> reseed might not be considered a CPRNG. >> >> I would describe that as a "CSPRNG with a known bug that makes it unsuitable >> for use under some conditions", but not as "not a CSPRNG". I would only >> call it "not a CSPRNG" if the cryptographic primitives were no good >> (e.g. RC4 or Xorshift or something even more predictable) or if there was >> a way to leak or clone the state *in a single-threaded program that does >> not fork*. > > I tend to agree, but the contention point was really 'that makes it unsuitable > for use under some conditions' was a deal breaker in face that kernel provides > an API with better guarantees. > >> >> On a related note, why is MADV_WIPEONFORK not adequate "feedback from the >> kernel"? > > If I recall correctly, the problem was not only state wipe on fork (with > MADV_WIPEONFORK should take care), but rather when the state needs to be > reseed due various situations outside of the userland knowledge (on the > arc4random thread Jason gave us some examples, I don't really recall all > of them by hearth). That's why the idea of providing the arc4random through > a vDSO primitive (where kernel can reseed any time it likes). If the goal is to let userspace know that it needs to reseed due to various kernel events happening, one way I see we could extend rseq to support this would be to add a 64-bit "seed generation counter" in the struct rseq per-thread area which would be incremented by the kernel when the seed needs to be updated in userspace. This would allow many userspace PRNG libraries to use this facility within a process, as there would be no single per-library location the kernel needs to reset, and would remove all knowledge of the PRNG internal state from the kernel. Thoughts ? Thanks, Mathieu
On Fri, Mar 22, 2024 at 4:47 PM Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > On 2024-03-22 14:05, Adhemerval Zanella Netto wrote: > > > > > > On 22/03/24 12:30, Zack Weinberg wrote: > >> On Fri, Mar 22, 2024, at 10:46 AM, Adhemerval Zanella Netto wrote: > >>> On 22/03/24 11:27, Zack Weinberg wrote: > >>>> On Thu, Mar 21, 2024, at 11:53 AM, Adhemerval Zanella Netto wrote: > If the goal is to let userspace know that it needs to reseed due to > various kernel events happening, Yes, this is the missing information.. "some event happened that the kernel knows about we need to drop the state and start over" being fork, suspend, resume, whatever future unknown event the kernel might consider.
On Fri, Mar 22, 2024, at 3:47 PM, Mathieu Desnoyers wrote: > On 2024-03-22 14:05, Adhemerval Zanella Netto wrote: >> On 22/03/24 12:30, Zack Weinberg wrote: >>> I would describe that as a "CSPRNG with a known bug that makes it >>> unsuitable for use under some conditions", but not as "not a CSPRNG". ... >> I tend to agree, but the contention point was really 'that makes it >> unsuitable for use under some conditions' was a deal breaker in face that >> kernel provides an API with better guarantees. How strong exactly are the guarantees that OpenBSD provides for its arc4random? I don't think we *need* to do any better than that, although obviously we should if we can. >>> On a related note, why is MADV_WIPEONFORK not adequate "feedback from the >>> kernel"? >> If I recall correctly, the problem was not only state wipe on fork (with >> MADV_WIPEONFORK should take care), but rather when the state needs to be >> reseed due various situations outside of the userland knowledge ... > If the goal is to let userspace know that it needs to reseed due to > various kernel events happening, one way I see we could extend rseq > to support this would be to add a 64-bit "seed generation counter" > in the struct rseq per-thread area which would be incremented by the > kernel when the seed needs to be updated in userspace. I don't know hardly anything about rseq. I think that sounds workable from libc's side of the fence; the remaining questions I see are 1) Will the kernel take your patch? 2) Is it OK for us to provide an arc4random implementation that uses this generation counter when available, but, when it's not available, doesn't reseed on these events that are invisible to user space? --- Independently, I propose that the existing non-cryptographic PRNGs (rand(), random(), etc.) should all be changed to run off a thread-local scrambled-linear generator (https://vigna.di.unimi.it/ftp/papers/ScrambledLinear.pdf). These have better statistical properties than anything we currently offer, and a state space that's small enough (256 bits) that it's reasonable for us to have one per thread, obviating locking concerns. zw
On 2024-03-23 10:01, Zack Weinberg wrote: [...] > ... >> If the goal is to let userspace know that it needs to reseed due to >> various kernel events happening, one way I see we could extend rseq >> to support this would be to add a 64-bit "seed generation counter" >> in the struct rseq per-thread area which would be incremented by the >> kernel when the seed needs to be updated in userspace. > > I don't know hardly anything about rseq. I think that sounds workable > from libc's side of the fence; the remaining questions I see are > > 1) Will the kernel take your patch? I can start by creating a proof-of-concept patch. If there are use-cases justifying its integration, I don't see why the other rseq co-maintainers would object. As maintainer of the Linux kernel rseq system call, I would be OK with it as long as it fits in the RSEQ design constraints: each new rseq field should support many users per process (executable and libraries), and we should try to minimize coupling between kernel and user-space. There are a few things I would need to know to create a prototype: - Do we need a 64-bit counter for this, or is 32-bit enough ? - What kernel events are we interested in ? I suspect that some are global (e.g. sleep, hibernate) and some are per-process (e.g. fork/clone). Are there other events I am missing here ? - At which point would the generation counter be validated ? Would that be before generating a PRN or after ? If it's before, then what happens to the validity of this PNG if the kernel event happens exactly while the PRN is being generated ? > 2) Is it OK for us to provide an arc4random implementation that uses > this generation counter when available, but, when it's not available, > doesn't reseed on these events that are invisible to user space? That's up to you really. Or you could make this configurable: a user could request a PRNG with security guarantees, or not. rseq provides interfaces to query which fields are supported, so depending on the user needs, your library could either return an error or allow generating a less-secure random number. > > --- > > Independently, I propose that the existing non-cryptographic PRNGs > (rand(), random(), etc.) should all be changed to run off a thread-local > scrambled-linear generator > (https://vigna.di.unimi.it/ftp/papers/ScrambledLinear.pdf). These have > better statistical properties than anything we currently offer, and a > state space that's small enough (256 bits) that it's reasonable for us > to have one per thread, obviating locking concerns. I've been wondering if doing so would break the PRNG guarantees from a whole-program (multithreaded) perspective: let's suppose we have many threads which are synchronized in such a way that the order of execution of statements across threads is guaranteed to be the same across runs. Should they expect a global PRNG with the same initial seed to have the same behavior ? How would the per-thread seeds allow the global PRNG walk to be reproducible across runs ? Thanks, Mathieu
On Sat, Mar 23, 2024 at 11:02 AM Zack Weinberg <zack@owlfolio.org> wrote: > Independently, I propose that the existing non-cryptographic PRNGs > (rand(), random(), etc.) Afaik that's possible .. except the *48 interfaces that are explicit on what algorithm to use..
* Zack Weinberg: > On Fri, Mar 22, 2024, at 3:47 PM, Mathieu Desnoyers wrote: >> On 2024-03-22 14:05, Adhemerval Zanella Netto wrote: >>> On 22/03/24 12:30, Zack Weinberg wrote: > >>>> I would describe that as a "CSPRNG with a known bug that makes it >>>> unsuitable for use under some conditions", but not as "not a CSPRNG". > ... >>> I tend to agree, but the contention point was really 'that makes it >>> unsuitable for use under some conditions' was a deal breaker in face that >>> kernel provides an API with better guarantees. > > How strong exactly are the guarantees that OpenBSD provides for its > arc4random? I don't think we *need* to do any better than that, > although obviously we should if we can. I don't think OpenBSD deals with virtualization in this context. I don't know their reasons, but the use case must be vanishingly small. I don't expect that there are many who worry about key disclosure due to VM snapshots and live migration, and, at the same time, are fine with virtualization itself as potential source of leaks. > Independently, I propose that the existing non-cryptographic PRNGs > (rand(), random(), etc.) should all be changed to run off a thread-local > scrambled-linear generator > (https://vigna.di.unimi.it/ftp/papers/ScrambledLinear.pdf). These have > better statistical properties than anything we currently offer, and a > state space that's small enough (256 bits) that it's reasonable for us > to have one per thread, obviating locking concerns. I think that's only possible if the process has not called srand. Thanks, Florian
On 2024-03-23 11:23, Mathieu Desnoyers wrote: > On 2024-03-23 10:01, Zack Weinberg wrote: > > [...] > >> ... >>> If the goal is to let userspace know that it needs to reseed due to >>> various kernel events happening, one way I see we could extend rseq >>> to support this would be to add a 64-bit "seed generation counter" >>> in the struct rseq per-thread area which would be incremented by the >>> kernel when the seed needs to be updated in userspace. >> >> I don't know hardly anything about rseq. I think that sounds workable >> from libc's side of the fence; the remaining questions I see are >> >> 1) Will the kernel take your patch? > > I can start by creating a proof-of-concept patch. If there are > use-cases justifying its integration, I don't see why the other > rseq co-maintainers would object. Reading the follow-up discussion on the thread, I now understand that a generation counter is probably not sufficient. Or maybe there are more than one use-case here ? If the intended purpose is to have the kernel wipe out the keys before going to sleep/hibernate, then you'll want an integration similar to madvise MADV_WIPEONFORK, but with a new semantic that also wipes on sleep/hibernate and other relevant events. If the intent is to make sure that random numbers get re-seeded after those kernel events, then rseq with a generation counter would work. Thanks, Mathieu
On 25/03/24 11:09, Mathieu Desnoyers wrote: > On 2024-03-23 11:23, Mathieu Desnoyers wrote: >> On 2024-03-23 10:01, Zack Weinberg wrote: >> >> [...] >> >>> ... >>>> If the goal is to let userspace know that it needs to reseed due to >>>> various kernel events happening, one way I see we could extend rseq >>>> to support this would be to add a 64-bit "seed generation counter" >>>> in the struct rseq per-thread area which would be incremented by the >>>> kernel when the seed needs to be updated in userspace. >>> >>> I don't know hardly anything about rseq. I think that sounds workable >>> from libc's side of the fence; the remaining questions I see are >>> >>> 1) Will the kernel take your patch? >> >> I can start by creating a proof-of-concept patch. If there are >> use-cases justifying its integration, I don't see why the other >> rseq co-maintainers would object. > > Reading the follow-up discussion on the thread, I now understand > that a generation counter is probably not sufficient. Or maybe there > are more than one use-case here ? > > If the intended purpose is to have the kernel wipe out the keys > before going to sleep/hibernate, then you'll want an integration > similar to madvise MADV_WIPEONFORK, but with a new semantic that > also wipes on sleep/hibernate and other relevant events. I don't fully recall all the details, but I think that's was Jason idea with the getrandom vDSO strategy. It used a new syscall to allocate the state and this syscall used internally a new mmap flag to trigger the required wipe/reset on the required events. He also wanted to prevent userland to incorrectly use the syscall, so he moved the buffer allocation to vDSO as well. The vDSO implementation used the same cypher as the kernel one (chacha20, with a extra constraint to avoid leak internal state on the stack), and subdivides the state region in backets that were addressed based on thread pointer. Each bucket kept a generation counter, which was updated by the kernel. Any failure (either in generation check or contention on backets) fallback to getrandom syscall. > > If the intent is to make sure that random numbers get re-seeded > after those kernel events, then rseq with a generation counter > would work. > > Thanks,
diff --git a/stdlib/random.c b/stdlib/random.c index 62f22fac8d58c7977f09c134bf80a797750da645..174603a8915fd8aa4b3ae64d023003c9e2c038f2 100644 --- a/stdlib/random.c +++ b/stdlib/random.c @@ -51,6 +51,7 @@ SUCH DAMAGE.*/ #include <libc-lock.h> +#include <sys/single_threaded.h> #include <limits.h> #include <stddef.h> #include <stdlib.h> @@ -288,6 +289,12 @@ __random (void) { int32_t retval; + if (SINGLE_THREAD_P) + { + (void) __random_r (&unsafe_state, &retval); + return retval; + } + __libc_lock_lock (lock); (void) __random_r (&unsafe_state, &retval);