diff mbox series

[DRAFT] random32: make prandom_u32() output unpredictable

Message ID 20200809065744.GA17668@SDF.ORG
State Not Applicable
Delegated to: David Miller
Headers show
Series [DRAFT] random32: make prandom_u32() output unpredictable | expand

Commit Message

George Spelvin Aug. 9, 2020, 6:57 a.m. UTC
Non-cryptographic PRNGs may have great statistical proprties, but
are usually trivially predictable to someone who knows the algorithm,
given a small sample of their output.  An LFSR like prandom_u32() is
particularly simple, even if the sample is widely scattered bits.

It turns out the network stack uses prandom_u32() for some things like
random port numbers which it would prefer are *not* trivially predictable.
Predictability led to a practical DNS spoofing attack.  Oops.

This patch replaces the LFSR with a homebrew cryptographic PRNG based
on the SipHash round function, which is in turn seeded with 128 bits
of strong random key.  (The authors of SipHash have *not* been consulted
about this abuse of their algorithm.)  Speed is prioritized over security;
attacks are rare, while performance is always wanted.

Replacing all callers of prandom_u32() is the quick fix.
Whether to reinstate a weaker PRNG for uses which can tolerate it
is an open question.

TODO: The prandom_seed() function is currently a no-op stub.
It's not obvious how to use this additional data, so suggestions are
invited.

f227e3ec3b5c ("random32: update the net random state on interrupt and
activity") was an earlier attempt at a solution.  This patch
replaces it; revert it before applying this one.

Reported-by: Amit Klein <aksecurity@gmail.com>
Cc: Willy Tarreau <w@1wt.eu>
Cc: Eric Dumazet <edumazet@google.com>
Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Fixes: f227e3ec3b5c ("random32: update the net random state on interrupt and activity")
Replaces: f227e3ec3b5c ("random32: update the net random state on interrupt and activity")
Signed-off-by: George Spelvin <lkml@sdf.org>
---
 lib/random32.c | 437 ++++++++++++++++++++++++++++++-------------------
 1 file changed, 267 insertions(+), 170 deletions(-)

I haven't yet rebooted, so this is only build-tested, but it should illustrate
the idea.

Comments

Willy Tarreau Aug. 9, 2020, 9:38 a.m. UTC | #1
Hi George,

On Sun, Aug 09, 2020 at 06:57:44AM +0000, George Spelvin wrote:
> +/*
> + * This is the core CPRNG function.  As "pseudorandom", this is not used
> + * for truly valuable things, just intended to be a PITA to guess.
> + * For maximum speed, we do just two SipHash rounds per word.  This is
> + * the same rate as 4 rounds per 64 bits that SipHash normally uses,
> + * so hopefully it's reasonably secure.
> + *
> + * There are two changes from the official SipHash finalization:
> + * - We omit some constants XORed with v2 in the SipHash spec as irrelevant;
> + *   they are there only to make the output rounds distinct from the input
> + *   rounds, and this application has no input rounds.
> + * - Rather than returning v0^v1^v2^v3, return v1+v3.
> + *   If you look at the SipHash round, the last operation on v3 is
> + *   "v3 ^= v0", so "v0 ^ v3" just undoes that, a waste of time.
> + *   Likewise "v1 ^= v2".  (The rotate of v2 makes a difference, but
> + *   it still cancels out half of the bits in v2 for no benefit.)
> + *   Second, since the last combining operation was xor, continue the
> + *   pattern of alternating xor/add for a tiny bit of extra non-linearity.
> + */
> +static u32 siprand_u32(struct siprand_state *s)
> +{
> +	unsigned long v0 = s->v[0], v1 = s->v[1], v2 = s->v[2], v3 = s->v[3];
> +
> +	SIPROUND(v0, v1, v2, v3);
> +	SIPROUND(v0, v1, v2, v3);
> +	s->v[0] = v0;  s->v[1] = v1;  s->v[2] = v2;  s->v[3] = v3;
> +	return v1 + v3;
> +}
> +
> +
> +/**
> + *	prandom_u32 - pseudo random number generator
> + *
> + *	A 32 bit pseudo-random number is generated using a fast
> + *	algorithm suitable for simulation. This algorithm is NOT
> + *	considered safe for cryptographic use.
> + */
> +u32 prandom_u32(void)
> +{
> +	struct siprand_state *state = get_cpu_ptr(&net_rand_state);
> +	u32 res = siprand_u32(state);
> +
> +	put_cpu_ptr(&net_rand_state);
> +	return res;
> +}

So I gave it a quick test under Qemu and it didn't show any obvious
performance difference compared to Tausworthe, which is a good thing,
even though there's a significant amount of measurement noise in each
case.

However it keeps the problem that the whole sequence is entirely
determined at the moment of reseeding, so if one were to be able to
access the state, e.g. using l1tf/spectre/meltdown/whatever, then
this state could be used to predict the whole ongoing sequence for
the next minute. What some view as a security feature, others will
see as a backdoor :-/  That's why I really like the noise approach.
Even just the below would significantly harm that capability because
that state alone isn't sufficient anymore to pre-compute all future
values:

--- a/lib/random32.c
+++ b/lib/random32.c
@@ -375,6 +375,7 @@ static u32 siprand_u32(struct siprand_state *s)
 {
        unsigned long v0 = s->v[0], v1 = s->v[1], v2 = s->v[2], v3 = s->v[3];
 
+       v0 += get_cycles();
        SIPROUND(v0, v1, v2, v3);
        SIPROUND(v0, v1, v2, v3);
        s->v[0] = v0;  s->v[1] = v1;  s->v[2] = v2;  s->v[3] = v3;

Regards,
Willy
Randy Dunlap Aug. 9, 2020, 1:50 p.m. UTC | #2
On 8/8/20 11:57 PM, George Spelvin wrote:



+#elif BITS_PER_LONG == 23


s/23/32/ ???
George Spelvin Aug. 9, 2020, 4:08 p.m. UTC | #3
On Sun, Aug 09, 2020 at 11:26:31AM +0300, Amit Klein wrote:
> BITS_PER_LONG==23 ???
> Should be ==32, I guess.

Of course.  I've been testing on a 64-bit system so hadn't
exercised that branch yet, and it's a case of "seeing what
I expect to see".
George Spelvin Aug. 9, 2020, 5:06 p.m. UTC | #4
On Sun, Aug 09, 2020 at 11:38:05AM +0200, Willy Tarreau wrote:
> So I gave it a quick test under Qemu and it didn't show any obvious
> performance difference compared to Tausworthe, which is a good thing,
> even though there's a significant amount of measurement noise in each
> case.

Thank you very much!  I'm not quite sure how to benchmark this.
The whole idea is that it's *not* used in a tight cache-hot loop.
Hopefully someone already has a test setup so I don't have to invent
one.

> However it keeps the problem that the whole sequence is entirely
> determined at the moment of reseeding, so if one were to be able to
> access the state, e.g. using l1tf/spectre/meltdown/whatever, then
> this state could be used to predict the whole ongoing sequence for
> the next minute. What some view as a security feature, others will
> see as a backdoor :-/  That's why I really like the noise approach.
> Even just the below would significantly harm that capability because
> that state alone isn't sufficient anymore to pre-compute all future
> values:
> 
> --- a/lib/random32.c
> +++ b/lib/random32.c
> @@ -375,6 +375,7 @@ static u32 siprand_u32(struct siprand_state *s)
>  {
>         unsigned long v0 = s->v[0], v1 = s->v[1], v2 = s->v[2], v3 = s->v[3];
>  
> +       v0 += get_cycles();
>         SIPROUND(v0, v1, v2, v3);
>         SIPROUND(v0, v1, v2, v3);
>         s->v[0] = v0;  s->v[1] = v1;  s->v[2] = v2;  s->v[3] = v3;

As long as:
1) The periodic catastrophic reseeding remains, and
2) You use fresh measurements, not the exact same bits
   that add_*_randomness feeds into /dev/random
then it doesn't do any real harm, so if it makes you feel better...

But I really want to stress how weak a design drip-reseeding is.

If an attacker has enough local machine access to do a meltdown-style attack,
then they can calibrate the TSC used in get_cycles very accurately, so the
remaining timing uncertainty is very low.  This makes a brute-force attack on
one or two reseedings quite easy.  I.e. if you can see every other output,
It's straightforward to figure out the ones in between.

I wonder if, on general principles, it would be better to use a more
SipHash style mixing in of the sample:
	m = get_cycles();
	v3 ^= m;
	SIPROUND(v0, v1, v2, v3);
	SIPROUND(v0, v1, v2, v3);
	v0 ^= m;

Not sure if it's worth the extra register (and associated spill/fill).
Willy Tarreau Aug. 9, 2020, 5:33 p.m. UTC | #5
On Sun, Aug 09, 2020 at 05:06:39PM +0000, George Spelvin wrote:
> On Sun, Aug 09, 2020 at 11:38:05AM +0200, Willy Tarreau wrote:
> > So I gave it a quick test under Qemu and it didn't show any obvious
> > performance difference compared to Tausworthe, which is a good thing,
> > even though there's a significant amount of measurement noise in each
> > case.
> 
> Thank you very much!  I'm not quite sure how to benchmark this.
> The whole idea is that it's *not* used in a tight cache-hot loop.
> Hopefully someone already has a test setup so I don't have to invent
> one.

Due to limited access to representative hardware, the to main tests
I've been running were on my laptop in qemu, and consisted in :

   - a connect-accept-close test to stress the accept() code and
     verify we don't observe a significant drop. The thing is that
     connect() usually is much slower and running the two on the
     same machine tends to significantly soften the differences
     compared to what a real machine would see when handling a
     DDoS for example.

   - a packet rate test through this rule (which uses prandom_u32()
     for each packet and which matches what can be done in packet
     schedulers or just by users having to deal with random drop) :

     iptables -I INPUT -m statistic --probability 0.5 -j ACCEPT

While these ones are not very relevant, especially in a VM, not
seeing significant variations tends to indicate we should not see
a catastrophic loss.

> > However it keeps the problem that the whole sequence is entirely
> > determined at the moment of reseeding, so if one were to be able to
> > access the state, e.g. using l1tf/spectre/meltdown/whatever, then
> > this state could be used to predict the whole ongoing sequence for
> > the next minute. What some view as a security feature, others will
> > see as a backdoor :-/  That's why I really like the noise approach.
> > Even just the below would significantly harm that capability because
> > that state alone isn't sufficient anymore to pre-compute all future
> > values:
> > 
> > --- a/lib/random32.c
> > +++ b/lib/random32.c
> > @@ -375,6 +375,7 @@ static u32 siprand_u32(struct siprand_state *s)
> >  {
> >         unsigned long v0 = s->v[0], v1 = s->v[1], v2 = s->v[2], v3 = s->v[3];
> >  
> > +       v0 += get_cycles();
> >         SIPROUND(v0, v1, v2, v3);
> >         SIPROUND(v0, v1, v2, v3);
> >         s->v[0] = v0;  s->v[1] = v1;  s->v[2] = v2;  s->v[3] = v3;
> 
> As long as:
> 1) The periodic catastrophic reseeding remains, and
> 2) You use fresh measurements, not the exact same bits
>    that add_*_randomness feeds into /dev/random
> then it doesn't do any real harm, so if it makes you feel better...
> 
> But I really want to stress how weak a design drip-reseeding is.
> 
> If an attacker has enough local machine access to do a meltdown-style attack,
> then they can calibrate the TSC used in get_cycles very accurately,

Absolutely.

> so the
> remaining timing uncertainty is very low.

Not that low in fact because they don't know precisely when the call is
made. I mean, let's say we're in the worst case, with two VMs running on
two siblings of the same core, with the same TSC, on a 3 GHz machine. The
attacker can stress the victim at 100k probes per second. That's still
15 bits of uncertainty on the TSC value estimation which is added to each
call. Even on the first call this is enough to make a source port
unguessable, and preventing the attacker from staying synchronized with
its victim. And I'm only speaking about an idle remote machine, not even
one taking unobservable traffic, which further adds to the difficulty.

> This makes a brute-force attack on
> one or two reseedings quite easy.  I.e. if you can see every other output,
> It's straightforward to figure out the ones in between.

But they already become useless because you're only observing stuff of
the past.

> I wonder if, on general principles, it would be better to use a more
> SipHash style mixing in of the sample:
> 	m = get_cycles();
> 	v3 ^= m;
> 	SIPROUND(v0, v1, v2, v3);
> 	SIPROUND(v0, v1, v2, v3);
> 	v0 ^= m;

Probably, if it's the recommended way to mix in other values, yes.

> Not sure if it's worth the extra register (and associated spill/fill).

If this makes the hash better, maybe. I can run some tests on this as
well. I'd really need to try on a cleaner setup, I have remote machines
at the office but I don't feel safe enough to remotely reboot them and
risk to lose them :-/

I'll also test on arm/arm64 to make sure we don't introduce a significant
cost there.

Willy
George Spelvin Aug. 9, 2020, 6:30 p.m. UTC | #6
On Sun, Aug 09, 2020 at 07:33:03PM +0200, Willy Tarreau wrote:
> Not that low in fact because they don't know precisely when the call is
> made. I mean, let's say we're in the worst case, with two VMs running on
> two siblings of the same core, with the same TSC, on a 3 GHz machine. The
> attacker can stress the victim at 100k probes per second. That's still
> 15 bits of uncertainty on the TSC value estimation which is added to each
> call. Even on the first call this is enough to make a source port
> unguessable, and preventing the attacker from staying synchronized with
> its victim. And I'm only speaking about an idle remote machine, not even
> one taking unobservable traffic, which further adds to the difficulty.

I'm trying to understand your attack scenario.  I'm assuming that an
attacker can call prandom_u32() locally.  (I don't have a specific code
path, but given the number of uses in the kernel, I assume *one* of
them will leak the output directly.)  And repeat the call fast
enough that there's at most *one* other user between our calls.

If an attacker knows the initial state, does an rdtsc, prandom_u32(),
and a second rdtsc, then they can guess the TSC value used in than
prandom_u32() quite accurately (4-6 bits fuzz, perhaps).  This is
trivial to brute force.

The fun comes if someone else does a prandom_u32() call in between.

All of a sudden, the 4-6 bit brute force of one get_cycles() value
fails to find a solution.  Someone else has called prandom_u32()!

Now we have 15 bits of uncertainty about that other call, and 5 bits
of uncertainty about our call.  2^20 possibilities only takes a few
milliseconds to test, and the 32-bit output of prandom_u32() can
verify a guess with minimal probability of error.

(Note that, to maintain tracking, we have to keep hammering
prandom_u32() *during* the search, but we can just buffer the results
and process them after the expensive search is complete.)

What you can see here is the incredible power of *multiple* unobserved
seedings.  As long as an attacker can limit things to one unobserved
prandom_u32(), it's a simple brute force.

If there are mroe than one, the additional bits of uncertainty
quickly make things impractical.

This is why I'm so keen on less frequent, more catastrophic,
reseeding.  Yes, the delay means an attacker who has captured
the state retains full knowledge for longer.  But they get
kicked off as soon as the catastophe happens.  Without it, they
can keep tracking the state indefinitely.

Even something simple like buffering 8 TSC samples, and adding them
at 32-bit offsets across the state every 8th call, would make a huge
difference.

Even if 7 of the timestamps are from attacker calls (4 bits
uncertainty), the time of the target call is 8x less known
(so it goes from 15 to 18 bits), and the total becomes
46 bits.  *So* much better.

> I can run some tests on this as
> well. I'd really need to try on a cleaner setup, I have remote machines
> at the office but I don't feel safe enough to remotely reboot them and
> risk to lose them :-/

Yeah, test kernels are nervous-making that way.

> I'll also test on arm/arm64 to make sure we don't introduce a significant
> cost there.

I don't expect a problem, but SipHash is optimized for 4-issue processors,
and on narrower machines fewer instructions are "free".
Willy Tarreau Aug. 9, 2020, 7:16 p.m. UTC | #7
On Sun, Aug 09, 2020 at 06:30:17PM +0000, George Spelvin wrote:
> I'm trying to understand your attack scenario.  I'm assuming that an
> attacker can call prandom_u32() locally.

Well, first, I'm mostly focusing on remote attacks, because if the
attacker has a permanent local access, he can as well just bind()
a source port instead of having to guess the next one that will be
used. Second, I'm not aware of direct access from userland, so the
calls to prandom_u32() are expected to be done only via specific
code parts. The worst case (in my opinion) is when the attacker
runs on the same physical CPU and exploits some memory leakage to
find the internal state and uses the same TSC, but can only trigger
prandom_u32() via the network, hence with much less precision.

(...)
> Even something simple like buffering 8 TSC samples, and adding them
> at 32-bit offsets across the state every 8th call, would make a huge
> difference.
> 
> Even if 7 of the timestamps are from attacker calls (4 bits
> uncertainty), the time of the target call is 8x less known
> (so it goes from 15 to 18 bits), and the total becomes
> 46 bits.  *So* much better.

Maybe. I'm just suggesting to add *some* noise to keep things not
exploitable if the state is stolen once in a while (which would
already be a big problem, admittedly).

> > I can run some tests on this as
> > well. I'd really need to try on a cleaner setup, I have remote machines
> > at the office but I don't feel safe enough to remotely reboot them and
> > risk to lose them :-/
> 
> Yeah, test kernels are nervous-making that way.

In fact I never booted a 5.8 kernel on the machine I'm thinking about
yet and can't remote-control its power supply, so I'm worried about
a dirty hang at boot by missing a config entry. The risk is low but
that's annoying when it happens.

> > I'll also test on arm/arm64 to make sure we don't introduce a significant
> > cost there.
> 
> I don't expect a problem, but SipHash is optimized for 4-issue processors,
> and on narrower machines fewer instructions are "free".

OK.

Willy
Marc Plumb Aug. 9, 2020, 9:10 p.m. UTC | #8
(Reseending since I accidentally sent it as HTML which the netdev 
mailing list doesn't like)

On 2020-08-09 2:05 p.m., Marc Plumb wrote:
>
> Willy,
>
>
> On 2020-08-07 3:19 p.m., Willy Tarreau wrote:
>>> If I can figure the state out once,
>> Yes but how do you take that as granted ? This state doesn't appear
>> without its noise counterpart, so taking as a prerequisite that you may
>> guess one separately obviously indicates that you then just have to
>> deduce the other, but the point of mixing precisely is that we do not
>> expose individual parts.
>
>
> On 2020-08-09 2:38 a.m., Willy Tarreau wrote:
>> However it keeps the problem that the whole sequence is entirely
>> determined at the moment of reseeding, so if one were to be able to
>> access the state, e.g. using l1tf/spectre/meltdown/whatever, then
>> this state could be used to predict the whole ongoing sequence for
>> the next minute. What some view as a security feature, others will
>> see as a backdoor :-/  That's why I really like the noise approach.
>> Even just the below would significantly harm that capability because
>> that state alone isn't sufficient anymore to pre-compute all future
>> values:
>
>
> So two days ago I was unreasonable for assuming an attacker to could 
> recover the entire state, now you're assuming the same thing? As I 
> said before, if an attacker can recover the complete state, then 
> slowly adding noise doesn't help significantly since an attacker can 
> brute force the noise added (even if a perfect CPRNG is used).
>
> However, I think I'm starting to see your underlying assumptions. 
> You're thinking that raw noise data are the only truly unpredictable 
> thing so you think that adding it is a defense against attacks like 
> foreshadow/spectre/meltdown. You aren't entirely wrong -- if there was 
> a fast noise source then it might be a good option. Just if the noise 
> source is slow, you're just causing far more damage to a far more 
> critical crytpto function to get very little benefit.
>
> If you want to add noise to the network PRNG, then you can't put the 
> same noise into the dev/random CPRNG at all, in any way. I've tried 
> explaining the crypto reasons for this, and George has added to that, 
> so let me try a different approach: It breaks FIPS 140-2 for all of 
> Linux. While FIPS certification isn't a key driver, it is a 
> consideration for the crypt modules. FIPS references NIST.SP.800-90B 
> (which is specifically about Recommendation for the Entropy Sources 
> Used for Random Bit Generation) which has a requirement that the noise 
> source not pass any data used for crypto operations to anything 
> outside of the defined security boundary. If you want to feed a noise 
> measurement into the network PRNG, then you can't feed it into the 
> /dev/random pool also. You have to decide where the different 
> measurements are going to go and keep them completely seperate.
>
> I'm not intimately familiar with the standards so I spoke with someone 
> who does FIPS 140-x certification and was told "I don't think the 
> standards even considered the idea that someone might pass data from a 
> conditioning pool to other functions. It completely violates the 
> security boundary concept so is just prohibited ... that type of 
> change would absolutely be a problem."
>
>
> Marc
>
Linus Torvalds Aug. 9, 2020, 9:48 p.m. UTC | #9
On Sun, Aug 9, 2020 at 2:10 PM Marc Plumb <lkml.mplumb@gmail.com> wrote:
>
> However, I think I'm starting to see your underlying assumptions.
> You're thinking that raw noise data are the only truly unpredictable
> thing so you think that adding it is a defense against attacks like
> foreshadow/spectre/meltdown. You aren't entirely wrong -- if there was
> a fast noise source then it might be a good option. Just if the noise
> source is slow, you're just causing far more damage to a far more
> critical crypto function to get very little benefit.

The only truly good noise source we have is any CPU native randomness.

Sadly, that is usually never really "fast". But we do use that for any
actual real /dev/random reading. We still whiten it through our
hashing, and we mix it in rather than use the raw CPU hw-provided
values (because not everybody trusts the CPU vendors either), but
/dev/random will most certainly take advantage of it as one source of
noise.

(Honesty in advertising: you can also use other interfaces that don't
bother with the remixing, and _will_ just return the raw CPU
randomness).

So if you make the (imho reasonable) assumption that you're running on
a modern enough CPU, and that the CPU hw randomness is of reasonable
quality, then you never need to worry about /dev/random. Every single
time you extract something from one of the pools, I think we mix in
that CPU randomness if it's available.

But the CPU randomness is too slow for the prandom code to use at
extraction time. It's on the order of a couple of hundred to a couple
of thousand cycles. That's peanuts for /dev/random, but quite a lot
for prandom.

In fact, at the slow end it is slow enough that you don't want to do
it at any fast granularity (ie not "every interrupt"), it's the "when
reseeding once a second" kind of slow.

arm64 has randomness too these days too, but that's only of the
"really slow, useful for seeding" variety. And I'm not sure which (if
any) CPU implementations out there actually do it yet.

Anyway, I suspect /dev/random has been over-engineered (I think we
have something like three layers of mixing bits _and_ the CPU
randomness _and_ all the interrupt randomness), and prandom may have
been left alone too much.

             Linus
Willy Tarreau Aug. 10, 2020, 11:47 a.m. UTC | #10
Hi George,

On Sun, Aug 09, 2020 at 06:30:17PM +0000, George Spelvin wrote:
> Even something simple like buffering 8 TSC samples, and adding them
> at 32-bit offsets across the state every 8th call, would make a huge
> difference.

Doing testing on real hardware showed that retrieving the TSC on every
call had a non negligible cost, causing a loss of 2.5% on the accept()
rate and 4% on packet rate when using iptables -m statistics. However
I reused your idea of accumulating old TSCs to increase the uncertainty
about their exact value, except that I retrieve it only on 1/8 calls
and use the previous noise in this case. With this I observe the same
performance as plain 5.8. Below are the connection rates accepted on
a single core :

        5.8           5.8+patch     5.8+patch+tsc
   192900-197900   188800->192200   194500-197500  (conn/s)

This was on a core i7-8700K. I looked at the asm code for the function
and it remains reasonably light, in the same order of complexity as the
original one, so I think we could go with that.

My proposed change is below, in case you have any improvements to suggest.

Regards,
Willy


diff --git a/lib/random32.c b/lib/random32.c
index 2b048e2ea99f..a12d63028106 100644
--- a/lib/random32.c
+++ b/lib/random32.c
@@ -317,6 +317,8 @@ static void __init prandom_state_selftest(void)
 
 struct siprand_state {
 	unsigned long v[4];
+	unsigned long noise;
+	unsigned long count;
 };
 
 static DEFINE_PER_CPU(struct siprand_state, net_rand_state) __latent_entropy;
@@ -334,7 +336,7 @@ static DEFINE_PER_CPU(struct siprand_state, net_rand_state) __latent_entropy;
 #define K0 (0x736f6d6570736575 ^ 0x6c7967656e657261 )
 #define K1 (0x646f72616e646f6d ^ 0x7465646279746573 )
 
-#elif BITS_PER_LONG == 23
+#elif BITS_PER_LONG == 32
 /*
  * On 32-bit machines, we use HSipHash, a reduced-width version of SipHash.
  * This is weaker, but 32-bit machines are not used for high-traffic
@@ -375,6 +377,12 @@ static u32 siprand_u32(struct siprand_state *s)
 {
 	unsigned long v0 = s->v[0], v1 = s->v[1], v2 = s->v[2], v3 = s->v[3];
 
+	if (++s->count >= 8) {
+		v3 ^= s->noise;
+		s->noise += random_get_entropy();
+		s->count = 0;
+	}
+
 	SIPROUND(v0, v1, v2, v3);
 	SIPROUND(v0, v1, v2, v3);
 	s->v[0] = v0;  s->v[1] = v1;  s->v[2] = v2;  s->v[3] = v3;
David Laight Aug. 10, 2020, 12:01 p.m. UTC | #11
From: Willy Tarreau
> Sent: 10 August 2020 12:47
> On Sun, Aug 09, 2020 at 06:30:17PM +0000, George Spelvin wrote:
> > Even something simple like buffering 8 TSC samples, and adding them
> > at 32-bit offsets across the state every 8th call, would make a huge
> > difference.
> 
> Doing testing on real hardware showed that retrieving the TSC on every
> call had a non negligible cost, causing a loss of 2.5% on the accept()
> rate and 4% on packet rate when using iptables -m statistics. However
> I reused your idea of accumulating old TSCs to increase the uncertainty
> about their exact value, except that I retrieve it only on 1/8 calls
> and use the previous noise in this case. With this I observe the same
> performance as plain 5.8. Below are the connection rates accepted on
> a single core :
> 
>         5.8           5.8+patch     5.8+patch+tsc
>    192900-197900   188800->192200   194500-197500  (conn/s)
> 
> This was on a core i7-8700K. I looked at the asm code for the function
> and it remains reasonably light, in the same order of complexity as the
> original one, so I think we could go with that.
> 
> My proposed change is below, in case you have any improvements to suggest.
> 
> Regards,
> Willy
> 
> 
> diff --git a/lib/random32.c b/lib/random32.c
> index 2b048e2ea99f..a12d63028106 100644
> --- a/lib/random32.c
> +++ b/lib/random32.c
> @@ -317,6 +317,8 @@ static void __init prandom_state_selftest(void)
> 
>  struct siprand_state {
>  	unsigned long v[4];
> +	unsigned long noise;
> +	unsigned long count;
>  };
> 
>  static DEFINE_PER_CPU(struct siprand_state, net_rand_state) __latent_entropy;
> @@ -334,7 +336,7 @@ static DEFINE_PER_CPU(struct siprand_state, net_rand_state) __latent_entropy;
>  #define K0 (0x736f6d6570736575 ^ 0x6c7967656e657261 )
>  #define K1 (0x646f72616e646f6d ^ 0x7465646279746573 )
> 
> -#elif BITS_PER_LONG == 23
> +#elif BITS_PER_LONG == 32
>  /*
>   * On 32-bit machines, we use HSipHash, a reduced-width version of SipHash.
>   * This is weaker, but 32-bit machines are not used for high-traffic
> @@ -375,6 +377,12 @@ static u32 siprand_u32(struct siprand_state *s)
>  {
>  	unsigned long v0 = s->v[0], v1 = s->v[1], v2 = s->v[2], v3 = s->v[3];
> 
> +	if (++s->count >= 8) {
> +		v3 ^= s->noise;
> +		s->noise += random_get_entropy();
> +		s->count = 0;
> +	}
> +

Using:
	if (s->count-- <= 0) {
		...
		s->count = 8;
	}
probably generates better code.
Although you may want to use a 'signed int' instead of 'unsigned long'.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Florian Westphal Aug. 10, 2020, 12:03 p.m. UTC | #12
Willy Tarreau <w@1wt.eu> wrote:
> On Sun, Aug 09, 2020 at 06:30:17PM +0000, George Spelvin wrote:
> > Even something simple like buffering 8 TSC samples, and adding them
> > at 32-bit offsets across the state every 8th call, would make a huge
> > difference.
> 
> Doing testing on real hardware showed that retrieving the TSC on every
> call had a non negligible cost, causing a loss of 2.5% on the accept()
> rate and 4% on packet rate when using iptables -m statistics. However
> I reused your idea of accumulating old TSCs to increase the uncertainty
> about their exact value, except that I retrieve it only on 1/8 calls
> and use the previous noise in this case. With this I observe the same
> performance as plain 5.8. Below are the connection rates accepted on
> a single core :
> 
>         5.8           5.8+patch     5.8+patch+tsc
>    192900-197900   188800->192200   194500-197500  (conn/s)
> 
> This was on a core i7-8700K. I looked at the asm code for the function
> and it remains reasonably light, in the same order of complexity as the
> original one, so I think we could go with that.
> 
> My proposed change is below, in case you have any improvements to suggest.

As this relates to networking, you could also hook perturbation into rx/tx
softirq processing.  E.g. once for each new napi poll round or only once
for each softnet invocation, depending on cost.

IIRC the proposed draft left a unused prandom_seed() stub around, you could
re-use that to place extra data to include in the hash in percpu data.
Willy Tarreau Aug. 10, 2020, 2:48 p.m. UTC | #13
On Mon, Aug 10, 2020 at 12:01:11PM +0000, David Laight wrote:
> >  /*
> >   * On 32-bit machines, we use HSipHash, a reduced-width version of SipHash.
> >   * This is weaker, but 32-bit machines are not used for high-traffic
> > @@ -375,6 +377,12 @@ static u32 siprand_u32(struct siprand_state *s)
> >  {
> >  	unsigned long v0 = s->v[0], v1 = s->v[1], v2 = s->v[2], v3 = s->v[3];
> > 
> > +	if (++s->count >= 8) {
> > +		v3 ^= s->noise;
> > +		s->noise += random_get_entropy();
> > +		s->count = 0;
> > +	}
> > +
> 
> Using:
> 	if (s->count-- <= 0) {
> 		...
> 		s->count = 8;
> 	}
> probably generates better code.
> Although you may want to use a 'signed int' instead of 'unsigned long'.

Yeah I know, it's just because I only slightly changed the previous code
there. I had an earlier version that kept the rand state fully padded
when storing intermediate values. That's among the final cleanups I'll
bring if we go down that route.

Thanks!

Willy
Willy Tarreau Aug. 10, 2020, 2:53 p.m. UTC | #14
On Mon, Aug 10, 2020 at 02:03:02PM +0200, Florian Westphal wrote:
> As this relates to networking, you could also hook perturbation into rx/tx
> softirq processing.  E.g. once for each new napi poll round or only once
> for each softnet invocation, depending on cost.

I was wondering what/where to add some processing. I was thinking about
having a per_cpu "noise" variable that would get mixed with the randoms
when generated. This "noise" would need to be global so that we can easily
append to it. For example on the network path it would be nice to use
checksums but nowadays they're not calculated anymore.

> IIRC the proposed draft left a unused prandom_seed() stub around, you could
> re-use that to place extra data to include in the hash in percpu data.

Probably. What I saw was that prandom_seed() expected to perform a full
(hence slow) reseed. Instead I'd like to do something cheap. I like the
principle of "noise" in that it doesn't promise to bring any security,
only to perturb a little bit.

Willy
Linus Torvalds Aug. 10, 2020, 4:31 p.m. UTC | #15
On Mon, Aug 10, 2020 at 4:47 AM Willy Tarreau <w@1wt.eu> wrote:
>
> Doing testing on real hardware showed that retrieving the TSC on every
> call had a non negligible cost, causing a loss of 2.5% on the accept()
> rate and 4% on packet rate when using iptables -m statistics.

And by "real hardware" I assume you mean x86, with a fairly fast and
high-performance TSC for get_random_entropy().

Reading the TSC takes on the order of 20-50 cycles, iirc.

But it can actually be *much* more expensive. On non-x86, it can be an
IO cycle to external chips.

And on older hardware VM's in x86, it can be a vm exit etc, so
thousands of cycles. I hope nobody uses those VM's any more, but it
would be a reasonable test-case for some non-x86 implementations, so..

IOW, no. You guys are - once again - ignoring reality.

            Linus
Willy Tarreau Aug. 10, 2020, 4:58 p.m. UTC | #16
On Mon, Aug 10, 2020 at 09:31:48AM -0700, Linus Torvalds wrote:
> On Mon, Aug 10, 2020 at 4:47 AM Willy Tarreau <w@1wt.eu> wrote:
> >
> > Doing testing on real hardware showed that retrieving the TSC on every
> > call had a non negligible cost, causing a loss of 2.5% on the accept()
> > rate and 4% on packet rate when using iptables -m statistics.
> 
> And by "real hardware" I assume you mean x86, with a fairly fast and
> high-performance TSC for get_random_entropy().

Yep.

> Reading the TSC takes on the order of 20-50 cycles, iirc.
> 
> But it can actually be *much* more expensive. On non-x86, it can be an
> IO cycle to external chips.

I took what we were already using in add_interrupt_randomness() since
I considered that if it was acceptable there, it probably was elsewhere.

> And on older hardware VM's in x86, it can be a vm exit etc, so
> thousands of cycles. I hope nobody uses those VM's any more, but it
> would be a reasonable test-case for some non-x86 implementations, so..

Yes, I remember these ones, they were not fun at all.

> IOW, no. You guys are - once again - ignoring reality.

I'm not ignoring reality, quite the opposite, trying to take all knowledge
into account. If I'm missing some points, fine. But if we were already
calling that in the interrupt handler I expected that this would be OK.

The alternative Florian talked about is quite interesting as well,
which is to collect some cheap noise in the network rx/tx paths since
these are the areas we care about.

Willy
Linus Torvalds Aug. 10, 2020, 5:45 p.m. UTC | #17
On Mon, Aug 10, 2020 at 9:59 AM Willy Tarreau <w@1wt.eu> wrote:
>
> I took what we were already using in add_interrupt_randomness() since
> I considered that if it was acceptable there, it probably was elsewhere.

Once you've taken an interrupt, you're doing IO anyway, and the
interrupt costs will dominate anything you do.

But the prandom_u32() interface is potentially done many times per
interrupt. For all I know it's done inside fairly critical locks etc
too.

So I don't think one usage translates to another very well.

              Linus
Willy Tarreau Aug. 10, 2020, 6:01 p.m. UTC | #18
On Mon, Aug 10, 2020 at 10:45:26AM -0700, Linus Torvalds wrote:
> On Mon, Aug 10, 2020 at 9:59 AM Willy Tarreau <w@1wt.eu> wrote:
> >
> > I took what we were already using in add_interrupt_randomness() since
> > I considered that if it was acceptable there, it probably was elsewhere.
> 
> Once you've taken an interrupt, you're doing IO anyway, and the
> interrupt costs will dominate anything you do.
> 
> But the prandom_u32() interface is potentially done many times per
> interrupt. For all I know it's done inside fairly critical locks etc
> too.
> 
> So I don't think one usage translates to another very well.

Possible, hence the better solution to just feed noise in hot paths.
Using jiffies and skb pointer in xmit is better than nothing anyway.
I'm not seeking anything extremist, I just want to make sure we don't
get yet-another-report on this area next summer just because some
researcher using two VMs discovers that attacking a 100% idle VM from
another one running on the same CPU core is trivial after having stolen
some memory data.  If at least such an attack is boring and rarely
works, the rest of the job is provided by siphash and we should be fine.

Willy
Willy Tarreau Aug. 10, 2020, 9:04 p.m. UTC | #19
Linus, George, Florian,

would something in this vein be OK in your opinion ?

- update_process_times still updates the noise
- we don't touch the fast_pool anymore
- we don't read any TSC on hot paths
- we update the noise in xmit from jiffies and a few pointer values instead

I've applied it on top of George's patch rebased to mainline for simplicity.
I've used a separate per_cpu noise variable to keep the net_rand_state static
with its __latent_entropy.

With this I'm even getting very slightly better performance than with
the previous patch (195.7 - 197.8k cps).

What could be improved is the way the input values are mixed (just
added hence commutative for now). I didn't want to call a siphash
round on the hot paths, but just shifting the previous noise value
before adding would work, such as the following for example:

  void prandom_u32_add_noise(a, b, c, d)
  {
  	unsigned long *noise = get_cpu_ptr(&net_rand_noise);

  #if BITS_PER_LONG == 64
	*noise = rol64(*noise, 7) + a + b + c + d;
  #else
	*noise = rol32(*noise, 7) + a + b + c + d;
  #endif
  	put_cpu_ptr(&net_rand_noise);

  }

Thanks,
Willy

---


diff --git a/drivers/char/random.c b/drivers/char/random.c
index d20ba1b104ca..2a41b21623ae 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -1277,7 +1277,6 @@ void add_interrupt_randomness(int irq, int irq_flags)
 
 	fast_mix(fast_pool);
 	add_interrupt_bench(cycles);
-	this_cpu_add(net_rand_state.s1, fast_pool->pool[cycles & 3]);
 
 	if (unlikely(crng_init == 0)) {
 		if ((fast_pool->count >= 64) &&
diff --git a/include/linux/prandom.h b/include/linux/prandom.h
index aa16e6468f91..e2b4990f2126 100644
--- a/include/linux/prandom.h
+++ b/include/linux/prandom.h
@@ -20,7 +20,7 @@ struct rnd_state {
 	__u32 s1, s2, s3, s4;
 };
 
-DECLARE_PER_CPU(struct rnd_state, net_rand_state);
+DECLARE_PER_CPU(unsigned long, net_rand_noise);
 
 u32 prandom_u32_state(struct rnd_state *state);
 void prandom_bytes_state(struct rnd_state *state, void *buf, size_t nbytes);
@@ -67,6 +67,7 @@ static inline void prandom_seed_state(struct rnd_state *state, u64 seed)
 	state->s2 = __seed(i,   8U);
 	state->s3 = __seed(i,  16U);
 	state->s4 = __seed(i, 128U);
+	__this_cpu_add(net_rand_noise, i);
 }
 
 /* Pseudo random number generator from numerical recipes. */
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index ae5029f984a8..2f07c569af77 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -1721,7 +1721,7 @@ void update_process_times(int user_tick)
 	 * non-constant value that's not affine to the number of calls to make
 	 * sure it's updated when there's some activity (we don't care in idle).
 	 */
-	this_cpu_add(net_rand_state.s1, rol32(jiffies, 24) + user_tick);
+	__this_cpu_add(net_rand_noise, rol32(jiffies, 24) + user_tick);
 }
 
 /**
diff --git a/lib/random32.c b/lib/random32.c
index 2b048e2ea99f..d74cf1db8cc9 100644
--- a/lib/random32.c
+++ b/lib/random32.c
@@ -320,6 +320,8 @@ struct siprand_state {
 };
 
 static DEFINE_PER_CPU(struct siprand_state, net_rand_state) __latent_entropy;
+DEFINE_PER_CPU(unsigned long, net_rand_noise);
+EXPORT_SYMBOL(net_rand_noise);
 
 #if BITS_PER_LONG == 64
 /*
@@ -334,7 +336,7 @@ static DEFINE_PER_CPU(struct siprand_state, net_rand_state) __latent_entropy;
 #define K0 (0x736f6d6570736575 ^ 0x6c7967656e657261 )
 #define K1 (0x646f72616e646f6d ^ 0x7465646279746573 )
 
-#elif BITS_PER_LONG == 23
+#elif BITS_PER_LONG == 32
 /*
  * On 32-bit machines, we use HSipHash, a reduced-width version of SipHash.
  * This is weaker, but 32-bit machines are not used for high-traffic
@@ -374,9 +376,12 @@ static DEFINE_PER_CPU(struct siprand_state, net_rand_state) __latent_entropy;
 static u32 siprand_u32(struct siprand_state *s)
 {
 	unsigned long v0 = s->v[0], v1 = s->v[1], v2 = s->v[2], v3 = s->v[3];
+	unsigned long n = __this_cpu_read(net_rand_noise);
 
+	v3 ^= n;
 	SIPROUND(v0, v1, v2, v3);
 	SIPROUND(v0, v1, v2, v3);
+	v0 ^= n;
 	s->v[0] = v0;  s->v[1] = v1;  s->v[2] = v2;  s->v[3] = v3;
 	return v1 + v3;
 }
diff --git a/net/core/dev.c b/net/core/dev.c
index 7df6c9617321..55a2471cd75b 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -144,6 +144,7 @@
 #include <linux/indirect_call_wrapper.h>
 #include <net/devlink.h>
 #include <linux/pm_runtime.h>
+#include <linux/prandom.h>
 
 #include "net-sysfs.h"
 
@@ -3557,6 +3558,7 @@ static int xmit_one(struct sk_buff *skb, struct net_device *dev,
 		dev_queue_xmit_nit(skb, dev);
 
 	len = skb->len;
+	__this_cpu_add(net_rand_noise, (long)skb + (long)dev + (long)txq + len + jiffies);
 	trace_net_dev_start_xmit(skb, dev);
 	rc = netdev_start_xmit(skb, dev, txq, more);
 	trace_net_dev_xmit(skb, rc, dev, len);
@@ -4129,6 +4131,7 @@ static int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
 			if (!skb)
 				goto out;
 
+			__this_cpu_add(net_rand_noise, (long)skb + (long)dev + (long)txq + jiffies);
 			HARD_TX_LOCK(dev, txq, cpu);
 
 			if (!netif_xmit_stopped(txq)) {
@@ -4194,6 +4197,7 @@ int dev_direct_xmit(struct sk_buff *skb, u16 queue_id)
 
 	skb_set_queue_mapping(skb, queue_id);
 	txq = skb_get_tx_queue(dev, skb);
+	__this_cpu_add(net_rand_noise, (long)skb + (long)dev + (long)txq + jiffies);
 
 	local_bh_disable();
George Spelvin Aug. 11, 2020, 3:47 a.m. UTC | #20
On Mon, Aug 10, 2020 at 01:47:00PM +0200, Willy Tarreau wrote:
> except that I retrieve it only on 1/8 calls
> and use the previous noise in this case.

Er... that's quite different.  I was saying you measure them all, and do:

 struct siprand_state {
 	...
+	uint32_t noise[i];
+	unsigned counter;
 }
 	...
+	s->noise[--s->counter] = random_get_entropy();
+
+	if (!s->counter) {
+		for (i = 0; i < 4; i++)
+			s->v[i] += s->noise[2*i] +
+				((unsigned long)s->noise[2*i+1] << BITS_PER_LONG/2);
+		s->counter = 8;
+	}

What you're doing is just decreasing the amount of seeding by a factor
of 8.  (Roughly.  You do gain log2(8)/2 = 1.5 bits because the sum of
8 random values has a standard deviation sqrt(8) times as large as
the inputs.)

> diff --git a/lib/random32.c b/lib/random32.c
> index 2b048e2ea99f..a12d63028106 100644
> --- a/lib/random32.c
> +++ b/lib/random32.c
> @@ -317,6 +317,8 @@ static void __init prandom_state_selftest(void)
>  
>  struct siprand_state {
>  	unsigned long v[4];
> +	unsigned long noise;
> +	unsigned long count;
>  };
>  
>  static DEFINE_PER_CPU(struct siprand_state, net_rand_state) __latent_entropy;
> @@ -334,7 +336,7 @@ static DEFINE_PER_CPU(struct siprand_state, net_rand_state) __latent_entropy;
>  #define K0 (0x736f6d6570736575 ^ 0x6c7967656e657261 )
>  #define K1 (0x646f72616e646f6d ^ 0x7465646279746573 )
>  
> -#elif BITS_PER_LONG == 23
> +#elif BITS_PER_LONG == 32
>  /*
>   * On 32-bit machines, we use HSipHash, a reduced-width version of SipHash.
>   * This is weaker, but 32-bit machines are not used for high-traffic
> @@ -375,6 +377,12 @@ static u32 siprand_u32(struct siprand_state *s)
>  {
>  	unsigned long v0 = s->v[0], v1 = s->v[1], v2 = s->v[2], v3 = s->v[3];
>  
> +	if (++s->count >= 8) {
> +		v3 ^= s->noise;
> +		s->noise += random_get_entropy();
> +		s->count = 0;
> +	}
> +

- Can you explain why you save the "noise" until next time?  Is this meant to
  make it harder for an attacker to observe the time?
- How about doing away with s->count and making it statistical:

+	if ((v3 & 7) == 0)
+		v3 ^= random_get_entropy();

That still does the seed 1/8 of the time, but in a much less regular pattern.
(Admittedly, it will totally break the branch predictor.  An unlikely()
might help.)
Willy Tarreau Aug. 11, 2020, 3:58 a.m. UTC | #21
On Tue, Aug 11, 2020 at 03:47:24AM +0000, George Spelvin wrote:
> On Mon, Aug 10, 2020 at 01:47:00PM +0200, Willy Tarreau wrote:
> > except that I retrieve it only on 1/8 calls
> > and use the previous noise in this case.
> 
> Er... that's quite different.  I was saying you measure them all, and do:

That was my first approach and it resulted in a significant performance
loss, hence the change (and the resulting ugly construct with the counter).

> > +	if (++s->count >= 8) {
> > +		v3 ^= s->noise;
> > +		s->noise += random_get_entropy();
> > +		s->count = 0;
> > +	}
> > +
> 
> - Can you explain why you save the "noise" until next time?  Is this meant to
>   make it harder for an attacker to observe the time?

Just to make the observable call not depend on immediately measurable TSC
values. It's weak, but the point was that when mixing attack traffic with
regular one, if you can measure the time variations on TSC to know when
it was used and don't have its resulting effect at the same time, it's
harder to analyse than when you have both at once.

> - How about doing away with s->count and making it statistical:
> 
> +	if ((v3 & 7) == 0)
> +		v3 ^= random_get_entropy();

Absolutely. I just kept the counter from previous attempt. But Linus
prefers that we completely remove TSC calls from direct calls due to
old VMs that were slow at this. We could collect it anywhere else once
in a while instead.

Willy
George Spelvin Aug. 11, 2020, 5:26 a.m. UTC | #22
On Mon, Aug 10, 2020 at 11:04:55PM +0200, Willy Tarreau wrote:
> What could be improved is the way the input values are mixed (just
> added hence commutative for now). I didn't want to call a siphash
> round on the hot paths, but just shifting the previous noise value
> before adding would work, such as the following for example:
> 
>   void prandom_u32_add_noise(a, b, c, d)
>   {
>   	unsigned long *noise = get_cpu_ptr(&net_rand_noise);
> 
>   #if BITS_PER_LONG == 64
> 	*noise = rol64(*noise, 7) + a + b + c + d;
>   #else
> 	*noise = rol32(*noise, 7) + a + b + c + d;
>   #endif
>   	put_cpu_ptr(&net_rand_noise);
> 
>   }

If you think this is enough seed material, I'm fine with it.

I don't hugely like the fact that you sum all the inputs, since
entropy tends to be concentrated in the low-order words, and summing
risks cancellation.

You can't afford even one SIPROUND as a non-cryptographic hash?  E.g.

DEFINE_PER_CPU(unsigned long[4], net_rand_noise);
EXPORT_SYMBOL(net_rand_noise);

void prandom_u32_add_noise(a, b, c, d)
{
	unsigned long *noise = get_cpu_ptr(&net_rand_noise);

	a ^= noise[0];
	b ^= noise[1];
	c ^= noise[2];
	d ^= noise[3];
	/*
	 * This is not used cryptographically; it's just
	 * a convenient 4-word hash function.
	 */
	SIPROUND(a, b, c, d);
	noise[0] = a;
	noise[1] = b;
	noise[2] = c;
	put_cpu_ptr(&net_rand_noise);
}

(And then you mix in net_rand_noise[0].)

Other options are HASH_MIX() from fs/namei.c, but that's
more sequential.

There's also a simple Xorshift generator.
Willy Tarreau Aug. 11, 2020, 5:37 a.m. UTC | #23
On Tue, Aug 11, 2020 at 05:26:49AM +0000, George Spelvin wrote:
> On Mon, Aug 10, 2020 at 11:04:55PM +0200, Willy Tarreau wrote:
> > What could be improved is the way the input values are mixed (just
> > added hence commutative for now). I didn't want to call a siphash
> > round on the hot paths, but just shifting the previous noise value
> > before adding would work, such as the following for example:
> > 
> >   void prandom_u32_add_noise(a, b, c, d)
> >   {
> >   	unsigned long *noise = get_cpu_ptr(&net_rand_noise);
> > 
> >   #if BITS_PER_LONG == 64
> > 	*noise = rol64(*noise, 7) + a + b + c + d;
> >   #else
> > 	*noise = rol32(*noise, 7) + a + b + c + d;
> >   #endif
> >   	put_cpu_ptr(&net_rand_noise);
> > 
> >   }
> 
> If you think this is enough seed material, I'm fine with it.
> 
> I don't hugely like the fact that you sum all the inputs, since
> entropy tends to be concentrated in the low-order words, and summing
> risks cancellation.

Yes I've figured this. But I thought it was still better than
a pure xor which would cancell the high bits from pointers.

> You can't afford even one SIPROUND as a non-cryptographic hash?  E.g.

That's what I mentioned above, I'm still hesitating. I need to test.

> 
> DEFINE_PER_CPU(unsigned long[4], net_rand_noise);
> EXPORT_SYMBOL(net_rand_noise);
> 
> void prandom_u32_add_noise(a, b, c, d)
> {
> 	unsigned long *noise = get_cpu_ptr(&net_rand_noise);
> 
> 	a ^= noise[0];
> 	b ^= noise[1];
> 	c ^= noise[2];
> 	d ^= noise[3];
> 	/*
> 	 * This is not used cryptographically; it's just
> 	 * a convenient 4-word hash function.
> 	 */
> 	SIPROUND(a, b, c, d);
> 	noise[0] = a;
> 	noise[1] = b;
> 	noise[2] = c;
> 	put_cpu_ptr(&net_rand_noise);
> }
> 
> (And then you mix in net_rand_noise[0].)
> 
> Other options are HASH_MIX() from fs/namei.c, but that's
> more sequential.
> 
> There's also a simple Xorshift generator.

I think a xorshift on each value will have roughly the same cost
as a single SIPROUND. But I've not yet completely eliminated these
options until I've tested. If we lose a few cycles per packet, that
might be OK.

Willy
diff mbox series

Patch

diff --git a/lib/random32.c b/lib/random32.c
index 763b920a6206c..2b048e2ea99f6 100644
--- a/lib/random32.c
+++ b/lib/random32.c
@@ -48,8 +48,6 @@  static inline void prandom_state_selftest(void)
 }
 #endif
 
-static DEFINE_PER_CPU(struct rnd_state, net_rand_state) __latent_entropy;
-
 /**
  *	prandom_u32_state - seeded pseudo-random number generator.
  *	@state: pointer to state structure holding seeded state.
@@ -69,25 +67,6 @@  u32 prandom_u32_state(struct rnd_state *state)
 }
 EXPORT_SYMBOL(prandom_u32_state);
 
-/**
- *	prandom_u32 - pseudo random number generator
- *
- *	A 32 bit pseudo-random number is generated using a fast
- *	algorithm suitable for simulation. This algorithm is NOT
- *	considered safe for cryptographic use.
- */
-u32 prandom_u32(void)
-{
-	struct rnd_state *state = &get_cpu_var(net_rand_state);
-	u32 res;
-
-	res = prandom_u32_state(state);
-	put_cpu_var(net_rand_state);
-
-	return res;
-}
-EXPORT_SYMBOL(prandom_u32);
-
 /**
  *	prandom_bytes_state - get the requested number of pseudo-random bytes
  *
@@ -119,20 +98,6 @@  void prandom_bytes_state(struct rnd_state *state, void *buf, size_t bytes)
 }
 EXPORT_SYMBOL(prandom_bytes_state);
 
-/**
- *	prandom_bytes - get the requested number of pseudo-random bytes
- *	@buf: where to copy the pseudo-random bytes to
- *	@bytes: the requested number of bytes
- */
-void prandom_bytes(void *buf, size_t bytes)
-{
-	struct rnd_state *state = &get_cpu_var(net_rand_state);
-
-	prandom_bytes_state(state, buf, bytes);
-	put_cpu_var(net_rand_state);
-}
-EXPORT_SYMBOL(prandom_bytes);
-
 static void prandom_warmup(struct rnd_state *state)
 {
 	/* Calling RNG ten times to satisfy recurrence condition */
@@ -148,96 +113,6 @@  static void prandom_warmup(struct rnd_state *state)
 	prandom_u32_state(state);
 }
 
-static u32 __extract_hwseed(void)
-{
-	unsigned int val = 0;
-
-	(void)(arch_get_random_seed_int(&val) ||
-	       arch_get_random_int(&val));
-
-	return val;
-}
-
-static void prandom_seed_early(struct rnd_state *state, u32 seed,
-			       bool mix_with_hwseed)
-{
-#define LCG(x)	 ((x) * 69069U)	/* super-duper LCG */
-#define HWSEED() (mix_with_hwseed ? __extract_hwseed() : 0)
-	state->s1 = __seed(HWSEED() ^ LCG(seed),        2U);
-	state->s2 = __seed(HWSEED() ^ LCG(state->s1),   8U);
-	state->s3 = __seed(HWSEED() ^ LCG(state->s2),  16U);
-	state->s4 = __seed(HWSEED() ^ LCG(state->s3), 128U);
-}
-
-/**
- *	prandom_seed - add entropy to pseudo random number generator
- *	@entropy: entropy value
- *
- *	Add some additional entropy to the prandom pool.
- */
-void prandom_seed(u32 entropy)
-{
-	int i;
-	/*
-	 * No locking on the CPUs, but then somewhat random results are, well,
-	 * expected.
-	 */
-	for_each_possible_cpu(i) {
-		struct rnd_state *state = &per_cpu(net_rand_state, i);
-
-		state->s1 = __seed(state->s1 ^ entropy, 2U);
-		prandom_warmup(state);
-	}
-}
-EXPORT_SYMBOL(prandom_seed);
-
-/*
- *	Generate some initially weak seeding values to allow
- *	to start the prandom_u32() engine.
- */
-static int __init prandom_init(void)
-{
-	int i;
-
-	prandom_state_selftest();
-
-	for_each_possible_cpu(i) {
-		struct rnd_state *state = &per_cpu(net_rand_state, i);
-		u32 weak_seed = (i + jiffies) ^ random_get_entropy();
-
-		prandom_seed_early(state, weak_seed, true);
-		prandom_warmup(state);
-	}
-
-	return 0;
-}
-core_initcall(prandom_init);
-
-static void __prandom_timer(struct timer_list *unused);
-
-static DEFINE_TIMER(seed_timer, __prandom_timer);
-
-static void __prandom_timer(struct timer_list *unused)
-{
-	u32 entropy;
-	unsigned long expires;
-
-	get_random_bytes(&entropy, sizeof(entropy));
-	prandom_seed(entropy);
-
-	/* reseed every ~60 seconds, in [40 .. 80) interval with slack */
-	expires = 40 + prandom_u32_max(40);
-	seed_timer.expires = jiffies + msecs_to_jiffies(expires * MSEC_PER_SEC);
-
-	add_timer(&seed_timer);
-}
-
-static void __init __prandom_start_seed_timer(void)
-{
-	seed_timer.expires = jiffies + msecs_to_jiffies(40 * MSEC_PER_SEC);
-	add_timer(&seed_timer);
-}
-
 void prandom_seed_full_state(struct rnd_state __percpu *pcpu_state)
 {
 	int i;
@@ -257,51 +132,6 @@  void prandom_seed_full_state(struct rnd_state __percpu *pcpu_state)
 }
 EXPORT_SYMBOL(prandom_seed_full_state);
 
-/*
- *	Generate better values after random number generator
- *	is fully initialized.
- */
-static void __prandom_reseed(bool late)
-{
-	unsigned long flags;
-	static bool latch = false;
-	static DEFINE_SPINLOCK(lock);
-
-	/* Asking for random bytes might result in bytes getting
-	 * moved into the nonblocking pool and thus marking it
-	 * as initialized. In this case we would double back into
-	 * this function and attempt to do a late reseed.
-	 * Ignore the pointless attempt to reseed again if we're
-	 * already waiting for bytes when the nonblocking pool
-	 * got initialized.
-	 */
-
-	/* only allow initial seeding (late == false) once */
-	if (!spin_trylock_irqsave(&lock, flags))
-		return;
-
-	if (latch && !late)
-		goto out;
-
-	latch = true;
-	prandom_seed_full_state(&net_rand_state);
-out:
-	spin_unlock_irqrestore(&lock, flags);
-}
-
-void prandom_reseed_late(void)
-{
-	__prandom_reseed(true);
-}
-
-static int __init prandom_reseed(void)
-{
-	__prandom_reseed(false);
-	__prandom_start_seed_timer();
-	return 0;
-}
-late_initcall(prandom_reseed);
-
 #ifdef CONFIG_RANDOM32_SELFTEST
 static struct prandom_test1 {
 	u32 seed;
@@ -463,3 +293,270 @@  static void __init prandom_state_selftest(void)
 		pr_info("prandom: %d self tests passed\n", runs);
 }
 #endif
+
+
+
+/*
+ * The prandom_u32() implementation is now completely separate from the
+ * prandom_state() functions, which are retained (for now) for compatibility.
+ *
+ * Because of (ab)use in the networking code for choosing random TCP/UDP port
+ * numbers, which open DoS possibilities if guessable, we want something
+ * stronger than a standard PRNG.  But the performance requirements of
+ * the network code do not allow robust crypto for this application.
+ *
+ * So this is a homebrew Junior Spaceman implementation, based on the
+ * lowest-latency trustworthy crypto primitive available, SipHash.
+ * (The authors of SipHash have not been consulted about this abuse of
+ * their work.)
+ *
+ * Standard SipHash-2-4 uses 2n+4 rounds to hash n words of input to
+ * one word of output.  This abbreviated version uses 2 rounds per word
+ * of output.
+ */
+
+struct siprand_state {
+	unsigned long v[4];
+};
+
+static DEFINE_PER_CPU(struct siprand_state, net_rand_state) __latent_entropy;
+
+#if BITS_PER_LONG == 64
+/*
+ * The core SipHash round function.  Each line can be executed in
+ * parallel given enough CPU resources.
+ */
+#define SIPROUND(v0,v1,v2,v3) ( \
+	v0 += v1, v1 = rol64(v1, 13),  v2 += v3, v3 = rol64(v3, 16), \
+	v1 ^= v0, v0 = rol64(v0, 32),  v3 ^= v2,                     \
+	v0 += v3, v3 = rol64(v3, 21),  v2 += v1, v1 = rol64(v1, 17), \
+	v3 ^= v0,                      v1 ^= v2, v2 = rol64(v2, 32)  )
+#define K0 (0x736f6d6570736575 ^ 0x6c7967656e657261 )
+#define K1 (0x646f72616e646f6d ^ 0x7465646279746573 )
+
+#elif BITS_PER_LONG == 23
+/*
+ * On 32-bit machines, we use HSipHash, a reduced-width version of SipHash.
+ * This is weaker, but 32-bit machines are not used for high-traffic
+ * applications, so there is less output for an attacker to analyze.
+ */
+#define SIPROUND(v0,v1,v2,v3) ( \
+	v0 += v1, v1 = rol32(v1,  5),  v2 += v3, v3 = rol32(v3,  8), \
+	v1 ^= v0, v0 = rol32(v0, 16),  v3 ^= v2,                     \
+	v0 += v3, v3 = rol32(v3,  7),  v2 += v1, v1 = rol32(v1, 13), \
+	v3 ^= v0,                      v1 ^= v2, v2 = rol32(v2, 16)  )
+#define K0 0x6c796765
+#define K1 0x74656462
+
+#else
+#error Unsupported BITS_PER_LONG
+#endif
+
+/*
+ * This is the core CPRNG function.  As "pseudorandom", this is not used
+ * for truly valuable things, just intended to be a PITA to guess.
+ * For maximum speed, we do just two SipHash rounds per word.  This is
+ * the same rate as 4 rounds per 64 bits that SipHash normally uses,
+ * so hopefully it's reasonably secure.
+ *
+ * There are two changes from the official SipHash finalization:
+ * - We omit some constants XORed with v2 in the SipHash spec as irrelevant;
+ *   they are there only to make the output rounds distinct from the input
+ *   rounds, and this application has no input rounds.
+ * - Rather than returning v0^v1^v2^v3, return v1+v3.
+ *   If you look at the SipHash round, the last operation on v3 is
+ *   "v3 ^= v0", so "v0 ^ v3" just undoes that, a waste of time.
+ *   Likewise "v1 ^= v2".  (The rotate of v2 makes a difference, but
+ *   it still cancels out half of the bits in v2 for no benefit.)
+ *   Second, since the last combining operation was xor, continue the
+ *   pattern of alternating xor/add for a tiny bit of extra non-linearity.
+ */
+static u32 siprand_u32(struct siprand_state *s)
+{
+	unsigned long v0 = s->v[0], v1 = s->v[1], v2 = s->v[2], v3 = s->v[3];
+
+	SIPROUND(v0, v1, v2, v3);
+	SIPROUND(v0, v1, v2, v3);
+	s->v[0] = v0;  s->v[1] = v1;  s->v[2] = v2;  s->v[3] = v3;
+	return v1 + v3;
+}
+
+
+/**
+ *	prandom_u32 - pseudo random number generator
+ *
+ *	A 32 bit pseudo-random number is generated using a fast
+ *	algorithm suitable for simulation. This algorithm is NOT
+ *	considered safe for cryptographic use.
+ */
+u32 prandom_u32(void)
+{
+	struct siprand_state *state = get_cpu_ptr(&net_rand_state);
+	u32 res = siprand_u32(state);
+
+	put_cpu_ptr(&net_rand_state);
+	return res;
+}
+EXPORT_SYMBOL(prandom_u32);
+
+/**
+ *	prandom_bytes - get the requested number of pseudo-random bytes
+ *	@buf: where to copy the pseudo-random bytes to
+ *	@bytes: the requested number of bytes
+ */
+void prandom_bytes(void *buf, size_t bytes)
+{
+	struct siprand_state *state = get_cpu_ptr(&net_rand_state);
+	u8 *ptr = buf;
+
+	while (bytes >= sizeof(u32)) {
+		put_unaligned(siprand_u32(state), (u32 *)ptr);
+		ptr += sizeof(u32);
+		bytes -= sizeof(u32);
+	}
+
+	if (bytes > 0) {
+		u32 rem = siprand_u32(state);
+		do {
+			*ptr++ = (u8)rem;
+			rem >>= BITS_PER_BYTE;
+		} while (--bytes > 0);
+	}
+	put_cpu_ptr(&net_rand_state);
+}
+EXPORT_SYMBOL(prandom_bytes);
+
+/**
+ *	prandom_seed - add entropy to pseudo random number generator
+ *	@entropy: entropy value
+ *
+ *	Add some additional seed material to the prandom pool.
+ *	The "entropy" is actually our IP address (the only caller is
+ *	the network code), not for unpredictability, but to ensure that
+ *	different machines are initialized differently.
+ */
+void prandom_seed(u32 entropy)
+{
+	/* FIXME: Where to feed this in? */
+}
+EXPORT_SYMBOL(prandom_seed);
+
+/*
+ *	Generate some initially weak seeding values to allow
+ *	the prandom_u32() engine to be started.
+ */
+static int __init prandom_init_early(void)
+{
+	int i;
+	unsigned long v0, v1, v2, v3;
+
+	if (!arch_get_random_long(&v0))
+		v0 = jiffies;
+	if (!arch_get_random_long(&v1))
+		v0 = random_get_entropy();
+	v2 = v0 ^ K0;
+	v3 = v1 ^ K1;
+
+	for_each_possible_cpu(i) {
+		struct siprand_state *state;
+
+		v3 ^= i;
+		SIPROUND(v0, v1, v2, v3);
+		SIPROUND(v0, v1, v2, v3);
+		v0 ^= i;
+
+		state = per_cpu_ptr(&net_rand_state, i);
+		state->v[0] = v0;  state->v[1] = v1;
+		state->v[2] = v2;  state->v[3] = v3;
+	}
+
+	return 0;
+}
+core_initcall(prandom_init_early);
+
+
+/* Stronger reseeding when available, and periodically thereafter. */
+static void prandom_reseed(struct timer_list *unused);
+
+static DEFINE_TIMER(seed_timer, prandom_reseed);
+
+static void prandom_reseed(struct timer_list *unused)
+{
+	unsigned long expires;
+	int i;
+
+	/*
+	 * Reinitialize each CPU's PRNG with 128 bits of key.
+	 * No locking on the CPUs, but then somewhat random results are,
+	 * well, expected.
+	 */
+	for_each_possible_cpu(i) {
+		struct siprand_state *state;
+		unsigned long v0 = get_random_long(), v2 = v0 ^ K0;
+		unsigned long v1 = get_random_long(), v3 = v1 ^ K1;
+#if BITS_PER_LONG == 32
+		int j;
+
+		/*
+		 * On 32-bit machines, hash in two extra words to
+		 * approximate 128-bit key length.  Not that the hash
+		 * has that much security, but this prevents a trivial
+		 * 64-bit brute force.
+		 */
+		for (j = 0; j < 2; j++) {
+			unsigned long m = get_random_long();
+			v3 ^= m;
+			SIPROUND(v0, v1, v2, v3);
+			SIPROUND(v0, v1, v2, v3);
+			v0 ^= m;
+		}
+#endif
+		/*
+		 * Probably impossible in practice, but there is a
+		 * theoretical risk that a race between this reseeding
+		 * and the target CPU writing its state back could
+		 * create the all-zero SipHash fixed point.
+		 *
+		 * To ensure that never happens, ensure the state
+		 * we write contains no zero words.
+		 */
+		state = per_cpu_ptr(&net_rand_state, i);
+		WRITE_ONCE(state->v[0], v0 ? v0 : -1ul);
+		WRITE_ONCE(state->v[1], v1 ? v1 : -1ul);
+		WRITE_ONCE(state->v[2], v2 ? v2 : -1ul);
+		WRITE_ONCE(state->v[3], v3 ? v3 : -1ul);
+	}
+
+	/* reseed every ~60 seconds, in [40 .. 80) interval with slack */
+	expires = round_jiffies(jiffies + 40 * HZ + prandom_u32_max(40 * HZ));
+	mod_timer(&seed_timer, expires);
+}
+
+/*
+ * The random ready callback can be called from almost any interrupt.
+ * To avoid worrying about whether it's safe to delay that interrupt
+ * long enough to seed all CPUs, just schedule an immediate timer event.
+ */
+static void prandom_timer_start(struct random_ready_callback *unused)
+{
+	mod_timer(&seed_timer, jiffies);
+}
+
+/*
+ * Start periodic full reseeding as soon as strong
+ * random numbers are available.
+ */
+static int __init prandom_init_late(void)
+{
+	static struct random_ready_callback random_ready = {
+	        .func = prandom_timer_start
+	};
+	int ret = add_random_ready_callback(&random_ready);
+
+	if (ret == -EALREADY) {
+		prandom_timer_start(&random_ready);
+		ret = 0;
+	}
+	return ret;
+}
+late_initcall(prandom_init_late);