mbox series

[SRU,F/aws,0/5] AWS: fix out of entropy on Graviton 2 instances types (mg6.*)

Message ID 20210507081547.6945-1-andrea.righi@canonical.com
Headers show
Series AWS: fix out of entropy on Graviton 2 instances types (mg6.*) | expand

Message

Andrea Righi May 7, 2021, 8:15 a.m. UTC
BugLink: https://bugs.launchpad.net/bugs/1927692

[Impact]

AWS Graviton 2 instances do not have enough entropy available at boot,
so any task that require entropy (even reading few bytes from
/dev/random) will be stuck forever.

[Fix]

The proper fix for this problem is to correctly refill the entropy pool
with some real random data using some hardware-generated randomness.

In the meantime a reasonable workaround can be to apply the following
upstream commits:

 30c08efec888 random: make /dev/random be almost like /dev/urandom
 48446f198f9a random: ignore GRND_RANDOM in getentropy(2)
 75551dbf112c random: add GRND_INSECURE to return best-effort non-cryptographic bytes
 c6f1deb15878 random: Add a urandom_read_nowait() for random APIs that don't warn
 4c8d062186d9 random: Don't wake crng_init_wait when crng_init == 1

In this way the system will not run out of entropy and will be able to
provide best-effort randomness in any case, preventing the out of
entropy issue on the AWS Gravion 2 instances.

[Test plan]

Execute the following command on any m6g instance:

  dd bs=32 count=1 if=/dev/random of=/dev/null

This should return quickly, if not it means that the system does not
have enough entropy available. When the problem happens this command
hangs forever.

[Where problems could occur]

This changes affect the read semantics of /dev/random to be the same as
/dev/urandom except that reads will block until the CRNG is ready. This
should not materially break any API. Any code that worked without these
changes should work at least as well as before. However, applications
that have strict randomness requirements might be affected by the
provided best-effort randomness, so we may need to apply more
commits/changes to introduce a proper hardware entropy support on
Graviton 2 instances to provide a better quality of randomness. In the
meantime these upstream changes consist a reasonable workaround to
prevent applications from hanging forever on the mg6.* instances.

----------------------------------------------------------------
Andy Lutomirski (5):
      random: add GRND_INSECURE to return best-effort non-cryptographic bytes
      random: Don't wake crng_init_wait when crng_init == 1
      random: Add a urandom_read_nowait() for random APIs that don't warn
      random: ignore GRND_RANDOM in getentropy(2)
      random: make /dev/random be almost like /dev/urandom

 drivers/char/random.c       | 81 +++++++++++++++++++++++++++++++++------------------------------------------------
 include/uapi/linux/random.h |  4 +++-
 2 files changed, 36 insertions(+), 49 deletions(-)

Comments

Guilherme G. Piccoli May 7, 2021, 10:58 a.m. UTC | #1
On Fri, May 7, 2021 at 5:16 AM Andrea Righi <andrea.righi@canonical.com> wrote:
>
> BugLink: https://bugs.launchpad.net/bugs/1927692
>
> [Impact]
>
> AWS Graviton 2 instances do not have enough entropy available at boot,
> so any task that require entropy (even reading few bytes from
> /dev/random) will be stuck forever.
>
> [Fix]
>
> The proper fix for this problem is to correctly refill the entropy pool
> with some real random data using some hardware-generated randomness.
>
> In the meantime a reasonable workaround can be to apply the following
> upstream commits:
>
>  30c08efec888 random: make /dev/random be almost like /dev/urandom
>  48446f198f9a random: ignore GRND_RANDOM in getentropy(2)
>  75551dbf112c random: add GRND_INSECURE to return best-effort non-cryptographic bytes
>  c6f1deb15878 random: Add a urandom_read_nowait() for random APIs that don't warn
>  4c8d062186d9 random: Don't wake crng_init_wait when crng_init == 1
>
> In this way the system will not run out of entropy and will be able to
> provide best-effort randomness in any case, preventing the out of
> entropy issue on the AWS Gravion 2 instances.
>
> [Test plan]
>
> Execute the following command on any m6g instance:
>
>   dd bs=32 count=1 if=/dev/random of=/dev/null
>
> This should return quickly, if not it means that the system does not
> have enough entropy available. When the problem happens this command
> hangs forever.
>
> [Where problems could occur]
>
> This changes affect the read semantics of /dev/random to be the same as
> /dev/urandom except that reads will block until the CRNG is ready. This
> should not materially break any API. Any code that worked without these
> changes should work at least as well as before. However, applications
> that have strict randomness requirements might be affected by the
> provided best-effort randomness, so we may need to apply more
> commits/changes to introduce a proper hardware entropy support on
> Graviton 2 instances to provide a better quality of randomness. In the
> meantime these upstream changes consist a reasonable workaround to
> prevent applications from hanging forever on the mg6.* instances.
>
> ----------------------------------------------------------------
> Andy Lutomirski (5):
>       random: add GRND_INSECURE to return best-effort non-cryptographic bytes
>       random: Don't wake crng_init_wait when crng_init == 1
>       random: Add a urandom_read_nowait() for random APIs that don't warn
>       random: ignore GRND_RANDOM in getentropy(2)
>       random: make /dev/random be almost like /dev/urandom
>
>  drivers/char/random.c       | 81 +++++++++++++++++++++++++++++++++------------------------------------------------
>  include/uapi/linux/random.h |  4 +++-
>  2 files changed, 36 insertions(+), 49 deletions(-)
>
>

Thanks Andrea, LGTM. I wonder if we plan to apply these commits to all
5.4-based kernels - are they in 5.8? If so, I feel it is worth to add
them to all 5.4-based kernels, entropy blocking is a PITA and usually
lead to multiple complains due to boot problems. I understand though
that this is more urgent to AWS...so mandatory to apply in F/AWS!
That said:

Acked-by: Guilherme G. Piccoli <gpiccoli@canonical.com>
Tim Gardner May 7, 2021, 11:31 a.m. UTC | #2
Acked-by: Tim Gardner <tim.gardner@canonical.com>

I'm not sure I fully understand patch 5, but it is a clean cherry-pick 
and testing shows it to at least not block anymore. As for how random 
the information is that is returned I can't say.

On 5/7/21 2:15 AM, Andrea Righi wrote:
> BugLink: https://bugs.launchpad.net/bugs/1927692
> 
> [Impact]
> 
> AWS Graviton 2 instances do not have enough entropy available at boot,
> so any task that require entropy (even reading few bytes from
> /dev/random) will be stuck forever.
> 
> [Fix]
> 
> The proper fix for this problem is to correctly refill the entropy pool
> with some real random data using some hardware-generated randomness.
> 
> In the meantime a reasonable workaround can be to apply the following
> upstream commits:
> 
>   30c08efec888 random: make /dev/random be almost like /dev/urandom
>   48446f198f9a random: ignore GRND_RANDOM in getentropy(2)
>   75551dbf112c random: add GRND_INSECURE to return best-effort non-cryptographic bytes
>   c6f1deb15878 random: Add a urandom_read_nowait() for random APIs that don't warn
>   4c8d062186d9 random: Don't wake crng_init_wait when crng_init == 1
> 
> In this way the system will not run out of entropy and will be able to
> provide best-effort randomness in any case, preventing the out of
> entropy issue on the AWS Gravion 2 instances.
> 
> [Test plan]
> 
> Execute the following command on any m6g instance:
> 
>    dd bs=32 count=1 if=/dev/random of=/dev/null
> 
> This should return quickly, if not it means that the system does not
> have enough entropy available. When the problem happens this command
> hangs forever.
> 
> [Where problems could occur]
> 
> This changes affect the read semantics of /dev/random to be the same as
> /dev/urandom except that reads will block until the CRNG is ready. This
> should not materially break any API. Any code that worked without these
> changes should work at least as well as before. However, applications
> that have strict randomness requirements might be affected by the
> provided best-effort randomness, so we may need to apply more
> commits/changes to introduce a proper hardware entropy support on
> Graviton 2 instances to provide a better quality of randomness. In the
> meantime these upstream changes consist a reasonable workaround to
> prevent applications from hanging forever on the mg6.* instances.
> 
> ----------------------------------------------------------------
> Andy Lutomirski (5):
>        random: add GRND_INSECURE to return best-effort non-cryptographic bytes
>        random: Don't wake crng_init_wait when crng_init == 1
>        random: Add a urandom_read_nowait() for random APIs that don't warn
>        random: ignore GRND_RANDOM in getentropy(2)
>        random: make /dev/random be almost like /dev/urandom
> 
>   drivers/char/random.c       | 81 +++++++++++++++++++++++++++++++++------------------------------------------------
>   include/uapi/linux/random.h |  4 +++-
>   2 files changed, 36 insertions(+), 49 deletions(-)
> 
>
Andrea Righi May 7, 2021, 1:03 p.m. UTC | #3
On Fri, May 07, 2021 at 07:58:09AM -0300, Guilherme Piccoli wrote:
> On Fri, May 7, 2021 at 5:16 AM Andrea Righi <andrea.righi@canonical.com> wrote:
> >
> > BugLink: https://bugs.launchpad.net/bugs/1927692
> >
> > [Impact]
> >
> > AWS Graviton 2 instances do not have enough entropy available at boot,
> > so any task that require entropy (even reading few bytes from
> > /dev/random) will be stuck forever.
> >
> > [Fix]
> >
> > The proper fix for this problem is to correctly refill the entropy pool
> > with some real random data using some hardware-generated randomness.
> >
> > In the meantime a reasonable workaround can be to apply the following
> > upstream commits:
> >
> >  30c08efec888 random: make /dev/random be almost like /dev/urandom
> >  48446f198f9a random: ignore GRND_RANDOM in getentropy(2)
> >  75551dbf112c random: add GRND_INSECURE to return best-effort non-cryptographic bytes
> >  c6f1deb15878 random: Add a urandom_read_nowait() for random APIs that don't warn
> >  4c8d062186d9 random: Don't wake crng_init_wait when crng_init == 1
> >
> > In this way the system will not run out of entropy and will be able to
> > provide best-effort randomness in any case, preventing the out of
> > entropy issue on the AWS Gravion 2 instances.
> >
> > [Test plan]
> >
> > Execute the following command on any m6g instance:
> >
> >   dd bs=32 count=1 if=/dev/random of=/dev/null
> >
> > This should return quickly, if not it means that the system does not
> > have enough entropy available. When the problem happens this command
> > hangs forever.
> >
> > [Where problems could occur]
> >
> > This changes affect the read semantics of /dev/random to be the same as
> > /dev/urandom except that reads will block until the CRNG is ready. This
> > should not materially break any API. Any code that worked without these
> > changes should work at least as well as before. However, applications
> > that have strict randomness requirements might be affected by the
> > provided best-effort randomness, so we may need to apply more
> > commits/changes to introduce a proper hardware entropy support on
> > Graviton 2 instances to provide a better quality of randomness. In the
> > meantime these upstream changes consist a reasonable workaround to
> > prevent applications from hanging forever on the mg6.* instances.
> >
> > ----------------------------------------------------------------
> > Andy Lutomirski (5):
> >       random: add GRND_INSECURE to return best-effort non-cryptographic bytes
> >       random: Don't wake crng_init_wait when crng_init == 1
> >       random: Add a urandom_read_nowait() for random APIs that don't warn
> >       random: ignore GRND_RANDOM in getentropy(2)
> >       random: make /dev/random be almost like /dev/urandom
> >
> >  drivers/char/random.c       | 81 +++++++++++++++++++++++++++++++++------------------------------------------------
> >  include/uapi/linux/random.h |  4 +++-
> >  2 files changed, 36 insertions(+), 49 deletions(-)
> >
> >
> 
> Thanks Andrea, LGTM. I wonder if we plan to apply these commits to all
> 5.4-based kernels - are they in 5.8? If so, I feel it is worth to add
> them to all 5.4-based kernels, entropy blocking is a PITA and usually
> lead to multiple complains due to boot problems. I understand though
> that this is more urgent to AWS...so mandatory to apply in F/AWS!
> That said:
> 
> Acked-by: Guilherme G. Piccoli <gpiccoli@canonical.com>

Thanks for the review Guilherme. These commits are all applied to all
our kernels >= 5.8 already and I agree that this patch set should
probably target all 5.4 kernels (especially the cloud kernels that can
easily go out of entropy). However, I would do more tests and more
investigation before applying it across the board, since it doesn't seem
to be a blocker for the other kernels.

-Andrea
Andrea Righi May 7, 2021, 1:26 p.m. UTC | #4
On Fri, May 07, 2021 at 05:31:09AM -0600, Tim Gardner wrote:
> Acked-by: Tim Gardner <tim.gardner@canonical.com>
> 
> I'm not sure I fully understand patch 5, but it is a clean cherry-pick and
> testing shows it to at least not block anymore. As for how random the
> information is that is returned I can't say.

Thanks for the review, Tim.

Patch 5 changes the read semantic of /dev/random.

Before, the kernel was using two separate pools of random data: one for
/dev/random and another for /dev/urandom. The pool for
/dev/random was a blocking pool (reads blocked until enogh entropy is
available) filled with "real" random data.

After the change the blocking pool is not used anymore by /dev/random
reads, reads will only block until the CRNG (cryptographic
random-number-generator has been initialized - function crng_ready()).
Once the CRNG is initialized all reads from /dev/random will never
block and will consume data generated by the CRNG and real random
events.

Basically after the change the kernel trusts the numbers generated by
the CRNG and before we were trusting only numbers generated by truly
random events.

This change is covered very well in this article:
https://lwn.net/Articles/808575/

-Andrea

> 
> On 5/7/21 2:15 AM, Andrea Righi wrote:
> > BugLink: https://bugs.launchpad.net/bugs/1927692
> > 
> > [Impact]
> > 
> > AWS Graviton 2 instances do not have enough entropy available at boot,
> > so any task that require entropy (even reading few bytes from
> > /dev/random) will be stuck forever.
> > 
> > [Fix]
> > 
> > The proper fix for this problem is to correctly refill the entropy pool
> > with some real random data using some hardware-generated randomness.
> > 
> > In the meantime a reasonable workaround can be to apply the following
> > upstream commits:
> > 
> >   30c08efec888 random: make /dev/random be almost like /dev/urandom
> >   48446f198f9a random: ignore GRND_RANDOM in getentropy(2)
> >   75551dbf112c random: add GRND_INSECURE to return best-effort non-cryptographic bytes
> >   c6f1deb15878 random: Add a urandom_read_nowait() for random APIs that don't warn
> >   4c8d062186d9 random: Don't wake crng_init_wait when crng_init == 1
> > 
> > In this way the system will not run out of entropy and will be able to
> > provide best-effort randomness in any case, preventing the out of
> > entropy issue on the AWS Gravion 2 instances.
> > 
> > [Test plan]
> > 
> > Execute the following command on any m6g instance:
> > 
> >    dd bs=32 count=1 if=/dev/random of=/dev/null
> > 
> > This should return quickly, if not it means that the system does not
> > have enough entropy available. When the problem happens this command
> > hangs forever.
> > 
> > [Where problems could occur]
> > 
> > This changes affect the read semantics of /dev/random to be the same as
> > /dev/urandom except that reads will block until the CRNG is ready. This
> > should not materially break any API. Any code that worked without these
> > changes should work at least as well as before. However, applications
> > that have strict randomness requirements might be affected by the
> > provided best-effort randomness, so we may need to apply more
> > commits/changes to introduce a proper hardware entropy support on
> > Graviton 2 instances to provide a better quality of randomness. In the
> > meantime these upstream changes consist a reasonable workaround to
> > prevent applications from hanging forever on the mg6.* instances.
> > 
> > ----------------------------------------------------------------
> > Andy Lutomirski (5):
> >        random: add GRND_INSECURE to return best-effort non-cryptographic bytes
> >        random: Don't wake crng_init_wait when crng_init == 1
> >        random: Add a urandom_read_nowait() for random APIs that don't warn
> >        random: ignore GRND_RANDOM in getentropy(2)
> >        random: make /dev/random be almost like /dev/urandom
> > 
> >   drivers/char/random.c       | 81 +++++++++++++++++++++++++++++++++------------------------------------------------
> >   include/uapi/linux/random.h |  4 +++-
> >   2 files changed, 36 insertions(+), 49 deletions(-)
> > 
> > 
> 
> -- 
> -----------
> Tim Gardner
> Canonical, Inc
Guilherme G. Piccoli May 7, 2021, 1:33 p.m. UTC | #5
On Fri, May 7, 2021 at 10:03 AM Andrea Righi <andrea.righi@canonical.com> wrote:
> Thanks for the review Guilherme. These commits are all applied to all
> our kernels >= 5.8 already and I agree that this patch set should
> probably target all 5.4 kernels (especially the cloud kernels that can
> easily go out of entropy). However, I would do more tests and more
> investigation before applying it across the board, since it doesn't seem
> to be a blocker for the other kernels.
>
> -Andrea

Makes sense, thank you Andrea =)
Tim Gardner May 7, 2021, 2:59 p.m. UTC | #6
On 5/7/21 7:26 AM, Andrea Righi wrote:
> On Fri, May 07, 2021 at 05:31:09AM -0600, Tim Gardner wrote:
>> Acked-by: Tim Gardner <tim.gardner@canonical.com>
>>
>> I'm not sure I fully understand patch 5, but it is a clean cherry-pick and
>> testing shows it to at least not block anymore. As for how random the
>> information is that is returned I can't say.
> 
> Thanks for the review, Tim.
> 
> Patch 5 changes the read semantic of /dev/random.
> 
> Before, the kernel was using two separate pools of random data: one for
> /dev/random and another for /dev/urandom. The pool for
> /dev/random was a blocking pool (reads blocked until enogh entropy is
> available) filled with "real" random data.
> 
> After the change the blocking pool is not used anymore by /dev/random
> reads, reads will only block until the CRNG (cryptographic
> random-number-generator has been initialized - function crng_ready()).
> Once the CRNG is initialized all reads from /dev/random will never
> block and will consume data generated by the CRNG and real random
> events.
> 
> Basically after the change the kernel trusts the numbers generated by
> the CRNG and before we were trusting only numbers generated by truly
> random events.
> 
> This change is covered very well in this article:
> https://lwn.net/Articles/808575/
> 
> -Andrea
Thanks for the pointer. That was quite informative.

rtg
-----------
Tim Gardner
Canonical, Inc
Tim Gardner May 7, 2021, 4:19 p.m. UTC | #7
Applied to focal:linux-aws master. Thanks.

-rtg

On 5/7/21 2:15 AM, Andrea Righi wrote:
> BugLink: https://bugs.launchpad.net/bugs/1927692
> 
> [Impact]
> 
> AWS Graviton 2 instances do not have enough entropy available at boot,
> so any task that require entropy (even reading few bytes from
> /dev/random) will be stuck forever.
> 
> [Fix]
> 
> The proper fix for this problem is to correctly refill the entropy pool
> with some real random data using some hardware-generated randomness.
> 
> In the meantime a reasonable workaround can be to apply the following
> upstream commits:
> 
>   30c08efec888 random: make /dev/random be almost like /dev/urandom
>   48446f198f9a random: ignore GRND_RANDOM in getentropy(2)
>   75551dbf112c random: add GRND_INSECURE to return best-effort non-cryptographic bytes
>   c6f1deb15878 random: Add a urandom_read_nowait() for random APIs that don't warn
>   4c8d062186d9 random: Don't wake crng_init_wait when crng_init == 1
> 
> In this way the system will not run out of entropy and will be able to
> provide best-effort randomness in any case, preventing the out of
> entropy issue on the AWS Gravion 2 instances.
> 
> [Test plan]
> 
> Execute the following command on any m6g instance:
> 
>    dd bs=32 count=1 if=/dev/random of=/dev/null
> 
> This should return quickly, if not it means that the system does not
> have enough entropy available. When the problem happens this command
> hangs forever.
> 
> [Where problems could occur]
> 
> This changes affect the read semantics of /dev/random to be the same as
> /dev/urandom except that reads will block until the CRNG is ready. This
> should not materially break any API. Any code that worked without these
> changes should work at least as well as before. However, applications
> that have strict randomness requirements might be affected by the
> provided best-effort randomness, so we may need to apply more
> commits/changes to introduce a proper hardware entropy support on
> Graviton 2 instances to provide a better quality of randomness. In the
> meantime these upstream changes consist a reasonable workaround to
> prevent applications from hanging forever on the mg6.* instances.
> 
> ----------------------------------------------------------------
> Andy Lutomirski (5):
>        random: add GRND_INSECURE to return best-effort non-cryptographic bytes
>        random: Don't wake crng_init_wait when crng_init == 1
>        random: Add a urandom_read_nowait() for random APIs that don't warn
>        random: ignore GRND_RANDOM in getentropy(2)
>        random: make /dev/random be almost like /dev/urandom
> 
>   drivers/char/random.c       | 81 +++++++++++++++++++++++++++++++++------------------------------------------------
>   include/uapi/linux/random.h |  4 +++-
>   2 files changed, 36 insertions(+), 49 deletions(-)
> 
>