diff mbox series

vsprintf: Do not break early boot with probing addresses

Message ID 20190509121923.8339-1-pmladek@suse.com (mailing list archive)
State Not Applicable
Headers show
Series vsprintf: Do not break early boot with probing addresses | expand

Checks

Context Check Description
snowpatch_ozlabs/apply_patch warning Failed to apply on branch next (8150a153c013aa2dd1ffae43370b89ac1347a7fb)
snowpatch_ozlabs/apply_patch fail Failed to apply to any branch

Commit Message

Petr Mladek May 9, 2019, 12:19 p.m. UTC
The commit 3e5903eb9cff70730 ("vsprintf: Prevent crash when dereferencing
invalid pointers") broke boot on several architectures. The common
pattern is that probe_kernel_read() is not working during early
boot because userspace access framework is not ready.

The check is only the best effort. Let's not rush with it during
the early boot.

Details:

1. Report on Power:

Kernel crashes very early during boot with with CONFIG_PPC_KUAP and
CONFIG_JUMP_LABEL_FEATURE_CHECK_DEBUG

The problem is the combination of some new code called via printk(),
check_pointer() which calls probe_kernel_read(). That then calls
allow_user_access() (PPC_KUAP) and that uses mmu_has_feature() too early
(before we've patched features). With the JUMP_LABEL debug enabled that
causes us to call printk() & dump_stack() and we end up recursing and
overflowing the stack.

Because it happens so early you don't get any output, just an apparently
dead system.

The stack trace (which you don't see) is something like:

  ...
  dump_stack+0xdc
  probe_kernel_read+0x1a4
  check_pointer+0x58
  string+0x3c
  vsnprintf+0x1bc
  vscnprintf+0x20
  printk_safe_log_store+0x7c
  printk+0x40
  dump_stack_print_info+0xbc
  dump_stack+0x8
  probe_kernel_read+0x1a4
  probe_kernel_read+0x19c
  check_pointer+0x58
  string+0x3c
  vsnprintf+0x1bc
  vscnprintf+0x20
  vprintk_store+0x6c
  vprintk_emit+0xec
  vprintk_func+0xd4
  printk+0x40
  cpufeatures_process_feature+0xc8
  scan_cpufeatures_subnodes+0x380
  of_scan_flat_dt_subnodes+0xb4
  dt_cpu_ftrs_scan_callback+0x158
  of_scan_flat_dt+0xf0
  dt_cpu_ftrs_scan+0x3c
  early_init_devtree+0x360
  early_setup+0x9c

2. Report on s390:

vsnprintf invocations, are broken on s390. For example, the early boot
output now looks like this where the first (efault) should be
the linux_banner:

[    0.099985] (efault)
[    0.099985] setup: Linux is running as a z/VM guest operating system in 64-bit mode
[    0.100066] setup: The maximum memory size is 8192MB
[    0.100070] cma: Reserved 4 MiB at (efault)
[    0.100100] numa: NUMA mode: (efault)

The reason for this, is that the code assumes that
probe_kernel_address() works very early. This however is not true on
at least s390. Uaccess on KERNEL_DS works only after page tables have
been setup on s390, which happens with setup_arch()->paging_init().

Any probe_kernel_address() invocation before that will return -EFAULT.

Fixes: 3e5903eb9cff70730 ("vsprintf: Prevent crash when dereferencing invalid pointers")
Signed-off-by: Petr Mladek <pmladek@suse.com>
---
 lib/vsprintf.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

Comments

Andy Shevchenko May 9, 2019, 1:05 p.m. UTC | #1
On Thu, May 09, 2019 at 02:19:23PM +0200, Petr Mladek wrote:
> The commit 3e5903eb9cff70730 ("vsprintf: Prevent crash when dereferencing
> invalid pointers") broke boot on several architectures. The common
> pattern is that probe_kernel_read() is not working during early
> boot because userspace access framework is not ready.
> 
> The check is only the best effort. Let's not rush with it during
> the early boot.
> 
> Details:
> 
> 1. Report on Power:
> 
> Kernel crashes very early during boot with with CONFIG_PPC_KUAP and
> CONFIG_JUMP_LABEL_FEATURE_CHECK_DEBUG
> 
> The problem is the combination of some new code called via printk(),
> check_pointer() which calls probe_kernel_read(). That then calls
> allow_user_access() (PPC_KUAP) and that uses mmu_has_feature() too early
> (before we've patched features). With the JUMP_LABEL debug enabled that
> causes us to call printk() & dump_stack() and we end up recursing and
> overflowing the stack.
> 
> Because it happens so early you don't get any output, just an apparently
> dead system.
> 
> The stack trace (which you don't see) is something like:
> 
>   ...
>   dump_stack+0xdc
>   probe_kernel_read+0x1a4
>   check_pointer+0x58
>   string+0x3c
>   vsnprintf+0x1bc
>   vscnprintf+0x20
>   printk_safe_log_store+0x7c
>   printk+0x40
>   dump_stack_print_info+0xbc
>   dump_stack+0x8
>   probe_kernel_read+0x1a4
>   probe_kernel_read+0x19c
>   check_pointer+0x58
>   string+0x3c
>   vsnprintf+0x1bc
>   vscnprintf+0x20
>   vprintk_store+0x6c
>   vprintk_emit+0xec
>   vprintk_func+0xd4
>   printk+0x40
>   cpufeatures_process_feature+0xc8
>   scan_cpufeatures_subnodes+0x380
>   of_scan_flat_dt_subnodes+0xb4
>   dt_cpu_ftrs_scan_callback+0x158
>   of_scan_flat_dt+0xf0
>   dt_cpu_ftrs_scan+0x3c
>   early_init_devtree+0x360
>   early_setup+0x9c
> 
> 2. Report on s390:
> 
> vsnprintf invocations, are broken on s390. For example, the early boot
> output now looks like this where the first (efault) should be
> the linux_banner:
> 
> [    0.099985] (efault)
> [    0.099985] setup: Linux is running as a z/VM guest operating system in 64-bit mode
> [    0.100066] setup: The maximum memory size is 8192MB
> [    0.100070] cma: Reserved 4 MiB at (efault)
> [    0.100100] numa: NUMA mode: (efault)
> 
> The reason for this, is that the code assumes that
> probe_kernel_address() works very early. This however is not true on
> at least s390. Uaccess on KERNEL_DS works only after page tables have
> been setup on s390, which happens with setup_arch()->paging_init().
> 
> Any probe_kernel_address() invocation before that will return -EFAULT.
> 

It's seems as a good enough fix.
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>

Though in all cases would be nice to distinguish error pointers as well.
Something like

if (IS_ERR(ptr))
	return err_pointer_str(ptr);

in check_pointer_msg().

> Fixes: 3e5903eb9cff70730 ("vsprintf: Prevent crash when dereferencing invalid pointers")
> Signed-off-by: Petr Mladek <pmladek@suse.com>
> ---
>  lib/vsprintf.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/lib/vsprintf.c b/lib/vsprintf.c
> index 7b0a6140bfad..8b43a883be6b 100644
> --- a/lib/vsprintf.c
> +++ b/lib/vsprintf.c
> @@ -640,8 +640,13 @@ static const char *check_pointer_msg(const void *ptr)
>  	if (!ptr)
>  		return "(null)";
>  
> -	if (probe_kernel_address(ptr, byte))
> -		return "(efault)";
> +	/* User space address handling is not ready during early boot. */
> +	if (system_state <= SYSTEM_BOOTING) {
> +		if ((unsigned long)ptr < PAGE_SIZE)
> +			return "(efault)";
> +	} else {
> +		if (probe_kernel_address(ptr, byte))
> +			return "(efault)";
>  
>  	return NULL;
>  }
> -- 
> 2.16.4
>
Steven Rostedt May 9, 2019, 1:13 p.m. UTC | #2
On Thu,  9 May 2019 14:19:23 +0200
Petr Mladek <pmladek@suse.com> wrote:

> The commit 3e5903eb9cff70730 ("vsprintf: Prevent crash when dereferencing
> invalid pointers") broke boot on several architectures. The common
> pattern is that probe_kernel_read() is not working during early
> boot because userspace access framework is not ready.
> 
> The check is only the best effort. Let's not rush with it during
> the early boot.
> 
> Details:
> 
> 1. Report on Power:
> 
> Kernel crashes very early during boot with with CONFIG_PPC_KUAP and
> CONFIG_JUMP_LABEL_FEATURE_CHECK_DEBUG
> 
> The problem is the combination of some new code called via printk(),
> check_pointer() which calls probe_kernel_read(). That then calls
> allow_user_access() (PPC_KUAP) and that uses mmu_has_feature() too early
> (before we've patched features). With the JUMP_LABEL debug enabled that
> causes us to call printk() & dump_stack() and we end up recursing and
> overflowing the stack.
> 
> Because it happens so early you don't get any output, just an apparently
> dead system.
> 
> The stack trace (which you don't see) is something like:
> 
>   ...
>   dump_stack+0xdc
>   probe_kernel_read+0x1a4
>   check_pointer+0x58
>   string+0x3c
>   vsnprintf+0x1bc
>   vscnprintf+0x20
>   printk_safe_log_store+0x7c
>   printk+0x40
>   dump_stack_print_info+0xbc
>   dump_stack+0x8
>   probe_kernel_read+0x1a4
>   probe_kernel_read+0x19c
>   check_pointer+0x58
>   string+0x3c
>   vsnprintf+0x1bc
>   vscnprintf+0x20
>   vprintk_store+0x6c
>   vprintk_emit+0xec
>   vprintk_func+0xd4
>   printk+0x40
>   cpufeatures_process_feature+0xc8
>   scan_cpufeatures_subnodes+0x380
>   of_scan_flat_dt_subnodes+0xb4
>   dt_cpu_ftrs_scan_callback+0x158
>   of_scan_flat_dt+0xf0
>   dt_cpu_ftrs_scan+0x3c
>   early_init_devtree+0x360
>   early_setup+0x9c
> 
> 2. Report on s390:
> 
> vsnprintf invocations, are broken on s390. For example, the early boot
> output now looks like this where the first (efault) should be
> the linux_banner:
> 
> [    0.099985] (efault)
> [    0.099985] setup: Linux is running as a z/VM guest operating system in 64-bit mode
> [    0.100066] setup: The maximum memory size is 8192MB
> [    0.100070] cma: Reserved 4 MiB at (efault)
> [    0.100100] numa: NUMA mode: (efault)
> 
> The reason for this, is that the code assumes that
> probe_kernel_address() works very early. This however is not true on
> at least s390. Uaccess on KERNEL_DS works only after page tables have
> been setup on s390, which happens with setup_arch()->paging_init().
> 
> Any probe_kernel_address() invocation before that will return -EFAULT.

Hmm, this sounds to me that probe_kernel_address() is broken for these
architectures. Perhaps the system_state check should be in
probe_kernel_address() for those architectures?


> 
> Fixes: 3e5903eb9cff70730 ("vsprintf: Prevent crash when dereferencing invalid pointers")
> Signed-off-by: Petr Mladek <pmladek@suse.com>
> ---
>  lib/vsprintf.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/lib/vsprintf.c b/lib/vsprintf.c
> index 7b0a6140bfad..8b43a883be6b 100644
> --- a/lib/vsprintf.c
> +++ b/lib/vsprintf.c
> @@ -640,8 +640,13 @@ static const char *check_pointer_msg(const void *ptr)
>  	if (!ptr)
>  		return "(null)";
>  
> -	if (probe_kernel_address(ptr, byte))
> -		return "(efault)";

Either that, or we add a macro to those architectures and add:

#ifdef ARCH_NO_EARLY_PROBE_KERNEL_ADDRESS

> +	/* User space address handling is not ready during early boot. */
> +	if (system_state <= SYSTEM_BOOTING) {
> +		if ((unsigned long)ptr < PAGE_SIZE)
> +			return "(efault)";
> +	} else {

#endif

Why add an unnecessary branch for archs this is not a problem for?

-- Steve

> +		if (probe_kernel_address(ptr, byte))
> +			return "(efault)";
>  
>  	return NULL;
>  }
Michal Suchanek May 9, 2019, 1:38 p.m. UTC | #3
On Thu,  9 May 2019 14:19:23 +0200
Petr Mladek <pmladek@suse.com> wrote:

> The commit 3e5903eb9cff70730 ("vsprintf: Prevent crash when dereferencing
> invalid pointers") broke boot on several architectures. The common
> pattern is that probe_kernel_read() is not working during early
> boot because userspace access framework is not ready.
> 
> The check is only the best effort. Let's not rush with it during
> the early boot.
> 
> Details:
> 
> 1. Report on Power:
> 
> Kernel crashes very early during boot with with CONFIG_PPC_KUAP and
> CONFIG_JUMP_LABEL_FEATURE_CHECK_DEBUG
> 
> The problem is the combination of some new code called via printk(),
> check_pointer() which calls probe_kernel_read(). That then calls
> allow_user_access() (PPC_KUAP) and that uses mmu_has_feature() too early
> (before we've patched features).

There is early_mmu_has_feature for this case. mmu_has_feature does not
work before patching so parts of kernel that can run before patching
must use the early_ variant which actually runs code reading the
feature bitmap to determine the answer.

Thanks

Michal
David Laight May 9, 2019, 1:46 p.m. UTC | #4
From: Michal Suchánek
> Sent: 09 May 2019 14:38
...
> > The problem is the combination of some new code called via printk(),
> > check_pointer() which calls probe_kernel_read(). That then calls
> > allow_user_access() (PPC_KUAP) and that uses mmu_has_feature() too early
> > (before we've patched features).
> 
> There is early_mmu_has_feature for this case. mmu_has_feature does not
> work before patching so parts of kernel that can run before patching
> must use the early_ variant which actually runs code reading the
> feature bitmap to determine the answer.

Does the early_ variant get patched so the it is reasonably
efficient after the 'patching' is done?
Or should there be a third version which gets patched across?

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Petr Mladek May 9, 2019, 2:06 p.m. UTC | #5
On Thu 2019-05-09 09:13:57, Steven Rostedt wrote:
> On Thu,  9 May 2019 14:19:23 +0200
> Petr Mladek <pmladek@suse.com> wrote:
> 
> > The commit 3e5903eb9cff70730 ("vsprintf: Prevent crash when dereferencing
> > invalid pointers") broke boot on several architectures. The common
> > pattern is that probe_kernel_read() is not working during early
> > boot because userspace access framework is not ready.
> > 
> > The check is only the best effort. Let's not rush with it during
> > the early boot.
> > 
> > Details:
> > 
> > 1. Report on Power:
> > 
> > Kernel crashes very early during boot with with CONFIG_PPC_KUAP and
> > CONFIG_JUMP_LABEL_FEATURE_CHECK_DEBUG
> > 
> > The problem is the combination of some new code called via printk(),
> > check_pointer() which calls probe_kernel_read(). That then calls
> > allow_user_access() (PPC_KUAP) and that uses mmu_has_feature() too early
> > (before we've patched features). With the JUMP_LABEL debug enabled that
> > causes us to call printk() & dump_stack() and we end up recursing and
> > overflowing the stack.
> > 
> > Because it happens so early you don't get any output, just an apparently
> > dead system.
> > 
> > The stack trace (which you don't see) is something like:
> > 
> >   ...
> >   dump_stack+0xdc
> >   probe_kernel_read+0x1a4
> >   check_pointer+0x58
> >   string+0x3c
> >   vsnprintf+0x1bc
> >   vscnprintf+0x20
> >   printk_safe_log_store+0x7c
> >   printk+0x40
> >   dump_stack_print_info+0xbc
> >   dump_stack+0x8
> >   probe_kernel_read+0x1a4
> >   probe_kernel_read+0x19c
> >   check_pointer+0x58
> >   string+0x3c
> >   vsnprintf+0x1bc
> >   vscnprintf+0x20
> >   vprintk_store+0x6c
> >   vprintk_emit+0xec
> >   vprintk_func+0xd4
> >   printk+0x40
> >   cpufeatures_process_feature+0xc8
> >   scan_cpufeatures_subnodes+0x380
> >   of_scan_flat_dt_subnodes+0xb4
> >   dt_cpu_ftrs_scan_callback+0x158
> >   of_scan_flat_dt+0xf0
> >   dt_cpu_ftrs_scan+0x3c
> >   early_init_devtree+0x360
> >   early_setup+0x9c
> > 
> > 2. Report on s390:
> > 
> > vsnprintf invocations, are broken on s390. For example, the early boot
> > output now looks like this where the first (efault) should be
> > the linux_banner:
> > 
> > [    0.099985] (efault)
> > [    0.099985] setup: Linux is running as a z/VM guest operating system in 64-bit mode
> > [    0.100066] setup: The maximum memory size is 8192MB
> > [    0.100070] cma: Reserved 4 MiB at (efault)
> > [    0.100100] numa: NUMA mode: (efault)
> > 
> > The reason for this, is that the code assumes that
> > probe_kernel_address() works very early. This however is not true on
> > at least s390. Uaccess on KERNEL_DS works only after page tables have
> > been setup on s390, which happens with setup_arch()->paging_init().
> > 
> > Any probe_kernel_address() invocation before that will return -EFAULT.
> 
> Hmm, this sounds to me that probe_kernel_address() is broken for these
> architectures. Perhaps the system_state check should be in
> probe_kernel_address() for those architectures?

Yeah. Well, these problems are hard to debug. It left a dead power
system with a blank screen. I am not sure if the added check is
worth the pain.

I hope that the check would help to debug problems. But it is yet
another complexity in printk() path. I think that it is fine
to keep it enabled only on the booted system for a while
and get some more feedback.

Best Regards,
Petr
Sergey Senozhatsky May 10, 2019, 4:32 a.m. UTC | #6
On (05/09/19 14:19), Petr Mladek wrote:
> 1. Report on Power:
> 
> Kernel crashes very early during boot with with CONFIG_PPC_KUAP and
> CONFIG_JUMP_LABEL_FEATURE_CHECK_DEBUG
> 
> The problem is the combination of some new code called via printk(),
> check_pointer() which calls probe_kernel_read(). That then calls
> allow_user_access() (PPC_KUAP) and that uses mmu_has_feature() too early
> (before we've patched features). With the JUMP_LABEL debug enabled that
> causes us to call printk() & dump_stack() and we end up recursing and
> overflowing the stack.

Hmm... hmm... PPC does an .opd-based symbol dereference, which
eventually probe_kernel_read()-s. So early printk(%pS) will do

	printk(%pS)
	 dereference_function_descriptor()
	  probe_kernel_address()
	   dump_stack()
	    printk(%pS)
	     dereference_function_descriptor()
	      probe_kernel_address()
	       dump_stack()
	        printk(%pS)
	         ...

I'd say... that it's not vsprintf that we want to fix, it's
the idea that probe_kernel_address() can dump_stack() on any
platform. On some archs probe_kernel_address()->dump_stack()
is going nowhere:
 dump_stack() does probe_kernel_address(), which calls dump_stack(),
 which calls printk(%pS)->probe_kernel_address() again and again,
 and again.

	-ss
Linus Torvalds May 10, 2019, 4:47 a.m. UTC | #7
[ Sorry about html and mobile crud, I'm not at the computer right now ]

How about we just undo the whole misguided probe_kernel_address() thing?

The excuse for is was that it would avoid crashes.

It turns out that was wrong, and that it just made things much worse.
Honestly, we haven't really had the crashes that the logic was supposed to
protect against in the first place, so by now it's clear that the whole
thing was a stupid and horrible mistake in the first place.

So let's admit that and just go back to the old sane model.

Seriously, we've never needed that odd probing. It causes issues. It's
wrong and pointless.

       Linus

On Thu, May 9, 2019, 21:32 Sergey Senozhatsky <
sergey.senozhatsky.work@gmail.com> wrote:

> On (05/09/19 14:19), Petr Mladek wrote:
> > 1. Report on Power:
> >
> > Kernel crashes very early during boot with with CONFIG_PPC_KUAP and
> > CONFIG_JUMP_LABEL_FEATURE_CHECK_DEBUG
> >
> > The problem is the combination of some new code called via printk(),
> > check_pointer() which calls probe_kernel_read(). That then calls
> > allow_user_access() (PPC_KUAP) and that uses mmu_has_feature() too early
> > (before we've patched features). With the JUMP_LABEL debug enabled that
> > causes us to call printk() & dump_stack() and we end up recursing and
> > overflowing the stack.
>
> Hmm... hmm... PPC does an .opd-based symbol dereference, which
> eventually probe_kernel_read()-s. So early printk(%pS) will do
>
>         printk(%pS)
>          dereference_function_descriptor()
>           probe_kernel_address()
>            dump_stack()
>             printk(%pS)
>              dereference_function_descriptor()
>               probe_kernel_address()
>                dump_stack()
>                 printk(%pS)
>                  ...
>
> I'd say... that it's not vsprintf that we want to fix, it's
> the idea that probe_kernel_address() can dump_stack() on any
> platform. On some archs probe_kernel_address()->dump_stack()
> is going nowhere:
>  dump_stack() does probe_kernel_address(), which calls dump_stack(),
>  which calls printk(%pS)->probe_kernel_address() again and again,
>  and again.
>
>         -ss
>
Michael Ellerman May 10, 2019, 10:21 a.m. UTC | #8
David Laight <David.Laight@ACULAB.COM> writes:
> From: Michal Suchánek
>> Sent: 09 May 2019 14:38
> ...
>> > The problem is the combination of some new code called via printk(),
>> > check_pointer() which calls probe_kernel_read(). That then calls
>> > allow_user_access() (PPC_KUAP) and that uses mmu_has_feature() too early
>> > (before we've patched features).
>> 
>> There is early_mmu_has_feature for this case. mmu_has_feature does not
>> work before patching so parts of kernel that can run before patching
>> must use the early_ variant which actually runs code reading the
>> feature bitmap to determine the answer.
>
> Does the early_ variant get patched so the it is reasonably
> efficient after the 'patching' is done?

No they don't get patched ever. The name is a bit misleading I guess.

> Or should there be a third version which gets patched across?

For a case like this it's entirely safe to just skip the code early in
boot, so if it was a static_key_false everything would just work.

Unfortunately the way the code is currently written we would have to
change all MMU features to static_key_false and that risks breaking
something else.

We have a long standing TODO to rework all our feature logic and unify
CPU/MMU/firmware/etc. features. Possibly as part of that we can come up
with a scheme where the default value is per-feature bit.

Having said all that, in this case the overhead of the test and branch
is small compared to the cost of writing to the SPR which controls user
access and then doing an isync, so it's all somewhat premature
optimisation.

cheers
diff mbox series

Patch

diff --git a/lib/vsprintf.c b/lib/vsprintf.c
index 7b0a6140bfad..8b43a883be6b 100644
--- a/lib/vsprintf.c
+++ b/lib/vsprintf.c
@@ -640,8 +640,13 @@  static const char *check_pointer_msg(const void *ptr)
 	if (!ptr)
 		return "(null)";
 
-	if (probe_kernel_address(ptr, byte))
-		return "(efault)";
+	/* User space address handling is not ready during early boot. */
+	if (system_state <= SYSTEM_BOOTING) {
+		if ((unsigned long)ptr < PAGE_SIZE)
+			return "(efault)";
+	} else {
+		if (probe_kernel_address(ptr, byte))
+			return "(efault)";
 
 	return NULL;
 }