diff mbox series

[v2,1/4] mm: Add optional close() to struct vm_special_mapping

Message ID 20240812082605.743814-1-mpe@ellerman.id.au (mailing list archive)
State Handled Elsewhere
Headers show
Series [v2,1/4] mm: Add optional close() to struct vm_special_mapping | expand

Commit Message

Michael Ellerman Aug. 12, 2024, 8:26 a.m. UTC
Add an optional close() callback to struct vm_special_mapping. It will
be used, by powerpc at least, to handle unmapping of the VDSO.

Although support for unmapping the VDSO was initially added
for CRIU[1], it is not desirable to guard that support behind
CONFIG_CHECKPOINT_RESTORE.

There are other known users of unmapping the VDSO which are not related
to CRIU, eg. Valgrind [2] and void-ship [3].

The powerpc arch_unmap() hook has been in place for ~9 years, with no
ifdef, so there may be other unknown users that have come to rely on
unmapping the VDSO. Even if the code was behind an ifdef, major distros
enable CHECKPOINT_RESTORE so users may not realise unmapping the VDSO
depends on that configuration option.

It's also undesirable to have such core mm behaviour behind a relatively
obscure CONFIG option.

Longer term the unmap behaviour should be standardised across
architectures, however that is complicated by the fact the VDSO pointer
is stored differently across architectures. There was a previous attempt
to unify that handling [4], which could be revived.

See [5] for further discussion.

[1]: commit 83d3f0e90c6c ("powerpc/mm: tracking vDSO remap")
[2]: https://sourceware.org/git/?p=valgrind.git;a=commit;h=3a004915a2cbdcdebafc1612427576bf3321eef5
[3]: https://github.com/insanitybit/void-ship
[4]: https://lore.kernel.org/lkml/20210611180242.711399-17-dima@arista.com/
[5]: https://lore.kernel.org/linuxppc-dev/shiq5v3jrmyi6ncwke7wgl76ojysgbhrchsk32q4lbx2hadqqc@kzyy2igem256

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: David Hildenbrand <david@redhat.com>
---
 include/linux/mm_types.h | 3 +++
 mm/mmap.c                | 6 ++++++
 2 files changed, 9 insertions(+)

v2:
- Add some blank lines as requested.
- Expand special_mapping_close() comment.
- Add David's reviewed-by.
- Expand change log to capture review discussion.

Comments

Liam R. Howlett Aug. 12, 2024, 8:41 p.m. UTC | #1
* Michael Ellerman <mpe@ellerman.id.au> [240812 04:26]:
> Add an optional close() callback to struct vm_special_mapping. It will
> be used, by powerpc at least, to handle unmapping of the VDSO.
> 
> Although support for unmapping the VDSO was initially added
> for CRIU[1], it is not desirable to guard that support behind
> CONFIG_CHECKPOINT_RESTORE.
> 
> There are other known users of unmapping the VDSO which are not related
> to CRIU, eg. Valgrind [2] and void-ship [3].
> 
> The powerpc arch_unmap() hook has been in place for ~9 years, with no
> ifdef, so there may be other unknown users that have come to rely on
> unmapping the VDSO. Even if the code was behind an ifdef, major distros
> enable CHECKPOINT_RESTORE so users may not realise unmapping the VDSO
> depends on that configuration option.
> 
> It's also undesirable to have such core mm behaviour behind a relatively
> obscure CONFIG option.
> 
> Longer term the unmap behaviour should be standardised across
> architectures, however that is complicated by the fact the VDSO pointer
> is stored differently across architectures. There was a previous attempt
> to unify that handling [4], which could be revived.
> 
> See [5] for further discussion.

Thank you for tackling this issue, it's much improved.

For the whole series:

Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>

> 
> [1]: commit 83d3f0e90c6c ("powerpc/mm: tracking vDSO remap")
> [2]: https://sourceware.org/git/?p=valgrind.git;a=commit;h=3a004915a2cbdcdebafc1612427576bf3321eef5
> [3]: https://github.com/insanitybit/void-ship
> [4]: https://lore.kernel.org/lkml/20210611180242.711399-17-dima@arista.com/
> [5]: https://lore.kernel.org/linuxppc-dev/shiq5v3jrmyi6ncwke7wgl76ojysgbhrchsk32q4lbx2hadqqc@kzyy2igem256
> 
> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
> Reviewed-by: David Hildenbrand <david@redhat.com>
> ---
>  include/linux/mm_types.h | 3 +++
>  mm/mmap.c                | 6 ++++++
>  2 files changed, 9 insertions(+)
> 
> v2:
> - Add some blank lines as requested.
> - Expand special_mapping_close() comment.
> - Add David's reviewed-by.
> - Expand change log to capture review discussion.
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 485424979254..78bdfc59abe5 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -1313,6 +1313,9 @@ struct vm_special_mapping {
>  
>  	int (*mremap)(const struct vm_special_mapping *sm,
>  		     struct vm_area_struct *new_vma);
> +
> +	void (*close)(const struct vm_special_mapping *sm,
> +		      struct vm_area_struct *vma);
>  };
>  
>  enum tlb_flush_reason {
> diff --git a/mm/mmap.c b/mm/mmap.c
> index d0dfc85b209b..af4dbf0d3bd4 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -3620,10 +3620,16 @@ void vm_stat_account(struct mm_struct *mm, vm_flags_t flags, long npages)
>  static vm_fault_t special_mapping_fault(struct vm_fault *vmf);
>  
>  /*
> + * Close hook, called for unmap() and on the old vma for mremap().
> + *
>   * Having a close hook prevents vma merging regardless of flags.
>   */
>  static void special_mapping_close(struct vm_area_struct *vma)
>  {
> +	const struct vm_special_mapping *sm = vma->vm_private_data;
> +
> +	if (sm->close)
> +		sm->close(sm, vma);
>  }
>  
>  static const char *special_mapping_name(struct vm_area_struct *vma)
> -- 
> 2.45.2
>
Nathan Chancellor Aug. 19, 2024, 6:52 p.m. UTC | #2
Hi Michael,

On Mon, Aug 12, 2024 at 06:26:02PM +1000, Michael Ellerman wrote:
> Add an optional close() callback to struct vm_special_mapping. It will
> be used, by powerpc at least, to handle unmapping of the VDSO.
> 
> Although support for unmapping the VDSO was initially added
> for CRIU[1], it is not desirable to guard that support behind
> CONFIG_CHECKPOINT_RESTORE.
> 
> There are other known users of unmapping the VDSO which are not related
> to CRIU, eg. Valgrind [2] and void-ship [3].
> 
> The powerpc arch_unmap() hook has been in place for ~9 years, with no
> ifdef, so there may be other unknown users that have come to rely on
> unmapping the VDSO. Even if the code was behind an ifdef, major distros
> enable CHECKPOINT_RESTORE so users may not realise unmapping the VDSO
> depends on that configuration option.
> 
> It's also undesirable to have such core mm behaviour behind a relatively
> obscure CONFIG option.
> 
> Longer term the unmap behaviour should be standardised across
> architectures, however that is complicated by the fact the VDSO pointer
> is stored differently across architectures. There was a previous attempt
> to unify that handling [4], which could be revived.
> 
> See [5] for further discussion.
> 
> [1]: commit 83d3f0e90c6c ("powerpc/mm: tracking vDSO remap")
> [2]: https://sourceware.org/git/?p=valgrind.git;a=commit;h=3a004915a2cbdcdebafc1612427576bf3321eef5
> [3]: https://github.com/insanitybit/void-ship
> [4]: https://lore.kernel.org/lkml/20210611180242.711399-17-dima@arista.com/
> [5]: https://lore.kernel.org/linuxppc-dev/shiq5v3jrmyi6ncwke7wgl76ojysgbhrchsk32q4lbx2hadqqc@kzyy2igem256
> 
> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
> Reviewed-by: David Hildenbrand <david@redhat.com>
> ---
>  include/linux/mm_types.h | 3 +++
>  mm/mmap.c                | 6 ++++++
>  2 files changed, 9 insertions(+)
> 
> v2:
> - Add some blank lines as requested.
> - Expand special_mapping_close() comment.
> - Add David's reviewed-by.
> - Expand change log to capture review discussion.
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 485424979254..78bdfc59abe5 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -1313,6 +1313,9 @@ struct vm_special_mapping {
>  
>  	int (*mremap)(const struct vm_special_mapping *sm,
>  		     struct vm_area_struct *new_vma);
> +
> +	void (*close)(const struct vm_special_mapping *sm,
> +		      struct vm_area_struct *vma);
>  };
>  
>  enum tlb_flush_reason {
> diff --git a/mm/mmap.c b/mm/mmap.c
> index d0dfc85b209b..af4dbf0d3bd4 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -3620,10 +3620,16 @@ void vm_stat_account(struct mm_struct *mm, vm_flags_t flags, long npages)
>  static vm_fault_t special_mapping_fault(struct vm_fault *vmf);
>  
>  /*
> + * Close hook, called for unmap() and on the old vma for mremap().
> + *
>   * Having a close hook prevents vma merging regardless of flags.
>   */
>  static void special_mapping_close(struct vm_area_struct *vma)
>  {
> +	const struct vm_special_mapping *sm = vma->vm_private_data;
> +
> +	if (sm->close)
> +		sm->close(sm, vma);
>  }
>  
>  static const char *special_mapping_name(struct vm_area_struct *vma)
> -- 
> 2.45.2
> 

This change is now in -next and I bisected a crash that our CI sees with
ARCH=um to it:

$ make -skj"$(nproc)" ARCH=um CROSS_COMPILE=x86_64-linux- defconfig linux

$ ./linux ubd0=$PWD/rootfs.ext4
...
Linux version 6.11.0-rc4-next-20240819 (nathan@thelio-3990X) (x86_64-linux-gcc (GCC) 14.2.0, GNU ld (GNU Binutils) 2.42) #1 Mon Aug 19 11:42:20 MST 2024
...
Run /sbin/init as init process

Modules linked in:
Pid: 24, comm: mount Not tainted 6.11.0-rc4-next-20240819
RIP: 0033:0x68006f6c
RSP: 000000006c8bfc68  EFLAGS: 00010206
RAX: 0000000068006f6c RBX: 0000000068a0aa18 RCX: 00000000600d8b09
RDX: 0000000000000000 RSI: 0000000068a0aa18 RDI: 0000000068805120
RBP: 000000006c8bfc70 R08: 0000000000000001 R09: 0000000068ae0308
R10: 000000000000000e R11: ffffffffffffffff R12: 0000000000000001
R13: 0000000068a0aa18 R14: 0000000000000015 R15: 0000000068944a88
Kernel panic - not syncing: Segfault with no mm
CPU: 0 UID: 0 PID: 24 Comm: mount Not tainted 6.11.0-rc4-next-20240819 #1
Stack:
 600caeff 6c8bfc90 600d8b2a 68944a80
 00000047 6c8bfda0 600cbfd9 6c8bfd50
 68944ad0 68944a88 7f7ffff000 7f7fffffff
Call Trace:
 [<600caeff>] ? special_mapping_close+0x16/0x19
 [<600d8b2a>] remove_vma+0x21/0x59
 [<600cbfd9>] exit_mmap+0x1f3/0x2bc
 [<60032a0c>] ? unblock_signals+0x0/0xbd
 [<600329fd>] ? block_signals+0x0/0xf
 [<6003831c>] __mmput+0x24/0x94
 [<60067262>] ? up_read+0x0/0x2c
 [<600383a1>] mmput+0x15/0x18
 [<6003ce97>] do_exit+0x381/0x9b8
 [<600e4b8d>] ? kfree+0x107/0x11b
 [<6003d752>] sys_exit_group+0x0/0x16
 [<6003d768>] pid_child_should_wake+0x0/0x42
 [<60022e7a>] handle_syscall+0x79/0xa7
 [<600358de>] userspace+0x4d3/0x505
 [<60020927>] fork_handler+0x84/0x8b

Passing this through scripts/decode_stacktrace.sh results in

? special_mapping_close (mm/mmap.c:2056)
remove_vma (mm/vma.c:144)
exit_mmap (include/linux/sched.h:2049 mm/mmap.c:1947)
? unblock_signals (arch/um/os-Linux/signal.c:296)
? block_signals (arch/um/os-Linux/signal.c:282)
__mmput (kernel/fork.c:1349)
? up_read (arch/x86/include/asm/atomic64_64.h:79 (discriminator 5) include/linux/atomic/atomic-arch-fallback.h:2749 (discriminator 5) include/linux/atomic/atomic-long.h:184 (discriminator 5) include/linux/atomic/atomic-instrumented.h:3317 (discriminator 5) kernel/locking/rwsem.c:1347 (discriminator 5) kernel/locking/rwsem.c:1622 (discriminator 5))
mmput (kernel/fork.c:1370)
do_exit (arch/um/include/asm/thread_info.h:46 kernel/exit.c:572 kernel/exit.c:926)
? kfree (mm/slub.c:4482 (discriminator 2) mm/slub.c:4522 (discriminator 2) mm/slub.c:4669 (discriminator 2))
sys_exit_group (kernel/exit.c:1099 kernel/exit.c:1097)
pid_child_should_wake (kernel/exit.c:1106 kernel/exit.c:1565)
handle_syscall (arch/um/kernel/skas/syscall.c:45 (discriminator 1))
userspace (arch/um/os-Linux/skas/process.c:466)
fork_handler (arch/um/kernel/process.c:137)

This change seems pretty innocuous but the bisect log does not lie :) I
am guessing UML is just special here somehow?

# bad: [367b5c3d53e57d51a5878816804652963da90950] Add linux-next specific files for 20240816
# good: [e724918b3786252b985b0c2764c16a57d1937707] Merge tag 'hardening-v6.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
git bisect start '367b5c3d53e57d51a5878816804652963da90950' 'e724918b3786252b985b0c2764c16a57d1937707'
# bad: [b12bdbe2615f5426953ae1e64d74176674618edb] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next.git
git bisect bad b12bdbe2615f5426953ae1e64d74176674618edb
# bad: [9ad9c8d6eea9063fe7309cdc8e76bd12377cd613] Merge branch 'for-next' of https://github.com/sophgo/linux.git
git bisect bad 9ad9c8d6eea9063fe7309cdc8e76bd12377cd613
# bad: [57c53c832b28ca79eddca47c5b599036be10d347] Merge branch 'perf-tools-next' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git
git bisect bad 57c53c832b28ca79eddca47c5b599036be10d347
# bad: [cbaf19e941bcd83cf50f569b3888f7db6dcaccfc] foo
git bisect bad cbaf19e941bcd83cf50f569b3888f7db6dcaccfc
# good: [cdb0e8eb648858f37bbe1d6245c3a3c49f265c1c] fixup! selftests/mm: Add mseal test for no-discard madvise
git bisect good cdb0e8eb648858f37bbe1d6245c3a3c49f265c1c
# bad: [4fdacc9ec44f04a9edc4ddd0c782ab698cd15257] mm: shmem: support large folio allocation for shmem_replace_folio()
git bisect bad 4fdacc9ec44f04a9edc4ddd0c782ab698cd15257
# good: [90f91965eee8256ffad811a6da097bc13b66aa2e] mm: reduce deferred struct page init ifdeffery
git bisect good 90f91965eee8256ffad811a6da097bc13b66aa2e
# good: [5ae759160c5df466f4ae7cb89c05cd963e91cc3c] mm: introduce a pageflag for partially mapped folios
git bisect good 5ae759160c5df466f4ae7cb89c05cd963e91cc3c
# good: [03683572685d2f8febfc022b758fdb4bddf8d783] maple_tree: fix comment typo with corresponding maple_status
git bisect good 03683572685d2f8febfc022b758fdb4bddf8d783
# bad: [74ef5018120b2a441428400a5f92891307d41b82] powerpc/vdso: refactor error handling
git bisect bad 74ef5018120b2a441428400a5f92891307d41b82
# bad: [5077f828c08424b81279341813a18b8923ebd42e] mm: add optional close() to struct vm_special_mapping
git bisect bad 5077f828c08424b81279341813a18b8923ebd42e
# good: [0ebac8817b5dce7b3a1afd6ff7197a75829d50ad] kfence: save freeing stack trace at calling time instead of freeing time
git bisect good 0ebac8817b5dce7b3a1afd6ff7197a75829d50ad
# first bad commit: [5077f828c08424b81279341813a18b8923ebd42e] mm: add optional close() to struct vm_special_mapping

The rootfs is available from [1] in case it matters
(x86_64-rootfs.ext4.zst, decompress it with zstd first); it just shuts
down the machine on boot.

Cheers,
Nathan

[1]: https://github.com/ClangBuiltLinux/boot-utils/releases/latest
Linus Torvalds Aug. 19, 2024, 7:29 p.m. UTC | #3
On Mon, 19 Aug 2024 at 11:53, Nathan Chancellor <nathan@kernel.org> wrote:
>
>
> Modules linked in:
> Pid: 24, comm: mount Not tainted 6.11.0-rc4-next-20240819
> RIP: 0033:0x68006f6c
> RSP: 000000006c8bfc68  EFLAGS: 00010206
> RAX: 0000000068006f6c RBX: 0000000068a0aa18 RCX: 00000000600d8b09
> RDX: 0000000000000000 RSI: 0000000068a0aa18 RDI: 0000000068805120
> RBP: 000000006c8bfc70 R08: 0000000000000001 R09: 0000000068ae0308
> R10: 000000000000000e R11: ffffffffffffffff R12: 0000000000000001
> R13: 0000000068a0aa18 R14: 0000000000000015 R15: 0000000068944a88
> Kernel panic - not syncing: Segfault with no mm
> CPU: 0 UID: 0 PID: 24 Comm: mount Not tainted 6.11.0-rc4-next-20240819 #1
> Stack:
>  600caeff 6c8bfc90 600d8b2a 68944a80
>  00000047 6c8bfda0 600cbfd9 6c8bfd50
>  68944ad0 68944a88 7f7ffff000 7f7fffffff
> Call Trace:
>  [<600caeff>] ? special_mapping_close+0x16/0x19

Hmm. No "Code:" line? Did you just edit that out, or maybe UML doesn't
print one out?

Anyway, for me that special_mapping_close() disassembles to


 <+0>:  mov    %rdi,%rsi
 <+3>:  mov    0x78(%rdi),%rdi
 <+7>:  mov    0x20(%rdi),%rax
 <+11>: test   %rax,%rax
 <+14>: je     0x600caa11 <special_mapping_close+24>
 <+16>: push   %rbp
 <+17>: mov    %rsp,%rbp
 <+20>: call   *%rax
 <+22>: pop    %rbp
 <+23>: ret
 <+24>: ret

which may just match yours, because special_mapping_close+0x16 is
obviously that +22, and it's the return point for that call.

And your %rax value does match that invalid %rip value of 0x68006f6c.

So it does look like it's jumping off to la-la-land, and the problem is the code

        const struct vm_special_mapping *sm = vma->vm_private_data;

        if (sm->close)
                sm->close(sm, vma);

where presumably 'vm_private_data' isn't a "struct vm_special_mapping *" at all.

And I think I see the problem.

When we have that 'legacy_special_mapping_vmops', then the
vm_private_data field actually points to 'pages'.

So the 'legacy_special_mapping_vmops' can *only* contain the '.fault'
handler, not the other handlers.

IOW, does something like this fix it?

  --- a/mm/mmap.c
  +++ b/mm/mmap.c
  @@ -2095,7 +2095,6 @@ static const struct vm_operations_struct
special_mapping_vmops = {
   };

   static const struct vm_operations_struct legacy_special_mapping_vmops = {
  -       .close = special_mapping_close,
          .fault = special_mapping_fault,
   };

and honestly, we should have different 'fault' functions instead of
having the same fault function and then making the function
dynamically see which vm_operations_struct it was. But that's a
separate issue.

And oh Christ, the difference between "_install_special_mapping()"
(which installs the modern style special mapping) and
"install_special_mapping()" (which installs the legacy special
mapping) is truly horrendously disgusting.

And yes, UML uses that legacy crap, which explains why it happens on
UML but not on sane architectures.

As does csky, hexagon, and nios2.

We should get rid of the legacy case entirely, and remove that insane
difference between _install_special_mapping() and
install_special_mapping().

But in the meantime, the one-liner fix *should* fix it. The four
architectures that uses the legacy crud don't care about the close
function anyway.

What a horrid thing.

                Linus
Nathan Chancellor Aug. 19, 2024, 7:51 p.m. UTC | #4
On Mon, Aug 19, 2024 at 12:29:34PM -0700, Linus Torvalds wrote:
> On Mon, 19 Aug 2024 at 11:53, Nathan Chancellor <nathan@kernel.org> wrote:
> >
> >
> > Modules linked in:
> > Pid: 24, comm: mount Not tainted 6.11.0-rc4-next-20240819
> > RIP: 0033:0x68006f6c
> > RSP: 000000006c8bfc68  EFLAGS: 00010206
> > RAX: 0000000068006f6c RBX: 0000000068a0aa18 RCX: 00000000600d8b09
> > RDX: 0000000000000000 RSI: 0000000068a0aa18 RDI: 0000000068805120
> > RBP: 000000006c8bfc70 R08: 0000000000000001 R09: 0000000068ae0308
> > R10: 000000000000000e R11: ffffffffffffffff R12: 0000000000000001
> > R13: 0000000068a0aa18 R14: 0000000000000015 R15: 0000000068944a88
> > Kernel panic - not syncing: Segfault with no mm
> > CPU: 0 UID: 0 PID: 24 Comm: mount Not tainted 6.11.0-rc4-next-20240819 #1
> > Stack:
> >  600caeff 6c8bfc90 600d8b2a 68944a80
> >  00000047 6c8bfda0 600cbfd9 6c8bfd50
> >  68944ad0 68944a88 7f7ffff000 7f7fffffff
> > Call Trace:
> >  [<600caeff>] ? special_mapping_close+0x16/0x19
> 
> Hmm. No "Code:" line? Did you just edit that out, or maybe UML doesn't
> print one out?

Nope, no editing, it is straight from my terminal. I guess UML just doesn't
print one.

> Anyway, for me that special_mapping_close() disassembles to
> 
> 
>  <+0>:  mov    %rdi,%rsi
>  <+3>:  mov    0x78(%rdi),%rdi
>  <+7>:  mov    0x20(%rdi),%rax
>  <+11>: test   %rax,%rax
>  <+14>: je     0x600caa11 <special_mapping_close+24>
>  <+16>: push   %rbp
>  <+17>: mov    %rsp,%rbp
>  <+20>: call   *%rax
>  <+22>: pop    %rbp
>  <+23>: ret
>  <+24>: ret
> 
> which may just match yours, because special_mapping_close+0x16 is
> obviously that +22, and it's the return point for that call.

Yeah seems like it, objdump -dr shows:

0000000000000027 <special_mapping_close>:
      27:   48 89 fe                mov    %rdi,%rsi
      2a:   48 8b 7f 78             mov    0x78(%rdi),%rdi
      2e:   48 8b 47 20             mov    0x20(%rdi),%rax
      32:   48 85 c0                test   %rax,%rax
      35:   74 08                   je     3f <special_mapping_close+0x18>
      37:   55                      push   %rbp
      38:   48 89 e5                mov    %rsp,%rbp
      3b:   ff d0                   call   *%rax
      3d:   5d                      pop    %rbp
      3e:   c3                      ret
      3f:   c3                      ret

> And your %rax value does match that invalid %rip value of 0x68006f6c.
> 
> So it does look like it's jumping off to la-la-land, and the problem is the code
> 
>         const struct vm_special_mapping *sm = vma->vm_private_data;
> 
>         if (sm->close)
>                 sm->close(sm, vma);
> 
> where presumably 'vm_private_data' isn't a "struct vm_special_mapping *" at all.
> 
> And I think I see the problem.
> 
> When we have that 'legacy_special_mapping_vmops', then the
> vm_private_data field actually points to 'pages'.
> 
> So the 'legacy_special_mapping_vmops' can *only* contain the '.fault'
> handler, not the other handlers.
> 
> IOW, does something like this fix it?
> 
>   --- a/mm/mmap.c
>   +++ b/mm/mmap.c
>   @@ -2095,7 +2095,6 @@ static const struct vm_operations_struct
> special_mapping_vmops = {
>    };
> 
>    static const struct vm_operations_struct legacy_special_mapping_vmops = {
>   -       .close = special_mapping_close,
>           .fault = special_mapping_fault,
>    };

Yes, that appears to fix it for me. I don't have much to say about the
rest but others might :)

Cheers,
Nathan
Linus Torvalds Aug. 19, 2024, 8:15 p.m. UTC | #5
On Mon, 19 Aug 2024 at 12:51, Nathan Chancellor <nathan@kernel.org> wrote:
>
> Yes, that appears to fix it for me. I don't have much to say about the
> rest but others might :)

Ok, I did a quick hack-job to remove that disgusting
install_special_mapping() legacy case.

With this, the old "install_special_mapping()" mess no longer exists,
but I haven't even attempted to compile the result, because I don't
have cross-compile environments for any of the affected architectures.

Except UML. I did at least build it there, but it's not like I tested it.

Adding architecture maintainers and more architecture lists to the
participants. It would be good to actually get this patch tested.
Context for newly added people:

   https://lore.kernel.org/all/CAHk-=wj9QPhG4CjiX8YLRC1wHj_Qs-T8wJi0WEhkfp0cszvB9w@mail.gmail.com/

NOTE! This patch is against my current tree, not the linux-next
changes. But it should entirely remove the case that caused problems
in linux-next.

                      Linus
Linus Torvalds Aug. 19, 2024, 8:16 p.m. UTC | #6
On Mon, 19 Aug 2024 at 13:15, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Ok, I did a quick hack-job to remove that disgusting
> install_special_mapping() legacy case.
>
> With this [..]

I forgot to actually attach that "this". Here it is. For real, this time.

                   Linus
Andrew Morton Aug. 20, 2024, 1:05 a.m. UTC | #7
On Mon, 19 Aug 2024 13:16:32 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Mon, 19 Aug 2024 at 13:15, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Ok, I did a quick hack-job to remove that disgusting
> > install_special_mapping() legacy case.
> >
> > With this [..]
> 
> I forgot to actually attach that "this". Here it is. For real, this time.

Thanks.  Do you think your one-liner remains desirable with this fix in
place?
Linus Torvalds Aug. 20, 2024, 1:11 a.m. UTC | #8
On Mon, 19 Aug 2024 at 18:05, Andrew Morton <akpm@linux-foundation.org> wrote:
>
> >
> > I forgot to actually attach that "this". Here it is. For real, this time.
>
> Thanks.  Do you think your one-liner remains desirable with this fix in
> place?

That patch actually removes the whole 'legacy_special_mapping_vmops'
that my one-liner then removed the '.close' field from, so no - my
one-liner just becomes a non-issue.

NOTE! That patch of mine *will* conflict with Michael's patch series,
since my patch to remove legacy_special_mapping_vmops was done on top
of my current -git tree. But it should be an obvious conflict, in that
it just means that the addition of .close never happens.

               Linus
Michael Ellerman Aug. 20, 2024, 6:26 a.m. UTC | #9
Linus Torvalds <torvalds@linux-foundation.org> writes:
> On Mon, 19 Aug 2024 at 13:15, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> Ok, I did a quick hack-job to remove that disgusting
>> install_special_mapping() legacy case.
>>
>> With this [..]
>
> I forgot to actually attach that "this". Here it is. For real, this time.
>
>                    Linus
>  arch/csky/kernel/vdso.c            | 28 +++++++++++++++++++++-------
>  arch/hexagon/kernel/vdso.c         | 14 ++++++++++----
>  arch/nios2/mm/init.c               | 12 ++++++++----
>  arch/sh/kernel/vsyscall/vsyscall.c | 14 +++++++++++---
>  arch/x86/um/vdso/vma.c             | 12 ++++++++----
>  include/linux/mm.h                 |  4 ----
>  mm/mmap.c                          | 32 +++++---------------------------
>  7 files changed, 63 insertions(+), 53 deletions(-)
>
...
> index 1bd85a6949c4..5e68ab7a8898 100644
> --- a/arch/sh/kernel/vsyscall/vsyscall.c
> +++ b/arch/sh/kernel/vsyscall/vsyscall.c
> @@ -36,6 +36,10 @@ __setup("vdso=", vdso_setup);
>   */
>  extern const char vsyscall_trapa_start, vsyscall_trapa_end;
>  static struct page *syscall_pages[1];
> +static struct vm_special_mapping vdso_mapping = {
> +	.name = "[vdso]",
> +	.pages = syscall_pages;
                              ^
                              should be ,
> +};

cheers
Linus Torvalds Aug. 20, 2024, 3:31 p.m. UTC | #10
On Mon, 19 Aug 2024 at 23:26, Michael Ellerman <mpe@ellerman.id.au> wrote:
>
> Linus Torvalds <torvalds@linux-foundation.org> writes:
> > +static struct vm_special_mapping vdso_mapping = {
> > +     .name = "[vdso]",
> > +     .pages = syscall_pages;
>                               ^
>                               should be ,

Ack. Changed here locally. But I assume you also don't actually test sh...

It would be good to get acks from the architectures that still used
the legacy interface.

              Linus
Linus Torvalds Aug. 20, 2024, 9:31 p.m. UTC | #11
On Tue, 20 Aug 2024 at 14:17, Rob Landley <rob@landley.net> wrote:
>
> Hexagon also has &&vdso_page which I don't understand (but have a toolchain for
> somewhere to at least smoketest...)

The '&&' is just a typo. It should obviously be just a single '&'. As
mentioned, the only testing that patch got was a x86-64 UML build
test.

Fixed locally.

               Linus
Rob Landley Aug. 20, 2024, 9:31 p.m. UTC | #12
On 8/20/24 10:31, Linus Torvalds wrote:
> On Mon, 19 Aug 2024 at 23:26, Michael Ellerman <mpe@ellerman.id.au> wrote:
>>
>> Linus Torvalds <torvalds@linux-foundation.org> writes:
>> > +static struct vm_special_mapping vdso_mapping = {
>> > +     .name = "[vdso]",
>> > +     .pages = syscall_pages;
>>                               ^
>>                               should be ,
> 
> Ack. Changed here locally.

Hexagon also has &&vdso_page which I don't understand (but have a toolchain for
somewhere to at least smoketest...)

> But I assume you also don't actually test sh...

But I do.

Aside: arch/sh smoketests easily under qemu, here's a relocatable binary toolchain:

  wget https://landley.net/bin/toolchains/latest/sh4-linux-musl-cross.tar.xz
  tar xvf sh4-linux-musl-cross-tar.xz
  CROSS_COMPILE=$PWD/sh4-linux-musl-cross/bin/sh4-linux-musl-

And https://landley.net/bin/mkroot/latest/sh4.tgz is a tiny qemu-system-sh4
system with kernel + initramfs.cpio.gz + run-qemu.sh with the config used to
build it under docs/ and as the first three lines of docs/linux-miniconfig
record, the kernel is arch/sh/boot/zImage.

You can keep the initramfs.cpio.gz and run-qemu.sh and swap out the kernel for a
quick boot to shell prompt under qemu. Serial console is qemu's stdin/stdout,
"exit" to shut down the emulator.

The build script, if you care, is 400 lines of bash:

  https://github.com/landley/toybox/blob/master/mkroot/mkroot.sh

> It would be good to get acks from the architectures that still used
> the legacy interface.

I'll give arch/sh a whirl. Can somebody answer my above hexagon question?

>               Linus

Rob
Rob Landley Aug. 20, 2024, 10:10 p.m. UTC | #13
On 8/20/24 16:31, Linus Torvalds wrote:
> On Tue, 20 Aug 2024 at 14:17, Rob Landley <rob@landley.net> wrote:
>>
>> Hexagon also has &&vdso_page which I don't understand (but have a toolchain for
>> somewhere to at least smoketest...)
> 
> The '&&' is just a typo. It should obviously be just a single '&'. As
> mentioned, the only testing that patch got was a x86-64 UML build
> test.
> 
> Fixed locally.

I deleted the extra ; and arch/sh4 built and qemu-system-sh4 booted to shell prompt:

Freeing initrd memory: 556K
Freeing unused kernel image (initmem) memory: 132K
This architecture does not have kernel memory protection.
Run /init as init process
8139cp 0000:00:02.0 eth0: link up, 100Mbps, full-duplex, lpa 0x05E1
Type exit when done.
# cat /proc/version
Linux version 6.11.0-rc4 (landley@driftwood) (sh4-linux-musl-gcc (GCC) 11.2.0,
GNU ld (GNU Binutils) 2.33.1) #1 Tue Aug 20 16:45:25 CDT 2024
# head -n 3 /proc/cpuinfo
machine		: RTS7751R2D
processor	: 0
cpu family	: sh4

Tested-by: Rob Landley <rob@landley.net>

Rob
Linus Torvalds Aug. 20, 2024, 11:14 p.m. UTC | #14
On Tue, 20 Aug 2024 at 14:56, Rob Landley <rob@landley.net> wrote:
>
> Tested-by: Rob Landley <rob@landley.net>

Ok, here's the patch with the fixes for problems that were pointed
out, and with the beginnings of a commit message.

I'm going to be traveling for the next ~10 days, and while I'll have
my laptop and internet, I'm unlikely to work on this any more.

Andrew - I think this is good, but there may be other issues lurking.
Do with it as you see fit,

            Linus
Andrew Morton Aug. 21, 2024, 1:18 a.m. UTC | #15
On Tue, 20 Aug 2024 16:14:29 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote:

> Andrew - I think this is good, but there may be other issues lurking.
> Do with it as you see fit,

Hey you know me, I'll merge any old thing if I think it'll help nurture
newbies.
Sven Schnelle Sept. 2, 2024, 7:06 p.m. UTC | #16
Nathan Chancellor <nathan@kernel.org> writes:

> Hi Michael,
>
> On Mon, Aug 12, 2024 at 06:26:02PM +1000, Michael Ellerman wrote:
>> Add an optional close() callback to struct vm_special_mapping. It will
>> be used, by powerpc at least, to handle unmapping of the VDSO.
>>
>> Although support for unmapping the VDSO was initially added
>> for CRIU[1], it is not desirable to guard that support behind
>> CONFIG_CHECKPOINT_RESTORE.
>>
>> There are other known users of unmapping the VDSO which are not related
>> to CRIU, eg. Valgrind [2] and void-ship [3].
>>
>> The powerpc arch_unmap() hook has been in place for ~9 years, with no
>> ifdef, so there may be other unknown users that have come to rely on
>> unmapping the VDSO. Even if the code was behind an ifdef, major distros
>> enable CHECKPOINT_RESTORE so users may not realise unmapping the VDSO
>> depends on that configuration option.
>>
>> It's also undesirable to have such core mm behaviour behind a relatively
>> obscure CONFIG option.
>>
>> Longer term the unmap behaviour should be standardised across
>> architectures, however that is complicated by the fact the VDSO pointer
>> is stored differently across architectures. There was a previous attempt
>> to unify that handling [4], which could be revived.
>>
>> See [5] for further discussion.
>>
>> [1]: commit 83d3f0e90c6c ("powerpc/mm: tracking vDSO remap")
>> [2]: https://sourceware.org/git/?p=valgrind.git;a=commit;h=3a004915a2cbdcdebafc1612427576bf3321eef5
>> [3]: https://github.com/insanitybit/void-ship
>> [4]: https://lore.kernel.org/lkml/20210611180242.711399-17-dima@arista.com/
>> [5]: https://lore.kernel.org/linuxppc-dev/shiq5v3jrmyi6ncwke7wgl76ojysgbhrchsk32q4lbx2hadqqc@kzyy2igem256
>>
>> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
>> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
>> Reviewed-by: David Hildenbrand <david@redhat.com>
>> ---
>>  include/linux/mm_types.h | 3 +++
>>  mm/mmap.c                | 6 ++++++
>>  2 files changed, 9 insertions(+)
>>
>> v2:
>> - Add some blank lines as requested.
>> - Expand special_mapping_close() comment.
>> - Add David's reviewed-by.
>> - Expand change log to capture review discussion.
>>
>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> index 485424979254..78bdfc59abe5 100644
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -1313,6 +1313,9 @@ struct vm_special_mapping {
>>
>>  	int (*mremap)(const struct vm_special_mapping *sm,
>>  		     struct vm_area_struct *new_vma);
>> +
>> +	void (*close)(const struct vm_special_mapping *sm,
>> +		      struct vm_area_struct *vma);
>>  };
>>
>>  enum tlb_flush_reason {
>> diff --git a/mm/mmap.c b/mm/mmap.c
>> index d0dfc85b209b..af4dbf0d3bd4 100644
>> --- a/mm/mmap.c
>> +++ b/mm/mmap.c
>> @@ -3620,10 +3620,16 @@ void vm_stat_account(struct mm_struct *mm, vm_flags_t flags, long npages)
>>  static vm_fault_t special_mapping_fault(struct vm_fault *vmf);
>>
>>  /*
>> + * Close hook, called for unmap() and on the old vma for mremap().
>> + *
>>   * Having a close hook prevents vma merging regardless of flags.
>>   */
>>  static void special_mapping_close(struct vm_area_struct *vma)
>>  {
>> +	const struct vm_special_mapping *sm = vma->vm_private_data;
>> +
>> +	if (sm->close)
>> +		sm->close(sm, vma);
>>  }
>>
>>  static const char *special_mapping_name(struct vm_area_struct *vma)
>> --
>> 2.45.2
>>
>
> This change is now in -next and I bisected a crash that our CI sees with
> ARCH=um to it:

I see another crash on s390, which seems related, but rather an issue in
uprobe. This can be reproduced by

# cd linux-next/tools/testing/selftests/ftrace
# ./ftracetest ./test.d/dynevent/add_remove_uprobe.tc

The 'mm: Add optional close() to struct vm_special_mapping' patch just
makes it visible. I enabled KASAN, and that shows me:

[   44.505448] ==================================================================                                                                      20:37:27 [3421/145075]
[   44.505455] BUG: KASAN: slab-use-after-free in special_mapping_close+0x9c/0xc8
[   44.505471] Read of size 8 at addr 00000000868dac48 by task sh/1384
[   44.505479]
[   44.505486] CPU: 51 UID: 0 PID: 1384 Comm: sh Not tainted 6.11.0-rc6-next-20240902-dirty #1496
[   44.505503] Hardware name: IBM 3931 A01 704 (z/VM 7.3.0)
[   44.505508] Call Trace:
[   44.505511]  [<000b0324d2f78080>] dump_stack_lvl+0xd0/0x108
[   44.505521]  [<000b0324d2f5435c>] print_address_description.constprop.0+0x34/0x2e0
[   44.505529]  [<000b0324d2f5464c>] print_report+0x44/0x138
[   44.505536]  [<000b0324d1383192>] kasan_report+0xc2/0x140
[   44.505543]  [<000b0324d2f52904>] special_mapping_close+0x9c/0xc8
[   44.505550]  [<000b0324d12c7978>] remove_vma+0x78/0x120
[   44.505557]  [<000b0324d128a2c6>] exit_mmap+0x326/0x750
[   44.505563]  [<000b0324d0ba655a>] __mmput+0x9a/0x370
[   44.505570]  [<000b0324d0bbfbe0>] exit_mm+0x240/0x340
[   44.505575]  [<000b0324d0bc0228>] do_exit+0x548/0xd70
[   44.505580]  [<000b0324d0bc1102>] do_group_exit+0x132/0x390
[   44.505586]  [<000b0324d0bc13b6>] __s390x_sys_exit_group+0x56/0x60
[   44.505592]  [<000b0324d0adcbd6>] do_syscall+0x2f6/0x430
[   44.505599]  [<000b0324d2f78434>] __do_syscall+0xa4/0x170
[   44.505606]  [<000b0324d2f9454c>] system_call+0x74/0x98
[   44.505614]
[   44.505616] Allocated by task 1384:
[   44.505621]  kasan_save_stack+0x40/0x70
[   44.505630]  kasan_save_track+0x28/0x40
[   44.505636]  __kasan_kmalloc+0xa0/0xc0
[   44.505642]  __create_xol_area+0xfa/0x410
[   44.505648]  get_xol_area+0xb0/0xf0
[   44.505652]  uprobe_notify_resume+0x27a/0x470
[   44.505657]  irqentry_exit_to_user_mode+0x15e/0x1d0
[   44.505664]  pgm_check_handler+0x122/0x170
[   44.505670]
[   44.505672] Freed by task 1384:
[   44.505676]  kasan_save_stack+0x40/0x70
[   44.505682]  kasan_save_track+0x28/0x40
[   44.505687]  kasan_save_free_info+0x4a/0x70
[   44.505693]  __kasan_slab_free+0x5a/0x70
[   44.505698]  kfree+0xe8/0x3f0
[   44.505704]  __mmput+0x20/0x370
[   44.505709]  exit_mm+0x240/0x340
[   44.505713]  do_exit+0x548/0xd70
[   44.505718]  do_group_exit+0x132/0x390
[   44.505722]  __s390x_sys_exit_group+0x56/0x60
[   44.505727]  do_syscall+0x2f6/0x430
[   44.505732]  __do_syscall+0xa4/0x170
[   44.505738]  system_call+0x74/0x98

The problem is that uprobe_clear_state() kfree's struct xol_area, which
contains struct vm_special_mapping *xol_mapping. This one is passed to
_install_special_mapping() in xol_add_vma().

__mput reads:

static inline void __mmput(struct mm_struct *mm)
{
	VM_BUG_ON(atomic_read(&mm->mm_users));

	uprobe_clear_state(mm);
	exit_aio(mm);
	ksm_exit(mm);
	khugepaged_exit(mm); /* must run before exit_mmap */
	exit_mmap(mm);
	...
}

So uprobe_clear_state() in the beginning free's the memory area
containing the vm_special_mapping data, but exit_mmap() uses this
address later via vma->vm_private_data (which was set in _install_special_mapping().

The following change fixes this for me, but i'm not sure about any side
effects:

diff --git a/kernel/fork.c b/kernel/fork.c
index df8e4575ff01..cfcabba36c93 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1340,11 +1340,11 @@ static inline void __mmput(struct mm_struct *mm)
 {
        VM_BUG_ON(atomic_read(&mm->mm_users));
 
-       uprobe_clear_state(mm);
        exit_aio(mm);
        ksm_exit(mm);
        khugepaged_exit(mm); /* must run before exit_mmap */
        exit_mmap(mm);
+       uprobe_clear_state(mm);
        mm_put_huge_zero_folio(mm);
        set_mm_exe_file(mm, NULL);
        if (!list_empty(&mm->mmlist)) {


Any thoughts?
Andrew Morton Sept. 2, 2024, 8:49 p.m. UTC | #17
On Mon, 02 Sep 2024 21:06:48 +0200 Sven Schnelle <svens@linux.ibm.com> wrote:

> So uprobe_clear_state() in the beginning free's the memory area
> containing the vm_special_mapping data, but exit_mmap() uses this
> address later via vma->vm_private_data (which was set in _install_special_mapping().
> 
> The following change fixes this for me, but i'm not sure about any side
> effects:
> 
> diff --git a/kernel/fork.c b/kernel/fork.c
> index df8e4575ff01..cfcabba36c93 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1340,11 +1340,11 @@ static inline void __mmput(struct mm_struct *mm)
>  {
>         VM_BUG_ON(atomic_read(&mm->mm_users));
>  
> -       uprobe_clear_state(mm);
>         exit_aio(mm);
>         ksm_exit(mm);
>         khugepaged_exit(mm); /* must run before exit_mmap */
>         exit_mmap(mm);
> +       uprobe_clear_state(mm);
>         mm_put_huge_zero_folio(mm);
>         set_mm_exe_file(mm, NULL);
>         if (!list_empty(&mm->mmlist)) {

uprobe_clear_state() is a pretty simple low-level thing.  Side-effects
seem unlikely?
Linus Torvalds Sept. 2, 2024, 9:02 p.m. UTC | #18
On Mon, 2 Sept 2024 at 13:49, Andrew Morton <akpm@linux-foundation.org> wrote:
>
> uprobe_clear_state() is a pretty simple low-level thing.  Side-effects
> seem unlikely?

I think uprobe_clear_state() should be removed from fork.c entirely,
made 'static', and then we'd have

        area->xol_mapping.close = uprobe_clear_state;

in __create_xol_area() instead (ok, the arguments change, instead of
looking up "mm->uprobes_state.xol_area", it would get it as the vma
argument)

That's how it should always have been, except we didn't have a close() function.

Hmm?

            Linus
Sven Schnelle Sept. 3, 2024, 6:27 a.m. UTC | #19
Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Mon, 2 Sept 2024 at 13:49, Andrew Morton <akpm@linux-foundation.org> wrote:
>>
>> uprobe_clear_state() is a pretty simple low-level thing.  Side-effects
>> seem unlikely?
>
> I think uprobe_clear_state() should be removed from fork.c entirely,
> made 'static', and then we'd have
>
>         area->xol_mapping.close = uprobe_clear_state;
>
> in __create_xol_area() instead (ok, the arguments change, instead of
> looking up "mm->uprobes_state.xol_area", it would get it as the vma
> argument)
>
> That's how it should always have been, except we didn't have a close() function.
>
> Hmm?

Indeed, that's much better. I'll prepare a patch.

Thanks!
diff mbox series

Patch

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 485424979254..78bdfc59abe5 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1313,6 +1313,9 @@  struct vm_special_mapping {
 
 	int (*mremap)(const struct vm_special_mapping *sm,
 		     struct vm_area_struct *new_vma);
+
+	void (*close)(const struct vm_special_mapping *sm,
+		      struct vm_area_struct *vma);
 };
 
 enum tlb_flush_reason {
diff --git a/mm/mmap.c b/mm/mmap.c
index d0dfc85b209b..af4dbf0d3bd4 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3620,10 +3620,16 @@  void vm_stat_account(struct mm_struct *mm, vm_flags_t flags, long npages)
 static vm_fault_t special_mapping_fault(struct vm_fault *vmf);
 
 /*
+ * Close hook, called for unmap() and on the old vma for mremap().
+ *
  * Having a close hook prevents vma merging regardless of flags.
  */
 static void special_mapping_close(struct vm_area_struct *vma)
 {
+	const struct vm_special_mapping *sm = vma->vm_private_data;
+
+	if (sm->close)
+		sm->close(sm, vma);
 }
 
 static const char *special_mapping_name(struct vm_area_struct *vma)