Message ID | 20240812082605.743814-1-mpe@ellerman.id.au (mailing list archive) |
---|---|
State | Handled Elsewhere |
Headers | show |
Series | [v2,1/4] mm: Add optional close() to struct vm_special_mapping | expand |
* Michael Ellerman <mpe@ellerman.id.au> [240812 04:26]: > Add an optional close() callback to struct vm_special_mapping. It will > be used, by powerpc at least, to handle unmapping of the VDSO. > > Although support for unmapping the VDSO was initially added > for CRIU[1], it is not desirable to guard that support behind > CONFIG_CHECKPOINT_RESTORE. > > There are other known users of unmapping the VDSO which are not related > to CRIU, eg. Valgrind [2] and void-ship [3]. > > The powerpc arch_unmap() hook has been in place for ~9 years, with no > ifdef, so there may be other unknown users that have come to rely on > unmapping the VDSO. Even if the code was behind an ifdef, major distros > enable CHECKPOINT_RESTORE so users may not realise unmapping the VDSO > depends on that configuration option. > > It's also undesirable to have such core mm behaviour behind a relatively > obscure CONFIG option. > > Longer term the unmap behaviour should be standardised across > architectures, however that is complicated by the fact the VDSO pointer > is stored differently across architectures. There was a previous attempt > to unify that handling [4], which could be revived. > > See [5] for further discussion. Thank you for tackling this issue, it's much improved. For the whole series: Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> > > [1]: commit 83d3f0e90c6c ("powerpc/mm: tracking vDSO remap") > [2]: https://sourceware.org/git/?p=valgrind.git;a=commit;h=3a004915a2cbdcdebafc1612427576bf3321eef5 > [3]: https://github.com/insanitybit/void-ship > [4]: https://lore.kernel.org/lkml/20210611180242.711399-17-dima@arista.com/ > [5]: https://lore.kernel.org/linuxppc-dev/shiq5v3jrmyi6ncwke7wgl76ojysgbhrchsk32q4lbx2hadqqc@kzyy2igem256 > > Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> > Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> > Reviewed-by: David Hildenbrand <david@redhat.com> > --- > include/linux/mm_types.h | 3 +++ > mm/mmap.c | 6 ++++++ > 2 files changed, 9 insertions(+) > > v2: > - Add some blank lines as requested. > - Expand special_mapping_close() comment. > - Add David's reviewed-by. > - Expand change log to capture review discussion. > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > index 485424979254..78bdfc59abe5 100644 > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -1313,6 +1313,9 @@ struct vm_special_mapping { > > int (*mremap)(const struct vm_special_mapping *sm, > struct vm_area_struct *new_vma); > + > + void (*close)(const struct vm_special_mapping *sm, > + struct vm_area_struct *vma); > }; > > enum tlb_flush_reason { > diff --git a/mm/mmap.c b/mm/mmap.c > index d0dfc85b209b..af4dbf0d3bd4 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -3620,10 +3620,16 @@ void vm_stat_account(struct mm_struct *mm, vm_flags_t flags, long npages) > static vm_fault_t special_mapping_fault(struct vm_fault *vmf); > > /* > + * Close hook, called for unmap() and on the old vma for mremap(). > + * > * Having a close hook prevents vma merging regardless of flags. > */ > static void special_mapping_close(struct vm_area_struct *vma) > { > + const struct vm_special_mapping *sm = vma->vm_private_data; > + > + if (sm->close) > + sm->close(sm, vma); > } > > static const char *special_mapping_name(struct vm_area_struct *vma) > -- > 2.45.2 >
Hi Michael, On Mon, Aug 12, 2024 at 06:26:02PM +1000, Michael Ellerman wrote: > Add an optional close() callback to struct vm_special_mapping. It will > be used, by powerpc at least, to handle unmapping of the VDSO. > > Although support for unmapping the VDSO was initially added > for CRIU[1], it is not desirable to guard that support behind > CONFIG_CHECKPOINT_RESTORE. > > There are other known users of unmapping the VDSO which are not related > to CRIU, eg. Valgrind [2] and void-ship [3]. > > The powerpc arch_unmap() hook has been in place for ~9 years, with no > ifdef, so there may be other unknown users that have come to rely on > unmapping the VDSO. Even if the code was behind an ifdef, major distros > enable CHECKPOINT_RESTORE so users may not realise unmapping the VDSO > depends on that configuration option. > > It's also undesirable to have such core mm behaviour behind a relatively > obscure CONFIG option. > > Longer term the unmap behaviour should be standardised across > architectures, however that is complicated by the fact the VDSO pointer > is stored differently across architectures. There was a previous attempt > to unify that handling [4], which could be revived. > > See [5] for further discussion. > > [1]: commit 83d3f0e90c6c ("powerpc/mm: tracking vDSO remap") > [2]: https://sourceware.org/git/?p=valgrind.git;a=commit;h=3a004915a2cbdcdebafc1612427576bf3321eef5 > [3]: https://github.com/insanitybit/void-ship > [4]: https://lore.kernel.org/lkml/20210611180242.711399-17-dima@arista.com/ > [5]: https://lore.kernel.org/linuxppc-dev/shiq5v3jrmyi6ncwke7wgl76ojysgbhrchsk32q4lbx2hadqqc@kzyy2igem256 > > Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> > Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> > Reviewed-by: David Hildenbrand <david@redhat.com> > --- > include/linux/mm_types.h | 3 +++ > mm/mmap.c | 6 ++++++ > 2 files changed, 9 insertions(+) > > v2: > - Add some blank lines as requested. > - Expand special_mapping_close() comment. > - Add David's reviewed-by. > - Expand change log to capture review discussion. > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > index 485424979254..78bdfc59abe5 100644 > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -1313,6 +1313,9 @@ struct vm_special_mapping { > > int (*mremap)(const struct vm_special_mapping *sm, > struct vm_area_struct *new_vma); > + > + void (*close)(const struct vm_special_mapping *sm, > + struct vm_area_struct *vma); > }; > > enum tlb_flush_reason { > diff --git a/mm/mmap.c b/mm/mmap.c > index d0dfc85b209b..af4dbf0d3bd4 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -3620,10 +3620,16 @@ void vm_stat_account(struct mm_struct *mm, vm_flags_t flags, long npages) > static vm_fault_t special_mapping_fault(struct vm_fault *vmf); > > /* > + * Close hook, called for unmap() and on the old vma for mremap(). > + * > * Having a close hook prevents vma merging regardless of flags. > */ > static void special_mapping_close(struct vm_area_struct *vma) > { > + const struct vm_special_mapping *sm = vma->vm_private_data; > + > + if (sm->close) > + sm->close(sm, vma); > } > > static const char *special_mapping_name(struct vm_area_struct *vma) > -- > 2.45.2 > This change is now in -next and I bisected a crash that our CI sees with ARCH=um to it: $ make -skj"$(nproc)" ARCH=um CROSS_COMPILE=x86_64-linux- defconfig linux $ ./linux ubd0=$PWD/rootfs.ext4 ... Linux version 6.11.0-rc4-next-20240819 (nathan@thelio-3990X) (x86_64-linux-gcc (GCC) 14.2.0, GNU ld (GNU Binutils) 2.42) #1 Mon Aug 19 11:42:20 MST 2024 ... Run /sbin/init as init process Modules linked in: Pid: 24, comm: mount Not tainted 6.11.0-rc4-next-20240819 RIP: 0033:0x68006f6c RSP: 000000006c8bfc68 EFLAGS: 00010206 RAX: 0000000068006f6c RBX: 0000000068a0aa18 RCX: 00000000600d8b09 RDX: 0000000000000000 RSI: 0000000068a0aa18 RDI: 0000000068805120 RBP: 000000006c8bfc70 R08: 0000000000000001 R09: 0000000068ae0308 R10: 000000000000000e R11: ffffffffffffffff R12: 0000000000000001 R13: 0000000068a0aa18 R14: 0000000000000015 R15: 0000000068944a88 Kernel panic - not syncing: Segfault with no mm CPU: 0 UID: 0 PID: 24 Comm: mount Not tainted 6.11.0-rc4-next-20240819 #1 Stack: 600caeff 6c8bfc90 600d8b2a 68944a80 00000047 6c8bfda0 600cbfd9 6c8bfd50 68944ad0 68944a88 7f7ffff000 7f7fffffff Call Trace: [<600caeff>] ? special_mapping_close+0x16/0x19 [<600d8b2a>] remove_vma+0x21/0x59 [<600cbfd9>] exit_mmap+0x1f3/0x2bc [<60032a0c>] ? unblock_signals+0x0/0xbd [<600329fd>] ? block_signals+0x0/0xf [<6003831c>] __mmput+0x24/0x94 [<60067262>] ? up_read+0x0/0x2c [<600383a1>] mmput+0x15/0x18 [<6003ce97>] do_exit+0x381/0x9b8 [<600e4b8d>] ? kfree+0x107/0x11b [<6003d752>] sys_exit_group+0x0/0x16 [<6003d768>] pid_child_should_wake+0x0/0x42 [<60022e7a>] handle_syscall+0x79/0xa7 [<600358de>] userspace+0x4d3/0x505 [<60020927>] fork_handler+0x84/0x8b Passing this through scripts/decode_stacktrace.sh results in ? special_mapping_close (mm/mmap.c:2056) remove_vma (mm/vma.c:144) exit_mmap (include/linux/sched.h:2049 mm/mmap.c:1947) ? unblock_signals (arch/um/os-Linux/signal.c:296) ? block_signals (arch/um/os-Linux/signal.c:282) __mmput (kernel/fork.c:1349) ? up_read (arch/x86/include/asm/atomic64_64.h:79 (discriminator 5) include/linux/atomic/atomic-arch-fallback.h:2749 (discriminator 5) include/linux/atomic/atomic-long.h:184 (discriminator 5) include/linux/atomic/atomic-instrumented.h:3317 (discriminator 5) kernel/locking/rwsem.c:1347 (discriminator 5) kernel/locking/rwsem.c:1622 (discriminator 5)) mmput (kernel/fork.c:1370) do_exit (arch/um/include/asm/thread_info.h:46 kernel/exit.c:572 kernel/exit.c:926) ? kfree (mm/slub.c:4482 (discriminator 2) mm/slub.c:4522 (discriminator 2) mm/slub.c:4669 (discriminator 2)) sys_exit_group (kernel/exit.c:1099 kernel/exit.c:1097) pid_child_should_wake (kernel/exit.c:1106 kernel/exit.c:1565) handle_syscall (arch/um/kernel/skas/syscall.c:45 (discriminator 1)) userspace (arch/um/os-Linux/skas/process.c:466) fork_handler (arch/um/kernel/process.c:137) This change seems pretty innocuous but the bisect log does not lie :) I am guessing UML is just special here somehow? # bad: [367b5c3d53e57d51a5878816804652963da90950] Add linux-next specific files for 20240816 # good: [e724918b3786252b985b0c2764c16a57d1937707] Merge tag 'hardening-v6.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux git bisect start '367b5c3d53e57d51a5878816804652963da90950' 'e724918b3786252b985b0c2764c16a57d1937707' # bad: [b12bdbe2615f5426953ae1e64d74176674618edb] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next.git git bisect bad b12bdbe2615f5426953ae1e64d74176674618edb # bad: [9ad9c8d6eea9063fe7309cdc8e76bd12377cd613] Merge branch 'for-next' of https://github.com/sophgo/linux.git git bisect bad 9ad9c8d6eea9063fe7309cdc8e76bd12377cd613 # bad: [57c53c832b28ca79eddca47c5b599036be10d347] Merge branch 'perf-tools-next' of git://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git git bisect bad 57c53c832b28ca79eddca47c5b599036be10d347 # bad: [cbaf19e941bcd83cf50f569b3888f7db6dcaccfc] foo git bisect bad cbaf19e941bcd83cf50f569b3888f7db6dcaccfc # good: [cdb0e8eb648858f37bbe1d6245c3a3c49f265c1c] fixup! selftests/mm: Add mseal test for no-discard madvise git bisect good cdb0e8eb648858f37bbe1d6245c3a3c49f265c1c # bad: [4fdacc9ec44f04a9edc4ddd0c782ab698cd15257] mm: shmem: support large folio allocation for shmem_replace_folio() git bisect bad 4fdacc9ec44f04a9edc4ddd0c782ab698cd15257 # good: [90f91965eee8256ffad811a6da097bc13b66aa2e] mm: reduce deferred struct page init ifdeffery git bisect good 90f91965eee8256ffad811a6da097bc13b66aa2e # good: [5ae759160c5df466f4ae7cb89c05cd963e91cc3c] mm: introduce a pageflag for partially mapped folios git bisect good 5ae759160c5df466f4ae7cb89c05cd963e91cc3c # good: [03683572685d2f8febfc022b758fdb4bddf8d783] maple_tree: fix comment typo with corresponding maple_status git bisect good 03683572685d2f8febfc022b758fdb4bddf8d783 # bad: [74ef5018120b2a441428400a5f92891307d41b82] powerpc/vdso: refactor error handling git bisect bad 74ef5018120b2a441428400a5f92891307d41b82 # bad: [5077f828c08424b81279341813a18b8923ebd42e] mm: add optional close() to struct vm_special_mapping git bisect bad 5077f828c08424b81279341813a18b8923ebd42e # good: [0ebac8817b5dce7b3a1afd6ff7197a75829d50ad] kfence: save freeing stack trace at calling time instead of freeing time git bisect good 0ebac8817b5dce7b3a1afd6ff7197a75829d50ad # first bad commit: [5077f828c08424b81279341813a18b8923ebd42e] mm: add optional close() to struct vm_special_mapping The rootfs is available from [1] in case it matters (x86_64-rootfs.ext4.zst, decompress it with zstd first); it just shuts down the machine on boot. Cheers, Nathan [1]: https://github.com/ClangBuiltLinux/boot-utils/releases/latest
On Mon, 19 Aug 2024 at 11:53, Nathan Chancellor <nathan@kernel.org> wrote: > > > Modules linked in: > Pid: 24, comm: mount Not tainted 6.11.0-rc4-next-20240819 > RIP: 0033:0x68006f6c > RSP: 000000006c8bfc68 EFLAGS: 00010206 > RAX: 0000000068006f6c RBX: 0000000068a0aa18 RCX: 00000000600d8b09 > RDX: 0000000000000000 RSI: 0000000068a0aa18 RDI: 0000000068805120 > RBP: 000000006c8bfc70 R08: 0000000000000001 R09: 0000000068ae0308 > R10: 000000000000000e R11: ffffffffffffffff R12: 0000000000000001 > R13: 0000000068a0aa18 R14: 0000000000000015 R15: 0000000068944a88 > Kernel panic - not syncing: Segfault with no mm > CPU: 0 UID: 0 PID: 24 Comm: mount Not tainted 6.11.0-rc4-next-20240819 #1 > Stack: > 600caeff 6c8bfc90 600d8b2a 68944a80 > 00000047 6c8bfda0 600cbfd9 6c8bfd50 > 68944ad0 68944a88 7f7ffff000 7f7fffffff > Call Trace: > [<600caeff>] ? special_mapping_close+0x16/0x19 Hmm. No "Code:" line? Did you just edit that out, or maybe UML doesn't print one out? Anyway, for me that special_mapping_close() disassembles to <+0>: mov %rdi,%rsi <+3>: mov 0x78(%rdi),%rdi <+7>: mov 0x20(%rdi),%rax <+11>: test %rax,%rax <+14>: je 0x600caa11 <special_mapping_close+24> <+16>: push %rbp <+17>: mov %rsp,%rbp <+20>: call *%rax <+22>: pop %rbp <+23>: ret <+24>: ret which may just match yours, because special_mapping_close+0x16 is obviously that +22, and it's the return point for that call. And your %rax value does match that invalid %rip value of 0x68006f6c. So it does look like it's jumping off to la-la-land, and the problem is the code const struct vm_special_mapping *sm = vma->vm_private_data; if (sm->close) sm->close(sm, vma); where presumably 'vm_private_data' isn't a "struct vm_special_mapping *" at all. And I think I see the problem. When we have that 'legacy_special_mapping_vmops', then the vm_private_data field actually points to 'pages'. So the 'legacy_special_mapping_vmops' can *only* contain the '.fault' handler, not the other handlers. IOW, does something like this fix it? --- a/mm/mmap.c +++ b/mm/mmap.c @@ -2095,7 +2095,6 @@ static const struct vm_operations_struct special_mapping_vmops = { }; static const struct vm_operations_struct legacy_special_mapping_vmops = { - .close = special_mapping_close, .fault = special_mapping_fault, }; and honestly, we should have different 'fault' functions instead of having the same fault function and then making the function dynamically see which vm_operations_struct it was. But that's a separate issue. And oh Christ, the difference between "_install_special_mapping()" (which installs the modern style special mapping) and "install_special_mapping()" (which installs the legacy special mapping) is truly horrendously disgusting. And yes, UML uses that legacy crap, which explains why it happens on UML but not on sane architectures. As does csky, hexagon, and nios2. We should get rid of the legacy case entirely, and remove that insane difference between _install_special_mapping() and install_special_mapping(). But in the meantime, the one-liner fix *should* fix it. The four architectures that uses the legacy crud don't care about the close function anyway. What a horrid thing. Linus
On Mon, Aug 19, 2024 at 12:29:34PM -0700, Linus Torvalds wrote: > On Mon, 19 Aug 2024 at 11:53, Nathan Chancellor <nathan@kernel.org> wrote: > > > > > > Modules linked in: > > Pid: 24, comm: mount Not tainted 6.11.0-rc4-next-20240819 > > RIP: 0033:0x68006f6c > > RSP: 000000006c8bfc68 EFLAGS: 00010206 > > RAX: 0000000068006f6c RBX: 0000000068a0aa18 RCX: 00000000600d8b09 > > RDX: 0000000000000000 RSI: 0000000068a0aa18 RDI: 0000000068805120 > > RBP: 000000006c8bfc70 R08: 0000000000000001 R09: 0000000068ae0308 > > R10: 000000000000000e R11: ffffffffffffffff R12: 0000000000000001 > > R13: 0000000068a0aa18 R14: 0000000000000015 R15: 0000000068944a88 > > Kernel panic - not syncing: Segfault with no mm > > CPU: 0 UID: 0 PID: 24 Comm: mount Not tainted 6.11.0-rc4-next-20240819 #1 > > Stack: > > 600caeff 6c8bfc90 600d8b2a 68944a80 > > 00000047 6c8bfda0 600cbfd9 6c8bfd50 > > 68944ad0 68944a88 7f7ffff000 7f7fffffff > > Call Trace: > > [<600caeff>] ? special_mapping_close+0x16/0x19 > > Hmm. No "Code:" line? Did you just edit that out, or maybe UML doesn't > print one out? Nope, no editing, it is straight from my terminal. I guess UML just doesn't print one. > Anyway, for me that special_mapping_close() disassembles to > > > <+0>: mov %rdi,%rsi > <+3>: mov 0x78(%rdi),%rdi > <+7>: mov 0x20(%rdi),%rax > <+11>: test %rax,%rax > <+14>: je 0x600caa11 <special_mapping_close+24> > <+16>: push %rbp > <+17>: mov %rsp,%rbp > <+20>: call *%rax > <+22>: pop %rbp > <+23>: ret > <+24>: ret > > which may just match yours, because special_mapping_close+0x16 is > obviously that +22, and it's the return point for that call. Yeah seems like it, objdump -dr shows: 0000000000000027 <special_mapping_close>: 27: 48 89 fe mov %rdi,%rsi 2a: 48 8b 7f 78 mov 0x78(%rdi),%rdi 2e: 48 8b 47 20 mov 0x20(%rdi),%rax 32: 48 85 c0 test %rax,%rax 35: 74 08 je 3f <special_mapping_close+0x18> 37: 55 push %rbp 38: 48 89 e5 mov %rsp,%rbp 3b: ff d0 call *%rax 3d: 5d pop %rbp 3e: c3 ret 3f: c3 ret > And your %rax value does match that invalid %rip value of 0x68006f6c. > > So it does look like it's jumping off to la-la-land, and the problem is the code > > const struct vm_special_mapping *sm = vma->vm_private_data; > > if (sm->close) > sm->close(sm, vma); > > where presumably 'vm_private_data' isn't a "struct vm_special_mapping *" at all. > > And I think I see the problem. > > When we have that 'legacy_special_mapping_vmops', then the > vm_private_data field actually points to 'pages'. > > So the 'legacy_special_mapping_vmops' can *only* contain the '.fault' > handler, not the other handlers. > > IOW, does something like this fix it? > > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -2095,7 +2095,6 @@ static const struct vm_operations_struct > special_mapping_vmops = { > }; > > static const struct vm_operations_struct legacy_special_mapping_vmops = { > - .close = special_mapping_close, > .fault = special_mapping_fault, > }; Yes, that appears to fix it for me. I don't have much to say about the rest but others might :) Cheers, Nathan
On Mon, 19 Aug 2024 at 12:51, Nathan Chancellor <nathan@kernel.org> wrote: > > Yes, that appears to fix it for me. I don't have much to say about the > rest but others might :) Ok, I did a quick hack-job to remove that disgusting install_special_mapping() legacy case. With this, the old "install_special_mapping()" mess no longer exists, but I haven't even attempted to compile the result, because I don't have cross-compile environments for any of the affected architectures. Except UML. I did at least build it there, but it's not like I tested it. Adding architecture maintainers and more architecture lists to the participants. It would be good to actually get this patch tested. Context for newly added people: https://lore.kernel.org/all/CAHk-=wj9QPhG4CjiX8YLRC1wHj_Qs-T8wJi0WEhkfp0cszvB9w@mail.gmail.com/ NOTE! This patch is against my current tree, not the linux-next changes. But it should entirely remove the case that caused problems in linux-next. Linus
On Mon, 19 Aug 2024 at 13:15, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Ok, I did a quick hack-job to remove that disgusting > install_special_mapping() legacy case. > > With this [..] I forgot to actually attach that "this". Here it is. For real, this time. Linus
On Mon, 19 Aug 2024 13:16:32 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Mon, 19 Aug 2024 at 13:15, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > Ok, I did a quick hack-job to remove that disgusting > > install_special_mapping() legacy case. > > > > With this [..] > > I forgot to actually attach that "this". Here it is. For real, this time. Thanks. Do you think your one-liner remains desirable with this fix in place?
On Mon, 19 Aug 2024 at 18:05, Andrew Morton <akpm@linux-foundation.org> wrote: > > > > > I forgot to actually attach that "this". Here it is. For real, this time. > > Thanks. Do you think your one-liner remains desirable with this fix in > place? That patch actually removes the whole 'legacy_special_mapping_vmops' that my one-liner then removed the '.close' field from, so no - my one-liner just becomes a non-issue. NOTE! That patch of mine *will* conflict with Michael's patch series, since my patch to remove legacy_special_mapping_vmops was done on top of my current -git tree. But it should be an obvious conflict, in that it just means that the addition of .close never happens. Linus
Linus Torvalds <torvalds@linux-foundation.org> writes: > On Mon, 19 Aug 2024 at 13:15, Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> >> Ok, I did a quick hack-job to remove that disgusting >> install_special_mapping() legacy case. >> >> With this [..] > > I forgot to actually attach that "this". Here it is. For real, this time. > > Linus > arch/csky/kernel/vdso.c | 28 +++++++++++++++++++++------- > arch/hexagon/kernel/vdso.c | 14 ++++++++++---- > arch/nios2/mm/init.c | 12 ++++++++---- > arch/sh/kernel/vsyscall/vsyscall.c | 14 +++++++++++--- > arch/x86/um/vdso/vma.c | 12 ++++++++---- > include/linux/mm.h | 4 ---- > mm/mmap.c | 32 +++++--------------------------- > 7 files changed, 63 insertions(+), 53 deletions(-) > ... > index 1bd85a6949c4..5e68ab7a8898 100644 > --- a/arch/sh/kernel/vsyscall/vsyscall.c > +++ b/arch/sh/kernel/vsyscall/vsyscall.c > @@ -36,6 +36,10 @@ __setup("vdso=", vdso_setup); > */ > extern const char vsyscall_trapa_start, vsyscall_trapa_end; > static struct page *syscall_pages[1]; > +static struct vm_special_mapping vdso_mapping = { > + .name = "[vdso]", > + .pages = syscall_pages; ^ should be , > +}; cheers
On Mon, 19 Aug 2024 at 23:26, Michael Ellerman <mpe@ellerman.id.au> wrote: > > Linus Torvalds <torvalds@linux-foundation.org> writes: > > +static struct vm_special_mapping vdso_mapping = { > > + .name = "[vdso]", > > + .pages = syscall_pages; > ^ > should be , Ack. Changed here locally. But I assume you also don't actually test sh... It would be good to get acks from the architectures that still used the legacy interface. Linus
On Tue, 20 Aug 2024 at 14:17, Rob Landley <rob@landley.net> wrote: > > Hexagon also has &&vdso_page which I don't understand (but have a toolchain for > somewhere to at least smoketest...) The '&&' is just a typo. It should obviously be just a single '&'. As mentioned, the only testing that patch got was a x86-64 UML build test. Fixed locally. Linus
On 8/20/24 10:31, Linus Torvalds wrote: > On Mon, 19 Aug 2024 at 23:26, Michael Ellerman <mpe@ellerman.id.au> wrote: >> >> Linus Torvalds <torvalds@linux-foundation.org> writes: >> > +static struct vm_special_mapping vdso_mapping = { >> > + .name = "[vdso]", >> > + .pages = syscall_pages; >> ^ >> should be , > > Ack. Changed here locally. Hexagon also has &&vdso_page which I don't understand (but have a toolchain for somewhere to at least smoketest...) > But I assume you also don't actually test sh... But I do. Aside: arch/sh smoketests easily under qemu, here's a relocatable binary toolchain: wget https://landley.net/bin/toolchains/latest/sh4-linux-musl-cross.tar.xz tar xvf sh4-linux-musl-cross-tar.xz CROSS_COMPILE=$PWD/sh4-linux-musl-cross/bin/sh4-linux-musl- And https://landley.net/bin/mkroot/latest/sh4.tgz is a tiny qemu-system-sh4 system with kernel + initramfs.cpio.gz + run-qemu.sh with the config used to build it under docs/ and as the first three lines of docs/linux-miniconfig record, the kernel is arch/sh/boot/zImage. You can keep the initramfs.cpio.gz and run-qemu.sh and swap out the kernel for a quick boot to shell prompt under qemu. Serial console is qemu's stdin/stdout, "exit" to shut down the emulator. The build script, if you care, is 400 lines of bash: https://github.com/landley/toybox/blob/master/mkroot/mkroot.sh > It would be good to get acks from the architectures that still used > the legacy interface. I'll give arch/sh a whirl. Can somebody answer my above hexagon question? > Linus Rob
On 8/20/24 16:31, Linus Torvalds wrote: > On Tue, 20 Aug 2024 at 14:17, Rob Landley <rob@landley.net> wrote: >> >> Hexagon also has &&vdso_page which I don't understand (but have a toolchain for >> somewhere to at least smoketest...) > > The '&&' is just a typo. It should obviously be just a single '&'. As > mentioned, the only testing that patch got was a x86-64 UML build > test. > > Fixed locally. I deleted the extra ; and arch/sh4 built and qemu-system-sh4 booted to shell prompt: Freeing initrd memory: 556K Freeing unused kernel image (initmem) memory: 132K This architecture does not have kernel memory protection. Run /init as init process 8139cp 0000:00:02.0 eth0: link up, 100Mbps, full-duplex, lpa 0x05E1 Type exit when done. # cat /proc/version Linux version 6.11.0-rc4 (landley@driftwood) (sh4-linux-musl-gcc (GCC) 11.2.0, GNU ld (GNU Binutils) 2.33.1) #1 Tue Aug 20 16:45:25 CDT 2024 # head -n 3 /proc/cpuinfo machine : RTS7751R2D processor : 0 cpu family : sh4 Tested-by: Rob Landley <rob@landley.net> Rob
On Tue, 20 Aug 2024 at 14:56, Rob Landley <rob@landley.net> wrote: > > Tested-by: Rob Landley <rob@landley.net> Ok, here's the patch with the fixes for problems that were pointed out, and with the beginnings of a commit message. I'm going to be traveling for the next ~10 days, and while I'll have my laptop and internet, I'm unlikely to work on this any more. Andrew - I think this is good, but there may be other issues lurking. Do with it as you see fit, Linus
On Tue, 20 Aug 2024 16:14:29 -0700 Linus Torvalds <torvalds@linux-foundation.org> wrote: > Andrew - I think this is good, but there may be other issues lurking. > Do with it as you see fit, Hey you know me, I'll merge any old thing if I think it'll help nurture newbies.
Nathan Chancellor <nathan@kernel.org> writes: > Hi Michael, > > On Mon, Aug 12, 2024 at 06:26:02PM +1000, Michael Ellerman wrote: >> Add an optional close() callback to struct vm_special_mapping. It will >> be used, by powerpc at least, to handle unmapping of the VDSO. >> >> Although support for unmapping the VDSO was initially added >> for CRIU[1], it is not desirable to guard that support behind >> CONFIG_CHECKPOINT_RESTORE. >> >> There are other known users of unmapping the VDSO which are not related >> to CRIU, eg. Valgrind [2] and void-ship [3]. >> >> The powerpc arch_unmap() hook has been in place for ~9 years, with no >> ifdef, so there may be other unknown users that have come to rely on >> unmapping the VDSO. Even if the code was behind an ifdef, major distros >> enable CHECKPOINT_RESTORE so users may not realise unmapping the VDSO >> depends on that configuration option. >> >> It's also undesirable to have such core mm behaviour behind a relatively >> obscure CONFIG option. >> >> Longer term the unmap behaviour should be standardised across >> architectures, however that is complicated by the fact the VDSO pointer >> is stored differently across architectures. There was a previous attempt >> to unify that handling [4], which could be revived. >> >> See [5] for further discussion. >> >> [1]: commit 83d3f0e90c6c ("powerpc/mm: tracking vDSO remap") >> [2]: https://sourceware.org/git/?p=valgrind.git;a=commit;h=3a004915a2cbdcdebafc1612427576bf3321eef5 >> [3]: https://github.com/insanitybit/void-ship >> [4]: https://lore.kernel.org/lkml/20210611180242.711399-17-dima@arista.com/ >> [5]: https://lore.kernel.org/linuxppc-dev/shiq5v3jrmyi6ncwke7wgl76ojysgbhrchsk32q4lbx2hadqqc@kzyy2igem256 >> >> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> >> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> >> Reviewed-by: David Hildenbrand <david@redhat.com> >> --- >> include/linux/mm_types.h | 3 +++ >> mm/mmap.c | 6 ++++++ >> 2 files changed, 9 insertions(+) >> >> v2: >> - Add some blank lines as requested. >> - Expand special_mapping_close() comment. >> - Add David's reviewed-by. >> - Expand change log to capture review discussion. >> >> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h >> index 485424979254..78bdfc59abe5 100644 >> --- a/include/linux/mm_types.h >> +++ b/include/linux/mm_types.h >> @@ -1313,6 +1313,9 @@ struct vm_special_mapping { >> >> int (*mremap)(const struct vm_special_mapping *sm, >> struct vm_area_struct *new_vma); >> + >> + void (*close)(const struct vm_special_mapping *sm, >> + struct vm_area_struct *vma); >> }; >> >> enum tlb_flush_reason { >> diff --git a/mm/mmap.c b/mm/mmap.c >> index d0dfc85b209b..af4dbf0d3bd4 100644 >> --- a/mm/mmap.c >> +++ b/mm/mmap.c >> @@ -3620,10 +3620,16 @@ void vm_stat_account(struct mm_struct *mm, vm_flags_t flags, long npages) >> static vm_fault_t special_mapping_fault(struct vm_fault *vmf); >> >> /* >> + * Close hook, called for unmap() and on the old vma for mremap(). >> + * >> * Having a close hook prevents vma merging regardless of flags. >> */ >> static void special_mapping_close(struct vm_area_struct *vma) >> { >> + const struct vm_special_mapping *sm = vma->vm_private_data; >> + >> + if (sm->close) >> + sm->close(sm, vma); >> } >> >> static const char *special_mapping_name(struct vm_area_struct *vma) >> -- >> 2.45.2 >> > > This change is now in -next and I bisected a crash that our CI sees with > ARCH=um to it: I see another crash on s390, which seems related, but rather an issue in uprobe. This can be reproduced by # cd linux-next/tools/testing/selftests/ftrace # ./ftracetest ./test.d/dynevent/add_remove_uprobe.tc The 'mm: Add optional close() to struct vm_special_mapping' patch just makes it visible. I enabled KASAN, and that shows me: [ 44.505448] ================================================================== 20:37:27 [3421/145075] [ 44.505455] BUG: KASAN: slab-use-after-free in special_mapping_close+0x9c/0xc8 [ 44.505471] Read of size 8 at addr 00000000868dac48 by task sh/1384 [ 44.505479] [ 44.505486] CPU: 51 UID: 0 PID: 1384 Comm: sh Not tainted 6.11.0-rc6-next-20240902-dirty #1496 [ 44.505503] Hardware name: IBM 3931 A01 704 (z/VM 7.3.0) [ 44.505508] Call Trace: [ 44.505511] [<000b0324d2f78080>] dump_stack_lvl+0xd0/0x108 [ 44.505521] [<000b0324d2f5435c>] print_address_description.constprop.0+0x34/0x2e0 [ 44.505529] [<000b0324d2f5464c>] print_report+0x44/0x138 [ 44.505536] [<000b0324d1383192>] kasan_report+0xc2/0x140 [ 44.505543] [<000b0324d2f52904>] special_mapping_close+0x9c/0xc8 [ 44.505550] [<000b0324d12c7978>] remove_vma+0x78/0x120 [ 44.505557] [<000b0324d128a2c6>] exit_mmap+0x326/0x750 [ 44.505563] [<000b0324d0ba655a>] __mmput+0x9a/0x370 [ 44.505570] [<000b0324d0bbfbe0>] exit_mm+0x240/0x340 [ 44.505575] [<000b0324d0bc0228>] do_exit+0x548/0xd70 [ 44.505580] [<000b0324d0bc1102>] do_group_exit+0x132/0x390 [ 44.505586] [<000b0324d0bc13b6>] __s390x_sys_exit_group+0x56/0x60 [ 44.505592] [<000b0324d0adcbd6>] do_syscall+0x2f6/0x430 [ 44.505599] [<000b0324d2f78434>] __do_syscall+0xa4/0x170 [ 44.505606] [<000b0324d2f9454c>] system_call+0x74/0x98 [ 44.505614] [ 44.505616] Allocated by task 1384: [ 44.505621] kasan_save_stack+0x40/0x70 [ 44.505630] kasan_save_track+0x28/0x40 [ 44.505636] __kasan_kmalloc+0xa0/0xc0 [ 44.505642] __create_xol_area+0xfa/0x410 [ 44.505648] get_xol_area+0xb0/0xf0 [ 44.505652] uprobe_notify_resume+0x27a/0x470 [ 44.505657] irqentry_exit_to_user_mode+0x15e/0x1d0 [ 44.505664] pgm_check_handler+0x122/0x170 [ 44.505670] [ 44.505672] Freed by task 1384: [ 44.505676] kasan_save_stack+0x40/0x70 [ 44.505682] kasan_save_track+0x28/0x40 [ 44.505687] kasan_save_free_info+0x4a/0x70 [ 44.505693] __kasan_slab_free+0x5a/0x70 [ 44.505698] kfree+0xe8/0x3f0 [ 44.505704] __mmput+0x20/0x370 [ 44.505709] exit_mm+0x240/0x340 [ 44.505713] do_exit+0x548/0xd70 [ 44.505718] do_group_exit+0x132/0x390 [ 44.505722] __s390x_sys_exit_group+0x56/0x60 [ 44.505727] do_syscall+0x2f6/0x430 [ 44.505732] __do_syscall+0xa4/0x170 [ 44.505738] system_call+0x74/0x98 The problem is that uprobe_clear_state() kfree's struct xol_area, which contains struct vm_special_mapping *xol_mapping. This one is passed to _install_special_mapping() in xol_add_vma(). __mput reads: static inline void __mmput(struct mm_struct *mm) { VM_BUG_ON(atomic_read(&mm->mm_users)); uprobe_clear_state(mm); exit_aio(mm); ksm_exit(mm); khugepaged_exit(mm); /* must run before exit_mmap */ exit_mmap(mm); ... } So uprobe_clear_state() in the beginning free's the memory area containing the vm_special_mapping data, but exit_mmap() uses this address later via vma->vm_private_data (which was set in _install_special_mapping(). The following change fixes this for me, but i'm not sure about any side effects: diff --git a/kernel/fork.c b/kernel/fork.c index df8e4575ff01..cfcabba36c93 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1340,11 +1340,11 @@ static inline void __mmput(struct mm_struct *mm) { VM_BUG_ON(atomic_read(&mm->mm_users)); - uprobe_clear_state(mm); exit_aio(mm); ksm_exit(mm); khugepaged_exit(mm); /* must run before exit_mmap */ exit_mmap(mm); + uprobe_clear_state(mm); mm_put_huge_zero_folio(mm); set_mm_exe_file(mm, NULL); if (!list_empty(&mm->mmlist)) { Any thoughts?
On Mon, 02 Sep 2024 21:06:48 +0200 Sven Schnelle <svens@linux.ibm.com> wrote: > So uprobe_clear_state() in the beginning free's the memory area > containing the vm_special_mapping data, but exit_mmap() uses this > address later via vma->vm_private_data (which was set in _install_special_mapping(). > > The following change fixes this for me, but i'm not sure about any side > effects: > > diff --git a/kernel/fork.c b/kernel/fork.c > index df8e4575ff01..cfcabba36c93 100644 > --- a/kernel/fork.c > +++ b/kernel/fork.c > @@ -1340,11 +1340,11 @@ static inline void __mmput(struct mm_struct *mm) > { > VM_BUG_ON(atomic_read(&mm->mm_users)); > > - uprobe_clear_state(mm); > exit_aio(mm); > ksm_exit(mm); > khugepaged_exit(mm); /* must run before exit_mmap */ > exit_mmap(mm); > + uprobe_clear_state(mm); > mm_put_huge_zero_folio(mm); > set_mm_exe_file(mm, NULL); > if (!list_empty(&mm->mmlist)) { uprobe_clear_state() is a pretty simple low-level thing. Side-effects seem unlikely?
On Mon, 2 Sept 2024 at 13:49, Andrew Morton <akpm@linux-foundation.org> wrote: > > uprobe_clear_state() is a pretty simple low-level thing. Side-effects > seem unlikely? I think uprobe_clear_state() should be removed from fork.c entirely, made 'static', and then we'd have area->xol_mapping.close = uprobe_clear_state; in __create_xol_area() instead (ok, the arguments change, instead of looking up "mm->uprobes_state.xol_area", it would get it as the vma argument) That's how it should always have been, except we didn't have a close() function. Hmm? Linus
Linus Torvalds <torvalds@linux-foundation.org> writes: > On Mon, 2 Sept 2024 at 13:49, Andrew Morton <akpm@linux-foundation.org> wrote: >> >> uprobe_clear_state() is a pretty simple low-level thing. Side-effects >> seem unlikely? > > I think uprobe_clear_state() should be removed from fork.c entirely, > made 'static', and then we'd have > > area->xol_mapping.close = uprobe_clear_state; > > in __create_xol_area() instead (ok, the arguments change, instead of > looking up "mm->uprobes_state.xol_area", it would get it as the vma > argument) > > That's how it should always have been, except we didn't have a close() function. > > Hmm? Indeed, that's much better. I'll prepare a patch. Thanks!
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 485424979254..78bdfc59abe5 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -1313,6 +1313,9 @@ struct vm_special_mapping { int (*mremap)(const struct vm_special_mapping *sm, struct vm_area_struct *new_vma); + + void (*close)(const struct vm_special_mapping *sm, + struct vm_area_struct *vma); }; enum tlb_flush_reason { diff --git a/mm/mmap.c b/mm/mmap.c index d0dfc85b209b..af4dbf0d3bd4 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -3620,10 +3620,16 @@ void vm_stat_account(struct mm_struct *mm, vm_flags_t flags, long npages) static vm_fault_t special_mapping_fault(struct vm_fault *vmf); /* + * Close hook, called for unmap() and on the old vma for mremap(). + * * Having a close hook prevents vma merging regardless of flags. */ static void special_mapping_close(struct vm_area_struct *vma) { + const struct vm_special_mapping *sm = vma->vm_private_data; + + if (sm->close) + sm->close(sm, vma); } static const char *special_mapping_name(struct vm_area_struct *vma)