Message ID | 20240403062754.124752-1-anton.ivanov@cambridgegreys.com |
---|---|
State | Changes Requested |
Headers | show |
Series | [v9] um: Enable preemption in UML | expand |
Hi, On Wed, 2024-04-03 at 07:27 +0100, anton.ivanov@cambridgegreys.com wrote: > From: Anton Ivanov <anton.ivanov@cambridgegreys.com> > > 1. Preemption requires saving/restoring FPU state. This patch > adds support for it using GCC intrinsics as well as appropriate > storage space in the thread structure. We reuse the space > which is already allocated for the userspace threads in the > thread_info structure. Do we need to store the FPU state even? After all, we cannot simply overwrite userspace FPU registers in UML. Seriously wondering about that. It seems to me it would be sufficient to ensure that only one kernel task is in a kern_fpu_begin()/kern_fpu_end() section at any point in time. Now, I don't know how preempt_disable()/preempt_enable() behaves exactly. But I would assume it makes such a guarantee and then we can simply map kern_fpu_begin to preempt_disable and kern_fpu_end to preempt_enable. > 2. irq critical sections need preempt_disable()/preempt_enable(). > > 3. TLB critical sections need preempt_disable()/preempt_enable(). > > 4. UML TLB flush is also invoked during a fork. This happens > with interrupts and preempt disabled which disagrees with the > standard mm locking via rwsem. The mm lock for this code path > had to be replaced with an rcu. Hopefully that flush during fork will be gone soon :-) Benjamin > [...]
On 19/04/2024 14:47, Benjamin Berg wrote: > Hi, > > On Wed, 2024-04-03 at 07:27 +0100, anton.ivanov@cambridgegreys.com > wrote: >> From: Anton Ivanov <anton.ivanov@cambridgegreys.com> >> >> 1. Preemption requires saving/restoring FPU state. This patch >> adds support for it using GCC intrinsics as well as appropriate >> storage space in the thread structure. We reuse the space >> which is already allocated for the userspace threads in the >> thread_info structure. > > Do we need to store the FPU state even? Everybody else does :) > After all, we cannot simply > overwrite userspace FPU registers in UML. Correct, but you can have kernel threads as well. > > Seriously wondering about that. It seems to me it would be sufficient > to ensure that only one kernel task is in a > kern_fpu_begin()/kern_fpu_end() section at any point in time. So what happens if you have a task which wants fpu and cannot get it at this point? > > Now, I don't know how preempt_disable()/preempt_enable() behaves > exactly. But I would assume it makes such a guarantee and then we can > simply map kern_fpu_begin to preempt_disable and kern_fpu_end to > preempt_enable. > >> 2. irq critical sections need preempt_disable()/preempt_enable(). Yes. Otherwise RAID, crypto, ipsec can mess up things. >> >> 3. TLB critical sections need preempt_disable()/preempt_enable(). I think I added those. >> >> 4. UML TLB flush is also invoked during a fork. This happens >> with interrupts and preempt disabled which disagrees with the >> standard mm locking via rwsem. The mm lock for this code path >> had to be replaced with an rcu. > > Hopefully that flush during fork will be gone soon :-) Hurrah!!! > > Benjamin > >> [...] > >
Hi, On Sat, 2024-04-20 at 13:22 +0100, Anton Ivanov wrote: > On 19/04/2024 14:47, Benjamin Berg wrote: > > Hi, > > > > On Wed, 2024-04-03 at 07:27 +0100, anton.ivanov@cambridgegreys.com > > wrote: > > > From: Anton Ivanov <anton.ivanov@cambridgegreys.com> > > > > > > 1. Preemption requires saving/restoring FPU state. This patch > > > adds support for it using GCC intrinsics as well as appropriate > > > storage space in the thread structure. We reuse the space > > > which is already allocated for the userspace threads in the > > > thread_info structure. > > > > Do we need to store the FPU state even? > > Everybody else does :) As far as I can tell, not really. The way I understand the x86 codeis that it tries to leave the FPU untouched in the hope that it will return to the same userspace task. However, if the kernel starts using the FPU, then it saves the FPU registers into the current task struct and sets a flag that the FPU state needs to be restored when returning to userspace. See TIF_NEED_FPU_LOAD, switch_fpu_prepare and kernel_fpu_begin_mask. We have multiple things confirming this for me: * storing only happens if "current" is not a KTHREAD or USER_WORKER (i.e. proper userspace, we e.g. in a syscall, not a kernel thread) * we set TIF_NEED_FPU_LOAD, which is documented with "load FPU on return to userspace" * the in_kernel_fpu per-CPU boolean is set and must be unset later * __schedule must only be called with preemption disabled * switch_fpu_prepare does not save FPU registers for KTHREAD or USER_WORKER either So, for me, all the effort is purely an optimization for the common case that a userspace process is switching into kernel space for a syscall (or pagefault), and we are coming back to the same process afterwards without scheduling. In that case, we can just leave the entire FPU registers untouched and everything is great. But, that case is irrelevant for us, because we always store the FPU registers of userspace processes anyway (equivalent to just setting the TIF_NEED_FPU_LOAD flag). And, for kernel tasks we never do that, but neither does x86. So, yeah, the more I look at it, the more certain I am that we should basically just do: #define kernel_fpu_begin preempt_disable #define kernel_fpu_end preempt_enable And, of course, add the preempt_* in various locations as you already have done. Benjamin > > > After all, we cannot simply > > overwrite userspace FPU registers in UML. > > Correct, but you can have kernel threads as well. > > > > Seriously wondering about that. It seems to me it would be > > sufficient > > to ensure that only one kernel task is in a > > kern_fpu_begin()/kern_fpu_end() section at any point in time. > > So what happens if you have a task which wants fpu and cannot get it > at > this point? > > > > > Now, I don't know how preempt_disable()/preempt_enable() behaves > > exactly. But I would assume it makes such a guarantee and then we > > can > > simply map kern_fpu_begin to preempt_disable and kern_fpu_end to > > preempt_enable. > > > > > 2. irq critical sections need preempt_disable()/preempt_enable(). > > Yes. Otherwise RAID, crypto, ipsec can mess up things. > > > > > > > 3. TLB critical sections need preempt_disable()/preempt_enable(). > > I think I added those. > > > > > > > 4. UML TLB flush is also invoked during a fork. This happens > > > with interrupts and preempt disabled which disagrees with the > > > standard mm locking via rwsem. The mm lock for this code path > > > had to be replaced with an rcu. > > > > Hopefully that flush during fork will be gone soon :-) > > Hurrah!!! > > > > > Benjamin > > > > > [...] > > > > >
> arch/um/Kconfig | 2 +- > arch/um/include/asm/fpu/api.h | 9 ++-- > arch/um/include/asm/thread_info.h | 4 +- > arch/um/kernel/Makefile | 4 ++ > arch/um/kernel/fpu.c | 71 +++++++++++++++++++++++++++++++ > arch/um/kernel/irq.c | 2 + > arch/um/kernel/tlb.c | 13 ++++++ In addition to what Benjamin pointed out (FPU state is in the MM process, so we probably don't care about save/restore), this also doesn't build any more, at least it needs: --- a/arch/x86/um/Makefile +++ b/arch/x86/um/Makefile @@ -31,7 +31,6 @@ obj-y += syscalls_64.o vdso/ subarch-y = ../lib/csum-partial_64.o ../lib/memcpy_64.o \ ../lib/memmove_64.o ../lib/memset_64.o -subarch-$(CONFIG_PREEMPTION) += ../entry/thunk_64.o endif git am also complained about the extra blank line at EOF here: > +EXPORT_SYMBOL_GPL(kernel_fpu_end); > + johannes
Hmm. I also see a ton of this: BUG: sleeping function called from invalid context at kernel/locking/rwsem.c:1525 in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 538, name: chan-switch.sh preempt_count: 2, expected: 0 RCU nest depth: 0, expected: 0 no locks held by chan-switch.sh/538. irq event stamp: 0 hardirqs last enabled at (0): [<0000000000000000>] 0x0 hardirqs last disabled at (0): [<000000006004b961>] copy_process+0xbb7/0x2790 softirqs last enabled at (0): [<000000006004b961>] copy_process+0xbb7/0x2790 softirqs last disabled at (0): [<0000000000000000>] 0x0 Preemption disabled at: [<0000000060084b04>] preempt_count_add+0x11f/0x126 CPU: 0 PID: 538 Comm: chan-switch.sh Tainted: G W O 6.10.0-rc2 #11 Stack: 714c7e40 605f2a9d 00000000 00000000 ffffff00 607d2edc 00000000 60db7a48 714c7e70 60620823 0000022c 606161d1 Call Trace: [<606161d1>] ? _printk+0x0/0x94 [<6002fa2f>] show_stack+0xfe/0x159 [<605f2a9d>] ? dump_stack_print_info+0xe1/0xf0 [<60620823>] dump_stack_lvl+0x77/0xbe [<606161d1>] ? _printk+0x0/0x94 [<60084b04>] ? preempt_count_add+0x11f/0x126 [<60620884>] dump_stack+0x1a/0x1c [<60085b82>] __might_resched+0x3b0/0x3d2 [<60085ca0>] __might_sleep+0xfc/0x107 [<60627212>] down_read+0x34/0x224 [<60032256>] force_flush_all+0x71/0x10d [<6002e95d>] fork_handler+0x14/0x94 johannes
On Mon, 2024-07-01 at 17:56 +0200, Johannes Berg wrote: > Hmm. I also see a ton of this: > > BUG: sleeping function called from invalid context at kernel/locking/rwsem.c:1525 > in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 538, name: chan-switch.sh > preempt_count: 2, expected: 0 > RCU nest depth: 0, expected: 0 > no locks held by chan-switch.sh/538. > irq event stamp: 0 > hardirqs last enabled at (0): [<0000000000000000>] 0x0 > hardirqs last disabled at (0): [<000000006004b961>] copy_process+0xbb7/0x2790 > softirqs last enabled at (0): [<000000006004b961>] copy_process+0xbb7/0x2790 > softirqs last disabled at (0): [<0000000000000000>] 0x0 > Preemption disabled at: > [<0000000060084b04>] preempt_count_add+0x11f/0x126 > CPU: 0 PID: 538 Comm: chan-switch.sh Tainted: G W O 6.10.0-rc2 #11 > Stack: > 714c7e40 605f2a9d 00000000 00000000 > ffffff00 607d2edc 00000000 60db7a48 > 714c7e70 60620823 0000022c 606161d1 > Call Trace: > [<606161d1>] ? _printk+0x0/0x94 > [<6002fa2f>] show_stack+0xfe/0x159 > [<605f2a9d>] ? dump_stack_print_info+0xe1/0xf0 > [<60620823>] dump_stack_lvl+0x77/0xbe > [<606161d1>] ? _printk+0x0/0x94 > [<60084b04>] ? preempt_count_add+0x11f/0x126 > [<60620884>] dump_stack+0x1a/0x1c > [<60085b82>] __might_resched+0x3b0/0x3d2 > [<60085ca0>] __might_sleep+0xfc/0x107 > [<60627212>] down_read+0x34/0x224 > [<60032256>] force_flush_all+0x71/0x10d Which is, however, addressed by my own patchset https://patchwork.ozlabs.org/project/linux-um/list/?series=374492 though it doesn't apply completely cleanly any more. johannes
On 01/07/2024 17:35, Johannes Berg wrote: > On Mon, 2024-07-01 at 17:56 +0200, Johannes Berg wrote: >> Hmm. I also see a ton of this: >> >> BUG: sleeping function called from invalid context at kernel/locking/rwsem.c:1525 >> in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 538, name: chan-switch.sh >> preempt_count: 2, expected: 0 >> RCU nest depth: 0, expected: 0 >> no locks held by chan-switch.sh/538. >> irq event stamp: 0 >> hardirqs last enabled at (0): [<0000000000000000>] 0x0 >> hardirqs last disabled at (0): [<000000006004b961>] copy_process+0xbb7/0x2790 >> softirqs last enabled at (0): [<000000006004b961>] copy_process+0xbb7/0x2790 >> softirqs last disabled at (0): [<0000000000000000>] 0x0 >> Preemption disabled at: >> [<0000000060084b04>] preempt_count_add+0x11f/0x126 >> CPU: 0 PID: 538 Comm: chan-switch.sh Tainted: G W O 6.10.0-rc2 #11 >> Stack: >> 714c7e40 605f2a9d 00000000 00000000 >> ffffff00 607d2edc 00000000 60db7a48 >> 714c7e70 60620823 0000022c 606161d1 >> Call Trace: >> [<606161d1>] ? _printk+0x0/0x94 >> [<6002fa2f>] show_stack+0xfe/0x159 >> [<605f2a9d>] ? dump_stack_print_info+0xe1/0xf0 >> [<60620823>] dump_stack_lvl+0x77/0xbe >> [<606161d1>] ? _printk+0x0/0x94 >> [<60084b04>] ? preempt_count_add+0x11f/0x126 >> [<60620884>] dump_stack+0x1a/0x1c >> [<60085b82>] __might_resched+0x3b0/0x3d2 >> [<60085ca0>] __might_sleep+0xfc/0x107 >> [<60627212>] down_read+0x34/0x224 >> [<60032256>] force_flush_all+0x71/0x10d > > Which is, however, addressed by my own patchset > > https://patchwork.ozlabs.org/project/linux-um/list/?series=374492 > > though it doesn't apply completely cleanly any more. > > johannes > As Benjamin pointed out we no longer need this, so we can junk this patch at this point. If there will be any need to save/restore FPU in kernel context, I will revisit it.
On Mon, 2024-07-01 at 19:55 +0100, Anton Ivanov wrote: > > > As Benjamin pointed out we no longer need this, so we can junk this > patch at this point. If there will be any need to save/restore FPU in > kernel context, I will revisit it. I think we still want/need the parts with the preempt_disable/enable, and Kconfig though? johannes
On 01/07/2024 20:09, Johannes Berg wrote: > On Mon, 2024-07-01 at 19:55 +0100, Anton Ivanov wrote: >> As Benjamin pointed out we no longer need this, so we can junk this >> patch at this point. If there will be any need to save/restore FPU in >> kernel context, I will revisit it. > I think we still want/need the parts with the preempt_disable/enable, > and Kconfig though? Good point. I can do a cut down version. > > johannes > >
diff --git a/arch/um/Kconfig b/arch/um/Kconfig index 93a5a8999b07..5b2a8991f6c8 100644 --- a/arch/um/Kconfig +++ b/arch/um/Kconfig @@ -11,7 +11,7 @@ config UML select ARCH_HAS_KCOV select ARCH_HAS_STRNCPY_FROM_USER select ARCH_HAS_STRNLEN_USER - select ARCH_NO_PREEMPT + select ARCH_NO_PREEMPT_DYNAMIC select HAVE_ARCH_AUDITSYSCALL select HAVE_ARCH_KASAN if X86_64 select HAVE_ARCH_KASAN_VMALLOC if HAVE_ARCH_KASAN diff --git a/arch/um/include/asm/fpu/api.h b/arch/um/include/asm/fpu/api.h index 71bfd9ef3938..9e7680bf48f0 100644 --- a/arch/um/include/asm/fpu/api.h +++ b/arch/um/include/asm/fpu/api.h @@ -4,12 +4,15 @@ /* Copyright (c) 2020 Cambridge Greys Ltd * Copyright (c) 2020 Red Hat Inc. - * A set of "dummy" defines to allow the direct inclusion - * of x86 optimized copy, xor, etc routines into the - * UML code tree. */ + */ +#if defined(CONFIG_PREEMPT) || defined(CONFIG_PREEMPT_VOLUNTARY) +extern void kernel_fpu_begin(void); +extern void kernel_fpu_end(void); +#else #define kernel_fpu_begin() (void)0 #define kernel_fpu_end() (void)0 +#endif static inline bool irq_fpu_usable(void) { diff --git a/arch/um/include/asm/thread_info.h b/arch/um/include/asm/thread_info.h index c7b4b49826a2..8f5f9b51eb75 100644 --- a/arch/um/include/asm/thread_info.h +++ b/arch/um/include/asm/thread_info.h @@ -23,8 +23,8 @@ struct thread_info { int preempt_count; /* 0 => preemptable, <0 => BUG */ struct thread_info *real_thread; /* Points to non-IRQ stack */ - unsigned long aux_fp_regs[FP_SIZE]; /* auxiliary fp_regs to save/restore - them out-of-band */ + /* auxiliary fp_regs to save/restore them out-of-band */ + unsigned long aux_fp_regs[FP_SIZE] __aligned(64); }; #define INIT_THREAD_INFO(tsk) \ diff --git a/arch/um/kernel/Makefile b/arch/um/kernel/Makefile index 811188be954c..c616e884a488 100644 --- a/arch/um/kernel/Makefile +++ b/arch/um/kernel/Makefile @@ -26,9 +26,13 @@ obj-$(CONFIG_OF) += dtb.o obj-$(CONFIG_EARLY_PRINTK) += early_printk.o obj-$(CONFIG_STACKTRACE) += stacktrace.o obj-$(CONFIG_GENERIC_PCI_IOMAP) += ioport.o +obj-$(CONFIG_PREEMPT) += fpu.o +obj-$(CONFIG_PREEMPT_VOLUNTARY) += fpu.o USER_OBJS := config.o +CFLAGS_fpu.o += -mxsave -mxsaveopt + include $(srctree)/arch/um/scripts/Makefile.rules targets := config.c config.tmp capflags.c diff --git a/arch/um/kernel/fpu.c b/arch/um/kernel/fpu.c new file mode 100644 index 000000000000..70c5e19f66e3 --- /dev/null +++ b/arch/um/kernel/fpu.c @@ -0,0 +1,71 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright (C) 2023 Cambridge Greys Ltd + * Copyright (C) 2023 Red Hat Inc + */ + +#include <linux/cpu.h> +#include <linux/init.h> +#include <asm/fpu/api.h> +#include <asm/cpufeature.h> + +/* + * The critical section between kernel_fpu_begin() and kernel_fpu_end() + * is non-reentrant. It is the caller's responsibility to avoid reentrance. + */ + +static DEFINE_PER_CPU(bool, in_kernel_fpu); + +/* UML and driver code it pulls out of the x86 tree knows about 387 features + * up to and including AVX512. TILE, etc are not yet supported. + */ + +#define KNOWN_387_FEATURES 0xFF + +void kernel_fpu_begin(void) +{ + preempt_disable(); + + WARN_ON(this_cpu_read(in_kernel_fpu)); + + this_cpu_write(in_kernel_fpu, true); + +#ifdef CONFIG_64BIT + if (likely(cpu_has(&boot_cpu_data, X86_FEATURE_XSAVEOPT))) + __builtin_ia32_xsaveopt64(¤t_thread_info()->aux_fp_regs, KNOWN_387_FEATURES); + else if (likely(cpu_has(&boot_cpu_data, X86_FEATURE_XSAVE))) + __builtin_ia32_xsave64(¤t_thread_info()->aux_fp_regs, KNOWN_387_FEATURES); + else + __builtin_ia32_fxsave64(¤t_thread_info()->aux_fp_regs); +#else + if (likely(cpu_has(&boot_cpu_data, X86_FEATURE_XSAVEOPT))) + __builtin_ia32_xsaveopt(¤t->aux_fp_regs, KNOWN_387_FEATURES); + else if (likely(cpu_has(&boot_cpu_data, X86_FEATURE_XSAVE))) + __builtin_ia32_xsave(¤t->aux_fp_regs, KNOWN_387_FEATURES); + else + __builtin_ia32_fxsave(¤t->aux_fp_regs); +#endif +} +EXPORT_SYMBOL_GPL(kernel_fpu_begin); + +void kernel_fpu_end(void) +{ + WARN_ON(!this_cpu_read(in_kernel_fpu)); + +#ifdef CONFIG_64BIT + if (likely(cpu_has(&boot_cpu_data, X86_FEATURE_XSAVE))) + __builtin_ia32_xrstor64(¤t_thread_info()->aux_fp_regs, KNOWN_387_FEATURES); + else + __builtin_ia32_fxrstor64(¤t_thread_info()->aux_fp_regs); +#else + if (likely(cpu_has(&boot_cpu_data, X86_FEATURE_XSAVE))) + __builtin_ia32_xrstor(¤t_tread_info()->aux_fp_regs, KNOWN_387_FEATURES); + else + __builtin_ia32_fxrstor(¤t_thread_info()->aux_fp_regs); +#endif + this_cpu_write(in_kernel_fpu, false); + + preempt_enable(); +} +EXPORT_SYMBOL_GPL(kernel_fpu_end); + diff --git a/arch/um/kernel/irq.c b/arch/um/kernel/irq.c index 635d44606bfe..c02525da45df 100644 --- a/arch/um/kernel/irq.c +++ b/arch/um/kernel/irq.c @@ -195,7 +195,9 @@ static void _sigio_handler(struct uml_pt_regs *regs, void sigio_handler(int sig, struct siginfo *unused_si, struct uml_pt_regs *regs) { + preempt_disable(); _sigio_handler(regs, irqs_suspended); + preempt_enable(); } static struct irq_entry *get_irq_entry_by_fd(int fd) diff --git a/arch/um/kernel/tlb.c b/arch/um/kernel/tlb.c index 247396b732e5..61c56917361e 100644 --- a/arch/um/kernel/tlb.c +++ b/arch/um/kernel/tlb.c @@ -322,6 +322,8 @@ static void fix_range_common(struct mm_struct *mm, unsigned long start_addr, unsigned long addr = start_addr, next; int ret = 0, userspace = 1; + preempt_disable(); + hvc = INIT_HVC(mm, force, userspace); pgd = pgd_offset(mm, addr); do { @@ -346,6 +348,7 @@ static void fix_range_common(struct mm_struct *mm, unsigned long start_addr, "process: %d\n", task_tgid_vnr(current)); mm_idp->kill = 1; } + preempt_enable(); } static int flush_tlb_kernel_range_common(unsigned long start, unsigned long end) @@ -362,6 +365,9 @@ static int flush_tlb_kernel_range_common(unsigned long start, unsigned long end) mm = &init_mm; hvc = INIT_HVC(mm, force, userspace); + + preempt_disable(); + for (addr = start; addr < end;) { pgd = pgd_offset(mm, addr); if (!pgd_present(*pgd)) { @@ -449,6 +455,9 @@ static int flush_tlb_kernel_range_common(unsigned long start, unsigned long end) if (err < 0) panic("flush_tlb_kernel failed, errno = %d\n", err); + + preempt_enable(); + return updated; } @@ -466,6 +475,8 @@ void flush_tlb_page(struct vm_area_struct *vma, unsigned long address) address &= PAGE_MASK; + preempt_disable(); + pgd = pgd_offset(mm, address); if (!pgd_present(*pgd)) goto kill; @@ -520,9 +531,11 @@ void flush_tlb_page(struct vm_area_struct *vma, unsigned long address) *pte = pte_mkuptodate(*pte); + preempt_enable(); return; kill: + preempt_enable(); printk(KERN_ERR "Failed to flush page for address 0x%lx\n", address); force_sig(SIGKILL); }