Message ID | 20180129202020.8515-3-mathieu.desnoyers@efficios.com (mailing list archive) |
---|---|
State | Not Applicable |
Headers | show |
Series | None | expand |
* Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > > +config ARCH_HAS_MEMBARRIER_HOOKS > + bool Yeah, so I have renamed this to ARCH_HAS_MEMBARRIER_CALLBACKS, and propagated it through the rest of the patches. "Callback" is the canonical name, and I also cringe every time I see 'hook'. Please let me know if there are any big objections against this minor cleanup. Thanks, Ingo
----- On Feb 5, 2018, at 3:22 PM, Ingo Molnar mingo@kernel.org wrote: > * Mathieu Desnoyers <mathieu.desnoyers@efficios.com> wrote: > >> >> +config ARCH_HAS_MEMBARRIER_HOOKS >> + bool > > Yeah, so I have renamed this to ARCH_HAS_MEMBARRIER_CALLBACKS, and propagated it > through the rest of the patches. "Callback" is the canonical name, and I also > cringe every time I see 'hook'. > > Please let me know if there are any big objections against this minor cleanup. No objection at all, Thanks! Mathieu > > Thanks, > > Ingo
Le 29/01/2018 à 21:20, Mathieu Desnoyers a écrit : > Allow PowerPC to skip the full memory barrier in switch_mm(), and > only issue the barrier when scheduling into a task belonging to a > process that has registered to use expedited private. > > Threads targeting the same VM but which belong to different thread > groups is a tricky case. It has a few consequences: > > It turns out that we cannot rely on get_nr_threads(p) to count the > number of threads using a VM. We can use > (atomic_read(&mm->mm_users) == 1 && get_nr_threads(p) == 1) > instead to skip the synchronize_sched() for cases where the VM only has > a single user, and that user only has a single thread. > > It also turns out that we cannot use for_each_thread() to set > thread flags in all threads using a VM, as it only iterates on the > thread group. > > Therefore, test the membarrier state variable directly rather than > relying on thread flags. This means > membarrier_register_private_expedited() needs to set the > MEMBARRIER_STATE_PRIVATE_EXPEDITED flag, issue synchronize_sched(), and > only then set MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY which allows > private expedited membarrier commands to succeed. > membarrier_arch_switch_mm() now tests for the > MEMBARRIER_STATE_PRIVATE_EXPEDITED flag. Looking at switch_mm_irqs_off(), I found it more complex than expected and found that this patch is the reason for that complexity. Before the patch (ie in kernel 4.14), we have: 00000000 <switch_mm_irqs_off>: 0: 81 24 01 c8 lwz r9,456(r4) 4: 71 29 00 01 andi. r9,r9,1 8: 40 82 00 1c bne 24 <switch_mm_irqs_off+0x24> c: 39 24 01 c8 addi r9,r4,456 10: 39 40 00 01 li r10,1 14: 7d 00 48 28 lwarx r8,0,r9 18: 7d 08 53 78 or r8,r8,r10 1c: 7d 00 49 2d stwcx. r8,0,r9 20: 40 c2 ff f4 bne- 14 <switch_mm_irqs_off+0x14> 24: 7c 04 18 40 cmplw r4,r3 28: 81 24 00 24 lwz r9,36(r4) 2c: 91 25 04 4c stw r9,1100(r5) 30: 4d 82 00 20 beqlr 34: 48 00 00 00 b 34 <switch_mm_irqs_off+0x34> 34: R_PPC_REL24 switch_mmu_context After the patch (ie in 5.13-rc6), that now is: 00000000 <switch_mm_irqs_off>: 0: 81 24 02 18 lwz r9,536(r4) 4: 71 29 00 01 andi. r9,r9,1 8: 41 82 00 24 beq 2c <switch_mm_irqs_off+0x2c> c: 7c 04 18 40 cmplw r4,r3 10: 81 24 00 24 lwz r9,36(r4) 14: 91 25 04 d0 stw r9,1232(r5) 18: 4d 82 00 20 beqlr 1c: 81 24 00 28 lwz r9,40(r4) 20: 71 29 00 0a andi. r9,r9,10 24: 40 82 00 34 bne 58 <switch_mm_irqs_off+0x58> 28: 48 00 00 00 b 28 <switch_mm_irqs_off+0x28> 28: R_PPC_REL24 switch_mmu_context 2c: 39 24 02 18 addi r9,r4,536 30: 39 40 00 01 li r10,1 34: 7d 00 48 28 lwarx r8,0,r9 38: 7d 08 53 78 or r8,r8,r10 3c: 7d 00 49 2d stwcx. r8,0,r9 40: 40 a2 ff f4 bne 34 <switch_mm_irqs_off+0x34> 44: 7c 04 18 40 cmplw r4,r3 48: 81 24 00 24 lwz r9,36(r4) 4c: 91 25 04 d0 stw r9,1232(r5) 50: 4d 82 00 20 beqlr 54: 48 00 00 00 b 54 <switch_mm_irqs_off+0x54> 54: R_PPC_REL24 switch_mmu_context 58: 2c 03 00 00 cmpwi r3,0 5c: 41 82 ff cc beq 28 <switch_mm_irqs_off+0x28> 60: 48 00 00 00 b 60 <switch_mm_irqs_off+0x60> 60: R_PPC_REL24 switch_mmu_context Especially, the comparison of 'prev' to 0 is pointless as both cases end up with just branching to 'switch_mmu_context' I don't understand all that complexity to just replace a simple 'smp_mb__after_unlock_lock()'. #define smp_mb__after_unlock_lock() smp_mb() #define smp_mb() barrier() # define barrier() __asm__ __volatile__("": : :"memory") Am I missing some subtility ? Thanks Christophe
----- On Jun 18, 2021, at 1:13 PM, Christophe Leroy christophe.leroy@csgroup.eu wrote: [...] > > I don't understand all that complexity to just replace a simple > 'smp_mb__after_unlock_lock()'. > > #define smp_mb__after_unlock_lock() smp_mb() > #define smp_mb() barrier() > # define barrier() __asm__ __volatile__("": : :"memory") > > > Am I missing some subtility ? On powerpc CONFIG_SMP, smp_mb() is actually defined as: #define smp_mb() __smp_mb() #define __smp_mb() mb() #define mb() __asm__ __volatile__ ("sync" : : : "memory") So the original motivation here was to skip a "sync" instruction whenever switching between threads which are part of the same process. But based on recent discussions, I suspect my implementation may be inaccurately doing so though. Thanks, Mathieu > > Thanks > Christophe
Le 18/06/2021 à 19:26, Mathieu Desnoyers a écrit : > ----- On Jun 18, 2021, at 1:13 PM, Christophe Leroy christophe.leroy@csgroup.eu wrote: > [...] >> >> I don't understand all that complexity to just replace a simple >> 'smp_mb__after_unlock_lock()'. >> >> #define smp_mb__after_unlock_lock() smp_mb() >> #define smp_mb() barrier() >> # define barrier() __asm__ __volatile__("": : :"memory") >> >> >> Am I missing some subtility ? > > On powerpc CONFIG_SMP, smp_mb() is actually defined as: > > #define smp_mb() __smp_mb() > #define __smp_mb() mb() > #define mb() __asm__ __volatile__ ("sync" : : : "memory") > > So the original motivation here was to skip a "sync" instruction whenever > switching between threads which are part of the same process. But based on > recent discussions, I suspect my implementation may be inaccurately doing > so though. > I see. Then, if you think a 'sync' is a concern, shouldn't we try and remove the forest of 'sync' in the I/O accessors ? I can't really understand why we need all those 'sync' and 'isync' and 'twi' around the accesses whereas I/O memory is usually mapped as 'Guarded' so memory access ordering is already garantied. I'm sure we'll save a lot with that. Christophe
On Sat, Jun 19, 2021 at 11:35:34AM +0200, Christophe Leroy wrote: > > > Le 18/06/2021 à 19:26, Mathieu Desnoyers a écrit : > >----- On Jun 18, 2021, at 1:13 PM, Christophe Leroy > >christophe.leroy@csgroup.eu wrote: > >[...] > >> > >>I don't understand all that complexity to just replace a simple > >>'smp_mb__after_unlock_lock()'. > >> > >>#define smp_mb__after_unlock_lock() smp_mb() > >>#define smp_mb() barrier() > >># define barrier() __asm__ __volatile__("": : :"memory") > >> > >> > >>Am I missing some subtility ? > > > >On powerpc CONFIG_SMP, smp_mb() is actually defined as: > > > >#define smp_mb() __smp_mb() > >#define __smp_mb() mb() > >#define mb() __asm__ __volatile__ ("sync" : : : "memory") > > > >So the original motivation here was to skip a "sync" instruction whenever > >switching between threads which are part of the same process. But based on > >recent discussions, I suspect my implementation may be inaccurately doing > >so though. > > > > I see. > > Then, if you think a 'sync' is a concern, shouldn't we try and remove the > forest of 'sync' in the I/O accessors ? > > I can't really understand why we need all those 'sync' and 'isync' and > 'twi' around the accesses whereas I/O memory is usually mapped as 'Guarded' > so memory access ordering is already garantied. > > I'm sure we'll save a lot with that. The point of the twi in the I/O accessors was to make things easier to debug if the accesses fail: for the twi insn to complete the load will have to have completed as well. On a correctly working system you never should need this (until something fails ;-) ) Without the twi you might need to enforce ordering in some cases still. The twi is a very heavy hammer, but some of that that gives us is no doubt actually needed. Segher
Le 19/06/2021 à 17:02, Segher Boessenkool a écrit : > On Sat, Jun 19, 2021 at 11:35:34AM +0200, Christophe Leroy wrote: >> >> >> Le 18/06/2021 à 19:26, Mathieu Desnoyers a écrit : >>> ----- On Jun 18, 2021, at 1:13 PM, Christophe Leroy >>> christophe.leroy@csgroup.eu wrote: >>> [...] >>>> >>>> I don't understand all that complexity to just replace a simple >>>> 'smp_mb__after_unlock_lock()'. >>>> >>>> #define smp_mb__after_unlock_lock() smp_mb() >>>> #define smp_mb() barrier() >>>> # define barrier() __asm__ __volatile__("": : :"memory") >>>> >>>> >>>> Am I missing some subtility ? >>> >>> On powerpc CONFIG_SMP, smp_mb() is actually defined as: >>> >>> #define smp_mb() __smp_mb() >>> #define __smp_mb() mb() >>> #define mb() __asm__ __volatile__ ("sync" : : : "memory") >>> >>> So the original motivation here was to skip a "sync" instruction whenever >>> switching between threads which are part of the same process. But based on >>> recent discussions, I suspect my implementation may be inaccurately doing >>> so though. >>> >> >> I see. >> >> Then, if you think a 'sync' is a concern, shouldn't we try and remove the >> forest of 'sync' in the I/O accessors ? >> >> I can't really understand why we need all those 'sync' and 'isync' and >> 'twi' around the accesses whereas I/O memory is usually mapped as 'Guarded' >> so memory access ordering is already garantied. >> >> I'm sure we'll save a lot with that. > > The point of the twi in the I/O accessors was to make things easier to > debug if the accesses fail: for the twi insn to complete the load will > have to have completed as well. On a correctly working system you never > should need this (until something fails ;-) ) > > Without the twi you might need to enforce ordering in some cases still. > The twi is a very heavy hammer, but some of that that gives us is no > doubt actually needed. > Well, I've always been quite perplex about that. According to the documentation of the 8xx, if a bus error or something happens on an I/O access, the exception will be accounted on the instruction which does the access. But based on the following function, I understand that some version of powerpc do generate the trap on the instruction which was being executed at the time the I/O access failed, not the instruction that does the access itself ? /* * I/O accesses can cause machine checks on powermacs. * Check if the NIP corresponds to the address of a sync * instruction for which there is an entry in the exception * table. * -- paulus. */ static inline int check_io_access(struct pt_regs *regs) { #ifdef CONFIG_PPC32 unsigned long msr = regs->msr; const struct exception_table_entry *entry; unsigned int *nip = (unsigned int *)regs->nip; if (((msr & 0xffff0000) == 0 || (msr & (0x80000 | 0x40000))) && (entry = search_exception_tables(regs->nip)) != NULL) { /* * Check that it's a sync instruction, or somewhere * in the twi; isync; nop sequence that inb/inw/inl uses. * As the address is in the exception table * we should be able to read the instr there. * For the debug message, we look at the preceding * load or store. */ if (*nip == PPC_INST_NOP) nip -= 2; else if (*nip == PPC_INST_ISYNC) --nip; if (*nip == PPC_INST_SYNC || (*nip >> 26) == OP_TRAP) { unsigned int rb; --nip; rb = (*nip >> 11) & 0x1f; printk(KERN_DEBUG "%s bad port %lx at %p\n", (*nip & 0x100)? "OUT to": "IN from", regs->gpr[rb] - _IO_BASE, nip); regs->msr |= MSR_RI; regs->nip = extable_fixup(entry); return 1; } } #endif /* CONFIG_PPC32 */ return 0; } Am I right ? It is not only the twi which bother's me in the I/O accessors but also the sync/isync and stuff. A write typically is sync stw A read is sync lwz twi isync Taking into account that HW ordering is garanteed by the fact that __iomem is guarded, isn't the 'memory' clobber enough as a barrier ? Thanks Christophe
Hi! On Mon, Jun 21, 2021 at 04:11:04PM +0200, Christophe Leroy wrote: > Le 19/06/2021 à 17:02, Segher Boessenkool a écrit : > >The point of the twi in the I/O accessors was to make things easier to > >debug if the accesses fail: for the twi insn to complete the load will > >have to have completed as well. On a correctly working system you never > >should need this (until something fails ;-) ) > > > >Without the twi you might need to enforce ordering in some cases still. > >The twi is a very heavy hammer, but some of that that gives us is no > >doubt actually needed. > > Well, I've always been quite perplex about that. According to the > documentation of the 8xx, if a bus error or something happens on an I/O > access, the exception will be accounted on the instruction which does the > access. But based on the following function, I understand that some version > of powerpc do generate the trap on the instruction which was being executed > at the time the I/O access failed, not the instruction that does the access > itself ? Trap instructions are never speculated (this may not be architectural, but it is true on all existing implementations). So the instructions after the twi;isync will not execute until the twi itself has finished, and that cannot happen before the preceding load has (because it uses the loaded register). Now, some I/O accesses can cause machine checks. Machine checks are asynchronous and can be hard to correlate to specific load insns, and worse, you may not even have the address loaded from in architected registers anymore. Since I/O accesses often take *long*, tens or even hundreds of cycles is not unusual, this can be a challenge. To recover from machine checks you typically need special debug hardware and/or software. For the Apple machines those are not so easy to come by. This "twi after loads" thing made it pretty easy to figure out where your code was going wrong. And it isn't as slow as it may sound: typically you really need to have the result of the load before you can go on do useful work anyway, and loads from I/O are slow non-posted things. > /* > * I/O accesses can cause machine checks on powermacs. > * Check if the NIP corresponds to the address of a sync > * instruction for which there is an entry in the exception > * table. > * -- paulus. > */ I suspect this is from before the twi thing was added? > It is not only the twi which bother's me in the I/O accessors but also the > sync/isync and stuff. > > A write typically is > > sync > stw > > A read is > > sync > lwz > twi > isync > > Taking into account that HW ordering is garanteed by the fact that __iomem > is guarded, Yes. But machine checks are asynchronous :-) > isn't the 'memory' clobber enough as a barrier ? A "memory" clobber isn't a barrier of any kind. "Compiler barriers" do not exist. The only thing such a clobber does is it tells the compiler that this inline asm can access some memory, and we do not say at what address. So the compiler cannot reorder this asm with other memory accesses. It has no other effects, no magical effects, and it is not comparable to actual barrier instructions (that actually tell the hardware that some certain ordering is required). "Compiler barrier" is a harmful misnomer: language shapes thoughts, using misleading names causes misguided thoughts. Anyway :-) The isync is simply to make sure the code after it does not start before the code before it has completed. The sync before I am not sure. Segher
diff --git a/MAINTAINERS b/MAINTAINERS index 845fc25812f1..34c1ecd5a5d1 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -8929,6 +8929,7 @@ L: linux-kernel@vger.kernel.org S: Supported F: kernel/sched/membarrier.c F: include/uapi/linux/membarrier.h +F: arch/powerpc/include/asm/membarrier.h MEMORY MANAGEMENT L: linux-mm@kvack.org diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 2ed525a44734..09b02180b8a0 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -140,6 +140,7 @@ config PPC select ARCH_HAS_FORTIFY_SOURCE select ARCH_HAS_GCOV_PROFILE_ALL select ARCH_HAS_PMEM_API if PPC64 + select ARCH_HAS_MEMBARRIER_HOOKS select ARCH_HAS_SCALED_CPUTIME if VIRT_CPU_ACCOUNTING_NATIVE select ARCH_HAS_SG_CHAIN select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST diff --git a/arch/powerpc/include/asm/membarrier.h b/arch/powerpc/include/asm/membarrier.h new file mode 100644 index 000000000000..98ff4f1fcf2b --- /dev/null +++ b/arch/powerpc/include/asm/membarrier.h @@ -0,0 +1,26 @@ +#ifndef _ASM_POWERPC_MEMBARRIER_H +#define _ASM_POWERPC_MEMBARRIER_H + +static inline void membarrier_arch_switch_mm(struct mm_struct *prev, + struct mm_struct *next, + struct task_struct *tsk) +{ + /* + * Only need the full barrier when switching between processes. + * Barrier when switching from kernel to userspace is not + * required here, given that it is implied by mmdrop(). Barrier + * when switching from userspace to kernel is not needed after + * store to rq->curr. + */ + if (likely(!(atomic_read(&next->membarrier_state) & + MEMBARRIER_STATE_PRIVATE_EXPEDITED) || !prev)) + return; + + /* + * The membarrier system call requires a full memory barrier + * after storing to rq->curr, before going back to user-space. + */ + smp_mb(); +} + +#endif /* _ASM_POWERPC_MEMBARRIER_H */ diff --git a/arch/powerpc/mm/mmu_context.c b/arch/powerpc/mm/mmu_context.c index d60a62bf4fc7..0ab297c4cfad 100644 --- a/arch/powerpc/mm/mmu_context.c +++ b/arch/powerpc/mm/mmu_context.c @@ -12,6 +12,7 @@ #include <linux/mm.h> #include <linux/cpu.h> +#include <linux/sched/mm.h> #include <asm/mmu_context.h> @@ -58,6 +59,10 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, * * On the read side the barrier is in pte_xchg(), which orders * the store to the PTE vs the load of mm_cpumask. + * + * This full barrier is needed by membarrier when switching + * between processes after store to rq->curr, before user-space + * memory accesses. */ smp_mb(); @@ -80,6 +85,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, if (new_on_cpu) radix_kvm_prefetch_workaround(next); + else + membarrier_arch_switch_mm(prev, next, tsk); /* * The actual HW switching method differs between the various diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h index 3d49b91b674d..1754396795f6 100644 --- a/include/linux/sched/mm.h +++ b/include/linux/sched/mm.h @@ -215,14 +215,25 @@ static inline void memalloc_noreclaim_restore(unsigned int flags) #ifdef CONFIG_MEMBARRIER enum { MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY = (1U << 0), - MEMBARRIER_STATE_SWITCH_MM = (1U << 1), + MEMBARRIER_STATE_PRIVATE_EXPEDITED = (1U << 1), }; +#ifdef CONFIG_ARCH_HAS_MEMBARRIER_HOOKS +#include <asm/membarrier.h> +#endif + static inline void membarrier_execve(struct task_struct *t) { atomic_set(&t->mm->membarrier_state, 0); } #else +#ifdef CONFIG_ARCH_HAS_MEMBARRIER_HOOKS +static inline void membarrier_arch_switch_mm(struct mm_struct *prev, + struct mm_struct *next, + struct task_struct *tsk) +{ +} +#endif static inline void membarrier_execve(struct task_struct *t) { } diff --git a/init/Kconfig b/init/Kconfig index a9a2e2c86671..2d118b6adee2 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -1412,6 +1412,9 @@ config USERFAULTFD Enable the userfaultfd() system call that allows to intercept and handle page faults in userland. +config ARCH_HAS_MEMBARRIER_HOOKS + bool + config EMBEDDED bool "Embedded system" option allnoconfig_y diff --git a/kernel/sched/core.c b/kernel/sched/core.c index a7bf32aabfda..c7e06dfa804b 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2653,16 +2653,6 @@ static struct rq *finish_task_switch(struct task_struct *prev) prev_state = prev->state; vtime_task_switch(prev); perf_event_task_sched_in(prev, current); - /* - * The membarrier system call requires a full memory barrier - * after storing to rq->curr, before going back to user-space. - * - * TODO: This smp_mb__after_unlock_lock can go away if PPC end - * up adding a full barrier to switch_mm(), or we should figure - * out if a smp_mb__after_unlock_lock is really the proper API - * to use. - */ - smp_mb__after_unlock_lock(); finish_lock_switch(rq, prev); finish_arch_post_lock_switch(); diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c index 9bcbacba82a8..678577267a9a 100644 --- a/kernel/sched/membarrier.c +++ b/kernel/sched/membarrier.c @@ -118,6 +118,14 @@ static void membarrier_register_private_expedited(void) if (atomic_read(&mm->membarrier_state) & MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY) return; + atomic_or(MEMBARRIER_STATE_PRIVATE_EXPEDITED, &mm->membarrier_state); + if (!(atomic_read(&mm->mm_users) == 1 && get_nr_threads(p) == 1)) { + /* + * Ensure all future scheduler executions will observe the + * new thread flag state for this process. + */ + synchronize_sched(); + } atomic_or(MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY, &mm->membarrier_state); }