Message ID | 20160309101349.GJ6344@twins.programming.kicks-ass.net |
---|---|
State | Superseded |
Headers | show |
On Wed, Mar 09, 2016 at 11:13:49AM +0100, Peter Zijlstra wrote: > --- > Subject: bitops: Do not default to __clear_bit() for __clear_bit_unlock() > > __clear_bit_unlock() is a special little snowflake. While it carries the > non-atomic '__' prefix, it is specifically documented to pair with > test_and_set_bit() and therefore should be 'somewhat' atomic. > > Therefore the generic implementation of __clear_bit_unlock() cannot use > the fully non-atomic __clear_bit() as a default. > > If an arch is able to do better; is must provide an implementation of > __clear_bit_unlock() itself. > FWIW, we could probably undo this if we unified all the spinlock based atomic ops implementations (there's a whole bunch doing essentially the same), and special cased __clear_bit_unlock() for that. Collapsing them is probably a good idea anyway, just a fair bit of non-trivial work to figure out all the differences and if they matter etc..
On Wednesday 09 March 2016 03:43 PM, Peter Zijlstra wrote: >>> If you take the lock in __bit_spin_unlock >>> then the race cannot happen. >> >> Of course it won't but that means we penalize all non atomic callers of the API >> with a superfluous spinlock which is not require din first place given the >> definition of API. > > Quite. _However_, your arch is still broken, but not by your fault. Its > the generic-asm code that is wrong. > > The thing is that __bit_spin_unlock() uses __clear_bit_unlock(), which > defaults to __clear_bit(). Which is wrong. > > --- > Subject: bitops: Do not default to __clear_bit() for __clear_bit_unlock() > > __clear_bit_unlock() is a special little snowflake. While it carries the > non-atomic '__' prefix, it is specifically documented to pair with > test_and_set_bit() and therefore should be 'somewhat' atomic. > > Therefore the generic implementation of __clear_bit_unlock() cannot use > the fully non-atomic __clear_bit() as a default. > > If an arch is able to do better; is must provide an implementation of > __clear_bit_unlock() itself. > > Reported-by: Vineet Gupta <Vineet.Gupta1@synopsys.com> This needs to be CCed stable as it fixes a real bug for ARC. > Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Vineet Gupta <Vineet.Gupta1@synopsys.com> FWIW, could we add some background to commit log, specifically what prompted this. Something like below... ---->8------ This came up as a result of hackbench livelock'ing in slab_lock() on ARC with SMP + SLUB + !LLSC. The issue was incorrect pairing of atomic ops. slab_lock() -> bit_spin_lock() -> test_and_set_bit() slab_unlock() -> __bit_spin_unlock() -> __clear_bit() The non serializing __clear_bit() was getting "lost" 80543b8e: ld_s r2,[r13,0] <--- (A) Finds PG_locked is set 80543b90: or r3,r2,1 <--- (B) other core unlocks right here 80543b94: st_s r3,[r13,0] <--- (C) sets PG_locked (overwrites unlock) Fixes ARC STAR 9000817404 ---->8------ > --- > include/asm-generic/bitops/lock.h | 14 +++++++------- > 1 file changed, 7 insertions(+), 7 deletions(-) > > diff --git a/include/asm-generic/bitops/lock.h b/include/asm-generic/bitops/lock.h > index c30266e94806..8ef0ccbf8167 100644 > --- a/include/asm-generic/bitops/lock.h > +++ b/include/asm-generic/bitops/lock.h > @@ -29,16 +29,16 @@ do { \ > * @nr: the bit to set > * @addr: the address to start counting from > * > - * This operation is like clear_bit_unlock, however it is not atomic. > - * It does provide release barrier semantics so it can be used to unlock > - * a bit lock, however it would only be used if no other CPU can modify > - * any bits in the memory until the lock is released (a good example is > - * if the bit lock itself protects access to the other bits in the word). > + * A weaker form of clear_bit_unlock() as used by __bit_lock_unlock(). If all > + * the bits in the word are protected by this lock some archs can use weaker > + * ops to safely unlock. > + * > + * See for example x86's implementation. > */ To be able to override/use-generic don't we need #ifndef .... > #define __clear_bit_unlock(nr, addr) \ > do { \ > - smp_mb(); \ > - __clear_bit(nr, addr); \ > + smp_mb__before_atomic(); \ > + clear_bit(nr, addr); \ > } while (0) > > #endif /* _ASM_GENERIC_BITOPS_LOCK_H_ */ >
On Wednesday 09 March 2016 04:01 PM, Peter Zijlstra wrote: > On Wed, Mar 09, 2016 at 11:13:49AM +0100, Peter Zijlstra wrote: >> --- >> Subject: bitops: Do not default to __clear_bit() for __clear_bit_unlock() >> >> __clear_bit_unlock() is a special little snowflake. While it carries the >> non-atomic '__' prefix, it is specifically documented to pair with >> test_and_set_bit() and therefore should be 'somewhat' atomic. >> >> Therefore the generic implementation of __clear_bit_unlock() cannot use >> the fully non-atomic __clear_bit() as a default. >> >> If an arch is able to do better; is must provide an implementation of >> __clear_bit_unlock() itself. >> > > FWIW, we could probably undo this if we unified all the spinlock based > atomic ops implementations (there's a whole bunch doing essentially the > same), and special cased __clear_bit_unlock() for that. > > Collapsing them is probably a good idea anyway, just a fair bit of > non-trivial work to figure out all the differences and if they matter > etc.. Indeed I thought about this when we first did the SMP port. The only issue was somewhat more generated code with the hashed spinlocks (vs. my dumb 2 spin locks - which as I see now will also cause false sharing - likely ending up in the same cache line), but I was more of a micro-optimization freak then than I'm now :-) So yeah I agree !
On Wednesday 09 March 2016 03:43 PM, Peter Zijlstra wrote: >> There is clearly a problem in slub code that it is pairing a test_and_set_bit() >> with a __clear_bit(). Latter can obviously clobber former if they are not a single >> instruction each unlike x86 or they use llock/scond kind of instructions where the >> interim store from other core is detected and causes a retry of whole llock/scond >> sequence. > > Yes, test_and_set_bit() + __clear_bit() is broken. But in SLUB: bit_spin_lock() + __bit_spin_unlock() is acceptable ? How so (ignoring the performance thing for discussion sake, which is a side effect of this implementation). So despite the comment below in bit_spinlock.h I don't quite comprehend how this is allowable. And if say, by deduction, this is fine for LLSC or lock prefixed cases, then isn't this true in general for lot more cases in kernel, i.e. pairing atomic lock with non-atomic unlock ? I'm missing something ! | /* | * bit-based spin_unlock() | * non-atomic version, which can be used eg. if the bit lock itself is | * protecting the rest of the flags in the word. | */ | static inline void __bit_spin_unlock(int bitnum, unsigned long *addr)
On Wed, Mar 09, 2016 at 06:52:45PM +0530, Vineet Gupta wrote: > On Wednesday 09 March 2016 03:43 PM, Peter Zijlstra wrote: > >> There is clearly a problem in slub code that it is pairing a test_and_set_bit() > >> with a __clear_bit(). Latter can obviously clobber former if they are not a single > >> instruction each unlike x86 or they use llock/scond kind of instructions where the > >> interim store from other core is detected and causes a retry of whole llock/scond > >> sequence. > > > > Yes, test_and_set_bit() + __clear_bit() is broken. > > But in SLUB: bit_spin_lock() + __bit_spin_unlock() is acceptable ? How so > (ignoring the performance thing for discussion sake, which is a side effect of > this implementation). The sort answer is: Per definition. They are defined to work together, which is what makes __clear_bit_unlock() such a special function. > So despite the comment below in bit_spinlock.h I don't quite comprehend how this > is allowable. And if say, by deduction, this is fine for LLSC or lock prefixed > cases, then isn't this true in general for lot more cases in kernel, i.e. pairing > atomic lock with non-atomic unlock ? I'm missing something ! x86 (and others) do in fact use non-atomic instructions for spin_unlock(). But as this is all arch specific, we can make these assumptions. Its just that generic code cannot rely on it. So let me try and explain. The problem as identified is: CPU0 CPU1 bit_spin_lock() __bit_spin_unlock() 1: /* fetch_or, r1 holds the old value */ spin_lock load r1, addr load r1, addr bclr r2, r1, 1 store r2, addr or r2, r1, 1 store r2, addr /* lost the store from CPU1 */ spin_unlock and r1, 1 bnz 2 /* it was set, go wait */ ret 2: load r1, addr and r1, 1 bnz 2 /* wait until its not set */ b 1 /* try again */ For LL/SC we replace: spin_lock load r1, addr ... store r2, addr spin_unlock With the (obvious): 1: load-locked r1, addr ... store-cond r2, addr bnz 1 /* or whatever branch instruction is required to retry */ In this case the failure cannot happen, because the store from CPU1 would have invalidated the lock from CPU0 and caused the store-cond to fail and retry the loop, observing the new value.
On Wednesday 09 March 2016 08:21 PM, Peter Zijlstra wrote: >> But in SLUB: bit_spin_lock() + __bit_spin_unlock() is acceptable ? How so >> (ignoring the performance thing for discussion sake, which is a side effect of >> this implementation). > > The sort answer is: Per definition. They are defined to work together, > which is what makes __clear_bit_unlock() such a special function. > >> So despite the comment below in bit_spinlock.h I don't quite comprehend how this >> is allowable. And if say, by deduction, this is fine for LLSC or lock prefixed >> cases, then isn't this true in general for lot more cases in kernel, i.e. pairing >> atomic lock with non-atomic unlock ? I'm missing something ! > > x86 (and others) do in fact use non-atomic instructions for > spin_unlock(). But as this is all arch specific, we can make these > assumptions. Its just that generic code cannot rely on it. OK despite being obvious now, I was not seeing the similarity between spin_*lock() and bit_spin*lock() :-( ARC also uses standard ST for spin_unlock() so by analogy __bit_spin_unlock() (for LLSC case) would be correctly paired with bit_spin_lock(). But then why would anyone need bit_spin_unlock() at all. Specially after this patch from you which tightens __bit_spin_lock() even more for the general case. Thing is if the API exists majority of people would would use the more conservative version w/o understand all these nuances. Can we pursue the path of moving bit_spin_unlock() over to __bit_spin_lock(): first changing the backend only and if proven stable replacing the call-sites themselves. > > So let me try and explain. > > > The problem as identified is: > > CPU0 CPU1 > > bit_spin_lock() __bit_spin_unlock() > 1: > /* fetch_or, r1 holds the old value */ > spin_lock > load r1, addr > load r1, addr > bclr r2, r1, 1 > store r2, addr > or r2, r1, 1 > store r2, addr /* lost the store from CPU1 */ > spin_unlock > > and r1, 1 > bnz 2 /* it was set, go wait */ > ret > > 2: > load r1, addr > and r1, 1 > bnz 2 /* wait until its not set */ > > b 1 /* try again */ > > > > For LL/SC we replace: > > spin_lock > load r1, addr > > ... > > store r2, addr > spin_unlock > > With the (obvious): > > 1: > load-locked r1, addr > > ... > > store-cond r2, addr > bnz 1 /* or whatever branch instruction is required to retry */ > > > In this case the failure cannot happen, because the store from CPU1 > would have invalidated the lock from CPU0 and caused the > store-cond to fail and retry the loop, observing the new value. You did it again, A picture is worth thousand words ! Thx, -Vineet
On Thu, Mar 10, 2016 at 11:21:21AM +0530, Vineet Gupta wrote: > On Wednesday 09 March 2016 08:21 PM, Peter Zijlstra wrote: > >> But in SLUB: bit_spin_lock() + __bit_spin_unlock() is acceptable ? How so > >> (ignoring the performance thing for discussion sake, which is a side effect of > >> this implementation). > > > > The sort answer is: Per definition. They are defined to work together, > > which is what makes __clear_bit_unlock() such a special function. > > > >> So despite the comment below in bit_spinlock.h I don't quite comprehend how this > >> is allowable. And if say, by deduction, this is fine for LLSC or lock prefixed > >> cases, then isn't this true in general for lot more cases in kernel, i.e. pairing > >> atomic lock with non-atomic unlock ? I'm missing something ! > > > > x86 (and others) do in fact use non-atomic instructions for > > spin_unlock(). But as this is all arch specific, we can make these > > assumptions. Its just that generic code cannot rely on it. > > OK despite being obvious now, I was not seeing the similarity between spin_*lock() > and bit_spin*lock() :-( > > ARC also uses standard ST for spin_unlock() so by analogy __bit_spin_unlock() (for > LLSC case) would be correctly paired with bit_spin_lock(). > > But then why would anyone need bit_spin_unlock() at all. Specially after this > patch from you which tightens __bit_spin_lock() even more for the general case. > > Thing is if the API exists majority of people would would use the more > conservative version w/o understand all these nuances. Can we pursue the path of > moving bit_spin_unlock() over to __bit_spin_lock(): first changing the backend > only and if proven stable replacing the call-sites themselves. So the thing is, __bit_spin_unlock() is not safe if other bits in that word can have concurrent modifications. Only if the bitlock locks the whole word (or something else ensures no other bits will change) can you use __bit_spin_unlock() to clear the lock bit.
diff --git a/include/asm-generic/bitops/lock.h b/include/asm-generic/bitops/lock.h index c30266e94806..8ef0ccbf8167 100644 --- a/include/asm-generic/bitops/lock.h +++ b/include/asm-generic/bitops/lock.h @@ -29,16 +29,16 @@ do { \ * @nr: the bit to set * @addr: the address to start counting from * - * This operation is like clear_bit_unlock, however it is not atomic. - * It does provide release barrier semantics so it can be used to unlock - * a bit lock, however it would only be used if no other CPU can modify - * any bits in the memory until the lock is released (a good example is - * if the bit lock itself protects access to the other bits in the word). + * A weaker form of clear_bit_unlock() as used by __bit_lock_unlock(). If all + * the bits in the word are protected by this lock some archs can use weaker + * ops to safely unlock. + * + * See for example x86's implementation. */ #define __clear_bit_unlock(nr, addr) \ do { \ - smp_mb(); \ - __clear_bit(nr, addr); \ + smp_mb__before_atomic(); \ + clear_bit(nr, addr); \ } while (0) #endif /* _ASM_GENERIC_BITOPS_LOCK_H_ */