Message ID | CA+55aFzP60Ucn0snFchHNw2tOnS6HYp-6n+f+9nxrbTPw8fTfQ@mail.gmail.com (mailing list archive) |
---|---|
State | Not Applicable |
Headers | show |
On Thu, 2014-03-20 at 09:41 -0700, Linus Torvalds wrote: > On Wed, Mar 19, 2014 at 10:56 PM, Davidlohr Bueso <davidlohr@hp.com> wrote: > > > > This problem suggests that we missed a wakeup for a task that was adding > > itself to the queue in a wait path. And the only place that can happen > > is with the hb spinlock check for any pending waiters. > > Ok, so thinking about hb_waiters_pending(): > > - spin_is_locked() test > - read barrier > - plist_head_empty() test > > seems to be broken after all. > > The race is against futex_wait() that does > > - futex_wait_setup(): > - queue_lock() > - get_futex_value_locked() > - futex_wait_queue_me() > - queue_me() > - plist_add() > > right? Yep. > > It strikes me that the "spin_is_locked()" test has no barriers wrt the > writing of the new futex value on the wake path. And the read barrier > obviously does nothing wrt the write either. Or am I missing > something? So the write that actually released the futex might be > almost arbitrarily delayed on the waking side. So the waiting side may > not see the new value, even though the waker assumes it does due to > the ordering of it doing the write first. Aha, that would certainly violate the ordering guarantees. I feared _something_ like that when we originally discussed your suggestion as opposed to the atomics one, but didn't have any case for it either. > So maybe we need a memory barrier in hb_waiters_pending() just to make > sure that the new value is written. > > But at that point, I suspect that Davidlohrs original patch that just > had explicit waiting counts is just as well. The whole point with the > head empty test was to emulate that "do we have waiters" without > having to actually add the atomics, but a memory barrier is really no > worse. > > The attached is a TOTALLY UNTESTED interdiff that adds back Davidlohrs > atomic count. It may be terminally broken, I literally didn't test it > at all. Comparing with the patch I sent earlier this morning, looks equivalent, and fwiw, passes my initial qemu bootup, which is the first way of detecting anything stupid going on. So, Srikar, please try this patch out, as opposed to mine, you don't have to first revert the commit in question.
On Thu, Mar 20, 2014 at 10:18 AM, Davidlohr Bueso <davidlohr@hp.com> wrote: >> It strikes me that the "spin_is_locked()" test has no barriers wrt the >> writing of the new futex value on the wake path. And the read barrier >> obviously does nothing wrt the write either. Or am I missing >> something? So the write that actually released the futex might be >> almost arbitrarily delayed on the waking side. So the waiting side may >> not see the new value, even though the waker assumes it does due to >> the ordering of it doing the write first. > > Aha, that would certainly violate the ordering guarantees. I feared > _something_ like that when we originally discussed your suggestion as > opposed to the atomics one, but didn't have any case for it either. Actually, looking closer, we have the memory barrier in get_futex_key_refs() (called by "get_futex_key()") so that's not it. In fact, your "atomic_read(&hb->waiters)" doesn't have any more serialization than the spin_is_locked() test had. But the spin_is_locked() and queue-empty tests are two separate memory reads, and maybe there is some ordering wrt those two that we missed, so the "waiters" patch is worth trying anyway. I do still dislike how the "waiters" thing adds an atomic update, but whatever.. Linus
On Thu, 2014-03-20 at 10:42 -0700, Linus Torvalds wrote: > On Thu, Mar 20, 2014 at 10:18 AM, Davidlohr Bueso <davidlohr@hp.com> wrote: > >> It strikes me that the "spin_is_locked()" test has no barriers wrt the > >> writing of the new futex value on the wake path. And the read barrier > >> obviously does nothing wrt the write either. Or am I missing > >> something? So the write that actually released the futex might be > >> almost arbitrarily delayed on the waking side. So the waiting side may > >> not see the new value, even though the waker assumes it does due to > >> the ordering of it doing the write first. > > > > Aha, that would certainly violate the ordering guarantees. I feared > > _something_ like that when we originally discussed your suggestion as > > opposed to the atomics one, but didn't have any case for it either. > > Actually, looking closer, we have the memory barrier in > get_futex_key_refs() (called by "get_futex_key()") so that's not it. > In fact, your "atomic_read(&hb->waiters)" doesn't have any more > serialization than the spin_is_locked() test had. > > But the spin_is_locked() and queue-empty tests are two separate memory > reads, and maybe there is some ordering wrt those two that we missed, > so the "waiters" patch is worth trying anyway. Well, imho we would have seen something wrong much much earlier. This patch has been very heavily tested (including with the java workload used by Shrikar). I still wonder about ppc and spinlocks (no ticketing!!) ... sure the "waiters" patch might fix the problem just because we explicitly count the members of the plist. And I guess if we cannot rely on all archs having an equivalent spinlock implementation, we simply cannot use this technique for futexes. Thanks, Davidlohr
On Thu, Mar 20, 2014 at 11:03 AM, Davidlohr Bueso <davidlohr@hp.com> wrote: > > I still wonder about ppc and spinlocks (no ticketing!!) ... sure the > "waiters" patch might fix the problem just because we explicitly count > the members of the plist. And I guess if we cannot rely on all archs > having an equivalent spinlock implementation, we simply cannot use this > technique for futexes. So the ticketing part means that on x86 we see pending waiters even when a previous one does "spin_unlock()". I agree that that is a fundamental difference between x86 and powerpc, and it does seem to be the most likely culprit. And dammit, I *liked* my "don't use an explicit waiter count" approach, so I'd love to be able to do it. But I we've never really guaranteed that "is_spin_locked()" shows whether there are spinners, so I don't know how to do that. I guess we could expose some interface for the spinlock code to say whether it supports that or not, and then switch between the two algorithms. But that just feels very very ugly to me. But let's see if the explicit waiter count version even solves the thing on powerpc. Maybe it's something else, and we'll have to revert entirely for now. Linus
On Thu, Mar 20, 2014 at 10:18 AM, Davidlohr Bueso <davidlohr@hp.com> wrote: > > Comparing with the patch I sent earlier this morning, looks equivalent, > and fwiw, passes my initial qemu bootup, which is the first way of > detecting anything stupid going on. > > So, Srikar, please try this patch out, as opposed to mine, you don't > have to first revert the commit in question. Ok, so it boots for me too, so hopefully it isn't totally broken. However, since it's just closing a race, and since getting the counts wrong should easily result in it *working* but always taking the slow path (for example), I'd really like people to also verify that it fixes the actual performance issue (ie assuming it fixes powerpc behavior for Srikar, I'd like to get it double-checked that it also avoids the spinlock in the common case). Because if the increment/decrement pairings end up being wrong, we could have a situation where the waiter count just ends up bogus, and it all works from a correctness standpoint but not from the intended performance optimization. No way I can test that sanely on my single-socket machine. Davidlohr? Linus
On Thu, 2014-03-20 at 11:36 -0700, Linus Torvalds wrote: > On Thu, Mar 20, 2014 at 10:18 AM, Davidlohr Bueso <davidlohr@hp.com> wrote: > > > > Comparing with the patch I sent earlier this morning, looks equivalent, > > and fwiw, passes my initial qemu bootup, which is the first way of > > detecting anything stupid going on. > > > > So, Srikar, please try this patch out, as opposed to mine, you don't > > have to first revert the commit in question. > > Ok, so it boots for me too, so hopefully it isn't totally broken. > > However, since it's just closing a race, and since getting the counts > wrong should easily result in it *working* but always taking the slow > path (for example), I'd really like people to also verify that it > fixes the actual performance issue (ie assuming it fixes powerpc > behavior for Srikar, I'd like to get it double-checked that it also > avoids the spinlock in the common case). Oh, it does. This atomics technique was tested at a customer's site and ready for upstream. To refresh, we were originally seeing massive contention on the hb->lock and an enormous amounts of 0 returns from futex_wake, indicating that spinners where piling up just to realize that the plist was empty! While I don't have any official numbers, I can confirm that perf showed that this issue was addressed with the atomics variant. Yes, such pathological behavior shows problems in the userspace locking primitives design/implementation, but allowing the kernel not to be affected by suboptimal uses of futexes is definitely a plus. As tglx suggested at the time, I also made sure that adding the barriers when doing the key refcounting didn't impose any serious restrictions to performance either. Now, what at the time required re-testing everything was when you suggested replacing this approach with a more elegant spin is locked test. Both approaches showed pretty much identical performance (and correctness, at least on x86). And to this day shows *significant* less time spent in kernel space dealing with futexes. > Because if the > increment/decrement pairings end up being wrong, we could have a > situation where the waiter count just ends up bogus, and it all works > from a correctness standpoint but not from the intended performance > optimization. > > No way I can test that sanely on my single-socket machine. Davidlohr? Not this patch, no :( -- we could never blindly reproduce the customer's workload. The only patch that I was able to create test cases for is the larger hash table one, which simply alleviates collisions. This is now part of perf-bench. Thanks, Davidlohr
On Thu, Mar 20, 2014 at 12:08 PM, Davidlohr Bueso <davidlohr@hp.com> wrote: > > Oh, it does. This atomics technique was tested at a customer's site and > ready for upstream. I'm not worried about the *original* patch. I'm worried about the incremental one. Your original patch never applied to my tree - I think it was based on -mm or something. So I couldn't verify my "let's go back to the explicit 'waiters'" incremental patch against reverting and re-applying the original patch. So I'd like you to re-verify that that incremental patch really is solid, and does what your original one did. Linus
On Thu, 2014-03-20 at 12:25 -0700, Linus Torvalds wrote: > On Thu, Mar 20, 2014 at 12:08 PM, Davidlohr Bueso <davidlohr@hp.com> wrote: > > > > Oh, it does. This atomics technique was tested at a customer's site and > > ready for upstream. > > I'm not worried about the *original* patch. I'm worried about the > incremental one. > > Your original patch never applied to my tree - I think it was based on > -mm or something. So I couldn't verify my "let's go back to the > explicit 'waiters'" incremental patch against reverting and > re-applying the original patch. Ok, so a big reason why this patch doesn't apply cleanly after reverting is because *most* of the changes were done at the top of the file with regards to documenting the ordering guarantees, the actual code changes are quite minimal. I reverted commits 99b60ce6 (documentation) and b0c29f79 (the offending commit), and then I cleanly applied the equivalent ones from v3 of the series (which was already *tested* and ready for upstream until you suggested looking into the alternative spinlock approach): https://lkml.org/lkml/2013/12/19/624 https://lkml.org/lkml/2013/12/19/630 Assuming the atomics solves the issue, would you be willing to take this path? Any pending documentation fixes can be added afterwards. The important thing is that the actual code is well tested. Thanks, Davidlohr
On Thu, Mar 20, 2014 at 1:20 PM, Davidlohr Bueso <davidlohr@hp.com> wrote: > > I reverted commits 99b60ce6 (documentation) and b0c29f79 (the offending > commit), and then I cleanly applied the equivalent ones from v3 of the > series (which was already *tested* and ready for upstream until you > suggested looking into the alternative spinlock approach): > > https://lkml.org/lkml/2013/12/19/624 > https://lkml.org/lkml/2013/12/19/630 > > Assuming the atomics solves the issue, would you be willing to take this > path? Any pending documentation fixes can be added afterwards. The > important thing is that the actual code is well tested. So my preference would be to do that "tested code" thing, but then edit out the comment changes and boil it down to just the minimal code changes. So that you can see what the patch actually *does*, without it being hidden by the bulk of the patch just being the reverts of the comment fixups. In fact, I hope that if you do that, you end up with the patch I just created by hand, and then we'd have come to the same situation two different ways independently, and I'd be doubly happy for that extra cross-checking of what went on. And I would *not* want to do this as "two reverts and one patch to re-do things like we used to", because that just makes the actual change even harder to see. And that's partly because if we eventually do decide that "hey, if we can do this using the ticket lock as a counter, let's do it that way", then this *small* fixup patch ends up showing the actual real differences between the two approaches. Of course, right now we don't even have confirmation from Srikar that the explicit "waiters" counter even fixes things on powerpc, so.. All the testing that orginal patch had was also on x86, so if it's some subtle memory ordering issue that we haven't figured out now, rather than the ticket lock thing, all this discussion about which way to go turns out to be entirely premature. Linus
> > Ok, so a big reason why this patch doesn't apply cleanly after reverting > is because *most* of the changes were done at the top of the file with > regards to documenting the ordering guarantees, the actual code changes > are quite minimal. > > I reverted commits 99b60ce6 (documentation) and b0c29f79 (the offending > commit), and then I cleanly applied the equivalent ones from v3 of the > series (which was already *tested* and ready for upstream until you > suggested looking into the alternative spinlock approach): > > https://lkml.org/lkml/2013/12/19/624 > https://lkml.org/lkml/2013/12/19/630 I reverted commits 99b60ce6 and b0c29f79. Then applied the patches in the above url. The last one had a reject but it was pretty straightforward to resolve it. After this, specjbb completes. So reverting and applying v3 3/4 and 4/4 patches works for me.
On Thu, Mar 20, 2014 at 9:55 PM, Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote: > > I reverted commits 99b60ce6 and b0c29f79. Then applied the patches in > the above url. The last one had a reject but it was pretty > straightforward to resolve it. After this, specjbb completes. > > So reverting and applying v3 3/4 and 4/4 patches works for me. Ok, I verified that the above endds up resulting in the same tree as the minimal patch I sent out, modulo (a) some comments and (b) an #ifdef CONFIG_SMP in futex_get_mm() that doesn't really matter. So I committed the minimal patch with your tested-by. Linus
> > So reverting and applying v3 3/4 and 4/4 patches works for me. > > Ok, I verified that the above endds up resulting in the same tree as > the minimal patch I sent out, modulo (a) some comments and (b) an > #ifdef CONFIG_SMP in futex_get_mm() that doesn't really matter. > > So I committed the minimal patch with your tested-by. > Just to be sure, I have verified with latest mainline with HEAD having commit 08edb33c4 (Merge branch 'drm-fixes' of git://people.freedesktop.org/~airlied/linux) which also has the commit 11d4616bd0 futex:( revert back to the explicit waiter counting code).
On Sat, 2014-03-22 at 07:57 +0530, Srikar Dronamraju wrote: > > > So reverting and applying v3 3/4 and 4/4 patches works for me. > > > > Ok, I verified that the above endds up resulting in the same tree as > > the minimal patch I sent out, modulo (a) some comments and (b) an > > #ifdef CONFIG_SMP in futex_get_mm() that doesn't really matter. > > > > So I committed the minimal patch with your tested-by. > > > > Just to be sure, I have verified with latest mainline with HEAD having > commit 08edb33c4 (Merge branch 'drm-fixes' of > git://people.freedesktop.org/~airlied/linux) which also has the commit > 11d4616bd0 futex:( revert back to the explicit waiter counting code). Thanks for double checking.
diff --git a/kernel/futex.c b/kernel/futex.c index 44a1261cb9ff..08ec814ad9d2 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -234,6 +234,7 @@ static const struct futex_q futex_q_init = { * waiting on a futex. */ struct futex_hash_bucket { + atomic_t waiters; spinlock_t lock; struct plist_head chain; } ____cacheline_aligned_in_smp; @@ -253,22 +254,37 @@ static inline void futex_get_mm(union futex_key *key) smp_mb__after_atomic_inc(); } -static inline bool hb_waiters_pending(struct futex_hash_bucket *hb) +/* + * Reflects a new waiter being added to the waitqueue. + */ +static inline void hb_waiters_inc(struct futex_hash_bucket *hb) { #ifdef CONFIG_SMP + atomic_inc(&hb->waiters); /* - * Tasks trying to enter the critical region are most likely - * potential waiters that will be added to the plist. Ensure - * that wakers won't miss to-be-slept tasks in the window between - * the wait call and the actual plist_add. + * Full barrier (A), see the ordering comment above. */ - if (spin_is_locked(&hb->lock)) - return true; - smp_rmb(); /* Make sure we check the lock state first */ + smp_mb__after_atomic_inc(); +#endif +} + +/* + * Reflects a waiter being removed from the waitqueue by wakeup + * paths. + */ +static inline void hb_waiters_dec(struct futex_hash_bucket *hb) +{ +#ifdef CONFIG_SMP + atomic_dec(&hb->waiters); +#endif +} - return !plist_head_empty(&hb->chain); +static inline int hb_waiters_pending(struct futex_hash_bucket *hb) +{ +#ifdef CONFIG_SMP + return atomic_read(&hb->waiters); #else - return true; + return 1; #endif } @@ -954,6 +970,7 @@ static void __unqueue_futex(struct futex_q *q) hb = container_of(q->lock_ptr, struct futex_hash_bucket, lock); plist_del(&q->list, &hb->chain); + hb_waiters_dec(hb); } /* @@ -1257,7 +1274,9 @@ void requeue_futex(struct futex_q *q, struct futex_hash_bucket *hb1, */ if (likely(&hb1->chain != &hb2->chain)) { plist_del(&q->list, &hb1->chain); + hb_waiters_dec(hb1); plist_add(&q->list, &hb2->chain); + hb_waiters_inc(hb2); q->lock_ptr = &hb2->lock; } get_futex_key_refs(key2); @@ -1600,6 +1619,17 @@ static inline struct futex_hash_bucket *queue_lock(struct futex_q *q) struct futex_hash_bucket *hb; hb = hash_futex(&q->key); + + /* + * Increment the counter before taking the lock so that + * a potential waker won't miss a to-be-slept task that is + * waiting for the spinlock. This is safe as all queue_lock() + * users end up calling queue_me(). Similarly, for housekeeping, + * decrement the counter at queue_unlock() when some error has + * occurred and we don't end up adding the task to the list. + */ + hb_waiters_inc(hb); + q->lock_ptr = &hb->lock; spin_lock(&hb->lock); /* implies MB (A) */ @@ -1611,6 +1641,7 @@ queue_unlock(struct futex_hash_bucket *hb) __releases(&hb->lock) { spin_unlock(&hb->lock); + hb_waiters_dec(hb); } /** @@ -2342,6 +2373,7 @@ int handle_early_requeue_pi_wakeup(struct futex_hash_bucket *hb, * Unqueue the futex_q and determine which it was. */ plist_del(&q->list, &hb->chain); + hb_waiters_dec(hb); /* Handle spurious wakeups gracefully */ ret = -EWOULDBLOCK; @@ -2875,6 +2907,7 @@ static int __init futex_init(void) futex_cmpxchg_enabled = 1; for (i = 0; i < futex_hashsize; i++) { + atomic_set(&futex_queues[i].waiters, 0); plist_head_init(&futex_queues[i].chain); spin_lock_init(&futex_queues[i].lock); }