Message ID | 1580841629-7102-1-git-send-email-cai@lca.pw |
---|---|
State | Accepted |
Delegated to: | David Miller |
Headers | show |
Series | [v3] skbuff: fix a data race in skb_queue_len() | expand |
From: Qian Cai <cai@lca.pw> Date: Tue, 4 Feb 2020 13:40:29 -0500 > sk_buff.qlen can be accessed concurrently as noticed by KCSAN, > > BUG: KCSAN: data-race in __skb_try_recv_from_queue / unix_dgram_sendmsg > > read to 0xffff8a1b1d8a81c0 of 4 bytes by task 5371 on cpu 96: > unix_dgram_sendmsg+0x9a9/0xb70 include/linux/skbuff.h:1821 > net/unix/af_unix.c:1761 > ____sys_sendmsg+0x33e/0x370 > ___sys_sendmsg+0xa6/0xf0 > __sys_sendmsg+0x69/0xf0 > __x64_sys_sendmsg+0x51/0x70 > do_syscall_64+0x91/0xb47 > entry_SYSCALL_64_after_hwframe+0x49/0xbe > > write to 0xffff8a1b1d8a81c0 of 4 bytes by task 1 on cpu 99: > __skb_try_recv_from_queue+0x327/0x410 include/linux/skbuff.h:2029 > __skb_try_recv_datagram+0xbe/0x220 > unix_dgram_recvmsg+0xee/0x850 > ____sys_recvmsg+0x1fb/0x210 > ___sys_recvmsg+0xa2/0xf0 > __sys_recvmsg+0x66/0xf0 > __x64_sys_recvmsg+0x51/0x70 > do_syscall_64+0x91/0xb47 > entry_SYSCALL_64_after_hwframe+0x49/0xbe > > Since only the read is operating as lockless, it could introduce a logic > bug in unix_recvq_full() due to the load tearing. Fix it by adding > a lockless variant of skb_queue_len() and unix_recvq_full() where > READ_ONCE() is on the read while WRITE_ONCE() is on the write similar to > the commit d7d16a89350a ("net: add skb_queue_empty_lockless()"). > > Signed-off-by: Qian Cai <cai@lca.pw> Applied, thank you.
Hi Eric, On Tue, Feb 04, 2020 at 01:40:29PM -0500, Qian Cai wrote: > - list->qlen--; > + WRITE_ONCE(list->qlen, list->qlen - 1); Sorry I'm a bit late to the party here, but this immediately jumped out. This generates worse code with a bigger race in some sense: list->qlen-- is: 0: 83 6f 10 01 subl $0x1,0x10(%rdi) whereas WRITE_ONCE(list->qlen, list->qlen - 1) is: 0: 8b 47 10 mov 0x10(%rdi),%eax 3: 83 e8 01 sub $0x1,%eax 6: 89 47 10 mov %eax,0x10(%rdi) Are you sure that's what we want? Jason
On 2/6/20 8:38 AM, Jason A. Donenfeld wrote: > Hi Eric, > > On Tue, Feb 04, 2020 at 01:40:29PM -0500, Qian Cai wrote: >> - list->qlen--; >> + WRITE_ONCE(list->qlen, list->qlen - 1); > > Sorry I'm a bit late to the party here, but this immediately jumped out. > This generates worse code with a bigger race in some sense: > > list->qlen-- is: > > 0: 83 6f 10 01 subl $0x1,0x10(%rdi) > > whereas WRITE_ONCE(list->qlen, list->qlen - 1) is: > > 0: 8b 47 10 mov 0x10(%rdi),%eax > 3: 83 e8 01 sub $0x1,%eax > 6: 89 47 10 mov %eax,0x10(%rdi) > > Are you sure that's what we want? > > Jason > Unfortunately we do not have ADD_ONCE() or something like that. Sure, on x86 we could get much better code generation. If we agree a READ_ONCE() was needed at the read side, then a WRITE_ONCE() is needed as well on write sides. If we believe load-tearing and/or write-tearing must not ever happen, then we must document this.
On Thu, Feb 6, 2020 at 6:10 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Unfortunately we do not have ADD_ONCE() or something like that.
I guess normally this is called "atomic_add", unless you're thinking
instead about something like this, which generates the same
inefficient code as WRITE_ONCE:
#define ADD_ONCE(d, s) *(volatile typeof(d) *)&(d) += (s)
On 2/6/20 10:12 AM, Jason A. Donenfeld wrote: > On Thu, Feb 6, 2020 at 6:10 PM Eric Dumazet <eric.dumazet@gmail.com> wrote: >> Unfortunately we do not have ADD_ONCE() or something like that. > > I guess normally this is called "atomic_add", unless you're thinking > instead about something like this, which generates the same > inefficient code as WRITE_ONCE: > > #define ADD_ONCE(d, s) *(volatile typeof(d) *)&(d) += (s) > Dmitry Vyukov had a nice suggestion few months back how to implement this. https://lkml.org/lkml/2019/10/5/6
On Thu, Feb 06, 2020 at 10:22:02AM -0800, Eric Dumazet wrote: > On 2/6/20 10:12 AM, Jason A. Donenfeld wrote: > > On Thu, Feb 6, 2020 at 6:10 PM Eric Dumazet <eric.dumazet@gmail.com> wrote: > >> Unfortunately we do not have ADD_ONCE() or something like that. > > > > I guess normally this is called "atomic_add", unless you're thinking > > instead about something like this, which generates the same > > inefficient code as WRITE_ONCE: > > > > #define ADD_ONCE(d, s) *(volatile typeof(d) *)&(d) += (s) > > > > Dmitry Vyukov had a nice suggestion few months back how to implement this. > > https://lkml.org/lkml/2019/10/5/6 That trick appears to work well in clang but not gcc: #define ADD_ONCE(d, i) ({ \ typeof(d) *__p = &(d); \ __atomic_store_n(__p, (i) + __atomic_load_n(__p, __ATOMIC_RELAXED), __ATOMIC_RELAXED); \ }) gcc 9.2 gives: 0: 8b 47 10 mov 0x10(%rdi),%eax 3: 83 e8 01 sub $0x1,%eax 6: 89 47 10 mov %eax,0x10(%rdi) clang 9.0.1 gives: 0: 81 47 10 ff ff ff ff addl $0xffffffff,0x10(%rdi) But actually, clang does equally as well with: #define ADD_ONCE(d, i) *(volatile typeof(d) *)&(d) += (i) And testing further back, it generates the same code with your original WRITE_ONCE. If clang's optimization here is technically correct, maybe we should go talk to the gcc people about catching this case?
On Thu, 6 Feb 2020 at 19:43, Jason A. Donenfeld <Jason@zx2c4.com> wrote: > > On Thu, Feb 06, 2020 at 10:22:02AM -0800, Eric Dumazet wrote: > > On 2/6/20 10:12 AM, Jason A. Donenfeld wrote: > > > On Thu, Feb 6, 2020 at 6:10 PM Eric Dumazet <eric.dumazet@gmail.com> wrote: > > >> Unfortunately we do not have ADD_ONCE() or something like that. > > > > > > I guess normally this is called "atomic_add", unless you're thinking > > > instead about something like this, which generates the same > > > inefficient code as WRITE_ONCE: > > > > > > #define ADD_ONCE(d, s) *(volatile typeof(d) *)&(d) += (s) > > > > > > > Dmitry Vyukov had a nice suggestion few months back how to implement this. > > > > https://lkml.org/lkml/2019/10/5/6 > > That trick appears to work well in clang but not gcc: > > #define ADD_ONCE(d, i) ({ \ > typeof(d) *__p = &(d); \ > __atomic_store_n(__p, (i) + __atomic_load_n(__p, __ATOMIC_RELAXED), __ATOMIC_RELAXED); \ > }) > > gcc 9.2 gives: > > 0: 8b 47 10 mov 0x10(%rdi),%eax > 3: 83 e8 01 sub $0x1,%eax > 6: 89 47 10 mov %eax,0x10(%rdi) > > clang 9.0.1 gives: > > 0: 81 47 10 ff ff ff ff addl $0xffffffff,0x10(%rdi) > > But actually, clang does equally as well with: > > #define ADD_ONCE(d, i) *(volatile typeof(d) *)&(d) += (i) I feel that ADD_ONCE conveys that it adds actually once (atomically), that is, if there are concurrent ADD_ONCE, all of them will succeed. This is not the case with the above variants and the 'ONCE' can turn into a 'MAYBE', and since we probably want to avoid making this more expensive on e.g. x86 that would need a LOCK-prefix. In the case here, what we actually want is something that safely increments/decrements if there are only concurrent readers (concurrent writers disallowed). So 'add_exclusive(var, val)' (all-caps or not) might be more appropriate. [As an aside, recent changes to KCSAN would also allow us to assert for something like 'add_exclusive()' that there are in fact no other writers but only concurrent readers, even if all accesses are marked.] If the single-writer constraint isn't wanted, but should still not be atomic, maybe 'add_lossy()'? Thanks, -- Marco > And testing further back, it generates the same code with your original > WRITE_ONCE. > > If clang's optimization here is technically correct, maybe we should go > talk to the gcc people about catching this case?
Useful playground: https://godbolt.org/z/i7JFRW
On Thu, 6 Feb 2020 at 18:10, Eric Dumazet <eric.dumazet@gmail.com> wrote: > > > > On 2/6/20 8:38 AM, Jason A. Donenfeld wrote: > > Hi Eric, > > > > On Tue, Feb 04, 2020 at 01:40:29PM -0500, Qian Cai wrote: > >> - list->qlen--; > >> + WRITE_ONCE(list->qlen, list->qlen - 1); > > > > Sorry I'm a bit late to the party here, but this immediately jumped out. > > This generates worse code with a bigger race in some sense: > > > > list->qlen-- is: > > > > 0: 83 6f 10 01 subl $0x1,0x10(%rdi) > > > > whereas WRITE_ONCE(list->qlen, list->qlen - 1) is: > > > > 0: 8b 47 10 mov 0x10(%rdi),%eax > > 3: 83 e8 01 sub $0x1,%eax > > 6: 89 47 10 mov %eax,0x10(%rdi) > > > > Are you sure that's what we want? > > > > Jason > > > > > Unfortunately we do not have ADD_ONCE() or something like that. > > Sure, on x86 we could get much better code generation. > > If we agree a READ_ONCE() was needed at the read side, > then a WRITE_ONCE() is needed as well on write sides. > > If we believe load-tearing and/or write-tearing must not ever happen, > then we must document this. Just FYI, forgot to mention: Recent KCSAN by default will forgive unannotated aligned writes up to word-size, making the assumptions these are safe. This would include things like 'var++' if there is only a single writer. This was added because of kernel-wide preferences we were told about. Since I cannot verify if this assumption is always correct, I would still prefer to mark writes if at all possible. In the end it's up to maintainers. Thanks, -- Marco
Jason A. Donenfeld <Jason@zx2c4.com> wrote: > Hi Eric, > > On Tue, Feb 04, 2020 at 01:40:29PM -0500, Qian Cai wrote: >> - list->qlen--; >> + WRITE_ONCE(list->qlen, list->qlen - 1); > > Sorry I'm a bit late to the party here, but this immediately jumped out. > This generates worse code with a bigger race in some sense: > > list->qlen-- is: > > 0: 83 6f 10 01 subl $0x1,0x10(%rdi) > > whereas WRITE_ONCE(list->qlen, list->qlen - 1) is: > > 0: 8b 47 10 mov 0x10(%rdi),%eax > 3: 83 e8 01 sub $0x1,%eax > 6: 89 47 10 mov %eax,0x10(%rdi) > > Are you sure that's what we want? Fixing these KCSAN warnings is actively making the kernel worse. Why are we still doing this? Thanks,
On 2/17/20, Herbert Xu <herbert@gondor.apana.org.au> wrote: > Jason A. Donenfeld <Jason@zx2c4.com> wrote: >> Hi Eric, >> >> On Tue, Feb 04, 2020 at 01:40:29PM -0500, Qian Cai wrote: >>> - list->qlen--; >>> + WRITE_ONCE(list->qlen, list->qlen - 1); >> >> Sorry I'm a bit late to the party here, but this immediately jumped out. >> This generates worse code with a bigger race in some sense: >> >> list->qlen-- is: >> >> 0: 83 6f 10 01 subl $0x1,0x10(%rdi) >> >> whereas WRITE_ONCE(list->qlen, list->qlen - 1) is: >> >> 0: 8b 47 10 mov 0x10(%rdi),%eax >> 3: 83 e8 01 sub $0x1,%eax >> 6: 89 47 10 mov %eax,0x10(%rdi) >> >> Are you sure that's what we want? > > Fixing these KCSAN warnings is actively making the kernel worse. > > Why are we still doing this? > Not necessarily a big fan of this either, but just for the record here in case it helps, while you might complain about instruction size blowing up a bit, cycle-wise these wind up being about the same anyway. On x86, one instruction != one cycle.
On Mon, Feb 17, 2020 at 08:39:45AM +0100, Jason A. Donenfeld wrote: > > Not necessarily a big fan of this either, but just for the record here > in case it helps, while you might complain about instruction size > blowing up a bit, cycle-wise these wind up being about the same > anyway. On x86, one instruction != one cycle. I don't care about that. My problem is with the mindless patches that started this thread. Look at the patch: commit 86b18aaa2b5b5bb48e609cd591b3d2d0fdbe0442 Author: Qian Cai <cai@lca.pw> Date: Tue Feb 4 13:40:29 2020 -0500 skbuff: fix a data race in skb_queue_len() It's utter garbage. Why on earth did it change that one instance of unix_recvq_full? In fact you can see how stupid it is because right after the call that got changed we again call into unix_recvq_full which surely would trigger the same warning. So far the vast majority of the KCSAN patches that have caught my attention have been of this mindless kind that does not add any value to the kernel. If anything they could be hiding real bugs that would now be harder to find. Cheers,
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 3d13a4b717e9..ca8806b69388 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -1822,6 +1822,18 @@ static inline __u32 skb_queue_len(const struct sk_buff_head *list_) } /** + * skb_queue_len_lockless - get queue length + * @list_: list to measure + * + * Return the length of an &sk_buff queue. + * This variant can be used in lockless contexts. + */ +static inline __u32 skb_queue_len_lockless(const struct sk_buff_head *list_) +{ + return READ_ONCE(list_->qlen); +} + +/** * __skb_queue_head_init - initialize non-spinlock portions of sk_buff_head * @list: queue to initialize * @@ -2026,7 +2038,7 @@ static inline void __skb_unlink(struct sk_buff *skb, struct sk_buff_head *list) { struct sk_buff *next, *prev; - list->qlen--; + WRITE_ONCE(list->qlen, list->qlen - 1); next = skb->next; prev = skb->prev; skb->next = skb->prev = NULL; diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index 321af97c7bbe..62c12cb5763e 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -189,11 +189,17 @@ static inline int unix_may_send(struct sock *sk, struct sock *osk) return unix_peer(osk) == NULL || unix_our_peer(sk, osk); } -static inline int unix_recvq_full(struct sock const *sk) +static inline int unix_recvq_full(const struct sock *sk) { return skb_queue_len(&sk->sk_receive_queue) > sk->sk_max_ack_backlog; } +static inline int unix_recvq_full_lockless(const struct sock *sk) +{ + return skb_queue_len_lockless(&sk->sk_receive_queue) > + READ_ONCE(sk->sk_max_ack_backlog); +} + struct sock *unix_peer_get(struct sock *s) { struct sock *peer; @@ -1758,7 +1764,8 @@ static int unix_dgram_sendmsg(struct socket *sock, struct msghdr *msg, * - unix_peer(sk) == sk by time of get but disconnected before lock */ if (other != sk && - unlikely(unix_peer(other) != sk && unix_recvq_full(other))) { + unlikely(unix_peer(other) != sk && + unix_recvq_full_lockless(other))) { if (timeo) { timeo = unix_wait_for_peer(other, timeo);
sk_buff.qlen can be accessed concurrently as noticed by KCSAN, BUG: KCSAN: data-race in __skb_try_recv_from_queue / unix_dgram_sendmsg read to 0xffff8a1b1d8a81c0 of 4 bytes by task 5371 on cpu 96: unix_dgram_sendmsg+0x9a9/0xb70 include/linux/skbuff.h:1821 net/unix/af_unix.c:1761 ____sys_sendmsg+0x33e/0x370 ___sys_sendmsg+0xa6/0xf0 __sys_sendmsg+0x69/0xf0 __x64_sys_sendmsg+0x51/0x70 do_syscall_64+0x91/0xb47 entry_SYSCALL_64_after_hwframe+0x49/0xbe write to 0xffff8a1b1d8a81c0 of 4 bytes by task 1 on cpu 99: __skb_try_recv_from_queue+0x327/0x410 include/linux/skbuff.h:2029 __skb_try_recv_datagram+0xbe/0x220 unix_dgram_recvmsg+0xee/0x850 ____sys_recvmsg+0x1fb/0x210 ___sys_recvmsg+0xa2/0xf0 __sys_recvmsg+0x66/0xf0 __x64_sys_recvmsg+0x51/0x70 do_syscall_64+0x91/0xb47 entry_SYSCALL_64_after_hwframe+0x49/0xbe Since only the read is operating as lockless, it could introduce a logic bug in unix_recvq_full() due to the load tearing. Fix it by adding a lockless variant of skb_queue_len() and unix_recvq_full() where READ_ONCE() is on the read while WRITE_ONCE() is on the write similar to the commit d7d16a89350a ("net: add skb_queue_empty_lockless()"). Signed-off-by: Qian Cai <cai@lca.pw> --- v3: fix minor issues thanks to Eric. v2: add lockless variant helpers and WRITE_ONCE(). include/linux/skbuff.h | 14 +++++++++++++- net/unix/af_unix.c | 11 +++++++++-- 2 files changed, 22 insertions(+), 3 deletions(-)