diff mbox series

[v3] skbuff: fix a data race in skb_queue_len()

Message ID 1580841629-7102-1-git-send-email-cai@lca.pw
State Accepted
Delegated to: David Miller
Headers show
Series [v3] skbuff: fix a data race in skb_queue_len() | expand

Commit Message

Qian Cai Feb. 4, 2020, 6:40 p.m. UTC
sk_buff.qlen can be accessed concurrently as noticed by KCSAN,

 BUG: KCSAN: data-race in __skb_try_recv_from_queue / unix_dgram_sendmsg

 read to 0xffff8a1b1d8a81c0 of 4 bytes by task 5371 on cpu 96:
  unix_dgram_sendmsg+0x9a9/0xb70 include/linux/skbuff.h:1821
				 net/unix/af_unix.c:1761
  ____sys_sendmsg+0x33e/0x370
  ___sys_sendmsg+0xa6/0xf0
  __sys_sendmsg+0x69/0xf0
  __x64_sys_sendmsg+0x51/0x70
  do_syscall_64+0x91/0xb47
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

 write to 0xffff8a1b1d8a81c0 of 4 bytes by task 1 on cpu 99:
  __skb_try_recv_from_queue+0x327/0x410 include/linux/skbuff.h:2029
  __skb_try_recv_datagram+0xbe/0x220
  unix_dgram_recvmsg+0xee/0x850
  ____sys_recvmsg+0x1fb/0x210
  ___sys_recvmsg+0xa2/0xf0
  __sys_recvmsg+0x66/0xf0
  __x64_sys_recvmsg+0x51/0x70
  do_syscall_64+0x91/0xb47
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

Since only the read is operating as lockless, it could introduce a logic
bug in unix_recvq_full() due to the load tearing. Fix it by adding
a lockless variant of skb_queue_len() and unix_recvq_full() where
READ_ONCE() is on the read while WRITE_ONCE() is on the write similar to
the commit d7d16a89350a ("net: add skb_queue_empty_lockless()").

Signed-off-by: Qian Cai <cai@lca.pw>
---

v3: fix minor issues thanks to Eric.
v2: add lockless variant helpers and WRITE_ONCE().

 include/linux/skbuff.h | 14 +++++++++++++-
 net/unix/af_unix.c     | 11 +++++++++--
 2 files changed, 22 insertions(+), 3 deletions(-)

Comments

David Miller Feb. 6, 2020, 12:59 p.m. UTC | #1
From: Qian Cai <cai@lca.pw>
Date: Tue,  4 Feb 2020 13:40:29 -0500

> sk_buff.qlen can be accessed concurrently as noticed by KCSAN,
> 
>  BUG: KCSAN: data-race in __skb_try_recv_from_queue / unix_dgram_sendmsg
> 
>  read to 0xffff8a1b1d8a81c0 of 4 bytes by task 5371 on cpu 96:
>   unix_dgram_sendmsg+0x9a9/0xb70 include/linux/skbuff.h:1821
> 				 net/unix/af_unix.c:1761
>   ____sys_sendmsg+0x33e/0x370
>   ___sys_sendmsg+0xa6/0xf0
>   __sys_sendmsg+0x69/0xf0
>   __x64_sys_sendmsg+0x51/0x70
>   do_syscall_64+0x91/0xb47
>   entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
>  write to 0xffff8a1b1d8a81c0 of 4 bytes by task 1 on cpu 99:
>   __skb_try_recv_from_queue+0x327/0x410 include/linux/skbuff.h:2029
>   __skb_try_recv_datagram+0xbe/0x220
>   unix_dgram_recvmsg+0xee/0x850
>   ____sys_recvmsg+0x1fb/0x210
>   ___sys_recvmsg+0xa2/0xf0
>   __sys_recvmsg+0x66/0xf0
>   __x64_sys_recvmsg+0x51/0x70
>   do_syscall_64+0x91/0xb47
>   entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
> Since only the read is operating as lockless, it could introduce a logic
> bug in unix_recvq_full() due to the load tearing. Fix it by adding
> a lockless variant of skb_queue_len() and unix_recvq_full() where
> READ_ONCE() is on the read while WRITE_ONCE() is on the write similar to
> the commit d7d16a89350a ("net: add skb_queue_empty_lockless()").
> 
> Signed-off-by: Qian Cai <cai@lca.pw>

Applied, thank you.
Jason A. Donenfeld Feb. 6, 2020, 4:38 p.m. UTC | #2
Hi Eric,

On Tue, Feb 04, 2020 at 01:40:29PM -0500, Qian Cai wrote:
> -	list->qlen--;
> +	WRITE_ONCE(list->qlen, list->qlen - 1);

Sorry I'm a bit late to the party here, but this immediately jumped out.
This generates worse code with a bigger race in some sense:

list->qlen-- is:

   0:   83 6f 10 01             subl   $0x1,0x10(%rdi)

whereas WRITE_ONCE(list->qlen, list->qlen - 1) is:

   0:   8b 47 10                mov    0x10(%rdi),%eax
   3:   83 e8 01                sub    $0x1,%eax
   6:   89 47 10                mov    %eax,0x10(%rdi)

Are you sure that's what we want?

Jason
Eric Dumazet Feb. 6, 2020, 5:10 p.m. UTC | #3
On 2/6/20 8:38 AM, Jason A. Donenfeld wrote:
> Hi Eric,
> 
> On Tue, Feb 04, 2020 at 01:40:29PM -0500, Qian Cai wrote:
>> -	list->qlen--;
>> +	WRITE_ONCE(list->qlen, list->qlen - 1);
> 
> Sorry I'm a bit late to the party here, but this immediately jumped out.
> This generates worse code with a bigger race in some sense:
> 
> list->qlen-- is:
> 
>    0:   83 6f 10 01             subl   $0x1,0x10(%rdi)
> 
> whereas WRITE_ONCE(list->qlen, list->qlen - 1) is:
> 
>    0:   8b 47 10                mov    0x10(%rdi),%eax
>    3:   83 e8 01                sub    $0x1,%eax
>    6:   89 47 10                mov    %eax,0x10(%rdi)
> 
> Are you sure that's what we want?
> 
> Jason
> 


Unfortunately we do not have ADD_ONCE() or something like that.

Sure, on x86 we could get much better code generation.

If we agree a READ_ONCE() was needed at the read side,
then a WRITE_ONCE() is needed as well on write sides.

If we believe load-tearing and/or write-tearing must not ever happen,
then we must document this.
Jason A. Donenfeld Feb. 6, 2020, 6:12 p.m. UTC | #4
On Thu, Feb 6, 2020 at 6:10 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Unfortunately we do not have ADD_ONCE() or something like that.

I guess normally this is called "atomic_add", unless you're thinking
instead about something like this, which generates the same
inefficient code as WRITE_ONCE:

#define ADD_ONCE(d, s) *(volatile typeof(d) *)&(d) += (s)
Eric Dumazet Feb. 6, 2020, 6:22 p.m. UTC | #5
On 2/6/20 10:12 AM, Jason A. Donenfeld wrote:
> On Thu, Feb 6, 2020 at 6:10 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> Unfortunately we do not have ADD_ONCE() or something like that.
> 
> I guess normally this is called "atomic_add", unless you're thinking
> instead about something like this, which generates the same
> inefficient code as WRITE_ONCE:
> 
> #define ADD_ONCE(d, s) *(volatile typeof(d) *)&(d) += (s)
> 

Dmitry Vyukov had a nice suggestion few months back how to implement this.

https://lkml.org/lkml/2019/10/5/6
Jason A. Donenfeld Feb. 6, 2020, 6:43 p.m. UTC | #6
On Thu, Feb 06, 2020 at 10:22:02AM -0800, Eric Dumazet wrote:
> On 2/6/20 10:12 AM, Jason A. Donenfeld wrote:
> > On Thu, Feb 6, 2020 at 6:10 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >> Unfortunately we do not have ADD_ONCE() or something like that.
> > 
> > I guess normally this is called "atomic_add", unless you're thinking
> > instead about something like this, which generates the same
> > inefficient code as WRITE_ONCE:
> > 
> > #define ADD_ONCE(d, s) *(volatile typeof(d) *)&(d) += (s)
> > 
> 
> Dmitry Vyukov had a nice suggestion few months back how to implement this.
> 
> https://lkml.org/lkml/2019/10/5/6

That trick appears to work well in clang but not gcc:

#define ADD_ONCE(d, i) ({ \
       typeof(d) *__p = &(d); \
       __atomic_store_n(__p, (i) + __atomic_load_n(__p, __ATOMIC_RELAXED), __ATOMIC_RELAXED); \
})

gcc 9.2 gives:

  0:   8b 47 10                mov    0x10(%rdi),%eax
  3:   83 e8 01                sub    $0x1,%eax
  6:   89 47 10                mov    %eax,0x10(%rdi)

clang 9.0.1 gives:

   0:   81 47 10 ff ff ff ff    addl   $0xffffffff,0x10(%rdi)

But actually, clang does equally as well with:

#define ADD_ONCE(d, i) *(volatile typeof(d) *)&(d) += (i)

And testing further back, it generates the same code with your original
WRITE_ONCE.

If clang's optimization here is technically correct, maybe we should go
talk to the gcc people about catching this case?
Marco Elver Feb. 6, 2020, 7:29 p.m. UTC | #7
On Thu, 6 Feb 2020 at 19:43, Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>
> On Thu, Feb 06, 2020 at 10:22:02AM -0800, Eric Dumazet wrote:
> > On 2/6/20 10:12 AM, Jason A. Donenfeld wrote:
> > > On Thu, Feb 6, 2020 at 6:10 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > >> Unfortunately we do not have ADD_ONCE() or something like that.
> > >
> > > I guess normally this is called "atomic_add", unless you're thinking
> > > instead about something like this, which generates the same
> > > inefficient code as WRITE_ONCE:
> > >
> > > #define ADD_ONCE(d, s) *(volatile typeof(d) *)&(d) += (s)
> > >
> >
> > Dmitry Vyukov had a nice suggestion few months back how to implement this.
> >
> > https://lkml.org/lkml/2019/10/5/6
>
> That trick appears to work well in clang but not gcc:
>
> #define ADD_ONCE(d, i) ({ \
>        typeof(d) *__p = &(d); \
>        __atomic_store_n(__p, (i) + __atomic_load_n(__p, __ATOMIC_RELAXED), __ATOMIC_RELAXED); \
> })
>
> gcc 9.2 gives:
>
>   0:   8b 47 10                mov    0x10(%rdi),%eax
>   3:   83 e8 01                sub    $0x1,%eax
>   6:   89 47 10                mov    %eax,0x10(%rdi)
>
> clang 9.0.1 gives:
>
>    0:   81 47 10 ff ff ff ff    addl   $0xffffffff,0x10(%rdi)
>
> But actually, clang does equally as well with:
>
> #define ADD_ONCE(d, i) *(volatile typeof(d) *)&(d) += (i)

I feel that ADD_ONCE conveys that it adds actually once (atomically),
that is, if there are concurrent ADD_ONCE, all of them will succeed.
This is not the case with the above variants and the 'ONCE' can turn
into a 'MAYBE', and since we probably want to avoid making this more
expensive on e.g. x86 that would need a LOCK-prefix.

In the case here, what we actually want is something that safely
increments/decrements if there are only concurrent readers (concurrent
writers disallowed). So 'add_exclusive(var, val)' (all-caps or not)
might be more appropriate. [As an aside, recent changes to KCSAN would
also allow us to assert for something like 'add_exclusive()' that
there are in fact no other writers but only concurrent readers, even
if all accesses are marked.]

If the single-writer constraint isn't wanted, but should still not be
atomic, maybe 'add_lossy()'?

Thanks,
-- Marco


> And testing further back, it generates the same code with your original
> WRITE_ONCE.
>
> If clang's optimization here is technically correct, maybe we should go
> talk to the gcc people about catching this case?
Jason A. Donenfeld Feb. 6, 2020, 9:55 p.m. UTC | #8
Useful playground: https://godbolt.org/z/i7JFRW
Marco Elver Feb. 7, 2020, 10:35 a.m. UTC | #9
On Thu, 6 Feb 2020 at 18:10, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>
>
> On 2/6/20 8:38 AM, Jason A. Donenfeld wrote:
> > Hi Eric,
> >
> > On Tue, Feb 04, 2020 at 01:40:29PM -0500, Qian Cai wrote:
> >> -    list->qlen--;
> >> +    WRITE_ONCE(list->qlen, list->qlen - 1);
> >
> > Sorry I'm a bit late to the party here, but this immediately jumped out.
> > This generates worse code with a bigger race in some sense:
> >
> > list->qlen-- is:
> >
> >    0:   83 6f 10 01             subl   $0x1,0x10(%rdi)
> >
> > whereas WRITE_ONCE(list->qlen, list->qlen - 1) is:
> >
> >    0:   8b 47 10                mov    0x10(%rdi),%eax
> >    3:   83 e8 01                sub    $0x1,%eax
> >    6:   89 47 10                mov    %eax,0x10(%rdi)
> >
> > Are you sure that's what we want?
> >
> > Jason
> >
>
>
> Unfortunately we do not have ADD_ONCE() or something like that.
>
> Sure, on x86 we could get much better code generation.
>
> If we agree a READ_ONCE() was needed at the read side,
> then a WRITE_ONCE() is needed as well on write sides.
>
> If we believe load-tearing and/or write-tearing must not ever happen,
> then we must document this.

Just FYI, forgot to mention: Recent KCSAN by default will forgive
unannotated aligned writes up to word-size, making the assumptions
these are safe. This would include things like 'var++' if there is
only a single writer. This was added because of kernel-wide
preferences we were told about.

Since I cannot verify if this assumption is always correct, I would
still prefer to mark writes if at all possible.  In the end it's up to
maintainers.

Thanks,
-- Marco
Herbert Xu Feb. 17, 2020, 3:24 a.m. UTC | #10
Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> Hi Eric,
> 
> On Tue, Feb 04, 2020 at 01:40:29PM -0500, Qian Cai wrote:
>> -     list->qlen--;
>> +     WRITE_ONCE(list->qlen, list->qlen - 1);
> 
> Sorry I'm a bit late to the party here, but this immediately jumped out.
> This generates worse code with a bigger race in some sense:
> 
> list->qlen-- is:
> 
>   0:   83 6f 10 01             subl   $0x1,0x10(%rdi)
> 
> whereas WRITE_ONCE(list->qlen, list->qlen - 1) is:
> 
>   0:   8b 47 10                mov    0x10(%rdi),%eax
>   3:   83 e8 01                sub    $0x1,%eax
>   6:   89 47 10                mov    %eax,0x10(%rdi)
> 
> Are you sure that's what we want?

Fixing these KCSAN warnings is actively making the kernel worse.

Why are we still doing this?

Thanks,
Jason A. Donenfeld Feb. 17, 2020, 7:39 a.m. UTC | #11
On 2/17/20, Herbert Xu <herbert@gondor.apana.org.au> wrote:
> Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>> Hi Eric,
>>
>> On Tue, Feb 04, 2020 at 01:40:29PM -0500, Qian Cai wrote:
>>> -     list->qlen--;
>>> +     WRITE_ONCE(list->qlen, list->qlen - 1);
>>
>> Sorry I'm a bit late to the party here, but this immediately jumped out.
>> This generates worse code with a bigger race in some sense:
>>
>> list->qlen-- is:
>>
>>   0:   83 6f 10 01             subl   $0x1,0x10(%rdi)
>>
>> whereas WRITE_ONCE(list->qlen, list->qlen - 1) is:
>>
>>   0:   8b 47 10                mov    0x10(%rdi),%eax
>>   3:   83 e8 01                sub    $0x1,%eax
>>   6:   89 47 10                mov    %eax,0x10(%rdi)
>>
>> Are you sure that's what we want?
>
> Fixing these KCSAN warnings is actively making the kernel worse.
>
> Why are we still doing this?
>
Not necessarily a big fan of this either, but just for the record here
in case it helps, while you might complain about instruction size
blowing up a bit, cycle-wise these wind up being about the same
anyway. On x86, one instruction != one cycle.
Herbert Xu Feb. 17, 2020, 10:20 a.m. UTC | #12
On Mon, Feb 17, 2020 at 08:39:45AM +0100, Jason A. Donenfeld wrote:
>
> Not necessarily a big fan of this either, but just for the record here
> in case it helps, while you might complain about instruction size
> blowing up a bit, cycle-wise these wind up being about the same
> anyway. On x86, one instruction != one cycle.

I don't care about that.  My problem is with the mindless patches
that started this thread.  Look at the patch:

commit 86b18aaa2b5b5bb48e609cd591b3d2d0fdbe0442
Author: Qian Cai <cai@lca.pw>
Date:   Tue Feb 4 13:40:29 2020 -0500

    skbuff: fix a data race in skb_queue_len()

It's utter garbage.  Why on earth did it change that one instance
of unix_recvq_full? In fact you can see how stupid it is because
right after the call that got changed we again call into
unix_recvq_full which surely would trigger the same warning.

So far the vast majority of the KCSAN patches that have caught
my attention have been of this mindless kind that does not add
any value to the kernel.  If anything they could be hiding real
bugs that would now be harder to find.

Cheers,
diff mbox series

Patch

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 3d13a4b717e9..ca8806b69388 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1822,6 +1822,18 @@  static inline __u32 skb_queue_len(const struct sk_buff_head *list_)
 }
 
 /**
+ *	skb_queue_len_lockless	- get queue length
+ *	@list_: list to measure
+ *
+ *	Return the length of an &sk_buff queue.
+ *	This variant can be used in lockless contexts.
+ */
+static inline __u32 skb_queue_len_lockless(const struct sk_buff_head *list_)
+{
+	return READ_ONCE(list_->qlen);
+}
+
+/**
  *	__skb_queue_head_init - initialize non-spinlock portions of sk_buff_head
  *	@list: queue to initialize
  *
@@ -2026,7 +2038,7 @@  static inline void __skb_unlink(struct sk_buff *skb, struct sk_buff_head *list)
 {
 	struct sk_buff *next, *prev;
 
-	list->qlen--;
+	WRITE_ONCE(list->qlen, list->qlen - 1);
 	next	   = skb->next;
 	prev	   = skb->prev;
 	skb->next  = skb->prev = NULL;
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 321af97c7bbe..62c12cb5763e 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -189,11 +189,17 @@  static inline int unix_may_send(struct sock *sk, struct sock *osk)
 	return unix_peer(osk) == NULL || unix_our_peer(sk, osk);
 }
 
-static inline int unix_recvq_full(struct sock const *sk)
+static inline int unix_recvq_full(const struct sock *sk)
 {
 	return skb_queue_len(&sk->sk_receive_queue) > sk->sk_max_ack_backlog;
 }
 
+static inline int unix_recvq_full_lockless(const struct sock *sk)
+{
+	return skb_queue_len_lockless(&sk->sk_receive_queue) >
+		READ_ONCE(sk->sk_max_ack_backlog);
+}
+
 struct sock *unix_peer_get(struct sock *s)
 {
 	struct sock *peer;
@@ -1758,7 +1764,8 @@  static int unix_dgram_sendmsg(struct socket *sock, struct msghdr *msg,
 	 * - unix_peer(sk) == sk by time of get but disconnected before lock
 	 */
 	if (other != sk &&
-	    unlikely(unix_peer(other) != sk && unix_recvq_full(other))) {
+	    unlikely(unix_peer(other) != sk &&
+	    unix_recvq_full_lockless(other))) {
 		if (timeo) {
 			timeo = unix_wait_for_peer(other, timeo);