[bpf] bpf: Fix a race in reuseport_array_free()

Message ID	20190927165221.2391541-1-kafai@fb.com
State	Changes Requested
Delegated to:	BPF Maintainers
Headers	show Return-Path: <bpf-owner@vger.kernel.org> Smtp-Origin-Hostprefix: devbig From: Martin KaFai Lau <kafai@fb.com> Smtp-Origin-Hostname: devbig005.ftw2.facebook.com To: <bpf@vger.kernel.org>, <netdev@vger.kernel.org> CC: Alexei Starovoitov <ast@fb.com>, Daniel Borkmann <daniel@iogearbox.net>, David Miller <davem@davemloft.net>, <kernel-team@fb.com> Smtp-Origin-Cluster: ftw2c04 Subject: [PATCH bpf] bpf: Fix a race in reuseport_array_free() Date: Fri, 27 Sep 2019 09:52:21 -0700 Message-ID: <20190927165221.2391541-1-kafai@fb.com> MIME-Version: 1.0 Content-Type: text/plain Sender: bpf-owner@vger.kernel.org Precedence: bulk
Series	[bpf] bpf: Fix a race in reuseport_array_free() \| expand [bpf] bpf: Fix a race in reuseport_array_free()

Message ID

20190927165221.2391541-1-kafai@fb.com

State

Changes Requested

Delegated to:

BPF Maintainers

Headers

Smtp-Origin-Hostprefix: devbig
From: Martin KaFai Lau <kafai@fb.com>
Smtp-Origin-Hostname: devbig005.ftw2.facebook.com
To: <bpf@vger.kernel.org>, <netdev@vger.kernel.org>
CC: Alexei Starovoitov <ast@fb.com>, Daniel Borkmann <daniel@iogearbox.net>,
	David Miller <davem@davemloft.net>, <kernel-team@fb.com>
Smtp-Origin-Cluster: ftw2c04
Subject: [PATCH bpf] bpf: Fix a race in reuseport_array_free()
Date: Fri, 27 Sep 2019 09:52:21 -0700
Message-ID: <20190927165221.2391541-1-kafai@fb.com>
MIME-Version: 1.0
Content-Type: text/plain
Sender: bpf-owner@vger.kernel.org
Precedence: bulk

Series

[bpf] bpf: Fix a race in reuseport_array_free() | expand

Commit Message

Martin KaFai Lau Sept. 27, 2019, 4:52 p.m. UTC

In reuseport_array_free(), the rcu_read_lock() cannot ensure sk is still
valid.  It is because bpf_sk_reuseport_detach() can be called from
__sk_destruct() which is invoked through call_rcu(..., __sk_destruct).

This patch takes the reuseport_lock in reuseport_array_free() which
is not the fast path.  The lock is taken inside the loop in case
that the bpf map is big.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 kernel/bpf/reuseport_array.c | 27 +++++----------------------
 1 file changed, 5 insertions(+), 22 deletions(-)

Comments

Eric Dumazet Sept. 27, 2019, 5:24 p.m. UTC | #1

On 9/27/19 9:52 AM, Martin KaFai Lau wrote:
> In reuseport_array_free(), the rcu_read_lock() cannot ensure sk is still
> valid.  It is because bpf_sk_reuseport_detach() can be called from
> __sk_destruct() which is invoked through call_rcu(..., __sk_destruct).

We could question why reuseport_detach_sock(sk) is called from __sk_destruct()
(after the rcu grace period) instead of sk_destruct() ?

> 
> This patch takes the reuseport_lock in reuseport_array_free() which
> is not the fast path.  The lock is taken inside the loop in case
> that the bpf map is big.
> 
> Signed-off-by: Martin KaFai Lau <kafai@fb.com>

Fixes: 5dc4c4b7d4e8 ("bpf: Introduce BPF_MAP_TYPE_REUSEPORT_SOCKARRAY")

Martin KaFai Lau Sept. 27, 2019, 6:17 p.m. UTC | #2

On Fri, Sep 27, 2019 at 10:24:49AM -0700, Eric Dumazet wrote:
> 
> 
> On 9/27/19 9:52 AM, Martin KaFai Lau wrote:
> > In reuseport_array_free(), the rcu_read_lock() cannot ensure sk is still
> > valid.  It is because bpf_sk_reuseport_detach() can be called from
> > __sk_destruct() which is invoked through call_rcu(..., __sk_destruct).
> 
> We could question why reuseport_detach_sock(sk) is called from __sk_destruct()
> (after the rcu grace period) instead of sk_destruct() ?
Agree.  It is another way to fix it.

In this patch, I chose to avoid the need to single out a special treatment for
reuseport_detach_sock() in sk_destruct().

I am happy either way.  What do you think?

> 
> > 
> > This patch takes the reuseport_lock in reuseport_array_free() which
> > is not the fast path.  The lock is taken inside the loop in case
> > that the bpf map is big.
> > 
> > Signed-off-by: Martin KaFai Lau <kafai@fb.com>
> 
> Fixes: 5dc4c4b7d4e8 ("bpf: Introduce BPF_MAP_TYPE_REUSEPORT_SOCKARRAY")
Ah...missed that.  Thanks!

Eric Dumazet Sept. 27, 2019, 8:47 p.m. UTC | #3

On 9/27/19 11:17 AM, Martin Lau wrote:
> On Fri, Sep 27, 2019 at 10:24:49AM -0700, Eric Dumazet wrote:
>>
>>
>> On 9/27/19 9:52 AM, Martin KaFai Lau wrote:
>>> In reuseport_array_free(), the rcu_read_lock() cannot ensure sk is still
>>> valid.  It is because bpf_sk_reuseport_detach() can be called from
>>> __sk_destruct() which is invoked through call_rcu(..., __sk_destruct).
>>
>> We could question why reuseport_detach_sock(sk) is called from __sk_destruct()
>> (after the rcu grace period) instead of sk_destruct() ?
> Agree.  It is another way to fix it.
> 
> In this patch, I chose to avoid the need to single out a special treatment for
> reuseport_detach_sock() in sk_destruct().
> 
> I am happy either way.  What do you think?

It seems that since we call reuseport_detach_sock() after the rcu grace period,
another cpu could catch the sk pointer in reuse->socks[] array and use
it right before our cpu frees the socket.

RCU rules are not properly applied here I think.

The rules for deletion are :

1) unpublish object from various lists/arrays/hashes.
2) rcu_grace_period
3) free the object.

If we fix the unpublish (we need to anyway to make the data path safe),
then your patch is not needed ?

What about (totally untested, might be horribly wrong)

diff --git a/net/core/sock.c b/net/core/sock.c
index 07863edbe6fc4842e47ebebf00bc21bc406d9264..d31a4b094797f73ef89110c954aa0a164879362d 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1700,8 +1700,6 @@ static void __sk_destruct(struct rcu_head *head)
                sk_filter_uncharge(sk, filter);
                RCU_INIT_POINTER(sk->sk_filter, NULL);
        }
-       if (rcu_access_pointer(sk->sk_reuseport_cb))
-               reuseport_detach_sock(sk);
 
        sock_disable_timestamp(sk, SK_FLAGS_TIMESTAMP);
 
@@ -1728,7 +1726,13 @@ static void __sk_destruct(struct rcu_head *head)
 
 void sk_destruct(struct sock *sk)
 {
-       if (sock_flag(sk, SOCK_RCU_FREE))
+       bool use_call_rcu = sock_flag(sk, SOCK_RCU_FREE);
+
+       if (rcu_access_pointer(sk->sk_reuseport_cb)) {
+               reuseport_detach_sock(sk);
+               use_call_rcu = true;
+       }
+       if (use_call_rcu)
                call_rcu(&sk->sk_rcu, __sk_destruct);
        else
                __sk_destruct(&sk->sk_rcu);

Martin KaFai Lau Sept. 27, 2019, 9:22 p.m. UTC | #4

On Fri, Sep 27, 2019 at 01:47:32PM -0700, Eric Dumazet wrote:
> 
> 
> On 9/27/19 11:17 AM, Martin Lau wrote:
> > On Fri, Sep 27, 2019 at 10:24:49AM -0700, Eric Dumazet wrote:
> >>
> >>
> >> On 9/27/19 9:52 AM, Martin KaFai Lau wrote:
> >>> In reuseport_array_free(), the rcu_read_lock() cannot ensure sk is still
> >>> valid.  It is because bpf_sk_reuseport_detach() can be called from
> >>> __sk_destruct() which is invoked through call_rcu(..., __sk_destruct).
> >>
> >> We could question why reuseport_detach_sock(sk) is called from __sk_destruct()
> >> (after the rcu grace period) instead of sk_destruct() ?
> > Agree.  It is another way to fix it.
> > 
> > In this patch, I chose to avoid the need to single out a special treatment for
> > reuseport_detach_sock() in sk_destruct().
> > 
> > I am happy either way.  What do you think?
> 
> It seems that since we call reuseport_detach_sock() after the rcu grace period,
> another cpu could catch the sk pointer in reuse->socks[] array and use
> it right before our cpu frees the socket.
> 
> RCU rules are not properly applied here I think.
> 
> The rules for deletion are :
> 
> 1) unpublish object from various lists/arrays/hashes.
Thanks for the analysis.  Agreed.  Indeed, there is an issue in reuse->socks[]
which is shared with other sockets and they may pick up the destructed
sk from reuse->socks[].

> 2) rcu_grace_period
> 3) free the object.
> 
> If we fix the unpublish (we need to anyway to make the data path safe),
> then your patch is not needed ?
Correct, not needed.

> 
> What about (totally untested, might be horribly wrong)
I had something similar in mind also.  I will take a closer look and
re-spin v2.

> 
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 07863edbe6fc4842e47ebebf00bc21bc406d9264..d31a4b094797f73ef89110c954aa0a164879362d 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -1700,8 +1700,6 @@ static void __sk_destruct(struct rcu_head *head)
>                 sk_filter_uncharge(sk, filter);
>                 RCU_INIT_POINTER(sk->sk_filter, NULL);
>         }
> -       if (rcu_access_pointer(sk->sk_reuseport_cb))
> -               reuseport_detach_sock(sk);
>  
>         sock_disable_timestamp(sk, SK_FLAGS_TIMESTAMP);
>  
> @@ -1728,7 +1726,13 @@ static void __sk_destruct(struct rcu_head *head)
>  
>  void sk_destruct(struct sock *sk)
>  {
> -       if (sock_flag(sk, SOCK_RCU_FREE))
> +       bool use_call_rcu = sock_flag(sk, SOCK_RCU_FREE);
> +
> +       if (rcu_access_pointer(sk->sk_reuseport_cb)) {
> +               reuseport_detach_sock(sk);
> +               use_call_rcu = true;
> +       }
> +       if (use_call_rcu)
>                 call_rcu(&sk->sk_rcu, __sk_destruct);
>         else
>                 __sk_destruct(&sk->sk_rcu);

diff --git a/kernel/bpf/reuseport_array.c b/kernel/bpf/reuseport_array.c
index 50c083ba978c..9e593ac31ad7 100644
--- a/kernel/bpf/reuseport_array.c
+++ b/kernel/bpf/reuseport_array.c
@@ -103,29 +103,11 @@  static void reuseport_array_free(struct bpf_map *map)
 	 * array now. Hence, this function only races with
 	 * bpf_sk_reuseport_detach() which was triggerred by
 	 * close() or disconnect().
-	 *
-	 * This function and bpf_sk_reuseport_detach() are
-	 * both removing sk from "array".  Who removes it
-	 * first does not matter.
-	 *
-	 * The only concern here is bpf_sk_reuseport_detach()
-	 * may access "array" which is being freed here.
-	 * bpf_sk_reuseport_detach() access this "array"
-	 * through sk->sk_user_data _and_ with sk->sk_callback_lock
-	 * held which is enough because this "array" is not freed
-	 * until all sk->sk_user_data has stopped referencing this "array".
-	 *
-	 * Hence, due to the above, taking "reuseport_lock" is not
-	 * needed here.
 	 */
-
-	/*
-	 * Since reuseport_lock is not taken, sk is accessed under
-	 * rcu_read_lock()
-	 */
-	rcu_read_lock();
 	for (i = 0; i < map->max_entries; i++) {
-		sk = rcu_dereference(array->ptrs[i]);
+		spin_lock_bh(&reuseport_lock);
+		sk = rcu_dereference_protected(array->ptrs[i],
+					lockdep_is_held(&reuseport_lock));
 		if (sk) {
 			write_lock_bh(&sk->sk_callback_lock);
 			/*
@@ -137,8 +119,9 @@  static void reuseport_array_free(struct bpf_map *map)
 			write_unlock_bh(&sk->sk_callback_lock);
 			RCU_INIT_POINTER(array->ptrs[i], NULL);
 		}
+		spin_unlock_bh(&reuseport_lock);
+		cond_resched();
 	}
-	rcu_read_unlock();
 
 	/*
 	 * Once reaching here, all sk->sk_user_data is not

[bpf] bpf: Fix a race in reuseport_array_free()

Commit Message

Comments

Patch