Message ID | 20201201144418.35045-6-kuniyu@amazon.co.jp |
---|---|
State | Superseded |
Headers | show |
Series | Socket migration for SO_REUSEPORT. | expand |
On 12/1/20 3:44 PM, Kuniyuki Iwashima wrote: > This patch renames reuseport_select_sock() to __reuseport_select_sock() and > adds two wrapper function of it to pass the migration type defined in the > previous commit. > > reuseport_select_sock : BPF_SK_REUSEPORT_MIGRATE_NO > reuseport_select_migrated_sock : BPF_SK_REUSEPORT_MIGRATE_REQUEST > > As mentioned before, we have to select a new listener for TCP_NEW_SYN_RECV > requests at receiving the final ACK or sending a SYN+ACK. Therefore, this > patch also changes the code to call reuseport_select_migrated_sock() even > if the listening socket is TCP_CLOSE. If we can pick out a listening socket > from the reuseport group, we rewrite request_sock.rsk_listener and resume > processing the request. > > Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com> > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> > --- > include/net/inet_connection_sock.h | 12 +++++++++++ > include/net/request_sock.h | 13 ++++++++++++ > include/net/sock_reuseport.h | 8 +++---- > net/core/sock_reuseport.c | 34 ++++++++++++++++++++++++------ > net/ipv4/inet_connection_sock.c | 13 ++++++++++-- > net/ipv4/tcp_ipv4.c | 9 ++++++-- > net/ipv6/tcp_ipv6.c | 9 ++++++-- > 7 files changed, 81 insertions(+), 17 deletions(-) > > diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h > index 2ea2d743f8fc..1e0958f5eb21 100644 > --- a/include/net/inet_connection_sock.h > +++ b/include/net/inet_connection_sock.h > @@ -272,6 +272,18 @@ static inline void inet_csk_reqsk_queue_added(struct sock *sk) > reqsk_queue_added(&inet_csk(sk)->icsk_accept_queue); > } > > +static inline void inet_csk_reqsk_queue_migrated(struct sock *sk, > + struct sock *nsk, > + struct request_sock *req) > +{ > + reqsk_queue_migrated(&inet_csk(sk)->icsk_accept_queue, > + &inet_csk(nsk)->icsk_accept_queue, > + req); > + sock_put(sk); > + sock_hold(nsk); This looks racy to me. nsk refcount might be zero at this point. If you think it can _not_ be zero, please add a big comment here, because this would mean something has been done before reaching this function, and this sock_hold() would be not needed in the first place. There is a good reason reqsk_alloc() is using refcount_inc_not_zero(). > + req->rsk_listener = nsk; > +} > + Honestly, this patch series looks quite complex, and finding a bug in the very first function I am looking at is not really a good sign...
From: Eric Dumazet <eric.dumazet@gmail.com> Date: Tue, 1 Dec 2020 16:13:39 +0100 > On 12/1/20 3:44 PM, Kuniyuki Iwashima wrote: > > This patch renames reuseport_select_sock() to __reuseport_select_sock() and > > adds two wrapper function of it to pass the migration type defined in the > > previous commit. > > > > reuseport_select_sock : BPF_SK_REUSEPORT_MIGRATE_NO > > reuseport_select_migrated_sock : BPF_SK_REUSEPORT_MIGRATE_REQUEST > > > > As mentioned before, we have to select a new listener for TCP_NEW_SYN_RECV > > requests at receiving the final ACK or sending a SYN+ACK. Therefore, this > > patch also changes the code to call reuseport_select_migrated_sock() even > > if the listening socket is TCP_CLOSE. If we can pick out a listening socket > > from the reuseport group, we rewrite request_sock.rsk_listener and resume > > processing the request. > > > > Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com> > > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> > > --- > > include/net/inet_connection_sock.h | 12 +++++++++++ > > include/net/request_sock.h | 13 ++++++++++++ > > include/net/sock_reuseport.h | 8 +++---- > > net/core/sock_reuseport.c | 34 ++++++++++++++++++++++++------ > > net/ipv4/inet_connection_sock.c | 13 ++++++++++-- > > net/ipv4/tcp_ipv4.c | 9 ++++++-- > > net/ipv6/tcp_ipv6.c | 9 ++++++-- > > 7 files changed, 81 insertions(+), 17 deletions(-) > > > > diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h > > index 2ea2d743f8fc..1e0958f5eb21 100644 > > --- a/include/net/inet_connection_sock.h > > +++ b/include/net/inet_connection_sock.h > > @@ -272,6 +272,18 @@ static inline void inet_csk_reqsk_queue_added(struct sock *sk) > > reqsk_queue_added(&inet_csk(sk)->icsk_accept_queue); > > } > > > > +static inline void inet_csk_reqsk_queue_migrated(struct sock *sk, > > + struct sock *nsk, > > + struct request_sock *req) > > +{ > > + reqsk_queue_migrated(&inet_csk(sk)->icsk_accept_queue, > > + &inet_csk(nsk)->icsk_accept_queue, > > + req); > > + sock_put(sk); > > + sock_hold(nsk); > > This looks racy to me. nsk refcount might be zero at this point. > > If you think it can _not_ be zero, please add a big comment here, > because this would mean something has been done before reaching this function, > and this sock_hold() would be not needed in the first place. > > There is a good reason reqsk_alloc() is using refcount_inc_not_zero(). Exactly, I will fix this in the next spin like below. Thank you. ---8<--- diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h index 1e0958f5eb21..d8c3be31e987 100644 --- a/include/net/inet_connection_sock.h +++ b/include/net/inet_connection_sock.h @@ -280,7 +280,6 @@ static inline void inet_csk_reqsk_queue_migrated(struct sock *sk, &inet_csk(nsk)->icsk_accept_queue, req); sock_put(sk); - sock_hold(nsk); req->rsk_listener = nsk; } diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c index 6b475897b496..4d07bddcf678 100644 --- a/net/core/sock_reuseport.c +++ b/net/core/sock_reuseport.c @@ -386,7 +386,14 @@ EXPORT_SYMBOL(reuseport_select_sock); struct sock *reuseport_select_migrated_sock(struct sock *sk, u32 hash, struct sk_buff *skb) { - return __reuseport_select_sock(sk, hash, skb, 0, BPF_SK_REUSEPORT_MIGRATE_REQUEST); + struct sock *nsk; + + nsk = __reuseport_select_sock(sk, hash, skb, 0, BPF_SK_REUSEPORT_MIGRATE_REQUEST); + if (IS_ERR_OR_NULL(nsk) || + unlikely(!refcount_inc_not_zero(&nsk->sk_refcnt))) + return NULL; + + return nsk; } EXPORT_SYMBOL(reuseport_select_migrated_sock); ---8<--- > > + req->rsk_listener = nsk; > > +} > > + > > Honestly, this patch series looks quite complex, and finding a bug in the > very first function I am looking at is not really a good sign... I also think this issue is quite complex, but it might be easier to fix than it was disscussed in 2015 [1] thanks to your some refactoring. https://lore.kernel.org/netdev/1443313848-751-1-git-send-email-tolga.ceylan@gmail.com/
On Tue, Dec 01, 2020 at 11:44:12PM +0900, Kuniyuki Iwashima wrote: > This patch renames reuseport_select_sock() to __reuseport_select_sock() and > adds two wrapper function of it to pass the migration type defined in the > previous commit. > > reuseport_select_sock : BPF_SK_REUSEPORT_MIGRATE_NO > reuseport_select_migrated_sock : BPF_SK_REUSEPORT_MIGRATE_REQUEST > > As mentioned before, we have to select a new listener for TCP_NEW_SYN_RECV > requests at receiving the final ACK or sending a SYN+ACK. Therefore, this > patch also changes the code to call reuseport_select_migrated_sock() even > if the listening socket is TCP_CLOSE. If we can pick out a listening socket > from the reuseport group, we rewrite request_sock.rsk_listener and resume > processing the request. > > Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com> > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> > --- > include/net/inet_connection_sock.h | 12 +++++++++++ > include/net/request_sock.h | 13 ++++++++++++ > include/net/sock_reuseport.h | 8 +++---- > net/core/sock_reuseport.c | 34 ++++++++++++++++++++++++------ > net/ipv4/inet_connection_sock.c | 13 ++++++++++-- > net/ipv4/tcp_ipv4.c | 9 ++++++-- > net/ipv6/tcp_ipv6.c | 9 ++++++-- > 7 files changed, 81 insertions(+), 17 deletions(-) > > diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h > index 2ea2d743f8fc..1e0958f5eb21 100644 > --- a/include/net/inet_connection_sock.h > +++ b/include/net/inet_connection_sock.h > @@ -272,6 +272,18 @@ static inline void inet_csk_reqsk_queue_added(struct sock *sk) > reqsk_queue_added(&inet_csk(sk)->icsk_accept_queue); > } > > +static inline void inet_csk_reqsk_queue_migrated(struct sock *sk, > + struct sock *nsk, > + struct request_sock *req) > +{ > + reqsk_queue_migrated(&inet_csk(sk)->icsk_accept_queue, > + &inet_csk(nsk)->icsk_accept_queue, > + req); > + sock_put(sk); not sure if it is safe to do here. IIUC, when the req->rsk_refcnt is held, it also holds a refcnt to req->rsk_listener such that sock_hold(req->rsk_listener) is safe because its sk_refcnt is not zero. > + sock_hold(nsk); > + req->rsk_listener = nsk; > +} > + [ ... ] > diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c > index 361efe55b1ad..e71653c6eae2 100644 > --- a/net/ipv4/inet_connection_sock.c > +++ b/net/ipv4/inet_connection_sock.c > @@ -743,8 +743,17 @@ static void reqsk_timer_handler(struct timer_list *t) > struct request_sock_queue *queue = &icsk->icsk_accept_queue; > int max_syn_ack_retries, qlen, expire = 0, resend = 0; > > - if (inet_sk_state_load(sk_listener) != TCP_LISTEN) > - goto drop; > + if (inet_sk_state_load(sk_listener) != TCP_LISTEN) { > + sk_listener = reuseport_select_migrated_sock(sk_listener, > + req_to_sk(req)->sk_hash, NULL); > + if (!sk_listener) { > + sk_listener = req->rsk_listener; > + goto drop; > + } > + inet_csk_reqsk_queue_migrated(req->rsk_listener, sk_listener, req); > + icsk = inet_csk(sk_listener); > + queue = &icsk->icsk_accept_queue; > + } > > max_syn_ack_retries = icsk->icsk_syn_retries ? : net->ipv4.sysctl_tcp_synack_retries; > /* Normally all the openreqs are young and become mature > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c > index e4b31e70bd30..9a9aa27c6069 100644 > --- a/net/ipv4/tcp_ipv4.c > +++ b/net/ipv4/tcp_ipv4.c > @@ -1973,8 +1973,13 @@ int tcp_v4_rcv(struct sk_buff *skb) > goto csum_error; > } > if (unlikely(sk->sk_state != TCP_LISTEN)) { > - inet_csk_reqsk_queue_drop_and_put(sk, req); > - goto lookup; > + nsk = reuseport_select_migrated_sock(sk, req_to_sk(req)->sk_hash, skb); > + if (!nsk) { > + inet_csk_reqsk_queue_drop_and_put(sk, req); > + goto lookup; > + } > + inet_csk_reqsk_queue_migrated(sk, nsk, req); > + sk = nsk; > } > /* We own a reference on the listener, increase it again > * as we might lose it too soon. > diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c > index 992cbf3eb9e3..ff11f3c0cb96 100644 > --- a/net/ipv6/tcp_ipv6.c > +++ b/net/ipv6/tcp_ipv6.c > @@ -1635,8 +1635,13 @@ INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff *skb) > goto csum_error; > } > if (unlikely(sk->sk_state != TCP_LISTEN)) { > - inet_csk_reqsk_queue_drop_and_put(sk, req); > - goto lookup; > + nsk = reuseport_select_migrated_sock(sk, req_to_sk(req)->sk_hash, skb); > + if (!nsk) { > + inet_csk_reqsk_queue_drop_and_put(sk, req); > + goto lookup; > + } > + inet_csk_reqsk_queue_migrated(sk, nsk, req); > + sk = nsk; > } > sock_hold(sk); For example, this sock_hold(sk). sk here is req->rsk_listener. > refcounted = true; > -- > 2.17.2 (Apple Git-113) >
From: Martin KaFai Lau <kafai@fb.com> Date: Wed, 9 Dec 2020 16:07:07 -0800 > On Tue, Dec 01, 2020 at 11:44:12PM +0900, Kuniyuki Iwashima wrote: > > This patch renames reuseport_select_sock() to __reuseport_select_sock() and > > adds two wrapper function of it to pass the migration type defined in the > > previous commit. > > > > reuseport_select_sock : BPF_SK_REUSEPORT_MIGRATE_NO > > reuseport_select_migrated_sock : BPF_SK_REUSEPORT_MIGRATE_REQUEST > > > > As mentioned before, we have to select a new listener for TCP_NEW_SYN_RECV > > requests at receiving the final ACK or sending a SYN+ACK. Therefore, this > > patch also changes the code to call reuseport_select_migrated_sock() even > > if the listening socket is TCP_CLOSE. If we can pick out a listening socket > > from the reuseport group, we rewrite request_sock.rsk_listener and resume > > processing the request. > > > > Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com> > > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> > > --- > > include/net/inet_connection_sock.h | 12 +++++++++++ > > include/net/request_sock.h | 13 ++++++++++++ > > include/net/sock_reuseport.h | 8 +++---- > > net/core/sock_reuseport.c | 34 ++++++++++++++++++++++++------ > > net/ipv4/inet_connection_sock.c | 13 ++++++++++-- > > net/ipv4/tcp_ipv4.c | 9 ++++++-- > > net/ipv6/tcp_ipv6.c | 9 ++++++-- > > 7 files changed, 81 insertions(+), 17 deletions(-) > > > > diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h > > index 2ea2d743f8fc..1e0958f5eb21 100644 > > --- a/include/net/inet_connection_sock.h > > +++ b/include/net/inet_connection_sock.h > > @@ -272,6 +272,18 @@ static inline void inet_csk_reqsk_queue_added(struct sock *sk) > > reqsk_queue_added(&inet_csk(sk)->icsk_accept_queue); > > } > > > > +static inline void inet_csk_reqsk_queue_migrated(struct sock *sk, > > + struct sock *nsk, > > + struct request_sock *req) > > +{ > > + reqsk_queue_migrated(&inet_csk(sk)->icsk_accept_queue, > > + &inet_csk(nsk)->icsk_accept_queue, > > + req); > > + sock_put(sk); > not sure if it is safe to do here. > IIUC, when the req->rsk_refcnt is held, it also holds a refcnt > to req->rsk_listener such that sock_hold(req->rsk_listener) is > safe because its sk_refcnt is not zero. I think it is safe to call sock_put() for the old listener here. Without this patchset, at receiving the final ACK or retransmitting SYN+ACK, if sk_state == TCP_CLOSE, sock_put(req->rsk_listener) is done by calling reqsk_put() twice in inet_csk_reqsk_queue_drop_and_put(). And then, we do `goto lookup;` and overwrite the sk. In the v2 patchset, refcount_inc_not_zero() is done for the new listener in reuseport_select_migrated_sock(), so we have to call sock_put() for the old listener instead to free it properly. ---8<--- +struct sock *reuseport_select_migrated_sock(struct sock *sk, u32 hash, + struct sk_buff *skb) +{ + struct sock *nsk; + + nsk = __reuseport_select_sock(sk, hash, skb, 0, BPF_SK_REUSEPORT_MIGRATE_REQUEST); + if (nsk && likely(refcount_inc_not_zero(&nsk->sk_refcnt))) + return nsk; + + return NULL; +} +EXPORT_SYMBOL(reuseport_select_migrated_sock); ---8<--- https://lore.kernel.org/netdev/20201207132456.65472-8-kuniyu@amazon.co.jp/ > > + sock_hold(nsk); > > + req->rsk_listener = nsk; > > +} > > + > > [ ... ] > > > diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c > > index 361efe55b1ad..e71653c6eae2 100644 > > --- a/net/ipv4/inet_connection_sock.c > > +++ b/net/ipv4/inet_connection_sock.c > > @@ -743,8 +743,17 @@ static void reqsk_timer_handler(struct timer_list *t) > > struct request_sock_queue *queue = &icsk->icsk_accept_queue; > > int max_syn_ack_retries, qlen, expire = 0, resend = 0; > > > > - if (inet_sk_state_load(sk_listener) != TCP_LISTEN) > > - goto drop; > > + if (inet_sk_state_load(sk_listener) != TCP_LISTEN) { > > + sk_listener = reuseport_select_migrated_sock(sk_listener, > > + req_to_sk(req)->sk_hash, NULL); > > + if (!sk_listener) { > > + sk_listener = req->rsk_listener; > > + goto drop; > > + } > > + inet_csk_reqsk_queue_migrated(req->rsk_listener, sk_listener, req); > > + icsk = inet_csk(sk_listener); > > + queue = &icsk->icsk_accept_queue; > > + } > > > > max_syn_ack_retries = icsk->icsk_syn_retries ? : net->ipv4.sysctl_tcp_synack_retries; > > /* Normally all the openreqs are young and become mature > > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c > > index e4b31e70bd30..9a9aa27c6069 100644 > > --- a/net/ipv4/tcp_ipv4.c > > +++ b/net/ipv4/tcp_ipv4.c > > @@ -1973,8 +1973,13 @@ int tcp_v4_rcv(struct sk_buff *skb) > > goto csum_error; > > } > > if (unlikely(sk->sk_state != TCP_LISTEN)) { > > - inet_csk_reqsk_queue_drop_and_put(sk, req); > > - goto lookup; > > + nsk = reuseport_select_migrated_sock(sk, req_to_sk(req)->sk_hash, skb); > > + if (!nsk) { > > + inet_csk_reqsk_queue_drop_and_put(sk, req); > > + goto lookup; > > + } > > + inet_csk_reqsk_queue_migrated(sk, nsk, req); > > + sk = nsk; > > } > > /* We own a reference on the listener, increase it again > > * as we might lose it too soon. > > diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c > > index 992cbf3eb9e3..ff11f3c0cb96 100644 > > --- a/net/ipv6/tcp_ipv6.c > > +++ b/net/ipv6/tcp_ipv6.c > > @@ -1635,8 +1635,13 @@ INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff *skb) > > goto csum_error; > > } > > if (unlikely(sk->sk_state != TCP_LISTEN)) { > > - inet_csk_reqsk_queue_drop_and_put(sk, req); > > - goto lookup; > > + nsk = reuseport_select_migrated_sock(sk, req_to_sk(req)->sk_hash, skb); > > + if (!nsk) { > > + inet_csk_reqsk_queue_drop_and_put(sk, req); > > + goto lookup; > > + } > > + inet_csk_reqsk_queue_migrated(sk, nsk, req); > > + sk = nsk; > > } > > sock_hold(sk); > For example, this sock_hold(sk). sk here is req->rsk_listener. After migration, this is for the new listener and it is safe because refcount_inc_not_zero() for the new listener is called in reuseport_select_migerate_sock(). > > refcounted = true; > > -- > > 2.17.2 (Apple Git-113)
On Thu, Dec 10, 2020 at 02:15:38PM +0900, Kuniyuki Iwashima wrote: > From: Martin KaFai Lau <kafai@fb.com> > Date: Wed, 9 Dec 2020 16:07:07 -0800 > > On Tue, Dec 01, 2020 at 11:44:12PM +0900, Kuniyuki Iwashima wrote: > > > This patch renames reuseport_select_sock() to __reuseport_select_sock() and > > > adds two wrapper function of it to pass the migration type defined in the > > > previous commit. > > > > > > reuseport_select_sock : BPF_SK_REUSEPORT_MIGRATE_NO > > > reuseport_select_migrated_sock : BPF_SK_REUSEPORT_MIGRATE_REQUEST > > > > > > As mentioned before, we have to select a new listener for TCP_NEW_SYN_RECV > > > requests at receiving the final ACK or sending a SYN+ACK. Therefore, this > > > patch also changes the code to call reuseport_select_migrated_sock() even > > > if the listening socket is TCP_CLOSE. If we can pick out a listening socket > > > from the reuseport group, we rewrite request_sock.rsk_listener and resume > > > processing the request. > > > > > > Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com> > > > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> > > > --- > > > include/net/inet_connection_sock.h | 12 +++++++++++ > > > include/net/request_sock.h | 13 ++++++++++++ > > > include/net/sock_reuseport.h | 8 +++---- > > > net/core/sock_reuseport.c | 34 ++++++++++++++++++++++++------ > > > net/ipv4/inet_connection_sock.c | 13 ++++++++++-- > > > net/ipv4/tcp_ipv4.c | 9 ++++++-- > > > net/ipv6/tcp_ipv6.c | 9 ++++++-- > > > 7 files changed, 81 insertions(+), 17 deletions(-) > > > > > > diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h > > > index 2ea2d743f8fc..1e0958f5eb21 100644 > > > --- a/include/net/inet_connection_sock.h > > > +++ b/include/net/inet_connection_sock.h > > > @@ -272,6 +272,18 @@ static inline void inet_csk_reqsk_queue_added(struct sock *sk) > > > reqsk_queue_added(&inet_csk(sk)->icsk_accept_queue); > > > } > > > > > > +static inline void inet_csk_reqsk_queue_migrated(struct sock *sk, > > > + struct sock *nsk, > > > + struct request_sock *req) > > > +{ > > > + reqsk_queue_migrated(&inet_csk(sk)->icsk_accept_queue, > > > + &inet_csk(nsk)->icsk_accept_queue, > > > + req); > > > + sock_put(sk); > > not sure if it is safe to do here. > > IIUC, when the req->rsk_refcnt is held, it also holds a refcnt > > to req->rsk_listener such that sock_hold(req->rsk_listener) is > > safe because its sk_refcnt is not zero. > > I think it is safe to call sock_put() for the old listener here. > > Without this patchset, at receiving the final ACK or retransmitting > SYN+ACK, if sk_state == TCP_CLOSE, sock_put(req->rsk_listener) is done > by calling reqsk_put() twice in inet_csk_reqsk_queue_drop_and_put(). Note that in your example (final ACK), sock_put(req->rsk_listener) is _only_ called when reqsk_put() can get refcount_dec_and_test(&req->rsk_refcnt) to reach zero. Here in this patch, it sock_put(req->rsk_listener) without req->rsk_refcnt reaching zero. Let says there are two cores holding two refcnt to req (one cnt for each core) by looking up the req from ehash. One of the core do this migrate and sock_put(req->rsk_listener). Another core does sock_hold(req->rsk_listener). Core1 Core2 sock_put(req->rsk_listener) sock_hold(req->rsk_listener) > And then, we do `goto lookup;` and overwrite the sk. > > In the v2 patchset, refcount_inc_not_zero() is done for the new listener in > reuseport_select_migrated_sock(), so we have to call sock_put() for the old > listener instead to free it properly. > > ---8<--- > +struct sock *reuseport_select_migrated_sock(struct sock *sk, u32 hash, > + struct sk_buff *skb) > +{ > + struct sock *nsk; > + > + nsk = __reuseport_select_sock(sk, hash, skb, 0, BPF_SK_REUSEPORT_MIGRATE_REQUEST); > + if (nsk && likely(refcount_inc_not_zero(&nsk->sk_refcnt))) There is another potential issue here. The TCP_LISTEN nsk is protected by rcu. refcount_inc_not_zero(&nsk->sk_refcnt) cannot be done if it is not under rcu_read_lock(). The receive path may be ok as it is in rcu. You may need to check for others. > + return nsk; > + > + return NULL; > +} > +EXPORT_SYMBOL(reuseport_select_migrated_sock); > ---8<--- > https://lore.kernel.org/netdev/20201207132456.65472-8-kuniyu@amazon.co.jp/ > > > > > + sock_hold(nsk); > > > + req->rsk_listener = nsk; It looks like there is another race here. What if multiple cores try to update req->rsk_listener?
diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h index 2ea2d743f8fc..1e0958f5eb21 100644 --- a/include/net/inet_connection_sock.h +++ b/include/net/inet_connection_sock.h @@ -272,6 +272,18 @@ static inline void inet_csk_reqsk_queue_added(struct sock *sk) reqsk_queue_added(&inet_csk(sk)->icsk_accept_queue); } +static inline void inet_csk_reqsk_queue_migrated(struct sock *sk, + struct sock *nsk, + struct request_sock *req) +{ + reqsk_queue_migrated(&inet_csk(sk)->icsk_accept_queue, + &inet_csk(nsk)->icsk_accept_queue, + req); + sock_put(sk); + sock_hold(nsk); + req->rsk_listener = nsk; +} + static inline int inet_csk_reqsk_queue_len(const struct sock *sk) { return reqsk_queue_len(&inet_csk(sk)->icsk_accept_queue); diff --git a/include/net/request_sock.h b/include/net/request_sock.h index 29e41ff3ec93..d18ba0b857cc 100644 --- a/include/net/request_sock.h +++ b/include/net/request_sock.h @@ -226,6 +226,19 @@ static inline void reqsk_queue_added(struct request_sock_queue *queue) atomic_inc(&queue->qlen); } +static inline void reqsk_queue_migrated(struct request_sock_queue *old_accept_queue, + struct request_sock_queue *new_accept_queue, + const struct request_sock *req) +{ + atomic_dec(&old_accept_queue->qlen); + atomic_inc(&new_accept_queue->qlen); + + if (req->num_timeout == 0) { + atomic_dec(&old_accept_queue->young); + atomic_inc(&new_accept_queue->young); + } +} + static inline int reqsk_queue_len(const struct request_sock_queue *queue) { return atomic_read(&queue->qlen); diff --git a/include/net/sock_reuseport.h b/include/net/sock_reuseport.h index 09a1b1539d4c..a48259a974be 100644 --- a/include/net/sock_reuseport.h +++ b/include/net/sock_reuseport.h @@ -32,10 +32,10 @@ extern int reuseport_alloc(struct sock *sk, bool bind_inany); extern int reuseport_add_sock(struct sock *sk, struct sock *sk2, bool bind_inany); extern struct sock *reuseport_detach_sock(struct sock *sk); -extern struct sock *reuseport_select_sock(struct sock *sk, - u32 hash, - struct sk_buff *skb, - int hdr_len); +extern struct sock *reuseport_select_sock(struct sock *sk, u32 hash, + struct sk_buff *skb, int hdr_len); +extern struct sock *reuseport_select_migrated_sock(struct sock *sk, u32 hash, + struct sk_buff *skb); extern int reuseport_attach_prog(struct sock *sk, struct bpf_prog *prog); extern int reuseport_detach_prog(struct sock *sk); diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c index 60d7c1f28809..b4fe0829c9ab 100644 --- a/net/core/sock_reuseport.c +++ b/net/core/sock_reuseport.c @@ -202,7 +202,7 @@ int reuseport_add_sock(struct sock *sk, struct sock *sk2, bool bind_inany) } reuse->socks[reuse->num_socks] = sk; - /* paired with smp_rmb() in reuseport_select_sock() */ + /* paired with smp_rmb() in __reuseport_select_sock() */ smp_wmb(); reuse->num_socks++; rcu_assign_pointer(sk->sk_reuseport_cb, reuse); @@ -313,12 +313,13 @@ static struct sock *run_bpf_filter(struct sock_reuseport *reuse, u16 socks, * @hdr_len: BPF filter expects skb data pointer at payload data. If * the skb does not yet point at the payload, this parameter represents * how far the pointer needs to advance to reach the payload. + * @migration: represents if it is selecting a listener for SYN or + * migrating ESTABLISHED/SYN_RECV sockets or NEW_SYN_RECV socket. * Returns a socket that should receive the packet (or NULL on error). */ -struct sock *reuseport_select_sock(struct sock *sk, - u32 hash, - struct sk_buff *skb, - int hdr_len) +struct sock *__reuseport_select_sock(struct sock *sk, u32 hash, + struct sk_buff *skb, int hdr_len, + u8 migration) { struct sock_reuseport *reuse; struct bpf_prog *prog; @@ -332,13 +333,19 @@ struct sock *reuseport_select_sock(struct sock *sk, if (!reuse) goto out; - prog = rcu_dereference(reuse->prog); socks = READ_ONCE(reuse->num_socks); if (likely(socks)) { /* paired with smp_wmb() in reuseport_add_sock() */ smp_rmb(); - if (!prog || !skb) + prog = rcu_dereference(reuse->prog); + if (!prog) + goto select_by_hash; + + if (migration) + goto out; + + if (!skb) goto select_by_hash; if (prog->type == BPF_PROG_TYPE_SK_REUSEPORT) @@ -367,8 +374,21 @@ struct sock *reuseport_select_sock(struct sock *sk, rcu_read_unlock(); return sk2; } + +struct sock *reuseport_select_sock(struct sock *sk, u32 hash, + struct sk_buff *skb, int hdr_len) +{ + return __reuseport_select_sock(sk, hash, skb, hdr_len, BPF_SK_REUSEPORT_MIGRATE_NO); +} EXPORT_SYMBOL(reuseport_select_sock); +struct sock *reuseport_select_migrated_sock(struct sock *sk, u32 hash, + struct sk_buff *skb) +{ + return __reuseport_select_sock(sk, hash, skb, 0, BPF_SK_REUSEPORT_MIGRATE_REQUEST); +} +EXPORT_SYMBOL(reuseport_select_migrated_sock); + int reuseport_attach_prog(struct sock *sk, struct bpf_prog *prog) { struct sock_reuseport *reuse; diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c index 361efe55b1ad..e71653c6eae2 100644 --- a/net/ipv4/inet_connection_sock.c +++ b/net/ipv4/inet_connection_sock.c @@ -743,8 +743,17 @@ static void reqsk_timer_handler(struct timer_list *t) struct request_sock_queue *queue = &icsk->icsk_accept_queue; int max_syn_ack_retries, qlen, expire = 0, resend = 0; - if (inet_sk_state_load(sk_listener) != TCP_LISTEN) - goto drop; + if (inet_sk_state_load(sk_listener) != TCP_LISTEN) { + sk_listener = reuseport_select_migrated_sock(sk_listener, + req_to_sk(req)->sk_hash, NULL); + if (!sk_listener) { + sk_listener = req->rsk_listener; + goto drop; + } + inet_csk_reqsk_queue_migrated(req->rsk_listener, sk_listener, req); + icsk = inet_csk(sk_listener); + queue = &icsk->icsk_accept_queue; + } max_syn_ack_retries = icsk->icsk_syn_retries ? : net->ipv4.sysctl_tcp_synack_retries; /* Normally all the openreqs are young and become mature diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index e4b31e70bd30..9a9aa27c6069 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -1973,8 +1973,13 @@ int tcp_v4_rcv(struct sk_buff *skb) goto csum_error; } if (unlikely(sk->sk_state != TCP_LISTEN)) { - inet_csk_reqsk_queue_drop_and_put(sk, req); - goto lookup; + nsk = reuseport_select_migrated_sock(sk, req_to_sk(req)->sk_hash, skb); + if (!nsk) { + inet_csk_reqsk_queue_drop_and_put(sk, req); + goto lookup; + } + inet_csk_reqsk_queue_migrated(sk, nsk, req); + sk = nsk; } /* We own a reference on the listener, increase it again * as we might lose it too soon. diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index 992cbf3eb9e3..ff11f3c0cb96 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -1635,8 +1635,13 @@ INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff *skb) goto csum_error; } if (unlikely(sk->sk_state != TCP_LISTEN)) { - inet_csk_reqsk_queue_drop_and_put(sk, req); - goto lookup; + nsk = reuseport_select_migrated_sock(sk, req_to_sk(req)->sk_hash, skb); + if (!nsk) { + inet_csk_reqsk_queue_drop_and_put(sk, req); + goto lookup; + } + inet_csk_reqsk_queue_migrated(sk, nsk, req); + sk = nsk; } sock_hold(sk); refcounted = true;