Message ID | 20210511133659.29982-2-fw@strlen.de |
---|---|
State | Superseded, archived |
Delegated to: | Paolo Abeni |
Headers | show |
Series | add cmsg support to receive path | expand |
On Tue, 2021-05-11 at 15:36 +0200, Florian Westphal wrote: > The setting is only relevant for the msk socket. > While at it, also handle rcvlowat/rcvtimeo this way. > > Signed-off-by: Florian Westphal <fw@strlen.de> > --- > net/mptcp/protocol.c | 7 +++++++ > net/mptcp/sockopt.c | 8 ++++++++ > 2 files changed, 15 insertions(+) > > diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c > index 652e55a0c6e8..86e599eb4403 100644 > --- a/net/mptcp/protocol.c > +++ b/net/mptcp/protocol.c > @@ -12,6 +12,7 @@ > #include <linux/sched/signal.h> > #include <linux/atomic.h> > #include <net/sock.h> > +#include <net/busy_poll.h> > #include <net/inet_common.h> > #include <net/inet_hashtables.h> > #include <net/protocol.h> > @@ -1982,6 +1983,12 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, > if (unlikely(flags & MSG_ERRQUEUE)) > return inet_recv_error(sk, msg, len, addr_len); > > + if (sk_can_busy_loop(sk) && > + skb_queue_empty_lockless(&msk->receive_queue) && > + skb_queue_empty_lockless(&sk->sk_receive_queue) && > + inet_sk_state_load(sk) == TCP_ESTABLISHED) > + sk_busy_loop(sk, nonblock); If I understand the code correctly, here we need a valid sk- >sk_napi_id. That field is set by the TCP input path in tcp_finish_connect()/tcp_v{4,6}_do_rcv()/tcp_child_process() for the subflow ssk. I think sk_napi_id is not currently available for the msk?!? Copying it from the subflow requires some care, as the msk could end-up busy polling on the 'wrong' napi instance ?!? /P
Paolo Abeni <pabeni@redhat.com> wrote: > On Tue, 2021-05-11 at 15:36 +0200, Florian Westphal wrote: > > The setting is only relevant for the msk socket. > > While at it, also handle rcvlowat/rcvtimeo this way. > > > > Signed-off-by: Florian Westphal <fw@strlen.de> > > --- > > net/mptcp/protocol.c | 7 +++++++ > > net/mptcp/sockopt.c | 8 ++++++++ > > 2 files changed, 15 insertions(+) > > > > diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c > > index 652e55a0c6e8..86e599eb4403 100644 > > --- a/net/mptcp/protocol.c > > +++ b/net/mptcp/protocol.c > > @@ -12,6 +12,7 @@ > > #include <linux/sched/signal.h> > > #include <linux/atomic.h> > > #include <net/sock.h> > > +#include <net/busy_poll.h> > > #include <net/inet_common.h> > > #include <net/inet_hashtables.h> > > #include <net/protocol.h> > > @@ -1982,6 +1983,12 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, > > if (unlikely(flags & MSG_ERRQUEUE)) > > return inet_recv_error(sk, msg, len, addr_len); > > > > + if (sk_can_busy_loop(sk) && > > + skb_queue_empty_lockless(&msk->receive_queue) && > > + skb_queue_empty_lockless(&sk->sk_receive_queue) && > > + inet_sk_state_load(sk) == TCP_ESTABLISHED) > > + sk_busy_loop(sk, nonblock); > > If I understand the code correctly, here we need a valid sk- > >sk_napi_id. That field is set by the TCP input path > in tcp_finish_connect()/tcp_v{4,6}_do_rcv()/tcp_child_process() for the > subflow ssk. Can this patch just be dropped from the series?
Paolo Abeni <pabeni@redhat.com> wrote: > On Tue, 2021-05-11 at 15:36 +0200, Florian Westphal wrote: > > + if (sk_can_busy_loop(sk) && > > + skb_queue_empty_lockless(&msk->receive_queue) && > > + skb_queue_empty_lockless(&sk->sk_receive_queue) && > > + inet_sk_state_load(sk) == TCP_ESTABLISHED) > > + sk_busy_loop(sk, nonblock); > > If I understand the code correctly, here we need a valid sk- > >sk_napi_id. That field is set by the TCP input path > in tcp_finish_connect()/tcp_v{4,6}_do_rcv()/tcp_child_process() for the > subflow ssk. > > I think sk_napi_id is not currently available for the msk?!? > Copying it from the subflow requires some care, as the msk could end-up > busy polling on the 'wrong' napi instance ?!? What about: diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -303,6 +303,7 @@ static bool __mptcp_move_skb(struct mptcp_sock *msk, struct sock *ssk, if (tail && mptcp_try_coalesce(sk, tail, skb)) return true; + sk_mark_napi_id_once(sk, skb); skb_set_owner_r(skb, sk); __skb_queue_tail(&sk->sk_receive_queue, skb); return true; This sets the napi id to one that provided the last mptcp-in-sequence skb.
On Wed, 2021-05-12 at 12:04 +0200, Florian Westphal wrote: > Paolo Abeni <pabeni@redhat.com> wrote: > > On Tue, 2021-05-11 at 15:36 +0200, Florian Westphal wrote: > > > + if (sk_can_busy_loop(sk) && > > > + skb_queue_empty_lockless(&msk->receive_queue) && > > > + skb_queue_empty_lockless(&sk->sk_receive_queue) && > > > + inet_sk_state_load(sk) == TCP_ESTABLISHED) > > > + sk_busy_loop(sk, nonblock); > > > > If I understand the code correctly, here we need a valid sk- > > > sk_napi_id. That field is set by the TCP input path > > in tcp_finish_connect()/tcp_v{4,6}_do_rcv()/tcp_child_process() for the > > subflow ssk. > > > > I think sk_napi_id is not currently available for the msk?!? > > Copying it from the subflow requires some care, as the msk could end-up > > busy polling on the 'wrong' napi instance ?!? > > What about: > > diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c > --- a/net/mptcp/protocol.c > +++ b/net/mptcp/protocol.c > @@ -303,6 +303,7 @@ static bool __mptcp_move_skb(struct mptcp_sock *msk, struct sock *ssk, > if (tail && mptcp_try_coalesce(sk, tail, skb)) > return true; > > + sk_mark_napi_id_once(sk, skb); > skb_set_owner_r(skb, sk); > __skb_queue_tail(&sk->sk_receive_queue, skb); > return true; > > This sets the napi id to one that provided the last mptcp-in-sequence skb. I *think* the above sets napi it to the one provided by the first received skb. On the flip side, updating it at every packet would probably cause contention on a 'read mostly' sock cacheline. What about something as crazy as the following? not even build tested, but I hope it can illustrate the idea bettern than words. Than we could call mptcp_busy_loop() instead of sk_busy_loop(). We could also try to be smarter, e.g. fetch all the napi_id in a single mptcp_for_each_subflow() traversal, and store them in a small, fixed size array, avoiding duplicate values. Thinking again of it, this latter strategy looks much better then what described below... --- diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index 2d21a4793d9d..f5d71aae38cc 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -1960,6 +1960,43 @@ static bool __mptcp_move_skbs(struct mptcp_sock *msk) return !skb_queue_empty(&msk->receive_queue); } +/* return -1 if nr > avail subflow */ +static int napi_id_by_subflow_nr(struct sock *sk, int nr) +{ + bool slow = lock_sock_fast(sk); + int id = -1, i = 0; + + mptcp_for_each_subflow(mptcp_sk(sk), subflow) { + if (i++ == nr) { + id = READ_ONCE(mptcp_subflow_tcp_sock(subflow)->sk_napi_id); + break; + } + } + + unlock_sock_fast(sk, slow); + return id; +} + +static void mptcp_busy_loop(struct sock *sk, bool nonblock) +{ + unsigned long start_time = busy_loop_current_time(); + struct mptcp_subflow_context *subflow;$ + + for (;;) { + int nr, napi_id = napi_id_by_subflow_nr(sk, 0); + + for (nr = 0; napi_id >= 0 ; napi_id = napi_id_by_subflow_nr(sk, ++nr)) } + napi_busy_loop(napi_id, NULL, NULL, + READ_ONCE(sk->sk_prefer_busy_poll), + READ_ONCE(sk->sk_busy_poll_budget) ?: BUSY_POLL_BUDGET); + if (skb_queue_empty_lockless(&sk->sk_receive_queue)) + return; + } + if (nonblock || sk_busy_loop_end(sk, start_time)) + return; + } +} + static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int nonblock, int flags, int *addr_len) {
Paolo Abeni <pabeni@redhat.com> wrote: > On Wed, 2021-05-12 at 12:04 +0200, Florian Westphal wrote: > > Paolo Abeni <pabeni@redhat.com> wrote: > > > On Tue, 2021-05-11 at 15:36 +0200, Florian Westphal wrote: > > > > + if (sk_can_busy_loop(sk) && > > > > + skb_queue_empty_lockless(&msk->receive_queue) && > > > > + skb_queue_empty_lockless(&sk->sk_receive_queue) && > > > > + inet_sk_state_load(sk) == TCP_ESTABLISHED) > > > > + sk_busy_loop(sk, nonblock); > > > > > > If I understand the code correctly, here we need a valid sk- > > > > sk_napi_id. That field is set by the TCP input path > > > in tcp_finish_connect()/tcp_v{4,6}_do_rcv()/tcp_child_process() for the > > > subflow ssk. > > > > > > I think sk_napi_id is not currently available for the msk?!? > > > Copying it from the subflow requires some care, as the msk could end-up > > > busy polling on the 'wrong' napi instance ?!? > > > > What about: > > > > diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c > > --- a/net/mptcp/protocol.c > > +++ b/net/mptcp/protocol.c > > @@ -303,6 +303,7 @@ static bool __mptcp_move_skb(struct mptcp_sock *msk, struct sock *ssk, > > if (tail && mptcp_try_coalesce(sk, tail, skb)) > > return true; > > > > + sk_mark_napi_id_once(sk, skb); > > skb_set_owner_r(skb, sk); > > __skb_queue_tail(&sk->sk_receive_queue, skb); > > return true; > > > > This sets the napi id to one that provided the last mptcp-in-sequence skb. > > I *think* the above sets napi it to the one provided by the first > received skb. Depends on wheter the receive queue is empty or not. Its the most recent subflow that had activity. > What about something as crazy as the following? not even build tested, > but I hope it can illustrate the idea bettern than words. I'm no longer convicend busypolling makes any sense with mptcp sockets, see below. > + for (;;) { > + int nr, napi_id = napi_id_by_subflow_nr(sk, 0); > + > + for (nr = 0; napi_id >= 0 ; napi_id = napi_id_by_subflow_nr(sk, ++nr)) } > + napi_busy_loop(napi_id, NULL, NULL, > + READ_ONCE(sk->sk_prefer_busy_poll), > + READ_ONCE(sk->sk_busy_poll_budget) ?: BUSY_POLL_BUDGET); > + if (skb_queue_empty_lockless(&sk->sk_receive_queue)) > + return; We would need a new conditional to early-terminate the busypoll loop. Else we might be busypolling on napi instance 1 while another instance just made a packet available. Futhermore, we'd have to divide the budget among the napi instances to avoid excess usage. The way I see it busypoll should never increase latency, but with the above, afaics, it would since we don't have a 1:1 mapping of mptcp socket to a napi instance that we expect packets on. The way I see it there are two options: 1. Ignore buyspoll requests, or. 2. Set the napi instance ID from __mptcp_move_skb(). As soon as the instance changes (anoter subflow on different path), disable busypoll for that mptcp socket. 1) is simpler, 2) would still support busypoll for fallback mode and for '1 flow' resp. 'multiple flows on same napi instance'. The latter could happen if we assume server with single link that serves clients with multiple uplinks.
On Fri, 2021-05-14 at 11:57 +0200, Florian Westphal wrote: > Paolo Abeni <pabeni@redhat.com> wrote: > > On Wed, 2021-05-12 at 12:04 +0200, Florian Westphal wrote: > > > Paolo Abeni <pabeni@redhat.com> wrote: > > > > On Tue, 2021-05-11 at 15:36 +0200, Florian Westphal wrote: > > > > > + if (sk_can_busy_loop(sk) && > > > > > + skb_queue_empty_lockless(&msk->receive_queue) && > > > > > + skb_queue_empty_lockless(&sk->sk_receive_queue) && > > > > > + inet_sk_state_load(sk) == TCP_ESTABLISHED) > > > > > + sk_busy_loop(sk, nonblock); > > > > > > > > If I understand the code correctly, here we need a valid sk- > > > > > sk_napi_id. That field is set by the TCP input path > > > > in tcp_finish_connect()/tcp_v{4,6}_do_rcv()/tcp_child_process() for the > > > > subflow ssk. > > > > > > > > I think sk_napi_id is not currently available for the msk?!? > > > > Copying it from the subflow requires some care, as the msk could end-up > > > > busy polling on the 'wrong' napi instance ?!? > > > > > > What about: > > > > > > diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c > > > --- a/net/mptcp/protocol.c > > > +++ b/net/mptcp/protocol.c > > > @@ -303,6 +303,7 @@ static bool __mptcp_move_skb(struct mptcp_sock *msk, struct sock *ssk, > > > if (tail && mptcp_try_coalesce(sk, tail, skb)) > > > return true; > > > > > > + sk_mark_napi_id_once(sk, skb); > > > skb_set_owner_r(skb, sk); > > > __skb_queue_tail(&sk->sk_receive_queue, skb); > > > return true; > > > > > > This sets the napi id to one that provided the last mptcp-in-sequence skb. > > > > I *think* the above sets napi it to the one provided by the first > > received skb. > > Depends on wheter the receive queue is empty or not. > > Its the most recent subflow that had activity. I mean: sk_mark_napi_id_once() is called on the most recent subflow that had activity, but AFAICS it only sets sk->sk_napi_id - the msk napi id, in this case - if the current value is 0 - that is, if such field has not been intialized yet. > > + for (;;) { > > + int nr, napi_id = napi_id_by_subflow_nr(sk, 0); > > + > > + for (nr = 0; napi_id >= 0 ; napi_id = napi_id_by_subflow_nr(sk, ++nr)) } > > + napi_busy_loop(napi_id, NULL, NULL, > > + READ_ONCE(sk->sk_prefer_busy_poll), > > + READ_ONCE(sk->sk_busy_poll_budget) ?: BUSY_POLL_BUDGET); > > + if (skb_queue_empty_lockless(&sk->sk_receive_queue)) > > + return; > > We would need a new conditional to early-terminate the busypoll loop. > > Else we might be busypolling on napi instance 1 while another instance > just made a packet available. > > Futhermore, we'd have to divide the budget among the napi instances to > avoid excess usage. With the above code, in the inner looop, we will do a single ->poll() invocation on each napi_id: passing a NULL 'loop_end' argument to napi_busy_loop() causes the latter to do a single poll invocation. That should also produce fair budget usage. > The way I see it busypoll should never increase latency, but with the > above, afaics, it would since we don't have a 1:1 mapping of mptcp > socket to a napi instance that we expect packets on. We will have a N:1 mapping (where N is the number of subflows using the code above and could be "optimized" to the number of different napi_id in use by such subflows with the changes discussed in the last part of my previous email). But still should not introuce additional latency. If e.g. we are polling on napi_id_1 and packets arrive on napi id_2, since the msk socket lock is not held, the rx subflow input path will deliver them to the msk receive queue and the: 'if (skb_queue_empty_lockless(&sk->sk_receive_queue))' check will terminate the loop as soon as polling no napi_id_1 completes. Well, that could actually introduce some latency, e.g. if napi_id_1 is under heavy traffic sent from some different peer. Not 110% sure anyway: BP reduce lantency mostly by avoiding tripping on the process scheduler, and we will get that part right. Assuming this later consideration is correct, we could eventually consider always doing busy polling on a single napi instance (belonging to an active, non backup subflow) even in presence of multiple active subflows, and that should still reduce latency. I guess a little experimentation will help ;) > The way I see it there are two options: > 1. Ignore buyspoll requests, or. > 2. Set the napi instance ID from __mptcp_move_skb(). > As soon as the instance changes (anoter subflow on different path), > disable busypoll for that mptcp socket. > > 1) is simpler, 2) would still support busypoll for fallback mode and > for '1 flow' resp. 'multiple flows on same napi instance'. Both sound reasonable > The latter could happen if we assume server with single > link that serves clients with multiple uplinks. ... and such link/NIC has a single napi instance, which is quite uncommon on server H/W, I think. I'm open to any option. Could you please review the above obove stuff regarding multiple napis BP? does that make any sense? Thanks! Paolo
On Fri, 2021-05-14 at 12:35 +0200, Paolo Abeni wrote: > We will have a N:1 mapping (where N is the number of subflows using the > code above and could be "optimized" to the number of different napi_id > in use by such subflows with the changes discussed in the last part of > my previous email). > > But still should not introuce additional latency. If e.g. we are > polling on napi_id_1 and packets arrive on napi id_2, since the msk > socket lock is not held, the rx subflow input path will deliver them to > the msk receive queue and the: > > 'if (skb_queue_empty_lockless(&sk->sk_receive_queue))' > > check will terminate the loop as soon as polling no napi_id_1 > completes. > > Well, that could actually introduce some latency, e.g. if napi_id_1 is > under heavy traffic sent from some different peer. Not 110% sure > anyway: BP reduce lantency mostly by avoiding tripping on the process > scheduler, and we will get that part right. > > Assuming this later consideration is correct, we could eventually > consider always doing busy polling on a single napi instance (belonging > to an active, non backup subflow) even in presence of multiple active > subflows, and that should still reduce latency. If I read correctly, this is roughly the approach used by mptcp.org > > I guess a little experimentation will help ;) @Christoph, do you have info/paper/data for BP && mptcp && multiple napi ids? Or (half-seriously) some university PHD willing do some academic research on it? ;))) /P
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index 652e55a0c6e8..86e599eb4403 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -12,6 +12,7 @@ #include <linux/sched/signal.h> #include <linux/atomic.h> #include <net/sock.h> +#include <net/busy_poll.h> #include <net/inet_common.h> #include <net/inet_hashtables.h> #include <net/protocol.h> @@ -1982,6 +1983,12 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, if (unlikely(flags & MSG_ERRQUEUE)) return inet_recv_error(sk, msg, len, addr_len); + if (sk_can_busy_loop(sk) && + skb_queue_empty_lockless(&msk->receive_queue) && + skb_queue_empty_lockless(&sk->sk_receive_queue) && + inet_sk_state_load(sk) == TCP_ESTABLISHED) + sk_busy_loop(sk, nonblock); + mptcp_lock_sock(sk, __mptcp_splice_receive_queue(sk)); if (unlikely(sk->sk_state == TCP_LISTEN)) { copied = -ENOTCONN; diff --git a/net/mptcp/sockopt.c b/net/mptcp/sockopt.c index a79798189599..ee55e4145d7e 100644 --- a/net/mptcp/sockopt.c +++ b/net/mptcp/sockopt.c @@ -254,6 +254,14 @@ static int mptcp_setsockopt_sol_socket(struct mptcp_sock *msk, int optname, return mptcp_setsockopt_sol_socket_int(msk, optname, optval, optlen); case SO_LINGER: return mptcp_setsockopt_sol_socket_linger(msk, optval, optlen); + case SO_RCVLOWAT: + case SO_RCVTIMEO_OLD: + case SO_RCVTIMEO_NEW: + case SO_BUSY_POLL: + case SO_PREFER_BUSY_POLL: + case SO_BUSY_POLL_BUDGET: + /* No need to copy: only relevant for msk */ + break; case SO_NO_CHECK: case SO_DONTROUTE: case SO_BROADCAST:
The setting is only relevant for the msk socket. While at it, also handle rcvlowat/rcvtimeo this way. Signed-off-by: Florian Westphal <fw@strlen.de> --- net/mptcp/protocol.c | 7 +++++++ net/mptcp/sockopt.c | 8 ++++++++ 2 files changed, 15 insertions(+)