From patchwork Fri Apr 27 18:50:37 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Soheil Hassas Yeganeh X-Patchwork-Id: 905914 X-Patchwork-Delegate: davem@davemloft.net Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.b="WXjMMxrQ"; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 40XjgW5lGmz9s08 for ; Sat, 28 Apr 2018 04:50:47 +1000 (AEST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932512AbeD0Suo (ORCPT ); Fri, 27 Apr 2018 14:50:44 -0400 Received: from mail-yb0-f195.google.com ([209.85.213.195]:35259 "EHLO mail-yb0-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932430AbeD0Sun (ORCPT ); Fri, 27 Apr 2018 14:50:43 -0400 Received: by mail-yb0-f195.google.com with SMTP id i69-v6so974622ybg.2 for ; Fri, 27 Apr 2018 11:50:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id; bh=+5y37wLk2mzQAov2QOHX8pElS9HLNfhDyzsj/bNhg80=; b=WXjMMxrQWaZB/ZV9bQTNk+BqwpOcbdiIEGcXG1j4JA6sFypkwOwclsIijbuGH/dtRs XtgxgZGpSHfl5g5+L8gcaXLaZMXLIqr2uHtSL4Af1hYdkUbrlnIp5lyuq2Plh7EU/khk +dQ1lJJzhuS062YgBzuEtOJTvKEDXTBSUIwUbeCDaaSisXC6E0/ut6vBbo5H7MXI6B+3 xuACCjKLtRCXIlpgXN3IUA1ZTopdD6s7DyOMYycj2Nc4k16pFeO97OGVF1V1Gu+taErf vJdrxjipj0WrGstgenKYBk0QtMEK9b/ydyGEHjunTB5TeWt8Y2zF9xQRgalYzlHNrU/Q Ce4Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id; bh=+5y37wLk2mzQAov2QOHX8pElS9HLNfhDyzsj/bNhg80=; b=T6c674Nj4SKJKgS4j1xWMomaaQaSMFYaIWPlhedswDJQ0Ez8AAAyYUAqoX+RmwRDJF +1ZA7NSYiM1G+rEY4S+7y1H8mrvIfwyfzcM2J225XE8xPPNmiA4rYFmUGdjRUzhDerX3 tN12b+EXH6R4H3BP/SukeiMPw3GfKrtkZo3hXiLcwPkAOlgUARWtAKTBpt+Qk2QBZlHR zCjlsOe5y4A4bjjS3aocb9om/l5Gx8PO6gEhU3XIR7zhKBBo4cD1TuIR3LVnKb8QwC/3 /tLH+K0eLuybP5fbqa3kRKjKnr1GjT202pBGV4slCyXIZqa9A9C3tLFbpDVLUEIWDMTv yyhw== X-Gm-Message-State: ALQs6tDriIddfLh29E+BN5aylZvuKHdnbgUuX+cAjhR0ovJBgYNm3ZzJ UeF1UXWWjmSUnn470P3nEaQ= X-Google-Smtp-Source: AB8JxZoZDhHILIbWFxQrjFEq+wQLHyyeJMIW0gG5Md6uFGPTuuiMz7LStnyZ+uDo227XxOqV23LRag== X-Received: by 2002:a25:c753:: with SMTP id w80-v6mr1917961ybe.337.1524855042098; Fri, 27 Apr 2018 11:50:42 -0700 (PDT) Received: from z.nyc.corp.google.com ([2620:0:1003:315:9c67:ffa0:44c0:d273]) by smtp.gmail.com with ESMTPSA id d135-v6sm843855ywh.2.2018.04.27.11.50.41 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 27 Apr 2018 11:50:41 -0700 (PDT) From: Soheil Hassas Yeganeh To: davem@davemloft.net, netdev@vger.kernel.org Cc: ycheng@google.com, ncardwell@google.com, edumazet@google.com, willemb@google.com, Soheil Hassas Yeganeh Subject: [PATCH net-next 1/2] tcp: send in-queue bytes in cmsg upon read Date: Fri, 27 Apr 2018 14:50:37 -0400 Message-Id: <20180427185038.32714-1-soheil.kdev@gmail.com> X-Mailer: git-send-email 2.17.0.441.gb46fe60e1d-goog Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org From: Soheil Hassas Yeganeh Applications with many concurrent connections, high variance in receive queue length and tight memory bounds cannot allocate worst-case buffer size to drain sockets. Knowing the size of receive queue length, applications can optimize how they allocate buffers to read from the socket. The number of bytes pending on the socket is directly available through ioctl(FIONREAD/SIOCINQ) and can be approximated using getsockopt(MEMINFO) (rmem_alloc includes skb overheads in addition to application data). But, both of these options add an extra syscall per recvmsg. Moreover, ioctl(FIONREAD/SIOCINQ) takes the socket lock. Add the TCP_INQ socket option to TCP. When this socket option is set, recvmsg() relays the number of bytes available on the socket for reading to the application via the TCP_CM_INQ control message. Calculate the number of bytes after releasing the socket lock to include the processed backlog, if any. To avoid an extra branch in the hot path of recvmsg() for this new control message, move all cmsg processing inside an existing branch for processing receive timestamps. Since the socket lock is not held when calculating the size of receive queue, TCP_INQ is a hint. For example, it can overestimate the queue size by one byte, if FIN is received. With this method, applications can start reading from the socket using a small buffer, and then use larger buffers based on the remaining data when needed. Signed-off-by: Soheil Hassas Yeganeh Signed-off-by: Yuchung Cheng Signed-off-by: Willem de Bruijn Reviewed-by: Eric Dumazet Reviewed-by: Neal Cardwell --- include/linux/tcp.h | 2 +- include/net/tcp.h | 8 ++++++++ include/uapi/linux/tcp.h | 3 +++ net/ipv4/tcp.c | 27 +++++++++++++++++++++++---- 4 files changed, 35 insertions(+), 5 deletions(-) diff --git a/include/linux/tcp.h b/include/linux/tcp.h index 20585d5c4e1c3..807776928cb86 100644 --- a/include/linux/tcp.h +++ b/include/linux/tcp.h @@ -228,7 +228,7 @@ struct tcp_sock { unused:2; u8 nonagle : 4,/* Disable Nagle algorithm? */ thin_lto : 1,/* Use linear timeouts for thin streams */ - unused1 : 1, + recvmsg_inq : 1,/* Indicate # of bytes in queue upon recvmsg */ repair : 1, frto : 1;/* F-RTO (RFC5682) activated in CA_Loss */ u8 repair_queue; diff --git a/include/net/tcp.h b/include/net/tcp.h index 833154e3df173..0986836b5df5b 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -1951,6 +1951,14 @@ static inline int tcp_inq(struct sock *sk) return answ; } +static inline int tcp_inq_hint(const struct sock *sk) +{ + const struct tcp_sock *tp = tcp_sk(sk); + + return max_t(int, 0, + READ_ONCE(tp->rcv_nxt) - READ_ONCE(tp->copied_seq)); +} + int tcp_peek_len(struct socket *sock); static inline void tcp_segs_in(struct tcp_sock *tp, const struct sk_buff *skb) diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h index 379b08700a542..d4cdd25a7bd48 100644 --- a/include/uapi/linux/tcp.h +++ b/include/uapi/linux/tcp.h @@ -122,6 +122,9 @@ enum { #define TCP_MD5SIG_EXT 32 /* TCP MD5 Signature with extensions */ #define TCP_FASTOPEN_KEY 33 /* Set the key for Fast Open (cookie) */ #define TCP_FASTOPEN_NO_COOKIE 34 /* Enable TFO without a TFO cookie */ +#define TCP_INQ 35 /* Notify bytes available to read as a cmsg on read */ + +#define TCP_CM_INQ TCP_INQ struct tcp_repair_opt { __u32 opt_code; diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index dfd090ea54ad4..5a7056980f730 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -1910,13 +1910,14 @@ int tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int nonblock, u32 peek_seq; u32 *seq; unsigned long used; - int err; + int err, inq; int target; /* Read at least this many bytes */ long timeo; struct sk_buff *skb, *last; u32 urg_hole = 0; struct scm_timestamping tss; bool has_tss = false; + bool has_cmsg; if (unlikely(flags & MSG_ERRQUEUE)) return inet_recv_error(sk, msg, len, addr_len); @@ -1931,6 +1932,7 @@ int tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int nonblock, if (sk->sk_state == TCP_LISTEN) goto out; + has_cmsg = tp->recvmsg_inq; timeo = sock_rcvtimeo(sk, nonblock); /* Urgent data needs to be handled specially. */ @@ -2117,6 +2119,7 @@ int tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int nonblock, if (TCP_SKB_CB(skb)->has_rxtstamp) { tcp_update_recv_tstamps(skb, &tss); has_tss = true; + has_cmsg = true; } if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) goto found_fin_ok; @@ -2136,13 +2139,20 @@ int tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int nonblock, * on connected socket. I was just happy when found this 8) --ANK */ - if (has_tss) - tcp_recv_timestamp(msg, sk, &tss); - /* Clean up data we have read: This will do ACK frames. */ tcp_cleanup_rbuf(sk, copied); release_sock(sk); + + if (has_cmsg) { + if (has_tss) + tcp_recv_timestamp(msg, sk, &tss); + if (tp->recvmsg_inq) { + inq = tcp_inq_hint(sk); + put_cmsg(msg, SOL_TCP, TCP_CM_INQ, sizeof(inq), &inq); + } + } + return copied; out: @@ -3011,6 +3021,12 @@ static int do_tcp_setsockopt(struct sock *sk, int level, tp->notsent_lowat = val; sk->sk_write_space(sk); break; + case TCP_INQ: + if (val > 1 || val < 0) + err = -EINVAL; + else + tp->recvmsg_inq = val; + break; default: err = -ENOPROTOOPT; break; @@ -3436,6 +3452,9 @@ static int do_tcp_getsockopt(struct sock *sk, int level, case TCP_NOTSENT_LOWAT: val = tp->notsent_lowat; break; + case TCP_INQ: + val = tp->recvmsg_inq; + break; case TCP_SAVE_SYN: val = tp->save_syn; break;