From patchwork Mon Sep 14 21:52:09 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Soheil Hassas Yeganeh X-Patchwork-Id: 1363969 X-Patchwork-Delegate: davem@davemloft.net Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=vger.kernel.org (client-ip=23.128.96.18; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.a=rsa-sha256 header.s=20161025 header.b=jmCD5vi4; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by ozlabs.org (Postfix) with ESMTP id 4Br0TC2LQYz9sSP for ; Tue, 15 Sep 2020 07:52:31 +1000 (AEST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726082AbgINVw3 (ORCPT ); Mon, 14 Sep 2020 17:52:29 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50196 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725961AbgINVwV (ORCPT ); Mon, 14 Sep 2020 17:52:21 -0400 Received: from mail-qk1-x744.google.com (mail-qk1-x744.google.com [IPv6:2607:f8b0:4864:20::744]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BDA2AC06174A for ; Mon, 14 Sep 2020 14:52:20 -0700 (PDT) Received: by mail-qk1-x744.google.com with SMTP id d20so2045532qka.5 for ; Mon, 14 Sep 2020 14:52:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=1LF/TjjDplazqZrihwYxLyCv+1SkkN89haVSx4X5K78=; b=jmCD5vi4CCAHsQbJ26ym9HeFUT8r+qfebXTVAzZszwSzAtil3jwvPXm9slFJD1UaUC Je6jwIx/+JFUTacFzlCk4+4ShvKOKTDkmogw8K7UWa7d1Vo8EjgOP6QEzLME+rOWLnLo FIRqzSo7gexm3gxLqYXD170Wqid4KcgCxy4ZQMOMWDLacGT0YubZLQ6VFiMlCOPDL6SN nMLO7DG1+WcyLzsqt9a0dj1x67w2Rxgc+l4ltXdu59Dogz84VW+X6cJBun3t3dPDJ/0Q jzrBJYPrKTfj4TKSrn/AxvkPpDoh/lxhOT/DMsm1vUstHUh8GsbxD8wqmwucEsBEmHoG Gr6w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=1LF/TjjDplazqZrihwYxLyCv+1SkkN89haVSx4X5K78=; b=QDEG1+MC2YulJWSkTp3Ry/OTPh9XdDdY3Z8RHJY+ROI015j12PHHOVFyxY0+3SAGgL FlxuZt5lY9dchGx0kwkQQo0WkuG5XOpaUk/R+GxJFMcKukBUaeMNbLjOTpv9qdY3ya3v 1WPRBx31uAtt1+FxGV+QTZM4u6IJn4upoCxZ9ENOn4nGvqR9A7Y98FHGFYhY5R0D9Iir 44RGYXLrDkX4+bupBffhGysyJfogU7CA186pUg3CQ20un251cGK/iKb/WJRVKGubCZYM gDmn7ysHtMmFK/DP+9gI8Njy8mE1h3TIkFuHCAsUj/vXi3cUiKilZEE4fOD46qqt+qTl K+uQ== X-Gm-Message-State: AOAM530353B7lifSt14D5rUWWVKfwA6YmncVK+N0I5kGRUhD/S9Jf/E7 xkiwT8lWALlHZX6Pq7AjJ+fisMGgYD0= X-Google-Smtp-Source: ABdhPJymfDgvQjHBCRhM1KMarycoVxAbY2IFhbfta11ztxvAoN7WG3Pr0uGDacnoZmtQbtnJ1+So0w== X-Received: by 2002:ae9:e602:: with SMTP id z2mr15230199qkf.259.1600120340062; Mon, 14 Sep 2020 14:52:20 -0700 (PDT) Received: from soheil4.nyc.corp.google.com ([2620:0:1003:312:a6ae:11ff:fe18:6946]) by smtp.gmail.com with ESMTPSA id r195sm14755232qke.74.2020.09.14.14.52.19 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 14 Sep 2020 14:52:19 -0700 (PDT) From: Soheil Hassas Yeganeh To: davem@davemloft.net, netdev@vger.kernel.org Cc: edumazet@google.com, Soheil Hassas Yeganeh Subject: [PATCH net-next 1/2] tcp: return EPOLLOUT from tcp_poll only when notsent_bytes is half the limit Date: Mon, 14 Sep 2020 17:52:09 -0400 Message-Id: <20200914215210.2288109-1-soheil.kdev@gmail.com> X-Mailer: git-send-email 2.28.0.618.gf4bc123cb7-goog MIME-Version: 1.0 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org From: Soheil Hassas Yeganeh If there was any event available on the TCP socket, tcp_poll() will be called to retrieve all the events. In tcp_poll(), we call sk_stream_is_writeable() which returns true as long as we are at least one byte below notsent_lowat. This will result in quite a few spurious EPLLOUT and frequent tiny sendmsg() calls as a result. Similar to sk_stream_write_space(), use __sk_stream_is_writeable with a wake value of 1, so that we set EPOLLOUT only if half the space is available for write. Signed-off-by: Soheil Hassas Yeganeh Signed-off-by: Eric Dumazet --- net/ipv4/tcp.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index d3781b6087cb..48c351804efc 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -564,7 +564,7 @@ __poll_t tcp_poll(struct file *file, struct socket *sock, poll_table *wait) mask |= EPOLLIN | EPOLLRDNORM; if (!(sk->sk_shutdown & SEND_SHUTDOWN)) { - if (sk_stream_is_writeable(sk)) { + if (__sk_stream_is_writeable(sk, 1)) { mask |= EPOLLOUT | EPOLLWRNORM; } else { /* send SIGIO later */ sk_set_bit(SOCKWQ_ASYNC_NOSPACE, sk); @@ -576,7 +576,7 @@ __poll_t tcp_poll(struct file *file, struct socket *sock, poll_table *wait) * pairs with the input side. */ smp_mb__after_atomic(); - if (sk_stream_is_writeable(sk)) + if (__sk_stream_is_writeable(sk, 1)) mask |= EPOLLOUT | EPOLLWRNORM; } } else From patchwork Mon Sep 14 21:52:10 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Soheil Hassas Yeganeh X-Patchwork-Id: 1363970 X-Patchwork-Delegate: davem@davemloft.net Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=vger.kernel.org (client-ip=23.128.96.18; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.a=rsa-sha256 header.s=20161025 header.b=jsD7MSs6; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by ozlabs.org (Postfix) with ESMTP id 4Br0TM2Txbz9sSP for ; Tue, 15 Sep 2020 07:52:39 +1000 (AEST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726156AbgINVwb (ORCPT ); Mon, 14 Sep 2020 17:52:31 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50198 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725986AbgINVwY (ORCPT ); Mon, 14 Sep 2020 17:52:24 -0400 Received: from mail-qk1-x742.google.com (mail-qk1-x742.google.com [IPv6:2607:f8b0:4864:20::742]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 80192C061788 for ; Mon, 14 Sep 2020 14:52:21 -0700 (PDT) Received: by mail-qk1-x742.google.com with SMTP id q63so2058054qkf.3 for ; Mon, 14 Sep 2020 14:52:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=eU/pCyz+Cn9FpUal4w4YvaEMrwTqR/VYjecDHxBJ9qA=; b=jsD7MSs64XLh2H1S+JMIH9tjwESGkvDSZP58ZZY0MaiRl34Psh695IktlX4hBDQTVp CvlPXLTX/M66olfqCxMoN1FeoVpObcMit64XN20s3enrPFA1JBjI/KFzs++CRzk+Akot EHUptmOcNhHuoO9QpX6r/qOTd/U42EqdsvsDftkUv1kxxu/AgXc159jEoXKVoEvC8onh 95cx5a4EkojFKjYguqiaBT/Tc8DULTG0uHagd8CG3GHNZrAuKeA1ib2J4QzGwpOQokQa iW2wQHaUPa2xddCdbcadIO3qQZij7NQixhwEF1XbL+ePSe0Fke4N9+ijcVCmkKMnVJI7 18iw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=eU/pCyz+Cn9FpUal4w4YvaEMrwTqR/VYjecDHxBJ9qA=; b=WQnOJKXHqtvzpIx1+5ekwbMixhkV5VDd7XFWsddrA1EhOp3Q0S0J88AsEV5Q+DNyUl HBNA3cOxeBFTrLQrOiy3fLzArqZ7WGF4WUZEcvmwyEOqP5e3aqE2iheyBpy9Aue2+wBW huFIR47VlcDSpEUMs/ImHX77PcBlDoiR92jOZWBDEB9ekosxSc/t2mhLzxMw3xt9tkxa wFHvJAZI4P27SzFy8vjEM7rY1zls7ovHWdlWGRZ0Lzy5qYzD3FrBREOaqiithsjuZvke fu8sjYk5dZQar9Fl4/txw7c00cY2XhtIxqYF2Oil2nzZONgurYmAZpVKkCVPklWdZ9zX 7g6Q== X-Gm-Message-State: AOAM530Rvv0rEDZ3ngEqHsIHc/TEIP7O03HfMt7/dRlodNqb4eCdadD1 FDmjeT15jJ0V3lh0MHqb68A= X-Google-Smtp-Source: ABdhPJxSoMT7qM0Jot/aiV3eHjP/h96bz7C/riEpTo17SY98MTlALTjDmNhexI68X36PSdUvruzZUA== X-Received: by 2002:a05:620a:1297:: with SMTP id w23mr13792500qki.345.1600120340802; Mon, 14 Sep 2020 14:52:20 -0700 (PDT) Received: from soheil4.nyc.corp.google.com ([2620:0:1003:312:a6ae:11ff:fe18:6946]) by smtp.gmail.com with ESMTPSA id r195sm14755232qke.74.2020.09.14.14.52.20 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 14 Sep 2020 14:52:20 -0700 (PDT) From: Soheil Hassas Yeganeh To: davem@davemloft.net, netdev@vger.kernel.org Cc: edumazet@google.com, Soheil Hassas Yeganeh Subject: [PATCH net-next 2/2] tcp: schedule EPOLLOUT after a partial sendmsg Date: Mon, 14 Sep 2020 17:52:10 -0400 Message-Id: <20200914215210.2288109-2-soheil.kdev@gmail.com> X-Mailer: git-send-email 2.28.0.618.gf4bc123cb7-goog In-Reply-To: <20200914215210.2288109-1-soheil.kdev@gmail.com> References: <20200914215210.2288109-1-soheil.kdev@gmail.com> MIME-Version: 1.0 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org From: Soheil Hassas Yeganeh For EPOLLET, applications must call sendmsg until they get EAGAIN. Otherwise, there is no guarantee that EPOLLOUT is sent if there was a failure upon memory allocation. As a result on high-speed NICs, userspace observes multiple small sendmsgs after a partial sendmsg until EAGAIN, since TCP can send 1-2 TSOs in between two sendmsg syscalls: // One large partial send due to memory allocation failure. sendmsg(20MB) = 2MB // Many small sends until EAGAIN. sendmsg(18MB) = 64KB sendmsg(17.9MB) = 128KB sendmsg(17.8MB) = 64KB ... sendmsg(...) = EAGAIN // At this point, userspace can assume an EPOLLOUT. To fix this, set the SOCK_NOSPACE on all partial sendmsg scenarios to guarantee that we send EPOLLOUT after partial sendmsg. After this commit userspace can assume that it will receive an EPOLLOUT after the first partial sendmsg. This EPOLLOUT will benefit from sk_stream_write_space() logic delaying the EPOLLOUT until significant space is available in write queue. Signed-off-by: Eric Dumazet Signed-off-by: Soheil Hassas Yeganeh --- net/ipv4/tcp.c | 20 +++++++++----------- 1 file changed, 9 insertions(+), 11 deletions(-) diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 48c351804efc..65057744fac8 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -1004,12 +1004,12 @@ ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset, !tcp_skb_can_collapse_to(skb)) { new_segment: if (!sk_stream_memory_free(sk)) - goto wait_for_sndbuf; + goto wait_for_space; skb = sk_stream_alloc_skb(sk, 0, sk->sk_allocation, tcp_rtx_and_write_queues_empty(sk)); if (!skb) - goto wait_for_memory; + goto wait_for_space; #ifdef CONFIG_TLS_DEVICE skb->decrypted = !!(flags & MSG_SENDPAGE_DECRYPTED); @@ -1028,7 +1028,7 @@ ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset, goto new_segment; } if (!sk_wmem_schedule(sk, copy)) - goto wait_for_memory; + goto wait_for_space; if (can_coalesce) { skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], copy); @@ -1069,9 +1069,8 @@ ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset, tcp_push_one(sk, mss_now); continue; -wait_for_sndbuf: +wait_for_space: set_bit(SOCK_NOSPACE, &sk->sk_socket->flags); -wait_for_memory: tcp_push(sk, flags & ~MSG_MORE, mss_now, TCP_NAGLE_PUSH, size_goal); @@ -1282,7 +1281,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size) new_segment: if (!sk_stream_memory_free(sk)) - goto wait_for_sndbuf; + goto wait_for_space; if (unlikely(process_backlog >= 16)) { process_backlog = 0; @@ -1293,7 +1292,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size) skb = sk_stream_alloc_skb(sk, 0, sk->sk_allocation, first_skb); if (!skb) - goto wait_for_memory; + goto wait_for_space; process_backlog++; skb->ip_summed = CHECKSUM_PARTIAL; @@ -1326,7 +1325,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size) struct page_frag *pfrag = sk_page_frag(sk); if (!sk_page_frag_refill(sk, pfrag)) - goto wait_for_memory; + goto wait_for_space; if (!skb_can_coalesce(skb, i, pfrag->page, pfrag->offset)) { @@ -1340,7 +1339,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size) copy = min_t(int, copy, pfrag->size - pfrag->offset); if (!sk_wmem_schedule(sk, copy)) - goto wait_for_memory; + goto wait_for_space; err = skb_copy_to_page_nocache(sk, &msg->msg_iter, skb, pfrag->page, @@ -1393,9 +1392,8 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size) tcp_push_one(sk, mss_now); continue; -wait_for_sndbuf: +wait_for_space: set_bit(SOCK_NOSPACE, &sk->sk_socket->flags); -wait_for_memory: if (copied) tcp_push(sk, flags & ~MSG_MORE, mss_now, TCP_NAGLE_PUSH, size_goal);