From patchwork Wed Oct 17 00:16:44 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Neal Cardwell X-Patchwork-Id: 985070 X-Patchwork-Delegate: davem@davemloft.net Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=google.com header.i=@google.com header.b="WPtKoFEv"; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 42ZXmY3kkLz9s8F for ; Wed, 17 Oct 2018 11:17:01 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727526AbeJQIJ4 (ORCPT ); Wed, 17 Oct 2018 04:09:56 -0400 Received: from mail-qk1-f201.google.com ([209.85.222.201]:42943 "EHLO mail-qk1-f201.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727088AbeJQIJ4 (ORCPT ); Wed, 17 Oct 2018 04:09:56 -0400 Received: by mail-qk1-f201.google.com with SMTP id m63-v6so25970778qkb.9 for ; Tue, 16 Oct 2018 17:16:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=bhB61fJRqkZOJRcaFPnmTMq0G4vXVC0cQ4Rja798rWs=; b=WPtKoFEv41dyv2P8M6PhDdufEFUDflg3ZF5JIKus2r7PHPKk5Y/E3fWxoQ0Hvx5Au2 3AElzj+kuiNugRtbN990qASmAeo/vOg/+d6/RJZohiS3nYNg7X/9kO5CGGCfWz8OIA55 jJFivt+KAZjkCNuwhTrrdVs0viP8KhhY6tfyITkIR3JJFmB+9762dT22iAHOqbJtTicY lvXu1+ReQph1ZQ8fZx5flhqZ/nqg8Py1A58DMqelasXcf0bpmVUiso1M2s2iD0KINevF nXpXPZahrVYFXfPTWyKTzG/Bfqb+gDY6Ba30POdwr2x4580bfF7uIxzha9gVtUFhwnjn mDDw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=bhB61fJRqkZOJRcaFPnmTMq0G4vXVC0cQ4Rja798rWs=; b=YsfuLfsNnKOT9ir8aLtwKgEUPxK6PgyT13V3cKfysm6pXOjieZpDuIgaAdP3i4bQzb tlGmlO6fGRM6LS8TrmMr/bTKp+eBAkbZZY8uWMpwt86x0drV5540pEqUXIwYyMn+rzfc /4t3KC0Ap58IjC9TZq+CfXDP3CEwmMc1327shXNYyvr276KCjMV3ZqSyNpWw8F4LN/mv eJz9hkgxlXCY2qVoGdHGeZXZj3cpG5/d+h7YFW1v17+YY+pF9HMhIq/1qyE/QKWuvb7Q TI9Nj0b9nGMljdXhR58mZ0Rf5+AKpxQGbIr93piJKyAPOgQ1TZ2DEYrYh5EGIawkChYl J+EA== X-Gm-Message-State: ABuFfoguf+QqxLjQ/MnQ6jbVR7kFFdHDz4xtZ4itDe0P8eKPpb7EwuUb s4b0GJowT9Tq6JAZyrjWXVeKX+ciKXNHcaA= X-Google-Smtp-Source: ACcGV623MD3p1YH9sLKnaLMhKw9SEtXMfJjJE37nX7ZN/KOsFuzdivS9iKxfxUAEUsURjpUyOWjTNBVkauB8tdo= X-Received: by 2002:a37:4a50:: with SMTP id x77-v6mr19924752qka.51.1539735418433; Tue, 16 Oct 2018 17:16:58 -0700 (PDT) Date: Tue, 16 Oct 2018 20:16:44 -0400 In-Reply-To: <20181017001645.261770-1-ncardwell@google.com> Message-Id: <20181017001645.261770-2-ncardwell@google.com> Mime-Version: 1.0 References: <20181017001645.261770-1-ncardwell@google.com> X-Mailer: git-send-email 2.19.1.568.g152ad8e336-goog Subject: [PATCH net-next 1/2] tcp_bbr: adjust TCP BBR for departure time pacing From: Neal Cardwell To: David Miller Cc: netdev@vger.kernel.org, Neal Cardwell , Yuchung Cheng , Eric Dumazet Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org Adjust TCP BBR for the new departure time pacing model in the recent commit ab408b6dc7449 ("tcp: switch tcp and sch_fq to new earliest departure time model"). With TSQ and pacing at lower layers, there are often several skbs queued in the pacing layer, and thus there is less data "in the network" than "in flight". With departure time pacing at lower layers (e.g. fq or potential future NICs), the data in the pacing layer now has a pre-scheduled ("baked-in") departure time that cannot be changed, even if the congestion control algorithm decides to use a new pacing rate. This means that there can be a non-trivial lag between when BBR makes a pacing rate change and when the inter-skb pacing delays change. After a pacing rate change, the number of packets in the network can gradually evolve to be higher or lower, depending on whether the sending rate is higher or lower than the delivery rate. Thus ignoring this lag can cause significant overshoot, with the flow ending up with too many or too few packets in the network. This commit changes BBR to adapt its pacing rate based on the amount of data in the network that it estimates has already been "baked in" by previous departure time decisions. We estimate the number of our packets that will be in the network at the earliest departure time (EDT) for the next skb scheduled as: in_network_at_edt = inflight_at_edt - (EDT - now) * bw If we're increasing the amount of data in the network ("in_network"), then we want to know if the transmit of the EDT skb will push in_network above the target, so our answer includes bbr_tso_segs_goal() from the skb departing at EDT. If we're decreasing in_network, then we want to know if in_network will sink too low just before the EDT transmit, so our answer does not include the segments from the skb departing at EDT. Why do we treat pacing_gain > 1.0 case and pacing_gain < 1.0 case differently? The in_network curve is a step function: in_network goes up on transmits, and down on ACKs. To accurately predict when in_network will go beyond our target value, this will happen on different events, depending on whether we're concerned about in_network potentially going too high or too low: o if pushing in_network up (pacing_gain > 1.0), then in_network goes above target upon a transmit event o if pushing in_network down (pacing_gain < 1.0), then in_network goes below target upon an ACK event This commit changes the BBR state machine to use this estimated "packets in network" value to make its decisions. Signed-off-by: Neal Cardwell Signed-off-by: Yuchung Cheng Signed-off-by: Eric Dumazet --- net/ipv4/tcp_bbr.c | 37 +++++++++++++++++++++++++++++++++++-- 1 file changed, 35 insertions(+), 2 deletions(-) diff --git a/net/ipv4/tcp_bbr.c b/net/ipv4/tcp_bbr.c index b88081285fd17..4cc2223d2cd54 100644 --- a/net/ipv4/tcp_bbr.c +++ b/net/ipv4/tcp_bbr.c @@ -369,6 +369,39 @@ static u32 bbr_target_cwnd(struct sock *sk, u32 bw, int gain) return cwnd; } +/* With pacing at lower layers, there's often less data "in the network" than + * "in flight". With TSQ and departure time pacing at lower layers (e.g. fq), + * we often have several skbs queued in the pacing layer with a pre-scheduled + * earliest departure time (EDT). BBR adapts its pacing rate based on the + * inflight level that it estimates has already been "baked in" by previous + * departure time decisions. We calculate a rough estimate of the number of our + * packets that might be in the network at the earliest departure time for the + * next skb scheduled: + * in_network_at_edt = inflight_at_edt - (EDT - now) * bw + * If we're increasing inflight, then we want to know if the transmit of the + * EDT skb will push inflight above the target, so inflight_at_edt includes + * bbr_tso_segs_goal() from the skb departing at EDT. If decreasing inflight, + * then estimate if inflight will sink too low just before the EDT transmit. + */ +static u32 bbr_packets_in_net_at_edt(struct sock *sk, u32 inflight_now) +{ + struct tcp_sock *tp = tcp_sk(sk); + struct bbr *bbr = inet_csk_ca(sk); + u64 now_ns, edt_ns, interval_us; + u32 interval_delivered, inflight_at_edt; + + now_ns = tp->tcp_clock_cache; + edt_ns = max(tp->tcp_wstamp_ns, now_ns); + interval_us = div_u64(edt_ns - now_ns, NSEC_PER_USEC); + interval_delivered = (u64)bbr_bw(sk) * interval_us >> BW_SCALE; + inflight_at_edt = inflight_now; + if (bbr->pacing_gain > BBR_UNIT) /* increasing inflight */ + inflight_at_edt += bbr_tso_segs_goal(sk); /* include EDT skb */ + if (interval_delivered >= inflight_at_edt) + return 0; + return inflight_at_edt - interval_delivered; +} + /* An optimization in BBR to reduce losses: On the first round of recovery, we * follow the packet conservation principle: send P packets per P packets acked. * After that, we slow-start and send at most 2*P packets per P packets acked. @@ -460,7 +493,7 @@ static bool bbr_is_next_cycle_phase(struct sock *sk, if (bbr->pacing_gain == BBR_UNIT) return is_full_length; /* just use wall clock time */ - inflight = rs->prior_in_flight; /* what was in-flight before ACK? */ + inflight = bbr_packets_in_net_at_edt(sk, rs->prior_in_flight); bw = bbr_max_bw(sk); /* A pacing_gain > 1.0 probes for bw by trying to raise inflight to at @@ -741,7 +774,7 @@ static void bbr_check_drain(struct sock *sk, const struct rate_sample *rs) bbr_target_cwnd(sk, bbr_max_bw(sk), BBR_UNIT); } /* fall through to check if in-flight is already small: */ if (bbr->mode == BBR_DRAIN && - tcp_packets_in_flight(tcp_sk(sk)) <= + bbr_packets_in_net_at_edt(sk, tcp_packets_in_flight(tcp_sk(sk))) <= bbr_target_cwnd(sk, bbr_max_bw(sk), BBR_UNIT)) bbr_reset_probe_bw_mode(sk); /* we estimate queue is drained */ } From patchwork Wed Oct 17 00:16:45 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Neal Cardwell X-Patchwork-Id: 985071 X-Patchwork-Delegate: davem@davemloft.net Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=google.com header.i=@google.com header.b="l2BoN86c"; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 42ZXmb4XBlz9s9J for ; Wed, 17 Oct 2018 11:17:03 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727546AbeJQIJ6 (ORCPT ); Wed, 17 Oct 2018 04:09:58 -0400 Received: from mail-qt1-f202.google.com ([209.85.160.202]:49877 "EHLO mail-qt1-f202.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727088AbeJQIJ6 (ORCPT ); Wed, 17 Oct 2018 04:09:58 -0400 Received: by mail-qt1-f202.google.com with SMTP id f20-v6so26585097qta.16 for ; Tue, 16 Oct 2018 17:17:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=/7FhtSdBRNZ2CVXVsdemY1JDwE+ZOkFq+m+vhktyxGg=; b=l2BoN86cNn77gY9Rhm2BghCl7Fm6A1YmzTE04HLBP2dZRsl+mkBFBVq5PD12dasRAy BIMjTpMw+tKiPfB9TWQOBa2h/7UV7KxHH63Qm13/MkLPdWynS0kJlsLgkpU1YlwL5u49 L+w5snehnFSdNib3t27mUKeHHwPJr88phd/V0OdN+xLLSm4ef0Y1FC5nOg0zCqr2bwxl ThCBHemAtybnXrOsAkoJTjKaHkf7Tzn/ycgjGQHAG4Mxk77rpxe7oEsBztSgSFquRwJO gEbZb6PrKtk0SQoXdRIzObjQZvSzC1KPwK7qRB+zFSBBtGouqoAB+BszVx+L535zERsl MATw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=/7FhtSdBRNZ2CVXVsdemY1JDwE+ZOkFq+m+vhktyxGg=; b=jL6viHUSj3wZOrdqq84yWfgDuYv9wlVGD+YFlFmOZEWOwU/0wK9jgOK9GEGSFsX9FQ uqp5/lvcURVbn/RvMsB58+h2pU6OBz+Fi9GHCd54v+m6X85Sq9qP1iqlXPWNlkc6FxKb nXUdsG1XQHanIngZfQsVVxcQ8frmpz03bcoPfIo+PTP2k86kXEhjqSBCN06TO0pdOO6Y c52N52PMFAZ6uuf2kZ80V6xnKYEOAo6jmxRauAlcBDannS3EFKYwHMLRLWW8v2FlATVE nZMNe1whXq3NGyY60zTUhl3lCvpf0CZScSfCAqUTlFKNl2iYdFGrGBV7MDDHg957qFJc LwuQ== X-Gm-Message-State: ABuFfojSUus3XhlEv8fZu0WPOTV1Db5hgqdB9eEyLOy9U6pJ/lZMXEli hiowP2y31tQ8PiWEfxpapbhLhHA3lsNXaUE= X-Google-Smtp-Source: ACcGV63qYGEazRtSvURm+7bZ28PmzKcGZojhIYAmIkAetPA4urowpBoobQUsQam7gGhQ73Lr+c7xjhWXUZ1g2C0= X-Received: by 2002:a0c:88b2:: with SMTP id 47mr19436619qvn.58.1539735421175; Tue, 16 Oct 2018 17:17:01 -0700 (PDT) Date: Tue, 16 Oct 2018 20:16:45 -0400 In-Reply-To: <20181017001645.261770-1-ncardwell@google.com> Message-Id: <20181017001645.261770-3-ncardwell@google.com> Mime-Version: 1.0 References: <20181017001645.261770-1-ncardwell@google.com> X-Mailer: git-send-email 2.19.1.568.g152ad8e336-goog Subject: [PATCH net-next 2/2] tcp_bbr: centralize code to set gains From: Neal Cardwell To: David Miller Cc: netdev@vger.kernel.org, Neal Cardwell , Yuchung Cheng , Soheil Hassas Yeganeh , Priyaranjan Jha , Eric Dumazet Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org Centralize the code that sets gains used for computing cwnd and pacing rate. This simplifies the code and makes it easier to change the state machine or (in the future) dynamically change the gain values and ensure that the correct gain values are always used. Signed-off-by: Neal Cardwell Signed-off-by: Yuchung Cheng Signed-off-by: Soheil Hassas Yeganeh Signed-off-by: Priyaranjan Jha Signed-off-by: Eric Dumazet --- net/ipv4/tcp_bbr.c | 40 ++++++++++++++++++++++++++++++---------- 1 file changed, 30 insertions(+), 10 deletions(-) diff --git a/net/ipv4/tcp_bbr.c b/net/ipv4/tcp_bbr.c index 4cc2223d2cd54..9277abdd822a0 100644 --- a/net/ipv4/tcp_bbr.c +++ b/net/ipv4/tcp_bbr.c @@ -521,8 +521,6 @@ static void bbr_advance_cycle_phase(struct sock *sk) bbr->cycle_idx = (bbr->cycle_idx + 1) & (CYCLE_LEN - 1); bbr->cycle_mstamp = tp->delivered_mstamp; - bbr->pacing_gain = bbr->lt_use_bw ? BBR_UNIT : - bbr_pacing_gain[bbr->cycle_idx]; } /* Gain cycling: cycle pacing gain to converge to fair share of available bw. */ @@ -540,8 +538,6 @@ static void bbr_reset_startup_mode(struct sock *sk) struct bbr *bbr = inet_csk_ca(sk); bbr->mode = BBR_STARTUP; - bbr->pacing_gain = bbr_high_gain; - bbr->cwnd_gain = bbr_high_gain; } static void bbr_reset_probe_bw_mode(struct sock *sk) @@ -549,8 +545,6 @@ static void bbr_reset_probe_bw_mode(struct sock *sk) struct bbr *bbr = inet_csk_ca(sk); bbr->mode = BBR_PROBE_BW; - bbr->pacing_gain = BBR_UNIT; - bbr->cwnd_gain = bbr_cwnd_gain; bbr->cycle_idx = CYCLE_LEN - 1 - prandom_u32_max(bbr_cycle_rand); bbr_advance_cycle_phase(sk); /* flip to next phase of gain cycle */ } @@ -768,8 +762,6 @@ static void bbr_check_drain(struct sock *sk, const struct rate_sample *rs) if (bbr->mode == BBR_STARTUP && bbr_full_bw_reached(sk)) { bbr->mode = BBR_DRAIN; /* drain queue we created */ - bbr->pacing_gain = bbr_drain_gain; /* pace slow to drain */ - bbr->cwnd_gain = bbr_high_gain; /* maintain cwnd */ tcp_sk(sk)->snd_ssthresh = bbr_target_cwnd(sk, bbr_max_bw(sk), BBR_UNIT); } /* fall through to check if in-flight is already small: */ @@ -831,8 +823,6 @@ static void bbr_update_min_rtt(struct sock *sk, const struct rate_sample *rs) if (bbr_probe_rtt_mode_ms > 0 && filter_expired && !bbr->idle_restart && bbr->mode != BBR_PROBE_RTT) { bbr->mode = BBR_PROBE_RTT; /* dip, drain queue */ - bbr->pacing_gain = BBR_UNIT; - bbr->cwnd_gain = BBR_UNIT; bbr_save_cwnd(sk); /* note cwnd so we can restore it */ bbr->probe_rtt_done_stamp = 0; } @@ -860,6 +850,35 @@ static void bbr_update_min_rtt(struct sock *sk, const struct rate_sample *rs) bbr->idle_restart = 0; } +static void bbr_update_gains(struct sock *sk) +{ + struct bbr *bbr = inet_csk_ca(sk); + + switch (bbr->mode) { + case BBR_STARTUP: + bbr->pacing_gain = bbr_high_gain; + bbr->cwnd_gain = bbr_high_gain; + break; + case BBR_DRAIN: + bbr->pacing_gain = bbr_drain_gain; /* slow, to drain */ + bbr->cwnd_gain = bbr_high_gain; /* keep cwnd */ + break; + case BBR_PROBE_BW: + bbr->pacing_gain = (bbr->lt_use_bw ? + BBR_UNIT : + bbr_pacing_gain[bbr->cycle_idx]); + bbr->cwnd_gain = bbr_cwnd_gain; + break; + case BBR_PROBE_RTT: + bbr->pacing_gain = BBR_UNIT; + bbr->cwnd_gain = BBR_UNIT; + break; + default: + WARN_ONCE(1, "BBR bad mode: %u\n", bbr->mode); + break; + } +} + static void bbr_update_model(struct sock *sk, const struct rate_sample *rs) { bbr_update_bw(sk, rs); @@ -867,6 +886,7 @@ static void bbr_update_model(struct sock *sk, const struct rate_sample *rs) bbr_check_full_bw_reached(sk, rs); bbr_check_drain(sk, rs); bbr_update_min_rtt(sk, rs); + bbr_update_gains(sk); } static void bbr_main(struct sock *sk, const struct rate_sample *rs)