From patchwork Wed Oct 17 00:16:44 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Neal Cardwell <ncardwell@google.com>
X-Patchwork-Id: 985070
X-Patchwork-Delegate: davem@davemloft.net
Return-Path: <netdev-owner@vger.kernel.org>
X-Original-To: patchwork-incoming-netdev@ozlabs.org
Delivered-To: patchwork-incoming-netdev@ozlabs.org
Authentication-Results: ozlabs.org;
	spf=none (mailfrom) smtp.mailfrom=vger.kernel.org
	(client-ip=209.132.180.67; helo=vger.kernel.org;
	envelope-from=netdev-owner@vger.kernel.org;
	receiver=<UNKNOWN>)
Authentication-Results: ozlabs.org; dmarc=pass (p=reject dis=none)
	header.from=google.com
Authentication-Results: ozlabs.org; dkim=pass (2048-bit key;
	unprotected) header.d=google.com header.i=@google.com
	header.b="WPtKoFEv"; dkim-atps=neutral
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by ozlabs.org (Postfix) with ESMTP id 42ZXmY3kkLz9s8F
	for <patchwork-incoming-netdev@ozlabs.org>;
	Wed, 17 Oct 2018 11:17:01 +1100 (AEDT)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1727526AbeJQIJ4 (ORCPT
	<rfc822;patchwork-incoming-netdev@ozlabs.org>);
	Wed, 17 Oct 2018 04:09:56 -0400
Received: from mail-qk1-f201.google.com ([209.85.222.201]:42943 "EHLO
	mail-qk1-f201.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1727088AbeJQIJ4 (ORCPT
	<rfc822;netdev@vger.kernel.org>); Wed, 17 Oct 2018 04:09:56 -0400
Received: by mail-qk1-f201.google.com with SMTP id m63-v6so25970778qkb.9
	for <netdev@vger.kernel.org>; Tue, 16 Oct 2018 17:16:58 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=google.com; s=20161025;
	h=date:in-reply-to:message-id:mime-version:references:subject:from:to
	:cc; bh=bhB61fJRqkZOJRcaFPnmTMq0G4vXVC0cQ4Rja798rWs=;
	b=WPtKoFEv41dyv2P8M6PhDdufEFUDflg3ZF5JIKus2r7PHPKk5Y/E3fWxoQ0Hvx5Au2
	3AElzj+kuiNugRtbN990qASmAeo/vOg/+d6/RJZohiS3nYNg7X/9kO5CGGCfWz8OIA55
	jJFivt+KAZjkCNuwhTrrdVs0viP8KhhY6tfyITkIR3JJFmB+9762dT22iAHOqbJtTicY
	lvXu1+ReQph1ZQ8fZx5flhqZ/nqg8Py1A58DMqelasXcf0bpmVUiso1M2s2iD0KINevF
	nXpXPZahrVYFXfPTWyKTzG/Bfqb+gDY6Ba30POdwr2x4580bfF7uIxzha9gVtUFhwnjn
	mDDw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=1e100.net; s=20161025;
	h=x-gm-message-state:date:in-reply-to:message-id:mime-version
	:references:subject:from:to:cc;
	bh=bhB61fJRqkZOJRcaFPnmTMq0G4vXVC0cQ4Rja798rWs=;
	b=YsfuLfsNnKOT9ir8aLtwKgEUPxK6PgyT13V3cKfysm6pXOjieZpDuIgaAdP3i4bQzb
	tlGmlO6fGRM6LS8TrmMr/bTKp+eBAkbZZY8uWMpwt86x0drV5540pEqUXIwYyMn+rzfc
	/4t3KC0Ap58IjC9TZq+CfXDP3CEwmMc1327shXNYyvr276KCjMV3ZqSyNpWw8F4LN/mv
	eJz9hkgxlXCY2qVoGdHGeZXZj3cpG5/d+h7YFW1v17+YY+pF9HMhIq/1qyE/QKWuvb7Q
	TI9Nj0b9nGMljdXhR58mZ0Rf5+AKpxQGbIr93piJKyAPOgQ1TZ2DEYrYh5EGIawkChYl
	J+EA==
X-Gm-Message-State: ABuFfoguf+QqxLjQ/MnQ6jbVR7kFFdHDz4xtZ4itDe0P8eKPpb7EwuUb
	s4b0GJowT9Tq6JAZyrjWXVeKX+ciKXNHcaA=
X-Google-Smtp-Source: 
 ACcGV623MD3p1YH9sLKnaLMhKw9SEtXMfJjJE37nX7ZN/KOsFuzdivS9iKxfxUAEUsURjpUyOWjTNBVkauB8tdo=
X-Received: by 2002:a37:4a50:: with SMTP id
	x77-v6mr19924752qka.51.1539735418433;
	Tue, 16 Oct 2018 17:16:58 -0700 (PDT)
Date: Tue, 16 Oct 2018 20:16:44 -0400
In-Reply-To: <20181017001645.261770-1-ncardwell@google.com>
Message-Id: <20181017001645.261770-2-ncardwell@google.com>
Mime-Version: 1.0
References: <20181017001645.261770-1-ncardwell@google.com>
X-Mailer: git-send-email 2.19.1.568.g152ad8e336-goog
Subject: [PATCH net-next 1/2] tcp_bbr: adjust TCP BBR for departure time
	pacing
From: Neal Cardwell <ncardwell@google.com>
To: David Miller <davem@davemloft.net>
Cc: netdev@vger.kernel.org, Neal Cardwell <ncardwell@google.com>,
	Yuchung Cheng <ycheng@google.com>, Eric Dumazet <edumazet@google.com>
Sender: netdev-owner@vger.kernel.org
Precedence: bulk
List-ID: <netdev.vger.kernel.org>
X-Mailing-List: netdev@vger.kernel.org

Adjust TCP BBR for the new departure time pacing model in the recent
commit ab408b6dc7449 ("tcp: switch tcp and sch_fq to new earliest
departure time model").

With TSQ and pacing at lower layers, there are often several skbs
queued in the pacing layer, and thus there is less data "in the
network" than "in flight".

With departure time pacing at lower layers (e.g. fq or potential
future NICs), the data in the pacing layer now has a pre-scheduled
("baked-in") departure time that cannot be changed, even if the
congestion control algorithm decides to use a new pacing rate.

This means that there can be a non-trivial lag between when BBR makes
a pacing rate change and when the inter-skb pacing delays
change. After a pacing rate change, the number of packets in the
network can gradually evolve to be higher or lower, depending on
whether the sending rate is higher or lower than the delivery
rate. Thus ignoring this lag can cause significant overshoot, with the
flow ending up with too many or too few packets in the network.

This commit changes BBR to adapt its pacing rate based on the amount
of data in the network that it estimates has already been "baked in"
by previous departure time decisions. We estimate the number of our
packets that will be in the network at the earliest departure time
(EDT) for the next skb scheduled as:

   in_network_at_edt = inflight_at_edt - (EDT - now) * bw

If we're increasing the amount of data in the network ("in_network"),
then we want to know if the transmit of the EDT skb will push
in_network above the target, so our answer includes
bbr_tso_segs_goal() from the skb departing at EDT. If we're decreasing
in_network, then we want to know if in_network will sink too low just
before the EDT transmit, so our answer does not include the segments
from the skb departing at EDT.

Why do we treat pacing_gain > 1.0 case and pacing_gain < 1.0 case
differently? The in_network curve is a step function: in_network goes
up on transmits, and down on ACKs. To accurately predict when
in_network will go beyond our target value, this will happen on
different events, depending on whether we're concerned about
in_network potentially going too high or too low:

 o if pushing in_network up (pacing_gain > 1.0),
   then in_network goes above target upon a transmit event

 o if pushing in_network down (pacing_gain < 1.0),
   then in_network goes below target upon an ACK event

This commit changes the BBR state machine to use this estimated
"packets in network" value to make its decisions.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/tcp_bbr.c | 37 +++++++++++++++++++++++++++++++++++--
 1 file changed, 35 insertions(+), 2 deletions(-)
diff --git a/net/ipv4/tcp_bbr.c b/net/ipv4/tcp_bbr.c
index b88081285fd17..4cc2223d2cd54 100644
--- a/net/ipv4/tcp_bbr.c
+++ b/net/ipv4/tcp_bbr.c
@@ -369,6 +369,39 @@ static u32 bbr_target_cwnd(struct sock *sk, u32 bw, int gain)
 	return cwnd;
 }
 
+/* With pacing at lower layers, there's often less data "in the network" than
+ * "in flight". With TSQ and departure time pacing at lower layers (e.g. fq),
+ * we often have several skbs queued in the pacing layer with a pre-scheduled
+ * earliest departure time (EDT). BBR adapts its pacing rate based on the
+ * inflight level that it estimates has already been "baked in" by previous
+ * departure time decisions. We calculate a rough estimate of the number of our
+ * packets that might be in the network at the earliest departure time for the
+ * next skb scheduled:
+ *   in_network_at_edt = inflight_at_edt - (EDT - now) * bw
+ * If we're increasing inflight, then we want to know if the transmit of the
+ * EDT skb will push inflight above the target, so inflight_at_edt includes
+ * bbr_tso_segs_goal() from the skb departing at EDT. If decreasing inflight,
+ * then estimate if inflight will sink too low just before the EDT transmit.
+ */
+static u32 bbr_packets_in_net_at_edt(struct sock *sk, u32 inflight_now)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct bbr *bbr = inet_csk_ca(sk);
+	u64 now_ns, edt_ns, interval_us;
+	u32 interval_delivered, inflight_at_edt;
+
+	now_ns = tp->tcp_clock_cache;
+	edt_ns = max(tp->tcp_wstamp_ns, now_ns);
+	interval_us = div_u64(edt_ns - now_ns, NSEC_PER_USEC);
+	interval_delivered = (u64)bbr_bw(sk) * interval_us >> BW_SCALE;
+	inflight_at_edt = inflight_now;
+	if (bbr->pacing_gain > BBR_UNIT)              /* increasing inflight */
+		inflight_at_edt += bbr_tso_segs_goal(sk);  /* include EDT skb */
+	if (interval_delivered >= inflight_at_edt)
+		return 0;
+	return inflight_at_edt - interval_delivered;
+}
+
 /* An optimization in BBR to reduce losses: On the first round of recovery, we
  * follow the packet conservation principle: send P packets per P packets acked.
  * After that, we slow-start and send at most 2*P packets per P packets acked.
@@ -460,7 +493,7 @@ static bool bbr_is_next_cycle_phase(struct sock *sk,
 	if (bbr->pacing_gain == BBR_UNIT)
 		return is_full_length;		/* just use wall clock time */
 
-	inflight = rs->prior_in_flight;  /* what was in-flight before ACK? */
+	inflight = bbr_packets_in_net_at_edt(sk, rs->prior_in_flight);
 	bw = bbr_max_bw(sk);
 
 	/* A pacing_gain > 1.0 probes for bw by trying to raise inflight to at
@@ -741,7 +774,7 @@ static void bbr_check_drain(struct sock *sk, const struct rate_sample *rs)
 				bbr_target_cwnd(sk, bbr_max_bw(sk), BBR_UNIT);
 	}	/* fall through to check if in-flight is already small: */
 	if (bbr->mode == BBR_DRAIN &&
-	    tcp_packets_in_flight(tcp_sk(sk)) <=
+	    bbr_packets_in_net_at_edt(sk, tcp_packets_in_flight(tcp_sk(sk))) <=
 	    bbr_target_cwnd(sk, bbr_max_bw(sk), BBR_UNIT))
 		bbr_reset_probe_bw_mode(sk);  /* we estimate queue is drained */
 }

From patchwork Wed Oct 17 00:16:45 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Neal Cardwell <ncardwell@google.com>
X-Patchwork-Id: 985071
X-Patchwork-Delegate: davem@davemloft.net
Return-Path: <netdev-owner@vger.kernel.org>
X-Original-To: patchwork-incoming-netdev@ozlabs.org
Delivered-To: patchwork-incoming-netdev@ozlabs.org
Authentication-Results: ozlabs.org;
	spf=none (mailfrom) smtp.mailfrom=vger.kernel.org
	(client-ip=209.132.180.67; helo=vger.kernel.org;
	envelope-from=netdev-owner@vger.kernel.org;
	receiver=<UNKNOWN>)
Authentication-Results: ozlabs.org; dmarc=pass (p=reject dis=none)
	header.from=google.com
Authentication-Results: ozlabs.org; dkim=pass (2048-bit key;
	unprotected) header.d=google.com header.i=@google.com
	header.b="l2BoN86c"; dkim-atps=neutral
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by ozlabs.org (Postfix) with ESMTP id 42ZXmb4XBlz9s9J
	for <patchwork-incoming-netdev@ozlabs.org>;
	Wed, 17 Oct 2018 11:17:03 +1100 (AEDT)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1727546AbeJQIJ6 (ORCPT
	<rfc822;patchwork-incoming-netdev@ozlabs.org>);
	Wed, 17 Oct 2018 04:09:58 -0400
Received: from mail-qt1-f202.google.com ([209.85.160.202]:49877 "EHLO
	mail-qt1-f202.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1727088AbeJQIJ6 (ORCPT
	<rfc822;netdev@vger.kernel.org>); Wed, 17 Oct 2018 04:09:58 -0400
Received: by mail-qt1-f202.google.com with SMTP id f20-v6so26585097qta.16
	for <netdev@vger.kernel.org>; Tue, 16 Oct 2018 17:17:01 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=google.com; s=20161025;
	h=date:in-reply-to:message-id:mime-version:references:subject:from:to
	:cc; bh=/7FhtSdBRNZ2CVXVsdemY1JDwE+ZOkFq+m+vhktyxGg=;
	b=l2BoN86cNn77gY9Rhm2BghCl7Fm6A1YmzTE04HLBP2dZRsl+mkBFBVq5PD12dasRAy
	BIMjTpMw+tKiPfB9TWQOBa2h/7UV7KxHH63Qm13/MkLPdWynS0kJlsLgkpU1YlwL5u49
	L+w5snehnFSdNib3t27mUKeHHwPJr88phd/V0OdN+xLLSm4ef0Y1FC5nOg0zCqr2bwxl
	ThCBHemAtybnXrOsAkoJTjKaHkf7Tzn/ycgjGQHAG4Mxk77rpxe7oEsBztSgSFquRwJO
	gEbZb6PrKtk0SQoXdRIzObjQZvSzC1KPwK7qRB+zFSBBtGouqoAB+BszVx+L535zERsl
	MATw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=1e100.net; s=20161025;
	h=x-gm-message-state:date:in-reply-to:message-id:mime-version
	:references:subject:from:to:cc;
	bh=/7FhtSdBRNZ2CVXVsdemY1JDwE+ZOkFq+m+vhktyxGg=;
	b=jL6viHUSj3wZOrdqq84yWfgDuYv9wlVGD+YFlFmOZEWOwU/0wK9jgOK9GEGSFsX9FQ
	uqp5/lvcURVbn/RvMsB58+h2pU6OBz+Fi9GHCd54v+m6X85Sq9qP1iqlXPWNlkc6FxKb
	nXUdsG1XQHanIngZfQsVVxcQ8frmpz03bcoPfIo+PTP2k86kXEhjqSBCN06TO0pdOO6Y
	c52N52PMFAZ6uuf2kZ80V6xnKYEOAo6jmxRauAlcBDannS3EFKYwHMLRLWW8v2FlATVE
	nZMNe1whXq3NGyY60zTUhl3lCvpf0CZScSfCAqUTlFKNl2iYdFGrGBV7MDDHg957qFJc
	LwuQ==
X-Gm-Message-State: ABuFfojSUus3XhlEv8fZu0WPOTV1Db5hgqdB9eEyLOy9U6pJ/lZMXEli
	hiowP2y31tQ8PiWEfxpapbhLhHA3lsNXaUE=
X-Google-Smtp-Source: 
 ACcGV63qYGEazRtSvURm+7bZ28PmzKcGZojhIYAmIkAetPA4urowpBoobQUsQam7gGhQ73Lr+c7xjhWXUZ1g2C0=
X-Received: by 2002:a0c:88b2:: with SMTP id 47mr19436619qvn.58.1539735421175;
	Tue, 16 Oct 2018 17:17:01 -0700 (PDT)
Date: Tue, 16 Oct 2018 20:16:45 -0400
In-Reply-To: <20181017001645.261770-1-ncardwell@google.com>
Message-Id: <20181017001645.261770-3-ncardwell@google.com>
Mime-Version: 1.0
References: <20181017001645.261770-1-ncardwell@google.com>
X-Mailer: git-send-email 2.19.1.568.g152ad8e336-goog
Subject: [PATCH net-next 2/2] tcp_bbr: centralize code to set gains
From: Neal Cardwell <ncardwell@google.com>
To: David Miller <davem@davemloft.net>
Cc: netdev@vger.kernel.org, Neal Cardwell <ncardwell@google.com>,
	Yuchung Cheng <ycheng@google.com>,
	Soheil Hassas Yeganeh <soheil@google.com>,
	Priyaranjan Jha <priyarjha@google.com>,
	Eric Dumazet <edumazet@google.com>
Sender: netdev-owner@vger.kernel.org
Precedence: bulk
List-ID: <netdev.vger.kernel.org>
X-Mailing-List: netdev@vger.kernel.org

Centralize the code that sets gains used for computing cwnd and pacing
rate. This simplifies the code and makes it easier to change the state
machine or (in the future) dynamically change the gain values and
ensure that the correct gain values are always used.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/tcp_bbr.c | 40 ++++++++++++++++++++++++++++++----------
 1 file changed, 30 insertions(+), 10 deletions(-)

diff --git a/net/ipv4/tcp_bbr.c b/net/ipv4/tcp_bbr.c
index 4cc2223d2cd54..9277abdd822a0 100644
--- a/net/ipv4/tcp_bbr.c
+++ b/net/ipv4/tcp_bbr.c
@@ -521,8 +521,6 @@ static void bbr_advance_cycle_phase(struct sock *sk)
 
 	bbr->cycle_idx = (bbr->cycle_idx + 1) & (CYCLE_LEN - 1);
 	bbr->cycle_mstamp = tp->delivered_mstamp;
-	bbr->pacing_gain = bbr->lt_use_bw ? BBR_UNIT :
-					    bbr_pacing_gain[bbr->cycle_idx];
 }
 
 /* Gain cycling: cycle pacing gain to converge to fair share of available bw. */
@@ -540,8 +538,6 @@ static void bbr_reset_startup_mode(struct sock *sk)
 	struct bbr *bbr = inet_csk_ca(sk);
 
 	bbr->mode = BBR_STARTUP;
-	bbr->pacing_gain = bbr_high_gain;
-	bbr->cwnd_gain	 = bbr_high_gain;
 }
 
 static void bbr_reset_probe_bw_mode(struct sock *sk)
@@ -549,8 +545,6 @@ static void bbr_reset_probe_bw_mode(struct sock *sk)
 	struct bbr *bbr = inet_csk_ca(sk);
 
 	bbr->mode = BBR_PROBE_BW;
-	bbr->pacing_gain = BBR_UNIT;
-	bbr->cwnd_gain = bbr_cwnd_gain;
 	bbr->cycle_idx = CYCLE_LEN - 1 - prandom_u32_max(bbr_cycle_rand);
 	bbr_advance_cycle_phase(sk);	/* flip to next phase of gain cycle */
 }
@@ -768,8 +762,6 @@ static void bbr_check_drain(struct sock *sk, const struct rate_sample *rs)
 
 	if (bbr->mode == BBR_STARTUP && bbr_full_bw_reached(sk)) {
 		bbr->mode = BBR_DRAIN;	/* drain queue we created */
-		bbr->pacing_gain = bbr_drain_gain;	/* pace slow to drain */
-		bbr->cwnd_gain = bbr_high_gain;	/* maintain cwnd */
 		tcp_sk(sk)->snd_ssthresh =
 				bbr_target_cwnd(sk, bbr_max_bw(sk), BBR_UNIT);
 	}	/* fall through to check if in-flight is already small: */
@@ -831,8 +823,6 @@ static void bbr_update_min_rtt(struct sock *sk, const struct rate_sample *rs)
 	if (bbr_probe_rtt_mode_ms > 0 && filter_expired &&
 	    !bbr->idle_restart && bbr->mode != BBR_PROBE_RTT) {
 		bbr->mode = BBR_PROBE_RTT;  /* dip, drain queue */
-		bbr->pacing_gain = BBR_UNIT;
-		bbr->cwnd_gain = BBR_UNIT;
 		bbr_save_cwnd(sk);  /* note cwnd so we can restore it */
 		bbr->probe_rtt_done_stamp = 0;
 	}
@@ -860,6 +850,35 @@ static void bbr_update_min_rtt(struct sock *sk, const struct rate_sample *rs)
 		bbr->idle_restart = 0;
 }
 
+static void bbr_update_gains(struct sock *sk)
+{
+	struct bbr *bbr = inet_csk_ca(sk);
+
+	switch (bbr->mode) {
+	case BBR_STARTUP:
+		bbr->pacing_gain = bbr_high_gain;
+		bbr->cwnd_gain	 = bbr_high_gain;
+		break;
+	case BBR_DRAIN:
+		bbr->pacing_gain = bbr_drain_gain;	/* slow, to drain */
+		bbr->cwnd_gain	 = bbr_high_gain;	/* keep cwnd */
+		break;
+	case BBR_PROBE_BW:
+		bbr->pacing_gain = (bbr->lt_use_bw ?
+				    BBR_UNIT :
+				    bbr_pacing_gain[bbr->cycle_idx]);
+		bbr->cwnd_gain	 = bbr_cwnd_gain;
+		break;
+	case BBR_PROBE_RTT:
+		bbr->pacing_gain = BBR_UNIT;
+		bbr->cwnd_gain	 = BBR_UNIT;
+		break;
+	default:
+		WARN_ONCE(1, "BBR bad mode: %u\n", bbr->mode);
+		break;
+	}
+}
+
 static void bbr_update_model(struct sock *sk, const struct rate_sample *rs)
 {
 	bbr_update_bw(sk, rs);
@@ -867,6 +886,7 @@ static void bbr_update_model(struct sock *sk, const struct rate_sample *rs)
 	bbr_check_full_bw_reached(sk, rs);
 	bbr_check_drain(sk, rs);
 	bbr_update_min_rtt(sk, rs);
+	bbr_update_gains(sk);
 }
 
 static void bbr_main(struct sock *sk, const struct rate_sample *rs)