diff mbox

[bisected] xfrm: TCP connection initiating PMTU discovery stalls on v3.

Message ID 2630078.LdJXSsPB40@h2o.as.studentenwerk.mhn.de
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Wolfgang Walter Dec. 12, 2014, 4:58 p.m. UTC
Hello Eric,

my plan to use e114a710aa5058c0ba4aa1dfb105132aefeb5e04 to make it easier to apply tcp-refine-TSO-autosizing was a bad idea as that needs further patches :-). Backporting

tcp-refine-TSO-autosizing
https://patchwork.ozlabs.org/patch/418506/

seems to be the easier solution.

I tested if 3.18 and 3.17.6 are affected and they are. I applied tcp-refine-TSO-autosizing and the one-liner below and then both work fine.

Attached is

1) my backport of tcp-refine-TSO-autosizing for 3.14.26
2) my backport of tcp-refine-TSO-autosizing for 3.17.6

For 3.18 tcp-refine-TSO-autosizing just applies fine.

I tested 3.16.26 + patches, 3.17.6 + patches and 3.18.2 + patches.

And here again the one-liner:

Comments

Eric Dumazet Dec. 12, 2014, 5:27 p.m. UTC | #1
On Fri, 2014-12-12 at 17:58 +0100, Wolfgang Walter wrote:

> This fixes hangs of local tcp connections over ipsec tunnels where
> pmtu is lower than the mtu of the interface.

Again this is a work around, the one liner patch is not a clean patch.

Something is wrong, we should fix GSO path instead.

Why ? Because we can queue a GSO packet in a qdisc, then later call
sk_setup_caps() (too late)


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Wolfgang Walter Dec. 12, 2014, 8:31 p.m. UTC | #2
Am Freitag, 12. Dezember 2014, 09:27:05 schrieb Eric Dumazet:
> On Fri, 2014-12-12 at 17:58 +0100, Wolfgang Walter wrote:
> > This fixes hangs of local tcp connections over ipsec tunnels where
> > pmtu is lower than the mtu of the interface.
> 
> Again this is a work around, the one liner patch is not a clean patch.
> 
> Something is wrong, we should fix GSO path instead.
> 
> Why ? Because we can queue a GSO packet in a qdisc, then later call
> sk_setup_caps() (too late)

Hmm. Do you think that sk_setup_caps should not disable GSO later at all? Or 
only that it should not influence already queued packets?

In the case of an ipsec tunnel GSO gets disabled later (for tcp connections) 
because dst->header_len() is zero at first and non-zero later. tcp connections 
initiated not to long afterwards start with GSO disabled from the beginning.


dst->header_len() being non zero for ipsec tunnels seems to be the case with 
or without an adjustment via PMTU. It seems that routing only does not know it 
from the beginning on (for incomimg connections) :-).

Why does a non zero dst->header_len() disable GSO? Is it just an optimization 
or does GSO not work in that case? If dst->header_len() being non zero signals 
that ipsec tunnel mode can't handle GSO at all then the problem would not lay 
in the gso path.


There is one thing which indicates that there is also a problen in the GSO 
path, though. As my tests demonstrated disabling GSO on the physical interface 
serving the ipsec tunnel does not prevent the hangs. It does, though, if sk-
>sk_gso_max_segs = 1 in sk_setup_caps() if sk_can_gso(sk) returns false. This 
was the first variation of the patch where sk->sk_gso_max_segs was not set if 
dst->header_len() is non zero.


Concering qdisc: an ipsec tunnel doesn't have a "tunnel-device" so it does not 
have qdisc? The ipsec tunnel transforms packets and the results are (in case 
of ipsec esp tunnel) esp packets which cannot be resgmented. Only these 
transformed packets are finally queued to a qdisc of a network device then.


Regards,
Thomas Jarosch Dec. 12, 2014, 9:30 p.m. UTC | #3
Wolfgang,

On 12/12/2014 09:31 PM, Wolfgang Walter wrote:
> There is one thing which indicates that there is also a problen in the GSO 
> path, though. As my tests demonstrated disabling GSO on the physical interface 
> serving the ipsec tunnel does not prevent the hangs. It does, though, if sk-
>> sk_gso_max_segs = 1 in sk_setup_caps() if sk_can_gso(sk) returns false. This 
> was the first variation of the patch where sk->sk_gso_max_segs was not set if 
> dst->header_len() is non zero.

if it's any help: Disabling TX checksumming prevents the hang
even on an unpatched 3.14.x kernel. You could check with your debug statements
in place the path of the packets with and without TX checksumming.

HTH,
Thomas

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Dec. 12, 2014, 10:31 p.m. UTC | #4
On Fri, 2014-12-12 at 22:30 +0100, Thomas Jarosch wrote:

> if it's any help: Disabling TX checksumming prevents the hang
> even on an unpatched 3.14.x kernel. You could check with your debug statements
> in place the path of the packets with and without TX checksumming.

Disabling TX checksum automatically disables GSO.

Disabling TSO/GSO is the real 'cure' for the time being, you can keep TX
checksums.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Wolfgang Walter Dec. 12, 2014, 11:47 p.m. UTC | #5
Am Freitag, 12. Dezember 2014, 14:31:49 schrieben Sie:
> On Fri, 2014-12-12 at 22:30 +0100, Thomas Jarosch wrote:
> > if it's any help: Disabling TX checksumming prevents the hang
> > even on an unpatched 3.14.x kernel. You could check with your debug
> > statements in place the path of the packets with and without TX
> > checksumming.

I can't disable it as the driver will not allow it:
# ethtool -K eth0 tx off
Cannot change tx-checksumming
Could not change any device features

> Disabling TX checksum automatically disables GSO.
> 
> Disabling TSO/GSO is the real 'cure' for the time being, you can keep TX
> checksums.

This does not help here. With GSO disabled (for network device serving the 
ipsec traffic, here e.g. eth0) the hang still occurs. tcp_set_skb_tso_segs() 
gets called and the else-branch for skb->len <= mss_now is taken.

I tested this serveral times.

Regards,
Eric Dumazet Dec. 13, 2014, 12:15 a.m. UTC | #6
On Sat, 2014-12-13 at 00:47 +0100, Wolfgang Walter wrote:

> I can't disable it as the driver will not allow it:
> # ethtool -K eth0 tx off
> Cannot change tx-checksumming
> Could not change any device features

Sounds a bug in itself :(


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Wolfgang Walter Dec. 13, 2014, 12:43 a.m. UTC | #7
Am Freitag, 12. Dezember 2014, 16:15:40 schrieb Eric Dumazet:
> On Sat, 2014-12-13 at 00:47 +0100, Wolfgang Walter wrote:
> > I can't disable it as the driver will not allow it:
> > # ethtool -K eth0 tx off
> > Cannot change tx-checksumming
> > Could not change any device features
> 
> Sounds a bug in itself :(
> 

Yes, that is what I meant in my email that there seems also to be bug in the 
GSO path.


I don't understand how tcp_sendmsg() marks gso/non-gso skbs.

tcp_send_mss() calls tcp_xmit_size_goal() which calls tcp_xmit_size_goal() and 
there is a check for GSO ( sk_can_gso(sk) ). If sk_can_gso() returns false 
then tcp_xmit_size_goal() returns mss_now. gso_segs seems to be set to zero. 
In in tcp_output.c gso_segs == 0 seems to allow GSO.

What if mss later shrinks? Couldn't it be that then tcp_set_skb_tso_segs() 
gets called and then there the gso branch is taken?

Regards,
Wolfgang Walter Dec. 15, 2014, 6:04 p.m. UTC | #8
Hello Eric!

Am Freitag, 12. Dezember 2014, 16:15:40 schrieb Eric Dumazet:
> On Sat, 2014-12-13 at 00:47 +0100, Wolfgang Walter wrote:
> > I can't disable it as the driver will not allow it:
> > # ethtool -K eth0 tx off
> > Cannot change tx-checksumming
> > Could not change any device features
> 
> Sounds a bug in itself :(

I think the "gso disabled for interface" case is broken because:

tcp_sendmsg() sets tcp_gso_segs to zero

tcp_sendmsg() calls tcp_push_one() or __tcp_push_pending_frames()

tcp_push_one() and  __tcp_push_pending_frames() call tcp_write_xmit()

tcp_write_xmit() call tcp_init_tso_segs()

tcp_init_tso_segs() calls tcp_set_skb_tso_segs() because !tso_segs is true

tcp_init_tso_segs() creates a gso-packet if packet size is larger than 
mss_now.


I think tcp_init_tso_segs() assumed that tcp_init_tso_segs() checks if the 
socket supports gso.

I didn't check the other callers of tcp_init_tso_segs().


Regards,
diff mbox

Patch

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -131,7 +131,7 @@  struct tcp_sock {
 	/* inet_connection_sock has to be the first member of tcp_sock */
 	struct inet_connection_sock	inet_conn;
 	u16	tcp_header_len;	/* Bytes of tcp header to send		*/
-	u16	xmit_size_goal_segs; /* Goal for segmenting output packets */
+	u16	gso_segs;	/* Max number of segs per GSO packet	*/
 
 /*
  *	Header prediction flags
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -836,47 +836,29 @@  static unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now,
 				       int large_allowed)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
-	u32 xmit_size_goal, old_size_goal;
-
-	xmit_size_goal = mss_now;
-
-	if (large_allowed && sk_can_gso(sk)) {
-		u32 gso_size, hlen;
-
-		/* Maybe we should/could use sk->sk_prot->max_header here ? */
-		hlen = inet_csk(sk)->icsk_af_ops->net_header_len +
-		       inet_csk(sk)->icsk_ext_hdr_len +
-		       tp->tcp_header_len;
-
-		/* Goal is to send at least one packet per ms,
-		 * not one big TSO packet every 100 ms.
-		 * This preserves ACK clocking and is consistent
-		 * with tcp_tso_should_defer() heuristic.
-		 */
-		gso_size = sk->sk_pacing_rate / (2 * MSEC_PER_SEC);
-		gso_size = max_t(u32, gso_size,
-				 sysctl_tcp_min_tso_segs * mss_now);
-
-		xmit_size_goal = min_t(u32, gso_size,
-				       sk->sk_gso_max_size - 1 - hlen);
-
-		xmit_size_goal = tcp_bound_to_half_wnd(tp, xmit_size_goal);
-
-		/* We try hard to avoid divides here */
-		old_size_goal = tp->xmit_size_goal_segs * mss_now;
-
-		if (likely(old_size_goal <= xmit_size_goal &&
-			   old_size_goal + mss_now > xmit_size_goal)) {
-			xmit_size_goal = old_size_goal;
-		} else {
-			tp->xmit_size_goal_segs =
-				min_t(u16, xmit_size_goal / mss_now,
-				      sk->sk_gso_max_segs);
-			xmit_size_goal = tp->xmit_size_goal_segs * mss_now;
-		}
+	u32 new_size_goal, size_goal, hlen;
+
+	if (!large_allowed || !sk_can_gso(sk))
+		return mss_now;
+
+	/* Maybe we should/could use sk->sk_prot->max_header here ? */
+	hlen = inet_csk(sk)->icsk_af_ops->net_header_len +
+	       inet_csk(sk)->icsk_ext_hdr_len +
+	       tp->tcp_header_len;
+
+	new_size_goal = sk->sk_gso_max_size - 1 - hlen;
+	new_size_goal = tcp_bound_to_half_wnd(tp, new_size_goal);
+
+	/* We try hard to avoid divides here */
+	size_goal = tp->gso_segs * mss_now;
+	if (unlikely(new_size_goal < size_goal ||
+		     new_size_goal >= size_goal + mss_now)) {
+		tp->gso_segs = min_t(u16, new_size_goal / mss_now,
+				     sk->sk_gso_max_segs);
+		size_goal = tp->gso_segs * mss_now;
 	}
 
-	return max(xmit_size_goal, mss_now);
+	return max(size_goal, mss_now);
 }
 
 static int tcp_send_mss(struct sock *sk, int *size_goal, int flags)
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1484,6 +1484,27 @@  static bool tcp_nagle_check(bool partial, const struct tcp_sock *tp,
 		((nonagle & TCP_NAGLE_CORK) ||
 		 (!nonagle && tp->packets_out && tcp_minshall_check(tp)));
 }
+
+/* Return how many segs we'd like on a TSO packet,
+ * to send one TSO packet per ms
+ */
+static u32 tcp_tso_autosize(const struct sock *sk, unsigned int mss_now)
+{
+	u32 bytes, segs;
+
+	bytes = min(sk->sk_pacing_rate >> 10,
+		    sk->sk_gso_max_size - 1 - MAX_TCP_HEADER);
+
+	/* Goal is to send at least one packet per ms,
+	 * not one big TSO packet every 100 ms.
+	 * This preserves ACK clocking and is consistent
+	 * with tcp_tso_should_defer() heuristic.
+	 */
+	segs = max_t(u32, bytes / mss_now, sysctl_tcp_min_tso_segs);
+
+	return min_t(u32, segs, sk->sk_gso_max_segs);
+}
+
 /* Returns the portion of skb which can be sent right away */
 static unsigned int tcp_mss_split_point(const struct sock *sk,
 					const struct sk_buff *skb,
@@ -1687,7 +1708,7 @@  static int tso_fragment(struct sock *sk, struct sk_buff *skb, unsigned int len,
  * This algorithm is from John Heffner.
  */
 static bool tcp_tso_should_defer(struct sock *sk, struct sk_buff *skb,
-				 bool *is_cwnd_limited)
+				 bool *is_cwnd_limited, u32 max_segs)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	const struct inet_connection_sock *icsk = inet_csk(sk);
@@ -1717,8 +1738,7 @@  static bool tcp_tso_should_defer(struct sock *sk, struct sk_buff *skb,
 	limit = min(send_win, cong_win);
 
 	/* If a full-sized TSO skb can be sent, do it. */
-	if (limit >= min_t(unsigned int, sk->sk_gso_max_size,
-			   tp->xmit_size_goal_segs * tp->mss_cache))
+	if (limit >= max_segs * tp->mss_cache)
 		goto send_now;
 
 	/* Middle in queue won't get any more data, full sendable already? */
@@ -1915,6 +1935,7 @@  static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 	int cwnd_quota;
 	int result;
 	bool is_cwnd_limited = false;
+	u32 max_segs;
 
 	sent_pkts = 0;
 
@@ -1928,6 +1949,7 @@  static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 		}
 	}
 
+	max_segs = tcp_tso_autosize(sk, mss_now);
 	while ((skb = tcp_send_head(sk))) {
 		unsigned int limit;
 
@@ -1960,10 +1982,23 @@  static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 				break;
 		} else {
 			if (!push_one &&
-			    tcp_tso_should_defer(sk, skb, &is_cwnd_limited))
+			    tcp_tso_should_defer(sk, skb, &is_cwnd_limited,
+						 max_segs))
 				break;
 		}
 
+		limit = mss_now;
+		if (tso_segs > 1 && !tcp_urg_mode(tp))
+			limit = tcp_mss_split_point(sk, skb, mss_now,
+						    min_t(unsigned int,
+							  cwnd_quota,
+							  max_segs),
+						    nonagle);
+
+		if (skb->len > limit &&
+		    unlikely(tso_fragment(sk, skb, limit, mss_now, gfp)))
+			break;
+
 		/* TCP Small Queues :
 		 * Control number of packets in qdisc/devices to two packets / or ~1 ms.
 		 * This allows for :
@@ -1974,8 +2009,8 @@  static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 		 * of queued bytes to ensure line rate.
 		 * One example is wifi aggregation (802.11 AMPDU)
 		 */
-		limit = max_t(unsigned int, sysctl_tcp_limit_output_bytes,
-			      sk->sk_pacing_rate >> 10);
+		limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10);
+		limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);
 
 		if (atomic_read(&sk->sk_wmem_alloc) > limit) {
 			set_bit(TSQ_THROTTLED, &tp->tsq_flags);
@@ -1988,18 +2023,6 @@  static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 				break;
 		}
 
-		limit = mss_now;
-		if (tso_segs > 1 && !tcp_urg_mode(tp))
-			limit = tcp_mss_split_point(sk, skb, mss_now,
-						    min_t(unsigned int,
-							  cwnd_quota,
-							  sk->sk_gso_max_segs),
-						    nonagle);
-
-		if (skb->len > limit &&
-		    unlikely(tso_fragment(sk, skb, limit, mss_now, gfp)))
-			break;
-
 		TCP_SKB_CB(skb)->when = tcp_time_stamp;
 
 		if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp)))
-- 
1.7.10.4