Message ID | 2630078.LdJXSsPB40@h2o.as.studentenwerk.mhn.de |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
On Fri, 2014-12-12 at 17:58 +0100, Wolfgang Walter wrote: > This fixes hangs of local tcp connections over ipsec tunnels where > pmtu is lower than the mtu of the interface. Again this is a work around, the one liner patch is not a clean patch. Something is wrong, we should fix GSO path instead. Why ? Because we can queue a GSO packet in a qdisc, then later call sk_setup_caps() (too late) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Am Freitag, 12. Dezember 2014, 09:27:05 schrieb Eric Dumazet: > On Fri, 2014-12-12 at 17:58 +0100, Wolfgang Walter wrote: > > This fixes hangs of local tcp connections over ipsec tunnels where > > pmtu is lower than the mtu of the interface. > > Again this is a work around, the one liner patch is not a clean patch. > > Something is wrong, we should fix GSO path instead. > > Why ? Because we can queue a GSO packet in a qdisc, then later call > sk_setup_caps() (too late) Hmm. Do you think that sk_setup_caps should not disable GSO later at all? Or only that it should not influence already queued packets? In the case of an ipsec tunnel GSO gets disabled later (for tcp connections) because dst->header_len() is zero at first and non-zero later. tcp connections initiated not to long afterwards start with GSO disabled from the beginning. dst->header_len() being non zero for ipsec tunnels seems to be the case with or without an adjustment via PMTU. It seems that routing only does not know it from the beginning on (for incomimg connections) :-). Why does a non zero dst->header_len() disable GSO? Is it just an optimization or does GSO not work in that case? If dst->header_len() being non zero signals that ipsec tunnel mode can't handle GSO at all then the problem would not lay in the gso path. There is one thing which indicates that there is also a problen in the GSO path, though. As my tests demonstrated disabling GSO on the physical interface serving the ipsec tunnel does not prevent the hangs. It does, though, if sk- >sk_gso_max_segs = 1 in sk_setup_caps() if sk_can_gso(sk) returns false. This was the first variation of the patch where sk->sk_gso_max_segs was not set if dst->header_len() is non zero. Concering qdisc: an ipsec tunnel doesn't have a "tunnel-device" so it does not have qdisc? The ipsec tunnel transforms packets and the results are (in case of ipsec esp tunnel) esp packets which cannot be resgmented. Only these transformed packets are finally queued to a qdisc of a network device then. Regards,
Wolfgang, On 12/12/2014 09:31 PM, Wolfgang Walter wrote: > There is one thing which indicates that there is also a problen in the GSO > path, though. As my tests demonstrated disabling GSO on the physical interface > serving the ipsec tunnel does not prevent the hangs. It does, though, if sk- >> sk_gso_max_segs = 1 in sk_setup_caps() if sk_can_gso(sk) returns false. This > was the first variation of the patch where sk->sk_gso_max_segs was not set if > dst->header_len() is non zero. if it's any help: Disabling TX checksumming prevents the hang even on an unpatched 3.14.x kernel. You could check with your debug statements in place the path of the packets with and without TX checksumming. HTH, Thomas -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2014-12-12 at 22:30 +0100, Thomas Jarosch wrote: > if it's any help: Disabling TX checksumming prevents the hang > even on an unpatched 3.14.x kernel. You could check with your debug statements > in place the path of the packets with and without TX checksumming. Disabling TX checksum automatically disables GSO. Disabling TSO/GSO is the real 'cure' for the time being, you can keep TX checksums. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Am Freitag, 12. Dezember 2014, 14:31:49 schrieben Sie: > On Fri, 2014-12-12 at 22:30 +0100, Thomas Jarosch wrote: > > if it's any help: Disabling TX checksumming prevents the hang > > even on an unpatched 3.14.x kernel. You could check with your debug > > statements in place the path of the packets with and without TX > > checksumming. I can't disable it as the driver will not allow it: # ethtool -K eth0 tx off Cannot change tx-checksumming Could not change any device features > Disabling TX checksum automatically disables GSO. > > Disabling TSO/GSO is the real 'cure' for the time being, you can keep TX > checksums. This does not help here. With GSO disabled (for network device serving the ipsec traffic, here e.g. eth0) the hang still occurs. tcp_set_skb_tso_segs() gets called and the else-branch for skb->len <= mss_now is taken. I tested this serveral times. Regards,
On Sat, 2014-12-13 at 00:47 +0100, Wolfgang Walter wrote: > I can't disable it as the driver will not allow it: > # ethtool -K eth0 tx off > Cannot change tx-checksumming > Could not change any device features Sounds a bug in itself :( -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Am Freitag, 12. Dezember 2014, 16:15:40 schrieb Eric Dumazet: > On Sat, 2014-12-13 at 00:47 +0100, Wolfgang Walter wrote: > > I can't disable it as the driver will not allow it: > > # ethtool -K eth0 tx off > > Cannot change tx-checksumming > > Could not change any device features > > Sounds a bug in itself :( > Yes, that is what I meant in my email that there seems also to be bug in the GSO path. I don't understand how tcp_sendmsg() marks gso/non-gso skbs. tcp_send_mss() calls tcp_xmit_size_goal() which calls tcp_xmit_size_goal() and there is a check for GSO ( sk_can_gso(sk) ). If sk_can_gso() returns false then tcp_xmit_size_goal() returns mss_now. gso_segs seems to be set to zero. In in tcp_output.c gso_segs == 0 seems to allow GSO. What if mss later shrinks? Couldn't it be that then tcp_set_skb_tso_segs() gets called and then there the gso branch is taken? Regards,
Hello Eric! Am Freitag, 12. Dezember 2014, 16:15:40 schrieb Eric Dumazet: > On Sat, 2014-12-13 at 00:47 +0100, Wolfgang Walter wrote: > > I can't disable it as the driver will not allow it: > > # ethtool -K eth0 tx off > > Cannot change tx-checksumming > > Could not change any device features > > Sounds a bug in itself :( I think the "gso disabled for interface" case is broken because: tcp_sendmsg() sets tcp_gso_segs to zero tcp_sendmsg() calls tcp_push_one() or __tcp_push_pending_frames() tcp_push_one() and __tcp_push_pending_frames() call tcp_write_xmit() tcp_write_xmit() call tcp_init_tso_segs() tcp_init_tso_segs() calls tcp_set_skb_tso_segs() because !tso_segs is true tcp_init_tso_segs() creates a gso-packet if packet size is larger than mss_now. I think tcp_init_tso_segs() assumed that tcp_init_tso_segs() checks if the socket supports gso. I didn't check the other callers of tcp_init_tso_segs(). Regards,
diff --git a/include/linux/tcp.h b/include/linux/tcp.h --- a/include/linux/tcp.h +++ b/include/linux/tcp.h @@ -131,7 +131,7 @@ struct tcp_sock { /* inet_connection_sock has to be the first member of tcp_sock */ struct inet_connection_sock inet_conn; u16 tcp_header_len; /* Bytes of tcp header to send */ - u16 xmit_size_goal_segs; /* Goal for segmenting output packets */ + u16 gso_segs; /* Max number of segs per GSO packet */ /* * Header prediction flags diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -836,47 +836,29 @@ static unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now, int large_allowed) { struct tcp_sock *tp = tcp_sk(sk); - u32 xmit_size_goal, old_size_goal; - - xmit_size_goal = mss_now; - - if (large_allowed && sk_can_gso(sk)) { - u32 gso_size, hlen; - - /* Maybe we should/could use sk->sk_prot->max_header here ? */ - hlen = inet_csk(sk)->icsk_af_ops->net_header_len + - inet_csk(sk)->icsk_ext_hdr_len + - tp->tcp_header_len; - - /* Goal is to send at least one packet per ms, - * not one big TSO packet every 100 ms. - * This preserves ACK clocking and is consistent - * with tcp_tso_should_defer() heuristic. - */ - gso_size = sk->sk_pacing_rate / (2 * MSEC_PER_SEC); - gso_size = max_t(u32, gso_size, - sysctl_tcp_min_tso_segs * mss_now); - - xmit_size_goal = min_t(u32, gso_size, - sk->sk_gso_max_size - 1 - hlen); - - xmit_size_goal = tcp_bound_to_half_wnd(tp, xmit_size_goal); - - /* We try hard to avoid divides here */ - old_size_goal = tp->xmit_size_goal_segs * mss_now; - - if (likely(old_size_goal <= xmit_size_goal && - old_size_goal + mss_now > xmit_size_goal)) { - xmit_size_goal = old_size_goal; - } else { - tp->xmit_size_goal_segs = - min_t(u16, xmit_size_goal / mss_now, - sk->sk_gso_max_segs); - xmit_size_goal = tp->xmit_size_goal_segs * mss_now; - } + u32 new_size_goal, size_goal, hlen; + + if (!large_allowed || !sk_can_gso(sk)) + return mss_now; + + /* Maybe we should/could use sk->sk_prot->max_header here ? */ + hlen = inet_csk(sk)->icsk_af_ops->net_header_len + + inet_csk(sk)->icsk_ext_hdr_len + + tp->tcp_header_len; + + new_size_goal = sk->sk_gso_max_size - 1 - hlen; + new_size_goal = tcp_bound_to_half_wnd(tp, new_size_goal); + + /* We try hard to avoid divides here */ + size_goal = tp->gso_segs * mss_now; + if (unlikely(new_size_goal < size_goal || + new_size_goal >= size_goal + mss_now)) { + tp->gso_segs = min_t(u16, new_size_goal / mss_now, + sk->sk_gso_max_segs); + size_goal = tp->gso_segs * mss_now; } - return max(xmit_size_goal, mss_now); + return max(size_goal, mss_now); } static int tcp_send_mss(struct sock *sk, int *size_goal, int flags) diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -1484,6 +1484,27 @@ static bool tcp_nagle_check(bool partial, const struct tcp_sock *tp, ((nonagle & TCP_NAGLE_CORK) || (!nonagle && tp->packets_out && tcp_minshall_check(tp))); } + +/* Return how many segs we'd like on a TSO packet, + * to send one TSO packet per ms + */ +static u32 tcp_tso_autosize(const struct sock *sk, unsigned int mss_now) +{ + u32 bytes, segs; + + bytes = min(sk->sk_pacing_rate >> 10, + sk->sk_gso_max_size - 1 - MAX_TCP_HEADER); + + /* Goal is to send at least one packet per ms, + * not one big TSO packet every 100 ms. + * This preserves ACK clocking and is consistent + * with tcp_tso_should_defer() heuristic. + */ + segs = max_t(u32, bytes / mss_now, sysctl_tcp_min_tso_segs); + + return min_t(u32, segs, sk->sk_gso_max_segs); +} + /* Returns the portion of skb which can be sent right away */ static unsigned int tcp_mss_split_point(const struct sock *sk, const struct sk_buff *skb, @@ -1687,7 +1708,7 @@ static int tso_fragment(struct sock *sk, struct sk_buff *skb, unsigned int len, * This algorithm is from John Heffner. */ static bool tcp_tso_should_defer(struct sock *sk, struct sk_buff *skb, - bool *is_cwnd_limited) + bool *is_cwnd_limited, u32 max_segs) { struct tcp_sock *tp = tcp_sk(sk); const struct inet_connection_sock *icsk = inet_csk(sk); @@ -1717,8 +1738,7 @@ static bool tcp_tso_should_defer(struct sock *sk, struct sk_buff *skb, limit = min(send_win, cong_win); /* If a full-sized TSO skb can be sent, do it. */ - if (limit >= min_t(unsigned int, sk->sk_gso_max_size, - tp->xmit_size_goal_segs * tp->mss_cache)) + if (limit >= max_segs * tp->mss_cache) goto send_now; /* Middle in queue won't get any more data, full sendable already? */ @@ -1915,6 +1935,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, int cwnd_quota; int result; bool is_cwnd_limited = false; + u32 max_segs; sent_pkts = 0; @@ -1928,6 +1949,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, } } + max_segs = tcp_tso_autosize(sk, mss_now); while ((skb = tcp_send_head(sk))) { unsigned int limit; @@ -1960,10 +1982,23 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, break; } else { if (!push_one && - tcp_tso_should_defer(sk, skb, &is_cwnd_limited)) + tcp_tso_should_defer(sk, skb, &is_cwnd_limited, + max_segs)) break; } + limit = mss_now; + if (tso_segs > 1 && !tcp_urg_mode(tp)) + limit = tcp_mss_split_point(sk, skb, mss_now, + min_t(unsigned int, + cwnd_quota, + max_segs), + nonagle); + + if (skb->len > limit && + unlikely(tso_fragment(sk, skb, limit, mss_now, gfp))) + break; + /* TCP Small Queues : * Control number of packets in qdisc/devices to two packets / or ~1 ms. * This allows for : @@ -1974,8 +2009,8 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, * of queued bytes to ensure line rate. * One example is wifi aggregation (802.11 AMPDU) */ - limit = max_t(unsigned int, sysctl_tcp_limit_output_bytes, - sk->sk_pacing_rate >> 10); + limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10); + limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes); if (atomic_read(&sk->sk_wmem_alloc) > limit) { set_bit(TSQ_THROTTLED, &tp->tsq_flags); @@ -1988,18 +2023,6 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, break; } - limit = mss_now; - if (tso_segs > 1 && !tcp_urg_mode(tp)) - limit = tcp_mss_split_point(sk, skb, mss_now, - min_t(unsigned int, - cwnd_quota, - sk->sk_gso_max_segs), - nonagle); - - if (skb->len > limit && - unlikely(tso_fragment(sk, skb, limit, mss_now, gfp))) - break; - TCP_SKB_CB(skb)->when = tcp_time_stamp; if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp))) -- 1.7.10.4