Message ID | 20171101211001.57901-1-cpaasch@apple.com |
---|---|
State | Superseded, archived |
Delegated to: | David Miller |
Headers | show |
Series | [net] tcp: Always cleanup skb before sending | expand |
On Wed, Nov 1, 2017 at 2:10 PM, Christoph Paasch <cpaasch@apple.com> wrote: > Since commit 971f10eca186 ("tcp: better TCP_SKB_CB layout to reduce cache > line misses") IPCB/IP6CB is no more at the beginning of the tcp_skb_cb. > This means that on the output path, we need to make sure that it has > been correctly initialized to 0, as is done in tcp_transmit_skb. > > However, when going through the other code-path in TCP that can send an > skb (e.g., through tcp_v6_send_synack), we end up in a situation where > IP6CB has some of its fields set to unexpected values. Depending on the > layout of tcp_skb_cb across the different kernel-versions this can be > lastopt, flags,... Or not use tcp_init_nondata_skb() on non fast clones, since it adds unnecessary writes and clears. tcp_make_synack() really has no business using tcp_init_nondata_skb() and could simply set th->seq = htonl(tcp_rsk(req)->snt_isn);
On Wed, 2017-11-01 at 14:32 -0700, Eric Dumazet wrote: > On Wed, Nov 1, 2017 at 2:10 PM, Christoph Paasch <cpaasch@apple.com> wrote: > > Since commit 971f10eca186 ("tcp: better TCP_SKB_CB layout to reduce cache > > line misses") IPCB/IP6CB is no more at the beginning of the tcp_skb_cb. > > This means that on the output path, we need to make sure that it has > > been correctly initialized to 0, as is done in tcp_transmit_skb. > > > > However, when going through the other code-path in TCP that can send an > > skb (e.g., through tcp_v6_send_synack), we end up in a situation where > > IP6CB has some of its fields set to unexpected values. Depending on the > > layout of tcp_skb_cb across the different kernel-versions this can be > > lastopt, flags,... > > Or not use tcp_init_nondata_skb() on non fast clones, since it adds > unnecessary writes and clears. > > tcp_make_synack() really has no business using tcp_init_nondata_skb() > and could simply set th->seq = htonl(tcp_rsk(req)->snt_isn); Something like : diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 69cfdead0cb49e4365158048a0d1a9bbdd55fa83..5502abc5307f0ce1de610d4b70f3a59c4d5383c5 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -3399,13 +3399,8 @@ struct sk_buff *tcp_make_synack(const struct sock *sk, struct dst_entry *dst, tcp_ecn_make_synack(req, th); th->source = htons(ireq->ir_num); th->dest = ireq->ir_rmt_port; - /* Setting of flags are superfluous here for callers (and ECE is - * not even correctly set) - */ - tcp_init_nondata_skb(skb, tcp_rsk(req)->snt_isn, - TCPHDR_SYN | TCPHDR_ACK); - - th->seq = htonl(TCP_SKB_CB(skb)->seq); + skb->ip_summed = CHECKSUM_PARTIAL; + th->seq = htonl(tcp_rsk(req)->snt_isn); /* XXX data is queued and acked as is. No buffer/window check */ th->ack_seq = htonl(tcp_rsk(req)->rcv_nxt);
On 01/11/17 - 14:53:38, Eric Dumazet wrote: > On Wed, 2017-11-01 at 14:32 -0700, Eric Dumazet wrote: > > On Wed, Nov 1, 2017 at 2:10 PM, Christoph Paasch <cpaasch@apple.com> wrote: > > > Since commit 971f10eca186 ("tcp: better TCP_SKB_CB layout to reduce cache > > > line misses") IPCB/IP6CB is no more at the beginning of the tcp_skb_cb. > > > This means that on the output path, we need to make sure that it has > > > been correctly initialized to 0, as is done in tcp_transmit_skb. > > > > > > However, when going through the other code-path in TCP that can send an > > > skb (e.g., through tcp_v6_send_synack), we end up in a situation where > > > IP6CB has some of its fields set to unexpected values. Depending on the > > > layout of tcp_skb_cb across the different kernel-versions this can be > > > lastopt, flags,... > > > > Or not use tcp_init_nondata_skb() on non fast clones, since it adds > > unnecessary writes and clears. > > > > tcp_make_synack() really has no business using tcp_init_nondata_skb() > > and could simply set th->seq = htonl(tcp_rsk(req)->snt_isn); > > Something like : > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c > index 69cfdead0cb49e4365158048a0d1a9bbdd55fa83..5502abc5307f0ce1de610d4b70f3a59c4d5383c5 100644 > --- a/net/ipv4/tcp_output.c > +++ b/net/ipv4/tcp_output.c > @@ -3399,13 +3399,8 @@ struct sk_buff *tcp_make_synack(const struct sock *sk, struct dst_entry *dst, > tcp_ecn_make_synack(req, th); > th->source = htons(ireq->ir_num); > th->dest = ireq->ir_rmt_port; > - /* Setting of flags are superfluous here for callers (and ECE is > - * not even correctly set) > - */ > - tcp_init_nondata_skb(skb, tcp_rsk(req)->snt_isn, > - TCPHDR_SYN | TCPHDR_ACK); > - > - th->seq = htonl(TCP_SKB_CB(skb)->seq); > + skb->ip_summed = CHECKSUM_PARTIAL; > + th->seq = htonl(tcp_rsk(req)->snt_isn); > /* XXX data is queued and acked as is. No buffer/window check */ > th->ack_seq = htonl(tcp_rsk(req)->rcv_nxt); Yes, that looks good to me. Thanks! But we still need to clean up the skb in tcp_v4_send_reset and tcp_v4_send_ack, as the fields (e.g., tcp_skb_cb->seq) have been set when coming from tcp_v4_rcv. Christoph
On Wed, 2017-11-01 at 17:10 -0700, Christoph Paasch wrote: > Yes, that looks good to me. Thanks! > > But we still need to clean up the skb in tcp_v4_send_reset and > tcp_v4_send_ack, as the fields (e.g., tcp_skb_cb->seq) have been set when > coming from tcp_v4_rcv. You might be confused : ip_send_unicast_reply() does not send back the incoming skb. A fresh skb is allocated, then appended/sent. And commit 24a2d43d8886f5a29c did the changes to provide to __ip_options_echo() the proper IPCB header location.
On Wed, 2017-11-01 at 18:00 -0700, Eric Dumazet wrote: > On Wed, 2017-11-01 at 17:10 -0700, Christoph Paasch wrote: > > > Yes, that looks good to me. Thanks! > > > > But we still need to clean up the skb in tcp_v4_send_reset and > > tcp_v4_send_ack, as the fields (e.g., tcp_skb_cb->seq) have been set when > > coming from tcp_v4_rcv. > > You might be confused : ip_send_unicast_reply() does not send back the > incoming skb. > > A fresh skb is allocated, then appended/sent. > > And commit 24a2d43d8886f5a29c did the changes to provide to > __ip_options_echo() the proper IPCB header location. > More details : Fields written by tcp_init_nondata_skb() on the synack packet : ->seq (32bits) at offset 0 of skb->cb[] ->end_seq (32bits) at offset 4 of skb->cb[] ->tcp_gso_segs (16bits) at offset 8 ->tcp_flags (8bits) at offset 12 value (TCPHDR_SYN | TCPHDR_ACK -> 0x12) ->sacked (8bits) at offset 13 IPCB fields sharing these 14 bytes : iif /* 32bits, offset 0 */ opt.faddr (32bits) offset 4 opt.nexthop (32bits) offset 8 /* value 1 */ opt.optlen (8bits) offset 12 /* value 0x12 */ opt.srr (8bits) offset 13 IP6CB fields sharing these 14 bytes : iif /* 32bits, offset 0 */ ra /* 16 bits, offset 4 */ dst0 /* 16 bits offset 6 */ srcrt /* 16 bits offset 8 */ -> 0x0001 dst1 /* 16 bits offset 10 */ (not mangled -> 0) lastopt /* 16 bits offset 12 */ -> 0x12 At xmit : IPV4 uses ip_build_and_send_pkt() to transmit the SYNACK, so skb->cb[] is not used. IPv6 uses other fields. So I really wonder what exact issue you observed, please share your drugs ;) Thanks !
On 01/11/17 - 18:00:10, Eric Dumazet wrote: > On Wed, 2017-11-01 at 17:10 -0700, Christoph Paasch wrote: > > > Yes, that looks good to me. Thanks! > > > > But we still need to clean up the skb in tcp_v4_send_reset and > > tcp_v4_send_ack, as the fields (e.g., tcp_skb_cb->seq) have been set when > > coming from tcp_v4_rcv. > > You might be confused : ip_send_unicast_reply() does not send back the > incoming skb. > > A fresh skb is allocated, then appended/sent. > > And commit 24a2d43d8886f5a29c did the changes to provide to > __ip_options_echo() the proper IPCB header location. Yes, sorry I misunderstood ip_send_unicast_reply(). Didn't see the allocation of the new skb. Christoph
Hi Eric, On 01/11/17 - 19:21:31, Eric Dumazet wrote: > On Wed, 2017-11-01 at 18:00 -0700, Eric Dumazet wrote: > > On Wed, 2017-11-01 at 17:10 -0700, Christoph Paasch wrote: > > > > > Yes, that looks good to me. Thanks! > > > > > > But we still need to clean up the skb in tcp_v4_send_reset and > > > tcp_v4_send_ack, as the fields (e.g., tcp_skb_cb->seq) have been set when > > > coming from tcp_v4_rcv. > > > > You might be confused : ip_send_unicast_reply() does not send back the > > incoming skb. > > > > A fresh skb is allocated, then appended/sent. > > > > And commit 24a2d43d8886f5a29c did the changes to provide to > > __ip_options_echo() the proper IPCB header location. > > > > More details : > > Fields written by tcp_init_nondata_skb() on the synack packet : > > ->seq (32bits) at offset 0 of skb->cb[] > ->end_seq (32bits) at offset 4 of skb->cb[] > ->tcp_gso_segs (16bits) at offset 8 > ->tcp_flags (8bits) at offset 12 value (TCPHDR_SYN | TCPHDR_ACK -> > 0x12) In my build I see tcp_flags at offset 16, because ktime_t is 64 bits. > ->sacked (8bits) at offset 13 > > IPCB fields sharing these 14 bytes : > > iif /* 32bits, offset 0 */ > opt.faddr (32bits) offset 4 > opt.nexthop (32bits) offset 8 /* value 1 */ > opt.optlen (8bits) offset 12 /* value 0x12 */ > opt.srr (8bits) offset 13 > > IP6CB fields sharing these 14 bytes : > > iif /* 32bits, offset 0 */ > ra /* 16 bits, offset 4 */ > dst0 /* 16 bits offset 6 */ > srcrt /* 16 bits offset 8 */ -> 0x0001 > dst1 /* 16 bits offset 10 */ (not mangled -> 0) > lastopt /* 16 bits offset 12 */ -> 0x12 Thus, because what I mention above, we are writing here into flags which sits at offset 16. So, IP6SKB_FORWARDED and IP6SKB_FRAGMENTED will be set on the outgoing skb. Especially for IP6SKB_FORWARDED, which is checked at ip6_finish_output2() I am concerned. Even in general, having these fields set looks brittle to me. What do you think? > At xmit : > > IPV4 uses ip_build_and_send_pkt() to transmit the SYNACK, so skb->cb[] > is not used. > > IPv6 uses other fields. > > So I really wonder what exact issue you observed, please share your > drugs ;) No drugs, only belgian beer :) Christoph
On Thu, Nov 2, 2017 at 11:24 AM, Christoph Paasch <cpaasch@apple.com> wrote: > Hi Eric, > > On 01/11/17 - 19:21:31, Eric Dumazet wrote: >> On Wed, 2017-11-01 at 18:00 -0700, Eric Dumazet wrote: >> > On Wed, 2017-11-01 at 17:10 -0700, Christoph Paasch wrote: >> > >> > > Yes, that looks good to me. Thanks! >> > > >> > > But we still need to clean up the skb in tcp_v4_send_reset and >> > > tcp_v4_send_ack, as the fields (e.g., tcp_skb_cb->seq) have been set when >> > > coming from tcp_v4_rcv. >> > >> > You might be confused : ip_send_unicast_reply() does not send back the >> > incoming skb. >> > >> > A fresh skb is allocated, then appended/sent. >> > >> > And commit 24a2d43d8886f5a29c did the changes to provide to >> > __ip_options_echo() the proper IPCB header location. >> > >> >> More details : >> >> Fields written by tcp_init_nondata_skb() on the synack packet : >> >> ->seq (32bits) at offset 0 of skb->cb[] >> ->end_seq (32bits) at offset 4 of skb->cb[] >> ->tcp_gso_segs (16bits) at offset 8 >> ->tcp_flags (8bits) at offset 12 value (TCPHDR_SYN | TCPHDR_ACK -> >> 0x12) > > In my build I see tcp_flags at offset 16, because ktime_t is 64 bits. Yes, but this ktime was removed in net-next for my rb-tree work. > >> ->sacked (8bits) at offset 13 >> >> IPCB fields sharing these 14 bytes : >> >> iif /* 32bits, offset 0 */ >> opt.faddr (32bits) offset 4 >> opt.nexthop (32bits) offset 8 /* value 1 */ >> opt.optlen (8bits) offset 12 /* value 0x12 */ >> opt.srr (8bits) offset 13 >> >> IP6CB fields sharing these 14 bytes : >> >> iif /* 32bits, offset 0 */ >> ra /* 16 bits, offset 4 */ >> dst0 /* 16 bits offset 6 */ >> srcrt /* 16 bits offset 8 */ -> 0x0001 >> dst1 /* 16 bits offset 10 */ (not mangled -> 0) >> lastopt /* 16 bits offset 12 */ -> 0x12 > > Thus, because what I mention above, we are writing here into flags which sits > at offset 16. > > So, IP6SKB_FORWARDED and IP6SKB_FRAGMENTED will be set on the outgoing skb. > Especially for IP6SKB_FORWARDED, which is checked at ip6_finish_output2() I > am concerned. Even in general, having these fields set looks brittle to me. > > What do you think? I think I will submit my patch, which should clear the issue, without breaking IPv4 options handling as your patch did ;) Thanks !
On 02/11/17 - 11:26:21, Eric Dumazet wrote: > On Thu, Nov 2, 2017 at 11:24 AM, Christoph Paasch <cpaasch@apple.com> wrote: > > Hi Eric, > > > > On 01/11/17 - 19:21:31, Eric Dumazet wrote: > >> On Wed, 2017-11-01 at 18:00 -0700, Eric Dumazet wrote: > >> > On Wed, 2017-11-01 at 17:10 -0700, Christoph Paasch wrote: > >> > > >> > > Yes, that looks good to me. Thanks! > >> > > > >> > > But we still need to clean up the skb in tcp_v4_send_reset and > >> > > tcp_v4_send_ack, as the fields (e.g., tcp_skb_cb->seq) have been set when > >> > > coming from tcp_v4_rcv. > >> > > >> > You might be confused : ip_send_unicast_reply() does not send back the > >> > incoming skb. > >> > > >> > A fresh skb is allocated, then appended/sent. > >> > > >> > And commit 24a2d43d8886f5a29c did the changes to provide to > >> > __ip_options_echo() the proper IPCB header location. > >> > > >> > >> More details : > >> > >> Fields written by tcp_init_nondata_skb() on the synack packet : > >> > >> ->seq (32bits) at offset 0 of skb->cb[] > >> ->end_seq (32bits) at offset 4 of skb->cb[] > >> ->tcp_gso_segs (16bits) at offset 8 > >> ->tcp_flags (8bits) at offset 12 value (TCPHDR_SYN | TCPHDR_ACK -> > >> 0x12) > > > > In my build I see tcp_flags at offset 16, because ktime_t is 64 bits. > > Yes, but this ktime was removed in net-next for my rb-tree work. > > > > >> ->sacked (8bits) at offset 13 > >> > >> IPCB fields sharing these 14 bytes : > >> > >> iif /* 32bits, offset 0 */ > >> opt.faddr (32bits) offset 4 > >> opt.nexthop (32bits) offset 8 /* value 1 */ > >> opt.optlen (8bits) offset 12 /* value 0x12 */ > >> opt.srr (8bits) offset 13 > >> > >> IP6CB fields sharing these 14 bytes : > >> > >> iif /* 32bits, offset 0 */ > >> ra /* 16 bits, offset 4 */ > >> dst0 /* 16 bits offset 6 */ > >> srcrt /* 16 bits offset 8 */ -> 0x0001 > >> dst1 /* 16 bits offset 10 */ (not mangled -> 0) > >> lastopt /* 16 bits offset 12 */ -> 0x12 > > > > Thus, because what I mention above, we are writing here into flags which sits > > at offset 16. > > > > So, IP6SKB_FORWARDED and IP6SKB_FRAGMENTED will be set on the outgoing skb. > > Especially for IP6SKB_FORWARDED, which is checked at ip6_finish_output2() I > > am concerned. Even in general, having these fields set looks brittle to me. > > > > What do you think? > > I think I will submit my patch, which should clear the issue, without > breaking IPv4 options handling as your patch did ;) Sounds good! :) Thanks for your feedback. Christoph
diff --git a/include/net/tcp.h b/include/net/tcp.h index e6d0002a1b0b..a375ee8fc534 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -2032,6 +2032,8 @@ static inline void tcp_listendrop(const struct sock *sk) enum hrtimer_restart tcp_pace_kick(struct hrtimer *timer); +void tcp_skb_cleanup(struct sk_buff *skb); + /* * Interface for adding Upper Level Protocols over TCP */ diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 5b027c69cbc5..db7dd65b1f19 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -709,6 +709,9 @@ static void tcp_v4_send_reset(const struct sock *sk, struct sk_buff *skb) arg.tos = ip_hdr(skb)->tos; arg.uid = sock_net_uid(net, sk && sk_fullsock(sk) ? sk : NULL); + + tcp_skb_cleanup(skb); + local_bh_disable(); ip_send_unicast_reply(*this_cpu_ptr(net->ipv4.tcp_sk), skb, &TCP_SKB_CB(skb)->header.h4.opt, @@ -795,6 +798,9 @@ static void tcp_v4_send_ack(const struct sock *sk, arg.bound_dev_if = oif; arg.tos = tos; arg.uid = sock_net_uid(net, sk_fullsock(sk) ? sk : NULL); + + tcp_skb_cleanup(skb); + local_bh_disable(); ip_send_unicast_reply(*this_cpu_ptr(net->ipv4.tcp_sk), skb, &TCP_SKB_CB(skb)->header.h4.opt, diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 823003eef3a2..6935a68d449b 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -973,6 +973,16 @@ static void tcp_internal_pacing(struct sock *sk, const struct sk_buff *skb) HRTIMER_MODE_ABS_PINNED); } +void tcp_skb_cleanup(struct sk_buff *skb) +{ + /* Our usage of tstamp should remain private */ + skb->tstamp = 0; + + /* Cleanup our debris for IP stacks */ + memset(skb->cb, 0, max(sizeof(struct inet_skb_parm), + sizeof(struct inet6_skb_parm))); +} + /* This routine actually transmits TCP packets queued in by * tcp_do_sendmsg(). This is used by both the initial * transmission and possible later retransmissions. @@ -1115,12 +1125,7 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it, skb_shinfo(skb)->gso_segs = tcp_skb_pcount(skb); skb_shinfo(skb)->gso_size = tcp_skb_mss(skb); - /* Our usage of tstamp should remain private */ - skb->tstamp = 0; - - /* Cleanup our debris for IP stacks */ - memset(skb->cb, 0, max(sizeof(struct inet_skb_parm), - sizeof(struct inet6_skb_parm))); + tcp_skb_cleanup(skb); err = icsk->icsk_af_ops->queue_xmit(sk, skb, &inet->cork.fl); @@ -3204,8 +3209,7 @@ struct sk_buff *tcp_make_synack(const struct sock *sk, struct dst_entry *dst, rcu_read_unlock(); #endif - /* Do not fool tcpdump (if any), clean our debris */ - skb->tstamp = 0; + tcp_skb_cleanup(skb); return skb; } EXPORT_SYMBOL(tcp_make_synack);
Since commit 971f10eca186 ("tcp: better TCP_SKB_CB layout to reduce cache line misses") IPCB/IP6CB is no more at the beginning of the tcp_skb_cb. This means that on the output path, we need to make sure that it has been correctly initialized to 0, as is done in tcp_transmit_skb. However, when going through the other code-path in TCP that can send an skb (e.g., through tcp_v6_send_synack), we end up in a situation where IP6CB has some of its fields set to unexpected values. Depending on the layout of tcp_skb_cb across the different kernel-versions this can be lastopt, flags,... This patch makes sure that IPCB/IP6CB is always set to 0 before sending. Cc: Eric Dumazet <edumazet@google.com> Fixes: 971f10eca186 ("tcp: better TCP_SKB_CB layout to reduce cache line misses") Signed-off-by: Christoph Paasch <cpaasch@apple.com> --- include/net/tcp.h | 2 ++ net/ipv4/tcp_ipv4.c | 6 ++++++ net/ipv4/tcp_output.c | 20 ++++++++++++-------- 3 files changed, 20 insertions(+), 8 deletions(-)