diff mbox series

[net] tcp: Always cleanup skb before sending

Message ID 20171101211001.57901-1-cpaasch@apple.com
State Superseded, archived
Delegated to: David Miller
Headers show
Series [net] tcp: Always cleanup skb before sending | expand

Commit Message

Christoph Paasch Nov. 1, 2017, 9:10 p.m. UTC
Since commit 971f10eca186 ("tcp: better TCP_SKB_CB layout to reduce cache
line misses") IPCB/IP6CB is no more at the beginning of the tcp_skb_cb.
This means that on the output path, we need to make sure that it has
been correctly initialized to 0, as is done in tcp_transmit_skb.

However, when going through the other code-path in TCP that can send an
skb (e.g., through tcp_v6_send_synack), we end up in a situation where
IP6CB has some of its fields set to unexpected values. Depending on the
layout of tcp_skb_cb across the different kernel-versions this can be
lastopt, flags,...

This patch makes sure that IPCB/IP6CB is always set to 0 before sending.

Cc: Eric Dumazet <edumazet@google.com>
Fixes: 971f10eca186 ("tcp: better TCP_SKB_CB layout to reduce cache line misses")
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
---
 include/net/tcp.h     |  2 ++
 net/ipv4/tcp_ipv4.c   |  6 ++++++
 net/ipv4/tcp_output.c | 20 ++++++++++++--------
 3 files changed, 20 insertions(+), 8 deletions(-)

Comments

Eric Dumazet Nov. 1, 2017, 9:32 p.m. UTC | #1
On Wed, Nov 1, 2017 at 2:10 PM, Christoph Paasch <cpaasch@apple.com> wrote:
> Since commit 971f10eca186 ("tcp: better TCP_SKB_CB layout to reduce cache
> line misses") IPCB/IP6CB is no more at the beginning of the tcp_skb_cb.
> This means that on the output path, we need to make sure that it has
> been correctly initialized to 0, as is done in tcp_transmit_skb.
>
> However, when going through the other code-path in TCP that can send an
> skb (e.g., through tcp_v6_send_synack), we end up in a situation where
> IP6CB has some of its fields set to unexpected values. Depending on the
> layout of tcp_skb_cb across the different kernel-versions this can be
> lastopt, flags,...

Or not use tcp_init_nondata_skb() on non fast clones, since it adds
unnecessary writes and clears.

tcp_make_synack() really has no business using tcp_init_nondata_skb()
and could simply set th->seq = htonl(tcp_rsk(req)->snt_isn);
Eric Dumazet Nov. 1, 2017, 9:53 p.m. UTC | #2
On Wed, 2017-11-01 at 14:32 -0700, Eric Dumazet wrote:
> On Wed, Nov 1, 2017 at 2:10 PM, Christoph Paasch <cpaasch@apple.com> wrote:
> > Since commit 971f10eca186 ("tcp: better TCP_SKB_CB layout to reduce cache
> > line misses") IPCB/IP6CB is no more at the beginning of the tcp_skb_cb.
> > This means that on the output path, we need to make sure that it has
> > been correctly initialized to 0, as is done in tcp_transmit_skb.
> >
> > However, when going through the other code-path in TCP that can send an
> > skb (e.g., through tcp_v6_send_synack), we end up in a situation where
> > IP6CB has some of its fields set to unexpected values. Depending on the
> > layout of tcp_skb_cb across the different kernel-versions this can be
> > lastopt, flags,...
> 
> Or not use tcp_init_nondata_skb() on non fast clones, since it adds
> unnecessary writes and clears.
> 
> tcp_make_synack() really has no business using tcp_init_nondata_skb()
> and could simply set th->seq = htonl(tcp_rsk(req)->snt_isn);

Something like :

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 69cfdead0cb49e4365158048a0d1a9bbdd55fa83..5502abc5307f0ce1de610d4b70f3a59c4d5383c5 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3399,13 +3399,8 @@ struct sk_buff *tcp_make_synack(const struct sock *sk, struct dst_entry *dst,
        tcp_ecn_make_synack(req, th);
        th->source = htons(ireq->ir_num);
        th->dest = ireq->ir_rmt_port;
-       /* Setting of flags are superfluous here for callers (and ECE is
-        * not even correctly set)
-        */
-       tcp_init_nondata_skb(skb, tcp_rsk(req)->snt_isn,
-                            TCPHDR_SYN | TCPHDR_ACK);
-
-       th->seq = htonl(TCP_SKB_CB(skb)->seq);
+       skb->ip_summed = CHECKSUM_PARTIAL;
+       th->seq = htonl(tcp_rsk(req)->snt_isn);
        /* XXX data is queued and acked as is. No buffer/window check */
        th->ack_seq = htonl(tcp_rsk(req)->rcv_nxt);
Christoph Paasch Nov. 2, 2017, 12:10 a.m. UTC | #3
On 01/11/17 - 14:53:38, Eric Dumazet wrote:
> On Wed, 2017-11-01 at 14:32 -0700, Eric Dumazet wrote:
> > On Wed, Nov 1, 2017 at 2:10 PM, Christoph Paasch <cpaasch@apple.com> wrote:
> > > Since commit 971f10eca186 ("tcp: better TCP_SKB_CB layout to reduce cache
> > > line misses") IPCB/IP6CB is no more at the beginning of the tcp_skb_cb.
> > > This means that on the output path, we need to make sure that it has
> > > been correctly initialized to 0, as is done in tcp_transmit_skb.
> > >
> > > However, when going through the other code-path in TCP that can send an
> > > skb (e.g., through tcp_v6_send_synack), we end up in a situation where
> > > IP6CB has some of its fields set to unexpected values. Depending on the
> > > layout of tcp_skb_cb across the different kernel-versions this can be
> > > lastopt, flags,...
> > 
> > Or not use tcp_init_nondata_skb() on non fast clones, since it adds
> > unnecessary writes and clears.
> > 
> > tcp_make_synack() really has no business using tcp_init_nondata_skb()
> > and could simply set th->seq = htonl(tcp_rsk(req)->snt_isn);
> 
> Something like :
> 
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 69cfdead0cb49e4365158048a0d1a9bbdd55fa83..5502abc5307f0ce1de610d4b70f3a59c4d5383c5 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -3399,13 +3399,8 @@ struct sk_buff *tcp_make_synack(const struct sock *sk, struct dst_entry *dst,
>         tcp_ecn_make_synack(req, th);
>         th->source = htons(ireq->ir_num);
>         th->dest = ireq->ir_rmt_port;
> -       /* Setting of flags are superfluous here for callers (and ECE is
> -        * not even correctly set)
> -        */
> -       tcp_init_nondata_skb(skb, tcp_rsk(req)->snt_isn,
> -                            TCPHDR_SYN | TCPHDR_ACK);
> -
> -       th->seq = htonl(TCP_SKB_CB(skb)->seq);
> +       skb->ip_summed = CHECKSUM_PARTIAL;
> +       th->seq = htonl(tcp_rsk(req)->snt_isn);
>         /* XXX data is queued and acked as is. No buffer/window check */
>         th->ack_seq = htonl(tcp_rsk(req)->rcv_nxt);

Yes, that looks good to me. Thanks!

But we still need to clean up the skb in tcp_v4_send_reset and
tcp_v4_send_ack, as the fields (e.g., tcp_skb_cb->seq) have been set when
coming from tcp_v4_rcv.


Christoph
Eric Dumazet Nov. 2, 2017, 1 a.m. UTC | #4
On Wed, 2017-11-01 at 17:10 -0700, Christoph Paasch wrote:

> Yes, that looks good to me. Thanks!
> 
> But we still need to clean up the skb in tcp_v4_send_reset and
> tcp_v4_send_ack, as the fields (e.g., tcp_skb_cb->seq) have been set when
> coming from tcp_v4_rcv.

You might be confused : ip_send_unicast_reply() does not send back the
incoming skb.

A fresh skb is allocated, then appended/sent.

And commit 24a2d43d8886f5a29c did the changes to provide to
__ip_options_echo() the proper IPCB header location.
Eric Dumazet Nov. 2, 2017, 2:21 a.m. UTC | #5
On Wed, 2017-11-01 at 18:00 -0700, Eric Dumazet wrote:
> On Wed, 2017-11-01 at 17:10 -0700, Christoph Paasch wrote:
> 
> > Yes, that looks good to me. Thanks!
> > 
> > But we still need to clean up the skb in tcp_v4_send_reset and
> > tcp_v4_send_ack, as the fields (e.g., tcp_skb_cb->seq) have been set when
> > coming from tcp_v4_rcv.
> 
> You might be confused : ip_send_unicast_reply() does not send back the
> incoming skb.
> 
> A fresh skb is allocated, then appended/sent.
> 
> And commit 24a2d43d8886f5a29c did the changes to provide to
> __ip_options_echo() the proper IPCB header location.
> 

More details :

Fields written by tcp_init_nondata_skb() on the synack packet :

->seq          (32bits) at offset 0 of skb->cb[]
->end_seq      (32bits) at offset 4 of skb->cb[]
->tcp_gso_segs (16bits) at offset 8
->tcp_flags    (8bits) at offset 12 value (TCPHDR_SYN | TCPHDR_ACK ->
0x12)
->sacked       (8bits) at offset 13

IPCB fields sharing these 14 bytes :

iif  /* 32bits, offset 0 */
opt.faddr    (32bits) offset 4
opt.nexthop  (32bits) offset 8 /* value 1 */
opt.optlen   (8bits) offset 12 /* value 0x12 */
opt.srr      (8bits) offset 13

IP6CB fields sharing these 14 bytes :

iif   /* 32bits, offset 0 */
ra    /* 16 bits, offset 4 */
dst0  /* 16 bits offset 6 */
srcrt /* 16 bits offset 8 */  -> 0x0001
dst1  /* 16 bits offset 10 */ (not mangled -> 0)
lastopt /* 16 bits offset 12 */  -> 0x12


At xmit :

IPV4 uses ip_build_and_send_pkt() to transmit the SYNACK, so skb->cb[]
is not used.

IPv6 uses other fields.

So I really wonder what exact issue you observed, please share your
drugs ;)

Thanks !
Christoph Paasch Nov. 2, 2017, 6:16 p.m. UTC | #6
On 01/11/17 - 18:00:10, Eric Dumazet wrote:
> On Wed, 2017-11-01 at 17:10 -0700, Christoph Paasch wrote:
> 
> > Yes, that looks good to me. Thanks!
> > 
> > But we still need to clean up the skb in tcp_v4_send_reset and
> > tcp_v4_send_ack, as the fields (e.g., tcp_skb_cb->seq) have been set when
> > coming from tcp_v4_rcv.
> 
> You might be confused : ip_send_unicast_reply() does not send back the
> incoming skb.
> 
> A fresh skb is allocated, then appended/sent.
> 
> And commit 24a2d43d8886f5a29c did the changes to provide to
> __ip_options_echo() the proper IPCB header location.

Yes, sorry I misunderstood ip_send_unicast_reply(). Didn't see the
allocation of the new skb.


Christoph
Christoph Paasch Nov. 2, 2017, 6:24 p.m. UTC | #7
Hi Eric,

On 01/11/17 - 19:21:31, Eric Dumazet wrote:
> On Wed, 2017-11-01 at 18:00 -0700, Eric Dumazet wrote:
> > On Wed, 2017-11-01 at 17:10 -0700, Christoph Paasch wrote:
> > 
> > > Yes, that looks good to me. Thanks!
> > > 
> > > But we still need to clean up the skb in tcp_v4_send_reset and
> > > tcp_v4_send_ack, as the fields (e.g., tcp_skb_cb->seq) have been set when
> > > coming from tcp_v4_rcv.
> > 
> > You might be confused : ip_send_unicast_reply() does not send back the
> > incoming skb.
> > 
> > A fresh skb is allocated, then appended/sent.
> > 
> > And commit 24a2d43d8886f5a29c did the changes to provide to
> > __ip_options_echo() the proper IPCB header location.
> > 
> 
> More details :
> 
> Fields written by tcp_init_nondata_skb() on the synack packet :
> 
> ->seq          (32bits) at offset 0 of skb->cb[]
> ->end_seq      (32bits) at offset 4 of skb->cb[]
> ->tcp_gso_segs (16bits) at offset 8
> ->tcp_flags    (8bits) at offset 12 value (TCPHDR_SYN | TCPHDR_ACK ->
> 0x12)

In my build I see tcp_flags at offset 16, because ktime_t is 64 bits.

> ->sacked       (8bits) at offset 13
> 
> IPCB fields sharing these 14 bytes :
> 
> iif  /* 32bits, offset 0 */
> opt.faddr    (32bits) offset 4
> opt.nexthop  (32bits) offset 8 /* value 1 */
> opt.optlen   (8bits) offset 12 /* value 0x12 */
> opt.srr      (8bits) offset 13
> 
> IP6CB fields sharing these 14 bytes :
> 
> iif   /* 32bits, offset 0 */
> ra    /* 16 bits, offset 4 */
> dst0  /* 16 bits offset 6 */
> srcrt /* 16 bits offset 8 */  -> 0x0001
> dst1  /* 16 bits offset 10 */ (not mangled -> 0)
> lastopt /* 16 bits offset 12 */  -> 0x12

Thus, because what I mention above, we are writing here into flags which sits
at offset 16.

So, IP6SKB_FORWARDED and IP6SKB_FRAGMENTED will be set on the outgoing skb.
Especially for IP6SKB_FORWARDED, which is checked at ip6_finish_output2() I
am concerned. Even in general, having these fields set looks brittle to me.

What do you think?

> At xmit :
> 
> IPV4 uses ip_build_and_send_pkt() to transmit the SYNACK, so skb->cb[]
> is not used.
> 
> IPv6 uses other fields.
> 
> So I really wonder what exact issue you observed, please share your
> drugs ;)

No drugs, only belgian beer :)


Christoph
Eric Dumazet Nov. 2, 2017, 6:26 p.m. UTC | #8
On Thu, Nov 2, 2017 at 11:24 AM, Christoph Paasch <cpaasch@apple.com> wrote:
> Hi Eric,
>
> On 01/11/17 - 19:21:31, Eric Dumazet wrote:
>> On Wed, 2017-11-01 at 18:00 -0700, Eric Dumazet wrote:
>> > On Wed, 2017-11-01 at 17:10 -0700, Christoph Paasch wrote:
>> >
>> > > Yes, that looks good to me. Thanks!
>> > >
>> > > But we still need to clean up the skb in tcp_v4_send_reset and
>> > > tcp_v4_send_ack, as the fields (e.g., tcp_skb_cb->seq) have been set when
>> > > coming from tcp_v4_rcv.
>> >
>> > You might be confused : ip_send_unicast_reply() does not send back the
>> > incoming skb.
>> >
>> > A fresh skb is allocated, then appended/sent.
>> >
>> > And commit 24a2d43d8886f5a29c did the changes to provide to
>> > __ip_options_echo() the proper IPCB header location.
>> >
>>
>> More details :
>>
>> Fields written by tcp_init_nondata_skb() on the synack packet :
>>
>> ->seq          (32bits) at offset 0 of skb->cb[]
>> ->end_seq      (32bits) at offset 4 of skb->cb[]
>> ->tcp_gso_segs (16bits) at offset 8
>> ->tcp_flags    (8bits) at offset 12 value (TCPHDR_SYN | TCPHDR_ACK ->
>> 0x12)
>
> In my build I see tcp_flags at offset 16, because ktime_t is 64 bits.

Yes, but this ktime was removed in net-next for my rb-tree work.

>
>> ->sacked       (8bits) at offset 13
>>
>> IPCB fields sharing these 14 bytes :
>>
>> iif  /* 32bits, offset 0 */
>> opt.faddr    (32bits) offset 4
>> opt.nexthop  (32bits) offset 8 /* value 1 */
>> opt.optlen   (8bits) offset 12 /* value 0x12 */
>> opt.srr      (8bits) offset 13
>>
>> IP6CB fields sharing these 14 bytes :
>>
>> iif   /* 32bits, offset 0 */
>> ra    /* 16 bits, offset 4 */
>> dst0  /* 16 bits offset 6 */
>> srcrt /* 16 bits offset 8 */  -> 0x0001
>> dst1  /* 16 bits offset 10 */ (not mangled -> 0)
>> lastopt /* 16 bits offset 12 */  -> 0x12
>
> Thus, because what I mention above, we are writing here into flags which sits
> at offset 16.
>
> So, IP6SKB_FORWARDED and IP6SKB_FRAGMENTED will be set on the outgoing skb.
> Especially for IP6SKB_FORWARDED, which is checked at ip6_finish_output2() I
> am concerned. Even in general, having these fields set looks brittle to me.
>
> What do you think?

I think I will submit my patch, which should clear the issue, without
breaking IPv4 options handling as your patch did ;)

Thanks !
Christoph Paasch Nov. 2, 2017, 6:31 p.m. UTC | #9
On 02/11/17 - 11:26:21, Eric Dumazet wrote:
> On Thu, Nov 2, 2017 at 11:24 AM, Christoph Paasch <cpaasch@apple.com> wrote:
> > Hi Eric,
> >
> > On 01/11/17 - 19:21:31, Eric Dumazet wrote:
> >> On Wed, 2017-11-01 at 18:00 -0700, Eric Dumazet wrote:
> >> > On Wed, 2017-11-01 at 17:10 -0700, Christoph Paasch wrote:
> >> >
> >> > > Yes, that looks good to me. Thanks!
> >> > >
> >> > > But we still need to clean up the skb in tcp_v4_send_reset and
> >> > > tcp_v4_send_ack, as the fields (e.g., tcp_skb_cb->seq) have been set when
> >> > > coming from tcp_v4_rcv.
> >> >
> >> > You might be confused : ip_send_unicast_reply() does not send back the
> >> > incoming skb.
> >> >
> >> > A fresh skb is allocated, then appended/sent.
> >> >
> >> > And commit 24a2d43d8886f5a29c did the changes to provide to
> >> > __ip_options_echo() the proper IPCB header location.
> >> >
> >>
> >> More details :
> >>
> >> Fields written by tcp_init_nondata_skb() on the synack packet :
> >>
> >> ->seq          (32bits) at offset 0 of skb->cb[]
> >> ->end_seq      (32bits) at offset 4 of skb->cb[]
> >> ->tcp_gso_segs (16bits) at offset 8
> >> ->tcp_flags    (8bits) at offset 12 value (TCPHDR_SYN | TCPHDR_ACK ->
> >> 0x12)
> >
> > In my build I see tcp_flags at offset 16, because ktime_t is 64 bits.
> 
> Yes, but this ktime was removed in net-next for my rb-tree work.
> 
> >
> >> ->sacked       (8bits) at offset 13
> >>
> >> IPCB fields sharing these 14 bytes :
> >>
> >> iif  /* 32bits, offset 0 */
> >> opt.faddr    (32bits) offset 4
> >> opt.nexthop  (32bits) offset 8 /* value 1 */
> >> opt.optlen   (8bits) offset 12 /* value 0x12 */
> >> opt.srr      (8bits) offset 13
> >>
> >> IP6CB fields sharing these 14 bytes :
> >>
> >> iif   /* 32bits, offset 0 */
> >> ra    /* 16 bits, offset 4 */
> >> dst0  /* 16 bits offset 6 */
> >> srcrt /* 16 bits offset 8 */  -> 0x0001
> >> dst1  /* 16 bits offset 10 */ (not mangled -> 0)
> >> lastopt /* 16 bits offset 12 */  -> 0x12
> >
> > Thus, because what I mention above, we are writing here into flags which sits
> > at offset 16.
> >
> > So, IP6SKB_FORWARDED and IP6SKB_FRAGMENTED will be set on the outgoing skb.
> > Especially for IP6SKB_FORWARDED, which is checked at ip6_finish_output2() I
> > am concerned. Even in general, having these fields set looks brittle to me.
> >
> > What do you think?
> 
> I think I will submit my patch, which should clear the issue, without
> breaking IPv4 options handling as your patch did ;)

Sounds good! :)

Thanks for your feedback.


Christoph
diff mbox series

Patch

diff --git a/include/net/tcp.h b/include/net/tcp.h
index e6d0002a1b0b..a375ee8fc534 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -2032,6 +2032,8 @@  static inline void tcp_listendrop(const struct sock *sk)
 
 enum hrtimer_restart tcp_pace_kick(struct hrtimer *timer);
 
+void tcp_skb_cleanup(struct sk_buff *skb);
+
 /*
  * Interface for adding Upper Level Protocols over TCP
  */
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 5b027c69cbc5..db7dd65b1f19 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -709,6 +709,9 @@  static void tcp_v4_send_reset(const struct sock *sk, struct sk_buff *skb)
 
 	arg.tos = ip_hdr(skb)->tos;
 	arg.uid = sock_net_uid(net, sk && sk_fullsock(sk) ? sk : NULL);
+
+	tcp_skb_cleanup(skb);
+
 	local_bh_disable();
 	ip_send_unicast_reply(*this_cpu_ptr(net->ipv4.tcp_sk),
 			      skb, &TCP_SKB_CB(skb)->header.h4.opt,
@@ -795,6 +798,9 @@  static void tcp_v4_send_ack(const struct sock *sk,
 		arg.bound_dev_if = oif;
 	arg.tos = tos;
 	arg.uid = sock_net_uid(net, sk_fullsock(sk) ? sk : NULL);
+
+	tcp_skb_cleanup(skb);
+
 	local_bh_disable();
 	ip_send_unicast_reply(*this_cpu_ptr(net->ipv4.tcp_sk),
 			      skb, &TCP_SKB_CB(skb)->header.h4.opt,
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 823003eef3a2..6935a68d449b 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -973,6 +973,16 @@  static void tcp_internal_pacing(struct sock *sk, const struct sk_buff *skb)
 		      HRTIMER_MODE_ABS_PINNED);
 }
 
+void tcp_skb_cleanup(struct sk_buff *skb)
+{
+	/* Our usage of tstamp should remain private */
+	skb->tstamp = 0;
+
+	/* Cleanup our debris for IP stacks */
+	memset(skb->cb, 0, max(sizeof(struct inet_skb_parm),
+			       sizeof(struct inet6_skb_parm)));
+}
+
 /* This routine actually transmits TCP packets queued in by
  * tcp_do_sendmsg().  This is used by both the initial
  * transmission and possible later retransmissions.
@@ -1115,12 +1125,7 @@  static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
 	skb_shinfo(skb)->gso_segs = tcp_skb_pcount(skb);
 	skb_shinfo(skb)->gso_size = tcp_skb_mss(skb);
 
-	/* Our usage of tstamp should remain private */
-	skb->tstamp = 0;
-
-	/* Cleanup our debris for IP stacks */
-	memset(skb->cb, 0, max(sizeof(struct inet_skb_parm),
-			       sizeof(struct inet6_skb_parm)));
+	tcp_skb_cleanup(skb);
 
 	err = icsk->icsk_af_ops->queue_xmit(sk, skb, &inet->cork.fl);
 
@@ -3204,8 +3209,7 @@  struct sk_buff *tcp_make_synack(const struct sock *sk, struct dst_entry *dst,
 	rcu_read_unlock();
 #endif
 
-	/* Do not fool tcpdump (if any), clean our debris */
-	skb->tstamp = 0;
+	tcp_skb_cleanup(skb);
 	return skb;
 }
 EXPORT_SYMBOL(tcp_make_synack);