diff mbox

[2/2,net-next] tcp: sk_add_backlog() is too agressive for TCP

Message ID alpine.DEB.2.00.1204241131580.735@wel-95.cs.helsinki.fi
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Ilpo Järvinen April 24, 2012, 8:40 a.m. UTC
On Tue, 24 Apr 2012, David Miller wrote:

> From: "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi>
> Date: Tue, 24 Apr 2012 11:21:18 +0300 (EEST)
> 
> > On Tue, 24 Apr 2012, David Miller wrote:
> > 
> >> That makes this a non-starter since we must therefore remember all of
> >> the SACK boundaries in the original packets.
> > 
> > GRO works because TCP tends to use rather constant MSS, right? ...Since 
> > ACKs and SACKs are nothing more than reflection of those MSS boundaries of 
> > the opposite direction I don't find that as impossible as you do because 
> > the same kind of "mss" assumption can be applied there. But GRO has made 
> > this somewhat messier now because the receiver doesn't any more generate 
> > ACK per MSS or ACK per 2*MSS but that could be "fixed" by offloading the 
> > ACK sending when responding to a GROed packet.
> 
> We're talking about accumulating ACKs on GRO not data packets.

So am I... :-). ...Code speaks more than thousands of words:



...Obviously Data and ACK couldn't be GROed at the same time practically 
(would allow reusing the gso_size field for ack_size). ...But why exactly 
you think this is not possible or viable solution if fully implemented?

And the problem I mentioned in the previous mail (in the terms of this 
code fragment) is that ackdiff is no longer MSS or 2*MSS because of GRO 
for the opposite direction doesn't trigger all those ACKs a non-GRO 
receiver would.

Comments

David Miller April 24, 2012, 8:48 a.m. UTC | #1
From: "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi>
Date: Tue, 24 Apr 2012 11:40:27 +0300 (EEST)

> And the problem I mentioned in the previous mail (in the terms of this 
> code fragment) is that ackdiff is no longer MSS or 2*MSS because of GRO 
> for the opposite direction doesn't trigger all those ACKs a non-GRO 
> receiver would.

This is the ACK counting Eric mentioned, which we need to preserve
for the purposes of CWND growth.

And in any event, you cannot depend upon the ACK boundaries in any way
shape or form, 2 * MSS is just one valid mode of ACK'ing.

Whether we get a GRO stretch ACK or a normal 2 * MSS one, it has to
work properly either way.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet April 24, 2012, 8:56 a.m. UTC | #2
On Tue, 2012-04-24 at 11:40 +0300, Ilpo Järvinen wrote:

> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 8bb6ade..33b87b2 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -2820,7 +2820,11 @@ found:
>  	flush |= (__force int)(flags & TCP_FLAG_CWR);
>  	flush |= (__force int)((flags ^ tcp_flag_word(th2)) &
>  		  ~(TCP_FLAG_CWR | TCP_FLAG_FIN | TCP_FLAG_PSH));
> -	flush |= (__force int)(th->ack_seq ^ th2->ack_seq);
> +
> +	ackgap = skb_shinfo(p)->ack_size;
> +	ackdiff = th2->ack_seq - th->ack_seq;
> +	flush |= (ackdiff - 1) >= ackgap;
> +
>  	for (i = sizeof(*th); i < thlen; i += 4)
>  		flush |= *(u32 *)((u8 *)th + i) ^
>  			 *(u32 *)((u8 *)th2 + i);
> 

See how we duplicate tcp stack in GRO layer ;)

Really I have many doubts about GRO today.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ilpo Järvinen April 24, 2012, 10:32 a.m. UTC | #3
On Tue, 24 Apr 2012, David Miller wrote:

> From: "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi>
> Date: Tue, 24 Apr 2012 11:40:27 +0300 (EEST)
> 
> > And the problem I mentioned in the previous mail (in the terms of this 
> > code fragment) is that ackdiff is no longer MSS or 2*MSS because of GRO 
> > for the opposite direction doesn't trigger all those ACKs a non-GRO 
> > receiver would.
> 
> And in any event, you cannot depend upon the ACK boundaries in any way
> shape or form, 2 * MSS is just one valid mode of ACK'ing.
> 
> Whether we get a GRO stretch ACK or a normal 2 * MSS one, it has to
> work properly either way.

I don't see how any of this is that much different from the data side. x 
bytes, x bytes, x bytes, ... (typically x=MSS) is just one valid mode of 
data sending, and yet the GRO is currently based on this one mode only. 
...It of course works properly also when that is not true but won't work 
as efficiently as when the size remains the same.
Eric Dumazet April 26, 2012, 8:10 a.m. UTC | #4
On Tue, 2012-04-24 at 10:56 +0200, Eric Dumazet wrote:

> Really I have many doubts about GRO today.

Particularly if a NIC driver provides linear skbs :
Resulting gro skb is a list of skbs, no memory savings, and many cache
line misses when copying to userland.

Maybe we could have a new kind of skb head allocation/setup, pointing to
a frag of 2048 bytes instead of a kmalloc() blob.

Right now, a fragged skb used 3 blocks of memory :

1) struct sk_buff
2) a bloc of 512 or 1024 or 2048 bytes of memory (skb->head)
3) a frag of 2048 (or PAGE_SIZE/2 or PAGE_SIZE)

While a linear skb has :

1) struct sk_buff
2) a bloc of 512 or 1024 or 2048 bytes of memory (skb->head) from
kmalloc()


Idea would have :

1) struct sk_buff
2) skb->head points to frag (aliasing, no memory allocation)
3) frag of 2048 (or PAGE_SIZE/2 or PAGE_SIZE)

Or the reverse (no frag so that skb is considered as linea), but special
code to 'allow' this skb head be considered as a frag when needed
(splice() code, or GRO merge, or TCP coalescing)


That would make GRO (and TCP coalescing) much more efficient, since the
resulting aggregated skb would be :
1) struct sk_buff
2) skb->head points to 1st frag (aliasing, no memory allocation)
3) array of [1..16] frags



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller April 26, 2012, 8:36 a.m. UTC | #5
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 26 Apr 2012 10:10:54 +0200

> Idea would have :
> 
> 1) struct sk_buff
> 2) skb->head points to frag (aliasing, no memory allocation)
> 3) frag of 2048 (or PAGE_SIZE/2 or PAGE_SIZE)

What would you set skb->data_len and skb->len to?

How would you handle tailroom?  We have the rule that you can't append
to the tail of a fragmented SKB (I mean append, the way that IPSEC
wants to write data to the end of an SKB when encrypting, for
example).  This is only allowed on fully linear SKBs.

So, for this reason and others, you can't pretend that this new
construction is linear in any way.  It would break a bunch of things.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet April 26, 2012, 9:10 a.m. UTC | #6
On Thu, 2012-04-26 at 04:36 -0400, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Thu, 26 Apr 2012 10:10:54 +0200
> 
> > Idea would have :
> > 
> > 1) struct sk_buff
> > 2) skb->head points to frag (aliasing, no memory allocation)
> > 3) frag of 2048 (or PAGE_SIZE/2 or PAGE_SIZE)
> 
> What would you set skb->data_len and skb->len to?
> 
> How would you handle tailroom?  We have the rule that you can't append
> to the tail of a fragmented SKB (I mean append, the way that IPSEC
> wants to write data to the end of an SKB when encrypting, for
> example).  This is only allowed on fully linear SKBs.
> 
> So, for this reason and others, you can't pretend that this new
> construction is linear in any way.  It would break a bunch of things.


The 'frag' would have a known size : 2048 bytes

But the end of it would be used by struct skb_shared_info

so data_len would be 0 in fact.

This would look like a regular linear skb.

Just a bit set in skb to say : Warning, skb->head was not kmalloced :
replace kfree(head) by put_page(...)

And this bit would be tested in GRO or tcp merge to 'upgrade' this
skb->head to proper page/frag


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller April 26, 2012, 9:18 a.m. UTC | #7
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 26 Apr 2012 11:10:02 +0200

> The 'frag' would have a known size : 2048 bytes
> 
> But the end of it would be used by struct skb_shared_info
> 
> so data_len would be 0 in fact.
> 
> This would look like a regular linear skb.
> 
> Just a bit set in skb to say : Warning, skb->head was not kmalloced :
> replace kfree(head) by put_page(...)
> 
> And this bit would be tested in GRO or tcp merge to 'upgrade' this
> skb->head to proper page/frag

And what happens if this ends up in a piece of code which wants to
append a page frag?

Or a piece of code which copies an SKB, including page frag parts?

All I'm saying is that the number of tests necessary to make this work
properly might become prohibitive.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet April 26, 2012, 9:22 a.m. UTC | #8
On Thu, 2012-04-26 at 05:18 -0400, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Thu, 26 Apr 2012 11:10:02 +0200
> 
> > The 'frag' would have a known size : 2048 bytes
> > 
> > But the end of it would be used by struct skb_shared_info
> > 
> > so data_len would be 0 in fact.
> > 
> > This would look like a regular linear skb.
> > 
> > Just a bit set in skb to say : Warning, skb->head was not kmalloced :
> > replace kfree(head) by put_page(...)
> > 
> > And this bit would be tested in GRO or tcp merge to 'upgrade' this
> > skb->head to proper page/frag
> 
> And what happens if this ends up in a piece of code which wants to
> append a page frag?
> 
> Or a piece of code which copies an SKB, including page frag parts?
> 
> All I'm saying is that the number of tests necessary to make this work
> properly might become prohibitive.

I'll cook a patch, I believe not so many parts assume skb->head can be
kmalloc()/kfree(). This should be contained in net/core/skbuff.c only.

Elsewhere we use skb->head as a pointer to memory. It will stay the
same.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Maciej Żenczykowski April 26, 2012, 10:09 a.m. UTC | #9
it would be very useful if there was an api for freeing skb->head memory...

I've fooled around with a single memory buffer for both the head and
the data and seen huge performance wins under heavy packet load (half
the amount of malloc/free), but could never quite get it to be 100%
stable.

On Thu, Apr 26, 2012 at 2:22 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2012-04-26 at 05:18 -0400, David Miller wrote:
>> From: Eric Dumazet <eric.dumazet@gmail.com>
>> Date: Thu, 26 Apr 2012 11:10:02 +0200
>>
>> > The 'frag' would have a known size : 2048 bytes
>> >
>> > But the end of it would be used by struct skb_shared_info
>> >
>> > so data_len would be 0 in fact.
>> >
>> > This would look like a regular linear skb.
>> >
>> > Just a bit set in skb to say : Warning, skb->head was not kmalloced :
>> > replace kfree(head) by put_page(...)
>> >
>> > And this bit would be tested in GRO or tcp merge to 'upgrade' this
>> > skb->head to proper page/frag
>>
>> And what happens if this ends up in a piece of code which wants to
>> append a page frag?
>>
>> Or a piece of code which copies an SKB, including page frag parts?
>>
>> All I'm saying is that the number of tests necessary to make this work
>> properly might become prohibitive.
>
> I'll cook a patch, I believe not so many parts assume skb->head can be
> kmalloc()/kfree(). This should be contained in net/core/skbuff.c only.
>
> Elsewhere we use skb->head as a pointer to memory. It will stay the
> same.
>
>
>
Eric Dumazet April 26, 2012, 12:32 p.m. UTC | #10
On Thu, 2012-04-26 at 03:09 -0700, Maciej Żenczykowski wrote:
> it would be very useful if there was an api for freeing skb->head memory...
> 
> I've fooled around with a single memory buffer for both the head and
> the data and seen huge performance wins under heavy packet load (half
> the amount of malloc/free), but could never quite get it to be 100%
> stable.

My patch is almost ready, and seems very good.

I added a param to skb_build() to tell the memory comes from a (sub)page

tg3 uses the new thing.

GRO does the correct merging.

(by the way, I noticed GRO lies about skb truesize... since it
accumulates frag lengths instead of allocated space... with a 1448/2048
mismatch for MTU=1500)

And this apparently solves my slub performance issue I had on my dual
quad core machine, since I no longer hit slub slow path.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 8bb6ade..33b87b2 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2820,7 +2820,11 @@  found:
 	flush |= (__force int)(flags & TCP_FLAG_CWR);
 	flush |= (__force int)((flags ^ tcp_flag_word(th2)) &
 		  ~(TCP_FLAG_CWR | TCP_FLAG_FIN | TCP_FLAG_PSH));
-	flush |= (__force int)(th->ack_seq ^ th2->ack_seq);
+
+	ackgap = skb_shinfo(p)->ack_size;
+	ackdiff = th2->ack_seq - th->ack_seq;
+	flush |= (ackdiff - 1) >= ackgap;
+
 	for (i = sizeof(*th); i < thlen; i += 4)
 		flush |= *(u32 *)((u8 *)th + i) ^
 			 *(u32 *)((u8 *)th2 + i);