Message ID | alpine.DEB.2.00.1204241131580.735@wel-95.cs.helsinki.fi |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
From: "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi> Date: Tue, 24 Apr 2012 11:40:27 +0300 (EEST) > And the problem I mentioned in the previous mail (in the terms of this > code fragment) is that ackdiff is no longer MSS or 2*MSS because of GRO > for the opposite direction doesn't trigger all those ACKs a non-GRO > receiver would. This is the ACK counting Eric mentioned, which we need to preserve for the purposes of CWND growth. And in any event, you cannot depend upon the ACK boundaries in any way shape or form, 2 * MSS is just one valid mode of ACK'ing. Whether we get a GRO stretch ACK or a normal 2 * MSS one, it has to work properly either way. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2012-04-24 at 11:40 +0300, Ilpo Järvinen wrote: > diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c > index 8bb6ade..33b87b2 100644 > --- a/net/ipv4/tcp.c > +++ b/net/ipv4/tcp.c > @@ -2820,7 +2820,11 @@ found: > flush |= (__force int)(flags & TCP_FLAG_CWR); > flush |= (__force int)((flags ^ tcp_flag_word(th2)) & > ~(TCP_FLAG_CWR | TCP_FLAG_FIN | TCP_FLAG_PSH)); > - flush |= (__force int)(th->ack_seq ^ th2->ack_seq); > + > + ackgap = skb_shinfo(p)->ack_size; > + ackdiff = th2->ack_seq - th->ack_seq; > + flush |= (ackdiff - 1) >= ackgap; > + > for (i = sizeof(*th); i < thlen; i += 4) > flush |= *(u32 *)((u8 *)th + i) ^ > *(u32 *)((u8 *)th2 + i); > See how we duplicate tcp stack in GRO layer ;) Really I have many doubts about GRO today. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 24 Apr 2012, David Miller wrote: > From: "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi> > Date: Tue, 24 Apr 2012 11:40:27 +0300 (EEST) > > > And the problem I mentioned in the previous mail (in the terms of this > > code fragment) is that ackdiff is no longer MSS or 2*MSS because of GRO > > for the opposite direction doesn't trigger all those ACKs a non-GRO > > receiver would. > > And in any event, you cannot depend upon the ACK boundaries in any way > shape or form, 2 * MSS is just one valid mode of ACK'ing. > > Whether we get a GRO stretch ACK or a normal 2 * MSS one, it has to > work properly either way. I don't see how any of this is that much different from the data side. x bytes, x bytes, x bytes, ... (typically x=MSS) is just one valid mode of data sending, and yet the GRO is currently based on this one mode only. ...It of course works properly also when that is not true but won't work as efficiently as when the size remains the same.
On Tue, 2012-04-24 at 10:56 +0200, Eric Dumazet wrote:
> Really I have many doubts about GRO today.
Particularly if a NIC driver provides linear skbs :
Resulting gro skb is a list of skbs, no memory savings, and many cache
line misses when copying to userland.
Maybe we could have a new kind of skb head allocation/setup, pointing to
a frag of 2048 bytes instead of a kmalloc() blob.
Right now, a fragged skb used 3 blocks of memory :
1) struct sk_buff
2) a bloc of 512 or 1024 or 2048 bytes of memory (skb->head)
3) a frag of 2048 (or PAGE_SIZE/2 or PAGE_SIZE)
While a linear skb has :
1) struct sk_buff
2) a bloc of 512 or 1024 or 2048 bytes of memory (skb->head) from
kmalloc()
Idea would have :
1) struct sk_buff
2) skb->head points to frag (aliasing, no memory allocation)
3) frag of 2048 (or PAGE_SIZE/2 or PAGE_SIZE)
Or the reverse (no frag so that skb is considered as linea), but special
code to 'allow' this skb head be considered as a frag when needed
(splice() code, or GRO merge, or TCP coalescing)
That would make GRO (and TCP coalescing) much more efficient, since the
resulting aggregated skb would be :
1) struct sk_buff
2) skb->head points to 1st frag (aliasing, no memory allocation)
3) array of [1..16] frags
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Eric Dumazet <eric.dumazet@gmail.com> Date: Thu, 26 Apr 2012 10:10:54 +0200 > Idea would have : > > 1) struct sk_buff > 2) skb->head points to frag (aliasing, no memory allocation) > 3) frag of 2048 (or PAGE_SIZE/2 or PAGE_SIZE) What would you set skb->data_len and skb->len to? How would you handle tailroom? We have the rule that you can't append to the tail of a fragmented SKB (I mean append, the way that IPSEC wants to write data to the end of an SKB when encrypting, for example). This is only allowed on fully linear SKBs. So, for this reason and others, you can't pretend that this new construction is linear in any way. It would break a bunch of things. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2012-04-26 at 04:36 -0400, David Miller wrote: > From: Eric Dumazet <eric.dumazet@gmail.com> > Date: Thu, 26 Apr 2012 10:10:54 +0200 > > > Idea would have : > > > > 1) struct sk_buff > > 2) skb->head points to frag (aliasing, no memory allocation) > > 3) frag of 2048 (or PAGE_SIZE/2 or PAGE_SIZE) > > What would you set skb->data_len and skb->len to? > > How would you handle tailroom? We have the rule that you can't append > to the tail of a fragmented SKB (I mean append, the way that IPSEC > wants to write data to the end of an SKB when encrypting, for > example). This is only allowed on fully linear SKBs. > > So, for this reason and others, you can't pretend that this new > construction is linear in any way. It would break a bunch of things. The 'frag' would have a known size : 2048 bytes But the end of it would be used by struct skb_shared_info so data_len would be 0 in fact. This would look like a regular linear skb. Just a bit set in skb to say : Warning, skb->head was not kmalloced : replace kfree(head) by put_page(...) And this bit would be tested in GRO or tcp merge to 'upgrade' this skb->head to proper page/frag -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Eric Dumazet <eric.dumazet@gmail.com> Date: Thu, 26 Apr 2012 11:10:02 +0200 > The 'frag' would have a known size : 2048 bytes > > But the end of it would be used by struct skb_shared_info > > so data_len would be 0 in fact. > > This would look like a regular linear skb. > > Just a bit set in skb to say : Warning, skb->head was not kmalloced : > replace kfree(head) by put_page(...) > > And this bit would be tested in GRO or tcp merge to 'upgrade' this > skb->head to proper page/frag And what happens if this ends up in a piece of code which wants to append a page frag? Or a piece of code which copies an SKB, including page frag parts? All I'm saying is that the number of tests necessary to make this work properly might become prohibitive. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2012-04-26 at 05:18 -0400, David Miller wrote: > From: Eric Dumazet <eric.dumazet@gmail.com> > Date: Thu, 26 Apr 2012 11:10:02 +0200 > > > The 'frag' would have a known size : 2048 bytes > > > > But the end of it would be used by struct skb_shared_info > > > > so data_len would be 0 in fact. > > > > This would look like a regular linear skb. > > > > Just a bit set in skb to say : Warning, skb->head was not kmalloced : > > replace kfree(head) by put_page(...) > > > > And this bit would be tested in GRO or tcp merge to 'upgrade' this > > skb->head to proper page/frag > > And what happens if this ends up in a piece of code which wants to > append a page frag? > > Or a piece of code which copies an SKB, including page frag parts? > > All I'm saying is that the number of tests necessary to make this work > properly might become prohibitive. I'll cook a patch, I believe not so many parts assume skb->head can be kmalloc()/kfree(). This should be contained in net/core/skbuff.c only. Elsewhere we use skb->head as a pointer to memory. It will stay the same. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
it would be very useful if there was an api for freeing skb->head memory... I've fooled around with a single memory buffer for both the head and the data and seen huge performance wins under heavy packet load (half the amount of malloc/free), but could never quite get it to be 100% stable. On Thu, Apr 26, 2012 at 2:22 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > On Thu, 2012-04-26 at 05:18 -0400, David Miller wrote: >> From: Eric Dumazet <eric.dumazet@gmail.com> >> Date: Thu, 26 Apr 2012 11:10:02 +0200 >> >> > The 'frag' would have a known size : 2048 bytes >> > >> > But the end of it would be used by struct skb_shared_info >> > >> > so data_len would be 0 in fact. >> > >> > This would look like a regular linear skb. >> > >> > Just a bit set in skb to say : Warning, skb->head was not kmalloced : >> > replace kfree(head) by put_page(...) >> > >> > And this bit would be tested in GRO or tcp merge to 'upgrade' this >> > skb->head to proper page/frag >> >> And what happens if this ends up in a piece of code which wants to >> append a page frag? >> >> Or a piece of code which copies an SKB, including page frag parts? >> >> All I'm saying is that the number of tests necessary to make this work >> properly might become prohibitive. > > I'll cook a patch, I believe not so many parts assume skb->head can be > kmalloc()/kfree(). This should be contained in net/core/skbuff.c only. > > Elsewhere we use skb->head as a pointer to memory. It will stay the > same. > > >
On Thu, 2012-04-26 at 03:09 -0700, Maciej Żenczykowski wrote: > it would be very useful if there was an api for freeing skb->head memory... > > I've fooled around with a single memory buffer for both the head and > the data and seen huge performance wins under heavy packet load (half > the amount of malloc/free), but could never quite get it to be 100% > stable. My patch is almost ready, and seems very good. I added a param to skb_build() to tell the memory comes from a (sub)page tg3 uses the new thing. GRO does the correct merging. (by the way, I noticed GRO lies about skb truesize... since it accumulates frag lengths instead of allocated space... with a 1448/2048 mismatch for MTU=1500) And this apparently solves my slub performance issue I had on my dual quad core machine, since I no longer hit slub slow path. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 8bb6ade..33b87b2 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -2820,7 +2820,11 @@ found: flush |= (__force int)(flags & TCP_FLAG_CWR); flush |= (__force int)((flags ^ tcp_flag_word(th2)) & ~(TCP_FLAG_CWR | TCP_FLAG_FIN | TCP_FLAG_PSH)); - flush |= (__force int)(th->ack_seq ^ th2->ack_seq); + + ackgap = skb_shinfo(p)->ack_size; + ackdiff = th2->ack_seq - th->ack_seq; + flush |= (ackdiff - 1) >= ackgap; + for (i = sizeof(*th); i < thlen; i += 4) flush |= *(u32 *)((u8 *)th + i) ^ *(u32 *)((u8 *)th2 + i);