diff mbox

[2/6] net: pad skb data and shinfo as a whole rather than individually

Message ID 1325783624-14601-2-git-send-email-ian.campbell@citrix.com
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Ian Campbell Jan. 5, 2012, 5:13 p.m. UTC
We already push the shinfo to the very end of the actual allocation which is
already cache-line aligned.

This reduces the minimum overhead required for this allocation such that the
shinfo can be grown in the following patch without overflowing 2048 bytes for a
1500 byte frame. Reducing this overhead means that sometimes the tail end of
the data can end up in the same cache line as the beginning of the shinfo but
in many cases the allocation slop means that there is no overlap.

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
---
 include/linux/skbuff.h |    4 ++--
 net/core/skbuff.c      |    3 +--
 2 files changed, 3 insertions(+), 4 deletions(-)

Comments

Eric Dumazet Jan. 5, 2012, 5:46 p.m. UTC | #1
Le jeudi 05 janvier 2012 à 17:13 +0000, Ian Campbell a écrit :
> We already push the shinfo to the very end of the actual allocation which is
> already cache-line aligned.
> 
> This reduces the minimum overhead required for this allocation such that the
> shinfo can be grown in the following patch without overflowing 2048 bytes for a
> 1500 byte frame. Reducing this overhead means that sometimes the tail end of
> the data can end up in the same cache line as the beginning of the shinfo but
> in many cases the allocation slop means that there is no overlap.
> 
> Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
> ---
>  include/linux/skbuff.h |    4 ++--
>  net/core/skbuff.c      |    3 +--
>  2 files changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 50db9b0..f04333b 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -41,7 +41,7 @@
>  #define SKB_DATA_ALIGN(X)	(((X) + (SMP_CACHE_BYTES - 1)) & \
>  				 ~(SMP_CACHE_BYTES - 1))
>  #define SKB_WITH_OVERHEAD(X)	\
> -	((X) - SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
> +	((X) - sizeof(struct skb_shared_info))
>  #define SKB_MAX_ORDER(X, ORDER) \
>  	SKB_WITH_OVERHEAD((PAGE_SIZE << (ORDER)) - (X))
>  #define SKB_MAX_HEAD(X)		(SKB_MAX_ORDER((X), 0))
> @@ -50,7 +50,7 @@
>  /* return minimum truesize of one skb containing X bytes of data */
>  #define SKB_TRUESIZE(X) ((X) +						\
>  			 SKB_DATA_ALIGN(sizeof(struct sk_buff)) +	\
> -			 SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
> +			 sizeof(struct skb_shared_info))
>  
>  /* A. Checksumming of received packets by device.
>   *
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index da0c97f..b6de604 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -189,8 +189,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
>  	 * aligned memory blocks, unless SLUB/SLAB debug is enabled.
>  	 * Both skb->head and skb_shared_info are cache line aligned.
>  	 */
> -	size = SKB_DATA_ALIGN(size);
> -	size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> +	size = SKB_DATA_ALIGN(size + sizeof(struct skb_shared_info));
>  	data = kmalloc_node_track_caller(size, gfp_mask, node);
>  	if (!data)
>  		goto nodata;

Unfortunately this makes the frequently use part of skb_shared_info not
any more in a single cache line.

This will slow down many operations like kfree_skb() of cold skbs (TX
completion for example)




--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Jan. 5, 2012, 7:13 p.m. UTC | #2
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 05 Jan 2012 18:46:47 +0100

>> @@ -189,8 +189,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
>>  	 * aligned memory blocks, unless SLUB/SLAB debug is enabled.
>>  	 * Both skb->head and skb_shared_info are cache line aligned.
>>  	 */
>> -	size = SKB_DATA_ALIGN(size);
>> -	size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
>> +	size = SKB_DATA_ALIGN(size + sizeof(struct skb_shared_info));
>>  	data = kmalloc_node_track_caller(size, gfp_mask, node);
>>  	if (!data)
>>  		goto nodata;
> 
> Unfortunately this makes the frequently use part of skb_shared_info not
> any more in a single cache line.
> 
> This will slow down many operations like kfree_skb() of cold skbs (TX
> completion for example)

My concerns match Eric's here.

Ian, you state that you attempted to look into schemes that put the
destructor info somewhere other than skb_shared_info() and that none
of them worked well.

But what kind of issues do you run into if you add this into struct
page itself?  Other subsystems could then use this facility too, and
if I recall there was even some talk of this.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ian Campbell Jan. 5, 2012, 8 p.m. UTC | #3
On Thu, 2012-01-05 at 19:13 +0000, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Thu, 05 Jan 2012 18:46:47 +0100
> 
> >> @@ -189,8 +189,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
> >>  	 * aligned memory blocks, unless SLUB/SLAB debug is enabled.
> >>  	 * Both skb->head and skb_shared_info are cache line aligned.
> >>  	 */
> >> -	size = SKB_DATA_ALIGN(size);
> >> -	size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> >> +	size = SKB_DATA_ALIGN(size + sizeof(struct skb_shared_info));
> >>  	data = kmalloc_node_track_caller(size, gfp_mask, node);
> >>  	if (!data)
> >>  		goto nodata;
> > 
> > Unfortunately this makes the frequently use part of skb_shared_info not
> > any more in a single cache line.
> > 
> > This will slow down many operations like kfree_skb() of cold skbs (TX
> > completion for example)
> 
> My concerns match Eric's here.
> 
> Ian, you state that you attempted to look into schemes that put the
> destructor info somewhere other than skb_shared_info() and that none
> of them worked well.
> 
> But what kind of issues do you run into if you add this into struct
> page itself?  Other subsystems could then use this facility too, and
> if I recall there was even some talk of this.

This is what the old out of tree Xen stuff used to do and it was my
understanding that it had been nacked by the mm guys, although it was a
long while ago and I wasn't really involved at the time. I think space
in struct page is pretty tight and any addition is multiplied over the
number of pages in the system which makes it undesirable. AIUI there
were also complaints about the addition of a hook on the page free path,
which is quite a hot one.

I could run it by them again but...

Although this scheme works fine for Xen's netback I don't think it works
for other use cases like the NFS one (or basically any kernel_sendpage
usage). In those cases you don't want to wait for the last core mm ref
on the page to get dropped, you just want to wait for the last ref due
to the particular sendpage. If you use the core page_count() reference
count then you end up waiting for the process to exit (and drop the core
reference count) before the destructor fires and you can complete the
write, which is obviously not what is desired!

There's also issues with things like two threads simultaneously doing
I/O from the same page. If one lot of I/O is to NFS and the other to
iSCSI (assuming they both use this facility in the future) then they
will clash over the use of the struct page field. In fact even if they
were both to NFS I bet nothing good would happen...

Cheers,
Ian.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Jan. 7, 2012, 6:18 p.m. UTC | #4
From: Ian Campbell <Ian.Campbell@citrix.com>
Date: Thu, 5 Jan 2012 20:00:58 +0000

> Although this scheme works fine for Xen's netback I don't think it works
> for other use cases like the NFS one (or basically any kernel_sendpage
> usage). In those cases you don't want to wait for the last core mm ref
> on the page to get dropped, you just want to wait for the last ref due
> to the particular sendpage. If you use the core page_count() reference
> count then you end up waiting for the process to exit (and drop the core
> reference count) before the destructor fires and you can complete the
> write, which is obviously not what is desired!
> 
> There's also issues with things like two threads simultaneously doing
> I/O from the same page. If one lot of I/O is to NFS and the other to
> iSCSI (assuming they both use this facility in the future) then they
> will clash over the use of the struct page field. In fact even if they
> were both to NFS I bet nothing good would happen...

Ok, struct page doesn't seem the best place for this stuff then.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 50db9b0..f04333b 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -41,7 +41,7 @@ 
 #define SKB_DATA_ALIGN(X)	(((X) + (SMP_CACHE_BYTES - 1)) & \
 				 ~(SMP_CACHE_BYTES - 1))
 #define SKB_WITH_OVERHEAD(X)	\
-	((X) - SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
+	((X) - sizeof(struct skb_shared_info))
 #define SKB_MAX_ORDER(X, ORDER) \
 	SKB_WITH_OVERHEAD((PAGE_SIZE << (ORDER)) - (X))
 #define SKB_MAX_HEAD(X)		(SKB_MAX_ORDER((X), 0))
@@ -50,7 +50,7 @@ 
 /* return minimum truesize of one skb containing X bytes of data */
 #define SKB_TRUESIZE(X) ((X) +						\
 			 SKB_DATA_ALIGN(sizeof(struct sk_buff)) +	\
-			 SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
+			 sizeof(struct skb_shared_info))
 
 /* A. Checksumming of received packets by device.
  *
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index da0c97f..b6de604 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -189,8 +189,7 @@  struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
 	 * aligned memory blocks, unless SLUB/SLAB debug is enabled.
 	 * Both skb->head and skb_shared_info are cache line aligned.
 	 */
-	size = SKB_DATA_ALIGN(size);
-	size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+	size = SKB_DATA_ALIGN(size + sizeof(struct skb_shared_info));
 	data = kmalloc_node_track_caller(size, gfp_mask, node);
 	if (!data)
 		goto nodata;