diff mbox

[net-next] xen-netfront: try linearizing SKB if it occupies too many slots

Message ID 53763CD1.6060500@citrix.com
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Zoltan Kiss May 16, 2014, 4:29 p.m. UTC
On 16/05/14 16:34, Wei Liu wrote:
> On Fri, May 16, 2014 at 08:22:19AM -0700, Eric Dumazet wrote:
>> On Fri, 2014-05-16 at 15:36 +0100, Wei Liu wrote:
>>> On Fri, May 16, 2014 at 07:21:08AM -0700, Eric Dumazet wrote:
>>>> On Fri, 2014-05-16 at 14:11 +0100, Wei Liu wrote:
>>>>
>>>>> It's not that common to trigger this, I only saw a few reports. In fact
>>>>> Stefan's report is the first one that comes with a method to reproduce
>>>>> it.
>>>>>
>>>>> I tested with redis-benchmark on a guest with 256MB RAM and only saw a
>>>>> few "failed to linearize", never saw a single one with 1GB guest.
>>>>
>>>> Well, I am just saying. This is asking order-5 allocations, and yes,
>>>> this is going to fail after few days of uptime, no matter what you try.
>>>>
>>>
>>> Hmm... I see what you mean -- memory fragmentation leads to allocation
>>> failure. Thanks.
>>
>> In the mean time, have you tried to lower gso_max_size ?
>>
>> Setting it witk netif_set_gso_max_size() to something like 56000 might
>> avoid the problem.
>>
>> (Not sure if it is applicable in your case)
>>
> 
> It works, at least in this Redis testcase. Could you explain a bit where
> this 56000 magic number comes from? :-)
> 
> Presumably I can derive it from some constant in core network code?

I guess it just makes more unlikely to have packets with problematic layout. But the following packet would still fail:
linear buffer : 80 bytes, on 2 pages
17 frags, 80 bytes each, each spanning over page boundary.

I just had an idea: a modified version of xenvif_handle_frag_list function from netback would be useful for us here. It recreates the frags array on fully utilized 4k pages. Plus we can use pskb_expand_head to reduce the page number on the linear buffer (although it might not work, see my comment in the patch)
The worst case linear buffer then spans N+1 pages, and has N*PAGE_SIZE+1 bytes. Then the frags after this coalescing should have 16*PAGE_SIZE - (N*PAGE_SIZE+2) bytes at most, which is 16-N pages. Altogether that's 16+1 page, which should definitely fit!
This is what I mean:

8<--------------
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Eric Dumazet May 16, 2014, 4:47 p.m. UTC | #1
On Fri, 2014-05-16 at 17:29 +0100, Zoltan Kiss wrote:
> On 16/05/14 16:34, Wei Liu wrote:
> > 
> > It works, at least in this Redis testcase. Could you explain a bit where
> > this 56000 magic number comes from? :-)
> > 
> > Presumably I can derive it from some constant in core network code?
> 
> I guess it just makes more unlikely to have packets with problematic layout. But the following packet would still fail:
> linear buffer : 80 bytes, on 2 pages
> 17 frags, 80 bytes each, each spanning over page boundary.

How would you build such skbs ? Its _very_ difficult, you have to be
very very smart to hit this.

Also reducing gso_max_size made sure order-5 allocations would not be
attempted in this unlikely case.

56000 + overhead is below 65536 -> order-4 allocations at most.





--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Zoltan Kiss May 16, 2014, 4:51 p.m. UTC | #2
On 16/05/14 17:47, Eric Dumazet wrote:
> On Fri, 2014-05-16 at 17:29 +0100, Zoltan Kiss wrote:
>> On 16/05/14 16:34, Wei Liu wrote:
>>>
>>> It works, at least in this Redis testcase. Could you explain a bit where
>>> this 56000 magic number comes from? :-)
>>>
>>> Presumably I can derive it from some constant in core network code?
>>
>> I guess it just makes more unlikely to have packets with problematic layout. But the following packet would still fail:
>> linear buffer : 80 bytes, on 2 pages
>> 17 frags, 80 bytes each, each spanning over page boundary.
>
> How would you build such skbs ? Its _very_ difficult, you have to be
> very very smart to hit this.
I wouldn't build such skbs, I would expect the network stack to create 
such weird things sometimes :)
The goal here is to prepare and handle the worst case scenarios as well.
>
> Also reducing gso_max_size made sure order-5 allocations would not be
> attempted in this unlikely case.
But reducing the gso_max_size would have a bad impact on the general 
network throughput, right?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Wei Liu May 16, 2014, 4:54 p.m. UTC | #3
On Fri, May 16, 2014 at 05:29:05PM +0100, Zoltan Kiss wrote:
> On 16/05/14 16:34, Wei Liu wrote:
> > On Fri, May 16, 2014 at 08:22:19AM -0700, Eric Dumazet wrote:
> >> On Fri, 2014-05-16 at 15:36 +0100, Wei Liu wrote:
> >>> On Fri, May 16, 2014 at 07:21:08AM -0700, Eric Dumazet wrote:
> >>>> On Fri, 2014-05-16 at 14:11 +0100, Wei Liu wrote:
> >>>>
> >>>>> It's not that common to trigger this, I only saw a few reports. In fact
> >>>>> Stefan's report is the first one that comes with a method to reproduce
> >>>>> it.
> >>>>>
> >>>>> I tested with redis-benchmark on a guest with 256MB RAM and only saw a
> >>>>> few "failed to linearize", never saw a single one with 1GB guest.
> >>>>
> >>>> Well, I am just saying. This is asking order-5 allocations, and yes,
> >>>> this is going to fail after few days of uptime, no matter what you try.
> >>>>
> >>>
> >>> Hmm... I see what you mean -- memory fragmentation leads to allocation
> >>> failure. Thanks.
> >>
> >> In the mean time, have you tried to lower gso_max_size ?
> >>
> >> Setting it witk netif_set_gso_max_size() to something like 56000 might
> >> avoid the problem.
> >>
> >> (Not sure if it is applicable in your case)
> >>
> > 
> > It works, at least in this Redis testcase. Could you explain a bit where
> > this 56000 magic number comes from? :-)
> > 
> > Presumably I can derive it from some constant in core network code?
> 
> I guess it just makes more unlikely to have packets with problematic layout. But the following packet would still fail:
> linear buffer : 80 bytes, on 2 pages
> 17 frags, 80 bytes each, each spanning over page boundary.
> 

Presumably max GSO size affects packet layout, that's what I was trying
to figure out.

> I just had an idea: a modified version of xenvif_handle_frag_list function from netback would be useful for us here. It recreates the frags array on fully utilized 4k pages. Plus we can use pskb_expand_head to reduce the page number on the linear buffer (although it might not work, see my comment in the patch)
> The worst case linear buffer then spans N+1 pages, and has N*PAGE_SIZE+1 bytes. Then the frags after this coalescing should have 16*PAGE_SIZE - (N*PAGE_SIZE+2) bytes at most, which is 16-N pages. Altogether that's 16+1 page, which should definitely fit!
> This is what I mean:
> 
> 8<--------------
> diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
> index 158b5e6..b1133d6 100644
> --- a/drivers/net/xen-netfront.c
> +++ b/drivers/net/xen-netfront.c
> @@ -544,6 +544,73 @@ static int xennet_count_skb_frag_slots(struct sk_buff *skb)
>  	return pages;
>  }
>  
> +int xenvif_reduce_pages(struct sk_buff *skb, int target)
> +{
> +	unsigned int offset = skb_headlen(skb);
> +	skb_frag_t frags[MAX_SKB_FRAGS];
> +	int newfrags, oldfrags;
> +	unsigned int pages, optimal;
> +
> +	BUG_ON(!target);
> +
> +	pages = DIV_ROUND_UP(offset_in_page(skb->data) + skb_headlen(skb), PAGE_SIZE);
> +	optimal = DIV_ROUND_UP(skb_headlen(skb), PAGE_SIZE);
> +	if (pages - optimal) {
> +		int err;
> +/* FIXME: we should check if pskb_expand_head really allocates on page boundary,
> + *  otherwise we can still have suboptimal page layout */
> +		if (unlikely(err = pskb_expand_head(skb, 0, 0, GFP_ATOMIC)))

I'm a bit lost. What do you expect from the call to pskb_expand_head?

I'm sorry I cannot see immediate result from the comment of
pskb_expand_head. If you call with nhead and ntail equal to 0 it creates
identical copy, but I don't see guarantee on page alignment. Did I miss
something?

> +			return err;
> +		target -= pages - optimal;
> +		if (!target)
> +			return 0;
> +	}
> +
> +	/* Subtract frags size, we will correct it later */
> +	skb->truesize -= skb->data_len;
> +
> +	/* Create a brand new frags array and coalesce there */
> +	for (newfrags = 0; offset < skb->len; newfrags++) {
> +		struct page *page;
> +		unsigned int len;
> +
> +		BUG_ON(newfrags >= MAX_SKB_FRAGS);
> +		page = alloc_page(GFP_ATOMIC);

And the ammount of memory allocation is a bit overkill I think (though
it's still better than the order-5 allocation in skb_linearize). Can you
not just memmove all paged data to first few frags and release other
frags?

Anyway, this method might still work, just a bit overkill IMHO.

Wei.

> +		if (!page) {
> +			int j;
> +			skb->truesize += skb->data_len;
> +			for (j = 0; j < newfrags; j++)
> +				put_page(frags[j].page.p);
> +			return -ENOMEM;
> +		}
> +
> +		if (offset + PAGE_SIZE < skb->len)
> +			len = PAGE_SIZE;
> +		else
> +			len = skb->len - offset;
> +		if (skb_copy_bits(skb, offset, page_address(page), len))
> +			BUG();
> +
> +		offset += len;
> +		frags[newfrags].page.p = page;
> +		frags[newfrags].page_offset = 0;
> +		skb_frag_size_set(&frags[newfrags], len);
> +	}
> +
> +	/* Drop the original buffers */
> +	for (oldfrags = 0; oldfrags < skb_shinfo(skb)->nr_frags; oldfrags++)
> +		skb_frag_unref(skb, oldfrags);
> +
> +	/* Swap the new frags array with the old one */
> +	memcpy(skb_shinfo(skb)->frags,
> +	       frags,
> +	       newfrags * sizeof(skb_frag_t));
> +	skb_shinfo(skb)->nr_frags = newfrags;
> +	/* Correct truesize */
> +	skb->truesize += newfrags * PAGE_SIZE;
> +	return 0;
> +}
> +
>  static int xennet_start_xmit(struct sk_buff *skb, struct net_device *dev)
>  {
>  	unsigned short id;
> @@ -573,11 +640,21 @@ static int xennet_start_xmit(struct sk_buff *skb, struct net_device *dev)
>  	slots = DIV_ROUND_UP(offset + len, PAGE_SIZE) +
>  		xennet_count_skb_frag_slots(skb);
>  	if (unlikely(slots > MAX_SKB_FRAGS + 1)) {
> -		net_alert_ratelimited(
> -			"xennet: skb rides the rocket: %d slots\n", slots);
> -		goto drop;
> +		if (unlikely(xenvif_reduce_pages(skb, slots - MAX_SKB_FRAGS - 1))) {
> +			net_alert_ratelimited(
> +				"xennet: couldn't reduce slot number from %d\n", slots);
> +			goto drop;
> +		}
> +		slots = DIV_ROUND_UP(offset_in_page(data) + len, PAGE_SIZE) +
> +			xennet_count_skb_frag_slots(skb);
> +		if (unlikely(slots > MAX_SKB_FRAGS + 1)) {
> +			net_alert_ratelimited(
> +				"xennet: slot reduction doesn't work, slots: %d\n", slots);
> +			goto drop;
> +		}
>  	}
>  
> +
>  	spin_lock_irqsave(&np->tx_lock, flags);
>  
>  	if (unlikely(!netif_carrier_ok(dev) ||
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet May 16, 2014, 5 p.m. UTC | #4
On Fri, 2014-05-16 at 17:51 +0100, Zoltan Kiss wrote:

> But reducing the gso_max_size would have a bad impact on the general 
> network throughput, right?

Not really.

Sending TSO packets with 44 MSS in them instead of 45 might in fact help
a bit. I never could detect any difference.

LRO/GRO receivers aggregate at most 17 or 22 segments anyway.

I said 56000 only as a quick way to test the thing

If you want an explicit formula, it might be :

SKB_WITH_OVERHEAD(65535 - LL_MAX_HEADER) or something like that...



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Zoltan Kiss May 19, 2014, 4:47 p.m. UTC | #5
On 16/05/14 17:54, Wei Liu wrote:
>> I just had an idea: a modified version of xenvif_handle_frag_list function from netback would be useful for us here. It recreates the frags array on fully utilized 4k pages. Plus we can use pskb_expand_head to reduce the page number on the linear buffer (although it might not work, see my comment in the patch)
>> The worst case linear buffer then spans N+1 pages, and has N*PAGE_SIZE+1 bytes. Then the frags after this coalescing should have 16*PAGE_SIZE - (N*PAGE_SIZE+2) bytes at most, which is 16-N pages. Altogether that's 16+1 page, which should definitely fit!
>> This is what I mean:
>>
>> 8<--------------
>> diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
>> index 158b5e6..b1133d6 100644
>> --- a/drivers/net/xen-netfront.c
>> +++ b/drivers/net/xen-netfront.c
>> @@ -544,6 +544,73 @@ static int xennet_count_skb_frag_slots(struct sk_buff *skb)
>>   	return pages;
>>   }
>>
>> +int xenvif_reduce_pages(struct sk_buff *skb, int target)
>> +{
>> +	unsigned int offset = skb_headlen(skb);
>> +	skb_frag_t frags[MAX_SKB_FRAGS];
>> +	int newfrags, oldfrags;
>> +	unsigned int pages, optimal;
>> +
>> +	BUG_ON(!target);
>> +
>> +	pages = DIV_ROUND_UP(offset_in_page(skb->data) + skb_headlen(skb), PAGE_SIZE);
>> +	optimal = DIV_ROUND_UP(skb_headlen(skb), PAGE_SIZE);
>> +	if (pages - optimal) {
>> +		int err;
>> +/* FIXME: we should check if pskb_expand_head really allocates on page boundary,
>> + *  otherwise we can still have suboptimal page layout */
>> +		if (unlikely(err = pskb_expand_head(skb, 0, 0, GFP_ATOMIC)))
>
> I'm a bit lost. What do you expect from the call to pskb_expand_head?
>
> I'm sorry I cannot see immediate result from the comment of
> pskb_expand_head. If you call with nhead and ntail equal to 0 it creates
> identical copy, but I don't see guarantee on page alignment. Did I miss
> something?
Yep, indeed, it doesn't guarantee directly that new allocation won't 
span across page boundaries unnecessarily. And actually we are still OK, 
as the skb shouldn't be more than 18 slots, so netback should be able to 
handle that.

>
>> +			return err;
>> +		target -= pages - optimal;
>> +		if (!target)
>> +			return 0;
>> +	}
>> +
>> +	/* Subtract frags size, we will correct it later */
>> +	skb->truesize -= skb->data_len;
>> +
>> +	/* Create a brand new frags array and coalesce there */
>> +	for (newfrags = 0; offset < skb->len; newfrags++) {
>> +		struct page *page;
>> +		unsigned int len;
>> +
>> +		BUG_ON(newfrags >= MAX_SKB_FRAGS);
>> +		page = alloc_page(GFP_ATOMIC);
>
> And the ammount of memory allocation is a bit overkill I think (though
> it's still better than the order-5 allocation in skb_linearize). Can you
> not just memmove all paged data to first few frags and release other
> frags?
>
> Anyway, this method might still work, just a bit overkill IMHO.

Yep, it's quite suboptimal, and anyone can come up with a better (and 
probably more complex) solution, however:
- this should be a rarely used thing, so performance doesn't matter that 
much at the moment (however who knows under which workload you can end 
up with skbs often fragmented so badly that you see this function called 
...)
- it would be good to create a fix for this soon, and let it backported 
to major distro kernels where compound pages are enabled

Zoli

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
index 158b5e6..b1133d6 100644
--- a/drivers/net/xen-netfront.c
+++ b/drivers/net/xen-netfront.c
@@ -544,6 +544,73 @@  static int xennet_count_skb_frag_slots(struct sk_buff *skb)
 	return pages;
 }
 
+int xenvif_reduce_pages(struct sk_buff *skb, int target)
+{
+	unsigned int offset = skb_headlen(skb);
+	skb_frag_t frags[MAX_SKB_FRAGS];
+	int newfrags, oldfrags;
+	unsigned int pages, optimal;
+
+	BUG_ON(!target);
+
+	pages = DIV_ROUND_UP(offset_in_page(skb->data) + skb_headlen(skb), PAGE_SIZE);
+	optimal = DIV_ROUND_UP(skb_headlen(skb), PAGE_SIZE);
+	if (pages - optimal) {
+		int err;
+/* FIXME: we should check if pskb_expand_head really allocates on page boundary,
+ *  otherwise we can still have suboptimal page layout */
+		if (unlikely(err = pskb_expand_head(skb, 0, 0, GFP_ATOMIC)))
+			return err;
+		target -= pages - optimal;
+		if (!target)
+			return 0;
+	}
+
+	/* Subtract frags size, we will correct it later */
+	skb->truesize -= skb->data_len;
+
+	/* Create a brand new frags array and coalesce there */
+	for (newfrags = 0; offset < skb->len; newfrags++) {
+		struct page *page;
+		unsigned int len;
+
+		BUG_ON(newfrags >= MAX_SKB_FRAGS);
+		page = alloc_page(GFP_ATOMIC);
+		if (!page) {
+			int j;
+			skb->truesize += skb->data_len;
+			for (j = 0; j < newfrags; j++)
+				put_page(frags[j].page.p);
+			return -ENOMEM;
+		}
+
+		if (offset + PAGE_SIZE < skb->len)
+			len = PAGE_SIZE;
+		else
+			len = skb->len - offset;
+		if (skb_copy_bits(skb, offset, page_address(page), len))
+			BUG();
+
+		offset += len;
+		frags[newfrags].page.p = page;
+		frags[newfrags].page_offset = 0;
+		skb_frag_size_set(&frags[newfrags], len);
+	}
+
+	/* Drop the original buffers */
+	for (oldfrags = 0; oldfrags < skb_shinfo(skb)->nr_frags; oldfrags++)
+		skb_frag_unref(skb, oldfrags);
+
+	/* Swap the new frags array with the old one */
+	memcpy(skb_shinfo(skb)->frags,
+	       frags,
+	       newfrags * sizeof(skb_frag_t));
+	skb_shinfo(skb)->nr_frags = newfrags;
+	/* Correct truesize */
+	skb->truesize += newfrags * PAGE_SIZE;
+	return 0;
+}
+
 static int xennet_start_xmit(struct sk_buff *skb, struct net_device *dev)
 {
 	unsigned short id;
@@ -573,11 +640,21 @@  static int xennet_start_xmit(struct sk_buff *skb, struct net_device *dev)
 	slots = DIV_ROUND_UP(offset + len, PAGE_SIZE) +
 		xennet_count_skb_frag_slots(skb);
 	if (unlikely(slots > MAX_SKB_FRAGS + 1)) {
-		net_alert_ratelimited(
-			"xennet: skb rides the rocket: %d slots\n", slots);
-		goto drop;
+		if (unlikely(xenvif_reduce_pages(skb, slots - MAX_SKB_FRAGS - 1))) {
+			net_alert_ratelimited(
+				"xennet: couldn't reduce slot number from %d\n", slots);
+			goto drop;
+		}
+		slots = DIV_ROUND_UP(offset_in_page(data) + len, PAGE_SIZE) +
+			xennet_count_skb_frag_slots(skb);
+		if (unlikely(slots > MAX_SKB_FRAGS + 1)) {
+			net_alert_ratelimited(
+				"xennet: slot reduction doesn't work, slots: %d\n", slots);
+			goto drop;
+		}
 	}
 
+
 	spin_lock_irqsave(&np->tx_lock, flags);
 
 	if (unlikely(!netif_carrier_ok(dev) ||