Message ID | 53763CD1.6060500@citrix.com |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
On Fri, 2014-05-16 at 17:29 +0100, Zoltan Kiss wrote: > On 16/05/14 16:34, Wei Liu wrote: > > > > It works, at least in this Redis testcase. Could you explain a bit where > > this 56000 magic number comes from? :-) > > > > Presumably I can derive it from some constant in core network code? > > I guess it just makes more unlikely to have packets with problematic layout. But the following packet would still fail: > linear buffer : 80 bytes, on 2 pages > 17 frags, 80 bytes each, each spanning over page boundary. How would you build such skbs ? Its _very_ difficult, you have to be very very smart to hit this. Also reducing gso_max_size made sure order-5 allocations would not be attempted in this unlikely case. 56000 + overhead is below 65536 -> order-4 allocations at most. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 16/05/14 17:47, Eric Dumazet wrote: > On Fri, 2014-05-16 at 17:29 +0100, Zoltan Kiss wrote: >> On 16/05/14 16:34, Wei Liu wrote: >>> >>> It works, at least in this Redis testcase. Could you explain a bit where >>> this 56000 magic number comes from? :-) >>> >>> Presumably I can derive it from some constant in core network code? >> >> I guess it just makes more unlikely to have packets with problematic layout. But the following packet would still fail: >> linear buffer : 80 bytes, on 2 pages >> 17 frags, 80 bytes each, each spanning over page boundary. > > How would you build such skbs ? Its _very_ difficult, you have to be > very very smart to hit this. I wouldn't build such skbs, I would expect the network stack to create such weird things sometimes :) The goal here is to prepare and handle the worst case scenarios as well. > > Also reducing gso_max_size made sure order-5 allocations would not be > attempted in this unlikely case. But reducing the gso_max_size would have a bad impact on the general network throughput, right? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, May 16, 2014 at 05:29:05PM +0100, Zoltan Kiss wrote: > On 16/05/14 16:34, Wei Liu wrote: > > On Fri, May 16, 2014 at 08:22:19AM -0700, Eric Dumazet wrote: > >> On Fri, 2014-05-16 at 15:36 +0100, Wei Liu wrote: > >>> On Fri, May 16, 2014 at 07:21:08AM -0700, Eric Dumazet wrote: > >>>> On Fri, 2014-05-16 at 14:11 +0100, Wei Liu wrote: > >>>> > >>>>> It's not that common to trigger this, I only saw a few reports. In fact > >>>>> Stefan's report is the first one that comes with a method to reproduce > >>>>> it. > >>>>> > >>>>> I tested with redis-benchmark on a guest with 256MB RAM and only saw a > >>>>> few "failed to linearize", never saw a single one with 1GB guest. > >>>> > >>>> Well, I am just saying. This is asking order-5 allocations, and yes, > >>>> this is going to fail after few days of uptime, no matter what you try. > >>>> > >>> > >>> Hmm... I see what you mean -- memory fragmentation leads to allocation > >>> failure. Thanks. > >> > >> In the mean time, have you tried to lower gso_max_size ? > >> > >> Setting it witk netif_set_gso_max_size() to something like 56000 might > >> avoid the problem. > >> > >> (Not sure if it is applicable in your case) > >> > > > > It works, at least in this Redis testcase. Could you explain a bit where > > this 56000 magic number comes from? :-) > > > > Presumably I can derive it from some constant in core network code? > > I guess it just makes more unlikely to have packets with problematic layout. But the following packet would still fail: > linear buffer : 80 bytes, on 2 pages > 17 frags, 80 bytes each, each spanning over page boundary. > Presumably max GSO size affects packet layout, that's what I was trying to figure out. > I just had an idea: a modified version of xenvif_handle_frag_list function from netback would be useful for us here. It recreates the frags array on fully utilized 4k pages. Plus we can use pskb_expand_head to reduce the page number on the linear buffer (although it might not work, see my comment in the patch) > The worst case linear buffer then spans N+1 pages, and has N*PAGE_SIZE+1 bytes. Then the frags after this coalescing should have 16*PAGE_SIZE - (N*PAGE_SIZE+2) bytes at most, which is 16-N pages. Altogether that's 16+1 page, which should definitely fit! > This is what I mean: > > 8<-------------- > diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c > index 158b5e6..b1133d6 100644 > --- a/drivers/net/xen-netfront.c > +++ b/drivers/net/xen-netfront.c > @@ -544,6 +544,73 @@ static int xennet_count_skb_frag_slots(struct sk_buff *skb) > return pages; > } > > +int xenvif_reduce_pages(struct sk_buff *skb, int target) > +{ > + unsigned int offset = skb_headlen(skb); > + skb_frag_t frags[MAX_SKB_FRAGS]; > + int newfrags, oldfrags; > + unsigned int pages, optimal; > + > + BUG_ON(!target); > + > + pages = DIV_ROUND_UP(offset_in_page(skb->data) + skb_headlen(skb), PAGE_SIZE); > + optimal = DIV_ROUND_UP(skb_headlen(skb), PAGE_SIZE); > + if (pages - optimal) { > + int err; > +/* FIXME: we should check if pskb_expand_head really allocates on page boundary, > + * otherwise we can still have suboptimal page layout */ > + if (unlikely(err = pskb_expand_head(skb, 0, 0, GFP_ATOMIC))) I'm a bit lost. What do you expect from the call to pskb_expand_head? I'm sorry I cannot see immediate result from the comment of pskb_expand_head. If you call with nhead and ntail equal to 0 it creates identical copy, but I don't see guarantee on page alignment. Did I miss something? > + return err; > + target -= pages - optimal; > + if (!target) > + return 0; > + } > + > + /* Subtract frags size, we will correct it later */ > + skb->truesize -= skb->data_len; > + > + /* Create a brand new frags array and coalesce there */ > + for (newfrags = 0; offset < skb->len; newfrags++) { > + struct page *page; > + unsigned int len; > + > + BUG_ON(newfrags >= MAX_SKB_FRAGS); > + page = alloc_page(GFP_ATOMIC); And the ammount of memory allocation is a bit overkill I think (though it's still better than the order-5 allocation in skb_linearize). Can you not just memmove all paged data to first few frags and release other frags? Anyway, this method might still work, just a bit overkill IMHO. Wei. > + if (!page) { > + int j; > + skb->truesize += skb->data_len; > + for (j = 0; j < newfrags; j++) > + put_page(frags[j].page.p); > + return -ENOMEM; > + } > + > + if (offset + PAGE_SIZE < skb->len) > + len = PAGE_SIZE; > + else > + len = skb->len - offset; > + if (skb_copy_bits(skb, offset, page_address(page), len)) > + BUG(); > + > + offset += len; > + frags[newfrags].page.p = page; > + frags[newfrags].page_offset = 0; > + skb_frag_size_set(&frags[newfrags], len); > + } > + > + /* Drop the original buffers */ > + for (oldfrags = 0; oldfrags < skb_shinfo(skb)->nr_frags; oldfrags++) > + skb_frag_unref(skb, oldfrags); > + > + /* Swap the new frags array with the old one */ > + memcpy(skb_shinfo(skb)->frags, > + frags, > + newfrags * sizeof(skb_frag_t)); > + skb_shinfo(skb)->nr_frags = newfrags; > + /* Correct truesize */ > + skb->truesize += newfrags * PAGE_SIZE; > + return 0; > +} > + > static int xennet_start_xmit(struct sk_buff *skb, struct net_device *dev) > { > unsigned short id; > @@ -573,11 +640,21 @@ static int xennet_start_xmit(struct sk_buff *skb, struct net_device *dev) > slots = DIV_ROUND_UP(offset + len, PAGE_SIZE) + > xennet_count_skb_frag_slots(skb); > if (unlikely(slots > MAX_SKB_FRAGS + 1)) { > - net_alert_ratelimited( > - "xennet: skb rides the rocket: %d slots\n", slots); > - goto drop; > + if (unlikely(xenvif_reduce_pages(skb, slots - MAX_SKB_FRAGS - 1))) { > + net_alert_ratelimited( > + "xennet: couldn't reduce slot number from %d\n", slots); > + goto drop; > + } > + slots = DIV_ROUND_UP(offset_in_page(data) + len, PAGE_SIZE) + > + xennet_count_skb_frag_slots(skb); > + if (unlikely(slots > MAX_SKB_FRAGS + 1)) { > + net_alert_ratelimited( > + "xennet: slot reduction doesn't work, slots: %d\n", slots); > + goto drop; > + } > } > > + > spin_lock_irqsave(&np->tx_lock, flags); > > if (unlikely(!netif_carrier_ok(dev) || -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2014-05-16 at 17:51 +0100, Zoltan Kiss wrote: > But reducing the gso_max_size would have a bad impact on the general > network throughput, right? Not really. Sending TSO packets with 44 MSS in them instead of 45 might in fact help a bit. I never could detect any difference. LRO/GRO receivers aggregate at most 17 or 22 segments anyway. I said 56000 only as a quick way to test the thing If you want an explicit formula, it might be : SKB_WITH_OVERHEAD(65535 - LL_MAX_HEADER) or something like that... -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 16/05/14 17:54, Wei Liu wrote: >> I just had an idea: a modified version of xenvif_handle_frag_list function from netback would be useful for us here. It recreates the frags array on fully utilized 4k pages. Plus we can use pskb_expand_head to reduce the page number on the linear buffer (although it might not work, see my comment in the patch) >> The worst case linear buffer then spans N+1 pages, and has N*PAGE_SIZE+1 bytes. Then the frags after this coalescing should have 16*PAGE_SIZE - (N*PAGE_SIZE+2) bytes at most, which is 16-N pages. Altogether that's 16+1 page, which should definitely fit! >> This is what I mean: >> >> 8<-------------- >> diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c >> index 158b5e6..b1133d6 100644 >> --- a/drivers/net/xen-netfront.c >> +++ b/drivers/net/xen-netfront.c >> @@ -544,6 +544,73 @@ static int xennet_count_skb_frag_slots(struct sk_buff *skb) >> return pages; >> } >> >> +int xenvif_reduce_pages(struct sk_buff *skb, int target) >> +{ >> + unsigned int offset = skb_headlen(skb); >> + skb_frag_t frags[MAX_SKB_FRAGS]; >> + int newfrags, oldfrags; >> + unsigned int pages, optimal; >> + >> + BUG_ON(!target); >> + >> + pages = DIV_ROUND_UP(offset_in_page(skb->data) + skb_headlen(skb), PAGE_SIZE); >> + optimal = DIV_ROUND_UP(skb_headlen(skb), PAGE_SIZE); >> + if (pages - optimal) { >> + int err; >> +/* FIXME: we should check if pskb_expand_head really allocates on page boundary, >> + * otherwise we can still have suboptimal page layout */ >> + if (unlikely(err = pskb_expand_head(skb, 0, 0, GFP_ATOMIC))) > > I'm a bit lost. What do you expect from the call to pskb_expand_head? > > I'm sorry I cannot see immediate result from the comment of > pskb_expand_head. If you call with nhead and ntail equal to 0 it creates > identical copy, but I don't see guarantee on page alignment. Did I miss > something? Yep, indeed, it doesn't guarantee directly that new allocation won't span across page boundaries unnecessarily. And actually we are still OK, as the skb shouldn't be more than 18 slots, so netback should be able to handle that. > >> + return err; >> + target -= pages - optimal; >> + if (!target) >> + return 0; >> + } >> + >> + /* Subtract frags size, we will correct it later */ >> + skb->truesize -= skb->data_len; >> + >> + /* Create a brand new frags array and coalesce there */ >> + for (newfrags = 0; offset < skb->len; newfrags++) { >> + struct page *page; >> + unsigned int len; >> + >> + BUG_ON(newfrags >= MAX_SKB_FRAGS); >> + page = alloc_page(GFP_ATOMIC); > > And the ammount of memory allocation is a bit overkill I think (though > it's still better than the order-5 allocation in skb_linearize). Can you > not just memmove all paged data to first few frags and release other > frags? > > Anyway, this method might still work, just a bit overkill IMHO. Yep, it's quite suboptimal, and anyone can come up with a better (and probably more complex) solution, however: - this should be a rarely used thing, so performance doesn't matter that much at the moment (however who knows under which workload you can end up with skbs often fragmented so badly that you see this function called ...) - it would be good to create a fix for this soon, and let it backported to major distro kernels where compound pages are enabled Zoli -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c index 158b5e6..b1133d6 100644 --- a/drivers/net/xen-netfront.c +++ b/drivers/net/xen-netfront.c @@ -544,6 +544,73 @@ static int xennet_count_skb_frag_slots(struct sk_buff *skb) return pages; } +int xenvif_reduce_pages(struct sk_buff *skb, int target) +{ + unsigned int offset = skb_headlen(skb); + skb_frag_t frags[MAX_SKB_FRAGS]; + int newfrags, oldfrags; + unsigned int pages, optimal; + + BUG_ON(!target); + + pages = DIV_ROUND_UP(offset_in_page(skb->data) + skb_headlen(skb), PAGE_SIZE); + optimal = DIV_ROUND_UP(skb_headlen(skb), PAGE_SIZE); + if (pages - optimal) { + int err; +/* FIXME: we should check if pskb_expand_head really allocates on page boundary, + * otherwise we can still have suboptimal page layout */ + if (unlikely(err = pskb_expand_head(skb, 0, 0, GFP_ATOMIC))) + return err; + target -= pages - optimal; + if (!target) + return 0; + } + + /* Subtract frags size, we will correct it later */ + skb->truesize -= skb->data_len; + + /* Create a brand new frags array and coalesce there */ + for (newfrags = 0; offset < skb->len; newfrags++) { + struct page *page; + unsigned int len; + + BUG_ON(newfrags >= MAX_SKB_FRAGS); + page = alloc_page(GFP_ATOMIC); + if (!page) { + int j; + skb->truesize += skb->data_len; + for (j = 0; j < newfrags; j++) + put_page(frags[j].page.p); + return -ENOMEM; + } + + if (offset + PAGE_SIZE < skb->len) + len = PAGE_SIZE; + else + len = skb->len - offset; + if (skb_copy_bits(skb, offset, page_address(page), len)) + BUG(); + + offset += len; + frags[newfrags].page.p = page; + frags[newfrags].page_offset = 0; + skb_frag_size_set(&frags[newfrags], len); + } + + /* Drop the original buffers */ + for (oldfrags = 0; oldfrags < skb_shinfo(skb)->nr_frags; oldfrags++) + skb_frag_unref(skb, oldfrags); + + /* Swap the new frags array with the old one */ + memcpy(skb_shinfo(skb)->frags, + frags, + newfrags * sizeof(skb_frag_t)); + skb_shinfo(skb)->nr_frags = newfrags; + /* Correct truesize */ + skb->truesize += newfrags * PAGE_SIZE; + return 0; +} + static int xennet_start_xmit(struct sk_buff *skb, struct net_device *dev) { unsigned short id; @@ -573,11 +640,21 @@ static int xennet_start_xmit(struct sk_buff *skb, struct net_device *dev) slots = DIV_ROUND_UP(offset + len, PAGE_SIZE) + xennet_count_skb_frag_slots(skb); if (unlikely(slots > MAX_SKB_FRAGS + 1)) { - net_alert_ratelimited( - "xennet: skb rides the rocket: %d slots\n", slots); - goto drop; + if (unlikely(xenvif_reduce_pages(skb, slots - MAX_SKB_FRAGS - 1))) { + net_alert_ratelimited( + "xennet: couldn't reduce slot number from %d\n", slots); + goto drop; + } + slots = DIV_ROUND_UP(offset_in_page(data) + len, PAGE_SIZE) + + xennet_count_skb_frag_slots(skb); + if (unlikely(slots > MAX_SKB_FRAGS + 1)) { + net_alert_ratelimited( + "xennet: slot reduction doesn't work, slots: %d\n", slots); + goto drop; + } } + spin_lock_irqsave(&np->tx_lock, flags); if (unlikely(!netif_carrier_ok(dev) ||