Message ID | 1406726730-17994-1-git-send-email-zoltan.kiss@citrix.com |
---|---|
State | Rejected, archived |
Delegated to: | David Miller |
Headers | show |
From: Zoltan Kiss <zoltan.kiss@citrix.com> Date: Wed, 30 Jul 2014 14:25:30 +0100 > There is a long known problem with the netfront/netback interface: if the guest > tries to send a packet which constitues more than MAX_SKB_FRAGS + 1 ring slots, > it gets dropped. The reason is that netback maps these slots to a frag in the > frags array, which is limited by size. Having so many slots can occur since > compound pages were introduced, as the ring protocol slice them up into > individual (non-compound) page aligned slots. The theoretical worst case > scenario looks like this (note, skbs are limited to 64 Kb here): > linear buffer: at most PAGE_SIZE - 17 * 2 bytes, overlapping page boundary, > using 2 slots > first 15 frags: 1 + PAGE_SIZE + 1 bytes long, first and last bytes are at the > end and the beginning of a page, therefore they use 3 * 15 = 45 slots > last 2 frags: 1 + 1 bytes, overlapping page boundary, 2 * 2 = 4 slots > Although I don't think this 51 slots skb can really happen, we need a solution > which can deal with every scenario. In real life there is only a few slots > overdue, but usually it causes the TCP stream to be blocked, as the retry will > most likely have the same buffer layout. > This patch solves this problem by slicing up the skb itself with the help of > skb_segment, and calling xennet_start_xmit again on the resulting packets. It > also works with the theoretical worst case, where there is a 3 level recursion. > The good thing is that skb_segment only copies the header part, the frags will > be just referenced again. > > Signed-off-by: Zoltan Kiss <zoltan.kiss@citrix.com> This is a really scary change :-) I definitely see some potential problem here. First of all, even in cases where it might "work", such as TCP, you are modifying the data stream. The sizes are changing, the packet counts are different, and all of this will have side effects such as potentially harming TCP performance. Secondly, for something like UDP you can't just split the packet up like this, or for any other datagram protocol for that matter. I know you're in a difficult situation, but I just can't see this being an acceptable approach to solving the problem right now. Where does the MAX_SKB_FRAGS + 1 limit really come from, the size of the TX queue? If you were to have a 64-slot TX queue, you ought to be able to handle this theoretical 51 slot SKB. And I don't think it's so theoretical, a carefully crafted sequence of sendfile() calls during a TCP_CORK sequence should be able to do it. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jul 31, 2014 at 01:25:20PM -0700, David Miller wrote: > From: Zoltan Kiss <zoltan.kiss@citrix.com> > Date: Wed, 30 Jul 2014 14:25:30 +0100 [...] > Secondly, for something like UDP you can't just split the packet up > like this, or for any other datagram protocol for that matter. > > I know you're in a difficult situation, but I just can't see this > being an acceptable approach to solving the problem right now. > > Where does the MAX_SKB_FRAGS + 1 limit really come from, the size of > the TX queue? > It stems from the implicit transimit protocol since inception of netfront / netback. Sigh. > If you were to have a 64-slot TX queue, you ought to be able to handle > this theoretical 51 slot SKB. > There's two problems: 1. IIRC a single page ring has 256 slots, allowing 64 slots packet yields 4 in-flight packets in worst case. 2. Older netback could not handle this large number of slots and it's likely to deem the frontend malicious. For #1, we don't actually care that much if guest screws itself by generating 64 slot packets. #2 is more concerning. Wei. > And I don't think it's so theoretical, a carefully crafted sequence of > sendfile() calls during a TCP_CORK sequence should be able to do it. > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Wei Liu <wei.liu2@citrix.com> Date: Fri, 1 Aug 2014 12:02:46 +0100 > On Thu, Jul 31, 2014 at 01:25:20PM -0700, David Miller wrote: >> If you were to have a 64-slot TX queue, you ought to be able to handle >> this theoretical 51 slot SKB. > > There's two problems: > 1. IIRC a single page ring has 256 slots, allowing 64 slots packet > yields 4 in-flight packets in worst case. > 2. Older netback could not handle this large number of slots and it's > likely to deem the frontend malicious. > > For #1, we don't actually care that much if guest screws itself by > generating 64 slot packets. #2 is more concerning. How many slots can the older netback handle? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Aug 02, 2014 at 03:33:37PM -0700, David Miller wrote: > From: Wei Liu <wei.liu2@citrix.com> > Date: Fri, 1 Aug 2014 12:02:46 +0100 > > > On Thu, Jul 31, 2014 at 01:25:20PM -0700, David Miller wrote: > >> If you were to have a 64-slot TX queue, you ought to be able to handle > >> this theoretical 51 slot SKB. > > > > There's two problems: > > 1. IIRC a single page ring has 256 slots, allowing 64 slots packet > > yields 4 in-flight packets in worst case. > > 2. Older netback could not handle this large number of slots and it's > > likely to deem the frontend malicious. > > > > For #1, we don't actually care that much if guest screws itself by > > generating 64 slot packets. #2 is more concerning. > > How many slots can the older netback handle? I listed those two problems in the context "if we were to lift this limit in the latest net-next tree", so "older netback" actually refers to netback from 3.10 to 3.16. The current implementation allows the number of slots X: 1. X <= 18, valid packet 2. 18 < X < fatal_slot_count, dropped 3. X >= fatal_slot_count, malicious frontend fatal_slot_count has default value of 20. Wei. > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 31/07/14 21:25, David Miller wrote: > From: Zoltan Kiss <zoltan.kiss@citrix.com> > Date: Wed, 30 Jul 2014 14:25:30 +0100 > >> There is a long known problem with the netfront/netback interface: if the guest >> tries to send a packet which constitues more than MAX_SKB_FRAGS + 1 ring slots, >> it gets dropped. The reason is that netback maps these slots to a frag in the >> frags array, which is limited by size. Having so many slots can occur since >> compound pages were introduced, as the ring protocol slice them up into >> individual (non-compound) page aligned slots. The theoretical worst case >> scenario looks like this (note, skbs are limited to 64 Kb here): >> linear buffer: at most PAGE_SIZE - 17 * 2 bytes, overlapping page boundary, >> using 2 slots >> first 15 frags: 1 + PAGE_SIZE + 1 bytes long, first and last bytes are at the >> end and the beginning of a page, therefore they use 3 * 15 = 45 slots >> last 2 frags: 1 + 1 bytes, overlapping page boundary, 2 * 2 = 4 slots >> Although I don't think this 51 slots skb can really happen, we need a solution >> which can deal with every scenario. In real life there is only a few slots >> overdue, but usually it causes the TCP stream to be blocked, as the retry will >> most likely have the same buffer layout. >> This patch solves this problem by slicing up the skb itself with the help of >> skb_segment, and calling xennet_start_xmit again on the resulting packets. It >> also works with the theoretical worst case, where there is a 3 level recursion. >> The good thing is that skb_segment only copies the header part, the frags will >> be just referenced again. >> >> Signed-off-by: Zoltan Kiss <zoltan.kiss@citrix.com> > > This is a really scary change :-) I admit that :) > > I definitely see some potential problem here. > > First of all, even in cases where it might "work", such as TCP, you > are modifying the data stream. The sizes are changing, the packet > counts are different, and all of this will have side effects such as > potentially harming TCP performance. > > Secondly, for something like UDP you can't just split the packet up > like this, or for any other datagram protocol for that matter. The netback/netfront interface currently only supports TSO and TSO6. That's why I did the pktgen TCP patch > > I know you're in a difficult situation, but I just can't see this > being an acceptable approach to solving the problem right now. > > Where does the MAX_SKB_FRAGS + 1 limit really come from, the size of > the TX queue? > > If you were to have a 64-slot TX queue, you ought to be able to handle > this theoretical 51 slot SKB. Let me step a bit back to explain the situation: There is a shared ring buffer between netfront and netback. The frontend posts requests with grant references plus offset-size pairs. A grant reference points to a page, which is limited by PAGE_SIZE. The frontend slice up the skb's linear buffer and frags array into "slots", each of them is a triplet mentioned above. If the linear buffer or a frag is on a compound page and overlaps page boundary, it is posted as separate buffers. E.g if it starts at offset 4000 with a size of 400 bytes, it will consume 2 slots. Unfortunately the grant mapping interface can't map compound pages into an another domain. The main problem is that those pages are only adjacent in the frontend's memory space, but not in the backend or DMA space, so even if you map them to adjacent backend pages (which would need a lot of change), you either need SWIOTLB (expensive, and backend pays the cost) or IOMMU (still don't work). Currently netback limits each skb sent through to 18 slots, because it has to map every grant ref to a frag. There was an idea to handle this problem by removing this limit and let the backend coalesce the scattered buffers into a brand new piece, but then the backend would pay the price, and it would be huge as most of the packet should be copied. We haven't seen this problem very often, and it's also a bit hard to reproduce (hence my frag offset-size pktgen patches), but we can't afford the assumption that it won't happen very often. Also, it is required that the guest should pay the price if it sends packets in such buffers, not the backend. The main concept in this solution is that if it turns out the packet needs too many slots in start_xmit, pretend that netfront is not GSO capable, and fall back to the software segmentation, which will result in packets which can fit. It mimics as if we would go back to dev_hard_start_xmit, to the place where it calls dev_gso_segment(), but the gso_size is set temporarily to (skb->len / 2 + 1) to avoid creating too many packets. It can also happen recursively, if the resulting packets are still too big slotwise. As far as I know it's not really possible to push back an skb to QDisc from start_xmit. If it is, that would be a more elegant solution for this problem. > > And I don't think it's so theoretical, a carefully crafted sequence of > sendfile() calls during a TCP_CORK sequence should be able to do it. > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Aug 04, 2014 at 06:29:34PM +0100, Zoltan Kiss wrote: > On 31/07/14 21:25, David Miller wrote: > >From: Zoltan Kiss <zoltan.kiss@citrix.com> > >Date: Wed, 30 Jul 2014 14:25:30 +0100 > > > >>There is a long known problem with the netfront/netback interface: if the guest > >>tries to send a packet which constitues more than MAX_SKB_FRAGS + 1 ring slots, > >>it gets dropped. The reason is that netback maps these slots to a frag in the > >>frags array, which is limited by size. Having so many slots can occur since > >>compound pages were introduced, as the ring protocol slice them up into > >>individual (non-compound) page aligned slots. The theoretical worst case > >>scenario looks like this (note, skbs are limited to 64 Kb here): > >>linear buffer: at most PAGE_SIZE - 17 * 2 bytes, overlapping page boundary, > >>using 2 slots > >>first 15 frags: 1 + PAGE_SIZE + 1 bytes long, first and last bytes are at the > >>end and the beginning of a page, therefore they use 3 * 15 = 45 slots > >>last 2 frags: 1 + 1 bytes, overlapping page boundary, 2 * 2 = 4 slots > >>Although I don't think this 51 slots skb can really happen, we need a solution > >>which can deal with every scenario. In real life there is only a few slots > >>overdue, but usually it causes the TCP stream to be blocked, as the retry will > >>most likely have the same buffer layout. > >>This patch solves this problem by slicing up the skb itself with the help of > >>skb_segment, and calling xennet_start_xmit again on the resulting packets. It > >>also works with the theoretical worst case, where there is a 3 level recursion. > >>The good thing is that skb_segment only copies the header part, the frags will > >>be just referenced again. > >> > >>Signed-off-by: Zoltan Kiss <zoltan.kiss@citrix.com> > > > >This is a really scary change :-) > I admit that :) > > > >I definitely see some potential problem here. > > > >First of all, even in cases where it might "work", such as TCP, you > >are modifying the data stream. The sizes are changing, the packet > >counts are different, and all of this will have side effects such as > >potentially harming TCP performance. > > > >Secondly, for something like UDP you can't just split the packet up > >like this, or for any other datagram protocol for that matter. > The netback/netfront interface currently only supports TSO and TSO6. That's > why I did the pktgen TCP patch IMO if this approach is known to be broken in the future (say if we want to support UFO) we'd better avoid it. Wei. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Wei Liu <wei.liu2@citrix.com> Date: Sun, 3 Aug 2014 10:11:10 +0100 > On Sat, Aug 02, 2014 at 03:33:37PM -0700, David Miller wrote: >> From: Wei Liu <wei.liu2@citrix.com> >> Date: Fri, 1 Aug 2014 12:02:46 +0100 >> >> > On Thu, Jul 31, 2014 at 01:25:20PM -0700, David Miller wrote: >> >> If you were to have a 64-slot TX queue, you ought to be able to handle >> >> this theoretical 51 slot SKB. >> > >> > There's two problems: >> > 1. IIRC a single page ring has 256 slots, allowing 64 slots packet >> > yields 4 in-flight packets in worst case. >> > 2. Older netback could not handle this large number of slots and it's >> > likely to deem the frontend malicious. >> > >> > For #1, we don't actually care that much if guest screws itself by >> > generating 64 slot packets. #2 is more concerning. >> >> How many slots can the older netback handle? > > I listed those two problems in the context "if we were to lift this > limit in the latest net-next tree", so "older netback" actually refers > to netback from 3.10 to 3.16. > > The current implementation allows the number of slots X: > 1. X <= 18, valid packet > 2. 18 < X < fatal_slot_count, dropped > 3. X >= fatal_slot_count, malicious frontend > > fatal_slot_count has default value of 20. Given what I've seen so far, I think the only option is to linearize the packet. BTW, we do have a netdev->gso_max_segs tunable drivers can set, but it might not cover all of the cases you need to handle. Maybe we can create a similar tunable which triggers skb_needs_linearize() in the transmit path. The advantage of such a tunable is that this can be worked with inside of TCP to avoid creating such packets in the first place. For example, all of the MAX_SKB_FRAGS checks you see in net/ipv4/tcp.c could be replaced with tests against this new tunable in struct netdevice. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Aug 04, 2014 at 03:24:11PM -0700, David Miller wrote: > From: Wei Liu <wei.liu2@citrix.com> > Date: Sun, 3 Aug 2014 10:11:10 +0100 > > > On Sat, Aug 02, 2014 at 03:33:37PM -0700, David Miller wrote: > >> From: Wei Liu <wei.liu2@citrix.com> > >> Date: Fri, 1 Aug 2014 12:02:46 +0100 > >> > >> > On Thu, Jul 31, 2014 at 01:25:20PM -0700, David Miller wrote: > >> >> If you were to have a 64-slot TX queue, you ought to be able to handle > >> >> this theoretical 51 slot SKB. > >> > > >> > There's two problems: > >> > 1. IIRC a single page ring has 256 slots, allowing 64 slots packet > >> > yields 4 in-flight packets in worst case. > >> > 2. Older netback could not handle this large number of slots and it's > >> > likely to deem the frontend malicious. > >> > > >> > For #1, we don't actually care that much if guest screws itself by > >> > generating 64 slot packets. #2 is more concerning. > >> > >> How many slots can the older netback handle? > > > > I listed those two problems in the context "if we were to lift this > > limit in the latest net-next tree", so "older netback" actually refers > > to netback from 3.10 to 3.16. > > > > The current implementation allows the number of slots X: > > 1. X <= 18, valid packet > > 2. 18 < X < fatal_slot_count, dropped > > 3. X >= fatal_slot_count, malicious frontend > > > > fatal_slot_count has default value of 20. > > Given what I've seen so far, I think the only option is to linearize > the packet. > > BTW, we do have a netdev->gso_max_segs tunable drivers can set, but > it might not cover all of the cases you need to handle. > > Maybe we can create a similar tunable which triggers > skb_needs_linearize() in the transmit path. > > The advantage of such a tunable is that this can be worked with > inside of TCP to avoid creating such packets in the first place. > > For example, all of the MAX_SKB_FRAGS checks you see in net/ipv4/tcp.c > could be replaced with tests against this new tunable in struct netdevice. +1 for this. Avoiding generating such packets in transmit path in the first place is even better. Wei. > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 04/08/14 23:24, David Miller wrote: > From: Wei Liu <wei.liu2@citrix.com> > Date: Sun, 3 Aug 2014 10:11:10 +0100 > >> On Sat, Aug 02, 2014 at 03:33:37PM -0700, David Miller wrote: >>> From: Wei Liu <wei.liu2@citrix.com> >>> Date: Fri, 1 Aug 2014 12:02:46 +0100 >>> >>>> On Thu, Jul 31, 2014 at 01:25:20PM -0700, David Miller wrote: >>>>> If you were to have a 64-slot TX queue, you ought to be able to handle >>>>> this theoretical 51 slot SKB. >>>> >>>> There's two problems: >>>> 1. IIRC a single page ring has 256 slots, allowing 64 slots packet >>>> yields 4 in-flight packets in worst case. >>>> 2. Older netback could not handle this large number of slots and it's >>>> likely to deem the frontend malicious. >>>> >>>> For #1, we don't actually care that much if guest screws itself by >>>> generating 64 slot packets. #2 is more concerning. >>> >>> How many slots can the older netback handle? >> >> I listed those two problems in the context "if we were to lift this >> limit in the latest net-next tree", so "older netback" actually refers >> to netback from 3.10 to 3.16. >> >> The current implementation allows the number of slots X: >> 1. X <= 18, valid packet >> 2. 18 < X < fatal_slot_count, dropped >> 3. X >= fatal_slot_count, malicious frontend >> >> fatal_slot_count has default value of 20. > > Given what I've seen so far, I think the only option is to linearize > the packet. I think that would have more performance penalty than calling skb_gso_segment, but maybe I'm wrong. > > BTW, we do have a netdev->gso_max_segs tunable drivers can set, but > it might not cover all of the cases you need to handle. Indeed. Even a packet with one frag can be too scattered for us. > > Maybe we can create a similar tunable which triggers > skb_needs_linearize() in the transmit path. > > The advantage of such a tunable is that this can be worked with > inside of TCP to avoid creating such packets in the first place. > > For example, all of the MAX_SKB_FRAGS checks you see in net/ipv4/tcp.c > could be replaced with tests against this new tunable in struct netdevice. You would need to implement xennet_count_skb_frag_slots and count the slots for every skb heading to a device with this tunable set. And not just for TCP, but for any packet source. I think it would be better to check for that tunable in dev_hard_start_xmit, and mask out the GSO bits in 'features' to force segmentation there. That would do essentially the same as this patch, but not in the netfront's start_xmit. One minor flaw is that it does one round of segmentation only, which doesn't handle the theoretical worst case. Zoli -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Zoltan Kiss <zoltan.kiss@citrix.com> Date: Mon, 4 Aug 2014 18:29:34 +0100 > On 31/07/14 21:25, David Miller wrote: >> Secondly, for something like UDP you can't just split the packet up >> like this, or for any other datagram protocol for that matter. > The netback/netfront interface currently only supports TSO and > TSO6. That's why I did the pktgen TCP patch Do a sendfile() with MSG_MORE over UDP, I bet you can construct a sequence that violates your constraints too. It doesn't make sense to focus on TSO, it's a fundamental issue. Packets can come from anywhere, and you have to be prepared to generically handle a MAX_SKB_FRAGS loaded SKB with arbitrary start/end/length fragment configurations. > Currently netback limits each skb sent through to 18 slots, because it > has to map every grant ref to a frag. There was an idea to handle this > problem by removing this limit and let the backend coalesce the > scattered buffers into a brand new piece, but then the backend would > pay the price, and it would be huge as most of the packet should be > copied. 18 slots means that even with linearization the maximum SKB size you can support is 64K. (16 * 4096) == 64K, please one extra slot on each side for potential partial pages, gives us 18. > We haven't seen this problem very often, and it's also a bit hard to > reproduce (hence my frag offset-size pktgen patches), but we can't > afford the assumption that it won't happen very often. It's trivial to reproduce, I've already shown how one could trigger it _without_ TSO being involved at all. I'll state it again: Set TCP_CORK, or use MSG_MORE on the socket. Do a sequence of many 1 byte sendfile() requests over a file, skipping around the offset on every call in order to prevent coalescing. Clear TCP_CORK or MSG_MORE, you should see a MAX_SKB_FRAGS skb end up in the driver transmit function. > The main concept in this solution is that if it turns out the packet > needs too many slots in start_xmit, pretend that netfront is not GSO > capable, and fall back to the software segmentation, which will result > in packets which can fit. This is the fundamental issue with your solution. It is not a GSO problem. You therefore have to fully linearize the packet when you encounter this situation. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Zoltan Kiss <zoltan.kiss@citrix.com> Date: Tue, 5 Aug 2014 14:00:30 +0100 > On 04/08/14 23:24, David Miller wrote: >> Given what I've seen so far, I think the only option is to linearize >> the packet. > I think that would have more performance penalty than calling > skb_gso_segment, but maybe I'm wrong. We have firmly established that you cannot legally use GSO segmenting as an option, so it doesn't matter if it's faster or not. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c index 055222b..0398240 100644 diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c index 055222b..9ce1b62 100644 --- a/drivers/net/xen-netfront.c +++ b/drivers/net/xen-netfront.c @@ -625,12 +625,37 @@ static int xennet_start_xmit(struct sk_buff *skb, struct net_device *dev) goto drop; } + /* WARNING: this function should be reentrant up until this point, as in + * the below if branch it could be called recursively + */ slots = DIV_ROUND_UP(offset + len, PAGE_SIZE) + xennet_count_skb_frag_slots(skb); if (unlikely(slots > MAX_SKB_FRAGS + 1)) { - net_alert_ratelimited( - "xennet: skb rides the rocket: %d slots\n", slots); - goto drop; + struct sk_buff *segs, *nskb; + unsigned short gso_size_orig = skb_shinfo(skb)->gso_size; + unsigned short gso_type_orig = skb_shinfo(skb)->gso_type; + + net_dbg_ratelimited( + "xennet: skb rides the rocket: %d slots, %d bytes\n", + slots, skb->len); + netdev_features_t features = + netif_skb_features(skb) & ~NETIF_F_GSO_MASK; + /* Segment this into two pieces, most probably it will fit */ + skb_shinfo(skb)->gso_size = skb->len / 2 + 1; + segs = skb_gso_segment(skb, features); + if (unlikely(!segs || IS_ERR(segs))) + goto drop; + do { + nskb = segs; + segs = nskb->next; + nskb->next = NULL; + skb_shinfo(nskb)->gso_size = gso_size_orig; + skb_shinfo(nskb)->gso_type = gso_type_orig; + xennet_start_xmit(nskb, dev); + } while (segs); + + dev_kfree_skb_any(skb); + return NETDEV_TX_OK; } spin_lock_irqsave(&queue->tx_lock, flags);
There is a long known problem with the netfront/netback interface: if the guest tries to send a packet which constitues more than MAX_SKB_FRAGS + 1 ring slots, it gets dropped. The reason is that netback maps these slots to a frag in the frags array, which is limited by size. Having so many slots can occur since compound pages were introduced, as the ring protocol slice them up into individual (non-compound) page aligned slots. The theoretical worst case scenario looks like this (note, skbs are limited to 64 Kb here): linear buffer: at most PAGE_SIZE - 17 * 2 bytes, overlapping page boundary, using 2 slots first 15 frags: 1 + PAGE_SIZE + 1 bytes long, first and last bytes are at the end and the beginning of a page, therefore they use 3 * 15 = 45 slots last 2 frags: 1 + 1 bytes, overlapping page boundary, 2 * 2 = 4 slots Although I don't think this 51 slots skb can really happen, we need a solution which can deal with every scenario. In real life there is only a few slots overdue, but usually it causes the TCP stream to be blocked, as the retry will most likely have the same buffer layout. This patch solves this problem by slicing up the skb itself with the help of skb_segment, and calling xennet_start_xmit again on the resulting packets. It also works with the theoretical worst case, where there is a 3 level recursion. The good thing is that skb_segment only copies the header part, the frags will be just referenced again. Signed-off-by: Zoltan Kiss <zoltan.kiss@citrix.com> Cc: Wei Liu <wei.liu2@citrix.com> Cc: Ian Campbell <Ian.Campbell@citrix.com> Cc: Paul Durrant <paul.durrant@citrix.com> Cc: netdev@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: xen-devel@lists.xenproject.org --- -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html