Message ID | 1471867570-1406-1-git-send-email-shmulik.ladkani@gmail.com |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
Shmulik Ladkani <shmulik.ladkani@gmail.com> wrote: > There are cases where gso skbs (which originate from an ingress > interface) have a gso_size value that exceeds the output dst mtu: > > - ipv4 forwarding middlebox having in/out interfaces with different mtus > addressed by fe6cc55f3a 'net: ip, ipv6: handle gso skbs in forwarding path' > - bridge having a tunnel member interface stacked over a device with small mtu > addressed by b8247f095e 'net: ip_finish_output_gso: If skb_gso_network_seglen exceeds MTU, allow segmentation for local udp tunneled skbs' > > In both cases, such skbs are identified, then go through early software > segmentation+fragmentation as part of ip_finish_output_gso. > > Another approach is to shrink the gso_size to a value suitable so > resulting segments are smaller than dst mtu, as suggeted by Eric > Dumazet (as part of [1]) and Florian Westphal (as part of [2]). > > This will void the need for software segmentation/fragmentation at > ip_finish_output_gso, thus significantly improve throughput and lower > cpu load. > > This RFC patch attempts to implement this gso_size clamping. > > [1] https://patchwork.ozlabs.org/patch/314327/ > [2] https://patchwork.ozlabs.org/patch/644724/ > > Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> > Cc: Eric Dumazet <edumazet@google.com> > Cc: Florian Westphal <fw@strlen.de> > > Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> > --- > > Comments welcome. > > Few questions embedded in the patch. > > Florian, in fe6cc55f you described a BUG due to gso_size decrease. > I've tested both bridged and routed cases, but in my setups failed to > hit the issue; Appreciate if you can provide some hints. Still get the BUG, I applied this patch on top of net-next. On hypervisor: 10.0.0.2 via 192.168.7.10 dev tap0 mtu lock 1500 ssh root@10.0.0.2 'cat > /dev/null' < /dev/zero On vm1 (which dies instantly, see below): eth0 mtu 1500 (192.168.7.10) eth1 mtu 1280 (10.0.0.1) On vm2 eth0 mtu 1280 (10.0.0.2) Normal ipv4 routing via vm1, no iptables etc. present, so we have hypervisor 1500 -> 1500 VM1 1280 -> 1280 VM2 Turning off gro avoids this problem. ------------[ cut here ]------------ kernel BUG at net-next/net/core/skbuff.c:3210! invalid opcode: 0000 [#1] SMP CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.8.0-rc2+ #1842 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014 task: ffff88013b100000 task.stack: ffff88013b0fc000 RIP: 0010:[<ffffffff8135ab44>] [<ffffffff8135ab44>] skb_segment+0x964/0xb20 RSP: 0018:ffff88013fd838d0 EFLAGS: 00010212 RAX: 00000000000005a8 RBX: ffff88013a9f9900 RCX: ffff88013b1cf500 RDX: 0000000000006612 RSI: 0000000000000494 RDI: 0000000000000114 RBP: ffff88013fd839a8 R08: 00000000000069ca R09: ffff88013b1cf400 R10: 0000000000000011 R11: 0000000000006612 R12: 00000000000064fe R13: ffff8801394c7300 R14: ffff88013937ad80 R15: 0000000000000011 FS: 0000000000000000(0000) GS:ffff88013fd80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f059fc3b2b0 CR3: 0000000001806000 CR4: 00000000000006a0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Stack: 000000000000003b ffffffffffffffbe fffffff400000000 ffff88013b1cf400 0000000000000000 0000000000000042 0000000000000040 0000000000000001 0000000000000042 ffff88013b1cf600 0000000000000000 ffff8801000004cc Call Trace: <IRQ> [<ffffffff8123bacf>] ? swiotlb_map_page+0x5f/0x120 [<ffffffff813eda00>] tcp_gso_segment+0x100/0x480 [<ffffffff813eddb3>] tcp4_gso_segment+0x33/0x90 [<ffffffff813fda7a>] inet_gso_segment+0x12a/0x3b0 [<ffffffff81368c00>] ? dev_hard_start_xmit+0x20/0x110 [<ffffffff813684f0>] skb_mac_gso_segment+0x90/0xf0 [<ffffffff81368601>] __skb_gso_segment+0xb1/0x140 [<ffffffff81368a7f>] validate_xmit_skb+0x14f/0x2b0 [<ffffffff81368d2e>] validate_xmit_skb_list+0x3e/0x60 [<ffffffff8138cb6a>] sch_direct_xmit+0x10a/0x1a0 [<ffffffff81369199>] __dev_queue_xmit+0x369/0x5d0 [<ffffffff8136940b>] dev_queue_xmit+0xb/0x10 [<ffffffff813c8f47>] ip_finish_output2+0x247/0x310 [<ffffffff813cac10>] ip_finish_output+0x1c0/0x250 [<ffffffff813cadea>] ip_output+0x3a/0x40 [<ffffffff813c751c>] ip_forward+0x36c/0x410 [<ffffffff813c5b06>] ip_rcv+0x2e6/0x630 [<ffffffff81364d5f>] __netif_receive_skb_core+0x2cf/0x940 [<ffffffff813189bd>] ? e1000_alloc_rx_buffers+0x1bd/0x490 [<ffffffff813653e8>] __netif_receive_skb+0x18/0x60 [<ffffffff81365728>] netif_receive_skb_internal+0x28/0x90 [<ffffffff813ee3b0>] ? tcp4_gro_complete+0x80/0x90 [<ffffffff8136580a>] napi_gro_complete+0x7a/0xa0 [<ffffffff813697e5>] napi_gro_flush+0x55/0x70 [<ffffffff81369d06>] napi_complete_done+0x66/0xb0 [<ffffffff81319810>] e1000_clean+0x380/0x900 [<ffffffff81368c65>] ? dev_hard_start_xmit+0x85/0x110 [<ffffffff81369ef3>] net_rx_action+0x1a3/0x2b0 [<ffffffff81049c22>] __do_softirq+0xe2/0x1d0 [<ffffffff81049f09>] irq_exit+0x89/0x90 [<ffffffff810199bf>] do_IRQ+0x4f/0xd0 [<ffffffff81498882>] common_interrupt+0x82/0x82 <EOI> [<ffffffff81035bd6>] ? native_safe_halt+0x6/0x10 [<ffffffff8101ff49>] default_idle+0x9/0x10 [<ffffffff8102052a>] arch_cpu_idle+0xa/0x10 [<ffffffff810791ce>] default_idle_call+0x2e/0x30 [<ffffffff8107933f>] cpu_startup_entry+0x16f/0x220 [<ffffffff8102d6f5>] start_secondary+0x105/0x130 Code: 00 08 02 48 89 df 44 89 44 24 18 83 e6 c0 e8 04 c7 ff ff 85 c0 0f 85 02 01 00 00 8b 83 b8 00 00 00 44 8b 44 24 18 e9 cc fe ff ff <0f> 0b 0f 0b 0f 0b 8b 4b 74 85 c9 0f 85 ce 00 00 00 48 8b 83 c0 RIP [<ffffffff8135ab44>] skb_segment+0x964/0xb20 RSP <ffff88013fd838d0> ---[ end trace 924612451efe8dce ]--- Kernel panic - not syncing: Fatal exception in interrupt Kernel Offset: disabled ---[ end Kernel panic - not syncing: Fatal exception in interrupt
Hi, On Mon, 22 Aug 2016 14:58:42 +0200, fw@strlen.de wrote: > > > > Florian, in fe6cc55f you described a BUG due to gso_size decrease. > > I've tested both bridged and routed cases, but in my setups failed to > > hit the issue; Appreciate if you can provide some hints. > > Still get the BUG, I applied this patch on top of net-next. > > On hypervisor: > 10.0.0.2 via 192.168.7.10 dev tap0 mtu lock 1500 > ssh root@10.0.0.2 'cat > /dev/null' < /dev/zero > > On vm1 (which dies instantly, see below): > eth0 mtu 1500 (192.168.7.10) > eth1 mtu 1280 (10.0.0.1) > > On vm2 > eth0 mtu 1280 (10.0.0.2) > > Normal ipv4 routing via vm1, no iptables etc. present, so > > we have hypervisor 1500 -> 1500 VM1 1280 -> 1280 VM2 > > Turning off gro avoids this problem. Thanks Florian, will dive into this.
Hi, On Mon, 22 Aug 2016 14:58:42 +0200, fw@strlen.de wrote: > > Florian, in fe6cc55f you described a BUG due to gso_size decrease. > > I've tested both bridged and routed cases, but in my setups failed to > > hit the issue; Appreciate if you can provide some hints. > > Still get the BUG, I applied this patch on top of net-next. > > On hypervisor: > 10.0.0.2 via 192.168.7.10 dev tap0 mtu lock 1500 > ssh root@10.0.0.2 'cat > /dev/null' < /dev/zero > > On vm1 (which dies instantly, see below): > eth0 mtu 1500 (192.168.7.10) > eth1 mtu 1280 (10.0.0.1) > > On vm2 > eth0 mtu 1280 (10.0.0.2) > > Normal ipv4 routing via vm1, no iptables etc. present, so > > we have hypervisor 1500 -> 1500 VM1 1280 -> 1280 VM2 > > Turning off gro avoids this problem. I hit the BUG only when VM2's mtu is not set to 1280 (kept to the 1500 default). Otherwise, Hypervisor's TCP stack (sender) uses TCP MSS advertised by VM2 (which is 1240 if VM2 mtu properly configured), thus GRO taking place in VM1's eth0 is based on arriving segments (sized 1240). Meaning, "ingress" gso_size is actually 1240, and no "gso clamping" occurs. Only if VM2 has mtu of 1500, the MSS seen by Hypervisor during handshake is 1460, thus GRO acting on VM1's eth0 is based on 1460 byte segments. This leads to "gso clamping" taking place, with the BUG in skb_segment (which btw, seems sensitive to change in gso_size only if GRO was merging into frag_list). Can you please acknowledge our setup and reproduction are aligned? Thanks, Shmulik
Shmulik Ladkani <shmulik.ladkani@gmail.com> wrote: > > Normal ipv4 routing via vm1, no iptables etc. present, so > > > > we have hypervisor 1500 -> 1500 VM1 1280 -> 1280 VM2 > > > > Turning off gro avoids this problem. > > I hit the BUG only when VM2's mtu is not set to 1280 (kept to the 1500 > default). Right, > Otherwise, Hypervisor's TCP stack (sender) uses TCP MSS advertised by > VM2 (which is 1240 if VM2 mtu properly configured), thus GRO taking > place in VM1's eth0 is based on arriving segments (sized 1240). True. > Only if VM2 has mtu of 1500, the MSS seen by Hypervisor during handshake > is 1460, thus GRO acting on VM1's eth0 is based on 1460 byte segments. > This leads to "gso clamping" taking place, with the BUG in skb_segment > (which btw, seems sensitive to change in gso_size only if GRO was > merging into frag_list). > > Can you please acknowledge our setup and reproduction are aligned? Yes, seems setups are identical.
Hi, On Mon, 22 Aug 2016 14:58:42 +0200, fw@strlen.de wrote: > Shmulik Ladkani <shmulik.ladkani@gmail.com> wrote: > > There are cases where gso skbs (which originate from an ingress > > interface) have a gso_size value that exceeds the output dst mtu: > > > > - ipv4 forwarding middlebox having in/out interfaces with different mtus > > addressed by fe6cc55f3a 'net: ip, ipv6: handle gso skbs in forwarding path' > > - bridge having a tunnel member interface stacked over a device with small mtu > > addressed by b8247f095e 'net: ip_finish_output_gso: If skb_gso_network_seglen exceeds MTU, allow segmentation for local udp tunneled skbs' > > > > In both cases, such skbs are identified, then go through early software > > segmentation+fragmentation as part of ip_finish_output_gso. > > > > Another approach is to shrink the gso_size to a value suitable so > > resulting segments are smaller than dst mtu, as suggeted by Eric > > Dumazet (as part of [1]) and Florian Westphal (as part of [2]). > > > > This will void the need for software segmentation/fragmentation at > > ip_finish_output_gso, thus significantly improve throughput and lower > > cpu load. > > > > This RFC patch attempts to implement this gso_size clamping. > > > > [1] https://patchwork.ozlabs.org/patch/314327/ > > [2] https://patchwork.ozlabs.org/patch/644724/ > > > > Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> > > --- > > > > Florian, in fe6cc55f you described a BUG due to gso_size decrease. > > I've tested both bridged and routed cases, but in my setups failed to > > hit the issue; Appreciate if you can provide some hints. > > Still get the BUG, I applied this patch on top of net-next. The BUG occurs when GRO occurs on the ingress, and only if GRO merges skbs into the frag_list (OTOH when segments are only placed into frags[] of a single skb, skb_segment succeeds even if gso_size was altered). This is due to an assumption that the frag_list members terminate on exact MSS boundaries (which no longer holds during gso_size clamping). The assumption is dated back to original support of 'frag_list' to skb_segment: 89319d3801 net: Add frag_list support to skb_segment This patch adds limited support for handling frag_list packets in skb_segment. The intention is to support GRO (Generic Receive Offload) packets which will be constructed by chaining normal packets using frag_list. As such we require all frag_list members terminate on exact MSS boundaries. This is checked using BUG_ON. As there should only be one producer in the kernel of such packets, namely GRO, this requirement should not be difficult to maintain. We have few alternatives for gso_size clamping: 1 Fix 'skb_segment' arithmentics to support inputs that do not match the "frag_list members terminate on exact MSS" assumption. 2 Perform gso_size clamping in 'ip_finish_output_gso' for non-GROed skbs. Other usecases will still benefit: (a) packets arriving from virtualized interfaces, e.g. tap and friends; (b) packets arriving from veth of other namespaces (packets are locally generated by TCP stack on a different netns). 3 Ditch the idea, again ;) We can persue (1), especially if there are other benefits doing so. OTOH due to the current complexity of 'skb_segment' this is bit risky. Going with (2) could be reasonable too, as it brings value for the virtualized environmnets that need gso_size clamping, while presenting minimal risk. Appreciate your opinions. Regards, Shmulik
On Thu, Aug 25, 2016 at 12:05:33PM +0300, Shmulik Ladkani wrote: > > We have few alternatives for gso_size clamping: > > 1 Fix 'skb_segment' arithmentics to support inputs that do not match > the "frag_list members terminate on exact MSS" assumption. > > 2 Perform gso_size clamping in 'ip_finish_output_gso' for non-GROed skbs. > Other usecases will still benefit: (a) packets arriving from > virtualized interfaces, e.g. tap and friends; (b) packets arriving from > veth of other namespaces (packets are locally generated by TCP stack > on a different netns). > > 3 Ditch the idea, again ;) > > We can persue (1), especially if there are other benefits doing so. > OTOH due to the current complexity of 'skb_segment' this is bit risky. > > Going with (2) could be reasonable too, as it brings value for > the virtualized environmnets that need gso_size clamping, while > presenting minimal risk. Please remember the overarching rule for GRO and that is it must be completely transparent, i.e., the output must be as if it didn't exist. So if this is a DF packet and then we must generate ICMP packets rather than downsizing the packets. If it's not DF then we must fragment only within each frag_list skb. Cheers,
On Thu, 25 Aug 2016 12:05:33 +0300 Shmulik Ladkani <shmulik.ladkani@gmail.com> wrote: > The BUG occurs when GRO occurs on the ingress, and only if GRO merges > skbs into the frag_list (OTOH when segments are only placed into frags[] > of a single skb, skb_segment succeeds even if gso_size was altered). > > This is due to an assumption that the frag_list members terminate on > exact MSS boundaries (which no longer holds during gso_size clamping). > > We have few alternatives for gso_size clamping: > > 1 Fix 'skb_segment' arithmentics to support inputs that do not match > the "frag_list members terminate on exact MSS" assumption. > > 2 Perform gso_size clamping in 'ip_finish_output_gso' for non-GROed skbs. > Other usecases will still benefit: (a) packets arriving from > virtualized interfaces, e.g. tap and friends; (b) packets arriving from > veth of other namespaces (packets are locally generated by TCP stack > on a different netns). > > 3 Ditch the idea, again ;) > > We can persue (1), especially if there are other benefits doing so. > OTOH due to the current complexity of 'skb_segment' this is bit risky. > > Going with (2) could be reasonable too, as it brings value for > the virtualized environmnets that need gso_size clamping, while > presenting minimal risk. Summarizing actions taken, in case someone refers to this thread. - Re (1): Spent a short while massaging skb_segment(). Code is not prepared to support various gso_size inputs. Main issue is that if nskb's frags[] get exausted (but original frag_skb's frags[] not yet fully traversed), there's no generation of a new skb. Code expects interation of both nskb's frags[] and frag_skb's frags[] to terminate together; the following allocated new skb is always a clone of next frag_skb in the original head_skb. Supporting various gso_size inputs required an intrusive rewrite. - Re (2): There's no easy way for ip_finish_output_gso() to detect that the skb is safe for "gso_size clamping" while preserving GSO/GRO transparency: We can know it is "gso_size clamping safe" PER SKB, but it doesn't suffice; to preserve GRO transparecy rule, we must know skb arrived from a code flow that is ALWAYS safe for gso_size clamping. So I ended up identifying the relevant code-flow of the use-case I'm interested on, verified it is indeed safe for altering gso_size (while taking a slight risk that this might not hold true in the future). I've used that mark as the criteria for safe "gso_size clamping" in 'ip_finish_output_gso'. Yep, not too elegant. Regards, Shmulik
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index dde37fb..b911b43 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -216,6 +216,40 @@ static int ip_finish_output2(struct net *net, struct sock *sk, struct sk_buff *s return -EINVAL; } +static inline int skb_gso_clamp(struct sk_buff *skb, unsigned int network_len) +{ + struct skb_shared_info *shinfo = skb_shinfo(skb); + unsigned short gso_size; + unsigned int seglen; + + if (shinfo->gso_size == GSO_BY_FRAGS) + return -EINVAL; + + seglen = skb_gso_network_seglen(skb); + + /* Decrease gso_size to fit specified network length */ + gso_size = shinfo->gso_size - (seglen - network_len); + if (shinfo->gso_type & SKB_GSO_UDP) + gso_size &= ~7; + + if (!(shinfo->gso_type & SKB_GSO_DODGY)) { + /* Recalculate gso_segs for skbs of trusted sources. + * Untrusted skbs will have their gso_segs calculated by + * skb_gso_segment. + */ + unsigned int hdr_len, payload_len; + + hdr_len = seglen - shinfo->gso_size; + payload_len = skb->len - hdr_len; + shinfo->gso_segs = DIV_ROUND_UP(payload_len, gso_size); + + // Q: need to verify gso_segs <= dev->gso_max_segs? + // seems to be protected by netif_skb_features + } + shinfo->gso_size = gso_size; + return 0; +} + static int ip_finish_output_gso(struct net *net, struct sock *sk, struct sk_buff *skb, unsigned int mtu) { @@ -237,6 +271,16 @@ static int ip_finish_output_gso(struct net *net, struct sock *sk, * 2) skb arrived via virtio-net, we thus get TSO/GSO skbs directly * from host network stack. */ + + /* Attempt to clamp gso_size to avoid segmenting and fragmenting. + * + * Q: policy needed? per device? + */ + if (sysctl_ip_output_gso_clamp) { + if (!skb_gso_clamp(skb, mtu)) + return ip_finish_output2(net, sk, skb); + } + features = netif_skb_features(skb); BUILD_BUG_ON(sizeof(*IPCB(skb)) > SKB_SGO_CB_OFFSET); segs = skb_gso_segment(skb, features & ~NETIF_F_GSO_MASK);
There are cases where gso skbs (which originate from an ingress interface) have a gso_size value that exceeds the output dst mtu: - ipv4 forwarding middlebox having in/out interfaces with different mtus addressed by fe6cc55f3a 'net: ip, ipv6: handle gso skbs in forwarding path' - bridge having a tunnel member interface stacked over a device with small mtu addressed by b8247f095e 'net: ip_finish_output_gso: If skb_gso_network_seglen exceeds MTU, allow segmentation for local udp tunneled skbs' In both cases, such skbs are identified, then go through early software segmentation+fragmentation as part of ip_finish_output_gso. Another approach is to shrink the gso_size to a value suitable so resulting segments are smaller than dst mtu, as suggeted by Eric Dumazet (as part of [1]) and Florian Westphal (as part of [2]). This will void the need for software segmentation/fragmentation at ip_finish_output_gso, thus significantly improve throughput and lower cpu load. This RFC patch attempts to implement this gso_size clamping. [1] https://patchwork.ozlabs.org/patch/314327/ [2] https://patchwork.ozlabs.org/patch/644724/ Cc: Hannes Frederic Sowa <hannes@stressinduktion.org> Cc: Eric Dumazet <edumazet@google.com> Cc: Florian Westphal <fw@strlen.de> Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> --- Comments welcome. Few questions embedded in the patch. Florian, in fe6cc55f you described a BUG due to gso_size decrease. I've tested both bridged and routed cases, but in my setups failed to hit the issue; Appreciate if you can provide some hints. net/ipv4/ip_output.c | 44 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 44 insertions(+)