diff mbox

[RFC] net: ip_finish_output_gso: Attempt gso_size clamping if segments exceed mtu

Message ID 1471867570-1406-1-git-send-email-shmulik.ladkani@gmail.com
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Shmulik Ladkani Aug. 22, 2016, 12:06 p.m. UTC
There are cases where gso skbs (which originate from an ingress
interface) have a gso_size value that exceeds the output dst mtu:

 - ipv4 forwarding middlebox having in/out interfaces with different mtus
   addressed by fe6cc55f3a 'net: ip, ipv6: handle gso skbs in forwarding path'
 - bridge having a tunnel member interface stacked over a device with small mtu
   addressed by b8247f095e 'net: ip_finish_output_gso: If skb_gso_network_seglen exceeds MTU, allow segmentation for local udp tunneled skbs'

In both cases, such skbs are identified, then go through early software
segmentation+fragmentation as part of ip_finish_output_gso.

Another approach is to shrink the gso_size to a value suitable so
resulting segments are smaller than dst mtu, as suggeted by Eric
Dumazet (as part of [1]) and Florian Westphal (as part of [2]).

This will void the need for software segmentation/fragmentation at
ip_finish_output_gso, thus significantly improve throughput and lower
cpu load.

This RFC patch attempts to implement this gso_size clamping.

[1] https://patchwork.ozlabs.org/patch/314327/
[2] https://patchwork.ozlabs.org/patch/644724/

Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>

Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
---

 Comments welcome.

 Few questions embedded in the patch.

 Florian, in fe6cc55f you described a BUG due to gso_size decrease.
 I've tested both bridged and routed cases, but in my setups failed to
 hit the issue; Appreciate if you can provide some hints.

 net/ipv4/ip_output.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 44 insertions(+)

Comments

Florian Westphal Aug. 22, 2016, 12:58 p.m. UTC | #1
Shmulik Ladkani <shmulik.ladkani@gmail.com> wrote:
> There are cases where gso skbs (which originate from an ingress
> interface) have a gso_size value that exceeds the output dst mtu:
> 
>  - ipv4 forwarding middlebox having in/out interfaces with different mtus
>    addressed by fe6cc55f3a 'net: ip, ipv6: handle gso skbs in forwarding path'
>  - bridge having a tunnel member interface stacked over a device with small mtu
>    addressed by b8247f095e 'net: ip_finish_output_gso: If skb_gso_network_seglen exceeds MTU, allow segmentation for local udp tunneled skbs'
> 
> In both cases, such skbs are identified, then go through early software
> segmentation+fragmentation as part of ip_finish_output_gso.
> 
> Another approach is to shrink the gso_size to a value suitable so
> resulting segments are smaller than dst mtu, as suggeted by Eric
> Dumazet (as part of [1]) and Florian Westphal (as part of [2]).
> 
> This will void the need for software segmentation/fragmentation at
> ip_finish_output_gso, thus significantly improve throughput and lower
> cpu load.
> 
> This RFC patch attempts to implement this gso_size clamping.
> 
> [1] https://patchwork.ozlabs.org/patch/314327/
> [2] https://patchwork.ozlabs.org/patch/644724/
> 
> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: Florian Westphal <fw@strlen.de>
> 
> Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
> ---
> 
>  Comments welcome.
> 
>  Few questions embedded in the patch.
> 
>  Florian, in fe6cc55f you described a BUG due to gso_size decrease.
>  I've tested both bridged and routed cases, but in my setups failed to
>  hit the issue; Appreciate if you can provide some hints.

Still get the BUG, I applied this patch on top of net-next.

On hypervisor:
10.0.0.2 via 192.168.7.10 dev tap0 mtu lock 1500
ssh root@10.0.0.2 'cat > /dev/null' < /dev/zero

On vm1 (which dies instantly, see below):
eth0 mtu 1500 (192.168.7.10)
eth1 mtu 1280 (10.0.0.1)

On vm2
eth0 mtu 1280 (10.0.0.2)

Normal ipv4 routing via vm1, no iptables etc. present, so

we have  hypervisor 1500 -> 1500 VM1 1280 -> 1280 VM2

Turning off gro avoids this problem.

------------[ cut here ]------------
kernel BUG at net-next/net/core/skbuff.c:3210!
invalid opcode: 0000 [#1] SMP
CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.8.0-rc2+ #1842
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014
task: ffff88013b100000 task.stack: ffff88013b0fc000
RIP: 0010:[<ffffffff8135ab44>]  [<ffffffff8135ab44>] skb_segment+0x964/0xb20
RSP: 0018:ffff88013fd838d0  EFLAGS: 00010212
RAX: 00000000000005a8 RBX: ffff88013a9f9900 RCX: ffff88013b1cf500
RDX: 0000000000006612 RSI: 0000000000000494 RDI: 0000000000000114
RBP: ffff88013fd839a8 R08: 00000000000069ca R09: ffff88013b1cf400
R10: 0000000000000011 R11: 0000000000006612 R12: 00000000000064fe
R13: ffff8801394c7300 R14: ffff88013937ad80 R15: 0000000000000011
FS:  0000000000000000(0000) GS:ffff88013fd80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f059fc3b2b0 CR3: 0000000001806000 CR4: 00000000000006a0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Stack:
 000000000000003b ffffffffffffffbe fffffff400000000 ffff88013b1cf400
 0000000000000000 0000000000000042 0000000000000040 0000000000000001
 0000000000000042 ffff88013b1cf600 0000000000000000 ffff8801000004cc
Call Trace:
 <IRQ> 
 [<ffffffff8123bacf>] ? swiotlb_map_page+0x5f/0x120
 [<ffffffff813eda00>] tcp_gso_segment+0x100/0x480
 [<ffffffff813eddb3>] tcp4_gso_segment+0x33/0x90
 [<ffffffff813fda7a>] inet_gso_segment+0x12a/0x3b0
 [<ffffffff81368c00>] ? dev_hard_start_xmit+0x20/0x110
 [<ffffffff813684f0>] skb_mac_gso_segment+0x90/0xf0
 [<ffffffff81368601>] __skb_gso_segment+0xb1/0x140
 [<ffffffff81368a7f>] validate_xmit_skb+0x14f/0x2b0
 [<ffffffff81368d2e>] validate_xmit_skb_list+0x3e/0x60
 [<ffffffff8138cb6a>] sch_direct_xmit+0x10a/0x1a0
 [<ffffffff81369199>] __dev_queue_xmit+0x369/0x5d0
 [<ffffffff8136940b>] dev_queue_xmit+0xb/0x10
 [<ffffffff813c8f47>] ip_finish_output2+0x247/0x310
 [<ffffffff813cac10>] ip_finish_output+0x1c0/0x250
 [<ffffffff813cadea>] ip_output+0x3a/0x40
 [<ffffffff813c751c>] ip_forward+0x36c/0x410
 [<ffffffff813c5b06>] ip_rcv+0x2e6/0x630
 [<ffffffff81364d5f>] __netif_receive_skb_core+0x2cf/0x940
 [<ffffffff813189bd>] ? e1000_alloc_rx_buffers+0x1bd/0x490
 [<ffffffff813653e8>] __netif_receive_skb+0x18/0x60
 [<ffffffff81365728>] netif_receive_skb_internal+0x28/0x90
 [<ffffffff813ee3b0>] ? tcp4_gro_complete+0x80/0x90
 [<ffffffff8136580a>] napi_gro_complete+0x7a/0xa0
 [<ffffffff813697e5>] napi_gro_flush+0x55/0x70
 [<ffffffff81369d06>] napi_complete_done+0x66/0xb0
 [<ffffffff81319810>] e1000_clean+0x380/0x900
 [<ffffffff81368c65>] ? dev_hard_start_xmit+0x85/0x110
 [<ffffffff81369ef3>] net_rx_action+0x1a3/0x2b0
 [<ffffffff81049c22>] __do_softirq+0xe2/0x1d0
 [<ffffffff81049f09>] irq_exit+0x89/0x90
 [<ffffffff810199bf>] do_IRQ+0x4f/0xd0
 [<ffffffff81498882>] common_interrupt+0x82/0x82
 <EOI> 
 [<ffffffff81035bd6>] ? native_safe_halt+0x6/0x10
 [<ffffffff8101ff49>] default_idle+0x9/0x10
 [<ffffffff8102052a>] arch_cpu_idle+0xa/0x10
 [<ffffffff810791ce>] default_idle_call+0x2e/0x30
 [<ffffffff8107933f>] cpu_startup_entry+0x16f/0x220
 [<ffffffff8102d6f5>] start_secondary+0x105/0x130
Code: 00 08 02 48 89 df 44 89 44 24 18 83 e6 c0 e8 04 c7 ff ff 85 c0 0f 85 02 01 00 00 8b 83 b8 00 00 00 44 8b 44 24 18 e9 cc fe ff ff <0f> 0b 0f 0b 0f 0b 8b 4b 74 85 c9 0f 85 ce 00 00 00 48 8b 83 c0 
RIP  [<ffffffff8135ab44>] skb_segment+0x964/0xb20
 RSP <ffff88013fd838d0>
---[ end trace 924612451efe8dce ]---
Kernel panic - not syncing: Fatal exception in interrupt
Kernel Offset: disabled
---[ end Kernel panic - not syncing: Fatal exception in interrupt
Shmulik Ladkani Aug. 22, 2016, 1:05 p.m. UTC | #2
Hi,

On Mon, 22 Aug 2016 14:58:42 +0200, fw@strlen.de wrote:
> > 
> >  Florian, in fe6cc55f you described a BUG due to gso_size decrease.
> >  I've tested both bridged and routed cases, but in my setups failed to
> >  hit the issue; Appreciate if you can provide some hints.
> 
> Still get the BUG, I applied this patch on top of net-next.
> 
> On hypervisor:
> 10.0.0.2 via 192.168.7.10 dev tap0 mtu lock 1500
> ssh root@10.0.0.2 'cat > /dev/null' < /dev/zero
> 
> On vm1 (which dies instantly, see below):
> eth0 mtu 1500 (192.168.7.10)
> eth1 mtu 1280 (10.0.0.1)
> 
> On vm2
> eth0 mtu 1280 (10.0.0.2)
> 
> Normal ipv4 routing via vm1, no iptables etc. present, so
> 
> we have  hypervisor 1500 -> 1500 VM1 1280 -> 1280 VM2
> 
> Turning off gro avoids this problem.

Thanks Florian, will dive into this.
Shmulik Ladkani Aug. 24, 2016, 2:53 p.m. UTC | #3
Hi,

On Mon, 22 Aug 2016 14:58:42 +0200, fw@strlen.de wrote:
> >  Florian, in fe6cc55f you described a BUG due to gso_size decrease.
> >  I've tested both bridged and routed cases, but in my setups failed to
> >  hit the issue; Appreciate if you can provide some hints.
> 
> Still get the BUG, I applied this patch on top of net-next.
> 
> On hypervisor:
> 10.0.0.2 via 192.168.7.10 dev tap0 mtu lock 1500
> ssh root@10.0.0.2 'cat > /dev/null' < /dev/zero
> 
> On vm1 (which dies instantly, see below):
> eth0 mtu 1500 (192.168.7.10)
> eth1 mtu 1280 (10.0.0.1)
> 
> On vm2
> eth0 mtu 1280 (10.0.0.2)
> 
> Normal ipv4 routing via vm1, no iptables etc. present, so
> 
> we have  hypervisor 1500 -> 1500 VM1 1280 -> 1280 VM2
> 
> Turning off gro avoids this problem.

I hit the BUG only when VM2's mtu is not set to 1280 (kept to the 1500
default).

Otherwise, Hypervisor's TCP stack (sender) uses TCP MSS advertised by
VM2 (which is 1240 if VM2 mtu properly configured), thus GRO taking
place in VM1's eth0 is based on arriving segments (sized 1240).
Meaning, "ingress" gso_size is actually 1240, and no "gso clamping"
occurs.

Only if VM2 has mtu of 1500, the MSS seen by Hypervisor during handshake
is 1460, thus GRO acting on VM1's eth0 is based on 1460 byte segments.
This leads to "gso clamping" taking place, with the BUG in skb_segment
(which btw, seems sensitive to change in gso_size only if GRO was
merging into frag_list).

Can you please acknowledge our setup and reproduction are aligned?

Thanks,
Shmulik
Florian Westphal Aug. 24, 2016, 2:59 p.m. UTC | #4
Shmulik Ladkani <shmulik.ladkani@gmail.com> wrote:
> > Normal ipv4 routing via vm1, no iptables etc. present, so
> > 
> > we have  hypervisor 1500 -> 1500 VM1 1280 -> 1280 VM2
> > 
> > Turning off gro avoids this problem.
> 
> I hit the BUG only when VM2's mtu is not set to 1280 (kept to the 1500
> default).

Right, 

> Otherwise, Hypervisor's TCP stack (sender) uses TCP MSS advertised by
> VM2 (which is 1240 if VM2 mtu properly configured), thus GRO taking
> place in VM1's eth0 is based on arriving segments (sized 1240).

True.

> Only if VM2 has mtu of 1500, the MSS seen by Hypervisor during handshake
> is 1460, thus GRO acting on VM1's eth0 is based on 1460 byte segments.
> This leads to "gso clamping" taking place, with the BUG in skb_segment
> (which btw, seems sensitive to change in gso_size only if GRO was
> merging into frag_list).
> 
> Can you please acknowledge our setup and reproduction are aligned?

Yes, seems setups are identical.
Shmulik Ladkani Aug. 25, 2016, 9:05 a.m. UTC | #5
Hi,

On Mon, 22 Aug 2016 14:58:42 +0200, fw@strlen.de wrote:
> Shmulik Ladkani <shmulik.ladkani@gmail.com> wrote:
> > There are cases where gso skbs (which originate from an ingress
> > interface) have a gso_size value that exceeds the output dst mtu:
> > 
> >  - ipv4 forwarding middlebox having in/out interfaces with different mtus
> >    addressed by fe6cc55f3a 'net: ip, ipv6: handle gso skbs in forwarding path'
> >  - bridge having a tunnel member interface stacked over a device with small mtu
> >    addressed by b8247f095e 'net: ip_finish_output_gso: If skb_gso_network_seglen exceeds MTU, allow segmentation for local udp tunneled skbs'
> > 
> > In both cases, such skbs are identified, then go through early software
> > segmentation+fragmentation as part of ip_finish_output_gso.
> > 
> > Another approach is to shrink the gso_size to a value suitable so
> > resulting segments are smaller than dst mtu, as suggeted by Eric
> > Dumazet (as part of [1]) and Florian Westphal (as part of [2]).
> > 
> > This will void the need for software segmentation/fragmentation at
> > ip_finish_output_gso, thus significantly improve throughput and lower
> > cpu load.
> > 
> > This RFC patch attempts to implement this gso_size clamping.
> > 
> > [1] https://patchwork.ozlabs.org/patch/314327/
> > [2] https://patchwork.ozlabs.org/patch/644724/
> > 
> > Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
> > ---
> > 
> >  Florian, in fe6cc55f you described a BUG due to gso_size decrease.
> >  I've tested both bridged and routed cases, but in my setups failed to
> >  hit the issue; Appreciate if you can provide some hints.
> 
> Still get the BUG, I applied this patch on top of net-next.

The BUG occurs when GRO occurs on the ingress, and only if GRO merges
skbs into the frag_list (OTOH when segments are only placed into frags[]
of a single skb, skb_segment succeeds even if gso_size was altered).

This is due to an assumption that the frag_list members terminate on
exact MSS boundaries (which no longer holds during gso_size clamping).

The assumption is dated back to original support of 'frag_list' to
skb_segment:

    89319d3801 net: Add frag_list support to skb_segment

    This patch adds limited support for handling frag_list packets in
    skb_segment.  The intention is to support GRO (Generic Receive Offload)
    packets which will be constructed by chaining normal packets using
    frag_list.
    
    As such we require all frag_list members terminate on exact MSS
    boundaries.  This is checked using BUG_ON.
    
    As there should only be one producer in the kernel of such packets,
    namely GRO, this requirement should not be difficult to maintain.

We have few alternatives for gso_size clamping:

1 Fix 'skb_segment' arithmentics to support inputs that do not match
  the "frag_list members terminate on exact MSS" assumption.

2 Perform gso_size clamping in 'ip_finish_output_gso' for non-GROed skbs.
  Other usecases will still benefit: (a) packets arriving from
  virtualized interfaces, e.g. tap and friends; (b) packets arriving from
  veth of other namespaces (packets are locally generated by TCP stack
  on a different netns).

3 Ditch the idea, again ;)

We can persue (1), especially if there are other benefits doing so.
OTOH due to the current complexity of 'skb_segment' this is bit risky.

Going with (2) could be reasonable too, as it brings value for
the virtualized environmnets that need gso_size clamping, while
presenting minimal risk.

Appreciate your opinions.

Regards,
Shmulik
Herbert Xu Aug. 26, 2016, 11:19 a.m. UTC | #6
On Thu, Aug 25, 2016 at 12:05:33PM +0300, Shmulik Ladkani wrote:
>
> We have few alternatives for gso_size clamping:
> 
> 1 Fix 'skb_segment' arithmentics to support inputs that do not match
>   the "frag_list members terminate on exact MSS" assumption.
> 
> 2 Perform gso_size clamping in 'ip_finish_output_gso' for non-GROed skbs.
>   Other usecases will still benefit: (a) packets arriving from
>   virtualized interfaces, e.g. tap and friends; (b) packets arriving from
>   veth of other namespaces (packets are locally generated by TCP stack
>   on a different netns).
> 
> 3 Ditch the idea, again ;)
> 
> We can persue (1), especially if there are other benefits doing so.
> OTOH due to the current complexity of 'skb_segment' this is bit risky.
> 
> Going with (2) could be reasonable too, as it brings value for
> the virtualized environmnets that need gso_size clamping, while
> presenting minimal risk.

Please remember the overarching rule for GRO and that is it must
be completely transparent, i.e., the output must be as if it didn't
exist.

So if this is a DF packet and then we must generate ICMP packets
rather than downsizing the packets.  If it's not DF then we must
fragment only within each frag_list skb.

Cheers,
Shmulik Ladkani Sept. 9, 2016, 5:48 a.m. UTC | #7
On Thu, 25 Aug 2016 12:05:33 +0300 Shmulik Ladkani <shmulik.ladkani@gmail.com> wrote:
> The BUG occurs when GRO occurs on the ingress, and only if GRO merges
> skbs into the frag_list (OTOH when segments are only placed into frags[]
> of a single skb, skb_segment succeeds even if gso_size was altered).
> 
> This is due to an assumption that the frag_list members terminate on
> exact MSS boundaries (which no longer holds during gso_size clamping).
> 
> We have few alternatives for gso_size clamping:
> 
> 1 Fix 'skb_segment' arithmentics to support inputs that do not match
>   the "frag_list members terminate on exact MSS" assumption.
> 
> 2 Perform gso_size clamping in 'ip_finish_output_gso' for non-GROed skbs.
>   Other usecases will still benefit: (a) packets arriving from
>   virtualized interfaces, e.g. tap and friends; (b) packets arriving from
>   veth of other namespaces (packets are locally generated by TCP stack
>   on a different netns).
> 
> 3 Ditch the idea, again ;)
> 
> We can persue (1), especially if there are other benefits doing so.
> OTOH due to the current complexity of 'skb_segment' this is bit risky.
> 
> Going with (2) could be reasonable too, as it brings value for
> the virtualized environmnets that need gso_size clamping, while
> presenting minimal risk.

Summarizing actions taken, in case someone refers to this thread.

- Re (1): Spent a short while massaging skb_segment().
  Code is not prepared to support various gso_size inputs.
  Main issue is that if nskb's frags[] get exausted (but original
  frag_skb's frags[] not yet fully traversed), there's no generation of
  a new skb. Code expects interation of both nskb's frags[] and
  frag_skb's frags[] to terminate together; the following allocated new
  skb is always a clone of next frag_skb in the original head_skb.
  Supporting various gso_size inputs required an intrusive rewrite.

- Re (2): There's no easy way for ip_finish_output_gso() to detect that
  the skb is safe for "gso_size clamping" while preserving GSO/GRO
  transparency:
  We can know it is "gso_size clamping safe" PER SKB, but it doesn't
  suffice; to preserve GRO transparecy rule, we must know skb arrived
  from a code flow that is ALWAYS safe for gso_size clamping.

So I ended up identifying the relevant code-flow of the use-case I'm
interested on, verified it is indeed safe for altering gso_size (while
taking a slight risk that this might not hold true in the future).
I've used that mark as the criteria for safe "gso_size clamping" in
'ip_finish_output_gso'. Yep, not too elegant.

Regards,
Shmulik
diff mbox

Patch

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index dde37fb..b911b43 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -216,6 +216,40 @@  static int ip_finish_output2(struct net *net, struct sock *sk, struct sk_buff *s
 	return -EINVAL;
 }
 
+static inline int skb_gso_clamp(struct sk_buff *skb, unsigned int network_len)
+{
+	struct skb_shared_info *shinfo = skb_shinfo(skb);
+	unsigned short gso_size;
+	unsigned int seglen;
+
+	if (shinfo->gso_size == GSO_BY_FRAGS)
+		return -EINVAL;
+
+	seglen = skb_gso_network_seglen(skb);
+
+	/* Decrease gso_size to fit specified network length */
+	gso_size = shinfo->gso_size - (seglen - network_len);
+	if (shinfo->gso_type & SKB_GSO_UDP)
+		gso_size &= ~7;
+
+	if (!(shinfo->gso_type & SKB_GSO_DODGY)) {
+		/* Recalculate gso_segs for skbs of trusted sources.
+		 * Untrusted skbs will have their gso_segs calculated by
+		 * skb_gso_segment.
+		 */
+		unsigned int hdr_len, payload_len;
+
+		hdr_len = seglen - shinfo->gso_size;
+		payload_len = skb->len - hdr_len;
+		shinfo->gso_segs = DIV_ROUND_UP(payload_len, gso_size);
+
+		// Q: need to verify gso_segs <= dev->gso_max_segs?
+		//    seems to be protected by netif_skb_features
+	}
+	shinfo->gso_size = gso_size;
+	return 0;
+}
+
 static int ip_finish_output_gso(struct net *net, struct sock *sk,
 				struct sk_buff *skb, unsigned int mtu)
 {
@@ -237,6 +271,16 @@  static int ip_finish_output_gso(struct net *net, struct sock *sk,
 	 * 2) skb arrived via virtio-net, we thus get TSO/GSO skbs directly
 	 * from host network stack.
 	 */
+
+	/* Attempt to clamp gso_size to avoid segmenting and fragmenting.
+	 *
+	 * Q: policy needed? per device?
+	 */
+	if (sysctl_ip_output_gso_clamp) {
+		if (!skb_gso_clamp(skb, mtu))
+			return ip_finish_output2(net, sk, skb);
+	}
+
 	features = netif_skb_features(skb);
 	BUILD_BUG_ON(sizeof(*IPCB(skb)) > SKB_SGO_CB_OFFSET);
 	segs = skb_gso_segment(skb, features & ~NETIF_F_GSO_MASK);