From patchwork Mon Jan  5 06:02:58 2015
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: FengYu LeiDian <fengyuleidian0615@gmail.com>
X-Patchwork-Id: 425231
X-Patchwork-Delegate: davem@davemloft.net
Return-Path: <netdev-owner@vger.kernel.org>
X-Original-To: patchwork-incoming@ozlabs.org
Delivered-To: patchwork-incoming@ozlabs.org
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by ozlabs.org (Postfix) with ESMTP id BDD801400A0
	for <patchwork-incoming@ozlabs.org>;
	Mon,  5 Jan 2015 17:06:39 +1100 (AEDT)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1751098AbbAEGGf (ORCPT <rfc822;patchwork-incoming@ozlabs.org>);
	Mon, 5 Jan 2015 01:06:35 -0500
Received: from mail-pd0-f173.google.com ([209.85.192.173]:33299 "EHLO
	mail-pd0-f173.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1750959AbbAEGGd (ORCPT
	<rfc822;netdev@vger.kernel.org>); Mon, 5 Jan 2015 01:06:33 -0500
Received: by mail-pd0-f173.google.com with SMTP id ft15so27444854pdb.18
	for <netdev@vger.kernel.org>; Sun, 04 Jan 2015 22:06:33 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=gmail.com; s=20120113;
	h=message-id:date:from:user-agent:mime-version:to:cc:subject
	:references:in-reply-to:content-type:content-transfer-encoding;
	bh=qee+UU3qOBVj6zz36km1nZigt1Pg/oEUCmRrdBr+5ZM=;
	b=ETsISLVKMRXn/Fp8nQABFSVFplsp0YF+/KOGcAfyxkkwDxem99qAmqDKDA4PhpHveE
	5wdxOhlwOqu2D/G/ZcePq7/chlfrMws/8EuVYM3+ycdm+2YzR/+Kb+8mU25Y7g9aRMf2
	m+uq0/D6lRber/t6lXAvU7v1pZaRGhuysB0U2bjVguNIuMF6P10IWOPC3ai0tsI9+iEj
	yGnA7cDFOALJJWGQWPKd9Q1YOeRvVmJ+ZntjeV6eEXuoLrhHW4B4y2CEdjO3XwHrF+7B
	z8E2OYTT28qJJ2UADRAKVxXMa8Mmjqkw84Pn7b2NskXPhgYnZijVS8ZSCAnRWgIvQ3Uy
	8QOw==
X-Received: by 10.66.249.99 with SMTP id yt3mr131314974pac.59.1420437993090;
	Sun, 04 Jan 2015 22:06:33 -0800 (PST)
Received: from [134.134.139.72] (jfdmzpr03-ext.jf.intel.com.
	[134.134.139.72]) by mx.google.com with ESMTPSA id
	ua7sm23518322pab.37.2015.01.04.22.06.27
	(version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128);
	Sun, 04 Jan 2015 22:06:32 -0800 (PST)
Message-ID: <54AA2912.6090903@gmail.com>
Date: Mon, 05 Jan 2015 14:02:58 +0800
From: Fan Du <fengyuleidian0615@gmail.com>
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
	rv:17.0) Gecko/20130106 Thunderbird/17.0.2
MIME-Version: 1.0
To: "Du, Fan" <fan.du@intel.com>, Thomas Graf <tgraf@suug.ch>,
	"davem@davemloft.net" <davem@davemloft.net>,
	"jesse@nicira.com" <jesse@nicira.com>
CC: "Michael S. Tsirkin" <mst@redhat.com>,
	'Jason Wang' <jasowang@redhat.com>,
	"netdev@vger.kernel.org" <netdev@vger.kernel.org>,
	"fw@strlen.de" <fw@strlen.de>,
	"dev@openvswitch.org" <dev@openvswitch.org>,
	"pshelar@nicira.com" <pshelar@nicira.com>
Subject: Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
References: <1417156385-18276-1-git-send-email-fan.du@intel.com>
	<1417158128.3268.2@smtp.corp.redhat.com>
	<5A90DA2E42F8AE43BC4A093BF0678848DED92B@SHSMSX104.ccr.corp.intel.com>
	<20141201135225.GA16814@casper.infradead.org>
	<20141202154839.GB5344@t520.home>
	<20141202170927.GA9457@casper.infradead.org>
	<20141202173401.GB4126@redhat.com>
	<20141202174158.GB9457@casper.infradead.org>
	<5A90DA2E42F8AE43BC4A093BF0678848DEDFDB@SHSMSX104.ccr.corp.intel.com>
In-Reply-To: 
 <5A90DA2E42F8AE43BC4A093BF0678848DEDFDB@SHSMSX104.ccr.corp.intel.com>
Sender: netdev-owner@vger.kernel.org
Precedence: bulk
List-ID: <netdev.vger.kernel.org>
X-Mailing-List: netdev@vger.kernel.org

于 2014年12月03日 10:31, Du, Fan 写道:
>
>
>> -----Original Message-----
>> From: Thomas Graf [mailto:tgr@infradead.org] On Behalf Of Thomas Graf
>> Sent: Wednesday, December 3, 2014 1:42 AM
>> To: Michael S. Tsirkin
>> Cc: Du, Fan; 'Jason Wang'; netdev@vger.kernel.org; davem@davemloft.net;
>> fw@strlen.de; dev@openvswitch.org; jesse@nicira.com; pshelar@nicira.com
>> Subject: Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
>>
>> On 12/02/14 at 07:34pm, Michael S. Tsirkin wrote:
>>> On Tue, Dec 02, 2014 at 05:09:27PM +0000, Thomas Graf wrote:
>>>> On 12/02/14 at 01:48pm, Flavio Leitner wrote:
>>>>> What about containers or any other virtualization environment that
>>>>> doesn't use Virtio?
>>>>
>>>> The host can dictate the MTU in that case for both veth or OVS
>>>> internal which would be primary container plumbing techniques.
>>>
>>> It typically can't do this easily for VMs with emulated devices:
>>> real ethernet uses a fixed MTU.
>>>
>>> IMHO it's confusing to suggest MTU as a fix for this bug, it's an
>>> unrelated optimization.
>>> ICMP_DEST_UNREACH/ICMP_FRAG_NEEDED is the right fix here.
>>
>> PMTU discovery only resolves the issue if an actual IP stack is running inside the
>> VM. This may not be the case at all.
>   ^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> Some thoughts here:
>
> Think otherwise, this is indeed what host stack should forge a ICMP_DEST_UNREACH/ICMP_FRAG_NEEDED
> message with _inner_ skb network and transport header, do whatever type of encapsulation,
> and thereafter push such packet upward to Guest/Container, which make them feel, the intermediate node
> or the peer send such message. PMTU should be expected to work correct.
> And such behavior should be shared by all other encapsulation tech if they are also suffered.

Hi David, Jesse and Thomas

As discussed in here: https://www.marc.info/?l=linux-netdev&m=141764712631150&w=4 and
quotes from Jesse:
My proposal would be something like this:
  * For L2, reduce the VM MTU to the lowest common denominator on the segment.
  * For L3, use path MTU discovery or fragment inner packet (i.e.
normal routing behavior).
  * As a last resort (such as if using an old version of virtio in the
guest), fragment the tunnel packet.


For L2, it's a administrative action
For L3, PMTU approach looks better, because once the sender is alerted the reduced MTU,
packet size after encapsulation will not exceed physical MTU, so no additional fragments
efforts needed.
For "As a last resort... fragment the tunnel packet", the original patch:
https://www.marc.info/?l=linux-netdev&m=141715655024090&w=4 did the job, but seems it's
not welcomed.

Below raw patch adopts PMTU approach, please review! Any kind of comments/suggestions
is welcomed.

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index e9f81d4..4d1b221 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -1771,6 +1771,130 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev,
  		tos = ip_tunnel_ecn_encap(tos, old_iph, skb);
  		ttl = ttl ? : ip4_dst_hoplimit(&rt->dst);

+		if (skb_is_gso(skb)) {
+			unsigned int inner_l234_hdrlen;
+			unsigned int outer_l34_hdrlen;
+			unsigned int gso_seglen;
+			struct net_device *phy_dev = rt->dst.dev;
+
+			inner_l234_hdrlen = skb_transport_header(skb) - skb_mac_header(skb);
+			if (skb_shinfo(skb)->gso_type & (SKB_GSO_TCPV4 | SKB_GSO_TCPV6))
+				inner_l234_hdrlen += tcp_hdrlen(skb);
+			if (skb_shinfo(skb)->gso_type & SKB_GSO_UDP)
+				inner_l234_hdrlen += sizeof(struct udphdr);
+
+			outer_l34_hdrlen = sizeof(struct iphdr) + sizeof(struct udphdr) + sizeof(struct vxlanhdr);
+			/* gso_seglen is the GSO-ed skb packet len, adjust gso_size
+			 * to fit into physical netdev MTU
+			 */
+			gso_seglen = outer_l34_hdrlen + inner_l234_hdrlen + skb_shinfo(skb)->gso_size;
+			if (gso_seglen > phy_dev->mtu) {
+				struct sk_buff *reply;
+				struct ethhdr *orig_eth;
+				struct ethhdr *new_eth;
+				struct ethhdr *tnl_eth;
+				struct iphdr *orig_ip;
+				struct iphdr *new_ip;
+				struct iphdr *tnl_ip;
+				struct icmphdr *new_icmp;
+				unsigned int room;
+				unsigned int data_len;
+				unsigned int reply_l234_hdrlen;
+				unsigned int vxlan_tnl_hdrlen;
+				struct vxlanhdr *vxh;
+				struct udphdr *uh;
+				__wsum csum;
+
+				/* How much room to store orignal message */
+				room = (skb->len > 576) ? 576 : skb->len;
+				room -= sizeof(struct iphdr) + sizeof(struct icmphdr);
+
+				/* Ethernet payload len */
+				data_len = skb->len - skb_network_offset(skb);
+				if (data_len > room)
+					data_len = room;
+
+				reply_l234_hdrlen = LL_RESERVED_SPACE(phy_dev) + phy_dev->needed_tailroom +
+									sizeof(struct iphdr) + sizeof(struct icmphdr);
+				vxlan_tnl_hdrlen = LL_RESERVED_SPACE(phy_dev) + phy_dev->needed_tailroom +
+									sizeof(struct iphdr) + sizeof(struct udphdr) + sizeof(struct vxlanhdr);
+
+				reply = alloc_skb(vxlan_tnl_hdrlen + reply_l234_hdrlen + data_len, GFP_ATOMIC);
+				reply->dev = phy_dev;
+				skb_reserve(reply, vxlan_tnl_hdrlen + reply_l234_hdrlen);
+
+				new_icmp = (struct icmphdr *)__skb_push(reply, sizeof(struct icmphdr));
+				new_icmp->type = ICMP_DEST_UNREACH;
+				new_icmp->code = ICMP_FRAG_NEEDED;
+				new_icmp->un.frag.mtu = htons(phy_dev->mtu - outer_l34_hdrlen);
+				new_icmp->checksum = 0;
+
+				new_ip = (struct iphdr *)__skb_push(reply, sizeof(struct iphdr));
+				orig_ip = ip_hdr(skb);
+				new_ip->ihl = 5;
+				new_ip->version = 4;
+				new_ip->ttl = 32;
+				new_ip->tos = 1;
+				new_ip->protocol = IPPROTO_ICMP;
+				new_ip->saddr = orig_ip->daddr;
+				new_ip->daddr = orig_ip->saddr;
+				new_ip->frag_off = 0;
+				new_ip->tot_len = htons(sizeof(struct iphdr) + sizeof(struct icmphdr) + data_len);
+				ip_send_check(new_ip);
+
+				new_eth = (struct ethhdr *)__skb_push(reply, sizeof(struct ethhdr));
+				orig_eth = eth_hdr(skb);
+				ether_addr_copy(new_eth->h_dest, orig_eth->h_source);
+				ether_addr_copy(new_eth->h_source, orig_eth->h_dest);
+				new_eth->h_proto = htons(ETH_P_IP);
+				reply->ip_summed = CHECKSUM_UNNECESSARY;
+				reply->pkt_type = PACKET_HOST;
+				reply->protocol = htons(ETH_P_IP);
+				memcpy(skb_put(reply, data_len), skb_network_header(skb), data_len);
+				new_icmp->checksum = csum_fold(csum_partial(new_icmp, sizeof(struct icmphdr) + data_len, 0));
+
+				/* vxlan encapuslation */
+				vxh = (struct vxlanhdr *)__skb_push(reply, sizeof(*vxh));
+				vxh->vx_flags = htonl(VXLAN_FLAGS);
+				vxh->vx_vni = htonl(vni << 8);
+
+				__skb_push(reply, sizeof(*uh));
+				skb_reset_transport_header(reply);
+				uh = udp_hdr(reply);
+				uh->dest = dst_port;
+				uh->source = src_port;
+				uh->len = htons(reply->len);
+				uh->check = 0;
+				csum = skb_checksum(reply, 0, reply->len, 0);
+				uh->check = udp_v4_check(reply->len, fl4.saddr, dst->sin.sin_addr.s_addr, csum);
+
+				tnl_ip = (struct iphdr *)__skb_push(reply, sizeof(struct iphdr));
+				skb_reset_network_header(reply);
+				tnl_ip->ihl = 5;
+				tnl_ip->version = 4;
+				tnl_ip->ttl = 32;
+				tnl_ip->tos = 1;
+				tnl_ip->protocol = IPPROTO_UDP;
+				tnl_ip->saddr = dst->sin.sin_addr.s_addr;
+				tnl_ip->daddr = fl4.saddr;
+				tnl_ip->frag_off = 0;
+				tnl_ip->tot_len = htons(reply->len);
+				ip_send_check(tnl_ip);
+
+				/* fill with nosense mac header */
+				tnl_eth = (struct ethhdr *)__skb_push(reply, sizeof(struct ethhdr));
+				skb_reset_mac_header(reply);
+				orig_eth = eth_hdr(skb);
+				ether_addr_copy(tnl_eth->h_dest, orig_eth->h_source);
+				ether_addr_copy(tnl_eth->h_source, orig_eth->h_dest);
+				tnl_eth->h_proto = htons(ETH_P_IP);
+				__skb_pull(reply, skb_network_offset(reply));
+
+				/* push encapuslated ICMP message back to sender */
+				netif_rx_ni(reply);
+			}
+		}
  		err = vxlan_xmit_skb(vxlan->vn_sock, rt, skb,
  				     fl4.saddr, dst->sin.sin_addr.s_addr,
  				     tos, ttl, df, src_port, dst_port,