Message ID | 5419F3E5.4050708@oracle.com |
---|---|
State | Superseded, archived |
Delegated to: | David Miller |
Headers | show |
Hello. On 9/17/2014 11:49 PM, David L Stevens wrote: > This patch sends ICMP and ICMPv6 messages for Path MTU Discovery when a remote > port MTU is smaller than the device MTU. This allows mixing newer VIO protocol > devices that support MTU negotiation with older devices that do not on the > same vswitch. It also allows Linux-Linux LDOMs to use 64K-1 data packets even > though Solaris vswitch is limited to <16K MTU. > Signed-off-by: David L Stevens <david.stevens@oracle.com> > --- > drivers/net/ethernet/sun/sunvnet.c | 37 +++++++++++++++++++++++++++++++++++- > 1 files changed, 36 insertions(+), 1 deletions(-) > diff --git a/drivers/net/ethernet/sun/sunvnet.c b/drivers/net/ethernet/sun/sunvnet.c > index 5e64e60..3c4ee18 100644 > --- a/drivers/net/ethernet/sun/sunvnet.c > +++ b/drivers/net/ethernet/sun/sunvnet.c [...] > @@ -791,8 +798,36 @@ static int vnet_start_xmit(struct sk_buff *skb, struct net_device *dev) > if (unlikely(!port)) > goto out_dropped; > > - if (skb->len > port->rmtu) > + if (skb->len > port->rmtu) { > + unsigned long localmtu = port->rmtu - ETH_HLEN; > + > + if (vio_version_after_eq(&port->vio, 1, 3)) > + localmtu -= VLAN_HLEN; > + > + if (skb->protocol == htons(ETH_P_IP)) { > + struct flowi4 fl4; > + struct rtable *rt = NULL; > + > + memset(&fl4, 0, sizeof(fl4)); > + fl4.flowi4_oif = dev->ifindex; > + fl4.flowi4_tos = RT_TOS(ip_hdr(skb)->tos); > + fl4.daddr = ip_hdr(skb)->daddr; > + fl4.saddr = ip_hdr(skb)->saddr; > + > + rt = ip_route_output_key(dev_net(dev), &fl4); > + if (!IS_ERR(rt)) { > + skb_dst_set(skb, &rt->dst); > + icmp_send(skb, ICMP_DEST_UNREACH, > + ICMP_FRAG_NEEDED, > + htonl(localmtu)); > + } > + } > +#if IS_ENABLED(CONFIG_IPV6) This #if could be avoided by extending the *if* statement below, no? > + else if (skb->protocol == htons(ETH_P_IPV6)) > + icmpv6_send(skb, ICMPV6_PKT_TOOBIG, 0, localmtu); > +#endif > goto out_dropped; > + } WBR, Sergei -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/17/2014 05:13 PM, Sergei Shtylyov wrote: >> + } >> +#if IS_ENABLED(CONFIG_IPV6) > > This #if could be avoided by extending the *if* statement below, no? > >> + else if (skb->protocol == htons(ETH_P_IPV6)) >> + icmpv6_send(skb, ICMPV6_PKT_TOOBIG, 0, localmtu); >> +#endif >> goto out_dropped; >> + } I'm not sure I understand your point, but the #if must be there to avoid a reference to icmpv6_send() which will not be defined if CONFIG_IPV6=n. +-DLS -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On (09/17/14 16:49), David L Stevens wrote: > + > + rt = ip_route_output_key(dev_net(dev), &fl4); > + if (!IS_ERR(rt)) { As I've mentioned before, this layering violation makes me uneasy, so its benefits should be evaluated carefully. You will typically not be able to find an rt for packets coming here from any application that does not itself use/update the FIB, e.g., uspace based packet-injectors (PF_PACKET-based applications, intel dpdk-based uspace stacks etc.) --Sowmini -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/17/2014 06:43 PM, Sowmini Varadhan wrote: > On (09/17/14 16:49), David L Stevens wrote: >> + >> + rt = ip_route_output_key(dev_net(dev), &fl4); >> + if (!IS_ERR(rt)) { > > As I've mentioned before, this layering violation makes me uneasy, > so its benefits should be evaluated carefully. You will typically not be > able to find an rt for packets coming here from any application > that does not itself use/update the FIB, e.g., uspace based packet-injectors > (PF_PACKET-based applications, intel dpdk-based uspace stacks etc.) I think the configurations where it doesn't work are reasonably rare, and the default on boot is a 1500-byte MTU for everyone and none of these ICMP errors will be triggered if that is where all hosts on the vswitch leave it. You don't have to ever mix MTUs. The alternative in all cases where we send an ICMP error to make it work is to instead silently drop those packets, all packets of the same size or larger that we get after. It does nothing different whatsoever for any configuration that works today. It only allows other configurations to work also. A pair of Linux LDOMs can get 8X throughput improvement by raising the MTU to 64K, but many packets will be *silently* dropped if they go to any other destination that does not support 64K MTU. Those destinations that don't support 64K MTU include any legacy Linux running the pre-jumbo code and all Solaris hosts, including the current releases. With the ICMP errors, the new linux code can interoperate properly with all of them, and do so at much higher throughput (4-8X) with those that can support higher MTUs. Also, I wouldn't call it a layering violation. icmp_send() is the external API for triggering ICMP errors, and we are sending them at the point where we know the next-hop MTU. It is exactly equivalent to an Ethernet device connected to a switch where the switch sends useful layer-3 packets (like IGMP queries). In this case, that useful layer 3 info is remote link MTU data; something not available in ordinary Ethernet. Also, any PF_PACKET or other applications that bypass the routing tables can still (and should) receive and process PMTUD packets. If you're sending raw Ethernet frames that are IP, you should. If you mean non-routable protocols, none of those can be delivered over any link where these ICMP errors would be sent, anyway. If you try to send an 18000-byte non-IP packet with a 1500-byte MTU, it will be dropped today, and still dropped with this patchset. Large broadcast frames (your comment from another mail) will not work, just as they won't work today. If you need to do that, you should leave the MTU of the sender below the lowest MTU attached to the same switch, the way you would for any other Ethernet, and they'll be fragmented and reassembled. On the other hand, if you want your TCP and UDP traffic over IPv4 and IPv6 to be 4-8 times faster than it is today without any other change, you can accept the deficiencies in the otherwise not-allowed configuration and set the device MTU to 64K even when it's mixed with other devices that can't do it. If all of them support the higher MTU, none of this code comes into play. If some of them don't, this code allows them to interoperate; without this code, all those packets are simply silently dropped and your network can only function at the level of the least capable attached LDOM. +-DLS -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/17/2014 07:41 PM, David L Stevens wrote: > > > On 09/17/2014 06:43 PM, Sowmini Varadhan wrote: >> On (09/17/14 16:49), David L Stevens wrote: >>> + >>> + rt = ip_route_output_key(dev_net(dev), &fl4); >>> + if (!IS_ERR(rt)) { >> >> As I've mentioned before, this layering violation makes me uneasy, >> so its benefits should be evaluated carefully. You will typically not be >> able to find an rt for packets coming here from any application >> that does not itself use/update the FIB, e.g., uspace based packet-injectors >> (PF_PACKET-based applications, intel dpdk-based uspace stacks etc.) > > A pair of Linux LDOMs can get 8X throughput improvement by raising the MTU to 64K, but > many packets will be *silently* dropped if they go to any other destination that does > not support 64K MTU. Those destinations that don't support 64K MTU include any legacy > Linux running the pre-jumbo code and all Solaris hosts, including the current releases. by now I am actually quite confused by what the Administrator will see. If I do "ifconfig -a" or "ip addr", what is the reported mtu of the interface? > Also, I wouldn't call it a layering violation. icmp_send() is the external API for > triggering ICMP errors, and we are sending them at the point where we know the next-hop MTU. > It is exactly equivalent to an Ethernet device connected to a switch where the switch > sends useful layer-3 packets (like IGMP queries). In this case, that useful layer 3 info > is remote link MTU data; something not available in ordinary Ethernet. Interesting. So if the Administrator sets up ICMP filters for outbound/inbound (at the IP layer), what will be the observed behavior? --Sowmini -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/18/2014 03:23 PM, Sowmini Varadhan wrote: > by now I am actually quite confused by what the Administrator will see. > If I do "ifconfig -a" or "ip addr", what is the reported mtu of the interface? The interface MTU is whatever the admin sets it to, between 68 bytes (the IPv4 min) and 64K-1 (the IPv4 max). In cases where packets of interface MTU size cannot be delivered because the LDC MTU is smaller, instead of silently dropping them, we send the ICMP errors which allow PMTUD updates per-destination. Subsequent packets will be segmented or fragmented at that (lower) value for that destination, and use other MTUs up to the interface MTU for other destinations. > Interesting. So if the Administrator sets up ICMP filters for outbound/inbound (at the IP layer), what will be the observed behavior? If an administrator drops PMTUD packets, then TCP won't work, even without this patch set, for any destinations that cause PMTUD. It's explicitly not optional in IPv6; in IPv4, fragmenting TCP packets could hide it as long as IP_DF is not set, but the only thing this code could do for packets too big is to drop them -- exactly what we'd do whether or not we send the ICMP error to tell the sender what MTU we can send. +-DLS -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/net/ethernet/sun/sunvnet.c b/drivers/net/ethernet/sun/sunvnet.c index 5e64e60..3c4ee18 100644 --- a/drivers/net/ethernet/sun/sunvnet.c +++ b/drivers/net/ethernet/sun/sunvnet.c @@ -17,6 +17,13 @@ #include <linux/mutex.h> #include <linux/if_vlan.h> +#if IS_ENABLED(CONFIG_IPV6) +#include <linux/icmpv6.h> +#endif + +#include <net/icmp.h> +#include <net/route.h> + #include <asm/vio.h> #include <asm/ldc.h> @@ -791,8 +798,36 @@ static int vnet_start_xmit(struct sk_buff *skb, struct net_device *dev) if (unlikely(!port)) goto out_dropped; - if (skb->len > port->rmtu) + if (skb->len > port->rmtu) { + unsigned long localmtu = port->rmtu - ETH_HLEN; + + if (vio_version_after_eq(&port->vio, 1, 3)) + localmtu -= VLAN_HLEN; + + if (skb->protocol == htons(ETH_P_IP)) { + struct flowi4 fl4; + struct rtable *rt = NULL; + + memset(&fl4, 0, sizeof(fl4)); + fl4.flowi4_oif = dev->ifindex; + fl4.flowi4_tos = RT_TOS(ip_hdr(skb)->tos); + fl4.daddr = ip_hdr(skb)->daddr; + fl4.saddr = ip_hdr(skb)->saddr; + + rt = ip_route_output_key(dev_net(dev), &fl4); + if (!IS_ERR(rt)) { + skb_dst_set(skb, &rt->dst); + icmp_send(skb, ICMP_DEST_UNREACH, + ICMP_FRAG_NEEDED, + htonl(localmtu)); + } + } +#if IS_ENABLED(CONFIG_IPV6) + else if (skb->protocol == htons(ETH_P_IPV6)) + icmpv6_send(skb, ICMPV6_PKT_TOOBIG, 0, localmtu); +#endif goto out_dropped; + } spin_lock_irqsave(&port->vio.lock, flags);
This patch sends ICMP and ICMPv6 messages for Path MTU Discovery when a remote port MTU is smaller than the device MTU. This allows mixing newer VIO protocol devices that support MTU negotiation with older devices that do not on the same vswitch. It also allows Linux-Linux LDOMs to use 64K-1 data packets even though Solaris vswitch is limited to <16K MTU. Signed-off-by: David L Stevens <david.stevens@oracle.com> --- drivers/net/ethernet/sun/sunvnet.c | 37 +++++++++++++++++++++++++++++++++++- 1 files changed, 36 insertions(+), 1 deletions(-)