Message ID | 54146A37.5010108@oracle.com |
---|---|
State | Changes Requested, archived |
Delegated to: | David Miller |
Headers | show |
From: David L Stevens <david.stevens@oracle.com> Date: Sat, 13 Sep 2014 12:00:55 -0400 > This patch allows an admin to set the MTU on a sunvnet device to arbitrary > values between the minimum (68) and maximum (65535) IPv4 packet sizes. > > Signed-off-by: David L Stevens <david.stevens@oracle.com> I personally find this scheme where we pretend that the device can have an arbitrary MTU, when in fact the effective MTU is a product of the sub-ports, quite ugly. In fact, that ugly ICMP stuff in the next patch is absolutely required to avoid bogus behavior possible after this patch. You have to combine #2 and #3 otherwise you are adding an intermediate regression. Logic wise, at the very least you should limit the MTU setting to the largest MTU of all of the individual ports. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/13/2014 04:21 PM, David Miller wrote: > I personally find this scheme where we pretend that the device can > have an arbitrary MTU, when in fact the effective MTU is a product of > the sub-ports, quite ugly. I wouldn't say I like it, either, but the problem is that without it, we are tied to the least common denominator. Anything that doesn't support v1.6 of the VIO protocol is stuck at the low MTU and low throughput, and since Solaris itself is limited to 16000, Linux, which can do 64K-1, is also limited to 16000. On my hardware, the original we'd be tied to is about 1Gbps, the 16000 is about 5.4Gbps, and the full linux-linux is about 8Gbps. So, a big penalty. I think of it as an Ethernet connected to a virtual switch, and the ICMP errors are for PMTUD are analogous to IGMP snooping. This is not an Ethernet device alone-- those don't negotiate per-destination link MTUs. But nothing forces anyone to mix MTUs; the ICMP errors simply allow it. > In fact, that ugly ICMP stuff in the next patch is absolutely required > to avoid bogus behavior possible after this patch. You have to > combine #2 and #3 otherwise you are adding an intermediate regression. I disagree here. It's not any more bogus for the admin to set an MTU value of what s/he wants when the others have not been. It *always* happens that way. Ordinary Ethernet comes up at 1500 and one of them must be increased first. At that time, the others don't match, and it is the admin's responsibility to make sure they match. > Logic wise, at the very least you should limit the MTU setting to the > largest MTU of all of the individual ports. We can't directly do that, because the MTU for the port is negotiated at probe time. That'll be 1500 IP data (always) and we have to raise one of them first, so one of them has to be set at a higher value than the negotiated MTU at some point, at least until it is reset and re-negotiated. But we don't know until we try a higher value if all the links can use it, and we can't prevent another link from joining later that has a lower MTU, but we can't then lower our on MTU for the whole device. I think in ordinary Ethernet, there is nothing at all enforcing a particular MTU-- it is set to what the admin wants, regardless of what other hosts use. That's the effect we ought to have here, despite the one-to-many p2p links where we can know in advance what the link MTUs are, and that's what patch #2 does. I don't think we should try too hard to prevent a value an admin wants -- it will just get in the way of the admin, where it doesn't in ordinary Ethernet. On the other hand, if the link MTU is lower, we shouldn't quietly drop packets, thus the ICMP errors that allow both. +-DLS -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/13/2014 10:15 PM, David L Stevens wrote: > I wouldn't say I like it, either, but the problem is that without it, we > are tied to the least common denominator. Anything that doesn't support > v1.6 of the VIO protocol is stuck at the low MTU and low throughput, and To put things in perspective, in practice its only legacy linux today that will do the v1.0, and administrators are likely to want to upgrade to the later version, so encumbering the code with legacy version support may end up becoming hard-to-maintain code? > I think of it as an Ethernet connected to a virtual switch, and the ICMP > errors are for PMTUD are analogous to IGMP snooping. This is not an Ethernet > device alone-- those don't negotiate per-destination link MTUs. But nothing forces > anyone to mix MTUs; the ICMP errors simply allow it. As I understand it, this method of sending ICMP from the driver will not work for L2 (non-IP) packets, and it will not even work for IP packets that are coming to us, from, say, openvswitch, right? So in practice it actually has limited usability? --Sowmini -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/14/2014 08:21 AM, Sowmini Varadhan wrote: > To put things in perspective, in practice its only legacy linux today that will do the v1.0, and administrators are likely to want to upgrade > to the later version, so encumbering the code with legacy version support may end up becoming hard-to-maintain code? No, v1.8 Solaris would force us to a 1/3 drop in performance between linux LDOMs because of its 16000 byte MTU limit. I don't think it's particularly hard to maintain -- it's virtually a literal translation of the text in the VIO protocol document. Everything that's there should stay there; only new revisions of the protocol would cause new changes, presumably in other areas of the code where those new features are implemented. And I don't think reverse compatibility is optional. > As I understand it, this method of sending ICMP from the driver will not > work for L2 (non-IP) packets, and it will not even work for IP packets that are coming to us, from, say, openvswitch, right? So in practice it > actually has limited usability? It wouldn't work for a bridged L2 network with no local IP address, because there would be no valid return IP address for the ICMP error we generate (in IPv4 -- IPv6 will always have a valid link-local address). Everything else, including openvswitch as far as I can tell, should make use of the standard pmtud routing information that these update. What I come back to, as before, is the simple notion that nothing forces an administrator to the otherwise unusual circumstance of setting different MTUs on directly-attached common networks. If you want to bridge L2 traffic, make your MTU 1500 and it'll work exactly as before. If you, instead, are using IPv4 or IPv6 and ordinary routed traffic, you can have 8X performance improvement between hosts that can support it, even if other hosts on the same vswitch and outside your control cannot. You can talk to all hosts on the vswitch, with a performance that matches the capabilities of each peer. I don't see any way that's not better. +-DLS -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/net/ethernet/sun/sunvnet.c b/drivers/net/ethernet/sun/sunvnet.c index 04c58b5..3826b36 100644 --- a/drivers/net/ethernet/sun/sunvnet.c +++ b/drivers/net/ethernet/sun/sunvnet.c @@ -754,6 +754,9 @@ static int vnet_start_xmit(struct sk_buff *skb, struct net_device *dev) if (unlikely(!port)) goto out_dropped; + if (skb->len > port->vio.rmtu) + goto out_dropped; + spin_lock_irqsave(&port->vio.lock, flags); dr = &port->vio.drings[VIO_DRIVER_TX_RING]; @@ -975,7 +978,7 @@ static void vnet_set_rx_mode(struct net_device *dev) static int vnet_change_mtu(struct net_device *dev, int new_mtu) { - if (new_mtu != ETH_DATA_LEN) + if (new_mtu < 68 || new_mtu > 65535) return -EINVAL; dev->mtu = new_mtu; diff --git a/drivers/net/ethernet/sun/sunvnet.h b/drivers/net/ethernet/sun/sunvnet.h index 243ae69..fcf0129 100644 --- a/drivers/net/ethernet/sun/sunvnet.h +++ b/drivers/net/ethernet/sun/sunvnet.h @@ -11,7 +11,7 @@ */ #define VNET_TX_TIMEOUT (5 * HZ) -#define VNET_MAXPACKET 1518ULL /* ETH_FRAMELEN + VLAN_HDR */ +#define VNET_MAXPACKET 65553ULL /* 64K-1 +ETH HDR +VLAN HDR*/ #define VNET_TX_RING_SIZE 512 #define VNET_TX_WAKEUP_THRESH(dr) ((dr)->pending / 4)
This patch allows an admin to set the MTU on a sunvnet device to arbitrary values between the minimum (68) and maximum (65535) IPv4 packet sizes. Signed-off-by: David L Stevens <david.stevens@oracle.com> --- drivers/net/ethernet/sun/sunvnet.c | 5 ++++- drivers/net/ethernet/sun/sunvnet.h | 2 +- 2 files changed, 5 insertions(+), 2 deletions(-)