mbox series

[RFC,net-next,00/11] Nested VLANs - decimate flags and add another

Message ID 20200527212512.17901-1-edwin.peer@broadcom.com
Headers show
Series Nested VLANs - decimate flags and add another | expand

Message

Edwin Peer May 27, 2020, 9:25 p.m. UTC
This series began life as a modest attempt to fix two issues pertaining
to VLANs nested inside Geneve tunnels and snowballed from there. The
first issue, addressed by a simple one-liner, is that GSO is not enabled
for upper VLAN devices on top of Geneve. The second issue, addressed by
the balance of the series, deals largely with MTU handling. VLAN devices
above L2 in L3 tunnels inherit the MTU of the underlying device. This
causes IP fragmentation because the inner L2 cannot be expanded within
the same maximum L3 size to accommodate the additional VLAN tag.

As a first attempt, a new flag was introduced to generalize what was
already being done for MACsec devices. This flag was unconditionally
set for all devices that have a size constrained L2, such as is the
case for Geneve and VXLAN tunnel devices. This doesn't quite do the
right thing, however, if the underlying device MTU happens to be
configured to a lower MTU than is supported. Thus, the approach was
further refined to set IFF_NO_VLAN_ROOM when changing MTU, based on
whether the underlying device L2 still has room for VLAN tags, but
stopping short of registering device notifiers to update upper device
MTU whenever a lower device changes. VLAN devices will thus do the
sensible thing if they are applied to an already configured device,
but will not dynamically update whenever the underlying device's MTU
is subsequently changed (this seemed a bridge too far).

Aggregate devices presented the next challenge. Transitively propagating
IFF_NO_VLAN_ROOM via bonds, teams and the like seemed similar in
principle to the handling of IFF_XMIT_DST_RELEASE (only the opposite),
but IFF_XMIT_DST_RELEASE_PERM evaded understanding. Ultimately this flag
failed to justify its existence, allowing the new flag to take its place
and avoid taking up the last bit in the enum.

Finally, an audit of the other net devices in the tree was conducted to
discover where else this new behavior may be appropriate. At this point
it was also discovered that GRE devices would happily allow VLANs to be
added even when L3 is being tunneled in L3, hence restricting VLANs to
ARPHRD_ETHER devices. Between ARPHRD_ETHER and IFF_NO_VLAN_ROOM, it now
seemed only a hop and a skip to eliminate NET_F_VLAN_CHALLENGED too, but
alas there are still a few holdouts that would appear to require more of
a moonshot to address.

Edwin Peer (11):
  net: geneve: enable vlan offloads
  net: do away with the IFF_XMIT_DST_RELEASE_PERM flag
  net: vlan: add IFF_NO_VLAN_ROOM to constrain MTU
  net: geneve: constrain upper VLAN MTU using IFF_NO_VLAN_ROOM
  net: vxlan: constrain upper VLAN MTU using IFF_NO_VLAN_ROOM
  net: l2tp_eth: constrain upper VLAN MTU using IFF_NO_VLAN_ROOM
  net: gre: constrain upper VLAN MTU using IFF_NO_VLAN_ROOM
  net: distribute IFF_NO_VLAN_ROOM into aggregate devs
  net: ntb_netdev: support VLAN using IFF_NO_VLAN_ROOM
  net: vlan: disallow non-Ethernet devices
  net: leverage IFF_NO_VLAN_ROOM to limit NETIF_F_VLAN_CHALLENGED

 Documentation/networking/netdev-features.rst  |   4 +-
 drivers/infiniband/ulp/ipoib/ipoib_main.c     |   3 +-
 drivers/net/bonding/bond_main.c               |  15 ++-
 drivers/net/ethernet/intel/e100.c             |  15 ++-
 .../net/ethernet/mellanox/mlxsw/switchx2.c    |  52 ++++++--
 drivers/net/ethernet/toshiba/ps3_gelic_net.c  |  12 +-
 drivers/net/ethernet/wiznet/w5100.c           |   6 +-
 drivers/net/ethernet/wiznet/w5300.c           |   6 +-
 drivers/net/ethernet/xilinx/ll_temac_main.c   |   1 -
 drivers/net/geneve.c                          |  17 ++-
 drivers/net/ifb.c                             |   4 +
 drivers/net/loopback.c                        |   1 -
 drivers/net/macsec.c                          |   6 +-
 drivers/net/net_failover.c                    |  31 +++--
 drivers/net/ntb_netdev.c                      |   8 +-
 drivers/net/rionet.c                          |   3 +
 drivers/net/sb1000.c                          |   1 +
 drivers/net/team/team.c                       |  16 +--
 drivers/net/vrf.c                             |   4 +-
 drivers/net/vxlan.c                           |  10 +-
 drivers/net/wimax/i2400m/netdev.c             |   5 +-
 drivers/s390/net/qeth_l2_main.c               |  12 +-
 include/linux/if_vlan.h                       |  48 ++++++++
 include/linux/netdevice.h                     |  12 +-
 net/8021q/vlan.c                              |   2 +-
 net/8021q/vlan_dev.c                          |   9 ++
 net/8021q/vlan_netlink.c                      |   2 +
 net/core/dev.c                                |   2 +-
 net/ipv4/ip_tunnel.c                          |   2 +
 net/ipv6/ip6_gre.c                            |   4 +-
 net/l2tp/l2tp_eth.c                           | 114 ++++++++++--------
 net/sched/sch_teql.c                          |   3 +
 32 files changed, 290 insertions(+), 140 deletions(-)

Comments

Michał Mirosław May 28, 2020, 12:15 a.m. UTC | #1
On Wed, May 27, 2020 at 02:25:01PM -0700, Edwin Peer wrote:
> This series began life as a modest attempt to fix two issues pertaining
> to VLANs nested inside Geneve tunnels and snowballed from there. The
> first issue, addressed by a simple one-liner, is that GSO is not enabled
> for upper VLAN devices on top of Geneve. The second issue, addressed by
> the balance of the series, deals largely with MTU handling. VLAN devices
> above L2 in L3 tunnels inherit the MTU of the underlying device. This
> causes IP fragmentation because the inner L2 cannot be expanded within
> the same maximum L3 size to accommodate the additional VLAN tag.
> 
> As a first attempt, a new flag was introduced to generalize what was
> already being done for MACsec devices. This flag was unconditionally
> set for all devices that have a size constrained L2, such as is the
> case for Geneve and VXLAN tunnel devices. This doesn't quite do the
> right thing, however, if the underlying device MTU happens to be
> configured to a lower MTU than is supported. Thus, the approach was
> further refined to set IFF_NO_VLAN_ROOM when changing MTU, based on
> whether the underlying device L2 still has room for VLAN tags, but
> stopping short of registering device notifiers to update upper device
> MTU whenever a lower device changes. VLAN devices will thus do the
> sensible thing if they are applied to an already configured device,
> but will not dynamically update whenever the underlying device's MTU
> is subsequently changed (this seemed a bridge too far).
[...]

Hi!

Good to see someone taking on the VLAN MTU mess.  :-)

Have you considered adding a 'vlan_headroom' field (or another name)
for a netdev instead of a flag? This would submit easily to device
aggregation (just take min from the slaves) and would also handle
nested VLANs gracefully (subtracting for every layer).

In patch 3 you seem to assume that if lower device reduces MTU below
its max, then its ok to push it up with VLAN headers. I don't think
this is apropriate when reducing MTU because of eg. PMTU limit for
a tunnel.

Best Regards,
Michał Mirosław
Edwin Peer May 28, 2020, 12:39 a.m. UTC | #2
On Wed, May 27, 2020 at 5:15 PM Michał Mirosław <mirq-linux@rere.qmqm.pl> wrote:

> Have you considered adding a 'vlan_headroom' field (or another name)
> for a netdev instead of a flag? This would submit easily to device
> aggregation (just take min from the slaves) and would also handle
> nested VLANs gracefully (subtracting for every layer).

Great idea, I think that would be much nicer.

> In patch 3 you seem to assume that if lower device reduces MTU below
> its max, then its ok to push it up with VLAN headers. I don't think
> this is apropriate when reducing MTU because of eg. PMTU limit for
> a tunnel.

Indeed. For non-tunnel devices I think this behavior is still correct,
because past the 1st hop (where device MTU should be appropriate), all
of L2, including any VLANs, has been replaced by something else. But
yes, tunnels probably do need to unconditionally reduce MTU, because
PMTU is something more dynamic. I guess I kind of half thought about
this for gre6, where this is what I did because PMTU is so much more
in your face for IPv6.

Regards,
Edwin Peer