Message ID | 1519934923-39372-3-git-send-email-sridhar.samudrala@intel.com |
---|---|
State | Changes Requested, archived |
Delegated to: | David Miller |
Headers | show |
Series | Enable virtio_net to act as a backup for a passthru device | expand |
Thu, Mar 01, 2018 at 09:08:43PM CET, sridhar.samudrala@intel.com wrote: >This patch enables virtio_net to switch over to a VF datapath when a VF >netdev is present with the same MAC address. It allows live migration >of a VM with a direct attached VF without the need to setup a bond/team >between a VF and virtio net device in the guest. > >The hypervisor needs to enable only one datapath at any time so that >packets don't get looped back to the VM over the other datapath. When a VF >is plugged, the virtio datapath link state can be marked as down. The >hypervisor needs to unplug the VF device from the guest on the source host >and reset the MAC filter of the VF to initiate failover of datapath to >virtio before starting the migration. After the migration is completed, >the destination hypervisor sets the MAC filter on the VF and plugs it back >to the guest to switch over to VF datapath. > >When BACKUP feature is enabled, an additional netdev(bypass netdev) is >created that acts as a master device and tracks the state of the 2 lower >netdevs. The original virtio_net netdev is marked as 'backup' netdev and a >passthru device with the same MAC is registered as 'active' netdev. > >This patch is based on the discussion initiated by Jesse on this thread. >https://marc.info/?l=linux-virtualization&m=151189725224231&w=2 > >Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com> >Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> >Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> >--- > drivers/net/virtio_net.c | 683 ++++++++++++++++++++++++++++++++++++++++++++++- As I wrote to the discussion of the other version of this patchset, I strongly believe you need to have a common part in net/core/ that will be shared by both virtio_net and netvsc. There there should be explicit limits for this in-driver bonding solution, like max 1 vf slave, etc.
On Fri, Mar 2, 2018 at 12:36 AM, Jiri Pirko <jiri@resnulli.us> wrote: > Thu, Mar 01, 2018 at 09:08:43PM CET, sridhar.samudrala@intel.com wrote: >>This patch enables virtio_net to switch over to a VF datapath when a VF >>netdev is present with the same MAC address. It allows live migration >>of a VM with a direct attached VF without the need to setup a bond/team >>between a VF and virtio net device in the guest. >> >>The hypervisor needs to enable only one datapath at any time so that >>packets don't get looped back to the VM over the other datapath. When a VF >>is plugged, the virtio datapath link state can be marked as down. The >>hypervisor needs to unplug the VF device from the guest on the source host >>and reset the MAC filter of the VF to initiate failover of datapath to >>virtio before starting the migration. After the migration is completed, >>the destination hypervisor sets the MAC filter on the VF and plugs it back >>to the guest to switch over to VF datapath. >> >>When BACKUP feature is enabled, an additional netdev(bypass netdev) is >>created that acts as a master device and tracks the state of the 2 lower >>netdevs. The original virtio_net netdev is marked as 'backup' netdev and a >>passthru device with the same MAC is registered as 'active' netdev. >> >>This patch is based on the discussion initiated by Jesse on this thread. >>https://marc.info/?l=linux-virtualization&m=151189725224231&w=2 >> >>Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com> >>Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> >>Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> >>--- >> drivers/net/virtio_net.c | 683 ++++++++++++++++++++++++++++++++++++++++++++++- > > As I wrote to the discussion of the other version of this patchset, > I strongly believe you need to have a common part in net/core/ that will > be shared by both virtio_net and netvsc. There there should be explicit > limits for this in-driver bonding solution, like max 1 vf slave, etc. Yeah, this code essentially calls out the "shareable" code with a comment at the start and end of the section what defines the virtio_bypass functionality. It would just be a matter of mostly cutting and pasting to put it into a separate driver module. The design limits things to a 1:1 relationship since we just have the child and backup pointers, but I don't think I am seeing exception handling to prevent us from overwriting the child pointers so there may be a leak there. Thanks. - Alex
Fri, Mar 02, 2018 at 04:26:25PM CET, alexander.duyck@gmail.com wrote: >On Fri, Mar 2, 2018 at 12:36 AM, Jiri Pirko <jiri@resnulli.us> wrote: >> Thu, Mar 01, 2018 at 09:08:43PM CET, sridhar.samudrala@intel.com wrote: >>>This patch enables virtio_net to switch over to a VF datapath when a VF >>>netdev is present with the same MAC address. It allows live migration >>>of a VM with a direct attached VF without the need to setup a bond/team >>>between a VF and virtio net device in the guest. >>> >>>The hypervisor needs to enable only one datapath at any time so that >>>packets don't get looped back to the VM over the other datapath. When a VF >>>is plugged, the virtio datapath link state can be marked as down. The >>>hypervisor needs to unplug the VF device from the guest on the source host >>>and reset the MAC filter of the VF to initiate failover of datapath to >>>virtio before starting the migration. After the migration is completed, >>>the destination hypervisor sets the MAC filter on the VF and plugs it back >>>to the guest to switch over to VF datapath. >>> >>>When BACKUP feature is enabled, an additional netdev(bypass netdev) is >>>created that acts as a master device and tracks the state of the 2 lower >>>netdevs. The original virtio_net netdev is marked as 'backup' netdev and a >>>passthru device with the same MAC is registered as 'active' netdev. >>> >>>This patch is based on the discussion initiated by Jesse on this thread. >>>https://marc.info/?l=linux-virtualization&m=151189725224231&w=2 >>> >>>Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com> >>>Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> >>>Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> >>>--- >>> drivers/net/virtio_net.c | 683 ++++++++++++++++++++++++++++++++++++++++++++++- >> >> As I wrote to the discussion of the other version of this patchset, >> I strongly believe you need to have a common part in net/core/ that will >> be shared by both virtio_net and netvsc. There there should be explicit >> limits for this in-driver bonding solution, like max 1 vf slave, etc. > >Yeah, this code essentially calls out the "shareable" code with a >comment at the start and end of the section what defines the >virtio_bypass functionality. It would just be a matter of mostly >cutting and pasting to put it into a separate driver module. Please put it there and unite the use of it with netvsc. > >The design limits things to a 1:1 relationship since we just have the >child and backup pointers, but I don't think I am seeing exception >handling to prevent us from overwriting the child pointers so there >may be a leak there. Ok.
On Fri, Mar 2, 2018 at 8:37 AM, Samudrala, Sridhar <sridhar.samudrala@intel.com> wrote: > > > On 3/2/2018 8:20 AM, Jiri Pirko wrote: > > Fri, Mar 02, 2018 at 04:26:25PM CET, alexander.duyck@gmail.com wrote: > > On Fri, Mar 2, 2018 at 12:36 AM, Jiri Pirko <jiri@resnulli.us> wrote: > > Thu, Mar 01, 2018 at 09:08:43PM CET, sridhar.samudrala@intel.com wrote: > > This patch enables virtio_net to switch over to a VF datapath when a VF > netdev is present with the same MAC address. It allows live migration > of a VM with a direct attached VF without the need to setup a bond/team > between a VF and virtio net device in the guest. > > The hypervisor needs to enable only one datapath at any time so that > packets don't get looped back to the VM over the other datapath. When a VF > is plugged, the virtio datapath link state can be marked as down. The > hypervisor needs to unplug the VF device from the guest on the source host > and reset the MAC filter of the VF to initiate failover of datapath to > virtio before starting the migration. After the migration is completed, > the destination hypervisor sets the MAC filter on the VF and plugs it back > to the guest to switch over to VF datapath. > > When BACKUP feature is enabled, an additional netdev(bypass netdev) is > created that acts as a master device and tracks the state of the 2 lower > netdevs. The original virtio_net netdev is marked as 'backup' netdev and a > passthru device with the same MAC is registered as 'active' netdev. > > This patch is based on the discussion initiated by Jesse on this thread. > https://marc.info/?l=linux-virtualization&m=151189725224231&w=2 > > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com> > Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> > Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> > --- > drivers/net/virtio_net.c | 683 > ++++++++++++++++++++++++++++++++++++++++++++++- > > As I wrote to the discussion of the other version of this patchset, > I strongly believe you need to have a common part in net/core/ that will > be shared by both virtio_net and netvsc. There there should be explicit > limits for this in-driver bonding solution, like max 1 vf slave, etc. > > Yeah, this code essentially calls out the "shareable" code with a > comment at the start and end of the section what defines the > virtio_bypass functionality. It would just be a matter of mostly > cutting and pasting to put it into a separate driver module. > > Please put it there and unite the use of it with netvsc. > > Based on Stephen's responses, it is not clear if netvsc would be OK to move > to > the 3 netdev solution at this time. When another paravirtual driver would > like > to use this solution, we can do refactoring at that time. > > The design limits things to a 1:1 relationship since we just have the > child and backup pointers, but I don't think I am seeing exception > handling to prevent us from overwriting the child pointers so there > may be a leak there. > > > We only allow one active netdev registered as a child. There is a check to > make sure that if an active netdev is registered, another one with same MAC > is not allowed. I don't think there is a leak. > > vbi = netdev_priv(dev); > backup = (child_netdev->dev.parent == dev->dev.parent); > if (backup ? rtnl_dereference(vbi->backup_netdev) : > rtnl_dereference(vbi->active_netdev)) { > netdev_info(dev, > "%s attempting to join bypass dev when %s > already present\n", > child_netdev->name, backup ? "backup" : > "active"); > return NOTIFY_DONE; > } > > /* Avoid non pci devices as active netdev */ > if (!backup && (!child_netdev->dev.parent || > !dev_is_pci(child_netdev->dev.parent))) > return NOTIFY_DONE; > > Thanks > Sridhar I overlooked that part. Thanks. - Alex
On Fri, Mar 02, 2018 at 07:26:25AM -0800, Alexander Duyck wrote: > The design limits things to a 1:1 relationship since we just have the > child and backup pointers, but I don't think I am seeing exception > handling to prevent us from overwriting the child pointers so there > may be a leak there. > > Thanks. > > - Alex In fact maintaining a list in that case would be nicer, and just use an arbitrary one. E.g. one can see how a user wanting to swap device 1 for device 2 might first add device 2 with same MAC then drop device 1.
On Fri, Mar 02, 2018 at 05:20:17PM +0100, Jiri Pirko wrote: > >Yeah, this code essentially calls out the "shareable" code with a > >comment at the start and end of the section what defines the > >virtio_bypass functionality. It would just be a matter of mostly > >cutting and pasting to put it into a separate driver module. > > Please put it there and unite the use of it with netvsc. Surely, adding this to other drivers (e.g. might this be handy for xen too?) can be left for a separate patchset. Let's get one device merged first.
On 3/2/2018 11:41 AM, Michael S. Tsirkin wrote: > On Fri, Mar 02, 2018 at 07:26:25AM -0800, Alexander Duyck wrote: >> The design limits things to a 1:1 relationship since we just have the >> child and backup pointers, but I don't think I am seeing exception >> handling to prevent us from overwriting the child pointers so there >> may be a leak there. >> >> Thanks. >> >> - Alex > In fact maintaining a list in that case would be nicer, and > just use an arbitrary one. > E.g. one can see how a user wanting to swap device 1 for device 2 > might first add device 2 with same MAC then drop device 1. It should be possible to swap VF1 with VF2 by - enabling virtio link - unplugging VF1 - plugging VF2 - disabling virtio link
On Fri, Mar 02, 2018 at 11:52:27AM -0800, Samudrala, Sridhar wrote: > > > On 3/2/2018 11:41 AM, Michael S. Tsirkin wrote: > > On Fri, Mar 02, 2018 at 07:26:25AM -0800, Alexander Duyck wrote: > > > The design limits things to a 1:1 relationship since we just have the > > > child and backup pointers, but I don't think I am seeing exception > > > handling to prevent us from overwriting the child pointers so there > > > may be a leak there. > > > > > > Thanks. > > > > > > - Alex > > In fact maintaining a list in that case would be nicer, and > > just use an arbitrary one. > > E.g. one can see how a user wanting to swap device 1 for device 2 > > might first add device 2 with same MAC then drop device 1. > > It should be possible to swap VF1 with VF2 by > 1.- enabling virtio link > 2.- unplugging VF1 > 3.- plugging VF2 > 4.- disabling virtio link > True, but it isn't hard to avoid breakage if user swapped steps 2 and 3. No need to make it more fragile that it has to be.
On Fri, Mar 2, 2018 at 12:10 PM, Michael S. Tsirkin <mst@redhat.com> wrote: > On Fri, Mar 02, 2018 at 11:52:27AM -0800, Samudrala, Sridhar wrote: >> >> >> On 3/2/2018 11:41 AM, Michael S. Tsirkin wrote: >> > On Fri, Mar 02, 2018 at 07:26:25AM -0800, Alexander Duyck wrote: >> > > The design limits things to a 1:1 relationship since we just have the >> > > child and backup pointers, but I don't think I am seeing exception >> > > handling to prevent us from overwriting the child pointers so there >> > > may be a leak there. >> > > >> > > Thanks. >> > > >> > > - Alex >> > In fact maintaining a list in that case would be nicer, and >> > just use an arbitrary one. >> > E.g. one can see how a user wanting to swap device 1 for device 2 >> > might first add device 2 with same MAC then drop device 1. >> >> It should be possible to swap VF1 with VF2 by >> 1.- enabling virtio link >> 2.- unplugging VF1 >> 3.- plugging VF2 >> 4.- disabling virtio link >> > > True, but it isn't hard to avoid breakage if user > swapped steps 2 and 3. No need to make it more > fragile that it has to be. The migration case, VF2 is associated with another PF on another machine (destination), I wonder how it is possible. Even with local plugging of VF2 on the same PF, the MAC address requirement (VF1's == VF2's) would fail the MAC address assignment on VF2. -Siwei > > -- > MST > > --------------------------------------------------------------------- > To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org > For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org >
On Fri, Mar 2, 2018 at 11:42 AM, Michael S. Tsirkin <mst@redhat.com> wrote: > On Fri, Mar 02, 2018 at 05:20:17PM +0100, Jiri Pirko wrote: >> >Yeah, this code essentially calls out the "shareable" code with a >> >comment at the start and end of the section what defines the >> >virtio_bypass functionality. It would just be a matter of mostly >> >cutting and pasting to put it into a separate driver module. >> >> Please put it there and unite the use of it with netvsc. > > Surely, adding this to other drivers (e.g. might this be handy for xen > too?) can be left for a separate patchset. Let's get one device merged > first. Agreed. -Siwei > > -- > MST > > --------------------------------------------------------------------- > To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org > For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org >
On 3/2/2018 12:44 PM, Siwei Liu wrote: > On Fri, Mar 2, 2018 at 12:10 PM, Michael S. Tsirkin <mst@redhat.com> wrote: >> On Fri, Mar 02, 2018 at 11:52:27AM -0800, Samudrala, Sridhar wrote: >>> >>> On 3/2/2018 11:41 AM, Michael S. Tsirkin wrote: >>>> On Fri, Mar 02, 2018 at 07:26:25AM -0800, Alexander Duyck wrote: >>>>> The design limits things to a 1:1 relationship since we just have the >>>>> child and backup pointers, but I don't think I am seeing exception >>>>> handling to prevent us from overwriting the child pointers so there >>>>> may be a leak there. >>>>> >>>>> Thanks. >>>>> >>>>> - Alex >>>> In fact maintaining a list in that case would be nicer, and >>>> just use an arbitrary one. >>>> E.g. one can see how a user wanting to swap device 1 for device 2 >>>> might first add device 2 with same MAC then drop device 1. >>> It should be possible to swap VF1 with VF2 by >>> 1.- enabling virtio link >>> 2.- unplugging VF1 >>> 3.- plugging VF2 >>> 4.- disabling virtio link >>> >> True, but it isn't hard to avoid breakage if user >> swapped steps 2 and 3. No need to make it more >> fragile that it has to be. > The migration case, VF2 is associated with another PF on another > machine (destination), I wonder how it is possible. > > Even with local plugging of VF2 on the same PF, the MAC address > requirement (VF1's == VF2's) would fail the MAC address assignment on > VF2. > > I didn't include updating the MAC filter step in the above sequence. So definitely plugging 2 VFs with the same MAC address will be an issue. Here is the more complete sequence of steps that are required to enable live migration. Source Hypervisor - Bring up the virtio link - Hot Unplug VF from the VM - Delete FDB entry for the MAC on the Bridge associated with virtio/tap - Remove the MAC filter associated with the VF - Migrate VM to destination Destination Hypervisor (after migration is completed) - Set MAC filter for the VF - Hot Plug VF to the VM - Bring down the virtio link Thanks Sridhar
On Thu, Mar 1, 2018 at 12:08 PM, Sridhar Samudrala <sridhar.samudrala@intel.com> wrote: > This patch enables virtio_net to switch over to a VF datapath when a VF > netdev is present with the same MAC address. It allows live migration > of a VM with a direct attached VF without the need to setup a bond/team > between a VF and virtio net device in the guest. > > The hypervisor needs to enable only one datapath at any time so that > packets don't get looped back to the VM over the other datapath. When a VF > is plugged, the virtio datapath link state can be marked as down. The > hypervisor needs to unplug the VF device from the guest on the source host > and reset the MAC filter of the VF to initiate failover of datapath to > virtio before starting the migration. After the migration is completed, > the destination hypervisor sets the MAC filter on the VF and plugs it back > to the guest to switch over to VF datapath. > > When BACKUP feature is enabled, an additional netdev(bypass netdev) is > created that acts as a master device and tracks the state of the 2 lower > netdevs. The original virtio_net netdev is marked as 'backup' netdev and a > passthru device with the same MAC is registered as 'active' netdev. > > This patch is based on the discussion initiated by Jesse on this thread. > https://marc.info/?l=linux-virtualization&m=151189725224231&w=2 > > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com> > Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> > Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> > --- > drivers/net/virtio_net.c | 683 ++++++++++++++++++++++++++++++++++++++++++++++- > 1 file changed, 682 insertions(+), 1 deletion(-) > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c > index bcd13fe906ca..f2860d86c952 100644 > --- a/drivers/net/virtio_net.c > +++ b/drivers/net/virtio_net.c > @@ -30,6 +30,8 @@ > #include <linux/cpu.h> > #include <linux/average.h> > #include <linux/filter.h> > +#include <linux/netdevice.h> > +#include <linux/pci.h> > #include <net/route.h> > #include <net/xdp.h> > > @@ -206,6 +208,9 @@ struct virtnet_info { > u32 speed; > > unsigned long guest_offloads; > + > + /* upper netdev created when BACKUP feature enabled */ > + struct net_device *bypass_netdev; > }; > > struct padded_vnet_hdr { > @@ -2236,6 +2241,22 @@ static int virtnet_xdp(struct net_device *dev, struct netdev_bpf *xdp) > } > } > > +static int virtnet_get_phys_port_name(struct net_device *dev, char *buf, > + size_t len) > +{ > + struct virtnet_info *vi = netdev_priv(dev); > + int ret; > + > + if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_BACKUP)) > + return -EOPNOTSUPP; > + > + ret = snprintf(buf, len, "_bkup"); > + if (ret >= len) > + return -EOPNOTSUPP; > + > + return 0; > +} > + What if the systemd/udevd is not new enough to enforce the n<phys_port_name> naming? Would virtio_bypass get a different name than the original virtio_net? Should we detect this earlier and fall back to legacy mode without creating the bypass netdev and ensalving the VF? > static const struct net_device_ops virtnet_netdev = { > .ndo_open = virtnet_open, > .ndo_stop = virtnet_close, > @@ -2253,6 +2274,7 @@ static const struct net_device_ops virtnet_netdev = { > .ndo_xdp_xmit = virtnet_xdp_xmit, > .ndo_xdp_flush = virtnet_xdp_flush, > .ndo_features_check = passthru_features_check, > + .ndo_get_phys_port_name = virtnet_get_phys_port_name, > }; > > static void virtnet_config_changed_work(struct work_struct *work) > @@ -2647,6 +2669,653 @@ static int virtnet_validate(struct virtio_device *vdev) > return 0; > } > > +/* START of functions supporting VIRTIO_NET_F_BACKUP feature. > + * When BACKUP feature is enabled, an additional netdev(bypass netdev) > + * is created that acts as a master device and tracks the state of the > + * 2 lower netdevs. The original virtio_net netdev is registered as > + * 'backup' netdev and a passthru device with the same MAC is registered > + * as 'active' netdev. > + */ > + > +/* bypass state maintained when BACKUP feature is enabled */ > +struct virtnet_bypass_info { > + /* passthru netdev with same MAC */ > + struct net_device __rcu *active_netdev; > + > + /* virtio_net netdev */ > + struct net_device __rcu *backup_netdev; > + > + /* active netdev stats */ > + struct rtnl_link_stats64 active_stats; > + > + /* backup netdev stats */ > + struct rtnl_link_stats64 backup_stats; > + > + /* aggregated stats */ > + struct rtnl_link_stats64 bypass_stats; > + > + /* spinlock while updating stats */ > + spinlock_t stats_lock; > +}; > + > +static void virtnet_bypass_child_open(struct net_device *dev, > + struct net_device *child_netdev) > +{ > + int err = dev_open(child_netdev); > + > + if (err) > + netdev_warn(dev, "unable to open slave: %s: %d\n", > + child_netdev->name, err); > +} > + > +static int virtnet_bypass_open(struct net_device *dev) > +{ > + struct virtnet_bypass_info *vbi = netdev_priv(dev); > + struct net_device *child_netdev; > + > + netif_carrier_off(dev); > + netif_tx_wake_all_queues(dev); > + > + child_netdev = rtnl_dereference(vbi->active_netdev); > + if (child_netdev) > + virtnet_bypass_child_open(dev, child_netdev); > + > + child_netdev = rtnl_dereference(vbi->backup_netdev); > + if (child_netdev) > + virtnet_bypass_child_open(dev, child_netdev); > + > + return 0; > +} > + > +static int virtnet_bypass_close(struct net_device *dev) > +{ > + struct virtnet_bypass_info *vi = netdev_priv(dev); > + struct net_device *child_netdev; > + > + netif_tx_disable(dev); > + > + child_netdev = rtnl_dereference(vi->active_netdev); > + if (child_netdev) > + dev_close(child_netdev); > + > + child_netdev = rtnl_dereference(vi->backup_netdev); > + if (child_netdev) > + dev_close(child_netdev); > + > + return 0; > +} > + > +static netdev_tx_t virtnet_bypass_drop_xmit(struct sk_buff *skb, > + struct net_device *dev) > +{ > + atomic_long_inc(&dev->tx_dropped); > + dev_kfree_skb_any(skb); > + return NETDEV_TX_OK; > +} > + > +static bool virtnet_bypass_xmit_ready(struct net_device *dev) > +{ > + return netif_running(dev) && netif_carrier_ok(dev); > +} > + > +static netdev_tx_t virtnet_bypass_start_xmit(struct sk_buff *skb, > + struct net_device *dev) > +{ > + struct virtnet_bypass_info *vbi = netdev_priv(dev); > + struct net_device *xmit_dev; > + > + /* Try xmit via active netdev followed by backup netdev */ > + xmit_dev = rcu_dereference_bh(vbi->active_netdev); > + if (!xmit_dev || !virtnet_bypass_xmit_ready(xmit_dev)) { > + xmit_dev = rcu_dereference_bh(vbi->backup_netdev); > + if (!xmit_dev || !virtnet_bypass_xmit_ready(xmit_dev)) > + return virtnet_bypass_drop_xmit(skb, dev); > + } > + > + skb->dev = xmit_dev; > + skb->queue_mapping = qdisc_skb_cb(skb)->slave_dev_queue_mapping; > + > + return dev_queue_xmit(skb); > +} > + > +static u16 virtnet_bypass_select_queue(struct net_device *dev, > + struct sk_buff *skb, void *accel_priv, > + select_queue_fallback_t fallback) > +{ > + /* This helper function exists to help dev_pick_tx get the correct > + * destination queue. Using a helper function skips a call to > + * skb_tx_hash and will put the skbs in the queue we expect on their > + * way down to the bonding driver. > + */ > + u16 txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) : 0; > + > + /* Save the original txq to restore before passing to the driver */ > + qdisc_skb_cb(skb)->slave_dev_queue_mapping = skb->queue_mapping; > + > + if (unlikely(txq >= dev->real_num_tx_queues)) { > + do { > + txq -= dev->real_num_tx_queues; > + } while (txq >= dev->real_num_tx_queues); > + } > + > + return txq; > +} > + > +/* fold stats, assuming all rtnl_link_stats64 fields are u64, but > + * that some drivers can provide 32bit values only. > + */ > +static void virtnet_bypass_fold_stats(struct rtnl_link_stats64 *_res, > + const struct rtnl_link_stats64 *_new, > + const struct rtnl_link_stats64 *_old) > +{ > + const u64 *new = (const u64 *)_new; > + const u64 *old = (const u64 *)_old; > + u64 *res = (u64 *)_res; > + int i; > + > + for (i = 0; i < sizeof(*_res) / sizeof(u64); i++) { > + u64 nv = new[i]; > + u64 ov = old[i]; > + s64 delta = nv - ov; > + > + /* detects if this particular field is 32bit only */ > + if (((nv | ov) >> 32) == 0) > + delta = (s64)(s32)((u32)nv - (u32)ov); > + > + /* filter anomalies, some drivers reset their stats > + * at down/up events. > + */ > + if (delta > 0) > + res[i] += delta; > + } > +} > + > +static void virtnet_bypass_get_stats(struct net_device *dev, > + struct rtnl_link_stats64 *stats) > +{ > + struct virtnet_bypass_info *vbi = netdev_priv(dev); > + const struct rtnl_link_stats64 *new; > + struct rtnl_link_stats64 temp; > + struct net_device *child_netdev; > + > + spin_lock(&vbi->stats_lock); > + memcpy(stats, &vbi->bypass_stats, sizeof(*stats)); > + > + rcu_read_lock(); > + > + child_netdev = rcu_dereference(vbi->active_netdev); > + if (child_netdev) { > + new = dev_get_stats(child_netdev, &temp); > + virtnet_bypass_fold_stats(stats, new, &vbi->active_stats); > + memcpy(&vbi->active_stats, new, sizeof(*new)); > + } > + > + child_netdev = rcu_dereference(vbi->backup_netdev); > + if (child_netdev) { > + new = dev_get_stats(child_netdev, &temp); > + virtnet_bypass_fold_stats(stats, new, &vbi->backup_stats); > + memcpy(&vbi->backup_stats, new, sizeof(*new)); > + } > + > + rcu_read_unlock(); > + > + memcpy(&vbi->bypass_stats, stats, sizeof(*stats)); > + spin_unlock(&vbi->stats_lock); > +} > + > +static int virtnet_bypass_change_mtu(struct net_device *dev, int new_mtu) > +{ > + struct virtnet_bypass_info *vbi = netdev_priv(dev); > + struct net_device *child_netdev; > + int ret = 0; > + > + child_netdev = rcu_dereference(vbi->active_netdev); > + if (child_netdev) { > + ret = dev_set_mtu(child_netdev, new_mtu); > + if (ret) > + return ret; > + } > + > + child_netdev = rcu_dereference(vbi->backup_netdev); > + if (child_netdev) { > + ret = dev_set_mtu(child_netdev, new_mtu); > + if (ret) > + netdev_err(child_netdev, > + "Unexpected failure to set mtu to %d\n", > + new_mtu); Shouldn't we unwind the MTU config on active_netdev if failing to set it on backup_netdev? > + } > + > + dev->mtu = new_mtu; > + return 0; > +} > + > +static void virtnet_bypass_set_rx_mode(struct net_device *dev) > +{ > + struct virtnet_bypass_info *vbi = netdev_priv(dev); > + struct net_device *child_netdev; > + > + rcu_read_lock(); > + > + child_netdev = rcu_dereference(vbi->active_netdev); > + if (child_netdev) { > + dev_uc_sync_multiple(child_netdev, dev); > + dev_mc_sync_multiple(child_netdev, dev); > + } > + > + child_netdev = rcu_dereference(vbi->backup_netdev); > + if (child_netdev) { > + dev_uc_sync_multiple(child_netdev, dev); > + dev_mc_sync_multiple(child_netdev, dev); > + } > + If VF comes up later than when set_rx_mode is called where do you sync up the unicast and multicast address? The rest looks good. Thanks, -Siwei > + rcu_read_unlock(); > +} > + > +static const struct net_device_ops virtnet_bypass_netdev_ops = { > + .ndo_open = virtnet_bypass_open, > + .ndo_stop = virtnet_bypass_close, > + .ndo_start_xmit = virtnet_bypass_start_xmit, > + .ndo_select_queue = virtnet_bypass_select_queue, > + .ndo_get_stats64 = virtnet_bypass_get_stats, > + .ndo_change_mtu = virtnet_bypass_change_mtu, > + .ndo_set_rx_mode = virtnet_bypass_set_rx_mode, > + .ndo_validate_addr = eth_validate_addr, > + .ndo_features_check = passthru_features_check, > +}; > + > +static int > +virtnet_bypass_ethtool_get_link_ksettings(struct net_device *dev, > + struct ethtool_link_ksettings *cmd) > +{ > + struct virtnet_bypass_info *vbi = netdev_priv(dev); > + struct net_device *child_netdev; > + > + child_netdev = rtnl_dereference(vbi->active_netdev); > + if (!child_netdev || !virtnet_bypass_xmit_ready(child_netdev)) { > + child_netdev = rtnl_dereference(vbi->backup_netdev); > + if (!child_netdev || !virtnet_bypass_xmit_ready(child_netdev)) { > + cmd->base.duplex = DUPLEX_UNKNOWN; > + cmd->base.port = PORT_OTHER; > + cmd->base.speed = SPEED_UNKNOWN; > + > + return 0; > + } > + } > + > + return __ethtool_get_link_ksettings(child_netdev, cmd); > +} > + > +#define BYPASS_DRV_NAME "virtnet_bypass" > +#define BYPASS_DRV_VERSION "0.1" > + > +static void virtnet_bypass_ethtool_get_drvinfo(struct net_device *dev, > + struct ethtool_drvinfo *drvinfo) > +{ > + strlcpy(drvinfo->driver, BYPASS_DRV_NAME, sizeof(drvinfo->driver)); > + strlcpy(drvinfo->version, BYPASS_DRV_VERSION, sizeof(drvinfo->version)); > +} > + > +static const struct ethtool_ops virtnet_bypass_ethtool_ops = { > + .get_drvinfo = virtnet_bypass_ethtool_get_drvinfo, > + .get_link = ethtool_op_get_link, > + .get_link_ksettings = virtnet_bypass_ethtool_get_link_ksettings, > +}; > + > +static struct net_device *get_virtnet_bypass_bymac(struct net *net, > + const u8 *mac) > +{ > + struct net_device *dev; > + > + ASSERT_RTNL(); > + > + for_each_netdev(net, dev) { > + if (dev->netdev_ops != &virtnet_bypass_netdev_ops) > + continue; /* not a virtnet_bypass device */ > + > + if (ether_addr_equal(mac, dev->perm_addr)) > + return dev; > + } > + > + return NULL; > +} > + > +static struct net_device * > +get_virtnet_bypass_byref(struct net_device *child_netdev) > +{ > + struct net *net = dev_net(child_netdev); > + struct net_device *dev; > + > + ASSERT_RTNL(); > + > + for_each_netdev(net, dev) { > + struct virtnet_bypass_info *vbi; > + > + if (dev->netdev_ops != &virtnet_bypass_netdev_ops) > + continue; /* not a virtnet_bypass device */ > + > + vbi = netdev_priv(dev); > + > + if ((rtnl_dereference(vbi->active_netdev) == child_netdev) || > + (rtnl_dereference(vbi->backup_netdev) == child_netdev)) > + return dev; /* a match */ > + } > + > + return NULL; > +} > + > +/* Called when child dev is injecting data into network stack. > + * Change the associated network device from lower dev to virtio. > + * note: already called with rcu_read_lock > + */ > +static rx_handler_result_t virtnet_bypass_handle_frame(struct sk_buff **pskb) > +{ > + struct sk_buff *skb = *pskb; > + struct net_device *ndev = rcu_dereference(skb->dev->rx_handler_data); > + > + skb->dev = ndev; > + > + return RX_HANDLER_ANOTHER; > +} > + > +static int virtnet_bypass_register_child(struct net_device *child_netdev) > +{ > + struct virtnet_bypass_info *vbi; > + struct net_device *dev; > + bool backup; > + int ret; > + > + if (child_netdev->addr_len != ETH_ALEN) > + return NOTIFY_DONE; > + > + /* We will use the MAC address to locate the virtnet_bypass netdev > + * to associate with the child netdev. If we don't find a matching > + * bypass netdev, move on. > + */ > + dev = get_virtnet_bypass_bymac(dev_net(child_netdev), > + child_netdev->perm_addr); > + if (!dev) > + return NOTIFY_DONE; > + > + vbi = netdev_priv(dev); > + backup = (child_netdev->dev.parent == dev->dev.parent); > + if (backup ? rtnl_dereference(vbi->backup_netdev) : > + rtnl_dereference(vbi->active_netdev)) { > + netdev_info(dev, > + "%s attempting to join bypass dev when %s already present\n", > + child_netdev->name, backup ? "backup" : "active"); > + return NOTIFY_DONE; > + } > + > + /* Avoid non pci devices as active netdev */ > + if (!backup && (!child_netdev->dev.parent || > + !dev_is_pci(child_netdev->dev.parent))) > + return NOTIFY_DONE; > + > + ret = netdev_rx_handler_register(child_netdev, > + virtnet_bypass_handle_frame, dev); > + if (ret != 0) { > + netdev_err(child_netdev, > + "can not register bypass receive handler (err = %d)\n", > + ret); > + goto rx_handler_failed; > + } > + > + ret = netdev_upper_dev_link(child_netdev, dev, NULL); > + if (ret != 0) { > + netdev_err(child_netdev, > + "can not set master device %s (err = %d)\n", > + dev->name, ret); > + goto upper_link_failed; > + } > + > + child_netdev->flags |= IFF_SLAVE; > + > + if (netif_running(dev)) { > + ret = dev_open(child_netdev); > + if (ret && (ret != -EBUSY)) { > + netdev_err(dev, "Opening child %s failed ret:%d\n", > + child_netdev->name, ret); > + goto err_interface_up; > + } > + } > + > + /* Align MTU of child with master */ > + ret = dev_set_mtu(child_netdev, dev->mtu); > + if (ret) { > + netdev_err(dev, > + "unable to change mtu of %s to %u register failed\n", > + child_netdev->name, dev->mtu); > + goto err_set_mtu; > + } > + > + call_netdevice_notifiers(NETDEV_JOIN, child_netdev); > + > + netdev_info(dev, "registering %s\n", child_netdev->name); > + > + dev_hold(child_netdev); > + if (backup) { > + rcu_assign_pointer(vbi->backup_netdev, child_netdev); > + dev_get_stats(vbi->backup_netdev, &vbi->backup_stats); > + } else { > + rcu_assign_pointer(vbi->active_netdev, child_netdev); > + dev_get_stats(vbi->active_netdev, &vbi->active_stats); > + dev->min_mtu = child_netdev->min_mtu; > + dev->max_mtu = child_netdev->max_mtu; > + } > + > + return NOTIFY_OK; > + > +err_set_mtu: > + dev_close(child_netdev); > +err_interface_up: > + netdev_upper_dev_unlink(child_netdev, dev); > + child_netdev->flags &= ~IFF_SLAVE; > +upper_link_failed: > + netdev_rx_handler_unregister(child_netdev); > +rx_handler_failed: > + return NOTIFY_DONE; > +} > + > +static int virtnet_bypass_unregister_child(struct net_device *child_netdev) > +{ > + struct virtnet_bypass_info *vbi; > + struct net_device *dev, *backup; > + > + dev = get_virtnet_bypass_byref(child_netdev); > + if (!dev) > + return NOTIFY_DONE; > + > + vbi = netdev_priv(dev); > + > + netdev_info(dev, "unregistering %s\n", child_netdev->name); > + > + netdev_rx_handler_unregister(child_netdev); > + netdev_upper_dev_unlink(child_netdev, dev); > + child_netdev->flags &= ~IFF_SLAVE; > + > + if (child_netdev->dev.parent == dev->dev.parent) { > + RCU_INIT_POINTER(vbi->backup_netdev, NULL); > + } else { > + RCU_INIT_POINTER(vbi->active_netdev, NULL); > + backup = rtnl_dereference(vbi->backup_netdev); > + if (backup) { > + dev->min_mtu = backup->min_mtu; > + dev->max_mtu = backup->max_mtu; > + } > + } > + > + dev_put(child_netdev); > + > + return NOTIFY_OK; > +} > + > +static int virtnet_bypass_update_link(struct net_device *child_netdev) > +{ > + struct net_device *dev, *active, *backup; > + struct virtnet_bypass_info *vbi; > + > + dev = get_virtnet_bypass_byref(child_netdev); > + if (!dev || !netif_running(dev)) > + return NOTIFY_DONE; > + > + vbi = netdev_priv(dev); > + > + active = rtnl_dereference(vbi->active_netdev); > + backup = rtnl_dereference(vbi->backup_netdev); > + > + if ((active && virtnet_bypass_xmit_ready(active)) || > + (backup && virtnet_bypass_xmit_ready(backup))) { > + netif_carrier_on(dev); > + netif_tx_wake_all_queues(dev); > + } else { > + netif_carrier_off(dev); > + netif_tx_stop_all_queues(dev); > + } > + > + return NOTIFY_OK; > +} > + > +static int virtnet_bypass_event(struct notifier_block *this, > + unsigned long event, void *ptr) > +{ > + struct net_device *event_dev = netdev_notifier_info_to_dev(ptr); > + > + /* Skip our own events */ > + if (event_dev->netdev_ops == &virtnet_bypass_netdev_ops) > + return NOTIFY_DONE; > + > + /* Avoid non-Ethernet type devices */ > + if (event_dev->type != ARPHRD_ETHER) > + return NOTIFY_DONE; > + > + /* Avoid Vlan dev with same MAC registering as child dev */ > + if (is_vlan_dev(event_dev)) > + return NOTIFY_DONE; > + > + /* Avoid Bonding master dev with same MAC registering as child dev */ > + if ((event_dev->priv_flags & IFF_BONDING) && > + (event_dev->flags & IFF_MASTER)) > + return NOTIFY_DONE; > + > + switch (event) { > + case NETDEV_REGISTER: > + return virtnet_bypass_register_child(event_dev); > + case NETDEV_UNREGISTER: > + return virtnet_bypass_unregister_child(event_dev); > + case NETDEV_UP: > + case NETDEV_DOWN: > + case NETDEV_CHANGE: > + return virtnet_bypass_update_link(event_dev); > + default: > + return NOTIFY_DONE; > + } > +} > + > +static struct notifier_block virtnet_bypass_notifier = { > + .notifier_call = virtnet_bypass_event, > +}; > + > +static int virtnet_bypass_create(struct virtnet_info *vi) > +{ > + struct net_device *backup_netdev = vi->dev; > + struct device *dev = &vi->vdev->dev; > + struct net_device *bypass_netdev; > + int res; > + > + /* Alloc at least 2 queues, for now we are going with 16 assuming > + * that most devices being bonded won't have too many queues. > + */ > + bypass_netdev = alloc_etherdev_mq(sizeof(struct virtnet_bypass_info), > + 16); > + if (!bypass_netdev) { > + dev_err(dev, "Unable to allocate bypass_netdev!\n"); > + return -ENOMEM; > + } > + > + dev_net_set(bypass_netdev, dev_net(backup_netdev)); > + SET_NETDEV_DEV(bypass_netdev, dev); > + > + bypass_netdev->netdev_ops = &virtnet_bypass_netdev_ops; > + bypass_netdev->ethtool_ops = &virtnet_bypass_ethtool_ops; > + > + /* Initialize the device options */ > + bypass_netdev->flags |= IFF_MASTER; > + bypass_netdev->priv_flags |= IFF_BONDING | IFF_UNICAST_FLT | > + IFF_NO_QUEUE; > + bypass_netdev->priv_flags &= ~(IFF_XMIT_DST_RELEASE | > + IFF_TX_SKB_SHARING); > + > + /* don't acquire bypass netdev's netif_tx_lock when transmitting */ > + bypass_netdev->features |= NETIF_F_LLTX; > + > + /* Don't allow bypass devices to change network namespaces. */ > + bypass_netdev->features |= NETIF_F_NETNS_LOCAL; > + > + bypass_netdev->hw_features = NETIF_F_HW_CSUM | NETIF_F_SG | > + NETIF_F_FRAGLIST | NETIF_F_ALL_TSO | > + NETIF_F_HIGHDMA | NETIF_F_LRO; > + > + bypass_netdev->hw_features |= NETIF_F_GSO_ENCAP_ALL; > + bypass_netdev->features |= bypass_netdev->hw_features; > + > + /* For now treat bypass netdev as VLAN challenged since we > + * cannot assume VLAN functionality with a VF > + */ > + bypass_netdev->features |= NETIF_F_VLAN_CHALLENGED; > + > + memcpy(bypass_netdev->dev_addr, backup_netdev->dev_addr, > + bypass_netdev->addr_len); > + > + bypass_netdev->min_mtu = backup_netdev->min_mtu; > + bypass_netdev->max_mtu = backup_netdev->max_mtu; > + > + res = register_netdev(bypass_netdev); > + if (res < 0) { > + dev_err(dev, "Unable to register bypass_netdev!\n"); > + free_netdev(bypass_netdev); > + return res; > + } > + > + netif_carrier_off(bypass_netdev); > + > + vi->bypass_netdev = bypass_netdev; > + > + return 0; > +} > + > +static void virtnet_bypass_destroy(struct virtnet_info *vi) > +{ > + struct net_device *bypass_netdev = vi->bypass_netdev; > + struct virtnet_bypass_info *vbi; > + struct net_device *child_netdev; > + > + /* no device found, nothing to free */ > + if (!bypass_netdev) > + return; > + > + vbi = netdev_priv(bypass_netdev); > + > + netif_device_detach(bypass_netdev); > + > + rtnl_lock(); > + > + child_netdev = rtnl_dereference(vbi->active_netdev); > + if (child_netdev) > + virtnet_bypass_unregister_child(child_netdev); > + > + child_netdev = rtnl_dereference(vbi->backup_netdev); > + if (child_netdev) > + virtnet_bypass_unregister_child(child_netdev); > + > + unregister_netdevice(bypass_netdev); > + > + rtnl_unlock(); > + > + free_netdev(bypass_netdev); > +} > + > +/* END of functions supporting VIRTIO_NET_F_BACKUP feature. */ > + > static int virtnet_probe(struct virtio_device *vdev) > { > int i, err = -ENOMEM; > @@ -2797,10 +3466,15 @@ static int virtnet_probe(struct virtio_device *vdev) > > virtnet_init_settings(dev); > > + if (virtio_has_feature(vdev, VIRTIO_NET_F_BACKUP)) { > + if (virtnet_bypass_create(vi) != 0) > + goto free_vqs; > + } > + > err = register_netdev(dev); > if (err) { > pr_debug("virtio_net: registering device failed\n"); > - goto free_vqs; > + goto free_bypass; > } > > virtio_device_ready(vdev); > @@ -2837,6 +3511,8 @@ static int virtnet_probe(struct virtio_device *vdev) > vi->vdev->config->reset(vdev); > > unregister_netdev(dev); > +free_bypass: > + virtnet_bypass_destroy(vi); > free_vqs: > cancel_delayed_work_sync(&vi->refill); > free_receive_page_frags(vi); > @@ -2871,6 +3547,8 @@ static void virtnet_remove(struct virtio_device *vdev) > > unregister_netdev(vi->dev); > > + virtnet_bypass_destroy(vi); > + > remove_vq_common(vi); > > free_netdev(vi->dev); > @@ -2968,6 +3646,8 @@ static __init int virtio_net_driver_init(void) > ret = register_virtio_driver(&virtio_net_driver); > if (ret) > goto err_virtio; > + > + register_netdevice_notifier(&virtnet_bypass_notifier); > return 0; > err_virtio: > cpuhp_remove_multi_state(CPUHP_VIRT_NET_DEAD); > @@ -2980,6 +3660,7 @@ module_init(virtio_net_driver_init); > > static __exit void virtio_net_driver_exit(void) > { > + unregister_netdevice_notifier(&virtnet_bypass_notifier); > unregister_virtio_driver(&virtio_net_driver); > cpuhp_remove_multi_state(CPUHP_VIRT_NET_DEAD); > cpuhp_remove_multi_state(virtionet_online); > -- > 2.14.3 >
On Fri, Mar 02, 2018 at 12:44:56PM -0800, Siwei Liu wrote: > On Fri, Mar 2, 2018 at 12:10 PM, Michael S. Tsirkin <mst@redhat.com> wrote: > > On Fri, Mar 02, 2018 at 11:52:27AM -0800, Samudrala, Sridhar wrote: > >> > >> > >> On 3/2/2018 11:41 AM, Michael S. Tsirkin wrote: > >> > On Fri, Mar 02, 2018 at 07:26:25AM -0800, Alexander Duyck wrote: > >> > > The design limits things to a 1:1 relationship since we just have the > >> > > child and backup pointers, but I don't think I am seeing exception > >> > > handling to prevent us from overwriting the child pointers so there > >> > > may be a leak there. > >> > > > >> > > Thanks. > >> > > > >> > > - Alex > >> > In fact maintaining a list in that case would be nicer, and > >> > just use an arbitrary one. > >> > E.g. one can see how a user wanting to swap device 1 for device 2 > >> > might first add device 2 with same MAC then drop device 1. > >> > >> It should be possible to swap VF1 with VF2 by > >> 1.- enabling virtio link > >> 2.- unplugging VF1 > >> 3.- plugging VF2 > >> 4.- disabling virtio link > >> > > > > True, but it isn't hard to avoid breakage if user > > swapped steps 2 and 3. No need to make it more > > fragile that it has to be. > > The migration case, VF2 is associated with another PF on another > machine (destination), I wonder how it is possible. E.g. you want to remove the PF so you unplug the VF then add another VF of the same PF. > Even with local plugging of VF2 on the same PF, the MAC address > requirement (VF1's == VF2's) would fail the MAC address assignment on > VF2. > > -Siwei Why would it fail? These are separate cards. > > > > -- > > MST > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org > > For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org > >
On Fri, Mar 02, 2018 at 12:56:21PM -0800, Samudrala, Sridhar wrote: > > > On 3/2/2018 12:44 PM, Siwei Liu wrote: > > On Fri, Mar 2, 2018 at 12:10 PM, Michael S. Tsirkin <mst@redhat.com> wrote: > > > On Fri, Mar 02, 2018 at 11:52:27AM -0800, Samudrala, Sridhar wrote: > > > > > > > > On 3/2/2018 11:41 AM, Michael S. Tsirkin wrote: > > > > > On Fri, Mar 02, 2018 at 07:26:25AM -0800, Alexander Duyck wrote: > > > > > > The design limits things to a 1:1 relationship since we just have the > > > > > > child and backup pointers, but I don't think I am seeing exception > > > > > > handling to prevent us from overwriting the child pointers so there > > > > > > may be a leak there. > > > > > > > > > > > > Thanks. > > > > > > > > > > > > - Alex > > > > > In fact maintaining a list in that case would be nicer, and > > > > > just use an arbitrary one. > > > > > E.g. one can see how a user wanting to swap device 1 for device 2 > > > > > might first add device 2 with same MAC then drop device 1. > > > > It should be possible to swap VF1 with VF2 by > > > > 1.- enabling virtio link > > > > 2.- unplugging VF1 > > > > 3.- plugging VF2 > > > > 4.- disabling virtio link > > > > > > > True, but it isn't hard to avoid breakage if user > > > swapped steps 2 and 3. No need to make it more > > > fragile that it has to be. > > The migration case, VF2 is associated with another PF on another > > machine (destination), I wonder how it is possible. > > > > Even with local plugging of VF2 on the same PF, the MAC address > > requirement (VF1's == VF2's) would fail the MAC address assignment on > > VF2. > > > > > I didn't include updating the MAC filter step in the above sequence. > So definitely plugging 2 VFs with the same MAC address will be an issue. If these are two separate PFs then I don't see why - each has its own MAC filter. > Here is the more complete sequence of steps that are required to > enable live migration. Replacing VF1 by VF2 is not about migration. It's to remove PF from host e.g. for service.
On Fri, Mar 02, 2018 at 01:11:56PM -0800, Siwei Liu wrote: > On Thu, Mar 1, 2018 at 12:08 PM, Sridhar Samudrala > <sridhar.samudrala@intel.com> wrote: > > This patch enables virtio_net to switch over to a VF datapath when a VF > > netdev is present with the same MAC address. It allows live migration > > of a VM with a direct attached VF without the need to setup a bond/team > > between a VF and virtio net device in the guest. > > > > The hypervisor needs to enable only one datapath at any time so that > > packets don't get looped back to the VM over the other datapath. When a VF > > is plugged, the virtio datapath link state can be marked as down. The > > hypervisor needs to unplug the VF device from the guest on the source host > > and reset the MAC filter of the VF to initiate failover of datapath to > > virtio before starting the migration. After the migration is completed, > > the destination hypervisor sets the MAC filter on the VF and plugs it back > > to the guest to switch over to VF datapath. > > > > When BACKUP feature is enabled, an additional netdev(bypass netdev) is > > created that acts as a master device and tracks the state of the 2 lower > > netdevs. The original virtio_net netdev is marked as 'backup' netdev and a > > passthru device with the same MAC is registered as 'active' netdev. > > > > This patch is based on the discussion initiated by Jesse on this thread. > > https://marc.info/?l=linux-virtualization&m=151189725224231&w=2 > > > > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com> > > Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> > > Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> > > --- > > drivers/net/virtio_net.c | 683 ++++++++++++++++++++++++++++++++++++++++++++++- > > 1 file changed, 682 insertions(+), 1 deletion(-) > > > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c > > index bcd13fe906ca..f2860d86c952 100644 > > --- a/drivers/net/virtio_net.c > > +++ b/drivers/net/virtio_net.c > > @@ -30,6 +30,8 @@ > > #include <linux/cpu.h> > > #include <linux/average.h> > > #include <linux/filter.h> > > +#include <linux/netdevice.h> > > +#include <linux/pci.h> > > #include <net/route.h> > > #include <net/xdp.h> > > > > @@ -206,6 +208,9 @@ struct virtnet_info { > > u32 speed; > > > > unsigned long guest_offloads; > > + > > + /* upper netdev created when BACKUP feature enabled */ > > + struct net_device *bypass_netdev; > > }; > > > > struct padded_vnet_hdr { > > @@ -2236,6 +2241,22 @@ static int virtnet_xdp(struct net_device *dev, struct netdev_bpf *xdp) > > } > > } > > > > +static int virtnet_get_phys_port_name(struct net_device *dev, char *buf, > > + size_t len) > > +{ > > + struct virtnet_info *vi = netdev_priv(dev); > > + int ret; > > + > > + if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_BACKUP)) > > + return -EOPNOTSUPP; > > + > > + ret = snprintf(buf, len, "_bkup"); > > + if (ret >= len) > > + return -EOPNOTSUPP; > > + > > + return 0; > > +} > > + > > What if the systemd/udevd is not new enough to enforce the > n<phys_port_name> naming? Would virtio_bypass get a different name > than the original virtio_net? You mean people using ethX names? Any hardware config change breaks these, I don't think that can be helped. > Should we detect this earlier and fall > back to legacy mode without creating the bypass netdev and ensalving > the VF? I don't think we can do this with existing kernel/userspace APIs.
On Fri, Mar 2, 2018 at 1:31 PM, Michael S. Tsirkin <mst@redhat.com> wrote: > On Fri, Mar 02, 2018 at 12:44:56PM -0800, Siwei Liu wrote: >> On Fri, Mar 2, 2018 at 12:10 PM, Michael S. Tsirkin <mst@redhat.com> wrote: >> > On Fri, Mar 02, 2018 at 11:52:27AM -0800, Samudrala, Sridhar wrote: >> >> >> >> >> >> On 3/2/2018 11:41 AM, Michael S. Tsirkin wrote: >> >> > On Fri, Mar 02, 2018 at 07:26:25AM -0800, Alexander Duyck wrote: >> >> > > The design limits things to a 1:1 relationship since we just have the >> >> > > child and backup pointers, but I don't think I am seeing exception >> >> > > handling to prevent us from overwriting the child pointers so there >> >> > > may be a leak there. >> >> > > >> >> > > Thanks. >> >> > > >> >> > > - Alex >> >> > In fact maintaining a list in that case would be nicer, and >> >> > just use an arbitrary one. >> >> > E.g. one can see how a user wanting to swap device 1 for device 2 >> >> > might first add device 2 with same MAC then drop device 1. >> >> >> >> It should be possible to swap VF1 with VF2 by >> >> 1.- enabling virtio link >> >> 2.- unplugging VF1 >> >> 3.- plugging VF2 >> >> 4.- disabling virtio link >> >> >> > >> > True, but it isn't hard to avoid breakage if user >> > swapped steps 2 and 3. No need to make it more >> > fragile that it has to be. >> >> The migration case, VF2 is associated with another PF on another >> machine (destination), I wonder how it is possible. > > E.g. you want to remove the PF so you unplug the VF > then add another VF of the same PF. > >> Even with local plugging of VF2 on the same PF, the MAC address >> requirement (VF1's == VF2's) would fail the MAC address assignment on >> VF2. >> >> -Siwei > > Why would it fail? These are separate cards. OK. I realized that you may talk about assigning a VF on a diffferent PF (VF1 on PF1 while VF2 on PF2). And we might assign a pass-through device rather than a VF. Yes, it's indeed possible that may happen but I take it as a further step down (another patch maybe) as it would involve changes to notify the network with gratuituious ARP and/or unsolicited ND advertisement of the MAC address association with the new port. -Siwei > >> > >> > -- >> > MST >> > >> > --------------------------------------------------------------------- >> > To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org >> > For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org >> >
On 3/2/2018 1:11 PM, Siwei Liu wrote: > On Thu, Mar 1, 2018 at 12:08 PM, Sridhar Samudrala > <sridhar.samudrala@intel.com> wrote: >> This patch enables virtio_net to switch over to a VF datapath when a VF >> netdev is present with the same MAC address. It allows live migration >> of a VM with a direct attached VF without the need to setup a bond/team >> between a VF and virtio net device in the guest. >> >> The hypervisor needs to enable only one datapath at any time so that >> packets don't get looped back to the VM over the other datapath. When a VF >> is plugged, the virtio datapath link state can be marked as down. The >> hypervisor needs to unplug the VF device from the guest on the source host >> and reset the MAC filter of the VF to initiate failover of datapath to >> virtio before starting the migration. After the migration is completed, >> the destination hypervisor sets the MAC filter on the VF and plugs it back >> to the guest to switch over to VF datapath. >> >> When BACKUP feature is enabled, an additional netdev(bypass netdev) is >> created that acts as a master device and tracks the state of the 2 lower >> netdevs. The original virtio_net netdev is marked as 'backup' netdev and a >> passthru device with the same MAC is registered as 'active' netdev. >> >> This patch is based on the discussion initiated by Jesse on this thread. >> https://marc.info/?l=linux-virtualization&m=151189725224231&w=2 >> >> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com> >> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> >> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> >> --- >> drivers/net/virtio_net.c | 683 ++++++++++++++++++++++++++++++++++++++++++++++- >> 1 file changed, 682 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c >> index bcd13fe906ca..f2860d86c952 100644 >> --- a/drivers/net/virtio_net.c >> +++ b/drivers/net/virtio_net.c >> @@ -30,6 +30,8 @@ >> #include <linux/cpu.h> >> #include <linux/average.h> >> #include <linux/filter.h> >> +#include <linux/netdevice.h> >> +#include <linux/pci.h> >> #include <net/route.h> >> #include <net/xdp.h> >> >> @@ -206,6 +208,9 @@ struct virtnet_info { >> u32 speed; >> >> unsigned long guest_offloads; >> + >> + /* upper netdev created when BACKUP feature enabled */ >> + struct net_device *bypass_netdev; >> }; >> >> struct padded_vnet_hdr { >> @@ -2236,6 +2241,22 @@ static int virtnet_xdp(struct net_device *dev, struct netdev_bpf *xdp) >> } >> } >> >> +static int virtnet_get_phys_port_name(struct net_device *dev, char *buf, >> + size_t len) >> +{ >> + struct virtnet_info *vi = netdev_priv(dev); >> + int ret; >> + >> + if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_BACKUP)) >> + return -EOPNOTSUPP; >> + >> + ret = snprintf(buf, len, "_bkup"); >> + if (ret >= len) >> + return -EOPNOTSUPP; >> + >> + return 0; >> +} >> + > What if the systemd/udevd is not new enough to enforce the > n<phys_port_name> naming? Would virtio_bypass get a different name > than the original virtio_net? Should we detect this earlier and fall > back to legacy mode without creating the bypass netdev and ensalving > the VF? If udev doesn't support renaming of the devices, then the upper bypass device should get the original name and the lower virtio netdev will get the next name. Hopefully the distros updating the kernel will also move to the new systemd/udev. > >> static const struct net_device_ops virtnet_netdev = { >> .ndo_open = virtnet_open, >> .ndo_stop = virtnet_close, >> @@ -2253,6 +2274,7 @@ static const struct net_device_ops virtnet_netdev = { >> .ndo_xdp_xmit = virtnet_xdp_xmit, >> .ndo_xdp_flush = virtnet_xdp_flush, >> .ndo_features_check = passthru_features_check, >> + .ndo_get_phys_port_name = virtnet_get_phys_port_name, >> }; >> >> static void virtnet_config_changed_work(struct work_struct *work) >> @@ -2647,6 +2669,653 @@ static int virtnet_validate(struct virtio_device *vdev) >> return 0; >> } >> >> +/* START of functions supporting VIRTIO_NET_F_BACKUP feature. >> + * When BACKUP feature is enabled, an additional netdev(bypass netdev) >> + * is created that acts as a master device and tracks the state of the >> + * 2 lower netdevs. The original virtio_net netdev is registered as >> + * 'backup' netdev and a passthru device with the same MAC is registered >> + * as 'active' netdev. >> + */ >> + >> +/* bypass state maintained when BACKUP feature is enabled */ >> +struct virtnet_bypass_info { >> + /* passthru netdev with same MAC */ >> + struct net_device __rcu *active_netdev; >> + >> + /* virtio_net netdev */ >> + struct net_device __rcu *backup_netdev; >> + >> + /* active netdev stats */ >> + struct rtnl_link_stats64 active_stats; >> + >> + /* backup netdev stats */ >> + struct rtnl_link_stats64 backup_stats; >> + >> + /* aggregated stats */ >> + struct rtnl_link_stats64 bypass_stats; >> + >> + /* spinlock while updating stats */ >> + spinlock_t stats_lock; >> +}; >> + >> +static void virtnet_bypass_child_open(struct net_device *dev, >> + struct net_device *child_netdev) >> +{ >> + int err = dev_open(child_netdev); >> + >> + if (err) >> + netdev_warn(dev, "unable to open slave: %s: %d\n", >> + child_netdev->name, err); >> +} >> + >> +static int virtnet_bypass_open(struct net_device *dev) >> +{ >> + struct virtnet_bypass_info *vbi = netdev_priv(dev); >> + struct net_device *child_netdev; >> + >> + netif_carrier_off(dev); >> + netif_tx_wake_all_queues(dev); >> + >> + child_netdev = rtnl_dereference(vbi->active_netdev); >> + if (child_netdev) >> + virtnet_bypass_child_open(dev, child_netdev); >> + >> + child_netdev = rtnl_dereference(vbi->backup_netdev); >> + if (child_netdev) >> + virtnet_bypass_child_open(dev, child_netdev); >> + >> + return 0; >> +} >> + >> +static int virtnet_bypass_close(struct net_device *dev) >> +{ >> + struct virtnet_bypass_info *vi = netdev_priv(dev); >> + struct net_device *child_netdev; >> + >> + netif_tx_disable(dev); >> + >> + child_netdev = rtnl_dereference(vi->active_netdev); >> + if (child_netdev) >> + dev_close(child_netdev); >> + >> + child_netdev = rtnl_dereference(vi->backup_netdev); >> + if (child_netdev) >> + dev_close(child_netdev); >> + >> + return 0; >> +} >> + >> +static netdev_tx_t virtnet_bypass_drop_xmit(struct sk_buff *skb, >> + struct net_device *dev) >> +{ >> + atomic_long_inc(&dev->tx_dropped); >> + dev_kfree_skb_any(skb); >> + return NETDEV_TX_OK; >> +} >> + >> +static bool virtnet_bypass_xmit_ready(struct net_device *dev) >> +{ >> + return netif_running(dev) && netif_carrier_ok(dev); >> +} >> + >> +static netdev_tx_t virtnet_bypass_start_xmit(struct sk_buff *skb, >> + struct net_device *dev) >> +{ >> + struct virtnet_bypass_info *vbi = netdev_priv(dev); >> + struct net_device *xmit_dev; >> + >> + /* Try xmit via active netdev followed by backup netdev */ >> + xmit_dev = rcu_dereference_bh(vbi->active_netdev); >> + if (!xmit_dev || !virtnet_bypass_xmit_ready(xmit_dev)) { >> + xmit_dev = rcu_dereference_bh(vbi->backup_netdev); >> + if (!xmit_dev || !virtnet_bypass_xmit_ready(xmit_dev)) >> + return virtnet_bypass_drop_xmit(skb, dev); >> + } >> + >> + skb->dev = xmit_dev; >> + skb->queue_mapping = qdisc_skb_cb(skb)->slave_dev_queue_mapping; >> + >> + return dev_queue_xmit(skb); >> +} >> + >> +static u16 virtnet_bypass_select_queue(struct net_device *dev, >> + struct sk_buff *skb, void *accel_priv, >> + select_queue_fallback_t fallback) >> +{ >> + /* This helper function exists to help dev_pick_tx get the correct >> + * destination queue. Using a helper function skips a call to >> + * skb_tx_hash and will put the skbs in the queue we expect on their >> + * way down to the bonding driver. >> + */ >> + u16 txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) : 0; >> + >> + /* Save the original txq to restore before passing to the driver */ >> + qdisc_skb_cb(skb)->slave_dev_queue_mapping = skb->queue_mapping; >> + >> + if (unlikely(txq >= dev->real_num_tx_queues)) { >> + do { >> + txq -= dev->real_num_tx_queues; >> + } while (txq >= dev->real_num_tx_queues); >> + } >> + >> + return txq; >> +} >> + >> +/* fold stats, assuming all rtnl_link_stats64 fields are u64, but >> + * that some drivers can provide 32bit values only. >> + */ >> +static void virtnet_bypass_fold_stats(struct rtnl_link_stats64 *_res, >> + const struct rtnl_link_stats64 *_new, >> + const struct rtnl_link_stats64 *_old) >> +{ >> + const u64 *new = (const u64 *)_new; >> + const u64 *old = (const u64 *)_old; >> + u64 *res = (u64 *)_res; >> + int i; >> + >> + for (i = 0; i < sizeof(*_res) / sizeof(u64); i++) { >> + u64 nv = new[i]; >> + u64 ov = old[i]; >> + s64 delta = nv - ov; >> + >> + /* detects if this particular field is 32bit only */ >> + if (((nv | ov) >> 32) == 0) >> + delta = (s64)(s32)((u32)nv - (u32)ov); >> + >> + /* filter anomalies, some drivers reset their stats >> + * at down/up events. >> + */ >> + if (delta > 0) >> + res[i] += delta; >> + } >> +} >> + >> +static void virtnet_bypass_get_stats(struct net_device *dev, >> + struct rtnl_link_stats64 *stats) >> +{ >> + struct virtnet_bypass_info *vbi = netdev_priv(dev); >> + const struct rtnl_link_stats64 *new; >> + struct rtnl_link_stats64 temp; >> + struct net_device *child_netdev; >> + >> + spin_lock(&vbi->stats_lock); >> + memcpy(stats, &vbi->bypass_stats, sizeof(*stats)); >> + >> + rcu_read_lock(); >> + >> + child_netdev = rcu_dereference(vbi->active_netdev); >> + if (child_netdev) { >> + new = dev_get_stats(child_netdev, &temp); >> + virtnet_bypass_fold_stats(stats, new, &vbi->active_stats); >> + memcpy(&vbi->active_stats, new, sizeof(*new)); >> + } >> + >> + child_netdev = rcu_dereference(vbi->backup_netdev); >> + if (child_netdev) { >> + new = dev_get_stats(child_netdev, &temp); >> + virtnet_bypass_fold_stats(stats, new, &vbi->backup_stats); >> + memcpy(&vbi->backup_stats, new, sizeof(*new)); >> + } >> + >> + rcu_read_unlock(); >> + >> + memcpy(&vbi->bypass_stats, stats, sizeof(*stats)); >> + spin_unlock(&vbi->stats_lock); >> +} >> + >> +static int virtnet_bypass_change_mtu(struct net_device *dev, int new_mtu) >> +{ >> + struct virtnet_bypass_info *vbi = netdev_priv(dev); >> + struct net_device *child_netdev; >> + int ret = 0; >> + >> + child_netdev = rcu_dereference(vbi->active_netdev); >> + if (child_netdev) { >> + ret = dev_set_mtu(child_netdev, new_mtu); >> + if (ret) >> + return ret; >> + } >> + >> + child_netdev = rcu_dereference(vbi->backup_netdev); >> + if (child_netdev) { >> + ret = dev_set_mtu(child_netdev, new_mtu); >> + if (ret) >> + netdev_err(child_netdev, >> + "Unexpected failure to set mtu to %d\n", >> + new_mtu); > Shouldn't we unwind the MTU config on active_netdev if failing to set > it on backup_netdev? backup_netdev is virtio_net and it is not expected to fail as it supports MTU all the way upto 64K. > >> + } >> + >> + dev->mtu = new_mtu; >> + return 0; >> +} >> + >> +static void virtnet_bypass_set_rx_mode(struct net_device *dev) >> +{ >> + struct virtnet_bypass_info *vbi = netdev_priv(dev); >> + struct net_device *child_netdev; >> + >> + rcu_read_lock(); >> + >> + child_netdev = rcu_dereference(vbi->active_netdev); >> + if (child_netdev) { >> + dev_uc_sync_multiple(child_netdev, dev); >> + dev_mc_sync_multiple(child_netdev, dev); >> + } >> + >> + child_netdev = rcu_dereference(vbi->backup_netdev); >> + if (child_netdev) { >> + dev_uc_sync_multiple(child_netdev, dev); >> + dev_mc_sync_multiple(child_netdev, dev); >> + } >> + > If VF comes up later than when set_rx_mode is called where do you sync > up the unicast and multicast address? I could include sync up of unicast and multicast list along with MTU when the child is registered. > > The rest looks good. > > Thanks, > -Siwei > >
On Fri, Mar 2, 2018 at 1:36 PM, Michael S. Tsirkin <mst@redhat.com> wrote: > On Fri, Mar 02, 2018 at 01:11:56PM -0800, Siwei Liu wrote: >> On Thu, Mar 1, 2018 at 12:08 PM, Sridhar Samudrala >> <sridhar.samudrala@intel.com> wrote: >> > This patch enables virtio_net to switch over to a VF datapath when a VF >> > netdev is present with the same MAC address. It allows live migration >> > of a VM with a direct attached VF without the need to setup a bond/team >> > between a VF and virtio net device in the guest. >> > >> > The hypervisor needs to enable only one datapath at any time so that >> > packets don't get looped back to the VM over the other datapath. When a VF >> > is plugged, the virtio datapath link state can be marked as down. The >> > hypervisor needs to unplug the VF device from the guest on the source host >> > and reset the MAC filter of the VF to initiate failover of datapath to >> > virtio before starting the migration. After the migration is completed, >> > the destination hypervisor sets the MAC filter on the VF and plugs it back >> > to the guest to switch over to VF datapath. >> > >> > When BACKUP feature is enabled, an additional netdev(bypass netdev) is >> > created that acts as a master device and tracks the state of the 2 lower >> > netdevs. The original virtio_net netdev is marked as 'backup' netdev and a >> > passthru device with the same MAC is registered as 'active' netdev. >> > >> > This patch is based on the discussion initiated by Jesse on this thread. >> > https://marc.info/?l=linux-virtualization&m=151189725224231&w=2 >> > >> > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com> >> > Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> >> > Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> >> > --- >> > drivers/net/virtio_net.c | 683 ++++++++++++++++++++++++++++++++++++++++++++++- >> > 1 file changed, 682 insertions(+), 1 deletion(-) >> > >> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c >> > index bcd13fe906ca..f2860d86c952 100644 >> > --- a/drivers/net/virtio_net.c >> > +++ b/drivers/net/virtio_net.c >> > @@ -30,6 +30,8 @@ >> > #include <linux/cpu.h> >> > #include <linux/average.h> >> > #include <linux/filter.h> >> > +#include <linux/netdevice.h> >> > +#include <linux/pci.h> >> > #include <net/route.h> >> > #include <net/xdp.h> >> > >> > @@ -206,6 +208,9 @@ struct virtnet_info { >> > u32 speed; >> > >> > unsigned long guest_offloads; >> > + >> > + /* upper netdev created when BACKUP feature enabled */ >> > + struct net_device *bypass_netdev; >> > }; >> > >> > struct padded_vnet_hdr { >> > @@ -2236,6 +2241,22 @@ static int virtnet_xdp(struct net_device *dev, struct netdev_bpf *xdp) >> > } >> > } >> > >> > +static int virtnet_get_phys_port_name(struct net_device *dev, char *buf, >> > + size_t len) >> > +{ >> > + struct virtnet_info *vi = netdev_priv(dev); >> > + int ret; >> > + >> > + if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_BACKUP)) >> > + return -EOPNOTSUPP; >> > + >> > + ret = snprintf(buf, len, "_bkup"); >> > + if (ret >= len) >> > + return -EOPNOTSUPP; >> > + >> > + return 0; >> > +} >> > + >> >> What if the systemd/udevd is not new enough to enforce the >> n<phys_port_name> naming? Would virtio_bypass get a different name >> than the original virtio_net? > > You mean people using ethX names? Any hardware config change breaks > these, I don't think that can be helped. I don't like the way to rely on .ndo_get_phys_port_name - it's fragile and it does not completely solve the problem it tries to address. Imagine what can end up with if getting an old udevd, or users already have exsiting explicit udev rules around phys_port_name. It does not give you the an ack in saying "yes, I know you're the bypass and you're the backup, please continue and I will give you both correct names", or an unacknowlegment saying "no, I don't know what these extra interfaces are, please go back and leave the VF device alone". We need new udev API for both feature negotiation and naming, or may even completely hide the lower interfaces. > >> Should we detect this earlier and fall >> back to legacy mode without creating the bypass netdev and ensalving >> the VF? > > I don't think we can do this with existing kernel/userspace APIs. That's why I ever said to make udev aware of this new type of combined device instead of doing hacks here and there around. Regards, -Siwei > > -- > MST
On Fri, Mar 2, 2018 at 3:12 PM, Samudrala, Sridhar <sridhar.samudrala@intel.com> wrote: > On 3/2/2018 1:11 PM, Siwei Liu wrote: >> >> On Thu, Mar 1, 2018 at 12:08 PM, Sridhar Samudrala >> <sridhar.samudrala@intel.com> wrote: >>> >>> This patch enables virtio_net to switch over to a VF datapath when a VF >>> netdev is present with the same MAC address. It allows live migration >>> of a VM with a direct attached VF without the need to setup a bond/team >>> between a VF and virtio net device in the guest. >>> >>> The hypervisor needs to enable only one datapath at any time so that >>> packets don't get looped back to the VM over the other datapath. When a >>> VF >>> is plugged, the virtio datapath link state can be marked as down. The >>> hypervisor needs to unplug the VF device from the guest on the source >>> host >>> and reset the MAC filter of the VF to initiate failover of datapath to >>> virtio before starting the migration. After the migration is completed, >>> the destination hypervisor sets the MAC filter on the VF and plugs it >>> back >>> to the guest to switch over to VF datapath. >>> >>> When BACKUP feature is enabled, an additional netdev(bypass netdev) is >>> created that acts as a master device and tracks the state of the 2 lower >>> netdevs. The original virtio_net netdev is marked as 'backup' netdev and >>> a >>> passthru device with the same MAC is registered as 'active' netdev. >>> >>> This patch is based on the discussion initiated by Jesse on this thread. >>> https://marc.info/?l=linux-virtualization&m=151189725224231&w=2 >>> >>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com> >>> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> >>> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> >>> --- >>> drivers/net/virtio_net.c | 683 >>> ++++++++++++++++++++++++++++++++++++++++++++++- >>> 1 file changed, 682 insertions(+), 1 deletion(-) >>> >>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c >>> index bcd13fe906ca..f2860d86c952 100644 >>> --- a/drivers/net/virtio_net.c >>> +++ b/drivers/net/virtio_net.c >>> @@ -30,6 +30,8 @@ >>> #include <linux/cpu.h> >>> #include <linux/average.h> >>> #include <linux/filter.h> >>> +#include <linux/netdevice.h> >>> +#include <linux/pci.h> >>> #include <net/route.h> >>> #include <net/xdp.h> >>> >>> @@ -206,6 +208,9 @@ struct virtnet_info { >>> u32 speed; >>> >>> unsigned long guest_offloads; >>> + >>> + /* upper netdev created when BACKUP feature enabled */ >>> + struct net_device *bypass_netdev; >>> }; >>> >>> struct padded_vnet_hdr { >>> @@ -2236,6 +2241,22 @@ static int virtnet_xdp(struct net_device *dev, >>> struct netdev_bpf *xdp) >>> } >>> } >>> >>> +static int virtnet_get_phys_port_name(struct net_device *dev, char *buf, >>> + size_t len) >>> +{ >>> + struct virtnet_info *vi = netdev_priv(dev); >>> + int ret; >>> + >>> + if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_BACKUP)) >>> + return -EOPNOTSUPP; >>> + >>> + ret = snprintf(buf, len, "_bkup"); >>> + if (ret >= len) >>> + return -EOPNOTSUPP; >>> + >>> + return 0; >>> +} >>> + >> >> What if the systemd/udevd is not new enough to enforce the >> n<phys_port_name> naming? Would virtio_bypass get a different name >> than the original virtio_net? Should we detect this earlier and fall >> back to legacy mode without creating the bypass netdev and ensalving >> the VF? > > > If udev doesn't support renaming of the devices, then the upper bypass > device > should get the original name and the lower virtio netdev will get the next > name. If you got two virtio-net's (say e.g. eth0 and eth1) before the update, the virtio-bypass interface on the first virtio gets eth0 and the backup netdev would get eth1? Then the IP address originally on eth1 gets configurd on the backup interface? > Hopefully the distros updating the kernel will also move to the new > systemd/udev. This is not reliable. I'd opt for a new udev API for this, and fall back to what it was if don't see fit. > > > >> >>> static const struct net_device_ops virtnet_netdev = { >>> .ndo_open = virtnet_open, >>> .ndo_stop = virtnet_close, >>> @@ -2253,6 +2274,7 @@ static const struct net_device_ops virtnet_netdev = >>> { >>> .ndo_xdp_xmit = virtnet_xdp_xmit, >>> .ndo_xdp_flush = virtnet_xdp_flush, >>> .ndo_features_check = passthru_features_check, >>> + .ndo_get_phys_port_name = virtnet_get_phys_port_name, >>> }; >>> >>> static void virtnet_config_changed_work(struct work_struct *work) >>> @@ -2647,6 +2669,653 @@ static int virtnet_validate(struct virtio_device >>> *vdev) >>> return 0; >>> } >>> >>> +/* START of functions supporting VIRTIO_NET_F_BACKUP feature. >>> + * When BACKUP feature is enabled, an additional netdev(bypass netdev) >>> + * is created that acts as a master device and tracks the state of the >>> + * 2 lower netdevs. The original virtio_net netdev is registered as >>> + * 'backup' netdev and a passthru device with the same MAC is registered >>> + * as 'active' netdev. >>> + */ >>> + >>> +/* bypass state maintained when BACKUP feature is enabled */ >>> +struct virtnet_bypass_info { >>> + /* passthru netdev with same MAC */ >>> + struct net_device __rcu *active_netdev; >>> + >>> + /* virtio_net netdev */ >>> + struct net_device __rcu *backup_netdev; >>> + >>> + /* active netdev stats */ >>> + struct rtnl_link_stats64 active_stats; >>> + >>> + /* backup netdev stats */ >>> + struct rtnl_link_stats64 backup_stats; >>> + >>> + /* aggregated stats */ >>> + struct rtnl_link_stats64 bypass_stats; >>> + >>> + /* spinlock while updating stats */ >>> + spinlock_t stats_lock; >>> +}; >>> + >>> +static void virtnet_bypass_child_open(struct net_device *dev, >>> + struct net_device *child_netdev) >>> +{ >>> + int err = dev_open(child_netdev); >>> + >>> + if (err) >>> + netdev_warn(dev, "unable to open slave: %s: %d\n", >>> + child_netdev->name, err); >>> +} >>> + >>> +static int virtnet_bypass_open(struct net_device *dev) >>> +{ >>> + struct virtnet_bypass_info *vbi = netdev_priv(dev); >>> + struct net_device *child_netdev; >>> + >>> + netif_carrier_off(dev); >>> + netif_tx_wake_all_queues(dev); >>> + >>> + child_netdev = rtnl_dereference(vbi->active_netdev); >>> + if (child_netdev) >>> + virtnet_bypass_child_open(dev, child_netdev); >>> + >>> + child_netdev = rtnl_dereference(vbi->backup_netdev); >>> + if (child_netdev) >>> + virtnet_bypass_child_open(dev, child_netdev); >>> + >>> + return 0; >>> +} >>> + >>> +static int virtnet_bypass_close(struct net_device *dev) >>> +{ >>> + struct virtnet_bypass_info *vi = netdev_priv(dev); >>> + struct net_device *child_netdev; >>> + >>> + netif_tx_disable(dev); >>> + >>> + child_netdev = rtnl_dereference(vi->active_netdev); >>> + if (child_netdev) >>> + dev_close(child_netdev); >>> + >>> + child_netdev = rtnl_dereference(vi->backup_netdev); >>> + if (child_netdev) >>> + dev_close(child_netdev); >>> + >>> + return 0; >>> +} >>> + >>> +static netdev_tx_t virtnet_bypass_drop_xmit(struct sk_buff *skb, >>> + struct net_device *dev) >>> +{ >>> + atomic_long_inc(&dev->tx_dropped); >>> + dev_kfree_skb_any(skb); >>> + return NETDEV_TX_OK; >>> +} >>> + >>> +static bool virtnet_bypass_xmit_ready(struct net_device *dev) >>> +{ >>> + return netif_running(dev) && netif_carrier_ok(dev); >>> +} >>> + >>> +static netdev_tx_t virtnet_bypass_start_xmit(struct sk_buff *skb, >>> + struct net_device *dev) >>> +{ >>> + struct virtnet_bypass_info *vbi = netdev_priv(dev); >>> + struct net_device *xmit_dev; >>> + >>> + /* Try xmit via active netdev followed by backup netdev */ >>> + xmit_dev = rcu_dereference_bh(vbi->active_netdev); >>> + if (!xmit_dev || !virtnet_bypass_xmit_ready(xmit_dev)) { >>> + xmit_dev = rcu_dereference_bh(vbi->backup_netdev); >>> + if (!xmit_dev || !virtnet_bypass_xmit_ready(xmit_dev)) >>> + return virtnet_bypass_drop_xmit(skb, dev); >>> + } >>> + >>> + skb->dev = xmit_dev; >>> + skb->queue_mapping = qdisc_skb_cb(skb)->slave_dev_queue_mapping; >>> + >>> + return dev_queue_xmit(skb); >>> +} >>> + >>> +static u16 virtnet_bypass_select_queue(struct net_device *dev, >>> + struct sk_buff *skb, void >>> *accel_priv, >>> + select_queue_fallback_t fallback) >>> +{ >>> + /* This helper function exists to help dev_pick_tx get the >>> correct >>> + * destination queue. Using a helper function skips a call to >>> + * skb_tx_hash and will put the skbs in the queue we expect on >>> their >>> + * way down to the bonding driver. >>> + */ >>> + u16 txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) : 0; >>> + >>> + /* Save the original txq to restore before passing to the driver >>> */ >>> + qdisc_skb_cb(skb)->slave_dev_queue_mapping = skb->queue_mapping; >>> + >>> + if (unlikely(txq >= dev->real_num_tx_queues)) { >>> + do { >>> + txq -= dev->real_num_tx_queues; >>> + } while (txq >= dev->real_num_tx_queues); >>> + } >>> + >>> + return txq; >>> +} >>> + >>> +/* fold stats, assuming all rtnl_link_stats64 fields are u64, but >>> + * that some drivers can provide 32bit values only. >>> + */ >>> +static void virtnet_bypass_fold_stats(struct rtnl_link_stats64 *_res, >>> + const struct rtnl_link_stats64 >>> *_new, >>> + const struct rtnl_link_stats64 >>> *_old) >>> +{ >>> + const u64 *new = (const u64 *)_new; >>> + const u64 *old = (const u64 *)_old; >>> + u64 *res = (u64 *)_res; >>> + int i; >>> + >>> + for (i = 0; i < sizeof(*_res) / sizeof(u64); i++) { >>> + u64 nv = new[i]; >>> + u64 ov = old[i]; >>> + s64 delta = nv - ov; >>> + >>> + /* detects if this particular field is 32bit only */ >>> + if (((nv | ov) >> 32) == 0) >>> + delta = (s64)(s32)((u32)nv - (u32)ov); >>> + >>> + /* filter anomalies, some drivers reset their stats >>> + * at down/up events. >>> + */ >>> + if (delta > 0) >>> + res[i] += delta; >>> + } >>> +} >>> + >>> +static void virtnet_bypass_get_stats(struct net_device *dev, >>> + struct rtnl_link_stats64 *stats) >>> +{ >>> + struct virtnet_bypass_info *vbi = netdev_priv(dev); >>> + const struct rtnl_link_stats64 *new; >>> + struct rtnl_link_stats64 temp; >>> + struct net_device *child_netdev; >>> + >>> + spin_lock(&vbi->stats_lock); >>> + memcpy(stats, &vbi->bypass_stats, sizeof(*stats)); >>> + >>> + rcu_read_lock(); >>> + >>> + child_netdev = rcu_dereference(vbi->active_netdev); >>> + if (child_netdev) { >>> + new = dev_get_stats(child_netdev, &temp); >>> + virtnet_bypass_fold_stats(stats, new, >>> &vbi->active_stats); >>> + memcpy(&vbi->active_stats, new, sizeof(*new)); >>> + } >>> + >>> + child_netdev = rcu_dereference(vbi->backup_netdev); >>> + if (child_netdev) { >>> + new = dev_get_stats(child_netdev, &temp); >>> + virtnet_bypass_fold_stats(stats, new, >>> &vbi->backup_stats); >>> + memcpy(&vbi->backup_stats, new, sizeof(*new)); >>> + } >>> + >>> + rcu_read_unlock(); >>> + >>> + memcpy(&vbi->bypass_stats, stats, sizeof(*stats)); >>> + spin_unlock(&vbi->stats_lock); >>> +} >>> + >>> +static int virtnet_bypass_change_mtu(struct net_device *dev, int >>> new_mtu) >>> +{ >>> + struct virtnet_bypass_info *vbi = netdev_priv(dev); >>> + struct net_device *child_netdev; >>> + int ret = 0; >>> + >>> + child_netdev = rcu_dereference(vbi->active_netdev); >>> + if (child_netdev) { >>> + ret = dev_set_mtu(child_netdev, new_mtu); >>> + if (ret) >>> + return ret; >>> + } >>> + >>> + child_netdev = rcu_dereference(vbi->backup_netdev); >>> + if (child_netdev) { >>> + ret = dev_set_mtu(child_netdev, new_mtu); >>> + if (ret) >>> + netdev_err(child_netdev, >>> + "Unexpected failure to set mtu to >>> %d\n", >>> + new_mtu); >> >> Shouldn't we unwind the MTU config on active_netdev if failing to set >> it on backup_netdev? > > > backup_netdev is virtio_net and it is not expected to fail as it supports > MTU all the way upto 64K. Hmmm, then why do you need to check up the return of setting mtu on backup_netdev here? And it's not completely impossible setting MTU fails due to some other reason at the lower level. We should not take assumption that way. > > > >> >>> + } >>> + >>> + dev->mtu = new_mtu; >>> + return 0; >>> +} >>> + >>> +static void virtnet_bypass_set_rx_mode(struct net_device *dev) >>> +{ >>> + struct virtnet_bypass_info *vbi = netdev_priv(dev); >>> + struct net_device *child_netdev; >>> + >>> + rcu_read_lock(); >>> + >>> + child_netdev = rcu_dereference(vbi->active_netdev); >>> + if (child_netdev) { >>> + dev_uc_sync_multiple(child_netdev, dev); >>> + dev_mc_sync_multiple(child_netdev, dev); >>> + } >>> + >>> + child_netdev = rcu_dereference(vbi->backup_netdev); >>> + if (child_netdev) { >>> + dev_uc_sync_multiple(child_netdev, dev); >>> + dev_mc_sync_multiple(child_netdev, dev); >>> + } >>> + >> >> If VF comes up later than when set_rx_mode is called where do you sync >> up the unicast and multicast address? > > > I could include sync up of unicast and multicast list along with > MTU when the child is registered. > > > >> >> The rest looks good. >> >> Thanks, >> -Siwei >> >> >
Fri, Mar 02, 2018 at 08:42:47PM CET, mst@redhat.com wrote: >On Fri, Mar 02, 2018 at 05:20:17PM +0100, Jiri Pirko wrote: >> >Yeah, this code essentially calls out the "shareable" code with a >> >comment at the start and end of the section what defines the >> >virtio_bypass functionality. It would just be a matter of mostly >> >cutting and pasting to put it into a separate driver module. >> >> Please put it there and unite the use of it with netvsc. > >Surely, adding this to other drivers (e.g. might this be handy for xen >too?) can be left for a separate patchset. Let's get one device merged >first. Why? Let's do the generic infra alongside with the driver. I see no good reason to rush into merging driver and only later, if ever, to convert it to generic solution. On contrary. That would lead into multiple approaches and different behavious in multiple drivers. That is plain wrong.
On Sat, Mar 3, 2018 at 3:31 AM, Jiri Pirko <jiri@resnulli.us> wrote: > Fri, Mar 02, 2018 at 08:42:47PM CET, mst@redhat.com wrote: >>On Fri, Mar 02, 2018 at 05:20:17PM +0100, Jiri Pirko wrote: >>> >Yeah, this code essentially calls out the "shareable" code with a >>> >comment at the start and end of the section what defines the >>> >virtio_bypass functionality. It would just be a matter of mostly >>> >cutting and pasting to put it into a separate driver module. >>> >>> Please put it there and unite the use of it with netvsc. >> >>Surely, adding this to other drivers (e.g. might this be handy for xen >>too?) can be left for a separate patchset. Let's get one device merged >>first. > > Why? Let's do the generic infra alongside with the driver. I see no good > reason to rush into merging driver and only later, if ever, to convert > it to generic solution. On contrary. That would lead into multiple > approaches and different behavious in multiple drivers. That is plain > wrong. If nothing else it doesn't hurt to do this in one driver in a generic way, and once it has been proven to address all the needs of that one driver we can then start moving other drivers to it. The current solution is quite generic, that was my contribution to this patch set as I didn't like how invasive it was being to virtio and thought it would be best to keep this as minimally invasive as possible. My preference would be to give this a release or two in virtio to mature before we start pushing it onto other drivers. It shouldn't take much to cut/paste this into a new driver file once we decide it is time to start extending it out to other drivers. - Alex
Sat, Mar 03, 2018 at 07:04:57PM CET, alexander.duyck@gmail.com wrote: >On Sat, Mar 3, 2018 at 3:31 AM, Jiri Pirko <jiri@resnulli.us> wrote: >> Fri, Mar 02, 2018 at 08:42:47PM CET, mst@redhat.com wrote: >>>On Fri, Mar 02, 2018 at 05:20:17PM +0100, Jiri Pirko wrote: >>>> >Yeah, this code essentially calls out the "shareable" code with a >>>> >comment at the start and end of the section what defines the >>>> >virtio_bypass functionality. It would just be a matter of mostly >>>> >cutting and pasting to put it into a separate driver module. >>>> >>>> Please put it there and unite the use of it with netvsc. >>> >>>Surely, adding this to other drivers (e.g. might this be handy for xen >>>too?) can be left for a separate patchset. Let's get one device merged >>>first. >> >> Why? Let's do the generic infra alongside with the driver. I see no good >> reason to rush into merging driver and only later, if ever, to convert >> it to generic solution. On contrary. That would lead into multiple >> approaches and different behavious in multiple drivers. That is plain >> wrong. > >If nothing else it doesn't hurt to do this in one driver in a generic >way, and once it has been proven to address all the needs of that one >driver we can then start moving other drivers to it. The current >solution is quite generic, that was my contribution to this patch set >as I didn't like how invasive it was being to virtio and thought it >would be best to keep this as minimally invasive as possible. > >My preference would be to give this a release or two in virtio to >mature before we start pushing it onto other drivers. It shouldn't >take much to cut/paste this into a new driver file once we decide it >is time to start extending it out to other drivers. I'm not talking about cut/paste and in fact that is what I'm worried about. I'm talking about common code in net/core/ or somewhere that would take care of this in-driver bonding. Each driver, like virtio_net, netvsc would just register some ops to it and the core would do all logic. I believe it is essential take this approach from the start.
On Sat, Mar 3, 2018 at 1:25 PM, Jiri Pirko <jiri@resnulli.us> wrote: > Sat, Mar 03, 2018 at 07:04:57PM CET, alexander.duyck@gmail.com wrote: >>On Sat, Mar 3, 2018 at 3:31 AM, Jiri Pirko <jiri@resnulli.us> wrote: >>> Fri, Mar 02, 2018 at 08:42:47PM CET, mst@redhat.com wrote: >>>>On Fri, Mar 02, 2018 at 05:20:17PM +0100, Jiri Pirko wrote: >>>>> >Yeah, this code essentially calls out the "shareable" code with a >>>>> >comment at the start and end of the section what defines the >>>>> >virtio_bypass functionality. It would just be a matter of mostly >>>>> >cutting and pasting to put it into a separate driver module. >>>>> >>>>> Please put it there and unite the use of it with netvsc. >>>> >>>>Surely, adding this to other drivers (e.g. might this be handy for xen >>>>too?) can be left for a separate patchset. Let's get one device merged >>>>first. >>> >>> Why? Let's do the generic infra alongside with the driver. I see no good >>> reason to rush into merging driver and only later, if ever, to convert >>> it to generic solution. On contrary. That would lead into multiple >>> approaches and different behavious in multiple drivers. That is plain >>> wrong. >> >>If nothing else it doesn't hurt to do this in one driver in a generic >>way, and once it has been proven to address all the needs of that one >>driver we can then start moving other drivers to it. The current >>solution is quite generic, that was my contribution to this patch set >>as I didn't like how invasive it was being to virtio and thought it >>would be best to keep this as minimally invasive as possible. >> >>My preference would be to give this a release or two in virtio to >>mature before we start pushing it onto other drivers. It shouldn't >>take much to cut/paste this into a new driver file once we decide it >>is time to start extending it out to other drivers. > > I'm not talking about cut/paste and in fact that is what I'm worried > about. I'm talking about common code in net/core/ or somewhere that > would take care of this in-driver bonding. Each driver, like virtio_net, > netvsc would just register some ops to it and the core would do all > logic. I believe it is essential take this approach from the start. Sorry, I didn't mean cut/paste into another driver, I meant to make it a driver of its own. My thought was to eventually create a shared/core driver module that is then used by the other drivers. My concern right now is that Stephen has indicated he doesn't want this approach taken with netvsc, and most of the community doesn't want the netvsc approach applied to virtio. Until that impasse can be resolved there isn't much value in trying to split this up so it is available to other drivers. In addition I would imagine it would make it a pain for others to back-port into distros since it would break legacy netvsc driver behavior. Patches are always welcome. Once this is in you are free to try fighting to get this made into a generic module and applied to both drivers, but we have already spent close to 3 months on this and it seems like there has been significantly more time spent arguing over the number of interfaces and/or drivers than spent writing/reviewing actual code. - Alex
On Fri, Mar 02, 2018 at 02:26:48PM -0800, Siwei Liu wrote: > On Fri, Mar 2, 2018 at 1:31 PM, Michael S. Tsirkin <mst@redhat.com> wrote: > > On Fri, Mar 02, 2018 at 12:44:56PM -0800, Siwei Liu wrote: > >> On Fri, Mar 2, 2018 at 12:10 PM, Michael S. Tsirkin <mst@redhat.com> wrote: > >> > On Fri, Mar 02, 2018 at 11:52:27AM -0800, Samudrala, Sridhar wrote: > >> >> > >> >> > >> >> On 3/2/2018 11:41 AM, Michael S. Tsirkin wrote: > >> >> > On Fri, Mar 02, 2018 at 07:26:25AM -0800, Alexander Duyck wrote: > >> >> > > The design limits things to a 1:1 relationship since we just have the > >> >> > > child and backup pointers, but I don't think I am seeing exception > >> >> > > handling to prevent us from overwriting the child pointers so there > >> >> > > may be a leak there. > >> >> > > > >> >> > > Thanks. > >> >> > > > >> >> > > - Alex > >> >> > In fact maintaining a list in that case would be nicer, and > >> >> > just use an arbitrary one. > >> >> > E.g. one can see how a user wanting to swap device 1 for device 2 > >> >> > might first add device 2 with same MAC then drop device 1. > >> >> > >> >> It should be possible to swap VF1 with VF2 by > >> >> 1.- enabling virtio link > >> >> 2.- unplugging VF1 > >> >> 3.- plugging VF2 > >> >> 4.- disabling virtio link > >> >> > >> > > >> > True, but it isn't hard to avoid breakage if user > >> > swapped steps 2 and 3. No need to make it more > >> > fragile that it has to be. > >> > >> The migration case, VF2 is associated with another PF on another > >> machine (destination), I wonder how it is possible. > > > > E.g. you want to remove the PF so you unplug the VF > > then add another VF of the same PF. > > > >> Even with local plugging of VF2 on the same PF, the MAC address > >> requirement (VF1's == VF2's) would fail the MAC address assignment on > >> VF2. > >> > >> -Siwei > > > > Why would it fail? These are separate cards. > > OK. I realized that you may talk about assigning a VF on a diffferent > PF (VF1 on PF1 while VF2 on PF2). And we might assign a pass-through > device rather than a VF. Yes, it's indeed possible that may happen but > I take it as a further step down (another patch maybe) as it would > involve changes to notify the network with gratuituious ARP and/or > unsolicited ND advertisement of the MAC address association with the > new port. > > -Siwei Interesting point. I guess that's a limitation in the curent patch then: virtio and PT device must be connected to the same physical NIC. Worth documenting. > > > >> > > >> > -- > >> > MST > >> > > >> > --------------------------------------------------------------------- > >> > To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org > >> > For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org > >> >
On Fri, Mar 02, 2018 at 03:56:31PM -0800, Siwei Liu wrote: > On Fri, Mar 2, 2018 at 1:36 PM, Michael S. Tsirkin <mst@redhat.com> wrote: > > On Fri, Mar 02, 2018 at 01:11:56PM -0800, Siwei Liu wrote: > >> On Thu, Mar 1, 2018 at 12:08 PM, Sridhar Samudrala > >> <sridhar.samudrala@intel.com> wrote: > >> > This patch enables virtio_net to switch over to a VF datapath when a VF > >> > netdev is present with the same MAC address. It allows live migration > >> > of a VM with a direct attached VF without the need to setup a bond/team > >> > between a VF and virtio net device in the guest. > >> > > >> > The hypervisor needs to enable only one datapath at any time so that > >> > packets don't get looped back to the VM over the other datapath. When a VF > >> > is plugged, the virtio datapath link state can be marked as down. The > >> > hypervisor needs to unplug the VF device from the guest on the source host > >> > and reset the MAC filter of the VF to initiate failover of datapath to > >> > virtio before starting the migration. After the migration is completed, > >> > the destination hypervisor sets the MAC filter on the VF and plugs it back > >> > to the guest to switch over to VF datapath. > >> > > >> > When BACKUP feature is enabled, an additional netdev(bypass netdev) is > >> > created that acts as a master device and tracks the state of the 2 lower > >> > netdevs. The original virtio_net netdev is marked as 'backup' netdev and a > >> > passthru device with the same MAC is registered as 'active' netdev. > >> > > >> > This patch is based on the discussion initiated by Jesse on this thread. > >> > https://marc.info/?l=linux-virtualization&m=151189725224231&w=2 > >> > > >> > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com> > >> > Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> > >> > Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> > >> > --- > >> > drivers/net/virtio_net.c | 683 ++++++++++++++++++++++++++++++++++++++++++++++- > >> > 1 file changed, 682 insertions(+), 1 deletion(-) > >> > > >> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c > >> > index bcd13fe906ca..f2860d86c952 100644 > >> > --- a/drivers/net/virtio_net.c > >> > +++ b/drivers/net/virtio_net.c > >> > @@ -30,6 +30,8 @@ > >> > #include <linux/cpu.h> > >> > #include <linux/average.h> > >> > #include <linux/filter.h> > >> > +#include <linux/netdevice.h> > >> > +#include <linux/pci.h> > >> > #include <net/route.h> > >> > #include <net/xdp.h> > >> > > >> > @@ -206,6 +208,9 @@ struct virtnet_info { > >> > u32 speed; > >> > > >> > unsigned long guest_offloads; > >> > + > >> > + /* upper netdev created when BACKUP feature enabled */ > >> > + struct net_device *bypass_netdev; > >> > }; > >> > > >> > struct padded_vnet_hdr { > >> > @@ -2236,6 +2241,22 @@ static int virtnet_xdp(struct net_device *dev, struct netdev_bpf *xdp) > >> > } > >> > } > >> > > >> > +static int virtnet_get_phys_port_name(struct net_device *dev, char *buf, > >> > + size_t len) > >> > +{ > >> > + struct virtnet_info *vi = netdev_priv(dev); > >> > + int ret; > >> > + > >> > + if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_BACKUP)) > >> > + return -EOPNOTSUPP; > >> > + > >> > + ret = snprintf(buf, len, "_bkup"); > >> > + if (ret >= len) > >> > + return -EOPNOTSUPP; > >> > + > >> > + return 0; > >> > +} > >> > + > >> > >> What if the systemd/udevd is not new enough to enforce the > >> n<phys_port_name> naming? Would virtio_bypass get a different name > >> than the original virtio_net? > > > > You mean people using ethX names? Any hardware config change breaks > > these, I don't think that can be helped. > > I don't like the way to rely on .ndo_get_phys_port_name - it's fragile > and it does not completely solve the problem it tries to address. > Imagine what can end up with if getting an old udevd, or users already > have exsiting explicit udev rules around phys_port_name. It does not > give you the an ack in saying "yes, I know you're the bypass and > you're the backup, please continue and I will give you both correct > names", or an unacknowlegment saying "no, I don't know what these > extra interfaces are, please go back and leave the VF device alone". > We need new udev API for both feature negotiation and naming, or may > even completely hide the lower interfaces. Go ahead and try to make this happen, but I won't hold my breath. > > > >> Should we detect this earlier and fall > >> back to legacy mode without creating the bypass netdev and ensalving > >> the VF? > > > > I don't think we can do this with existing kernel/userspace APIs. > > That's why I ever said to make udev aware of this new type of combined > device instead of doing hacks here and there around. > > Regards, > -Siwei We can add new interfaces on top but the main purpose here is to make old userspace do new tricks. > > > > -- > > MST
Sun, Mar 04, 2018 at 01:26:53AM CET, alexander.duyck@gmail.com wrote: >On Sat, Mar 3, 2018 at 1:25 PM, Jiri Pirko <jiri@resnulli.us> wrote: >> Sat, Mar 03, 2018 at 07:04:57PM CET, alexander.duyck@gmail.com wrote: >>>On Sat, Mar 3, 2018 at 3:31 AM, Jiri Pirko <jiri@resnulli.us> wrote: >>>> Fri, Mar 02, 2018 at 08:42:47PM CET, mst@redhat.com wrote: >>>>>On Fri, Mar 02, 2018 at 05:20:17PM +0100, Jiri Pirko wrote: >>>>>> >Yeah, this code essentially calls out the "shareable" code with a >>>>>> >comment at the start and end of the section what defines the >>>>>> >virtio_bypass functionality. It would just be a matter of mostly >>>>>> >cutting and pasting to put it into a separate driver module. >>>>>> >>>>>> Please put it there and unite the use of it with netvsc. >>>>> >>>>>Surely, adding this to other drivers (e.g. might this be handy for xen >>>>>too?) can be left for a separate patchset. Let's get one device merged >>>>>first. >>>> >>>> Why? Let's do the generic infra alongside with the driver. I see no good >>>> reason to rush into merging driver and only later, if ever, to convert >>>> it to generic solution. On contrary. That would lead into multiple >>>> approaches and different behavious in multiple drivers. That is plain >>>> wrong. >>> >>>If nothing else it doesn't hurt to do this in one driver in a generic >>>way, and once it has been proven to address all the needs of that one >>>driver we can then start moving other drivers to it. The current >>>solution is quite generic, that was my contribution to this patch set >>>as I didn't like how invasive it was being to virtio and thought it >>>would be best to keep this as minimally invasive as possible. >>> >>>My preference would be to give this a release or two in virtio to >>>mature before we start pushing it onto other drivers. It shouldn't >>>take much to cut/paste this into a new driver file once we decide it >>>is time to start extending it out to other drivers. >> >> I'm not talking about cut/paste and in fact that is what I'm worried >> about. I'm talking about common code in net/core/ or somewhere that >> would take care of this in-driver bonding. Each driver, like virtio_net, >> netvsc would just register some ops to it and the core would do all >> logic. I believe it is essential take this approach from the start. > >Sorry, I didn't mean cut/paste into another driver, I meant to make it >a driver of its own. My thought was to eventually create a shared/core >driver module that is then used by the other drivers. > >My concern right now is that Stephen has indicated he doesn't want >this approach taken with netvsc, and most of the community doesn't IIUC, he only does not like the extra netdev. Is there anything else? >want the netvsc approach applied to virtio. Until that impasse can be >resolved there isn't much value in trying to split this up so it is >available to other drivers. In addition I would imagine it would make >it a pain for others to back-port into distros since it would break >legacy netvsc driver behavior. Patches are always welcome. Once this >is in you are free to try fighting to get this made into a generic >module and applied to both drivers, but we have already spent close to >3 months on this and it seems like there has been significantly more Alex, time is never a good argument for poor design and shortcuts. >time spent arguing over the number of interfaces and/or drivers than >spent writing/reviewing actual code. > >- Alex
On Sat, Mar 3, 2018 at 11:13 PM, Jiri Pirko <jiri@resnulli.us> wrote: > Sun, Mar 04, 2018 at 01:26:53AM CET, alexander.duyck@gmail.com wrote: >>On Sat, Mar 3, 2018 at 1:25 PM, Jiri Pirko <jiri@resnulli.us> wrote: >>> Sat, Mar 03, 2018 at 07:04:57PM CET, alexander.duyck@gmail.com wrote: >>>>On Sat, Mar 3, 2018 at 3:31 AM, Jiri Pirko <jiri@resnulli.us> wrote: >>>>> Fri, Mar 02, 2018 at 08:42:47PM CET, mst@redhat.com wrote: >>>>>>On Fri, Mar 02, 2018 at 05:20:17PM +0100, Jiri Pirko wrote: >>>>>>> >Yeah, this code essentially calls out the "shareable" code with a >>>>>>> >comment at the start and end of the section what defines the >>>>>>> >virtio_bypass functionality. It would just be a matter of mostly >>>>>>> >cutting and pasting to put it into a separate driver module. >>>>>>> >>>>>>> Please put it there and unite the use of it with netvsc. >>>>>> >>>>>>Surely, adding this to other drivers (e.g. might this be handy for xen >>>>>>too?) can be left for a separate patchset. Let's get one device merged >>>>>>first. >>>>> >>>>> Why? Let's do the generic infra alongside with the driver. I see no good >>>>> reason to rush into merging driver and only later, if ever, to convert >>>>> it to generic solution. On contrary. That would lead into multiple >>>>> approaches and different behavious in multiple drivers. That is plain >>>>> wrong. >>>> >>>>If nothing else it doesn't hurt to do this in one driver in a generic >>>>way, and once it has been proven to address all the needs of that one >>>>driver we can then start moving other drivers to it. The current >>>>solution is quite generic, that was my contribution to this patch set >>>>as I didn't like how invasive it was being to virtio and thought it >>>>would be best to keep this as minimally invasive as possible. >>>> >>>>My preference would be to give this a release or two in virtio to >>>>mature before we start pushing it onto other drivers. It shouldn't >>>>take much to cut/paste this into a new driver file once we decide it >>>>is time to start extending it out to other drivers. >>> >>> I'm not talking about cut/paste and in fact that is what I'm worried >>> about. I'm talking about common code in net/core/ or somewhere that >>> would take care of this in-driver bonding. Each driver, like virtio_net, >>> netvsc would just register some ops to it and the core would do all >>> logic. I believe it is essential take this approach from the start. >> >>Sorry, I didn't mean cut/paste into another driver, I meant to make it >>a driver of its own. My thought was to eventually create a shared/core >>driver module that is then used by the other drivers. >> >>My concern right now is that Stephen has indicated he doesn't want >>this approach taken with netvsc, and most of the community doesn't > > IIUC, he only does not like the extra netdev. Is there anything else? Nope that is pretty much it. It doesn't seem like a big deal for virtio, but for netvsc it is significant since they don't have any "backup" bit feature differentiation, so they would likely be stuck with 2 netdevs even in their basic setup. >>want the netvsc approach applied to virtio. Until that impasse can be >>resolved there isn't much value in trying to split this up so it is >>available to other drivers. In addition I would imagine it would make >>it a pain for others to back-port into distros since it would break >>legacy netvsc driver behavior. Patches are always welcome. Once this >>is in you are free to try fighting to get this made into a generic >>module and applied to both drivers, but we have already spent close to >>3 months on this and it seems like there has been significantly more > > Alex, time is never a good argument for poor design and shortcuts. I'm not saying we should go with a poor design due to time. But expecting us to implement something where the maintainer of said driver has not agreed to is pointless, and I don't see it as a design shortcut to implement something in one driver with the expectation that we will then make it core later once it has proven itself and has use elsewhere. In the meantime I would imagine it also makes it easier for things like backports and such for us to do it this way since we are only impacting one driver. You are telling us to do something that not everyone has agreed to. Currently we only have agreement from Michael on taking this code, as such we are working with virtio only for now. When the time comes that we can get other maintainers, specifically Stephen, to agree to it then we can cut/paste this code into a core file or into a module of its own. Alternatively I suppose we could take this up to Dave if you can't get Stephen to agree. If you can get Dave to say we need to change netvsc then we will go ahead with it, but generally I prefer to respect when the maintainer of something says they don't want us modifying their code in some way.
Sun, Mar 04, 2018 at 07:24:12PM CET, alexander.duyck@gmail.com wrote: >On Sat, Mar 3, 2018 at 11:13 PM, Jiri Pirko <jiri@resnulli.us> wrote: >> Sun, Mar 04, 2018 at 01:26:53AM CET, alexander.duyck@gmail.com wrote: >>>On Sat, Mar 3, 2018 at 1:25 PM, Jiri Pirko <jiri@resnulli.us> wrote: >>>> Sat, Mar 03, 2018 at 07:04:57PM CET, alexander.duyck@gmail.com wrote: >>>>>On Sat, Mar 3, 2018 at 3:31 AM, Jiri Pirko <jiri@resnulli.us> wrote: >>>>>> Fri, Mar 02, 2018 at 08:42:47PM CET, mst@redhat.com wrote: >>>>>>>On Fri, Mar 02, 2018 at 05:20:17PM +0100, Jiri Pirko wrote: >>>>>>>> >Yeah, this code essentially calls out the "shareable" code with a >>>>>>>> >comment at the start and end of the section what defines the >>>>>>>> >virtio_bypass functionality. It would just be a matter of mostly >>>>>>>> >cutting and pasting to put it into a separate driver module. >>>>>>>> >>>>>>>> Please put it there and unite the use of it with netvsc. >>>>>>> >>>>>>>Surely, adding this to other drivers (e.g. might this be handy for xen >>>>>>>too?) can be left for a separate patchset. Let's get one device merged >>>>>>>first. >>>>>> >>>>>> Why? Let's do the generic infra alongside with the driver. I see no good >>>>>> reason to rush into merging driver and only later, if ever, to convert >>>>>> it to generic solution. On contrary. That would lead into multiple >>>>>> approaches and different behavious in multiple drivers. That is plain >>>>>> wrong. >>>>> >>>>>If nothing else it doesn't hurt to do this in one driver in a generic >>>>>way, and once it has been proven to address all the needs of that one >>>>>driver we can then start moving other drivers to it. The current >>>>>solution is quite generic, that was my contribution to this patch set >>>>>as I didn't like how invasive it was being to virtio and thought it >>>>>would be best to keep this as minimally invasive as possible. >>>>> >>>>>My preference would be to give this a release or two in virtio to >>>>>mature before we start pushing it onto other drivers. It shouldn't >>>>>take much to cut/paste this into a new driver file once we decide it >>>>>is time to start extending it out to other drivers. >>>> >>>> I'm not talking about cut/paste and in fact that is what I'm worried >>>> about. I'm talking about common code in net/core/ or somewhere that >>>> would take care of this in-driver bonding. Each driver, like virtio_net, >>>> netvsc would just register some ops to it and the core would do all >>>> logic. I believe it is essential take this approach from the start. >>> >>>Sorry, I didn't mean cut/paste into another driver, I meant to make it >>>a driver of its own. My thought was to eventually create a shared/core >>>driver module that is then used by the other drivers. >>> >>>My concern right now is that Stephen has indicated he doesn't want >>>this approach taken with netvsc, and most of the community doesn't >> >> IIUC, he only does not like the extra netdev. Is there anything else? > >Nope that is pretty much it. It doesn't seem like a big deal for >virtio, but for netvsc it is significant since they don't have any >"backup" bit feature differentiation, so they would likely be stuck >with 2 netdevs even in their basic setup. Okay. If that is a strict "no-go" for netvsc, this should be just a flag passed down to the in-driver bond code. > >>>want the netvsc approach applied to virtio. Until that impasse can be >>>resolved there isn't much value in trying to split this up so it is >>>available to other drivers. In addition I would imagine it would make >>>it a pain for others to back-port into distros since it would break >>>legacy netvsc driver behavior. Patches are always welcome. Once this >>>is in you are free to try fighting to get this made into a generic >>>module and applied to both drivers, but we have already spent close to >>>3 months on this and it seems like there has been significantly more >> >> Alex, time is never a good argument for poor design and shortcuts. > >I'm not saying we should go with a poor design due to time. But >expecting us to implement something where the maintainer of said >driver has not agreed to is pointless, and I don't see it as a design He just does not like the third netdev, not the fact that the code for in-driver bonding would be shared. >shortcut to implement something in one driver with the expectation >that we will then make it core later once it has proven itself and has >use elsewhere. In the meantime I would imagine it also makes it easier >for things like backports and such for us to do it this way since we >are only impacting one driver. When you are working on upstream kernel, you should not care about backports. That leads to poor design. Not an argument. > >You are telling us to do something that not everyone has agreed to. Who did not? >Currently we only have agreement from Michael on taking this code, as >such we are working with virtio only for now. When the time comes that If you do duplication of netvsc in-driver bonding in virtio_net, it will stay there forever. So what you say is: "We will do it halfway now and promise to fix it later". That later will never happen, I'm pretty sure. That is why I push for in-driver bonding shared code as a part of this patchset. + if you would be pushing first driver to do this, I would understand. But the first driver is already in. You are pushing second. This is the time to do the sharing, unification of behaviour. Next time is too late. >we can get other maintainers, specifically Stephen, to agree to it >then we can cut/paste this code into a core file or into a module of >its own. Alternatively I suppose we could take this up to Dave if you >can't get Stephen to agree. If you can get Dave to say we need to >change netvsc then we will go ahead with it, but generally I prefer to >respect when the maintainer of something says they don't want us >modifying their code in some way.
On 3/4/2018 10:50 AM, Jiri Pirko wrote: > Sun, Mar 04, 2018 at 07:24:12PM CET, alexander.duyck@gmail.com wrote: >> On Sat, Mar 3, 2018 at 11:13 PM, Jiri Pirko <jiri@resnulli.us> wrote: >>> Sun, Mar 04, 2018 at 01:26:53AM CET, alexander.duyck@gmail.com wrote: >>>> On Sat, Mar 3, 2018 at 1:25 PM, Jiri Pirko <jiri@resnulli.us> wrote: >>>>> Sat, Mar 03, 2018 at 07:04:57PM CET, alexander.duyck@gmail.com wrote: >>>>>> On Sat, Mar 3, 2018 at 3:31 AM, Jiri Pirko <jiri@resnulli.us> wrote: >>>>>>> Fri, Mar 02, 2018 at 08:42:47PM CET, mst@redhat.com wrote: >>>>>>>> On Fri, Mar 02, 2018 at 05:20:17PM +0100, Jiri Pirko wrote: >>>>>>>>>> Yeah, this code essentially calls out the "shareable" code with a >>>>>>>>>> comment at the start and end of the section what defines the >>>>>>>>>> virtio_bypass functionality. It would just be a matter of mostly >>>>>>>>>> cutting and pasting to put it into a separate driver module. >>>>>>>>> Please put it there and unite the use of it with netvsc. >>>>>>>> Surely, adding this to other drivers (e.g. might this be handy for xen >>>>>>>> too?) can be left for a separate patchset. Let's get one device merged >>>>>>>> first. >>>>>>> Why? Let's do the generic infra alongside with the driver. I see no good >>>>>>> reason to rush into merging driver and only later, if ever, to convert >>>>>>> it to generic solution. On contrary. That would lead into multiple >>>>>>> approaches and different behavious in multiple drivers. That is plain >>>>>>> wrong. >>>>>> If nothing else it doesn't hurt to do this in one driver in a generic >>>>>> way, and once it has been proven to address all the needs of that one >>>>>> driver we can then start moving other drivers to it. The current >>>>>> solution is quite generic, that was my contribution to this patch set >>>>>> as I didn't like how invasive it was being to virtio and thought it >>>>>> would be best to keep this as minimally invasive as possible. >>>>>> >>>>>> My preference would be to give this a release or two in virtio to >>>>>> mature before we start pushing it onto other drivers. It shouldn't >>>>>> take much to cut/paste this into a new driver file once we decide it >>>>>> is time to start extending it out to other drivers. >>>>> I'm not talking about cut/paste and in fact that is what I'm worried >>>>> about. I'm talking about common code in net/core/ or somewhere that >>>>> would take care of this in-driver bonding. Each driver, like virtio_net, >>>>> netvsc would just register some ops to it and the core would do all >>>>> logic. I believe it is essential take this approach from the start. >>>> Sorry, I didn't mean cut/paste into another driver, I meant to make it >>>> a driver of its own. My thought was to eventually create a shared/core >>>> driver module that is then used by the other drivers. >>>> >>>> My concern right now is that Stephen has indicated he doesn't want >>>> this approach taken with netvsc, and most of the community doesn't >>> IIUC, he only does not like the extra netdev. Is there anything else? >> Nope that is pretty much it. It doesn't seem like a big deal for >> virtio, but for netvsc it is significant since they don't have any >> "backup" bit feature differentiation, so they would likely be stuck >> with 2 netdevs even in their basic setup. > Okay. If that is a strict "no-go" for netvsc, this should be > just a flag passed down to the in-driver bond code. This results in a 3 driver model (virtio/netvsc, vf & bypass) with 2 netdevs created when bypass is based on netvsc and 3 netdevs created when the bypass is based on virtio_net. Unless we agree on a common netdev model between netvsc and virtio_net, i am not sure if it is useful to commonize the code into a separate driver. -Sridhar
On Sun, Mar 4, 2018 at 10:50 AM, Jiri Pirko <jiri@resnulli.us> wrote: > Sun, Mar 04, 2018 at 07:24:12PM CET, alexander.duyck@gmail.com wrote: >>On Sat, Mar 3, 2018 at 11:13 PM, Jiri Pirko <jiri@resnulli.us> wrote: >>> Sun, Mar 04, 2018 at 01:26:53AM CET, alexander.duyck@gmail.com wrote: >>>>On Sat, Mar 3, 2018 at 1:25 PM, Jiri Pirko <jiri@resnulli.us> wrote: >>>>> Sat, Mar 03, 2018 at 07:04:57PM CET, alexander.duyck@gmail.com wrote: >>>>>>On Sat, Mar 3, 2018 at 3:31 AM, Jiri Pirko <jiri@resnulli.us> wrote: >>>>>>> Fri, Mar 02, 2018 at 08:42:47PM CET, mst@redhat.com wrote: >>>>>>>>On Fri, Mar 02, 2018 at 05:20:17PM +0100, Jiri Pirko wrote: >>>>>>>>> >Yeah, this code essentially calls out the "shareable" code with a >>>>>>>>> >comment at the start and end of the section what defines the >>>>>>>>> >virtio_bypass functionality. It would just be a matter of mostly >>>>>>>>> >cutting and pasting to put it into a separate driver module. >>>>>>>>> >>>>>>>>> Please put it there and unite the use of it with netvsc. >>>>>>>> >>>>>>>>Surely, adding this to other drivers (e.g. might this be handy for xen >>>>>>>>too?) can be left for a separate patchset. Let's get one device merged >>>>>>>>first. >>>>>>> >>>>>>> Why? Let's do the generic infra alongside with the driver. I see no good >>>>>>> reason to rush into merging driver and only later, if ever, to convert >>>>>>> it to generic solution. On contrary. That would lead into multiple >>>>>>> approaches and different behavious in multiple drivers. That is plain >>>>>>> wrong. >>>>>> >>>>>>If nothing else it doesn't hurt to do this in one driver in a generic >>>>>>way, and once it has been proven to address all the needs of that one >>>>>>driver we can then start moving other drivers to it. The current >>>>>>solution is quite generic, that was my contribution to this patch set >>>>>>as I didn't like how invasive it was being to virtio and thought it >>>>>>would be best to keep this as minimally invasive as possible. >>>>>> >>>>>>My preference would be to give this a release or two in virtio to >>>>>>mature before we start pushing it onto other drivers. It shouldn't >>>>>>take much to cut/paste this into a new driver file once we decide it >>>>>>is time to start extending it out to other drivers. >>>>> >>>>> I'm not talking about cut/paste and in fact that is what I'm worried >>>>> about. I'm talking about common code in net/core/ or somewhere that >>>>> would take care of this in-driver bonding. Each driver, like virtio_net, >>>>> netvsc would just register some ops to it and the core would do all >>>>> logic. I believe it is essential take this approach from the start. >>>> >>>>Sorry, I didn't mean cut/paste into another driver, I meant to make it >>>>a driver of its own. My thought was to eventually create a shared/core >>>>driver module that is then used by the other drivers. >>>> >>>>My concern right now is that Stephen has indicated he doesn't want >>>>this approach taken with netvsc, and most of the community doesn't >>> >>> IIUC, he only does not like the extra netdev. Is there anything else? >> >>Nope that is pretty much it. It doesn't seem like a big deal for >>virtio, but for netvsc it is significant since they don't have any >>"backup" bit feature differentiation, so they would likely be stuck >>with 2 netdevs even in their basic setup. > > Okay. If that is a strict "no-go" for netvsc, this should be > just a flag passed down to the in-driver bond code. Are you serious? We might as well just do a per-driver bond then if that is what you want. Once you go back to the "2 netdev" model for this the bond becomes tightly woven into the driver and becomes a separate beast entirely. At that point sharing kind of goes out the window since you have to be tightly coupled into all of the per-driver ops. I would argue there is no way to do the "2 netdev" model generically. It is kind of inherent to the "2 netdev" model in the first place since you can't have a third driver pop up so now everything is pulled into the paravirtual interface unless you invert everything and require the netvsc driver to provide the driver with a set of function pointers allowing it to call back into it. In addition you suddenly have to deal with all the qdisc and Tx queue locking mess. So the 3 netdev model let the driver be lockless and run with no queue disc. Are you telling us you expect our solution to run in both modes or are you pushing the qdisc overhead and Tx queue locking into the 3 netdev model? What it ultimately comes down to is how do you create a new netdev without exposing a new netdev? In the 3 netdev model this all makes sense as we can leave the paravirtual interface in tact. Now you are telling us that based on a flag we either have to embed ourselves into the paravirtual interface without exposing our operations, or we have to embed the paravirtual interface into our device without letting it be visible. The sheer overhead of that will end up more then doubling the code size for just the core bits of this patch. >> >>>>want the netvsc approach applied to virtio. Until that impasse can be >>>>resolved there isn't much value in trying to split this up so it is >>>>available to other drivers. In addition I would imagine it would make >>>>it a pain for others to back-port into distros since it would break >>>>legacy netvsc driver behavior. Patches are always welcome. Once this >>>>is in you are free to try fighting to get this made into a generic >>>>module and applied to both drivers, but we have already spent close to >>>>3 months on this and it seems like there has been significantly more >>> >>> Alex, time is never a good argument for poor design and shortcuts. >> >>I'm not saying we should go with a poor design due to time. But >>expecting us to implement something where the maintainer of said >>driver has not agreed to is pointless, and I don't see it as a design > > He just does not like the third netdev, not the fact that the code > for in-driver bonding would be shared. Yes, but losing the third/master netdev suddenly steps up the complexity several fold. I would rather not support it at all. In addition Jakub and others were strongly opposed the "abomination" as I believe you guys called it, and quite frankly I think it is a bad design decision made for cosmetic reasons. >>shortcut to implement something in one driver with the expectation >>that we will then make it core later once it has proven itself and has >>use elsewhere. In the meantime I would imagine it also makes it easier >>for things like backports and such for us to do it this way since we >>are only impacting one driver. > > When you are working on upstream kernel, you should not care about > backports. That leads to poor design. Not an argument. I disagree. When you have to think about maintenance of things it is quite useful to think about backports. Otherwise older versions of your code become unmaintainable. >> >>You are telling us to do something that not everyone has agreed to. > > Who did not? As far as I know you are the only person asking that this has to support both models. At least with us only supporting 3 netdevs, it was only the 2 netdev crowd that was unhappy with it. I don't know who is going to be happy with us doing an either/or setup that will ultimately more confusing from a user experience perspective since now you are saying this thing is either a hybrid slave/master, or just a master depending on who is using it. >>Currently we only have agreement from Michael on taking this code, as >>such we are working with virtio only for now. When the time comes that > > If you do duplication of netvsc in-driver bonding in virtio_net, it will > stay there forever. So what you say is: "We will do it halfway now > and promise to fix it later". That later will never happen, I'm pretty > sure. That is why I push for in-driver bonding shared code as a part of > this patchset. You want this new approach and a copy of netvsc moved into either core or some module of its own. I say pick an architecture. We are looking at either 2 netdevs or 3. We are not going to support both because that will ultimately lead to a terrible user experience and make things quite confusing. > + if you would be pushing first driver to do this, I would understand. > But the first driver is already in. You are pushing second. This is the > time to do the sharing, unification of behaviour. Next time is too late. That is great, if we want to share then lets share. But what you are essentially telling us is that we need to fork this solution and maintain two code paths, one for 2 netdevs, and another for 3. At that point what is the point in merging them together?
Sun, Mar 04, 2018 at 10:58:34PM CET, alexander.duyck@gmail.com wrote: >On Sun, Mar 4, 2018 at 10:50 AM, Jiri Pirko <jiri@resnulli.us> wrote: >> Sun, Mar 04, 2018 at 07:24:12PM CET, alexander.duyck@gmail.com wrote: >>>On Sat, Mar 3, 2018 at 11:13 PM, Jiri Pirko <jiri@resnulli.us> wrote: [...] > >>>Currently we only have agreement from Michael on taking this code, as >>>such we are working with virtio only for now. When the time comes that >> >> If you do duplication of netvsc in-driver bonding in virtio_net, it will >> stay there forever. So what you say is: "We will do it halfway now >> and promise to fix it later". That later will never happen, I'm pretty >> sure. That is why I push for in-driver bonding shared code as a part of >> this patchset. > >You want this new approach and a copy of netvsc moved into either core >or some module of its own. I say pick an architecture. We are looking >at either 2 netdevs or 3. We are not going to support both because >that will ultimately lead to a terrible user experience and make >things quite confusing. > >> + if you would be pushing first driver to do this, I would understand. >> But the first driver is already in. You are pushing second. This is the >> time to do the sharing, unification of behaviour. Next time is too late. > >That is great, if we want to share then lets share. But what you are >essentially telling us is that we need to fork this solution and >maintain two code paths, one for 2 netdevs, and another for 3. At that >point what is the point in merging them together? Of course, I vote for the same behaviour for netvsc and virtio_net. That is my point from the very beginning. Stephen, what do you think? Could we please make virtio_net and netvsc behave the same and to use a single code with well-defined checks and restrictions for this feature?
On Mon, 5 Mar 2018 10:21:18 +0100 Jiri Pirko <jiri@resnulli.us> wrote: > Sun, Mar 04, 2018 at 10:58:34PM CET, alexander.duyck@gmail.com wrote: > >On Sun, Mar 4, 2018 at 10:50 AM, Jiri Pirko <jiri@resnulli.us> wrote: > >> Sun, Mar 04, 2018 at 07:24:12PM CET, alexander.duyck@gmail.com wrote: > >>>On Sat, Mar 3, 2018 at 11:13 PM, Jiri Pirko <jiri@resnulli.us> wrote: > > [...] > > > > >>>Currently we only have agreement from Michael on taking this code, as > >>>such we are working with virtio only for now. When the time comes that > >> > >> If you do duplication of netvsc in-driver bonding in virtio_net, it will > >> stay there forever. So what you say is: "We will do it halfway now > >> and promise to fix it later". That later will never happen, I'm pretty > >> sure. That is why I push for in-driver bonding shared code as a part of > >> this patchset. > > > >You want this new approach and a copy of netvsc moved into either core > >or some module of its own. I say pick an architecture. We are looking > >at either 2 netdevs or 3. We are not going to support both because > >that will ultimately lead to a terrible user experience and make > >things quite confusing. > > > >> + if you would be pushing first driver to do this, I would understand. > >> But the first driver is already in. You are pushing second. This is the > >> time to do the sharing, unification of behaviour. Next time is too late. > > > >That is great, if we want to share then lets share. But what you are > >essentially telling us is that we need to fork this solution and > >maintain two code paths, one for 2 netdevs, and another for 3. At that > >point what is the point in merging them together? > > Of course, I vote for the same behaviour for netvsc and virtio_net. That > is my point from the very beginning. > > Stephen, what do you think? Could we please make virtio_net and netvsc > behave the same and to use a single code with well-defined checks and > restrictions for this feature? Eventually, yes both could share common code routines. In reality, the failover stuff is only a very small part of either driver so it is not worth stretching to try and cover too much. If you look, the failover code is just using routines that already exist for use by bonding, teaming, etc. There will always be two drivers, the ring buffers and buffering are very different between vmbus and virtio. It would help to address some of the awkward stuff like queue selection and offload handling in a common way. Don't worry too much about backports. The backport can use the old code if necessary.
Mon, Mar 05, 2018 at 05:11:32PM CET, stephen@networkplumber.org wrote: >On Mon, 5 Mar 2018 10:21:18 +0100 >Jiri Pirko <jiri@resnulli.us> wrote: > >> Sun, Mar 04, 2018 at 10:58:34PM CET, alexander.duyck@gmail.com wrote: >> >On Sun, Mar 4, 2018 at 10:50 AM, Jiri Pirko <jiri@resnulli.us> wrote: >> >> Sun, Mar 04, 2018 at 07:24:12PM CET, alexander.duyck@gmail.com wrote: >> >>>On Sat, Mar 3, 2018 at 11:13 PM, Jiri Pirko <jiri@resnulli.us> wrote: >> >> [...] >> >> > >> >>>Currently we only have agreement from Michael on taking this code, as >> >>>such we are working with virtio only for now. When the time comes that >> >> >> >> If you do duplication of netvsc in-driver bonding in virtio_net, it will >> >> stay there forever. So what you say is: "We will do it halfway now >> >> and promise to fix it later". That later will never happen, I'm pretty >> >> sure. That is why I push for in-driver bonding shared code as a part of >> >> this patchset. >> > >> >You want this new approach and a copy of netvsc moved into either core >> >or some module of its own. I say pick an architecture. We are looking >> >at either 2 netdevs or 3. We are not going to support both because >> >that will ultimately lead to a terrible user experience and make >> >things quite confusing. >> > >> >> + if you would be pushing first driver to do this, I would understand. >> >> But the first driver is already in. You are pushing second. This is the >> >> time to do the sharing, unification of behaviour. Next time is too late. >> > >> >That is great, if we want to share then lets share. But what you are >> >essentially telling us is that we need to fork this solution and >> >maintain two code paths, one for 2 netdevs, and another for 3. At that >> >point what is the point in merging them together? >> >> Of course, I vote for the same behaviour for netvsc and virtio_net. That >> is my point from the very beginning. >> >> Stephen, what do you think? Could we please make virtio_net and netvsc >> behave the same and to use a single code with well-defined checks and >> restrictions for this feature? > >Eventually, yes both could share common code routines. In reality, >the failover stuff is only a very small part of either driver so >it is not worth stretching to try and cover too much. If you look, >the failover code is just using routines that already exist for >use by bonding, teaming, etc. Yeah, we consern was also about the code that processes the netdev notifications and does auto-enslave and all related stuff. > >There will always be two drivers, the ring buffers and buffering >are very different between vmbus and virtio. It would help to address >some of the awkward stuff like queue selection and offload handling >in a common way. Agreed. > >Don't worry too much about backports. The backport can use the >old code if necessary.
On Mon, Mar 5, 2018 at 2:30 PM, Jiri Pirko <jiri@resnulli.us> wrote: > Mon, Mar 05, 2018 at 05:11:32PM CET, stephen@networkplumber.org wrote: >>On Mon, 5 Mar 2018 10:21:18 +0100 >>Jiri Pirko <jiri@resnulli.us> wrote: >> >>> Sun, Mar 04, 2018 at 10:58:34PM CET, alexander.duyck@gmail.com wrote: >>> >On Sun, Mar 4, 2018 at 10:50 AM, Jiri Pirko <jiri@resnulli.us> wrote: >>> >> Sun, Mar 04, 2018 at 07:24:12PM CET, alexander.duyck@gmail.com wrote: >>> >>>On Sat, Mar 3, 2018 at 11:13 PM, Jiri Pirko <jiri@resnulli.us> wrote: >>> >>> [...] >>> >>> > >>> >>>Currently we only have agreement from Michael on taking this code, as >>> >>>such we are working with virtio only for now. When the time comes that >>> >> >>> >> If you do duplication of netvsc in-driver bonding in virtio_net, it will >>> >> stay there forever. So what you say is: "We will do it halfway now >>> >> and promise to fix it later". That later will never happen, I'm pretty >>> >> sure. That is why I push for in-driver bonding shared code as a part of >>> >> this patchset. >>> > >>> >You want this new approach and a copy of netvsc moved into either core >>> >or some module of its own. I say pick an architecture. We are looking >>> >at either 2 netdevs or 3. We are not going to support both because >>> >that will ultimately lead to a terrible user experience and make >>> >things quite confusing. >>> > >>> >> + if you would be pushing first driver to do this, I would understand. >>> >> But the first driver is already in. You are pushing second. This is the >>> >> time to do the sharing, unification of behaviour. Next time is too late. >>> > >>> >That is great, if we want to share then lets share. But what you are >>> >essentially telling us is that we need to fork this solution and >>> >maintain two code paths, one for 2 netdevs, and another for 3. At that >>> >point what is the point in merging them together? >>> >>> Of course, I vote for the same behaviour for netvsc and virtio_net. That >>> is my point from the very beginning. >>> >>> Stephen, what do you think? Could we please make virtio_net and netvsc >>> behave the same and to use a single code with well-defined checks and >>> restrictions for this feature? >> >>Eventually, yes both could share common code routines. In reality, >>the failover stuff is only a very small part of either driver so >>it is not worth stretching to try and cover too much. If you look, >>the failover code is just using routines that already exist for >>use by bonding, teaming, etc. > > Yeah, we consern was also about the code that processes the netdev > notifications and does auto-enslave and all related stuff. The concern was the driver model. If we expose 3 netdevs or 2 with the VF driver present. Somehow this is turning into a "merge netvsc into virtio" think and that isn't the subject that was being asked. Ideally we want one model for this. Either 3 netdevs or 2. The problem is 2 causes issues in terms of performance and will limit features of virtio, but 2 is the precedent set by netvsc. We need to figure out the path forward for this. There is talk about "sharing" but it is hard to make these two approaches share code when they are doing two very different setups and end up presenting themselves as two very different driver models. >> >>There will always be two drivers, the ring buffers and buffering >>are very different between vmbus and virtio. It would help to address >>some of the awkward stuff like queue selection and offload handling >>in a common way. > > Agreed. There are going to end up being three drivers by the time we are done. We will end up with netvsc, virtio, and some shared block of functionality that is used between the two of them. At least that is the assumption if the two are going to share code. I don't know if everyone will want to take on the extra overhead for the code shared between these two drivers being a part of the core net code.
On Mon, 5 Mar 2018 14:47:20 -0800 Alexander Duyck <alexander.duyck@gmail.com> wrote: > On Mon, Mar 5, 2018 at 2:30 PM, Jiri Pirko <jiri@resnulli.us> wrote: > > Mon, Mar 05, 2018 at 05:11:32PM CET, stephen@networkplumber.org wrote: > >>On Mon, 5 Mar 2018 10:21:18 +0100 > >>Jiri Pirko <jiri@resnulli.us> wrote: > >> > >>> Sun, Mar 04, 2018 at 10:58:34PM CET, alexander.duyck@gmail.com wrote: > >>> >On Sun, Mar 4, 2018 at 10:50 AM, Jiri Pirko <jiri@resnulli.us> wrote: > >>> >> Sun, Mar 04, 2018 at 07:24:12PM CET, alexander.duyck@gmail.com wrote: > >>> >>>On Sat, Mar 3, 2018 at 11:13 PM, Jiri Pirko <jiri@resnulli.us> wrote: > >>> > >>> [...] > >>> > >>> > > >>> >>>Currently we only have agreement from Michael on taking this code, as > >>> >>>such we are working with virtio only for now. When the time comes that > >>> >> > >>> >> If you do duplication of netvsc in-driver bonding in virtio_net, it will > >>> >> stay there forever. So what you say is: "We will do it halfway now > >>> >> and promise to fix it later". That later will never happen, I'm pretty > >>> >> sure. That is why I push for in-driver bonding shared code as a part of > >>> >> this patchset. > >>> > > >>> >You want this new approach and a copy of netvsc moved into either core > >>> >or some module of its own. I say pick an architecture. We are looking > >>> >at either 2 netdevs or 3. We are not going to support both because > >>> >that will ultimately lead to a terrible user experience and make > >>> >things quite confusing. > >>> > > >>> >> + if you would be pushing first driver to do this, I would understand. > >>> >> But the first driver is already in. You are pushing second. This is the > >>> >> time to do the sharing, unification of behaviour. Next time is too late. > >>> > > >>> >That is great, if we want to share then lets share. But what you are > >>> >essentially telling us is that we need to fork this solution and > >>> >maintain two code paths, one for 2 netdevs, and another for 3. At that > >>> >point what is the point in merging them together? > >>> > >>> Of course, I vote for the same behaviour for netvsc and virtio_net. That > >>> is my point from the very beginning. > >>> > >>> Stephen, what do you think? Could we please make virtio_net and netvsc > >>> behave the same and to use a single code with well-defined checks and > >>> restrictions for this feature? > >> > >>Eventually, yes both could share common code routines. In reality, > >>the failover stuff is only a very small part of either driver so > >>it is not worth stretching to try and cover too much. If you look, > >>the failover code is just using routines that already exist for > >>use by bonding, teaming, etc. > > > > Yeah, we consern was also about the code that processes the netdev > > notifications and does auto-enslave and all related stuff. > > The concern was the driver model. If we expose 3 netdevs or 2 with the > VF driver present. Somehow this is turning into a "merge netvsc into > virtio" think and that isn't the subject that was being asked. > > Ideally we want one model for this. Either 3 netdevs or 2. The problem > is 2 causes issues in terms of performance and will limit features of > virtio, but 2 is the precedent set by netvsc. We need to figure out > the path forward for this. There is talk about "sharing" but it is > hard to make these two approaches share code when they are doing two > very different setups and end up presenting themselves as two very > different driver models. I appreciate this discussion, and it has helped a lot. Netvsc is stuck with 2 netdev model for the foreseeable future. We already failed once with the bonding model, and that created a lot of pain. The current model is working well and have convinced the major distros to support the two netdev model and don't want to back. Very open to optimizations and ways to smooth out the rough edges.
On Mon, Mar 5, 2018 at 7:15 PM, Stephen Hemminger <stephen@networkplumber.org> wrote: > On Mon, 5 Mar 2018 14:47:20 -0800 > Alexander Duyck <alexander.duyck@gmail.com> wrote: > >> On Mon, Mar 5, 2018 at 2:30 PM, Jiri Pirko <jiri@resnulli.us> wrote: >> > Mon, Mar 05, 2018 at 05:11:32PM CET, stephen@networkplumber.org wrote: >> >>On Mon, 5 Mar 2018 10:21:18 +0100 >> >>Jiri Pirko <jiri@resnulli.us> wrote: >> >> >> >>> Sun, Mar 04, 2018 at 10:58:34PM CET, alexander.duyck@gmail.com wrote: >> >>> >On Sun, Mar 4, 2018 at 10:50 AM, Jiri Pirko <jiri@resnulli.us> wrote: >> >>> >> Sun, Mar 04, 2018 at 07:24:12PM CET, alexander.duyck@gmail.com wrote: >> >>> >>>On Sat, Mar 3, 2018 at 11:13 PM, Jiri Pirko <jiri@resnulli.us> wrote: >> >>> >> >>> [...] >> >>> >> >>> > >> >>> >>>Currently we only have agreement from Michael on taking this code, as >> >>> >>>such we are working with virtio only for now. When the time comes that >> >>> >> >> >>> >> If you do duplication of netvsc in-driver bonding in virtio_net, it will >> >>> >> stay there forever. So what you say is: "We will do it halfway now >> >>> >> and promise to fix it later". That later will never happen, I'm pretty >> >>> >> sure. That is why I push for in-driver bonding shared code as a part of >> >>> >> this patchset. >> >>> > >> >>> >You want this new approach and a copy of netvsc moved into either core >> >>> >or some module of its own. I say pick an architecture. We are looking >> >>> >at either 2 netdevs or 3. We are not going to support both because >> >>> >that will ultimately lead to a terrible user experience and make >> >>> >things quite confusing. >> >>> > >> >>> >> + if you would be pushing first driver to do this, I would understand. >> >>> >> But the first driver is already in. You are pushing second. This is the >> >>> >> time to do the sharing, unification of behaviour. Next time is too late. >> >>> > >> >>> >That is great, if we want to share then lets share. But what you are >> >>> >essentially telling us is that we need to fork this solution and >> >>> >maintain two code paths, one for 2 netdevs, and another for 3. At that >> >>> >point what is the point in merging them together? >> >>> >> >>> Of course, I vote for the same behaviour for netvsc and virtio_net. That >> >>> is my point from the very beginning. >> >>> >> >>> Stephen, what do you think? Could we please make virtio_net and netvsc >> >>> behave the same and to use a single code with well-defined checks and >> >>> restrictions for this feature? >> >> >> >>Eventually, yes both could share common code routines. In reality, >> >>the failover stuff is only a very small part of either driver so >> >>it is not worth stretching to try and cover too much. If you look, >> >>the failover code is just using routines that already exist for >> >>use by bonding, teaming, etc. >> > >> > Yeah, we consern was also about the code that processes the netdev >> > notifications and does auto-enslave and all related stuff. >> >> The concern was the driver model. If we expose 3 netdevs or 2 with the >> VF driver present. Somehow this is turning into a "merge netvsc into >> virtio" think and that isn't the subject that was being asked. >> >> Ideally we want one model for this. Either 3 netdevs or 2. The problem >> is 2 causes issues in terms of performance and will limit features of >> virtio, but 2 is the precedent set by netvsc. We need to figure out >> the path forward for this. There is talk about "sharing" but it is >> hard to make these two approaches share code when they are doing two >> very different setups and end up presenting themselves as two very >> different driver models. > > I appreciate this discussion, and it has helped a lot. > > Netvsc is stuck with 2 netdev model for the foreseeable future. > We already failed once with the bonding model, and that created a lot of > pain. The current model is working well and have convinced the major distros > to support the two netdev model and don't want to back. > > Very open to optimizations and ways to smooth out the rough edges. Thank you for clarifying this Stephen. Okay. So with things defined such that we are doing a 2 netdev model for netvsc, and a 3 netdev model for virtio, is it still in our interest for us to try making a shared library between the two? In my mind, the virtnet_bypass becomes the way we go forward for any future solutions. I say we treat the netvsc approach as a "legacy" approach and avoid creating any new libraries or drivers to support it, and instead just focus on the 3 netdev approach as the way this is to be done going forward. That way we avoid anyone else trying to implement something like the 2 netdev solution in the future. So getting back to the code here. Should we split the virtnet_bypass code out into a separate module? My preference would be to let this incubate as a part of virtio_net until either there is another user, or it becomes big enough that it needs to be moved. That said, there can be arguments for either keeping it in, or making it a module on its own depending on where we see this going in the future. - Alex
Tue, Mar 06, 2018 at 08:08:21PM CET, alexander.duyck@gmail.com wrote: >On Mon, Mar 5, 2018 at 7:15 PM, Stephen Hemminger ><stephen@networkplumber.org> wrote: >> On Mon, 5 Mar 2018 14:47:20 -0800 >> Alexander Duyck <alexander.duyck@gmail.com> wrote: >> >>> On Mon, Mar 5, 2018 at 2:30 PM, Jiri Pirko <jiri@resnulli.us> wrote: >>> > Mon, Mar 05, 2018 at 05:11:32PM CET, stephen@networkplumber.org wrote: >>> >>On Mon, 5 Mar 2018 10:21:18 +0100 >>> >>Jiri Pirko <jiri@resnulli.us> wrote: >>> >> >>> >>> Sun, Mar 04, 2018 at 10:58:34PM CET, alexander.duyck@gmail.com wrote: >>> >>> >On Sun, Mar 4, 2018 at 10:50 AM, Jiri Pirko <jiri@resnulli.us> wrote: >>> >>> >> Sun, Mar 04, 2018 at 07:24:12PM CET, alexander.duyck@gmail.com wrote: >>> >>> >>>On Sat, Mar 3, 2018 at 11:13 PM, Jiri Pirko <jiri@resnulli.us> wrote: >>> >>> >>> >>> [...] >>> >>> >>> >>> > >>> >>> >>>Currently we only have agreement from Michael on taking this code, as >>> >>> >>>such we are working with virtio only for now. When the time comes that >>> >>> >> >>> >>> >> If you do duplication of netvsc in-driver bonding in virtio_net, it will >>> >>> >> stay there forever. So what you say is: "We will do it halfway now >>> >>> >> and promise to fix it later". That later will never happen, I'm pretty >>> >>> >> sure. That is why I push for in-driver bonding shared code as a part of >>> >>> >> this patchset. >>> >>> > >>> >>> >You want this new approach and a copy of netvsc moved into either core >>> >>> >or some module of its own. I say pick an architecture. We are looking >>> >>> >at either 2 netdevs or 3. We are not going to support both because >>> >>> >that will ultimately lead to a terrible user experience and make >>> >>> >things quite confusing. >>> >>> > >>> >>> >> + if you would be pushing first driver to do this, I would understand. >>> >>> >> But the first driver is already in. You are pushing second. This is the >>> >>> >> time to do the sharing, unification of behaviour. Next time is too late. >>> >>> > >>> >>> >That is great, if we want to share then lets share. But what you are >>> >>> >essentially telling us is that we need to fork this solution and >>> >>> >maintain two code paths, one for 2 netdevs, and another for 3. At that >>> >>> >point what is the point in merging them together? >>> >>> >>> >>> Of course, I vote for the same behaviour for netvsc and virtio_net. That >>> >>> is my point from the very beginning. >>> >>> >>> >>> Stephen, what do you think? Could we please make virtio_net and netvsc >>> >>> behave the same and to use a single code with well-defined checks and >>> >>> restrictions for this feature? >>> >> >>> >>Eventually, yes both could share common code routines. In reality, >>> >>the failover stuff is only a very small part of either driver so >>> >>it is not worth stretching to try and cover too much. If you look, >>> >>the failover code is just using routines that already exist for >>> >>use by bonding, teaming, etc. >>> > >>> > Yeah, we consern was also about the code that processes the netdev >>> > notifications and does auto-enslave and all related stuff. >>> >>> The concern was the driver model. If we expose 3 netdevs or 2 with the >>> VF driver present. Somehow this is turning into a "merge netvsc into >>> virtio" think and that isn't the subject that was being asked. >>> >>> Ideally we want one model for this. Either 3 netdevs or 2. The problem >>> is 2 causes issues in terms of performance and will limit features of >>> virtio, but 2 is the precedent set by netvsc. We need to figure out >>> the path forward for this. There is talk about "sharing" but it is >>> hard to make these two approaches share code when they are doing two >>> very different setups and end up presenting themselves as two very >>> different driver models. >> >> I appreciate this discussion, and it has helped a lot. >> >> Netvsc is stuck with 2 netdev model for the foreseeable future. >> We already failed once with the bonding model, and that created a lot of >> pain. The current model is working well and have convinced the major distros >> to support the two netdev model and don't want to back. >> >> Very open to optimizations and ways to smooth out the rough edges. > >Thank you for clarifying this Stephen. > >Okay. So with things defined such that we are doing a 2 netdev model >for netvsc, and a 3 netdev model for virtio, is it still in our >interest for us to try making a shared library between the two? In my >mind, the virtnet_bypass becomes the way we go forward for any future >solutions. I say we treat the netvsc approach as a "legacy" approach >and avoid creating any new libraries or drivers to support it, and >instead just focus on the 3 netdev approach as the way this is to be >done going forward. That way we avoid anyone else trying to implement >something like the 2 netdev solution in the future. > >So getting back to the code here. Should we split the virtnet_bypass >code out into a separate module? My preference would be to let this >incubate as a part of virtio_net until either there is another user, >or it becomes big enough that it needs to be moved. That said, there What do you mean by "big enough"? I hope that is will never be extended by anything. If you need to do anything more complex, you should use bong/team. I definitelly vote for a separate common shared code for both netvsc and virtio_net - even if you use 2 and 3 netdev model, you could share the common code. Strict checks and limitation should be in place.
On Tue, Mar 6, 2018 at 2:59 PM, Jiri Pirko <jiri@resnulli.us> wrote: > Tue, Mar 06, 2018 at 08:08:21PM CET, alexander.duyck@gmail.com wrote: >>On Mon, Mar 5, 2018 at 7:15 PM, Stephen Hemminger >><stephen@networkplumber.org> wrote: >>> On Mon, 5 Mar 2018 14:47:20 -0800 >>> Alexander Duyck <alexander.duyck@gmail.com> wrote: >>> >>>> On Mon, Mar 5, 2018 at 2:30 PM, Jiri Pirko <jiri@resnulli.us> wrote: >>>> > Mon, Mar 05, 2018 at 05:11:32PM CET, stephen@networkplumber.org wrote: >>>> >>On Mon, 5 Mar 2018 10:21:18 +0100 >>>> >>Jiri Pirko <jiri@resnulli.us> wrote: >>>> >> >>>> >>> Sun, Mar 04, 2018 at 10:58:34PM CET, alexander.duyck@gmail.com wrote: >>>> >>> >On Sun, Mar 4, 2018 at 10:50 AM, Jiri Pirko <jiri@resnulli.us> wrote: >>>> >>> >> Sun, Mar 04, 2018 at 07:24:12PM CET, alexander.duyck@gmail.com wrote: >>>> >>> >>>On Sat, Mar 3, 2018 at 11:13 PM, Jiri Pirko <jiri@resnulli.us> wrote: >>>> >>> >>>> >>> [...] >>>> >>> >>>> >>> > >>>> >>> >>>Currently we only have agreement from Michael on taking this code, as >>>> >>> >>>such we are working with virtio only for now. When the time comes that >>>> >>> >> >>>> >>> >> If you do duplication of netvsc in-driver bonding in virtio_net, it will >>>> >>> >> stay there forever. So what you say is: "We will do it halfway now >>>> >>> >> and promise to fix it later". That later will never happen, I'm pretty >>>> >>> >> sure. That is why I push for in-driver bonding shared code as a part of >>>> >>> >> this patchset. >>>> >>> > >>>> >>> >You want this new approach and a copy of netvsc moved into either core >>>> >>> >or some module of its own. I say pick an architecture. We are looking >>>> >>> >at either 2 netdevs or 3. We are not going to support both because >>>> >>> >that will ultimately lead to a terrible user experience and make >>>> >>> >things quite confusing. >>>> >>> > >>>> >>> >> + if you would be pushing first driver to do this, I would understand. >>>> >>> >> But the first driver is already in. You are pushing second. This is the >>>> >>> >> time to do the sharing, unification of behaviour. Next time is too late. >>>> >>> > >>>> >>> >That is great, if we want to share then lets share. But what you are >>>> >>> >essentially telling us is that we need to fork this solution and >>>> >>> >maintain two code paths, one for 2 netdevs, and another for 3. At that >>>> >>> >point what is the point in merging them together? >>>> >>> >>>> >>> Of course, I vote for the same behaviour for netvsc and virtio_net. That >>>> >>> is my point from the very beginning. >>>> >>> >>>> >>> Stephen, what do you think? Could we please make virtio_net and netvsc >>>> >>> behave the same and to use a single code with well-defined checks and >>>> >>> restrictions for this feature? >>>> >> >>>> >>Eventually, yes both could share common code routines. In reality, >>>> >>the failover stuff is only a very small part of either driver so >>>> >>it is not worth stretching to try and cover too much. If you look, >>>> >>the failover code is just using routines that already exist for >>>> >>use by bonding, teaming, etc. >>>> > >>>> > Yeah, we consern was also about the code that processes the netdev >>>> > notifications and does auto-enslave and all related stuff. >>>> >>>> The concern was the driver model. If we expose 3 netdevs or 2 with the >>>> VF driver present. Somehow this is turning into a "merge netvsc into >>>> virtio" think and that isn't the subject that was being asked. >>>> >>>> Ideally we want one model for this. Either 3 netdevs or 2. The problem >>>> is 2 causes issues in terms of performance and will limit features of >>>> virtio, but 2 is the precedent set by netvsc. We need to figure out >>>> the path forward for this. There is talk about "sharing" but it is >>>> hard to make these two approaches share code when they are doing two >>>> very different setups and end up presenting themselves as two very >>>> different driver models. >>> >>> I appreciate this discussion, and it has helped a lot. >>> >>> Netvsc is stuck with 2 netdev model for the foreseeable future. >>> We already failed once with the bonding model, and that created a lot of >>> pain. The current model is working well and have convinced the major distros >>> to support the two netdev model and don't want to back. >>> >>> Very open to optimizations and ways to smooth out the rough edges. >> >>Thank you for clarifying this Stephen. >> >>Okay. So with things defined such that we are doing a 2 netdev model >>for netvsc, and a 3 netdev model for virtio, is it still in our >>interest for us to try making a shared library between the two? In my >>mind, the virtnet_bypass becomes the way we go forward for any future >>solutions. I say we treat the netvsc approach as a "legacy" approach >>and avoid creating any new libraries or drivers to support it, and >>instead just focus on the 3 netdev approach as the way this is to be >>done going forward. That way we avoid anyone else trying to implement >>something like the 2 netdev solution in the future. >> >>So getting back to the code here. Should we split the virtnet_bypass >>code out into a separate module? My preference would be to let this >>incubate as a part of virtio_net until either there is another user, >>or it becomes big enough that it needs to be moved. That said, there > > What do you mean by "big enough"? I hope that is will never be extended > by anything. If you need to do anything more complex, you should use > bong/team. I wasn't thinking so much extra features as enough workaround for discovered issues. Really this should be more-or-less feature locked. > I definitelly vote for a separate common shared code for both netvsc and > virtio_net - even if you use 2 and 3 netdev model, you could share the > common code. Strict checks and limitation should be in place. Noted. But as I also mentioned there isn't that much "common" code between the two models. I think if anything we could probably look at peeling out a few bits such as "get_<iface>_bymac" which really would become dev_get_by_mac_and_ops in order to find the device for the notifiers. I probably wouldn't even put that in our driver and would instead put it in the core code since it almost makes more sense there. Beyond that sharing becomes much more challenging due to the differences in the Rx and Tx paths that build out of the difference between the 2 driver and 3 driver models.
On Tue, Mar 06, 2018 at 03:27:46PM -0800, Alexander Duyck wrote: > > I definitelly vote for a separate common shared code for both netvsc and > > virtio_net - even if you use 2 and 3 netdev model, you could share the > > common code. Strict checks and limitation should be in place. > > Noted. But as I also mentioned there isn't that much "common" code > between the two models. I think if anything we could probably look at > peeling out a few bits such as "get_<iface>_bymac" which really would > become dev_get_by_mac_and_ops in order to find the device for the > notifiers. I probably wouldn't even put that in our driver and would > instead put it in the core code since it almost makes more sense > there. Beyond that sharing becomes much more challenging due to the > differences in the Rx and Tx paths that build out of the difference > between the 2 driver and 3 driver models. At this point it might be worth it to articulate the advantages of the 3 netdev model. If they are compelling, why wouldn't netvsc users want them? Alex, I think you were one of the strongest proponents of this model, you should be well placed to provide a summary.
On Tue, Mar 6, 2018 at 6:38 PM, Michael S. Tsirkin <mst@redhat.com> wrote: > On Tue, Mar 06, 2018 at 03:27:46PM -0800, Alexander Duyck wrote: >> > I definitelly vote for a separate common shared code for both netvsc and >> > virtio_net - even if you use 2 and 3 netdev model, you could share the >> > common code. Strict checks and limitation should be in place. >> >> Noted. But as I also mentioned there isn't that much "common" code >> between the two models. I think if anything we could probably look at >> peeling out a few bits such as "get_<iface>_bymac" which really would >> become dev_get_by_mac_and_ops in order to find the device for the >> notifiers. I probably wouldn't even put that in our driver and would >> instead put it in the core code since it almost makes more sense >> there. Beyond that sharing becomes much more challenging due to the >> differences in the Rx and Tx paths that build out of the difference >> between the 2 driver and 3 driver models. > > At this point it might be worth it to articulate the advantages > of the 3 netdev model. The main advantages are performance on the lower devices. Specifically we can run LLTX and IFF_NOQUEUE on the upper device, meaning the VF doesn't take a performance hit when we start transmitting through it. In addition the paravirtual device is able to fully expose it's features, so for example virtio_net can maintain in-driver XDP, and we can provide generic XDP on the upper device. We can also have an asymmetric configuration where the number of queues or feature sets can be different between the ports. Basically what it comes down to is in the 3 netdev model we never have to deal with the issues of going from being a standard netdev to being a master netdev. The role of each netdev is clearly defined. As far as doing a generic solution it is the preferred way to go as well since we don't have to rewrite functionality of the lower device. Currently the only changes really needed are to add a phys_port_name operation so that you can distinguish between the bypass interface and the original paravirtual one. As such you have no confusion about what you are running. > If they are compelling, why wouldn't netvsc users want them? I believe part of the logic for Stephen's choice is that netvsc doesn't have a "BACKUP" bit like what we have virtio_net. As a result they have no way of knowing if a VF will ever show up or not. That makes the 3 netdev solution less appealing as they always end up in the bonding mode. In addition they have intentionally limited themselves to avoid features such as XDP that would cause issues with the 2 netdev model. > Alex, I think you were one of the strongest proponents of this model, > you should be well placed to provide a summary. Hopefully the information provided helps. In my mind the issue is that netvsc was not designed correctly, and as such it is using the 2 netdev model as a bit of a crutch to handle the case where they wanted to add a VF bypass. At this point it has become a part of their architecture so it would be confusing for their user base to deal with having 2 netdevs spawn every time their driver sees a new device. In the case of virtio we only spawn 2 netdevs if the backup bit is set which implies a specific use case. When the bit isn't set we are only spawning one device, and as a result we can get much better performance out of the resultant combination of devices, and can expose functionality such as in-driver XDP in virtio.
On Wed, 7 Mar 2018 09:50:50 -0800 Alexander Duyck <alexander.duyck@gmail.com> wrote: > On Tue, Mar 6, 2018 at 6:38 PM, Michael S. Tsirkin <mst@redhat.com> wrote: > > On Tue, Mar 06, 2018 at 03:27:46PM -0800, Alexander Duyck wrote: > >> > I definitelly vote for a separate common shared code for both netvsc and > >> > virtio_net - even if you use 2 and 3 netdev model, you could share the > >> > common code. Strict checks and limitation should be in place. > >> > >> Noted. But as I also mentioned there isn't that much "common" code > >> between the two models. I think if anything we could probably look at > >> peeling out a few bits such as "get_<iface>_bymac" which really would > >> become dev_get_by_mac_and_ops in order to find the device for the > >> notifiers. I probably wouldn't even put that in our driver and would > >> instead put it in the core code since it almost makes more sense > >> there. Beyond that sharing becomes much more challenging due to the > >> differences in the Rx and Tx paths that build out of the difference > >> between the 2 driver and 3 driver models. > > > > At this point it might be worth it to articulate the advantages > > of the 3 netdev model. > > The main advantages are performance on the lower devices. Specifically > we can run LLTX and IFF_NOQUEUE on the upper device, meaning the VF > doesn't take a performance hit when we start transmitting through it. > In addition the paravirtual device is able to fully expose it's > features, so for example virtio_net can maintain in-driver XDP, and we > can provide generic XDP on the upper device. We can also have an > asymmetric configuration where the number of queues or feature sets > can be different between the ports. > > Basically what it comes down to is in the 3 netdev model we never have > to deal with the issues of going from being a standard netdev to being > a master netdev. The role of each netdev is clearly defined. > > As far as doing a generic solution it is the preferred way to go as > well since we don't have to rewrite functionality of the lower device. > Currently the only changes really needed are to add a phys_port_name > operation so that you can distinguish between the bypass interface and > the original paravirtual one. As such you have no confusion about what > you are running. > > > If they are compelling, why wouldn't netvsc users want them? > > I believe part of the logic for Stephen's choice is that netvsc > doesn't have a "BACKUP" bit like what we have virtio_net. As a result > they have no way of knowing if a VF will ever show up or not. That > makes the 3 netdev solution less appealing as they always end up in > the bonding mode. In addition they have intentionally limited > themselves to avoid features such as XDP that would cause issues with > the 2 netdev model. > > > Alex, I think you were one of the strongest proponents of this model, > > you should be well placed to provide a summary. > > Hopefully the information provided helps. In my mind the issue is that > netvsc was not designed correctly, and as such it is using the 2 > netdev model as a bit of a crutch to handle the case where they wanted > to add a VF bypass. At this point it has become a part of their > architecture so it would be confusing for their user base to deal with > having 2 netdevs spawn every time their driver sees a new device. In > the case of virtio we only spawn 2 netdevs if the backup bit is set > which implies a specific use case. When the bit isn't set we are only > spawning one device, and as a result we can get much better > performance out of the resultant combination of devices, and can > expose functionality such as in-driver XDP in virtio. I have experimented with toggling IFF_NOQUEUE and/or LLTX on the netvsc device when doing passthrough. It didn't help performance much and there are corner cases that make it painful, like what if qdisc was placed by user on the netvsc device? Should the qdisc then be moved back and forth to the VF device when it arrives or is removed? As far as XDP support, it is on the plan to support XDP on the netvsc device since the receive buffers do have to be copied. But there has been no customer demand for it; and the VF model on Azure has limitations which make a transparent XDP model pretty much impossible.
On Wed, Mar 7, 2018 at 10:06 AM, Stephen Hemminger <stephen@networkplumber.org> wrote: > On Wed, 7 Mar 2018 09:50:50 -0800 > Alexander Duyck <alexander.duyck@gmail.com> wrote: > >> On Tue, Mar 6, 2018 at 6:38 PM, Michael S. Tsirkin <mst@redhat.com> wrote: >> > On Tue, Mar 06, 2018 at 03:27:46PM -0800, Alexander Duyck wrote: >> >> > I definitelly vote for a separate common shared code for both netvsc and >> >> > virtio_net - even if you use 2 and 3 netdev model, you could share the >> >> > common code. Strict checks and limitation should be in place. >> >> >> >> Noted. But as I also mentioned there isn't that much "common" code >> >> between the two models. I think if anything we could probably look at >> >> peeling out a few bits such as "get_<iface>_bymac" which really would >> >> become dev_get_by_mac_and_ops in order to find the device for the >> >> notifiers. I probably wouldn't even put that in our driver and would >> >> instead put it in the core code since it almost makes more sense >> >> there. Beyond that sharing becomes much more challenging due to the >> >> differences in the Rx and Tx paths that build out of the difference >> >> between the 2 driver and 3 driver models. >> > >> > At this point it might be worth it to articulate the advantages >> > of the 3 netdev model. >> >> The main advantages are performance on the lower devices. Specifically >> we can run LLTX and IFF_NOQUEUE on the upper device, meaning the VF >> doesn't take a performance hit when we start transmitting through it. >> In addition the paravirtual device is able to fully expose it's >> features, so for example virtio_net can maintain in-driver XDP, and we >> can provide generic XDP on the upper device. We can also have an >> asymmetric configuration where the number of queues or feature sets >> can be different between the ports. >> >> Basically what it comes down to is in the 3 netdev model we never have >> to deal with the issues of going from being a standard netdev to being >> a master netdev. The role of each netdev is clearly defined. >> >> As far as doing a generic solution it is the preferred way to go as >> well since we don't have to rewrite functionality of the lower device. >> Currently the only changes really needed are to add a phys_port_name >> operation so that you can distinguish between the bypass interface and >> the original paravirtual one. As such you have no confusion about what >> you are running. >> >> > If they are compelling, why wouldn't netvsc users want them? >> >> I believe part of the logic for Stephen's choice is that netvsc >> doesn't have a "BACKUP" bit like what we have virtio_net. As a result >> they have no way of knowing if a VF will ever show up or not. That >> makes the 3 netdev solution less appealing as they always end up in >> the bonding mode. In addition they have intentionally limited >> themselves to avoid features such as XDP that would cause issues with >> the 2 netdev model. >> >> > Alex, I think you were one of the strongest proponents of this model, >> > you should be well placed to provide a summary. >> >> Hopefully the information provided helps. In my mind the issue is that >> netvsc was not designed correctly, and as such it is using the 2 >> netdev model as a bit of a crutch to handle the case where they wanted >> to add a VF bypass. At this point it has become a part of their >> architecture so it would be confusing for their user base to deal with >> having 2 netdevs spawn every time their driver sees a new device. In >> the case of virtio we only spawn 2 netdevs if the backup bit is set >> which implies a specific use case. When the bit isn't set we are only >> spawning one device, and as a result we can get much better >> performance out of the resultant combination of devices, and can >> expose functionality such as in-driver XDP in virtio. > > > I have experimented with toggling IFF_NOQUEUE and/or LLTX on the netvsc > device when doing passthrough. It didn't help performance much and there > are corner cases that make it painful, like what if qdisc was placed > by user on the netvsc device? Should the qdisc then be moved back > and forth to the VF device when it arrives or is removed? The advantages of IFF_NOQUEUE and LLTX would mostly be seen on small packets, and I wouldn't risk toggling them on/off as doing so with packets in flight might lead to odd results. That was one of the reasons why I wanted to just stick with the 3 netdev model for us, and leave you guys running with the 2 netdev model. > As far as XDP support, it is on the plan to support XDP on the netvsc > device since the receive buffers do have to be copied. But there has > been no customer demand for it; and the VF model on Azure has limitations > which make a transparent XDP model pretty much impossible. I wasn't saying you needed to implement XDP in netvsc. My point was that in virtio we already have it and it would be confusing for us to have a virtio driver that had to drop support for it in order to support the 2 netdev model.
On Wed, Mar 07, 2018 at 10:06:30AM -0800, Stephen Hemminger wrote: > On Wed, 7 Mar 2018 09:50:50 -0800 > Alexander Duyck <alexander.duyck@gmail.com> wrote: > > > On Tue, Mar 6, 2018 at 6:38 PM, Michael S. Tsirkin <mst@redhat.com> wrote: > > > On Tue, Mar 06, 2018 at 03:27:46PM -0800, Alexander Duyck wrote: > > >> > I definitelly vote for a separate common shared code for both netvsc and > > >> > virtio_net - even if you use 2 and 3 netdev model, you could share the > > >> > common code. Strict checks and limitation should be in place. > > >> > > >> Noted. But as I also mentioned there isn't that much "common" code > > >> between the two models. I think if anything we could probably look at > > >> peeling out a few bits such as "get_<iface>_bymac" which really would > > >> become dev_get_by_mac_and_ops in order to find the device for the > > >> notifiers. I probably wouldn't even put that in our driver and would > > >> instead put it in the core code since it almost makes more sense > > >> there. Beyond that sharing becomes much more challenging due to the > > >> differences in the Rx and Tx paths that build out of the difference > > >> between the 2 driver and 3 driver models. > > > > > > At this point it might be worth it to articulate the advantages > > > of the 3 netdev model. > > > > The main advantages are performance on the lower devices. Specifically > > we can run LLTX and IFF_NOQUEUE on the upper device, meaning the VF > > doesn't take a performance hit when we start transmitting through it. > > In addition the paravirtual device is able to fully expose it's > > features, so for example virtio_net can maintain in-driver XDP, and we > > can provide generic XDP on the upper device. We can also have an > > asymmetric configuration where the number of queues or feature sets > > can be different between the ports. > > > > Basically what it comes down to is in the 3 netdev model we never have > > to deal with the issues of going from being a standard netdev to being > > a master netdev. The role of each netdev is clearly defined. > > > > As far as doing a generic solution it is the preferred way to go as > > well since we don't have to rewrite functionality of the lower device. > > Currently the only changes really needed are to add a phys_port_name > > operation so that you can distinguish between the bypass interface and > > the original paravirtual one. As such you have no confusion about what > > you are running. > > > > > If they are compelling, why wouldn't netvsc users want them? > > > > I believe part of the logic for Stephen's choice is that netvsc > > doesn't have a "BACKUP" bit like what we have virtio_net. As a result > > they have no way of knowing if a VF will ever show up or not. That > > makes the 3 netdev solution less appealing as they always end up in > > the bonding mode. In addition they have intentionally limited > > themselves to avoid features such as XDP that would cause issues with > > the 2 netdev model. > > > > > Alex, I think you were one of the strongest proponents of this model, > > > you should be well placed to provide a summary. > > > > Hopefully the information provided helps. In my mind the issue is that > > netvsc was not designed correctly, and as such it is using the 2 > > netdev model as a bit of a crutch to handle the case where they wanted > > to add a VF bypass. At this point it has become a part of their > > architecture so it would be confusing for their user base to deal with > > having 2 netdevs spawn every time their driver sees a new device. In > > the case of virtio we only spawn 2 netdevs if the backup bit is set > > which implies a specific use case. When the bit isn't set we are only > > spawning one device, and as a result we can get much better > > performance out of the resultant combination of devices, and can > > expose functionality such as in-driver XDP in virtio. > > > I have experimented with toggling IFF_NOQUEUE and/or LLTX on the netvsc > device when doing passthrough. It didn't help performance much and there > are corner cases that make it painful, like what if qdisc was placed > by user on the netvsc device? Should the qdisc then be moved back > and forth to the VF device when it arrives or is removed? > > As far as XDP support, it is on the plan to support XDP on the netvsc > device since the receive buffers do have to be copied. But there has > been no customer demand for it; and the VF model on Azure has limitations > which make a transparent XDP model pretty much impossible. Jiri, does that answer your question? Even though there is some similarity, the needs of netvsc and virtio appear significantly different. We do not yet know whether other PV devices will be virtio-like or netvsc-like. Do you agree?
On 3/7/2018 12:11 PM, Michael S. Tsirkin wrote: > On Wed, Mar 07, 2018 at 10:06:30AM -0800, Stephen Hemminger wrote: >> On Wed, 7 Mar 2018 09:50:50 -0800 >> Alexander Duyck <alexander.duyck@gmail.com> wrote: >> >>> On Tue, Mar 6, 2018 at 6:38 PM, Michael S. Tsirkin <mst@redhat.com> wrote: >>>> On Tue, Mar 06, 2018 at 03:27:46PM -0800, Alexander Duyck wrote: >>>>>> I definitelly vote for a separate common shared code for both netvsc and >>>>>> virtio_net - even if you use 2 and 3 netdev model, you could share the >>>>>> common code. Strict checks and limitation should be in place. >>>>> Noted. But as I also mentioned there isn't that much "common" code >>>>> between the two models. I think if anything we could probably look at >>>>> peeling out a few bits such as "get_<iface>_bymac" which really would >>>>> become dev_get_by_mac_and_ops in order to find the device for the >>>>> notifiers. I probably wouldn't even put that in our driver and would >>>>> instead put it in the core code since it almost makes more sense >>>>> there. Beyond that sharing becomes much more challenging due to the >>>>> differences in the Rx and Tx paths that build out of the difference >>>>> between the 2 driver and 3 driver models. >>>> At this point it might be worth it to articulate the advantages >>>> of the 3 netdev model. >>> The main advantages are performance on the lower devices. Specifically >>> we can run LLTX and IFF_NOQUEUE on the upper device, meaning the VF >>> doesn't take a performance hit when we start transmitting through it. >>> In addition the paravirtual device is able to fully expose it's >>> features, so for example virtio_net can maintain in-driver XDP, and we >>> can provide generic XDP on the upper device. We can also have an >>> asymmetric configuration where the number of queues or feature sets >>> can be different between the ports. >>> >>> Basically what it comes down to is in the 3 netdev model we never have >>> to deal with the issues of going from being a standard netdev to being >>> a master netdev. The role of each netdev is clearly defined. >>> >>> As far as doing a generic solution it is the preferred way to go as >>> well since we don't have to rewrite functionality of the lower device. >>> Currently the only changes really needed are to add a phys_port_name >>> operation so that you can distinguish between the bypass interface and >>> the original paravirtual one. As such you have no confusion about what >>> you are running. >>> >>>> If they are compelling, why wouldn't netvsc users want them? >>> I believe part of the logic for Stephen's choice is that netvsc >>> doesn't have a "BACKUP" bit like what we have virtio_net. As a result >>> they have no way of knowing if a VF will ever show up or not. That >>> makes the 3 netdev solution less appealing as they always end up in >>> the bonding mode. In addition they have intentionally limited >>> themselves to avoid features such as XDP that would cause issues with >>> the 2 netdev model. >>> >>>> Alex, I think you were one of the strongest proponents of this model, >>>> you should be well placed to provide a summary. >>> Hopefully the information provided helps. In my mind the issue is that >>> netvsc was not designed correctly, and as such it is using the 2 >>> netdev model as a bit of a crutch to handle the case where they wanted >>> to add a VF bypass. At this point it has become a part of their >>> architecture so it would be confusing for their user base to deal with >>> having 2 netdevs spawn every time their driver sees a new device. In >>> the case of virtio we only spawn 2 netdevs if the backup bit is set >>> which implies a specific use case. When the bit isn't set we are only >>> spawning one device, and as a result we can get much better >>> performance out of the resultant combination of devices, and can >>> expose functionality such as in-driver XDP in virtio. >> >> I have experimented with toggling IFF_NOQUEUE and/or LLTX on the netvsc >> device when doing passthrough. It didn't help performance much and there >> are corner cases that make it painful, like what if qdisc was placed >> by user on the netvsc device? Should the qdisc then be moved back >> and forth to the VF device when it arrives or is removed? >> >> As far as XDP support, it is on the plan to support XDP on the netvsc >> device since the receive buffers do have to be copied. But there has >> been no customer demand for it; and the VF model on Azure has limitations >> which make a transparent XDP model pretty much impossible. > > Jiri, does that answer your question? Even though there is some > similarity, the needs of netvsc and virtio appear significantly > different. > > We do not yet know whether other PV devices will be virtio-like > or netvsc-like. > > Do you agree? Michael, At this point can we agree to go with 3 netdev model for virtio and as this is not compatible with netvsc's 2 netdev model, the scope for any common code between virtio and netvsc is very limited. Should i submit a v5 version of this patch series with some minor fixes? Or can you pull in this series and i can submit patches on top of that? Thanks Sridhar
Thu, Mar 01, 2018 at 09:08:43PM CET, sridhar.samudrala@intel.com wrote: >This patch enables virtio_net to switch over to a VF datapath when a VF >netdev is present with the same MAC address. It allows live migration >of a VM with a direct attached VF without the need to setup a bond/team >between a VF and virtio net device in the guest. > >The hypervisor needs to enable only one datapath at any time so that >packets don't get looped back to the VM over the other datapath. When a VF >is plugged, the virtio datapath link state can be marked as down. The >hypervisor needs to unplug the VF device from the guest on the source host >and reset the MAC filter of the VF to initiate failover of datapath to >virtio before starting the migration. After the migration is completed, >the destination hypervisor sets the MAC filter on the VF and plugs it back >to the guest to switch over to VF datapath. > >When BACKUP feature is enabled, an additional netdev(bypass netdev) is >created that acts as a master device and tracks the state of the 2 lower >netdevs. The original virtio_net netdev is marked as 'backup' netdev and a >passthru device with the same MAC is registered as 'active' netdev. > >This patch is based on the discussion initiated by Jesse on this thread. >https://marc.info/?l=linux-virtualization&m=151189725224231&w=2 > >Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com> >Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> >Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> [...] >+static int virtnet_bypass_register_child(struct net_device *child_netdev) >+{ >+ struct virtnet_bypass_info *vbi; >+ struct net_device *dev; >+ bool backup; >+ int ret; >+ >+ if (child_netdev->addr_len != ETH_ALEN) >+ return NOTIFY_DONE; >+ >+ /* We will use the MAC address to locate the virtnet_bypass netdev >+ * to associate with the child netdev. If we don't find a matching >+ * bypass netdev, move on. >+ */ >+ dev = get_virtnet_bypass_bymac(dev_net(child_netdev), >+ child_netdev->perm_addr); >+ if (!dev) >+ return NOTIFY_DONE; >+ >+ vbi = netdev_priv(dev); >+ backup = (child_netdev->dev.parent == dev->dev.parent); >+ if (backup ? rtnl_dereference(vbi->backup_netdev) : >+ rtnl_dereference(vbi->active_netdev)) { >+ netdev_info(dev, >+ "%s attempting to join bypass dev when %s already present\n", >+ child_netdev->name, backup ? "backup" : "active"); >+ return NOTIFY_DONE; >+ } >+ >+ /* Avoid non pci devices as active netdev */ >+ if (!backup && (!child_netdev->dev.parent || >+ !dev_is_pci(child_netdev->dev.parent))) >+ return NOTIFY_DONE; >+ >+ ret = netdev_rx_handler_register(child_netdev, >+ virtnet_bypass_handle_frame, dev); >+ if (ret != 0) { >+ netdev_err(child_netdev, >+ "can not register bypass receive handler (err = %d)\n", >+ ret); >+ goto rx_handler_failed; >+ } >+ >+ ret = netdev_upper_dev_link(child_netdev, dev, NULL); >+ if (ret != 0) { >+ netdev_err(child_netdev, >+ "can not set master device %s (err = %d)\n", >+ dev->name, ret); >+ goto upper_link_failed; >+ } >+ >+ child_netdev->flags |= IFF_SLAVE; >+ >+ if (netif_running(dev)) { >+ ret = dev_open(child_netdev); >+ if (ret && (ret != -EBUSY)) { >+ netdev_err(dev, "Opening child %s failed ret:%d\n", >+ child_netdev->name, ret); >+ goto err_interface_up; >+ } >+ } Much of this function is copy of netvsc_vf_join, should be shared with netvsc. >+ >+ /* Align MTU of child with master */ >+ ret = dev_set_mtu(child_netdev, dev->mtu); >+ if (ret) { >+ netdev_err(dev, >+ "unable to change mtu of %s to %u register failed\n", >+ child_netdev->name, dev->mtu); >+ goto err_set_mtu; >+ } >+ >+ call_netdevice_notifiers(NETDEV_JOIN, child_netdev); >+ >+ netdev_info(dev, "registering %s\n", child_netdev->name); >+ >+ dev_hold(child_netdev); >+ if (backup) { >+ rcu_assign_pointer(vbi->backup_netdev, child_netdev); >+ dev_get_stats(vbi->backup_netdev, &vbi->backup_stats); >+ } else { >+ rcu_assign_pointer(vbi->active_netdev, child_netdev); >+ dev_get_stats(vbi->active_netdev, &vbi->active_stats); >+ dev->min_mtu = child_netdev->min_mtu; >+ dev->max_mtu = child_netdev->max_mtu; >+ } >+ >+ return NOTIFY_OK; >+ >+err_set_mtu: >+ dev_close(child_netdev); >+err_interface_up: >+ netdev_upper_dev_unlink(child_netdev, dev); >+ child_netdev->flags &= ~IFF_SLAVE; >+upper_link_failed: >+ netdev_rx_handler_unregister(child_netdev); >+rx_handler_failed: >+ return NOTIFY_DONE; >+} >+ >+static int virtnet_bypass_unregister_child(struct net_device *child_netdev) >+{ >+ struct virtnet_bypass_info *vbi; >+ struct net_device *dev, *backup; >+ >+ dev = get_virtnet_bypass_byref(child_netdev); >+ if (!dev) >+ return NOTIFY_DONE; >+ >+ vbi = netdev_priv(dev); >+ >+ netdev_info(dev, "unregistering %s\n", child_netdev->name); >+ >+ netdev_rx_handler_unregister(child_netdev); >+ netdev_upper_dev_unlink(child_netdev, dev); >+ child_netdev->flags &= ~IFF_SLAVE; >+ >+ if (child_netdev->dev.parent == dev->dev.parent) { >+ RCU_INIT_POINTER(vbi->backup_netdev, NULL); >+ } else { >+ RCU_INIT_POINTER(vbi->active_netdev, NULL); >+ backup = rtnl_dereference(vbi->backup_netdev); >+ if (backup) { >+ dev->min_mtu = backup->min_mtu; >+ dev->max_mtu = backup->max_mtu; >+ } >+ } >+ >+ dev_put(child_netdev); >+ >+ return NOTIFY_OK; >+} >+ >+static int virtnet_bypass_update_link(struct net_device *child_netdev) >+{ >+ struct net_device *dev, *active, *backup; >+ struct virtnet_bypass_info *vbi; >+ >+ dev = get_virtnet_bypass_byref(child_netdev); >+ if (!dev || !netif_running(dev)) >+ return NOTIFY_DONE; >+ >+ vbi = netdev_priv(dev); >+ >+ active = rtnl_dereference(vbi->active_netdev); >+ backup = rtnl_dereference(vbi->backup_netdev); >+ >+ if ((active && virtnet_bypass_xmit_ready(active)) || >+ (backup && virtnet_bypass_xmit_ready(backup))) { >+ netif_carrier_on(dev); >+ netif_tx_wake_all_queues(dev); >+ } else { >+ netif_carrier_off(dev); >+ netif_tx_stop_all_queues(dev); >+ } >+ >+ return NOTIFY_OK; >+} >+ >+static int virtnet_bypass_event(struct notifier_block *this, >+ unsigned long event, void *ptr) >+{ >+ struct net_device *event_dev = netdev_notifier_info_to_dev(ptr); >+ >+ /* Skip our own events */ >+ if (event_dev->netdev_ops == &virtnet_bypass_netdev_ops) >+ return NOTIFY_DONE; >+ >+ /* Avoid non-Ethernet type devices */ >+ if (event_dev->type != ARPHRD_ETHER) >+ return NOTIFY_DONE; >+ >+ /* Avoid Vlan dev with same MAC registering as child dev */ >+ if (is_vlan_dev(event_dev)) >+ return NOTIFY_DONE; >+ >+ /* Avoid Bonding master dev with same MAC registering as child dev */ >+ if ((event_dev->priv_flags & IFF_BONDING) && >+ (event_dev->flags & IFF_MASTER)) >+ return NOTIFY_DONE; >+ >+ switch (event) { >+ case NETDEV_REGISTER: >+ return virtnet_bypass_register_child(event_dev); >+ case NETDEV_UNREGISTER: >+ return virtnet_bypass_unregister_child(event_dev); >+ case NETDEV_UP: >+ case NETDEV_DOWN: >+ case NETDEV_CHANGE: >+ return virtnet_bypass_update_link(event_dev); >+ default: >+ return NOTIFY_DONE; >+ } >+} For example this function is 1:1 copy of netvsc, even with comments and bugs (like IFF_BODING check). This is also something that should be shared with netvsc. >+ >+static struct notifier_block virtnet_bypass_notifier = { >+ .notifier_call = virtnet_bypass_event, >+}; >+ >+static int virtnet_bypass_create(struct virtnet_info *vi) >+{ >+ struct net_device *backup_netdev = vi->dev; >+ struct device *dev = &vi->vdev->dev; >+ struct net_device *bypass_netdev; >+ int res; >+ >+ /* Alloc at least 2 queues, for now we are going with 16 assuming >+ * that most devices being bonded won't have too many queues. >+ */ >+ bypass_netdev = alloc_etherdev_mq(sizeof(struct virtnet_bypass_info), >+ 16); >+ if (!bypass_netdev) { >+ dev_err(dev, "Unable to allocate bypass_netdev!\n"); >+ return -ENOMEM; >+ } >+ >+ dev_net_set(bypass_netdev, dev_net(backup_netdev)); >+ SET_NETDEV_DEV(bypass_netdev, dev); >+ >+ bypass_netdev->netdev_ops = &virtnet_bypass_netdev_ops; >+ bypass_netdev->ethtool_ops = &virtnet_bypass_ethtool_ops; >+ >+ /* Initialize the device options */ >+ bypass_netdev->flags |= IFF_MASTER; >+ bypass_netdev->priv_flags |= IFF_BONDING | IFF_UNICAST_FLT | No clue why you set IFF_BONDING here...
On 3/12/2018 1:12 PM, Jiri Pirko wrote: > Thu, Mar 01, 2018 at 09:08:43PM CET, sridhar.samudrala@intel.com wrote: >> This patch enables virtio_net to switch over to a VF datapath when a VF >> netdev is present with the same MAC address. It allows live migration >> of a VM with a direct attached VF without the need to setup a bond/team >> between a VF and virtio net device in the guest. >> >> The hypervisor needs to enable only one datapath at any time so that >> packets don't get looped back to the VM over the other datapath. When a VF >> is plugged, the virtio datapath link state can be marked as down. The >> hypervisor needs to unplug the VF device from the guest on the source host >> and reset the MAC filter of the VF to initiate failover of datapath to >> virtio before starting the migration. After the migration is completed, >> the destination hypervisor sets the MAC filter on the VF and plugs it back >> to the guest to switch over to VF datapath. >> >> When BACKUP feature is enabled, an additional netdev(bypass netdev) is >> created that acts as a master device and tracks the state of the 2 lower >> netdevs. The original virtio_net netdev is marked as 'backup' netdev and a >> passthru device with the same MAC is registered as 'active' netdev. >> >> This patch is based on the discussion initiated by Jesse on this thread. >> https://marc.info/?l=linux-virtualization&m=151189725224231&w=2 >> >> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com> >> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> >> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> > [...] > > >> +static int virtnet_bypass_register_child(struct net_device *child_netdev) >> +{ >> + struct virtnet_bypass_info *vbi; >> + struct net_device *dev; >> + bool backup; >> + int ret; >> + >> + if (child_netdev->addr_len != ETH_ALEN) >> + return NOTIFY_DONE; >> + >> + /* We will use the MAC address to locate the virtnet_bypass netdev >> + * to associate with the child netdev. If we don't find a matching >> + * bypass netdev, move on. >> + */ >> + dev = get_virtnet_bypass_bymac(dev_net(child_netdev), >> + child_netdev->perm_addr); >> + if (!dev) >> + return NOTIFY_DONE; >> + >> + vbi = netdev_priv(dev); >> + backup = (child_netdev->dev.parent == dev->dev.parent); >> + if (backup ? rtnl_dereference(vbi->backup_netdev) : >> + rtnl_dereference(vbi->active_netdev)) { >> + netdev_info(dev, >> + "%s attempting to join bypass dev when %s already present\n", >> + child_netdev->name, backup ? "backup" : "active"); >> + return NOTIFY_DONE; >> + } >> + >> + /* Avoid non pci devices as active netdev */ >> + if (!backup && (!child_netdev->dev.parent || >> + !dev_is_pci(child_netdev->dev.parent))) >> + return NOTIFY_DONE; >> + >> + ret = netdev_rx_handler_register(child_netdev, >> + virtnet_bypass_handle_frame, dev); >> + if (ret != 0) { >> + netdev_err(child_netdev, >> + "can not register bypass receive handler (err = %d)\n", >> + ret); >> + goto rx_handler_failed; >> + } >> + >> + ret = netdev_upper_dev_link(child_netdev, dev, NULL); >> + if (ret != 0) { >> + netdev_err(child_netdev, >> + "can not set master device %s (err = %d)\n", >> + dev->name, ret); >> + goto upper_link_failed; >> + } >> + >> + child_netdev->flags |= IFF_SLAVE; >> + >> + if (netif_running(dev)) { >> + ret = dev_open(child_netdev); >> + if (ret && (ret != -EBUSY)) { >> + netdev_err(dev, "Opening child %s failed ret:%d\n", >> + child_netdev->name, ret); >> + goto err_interface_up; >> + } >> + } > Much of this function is copy of netvsc_vf_join, should be shared with > netvsc. Yes. This is based on netvsc_register_vf() and netvsc_vf_join(). But modified to handle registration of 2 child netdevs(backup virtio and active vf). In case of netvsc, they only register VF. There is no backup netdev. I think trying to make this code shareable with netvsc will introduce additional checks and make it convoluted. > > >> + >> + /* Align MTU of child with master */ >> + ret = dev_set_mtu(child_netdev, dev->mtu); >> + if (ret) { >> + netdev_err(dev, >> + "unable to change mtu of %s to %u register failed\n", >> + child_netdev->name, dev->mtu); >> + goto err_set_mtu; >> + } >> + >> + call_netdevice_notifiers(NETDEV_JOIN, child_netdev); >> + >> + netdev_info(dev, "registering %s\n", child_netdev->name); >> + >> + dev_hold(child_netdev); >> + if (backup) { >> + rcu_assign_pointer(vbi->backup_netdev, child_netdev); >> + dev_get_stats(vbi->backup_netdev, &vbi->backup_stats); >> + } else { >> + rcu_assign_pointer(vbi->active_netdev, child_netdev); >> + dev_get_stats(vbi->active_netdev, &vbi->active_stats); >> + dev->min_mtu = child_netdev->min_mtu; >> + dev->max_mtu = child_netdev->max_mtu; >> + } >> + >> + return NOTIFY_OK; >> + >> +err_set_mtu: >> + dev_close(child_netdev); >> +err_interface_up: >> + netdev_upper_dev_unlink(child_netdev, dev); >> + child_netdev->flags &= ~IFF_SLAVE; >> +upper_link_failed: >> + netdev_rx_handler_unregister(child_netdev); >> +rx_handler_failed: >> + return NOTIFY_DONE; >> +} >> + >> +static int virtnet_bypass_unregister_child(struct net_device *child_netdev) >> +{ >> + struct virtnet_bypass_info *vbi; >> + struct net_device *dev, *backup; >> + >> + dev = get_virtnet_bypass_byref(child_netdev); >> + if (!dev) >> + return NOTIFY_DONE; >> + >> + vbi = netdev_priv(dev); >> + >> + netdev_info(dev, "unregistering %s\n", child_netdev->name); >> + >> + netdev_rx_handler_unregister(child_netdev); >> + netdev_upper_dev_unlink(child_netdev, dev); >> + child_netdev->flags &= ~IFF_SLAVE; >> + >> + if (child_netdev->dev.parent == dev->dev.parent) { >> + RCU_INIT_POINTER(vbi->backup_netdev, NULL); >> + } else { >> + RCU_INIT_POINTER(vbi->active_netdev, NULL); >> + backup = rtnl_dereference(vbi->backup_netdev); >> + if (backup) { >> + dev->min_mtu = backup->min_mtu; >> + dev->max_mtu = backup->max_mtu; >> + } >> + } >> + >> + dev_put(child_netdev); >> + >> + return NOTIFY_OK; >> +} >> + >> +static int virtnet_bypass_update_link(struct net_device *child_netdev) >> +{ >> + struct net_device *dev, *active, *backup; >> + struct virtnet_bypass_info *vbi; >> + >> + dev = get_virtnet_bypass_byref(child_netdev); >> + if (!dev || !netif_running(dev)) >> + return NOTIFY_DONE; >> + >> + vbi = netdev_priv(dev); >> + >> + active = rtnl_dereference(vbi->active_netdev); >> + backup = rtnl_dereference(vbi->backup_netdev); >> + >> + if ((active && virtnet_bypass_xmit_ready(active)) || >> + (backup && virtnet_bypass_xmit_ready(backup))) { >> + netif_carrier_on(dev); >> + netif_tx_wake_all_queues(dev); >> + } else { >> + netif_carrier_off(dev); >> + netif_tx_stop_all_queues(dev); >> + } >> + >> + return NOTIFY_OK; >> +} >> + >> +static int virtnet_bypass_event(struct notifier_block *this, >> + unsigned long event, void *ptr) >> +{ >> + struct net_device *event_dev = netdev_notifier_info_to_dev(ptr); >> + >> + /* Skip our own events */ >> + if (event_dev->netdev_ops == &virtnet_bypass_netdev_ops) >> + return NOTIFY_DONE; >> + >> + /* Avoid non-Ethernet type devices */ >> + if (event_dev->type != ARPHRD_ETHER) >> + return NOTIFY_DONE; >> + >> + /* Avoid Vlan dev with same MAC registering as child dev */ >> + if (is_vlan_dev(event_dev)) >> + return NOTIFY_DONE; >> + >> + /* Avoid Bonding master dev with same MAC registering as child dev */ >> + if ((event_dev->priv_flags & IFF_BONDING) && >> + (event_dev->flags & IFF_MASTER)) >> + return NOTIFY_DONE; >> + >> + switch (event) { >> + case NETDEV_REGISTER: >> + return virtnet_bypass_register_child(event_dev); >> + case NETDEV_UNREGISTER: >> + return virtnet_bypass_unregister_child(event_dev); >> + case NETDEV_UP: >> + case NETDEV_DOWN: >> + case NETDEV_CHANGE: >> + return virtnet_bypass_update_link(event_dev); >> + default: >> + return NOTIFY_DONE; >> + } >> +} > For example this function is 1:1 copy of netvsc, even with comments > and bugs (like IFF_BODING check). > > This is also something that should be shared with netvsc. The problem is with the calls that are made from this function. Sharing would have been possible if both virtio and netvsc went with same netdev model. > > >> + >> +static struct notifier_block virtnet_bypass_notifier = { >> + .notifier_call = virtnet_bypass_event, >> +}; >> + >> +static int virtnet_bypass_create(struct virtnet_info *vi) >> +{ >> + struct net_device *backup_netdev = vi->dev; >> + struct device *dev = &vi->vdev->dev; >> + struct net_device *bypass_netdev; >> + int res; >> + >> + /* Alloc at least 2 queues, for now we are going with 16 assuming >> + * that most devices being bonded won't have too many queues. >> + */ >> + bypass_netdev = alloc_etherdev_mq(sizeof(struct virtnet_bypass_info), >> + 16); >> + if (!bypass_netdev) { >> + dev_err(dev, "Unable to allocate bypass_netdev!\n"); >> + return -ENOMEM; >> + } >> + >> + dev_net_set(bypass_netdev, dev_net(backup_netdev)); >> + SET_NETDEV_DEV(bypass_netdev, dev); >> + >> + bypass_netdev->netdev_ops = &virtnet_bypass_netdev_ops; >> + bypass_netdev->ethtool_ops = &virtnet_bypass_ethtool_ops; >> + >> + /* Initialize the device options */ >> + bypass_netdev->flags |= IFF_MASTER; >> + bypass_netdev->priv_flags |= IFF_BONDING | IFF_UNICAST_FLT | > No clue why you set IFF_BONDING here... This is to indicate to the userspace that this is a master bond device. May be it is sufficient to just set IFF_MASTER. > > > > >
Mon, Mar 12, 2018 at 09:58:06PM CET, sridhar.samudrala@intel.com wrote: > > >On 3/12/2018 1:12 PM, Jiri Pirko wrote: >> Thu, Mar 01, 2018 at 09:08:43PM CET, sridhar.samudrala@intel.com wrote: >> > This patch enables virtio_net to switch over to a VF datapath when a VF >> > netdev is present with the same MAC address. It allows live migration >> > of a VM with a direct attached VF without the need to setup a bond/team >> > between a VF and virtio net device in the guest. >> > >> > The hypervisor needs to enable only one datapath at any time so that >> > packets don't get looped back to the VM over the other datapath. When a VF >> > is plugged, the virtio datapath link state can be marked as down. The >> > hypervisor needs to unplug the VF device from the guest on the source host >> > and reset the MAC filter of the VF to initiate failover of datapath to >> > virtio before starting the migration. After the migration is completed, >> > the destination hypervisor sets the MAC filter on the VF and plugs it back >> > to the guest to switch over to VF datapath. >> > >> > When BACKUP feature is enabled, an additional netdev(bypass netdev) is >> > created that acts as a master device and tracks the state of the 2 lower >> > netdevs. The original virtio_net netdev is marked as 'backup' netdev and a >> > passthru device with the same MAC is registered as 'active' netdev. >> > >> > This patch is based on the discussion initiated by Jesse on this thread. >> > https://marc.info/?l=linux-virtualization&m=151189725224231&w=2 >> > >> > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com> >> > Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> >> > Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> >> [...] >> >> >> > +static int virtnet_bypass_register_child(struct net_device *child_netdev) >> > +{ >> > + struct virtnet_bypass_info *vbi; >> > + struct net_device *dev; >> > + bool backup; >> > + int ret; >> > + >> > + if (child_netdev->addr_len != ETH_ALEN) >> > + return NOTIFY_DONE; >> > + >> > + /* We will use the MAC address to locate the virtnet_bypass netdev >> > + * to associate with the child netdev. If we don't find a matching >> > + * bypass netdev, move on. >> > + */ >> > + dev = get_virtnet_bypass_bymac(dev_net(child_netdev), >> > + child_netdev->perm_addr); >> > + if (!dev) >> > + return NOTIFY_DONE; >> > + >> > + vbi = netdev_priv(dev); >> > + backup = (child_netdev->dev.parent == dev->dev.parent); >> > + if (backup ? rtnl_dereference(vbi->backup_netdev) : >> > + rtnl_dereference(vbi->active_netdev)) { >> > + netdev_info(dev, >> > + "%s attempting to join bypass dev when %s already present\n", >> > + child_netdev->name, backup ? "backup" : "active"); >> > + return NOTIFY_DONE; >> > + } >> > + >> > + /* Avoid non pci devices as active netdev */ >> > + if (!backup && (!child_netdev->dev.parent || >> > + !dev_is_pci(child_netdev->dev.parent))) >> > + return NOTIFY_DONE; >> > + >> > + ret = netdev_rx_handler_register(child_netdev, >> > + virtnet_bypass_handle_frame, dev); >> > + if (ret != 0) { >> > + netdev_err(child_netdev, >> > + "can not register bypass receive handler (err = %d)\n", >> > + ret); >> > + goto rx_handler_failed; >> > + } >> > + >> > + ret = netdev_upper_dev_link(child_netdev, dev, NULL); >> > + if (ret != 0) { >> > + netdev_err(child_netdev, >> > + "can not set master device %s (err = %d)\n", >> > + dev->name, ret); >> > + goto upper_link_failed; >> > + } >> > + >> > + child_netdev->flags |= IFF_SLAVE; >> > + >> > + if (netif_running(dev)) { >> > + ret = dev_open(child_netdev); >> > + if (ret && (ret != -EBUSY)) { >> > + netdev_err(dev, "Opening child %s failed ret:%d\n", >> > + child_netdev->name, ret); >> > + goto err_interface_up; >> > + } >> > + } >> Much of this function is copy of netvsc_vf_join, should be shared with >> netvsc. > >Yes. This is based on netvsc_register_vf() and netvsc_vf_join(). But modified >to handle registration of 2 child netdevs(backup virtio and active vf). In case >of netvsc, they only register VF. There is no backup netdev. >I think trying to make this code shareable with netvsc will introduce additional >checks and make it convoluted. No problem. > > >> >> >> > + >> > + /* Align MTU of child with master */ >> > + ret = dev_set_mtu(child_netdev, dev->mtu); >> > + if (ret) { >> > + netdev_err(dev, >> > + "unable to change mtu of %s to %u register failed\n", >> > + child_netdev->name, dev->mtu); >> > + goto err_set_mtu; >> > + } >> > + >> > + call_netdevice_notifiers(NETDEV_JOIN, child_netdev); >> > + >> > + netdev_info(dev, "registering %s\n", child_netdev->name); >> > + >> > + dev_hold(child_netdev); >> > + if (backup) { >> > + rcu_assign_pointer(vbi->backup_netdev, child_netdev); >> > + dev_get_stats(vbi->backup_netdev, &vbi->backup_stats); >> > + } else { >> > + rcu_assign_pointer(vbi->active_netdev, child_netdev); >> > + dev_get_stats(vbi->active_netdev, &vbi->active_stats); >> > + dev->min_mtu = child_netdev->min_mtu; >> > + dev->max_mtu = child_netdev->max_mtu; >> > + } >> > + >> > + return NOTIFY_OK; >> > + >> > +err_set_mtu: >> > + dev_close(child_netdev); >> > +err_interface_up: >> > + netdev_upper_dev_unlink(child_netdev, dev); >> > + child_netdev->flags &= ~IFF_SLAVE; >> > +upper_link_failed: >> > + netdev_rx_handler_unregister(child_netdev); >> > +rx_handler_failed: >> > + return NOTIFY_DONE; >> > +} >> > + >> > +static int virtnet_bypass_unregister_child(struct net_device *child_netdev) >> > +{ >> > + struct virtnet_bypass_info *vbi; >> > + struct net_device *dev, *backup; >> > + >> > + dev = get_virtnet_bypass_byref(child_netdev); >> > + if (!dev) >> > + return NOTIFY_DONE; >> > + >> > + vbi = netdev_priv(dev); >> > + >> > + netdev_info(dev, "unregistering %s\n", child_netdev->name); >> > + >> > + netdev_rx_handler_unregister(child_netdev); >> > + netdev_upper_dev_unlink(child_netdev, dev); >> > + child_netdev->flags &= ~IFF_SLAVE; >> > + >> > + if (child_netdev->dev.parent == dev->dev.parent) { >> > + RCU_INIT_POINTER(vbi->backup_netdev, NULL); >> > + } else { >> > + RCU_INIT_POINTER(vbi->active_netdev, NULL); >> > + backup = rtnl_dereference(vbi->backup_netdev); >> > + if (backup) { >> > + dev->min_mtu = backup->min_mtu; >> > + dev->max_mtu = backup->max_mtu; >> > + } >> > + } >> > + >> > + dev_put(child_netdev); >> > + >> > + return NOTIFY_OK; >> > +} >> > + >> > +static int virtnet_bypass_update_link(struct net_device *child_netdev) >> > +{ >> > + struct net_device *dev, *active, *backup; >> > + struct virtnet_bypass_info *vbi; >> > + >> > + dev = get_virtnet_bypass_byref(child_netdev); >> > + if (!dev || !netif_running(dev)) >> > + return NOTIFY_DONE; >> > + >> > + vbi = netdev_priv(dev); >> > + >> > + active = rtnl_dereference(vbi->active_netdev); >> > + backup = rtnl_dereference(vbi->backup_netdev); >> > + >> > + if ((active && virtnet_bypass_xmit_ready(active)) || >> > + (backup && virtnet_bypass_xmit_ready(backup))) { >> > + netif_carrier_on(dev); >> > + netif_tx_wake_all_queues(dev); >> > + } else { >> > + netif_carrier_off(dev); >> > + netif_tx_stop_all_queues(dev); >> > + } >> > + >> > + return NOTIFY_OK; >> > +} >> > + >> > +static int virtnet_bypass_event(struct notifier_block *this, >> > + unsigned long event, void *ptr) >> > +{ >> > + struct net_device *event_dev = netdev_notifier_info_to_dev(ptr); >> > + >> > + /* Skip our own events */ >> > + if (event_dev->netdev_ops == &virtnet_bypass_netdev_ops) >> > + return NOTIFY_DONE; >> > + >> > + /* Avoid non-Ethernet type devices */ >> > + if (event_dev->type != ARPHRD_ETHER) >> > + return NOTIFY_DONE; >> > + >> > + /* Avoid Vlan dev with same MAC registering as child dev */ >> > + if (is_vlan_dev(event_dev)) >> > + return NOTIFY_DONE; >> > + >> > + /* Avoid Bonding master dev with same MAC registering as child dev */ >> > + if ((event_dev->priv_flags & IFF_BONDING) && >> > + (event_dev->flags & IFF_MASTER)) >> > + return NOTIFY_DONE; >> > + >> > + switch (event) { >> > + case NETDEV_REGISTER: >> > + return virtnet_bypass_register_child(event_dev); >> > + case NETDEV_UNREGISTER: >> > + return virtnet_bypass_unregister_child(event_dev); >> > + case NETDEV_UP: >> > + case NETDEV_DOWN: >> > + case NETDEV_CHANGE: >> > + return virtnet_bypass_update_link(event_dev); >> > + default: >> > + return NOTIFY_DONE; >> > + } >> > +} >> For example this function is 1:1 copy of netvsc, even with comments >> and bugs (like IFF_BODING check). >> >> This is also something that should be shared with netvsc. > >The problem is with the calls that are made from this function. >Sharing would have been possible if both virtio and netvsc went with >same netdev model. ops? You can still share. > >> >> >> > + >> > +static struct notifier_block virtnet_bypass_notifier = { >> > + .notifier_call = virtnet_bypass_event, >> > +}; >> > + >> > +static int virtnet_bypass_create(struct virtnet_info *vi) >> > +{ >> > + struct net_device *backup_netdev = vi->dev; >> > + struct device *dev = &vi->vdev->dev; >> > + struct net_device *bypass_netdev; >> > + int res; >> > + >> > + /* Alloc at least 2 queues, for now we are going with 16 assuming >> > + * that most devices being bonded won't have too many queues. >> > + */ >> > + bypass_netdev = alloc_etherdev_mq(sizeof(struct virtnet_bypass_info), >> > + 16); >> > + if (!bypass_netdev) { >> > + dev_err(dev, "Unable to allocate bypass_netdev!\n"); >> > + return -ENOMEM; >> > + } >> > + >> > + dev_net_set(bypass_netdev, dev_net(backup_netdev)); >> > + SET_NETDEV_DEV(bypass_netdev, dev); >> > + >> > + bypass_netdev->netdev_ops = &virtnet_bypass_netdev_ops; >> > + bypass_netdev->ethtool_ops = &virtnet_bypass_ethtool_ops; >> > + >> > + /* Initialize the device options */ >> > + bypass_netdev->flags |= IFF_MASTER; >> > + bypass_netdev->priv_flags |= IFF_BONDING | IFF_UNICAST_FLT | >> No clue why you set IFF_BONDING here... > >This is to indicate to the userspace that this is a master bond device. This is not a master bond device! If you say this, it makes me really worried.... >May be it is sufficient to just set IFF_MASTER. > >> >> >> >> >> >
On Sat, Mar 3, 2018 at 8:04 PM, Michael S. Tsirkin <mst@redhat.com> wrote: > On Fri, Mar 02, 2018 at 03:56:31PM -0800, Siwei Liu wrote: >> On Fri, Mar 2, 2018 at 1:36 PM, Michael S. Tsirkin <mst@redhat.com> wrote: >> > On Fri, Mar 02, 2018 at 01:11:56PM -0800, Siwei Liu wrote: >> >> On Thu, Mar 1, 2018 at 12:08 PM, Sridhar Samudrala >> >> <sridhar.samudrala@intel.com> wrote: >> >> > This patch enables virtio_net to switch over to a VF datapath when a VF >> >> > netdev is present with the same MAC address. It allows live migration >> >> > of a VM with a direct attached VF without the need to setup a bond/team >> >> > between a VF and virtio net device in the guest. >> >> > >> >> > The hypervisor needs to enable only one datapath at any time so that >> >> > packets don't get looped back to the VM over the other datapath. When a VF >> >> > is plugged, the virtio datapath link state can be marked as down. The >> >> > hypervisor needs to unplug the VF device from the guest on the source host >> >> > and reset the MAC filter of the VF to initiate failover of datapath to >> >> > virtio before starting the migration. After the migration is completed, >> >> > the destination hypervisor sets the MAC filter on the VF and plugs it back >> >> > to the guest to switch over to VF datapath. >> >> > >> >> > When BACKUP feature is enabled, an additional netdev(bypass netdev) is >> >> > created that acts as a master device and tracks the state of the 2 lower >> >> > netdevs. The original virtio_net netdev is marked as 'backup' netdev and a >> >> > passthru device with the same MAC is registered as 'active' netdev. >> >> > >> >> > This patch is based on the discussion initiated by Jesse on this thread. >> >> > https://marc.info/?l=linux-virtualization&m=151189725224231&w=2 >> >> > >> >> > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com> >> >> > Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> >> >> > Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> >> >> > --- >> >> > drivers/net/virtio_net.c | 683 ++++++++++++++++++++++++++++++++++++++++++++++- >> >> > 1 file changed, 682 insertions(+), 1 deletion(-) >> >> > >> >> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c >> >> > index bcd13fe906ca..f2860d86c952 100644 >> >> > --- a/drivers/net/virtio_net.c >> >> > +++ b/drivers/net/virtio_net.c >> >> > @@ -30,6 +30,8 @@ >> >> > #include <linux/cpu.h> >> >> > #include <linux/average.h> >> >> > #include <linux/filter.h> >> >> > +#include <linux/netdevice.h> >> >> > +#include <linux/pci.h> >> >> > #include <net/route.h> >> >> > #include <net/xdp.h> >> >> > >> >> > @@ -206,6 +208,9 @@ struct virtnet_info { >> >> > u32 speed; >> >> > >> >> > unsigned long guest_offloads; >> >> > + >> >> > + /* upper netdev created when BACKUP feature enabled */ >> >> > + struct net_device *bypass_netdev; >> >> > }; >> >> > >> >> > struct padded_vnet_hdr { >> >> > @@ -2236,6 +2241,22 @@ static int virtnet_xdp(struct net_device *dev, struct netdev_bpf *xdp) >> >> > } >> >> > } >> >> > >> >> > +static int virtnet_get_phys_port_name(struct net_device *dev, char *buf, >> >> > + size_t len) >> >> > +{ >> >> > + struct virtnet_info *vi = netdev_priv(dev); >> >> > + int ret; >> >> > + >> >> > + if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_BACKUP)) >> >> > + return -EOPNOTSUPP; >> >> > + >> >> > + ret = snprintf(buf, len, "_bkup"); >> >> > + if (ret >= len) >> >> > + return -EOPNOTSUPP; >> >> > + >> >> > + return 0; >> >> > +} >> >> > + >> >> >> >> What if the systemd/udevd is not new enough to enforce the >> >> n<phys_port_name> naming? Would virtio_bypass get a different name >> >> than the original virtio_net? >> > >> > You mean people using ethX names? Any hardware config change breaks >> > these, I don't think that can be helped. >> >> I don't like the way to rely on .ndo_get_phys_port_name - it's fragile >> and it does not completely solve the problem it tries to address. >> Imagine what can end up with if getting an old udevd, or users already >> have exsiting explicit udev rules around phys_port_name. It does not >> give you the an ack in saying "yes, I know you're the bypass and >> you're the backup, please continue and I will give you both correct >> names", or an unacknowlegment saying "no, I don't know what these >> extra interfaces are, please go back and leave the VF device alone". >> We need new udev API for both feature negotiation and naming, or may >> even completely hide the lower interfaces. > > Go ahead and try to make this happen, but I won't hold my > breath. Heads-up: I have some working code to hide the lower devices, upon which the new udev/sysfs interace for feature negotiation and naming will be built. I will get it posted for review in the next few days once ready. Hopefully this could end the fight between the 3-netdev and 2-netdev driver model, as I see most of the problems being argued in this thread can be addressed by just hiding the lower devices from the 3-netdev model while keeping its merits still. Regards, -Siwei > >> > >> >> Should we detect this earlier and fall >> >> back to legacy mode without creating the bypass netdev and ensalving >> >> the VF? >> > >> > I don't think we can do this with existing kernel/userspace APIs. >> >> That's why I ever said to make udev aware of this new type of combined >> device instead of doing hacks here and there around. >> >> Regards, >> -Siwei > > We can add new interfaces on top but the main purpose here is to > make old userspace do new tricks. > >> > >> > -- >> > MST
Apologies, still some comments going. Please see inline. On Thu, Mar 1, 2018 at 12:08 PM, Sridhar Samudrala <sridhar.samudrala@intel.com> wrote: > This patch enables virtio_net to switch over to a VF datapath when a VF > netdev is present with the same MAC address. It allows live migration > of a VM with a direct attached VF without the need to setup a bond/team > between a VF and virtio net device in the guest. > > The hypervisor needs to enable only one datapath at any time so that > packets don't get looped back to the VM over the other datapath. When a VF > is plugged, the virtio datapath link state can be marked as down. The > hypervisor needs to unplug the VF device from the guest on the source host > and reset the MAC filter of the VF to initiate failover of datapath to > virtio before starting the migration. After the migration is completed, > the destination hypervisor sets the MAC filter on the VF and plugs it back > to the guest to switch over to VF datapath. > > When BACKUP feature is enabled, an additional netdev(bypass netdev) is > created that acts as a master device and tracks the state of the 2 lower > netdevs. The original virtio_net netdev is marked as 'backup' netdev and a > passthru device with the same MAC is registered as 'active' netdev. > > This patch is based on the discussion initiated by Jesse on this thread. > https://marc.info/?l=linux-virtualization&m=151189725224231&w=2 > > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com> > Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> > Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> > --- > drivers/net/virtio_net.c | 683 ++++++++++++++++++++++++++++++++++++++++++++++- > 1 file changed, 682 insertions(+), 1 deletion(-) > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c > index bcd13fe906ca..f2860d86c952 100644 > --- a/drivers/net/virtio_net.c > +++ b/drivers/net/virtio_net.c > @@ -30,6 +30,8 @@ > #include <linux/cpu.h> > #include <linux/average.h> > #include <linux/filter.h> > +#include <linux/netdevice.h> > +#include <linux/pci.h> > #include <net/route.h> > #include <net/xdp.h> > > @@ -206,6 +208,9 @@ struct virtnet_info { > u32 speed; > > unsigned long guest_offloads; > + > + /* upper netdev created when BACKUP feature enabled */ > + struct net_device *bypass_netdev; > }; > > struct padded_vnet_hdr { > @@ -2236,6 +2241,22 @@ static int virtnet_xdp(struct net_device *dev, struct netdev_bpf *xdp) > } > } > > +static int virtnet_get_phys_port_name(struct net_device *dev, char *buf, > + size_t len) > +{ > + struct virtnet_info *vi = netdev_priv(dev); > + int ret; > + > + if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_BACKUP)) > + return -EOPNOTSUPP; > + > + ret = snprintf(buf, len, "_bkup"); > + if (ret >= len) > + return -EOPNOTSUPP; > + > + return 0; > +} > + > static const struct net_device_ops virtnet_netdev = { > .ndo_open = virtnet_open, > .ndo_stop = virtnet_close, > @@ -2253,6 +2274,7 @@ static const struct net_device_ops virtnet_netdev = { > .ndo_xdp_xmit = virtnet_xdp_xmit, > .ndo_xdp_flush = virtnet_xdp_flush, > .ndo_features_check = passthru_features_check, > + .ndo_get_phys_port_name = virtnet_get_phys_port_name, > }; > > static void virtnet_config_changed_work(struct work_struct *work) > @@ -2647,6 +2669,653 @@ static int virtnet_validate(struct virtio_device *vdev) > return 0; > } > > +/* START of functions supporting VIRTIO_NET_F_BACKUP feature. > + * When BACKUP feature is enabled, an additional netdev(bypass netdev) > + * is created that acts as a master device and tracks the state of the > + * 2 lower netdevs. The original virtio_net netdev is registered as > + * 'backup' netdev and a passthru device with the same MAC is registered > + * as 'active' netdev. > + */ > + > +/* bypass state maintained when BACKUP feature is enabled */ > +struct virtnet_bypass_info { > + /* passthru netdev with same MAC */ > + struct net_device __rcu *active_netdev; > + > + /* virtio_net netdev */ > + struct net_device __rcu *backup_netdev; > + > + /* active netdev stats */ > + struct rtnl_link_stats64 active_stats; > + > + /* backup netdev stats */ > + struct rtnl_link_stats64 backup_stats; > + > + /* aggregated stats */ > + struct rtnl_link_stats64 bypass_stats; > + > + /* spinlock while updating stats */ > + spinlock_t stats_lock; > +}; > + > +static void virtnet_bypass_child_open(struct net_device *dev, > + struct net_device *child_netdev) > +{ > + int err = dev_open(child_netdev); > + > + if (err) > + netdev_warn(dev, "unable to open slave: %s: %d\n", > + child_netdev->name, err); > +} Why ignoring the error?? i.e. virtnet_bypass_child_open should have return value. I don't believe the caller is in a safe context to guarantee the dev_open always succeeds. > + > +static int virtnet_bypass_open(struct net_device *dev) > +{ > + struct virtnet_bypass_info *vbi = netdev_priv(dev); > + struct net_device *child_netdev; > + > + netif_carrier_off(dev); > + netif_tx_wake_all_queues(dev); > + > + child_netdev = rtnl_dereference(vbi->active_netdev); > + if (child_netdev) > + virtnet_bypass_child_open(dev, child_netdev); Handle the error? > + > + child_netdev = rtnl_dereference(vbi->backup_netdev); > + if (child_netdev) > + virtnet_bypass_child_open(dev, child_netdev); Handle the error and unwind? > + > + return 0; > +} > + > +static int virtnet_bypass_close(struct net_device *dev) > +{ > + struct virtnet_bypass_info *vi = netdev_priv(dev); > + struct net_device *child_netdev; > + > + netif_tx_disable(dev); > + > + child_netdev = rtnl_dereference(vi->active_netdev); > + if (child_netdev) > + dev_close(child_netdev); > + > + child_netdev = rtnl_dereference(vi->backup_netdev); > + if (child_netdev) > + dev_close(child_netdev); > + > + return 0; > +} > + > +static netdev_tx_t virtnet_bypass_drop_xmit(struct sk_buff *skb, > + struct net_device *dev) > +{ > + atomic_long_inc(&dev->tx_dropped); > + dev_kfree_skb_any(skb); > + return NETDEV_TX_OK; > +} > + > +static bool virtnet_bypass_xmit_ready(struct net_device *dev) > +{ > + return netif_running(dev) && netif_carrier_ok(dev); > +} > + > +static netdev_tx_t virtnet_bypass_start_xmit(struct sk_buff *skb, > + struct net_device *dev) > +{ > + struct virtnet_bypass_info *vbi = netdev_priv(dev); > + struct net_device *xmit_dev; > + > + /* Try xmit via active netdev followed by backup netdev */ > + xmit_dev = rcu_dereference_bh(vbi->active_netdev); > + if (!xmit_dev || !virtnet_bypass_xmit_ready(xmit_dev)) { > + xmit_dev = rcu_dereference_bh(vbi->backup_netdev); > + if (!xmit_dev || !virtnet_bypass_xmit_ready(xmit_dev)) > + return virtnet_bypass_drop_xmit(skb, dev); > + } > + > + skb->dev = xmit_dev; > + skb->queue_mapping = qdisc_skb_cb(skb)->slave_dev_queue_mapping; > + > + return dev_queue_xmit(skb); > +} > + > +static u16 virtnet_bypass_select_queue(struct net_device *dev, > + struct sk_buff *skb, void *accel_priv, > + select_queue_fallback_t fallback) > +{ > + /* This helper function exists to help dev_pick_tx get the correct > + * destination queue. Using a helper function skips a call to > + * skb_tx_hash and will put the skbs in the queue we expect on their > + * way down to the bonding driver. > + */ > + u16 txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) : 0; > + > + /* Save the original txq to restore before passing to the driver */ > + qdisc_skb_cb(skb)->slave_dev_queue_mapping = skb->queue_mapping; > + > + if (unlikely(txq >= dev->real_num_tx_queues)) { > + do { > + txq -= dev->real_num_tx_queues; > + } while (txq >= dev->real_num_tx_queues); > + } > + > + return txq; > +} > + > +/* fold stats, assuming all rtnl_link_stats64 fields are u64, but > + * that some drivers can provide 32bit values only. > + */ > +static void virtnet_bypass_fold_stats(struct rtnl_link_stats64 *_res, > + const struct rtnl_link_stats64 *_new, > + const struct rtnl_link_stats64 *_old) > +{ > + const u64 *new = (const u64 *)_new; > + const u64 *old = (const u64 *)_old; > + u64 *res = (u64 *)_res; > + int i; > + > + for (i = 0; i < sizeof(*_res) / sizeof(u64); i++) { > + u64 nv = new[i]; > + u64 ov = old[i]; > + s64 delta = nv - ov; > + > + /* detects if this particular field is 32bit only */ > + if (((nv | ov) >> 32) == 0) > + delta = (s64)(s32)((u32)nv - (u32)ov); > + > + /* filter anomalies, some drivers reset their stats > + * at down/up events. > + */ > + if (delta > 0) > + res[i] += delta; > + } > +} > + > +static void virtnet_bypass_get_stats(struct net_device *dev, > + struct rtnl_link_stats64 *stats) > +{ > + struct virtnet_bypass_info *vbi = netdev_priv(dev); > + const struct rtnl_link_stats64 *new; > + struct rtnl_link_stats64 temp; > + struct net_device *child_netdev; > + > + spin_lock(&vbi->stats_lock); > + memcpy(stats, &vbi->bypass_stats, sizeof(*stats)); > + > + rcu_read_lock(); > + > + child_netdev = rcu_dereference(vbi->active_netdev); > + if (child_netdev) { > + new = dev_get_stats(child_netdev, &temp); > + virtnet_bypass_fold_stats(stats, new, &vbi->active_stats); > + memcpy(&vbi->active_stats, new, sizeof(*new)); > + } > + > + child_netdev = rcu_dereference(vbi->backup_netdev); > + if (child_netdev) { > + new = dev_get_stats(child_netdev, &temp); > + virtnet_bypass_fold_stats(stats, new, &vbi->backup_stats); > + memcpy(&vbi->backup_stats, new, sizeof(*new)); > + } > + > + rcu_read_unlock(); > + > + memcpy(&vbi->bypass_stats, stats, sizeof(*stats)); > + spin_unlock(&vbi->stats_lock); > +} > + > +static int virtnet_bypass_change_mtu(struct net_device *dev, int new_mtu) > +{ > + struct virtnet_bypass_info *vbi = netdev_priv(dev); > + struct net_device *child_netdev; > + int ret = 0; > + > + child_netdev = rcu_dereference(vbi->active_netdev); > + if (child_netdev) { > + ret = dev_set_mtu(child_netdev, new_mtu); > + if (ret) > + return ret; > + } > + > + child_netdev = rcu_dereference(vbi->backup_netdev); > + if (child_netdev) { > + ret = dev_set_mtu(child_netdev, new_mtu); > + if (ret) > + netdev_err(child_netdev, > + "Unexpected failure to set mtu to %d\n", > + new_mtu); > + } > + > + dev->mtu = new_mtu; > + return 0; > +} > + > +static void virtnet_bypass_set_rx_mode(struct net_device *dev) > +{ > + struct virtnet_bypass_info *vbi = netdev_priv(dev); > + struct net_device *child_netdev; > + > + rcu_read_lock(); > + > + child_netdev = rcu_dereference(vbi->active_netdev); > + if (child_netdev) { > + dev_uc_sync_multiple(child_netdev, dev); > + dev_mc_sync_multiple(child_netdev, dev); > + } > + > + child_netdev = rcu_dereference(vbi->backup_netdev); > + if (child_netdev) { > + dev_uc_sync_multiple(child_netdev, dev); > + dev_mc_sync_multiple(child_netdev, dev); > + } > + > + rcu_read_unlock(); > +} > + > +static const struct net_device_ops virtnet_bypass_netdev_ops = { > + .ndo_open = virtnet_bypass_open, > + .ndo_stop = virtnet_bypass_close, > + .ndo_start_xmit = virtnet_bypass_start_xmit, > + .ndo_select_queue = virtnet_bypass_select_queue, > + .ndo_get_stats64 = virtnet_bypass_get_stats, > + .ndo_change_mtu = virtnet_bypass_change_mtu, > + .ndo_set_rx_mode = virtnet_bypass_set_rx_mode, > + .ndo_validate_addr = eth_validate_addr, > + .ndo_features_check = passthru_features_check, > +}; > + > +static int > +virtnet_bypass_ethtool_get_link_ksettings(struct net_device *dev, > + struct ethtool_link_ksettings *cmd) > +{ > + struct virtnet_bypass_info *vbi = netdev_priv(dev); > + struct net_device *child_netdev; > + > + child_netdev = rtnl_dereference(vbi->active_netdev); > + if (!child_netdev || !virtnet_bypass_xmit_ready(child_netdev)) { > + child_netdev = rtnl_dereference(vbi->backup_netdev); > + if (!child_netdev || !virtnet_bypass_xmit_ready(child_netdev)) { > + cmd->base.duplex = DUPLEX_UNKNOWN; > + cmd->base.port = PORT_OTHER; > + cmd->base.speed = SPEED_UNKNOWN; > + > + return 0; > + } > + } > + > + return __ethtool_get_link_ksettings(child_netdev, cmd); > +} > + > +#define BYPASS_DRV_NAME "virtnet_bypass" > +#define BYPASS_DRV_VERSION "0.1" > + > +static void virtnet_bypass_ethtool_get_drvinfo(struct net_device *dev, > + struct ethtool_drvinfo *drvinfo) > +{ > + strlcpy(drvinfo->driver, BYPASS_DRV_NAME, sizeof(drvinfo->driver)); > + strlcpy(drvinfo->version, BYPASS_DRV_VERSION, sizeof(drvinfo->version)); > +} > + > +static const struct ethtool_ops virtnet_bypass_ethtool_ops = { > + .get_drvinfo = virtnet_bypass_ethtool_get_drvinfo, > + .get_link = ethtool_op_get_link, > + .get_link_ksettings = virtnet_bypass_ethtool_get_link_ksettings, > +}; > + > +static struct net_device *get_virtnet_bypass_bymac(struct net *net, > + const u8 *mac) > +{ > + struct net_device *dev; > + > + ASSERT_RTNL(); > + > + for_each_netdev(net, dev) { > + if (dev->netdev_ops != &virtnet_bypass_netdev_ops) > + continue; /* not a virtnet_bypass device */ > + > + if (ether_addr_equal(mac, dev->perm_addr)) > + return dev; > + } > + > + return NULL; > +} > + > +static struct net_device * > +get_virtnet_bypass_byref(struct net_device *child_netdev) > +{ > + struct net *net = dev_net(child_netdev); > + struct net_device *dev; > + > + ASSERT_RTNL(); > + > + for_each_netdev(net, dev) { > + struct virtnet_bypass_info *vbi; > + > + if (dev->netdev_ops != &virtnet_bypass_netdev_ops) > + continue; /* not a virtnet_bypass device */ > + > + vbi = netdev_priv(dev); > + > + if ((rtnl_dereference(vbi->active_netdev) == child_netdev) || > + (rtnl_dereference(vbi->backup_netdev) == child_netdev)) > + return dev; /* a match */ > + } > + > + return NULL; > +} > + > +/* Called when child dev is injecting data into network stack. > + * Change the associated network device from lower dev to virtio. > + * note: already called with rcu_read_lock > + */ > +static rx_handler_result_t virtnet_bypass_handle_frame(struct sk_buff **pskb) > +{ > + struct sk_buff *skb = *pskb; > + struct net_device *ndev = rcu_dereference(skb->dev->rx_handler_data); > + > + skb->dev = ndev; > + > + return RX_HANDLER_ANOTHER; > +} > + > +static int virtnet_bypass_register_child(struct net_device *child_netdev) > +{ > + struct virtnet_bypass_info *vbi; > + struct net_device *dev; > + bool backup; > + int ret; > + > + if (child_netdev->addr_len != ETH_ALEN) > + return NOTIFY_DONE; > + > + /* We will use the MAC address to locate the virtnet_bypass netdev > + * to associate with the child netdev. If we don't find a matching > + * bypass netdev, move on. > + */ > + dev = get_virtnet_bypass_bymac(dev_net(child_netdev), > + child_netdev->perm_addr); > + if (!dev) > + return NOTIFY_DONE; > + > + vbi = netdev_priv(dev); > + backup = (child_netdev->dev.parent == dev->dev.parent); > + if (backup ? rtnl_dereference(vbi->backup_netdev) : > + rtnl_dereference(vbi->active_netdev)) { > + netdev_info(dev, > + "%s attempting to join bypass dev when %s already present\n", > + child_netdev->name, backup ? "backup" : "active"); > + return NOTIFY_DONE; > + } > + > + /* Avoid non pci devices as active netdev */ > + if (!backup && (!child_netdev->dev.parent || > + !dev_is_pci(child_netdev->dev.parent))) > + return NOTIFY_DONE; > + There's a problem here in terms of error (particularly on the active slave, e.g. VF), see below: > + ret = netdev_rx_handler_register(child_netdev, > + virtnet_bypass_handle_frame, dev); > + if (ret != 0) { > + netdev_err(child_netdev, > + "can not register bypass receive handler (err = %d)\n", > + ret); > + goto rx_handler_failed; > + } > + > + ret = netdev_upper_dev_link(child_netdev, dev, NULL); > + if (ret != 0) { > + netdev_err(child_netdev, > + "can not set master device %s (err = %d)\n", > + dev->name, ret); > + goto upper_link_failed; > + } > + > + child_netdev->flags |= IFF_SLAVE; > + > + if (netif_running(dev)) { > + ret = dev_open(child_netdev); > + if (ret && (ret != -EBUSY)) { > + netdev_err(dev, "Opening child %s failed ret:%d\n", > + child_netdev->name, ret); > + goto err_interface_up; > + } > + } > + > + /* Align MTU of child with master */ > + ret = dev_set_mtu(child_netdev, dev->mtu); > + if (ret) { > + netdev_err(dev, > + "unable to change mtu of %s to %u register failed\n", > + child_netdev->name, dev->mtu); > + goto err_set_mtu; > + } If any of the above calls returns non-zero, the current code steps back to undo what's being done on that spefic slave previously. For instance, if the netdev_rx_handler_register returns non-zero because of busy rx handler, this register_child function would give up enslaving the VF and leave the upper virtio_bypass interface behind once it returns. I am not sure if it's a good idea to leave the virtio_bypass around if running into failure: the guest is not migratable as the VF doesn't have a backup path, And perhaps the worse part is that, it now has two interfaces with identical MAC address but one of them is invalid (user cannot use the virtio interface as it has a dampened datapath). IMHO the virtio_bypass and its lower netdev should be destroyed at all when it fails to bind the VF, and technically, there should be some way to propogate the failure status to the hypervisor/backend, indicating that the VM is not migratable because of guest software errors (maybe by clearing out the backup feature from the guest virtio driver so host can see/learn it). Thanks, -Siwei > + > + call_netdevice_notifiers(NETDEV_JOIN, child_netdev); > + > + netdev_info(dev, "registering %s\n", child_netdev->name); > + > + dev_hold(child_netdev); > + if (backup) { > + rcu_assign_pointer(vbi->backup_netdev, child_netdev); > + dev_get_stats(vbi->backup_netdev, &vbi->backup_stats); > + } else { > + rcu_assign_pointer(vbi->active_netdev, child_netdev); > + dev_get_stats(vbi->active_netdev, &vbi->active_stats); > + dev->min_mtu = child_netdev->min_mtu; > + dev->max_mtu = child_netdev->max_mtu; > + } > + > + return NOTIFY_OK; > + > +err_set_mtu: > + dev_close(child_netdev); > +err_interface_up: > + netdev_upper_dev_unlink(child_netdev, dev); > + child_netdev->flags &= ~IFF_SLAVE; > +upper_link_failed: > + netdev_rx_handler_unregister(child_netdev); > +rx_handler_failed: > + return NOTIFY_DONE; > +} > + > +static int virtnet_bypass_unregister_child(struct net_device *child_netdev) > +{ > + struct virtnet_bypass_info *vbi; > + struct net_device *dev, *backup; > + > + dev = get_virtnet_bypass_byref(child_netdev); > + if (!dev) > + return NOTIFY_DONE; > + > + vbi = netdev_priv(dev); > + > + netdev_info(dev, "unregistering %s\n", child_netdev->name); > + > + netdev_rx_handler_unregister(child_netdev); > + netdev_upper_dev_unlink(child_netdev, dev); > + child_netdev->flags &= ~IFF_SLAVE; > + > + if (child_netdev->dev.parent == dev->dev.parent) { > + RCU_INIT_POINTER(vbi->backup_netdev, NULL); > + } else { > + RCU_INIT_POINTER(vbi->active_netdev, NULL); > + backup = rtnl_dereference(vbi->backup_netdev); > + if (backup) { > + dev->min_mtu = backup->min_mtu; > + dev->max_mtu = backup->max_mtu; > + } > + } > + > + dev_put(child_netdev); > + > + return NOTIFY_OK; > +} > + > +static int virtnet_bypass_update_link(struct net_device *child_netdev) > +{ > + struct net_device *dev, *active, *backup; > + struct virtnet_bypass_info *vbi; > + > + dev = get_virtnet_bypass_byref(child_netdev); > + if (!dev || !netif_running(dev)) > + return NOTIFY_DONE; > + > + vbi = netdev_priv(dev); > + > + active = rtnl_dereference(vbi->active_netdev); > + backup = rtnl_dereference(vbi->backup_netdev); > + > + if ((active && virtnet_bypass_xmit_ready(active)) || > + (backup && virtnet_bypass_xmit_ready(backup))) { > + netif_carrier_on(dev); > + netif_tx_wake_all_queues(dev); > + } else { > + netif_carrier_off(dev); > + netif_tx_stop_all_queues(dev); > + } > + > + return NOTIFY_OK; > +} > + > +static int virtnet_bypass_event(struct notifier_block *this, > + unsigned long event, void *ptr) > +{ > + struct net_device *event_dev = netdev_notifier_info_to_dev(ptr); > + > + /* Skip our own events */ > + if (event_dev->netdev_ops == &virtnet_bypass_netdev_ops) > + return NOTIFY_DONE; > + > + /* Avoid non-Ethernet type devices */ > + if (event_dev->type != ARPHRD_ETHER) > + return NOTIFY_DONE; > + > + /* Avoid Vlan dev with same MAC registering as child dev */ > + if (is_vlan_dev(event_dev)) > + return NOTIFY_DONE; > + > + /* Avoid Bonding master dev with same MAC registering as child dev */ > + if ((event_dev->priv_flags & IFF_BONDING) && > + (event_dev->flags & IFF_MASTER)) > + return NOTIFY_DONE; > + > + switch (event) { > + case NETDEV_REGISTER: > + return virtnet_bypass_register_child(event_dev); > + case NETDEV_UNREGISTER: > + return virtnet_bypass_unregister_child(event_dev); > + case NETDEV_UP: > + case NETDEV_DOWN: > + case NETDEV_CHANGE: > + return virtnet_bypass_update_link(event_dev); > + default: > + return NOTIFY_DONE; > + } > +} > + > +static struct notifier_block virtnet_bypass_notifier = { > + .notifier_call = virtnet_bypass_event, > +}; > + > +static int virtnet_bypass_create(struct virtnet_info *vi) > +{ > + struct net_device *backup_netdev = vi->dev; > + struct device *dev = &vi->vdev->dev; > + struct net_device *bypass_netdev; > + int res; > + > + /* Alloc at least 2 queues, for now we are going with 16 assuming > + * that most devices being bonded won't have too many queues. > + */ > + bypass_netdev = alloc_etherdev_mq(sizeof(struct virtnet_bypass_info), > + 16); > + if (!bypass_netdev) { > + dev_err(dev, "Unable to allocate bypass_netdev!\n"); > + return -ENOMEM; > + } > + > + dev_net_set(bypass_netdev, dev_net(backup_netdev)); > + SET_NETDEV_DEV(bypass_netdev, dev); > + > + bypass_netdev->netdev_ops = &virtnet_bypass_netdev_ops; > + bypass_netdev->ethtool_ops = &virtnet_bypass_ethtool_ops; > + > + /* Initialize the device options */ > + bypass_netdev->flags |= IFF_MASTER; > + bypass_netdev->priv_flags |= IFF_BONDING | IFF_UNICAST_FLT | > + IFF_NO_QUEUE; > + bypass_netdev->priv_flags &= ~(IFF_XMIT_DST_RELEASE | > + IFF_TX_SKB_SHARING); > + > + /* don't acquire bypass netdev's netif_tx_lock when transmitting */ > + bypass_netdev->features |= NETIF_F_LLTX; > + > + /* Don't allow bypass devices to change network namespaces. */ > + bypass_netdev->features |= NETIF_F_NETNS_LOCAL; > + > + bypass_netdev->hw_features = NETIF_F_HW_CSUM | NETIF_F_SG | > + NETIF_F_FRAGLIST | NETIF_F_ALL_TSO | > + NETIF_F_HIGHDMA | NETIF_F_LRO; > + > + bypass_netdev->hw_features |= NETIF_F_GSO_ENCAP_ALL; > + bypass_netdev->features |= bypass_netdev->hw_features; > + > + /* For now treat bypass netdev as VLAN challenged since we > + * cannot assume VLAN functionality with a VF > + */ > + bypass_netdev->features |= NETIF_F_VLAN_CHALLENGED; > + > + memcpy(bypass_netdev->dev_addr, backup_netdev->dev_addr, > + bypass_netdev->addr_len); > + > + bypass_netdev->min_mtu = backup_netdev->min_mtu; > + bypass_netdev->max_mtu = backup_netdev->max_mtu; > + > + res = register_netdev(bypass_netdev); > + if (res < 0) { > + dev_err(dev, "Unable to register bypass_netdev!\n"); > + free_netdev(bypass_netdev); > + return res; > + } > + > + netif_carrier_off(bypass_netdev); > + > + vi->bypass_netdev = bypass_netdev; > + > + return 0; > +} > + > +static void virtnet_bypass_destroy(struct virtnet_info *vi) > +{ > + struct net_device *bypass_netdev = vi->bypass_netdev; > + struct virtnet_bypass_info *vbi; > + struct net_device *child_netdev; > + > + /* no device found, nothing to free */ > + if (!bypass_netdev) > + return; > + > + vbi = netdev_priv(bypass_netdev); > + > + netif_device_detach(bypass_netdev); > + > + rtnl_lock(); > + > + child_netdev = rtnl_dereference(vbi->active_netdev); > + if (child_netdev) > + virtnet_bypass_unregister_child(child_netdev); > + > + child_netdev = rtnl_dereference(vbi->backup_netdev); > + if (child_netdev) > + virtnet_bypass_unregister_child(child_netdev); > + > + unregister_netdevice(bypass_netdev); > + > + rtnl_unlock(); > + > + free_netdev(bypass_netdev); > +} > + > +/* END of functions supporting VIRTIO_NET_F_BACKUP feature. */ > + > static int virtnet_probe(struct virtio_device *vdev) > { > int i, err = -ENOMEM; > @@ -2797,10 +3466,15 @@ static int virtnet_probe(struct virtio_device *vdev) > > virtnet_init_settings(dev); > > + if (virtio_has_feature(vdev, VIRTIO_NET_F_BACKUP)) { > + if (virtnet_bypass_create(vi) != 0) > + goto free_vqs; > + } > + > err = register_netdev(dev); > if (err) { > pr_debug("virtio_net: registering device failed\n"); > - goto free_vqs; > + goto free_bypass; > } > > virtio_device_ready(vdev); > @@ -2837,6 +3511,8 @@ static int virtnet_probe(struct virtio_device *vdev) > vi->vdev->config->reset(vdev); > > unregister_netdev(dev); > +free_bypass: > + virtnet_bypass_destroy(vi); > free_vqs: > cancel_delayed_work_sync(&vi->refill); > free_receive_page_frags(vi); > @@ -2871,6 +3547,8 @@ static void virtnet_remove(struct virtio_device *vdev) > > unregister_netdev(vi->dev); > > + virtnet_bypass_destroy(vi); > + > remove_vq_common(vi); > > free_netdev(vi->dev); > @@ -2968,6 +3646,8 @@ static __init int virtio_net_driver_init(void) > ret = register_virtio_driver(&virtio_net_driver); > if (ret) > goto err_virtio; > + > + register_netdevice_notifier(&virtnet_bypass_notifier); > return 0; > err_virtio: > cpuhp_remove_multi_state(CPUHP_VIRT_NET_DEAD); > @@ -2980,6 +3660,7 @@ module_init(virtio_net_driver_init); > > static __exit void virtio_net_driver_exit(void) > { > + unregister_netdevice_notifier(&virtnet_bypass_notifier); > unregister_virtio_driver(&virtio_net_driver); > cpuhp_remove_multi_state(CPUHP_VIRT_NET_DEAD); > cpuhp_remove_multi_state(virtionet_online); > -- > 2.14.3 >
On 3/12/2018 3:44 PM, Siwei Liu wrote: > Apologies, still some comments going. Please see inline. > > On Thu, Mar 1, 2018 at 12:08 PM, Sridhar Samudrala > <sridhar.samudrala@intel.com> wrote: >> This patch enables virtio_net to switch over to a VF datapath when a VF >> netdev is present with the same MAC address. It allows live migration >> of a VM with a direct attached VF without the need to setup a bond/team >> between a VF and virtio net device in the guest. >> >> The hypervisor needs to enable only one datapath at any time so that >> packets don't get looped back to the VM over the other datapath. When a VF >> is plugged, the virtio datapath link state can be marked as down. The >> hypervisor needs to unplug the VF device from the guest on the source host >> and reset the MAC filter of the VF to initiate failover of datapath to >> virtio before starting the migration. After the migration is completed, >> the destination hypervisor sets the MAC filter on the VF and plugs it back >> to the guest to switch over to VF datapath. >> >> When BACKUP feature is enabled, an additional netdev(bypass netdev) is >> created that acts as a master device and tracks the state of the 2 lower >> netdevs. The original virtio_net netdev is marked as 'backup' netdev and a >> passthru device with the same MAC is registered as 'active' netdev. >> >> This patch is based on the discussion initiated by Jesse on this thread. >> https://marc.info/?l=linux-virtualization&m=151189725224231&w=2 >> >> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com> >> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> >> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> >> --- >> drivers/net/virtio_net.c | 683 ++++++++++++++++++++++++++++++++++++++++++++++- >> 1 file changed, 682 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c >> index bcd13fe906ca..f2860d86c952 100644 >> --- a/drivers/net/virtio_net.c >> +++ b/drivers/net/virtio_net.c >> @@ -30,6 +30,8 @@ >> #include <linux/cpu.h> >> #include <linux/average.h> >> #include <linux/filter.h> >> +#include <linux/netdevice.h> >> +#include <linux/pci.h> >> #include <net/route.h> >> #include <net/xdp.h> >> >> @@ -206,6 +208,9 @@ struct virtnet_info { >> u32 speed; >> >> unsigned long guest_offloads; >> + >> + /* upper netdev created when BACKUP feature enabled */ >> + struct net_device *bypass_netdev; >> }; >> >> struct padded_vnet_hdr { >> @@ -2236,6 +2241,22 @@ static int virtnet_xdp(struct net_device *dev, struct netdev_bpf *xdp) >> } >> } >> >> +static int virtnet_get_phys_port_name(struct net_device *dev, char *buf, >> + size_t len) >> +{ >> + struct virtnet_info *vi = netdev_priv(dev); >> + int ret; >> + >> + if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_BACKUP)) >> + return -EOPNOTSUPP; >> + >> + ret = snprintf(buf, len, "_bkup"); >> + if (ret >= len) >> + return -EOPNOTSUPP; >> + >> + return 0; >> +} >> + >> static const struct net_device_ops virtnet_netdev = { >> .ndo_open = virtnet_open, >> .ndo_stop = virtnet_close, >> @@ -2253,6 +2274,7 @@ static const struct net_device_ops virtnet_netdev = { >> .ndo_xdp_xmit = virtnet_xdp_xmit, >> .ndo_xdp_flush = virtnet_xdp_flush, >> .ndo_features_check = passthru_features_check, >> + .ndo_get_phys_port_name = virtnet_get_phys_port_name, >> }; >> >> static void virtnet_config_changed_work(struct work_struct *work) >> @@ -2647,6 +2669,653 @@ static int virtnet_validate(struct virtio_device *vdev) >> return 0; >> } >> >> +/* START of functions supporting VIRTIO_NET_F_BACKUP feature. >> + * When BACKUP feature is enabled, an additional netdev(bypass netdev) >> + * is created that acts as a master device and tracks the state of the >> + * 2 lower netdevs. The original virtio_net netdev is registered as >> + * 'backup' netdev and a passthru device with the same MAC is registered >> + * as 'active' netdev. >> + */ >> + >> +/* bypass state maintained when BACKUP feature is enabled */ >> +struct virtnet_bypass_info { >> + /* passthru netdev with same MAC */ >> + struct net_device __rcu *active_netdev; >> + >> + /* virtio_net netdev */ >> + struct net_device __rcu *backup_netdev; >> + >> + /* active netdev stats */ >> + struct rtnl_link_stats64 active_stats; >> + >> + /* backup netdev stats */ >> + struct rtnl_link_stats64 backup_stats; >> + >> + /* aggregated stats */ >> + struct rtnl_link_stats64 bypass_stats; >> + >> + /* spinlock while updating stats */ >> + spinlock_t stats_lock; >> +}; >> + >> +static void virtnet_bypass_child_open(struct net_device *dev, >> + struct net_device *child_netdev) >> +{ >> + int err = dev_open(child_netdev); >> + >> + if (err) >> + netdev_warn(dev, "unable to open slave: %s: %d\n", >> + child_netdev->name, err); >> +} > Why ignoring the error?? i.e. virtnet_bypass_child_open should have > return value. I don't believe the caller is in a safe context to > guarantee the dev_open always succeeds. OK. Will handle this in the next revision. > >> + >> +static int virtnet_bypass_open(struct net_device *dev) >> +{ >> + struct virtnet_bypass_info *vbi = netdev_priv(dev); >> + struct net_device *child_netdev; >> + >> + netif_carrier_off(dev); >> + netif_tx_wake_all_queues(dev); >> + >> + child_netdev = rtnl_dereference(vbi->active_netdev); >> + if (child_netdev) >> + virtnet_bypass_child_open(dev, child_netdev); > Handle the error? > >> + >> + child_netdev = rtnl_dereference(vbi->backup_netdev); >> + if (child_netdev) >> + virtnet_bypass_child_open(dev, child_netdev); > Handle the error and unwind? Sure. <snip> > + > +static int virtnet_bypass_register_child(struct net_device *child_netdev) > +{ > + struct virtnet_bypass_info *vbi; > + struct net_device *dev; > + bool backup; > + int ret; > + > + if (child_netdev->addr_len != ETH_ALEN) > + return NOTIFY_DONE; > + > + /* We will use the MAC address to locate the virtnet_bypass netdev > + * to associate with the child netdev. If we don't find a matching > + * bypass netdev, move on. > + */ > + dev = get_virtnet_bypass_bymac(dev_net(child_netdev), > + child_netdev->perm_addr); > + if (!dev) > + return NOTIFY_DONE; > + > + vbi = netdev_priv(dev); > + backup = (child_netdev->dev.parent == dev->dev.parent); > + if (backup ? rtnl_dereference(vbi->backup_netdev) : > + rtnl_dereference(vbi->active_netdev)) { > + netdev_info(dev, > + "%s attempting to join bypass dev when %s already present\n", > + child_netdev->name, backup ? "backup" : "active"); > + return NOTIFY_DONE; > + } > + > + /* Avoid non pci devices as active netdev */ > + if (!backup && (!child_netdev->dev.parent || > + !dev_is_pci(child_netdev->dev.parent))) > + return NOTIFY_DONE; > + > There's a problem here in terms of error (particularly on the active > slave, e.g. VF), see below: > >> + ret = netdev_rx_handler_register(child_netdev, >> + virtnet_bypass_handle_frame, dev); >> + if (ret != 0) { >> + netdev_err(child_netdev, >> + "can not register bypass receive handler (err = %d)\n", >> + ret); >> + goto rx_handler_failed; >> + } >> + >> + ret = netdev_upper_dev_link(child_netdev, dev, NULL); >> + if (ret != 0) { >> + netdev_err(child_netdev, >> + "can not set master device %s (err = %d)\n", >> + dev->name, ret); >> + goto upper_link_failed; >> + } >> + >> + child_netdev->flags |= IFF_SLAVE; >> + >> + if (netif_running(dev)) { >> + ret = dev_open(child_netdev); >> + if (ret && (ret != -EBUSY)) { >> + netdev_err(dev, "Opening child %s failed ret:%d\n", >> + child_netdev->name, ret); >> + goto err_interface_up; >> + } >> + } >> + >> + /* Align MTU of child with master */ >> + ret = dev_set_mtu(child_netdev, dev->mtu); >> + if (ret) { >> + netdev_err(dev, >> + "unable to change mtu of %s to %u register failed\n", >> + child_netdev->name, dev->mtu); >> + goto err_set_mtu; >> + } > If any of the above calls returns non-zero, the current code steps > back to undo what's being done on that spefic slave previously. For > instance, if the netdev_rx_handler_register returns non-zero because > of busy rx handler, this register_child function would give up > enslaving the VF and leave the upper virtio_bypass interface behind > once it returns. virtio_bypass interface is the upper master netdev and it is always present when BACKUP feature is enabled. If there is failure with enslaving VF, there will be 2 netdevs, master virtio_bypass and slave virtio_net and the VM can be migrated. > I am not sure if it's a good idea to leave the > virtio_bypass around if running into failure: the guest is not > migratable as the VF doesn't have a backup path, Are you talking about a failure when registering backup netdev? This should not happen, but i guess we can improve error handling in such scenario. > And perhaps the worse > part is that, it now has two interfaces with identical MAC address but > one of them is invalid (user cannot use the virtio interface as it has > a dampened datapath). IMHO the virtio_bypass and its lower netdev > should be destroyed at all when it fails to bind the VF, and > technically, there should be some way to propogate the failure status > to the hypervisor/backend, indicating that the VM is not migratable > because of guest software errors (maybe by clearing out the backup > feature from the guest virtio driver so host can see/learn it). In BACKUP mode, user can only use the upper virtio_bypass netdev and that will always be there. Any failure to enslave VF netdev is not fatal, but i will see if we can improve the error handling of failure to enslave backup netdev. Also, i don't think the BACKUP feature bit is negotiable with the host. Thanks Sridhar
On 3/12/2018 2:08 PM, Jiri Pirko wrote: > Mon, Mar 12, 2018 at 09:58:06PM CET, sridhar.samudrala@intel.com wrote: >> >> On 3/12/2018 1:12 PM, Jiri Pirko wrote: >>> Thu, Mar 01, 2018 at 09:08:43PM CET, sridhar.samudrala@intel.com wrote: >>>> This patch enables virtio_net to switch over to a VF datapath when a VF >>>> netdev is present with the same MAC address. It allows live migration >>>> of a VM with a direct attached VF without the need to setup a bond/team >>>> between a VF and virtio net device in the guest. >>>> >>>> The hypervisor needs to enable only one datapath at any time so that >>>> packets don't get looped back to the VM over the other datapath. When a VF >>>> is plugged, the virtio datapath link state can be marked as down. The >>>> hypervisor needs to unplug the VF device from the guest on the source host >>>> and reset the MAC filter of the VF to initiate failover of datapath to >>>> virtio before starting the migration. After the migration is completed, >>>> the destination hypervisor sets the MAC filter on the VF and plugs it back >>>> to the guest to switch over to VF datapath. >>>> >>>> When BACKUP feature is enabled, an additional netdev(bypass netdev) is >>>> created that acts as a master device and tracks the state of the 2 lower >>>> netdevs. The original virtio_net netdev is marked as 'backup' netdev and a >>>> passthru device with the same MAC is registered as 'active' netdev. >>>> >>>> This patch is based on the discussion initiated by Jesse on this thread. >>>> https://marc.info/?l=linux-virtualization&m=151189725224231&w=2 >>>> >>>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com> >>>> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> >>>> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> >>> [...] >>> >>> >>>> +static int virtnet_bypass_register_child(struct net_device *child_netdev) >>>> +{ >>>> + struct virtnet_bypass_info *vbi; >>>> + struct net_device *dev; >>>> + bool backup; >>>> + int ret; >>>> + >>>> + if (child_netdev->addr_len != ETH_ALEN) >>>> + return NOTIFY_DONE; >>>> + >>>> + /* We will use the MAC address to locate the virtnet_bypass netdev >>>> + * to associate with the child netdev. If we don't find a matching >>>> + * bypass netdev, move on. >>>> + */ >>>> + dev = get_virtnet_bypass_bymac(dev_net(child_netdev), >>>> + child_netdev->perm_addr); >>>> + if (!dev) >>>> + return NOTIFY_DONE; >>>> + >>>> + vbi = netdev_priv(dev); >>>> + backup = (child_netdev->dev.parent == dev->dev.parent); >>>> + if (backup ? rtnl_dereference(vbi->backup_netdev) : >>>> + rtnl_dereference(vbi->active_netdev)) { >>>> + netdev_info(dev, >>>> + "%s attempting to join bypass dev when %s already present\n", >>>> + child_netdev->name, backup ? "backup" : "active"); >>>> + return NOTIFY_DONE; >>>> + } >>>> + >>>> + /* Avoid non pci devices as active netdev */ >>>> + if (!backup && (!child_netdev->dev.parent || >>>> + !dev_is_pci(child_netdev->dev.parent))) >>>> + return NOTIFY_DONE; >>>> + >>>> + ret = netdev_rx_handler_register(child_netdev, >>>> + virtnet_bypass_handle_frame, dev); >>>> + if (ret != 0) { >>>> + netdev_err(child_netdev, >>>> + "can not register bypass receive handler (err = %d)\n", >>>> + ret); >>>> + goto rx_handler_failed; >>>> + } >>>> + >>>> + ret = netdev_upper_dev_link(child_netdev, dev, NULL); >>>> + if (ret != 0) { >>>> + netdev_err(child_netdev, >>>> + "can not set master device %s (err = %d)\n", >>>> + dev->name, ret); >>>> + goto upper_link_failed; >>>> + } >>>> + >>>> + child_netdev->flags |= IFF_SLAVE; >>>> + >>>> + if (netif_running(dev)) { >>>> + ret = dev_open(child_netdev); >>>> + if (ret && (ret != -EBUSY)) { >>>> + netdev_err(dev, "Opening child %s failed ret:%d\n", >>>> + child_netdev->name, ret); >>>> + goto err_interface_up; >>>> + } >>>> + } >>> Much of this function is copy of netvsc_vf_join, should be shared with >>> netvsc. >> Yes. This is based on netvsc_register_vf() and netvsc_vf_join(). But modified >> to handle registration of 2 child netdevs(backup virtio and active vf). In case >> of netvsc, they only register VF. There is no backup netdev. >> I think trying to make this code shareable with netvsc will introduce additional >> checks and make it convoluted. > No problem. Not clear what you meant here? Want to confirm that you are agreeing that it is OK to not share this function with netvsc. > > >> >>> >>>> + >>>> + /* Align MTU of child with master */ >>>> + ret = dev_set_mtu(child_netdev, dev->mtu); >>>> + if (ret) { >>>> + netdev_err(dev, >>>> + "unable to change mtu of %s to %u register failed\n", >>>> + child_netdev->name, dev->mtu); >>>> + goto err_set_mtu; >>>> + } >>>> + >>>> + call_netdevice_notifiers(NETDEV_JOIN, child_netdev); >>>> + >>>> + netdev_info(dev, "registering %s\n", child_netdev->name); >>>> + >>>> + dev_hold(child_netdev); >>>> + if (backup) { >>>> + rcu_assign_pointer(vbi->backup_netdev, child_netdev); >>>> + dev_get_stats(vbi->backup_netdev, &vbi->backup_stats); >>>> + } else { >>>> + rcu_assign_pointer(vbi->active_netdev, child_netdev); >>>> + dev_get_stats(vbi->active_netdev, &vbi->active_stats); >>>> + dev->min_mtu = child_netdev->min_mtu; >>>> + dev->max_mtu = child_netdev->max_mtu; >>>> + } >>>> + >>>> + return NOTIFY_OK; >>>> + >>>> +err_set_mtu: >>>> + dev_close(child_netdev); >>>> +err_interface_up: >>>> + netdev_upper_dev_unlink(child_netdev, dev); >>>> + child_netdev->flags &= ~IFF_SLAVE; >>>> +upper_link_failed: >>>> + netdev_rx_handler_unregister(child_netdev); >>>> +rx_handler_failed: >>>> + return NOTIFY_DONE; >>>> +} >>>> + >>>> +static int virtnet_bypass_unregister_child(struct net_device *child_netdev) >>>> +{ >>>> + struct virtnet_bypass_info *vbi; >>>> + struct net_device *dev, *backup; >>>> + >>>> + dev = get_virtnet_bypass_byref(child_netdev); >>>> + if (!dev) >>>> + return NOTIFY_DONE; >>>> + >>>> + vbi = netdev_priv(dev); >>>> + >>>> + netdev_info(dev, "unregistering %s\n", child_netdev->name); >>>> + >>>> + netdev_rx_handler_unregister(child_netdev); >>>> + netdev_upper_dev_unlink(child_netdev, dev); >>>> + child_netdev->flags &= ~IFF_SLAVE; >>>> + >>>> + if (child_netdev->dev.parent == dev->dev.parent) { >>>> + RCU_INIT_POINTER(vbi->backup_netdev, NULL); >>>> + } else { >>>> + RCU_INIT_POINTER(vbi->active_netdev, NULL); >>>> + backup = rtnl_dereference(vbi->backup_netdev); >>>> + if (backup) { >>>> + dev->min_mtu = backup->min_mtu; >>>> + dev->max_mtu = backup->max_mtu; >>>> + } >>>> + } >>>> + >>>> + dev_put(child_netdev); >>>> + >>>> + return NOTIFY_OK; >>>> +} >>>> + >>>> +static int virtnet_bypass_update_link(struct net_device *child_netdev) >>>> +{ >>>> + struct net_device *dev, *active, *backup; >>>> + struct virtnet_bypass_info *vbi; >>>> + >>>> + dev = get_virtnet_bypass_byref(child_netdev); >>>> + if (!dev || !netif_running(dev)) >>>> + return NOTIFY_DONE; >>>> + >>>> + vbi = netdev_priv(dev); >>>> + >>>> + active = rtnl_dereference(vbi->active_netdev); >>>> + backup = rtnl_dereference(vbi->backup_netdev); >>>> + >>>> + if ((active && virtnet_bypass_xmit_ready(active)) || >>>> + (backup && virtnet_bypass_xmit_ready(backup))) { >>>> + netif_carrier_on(dev); >>>> + netif_tx_wake_all_queues(dev); >>>> + } else { >>>> + netif_carrier_off(dev); >>>> + netif_tx_stop_all_queues(dev); >>>> + } >>>> + >>>> + return NOTIFY_OK; >>>> +} >>>> + >>>> +static int virtnet_bypass_event(struct notifier_block *this, >>>> + unsigned long event, void *ptr) >>>> +{ >>>> + struct net_device *event_dev = netdev_notifier_info_to_dev(ptr); >>>> + >>>> + /* Skip our own events */ >>>> + if (event_dev->netdev_ops == &virtnet_bypass_netdev_ops) >>>> + return NOTIFY_DONE; >>>> + >>>> + /* Avoid non-Ethernet type devices */ >>>> + if (event_dev->type != ARPHRD_ETHER) >>>> + return NOTIFY_DONE; >>>> + >>>> + /* Avoid Vlan dev with same MAC registering as child dev */ >>>> + if (is_vlan_dev(event_dev)) >>>> + return NOTIFY_DONE; >>>> + >>>> + /* Avoid Bonding master dev with same MAC registering as child dev */ >>>> + if ((event_dev->priv_flags & IFF_BONDING) && >>>> + (event_dev->flags & IFF_MASTER)) >>>> + return NOTIFY_DONE; >>>> + >>>> + switch (event) { >>>> + case NETDEV_REGISTER: >>>> + return virtnet_bypass_register_child(event_dev); >>>> + case NETDEV_UNREGISTER: >>>> + return virtnet_bypass_unregister_child(event_dev); >>>> + case NETDEV_UP: >>>> + case NETDEV_DOWN: >>>> + case NETDEV_CHANGE: >>>> + return virtnet_bypass_update_link(event_dev); >>>> + default: >>>> + return NOTIFY_DONE; >>>> + } >>>> +} >>> For example this function is 1:1 copy of netvsc, even with comments >>> and bugs (like IFF_BODING check). >>> >>> This is also something that should be shared with netvsc. >> The problem is with the calls that are made from this function. >> Sharing would have been possible if both virtio and netvsc went with >> same netdev model. > ops? You can still share. OK. looks like you want to see atleast some code shared between netvsc and virtio_net even when they are using 2 different netdev models. Will try to add a new file under net/core and move some functions that can be shared between netvsc and virtio_net in BACKUP mode. > > >>> >>>> + >>>> +static struct notifier_block virtnet_bypass_notifier = { >>>> + .notifier_call = virtnet_bypass_event, >>>> +}; >>>> + >>>> +static int virtnet_bypass_create(struct virtnet_info *vi) >>>> +{ >>>> + struct net_device *backup_netdev = vi->dev; >>>> + struct device *dev = &vi->vdev->dev; >>>> + struct net_device *bypass_netdev; >>>> + int res; >>>> + >>>> + /* Alloc at least 2 queues, for now we are going with 16 assuming >>>> + * that most devices being bonded won't have too many queues. >>>> + */ >>>> + bypass_netdev = alloc_etherdev_mq(sizeof(struct virtnet_bypass_info), >>>> + 16); >>>> + if (!bypass_netdev) { >>>> + dev_err(dev, "Unable to allocate bypass_netdev!\n"); >>>> + return -ENOMEM; >>>> + } >>>> + >>>> + dev_net_set(bypass_netdev, dev_net(backup_netdev)); >>>> + SET_NETDEV_DEV(bypass_netdev, dev); >>>> + >>>> + bypass_netdev->netdev_ops = &virtnet_bypass_netdev_ops; >>>> + bypass_netdev->ethtool_ops = &virtnet_bypass_ethtool_ops; >>>> + >>>> + /* Initialize the device options */ >>>> + bypass_netdev->flags |= IFF_MASTER; >>>> + bypass_netdev->priv_flags |= IFF_BONDING | IFF_UNICAST_FLT | >>> No clue why you set IFF_BONDING here... >> This is to indicate to the userspace that this is a master bond device. > This is not a master bond device! If you say this, it makes me really > worried.... > Will remove IFF_BONDING in the next revision. > >> May be it is sufficient to just set IFF_MASTER. >> >>> >>> >>> >>>
On Tue, Mar 13, 2018 at 05:28:07PM -0700, Samudrala, Sridhar wrote: > > I am not sure if it's a good idea to leave the > > virtio_bypass around if running into failure: the guest is not > > migratable as the VF doesn't have a backup path, > > Are you talking about a failure when registering backup netdev? This should not > happen, but i guess we can improve error handling in such scenario. A nice way to do this would be to clear the backup feature bit. > > > And perhaps the worse > > part is that, it now has two interfaces with identical MAC address but > > one of them is invalid (user cannot use the virtio interface as it has > > a dampened datapath). IMHO the virtio_bypass and its lower netdev > > should be destroyed at all when it fails to bind the VF, and > > technically, there should be some way to propogate the failure status > > to the hypervisor/backend, indicating that the VM is not migratable > > because of guest software errors (maybe by clearing out the backup > > feature from the guest virtio driver so host can see/learn it). > > In BACKUP mode, user can only use the upper virtio_bypass netdev and that will > always be there. Any failure to enslave VF netdev is not fatal, but i will see > if we can improve the error handling of failure to enslave backup netdev. > Also, i don't think the BACKUP feature bit is negotiable with the host. > > Thanks > Sridhar All bits are negotiable. It's up to the host whether to support a device with this bit clear, or to fail negotiation.
On Tue, 13 Mar 2018 17:36:23 -0700 "Samudrala, Sridhar" <sridhar.samudrala@intel.com> wrote: > OK. looks like you want to see atleast some code shared between netvsc and virtio_net > even when they are using 2 different netdev models. > Will try to add a new file under net/core and move some functions that can be shared > between netvsc and virtio_net in BACKUP mode. Make it work first (on both) then workout how to share. That method worked quite well for tunnels and other netdev like objects.
On Tue, Mar 13, 2018 at 5:28 PM, Samudrala, Sridhar <sridhar.samudrala@intel.com> wrote: > On 3/12/2018 3:44 PM, Siwei Liu wrote: >> >> Apologies, still some comments going. Please see inline. >> >> On Thu, Mar 1, 2018 at 12:08 PM, Sridhar Samudrala >> <sridhar.samudrala@intel.com> wrote: >>> >>> This patch enables virtio_net to switch over to a VF datapath when a VF >>> netdev is present with the same MAC address. It allows live migration >>> of a VM with a direct attached VF without the need to setup a bond/team >>> between a VF and virtio net device in the guest. >>> >>> The hypervisor needs to enable only one datapath at any time so that >>> packets don't get looped back to the VM over the other datapath. When a >>> VF >>> is plugged, the virtio datapath link state can be marked as down. The >>> hypervisor needs to unplug the VF device from the guest on the source >>> host >>> and reset the MAC filter of the VF to initiate failover of datapath to >>> virtio before starting the migration. After the migration is completed, >>> the destination hypervisor sets the MAC filter on the VF and plugs it >>> back >>> to the guest to switch over to VF datapath. >>> >>> When BACKUP feature is enabled, an additional netdev(bypass netdev) is >>> created that acts as a master device and tracks the state of the 2 lower >>> netdevs. The original virtio_net netdev is marked as 'backup' netdev and >>> a >>> passthru device with the same MAC is registered as 'active' netdev. >>> >>> This patch is based on the discussion initiated by Jesse on this thread. >>> https://marc.info/?l=linux-virtualization&m=151189725224231&w=2 >>> >>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com> >>> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> >>> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> >>> --- >>> drivers/net/virtio_net.c | 683 >>> ++++++++++++++++++++++++++++++++++++++++++++++- >>> 1 file changed, 682 insertions(+), 1 deletion(-) >>> >>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c >>> index bcd13fe906ca..f2860d86c952 100644 >>> --- a/drivers/net/virtio_net.c >>> +++ b/drivers/net/virtio_net.c >>> @@ -30,6 +30,8 @@ >>> #include <linux/cpu.h> >>> #include <linux/average.h> >>> #include <linux/filter.h> >>> +#include <linux/netdevice.h> >>> +#include <linux/pci.h> >>> #include <net/route.h> >>> #include <net/xdp.h> >>> >>> @@ -206,6 +208,9 @@ struct virtnet_info { >>> u32 speed; >>> >>> unsigned long guest_offloads; >>> + >>> + /* upper netdev created when BACKUP feature enabled */ >>> + struct net_device *bypass_netdev; >>> }; >>> >>> struct padded_vnet_hdr { >>> @@ -2236,6 +2241,22 @@ static int virtnet_xdp(struct net_device *dev, >>> struct netdev_bpf *xdp) >>> } >>> } >>> >>> +static int virtnet_get_phys_port_name(struct net_device *dev, char *buf, >>> + size_t len) >>> +{ >>> + struct virtnet_info *vi = netdev_priv(dev); >>> + int ret; >>> + >>> + if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_BACKUP)) >>> + return -EOPNOTSUPP; >>> + >>> + ret = snprintf(buf, len, "_bkup"); >>> + if (ret >= len) >>> + return -EOPNOTSUPP; >>> + >>> + return 0; >>> +} >>> + >>> static const struct net_device_ops virtnet_netdev = { >>> .ndo_open = virtnet_open, >>> .ndo_stop = virtnet_close, >>> @@ -2253,6 +2274,7 @@ static const struct net_device_ops virtnet_netdev = >>> { >>> .ndo_xdp_xmit = virtnet_xdp_xmit, >>> .ndo_xdp_flush = virtnet_xdp_flush, >>> .ndo_features_check = passthru_features_check, >>> + .ndo_get_phys_port_name = virtnet_get_phys_port_name, >>> }; >>> >>> static void virtnet_config_changed_work(struct work_struct *work) >>> @@ -2647,6 +2669,653 @@ static int virtnet_validate(struct virtio_device >>> *vdev) >>> return 0; >>> } >>> >>> +/* START of functions supporting VIRTIO_NET_F_BACKUP feature. >>> + * When BACKUP feature is enabled, an additional netdev(bypass netdev) >>> + * is created that acts as a master device and tracks the state of the >>> + * 2 lower netdevs. The original virtio_net netdev is registered as >>> + * 'backup' netdev and a passthru device with the same MAC is registered >>> + * as 'active' netdev. >>> + */ >>> + >>> +/* bypass state maintained when BACKUP feature is enabled */ >>> +struct virtnet_bypass_info { >>> + /* passthru netdev with same MAC */ >>> + struct net_device __rcu *active_netdev; >>> + >>> + /* virtio_net netdev */ >>> + struct net_device __rcu *backup_netdev; >>> + >>> + /* active netdev stats */ >>> + struct rtnl_link_stats64 active_stats; >>> + >>> + /* backup netdev stats */ >>> + struct rtnl_link_stats64 backup_stats; >>> + >>> + /* aggregated stats */ >>> + struct rtnl_link_stats64 bypass_stats; >>> + >>> + /* spinlock while updating stats */ >>> + spinlock_t stats_lock; >>> +}; >>> + >>> +static void virtnet_bypass_child_open(struct net_device *dev, >>> + struct net_device *child_netdev) >>> +{ >>> + int err = dev_open(child_netdev); >>> + >>> + if (err) >>> + netdev_warn(dev, "unable to open slave: %s: %d\n", >>> + child_netdev->name, err); >>> +} >> >> Why ignoring the error?? i.e. virtnet_bypass_child_open should have >> return value. I don't believe the caller is in a safe context to >> guarantee the dev_open always succeeds. > > > OK. Will handle this in the next revision. > > >> >>> + >>> +static int virtnet_bypass_open(struct net_device *dev) >>> +{ >>> + struct virtnet_bypass_info *vbi = netdev_priv(dev); >>> + struct net_device *child_netdev; >>> + >>> + netif_carrier_off(dev); >>> + netif_tx_wake_all_queues(dev); >>> + >>> + child_netdev = rtnl_dereference(vbi->active_netdev); >>> + if (child_netdev) >>> + virtnet_bypass_child_open(dev, child_netdev); >> >> Handle the error? >> >>> + >>> + child_netdev = rtnl_dereference(vbi->backup_netdev); >>> + if (child_netdev) >>> + virtnet_bypass_child_open(dev, child_netdev); >> >> Handle the error and unwind? > > > Sure. > > <snip> > > >> + >> +static int virtnet_bypass_register_child(struct net_device *child_netdev) >> +{ >> + struct virtnet_bypass_info *vbi; >> + struct net_device *dev; >> + bool backup; >> + int ret; >> + >> + if (child_netdev->addr_len != ETH_ALEN) >> + return NOTIFY_DONE; >> + >> + /* We will use the MAC address to locate the virtnet_bypass netdev >> + * to associate with the child netdev. If we don't find a matching >> + * bypass netdev, move on. >> + */ >> + dev = get_virtnet_bypass_bymac(dev_net(child_netdev), >> + child_netdev->perm_addr); >> + if (!dev) >> + return NOTIFY_DONE; >> + >> + vbi = netdev_priv(dev); >> + backup = (child_netdev->dev.parent == dev->dev.parent); >> + if (backup ? rtnl_dereference(vbi->backup_netdev) : >> + rtnl_dereference(vbi->active_netdev)) { >> + netdev_info(dev, >> + "%s attempting to join bypass dev when %s >> already present\n", >> + child_netdev->name, backup ? "backup" : >> "active"); >> + return NOTIFY_DONE; >> + } >> + >> + /* Avoid non pci devices as active netdev */ >> + if (!backup && (!child_netdev->dev.parent || >> + !dev_is_pci(child_netdev->dev.parent))) >> + return NOTIFY_DONE; >> + >> There's a problem here in terms of error (particularly on the active >> slave, e.g. VF), see below: >> >>> + ret = netdev_rx_handler_register(child_netdev, >>> + virtnet_bypass_handle_frame, >>> dev); >>> + if (ret != 0) { >>> + netdev_err(child_netdev, >>> + "can not register bypass receive handler (err >>> = %d)\n", >>> + ret); >>> + goto rx_handler_failed; >>> + } >>> + >>> + ret = netdev_upper_dev_link(child_netdev, dev, NULL); >>> + if (ret != 0) { >>> + netdev_err(child_netdev, >>> + "can not set master device %s (err = %d)\n", >>> + dev->name, ret); >>> + goto upper_link_failed; >>> + } >>> + >>> + child_netdev->flags |= IFF_SLAVE; >>> + >>> + if (netif_running(dev)) { >>> + ret = dev_open(child_netdev); >>> + if (ret && (ret != -EBUSY)) { >>> + netdev_err(dev, "Opening child %s failed >>> ret:%d\n", >>> + child_netdev->name, ret); >>> + goto err_interface_up; >>> + } >>> + } >>> + >>> + /* Align MTU of child with master */ >>> + ret = dev_set_mtu(child_netdev, dev->mtu); >>> + if (ret) { >>> + netdev_err(dev, >>> + "unable to change mtu of %s to %u register >>> failed\n", >>> + child_netdev->name, dev->mtu); >>> + goto err_set_mtu; >>> + } >> >> If any of the above calls returns non-zero, the current code steps >> back to undo what's being done on that spefic slave previously. For >> instance, if the netdev_rx_handler_register returns non-zero because >> of busy rx handler, this register_child function would give up >> enslaving the VF and leave the upper virtio_bypass interface behind >> once it returns. > > > virtio_bypass interface is the upper master netdev and it is always present > when > BACKUP feature is enabled. > If there is failure with enslaving VF, there will be 2 netdevs, > master virtio_bypass and slave virtio_net and the VM can be migrated. Failure in enslaving the VF does not mean the netdev would disappear. That means there will be 3 seperate netdevs left behind, a master virtio_bypass with slave virtio_net, plus a standalone VF. In this case the VF are still useable as the datapath is still there, while the datapath for virtio_* is made standby. Which interface you expect user to continue to use in case of failure? If it's the VF interface you choose, there should be some detection for that condition in virtio_net, hide the virtio interfaces from user by unregistering those two virtio netdevs (master virtio_bypass with slave virtio_net), presumably. If not so user has to use virtio_bypass, how do you prevent users from using the VF, or consequently, indicate/propogate the failure to backend and switch the host datapath to virtio? I don't see either of the above being handled on your patch. Unless I miss something obvious, I don't think this patch is close to achieving the goal in terms of completeness. > > >> I am not sure if it's a good idea to leave the >> virtio_bypass around if running into failure: the guest is not >> migratable as the VF doesn't have a backup path, > > > Are you talking about a failure when registering backup netdev? This should No, see above. I was mainly concerned with the failure in enslaving VF (active netdev). If users continue to use raw VF interface within the VM, the guest cannot be migrated for sure. > not > happen, but i guess we can improve error handling in such scenario. > > >> And perhaps the worse >> part is that, it now has two interfaces with identical MAC address but >> one of them is invalid (user cannot use the virtio interface as it has >> a dampened datapath). IMHO the virtio_bypass and its lower netdev >> should be destroyed at all when it fails to bind the VF, and >> technically, there should be some way to propogate the failure status >> to the hypervisor/backend, indicating that the VM is not migratable >> because of guest software errors (maybe by clearing out the backup >> feature from the guest virtio driver so host can see/learn it). > > > In BACKUP mode, user can only use the upper virtio_bypass netdev and that > will > always be there. Any failure to enslave VF netdev is not fatal, but i will > see > if we can improve the error handling of failure to enslave backup netdev. > Also, i don't think the BACKUP feature bit is negotiable with the host. I don't believe that is the case. In any event, you should come up with whatever means to indicate to the hypervisor that some VF are still left unbound with a virtio, so QEMU knows the VM cannot live migrate when host user attempts to do so. Or, alternatively, propoate the failure to host, plug out the VF and switch the host datapath to virtio. Please work out a complete solution to address the error case. Regards, -Siwei > > Thanks > Sridhar > >
Wed, Mar 14, 2018 at 01:36:23AM CET, sridhar.samudrala@intel.com wrote: >On 3/12/2018 2:08 PM, Jiri Pirko wrote: >> Mon, Mar 12, 2018 at 09:58:06PM CET, sridhar.samudrala@intel.com wrote: >> > >> > On 3/12/2018 1:12 PM, Jiri Pirko wrote: >> > > Thu, Mar 01, 2018 at 09:08:43PM CET, sridhar.samudrala@intel.com wrote: >> > > > This patch enables virtio_net to switch over to a VF datapath when a VF >> > > > netdev is present with the same MAC address. It allows live migration >> > > > of a VM with a direct attached VF without the need to setup a bond/team >> > > > between a VF and virtio net device in the guest. >> > > > >> > > > The hypervisor needs to enable only one datapath at any time so that >> > > > packets don't get looped back to the VM over the other datapath. When a VF >> > > > is plugged, the virtio datapath link state can be marked as down. The >> > > > hypervisor needs to unplug the VF device from the guest on the source host >> > > > and reset the MAC filter of the VF to initiate failover of datapath to >> > > > virtio before starting the migration. After the migration is completed, >> > > > the destination hypervisor sets the MAC filter on the VF and plugs it back >> > > > to the guest to switch over to VF datapath. >> > > > >> > > > When BACKUP feature is enabled, an additional netdev(bypass netdev) is >> > > > created that acts as a master device and tracks the state of the 2 lower >> > > > netdevs. The original virtio_net netdev is marked as 'backup' netdev and a >> > > > passthru device with the same MAC is registered as 'active' netdev. >> > > > >> > > > This patch is based on the discussion initiated by Jesse on this thread. >> > > > https://marc.info/?l=linux-virtualization&m=151189725224231&w=2 >> > > > >> > > > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com> >> > > > Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> >> > > > Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> >> > > [...] >> > > >> > > >> > > > +static int virtnet_bypass_register_child(struct net_device *child_netdev) >> > > > +{ >> > > > + struct virtnet_bypass_info *vbi; >> > > > + struct net_device *dev; >> > > > + bool backup; >> > > > + int ret; >> > > > + >> > > > + if (child_netdev->addr_len != ETH_ALEN) >> > > > + return NOTIFY_DONE; >> > > > + >> > > > + /* We will use the MAC address to locate the virtnet_bypass netdev >> > > > + * to associate with the child netdev. If we don't find a matching >> > > > + * bypass netdev, move on. >> > > > + */ >> > > > + dev = get_virtnet_bypass_bymac(dev_net(child_netdev), >> > > > + child_netdev->perm_addr); >> > > > + if (!dev) >> > > > + return NOTIFY_DONE; >> > > > + >> > > > + vbi = netdev_priv(dev); >> > > > + backup = (child_netdev->dev.parent == dev->dev.parent); >> > > > + if (backup ? rtnl_dereference(vbi->backup_netdev) : >> > > > + rtnl_dereference(vbi->active_netdev)) { >> > > > + netdev_info(dev, >> > > > + "%s attempting to join bypass dev when %s already present\n", >> > > > + child_netdev->name, backup ? "backup" : "active"); >> > > > + return NOTIFY_DONE; >> > > > + } >> > > > + >> > > > + /* Avoid non pci devices as active netdev */ >> > > > + if (!backup && (!child_netdev->dev.parent || >> > > > + !dev_is_pci(child_netdev->dev.parent))) >> > > > + return NOTIFY_DONE; >> > > > + >> > > > + ret = netdev_rx_handler_register(child_netdev, >> > > > + virtnet_bypass_handle_frame, dev); >> > > > + if (ret != 0) { >> > > > + netdev_err(child_netdev, >> > > > + "can not register bypass receive handler (err = %d)\n", >> > > > + ret); >> > > > + goto rx_handler_failed; >> > > > + } >> > > > + >> > > > + ret = netdev_upper_dev_link(child_netdev, dev, NULL); >> > > > + if (ret != 0) { >> > > > + netdev_err(child_netdev, >> > > > + "can not set master device %s (err = %d)\n", >> > > > + dev->name, ret); >> > > > + goto upper_link_failed; >> > > > + } >> > > > + >> > > > + child_netdev->flags |= IFF_SLAVE; >> > > > + >> > > > + if (netif_running(dev)) { >> > > > + ret = dev_open(child_netdev); >> > > > + if (ret && (ret != -EBUSY)) { >> > > > + netdev_err(dev, "Opening child %s failed ret:%d\n", >> > > > + child_netdev->name, ret); >> > > > + goto err_interface_up; >> > > > + } >> > > > + } >> > > Much of this function is copy of netvsc_vf_join, should be shared with >> > > netvsc. >> > Yes. This is based on netvsc_register_vf() and netvsc_vf_join(). But modified >> > to handle registration of 2 child netdevs(backup virtio and active vf). In case >> > of netvsc, they only register VF. There is no backup netdev. >> > I think trying to make this code shareable with netvsc will introduce additional >> > checks and make it convoluted. >> No problem. > >Not clear what you meant here? >Want to confirm that you are agreeing that it is OK to not share this function >with netvsc. Please share as much as possible. And this is possible. netdev_upper_dev_link IFF_SLAVE flag set netdev_rx_handler_register, checks etc. > > >> >> >> > >> > > >> > > > + >> > > > + /* Align MTU of child with master */ >> > > > + ret = dev_set_mtu(child_netdev, dev->mtu); >> > > > + if (ret) { >> > > > + netdev_err(dev, >> > > > + "unable to change mtu of %s to %u register failed\n", >> > > > + child_netdev->name, dev->mtu); >> > > > + goto err_set_mtu; >> > > > + } >> > > > + >> > > > + call_netdevice_notifiers(NETDEV_JOIN, child_netdev); >> > > > + >> > > > + netdev_info(dev, "registering %s\n", child_netdev->name); >> > > > + >> > > > + dev_hold(child_netdev); >> > > > + if (backup) { >> > > > + rcu_assign_pointer(vbi->backup_netdev, child_netdev); >> > > > + dev_get_stats(vbi->backup_netdev, &vbi->backup_stats); >> > > > + } else { >> > > > + rcu_assign_pointer(vbi->active_netdev, child_netdev); >> > > > + dev_get_stats(vbi->active_netdev, &vbi->active_stats); >> > > > + dev->min_mtu = child_netdev->min_mtu; >> > > > + dev->max_mtu = child_netdev->max_mtu; >> > > > + } >> > > > + >> > > > + return NOTIFY_OK; >> > > > + >> > > > +err_set_mtu: >> > > > + dev_close(child_netdev); >> > > > +err_interface_up: >> > > > + netdev_upper_dev_unlink(child_netdev, dev); >> > > > + child_netdev->flags &= ~IFF_SLAVE; >> > > > +upper_link_failed: >> > > > + netdev_rx_handler_unregister(child_netdev); >> > > > +rx_handler_failed: >> > > > + return NOTIFY_DONE; >> > > > +} >> > > > + >> > > > +static int virtnet_bypass_unregister_child(struct net_device *child_netdev) >> > > > +{ >> > > > + struct virtnet_bypass_info *vbi; >> > > > + struct net_device *dev, *backup; >> > > > + >> > > > + dev = get_virtnet_bypass_byref(child_netdev); >> > > > + if (!dev) >> > > > + return NOTIFY_DONE; >> > > > + >> > > > + vbi = netdev_priv(dev); >> > > > + >> > > > + netdev_info(dev, "unregistering %s\n", child_netdev->name); >> > > > + >> > > > + netdev_rx_handler_unregister(child_netdev); >> > > > + netdev_upper_dev_unlink(child_netdev, dev); >> > > > + child_netdev->flags &= ~IFF_SLAVE; >> > > > + >> > > > + if (child_netdev->dev.parent == dev->dev.parent) { >> > > > + RCU_INIT_POINTER(vbi->backup_netdev, NULL); >> > > > + } else { >> > > > + RCU_INIT_POINTER(vbi->active_netdev, NULL); >> > > > + backup = rtnl_dereference(vbi->backup_netdev); >> > > > + if (backup) { >> > > > + dev->min_mtu = backup->min_mtu; >> > > > + dev->max_mtu = backup->max_mtu; >> > > > + } >> > > > + } >> > > > + >> > > > + dev_put(child_netdev); >> > > > + >> > > > + return NOTIFY_OK; >> > > > +} >> > > > + >> > > > +static int virtnet_bypass_update_link(struct net_device *child_netdev) >> > > > +{ >> > > > + struct net_device *dev, *active, *backup; >> > > > + struct virtnet_bypass_info *vbi; >> > > > + >> > > > + dev = get_virtnet_bypass_byref(child_netdev); >> > > > + if (!dev || !netif_running(dev)) >> > > > + return NOTIFY_DONE; >> > > > + >> > > > + vbi = netdev_priv(dev); >> > > > + >> > > > + active = rtnl_dereference(vbi->active_netdev); >> > > > + backup = rtnl_dereference(vbi->backup_netdev); >> > > > + >> > > > + if ((active && virtnet_bypass_xmit_ready(active)) || >> > > > + (backup && virtnet_bypass_xmit_ready(backup))) { >> > > > + netif_carrier_on(dev); >> > > > + netif_tx_wake_all_queues(dev); >> > > > + } else { >> > > > + netif_carrier_off(dev); >> > > > + netif_tx_stop_all_queues(dev); >> > > > + } >> > > > + >> > > > + return NOTIFY_OK; >> > > > +} >> > > > + >> > > > +static int virtnet_bypass_event(struct notifier_block *this, >> > > > + unsigned long event, void *ptr) >> > > > +{ >> > > > + struct net_device *event_dev = netdev_notifier_info_to_dev(ptr); >> > > > + >> > > > + /* Skip our own events */ >> > > > + if (event_dev->netdev_ops == &virtnet_bypass_netdev_ops) >> > > > + return NOTIFY_DONE; >> > > > + >> > > > + /* Avoid non-Ethernet type devices */ >> > > > + if (event_dev->type != ARPHRD_ETHER) >> > > > + return NOTIFY_DONE; >> > > > + >> > > > + /* Avoid Vlan dev with same MAC registering as child dev */ >> > > > + if (is_vlan_dev(event_dev)) >> > > > + return NOTIFY_DONE; >> > > > + >> > > > + /* Avoid Bonding master dev with same MAC registering as child dev */ >> > > > + if ((event_dev->priv_flags & IFF_BONDING) && >> > > > + (event_dev->flags & IFF_MASTER)) >> > > > + return NOTIFY_DONE; >> > > > + >> > > > + switch (event) { >> > > > + case NETDEV_REGISTER: >> > > > + return virtnet_bypass_register_child(event_dev); >> > > > + case NETDEV_UNREGISTER: >> > > > + return virtnet_bypass_unregister_child(event_dev); >> > > > + case NETDEV_UP: >> > > > + case NETDEV_DOWN: >> > > > + case NETDEV_CHANGE: >> > > > + return virtnet_bypass_update_link(event_dev); >> > > > + default: >> > > > + return NOTIFY_DONE; >> > > > + } >> > > > +} >> > > For example this function is 1:1 copy of netvsc, even with comments >> > > and bugs (like IFF_BODING check). >> > > >> > > This is also something that should be shared with netvsc. >> > The problem is with the calls that are made from this function. >> > Sharing would have been possible if both virtio and netvsc went with >> > same netdev model. >> ops? You can still share. > >OK. looks like you want to see atleast some code shared between netvsc and virtio_net >even when they are using 2 different netdev models. Yes please. >Will try to add a new file under net/core and move some functions that can be shared >between netvsc and virtio_net in BACKUP mode. > > >> >> >> > > >> > > > + >> > > > +static struct notifier_block virtnet_bypass_notifier = { >> > > > + .notifier_call = virtnet_bypass_event, >> > > > +}; >> > > > + >> > > > +static int virtnet_bypass_create(struct virtnet_info *vi) >> > > > +{ >> > > > + struct net_device *backup_netdev = vi->dev; >> > > > + struct device *dev = &vi->vdev->dev; >> > > > + struct net_device *bypass_netdev; >> > > > + int res; >> > > > + >> > > > + /* Alloc at least 2 queues, for now we are going with 16 assuming >> > > > + * that most devices being bonded won't have too many queues. >> > > > + */ >> > > > + bypass_netdev = alloc_etherdev_mq(sizeof(struct virtnet_bypass_info), >> > > > + 16); >> > > > + if (!bypass_netdev) { >> > > > + dev_err(dev, "Unable to allocate bypass_netdev!\n"); >> > > > + return -ENOMEM; >> > > > + } >> > > > + >> > > > + dev_net_set(bypass_netdev, dev_net(backup_netdev)); >> > > > + SET_NETDEV_DEV(bypass_netdev, dev); >> > > > + >> > > > + bypass_netdev->netdev_ops = &virtnet_bypass_netdev_ops; >> > > > + bypass_netdev->ethtool_ops = &virtnet_bypass_ethtool_ops; >> > > > + >> > > > + /* Initialize the device options */ >> > > > + bypass_netdev->flags |= IFF_MASTER; >> > > > + bypass_netdev->priv_flags |= IFF_BONDING | IFF_UNICAST_FLT | >> > > No clue why you set IFF_BONDING here... >> > This is to indicate to the userspace that this is a master bond device. >> This is not a master bond device! If you say this, it makes me really >> worried.... >> >Will remove IFF_BONDING in the next revision. Also please remove IFF_MASTER setting. Thanks > > >> >> > May be it is sufficient to just set IFF_MASTER. >> > >> > > >> > > >> > > >> > > >
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index bcd13fe906ca..f2860d86c952 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -30,6 +30,8 @@ #include <linux/cpu.h> #include <linux/average.h> #include <linux/filter.h> +#include <linux/netdevice.h> +#include <linux/pci.h> #include <net/route.h> #include <net/xdp.h> @@ -206,6 +208,9 @@ struct virtnet_info { u32 speed; unsigned long guest_offloads; + + /* upper netdev created when BACKUP feature enabled */ + struct net_device *bypass_netdev; }; struct padded_vnet_hdr { @@ -2236,6 +2241,22 @@ static int virtnet_xdp(struct net_device *dev, struct netdev_bpf *xdp) } } +static int virtnet_get_phys_port_name(struct net_device *dev, char *buf, + size_t len) +{ + struct virtnet_info *vi = netdev_priv(dev); + int ret; + + if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_BACKUP)) + return -EOPNOTSUPP; + + ret = snprintf(buf, len, "_bkup"); + if (ret >= len) + return -EOPNOTSUPP; + + return 0; +} + static const struct net_device_ops virtnet_netdev = { .ndo_open = virtnet_open, .ndo_stop = virtnet_close, @@ -2253,6 +2274,7 @@ static const struct net_device_ops virtnet_netdev = { .ndo_xdp_xmit = virtnet_xdp_xmit, .ndo_xdp_flush = virtnet_xdp_flush, .ndo_features_check = passthru_features_check, + .ndo_get_phys_port_name = virtnet_get_phys_port_name, }; static void virtnet_config_changed_work(struct work_struct *work) @@ -2647,6 +2669,653 @@ static int virtnet_validate(struct virtio_device *vdev) return 0; } +/* START of functions supporting VIRTIO_NET_F_BACKUP feature. + * When BACKUP feature is enabled, an additional netdev(bypass netdev) + * is created that acts as a master device and tracks the state of the + * 2 lower netdevs. The original virtio_net netdev is registered as + * 'backup' netdev and a passthru device with the same MAC is registered + * as 'active' netdev. + */ + +/* bypass state maintained when BACKUP feature is enabled */ +struct virtnet_bypass_info { + /* passthru netdev with same MAC */ + struct net_device __rcu *active_netdev; + + /* virtio_net netdev */ + struct net_device __rcu *backup_netdev; + + /* active netdev stats */ + struct rtnl_link_stats64 active_stats; + + /* backup netdev stats */ + struct rtnl_link_stats64 backup_stats; + + /* aggregated stats */ + struct rtnl_link_stats64 bypass_stats; + + /* spinlock while updating stats */ + spinlock_t stats_lock; +}; + +static void virtnet_bypass_child_open(struct net_device *dev, + struct net_device *child_netdev) +{ + int err = dev_open(child_netdev); + + if (err) + netdev_warn(dev, "unable to open slave: %s: %d\n", + child_netdev->name, err); +} + +static int virtnet_bypass_open(struct net_device *dev) +{ + struct virtnet_bypass_info *vbi = netdev_priv(dev); + struct net_device *child_netdev; + + netif_carrier_off(dev); + netif_tx_wake_all_queues(dev); + + child_netdev = rtnl_dereference(vbi->active_netdev); + if (child_netdev) + virtnet_bypass_child_open(dev, child_netdev); + + child_netdev = rtnl_dereference(vbi->backup_netdev); + if (child_netdev) + virtnet_bypass_child_open(dev, child_netdev); + + return 0; +} + +static int virtnet_bypass_close(struct net_device *dev) +{ + struct virtnet_bypass_info *vi = netdev_priv(dev); + struct net_device *child_netdev; + + netif_tx_disable(dev); + + child_netdev = rtnl_dereference(vi->active_netdev); + if (child_netdev) + dev_close(child_netdev); + + child_netdev = rtnl_dereference(vi->backup_netdev); + if (child_netdev) + dev_close(child_netdev); + + return 0; +} + +static netdev_tx_t virtnet_bypass_drop_xmit(struct sk_buff *skb, + struct net_device *dev) +{ + atomic_long_inc(&dev->tx_dropped); + dev_kfree_skb_any(skb); + return NETDEV_TX_OK; +} + +static bool virtnet_bypass_xmit_ready(struct net_device *dev) +{ + return netif_running(dev) && netif_carrier_ok(dev); +} + +static netdev_tx_t virtnet_bypass_start_xmit(struct sk_buff *skb, + struct net_device *dev) +{ + struct virtnet_bypass_info *vbi = netdev_priv(dev); + struct net_device *xmit_dev; + + /* Try xmit via active netdev followed by backup netdev */ + xmit_dev = rcu_dereference_bh(vbi->active_netdev); + if (!xmit_dev || !virtnet_bypass_xmit_ready(xmit_dev)) { + xmit_dev = rcu_dereference_bh(vbi->backup_netdev); + if (!xmit_dev || !virtnet_bypass_xmit_ready(xmit_dev)) + return virtnet_bypass_drop_xmit(skb, dev); + } + + skb->dev = xmit_dev; + skb->queue_mapping = qdisc_skb_cb(skb)->slave_dev_queue_mapping; + + return dev_queue_xmit(skb); +} + +static u16 virtnet_bypass_select_queue(struct net_device *dev, + struct sk_buff *skb, void *accel_priv, + select_queue_fallback_t fallback) +{ + /* This helper function exists to help dev_pick_tx get the correct + * destination queue. Using a helper function skips a call to + * skb_tx_hash and will put the skbs in the queue we expect on their + * way down to the bonding driver. + */ + u16 txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) : 0; + + /* Save the original txq to restore before passing to the driver */ + qdisc_skb_cb(skb)->slave_dev_queue_mapping = skb->queue_mapping; + + if (unlikely(txq >= dev->real_num_tx_queues)) { + do { + txq -= dev->real_num_tx_queues; + } while (txq >= dev->real_num_tx_queues); + } + + return txq; +} + +/* fold stats, assuming all rtnl_link_stats64 fields are u64, but + * that some drivers can provide 32bit values only. + */ +static void virtnet_bypass_fold_stats(struct rtnl_link_stats64 *_res, + const struct rtnl_link_stats64 *_new, + const struct rtnl_link_stats64 *_old) +{ + const u64 *new = (const u64 *)_new; + const u64 *old = (const u64 *)_old; + u64 *res = (u64 *)_res; + int i; + + for (i = 0; i < sizeof(*_res) / sizeof(u64); i++) { + u64 nv = new[i]; + u64 ov = old[i]; + s64 delta = nv - ov; + + /* detects if this particular field is 32bit only */ + if (((nv | ov) >> 32) == 0) + delta = (s64)(s32)((u32)nv - (u32)ov); + + /* filter anomalies, some drivers reset their stats + * at down/up events. + */ + if (delta > 0) + res[i] += delta; + } +} + +static void virtnet_bypass_get_stats(struct net_device *dev, + struct rtnl_link_stats64 *stats) +{ + struct virtnet_bypass_info *vbi = netdev_priv(dev); + const struct rtnl_link_stats64 *new; + struct rtnl_link_stats64 temp; + struct net_device *child_netdev; + + spin_lock(&vbi->stats_lock); + memcpy(stats, &vbi->bypass_stats, sizeof(*stats)); + + rcu_read_lock(); + + child_netdev = rcu_dereference(vbi->active_netdev); + if (child_netdev) { + new = dev_get_stats(child_netdev, &temp); + virtnet_bypass_fold_stats(stats, new, &vbi->active_stats); + memcpy(&vbi->active_stats, new, sizeof(*new)); + } + + child_netdev = rcu_dereference(vbi->backup_netdev); + if (child_netdev) { + new = dev_get_stats(child_netdev, &temp); + virtnet_bypass_fold_stats(stats, new, &vbi->backup_stats); + memcpy(&vbi->backup_stats, new, sizeof(*new)); + } + + rcu_read_unlock(); + + memcpy(&vbi->bypass_stats, stats, sizeof(*stats)); + spin_unlock(&vbi->stats_lock); +} + +static int virtnet_bypass_change_mtu(struct net_device *dev, int new_mtu) +{ + struct virtnet_bypass_info *vbi = netdev_priv(dev); + struct net_device *child_netdev; + int ret = 0; + + child_netdev = rcu_dereference(vbi->active_netdev); + if (child_netdev) { + ret = dev_set_mtu(child_netdev, new_mtu); + if (ret) + return ret; + } + + child_netdev = rcu_dereference(vbi->backup_netdev); + if (child_netdev) { + ret = dev_set_mtu(child_netdev, new_mtu); + if (ret) + netdev_err(child_netdev, + "Unexpected failure to set mtu to %d\n", + new_mtu); + } + + dev->mtu = new_mtu; + return 0; +} + +static void virtnet_bypass_set_rx_mode(struct net_device *dev) +{ + struct virtnet_bypass_info *vbi = netdev_priv(dev); + struct net_device *child_netdev; + + rcu_read_lock(); + + child_netdev = rcu_dereference(vbi->active_netdev); + if (child_netdev) { + dev_uc_sync_multiple(child_netdev, dev); + dev_mc_sync_multiple(child_netdev, dev); + } + + child_netdev = rcu_dereference(vbi->backup_netdev); + if (child_netdev) { + dev_uc_sync_multiple(child_netdev, dev); + dev_mc_sync_multiple(child_netdev, dev); + } + + rcu_read_unlock(); +} + +static const struct net_device_ops virtnet_bypass_netdev_ops = { + .ndo_open = virtnet_bypass_open, + .ndo_stop = virtnet_bypass_close, + .ndo_start_xmit = virtnet_bypass_start_xmit, + .ndo_select_queue = virtnet_bypass_select_queue, + .ndo_get_stats64 = virtnet_bypass_get_stats, + .ndo_change_mtu = virtnet_bypass_change_mtu, + .ndo_set_rx_mode = virtnet_bypass_set_rx_mode, + .ndo_validate_addr = eth_validate_addr, + .ndo_features_check = passthru_features_check, +}; + +static int +virtnet_bypass_ethtool_get_link_ksettings(struct net_device *dev, + struct ethtool_link_ksettings *cmd) +{ + struct virtnet_bypass_info *vbi = netdev_priv(dev); + struct net_device *child_netdev; + + child_netdev = rtnl_dereference(vbi->active_netdev); + if (!child_netdev || !virtnet_bypass_xmit_ready(child_netdev)) { + child_netdev = rtnl_dereference(vbi->backup_netdev); + if (!child_netdev || !virtnet_bypass_xmit_ready(child_netdev)) { + cmd->base.duplex = DUPLEX_UNKNOWN; + cmd->base.port = PORT_OTHER; + cmd->base.speed = SPEED_UNKNOWN; + + return 0; + } + } + + return __ethtool_get_link_ksettings(child_netdev, cmd); +} + +#define BYPASS_DRV_NAME "virtnet_bypass" +#define BYPASS_DRV_VERSION "0.1" + +static void virtnet_bypass_ethtool_get_drvinfo(struct net_device *dev, + struct ethtool_drvinfo *drvinfo) +{ + strlcpy(drvinfo->driver, BYPASS_DRV_NAME, sizeof(drvinfo->driver)); + strlcpy(drvinfo->version, BYPASS_DRV_VERSION, sizeof(drvinfo->version)); +} + +static const struct ethtool_ops virtnet_bypass_ethtool_ops = { + .get_drvinfo = virtnet_bypass_ethtool_get_drvinfo, + .get_link = ethtool_op_get_link, + .get_link_ksettings = virtnet_bypass_ethtool_get_link_ksettings, +}; + +static struct net_device *get_virtnet_bypass_bymac(struct net *net, + const u8 *mac) +{ + struct net_device *dev; + + ASSERT_RTNL(); + + for_each_netdev(net, dev) { + if (dev->netdev_ops != &virtnet_bypass_netdev_ops) + continue; /* not a virtnet_bypass device */ + + if (ether_addr_equal(mac, dev->perm_addr)) + return dev; + } + + return NULL; +} + +static struct net_device * +get_virtnet_bypass_byref(struct net_device *child_netdev) +{ + struct net *net = dev_net(child_netdev); + struct net_device *dev; + + ASSERT_RTNL(); + + for_each_netdev(net, dev) { + struct virtnet_bypass_info *vbi; + + if (dev->netdev_ops != &virtnet_bypass_netdev_ops) + continue; /* not a virtnet_bypass device */ + + vbi = netdev_priv(dev); + + if ((rtnl_dereference(vbi->active_netdev) == child_netdev) || + (rtnl_dereference(vbi->backup_netdev) == child_netdev)) + return dev; /* a match */ + } + + return NULL; +} + +/* Called when child dev is injecting data into network stack. + * Change the associated network device from lower dev to virtio. + * note: already called with rcu_read_lock + */ +static rx_handler_result_t virtnet_bypass_handle_frame(struct sk_buff **pskb) +{ + struct sk_buff *skb = *pskb; + struct net_device *ndev = rcu_dereference(skb->dev->rx_handler_data); + + skb->dev = ndev; + + return RX_HANDLER_ANOTHER; +} + +static int virtnet_bypass_register_child(struct net_device *child_netdev) +{ + struct virtnet_bypass_info *vbi; + struct net_device *dev; + bool backup; + int ret; + + if (child_netdev->addr_len != ETH_ALEN) + return NOTIFY_DONE; + + /* We will use the MAC address to locate the virtnet_bypass netdev + * to associate with the child netdev. If we don't find a matching + * bypass netdev, move on. + */ + dev = get_virtnet_bypass_bymac(dev_net(child_netdev), + child_netdev->perm_addr); + if (!dev) + return NOTIFY_DONE; + + vbi = netdev_priv(dev); + backup = (child_netdev->dev.parent == dev->dev.parent); + if (backup ? rtnl_dereference(vbi->backup_netdev) : + rtnl_dereference(vbi->active_netdev)) { + netdev_info(dev, + "%s attempting to join bypass dev when %s already present\n", + child_netdev->name, backup ? "backup" : "active"); + return NOTIFY_DONE; + } + + /* Avoid non pci devices as active netdev */ + if (!backup && (!child_netdev->dev.parent || + !dev_is_pci(child_netdev->dev.parent))) + return NOTIFY_DONE; + + ret = netdev_rx_handler_register(child_netdev, + virtnet_bypass_handle_frame, dev); + if (ret != 0) { + netdev_err(child_netdev, + "can not register bypass receive handler (err = %d)\n", + ret); + goto rx_handler_failed; + } + + ret = netdev_upper_dev_link(child_netdev, dev, NULL); + if (ret != 0) { + netdev_err(child_netdev, + "can not set master device %s (err = %d)\n", + dev->name, ret); + goto upper_link_failed; + } + + child_netdev->flags |= IFF_SLAVE; + + if (netif_running(dev)) { + ret = dev_open(child_netdev); + if (ret && (ret != -EBUSY)) { + netdev_err(dev, "Opening child %s failed ret:%d\n", + child_netdev->name, ret); + goto err_interface_up; + } + } + + /* Align MTU of child with master */ + ret = dev_set_mtu(child_netdev, dev->mtu); + if (ret) { + netdev_err(dev, + "unable to change mtu of %s to %u register failed\n", + child_netdev->name, dev->mtu); + goto err_set_mtu; + } + + call_netdevice_notifiers(NETDEV_JOIN, child_netdev); + + netdev_info(dev, "registering %s\n", child_netdev->name); + + dev_hold(child_netdev); + if (backup) { + rcu_assign_pointer(vbi->backup_netdev, child_netdev); + dev_get_stats(vbi->backup_netdev, &vbi->backup_stats); + } else { + rcu_assign_pointer(vbi->active_netdev, child_netdev); + dev_get_stats(vbi->active_netdev, &vbi->active_stats); + dev->min_mtu = child_netdev->min_mtu; + dev->max_mtu = child_netdev->max_mtu; + } + + return NOTIFY_OK; + +err_set_mtu: + dev_close(child_netdev); +err_interface_up: + netdev_upper_dev_unlink(child_netdev, dev); + child_netdev->flags &= ~IFF_SLAVE; +upper_link_failed: + netdev_rx_handler_unregister(child_netdev); +rx_handler_failed: + return NOTIFY_DONE; +} + +static int virtnet_bypass_unregister_child(struct net_device *child_netdev) +{ + struct virtnet_bypass_info *vbi; + struct net_device *dev, *backup; + + dev = get_virtnet_bypass_byref(child_netdev); + if (!dev) + return NOTIFY_DONE; + + vbi = netdev_priv(dev); + + netdev_info(dev, "unregistering %s\n", child_netdev->name); + + netdev_rx_handler_unregister(child_netdev); + netdev_upper_dev_unlink(child_netdev, dev); + child_netdev->flags &= ~IFF_SLAVE; + + if (child_netdev->dev.parent == dev->dev.parent) { + RCU_INIT_POINTER(vbi->backup_netdev, NULL); + } else { + RCU_INIT_POINTER(vbi->active_netdev, NULL); + backup = rtnl_dereference(vbi->backup_netdev); + if (backup) { + dev->min_mtu = backup->min_mtu; + dev->max_mtu = backup->max_mtu; + } + } + + dev_put(child_netdev); + + return NOTIFY_OK; +} + +static int virtnet_bypass_update_link(struct net_device *child_netdev) +{ + struct net_device *dev, *active, *backup; + struct virtnet_bypass_info *vbi; + + dev = get_virtnet_bypass_byref(child_netdev); + if (!dev || !netif_running(dev)) + return NOTIFY_DONE; + + vbi = netdev_priv(dev); + + active = rtnl_dereference(vbi->active_netdev); + backup = rtnl_dereference(vbi->backup_netdev); + + if ((active && virtnet_bypass_xmit_ready(active)) || + (backup && virtnet_bypass_xmit_ready(backup))) { + netif_carrier_on(dev); + netif_tx_wake_all_queues(dev); + } else { + netif_carrier_off(dev); + netif_tx_stop_all_queues(dev); + } + + return NOTIFY_OK; +} + +static int virtnet_bypass_event(struct notifier_block *this, + unsigned long event, void *ptr) +{ + struct net_device *event_dev = netdev_notifier_info_to_dev(ptr); + + /* Skip our own events */ + if (event_dev->netdev_ops == &virtnet_bypass_netdev_ops) + return NOTIFY_DONE; + + /* Avoid non-Ethernet type devices */ + if (event_dev->type != ARPHRD_ETHER) + return NOTIFY_DONE; + + /* Avoid Vlan dev with same MAC registering as child dev */ + if (is_vlan_dev(event_dev)) + return NOTIFY_DONE; + + /* Avoid Bonding master dev with same MAC registering as child dev */ + if ((event_dev->priv_flags & IFF_BONDING) && + (event_dev->flags & IFF_MASTER)) + return NOTIFY_DONE; + + switch (event) { + case NETDEV_REGISTER: + return virtnet_bypass_register_child(event_dev); + case NETDEV_UNREGISTER: + return virtnet_bypass_unregister_child(event_dev); + case NETDEV_UP: + case NETDEV_DOWN: + case NETDEV_CHANGE: + return virtnet_bypass_update_link(event_dev); + default: + return NOTIFY_DONE; + } +} + +static struct notifier_block virtnet_bypass_notifier = { + .notifier_call = virtnet_bypass_event, +}; + +static int virtnet_bypass_create(struct virtnet_info *vi) +{ + struct net_device *backup_netdev = vi->dev; + struct device *dev = &vi->vdev->dev; + struct net_device *bypass_netdev; + int res; + + /* Alloc at least 2 queues, for now we are going with 16 assuming + * that most devices being bonded won't have too many queues. + */ + bypass_netdev = alloc_etherdev_mq(sizeof(struct virtnet_bypass_info), + 16); + if (!bypass_netdev) { + dev_err(dev, "Unable to allocate bypass_netdev!\n"); + return -ENOMEM; + } + + dev_net_set(bypass_netdev, dev_net(backup_netdev)); + SET_NETDEV_DEV(bypass_netdev, dev); + + bypass_netdev->netdev_ops = &virtnet_bypass_netdev_ops; + bypass_netdev->ethtool_ops = &virtnet_bypass_ethtool_ops; + + /* Initialize the device options */ + bypass_netdev->flags |= IFF_MASTER; + bypass_netdev->priv_flags |= IFF_BONDING | IFF_UNICAST_FLT | + IFF_NO_QUEUE; + bypass_netdev->priv_flags &= ~(IFF_XMIT_DST_RELEASE | + IFF_TX_SKB_SHARING); + + /* don't acquire bypass netdev's netif_tx_lock when transmitting */ + bypass_netdev->features |= NETIF_F_LLTX; + + /* Don't allow bypass devices to change network namespaces. */ + bypass_netdev->features |= NETIF_F_NETNS_LOCAL; + + bypass_netdev->hw_features = NETIF_F_HW_CSUM | NETIF_F_SG | + NETIF_F_FRAGLIST | NETIF_F_ALL_TSO | + NETIF_F_HIGHDMA | NETIF_F_LRO; + + bypass_netdev->hw_features |= NETIF_F_GSO_ENCAP_ALL; + bypass_netdev->features |= bypass_netdev->hw_features; + + /* For now treat bypass netdev as VLAN challenged since we + * cannot assume VLAN functionality with a VF + */ + bypass_netdev->features |= NETIF_F_VLAN_CHALLENGED; + + memcpy(bypass_netdev->dev_addr, backup_netdev->dev_addr, + bypass_netdev->addr_len); + + bypass_netdev->min_mtu = backup_netdev->min_mtu; + bypass_netdev->max_mtu = backup_netdev->max_mtu; + + res = register_netdev(bypass_netdev); + if (res < 0) { + dev_err(dev, "Unable to register bypass_netdev!\n"); + free_netdev(bypass_netdev); + return res; + } + + netif_carrier_off(bypass_netdev); + + vi->bypass_netdev = bypass_netdev; + + return 0; +} + +static void virtnet_bypass_destroy(struct virtnet_info *vi) +{ + struct net_device *bypass_netdev = vi->bypass_netdev; + struct virtnet_bypass_info *vbi; + struct net_device *child_netdev; + + /* no device found, nothing to free */ + if (!bypass_netdev) + return; + + vbi = netdev_priv(bypass_netdev); + + netif_device_detach(bypass_netdev); + + rtnl_lock(); + + child_netdev = rtnl_dereference(vbi->active_netdev); + if (child_netdev) + virtnet_bypass_unregister_child(child_netdev); + + child_netdev = rtnl_dereference(vbi->backup_netdev); + if (child_netdev) + virtnet_bypass_unregister_child(child_netdev); + + unregister_netdevice(bypass_netdev); + + rtnl_unlock(); + + free_netdev(bypass_netdev); +} + +/* END of functions supporting VIRTIO_NET_F_BACKUP feature. */ + static int virtnet_probe(struct virtio_device *vdev) { int i, err = -ENOMEM; @@ -2797,10 +3466,15 @@ static int virtnet_probe(struct virtio_device *vdev) virtnet_init_settings(dev); + if (virtio_has_feature(vdev, VIRTIO_NET_F_BACKUP)) { + if (virtnet_bypass_create(vi) != 0) + goto free_vqs; + } + err = register_netdev(dev); if (err) { pr_debug("virtio_net: registering device failed\n"); - goto free_vqs; + goto free_bypass; } virtio_device_ready(vdev); @@ -2837,6 +3511,8 @@ static int virtnet_probe(struct virtio_device *vdev) vi->vdev->config->reset(vdev); unregister_netdev(dev); +free_bypass: + virtnet_bypass_destroy(vi); free_vqs: cancel_delayed_work_sync(&vi->refill); free_receive_page_frags(vi); @@ -2871,6 +3547,8 @@ static void virtnet_remove(struct virtio_device *vdev) unregister_netdev(vi->dev); + virtnet_bypass_destroy(vi); + remove_vq_common(vi); free_netdev(vi->dev); @@ -2968,6 +3646,8 @@ static __init int virtio_net_driver_init(void) ret = register_virtio_driver(&virtio_net_driver); if (ret) goto err_virtio; + + register_netdevice_notifier(&virtnet_bypass_notifier); return 0; err_virtio: cpuhp_remove_multi_state(CPUHP_VIRT_NET_DEAD); @@ -2980,6 +3660,7 @@ module_init(virtio_net_driver_init); static __exit void virtio_net_driver_exit(void) { + unregister_netdevice_notifier(&virtnet_bypass_notifier); unregister_virtio_driver(&virtio_net_driver); cpuhp_remove_multi_state(CPUHP_VIRT_NET_DEAD); cpuhp_remove_multi_state(virtionet_online);