diff mbox series

[RFC,net-next,v6,4/4] netvsc: refactor notifier/event handling code to use the bypass framework

Message ID 1523386790-12396-5-git-send-email-sridhar.samudrala@intel.com
State RFC, archived
Delegated to: David Miller
Headers show
Series Enable virtio_net to act as a backup for a passthru device | expand

Commit Message

Samudrala, Sridhar April 10, 2018, 6:59 p.m. UTC
Use the registration/notification framework supported by the generic
bypass infrastructure.

Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
---
 drivers/net/hyperv/Kconfig      |   1 +
 drivers/net/hyperv/hyperv_net.h |   2 +
 drivers/net/hyperv/netvsc_drv.c | 208 ++++++++++------------------------------
 3 files changed, 55 insertions(+), 156 deletions(-)

Comments

Stephen Hemminger April 10, 2018, 9:26 p.m. UTC | #1
On Tue, 10 Apr 2018 11:59:50 -0700
Sridhar Samudrala <sridhar.samudrala@intel.com> wrote:

> Use the registration/notification framework supported by the generic
> bypass infrastructure.
> 
> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> ---

Thanks for doing this.  Your current version has couple show stopper
issues.

First, the slave device is instantly taking over the slave.
This doesn't allow udev/systemd to do its device rename of the slave
device. Netvsc uses a delayed work to workaround this.

Secondly, the select queue needs to call queue selection in VF.
The bonding/teaming logic doesn't work well for UDP flows.
Commit b3bf5666a510 ("hv_netvsc: defer queue selection to VF")
fixed this performance problem.

Lastly, more indirection is bad in current climate.

I am not completely adverse to this but it needs to be fast, simple
and completely transparent.
Samudrala, Sridhar April 10, 2018, 10:56 p.m. UTC | #2
On 4/10/2018 2:26 PM, Stephen Hemminger wrote:
> On Tue, 10 Apr 2018 11:59:50 -0700
> Sridhar Samudrala <sridhar.samudrala@intel.com> wrote:
>
>> Use the registration/notification framework supported by the generic
>> bypass infrastructure.
>>
>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>> ---
> Thanks for doing this.  Your current version has couple show stopper
> issues.
>
> First, the slave device is instantly taking over the slave.
> This doesn't allow udev/systemd to do its device rename of the slave
> device. Netvsc uses a delayed work to workaround this.

OK. I guess you are referring to the dev_set_mtu() and dev_open() calls that are
made in bypass_slave_register() and you want to defer them to be done after
a delay.  I could avoid these calls in case of netvsc based on bypass_ops.


>
> Secondly, the select queue needs to call queue selection in VF.
> The bonding/teaming logic doesn't work well for UDP flows.
> Commit b3bf5666a510 ("hv_netvsc: defer queue selection to VF")
> fixed this performance problem.

netvsc should not be using bypass_select_queue() as  that ndo op gets used
only with 3-netdev model.
Anyway, will look into updating bypass_select_queue() based on your fix.

>
> Lastly, more indirection is bad in current climate.
>
> I am not completely adverse to this but it needs to be fast, simple
> and completely transparent.

Not sure we can avoid this indirection if we want to commonize the code,  but use
different models for virtio-net and netvsc.

On the other hand, these patches avoid calls to get_netvsc_bymac() and
get_netvsc_by_ref() that go through all the devices for all the netdev events.
netvsc lookups should be much faster.

Thanks
Sridhar
Michael S. Tsirkin April 10, 2018, 11:28 p.m. UTC | #3
On Tue, Apr 10, 2018 at 02:26:08PM -0700, Stephen Hemminger wrote:
> On Tue, 10 Apr 2018 11:59:50 -0700
> Sridhar Samudrala <sridhar.samudrala@intel.com> wrote:
> 
> > Use the registration/notification framework supported by the generic
> > bypass infrastructure.
> > 
> > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> > ---
> 
> Thanks for doing this.  Your current version has couple show stopper
> issues.
> 
> First, the slave device is instantly taking over the slave.
> This doesn't allow udev/systemd to do its device rename of the slave
> device. Netvsc uses a delayed work to workaround this.

Interesting. Does this mean udev must act within a specific time window
then?

> Secondly, the select queue needs to call queue selection in VF.
> The bonding/teaming logic doesn't work well for UDP flows.
> Commit b3bf5666a510 ("hv_netvsc: defer queue selection to VF")
> fixed this performance problem.
> 
> Lastly, more indirection is bad in current climate.
> 
> I am not completely adverse to this but it needs to be fast, simple
> and completely transparent.
Siwei Liu April 10, 2018, 11:44 p.m. UTC | #4
On Tue, Apr 10, 2018 at 4:28 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Tue, Apr 10, 2018 at 02:26:08PM -0700, Stephen Hemminger wrote:
>> On Tue, 10 Apr 2018 11:59:50 -0700
>> Sridhar Samudrala <sridhar.samudrala@intel.com> wrote:
>>
>> > Use the registration/notification framework supported by the generic
>> > bypass infrastructure.
>> >
>> > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>> > ---
>>
>> Thanks for doing this.  Your current version has couple show stopper
>> issues.
>>
>> First, the slave device is instantly taking over the slave.
>> This doesn't allow udev/systemd to do its device rename of the slave
>> device. Netvsc uses a delayed work to workaround this.
>
> Interesting. Does this mean udev must act within a specific time window
> then?

Sighs, lots of hacks. Why propgating this from driver to a common
module. We really need a clean solution.

-Siwei


>
>> Secondly, the select queue needs to call queue selection in VF.
>> The bonding/teaming logic doesn't work well for UDP flows.
>> Commit b3bf5666a510 ("hv_netvsc: defer queue selection to VF")
>> fixed this performance problem.
>>
>> Lastly, more indirection is bad in current climate.
>>
>> I am not completely adverse to this but it needs to be fast, simple
>> and completely transparent.
Stephen Hemminger April 10, 2018, 11:59 p.m. UTC | #5
On Tue, 10 Apr 2018 16:44:47 -0700
Siwei Liu <loseweigh@gmail.com> wrote:

> On Tue, Apr 10, 2018 at 4:28 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Tue, Apr 10, 2018 at 02:26:08PM -0700, Stephen Hemminger wrote:  
> >> On Tue, 10 Apr 2018 11:59:50 -0700
> >> Sridhar Samudrala <sridhar.samudrala@intel.com> wrote:
> >>  
> >> > Use the registration/notification framework supported by the generic
> >> > bypass infrastructure.
> >> >
> >> > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> >> > ---  
> >>
> >> Thanks for doing this.  Your current version has couple show stopper
> >> issues.
> >>
> >> First, the slave device is instantly taking over the slave.
> >> This doesn't allow udev/systemd to do its device rename of the slave
> >> device. Netvsc uses a delayed work to workaround this.  
> >
> > Interesting. Does this mean udev must act within a specific time window
> > then?  
> 
> Sighs, lots of hacks. Why propgating this from driver to a common
> module. We really need a clean solution.
> 

I had a patch to wait for udev to do the rename and go from there
but davem rejected it.
Michael S. Tsirkin April 11, 2018, 1:21 a.m. UTC | #6
On Tue, Apr 10, 2018 at 02:26:08PM -0700, Stephen Hemminger wrote:
> On Tue, 10 Apr 2018 11:59:50 -0700
> Sridhar Samudrala <sridhar.samudrala@intel.com> wrote:
> 
> > Use the registration/notification framework supported by the generic
> > bypass infrastructure.
> > 
> > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> > ---
> 
> Thanks for doing this.  Your current version has couple show stopper
> issues.
> 
> First, the slave device is instantly taking over the slave.
> This doesn't allow udev/systemd to do its device rename of the slave
> device. Netvsc uses a delayed work to workaround this.
> 
> Secondly, the select queue needs to call queue selection in VF.
> The bonding/teaming logic doesn't work well for UDP flows.
> Commit b3bf5666a510 ("hv_netvsc: defer queue selection to VF")
> fixed this performance problem.
> 
> Lastly, more indirection is bad in current climate.

Well right now netvsc does an indirect call to the PT device,
does it not? If you really want max performance when PT
is in use you need to do the reverse and have PT forward to netvsc.

> I am not completely adverse to this but it needs to be fast, simple
> and completely transparent.
Jiri Pirko April 11, 2018, 7:50 a.m. UTC | #7
Wed, Apr 11, 2018 at 01:28:51AM CEST, mst@redhat.com wrote:
>On Tue, Apr 10, 2018 at 02:26:08PM -0700, Stephen Hemminger wrote:
>> On Tue, 10 Apr 2018 11:59:50 -0700
>> Sridhar Samudrala <sridhar.samudrala@intel.com> wrote:
>> 
>> > Use the registration/notification framework supported by the generic
>> > bypass infrastructure.
>> > 
>> > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>> > ---
>> 
>> Thanks for doing this.  Your current version has couple show stopper
>> issues.
>> 
>> First, the slave device is instantly taking over the slave.
>> This doesn't allow udev/systemd to do its device rename of the slave
>> device. Netvsc uses a delayed work to workaround this.
>
>Interesting. Does this mean udev must act within a specific time window
>then?

Yeah. That is scarry. Also, wrong.


>
>> Secondly, the select queue needs to call queue selection in VF.
>> The bonding/teaming logic doesn't work well for UDP flows.
>> Commit b3bf5666a510 ("hv_netvsc: defer queue selection to VF")
>> fixed this performance problem.
>> 
>> Lastly, more indirection is bad in current climate.
>> 
>> I am not completely adverse to this but it needs to be fast, simple
>> and completely transparent.
Jiri Pirko April 11, 2018, 7:53 a.m. UTC | #8
Tue, Apr 10, 2018 at 11:26:08PM CEST, stephen@networkplumber.org wrote:
>On Tue, 10 Apr 2018 11:59:50 -0700
>Sridhar Samudrala <sridhar.samudrala@intel.com> wrote:
>
>> Use the registration/notification framework supported by the generic
>> bypass infrastructure.
>> 
>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>> ---
>
>Thanks for doing this.  Your current version has couple show stopper
>issues.
>
>First, the slave device is instantly taking over the slave.
>This doesn't allow udev/systemd to do its device rename of the slave
>device. Netvsc uses a delayed work to workaround this.

Wait. Why the fact a device is enslaved has to affect the udev in any
way? If it does, smells like a bug in udev.


>
>Secondly, the select queue needs to call queue selection in VF.
>The bonding/teaming logic doesn't work well for UDP flows.
>Commit b3bf5666a510 ("hv_netvsc: defer queue selection to VF")
>fixed this performance problem.
>
>Lastly, more indirection is bad in current climate.
>
>I am not completely adverse to this but it needs to be fast, simple
>and completely transparent.
Siwei Liu Feb. 22, 2019, 1:14 a.m. UTC | #9
Sorry for replying to this ancient thread. There was some remaining
issue that I don't think the initial net_failover patch got addressed
cleanly, see:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1815268

The renaming of 'eth0' to 'ens4' fails because the udev userspace was
not specifically writtten for such kernel automatic enslavement.
Specifically, if it is a bond or team, the slave would typically get
renamed *before* virtual device gets created, that's what udev can
control (without getting netdev opened early by the other part of
kernel) and other userspace components for e.g. initramfs,
init-scripts can coordinate well in between. The in-kernel
auto-enslavement of net_failover breaks this userspace convention,
which don't provides a solution if user care about consistent naming
on the slave netdevs specifically.

Previously this issue had been specifically called out when IFF_HIDDEN
and the 1-netdev was proposed, but no one gives out a solution to this
problem ever since. Please share your mind how to proceed and solve
this userspace issue if netdev does not welcome a 1-netdev model.

On Wed, Apr 11, 2018 at 12:53 AM Jiri Pirko <jiri@resnulli.us> wrote:
>
> Tue, Apr 10, 2018 at 11:26:08PM CEST, stephen@networkplumber.org wrote:
> >On Tue, 10 Apr 2018 11:59:50 -0700
> >Sridhar Samudrala <sridhar.samudrala@intel.com> wrote:
> >
> >> Use the registration/notification framework supported by the generic
> >> bypass infrastructure.
> >>
> >> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> >> ---
> >
> >Thanks for doing this.  Your current version has couple show stopper
> >issues.
> >
> >First, the slave device is instantly taking over the slave.
> >This doesn't allow udev/systemd to do its device rename of the slave
> >device. Netvsc uses a delayed work to workaround this.
>
> Wait. Why the fact a device is enslaved has to affect the udev in any
> way? If it does, smells like a bug in udev.

See above for clarifications.

Thanks,


>
>
> >
> >Secondly, the select queue needs to call queue selection in VF.
> >The bonding/teaming logic doesn't work well for UDP flows.
> >Commit b3bf5666a510 ("hv_netvsc: defer queue selection to VF")
> >fixed this performance problem.
> >
> >Lastly, more indirection is bad in current climate.
> >
> >I am not completely adverse to this but it needs to be fast, simple
> >and completely transparent.
Michael S. Tsirkin Feb. 22, 2019, 1:39 a.m. UTC | #10
On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu wrote:
> Sorry for replying to this ancient thread. There was some remaining
> issue that I don't think the initial net_failover patch got addressed
> cleanly, see:
> 
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1815268
> 
> The renaming of 'eth0' to 'ens4' fails because the udev userspace was
> not specifically writtten for such kernel automatic enslavement.
> Specifically, if it is a bond or team, the slave would typically get
> renamed *before* virtual device gets created, that's what udev can
> control (without getting netdev opened early by the other part of
> kernel) and other userspace components for e.g. initramfs,
> init-scripts can coordinate well in between. The in-kernel
> auto-enslavement of net_failover breaks this userspace convention,
> which don't provides a solution if user care about consistent naming
> on the slave netdevs specifically.
> 
> Previously this issue had been specifically called out when IFF_HIDDEN
> and the 1-netdev was proposed, but no one gives out a solution to this
> problem ever since. Please share your mind how to proceed and solve
> this userspace issue if netdev does not welcome a 1-netdev model.

Above says:

	there's no motivation in the systemd/udevd community at
	this point to refactor the rename logic and make it work well with
	3-netdev.

What would the fix be? Skip slave devices?
Si-Wei Liu Feb. 22, 2019, 3:33 a.m. UTC | #11
On 2/21/2019 5:39 PM, Michael S. Tsirkin wrote:
> On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu wrote:
>> Sorry for replying to this ancient thread. There was some remaining
>> issue that I don't think the initial net_failover patch got addressed
>> cleanly, see:
>>
>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1815268
>>
>> The renaming of 'eth0' to 'ens4' fails because the udev userspace was
>> not specifically writtten for such kernel automatic enslavement.
>> Specifically, if it is a bond or team, the slave would typically get
>> renamed *before* virtual device gets created, that's what udev can
>> control (without getting netdev opened early by the other part of
>> kernel) and other userspace components for e.g. initramfs,
>> init-scripts can coordinate well in between. The in-kernel
>> auto-enslavement of net_failover breaks this userspace convention,
>> which don't provides a solution if user care about consistent naming
>> on the slave netdevs specifically.
>>
>> Previously this issue had been specifically called out when IFF_HIDDEN
>> and the 1-netdev was proposed, but no one gives out a solution to this
>> problem ever since. Please share your mind how to proceed and solve
>> this userspace issue if netdev does not welcome a 1-netdev model.
> Above says:
>
> 	there's no motivation in the systemd/udevd community at
> 	this point to refactor the rename logic and make it work well with
> 	3-netdev.
>
> What would the fix be? Skip slave devices?
>
There's nothing user can get if just skipping slave devices - the name 
is still unchanged and unpredictable e.g. eth0, or eth1 the next reboot, 
while the rest may conform to the naming scheme (ens3 and such). There's 
no way one can fix this in userspace alone - when the failover is 
created the enslaved netdev was opened by the kernel earlier than the 
userspace is made aware of, and there's no negotiation protocol for 
kernel to know when userspace has done initial renaming of the 
interface. I would expect netdev list should at least provide the 
direction in general for how this can be solved...

-Siwei
Si-Wei Liu Feb. 22, 2019, 7:55 a.m. UTC | #12
On 2/21/2019 11:00 PM, Samudrala, Sridhar wrote:
>
>
> On 2/21/2019 7:33 PM, si-wei liu wrote:
>>
>>
>> On 2/21/2019 5:39 PM, Michael S. Tsirkin wrote:
>>> On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu wrote:
>>>> Sorry for replying to this ancient thread. There was some remaining
>>>> issue that I don't think the initial net_failover patch got addressed
>>>> cleanly, see:
>>>>
>>>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1815268
>>>>
>>>> The renaming of 'eth0' to 'ens4' fails because the udev userspace was
>>>> not specifically writtten for such kernel automatic enslavement.
>>>> Specifically, if it is a bond or team, the slave would typically get
>>>> renamed *before* virtual device gets created, that's what udev can
>>>> control (without getting netdev opened early by the other part of
>>>> kernel) and other userspace components for e.g. initramfs,
>>>> init-scripts can coordinate well in between. The in-kernel
>>>> auto-enslavement of net_failover breaks this userspace convention,
>>>> which don't provides a solution if user care about consistent naming
>>>> on the slave netdevs specifically.
>>>>
>>>> Previously this issue had been specifically called out when IFF_HIDDEN
>>>> and the 1-netdev was proposed, but no one gives out a solution to this
>>>> problem ever since. Please share your mind how to proceed and solve
>>>> this userspace issue if netdev does not welcome a 1-netdev model.
>>> Above says:
>>>
>>>     there's no motivation in the systemd/udevd community at
>>>     this point to refactor the rename logic and make it work well with
>>>     3-netdev.
>>>
>>> What would the fix be? Skip slave devices?
>>>
>> There's nothing user can get if just skipping slave devices - the 
>> name is still unchanged and unpredictable e.g. eth0, or eth1 the next 
>> reboot, while the rest may conform to the naming scheme (ens3 and 
>> such). There's no way one can fix this in userspace alone - when the 
>> failover is created the enslaved netdev was opened by the kernel 
>> earlier than the userspace is made aware of, and there's no 
>> negotiation protocol for kernel to know when userspace has done 
>> initial renaming of the interface. I would expect netdev list should 
>> at least provide the direction in general for how this can be solved...
>>
> Is there an issue if slave device names are not predictable? The user/admin scripts are expected
> to only work with the master failover device.
Where does this expectation come from?

Admin users may have ethtool or tc configurations that need to deal with 
predictable interface name. Third-party app which was built upon 
specifying certain interface name can't be modified to chase dynamic names.

Specifically, we have pre-canned image that uses ethtool to fine tune VF 
offload settings post boot for specific workload. Those images won't 
work well if the name is constantly changing just after couple rounds of 
live migration.

> Moreover, you were suggesting hiding the lower slave devices anyway. There was some discussion
> about moving them to a hidden network namespace so that they are not visible from the default namespace.
> I looked into this sometime back, but did not find the right kernel api to create a network namespace within
> kernel. If so, we could use this mechanism to simulate a 1-netdev model.
Yes, that's one possible implementation (IMHO the key is to make 
1-netdev model as much transparent to a real NIC as possible, while a 
hidden netns is just the vehicle). However, I recall there was 
resistance around this discussion that even the concept of hiding itself 
is a taboo for Linux netdev. I would like to summon potential 
alternatives before concluding 1-netdev is the only solution too soon.

Thanks,
-Siwei

>
>> -Siwei
>>
>>
Rob Miller Feb. 22, 2019, 12:58 p.m. UTC | #13
I don’t know enough about how they get named, but is it possible for
user space to suggest its interface name, such that the interface name
would we as unique as the VM name itself. and is limited to scope to
be within the network boundry of an organization?

In other words, as a company, i decided to name my VM co-vm-1 through
co-vm-xxx, i would leave off location of vm b/c that will change. My
interfaces then would be named, co-vm-1.0 through co-vm-1.x.

Just thinking out loud.

Sent from my iPhone

> On Feb 22, 2019, at 2:55 AM, si-wei liu <si-wei.liu@oracle.com> wrote:
>
>
>
>> On 2/21/2019 11:00 PM, Samudrala, Sridhar wrote:
>>
>>
>>> On 2/21/2019 7:33 PM, si-wei liu wrote:
>>>
>>>
>>>> On 2/21/2019 5:39 PM, Michael S. Tsirkin wrote:
>>>>> On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu wrote:
>>>>> Sorry for replying to this ancient thread. There was some remaining
>>>>> issue that I don't think the initial net_failover patch got addressed
>>>>> cleanly, see:
>>>>>
>>>>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1815268
>>>>>
>>>>> The renaming of 'eth0' to 'ens4' fails because the udev userspace was
>>>>> not specifically writtten for such kernel automatic enslavement.
>>>>> Specifically, if it is a bond or team, the slave would typically get
>>>>> renamed *before* virtual device gets created, that's what udev can
>>>>> control (without getting netdev opened early by the other part of
>>>>> kernel) and other userspace components for e.g. initramfs,
>>>>> init-scripts can coordinate well in between. The in-kernel
>>>>> auto-enslavement of net_failover breaks this userspace convention,
>>>>> which don't provides a solution if user care about consistent naming
>>>>> on the slave netdevs specifically.
>>>>>
>>>>> Previously this issue had been specifically called out when IFF_HIDDEN
>>>>> and the 1-netdev was proposed, but no one gives out a solution to this
>>>>> problem ever since. Please share your mind how to proceed and solve
>>>>> this userspace issue if netdev does not welcome a 1-netdev model.
>>>> Above says:
>>>>
>>>>    there's no motivation in the systemd/udevd community at
>>>>    this point to refactor the rename logic and make it work well with
>>>>    3-netdev.
>>>>
>>>> What would the fix be? Skip slave devices?
>>>>
>>> There's nothing user can get if just skipping slave devices - the name is still unchanged and unpredictable e.g. eth0, or eth1 the next reboot, while the rest may conform to the naming scheme (ens3 and such). There's no way one can fix this in userspace alone - when the failover is created the enslaved netdev was opened by the kernel earlier than the userspace is made aware of, and there's no negotiation protocol for kernel to know when userspace has done initial renaming of the interface. I would expect netdev list should at least provide the direction in general for how this can be solved...
>>>
>> Is there an issue if slave device names are not predictable? The user/admin scripts are expected
>> to only work with the master failover device.
> Where does this expectation come from?
>
> Admin users may have ethtool or tc configurations that need to deal with predictable interface name. Third-party app which was built upon specifying certain interface name can't be modified to chase dynamic names.
>
> Specifically, we have pre-canned image that uses ethtool to fine tune VF offload settings post boot for specific workload. Those images won't work well if the name is constantly changing just after couple rounds of live migration.
>
>> Moreover, you were suggesting hiding the lower slave devices anyway. There was some discussion
>> about moving them to a hidden network namespace so that they are not visible from the default namespace.
>> I looked into this sometime back, but did not find the right kernel api to create a network namespace within
>> kernel. If so, we could use this mechanism to simulate a 1-netdev model.
> Yes, that's one possible implementation (IMHO the key is to make 1-netdev model as much transparent to a real NIC as possible, while a hidden netns is just the vehicle). However, I recall there was resistance around this discussion that even the concept of hiding itself is a taboo for Linux netdev. I would like to summon potential alternatives before concluding 1-netdev is the only solution too soon.
>
> Thanks,
> -Siwei
>
>>
>>> -Siwei
>>>
>>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
>
Michael S. Tsirkin Feb. 22, 2019, 3:14 p.m. UTC | #14
On Thu, Feb 21, 2019 at 11:55:11PM -0800, si-wei liu wrote:
> 
> 
> On 2/21/2019 11:00 PM, Samudrala, Sridhar wrote:
> > 
> > 
> > On 2/21/2019 7:33 PM, si-wei liu wrote:
> > > 
> > > 
> > > On 2/21/2019 5:39 PM, Michael S. Tsirkin wrote:
> > > > On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu wrote:
> > > > > Sorry for replying to this ancient thread. There was some remaining
> > > > > issue that I don't think the initial net_failover patch got addressed
> > > > > cleanly, see:
> > > > > 
> > > > > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1815268
> > > > > 
> > > > > The renaming of 'eth0' to 'ens4' fails because the udev userspace was
> > > > > not specifically writtten for such kernel automatic enslavement.
> > > > > Specifically, if it is a bond or team, the slave would typically get
> > > > > renamed *before* virtual device gets created, that's what udev can
> > > > > control (without getting netdev opened early by the other part of
> > > > > kernel) and other userspace components for e.g. initramfs,
> > > > > init-scripts can coordinate well in between. The in-kernel
> > > > > auto-enslavement of net_failover breaks this userspace convention,
> > > > > which don't provides a solution if user care about consistent naming
> > > > > on the slave netdevs specifically.
> > > > > 
> > > > > Previously this issue had been specifically called out when IFF_HIDDEN
> > > > > and the 1-netdev was proposed, but no one gives out a solution to this
> > > > > problem ever since. Please share your mind how to proceed and solve
> > > > > this userspace issue if netdev does not welcome a 1-netdev model.
> > > > Above says:
> > > > 
> > > >     there's no motivation in the systemd/udevd community at
> > > >     this point to refactor the rename logic and make it work well with
> > > >     3-netdev.
> > > > 
> > > > What would the fix be? Skip slave devices?
> > > > 
> > > There's nothing user can get if just skipping slave devices - the
> > > name is still unchanged and unpredictable e.g. eth0, or eth1 the
> > > next reboot, while the rest may conform to the naming scheme (ens3
> > > and such). There's no way one can fix this in userspace alone - when
> > > the failover is created the enslaved netdev was opened by the kernel
> > > earlier than the userspace is made aware of, and there's no
> > > negotiation protocol for kernel to know when userspace has done
> > > initial renaming of the interface. I would expect netdev list should
> > > at least provide the direction in general for how this can be
> > > solved...


I was just wondering what did you mean when you said
"refactor the rename logic and make it work well with 3-netdev" -
was there a proposal udev rejected?

Anyway, can we write a time diagram for what happens in which order that
leads to failure?  That would help look for triggers that we can tie
into, or add new ones.






> > > 
> > Is there an issue if slave device names are not predictable? The user/admin scripts are expected
> > to only work with the master failover device.
> Where does this expectation come from?
> 
> Admin users may have ethtool or tc configurations that need to deal with
> predictable interface name. Third-party app which was built upon specifying
> certain interface name can't be modified to chase dynamic names.
> 
> Specifically, we have pre-canned image that uses ethtool to fine tune VF
> offload settings post boot for specific workload. Those images won't work
> well if the name is constantly changing just after couple rounds of live
> migration.

It should be possible to specify the ethtool configuration on the
master and have it automatically propagated to the slave.

BTW this is something we should look at IMHO.

> > Moreover, you were suggesting hiding the lower slave devices anyway. There was some discussion
> > about moving them to a hidden network namespace so that they are not visible from the default namespace.
> > I looked into this sometime back, but did not find the right kernel api to create a network namespace within
> > kernel. If so, we could use this mechanism to simulate a 1-netdev model.
> Yes, that's one possible implementation (IMHO the key is to make 1-netdev
> model as much transparent to a real NIC as possible, while a hidden netns is
> just the vehicle). However, I recall there was resistance around this
> discussion that even the concept of hiding itself is a taboo for Linux
> netdev. I would like to summon potential alternatives before concluding
> 1-netdev is the only solution too soon.
> 
> Thanks,
> -Siwei

Your scripts would not work at all then, right?


> > 
> > > -Siwei
> > > 
> > >
Si-Wei Liu Feb. 26, 2019, 12:58 a.m. UTC | #15
On 2/22/2019 7:14 AM, Michael S. Tsirkin wrote:
> On Thu, Feb 21, 2019 at 11:55:11PM -0800, si-wei liu wrote:
>>
>> On 2/21/2019 11:00 PM, Samudrala, Sridhar wrote:
>>>
>>> On 2/21/2019 7:33 PM, si-wei liu wrote:
>>>>
>>>> On 2/21/2019 5:39 PM, Michael S. Tsirkin wrote:
>>>>> On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu wrote:
>>>>>> Sorry for replying to this ancient thread. There was some remaining
>>>>>> issue that I don't think the initial net_failover patch got addressed
>>>>>> cleanly, see:
>>>>>>
>>>>>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1815268
>>>>>>
>>>>>> The renaming of 'eth0' to 'ens4' fails because the udev userspace was
>>>>>> not specifically writtten for such kernel automatic enslavement.
>>>>>> Specifically, if it is a bond or team, the slave would typically get
>>>>>> renamed *before* virtual device gets created, that's what udev can
>>>>>> control (without getting netdev opened early by the other part of
>>>>>> kernel) and other userspace components for e.g. initramfs,
>>>>>> init-scripts can coordinate well in between. The in-kernel
>>>>>> auto-enslavement of net_failover breaks this userspace convention,
>>>>>> which don't provides a solution if user care about consistent naming
>>>>>> on the slave netdevs specifically.
>>>>>>
>>>>>> Previously this issue had been specifically called out when IFF_HIDDEN
>>>>>> and the 1-netdev was proposed, but no one gives out a solution to this
>>>>>> problem ever since. Please share your mind how to proceed and solve
>>>>>> this userspace issue if netdev does not welcome a 1-netdev model.
>>>>> Above says:
>>>>>
>>>>>      there's no motivation in the systemd/udevd community at
>>>>>      this point to refactor the rename logic and make it work well with
>>>>>      3-netdev.
>>>>>
>>>>> What would the fix be? Skip slave devices?
>>>>>
>>>> There's nothing user can get if just skipping slave devices - the
>>>> name is still unchanged and unpredictable e.g. eth0, or eth1 the
>>>> next reboot, while the rest may conform to the naming scheme (ens3
>>>> and such). There's no way one can fix this in userspace alone - when
>>>> the failover is created the enslaved netdev was opened by the kernel
>>>> earlier than the userspace is made aware of, and there's no
>>>> negotiation protocol for kernel to know when userspace has done
>>>> initial renaming of the interface. I would expect netdev list should
>>>> at least provide the direction in general for how this can be
>>>> solved...
>
> I was just wondering what did you mean when you said
> "refactor the rename logic and make it work well with 3-netdev" -
> was there a proposal udev rejected?
No. I never believed this particular issue can be fixed in userspace 
alone. Previously someone had said it could be, but I never see any work 
or relevant discussion ever happened in various userspace communities 
(for e.g. dracut, initramfs-tools, systemd, udev, and NetworkManager). 
IMHO the root of the issue derives from the kernel, it makes more sense 
to start from netdev, work out and decide on a solution: see what can be 
done in the kernel in order to fix it, then after that engage userspace 
community for the feasibility...

> Anyway, can we write a time diagram for what happens in which order that
> leads to failure?  That would help look for triggers that we can tie
> into, or add new ones.
>

See attached diagram.

>
>
>
>
>>> Is there an issue if slave device names are not predictable? The user/admin scripts are expected
>>> to only work with the master failover device.
>> Where does this expectation come from?
>>
>> Admin users may have ethtool or tc configurations that need to deal with
>> predictable interface name. Third-party app which was built upon specifying
>> certain interface name can't be modified to chase dynamic names.
>>
>> Specifically, we have pre-canned image that uses ethtool to fine tune VF
>> offload settings post boot for specific workload. Those images won't work
>> well if the name is constantly changing just after couple rounds of live
>> migration.
> It should be possible to specify the ethtool configuration on the
> master and have it automatically propagated to the slave.
>
> BTW this is something we should look at IMHO.
I was elaborating a few examples that the expectation and assumption 
that user/admin scripts only deal with master failover device is 
incorrect. It had never been taken good care of, although I did try to 
emphasize it from the very beginning.

Basically what you said about propagating the ethtool configuration down 
to the slave is the key pursuance of 1-netdev model. However, what I am 
seeking now is any alternative that can also fix the specific udev 
rename problem, before concluding that 1-netdev is the only solution. 
Generally a 1-netdev scheme would take time to implement, while I'm 
trying to find a way out to fix this particular naming problem under 
3-netdev.

>
>>> Moreover, you were suggesting hiding the lower slave devices anyway. There was some discussion
>>> about moving them to a hidden network namespace so that they are not visible from the default namespace.
>>> I looked into this sometime back, but did not find the right kernel api to create a network namespace within
>>> kernel. If so, we could use this mechanism to simulate a 1-netdev model.
>> Yes, that's one possible implementation (IMHO the key is to make 1-netdev
>> model as much transparent to a real NIC as possible, while a hidden netns is
>> just the vehicle). However, I recall there was resistance around this
>> discussion that even the concept of hiding itself is a taboo for Linux
>> netdev. I would like to summon potential alternatives before concluding
>> 1-netdev is the only solution too soon.
>>
>> Thanks,
>> -Siwei
> Your scripts would not work at all then, right?
At this point we don't claim images with such usage as SR-IOV live 
migrate-able. We would flag it as live migrate-able until this ethtool 
config issue is fully addressed and a transparent live migration 
solution emerges in upstream eventually.


Thanks,
-Siwei
>
>
>>>> -Siwei
>>>>
>>>>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
>
net_failover(kernel)                            |    network.service (user)    |          systemd-udevd (user)
--------------------------------------------------+------------------------------+--------------------------------------------
(standby virtio-net and net_failover              |                              |
devices created and initialized,                  |                              |
i.e. virtnet_probe()->                            |                              |
       net_failover_create()                      |                              |
was done.)                                        |                              |
                                                  |                              |
                                                  |  runs `ifup ens3' ->         |
                                                  |    ip link set dev ens3 up   |
net_failover_open()                               |                              |
  dev_open(virtnet_dev)                           |                              |
    virtnet_open(virtnet_dev)                     |                              |
  netif_carrier_on(failover_dev)                  |                              |
  ...                                             |                              |
                                                  |                              |
(VF hot plugged in)                               |                              |
ixgbevf_probe()                                   |                              |
 register_netdev(ixgbevf_netdev)                  |                              |
  netdev_register_kobject(ixgbevf_netdev)         |                              |
   kobject_add(ixgbevf_dev)                       |                              |
    device_add(ixgbevf_dev)                       |                              |
     kobject_uevent(&ixgbevf_dev->kobj, KOBJ_ADD) |                              |
      netlink_broadcast()                         |                              |
  ...                                             |                              |
  call_netdevice_notifiers(NETDEV_REGISTER)       |                              |
   failover_event(..., NETDEV_REGISTER, ...)      |                              |
    failover_slave_register(ixgbevf_netdev)       |                              |
     net_failover_slave_register(ixgbevf_netdev)  |                              |
      dev_open(ixgbevf_netdev)                    |                              |
                                                  |                              |
                                                  |                              |
                                                  |                              |   received ADD uevent from netlink fd
                                                  |                              |   ...
                                                  |                              |   udev-builtin-net_id.c:dev_pci_slot()
                                                  |                              |   (decided to renamed 'eth0' )
                                                  |                              |     ip link set dev eth0 name ens4
(dev_change_name() returns -EBUSY as              |                              |
ixgbevf_netdev->flags has IFF_UP)                 |                              |
                                                  |                              |
Stephen Hemminger Feb. 26, 2019, 1:39 a.m. UTC | #16
On Mon, 25 Feb 2019 16:58:07 -0800
si-wei liu <si-wei.liu@oracle.com> wrote:

> On 2/22/2019 7:14 AM, Michael S. Tsirkin wrote:
> > On Thu, Feb 21, 2019 at 11:55:11PM -0800, si-wei liu wrote:  
> >>
> >> On 2/21/2019 11:00 PM, Samudrala, Sridhar wrote:  
> >>>
> >>> On 2/21/2019 7:33 PM, si-wei liu wrote:  
> >>>>
> >>>> On 2/21/2019 5:39 PM, Michael S. Tsirkin wrote:  
> >>>>> On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu wrote:  
> >>>>>> Sorry for replying to this ancient thread. There was some remaining
> >>>>>> issue that I don't think the initial net_failover patch got addressed
> >>>>>> cleanly, see:
> >>>>>>
> >>>>>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1815268
> >>>>>>
> >>>>>> The renaming of 'eth0' to 'ens4' fails because the udev userspace was
> >>>>>> not specifically writtten for such kernel automatic enslavement.
> >>>>>> Specifically, if it is a bond or team, the slave would typically get
> >>>>>> renamed *before* virtual device gets created, that's what udev can
> >>>>>> control (without getting netdev opened early by the other part of
> >>>>>> kernel) and other userspace components for e.g. initramfs,
> >>>>>> init-scripts can coordinate well in between. The in-kernel
> >>>>>> auto-enslavement of net_failover breaks this userspace convention,
> >>>>>> which don't provides a solution if user care about consistent naming
> >>>>>> on the slave netdevs specifically.
> >>>>>>
> >>>>>> Previously this issue had been specifically called out when IFF_HIDDEN
> >>>>>> and the 1-netdev was proposed, but no one gives out a solution to this
> >>>>>> problem ever since. Please share your mind how to proceed and solve
> >>>>>> this userspace issue if netdev does not welcome a 1-netdev model.  
> >>>>> Above says:
> >>>>>
> >>>>>      there's no motivation in the systemd/udevd community at
> >>>>>      this point to refactor the rename logic and make it work well with
> >>>>>      3-netdev.
> >>>>>
> >>>>> What would the fix be? Skip slave devices?
> >>>>>  
> >>>> There's nothing user can get if just skipping slave devices - the
> >>>> name is still unchanged and unpredictable e.g. eth0, or eth1 the
> >>>> next reboot, while the rest may conform to the naming scheme (ens3
> >>>> and such). There's no way one can fix this in userspace alone - when
> >>>> the failover is created the enslaved netdev was opened by the kernel
> >>>> earlier than the userspace is made aware of, and there's no
> >>>> negotiation protocol for kernel to know when userspace has done
> >>>> initial renaming of the interface. I would expect netdev list should
> >>>> at least provide the direction in general for how this can be
> >>>> solved...  
> >
> > I was just wondering what did you mean when you said
> > "refactor the rename logic and make it work well with 3-netdev" -
> > was there a proposal udev rejected?  
> No. I never believed this particular issue can be fixed in userspace 
> alone. Previously someone had said it could be, but I never see any work 
> or relevant discussion ever happened in various userspace communities 
> (for e.g. dracut, initramfs-tools, systemd, udev, and NetworkManager). 
> IMHO the root of the issue derives from the kernel, it makes more sense 
> to start from netdev, work out and decide on a solution: see what can be 
> done in the kernel in order to fix it, then after that engage userspace 
> community for the feasibility...
> 
> > Anyway, can we write a time diagram for what happens in which order that
> > leads to failure?  That would help look for triggers that we can tie
> > into, or add new ones.
> >  
> 
> See attached diagram.
> 
> >
> >
> >
> >  
> >>> Is there an issue if slave device names are not predictable? The user/admin scripts are expected
> >>> to only work with the master failover device.  
> >> Where does this expectation come from?
> >>
> >> Admin users may have ethtool or tc configurations that need to deal with
> >> predictable interface name. Third-party app which was built upon specifying
> >> certain interface name can't be modified to chase dynamic names.
> >>
> >> Specifically, we have pre-canned image that uses ethtool to fine tune VF
> >> offload settings post boot for specific workload. Those images won't work
> >> well if the name is constantly changing just after couple rounds of live
> >> migration.  
> > It should be possible to specify the ethtool configuration on the
> > master and have it automatically propagated to the slave.
> >
> > BTW this is something we should look at IMHO.  
> I was elaborating a few examples that the expectation and assumption 
> that user/admin scripts only deal with master failover device is 
> incorrect. It had never been taken good care of, although I did try to 
> emphasize it from the very beginning.
> 
> Basically what you said about propagating the ethtool configuration down 
> to the slave is the key pursuance of 1-netdev model. However, what I am 
> seeking now is any alternative that can also fix the specific udev 
> rename problem, before concluding that 1-netdev is the only solution. 
> Generally a 1-netdev scheme would take time to implement, while I'm 
> trying to find a way out to fix this particular naming problem under 
> 3-netdev.
> 
> >  
> >>> Moreover, you were suggesting hiding the lower slave devices anyway. There was some discussion
> >>> about moving them to a hidden network namespace so that they are not visible from the default namespace.
> >>> I looked into this sometime back, but did not find the right kernel api to create a network namespace within
> >>> kernel. If so, we could use this mechanism to simulate a 1-netdev model.  
> >> Yes, that's one possible implementation (IMHO the key is to make 1-netdev
> >> model as much transparent to a real NIC as possible, while a hidden netns is
> >> just the vehicle). However, I recall there was resistance around this
> >> discussion that even the concept of hiding itself is a taboo for Linux
> >> netdev. I would like to summon potential alternatives before concluding
> >> 1-netdev is the only solution too soon.
> >>
> >> Thanks,
> >> -Siwei  
> > Your scripts would not work at all then, right?  
> At this point we don't claim images with such usage as SR-IOV live 
> migrate-able. We would flag it as live migrate-able until this ethtool 
> config issue is fully addressed and a transparent live migration 
> solution emerges in upstream eventually.

The hyper-v netvsc with 1-dev model uses a timeout to allow  udev to do its rename.
I proposed a patch to key state change off of the udev rename, but that patch was
rejected.
Michael S. Tsirkin Feb. 26, 2019, 2:05 a.m. UTC | #17
On Mon, Feb 25, 2019 at 05:39:12PM -0800, Stephen Hemminger wrote:
> > >>> Moreover, you were suggesting hiding the lower slave devices anyway. There was some discussion
> > >>> about moving them to a hidden network namespace so that they are not visible from the default namespace.
> > >>> I looked into this sometime back, but did not find the right kernel api to create a network namespace within
> > >>> kernel. If so, we could use this mechanism to simulate a 1-netdev model.  
> > >> Yes, that's one possible implementation (IMHO the key is to make 1-netdev
> > >> model as much transparent to a real NIC as possible, while a hidden netns is
> > >> just the vehicle). However, I recall there was resistance around this
> > >> discussion that even the concept of hiding itself is a taboo for Linux
> > >> netdev. I would like to summon potential alternatives before concluding
> > >> 1-netdev is the only solution too soon.
> > >>
> > >> Thanks,
> > >> -Siwei  
> > > Your scripts would not work at all then, right?  
> > At this point we don't claim images with such usage as SR-IOV live 
> > migrate-able. We would flag it as live migrate-able until this ethtool 
> > config issue is fully addressed and a transparent live migration 
> > solution emerges in upstream eventually.
> 
> The hyper-v netvsc with 1-dev model uses a timeout to allow  udev to do its rename.
> I proposed a patch to key state change off of the udev rename, but that patch was
> rejected.

Of course that would mean nothing works without udev - was
that the objection? Could you help me find that discussion pls?
Michael S. Tsirkin Feb. 26, 2019, 2:08 a.m. UTC | #18
On Mon, Feb 25, 2019 at 04:58:07PM -0800, si-wei liu wrote:
> 
> 
> On 2/22/2019 7:14 AM, Michael S. Tsirkin wrote:
> > On Thu, Feb 21, 2019 at 11:55:11PM -0800, si-wei liu wrote:
> > > 
> > > On 2/21/2019 11:00 PM, Samudrala, Sridhar wrote:
> > > > 
> > > > On 2/21/2019 7:33 PM, si-wei liu wrote:
> > > > > 
> > > > > On 2/21/2019 5:39 PM, Michael S. Tsirkin wrote:
> > > > > > On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu wrote:
> > > > > > > Sorry for replying to this ancient thread. There was some remaining
> > > > > > > issue that I don't think the initial net_failover patch got addressed
> > > > > > > cleanly, see:
> > > > > > > 
> > > > > > > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1815268
> > > > > > > 
> > > > > > > The renaming of 'eth0' to 'ens4' fails because the udev userspace was
> > > > > > > not specifically writtten for such kernel automatic enslavement.
> > > > > > > Specifically, if it is a bond or team, the slave would typically get
> > > > > > > renamed *before* virtual device gets created, that's what udev can
> > > > > > > control (without getting netdev opened early by the other part of
> > > > > > > kernel) and other userspace components for e.g. initramfs,
> > > > > > > init-scripts can coordinate well in between. The in-kernel
> > > > > > > auto-enslavement of net_failover breaks this userspace convention,
> > > > > > > which don't provides a solution if user care about consistent naming
> > > > > > > on the slave netdevs specifically.
> > > > > > > 
> > > > > > > Previously this issue had been specifically called out when IFF_HIDDEN
> > > > > > > and the 1-netdev was proposed, but no one gives out a solution to this
> > > > > > > problem ever since. Please share your mind how to proceed and solve
> > > > > > > this userspace issue if netdev does not welcome a 1-netdev model.
> > > > > > Above says:
> > > > > > 
> > > > > >      there's no motivation in the systemd/udevd community at
> > > > > >      this point to refactor the rename logic and make it work well with
> > > > > >      3-netdev.
> > > > > > 
> > > > > > What would the fix be? Skip slave devices?
> > > > > > 
> > > > > There's nothing user can get if just skipping slave devices - the
> > > > > name is still unchanged and unpredictable e.g. eth0, or eth1 the
> > > > > next reboot, while the rest may conform to the naming scheme (ens3
> > > > > and such). There's no way one can fix this in userspace alone - when
> > > > > the failover is created the enslaved netdev was opened by the kernel
> > > > > earlier than the userspace is made aware of, and there's no
> > > > > negotiation protocol for kernel to know when userspace has done
> > > > > initial renaming of the interface. I would expect netdev list should
> > > > > at least provide the direction in general for how this can be
> > > > > solved...
> > 
> > I was just wondering what did you mean when you said
> > "refactor the rename logic and make it work well with 3-netdev" -
> > was there a proposal udev rejected?
> No. I never believed this particular issue can be fixed in userspace alone.
> Previously someone had said it could be, but I never see any work or
> relevant discussion ever happened in various userspace communities (for e.g.
> dracut, initramfs-tools, systemd, udev, and NetworkManager). IMHO the root
> of the issue derives from the kernel, it makes more sense to start from
> netdev, work out and decide on a solution: see what can be done in the
> kernel in order to fix it, then after that engage userspace community for
> the feasibility...
> 
> > Anyway, can we write a time diagram for what happens in which order that
> > leads to failure?  That would help look for triggers that we can tie
> > into, or add new ones.
> > 
> 
> See attached diagram.
> 
> > 
> > 
> > 
> > 
> > > > Is there an issue if slave device names are not predictable? The user/admin scripts are expected
> > > > to only work with the master failover device.
> > > Where does this expectation come from?
> > > 
> > > Admin users may have ethtool or tc configurations that need to deal with
> > > predictable interface name. Third-party app which was built upon specifying
> > > certain interface name can't be modified to chase dynamic names.
> > > 
> > > Specifically, we have pre-canned image that uses ethtool to fine tune VF
> > > offload settings post boot for specific workload. Those images won't work
> > > well if the name is constantly changing just after couple rounds of live
> > > migration.
> > It should be possible to specify the ethtool configuration on the
> > master and have it automatically propagated to the slave.
> > 
> > BTW this is something we should look at IMHO.
> I was elaborating a few examples that the expectation and assumption that
> user/admin scripts only deal with master failover device is incorrect. It
> had never been taken good care of, although I did try to emphasize it from
> the very beginning.
> 
> Basically what you said about propagating the ethtool configuration down to
> the slave is the key pursuance of 1-netdev model. However, what I am seeking
> now is any alternative that can also fix the specific udev rename problem,
> before concluding that 1-netdev is the only solution. Generally a 1-netdev
> scheme would take time to implement, while I'm trying to find a way out to
> fix this particular naming problem under 3-netdev.
> 
> > 
> > > > Moreover, you were suggesting hiding the lower slave devices anyway. There was some discussion
> > > > about moving them to a hidden network namespace so that they are not visible from the default namespace.
> > > > I looked into this sometime back, but did not find the right kernel api to create a network namespace within
> > > > kernel. If so, we could use this mechanism to simulate a 1-netdev model.
> > > Yes, that's one possible implementation (IMHO the key is to make 1-netdev
> > > model as much transparent to a real NIC as possible, while a hidden netns is
> > > just the vehicle). However, I recall there was resistance around this
> > > discussion that even the concept of hiding itself is a taboo for Linux
> > > netdev. I would like to summon potential alternatives before concluding
> > > 1-netdev is the only solution too soon.
> > > 
> > > Thanks,
> > > -Siwei
> > Your scripts would not work at all then, right?
> At this point we don't claim images with such usage as SR-IOV live
> migrate-able. We would flag it as live migrate-able until this ethtool
> config issue is fully addressed and a transparent live migration solution
> emerges in upstream eventually.
> 
> 
> Thanks,
> -Siwei
> > 
> > 
> > > > > -Siwei
> > > > > 
> > > > > 
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> > For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
> > 
> 

> 
>   net_failover(kernel)                            |    network.service (user)    |          systemd-udevd (user)
> --------------------------------------------------+------------------------------+--------------------------------------------
> (standby virtio-net and net_failover              |                              |
> devices created and initialized,                  |                              |
> i.e. virtnet_probe()->                            |                              |
>        net_failover_create()                      |                              |
> was done.)                                        |                              |
>                                                   |                              |
>                                                   |  runs `ifup ens3' ->         |
>                                                   |    ip link set dev ens3 up   |
> net_failover_open()                               |                              |
>   dev_open(virtnet_dev)                           |                              |
>     virtnet_open(virtnet_dev)                     |                              |
>   netif_carrier_on(failover_dev)                  |                              |
>   ...                                             |                              |
>                                                   |                              |
> (VF hot plugged in)                               |                              |
> ixgbevf_probe()                                   |                              |
>  register_netdev(ixgbevf_netdev)                  |                              |
>   netdev_register_kobject(ixgbevf_netdev)         |                              |
>    kobject_add(ixgbevf_dev)                       |                              |
>     device_add(ixgbevf_dev)                       |                              |
>      kobject_uevent(&ixgbevf_dev->kobj, KOBJ_ADD) |                              |
>       netlink_broadcast()                         |                              |
>   ...                                             |                              |
>   call_netdevice_notifiers(NETDEV_REGISTER)       |                              |
>    failover_event(..., NETDEV_REGISTER, ...)      |                              |
>     failover_slave_register(ixgbevf_netdev)       |                              |
>      net_failover_slave_register(ixgbevf_netdev)  |                              |
>       dev_open(ixgbevf_netdev)                    |                              |
>                                                   |                              |
>                                                   |                              |
>                                                   |                              |   received ADD uevent from netlink fd
>                                                   |                              |   ...
>                                                   |                              |   udev-builtin-net_id.c:dev_pci_slot()
>                                                   |                              |   (decided to renamed 'eth0' )
>                                                   |                              |     ip link set dev eth0 name ens4
> (dev_change_name() returns -EBUSY as              |                              |
> ixgbevf_netdev->flags has IFF_UP)                 |                              |
>                                                   |                              |
> 

Given renaming slaves does not work anyway: would it work if we just
hard-coded slave names instead?

E.g.
1. fail slave renames
2. rename of failover to XX automatically renames standby to XXnsby
   and primary to XXnpry
Si-Wei Liu Feb. 27, 2019, 12:17 a.m. UTC | #19
On 2/25/2019 6:08 PM, Michael S. Tsirkin wrote:
> On Mon, Feb 25, 2019 at 04:58:07PM -0800, si-wei liu wrote:
>>
>> On 2/22/2019 7:14 AM, Michael S. Tsirkin wrote:
>>> On Thu, Feb 21, 2019 at 11:55:11PM -0800, si-wei liu wrote:
>>>> On 2/21/2019 11:00 PM, Samudrala, Sridhar wrote:
>>>>> On 2/21/2019 7:33 PM, si-wei liu wrote:
>>>>>> On 2/21/2019 5:39 PM, Michael S. Tsirkin wrote:
>>>>>>> On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu wrote:
>>>>>>>> Sorry for replying to this ancient thread. There was some remaining
>>>>>>>> issue that I don't think the initial net_failover patch got addressed
>>>>>>>> cleanly, see:
>>>>>>>>
>>>>>>>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1815268
>>>>>>>>
>>>>>>>> The renaming of 'eth0' to 'ens4' fails because the udev userspace was
>>>>>>>> not specifically writtten for such kernel automatic enslavement.
>>>>>>>> Specifically, if it is a bond or team, the slave would typically get
>>>>>>>> renamed *before* virtual device gets created, that's what udev can
>>>>>>>> control (without getting netdev opened early by the other part of
>>>>>>>> kernel) and other userspace components for e.g. initramfs,
>>>>>>>> init-scripts can coordinate well in between. The in-kernel
>>>>>>>> auto-enslavement of net_failover breaks this userspace convention,
>>>>>>>> which don't provides a solution if user care about consistent naming
>>>>>>>> on the slave netdevs specifically.
>>>>>>>>
>>>>>>>> Previously this issue had been specifically called out when IFF_HIDDEN
>>>>>>>> and the 1-netdev was proposed, but no one gives out a solution to this
>>>>>>>> problem ever since. Please share your mind how to proceed and solve
>>>>>>>> this userspace issue if netdev does not welcome a 1-netdev model.
>>>>>>> Above says:
>>>>>>>
>>>>>>>       there's no motivation in the systemd/udevd community at
>>>>>>>       this point to refactor the rename logic and make it work well with
>>>>>>>       3-netdev.
>>>>>>>
>>>>>>> What would the fix be? Skip slave devices?
>>>>>>>
>>>>>> There's nothing user can get if just skipping slave devices - the
>>>>>> name is still unchanged and unpredictable e.g. eth0, or eth1 the
>>>>>> next reboot, while the rest may conform to the naming scheme (ens3
>>>>>> and such). There's no way one can fix this in userspace alone - when
>>>>>> the failover is created the enslaved netdev was opened by the kernel
>>>>>> earlier than the userspace is made aware of, and there's no
>>>>>> negotiation protocol for kernel to know when userspace has done
>>>>>> initial renaming of the interface. I would expect netdev list should
>>>>>> at least provide the direction in general for how this can be
>>>>>> solved...
>>> I was just wondering what did you mean when you said
>>> "refactor the rename logic and make it work well with 3-netdev" -
>>> was there a proposal udev rejected?
>> No. I never believed this particular issue can be fixed in userspace alone.
>> Previously someone had said it could be, but I never see any work or
>> relevant discussion ever happened in various userspace communities (for e.g.
>> dracut, initramfs-tools, systemd, udev, and NetworkManager). IMHO the root
>> of the issue derives from the kernel, it makes more sense to start from
>> netdev, work out and decide on a solution: see what can be done in the
>> kernel in order to fix it, then after that engage userspace community for
>> the feasibility...
>>
>>> Anyway, can we write a time diagram for what happens in which order that
>>> leads to failure?  That would help look for triggers that we can tie
>>> into, or add new ones.
>>>
>> See attached diagram.
>>
>>>
>>>
>>>
>>>>> Is there an issue if slave device names are not predictable? The user/admin scripts are expected
>>>>> to only work with the master failover device.
>>>> Where does this expectation come from?
>>>>
>>>> Admin users may have ethtool or tc configurations that need to deal with
>>>> predictable interface name. Third-party app which was built upon specifying
>>>> certain interface name can't be modified to chase dynamic names.
>>>>
>>>> Specifically, we have pre-canned image that uses ethtool to fine tune VF
>>>> offload settings post boot for specific workload. Those images won't work
>>>> well if the name is constantly changing just after couple rounds of live
>>>> migration.
>>> It should be possible to specify the ethtool configuration on the
>>> master and have it automatically propagated to the slave.
>>>
>>> BTW this is something we should look at IMHO.
>> I was elaborating a few examples that the expectation and assumption that
>> user/admin scripts only deal with master failover device is incorrect. It
>> had never been taken good care of, although I did try to emphasize it from
>> the very beginning.
>>
>> Basically what you said about propagating the ethtool configuration down to
>> the slave is the key pursuance of 1-netdev model. However, what I am seeking
>> now is any alternative that can also fix the specific udev rename problem,
>> before concluding that 1-netdev is the only solution. Generally a 1-netdev
>> scheme would take time to implement, while I'm trying to find a way out to
>> fix this particular naming problem under 3-netdev.
>>
>>>>> Moreover, you were suggesting hiding the lower slave devices anyway. There was some discussion
>>>>> about moving them to a hidden network namespace so that they are not visible from the default namespace.
>>>>> I looked into this sometime back, but did not find the right kernel api to create a network namespace within
>>>>> kernel. If so, we could use this mechanism to simulate a 1-netdev model.
>>>> Yes, that's one possible implementation (IMHO the key is to make 1-netdev
>>>> model as much transparent to a real NIC as possible, while a hidden netns is
>>>> just the vehicle). However, I recall there was resistance around this
>>>> discussion that even the concept of hiding itself is a taboo for Linux
>>>> netdev. I would like to summon potential alternatives before concluding
>>>> 1-netdev is the only solution too soon.
>>>>
>>>> Thanks,
>>>> -Siwei
>>> Your scripts would not work at all then, right?
>> At this point we don't claim images with such usage as SR-IOV live
>> migrate-able. We would flag it as live migrate-able until this ethtool
>> config issue is fully addressed and a transparent live migration solution
>> emerges in upstream eventually.
>>
>>
>> Thanks,
>> -Siwei
>>>
>>>>>> -Siwei
>>>>>>
>>>>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
>>> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
>>>
>>    net_failover(kernel)                            |    network.service (user)    |          systemd-udevd (user)
>> --------------------------------------------------+------------------------------+--------------------------------------------
>> (standby virtio-net and net_failover              |                              |
>> devices created and initialized,                  |                              |
>> i.e. virtnet_probe()->                            |                              |
>>         net_failover_create()                      |                              |
>> was done.)                                        |                              |
>>                                                    |                              |
>>                                                    |  runs `ifup ens3' ->         |
>>                                                    |    ip link set dev ens3 up   |
>> net_failover_open()                               |                              |
>>    dev_open(virtnet_dev)                           |                              |
>>      virtnet_open(virtnet_dev)                     |                              |
>>    netif_carrier_on(failover_dev)                  |                              |
>>    ...                                             |                              |
>>                                                    |                              |
>> (VF hot plugged in)                               |                              |
>> ixgbevf_probe()                                   |                              |
>>   register_netdev(ixgbevf_netdev)                  |                              |
>>    netdev_register_kobject(ixgbevf_netdev)         |                              |
>>     kobject_add(ixgbevf_dev)                       |                              |
>>      device_add(ixgbevf_dev)                       |                              |
>>       kobject_uevent(&ixgbevf_dev->kobj, KOBJ_ADD) |                              |
>>        netlink_broadcast()                         |                              |
>>    ...                                             |                              |
>>    call_netdevice_notifiers(NETDEV_REGISTER)       |                              |
>>     failover_event(..., NETDEV_REGISTER, ...)      |                              |
>>      failover_slave_register(ixgbevf_netdev)       |                              |
>>       net_failover_slave_register(ixgbevf_netdev)  |                              |
>>        dev_open(ixgbevf_netdev)                    |                              |
>>                                                    |                              |
>>                                                    |                              |
>>                                                    |                              |   received ADD uevent from netlink fd
>>                                                    |                              |   ...
>>                                                    |                              |   udev-builtin-net_id.c:dev_pci_slot()
>>                                                    |                              |   (decided to renamed 'eth0' )
>>                                                    |                              |     ip link set dev eth0 name ens4
>> (dev_change_name() returns -EBUSY as              |                              |
>> ixgbevf_netdev->flags has IFF_UP)                 |                              |
>>                                                    |                              |
>>
> Given renaming slaves does not work anyway:
I was actually thinking what if we relieve the rename restriction just 
for the failover slave? What the impact would be? I think users don't 
care about slave being renamed when it's in use, especially the initial 
rename. Thoughts?

>   would it work if we just
> hard-coded slave names instead?
>
> E.g.
> 1. fail slave renames
> 2. rename of failover to XX automatically renames standby to XXnsby
>     and primary to XXnpry
That wouldn't help. The time when the failover master gets renamed, the 
VF may not be present. I don't like the idea to delay exposing failover 
master until VF is hot plugged in (probably subject to various failures) 
later.

Thanks,
-Siwei

>
>
Si-Wei Liu Feb. 27, 2019, 12:49 a.m. UTC | #20
On 2/25/2019 6:05 PM, Michael S. Tsirkin wrote:
> On Mon, Feb 25, 2019 at 05:39:12PM -0800, Stephen Hemminger wrote:
>>>>>> Moreover, you were suggesting hiding the lower slave devices anyway. There was some discussion
>>>>>> about moving them to a hidden network namespace so that they are not visible from the default namespace.
>>>>>> I looked into this sometime back, but did not find the right kernel api to create a network namespace within
>>>>>> kernel. If so, we could use this mechanism to simulate a 1-netdev model.
>>>>> Yes, that's one possible implementation (IMHO the key is to make 1-netdev
>>>>> model as much transparent to a real NIC as possible, while a hidden netns is
>>>>> just the vehicle). However, I recall there was resistance around this
>>>>> discussion that even the concept of hiding itself is a taboo for Linux
>>>>> netdev. I would like to summon potential alternatives before concluding
>>>>> 1-netdev is the only solution too soon.
>>>>>
>>>>> Thanks,
>>>>> -Siwei
>>>> Your scripts would not work at all then, right?
>>> At this point we don't claim images with such usage as SR-IOV live
>>> migrate-able. We would flag it as live migrate-able until this ethtool
>>> config issue is fully addressed and a transparent live migration
>>> solution emerges in upstream eventually.
>> The hyper-v netvsc with 1-dev model uses a timeout to allow  udev to do its rename.
>> I proposed a patch to key state change off of the udev rename, but that patch was
>> rejected.
> Of course that would mean nothing works without udev - was
> that the objection? Could you help me find that discussion pls?
Yeah, the kernel should work with and without udev rename - typically 
the kernel is agnostic of upcoming rename. User may opt out for kernel 
provided names (particularly on older distros) then no rename would ever 
happen.

I ever thought about this approach but didn't think it would fit. But, 
what is the historical reason that prevents slave from being renamed 
after being opened? Could we specialize a code path for this kernel 
created device, as net_failover shouldn't carry over any history burden?


Thanks,
-Siwei
Stephen Hemminger Feb. 27, 2019, 9:57 p.m. UTC | #21
On Tue, 26 Feb 2019 16:17:21 -0800
si-wei liu <si-wei.liu@oracle.com> wrote:

> On 2/25/2019 6:08 PM, Michael S. Tsirkin wrote:
> > On Mon, Feb 25, 2019 at 04:58:07PM -0800, si-wei liu wrote:  
> >>
> >> On 2/22/2019 7:14 AM, Michael S. Tsirkin wrote:  
> >>> On Thu, Feb 21, 2019 at 11:55:11PM -0800, si-wei liu wrote:  
> >>>> On 2/21/2019 11:00 PM, Samudrala, Sridhar wrote:  
> >>>>> On 2/21/2019 7:33 PM, si-wei liu wrote:  
> >>>>>> On 2/21/2019 5:39 PM, Michael S. Tsirkin wrote:  
> >>>>>>> On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu wrote:  
> >>>>>>>> Sorry for replying to this ancient thread. There was some remaining
> >>>>>>>> issue that I don't think the initial net_failover patch got addressed
> >>>>>>>> cleanly, see:
> >>>>>>>>
> >>>>>>>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1815268
> >>>>>>>>
> >>>>>>>> The renaming of 'eth0' to 'ens4' fails because the udev userspace was
> >>>>>>>> not specifically writtten for such kernel automatic enslavement.
> >>>>>>>> Specifically, if it is a bond or team, the slave would typically get
> >>>>>>>> renamed *before* virtual device gets created, that's what udev can
> >>>>>>>> control (without getting netdev opened early by the other part of
> >>>>>>>> kernel) and other userspace components for e.g. initramfs,
> >>>>>>>> init-scripts can coordinate well in between. The in-kernel
> >>>>>>>> auto-enslavement of net_failover breaks this userspace convention,
> >>>>>>>> which don't provides a solution if user care about consistent naming
> >>>>>>>> on the slave netdevs specifically.
> >>>>>>>>
> >>>>>>>> Previously this issue had been specifically called out when IFF_HIDDEN
> >>>>>>>> and the 1-netdev was proposed, but no one gives out a solution to this
> >>>>>>>> problem ever since. Please share your mind how to proceed and solve
> >>>>>>>> this userspace issue if netdev does not welcome a 1-netdev model.  
> >>>>>>> Above says:
> >>>>>>>
> >>>>>>>       there's no motivation in the systemd/udevd community at
> >>>>>>>       this point to refactor the rename logic and make it work well with
> >>>>>>>       3-netdev.
> >>>>>>>
> >>>>>>> What would the fix be? Skip slave devices?
> >>>>>>>  
> >>>>>> There's nothing user can get if just skipping slave devices - the
> >>>>>> name is still unchanged and unpredictable e.g. eth0, or eth1 the
> >>>>>> next reboot, while the rest may conform to the naming scheme (ens3
> >>>>>> and such). There's no way one can fix this in userspace alone - when
> >>>>>> the failover is created the enslaved netdev was opened by the kernel
> >>>>>> earlier than the userspace is made aware of, and there's no
> >>>>>> negotiation protocol for kernel to know when userspace has done
> >>>>>> initial renaming of the interface. I would expect netdev list should
> >>>>>> at least provide the direction in general for how this can be
> >>>>>> solved...  
> >>> I was just wondering what did you mean when you said
> >>> "refactor the rename logic and make it work well with 3-netdev" -
> >>> was there a proposal udev rejected?  
> >> No. I never believed this particular issue can be fixed in userspace alone.
> >> Previously someone had said it could be, but I never see any work or
> >> relevant discussion ever happened in various userspace communities (for e.g.
> >> dracut, initramfs-tools, systemd, udev, and NetworkManager). IMHO the root
> >> of the issue derives from the kernel, it makes more sense to start from
> >> netdev, work out and decide on a solution: see what can be done in the
> >> kernel in order to fix it, then after that engage userspace community for
> >> the feasibility...
> >>  
> >>> Anyway, can we write a time diagram for what happens in which order that
> >>> leads to failure?  That would help look for triggers that we can tie
> >>> into, or add new ones.
> >>>  
> >> See attached diagram.
> >>  
> >>>
> >>>
> >>>  
> >>>>> Is there an issue if slave device names are not predictable? The user/admin scripts are expected
> >>>>> to only work with the master failover device.  
> >>>> Where does this expectation come from?
> >>>>
> >>>> Admin users may have ethtool or tc configurations that need to deal with
> >>>> predictable interface name. Third-party app which was built upon specifying
> >>>> certain interface name can't be modified to chase dynamic names.
> >>>>
> >>>> Specifically, we have pre-canned image that uses ethtool to fine tune VF
> >>>> offload settings post boot for specific workload. Those images won't work
> >>>> well if the name is constantly changing just after couple rounds of live
> >>>> migration.  
> >>> It should be possible to specify the ethtool configuration on the
> >>> master and have it automatically propagated to the slave.
> >>>
> >>> BTW this is something we should look at IMHO.  
> >> I was elaborating a few examples that the expectation and assumption that
> >> user/admin scripts only deal with master failover device is incorrect. It
> >> had never been taken good care of, although I did try to emphasize it from
> >> the very beginning.
> >>
> >> Basically what you said about propagating the ethtool configuration down to
> >> the slave is the key pursuance of 1-netdev model. However, what I am seeking
> >> now is any alternative that can also fix the specific udev rename problem,
> >> before concluding that 1-netdev is the only solution. Generally a 1-netdev
> >> scheme would take time to implement, while I'm trying to find a way out to
> >> fix this particular naming problem under 3-netdev.
> >>  
> >>>>> Moreover, you were suggesting hiding the lower slave devices anyway. There was some discussion
> >>>>> about moving them to a hidden network namespace so that they are not visible from the default namespace.
> >>>>> I looked into this sometime back, but did not find the right kernel api to create a network namespace within
> >>>>> kernel. If so, we could use this mechanism to simulate a 1-netdev model.  
> >>>> Yes, that's one possible implementation (IMHO the key is to make 1-netdev
> >>>> model as much transparent to a real NIC as possible, while a hidden netns is
> >>>> just the vehicle). However, I recall there was resistance around this
> >>>> discussion that even the concept of hiding itself is a taboo for Linux
> >>>> netdev. I would like to summon potential alternatives before concluding
> >>>> 1-netdev is the only solution too soon.
> >>>>
> >>>> Thanks,
> >>>> -Siwei  
> >>> Your scripts would not work at all then, right?  
> >> At this point we don't claim images with such usage as SR-IOV live
> >> migrate-able. We would flag it as live migrate-able until this ethtool
> >> config issue is fully addressed and a transparent live migration solution
> >> emerges in upstream eventually.
> >>
> >>
> >> Thanks,
> >> -Siwei  
> >>>  
> >>>>>> -Siwei
> >>>>>>
> >>>>>>  
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> >>> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
> >>>  
> >>    net_failover(kernel)                            |    network.service (user)    |          systemd-udevd (user)
> >> --------------------------------------------------+------------------------------+--------------------------------------------
> >> (standby virtio-net and net_failover              |                              |
> >> devices created and initialized,                  |                              |
> >> i.e. virtnet_probe()->                            |                              |
> >>         net_failover_create()                      |                              |
> >> was done.)                                        |                              |
> >>                                                    |                              |
> >>                                                    |  runs `ifup ens3' ->         |
> >>                                                    |    ip link set dev ens3 up   |
> >> net_failover_open()                               |                              |
> >>    dev_open(virtnet_dev)                           |                              |
> >>      virtnet_open(virtnet_dev)                     |                              |
> >>    netif_carrier_on(failover_dev)                  |                              |
> >>    ...                                             |                              |
> >>                                                    |                              |
> >> (VF hot plugged in)                               |                              |
> >> ixgbevf_probe()                                   |                              |
> >>   register_netdev(ixgbevf_netdev)                  |                              |
> >>    netdev_register_kobject(ixgbevf_netdev)         |                              |
> >>     kobject_add(ixgbevf_dev)                       |                              |
> >>      device_add(ixgbevf_dev)                       |                              |
> >>       kobject_uevent(&ixgbevf_dev->kobj, KOBJ_ADD) |                              |
> >>        netlink_broadcast()                         |                              |
> >>    ...                                             |                              |
> >>    call_netdevice_notifiers(NETDEV_REGISTER)       |                              |
> >>     failover_event(..., NETDEV_REGISTER, ...)      |                              |
> >>      failover_slave_register(ixgbevf_netdev)       |                              |
> >>       net_failover_slave_register(ixgbevf_netdev)  |                              |
> >>        dev_open(ixgbevf_netdev)                    |                              |
> >>                                                    |                              |
> >>                                                    |                              |
> >>                                                    |                              |   received ADD uevent from netlink fd
> >>                                                    |                              |   ...
> >>                                                    |                              |   udev-builtin-net_id.c:dev_pci_slot()
> >>                                                    |                              |   (decided to renamed 'eth0' )
> >>                                                    |                              |     ip link set dev eth0 name ens4
> >> (dev_change_name() returns -EBUSY as              |                              |
> >> ixgbevf_netdev->flags has IFF_UP)                 |                              |
> >>                                                    |                              |
> >>  
> > Given renaming slaves does not work anyway:  
> I was actually thinking what if we relieve the rename restriction just 
> for the failover slave? What the impact would be? I think users don't 
> care about slave being renamed when it's in use, especially the initial 
> rename. Thoughts?
> 
> >   would it work if we just
> > hard-coded slave names instead?
> >
> > E.g.
> > 1. fail slave renames
> > 2. rename of failover to XX automatically renames standby to XXnsby
> >     and primary to XXnpry  
> That wouldn't help. The time when the failover master gets renamed, the 
> VF may not be present. I don't like the idea to delay exposing failover 
> master until VF is hot plugged in (probably subject to various failures) 
> later.


What netvsc does now is wait 2 seconds (to allow udev to do rename)
before bringing the VF link up. This works, has had no problems even
with slow distributions and is widely used.

A patch to allow ending the timeout after rename was proposed but
rejected.

https://lore.kernel.org/netdev/20171220223323.21125-1-sthemmin@microsoft.com/

Allow network devices to change name when up is too risky. There are things
like netfilter rules and other state in and out of the kernel that may break.
Userspace does not like it when the rules change.
Si-Wei Liu Feb. 27, 2019, 10:30 p.m. UTC | #22
On 2/27/2019 1:57 PM, Stephen Hemminger wrote:
> On Tue, 26 Feb 2019 16:17:21 -0800
> si-wei liu <si-wei.liu@oracle.com> wrote:
>
>> On 2/25/2019 6:08 PM, Michael S. Tsirkin wrote:
>>> On Mon, Feb 25, 2019 at 04:58:07PM -0800, si-wei liu wrote:
>>>> On 2/22/2019 7:14 AM, Michael S. Tsirkin wrote:
>>>>> On Thu, Feb 21, 2019 at 11:55:11PM -0800, si-wei liu wrote:
>>>>>> On 2/21/2019 11:00 PM, Samudrala, Sridhar wrote:
>>>>>>> On 2/21/2019 7:33 PM, si-wei liu wrote:
>>>>>>>> On 2/21/2019 5:39 PM, Michael S. Tsirkin wrote:
>>>>>>>>> On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu wrote:
>>>>>>>>>> Sorry for replying to this ancient thread. There was some remaining
>>>>>>>>>> issue that I don't think the initial net_failover patch got addressed
>>>>>>>>>> cleanly, see:
>>>>>>>>>>
>>>>>>>>>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1815268
>>>>>>>>>>
>>>>>>>>>> The renaming of 'eth0' to 'ens4' fails because the udev userspace was
>>>>>>>>>> not specifically writtten for such kernel automatic enslavement.
>>>>>>>>>> Specifically, if it is a bond or team, the slave would typically get
>>>>>>>>>> renamed *before* virtual device gets created, that's what udev can
>>>>>>>>>> control (without getting netdev opened early by the other part of
>>>>>>>>>> kernel) and other userspace components for e.g. initramfs,
>>>>>>>>>> init-scripts can coordinate well in between. The in-kernel
>>>>>>>>>> auto-enslavement of net_failover breaks this userspace convention,
>>>>>>>>>> which don't provides a solution if user care about consistent naming
>>>>>>>>>> on the slave netdevs specifically.
>>>>>>>>>>
>>>>>>>>>> Previously this issue had been specifically called out when IFF_HIDDEN
>>>>>>>>>> and the 1-netdev was proposed, but no one gives out a solution to this
>>>>>>>>>> problem ever since. Please share your mind how to proceed and solve
>>>>>>>>>> this userspace issue if netdev does not welcome a 1-netdev model.
>>>>>>>>> Above says:
>>>>>>>>>
>>>>>>>>>        there's no motivation in the systemd/udevd community at
>>>>>>>>>        this point to refactor the rename logic and make it work well with
>>>>>>>>>        3-netdev.
>>>>>>>>>
>>>>>>>>> What would the fix be? Skip slave devices?
>>>>>>>>>   
>>>>>>>> There's nothing user can get if just skipping slave devices - the
>>>>>>>> name is still unchanged and unpredictable e.g. eth0, or eth1 the
>>>>>>>> next reboot, while the rest may conform to the naming scheme (ens3
>>>>>>>> and such). There's no way one can fix this in userspace alone - when
>>>>>>>> the failover is created the enslaved netdev was opened by the kernel
>>>>>>>> earlier than the userspace is made aware of, and there's no
>>>>>>>> negotiation protocol for kernel to know when userspace has done
>>>>>>>> initial renaming of the interface. I would expect netdev list should
>>>>>>>> at least provide the direction in general for how this can be
>>>>>>>> solved...
>>>>> I was just wondering what did you mean when you said
>>>>> "refactor the rename logic and make it work well with 3-netdev" -
>>>>> was there a proposal udev rejected?
>>>> No. I never believed this particular issue can be fixed in userspace alone.
>>>> Previously someone had said it could be, but I never see any work or
>>>> relevant discussion ever happened in various userspace communities (for e.g.
>>>> dracut, initramfs-tools, systemd, udev, and NetworkManager). IMHO the root
>>>> of the issue derives from the kernel, it makes more sense to start from
>>>> netdev, work out and decide on a solution: see what can be done in the
>>>> kernel in order to fix it, then after that engage userspace community for
>>>> the feasibility...
>>>>   
>>>>> Anyway, can we write a time diagram for what happens in which order that
>>>>> leads to failure?  That would help look for triggers that we can tie
>>>>> into, or add new ones.
>>>>>   
>>>> See attached diagram.
>>>>   
>>>>>
>>>>>   
>>>>>>> Is there an issue if slave device names are not predictable? The user/admin scripts are expected
>>>>>>> to only work with the master failover device.
>>>>>> Where does this expectation come from?
>>>>>>
>>>>>> Admin users may have ethtool or tc configurations that need to deal with
>>>>>> predictable interface name. Third-party app which was built upon specifying
>>>>>> certain interface name can't be modified to chase dynamic names.
>>>>>>
>>>>>> Specifically, we have pre-canned image that uses ethtool to fine tune VF
>>>>>> offload settings post boot for specific workload. Those images won't work
>>>>>> well if the name is constantly changing just after couple rounds of live
>>>>>> migration.
>>>>> It should be possible to specify the ethtool configuration on the
>>>>> master and have it automatically propagated to the slave.
>>>>>
>>>>> BTW this is something we should look at IMHO.
>>>> I was elaborating a few examples that the expectation and assumption that
>>>> user/admin scripts only deal with master failover device is incorrect. It
>>>> had never been taken good care of, although I did try to emphasize it from
>>>> the very beginning.
>>>>
>>>> Basically what you said about propagating the ethtool configuration down to
>>>> the slave is the key pursuance of 1-netdev model. However, what I am seeking
>>>> now is any alternative that can also fix the specific udev rename problem,
>>>> before concluding that 1-netdev is the only solution. Generally a 1-netdev
>>>> scheme would take time to implement, while I'm trying to find a way out to
>>>> fix this particular naming problem under 3-netdev.
>>>>   
>>>>>>> Moreover, you were suggesting hiding the lower slave devices anyway. There was some discussion
>>>>>>> about moving them to a hidden network namespace so that they are not visible from the default namespace.
>>>>>>> I looked into this sometime back, but did not find the right kernel api to create a network namespace within
>>>>>>> kernel. If so, we could use this mechanism to simulate a 1-netdev model.
>>>>>> Yes, that's one possible implementation (IMHO the key is to make 1-netdev
>>>>>> model as much transparent to a real NIC as possible, while a hidden netns is
>>>>>> just the vehicle). However, I recall there was resistance around this
>>>>>> discussion that even the concept of hiding itself is a taboo for Linux
>>>>>> netdev. I would like to summon potential alternatives before concluding
>>>>>> 1-netdev is the only solution too soon.
>>>>>>
>>>>>> Thanks,
>>>>>> -Siwei
>>>>> Your scripts would not work at all then, right?
>>>> At this point we don't claim images with such usage as SR-IOV live
>>>> migrate-able. We would flag it as live migrate-able until this ethtool
>>>> config issue is fully addressed and a transparent live migration solution
>>>> emerges in upstream eventually.
>>>>
>>>>
>>>> Thanks,
>>>> -Siwei
>>>>>   
>>>>>>>> -Siwei
>>>>>>>>
>>>>>>>>   
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
>>>>> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
>>>>>   
>>>>     net_failover(kernel)                            |    network.service (user)    |          systemd-udevd (user)
>>>> --------------------------------------------------+------------------------------+--------------------------------------------
>>>> (standby virtio-net and net_failover              |                              |
>>>> devices created and initialized,                  |                              |
>>>> i.e. virtnet_probe()->                            |                              |
>>>>          net_failover_create()                      |                              |
>>>> was done.)                                        |                              |
>>>>                                                     |                              |
>>>>                                                     |  runs `ifup ens3' ->         |
>>>>                                                     |    ip link set dev ens3 up   |
>>>> net_failover_open()                               |                              |
>>>>     dev_open(virtnet_dev)                           |                              |
>>>>       virtnet_open(virtnet_dev)                     |                              |
>>>>     netif_carrier_on(failover_dev)                  |                              |
>>>>     ...                                             |                              |
>>>>                                                     |                              |
>>>> (VF hot plugged in)                               |                              |
>>>> ixgbevf_probe()                                   |                              |
>>>>    register_netdev(ixgbevf_netdev)                  |                              |
>>>>     netdev_register_kobject(ixgbevf_netdev)         |                              |
>>>>      kobject_add(ixgbevf_dev)                       |                              |
>>>>       device_add(ixgbevf_dev)                       |                              |
>>>>        kobject_uevent(&ixgbevf_dev->kobj, KOBJ_ADD) |                              |
>>>>         netlink_broadcast()                         |                              |
>>>>     ...                                             |                              |
>>>>     call_netdevice_notifiers(NETDEV_REGISTER)       |                              |
>>>>      failover_event(..., NETDEV_REGISTER, ...)      |                              |
>>>>       failover_slave_register(ixgbevf_netdev)       |                              |
>>>>        net_failover_slave_register(ixgbevf_netdev)  |                              |
>>>>         dev_open(ixgbevf_netdev)                    |                              |
>>>>                                                     |                              |
>>>>                                                     |                              |
>>>>                                                     |                              |   received ADD uevent from netlink fd
>>>>                                                     |                              |   ...
>>>>                                                     |                              |   udev-builtin-net_id.c:dev_pci_slot()
>>>>                                                     |                              |   (decided to renamed 'eth0' )
>>>>                                                     |                              |     ip link set dev eth0 name ens4
>>>> (dev_change_name() returns -EBUSY as              |                              |
>>>> ixgbevf_netdev->flags has IFF_UP)                 |                              |
>>>>                                                     |                              |
>>>>   
>>> Given renaming slaves does not work anyway:
>> I was actually thinking what if we relieve the rename restriction just
>> for the failover slave? What the impact would be? I think users don't
>> care about slave being renamed when it's in use, especially the initial
>> rename. Thoughts?
>>
>>>    would it work if we just
>>> hard-coded slave names instead?
>>>
>>> E.g.
>>> 1. fail slave renames
>>> 2. rename of failover to XX automatically renames standby to XXnsby
>>>      and primary to XXnpry
>> That wouldn't help. The time when the failover master gets renamed, the
>> VF may not be present. I don't like the idea to delay exposing failover
>> master until VF is hot plugged in (probably subject to various failures)
>> later.
>
> What netvsc does now is wait 2 seconds (to allow udev to do rename)
> before bringing the VF link up. This works, has had no problems even
> with slow distributions and is widely used.
>
> A patch to allow ending the timeout after rename was proposed but
> rejected.
>
> https://lore.kernel.org/netdev/20171220223323.21125-1-sthemmin@microsoft.com/
>
> Allow network devices to change name when up is too risky.
I understand the concern in general, the thread above referenced this patch:

https://patchwork.ozlabs.org/patch/799646/

That was in the context of netvsc without a proper framework (net_failover).

What I was saying is that we should consider opening up the rename 
restriction for  IFF_FAILOVER_SLAVE. It looks to me that all the 
userspace usage are trying to ignore the slave instead of operating it 
directly. The netfilter rules and what mentioned below can/should be 
applied to on top of the master if I'm not mistaken. The current 
userspace doesn't speak the net_failover way, and it is already broken 
since its introduction. If anything, those userspace can be fixed up to 
listen for rename events to track name changes. Whatever those cases are 
it should not affect current use cases.

Thanks,
-Siwei



> There are things
> like netfilter rules and other state in and out of the kernel that may break.
> Userspace does not like it when the rules change.
Michael S. Tsirkin Feb. 27, 2019, 10:38 p.m. UTC | #23
On Tue, Feb 26, 2019 at 04:17:21PM -0800, si-wei liu wrote:
> 
> 
> On 2/25/2019 6:08 PM, Michael S. Tsirkin wrote:
> > On Mon, Feb 25, 2019 at 04:58:07PM -0800, si-wei liu wrote:
> > > 
> > > On 2/22/2019 7:14 AM, Michael S. Tsirkin wrote:
> > > > On Thu, Feb 21, 2019 at 11:55:11PM -0800, si-wei liu wrote:
> > > > > On 2/21/2019 11:00 PM, Samudrala, Sridhar wrote:
> > > > > > On 2/21/2019 7:33 PM, si-wei liu wrote:
> > > > > > > On 2/21/2019 5:39 PM, Michael S. Tsirkin wrote:
> > > > > > > > On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu wrote:
> > > > > > > > > Sorry for replying to this ancient thread. There was some remaining
> > > > > > > > > issue that I don't think the initial net_failover patch got addressed
> > > > > > > > > cleanly, see:
> > > > > > > > > 
> > > > > > > > > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1815268
> > > > > > > > > 
> > > > > > > > > The renaming of 'eth0' to 'ens4' fails because the udev userspace was
> > > > > > > > > not specifically writtten for such kernel automatic enslavement.
> > > > > > > > > Specifically, if it is a bond or team, the slave would typically get
> > > > > > > > > renamed *before* virtual device gets created, that's what udev can
> > > > > > > > > control (without getting netdev opened early by the other part of
> > > > > > > > > kernel) and other userspace components for e.g. initramfs,
> > > > > > > > > init-scripts can coordinate well in between. The in-kernel
> > > > > > > > > auto-enslavement of net_failover breaks this userspace convention,
> > > > > > > > > which don't provides a solution if user care about consistent naming
> > > > > > > > > on the slave netdevs specifically.
> > > > > > > > > 
> > > > > > > > > Previously this issue had been specifically called out when IFF_HIDDEN
> > > > > > > > > and the 1-netdev was proposed, but no one gives out a solution to this
> > > > > > > > > problem ever since. Please share your mind how to proceed and solve
> > > > > > > > > this userspace issue if netdev does not welcome a 1-netdev model.
> > > > > > > > Above says:
> > > > > > > > 
> > > > > > > >       there's no motivation in the systemd/udevd community at
> > > > > > > >       this point to refactor the rename logic and make it work well with
> > > > > > > >       3-netdev.
> > > > > > > > 
> > > > > > > > What would the fix be? Skip slave devices?
> > > > > > > > 
> > > > > > > There's nothing user can get if just skipping slave devices - the
> > > > > > > name is still unchanged and unpredictable e.g. eth0, or eth1 the
> > > > > > > next reboot, while the rest may conform to the naming scheme (ens3
> > > > > > > and such). There's no way one can fix this in userspace alone - when
> > > > > > > the failover is created the enslaved netdev was opened by the kernel
> > > > > > > earlier than the userspace is made aware of, and there's no
> > > > > > > negotiation protocol for kernel to know when userspace has done
> > > > > > > initial renaming of the interface. I would expect netdev list should
> > > > > > > at least provide the direction in general for how this can be
> > > > > > > solved...
> > > > I was just wondering what did you mean when you said
> > > > "refactor the rename logic and make it work well with 3-netdev" -
> > > > was there a proposal udev rejected?
> > > No. I never believed this particular issue can be fixed in userspace alone.
> > > Previously someone had said it could be, but I never see any work or
> > > relevant discussion ever happened in various userspace communities (for e.g.
> > > dracut, initramfs-tools, systemd, udev, and NetworkManager). IMHO the root
> > > of the issue derives from the kernel, it makes more sense to start from
> > > netdev, work out and decide on a solution: see what can be done in the
> > > kernel in order to fix it, then after that engage userspace community for
> > > the feasibility...
> > > 
> > > > Anyway, can we write a time diagram for what happens in which order that
> > > > leads to failure?  That would help look for triggers that we can tie
> > > > into, or add new ones.
> > > > 
> > > See attached diagram.
> > > 
> > > > 
> > > > 
> > > > 
> > > > > > Is there an issue if slave device names are not predictable? The user/admin scripts are expected
> > > > > > to only work with the master failover device.
> > > > > Where does this expectation come from?
> > > > > 
> > > > > Admin users may have ethtool or tc configurations that need to deal with
> > > > > predictable interface name. Third-party app which was built upon specifying
> > > > > certain interface name can't be modified to chase dynamic names.
> > > > > 
> > > > > Specifically, we have pre-canned image that uses ethtool to fine tune VF
> > > > > offload settings post boot for specific workload. Those images won't work
> > > > > well if the name is constantly changing just after couple rounds of live
> > > > > migration.
> > > > It should be possible to specify the ethtool configuration on the
> > > > master and have it automatically propagated to the slave.
> > > > 
> > > > BTW this is something we should look at IMHO.
> > > I was elaborating a few examples that the expectation and assumption that
> > > user/admin scripts only deal with master failover device is incorrect. It
> > > had never been taken good care of, although I did try to emphasize it from
> > > the very beginning.
> > > 
> > > Basically what you said about propagating the ethtool configuration down to
> > > the slave is the key pursuance of 1-netdev model. However, what I am seeking
> > > now is any alternative that can also fix the specific udev rename problem,
> > > before concluding that 1-netdev is the only solution. Generally a 1-netdev
> > > scheme would take time to implement, while I'm trying to find a way out to
> > > fix this particular naming problem under 3-netdev.
> > > 
> > > > > > Moreover, you were suggesting hiding the lower slave devices anyway. There was some discussion
> > > > > > about moving them to a hidden network namespace so that they are not visible from the default namespace.
> > > > > > I looked into this sometime back, but did not find the right kernel api to create a network namespace within
> > > > > > kernel. If so, we could use this mechanism to simulate a 1-netdev model.
> > > > > Yes, that's one possible implementation (IMHO the key is to make 1-netdev
> > > > > model as much transparent to a real NIC as possible, while a hidden netns is
> > > > > just the vehicle). However, I recall there was resistance around this
> > > > > discussion that even the concept of hiding itself is a taboo for Linux
> > > > > netdev. I would like to summon potential alternatives before concluding
> > > > > 1-netdev is the only solution too soon.
> > > > > 
> > > > > Thanks,
> > > > > -Siwei
> > > > Your scripts would not work at all then, right?
> > > At this point we don't claim images with such usage as SR-IOV live
> > > migrate-able. We would flag it as live migrate-able until this ethtool
> > > config issue is fully addressed and a transparent live migration solution
> > > emerges in upstream eventually.
> > > 
> > > 
> > > Thanks,
> > > -Siwei
> > > > 
> > > > > > > -Siwei
> > > > > > > 
> > > > > > > 
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> > > > For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
> > > > 
> > >    net_failover(kernel)                            |    network.service (user)    |          systemd-udevd (user)
> > > --------------------------------------------------+------------------------------+--------------------------------------------
> > > (standby virtio-net and net_failover              |                              |
> > > devices created and initialized,                  |                              |
> > > i.e. virtnet_probe()->                            |                              |
> > >         net_failover_create()                      |                              |
> > > was done.)                                        |                              |
> > >                                                    |                              |
> > >                                                    |  runs `ifup ens3' ->         |
> > >                                                    |    ip link set dev ens3 up   |
> > > net_failover_open()                               |                              |
> > >    dev_open(virtnet_dev)                           |                              |
> > >      virtnet_open(virtnet_dev)                     |                              |
> > >    netif_carrier_on(failover_dev)                  |                              |
> > >    ...                                             |                              |
> > >                                                    |                              |
> > > (VF hot plugged in)                               |                              |
> > > ixgbevf_probe()                                   |                              |
> > >   register_netdev(ixgbevf_netdev)                  |                              |
> > >    netdev_register_kobject(ixgbevf_netdev)         |                              |
> > >     kobject_add(ixgbevf_dev)                       |                              |
> > >      device_add(ixgbevf_dev)                       |                              |
> > >       kobject_uevent(&ixgbevf_dev->kobj, KOBJ_ADD) |                              |
> > >        netlink_broadcast()                         |                              |
> > >    ...                                             |                              |
> > >    call_netdevice_notifiers(NETDEV_REGISTER)       |                              |
> > >     failover_event(..., NETDEV_REGISTER, ...)      |                              |
> > >      failover_slave_register(ixgbevf_netdev)       |                              |
> > >       net_failover_slave_register(ixgbevf_netdev)  |                              |
> > >        dev_open(ixgbevf_netdev)                    |                              |
> > >                                                    |                              |
> > >                                                    |                              |
> > >                                                    |                              |   received ADD uevent from netlink fd
> > >                                                    |                              |   ...
> > >                                                    |                              |   udev-builtin-net_id.c:dev_pci_slot()
> > >                                                    |                              |   (decided to renamed 'eth0' )
> > >                                                    |                              |     ip link set dev eth0 name ens4
> > > (dev_change_name() returns -EBUSY as              |                              |
> > > ixgbevf_netdev->flags has IFF_UP)                 |                              |
> > >                                                    |                              |
> > > 
> > Given renaming slaves does not work anyway:
> I was actually thinking what if we relieve the rename restriction just for
> the failover slave? What the impact would be? I think users don't care about
> slave being renamed when it's in use, especially the initial rename.
> Thoughts?
> 
> >   would it work if we just
> > hard-coded slave names instead?
> > 
> > E.g.
> > 1. fail slave renames
> > 2. rename of failover to XX automatically renames standby to XXnsby
> >     and primary to XXnpry
> That wouldn't help. The time when the failover master gets renamed, the VF
> may not be present.

In this scheme if VF is not there it will be renamed immediately after registration.

> I don't like the idea to delay exposing failover master
> until VF is hot plugged in (probably subject to various failures) later.
> 
> Thanks,
> -Siwei


I agree, this was not what I meant.

> > 
> >
Si-Wei Liu Feb. 27, 2019, 11:34 p.m. UTC | #24
On 2/27/2019 2:38 PM, Michael S. Tsirkin wrote:
> On Tue, Feb 26, 2019 at 04:17:21PM -0800, si-wei liu wrote:
>>
>> On 2/25/2019 6:08 PM, Michael S. Tsirkin wrote:
>>> On Mon, Feb 25, 2019 at 04:58:07PM -0800, si-wei liu wrote:
>>>> On 2/22/2019 7:14 AM, Michael S. Tsirkin wrote:
>>>>> On Thu, Feb 21, 2019 at 11:55:11PM -0800, si-wei liu wrote:
>>>>>> On 2/21/2019 11:00 PM, Samudrala, Sridhar wrote:
>>>>>>> On 2/21/2019 7:33 PM, si-wei liu wrote:
>>>>>>>> On 2/21/2019 5:39 PM, Michael S. Tsirkin wrote:
>>>>>>>>> On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu wrote:
>>>>>>>>>> Sorry for replying to this ancient thread. There was some remaining
>>>>>>>>>> issue that I don't think the initial net_failover patch got addressed
>>>>>>>>>> cleanly, see:
>>>>>>>>>>
>>>>>>>>>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1815268
>>>>>>>>>>
>>>>>>>>>> The renaming of 'eth0' to 'ens4' fails because the udev userspace was
>>>>>>>>>> not specifically writtten for such kernel automatic enslavement.
>>>>>>>>>> Specifically, if it is a bond or team, the slave would typically get
>>>>>>>>>> renamed *before* virtual device gets created, that's what udev can
>>>>>>>>>> control (without getting netdev opened early by the other part of
>>>>>>>>>> kernel) and other userspace components for e.g. initramfs,
>>>>>>>>>> init-scripts can coordinate well in between. The in-kernel
>>>>>>>>>> auto-enslavement of net_failover breaks this userspace convention,
>>>>>>>>>> which don't provides a solution if user care about consistent naming
>>>>>>>>>> on the slave netdevs specifically.
>>>>>>>>>>
>>>>>>>>>> Previously this issue had been specifically called out when IFF_HIDDEN
>>>>>>>>>> and the 1-netdev was proposed, but no one gives out a solution to this
>>>>>>>>>> problem ever since. Please share your mind how to proceed and solve
>>>>>>>>>> this userspace issue if netdev does not welcome a 1-netdev model.
>>>>>>>>> Above says:
>>>>>>>>>
>>>>>>>>>        there's no motivation in the systemd/udevd community at
>>>>>>>>>        this point to refactor the rename logic and make it work well with
>>>>>>>>>        3-netdev.
>>>>>>>>>
>>>>>>>>> What would the fix be? Skip slave devices?
>>>>>>>>>
>>>>>>>> There's nothing user can get if just skipping slave devices - the
>>>>>>>> name is still unchanged and unpredictable e.g. eth0, or eth1 the
>>>>>>>> next reboot, while the rest may conform to the naming scheme (ens3
>>>>>>>> and such). There's no way one can fix this in userspace alone - when
>>>>>>>> the failover is created the enslaved netdev was opened by the kernel
>>>>>>>> earlier than the userspace is made aware of, and there's no
>>>>>>>> negotiation protocol for kernel to know when userspace has done
>>>>>>>> initial renaming of the interface. I would expect netdev list should
>>>>>>>> at least provide the direction in general for how this can be
>>>>>>>> solved...
>>>>> I was just wondering what did you mean when you said
>>>>> "refactor the rename logic and make it work well with 3-netdev" -
>>>>> was there a proposal udev rejected?
>>>> No. I never believed this particular issue can be fixed in userspace alone.
>>>> Previously someone had said it could be, but I never see any work or
>>>> relevant discussion ever happened in various userspace communities (for e.g.
>>>> dracut, initramfs-tools, systemd, udev, and NetworkManager). IMHO the root
>>>> of the issue derives from the kernel, it makes more sense to start from
>>>> netdev, work out and decide on a solution: see what can be done in the
>>>> kernel in order to fix it, then after that engage userspace community for
>>>> the feasibility...
>>>>
>>>>> Anyway, can we write a time diagram for what happens in which order that
>>>>> leads to failure?  That would help look for triggers that we can tie
>>>>> into, or add new ones.
>>>>>
>>>> See attached diagram.
>>>>
>>>>>
>>>>>
>>>>>>> Is there an issue if slave device names are not predictable? The user/admin scripts are expected
>>>>>>> to only work with the master failover device.
>>>>>> Where does this expectation come from?
>>>>>>
>>>>>> Admin users may have ethtool or tc configurations that need to deal with
>>>>>> predictable interface name. Third-party app which was built upon specifying
>>>>>> certain interface name can't be modified to chase dynamic names.
>>>>>>
>>>>>> Specifically, we have pre-canned image that uses ethtool to fine tune VF
>>>>>> offload settings post boot for specific workload. Those images won't work
>>>>>> well if the name is constantly changing just after couple rounds of live
>>>>>> migration.
>>>>> It should be possible to specify the ethtool configuration on the
>>>>> master and have it automatically propagated to the slave.
>>>>>
>>>>> BTW this is something we should look at IMHO.
>>>> I was elaborating a few examples that the expectation and assumption that
>>>> user/admin scripts only deal with master failover device is incorrect. It
>>>> had never been taken good care of, although I did try to emphasize it from
>>>> the very beginning.
>>>>
>>>> Basically what you said about propagating the ethtool configuration down to
>>>> the slave is the key pursuance of 1-netdev model. However, what I am seeking
>>>> now is any alternative that can also fix the specific udev rename problem,
>>>> before concluding that 1-netdev is the only solution. Generally a 1-netdev
>>>> scheme would take time to implement, while I'm trying to find a way out to
>>>> fix this particular naming problem under 3-netdev.
>>>>
>>>>>>> Moreover, you were suggesting hiding the lower slave devices anyway. There was some discussion
>>>>>>> about moving them to a hidden network namespace so that they are not visible from the default namespace.
>>>>>>> I looked into this sometime back, but did not find the right kernel api to create a network namespace within
>>>>>>> kernel. If so, we could use this mechanism to simulate a 1-netdev model.
>>>>>> Yes, that's one possible implementation (IMHO the key is to make 1-netdev
>>>>>> model as much transparent to a real NIC as possible, while a hidden netns is
>>>>>> just the vehicle). However, I recall there was resistance around this
>>>>>> discussion that even the concept of hiding itself is a taboo for Linux
>>>>>> netdev. I would like to summon potential alternatives before concluding
>>>>>> 1-netdev is the only solution too soon.
>>>>>>
>>>>>> Thanks,
>>>>>> -Siwei
>>>>> Your scripts would not work at all then, right?
>>>> At this point we don't claim images with such usage as SR-IOV live
>>>> migrate-able. We would flag it as live migrate-able until this ethtool
>>>> config issue is fully addressed and a transparent live migration solution
>>>> emerges in upstream eventually.
>>>>
>>>>
>>>> Thanks,
>>>> -Siwei
>>>>>>>> -Siwei
>>>>>>>>
>>>>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
>>>>> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
>>>>>
>>>>     net_failover(kernel)                            |    network.service (user)    |          systemd-udevd (user)
>>>> --------------------------------------------------+------------------------------+--------------------------------------------
>>>> (standby virtio-net and net_failover              |                              |
>>>> devices created and initialized,                  |                              |
>>>> i.e. virtnet_probe()->                            |                              |
>>>>          net_failover_create()                      |                              |
>>>> was done.)                                        |                              |
>>>>                                                     |                              |
>>>>                                                     |  runs `ifup ens3' ->         |
>>>>                                                     |    ip link set dev ens3 up   |
>>>> net_failover_open()                               |                              |
>>>>     dev_open(virtnet_dev)                           |                              |
>>>>       virtnet_open(virtnet_dev)                     |                              |
>>>>     netif_carrier_on(failover_dev)                  |                              |
>>>>     ...                                             |                              |
>>>>                                                     |                              |
>>>> (VF hot plugged in)                               |                              |
>>>> ixgbevf_probe()                                   |                              |
>>>>    register_netdev(ixgbevf_netdev)                  |                              |
>>>>     netdev_register_kobject(ixgbevf_netdev)         |                              |
>>>>      kobject_add(ixgbevf_dev)                       |                              |
>>>>       device_add(ixgbevf_dev)                       |                              |
>>>>        kobject_uevent(&ixgbevf_dev->kobj, KOBJ_ADD) |                              |
>>>>         netlink_broadcast()                         |                              |
>>>>     ...                                             |                              |
>>>>     call_netdevice_notifiers(NETDEV_REGISTER)       |                              |
>>>>      failover_event(..., NETDEV_REGISTER, ...)      |                              |
>>>>       failover_slave_register(ixgbevf_netdev)       |                              |
>>>>        net_failover_slave_register(ixgbevf_netdev)  |                              |
>>>>         dev_open(ixgbevf_netdev)                    |                              |
>>>>                                                     |                              |
>>>>                                                     |                              |
>>>>                                                     |                              |   received ADD uevent from netlink fd
>>>>                                                     |                              |   ...
>>>>                                                     |                              |   udev-builtin-net_id.c:dev_pci_slot()
>>>>                                                     |                              |   (decided to renamed 'eth0' )
>>>>                                                     |                              |     ip link set dev eth0 name ens4
>>>> (dev_change_name() returns -EBUSY as              |                              |
>>>> ixgbevf_netdev->flags has IFF_UP)                 |                              |
>>>>                                                     |                              |
>>>>
>>> Given renaming slaves does not work anyway:
>> I was actually thinking what if we relieve the rename restriction just for
>> the failover slave? What the impact would be? I think users don't care about
>> slave being renamed when it's in use, especially the initial rename.
>> Thoughts?
>>
>>>    would it work if we just
>>> hard-coded slave names instead?
>>>
>>> E.g.
>>> 1. fail slave renames
>>> 2. rename of failover to XX automatically renames standby to XXnsby
>>>      and primary to XXnpry
>> That wouldn't help. The time when the failover master gets renamed, the VF
>> may not be present.
> In this scheme if VF is not there it will be renamed immediately after registration.
Who will be responsible to rename the slave, the kernel? Note the 
master's name may or may not come from the userspace. If it comes from 
the userspace, should the userspace daemon change their expectation not 
to name/rename _any_ slaves (today there's no distinction)? How do users 
know which name to trust, depending on which wins the race more often? 
Say if kernel wants a ens3npry name while userspace wants it named as ens4.

-Siwei

>
>> I don't like the idea to delay exposing failover master
>> until VF is hot plugged in (probably subject to various failures) later.
>>
>> Thanks,
>> -Siwei
>
> I agree, this was not what I meant.
>
>>>
Michael S. Tsirkin Feb. 27, 2019, 11:50 p.m. UTC | #25
On Wed, Feb 27, 2019 at 03:34:56PM -0800, si-wei liu wrote:
> 
> 
> On 2/27/2019 2:38 PM, Michael S. Tsirkin wrote:
> > On Tue, Feb 26, 2019 at 04:17:21PM -0800, si-wei liu wrote:
> > > 
> > > On 2/25/2019 6:08 PM, Michael S. Tsirkin wrote:
> > > > On Mon, Feb 25, 2019 at 04:58:07PM -0800, si-wei liu wrote:
> > > > > On 2/22/2019 7:14 AM, Michael S. Tsirkin wrote:
> > > > > > On Thu, Feb 21, 2019 at 11:55:11PM -0800, si-wei liu wrote:
> > > > > > > On 2/21/2019 11:00 PM, Samudrala, Sridhar wrote:
> > > > > > > > On 2/21/2019 7:33 PM, si-wei liu wrote:
> > > > > > > > > On 2/21/2019 5:39 PM, Michael S. Tsirkin wrote:
> > > > > > > > > > On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu wrote:
> > > > > > > > > > > Sorry for replying to this ancient thread. There was some remaining
> > > > > > > > > > > issue that I don't think the initial net_failover patch got addressed
> > > > > > > > > > > cleanly, see:
> > > > > > > > > > > 
> > > > > > > > > > > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1815268
> > > > > > > > > > > 
> > > > > > > > > > > The renaming of 'eth0' to 'ens4' fails because the udev userspace was
> > > > > > > > > > > not specifically writtten for such kernel automatic enslavement.
> > > > > > > > > > > Specifically, if it is a bond or team, the slave would typically get
> > > > > > > > > > > renamed *before* virtual device gets created, that's what udev can
> > > > > > > > > > > control (without getting netdev opened early by the other part of
> > > > > > > > > > > kernel) and other userspace components for e.g. initramfs,
> > > > > > > > > > > init-scripts can coordinate well in between. The in-kernel
> > > > > > > > > > > auto-enslavement of net_failover breaks this userspace convention,
> > > > > > > > > > > which don't provides a solution if user care about consistent naming
> > > > > > > > > > > on the slave netdevs specifically.
> > > > > > > > > > > 
> > > > > > > > > > > Previously this issue had been specifically called out when IFF_HIDDEN
> > > > > > > > > > > and the 1-netdev was proposed, but no one gives out a solution to this
> > > > > > > > > > > problem ever since. Please share your mind how to proceed and solve
> > > > > > > > > > > this userspace issue if netdev does not welcome a 1-netdev model.
> > > > > > > > > > Above says:
> > > > > > > > > > 
> > > > > > > > > >        there's no motivation in the systemd/udevd community at
> > > > > > > > > >        this point to refactor the rename logic and make it work well with
> > > > > > > > > >        3-netdev.
> > > > > > > > > > 
> > > > > > > > > > What would the fix be? Skip slave devices?
> > > > > > > > > > 
> > > > > > > > > There's nothing user can get if just skipping slave devices - the
> > > > > > > > > name is still unchanged and unpredictable e.g. eth0, or eth1 the
> > > > > > > > > next reboot, while the rest may conform to the naming scheme (ens3
> > > > > > > > > and such). There's no way one can fix this in userspace alone - when
> > > > > > > > > the failover is created the enslaved netdev was opened by the kernel
> > > > > > > > > earlier than the userspace is made aware of, and there's no
> > > > > > > > > negotiation protocol for kernel to know when userspace has done
> > > > > > > > > initial renaming of the interface. I would expect netdev list should
> > > > > > > > > at least provide the direction in general for how this can be
> > > > > > > > > solved...
> > > > > > I was just wondering what did you mean when you said
> > > > > > "refactor the rename logic and make it work well with 3-netdev" -
> > > > > > was there a proposal udev rejected?
> > > > > No. I never believed this particular issue can be fixed in userspace alone.
> > > > > Previously someone had said it could be, but I never see any work or
> > > > > relevant discussion ever happened in various userspace communities (for e.g.
> > > > > dracut, initramfs-tools, systemd, udev, and NetworkManager). IMHO the root
> > > > > of the issue derives from the kernel, it makes more sense to start from
> > > > > netdev, work out and decide on a solution: see what can be done in the
> > > > > kernel in order to fix it, then after that engage userspace community for
> > > > > the feasibility...
> > > > > 
> > > > > > Anyway, can we write a time diagram for what happens in which order that
> > > > > > leads to failure?  That would help look for triggers that we can tie
> > > > > > into, or add new ones.
> > > > > > 
> > > > > See attached diagram.
> > > > > 
> > > > > > 
> > > > > > 
> > > > > > > > Is there an issue if slave device names are not predictable? The user/admin scripts are expected
> > > > > > > > to only work with the master failover device.
> > > > > > > Where does this expectation come from?
> > > > > > > 
> > > > > > > Admin users may have ethtool or tc configurations that need to deal with
> > > > > > > predictable interface name. Third-party app which was built upon specifying
> > > > > > > certain interface name can't be modified to chase dynamic names.
> > > > > > > 
> > > > > > > Specifically, we have pre-canned image that uses ethtool to fine tune VF
> > > > > > > offload settings post boot for specific workload. Those images won't work
> > > > > > > well if the name is constantly changing just after couple rounds of live
> > > > > > > migration.
> > > > > > It should be possible to specify the ethtool configuration on the
> > > > > > master and have it automatically propagated to the slave.
> > > > > > 
> > > > > > BTW this is something we should look at IMHO.
> > > > > I was elaborating a few examples that the expectation and assumption that
> > > > > user/admin scripts only deal with master failover device is incorrect. It
> > > > > had never been taken good care of, although I did try to emphasize it from
> > > > > the very beginning.
> > > > > 
> > > > > Basically what you said about propagating the ethtool configuration down to
> > > > > the slave is the key pursuance of 1-netdev model. However, what I am seeking
> > > > > now is any alternative that can also fix the specific udev rename problem,
> > > > > before concluding that 1-netdev is the only solution. Generally a 1-netdev
> > > > > scheme would take time to implement, while I'm trying to find a way out to
> > > > > fix this particular naming problem under 3-netdev.
> > > > > 
> > > > > > > > Moreover, you were suggesting hiding the lower slave devices anyway. There was some discussion
> > > > > > > > about moving them to a hidden network namespace so that they are not visible from the default namespace.
> > > > > > > > I looked into this sometime back, but did not find the right kernel api to create a network namespace within
> > > > > > > > kernel. If so, we could use this mechanism to simulate a 1-netdev model.
> > > > > > > Yes, that's one possible implementation (IMHO the key is to make 1-netdev
> > > > > > > model as much transparent to a real NIC as possible, while a hidden netns is
> > > > > > > just the vehicle). However, I recall there was resistance around this
> > > > > > > discussion that even the concept of hiding itself is a taboo for Linux
> > > > > > > netdev. I would like to summon potential alternatives before concluding
> > > > > > > 1-netdev is the only solution too soon.
> > > > > > > 
> > > > > > > Thanks,
> > > > > > > -Siwei
> > > > > > Your scripts would not work at all then, right?
> > > > > At this point we don't claim images with such usage as SR-IOV live
> > > > > migrate-able. We would flag it as live migrate-able until this ethtool
> > > > > config issue is fully addressed and a transparent live migration solution
> > > > > emerges in upstream eventually.
> > > > > 
> > > > > 
> > > > > Thanks,
> > > > > -Siwei
> > > > > > > > > -Siwei
> > > > > > > > > 
> > > > > > > > > 
> > > > > > ---------------------------------------------------------------------
> > > > > > To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> > > > > > For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
> > > > > > 
> > > > >     net_failover(kernel)                            |    network.service (user)    |          systemd-udevd (user)
> > > > > --------------------------------------------------+------------------------------+--------------------------------------------
> > > > > (standby virtio-net and net_failover              |                              |
> > > > > devices created and initialized,                  |                              |
> > > > > i.e. virtnet_probe()->                            |                              |
> > > > >          net_failover_create()                      |                              |
> > > > > was done.)                                        |                              |
> > > > >                                                     |                              |
> > > > >                                                     |  runs `ifup ens3' ->         |
> > > > >                                                     |    ip link set dev ens3 up   |
> > > > > net_failover_open()                               |                              |
> > > > >     dev_open(virtnet_dev)                           |                              |
> > > > >       virtnet_open(virtnet_dev)                     |                              |
> > > > >     netif_carrier_on(failover_dev)                  |                              |
> > > > >     ...                                             |                              |
> > > > >                                                     |                              |
> > > > > (VF hot plugged in)                               |                              |
> > > > > ixgbevf_probe()                                   |                              |
> > > > >    register_netdev(ixgbevf_netdev)                  |                              |
> > > > >     netdev_register_kobject(ixgbevf_netdev)         |                              |
> > > > >      kobject_add(ixgbevf_dev)                       |                              |
> > > > >       device_add(ixgbevf_dev)                       |                              |
> > > > >        kobject_uevent(&ixgbevf_dev->kobj, KOBJ_ADD) |                              |
> > > > >         netlink_broadcast()                         |                              |
> > > > >     ...                                             |                              |
> > > > >     call_netdevice_notifiers(NETDEV_REGISTER)       |                              |
> > > > >      failover_event(..., NETDEV_REGISTER, ...)      |                              |
> > > > >       failover_slave_register(ixgbevf_netdev)       |                              |
> > > > >        net_failover_slave_register(ixgbevf_netdev)  |                              |
> > > > >         dev_open(ixgbevf_netdev)                    |                              |
> > > > >                                                     |                              |
> > > > >                                                     |                              |
> > > > >                                                     |                              |   received ADD uevent from netlink fd
> > > > >                                                     |                              |   ...
> > > > >                                                     |                              |   udev-builtin-net_id.c:dev_pci_slot()
> > > > >                                                     |                              |   (decided to renamed 'eth0' )
> > > > >                                                     |                              |     ip link set dev eth0 name ens4
> > > > > (dev_change_name() returns -EBUSY as              |                              |
> > > > > ixgbevf_netdev->flags has IFF_UP)                 |                              |
> > > > >                                                     |                              |
> > > > > 
> > > > Given renaming slaves does not work anyway:
> > > I was actually thinking what if we relieve the rename restriction just for
> > > the failover slave? What the impact would be? I think users don't care about
> > > slave being renamed when it's in use, especially the initial rename.
> > > Thoughts?
> > > 
> > > >    would it work if we just
> > > > hard-coded slave names instead?
> > > > 
> > > > E.g.
> > > > 1. fail slave renames
> > > > 2. rename of failover to XX automatically renames standby to XXnsby
> > > >      and primary to XXnpry
> > > That wouldn't help. The time when the failover master gets renamed, the VF
> > > may not be present.
> > In this scheme if VF is not there it will be renamed immediately after registration.
> Who will be responsible to rename the slave, the kernel?

That's the idea.

> Note the master's
> name may or may not come from the userspace. If it comes from the userspace,
> should the userspace daemon change their expectation not to name/rename
> _any_ slaves (today there's no distinction)?

Yes the idea would be to fail renaming slaves.

> How do users know which name to
> trust, depending on which wins the race more often? Say if kernel wants a
> ens3npry name while userspace wants it named as ens4.
> 
> -Siwei

With this approach kernel will deny attempts by userspace to rename
slaves.  Slaves will always be named XXXnsby and XXnpry. Master renames
will rename both slaves.

It seems pretty solid to me, the only issue is that in theory userspace
can use a name like XXXnsby for something else. But this seems unlikely.


> > 
> > > I don't like the idea to delay exposing failover master
> > > until VF is hot plugged in (probably subject to various failures) later.
> > > 
> > > Thanks,
> > > -Siwei
> > 
> > I agree, this was not what I meant.
> > 
> > > >
Liran Alon Feb. 28, 2019, midnight UTC | #26
> On 28 Feb 2019, at 1:50, Michael S. Tsirkin <mst@redhat.com> wrote:
> 
> On Wed, Feb 27, 2019 at 03:34:56PM -0800, si-wei liu wrote:
>> 
>> 
>> On 2/27/2019 2:38 PM, Michael S. Tsirkin wrote:
>>> On Tue, Feb 26, 2019 at 04:17:21PM -0800, si-wei liu wrote:
>>>> 
>>>> On 2/25/2019 6:08 PM, Michael S. Tsirkin wrote:
>>>>> On Mon, Feb 25, 2019 at 04:58:07PM -0800, si-wei liu wrote:
>>>>>> On 2/22/2019 7:14 AM, Michael S. Tsirkin wrote:
>>>>>>> On Thu, Feb 21, 2019 at 11:55:11PM -0800, si-wei liu wrote:
>>>>>>>> On 2/21/2019 11:00 PM, Samudrala, Sridhar wrote:
>>>>>>>>> On 2/21/2019 7:33 PM, si-wei liu wrote:
>>>>>>>>>> On 2/21/2019 5:39 PM, Michael S. Tsirkin wrote:
>>>>>>>>>>> On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu wrote:
>>>>>>>>>>>> Sorry for replying to this ancient thread. There was some remaining
>>>>>>>>>>>> issue that I don't think the initial net_failover patch got addressed
>>>>>>>>>>>> cleanly, see:
>>>>>>>>>>>> 
>>>>>>>>>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.launchpad.net_ubuntu_-2Bsource_linux_-2Bbug_1815268&d=DwIBAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=Jk6Q8nNzkQ6LJ6g42qARkg6ryIDGQr-yKXPNGZbpTx0&m=aL-QfUoSYx8r0XCOBkcDtF8f-cYxrJI3skYLFTb8XJE&s=yk6Nqv3a6_JMzyrXKY67h00FyNrDJyQ-PYMFffDSTXM&e=
>>>>>>>>>>>> 
>>>>>>>>>>>> The renaming of 'eth0' to 'ens4' fails because the udev userspace was
>>>>>>>>>>>> not specifically writtten for such kernel automatic enslavement.
>>>>>>>>>>>> Specifically, if it is a bond or team, the slave would typically get
>>>>>>>>>>>> renamed *before* virtual device gets created, that's what udev can
>>>>>>>>>>>> control (without getting netdev opened early by the other part of
>>>>>>>>>>>> kernel) and other userspace components for e.g. initramfs,
>>>>>>>>>>>> init-scripts can coordinate well in between. The in-kernel
>>>>>>>>>>>> auto-enslavement of net_failover breaks this userspace convention,
>>>>>>>>>>>> which don't provides a solution if user care about consistent naming
>>>>>>>>>>>> on the slave netdevs specifically.
>>>>>>>>>>>> 
>>>>>>>>>>>> Previously this issue had been specifically called out when IFF_HIDDEN
>>>>>>>>>>>> and the 1-netdev was proposed, but no one gives out a solution to this
>>>>>>>>>>>> problem ever since. Please share your mind how to proceed and solve
>>>>>>>>>>>> this userspace issue if netdev does not welcome a 1-netdev model.
>>>>>>>>>>> Above says:
>>>>>>>>>>> 
>>>>>>>>>>>       there's no motivation in the systemd/udevd community at
>>>>>>>>>>>       this point to refactor the rename logic and make it work well with
>>>>>>>>>>>       3-netdev.
>>>>>>>>>>> 
>>>>>>>>>>> What would the fix be? Skip slave devices?
>>>>>>>>>>> 
>>>>>>>>>> There's nothing user can get if just skipping slave devices - the
>>>>>>>>>> name is still unchanged and unpredictable e.g. eth0, or eth1 the
>>>>>>>>>> next reboot, while the rest may conform to the naming scheme (ens3
>>>>>>>>>> and such). There's no way one can fix this in userspace alone - when
>>>>>>>>>> the failover is created the enslaved netdev was opened by the kernel
>>>>>>>>>> earlier than the userspace is made aware of, and there's no
>>>>>>>>>> negotiation protocol for kernel to know when userspace has done
>>>>>>>>>> initial renaming of the interface. I would expect netdev list should
>>>>>>>>>> at least provide the direction in general for how this can be
>>>>>>>>>> solved...
>>>>>>> I was just wondering what did you mean when you said
>>>>>>> "refactor the rename logic and make it work well with 3-netdev" -
>>>>>>> was there a proposal udev rejected?
>>>>>> No. I never believed this particular issue can be fixed in userspace alone.
>>>>>> Previously someone had said it could be, but I never see any work or
>>>>>> relevant discussion ever happened in various userspace communities (for e.g.
>>>>>> dracut, initramfs-tools, systemd, udev, and NetworkManager). IMHO the root
>>>>>> of the issue derives from the kernel, it makes more sense to start from
>>>>>> netdev, work out and decide on a solution: see what can be done in the
>>>>>> kernel in order to fix it, then after that engage userspace community for
>>>>>> the feasibility...
>>>>>> 
>>>>>>> Anyway, can we write a time diagram for what happens in which order that
>>>>>>> leads to failure?  That would help look for triggers that we can tie
>>>>>>> into, or add new ones.
>>>>>>> 
>>>>>> See attached diagram.
>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>>> Is there an issue if slave device names are not predictable? The user/admin scripts are expected
>>>>>>>>> to only work with the master failover device.
>>>>>>>> Where does this expectation come from?
>>>>>>>> 
>>>>>>>> Admin users may have ethtool or tc configurations that need to deal with
>>>>>>>> predictable interface name. Third-party app which was built upon specifying
>>>>>>>> certain interface name can't be modified to chase dynamic names.
>>>>>>>> 
>>>>>>>> Specifically, we have pre-canned image that uses ethtool to fine tune VF
>>>>>>>> offload settings post boot for specific workload. Those images won't work
>>>>>>>> well if the name is constantly changing just after couple rounds of live
>>>>>>>> migration.
>>>>>>> It should be possible to specify the ethtool configuration on the
>>>>>>> master and have it automatically propagated to the slave.
>>>>>>> 
>>>>>>> BTW this is something we should look at IMHO.
>>>>>> I was elaborating a few examples that the expectation and assumption that
>>>>>> user/admin scripts only deal with master failover device is incorrect. It
>>>>>> had never been taken good care of, although I did try to emphasize it from
>>>>>> the very beginning.
>>>>>> 
>>>>>> Basically what you said about propagating the ethtool configuration down to
>>>>>> the slave is the key pursuance of 1-netdev model. However, what I am seeking
>>>>>> now is any alternative that can also fix the specific udev rename problem,
>>>>>> before concluding that 1-netdev is the only solution. Generally a 1-netdev
>>>>>> scheme would take time to implement, while I'm trying to find a way out to
>>>>>> fix this particular naming problem under 3-netdev.
>>>>>> 
>>>>>>>>> Moreover, you were suggesting hiding the lower slave devices anyway. There was some discussion
>>>>>>>>> about moving them to a hidden network namespace so that they are not visible from the default namespace.
>>>>>>>>> I looked into this sometime back, but did not find the right kernel api to create a network namespace within
>>>>>>>>> kernel. If so, we could use this mechanism to simulate a 1-netdev model.
>>>>>>>> Yes, that's one possible implementation (IMHO the key is to make 1-netdev
>>>>>>>> model as much transparent to a real NIC as possible, while a hidden netns is
>>>>>>>> just the vehicle). However, I recall there was resistance around this
>>>>>>>> discussion that even the concept of hiding itself is a taboo for Linux
>>>>>>>> netdev. I would like to summon potential alternatives before concluding
>>>>>>>> 1-netdev is the only solution too soon.
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> -Siwei
>>>>>>> Your scripts would not work at all then, right?
>>>>>> At this point we don't claim images with such usage as SR-IOV live
>>>>>> migrate-able. We would flag it as live migrate-able until this ethtool
>>>>>> config issue is fully addressed and a transparent live migration solution
>>>>>> emerges in upstream eventually.
>>>>>> 
>>>>>> 
>>>>>> Thanks,
>>>>>> -Siwei
>>>>>>>>>> -Siwei
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
>>>>>>> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
>>>>>>> 
>>>>>>    net_failover(kernel)                            |    network.service (user)    |          systemd-udevd (user)
>>>>>> --------------------------------------------------+------------------------------+--------------------------------------------
>>>>>> (standby virtio-net and net_failover              |                              |
>>>>>> devices created and initialized,                  |                              |
>>>>>> i.e. virtnet_probe()->                            |                              |
>>>>>>         net_failover_create()                      |                              |
>>>>>> was done.)                                        |                              |
>>>>>>                                                    |                              |
>>>>>>                                                    |  runs `ifup ens3' ->         |
>>>>>>                                                    |    ip link set dev ens3 up   |
>>>>>> net_failover_open()                               |                              |
>>>>>>    dev_open(virtnet_dev)                           |                              |
>>>>>>      virtnet_open(virtnet_dev)                     |                              |
>>>>>>    netif_carrier_on(failover_dev)                  |                              |
>>>>>>    ...                                             |                              |
>>>>>>                                                    |                              |
>>>>>> (VF hot plugged in)                               |                              |
>>>>>> ixgbevf_probe()                                   |                              |
>>>>>>   register_netdev(ixgbevf_netdev)                  |                              |
>>>>>>    netdev_register_kobject(ixgbevf_netdev)         |                              |
>>>>>>     kobject_add(ixgbevf_dev)                       |                              |
>>>>>>      device_add(ixgbevf_dev)                       |                              |
>>>>>>       kobject_uevent(&ixgbevf_dev->kobj, KOBJ_ADD) |                              |
>>>>>>        netlink_broadcast()                         |                              |
>>>>>>    ...                                             |                              |
>>>>>>    call_netdevice_notifiers(NETDEV_REGISTER)       |                              |
>>>>>>     failover_event(..., NETDEV_REGISTER, ...)      |                              |
>>>>>>      failover_slave_register(ixgbevf_netdev)       |                              |
>>>>>>       net_failover_slave_register(ixgbevf_netdev)  |                              |
>>>>>>        dev_open(ixgbevf_netdev)                    |                              |
>>>>>>                                                    |                              |
>>>>>>                                                    |                              |
>>>>>>                                                    |                              |   received ADD uevent from netlink fd
>>>>>>                                                    |                              |   ...
>>>>>>                                                    |                              |   udev-builtin-net_id.c:dev_pci_slot()
>>>>>>                                                    |                              |   (decided to renamed 'eth0' )
>>>>>>                                                    |                              |     ip link set dev eth0 name ens4
>>>>>> (dev_change_name() returns -EBUSY as              |                              |
>>>>>> ixgbevf_netdev->flags has IFF_UP)                 |                              |
>>>>>>                                                    |                              |
>>>>>> 
>>>>> Given renaming slaves does not work anyway:
>>>> I was actually thinking what if we relieve the rename restriction just for
>>>> the failover slave? What the impact would be? I think users don't care about
>>>> slave being renamed when it's in use, especially the initial rename.
>>>> Thoughts?
>>>> 
>>>>>   would it work if we just
>>>>> hard-coded slave names instead?
>>>>> 
>>>>> E.g.
>>>>> 1. fail slave renames
>>>>> 2. rename of failover to XX automatically renames standby to XXnsby
>>>>>     and primary to XXnpry
>>>> That wouldn't help. The time when the failover master gets renamed, the VF
>>>> may not be present.
>>> In this scheme if VF is not there it will be renamed immediately after registration.
>> Who will be responsible to rename the slave, the kernel?
> 
> That's the idea.
> 
>> Note the master's
>> name may or may not come from the userspace. If it comes from the userspace,
>> should the userspace daemon change their expectation not to name/rename
>> _any_ slaves (today there's no distinction)?
> 
> Yes the idea would be to fail renaming slaves.
> 
>> How do users know which name to
>> trust, depending on which wins the race more often? Say if kernel wants a
>> ens3npry name while userspace wants it named as ens4.
>> 
>> -Siwei
> 
> With this approach kernel will deny attempts by userspace to rename
> slaves.  Slaves will always be named XXXnsby and XXnpry. Master renames
> will rename both slaves.
> 
> It seems pretty solid to me, the only issue is that in theory userspace
> can use a name like XXXnsby for something else. But this seems unlikely.

I’m fond of this idea and I have similar opinion.
I think it simplifies the issue here.
I don’t see a real reason for customer to define udev rule to rename a net-failover slave to have different postfix.

-Liran

> 
> 
>>> 
>>>> I don't like the idea to delay exposing failover master
>>>> until VF is hot plugged in (probably subject to various failures) later.
>>>> 
>>>> Thanks,
>>>> -Siwei
>>> 
>>> I agree, this was not what I meant.
>>> 
>>>>>
Stephen Hemminger Feb. 28, 2019, 12:03 a.m. UTC | #27
On Wed, 27 Feb 2019 18:50:44 -0500
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Wed, Feb 27, 2019 at 03:34:56PM -0800, si-wei liu wrote:
> > 
> > 
> > On 2/27/2019 2:38 PM, Michael S. Tsirkin wrote:  
> > > On Tue, Feb 26, 2019 at 04:17:21PM -0800, si-wei liu wrote:  
> > > > 
> > > > On 2/25/2019 6:08 PM, Michael S. Tsirkin wrote:  
> > > > > On Mon, Feb 25, 2019 at 04:58:07PM -0800, si-wei liu wrote:  
> > > > > > On 2/22/2019 7:14 AM, Michael S. Tsirkin wrote:  
> > > > > > > On Thu, Feb 21, 2019 at 11:55:11PM -0800, si-wei liu wrote:  
> > > > > > > > On 2/21/2019 11:00 PM, Samudrala, Sridhar wrote:  
> > > > > > > > > On 2/21/2019 7:33 PM, si-wei liu wrote:  
> > > > > > > > > > On 2/21/2019 5:39 PM, Michael S. Tsirkin wrote:  
> > > > > > > > > > > On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu wrote:  
> > > > > > > > > > > > Sorry for replying to this ancient thread. There was some remaining
> > > > > > > > > > > > issue that I don't think the initial net_failover patch got addressed
> > > > > > > > > > > > cleanly, see:
> > > > > > > > > > > > 
> > > > > > > > > > > > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1815268
> > > > > > > > > > > > 
> > > > > > > > > > > > The renaming of 'eth0' to 'ens4' fails because the udev userspace was
> > > > > > > > > > > > not specifically writtten for such kernel automatic enslavement.
> > > > > > > > > > > > Specifically, if it is a bond or team, the slave would typically get
> > > > > > > > > > > > renamed *before* virtual device gets created, that's what udev can
> > > > > > > > > > > > control (without getting netdev opened early by the other part of
> > > > > > > > > > > > kernel) and other userspace components for e.g. initramfs,
> > > > > > > > > > > > init-scripts can coordinate well in between. The in-kernel
> > > > > > > > > > > > auto-enslavement of net_failover breaks this userspace convention,
> > > > > > > > > > > > which don't provides a solution if user care about consistent naming
> > > > > > > > > > > > on the slave netdevs specifically.
> > > > > > > > > > > > 
> > > > > > > > > > > > Previously this issue had been specifically called out when IFF_HIDDEN
> > > > > > > > > > > > and the 1-netdev was proposed, but no one gives out a solution to this
> > > > > > > > > > > > problem ever since. Please share your mind how to proceed and solve
> > > > > > > > > > > > this userspace issue if netdev does not welcome a 1-netdev model.  
> > > > > > > > > > > Above says:
> > > > > > > > > > > 
> > > > > > > > > > >        there's no motivation in the systemd/udevd community at
> > > > > > > > > > >        this point to refactor the rename logic and make it work well with
> > > > > > > > > > >        3-netdev.
> > > > > > > > > > > 
> > > > > > > > > > > What would the fix be? Skip slave devices?
> > > > > > > > > > >   
> > > > > > > > > > There's nothing user can get if just skipping slave devices - the
> > > > > > > > > > name is still unchanged and unpredictable e.g. eth0, or eth1 the
> > > > > > > > > > next reboot, while the rest may conform to the naming scheme (ens3
> > > > > > > > > > and such). There's no way one can fix this in userspace alone - when
> > > > > > > > > > the failover is created the enslaved netdev was opened by the kernel
> > > > > > > > > > earlier than the userspace is made aware of, and there's no
> > > > > > > > > > negotiation protocol for kernel to know when userspace has done
> > > > > > > > > > initial renaming of the interface. I would expect netdev list should
> > > > > > > > > > at least provide the direction in general for how this can be
> > > > > > > > > > solved...  
> > > > > > > I was just wondering what did you mean when you said
> > > > > > > "refactor the rename logic and make it work well with 3-netdev" -
> > > > > > > was there a proposal udev rejected?  
> > > > > > No. I never believed this particular issue can be fixed in userspace alone.
> > > > > > Previously someone had said it could be, but I never see any work or
> > > > > > relevant discussion ever happened in various userspace communities (for e.g.
> > > > > > dracut, initramfs-tools, systemd, udev, and NetworkManager). IMHO the root
> > > > > > of the issue derives from the kernel, it makes more sense to start from
> > > > > > netdev, work out and decide on a solution: see what can be done in the
> > > > > > kernel in order to fix it, then after that engage userspace community for
> > > > > > the feasibility...
> > > > > >   
> > > > > > > Anyway, can we write a time diagram for what happens in which order that
> > > > > > > leads to failure?  That would help look for triggers that we can tie
> > > > > > > into, or add new ones.
> > > > > > >   
> > > > > > See attached diagram.
> > > > > >   
> > > > > > > 
> > > > > > >   
> > > > > > > > > Is there an issue if slave device names are not predictable? The user/admin scripts are expected
> > > > > > > > > to only work with the master failover device.  
> > > > > > > > Where does this expectation come from?
> > > > > > > > 
> > > > > > > > Admin users may have ethtool or tc configurations that need to deal with
> > > > > > > > predictable interface name. Third-party app which was built upon specifying
> > > > > > > > certain interface name can't be modified to chase dynamic names.
> > > > > > > > 
> > > > > > > > Specifically, we have pre-canned image that uses ethtool to fine tune VF
> > > > > > > > offload settings post boot for specific workload. Those images won't work
> > > > > > > > well if the name is constantly changing just after couple rounds of live
> > > > > > > > migration.  
> > > > > > > It should be possible to specify the ethtool configuration on the
> > > > > > > master and have it automatically propagated to the slave.
> > > > > > > 
> > > > > > > BTW this is something we should look at IMHO.  
> > > > > > I was elaborating a few examples that the expectation and assumption that
> > > > > > user/admin scripts only deal with master failover device is incorrect. It
> > > > > > had never been taken good care of, although I did try to emphasize it from
> > > > > > the very beginning.
> > > > > > 
> > > > > > Basically what you said about propagating the ethtool configuration down to
> > > > > > the slave is the key pursuance of 1-netdev model. However, what I am seeking
> > > > > > now is any alternative that can also fix the specific udev rename problem,
> > > > > > before concluding that 1-netdev is the only solution. Generally a 1-netdev
> > > > > > scheme would take time to implement, while I'm trying to find a way out to
> > > > > > fix this particular naming problem under 3-netdev.
> > > > > >   
> > > > > > > > > Moreover, you were suggesting hiding the lower slave devices anyway. There was some discussion
> > > > > > > > > about moving them to a hidden network namespace so that they are not visible from the default namespace.
> > > > > > > > > I looked into this sometime back, but did not find the right kernel api to create a network namespace within
> > > > > > > > > kernel. If so, we could use this mechanism to simulate a 1-netdev model.  
> > > > > > > > Yes, that's one possible implementation (IMHO the key is to make 1-netdev
> > > > > > > > model as much transparent to a real NIC as possible, while a hidden netns is
> > > > > > > > just the vehicle). However, I recall there was resistance around this
> > > > > > > > discussion that even the concept of hiding itself is a taboo for Linux
> > > > > > > > netdev. I would like to summon potential alternatives before concluding
> > > > > > > > 1-netdev is the only solution too soon.
> > > > > > > > 
> > > > > > > > Thanks,
> > > > > > > > -Siwei  
> > > > > > > Your scripts would not work at all then, right?  
> > > > > > At this point we don't claim images with such usage as SR-IOV live
> > > > > > migrate-able. We would flag it as live migrate-able until this ethtool
> > > > > > config issue is fully addressed and a transparent live migration solution
> > > > > > emerges in upstream eventually.
> > > > > > 
> > > > > > 
> > > > > > Thanks,
> > > > > > -Siwei  
> > > > > > > > > > -Siwei
> > > > > > > > > > 
> > > > > > > > > >   
> > > > > > > ---------------------------------------------------------------------
> > > > > > > To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> > > > > > > For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
> > > > > > >   
> > > > > >     net_failover(kernel)                            |    network.service (user)    |          systemd-udevd (user)
> > > > > > --------------------------------------------------+------------------------------+--------------------------------------------
> > > > > > (standby virtio-net and net_failover              |                              |
> > > > > > devices created and initialized,                  |                              |
> > > > > > i.e. virtnet_probe()->                            |                              |
> > > > > >          net_failover_create()                      |                              |
> > > > > > was done.)                                        |                              |
> > > > > >                                                     |                              |
> > > > > >                                                     |  runs `ifup ens3' ->         |
> > > > > >                                                     |    ip link set dev ens3 up   |
> > > > > > net_failover_open()                               |                              |
> > > > > >     dev_open(virtnet_dev)                           |                              |
> > > > > >       virtnet_open(virtnet_dev)                     |                              |
> > > > > >     netif_carrier_on(failover_dev)                  |                              |
> > > > > >     ...                                             |                              |
> > > > > >                                                     |                              |
> > > > > > (VF hot plugged in)                               |                              |
> > > > > > ixgbevf_probe()                                   |                              |
> > > > > >    register_netdev(ixgbevf_netdev)                  |                              |
> > > > > >     netdev_register_kobject(ixgbevf_netdev)         |                              |
> > > > > >      kobject_add(ixgbevf_dev)                       |                              |
> > > > > >       device_add(ixgbevf_dev)                       |                              |
> > > > > >        kobject_uevent(&ixgbevf_dev->kobj, KOBJ_ADD) |                              |
> > > > > >         netlink_broadcast()                         |                              |
> > > > > >     ...                                             |                              |
> > > > > >     call_netdevice_notifiers(NETDEV_REGISTER)       |                              |
> > > > > >      failover_event(..., NETDEV_REGISTER, ...)      |                              |
> > > > > >       failover_slave_register(ixgbevf_netdev)       |                              |
> > > > > >        net_failover_slave_register(ixgbevf_netdev)  |                              |
> > > > > >         dev_open(ixgbevf_netdev)                    |                              |
> > > > > >                                                     |                              |
> > > > > >                                                     |                              |
> > > > > >                                                     |                              |   received ADD uevent from netlink fd
> > > > > >                                                     |                              |   ...
> > > > > >                                                     |                              |   udev-builtin-net_id.c:dev_pci_slot()
> > > > > >                                                     |                              |   (decided to renamed 'eth0' )
> > > > > >                                                     |                              |     ip link set dev eth0 name ens4
> > > > > > (dev_change_name() returns -EBUSY as              |                              |
> > > > > > ixgbevf_netdev->flags has IFF_UP)                 |                              |
> > > > > >                                                     |                              |
> > > > > >   
> > > > > Given renaming slaves does not work anyway:  
> > > > I was actually thinking what if we relieve the rename restriction just for
> > > > the failover slave? What the impact would be? I think users don't care about
> > > > slave being renamed when it's in use, especially the initial rename.
> > > > Thoughts?
> > > >   
> > > > >    would it work if we just
> > > > > hard-coded slave names instead?
> > > > > 
> > > > > E.g.
> > > > > 1. fail slave renames
> > > > > 2. rename of failover to XX automatically renames standby to XXnsby
> > > > >      and primary to XXnpry  
> > > > That wouldn't help. The time when the failover master gets renamed, the VF
> > > > may not be present.  
> > > In this scheme if VF is not there it will be renamed immediately after registration.  
> > Who will be responsible to rename the slave, the kernel?  
> 
> That's the idea.
> 
> > Note the master's
> > name may or may not come from the userspace. If it comes from the userspace,
> > should the userspace daemon change their expectation not to name/rename
> > _any_ slaves (today there's no distinction)?  
> 
> Yes the idea would be to fail renaming slaves.
> 
> > How do users know which name to
> > trust, depending on which wins the race more often? Say if kernel wants a
> > ens3npry name while userspace wants it named as ens4.
> > 
> > -Siwei  
> 
> With this approach kernel will deny attempts by userspace to rename
> slaves.  Slaves will always be named XXXnsby and XXnpry. Master renames
> will rename both slaves.
> 
> It seems pretty solid to me, the only issue is that in theory userspace
> can use a name like XXXnsby for something else. But this seems unlikely.

Similar schemes (with kernel providing naming) were also previously rejected
upstream. It has been a consistent theme that the kernel should not be in
the renaming business. It will certainly break userspace.
Si-Wei Liu Feb. 28, 2019, 12:38 a.m. UTC | #28
On 2/27/2019 3:50 PM, Michael S. Tsirkin wrote:
> On Wed, Feb 27, 2019 at 03:34:56PM -0800, si-wei liu wrote:
>>
>> On 2/27/2019 2:38 PM, Michael S. Tsirkin wrote:
>>> On Tue, Feb 26, 2019 at 04:17:21PM -0800, si-wei liu wrote:
>>>> On 2/25/2019 6:08 PM, Michael S. Tsirkin wrote:
>>>>> On Mon, Feb 25, 2019 at 04:58:07PM -0800, si-wei liu wrote:
>>>>>> On 2/22/2019 7:14 AM, Michael S. Tsirkin wrote:
>>>>>>> On Thu, Feb 21, 2019 at 11:55:11PM -0800, si-wei liu wrote:
>>>>>>>> On 2/21/2019 11:00 PM, Samudrala, Sridhar wrote:
>>>>>>>>> On 2/21/2019 7:33 PM, si-wei liu wrote:
>>>>>>>>>> On 2/21/2019 5:39 PM, Michael S. Tsirkin wrote:
>>>>>>>>>>> On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu wrote:
>>>>>>>>>>>> Sorry for replying to this ancient thread. There was some remaining
>>>>>>>>>>>> issue that I don't think the initial net_failover patch got addressed
>>>>>>>>>>>> cleanly, see:
>>>>>>>>>>>>
>>>>>>>>>>>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1815268
>>>>>>>>>>>>
>>>>>>>>>>>> The renaming of 'eth0' to 'ens4' fails because the udev userspace was
>>>>>>>>>>>> not specifically writtten for such kernel automatic enslavement.
>>>>>>>>>>>> Specifically, if it is a bond or team, the slave would typically get
>>>>>>>>>>>> renamed *before* virtual device gets created, that's what udev can
>>>>>>>>>>>> control (without getting netdev opened early by the other part of
>>>>>>>>>>>> kernel) and other userspace components for e.g. initramfs,
>>>>>>>>>>>> init-scripts can coordinate well in between. The in-kernel
>>>>>>>>>>>> auto-enslavement of net_failover breaks this userspace convention,
>>>>>>>>>>>> which don't provides a solution if user care about consistent naming
>>>>>>>>>>>> on the slave netdevs specifically.
>>>>>>>>>>>>
>>>>>>>>>>>> Previously this issue had been specifically called out when IFF_HIDDEN
>>>>>>>>>>>> and the 1-netdev was proposed, but no one gives out a solution to this
>>>>>>>>>>>> problem ever since. Please share your mind how to proceed and solve
>>>>>>>>>>>> this userspace issue if netdev does not welcome a 1-netdev model.
>>>>>>>>>>> Above says:
>>>>>>>>>>>
>>>>>>>>>>>         there's no motivation in the systemd/udevd community at
>>>>>>>>>>>         this point to refactor the rename logic and make it work well with
>>>>>>>>>>>         3-netdev.
>>>>>>>>>>>
>>>>>>>>>>> What would the fix be? Skip slave devices?
>>>>>>>>>>>
>>>>>>>>>> There's nothing user can get if just skipping slave devices - the
>>>>>>>>>> name is still unchanged and unpredictable e.g. eth0, or eth1 the
>>>>>>>>>> next reboot, while the rest may conform to the naming scheme (ens3
>>>>>>>>>> and such). There's no way one can fix this in userspace alone - when
>>>>>>>>>> the failover is created the enslaved netdev was opened by the kernel
>>>>>>>>>> earlier than the userspace is made aware of, and there's no
>>>>>>>>>> negotiation protocol for kernel to know when userspace has done
>>>>>>>>>> initial renaming of the interface. I would expect netdev list should
>>>>>>>>>> at least provide the direction in general for how this can be
>>>>>>>>>> solved...
>>>>>>> I was just wondering what did you mean when you said
>>>>>>> "refactor the rename logic and make it work well with 3-netdev" -
>>>>>>> was there a proposal udev rejected?
>>>>>> No. I never believed this particular issue can be fixed in userspace alone.
>>>>>> Previously someone had said it could be, but I never see any work or
>>>>>> relevant discussion ever happened in various userspace communities (for e.g.
>>>>>> dracut, initramfs-tools, systemd, udev, and NetworkManager). IMHO the root
>>>>>> of the issue derives from the kernel, it makes more sense to start from
>>>>>> netdev, work out and decide on a solution: see what can be done in the
>>>>>> kernel in order to fix it, then after that engage userspace community for
>>>>>> the feasibility...
>>>>>>
>>>>>>> Anyway, can we write a time diagram for what happens in which order that
>>>>>>> leads to failure?  That would help look for triggers that we can tie
>>>>>>> into, or add new ones.
>>>>>>>
>>>>>> See attached diagram.
>>>>>>
>>>>>>>
>>>>>>>>> Is there an issue if slave device names are not predictable? The user/admin scripts are expected
>>>>>>>>> to only work with the master failover device.
>>>>>>>> Where does this expectation come from?
>>>>>>>>
>>>>>>>> Admin users may have ethtool or tc configurations that need to deal with
>>>>>>>> predictable interface name. Third-party app which was built upon specifying
>>>>>>>> certain interface name can't be modified to chase dynamic names.
>>>>>>>>
>>>>>>>> Specifically, we have pre-canned image that uses ethtool to fine tune VF
>>>>>>>> offload settings post boot for specific workload. Those images won't work
>>>>>>>> well if the name is constantly changing just after couple rounds of live
>>>>>>>> migration.
>>>>>>> It should be possible to specify the ethtool configuration on the
>>>>>>> master and have it automatically propagated to the slave.
>>>>>>>
>>>>>>> BTW this is something we should look at IMHO.
>>>>>> I was elaborating a few examples that the expectation and assumption that
>>>>>> user/admin scripts only deal with master failover device is incorrect. It
>>>>>> had never been taken good care of, although I did try to emphasize it from
>>>>>> the very beginning.
>>>>>>
>>>>>> Basically what you said about propagating the ethtool configuration down to
>>>>>> the slave is the key pursuance of 1-netdev model. However, what I am seeking
>>>>>> now is any alternative that can also fix the specific udev rename problem,
>>>>>> before concluding that 1-netdev is the only solution. Generally a 1-netdev
>>>>>> scheme would take time to implement, while I'm trying to find a way out to
>>>>>> fix this particular naming problem under 3-netdev.
>>>>>>
>>>>>>>>> Moreover, you were suggesting hiding the lower slave devices anyway. There was some discussion
>>>>>>>>> about moving them to a hidden network namespace so that they are not visible from the default namespace.
>>>>>>>>> I looked into this sometime back, but did not find the right kernel api to create a network namespace within
>>>>>>>>> kernel. If so, we could use this mechanism to simulate a 1-netdev model.
>>>>>>>> Yes, that's one possible implementation (IMHO the key is to make 1-netdev
>>>>>>>> model as much transparent to a real NIC as possible, while a hidden netns is
>>>>>>>> just the vehicle). However, I recall there was resistance around this
>>>>>>>> discussion that even the concept of hiding itself is a taboo for Linux
>>>>>>>> netdev. I would like to summon potential alternatives before concluding
>>>>>>>> 1-netdev is the only solution too soon.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> -Siwei
>>>>>>> Your scripts would not work at all then, right?
>>>>>> At this point we don't claim images with such usage as SR-IOV live
>>>>>> migrate-able. We would flag it as live migrate-able until this ethtool
>>>>>> config issue is fully addressed and a transparent live migration solution
>>>>>> emerges in upstream eventually.
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> -Siwei
>>>>>>>>>> -Siwei
>>>>>>>>>>
>>>>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
>>>>>>> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
>>>>>>>
>>>>>>      net_failover(kernel)                            |    network.service (user)    |          systemd-udevd (user)
>>>>>> --------------------------------------------------+------------------------------+--------------------------------------------
>>>>>> (standby virtio-net and net_failover              |                              |
>>>>>> devices created and initialized,                  |                              |
>>>>>> i.e. virtnet_probe()->                            |                              |
>>>>>>           net_failover_create()                      |                              |
>>>>>> was done.)                                        |                              |
>>>>>>                                                      |                              |
>>>>>>                                                      |  runs `ifup ens3' ->         |
>>>>>>                                                      |    ip link set dev ens3 up   |
>>>>>> net_failover_open()                               |                              |
>>>>>>      dev_open(virtnet_dev)                           |                              |
>>>>>>        virtnet_open(virtnet_dev)                     |                              |
>>>>>>      netif_carrier_on(failover_dev)                  |                              |
>>>>>>      ...                                             |                              |
>>>>>>                                                      |                              |
>>>>>> (VF hot plugged in)                               |                              |
>>>>>> ixgbevf_probe()                                   |                              |
>>>>>>     register_netdev(ixgbevf_netdev)                  |                              |
>>>>>>      netdev_register_kobject(ixgbevf_netdev)         |                              |
>>>>>>       kobject_add(ixgbevf_dev)                       |                              |
>>>>>>        device_add(ixgbevf_dev)                       |                              |
>>>>>>         kobject_uevent(&ixgbevf_dev->kobj, KOBJ_ADD) |                              |
>>>>>>          netlink_broadcast()                         |                              |
>>>>>>      ...                                             |                              |
>>>>>>      call_netdevice_notifiers(NETDEV_REGISTER)       |                              |
>>>>>>       failover_event(..., NETDEV_REGISTER, ...)      |                              |
>>>>>>        failover_slave_register(ixgbevf_netdev)       |                              |
>>>>>>         net_failover_slave_register(ixgbevf_netdev)  |                              |
>>>>>>          dev_open(ixgbevf_netdev)                    |                              |
>>>>>>                                                      |                              |
>>>>>>                                                      |                              |
>>>>>>                                                      |                              |   received ADD uevent from netlink fd
>>>>>>                                                      |                              |   ...
>>>>>>                                                      |                              |   udev-builtin-net_id.c:dev_pci_slot()
>>>>>>                                                      |                              |   (decided to renamed 'eth0' )
>>>>>>                                                      |                              |     ip link set dev eth0 name ens4
>>>>>> (dev_change_name() returns -EBUSY as              |                              |
>>>>>> ixgbevf_netdev->flags has IFF_UP)                 |                              |
>>>>>>                                                      |                              |
>>>>>>
>>>>> Given renaming slaves does not work anyway:
>>>> I was actually thinking what if we relieve the rename restriction just for
>>>> the failover slave? What the impact would be? I think users don't care about
>>>> slave being renamed when it's in use, especially the initial rename.
>>>> Thoughts?
>>>>
>>>>>     would it work if we just
>>>>> hard-coded slave names instead?
>>>>>
>>>>> E.g.
>>>>> 1. fail slave renames
>>>>> 2. rename of failover to XX automatically renames standby to XXnsby
>>>>>       and primary to XXnpry
>>>> That wouldn't help. The time when the failover master gets renamed, the VF
>>>> may not be present.
>>> In this scheme if VF is not there it will be renamed immediately after registration.
>> Who will be responsible to rename the slave, the kernel?
> That's the idea.
>
>> Note the master's
>> name may or may not come from the userspace. If it comes from the userspace,
>> should the userspace daemon change their expectation not to name/rename
>> _any_ slaves (today there's no distinction)?
> Yes the idea would be to fail renaming slaves.
No I was asking about the userspace expectation: whether it should track 
and detect the lifecycle events of failover slaves and decide what to 
do. How does it get back to the user specified name if VF is not 
enslaved (say someone unloads the virtio-net module)?

As this scheme adds much complexity to the kernel naming convention 
(currently it's just ethX names) that no userspace can understand. Will 
the change break userspace further?

-Siwei

>
>> How do users know which name to
>> trust, depending on which wins the race more often? Say if kernel wants a
>> ens3npry name while userspace wants it named as ens4.
>>
>> -Siwei
> With this approach kernel will deny attempts by userspace to rename
> slaves.  Slaves will always be named XXXnsby and XXnpry. Master renames
> will rename both slaves.
>
> It seems pretty solid to me, the only issue is that in theory userspace
> can use a name like XXXnsby for something else. But this seems unlikely.
>
>
>>>> I don't like the idea to delay exposing failover master
>>>> until VF is hot plugged in (probably subject to various failures) later.
>>>>
>>>> Thanks,
>>>> -Siwei
>>> I agree, this was not what I meant.
>>>
Michael S. Tsirkin Feb. 28, 2019, 12:38 a.m. UTC | #29
On Wed, Feb 27, 2019 at 04:03:42PM -0800, Stephen Hemminger wrote:
> > With this approach kernel will deny attempts by userspace to rename
> > slaves.  Slaves will always be named XXXnsby and XXnpry. Master renames
> > will rename both slaves.
> > 
> > It seems pretty solid to me, the only issue is that in theory userspace
> > can use a name like XXXnsby for something else. But this seems unlikely.
> 
> Similar schemes (with kernel providing naming) were also previously rejected
> upstream.

Links?
I'm inclined to try and see what happens.

> It has been a consistent theme that the kernel should not be in
> the renaming business.

In this case it's not in renaming business per se. The only reason
we even have the original name is due to the ways internal APIs
work. You can look at it as simply having slaves names being
part of master.

> It will certainly break userspace.

That's a strong claim. What is it based on?  It so happens that
userspace renaming slaves is already broken on virtio. So we can fix it
any way we like :)

And yes it won't help netvsc because netvsc wants compatibility with old
scripts but then netvsc uses a 2 device model anyway.
Michael S. Tsirkin Feb. 28, 2019, 12:41 a.m. UTC | #30
On Wed, Feb 27, 2019 at 04:38:00PM -0800, si-wei liu wrote:
> 
> 
> On 2/27/2019 3:50 PM, Michael S. Tsirkin wrote:
> > On Wed, Feb 27, 2019 at 03:34:56PM -0800, si-wei liu wrote:
> > > 
> > > On 2/27/2019 2:38 PM, Michael S. Tsirkin wrote:
> > > > On Tue, Feb 26, 2019 at 04:17:21PM -0800, si-wei liu wrote:
> > > > > On 2/25/2019 6:08 PM, Michael S. Tsirkin wrote:
> > > > > > On Mon, Feb 25, 2019 at 04:58:07PM -0800, si-wei liu wrote:
> > > > > > > On 2/22/2019 7:14 AM, Michael S. Tsirkin wrote:
> > > > > > > > On Thu, Feb 21, 2019 at 11:55:11PM -0800, si-wei liu wrote:
> > > > > > > > > On 2/21/2019 11:00 PM, Samudrala, Sridhar wrote:
> > > > > > > > > > On 2/21/2019 7:33 PM, si-wei liu wrote:
> > > > > > > > > > > On 2/21/2019 5:39 PM, Michael S. Tsirkin wrote:
> > > > > > > > > > > > On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu wrote:
> > > > > > > > > > > > > Sorry for replying to this ancient thread. There was some remaining
> > > > > > > > > > > > > issue that I don't think the initial net_failover patch got addressed
> > > > > > > > > > > > > cleanly, see:
> > > > > > > > > > > > > 
> > > > > > > > > > > > > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1815268
> > > > > > > > > > > > > 
> > > > > > > > > > > > > The renaming of 'eth0' to 'ens4' fails because the udev userspace was
> > > > > > > > > > > > > not specifically writtten for such kernel automatic enslavement.
> > > > > > > > > > > > > Specifically, if it is a bond or team, the slave would typically get
> > > > > > > > > > > > > renamed *before* virtual device gets created, that's what udev can
> > > > > > > > > > > > > control (without getting netdev opened early by the other part of
> > > > > > > > > > > > > kernel) and other userspace components for e.g. initramfs,
> > > > > > > > > > > > > init-scripts can coordinate well in between. The in-kernel
> > > > > > > > > > > > > auto-enslavement of net_failover breaks this userspace convention,
> > > > > > > > > > > > > which don't provides a solution if user care about consistent naming
> > > > > > > > > > > > > on the slave netdevs specifically.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Previously this issue had been specifically called out when IFF_HIDDEN
> > > > > > > > > > > > > and the 1-netdev was proposed, but no one gives out a solution to this
> > > > > > > > > > > > > problem ever since. Please share your mind how to proceed and solve
> > > > > > > > > > > > > this userspace issue if netdev does not welcome a 1-netdev model.
> > > > > > > > > > > > Above says:
> > > > > > > > > > > > 
> > > > > > > > > > > >         there's no motivation in the systemd/udevd community at
> > > > > > > > > > > >         this point to refactor the rename logic and make it work well with
> > > > > > > > > > > >         3-netdev.
> > > > > > > > > > > > 
> > > > > > > > > > > > What would the fix be? Skip slave devices?
> > > > > > > > > > > > 
> > > > > > > > > > > There's nothing user can get if just skipping slave devices - the
> > > > > > > > > > > name is still unchanged and unpredictable e.g. eth0, or eth1 the
> > > > > > > > > > > next reboot, while the rest may conform to the naming scheme (ens3
> > > > > > > > > > > and such). There's no way one can fix this in userspace alone - when
> > > > > > > > > > > the failover is created the enslaved netdev was opened by the kernel
> > > > > > > > > > > earlier than the userspace is made aware of, and there's no
> > > > > > > > > > > negotiation protocol for kernel to know when userspace has done
> > > > > > > > > > > initial renaming of the interface. I would expect netdev list should
> > > > > > > > > > > at least provide the direction in general for how this can be
> > > > > > > > > > > solved...
> > > > > > > > I was just wondering what did you mean when you said
> > > > > > > > "refactor the rename logic and make it work well with 3-netdev" -
> > > > > > > > was there a proposal udev rejected?
> > > > > > > No. I never believed this particular issue can be fixed in userspace alone.
> > > > > > > Previously someone had said it could be, but I never see any work or
> > > > > > > relevant discussion ever happened in various userspace communities (for e.g.
> > > > > > > dracut, initramfs-tools, systemd, udev, and NetworkManager). IMHO the root
> > > > > > > of the issue derives from the kernel, it makes more sense to start from
> > > > > > > netdev, work out and decide on a solution: see what can be done in the
> > > > > > > kernel in order to fix it, then after that engage userspace community for
> > > > > > > the feasibility...
> > > > > > > 
> > > > > > > > Anyway, can we write a time diagram for what happens in which order that
> > > > > > > > leads to failure?  That would help look for triggers that we can tie
> > > > > > > > into, or add new ones.
> > > > > > > > 
> > > > > > > See attached diagram.
> > > > > > > 
> > > > > > > > 
> > > > > > > > > > Is there an issue if slave device names are not predictable? The user/admin scripts are expected
> > > > > > > > > > to only work with the master failover device.
> > > > > > > > > Where does this expectation come from?
> > > > > > > > > 
> > > > > > > > > Admin users may have ethtool or tc configurations that need to deal with
> > > > > > > > > predictable interface name. Third-party app which was built upon specifying
> > > > > > > > > certain interface name can't be modified to chase dynamic names.
> > > > > > > > > 
> > > > > > > > > Specifically, we have pre-canned image that uses ethtool to fine tune VF
> > > > > > > > > offload settings post boot for specific workload. Those images won't work
> > > > > > > > > well if the name is constantly changing just after couple rounds of live
> > > > > > > > > migration.
> > > > > > > > It should be possible to specify the ethtool configuration on the
> > > > > > > > master and have it automatically propagated to the slave.
> > > > > > > > 
> > > > > > > > BTW this is something we should look at IMHO.
> > > > > > > I was elaborating a few examples that the expectation and assumption that
> > > > > > > user/admin scripts only deal with master failover device is incorrect. It
> > > > > > > had never been taken good care of, although I did try to emphasize it from
> > > > > > > the very beginning.
> > > > > > > 
> > > > > > > Basically what you said about propagating the ethtool configuration down to
> > > > > > > the slave is the key pursuance of 1-netdev model. However, what I am seeking
> > > > > > > now is any alternative that can also fix the specific udev rename problem,
> > > > > > > before concluding that 1-netdev is the only solution. Generally a 1-netdev
> > > > > > > scheme would take time to implement, while I'm trying to find a way out to
> > > > > > > fix this particular naming problem under 3-netdev.
> > > > > > > 
> > > > > > > > > > Moreover, you were suggesting hiding the lower slave devices anyway. There was some discussion
> > > > > > > > > > about moving them to a hidden network namespace so that they are not visible from the default namespace.
> > > > > > > > > > I looked into this sometime back, but did not find the right kernel api to create a network namespace within
> > > > > > > > > > kernel. If so, we could use this mechanism to simulate a 1-netdev model.
> > > > > > > > > Yes, that's one possible implementation (IMHO the key is to make 1-netdev
> > > > > > > > > model as much transparent to a real NIC as possible, while a hidden netns is
> > > > > > > > > just the vehicle). However, I recall there was resistance around this
> > > > > > > > > discussion that even the concept of hiding itself is a taboo for Linux
> > > > > > > > > netdev. I would like to summon potential alternatives before concluding
> > > > > > > > > 1-netdev is the only solution too soon.
> > > > > > > > > 
> > > > > > > > > Thanks,
> > > > > > > > > -Siwei
> > > > > > > > Your scripts would not work at all then, right?
> > > > > > > At this point we don't claim images with such usage as SR-IOV live
> > > > > > > migrate-able. We would flag it as live migrate-able until this ethtool
> > > > > > > config issue is fully addressed and a transparent live migration solution
> > > > > > > emerges in upstream eventually.
> > > > > > > 
> > > > > > > 
> > > > > > > Thanks,
> > > > > > > -Siwei
> > > > > > > > > > > -Siwei
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > ---------------------------------------------------------------------
> > > > > > > > To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> > > > > > > > For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
> > > > > > > > 
> > > > > > >      net_failover(kernel)                            |    network.service (user)    |          systemd-udevd (user)
> > > > > > > --------------------------------------------------+------------------------------+--------------------------------------------
> > > > > > > (standby virtio-net and net_failover              |                              |
> > > > > > > devices created and initialized,                  |                              |
> > > > > > > i.e. virtnet_probe()->                            |                              |
> > > > > > >           net_failover_create()                      |                              |
> > > > > > > was done.)                                        |                              |
> > > > > > >                                                      |                              |
> > > > > > >                                                      |  runs `ifup ens3' ->         |
> > > > > > >                                                      |    ip link set dev ens3 up   |
> > > > > > > net_failover_open()                               |                              |
> > > > > > >      dev_open(virtnet_dev)                           |                              |
> > > > > > >        virtnet_open(virtnet_dev)                     |                              |
> > > > > > >      netif_carrier_on(failover_dev)                  |                              |
> > > > > > >      ...                                             |                              |
> > > > > > >                                                      |                              |
> > > > > > > (VF hot plugged in)                               |                              |
> > > > > > > ixgbevf_probe()                                   |                              |
> > > > > > >     register_netdev(ixgbevf_netdev)                  |                              |
> > > > > > >      netdev_register_kobject(ixgbevf_netdev)         |                              |
> > > > > > >       kobject_add(ixgbevf_dev)                       |                              |
> > > > > > >        device_add(ixgbevf_dev)                       |                              |
> > > > > > >         kobject_uevent(&ixgbevf_dev->kobj, KOBJ_ADD) |                              |
> > > > > > >          netlink_broadcast()                         |                              |
> > > > > > >      ...                                             |                              |
> > > > > > >      call_netdevice_notifiers(NETDEV_REGISTER)       |                              |
> > > > > > >       failover_event(..., NETDEV_REGISTER, ...)      |                              |
> > > > > > >        failover_slave_register(ixgbevf_netdev)       |                              |
> > > > > > >         net_failover_slave_register(ixgbevf_netdev)  |                              |
> > > > > > >          dev_open(ixgbevf_netdev)                    |                              |
> > > > > > >                                                      |                              |
> > > > > > >                                                      |                              |
> > > > > > >                                                      |                              |   received ADD uevent from netlink fd
> > > > > > >                                                      |                              |   ...
> > > > > > >                                                      |                              |   udev-builtin-net_id.c:dev_pci_slot()
> > > > > > >                                                      |                              |   (decided to renamed 'eth0' )
> > > > > > >                                                      |                              |     ip link set dev eth0 name ens4
> > > > > > > (dev_change_name() returns -EBUSY as              |                              |
> > > > > > > ixgbevf_netdev->flags has IFF_UP)                 |                              |
> > > > > > >                                                      |                              |
> > > > > > > 
> > > > > > Given renaming slaves does not work anyway:
> > > > > I was actually thinking what if we relieve the rename restriction just for
> > > > > the failover slave? What the impact would be? I think users don't care about
> > > > > slave being renamed when it's in use, especially the initial rename.
> > > > > Thoughts?
> > > > > 
> > > > > >     would it work if we just
> > > > > > hard-coded slave names instead?
> > > > > > 
> > > > > > E.g.
> > > > > > 1. fail slave renames
> > > > > > 2. rename of failover to XX automatically renames standby to XXnsby
> > > > > >       and primary to XXnpry
> > > > > That wouldn't help. The time when the failover master gets renamed, the VF
> > > > > may not be present.
> > > > In this scheme if VF is not there it will be renamed immediately after registration.
> > > Who will be responsible to rename the slave, the kernel?
> > That's the idea.
> > 
> > > Note the master's
> > > name may or may not come from the userspace. If it comes from the userspace,
> > > should the userspace daemon change their expectation not to name/rename
> > > _any_ slaves (today there's no distinction)?
> > Yes the idea would be to fail renaming slaves.
> No I was asking about the userspace expectation: whether it should track and
> detect the lifecycle events of failover slaves and decide what to do. How
> does it get back to the user specified name if VF is not enslaved (say
> someone unloads the virtio-net module)?

When virtio net is removed VF will shortly be removed too.

> As this scheme adds much complexity to the kernel naming convention
> (currently it's just ethX names) that no userspace can understand.

Anything that pokes at slaves needs to be specially designed anyway.
Naming seems like a minor issue.

> Will the
> change break userspace further?
> 
> -Siwei

Didn't you show userspace is already broken. You can't "further
break it", rename already fails.

> > 
> > > How do users know which name to
> > > trust, depending on which wins the race more often? Say if kernel wants a
> > > ens3npry name while userspace wants it named as ens4.
> > > 
> > > -Siwei
> > With this approach kernel will deny attempts by userspace to rename
> > slaves.  Slaves will always be named XXXnsby and XXnpry. Master renames
> > will rename both slaves.
> > 
> > It seems pretty solid to me, the only issue is that in theory userspace
> > can use a name like XXXnsby for something else. But this seems unlikely.
> > 
> > 
> > > > > I don't like the idea to delay exposing failover master
> > > > > until VF is hot plugged in (probably subject to various failures) later.
> > > > > 
> > > > > Thanks,
> > > > > -Siwei
> > > > I agree, this was not what I meant.
> > > >
Jakub Kicinski Feb. 28, 2019, 12:52 a.m. UTC | #31
On Wed, 27 Feb 2019 19:41:32 -0500, Michael S. Tsirkin wrote:
> > As this scheme adds much complexity to the kernel naming convention
> > (currently it's just ethX names) that no userspace can understand.  
> 
> Anything that pokes at slaves needs to be specially designed anyway.
> Naming seems like a minor issue.

Can the users who care about the naming put net_failover into
"user space will do the bond enslavement" mode, and do the bond
creation/management themselves from user space (in systemd/ 
Network Manager) based on the failover flag?
Michael S. Tsirkin Feb. 28, 2019, 1:26 a.m. UTC | #32
On Wed, Feb 27, 2019 at 04:52:05PM -0800, Jakub Kicinski wrote:
> On Wed, 27 Feb 2019 19:41:32 -0500, Michael S. Tsirkin wrote:
> > > As this scheme adds much complexity to the kernel naming convention
> > > (currently it's just ethX names) that no userspace can understand.  
> > 
> > Anything that pokes at slaves needs to be specially designed anyway.
> > Naming seems like a minor issue.
> 
> Can the users who care about the naming put net_failover into
> "user space will do the bond enslavement" mode, and do the bond
> creation/management themselves from user space (in systemd/ 
> Network Manager) based on the failover flag?

Putting issues of compatibility aside (userspace tends to be confused if
you give it two devices with same MAC), how would you have it work in
practice? Timer based hacks like netvsc where if userspace didn't
respond within X seconds we assume it won't and do everything ourselves?
Jakub Kicinski Feb. 28, 2019, 1:52 a.m. UTC | #33
On Wed, 27 Feb 2019 20:26:02 -0500, Michael S. Tsirkin wrote:
> On Wed, Feb 27, 2019 at 04:52:05PM -0800, Jakub Kicinski wrote:
> > On Wed, 27 Feb 2019 19:41:32 -0500, Michael S. Tsirkin wrote:  
> > > > As this scheme adds much complexity to the kernel naming convention
> > > > (currently it's just ethX names) that no userspace can understand.    
> > > 
> > > Anything that pokes at slaves needs to be specially designed anyway.
> > > Naming seems like a minor issue.  
> > 
> > Can the users who care about the naming put net_failover into
> > "user space will do the bond enslavement" mode, and do the bond
> > creation/management themselves from user space (in systemd/ 
> > Network Manager) based on the failover flag?  
> 
> Putting issues of compatibility aside (userspace tends to be confused if
> you give it two devices with same MAC), how would you have it work in
> practice? Timer based hacks like netvsc where if userspace didn't
> respond within X seconds we assume it won't and do everything ourselves?

Well, what I'm saying is basically if user space knows how to deal with
the auto-bonding, we can put aside net_failover for the most part.  It
can either be blacklisted or it can have some knob which will
effectively disable the auto-enslavement.

Auto-bonding capable user space can do the renames, spawn the bond,
etc. all by itself.  I'm basically going back to my initial proposal
here :)  There is a RedHat bugzilla for the NetworkManager team to do
this, but we merged net_failover before those folks got around to
implementing it.

IOW if NM/systemd is capable of doing the auto-bonding itself it can
disable the kernel mechanism and take care of it all.  If kernel is
booted with an old user space which doesn't have capable NM/systemd -
net_failover will kick in and do its best.
Michael S. Tsirkin Feb. 28, 2019, 4:47 a.m. UTC | #34
On Wed, Feb 27, 2019 at 05:52:18PM -0800, Jakub Kicinski wrote:
> On Wed, 27 Feb 2019 20:26:02 -0500, Michael S. Tsirkin wrote:
> > On Wed, Feb 27, 2019 at 04:52:05PM -0800, Jakub Kicinski wrote:
> > > On Wed, 27 Feb 2019 19:41:32 -0500, Michael S. Tsirkin wrote:  
> > > > > As this scheme adds much complexity to the kernel naming convention
> > > > > (currently it's just ethX names) that no userspace can understand.    
> > > > 
> > > > Anything that pokes at slaves needs to be specially designed anyway.
> > > > Naming seems like a minor issue.  
> > > 
> > > Can the users who care about the naming put net_failover into
> > > "user space will do the bond enslavement" mode, and do the bond
> > > creation/management themselves from user space (in systemd/ 
> > > Network Manager) based on the failover flag?  
> > 
> > Putting issues of compatibility aside (userspace tends to be confused if
> > you give it two devices with same MAC), how would you have it work in
> > practice? Timer based hacks like netvsc where if userspace didn't
> > respond within X seconds we assume it won't and do everything ourselves?
> 
> Well, what I'm saying is basically if user space knows how to deal with
> the auto-bonding, we can put aside net_failover for the most part.  It
> can either be blacklisted or it can have some knob which will
> effectively disable the auto-enslavement.

OK I guess we could add a module parameter to skip this.
Is this what you mean?

> Auto-bonding capable user space can do the renames, spawn the bond,
> etc. all by itself.  I'm basically going back to my initial proposal
> here :)  There is a RedHat bugzilla for the NetworkManager team to do
> this, but we merged net_failover before those folks got around to
> implementing it.

In particular because there's no policy involved whatsoever
here so it's just mechanism being pushed up to userspace.

> IOW if NM/systemd is capable of doing the auto-bonding itself it can
> disable the kernel mechanism and take care of it all.  If kernel is
> booted with an old user space which doesn't have capable NM/systemd -
> net_failover will kick in and do its best.

Sure - it's just 2 lines of code, see below.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

But I don't intend to bother until there's actual interest from
userspace developers to bother. In particular it is not just NM/systemd
even on Fedora - e.g. you will need to teach dracut to somehow detect
and handle this - right now it gets confused if there are two devices
with same MAC addresses.

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 955b3e76eb8d..dd2b2c370003 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -43,6 +43,7 @@ static bool csum = true, gso = true, napi_tx;
 module_param(csum, bool, 0444);
 module_param(gso, bool, 0444);
 module_param(napi_tx, bool, 0644);
+module_param(disable_failover, bool, 0644);
 
 /* FIXME: MTU in config. */
 #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
@@ -3163,6 +3164,7 @@ static int virtnet_probe(struct virtio_device *vdev)
 	virtnet_init_settings(dev);
 
-	if (virtio_has_feature(vdev, VIRTIO_NET_F_STANDBY)) {
+	if (virtio_has_feature(vdev, VIRTIO_NET_F_STANDBY) &&
+		!disable_failover) {
 		vi->failover = net_failover_create(vi->dev);
 		if (IS_ERR(vi->failover)) {
 			err = PTR_ERR(vi->failover);
Si-Wei Liu Feb. 28, 2019, 9:32 a.m. UTC | #35
On 2/27/2019 4:41 PM, Michael S. Tsirkin wrote:
> On Wed, Feb 27, 2019 at 04:38:00PM -0800, si-wei liu wrote:
>>
>> On 2/27/2019 3:50 PM, Michael S. Tsirkin wrote:
>>> On Wed, Feb 27, 2019 at 03:34:56PM -0800, si-wei liu wrote:
>>>> On 2/27/2019 2:38 PM, Michael S. Tsirkin wrote:
>>>>> On Tue, Feb 26, 2019 at 04:17:21PM -0800, si-wei liu wrote:
>>>>>> On 2/25/2019 6:08 PM, Michael S. Tsirkin wrote:
>>>>>>> On Mon, Feb 25, 2019 at 04:58:07PM -0800, si-wei liu wrote:
>>>>>>>> On 2/22/2019 7:14 AM, Michael S. Tsirkin wrote:
>>>>>>>>> On Thu, Feb 21, 2019 at 11:55:11PM -0800, si-wei liu wrote:
>>>>>>>>>> On 2/21/2019 11:00 PM, Samudrala, Sridhar wrote:
>>>>>>>>>>> On 2/21/2019 7:33 PM, si-wei liu wrote:
>>>>>>>>>>>> On 2/21/2019 5:39 PM, Michael S. Tsirkin wrote:
>>>>>>>>>>>>> On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu wrote:
>>>>>>>>>>>>>> Sorry for replying to this ancient thread. There was some remaining
>>>>>>>>>>>>>> issue that I don't think the initial net_failover patch got addressed
>>>>>>>>>>>>>> cleanly, see:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1815268
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The renaming of 'eth0' to 'ens4' fails because the udev userspace was
>>>>>>>>>>>>>> not specifically writtten for such kernel automatic enslavement.
>>>>>>>>>>>>>> Specifically, if it is a bond or team, the slave would typically get
>>>>>>>>>>>>>> renamed *before* virtual device gets created, that's what udev can
>>>>>>>>>>>>>> control (without getting netdev opened early by the other part of
>>>>>>>>>>>>>> kernel) and other userspace components for e.g. initramfs,
>>>>>>>>>>>>>> init-scripts can coordinate well in between. The in-kernel
>>>>>>>>>>>>>> auto-enslavement of net_failover breaks this userspace convention,
>>>>>>>>>>>>>> which don't provides a solution if user care about consistent naming
>>>>>>>>>>>>>> on the slave netdevs specifically.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Previously this issue had been specifically called out when IFF_HIDDEN
>>>>>>>>>>>>>> and the 1-netdev was proposed, but no one gives out a solution to this
>>>>>>>>>>>>>> problem ever since. Please share your mind how to proceed and solve
>>>>>>>>>>>>>> this userspace issue if netdev does not welcome a 1-netdev model.
>>>>>>>>>>>>> Above says:
>>>>>>>>>>>>>
>>>>>>>>>>>>>          there's no motivation in the systemd/udevd community at
>>>>>>>>>>>>>          this point to refactor the rename logic and make it work well with
>>>>>>>>>>>>>          3-netdev.
>>>>>>>>>>>>>
>>>>>>>>>>>>> What would the fix be? Skip slave devices?
>>>>>>>>>>>>>
>>>>>>>>>>>> There's nothing user can get if just skipping slave devices - the
>>>>>>>>>>>> name is still unchanged and unpredictable e.g. eth0, or eth1 the
>>>>>>>>>>>> next reboot, while the rest may conform to the naming scheme (ens3
>>>>>>>>>>>> and such). There's no way one can fix this in userspace alone - when
>>>>>>>>>>>> the failover is created the enslaved netdev was opened by the kernel
>>>>>>>>>>>> earlier than the userspace is made aware of, and there's no
>>>>>>>>>>>> negotiation protocol for kernel to know when userspace has done
>>>>>>>>>>>> initial renaming of the interface. I would expect netdev list should
>>>>>>>>>>>> at least provide the direction in general for how this can be
>>>>>>>>>>>> solved...
>>>>>>>>> I was just wondering what did you mean when you said
>>>>>>>>> "refactor the rename logic and make it work well with 3-netdev" -
>>>>>>>>> was there a proposal udev rejected?
>>>>>>>> No. I never believed this particular issue can be fixed in userspace alone.
>>>>>>>> Previously someone had said it could be, but I never see any work or
>>>>>>>> relevant discussion ever happened in various userspace communities (for e.g.
>>>>>>>> dracut, initramfs-tools, systemd, udev, and NetworkManager). IMHO the root
>>>>>>>> of the issue derives from the kernel, it makes more sense to start from
>>>>>>>> netdev, work out and decide on a solution: see what can be done in the
>>>>>>>> kernel in order to fix it, then after that engage userspace community for
>>>>>>>> the feasibility...
>>>>>>>>
>>>>>>>>> Anyway, can we write a time diagram for what happens in which order that
>>>>>>>>> leads to failure?  That would help look for triggers that we can tie
>>>>>>>>> into, or add new ones.
>>>>>>>>>
>>>>>>>> See attached diagram.
>>>>>>>>
>>>>>>>>>>> Is there an issue if slave device names are not predictable? The user/admin scripts are expected
>>>>>>>>>>> to only work with the master failover device.
>>>>>>>>>> Where does this expectation come from?
>>>>>>>>>>
>>>>>>>>>> Admin users may have ethtool or tc configurations that need to deal with
>>>>>>>>>> predictable interface name. Third-party app which was built upon specifying
>>>>>>>>>> certain interface name can't be modified to chase dynamic names.
>>>>>>>>>>
>>>>>>>>>> Specifically, we have pre-canned image that uses ethtool to fine tune VF
>>>>>>>>>> offload settings post boot for specific workload. Those images won't work
>>>>>>>>>> well if the name is constantly changing just after couple rounds of live
>>>>>>>>>> migration.
>>>>>>>>> It should be possible to specify the ethtool configuration on the
>>>>>>>>> master and have it automatically propagated to the slave.
>>>>>>>>>
>>>>>>>>> BTW this is something we should look at IMHO.
>>>>>>>> I was elaborating a few examples that the expectation and assumption that
>>>>>>>> user/admin scripts only deal with master failover device is incorrect. It
>>>>>>>> had never been taken good care of, although I did try to emphasize it from
>>>>>>>> the very beginning.
>>>>>>>>
>>>>>>>> Basically what you said about propagating the ethtool configuration down to
>>>>>>>> the slave is the key pursuance of 1-netdev model. However, what I am seeking
>>>>>>>> now is any alternative that can also fix the specific udev rename problem,
>>>>>>>> before concluding that 1-netdev is the only solution. Generally a 1-netdev
>>>>>>>> scheme would take time to implement, while I'm trying to find a way out to
>>>>>>>> fix this particular naming problem under 3-netdev.
>>>>>>>>
>>>>>>>>>>> Moreover, you were suggesting hiding the lower slave devices anyway. There was some discussion
>>>>>>>>>>> about moving them to a hidden network namespace so that they are not visible from the default namespace.
>>>>>>>>>>> I looked into this sometime back, but did not find the right kernel api to create a network namespace within
>>>>>>>>>>> kernel. If so, we could use this mechanism to simulate a 1-netdev model.
>>>>>>>>>> Yes, that's one possible implementation (IMHO the key is to make 1-netdev
>>>>>>>>>> model as much transparent to a real NIC as possible, while a hidden netns is
>>>>>>>>>> just the vehicle). However, I recall there was resistance around this
>>>>>>>>>> discussion that even the concept of hiding itself is a taboo for Linux
>>>>>>>>>> netdev. I would like to summon potential alternatives before concluding
>>>>>>>>>> 1-netdev is the only solution too soon.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> -Siwei
>>>>>>>>> Your scripts would not work at all then, right?
>>>>>>>> At this point we don't claim images with such usage as SR-IOV live
>>>>>>>> migrate-able. We would flag it as live migrate-able until this ethtool
>>>>>>>> config issue is fully addressed and a transparent live migration solution
>>>>>>>> emerges in upstream eventually.
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> -Siwei
>>>>>>>>>>>> -Siwei
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
>>>>>>>>> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
>>>>>>>>>
>>>>>>>>       net_failover(kernel)                            |    network.service (user)    |          systemd-udevd (user)
>>>>>>>> --------------------------------------------------+------------------------------+--------------------------------------------
>>>>>>>> (standby virtio-net and net_failover              |                              |
>>>>>>>> devices created and initialized,                  |                              |
>>>>>>>> i.e. virtnet_probe()->                            |                              |
>>>>>>>>            net_failover_create()                      |                              |
>>>>>>>> was done.)                                        |                              |
>>>>>>>>                                                       |                              |
>>>>>>>>                                                       |  runs `ifup ens3' ->         |
>>>>>>>>                                                       |    ip link set dev ens3 up   |
>>>>>>>> net_failover_open()                               |                              |
>>>>>>>>       dev_open(virtnet_dev)                           |                              |
>>>>>>>>         virtnet_open(virtnet_dev)                     |                              |
>>>>>>>>       netif_carrier_on(failover_dev)                  |                              |
>>>>>>>>       ...                                             |                              |
>>>>>>>>                                                       |                              |
>>>>>>>> (VF hot plugged in)                               |                              |
>>>>>>>> ixgbevf_probe()                                   |                              |
>>>>>>>>      register_netdev(ixgbevf_netdev)                  |                              |
>>>>>>>>       netdev_register_kobject(ixgbevf_netdev)         |                              |
>>>>>>>>        kobject_add(ixgbevf_dev)                       |                              |
>>>>>>>>         device_add(ixgbevf_dev)                       |                              |
>>>>>>>>          kobject_uevent(&ixgbevf_dev->kobj, KOBJ_ADD) |                              |
>>>>>>>>           netlink_broadcast()                         |                              |
>>>>>>>>       ...                                             |                              |
>>>>>>>>       call_netdevice_notifiers(NETDEV_REGISTER)       |                              |
>>>>>>>>        failover_event(..., NETDEV_REGISTER, ...)      |                              |
>>>>>>>>         failover_slave_register(ixgbevf_netdev)       |                              |
>>>>>>>>          net_failover_slave_register(ixgbevf_netdev)  |                              |
>>>>>>>>           dev_open(ixgbevf_netdev)                    |                              |
>>>>>>>>                                                       |                              |
>>>>>>>>                                                       |                              |
>>>>>>>>                                                       |                              |   received ADD uevent from netlink fd
>>>>>>>>                                                       |                              |   ...
>>>>>>>>                                                       |                              |   udev-builtin-net_id.c:dev_pci_slot()
>>>>>>>>                                                       |                              |   (decided to renamed 'eth0' )
>>>>>>>>                                                       |                              |     ip link set dev eth0 name ens4
>>>>>>>> (dev_change_name() returns -EBUSY as              |                              |
>>>>>>>> ixgbevf_netdev->flags has IFF_UP)                 |                              |
>>>>>>>>                                                       |                              |
>>>>>>>>
>>>>>>> Given renaming slaves does not work anyway:
>>>>>> I was actually thinking what if we relieve the rename restriction just for
>>>>>> the failover slave? What the impact would be? I think users don't care about
>>>>>> slave being renamed when it's in use, especially the initial rename.
>>>>>> Thoughts?
>>>>>>
>>>>>>>      would it work if we just
>>>>>>> hard-coded slave names instead?
>>>>>>>
>>>>>>> E.g.
>>>>>>> 1. fail slave renames
>>>>>>> 2. rename of failover to XX automatically renames standby to XXnsby
>>>>>>>        and primary to XXnpry
>>>>>> That wouldn't help. The time when the failover master gets renamed, the VF
>>>>>> may not be present.
>>>>> In this scheme if VF is not there it will be renamed immediately after registration.
>>>> Who will be responsible to rename the slave, the kernel?
>>> That's the idea.
>>>
>>>> Note the master's
>>>> name may or may not come from the userspace. If it comes from the userspace,
>>>> should the userspace daemon change their expectation not to name/rename
>>>> _any_ slaves (today there's no distinction)?
>>> Yes the idea would be to fail renaming slaves.
>> No I was asking about the userspace expectation: whether it should track and
>> detect the lifecycle events of failover slaves and decide what to do. How
>> does it get back to the user specified name if VF is not enslaved (say
>> someone unloads the virtio-net module)?
> When virtio net is removed VF will shortly be removed too.
>
>> As this scheme adds much complexity to the kernel naming convention
>> (currently it's just ethX names) that no userspace can understand.
> Anything that pokes at slaves needs to be specially designed anyway.
> Naming seems like a minor issue.
>
>> Will the
>> change break userspace further?
>>
>> -Siwei
> Didn't you show userspace is already broken. You can't "further
> break it", rename already fails.
It's a race, userspace tends to give slave a user(space) desired name 
but sometimes may fail due to this race. Today if failover master is not 
up, rename would succeed anyway. While what you proposed prohibits user 
from providing a name in all circumstances if I understand you 
correctly. That's what I meant of breaking userspace further. On the 
other hand, you seem to tighten the kernel default naming to udev 
predictable names, which is derived from only recent systemd-udevd, 
while there exists many possible userspace naming schemes out of that. 
Users today who deliberately chooses to disable predictable naming 
(net.ifnames=0 biosdevname=0) and fall back to kernel provided names 
would expect the ethX pattern, with this change admin/user scripts which 
matches the ethX pattern could potentially break.

IMHO that change is more risky than allow userspace to change the name 
for failover slave in any case. I would refresh everyone's mind that the 
target users of net_failover is very specific to the live migration 
scenario, who typically don't have profound knowledge to fiddle with the 
low level plumbing but just expect to operate on master device directly. 
I don't have much concern over the slave netfilter rule brokenness or 
whatsoever if just lifting up the rename restriction: the failover slave 
naming itself is already unreliable, how can we break those apps relying 
on consistent naming further without fixing it in the first place? It 
could be just simply two lines of code change, if any net_failover user, 
who may break due to this change, would have come here and complained 
about the naming issue earlier. IOW at the very least, the change below 
shouldn't make the current situation any worse.

Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com>

--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1127,7 +1127,8 @@ int dev_change_name(struct net_device *dev, const 
char *newname)
         BUG_ON(!dev_net(dev));

         net = dev_net(dev);
-       if (dev->flags & IFF_UP)
+       if (dev->flags & IFF_UP &&
+           !(dev->priv_flags & IFF_FAILOVER_SLAVE))
                 return -EBUSY;

         write_seqcount_begin(&devnet_rename_seq);

Thanks,
-Siwei


>
>>>> How do users know which name to
>>>> trust, depending on which wins the race more often? Say if kernel wants a
>>>> ens3npry name while userspace wants it named as ens4.
>>>>
>>>> -Siwei
>>> With this approach kernel will deny attempts by userspace to rename
>>> slaves.  Slaves will always be named XXXnsby and XXnpry. Master renames
>>> will rename both slaves.
>>>
>>> It seems pretty solid to me, the only issue is that in theory userspace
>>> can use a name like XXXnsby for something else. But this seems unlikely.
>>>
>>>
>>>>>> I don't like the idea to delay exposing failover master
>>>>>> until VF is hot plugged in (probably subject to various failures) later.
>>>>>>
>>>>>> Thanks,
>>>>>> -Siwei
>>>>> I agree, this was not what I meant.
>>>>>
Michael S. Tsirkin Feb. 28, 2019, 2:26 p.m. UTC | #36
On Thu, Feb 28, 2019 at 01:32:12AM -0800, si-wei liu wrote:
> > > Will the
> > > change break userspace further?
> > > 
> > > -Siwei
> > Didn't you show userspace is already broken. You can't "further
> > break it", rename already fails.
> It's a race, userspace tends to give slave a user(space) desired name but
> sometimes may fail due to this race. Today if failover master is not up,
> rename would succeed anyway. While what you proposed prohibits user from
> providing a name in all circumstances if I understand you correctly. That's
> what I meant of breaking userspace further. On the other hand, you seem to
> tighten the kernel default naming to udev predictable names, which is
> derived from only recent systemd-udevd, while there exists many possible
> userspace naming schemes out of that. Users today who deliberately chooses
> to disable predictable naming (net.ifnames=0 biosdevname=0) and fall back to
> kernel provided names would expect the ethX pattern, with this change
> admin/user scripts which matches the ethX pattern could potentially break.

Whatever crashes with a name not matching ethX will crash on the
standby interface *anyway*.

So I think what you are saying is that someone might have already
written scripts and gotten them to work on v4.17 when STANDBY was
included and these scripts rely on ethX. Now these scripts
will break.

Maybe it is still early enough (just half a year passed) that the
number of these users would be small.  So how about a kernel config
option and maybe a module parameter to rename the primary?  People can
then opt in to the old broken behaviour.
Jakub Kicinski Feb. 28, 2019, 6:13 p.m. UTC | #37
On Wed, 27 Feb 2019 23:47:33 -0500, Michael S. Tsirkin wrote:
> On Wed, Feb 27, 2019 at 05:52:18PM -0800, Jakub Kicinski wrote:
> > > > Can the users who care about the naming put net_failover into
> > > > "user space will do the bond enslavement" mode, and do the bond
> > > > creation/management themselves from user space (in systemd/ 
> > > > Network Manager) based on the failover flag?    
> > > 
> > > Putting issues of compatibility aside (userspace tends to be confused if
> > > you give it two devices with same MAC), how would you have it work in
> > > practice? Timer based hacks like netvsc where if userspace didn't
> > > respond within X seconds we assume it won't and do everything ourselves?  
> > 
> > Well, what I'm saying is basically if user space knows how to deal with
> > the auto-bonding, we can put aside net_failover for the most part.  It
> > can either be blacklisted or it can have some knob which will
> > effectively disable the auto-enslavement.  
> 
> OK I guess we could add a module parameter to skip this.
> Is this what you mean?

Yup.

> > Auto-bonding capable user space can do the renames, spawn the bond,
> > etc. all by itself.  I'm basically going back to my initial proposal
> > here :)  There is a RedHat bugzilla for the NetworkManager team to do
> > this, but we merged net_failover before those folks got around to
> > implementing it.  
> 
> In particular because there's no policy involved whatsoever
> here so it's just mechanism being pushed up to userspace.
> 
> > IOW if NM/systemd is capable of doing the auto-bonding itself it can
> > disable the kernel mechanism and take care of it all.  If kernel is
> > booted with an old user space which doesn't have capable NM/systemd -
> > net_failover will kick in and do its best.  
> 
> Sure - it's just 2 lines of code, see below.
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> 
> But I don't intend to bother until there's actual interest from
> userspace developers to bother. In particular it is not just NM/systemd
> even on Fedora - e.g. you will need to teach dracut to somehow detect
> and handle this - right now it gets confused if there are two devices
> with same MAC addresses.

It is a bit of a the chicken or the egg situation ;)  But users can
just blacklist, too.  Anyway, I think this is far better than module
parameters for twiddling kernel-based interface naming policy.. :S

> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 955b3e76eb8d..dd2b2c370003 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -43,6 +43,7 @@ static bool csum = true, gso = true, napi_tx;
>  module_param(csum, bool, 0444);
>  module_param(gso, bool, 0444);
>  module_param(napi_tx, bool, 0644);
> +module_param(disable_failover, bool, 0644);
>  
>  /* FIXME: MTU in config. */
>  #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
> @@ -3163,6 +3164,7 @@ static int virtnet_probe(struct virtio_device *vdev)
>  	virtnet_init_settings(dev);
>  
> -	if (virtio_has_feature(vdev, VIRTIO_NET_F_STANDBY)) {
> +	if (virtio_has_feature(vdev, VIRTIO_NET_F_STANDBY) &&
> +		!disable_failover) {
>  		vi->failover = net_failover_create(vi->dev);
>  		if (IS_ERR(vi->failover)) {
>  			err = PTR_ERR(vi->failover);
>
Michael S. Tsirkin Feb. 28, 2019, 7:36 p.m. UTC | #38
On Thu, Feb 28, 2019 at 10:13:56AM -0800, Jakub Kicinski wrote:
> On Wed, 27 Feb 2019 23:47:33 -0500, Michael S. Tsirkin wrote:
> > On Wed, Feb 27, 2019 at 05:52:18PM -0800, Jakub Kicinski wrote:
> > > > > Can the users who care about the naming put net_failover into
> > > > > "user space will do the bond enslavement" mode, and do the bond
> > > > > creation/management themselves from user space (in systemd/ 
> > > > > Network Manager) based on the failover flag?    
> > > > 
> > > > Putting issues of compatibility aside (userspace tends to be confused if
> > > > you give it two devices with same MAC), how would you have it work in
> > > > practice? Timer based hacks like netvsc where if userspace didn't
> > > > respond within X seconds we assume it won't and do everything ourselves?  
> > > 
> > > Well, what I'm saying is basically if user space knows how to deal with
> > > the auto-bonding, we can put aside net_failover for the most part.  It
> > > can either be blacklisted or it can have some knob which will
> > > effectively disable the auto-enslavement.  
> > 
> > OK I guess we could add a module parameter to skip this.
> > Is this what you mean?
> 
> Yup.
> 
> > > Auto-bonding capable user space can do the renames, spawn the bond,
> > > etc. all by itself.  I'm basically going back to my initial proposal
> > > here :)  There is a RedHat bugzilla for the NetworkManager team to do
> > > this, but we merged net_failover before those folks got around to
> > > implementing it.  
> > 
> > In particular because there's no policy involved whatsoever
> > here so it's just mechanism being pushed up to userspace.
> > 
> > > IOW if NM/systemd is capable of doing the auto-bonding itself it can
> > > disable the kernel mechanism and take care of it all.  If kernel is
> > > booted with an old user space which doesn't have capable NM/systemd -
> > > net_failover will kick in and do its best.  
> > 
> > Sure - it's just 2 lines of code, see below.
> > 
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > 
> > But I don't intend to bother until there's actual interest from
> > userspace developers to bother. In particular it is not just NM/systemd
> > even on Fedora - e.g. you will need to teach dracut to somehow detect
> > and handle this - right now it gets confused if there are two devices
> > with same MAC addresses.
> 
> It is a bit of a the chicken or the egg situation ;)  But users can
> just blacklist, too.  Anyway, I think this is far better than module
> parameters

Sorry I'm a bit confused. What is better than what?

> for twiddling kernel-based interface naming policy.. :S

I see your point. But my point is slave names don't really matter, only
master name matters.  So I am not sure there's any policy worth talking
about here.

> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index 955b3e76eb8d..dd2b2c370003 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -43,6 +43,7 @@ static bool csum = true, gso = true, napi_tx;
> >  module_param(csum, bool, 0444);
> >  module_param(gso, bool, 0444);
> >  module_param(napi_tx, bool, 0644);
> > +module_param(disable_failover, bool, 0644);
> >  
> >  /* FIXME: MTU in config. */
> >  #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
> > @@ -3163,6 +3164,7 @@ static int virtnet_probe(struct virtio_device *vdev)
> >  	virtnet_init_settings(dev);
> >  
> > -	if (virtio_has_feature(vdev, VIRTIO_NET_F_STANDBY)) {
> > +	if (virtio_has_feature(vdev, VIRTIO_NET_F_STANDBY) &&
> > +		!disable_failover) {
> >  		vi->failover = net_failover_create(vi->dev);
> >  		if (IS_ERR(vi->failover)) {
> >  			err = PTR_ERR(vi->failover);
> >
Jakub Kicinski Feb. 28, 2019, 7:56 p.m. UTC | #39
On Thu, 28 Feb 2019 14:36:56 -0500, Michael S. Tsirkin wrote:
> > It is a bit of a the chicken or the egg situation ;)  But users can
> > just blacklist, too.  Anyway, I think this is far better than module
> > parameters  
> 
> Sorry I'm a bit confused. What is better than what?

I mean that blacklist net_failover or module param to disable
net_failover and handle in user space are better than trying to solve
the renaming at kernel level (either by adding module params that make
the kernel rename devices or letting user space change names of running
devices if they are slaves).

> > for twiddling kernel-based interface naming policy.. :S  
> 
> I see your point. But my point is slave names don't really matter, only
> master name matters.  So I am not sure there's any policy worth talking
> about here.

Oh yes, I don't disagree with you, but others seems to want to rename
the auto-bonded lower devices.  Which can be done trivially if it was 
a daemon in user space instantiating the auto-bond.  We are just
providing a basic version of auto-bonding in the kernel.  If there are
extra requirements on policy, or naming - the whole thing is better
solved in user space.
Michael S. Tsirkin Feb. 28, 2019, 8:14 p.m. UTC | #40
On Thu, Feb 28, 2019 at 11:56:41AM -0800, Jakub Kicinski wrote:
> On Thu, 28 Feb 2019 14:36:56 -0500, Michael S. Tsirkin wrote:
> > > It is a bit of a the chicken or the egg situation ;)  But users can
> > > just blacklist, too.  Anyway, I think this is far better than module
> > > parameters  
> > 
> > Sorry I'm a bit confused. What is better than what?
> 
> I mean that blacklist net_failover or module param to disable
> net_failover and handle in user space are better than trying to solve
> the renaming at kernel level (either by adding module params that make
> the kernel rename devices or letting user space change names of running
> devices if they are slaves).
> 
> > > for twiddling kernel-based interface naming policy.. :S  
> > 
> > I see your point. But my point is slave names don't really matter, only
> > master name matters.  So I am not sure there's any policy worth talking
> > about here.
> 
> Oh yes, I don't disagree with you, but others seems to want to rename
> the auto-bonded lower devices.  Which can be done trivially if it was 
> a daemon in user space instantiating the auto-bond.  We are just
> providing a basic version of auto-bonding in the kernel.  If there are
> extra requirements on policy, or naming - the whole thing is better
> solved in user space.

OK so it seems that you would be happy with a combination of the module
parameter disabling failover completely and renaming primary in kernel?
Did I get it right?
Jakub Kicinski Feb. 28, 2019, 11:31 p.m. UTC | #41
On Thu, 28 Feb 2019 15:14:55 -0500, Michael S. Tsirkin wrote:
> On Thu, Feb 28, 2019 at 11:56:41AM -0800, Jakub Kicinski wrote:
> > On Thu, 28 Feb 2019 14:36:56 -0500, Michael S. Tsirkin wrote:  
> > > > It is a bit of a the chicken or the egg situation ;)  But users can
> > > > just blacklist, too.  Anyway, I think this is far better than module
> > > > parameters    
> > > 
> > > Sorry I'm a bit confused. What is better than what?  
> > 
> > I mean that blacklist net_failover or module param to disable
> > net_failover and handle in user space are better than trying to solve
> > the renaming at kernel level (either by adding module params that make
> > the kernel rename devices or letting user space change names of running
> > devices if they are slaves).
> >   
> > > > for twiddling kernel-based interface naming policy.. :S    
> > > 
> > > I see your point. But my point is slave names don't really matter, only
> > > master name matters.  So I am not sure there's any policy worth talking
> > > about here.  
> > 
> > Oh yes, I don't disagree with you, but others seems to want to rename
> > the auto-bonded lower devices.  Which can be done trivially if it was 
> > a daemon in user space instantiating the auto-bond.  We are just
> > providing a basic version of auto-bonding in the kernel.  If there are
> > extra requirements on policy, or naming - the whole thing is better
> > solved in user space.  
> 
> OK so it seems that you would be happy with a combination of the module
> parameter disabling failover completely and renaming primary in kernel?
> Did I get it right?

Not 100%, I'm personally not convinced that renaming primary in the
kernel is okay.
Siwei Liu March 1, 2019, 12:20 a.m. UTC | #42
On Thu, Feb 28, 2019 at 11:56 AM Jakub Kicinski <kubakici@wp.pl> wrote:
>
> On Thu, 28 Feb 2019 14:36:56 -0500, Michael S. Tsirkin wrote:
> > > It is a bit of a the chicken or the egg situation ;)  But users can
> > > just blacklist, too.  Anyway, I think this is far better than module
> > > parameters
> >
> > Sorry I'm a bit confused. What is better than what?
>
> I mean that blacklist net_failover or module param to disable
> net_failover and handle in user space are better than trying to solve
> the renaming at kernel level (either by adding module params that make
> the kernel rename devices or letting user space change names of running
> devices if they are slaves).

Before I was aksed to revive this old mail thread, I knew the
discussion could end up with something like this. Yes, theoretically
there's a point - basically you don't believe kernel should take risk
in fixing the issue, so you push back the hope to something in
hypothesis that actually wasn't done and hard to get done in reality.
It's not too different than saying "hey, what you're asking for is
simply wrong, don't do it! Go back to modify userspace to create a
bond or team instead!" FWIW I want to emphasize that the debate for
what should be the right place to implement this failover facility:
userspace versus kernel, had been around for almost a decade, and no
real work ever happened in userspace to "standardize" this in the
Linux world.  The truth is that it's quite amount of complex work to
get it implemented right at userspace in reality: what Michael
mentions about making dracut auto-bonding aware is just tip of the
iceberg. Basically one would need to modify all the existing network
config tools to treat them well with this new auto-bonding concept:
handle duplicate MACs, differentiate it with regular bond/team, fix
boot time dependency of network boot and etc. Moreover, it's not a
single distro's effort from cloud provider's perspective, at least not
as simple as to say just move it to a daemon systemd/NM then work is
done. We (Oracle) had done extensive work in the past year to help
align various userspace components and work with distro vendors to
patch shipped packages to make them work with the failover 3-netdev
model. The work that needs to be done with userspace auto-bonding
would be more involved than just that, with quite trivial value (just
naming?) in turn that I suspect any developer in userspace could be
motivated.

So, simply put, no, we have zero interest in this direction. If
upstream believes this is the final conclusion, I think we can stop
discussing.

Thanks,
-Siwei
>
> > > for twiddling kernel-based interface naming policy.. :S
> >
> > I see your point. But my point is slave names don't really matter, only
> > master name matters.  So I am not sure there's any policy worth talking
> > about here.
>
> Oh yes, I don't disagree with you, but others seems to want to rename
> the auto-bonded lower devices.  Which can be done trivially if it was
> a daemon in user space instantiating the auto-bond.  We are just
> providing a basic version of auto-bonding in the kernel.  If there are
> extra requirements on policy, or naming - the whole thing is better
> solved in user space.
Jakub Kicinski March 1, 2019, 1:05 a.m. UTC | #43
On Thu, 28 Feb 2019 16:20:28 -0800, Siwei Liu wrote:
> On Thu, Feb 28, 2019 at 11:56 AM Jakub Kicinski wrote:
> > On Thu, 28 Feb 2019 14:36:56 -0500, Michael S. Tsirkin wrote:  
> > > > It is a bit of a the chicken or the egg situation ;)  But users can
> > > > just blacklist, too.  Anyway, I think this is far better than module
> > > > parameters  
> > >
> > > Sorry I'm a bit confused. What is better than what?  
> >
> > I mean that blacklist net_failover or module param to disable
> > net_failover and handle in user space are better than trying to solve
> > the renaming at kernel level (either by adding module params that make
> > the kernel rename devices or letting user space change names of running
> > devices if they are slaves).  
> 
> Before I was aksed to revive this old mail thread, I knew the
> discussion could end up with something like this. Yes, theoretically
> there's a point - basically you don't believe kernel should take risk
> in fixing the issue, so you push back the hope to something in
> hypothesis that actually wasn't done and hard to get done in reality.
> It's not too different than saying "hey, what you're asking for is
> simply wrong, don't do it! Go back to modify userspace to create a
> bond or team instead!" FWIW I want to emphasize that the debate for
> what should be the right place to implement this failover facility:
> userspace versus kernel, had been around for almost a decade, and no
> real work ever happened in userspace to "standardize" this in the
> Linux world.

Let me offer you my very subjective opinion of why "no real work ever
happened in user space".  The actors who have primary interest to get
the auto-bonding working are HW vendors trying to either convince
customers to use SR-IOV, or being pressured by customers to make SR-IOV
easier to consume.  HW vendors hire driver developers, not user space
developers.  So the solution we arrive at is in the kernel for a non
technical reason (Conway's law, sort of).

$ cd NetworkManager/
$ git log --pretty=format:"%ae" | \
    grep '\(mellanox\|intel\|broadcom\|netronome\)' | sort | uniq -c
     81 andrew.zaborowski@intel.com
      2 David.Woodhouse@intel.com
      2 ismo.puustinen@intel.com
      1 michael.i.doherty@intel.com

Andrew works on WiFi.

I have asked the NetworkManager folks to implement this feature last
year when net_failover got dangerously close to getting merged, and
they said they were never approached with this request before, much less
offered code that solve it.  Unfortunately before they got around to it
net_failover was merged already, and they didn't proceed.  

So to my knowledge nobody ever tried to solve this in user space.
I don't think net_failover is particularly terrible, or that renaming
of primary in the kernel is the end of the world, but I'd appreciate if
you could point me to efforts to solve it upstream in user space
components, or acknowledge that nobody actually tried that.
Si-Wei Liu March 1, 2019, 1:30 a.m. UTC | #44
On 2/28/2019 6:26 AM, Michael S. Tsirkin wrote:
> On Thu, Feb 28, 2019 at 01:32:12AM -0800, si-wei liu wrote:
>>>> Will the
>>>> change break userspace further?
>>>>
>>>> -Siwei
>>> Didn't you show userspace is already broken. You can't "further
>>> break it", rename already fails.
>> It's a race, userspace tends to give slave a user(space) desired name but
>> sometimes may fail due to this race. Today if failover master is not up,
>> rename would succeed anyway. While what you proposed prohibits user from
>> providing a name in all circumstances if I understand you correctly. That's
>> what I meant of breaking userspace further. On the other hand, you seem to
>> tighten the kernel default naming to udev predictable names, which is
>> derived from only recent systemd-udevd, while there exists many possible
>> userspace naming schemes out of that. Users today who deliberately chooses
>> to disable predictable naming (net.ifnames=0 biosdevname=0) and fall back to
>> kernel provided names would expect the ethX pattern, with this change
>> admin/user scripts which matches the ethX pattern could potentially break.
> Whatever crashes with a name not matching ethX will crash on the
> standby interface *anyway*.
With udev predictable naming disabled they should not. It's not hard for 
user to look for device attribute to persistent the name well, in a 
consistent and reliable way.

>
> So I think what you are saying is that someone might have already
> written scripts and gotten them to work on v4.17 when STANDBY was
> included and these scripts rely on ethX. Now these scripts
> will break.
The controversial part is the new kernel naming pattern. Initially I 
thought there shouldn't be such crazy scripts relying on the pattern, 
but when I worked on cloud-init it I realized that there's already a lot 
of software taking assumption around the 'eth0' name. In the past I've 
seen random scripts that parses the ethX name assumes (incorrectly) the 
name ends up with digits, or even the digits and name are 1:1 mapped. Of 
course, you can say these are bugs in scripts themselves.

Anyway, I'll let others in the netdev to comment on this new scheme, 
maybe that's the concern of merely myself. The good part of your 
proposal is that we can get consistent slave name, which still plays its 
role until we move towards making slave names less relevant, i.e. 
ideally a 1-netdev model. I think we both agree that the master matters 
more than the slave names.

>
> Maybe it is still early enough (just half a year passed) that the
> number of these users would be small.  So how about a kernel config
> option and maybe a module parameter to rename the primary?  People can
> then opt in to the old broken behaviour.
Were I could I would ask  why a similar opt-in (kernel config or module 
parameter) couldn't be implemented to open up the rename restriction on 
slave, net_failover in particular. What I felt about this rename 
restriction was more because of historical reason than anything else, 
while net_failover is comparatively a new type of link that we are now 
designing proper use case it should support, and can get it shaped to 
whatever it fits. My personal view is that the slave can't be renamed 
when master is running is just implementation details that got 
incorrectly exposed to userspace apps for many years. It's old behavior 
with historical reason for sure, but I don't think this applies to 
net_failover.

(FWIW as one previous bond maintainer for another OS, we relieved the 
rename restriction slaves 13 year ago, while no single complaint or 
issue was ever raised because of this change over the years, neither 
from the customers of tens of millions of installation base, nor the 
FOSS software running atop. Of course, Linux is different so that 
experience doesn't count.)

Thanks,
-Siwei
Michael S. Tsirkin March 1, 2019, 1:27 p.m. UTC | #45
On Thu, Feb 28, 2019 at 05:30:56PM -0800, si-wei liu wrote:
> 
> 
> On 2/28/2019 6:26 AM, Michael S. Tsirkin wrote:
> > On Thu, Feb 28, 2019 at 01:32:12AM -0800, si-wei liu wrote:
> > > > > Will the
> > > > > change break userspace further?
> > > > > 
> > > > > -Siwei
> > > > Didn't you show userspace is already broken. You can't "further
> > > > break it", rename already fails.
> > > It's a race, userspace tends to give slave a user(space) desired name but
> > > sometimes may fail due to this race. Today if failover master is not up,
> > > rename would succeed anyway. While what you proposed prohibits user from
> > > providing a name in all circumstances if I understand you correctly. That's
> > > what I meant of breaking userspace further. On the other hand, you seem to
> > > tighten the kernel default naming to udev predictable names, which is
> > > derived from only recent systemd-udevd, while there exists many possible
> > > userspace naming schemes out of that. Users today who deliberately chooses
> > > to disable predictable naming (net.ifnames=0 biosdevname=0) and fall back to
> > > kernel provided names would expect the ethX pattern, with this change
> > > admin/user scripts which matches the ethX pattern could potentially break.
> > Whatever crashes with a name not matching ethX will crash on the
> > standby interface *anyway*.
> With udev predictable naming disabled they should not. It's not hard for
> user to look for device attribute to persistent the name well, in a
> consistent and reliable way.

Well that's special code for failover already. So far we just
taught userspace to skip renaming slave interfaces.

> > 
> > So I think what you are saying is that someone might have already
> > written scripts and gotten them to work on v4.17 when STANDBY was
> > included and these scripts rely on ethX. Now these scripts
> > will break.
> The controversial part is the new kernel naming pattern. Initially I thought
> there shouldn't be such crazy scripts relying on the pattern, but when I
> worked on cloud-init it I realized that there's already a lot of software
> taking assumption around the 'eth0' name. In the past I've seen random
> scripts that parses the ethX name assumes (incorrectly) the name ends up
> with digits, or even the digits and name are 1:1 mapped. Of course, you can
> say these are bugs in scripts themselves.

No what I say is that they will crash on rename of standby too.

> Anyway, I'll let others in the netdev to comment on this new scheme, maybe
> that's the concern of merely myself. The good part of your proposal is that
> we can get consistent slave name, which still plays its role until we move
> towards making slave names less relevant, i.e. ideally a 1-netdev model. I
> think we both agree that the master matters more than the slave names.
> > 
> > Maybe it is still early enough (just half a year passed) that the
> > number of these users would be small.  So how about a kernel config
> > option and maybe a module parameter to rename the primary?  People can
> > then opt in to the old broken behaviour.
> Were I could I would ask  why a similar opt-in (kernel config or module
> parameter) couldn't be implemented to open up the rename restriction on
> slave, net_failover in particular. What I felt about this rename restriction
> was more because of historical reason than anything else, while net_failover
> is comparatively a new type of link that we are now designing proper use
> case it should support, and can get it shaped to whatever it fits. My
> personal view is that the slave can't be renamed when master is running is
> just implementation details that got incorrectly exposed to userspace apps
> for many years. It's old behavior with historical reason for sure, but I
> don't think this applies to net_failover.
> 
> (FWIW as one previous bond maintainer for another OS, we relieved the rename
> restriction slaves 13 year ago, while no single complaint or issue was ever
> raised because of this change over the years, neither from the customers of
> tens of millions of installation base, nor the FOSS software running atop.
> Of course, Linux is different so that experience doesn't count.)
> 
> Thanks,
> -Siwei
>
Si-Wei Liu March 1, 2019, 8:55 p.m. UTC | #46
On 3/1/2019 5:27 AM, Michael S. Tsirkin wrote:
> On Thu, Feb 28, 2019 at 05:30:56PM -0800, si-wei liu wrote:
>>
>> On 2/28/2019 6:26 AM, Michael S. Tsirkin wrote:
>>> On Thu, Feb 28, 2019 at 01:32:12AM -0800, si-wei liu wrote:
>>>>>> Will the
>>>>>> change break userspace further?
>>>>>>
>>>>>> -Siwei
>>>>> Didn't you show userspace is already broken. You can't "further
>>>>> break it", rename already fails.
>>>> It's a race, userspace tends to give slave a user(space) desired name but
>>>> sometimes may fail due to this race. Today if failover master is not up,
>>>> rename would succeed anyway. While what you proposed prohibits user from
>>>> providing a name in all circumstances if I understand you correctly. That's
>>>> what I meant of breaking userspace further. On the other hand, you seem to
>>>> tighten the kernel default naming to udev predictable names, which is
>>>> derived from only recent systemd-udevd, while there exists many possible
>>>> userspace naming schemes out of that. Users today who deliberately chooses
>>>> to disable predictable naming (net.ifnames=0 biosdevname=0) and fall back to
>>>> kernel provided names would expect the ethX pattern, with this change
>>>> admin/user scripts which matches the ethX pattern could potentially break.
>>> Whatever crashes with a name not matching ethX will crash on the
>>> standby interface *anyway*.
>> With udev predictable naming disabled they should not. It's not hard for
>> user to look for device attribute to persistent the name well, in a
>> consistent and reliable way.
> Well that's special code for failover already. So far we just
> taught userspace to skip renaming slave interfaces.
I think today kernel provided names never collapse, e.g. master gets 
eth0 then standby will get eth1. It's the userspace specified name that 
suffers name clashing, mostly the default predictable naming pattern 
from systemd-udevd.

Kernel should not assume there's only one naming pattern in userspace. 
Users can customize naming with udev rules in /etc which do not conform 
to the default udevd pattern at all. It's pretty legitimate use case.


>
>>> So I think what you are saying is that someone might have already
>>> written scripts and gotten them to work on v4.17 when STANDBY was
>>> included and these scripts rely on ethX. Now these scripts
>>> will break.
>> The controversial part is the new kernel naming pattern. Initially I thought
>> there shouldn't be such crazy scripts relying on the pattern, but when I
>> worked on cloud-init it I realized that there's already a lot of software
>> taking assumption around the 'eth0' name. In the past I've seen random
>> scripts that parses the ethX name assumes (incorrectly) the name ends up
>> with digits, or even the digits and name are 1:1 mapped. Of course, you can
>> say these are bugs in scripts themselves.
> No what I say is that they will crash on rename of standby too.
What do you mean crashing on standby rename? First off, if master is not 
up, rename on both standby and primary should not fail. If master is up, 
the standby should be named before userspace brings up the master, so 
what's the issue you talked about?

Thanks,
-Siwei

>
>> Anyway, I'll let others in the netdev to comment on this new scheme, maybe
>> that's the concern of merely myself. The good part of your proposal is that
>> we can get consistent slave name, which still plays its role until we move
>> towards making slave names less relevant, i.e. ideally a 1-netdev model. I
>> think we both agree that the master matters more than the slave names.
>>> Maybe it is still early enough (just half a year passed) that the
>>> number of these users would be small.  So how about a kernel config
>>> option and maybe a module parameter to rename the primary?  People can
>>> then opt in to the old broken behaviour.
>> Were I could I would ask  why a similar opt-in (kernel config or module
>> parameter) couldn't be implemented to open up the rename restriction on
>> slave, net_failover in particular. What I felt about this rename restriction
>> was more because of historical reason than anything else, while net_failover
>> is comparatively a new type of link that we are now designing proper use
>> case it should support, and can get it shaped to whatever it fits. My
>> personal view is that the slave can't be renamed when master is running is
>> just implementation details that got incorrectly exposed to userspace apps
>> for many years. It's old behavior with historical reason for sure, but I
>> don't think this applies to net_failover.
>>
>> (FWIW as one previous bond maintainer for another OS, we relieved the rename
>> restriction slaves 13 year ago, while no single complaint or issue was ever
>> raised because of this change over the years, neither from the customers of
>> tens of millions of installation base, nor the FOSS software running atop.
>> Of course, Linux is different so that experience doesn't count.)
>>
>> Thanks,
>> -Siwei
>>
Siwei Liu March 2, 2019, 12:30 a.m. UTC | #47
On Thu, Feb 28, 2019 at 5:05 PM Jakub Kicinski <kubakici@wp.pl> wrote:
>
> On Thu, 28 Feb 2019 16:20:28 -0800, Siwei Liu wrote:
> > On Thu, Feb 28, 2019 at 11:56 AM Jakub Kicinski wrote:
> > > On Thu, 28 Feb 2019 14:36:56 -0500, Michael S. Tsirkin wrote:
> > > > > It is a bit of a the chicken or the egg situation ;)  But users can
> > > > > just blacklist, too.  Anyway, I think this is far better than module
> > > > > parameters
> > > >
> > > > Sorry I'm a bit confused. What is better than what?
> > >
> > > I mean that blacklist net_failover or module param to disable
> > > net_failover and handle in user space are better than trying to solve
> > > the renaming at kernel level (either by adding module params that make
> > > the kernel rename devices or letting user space change names of running
> > > devices if they are slaves).
> >
> > Before I was aksed to revive this old mail thread, I knew the
> > discussion could end up with something like this. Yes, theoretically
> > there's a point - basically you don't believe kernel should take risk
> > in fixing the issue, so you push back the hope to something in
> > hypothesis that actually wasn't done and hard to get done in reality.
> > It's not too different than saying "hey, what you're asking for is
> > simply wrong, don't do it! Go back to modify userspace to create a
> > bond or team instead!" FWIW I want to emphasize that the debate for
> > what should be the right place to implement this failover facility:
> > userspace versus kernel, had been around for almost a decade, and no
> > real work ever happened in userspace to "standardize" this in the
> > Linux world.
>
> Let me offer you my very subjective opinion of why "no real work ever
> happened in user space".  The actors who have primary interest to get
> the auto-bonding working are HW vendors trying to either convince
> customers to use SR-IOV, or being pressured by customers to make SR-IOV
> easier to consume.  HW vendors hire driver developers, not user space
> developers.  So the solution we arrive at is in the kernel for a non
> technical reason (Conway's law, sort of).
>
> $ cd NetworkManager/
> $ git log --pretty=format:"%ae" | \
>     grep '\(mellanox\|intel\|broadcom\|netronome\)' | sort | uniq -c
>      81 andrew.zaborowski@intel.com
>       2 David.Woodhouse@intel.com
>       2 ismo.puustinen@intel.com
>       1 michael.i.doherty@intel.com
>
> Andrew works on WiFi.
>

I'm sorry, but we don't use NetworkManager in our cloud images at all.
We sufferd from lots of problems when booting from remote iSCSI disk
with NetworkManager enabled, and it looks like those issues are still
there while that's not (my subjective impression) a network config
tool mainly targeting desktop and WiFi users ever cares about. At
least a sign of lack of sufficient testing was made there.

From cloud service provider perspective, we always prefer single
central solution than speak to various distro vendors with their own
network daemons/config tools thus different solutions. It's hard to
coordicate all efforts in one place. From my personal perspetive, the
in-kernel auto-slave solution is nothing technically inferior than any
userspace implementation, and every major OS/cloud providers choose to
implement this in-kernel model for the same reason. I don't want to
argue more if there's value or not for net_failover to be in Linux
kernel, given that it's already there I think it's better to move on.

We have done extensive work in reporting (actually, fix them
internally before posting) issues to the dracut, udev,
initramfs-tools, and cloud-init community. Although as claimed the
3-netdev should be transparent to userspace in general, the reality is
opposite: the effort is nothing differenet than bring up a new type of
virutal bond than any existing userspace tool would otherwise expect
for a regular physical netdev. If there's ever concern about breaking
userspace, I bet no one ever tries to start using it. If they did they
know what I am saying. The dup MAC address setting and plugging order
are totally new to userspace that none of userspace tools fail to know
how to plumb failover interface in a proper way, if without fixing
them one or another.

-Siwei

> I have asked the NetworkManager folks to implement this feature last
> year when net_failover got dangerously close to getting merged, and
> they said they were never approached with this request before, much less
> offered code that solve it.  Unfortunately before they got around to it
> net_failover was merged already, and they didn't proceed.
>
> So to my knowledge nobody ever tried to solve this in user space.
> I don't think net_failover is particularly terrible, or that renaming
> of primary in the kernel is the end of the world, but I'd appreciate if
> you could point me to efforts to solve it upstream in user space
> components, or acknowledge that nobody actually tried that.
diff mbox series

Patch

diff --git a/drivers/net/hyperv/Kconfig b/drivers/net/hyperv/Kconfig
index 936968d23559..cc3a721baa18 100644
--- a/drivers/net/hyperv/Kconfig
+++ b/drivers/net/hyperv/Kconfig
@@ -1,5 +1,6 @@ 
 config HYPERV_NET
 	tristate "Microsoft Hyper-V virtual network driver"
 	depends on HYPERV
+	depends on MAY_USE_BYPASS
 	help
 	  Select this option to enable the Hyper-V virtual network driver.
diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index 960f06141472..5f8137bc5c1c 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -768,6 +768,8 @@  struct net_device_context {
 	u32 vf_alloc;
 	/* Serial number of the VF to team with */
 	u32 vf_serial;
+
+	struct bypass_master *bypass_master;
 };
 
 /* Per channel data */
diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index ecc84954c511..87c2a276e62f 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -43,6 +43,7 @@ 
 #include <net/pkt_sched.h>
 #include <net/checksum.h>
 #include <net/ip6_checksum.h>
+#include <net/bypass.h>
 
 #include "hyperv_net.h"
 
@@ -1763,46 +1764,6 @@  static void netvsc_link_change(struct work_struct *w)
 	rtnl_unlock();
 }
 
-static struct net_device *get_netvsc_bymac(const u8 *mac)
-{
-	struct net_device *dev;
-
-	ASSERT_RTNL();
-
-	for_each_netdev(&init_net, dev) {
-		if (dev->netdev_ops != &device_ops)
-			continue;	/* not a netvsc device */
-
-		if (ether_addr_equal(mac, dev->perm_addr))
-			return dev;
-	}
-
-	return NULL;
-}
-
-static struct net_device *get_netvsc_byref(struct net_device *vf_netdev)
-{
-	struct net_device *dev;
-
-	ASSERT_RTNL();
-
-	for_each_netdev(&init_net, dev) {
-		struct net_device_context *net_device_ctx;
-
-		if (dev->netdev_ops != &device_ops)
-			continue;	/* not a netvsc device */
-
-		net_device_ctx = netdev_priv(dev);
-		if (!rtnl_dereference(net_device_ctx->nvdev))
-			continue;	/* device is removed */
-
-		if (rtnl_dereference(net_device_ctx->vf_netdev) == vf_netdev)
-			return dev;	/* a match */
-	}
-
-	return NULL;
-}
-
 /* Called when VF is injecting data into network stack.
  * Change the associated network device from VF to netvsc.
  * note: already called with rcu_read_lock
@@ -1829,39 +1790,15 @@  static int netvsc_vf_join(struct net_device *vf_netdev,
 			  struct net_device *ndev)
 {
 	struct net_device_context *ndev_ctx = netdev_priv(ndev);
-	int ret;
-
-	ret = netdev_rx_handler_register(vf_netdev,
-					 netvsc_vf_handle_frame, ndev);
-	if (ret != 0) {
-		netdev_err(vf_netdev,
-			   "can not register netvsc VF receive handler (err = %d)\n",
-			   ret);
-		goto rx_handler_failed;
-	}
-
-	ret = netdev_upper_dev_link(vf_netdev, ndev, NULL);
-	if (ret != 0) {
-		netdev_err(vf_netdev,
-			   "can not set master device %s (err = %d)\n",
-			   ndev->name, ret);
-		goto upper_link_failed;
-	}
-
-	/* set slave flag before open to prevent IPv6 addrconf */
-	vf_netdev->flags |= IFF_SLAVE;
 
 	schedule_delayed_work(&ndev_ctx->vf_takeover, VF_TAKEOVER_INT);
 
-	call_netdevice_notifiers(NETDEV_JOIN, vf_netdev);
-
 	netdev_info(vf_netdev, "joined to %s\n", ndev->name);
-	return 0;
 
-upper_link_failed:
-	netdev_rx_handler_unregister(vf_netdev);
-rx_handler_failed:
-	return ret;
+	dev_hold(vf_netdev);
+	rcu_assign_pointer(ndev_ctx->vf_netdev, vf_netdev);
+
+	return 0;
 }
 
 static void __netvsc_vf_setup(struct net_device *ndev,
@@ -1914,85 +1851,82 @@  static void netvsc_vf_setup(struct work_struct *w)
 	rtnl_unlock();
 }
 
-static int netvsc_register_vf(struct net_device *vf_netdev)
+static int netvsc_vf_pre_register(struct net_device *vf_netdev,
+				  struct net_device *ndev)
 {
-	struct net_device *ndev;
 	struct net_device_context *net_device_ctx;
 	struct netvsc_device *netvsc_dev;
 
-	if (vf_netdev->addr_len != ETH_ALEN)
-		return NOTIFY_DONE;
-
-	/*
-	 * We will use the MAC address to locate the synthetic interface to
-	 * associate with the VF interface. If we don't find a matching
-	 * synthetic interface, move on.
-	 */
-	ndev = get_netvsc_bymac(vf_netdev->perm_addr);
-	if (!ndev)
-		return NOTIFY_DONE;
-
 	net_device_ctx = netdev_priv(ndev);
 	netvsc_dev = rtnl_dereference(net_device_ctx->nvdev);
 	if (!netvsc_dev || rtnl_dereference(net_device_ctx->vf_netdev))
-		return NOTIFY_DONE;
-
-	if (netvsc_vf_join(vf_netdev, ndev) != 0)
-		return NOTIFY_DONE;
+		return -EEXIST;
 
 	netdev_info(ndev, "VF registering: %s\n", vf_netdev->name);
 
-	dev_hold(vf_netdev);
-	rcu_assign_pointer(net_device_ctx->vf_netdev, vf_netdev);
-	return NOTIFY_OK;
+	return 0;
 }
 
 /* VF up/down change detected, schedule to change data path */
-static int netvsc_vf_changed(struct net_device *vf_netdev)
+static int netvsc_vf_changed(struct net_device *vf_netdev,
+			     struct net_device *ndev)
 {
 	struct net_device_context *net_device_ctx;
 	struct netvsc_device *netvsc_dev;
-	struct net_device *ndev;
 	bool vf_is_up = netif_running(vf_netdev);
 
-	ndev = get_netvsc_byref(vf_netdev);
-	if (!ndev)
-		return NOTIFY_DONE;
-
 	net_device_ctx = netdev_priv(ndev);
 	netvsc_dev = rtnl_dereference(net_device_ctx->nvdev);
 	if (!netvsc_dev)
-		return NOTIFY_DONE;
+		return -EINVAL;
 
 	netvsc_switch_datapath(ndev, vf_is_up);
 	netdev_info(ndev, "Data path switched %s VF: %s\n",
 		    vf_is_up ? "to" : "from", vf_netdev->name);
 
-	return NOTIFY_OK;
+	return 0;
 }
 
-static int netvsc_unregister_vf(struct net_device *vf_netdev)
+static int netvsc_vf_release(struct net_device *vf_netdev,
+			     struct net_device *ndev)
 {
-	struct net_device *ndev;
 	struct net_device_context *net_device_ctx;
 
-	ndev = get_netvsc_byref(vf_netdev);
-	if (!ndev)
-		return NOTIFY_DONE;
-
 	net_device_ctx = netdev_priv(ndev);
-	cancel_delayed_work_sync(&net_device_ctx->vf_takeover);
+	if (vf_netdev != rtnl_dereference(net_device_ctx->vf_netdev))
+		return -EINVAL;
 
-	netdev_info(ndev, "VF unregistering: %s\n", vf_netdev->name);
+	cancel_delayed_work_sync(&net_device_ctx->vf_takeover);
 
-	netdev_rx_handler_unregister(vf_netdev);
-	netdev_upper_dev_unlink(vf_netdev, ndev);
 	RCU_INIT_POINTER(net_device_ctx->vf_netdev, NULL);
 	dev_put(vf_netdev);
 
-	return NOTIFY_OK;
+	return 0;
 }
 
+static int netvsc_vf_pre_unregister(struct net_device *vf_netdev,
+				    struct net_device *ndev)
+{
+	struct net_device_context *net_device_ctx;
+
+	net_device_ctx = netdev_priv(ndev);
+	if (vf_netdev != rtnl_dereference(net_device_ctx->vf_netdev))
+		return -EINVAL;
+
+	netdev_info(ndev, "VF unregistering: %s\n", vf_netdev->name);
+
+	return 0;
+}
+
+static struct bypass_ops netvsc_bypass_ops = {
+	.slave_pre_register	= netvsc_vf_pre_register,
+	.slave_join		= netvsc_vf_join,
+	.slave_pre_unregister	= netvsc_vf_pre_unregister,
+	.slave_release		= netvsc_vf_release,
+	.slave_link_change	= netvsc_vf_changed,
+	.handle_frame		= netvsc_vf_handle_frame,
+};
+
 static int netvsc_probe(struct hv_device *dev,
 			const struct hv_vmbus_device_id *dev_id)
 {
@@ -2082,8 +2016,15 @@  static int netvsc_probe(struct hv_device *dev,
 		goto register_failed;
 	}
 
+	ret = bypass_master_register(net, &netvsc_bypass_ops,
+				     &net_device_ctx->bypass_master);
+	if (ret != 0)
+		goto err_bypass;
+
 	return ret;
 
+err_bypass:
+	unregister_netdev(net);
 register_failed:
 	rndis_filter_device_remove(dev, nvdev);
 rndis_failed:
@@ -2124,13 +2065,15 @@  static int netvsc_remove(struct hv_device *dev)
 	rtnl_lock();
 	vf_netdev = rtnl_dereference(ndev_ctx->vf_netdev);
 	if (vf_netdev)
-		netvsc_unregister_vf(vf_netdev);
+		bypass_slave_unregister(vf_netdev);
 
 	if (nvdev)
 		rndis_filter_device_remove(dev, nvdev);
 
 	unregister_netdevice(net);
 
+	bypass_master_unregister(ndev_ctx->bypass_master);
+
 	rtnl_unlock();
 	rcu_read_unlock();
 
@@ -2157,54 +2100,8 @@  static struct  hv_driver netvsc_drv = {
 	.remove = netvsc_remove,
 };
 
-/*
- * On Hyper-V, every VF interface is matched with a corresponding
- * synthetic interface. The synthetic interface is presented first
- * to the guest. When the corresponding VF instance is registered,
- * we will take care of switching the data path.
- */
-static int netvsc_netdev_event(struct notifier_block *this,
-			       unsigned long event, void *ptr)
-{
-	struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
-
-	/* Skip our own events */
-	if (event_dev->netdev_ops == &device_ops)
-		return NOTIFY_DONE;
-
-	/* Avoid non-Ethernet type devices */
-	if (event_dev->type != ARPHRD_ETHER)
-		return NOTIFY_DONE;
-
-	/* Avoid Vlan dev with same MAC registering as VF */
-	if (is_vlan_dev(event_dev))
-		return NOTIFY_DONE;
-
-	/* Avoid Bonding master dev with same MAC registering as VF */
-	if ((event_dev->priv_flags & IFF_BONDING) &&
-	    (event_dev->flags & IFF_MASTER))
-		return NOTIFY_DONE;
-
-	switch (event) {
-	case NETDEV_REGISTER:
-		return netvsc_register_vf(event_dev);
-	case NETDEV_UNREGISTER:
-		return netvsc_unregister_vf(event_dev);
-	case NETDEV_UP:
-	case NETDEV_DOWN:
-		return netvsc_vf_changed(event_dev);
-	default:
-		return NOTIFY_DONE;
-	}
-}
-
-static struct notifier_block netvsc_netdev_notifier = {
-	.notifier_call = netvsc_netdev_event,
-};
-
 static void __exit netvsc_drv_exit(void)
 {
-	unregister_netdevice_notifier(&netvsc_netdev_notifier);
 	vmbus_driver_unregister(&netvsc_drv);
 }
 
@@ -2224,7 +2121,6 @@  static int __init netvsc_drv_init(void)
 	if (ret)
 		return ret;
 
-	register_netdevice_notifier(&netvsc_netdev_notifier);
 	return 0;
 }