Message ID | 50160EEF.6050406@parallels.com |
---|---|
State | Changes Requested, archived |
Delegated to: | David Miller |
Headers | show |
Pavel Emelyanov <xemul@parallels.com> writes: > Currently the RTM_NEWLINK results in -EOPNOTSUPP if the ifinfomsg->ifi_index > is not zero. I propose to allow requesting ifindices on link creation. This > is required by the checkpoint-restore to correctly restore a net namespace > (i.e. -- a container). The question what to do with pre-created devices such > as lo or sit fbdev is open, but for manually created devices this can be > solved by this patch. Have you walked through and found the locations where we still rely on ifindex being globally unique? Last time I was working in this area there were serveral places where things were indexed by just the interface index. I susepct it might be easier to generate hotplug events at restart time saying someone removed and added an identical set of network devices. Certainly for physical hardware that needs to happen, because things like mac addresses will change. Eric > Signed-off-by: Pavel Emelyanov <xemul@parallels.com> > > --- > > diff --git a/net/core/dev.c b/net/core/dev.c > index 0ebaea1..5966e2f 100644 > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -5533,7 +5533,12 @@ int register_netdevice(struct net_device *dev) > } > } > > - dev->ifindex = dev_new_index(net); > + ret = -EBUSY; > + if (!dev->ifindex) > + dev->ifindex = dev_new_index(net); > + else if (__dev_get_by_index(net, dev->ifindex)) > + goto err_uninit; > + > if (dev->iflink == -1) > dev->iflink = dev->ifindex; > > diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c > index 334b930..76e19aa 100644 > --- a/net/core/rtnetlink.c > +++ b/net/core/rtnetlink.c > @@ -1801,8 +1801,6 @@ replay: > return -ENODEV; > } > > - if (ifm->ifi_index) > - return -EOPNOTSUPP; > if (tb[IFLA_MAP] || tb[IFLA_MASTER] || tb[IFLA_PROTINFO]) > return -EOPNOTSUPP; > > @@ -1828,10 +1826,14 @@ replay: > return PTR_ERR(dest_net); > > dev = rtnl_create_link(net, dest_net, ifname, ops, tb); > - > - if (IS_ERR(dev)) > + if (IS_ERR(dev)) { > err = PTR_ERR(dev); > - else if (ops->newlink) > + goto out; > + } > + > + dev->ifindex = ifm->ifi_index; > + > + if (ops->newlink) > err = ops->newlink(net, dev, tb, data); > else > err = register_netdevice(dev); > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
ebiederm@xmission.com (Eric W. Biederman) writes: > Pavel Emelyanov <xemul@parallels.com> writes: > >> Currently the RTM_NEWLINK results in -EOPNOTSUPP if the ifinfomsg->ifi_index >> is not zero. I propose to allow requesting ifindices on link creation. This >> is required by the checkpoint-restore to correctly restore a net namespace >> (i.e. -- a container). The question what to do with pre-created devices such >> as lo or sit fbdev is open, but for manually created devices this can be >> solved by this patch. > > Have you walked through and found the locations where we still rely on > ifindex being globally unique? > > Last time I was working in this area there were serveral places where > things were indexed by just the interface index. If it is really safe to make ifindex per network namespace at this point you can make dev_new_ifindex have a per network namespace base counter, and that will fix your problems with the loopback device. Unless you have done the work to root out the last of dependencies on ifindex being globally unique I think you will run into some operational problems. Eric -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 2012-07-30 at 03:49 -0700, Eric W. Biederman wrote: > Pavel Emelyanov <xemul@parallels.com> writes: > > > Currently the RTM_NEWLINK results in -EOPNOTSUPP if the ifinfomsg->ifi_index > > is not zero. I propose to allow requesting ifindices on link creation. This > > is required by the checkpoint-restore to correctly restore a net namespace > > (i.e. -- a container). The question what to do with pre-created devices such > > as lo or sit fbdev is open, but for manually created devices this can be > > solved by this patch. > > Have you walked through and found the locations where we still rely on > ifindex being globally unique? > > Last time I was working in this area there were serveral places where > things were indexed by just the interface index. Really ? This would be very strange. AFAIK dev_new_index() is always called, even in the dev_change_net_namespace() case if there is a conflict. And dev_new_index() could use a pernet net->ifindex instead of a shared/static one. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric Dumazet <eric.dumazet@gmail.com> writes: > On Mon, 2012-07-30 at 03:49 -0700, Eric W. Biederman wrote: >> Pavel Emelyanov <xemul@parallels.com> writes: >> >> > Currently the RTM_NEWLINK results in -EOPNOTSUPP if the ifinfomsg->ifi_index >> > is not zero. I propose to allow requesting ifindices on link creation. This >> > is required by the checkpoint-restore to correctly restore a net namespace >> > (i.e. -- a container). The question what to do with pre-created devices such >> > as lo or sit fbdev is open, but for manually created devices this can be >> > solved by this patch. >> >> Have you walked through and found the locations where we still rely on >> ifindex being globally unique? >> >> Last time I was working in this area there were serveral places where >> things were indexed by just the interface index. > > Really ? This would be very strange. There at least were places that used oif or iff without being pernet last time I was working on this. It was never code that I understood particularly well so my memory of what that code is, is unfortunately fuzzy. > AFAIK dev_new_index() is always called, even in the > dev_change_net_namespace() case if there is a conflict. Except we never have a conflict because it takes an absurd number of network devices to cause a 32bit counter to wrap. > And dev_new_index() could use a pernet net->ifindex instead of a > shared/static one. Yes. I made all of the core changes, and held back on making dev_new_index() use a pernet net->ifindex because of a couple of problem cases. It has been a long time and those cases might have been fixed. I'm not seeing anything obvious in the network stack with a quick skim, but before we start relying on the property that interface indicies are not globally unique I expect an good hard look at the networking stack to see if any of those cases where there were problems still exist. Eric -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 07/30/2012 02:56 PM, Eric W. Biederman wrote: > ebiederm@xmission.com (Eric W. Biederman) writes: > >> Pavel Emelyanov <xemul@parallels.com> writes: >> >>> Currently the RTM_NEWLINK results in -EOPNOTSUPP if the ifinfomsg->ifi_index >>> is not zero. I propose to allow requesting ifindices on link creation. This >>> is required by the checkpoint-restore to correctly restore a net namespace >>> (i.e. -- a container). The question what to do with pre-created devices such >>> as lo or sit fbdev is open, but for manually created devices this can be >>> solved by this patch. >> >> Have you walked through and found the locations where we still rely on >> ifindex being globally unique? >> >> Last time I was working in this area there were serveral places where >> things were indexed by just the interface index. > > If it is really safe to make ifindex per network namespace at this > point you can make dev_new_ifindex have a per network namespace base > counter, and that will fix your problems with the loopback device. Not it's not so unfortunately :( First, let's imagine that on host A the loopback device got registered as first device, but on host B for some reason some other device got registered first. In that case after migration from A to B the lo on B will have index equals 2. And there's no any strict requirement that lo's per net operations are registered first. Please, correct me if I'm wrong. Next. In fact, lo is not the only problem. Look at the e.g. sit versus ipgre fallback devices. Both gets created on netns creation and obtain whatever ifindices are generated for them. Even if we make ifidex per netns chances that sit gets registered _strictly_ before ipgre equal zero, since they are both modules. > Unless you have done the work to root out the last of dependencies on > ifindex being globally unique I think you will run into some operational > problems. I totally agree with that. Before doing this patch I revisited the ancient attempt to make ifindices per netns and checked the issues Dave and you discussed then -- I have looked through how the ifindices are used in the networking code and found no places where the system-wide uniqueness is still required. That's why I proposed this patch for inclusion. If you know the places I've missed, please let me know, I will work on it. > Eric > > . > Thanks, Pavel -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> I'm not seeing anything obvious in the network stack with a quick skim, > but before we start relying on the property that interface indicies are > not globally unique I expect an good hard look at the networking stack > to see if any of those cases where there were problems still exist. Just an idea -- is it worth moving the possibility to have ifindidces intersect under CONFIG_<SOMETHING> (EXPERT/CHECKPOINT_RESTORE) to let wider audience check the code in real-life? > Eric > > . > Thanks, Pavel -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Pavel Emelyanov <xemul@parallels.com> writes: > On 07/30/2012 02:56 PM, Eric W. Biederman wrote: >> ebiederm@xmission.com (Eric W. Biederman) writes: >> >>> Pavel Emelyanov <xemul@parallels.com> writes: >>> >>>> Currently the RTM_NEWLINK results in -EOPNOTSUPP if the ifinfomsg->ifi_index >>>> is not zero. I propose to allow requesting ifindices on link creation. This >>>> is required by the checkpoint-restore to correctly restore a net namespace >>>> (i.e. -- a container). The question what to do with pre-created devices such >>>> as lo or sit fbdev is open, but for manually created devices this can be >>>> solved by this patch. >>> >>> Have you walked through and found the locations where we still rely on >>> ifindex being globally unique? >>> >>> Last time I was working in this area there were serveral places where >>> things were indexed by just the interface index. >> >> If it is really safe to make ifindex per network namespace at this >> point you can make dev_new_ifindex have a per network namespace base >> counter, and that will fix your problems with the loopback device. > > Not it's not so unfortunately :( > > First, let's imagine that on host A the loopback device got registered as > first device, but on host B for some reason some other device got registered > first. In that case after migration from A to B the lo on B will have index > equals 2. And there's no any strict requirement that lo's per net operations > are registered first. Please, correct me if I'm wrong. Actually there is a hard requirement that the loopback device be the last device in a network namespace to be unregistered. We meet that requirement by registering the loopback device first "net/core/dev.c:net_dev_init()". > Next. In fact, lo is not the only problem. Look at the e.g. sit versus ipgre > fallback devices. Both gets created on netns creation and obtain whatever > ifindices are generated for them. Even if we make ifidex per netns chances > that sit gets registered _strictly_ before ipgre equal zero, since they are > both modules. True. However those fallback devices should no longer be needed, and even if they are I think you can delete and recreate them. Making lo the particularly interesting case. >> Unless you have done the work to root out the last of dependencies on >> ifindex being globally unique I think you will run into some operational >> problems. > > I totally agree with that. Before doing this patch I revisited the ancient > attempt to make ifindices per netns and checked the issues Dave and you > discussed then -- I have looked through how the ifindices are used in the > networking code and found no places where the system-wide uniqueness is still > required. That's why I proposed this patch for inclusion. If you know the > places I've missed, please let me know, I will work on it. I took a quick look and I did not see anything. I saw places under net/sched/ that looked a bit suspicious, and of course there are places where we use oif and iff in some of the routing code that make we wonder a bit. But if you have looked and if I have looked I think we are ok. > Just an idea -- is it worth moving the possibility to have ifindidces intersect > under CONFIG_<SOMETHING> (EXPERT/CHECKPOINT_RESTORE) to let wider audience check > the code in real-life? I think the best testing we are going to get diversity wise is to create a per netns counter into dev_new_index when net-next opens up. Having an ifindex that we can only set at netdevice creation time seems reasonable. Eric -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
>> First, let's imagine that on host A the loopback device got registered as >> first device, but on host B for some reason some other device got registered >> first. In that case after migration from A to B the lo on B will have index >> equals 2. And there's no any strict requirement that lo's per net operations >> are registered first. Please, correct me if I'm wrong. > > Actually there is a hard requirement that the loopback device be the > last device in a network namespace to be unregistered. We meet that > requirement by registering the loopback device first > "net/core/dev.c:net_dev_init()". Hm... Indeed, and this is good news! >> Next. In fact, lo is not the only problem. Look at the e.g. sit versus ipgre >> fallback devices. Both gets created on netns creation and obtain whatever >> ifindices are generated for them. Even if we make ifidex per netns chances >> that sit gets registered _strictly_ before ipgre equal zero, since they are >> both modules. > > True. However those fallback devices should no longer be needed, > and even if they are I think you can delete and recreate them. Good idea! I will look at that direction. > Making lo the particularly interesting case. Yup, provided we can manually recreate those auto-created devices this solves the issue. >> Just an idea -- is it worth moving the possibility to have ifindidces intersect >> under CONFIG_<SOMETHING> (EXPERT/CHECKPOINT_RESTORE) to let wider audience check >> the code in real-life? > > I think the best testing we are going to get diversity wise is to create > a per netns counter into dev_new_index when net-next opens up. > > Having an ifindex that we can only set at netdevice creation time seems > reasonable. OK, thank you, Eric. > Eric > . > Thanks, Pavel -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/net/core/dev.c b/net/core/dev.c index 0ebaea1..5966e2f 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -5533,7 +5533,12 @@ int register_netdevice(struct net_device *dev) } } - dev->ifindex = dev_new_index(net); + ret = -EBUSY; + if (!dev->ifindex) + dev->ifindex = dev_new_index(net); + else if (__dev_get_by_index(net, dev->ifindex)) + goto err_uninit; + if (dev->iflink == -1) dev->iflink = dev->ifindex; diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c index 334b930..76e19aa 100644 --- a/net/core/rtnetlink.c +++ b/net/core/rtnetlink.c @@ -1801,8 +1801,6 @@ replay: return -ENODEV; } - if (ifm->ifi_index) - return -EOPNOTSUPP; if (tb[IFLA_MAP] || tb[IFLA_MASTER] || tb[IFLA_PROTINFO]) return -EOPNOTSUPP; @@ -1828,10 +1826,14 @@ replay: return PTR_ERR(dest_net); dev = rtnl_create_link(net, dest_net, ifname, ops, tb); - - if (IS_ERR(dev)) + if (IS_ERR(dev)) { err = PTR_ERR(dev); - else if (ops->newlink) + goto out; + } + + dev->ifindex = ifm->ifi_index; + + if (ops->newlink) err = ops->newlink(net, dev, tb, data); else err = register_netdevice(dev);
Currently the RTM_NEWLINK results in -EOPNOTSUPP if the ifinfomsg->ifi_index is not zero. I propose to allow requesting ifindices on link creation. This is required by the checkpoint-restore to correctly restore a net namespace (i.e. -- a container). The question what to do with pre-created devices such as lo or sit fbdev is open, but for manually created devices this can be solved by this patch. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> --- -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html