Message ID | 20091009210909.GA9836@auslistsprd01.us.dell.com |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
On Fri, 9 Oct 2009 16:09:09 -0500 Matt Domsch <Matt_Domsch@dell.com> wrote: > On Fri, Oct 09, 2009 at 09:00:01AM -0500, Narendra K wrote: > > On Fri, Oct 09, 2009 at 07:12:07PM +0530, K, Narendra wrote: > > > > example udev config: > > > > SUBSYSTEM=="net", > > > SYMLINK+="net/by-mac/$sysfs{ifindex}.$sysfs{address}" > > > > > > work as well. But coupling the ifindex to the MAC address like this > > > doesn't work. (In general, coupling any two unrelated attributes when > > > trying to do persistent names doesn't work.) > > > > > Attaching the latest patch incorporating review comments. > > Same patch, rebased to linux-next. > > By creating character devices for every network device, we can use > udev to maintain alternate naming policies for devices, including > additional names for the same device, without interfering with the > name that the kernel assigns a device. > > This is conditionalized on CONFIG_NET_CDEV. If enabled (the default), > device nodes will automatically be created in /dev/netdev/ for each > network device. (/dev/net/ is already populated by the tun device.) > > These device nodes are not functional at the moment - open() returns > -ENOSYS. Their only purpose is to provide userspace with a kernel > name to ifindex mapping, in a form that udev can easily manage. > > Signed-off-by: Jordan Hargrave <Jordan_Hargrave@dell.com> > Signed-off-by: Narendra K <Narendra_K@dell.com> > Signed-off-by: Matt Domsch <Matt_Domsch@dell.com> Maybe I'm dense but can't see why having a useless /dev/net/ symlinks is a good interface choice. Perhaps you should explain the race between PCI scan and udev in more detail, and why solving it in either of those places won't work. As it stands you are proposing yet another wart to the already complex set of network interface API's which has implications for security as well as increasing the number of possible bugs.
On Fri, Oct 09, 2009 at 07:44:01PM -0700, Stephen Hemminger wrote: > Maybe I'm dense but can't see why having a useless /dev/net/ symlinks > is a good interface choice. Perhaps you should explain the race between > PCI scan and udev in more detail, and why solving it in either of those > places won't work. As it stands you are proposing yet another wart to > the already complex set of network interface API's which has implications > for security as well as increasing the number of possible bugs. The fundamental challenge is that system administrators, particularly those of server-class hardware with multiple network ports present (some on the motherboard, some on add-in cards), have the not-so-unreasonable expectation that there is a deterministic mapping between those ports and the name one uses to address those ports. The fundamental roadblock to this is that enumeration != naming, except that it is for network devices, and we keep changing the enumeration order. Today, port naming is completely nondeterministic. If you have but one NIC, there are few chances to get the name wrong (it'll be eth0). If you have >1 NIC, chances increase to get it wrong. The complexity arises at multiple levels. First, device driver load order. In the 2.4 kernel days, and even mostly early 2.6 kernel days, the order in which network drivers loaded played a role in determining the name of the device. Drivers loaded first would get their devices named first. If I have two types of devices, say an e100-driven NIC and a tg3-driven NIC, I could figure out that the names would be eth0=e100 and eth1=tg3 by setting the load order in /etc/modules.conf (now modprobe.conf). If I wanted the other order, fine, just switch it around in modules.conf and reboot. OS installers, being the first running instance of Linux, before modprobe.conf existed to set that ordering, had to have other mechanisms to load drivers (often manually, or if programmatically such as in a kickstart or autoyast file, was still somewhat fixed). With the advent of modaliases + udev, now modprobe.conf doesn't contain this ordering anymore, and udev loads the drivers. So while it wasn't perfect, it was better than nothing, and that's gone now. It gets even worse as, to speed up boot time, modprobes can be run in parallel, and even within individual drivers, the NICs get initialized (and named) in parallel. Further confusing things, some devices need firmware loaded into them before getting names assigned, which is done from userspace, and they race. Second, PCI device list order. In the 2.4 kernel days, the PCI device list was scanned "breadth-first" (for each bus; for each device; for each function; do load...). FWIW, Windows still does this. It gives BIOS, which assigns PCI bus numbers, a chance to put LOMs at a lower bus number than add-in cards. Module load order still mattered, but at least if you had say 2 e1000 ports as LOMs, and 2 e1000 ports on add-in cards, you pretty much knew the ordering would be eth0 as lowest bdf on the motherboard, eth1 as next bdf on the motherboard, and eth2 and 3 as the add-in cards in ascending slot order. With the advent of PCI hot plug in the 2.5 kernel series, the breadth-first ordering became depth-first. (for each bus; for each device; if the device is a bridge, scan the busses behind it.). This caused NICs on bus 0 device 5, and bus 1 device 3, (eth0 and 1 respectively) to be enumerated differently due to the a bridge from bus 0 to bus 1 at 0:4. My crude hack of pci=bfsort, with some dmi strings to match and auto-enable, at least reverted this back to the ordering the 2.4 kernel and Windows used. Now we have to keep adding systems to this DMI list (Dell has a number of systems on this list today; HP has even more). And it doesn't completely solve the problem, just masks it. So, to address the ordering problem, I placed a constraint on our server hardware teams, forcing them to lay out their boards and assign PCIe lanes and bus numbers, such that at least the designed "first" LOM would get found first in either depth-first or breadth-first order. Our 10G and 11G servers have this restriction in place, though it wasn't easy. And it's gotten even harder, as the PCIe switches expand the number of lanes available. We no longer have the traditional tiered buses architecture, but the PCI layer for this purpose thinks we do. I need to remove this constraint on the hardware teams - it's gotten to be impossible for the chipset lanes to be laid out efficiently with this constraint. All of the above just papered over the enumeration != naming problem. Third, stateless computing is becoming more and more commonplace. The Field Replaceable Unit is the server itself. Got a bad server? Pull it out, move the disks to an identical unit, insert the new server, and go. Fix the bad server offline and bring it back. In this model, having MAC addresses as the mechanism that is providing the determinism (/etc/mactab or udev persistent naming rules) breaks, because the MAC addresses of the ports on the new server won't be the same as on the old server. HP even has a technology to solve _this_ problem (in their blade chassis) - Virtual Connect. The MACs get assigned by the chassis to the blades at POST, and are fixed to the slot. Slick, and Dell has an even more flexible similar feature FlexAddress. This doesn't solve the OS installer problem of "which of these NICs should I use to do an install?" but it does recognize the problem space and tries to overcome it. Fourth, for OS installers, choosing which NIC to use at installtime, when all the NICs are plugged in, can be difficult. PXE environments, using pxelinux and its IPAPPEND 2 option, will append "BOOTIF=xx:xx:xx:xx:xx:xx" to the kernel command line, that containing the MAC address of the NIC used for PXE. Neat trick. Yes, we then had to teach the OS installers to recognize and use this. But it only works if you PXE boot, and only for that one NIC. Fifth, network devices can have only a single name. eth0. If we look at disks, we see udev manages a tree of symlinks for /dev/disk/by-label, /dev/disk/by-path, /dev/disk/by-uuid. And as a system admin, if I wanted to also create a udev rule for /dev/disk/by-function (boot, swap, mattsstorage), it's trivial to do so. Why can't we have this flexibility for network devices too? So, how do we get deterministic naming for all the NICs in a system? That's what I'm going for. Picture a network switch, with several blades, and several ports on each blade. The network admin addresses each port as say 1/16 (the 16th port on blade 1, clearly labeled). The parallel on servers is the chassis label printed on the outside (say, "Gb1"). But due the above, there is no guarantee, and in fact little chance, that Gb1 will be consistently named eth0 - it may vary from boot to boot. That's full of fail. For a concrete example, the 4 bnx2 chips in my PowerEdge R610 with a current 2.6 kernel, loading only one driver, the ports get assigned names in nondeterministic order on each boot. Given that the ifcfg-eth* rules, netfilter rules, and the rest all expect deterministic naming, massive failure ensues unless some form of determinism is brought back in. The idea to use a character device node to expose the ifindex value, and udev to manage a tree of symlinks to it, really follows the model used today for disks. It allows us to get deterministic names for devices (albeit, the names are symlinks), and multiple names for devices (through multiple symlink rules). That some people want to use the char device to call ioctl() and read/write, as is possible on the BSDs, would just be gravy IMHO. It does require a change in behavior for a system administrator. Instead of hard-coding 'eth0' into her scripts, she uses '/dev/net/by-function/boot' or somesuch. But then that name is guaranteed to always refer to the "right" NIC. Every admin I've spoken to is willing to make this kind of change, as long as they get the consistent, deterministic naming they expect but don't have today. And it does require patching userspace apps to take both a kernel device name, or a path, and to resolve the path to device name or ifindex. We wrote libnetdevname (really, one function), and have patches for several userspace apps to use it, to prove it can be done. One alternative would be to do something using the sysfs ifindex value already exported. e.g. /sys/devices/pci0000:00/0000:00:05.0/0000:05:00.0/0000:06:07.0/net/eth0/ifindex but we have never had symlinks from /dev into /sys before (doesn't mean we couldn't though). In that case, udev would grow to manage /dev/net/by-chassis-label/Embedded_NIC_1 -> /sys/devices/.../net/eth0, and libnetdevname would be used to follow the symlink in applications. This approach could solve my problem without (many or any?) kernel changes needed, but wouldn't help those who want to do ioctl/read/write to a devnode. Given the problem, I really do need a solution. I've proposed one method, and an alternative, but I can't afford to let the problem stay unaddressed any longer, and need a clear direction to be chosen. The char device gives me what I need, and others what they want also. Thanks for listening to the diatribe. For more examples and workarounds that we've been telling our customers for several years, check out http://linux.dell.com/papers.shtml for the Network Interface Card Naming whitepaper.
On Fri, Oct 09, 2009 at 11:40:57PM -0500, Matt Domsch wrote: > The fundamental roadblock to this is that enumeration != naming, > except that it is for network devices, and we keep changing the > enumeration order. No, the hardware changes the enumeration order, it places _no_ guarantees on what order stuff will be found in. So this is not the kernel changing, just to be clear. Again, I have a machine here that likes to reorder PCI devices every 4th or so boot times, and that's fine according to the PCI spec. Yeah, it's a crappy BIOS, but the manufacturer rightly pointed out that it is not in violation of anything. > Today, port naming is completely nondeterministic. If you have but > one NIC, there are few chances to get the name wrong (it'll be eth0). > If you have >1 NIC, chances increase to get it wrong. That is why all distros name network devices based on the only deterministic thing they have today, the MAC address. I still fail to see why you do not like this solution, it is honestly the only way to properly name network devices in a sane manner. All distros also provide a way to easily rename the network devices, to place a specific name on a specific MAC address, so again, this should all be solved already. No matter how badly your BIOS teams mess up the PCI enumeration order :) thanks, greg k-h -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Greg, > No, the hardware changes the enumeration order, it places _no_ > guarantees on what order stuff will be found in. So this is not the > kernel changing, just to be clear. > Again, I have a machine here that likes to reorder PCI devices every 4th > or so boot times, and that's fine according to the PCI spec. Yeah, it's > a crappy BIOS, but the manufacturer rightly pointed out that it is not > in violation of anything. > I think the open call should be implemented then. By the patch very little knowledge is being shared on type of network implementation it is trying to do.Also it is messing with core datastructure and procedures. This seems to be simplified by changing implementing the other operations like poll(). > That is why all distros name network devices based on the only > deterministic thing they have today, the MAC address. I still fail to > see why you do not like this solution, it is honestly the only way to > properly name network devices in a sane manner. This is feature that needs to be implemented. As per the rules followed. > > All distros also provide a way to easily rename the network devices, to > place a specific name on a specific MAC address, so again, this should > all be solved already. > > No matter how badly your BIOS teams mess up the PCI enumeration order :) This is an problem, But I think this can be solved by implementing some of the routines in the network device. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Oct 09, 2009 at 10:23:08PM -0700, Greg KH wrote: > On Fri, Oct 09, 2009 at 11:40:57PM -0500, Matt Domsch wrote: > > The fundamental roadblock to this is that enumeration != naming, > > except that it is for network devices, and we keep changing the > > enumeration order. > > No, the hardware changes the enumeration order, it places _no_ > guarantees on what order stuff will be found in. So this is not the > kernel changing, just to be clear. Over time the kernel has changed its enumeration mechanisms, and introduced parallelism into the process (which is a good thing), which, from a user perspective, makes names nondeterministic. Yes, fixing this up by hard-coding MAC addresses after install has been the traditional mechanism to address this. I think there's a better way. > Again, I have a machine here that likes to reorder PCI devices every 4th > or so boot times, and that's fine according to the PCI spec. Yeah, it's > a crappy BIOS, but the manufacturer rightly pointed out that it is not > in violation of anything. I haven't encounted this myself, but yes, it's valid but annoying. > > Today, port naming is completely nondeterministic. If you have but > > one NIC, there are few chances to get the name wrong (it'll be eth0). > > If you have >1 NIC, chances increase to get it wrong. > > That is why all distros name network devices based on the only > deterministic thing they have today, the MAC address. I still fail to > see why you do not like this solution, it is honestly the only way to > properly name network devices in a sane manner. > > All distros also provide a way to easily rename the network devices, to > place a specific name on a specific MAC address, so again, this should > all be solved already. It's not the only way, it introduces state where there's a desire for a stateless solution, it's useless for getting all the names right at initial OS install time, and it restricts us to a single "name" for a given device. We can get additional information from BIOS. SMBIOS 2.6 (types 9 and 41) has the fields to let us get a "label" for an device at a given b/d/f. On my PowerEdge R610, I see "Embedded NIC 1" .. "Embedded NIC 4" for the 4 LOMs. These labels have a clear correlation to the labels on the back of the chassis at these ports. biosdevname can parse and report this. HP made a similar vendor-specific extension to SMBIOS for their platforms, which biosdevname also parses. Even if BIOS decides they need to renumber the busses on every boot, it can keep this table correct. (insert general mistrust of BIOS authors rant; that's not the point here.) biosdevname can be used in udev rules to create multiple names for a given device. Rules such as: PROGRAM="/sbin/biosdevname --policy=all_names -i %k", SYMLINK+="net/by-slot-name/%c", OPTIONS+="string_escape=replace" PROGRAM="/sbin/biosdevname --policy=smbios_names -i %k", SYMLINK+="net/by-chassis-label/%c", OPTIONS+="string_escape=replace" SMBIOS has its own problems, specifically that it's not hot-plug aware (it's a static table created during POST). And if a better way is found (perhaps through the PCI SIG or ACPI), great, biosdevname can be extended to use it. But, without at least a change in udev or the kernel, it doesn't do any good. > No matter how badly your BIOS teams mess up the PCI enumeration > order :) In my case, the BIOS for a given system always configures the ports the same way, and assigns b/d/f the same way. With no change in the BIOS or hardware, I still see the ports enumerated differently on each boot. :-(
On Sat, Oct 10, 2009 at 07:47:32AM -0500, Matt Domsch wrote: > On Fri, Oct 09, 2009 at 10:23:08PM -0700, Greg KH wrote: > > On Fri, Oct 09, 2009 at 11:40:57PM -0500, Matt Domsch wrote: > > > The fundamental roadblock to this is that enumeration != naming, > > > except that it is for network devices, and we keep changing the > > > enumeration order. > > > > No, the hardware changes the enumeration order, it places _no_ > > guarantees on what order stuff will be found in. So this is not the > > kernel changing, just to be clear. > > Over time the kernel has changed its enumeration mechanisms, and > introduced parallelism into the process (which is a good thing), > which, from a user perspective, makes names nondeterministic. Yes, > fixing this up by hard-coding MAC addresses after install has been > the traditional mechanism to address this. I think there's a better > way. Ok, but that way can be done in userspace, without the need for this char device, right? > > > Today, port naming is completely nondeterministic. If you have but > > > one NIC, there are few chances to get the name wrong (it'll be eth0). > > > If you have >1 NIC, chances increase to get it wrong. > > > > That is why all distros name network devices based on the only > > deterministic thing they have today, the MAC address. I still fail to > > see why you do not like this solution, it is honestly the only way to > > properly name network devices in a sane manner. > > > > All distros also provide a way to easily rename the network devices, to > > place a specific name on a specific MAC address, so again, this should > > all be solved already. > > It's not the only way, it introduces state where there's a desire for > a stateless solution, it's useless for getting all the names right at > initial OS install time, and it restricts us to a single "name" for a > given device. > > We can get additional information from BIOS. SMBIOS 2.6 (types 9 and > 41) has the fields to let us get a "label" for an device at a given > b/d/f. On my PowerEdge R610, I see "Embedded NIC 1" .. "Embedded NIC > 4" for the 4 LOMs. These labels have a clear correlation to the > labels on the back of the chassis at these ports. biosdevname can > parse and report this. HP made a similar vendor-specific extension to > SMBIOS for their platforms, which biosdevname also parses. Even if > BIOS decides they need to renumber the busses on every boot, it can > keep this table correct. (insert general mistrust of BIOS authors > rant; that's not the point here.) > > biosdevname can be used in udev rules to create multiple names for a > given device. Rules such as: Yes, if you want multiple ways to name a network device, then you need the char nodes. But without that, you can just pick "always use the biosdevname" type option from your distro setup screen and go with that. Then you have everything always working properly from the very beginning. > > No matter how badly your BIOS teams mess up the PCI enumeration > > order :) > > In my case, the BIOS for a given system always configures the ports > the same way, and assigns b/d/f the same way. With no change in the > BIOS or hardware, I still see the ports enumerated differently on each > boot. :-( Again, that's legal from a PCI standpoint :) So you really want this for multiple ways to name the same network device. That's a choice the network developers are going to have to make, as to if that is going to be a legal thing to have happen or not. But this code is not a requirement to "solve" the fact that network devices can show up in different order, that problem can be solved as long as the user picks a single way to name the devices, using tools that are already present today in distros. thanks, greg k-h -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Oct 10, 2009 at 01:47:39PM +0530, Sujit K M wrote: > Greg, > > > > No, the hardware changes the enumeration order, it places _no_ > > guarantees on what order stuff will be found in. ?So this is not the > > kernel changing, just to be clear. > > Again, I have a machine here that likes to reorder PCI devices every 4th > > or so boot times, and that's fine according to the PCI spec. ?Yeah, it's > > a crappy BIOS, but the manufacturer rightly pointed out that it is not > > in violation of anything. > > > > I think the open call should be implemented then. By the patch very little > knowledge is being shared on type of network implementation it is trying to > do. What would open() accomplish? What good would the file descriptor be? What could you use it for? > Also it is messing with core datastructure and procedures. This seems > to be simplified by changing implementing the other operations like poll(). I don't understand. > > That is why all distros name network devices based on the only > > deterministic thing they have today, the MAC address. ?I still fail to > > see why you do not like this solution, it is honestly the only way to > > properly name network devices in a sane manner. > > This is feature that needs to be implemented. As per the rules followed. This feature is already implemented today, all distros have it. > > All distros also provide a way to easily rename the network devices, to > > place a specific name on a specific MAC address, so again, this should > > all be solved already. > > > > No matter how badly your BIOS teams mess up the PCI enumeration order :) > > This is an problem, But I think this can be solved by implementing some of the > routines in the network device. I don't, see the rules that your distro ships today for persistant network devices, it's already there, no need to change the kernel at all. thanks, greg k-h -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Greg KH wrote: > On Sat, Oct 10, 2009 at 07:47:32AM -0500, Matt Domsch wrote: >> On Fri, Oct 09, 2009 at 10:23:08PM -0700, Greg KH wrote: >>> On Fri, Oct 09, 2009 at 11:40:57PM -0500, Matt Domsch wrote: >>>> The fundamental roadblock to this is that enumeration != >>>> naming, except that it is for network devices, and we keep >>>> changing the enumeration order. >>> No, the hardware changes the enumeration order, it places _no_ >>> guarantees on what order stuff will be found in. So this is not >>> the kernel changing, just to be clear. >> Over time the kernel has changed its enumeration mechanisms, and >> introduced parallelism into the process (which is a good thing), >> which, from a user perspective, makes names nondeterministic. Yes, >> fixing this up by hard-coding MAC addresses after install has been >> the traditional mechanism to address this. I think there's a >> better way. > > Ok, but that way can be done in userspace, without the need for this > char device, right? For the record -- when I tried to send a patch that did exactly this (provided an option to use by-path persistence for network drivers), it was rejected because "that doesn't work for USB". True, it doesn't. But by-mac (what we have today) doesn't work for replacing motherboards in a random home system (that can't override the MAC address in the BIOS), either. So why not provide both alternatives? As you say below, it's up to the network devs whether this should be allowed... >> biosdevname can be used in udev rules to create multiple names for >> a given device. Rules such as: > > Yes, if you want multiple ways to name a network device, then you > need the char nodes. But without that, you can just pick "always use > the biosdevname" type option from your distro setup screen and go > with that. Then you have everything always working properly from the > very beginning. *If* biosdevname works on your system. It doesn't on mine: this SMBIOS extension doesn't exist. :-) > So you really want this for multiple ways to name the same network > device. That's a choice the network developers are going to have to > make, as to if that is going to be a legal thing to have happen or > not. Yes. So do I, actually (for what little that's worth)... > But this code is not a requirement to "solve" the fact that network > devices can show up in different order, that problem can be solved as > long as the user picks a single way to name the devices, using tools > that are already present today in distros. This code is not a requirement, no. But -- as you say -- it does provide a halfway-decent way to assign multiple names to a NIC. And that provides admins the choice to use a couple different persistence schemes, depending on how they expect their hardware to work. (It *may* even be possible to use some kind of layer-2 traffic to see what else is on the connected network and provide symlinks based on that. IPv6 autoconfig type of thing, maybe. That's probably a *lot* more complicated, and may be impossible, but would be even closer to what I think Dell customers are asking for based on Matt's posts.)
On Fri, 9 Oct 2009, Greg KH wrote: > On Fri, Oct 09, 2009 at 11:40:57PM -0500, Matt Domsch wrote: > > The fundamental roadblock to this is that enumeration != naming, > > except that it is for network devices, and we keep changing the > > enumeration order. > > No, the hardware changes the enumeration order, it places _no_ > guarantees on what order stuff will be found in. So this is not the > kernel changing, just to be clear. > > Again, I have a machine here that likes to reorder PCI devices every 4th > or so boot times, and that's fine according to the PCI spec. Yeah, it's > a crappy BIOS, but the manufacturer rightly pointed out that it is not > in violation of anything. > > > Today, port naming is completely nondeterministic. If you have but > > one NIC, there are few chances to get the name wrong (it'll be eth0). > > If you have >1 NIC, chances increase to get it wrong. > > That is why all distros name network devices based on the only > deterministic thing they have today, the MAC address. I still fail to > see why you do not like this solution, it is honestly the only way to > properly name network devices in a sane manner. > > All distros also provide a way to easily rename the network devices, to > place a specific name on a specific MAC address, so again, this should > all be solved already. > > No matter how badly your BIOS teams mess up the PCI enumeration order :) No comment on the specific implementation decision, but I am in the process of setting up a large number of test systems with identical hardware configurations, and using a master disk image to clone all the test systems. The biggest pain in this process is identiying the MAC addresses for each of the six or more network interfaces in each test system (we want eth0...ethN to always reference the same physical port on the test systems), and then having to modify the 70-persistent-net.rules udev file and the HWADDR entry for all the ifcfg-ethX files to reflect the correct MAC addresses. It would be fantastic if there were some mechanism for making this part of the process unnecessary. -Bill -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 9 Oct 2009 23:40:57 -0500 Matt Domsch <Matt_Domsch@dell.com> wrote: > On Fri, Oct 09, 2009 at 07:44:01PM -0700, Stephen Hemminger wrote: > > Maybe I'm dense but can't see why having a useless /dev/net/ symlinks > > is a good interface choice. Perhaps you should explain the race between > > PCI scan and udev in more detail, and why solving it in either of those > > places won't work. As it stands you are proposing yet another wart to > > the already complex set of network interface API's which has implications > > for security as well as increasing the number of possible bugs. > > The fundamental challenge is that system administrators, particularly > those of server-class hardware with multiple network ports present > (some on the motherboard, some on add-in cards), have the > not-so-unreasonable expectation that there is a deterministic mapping > between those ports and the name one uses to address those ports. > > The fundamental roadblock to this is that enumeration != naming, > except that it is for network devices, and we keep changing the > enumeration order. > > Today, port naming is completely nondeterministic. If you have but > one NIC, there are few chances to get the name wrong (it'll be eth0). > If you have >1 NIC, chances increase to get it wrong. > > The complexity arises at multiple levels. > > First, device driver load order. In the 2.4 kernel days, and even > mostly early 2.6 kernel days, the order in which network drivers > loaded played a role in determining the name of the device. Drivers > loaded first would get their devices named first. If I have two types > of devices, say an e100-driven NIC and a tg3-driven NIC, I could > figure out that the names would be eth0=e100 and eth1=tg3 by setting > the load order in /etc/modules.conf (now modprobe.conf). If I wanted > the other order, fine, just switch it around in modules.conf and > reboot. OS installers, being the first running instance of Linux, > before modprobe.conf existed to set that ordering, had to have other > mechanisms to load drivers (often manually, or if programmatically > such as in a kickstart or autoyast file, was still somewhat fixed). > > With the advent of modaliases + udev, now modprobe.conf doesn't > contain this ordering anymore, and udev loads the drivers. So while > it wasn't perfect, it was better than nothing, and that's gone now. > > It gets even worse as, to speed up boot time, modprobes can be run in > parallel, and even within individual drivers, the NICs get initialized > (and named) in parallel. Further confusing things, some devices need > firmware loaded into them before getting names assigned, which is done > from userspace, and they race. > > Second, PCI device list order. In the 2.4 kernel days, the PCI device > list was scanned "breadth-first" (for each bus; for each device; for > each function; do load...). FWIW, Windows still does this. It gives > BIOS, which assigns PCI bus numbers, a chance to put LOMs at a lower > bus number than add-in cards. Module load order still mattered, but > at least if you had say 2 e1000 ports as LOMs, and 2 e1000 ports on > add-in cards, you pretty much knew the ordering would be eth0 as > lowest bdf on the motherboard, eth1 as next bdf on the motherboard, > and eth2 and 3 as the add-in cards in ascending slot order. > > With the advent of PCI hot plug in the 2.5 kernel series, the > breadth-first ordering became depth-first. (for each bus; for each > device; if the device is a bridge, scan the busses behind it.). This > caused NICs on bus 0 device 5, and bus 1 device 3, (eth0 and 1 > respectively) to be enumerated differently due to the a bridge from > bus 0 to bus 1 at 0:4. My crude hack of pci=bfsort, with some dmi > strings to match and auto-enable, at least reverted this back to the > ordering the 2.4 kernel and Windows used. Now we have to keep adding > systems to this DMI list (Dell has a number of systems on this list > today; HP has even more). And it doesn't completely solve the > problem, just masks it. > > So, to address the ordering problem, I placed a constraint on our > server hardware teams, forcing them to lay out their boards and assign > PCIe lanes and bus numbers, such that at least the designed "first" > LOM would get found first in either depth-first or breadth-first > order. Our 10G and 11G servers have this restriction in place, though > it wasn't easy. And it's gotten even harder, as the PCIe switches > expand the number of lanes available. We no longer have the > traditional tiered buses architecture, but the PCI layer for this > purpose thinks we do. I need to remove this constraint on the > hardware teams - it's gotten to be impossible for the chipset lanes to > be laid out efficiently with this constraint. > > All of the above just papered over the enumeration != naming problem. > > Third, stateless computing is becoming more and more commonplace. The > Field Replaceable Unit is the server itself. Got a bad server? Pull > it out, move the disks to an identical unit, insert the new server, > and go. Fix the bad server offline and bring it back. In this model, > having MAC addresses as the mechanism that is providing the > determinism (/etc/mactab or udev persistent naming rules) breaks, > because the MAC addresses of the ports on the new server won't be the > same as on the old server. HP even has a technology to solve _this_ > problem (in their blade chassis) - Virtual Connect. The MACs get > assigned by the chassis to the blades at POST, and are fixed to the > slot. Slick, and Dell has an even more flexible similar feature > FlexAddress. This doesn't solve the OS installer problem of "which of > these NICs should I use to do an install?" but it does recognize the > problem space and tries to overcome it. > > Fourth, for OS installers, choosing which NIC to use at installtime, > when all the NICs are plugged in, can be difficult. PXE environments, > using pxelinux and its IPAPPEND 2 option, will append > "BOOTIF=xx:xx:xx:xx:xx:xx" to the kernel command line, that > containing the MAC address of the NIC used for PXE. Neat trick. Yes, > we then had to teach the OS installers to recognize and use this. But > it only works if you PXE boot, and only for that one NIC. > > Fifth, network devices can have only a single name. eth0. If we look > at disks, we see udev manages a tree of symlinks for > /dev/disk/by-label, /dev/disk/by-path, /dev/disk/by-uuid. And as a > system admin, if I wanted to also create a udev rule for > /dev/disk/by-function (boot, swap, mattsstorage), it's trivial to do > so. Why can't we have this flexibility for network devices too? > > So, how do we get deterministic naming for all the NICs in a system? > That's what I'm going for. Picture a network switch, with several > blades, and several ports on each blade. The network admin addresses > each port as say 1/16 (the 16th port on blade 1, clearly labeled). > The parallel on servers is the chassis label printed on the outside > (say, "Gb1"). But due the above, there is no guarantee, and in fact > little chance, that Gb1 will be consistently named eth0 - it may vary > from boot to boot. That's full of fail. > > For a concrete example, the 4 bnx2 chips in my PowerEdge R610 with a > current 2.6 kernel, loading only one driver, the ports get assigned > names in nondeterministic order on each boot. Given that the > ifcfg-eth* rules, netfilter rules, and the rest all expect > deterministic naming, massive failure ensues unless some form of > determinism is brought back in. > > The idea to use a character device node to expose the ifindex value, > and udev to manage a tree of symlinks to it, really follows the model > used today for disks. It allows us to get deterministic names for > devices (albeit, the names are symlinks), and multiple names for > devices (through multiple symlink rules). That some people want to > use the char device to call ioctl() and read/write, as is possible on > the BSDs, would just be gravy IMHO. > > It does require a change in behavior for a system administrator. > Instead of hard-coding 'eth0' into her scripts, she uses > '/dev/net/by-function/boot' or somesuch. But then that name is > guaranteed to always refer to the "right" NIC. Every admin I've > spoken to is willing to make this kind of change, as long as they get > the consistent, deterministic naming they expect but don't have > today. And it does require patching userspace apps to take both a > kernel device name, or a path, and to resolve the path to device name > or ifindex. We wrote libnetdevname (really, one function), and have > patches for several userspace apps to use it, to prove it can be done. > > One alternative would be to do something using the sysfs ifindex value > already exported. e.g. > /sys/devices/pci0000:00/0000:00:05.0/0000:05:00.0/0000:06:07.0/net/eth0/ifindex > > but we have never had symlinks from /dev into /sys before (doesn't > mean we couldn't though). In that case, udev would grow to manage > /dev/net/by-chassis-label/Embedded_NIC_1 -> /sys/devices/.../net/eth0, > and libnetdevname would be used to follow the symlink in applications. > This approach could solve my problem without (many or any?) kernel > changes needed, but wouldn't help those who want to do > ioctl/read/write to a devnode. > > Given the problem, I really do need a solution. I've proposed one > method, and an alternative, but I can't afford to let the problem stay > unaddressed any longer, and need a clear direction to be chosen. The > char device gives me what I need, and others what they want also. > > Thanks for listening to the diatribe. For more examples and > workarounds that we've been telling our customers for several years, > check out http://linux.dell.com/papers.shtml for the Network Interface > Card Naming whitepaper. > > Why isn't the available through sysfs enough, if not why not add the necessary attributes there. BTW, for our distro, we are looking into device renaming based on PCI slot because that is what router OS's do. Customers expect if they replace the card in slot 0, it will come back with the same name. This is not what server customers expect. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Oct 10, 2009 at 20:11, Bill Fink <billfink@mindspring.com> wrote: > No comment on the specific implementation decision, but I am in the > process of setting up a large number of test systems with identical > hardware configurations, and using a master disk image to clone all the > test systems. The biggest pain in this process is identiying the MAC > addresses for each of the six or more network interfaces in each test > system (we want eth0...ethN to always reference the same physical port > on the test systems), and then having to modify the 70-persistent-net.rules > udev file and the HWADDR entry for all the ifcfg-ethX files to reflect > the correct MAC addresses. It would be fantastic if there were some > mechanism for making this part of the process unnecessary. Udev creates the persistent rules only if no other rule set a name. Adding something like: SUBSYSTEM=="net", KERNEL==""eth*", NAME="eth%n" in any earlier rules file before the udev generated one will skip all off the automatic udev rule creation. Kay -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, 2009-10-10 at 09:27 -0700, Greg KH wrote: > On Sat, Oct 10, 2009 at 01:47:39PM +0530, Sujit K M wrote: > > Greg, > > > > > > > No, the hardware changes the enumeration order, it places _no_ > > > guarantees on what order stuff will be found in. ?So this is not the > > > kernel changing, just to be clear. > > > Again, I have a machine here that likes to reorder PCI devices every 4th > > > or so boot times, and that's fine according to the PCI spec. ?Yeah, it's > > > a crappy BIOS, but the manufacturer rightly pointed out that it is not > > > in violation of anything. > > > > > > > I think the open call should be implemented then. By the patch very little > > knowledge is being shared on type of network implementation it is trying to > > do. > > What would open() accomplish? What good would the file descriptor be? > What could you use it for? Currently all net device ioctls are carried out through arbitrary sockets and identify the device by name (aside from one to look up the name by ifindex). Ever since it became possible to rename net devices, it has been possible for a sequence of ioctls intended for one device to race with renaming of that device. Adding open() and ioctl() to the character device (which seems reasonably easy) would provide a way to avoid this. On the other hand, the netlink configuration APIs already use ifindex so it may be better just to say that the device ioctls are deprecated and applications should use netlink. > > Also it is messing with core datastructure and procedures. This seems > > to be simplified by changing implementing the other operations like poll(). > > I don't understand. > > > > That is why all distros name network devices based on the only > > > deterministic thing they have today, the MAC address. ?I still fail to > > > see why you do not like this solution, it is honestly the only way to > > > properly name network devices in a sane manner. > > > > This is feature that needs to be implemented. As per the rules followed. > > This feature is already implemented today, all distros have it. No, see below. > > > All distros also provide a way to easily rename the network devices, to > > > place a specific name on a specific MAC address, so again, this should > > > all be solved already. > > > > > > No matter how badly your BIOS teams mess up the PCI enumeration order :) > > > > This is an problem, But I think this can be solved by implementing some of the > > routines in the network device. > > I don't, see the rules that your distro ships today for persistant > network devices, it's already there, no need to change the kernel at > all. The udev persistent net rules work tolerably well for a single system with a stable set of net devices. They do not solve the problem Matt's talking about, which is lack of consistency between multiple systems, because the initial enumeration order is not predictable. They also result in name changes when a NIC (or motherboard) is swapped. For some users, that's fine; for others, it's not. The ability to specify NICs by port name or PCI address should solve these problems. Ben.
On Sat, Oct 10, 2009 at 11:32:19AM -0700, Stephen Hemminger wrote: > > BTW, for our distro, we are looking into device renaming based on PCI slot > because that is what router OS's do. Customers expect if they replace the card > in slot 0, it will come back with the same name. This is not what server > customers expect. If your bios exposes the PCI slots to userspace (through the proper ACPI namespace), doing this type of naming should be trivial with some simple udev rules, no additional kernel infrastructure is needed. thanks, greg k-h -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Oct 10, 2009 at 08:00:30PM +0100, Ben Hutchings wrote: > On the other hand, the netlink configuration APIs already use ifindex so > it may be better just to say that the device ioctls are deprecated and > applications should use netlink. I thought that is what was already encouraged to happen. > > > > That is why all distros name network devices based on the only > > > > deterministic thing they have today, the MAC address. ?I still fail to > > > > see why you do not like this solution, it is honestly the only way to > > > > properly name network devices in a sane manner. > > > > > > This is feature that needs to be implemented. As per the rules followed. > > > > This feature is already implemented today, all distros have it. > > No, see below. Yes, if not, file a bug in your distro, all of the infrastructure is already in place, and the udev rules and scripts are already written. > > > > All distros also provide a way to easily rename the network devices, to > > > > place a specific name on a specific MAC address, so again, this should > > > > all be solved already. > > > > > > > > No matter how badly your BIOS teams mess up the PCI enumeration order :) > > > > > > This is an problem, But I think this can be solved by implementing some of the > > > routines in the network device. > > > > I don't, see the rules that your distro ships today for persistant > > network devices, it's already there, no need to change the kernel at > > all. > > The udev persistent net rules work tolerably well for a single system > with a stable set of net devices. > > They do not solve the problem Matt's talking about, which is lack of > consistency between multiple systems, because the initial enumeration > order is not predictable. Again, you name the device as a MAC address. Or something else that the BIOS exports in a unique manner (PCI slot name, etc.). That is consistant. If not, then fix the BIOS. > They also result in name changes when a NIC (or motherboard) is swapped. > For some users, that's fine; for others, it's not. > > The ability to specify NICs by port name or PCI address should solve > these problems. That can be done today quite easily. But note that PCI addresses are not guaranteed to be stable. As lots of machines are known to have happen. Again, none of this requires any kernel changes today at all, let alone adding dummy char devices for network devices. thanks, greg k-h -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Oct 10, 2009 at 10:34:16AM -0700, Bryan Kadzban wrote: > Greg KH wrote: > > On Sat, Oct 10, 2009 at 07:47:32AM -0500, Matt Domsch wrote: > >> On Fri, Oct 09, 2009 at 10:23:08PM -0700, Greg KH wrote: > >>> On Fri, Oct 09, 2009 at 11:40:57PM -0500, Matt Domsch wrote: > >>>> The fundamental roadblock to this is that enumeration != > >>>> naming, except that it is for network devices, and we keep > >>>> changing the enumeration order. > >>> No, the hardware changes the enumeration order, it places _no_ > >>> guarantees on what order stuff will be found in. So this is not > >>> the kernel changing, just to be clear. > >> Over time the kernel has changed its enumeration mechanisms, and > >> introduced parallelism into the process (which is a good thing), > >> which, from a user perspective, makes names nondeterministic. Yes, > >> fixing this up by hard-coding MAC addresses after install has been > >> the traditional mechanism to address this. I think there's a > >> better way. > > > > Ok, but that way can be done in userspace, without the need for this > > char device, right? > > For the record -- when I tried to send a patch that did exactly this > (provided an option to use by-path persistence for network drivers), it > was rejected because "that doesn't work for USB". > > True, it doesn't. But by-mac (what we have today) doesn't work for > replacing motherboards in a random home system (that can't override the > MAC address in the BIOS), either. If you replace a motherboard, you honestly expect no configuration to be needed to be changed? If so, then don't use the MAC naming scheme for your systems. > > But this code is not a requirement to "solve" the fact that network > > devices can show up in different order, that problem can be solved as > > long as the user picks a single way to name the devices, using tools > > that are already present today in distros. > > This code is not a requirement, no. But -- as you say -- it does > provide a halfway-decent way to assign multiple names to a NIC. And > that provides admins the choice to use a couple different persistence > schemes, depending on how they expect their hardware to work. But the names need to then be resolved back to a "real" kernel name in order to do anything with that network connection, as the char devices are not real ones. So that adds an additional layer of complexity on all of the system configuration tools. thanks, greg k-h -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Oct 10, Matt Domsch <Matt_Domsch@dell.com> wrote: > It does require a change in behavior for a system administrator. > Instead of hard-coding 'eth0' into her scripts, she uses > '/dev/net/by-function/boot' or somesuch. But then that name is > guaranteed to always refer to the "right" NIC. Every admin I've > spoken to is willing to make this kind of change, as long as they get > the consistent, deterministic naming they expect but don't have > today. And it does require patching userspace apps to take both a > kernel device name, or a path, and to resolve the path to device name > or ifindex. We wrote libnetdevname (really, one function), and have > patches for several userspace apps to use it, to prove it can be done. For the records, before being a distribution developer I am a system administrator (who designed and manages many firewalls with multiple network interfaces) and I am still unconvinced that what you are proposing is a practical solution and that its downsides justify the significant changes both in software and in system administration practices that it requires. The first issue which greatly concerns me is the need to modify *every* userspace application and kernel tool (what about iptables? What about the kernel logs?): from an users experience point of view it would be very annoying if different applications used different names to refer to the same network device. I am also concerned with the practical implications of trying to use such long (and unusual) names: IFNAMSIZ is 16, so user interfaces tend to assume both short names and that they match something like /^[a-z0-9]+$/. What about e.g. distribution scripts which use the interface name as a file system path component? Do you already have a (standard) scheme to losslessly convert the names to a form without slashes?
On Sat, 2009-10-10 at 09:25 -0700, Greg KH wrote: > Ok, but that way can be done in userspace, without the need for this > char device, right? It might actually be nice to have a device file anyway since you can use existing udev infrastructure to adjust permissions (e.g. chown it to the netdev group) and add ACLs. This would allow running some software as an unprivileged user instead of uid 0. David -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, Oct 11, 2009 at 12:40:18PM -0400, David Zeuthen wrote: > On Sat, 2009-10-10 at 09:25 -0700, Greg KH wrote: > > Ok, but that way can be done in userspace, without the need for this > > char device, right? > > It might actually be nice to have a device file anyway since you can use > existing udev infrastructure to adjust permissions (e.g. chown it to the > netdev group) and add ACLs. This would allow running some software as an > unprivileged user instead of uid 0. But as the char nodes would not actually control access to anything, how would this help? Remember, these device nodes are "dummies" with nothing behind them (open() fails). thanks, greg k-h -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Oct 10, 2009 at 12:23 AM, Greg KH <greg@kroah.com> wrote: > On Fri, Oct 09, 2009 at 11:40:57PM -0500, Matt Domsch wrote: >> The fundamental roadblock to this is that enumeration != naming, >> except that it is for network devices, and we keep changing the >> enumeration order. > > No, the hardware changes the enumeration order, it places _no_ > guarantees on what order stuff will be found in. So this is not the > kernel changing, just to be clear. > > Again, I have a machine here that likes to reorder PCI devices every 4th > or so boot times, and that's fine according to the PCI spec. Yeah, it's > a crappy BIOS, but the manufacturer rightly pointed out that it is not > in violation of anything. > >> Today, port naming is completely nondeterministic. If you have but >> one NIC, there are few chances to get the name wrong (it'll be eth0). >> If you have >1 NIC, chances increase to get it wrong. > > That is why all distros name network devices based on the only > deterministic thing they have today, the MAC address. I still fail to > see why you do not like this solution, it is honestly the only way to > properly name network devices in a sane manner. > > All distros also provide a way to easily rename the network devices, to > place a specific name on a specific MAC address, so again, this should > all be solved already. > > No matter how badly your BIOS teams mess up the PCI enumeration order :) > > thanks, > > greg k-h > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > So when an add-in PCI NIC has a lower MAC than the motherboard NICs, the add-in cards will come before the motherboard NICs. i don't like it. But please whatever is done, make sure ping and tracert still works when telling it to use a ethX source interface: eth0 = 4.3.2.8, the default gateway is thru eth1. ping -I eth0 208.67.222.222 FAILS ping -I 4.3.2.8 208.67.222.222 WORKS tracert -i eth0 -I 208.67.222.222 FAILS tracert -s 4.3.2.8 -I 208.67.222.222 WORKS tracert -i eth0 208.67.222.222 FAILS tracert -s 4.3.2.8 208.67.222.222 WORKS -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, Oct 11, 2009 at 04:10:03PM -0500, Rob Townley wrote: > So when an add-in PCI NIC has a lower MAC than the motherboard NICs, > the add-in cards will come before the motherboard NICs. i don't like it. Actually, MAC address has nothing to do with device naming/ordering at all. Often systems will have onboard NICs in ascending MAC address order, but that's not a requirement, and I've seen systems not do that. And once you get to add-in vs onboard, BIOS wouldn't be able to enforce such an ordering anyhow (in general). But yes, you raise the point that, without using MAC-assigned names or another naming mechanism designed to cope with this, adding or removing a card can cause a difference in device enumeration, and thus name.
On Sun, Oct 11, 2009 at 04:10:03PM -0500, Rob Townley wrote: > So when an add-in PCI NIC has a lower MAC than the motherboard NICs, > the add-in cards will come before the motherboard NICs. i don't like it. Huh? Have you used the MAC persistant rules? If you add a new card, what does it pick for it? > But please whatever is done, make sure ping and tracert still works when > telling it to use a ethX source interface: > > eth0 = 4.3.2.8, the default gateway is thru eth1. > ping -I eth0 208.67.222.222 FAILS > ping -I 4.3.2.8 208.67.222.222 WORKS > tracert -i eth0 -I 208.67.222.222 FAILS > tracert -s 4.3.2.8 -I 208.67.222.222 WORKS > tracert -i eth0 208.67.222.222 FAILS > tracert -s 4.3.2.8 208.67.222.222 WORKS Again, is what we currently have broken? I am confused as to what this is referring to. greg k-h -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Greg KH wrote: > On Sat, Oct 10, 2009 at 10:34:16AM -0700, Bryan Kadzban wrote: >> Greg KH wrote: >>> On Sat, Oct 10, 2009 at 07:47:32AM -0500, Matt Domsch wrote: >>>> On Fri, Oct 09, 2009 at 10:23:08PM -0700, Greg KH wrote: >>>>> On Fri, Oct 09, 2009 at 11:40:57PM -0500, Matt Domsch wrote: >>>>>> The fundamental roadblock to this is that enumeration != >>>>>> naming, except that it is for network devices, and we keep >>>>>> changing the enumeration order. >>>>> No, the hardware changes the enumeration order, it places >>>>> _no_ guarantees on what order stuff will be found in. So >>>>> this is not the kernel changing, just to be clear. >>>> Over time the kernel has changed its enumeration mechanisms, >>>> and introduced parallelism into the process (which is a good >>>> thing), which, from a user perspective, makes names >>>> nondeterministic. Yes, fixing this up by hard-coding MAC >>>> addresses after install has been the traditional mechanism to >>>> address this. I think there's a better way. >>> Ok, but that way can be done in userspace, without the need for >>> this char device, right? >> For the record -- when I tried to send a patch that did exactly >> this (provided an option to use by-path persistence for network >> drivers), it was rejected because "that doesn't work for USB". >> >> True, it doesn't. But by-mac (what we have today) doesn't work for >> replacing motherboards in a random home system (that can't override >> the MAC address in the BIOS), either. > > If you replace a motherboard, you honestly expect no configuration to > be needed to be changed? If so, then don't use the MAC naming scheme > for your systems. What else is there? biosdevname doesn't work with this BIOS. It looks like at least path_id has been updated to work with NICs now, so that might work, with a bit of custom rule hacking. Or at least, it won't work any more poorly than for disks, which seem to work pretty well... :-) >>> But this code is not a requirement to "solve" the fact that >>> network devices can show up in different order, that problem can >>> be solved as long as the user picks a single way to name the >>> devices, using tools that are already present today in distros. >> This code is not a requirement, no. But -- as you say -- it does >> provide a halfway-decent way to assign multiple names to a NIC. >> And that provides admins the choice to use a couple different >> persistence schemes, depending on how they expect their hardware to >> work. > > But the names need to then be resolved back to a "real" kernel name > in order to do anything with that network connection, as the char > devices are not real ones. So that adds an additional layer of > complexity on all of the system configuration tools. Yes, that is true -- and no, this change isn't perfect. But it lets me have multiple "names" per interface, and have "names" that are longer than IFNAMSIZ, though, which is why I like it. (Now, if open() would return effectively a netlink socket bound to that ifindex already, such that the program didn't need to fill in the various ifindex fields for e.g. rtnetlink... but it's probably really hard to do that, so this isn't a serious suggestion.)
On Sat, Oct 10, 2009 at 11:32:19AM -0700, Stephen Hemminger wrote: > > On Fri, Oct 09, 2009 at 07:44:01PM -0700, Stephen Hemminger wrote: [...] > > Why isn't the available through sysfs enough, if not why not > add the necessary attributes there. True. If sysfs is not sufficient, what exact naming scheme could be applied that the chardev based naming could use? > [...] Kurt -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Bryan Kadzban wrote: > (Now, if open() would return effectively a netlink socket bound to > that ifindex already, such that the program didn't need to fill in > the various ifindex fields for e.g. rtnetlink... but it's probably > really hard to do that, so this isn't a serious suggestion.) Wait, scratch that. It's not "really hard", it's "almost impossible". At open() time, you have no idea which netlink family the program wants to communicate with. bind() is also hard. (In theory, you could support bind() on this new FD -- but then why is userspace using a file in the first place, and not a socket?) So this is even less of a serious suggestion now. I'd still like to be able to refer to NICs by multiple names though, if we can find a way that works...
Greg KH (greg@kroah.com) said: > > Today, port naming is completely nondeterministic. If you have but > > one NIC, there are few chances to get the name wrong (it'll be eth0). > > If you have >1 NIC, chances increase to get it wrong. > > That is why all distros name network devices based on the only > deterministic thing they have today, the MAC address. I still fail to > see why you do not like this solution, it is honestly the only way to > properly name network devices in a sane manner. > > All distros also provide a way to easily rename the network devices, to > place a specific name on a specific MAC address, so again, this should > all be solved already. No, it's not solved. Even if you have persistent names once you install, if you ever re-image, you're likely to get *different* persistent names; the first load will always be non-detmerministic. The only way around this would be to have some sort of screen like: Would you like your network devices to be enumerated by [ ] MAC address [ ] PCI device order [ ] Driver name [ ] Other which is just all sorts of fail in and of itself. Especially since once you get to the point where you can coherently ask this in a native installer, the drivers have already loaded. Bill -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Oct 12, 2009 at 01:45:28PM -0400, Bill Nottingham wrote: > Greg KH (greg@kroah.com) said: > > > Today, port naming is completely nondeterministic. If you have but > > > one NIC, there are few chances to get the name wrong (it'll be eth0). > > > If you have >1 NIC, chances increase to get it wrong. > > > > That is why all distros name network devices based on the only > > deterministic thing they have today, the MAC address. I still fail to > > see why you do not like this solution, it is honestly the only way to > > properly name network devices in a sane manner. > > > > All distros also provide a way to easily rename the network devices, to > > place a specific name on a specific MAC address, so again, this should > > all be solved already. > > No, it's not solved. Even if you have persistent names once you install, > if you ever re-image, you're likely to get *different* persistent names; > the first load will always be non-detmerministic. > > The only way around this would be to have some sort of screen like: > > Would you like your network devices to be enumerated by > > [ ] MAC address > [ ] PCI device order > [ ] Driver name > [ ] Other [ ] PCI slot name That's one that modern systems are now reporting, and should solve Matt's problem as well, right? > which is just all sorts of fail in and of itself. Especially since > once you get to the point where you can coherently ask this in a > native installer, the drivers have already loaded. No, the driver load order doesn't determine this, you need the drivers loaded first before you can rename anything :) And I don't see how Matt's proposed patch helps resolve this type of issue any better than what we currently have today, do you? thanks, greg k-h -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Greg KH (greg@kroah.com) said: > > No, it's not solved. Even if you have persistent names once you install, > > if you ever re-image, you're likely to get *different* persistent names; > > the first load will always be non-detmerministic. > > > > The only way around this would be to have some sort of screen like: > > > > Would you like your network devices to be enumerated by > > > > [ ] MAC address > > [ ] PCI device order > > [ ] Driver name > > [ ] Other > > [ ] PCI slot name > > That's one that modern systems are now reporting, and should solve > Matt's problem as well, right? ... maybe. On my laptop, the first 'slot' enumerated appears to be the cardbus bridge, before the on-board ethernet. And on the desktop next to me, the slot driver shows nothing. > And I don't see how Matt's proposed patch helps resolve this type of > issue any better than what we currently have today, do you? It allows multiple addressing schemes to be active at once, which can allow the admin to choose post-install without making an active choice at installation. This is an improvement, even if it doesn't solve the world. Bill -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Oct 12, 2009 at 02:07:42PM -0400, Bill Nottingham wrote: > Greg KH (greg@kroah.com) said: > > > No, it's not solved. Even if you have persistent names once you install, > > > if you ever re-image, you're likely to get *different* persistent names; > > > the first load will always be non-detmerministic. > > > > > > The only way around this would be to have some sort of screen like: > > > > > > Would you like your network devices to be enumerated by > > > > > > [ ] MAC address > > > [ ] PCI device order > > > [ ] Driver name > > > [ ] Other > > > > [ ] PCI slot name > > > > That's one that modern systems are now reporting, and should solve > > Matt's problem as well, right? > > ... maybe. On my laptop, the first 'slot' enumerated appears to be > the cardbus bridge, before the on-board ethernet. And on the desktop > next to me, the slot driver shows nothing. On servers, where this matters (multiple ethernet pci devices), this should all be present if the manufacturer wants it to be, as it is just an ACPI table entry. > > And I don't see how Matt's proposed patch helps resolve this type of > > issue any better than what we currently have today, do you? > > It allows multiple addressing schemes to be active at once, which > can allow the admin to choose post-install without making an > active choice at installation. This is an improvement, even if > it doesn't solve the world. But these different names can not be used by the networking stack, or in scripts, as others have pointed out. Which seems to be the big problem here. thanks, greg k-h -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, Oct 11, 2009 at 10:00 PM, Greg KH <greg@kroah.com> wrote: > On Sun, Oct 11, 2009 at 04:10:03PM -0500, Rob Townley wrote: >> So when an add-in PCI NIC has a lower MAC than the motherboard NICs, >> the add-in cards will come before the motherboard NICs. i don't like it. > > Huh? Have you used the MAC persistant rules? If you add a new card, > what does it pick for it? i have a hp-dl360 (two nics) with a fibre optic add in nic. On a fresh install, the add-in is eth0. i didn't like it, but ran it for years. > >> But please whatever is done, make sure ping and tracert still works when >> telling it to use a ethX source interface: >> >> eth0 = 4.3.2.8, the default gateway is thru eth1. >> ping -I eth0 208.67.222.222 FAILS >> ping -I 4.3.2.8 208.67.222.222 WORKS >> tracert -i eth0 -I 208.67.222.222 FAILS >> tracert -s 4.3.2.8 -I 208.67.222.222 WORKS >> tracert -i eth0 208.67.222.222 FAILS >> tracert -s 4.3.2.8 208.67.222.222 WORKS > > Again, is what we currently have broken? I am confused as to what this > is referring to. Yes, ping and traceroute are broken at least on Fedora, CentOS, and busybox. On a multinic, multigatewayed machine, passing ethX instead of the IP address will give the false result: "Destination Host Unreachable" when the machine's default gateway is reached thru the other nic. In the following example, the default gateway is thru eth1, not eth0. Pay attention to the text between the '*****'. ping -c 1 -B -I eth0 208.67.222.222 PING 208.67.222.222 (208.67.222.222) from ***** 4.3.2.8 eth0*****: 56(84) bytes of data. From 4.3.2.8 icmp_seq=1 Destination Host Unreachable #ping -c 1 -B -I 4.3.2.8 208.67.222.222 PING 208.67.222.222 (208.67.222.222) from ***** 4.3.2.8 *****: 56(84) bytes of data. 64 bytes from 208.67.222.222: icmp_seq=1 ttl=55 time=562 ms > > greg k-h > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Oct 12, 2009 at 01:35:25PM -0500, Rob Townley wrote: > > Again, is what we currently have broken? I am confused as to what this > > is referring to. > > Yes, ping and traceroute are broken at least on Fedora, CentOS, and busybox. > On a multinic, multigatewayed machine, passing ethX instead of the IP > address will give the false result: "Destination Host Unreachable" > when the machine's default gateway is reached thru the other nic. In > the following example, the default gateway is thru eth1, not eth0. Unrelated to this thread. We're having a hard enough time making sure this conversation accurately reflects the views and needs of everyone involved. Please let's not throw in another tangent. Thanks, Matt
On Sat, 2009-10-10 at 14:06 -0700, Greg KH wrote: > On Sat, Oct 10, 2009 at 11:32:19AM -0700, Stephen Hemminger wrote: > > > > BTW, for our distro, we are looking into device renaming based on PCI slot > > because that is what router OS's do. Customers expect if they replace the card > > in slot 0, it will come back with the same name. This is not what server > > customers expect. > > If your bios exposes the PCI slots to userspace (through the proper ACPI > namespace), doing this type of naming should be trivial with some simple > udev rules, no additional kernel infrastructure is needed. By and large, the people that care most about persistent network device names based on *location in the machine* are server users. This allows hotswap of cards or single-image-multiple-machine without needing configuration changes, which is nice. Those users can reasonably be expected to choose hardware whose BIOS supports the ACPI tables that (mostly) guarantee to provide actual, stable names for their hardware. If there's even a 10% chance that on consumer-level systems the names won't be stable on a given boot (and I can't see how, without BIOS support, we can guarantee 100% stability) then it's a worthless guarantee. If the BIOS support exists, it is trivial to use udev to create the correct naming mechanism for your machine, either using MAC address or BIOS-provided slot naming. No kernel patch is required. If the BIOS support does not exist, you are not guaranteed a stable naming mechanism except by MAC address, because the BIOS may randomly change enumeration based on the time of day, or it may not. A 90 or 95% stability guarantee is not a guarantee at all. Third, USB enumeration will always be unstable. Thus we have an unsolvable discrepancy in behavior between PCI and USB. Is this correct? Dan -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
>If the BIOS support exists, it is trivial to use udev to >create the correct naming mechanism for your machine, either >using MAC address or BIOS-provided slot naming. No kernel >patch is required. > Yes. In case, we want to rename only once. MAC address or slot names do provide persistent naming. They help in retaining whatever names are assigned during install time, which is the first instantiation of the OS. But these names may not be as expected (like first on board network interface name is expected to be "eth0" which is not always the case and might not reflect what is written on the chassis label as "Gb1" and "Gb2" etc) which would result in unattended installs break. Also image based deployments will face problems by introducing state such as MAC address. With regards, Narendra K -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index b332eef..a2f23b4 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -44,6 +44,7 @@ #include <linux/workqueue.h> #include <linux/ethtool.h> +#include <linux/cdev.h> #include <net/net_namespace.h> #include <net/dsa.h> #ifdef CONFIG_DCB @@ -916,6 +917,9 @@ struct net_device /* max exchange id for FCoE LRO by ddp */ unsigned int fcoe_ddp_xid; #endif +#ifdef CONFIG_NET_CDEV + struct cdev cdev; +#endif }; #define to_net_dev(d) container_of(d, struct net_device, dev) diff --git a/net/Kconfig b/net/Kconfig index 041c35e..bdc5bd7 100644 --- a/net/Kconfig +++ b/net/Kconfig @@ -43,6 +43,16 @@ config COMPAT_NETLINK_MESSAGES Newly written code should NEVER need this option but do compat-independent messages instead! +config NET_CDEV + bool "/dev files for network devices" + default y + help + This option causes /dev entries to be created for each + network device. This allows the use of udev to create + alternate device naming policies. + + If unsure, say Y. + menu "Networking options" source "net/packet/Kconfig" diff --git a/net/core/Makefile b/net/core/Makefile index 796f46e..0b40d2c 100644 --- a/net/core/Makefile +++ b/net/core/Makefile @@ -19,4 +19,5 @@ obj-$(CONFIG_NET_DMA) += user_dma.o obj-$(CONFIG_FIB_RULES) += fib_rules.o obj-$(CONFIG_TRACEPOINTS) += net-traces.o obj-$(CONFIG_NET_DROP_MONITOR) += drop_monitor.o +obj-$(CONFIG_NET_CDEV) += cdev.o diff --git a/net/core/cdev.c b/net/core/cdev.c new file mode 100644 index 0000000..1f36076 --- /dev/null +++ b/net/core/cdev.c @@ -0,0 +1,42 @@ +#include <linux/fs.h> +#include <linux/cdev.h> +#include <linux/netdevice.h> +#include <linux/device.h> + +/* Used for network dynamic major number */ +static dev_t netdev_devt; + +static int netdev_cdev_open(struct inode *inode, struct file *filep) +{ + /* no operations on this device are implemented */ + return -ENOSYS; +} + +static const struct file_operations netdev_cdev_fops = { + .owner = THIS_MODULE, + .open = netdev_cdev_open, +}; + +void netdev_cdev_alloc(void) +{ + alloc_chrdev_region(&netdev_devt, 0, 1<<20, "net"); +} + +void netdev_cdev_init(struct net_device *dev) +{ + cdev_init(&dev->cdev, &netdev_cdev_fops); + cdev_add(&dev->cdev, MKDEV(MAJOR(netdev_devt), dev->ifindex), 1); + +} + +void netdev_cdev_del(struct net_device *dev) +{ + if (dev->cdev.dev) + cdev_del(&dev->cdev); +} + +void netdev_cdev_kobj_init(struct device *dev, struct net_device *net) +{ + if (net->cdev.dev) + dev->devt = net->cdev.dev; +} diff --git a/net/core/cdev.h b/net/core/cdev.h new file mode 100644 index 0000000..9cf5a90 --- /dev/null +++ b/net/core/cdev.h @@ -0,0 +1,13 @@ +#include <linux/netdevice.h> + +#ifdef CONFIG_NET_CDEV +void netdev_cdev_alloc(void); +void netdev_cdev_init(struct net_device *dev); +void netdev_cdev_del(struct net_device *dev); +void netdev_cdev_kobj_init(struct device *dev, struct net_device *net); +#else +static inline void netdev_cdev_alloc(void) {} +static inline void netdev_cdev_init(struct net_device *dev) {} +static inline void netdev_cdev_del(struct net_device *dev) {} +static inline void netdev_cdev_kobj_init(struct device *dev, struct net_device *net) {} +#endif diff --git a/net/core/dev.c b/net/core/dev.c index a74c8fd..d771438 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -129,6 +129,7 @@ #include <trace/events/napi.h> #include "net-sysfs.h" +#include "cdev.h" /* Instead of increasing this, you should create a hash table. */ #define MAX_GRO_SKBS 8 @@ -4684,6 +4685,7 @@ static void rollback_registered(struct net_device *dev) /* Remove entries from kobject tree */ netdev_unregister_kobject(dev); + netdev_cdev_del(dev); synchronize_net(); @@ -4835,6 +4837,8 @@ int register_netdevice(struct net_device *dev) if (dev->features & NETIF_F_SG) dev->features |= NETIF_F_GSO; + netdev_cdev_init(dev); + netdev_initialize_kobject(dev); ret = call_netdevice_notifiers(NETDEV_POST_INIT, dev); @@ -4870,6 +4874,7 @@ out: return ret; err_uninit: + netdev_cdev_del(dev); if (dev->netdev_ops->ndo_uninit) dev->netdev_ops->ndo_uninit(dev); goto out; @@ -5377,6 +5382,7 @@ int dev_change_net_namespace(struct net_device *dev, struct net *net, const char dev_addr_discard(dev); netdev_unregister_kobject(dev); + netdev_cdev_del(dev); /* Actually switch the network namespace */ dev_net_set(dev, net); @@ -5393,6 +5399,8 @@ int dev_change_net_namespace(struct net_device *dev, struct net *net, const char dev->iflink = dev->ifindex; } + netdev_cdev_init(dev); + /* Fixup kobjects */ err = netdev_register_kobject(dev); WARN_ON(err); @@ -5626,6 +5634,8 @@ static int __init net_dev_init(void) BUG_ON(!dev_boot_phase); + netdev_cdev_alloc(); + if (dev_proc_init()) goto out; diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c index 753c420..f4ee557 100644 --- a/net/core/net-sysfs.c +++ b/net/core/net-sysfs.c @@ -19,6 +19,7 @@ #include <net/wext.h> #include "net-sysfs.h" +#include "cdev.h" #ifdef CONFIG_SYSFS static const char fmt_hex[] = "%#x\n"; @@ -501,6 +502,14 @@ static void netdev_release(struct device *d) kfree((char *)dev - dev->padded); } +#ifdef CONFIG_NET_CDEV +static char *netdev_devnode(struct device *d, mode_t *mode) +{ + struct net_device *dev = to_net_dev(d); + return kasprintf(GFP_KERNEL, "netdev/%s", dev->name); +} +#endif + static struct class net_class = { .name = "net", .dev_release = netdev_release, @@ -510,6 +519,9 @@ static struct class net_class = { #ifdef CONFIG_HOTPLUG .dev_uevent = netdev_uevent, #endif +#ifdef CONFIG_NET_CDEV + .devnode = netdev_devnode, +#endif }; /* Delete sysfs entries but hold kobject reference until after all @@ -536,6 +548,7 @@ int netdev_register_kobject(struct net_device *net) dev->class = &net_class; dev->platform_data = net; dev->groups = groups; + netdev_cdev_kobj_init(dev, net); dev_set_name(dev, "%s", net->name);