Message ID | 20100419191807.10423.84600.stgit@savbu-pc100.cisco.com |
---|---|
State | Changes Requested, archived |
Delegated to: | David Miller |
Headers | show |
On Monday 19 April 2010, Scott Feldman wrote: > IOV netlink (IOVNL) adds I/O Virtualization control support to a master > device (MD) netdev interface. The MD (e.g. SR-IOV PF) will set/get > control settings on behalf of a slave netdevice (e.g. SR-IOV VF). The > design allows for the case where master and slave are the > same netdev interface. What is the reason for controlling the slave device through the master, rather than talking to the slave directly? The kernel always knows the master for each slave, so it seems to me that this information is redundant. Is this new interface only for the case that you have a switch integrated in the NIC, or also for the case where you do an LLDP and EDP exchange with an adjacent bridge and put the device into VEPA mode? > One control setting example is MAC/VLAN settings for a VF. Another > example control setting is a port-profile for a VF. A port-profile is an > identifier that defines policy-based settings on the network port > backing the VF. The network port settings examples are VLAN membership, > QoS settings, and L2 security settings, typical of a data center network. > > This patch adds the iovnl interface definitions and an iovnl module. How does this relate to the existing DCB netlink interface? My feeling is that there is some overlap in how it would get used, and some parts that are very distinct. In particular, I'd guess that you'd want to be able to set DCB parameters for each VF, but not all DCB adapters would support SR-IOV. Did you consider making this code an extension to the DCB interface instead of a separate one? What was the reason for your decision to keep it separate? Also, do you expect your interface to be supported by dcbd/lldpad, or is there a good reason to create a new tool for iovnl? > + * @IOV_ATTR_IFNAME: interface name of master (PF) net device (NLA_NUL_STRING) > + * @IOV_ATTR_VF_IFNAME: interface name of target VF device (NLA_NUL_STRING) As mentioned above, why not drop one of these, and just pass the VF's IFNAME? > + * @IOV_ATTR_PORT_PROFILE: port-profile name to assign to device > + * (NLA_NUL_STRING) How does the definition of the port profile get into the NIC's switch? Is there any way to list the available port profiles? > + * @IOV_ATTR_CLIENT_NAME: client name (NLA_NUL_STRING) > + * @IOV_ATTR_HOST_UUID: host UUID (NLA_NUL_STRING) Can you elaborate more on what these do? Who is the 'client' and the 'host' in this case, and why do you need to identify them? > + * @IOV_ATTR_MAC_ADDR: device station MAC address (NLA_U8[6]) Just one mac address? What happens if we want to assign multiple mac addresses to the VF later? Also, how is this defined specifically? Will a SIOCSIFHWADDR with a different MAC address on the VF fail later, or is this just the default value? > + * @IOV_ATTR_VLAN: device 8021q VLAN ID (NLA_U16) Same here: Should you be able to set multiple MAC addresses, or trunk mode? Can the VF override it? Also, for the new multi-channel VEPA, I'd guess that you also need to supply an 802.1ad S-VLAN ID. Arnd -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
* Arnd Bergmann (arnd@arndb.de) wrote: > On Monday 19 April 2010, Scott Feldman wrote: > > > IOV netlink (IOVNL) adds I/O Virtualization control support to a master > > device (MD) netdev interface. The MD (e.g. SR-IOV PF) will set/get > > control settings on behalf of a slave netdevice (e.g. SR-IOV VF). The > > design allows for the case where master and slave are the > > same netdev interface. > > What is the reason for controlling the slave device through the master, > rather than talking to the slave directly? The kernel always knows > the master for each slave, so it seems to me that this information > is redundant. Not all devices have this relationship explicit (i.e. not all are pure sr-iov devices). If there's always a way to discover the master from the device, then I agree we only need the slave. > Is this new interface only for the case that you have a switch integrated > in the NIC, or also for the case where you do an LLDP and EDP exchange > with an adjacent bridge and put the device into VEPA mode? It should be useful for both. That's part of the reason for using netlink, a userspace daemon running the VDP state machine (like lldpad) can listen for these messages and see a set_port_profile request when the user starts up a VM. > > One control setting example is MAC/VLAN settings for a VF. Another > > example control setting is a port-profile for a VF. A port-profile is an > > identifier that defines policy-based settings on the network port > > backing the VF. The network port settings examples are VLAN membership, > > QoS settings, and L2 security settings, typical of a data center network. > > > > This patch adds the iovnl interface definitions and an iovnl module. > > How does this relate to the existing DCB netlink interface? My feeling > is that there is some overlap in how it would get used, and some parts > that are very distinct. In particular, I'd guess that you'd want to > be able to set DCB parameters for each VF, but not all DCB adapters > would support SR-IOV. > > Did you consider making this code an extension to the DCB interface > instead of a separate one? What was the reason for your decision > to keep it separate? Well, aside from the fact that DCB and VDP have some low level similarities in the PDU and they are both communication between the host and the switch, they are doing different things. > Also, do you expect your interface to be supported by dcbd/lldpad, > or is there a good reason to create a new tool for iovnl? lldpad would listen, I don't see why iproute2 couldn't send, and libvirt will send as well. > > + * @IOV_ATTR_IFNAME: interface name of master (PF) net device (NLA_NUL_STRING) > > + * @IOV_ATTR_VF_IFNAME: interface name of target VF device (NLA_NUL_STRING) > > As mentioned above, why not drop one of these, and just pass the VF's IFNAME? > > > + * @IOV_ATTR_PORT_PROFILE: port-profile name to assign to device > > + * (NLA_NUL_STRING) > > How does the definition of the port profile get into the NIC's switch? > Is there any way to list the available port profiles? The port profile is a concept external to the NIC's switch. It's a value that exists in the external physical layer 2 switching infrastructure. So an admin knows this value and is informing the adjacent switch that a new virutal interface is coming up and needs some particular port profile. > > + * @IOV_ATTR_CLIENT_NAME: client name (NLA_NUL_STRING) > > + * @IOV_ATTR_HOST_UUID: host UUID (NLA_NUL_STRING) > > Can you elaborate more on what these do? Who is the 'client' and the 'host' > in this case, and why do you need to identify them? > > > + * @IOV_ATTR_MAC_ADDR: device station MAC address (NLA_U8[6]) > > Just one mac address? What happens if we want to assign multiple mac > addresses to the VF later? Also, how is this defined specifically? > Will a SIOCSIFHWADDR with a different MAC address on the VF fail > later, or is this just the default value? > > > + * @IOV_ATTR_VLAN: device 8021q VLAN ID (NLA_U16) > > Same here: Should you be able to set multiple MAC addresses, or > trunk mode? Can the VF override it? > Also, for the new multi-channel VEPA, I'd guess that you also need > to supply an 802.1ad S-VLAN ID. Something like set_port_profile() would initiate the negotiation for the s-vlan id for a particular channel, not sure it's needed as part of the netlink interface or not. thanks, -chris -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tuesday 20 April 2010, Chris Wright wrote: > * Arnd Bergmann (arnd@arndb.de) wrote: > > On Monday 19 April 2010, Scott Feldman wrote: > > > > What is the reason for controlling the slave device through the master, > > rather than talking to the slave directly? The kernel always knows > > the master for each slave, so it seems to me that this information > > is redundant. > > Not all devices have this relationship explicit (i.e. not all are pure > sr-iov devices). If there's always a way to discover the master from > the device, then I agree we only need the slave. Hmm, is there an actual example of a card where the relationship is not known to the kernel? > > Is this new interface only for the case that you have a switch integrated > > in the NIC, or also for the case where you do an LLDP and EDP exchange > > with an adjacent bridge and put the device into VEPA mode? > > It should be useful for both. That's part of the reason for using > netlink, a userspace daemon running the VDP state machine (like lldpad) > can listen for these messages and see a set_port_profile request when > the user starts up a VM. After thinking some more about this case, I now believe we should do it the other way around, and have lldpad in control of this interface from the user space side, and letting user programs (lldptool, libvirt, ...) talk to lldpad in order to set it up. > > Also, do you expect your interface to be supported by dcbd/lldpad, > > or is there a good reason to create a new tool for iovnl? > > lldpad would listen, I don't see why iproute2 couldn't send, and libvirt > will send as well. Not sure. We need lldpad to do this exchange for the case of VEPA with VDP, so always using lldpad would let us unify the user interface for both cases. We can of course have iproute2 talk to lldpad, in the same way that libvirt does. > > > + * @IOV_ATTR_PORT_PROFILE: port-profile name to assign to device > > > + * (NLA_NUL_STRING) > > > > How does the definition of the port profile get into the NIC's switch? > > Is there any way to list the available port profiles? > > The port profile is a concept external to the NIC's switch. It's a value > that exists in the external physical layer 2 switching infrastructure. > So an admin knows this value and is informing the adjacent switch that a > new virutal interface is coming up and needs some particular port profile. But that's only the case if the NIC itself is in VEPA mode. If that were the case, there would be no need for a kernel interface at all, because then we could just drive the port profile selection from user space. The proposed interface only seems to make sense if you use it to configure the NIC itself! Why should it care about the port profile otherwise? > > Same here: Should you be able to set multiple MAC addresses, or > > trunk mode? Can the VF override it? > > Also, for the new multi-channel VEPA, I'd guess that you also need > > to supply an 802.1ad S-VLAN ID. > > Something like set_port_profile() would initiate the negotiation for the > s-vlan id for a particular channel, not sure it's needed as part of the > netlink interface or not. Well, you have to set up the s-vlan ID in order to have something to set the port profile in. Arnd -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
* Arnd Bergmann (arnd@arndb.de) wrote: > On Tuesday 20 April 2010, Chris Wright wrote: > > * Arnd Bergmann (arnd@arndb.de) wrote: > > > On Monday 19 April 2010, Scott Feldman wrote: > > > > > > What is the reason for controlling the slave device through the master, > > > rather than talking to the slave directly? The kernel always knows > > > the master for each slave, so it seems to me that this information > > > is redundant. > > > > Not all devices have this relationship explicit (i.e. not all are pure > > sr-iov devices). If there's always a way to discover the master from > > the device, then I agree we only need the slave. > > Hmm, is there an actual example of a card where the relationship is not > known to the kernel? > > > > Is this new interface only for the case that you have a switch integrated > > > in the NIC, or also for the case where you do an LLDP and EDP exchange > > > with an adjacent bridge and put the device into VEPA mode? > > > > It should be useful for both. That's part of the reason for using > > netlink, a userspace daemon running the VDP state machine (like lldpad) > > can listen for these messages and see a set_port_profile request when > > the user starts up a VM. > > After thinking some more about this case, I now believe we should do > it the other way around, and have lldpad in control of this interface > from the user space side, and letting user programs (lldptool, libvirt, > ...) talk to lldpad in order to set it up. lldpad won't be involved in all cases, yet a mgmt tool like libvirt will. so this seems backwards. > > > Also, do you expect your interface to be supported by dcbd/lldpad, > > > or is there a good reason to create a new tool for iovnl? > > > > lldpad would listen, I don't see why iproute2 couldn't send, and libvirt > > will send as well. > > Not sure. We need lldpad to do this exchange for the case of VEPA with > VDP, so always using lldpad would let us unify the user interface for > both cases. We can of course have iproute2 talk to lldpad, in the > same way that libvirt does. > > > > > + * @IOV_ATTR_PORT_PROFILE: port-profile name to assign to device > > > > + * (NLA_NUL_STRING) > > > > > > How does the definition of the port profile get into the NIC's switch? > > > Is there any way to list the available port profiles? > > > > The port profile is a concept external to the NIC's switch. It's a value > > that exists in the external physical layer 2 switching infrastructure. > > So an admin knows this value and is informing the adjacent switch that a > > new virutal interface is coming up and needs some particular port profile. > > But that's only the case if the NIC itself is in VEPA mode. If that > were the case, there would be no need for a kernel interface at all, > because then we could just drive the port profile selection from user > space. > > The proposed interface only seems to make sense if you use it to > configure the NIC itself! Why should it care about the port profile > otherwise? In the case of devices that can do adjacent switch negotiations directly. > > > Same here: Should you be able to set multiple MAC addresses, or > > > trunk mode? Can the VF override it? > > > Also, for the new multi-channel VEPA, I'd guess that you also need > > > to supply an 802.1ad S-VLAN ID. > > > > Something like set_port_profile() would initiate the negotiation for the > > s-vlan id for a particular channel, not sure it's needed as part of the > > netlink interface or not. > > Well, you have to set up the s-vlan ID in order to have something to > set the port profile in. Right, depends if the use the port profile to establish the channel and negotiate the s-vlan ID. I don't recall the order there. thanks, -chris -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tuesday 20 April 2010, Chris Wright wrote: > * Arnd Bergmann (arnd@arndb.de) wrote: > > On Tuesday 20 April 2010, Chris Wright wrote: > > > > After thinking some more about this case, I now believe we should do > > it the other way around, and have lldpad in control of this interface > > from the user space side, and letting user programs (lldptool, libvirt, > > ...) talk to lldpad in order to set it up. > > lldpad won't be involved in all cases, yet a mgmt tool like libvirt will. > so this seems backwards. Well, that part is still the matter of this discussion, as far as I can tell ;-) > > But that's only the case if the NIC itself is in VEPA mode. If that > > were the case, there would be no need for a kernel interface at all, > > because then we could just drive the port profile selection from user > > space. > > > > The proposed interface only seems to make sense if you use it to > > configure the NIC itself! Why should it care about the port profile > > otherwise? > > In the case of devices that can do adjacent switch negotiations directly. I thought the idea to deal with those devices was to beat sense into the respective developers until they do the negotiation in software 8-) > > > > Same here: Should you be able to set multiple MAC addresses, or > > > > trunk mode? Can the VF override it? > > > > Also, for the new multi-channel VEPA, I'd guess that you also need > > > > to supply an 802.1ad S-VLAN ID. > > > > > > Something like set_port_profile() would initiate the negotiation for the > > > s-vlan id for a particular channel, not sure it's needed as part of the > > > netlink interface or not. > > > > Well, you have to set up the s-vlan ID in order to have something to > > set the port profile in. > > Right, depends if the use the port profile to establish the channel and > negotiate the s-vlan ID. I don't recall the order there. I'm pretty sure that setting up the channel (for 802.1bg) is done before any port profile comes in. Arnd -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 4/20/10 6:48 AM, "Arnd Bergmann" <arnd@arndb.de> wrote: > On Monday 19 April 2010, Scott Feldman wrote: > >> IOV netlink (IOVNL) adds I/O Virtualization control support to a master >> device (MD) netdev interface. The MD (e.g. SR-IOV PF) will set/get >> control settings on behalf of a slave netdevice (e.g. SR-IOV VF). The >> design allows for the case where master and slave are the >> same netdev interface. > > What is the reason for controlling the slave device through the master, > rather than talking to the slave directly? The kernel always knows > the master for each slave, so it seems to me that this information > is redundant. The interface would allow talking to the slave directly. In fact, that's the example with enic port-profile in patch 2/2. But, it would be nice not to rule out the case where the master proxies slave control and the master is under exclusively controlled by hypervisor. > Is this new interface only for the case that you have a switch integrated > in the NIC, or also for the case where you do an LLDP and EDP exchange > with an adjacent bridge and put the device into VEPA mode? All of the above. Basing this on netlink give us flexibility to work with user-space mgmt tools or directly with kernel netdev as in the enic case. Not trying to make assumptions about where (user-space, kernel) and by which entity sources or sinks the netlink msg. >> One control setting example is MAC/VLAN settings for a VF. Another >> example control setting is a port-profile for a VF. A port-profile is an >> identifier that defines policy-based settings on the network port >> backing the VF. The network port settings examples are VLAN membership, >> QoS settings, and L2 security settings, typical of a data center network. >> >> This patch adds the iovnl interface definitions and an iovnl module. > > How does this relate to the existing DCB netlink interface? My feeling > is that there is some overlap in how it would get used, and some parts > that are very distinct. In particular, I'd guess that you'd want to > be able to set DCB parameters for each VF, but not all DCB adapters > would support SR-IOV. > > Did you consider making this code an extension to the DCB interface > instead of a separate one? What was the reason for your decision > to keep it separate? Considered it but DCB interface is well defined for DCB and it didn't seem right gluing on interfaces not specified within DCB. I agree that there is some overlap in the sense that both interface are used to configure a netdev with some properties interesting for the data center, but the DCB interface is for local setting of the properties on the host whereas iovnl is about pushing the setting of those properties to the network for policy-based control. > Also, do you expect your interface to be supported by dcbd/lldpad, > or is there a good reason to create a new tool for iovnl? Lldpad supporting this interface would seem right, for those cases where lldpad is responsible for configuring the netdev. >> + * @IOV_ATTR_CLIENT_NAME: client name (NLA_NUL_STRING) >> + * @IOV_ATTR_HOST_UUID: host UUID (NLA_NUL_STRING) > > Can you elaborate more on what these do? Who is the 'client' and the 'host' > in this case, and why do you need to identify them? Those are optional and useful, for example, by the network mgmt tool for presenting a view such as: - blade 1/2 // know by host uuid - vm-rhel5-eth0 // client name - port-profile: xyz Something like that. >> + * @IOV_ATTR_MAC_ADDR: device station MAC address (NLA_U8[6]) > > Just one mac address? What happens if we want to assign multiple mac > addresses to the VF later? Also, how is this defined specifically? > Will a SIOCSIFHWADDR with a different MAC address on the VF fail > later, or is this just the default value? Depends on how the VF wants to handle this. For our use-case with enic we only need the port-profile op so I'm not sure what the best design is for mac+vlan on a VF. Looking for advise from folks like yourself. If it's not needed, let's scratch it. -scott -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 4/20/10 9:19 AM, "Arnd Bergmann" <arnd@arndb.de> wrote: >>> But that's only the case if the NIC itself is in VEPA mode. If that >>> were the case, there would be no need for a kernel interface at all, >>> because then we could just drive the port profile selection from user >>> space. >>> >>> The proposed interface only seems to make sense if you use it to >>> configure the NIC itself! Why should it care about the port profile >>> otherwise? >> >> In the case of devices that can do adjacent switch negotiations directly. > > I thought the idea to deal with those devices was to beat sense into > the respective developers until they do the negotiation in software 8-) When the device can do the negotiation directly with the switch, why does it make sense to bypass that and use software on the host? I don't think we'd want to give up on link speed/duplex auto-negotiation and punt those setting back to the user/host like in the old days. -scott -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tuesday 20 April 2010, Scott Feldman wrote: > On 4/20/10 6:48 AM, "Arnd Bergmann" <arnd@arndb.de> wrote: > > > On Monday 19 April 2010, Scott Feldman wrote: > > > >> IOV netlink (IOVNL) adds I/O Virtualization control support to a master > >> device (MD) netdev interface. The MD (e.g. SR-IOV PF) will set/get > >> control settings on behalf of a slave netdevice (e.g. SR-IOV VF). The > >> design allows for the case where master and slave are the > >> same netdev interface. > > > > What is the reason for controlling the slave device through the master, > > rather than talking to the slave directly? The kernel always knows > > the master for each slave, so it seems to me that this information > > is redundant. > > The interface would allow talking to the slave directly. In fact, that's the > example with enic port-profile in patch 2/2. But, it would be nice not to > rule out the case where the master proxies slave control and the master is > under exclusively controlled by hypervisor. Not sure I understand. Do you mean the case where this code runs in the hypervisor (e.g. KVM), or a different scerario with the setup being done in a guest driver? So far, I have assumed that we would always do the setup on the host side, which always has access to both the master, and a slave proxy. In particular, your interface requires access to the slave AFAICT, because otherwise the VF IFNAME does not have any significance. Take the case where you use network namespaces and put the VF into a separate namespace. With your interface, the PF is still in the root namespace, but passing both interface names in this interface won't help you because they are never visible in the same namespace (e.g. both might be named eth0 in their respective containers). > > Is this new interface only for the case that you have a switch integrated > > in the NIC, or also for the case where you do an LLDP and EDP exchange > > with an adjacent bridge and put the device into VEPA mode? > > All of the above. Basing this on netlink give us flexibility to work with > user-space mgmt tools or directly with kernel netdev as in the enic case. > Not trying to make assumptions about where (user-space, kernel) and by which > entity sources or sinks the netlink msg. ok. > > Did you consider making this code an extension to the DCB interface > > instead of a separate one? What was the reason for your decision > > to keep it separate? > > Considered it but DCB interface is well defined for DCB and it didn't seem > right gluing on interfaces not specified within DCB. I agree that there is > some overlap in the sense that both interface are used to configure a netdev > with some properties interesting for the data center, but the DCB interface > is for local setting of the properties on the host whereas iovnl is about > pushing the setting of those properties to the network for policy-based > control. > > > Also, do you expect your interface to be supported by dcbd/lldpad, > > or is there a good reason to create a new tool for iovnl? > > Lldpad supporting this interface would seem right, for those cases where > lldpad is responsible for configuring the netdev. I believe we meant different things here, because I misunderstood the intention of the code. My question was whether lldpad would send the netlink messages to iovnl, but from what you and Chris write, the real idea was that both lldpad and kernel/iovnl can receive the same messages, right? > >> + * @IOV_ATTR_CLIENT_NAME: client name (NLA_NUL_STRING) > >> + * @IOV_ATTR_HOST_UUID: host UUID (NLA_NUL_STRING) > > > > Can you elaborate more on what these do? Who is the 'client' and the 'host' > > in this case, and why do you need to identify them? > > Those are optional and useful, for example, by the network mgmt tool for > presenting a view such as: > > - blade 1/2 // know by host uuid > - vm-rhel5-eth0 // client name > - port-profile: xyz > > Something like that. Hmm, but how do they get from the device driver to the the network management tool then? Also, these are similar to the attributes that are passed in the 802.1Qbg VDP protocol, but not compatible. If the idea is use the same netlink protocol for both your internal representation and for the standard based protocol, I think we should make them compatible. Instead of a string identifying the port profile, this needs to pass a four byte field for a VSI type (3 bytes) and VSI manager ID (1 byte). There is also a UUID in VDP, but it identifies the guest, not the host, so this is really confusing. VDP also needs a list of MAC addresses and VLAN IDs (normally only one of each), but that would be separate from what you tell the adapter, see below: > >> + * @IOV_ATTR_MAC_ADDR: device station MAC address (NLA_U8[6]) > > > > Just one mac address? What happens if we want to assign multiple mac > > addresses to the VF later? Also, how is this defined specifically? > > Will a SIOCSIFHWADDR with a different MAC address on the VF fail > > later, or is this just the default value? > > Depends on how the VF wants to handle this. For our use-case with enic we > only need the port-profile op so I'm not sure what the best design is for > mac+vlan on a VF. Looking for advise from folks like yourself. If it's not > needed, let's scratch it. In order to make VEPA work, it's absolutely required to impose a hard limit on what MAC+VLAN IDs are visible to the VF, because the switch identifies the guest by those and forwards any frames to/from that address according to the VSI type. However, I feel that we should strictly separate the steps of configuring the adapter from talking to the switch. When we do the VDP association in user land, we still need to set up the VLAN and MAC configuration for the VF through a kernel interface. If we ignore the port profile stuff for a moment, your netlink interface looks like a good fit for that. Since it seems what you really want to do is to do the exchange with the switch from here, maybe the hardware configuration part should be moved the DCB interface? Arnd -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
I am noticing a strange and a troublesome behavior with tcp md5 checksums. Some selective packets are going out with invalid md5 checksums. The only thing that is changing is the ack number (between the packets with valid and invalid md5 checksums), so while most packets have correct md5 checksums few 1 in 1000s have md5 checksums errors. I am on 2.6.26 and I know that there have been significant changes since this version in this area. I have gone thru them but none of issues they address seem like the cause for this problem. I have the scatter/gather and tcp segmentation disabled in the card. The packet captures are attached. Bijay
On Tuesday 20 April 2010, Scott Feldman wrote: > On 4/20/10 9:19 AM, "Arnd Bergmann" <arnd@arndb.de> wrote: > > >> In the case of devices that can do adjacent switch negotiations directly. > > > > I thought the idea to deal with those devices was to beat sense into > > the respective developers until they do the negotiation in software 8-) > > When the device can do the negotiation directly with the switch, why does it > make sense to bypass that and use software on the host? I don't think we'd > want to give up on link speed/duplex auto-negotiation and punt those setting > back to the user/host like in the old days. For the link negotiation, the card is the right place because it's necessary to get the link working before the OS can talk to the switch. For VDP, that's different because the hypervisor needs to talk to the switch before the guest can communicate, so there is no interdependency. More importantly, the card cannot possibly do the protocol by itself, because the information that gets exchanged is specific to the hypervisor and the guest, not to the hardware. What you have implemented is another protocol between the hypervisor and the NIC that exchanges the exact same data that then gets sent to the switch. We already need to have an implementation that sends this data to the switch from user space for all cards that don't do it in firmware, so doing an alternative path in the adapter really creates more work for the users, and means that when we fix bugs or add features to the common code, you don't get them ;-). Arnd -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
* Arnd Bergmann (arnd@arndb.de) wrote: > On Tuesday 20 April 2010, Scott Feldman wrote: > > On 4/20/10 6:48 AM, "Arnd Bergmann" <arnd@arndb.de> wrote: > > > Also, do you expect your interface to be supported by dcbd/lldpad, > > > or is there a good reason to create a new tool for iovnl? > > > > Lldpad supporting this interface would seem right, for those cases where > > lldpad is responsible for configuring the netdev. > > I believe we meant different things here, because I misunderstood the > intention of the code. My question was whether lldpad would send the > netlink messages to iovnl, but from what you and Chris write, the > real idea was that both lldpad and kernel/iovnl can receive the > same messages, right? Correct. An example set of steps for initiating host to switch negotiation and subsequently launching a VM would be (expect user below to be a mgmt tool like libvirt): 1) user sends netlink message w/ relevant host interface and port profile id 2) recipient picks this up (enic, lldpad, whatever) 3) recipient does negotiation w/ adjacent switch 4) user creates macvtap associated w/ relevant host interface 5) user launches guest > > >> + * @IOV_ATTR_CLIENT_NAME: client name (NLA_NUL_STRING) > > >> + * @IOV_ATTR_HOST_UUID: host UUID (NLA_NUL_STRING) > > > > > > Can you elaborate more on what these do? Who is the 'client' and the 'host' > > > in this case, and why do you need to identify them? > > > > Those are optional and useful, for example, by the network mgmt tool for > > presenting a view such as: > > > > - blade 1/2 // know by host uuid > > - vm-rhel5-eth0 // client name > > - port-profile: xyz > > > > Something like that. > > Hmm, but how do they get from the device driver to the the network > management tool then? Also, these are similar to the attributes > that are passed in the 802.1Qbg VDP protocol, but not compatible. > If the idea is use the same netlink protocol for both your internal > representation and for the standard based protocol, I think we should > make them compatible. Indeed, that's my expectation. > Instead of a string identifying the port profile, this needs to pass > a four byte field for a VSI type (3 bytes) and VSI manager ID (1 byte). I think we just need a u8 array, 4 bytes for VDP, some maxlen that is at least as large as enic expects. > There is also a UUID in VDP, but it identifies the guest, not the host, > so this is really confusing. Yes, I had same confusion. I expected guest, enic wants to send host as well. > VDP also needs a list of MAC addresses and VLAN IDs (normally only > one of each), but that would be separate from what you tell the adapter, > see below: > > > >> + * @IOV_ATTR_MAC_ADDR: device station MAC address (NLA_U8[6]) > > > > > > Just one mac address? What happens if we want to assign multiple mac > > > addresses to the VF later? Also, how is this defined specifically? > > > Will a SIOCSIFHWADDR with a different MAC address on the VF fail > > > later, or is this just the default value? > > > > Depends on how the VF wants to handle this. For our use-case with enic we > > only need the port-profile op so I'm not sure what the best design is for > > mac+vlan on a VF. Looking for advise from folks like yourself. If it's not > > needed, let's scratch it. > > In order to make VEPA work, it's absolutely required to impose a hard limit > on what MAC+VLAN IDs are visible to the VF, because the switch identifies > the guest by those and forwards any frames to/from that address according > to the VSI type. > > However, I feel that we should strictly separate the steps of configuring > the adapter from talking to the switch. When we do the VDP association > in user land, we still need to set up the VLAN and MAC configuration for > the VF through a kernel interface. If we ignore the port profile stuff > for a moment, your netlink interface looks like a good fit for that. > > Since it seems what you really want to do is to do the exchange with the > switch from here, maybe the hardware configuration part should be moved > the DCB interface? I suppose this would work (although it's a bit odd being out of scope of DCB spec). I don't expect mgmt app to care about the implementation specifics of an adapter, so it will always send this and iovnl message too. All as part of same setup. thanks, -chris -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 4/21/10 6:17 AM, "Arnd Bergmann" <arnd@arndb.de> wrote: > On Tuesday 20 April 2010, Scott Feldman wrote: >> On 4/20/10 9:19 AM, "Arnd Bergmann" <arnd@arndb.de> wrote: >> >>>> In the case of devices that can do adjacent switch negotiations directly. >>> >>> I thought the idea to deal with those devices was to beat sense into >>> the respective developers until they do the negotiation in software 8-) >> >> When the device can do the negotiation directly with the switch, why does it >> make sense to bypass that and use software on the host? I don't think we'd >> want to give up on link speed/duplex auto-negotiation and punt those setting >> back to the user/host like in the old days. > > For the link negotiation, the card is the right place because it's necessary > to get the link working before the OS can talk to the switch. > For VDP, that's different because the hypervisor needs to talk to the switch > before the guest can communicate, so there is no interdependency. > > More importantly, the card cannot possibly do the protocol by itself, > because the information that gets exchanged is specific to the hypervisor and > the guest, not to the hardware. What you have implemented is another protocol > between the hypervisor and the NIC that exchanges the exact same data that > then gets sent to the switch. We already need to have an implementation that > sends this data to the switch from user space for all cards that don't do > it in firmware, so doing an alternative path in the adapter really creates > more work for the users, and means that when we fix bugs or add features > to the common code, you don't get them ;-). But the point of iovnl was to provide a single mechanism for both types of adapters (w/ or w/o firmware assist) to exchange this data with the switch, therefore making the difference in the adapters transparent to the user. So I'm missing your point about more work for the users. -scott -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wednesday 21 April 2010, Chris Wright wrote: > * Arnd Bergmann (arnd@arndb.de) wrote: > > On Tuesday 20 April 2010, Scott Feldman wrote: > > I believe we meant different things here, because I misunderstood the > > intention of the code. My question was whether lldpad would send the > > netlink messages to iovnl, but from what you and Chris write, the > > real idea was that both lldpad and kernel/iovnl can receive the > > same messages, right? > > Correct. An example set of steps for initiating host to switch > negotiation and subsequently launching a VM would be (expect user below > to be a mgmt tool like libvirt): > > 1) user sends netlink message w/ relevant host interface and port profile id > 2) recipient picks this up (enic, lldpad, whatever) > 3) recipient does negotiation w/ adjacent switch > 4) user creates macvtap associated w/ relevant host interface > 5) user launches guest I'd move point 4 before 1, but otherwise it makes sense and it would still work either way. > > If the idea is use the same netlink protocol for both your internal > > representation and for the standard based protocol, I think we should > > make them compatible. > > Indeed, that's my expectation. > > [...] > > > Instead of a string identifying the port profile, this needs to pass > > a four byte field for a VSI type (3 bytes) and VSI manager ID (1 byte). > > I think we just need a u8 array, 4 bytes for VDP, some maxlen that is > at least as large as enic expects. > > > There is also a UUID in VDP, but it identifies the guest, not the host, > > so this is really confusing. > > Yes, I had same confusion. I expected guest, enic wants to send host as > well. So given all these differences, how compatible can we make them? With the current definition, most of fields are at least slightly different. The differences seem to stem mostly from the fact that Cisco switches use a nonstandard protocol, rather than the difference between the firmware and userland implementations of the protocol, and of course we shouldn't confuse the two. > > In order to make VEPA work, it's absolutely required to impose a hard limit > > on what MAC+VLAN IDs are visible to the VF, because the switch identifies > > the guest by those and forwards any frames to/from that address according > > to the VSI type. > > > > However, I feel that we should strictly separate the steps of configuring > > the adapter from talking to the switch. When we do the VDP association > > in user land, we still need to set up the VLAN and MAC configuration for > > the VF through a kernel interface. If we ignore the port profile stuff > > for a moment, your netlink interface looks like a good fit for that. > > > > Since it seems what you really want to do is to do the exchange with the > > switch from here, maybe the hardware configuration part should be moved > > the DCB interface? > > I suppose this would work (although it's a bit odd being out of scope > of DCB spec). It could be anywhere, it doesn't have to be the DCB interface, but could be anything ranging from ethtool to iplink I guess. And we should define it in a way that works for any SR-IOV card, whether it's using Cisco's protocol in firmware, 802.1Qbg VDP in firmware, lldpad to do VDP or none of the above and just provides an internal switch like all the existing NICs. > I don't expect mgmt app to care about the implementation > specifics of an adapter, so it will always send this and iovnl message > too. All as part of same setup. Why? I really see these things as separate. Obviously a management tool like libvirt would need to do both these things eventually, but each of them has multiple options that can be combined in various ways: 1. Setting up the slave device a) create an SR-IOV VF to assign to a guest b) create a macvtap device to pass to qemu or vhost c) attach a tap device to a bridge d) create a macvlan device and put it into a container e) create a virtual interface for a VMDq adapter 2) Registering the slave with the switch a) use Cisco protocol in enic firmware (see patch 2/2) b) use standard VDP in lldpad c) use reverse-engineered cisco protocol in some user tool for non-enic adapters. d) use standard VDP in firmware (hopefully this never happens) e) do nothing at all (as we do today) Some of the cases can be treated identically, e.g. 1d) and 1e), or 2a) and 2c), but in general the management app needs to have some idea of which combination it's going to set up. Arnd -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wednesday 21 April 2010, Scott Feldman wrote: > On 4/21/10 6:17 AM, "Arnd Bergmann" <arnd@arndb.de> wrote: > > More importantly, the card cannot possibly do the protocol by itself, > > because the information that gets exchanged is specific to the hypervisor and > > the guest, not to the hardware. What you have implemented is another protocol > > between the hypervisor and the NIC that exchanges the exact same data that > > then gets sent to the switch. We already need to have an implementation that > > sends this data to the switch from user space for all cards that don't do > > it in firmware, so doing an alternative path in the adapter really creates > > more work for the users, and means that when we fix bugs or add features > > to the common code, you don't get them ;-). > > But the point of iovnl was to provide a single mechanism for both types of > adapters (w/ or w/o firmware assist) to exchange this data with the switch, > therefore making the difference in the adapters transparent to the user. So > I'm missing your point about more work for the users. It creates an extra step: Normally we'd simply implement the network protocol in user space, e.g. in lldpad and have other code use the lldptool command line interface to start the negotiation. Now we have a user protocol based on netlink that is about as complex as the wire protocol itself, at least if you want to implement both the standard VDP and the Cisco variant, and do all the interesting parts like guest migration and synchronously waiting for the negotiation to complete. Arnd -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tuesday 20 April 2010, Arnd Bergmann wrote: > > + * @IOV_ATTR_IFNAME: interface name of master (PF) net device (NLA_NUL_STRING) > > + * @IOV_ATTR_VF_IFNAME: interface name of target VF device (NLA_NUL_STRING) > > As mentioned above, why not drop one of these, and just pass the VF's IFNAME? > Coming back to this point, I now think it would be ideal if we could actually leave out IOV_ATTR_VF_IFNAME and just pass the master IFNAME and the slave MAC address. Since we're not actually doing anything with the slave itself but really talking the switch, it should not be needed at all. That would solve all problems with the slave having moved to another namespace already, and make it totally clear that this is not about configuring the slave but about registering it. Scott, would that still work with your driver? Arnd -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Scott Feldman <scofeldm@cisco.com> Date: Mon, 19 Apr 2010 12:18:07 -0700 > +#define IOVNL_PROTO_VERSION 1 > + Please delete this in the final version, the macro isn't even used by the code. We don't do protocol versioning in netlink. Instead we get the base stuff solid from the beginning, and then if something needs fixing up we handle this using new attributes in a way which is both backward and forward compatible. Thanks. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Scott Feldman <scofeldm@cisco.com> Date: Mon, 19 Apr 2010 12:18:07 -0700 > + if (tb[IOV_ATTR_VF_IFNAME]) > + vf_dev = dev_get_by_name(&init_net, > + nla_data(tb[IOV_ATTR_VF_IFNAME])); It's probably best to check this for NULL and notify the user with an error in that case (don't forget to put 'dev' in that error path :-) As things stand it looks like if we can't find vf_dev, we'll just send NULL down to the vf_dev arg of the various operations and possibly silently succeed. That's not desirable, semantically. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thursday 22 April 2010, David Miller wrote: > From: Scott Feldman <scofeldm@cisco.com> > Date: Mon, 19 Apr 2010 12:18:07 -0700 > > > + if (tb[IOV_ATTR_VF_IFNAME]) > > + vf_dev = dev_get_by_name(&init_net, > > + nla_data(tb[IOV_ATTR_VF_IFNAME])); > > It's probably best to check this for NULL and notify > the user with an error in that case (don't forget to > put 'dev' in that error path :-) Since you brought up that hunk: shouldn't the namespace better be current->nsproxy->net_ns instead of init_ns? If the sender is confined in a separate network namespace, I would expect that it should be able to modify devices in its own namespace but none that are in the root namespace. Arnd -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Arnd Bergmann <arnd@arndb.de> Date: Thu, 22 Apr 2010 12:53:11 +0200 > On Thursday 22 April 2010, David Miller wrote: >> From: Scott Feldman <scofeldm@cisco.com> >> Date: Mon, 19 Apr 2010 12:18:07 -0700 >> >> > + if (tb[IOV_ATTR_VF_IFNAME]) >> > + vf_dev = dev_get_by_name(&init_net, >> > + nla_data(tb[IOV_ATTR_VF_IFNAME])); >> >> It's probably best to check this for NULL and notify >> the user with an error in that case (don't forget to >> put 'dev' in that error path :-) > > Since you brought up that hunk: shouldn't the namespace better > be current->nsproxy->net_ns instead of init_ns? If the sender > is confined in a separate network namespace, I would expect > that it should be able to modify devices in its own namespace > but none that are in the root namespace. Yes, the namespace needs to be handled better. But reading other parts of the discussion it seems that IOV_ATTR_VF_IFNAME and some other bits will likely be removed in the initial implementation of this stuff. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thursday 22 April 2010, David Miller wrote: > But reading other parts of the discussion it seems that > IOV_ATTR_VF_IFNAME and some other bits will likely be > removed in the initial implementation of this stuff. That's what I suggested, yes. However, I'm still waiting for a reply from Scott wether it's actually possibly to remove it based on the way that the enic firmware works. Arnd -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/include/linux/iovnl.h b/include/linux/iovnl.h new file mode 100644 index 0000000..ac5fcd3 --- /dev/null +++ b/include/linux/iovnl.h @@ -0,0 +1,124 @@ +/* + * Copyright 2010 Cisco Systems, Inc. All rights reserved. + * + * This program is free software; you may redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; version 2 of the License. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +#ifndef __LINUX_IOVNL_H__ +#define __LINUX_IOVNL_H__ + +#include <linux/types.h> + +#define IOVNL_PROTO_VERSION 1 + +/** + * IOV netlink (IOVNL) adds I/O Virtualization control support to a master + * device (MD) netdev interface. The MD (e.g. SR-IOV PF) will set/get + * control settings on behalf of a slave netdevice (e.g. SR-IOV VF). The + * design allows for the degenerative case where master and slave are the + * same netdev interface. + * + * One control setting example is MAC/VLAN settings for a VF. Another + * example control setting is a port-profile for a VF. A port-profile is an + * identifier that defines policy-based settings on the network port + * backing the VF. The network port settings examples are VLAN membership, + * QoS settings, and L2 security settings, typical of a data center network. + * + * This file defines an rtnetlink interface to allow setting of IOVNL + * on capable netdev devices. + */ + +struct iovnlmsg { + __u8 family; + __u8 cmd; + __u16 pad; +}; + +/** + * enum iovnl_cmds - supported IOV commands + * + * @IOV_CMD_UNDEFINED: unspecified command to catch errors + * @IOV_CMD_SET_PORT_PROFILE: set the port-profile on the device + * @IOV_CMD_UNSET_PORT_PROFILE: clear port-profile on the device + * @IOV_CMD_GET_PORT_PROFILE_STATUS: return status of last + * IOV_CMD_SET_PORT_PROFILE command + * @IOV_SET_MAC_VLAN: Set the MAC address and VLAN on the device + */ +enum iovnl_cmds { + IOV_CMD_UNDEFINED, + + IOV_CMD_SET_PORT_PROFILE, + IOV_CMD_UNSET_PORT_PROFILE, + IOV_CMD_GET_PORT_PROFILE_STATUS, + + IOV_CMD_SET_MAC_VLAN, + + __IOV_CMD_ENUM_MAX, + IOV_CMD_MAX = __IOV_CMD_ENUM_MAX - 1, +}; + +/** + * enum iovnl_attrs - IOV top-level netlink attributes + * + * @IOV_ATTR_UNDEFINED: unspecified attribute to catch errors + * @IOV_ATTR_IFNAME: interface name of master (PF) net device (NLA_NUL_STRING) + * @IOV_ATTR_VF_IFNAME: interface name of target VF device (NLA_NUL_STRING) + * @IOV_ATTR_PORT_PROFILE: port-profile name to assign to device + * (NLA_NUL_STRING) + * @IOV_ATTR_CLIENT_NAME: client name (NLA_NUL_STRING) + * @IOV_ATTR_HOST_UUID: host UUID (NLA_NUL_STRING) + * @IOV_ATTR_PORT_PROFILE_STATUS: status of last IOV_CMD_SET_PORT_PROFILE + * command (NLA_U8) + * @IOV_ATTR_MAC_ADDR: device station MAC address (NLA_U8[6]) + * @IOV_ATTR_VLAN: device 8021q VLAN ID (NLA_U16) + # @IOV_ATTR_STATUS: cmd return status code + */ +enum iovnl_attrs { + IOV_ATTR_UNDEFINED, + + IOV_ATTR_IFNAME, + IOV_ATTR_VF_IFNAME, + + IOV_ATTR_PORT_PROFILE, + IOV_ATTR_CLIENT_NAME, + IOV_ATTR_HOST_UUID, + IOV_ATTR_PORT_PROFILE_STATUS, + + IOV_ATTR_MAC_ADDR, + IOV_ATTR_VLAN, + + IOV_ATTR_STATUS, + + __IOV_ATTR_ENUM_MAX, + IOV_ATTR_MAX = __IOV_ATTR_ENUM_MAX - 1, +}; + +/** + * enum iovnl_port_profile_status - IOV_ATTR_PORT_PROFILE_STATUS status + * return codes + * + * @IOV_PORT_PROFILE_STATUS_UNKNOWN: unspecified to catch errors + * @IOV_PORT_PROFILE_STATUS_SUCCESS: port-profile aiovlied successfully + * @IOV_PORT_PROFILE_STATUS_ERROR: port-profile setting had error + * @IOV_PORT_PROFILE_STATUS_INPROGRESS: port-profile setting in-progress + */ +enum iovnl_port_profile_status { + IOV_PORT_PROFILE_STATUS_UNKNOWN, + IOV_PORT_PROFILE_STATUS_SUCCESS, + IOV_PORT_PROFILE_STATUS_ERROR, + IOV_PORT_PROFILE_STATUS_INPROGRESS, +}; + +#endif /* __LINUX_IOVNL_H__ */ diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 649a025..b531b0d 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -50,6 +50,7 @@ #ifdef CONFIG_DCB #include <net/dcbnl.h> #endif +#include <net/iovnl.h> struct vlan_group; struct netpoll_info; @@ -1048,6 +1049,9 @@ struct net_device { const struct dcbnl_rtnl_ops *dcbnl_ops; #endif + /* IOV netlink ops */ + const struct iovnl_ops *iovnl_ops; + #if defined(CONFIG_FCOE) || defined(CONFIG_FCOE_MODULE) /* max exchange id for FCoE LRO by ddp */ unsigned int fcoe_ddp_xid; diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h index d1c7c90..aafadf7 100644 --- a/include/linux/rtnetlink.h +++ b/include/linux/rtnetlink.h @@ -113,6 +113,11 @@ enum { RTM_SETDCB, #define RTM_SETDCB RTM_SETDCB + RTM_GETIOV = 82, +#define RTM_GETIOV RTM_GETIOV + RTM_SETIOV, +#define RTM_SETIOV RTM_SETIOV + __RTM_MAX, #define RTM_MAX (((__RTM_MAX + 3) & ~3) - 1) }; diff --git a/include/net/iovnl.h b/include/net/iovnl.h new file mode 100644 index 0000000..c353eee --- /dev/null +++ b/include/net/iovnl.h @@ -0,0 +1,36 @@ +/* + * Copyright 2010 Cisco Systems, Inc. All rights reserved. + * + * This program is free software; you may redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; version 2 of the License. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +#ifndef __NET_IOVNL_H__ +#define __NET_IOVNL_H__ + +/* + * Ops struct for the netlink callbacks. Used by IOVNL-enabled drivers through + * the netdevice struct. + */ +struct iovnl_ops { + int (*set_port_profile)(struct net_device *, struct net_device *, + char *, u8 *, char *, char *); + int (*unset_port_profile)(struct net_device *, struct net_device *); + int (*get_port_profile_status)(struct net_device *, + struct net_device *); + int (*set_mac_vlan)(struct net_device *, struct net_device *, + u8 *, u16); +}; + +#endif /* __NET_IOVNL_H__ */ diff --git a/net/Kconfig b/net/Kconfig index 0d68b40..aca5de0 100644 --- a/net/Kconfig +++ b/net/Kconfig @@ -203,6 +203,7 @@ source "net/phonet/Kconfig" source "net/ieee802154/Kconfig" source "net/sched/Kconfig" source "net/dcb/Kconfig" +source "net/iovnl/Kconfig" config RPS boolean diff --git a/net/Makefile b/net/Makefile index cb7bdc1..23589e9 100644 --- a/net/Makefile +++ b/net/Makefile @@ -61,6 +61,9 @@ obj-$(CONFIG_CAIF) += caif/ ifneq ($(CONFIG_DCB),) obj-y += dcb/ endif +ifneq ($(CONFIG_IOVNL),) +obj-y += iovnl/ +endif obj-y += ieee802154/ ifeq ($(CONFIG_NET),y) diff --git a/net/iovnl/Kconfig b/net/iovnl/Kconfig new file mode 100644 index 0000000..4548417 --- /dev/null +++ b/net/iovnl/Kconfig @@ -0,0 +1,10 @@ +config IOVNL + tristate "IOV rtnetlink support" + default n + ---help--- + This enables support for configuring IOV + on Ethernet adapters via rtnetlink. Say 'Y' + if you have a Ethernet adapter which supports network + configuration using IOV rtnetlinl. + + If unsure, say N. diff --git a/net/iovnl/Makefile b/net/iovnl/Makefile new file mode 100644 index 0000000..9256d01 --- /dev/null +++ b/net/iovnl/Makefile @@ -0,0 +1 @@ +obj-$(CONFIG_IOVNL) += iovnl.o diff --git a/net/iovnl/iovnl.c b/net/iovnl/iovnl.c new file mode 100644 index 0000000..ce9db50 --- /dev/null +++ b/net/iovnl/iovnl.c @@ -0,0 +1,260 @@ +/* + * Copyright 2010 Cisco Systems, Inc. All rights reserved. + * + * This program is free software; you may redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; version 2 of the License. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + * + */ + +#include <linux/netdevice.h> +#include <linux/netlink.h> +#include <linux/rtnetlink.h> +#include <linux/iovnl.h> +#include <net/netlink.h> +#include <net/rtnetlink.h> +#include <net/iovnl.h> +#include <net/sock.h> + +MODULE_AUTHOR("Roopa Prabhu <roprabhu@cisco.com, " + "Scott Feldman <scofeldm@cisco.com>"); +MODULE_DESCRIPTION("IOV netlink"); +MODULE_LICENSE("GPL"); + +/* IOVNL netlink attributes policy */ +static const struct nla_policy iovnl_rtnl_policy[IOV_ATTR_MAX + 1] = { + [IOV_ATTR_IFNAME] = { .type = NLA_NUL_STRING, .len = IFNAMSIZ - 1 }, + [IOV_ATTR_VF_IFNAME] = { .type = NLA_NUL_STRING, .len = IFNAMSIZ - 1 }, + [IOV_ATTR_PORT_PROFILE] = { .type = NLA_NUL_STRING, .len = 32 }, + [IOV_ATTR_CLIENT_NAME] = { .type = NLA_NUL_STRING, .len = 32 }, + [IOV_ATTR_HOST_UUID] = { .type = NLA_NUL_STRING, .len = 64 }, + [IOV_ATTR_PORT_PROFILE_STATUS] = { .type = NLA_U8 }, + [IOV_ATTR_MAC_ADDR] = { .len = 6 }, + [IOV_ATTR_VLAN] = { .type = NLA_U16 }, + [IOV_ATTR_STATUS] = { .type = NLA_U8 }, +}; + +/* standard netlink reply call */ +static int iovnl_reply(u8 value, u8 event, u8 cmd, u8 attr, u32 pid, + u32 seq, u16 flags) +{ + struct sk_buff *skb; + struct iovnlmsg *iov; + struct nlmsghdr *nlh; + int ret = -EINVAL; + + skb = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL); + if (!skb) + return ret; + + nlh = NLMSG_NEW(skb, pid, seq, event, sizeof(*iov), flags); + + iov = NLMSG_DATA(nlh); + iov->family = AF_UNSPEC; + iov->cmd = cmd; + iov->pad = 0; + + ret = nla_put_u8(skb, attr, value); + if (ret) + goto err; + + /* end the message, assign the nlmsg_len. */ + nlmsg_end(skb, nlh); + ret = rtnl_unicast(skb, &init_net, pid); + if (ret) + return -EINVAL; + + return 0; +nlmsg_failure: +err: + kfree_skb(skb); + return ret; +} + +static int iovnl_get_port_profile_status(struct net_device *dev, + struct net_device *vf_dev, u32 pid, u32 seq, u16 flags) +{ + int ret; + + if (!dev->iovnl_ops->get_port_profile_status) + return -EINVAL; + + ret = dev->iovnl_ops->get_port_profile_status(dev, vf_dev); + + return iovnl_reply(ret, RTM_GETIOV, + IOV_CMD_GET_PORT_PROFILE_STATUS, IOV_ATTR_PORT_PROFILE_STATUS, + pid, seq, flags); +} + + +static int iovnl_set_port_profile(struct net_device *dev, + struct net_device *vf_dev, struct nlattr **tb, + u32 pid, u32 seq, u16 flags) +{ + int i, ret; + char *port_profile = NULL; + u8 *mac_addr = NULL; + char *client_name = NULL; + char *host_uuid = NULL; + + if (!tb[IOV_ATTR_PORT_PROFILE] || !dev->iovnl_ops->set_port_profile) + return -EINVAL; + + for (i = 0; i <= IOV_ATTR_MAX; i++) { + if (!tb[i]) + continue; + switch (tb[i]->nla_type) { + case IOV_ATTR_PORT_PROFILE: + port_profile = nla_data(tb[i]); + break; + case IOV_ATTR_MAC_ADDR: + mac_addr = nla_data(tb[i]); + break; + case IOV_ATTR_CLIENT_NAME: + client_name = nla_data(tb[i]); + break; + case IOV_ATTR_HOST_UUID: + host_uuid = nla_data(tb[i]); + break; + } + } + + ret = dev->iovnl_ops->set_port_profile(dev, vf_dev, + port_profile, mac_addr, client_name, host_uuid); + + return iovnl_reply(ret, RTM_SETIOV, IOV_CMD_SET_PORT_PROFILE, + IOV_ATTR_STATUS, pid, seq, flags); +} + +static int iovnl_set_mac_vlan(struct net_device *dev, + struct net_device *vf_dev, struct nlattr **tb, + u32 pid, u32 seq, u16 flags) +{ + int i, ret; + u8 *mac_addr = NULL; + u16 vlan = 0; + + if (!dev->iovnl_ops->set_mac_vlan) + return -EINVAL; + + for (i = 0; i <= IOV_ATTR_MAX; i++) { + if (!tb[i]) + continue; + switch (tb[i]->nla_type) { + case IOV_ATTR_MAC_ADDR: + mac_addr = nla_data(tb[i]); + break; + case IOV_ATTR_VLAN: + vlan = nla_get_u16(tb[i]); + break; + } + } + + ret = dev->iovnl_ops->set_mac_vlan(dev, vf_dev, + mac_addr, vlan); + + return iovnl_reply(ret, RTM_SETIOV, IOV_CMD_SET_MAC_VLAN, + IOV_ATTR_STATUS, pid, seq, flags); +} + +static int iovnl_unset_port_profile(struct net_device *dev, + struct net_device *vf_dev, struct nlattr **tb, + u32 pid, u32 seq, u16 flags) +{ + int ret; + + if (!dev->iovnl_ops->unset_port_profile) + return -EINVAL; + + ret = dev->iovnl_ops->unset_port_profile(dev, vf_dev); + + return iovnl_reply(ret, RTM_SETIOV, IOV_CMD_UNSET_PORT_PROFILE, + IOV_ATTR_STATUS, pid, seq, flags); +} + +static int iovnl_doit(struct sk_buff *skb, struct nlmsghdr *nlh, void *arg) +{ + struct net *net = sock_net(skb->sk); + struct net_device *dev; + struct net_device *vf_dev = NULL; + struct iovnlmsg *iov = (struct iovnlmsg *)NLMSG_DATA(nlh); + struct nlattr *tb[IOV_ATTR_MAX + 1]; + u32 pid = skb ? NETLINK_CB(skb).pid : 0; + int ret; + + if (!net_eq(net, &init_net)) + return -EINVAL; + + ret = nlmsg_parse(nlh, sizeof(*iov), tb, IOV_ATTR_MAX, + iovnl_rtnl_policy); + if (ret < 0) + return ret; + + if (!tb[IOV_ATTR_IFNAME]) + return -EINVAL; + + dev = dev_get_by_name(&init_net, nla_data(tb[IOV_ATTR_IFNAME])); + if (!dev) + return -EINVAL; + + if (tb[IOV_ATTR_VF_IFNAME]) + vf_dev = dev_get_by_name(&init_net, + nla_data(tb[IOV_ATTR_VF_IFNAME])); + + if (!dev->iovnl_ops) + goto errout; + + switch (iov->cmd) { + case IOV_CMD_SET_PORT_PROFILE: + ret = iovnl_set_port_profile(dev, vf_dev, + tb, pid, nlh->nlmsg_seq, nlh->nlmsg_flags); + goto out; + case IOV_CMD_UNSET_PORT_PROFILE: + ret = iovnl_unset_port_profile(dev, vf_dev, + tb, pid, nlh->nlmsg_seq, nlh->nlmsg_flags); + goto out; + case IOV_CMD_GET_PORT_PROFILE_STATUS: + ret = iovnl_get_port_profile_status(dev, vf_dev, + pid, nlh->nlmsg_seq, nlh->nlmsg_flags); + goto out; + case IOV_CMD_SET_MAC_VLAN: + ret = iovnl_set_mac_vlan(dev, vf_dev, + tb, pid, nlh->nlmsg_seq, nlh->nlmsg_flags); + goto out; + default: + goto errout; + } +errout: + ret = -EINVAL; +out: + dev_put(dev); + if (vf_dev) + dev_put(vf_dev); + + return ret; +} + +static int __init iovnl_init(void) +{ + rtnl_register(PF_UNSPEC, RTM_GETIOV, iovnl_doit, NULL); + rtnl_register(PF_UNSPEC, RTM_SETIOV, iovnl_doit, NULL); + + return 0; +} +module_init(iovnl_init); + +static void __exit iovnl_exit(void) +{ + rtnl_unregister(PF_UNSPEC, RTM_GETIOV); + rtnl_unregister(PF_UNSPEC, RTM_SETIOV); +} +module_exit(iovnl_exit);