Message ID | 1236761624.2567.442.camel@ymzhang |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
"Zhang, Yanmin" <yanmin_zhang@linux.intel.com> writes: > I got some comments. Special thanks to Stephen Hemminger for teaching me on > what reorder is and some other comments. Also thank other guys who raised comments. > > v2 has some improvements. > 1) Add new sysfs interface /sys/class/net/ethXXX/rx_queueXXX/processing_cpu. Admin > could use it to configure the binding between RX and cpu number. So it's convenient > for drivers to use the new capability. Seems very inconvenient to have to configure this by hand. How about auto selecting one that shares the same LLC or somesuch? Passing data to anything with the same LLC should be cheap enough. BTW the standard idea to balance processing over multiple CPUs was to use MSI-X to multiple CPUs. and just use the hash function on the NIC. Have you considered this for forwarding too? The trick here would be to try to avoid reordering inside streams as far as possible, but since the NIC hash should work on flow basis that should be ok. -Andi
On Wed, 2009-03-11 at 12:13 +0100, Andi Kleen wrote: > "Zhang, Yanmin" <yanmin_zhang@linux.intel.com> writes: > > > I got some comments. Special thanks to Stephen Hemminger for teaching me on > > what reorder is and some other comments. Also thank other guys who raised comments. > > > > > > v2 has some improvements. > > 1) Add new sysfs interface /sys/class/net/ethXXX/rx_queueXXX/processing_cpu. Admin > > could use it to configure the binding between RX and cpu number. So it's convenient > > for drivers to use the new capability. > > Seems very inconvenient to have to configure this by hand. A little, but not too much, especially when we consider there is interrupt binding. > How about > auto selecting one that shares the same LLC or somesuch? There are 2 kinds of LLC sharing here. 1) RX/TX share the LLC; 2) All RX share the LLC of some cpus and TX share the LLC of other cpus. Item 1) is important, but sometimes item 2) is also important when the sending speed is very high and huge data is on flight which flushes cpu cache quickly. It's hard to distinguish the 2 different scenarioes automatically. > Passing > data to anything with the same LLC should be cheap enough. Yes, when the data isn't huge. My forwarding testing currently could reach at 270M bytes per second on Nehalem and I wish higher if I could get the latest NICs. > BTW the standard idea to balance processing over multiple CPUs was to > use MSI-X to multiple CPUs. Yes. My method still depends on MSI-X and multi-queue. One difference is I just need less than CPU_NUM interrupt numbers as there are only some cpus working on packet receiving. > and just use the hash function on the > NIC. Sorry. I can't understand what the hash function of NIC is. Perhaps NIC hardware has something like hash function to decide the RX queue number based on SRC/DST? > Have you considered this for forwarding too? Yes. originally, I plan to add a tx_num under the same sysfs directory, so admin could define that all packets received from a RX queue should be sent out from a specific TX queue. So struct sk_buff->queue_mapping would be a union of 2 sub-members, rx_num and tx_num. But sk_buff->queue_mapping is just a u16 which is a small type. We might use the most-significant bit of sk_buff->queue_mapping as a flag as rx_num and tx_num wouldn't exist at the same time. > The trick here would > be to try to avoid reordering inside streams as far as possible, It's not to solve reorder issue. The start point is 10G NIC is very fast. We need some cpu work on packet receiving dedicately. If they work on other things, NIC might drop packets quickly. The sysfs interface is just to facilitate NIC drivers. If there is no the sysfs interface, driver developers need implement it with parameters which are painful. > but > since the NIC hash should work on flow basis that should be ok. Yes, hardware is good at preventing reorder. My method doesn't change the order in software layer. Thanks Andi. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2009-03-12 at 16:16 +0800, Zhang, Yanmin wrote: > On Wed, 2009-03-11 at 12:13 +0100, Andi Kleen wrote: [...] > > and just use the hash function on the > > NIC. > Sorry. I can't understand what the hash function of NIC is. Perhaps NIC hardware has something > like hash function to decide the RX queue number based on SRC/DST? Yes, that's exactly what they do. This feature is sometimes called Receive-Side Scaling (RSS) which is Microsoft's name for it. Microsoft requires Windows drivers performing RSS to provide the hash value to the networking stack, so Linux drivers for the same hardware should be able to do so too. > > Have you considered this for forwarding too? > Yes. originally, I plan to add a tx_num under the same sysfs directory, so admin could > define that all packets received from a RX queue should be sent out from a specific TX queue. The choice of TX queue can be based on the RX hash so that configuration is usually unnecessary. > So struct sk_buff->queue_mapping would be a union of 2 sub-members, rx_num and tx_num. But > sk_buff->queue_mapping is just a u16 which is a small type. We might use the most-significant > bit of sk_buff->queue_mapping as a flag as rx_num and tx_num wouldn't exist at the > same time. > > > The trick here would > > be to try to avoid reordering inside streams as far as possible, > It's not to solve reorder issue. The start point is 10G NIC is very fast. We need some cpu > work on packet receiving dedicately. If they work on other things, NIC might drop packets > quickly. Aggressive power-saving causes far greater latency than context- switching under Linux. I believe most 10G NICs have large RX FIFOs to mitigate against this. Ethernet flow control also helps to prevent packet loss. > The sysfs interface is just to facilitate NIC drivers. If there is no the sysfs interface, > driver developers need implement it with parameters which are painful. [...] Or through the ethtool API, which already has some multiqueue control operations. Ben.
On Thu, Mar 12, 2009 at 04:16:32PM +0800, Zhang, Yanmin wrote: > > > Seems very inconvenient to have to configure this by hand. > A little, but not too much, especially when we consider there is interrupt binding. Interrupt binding is something popular for benchmarks, but most users don't (and shouldn't need to) care. Having it work well out of the box without special configuration is very important. > > > How about > > auto selecting one that shares the same LLC or somesuch? > There are 2 kinds of LLC sharing here. > 1) RX/TX share the LLC; > 2) All RX share the LLC of some cpus and TX share the LLC of other cpus. > > Item 1) is important, but sometimes item 2) is also important when the sending speed is > very high and huge data is on flight which flushes cpu cache quickly. > It's hard to distinguish the 2 different scenarioes automatically. Why is it hard if you know the CPUs? > > and just use the hash function on the > > NIC. > Sorry. I can't understand what the hash function of NIC is. Perhaps NIC hardware has something > like hash function to decide the RX queue number based on SRC/DST? There's a Microsoft spec for a standard hash function that does this on NICs and all the serious ones support it these days. The hash is normally used to select a MSI-X target based on the input header. I think if that works your manual target shouldn't be necessary. > > The trick here would > > be to try to avoid reordering inside streams as far as possible, > It's not to solve reorder issue. The start point is 10G NIC is very fast. We need some cpu Point was that any solution shouldn't add more reordering. But when a RSS hash is used there is no reordering on stream basis. -Andi
On Thu, 2009-03-12 at 14:08 +0000, Ben Hutchings wrote: > On Thu, 2009-03-12 at 16:16 +0800, Zhang, Yanmin wrote: > > On Wed, 2009-03-11 at 12:13 +0100, Andi Kleen wrote: > [...] > > > and just use the hash function on the > > > NIC. > > Sorry. I can't understand what the hash function of NIC is. Perhaps NIC hardware has something > > like hash function to decide the RX queue number based on SRC/DST? > > Yes, that's exactly what they do. This feature is sometimes called > Receive-Side Scaling (RSS) which is Microsoft's name for it. Microsoft > requires Windows drivers performing RSS to provide the hash value to the > networking stack, so Linux drivers for the same hardware should be able > to do so too. Oh, I didn't know the background. I need study more about network. Thanks for explain it. > > > > Have you considered this for forwarding too? > > Yes. originally, I plan to add a tx_num under the same sysfs directory, so admin could > > define that all packets received from a RX queue should be sent out from a specific TX queue. > > The choice of TX queue can be based on the RX hash so that configuration > is usually unnecessary. I agree. I double checked the latest codes of tree net-next-2.6 and function skb_tx_hash is enough. > > > So struct sk_buff->queue_mapping would be a union of 2 sub-members, rx_num and tx_num. But > > sk_buff->queue_mapping is just a u16 which is a small type. We might use the most-significant > > bit of sk_buff->queue_mapping as a flag as rx_num and tx_num wouldn't exist at the > > same time. > > > > > The trick here would > > > be to try to avoid reordering inside streams as far as possible, > > It's not to solve reorder issue. The start point is 10G NIC is very fast. We need some cpu > > work on packet receiving dedicately. If they work on other things, NIC might drop packets > > quickly. > > Aggressive power-saving causes far greater latency than context- > switching under Linux. Yes when NIC is free mostly. When NIC is busy, it wouldn't enter power-saving mode. Performance testing is used to turn off all power-saving modes. :) > I believe most 10G NICs have large RX FIFOs to > mitigate against this. Ethernet flow control also helps to prevent > packet loss. I guess NIC might allocate resources evenly for all queues, at least by default. If considering packet sending burst with the same SRC/DST, a specific queue might be full quickly. I instrumented driver and kernel to print out packet receiving and forwarding. As The latest IXGBE driver gets a packet and forwards it immediately, I think most packets are dropped by hardware because cpu doesn't collects packets quickly when the specific receiving queue is full. By comparing the sending speed and forwarding speed, we could get the dropping rate easily. My experiment shows receving cpu idle is more than 50% and cpu does often collect all packets till the specific queue is empty. I think that's because pktgen switches to a new SRC/DST to produce another burst to fill other queues quickly. It's hard to say cpu is slower than NIC because they work on different parts of the full receiving/processing procedures. But we need cpu collect packets ASAP. > > The sysfs interface is just to facilitate NIC drivers. If there is no the sysfs interface, > > driver developers need implement it with parameters which are painful. > [...] > > Or through the ethtool API, which already has some multiqueue control > operations. That's an alternative approach to configure it. If checking the sample patch on driver, we can find the change is very small. Thanks for your kind comments. Yanmin -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2009-03-12 at 15:34 +0100, Andi Kleen wrote: > On Thu, Mar 12, 2009 at 04:16:32PM +0800, Zhang, Yanmin wrote: > > > > > Seems very inconvenient to have to configure this by hand. > > A little, but not too much, especially when we consider there is interrupt binding. > > Interrupt binding is something popular for benchmarks, but most users > don't (and shouldn't need to) care. Having it work well out of the box > without special configuration is very important. Thanks Andi. You tell the truth. Now I understand why David Miller is working on auto TX selection. One thing I want to clarify is, with the default configuration, the processing path still goes to current automation selection. That means my method has little impact on current automation selection with default configuration, except a small cache miss. Another exception is IXGBE prefers to getting one packet and sending one packet immediately instead of backlog. Even when turning on the new capability to separate packet receiving and packet processing, TX selection is still following current automatic selection. The difference is we use different cpu. Driver still could record RX number into skb which is used when sending out. > > > > > > How about > > > auto selecting one that shares the same LLC or somesuch? > > There are 2 kinds of LLC sharing here. > > 1) RX/TX share the LLC; > > 2) All RX share the LLC of some cpus and TX share the LLC of other cpus. > > > > Item 1) is important, but sometimes item 2) is also important when the sending speed is > > very high and huge data is on flight which flushes cpu cache quickly. > > It's hard to distinguish the 2 different scenarioes automatically. > > Why is it hard if you know the CPUs? RX binding depends on interrupt binding totally. If the MSI-X interrupt is sent to cpu A, cpu A will collect the packets on the RX queue. By default, interrupt isn't bound. Software knows the LLC sharing of cpu A. If cpu A receives the interrupt, it couldn't just throw packets to other cpus which share its LLC, because it doesn't know whether other cpus are collecting packets from other RX queues now. > > > > and just use the hash function on the > > > NIC. > > Sorry. I can't understand what the hash function of NIC is. Perhaps NIC hardware has something > > like hash function to decide the RX queue number based on SRC/DST? > > There's a Microsoft spec for a standard hash function that does this > on NICs and all the serious ones support it these days. The hash > is normally used to select a MSI-X target based on the input header. Thanks for the explanation. The capability defined by the spec is to choose a MSI-X number and provides a hint when sending a cloned packet out. Does the NIC know how cpu is busy? I assume not. So the hash is trying to distribute packets into RX queues evenly while also avoiding reorder. We might say irqbalance could balance workload so we expect cpu workload is even. My testing shows such evenly distribution of packets on all cpu isn't good at performance. > > I think if that works your manual target shouldn't be necessary. Here are 2 targets with my method. The one is packet collecting cpu and the other is packet processing cpu. As NIC doesn't know how busy cpu is, why can't we separate the processing? > > > > The trick here would > > > be to try to avoid reordering inside streams as far as possible, > > It's not to solve reorder issue. The start point is 10G NIC is very fast. We need some cpu > > Point was that any solution shouldn't add more reordering. But when a RSS > hash is used there is no reordering on stream basis. Yes. Thanks again. Yanmin -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Mar 12, 2009 at 11:43 PM, Zhang, Yanmin <yanmin_zhang@linux.intel.com> wrote: > > On Thu, 2009-03-12 at 14:08 +0000, Ben Hutchings wrote: > > On Thu, 2009-03-12 at 16:16 +0800, Zhang, Yanmin wrote: > > > On Wed, 2009-03-11 at 12:13 +0100, Andi Kleen wrote: > > [...] > > > > and just use the hash function on the > > > > NIC. > > > Sorry. I can't understand what the hash function of NIC is. Perhaps NIC hardware has something > > > like hash function to decide the RX queue number based on SRC/DST? > > > > Yes, that's exactly what they do. This feature is sometimes called > > Receive-Side Scaling (RSS) which is Microsoft's name for it. Microsoft > > requires Windows drivers performing RSS to provide the hash value to the > > networking stack, so Linux drivers for the same hardware should be able > > to do so too. > Oh, I didn't know the background. I need study more about network. > Thanks for explain it. > You'll definitely want to look at the hardware provided hash. We've been using a 10G NIC which provides a Toeplitz hash (the one defined by Microsoft) and a software RSS-like capability to move packets from an interrupting CPU to another for processing. The hash could be used to index to a set of CPUs, but we also use the hash as a connection identifier to key into a lookup table to steer packets to the CPU where the application is running based on the running CPU of the last recvmsg. Using the device provided hash in this manner is a HUGE win, as opposed to taking cache misses to get 4-tuple from packet itself to compute a hash. I posted some patches a while back on our work if you're interested. We also using multiple RX queues of the 10G device in concert with pretty good results. We have noticed that the interrupt overheads substantially mitigate the benefits. In fact, I would say the software packet steering has provided the greater benefit (and it's very useful on our many 1G NICS that don't have multiq!). Tom -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Tom Herbert <therbert@google.com> Date: Fri, 13 Mar 2009 10:06:56 -0700 > You'll definitely want to look at the hardware provided hash. We've > been using a 10G NIC which provides a Toeplitz hash (the one defined > by Microsoft) and a software RSS-like capability to move packets from > an interrupting CPU to another for processing. The hash could be used > to index to a set of CPUs, but we also use the hash as a connection > identifier to key into a lookup table to steer packets to the CPU > where the application is running based on the running CPU of the last > recvmsg. Using the device provided hash in this manner is a HUGE win, > as opposed to taking cache misses to get 4-tuple from packet itself to > compute a hash. I posted some patches a while back on our work if > you're interested. I never understood this. If you don't let the APIC move the interrupt around, the individual MSI-X interrupts will steer packets to individual specific CPUS and as a result the scheduler will migrate tasks over to those cpus since the wakeup events keep occuring there. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Mar 13, 2009 at 11:51 AM, David Miller <davem@davemloft.net> wrote: > > From: Tom Herbert <therbert@google.com> > Date: Fri, 13 Mar 2009 10:06:56 -0700 > > > You'll definitely want to look at the hardware provided hash. We've > > been using a 10G NIC which provides a Toeplitz hash (the one defined > > by Microsoft) and a software RSS-like capability to move packets from > > an interrupting CPU to another for processing. The hash could be used > > to index to a set of CPUs, but we also use the hash as a connection > > identifier to key into a lookup table to steer packets to the CPU > > where the application is running based on the running CPU of the last > > recvmsg. Using the device provided hash in this manner is a HUGE win, > > as opposed to taking cache misses to get 4-tuple from packet itself to > > compute a hash. I posted some patches a while back on our work if > > you're interested. > > I never understood this. > > If you don't let the APIC move the interrupt around, the individual > MSI-X interrupts will steer packets to individual specific CPUS and as > a result the scheduler will migrate tasks over to those cpus since the > wakeup events keep occuring there. We are trying to follow the decisions scheduler as opposed to leading it. This works on very loaded systems, with applications binding to cpusets, with threads that are receiving on multiple sockets. I suppose it might be compelling if a NIC could steer packets per flow, instead of by a hash... -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Tom Herbert <therbert@google.com> Date: Fri, 13 Mar 2009 13:58:53 -0700 > We are trying to follow the decisions scheduler as opposed to > leading it. This works on very loaded systems, with applications > binding to cpusets, with threads that are receiving on multiple > sockets. I suppose it might be compelling if a NIC could steer > packets per flow, instead of by a hash... If the hash is good is will distribute the load properly. If the NIC is sophisticated enough (Sun's Neptune chipset is) you can even group interrupt distribution by traffic type and even bind specific ports to interrupt groups. I really detest all of these software hacks that add overhead to solve problems the hardware can solve for us. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> > If the hash is good is will distribute the load properly. > > If the NIC is sophisticated enough (Sun's Neptune chipset is) > you can even group interrupt distribution by traffic type > and even bind specific ports to interrupt groups. > > I really detest all of these software hacks that add overhead > to solve problems the hardware can solve for us. > I appreciate this philosophy, but unfortunately I don't have the luxury of working with a NIC that solves these problems. The reality may be that we're trying to squeeze performance out of crappy hardware to scale on multi-core. Left alone we couldn't get the stack to scale, but with these "destable hacks" we've gotten 3X or so improvement in packets per second across both our dumb 1G and 10G NICs. These gains have translated into tangible application performance gains, so we'll probably continue to have interest in this area of development at least for the foreseeable future. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2009-03-13 at 14:01 -0700, Tom Herbert wrote: > On Fri, Mar 13, 2009 at 11:51 AM, David Miller <davem@davemloft.net> wrote: > > > > From: Tom Herbert <therbert@google.com> > > Date: Fri, 13 Mar 2009 10:06:56 -0700 > > > > > You'll definitely want to look at the hardware provided hash. We've > > > been using a 10G NIC which provides a Toeplitz hash (the one defined > > > by Microsoft) and a software RSS-like capability to move packets from > > > an interrupting CPU to another for processing. The hash could be used > > > to index to a set of CPUs, but we also use the hash as a connection > > > identifier to key into a lookup table to steer packets to the CPU > > > where the application is running based on the running CPU of the last > > > recvmsg. Using the device provided hash in this manner is a HUGE win, > > > as opposed to taking cache misses to get 4-tuple from packet itself to > > > compute a hash. I posted some patches a while back on our work if > > > you're interested. > > > > I never understood this. > > > > If you don't let the APIC move the interrupt around, the individual > > MSI-X interrupts will steer packets to individual specific CPUS and as > > a result the scheduler will migrate tasks over to those cpus since the > > wakeup events keep occuring there. > > We are trying to follow the decisions scheduler as opposed to leading > it. This works on very loaded systems, with applications binding to > cpusets, with threads that are receiving on multiple sockets. I > suppose it might be compelling if a NIC could steer packets per flow, > instead of by a hash... Depending on the NIC, RX queue selection may be done using a large number of bits of the hash value and an indirection table or by matching against specific values in the headers. The SFC4000 supports both of these, though limited to TCP/IPv4 and UDP/IPv4. I think Neptune may be more flexible. Of course, both indirection table entries and filter table entries will be limited resources in any NIC, so allocating these wholly automatically is an interesting challenge. Ben.
On Fri, 13 Mar 2009 22:10:59 +0000 Ben Hutchings <bhutchings@solarflare.com> wrote: > On Fri, 2009-03-13 at 14:01 -0700, Tom Herbert wrote: > > On Fri, Mar 13, 2009 at 11:51 AM, David Miller <davem@davemloft.net> wrote: > > > > > > From: Tom Herbert <therbert@google.com> > > > Date: Fri, 13 Mar 2009 10:06:56 -0700 > > > > > > > You'll definitely want to look at the hardware provided hash. We've > > > > been using a 10G NIC which provides a Toeplitz hash (the one defined > > > > by Microsoft) and a software RSS-like capability to move packets from > > > > an interrupting CPU to another for processing. The hash could be used > > > > to index to a set of CPUs, but we also use the hash as a connection > > > > identifier to key into a lookup table to steer packets to the CPU > > > > where the application is running based on the running CPU of the last > > > > recvmsg. Using the device provided hash in this manner is a HUGE win, > > > > as opposed to taking cache misses to get 4-tuple from packet itself to > > > > compute a hash. I posted some patches a while back on our work if > > > > you're interested. > > > > > > I never understood this. > > > > > > If you don't let the APIC move the interrupt around, the individual > > > MSI-X interrupts will steer packets to individual specific CPUS and as > > > a result the scheduler will migrate tasks over to those cpus since the > > > wakeup events keep occuring there. > > > > We are trying to follow the decisions scheduler as opposed to leading > > it. This works on very loaded systems, with applications binding to > > cpusets, with threads that are receiving on multiple sockets. I > > suppose it might be compelling if a NIC could steer packets per flow, > > instead of by a hash... > > Depending on the NIC, RX queue selection may be done using a large > number of bits of the hash value and an indirection table or by matching > against specific values in the headers. The SFC4000 supports both of > these, though limited to TCP/IPv4 and UDP/IPv4. I think Neptune may be > more flexible. Of course, both indirection table entries and filter > table entries will be limited resources in any NIC, so allocating these > wholly automatically is an interesting challenge. > > Ben. > The problem is that without hardware support, handing off the packet may take more effort than processing it. Especially when cache line has to bounce to other CPU and trying to keep up with DoS attacks. It all depends how much processing is required, and the architecture of the system. The tradeoff would change over time based on processing speed and optimizing the receive/firewall code. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Tom Herbert <therbert@google.com> Date: Fri, 13 Mar 2009 14:59:55 -0700 > I appreciate this philosophy, but unfortunately I don't have the > luxury of working with a NIC that solves these problems. The reality > may be that we're trying to squeeze performance out of crappy hardware > to scale on multi-core. Left alone we couldn't get the stack to > scale, but with these "destable hacks" we've gotten 3X or so ^^^^^^^^ Spelling. > improvement in packets per second across both our dumb 1G and 10G > NICs Do these NICs at least support multiqueue? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Mar 13, 2009 at 03:19:13PM -0700, David Miller wrote: > > > improvement in packets per second across both our dumb 1G and 10G > > NICs > > Do these NICs at least support multiqueue? I don't think they do. See the lsat paragraph in Tom's first email. I think we all agree that hacks such as these are onlhy useful for NICs that either don't support mq or if the number of rx queues is too small. The question is how much do we love these NICs :) Cheers,
>> I appreciate this philosophy, but unfortunately I don't have the >> luxury of working with a NIC that solves these problems. The reality >> may be that we're trying to squeeze performance out of crappy hardware >> to scale on multi-core. Left alone we couldn't get the stack to >> scale, but with these "destable hacks" we've gotten 3X or so > ^^^^^^^^ > > Spelling. > >> improvement in packets per second across both our dumb 1G and 10G >> NICs > > Do these NICs at least support multiqueue? > Yes, we are using a 10G NIC that supports multi-queue. The number of RX queues supported is half the number of cores on our platform, so that is going to limit the parallelism. With multi-queue turned on we do see about 4X improvement in pps over just using a single queue; this is about the same improvement we see using a single queue with our software steering techniques (this particular device provides the Toeplitz hash). Enabling HW multi-queue has somewhat higher CPU utilization though, the extra device interrupt load is not coming for free. We actually use the HW multi-queue in conjunction with our software steering to get maximum pps (about 20% more). -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> We are trying to follow the decisions scheduler as opposed to leading it. > This works on very loaded systems, with applications binding to cpusets, One possible solution would be then to just not bind to cpusets and give the scheduler the freedom it needs instead? -Andi
> Yes, we are using a 10G NIC that supports multi-queue. The number of > RX queues supported is half the number of cores on our platform, so > that is going to limit the parallelism. With multi-queue turned on we The standard wisdom is that you don't necessarily need to transmit to each core, but rather to each shared mid or least level cache. Once the data is cache hot (or cache near) distributing it further in software is comparable cheap. So this means you don't necessarily need as many queues as cores, but more as many as big caches. -Andi
From: Tom Herbert <therbert@google.com> Date: Fri, 13 Mar 2009 17:24:10 -0700 > Enabling HW multi-queue has somewhat higher CPU > utilization though, the extra device interrupt load is not coming for > free. We actually use the HW multi-queue in conjunction with our > software steering to get maximum pps (about 20% more). This is a non-intuitive observation. Using HW multiqueue should be cheaper than doing it in software, right? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Mar 13, 2009 at 07:19:51PM -0700, David Miller wrote: > From: Tom Herbert <therbert@google.com> > Date: Fri, 13 Mar 2009 17:24:10 -0700 > > > Enabling HW multi-queue has somewhat higher CPU > > utilization though, the extra device interrupt load is not coming for > > free. We actually use the HW multi-queue in conjunction with our > > software steering to get maximum pps (about 20% more). > > This is a non-intuitive observation. Using HW multiqueue should be > cheaper than doing it in software, right? Shared caches can play games with the numbers, we need to look at this a bit more. Cheers,
On Fri, Mar 13, 2009 at 7:19 PM, David Miller <davem@davemloft.net> wrote: > From: Tom Herbert <therbert@google.com> > Date: Fri, 13 Mar 2009 17:24:10 -0700 > >> Enabling HW multi-queue has somewhat higher CPU >> utilization though, the extra device interrupt load is not coming for >> free. We actually use the HW multi-queue in conjunction with our >> software steering to get maximum pps (about 20% more). > > This is a non-intuitive observation. Using HW multiqueue should be > cheaper than doing it in software, right? > I suppose it may be counter-intuitive, but I am not making a general claim. I would only suggest that these software hacks could be a very good approximation or substitute for hardware functionality. This is a generic way to get more performance out of deficient or lower end NICs. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Tom Herbert <therbert@google.com> Date: Sat, 14 Mar 2009 11:15:21 -0700 > I suppose it may be counter-intuitive, but I am not making a general > claim. I would only suggest that these software hacks could be a very > good approximation or substitute for hardware functionality. This is > a generic way to get more performance out of deficient or lower end > NICs. They certainly could. Why don't you post the current version of your patches so we have something concrete to discuss? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2009-03-13 at 10:06 -0700, Tom Herbert wrote: > On Thu, Mar 12, 2009 at 11:43 PM, Zhang, Yanmin > <yanmin_zhang@linux.intel.com> wrote: > > > > On Thu, 2009-03-12 at 14:08 +0000, Ben Hutchings wrote: > > > On Thu, 2009-03-12 at 16:16 +0800, Zhang, Yanmin wrote: > > > > On Wed, 2009-03-11 at 12:13 +0100, Andi Kleen wrote: > > > Yes, that's exactly what they do. This feature is sometimes called > > > Receive-Side Scaling (RSS) which is Microsoft's name for it. Microsoft > > > requires Windows drivers performing RSS to provide the hash value to the > > > networking stack, so Linux drivers for the same hardware should be able > > > to do so too. > > Oh, I didn't know the background. I need study more about network. > > Thanks for explain it. > > > > You'll definitely want to look at the hardware provided hash. We've > been using a 10G NIC which provides a Toeplitz hash (the one defined > by Microsoft) and a software RSS-like capability to move packets from > an interrupting CPU to another for processing. The hash could be used > to index to a set of CPUs, but we also use the hash as a connection > identifier to key into a lookup table to steer packets to the CPU > where the application is running based on the running CPU of the last > recvmsg. Your scenario is different from mine. My case is ip_forward which happens in kernel and there is no application participating in the forwarding. I might test the application communication on 10G NIC with my method later. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Mar 14, 2009 at 11:45 AM, David Miller <davem@davemloft.net> wrote: > > From: Tom Herbert <therbert@google.com> > Date: Sat, 14 Mar 2009 11:15:21 -0700 > > > I suppose it may be counter-intuitive, but I am not making a general > > claim. I would only suggest that these software hacks could be a very > > good approximation or substitute for hardware functionality. This is > > a generic way to get more performance out of deficient or lower end > > NICs. > > They certainly could. Why don't you post the current version > of your patches so we have something concrete to discuss? I'll do that. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
--- linux-2.6.29-rc7/include/linux/netdevice.h 2009-03-09 15:20:49.000000000 +0800 +++ linux-2.6.29-rc7_backlog/include/linux/netdevice.h 2009-03-11 10:17:08.000000000 +0800 @@ -1119,6 +1119,9 @@ static inline int unregister_gifconf(uns /* * Incoming packets are placed on per-cpu queues so that * no locking is needed. + * To speed up fast network, sometimes place incoming packets + * to other cpu queues. Use input_pkt_alien_queue.lock to + * protect input_pkt_alien_queue. */ struct softnet_data { @@ -1127,6 +1130,7 @@ struct softnet_data struct list_head poll_list; struct sk_buff *completion_queue; + struct sk_buff_head input_pkt_alien_queue; struct napi_struct backlog; }; @@ -1368,6 +1372,8 @@ extern void dev_kfree_skb_irq(struct sk_ extern void dev_kfree_skb_any(struct sk_buff *skb); #define HAVE_NETIF_RX 1 +extern int raise_netif_irq(int cpu, + struct sk_buff_head *skb_queue); extern int netif_rx(struct sk_buff *skb); extern int netif_rx_ni(struct sk_buff *skb); #define HAVE_NETIF_RECEIVE_SKB 1 --- linux-2.6.29-rc7/net/core/dev.c 2009-03-09 15:20:50.000000000 +0800 +++ linux-2.6.29-rc7_backlog/net/core/dev.c 2009-03-11 10:27:57.000000000 +0800 @@ -1997,6 +1997,114 @@ int netif_rx_ni(struct sk_buff *skb) EXPORT_SYMBOL(netif_rx_ni); +static void net_drop_skb(struct sk_buff_head *skb_queue) +{ + struct sk_buff *skb = __skb_dequeue(skb_queue); + + while (skb) { + __get_cpu_var(netdev_rx_stat).dropped++; + kfree_skb(skb); + skb = __skb_dequeue(skb_queue); + } +} + +static int net_backlog_local_merge(struct sk_buff_head *skb_queue) +{ + struct softnet_data *queue; + unsigned long flags; + + queue = &__get_cpu_var(softnet_data); + if (queue->input_pkt_queue.qlen + skb_queue->qlen <= + netdev_max_backlog) { + + local_irq_save(flags); + if (!queue->input_pkt_queue.qlen) + napi_schedule(&queue->backlog); + skb_queue_splice_tail_init(skb_queue, &queue->input_pkt_queue); + local_irq_restore(flags); + + return 0; + } else { + net_drop_skb(skb_queue); + return 1; + } +} + +static void net_napi_backlog(void *data) +{ + struct softnet_data *queue = &__get_cpu_var(softnet_data); + + napi_schedule(&queue->backlog); + kfree(data); +} + +static int net_backlog_notify_cpu(int cpu) +{ + struct call_single_data *data; + + data = kmalloc(sizeof(struct call_single_data), GFP_ATOMIC); + if (!data) + return -1; + + data->func = net_napi_backlog; + data->info = data; + data->flags = 0; + __smp_call_function_single(cpu, data); + + return 0; +} + +int raise_netif_irq(int cpu, struct sk_buff_head *skb_queue) +{ + unsigned long flags; + struct softnet_data *queue; + int retval, need_notify=0; + + if (!skb_queue || skb_queue_empty(skb_queue)) + return 0; + + /* + * If cpu is offline, we queue skb back to + * the queue on current cpu. + */ + if ((unsigned)cpu >= nr_cpu_ids || + !cpu_online(cpu) || + cpu == smp_processor_id()) { + net_backlog_local_merge(skb_queue); + return 0; + } + + queue = &per_cpu(softnet_data, cpu); + if (queue->input_pkt_alien_queue.qlen > netdev_max_backlog) + goto failed1; + + spin_lock_irqsave(&queue->input_pkt_alien_queue.lock, flags); + if (skb_queue_empty(&queue->input_pkt_alien_queue)) + need_notify = 1; + skb_queue_splice_tail_init(skb_queue, + &queue->input_pkt_alien_queue); + spin_unlock_irqrestore(&queue->input_pkt_alien_queue.lock, + flags); + + if (need_notify) { + retval = net_backlog_notify_cpu(cpu); + if (unlikely(retval)) + goto failed2; + } + + return 0; + +failed2: + spin_lock_irqsave(&queue->input_pkt_alien_queue.lock, flags); + skb_queue_splice_tail_init(&queue->input_pkt_alien_queue, skb_queue); + spin_unlock_irqrestore(&queue->input_pkt_alien_queue.lock, + flags); +failed1: + net_drop_skb(skb_queue); + + return 1; +} + static void net_tx_action(struct softirq_action *h) { struct softnet_data *sd = &__get_cpu_var(softnet_data); @@ -2336,6 +2444,13 @@ static void flush_backlog(void *arg) struct net_device *dev = arg; struct softnet_data *queue = &__get_cpu_var(softnet_data); struct sk_buff *skb, *tmp; + unsigned long flags; + + spin_lock_irqsave(&queue->input_pkt_alien_queue.lock, flags); + skb_queue_splice_tail_init( + &queue->input_pkt_alien_queue, + &queue->input_pkt_queue ); + spin_unlock_irqrestore(&queue->input_pkt_alien_queue.lock, flags); skb_queue_walk_safe(&queue->input_pkt_queue, skb, tmp) if (skb->dev == dev) { @@ -2594,9 +2709,19 @@ static int process_backlog(struct napi_s local_irq_disable(); skb = __skb_dequeue(&queue->input_pkt_queue); if (!skb) { - __napi_complete(napi); - local_irq_enable(); - break; + if (!skb_queue_empty(&queue->input_pkt_alien_queue)) { + spin_lock(&queue->input_pkt_alien_queue.lock); + skb_queue_splice_tail_init( + &queue->input_pkt_alien_queue, + &queue->input_pkt_queue ); + spin_unlock(&queue->input_pkt_alien_queue.lock); + + skb = __skb_dequeue(&queue->input_pkt_queue); + } else { + __napi_complete(napi); + local_irq_enable(); + break; + } } local_irq_enable(); @@ -4985,6 +5110,11 @@ static int dev_cpu_callback(struct notif local_irq_enable(); /* Process offline CPU's input_pkt_queue */ + spin_lock(&oldsd->input_pkt_alien_queue.lock); + skb_queue_splice_tail_init(&oldsd->input_pkt_alien_queue, + &oldsd->input_pkt_queue); + spin_unlock(&oldsd->input_pkt_alien_queue.lock); + while ((skb = __skb_dequeue(&oldsd->input_pkt_queue))) netif_rx(skb); @@ -5184,10 +5314,13 @@ static int __init net_dev_init(void) struct softnet_data *queue; queue = &per_cpu(softnet_data, i); + skb_queue_head_init(&queue->input_pkt_queue); queue->completion_queue = NULL; INIT_LIST_HEAD(&queue->poll_list); + skb_queue_head_init(&queue->input_pkt_alien_queue); + queue->backlog.poll = process_backlog; queue->backlog.weight = weight_p; queue->backlog.gro_list = NULL; @@ -5247,6 +5380,7 @@ EXPORT_SYMBOL(netdev_set_master); EXPORT_SYMBOL(netdev_state_change); EXPORT_SYMBOL(netif_receive_skb); EXPORT_SYMBOL(netif_rx); +EXPORT_SYMBOL(raise_netif_irq); EXPORT_SYMBOL(register_gifconf); EXPORT_SYMBOL(register_netdevice); EXPORT_SYMBOL(register_netdevice_notifier);