Message ID | 1235525270.2604.483.camel@ymzhang |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
On Wed, 25 Feb 2009 09:27:49 +0800 "Zhang, Yanmin" <yanmin_zhang@linux.intel.com> wrote: > Subject: hand off skb list to other cpu to submit to upper layer > From: Zhang Yanmin <yanmin.zhang@linux.intel.com> > > Recently, I am investigating an ip_forward performance issue with 10G IXGBE NIC. > I start the testing on 2 machines. Every machine has 2 10G NICs. The 1st one seconds > packets by pktgen. The 2nd receives the packets from one NIC and forwards them out > from the 2nd NIC. As NICs supports multi-queue, I bind the queues to different logical > cpu of different physical cpu while considering cache sharing carefully. > > Comparing with sending speed on the 1st machine, the forward speed is not good, only > about 60% of sending speed. As a matter of fact, IXGBE driver starts NAPI when interrupt > arrives. When ip_forward=1, receiver collects a packet and forwards it out immediately. > So although IXGBE collects packets with NAPI, the forwarding really has much impact on > collection. As IXGBE runs very fast, it drops packets quickly. The better way for > receiving cpu is doing nothing than just collecting packets. > > Currently kernel has backlog to support a similar capability, but process_backlog still > runs on the receiving cpu. I enhance backlog by adding a new input_pkt_alien_queue to > softnet_data. Receving cpu collects packets and link them into skb list, then delivers > the list to the input_pkt_alien_queue of other cpu. process_backlog picks up the skb list > from input_pkt_alien_queue when input_pkt_queue is empty. > > NIC driver could use this capability like below step in NAPI RX cleanup function. > 1) Initiate a local var struct sk_buff_head skb_head; > 2) In the packet collection loop, just calls netif_rx_queue or __skb_queue_tail(skb_head, skb) > to add skb to the list; > 3) Before exiting, calls raise_netif_irq to submit the skb list to specific cpu. > > Enlarge /proc/sys/net/core/netdev_max_backlog and netdev_budget before testing. > > I tested my patch on top of 2.6.28.5. The improvement is about 43%. > > Signed-off-by: Zhang Yanmin <yanmin.zhang@linux.intel.com> > > --- You can't safely put packets on another CPU queue without adding a spinlock. And if you add the spinlock, you drop the performance back down for your device and all the other devices. Also, you will end up reordering packets which hurts single stream TCP performance. Is this all because the hardware doesn't do MSI-X or are you testing only a single flow. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2009-02-24 at 18:11 -0800, Stephen Hemminger wrote: > On Wed, 25 Feb 2009 09:27:49 +0800 > "Zhang, Yanmin" <yanmin_zhang@linux.intel.com> wrote: > > > Subject: hand off skb list to other cpu to submit to upper layer > > From: Zhang Yanmin <yanmin.zhang@linux.intel.com> > > > > Recently, I am investigating an ip_forward performance issue with 10G IXGBE NIC. > > I start the testing on 2 machines. Every machine has 2 10G NICs. The 1st one seconds > > packets by pktgen. The 2nd receives the packets from one NIC and forwards them out > > from the 2nd NIC. As NICs supports multi-queue, I bind the queues to different logical > > cpu of different physical cpu while considering cache sharing carefully. > > > > Comparing with sending speed on the 1st machine, the forward speed is not good, only > > about 60% of sending speed. As a matter of fact, IXGBE driver starts NAPI when interrupt > > arrives. When ip_forward=1, receiver collects a packet and forwards it out immediately. > > So although IXGBE collects packets with NAPI, the forwarding really has much impact on > > collection. As IXGBE runs very fast, it drops packets quickly. The better way for > > receiving cpu is doing nothing than just collecting packets. > > > > Currently kernel has backlog to support a similar capability, but process_backlog still > > runs on the receiving cpu. I enhance backlog by adding a new input_pkt_alien_queue to > > softnet_data. Receving cpu collects packets and link them into skb list, then delivers > > the list to the input_pkt_alien_queue of other cpu. process_backlog picks up the skb list > > from input_pkt_alien_queue when input_pkt_queue is empty. > > > > NIC driver could use this capability like below step in NAPI RX cleanup function. > > 1) Initiate a local var struct sk_buff_head skb_head; > > 2) In the packet collection loop, just calls netif_rx_queue or __skb_queue_tail(skb_head, skb) > > to add skb to the list; > > 3) Before exiting, calls raise_netif_irq to submit the skb list to specific cpu. > > > > Enlarge /proc/sys/net/core/netdev_max_backlog and netdev_budget before testing. > > > > I tested my patch on top of 2.6.28.5. The improvement is about 43%. > > > > Signed-off-by: Zhang Yanmin <yanmin.zhang@linux.intel.com> > > > > --- Thanks for your comments. > > You can't safely put packets on another CPU queue without adding a spinlock. input_pkt_alien_queue is a struct sk_buff_head which has a spinlock. We use that lock to protect the queue. > And if you add the spinlock, you drop the performance back down for your > device and all the other devices. My testing shows 43% improvement. As multi-core machines are becoming popular, we can allocate some core for packet collection only. I use the spinlock carefully. The deliver cpu locks it only when input_pkt_queue is empty, and just merges the list to input_pkt_queue. Later skb dequeue needn't hold the spinlock. In the other hand, the original receving cpu dispatches a batch of skb (64 packets with IXGBE default) when holding the lock once. > Also, you will end up reordering > packets which hurts single stream TCP performance. Would you like to elaborate the scenario? Does your speaking mean multi-queue also hurts single stream TCP performance when we bind multi-queue(interrupt) to different cpu? > > Is this all because the hardware doesn't do MSI-X IXGBE supports MSI-X and I enables it when testing. The receiver has 16 multi-queue, so 16 irq numbers. I bind 2 irq numbers per logical cpu of one physical cpu. > or are you testing only > a single flow. What does a single flow mean here? One sender? I do start one sender for testing because I couldn't get enough hardware. In addition, my patch doesn't change old interface, so there would be no performance hurt to old drivers. yanmin -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 25 Feb 2009 10:35:43 +0800 "Zhang, Yanmin" <yanmin_zhang@linux.intel.com> wrote: > On Tue, 2009-02-24 at 18:11 -0800, Stephen Hemminger wrote: > > On Wed, 25 Feb 2009 09:27:49 +0800 > > "Zhang, Yanmin" <yanmin_zhang@linux.intel.com> wrote: > > > > > Subject: hand off skb list to other cpu to submit to upper layer > > > From: Zhang Yanmin <yanmin.zhang@linux.intel.com> > > > > > > Recently, I am investigating an ip_forward performance issue with 10G IXGBE NIC. > > > I start the testing on 2 machines. Every machine has 2 10G NICs. The 1st one seconds > > > packets by pktgen. The 2nd receives the packets from one NIC and forwards them out > > > from the 2nd NIC. As NICs supports multi-queue, I bind the queues to different logical > > > cpu of different physical cpu while considering cache sharing carefully. > > > > > > Comparing with sending speed on the 1st machine, the forward speed is not good, only > > > about 60% of sending speed. As a matter of fact, IXGBE driver starts NAPI when interrupt > > > arrives. When ip_forward=1, receiver collects a packet and forwards it out immediately. > > > So although IXGBE collects packets with NAPI, the forwarding really has much impact on > > > collection. As IXGBE runs very fast, it drops packets quickly. The better way for > > > receiving cpu is doing nothing than just collecting packets. > > > > > > Currently kernel has backlog to support a similar capability, but process_backlog still > > > runs on the receiving cpu. I enhance backlog by adding a new input_pkt_alien_queue to > > > softnet_data. Receving cpu collects packets and link them into skb list, then delivers > > > the list to the input_pkt_alien_queue of other cpu. process_backlog picks up the skb list > > > from input_pkt_alien_queue when input_pkt_queue is empty. > > > > > > NIC driver could use this capability like below step in NAPI RX cleanup function. > > > 1) Initiate a local var struct sk_buff_head skb_head; > > > 2) In the packet collection loop, just calls netif_rx_queue or __skb_queue_tail(skb_head, skb) > > > to add skb to the list; > > > 3) Before exiting, calls raise_netif_irq to submit the skb list to specific cpu. > > > > > > Enlarge /proc/sys/net/core/netdev_max_backlog and netdev_budget before testing. > > > > > > I tested my patch on top of 2.6.28.5. The improvement is about 43%. > > > > > > Signed-off-by: Zhang Yanmin <yanmin.zhang@linux.intel.com> > > > > > > --- > Thanks for your comments. > > > > > You can't safely put packets on another CPU queue without adding a spinlock. > input_pkt_alien_queue is a struct sk_buff_head which has a spinlock. We use > that lock to protect the queue. I was reading netif_rx_queue() and you have it using __skb_queue_tail() which has no locking. > > And if you add the spinlock, you drop the performance back down for your > > device and all the other devices. > My testing shows 43% improvement. As multi-core machines are becoming > popular, we can allocate some core for packet collection only. > > I use the spinlock carefully. The deliver cpu locks it only when input_pkt_queue > is empty, and just merges the list to input_pkt_queue. Later skb dequeue needn't > hold the spinlock. In the other hand, the original receving cpu dispatches a batch > of skb (64 packets with IXGBE default) when holding the lock once. > > > Also, you will end up reordering > > packets which hurts single stream TCP performance. > Would you like to elaborate the scenario? Does your speaking mean multi-queue > also hurts single stream TCP performance when we bind multi-queue(interrupt) to > different cpu? > > > > > Is this all because the hardware doesn't do MSI-X > IXGBE supports MSI-X and I enables it when testing. The receiver has 16 multi-queue, > so 16 irq numbers. I bind 2 irq numbers per logical cpu of one physical cpu. > > > or are you testing only > > a single flow. > What does a single flow mean here? One sender? I do start one sender for testing because > I couldn't get enough hardware. Multiple receive queues only have an performance gain if the packets are being sent with different SRC/DST address pairs. That is how the hardware is supposed to break them into queues. Reordering is what happens when packts that are sent as [ 0, 1, 2, 3, 4 ] get received as [ 0, 1, 4, 3, 2 ] because your receive processing happened on different CPU's. You really need to test this with some program like 'iperf' to see the effect it has on TCP. Older Juniper routers used to have hardware that did this and it caused it caused performance loss. Do some google searches and you will see it is a active research topic about whether reordering is okay or not. Existing multiqueue is safe because it doesn't reorder inside a single flow; it only changes order between flows: [ A1, A2, B1, B2] => [ A1, B1, A2, B2 ] > > In addition, my patch doesn't change old interface, so there would be no performance > hurt to old drivers. > > yanmin > > Isn't this a problem: > +int netif_rx_queue(struct sk_buff *skb, struct sk_buff_head *skb_queue) > { > struct softnet_data *queue; > unsigned long flags; > + int this_cpu; > > /* if netpoll wants it, pretend we never saw it */ > if (netpoll_rx(skb)) > @@ -1943,24 +1946,31 @@ int netif_rx(struct sk_buff *skb) > if (!skb->tstamp.tv64) > net_timestamp(skb); > > + if (skb_queue) > + this_cpu = 0; > + else > + this_cpu = 1; Why bother with a special boolean? and instead just test for skb_queue != NULL > + > /* > * The code is rearranged so that the path is the most > * short when CPU is congested, but is still operating. > */ > local_irq_save(flags); > + > queue = &__get_cpu_var(softnet_data); > + if (!skb_queue) > + skb_queue = &queue->input_pkt_queue; > > __get_cpu_var(netdev_rx_stat).total++; > - if (queue->input_pkt_queue.qlen <= netdev_max_backlog) { > - if (queue->input_pkt_queue.qlen) { > -enqueue: > - __skb_queue_tail(&queue->input_pkt_queue, skb); > - local_irq_restore(flags); > - return NET_RX_SUCCESS; > + > + if (skb_queue->qlen <= netdev_max_backlog) { > + if (!skb_queue->qlen && this_cpu) { > + napi_schedule(&queue->backlog); > } Won't this break if skb_queue is NULL (non NAPI case)? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2009-02-24 at 21:18 -0800, Stephen Hemminger wrote: > On Wed, 25 Feb 2009 10:35:43 +0800 > "Zhang, Yanmin" <yanmin_zhang@linux.intel.com> wrote: > > > On Tue, 2009-02-24 at 18:11 -0800, Stephen Hemminger wrote: > > > On Wed, 25 Feb 2009 09:27:49 +0800 > > > "Zhang, Yanmin" <yanmin_zhang@linux.intel.com> wrote: > > > > > > > Subject: hand off skb list to other cpu to submit to upper layer > > > > From: Zhang Yanmin <yanmin.zhang@linux.intel.com> > > > > > > > > Recently, I am investigating an ip_forward performance issue with 10G IXGBE NIC. > > > > I start the testing on 2 machines. Every machine has 2 10G NICs. The 1st one seconds > > > > packets by pktgen. The 2nd receives the packets from one NIC and forwards them out > > > > from the 2nd NIC. As NICs supports multi-queue, I bind the queues to different logical > > > > cpu of different physical cpu while considering cache sharing carefully. > > > > > > > > Comparing with sending speed on the 1st machine, the forward speed is not good, only > > > > about 60% of sending speed. As a matter of fact, IXGBE driver starts NAPI when interrupt > > > > arrives. When ip_forward=1, receiver collects a packet and forwards it out immediately. > > > > So although IXGBE collects packets with NAPI, the forwarding really has much impact on > > > > collection. As IXGBE runs very fast, it drops packets quickly. The better way for > > > > receiving cpu is doing nothing than just collecting packets. > > > > > > > > Currently kernel has backlog to support a similar capability, but process_backlog still > > > > runs on the receiving cpu. I enhance backlog by adding a new input_pkt_alien_queue to > > > > softnet_data. Receving cpu collects packets and link them into skb list, then delivers > > > > the list to the input_pkt_alien_queue of other cpu. process_backlog picks up the skb list > > > > from input_pkt_alien_queue when input_pkt_queue is empty. > > > > > > > > NIC driver could use this capability like below step in NAPI RX cleanup function. > > > > 1) Initiate a local var struct sk_buff_head skb_head; > > > > 2) In the packet collection loop, just calls netif_rx_queue or __skb_queue_tail(skb_head, skb) > > > > to add skb to the list; > > > > 3) Before exiting, calls raise_netif_irq to submit the skb list to specific cpu. > > > > > > > > Enlarge /proc/sys/net/core/netdev_max_backlog and netdev_budget before testing. > > > > > > > > I tested my patch on top of 2.6.28.5. The improvement is about 43%. > > > > > > > > Signed-off-by: Zhang Yanmin <yanmin.zhang@linux.intel.com> > > > > > > > You can't safely put packets on another CPU queue without adding a spinlock. > > input_pkt_alien_queue is a struct sk_buff_head which has a spinlock. We use > > that lock to protect the queue. > > I was reading netif_rx_queue() and you have it using __skb_queue_tail() which > has no locking. Sorry, I need add some comments to function netif_rx_queue. Parameter skb_queue is point to a local var, or NULL. If it points to a local var, just like function ixgbe_clean_rx_irq of IXGBE, so we needn't protect it when using __skb_queue_tail to add new skb. If skb_queue is point to NULL, below skb_queue = &queue->input_pkt_queue; make it points to the local input_pkt_queue which is protected by local_irq_save. > > > And if you add the spinlock, you drop the performance back down for your > > > device and all the other devices. > > My testing shows 43% improvement. As multi-core machines are becoming > > popular, we can allocate some core for packet collection only. > > > > I use the spinlock carefully. The deliver cpu locks it only when input_pkt_queue > > is empty, and just merges the list to input_pkt_queue. Later skb dequeue needn't > > hold the spinlock. In the other hand, the original receving cpu dispatches a batch > > of skb (64 packets with IXGBE default) when holding the lock once. > > > > > Also, you will end up reordering > > > packets which hurts single stream TCP performance. > > Would you like to elaborate the scenario? Does your speaking mean multi-queue > > also hurts single stream TCP performance when we bind multi-queue(interrupt) to > > different cpu? > > > > > > > > Is this all because the hardware doesn't do MSI-X > > IXGBE supports MSI-X and I enables it when testing. The receiver has 16 multi-queue, > > so 16 irq numbers. I bind 2 irq numbers per logical cpu of one physical cpu. > > > > > or are you testing only > > > a single flow. > > What does a single flow mean here? One sender? I do start one sender for testing because > > I couldn't get enough hardware. > > Multiple receive queues only have an performance gain if the packets are being > sent with different SRC/DST address pairs. That is how the hardware is supposed > to break them into queues. Thanks for your explanation. > > Reordering is what happens when packts that are sent as [ 0, 1, 2, 3, 4 ] > get received as [ 0, 1, 4, 3, 2 ] because your receive processing happened on different > CPU's. You really need to test this with some program like 'iperf' to see the effect > it has on TCP. Older Juniper routers used to have hardware that did this and it > caused it caused performance loss. Do some google searches and you will see > it is a active research topic about whether reordering is okay or not. Existing > multiqueue is safe because it doesn't reorder inside a single flow; it only > changes order between flows: [ A1, A2, B1, B2] => [ A1, B1, A2, B2 ] Thanks. Your explanation is very clear. My patch might cause reorder, but very rarely, because reorder only happens when there is a failover in function raise_netif_irq. perhaps I need replace the failover with just packet dropping? I will try iperf. > > Isn't this a problem: > > +int netif_rx_queue(struct sk_buff *skb, struct sk_buff_head *skb_queue) > > { > > struct softnet_data *queue; > > unsigned long flags; > > + int this_cpu; > > > > /* if netpoll wants it, pretend we never saw it */ > > if (netpoll_rx(skb)) > > @@ -1943,24 +1946,31 @@ int netif_rx(struct sk_buff *skb) > > if (!skb->tstamp.tv64) > > net_timestamp(skb); > > > > + if (skb_queue) > > + this_cpu = 0; > > + else > > + this_cpu = 1; > > Why bother with a special boolean? and instead just test for skb_queue != NULL Var this_cpu is used for napi_schedule late. Although the logical has no problem, this_cpu seems confused. Let me check if there is a better way for late napi_schedule. > > > + > > /* > > * The code is rearranged so that the path is the most > > * short when CPU is congested, but is still operating. > > */ > > local_irq_save(flags); > > + > > queue = &__get_cpu_var(softnet_data); > > + if (!skb_queue) > > + skb_queue = &queue->input_pkt_queue; When skb_queue is NULL, we redirect it to queue->input_pkt_queue. > > > > > __get_cpu_var(netdev_rx_stat).total++; > > - if (queue->input_pkt_queue.qlen <= netdev_max_backlog) { > > - if (queue->input_pkt_queue.qlen) { > > -enqueue: > > - __skb_queue_tail(&queue->input_pkt_queue, skb); > > - local_irq_restore(flags); > > - return NET_RX_SUCCESS; > > + > > + if (skb_queue->qlen <= netdev_max_backlog) { > > + if (!skb_queue->qlen && this_cpu) { > > + napi_schedule(&queue->backlog); > > } > > Won't this break if skb_queue is NULL (non NAPI case)? So skb_queue isn't NULL here. Another idea is just to delete function netif_rx_queue. Drivers could use __skb_queue_tail directly. The difference netif_rx_queue has a queue length checking while __skb_queue_tail hasn't. But mostly skb_queue is far smaller than queue->input_pkt_queue.qlen and queue->input_pkt_alien_queue.qlen. Yanmin -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Zhang, Yanmin <yanmin_zhang@linux.intel.com> wrote: > Subject: hand off skb list to other cpu to submit to upper layer > From: Zhang Yanmin <yanmin.zhang@linux.intel.com> > > Recently, I am investigating an ip_forward performance issue with 10G IXGBE NIC. > I start the testing on 2 machines. Every machine has 2 10G NICs. The 1st one seconds > packets by pktgen. The 2nd receives the packets from one NIC and forwards them out > from the 2nd NIC. As NICs supports multi-queue, I bind the queues to different logical > cpu of different physical cpu while considering cache sharing carefully. > > Comparing with sending speed on the 1st machine, the forward speed is not good, only > about 60% of sending speed. As a matter of fact, IXGBE driver starts NAPI when interrupt > arrives. When ip_forward=1, receiver collects a packet and forwards it out immediately. > So although IXGBE collects packets with NAPI, the forwarding really has much impact on > collection. As IXGBE runs very fast, it drops packets quickly. The better way for > receiving cpu is doing nothing than just collecting packets. This doesn't make sense. With multiqueue RX, every core should be working to receive its fraction of the traffic and forwarding them out. So you shouldn't have any idle cores to begin with. The fact that you do means that multiqueue RX hasn't maximised its utility, so you should tackle that instead of trying redirect traffic away from the cores that are receiving. Of course for NICs that don't support multiqueue RX, or where the number of RX queues is less than the number of cores, then a scheme like yours may be useful. Cheers,
On Wed, 2009-02-25 at 14:36 +0800, Herbert Xu wrote: > Zhang, Yanmin <yanmin_zhang@linux.intel.com> wrote: > > Subject: hand off skb list to other cpu to submit to upper layer > > From: Zhang Yanmin <yanmin.zhang@linux.intel.com> > > > > Recently, I am investigating an ip_forward performance issue with 10G IXGBE NIC. > > I start the testing on 2 machines. Every machine has 2 10G NICs. The 1st one seconds > > packets by pktgen. The 2nd receives the packets from one NIC and forwards them out > > from the 2nd NIC. As NICs supports multi-queue, I bind the queues to different logical > > cpu of different physical cpu while considering cache sharing carefully. > > > > Comparing with sending speed on the 1st machine, the forward speed is not good, only > > about 60% of sending speed. As a matter of fact, IXGBE driver starts NAPI when interrupt > > arrives. When ip_forward=1, receiver collects a packet and forwards it out immediately. > > So although IXGBE collects packets with NAPI, the forwarding really has much impact on > > collection. As IXGBE runs very fast, it drops packets quickly. The better way for > > receiving cpu is doing nothing than just collecting packets. > Thanks for your comments. > This doesn't make sense. With multiqueue RX, every core should be > working to receive its fraction of the traffic and forwarding them > out. I never say the core can't receive and forward packets at the same time. I mean the performance isn't good. > So you shouldn't have any idle cores to begin with. The fact > that you do means that multiqueue RX hasn't maximised its utility, > so you should tackle that instead of trying redirect traffic away > from the cores that are receiving. >From Stephen's explanation, the packets are being sent with different SRC/DST address pairs by which harware delivers packets to different queues. we couldn't expect NIC always puts packets into queues evenly. The behavior is IXGBE is very fast and cpu couldn't collect packets in time if it collects packets and forwards them at the same time. That causes IXGBE drops packets. > > Of course for NICs that don't support multiqueue RX, or where the > number of RX queues is less than the number of cores, then a scheme > like yours may be useful. IXGBE NIC does support a large number of RX queues. By default, it creates CPU_NUM queues. But the performance is not good when we bind queues to cpu evenly. One reason is cache miss/ping-pong. The forwarder machine has 2 physical cpu and every cpu has 8 logical threads. All 8 logical cpu share the last level cache. With my ip_forward testing by pktgen, binding queues to 8 logical cpu of a physical cpu could have 40% improvement than binding queues to 16 logical cpu. So the optimization scenario just needs IXGBE drivers create 8 queues. If the machines might have a couple of NICs and every NIC has CPU_NUM queues, binding them evenly might cause more cache-miss/ping-pong. I didn't test multiple receiving NICs scenario as I couldn't get enough hardware. Yanmin -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com> Date: Wed, 25 Feb 2009 15:20:23 +0800 > If the machines might have a couple of NICs and every NIC has CPU_NUM queues, > binding them evenly might cause more cache-miss/ping-pong. I didn't test > multiple receiving NICs scenario as I couldn't get enough hardware. In the net-next-2.6 tree, since we mark incoming packets with skb_record_rx_queue() properly, we'll make a more favorable choice of TX queue. You may want to figure out what that isn't behaving well in your case. I don't think we should do any kind of software spreading for such capable hardware, it defeats the whole point of supporting the multiqueue features. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2009-02-24 at 23:31 -0800, David Miller wrote: > From: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com> > Date: Wed, 25 Feb 2009 15:20:23 +0800 > > > If the machines might have a couple of NICs and every NIC has CPU_NUM queues, > > binding them evenly might cause more cache-miss/ping-pong. I didn't test > > multiple receiving NICs scenario as I couldn't get enough hardware. > > In the net-next-2.6 tree, since we mark incoming packets with > skb_record_rx_queue() properly, we'll make a more favorable choice of > TX queue. Thanks for your pointer. I cloned net-next-2.6 tree. skb_record_rx_queue is a smart idea to implement an auto TX selection. There is no NIC multi-queue standard or RFC available. At least I didn't find it by google. Both the new skb_record_rx_queue and current kernel have an assumption on multi-queue. The assumption is it's best to send out packets from the TX of the same number of queue like the one of RX if the receved packets are related to the out packets. Or more direct speaking is we need send packets on the same cpu on which we receive them. The start point is that could reduce skb and data cache miss. With slow NIC, the assumption is right. But with high-speed NIC, especially 10G NIC, the assumption seems not ok. Here is a simple calculation with real testing/data with Nehalem machine and Bensley machine. There are 2 machines with the testing driven by pktgen. send packets Machine A ==============> Machine B forward pkts back <============== With Nehalem machines, I can get 4 million pps (packets per second) and per packet consists of 60 bytes. So the speed is about 240MBytes/s. Nehalem has 2 sockets and every socket has 4 core and 8 logical cpu. All logical cpu share the last level cache 8Mbytes. That means every physical cpu receives 120M bytes per second which is 8 times of last level cache size. With Bensley machine, I can get 1.2M pps, or 72MBytes. That machine has 2 sockets and every socket has a qual-core cpu. Every dual-core share the last level cache 6MByte. That means every dual-core gets 18M bytes per second, which is 3 times of last level cache size. So with both bensley and Nehalem, the cache is flushed very quickly with 10G NIC testing. Some other kinds of machines might have bigger cache. For example, my Montvale Itanium has 2 sockets, and every socket has a qual-core cpu plus multi-thread. Every dual-core shares the last level cache 12M. But the cache is stll flushed at least twice per second. If checking NIC drivers, we can find drivers touch very limited fields of sk_buff when collecting packets from NIC. It is said 20G or 30G NIC are under producing. So with high-speed 10G NIC, the old assumption seems not working. In the other hand, which part causes most cache foot print and cache miss? I don't think drivers do so because the receiving cpu only touch some fields of sb_buff before sending to upper layer. My patch throws packets to specific cpu controlled by configuration, which doesn't cause much cache ping-pong. After receving cpu throws packets to 2nd cpu, it doesn't need them again. The 2nd cpu has cache-miss, but it doesn't cause cache ping-pong. My patch doesn't always disagree with skb_record_rx_queue. 1) It can be configured by admin; 2) We can call skb_record_rx_queue or similiar functions at the 2nd cpu (the real cpu to process the packets by process_backlog); So later on cache footprint won't be wasted when forwarding packets out; > > You may want to figure out what that isn't behaving well in your > case. I did check kernel, including slab ( I tried slab/slub/slqb and use slub now) tuning, and instrumented IXGBE driver. Besides careful multi-queue/interrupt binding, another way is just to use my patch to promote speed for more than 40% on both Nehalem and Bensley. > > I don't think we should do any kind of software spreading for such > capable hardware, > it defeats the whole point of supporting the > multiqueue features. There is no NIC multi-queue standard or RFC. Jesse is worried about we might allocate free cores for the packet collection while a real environment keeps cpu all busy. I added more pressure on sending machine, and got better performance on forwarding machine and the forwarding machine's cpu are busier than before. Some logical cpu idle is near to 0. But I only have a couple of 10G NIC, and couldn't add more pressure to make all cpu busy. Thanks again, for your comments and patience. Yanmin -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com> Date: Wed, 04 Mar 2009 17:27:48 +0800 > Both the new skb_record_rx_queue and current kernel have an > assumption on multi-queue. The assumption is it's best to send out > packets from the TX of the same number of queue like the one of RX > if the receved packets are related to the out packets. Or more > direct speaking is we need send packets on the same cpu on which we > receive them. The start point is that could reduce skb and data > cache miss. We have to use the same TX queue for all packets for the same connection flow (same src/dst IP address and ports) otherwise we introduce reordering. Herbert brought this up, now I have explicitly brought this up, and you cannot ignore this issue. You must not knowingly reorder packets, and using different TX queues for packets within the same flow does that. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2009-03-04 at 01:39 -0800, David Miller wrote: > From: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com> > Date: Wed, 04 Mar 2009 17:27:48 +0800 > > > Both the new skb_record_rx_queue and current kernel have an > > assumption on multi-queue. The assumption is it's best to send out > > packets from the TX of the same number of queue like the one of RX > > if the receved packets are related to the out packets. Or more > > direct speaking is we need send packets on the same cpu on which we > > receive them. The start point is that could reduce skb and data > > cache miss. > > We have to use the same TX queue for all packets for the same > connection flow (same src/dst IP address and ports) otherwise > we introduce reordering. > Herbert brought this up, now I have explicitly brought this up, > and you cannot ignore this issue. Thanks. Stephen Hemminger brought it up and explained what reorder is. I answered in a reply (sorry for not clear) that mostly we need spread packets among RX/TX in a 1:1 mapping or N:1 mapping. For example, all packets received from RX 8 will be spreaded to TX 0 always. > > You must not knowingly reorder packets, and using different TX > queues for packets within the same flow does that. Thanks for you rexplanation which is really consistent with Stephen's speaking. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2009-03-05 at 09:04 +0800, Zhang, Yanmin wrote: > On Wed, 2009-03-04 at 01:39 -0800, David Miller wrote: > > From: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com> > > Date: Wed, 04 Mar 2009 17:27:48 +0800 > > > > > Both the new skb_record_rx_queue and current kernel have an > > > assumption on multi-queue. The assumption is it's best to send out > > > packets from the TX of the same number of queue like the one of RX > > > if the receved packets are related to the out packets. Or more > > > direct speaking is we need send packets on the same cpu on which we > > > receive them. The start point is that could reduce skb and data > > > cache miss. > > > > We have to use the same TX queue for all packets for the same > > connection flow (same src/dst IP address and ports) otherwise > > we introduce reordering. > > Herbert brought this up, now I have explicitly brought this up, > > and you cannot ignore this issue. > Thanks. Stephen Hemminger brought it up and explained what reorder > is. I answered in a reply (sorry for not clear) that mostly we need spread > packets among RX/TX in a 1:1 mapping or N:1 mapping. For example, all packets > received from RX 8 will be spreaded to TX 0 always. To make it clearer, I used 1:1 mapping binding when running testing on bensley (4*2 cores) and Nehalem (2*4*2 logical cpu). So there is no reorder issue. I also worked out a new patch on the failover path to just drop packets when qlen is bigger than netdev_max_backlog, so the failover path wouldn't cause reorder. > > > > > > You must not knowingly reorder packets, and using different TX > > queues for packets within the same flow does that. > Thanks for you rexplanation which is really consistent with Stephen's speaking. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
2009/3/5, Zhang, Yanmin <yanmin_zhang@linux.intel.com>: > On Thu, 2009-03-05 at 09:04 +0800, Zhang, Yanmin wrote: > > On Wed, 2009-03-04 at 01:39 -0800, David Miller wrote: > > > From: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com> > > > Date: Wed, 04 Mar 2009 17:27:48 +0800 > > > > > > > Both the new skb_record_rx_queue and current kernel have an > > > > assumption on multi-queue. The assumption is it's best to send out > > > > packets from the TX of the same number of queue like the one of RX > > > > if the receved packets are related to the out packets. Or more > > > > direct speaking is we need send packets on the same cpu on which we > > > > receive them. The start point is that could reduce skb and data > > > > cache miss. > > > > > > We have to use the same TX queue for all packets for the same > > > connection flow (same src/dst IP address and ports) otherwise > > > we introduce reordering. > > > Herbert brought this up, now I have explicitly brought this up, > > > and you cannot ignore this issue. > > Thanks. Stephen Hemminger brought it up and explained what reorder > > is. I answered in a reply (sorry for not clear) that mostly we need spread > > packets among RX/TX in a 1:1 mapping or N:1 mapping. For example, all packets > > received from RX 8 will be spreaded to TX 0 always. > > To make it clearer, I used 1:1 mapping binding when running testing > on bensley (4*2 cores) and Nehalem (2*4*2 logical cpu). So there is no reorder > issue. I also worked out a new patch on the failover path to just drop > packets when qlen is bigger than netdev_max_backlog, so the failover path wouldn't > cause reorder. > We have not seen this problem in our testing. We do keep the skb processing with the same CPU from RX to TX. This is done via setting affinity for queues and using custom select_queue. +static u16 select_queue(struct net_device *dev, struct sk_buff *skb) +{ + if( dev->real_num_tx_queues && skb_rx_queue_recorded(skb) ) + return skb_get_rx_queue(skb) % dev->real_num_tx_queues; + + return smp_processor_id() % dev->real_num_tx_queues; +} + The hash based default for selecting TX-queue generates an uneven spread that is hard to follow with correct affinity. We have not been able to generate quite as much traffic from the sender. Sender: (64 byte pkts) eth5 4.5 k bit/s 3 pps 1233.9 M bit/s 2.632 M pps Router: eth0 1077.2 M bit/s 2.298 M pps 1.7 k bit/s 1 pps eth1 744 bit/s 1 pps 1076.3 M bit/s 2.296 M pps Im not sure I like the proposed concept since it decouples RX processing from receiving. There is no point collecting lots of packets just to drop them later in the qdisc. Infact this is bad for performance, we just consume cpu for nothing. It is important to have as strong correlation as possible between RX and TX so we dont receive more pkts than we can handle. Better to drop on the interface. We might start thinking of a way for userland to set the policy for multiq mapping. Cheers, Jens Låås > > > > > > You must not knowingly reorder packets, and using different TX > > > queues for packets within the same flow does that. > > Thanks for you rexplanation which is really consistent with Stephen's speaking. > > > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2009-03-05 at 08:32 +0100, Jens Låås wrote: > 2009/3/5, Zhang, Yanmin <yanmin_zhang@linux.intel.com>: > > On Thu, 2009-03-05 at 09:04 +0800, Zhang, Yanmin wrote: > > > On Wed, 2009-03-04 at 01:39 -0800, David Miller wrote: > > > > From: "Zhang, Yanmin" <yanmin_zhang@linux.intel.com> > > > > Date: Wed, 04 Mar 2009 17:27:48 +0800 > > > > > > > > > Both the new skb_record_rx_queue and current kernel have an > > > > > assumption on multi-queue. The assumption is it's best to send out > > > > > packets from the TX of the same number of queue like the one of RX > > > > > if the receved packets are related to the out packets. Or more > > > > > direct speaking is we need send packets on the same cpu on which we > > > > > receive them. The start point is that could reduce skb and data > > > > > cache miss. > > > > > > > > We have to use the same TX queue for all packets for the same > > > > connection flow (same src/dst IP address and ports) otherwise > > > > we introduce reordering. > > > > Herbert brought this up, now I have explicitly brought this up, > > > > and you cannot ignore this issue. > > > Thanks. Stephen Hemminger brought it up and explained what reorder > > > is. I answered in a reply (sorry for not clear) that mostly we need spread > > > packets among RX/TX in a 1:1 mapping or N:1 mapping. For example, all packets > > > received from RX 8 will be spreaded to TX 0 always. > > > > To make it clearer, I used 1:1 mapping binding when running testing > > on bensley (4*2 cores) and Nehalem (2*4*2 logical cpu). So there is no reorder > > issue. I also worked out a new patch on the failover path to just drop > > packets when qlen is bigger than netdev_max_backlog, so the failover path wouldn't > > cause reorder. > > > > We have not seen this problem in our Thanks for your valuable input. We need more data on high-speed NIC. > We do keep the skb processing with the same CPU from RX to TX. That's a normal point. I did so when I began to investigate why forward speed is far slower than sending speed with 10G NIC. > This is done via setting affinity for queues and using custom select_queue. > > +static u16 select_queue(struct net_device *dev, struct sk_buff *skb) > +{ > + if( dev->real_num_tx_queues && skb_rx_queue_recorded(skb) ) > + return skb_get_rx_queue(skb) % dev->real_num_tx_queues; > + > + return smp_processor_id() % dev->real_num_tx_queues; > +} > + Yes, with the function and every NIC has CPU_NUM queues, skb is processed with the same cpu from RX to TX. > > The hash based default for selecting TX-queue generates an uneven > spread that is hard to follow with correct affinity. > > We have not been able to generate quite as much traffic from the sender. pktgen of the latest kernel supports multi-thread on the same device. If you just starts one thread, the speed is limited. Could you try 4 or 8 threads? Perhaps speed could double then. > > Sender: (64 byte pkts) > eth5 4.5 k bit/s 3 pps 1233.9 M bit/s 2.632 M pps I'm a little confused with the data. Do the first 2 mean IN and last 2 mean OUT? What kind of NIC and machines are they? How big is the last level cache of the cpu? > > Router: > eth0 1077.2 M bit/s 2.298 M pps 1.7 k bit/s 1 pps > eth1 744 bit/s 1 pps 1076.3 M bit/s 2.296 M pps The forward speed is quite close to the sending speed of the Sender. It seems your machine needn't my patch. My original case is the sending speed is 1.4M pps with careful cpu binding considering cpu cache sharing. With my patch, the result becomes 2M pps and the sending speed is 2.36M pps. The NICs I am using are not latest. > > Im not sure I like the proposed concept since it decouples RX > processing from receiving. > There is no point collecting lots of packets just to drop them later > in the qdisc. > Infact this is bad for performance, we just consume cpu for nothing. Yes, if the skb processing cpu is very busy, and we choose to drop skb there instead of by driver or NIC hardware, performance might be worse. A small change on my patch and driver could reduce that possibility. Checking qlen before collecting the 64 packets (assume driver collects 64 packets per NAPI loop). If qlen is larger than netdev_max_backlog, driver could just return without real collection. We need data to distinguish good or bad. > It is important to have as strong correlation as possible between RX > and TX so we dont receive more pkts than we can handle. Better to drop > on the interface. With my above small change, interface would drop packets. > > We might start thinking of a way for userland to set the policy for > multiq mapping. I also think so. I did more testing with different slab allocator as slab has big impact on performance. SLQB has very different behavior from SLUB. It seems SLQB (try2) need improve NUMA allocation/free. At least I use slub_min_objects=64 and slub_max_order=6 to get the best result on my machine. Thanks for your comments. > > > > > > > > You must not knowingly reorder packets, and using different TX > > > > queues for packets within the same flow does that. > > > Thanks for you rexplanation which is really consistent with Stephen's speaking. > > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
--- linux-2.6.29-rc2/include/linux/netdevice.h 2009-01-20 14:20:45.000000000 +0800 +++ linux-2.6.29-rc2_napi_rcv/include/linux/netdevice.h 2009-02-23 13:32:48.000000000 +0800 @@ -1119,6 +1119,9 @@ static inline int unregister_gifconf(uns /* * Incoming packets are placed on per-cpu queues so that * no locking is needed. + * To speed up fast network, sometimes place incoming packets + * to other cpu queues. Use input_pkt_alien_queue.lock to + * protect input_pkt_alien_queue. */ struct softnet_data { @@ -1127,6 +1130,7 @@ struct softnet_data struct list_head poll_list; struct sk_buff *completion_queue; + struct sk_buff_head input_pkt_alien_queue; struct napi_struct backlog; }; @@ -1368,6 +1372,10 @@ extern void dev_kfree_skb_irq(struct sk_ extern void dev_kfree_skb_any(struct sk_buff *skb); #define HAVE_NETIF_RX 1 +extern int netif_rx_queue(struct sk_buff *skb, + struct sk_buff_head *skb_queue); +extern int raise_netif_irq(int cpu, + struct sk_buff_head *skb_queue); extern int netif_rx(struct sk_buff *skb); extern int netif_rx_ni(struct sk_buff *skb); #define HAVE_NETIF_RECEIVE_SKB 1 --- linux-2.6.29-rc2/net/core/dev.c 2009-01-20 14:20:45.000000000 +0800 +++ linux-2.6.29-rc2_napi_rcv/net/core/dev.c 2009-02-24 13:53:02.000000000 +0800 @@ -1917,8 +1917,10 @@ DEFINE_PER_CPU(struct netif_rx_stats, ne /** - * netif_rx - post buffer to the network code + * netif_rx_queue - post buffer to the network code * @skb: buffer to post + * @sk_buff_head: the queue to keep skb. It could be NULL or point + * to a local variable. * * This function receives a packet from a device driver and queues it for * the upper (protocol) levels to process. It always succeeds. The buffer @@ -1931,10 +1933,11 @@ DEFINE_PER_CPU(struct netif_rx_stats, ne * */ -int netif_rx(struct sk_buff *skb) +int netif_rx_queue(struct sk_buff *skb, struct sk_buff_head *skb_queue) { struct softnet_data *queue; unsigned long flags; + int this_cpu; /* if netpoll wants it, pretend we never saw it */ if (netpoll_rx(skb)) @@ -1943,24 +1946,31 @@ int netif_rx(struct sk_buff *skb) if (!skb->tstamp.tv64) net_timestamp(skb); + if (skb_queue) + this_cpu = 0; + else + this_cpu = 1; + /* * The code is rearranged so that the path is the most * short when CPU is congested, but is still operating. */ local_irq_save(flags); + queue = &__get_cpu_var(softnet_data); + if (!skb_queue) + skb_queue = &queue->input_pkt_queue; __get_cpu_var(netdev_rx_stat).total++; - if (queue->input_pkt_queue.qlen <= netdev_max_backlog) { - if (queue->input_pkt_queue.qlen) { -enqueue: - __skb_queue_tail(&queue->input_pkt_queue, skb); - local_irq_restore(flags); - return NET_RX_SUCCESS; + + if (skb_queue->qlen <= netdev_max_backlog) { + if (!skb_queue->qlen && this_cpu) { + napi_schedule(&queue->backlog); } - napi_schedule(&queue->backlog); - goto enqueue; + __skb_queue_tail(skb_queue, skb); + local_irq_restore(flags); + return NET_RX_SUCCESS; } __get_cpu_var(netdev_rx_stat).dropped++; @@ -1970,6 +1980,11 @@ enqueue: return NET_RX_DROP; } +int netif_rx(struct sk_buff *skb) +{ + return netif_rx_queue(skb, NULL); +} + int netif_rx_ni(struct sk_buff *skb) { int err; @@ -1985,6 +2000,79 @@ int netif_rx_ni(struct sk_buff *skb) EXPORT_SYMBOL(netif_rx_ni); +static void net_drop_skb(struct sk_buff_head *skb_queue) +{ + struct sk_buff *skb = __skb_dequeue(skb_queue); + + while (skb) { + __get_cpu_var(netdev_rx_stat).dropped++; + kfree_skb(skb); + skb = __skb_dequeue(skb_queue); + } +} + +static void net_napi_backlog(void *data) +{ + struct softnet_data *queue = &__get_cpu_var(softnet_data); + + napi_schedule(&queue->backlog); + kfree(data); +} + +int raise_netif_irq(int cpu, struct sk_buff_head *skb_queue) +{ + unsigned long flags; + struct softnet_data *queue; + + if (skb_queue_empty(skb_queue)) + return 0; + + if ((unsigned)cpu < nr_cpu_ids && + cpu_online(cpu) && + cpu != smp_processor_id()) { + + struct call_single_data *data; + + queue = &per_cpu(softnet_data, cpu); + + if (queue->input_pkt_alien_queue.qlen > netdev_max_backlog) + goto failover; + + data = kmalloc(sizeof(struct call_single_data), GFP_ATOMIC); + if (!data) + goto failover; + + spin_lock_irqsave(&queue->input_pkt_alien_queue.lock, flags); + skb_queue_splice_tail_init(skb_queue, + &queue->input_pkt_alien_queue); + spin_unlock_irqrestore(&queue->input_pkt_alien_queue.lock, + flags); + + data->func = net_napi_backlog; + data->info = data; + data->flags = 0; + + __smp_call_function_single(cpu, data); + + return 0; + } + +failover: + /* If cpu is offline, we queue skb back to the queue on current cpu*/ + queue = &__get_cpu_var(softnet_data); + if (queue->input_pkt_queue.qlen + skb_queue->qlen <= + netdev_max_backlog) { + local_irq_save(flags); + skb_queue_splice_tail_init(skb_queue, &queue->input_pkt_queue); + napi_schedule(&queue->backlog); + local_irq_restore(flags); + } else { + net_drop_skb(skb_queue); + } + + return 1; +} + static void net_tx_action(struct softirq_action *h) { struct softnet_data *sd = &__get_cpu_var(softnet_data); @@ -2324,6 +2412,13 @@ static void flush_backlog(void *arg) struct net_device *dev = arg; struct softnet_data *queue = &__get_cpu_var(softnet_data); struct sk_buff *skb, *tmp; + unsigned long flags; + + spin_lock_irqsave(&queue->input_pkt_alien_queue.lock, flags); + skb_queue_splice_tail_init( + &queue->input_pkt_alien_queue, + &queue->input_pkt_queue ); + spin_unlock_irqrestore(&queue->input_pkt_alien_queue.lock, flags); skb_queue_walk_safe(&queue->input_pkt_queue, skb, tmp) if (skb->dev == dev) { @@ -2575,9 +2670,19 @@ static int process_backlog(struct napi_s local_irq_disable(); skb = __skb_dequeue(&queue->input_pkt_queue); if (!skb) { - __napi_complete(napi); - local_irq_enable(); - break; + if (!skb_queue_empty(&queue->input_pkt_alien_queue)) { + spin_lock(&queue->input_pkt_alien_queue.lock); + skb_queue_splice_tail_init( + &queue->input_pkt_alien_queue, + &queue->input_pkt_queue ); + spin_unlock(&queue->input_pkt_alien_queue.lock); + + skb = __skb_dequeue(&queue->input_pkt_queue); + } else { + __napi_complete(napi); + local_irq_enable(); + break; + } } local_irq_enable(); @@ -4966,6 +5071,11 @@ static int dev_cpu_callback(struct notif local_irq_enable(); /* Process offline CPU's input_pkt_queue */ + spin_lock(&oldsd->input_pkt_alien_queue.lock); + skb_queue_splice_tail_init(&oldsd->input_pkt_alien_queue, + &oldsd->input_pkt_queue); + spin_unlock(&oldsd->input_pkt_alien_queue.lock); + while ((skb = __skb_dequeue(&oldsd->input_pkt_queue))) netif_rx(skb); @@ -5165,10 +5275,13 @@ static int __init net_dev_init(void) struct softnet_data *queue; queue = &per_cpu(softnet_data, i); + skb_queue_head_init(&queue->input_pkt_queue); queue->completion_queue = NULL; INIT_LIST_HEAD(&queue->poll_list); + skb_queue_head_init(&queue->input_pkt_alien_queue); + queue->backlog.poll = process_backlog; queue->backlog.weight = weight_p; queue->backlog.gro_list = NULL; @@ -5227,7 +5340,9 @@ EXPORT_SYMBOL(netdev_boot_setup_check); EXPORT_SYMBOL(netdev_set_master); EXPORT_SYMBOL(netdev_state_change); EXPORT_SYMBOL(netif_receive_skb); +EXPORT_SYMBOL(netif_rx_queue); EXPORT_SYMBOL(netif_rx); +EXPORT_SYMBOL(raise_netif_irq); EXPORT_SYMBOL(register_gifconf); EXPORT_SYMBOL(register_netdevice); EXPORT_SYMBOL(register_netdevice_notifier);