Message ID | 1271424065.4606.31.camel@bigi |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
On Fri, Apr 16, 2010 at 9:21 PM, jamal <hadi@cyberus.ca> wrote: > On Fri, 2010-04-16 at 07:18 +0200, Eric Dumazet wrote: > >> >> A kernel module might do this, this could be integrated in perf bench so >> that we can regression tests upcoming kernels. > > Perf would be good - but even softnet_stat cleaner than the the nasty > hack i use (attached) would be a good start; the ping with and without > rps gives me a ballpark number. > > IPI is important to me because having tried it before it and failed > miserably. I was thinking the improvement may be due to hardware used > but i am having a hard time to get people to tell me what hardware they > used! I am old school - I need data;-> The RFS patch commit seems to > have more info but still vague, example: > "The benefits of RFS are dependent on cache hierarchy, application > load, and other factors" > Also, what does a "simple" or "complex" benchmark mean?;-> > I think it is only fair to get this info, no? > > Please dont consider what i say above as being anti-RPS. > 5 microsec extra latency is not bad if it can be amortized. > Unfortunately, the best traffic i could generate was < 20Kpps of > ping which still manages to get 1 IPI/packet on Nehalem. I am going > to write up some app (lots of cycles available tommorow). I still think > it is valueable. > + seq_printf(seq, "%08x %08x %08x %08x %08x %08x %08x %08x %08x %08x %08x\n", s->total, s->dropped, s->time_squeeze, 0, 0, 0, 0, 0, /* was fastroute */ - s->cpu_collision, s->received_rps); + s->cpu_collision, s->received_rps, s->ipi_rps); Do you mean that received_rps is equal to ipi_rps? received_rps is the number of IPI used by RPS. And ipi_rps is the number of IPIs sent by function generic_exec_single(). If there isn't other user of generic_exec_single(), received_rps should be equal to ipi_rps. @@ -158,7 +159,10 @@ void generic_exec_single(int cpu, struct call_single_data *data, int wait) * equipped to do the right thing... */ if (ipi) +{ arch_send_call_function_single_ipi(cpu); + __get_cpu_var(netdev_rx_stat).ipi_rps++; +}
On Fri, 2010-04-16 at 21:34 +0800, Changli Gao wrote: > > + seq_printf(seq, "%08x %08x %08x %08x %08x %08x %08x %08x %08x %08x %08x\n", > s->total, s->dropped, s->time_squeeze, 0, > 0, 0, 0, 0, /* was fastroute */ > - s->cpu_collision, s->received_rps); > + s->cpu_collision, s->received_rps, s->ipi_rps); > > Do you mean that received_rps is equal to ipi_rps? received_rps is the > number of IPI used by RPS. And ipi_rps is the number of IPIs sent by > function generic_exec_single(). If there isn't other user of > generic_exec_single(), received_rps should be equal to ipi_rps. > my observation is: s->total is the sum of all packets received by cpu (some directly from ethernet) s->received_rps was what the count receiver cpu saw incoming if they were sent by another cpu. s-> ipi_rps is the times we tried to enq to remote cpu but found it to be empty and had to send an IPI. ipi_rps can be < received_rps if we receive > 1 packet without generating an IPI. What did i miss? cheers, jamal -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Apr 16, 2010 at 9:49 PM, jamal <hadi@cyberus.ca> wrote: > On Fri, 2010-04-16 at 21:34 +0800, Changli Gao wrote: > > > my observation is: > s->total is the sum of all packets received by cpu (some directly from > ethernet) It is meaningless currently. If rps is enabled, it may be twice of the number of the packets received, because one packet may be count twice: one in enqueue_to_backlog(), and the other in __netif_receive_skb(). I had posted a patch to solve this problem. http://patchwork.ozlabs.org/patch/50217/ If you don't apply my patch, you'd better refer to /proc/net/dev for the total number. > s->received_rps was what the count receiver cpu saw incoming if they > were sent by another cpu. Maybe its name confused you. /* Called from hardirq (IPI) context */ static void trigger_softirq(void *data) { struct softnet_data *queue = data; __napi_schedule(&queue->backlog); __get_cpu_var(netdev_rx_stat).received_rps++; } the function above is called in IRQ of IPI. It counts the number of IPIs received. It is actually ipi_rps you need. > s-> ipi_rps is the times we tried to enq to remote cpu but found it to > be empty and had to send an IPI. > ipi_rps can be < received_rps if we receive > 1 packet without > generating an IPI. What did i miss? >
On Fri, 2010-04-16 at 22:10 +0800, Changli Gao wrote: > On Fri, Apr 16, 2010 at 9:49 PM, jamal <hadi@cyberus.ca> wrote: > > my observation is: > > s->total is the sum of all packets received by cpu (some directly from > > ethernet) > > It is meaningless currently. If rps is enabled, it may be twice of the > number of the packets received, because one packet may be count twice: > one in enqueue_to_backlog(), and the other in __netif_receive_skb(). You are probably right - you made me look at my collected data ;-> i will look closely later, but it seems they are accounting for different cpus, no? Example, attached are some of the stats i captured when i was running the tests redirecting from CPU0 to CPU1 1M packets at about 20Kpps (just cut to the first and last two columns): cpu Total |rps_recv |rps_ipi -----+----------+---------+--------- cpu0 | 002dc7f1 |00000000 |000f4246 cpu1 | 002dc804 |000f4240 |00000000 ------------------------------------- So: cpu0 receive 0x2dc7f1 pkts accummulative over time and redirected to cpu1 (mostly, the extra 5 maybe to leftover since i clear the data) and for the test 0xf4246 times it generated an IPI. It can be seen that total running for CPU1 is 0x2dc804 but in this one run it received 1M packets (0xf4240). i.e i dont see the double accounting.. cheers, jamal 002dc7f1 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 000f4246 002dc804 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 000f4240 00000000 00000004 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000004 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000006 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 0000003c 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000004 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 0000003e 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
On Fri, Apr 16, 2010 at 10:43 PM, jamal <hadi@cyberus.ca> wrote: > On Fri, 2010-04-16 at 22:10 +0800, Changli Gao wrote: >> On Fri, Apr 16, 2010 at 9:49 PM, jamal <hadi@cyberus.ca> wrote: > >> > my observation is: >> > s->total is the sum of all packets received by cpu (some directly from >> > ethernet) >> >> It is meaningless currently. If rps is enabled, it may be twice of the >> number of the packets received, because one packet may be count twice: >> one in enqueue_to_backlog(), and the other in __netif_receive_skb(). > > You are probably right - you made me look at my collected data ;-> > i will look closely later, but it seems they are accounting for > different cpus, no? > Example, attached are some of the stats i captured when i was running > the tests redirecting from CPU0 to CPU1 1M packets at about 20Kpps (just > cut to the first and last two columns): > > cpu Total |rps_recv |rps_ipi > -----+----------+---------+--------- > cpu0 | 002dc7f1 |00000000 |000f4246 > cpu1 | 002dc804 |000f4240 |00000000 > ------------------------------------- > > So: cpu0 receive 0x2dc7f1 pkts accummulative over time and > redirected to cpu1 (mostly, the extra 5 maybe to leftover since i clear > the data) and for the test 0xf4246 times it generated an IPI. It can be > seen that total running for CPU1 is 0x2dc804 but in this one run it > received 1M packets (0xf4240). I remeber you redirected all the traffic from cpu0 to cpu1, and the data shows: about 0x2dc7f1 packets are processed, and about 0xf4240 IPI are generated. > i.e i dont see the double accounting.. > a single packet is counted twice by CPU0 and CPU1. If you change RPS setting by: echo 1 > ..../rps_cpus you will find the total number are doubled.
Le vendredi 16 avril 2010 à 09:21 -0400, jamal a écrit : > On Fri, 2010-04-16 at 07:18 +0200, Eric Dumazet wrote: > > > > > A kernel module might do this, this could be integrated in perf bench so > > that we can regression tests upcoming kernels. > > Perf would be good - but even softnet_stat cleaner than the the nasty > hack i use (attached) would be a good start; the ping with and without > rps gives me a ballpark number. > > IPI is important to me because having tried it before it and failed > miserably. I was thinking the improvement may be due to hardware used > but i am having a hard time to get people to tell me what hardware they > used! I am old school - I need data;-> The RFS patch commit seems to > have more info but still vague, example: > "The benefits of RFS are dependent on cache hierarchy, application > load, and other factors" > Also, what does a "simple" or "complex" benchmark mean?;-> > I think it is only fair to get this info, no? > > Please dont consider what i say above as being anti-RPS. > 5 microsec extra latency is not bad if it can be amortized. > Unfortunately, the best traffic i could generate was < 20Kpps of > ping which still manages to get 1 IPI/packet on Nehalem. I am going > to write up some app (lots of cycles available tommorow). I still think > it is valueable. I did some tests on a dual quad core machine (E5450 @ 3.00GHz), not nehalem. So a 3-4 years old design. For all test, I use the best time of 3 runs of "ping -f -q -c 100000 192.168.0.2". Yes ping is not very good, but its available ;) Note: I make sure all 8 cpus of target are busy, eating cpu cycles in user land. I dont want to tweak acpi or whatever smart power saving mechanisms. When RPS off 100000 packets transmitted, 100000 received, 0% packet loss, time 4160ms RPS on, but directed on the cpu0 handling device interrupts (tg3, napi) (echo 01 > /sys/class/net/eth3/queues/rx-0/rps_cpus) 100000 packets transmitted, 100000 received, 0% packet loss, time 4234ms So the cost of queing the packet into our own queue (netif_receive_skb -> enqueue_to_backlog) is about 0.74 us (74 ms / 100000) I personally think we should process packet instead of queeing it, but Tom disagree with me. RPS on, directed on cpu1 (other socket) (echo 02 > /sys/class/net/eth3/queues/rx-0/rps_cpus) 100000 packets transmitted, 100000 received, 0% packet loss, time 4542ms So extra cost to enqueue to a remote cpu queue, IPI, softirq handling... is 3 us. Note this cost is in case we receive a single packet. I suspect IPI itself is in the 1.5 us range, not very far from the queing to ourself case. For me RPS use cases are : 1) Value added apps handling lot of TCP data, where the costs of cache misses in tcp stack easily justify to spend 3 us to gain much more. 2) Network appliance, where a single cpu is filled 100% to handle one device hardware and software/RPS interrupts, delegating all higher level works to a pool of cpus. I'll try to do these tests on a Nehalem target. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> So the cost of queing the packet into our own queue (netif_receive_skb > -> enqueue_to_backlog) is about 0.74 us (74 ms / 100000) > > I personally think we should process packet instead of queeing it, but > Tom disagree with me. > You could do that, but then the packet processing becomes HOL blocking on all the packets that are being sent to other queues for processing-- remember the IPIs is only sent at the end of the NAPI. So unless the upper stack processing is <0.74us in your case, I think processing packets directly on the local queue would improve best case latency, but would increase average latency and even more likely worse case latency on loads with multiple flows. > RPS on, directed on cpu1 (other socket) > (echo 02 > /sys/class/net/eth3/queues/rx-0/rps_cpus) > 100000 packets transmitted, 100000 received, 0% packet loss, time 4542ms > > So extra cost to enqueue to a remote cpu queue, IPI, softirq handling... > is 3 us. Note this cost is in case we receive a single packet. > > I suspect IPI itself is in the 1.5 us range, not very far from the > queing to ourself case. > > For me RPS use cases are : > > 1) Value added apps handling lot of TCP data, where the costs of cache > misses in tcp stack easily justify to spend 3 us to gain much more. > > 2) Network appliance, where a single cpu is filled 100% to handle one > device hardware and software/RPS interrupts, delegating all higher level > works to a pool of cpus. > > I'll try to do these tests on a Nehalem target. > > > > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le samedi 17 avril 2010 à 01:43 -0700, Tom Herbert a écrit : > > So the cost of queing the packet into our own queue (netif_receive_skb > > -> enqueue_to_backlog) is about 0.74 us (74 ms / 100000) > > > > I personally think we should process packet instead of queeing it, but > > Tom disagree with me. > > > You could do that, but then the packet processing becomes HOL blocking > on all the packets that are being sent to other queues for > processing-- remember the IPIs is only sent at the end of the NAPI. > So unless the upper stack processing is <0.74us in your case, I think > processing packets directly on the local queue would improve best case > latency, but would increase average latency and even more likely worse > case latency on loads with multiple flows. Anyway, a big part of this 0.74 us overhead comes from get_rps_cpu() itself, computing skb->rxhash and all. We should make a review of how many cache lines we exchange per skb, and try to reduce this number. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le samedi 17 avril 2010 à 11:23 +0200, Eric Dumazet a écrit : > Le samedi 17 avril 2010 à 01:43 -0700, Tom Herbert a écrit : > > > So the cost of queing the packet into our own queue (netif_receive_skb > > > -> enqueue_to_backlog) is about 0.74 us (74 ms / 100000) > > > > > > I personally think we should process packet instead of queeing it, but > > > Tom disagree with me. > > > > > You could do that, but then the packet processing becomes HOL blocking > > on all the packets that are being sent to other queues for > > processing-- remember the IPIs is only sent at the end of the NAPI. > > So unless the upper stack processing is <0.74us in your case, I think > > processing packets directly on the local queue would improve best case > > latency, but would increase average latency and even more likely worse > > case latency on loads with multiple flows. Tom, I am not sure what you describe is even respected for NAPI devices. (I hope you use napi devices in your company ;) ) If we enqueue a skb to backlog, we also link our backlog napi into our poll_list, if not already there. So the loop in net_rx_action() will make us handle our backlog napi a bit after this network device napi (if time limit of 2 jiffies not elapsed) and *before* sending IPIS to remote cpus anyway. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> Tom, I am not sure what you describe is even respected for NAPI devices. > (I hope you use napi devices in your company ;) ) > > If we enqueue a skb to backlog, we also link our backlog napi into our > poll_list, if not already there. > > So the loop in net_rx_action() will make us handle our backlog napi a > bit after this network device napi (if time limit of 2 jiffies not > elapsed) and *before* sending IPIS to remote cpus anyway. > Then I think that's a bug you've identified ;-) > > > > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, 2010-04-17 at 09:35 +0200, Eric Dumazet wrote: > I did some tests on a dual quad core machine (E5450 @ 3.00GHz), not > nehalem. So a 3-4 years old design. Eric, I thank you kind sir for going out of your way to do this - it is certainly a good processor to compare against > For all test, I use the best time of 3 runs of "ping -f -q -c 100000 > 192.168.0.2". Yes ping is not very good, but its available ;) It is a reasonable quick test, no fancy setup required ;-> > Note: I make sure all 8 cpus of target are busy, eating cpu cycles in > user land. I didnt keep the cpus busy. I should re-run with such a setup, any specific app that you used to keep them busy? Keeping them busy could have consequences; I am speculating you probably ended having greater than one packet/IPI ratio i.e amortization benefit.. > I dont want to tweak acpi or whatever smart power saving > mechanisms. I should mention i turned off acpi as well in the bios; it was consuming more cpu cycles than net-processing and was interfering in my tests. > When RPS off > 100000 packets transmitted, 100000 received, 0% packet loss, time 4160ms > > RPS on, but directed on the cpu0 handling device interrupts (tg3, napi) > (echo 01 > /sys/class/net/eth3/queues/rx-0/rps_cpus) > 100000 packets transmitted, 100000 received, 0% packet loss, time 4234ms > > So the cost of queing the packet into our own queue (netif_receive_skb > -> enqueue_to_backlog) is about 0.74 us (74 ms / 100000) > Excellent analysis. > I personally think we should process packet instead of queeing it, but > Tom disagree with me. Sorry - I am gonna have to turn on some pedagogy and offer my Canadian 2 cents;-> I would lean on agreeing with Tom, but maybe go one step further (sans packet-reordering): we should never process packets to socket layer on the demuxing cpu. enqueue everything you receive on a different cpu - so somehow receiving cpu becomes part of a hashing decision ... The reason is derived from queueing theory - of which i know dangerously little - but refer you to mr. little his-self[1] (pun fully intended;->): i.e fixed serving time provides more predictable results as opposed to once in a while a spike as you receive packets destined to "our cpu". Queueing packets and later allocating cycles to processing them adds to variability, but is not as bad as processing to completion to socket layer. > RPS on, directed on cpu1 (other socket) > (echo 02 > /sys/class/net/eth3/queues/rx-0/rps_cpus) > 100000 packets transmitted, 100000 received, 0% packet loss, time 4542ms Good test - should be worst case scenario. But there are two other scenarios which will give different results in my opinion. On your setup i think each socket has two dies, each with two cores. So my feeling is you will get different numbers if you go within same die and across dies within same socket. If i am not mistaken, the mapping would be something like socket0/die0{core0/2}, socket0/die1{core4/6}, socket1/die0{core1/3}, socket1{core5/7}. If you have cycles can you try the same socket+die but different cores and same socket but different die test? > So extra cost to enqueue to a remote cpu queue, IPI, softirq handling... > is 3 us. Note this cost is in case we receive a single packet. Which is not too bad if amortized. Were you able to check if you processed a packet/IPI? One way to achieve that is just standard ping. In the nehalem my number for going to a different core was in the range of 5 microseconds effect on RTT when system was not busy. I think it would be higher going across QPI. > I suspect IPI itself is in the 1.5 us range, not very far from the > queing to ourself case. Sound about right maybe 2 us in my case. I am still mystified by "what damage does an IPI make?" to the system harmony. I have to do some reading. Andi mentioned the APIC connection - but my gut feeling is you probably end up going to main memory and invalidate cache. > For me RPS use cases are : > > 1) Value added apps handling lot of TCP data, where the costs of cache > misses in tcp stack easily justify to spend 3 us to gain much more. > > 2) Network appliance, where a single cpu is filled 100% to handle one > device hardware and software/RPS interrupts, delegating all higher level > works to a pool of cpus. > Agreed on both. The caveat to note: - what hardware would be reasonable - within same hardware what setups would be good to use - when it doesnt benefit even with the everything correct (eg low tcp throughput) > I'll try to do these tests on a Nehalem target. Thanks again Eric. cheers, jamal [1]http://en.wikipedia.org/wiki/Little's_law -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le samedi 17 avril 2010 à 13:31 -0400, jamal a écrit : > On Sat, 2010-04-17 at 09:35 +0200, Eric Dumazet wrote: > > > I did some tests on a dual quad core machine (E5450 @ 3.00GHz), not > > nehalem. So a 3-4 years old design. > > Eric, I thank you kind sir for going out of your way to do this - it is > certainly a good processor to compare against > > > For all test, I use the best time of 3 runs of "ping -f -q -c 100000 > > 192.168.0.2". Yes ping is not very good, but its available ;) > > It is a reasonable quick test, no fancy setup required ;-> > > > Note: I make sure all 8 cpus of target are busy, eating cpu cycles in > > user land. > > I didnt keep the cpus busy. I should re-run with such a setup, any > specific app that you used to keep them busy? Keeping them busy could > have consequences; I am speculating you probably ended having greater > than one packet/IPI ratio i.e amortization benefit.. No, only one packet per IPI, since I setup my tg3 coalescing parameter to the minimum value, I received one packet per interrupt. The specific app is : for f in `seq 1 8`; do while :; do :; done& done > > > I dont want to tweak acpi or whatever smart power saving > > mechanisms. > > I should mention i turned off acpi as well in the bios; it was consuming > more cpu cycles than net-processing and was interfering in my tests. > > > When RPS off > > 100000 packets transmitted, 100000 received, 0% packet loss, time 4160ms > > > > RPS on, but directed on the cpu0 handling device interrupts (tg3, napi) > > (echo 01 > /sys/class/net/eth3/queues/rx-0/rps_cpus) > > 100000 packets transmitted, 100000 received, 0% packet loss, time 4234ms > > > > So the cost of queing the packet into our own queue (netif_receive_skb > > -> enqueue_to_backlog) is about 0.74 us (74 ms / 100000) > > > > Excellent analysis. > > > I personally think we should process packet instead of queeing it, but > > Tom disagree with me. > > Sorry - I am gonna have to turn on some pedagogy and offer my > Canadian 2 cents;-> > I would lean on agreeing with Tom, but maybe go one step further (sans > packet-reordering): we should never process packets to socket layer on > the demuxing cpu. > enqueue everything you receive on a different cpu - so somehow receiving > cpu becomes part of a hashing decision ... > > The reason is derived from queueing theory - of which i know dangerously > little - but refer you to mr. little his-self[1] (pun fully > intended;->): > i.e fixed serving time provides more predictable results as opposed to > once in a while a spike as you receive packets destined to "our cpu". > Queueing packets and later allocating cycles to processing them adds to > variability, but is not as bad as processing to completion to socket > layer. > > > RPS on, directed on cpu1 (other socket) > > (echo 02 > /sys/class/net/eth3/queues/rx-0/rps_cpus) > > 100000 packets transmitted, 100000 received, 0% packet loss, time 4542ms > > Good test - should be worst case scenario. But there are two other > scenarios which will give different results in my opinion. > On your setup i think each socket has two dies, each with two cores. So > my feeling is you will get different numbers if you go within same die > and across dies within same socket. If i am not mistaken, the mapping > would be something like socket0/die0{core0/2}, socket0/die1{core4/6}, > socket1/die0{core1/3}, socket1{core5/7}. > If you have cycles can you try the same socket+die but different cores > and same socket but different die test? Sure, lets redo a full test, taking lowest time of three ping runs echo 00 >/sys/class/net/eth3/queues/rx-0/rps_cpus 100000 packets transmitted, 100000 received, 0% packet loss, time 4151ms echo 01 >/sys/class/net/eth3/queues/rx-0/rps_cpus 100000 packets transmitted, 100000 received, 0% packet loss, time 4254ms echo 02 >/sys/class/net/eth3/queues/rx-0/rps_cpus 100000 packets transmitted, 100000 received, 0% packet loss, time 4563ms echo 04 >/sys/class/net/eth3/queues/rx-0/rps_cpus 100000 packets transmitted, 100000 received, 0% packet loss, time 4458ms echo 08 >/sys/class/net/eth3/queues/rx-0/rps_cpus 100000 packets transmitted, 100000 received, 0% packet loss, time 4563ms echo 10 >/sys/class/net/eth3/queues/rx-0/rps_cpus 100000 packets transmitted, 100000 received, 0% packet loss, time 4327ms echo 20 >/sys/class/net/eth3/queues/rx-0/rps_cpus 100000 packets transmitted, 100000 received, 0% packet loss, time 4571ms echo 40 >/sys/class/net/eth3/queues/rx-0/rps_cpus 100000 packets transmitted, 100000 received, 0% packet loss, time 4472ms echo 80 >/sys/class/net/eth3/queues/rx-0/rps_cpus 100000 packets transmitted, 100000 received, 0% packet loss, time 4568ms # egrep "physical id|core|apicid" /proc/cpuinfo physical id : 0 core id : 0 cpu cores : 4 apicid : 0 initial apicid : 0 physical id : 1 core id : 0 cpu cores : 4 apicid : 4 initial apicid : 4 physical id : 0 core id : 2 cpu cores : 4 apicid : 2 initial apicid : 2 physical id : 1 core id : 2 cpu cores : 4 apicid : 6 initial apicid : 6 physical id : 0 core id : 1 cpu cores : 4 apicid : 1 initial apicid : 1 physical id : 1 core id : 1 cpu cores : 4 apicid : 5 initial apicid : 5 physical id : 0 core id : 3 cpu cores : 4 apicid : 3 initial apicid : 3 physical id : 1 core id : 3 cpu cores : 4 apicid : 7 initial apicid : 7 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le dimanche 18 avril 2010 à 11:39 +0200, Eric Dumazet a écrit : > No, only one packet per IPI, since I setup my tg3 coalescing parameter > to the minimum value, I received one packet per interrupt. > > The specific app is : > > for f in `seq 1 8`; do while :; do :; done& done > An other interesting user land app would be to use a cpu _and_ memory cruncher, because of caches misses we'll get. $ cat nloop.c #include <stdlib.h> #include <string.h> #include <unistd.h> #define SZ 4*1024*1024 int main(int argc, char *argv[]) { int nproc = 8; char *buffer; if (argc > 1) nproc = atoi(argv[1]); while (nproc > 1) { if (fork() == 0) break; nproc--; } buffer = malloc(SZ); while (1) memset(buffer, 0x55, SZ); } $ ./nloop 8 & echo 00 >/sys/class/net/eth3/queues/rx-0/rps_cpus 4861ms echo 01 >/sys/class/net/eth3/queues/rx-0/rps_cpus 4981ms echo 02 >/sys/class/net/eth3/queues/rx-0/rps_cpus 7191ms echo 04 >/sys/class/net/eth3/queues/rx-0/rps_cpus 7128ms echo 08 >/sys/class/net/eth3/queues/rx-0/rps_cpus 7107ms echo 10 >/sys/class/net/eth3/queues/rx-0/rps_cpus 5505ms echo 20 >/sys/class/net/eth3/queues/rx-0/rps_cpus 7125ms echo 40 >/sys/class/net/eth3/queues/rx-0/rps_cpus 7022ms echo 80 >/sys/class/net/eth3/queues/rx-0/rps_cpus 7157ms Maximum overhead is 7191-4861 = 23.3 us per packet -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Thanks Eric. I tried to visualize your results - attached. There are 2-3 odd numbers (labelled with *) but other than that results are as expected... I did run some experiments with some udp sink server and i saw the IPIs amortized; unfortunately sky2 h/ware proved to be bottleneck (at > 750Kpps incoming, it started dropping and wasnt recording the drops, so i had to slow things down). I need to digest my results a little more - but it seems i was getting better throughput results with RPS (i.e it was able to sink more packets).. cheers, jamal
Sorry, didnt respond to you - busyed out setting up before trying to think a little more about this.. On Fri, 2010-04-16 at 22:58 +0800, Changli Gao wrote: > > > > cpu Total |rps_recv |rps_ipi > > -----+----------+---------+--------- > > cpu0 | 002dc7f1 |00000000 |000f4246 > > cpu1 | 002dc804 |000f4240 |00000000 > > ------------------------------------- > > > > So: cpu0 receive 0x2dc7f1 pkts accummulative over time and > > redirected to cpu1 (mostly, the extra 5 maybe to leftover since i clear > > the data) and for the test 0xf4246 times it generated an IPI. It can be > > seen that total running for CPU1 is 0x2dc804 but in this one run it > > received 1M packets (0xf4240). > > I remeber you redirected all the traffic from cpu0 to cpu1, and the data shows: > > about 0x2dc7f1 packets are processed, and about 0xf4240 IPI are generated. If you look at the patch, I am zeroing those stats - so 0xf4240 is only one test (decimal 1M). I think there is something to what you are saying; rps_ipi on cpu0 is ambigous because it counts the number of times cpu0 softirq was scheduled as well as the number of times cpu0 scheduled other cpus. The extra six for cpu0 turn out to be the times an ethernet interrupt scheduled the cpu0 softirq. > a single packet is counted twice by CPU0 and CPU1. Well, the counts have different meanings; rps_ipi applies to source cpu activity and rps_recv applies to destination. Example, if cpu0 in total 6 times found some destination cpu to be empty and 2 of those happen to be on cpu1, cpu2, cpu3 then cpu0: ipi_rps = 6 cpu1: rps_recv = 2 cpu2: rps_recv = 2 cpu3: rps_recv = 2 > If you change RPS setting by: > > echo 1 > ..../rps_cpus > > you will find the total number are doubled. This is true. But IMO deserving and should be double counted. It is just more fine-grained accounting. IOW, I am not sure we need your patch because we will loose the fine-grain accounting - and mine requires more work to be less ambigous. cheers, jamal -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
folks, Thanks to everybody (Eric stands out) for your patience. I ended mostly validating whats already been said. I have a lot of data and can describe in details how i tested etc but it would require patience in reading, so i will spare you;-> If you are interested let me know and i will be happy to share. Summary is: -rps good, gives higher throughput for apps -rps not so good, latency worse but gets better with higher input rate or increasing number of flows (which translates to higher pps) -rps works well with newer hardware that has better cache structures. [Gives great results on my test machine a Nehalem single processor, 4 cores each with two SMT threads that has a shared L2 between threads and a shared L3 between cores]. Your selection of what the demux cpu is and where the target cpus are is an influencing factor in the latency results. If you have a system with multiple sockets, you should get better numbers if you stay within the same socket relative to going across sockets. -rps does a better job at helping schedule apps on same cpu thus localizing the app. The throughput results with rps are very consistent and better whereas in non-rps case, variance is _high_. My next step is to do some forwarding tests - probably next week. I am concerned here because i expect the cache misses to be higher than the app scenario (netdev structure and attributes could be touched by many cpus) cheers, jamal -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le mardi 20 avril 2010 à 08:02 -0400, jamal a écrit : > folks, > > Thanks to everybody (Eric stands out) for your patience. > I ended mostly validating whats already been said. I have a lot of data > and can describe in details how i tested etc but it would require > patience in reading, so i will spare you;-> If you are interested let me > know and i will be happy to share. > > Summary is: > -rps good, gives higher throughput for apps > -rps not so good, latency worse but gets better with higher input rate > or increasing number of flows (which translates to higher pps) > -rps works well with newer hardware that has better cache structures. > [Gives great results on my test machine a Nehalem single processor, 4 > cores each with two SMT threads that has a shared L2 between threads and > a shared L3 between cores]. > Your selection of what the demux cpu is and where the target cpus are is > an influencing factor in the latency results. If you have a system with > multiple sockets, you should get better numbers if you stay within the > same socket relative to going across sockets. > -rps does a better job at helping schedule apps on same cpu thus > localizing the app. The throughput results with rps are very consistent > and better whereas in non-rps case, variance is _high_. > > My next step is to do some forwarding tests - probably next week. I am > concerned here because i expect the cache misses to be higher than the > app scenario (netdev structure and attributes could be touched by many > cpus) > Hi Jamal I think your tests are very interesting, maybe could you publish them somehow ? (I forgot to thank you about the previous report and nice graph) perf reports would be good too to help to spot hot points. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le mercredi 21 avril 2010 à 08:39 -0400, jamal a écrit : > On Tue, 2010-04-20 at 15:13 +0200, Eric Dumazet wrote: > > > > I think your tests are very interesting, maybe could you publish them > > somehow ? (I forgot to thank you about the previous report and nice > > graph) > > perf reports would be good too to help to spot hot points. > > Ok ;-> > Let me explain my test setup (which some app types may gasp at;->): > > SUT(system under test) was a nehalem single processor (4 cores, 2 SMT > threads per core). > SUT runs a udp sink server i wrote (with apologies to Rick Jones[1]) > which forks at most a process per detected cpu and binds to a different > udp port on each processor. > Traffic generator sent to SUT upto 750Kpps of udp packets round-robbin > and varied the destination port to select a different flow on each of > the outgoing packets. I could further increment the number of flows by > varying the source address and source port number but in the end i > settled down to fixed srcip/srcport/destinationip and just varied the > port number in order to simplify results collection. > For rps i selected mask "ee" and bound interrupt to cpu0. ee leaves > out cpu0 and cpu4 from the set of target cpus. Because Nehalem has SMT > threads, cpu0 and cpu4 are SMT threads that reside on core0 and they > steal execution cycles from each other - so i didnt want that to happen > and instead tried to have as many of those cycles as possible for > demuxing incoming packets. > > Overall, in best case scenario rps had 5-7% better throughput than > nonrps setup. It had upto 10% more cpu use and about 2-5% more latency. > I am attaching some visualization of the way 8 flows were distributed > around the different cpus. The diagrams show some samples - but what you > see there was a good reflection of what i saw in many runs of the tests. > Essentially, for localization is better with rps which gets better if > you can somehow map the target cpus as selected by rps to what the app > binds to. > Ive also attached a small annotated perf output - sorry i didnt have > time to dig deeper into the code; maybe later this week. I think my > biggest problem in this setup was the sky2 driver or hardware poor > ability to handle lots of traffic. > > > cheers, > jamal > > [1] I want to hump on the SUT with tons of traffic and count packets; > too complex to do with netperf Thanks a lot Jamal, this is really useful Drawback of using a fixed src ip from your generator is that all flows share the same struct dst entry on SUT. This might explain some glitches you noticed (ip_route_input + ip_rcv at high level on slave/application cpus) Also note your test is one way. If some data was replied we would see much use of the 'flows' I notice epoll_ctl() used a lot, are you re-arming epoll each time you receive a datagram ? I see slave/application cpus hit _raw_spin_lock_irqsave() and _raw_spin_unlock_irqrestore(). Maybe a ring buffer could help (instead of a double linked queue) for backlog, or the double queue trick, if Changli wants to respin his patch. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> Let me explain my test setup (which some app types may gasp at;->): > > SUT(system under test) was a nehalem single processor (4 cores, 2 SMT > threads per core). > SUT runs a udp sink server i wrote (with apologies to Rick Jones[1]) > ... > > [1] I want to hump on the SUT with tons of traffic and count packets; > too complex to do with netperf No need to apologize, if you like I'd be happy to discuss netperf usage tips offline. That offer stands for everyone. happy benchmarking, rick jones -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Apr 22, 2010 at 3:01 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > > Thanks a lot Jamal, this is really useful > > Drawback of using a fixed src ip from your generator is that all flows > share the same struct dst entry on SUT. This might explain some glitches > you noticed (ip_route_input + ip_rcv at high level on slave/application > cpus) > Also note your test is one way. If some data was replied we would see > much use of the 'flows' > > I notice epoll_ctl() used a lot, are you re-arming epoll each time you > receive a datagram ? > > I see slave/application cpus hit _raw_spin_lock_irqsave() and > _raw_spin_unlock_irqrestore(). > > Maybe a ring buffer could help (instead of a double linked queue) for > backlog, or the double queue trick, if Changli wants to respin his > patch. > > OK, I'll post a new patch against the current tree, so Jamal can have a try. I am sorry, but I don't have a suitable computer for benchmark.
On Wed, 2010-04-21 at 21:01 +0200, Eric Dumazet wrote: > Drawback of using a fixed src ip from your generator is that all flows > share the same struct dst entry on SUT. This might explain some glitches > you noticed (ip_route_input + ip_rcv at high level on slave/application > cpus) yes, that would explain it ;-> I could have flows going to each cpu generating different unique dst. It is good i didnt ;-> > Also note your test is one way. If some data was replied we would see > much use of the 'flows' > In my next step i wanted to "route" these packets at app level and for this stage of testing just wanted to sink the data to reduce experiment variables. Reason: The netdev structure would hit a lot of cache misses if i started using it to both send/recv since lots of things are shared on tx/rx (example napi tx prunning could happen on either tx or receive path); same thing with qdisc path which is at netdev granularity.. I think there may be room for interesting improvements in this area.. > I notice epoll_ctl() used a lot, are you re-arming epoll each time you > receive a datagram ? I am using default libevent on debian. It looks very old and maybe buggy. I will try to upgrade first and if still see the same investigate. > I see slave/application cpus hit _raw_spin_lock_irqsave() and > _raw_spin_unlock_irqrestore(). > > Maybe a ring buffer could help (instead of a double linked queue) for > backlog, or the double queue trick, if Changli wants to respin his > patch. > Ok, I will have some cycles later today/tommorow or for sure on weekend. My setup is still intact - so i can test. cheers, jamal -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Apr 22, 2010 at 8:12 PM, jamal <hadi@cyberus.ca> wrote: > >> I see slave/application cpus hit _raw_spin_lock_irqsave() and >> _raw_spin_unlock_irqrestore(). >> >> Maybe a ring buffer could help (instead of a double linked queue) for >> backlog, or the double queue trick, if Changli wants to respin his >> patch. >> > > Ok, I will have some cycles later today/tommorow or for sure on weekend. > My setup is still intact - so i can test. > I read the code again, and find that we don't use spin_lock_irqsave(), and we use local_irq_save() and spin_lock() instead, so _raw_spin_lock_irqsave() and _raw_spin_lock_irqrestore() should not be related to backlog. the lock maybe sk_receive_queue.lock. Jamal, did you use a single socket to serve all the clients? BTW: completion_queue and output_queue in softnet_data both are LIFO queues. For completion_queue, FIFO is better, as the last used skb is more likely in cache, and should be used first. Since slab has always cache the last used memory at the head, we'd better free the skb in FIFO manner. For output_queue, FIFO is good for fairness among qdiscs.
On Sun, 2010-04-25 at 10:31 +0800, Changli Gao wrote: > I read the code again, and find that we don't use spin_lock_irqsave(), > and we use local_irq_save() and spin_lock() instead, so > _raw_spin_lock_irqsave() and _raw_spin_lock_irqrestore() should not be > related to backlog. the lock maybe sk_receive_queue.lock. Possible. I am wondering if there's a way we can precisely nail where that is happening? is lockstat any use? Fixing _raw_spin_lock_irqsave and friend is the lowest hanging fruit. So looking at your patch now i see it is likely there was an improvement made for non-rps case (moving out of loop some irq_enable etc). i.e my results may not be crazy after adding your patch and seeing an improvement for non-rps case. However, whatever your patch did - it did not help the rps case case: call_function_single_interrupt() comes out higher in the profile, and # of IPIs seems to have gone up (although i did not measure this, I can see the interrupts/second went up by almost 50-60%) > Jamal, did you use a single socket to serve all the clients? Socket per detected cpu. > BTW: completion_queue and output_queue in softnet_data both are LIFO > queues. For completion_queue, FIFO is better, as the last used skb is > more likely in cache, and should be used first. Since slab has always > cache the last used memory at the head, we'd better free the skb in > FIFO manner. For output_queue, FIFO is good for fairness among qdiscs. I think it will depend on how many of those skbs are sitting in the completion queue, cache warmth etc. LIFO is always safest, you have higher probability of finding a cached skb infront. cheers, jamal -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Apr 26, 2010 at 7:35 PM, jamal <hadi@cyberus.ca> wrote: > On Sun, 2010-04-25 at 10:31 +0800, Changli Gao wrote: > >> I read the code again, and find that we don't use spin_lock_irqsave(), >> and we use local_irq_save() and spin_lock() instead, so >> _raw_spin_lock_irqsave() and _raw_spin_lock_irqrestore() should not be >> related to backlog. the lock maybe sk_receive_queue.lock. > > Possible. > I am wondering if there's a way we can precisely nail where that is > happening? is lockstat any use? > Fixing _raw_spin_lock_irqsave and friend is the lowest hanging fruit. > Maybe lockstat can help in this case. > So looking at your patch now i see it is likely there was an improvement > made for non-rps case (moving out of loop some irq_enable etc). > i.e my results may not be crazy after adding your patch and seeing an > improvement for non-rps case. > However, whatever your patch did - it did not help the rps case case: > call_function_single_interrupt() comes out higher in the profile, > and # of IPIs seems to have gone up (although i did not measure this, I > can see the interrupts/second went up by almost 50-60%) Did you apply the patch from Eric? It would reduce the number of local_irq_disable() calls but increase the number of IPIs. > >> Jamal, did you use a single socket to serve all the clients? > > Socket per detected cpu. Ignore it. I made a mistake here. > >> BTW: completion_queue and output_queue in softnet_data both are LIFO >> queues. For completion_queue, FIFO is better, as the last used skb is >> more likely in cache, and should be used first. Since slab has always >> cache the last used memory at the head, we'd better free the skb in >> FIFO manner. For output_queue, FIFO is good for fairness among qdiscs. > > I think it will depend on how many of those skbs are sitting in the > completion queue, cache warmth etc. LIFO is always safest, you have > higher probability of finding a cached skb infront. > we call kfree_skb() to release skbs to slab allocator, then slab allocator stores them in a LIFO queue. If completion queue is also a LIFO queue, the latest unused skb will be in the front of the queue, and will be released to slab allocator at first. At the next time, we call alloc_skb(), the memory used by the skb in the end of the completion queue will be returned instead of the hot one. However, as Eric said, new drivers don't rely on completion queue, it isn't a real problem, especially in your test case.
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index d1a21b5..f8267fc 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -224,6 +224,7 @@ struct netif_rx_stats { unsigned time_squeeze; unsigned cpu_collision; unsigned received_rps; + unsigned ipi_rps; }; DECLARE_PER_CPU(struct netif_rx_stats, netdev_rx_stat); diff --git a/kernel/smp.c b/kernel/smp.c index 9867b6b..8c5dcb7 100644 --- a/kernel/smp.c +++ b/kernel/smp.c @@ -11,6 +11,7 @@ #include <linux/init.h> #include <linux/smp.h> #include <linux/cpu.h> +#include <linux/netdevice.h> static struct { struct list_head queue; @@ -158,7 +159,10 @@ void generic_exec_single(int cpu, struct call_single_data *data, int wait) * equipped to do the right thing... */ if (ipi) +{ arch_send_call_function_single_ipi(cpu); + __get_cpu_var(netdev_rx_stat).ipi_rps++; +} if (wait) csd_lock_wait(data); diff --git a/net/core/dev.c b/net/core/dev.c index b98ddc6..0bbbdcf 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -3563,10 +3563,12 @@ static int softnet_seq_show(struct seq_file *seq, void *v) { struct netif_rx_stats *s = v; - seq_printf(seq, "%08x %08x %08x %08x %08x %08x %08x %08x %08x %08x\n", + seq_printf(seq, "%08x %08x %08x %08x %08x %08x %08x %08x %08x %08x %08x\n", s->total, s->dropped, s->time_squeeze, 0, 0, 0, 0, 0, /* was fastroute */ - s->cpu_collision, s->received_rps); + s->cpu_collision, s->received_rps, s->ipi_rps); + s->ipi_rps = 0; + s->received_rps = 0; return 0; }