Message ID | 200905231124.28925.mb@bu3sch.de |
---|---|
State | Not Applicable, archived |
Delegated to: | David Miller |
Headers | show |
Hi! Michael Buesch wrote: > Thanks a lot, Dave! This fixes the issue on my chip. Yep, it's stable here as well. And even a little faster than pci=nomsi. The only strangeness I observed is that the throughput (measured with iperf and a single TCP connection) varies: [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.0 sec 667 MBytes 559 Mbits/sec [ 3] 10.0-20.0 sec 803 MBytes 673 Mbits/sec [ 3] 20.0-30.0 sec 802 MBytes 673 Mbits/sec [ 3] 30.0-40.0 sec 714 MBytes 599 Mbits/sec [ 3] 40.0-50.0 sec 669 MBytes 561 Mbits/sec [ 3] 50.0-60.0 sec 791 MBytes 663 Mbits/sec [ 3] 0.0-60.0 sec 4.34 GBytes 622 Mbits/sec In gkrellm, you can see that it actually alternates between two values. With pci=nomsi (and without the patch) I get lower but more consistent results: [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.0 sec 511 MBytes 429 Mbits/sec [ 3] 10.0-20.0 sec 465 MBytes 390 Mbits/sec [ 3] 20.0-30.0 sec 481 MBytes 404 Mbits/sec [ 3] 30.0-40.0 sec 466 MBytes 391 Mbits/sec [ 3] 40.0-50.0 sec 465 MBytes 390 Mbits/sec [ 3] 50.0-60.0 sec 463 MBytes 389 Mbits/sec [ 3] 0.0-60.0 sec 2.78 GBytes 399 Mbits/sec I suppose it's a side effect of the MSI acknowledgement loop. But who am I to complain about higher average throughput? ;-) > You can add my: > > Tested-by: Michael Buesch <mb@bu3sch.de> Tested-by: Michael Riepe <michael.riepe@googlemail.com>
On Saturday 23 May 2009 16:35:28 Michael Riepe wrote: > Hi! > > Michael Buesch wrote: > > > Thanks a lot, Dave! This fixes the issue on my chip. > > Yep, it's stable here as well. And even a little faster than pci=nomsi. > The only strangeness I observed is that the throughput (measured with > iperf and a single TCP connection) varies: > > [ ID] Interval Transfer Bandwidth > [ 3] 0.0-10.0 sec 667 MBytes 559 Mbits/sec > [ 3] 10.0-20.0 sec 803 MBytes 673 Mbits/sec > [ 3] 20.0-30.0 sec 802 MBytes 673 Mbits/sec > [ 3] 30.0-40.0 sec 714 MBytes 599 Mbits/sec > [ 3] 40.0-50.0 sec 669 MBytes 561 Mbits/sec > [ 3] 50.0-60.0 sec 791 MBytes 663 Mbits/sec > [ 3] 0.0-60.0 sec 4.34 GBytes 622 Mbits/sec Are you running the iperf server or client on the r8169? I'm running the iperf server on the r8169 and get the following results: mb@homer:~$ iperf -c 192.168.2.50 -t120 -i10 ------------------------------------------------------------ Client connecting to 192.168.2.50, TCP port 5001 TCP window size: 16.0 KByte (default) ------------------------------------------------------------ [ 3] local 192.168.2.205 port 34739 connected with 192.168.2.50 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.0 sec 789 MBytes 662 Mbits/sec [ 3] 10.0-20.0 sec 791 MBytes 664 Mbits/sec [ 3] 20.0-30.0 sec 799 MBytes 670 Mbits/sec [ 3] 30.0-40.0 sec 794 MBytes 666 Mbits/sec [ 3] 40.0-50.0 sec 795 MBytes 667 Mbits/sec [ 3] 50.0-60.0 sec 793 MBytes 665 Mbits/sec [ 3] 60.0-70.0 sec 800 MBytes 671 Mbits/sec [ 3] 70.0-80.0 sec 795 MBytes 667 Mbits/sec [ 3] 80.0-90.0 sec 800 MBytes 672 Mbits/sec [ 3] 90.0-100.0 sec 797 MBytes 669 Mbits/sec [ 3] 100.0-110.0 sec 789 MBytes 662 Mbits/sec [ 3] 110.0-120.0 sec 791 MBytes 663 Mbits/sec [ 3] 0.0-120.0 sec 9.31 GBytes 666 Mbits/sec Looks acceptable to me. The iperf client device is a BCM5780.
On Sat, 2009-05-23 at 16:35 +0200, Michael Riepe wrote: > Hi! > > Michael Buesch wrote: > > > Thanks a lot, Dave! This fixes the issue on my chip. > > Yep, it's stable here as well. And even a little faster than pci=nomsi. > The only strangeness I observed is that the throughput (measured with > iperf and a single TCP connection) varies: > > [ ID] Interval Transfer Bandwidth > [ 3] 0.0-10.0 sec 667 MBytes 559 Mbits/sec > [ 3] 10.0-20.0 sec 803 MBytes 673 Mbits/sec > [ 3] 20.0-30.0 sec 802 MBytes 673 Mbits/sec > [ 3] 30.0-40.0 sec 714 MBytes 599 Mbits/sec > [ 3] 40.0-50.0 sec 669 MBytes 561 Mbits/sec > [ 3] 50.0-60.0 sec 791 MBytes 663 Mbits/sec > [ 3] 0.0-60.0 sec 4.34 GBytes 622 Mbits/sec > [snip] > I suppose it's a side effect of the MSI acknowledgement loop. But who am > I to complain about higher average throughput? ;-) I wonder if that is the TCP sawtooth pattern -- run up until we drop packets, drop off, repeat. I thought newer congestion algorithms would help with that, but I've not kept up, this may be another red-herring -- like the bisection into genirq. A tcpdump may answer the question -- wireshark can do an analysis and see if it is running up until it starts dropping or something. Or it may be the loop, but I wouldn't expect it to make such a big difference, or be as variable if it does. Also, what does it look like with multiple streams? Thanks for testing guys -- I'm glad it works for a sample size > 1! Dave -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi! Michael Buesch wrote: > On Saturday 23 May 2009 16:35:28 Michael Riepe wrote: > >>Hi! >> >>Michael Buesch wrote: >> >> >>>Thanks a lot, Dave! This fixes the issue on my chip. >> >>Yep, it's stable here as well. And even a little faster than pci=nomsi. >>The only strangeness I observed is that the throughput (measured with >>iperf and a single TCP connection) varies: >> >>[ ID] Interval Transfer Bandwidth >>[ 3] 0.0-10.0 sec 667 MBytes 559 Mbits/sec >>[ 3] 10.0-20.0 sec 803 MBytes 673 Mbits/sec >>[ 3] 20.0-30.0 sec 802 MBytes 673 Mbits/sec >>[ 3] 30.0-40.0 sec 714 MBytes 599 Mbits/sec >>[ 3] 40.0-50.0 sec 669 MBytes 561 Mbits/sec >>[ 3] 50.0-60.0 sec 791 MBytes 663 Mbits/sec >>[ 3] 0.0-60.0 sec 4.34 GBytes 622 Mbits/sec > > > Are you running the iperf server or client on the r8169? The client, since I wanted to measure write throughput. The server is a Lenovo Thinkpad T60 (e1000e driver). > I'm running the iperf server on the r8169 and get the following results: > > mb@homer:~$ iperf -c 192.168.2.50 -t120 -i10 In that case, the r8169 is receiving data (which never was a problem, at least for me - it only lost interrupts in TX mode).
Hi! David Dillow wrote: > I wonder if that is the TCP sawtooth pattern -- run up until we drop > packets, drop off, repeat. I thought newer congestion algorithms would > help with that, but I've not kept up, this may be another red-herring -- > like the bisection into genirq. Actually, I just found out that things are much stranger. A freshly booted system (I'm using 2.6.29.2 + the r8169 patch sent by Michael Buesch, by the way) behaves like this: [ 3] local 192.168.178.206 port 44090 connected with 192.168.178.204 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.0 sec 483 MBytes 405 Mbits/sec [ 3] 10.0-20.0 sec 472 MBytes 396 Mbits/sec [ 3] 20.0-30.0 sec 482 MBytes 404 Mbits/sec [ 3] 30.0-40.0 sec 483 MBytes 405 Mbits/sec [ 3] 40.0-50.0 sec 480 MBytes 402 Mbits/sec [ 3] 50.0-60.0 sec 479 MBytes 402 Mbits/sec [ 3] 0.0-60.0 sec 2.81 GBytes 402 Mbits/sec Then I've been running another test, something along the lines of for dest in host1 host1 host2 host2 do ssh $dest dd of=/dev/null bs=8k count=10240000 </dev/zero & done After a while, I killed the ssh processes and ran iperf again. And this time, I got: [ 3] local 192.168.178.206 port 58029 connected with 192.168.178.204 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.0 sec 634 MBytes 531 Mbits/sec [ 3] 10.0-20.0 sec 740 MBytes 621 Mbits/sec [ 3] 20.0-30.0 sec 641 MBytes 538 Mbits/sec [ 3] 30.0-40.0 sec 738 MBytes 619 Mbits/sec [ 3] 40.0-50.0 sec 742 MBytes 622 Mbits/sec [ 3] 50.0-60.0 sec 743 MBytes 623 Mbits/sec [ 3] 0.0-60.0 sec 4.14 GBytes 592 Mbits/sec Obviously, the high-load ssh test (which would kill the device within a few seconds without the patch) triggers something here. A few observations later, however, I was convinced that it's not a TCP congestion or driver issue. Actually, the throughput depends on the CPU the benchmark is running on. You can see that in gkrellm - whenever the process jumps to another CPU, the throughput changes. On the four (virtual) CPUs of the Atom 330, I get these results: CPU 0: 0.0-60.0 sec 2.65 GBytes 380 Mbits/sec CPU 1: 0.0-60.0 sec 4.12 GBytes 590 Mbits/sec CPU 2: 0.0-60.0 sec 3.79 GBytes 543 Mbits/sec CPU 3: 0.0-60.0 sec 4.13 GBytes 592 Mbits/sec CPU 0+2 are on the first core, 1+3 on the second. If I use two connections (iperf -P2) and nail iperf to both threads of a single core with taskset (the program is multi-threaded, just in case you wonder), I get this: CPU 0+2: 0.0-60.0 sec 4.65 GBytes 665 Mbits/sec CPU 1+3: 0.0-60.0 sec 6.43 GBytes 920 Mbits/sec That's quite a difference, isn't it? Now I wonder what CPU 0 is doing...
On Saturday 23 May 2009 17:01:36 Michael Riepe wrote: > The client, since I wanted to measure write throughput. The server is a > Lenovo Thinkpad T60 (e1000e driver). > > > I'm running the iperf server on the r8169 and get the following results: > > > > mb@homer:~$ iperf -c 192.168.2.50 -t120 -i10 > > In that case, the r8169 is receiving data (which never was a problem, at > least for me - it only lost interrupts in TX mode). Ok, I tested the other way around and it shows the problem: mb@quimby:~/kernel$ iperf -c 192.168.2.205 -t120 -i10 ------------------------------------------------------------ Client connecting to 192.168.2.205, TCP port 5001 TCP window size: 16.0 KByte (default) ------------------------------------------------------------ [ 3] local 192.168.2.50 port 34788 connected with 192.168.2.205 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.0 sec 501 MBytes 420 Mbits/sec [ 3] 10.0-20.0 sec 533 MBytes 447 Mbits/sec [ 3] 20.0-30.0 sec 699 MBytes 587 Mbits/sec [ 3] 30.0-40.0 sec 715 MBytes 600 Mbits/sec [ 3] 40.0-50.0 sec 469 MBytes 394 Mbits/sec [ 3] 50.0-60.0 sec 466 MBytes 391 Mbits/sec [ 3] 60.0-70.0 sec 464 MBytes 389 Mbits/sec [ 3] 70.0-80.0 sec 498 MBytes 418 Mbits/sec [ 3] 80.0-90.0 sec 477 MBytes 400 Mbits/sec [ 3] 90.0-100.0 sec 601 MBytes 504 Mbits/sec [ 3] 100.0-110.0 sec 696 MBytes 584 Mbits/sec [ 3] 110.0-120.0 sec 510 MBytes 428 Mbits/sec [ 3] 0.0-120.0 sec 6.47 GBytes 463 Mbits/sec
On Saturday 23 May 2009 18:12:22 Michael Riepe wrote: > CPU 0+2: 0.0-60.0 sec 4.65 GBytes 665 Mbits/sec > CPU 1+3: 0.0-60.0 sec 6.43 GBytes 920 Mbits/sec > > That's quite a difference, isn't it? > > Now I wonder what CPU 0 is doing... Is CPU0 the only one servicing interrupts?
On Sat, 2009-05-23 at 18:12 +0200, Michael Riepe wrote: > If I use two connections (iperf -P2) and nail iperf to both threads of a > single core with taskset (the program is multi-threaded, just in case > you wonder), I get this: > > CPU 0+2: 0.0-60.0 sec 4.65 GBytes 665 Mbits/sec > CPU 1+3: 0.0-60.0 sec 6.43 GBytes 920 Mbits/sec > > That's quite a difference, isn't it? > > Now I wonder what CPU 0 is doing... Where does /proc/interrupts say the irqs are going? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Saturday 23 May 2009 18:46:13 David Dillow wrote: > On Sat, 2009-05-23 at 18:12 +0200, Michael Riepe wrote: > > If I use two connections (iperf -P2) and nail iperf to both threads of a > > single core with taskset (the program is multi-threaded, just in case > > you wonder), I get this: > > > > CPU 0+2: 0.0-60.0 sec 4.65 GBytes 665 Mbits/sec > > CPU 1+3: 0.0-60.0 sec 6.43 GBytes 920 Mbits/sec > > > > That's quite a difference, isn't it? > > > > Now I wonder what CPU 0 is doing... > > Where does /proc/interrupts say the irqs are going? > > > For me it looks like this: 24: 52606581 0 0 0 PCI-MSI-edge eth0 So as I have the same CPU as Michael Riepe, I think CPU0 (core0) servicing interrupts is related to the issue. I'm wondering however, if this is expected behavior. iperf and interrupts will pretty much saturate a core of the atom330. So it looks like a plain CPU bottleneck.
David Dillow wrote: > On Sat, 2009-05-23 at 18:12 +0200, Michael Riepe wrote: > >>If I use two connections (iperf -P2) and nail iperf to both threads of a >>single core with taskset (the program is multi-threaded, just in case >>you wonder), I get this: >> >>CPU 0+2: 0.0-60.0 sec 4.65 GBytes 665 Mbits/sec >>CPU 1+3: 0.0-60.0 sec 6.43 GBytes 920 Mbits/sec >> >>That's quite a difference, isn't it? >> >>Now I wonder what CPU 0 is doing... > > > Where does /proc/interrupts say the irqs are going? Oh well... CPU0 CPU1 CPU2 CPU3 0: 66 0 0 0 IO-APIC-edge timer 1: 2 0 0 0 IO-APIC-edge i8042 7: 0 0 0 0 IO-APIC-edge parport0 8: 1 0 0 0 IO-APIC-edge rtc0 9: 0 0 0 0 IO-APIC-fasteoi acpi 12: 4 0 0 0 IO-APIC-edge i8042 14: 0 0 0 0 IO-APIC-edge ata_piix 15: 0 0 0 0 IO-APIC-edge ata_piix 16: 0 0 0 0 IO-APIC-fasteoi uhci_hcd:usb4 18: 0 0 0 0 IO-APIC-fasteoi uhci_hcd:usb3 19: 9696 0 0 0 IO-APIC-fasteoi ata_piix, uhci_hcd:usb2 22: 471 0 0 0 IO-APIC-fasteoi HDA Intel 23: 100 0 0 0 IO-APIC-fasteoi uhci_hcd:usb1, ehci_hcd:usb5 27: 48463995 0 0 0 PCI-MSI-edge eth0 NMI: 0 0 0 0 Non-maskable interrupts LOC: 460965 294620 324902 358771 Local timer interrupts RES: 30111 286150 113295 268387 Rescheduling interrupts CAL: 1140 1143 52 34 Function call interrupts TLB: 2223 952 1184 2573 TLB shootdowns SPU: 0 0 0 0 Spurious interrupts ERR: 0 MIS: 0
On Sat, 2009-05-23 at 18:53 +0200, Michael Riepe wrote: > > David Dillow wrote: > > On Sat, 2009-05-23 at 18:12 +0200, Michael Riepe wrote: > > > >>If I use two connections (iperf -P2) and nail iperf to both threads of a > >>single core with taskset (the program is multi-threaded, just in case > >>you wonder), I get this: > >> > >>CPU 0+2: 0.0-60.0 sec 4.65 GBytes 665 Mbits/sec > >>CPU 1+3: 0.0-60.0 sec 6.43 GBytes 920 Mbits/sec > >> > >>That's quite a difference, isn't it? > >> > >>Now I wonder what CPU 0 is doing... > > > > > > Where does /proc/interrupts say the irqs are going? > > Oh well... > 27: 48463995 0 0 0 PCI-MSI-edge eth0 What does it look like if you move the iperf around the CPUs while using pci=nomsi? I'm looking to make sure I didn't cause a terrible regression in the cost of IRQ handling... -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Index: linux-2.6.29/drivers/net/r8169.c =================================================================== --- linux-2.6.29.orig/drivers/net/r8169.c 2009-05-23 11:06:22.000000000 +0200 +++ linux-2.6.29/drivers/net/r8169.c 2009-05-23 11:08:17.000000000 +0200 @@ -3554,54 +3554,64 @@ int handled = 0; int status; + /* loop handling interrupts until we have no new ones or + * we hit a invalid/hotplug case. + */ status = RTL_R16(IntrStatus); + while (status && status != 0xffff) { + handled = 1; - /* hotplug/major error/no more work/shared irq */ - if ((status == 0xffff) || !status) - goto out; - - handled = 1; + /* Handle all of the error cases first. These will reset + * the chip, so just exit the loop. + */ + if (unlikely(!netif_running(dev))) { + rtl8169_asic_down(ioaddr); + break; + } - if (unlikely(!netif_running(dev))) { - rtl8169_asic_down(ioaddr); - goto out; - } + /* Work around for rx fifo overflow */ + if (unlikely(status & RxFIFOOver) && + (tp->mac_version == RTL_GIGA_MAC_VER_11)) { + netif_stop_queue(dev); + rtl8169_tx_timeout(dev); + break; + } - status &= tp->intr_mask; - RTL_W16(IntrStatus, - (status & RxFIFOOver) ? (status | RxOverflow) : status); + if (unlikely(status & SYSErr)) { + rtl8169_pcierr_interrupt(dev); + break; + } - if (!(status & tp->intr_event)) - goto out; + if (status & LinkChg) + rtl8169_check_link_status(dev, tp, ioaddr); - /* Work around for rx fifo overflow */ - if (unlikely(status & RxFIFOOver) && - (tp->mac_version == RTL_GIGA_MAC_VER_11)) { - netif_stop_queue(dev); - rtl8169_tx_timeout(dev); - goto out; - } + /* We need to see the lastest version of tp->intr_mask to + * avoid ignoring an MSI interrupt and having to wait for + * another event which may never come. + */ + smp_rmb(); + if (status & tp->intr_mask & tp->napi_event) { + RTL_W16(IntrMask, tp->intr_event & ~tp->napi_event); + tp->intr_mask = ~tp->napi_event; + + if (likely(netif_rx_schedule_prep(&tp->napi))) + __netif_rx_schedule(&tp->napi); + else if (netif_msg_intr(tp)) { + printk(KERN_INFO "%s: interrupt %04x in poll\n", + dev->name, status); + } + } - if (unlikely(status & SYSErr)) { - rtl8169_pcierr_interrupt(dev); - goto out; + /* We only get a new MSI interrupt when all active irq + * sources on the chip have been acknowledged. So, ack + * everything we've seen and check if new sources have become + * active to avoid blocking all interrupts from the chip. + */ + RTL_W16(IntrStatus, + (status & RxFIFOOver) ? (status | RxOverflow) : status); + status = RTL_R16(IntrStatus); } - if (status & LinkChg) - rtl8169_check_link_status(dev, tp, ioaddr); - - if (status & tp->napi_event) { - RTL_W16(IntrMask, tp->intr_event & ~tp->napi_event); - tp->intr_mask = ~tp->napi_event; - - if (likely(netif_rx_schedule_prep(&tp->napi))) - __netif_rx_schedule(&tp->napi); - else if (netif_msg_intr(tp)) { - printk(KERN_INFO "%s: interrupt %04x in poll\n", - dev->name, status); - } - } -out: return IRQ_RETVAL(handled); } @@ -3617,13 +3627,15 @@ if (work_done < budget) { netif_rx_complete(napi); - tp->intr_mask = 0xffff; - /* - * 20040426: the barrier is not strictly required but the - * behavior of the irq handler could be less predictable - * without it. Btw, the lack of flush for the posted pci - * write is safe - FR + + /* We need for force the visibility of tp->intr_mask + * for other CPUs, as we can loose an MSI interrupt + * and potentially wait for a retransmit timeout if we don't. + * The posted write to IntrMask is safe, as it will + * eventually make it to the chip and we won't loose anything + * until it does. */ + tp->intr_mask = 0xffff; smp_wmb(); RTL_W16(IntrMask, tp->intr_event); }