From patchwork Tue May 5 18:50:26 2009 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Eric Dumazet X-Patchwork-Id: 26888 X-Patchwork-Delegate: davem@davemloft.net Return-Path: X-Original-To: patchwork-incoming@bilbo.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from ozlabs.org (ozlabs.org [203.10.76.45]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client CN "mx.ozlabs.org", Issuer "CA Cert Signing Authority" (verified OK)) by bilbo.ozlabs.org (Postfix) with ESMTPS id 53E14B707C for ; Wed, 6 May 2009 04:50:48 +1000 (EST) Received: by ozlabs.org (Postfix) id 31F5EDDDFA; Wed, 6 May 2009 04:50:48 +1000 (EST) Delivered-To: patchwork-incoming@ozlabs.org Received: from vger.kernel.org (vger.kernel.org [209.132.176.167]) by ozlabs.org (Postfix) with ESMTP id 338FCDDDB6 for ; Wed, 6 May 2009 04:50:45 +1000 (EST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752816AbZEESuj (ORCPT ); Tue, 5 May 2009 14:50:39 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1752703AbZEESuh (ORCPT ); Tue, 5 May 2009 14:50:37 -0400 Received: from gw1.cosmosbay.com ([212.99.114.194]:36236 "EHLO gw1.cosmosbay.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752345AbZEESuh convert rfc822-to-8bit (ORCPT ); Tue, 5 May 2009 14:50:37 -0400 Received: from [127.0.0.1] (localhost [127.0.0.1]) by gw1.cosmosbay.com (8.13.7/8.13.7) with ESMTP id n45IoQWa014972; Tue, 5 May 2009 20:50:29 +0200 Message-ID: <4A008A72.6030607@cosmosbay.com> Date: Tue, 05 May 2009 20:50:26 +0200 From: Eric Dumazet User-Agent: Thunderbird 2.0.0.21 (Windows/20090302) MIME-Version: 1.0 To: Vladimir Ivashchenko CC: netdev@vger.kernel.org Subject: Re: bond + tc regression ? References: <1241538358.27647.9.camel@hazard2.francoudi.com> <4A0069F3.5030607@cosmosbay.com> <20090505174135.GA29716@francoudi.com> In-Reply-To: <20090505174135.GA29716@francoudi.com> X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-1.6 (gw1.cosmosbay.com [0.0.0.0]); Tue, 05 May 2009 20:50:30 +0200 (CEST) Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org Vladimir Ivashchenko a écrit : >>> On both kernels, the system is running with at least 70% idle CPU. >>> The network interrupts are distributed accross the cores. >> You should not distribute interrupts, but bound a NIC to one CPU > > Kernels 2.6.28 and 2.6.29 do this by default, so I thought its correct. > The defaults are wrong? Yes they are, at least for forwarding setups. > > I have tried with IRQs bound to one CPU per NIC. Same result. Did you check "grep eth /proc/interrupts" that your affinities setup were indeed taken into account ? You should use same CPU for eth0 and eth2 (bond0), and another CPU for eth1 and eth3 (bond1) check how your cpus are setup egrep 'physical id|core id|processor' /proc/cpuinfo Because you might play and find best combo If you use 2.6.29, apply following patch to get better system accounting, to check if your cpu are saturated or not by hard/soft irqs > >>> I thought it was a e1000e driver issue, but tweaking e1000e ring buffers >>> didn't help. I tried using e1000 on 2.6.28 by adding necessary PCI IDs, >>> I tried running on a different server with bnx cards, I tried disabling >>> NO_HZ and HRTICK, but still I have the same problem. >>> >>> However, if I don't utilize bond, but just apply rules on normal ethX >>> interfaces, there is no packet loss with 2.6.28/29. >>> >>> So, the problem appears only when I use 2.6.28/29 + bond + classful tc >>> combination. >>> >>> Any ideas ? >>> >> Yes, we need much more information :) >> Is it a forwarding setup only ? > > Yes, the server is doing nothing else but forwarding, no iptables. > >> cat /proc/interrupts > > CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 > 0: 130 0 0 0 0 0 0 0 IO-APIC-edge timer > 1: 2 0 0 0 0 0 0 0 IO-APIC-edge i8042 > 3: 0 0 0 1 0 1 0 0 IO-APIC-edge > 4: 0 0 1 0 0 0 1 0 IO-APIC-edge > 9: 0 0 0 0 0 0 0 0 IO-APIC-fasteoi acpi > 12: 4 0 0 0 0 0 0 0 IO-APIC-edge i8042 > 14: 0 0 0 0 0 0 0 0 IO-APIC-edge ata_piix > 15: 0 0 0 0 0 0 0 0 IO-APIC-edge ata_piix > 17: 30901 31910 31446 30655 31618 30550 31543 30958 IO-APIC-fasteoi aacraid > 20: 0 0 0 0 0 0 0 0 IO-APIC-fasteoi uhci_hcd:usb4 > 21: 0 0 0 0 0 0 0 0 IO-APIC-fasteoi uhci_hcd:usb5, ahci > 22: 298387 297642 295508 294368 295533 295430 295275 296036 IO-APIC-fasteoi ehci_hcd:usb1, uhci_hcd:usb2 > 23: 10868 10926 10980 10738 10939 10615 10761 10909 IO-APIC-fasteoi uhci_hcd:usb3 > 57: 1486251823 1486835830 1486677250 1487105983 1488000303 1485941815 1487728317 1486624997 PCI-MSI-edge eth0 > 58: 1510676329 1509708161 1510347202 1509969755 1508599471 1511220118 1509094578 1509727616 PCI-MSI-edge eth1 > 59: 1482578890 1483618556 1482963700 1483164528 1484561615 1482130645 1484116749 1483557717 PCI-MSI-edge eth2 > 60: 1507341647 1506685822 1506862759 1506612818 1505689367 1507559672 1505911622 1506940613 PCI-MSI-edge eth3 > NMI: 0 0 0 0 0 0 0 0 Non-maskable interrupts > LOC: 1020533656 1020535165 1020533613 1020534967 1020535173 1020534409 1020534985 1020534220 Local timer interrupts > RES: 18605 21215 15957 18637 22429 19493 16649 15589 Rescheduling interrupts > CAL: 160 214 186 185 199 205 190 180 Function call interrupts > TLB: 259515 264126 309016 312222 263163 265601 306189 305430 TLB shootdowns > TRM: 0 0 0 0 0 0 0 0 Thermal event interrupts > SPU: 0 0 0 0 0 0 0 0 Spurious interrupts > ERR: 0 > MIS: 0 > >> tc -s -d qdisc > > For test sake, I just put "tc qdisc add dev $IFACE root handle 1: prio" and no filters at all. > I get the same with HTB "tc qdisc add dev $IFACE root handle 1: htb default 99" and no subclasses. > > qdisc pfifo_fast 0: dev eth0 root bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 > Sent 13287736273644 bytes 1263672018 pkt (dropped 0, overlimits 0 requeues 2928480094) > rate 0bit 0pps backlog 0b 0p requeues 2928480094 > qdisc pfifo_fast 0: dev eth1 root bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 > Sent 40064376195000 bytes 1747026586 pkt (dropped 0, overlimits 0 requeues 463621814) > rate 0bit 0pps backlog 0b 0p requeues 463621814 > qdisc pfifo_fast 0: dev eth2 root bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 > Sent 13350145517965 bytes 1350897201 pkt (dropped 0, overlimits 0 requeues 2930879507) > rate 0bit 0pps backlog 0b 0p requeues 2930879507 > qdisc pfifo_fast 0: dev eth3 root bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 > Sent 40193456126884 bytes 1950653764 pkt (dropped 0, overlimits 0 requeues 465511120) > rate 0bit 0pps backlog 0b 0p requeues 465511120 > qdisc prio 1: dev bond0 root bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 > Sent 985164834 bytes 2720991 pkt (dropped 241834, overlimits 0 requeues 0) > rate 0bit 0pps backlog 0b 0p requeues 0 > qdisc prio 1: dev bond1 root bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 > Sent 2347118738 bytes 3089171 pkt (dropped 304601, overlimits 0 requeues 0) > rate 0bit 0pps backlog 0b 0p requeues 0 > > ** Drops on bond0/bond1 are increasing by approximately 5000 per second: > > qdisc pfifo_fast 0: dev eth0 root bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 > Sent 13287874353796 bytes 1264050808 pkt (dropped 0, overlimits 0 requeues 2928520779) > rate 0bit 0pps backlog 0b 0p requeues 2928520779 > qdisc pfifo_fast 0: dev eth1 root bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 > Sent 40064706826018 bytes 1747459793 pkt (dropped 0, overlimits 0 requeues 463669610) > rate 0bit 0pps backlog 0b 0p requeues 463669610 > qdisc pfifo_fast 0: dev eth2 root bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 > Sent 13350283202695 bytes 1351277761 pkt (dropped 0, overlimits 0 requeues 2930918488) > rate 0bit 0pps backlog 0b 0p requeues 2930918488 > qdisc pfifo_fast 0: dev eth3 root bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 > Sent 40193784868074 bytes 1951084029 pkt (dropped 0, overlimits 0 requeues 465558015) > rate 0bit 0pps backlog 0b 0p requeues 465558015 > qdisc prio 1: dev bond0 root bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 > Sent 1260929539 bytes 3480340 pkt (dropped 311145, overlimits 0 requeues 0) > rate 0bit 0pps backlog 0b 0p requeues 0 > qdisc prio 1: dev bond1 root bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 > Sent 3006490946 bytes 3952643 pkt (dropped 396850, overlimits 0 requeues 0) > rate 0bit 0pps backlog 0b 0p requeues 0 > > With same setup on 2.6.23, drops are increasing only by 50/sec or so. > > As soon as I do "tc qdisc del dev $IFACE root", packet loss stops. > >> cat /proc/net/bonding/bond0 > > Ethernet Channel Bonding Driver: v3.5.0 (November 4, 2008) > > Bonding Mode: IEEE 802.3ad Dynamic link aggregation > Transmit Hash Policy: layer3+4 (1) > MII Status: up > MII Polling Interval (ms): 80 > Up Delay (ms): 0 > Down Delay (ms): 0 > > 802.3ad info > LACP rate: slow > Aggregator selection policy (ad_select): stable > Active Aggregator Info: > Aggregator ID: 1 > Number of ports: 2 > Actor Key: 17 > Partner Key: 4 > Partner Mac Address: 00:19:e7:b2:07:80 > > Slave Interface: eth0 > MII Status: up > Link Failure Count: 1 > Permanent HW addr: 00:1b:24:bd:e9:cc > Aggregator ID: 1 > > Slave Interface: eth2 > MII Status: up > Link Failure Count: 1 > Permanent HW addr: 00:1b:24:bd:e9:ce > Aggregator ID: 1 > >> cat /proc/net/bonding/bond1 > > Ethernet Channel Bonding Driver: v3.5.0 (November 4, 2008) > > Bonding Mode: IEEE 802.3ad Dynamic link aggregation > Transmit Hash Policy: layer3+4 (1) > MII Status: up > MII Polling Interval (ms): 80 > Up Delay (ms): 0 > Down Delay (ms): 0 > > 802.3ad info > LACP rate: slow > Aggregator selection policy (ad_select): stable > Active Aggregator Info: > Aggregator ID: 2 > Number of ports: 2 > Actor Key: 17 > Partner Key: 5 > Partner Mac Address: 00:19:e7:b2:07:80 > > Slave Interface: eth1 > MII Status: up > Link Failure Count: 1 > Permanent HW addr: 00:1b:24:bd:e9:cd > Aggregator ID: 2 > > Slave Interface: eth3 > MII Status: up > Link Failure Count: 2 > Permanent HW addr: 00:1b:24:bd:e9:cf > Aggregator ID: 2 > > >> mpstat -P ALL 10 > > 08:04:36 PM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s > 08:04:46 PM all 0.00 0.00 0.01 0.00 0.00 1.05 0.00 98.94 70525.73 > 08:04:46 PM 0 0.00 0.00 0.00 0.00 0.00 0.70 0.00 99.30 7814.41 > 08:04:46 PM 1 0.00 0.00 0.00 0.00 0.00 2.10 0.00 97.90 7814.41 > 08:04:46 PM 2 0.00 0.00 0.00 0.00 0.00 0.20 0.00 99.80 7814.41 > 08:04:46 PM 3 0.00 0.00 0.10 0.00 0.00 1.30 0.00 98.60 7814.51 > 08:04:46 PM 4 0.00 0.00 0.00 0.00 0.00 0.50 0.00 99.50 7814.41 > 08:04:46 PM 5 0.00 0.00 0.00 0.00 0.00 1.90 0.00 98.10 7814.41 > 08:04:46 PM 6 0.00 0.00 0.00 0.00 0.00 0.60 0.00 99.40 7814.41 > 08:04:46 PM 7 0.00 0.00 0.10 0.00 0.00 0.90 0.00 99.00 7814.51 > 08:04:46 PM 8 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > > 08:04:46 PM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s > 08:04:56 PM all 0.00 0.00 0.01 0.00 0.00 1.49 0.00 98.50 66429.30 > 08:04:56 PM 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 7303.50 > 08:04:56 PM 1 0.00 0.00 0.00 0.00 0.00 1.60 0.00 98.40 7303.50 > 08:04:56 PM 2 0.00 0.00 0.00 0.00 0.00 1.20 0.00 98.80 7303.50 > 08:04:56 PM 3 0.00 0.00 0.00 0.00 0.00 3.20 0.00 96.80 7303.40 > 08:04:56 PM 4 0.00 0.00 0.00 0.00 0.00 1.90 0.00 98.10 7303.60 > 08:04:56 PM 5 0.00 0.00 0.00 0.00 0.00 1.20 0.00 98.80 7303.50 > 08:04:56 PM 6 0.00 0.00 0.10 0.00 0.00 1.80 0.00 98.10 7303.50 > 08:04:56 PM 7 0.00 0.00 0.00 0.00 0.00 1.20 0.00 98.80 7303.50 > 08:04:56 PM 8 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 > >> ifconfig -a > > bond0 Link encap:Ethernet HWaddr 00:1B:24:BD:E9:CC > inet addr:xxx.xxx.135.44 Bcast:xxx.xxx.135.47 Mask:255.255.255.248 > inet6 addr: fe80::21b:24ff:febd:e9cc/64 Scope:Link > UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1 > RX packets:436076190 errors:0 dropped:391250 overruns:0 frame:0 > TX packets:2620156321 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:0 > RX bytes:4210046233 (3.9 GiB) TX bytes:2520272242 (2.3 GiB) > > bond1 Link encap:Ethernet HWaddr 00:1B:24:BD:E9:CD > inet addr:xxx.xxx.70.156 Bcast:xxx.xxx.70.159 Mask:255.255.255.248 > inet6 addr: fe80::21b:24ff:febd:e9cd/64 Scope:Link > UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1 > RX packets:239471641 errors:0 dropped:344 overruns:0 frame:0 > TX packets:3704083902 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:0 > RX bytes:2488754745 (2.3 GiB) TX bytes:2685275089 (2.5 GiB) > > eth0 Link encap:Ethernet HWaddr 00:1B:24:BD:E9:CC > UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1 > RX packets:2235085582 errors:0 dropped:353786 overruns:0 frame:0 > TX packets:1266449269 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:1000 > RX bytes:3768096439 (3.5 GiB) TX bytes:113363829 (108.1 MiB) > Memory:fc6e0000-fc700000 > > eth1 Link encap:Ethernet HWaddr 00:1B:24:BD:E9:CD > UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1 > RX packets:4228974804 errors:0 dropped:344 overruns:0 frame:0 > TX packets:1750216649 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:1000 > RX bytes:3350270261 (3.1 GiB) TX bytes:3358220645 (3.1 GiB) > Memory:fc6c0000-fc6e0000 > > eth2 Link encap:Ethernet HWaddr 00:1B:24:BD:E9:CC > UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1 > RX packets:2495958020 errors:0 dropped:37464 overruns:0 frame:0 > TX packets:1353707165 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:1000 > RX bytes:442055526 (421.5 MiB) TX bytes:2406943933 (2.2 GiB) > Memory:fcde0000-fce00000 > > eth3 Link encap:Ethernet HWaddr 00:1B:24:BD:E9:CD > UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1 > RX packets:305464222 errors:0 dropped:0 overruns:0 frame:0 > TX packets:1953867360 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:1000 > RX bytes:3433479245 (3.1 GiB) TX bytes:3622113909 (3.3 GiB) > Memory:fcd80000-fcda0000 > > lo Link encap:Local Loopback > inet addr:127.0.0.1 Mask:255.0.0.0 > inet6 addr: ::1/128 Scope:Host > UP LOOPBACK RUNNING MTU:16436 Metric:1 > RX packets:53537 errors:0 dropped:0 overruns:0 frame:0 > TX packets:53537 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:0 > RX bytes:431006433 (411.0 MiB) TX bytes:431006433 (411.0 MiB) > > > NOTE: ifconfig drops on bond0/bond1 are *NOT* increasing. These drops are there from before. > --- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html --- linux-2.6.29/kernel/sched.c.orig 2009-05-05 20:46:49.000000000 +0200 +++ linux-2.6.29/kernel/sched.c 2009-05-05 20:47:19.000000000 +0200 @@ -4290,7 +4290,7 @@ if (user_tick) account_user_time(p, one_jiffy, one_jiffy_scaled); - else if (p != rq->idle) + else if ((p != rq->idle) || (irq_count() != HARDIRQ_OFFSET)) account_system_time(p, HARDIRQ_OFFSET, one_jiffy, one_jiffy_scaled); else