Message ID | 49A8FAFF.7060104@cosmosbay.com |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
Eric Dumazet a écrit : > Kenny Chang a écrit : >> It's been a while since I updated this thread. We've been running >> through the different suggestions and tabulating their effects, as well >> as trying out an Intel card. The short story is that setting affinity >> and MSI works to some extent, and the Intel card doesn't seem to change >> things significantly. The results don't seem consistent enough for us >> to be able to point to a smoking gun. >> >> It does look like the 2.6.29-rc4 kernel performs okay with the Intel >> card, but this is not a real-time build and it's not likely to be in a >> supported Ubuntu distribution real soon. We've reached the point where >> we'd like to look for an expert dedicated to work on this problem for a >> period of time. The final result being some sort of solution to produce >> a realtime configuration with a reasonably "aged" kernel (.24~.28) that >> has multicast performance greater than or equal to that of 2.6.15. >> >> If anybody is interested in devoting some compensated time to this >> issue, we're offering up a bounty: >> http://www.athenacr.com/bounties/multicast-performance/ >> >> For completeness, here's the table of our experiment results: >> >> ====================== ================== ========= ========== >> =============== ============== ============== ================= >> Kernel flavor IRQ affinity *4x >> mcasttest* *5x mcasttest* *6x mcasttest* *Mtools2* [4]_ >> ====================== ================== ========= ========== >> =============== ============== ============== ================= >> Intel >> e1000e >> >> -----------------------------------------+---------+----------+---------------+--------------+--------------+----------------- >> >> 2.6.24.19 rt | any | >> OK Maybe X >> 2.6.24.19 rt | CPU0 | >> OK OK X >> 2.6.24.19 generic | any | >> X >> 2.6.24.19 generic | CPU0 | >> OK >> 2.6.29-rc3 vanilla-server | any | >> X >> 2.6.29-rc3 vanilla-server | CPU0 | >> OK >> 2.6.29-rc4 vanilla-generic | any | >> X OK >> 2.6.29-rc4 vanilla-generic | CPU0 | OK >> OK OK [5]_ OK >> -----------------------------------------+---------+----------+---------------+--------------+--------------+----------------- >> >> Broadcom >> BNX2 >> >> -----------------------------------------+---------+----------+---------------+--------------+--------------+----------------- >> >> 2.6.24-19 rt | MSI any | >> OK OK X >> 2.6.24-19 rt | MSI CPU0 | >> OK Maybe X >> 2.6.24-19 rt | APIC any | >> OK OK X >> 2.6.24-19 rt | APIC CPU0 | >> OK Maybe X >> 2.6.24-19-bnx-latest rt | APIC CPU0 | >> OK X >> 2.6.24-19 server | MSI any | >> X >> 2.6.24-19 server | MSI CPU0 | >> OK >> 2.6.24-19 generic | APIC any | >> X >> 2.6.24-19 generic | APIC CPU0 | >> OK >> 2.6.27-11 generic | APIC any | >> X >> 2.6.27-11 generic | APIC CPU0 | >> OK 10% drop >> 2.6.28-8 generic | APIC any | >> OK X >> 2.6.28-8 generic | APIC CPU0 | >> OK OK 0.5% drop >> 2.6.29-rc3 vanilla-server | MSI any | >> X >> 2.6.29-rc3 vanilla-server | MSI CPU0 | >> X >> 2.6.29-rc3 vanilla-server | APIC any | >> OK X >> 2.6.29-rc3 vanilla-server | APIC CPU0 | >> OK OK >> 2.6.29-rc4 vanilla-generic | APIC any | >> X >> 2.6.29-rc4 vanilla-generic | APIC CPU0 | >> OK 3% drop 10% drop X >> ====================== >> ==================+=========+==========+===============+==============+==============+================= >> >> * [4] MTools2 is a test from 29West: http://www.29west.com/docs/TestNet/ >> * [5] In 5 trials, 1 of the trials dropped 2%, 4 of the trials dropped >> nothing. >> >> Kenny >> > > Hi Kenny > > I am investigating how to reduce contention (and schedule() calls) on this workload. > I bound NIC (gigabit BNX2) irq to cpu 0, so that oprofile results on this cpu can show us where ksoftirqd is spending its time. We can see scheduler at work :) Also, one thing to note is __copy_skb_header() : 9.49 % of cpu0 time. The problem comes from dst_clone() (6.05 % total, so 2/3 of __copy_skb_header()), touching a highly contended cache line. (other cpus are doing the decrement of dst refcounter) CPU: Core 2, speed 3000.05 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000 Samples on CPU 0 (samples for other cpus 1..7 omitted) samples cum. samples % cum. % symbol name 23750 23750 9.8159 9.8159 try_to_wake_up 22972 46722 9.4944 19.3103 __copy_skb_header 20217 66939 8.3557 27.6660 enqueue_task_fair 14565 81504 6.0197 33.6857 sock_def_readable 13454 94958 5.5606 39.2463 task_rq_lock 13381 108339 5.5304 44.7767 resched_task 13090 121429 5.4101 50.1868 udp_queue_rcv_skb 11441 132870 4.7286 54.9154 skb_queue_tail 10109 142979 4.1781 59.0935 sock_queue_rcv_skb 10024 153003 4.1429 63.2364 __wake_up_sync 9952 162955 4.1132 67.3496 update_curr 8761 171716 3.6209 70.9705 sched_clock_cpu 7414 179130 3.0642 74.0347 rb_insert_color 7381 186511 3.0506 77.0853 select_task_rq_fair 6749 193260 2.7894 79.8747 __slab_alloc 5881 199141 2.4306 82.3053 __wake_up_common 5432 204573 2.2451 84.5504 __skb_clone 4306 208879 1.7797 86.3300 kmem_cache_alloc 3524 212403 1.4565 87.7865 place_entity 2783 215186 1.1502 88.9367 skb_clone 2576 217762 1.0647 90.0014 __udp4_lib_rcv 2430 220192 1.0043 91.0057 bnx2_poll_work 2184 222376 0.9027 91.9084 ipt_do_table 2090 224466 0.8638 92.7722 ip_route_input 1877 226343 0.7758 93.5479 __alloc_skb 1495 227838 0.6179 94.1658 native_sched_clock 1166 229004 0.4819 94.6477 __update_sched_clock 1083 230087 0.4476 95.0953 netif_receive_skb 1062 231149 0.4389 95.5343 activate_task 644 231793 0.2662 95.8004 __kmalloc_track_caller 638 232431 0.2637 96.0641 nf_iterate 549 232980 0.2269 96.2910 skb_put -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Eric Dumazet <dada1@cosmosbay.com> Date: Sat, 28 Feb 2009 09:51:11 +0100 > David, this is a preliminary work, not meant for inclusion as is, > comments are welcome. > > [PATCH] net: sk_forward_alloc becomes an atomic_t > > Commit 95766fff6b9a78d11fc2d3812dd035381690b55d > (UDP: Add memory accounting) introduced a regression for high rate UDP flows, > because of extra lock_sock() in udp_recvmsg() > > In order to reduce need for lock_sock() in UDP receive path, we might need > to declare sk_forward_alloc as an atomic_t. > > udp_recvmsg() can avoid a lock_sock()/release_sock() pair. > > Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> This adds new overhead for TCP which has to hold the socket lock for other reasons in these paths. I don't get how an atomic_t operation is cheaper than a lock_sock/release_sock. Is it the case that in many executions of these paths only atomic_read()'s are necessary? I actually think this scheme is racy. There is a reason we have to hold the socket lock when doing memory scheduling. Two threads can get in there and say "hey I have enough space already" even though only enough space is allocated for one of their requests. What did I miss? :) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
David Miller a écrit : > From: Eric Dumazet <dada1@cosmosbay.com> > Date: Sat, 28 Feb 2009 09:51:11 +0100 > >> David, this is a preliminary work, not meant for inclusion as is, >> comments are welcome. >> >> [PATCH] net: sk_forward_alloc becomes an atomic_t >> >> Commit 95766fff6b9a78d11fc2d3812dd035381690b55d >> (UDP: Add memory accounting) introduced a regression for high rate UDP flows, >> because of extra lock_sock() in udp_recvmsg() >> >> In order to reduce need for lock_sock() in UDP receive path, we might need >> to declare sk_forward_alloc as an atomic_t. >> >> udp_recvmsg() can avoid a lock_sock()/release_sock() pair. >> >> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> > > This adds new overhead for TCP which has to hold the socket > lock for other reasons in these paths. > > I don't get how an atomic_t operation is cheaper than a > lock_sock/release_sock. Is it the case that in many > executions of these paths only atomic_read()'s are necessary? > > I actually think this scheme is racy. There is a reason we > have to hold the socket lock when doing memory scheduling. > Two threads can get in there and say "hey I have enough space > already" even though only enough space is allocated for one > of their requests. > > What did I miss? :) > I believe you are right, and in fact was about to post a "dont look at this patch" since it doesnt help the multicast reception at all, I redone tests more carefuly and got nothing but noise. We have a cache line ping pong mess here, and need more thinking. I rewrote Kenny prog to use non blocking sockets. Receivers are doing : int delay = 50; fcntl(s, F_SETFL, O_NDELAY); while(1) { struct sockaddr_in from; socklen_t fromlen = sizeof(from); res = recvfrom(s, buf, 1000, 0, (struct sockaddr*)&from, &fromlen); if (res == -1) { delay++; usleep(delay); continue; } if (delay > 40) delay--; ++npackets; With this litle user space change and 8 receivers on my dual quad core, softirqd only takes 8% of one cpu and no drops at all (instead of 100% cpu and 30% drops) So this is definitly a problem mixing scheduler cache line ping pongs with network stack cache line ping pongs. We could reorder fields so that fewer cache lines are touched by the softirq processing, I tried this but still got packet drops. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric Dumazet a écrit : > David Miller a écrit : >> From: Eric Dumazet <dada1@cosmosbay.com> >> Date: Sat, 28 Feb 2009 09:51:11 +0100 >> >>> David, this is a preliminary work, not meant for inclusion as is, >>> comments are welcome. >>> >>> [PATCH] net: sk_forward_alloc becomes an atomic_t >>> >>> Commit 95766fff6b9a78d11fc2d3812dd035381690b55d >>> (UDP: Add memory accounting) introduced a regression for high rate UDP flows, >>> because of extra lock_sock() in udp_recvmsg() >>> >>> In order to reduce need for lock_sock() in UDP receive path, we might need >>> to declare sk_forward_alloc as an atomic_t. >>> >>> udp_recvmsg() can avoid a lock_sock()/release_sock() pair. >>> >>> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> >> This adds new overhead for TCP which has to hold the socket >> lock for other reasons in these paths. >> >> I don't get how an atomic_t operation is cheaper than a >> lock_sock/release_sock. Is it the case that in many >> executions of these paths only atomic_read()'s are necessary? >> >> I actually think this scheme is racy. There is a reason we >> have to hold the socket lock when doing memory scheduling. >> Two threads can get in there and say "hey I have enough space >> already" even though only enough space is allocated for one >> of their requests. >> >> What did I miss? :) >> > > I believe you are right, and in fact was about to post a "dont look at this patch" > since it doesnt help the multicast reception at all, I redone tests more carefuly > and got nothing but noise. > > We have a cache line ping pong mess here, and need more thinking. > > I rewrote Kenny prog to use non blocking sockets. > > Receivers are doing : > > int delay = 50; > fcntl(s, F_SETFL, O_NDELAY); > while(1) > { > struct sockaddr_in from; > socklen_t fromlen = sizeof(from); > res = recvfrom(s, buf, 1000, 0, (struct sockaddr*)&from, &fromlen); > if (res == -1) { > delay++; > usleep(delay); > continue; > } > if (delay > 40) > delay--; > ++npackets; > > With this litle user space change and 8 receivers on my dual quad core, softirqd > only takes 8% of one cpu and no drops at all (instead of 100% cpu and 30% drops) > > So this is definitly a problem mixing scheduler cache line ping pongs with network > stack cache line ping pongs. > > We could reorder fields so that fewer cache lines are touched by the softirq processing, > I tried this but still got packet drops. > > > I have more questions : What is the maximum latency you can afford on the delivery of the packet(s) ? Are user apps using real time scheduling ? I had an idea, that keep cpu handling NIC interrupts only delivering packets to socket queues, and not messing with scheduler : fast queueing, and wakeing up a workqueue (on another cpu) to perform the scheduler work. But that means some extra latency (in the order of 2 or 3 us I guess) We could enter in this mode automatically, if the NIC rx handler *see* more than N packets are waiting in NIC queue : In case of moderate or light trafic, no extra latency would be necessary. This would mean some changes in NIC driver. Hum, then, if NIC rx handler is run beside the ksoftirqd, we already know we are in a stress situation, so maybe no driver changes are necessary : Just test if we run ksoftirqd... -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/include/net/sock.h b/include/net/sock.h index 4bb1ff9..c4befb9 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -250,7 +250,7 @@ struct sock { struct sk_buff_head sk_async_wait_queue; #endif int sk_wmem_queued; - int sk_forward_alloc; + atomic_t sk_forward_alloc; gfp_t sk_allocation; int sk_route_caps; int sk_gso_type; @@ -823,7 +823,7 @@ static inline int sk_wmem_schedule(struct sock *sk, int size) { if (!sk_has_account(sk)) return 1; - return size <= sk->sk_forward_alloc || + return size <= atomic_read(&sk->sk_forward_alloc) || __sk_mem_schedule(sk, size, SK_MEM_SEND); } @@ -831,7 +831,7 @@ static inline int sk_rmem_schedule(struct sock *sk, int size) { if (!sk_has_account(sk)) return 1; - return size <= sk->sk_forward_alloc || + return size <= atomic_read(&sk->sk_forward_alloc) || __sk_mem_schedule(sk, size, SK_MEM_RECV); } @@ -839,7 +839,7 @@ static inline void sk_mem_reclaim(struct sock *sk) { if (!sk_has_account(sk)) return; - if (sk->sk_forward_alloc >= SK_MEM_QUANTUM) + if (atomic_read(&sk->sk_forward_alloc) >= SK_MEM_QUANTUM) __sk_mem_reclaim(sk); } @@ -847,7 +847,7 @@ static inline void sk_mem_reclaim_partial(struct sock *sk) { if (!sk_has_account(sk)) return; - if (sk->sk_forward_alloc > SK_MEM_QUANTUM) + if (atomic_read(&sk->sk_forward_alloc) > SK_MEM_QUANTUM) __sk_mem_reclaim(sk); } @@ -855,14 +855,14 @@ static inline void sk_mem_charge(struct sock *sk, int size) { if (!sk_has_account(sk)) return; - sk->sk_forward_alloc -= size; + atomic_sub(size, &sk->sk_forward_alloc); } static inline void sk_mem_uncharge(struct sock *sk, int size) { if (!sk_has_account(sk)) return; - sk->sk_forward_alloc += size; + atomic_add(size, &sk->sk_forward_alloc); } static inline void sk_wmem_free_skb(struct sock *sk, struct sk_buff *skb) diff --git a/net/core/sock.c b/net/core/sock.c index 0620046..8489105 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -1081,7 +1081,7 @@ struct sock *sk_clone(const struct sock *sk, const gfp_t priority) newsk->sk_dst_cache = NULL; newsk->sk_wmem_queued = 0; - newsk->sk_forward_alloc = 0; + atomic_set(&newsk->sk_forward_alloc, 0); newsk->sk_send_head = NULL; newsk->sk_userlocks = sk->sk_userlocks & ~SOCK_BINDPORT_LOCK; @@ -1479,7 +1479,7 @@ int __sk_mem_schedule(struct sock *sk, int size, int kind) int amt = sk_mem_pages(size); int allocated; - sk->sk_forward_alloc += amt * SK_MEM_QUANTUM; + atomic_add(amt * SK_MEM_QUANTUM, &sk->sk_forward_alloc); allocated = atomic_add_return(amt, prot->memory_allocated); /* Under limit. */ @@ -1520,7 +1520,7 @@ int __sk_mem_schedule(struct sock *sk, int size, int kind) if (prot->sysctl_mem[2] > alloc * sk_mem_pages(sk->sk_wmem_queued + atomic_read(&sk->sk_rmem_alloc) + - sk->sk_forward_alloc)) + atomic_read(&sk->sk_forward_alloc))) return 1; } @@ -1537,7 +1537,7 @@ suppress_allocation: } /* Alas. Undo changes. */ - sk->sk_forward_alloc -= amt * SK_MEM_QUANTUM; + atomic_sub(amt * SK_MEM_QUANTUM, &sk->sk_forward_alloc); atomic_sub(amt, prot->memory_allocated); return 0; } @@ -1551,14 +1551,21 @@ EXPORT_SYMBOL(__sk_mem_schedule); void __sk_mem_reclaim(struct sock *sk) { struct proto *prot = sk->sk_prot; - - atomic_sub(sk->sk_forward_alloc >> SK_MEM_QUANTUM_SHIFT, - prot->memory_allocated); - sk->sk_forward_alloc &= SK_MEM_QUANTUM - 1; - - if (prot->memory_pressure && *prot->memory_pressure && - (atomic_read(prot->memory_allocated) < prot->sysctl_mem[0])) - *prot->memory_pressure = 0; + int val = atomic_read(&sk->sk_forward_alloc); + +begin: + val = atomic_read(&sk->sk_forward_alloc); + if (val >= SK_MEM_QUANTUM) { + if (atomic_cmpxchg(&sk->sk_forward_alloc, val, + val & (SK_MEM_QUANTUM - 1)) != val) + goto begin; + atomic_sub(val >> SK_MEM_QUANTUM_SHIFT, + prot->memory_allocated); + + if (prot->memory_pressure && *prot->memory_pressure && + (atomic_read(prot->memory_allocated) < prot->sysctl_mem[0])) + *prot->memory_pressure = 0; + } } EXPORT_SYMBOL(__sk_mem_reclaim); diff --git a/net/core/stream.c b/net/core/stream.c index 8727cea..4d04d28 100644 --- a/net/core/stream.c +++ b/net/core/stream.c @@ -198,7 +198,7 @@ void sk_stream_kill_queues(struct sock *sk) sk_mem_reclaim(sk); WARN_ON(sk->sk_wmem_queued); - WARN_ON(sk->sk_forward_alloc); + WARN_ON(atomic_read(&sk->sk_forward_alloc)); /* It is _impossible_ for the backlog to contain anything * when we get here. All user references to this socket diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index 627be4d..7a1475c 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -152,7 +152,7 @@ void inet_sock_destruct(struct sock *sk) WARN_ON(atomic_read(&sk->sk_rmem_alloc)); WARN_ON(atomic_read(&sk->sk_wmem_alloc)); WARN_ON(sk->sk_wmem_queued); - WARN_ON(sk->sk_forward_alloc); + WARN_ON(atomic_read(&sk->sk_forward_alloc)); kfree(inet->opt); dst_release(sk->sk_dst_cache); diff --git a/net/ipv4/inet_diag.c b/net/ipv4/inet_diag.c index 588a779..903ad66 100644 --- a/net/ipv4/inet_diag.c +++ b/net/ipv4/inet_diag.c @@ -158,7 +158,7 @@ static int inet_csk_diag_fill(struct sock *sk, if (minfo) { minfo->idiag_rmem = atomic_read(&sk->sk_rmem_alloc); minfo->idiag_wmem = sk->sk_wmem_queued; - minfo->idiag_fmem = sk->sk_forward_alloc; + minfo->idiag_fmem = atomic_read(&sk->sk_forward_alloc); minfo->idiag_tmem = atomic_read(&sk->sk_wmem_alloc); } diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index a6961d7..5e08f37 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -5258,7 +5258,7 @@ int tcp_rcv_established(struct sock *sk, struct sk_buff *skb, tcp_rcv_rtt_measure_ts(sk, skb); - if ((int)skb->truesize > sk->sk_forward_alloc) + if ((int)skb->truesize > atomic_read(&sk->sk_forward_alloc)) goto step5; NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPHPHITS); diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 4bd178a..dcc246a 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -955,9 +955,7 @@ try_again: err = ulen; out_free: - lock_sock(sk); skb_free_datagram(sk, skb); - release_sock(sk); out: return err; diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c index 84b1a29..582b80a 100644 --- a/net/ipv6/udp.c +++ b/net/ipv6/udp.c @@ -257,9 +257,7 @@ try_again: err = ulen; out_free: - lock_sock(sk); skb_free_datagram(sk, skb); - release_sock(sk); out: return err; diff --git a/net/sched/em_meta.c b/net/sched/em_meta.c index 72cf86e..94d90b6 100644 --- a/net/sched/em_meta.c +++ b/net/sched/em_meta.c @@ -383,7 +383,7 @@ META_COLLECTOR(int_sk_wmem_queued) META_COLLECTOR(int_sk_fwd_alloc) { SKIP_NONLOCAL(skb); - dst->value = skb->sk->sk_forward_alloc; + dst->value = atomic_read(&skb->sk->sk_forward_alloc); } META_COLLECTOR(int_sk_sndbuf)