Message ID | 4A044BE7.3070308@cosmosbay.com |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
From: Eric Dumazet <dada1@cosmosbay.com> Date: Fri, 08 May 2009 17:12:39 +0200 > For example, we can avoid the dst_release() cache miss if this > is done in start_xmit(), and not later in TX completion while freeing skb. > I tried various patches in the past but unfortunatly it seems > only safe way to do this is in the driver xmit itself, not in core > network stack. This would need many patches, one for each driver. There might be a way around having to hit every driver. The case we can't muck with is when the route will be used. Devices which create this kind of situation can be marked with a flag bit in struct netdevice. If that flag bit isn't set, you can drop the DST in dev_hard_start_xmit(). > [PATCH] net: reduce number of reference taken on sk_refcnt > > Current sk_wmem_alloc schema uses a sk_refcnt taken for each packet > in flight. This hurts some workloads at TX completion time, because > sock_wfree() has three cache lines to touch at least. > (one for sk_wmem_alloc, one for testing sk_flags, one > to decrement sk_refcnt) > > We could use only one reference count, taken only when sk_wmem_alloc > is changed from or to ZERO value (ie one reference count for any number > of in-flight packets) > > Not all atomic_add() must be changed to atomic_add_return(), if we > know current sk_wmem_alloc is already not null. > > This patch reduces by one number of cache lines dirtied in sock_wfree() > and number of atomic operation in some workloads. > > Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> I like this idea. Let me know when you have some at least basic performance numbers and wish to submit this formally. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
David Miller a écrit : > From: Eric Dumazet <dada1@cosmosbay.com> > Date: Fri, 08 May 2009 17:12:39 +0200 > >> For example, we can avoid the dst_release() cache miss if this >> is done in start_xmit(), and not later in TX completion while freeing skb. >> I tried various patches in the past but unfortunatly it seems >> only safe way to do this is in the driver xmit itself, not in core >> network stack. This would need many patches, one for each driver. > > There might be a way around having to hit every driver. > > The case we can't muck with is when the route will be used. > Devices which create this kind of situation can be marked with > a flag bit in struct netdevice. If that flag bit isn't set, > you can drop the DST in dev_hard_start_xmit(). Yes, this is a possibility, I'll think about it, thank you. I'll have to recall which devices would need this flag (loopback for sure).. > >> [PATCH] net: reduce number of reference taken on sk_refcnt >> >> Current sk_wmem_alloc schema uses a sk_refcnt taken for each packet >> in flight. This hurts some workloads at TX completion time, because >> sock_wfree() has three cache lines to touch at least. >> (one for sk_wmem_alloc, one for testing sk_flags, one >> to decrement sk_refcnt) >> >> We could use only one reference count, taken only when sk_wmem_alloc >> is changed from or to ZERO value (ie one reference count for any number >> of in-flight packets) >> >> Not all atomic_add() must be changed to atomic_add_return(), if we >> know current sk_wmem_alloc is already not null. >> >> This patch reduces by one number of cache lines dirtied in sock_wfree() >> and number of atomic operation in some workloads. >> >> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> > > I like this idea. Let me know when you have some at least > basic performance numbers and wish to submit this formally. Sure, but I am focusing right now on the opposite situation (many tcp flows but with small in/out trafic, where this patch has no impact, since I have only (0 <--> !0) transitions.) BTW, oprofile for this kind of workload gives a surprising result. (timer stuff being *very* expensive) CPU doing the NAPI stuff has this profile : 88688 88688 9.7805 9.7805 lock_timer_base 72692 161380 8.0165 17.7970 bnx2_poll_work 66958 228338 7.3842 25.1812 mod_timer 47980 276318 5.2913 30.4724 __wake_up 43312 319630 4.7765 35.2489 task_rq_lock 43193 362823 4.7633 40.0122 __slab_alloc 36388 399211 4.0129 44.0251 __alloc_skb 30285 429496 3.3398 47.3650 skb_release_data 29236 458732 3.2242 50.5891 ip_rcv 29219 487951 3.2223 53.8114 resched_task 29094 517045 3.2085 57.0199 __inet_lookup_established 28695 545740 3.1645 60.1844 tcp_v4_rcv 27479 573219 3.0304 63.2148 sock_wfree 26722 599941 2.9469 66.1617 ip_route_input 21401 621342 2.3601 68.5218 select_task_rq_fair 19390 640732 2.1383 70.6601 __kfree_skb 17763 658495 1.9589 72.6190 sched_clock_cpu 17565 676060 1.9371 74.5561 try_to_wake_up 17366 693426 1.9151 76.4712 __enqueue_entity 16174 709600 1.7837 78.2549 update_curr 14323 723923 1.5795 79.8345 __kmalloc_track_caller 14003 737926 1.5443 81.3787 enqueue_task_fair 12456 750382 1.3737 82.7524 __tcp_prequeue 12212 762594 1.3467 84.0991 __wake_up_common 11437 774031 1.2613 85.3604 kmem_cache_alloc 10927 784958 1.2050 86.5654 place_entity 10535 795493 1.1618 87.7272 netif_receive_skb 9971 805464 1.0996 88.8268 ipt_do_table 8551 814015 0.9430 89.7698 internal_add_timer -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Eric Dumazet <dada1@cosmosbay.com> Date: Sat, 09 May 2009 14:13:59 +0200 > BTW, oprofile for this kind of workload gives a surprising result. > (timer stuff being *very* expensive) > CPU doing the NAPI stuff has this profile : > > 88688 88688 9.7805 9.7805 lock_timer_base > 72692 161380 8.0165 17.7970 bnx2_poll_work > 66958 228338 7.3842 25.1812 mod_timer > 47980 276318 5.2913 30.4724 __wake_up > 43312 319630 4.7765 35.2489 task_rq_lock To me this seems to indicate that timers keep getting scheduled on different cpus. Probably it would be helped by google's packet-processing-locality patches, or some variant thereof. Consider the case where we always send some message on CPU A and then process the ACK on CPU B. We'll always be cancelling the timer on a foreign cpu. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Eric Dumazet <dada1@cosmosbay.com> Date: Sat, 09 May 2009 14:13:59 +0200 > David Miller a écrit : >> From: Eric Dumazet <dada1@cosmosbay.com> >> Date: Fri, 08 May 2009 17:12:39 +0200 >> >>> For example, we can avoid the dst_release() cache miss if this >>> is done in start_xmit(), and not later in TX completion while freeing skb. >>> I tried various patches in the past but unfortunatly it seems >>> only safe way to do this is in the driver xmit itself, not in core >>> network stack. This would need many patches, one for each driver. >> >> There might be a way around having to hit every driver. >> >> The case we can't muck with is when the route will be used. >> Devices which create this kind of situation can be marked with >> a flag bit in struct netdevice. If that flag bit isn't set, >> you can drop the DST in dev_hard_start_xmit(). > > Yes, this is a possibility, I'll think about it, thank you. > I'll have to recall which devices would need this flag (loopback for sure).. Another idea is to invert the flag, and set it in places such as alloc_etherdev*(). This might speed up getting coverage for the most obvious cases. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: David Miller <davem@davemloft.net> Date: Sat, 09 May 2009 13:34:54 -0700 (PDT) > Consider the case where we always send some message on CPU A and > then process the ACK on CPU B. We'll always be cancelling the > timer on a foreign cpu. I should also mention that TCP has a peculiar optimization of timers that is likely being thwarted by your workload. It never deletes timers under normal operation, it simply lets them still expire and the handler notices that there is "nothing to do" and returns. But when the connection does shut down, we have to purge all of these timers. That could be another part of why you see timers in your profile. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
David Miller a écrit : > From: David Miller <davem@davemloft.net> > Date: Sat, 09 May 2009 13:34:54 -0700 (PDT) > >> Consider the case where we always send some message on CPU A and >> then process the ACK on CPU B. We'll always be cancelling the >> timer on a foreign cpu. > > I should also mention that TCP has a peculiar optimization of timers > that is likely being thwarted by your workload. It never deletes > timers under normal operation, it simply lets them still expire > and the handler notices that there is "nothing to do" and returns. Yes, you refer to INET_CSK_CLEAR_TIMERS condition, never set. > > But when the connection does shut down, we have to purge all of > these timers. > > That could be another part of why you see timers in your profile. > > Well, in my workload they should never expire, since application exchange enough data on both direction, and they are no losses (Gigabit LAN context) On machine acting as a server (the one I am focusing to, of course), each incoming frame : - Contains ACK for the previous sent frame - Contains data provided by the client. - Starts a timer for delayed ACK Then server applications reacts and sends a new payload, and TCP stack - Sends a frame including ACK for previous received frame - Contains data provided by server application - Starts a timer for retransmiting this frame if no ACK is received later. So yes, each incoming and each outgoing frame is going to call mod_timer() Problem is that incoming process is done by CPU 0 (the one that is dedicated to NAPI processing because of stress situation, cpu 100% in softirq land), and outgoing processing done by other cpus in the machine. offsetof(struct inet_connection_sock, icsk_retransmit_timer)=0x208 offsetof(struct inet_connection_sock, icsk_delack_timer)=0x238 So there are cache line ping-pongs, but oprofile seems to point to a spinlock contention in lock_timer_base(), I dont know why... shouldnt (in my workload) delack_timer all belongs to cpu 0, and retransmit_timers to other cpus ? Or is mod_timer never migrates an already established timer ? That would explain the lock contention on timer_base, we should take care of it if possible. Thanks David -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric Dumazet a écrit : > David Miller a écrit : >> From: David Miller <davem@davemloft.net> >> Date: Sat, 09 May 2009 13:34:54 -0700 (PDT) >> >>> Consider the case where we always send some message on CPU A and >>> then process the ACK on CPU B. We'll always be cancelling the >>> timer on a foreign cpu. >> I should also mention that TCP has a peculiar optimization of timers >> that is likely being thwarted by your workload. It never deletes >> timers under normal operation, it simply lets them still expire >> and the handler notices that there is "nothing to do" and returns. > > Yes, you refer to INET_CSK_CLEAR_TIMERS condition, never set. > >> But when the connection does shut down, we have to purge all of >> these timers. >> >> That could be another part of why you see timers in your profile. >> >> > > Well, in my workload they should never expire, since application exchange > enough data on both direction, and they are no losses (Gigabit LAN context) > > On machine acting as a server (the one I am focusing to, of course), > each incoming frame : > > - Contains ACK for the previous sent frame > - Contains data provided by the client. > - Starts a timer for delayed ACK > > Then server applications reacts and sends a new payload, and TCP stack > - Sends a frame including ACK for previous received frame > - Contains data provided by server application > - Starts a timer for retransmiting this frame if no ACK is received later. > > So yes, each incoming and each outgoing frame is going to call mod_timer() > > Problem is that incoming process is done by CPU 0 (the one that is dedicated > to NAPI processing because of stress situation, cpu 100% in softirq land), > and outgoing processing done by other cpus in the machine. > > offsetof(struct inet_connection_sock, icsk_retransmit_timer)=0x208 > offsetof(struct inet_connection_sock, icsk_delack_timer)=0x238 > > So there are cache line ping-pongs, but oprofile seems to point > to a spinlock contention in lock_timer_base(), I dont know why... > shouldnt (in my workload) delack_timer all belongs to cpu 0, and > retransmit_timers to other cpus ? > > Or is mod_timer never migrates an already established timer ? > > That would explain the lock contention on timer_base, we should > take care of it if possible. > ftrace is my friend :) Problem is the application, when doing it recv() call is calling tcp_send_delayed_ack() too. So yes, cpus are fighting on icsk_delack_timer and their timer_base pretty hard. 2631.936051: finish_task_switch <-schedule 2631.936051: perf_counter_task_sched_in <-finish_task_switch 2631.936051: __perf_counter_sched_in <-perf_counter_task_sched_in 2631.936051: _spin_lock <-__perf_counter_sched_in 2631.936052: lock_sock_nested <-sk_wait_data 2631.936052: _spin_lock_bh <-lock_sock_nested 2631.936052: local_bh_disable <-_spin_lock_bh 2631.936052: local_bh_enable <-lock_sock_nested 2631.936052: finish_wait <-sk_wait_data 2631.936053: tcp_prequeue_process <-tcp_recvmsg 2631.936053: local_bh_disable <-tcp_prequeue_process 2631.936053: tcp_v4_do_rcv <-tcp_prequeue_process 2631.936053: tcp_rcv_established <-tcp_v4_do_rcv 2631.936054: local_bh_enable <-tcp_rcv_established 2631.936054: skb_copy_datagram_iovec <-tcp_rcv_established 2631.936054: memcpy_toiovec <-skb_copy_datagram_iovec 2631.936054: copy_to_user <-memcpy_toiovec 2631.936054: tcp_rcv_space_adjust <-tcp_rcv_established 2631.936055: local_bh_disable <-tcp_rcv_established 2631.936055: tcp_event_data_recv <-tcp_rcv_established 2631.936055: tcp_ack <-tcp_rcv_established 2631.936056: __kfree_skb <-tcp_ack 2631.936056: skb_release_head_state <-__kfree_skb 2631.936056: dst_release <-skb_release_head_state 2631.936056: skb_release_data <-__kfree_skb 2631.936056: put_page <-skb_release_data 2631.936057: kfree <-skb_release_data 2631.936057: kmem_cache_free <-__kfree_skb 2631.936057: tcp_valid_rtt_meas <-tcp_ack 2631.936058: bictcp_acked <-tcp_ack 2631.936058: bictcp_cong_avoid <-tcp_ack 2631.936058: tcp_is_cwnd_limited <-bictcp_cong_avoid 2631.936058: tcp_current_mss <-tcp_rcv_established 2631.936058: tcp_established_options <-tcp_current_mss 2631.936058: __tcp_push_pending_frames <-tcp_rcv_established 2631.936059: __tcp_ack_snd_check <-tcp_rcv_established 2631.936059: tcp_send_delayed_ack <-__tcp_ack_snd_check 2631.936059: sk_reset_timer <-tcp_send_delayed_ack 2631.936059: mod_timer <-sk_reset_timer 2631.936059: lock_timer_base <-mod_timer 2631.936059: _spin_lock_irqsave <-lock_timer_base 2631.936059: _spin_lock <-mod_timer 2631.936060: internal_add_timer <-mod_timer 2631.936064: _spin_unlock_irqrestore <-mod_timer 2631.936064: __kfree_skb <-tcp_rcv_established 2631.936064: skb_release_head_state <-__kfree_skb 2631.936064: dst_release <-skb_release_head_state 2631.936065: skb_release_data <-__kfree_skb 2631.936065: kfree <-skb_release_data 2631.936065: __slab_free <-kfree 2631.936065: add_partial <-__slab_free 2631.936065: _spin_lock <-add_partial 2631.936066: kmem_cache_free <-__kfree_skb 2631.936066: __slab_free <-kmem_cache_free 2631.936066: add_partial <-__slab_free 2631.936067: _spin_lock <-add_partial 2631.936067: local_bh_enable <-tcp_prequeue_process 2631.936067: tcp_cleanup_rbuf <-tcp_recvmsg 2631.936067: __tcp_select_window <-tcp_cleanup_rbuf 2631.936067: release_sock <-tcp_recvmsg 2631.936068: _spin_lock_bh <-release_sock 2631.936068: local_bh_disable <-_spin_lock_bh 2631.936068: _spin_unlock_bh <-release_sock 2631.936068: local_bh_enable_ip <-_spin_unlock_bh 2631.936068: fput <-sys_recvfrom -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
--- a/include/net/sock.h +++ b/include/net/sock.h @@ -1213,14 +1213,17 @@ static inline int skb_copy_to_page(struct sock *sk, char __user *from, * * Inlined as it's very short and called for pretty much every * packet ever received. + * We take a reference on sock only if this is the first packet + * accounted for. (one reference count for any number of in flight packets) */ static inline void skb_set_owner_w(struct sk_buff *skb, struct sock *sk) { - sock_hold(sk); skb->sk = sk; skb->destructor = sock_wfree; - atomic_add(skb->truesize, &sk->sk_wmem_alloc); + if (atomic_add_return(skb->truesize, &sk->sk_wmem_alloc) == + skb->truesize) + sock_hold(sk); } static inline void skb_set_owner_r(struct sk_buff *skb, struct sock *sk) diff --git a/net/core/sock.c b/net/core/sock.c index 7dbf3ff..bf10084 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -1170,12 +1170,17 @@ void __init sk_init(void) void sock_wfree(struct sk_buff *skb) { struct sock *sk = skb->sk; + int res; /* In case it might be waiting for more memory. */ - atomic_sub(skb->truesize, &sk->sk_wmem_alloc); + res = atomic_sub_return(skb->truesize, &sk->sk_wmem_alloc); if (!sock_flag(sk, SOCK_USE_WRITE_QUEUE)) sk->sk_write_space(sk); - sock_put(sk); + /* + * No more packets in flight, we must release sock refcnt + */ + if (res == 0) + sock_put(sk); } /*