diff mbox

net: reduce number of reference taken on sk_refcnt

Message ID 4A044BE7.3070308@cosmosbay.com
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet May 8, 2009, 3:12 p.m. UTC
Krzysztof Halasa a écrit :
> Hi,
> 
> Could NAPI or something similar be used for TX in addition to RX?
> Perhaps some driver already does that?
> 
> I'm specifically thinking of IXP4xx Ethernet (and possibly WAN), those
> are 266-667 MHz ARM XScale CPUs and the interrupts handling just
> transmitted sk_buffs are a pain. Could I delay those interrupts
> somehow?

This is a pain yes, I agree. But some improvements can be done.
(even on drivers already using NAPI, especially if same interrupt
is used both for RX and RX queues)

For example, we can avoid the dst_release() cache miss if this
is done in start_xmit(), and not later in TX completion while freeing skb.
I tried various patches in the past but unfortunatly it seems
only safe way to do this is in the driver xmit itself, not in core
network stack. This would need many patches, one for each driver.

Then, if your workload is UDP packets, a recent patch
avoided an expensive wakeup at TX completion time
(check commit bf368e4e70cd4e0f880923c44e95a4273d725ab4

http://git2.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=bf368e4e70cd4e0f880923c44e95a4273d725ab4

Unfortunatly this patch is only for 2.6.30, since it is based
on a new infrastructure from Davide Libenzi)

Then, additional work should be done to reduce cache lines
misses in sock_wfree()

This function currently touches many different cache lines
(one for sk_wmem_alloc, one for testing sk_flags, one
 to decrement sk_refcnt). And on UDP, extra cachelines because
we call sock_def_write_space(). Yes this sucks.

We could avoid touching sk_refcnt in several cases. We
need a reference count only for the whole packets attached to a socket,
granted we use atomic_add_return() and such functions to detect 0
transitions. That way, if a tcp stream always has some packets in flight,
TX completion would not have to decrement sk_refcnt.

Following patch (based on net-next-2.6) to illustrate my point :

[PATCH] net: reduce number of reference taken on sk_refcnt

Current sk_wmem_alloc schema uses a sk_refcnt taken for each packet
in flight. This hurts some workloads at TX completion time, because
sock_wfree() has three cache lines to touch at least.
(one for sk_wmem_alloc, one for testing sk_flags, one
 to decrement sk_refcnt)

We could use only one reference count, taken only when sk_wmem_alloc
is changed from or to ZERO value (ie one reference count for any number
of in-flight packets)

Not all atomic_add() must be changed to atomic_add_return(), if we
know current sk_wmem_alloc is already not null.

This patch reduces by one number of cache lines dirtied in sock_wfree()
and number of atomic operation in some workloads. 

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

diff --cc include/net/tcp.h
index ac37228,87d210b..0000000
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
diff --git a/include/net/sock.h b/include/net/sock.h
index 4bb1ff9..d57e88b 100644

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

David Miller May 8, 2009, 9:48 p.m. UTC | #1
From: Eric Dumazet <dada1@cosmosbay.com>
Date: Fri, 08 May 2009 17:12:39 +0200

> For example, we can avoid the dst_release() cache miss if this
> is done in start_xmit(), and not later in TX completion while freeing skb.
> I tried various patches in the past but unfortunatly it seems
> only safe way to do this is in the driver xmit itself, not in core
> network stack. This would need many patches, one for each driver.

There might be a way around having to hit every driver.

The case we can't muck with is when the route will be used.
Devices which create this kind of situation can be marked with
a flag bit in struct netdevice.  If that flag bit isn't set,
you can drop the DST in dev_hard_start_xmit().

> [PATCH] net: reduce number of reference taken on sk_refcnt
> 
> Current sk_wmem_alloc schema uses a sk_refcnt taken for each packet
> in flight. This hurts some workloads at TX completion time, because
> sock_wfree() has three cache lines to touch at least.
> (one for sk_wmem_alloc, one for testing sk_flags, one
>  to decrement sk_refcnt)
> 
> We could use only one reference count, taken only when sk_wmem_alloc
> is changed from or to ZERO value (ie one reference count for any number
> of in-flight packets)
> 
> Not all atomic_add() must be changed to atomic_add_return(), if we
> know current sk_wmem_alloc is already not null.
> 
> This patch reduces by one number of cache lines dirtied in sock_wfree()
> and number of atomic operation in some workloads. 
> 
> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

I like this idea.  Let me know when you have some at least
basic performance numbers and wish to submit this formally.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet May 9, 2009, 12:13 p.m. UTC | #2
David Miller a écrit :
> From: Eric Dumazet <dada1@cosmosbay.com>
> Date: Fri, 08 May 2009 17:12:39 +0200
> 
>> For example, we can avoid the dst_release() cache miss if this
>> is done in start_xmit(), and not later in TX completion while freeing skb.
>> I tried various patches in the past but unfortunatly it seems
>> only safe way to do this is in the driver xmit itself, not in core
>> network stack. This would need many patches, one for each driver.
> 
> There might be a way around having to hit every driver.
> 
> The case we can't muck with is when the route will be used.
> Devices which create this kind of situation can be marked with
> a flag bit in struct netdevice.  If that flag bit isn't set,
> you can drop the DST in dev_hard_start_xmit().

Yes, this is a possibility, I'll think about it, thank you.
I'll have to recall which devices would need this flag (loopback for sure)..


> 
>> [PATCH] net: reduce number of reference taken on sk_refcnt
>>
>> Current sk_wmem_alloc schema uses a sk_refcnt taken for each packet
>> in flight. This hurts some workloads at TX completion time, because
>> sock_wfree() has three cache lines to touch at least.
>> (one for sk_wmem_alloc, one for testing sk_flags, one
>>  to decrement sk_refcnt)
>>
>> We could use only one reference count, taken only when sk_wmem_alloc
>> is changed from or to ZERO value (ie one reference count for any number
>> of in-flight packets)
>>
>> Not all atomic_add() must be changed to atomic_add_return(), if we
>> know current sk_wmem_alloc is already not null.
>>
>> This patch reduces by one number of cache lines dirtied in sock_wfree()
>> and number of atomic operation in some workloads. 
>>
>> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
> 
> I like this idea.  Let me know when you have some at least
> basic performance numbers and wish to submit this formally.

Sure, but I am focusing right now on the opposite situation
(many tcp flows but with small in/out trafic, where this patch
has no impact, since I have only (0 <--> !0) transitions.)

BTW, oprofile for this kind of workload gives a surprising result.
(timer stuff being *very* expensive)
CPU doing the NAPI stuff has this profile :

88688    88688          9.7805   9.7805    lock_timer_base
72692    161380         8.0165  17.7970    bnx2_poll_work
66958    228338         7.3842  25.1812    mod_timer
47980    276318         5.2913  30.4724    __wake_up
43312    319630         4.7765  35.2489    task_rq_lock
43193    362823         4.7633  40.0122    __slab_alloc
36388    399211         4.0129  44.0251    __alloc_skb
30285    429496         3.3398  47.3650    skb_release_data
29236    458732         3.2242  50.5891    ip_rcv
29219    487951         3.2223  53.8114    resched_task
29094    517045         3.2085  57.0199    __inet_lookup_established
28695    545740         3.1645  60.1844    tcp_v4_rcv
27479    573219         3.0304  63.2148    sock_wfree
26722    599941         2.9469  66.1617    ip_route_input
21401    621342         2.3601  68.5218    select_task_rq_fair
19390    640732         2.1383  70.6601    __kfree_skb
17763    658495         1.9589  72.6190    sched_clock_cpu
17565    676060         1.9371  74.5561    try_to_wake_up
17366    693426         1.9151  76.4712    __enqueue_entity
16174    709600         1.7837  78.2549    update_curr
14323    723923         1.5795  79.8345    __kmalloc_track_caller
14003    737926         1.5443  81.3787    enqueue_task_fair
12456    750382         1.3737  82.7524    __tcp_prequeue
12212    762594         1.3467  84.0991    __wake_up_common
11437    774031         1.2613  85.3604    kmem_cache_alloc
10927    784958         1.2050  86.5654    place_entity
10535    795493         1.1618  87.7272    netif_receive_skb
9971     805464         1.0996  88.8268    ipt_do_table
8551     814015         0.9430  89.7698    internal_add_timer

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller May 9, 2009, 8:34 p.m. UTC | #3
From: Eric Dumazet <dada1@cosmosbay.com>
Date: Sat, 09 May 2009 14:13:59 +0200

> BTW, oprofile for this kind of workload gives a surprising result.
> (timer stuff being *very* expensive)
> CPU doing the NAPI stuff has this profile :
> 
> 88688    88688          9.7805   9.7805    lock_timer_base
> 72692    161380         8.0165  17.7970    bnx2_poll_work
> 66958    228338         7.3842  25.1812    mod_timer
> 47980    276318         5.2913  30.4724    __wake_up
> 43312    319630         4.7765  35.2489    task_rq_lock

To me this seems to indicate that timers keep getting scheduled on
different cpus.

Probably it would be helped by google's packet-processing-locality
patches, or some variant thereof.

Consider the case where we always send some message on CPU A and
then process the ACK on CPU B.  We'll always be cancelling the
timer on a foreign cpu.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller May 9, 2009, 8:36 p.m. UTC | #4
From: Eric Dumazet <dada1@cosmosbay.com>
Date: Sat, 09 May 2009 14:13:59 +0200

> David Miller a écrit :
>> From: Eric Dumazet <dada1@cosmosbay.com>
>> Date: Fri, 08 May 2009 17:12:39 +0200
>> 
>>> For example, we can avoid the dst_release() cache miss if this
>>> is done in start_xmit(), and not later in TX completion while freeing skb.
>>> I tried various patches in the past but unfortunatly it seems
>>> only safe way to do this is in the driver xmit itself, not in core
>>> network stack. This would need many patches, one for each driver.
>> 
>> There might be a way around having to hit every driver.
>> 
>> The case we can't muck with is when the route will be used.
>> Devices which create this kind of situation can be marked with
>> a flag bit in struct netdevice.  If that flag bit isn't set,
>> you can drop the DST in dev_hard_start_xmit().
> 
> Yes, this is a possibility, I'll think about it, thank you.
> I'll have to recall which devices would need this flag (loopback for sure)..

Another idea is to invert the flag, and set it in places such
as alloc_etherdev*().

This might speed up getting coverage for the most obvious cases.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller May 9, 2009, 8:40 p.m. UTC | #5
From: David Miller <davem@davemloft.net>
Date: Sat, 09 May 2009 13:34:54 -0700 (PDT)

> Consider the case where we always send some message on CPU A and
> then process the ACK on CPU B.  We'll always be cancelling the
> timer on a foreign cpu.

I should also mention that TCP has a peculiar optimization of timers
that is likely being thwarted by your workload.  It never deletes
timers under normal operation, it simply lets them still expire
and the handler notices that there is "nothing to do" and returns.

But when the connection does shut down, we have to purge all of
these timers.

That could be another part of why you see timers in your profile.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet May 10, 2009, 7:09 a.m. UTC | #6
David Miller a écrit :
> From: David Miller <davem@davemloft.net>
> Date: Sat, 09 May 2009 13:34:54 -0700 (PDT)
> 
>> Consider the case where we always send some message on CPU A and
>> then process the ACK on CPU B.  We'll always be cancelling the
>> timer on a foreign cpu.
> 
> I should also mention that TCP has a peculiar optimization of timers
> that is likely being thwarted by your workload.  It never deletes
> timers under normal operation, it simply lets them still expire
> and the handler notices that there is "nothing to do" and returns.

Yes, you refer to INET_CSK_CLEAR_TIMERS condition, never set.

> 
> But when the connection does shut down, we have to purge all of
> these timers.
> 
> That could be another part of why you see timers in your profile.
> 
> 

Well, in my workload they should never expire, since application exchange
enough data on both direction, and they are no losses (Gigabit LAN context)

On machine acting as a server (the one I am focusing to, of course),
each incoming frame :

- Contains ACK for the previous sent frame
- Contains data provided by the client.
- Starts a timer for delayed ACK

Then server applications reacts and sends a new payload, and TCP stack
- Sends a frame including ACK for previous received frame
- Contains data provided by server application
- Starts a timer for retransmiting this frame if no ACK is received later.

So yes, each incoming and each outgoing frame is going to call mod_timer()

Problem is that incoming process is done by CPU 0 (the one that is dedicated
to NAPI processing because of stress situation, cpu 100% in softirq land),
and outgoing processing done by other cpus in the machine.

offsetof(struct inet_connection_sock, icsk_retransmit_timer)=0x208
offsetof(struct inet_connection_sock, icsk_delack_timer)=0x238

So there are cache line ping-pongs, but oprofile seems to point
to a spinlock contention in lock_timer_base(), I dont know why...
shouldnt (in my workload) delack_timer all belongs to cpu 0, and 
retransmit_timers to other cpus ? 

Or is mod_timer never migrates an already established timer ?

That would explain the lock contention on timer_base, we should
take care of it if possible.

Thanks David


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet May 10, 2009, 7:43 a.m. UTC | #7
Eric Dumazet a écrit :
> David Miller a écrit :
>> From: David Miller <davem@davemloft.net>
>> Date: Sat, 09 May 2009 13:34:54 -0700 (PDT)
>>
>>> Consider the case where we always send some message on CPU A and
>>> then process the ACK on CPU B.  We'll always be cancelling the
>>> timer on a foreign cpu.
>> I should also mention that TCP has a peculiar optimization of timers
>> that is likely being thwarted by your workload.  It never deletes
>> timers under normal operation, it simply lets them still expire
>> and the handler notices that there is "nothing to do" and returns.
> 
> Yes, you refer to INET_CSK_CLEAR_TIMERS condition, never set.
> 
>> But when the connection does shut down, we have to purge all of
>> these timers.
>>
>> That could be another part of why you see timers in your profile.
>>
>>
> 
> Well, in my workload they should never expire, since application exchange
> enough data on both direction, and they are no losses (Gigabit LAN context)
> 
> On machine acting as a server (the one I am focusing to, of course),
> each incoming frame :
> 
> - Contains ACK for the previous sent frame
> - Contains data provided by the client.
> - Starts a timer for delayed ACK
> 
> Then server applications reacts and sends a new payload, and TCP stack
> - Sends a frame including ACK for previous received frame
> - Contains data provided by server application
> - Starts a timer for retransmiting this frame if no ACK is received later.
> 
> So yes, each incoming and each outgoing frame is going to call mod_timer()
> 
> Problem is that incoming process is done by CPU 0 (the one that is dedicated
> to NAPI processing because of stress situation, cpu 100% in softirq land),
> and outgoing processing done by other cpus in the machine.
> 
> offsetof(struct inet_connection_sock, icsk_retransmit_timer)=0x208
> offsetof(struct inet_connection_sock, icsk_delack_timer)=0x238
> 
> So there are cache line ping-pongs, but oprofile seems to point
> to a spinlock contention in lock_timer_base(), I dont know why...
> shouldnt (in my workload) delack_timer all belongs to cpu 0, and 
> retransmit_timers to other cpus ? 
> 
> Or is mod_timer never migrates an already established timer ?
> 
> That would explain the lock contention on timer_base, we should
> take care of it if possible.
> 

ftrace is my friend :)

Problem is the application, when doing it recv() call
is calling tcp_send_delayed_ack() too.

So yes, cpus are fighting on icsk_delack_timer and their
timer_base pretty hard.


2631.936051: finish_task_switch <-schedule
2631.936051: perf_counter_task_sched_in <-finish_task_switch
2631.936051: __perf_counter_sched_in <-perf_counter_task_sched_in
2631.936051: _spin_lock <-__perf_counter_sched_in
2631.936052: lock_sock_nested <-sk_wait_data
2631.936052: _spin_lock_bh <-lock_sock_nested
2631.936052: local_bh_disable <-_spin_lock_bh
2631.936052: local_bh_enable <-lock_sock_nested
2631.936052: finish_wait <-sk_wait_data
2631.936053: tcp_prequeue_process <-tcp_recvmsg
2631.936053: local_bh_disable <-tcp_prequeue_process
2631.936053: tcp_v4_do_rcv <-tcp_prequeue_process
2631.936053: tcp_rcv_established <-tcp_v4_do_rcv
2631.936054: local_bh_enable <-tcp_rcv_established
2631.936054: skb_copy_datagram_iovec <-tcp_rcv_established
2631.936054: memcpy_toiovec <-skb_copy_datagram_iovec
2631.936054: copy_to_user <-memcpy_toiovec
2631.936054: tcp_rcv_space_adjust <-tcp_rcv_established
2631.936055: local_bh_disable <-tcp_rcv_established
2631.936055: tcp_event_data_recv <-tcp_rcv_established
2631.936055: tcp_ack <-tcp_rcv_established
2631.936056: __kfree_skb <-tcp_ack
2631.936056: skb_release_head_state <-__kfree_skb
2631.936056: dst_release <-skb_release_head_state
2631.936056: skb_release_data <-__kfree_skb
2631.936056: put_page <-skb_release_data
2631.936057: kfree <-skb_release_data
2631.936057: kmem_cache_free <-__kfree_skb
2631.936057: tcp_valid_rtt_meas <-tcp_ack
2631.936058: bictcp_acked <-tcp_ack
2631.936058: bictcp_cong_avoid <-tcp_ack
2631.936058: tcp_is_cwnd_limited <-bictcp_cong_avoid
2631.936058: tcp_current_mss <-tcp_rcv_established
2631.936058: tcp_established_options <-tcp_current_mss
2631.936058: __tcp_push_pending_frames <-tcp_rcv_established
2631.936059: __tcp_ack_snd_check <-tcp_rcv_established
2631.936059: tcp_send_delayed_ack <-__tcp_ack_snd_check
2631.936059: sk_reset_timer <-tcp_send_delayed_ack
2631.936059: mod_timer <-sk_reset_timer
2631.936059: lock_timer_base <-mod_timer
2631.936059: _spin_lock_irqsave <-lock_timer_base
2631.936059: _spin_lock <-mod_timer
2631.936060: internal_add_timer <-mod_timer
2631.936064: _spin_unlock_irqrestore <-mod_timer
2631.936064: __kfree_skb <-tcp_rcv_established
2631.936064: skb_release_head_state <-__kfree_skb
2631.936064: dst_release <-skb_release_head_state
2631.936065: skb_release_data <-__kfree_skb
2631.936065: kfree <-skb_release_data
2631.936065: __slab_free <-kfree
2631.936065: add_partial <-__slab_free
2631.936065: _spin_lock <-add_partial
2631.936066: kmem_cache_free <-__kfree_skb
2631.936066: __slab_free <-kmem_cache_free
2631.936066: add_partial <-__slab_free
2631.936067: _spin_lock <-add_partial
2631.936067: local_bh_enable <-tcp_prequeue_process
2631.936067: tcp_cleanup_rbuf <-tcp_recvmsg
2631.936067: __tcp_select_window <-tcp_cleanup_rbuf
2631.936067: release_sock <-tcp_recvmsg
2631.936068: _spin_lock_bh <-release_sock
2631.936068: local_bh_disable <-_spin_lock_bh
2631.936068: _spin_unlock_bh <-release_sock
2631.936068: local_bh_enable_ip <-_spin_unlock_bh
2631.936068: fput <-sys_recvfrom

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1213,14 +1213,17 @@  static inline int skb_copy_to_page(struct sock *sk, char __user *from,
  *
  * 	Inlined as it's very short and called for pretty much every
  *	packet ever received.
+ *	We take a reference on sock only if this is the first packet
+ *	accounted for. (one reference count for any number of in flight packets)
  */
 
 static inline void skb_set_owner_w(struct sk_buff *skb, struct sock *sk)
 {
-	sock_hold(sk);
 	skb->sk = sk;
 	skb->destructor = sock_wfree;
-	atomic_add(skb->truesize, &sk->sk_wmem_alloc);
+	if (atomic_add_return(skb->truesize, &sk->sk_wmem_alloc) ==
+	    skb->truesize)
+		sock_hold(sk);
 }
 
 static inline void skb_set_owner_r(struct sk_buff *skb, struct sock *sk)
diff --git a/net/core/sock.c b/net/core/sock.c
index 7dbf3ff..bf10084 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1170,12 +1170,17 @@  void __init sk_init(void)
 void sock_wfree(struct sk_buff *skb)
 {
 	struct sock *sk = skb->sk;
+	int res;
 
 	/* In case it might be waiting for more memory. */
-	atomic_sub(skb->truesize, &sk->sk_wmem_alloc);
+	res = atomic_sub_return(skb->truesize, &sk->sk_wmem_alloc);
 	if (!sock_flag(sk, SOCK_USE_WRITE_QUEUE))
 		sk->sk_write_space(sk);
-	sock_put(sk);
+	/*
+	 * No more packets in flight, we must release sock refcnt
+	 */
+	if (res == 0)
+		sock_put(sk);
 }
 
 /*