diff mbox

net: reduce number of reference taken on sk_refcnt

Message ID 4A06B064.9080801@cosmosbay.com
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet May 10, 2009, 10:45 a.m. UTC
Eric Dumazet a écrit :
> Eric Dumazet a écrit :
>> David Miller a écrit :
>>> From: David Miller <davem@davemloft.net>
>>> Date: Sat, 09 May 2009 13:34:54 -0700 (PDT)
>>>
>>>> Consider the case where we always send some message on CPU A and
>>>> then process the ACK on CPU B.  We'll always be cancelling the
>>>> timer on a foreign cpu.
>>> I should also mention that TCP has a peculiar optimization of timers
>>> that is likely being thwarted by your workload.  It never deletes
>>> timers under normal operation, it simply lets them still expire
>>> and the handler notices that there is "nothing to do" and returns.
>> Yes, you refer to INET_CSK_CLEAR_TIMERS condition, never set.
>>
>>> But when the connection does shut down, we have to purge all of
>>> these timers.
>>>
>>> That could be another part of why you see timers in your profile.
>>>
>>>
>> Well, in my workload they should never expire, since application exchange
>> enough data on both direction, and they are no losses (Gigabit LAN context)
>>
>> On machine acting as a server (the one I am focusing to, of course),
>> each incoming frame :
>>
>> - Contains ACK for the previous sent frame
>> - Contains data provided by the client.
>> - Starts a timer for delayed ACK
>>
>> Then server applications reacts and sends a new payload, and TCP stack
>> - Sends a frame including ACK for previous received frame
>> - Contains data provided by server application
>> - Starts a timer for retransmiting this frame if no ACK is received later.
>>
>> So yes, each incoming and each outgoing frame is going to call mod_timer()
>>
>> Problem is that incoming process is done by CPU 0 (the one that is dedicated
>> to NAPI processing because of stress situation, cpu 100% in softirq land),
>> and outgoing processing done by other cpus in the machine.
>>
>> offsetof(struct inet_connection_sock, icsk_retransmit_timer)=0x208
>> offsetof(struct inet_connection_sock, icsk_delack_timer)=0x238
>>
>> So there are cache line ping-pongs, but oprofile seems to point
>> to a spinlock contention in lock_timer_base(), I dont know why...
>> shouldnt (in my workload) delack_timer all belongs to cpu 0, and 
>> retransmit_timers to other cpus ? 
>>
>> Or is mod_timer never migrates an already established timer ?
>>
>> That would explain the lock contention on timer_base, we should
>> take care of it if possible.
>>
> 
> ftrace is my friend :)
> 
> Problem is the application, when doing it recv() call
> is calling tcp_send_delayed_ack() too.
> 
> So yes, cpus are fighting on icsk_delack_timer and their
> timer_base pretty hard.
> 
> 

Changing tcp_prequeue to use same delack timer than tcp_send_delayed_ack()
helps a lot, but I dont know if its correct or not.

It was already changed recently in commit 0c266898b42fe4e4e2f9edfc9d3474c10f93aa6a

Author: Satoru SATOH <satoru.satoh@gmail.com>
Date:   Mon May 4 11:11:01 2009 -0700

    tcp: Fix tcp_prequeue() to get correct rto_min value

    tcp_prequeue() refers to the constant value (TCP_RTO_MIN) regardless of
    the actual value might be tuned. The following patches fix this and make
    tcp_prequeue get the actual value returns from tcp_rto_min().

but as reader will reset timeout to an even smaller value, I wonder if
we should not select this smaller value sooner, to avoid a mod_timer ping/pong

TCP_RTO_MIN (200 ms) was too big, but (3 * tcp_rto_min(sk)) / 4 might
be too big too...

New profile data of cpu0 (no more timer stuff, and improved bandwidth
by more than 10 %). back to normal device driver load and scheduler (wakeups)

104452   104452        13.3363  13.3363    bnx2_poll_work
56103    160555         7.1631  20.4994    __wake_up
38650    199205         4.9348  25.4342    task_rq_lock
37947    237152         4.8450  30.2792    __slab_alloc
37229    274381         4.7533  35.0326    tcp_v4_rcv
34781    309162         4.4408  39.4734    __alloc_skb
27977    337139         3.5721  43.0454    skb_release_data
26516    363655         3.3855  46.4309    ip_rcv
26399    390054         3.3706  49.8015    resched_task
26059    416113         3.3272  53.1287    __inet_lookup_established
25518    441631         3.2581  56.3868    sock_wfree
23515    465146         3.0024  59.3892    ip_route_input
19811    484957         2.5294  61.9186    select_task_rq_fair
16901    501858         2.1579  64.0765    __kfree_skb
16466    518324         2.1024  66.1788    __enqueue_entity
16008    534332         2.0439  68.2227    sched_clock_cpu
15486    549818         1.9772  70.2000    try_to_wake_up
14641    564459         1.8693  72.0693    update_curr
13725    578184         1.7524  73.8217    enqueue_task_fair
13304    591488         1.6986  75.5203    __kmalloc_track_caller
13190    604678         1.6841  77.2044    kmem_cache_alloc
12016    616694         1.5342  78.7386    __tcp_prequeue
10989    627683         1.4031  80.1416    __wake_up_common
10623    638306         1.3563  81.4980    netif_receive_skb
10004    648310         1.2773  82.7753    place_entity


Patch follows for RFC only (not Signed-of...), and based on net-next-2.6 


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

David Miller May 19, 2009, 4:58 a.m. UTC | #1
From: Eric Dumazet <dada1@cosmosbay.com>
Date: Sun, 10 May 2009 12:45:56 +0200

> Patch follows for RFC only (not Signed-of...), and based on net-next-2.6 

Thanks for the analysis.

> @@ -922,10 +922,13 @@ static inline int tcp_prequeue(struct sock *sk, struct sk_buff *skb)
>  	} else if (skb_queue_len(&tp->ucopy.prequeue) == 1) {
>  		wake_up_interruptible_poll(sk->sk_sleep,
>  					   POLLIN | POLLRDNORM | POLLRDBAND);
> -		if (!inet_csk_ack_scheduled(sk))
> +		if (!inet_csk_ack_scheduled(sk)) {
> +			unsigned int delay = (3 * tcp_rto_min(sk)) / 4;
> +
> +			delay = min(inet_csk(sk)->icsk_ack.ato, delay);
>  			inet_csk_reset_xmit_timer(sk, ICSK_TIME_DACK,
> -						  (3 * tcp_rto_min(sk)) / 4,
> -						  TCP_RTO_MAX);
> +						  delay, TCP_RTO_MAX);
> +		}
>  	}
>  	return 1;

I think this code is trying to aggressively stretch the ACK when
prequeueing.  In order to make sure there is enough time to get
the process on the CPU and send a response, and thus piggyback
the ACK.

If that turns out not to really matter, or matter less than your
problem, then we can make your change and I'm all for it.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet May 21, 2009, 9:07 a.m. UTC | #2
David Miller a écrit :
> From: Eric Dumazet <dada1@cosmosbay.com>
> Date: Sun, 10 May 2009 12:45:56 +0200
> 
>> Patch follows for RFC only (not Signed-of...), and based on net-next-2.6 
> 
> Thanks for the analysis.
> 
>> @@ -922,10 +922,13 @@ static inline int tcp_prequeue(struct sock *sk, struct sk_buff *skb)
>>  	} else if (skb_queue_len(&tp->ucopy.prequeue) == 1) {
>>  		wake_up_interruptible_poll(sk->sk_sleep,
>>  					   POLLIN | POLLRDNORM | POLLRDBAND);
>> -		if (!inet_csk_ack_scheduled(sk))
>> +		if (!inet_csk_ack_scheduled(sk)) {
>> +			unsigned int delay = (3 * tcp_rto_min(sk)) / 4;
>> +
>> +			delay = min(inet_csk(sk)->icsk_ack.ato, delay);
>>  			inet_csk_reset_xmit_timer(sk, ICSK_TIME_DACK,
>> -						  (3 * tcp_rto_min(sk)) / 4,
>> -						  TCP_RTO_MAX);
>> +						  delay, TCP_RTO_MAX);
>> +		}
>>  	}
>>  	return 1;
> 
> I think this code is trying to aggressively stretch the ACK when
> prequeueing.  In order to make sure there is enough time to get
> the process on the CPU and send a response, and thus piggyback
> the ACK.
> 
> If that turns out not to really matter, or matter less than your
> problem, then we can make your change and I'm all for it.

This change gave me about 15% increase in bandwidth in a multiflow
tcp benchmark. But this optimization worked because tasks could be
wakeup and send their answer in the same jiffies, and 'rearming'
the xmit timer with the same value...

(135.000 messages received/sent per second in my benchmark, with 60 flows)

mod_timer() has special heuristic to avoid calling __mod_timer()

int mod_timer(struct timer_list *timer, unsigned long expires)
{
        /*
         * This is a common optimization triggered by the
         * networking code - if the timer is re-modified
         * to be the same thing then just return:
         */
        if (timer->expires == expires && timer_pending(timer))
                return 1;

        return __mod_timer(timer, expires, false);
}

with HZ=1000, and real applications (using more than 1 msec to process the request),
I suppose this kind of optimization is unlikely to happen,
so we might extend mod_timer() heuristic to avoid changing timer->expires
if the new value is almost the same than previous, and not "exactly the same value"

int mod_timer_unexact(struct timer_list *timer, unsigned long expires, long maxerror)
{
        /*
         * This is a common optimization triggered by the
         * networking code - if the timer is re-modified
         * to be about the same thing then just return:
         */
        if (timer_pending(timer)) {
		long delta = expires - timer->expires;

		if (delta <= maxerror && delta >= -maxerror)
	                return 1;
	}
        return __mod_timer(timer, expires, false);
}



But to be effective, prequeue needs a blocked task
for each flow, and modern daemons prefer to use poll/epoll and
prequeue is thus not used.

Another possibility would be to use a different timer for prequeue 
exclusive use instead of sharing xmit_timer.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 19f4150..906f597 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -922,10 +922,13 @@  static inline int tcp_prequeue(struct sock *sk, struct sk_buff *skb)
 	} else if (skb_queue_len(&tp->ucopy.prequeue) == 1) {
 		wake_up_interruptible_poll(sk->sk_sleep,
 					   POLLIN | POLLRDNORM | POLLRDBAND);
-		if (!inet_csk_ack_scheduled(sk))
+		if (!inet_csk_ack_scheduled(sk)) {
+			unsigned int delay = (3 * tcp_rto_min(sk)) / 4;
+
+			delay = min(inet_csk(sk)->icsk_ack.ato, delay);
 			inet_csk_reset_xmit_timer(sk, ICSK_TIME_DACK,
-						  (3 * tcp_rto_min(sk)) / 4,
-						  TCP_RTO_MAX);
+						  delay, TCP_RTO_MAX);
+		}
 	}
 	return 1;
 }