From patchwork Thu Apr 23 23:07:06 2009
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: Eric Dumazet <dada1@cosmosbay.com>
X-Patchwork-Id: 26389
X-Patchwork-Delegate: davem@davemloft.net
Return-Path: <netdev-owner@vger.kernel.org>
X-Original-To: patchwork-incoming@bilbo.ozlabs.org
Delivered-To: patchwork-incoming@bilbo.ozlabs.org
Received: from ozlabs.org (ozlabs.org [203.10.76.45])
	(using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
	(Client CN "mx.ozlabs.org",
	Issuer "CA Cert Signing Authority" (verified OK))
	by bilbo.ozlabs.org (Postfix) with ESMTPS id D251AB7063
	for <patchwork-incoming@bilbo.ozlabs.org>;
	Fri, 24 Apr 2009 09:08:10 +1000 (EST)
Received: by ozlabs.org (Postfix)
	id C0AF7DE13B; Fri, 24 Apr 2009 09:08:10 +1000 (EST)
Delivered-To: patchwork-incoming@ozlabs.org
Received: from vger.kernel.org (vger.kernel.org [209.132.176.167])
	by ozlabs.org (Postfix) with ESMTP id 5E03ADE0DF
	for <patchwork-incoming@ozlabs.org>;
	Fri, 24 Apr 2009 09:08:10 +1000 (EST)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1756177AbZDWXIA (ORCPT <rfc822;patchwork-incoming@ozlabs.org>);
	Thu, 23 Apr 2009 19:08:00 -0400
Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1754694AbZDWXIA
	(ORCPT <rfc822; netdev-outgoing>); Thu, 23 Apr 2009 19:08:00 -0400
Received: from gw1.cosmosbay.com ([212.99.114.194]:49464 "EHLO
	gw1.cosmosbay.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1752277AbZDWXH7 convert rfc822-to-8bit (ORCPT
	<rfc822;netdev@vger.kernel.org>); Thu, 23 Apr 2009 19:07:59 -0400
Received: from [127.0.0.1] (localhost [127.0.0.1])
	by gw1.cosmosbay.com (8.13.7/8.13.7) with ESMTP id n3NN7BZk013645;
	Fri, 24 Apr 2009 01:07:11 +0200
Message-ID: <49F0F49A.1050609@cosmosbay.com>
Date: Fri, 24 Apr 2009 01:07:06 +0200
From: Eric Dumazet <dada1@cosmosbay.com>
User-Agent: Thunderbird 2.0.0.21 (Windows/20090302)
MIME-Version: 1.0
To: "Brandeburg, Jesse" <jesse.brandeburg@intel.com>
CC: Christoph Lameter <cl@linux-foundation.org>,
	"David S. Miller" <davem@davemloft.net>,
	Linux Netdev List <netdev@vger.kernel.org>,
	Michael Chan <mchan@broadcom.com>,
	Ben Hutchings <bhutchings@solarflare.com>
Subject: Re: about latencies
References: <49F0E579.5030200@cosmosbay.com>
	<alpine.WNT.2.00.0904231520300.5352@jbrandeb-desk1.amr.corp.intel.com>
In-Reply-To: 
 <alpine.WNT.2.00.0904231520300.5352@jbrandeb-desk1.amr.corp.intel.com>
X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-1.6
	(gw1.cosmosbay.com [0.0.0.0]);
	Fri, 24 Apr 2009 01:07:12 +0200 (CEST)
Sender: netdev-owner@vger.kernel.org
Precedence: bulk
List-ID: <netdev.vger.kernel.org>
X-Mailing-List: netdev@vger.kernel.org

Brandeburg, Jesse a écrit :
> On Thu, 23 Apr 2009, Eric Dumazet wrote:
>> Some time later, NIC tells us TX was completed.
>> We free skb().
>>  1) dst_release()   (might dirty one cache line, that was increased by application cpu)
>>
>>  2) and more important... since UDP is now doing memory accounting...
>>
>> sock_wfree()
>>   -> sock_def_write_space()
>>      -> _read_lock()
>>      -> __wake_up_sync_key()
>>   and lot of functions calls to wakeup the task, for nothing since it
>> will just schedule again. Lot of cache lines dirtied...
>>
>>
>> We could improve this.
>>
>> 1) dst_release at xmit time, should save a cache line ping-pong on general case
>> 2) sock_wfree() in advance, done at transmit time (generally the thread/cpu doing the send)
> 
> how much does the effect socket accounting?  will the app then fill the 
> hardware tx ring all the time because there is no application throttling 
> due to delayed kfree?

tx ring is limited to 256 or 512 or 1024 elements, but yes this might
defeat udp mem accounting on sending side, unless using qdiscs...

Alternative would be to seperate sleepers (waiting for input or output)
to avoid extra wakeups. I am pretty sure every network dev always wanted
to do that eventually :)

> 
>> 3) changing bnx2_poll_work() to first call bnx2_rx_int(), then bnx2_tx_int() to consume tx.
> 
> at least all of the intel drivers that have a single vector (function) 
> handling interrupts, always call tx clean first so that any tx buffers are 
> free to be used immediately because the NAPI calls can generate tx traffic 
> (acks in the case of tcp and full routed packet transmits in the case of 
> forwarding)
> 
> of course in the case of MSI-X (igb/ixgbe) most times the tx cleanup is 
> handled independently (completely async) of rx.
> 
> 
>> What do you think ?
> 
> you're running a latency sensitive test on a NOHZ kernel below, isn't that 
> a bad idea?

I tried worst case to match (eventually) Christoph data.
I usually am not using NOHZ, but what about linux distros ?

> 
> OT - the amount of timer code (*ns*) and spinlocks noted below seems 
> generally disturbing.
> 
>> function ftrace of one "tx completion, extra wakeup, incoming udp, outgoing udp"
> 
> thanks for posting this, very interesting to see the flow of calls.  A ton 
> of work is done to handle just two packets.

yes, it costs about 30000 cycles...

>  
> might also be interesting to see what happens (how much shorter the call 
> chain is) on a UP kernel.

Here is a preliminary patch that does this, not for inclusion, testing only and
comments welcomed.

It saves more than 2 us on preliminary tests (! NOHZ kernel)
and CPU0 handling both IRQ and application.

# udpping -n 10000 -l 40 192.168.20.110
udp ping 0.0.0.0:9001 -> 192.168.20.110:9000 10000 samples ....
742759.61us (70.86us/74.28us/480.32us)

BTW, UDP mem accounting was added in 2.6.25.

[RFC] bnx2: Optimizations

1) dst_release() at xmit time, should save a cache line ping-pong on general case,
   where TX completion is done by another cpu.

2) sock_wfree() in advance, done at transmit time (generally the thread/cpu doing the send),
   instead doing it at completion time, by another cpu.

This reduces latency of UDP receive/send by 2 us at least

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

diff --git a/drivers/net/bnx2.c b/drivers/net/bnx2.c
index d478391..1078c85 100644
--- a/drivers/net/bnx2.c
+++ b/drivers/net/bnx2.c
@@ -6168,7 +6168,13 @@ bnx2_start_xmit(struct sk_buff *skb, struct net_device *dev)
 
 	tx_buf = &txr->tx_buf_ring[ring_prod];
 	tx_buf->skb = skb;
-
+	dst_release(skb->dst);
+	skb->dst = NULL;
+	if (skb->destructor == sock_wfree) {
+		sock_wfree(skb);
+		skb->destructor = NULL;
+	}
+	
 	txbd = &txr->tx_desc_ring[ring_prod];
 
 	txbd->tx_bd_haddr_hi = (u64) mapping >> 32;