From patchwork Fri May  8 15:12:39 2009
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: Eric Dumazet <dada1@cosmosbay.com>
X-Patchwork-Id: 27016
X-Patchwork-Delegate: davem@davemloft.net
Return-Path: <netdev-owner@vger.kernel.org>
X-Original-To: patchwork-incoming@bilbo.ozlabs.org
Delivered-To: patchwork-incoming@bilbo.ozlabs.org
Received: from ozlabs.org (ozlabs.org [203.10.76.45])
	(using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
	(Client CN "mx.ozlabs.org",
	Issuer "CA Cert Signing Authority" (verified OK))
	by bilbo.ozlabs.org (Postfix) with ESMTPS id CE841B6F35
	for <patchwork-incoming@bilbo.ozlabs.org>;
	Sat,  9 May 2009 01:13:03 +1000 (EST)
Received: by ozlabs.org (Postfix)
	id BA546DDF9D; Sat,  9 May 2009 01:13:03 +1000 (EST)
Delivered-To: patchwork-incoming@ozlabs.org
Received: from vger.kernel.org (vger.kernel.org [209.132.176.167])
	by ozlabs.org (Postfix) with ESMTP id 5A0F4DDF9C
	for <patchwork-incoming@ozlabs.org>;
	Sat,  9 May 2009 01:13:03 +1000 (EST)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1760485AbZEHPMv (ORCPT <rfc822;patchwork-incoming@ozlabs.org>);
	Fri, 8 May 2009 11:12:51 -0400
Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1757970AbZEHPMv
	(ORCPT <rfc822;netdev-outgoing>); Fri, 8 May 2009 11:12:51 -0400
Received: from gw1.cosmosbay.com ([212.99.114.194]:40237 "EHLO
	gw1.cosmosbay.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1761459AbZEHPMt convert rfc822-to-8bit (ORCPT
	<rfc822;netdev@vger.kernel.org>); Fri, 8 May 2009 11:12:49 -0400
Received: from [127.0.0.1] (localhost [127.0.0.1])
	by gw1.cosmosbay.com (8.13.7/8.13.7) with ESMTP id n48FCfmU009388;
	Fri, 8 May 2009 17:12:41 +0200
Message-ID: <4A044BE7.3070308@cosmosbay.com>
Date: Fri, 08 May 2009 17:12:39 +0200
From: Eric Dumazet <dada1@cosmosbay.com>
User-Agent: Thunderbird 2.0.0.21 (Windows/20090302)
MIME-Version: 1.0
To: Krzysztof Halasa <khc@pm.waw.pl>
CC: netdev@vger.kernel.org, "David S. Miller" <davem@davemloft.net>
Subject: [PATCH] net: reduce number of reference taken on sk_refcnt
References: <m31vqzwzae.fsf@intrepid.localdomain>
In-Reply-To: <m31vqzwzae.fsf@intrepid.localdomain>
X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-1.6
	(gw1.cosmosbay.com [0.0.0.0]);
	Fri, 08 May 2009 17:12:41 +0200 (CEST)
Sender: netdev-owner@vger.kernel.org
Precedence: bulk
List-ID: <netdev.vger.kernel.org>
X-Mailing-List: netdev@vger.kernel.org

Krzysztof Halasa a écrit :
> Hi,
> 
> Could NAPI or something similar be used for TX in addition to RX?
> Perhaps some driver already does that?
> 
> I'm specifically thinking of IXP4xx Ethernet (and possibly WAN), those
> are 266-667 MHz ARM XScale CPUs and the interrupts handling just
> transmitted sk_buffs are a pain. Could I delay those interrupts
> somehow?

This is a pain yes, I agree. But some improvements can be done.
(even on drivers already using NAPI, especially if same interrupt
is used both for RX and RX queues)

For example, we can avoid the dst_release() cache miss if this
is done in start_xmit(), and not later in TX completion while freeing skb.
I tried various patches in the past but unfortunatly it seems
only safe way to do this is in the driver xmit itself, not in core
network stack. This would need many patches, one for each driver.

Then, if your workload is UDP packets, a recent patch
avoided an expensive wakeup at TX completion time
(check commit bf368e4e70cd4e0f880923c44e95a4273d725ab4

http://git2.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=bf368e4e70cd4e0f880923c44e95a4273d725ab4

Unfortunatly this patch is only for 2.6.30, since it is based
on a new infrastructure from Davide Libenzi)

Then, additional work should be done to reduce cache lines
misses in sock_wfree()

This function currently touches many different cache lines
(one for sk_wmem_alloc, one for testing sk_flags, one
 to decrement sk_refcnt). And on UDP, extra cachelines because
we call sock_def_write_space(). Yes this sucks.

We could avoid touching sk_refcnt in several cases. We
need a reference count only for the whole packets attached to a socket,
granted we use atomic_add_return() and such functions to detect 0
transitions. That way, if a tcp stream always has some packets in flight,
TX completion would not have to decrement sk_refcnt.

Following patch (based on net-next-2.6) to illustrate my point :

[PATCH] net: reduce number of reference taken on sk_refcnt

Current sk_wmem_alloc schema uses a sk_refcnt taken for each packet
in flight. This hurts some workloads at TX completion time, because
sock_wfree() has three cache lines to touch at least.
(one for sk_wmem_alloc, one for testing sk_flags, one
 to decrement sk_refcnt)

We could use only one reference count, taken only when sk_wmem_alloc
is changed from or to ZERO value (ie one reference count for any number
of in-flight packets)

Not all atomic_add() must be changed to atomic_add_return(), if we
know current sk_wmem_alloc is already not null.

This patch reduces by one number of cache lines dirtied in sock_wfree()
and number of atomic operation in some workloads. 

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

diff --cc include/net/tcp.h
index ac37228,87d210b..0000000
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
diff --git a/include/net/sock.h b/include/net/sock.h
index 4bb1ff9..d57e88b 100644
---
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1213,14 +1213,17 @@ static inline int skb_copy_to_page(struct sock *sk, char __user *from,
  *
  * 	Inlined as it's very short and called for pretty much every
  *	packet ever received.
+ *	We take a reference on sock only if this is the first packet
+ *	accounted for. (one reference count for any number of in flight packets)
  */
 
 static inline void skb_set_owner_w(struct sk_buff *skb, struct sock *sk)
 {
-	sock_hold(sk);
 	skb->sk = sk;
 	skb->destructor = sock_wfree;
-	atomic_add(skb->truesize, &sk->sk_wmem_alloc);
+	if (atomic_add_return(skb->truesize, &sk->sk_wmem_alloc) ==
+	    skb->truesize)
+		sock_hold(sk);
 }
 
 static inline void skb_set_owner_r(struct sk_buff *skb, struct sock *sk)
diff --git a/net/core/sock.c b/net/core/sock.c
index 7dbf3ff..bf10084 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1170,12 +1170,17 @@ void __init sk_init(void)
 void sock_wfree(struct sk_buff *skb)
 {
 	struct sock *sk = skb->sk;
+	int res;
 
 	/* In case it might be waiting for more memory. */
-	atomic_sub(skb->truesize, &sk->sk_wmem_alloc);
+	res = atomic_sub_return(skb->truesize, &sk->sk_wmem_alloc);
 	if (!sock_flag(sk, SOCK_USE_WRITE_QUEUE))
 		sk->sk_write_space(sk);
-	sock_put(sk);
+	/*
+	 * No more packets in flight, we must release sock refcnt
+	 */
+	if (res == 0)
+		sock_put(sk);
 }
 
 /*