From patchwork Fri May 8 15:12:39 2009 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Eric Dumazet X-Patchwork-Id: 27016 X-Patchwork-Delegate: davem@davemloft.net Return-Path: X-Original-To: patchwork-incoming@bilbo.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from ozlabs.org (ozlabs.org [203.10.76.45]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client CN "mx.ozlabs.org", Issuer "CA Cert Signing Authority" (verified OK)) by bilbo.ozlabs.org (Postfix) with ESMTPS id CE841B6F35 for ; Sat, 9 May 2009 01:13:03 +1000 (EST) Received: by ozlabs.org (Postfix) id BA546DDF9D; Sat, 9 May 2009 01:13:03 +1000 (EST) Delivered-To: patchwork-incoming@ozlabs.org Received: from vger.kernel.org (vger.kernel.org [209.132.176.167]) by ozlabs.org (Postfix) with ESMTP id 5A0F4DDF9C for ; Sat, 9 May 2009 01:13:03 +1000 (EST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1760485AbZEHPMv (ORCPT ); Fri, 8 May 2009 11:12:51 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1757970AbZEHPMv (ORCPT ); Fri, 8 May 2009 11:12:51 -0400 Received: from gw1.cosmosbay.com ([212.99.114.194]:40237 "EHLO gw1.cosmosbay.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1761459AbZEHPMt convert rfc822-to-8bit (ORCPT ); Fri, 8 May 2009 11:12:49 -0400 Received: from [127.0.0.1] (localhost [127.0.0.1]) by gw1.cosmosbay.com (8.13.7/8.13.7) with ESMTP id n48FCfmU009388; Fri, 8 May 2009 17:12:41 +0200 Message-ID: <4A044BE7.3070308@cosmosbay.com> Date: Fri, 08 May 2009 17:12:39 +0200 From: Eric Dumazet User-Agent: Thunderbird 2.0.0.21 (Windows/20090302) MIME-Version: 1.0 To: Krzysztof Halasa CC: netdev@vger.kernel.org, "David S. Miller" Subject: [PATCH] net: reduce number of reference taken on sk_refcnt References: In-Reply-To: X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-1.6 (gw1.cosmosbay.com [0.0.0.0]); Fri, 08 May 2009 17:12:41 +0200 (CEST) Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org Krzysztof Halasa a écrit : > Hi, > > Could NAPI or something similar be used for TX in addition to RX? > Perhaps some driver already does that? > > I'm specifically thinking of IXP4xx Ethernet (and possibly WAN), those > are 266-667 MHz ARM XScale CPUs and the interrupts handling just > transmitted sk_buffs are a pain. Could I delay those interrupts > somehow? This is a pain yes, I agree. But some improvements can be done. (even on drivers already using NAPI, especially if same interrupt is used both for RX and RX queues) For example, we can avoid the dst_release() cache miss if this is done in start_xmit(), and not later in TX completion while freeing skb. I tried various patches in the past but unfortunatly it seems only safe way to do this is in the driver xmit itself, not in core network stack. This would need many patches, one for each driver. Then, if your workload is UDP packets, a recent patch avoided an expensive wakeup at TX completion time (check commit bf368e4e70cd4e0f880923c44e95a4273d725ab4 http://git2.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=bf368e4e70cd4e0f880923c44e95a4273d725ab4 Unfortunatly this patch is only for 2.6.30, since it is based on a new infrastructure from Davide Libenzi) Then, additional work should be done to reduce cache lines misses in sock_wfree() This function currently touches many different cache lines (one for sk_wmem_alloc, one for testing sk_flags, one to decrement sk_refcnt). And on UDP, extra cachelines because we call sock_def_write_space(). Yes this sucks. We could avoid touching sk_refcnt in several cases. We need a reference count only for the whole packets attached to a socket, granted we use atomic_add_return() and such functions to detect 0 transitions. That way, if a tcp stream always has some packets in flight, TX completion would not have to decrement sk_refcnt. Following patch (based on net-next-2.6) to illustrate my point : [PATCH] net: reduce number of reference taken on sk_refcnt Current sk_wmem_alloc schema uses a sk_refcnt taken for each packet in flight. This hurts some workloads at TX completion time, because sock_wfree() has three cache lines to touch at least. (one for sk_wmem_alloc, one for testing sk_flags, one to decrement sk_refcnt) We could use only one reference count, taken only when sk_wmem_alloc is changed from or to ZERO value (ie one reference count for any number of in-flight packets) Not all atomic_add() must be changed to atomic_add_return(), if we know current sk_wmem_alloc is already not null. This patch reduces by one number of cache lines dirtied in sock_wfree() and number of atomic operation in some workloads. Signed-off-by: Eric Dumazet diff --cc include/net/tcp.h index ac37228,87d210b..0000000 --- a/include/net/tcp.h +++ b/include/net/tcp.h diff --git a/include/net/sock.h b/include/net/sock.h index 4bb1ff9..d57e88b 100644 --- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html --- a/include/net/sock.h +++ b/include/net/sock.h @@ -1213,14 +1213,17 @@ static inline int skb_copy_to_page(struct sock *sk, char __user *from, * * Inlined as it's very short and called for pretty much every * packet ever received. + * We take a reference on sock only if this is the first packet + * accounted for. (one reference count for any number of in flight packets) */ static inline void skb_set_owner_w(struct sk_buff *skb, struct sock *sk) { - sock_hold(sk); skb->sk = sk; skb->destructor = sock_wfree; - atomic_add(skb->truesize, &sk->sk_wmem_alloc); + if (atomic_add_return(skb->truesize, &sk->sk_wmem_alloc) == + skb->truesize) + sock_hold(sk); } static inline void skb_set_owner_r(struct sk_buff *skb, struct sock *sk) diff --git a/net/core/sock.c b/net/core/sock.c index 7dbf3ff..bf10084 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -1170,12 +1170,17 @@ void __init sk_init(void) void sock_wfree(struct sk_buff *skb) { struct sock *sk = skb->sk; + int res; /* In case it might be waiting for more memory. */ - atomic_sub(skb->truesize, &sk->sk_wmem_alloc); + res = atomic_sub_return(skb->truesize, &sk->sk_wmem_alloc); if (!sock_flag(sk, SOCK_USE_WRITE_QUEUE)) sk->sk_write_space(sk); - sock_put(sk); + /* + * No more packets in flight, we must release sock refcnt + */ + if (res == 0) + sock_put(sk); } /*