From patchwork Wed Mar 25 18:41:27 2009 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Eric Dumazet X-Patchwork-Id: 25098 X-Patchwork-Delegate: davem@davemloft.net Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Received: from vger.kernel.org (vger.kernel.org [209.132.176.167]) by ozlabs.org (Postfix) with ESMTP id 373C9DDD01 for ; Thu, 26 Mar 2009 05:42:49 +1100 (EST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1763585AbZCYSlj (ORCPT ); Wed, 25 Mar 2009 14:41:39 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1763547AbZCYSli (ORCPT ); Wed, 25 Mar 2009 14:41:38 -0400 Received: from gw1.cosmosbay.com ([212.99.114.194]:50423 "EHLO gw1.cosmosbay.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1763487AbZCYSlh convert rfc822-to-8bit (ORCPT ); Wed, 25 Mar 2009 14:41:37 -0400 Received: from [127.0.0.1] (localhost [127.0.0.1]) by gw1.cosmosbay.com (8.13.7/8.13.7) with ESMTP id n2PIfRrI028284; Wed, 25 Mar 2009 19:41:27 +0100 Message-ID: <49CA7AD7.8040401@cosmosbay.com> Date: Wed, 25 Mar 2009 19:41:27 +0100 From: Eric Dumazet User-Agent: Thunderbird 2.0.0.21 (Windows/20090302) MIME-Version: 1.0 To: Jarek Poplawski CC: David Miller , netdev@vger.kernel.org Subject: Re: [RFC] net: release dst entry in dev_queue_xmit() References: <49C380A6.4000904@cosmosbay.com> <20090324.234354.43714160.davem@davemloft.net> <49C9D99A.2040900@cosmosbay.com> <20090325.001720.238121238.davem@davemloft.net> <49CA7658.4010400@gmail.com> In-Reply-To: <49CA7658.4010400@gmail.com> X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-1.6 (gw1.cosmosbay.com [0.0.0.0]); Wed, 25 Mar 2009 19:41:28 +0100 (CET) Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org Jarek Poplawski a écrit : > David Miller wrote, On 03/25/2009 08:17 AM: > >> From: Eric Dumazet >> Date: Wed, 25 Mar 2009 08:13:30 +0100 >> >>> If done in dev_hard_start_xmit(), skb could be requeued (because of >>> NETDEV_TX_BUSY). Then if requeued, maybe at this time, dst being >>> NULL is not a problem ? >> Usually it should be OK because the packet schedulers have >> a sort-of one-behind state that allows them to reinsert >> the SKB into their queue datastructures without reclassifying. > > > Actually, since David has dumped requeuing there is no reinserting; > this last one "requeued" skb is buffered at the top in q->gso_skb > and waiting for better times. Yes indeed, this is what I thought too, thanks Jarek. I tested following patch today on my machine, but obviously could not try all possible quirks :) [PATCH] net: release dst entry in dev_hard_start_xmit() One point of contention in high network loads is the dst_release() performed when a transmited skb is freed. This is because NIC tx completion calls skb free long after original call to dev_queue_xmit(skb). CPU cache is cold and the atomic op in dst_release() stalls. On SMP, this is quite visible if one CPU is 100% handling softirqs for a network device, since dst_clone() is done by other cpus, involving cache line ping pongs. I believe we can release dst in dev_hard_start_xmit(), while cpu cache is hot, since caller of dev_queue_xmit() had to hold a reference on dst right before. This reduce work to be done by softirq handler, and decrease cache misses. I also believe only pktgen can call dev_queue_xmit() with skb which have a skb->users != 1. But pkthen skbs have a NULL dst entry. I added a WARN_ON_ONCE() to catch other cases, and not release skb->dst if skb->users != 1 Signed-off-by: Eric Dumazet --- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html diff --git a/net/core/dev.c b/net/core/dev.c index e3fe5c7..a622db6 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -1664,6 +1664,26 @@ static int dev_gso_segment(struct sk_buff *skb) return 0; } + +/* + * Release dst while its refcnt is likely hot in CPU cache, instead + * of waiting NIC tx completion. + * We inline dst_release() code for performance reason + */ +static void release_skb_dst(struct sk_buff *skb) +{ + if (likely(skb->dst)) { + if (!WARN_ON_ONCE(atomic_read(&skb->users) != 1)) { + int newrefcnt; + + smp_mb__before_atomic_dec(); + newrefcnt = atomic_dec_return(&skb->dst->__refcnt); + WARN_ON(newrefcnt < 0); + skb->dst = NULL; + } + } +} + int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev, struct netdev_queue *txq) { @@ -1681,6 +1701,7 @@ int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev, goto gso; } + release_skb_dst(skb); return ops->ndo_start_xmit(skb, dev); } @@ -1691,6 +1712,7 @@ gso: skb->next = nskb->next; nskb->next = NULL; + release_skb_dst(nskb); rc = ops->ndo_start_xmit(nskb, dev); if (unlikely(rc)) { nskb->next = skb->next;