From patchwork Sun Mar  8 16:46:13 2009
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: Eric Dumazet <dada1@cosmosbay.com>
X-Patchwork-Id: 24179
X-Patchwork-Delegate: davem@davemloft.net
Return-Path: <netdev-owner@vger.kernel.org>
X-Original-To: patchwork-incoming@ozlabs.org
Delivered-To: patchwork-incoming@ozlabs.org
Received: from vger.kernel.org (vger.kernel.org [209.132.176.167])
	by ozlabs.org (Postfix) with ESMTP id 478F3DDFAD
	for <patchwork-incoming@ozlabs.org>;
	Mon,  9 Mar 2009 03:49:11 +1100 (EST)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1752930AbZCHQqe (ORCPT <rfc822;patchwork-incoming@ozlabs.org>);
	Sun, 8 Mar 2009 12:46:34 -0400
Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1752845AbZCHQqd
	(ORCPT <rfc822;netdev-outgoing>); Sun, 8 Mar 2009 12:46:33 -0400
Received: from gw1.cosmosbay.com ([212.99.114.194]:44800 "EHLO
	gw1.cosmosbay.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1752714AbZCHQqc convert rfc822-to-8bit (ORCPT
	<rfc822;netdev@vger.kernel.org>); Sun, 8 Mar 2009 12:46:32 -0400
Received: from [127.0.0.1] (localhost [127.0.0.1])
	by gw1.cosmosbay.com (8.13.7/8.13.7) with ESMTP id n28GkELr027663;
	Sun, 8 Mar 2009 17:46:14 +0100
Message-ID: <49B3F655.6030308@cosmosbay.com>
Date: Sun, 08 Mar 2009 17:46:13 +0100
From: Eric Dumazet <dada1@cosmosbay.com>
User-Agent: Thunderbird 2.0.0.19 (Windows/20081209)
MIME-Version: 1.0
To: kchang@athenacr.com
CC: David Miller <davem@davemloft.net>, netdev@vger.kernel.org,
	cl@linux-foundation.org, Brian Bloniarz <bmb@athenacr.com>
Subject: Re: Multicast packet loss
References: <20090204012144.GC3650@localhost.localdomain>
	<49A6CE39.5050200@athenacr.com>	<49A8FAFF.7060104@cosmosbay.com>
	<20090304.001646.100690134.davem@davemloft.net>
	<49AE3DA9.2020103@cosmosbay.com> <49B2266C.9050701@cosmosbay.com>
In-Reply-To: <49B2266C.9050701@cosmosbay.com>
X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-1.6
	(gw1.cosmosbay.com [0.0.0.0]);
	Sun, 08 Mar 2009 17:46:15 +0100 (CET)
Sender: netdev-owner@vger.kernel.org
Precedence: bulk
List-ID: <netdev.vger.kernel.org>
X-Mailing-List: netdev@vger.kernel.org

Eric Dumazet a écrit :
> 
> I have more questions :
> 
> What is the maximum latency you can afford on the delivery of the packet(s) ?
> 
> Are user apps using real time scheduling ?
> 
> I had an idea, that keep cpu handling NIC interrupts only delivering packets to
> socket queues, and not messing with scheduler : fast queueing, and wakeing up
> a workqueue (on another cpu) to perform the scheduler work. But that means
> some extra latency (in the order of 2 or 3 us I guess)
> 
> We could enter in this mode automatically, if the NIC rx handler *see* more than
> N packets are waiting in NIC queue : In case of moderate or light trafic, no
> extra latency would be necessary. This would mean some changes in NIC driver.
> 
> Hum, then, if NIC rx handler is run beside the ksoftirqd, we already know
> we are in a stress situation, so maybe no driver changes are necessary :
> Just test if we run ksoftirqd...
> 

Here is a patch that helps. It's still an RFC of course, since its somewhat ugly :)

I am now able to have 8 receivers on my 8 cpus machine, with one multicast packet every 10 us,
without any loss. (standard setup, no affinity games)

oprofile results see that scheduler overhead vanished, we get back to pure network profile :)

(First offender being __copy_skb_header because of the atomic_inc on dst refcount)

CPU: Core 2, speed 3000.43 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
samples  cum. samples  %        cum. %     symbol name
126329   126329        20.4296  20.4296    __copy_skb_header
31395    157724         5.0771  25.5067    udp_queue_rcv_skb
29191    186915         4.7207  30.2274    sock_def_readable
26284    213199         4.2506  34.4780    sock_queue_rcv_skb
26010    239209         4.2063  38.6842    kmem_cache_alloc
20040    259249         3.2408  41.9251    __udp4_lib_rcv
19570    278819         3.1648  45.0899    skb_queue_tail
17799    296618         2.8784  47.9683    bnx2_poll_work
17267    313885         2.7924  50.7606    skb_release_data
14663    328548         2.3713  53.1319    __skb_recv_datagram
14443    342991         2.3357  55.4676    __slab_alloc
13248    356239         2.1424  57.6100    copy_to_user
13138    369377         2.1246  59.7347    __sk_mem_schedule
12004    381381         1.9413  61.6759    __skb_clone
11924    393305         1.9283  63.6042    skb_clone
11077    404382         1.7913  65.3956    lock_sock_nested
10320    414702         1.6689  67.0645    ip_route_input
9622     424324         1.5560  68.6205    dst_release
8344     432668         1.3494  69.9699    __slab_free
8124     440792         1.3138  71.2837    mwait_idle
7066     447858         1.1427  72.4264    udp_recvmsg
6652     454510         1.0757  73.5021    netif_receive_skb
6386     460896         1.0327  74.5349    ipt_do_table
6010     466906         0.9719  75.5068    release_sock
6003     472909         0.9708  76.4776    memcpy_toiovec
5697     478606         0.9213  77.3989    __alloc_skb
5671     484277         0.9171  78.3160    copy_from_user
5031     489308         0.8136  79.1296    sysenter_past_esp
4753     494061         0.7686  79.8982    bnx2_interrupt
4429     498490         0.7162  80.6145    sock_rfree


[PATCH] softirq: Introduce mechanism to defer wakeups

Some network workloads need to call scheduler too many times. For example,
each received multicast frame can wakeup many threads. ksoftirqd is then
not able to drain NIC RX queues and we get frame losses and high latencies.

This patch adds an infrastructure to delay part of work done in
sock_def_readable() at end of do_softirq()


Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 include/linux/interrupt.h |    9 +++++++++
 include/net/sock.h        |    1 +
 kernel/softirq.c          |   29 ++++++++++++++++++++++++++++-
 net/core/sock.c           |   21 +++++++++++++++++++--
 4 files changed, 57 insertions(+), 3 deletions(-)


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 9127f6b..62caaae 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -295,6 +295,15 @@ extern void send_remote_softirq(struct call_single_data *cp, int cpu, int softir
 extern void __send_remote_softirq(struct call_single_data *cp, int cpu,
 				  int this_cpu, int softirq);
 
+/*
+ * delayed works : should be delayed at do_softirq() end
+ */
+struct softirq_del {
+	struct softirq_del	*next;
+	void 			(*func)(struct softirq_del *);
+};
+int softirq_del(struct softirq_del *sdel, void (*func)(struct softirq_del *));
+
 /* Tasklets --- multithreaded analogue of BHs.
 
    Main feature differing them of generic softirqs: tasklet
diff --git a/include/net/sock.h b/include/net/sock.h
index eefeeaf..95841de 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -260,6 +260,7 @@ struct sock {
 	unsigned long	        sk_lingertime;
 	struct sk_buff_head	sk_error_queue;
 	struct proto		*sk_prot_creator;
+	struct softirq_del	sk_del;
 	rwlock_t		sk_callback_lock;
 	int			sk_err,
 				sk_err_soft;
diff --git a/kernel/softirq.c b/kernel/softirq.c
index bdbe9de..40fe527 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -158,6 +158,33 @@ void local_bh_enable_ip(unsigned long ip)
 }
 EXPORT_SYMBOL(local_bh_enable_ip);
 
+
+static DEFINE_PER_CPU(struct softirq_del *, softirq_del_head);
+int softirq_del(struct softirq_del *sdel, void (*func)(struct softirq_del *))
+{
+	if (cmpxchg(&sdel->func, NULL, func) == NULL) {
+		sdel->next = __get_cpu_var(softirq_del_head);
+		__get_cpu_var(softirq_del_head) = sdel;
+		return 1;
+	}
+	return 0;
+}
+
+static void softirqdel_exec(void)
+{
+	struct softirq_del *sdel;
+	void (*func)(struct softirq_del *);
+
+	while ((sdel = __get_cpu_var(softirq_del_head)) != NULL) {
+		__get_cpu_var(softirq_del_head) = sdel->next;
+		func = sdel->func;
+		sdel->func = NULL;
+		(*func)(sdel);
+		}
+}
+
+
+
 /*
  * We restart softirq processing MAX_SOFTIRQ_RESTART times,
  * and we fall back to softirqd after that.
@@ -219,7 +246,7 @@ restart:
 
 	if (pending)
 		wakeup_softirqd();
-
+	softirqdel_exec();
 	trace_softirq_exit();
 
 	account_system_vtime(current);
diff --git a/net/core/sock.c b/net/core/sock.c
index 5f97caa..f9ee8dd 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1026,6 +1026,7 @@ struct sock *sk_clone(const struct sock *sk, const gfp_t priority)
 #endif
 
 		rwlock_init(&newsk->sk_dst_lock);
+		newsk->sk_del.func = NULL;
 		rwlock_init(&newsk->sk_callback_lock);
 		lockdep_set_class_and_name(&newsk->sk_callback_lock,
 				af_callback_keys + newsk->sk_family,
@@ -1634,12 +1635,27 @@ static void sock_def_error_report(struct sock *sk)
 	read_unlock(&sk->sk_callback_lock);
 }
 
+static void sock_readable_defer(struct softirq_del *sdel)
+{
+	struct sock *sk = container_of(sdel, struct sock, sk_del);
+
+	wake_up_interruptible_sync(sk->sk_sleep);
+	read_unlock(&sk->sk_callback_lock);
+}
+
 static void sock_def_readable(struct sock *sk, int len)
 {
 	read_lock(&sk->sk_callback_lock);
-	if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
-		wake_up_interruptible_sync(sk->sk_sleep);
 	sk_wake_async(sk, SOCK_WAKE_WAITD, POLL_IN);
+	if (sk->sk_sleep && waitqueue_active(sk->sk_sleep)) {
+		if (in_softirq()) {
+			if (!softirq_del(&sk->sk_del, sock_readable_defer))
+				goto unlock;
+			return;
+		}
+		wake_up_interruptible_sync(sk->sk_sleep);
+	}
+unlock:
 	read_unlock(&sk->sk_callback_lock);
 }
 
@@ -1720,6 +1736,7 @@ void sock_init_data(struct socket *sock, struct sock *sk)
 		sk->sk_sleep	=	NULL;
 
 	rwlock_init(&sk->sk_dst_lock);
+	sk->sk_del.func		=	NULL;
 	rwlock_init(&sk->sk_callback_lock);
 	lockdep_set_class_and_name(&sk->sk_callback_lock,
 			af_callback_keys + sk->sk_family,