Message ID | 49B3F655.6030308@cosmosbay.com |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
From: Eric Dumazet <dada1@cosmosbay.com> Date: Sun, 08 Mar 2009 17:46:13 +0100 > + if (sk->sk_sleep && waitqueue_active(sk->sk_sleep)) { > + if (in_softirq()) { > + if (!softirq_del(&sk->sk_del, sock_readable_defer)) > + goto unlock; > + return; > + } This is interesting. I think you should make softirq_del() more flexible. Make it the socket's job to make sure it doesn't try to defer different functions, and put the onus on locking there as well. The cmpxchg() and all of this checking is just wasted work. I'd really like to get rid of that callback lock too, then we'd really be in business. :-) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
David Miller a écrit : > From: Eric Dumazet <dada1@cosmosbay.com> > Date: Sun, 08 Mar 2009 17:46:13 +0100 > >> + if (sk->sk_sleep && waitqueue_active(sk->sk_sleep)) { >> + if (in_softirq()) { >> + if (!softirq_del(&sk->sk_del, sock_readable_defer)) >> + goto unlock; >> + return; >> + } > > This is interesting. > > I think you should make softirq_del() more flexible. Make it the > socket's job to make sure it doesn't try to defer different > functions, and put the onus on locking there as well. > > The cmpxchg() and all of this checking is just wasted work. > > I'd really like to get rid of that callback lock too, then we'd > really be in business. :-) First thanks for your review David. I chose cmpxchg() because I needed some form of exclusion here. I first added a spinlock inside "struct softirq_del" then I realize I could use cmpxchg() instead and keep the structure small. As the synchronization is only needed at queueing time, we could pass the address of a spinlock XXX to sofirq_del() call. Also, when an event was queued for later invocation, I also needed to keep a reference on "struct socket" to make sure it doesnt disappear before the invocation. Not all sockets are RCU guarded (we added RCU only for some protocols (TCP, UDP ...). So I found keeping a read_lock on callback was the easyest thing to do. I now realize we might overflow preempt_count, so special care is needed. About your first point, maybe we should make sofirq_del() (poor name I confess) only have one argument (pointer to struct softirq_del), and initialize the function pointer at socket init time. That would insure "struct softirq_del" is associated to one callback only. cmpxchg() test would have to be done on "next" field then (or use the spinlock XXX) I am not sure output path needs such tricks, since threads are rarely blocking on output : We dont trigger 400.000 wakeups per second ? Another point : I did a tbench test and got 2517 MB/s with the patch, instead of 2538 MB/s (using Linus 2.6 git tree), thats ~ 0.8 % regression for this workload. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric Dumazet wrote:
> Here is a patch that helps. It's still an RFC of course, since its somewhat ugly :)
Hi Eric,
I did some experimenting with this patch today -- we're users, not kernel hackers,
but the performance looks great. We see no loss with mcasttest, and no loss with
our internal test programs (which do much more user-space work). We're very
encouraged :)
One thing I'm curious about: previously, setting /proc/irq/<eth0>/smp_affinity
to one CPU made things perform better, but with this patch, performance is better
with smp_affinity == ff than with smp_affinity == 1. Do you know why that
is? Our tests are all with bnx2 msi_disable=1. I can investigate with oprofile
tomorrow.
Thank you for your continued help, we all deeply appreciate having someone
looking at this workload.
Thanks,
Brian Bloniarz
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Brian Bloniarz a écrit : > Eric Dumazet wrote: >> Here is a patch that helps. It's still an RFC of course, since its >> somewhat ugly :) > > Hi Eric, > > I did some experimenting with this patch today -- we're users, not > kernel hackers, > but the performance looks great. We see no loss with mcasttest, and no > loss with > our internal test programs (which do much more user-space work). We're very > encouraged :) > > One thing I'm curious about: previously, setting > /proc/irq/<eth0>/smp_affinity > to one CPU made things perform better, but with this patch, performance > is better > with smp_affinity == ff than with smp_affinity == 1. Do you know why that > is? Our tests are all with bnx2 msi_disable=1. I can investigate with > oprofile > tomorrow. > Well, smp_affinity could help in my opininon if you dedicate one cpu for the NIC, and others for user apps, if the average work done per packet is large. If load is light, its better to use the same cpu to perform all the work, since no expensive bus trafic is needed between cpu to exchange memory lines. If you only change /proc/irq/<eth0>/smp_affinity, and let scheduler chose any cpu for your user-space work that can have long latencies, I would not expect better performances. Try to cpu affine your taks to 0xFE to get better determinism. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Eric, FYI: with your patch applied and lockdep enabled, I see: [ 39.114628] ================================================ [ 39.121964] [ BUG: lock held when returning to user space! ] [ 39.127704] ------------------------------------------------ [ 39.133461] msgtest/5242 is leaving the kernel with locks still held! [ 39.140132] 1 lock held by msgtest/5242: [ 39.144287] #0: (clock-AF_INET){-.-?}, at: [<ffffffff8041f5b9>] sock_def_readable+0x19/0xb0 I can't reproduced this with the mcasttest program yet, it was with an internal test program which does some userspace processing on the messages. I'll let you know if I find a way to reproduce it with a simple program I can share. > Well, smp_affinity could help in my opininon if you dedicate > one cpu for the NIC, and others for user apps, if the average > work done per packet is large. If load is light, its better > to use the same cpu to perform all the work, since no expensive > bus trafic is needed between cpu to exchange memory lines. I tried this setup as well: an 8-core box with 4 userspace processes, each affined to an individual CPU1-4. The IRQ was on CPU0. On most kernels, this setup loses fewer packets than the default affinity (though they both lose some). With your patch enabled, the default affinity loses 0 packets, and this setup loses some. Thanks, Brian Bloniarz -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Eric Dumazet <dada1@cosmosbay.com> Date: Mon, 09 Mar 2009 07:36:57 +0100 > I chose cmpxchg() because I needed some form of exclusion here. > I first added a spinlock inside "struct softirq_del" then I realize > I could use cmpxchg() instead and keep the structure small. As the > synchronization is only needed at queueing time, we could pass > the address of a spinlock XXX to sofirq_del() call. I don't understand why you need the mutual exclusion in the first place. The function pointer always has the same value. And this locking isn't protecting the list insertion either, as that isn't even necessary. It just looks like plain overhead to me. > Also, when an event was queued for later invocation, I also needed to keep > a reference on "struct socket" to make sure it doesnt disappear before > the invocation. Not all sockets are RCU guarded (we added RCU only for > some protocols (TCP, UDP ...). So I found keeping a read_lock > on callback was the easyest thing to do. I now realize we might > overflow preempt_count, so special care is needed. You're using this in UDP so... make the rule that you can't use this with a non-RCU-quiescent protocol. > About your first point, maybe we should make sofirq_del() (poor name > I confess) only have one argument (pointer to struct softirq_del), > and initialize the function pointer at socket init time. That would > insure "struct softirq_del" is associated to one callback > only. cmpxchg() test would have to be done on "next" field then (or > use the spinlock XXX) Why? You run this from softirq safe context, nothing can run other softirqs on this cpu and corrupt the list, therefore. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
David Miller a écrit : > From: Eric Dumazet <dada1@cosmosbay.com> > Date: Mon, 09 Mar 2009 07:36:57 +0100 > >> I chose cmpxchg() because I needed some form of exclusion here. >> I first added a spinlock inside "struct softirq_del" then I realize >> I could use cmpxchg() instead and keep the structure small. As the >> synchronization is only needed at queueing time, we could pass >> the address of a spinlock XXX to sofirq_del() call. > > I don't understand why you need the mutual exclusion in the > first place. The function pointer always has the same value. > And this locking isn't protecting the list insertion either, > as that isn't even necessary. > > It just looks like plain overhead to me. I was lazy to check all callers (all protocols) had a lock on sock, and prefered safety. I was fooled by the read_lock(), and though several cpus could call this function in // > >> Also, when an event was queued for later invocation, I also needed to keep >> a reference on "struct socket" to make sure it doesnt disappear before >> the invocation. Not all sockets are RCU guarded (we added RCU only for >> some protocols (TCP, UDP ...). So I found keeping a read_lock >> on callback was the easyest thing to do. I now realize we might >> overflow preempt_count, so special care is needed. > > You're using this in UDP so... make the rule that you can't use > this with a non-RCU-quiescent protocol. UDP/TCP only ? I though many other protocols (not all using RCU) were using sock_def_readable() too... -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Eric Dumazet <dada1@cosmosbay.com> Date: Fri, 13 Mar 2009 23:30:31 +0100 > David Miller a écrit : > >> Also, when an event was queued for later invocation, I also needed to keep > >> a reference on "struct socket" to make sure it doesnt disappear before > >> the invocation. Not all sockets are RCU guarded (we added RCU only for > >> some protocols (TCP, UDP ...). So I found keeping a read_lock > >> on callback was the easyest thing to do. I now realize we might > >> overflow preempt_count, so special care is needed. > > > > You're using this in UDP so... make the rule that you can't use > > this with a non-RCU-quiescent protocol. > > UDP/TCP only ? I though many other protocols (not all using RCU) were > using sock_def_readable() too... Maybe create a inet_def_readable() just for this purpose :-) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
David Miller a écrit : > From: Eric Dumazet <dada1@cosmosbay.com> > Date: Fri, 13 Mar 2009 23:30:31 +0100 > >> David Miller a écrit : >>>> Also, when an event was queued for later invocation, I also needed to keep >>>> a reference on "struct socket" to make sure it doesnt disappear before >>>> the invocation. Not all sockets are RCU guarded (we added RCU only for >>>> some protocols (TCP, UDP ...). So I found keeping a read_lock >>>> on callback was the easyest thing to do. I now realize we might >>>> overflow preempt_count, so special care is needed. >>> You're using this in UDP so... make the rule that you can't use >>> this with a non-RCU-quiescent protocol. >> UDP/TCP only ? I though many other protocols (not all using RCU) were >> using sock_def_readable() too... > > Maybe create a inet_def_readable() just for this purpose :-) I must be tired, I should had this idea before you :) I post a new patch after some rest, I definitly should not be still awake ! -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h index 9127f6b..62caaae 100644 --- a/include/linux/interrupt.h +++ b/include/linux/interrupt.h @@ -295,6 +295,15 @@ extern void send_remote_softirq(struct call_single_data *cp, int cpu, int softir extern void __send_remote_softirq(struct call_single_data *cp, int cpu, int this_cpu, int softirq); +/* + * delayed works : should be delayed at do_softirq() end + */ +struct softirq_del { + struct softirq_del *next; + void (*func)(struct softirq_del *); +}; +int softirq_del(struct softirq_del *sdel, void (*func)(struct softirq_del *)); + /* Tasklets --- multithreaded analogue of BHs. Main feature differing them of generic softirqs: tasklet diff --git a/include/net/sock.h b/include/net/sock.h index eefeeaf..95841de 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -260,6 +260,7 @@ struct sock { unsigned long sk_lingertime; struct sk_buff_head sk_error_queue; struct proto *sk_prot_creator; + struct softirq_del sk_del; rwlock_t sk_callback_lock; int sk_err, sk_err_soft; diff --git a/kernel/softirq.c b/kernel/softirq.c index bdbe9de..40fe527 100644 --- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -158,6 +158,33 @@ void local_bh_enable_ip(unsigned long ip) } EXPORT_SYMBOL(local_bh_enable_ip); + +static DEFINE_PER_CPU(struct softirq_del *, softirq_del_head); +int softirq_del(struct softirq_del *sdel, void (*func)(struct softirq_del *)) +{ + if (cmpxchg(&sdel->func, NULL, func) == NULL) { + sdel->next = __get_cpu_var(softirq_del_head); + __get_cpu_var(softirq_del_head) = sdel; + return 1; + } + return 0; +} + +static void softirqdel_exec(void) +{ + struct softirq_del *sdel; + void (*func)(struct softirq_del *); + + while ((sdel = __get_cpu_var(softirq_del_head)) != NULL) { + __get_cpu_var(softirq_del_head) = sdel->next; + func = sdel->func; + sdel->func = NULL; + (*func)(sdel); + } +} + + + /* * We restart softirq processing MAX_SOFTIRQ_RESTART times, * and we fall back to softirqd after that. @@ -219,7 +246,7 @@ restart: if (pending) wakeup_softirqd(); - + softirqdel_exec(); trace_softirq_exit(); account_system_vtime(current); diff --git a/net/core/sock.c b/net/core/sock.c index 5f97caa..f9ee8dd 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -1026,6 +1026,7 @@ struct sock *sk_clone(const struct sock *sk, const gfp_t priority) #endif rwlock_init(&newsk->sk_dst_lock); + newsk->sk_del.func = NULL; rwlock_init(&newsk->sk_callback_lock); lockdep_set_class_and_name(&newsk->sk_callback_lock, af_callback_keys + newsk->sk_family, @@ -1634,12 +1635,27 @@ static void sock_def_error_report(struct sock *sk) read_unlock(&sk->sk_callback_lock); } +static void sock_readable_defer(struct softirq_del *sdel) +{ + struct sock *sk = container_of(sdel, struct sock, sk_del); + + wake_up_interruptible_sync(sk->sk_sleep); + read_unlock(&sk->sk_callback_lock); +} + static void sock_def_readable(struct sock *sk, int len) { read_lock(&sk->sk_callback_lock); - if (sk->sk_sleep && waitqueue_active(sk->sk_sleep)) - wake_up_interruptible_sync(sk->sk_sleep); sk_wake_async(sk, SOCK_WAKE_WAITD, POLL_IN); + if (sk->sk_sleep && waitqueue_active(sk->sk_sleep)) { + if (in_softirq()) { + if (!softirq_del(&sk->sk_del, sock_readable_defer)) + goto unlock; + return; + } + wake_up_interruptible_sync(sk->sk_sleep); + } +unlock: read_unlock(&sk->sk_callback_lock); } @@ -1720,6 +1736,7 @@ void sock_init_data(struct socket *sock, struct sock *sk) sk->sk_sleep = NULL; rwlock_init(&sk->sk_dst_lock); + sk->sk_del.func = NULL; rwlock_init(&sk->sk_callback_lock); lockdep_set_class_and_name(&sk->sk_callback_lock, af_callback_keys + sk->sk_family,