poll: Avoid extra wakeups in select/poll

Message ID	49F71B63.8010503@cosmosbay.com
State	Not Applicable, archived
Delegated to:	David Miller
Headers	show Return-Path: <netdev-owner@vger.kernel.org> Message-ID: <49F71B63.8010503@cosmosbay.com> Date: Tue, 28 Apr 2009 17:06:11 +0200 From: Eric Dumazet <dada1@cosmosbay.com> User-Agent: Thunderbird 2.0.0.21 (Windows/20090302) MIME-Version: 1.0 To: linux kernel <linux-kernel@vger.kernel.org> CC: Andi Kleen <andi@firstfloor.org>, David Miller <davem@davemloft.net>, cl@linux.com, jesse.brandeburg@intel.com, netdev@vger.kernel.org, haoki@redhat.com, mchan@broadcom.com, davidel@xmailserver.org, Ingo Molnar <mingo@elte.hu> Subject: [PATCH] poll: Avoid extra wakeups in select/poll References: <alpine.DEB.1.10.0904241554100.29995@qirst.com> <49F3308B.1030507@cosmosbay.com> <20090426.020411.157511269.davem@davemloft.net> <49F43B8F.2050907@cosmosbay.com> <87ab60rh8t.fsf@basil.nowhere.org> In-Reply-To: <87ab60rh8t.fsf@basil.nowhere.org> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: netdev-owner@vger.kernel.org Precedence: bulk

Eric Dumazet April 28, 2009, 3:06 p.m. UTC

[PATCH] poll: Avoid extra wakeups in select/poll

After introduction of keyed wakeups Davide Libenzi did on epoll, we
are able to avoid spurious wakeups in poll()/select() code too.

For example, typical use of poll()/select() is to wait for incoming
network frames on many sockets. But TX completion for UDP/TCP 
frames call sock_wfree() which in turn schedules thread.

When scheduled, thread does a full scan of all polled fds and
can sleep again, because nothing is really available. If number
of fds is large, this cause significant load.

This patch makes select()/poll() aware of keyed wakeups and
useless wakeups are avoided. This reduces number of context
switches by about 50% on some setups, and work performed
by sofirq handlers.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Andi Kleen <ak@linux.intel.com>
---
 fs/select.c          |   28 +++++++++++++++++++++++++---
 include/linux/poll.h |    3 +++
 2 files changed, 28 insertions(+), 3 deletions(-)


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Lameter (Ampere) April 28, 2009, 7:05 p.m. UTC | #1

For the udpping test load these patches have barely any effect:

git2p1 is the first version of the patch
git2p2 is the second version (this one)

Same CPU
Kernel			Test 1	Test 2	Test 3	Test 4	Average
2.6.22			83.03	82.9	82.89	82.92	82.935
2.6.23			83.35	82.81	82.83	82.86	82.9625
2.6.24			82.66	82.56	82.64	82.73	82.6475
2.6.25			84.28	84.29	84.37	84.3	84.31
2.6.26			84.72	84.38	84.41	84.68	84.5475
2.6.27			84.56	84.44	84.41	84.58	84.4975
2.6.28			84.7	84.43	84.47	84.48	84.52
2.6.29			84.91	84.67	84.69	84.75	84.755
2.6.30-rc2		84.94	84.72	84.69	84.93	84.82
2.6.30-rc3		84.88	84.7	84.73	84.89	84.8
2.6.30-rc3-git2p1	84.89	84.77	84.79	84.85	84.825
2.6.30-rc3-git2p2	84.91	84.79	84.78	84.8	84.82

Same Core
Kernel			Test 1	Test 2	Test 3	Test 4	Average
2.6.22			84.6	84.71	84.52	84.53	84.59
2.6.23			84.59	84.5	84.33	84.34	84.44
2.6.24			84.28	84.3	84.38	84.28	84.31
2.6.25			86.12	85.8	86.2	86.04	86.04
2.6.26			86.61	86.46	86.49	86.7	86.565
2.6.27			87	87.01	87	86.95	86.99
2.6.28			86.53	86.44	86.26	86.24	86.3675
2.6.29			85.88	85.94	86.1	85.69	85.9025
2.6.30-rc2		86.03	85.93	85.99	86.06	86.0025
2.6.30-rc3		85.73	85.88	85.67	85.94	85.805
2.6.30-rc3-git2p1	86.11	85.8	86.03	85.92	85.965
2.6.30-rc3-git2p2	86.04	85.96	85.89	86.04	85.9825

Same Socket
Kernel			Test 1	Test 2	Test 3	Test 4	Average
2.6.22			90.08	89.72	90	89.9	89.925
2.6.23			89.72	90.1	89.99	89.86	89.9175
2.6.24			89.18	89.28	89.25	89.22	89.2325
2.6.25			90.83	90.78	90.87	90.61	90.7725
2.6.26			90.51	91.25	91.8	91.69	91.3125
2.6.27			91.98	91.93	91.97	91.91	91.9475
2.6.28			91.72	91.7	91.84	91.75	91.7525
2.6.29			89.85	89.85	90.14	89.9	89.935
2.6.30-rc2		90.78	90.8	90.87	90.73	90.795
2.6.30-rc3		90.84	90.94	91.05	90.84	90.9175
2.6.30-rc3-git2p1	90.87	90.95	90.86	90.92	90.9
2.6.30-rc3-git2p2	91.09	91.01	90.97	91.06	91.0325

Different Socket
Kernel			Test 1	Test 2	Test 3	Test 4	Average
2.6.22			91.64	91.65	91.61	91.68	91.645
2.6.23			91.9	91.84	91.92	91.83	91.873
2.6.24			91.33	91.24	91.42	91.38	91.343
2.6.25			92.39	92.04	92.3	92.23	92.240
2.6.26			90.64	90.57	90.6	90.08	90.473
2.6.27			91.14	91.26	90.9	91.09	91.098
2.6.28			92.3	91.92	92.3	92.23	92.188
2.6.29			90.57	89.83	89.9	90.41	90.178
2.6.30-rc2		90.59	90.97	90.27	91.69	90.880
2.6.30-rc3		92.08	91.32	91.21	92.06	91.668
2.6.30-rc3-git2p1	91.46	91.38	91.92	91.03	91.448
2.6.30-rc3-git2p2	91.39	90.47	90.03	90.62	90.628

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Eric Dumazet April 28, 2009, 8:05 p.m. UTC | #2

Christoph Lameter a écrit :
> For the udpping test load these patches have barely any effect:
> 
> git2p1 is the first version of the patch
> git2p2 is the second version (this one)

But... udpping does *not* use poll() nor select(), unless I am mistaken ?

If you really want to test this patch with udpping, you might add a poll() call
before recvfrom() :

                while(1) {
+                       struct pollfd pfd = { .fd = sock, .events = POLLIN};
+                       poll(pfd, 1, -1);
                        nbytes = recvfrom(sock, msg, min(inblocksize, sizeof(msg)),
                                                                0, &inad, &inadlen);
                        if (nbytes < 0) {
                                perror("recvfrom");
                                break;
                        }
                        if (sendto(sock, msg, nbytes, 0, &inad, inadlen) < 0) {
                                perror("sendto");
                                break;
                        }
                }

Part about recvfrom() wakeup avoidance is in David net-2.6 tree, and saves 2 us on udpping here.

Is it what you call git2p1 ?



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Christoph Lameter (Ampere) April 28, 2009, 8:14 p.m. UTC | #3

On Tue, 28 Apr 2009, Eric Dumazet wrote:

> Part about recvfrom() wakeup avoidance is in David net-2.6 tree, and saves 2 us on udpping here.
>
> Is it what you call git2p1 ?

No that is just the prior version of the poll/select improvements.

Which patch are you referring to?

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andrew Morton April 29, 2009, 7:20 a.m. UTC | #4

On Tue, 28 Apr 2009 17:06:11 +0200 Eric Dumazet <dada1@cosmosbay.com> wrote:

> [PATCH] poll: Avoid extra wakeups in select/poll
> 
> After introduction of keyed wakeups Davide Libenzi did on epoll, we
> are able to avoid spurious wakeups in poll()/select() code too.
> 
> For example, typical use of poll()/select() is to wait for incoming
> network frames on many sockets. But TX completion for UDP/TCP 
> frames call sock_wfree() which in turn schedules thread.
> 
> When scheduled, thread does a full scan of all polled fds and
> can sleep again, because nothing is really available. If number
> of fds is large, this cause significant load.
> 
> This patch makes select()/poll() aware of keyed wakeups and
> useless wakeups are avoided. This reduces number of context
> switches by about 50% on some setups, and work performed
> by sofirq handlers.
> 

Seems that this is a virtuous patch even though Christoph is struggling
a bit to test it?

>  fs/select.c          |   28 +++++++++++++++++++++++++---
>  include/linux/poll.h |    3 +++
>  2 files changed, 28 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/select.c b/fs/select.c
> index 0fe0e14..2708187 100644
> --- a/fs/select.c
> +++ b/fs/select.c
> @@ -168,7 +168,7 @@ static struct poll_table_entry *poll_get_entry(struct poll_wqueues *p)
>  	return table->entry++;
>  }
>  
> -static int pollwake(wait_queue_t *wait, unsigned mode, int sync, void *key)
> +static int __pollwake(wait_queue_t *wait, unsigned mode, int sync, void *key)
>  {
>  	struct poll_wqueues *pwq = wait->private;
>  	DECLARE_WAITQUEUE(dummy_wait, pwq->polling_task);
> @@ -194,6 +194,16 @@ static int pollwake(wait_queue_t *wait, unsigned mode, int sync, void *key)
>  	return default_wake_function(&dummy_wait, mode, sync, key);
>  }
>  
> +static int pollwake(wait_queue_t *wait, unsigned mode, int sync, void *key)
> +{
> +	struct poll_table_entry *entry;
> +
> +	entry = container_of(wait, struct poll_table_entry, wait);
> +	if (key && !((unsigned long)key & entry->key))
> +		return 0;
> +	return __pollwake(wait, mode, sync, key);
> +}
> +
>  /* Add a new entry */
>  static void __pollwait(struct file *filp, wait_queue_head_t *wait_address,
>  				poll_table *p)
> @@ -205,6 +215,7 @@ static void __pollwait(struct file *filp, wait_queue_head_t *wait_address,
>  	get_file(filp);
>  	entry->filp = filp;
>  	entry->wait_address = wait_address;
> +	entry->key = p->key;
>  	init_waitqueue_func_entry(&entry->wait, pollwake);
>  	entry->wait.private = pwq;
>  	add_wait_queue(wait_address, &entry->wait);
> @@ -418,8 +429,16 @@ int do_select(int n, fd_set_bits *fds, struct timespec *end_time)
>  				if (file) {
>  					f_op = file->f_op;
>  					mask = DEFAULT_POLLMASK;
> -					if (f_op && f_op->poll)
> +					if (f_op && f_op->poll) {
> +						if (wait) {
> +							wait->key = POLLEX_SET;
> +							if (in & bit)
> +								wait->key |= POLLIN_SET;
> +							if (out & bit)
> +								wait->key |= POLLOUT_SET;
> +						}
>  						mask = (*f_op->poll)(file, retval ? NULL : wait);
> +					}

<resizes xterm rather a lot>

Can we (and should we) avoid all that manipulation of wait->key if
`retval' is zero?

> --- a/include/linux/poll.h
> +++ b/include/linux/poll.h
> @@ -32,6 +32,7 @@ typedef void (*poll_queue_proc)(struct file *, wait_queue_head_t *, struct poll_
>  
>  typedef struct poll_table_struct {
>  	poll_queue_proc qproc;
> +	unsigned long key;
>  } poll_table;
>  
>  static inline void poll_wait(struct file * filp, wait_queue_head_t * wait_address, poll_table *p)
> @@ -43,10 +44,12 @@ static inline void poll_wait(struct file * filp, wait_queue_head_t * wait_addres
>  static inline void init_poll_funcptr(poll_table *pt, poll_queue_proc qproc)
>  {
>  	pt->qproc = qproc;
> +	pt->key   = ~0UL; /* all events enabled */

I kind of prefer to use plain old -1 for the all-ones pattern.  Because
it always just works, and doesn't send the reviewer off to check if the
type was really u64 or something.

It's a bit ugly though.

>  }
>  
>  struct poll_table_entry {
>  	struct file *filp;
> +	unsigned long key;
>  	wait_queue_t wait;
>  	wait_queue_head_t *wait_address;
>  };

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Andi Kleen April 29, 2009, 7:35 a.m. UTC | #5

> Seems that this is a virtuous patch even though Christoph is struggling
> a bit to test it?

The main drawback is that the select/poll data structures will get
larger. That could cause regression in theory. But I suspect
the win in some situations is still worth it. Of course
it would be nice if it handled more situations (like
multiple reader etc.)

-Andi
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Eric Dumazet April 29, 2009, 7:37 a.m. UTC | #6

Andi Kleen a écrit :
>> Seems that this is a virtuous patch even though Christoph is struggling
>> a bit to test it?
> 
> The main drawback is that the select/poll data structures will get
> larger. That could cause regression in theory. But I suspect
> the win in some situations is still worth it. Of course
> it would be nice if it handled more situations (like
> multiple reader etc.)

On poll()/select() interface, we must wakeup every pollers, because we dont know
if they really will consume the event

thread 1:
poll();
<insert an exit() or something bad here>
read();

thread 2:
poll(); /* no return because event was 'granted' to thread 1 */
read();


We could try to optimize read()/recvfrom() because we can know if event
is consumed, as its a whole syscall.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Eric Dumazet April 29, 2009, 7:39 a.m. UTC | #7

Andrew Morton a écrit :
> On Tue, 28 Apr 2009 17:06:11 +0200 Eric Dumazet <dada1@cosmosbay.com> wrote:
> 
>> [PATCH] poll: Avoid extra wakeups in select/poll
>>
>> After introduction of keyed wakeups Davide Libenzi did on epoll, we
>> are able to avoid spurious wakeups in poll()/select() code too.
>>
>> For example, typical use of poll()/select() is to wait for incoming
>> network frames on many sockets. But TX completion for UDP/TCP 
>> frames call sock_wfree() which in turn schedules thread.
>>
>> When scheduled, thread does a full scan of all polled fds and
>> can sleep again, because nothing is really available. If number
>> of fds is large, this cause significant load.
>>
>> This patch makes select()/poll() aware of keyed wakeups and
>> useless wakeups are avoided. This reduces number of context
>> switches by about 50% on some setups, and work performed
>> by sofirq handlers.
>>
> 
> Seems that this is a virtuous patch even though Christoph is struggling
> a bit to test it?
> 
>>  fs/select.c          |   28 +++++++++++++++++++++++++---
>>  include/linux/poll.h |    3 +++
>>  2 files changed, 28 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/select.c b/fs/select.c
>> index 0fe0e14..2708187 100644
>> --- a/fs/select.c
>> +++ b/fs/select.c
>> @@ -168,7 +168,7 @@ static struct poll_table_entry *poll_get_entry(struct poll_wqueues *p)
>>  	return table->entry++;
>>  }
>>  
>> -static int pollwake(wait_queue_t *wait, unsigned mode, int sync, void *key)
>> +static int __pollwake(wait_queue_t *wait, unsigned mode, int sync, void *key)
>>  {
>>  	struct poll_wqueues *pwq = wait->private;
>>  	DECLARE_WAITQUEUE(dummy_wait, pwq->polling_task);
>> @@ -194,6 +194,16 @@ static int pollwake(wait_queue_t *wait, unsigned mode, int sync, void *key)
>>  	return default_wake_function(&dummy_wait, mode, sync, key);
>>  }
>>  
>> +static int pollwake(wait_queue_t *wait, unsigned mode, int sync, void *key)
>> +{
>> +	struct poll_table_entry *entry;
>> +
>> +	entry = container_of(wait, struct poll_table_entry, wait);
>> +	if (key && !((unsigned long)key & entry->key))
>> +		return 0;
>> +	return __pollwake(wait, mode, sync, key);
>> +}
>> +
>>  /* Add a new entry */
>>  static void __pollwait(struct file *filp, wait_queue_head_t *wait_address,
>>  				poll_table *p)
>> @@ -205,6 +215,7 @@ static void __pollwait(struct file *filp, wait_queue_head_t *wait_address,
>>  	get_file(filp);
>>  	entry->filp = filp;
>>  	entry->wait_address = wait_address;
>> +	entry->key = p->key;
>>  	init_waitqueue_func_entry(&entry->wait, pollwake);
>>  	entry->wait.private = pwq;
>>  	add_wait_queue(wait_address, &entry->wait);
>> @@ -418,8 +429,16 @@ int do_select(int n, fd_set_bits *fds, struct timespec *end_time)
>>  				if (file) {
>>  					f_op = file->f_op;
>>  					mask = DEFAULT_POLLMASK;
>> -					if (f_op && f_op->poll)
>> +					if (f_op && f_op->poll) {
>> +						if (wait) {
>> +							wait->key = POLLEX_SET;
>> +							if (in & bit)
>> +								wait->key |= POLLIN_SET;
>> +							if (out & bit)
>> +								wait->key |= POLLOUT_SET;
>> +						}
>>  						mask = (*f_op->poll)(file, retval ? NULL : wait);
>> +					}
> 
> <resizes xterm rather a lot>
> 
> Can we (and should we) avoid all that manipulation of wait->key if
> `retval' is zero?

yes, we could set wait to NULL as soon as retval is incremented.
and also do :

mask = (*f_op->poll)(file, wait);


> 
>> --- a/include/linux/poll.h
>> +++ b/include/linux/poll.h
>> @@ -32,6 +32,7 @@ typedef void (*poll_queue_proc)(struct file *, wait_queue_head_t *, struct poll_
>>  
>>  typedef struct poll_table_struct {
>>  	poll_queue_proc qproc;
>> +	unsigned long key;
>>  } poll_table;
>>  
>>  static inline void poll_wait(struct file * filp, wait_queue_head_t * wait_address, poll_table *p)
>> @@ -43,10 +44,12 @@ static inline void poll_wait(struct file * filp, wait_queue_head_t * wait_addres
>>  static inline void init_poll_funcptr(poll_table *pt, poll_queue_proc qproc)
>>  {
>>  	pt->qproc = qproc;
>> +	pt->key   = ~0UL; /* all events enabled */
> 
> I kind of prefer to use plain old -1 for the all-ones pattern.  Because
> it always just works, and doesn't send the reviewer off to check if the
> type was really u64 or something.
> 
> It's a bit ugly though.
> 
>>  }
>>  
>>  struct poll_table_entry {
>>  	struct file *filp;
>> +	unsigned long key;
>>  	wait_queue_t wait;
>>  	wait_queue_head_t *wait_address;
>>  };
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ingo Molnar April 29, 2009, 9:16 a.m. UTC | #8

* Eric Dumazet <dada1@cosmosbay.com> wrote:

> @@ -418,8 +429,16 @@ int do_select(int n, fd_set_bits *fds, struct timespec *end_time)
>  				if (file) {
>  					f_op = file->f_op;
>  					mask = DEFAULT_POLLMASK;
> -					if (f_op && f_op->poll)
> +					if (f_op && f_op->poll) {
> +						if (wait) {
> +							wait->key = POLLEX_SET;
> +							if (in & bit)
> +								wait->key |= POLLIN_SET;
> +							if (out & bit)
> +								wait->key |= POLLOUT_SET;
> +						}
>  						mask = (*f_op->poll)(file, retval ? NULL : wait);
> +					}
>  					fput_light(file, fput_needed);
>  					if ((mask & POLLIN_SET) && (in & bit)) {
>  						res_in |= bit;

Please factor this whole 'if (file)' branch out into a helper. 
Typical indentation levels go from 1 to 3 tabs - 4 should be avoided 
if possible and 5 is pretty excessive already. This goes to eight.

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ingo Molnar April 29, 2009, 9:22 a.m. UTC | #9

* Andi Kleen <andi@firstfloor.org> wrote:

> > Seems that this is a virtuous patch even though Christoph is struggling
> > a bit to test it?
> 
> The main drawback is that the select/poll data structures will get 
> larger. That could cause regression in theory. [...]

Current size of struct poll_table_entry is 0x38 on 64-bit kernels. 
Adding the key will make it 0x40 - which is not only a power of two 
but also matches cache line size on most modern CPUs.

So the size of this structure is ideal now and arithmetics on the 
poll table have become simpler as well.

So the patch has my ack:

  Acked-by: Ingo Molnar <mingo@elte.hu>

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

poll: Avoid extra wakeups in select/poll

Commit Message

Comments

Patch