Message ID | 1417084026-12307-4-git-send-email-pl@kamp.de |
---|---|
State | New |
Headers | show |
On 27/11/2014 11:27, Peter Lieven wrote: > +static __thread struct CoRoutinePool { > + Coroutine *ptrs[POOL_MAX_SIZE]; > + unsigned int size; > + unsigned int nextfree; > +} CoPool; > The per-thread ring unfortunately didn't work well last time it was tested. Devices that do not use ioeventfd (not just the slow ones, even decently performing ones like ahci, nvme or megasas) will create the coroutine in the VCPU thread, and destroy it in the iothread. The result is that coroutines cannot be reused. Can you check if this is still the case? Paolo
Am 27.11.2014 um 17:40 schrieb Paolo Bonzini: > > On 27/11/2014 11:27, Peter Lieven wrote: >> +static __thread struct CoRoutinePool { >> + Coroutine *ptrs[POOL_MAX_SIZE]; >> + unsigned int size; >> + unsigned int nextfree; >> +} CoPool; >> > The per-thread ring unfortunately didn't work well last time it was > tested. Devices that do not use ioeventfd (not just the slow ones, even > decently performing ones like ahci, nvme or megasas) will create the > coroutine in the VCPU thread, and destroy it in the iothread. The > result is that coroutines cannot be reused. > > Can you check if this is still the case? I already tested at least for IDE and for ioeventfd=off. The coroutine is created in the vCPU thread and destroyed in the I/O thread. I also havea more complicated version which sets per therad coroutine pool only for dataplane. Avoiding the lock for dedicated iothreads. For those who want to take a look: https://github.com/plieven/qemu/commit/325bc4ef5c7039337fa785744b145e2bdbb7b62e Peter
Am 28.11.2014 um 11:28 schrieb Paolo Bonzini: > > On 28/11/2014 09:13, Peter Lieven wrote: >> Am 27.11.2014 um 17:40 schrieb Paolo Bonzini: >>> On 27/11/2014 11:27, Peter Lieven wrote: >>>> +static __thread struct CoRoutinePool { >>>> + Coroutine *ptrs[POOL_MAX_SIZE]; >>>> + unsigned int size; >>>> + unsigned int nextfree; >>>> +} CoPool; >>>> >>> The per-thread ring unfortunately didn't work well last time it was >>> tested. Devices that do not use ioeventfd (not just the slow ones, even >>> decently performing ones like ahci, nvme or megasas) will create the >>> coroutine in the VCPU thread, and destroy it in the iothread. The >>> result is that coroutines cannot be reused. >>> >>> Can you check if this is still the case? >> I already tested at least for IDE and for ioeventfd=off. The coroutine >> is created in the vCPU thread and destroyed in the I/O thread. >> >> I also havea more complicated version which sets per therad coroutine pool only >> for dataplane. Avoiding the lock for dedicated iothreads. >> >> For those who want to take a look: >> >> https://github.com/plieven/qemu/commit/325bc4ef5c7039337fa785744b145e2bdbb7b62e > Can you test it against the patch I just sent in Kevin's linux-aio > coroutine thread? Was already doing it ;-) At least with test-couroutine.c.... master: Run operation 40000000 iterations 12.851414 s, 3112K operations/s, 321ns per coroutine paolo: Run operation 40000000 iterations 11.951720 s, 3346K operations/s, 298ns per coroutine plieven/perf_master2: Run operation 40000000 iterations 9.013785 s, 4437K operations/s, 225ns per coroutine plieven/perf_master: Run operation 40000000 iterations 11.072883 s, 3612K operations/s, 276ns per coroutine However, perf_master and perf_master2 have a regerssion regarding nesting as it seems. @Kevin: Could that be the reason why they performe bad in some szenarios? Regarding the bypass that is discussed. If it is not just a benchmark thing but really necessary for some peoples use cases why not add a new aio mode like "bypass" and use it only then. If the performance is really needed the user he/she might trade it in for lost features like iothrottling, filters etc. Peter
> master: > Run operation 40000000 iterations 12.851414 s, 3112K operations/s, 321ns per coroutine > > paolo: > Run operation 40000000 iterations 11.951720 s, 3346K operations/s, 298ns per coroutine Nice. :) Can you please try "coroutine: Use __thread … " together, too? I still see 11% time spent in pthread_getspecific, and I get ~10% more indeed if I apply it here (my times are 191/160/145). Paolo
Am 28.11.2014 um 12:14 schrieb Paolo Bonzini: >> master: >> Run operation 40000000 iterations 12.851414 s, 3112K operations/s, 321ns per coroutine >> >> paolo: >> Run operation 40000000 iterations 11.951720 s, 3346K operations/s, 298ns per coroutine > Nice. :) > > Can you please try "coroutine: Use __thread … " together, too? I still > see 11% time spent in pthread_getspecific, and I get ~10% more indeed if > I apply it here (my times are 191/160/145). indeed: Run operation 40000000 iterations 10.138684 s, 3945K operations/s, 253ns per coroutine
On 28/11/2014 12:21, Peter Lieven wrote: > Am 28.11.2014 um 12:14 schrieb Paolo Bonzini: >>> master: >>> Run operation 40000000 iterations 12.851414 s, 3112K operations/s, 321ns per coroutine >>> >>> paolo: >>> Run operation 40000000 iterations 11.951720 s, 3346K operations/s, 298ns per coroutine >> Nice. :) >> >> Can you please try "coroutine: Use __thread … " together, too? I still >> see 11% time spent in pthread_getspecific, and I get ~10% more indeed if >> I apply it here (my times are 191/160/145). > > indeed: > > Run operation 40000000 iterations 10.138684 s, 3945K operations/s, 253ns per coroutine Your perf_master2 uses the ring buffer unconditionally, right? I wonder if we can use a similar algorithm but with arrays instead of lists... Paolo
Am 28.11.2014 um 12:23 schrieb Paolo Bonzini: > > On 28/11/2014 12:21, Peter Lieven wrote: >> Am 28.11.2014 um 12:14 schrieb Paolo Bonzini: >>>> master: >>>> Run operation 40000000 iterations 12.851414 s, 3112K operations/s, 321ns per coroutine >>>> >>>> paolo: >>>> Run operation 40000000 iterations 11.951720 s, 3346K operations/s, 298ns per coroutine >>> Nice. :) >>> >>> Can you please try "coroutine: Use __thread … " together, too? I still >>> see 11% time spent in pthread_getspecific, and I get ~10% more indeed if >>> I apply it here (my times are 191/160/145). >> indeed: >> >> Run operation 40000000 iterations 10.138684 s, 3945K operations/s, 253ns per coroutine > Your perf_master2 uses the ring buffer unconditionally, right? I wonder > if we can use a similar algorithm but with arrays instead of lists... You mean an algorithm similar to perf_master2 or to the current implementation? The ring buffer seems to have a drawback when it comes to excessive coroutine nesting. My idea was that you do not throw away hot coroutines when the pool is full. However, i do not know if this is really a problem since the pool is only full when there is not much I/O. Or is this assumption to easy? Peter
On Thu, Nov 27, 2014 at 11:27:06AM +0100, Peter Lieven wrote: > diff --git a/iothread.c b/iothread.c > index 342a23f..b53529b 100644 > --- a/iothread.c > +++ b/iothread.c > @@ -15,6 +15,7 @@ > #include "qom/object_interfaces.h" > #include "qemu/module.h" > #include "block/aio.h" > +#include "block/coroutine.h" > #include "sysemu/iothread.h" > #include "qmp-commands.h" > #include "qemu/error-report.h" > @@ -47,6 +48,8 @@ static void *iothread_run(void *opaque) > } > aio_context_release(iothread->ctx); > } > + > + coroutine_pool_cleanup(); > return NULL; > } The assumption here is that iothread_run() is the only thread function that uses coroutines. If another thread uses coroutines the pool will leak :(. This is one of the challenges of thread-local storage.
diff --git a/include/block/coroutine.h b/include/block/coroutine.h index 1d9343d..f0b3a2d 100644 --- a/include/block/coroutine.h +++ b/include/block/coroutine.h @@ -96,7 +96,7 @@ Coroutine *coroutine_fn qemu_coroutine_self(void); */ bool qemu_in_coroutine(void); - +void coroutine_pool_cleanup(void); /** * CoQueues are a mechanism to queue coroutines in order to continue executing diff --git a/iothread.c b/iothread.c index 342a23f..b53529b 100644 --- a/iothread.c +++ b/iothread.c @@ -15,6 +15,7 @@ #include "qom/object_interfaces.h" #include "qemu/module.h" #include "block/aio.h" +#include "block/coroutine.h" #include "sysemu/iothread.h" #include "qmp-commands.h" #include "qemu/error-report.h" @@ -47,6 +48,8 @@ static void *iothread_run(void *opaque) } aio_context_release(iothread->ctx); } + + coroutine_pool_cleanup(); return NULL; } diff --git a/qemu-coroutine.c b/qemu-coroutine.c index 4708521..1900155 100644 --- a/qemu-coroutine.c +++ b/qemu-coroutine.c @@ -24,22 +24,22 @@ enum { }; /** Free list to speed up creation */ -static QemuMutex pool_lock; -static QSLIST_HEAD(, Coroutine) pool = QSLIST_HEAD_INITIALIZER(pool); -static unsigned int pool_size; +static __thread struct CoRoutinePool { + Coroutine *ptrs[POOL_MAX_SIZE]; + unsigned int size; + unsigned int nextfree; +} CoPool; Coroutine *qemu_coroutine_create(CoroutineEntry *entry) { Coroutine *co = NULL; - if (CONFIG_COROUTINE_POOL) { - qemu_mutex_lock(&pool_lock); - co = QSLIST_FIRST(&pool); - if (co) { - QSLIST_REMOVE_HEAD(&pool, pool_next); - pool_size--; + if (CoPool.size) { + co = CoPool.ptrs[CoPool.nextfree]; + CoPool.size--; + CoPool.nextfree--; + CoPool.nextfree &= (POOL_MAX_SIZE - 1); } - qemu_mutex_unlock(&pool_lock); } if (!co) { @@ -54,36 +54,30 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry) static void coroutine_delete(Coroutine *co) { if (CONFIG_COROUTINE_POOL) { - qemu_mutex_lock(&pool_lock); - if (pool_size < POOL_MAX_SIZE) { - QSLIST_INSERT_HEAD(&pool, co, pool_next); - co->caller = NULL; - pool_size++; - qemu_mutex_unlock(&pool_lock); - return; + CoPool.nextfree++; + CoPool.nextfree &= (POOL_MAX_SIZE - 1); + if (CoPool.size == POOL_MAX_SIZE) { + qemu_coroutine_delete(CoPool.ptrs[CoPool.nextfree]); + } else { + CoPool.size++; } - qemu_mutex_unlock(&pool_lock); + co->caller = NULL; + CoPool.ptrs[CoPool.nextfree] = co; + } else { + qemu_coroutine_delete(co); } - - qemu_coroutine_delete(co); } static void __attribute__((constructor)) coroutine_pool_init(void) { - qemu_mutex_init(&pool_lock); } -static void __attribute__((destructor)) coroutine_pool_cleanup(void) +void __attribute__((destructor)) coroutine_pool_cleanup(void) { - Coroutine *co; - Coroutine *tmp; - - QSLIST_FOREACH_SAFE(co, &pool, pool_next, tmp) { - QSLIST_REMOVE_HEAD(&pool, pool_next); - qemu_coroutine_delete(co); + printf("coroutine_pool_cleanup %lx pool %p\n", pthread_self(), &CoPool); + while (CoPool.size) { + qemu_coroutine_delete(qemu_coroutine_create(NULL)); } - - qemu_mutex_destroy(&pool_lock); } static void coroutine_swap(Coroutine *from, Coroutine *to)
This patch creates a ring structure for the coroutine pool instead of a linked list. The implementation of the list has the issue that it always throws aways the latest coroutines instead of the oldest ones. This is a drawback since the latest used coroutines are more likely cached than old ones. Furthermore this patch creates a coroutine pool per thread to remove the need for locking. Signed-off-by: Peter Lieven <pl@kamp.de> --- include/block/coroutine.h | 2 +- iothread.c | 3 +++ qemu-coroutine.c | 54 ++++++++++++++++++++------------------------- 3 files changed, 28 insertions(+), 31 deletions(-)