Message ID | 20240318183429.1039340-1-stefanha@redhat.com |
---|---|
State | New |
Headers | show |
Series | coroutine: cap per-thread local pool size | expand |
Am 18.03.2024 um 19:34 hat Stefan Hajnoczi geschrieben: > The coroutine pool implementation can hit the Linux vm.max_map_count > limit, causing QEMU to abort with "failed to allocate memory for stack" > or "failed to set up stack guard page" during coroutine creation. > > This happens because per-thread pools can grow to tens of thousands of > coroutines. Each coroutine causes 2 virtual memory areas to be created. > Eventually vm.max_map_count is reached and memory-related syscalls fail. > The per-thread pool sizes are non-uniform and depend on past coroutine > usage in each thread, so it's possible for one thread to have a large > pool while another thread's pool is empty. > > Switch to a new coroutine pool implementation with a global pool that > grows to a maximum number of coroutines and per-thread local pools that > are capped at hardcoded small number of coroutines. > > This approach does not leave large numbers of coroutines pooled in a > thread that may not use them again. In order to perform well it > amortizes the cost of global pool accesses by working in batches of > coroutines instead of individual coroutines. > > The global pool is a list. Threads donate batches of coroutines to when > they have too many and take batches from when they have too few: > > .-----------------------------------. > | Batch 1 | Batch 2 | Batch 3 | ... | global_pool > `-----------------------------------' > > Each thread has up to 2 batches of coroutines: > > .-------------------. > | Batch 1 | Batch 2 | per-thread local_pool (maximum 2 batches) > `-------------------' > > The goal of this change is to reduce the excessive number of pooled > coroutines that cause QEMU to abort when vm.max_map_count is reached > without losing the performance of an adequately sized coroutine pool. > > Here are virtio-blk disk I/O benchmark results: > > RW BLKSIZE IODEPTH OLD NEW CHANGE > randread 4k 1 113725 117451 +3.3% > randread 4k 8 192968 198510 +2.9% > randread 4k 16 207138 209429 +1.1% > randread 4k 32 212399 215145 +1.3% > randread 4k 64 218319 221277 +1.4% > randread 128k 1 17587 17535 -0.3% > randread 128k 8 17614 17616 +0.0% > randread 128k 16 17608 17609 +0.0% > randread 128k 32 17552 17553 +0.0% > randread 128k 64 17484 17484 +0.0% > > See files/{fio.sh,test.xml.j2} for the benchmark configuration: > https://gitlab.com/stefanha/virt-playbooks/-/tree/coroutine-pool-fix-sizing > > Buglink: https://issues.redhat.com/browse/RHEL-28947 > Reported-by: Sanjay Rao <srao@redhat.com> > Reported-by: Boaz Ben Shabat <bbenshab@redhat.com> > Reported-by: Joe Mario <jmario@redhat.com> > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Though I do wonder if we can do something about the slight performance degradation that Sanjay reported. We seem to stay well under the hard limit, so the reduced global pool size shouldn't be the issue. Maybe it's the locking? Either way, even though it could be called a fix, I don't think this is for 9.0, right? Kevin
On Mon, Mar 18, 2024 at 02:34:29PM -0400, Stefan Hajnoczi wrote: > The coroutine pool implementation can hit the Linux vm.max_map_count > limit, causing QEMU to abort with "failed to allocate memory for stack" > or "failed to set up stack guard page" during coroutine creation. > > This happens because per-thread pools can grow to tens of thousands of > coroutines. Each coroutine causes 2 virtual memory areas to be created. This sounds quite alarming. What usage scenario is justified in creating so many coroutines ? IIUC, coroutine stack size is 1 MB, and so tens of thousands of coroutines implies 10's of GB of memory just on stacks alone. > Eventually vm.max_map_count is reached and memory-related syscalls fail. On my system max_map_count is 1048576, quite alot higher than 10's of 1000's. Hitting that would imply ~500,000 coroutines and ~500 GB of stacks ! > diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c > index 5fd2dbaf8b..2790959eaf 100644 > --- a/util/qemu-coroutine.c > +++ b/util/qemu-coroutine.c > +static unsigned int get_global_pool_hard_max_size(void) > +{ > +#ifdef __linux__ > + g_autofree char *contents = NULL; > + int max_map_count; > + > + /* > + * Linux processes can have up to max_map_count virtual memory areas > + * (VMAs). mmap(2), mprotect(2), etc fail with ENOMEM beyond this limit. We > + * must limit the coroutine pool to a safe size to avoid running out of > + * VMAs. > + */ > + if (g_file_get_contents("/proc/sys/vm/max_map_count", &contents, NULL, > + NULL) && > + qemu_strtoi(contents, NULL, 10, &max_map_count) == 0) { > + /* > + * This is a conservative upper bound that avoids exceeding > + * max_map_count. Leave half for non-coroutine users like library > + * dependencies, vhost-user, etc. Each coroutine takes up 2 VMAs so > + * halve the amount again. > + */ > + return max_map_count / 4; That's 256,000 coroutines, which still sounds incredibly large to me. > + } > +#endif > + > + return UINT_MAX; Why UINT_MAX as a default ? If we can't read procfs, we should assume some much smaller sane default IMHO, that corresponds to what current linux default max_map_count would be. > +} > + > +static void __attribute__((constructor)) qemu_coroutine_init(void) > +{ > + qemu_mutex_init(&global_pool_lock); > + global_pool_hard_max_size = get_global_pool_hard_max_size(); > } > -- > 2.44.0 > > With regards, Daniel
On Tue, Mar 19, 2024 at 02:32:06PM +0100, Kevin Wolf wrote: > Am 18.03.2024 um 19:34 hat Stefan Hajnoczi geschrieben: > > The coroutine pool implementation can hit the Linux vm.max_map_count > > limit, causing QEMU to abort with "failed to allocate memory for stack" > > or "failed to set up stack guard page" during coroutine creation. > > > > This happens because per-thread pools can grow to tens of thousands of > > coroutines. Each coroutine causes 2 virtual memory areas to be created. > > Eventually vm.max_map_count is reached and memory-related syscalls fail. > > The per-thread pool sizes are non-uniform and depend on past coroutine > > usage in each thread, so it's possible for one thread to have a large > > pool while another thread's pool is empty. > > > > Switch to a new coroutine pool implementation with a global pool that > > grows to a maximum number of coroutines and per-thread local pools that > > are capped at hardcoded small number of coroutines. > > > > This approach does not leave large numbers of coroutines pooled in a > > thread that may not use them again. In order to perform well it > > amortizes the cost of global pool accesses by working in batches of > > coroutines instead of individual coroutines. > > > > The global pool is a list. Threads donate batches of coroutines to when > > they have too many and take batches from when they have too few: > > > > .-----------------------------------. > > | Batch 1 | Batch 2 | Batch 3 | ... | global_pool > > `-----------------------------------' > > > > Each thread has up to 2 batches of coroutines: > > > > .-------------------. > > | Batch 1 | Batch 2 | per-thread local_pool (maximum 2 batches) > > `-------------------' > > > > The goal of this change is to reduce the excessive number of pooled > > coroutines that cause QEMU to abort when vm.max_map_count is reached > > without losing the performance of an adequately sized coroutine pool. > > > > Here are virtio-blk disk I/O benchmark results: > > > > RW BLKSIZE IODEPTH OLD NEW CHANGE > > randread 4k 1 113725 117451 +3.3% > > randread 4k 8 192968 198510 +2.9% > > randread 4k 16 207138 209429 +1.1% > > randread 4k 32 212399 215145 +1.3% > > randread 4k 64 218319 221277 +1.4% > > randread 128k 1 17587 17535 -0.3% > > randread 128k 8 17614 17616 +0.0% > > randread 128k 16 17608 17609 +0.0% > > randread 128k 32 17552 17553 +0.0% > > randread 128k 64 17484 17484 +0.0% > > > > See files/{fio.sh,test.xml.j2} for the benchmark configuration: > > https://gitlab.com/stefanha/virt-playbooks/-/tree/coroutine-pool-fix-sizing > > > > Buglink: https://issues.redhat.com/browse/RHEL-28947 > > Reported-by: Sanjay Rao <srao@redhat.com> > > Reported-by: Boaz Ben Shabat <bbenshab@redhat.com> > > Reported-by: Joe Mario <jmario@redhat.com> > > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> > > Reviewed-by: Kevin Wolf <kwolf@redhat.com> > > Though I do wonder if we can do something about the slight performance > degradation that Sanjay reported. We seem to stay well under the hard > limit, so the reduced global pool size shouldn't be the issue. Maybe > it's the locking? I'm not sure if it's the lock. When writing the code I tried to avoid thresholds that cause batches to bounce between the global and local thread pools, because that is another way to lose performance. So maybe it's something related to the algorithm. > Either way, even though it could be called a fix, I don't think this is > for 9.0, right? There is a bug report for the max_map_count issue (RHEL-28947), so I think this fix should go into QEMU 9.0. Stefan
On Tue, Mar 19, 2024 at 9:32 AM Kevin Wolf <kwolf@redhat.com> wrote: > Am 18.03.2024 um 19:34 hat Stefan Hajnoczi geschrieben: > > The coroutine pool implementation can hit the Linux vm.max_map_count > > limit, causing QEMU to abort with "failed to allocate memory for stack" > > or "failed to set up stack guard page" during coroutine creation. > > > > This happens because per-thread pools can grow to tens of thousands of > > coroutines. Each coroutine causes 2 virtual memory areas to be created. > > Eventually vm.max_map_count is reached and memory-related syscalls fail. > > The per-thread pool sizes are non-uniform and depend on past coroutine > > usage in each thread, so it's possible for one thread to have a large > > pool while another thread's pool is empty. > > > > Switch to a new coroutine pool implementation with a global pool that > > grows to a maximum number of coroutines and per-thread local pools that > > are capped at hardcoded small number of coroutines. > > > > This approach does not leave large numbers of coroutines pooled in a > > thread that may not use them again. In order to perform well it > > amortizes the cost of global pool accesses by working in batches of > > coroutines instead of individual coroutines. > > > > The global pool is a list. Threads donate batches of coroutines to when > > they have too many and take batches from when they have too few: > > > > .-----------------------------------. > > | Batch 1 | Batch 2 | Batch 3 | ... | global_pool > > `-----------------------------------' > > > > Each thread has up to 2 batches of coroutines: > > > > .-------------------. > > | Batch 1 | Batch 2 | per-thread local_pool (maximum 2 batches) > > `-------------------' > > > > The goal of this change is to reduce the excessive number of pooled > > coroutines that cause QEMU to abort when vm.max_map_count is reached > > without losing the performance of an adequately sized coroutine pool. > > > > Here are virtio-blk disk I/O benchmark results: > > > > RW BLKSIZE IODEPTH OLD NEW CHANGE > > randread 4k 1 113725 117451 +3.3% > > randread 4k 8 192968 198510 +2.9% > > randread 4k 16 207138 209429 +1.1% > > randread 4k 32 212399 215145 +1.3% > > randread 4k 64 218319 221277 +1.4% > > randread 128k 1 17587 17535 -0.3% > > randread 128k 8 17614 17616 +0.0% > > randread 128k 16 17608 17609 +0.0% > > randread 128k 32 17552 17553 +0.0% > > randread 128k 64 17484 17484 +0.0% > > > > See files/{fio.sh,test.xml.j2} for the benchmark configuration: > > > https://gitlab.com/stefanha/virt-playbooks/-/tree/coroutine-pool-fix-sizing > > > > Buglink: https://issues.redhat.com/browse/RHEL-28947 > > Reported-by: Sanjay Rao <srao@redhat.com> > > Reported-by: Boaz Ben Shabat <bbenshab@redhat.com> > > Reported-by: Joe Mario <jmario@redhat.com> > > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> > > Reviewed-by: Kevin Wolf <kwolf@redhat.com> > > Though I do wonder if we can do something about the slight performance > degradation that Sanjay reported. We seem to stay well under the hard > limit, so the reduced global pool size shouldn't be the issue. Maybe > it's the locking? > > We are only seeing a slight fall off from our much improved numbers with the addition of iothreads. I am not very concerned. With database workloads, there's always a run to run variation. Especially when there's a lot of idle cpus on the host. To reduce the run to run variation, we use cpu / numa pinning and other methods like pci passthru. If I get a chance, I will do some runs with cpu pinning to see what the numbers look like. > Either way, even though it could be called a fix, I don't think this is > for 9.0, right? > > Kevin > >
Am 19.03.2024 um 14:43 hat Daniel P. Berrangé geschrieben: > On Mon, Mar 18, 2024 at 02:34:29PM -0400, Stefan Hajnoczi wrote: > > The coroutine pool implementation can hit the Linux vm.max_map_count > > limit, causing QEMU to abort with "failed to allocate memory for stack" > > or "failed to set up stack guard page" during coroutine creation. > > > > This happens because per-thread pools can grow to tens of thousands of > > coroutines. Each coroutine causes 2 virtual memory areas to be created. > > This sounds quite alarming. What usage scenario is justified in > creating so many coroutines? Basically we try to allow pooling coroutines for as many requests as there can be in flight at the same time. That is, adding a virtio-blk device increases the maximum pool size by num_queues * queue_size. If you have a guest with many CPUs, the default num_queues is relatively large (the bug referenced by Stefan had 64), and queue_size is 256 by default. That's 16k potential requests in flight per disk. Another part of it is just that our calculation didn't make a lot of sense. Instead of applying this number to the pool size of the iothread that would actually get the requests, we applied it to _every_ iothread. This is fixed with this patch, it's a global number applied to a global pool now. > IIUC, coroutine stack size is 1 MB, and so tens of thousands of > coroutines implies 10's of GB of memory just on stacks alone. That's only virtual memory, though. Not sure how much of it is actually used in practice. > > Eventually vm.max_map_count is reached and memory-related syscalls fail. > > On my system max_map_count is 1048576, quite alot higher than > 10's of 1000's. Hitting that would imply ~500,000 coroutines and > ~500 GB of stacks ! Did you change the configuration some time in the past, or is this just a newer default? I get 65530, and that's the same default number I've seen in the bug reports. > > diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c > > index 5fd2dbaf8b..2790959eaf 100644 > > --- a/util/qemu-coroutine.c > > +++ b/util/qemu-coroutine.c > > > +static unsigned int get_global_pool_hard_max_size(void) > > +{ > > +#ifdef __linux__ > > + g_autofree char *contents = NULL; > > + int max_map_count; > > + > > + /* > > + * Linux processes can have up to max_map_count virtual memory areas > > + * (VMAs). mmap(2), mprotect(2), etc fail with ENOMEM beyond this limit. We > > + * must limit the coroutine pool to a safe size to avoid running out of > > + * VMAs. > > + */ > > + if (g_file_get_contents("/proc/sys/vm/max_map_count", &contents, NULL, > > + NULL) && > > + qemu_strtoi(contents, NULL, 10, &max_map_count) == 0) { > > + /* > > + * This is a conservative upper bound that avoids exceeding > > + * max_map_count. Leave half for non-coroutine users like library > > + * dependencies, vhost-user, etc. Each coroutine takes up 2 VMAs so > > + * halve the amount again. > > + */ > > + return max_map_count / 4; > > That's 256,000 coroutines, which still sounds incredibly large > to me. The whole purpose of the limitation is that you won't ever get -ENOMEM back, which will likely crash your VM. Even if this hard limit is high, that doesn't mean that it's fully used. Your setting of 1048576 probably means that you would never have hit the crash anyway. Even the benchmarks that used to hit the problem don't even get close to this hard limit any more because the actual number of coroutines stays much smaller after applying this patch. > > + } > > +#endif > > + > > + return UINT_MAX; > > Why UINT_MAX as a default ? If we can't read procfs, we should > assume some much smaller sane default IMHO, that corresponds to > what current linux default max_map_count would be. I don't think we should artificially limit the pool size and with this potentially limit the performance with it even if the host could do more if we only allowed it to. If we can't read it from procfs, then it's your responsibility as a user to make sure that it's large enough for your VM configuration. Kevin
On Tue, Mar 19, 2024 at 05:54:38PM +0100, Kevin Wolf wrote: > Am 19.03.2024 um 14:43 hat Daniel P. Berrangé geschrieben: > > On Mon, Mar 18, 2024 at 02:34:29PM -0400, Stefan Hajnoczi wrote: > > > The coroutine pool implementation can hit the Linux vm.max_map_count > > > limit, causing QEMU to abort with "failed to allocate memory for stack" > > > or "failed to set up stack guard page" during coroutine creation. > > > > > > This happens because per-thread pools can grow to tens of thousands of > > > coroutines. Each coroutine causes 2 virtual memory areas to be created. > > > > This sounds quite alarming. What usage scenario is justified in > > creating so many coroutines? > > Basically we try to allow pooling coroutines for as many requests as > there can be in flight at the same time. That is, adding a virtio-blk > device increases the maximum pool size by num_queues * queue_size. If > you have a guest with many CPUs, the default num_queues is relatively > large (the bug referenced by Stefan had 64), and queue_size is 256 by > default. That's 16k potential requests in flight per disk. If we have more than 1 virtio-blk device, does that scale up the max coroutines too ? eg would 32 virtio-blks devices imply 16k * 32 -> 512k potential requests/coroutines ? > > IIUC, coroutine stack size is 1 MB, and so tens of thousands of > > coroutines implies 10's of GB of memory just on stacks alone. > > That's only virtual memory, though. Not sure how much of it is actually > used in practice. True, by default Linux wouldn't care too much about virtual memory, Only if 'vm.overcommit_memory' is changed from its default, such that Linux applies an overcommit ratio on RAM, then total virtual memory would be relevant. > > > Eventually vm.max_map_count is reached and memory-related syscalls fail. > > > > On my system max_map_count is 1048576, quite alot higher than > > 10's of 1000's. Hitting that would imply ~500,000 coroutines and > > ~500 GB of stacks ! > > Did you change the configuration some time in the past, or is this just > a newer default? I get 65530, and that's the same default number I've > seen in the bug reports. It turns out it is a Fedora change, rather than a kernel change: https://fedoraproject.org/wiki/Changes/IncreaseVmMaxMapCount > > > diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c > > > index 5fd2dbaf8b..2790959eaf 100644 > > > --- a/util/qemu-coroutine.c > > > +++ b/util/qemu-coroutine.c > > > > > +static unsigned int get_global_pool_hard_max_size(void) > > > +{ > > > +#ifdef __linux__ > > > + g_autofree char *contents = NULL; > > > + int max_map_count; > > > + > > > + /* > > > + * Linux processes can have up to max_map_count virtual memory areas > > > + * (VMAs). mmap(2), mprotect(2), etc fail with ENOMEM beyond this limit. We > > > + * must limit the coroutine pool to a safe size to avoid running out of > > > + * VMAs. > > > + */ > > > + if (g_file_get_contents("/proc/sys/vm/max_map_count", &contents, NULL, > > > + NULL) && > > > + qemu_strtoi(contents, NULL, 10, &max_map_count) == 0) { > > > + /* > > > + * This is a conservative upper bound that avoids exceeding > > > + * max_map_count. Leave half for non-coroutine users like library > > > + * dependencies, vhost-user, etc. Each coroutine takes up 2 VMAs so > > > + * halve the amount again. > > > + */ > > > + return max_map_count / 4; > > > > That's 256,000 coroutines, which still sounds incredibly large > > to me. > > The whole purpose of the limitation is that you won't ever get -ENOMEM > back, which will likely crash your VM. Even if this hard limit is high, > that doesn't mean that it's fully used. Your setting of 1048576 probably > means that you would never have hit the crash anyway. > > Even the benchmarks that used to hit the problem don't even get close to > this hard limit any more because the actual number of coroutines stays > much smaller after applying this patch. I'm more thinking about what's the worst case behaviour that a malicious guest can inflict on QEMU, and cause unexpectedly high memory usage in the host. ENOMEM is bad for a friendy VM, but there's also the risk to the host from a unfriendly VM exploiting the high limits > > > > + } > > > +#endif > > > + > > > + return UINT_MAX; > > > > Why UINT_MAX as a default ? If we can't read procfs, we should > > assume some much smaller sane default IMHO, that corresponds to > > what current linux default max_map_count would be. > > I don't think we should artificially limit the pool size and with this > potentially limit the performance with it even if the host could do more > if we only allowed it to. If we can't read it from procfs, then it's > your responsibility as a user to make sure that it's large enough for > your VM configuration. With regards, Daniel
Am 19.03.2024 um 18:10 hat Daniel P. Berrangé geschrieben: > On Tue, Mar 19, 2024 at 05:54:38PM +0100, Kevin Wolf wrote: > > Am 19.03.2024 um 14:43 hat Daniel P. Berrangé geschrieben: > > > On Mon, Mar 18, 2024 at 02:34:29PM -0400, Stefan Hajnoczi wrote: > > > > The coroutine pool implementation can hit the Linux vm.max_map_count > > > > limit, causing QEMU to abort with "failed to allocate memory for stack" > > > > or "failed to set up stack guard page" during coroutine creation. > > > > > > > > This happens because per-thread pools can grow to tens of thousands of > > > > coroutines. Each coroutine causes 2 virtual memory areas to be created. > > > > > > This sounds quite alarming. What usage scenario is justified in > > > creating so many coroutines? > > > > Basically we try to allow pooling coroutines for as many requests as > > there can be in flight at the same time. That is, adding a virtio-blk > > device increases the maximum pool size by num_queues * queue_size. If > > you have a guest with many CPUs, the default num_queues is relatively > > large (the bug referenced by Stefan had 64), and queue_size is 256 by > > default. That's 16k potential requests in flight per disk. > > If we have more than 1 virtio-blk device, does that scale up the max > coroutines too ? > > eg would 32 virtio-blks devices imply 16k * 32 -> 512k potential > requests/coroutines ? Yes. This is the number of request descriptors that fit in the virtqueues, and if you add another device with additional virtqueues, then obviously that increases the number of theoretically possible parallel requests. The limits of what you can actually achieve in practice might be lower because I/O might complete faster than the time we need to process all of the queued requests, depending on how many vcpus are trying to "compete" with how many iothreads. Of course, the practical limits in five years might be different from today. > > > IIUC, coroutine stack size is 1 MB, and so tens of thousands of > > > coroutines implies 10's of GB of memory just on stacks alone. > > > > That's only virtual memory, though. Not sure how much of it is actually > > used in practice. > > True, by default Linux wouldn't care too much about virtual memory, > Only if 'vm.overcommit_memory' is changed from its default, such > that Linux applies an overcommit ratio on RAM, then total virtual > memory would be relevant. That's a good point and one that I don't have a good answer for, short of just replacing the whole QEMU block layer with rsd and switching to stackless coroutines/futures this way. > > > > Eventually vm.max_map_count is reached and memory-related syscalls fail. > > > > > > On my system max_map_count is 1048576, quite alot higher than > > > 10's of 1000's. Hitting that would imply ~500,000 coroutines and > > > ~500 GB of stacks ! > > > > Did you change the configuration some time in the past, or is this just > > a newer default? I get 65530, and that's the same default number I've > > seen in the bug reports. > > It turns out it is a Fedora change, rather than a kernel change: > > https://fedoraproject.org/wiki/Changes/IncreaseVmMaxMapCount Good to know, thanks. > > > > diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c > > > > index 5fd2dbaf8b..2790959eaf 100644 > > > > --- a/util/qemu-coroutine.c > > > > +++ b/util/qemu-coroutine.c > > > > > > > +static unsigned int get_global_pool_hard_max_size(void) > > > > +{ > > > > +#ifdef __linux__ > > > > + g_autofree char *contents = NULL; > > > > + int max_map_count; > > > > + > > > > + /* > > > > + * Linux processes can have up to max_map_count virtual memory areas > > > > + * (VMAs). mmap(2), mprotect(2), etc fail with ENOMEM beyond this limit. We > > > > + * must limit the coroutine pool to a safe size to avoid running out of > > > > + * VMAs. > > > > + */ > > > > + if (g_file_get_contents("/proc/sys/vm/max_map_count", &contents, NULL, > > > > + NULL) && > > > > + qemu_strtoi(contents, NULL, 10, &max_map_count) == 0) { > > > > + /* > > > > + * This is a conservative upper bound that avoids exceeding > > > > + * max_map_count. Leave half for non-coroutine users like library > > > > + * dependencies, vhost-user, etc. Each coroutine takes up 2 VMAs so > > > > + * halve the amount again. > > > > + */ > > > > + return max_map_count / 4; > > > > > > That's 256,000 coroutines, which still sounds incredibly large > > > to me. > > > > The whole purpose of the limitation is that you won't ever get -ENOMEM > > back, which will likely crash your VM. Even if this hard limit is high, > > that doesn't mean that it's fully used. Your setting of 1048576 probably > > means that you would never have hit the crash anyway. > > > > Even the benchmarks that used to hit the problem don't even get close to > > this hard limit any more because the actual number of coroutines stays > > much smaller after applying this patch. > > I'm more thinking about what's the worst case behaviour that a > malicious guest can inflict on QEMU, and cause unexpectedly high > memory usage in the host. > > ENOMEM is bad for a friendy VM, but there's also the risk to the host > from a unfriendly VM exploiting the high limits But from a QEMU perspective, what is the difference between a friendly high-performance VM that exhausts the available bandwidth to do its job as good and fast as possible, and a malicious VM that does that same just to waste host resources? I don't think QEMU can decide this, they look the same. If you want a VM not to send 16k requests in parallel, you can configure its disk to expose less queues or a smaller queue size. The values I mentioned above are only defaults that allow friendly VMs to perform well out of the box, nothing prevents you from changing them to restrict the amount of resources a VM can use. Kevin
On Tue, Mar 19, 2024 at 01:43:32PM +0000, Daniel P. Berrangé wrote: > On Mon, Mar 18, 2024 at 02:34:29PM -0400, Stefan Hajnoczi wrote: > > The coroutine pool implementation can hit the Linux vm.max_map_count > > limit, causing QEMU to abort with "failed to allocate memory for stack" > > or "failed to set up stack guard page" during coroutine creation. > > > > This happens because per-thread pools can grow to tens of thousands of > > coroutines. Each coroutine causes 2 virtual memory areas to be created. > > This sounds quite alarming. What usage scenario is justified in > creating so many coroutines ? The coroutine pool hides creation and deletion latency. The pool initially has a modest size of 64, but each virtio-blk device increases the pool size by num_queues * queue_size (256) / 2. The issue pops up with large SMP guests (i.e. large num_queues) with multiple virtio-blk devices. > IIUC, coroutine stack size is 1 MB, and so tens of thousands of > coroutines implies 10's of GB of memory just on stacks alone. > > > Eventually vm.max_map_count is reached and memory-related syscalls fail. > > On my system max_map_count is 1048576, quite alot higher than > 10's of 1000's. Hitting that would imply ~500,000 coroutines and > ~500 GB of stacks ! Fedora recently increased the limit to 1048576. Before that it was 65k-ish and still is on most other distros. Regarding why QEMU might have 65k coroutines pooled, it's because the existing coroutine pool algorithm is per thread. So if the max pool size is 15k but you have 4 IOThreads then up to 4 x 15k total coroutines can be sitting in pools. This patch addresses this by setting a small fixed size on per thread pools (256). > > > diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c > > index 5fd2dbaf8b..2790959eaf 100644 > > --- a/util/qemu-coroutine.c > > +++ b/util/qemu-coroutine.c > > > +static unsigned int get_global_pool_hard_max_size(void) > > +{ > > +#ifdef __linux__ > > + g_autofree char *contents = NULL; > > + int max_map_count; > > + > > + /* > > + * Linux processes can have up to max_map_count virtual memory areas > > + * (VMAs). mmap(2), mprotect(2), etc fail with ENOMEM beyond this limit. We > > + * must limit the coroutine pool to a safe size to avoid running out of > > + * VMAs. > > + */ > > + if (g_file_get_contents("/proc/sys/vm/max_map_count", &contents, NULL, > > + NULL) && > > + qemu_strtoi(contents, NULL, 10, &max_map_count) == 0) { > > + /* > > + * This is a conservative upper bound that avoids exceeding > > + * max_map_count. Leave half for non-coroutine users like library > > + * dependencies, vhost-user, etc. Each coroutine takes up 2 VMAs so > > + * halve the amount again. > > + */ > > + return max_map_count / 4; > > That's 256,000 coroutines, which still sounds incredibly large > to me. Any ideas for tweaking this heuristic? > > > + } > > +#endif > > + > > + return UINT_MAX; > > Why UINT_MAX as a default ? If we can't read procfs, we should > assume some much smaller sane default IMHO, that corresponds to > what current linux default max_map_count would be. This line is not Linux-specific. I don't know if other OSes have an equivalent to max_map_count. I agree with defaulting to 64k-ish on Linux. Stefan > > > +} > > + > > +static void __attribute__((constructor)) qemu_coroutine_init(void) > > +{ > > + qemu_mutex_init(&global_pool_lock); > > + global_pool_hard_max_size = get_global_pool_hard_max_size(); > > } > > -- > > 2.44.0 > > > > > > With regards, > Daniel > -- > |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| > |: https://libvirt.org -o- https://fstop138.berrange.com :| > |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :| >
On Tue, Mar 19, 2024 at 01:55:10PM -0400, Stefan Hajnoczi wrote: > On Tue, Mar 19, 2024 at 01:43:32PM +0000, Daniel P. Berrangé wrote: > > On Mon, Mar 18, 2024 at 02:34:29PM -0400, Stefan Hajnoczi wrote: > > > diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c > > > index 5fd2dbaf8b..2790959eaf 100644 > > > --- a/util/qemu-coroutine.c > > > +++ b/util/qemu-coroutine.c > > > > > +static unsigned int get_global_pool_hard_max_size(void) > > > +{ > > > +#ifdef __linux__ > > > + g_autofree char *contents = NULL; > > > + int max_map_count; > > > + > > > + /* > > > + * Linux processes can have up to max_map_count virtual memory areas > > > + * (VMAs). mmap(2), mprotect(2), etc fail with ENOMEM beyond this limit. We > > > + * must limit the coroutine pool to a safe size to avoid running out of > > > + * VMAs. > > > + */ > > > + if (g_file_get_contents("/proc/sys/vm/max_map_count", &contents, NULL, > > > + NULL) && > > > + qemu_strtoi(contents, NULL, 10, &max_map_count) == 0) { > > > + /* > > > + * This is a conservative upper bound that avoids exceeding > > > + * max_map_count. Leave half for non-coroutine users like library > > > + * dependencies, vhost-user, etc. Each coroutine takes up 2 VMAs so > > > + * halve the amount again. Leaving half for loaded libraries, etc is quite conservative if max_map_count is the small-ish 64k default. That reservation could perhaps a fixed number like 5,000 ? > > > + */ > > > + return max_map_count / 4; > > > > That's 256,000 coroutines, which still sounds incredibly large > > to me. > > Any ideas for tweaking this heuristic? The awkward thing about this limit is that its hardcoded, and since it is indeed a "heuristic", we know it is going to be sub-optimal for some use cases / scenarios. The worst case upper limit is num virtio-blk * num threads * num queues Reducing the number of devices isn't practical if the guest genuinely needs that many volumes. Reducing the threads or queues artificially limits the peak performance of a single disk handling in isolation, while other disks are idle, so that's not desirable. So there's no way to cap the worst case scenario, while still maximising the single disk performance possibilities. With large VMs with many CPUs and many disks, it could be reasonable to not expect a real guest to need to maximise I/O on every disk at the same time, and thus want to put some cap there to control worst case resource usage. It feels like it leans towards being able to control the coroutine pool limit explicitly, as a CLI option, to override this default hueristic. > > > + } > > > +#endif > > > + > > > + return UINT_MAX; > > > > Why UINT_MAX as a default ? If we can't read procfs, we should > > assume some much smaller sane default IMHO, that corresponds to > > what current linux default max_map_count would be. > > This line is not Linux-specific. I don't know if other OSes have an > equivalent to max_map_count. > > I agree with defaulting to 64k-ish on Linux. With regards, Daniel
On Tue, Mar 19, 2024 at 06:41:28PM +0100, Kevin Wolf wrote: > Am 19.03.2024 um 18:10 hat Daniel P. Berrangé geschrieben: > > On Tue, Mar 19, 2024 at 05:54:38PM +0100, Kevin Wolf wrote: > > > Am 19.03.2024 um 14:43 hat Daniel P. Berrangé geschrieben: > > > > On Mon, Mar 18, 2024 at 02:34:29PM -0400, Stefan Hajnoczi wrote: > > > > > The coroutine pool implementation can hit the Linux vm.max_map_count > > > > > limit, causing QEMU to abort with "failed to allocate memory for stack" > > > > > or "failed to set up stack guard page" during coroutine creation. > > > > > > > > > > This happens because per-thread pools can grow to tens of thousands of > > > > > coroutines. Each coroutine causes 2 virtual memory areas to be created. > > > > > > > > This sounds quite alarming. What usage scenario is justified in > > > > creating so many coroutines? > > > > > > Basically we try to allow pooling coroutines for as many requests as > > > there can be in flight at the same time. That is, adding a virtio-blk > > > device increases the maximum pool size by num_queues * queue_size. If > > > you have a guest with many CPUs, the default num_queues is relatively > > > large (the bug referenced by Stefan had 64), and queue_size is 256 by > > > default. That's 16k potential requests in flight per disk. > > > > If we have more than 1 virtio-blk device, does that scale up the max > > coroutines too ? > > > > eg would 32 virtio-blks devices imply 16k * 32 -> 512k potential > > requests/coroutines ? > > Yes. This is the number of request descriptors that fit in the > virtqueues, and if you add another device with additional virtqueues, > then obviously that increases the number of theoretically possible > parallel requests. > > The limits of what you can actually achieve in practice might be lower > because I/O might complete faster than the time we need to process all > of the queued requests, depending on how many vcpus are trying to > "compete" with how many iothreads. Of course, the practical limits in > five years might be different from today. > > > > > IIUC, coroutine stack size is 1 MB, and so tens of thousands of > > > > coroutines implies 10's of GB of memory just on stacks alone. > > > > > > That's only virtual memory, though. Not sure how much of it is actually > > > used in practice. > > > > True, by default Linux wouldn't care too much about virtual memory, > > Only if 'vm.overcommit_memory' is changed from its default, such > > that Linux applies an overcommit ratio on RAM, then total virtual > > memory would be relevant. > > That's a good point and one that I don't have a good answer for, short > of just replacing the whole QEMU block layer with rsd and switching to > stackless coroutines/futures this way. > > > > > > Eventually vm.max_map_count is reached and memory-related syscalls fail. > > > > > > > > On my system max_map_count is 1048576, quite alot higher than > > > > 10's of 1000's. Hitting that would imply ~500,000 coroutines and > > > > ~500 GB of stacks ! > > > > > > Did you change the configuration some time in the past, or is this just > > > a newer default? I get 65530, and that's the same default number I've > > > seen in the bug reports. > > > > It turns out it is a Fedora change, rather than a kernel change: > > > > https://fedoraproject.org/wiki/Changes/IncreaseVmMaxMapCount > > Good to know, thanks. > > > > > > diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c > > > > > index 5fd2dbaf8b..2790959eaf 100644 > > > > > --- a/util/qemu-coroutine.c > > > > > +++ b/util/qemu-coroutine.c > > > > > > > > > +static unsigned int get_global_pool_hard_max_size(void) > > > > > +{ > > > > > +#ifdef __linux__ > > > > > + g_autofree char *contents = NULL; > > > > > + int max_map_count; > > > > > + > > > > > + /* > > > > > + * Linux processes can have up to max_map_count virtual memory areas > > > > > + * (VMAs). mmap(2), mprotect(2), etc fail with ENOMEM beyond this limit. We > > > > > + * must limit the coroutine pool to a safe size to avoid running out of > > > > > + * VMAs. > > > > > + */ > > > > > + if (g_file_get_contents("/proc/sys/vm/max_map_count", &contents, NULL, > > > > > + NULL) && > > > > > + qemu_strtoi(contents, NULL, 10, &max_map_count) == 0) { > > > > > + /* > > > > > + * This is a conservative upper bound that avoids exceeding > > > > > + * max_map_count. Leave half for non-coroutine users like library > > > > > + * dependencies, vhost-user, etc. Each coroutine takes up 2 VMAs so > > > > > + * halve the amount again. > > > > > + */ > > > > > + return max_map_count / 4; > > > > > > > > That's 256,000 coroutines, which still sounds incredibly large > > > > to me. > > > > > > The whole purpose of the limitation is that you won't ever get -ENOMEM > > > back, which will likely crash your VM. Even if this hard limit is high, > > > that doesn't mean that it's fully used. Your setting of 1048576 probably > > > means that you would never have hit the crash anyway. > > > > > > Even the benchmarks that used to hit the problem don't even get close to > > > this hard limit any more because the actual number of coroutines stays > > > much smaller after applying this patch. > > > > I'm more thinking about what's the worst case behaviour that a > > malicious guest can inflict on QEMU, and cause unexpectedly high > > memory usage in the host. > > > > ENOMEM is bad for a friendy VM, but there's also the risk to the host > > from a unfriendly VM exploiting the high limits > > But from a QEMU perspective, what is the difference between a friendly > high-performance VM that exhausts the available bandwidth to do its job > as good and fast as possible, and a malicious VM that does that same > just to waste host resources? I don't think QEMU can decide this, they > look the same. > > If you want a VM not to send 16k requests in parallel, you can configure > its disk to expose less queues or a smaller queue size. The values I > mentioned above are only defaults that allow friendly VMs to perform > well out of the box, nothing prevents you from changing them to restrict > the amount of resources a VM can use. Reducing queues is a no-win scenario, as it limits the performance of a single disk when used in isolation, in order to cap the worst case when all disks are used concurrently :-( It would be nice to allow a single disk to burst to a high level, and only limit coroutines if many disks are all trying to concurrently burst to a high level. With regards, Daniel
On Tue, Mar 19, 2024 at 08:10:49PM +0000, Daniel P. Berrangé wrote: > On Tue, Mar 19, 2024 at 01:55:10PM -0400, Stefan Hajnoczi wrote: > > On Tue, Mar 19, 2024 at 01:43:32PM +0000, Daniel P. Berrangé wrote: > > > On Mon, Mar 18, 2024 at 02:34:29PM -0400, Stefan Hajnoczi wrote: > > > > diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c > > > > index 5fd2dbaf8b..2790959eaf 100644 > > > > --- a/util/qemu-coroutine.c > > > > +++ b/util/qemu-coroutine.c > > > > > > > +static unsigned int get_global_pool_hard_max_size(void) > > > > +{ > > > > +#ifdef __linux__ > > > > + g_autofree char *contents = NULL; > > > > + int max_map_count; > > > > + > > > > + /* > > > > + * Linux processes can have up to max_map_count virtual memory areas > > > > + * (VMAs). mmap(2), mprotect(2), etc fail with ENOMEM beyond this limit. We > > > > + * must limit the coroutine pool to a safe size to avoid running out of > > > > + * VMAs. > > > > + */ > > > > + if (g_file_get_contents("/proc/sys/vm/max_map_count", &contents, NULL, > > > > + NULL) && > > > > + qemu_strtoi(contents, NULL, 10, &max_map_count) == 0) { > > > > + /* > > > > + * This is a conservative upper bound that avoids exceeding > > > > + * max_map_count. Leave half for non-coroutine users like library > > > > + * dependencies, vhost-user, etc. Each coroutine takes up 2 VMAs so > > > > + * halve the amount again. > > Leaving half for loaded libraries, etc is quite conservative > if max_map_count is the small-ish 64k default. > > That reservation could perhaps a fixed number like 5,000 ? While I don't want QEMU to abort, once this heuristic is in the code it will be scary to make it more optimistic and we may never change it. So now is the best time to try 5,000. I'll send a follow-up patch that reserves 5,000 mappings. If that turns out to be too optimistic we can increase the reservation. > > > > + */ > > > > + return max_map_count / 4; > > > > > > That's 256,000 coroutines, which still sounds incredibly large > > > to me. > > > > Any ideas for tweaking this heuristic? > > The awkward thing about this limit is that its hardcoded, and > since it is indeed a "heuristic", we know it is going to be > sub-optimal for some use cases / scenarios. > > The worst case upper limit is > > num virtio-blk * num threads * num queues > > Reducing the number of devices isn't practical if the guest > genuinely needs that many volumes. > > Reducing the threads or queues artificially limits the peak > performance of a single disk handling in isolation, while > other disks are idle, so that's not desirable. > > So there's no way to cap the worst case scenario, while > still maximising the single disk performance possibilities. > > With large VMs with many CPUs and many disks, it could be > reasonable to not expect a real guest to need to maximise > I/O on every disk at the same time, and thus want to put > some cap there to control worst case resource usage. > > It feels like it leans towards being able to control the > coroutine pool limit explicitly, as a CLI option, to override > this default hueristic. > > > > > + } > > > > +#endif > > > > + > > > > + return UINT_MAX; > > > > > > Why UINT_MAX as a default ? If we can't read procfs, we should > > > assume some much smaller sane default IMHO, that corresponds to > > > what current linux default max_map_count would be. > > > > This line is not Linux-specific. I don't know if other OSes have an > > equivalent to max_map_count. > > > > I agree with defaulting to 64k-ish on Linux. > > > > With regards, > Daniel > -- > |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| > |: https://libvirt.org -o- https://fstop138.berrange.com :| > |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :| >
On Wed, Mar 20, 2024 at 09:35:39AM -0400, Stefan Hajnoczi wrote: > On Tue, Mar 19, 2024 at 08:10:49PM +0000, Daniel P. Berrangé wrote: > > On Tue, Mar 19, 2024 at 01:55:10PM -0400, Stefan Hajnoczi wrote: > > > On Tue, Mar 19, 2024 at 01:43:32PM +0000, Daniel P. Berrangé wrote: > > > > On Mon, Mar 18, 2024 at 02:34:29PM -0400, Stefan Hajnoczi wrote: > > > > > diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c > > > > > index 5fd2dbaf8b..2790959eaf 100644 > > > > > --- a/util/qemu-coroutine.c > > > > > +++ b/util/qemu-coroutine.c > > > > > > > > > +static unsigned int get_global_pool_hard_max_size(void) > > > > > +{ > > > > > +#ifdef __linux__ > > > > > + g_autofree char *contents = NULL; > > > > > + int max_map_count; > > > > > + > > > > > + /* > > > > > + * Linux processes can have up to max_map_count virtual memory areas > > > > > + * (VMAs). mmap(2), mprotect(2), etc fail with ENOMEM beyond this limit. We > > > > > + * must limit the coroutine pool to a safe size to avoid running out of > > > > > + * VMAs. > > > > > + */ > > > > > + if (g_file_get_contents("/proc/sys/vm/max_map_count", &contents, NULL, > > > > > + NULL) && > > > > > + qemu_strtoi(contents, NULL, 10, &max_map_count) == 0) { > > > > > + /* > > > > > + * This is a conservative upper bound that avoids exceeding > > > > > + * max_map_count. Leave half for non-coroutine users like library > > > > > + * dependencies, vhost-user, etc. Each coroutine takes up 2 VMAs so > > > > > + * halve the amount again. > > > > Leaving half for loaded libraries, etc is quite conservative > > if max_map_count is the small-ish 64k default. > > > > That reservation could perhaps a fixed number like 5,000 ? > > While I don't want QEMU to abort, once this heuristic is in the code it > will be scary to make it more optimistic and we may never change it. So > now is the best time to try 5,000. > > I'll send a follow-up patch that reserves 5,000 mappings. If that turns > out to be too optimistic we can increase the reservation. BTW, I suggested 5,000, because I looked at a few QEM processes I have running on Fedora and saw just under 1,000 lines in /proc/$PID/maps, of which only a subset is library mappings. So multiplying that x5 felt like a fairly generous overhead for more complex build configurations. With regards, Daniel
Am 20.03.2024 um 15:09 hat Daniel P. Berrangé geschrieben: > On Wed, Mar 20, 2024 at 09:35:39AM -0400, Stefan Hajnoczi wrote: > > On Tue, Mar 19, 2024 at 08:10:49PM +0000, Daniel P. Berrangé wrote: > > > On Tue, Mar 19, 2024 at 01:55:10PM -0400, Stefan Hajnoczi wrote: > > > > On Tue, Mar 19, 2024 at 01:43:32PM +0000, Daniel P. Berrangé wrote: > > > > > On Mon, Mar 18, 2024 at 02:34:29PM -0400, Stefan Hajnoczi wrote: > > > > > > diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c > > > > > > index 5fd2dbaf8b..2790959eaf 100644 > > > > > > --- a/util/qemu-coroutine.c > > > > > > +++ b/util/qemu-coroutine.c > > > > > > > > > > > +static unsigned int get_global_pool_hard_max_size(void) > > > > > > +{ > > > > > > +#ifdef __linux__ > > > > > > + g_autofree char *contents = NULL; > > > > > > + int max_map_count; > > > > > > + > > > > > > + /* > > > > > > + * Linux processes can have up to max_map_count virtual memory areas > > > > > > + * (VMAs). mmap(2), mprotect(2), etc fail with ENOMEM beyond this limit. We > > > > > > + * must limit the coroutine pool to a safe size to avoid running out of > > > > > > + * VMAs. > > > > > > + */ > > > > > > + if (g_file_get_contents("/proc/sys/vm/max_map_count", &contents, NULL, > > > > > > + NULL) && > > > > > > + qemu_strtoi(contents, NULL, 10, &max_map_count) == 0) { > > > > > > + /* > > > > > > + * This is a conservative upper bound that avoids exceeding > > > > > > + * max_map_count. Leave half for non-coroutine users like library > > > > > > + * dependencies, vhost-user, etc. Each coroutine takes up 2 VMAs so > > > > > > + * halve the amount again. > > > > > > Leaving half for loaded libraries, etc is quite conservative > > > if max_map_count is the small-ish 64k default. > > > > > > That reservation could perhaps a fixed number like 5,000 ? > > > > While I don't want QEMU to abort, once this heuristic is in the code it > > will be scary to make it more optimistic and we may never change it. So > > now is the best time to try 5,000. > > > > I'll send a follow-up patch that reserves 5,000 mappings. If that turns > > out to be too optimistic we can increase the reservation. > > BTW, I suggested 5,000, because I looked at a few QEM processes I have > running on Fedora and saw just under 1,000 lines in /proc/$PID/maps, > of which only a subset is library mappings. So multiplying that x5 felt > like a fairly generous overhead for more complex build configurations. On my system, the boring desktop VM with no special hardware or other advanced configuration takes ~1500 mappings, most of which are libraries. I'm not concerned about the library mappings, it's unlikely that we'll double the number of libraries soon. But I'm not sure about dynamic mappings outside of coroutines, maybe when enabling features my simple desktop VM doesn't even use at all. If we're sure that nothing else uses any number worth mentioning, fine with me. But I couldn't tell. Staying the area we know reasonably well, how many libblkio bounce buffers could be in use at the same time? I think each one is an individual mmap(), right? Kevin
On Thu, 21 Mar 2024 at 08:22, Kevin Wolf <kwolf@redhat.com> wrote: > > Am 20.03.2024 um 15:09 hat Daniel P. Berrangé geschrieben: > > On Wed, Mar 20, 2024 at 09:35:39AM -0400, Stefan Hajnoczi wrote: > > > On Tue, Mar 19, 2024 at 08:10:49PM +0000, Daniel P. Berrangé wrote: > > > > On Tue, Mar 19, 2024 at 01:55:10PM -0400, Stefan Hajnoczi wrote: > > > > > On Tue, Mar 19, 2024 at 01:43:32PM +0000, Daniel P. Berrangé wrote: > > > > > > On Mon, Mar 18, 2024 at 02:34:29PM -0400, Stefan Hajnoczi wrote: > > > > > > > diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c > > > > > > > index 5fd2dbaf8b..2790959eaf 100644 > > > > > > > --- a/util/qemu-coroutine.c > > > > > > > +++ b/util/qemu-coroutine.c > > > > > > > > > > > > > +static unsigned int get_global_pool_hard_max_size(void) > > > > > > > +{ > > > > > > > +#ifdef __linux__ > > > > > > > + g_autofree char *contents = NULL; > > > > > > > + int max_map_count; > > > > > > > + > > > > > > > + /* > > > > > > > + * Linux processes can have up to max_map_count virtual memory areas > > > > > > > + * (VMAs). mmap(2), mprotect(2), etc fail with ENOMEM beyond this limit. We > > > > > > > + * must limit the coroutine pool to a safe size to avoid running out of > > > > > > > + * VMAs. > > > > > > > + */ > > > > > > > + if (g_file_get_contents("/proc/sys/vm/max_map_count", &contents, NULL, > > > > > > > + NULL) && > > > > > > > + qemu_strtoi(contents, NULL, 10, &max_map_count) == 0) { > > > > > > > + /* > > > > > > > + * This is a conservative upper bound that avoids exceeding > > > > > > > + * max_map_count. Leave half for non-coroutine users like library > > > > > > > + * dependencies, vhost-user, etc. Each coroutine takes up 2 VMAs so > > > > > > > + * halve the amount again. > > > > > > > > Leaving half for loaded libraries, etc is quite conservative > > > > if max_map_count is the small-ish 64k default. > > > > > > > > That reservation could perhaps a fixed number like 5,000 ? > > > > > > While I don't want QEMU to abort, once this heuristic is in the code it > > > will be scary to make it more optimistic and we may never change it. So > > > now is the best time to try 5,000. > > > > > > I'll send a follow-up patch that reserves 5,000 mappings. If that turns > > > out to be too optimistic we can increase the reservation. > > > > BTW, I suggested 5,000, because I looked at a few QEM processes I have > > running on Fedora and saw just under 1,000 lines in /proc/$PID/maps, > > of which only a subset is library mappings. So multiplying that x5 felt > > like a fairly generous overhead for more complex build configurations. > > On my system, the boring desktop VM with no special hardware or other > advanced configuration takes ~1500 mappings, most of which are > libraries. I'm not concerned about the library mappings, it's unlikely > that we'll double the number of libraries soon. > > But I'm not sure about dynamic mappings outside of coroutines, maybe > when enabling features my simple desktop VM doesn't even use at all. If > we're sure that nothing else uses any number worth mentioning, fine with > me. But I couldn't tell. > > Staying the area we know reasonably well, how many libblkio bounce > buffers could be in use at the same time? I think each one is an > individual mmap(), right? libblkio's mapping requirements are similar to vhost-user. There is one general-purpose bounce buffer mapping plus a mapping for each QEMU RAMBlock. That means the number of in-flight I/Os does not directly influence the number of mappings. Stefan
diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c index 5fd2dbaf8b..2790959eaf 100644 --- a/util/qemu-coroutine.c +++ b/util/qemu-coroutine.c @@ -18,39 +18,200 @@ #include "qemu/atomic.h" #include "qemu/coroutine_int.h" #include "qemu/coroutine-tls.h" +#include "qemu/cutils.h" #include "block/aio.h" -/** - * The minimal batch size is always 64, coroutines from the release_pool are - * reused as soon as there are 64 coroutines in it. The maximum pool size starts - * with 64 and is increased on demand so that coroutines are not deleted even if - * they are not immediately reused. - */ enum { - POOL_MIN_BATCH_SIZE = 64, - POOL_INITIAL_MAX_SIZE = 64, + COROUTINE_POOL_BATCH_MAX_SIZE = 128, }; -/** Free list to speed up creation */ -static QSLIST_HEAD(, Coroutine) release_pool = QSLIST_HEAD_INITIALIZER(pool); -static unsigned int pool_max_size = POOL_INITIAL_MAX_SIZE; -static unsigned int release_pool_size; +/* + * Coroutine creation and deletion is expensive so a pool of unused coroutines + * is kept as a cache. When the pool has coroutines available, they are + * recycled instead of creating new ones from scratch. Coroutines are added to + * the pool upon termination. + * + * The pool is global but each thread maintains a small local pool to avoid + * global pool contention. Threads fetch and return batches of coroutines from + * the global pool to maintain their local pool. The local pool holds up to two + * batches whereas the maximum size of the global pool is controlled by the + * qemu_coroutine_inc_pool_size() API. + * + * .-----------------------------------. + * | Batch 1 | Batch 2 | Batch 3 | ... | global_pool + * `-----------------------------------' + * + * .-------------------. + * | Batch 1 | Batch 2 | per-thread local_pool (maximum 2 batches) + * `-------------------' + */ +typedef struct CoroutinePoolBatch { + /* Batches are kept in a list */ + QSLIST_ENTRY(CoroutinePoolBatch) next; -typedef QSLIST_HEAD(, Coroutine) CoroutineQSList; -QEMU_DEFINE_STATIC_CO_TLS(CoroutineQSList, alloc_pool); -QEMU_DEFINE_STATIC_CO_TLS(unsigned int, alloc_pool_size); -QEMU_DEFINE_STATIC_CO_TLS(Notifier, coroutine_pool_cleanup_notifier); + /* This batch holds up to @COROUTINE_POOL_BATCH_MAX_SIZE coroutines */ + QSLIST_HEAD(, Coroutine) list; + unsigned int size; +} CoroutinePoolBatch; -static void coroutine_pool_cleanup(Notifier *n, void *value) +typedef QSLIST_HEAD(, CoroutinePoolBatch) CoroutinePool; + +/* Host operating system limit on number of pooled coroutines */ +static unsigned int global_pool_hard_max_size; + +static QemuMutex global_pool_lock; /* protects the following variables */ +static CoroutinePool global_pool = QSLIST_HEAD_INITIALIZER(global_pool); +static unsigned int global_pool_size; +static unsigned int global_pool_max_size = COROUTINE_POOL_BATCH_MAX_SIZE; + +QEMU_DEFINE_STATIC_CO_TLS(CoroutinePool, local_pool); +QEMU_DEFINE_STATIC_CO_TLS(Notifier, local_pool_cleanup_notifier); + +static CoroutinePoolBatch *coroutine_pool_batch_new(void) +{ + CoroutinePoolBatch *batch = g_new(CoroutinePoolBatch, 1); + + QSLIST_INIT(&batch->list); + batch->size = 0; + return batch; +} + +static void coroutine_pool_batch_delete(CoroutinePoolBatch *batch) { Coroutine *co; Coroutine *tmp; - CoroutineQSList *alloc_pool = get_ptr_alloc_pool(); - QSLIST_FOREACH_SAFE(co, alloc_pool, pool_next, tmp) { - QSLIST_REMOVE_HEAD(alloc_pool, pool_next); + QSLIST_FOREACH_SAFE(co, &batch->list, pool_next, tmp) { + QSLIST_REMOVE_HEAD(&batch->list, pool_next); qemu_coroutine_delete(co); } + g_free(batch); +} + +static void local_pool_cleanup(Notifier *n, void *value) +{ + CoroutinePool *local_pool = get_ptr_local_pool(); + CoroutinePoolBatch *batch; + CoroutinePoolBatch *tmp; + + QSLIST_FOREACH_SAFE(batch, local_pool, next, tmp) { + QSLIST_REMOVE_HEAD(local_pool, next); + coroutine_pool_batch_delete(batch); + } +} + +/* Ensure the atexit notifier is registered */ +static void local_pool_cleanup_init_once(void) +{ + Notifier *notifier = get_ptr_local_pool_cleanup_notifier(); + if (!notifier->notify) { + notifier->notify = local_pool_cleanup; + qemu_thread_atexit_add(notifier); + } +} + +/* Helper to get the next unused coroutine from the local pool */ +static Coroutine *coroutine_pool_get_local(void) +{ + CoroutinePool *local_pool = get_ptr_local_pool(); + CoroutinePoolBatch *batch = QSLIST_FIRST(local_pool); + Coroutine *co; + + if (unlikely(!batch)) { + return NULL; + } + + co = QSLIST_FIRST(&batch->list); + QSLIST_REMOVE_HEAD(&batch->list, pool_next); + batch->size--; + + if (batch->size == 0) { + QSLIST_REMOVE_HEAD(local_pool, next); + coroutine_pool_batch_delete(batch); + } + return co; +} + +/* Get the next batch from the global pool */ +static void coroutine_pool_refill_local(void) +{ + CoroutinePool *local_pool = get_ptr_local_pool(); + CoroutinePoolBatch *batch; + + WITH_QEMU_LOCK_GUARD(&global_pool_lock) { + batch = QSLIST_FIRST(&global_pool); + + if (batch) { + QSLIST_REMOVE_HEAD(&global_pool, next); + global_pool_size -= batch->size; + } + } + + if (batch) { + QSLIST_INSERT_HEAD(local_pool, batch, next); + local_pool_cleanup_init_once(); + } +} + +/* Add a batch of coroutines to the global pool */ +static void coroutine_pool_put_global(CoroutinePoolBatch *batch) +{ + WITH_QEMU_LOCK_GUARD(&global_pool_lock) { + unsigned int max = MIN(global_pool_max_size, + global_pool_hard_max_size); + + if (global_pool_size < max) { + QSLIST_INSERT_HEAD(&global_pool, batch, next); + + /* Overshooting the max pool size is allowed */ + global_pool_size += batch->size; + return; + } + } + + /* The global pool was full, so throw away this batch */ + coroutine_pool_batch_delete(batch); +} + +/* Get the next unused coroutine from the pool or return NULL */ +static Coroutine *coroutine_pool_get(void) +{ + Coroutine *co; + + co = coroutine_pool_get_local(); + if (!co) { + coroutine_pool_refill_local(); + co = coroutine_pool_get_local(); + } + return co; +} + +static void coroutine_pool_put(Coroutine *co) +{ + CoroutinePool *local_pool = get_ptr_local_pool(); + CoroutinePoolBatch *batch = QSLIST_FIRST(local_pool); + + if (unlikely(!batch)) { + batch = coroutine_pool_batch_new(); + QSLIST_INSERT_HEAD(local_pool, batch, next); + local_pool_cleanup_init_once(); + } + + if (unlikely(batch->size >= COROUTINE_POOL_BATCH_MAX_SIZE)) { + CoroutinePoolBatch *next = QSLIST_NEXT(batch, next); + + /* Is the local pool full? */ + if (next) { + QSLIST_REMOVE_HEAD(local_pool, next); + coroutine_pool_put_global(batch); + } + + batch = coroutine_pool_batch_new(); + QSLIST_INSERT_HEAD(local_pool, batch, next); + } + + QSLIST_INSERT_HEAD(&batch->list, co, pool_next); + batch->size++; } Coroutine *qemu_coroutine_create(CoroutineEntry *entry, void *opaque) @@ -58,31 +219,7 @@ Coroutine *qemu_coroutine_create(CoroutineEntry *entry, void *opaque) Coroutine *co = NULL; if (IS_ENABLED(CONFIG_COROUTINE_POOL)) { - CoroutineQSList *alloc_pool = get_ptr_alloc_pool(); - - co = QSLIST_FIRST(alloc_pool); - if (!co) { - if (release_pool_size > POOL_MIN_BATCH_SIZE) { - /* Slow path; a good place to register the destructor, too. */ - Notifier *notifier = get_ptr_coroutine_pool_cleanup_notifier(); - if (!notifier->notify) { - notifier->notify = coroutine_pool_cleanup; - qemu_thread_atexit_add(notifier); - } - - /* This is not exact; there could be a little skew between - * release_pool_size and the actual size of release_pool. But - * it is just a heuristic, it does not need to be perfect. - */ - set_alloc_pool_size(qatomic_xchg(&release_pool_size, 0)); - QSLIST_MOVE_ATOMIC(alloc_pool, &release_pool); - co = QSLIST_FIRST(alloc_pool); - } - } - if (co) { - QSLIST_REMOVE_HEAD(alloc_pool, pool_next); - set_alloc_pool_size(get_alloc_pool_size() - 1); - } + co = coroutine_pool_get(); } if (!co) { @@ -100,19 +237,10 @@ static void coroutine_delete(Coroutine *co) co->caller = NULL; if (IS_ENABLED(CONFIG_COROUTINE_POOL)) { - if (release_pool_size < qatomic_read(&pool_max_size) * 2) { - QSLIST_INSERT_HEAD_ATOMIC(&release_pool, co, pool_next); - qatomic_inc(&release_pool_size); - return; - } - if (get_alloc_pool_size() < qatomic_read(&pool_max_size)) { - QSLIST_INSERT_HEAD(get_ptr_alloc_pool(), co, pool_next); - set_alloc_pool_size(get_alloc_pool_size() + 1); - return; - } + coroutine_pool_put(co); + } else { + qemu_coroutine_delete(co); } - - qemu_coroutine_delete(co); } void qemu_aio_coroutine_enter(AioContext *ctx, Coroutine *co) @@ -223,10 +351,46 @@ AioContext *qemu_coroutine_get_aio_context(Coroutine *co) void qemu_coroutine_inc_pool_size(unsigned int additional_pool_size) { - qatomic_add(&pool_max_size, additional_pool_size); + QEMU_LOCK_GUARD(&global_pool_lock); + global_pool_max_size += additional_pool_size; } void qemu_coroutine_dec_pool_size(unsigned int removing_pool_size) { - qatomic_sub(&pool_max_size, removing_pool_size); + QEMU_LOCK_GUARD(&global_pool_lock); + global_pool_max_size -= removing_pool_size; +} + +static unsigned int get_global_pool_hard_max_size(void) +{ +#ifdef __linux__ + g_autofree char *contents = NULL; + int max_map_count; + + /* + * Linux processes can have up to max_map_count virtual memory areas + * (VMAs). mmap(2), mprotect(2), etc fail with ENOMEM beyond this limit. We + * must limit the coroutine pool to a safe size to avoid running out of + * VMAs. + */ + if (g_file_get_contents("/proc/sys/vm/max_map_count", &contents, NULL, + NULL) && + qemu_strtoi(contents, NULL, 10, &max_map_count) == 0) { + /* + * This is a conservative upper bound that avoids exceeding + * max_map_count. Leave half for non-coroutine users like library + * dependencies, vhost-user, etc. Each coroutine takes up 2 VMAs so + * halve the amount again. + */ + return max_map_count / 4; + } +#endif + + return UINT_MAX; +} + +static void __attribute__((constructor)) qemu_coroutine_init(void) +{ + qemu_mutex_init(&global_pool_lock); + global_pool_hard_max_size = get_global_pool_hard_max_size(); }
The coroutine pool implementation can hit the Linux vm.max_map_count limit, causing QEMU to abort with "failed to allocate memory for stack" or "failed to set up stack guard page" during coroutine creation. This happens because per-thread pools can grow to tens of thousands of coroutines. Each coroutine causes 2 virtual memory areas to be created. Eventually vm.max_map_count is reached and memory-related syscalls fail. The per-thread pool sizes are non-uniform and depend on past coroutine usage in each thread, so it's possible for one thread to have a large pool while another thread's pool is empty. Switch to a new coroutine pool implementation with a global pool that grows to a maximum number of coroutines and per-thread local pools that are capped at hardcoded small number of coroutines. This approach does not leave large numbers of coroutines pooled in a thread that may not use them again. In order to perform well it amortizes the cost of global pool accesses by working in batches of coroutines instead of individual coroutines. The global pool is a list. Threads donate batches of coroutines to when they have too many and take batches from when they have too few: .-----------------------------------. | Batch 1 | Batch 2 | Batch 3 | ... | global_pool `-----------------------------------' Each thread has up to 2 batches of coroutines: .-------------------. | Batch 1 | Batch 2 | per-thread local_pool (maximum 2 batches) `-------------------' The goal of this change is to reduce the excessive number of pooled coroutines that cause QEMU to abort when vm.max_map_count is reached without losing the performance of an adequately sized coroutine pool. Here are virtio-blk disk I/O benchmark results: RW BLKSIZE IODEPTH OLD NEW CHANGE randread 4k 1 113725 117451 +3.3% randread 4k 8 192968 198510 +2.9% randread 4k 16 207138 209429 +1.1% randread 4k 32 212399 215145 +1.3% randread 4k 64 218319 221277 +1.4% randread 128k 1 17587 17535 -0.3% randread 128k 8 17614 17616 +0.0% randread 128k 16 17608 17609 +0.0% randread 128k 32 17552 17553 +0.0% randread 128k 64 17484 17484 +0.0% See files/{fio.sh,test.xml.j2} for the benchmark configuration: https://gitlab.com/stefanha/virt-playbooks/-/tree/coroutine-pool-fix-sizing Buglink: https://issues.redhat.com/browse/RHEL-28947 Reported-by: Sanjay Rao <srao@redhat.com> Reported-by: Boaz Ben Shabat <bbenshab@redhat.com> Reported-by: Joe Mario <jmario@redhat.com> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> --- This patch obsoletes "[PATCH v2] virtio-blk: iothread-vq-mapping coroutine pool sizing" because the pool size is now global instead of per thread: https://lore.kernel.org/qemu-devel/20240312151204.412624-1-stefanha@redhat.com/ Please don't apply "[PATCH v2] virtio-blk: iothread-vq-mapping coroutine pool sizing". util/qemu-coroutine.c | 282 +++++++++++++++++++++++++++++++++--------- 1 file changed, 223 insertions(+), 59 deletions(-)