[v0,0/4] backends/hostmem: add an ability to specify prealloc timeout

Message ID	20230120134749.550639-1-d-tatianin@yandex-team.ru
Headers	show Return-Path: <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org> Precedence: bulk From: Daniil Tatianin <d-tatianin@yandex-team.ru> To: Paolo Bonzini <pbonzini@redhat.com> Cc: Daniil Tatianin <d-tatianin@yandex-team.ru>, qemu-devel@nongnu.org, Stefan Weil <sw@weilnetz.de>, David Hildenbrand <david@redhat.com>, Igor Mammedov <imammedo@redhat.com>, yc-core@yandex-team.ru Subject: [PATCH v0 0/4] backends/hostmem: add an ability to specify prealloc timeout Date: Fri, 20 Jan 2023 16:47:45 +0300 Message-Id: <20230120134749.550639-1-d-tatianin@yandex-team.ru> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Received-SPF: pass client-ip=178.154.239.72; envelope-from=d-tatianin@yandex-team.ru; helo=forwardcorp1a.mail.yandex.net X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org
Series	backends/hostmem: add an ability to specify prealloc timeout \| expand [v0,0/4] backends/hostmem: add an ability to specify prealloc timeout [1/4] oslib: introduce new qemu_prealloc_mem_with_timeout() api [2/4] backends/hostmem: move memory region preallocation logic into a helper [3/4] backends/hostmem: add an ability to specify prealloc timeout [4/4] backends/hostmem: add an ability to make prealloc timeout fatal

Daniil Tatianin Jan. 20, 2023, 1:47 p.m. UTC

This series introduces new qemu_prealloc_mem_with_timeout() api,
which allows limiting the maximum amount of time to be spent on memory
preallocation. It also adds prealloc statistics collection that is
exposed via an optional timeout handler.

This new api is then utilized by hostmem for guest RAM preallocation
controlled via new object properties called 'prealloc-timeout' and
'prealloc-timeout-fatal'.

This is useful for limiting VM startup time on systems with
unpredictable page allocation delays due to memory fragmentation or the
backing storage. The timeout can be configured to either simply emit a
warning and continue VM startup without having preallocated the entire
guest RAM or just abort startup entirely if that is not acceptable for
a specific use case.

Daniil Tatianin (4):
  oslib: introduce new qemu_prealloc_mem_with_timeout() api
  backends/hostmem: move memory region preallocation logic into a helper
  backends/hostmem: add an ability to specify prealloc timeout
  backends/hostmem: add an ability to make prealloc timeout fatal

 backends/hostmem.c       | 112 +++++++++++++++++++++++++++++++-------
 include/qemu/osdep.h     |  19 +++++++
 include/sysemu/hostmem.h |   3 ++
 qapi/qom.json            |   8 +++
 util/oslib-posix.c       | 114 +++++++++++++++++++++++++++++++++++----
 util/oslib-win32.c       |   9 ++++
 6 files changed, 238 insertions(+), 27 deletions(-)

David Hildenbrand Jan. 23, 2023, 8:57 a.m. UTC | #1

On 20.01.23 14:47, Daniil Tatianin wrote:
> This series introduces new qemu_prealloc_mem_with_timeout() api,
> which allows limiting the maximum amount of time to be spent on memory
> preallocation. It also adds prealloc statistics collection that is
> exposed via an optional timeout handler.
> 
> This new api is then utilized by hostmem for guest RAM preallocation
> controlled via new object properties called 'prealloc-timeout' and
> 'prealloc-timeout-fatal'.
> 
> This is useful for limiting VM startup time on systems with
> unpredictable page allocation delays due to memory fragmentation or the
> backing storage. The timeout can be configured to either simply emit a
> warning and continue VM startup without having preallocated the entire
> guest RAM or just abort startup entirely if that is not acceptable for
> a specific use case.

The major use case for preallocation is memory resources that cannot be 
overcommitted (hugetlb, file blocks, ...), to avoid running out of such 
resources later, while the guest is already running, and crashing it.

Allocating only a fraction "because it takes too long" looks quite 
useless in that (main use-case) context. We shouldn't encourage QEMU 
users to play with fire in such a way. IOW, there should be no way 
around "prealloc-timeout-fatal". Either preallocation succeeded and the 
guest can run, or it failed, and the guest can't run.

... but then, management tools can simply start QEMU with "-S", start an 
own timer, and zap QEMU if it didn't manage to come up in time, and 
simply start a new QEMU instance without preallocation enabled.

The "good" thing about that approach is that it will also cover any 
implicit memory preallocation, like using mlock() or VFIO, that don't 
run in ordinary per-hostmem preallocation context. If setting QEMU up 
takes to long, you might want to try on a different hypervisor in your 
cluster instead.

I don't immediately see why we want to make our preallcoation+hostmem 
implementation in QEMU more complicated for such a use case.

Daniil Tatianin Jan. 23, 2023, 1:30 p.m. UTC | #2

On 1/23/23 11:57 AM, David Hildenbrand wrote:
> On 20.01.23 14:47, Daniil Tatianin wrote:
>> This series introduces new qemu_prealloc_mem_with_timeout() api,
>> which allows limiting the maximum amount of time to be spent on memory
>> preallocation. It also adds prealloc statistics collection that is
>> exposed via an optional timeout handler.
>>
>> This new api is then utilized by hostmem for guest RAM preallocation
>> controlled via new object properties called 'prealloc-timeout' and
>> 'prealloc-timeout-fatal'.
>>
>> This is useful for limiting VM startup time on systems with
>> unpredictable page allocation delays due to memory fragmentation or the
>> backing storage. The timeout can be configured to either simply emit a
>> warning and continue VM startup without having preallocated the entire
>> guest RAM or just abort startup entirely if that is not acceptable for
>> a specific use case.
> 
> The major use case for preallocation is memory resources that cannot be 
> overcommitted (hugetlb, file blocks, ...), to avoid running out of such 
> resources later, while the guest is already running, and crashing it.

Wouldn't you say that preallocating memory for the sake of speeding up 
guest kernel startup & runtime is a valid use case of prealloc? This way 
we can avoid expensive (for a multitude of reasons) page faults that 
will otherwise slow down the guest significantly at runtime and affect 
the user experience.

> Allocating only a fraction "because it takes too long" looks quite 
> useless in that (main use-case) context. We shouldn't encourage QEMU 
> users to play with fire in such a way. IOW, there should be no way 
> around "prealloc-timeout-fatal". Either preallocation succeeded and the 
> guest can run, or it failed, and the guest can't run.

Here we basically accept the fact that e.g with fragmented memory the 
kernel might take a while in a page fault handler especially for hugetlb 
because of page compaction that has to run for every fault.

This way we can prefault at least some number of pages and let the guest 
fault the rest on demand later on during runtime even if it's slow and 
would cause a noticeable lag.

> ... but then, management tools can simply start QEMU with "-S", start an 
> own timer, and zap QEMU if it didn't manage to come up in time, and 
> simply start a new QEMU instance without preallocation enabled.
> 
> The "good" thing about that approach is that it will also cover any 
> implicit memory preallocation, like using mlock() or VFIO, that don't 
> run in ordinary per-hostmem preallocation context. If setting QEMU up 
> takes to long, you might want to try on a different hypervisor in your 
> cluster instead.

This approach definitely works too but again it assumes that we always 
want 'prealloc-timeout-fatal' to be on, which is, for the most part only 
the case for working around issues that might be caused by overcommit.

> 
> I don't immediately see why we want to make our preallcoation+hostmem 
> implementation in QEMU more complicated for such a use case.
>

Daniel P. Berrangé Jan. 23, 2023, 1:47 p.m. UTC | #3

On Mon, Jan 23, 2023 at 04:30:03PM +0300, Daniil Tatianin wrote:
> On 1/23/23 11:57 AM, David Hildenbrand wrote:
> > On 20.01.23 14:47, Daniil Tatianin wrote:
> > > This series introduces new qemu_prealloc_mem_with_timeout() api,
> > > which allows limiting the maximum amount of time to be spent on memory
> > > preallocation. It also adds prealloc statistics collection that is
> > > exposed via an optional timeout handler.
> > > 
> > > This new api is then utilized by hostmem for guest RAM preallocation
> > > controlled via new object properties called 'prealloc-timeout' and
> > > 'prealloc-timeout-fatal'.
> > > 
> > > This is useful for limiting VM startup time on systems with
> > > unpredictable page allocation delays due to memory fragmentation or the
> > > backing storage. The timeout can be configured to either simply emit a
> > > warning and continue VM startup without having preallocated the entire
> > > guest RAM or just abort startup entirely if that is not acceptable for
> > > a specific use case.
> > 
> > The major use case for preallocation is memory resources that cannot be
> > overcommitted (hugetlb, file blocks, ...), to avoid running out of such
> > resources later, while the guest is already running, and crashing it.
> 
> Wouldn't you say that preallocating memory for the sake of speeding up guest
> kernel startup & runtime is a valid use case of prealloc? This way we can
> avoid expensive (for a multitude of reasons) page faults that will otherwise
> slow down the guest significantly at runtime and affect the user experience.
> 
> > Allocating only a fraction "because it takes too long" looks quite
> > useless in that (main use-case) context. We shouldn't encourage QEMU
> > users to play with fire in such a way. IOW, there should be no way
> > around "prealloc-timeout-fatal". Either preallocation succeeded and the
> > guest can run, or it failed, and the guest can't run.
> 
> Here we basically accept the fact that e.g with fragmented memory the kernel
> might take a while in a page fault handler especially for hugetlb because of
> page compaction that has to run for every fault.
> 
> This way we can prefault at least some number of pages and let the guest
> fault the rest on demand later on during runtime even if it's slow and would
> cause a noticeable lag.

Rather than treat this as a problem that needs a timeout, can we
restate it as situations need synchronous vs asynchronous
preallocation ?

For the case where we need synchronous prealloc, current QEMU deals
with that. If it doesn't work quickly enough, mgmt can just kill
QEMU already today.

For the case where you would like some prealloc, but don't mind
if it runs without full prealloc, then why not just treat it as an
entirely asynchronous task ? Instead of calling qemu_prealloc_mem
and waiting for it to complete, just spawn a thread to run
qemu_prealloc_mem, so it doesn't block QEMU startup. This will
have minimal maint burden on the existing code, and will avoid
need for mgmt apps to think about what timeout value to give,
which is good because timeouts are hard to get right.

Most of the time that async background prealloc will still finish
before the guest even gets out of the firmware phase, but if it
takes longer it is no big deal. You don't need to quit the prealloc
job early, you just need it to not delay the guest OS boot IIUC.

This impl could be done with the 'prealloc' property turning from
a boolean on/off, to a enum  on/async/off, where 'on' == sync
prealloc. Or add a separate 'prealloc-async' bool property

With regards,
Daniel

David Hildenbrand Jan. 23, 2023, 1:56 p.m. UTC | #4

On 23.01.23 14:30, Daniil Tatianin wrote:
> On 1/23/23 11:57 AM, David Hildenbrand wrote:
>> On 20.01.23 14:47, Daniil Tatianin wrote:
>>> This series introduces new qemu_prealloc_mem_with_timeout() api,
>>> which allows limiting the maximum amount of time to be spent on memory
>>> preallocation. It also adds prealloc statistics collection that is
>>> exposed via an optional timeout handler.
>>>
>>> This new api is then utilized by hostmem for guest RAM preallocation
>>> controlled via new object properties called 'prealloc-timeout' and
>>> 'prealloc-timeout-fatal'.
>>>
>>> This is useful for limiting VM startup time on systems with
>>> unpredictable page allocation delays due to memory fragmentation or the
>>> backing storage. The timeout can be configured to either simply emit a
>>> warning and continue VM startup without having preallocated the entire
>>> guest RAM or just abort startup entirely if that is not acceptable for
>>> a specific use case.
>>
>> The major use case for preallocation is memory resources that cannot be
>> overcommitted (hugetlb, file blocks, ...), to avoid running out of such
>> resources later, while the guest is already running, and crashing it.
> 
> Wouldn't you say that preallocating memory for the sake of speeding up
> guest kernel startup & runtime is a valid use case of prealloc? This way
> we can avoid expensive (for a multitude of reasons) page faults that
> will otherwise slow down the guest significantly at runtime and affect
> the user experience.

With "ordinary" memory (anon/shmem/file), there is no such guarantee 
unless you effectively prevent swapping/writeback or run in an extremely 
controlled environment. With anon memory, you further have to disable 
KSM, because that could immediately de-duplicate the zeroed pages again.

For this reason, I am not aware of preallocation getting used for the 
use case you mentioned. Performance-sensitive workloads want 
determinism, and consequently usually use hugetlb + preallocation. Or 
mlockall() to effectively allocate all memory and lock it before 
starting the VM.

Regarding page faults: with THP, the guest will touch a 2 MiB range 
once, and you'll get a 2 MiB page populated, requiring no further write 
faults, which should already heavily reduce page faults when booting a 
guest.

Preallocating all guest memory to make a guest kernel boot up faster 
sound a bit weird to me. Preallocating "some random part of guest 
memory" also sounds weird, too: what if the guest uses exactly the 
memory locations you didn't preallocate?

I'd suggest doing some measurements if there are actually cases where 
"randomly preallocating some memory pages" are actually beneficial when 
considering the overall startup time (setting up VM + starting the OS).

> 
>> Allocating only a fraction "because it takes too long" looks quite
>> useless in that (main use-case) context. We shouldn't encourage QEMU
>> users to play with fire in such a way. IOW, there should be no way
>> around "prealloc-timeout-fatal". Either preallocation succeeded and the
>> guest can run, or it failed, and the guest can't run.
> 
> Here we basically accept the fact that e.g with fragmented memory the
> kernel might take a while in a page fault handler especially for hugetlb
> because of page compaction that has to run for every fault.
> 
> This way we can prefault at least some number of pages and let the guest
> fault the rest on demand later on during runtime even if it's slow and
> would cause a noticeable lag.

Sorry, I don't really see the value of this "preallcoating an random 
portion of guest memory".

In practice, Linux guests will only touch all memory once that memory is 
required (e.g., allocated), not as default during bootup".

What you could do, is start the VM from a shmem/hugetlb/... file, and 
concurrently start preallocating all memory from a second process. The 
guest can boot up immediately and eventually you'll have all guest 
memory allocated. It won't work with anon memory (memory-backend-ram) 
and private mappings (shared=false), of course.

> 
>> ... but then, management tools can simply start QEMU with "-S", start an
>> own timer, and zap QEMU if it didn't manage to come up in time, and
>> simply start a new QEMU instance without preallocation enabled.
>>
>> The "good" thing about that approach is that it will also cover any
>> implicit memory preallocation, like using mlock() or VFIO, that don't
>> run in ordinary per-hostmem preallocation context. If setting QEMU up
>> takes to long, you might want to try on a different hypervisor in your
>> cluster instead.
> 
> This approach definitely works too but again it assumes that we always
> want 'prealloc-timeout-fatal' to be on, which is, for the most part only
> the case for working around issues that might be caused by overcommit.

Can you elaborate? Thanks.

David Hildenbrand Jan. 23, 2023, 2:10 p.m. UTC | #5

On 23.01.23 14:47, Daniel P. Berrangé wrote:
> On Mon, Jan 23, 2023 at 04:30:03PM +0300, Daniil Tatianin wrote:
>> On 1/23/23 11:57 AM, David Hildenbrand wrote:
>>> On 20.01.23 14:47, Daniil Tatianin wrote:
>>>> This series introduces new qemu_prealloc_mem_with_timeout() api,
>>>> which allows limiting the maximum amount of time to be spent on memory
>>>> preallocation. It also adds prealloc statistics collection that is
>>>> exposed via an optional timeout handler.
>>>>
>>>> This new api is then utilized by hostmem for guest RAM preallocation
>>>> controlled via new object properties called 'prealloc-timeout' and
>>>> 'prealloc-timeout-fatal'.
>>>>
>>>> This is useful for limiting VM startup time on systems with
>>>> unpredictable page allocation delays due to memory fragmentation or the
>>>> backing storage. The timeout can be configured to either simply emit a
>>>> warning and continue VM startup without having preallocated the entire
>>>> guest RAM or just abort startup entirely if that is not acceptable for
>>>> a specific use case.
>>>
>>> The major use case for preallocation is memory resources that cannot be
>>> overcommitted (hugetlb, file blocks, ...), to avoid running out of such
>>> resources later, while the guest is already running, and crashing it.
>>
>> Wouldn't you say that preallocating memory for the sake of speeding up guest
>> kernel startup & runtime is a valid use case of prealloc? This way we can
>> avoid expensive (for a multitude of reasons) page faults that will otherwise
>> slow down the guest significantly at runtime and affect the user experience.
>>
>>> Allocating only a fraction "because it takes too long" looks quite
>>> useless in that (main use-case) context. We shouldn't encourage QEMU
>>> users to play with fire in such a way. IOW, there should be no way
>>> around "prealloc-timeout-fatal". Either preallocation succeeded and the
>>> guest can run, or it failed, and the guest can't run.
>>
>> Here we basically accept the fact that e.g with fragmented memory the kernel
>> might take a while in a page fault handler especially for hugetlb because of
>> page compaction that has to run for every fault.
>>
>> This way we can prefault at least some number of pages and let the guest
>> fault the rest on demand later on during runtime even if it's slow and would
>> cause a noticeable lag.
> 
> Rather than treat this as a problem that needs a timeout, can we
> restate it as situations need synchronous vs asynchronous
> preallocation ?
> 
> For the case where we need synchronous prealloc, current QEMU deals
> with that. If it doesn't work quickly enough, mgmt can just kill
> QEMU already today.
> 
> For the case where you would like some prealloc, but don't mind
> if it runs without full prealloc, then why not just treat it as an
> entirely asynchronous task ? Instead of calling qemu_prealloc_mem
> and waiting for it to complete, just spawn a thread to run
> qemu_prealloc_mem, so it doesn't block QEMU startup. This will
> have minimal maint burden on the existing code, and will avoid
> need for mgmt apps to think about what timeout value to give,
> which is good because timeouts are hard to get right.
> 
> Most of the time that async background prealloc will still finish
> before the guest even gets out of the firmware phase, but if it
> takes longer it is no big deal. You don't need to quit the prealloc
> job early, you just need it to not delay the guest OS boot IIUC.
> 
> This impl could be done with the 'prealloc' property turning from
> a boolean on/off, to a enum  on/async/off, where 'on' == sync
> prealloc. Or add a separate 'prealloc-async' bool property

That sounds better to me.

Daniil Tatianin Jan. 23, 2023, 2:14 p.m. UTC | #6

On 1/23/23 4:47 PM, Daniel P. Berrangé wrote:
> On Mon, Jan 23, 2023 at 04:30:03PM +0300, Daniil Tatianin wrote:
>> On 1/23/23 11:57 AM, David Hildenbrand wrote:
>>> On 20.01.23 14:47, Daniil Tatianin wrote:
>>>> This series introduces new qemu_prealloc_mem_with_timeout() api,
>>>> which allows limiting the maximum amount of time to be spent on memory
>>>> preallocation. It also adds prealloc statistics collection that is
>>>> exposed via an optional timeout handler.
>>>>
>>>> This new api is then utilized by hostmem for guest RAM preallocation
>>>> controlled via new object properties called 'prealloc-timeout' and
>>>> 'prealloc-timeout-fatal'.
>>>>
>>>> This is useful for limiting VM startup time on systems with
>>>> unpredictable page allocation delays due to memory fragmentation or the
>>>> backing storage. The timeout can be configured to either simply emit a
>>>> warning and continue VM startup without having preallocated the entire
>>>> guest RAM or just abort startup entirely if that is not acceptable for
>>>> a specific use case.
>>>
>>> The major use case for preallocation is memory resources that cannot be
>>> overcommitted (hugetlb, file blocks, ...), to avoid running out of such
>>> resources later, while the guest is already running, and crashing it.
>>
>> Wouldn't you say that preallocating memory for the sake of speeding up guest
>> kernel startup & runtime is a valid use case of prealloc? This way we can
>> avoid expensive (for a multitude of reasons) page faults that will otherwise
>> slow down the guest significantly at runtime and affect the user experience.
>>
>>> Allocating only a fraction "because it takes too long" looks quite
>>> useless in that (main use-case) context. We shouldn't encourage QEMU
>>> users to play with fire in such a way. IOW, there should be no way
>>> around "prealloc-timeout-fatal". Either preallocation succeeded and the
>>> guest can run, or it failed, and the guest can't run.
>>
>> Here we basically accept the fact that e.g with fragmented memory the kernel
>> might take a while in a page fault handler especially for hugetlb because of
>> page compaction that has to run for every fault.
>>
>> This way we can prefault at least some number of pages and let the guest
>> fault the rest on demand later on during runtime even if it's slow and would
>> cause a noticeable lag.
> 
> Rather than treat this as a problem that needs a timeout, can we
> restate it as situations need synchronous vs asynchronous
> preallocation ?
> 
> For the case where we need synchronous prealloc, current QEMU deals
> with that. If it doesn't work quickly enough, mgmt can just kill
> QEMU already today.
> 
> For the case where you would like some prealloc, but don't mind
> if it runs without full prealloc, then why not just treat it as an
> entirely asynchronous task ? Instead of calling qemu_prealloc_mem
> and waiting for it to complete, just spawn a thread to run
> qemu_prealloc_mem, so it doesn't block QEMU startup. This will
> have minimal maint burden on the existing code, and will avoid
> need for mgmt apps to think about what timeout value to give,
> which is good because timeouts are hard to get right.
> 
> Most of the time that async background prealloc will still finish
> before the guest even gets out of the firmware phase, but if it
> takes longer it is no big deal. You don't need to quit the prealloc
> job early, you just need it to not delay the guest OS boot IIUC.
> 
> This impl could be done with the 'prealloc' property turning from
> a boolean on/off, to a enum  on/async/off, where 'on' == sync
> prealloc. Or add a separate 'prealloc-async' bool property

I like this idea, but I'm not sure how we would go about writing to live 
guest memory. Is that something that can be done safely without racing 
with the guest?

> With regards,
> Daniel

David Hildenbrand Jan. 23, 2023, 2:16 p.m. UTC | #7

On 23.01.23 15:14, Daniil Tatianin wrote:
> On 1/23/23 4:47 PM, Daniel P. Berrangé wrote:
>> On Mon, Jan 23, 2023 at 04:30:03PM +0300, Daniil Tatianin wrote:
>>> On 1/23/23 11:57 AM, David Hildenbrand wrote:
>>>> On 20.01.23 14:47, Daniil Tatianin wrote:
>>>>> This series introduces new qemu_prealloc_mem_with_timeout() api,
>>>>> which allows limiting the maximum amount of time to be spent on memory
>>>>> preallocation. It also adds prealloc statistics collection that is
>>>>> exposed via an optional timeout handler.
>>>>>
>>>>> This new api is then utilized by hostmem for guest RAM preallocation
>>>>> controlled via new object properties called 'prealloc-timeout' and
>>>>> 'prealloc-timeout-fatal'.
>>>>>
>>>>> This is useful for limiting VM startup time on systems with
>>>>> unpredictable page allocation delays due to memory fragmentation or the
>>>>> backing storage. The timeout can be configured to either simply emit a
>>>>> warning and continue VM startup without having preallocated the entire
>>>>> guest RAM or just abort startup entirely if that is not acceptable for
>>>>> a specific use case.
>>>>
>>>> The major use case for preallocation is memory resources that cannot be
>>>> overcommitted (hugetlb, file blocks, ...), to avoid running out of such
>>>> resources later, while the guest is already running, and crashing it.
>>>
>>> Wouldn't you say that preallocating memory for the sake of speeding up guest
>>> kernel startup & runtime is a valid use case of prealloc? This way we can
>>> avoid expensive (for a multitude of reasons) page faults that will otherwise
>>> slow down the guest significantly at runtime and affect the user experience.
>>>
>>>> Allocating only a fraction "because it takes too long" looks quite
>>>> useless in that (main use-case) context. We shouldn't encourage QEMU
>>>> users to play with fire in such a way. IOW, there should be no way
>>>> around "prealloc-timeout-fatal". Either preallocation succeeded and the
>>>> guest can run, or it failed, and the guest can't run.
>>>
>>> Here we basically accept the fact that e.g with fragmented memory the kernel
>>> might take a while in a page fault handler especially for hugetlb because of
>>> page compaction that has to run for every fault.
>>>
>>> This way we can prefault at least some number of pages and let the guest
>>> fault the rest on demand later on during runtime even if it's slow and would
>>> cause a noticeable lag.
>>
>> Rather than treat this as a problem that needs a timeout, can we
>> restate it as situations need synchronous vs asynchronous
>> preallocation ?
>>
>> For the case where we need synchronous prealloc, current QEMU deals
>> with that. If it doesn't work quickly enough, mgmt can just kill
>> QEMU already today.
>>
>> For the case where you would like some prealloc, but don't mind
>> if it runs without full prealloc, then why not just treat it as an
>> entirely asynchronous task ? Instead of calling qemu_prealloc_mem
>> and waiting for it to complete, just spawn a thread to run
>> qemu_prealloc_mem, so it doesn't block QEMU startup. This will
>> have minimal maint burden on the existing code, and will avoid
>> need for mgmt apps to think about what timeout value to give,
>> which is good because timeouts are hard to get right.
>>
>> Most of the time that async background prealloc will still finish
>> before the guest even gets out of the firmware phase, but if it
>> takes longer it is no big deal. You don't need to quit the prealloc
>> job early, you just need it to not delay the guest OS boot IIUC.
>>
>> This impl could be done with the 'prealloc' property turning from
>> a boolean on/off, to a enum  on/async/off, where 'on' == sync
>> prealloc. Or add a separate 'prealloc-async' bool property
> 
> I like this idea, but I'm not sure how we would go about writing to live
> guest memory. Is that something that can be done safely without racing
> with the guest?

You can use MADV_POPULATE_WRITE safely, as it doesn't actually perform a 
write. We'd have to fail async=true if MADV_POPULATE_WRITE cannot be used.

Daniel P. Berrangé Jan. 23, 2023, 4:01 p.m. UTC | #8

On Mon, Jan 23, 2023 at 03:16:03PM +0100, David Hildenbrand wrote:
> On 23.01.23 15:14, Daniil Tatianin wrote:
> > On 1/23/23 4:47 PM, Daniel P. Berrangé wrote:
> > > On Mon, Jan 23, 2023 at 04:30:03PM +0300, Daniil Tatianin wrote:
> > > > On 1/23/23 11:57 AM, David Hildenbrand wrote:
> > > > > On 20.01.23 14:47, Daniil Tatianin wrote:
> > > > > > This series introduces new qemu_prealloc_mem_with_timeout() api,
> > > > > > which allows limiting the maximum amount of time to be spent on memory
> > > > > > preallocation. It also adds prealloc statistics collection that is
> > > > > > exposed via an optional timeout handler.
> > > > > > 
> > > > > > This new api is then utilized by hostmem for guest RAM preallocation
> > > > > > controlled via new object properties called 'prealloc-timeout' and
> > > > > > 'prealloc-timeout-fatal'.
> > > > > > 
> > > > > > This is useful for limiting VM startup time on systems with
> > > > > > unpredictable page allocation delays due to memory fragmentation or the
> > > > > > backing storage. The timeout can be configured to either simply emit a
> > > > > > warning and continue VM startup without having preallocated the entire
> > > > > > guest RAM or just abort startup entirely if that is not acceptable for
> > > > > > a specific use case.
> > > > > 
> > > > > The major use case for preallocation is memory resources that cannot be
> > > > > overcommitted (hugetlb, file blocks, ...), to avoid running out of such
> > > > > resources later, while the guest is already running, and crashing it.
> > > > 
> > > > Wouldn't you say that preallocating memory for the sake of speeding up guest
> > > > kernel startup & runtime is a valid use case of prealloc? This way we can
> > > > avoid expensive (for a multitude of reasons) page faults that will otherwise
> > > > slow down the guest significantly at runtime and affect the user experience.
> > > > 
> > > > > Allocating only a fraction "because it takes too long" looks quite
> > > > > useless in that (main use-case) context. We shouldn't encourage QEMU
> > > > > users to play with fire in such a way. IOW, there should be no way
> > > > > around "prealloc-timeout-fatal". Either preallocation succeeded and the
> > > > > guest can run, or it failed, and the guest can't run.
> > > > 
> > > > Here we basically accept the fact that e.g with fragmented memory the kernel
> > > > might take a while in a page fault handler especially for hugetlb because of
> > > > page compaction that has to run for every fault.
> > > > 
> > > > This way we can prefault at least some number of pages and let the guest
> > > > fault the rest on demand later on during runtime even if it's slow and would
> > > > cause a noticeable lag.
> > > 
> > > Rather than treat this as a problem that needs a timeout, can we
> > > restate it as situations need synchronous vs asynchronous
> > > preallocation ?
> > > 
> > > For the case where we need synchronous prealloc, current QEMU deals
> > > with that. If it doesn't work quickly enough, mgmt can just kill
> > > QEMU already today.
> > > 
> > > For the case where you would like some prealloc, but don't mind
> > > if it runs without full prealloc, then why not just treat it as an
> > > entirely asynchronous task ? Instead of calling qemu_prealloc_mem
> > > and waiting for it to complete, just spawn a thread to run
> > > qemu_prealloc_mem, so it doesn't block QEMU startup. This will
> > > have minimal maint burden on the existing code, and will avoid
> > > need for mgmt apps to think about what timeout value to give,
> > > which is good because timeouts are hard to get right.
> > > 
> > > Most of the time that async background prealloc will still finish
> > > before the guest even gets out of the firmware phase, but if it
> > > takes longer it is no big deal. You don't need to quit the prealloc
> > > job early, you just need it to not delay the guest OS boot IIUC.
> > > 
> > > This impl could be done with the 'prealloc' property turning from
> > > a boolean on/off, to a enum  on/async/off, where 'on' == sync
> > > prealloc. Or add a separate 'prealloc-async' bool property
> > 
> > I like this idea, but I'm not sure how we would go about writing to live
> > guest memory. Is that something that can be done safely without racing
> > with the guest?
> 
> You can use MADV_POPULATE_WRITE safely, as it doesn't actually perform a
> write. We'd have to fail async=true if MADV_POPULATE_WRITE cannot be used.

Right, in the short term that means this feature would have limited
availability on our targetted OS platforms, but such issues tend to
fade into irrelevance quicker than we anticipate, as platforms move
forwards at such a fast pace.

With regards,
Daniel

Valentin Sinitsyn Jan. 24, 2023, 6:57 a.m. UTC | #9

Hello,

On 23.01.2023 19:14, Daniil Tatianin wrote:
> On 1/23/23 4:47 PM, Daniel P. Berrangé wrote:
>> On Mon, Jan 23, 2023 at 04:30:03PM +0300, Daniil Tatianin wrote:
>>> On 1/23/23 11:57 AM, David Hildenbrand wrote:
>>>> On 20.01.23 14:47, Daniil Tatianin wrote:
>>>>> This series introduces new qemu_prealloc_mem_with_timeout() api,
>>>>> which allows limiting the maximum amount of time to be spent on memory
>>>>> preallocation. It also adds prealloc statistics collection that is
>>>>> exposed via an optional timeout handler.
>>>>>
>>>>> This new api is then utilized by hostmem for guest RAM preallocation
>>>>> controlled via new object properties called 'prealloc-timeout' and
>>>>> 'prealloc-timeout-fatal'.
>>>>>
>>>>> This is useful for limiting VM startup time on systems with
>>>>> unpredictable page allocation delays due to memory fragmentation or 
>>>>> the
>>>>> backing storage. The timeout can be configured to either simply emit a
>>>>> warning and continue VM startup without having preallocated the entire
>>>>> guest RAM or just abort startup entirely if that is not acceptable for
>>>>> a specific use case.
>>>>
>>>> The major use case for preallocation is memory resources that cannot be
>>>> overcommitted (hugetlb, file blocks, ...), to avoid running out of such
>>>> resources later, while the guest is already running, and crashing it.
>>>
>>> Wouldn't you say that preallocating memory for the sake of speeding 
>>> up guest
>>> kernel startup & runtime is a valid use case of prealloc? This way we 
>>> can
>>> avoid expensive (for a multitude of reasons) page faults that will 
>>> otherwise
>>> slow down the guest significantly at runtime and affect the user 
>>> experience.
>>>
>>>> Allocating only a fraction "because it takes too long" looks quite
>>>> useless in that (main use-case) context. We shouldn't encourage QEMU
>>>> users to play with fire in such a way. IOW, there should be no way
>>>> around "prealloc-timeout-fatal". Either preallocation succeeded and the
>>>> guest can run, or it failed, and the guest can't run.
>>>
>>> Here we basically accept the fact that e.g with fragmented memory the 
>>> kernel
>>> might take a while in a page fault handler especially for hugetlb 
>>> because of
>>> page compaction that has to run for every fault.
>>>
>>> This way we can prefault at least some number of pages and let the guest
>>> fault the rest on demand later on during runtime even if it's slow 
>>> and would
>>> cause a noticeable lag.
>>
>> Rather than treat this as a problem that needs a timeout, can we
>> restate it as situations need synchronous vs asynchronous
>> preallocation ?
>>
>> For the case where we need synchronous prealloc, current QEMU deals
>> with that. If it doesn't work quickly enough, mgmt can just kill
>> QEMU already today.
>>
>> For the case where you would like some prealloc, but don't mind
>> if it runs without full prealloc, then why not just treat it as an
>> entirely asynchronous task ? Instead of calling qemu_prealloc_mem
>> and waiting for it to complete, just spawn a thread to run
>> qemu_prealloc_mem, so it doesn't block QEMU startup. This will
>> have minimal maint burden on the existing code, and will avoid
>> need for mgmt apps to think about what timeout value to give,
>> which is good because timeouts are hard to get right.
>>
>> Most of the time that async background prealloc will still finish
>> before the guest even gets out of the firmware phase, but if it
>> takes longer it is no big deal. You don't need to quit the prealloc
>> job early, you just need it to not delay the guest OS boot IIUC.
>>
>> This impl could be done with the 'prealloc' property turning from
>> a boolean on/off, to a enum  on/async/off, where 'on' == sync
>> prealloc. Or add a separate 'prealloc-async' bool property
> 
> I like this idea, but I'm not sure how we would go about writing to live 
> guest memory. Is that something that can be done safely without racing 
> with the guest?

Don't forget that prealloc threads will need some CPUs to run, which 
would likely result in increased steal time during preallocation for the 
guest.

Likely not a big deal, but something to keep in mind.

Best,
Valentine

[v0,0/4] backends/hostmem: add an ability to specify prealloc timeout

Message

Comments