mbox series

[v2,0/7] hostmem: NUMA-aware memory preallocation using ThreadContext

Message ID 20221010091117.88603-1-david@redhat.com
Headers show
Series hostmem: NUMA-aware memory preallocation using ThreadContext | expand

Message

David Hildenbrand Oct. 10, 2022, 9:11 a.m. UTC
This is a follow-up on "util: NUMA aware memory preallocation" [1] by
Michal.

Setting the CPU affinity of threads from inside QEMU usually isn't
easily possible, because we don't want QEMU -- once started and running
guest code -- to be able to mess up the system. QEMU disallows relevant
syscalls using seccomp, such that any such invocation will fail.

Especially for memory preallocation in memory backends, the CPU affinity
can significantly increase guest startup time, for example, when running
large VMs backed by huge/gigantic pages, because of NUMA effects. For
NUMA-aware preallocation, we have to set the CPU affinity, however:

(1) Once preallocation threads are created during preallocation, management
    tools cannot intercept anymore to change the affinity. These threads
    are created automatically on demand.
(2) QEMU cannot easily set the CPU affinity itself.
(3) The CPU affinity derived from the NUMA bindings of the memory backend
    might not necessarily be exactly the CPUs we actually want to use
    (e.g., CPU-less NUMA nodes, CPUs that are pinned/used for other VMs).

There is an easy "workaround". If we have a thread with the right CPU
affinity, we can simply create new threads on demand via that prepared
context. So, all we have to do is setup and create such a context ahead
of time, to then configure preallocation to create new threads via that
environment.

So, let's introduce a user-creatable "thread-context" object that
essentially consists of a context thread used to create new threads.
QEMU can either try setting the CPU affinity itself ("cpu-affinity",
"node-affinity" property), or upper layers can extract the thread id
("thread-id" property) to configure it externally.

Make memory-backends consume a thread-context object
(via the "prealloc-context" property) and use it when preallocating to
create new threads with the desired CPU affinity. Further, to make it
easier to use, allow creation of "thread-context" objects, including
setting the CPU affinity directly from QEMU, before enabling the
sandbox option.


Quick test on a system with 2 NUMA nodes:

Without CPU affinity:
    time qemu-system-x86_64 \
        -object memory-backend-memfd,id=md1,hugetlb=on,hugetlbsize=2M,size=64G,prealloc-threads=12,prealloc=on,host-nodes=0,policy=bind \
        -nographic -monitor stdio

    real    0m5.383s
    real    0m3.499s
    real    0m5.129s
    real    0m4.232s
    real    0m5.220s
    real    0m4.288s
    real    0m3.582s
    real    0m4.305s
    real    0m5.421s
    real    0m4.502s

    -> It heavily depends on the scheduler CPU selection

With CPU affinity:
    time qemu-system-x86_64 \
        -object thread-context,id=tc1,node-affinity=0 \
        -object memory-backend-memfd,id=md1,hugetlb=on,hugetlbsize=2M,size=64G,prealloc-threads=12,prealloc=on,host-nodes=0,policy=bind,prealloc-context=tc1 \
        -sandbox enable=on,resourcecontrol=deny \
        -nographic -monitor stdio

    real    0m1.959s
    real    0m1.942s
    real    0m1.943s
    real    0m1.941s
    real    0m1.948s
    real    0m1.964s
    real    0m1.949s
    real    0m1.948s
    real    0m1.941s
    real    0m1.937s

On reasonably large VMs, the speedup can be quite significant.

While this concept is currently only used for short-lived preallocation
threads, nothing major speaks against reusing the concept for other
threads that are harder to identify/configure -- except that
we need additional (idle) context threads that are otherwise left unused.

This series does not yet tackle concurrent preallocation of memory
backends. Memory backend objects are created and memory is preallocated one
memory backend at a time -- and there is currently no way to do
preallocation asynchronously.

[1] https://lkml.kernel.org/r/ffdcd118d59b379ede2b64745144165a40f6a813.1652165704.git.mprivozn@redhat.com

v1 -> v2:
* Fixed some minor style nits
* "util: Introduce ThreadContext user-creatable object"
 -> Impove documentation and patch description. [Markus]
* "util: Add write-only "node-affinity" property for ThreadContext"
 -> Impove documentation and patch description. [Markus]

RFC -> v1:
* "vl: Allow ThreadContext objects to be created before the sandbox option"
 -> Move parsing of the "name" property before object_create_pre_sandbox
* Added RB's

Cc: Michal Privoznik <mprivozn@redhat.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Daniel P. Berrangé" <berrange@redhat.com>
Cc: Eduardo Habkost <eduardo@habkost.net>
Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
Cc: Eric Blake <eblake@redhat.com>
Cc: Markus Armbruster <armbru@redhat.com>
Cc: Richard Henderson <richard.henderson@linaro.org>
Cc: Stefan Weil <sw@weilnetz.de>

David Hildenbrand (7):
  util: Cleanup and rename os_mem_prealloc()
  util: Introduce qemu_thread_set_affinity() and
    qemu_thread_get_affinity()
  util: Introduce ThreadContext user-creatable object
  util: Add write-only "node-affinity" property for ThreadContext
  util: Make qemu_prealloc_mem() optionally consume a ThreadContext
  hostmem: Allow for specifying a ThreadContext for preallocation
  vl: Allow ThreadContext objects to be created before the sandbox
    option

 backends/hostmem.c            |  13 +-
 hw/virtio/virtio-mem.c        |   2 +-
 include/qemu/osdep.h          |  19 +-
 include/qemu/thread-context.h |  57 ++++++
 include/qemu/thread.h         |   4 +
 include/sysemu/hostmem.h      |   2 +
 meson.build                   |  16 ++
 qapi/qom.json                 |  28 +++
 softmmu/cpus.c                |   2 +-
 softmmu/vl.c                  |  36 +++-
 util/meson.build              |   1 +
 util/oslib-posix.c            |  39 ++--
 util/oslib-win32.c            |   8 +-
 util/qemu-thread-posix.c      |  70 +++++++
 util/qemu-thread-win32.c      |  12 ++
 util/thread-context.c         | 362 ++++++++++++++++++++++++++++++++++
 16 files changed, 641 insertions(+), 30 deletions(-)
 create mode 100644 include/qemu/thread-context.h
 create mode 100644 util/thread-context.c

Comments

Dr. David Alan Gilbert Oct. 10, 2022, 10:40 a.m. UTC | #1
* David Hildenbrand (david@redhat.com) wrote:
> This is a follow-up on "util: NUMA aware memory preallocation" [1] by
> Michal.
> 
> Setting the CPU affinity of threads from inside QEMU usually isn't
> easily possible, because we don't want QEMU -- once started and running
> guest code -- to be able to mess up the system. QEMU disallows relevant
> syscalls using seccomp, such that any such invocation will fail.
> 
> Especially for memory preallocation in memory backends, the CPU affinity
> can significantly increase guest startup time, for example, when running
> large VMs backed by huge/gigantic pages, because of NUMA effects. For
> NUMA-aware preallocation, we have to set the CPU affinity, however:
> 
> (1) Once preallocation threads are created during preallocation, management
>     tools cannot intercept anymore to change the affinity. These threads
>     are created automatically on demand.
> (2) QEMU cannot easily set the CPU affinity itself.
> (3) The CPU affinity derived from the NUMA bindings of the memory backend
>     might not necessarily be exactly the CPUs we actually want to use
>     (e.g., CPU-less NUMA nodes, CPUs that are pinned/used for other VMs).
> 
> There is an easy "workaround". If we have a thread with the right CPU
> affinity, we can simply create new threads on demand via that prepared
> context. So, all we have to do is setup and create such a context ahead
> of time, to then configure preallocation to create new threads via that
> environment.
> 
> So, let's introduce a user-creatable "thread-context" object that
> essentially consists of a context thread used to create new threads.
> QEMU can either try setting the CPU affinity itself ("cpu-affinity",
> "node-affinity" property), or upper layers can extract the thread id
> ("thread-id" property) to configure it externally.
> 
> Make memory-backends consume a thread-context object
> (via the "prealloc-context" property) and use it when preallocating to
> create new threads with the desired CPU affinity. Further, to make it
> easier to use, allow creation of "thread-context" objects, including
> setting the CPU affinity directly from QEMU, before enabling the
> sandbox option.
> 
> 
> Quick test on a system with 2 NUMA nodes:
> 
> Without CPU affinity:
>     time qemu-system-x86_64 \
>         -object memory-backend-memfd,id=md1,hugetlb=on,hugetlbsize=2M,size=64G,prealloc-threads=12,prealloc=on,host-nodes=0,policy=bind \
>         -nographic -monitor stdio
> 
>     real    0m5.383s
>     real    0m3.499s
>     real    0m5.129s
>     real    0m4.232s
>     real    0m5.220s
>     real    0m4.288s
>     real    0m3.582s
>     real    0m4.305s
>     real    0m5.421s
>     real    0m4.502s
> 
>     -> It heavily depends on the scheduler CPU selection
> 
> With CPU affinity:
>     time qemu-system-x86_64 \
>         -object thread-context,id=tc1,node-affinity=0 \
>         -object memory-backend-memfd,id=md1,hugetlb=on,hugetlbsize=2M,size=64G,prealloc-threads=12,prealloc=on,host-nodes=0,policy=bind,prealloc-context=tc1 \
>         -sandbox enable=on,resourcecontrol=deny \
>         -nographic -monitor stdio
> 
>     real    0m1.959s
>     real    0m1.942s
>     real    0m1.943s
>     real    0m1.941s
>     real    0m1.948s
>     real    0m1.964s
>     real    0m1.949s
>     real    0m1.948s
>     real    0m1.941s
>     real    0m1.937s
> 
> On reasonably large VMs, the speedup can be quite significant.
> 
> While this concept is currently only used for short-lived preallocation
> threads, nothing major speaks against reusing the concept for other
> threads that are harder to identify/configure -- except that
> we need additional (idle) context threads that are otherwise left unused.
> 
> This series does not yet tackle concurrent preallocation of memory
> backends. Memory backend objects are created and memory is preallocated one
> memory backend at a time -- and there is currently no way to do
> preallocation asynchronously.

Since you seem to have a full set of r-b's - do you intend to merge this
as-is or do the cuncurrenct preallocation first?

Dave

> [1] https://lkml.kernel.org/r/ffdcd118d59b379ede2b64745144165a40f6a813.1652165704.git.mprivozn@redhat.com
> 
> v1 -> v2:
> * Fixed some minor style nits
> * "util: Introduce ThreadContext user-creatable object"
>  -> Impove documentation and patch description. [Markus]
> * "util: Add write-only "node-affinity" property for ThreadContext"
>  -> Impove documentation and patch description. [Markus]
> 
> RFC -> v1:
> * "vl: Allow ThreadContext objects to be created before the sandbox option"
>  -> Move parsing of the "name" property before object_create_pre_sandbox
> * Added RB's
> 
> Cc: Michal Privoznik <mprivozn@redhat.com>
> Cc: Igor Mammedov <imammedo@redhat.com>
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: "Daniel P. Berrangé" <berrange@redhat.com>
> Cc: Eduardo Habkost <eduardo@habkost.net>
> Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
> Cc: Eric Blake <eblake@redhat.com>
> Cc: Markus Armbruster <armbru@redhat.com>
> Cc: Richard Henderson <richard.henderson@linaro.org>
> Cc: Stefan Weil <sw@weilnetz.de>
> 
> David Hildenbrand (7):
>   util: Cleanup and rename os_mem_prealloc()
>   util: Introduce qemu_thread_set_affinity() and
>     qemu_thread_get_affinity()
>   util: Introduce ThreadContext user-creatable object
>   util: Add write-only "node-affinity" property for ThreadContext
>   util: Make qemu_prealloc_mem() optionally consume a ThreadContext
>   hostmem: Allow for specifying a ThreadContext for preallocation
>   vl: Allow ThreadContext objects to be created before the sandbox
>     option
> 
>  backends/hostmem.c            |  13 +-
>  hw/virtio/virtio-mem.c        |   2 +-
>  include/qemu/osdep.h          |  19 +-
>  include/qemu/thread-context.h |  57 ++++++
>  include/qemu/thread.h         |   4 +
>  include/sysemu/hostmem.h      |   2 +
>  meson.build                   |  16 ++
>  qapi/qom.json                 |  28 +++
>  softmmu/cpus.c                |   2 +-
>  softmmu/vl.c                  |  36 +++-
>  util/meson.build              |   1 +
>  util/oslib-posix.c            |  39 ++--
>  util/oslib-win32.c            |   8 +-
>  util/qemu-thread-posix.c      |  70 +++++++
>  util/qemu-thread-win32.c      |  12 ++
>  util/thread-context.c         | 362 ++++++++++++++++++++++++++++++++++
>  16 files changed, 641 insertions(+), 30 deletions(-)
>  create mode 100644 include/qemu/thread-context.h
>  create mode 100644 util/thread-context.c
> 
> -- 
> 2.37.3
>
David Hildenbrand Oct. 10, 2022, 11:18 a.m. UTC | #2
On 10.10.22 12:40, Dr. David Alan Gilbert wrote:
> * David Hildenbrand (david@redhat.com) wrote:
>> This is a follow-up on "util: NUMA aware memory preallocation" [1] by
>> Michal.
>>
>> Setting the CPU affinity of threads from inside QEMU usually isn't
>> easily possible, because we don't want QEMU -- once started and running
>> guest code -- to be able to mess up the system. QEMU disallows relevant
>> syscalls using seccomp, such that any such invocation will fail.
>>
>> Especially for memory preallocation in memory backends, the CPU affinity
>> can significantly increase guest startup time, for example, when running
>> large VMs backed by huge/gigantic pages, because of NUMA effects. For
>> NUMA-aware preallocation, we have to set the CPU affinity, however:
>>
>> (1) Once preallocation threads are created during preallocation, management
>>      tools cannot intercept anymore to change the affinity. These threads
>>      are created automatically on demand.
>> (2) QEMU cannot easily set the CPU affinity itself.
>> (3) The CPU affinity derived from the NUMA bindings of the memory backend
>>      might not necessarily be exactly the CPUs we actually want to use
>>      (e.g., CPU-less NUMA nodes, CPUs that are pinned/used for other VMs).
>>
>> There is an easy "workaround". If we have a thread with the right CPU
>> affinity, we can simply create new threads on demand via that prepared
>> context. So, all we have to do is setup and create such a context ahead
>> of time, to then configure preallocation to create new threads via that
>> environment.
>>
>> So, let's introduce a user-creatable "thread-context" object that
>> essentially consists of a context thread used to create new threads.
>> QEMU can either try setting the CPU affinity itself ("cpu-affinity",
>> "node-affinity" property), or upper layers can extract the thread id
>> ("thread-id" property) to configure it externally.
>>
>> Make memory-backends consume a thread-context object
>> (via the "prealloc-context" property) and use it when preallocating to
>> create new threads with the desired CPU affinity. Further, to make it
>> easier to use, allow creation of "thread-context" objects, including
>> setting the CPU affinity directly from QEMU, before enabling the
>> sandbox option.
>>
>>
>> Quick test on a system with 2 NUMA nodes:
>>
>> Without CPU affinity:
>>      time qemu-system-x86_64 \
>>          -object memory-backend-memfd,id=md1,hugetlb=on,hugetlbsize=2M,size=64G,prealloc-threads=12,prealloc=on,host-nodes=0,policy=bind \
>>          -nographic -monitor stdio
>>
>>      real    0m5.383s
>>      real    0m3.499s
>>      real    0m5.129s
>>      real    0m4.232s
>>      real    0m5.220s
>>      real    0m4.288s
>>      real    0m3.582s
>>      real    0m4.305s
>>      real    0m5.421s
>>      real    0m4.502s
>>
>>      -> It heavily depends on the scheduler CPU selection
>>
>> With CPU affinity:
>>      time qemu-system-x86_64 \
>>          -object thread-context,id=tc1,node-affinity=0 \
>>          -object memory-backend-memfd,id=md1,hugetlb=on,hugetlbsize=2M,size=64G,prealloc-threads=12,prealloc=on,host-nodes=0,policy=bind,prealloc-context=tc1 \
>>          -sandbox enable=on,resourcecontrol=deny \
>>          -nographic -monitor stdio
>>
>>      real    0m1.959s
>>      real    0m1.942s
>>      real    0m1.943s
>>      real    0m1.941s
>>      real    0m1.948s
>>      real    0m1.964s
>>      real    0m1.949s
>>      real    0m1.948s
>>      real    0m1.941s
>>      real    0m1.937s
>>
>> On reasonably large VMs, the speedup can be quite significant.
>>
>> While this concept is currently only used for short-lived preallocation
>> threads, nothing major speaks against reusing the concept for other
>> threads that are harder to identify/configure -- except that
>> we need additional (idle) context threads that are otherwise left unused.
>>
>> This series does not yet tackle concurrent preallocation of memory
>> backends. Memory backend objects are created and memory is preallocated one
>> memory backend at a time -- and there is currently no way to do
>> preallocation asynchronously.

Hi Dave,

> 
> Since you seem to have a full set of r-b's - do you intend to merge this
> as-is or do the cuncurrenct preallocation first?

I intent to merge this as is, as it provides a benefit as it stands and 
concurrent preallcoation might not require user interface changes.

I do have some ideas on how to implement concurrent preallocation, but 
it needs more thought (and more importantly, time).
Dr. David Alan Gilbert Oct. 11, 2022, 9:02 a.m. UTC | #3
* David Hildenbrand (david@redhat.com) wrote:
> On 10.10.22 12:40, Dr. David Alan Gilbert wrote:
> > * David Hildenbrand (david@redhat.com) wrote:
> > > This is a follow-up on "util: NUMA aware memory preallocation" [1] by
> > > Michal.
> > > 
> > > Setting the CPU affinity of threads from inside QEMU usually isn't
> > > easily possible, because we don't want QEMU -- once started and running
> > > guest code -- to be able to mess up the system. QEMU disallows relevant
> > > syscalls using seccomp, such that any such invocation will fail.
> > > 
> > > Especially for memory preallocation in memory backends, the CPU affinity
> > > can significantly increase guest startup time, for example, when running
> > > large VMs backed by huge/gigantic pages, because of NUMA effects. For
> > > NUMA-aware preallocation, we have to set the CPU affinity, however:
> > > 
> > > (1) Once preallocation threads are created during preallocation, management
> > >      tools cannot intercept anymore to change the affinity. These threads
> > >      are created automatically on demand.
> > > (2) QEMU cannot easily set the CPU affinity itself.
> > > (3) The CPU affinity derived from the NUMA bindings of the memory backend
> > >      might not necessarily be exactly the CPUs we actually want to use
> > >      (e.g., CPU-less NUMA nodes, CPUs that are pinned/used for other VMs).
> > > 
> > > There is an easy "workaround". If we have a thread with the right CPU
> > > affinity, we can simply create new threads on demand via that prepared
> > > context. So, all we have to do is setup and create such a context ahead
> > > of time, to then configure preallocation to create new threads via that
> > > environment.
> > > 
> > > So, let's introduce a user-creatable "thread-context" object that
> > > essentially consists of a context thread used to create new threads.
> > > QEMU can either try setting the CPU affinity itself ("cpu-affinity",
> > > "node-affinity" property), or upper layers can extract the thread id
> > > ("thread-id" property) to configure it externally.
> > > 
> > > Make memory-backends consume a thread-context object
> > > (via the "prealloc-context" property) and use it when preallocating to
> > > create new threads with the desired CPU affinity. Further, to make it
> > > easier to use, allow creation of "thread-context" objects, including
> > > setting the CPU affinity directly from QEMU, before enabling the
> > > sandbox option.
> > > 
> > > 
> > > Quick test on a system with 2 NUMA nodes:
> > > 
> > > Without CPU affinity:
> > >      time qemu-system-x86_64 \
> > >          -object memory-backend-memfd,id=md1,hugetlb=on,hugetlbsize=2M,size=64G,prealloc-threads=12,prealloc=on,host-nodes=0,policy=bind \
> > >          -nographic -monitor stdio
> > > 
> > >      real    0m5.383s
> > >      real    0m3.499s
> > >      real    0m5.129s
> > >      real    0m4.232s
> > >      real    0m5.220s
> > >      real    0m4.288s
> > >      real    0m3.582s
> > >      real    0m4.305s
> > >      real    0m5.421s
> > >      real    0m4.502s
> > > 
> > >      -> It heavily depends on the scheduler CPU selection
> > > 
> > > With CPU affinity:
> > >      time qemu-system-x86_64 \
> > >          -object thread-context,id=tc1,node-affinity=0 \
> > >          -object memory-backend-memfd,id=md1,hugetlb=on,hugetlbsize=2M,size=64G,prealloc-threads=12,prealloc=on,host-nodes=0,policy=bind,prealloc-context=tc1 \
> > >          -sandbox enable=on,resourcecontrol=deny \
> > >          -nographic -monitor stdio
> > > 
> > >      real    0m1.959s
> > >      real    0m1.942s
> > >      real    0m1.943s
> > >      real    0m1.941s
> > >      real    0m1.948s
> > >      real    0m1.964s
> > >      real    0m1.949s
> > >      real    0m1.948s
> > >      real    0m1.941s
> > >      real    0m1.937s
> > > 
> > > On reasonably large VMs, the speedup can be quite significant.
> > > 
> > > While this concept is currently only used for short-lived preallocation
> > > threads, nothing major speaks against reusing the concept for other
> > > threads that are harder to identify/configure -- except that
> > > we need additional (idle) context threads that are otherwise left unused.
> > > 
> > > This series does not yet tackle concurrent preallocation of memory
> > > backends. Memory backend objects are created and memory is preallocated one
> > > memory backend at a time -- and there is currently no way to do
> > > preallocation asynchronously.
> 
> Hi Dave,
> 
> > 
> > Since you seem to have a full set of r-b's - do you intend to merge this
> > as-is or do the cuncurrenct preallocation first?
> 
> I intent to merge this as is, as it provides a benefit as it stands and
> concurrent preallcoation might not require user interface changes.

Yep, that's fair enough.

> I do have some ideas on how to implement concurrent preallocation, but it
> needs more thought (and more importantly, time).

Yep, it would be nice for the really huge VMs.

Dave


> -- 
> Thanks,
> 
> David / dhildenb
>