diff mbox series

[3/3] backends/hostmem: Round up memory size for qemu_madvise() and mbind()

Message ID bd03706d336e9be360dd53cf125c27fbeb26acf7.1716912651.git.mprivozn@redhat.com
State New
Headers show
Series backends/hostmem: Round up memory size for qemu_madvise() and mbind() | expand

Commit Message

Michal Privoznik May 28, 2024, 4:15 p.m. UTC
Simple reproducer:
qemu.git $ ./build/qemu-system-x86_64 \
-m size=8389632k,slots=16,maxmem=25600000k \
-object
'{"qom-type":"memory-backend-file","id":"ram-node0","mem-path":"/hugepages2M/","prealloc":true,"size":8590983168,"host-nodes":[0],"policy":"bind"}' \
-numa node,nodeid=0,cpus=0,memdev=ram-node0

With current master I get:

qemu-system-x86_64: cannot bind memory to host NUMA nodes: Invalid argument

The problem is that memory size (8193MiB) is not an integer
multiple of underlying pagesize (2MiB) which triggers a check
inside of mbind(), since we can't really set policy just to a
fraction of a page. As qemu_madvise() has the same expectation,
round size passed to underlying pagesize.

Signed-off-by: Michal Privoznik <mprivozn@redhat.com>
---
 backends/hostmem.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

Comments

David Hildenbrand May 28, 2024, 4:47 p.m. UTC | #1
Am 28.05.24 um 18:15 schrieb Michal Privoznik:
> ./build/qemu-system-x86_64 \ -m size=8389632k,slots=16,maxmem=25600000k \ 
> -object 
> '{"qom-type":"memory-backend-file","id":"ram-node0","mem-path":"hugepages2M","prealloc":true,"size":8590983168,"host-nodes":[0],"policy":"bind"}' \ -numa node,nodeid=0,cpus=0,memdev=ram-node0

For DIMMs and friends we now (again) enforce that the size must be aligned to 
the page size:

commit 540a1abbf0b243e4cfb4333c5d30a041f7080ba4
Author: David Hildenbrand <david@redhat.com>
Date:   Wed Jan 17 14:55:54 2024 +0100

     memory-device: reintroduce memory region size check

     We used to check that the memory region size is multiples of the overall
     requested address alignment for the device memory address.

     We removed that check, because there are cases (i.e., hv-balloon) where
     devices unconditionally request an address alignment that has a very large
     alignment (i.e., 32 GiB), but the actual memory device size might not be
     multiples of that alignment.

     However, this change:

     (a) allows for some practically impossible DIMM sizes, like "1GB+1 byte".
     (b) allows for DIMMs that partially cover hugetlb pages, previously
         reported in [1].
...

Partial hugetlb pages do not particularly make sense; wasting memory. Do we 
expect people to actually ave working setup that we might break when disallowing 
such configurations? Or would we actually help them identify that something 
non-sensical is happening?

When using memory-backend-memfd, we already do get a proper error:

  ./build/qemu-system-x86_64 -m 2047m -object 
memory-backend-memfd,id=ram-node0,hugetlb=on,size=2047m,reserve=off -numa 
node,nodeid=0,cpus=0,memdev=ram-node0 -S
qemu-system-x86_64: failed to resize memfd to 2146435072: Invalid argument

So this only applies to memory-backend-file, where we maybe should fail in a 
similar way?
Richard Henderson May 28, 2024, 5:22 p.m. UTC | #2
On 5/28/24 09:15, Michal Privoznik wrote:
> +        sz = ROUND_UP(sz, qemu_ram_pagesize(backend->mr.ram_block));

Second argument is evaluated twice.
You probably don't want that to be a function call.


r~
Michal Privoznik May 29, 2024, 6:48 a.m. UTC | #3
On 5/28/24 18:47, David Hildenbrand wrote:
> Am 28.05.24 um 18:15 schrieb Michal Privoznik:
>> ./build/qemu-system-x86_64 \ -m
>> size=8389632k,slots=16,maxmem=25600000k \ -object
>> '{"qom-type":"memory-backend-file","id":"ram-node0","mem-path":"hugepages2M","prealloc":true,"size":8590983168,"host-nodes":[0],"policy":"bind"}' \ -numa node,nodeid=0,cpus=0,memdev=ram-node0
> 
> For DIMMs and friends we now (again) enforce that the size must be
> aligned to the page size:
> 
> commit 540a1abbf0b243e4cfb4333c5d30a041f7080ba4
> Author: David Hildenbrand <david@redhat.com>
> Date:   Wed Jan 17 14:55:54 2024 +0100
> 
>     memory-device: reintroduce memory region size check
> 
>     We used to check that the memory region size is multiples of the
> overall
>     requested address alignment for the device memory address.
> 
>     We removed that check, because there are cases (i.e., hv-balloon) where
>     devices unconditionally request an address alignment that has a very
> large
>     alignment (i.e., 32 GiB), but the actual memory device size might
> not be
>     multiples of that alignment.
> 
>     However, this change:
> 
>     (a) allows for some practically impossible DIMM sizes, like "1GB+1
> byte".
>     (b) allows for DIMMs that partially cover hugetlb pages, previously
>         reported in [1].
> ...
> 
> Partial hugetlb pages do not particularly make sense; wasting memory. Do
> we expect people to actually ave working setup that we might break when
> disallowing such configurations? Or would we actually help them identify
> that something non-sensical is happening?
> 
> When using memory-backend-memfd, we already do get a proper error:
> 
>  ./build/qemu-system-x86_64 -m 2047m -object
> memory-backend-memfd,id=ram-node0,hugetlb=on,size=2047m,reserve=off
> -numa node,nodeid=0,cpus=0,memdev=ram-node0 -S
> qemu-system-x86_64: failed to resize memfd to 2146435072: Invalid argument
> 
> So this only applies to memory-backend-file, where we maybe should fail
> in a similar way?
> 

Yeah, let's fail in that case. But non-aligned length is just one of
many reasons madvise()/mbind() can fail. What about the others? Should
we make them report an error or just a warning?

Michal
David Hildenbrand May 29, 2024, 12:30 p.m. UTC | #4
On 29.05.24 08:48, Michal Prívozník wrote:
> On 5/28/24 18:47, David Hildenbrand wrote:
>> Am 28.05.24 um 18:15 schrieb Michal Privoznik:
>>> ./build/qemu-system-x86_64 \ -m
>>> size=8389632k,slots=16,maxmem=25600000k \ -object
>>> '{"qom-type":"memory-backend-file","id":"ram-node0","mem-path":"hugepages2M","prealloc":true,"size":8590983168,"host-nodes":[0],"policy":"bind"}' \ -numa node,nodeid=0,cpus=0,memdev=ram-node0
>>
>> For DIMMs and friends we now (again) enforce that the size must be
>> aligned to the page size:
>>
>> commit 540a1abbf0b243e4cfb4333c5d30a041f7080ba4
>> Author: David Hildenbrand <david@redhat.com>
>> Date:   Wed Jan 17 14:55:54 2024 +0100
>>
>>      memory-device: reintroduce memory region size check
>>
>>      We used to check that the memory region size is multiples of the
>> overall
>>      requested address alignment for the device memory address.
>>
>>      We removed that check, because there are cases (i.e., hv-balloon) where
>>      devices unconditionally request an address alignment that has a very
>> large
>>      alignment (i.e., 32 GiB), but the actual memory device size might
>> not be
>>      multiples of that alignment.
>>
>>      However, this change:
>>
>>      (a) allows for some practically impossible DIMM sizes, like "1GB+1
>> byte".
>>      (b) allows for DIMMs that partially cover hugetlb pages, previously
>>          reported in [1].
>> ...
>>
>> Partial hugetlb pages do not particularly make sense; wasting memory. Do
>> we expect people to actually ave working setup that we might break when
>> disallowing such configurations? Or would we actually help them identify
>> that something non-sensical is happening?
>>
>> When using memory-backend-memfd, we already do get a proper error:
>>
>>   ./build/qemu-system-x86_64 -m 2047m -object
>> memory-backend-memfd,id=ram-node0,hugetlb=on,size=2047m,reserve=off
>> -numa node,nodeid=0,cpus=0,memdev=ram-node0 -S
>> qemu-system-x86_64: failed to resize memfd to 2146435072: Invalid argument
>>
>> So this only applies to memory-backend-file, where we maybe should fail
>> in a similar way?
>>
> 
> Yeah, let's fail in that case. But non-aligned length is just one of
> many reasons madvise()/mbind() can fail. What about the others? Should
> we make them report an error or just a warning?

Regarding madvise(), we should report at least a warning.

In qemu_ram_setup_dump() we print an error if QEMU_MADV_DONTDUMP failed.

But we swallow any errors from memory_try_enable_merging() ... in 
general, we likely have to distinguish the "not supported by the OS" 
case from " actually supported but failed" case.

In the second patch, maybe we should really fail if something unexpected 
happens, instead of fake-changing the properties.
diff mbox series

Patch

diff --git a/backends/hostmem.c b/backends/hostmem.c
index 1a6fd1c714..9b727699f6 100644
--- a/backends/hostmem.c
+++ b/backends/hostmem.c
@@ -179,6 +179,8 @@  static void host_memory_backend_set_merge(Object *obj, bool value, Error **errp)
         void *ptr = memory_region_get_ram_ptr(&backend->mr);
         uint64_t sz = memory_region_size(&backend->mr);
 
+        sz = ROUND_UP(sz, qemu_ram_pagesize(backend->mr.ram_block));
+
         if (qemu_madvise(ptr, sz,
                          value ? QEMU_MADV_MERGEABLE : QEMU_MADV_UNMERGEABLE)) {
             warn_report("Couldn't change property 'merge' on '%s': %s",
@@ -208,6 +210,8 @@  static void host_memory_backend_set_dump(Object *obj, bool value, Error **errp)
         void *ptr = memory_region_get_ram_ptr(&backend->mr);
         uint64_t sz = memory_region_size(&backend->mr);
 
+        sz = ROUND_UP(sz, qemu_ram_pagesize(backend->mr.ram_block));
+
         if (qemu_madvise(ptr, sz,
                          value ? QEMU_MADV_DODUMP : QEMU_MADV_DONTDUMP)) {
             warn_report("Couldn't change property 'dump' on '%s': %s",
@@ -344,6 +348,13 @@  host_memory_backend_memory_complete(UserCreatable *uc, Error **errp)
     ptr = memory_region_get_ram_ptr(&backend->mr);
     sz = memory_region_size(&backend->mr);
 
+    /*
+     * Round up size to be an integer multiple of pagesize, because
+     * both madvise() and mbind() does not really like setting
+     * advice/policy to just a fraction of a page.
+     */
+    sz = ROUND_UP(sz, qemu_ram_pagesize(backend->mr.ram_block));
+
     if (backend->merge &&
         qemu_madvise(ptr, sz, QEMU_MADV_MERGEABLE)) {
         warn_report("Couldn't set property 'merge' on '%s': %s",