numa: equally distribute memory on nodes

Message ID	20170426100701.21893-1-lvivier@redhat.com
State	New
Headers	show Return-Path: <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mx1.redhat.com 43FDB81254 From: Laurent Vivier <lvivier@redhat.com> To: Eduardo Habkost <ehabkost@redhat.com> Date: Wed, 26 Apr 2017 12:07:01 +0200 Message-Id: <20170426100701.21893-1-lvivier@redhat.com> Subject: [Qemu-devel] [PATCH] numa: equally distribute memory on nodes Precedence: list Cc: Laurent Vivier <lvivier@redhat.com>, Thomas Huth <thuth@redhat.com>, qemu-devel@nongnu.org, David Gibson <david@gibson.dropbear.id.au> Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org>

Message ID

20170426100701.21893-1-lvivier@redhat.com

State

New

Headers

DMARC-Filter: OpenDMARC Filter v1.3.2 mx1.redhat.com 43FDB81254
From: Laurent Vivier <lvivier@redhat.com>
To: Eduardo Habkost <ehabkost@redhat.com>
Date: Wed, 26 Apr 2017 12:07:01 +0200
Message-Id: <20170426100701.21893-1-lvivier@redhat.com>
Subject: [Qemu-devel] [PATCH] numa: equally distribute memory on nodes
Precedence: list
Cc: Laurent Vivier <lvivier@redhat.com>, Thomas Huth <thuth@redhat.com>,
	qemu-devel@nongnu.org, David Gibson <david@gibson.dropbear.id.au>
Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org
Sender: "Qemu-devel"
	<qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org>

Commit Message

Laurent Vivier April 26, 2017, 10:07 a.m. UTC

When there is more nodes than memory available to put the minimum
allowed memory by node, all the memory is put on the last node.

This is because we put (ram_size / nb_numa_nodes) &
~((1 << mc->numa_mem_align_shift) - 1); on each node, and in this
case the value is 0. This is particularly true with pseries,
as the memory must be aligned to 256MB.

To avoid this problem, this patch uses an error diffusion algorithm [1]
to distribute equally the memory on nodes.

Example:

qemu-system-ppc64 -S -nographic  -nodefaults -monitor stdio -m 1G -smp 8 \
                  -numa node -numa node -numa node \
                  -numa node -numa node -numa node

Before:

(qemu) info numa
6 nodes
node 0 cpus: 0 6
node 0 size: 0 MB
node 1 cpus: 1 7
node 1 size: 0 MB
node 2 cpus: 2
node 2 size: 0 MB
node 3 cpus: 3
node 3 size: 0 MB
node 4 cpus: 4
node 4 size: 0 MB
node 5 cpus: 5
node 5 size: 1024 MB

After:
(qemu) info numa
6 nodes
node 0 cpus: 0 6
node 0 size: 0 MB
node 1 cpus: 1 7
node 1 size: 256 MB
node 2 cpus: 2
node 2 size: 0 MB
node 3 cpus: 3
node 3 size: 256 MB
node 4 cpus: 4
node 4 size: 256 MB
node 5 cpus: 5
node 5 size: 256 MB

[1] https://en.wikipedia.org/wiki/Error_diffusion

Signed-off-by: Laurent Vivier <lvivier@redhat.com>
---
 numa.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

Comments

Eduardo Habkost April 26, 2017, 2:31 p.m. UTC | #1

On Wed, Apr 26, 2017 at 12:07:01PM +0200, Laurent Vivier wrote:
> When there is more nodes than memory available to put the minimum
> allowed memory by node, all the memory is put on the last node.
> 
> This is because we put (ram_size / nb_numa_nodes) &
> ~((1 << mc->numa_mem_align_shift) - 1); on each node, and in this
> case the value is 0. This is particularly true with pseries,
> as the memory must be aligned to 256MB.
> 
> To avoid this problem, this patch uses an error diffusion algorithm [1]
> to distribute equally the memory on nodes.

Nice.

But we need compat code to keep the previous behavior on older
machine-types. We can use either a new boolean MachineClass
field, or a MachineClass method (mc->auto_assign_ram(), maybe?)
that 2.9 machine-types could override.

> 
> Example:
> 
> qemu-system-ppc64 -S -nographic  -nodefaults -monitor stdio -m 1G -smp 8 \
>                   -numa node -numa node -numa node \
>                   -numa node -numa node -numa node
> 
> Before:
> 
> (qemu) info numa
> 6 nodes
> node 0 cpus: 0 6
> node 0 size: 0 MB
> node 1 cpus: 1 7
> node 1 size: 0 MB
> node 2 cpus: 2
> node 2 size: 0 MB
> node 3 cpus: 3
> node 3 size: 0 MB
> node 4 cpus: 4
> node 4 size: 0 MB
> node 5 cpus: 5
> node 5 size: 1024 MB
> 
> After:
> (qemu) info numa
> 6 nodes
> node 0 cpus: 0 6
> node 0 size: 0 MB
> node 1 cpus: 1 7
> node 1 size: 256 MB
> node 2 cpus: 2
> node 2 size: 0 MB
> node 3 cpus: 3
> node 3 size: 256 MB
> node 4 cpus: 4
> node 4 size: 256 MB
> node 5 cpus: 5
> node 5 size: 256 MB
> 
> [1] https://en.wikipedia.org/wiki/Error_diffusion
> 
> Signed-off-by: Laurent Vivier <lvivier@redhat.com>
> ---
>  numa.c | 10 +++++++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/numa.c b/numa.c
> index 6fc2393..bcf1c54 100644
> --- a/numa.c
> +++ b/numa.c
> @@ -336,15 +336,19 @@ void parse_numa_opts(MachineClass *mc)
>              }
>          }
>          if (i == nb_numa_nodes) {
> -            uint64_t usedmem = 0;
> +            uint64_t usedmem = 0, node_mem;
> +            uint64_t granularity = ram_size / nb_numa_nodes;
> +            uint64_t propagate = 0;
>  
>              /* Align each node according to the alignment
>               * requirements of the machine class
>               */
>              for (i = 0; i < nb_numa_nodes - 1; i++) {
> -                numa_info[i].node_mem = (ram_size / nb_numa_nodes) &
> +                node_mem = (granularity + propagate) &
>                                          ~((1 << mc->numa_mem_align_shift) - 1);
> -                usedmem += numa_info[i].node_mem;
> +                propagate = granularity + propagate - node_mem;
> +                numa_info[i].node_mem = node_mem;
> +                usedmem += node_mem;
>              }
>              numa_info[i].node_mem = ram_size - usedmem;
>          }
> -- 
> 2.9.3
>

Laurent Vivier April 26, 2017, 4:35 p.m. UTC | #2

On 26/04/2017 16:31, Eduardo Habkost wrote:
> On Wed, Apr 26, 2017 at 12:07:01PM +0200, Laurent Vivier wrote:
>> When there is more nodes than memory available to put the minimum
>> allowed memory by node, all the memory is put on the last node.
>>
>> This is because we put (ram_size / nb_numa_nodes) &
>> ~((1 << mc->numa_mem_align_shift) - 1); on each node, and in this
>> case the value is 0. This is particularly true with pseries,
>> as the memory must be aligned to 256MB.
>>
>> To avoid this problem, this patch uses an error diffusion algorithm [1]
>> to distribute equally the memory on nodes.
> 
> Nice.
> 
> But we need compat code to keep the previous behavior on older
> machine-types. We can use either a new boolean MachineClass
> field, or a MachineClass method (mc->auto_assign_ram(), maybe?)
> that 2.9 machine-types could override.

You're right. I'm going to introduce a "numa_auto_assign_ram()" function
it the MachineClass.

Thanks,
Laurent

diff --git a/numa.c b/numa.c
index 6fc2393..bcf1c54 100644
--- a/numa.c
+++ b/numa.c
@@ -336,15 +336,19 @@  void parse_numa_opts(MachineClass *mc)
             }
         }
         if (i == nb_numa_nodes) {
-            uint64_t usedmem = 0;
+            uint64_t usedmem = 0, node_mem;
+            uint64_t granularity = ram_size / nb_numa_nodes;
+            uint64_t propagate = 0;
 
             /* Align each node according to the alignment
              * requirements of the machine class
              */
             for (i = 0; i < nb_numa_nodes - 1; i++) {
-                numa_info[i].node_mem = (ram_size / nb_numa_nodes) &
+                node_mem = (granularity + propagate) &
                                         ~((1 << mc->numa_mem_align_shift) - 1);
-                usedmem += numa_info[i].node_mem;
+                propagate = granularity + propagate - node_mem;
+                numa_info[i].node_mem = node_mem;
+                usedmem += node_mem;
             }
             numa_info[i].node_mem = ram_size - usedmem;
         }

numa: equally distribute memory on nodes

Commit Message

Comments

Patch