Message ID | 20230626111445.163573-4-pbonzini@redhat.com |
---|---|
State | New |
Headers | show |
Series | [PULL,01/18] build: further refine build.ninja rules | expand |
On Mon, 26 Jun 2023 at 12:15, Paolo Bonzini <pbonzini@redhat.com> wrote: > > From: Gavin Shan <gshan@redhat.com> > > For some architectures like ARM64, multiple CPUs in one cluster can be > associated with different NUMA nodes, which is irregular configuration > because we shouldn't have this in baremetal environment. The irregular > configuration causes Linux guest to misbehave, as the following warning > messages indicate. > > -smp 6,maxcpus=6,sockets=2,clusters=1,cores=3,threads=1 \ > -numa node,nodeid=0,cpus=0-1,memdev=ram0 \ > -numa node,nodeid=1,cpus=2-3,memdev=ram1 \ > -numa node,nodeid=2,cpus=4-5,memdev=ram2 \ Hi. This new warning shows up a lot in "make check" output: $ grep -c 'can cause OSes' /tmp/parn3ofA.par 44 Looks like this is all in the qtest-aarch64/numa-test test. Please can you investigate and either: (1) fix the test not to do the bad thing that's causing the warning (2) change the warning so it doesn't show up in stderr when running a correct and passing test ? thanks -- PMM
On 7/20/23 23:10, Peter Maydell wrote: > On Mon, 26 Jun 2023 at 12:15, Paolo Bonzini <pbonzini@redhat.com> wrote: >> >> From: Gavin Shan <gshan@redhat.com> >> >> For some architectures like ARM64, multiple CPUs in one cluster can be >> associated with different NUMA nodes, which is irregular configuration >> because we shouldn't have this in baremetal environment. The irregular >> configuration causes Linux guest to misbehave, as the following warning >> messages indicate. >> >> -smp 6,maxcpus=6,sockets=2,clusters=1,cores=3,threads=1 \ >> -numa node,nodeid=0,cpus=0-1,memdev=ram0 \ >> -numa node,nodeid=1,cpus=2-3,memdev=ram1 \ >> -numa node,nodeid=2,cpus=4-5,memdev=ram2 \ > > Hi. This new warning shows up a lot in "make check" output: > > $ grep -c 'can cause OSes' /tmp/parn3ofA.par > 44 > > Looks like this is all in the qtest-aarch64/numa-test test. > > Please can you investigate and either: > (1) fix the test not to do the bad thing that's causing the warning > (2) change the warning so it doesn't show up in stderr when > running a correct and passing test > ? > Yes, all the warning messages come from tests/qtest/numa-test.c. There are 3 configurations where the boundary of CPU cluster and NUMA node is broken as expected. I've sent a patch to disable the validation for qtest. https://lists.nongnu.org/archive/html/qemu-arm/2023-07/msg00440.html With the patch applied, I didn't see similar warning messages from "make -j 40 check-qtest". Thanks, Gavin
diff --git a/hw/core/machine.c b/hw/core/machine.c index 1000406211f..46f8f9a2b04 100644 --- a/hw/core/machine.c +++ b/hw/core/machine.c @@ -1262,6 +1262,45 @@ static void machine_numa_finish_cpu_init(MachineState *machine) g_string_free(s, true); } +static void validate_cpu_cluster_to_numa_boundary(MachineState *ms) +{ + MachineClass *mc = MACHINE_GET_CLASS(ms); + NumaState *state = ms->numa_state; + const CPUArchIdList *possible_cpus = mc->possible_cpu_arch_ids(ms); + const CPUArchId *cpus = possible_cpus->cpus; + int i, j; + + if (state->num_nodes <= 1 || possible_cpus->len <= 1) { + return; + } + + /* + * The Linux scheduling domain can't be parsed when the multiple CPUs + * in one cluster have been associated with different NUMA nodes. However, + * it's fine to associate one NUMA node with CPUs in different clusters. + */ + for (i = 0; i < possible_cpus->len; i++) { + for (j = i + 1; j < possible_cpus->len; j++) { + if (cpus[i].props.has_socket_id && + cpus[i].props.has_cluster_id && + cpus[i].props.has_node_id && + cpus[j].props.has_socket_id && + cpus[j].props.has_cluster_id && + cpus[j].props.has_node_id && + cpus[i].props.socket_id == cpus[j].props.socket_id && + cpus[i].props.cluster_id == cpus[j].props.cluster_id && + cpus[i].props.node_id != cpus[j].props.node_id) { + warn_report("CPU-%d and CPU-%d in socket-%" PRId64 "-cluster-%" PRId64 + " have been associated with node-%" PRId64 " and node-%" PRId64 + " respectively. It can cause OSes like Linux to" + " misbehave", i, j, cpus[i].props.socket_id, + cpus[i].props.cluster_id, cpus[i].props.node_id, + cpus[j].props.node_id); + } + } + } +} + MemoryRegion *machine_consume_memdev(MachineState *machine, HostMemoryBackend *backend) { @@ -1355,6 +1394,9 @@ void machine_run_board_init(MachineState *machine, const char *mem_path, Error * numa_complete_configuration(machine); if (machine->numa_state->num_nodes) { machine_numa_finish_cpu_init(machine); + if (machine_class->cpu_cluster_has_numa_boundary) { + validate_cpu_cluster_to_numa_boundary(machine); + } } } diff --git a/include/hw/boards.h b/include/hw/boards.h index a385010909d..6b267c21ce7 100644 --- a/include/hw/boards.h +++ b/include/hw/boards.h @@ -274,6 +274,7 @@ struct MachineClass { bool nvdimm_supported; bool numa_mem_supported; bool auto_enable_numa; + bool cpu_cluster_has_numa_boundary; SMPCompatProps smp_props; const char *default_ram_id;