Message ID | 20190830161345.22436-1-lvivier@redhat.com |
---|---|
State | New |
Headers | show |
Series | pseries: do not allow memory-less/cpu-less NUMA node | expand |
On Fri, Aug 30, 2019 at 06:13:45PM +0200, Laurent Vivier wrote: > When we hotplug a CPU on memory-less/cpu-less node, the linux kernel > crashes. > > This happens because linux kernel needs to know the NUMA topology at > start to be able to initialize the distance lookup table. > > On pseries, the topology is provided by the firmware via the existing > CPUs and memory information. Thus a node without memory and CPU cannot be > discovered by the kernel. > > To avoid the kernel crash, do not allow to start pseries with empty > nodes. This describes one possible guest OS. Is there any reasonable chance that a non-Linux guest might be able to handle this situation correctly, or do you expect any guest to have the same restriction ? > Signed-off-by: Laurent Vivier <lvivier@redhat.com> > --- > hw/ppc/spapr.c | 33 +++++++++++++++++++++++++++++++++ > 1 file changed, 33 insertions(+) Regards, Daniel
On Fri, 30 Aug 2019 17:34:13 +0100 Daniel P. Berrangé <berrange@redhat.com> wrote: > On Fri, Aug 30, 2019 at 06:13:45PM +0200, Laurent Vivier wrote: > > When we hotplug a CPU on memory-less/cpu-less node, the linux kernel > > crashes. > > > > This happens because linux kernel needs to know the NUMA topology at > > start to be able to initialize the distance lookup table. > > > > On pseries, the topology is provided by the firmware via the existing > > CPUs and memory information. Thus a node without memory and CPU cannot be > > discovered by the kernel. > > > > To avoid the kernel crash, do not allow to start pseries with empty > > nodes. > > This describes one possible guest OS. Is there any reasonable chance > that a non-Linux guest might be able to handle this situation correctly, > or do you expect any guest to have the same restriction ? > I can try to grab an AIX image and give a try, but anyway this looks like a very big hammer to me... :-\ > > Signed-off-by: Laurent Vivier <lvivier@redhat.com> > > --- > > hw/ppc/spapr.c | 33 +++++++++++++++++++++++++++++++++ > > 1 file changed, 33 insertions(+) > > Regards, > Daniel
On Fri, Aug 30, 2019 at 07:45:43PM +0200, Greg Kurz wrote: > On Fri, 30 Aug 2019 17:34:13 +0100 > Daniel P. Berrangé <berrange@redhat.com> wrote: > > > On Fri, Aug 30, 2019 at 06:13:45PM +0200, Laurent Vivier wrote: > > > When we hotplug a CPU on memory-less/cpu-less node, the linux kernel > > > crashes. > > > > > > This happens because linux kernel needs to know the NUMA topology at > > > start to be able to initialize the distance lookup table. > > > > > > On pseries, the topology is provided by the firmware via the existing > > > CPUs and memory information. Thus a node without memory and CPU cannot be > > > discovered by the kernel. > > > > > > To avoid the kernel crash, do not allow to start pseries with empty > > > nodes. > > > > This describes one possible guest OS. Is there any reasonable chance > > that a non-Linux guest might be able to handle this situation correctly, > > or do you expect any guest to have the same restriction ? That's... a more complicated question than you'd think. The problem here is it's not really obvious in PAPR how topology information for nodes without memory should be described in the device tree (which is the only way we given that information to the guest). It's possible there's some way to encode this information that would make AIX happy and we just need to fix Linux to cope with that, but it's not really clear what it would be. > I can try to grab an AIX image and give a try, but anyway this looks like > a very big hammer to me... :-\ I'm not really sure why everyone seems to think losing zero-memory node capability is such a big deal. It's never worked in practice on POWER and we can always put it back if we figure out a sensible way to do it.
On Mon, Sep 02, 2019 at 04:27:18PM +1000, David Gibson wrote: > On Fri, Aug 30, 2019 at 07:45:43PM +0200, Greg Kurz wrote: > > On Fri, 30 Aug 2019 17:34:13 +0100 > > Daniel P. Berrangé <berrange@redhat.com> wrote: > > > > > On Fri, Aug 30, 2019 at 06:13:45PM +0200, Laurent Vivier wrote: > > > > When we hotplug a CPU on memory-less/cpu-less node, the linux kernel > > > > crashes. > > > > > > > > This happens because linux kernel needs to know the NUMA topology at > > > > start to be able to initialize the distance lookup table. > > > > > > > > On pseries, the topology is provided by the firmware via the existing > > > > CPUs and memory information. Thus a node without memory and CPU cannot be > > > > discovered by the kernel. > > > > > > > > To avoid the kernel crash, do not allow to start pseries with empty > > > > nodes. > > > > > > This describes one possible guest OS. Is there any reasonable chance > > > that a non-Linux guest might be able to handle this situation correctly, > > > or do you expect any guest to have the same restriction ? > > That's... a more complicated question than you'd think. > > The problem here is it's not really obvious in PAPR how topology > information for nodes without memory should be described in the device > tree (which is the only way we given that information to the guest). > > It's possible there's some way to encode this information that would > make AIX happy and we just need to fix Linux to cope with that, but > it's not really clear what it would be. > > > I can try to grab an AIX image and give a try, but anyway this looks like > > a very big hammer to me... :-\ > > I'm not really sure why everyone seems to think losing zero-memory > node capability is such a big deal. It's never worked in practice on > POWER and we can always put it back if we figure out a sensible way to > do it. I'm not that bothered - I just wanted to double check that we were not intentionally breaking a non-Linux guest OS that was known to work today. Regards, Daniel
On Mon, Sep 02, 2019 at 09:57:36AM +0100, Daniel P. Berrangé wrote: > On Mon, Sep 02, 2019 at 04:27:18PM +1000, David Gibson wrote: > > On Fri, Aug 30, 2019 at 07:45:43PM +0200, Greg Kurz wrote: > > > On Fri, 30 Aug 2019 17:34:13 +0100 > > > Daniel P. Berrangé <berrange@redhat.com> wrote: > > > > > > > On Fri, Aug 30, 2019 at 06:13:45PM +0200, Laurent Vivier wrote: > > > > > When we hotplug a CPU on memory-less/cpu-less node, the linux kernel > > > > > crashes. > > > > > > > > > > This happens because linux kernel needs to know the NUMA topology at > > > > > start to be able to initialize the distance lookup table. > > > > > > > > > > On pseries, the topology is provided by the firmware via the existing > > > > > CPUs and memory information. Thus a node without memory and CPU cannot be > > > > > discovered by the kernel. > > > > > > > > > > To avoid the kernel crash, do not allow to start pseries with empty > > > > > nodes. > > > > > > > > This describes one possible guest OS. Is there any reasonable chance > > > > that a non-Linux guest might be able to handle this situation correctly, > > > > or do you expect any guest to have the same restriction ? > > > > That's... a more complicated question than you'd think. > > > > The problem here is it's not really obvious in PAPR how topology > > information for nodes without memory should be described in the device > > tree (which is the only way we given that information to the guest). > > > > It's possible there's some way to encode this information that would > > make AIX happy and we just need to fix Linux to cope with that, but > > it's not really clear what it would be. > > > > > I can try to grab an AIX image and give a try, but anyway this looks like > > > a very big hammer to me... :-\ > > > > I'm not really sure why everyone seems to think losing zero-memory > > node capability is such a big deal. It's never worked in practice on > > POWER and we can always put it back if we figure out a sensible way to > > do it. > > I'm not that bothered - I just wanted to double check that we were not > intentionally breaking a non-Linux guest OS that was known to work today. There are no non-Linux guests that are known to work today, unless you count the kvm-unit-tests micro-OS. AIX support is coming along, but it's by no means established.
On Fri, Aug 30, 2019 at 06:13:45PM +0200, Laurent Vivier wrote: > When we hotplug a CPU on memory-less/cpu-less node, the linux kernel > crashes. > > This happens because linux kernel needs to know the NUMA topology at > start to be able to initialize the distance lookup table. > > On pseries, the topology is provided by the firmware via the existing > CPUs and memory information. Thus a node without memory and CPU cannot be > discovered by the kernel. > > To avoid the kernel crash, do not allow to start pseries with empty > nodes. > > Signed-off-by: Laurent Vivier <lvivier@redhat.com> Applied to ppc-for-4.2.
On Mon, 2 Sep 2019 16:27:18 +1000 David Gibson <david@gibson.dropbear.id.au> wrote: > On Fri, Aug 30, 2019 at 07:45:43PM +0200, Greg Kurz wrote: > > On Fri, 30 Aug 2019 17:34:13 +0100 > > Daniel P. Berrangé <berrange@redhat.com> wrote: > > > > > On Fri, Aug 30, 2019 at 06:13:45PM +0200, Laurent Vivier wrote: > > > > When we hotplug a CPU on memory-less/cpu-less node, the linux kernel > > > > crashes. > > > > > > > > This happens because linux kernel needs to know the NUMA topology at > > > > start to be able to initialize the distance lookup table. > > > > > > > > On pseries, the topology is provided by the firmware via the existing > > > > CPUs and memory information. Thus a node without memory and CPU cannot be > > > > discovered by the kernel. > > > > > > > > To avoid the kernel crash, do not allow to start pseries with empty > > > > nodes. > > > > > > This describes one possible guest OS. Is there any reasonable chance > > > that a non-Linux guest might be able to handle this situation correctly, > > > or do you expect any guest to have the same restriction ? > > That's... a more complicated question than you'd think. > > The problem here is it's not really obvious in PAPR how topology > information for nodes without memory should be described in the device > tree (which is the only way we given that information to the guest). > The reported issue is to have a node without memory AND without cpu. > It's possible there's some way to encode this information that would > make AIX happy and we just need to fix Linux to cope with that, but > it's not really clear what it would be. > > > I can try to grab an AIX image and give a try, but anyway this looks like > > a very big hammer to me... :-\ > > I'm not really sure why everyone seems to think losing zero-memory > node capability is such a big deal. It's never worked in practice on > POWER and we can always put it back if we figure out a sensible way to > do it. > It isn't really about losing the memory-less/cpu-less node capability, but more about finding the appropriate fix. The changelog doesn't give much clues on what's happening exactly: QEMU command line ? linux call stack ? For example, I could hit a crash with the following command line: -smp 1,maxcpus=2 \ -object memory-backend-ram,size=512M,id=node0 \ -numa node,nodeid=0,memdev=node0 \ -numa node,nodeid=1 (qemu) info numa 2 nodes node 0 cpus: 0 node 0 size: 512 MB node 0 plugged: 0 MB node 1 cpus: node 1 size: 0 MB node 1 plugged: 0 MB (qemu) device_add host-spapr-cpu-core,core-id=1 [ 24.507552] Built 1 zonelists, mobility grouping on. Total pages: 7656 [ 24.507592] Policy zone: Normal [ 24.553481] WARNING: workqueue cpumask: online intersect > possible intersect [ 24.608814] BUG: Unable to handle kernel data access at 0x14e13da04c5bc37e [ 24.608875] Faulting instruction address: 0xc000000000175650 [ 24.608931] Oops: Kernel access of bad area, sig: 11 [#1] [ 24.608976] LE PAGE_SIZE=64K MMU=Radix MMU=Hash SMP NR_CPUS=1024 NUMA pSeries [ 24.609042] Modules linked in: virtio_net vmx_crypto net_failover failover crct10dif_vpmsum ip_tables xfs libcrc32c crc32c_vpmsum virtio_blk kvm rpadlpar_io rpaphp 9p fscache 9pnet_virtio 9pnet [ 24.609222] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.1.17-300.fc30.ppc64le #1 [ 24.609286] NIP: c000000000175650 LR: c000000000175310 CTR: 0000000000000000 [ 24.609351] REGS: c00000001e597210 TRAP: 0380 Not tainted (5.1.17-300.fc30.ppc64le) [ 24.609414] MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 44444248 XER: 00000000 [ 24.609482] CFAR: c000000000175528 IRQMASK: 0 [ 24.609482] GPR00: c000000000175310 c00000001e5974a0 c0000000015fc400 0000000000000002 [ 24.609482] GPR04: 0000000000000001 0000000000000001 0000000000000001 0000000000000400 [ 24.609482] GPR08: 14e13da04c5bc37e 0000000000000000 0000000000000000 0000000000000000 [ 24.609482] GPR12: 0000000024022248 c00000000fffee00 0000000000000007 c00000001e0e8fb0 [ 24.609482] GPR16: c00000000162dc70 0000000000000008 c00000001e5976d8 0000000020000000 [ 24.609482] GPR20: 0000000100000003 0000000000000001 0000000000000000 14e13da04c5bc35e [ 24.609482] GPR24: c000000001630164 0000000000000010 14e13da04c5bc37e 0000000000000000 [ 24.609482] GPR28: 0000000000000002 c0000000142a0e00 c00000001ff25d80 c00000001e5975a8 [ 24.610052] NIP [c000000000175650] find_busiest_group+0x510/0xe10 [ 24.610107] LR [c000000000175310] find_busiest_group+0x1d0/0xe10 [ 24.610169] Call Trace: [ 24.610203] [c00000001e5974a0] [c000000000175310] find_busiest_group+0x1d0/0xe10 (unreliable) [ 24.610304] [c00000001e597680] [c000000000176110] load_balance+0x1c0/0xe80 [ 24.610377] [c00000001e5977d0] [c000000000176ff8] rebalance_domains+0x228/0x380 [ 24.610467] [c00000001e597880] [c000000000c7c170] __do_softirq+0x170/0x404 [ 24.610542] [c00000001e597980] [c000000000124368] irq_exit+0xd8/0x110 [ 24.610617] [c00000001e5979a0] [c000000000028778] timer_interrupt+0x128/0x2e0 [ 24.610706] [c00000001e597a00] [c000000000009314] decrementer_common+0x154/0x160 [ 24.610799] --- interrupt: 901 at plpar_hcall_norets+0x1c/0x28 [ 24.610799] LR = check_and_cede_processor+0x48/0x60 [ 24.610915] [c00000001e597d00] [c00000001e597d60] 0xc00000001e597d60 (unreliable) [ 24.611004] [c00000001e597d60] [c0000000009e22a8] shared_cede_loop+0x68/0x180 [ 24.611096] [c00000001e597da0] [c0000000009dec64] cpuidle_enter_state+0xa4/0x660 [ 24.611191] [c00000001e597e30] [c0000000001647a0] call_cpuidle+0x50/0xa0 [ 24.611270] [c00000001e597e50] [c000000000164d6c] do_idle+0x2cc/0x3b0 [ 24.611350] [c00000001e597ec0] [c00000000016508c] cpu_startup_entry+0x3c/0x50 [ 24.611445] [c00000001e597ef0] [c000000000051dd0] start_secondary+0x630/0x660 [ 24.611539] [c00000001e597f90] [c00000000000b25c] start_secondary_prolog+0x10/0x14 [ 24.611632] Instruction dump: [ 24.611680] 7c374800 41820234 e8920016 3b570020 8152002c 7c893670 7d290194 548506be [ 24.611775] 788606a0 7d2907b4 79291f24 7d1a4a14 <7cfa482a> 7ce72c36 78e707e0 2d270000 [ 24.611871] ---[ end trace 0e5e3ed14d31f59d ]--- [ 24.617852] [ 25.617885] Kernel panic - not syncing: Aiee, killing interrupt handler! (qemu) info numa 2 nodes node 0 cpus: 0 node 0 size: 512 MB node 0 plugged: 0 MB node 1 cpus: 1 node 1 size: 0 MB node 1 plugged: 0 MB but the crash doesn't occur with: -smp 1,maxcpus=2 \ -object memory-backend-ram,size=512M,id=node0 \ -numa node,nodeid=0,memdev=node0 \ -numa node,nodeid=1 \ -device spapr-pci-host-bridge,index=1,id=phb1,numa_node=1 (qemu) info numa 2 nodes node 0 cpus: 0 node 0 size: 512 MB node 0 plugged: 0 MB node 1 cpus: node 1 size: 0 MB node 1 plugged: 0 MB (qemu) device_add host-spapr-cpu-core,core-id=1 [ 154.637304] Policy zone: Normal [ 154.665463] WARNING: workqueue cpumask: online intersect > possible intersect (qemu) info numa 2 nodes node 0 cpus: 0 node 0 size: 512 MB node 0 plugged: 0 MB node 1 cpus: 1 node 1 size: 0 MB node 1 plugged: 0 MB nor with: -smp 1,maxcpus=2 \ -object memory-backend-ram,size=512M,id=node0 \ -numa node,nodeid=0,memdev=node0,cpus=0 \ -numa node,nodeid=1 qemu-system-ppc64: warning: CPU(s) not present in any NUMA nodes: CPU 1 [core-id: 1] qemu-system-ppc64: warning: All CPU(s) up to maxcpus should be described in NUMA config, ability to start up with partial NUMA mappings is obsoleted and will be removed in future (qemu) device_add host-spapr-cpu-core,core-id=1 (qemu) info numa 2 nodes node 0 cpus: 0 1 node 0 size: 512 MB node 0 plugged: 0 MB node 1 cpus: node 1 size: 0 MB node 1 plugged: 0 MB so I don't know why linux crashes, but it isn't exactly because of having a cpu-less/memory-less node and this patch catches the non-crashing cases anyway.
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c index baedadf20b8c..8be738901cf9 100644 --- a/hw/ppc/spapr.c +++ b/hw/ppc/spapr.c @@ -2847,6 +2847,39 @@ static void spapr_machine_init(MachineState *machine) /* init CPUs */ spapr_init_cpus(spapr); + /* + * check we don't have a memory-less/cpu-less NUMA node + * Firmware relies on the existing memory/cpu topology to provide the + * NUMA topology to the kernel. + * And the linux kernel needs to know the NUMA topology at start + * to be able to hotplug CPUs later. + */ + if (nb_numa_nodes) { + for (i = 0; i < nb_numa_nodes; ++i) { + /* check for memory-less node */ + if (numa_info[i].node_mem == 0) { + CPUState *cs; + int found = 0; + /* check for cpu-less node */ + CPU_FOREACH(cs) { + PowerPCCPU *cpu = POWERPC_CPU(cs); + if (cpu->node_id == i) { + found = 1; + break; + } + } + /* memory-less and cpu-less node */ + if (!found) { + error_report( + "Memory-less/cpu-less nodes are not supported (node %d)", + i); + exit(1); + } + } + } + + } + if ((!kvm_enabled() || kvmppc_has_cap_mmu_radix()) && ppc_type_check_compat(machine->cpu_type, CPU_POWERPC_LOGICAL_3_00, 0, spapr->max_compat_pvr)) {
When we hotplug a CPU on memory-less/cpu-less node, the linux kernel crashes. This happens because linux kernel needs to know the NUMA topology at start to be able to initialize the distance lookup table. On pseries, the topology is provided by the firmware via the existing CPUs and memory information. Thus a node without memory and CPU cannot be discovered by the kernel. To avoid the kernel crash, do not allow to start pseries with empty nodes. Signed-off-by: Laurent Vivier <lvivier@redhat.com> --- hw/ppc/spapr.c | 33 +++++++++++++++++++++++++++++++++ 1 file changed, 33 insertions(+)