Message ID | 20210305173845.451158-1-danielhb413@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | [1/1] hotplug-cpu.c: show 'last online CPU' error in dlpar_cpu_remove() | expand |
Context | Check | Description |
---|---|---|
snowpatch_ozlabs/apply_patch | success | Successfully applied on branch powerpc/merge (91966823812efbd175f904599e5cf2a854b39809) |
snowpatch_ozlabs/build-ppc64le | success | Build succeeded |
snowpatch_ozlabs/build-ppc64be | success | Build succeeded |
snowpatch_ozlabs/build-ppc64e | success | Build succeeded |
snowpatch_ozlabs/build-pmac32 | success | Build succeeded |
snowpatch_ozlabs/checkpatch | success | total: 0 errors, 0 warnings, 0 checks, 18 lines checked |
snowpatch_ozlabs/needsstable | success | Patch has no Fixes tags |
Ping On 3/5/21 2:38 PM, Daniel Henrique Barboza wrote: > Of all the reasons that dlpar_cpu_remove() can fail, the 'last online > CPU' is one that can be caused directly by the user offlining CPUs > in a partition/virtual machine that has hotplugged CPUs. Trying to > reclaim a hotplugged CPU can fail if the CPU is now the last online in > the system. This is easily reproduced using QEMU [1]. > > Throwing a more specific error message for this case, instead of just > "Failed to offline CPU", makes it clearer that the error is in fact a > known error situation instead of other generic/unknown cause. > > [1] https://bugzilla.redhat.com/1911414 > > Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> > --- > arch/powerpc/platforms/pseries/hotplug-cpu.c | 12 +++++++++++- > 1 file changed, 11 insertions(+), 1 deletion(-) > > diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c > index 12cbffd3c2e3..134f393f09e1 100644 > --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c > +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c > @@ -514,7 +514,17 @@ static ssize_t dlpar_cpu_remove(struct device_node *dn, u32 drc_index) > > rc = dlpar_offline_cpu(dn); > if (rc) { > - pr_warn("Failed to offline CPU %pOFn, rc: %d\n", dn, rc); > + /* dlpar_offline_cpu will return -EBUSY from cpu_down() (via > + * device_offline()) in 2 cases: cpu_hotplug_disable is true or > + * there is only one CPU left. Warn the user about the second > + * since this can happen with user offlining CPUs and then > + * attempting hotunplugs. > + */ > + if (rc == -EBUSY && num_online_cpus() == 1) > + pr_warn("Unable to remove last online CPU %pOFn\n", dn); > + else > + pr_warn("Failed to offline CPU %pOFn, rc: %d\n", dn, rc); > + > return -EINVAL; > } > >
Daniel Henrique Barboza <danielhb413@gmail.com> writes: > Ping > > On 3/5/21 2:38 PM, Daniel Henrique Barboza wrote: >> Of all the reasons that dlpar_cpu_remove() can fail, the 'last online >> CPU' is one that can be caused directly by the user offlining CPUs >> in a partition/virtual machine that has hotplugged CPUs. Trying to >> reclaim a hotplugged CPU can fail if the CPU is now the last online in >> the system. This is easily reproduced using QEMU [1]. Sorry, I saw this earlier and never got around to replying. I'm wondering if we neet to catch it earlier, ie. in dlpar_offline_cpu(). By the time we return to dlpar_cpu_remove() we've dropped the cpu_add_remove_lock (cpu_maps_update_done), so num_online_cpus() could change out from under us, meaning the num_online_cpus() == 1 check might trigger incorrectly (or vice versa). Something like the patch below (completely untested :D) And writing that patch makes me wonder, is == 1 the right check? In most cases we'll remove all but one thread of the core, but we'll fail on the last thread. Leaving that core effectively stuck in SMT1. Is that useful behaviour? Should we instead check at the start that we can remove all threads of the core without going to zero online CPUs? cheers diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c index 12cbffd3c2e3..498c22331ac8 100644 --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c @@ -271,6 +271,12 @@ static int dlpar_offline_cpu(struct device_node *dn) if (!cpu_online(cpu)) break; + if (num_online_cpus() == 1) { + pr_warn("Unable to remove last online CPU %pOFn\n", dn); + rc = EBUSY; + goto out_unlock; + } + cpu_maps_update_done(); rc = device_offline(get_cpu_device(cpu)); if (rc) @@ -283,6 +289,7 @@ static int dlpar_offline_cpu(struct device_node *dn) thread); } } +out_unlock: cpu_maps_update_done(); out:
On 3/19/21 8:26 AM, Michael Ellerman wrote: > Daniel Henrique Barboza <danielhb413@gmail.com> writes: >> Ping >> >> On 3/5/21 2:38 PM, Daniel Henrique Barboza wrote: >>> Of all the reasons that dlpar_cpu_remove() can fail, the 'last online >>> CPU' is one that can be caused directly by the user offlining CPUs >>> in a partition/virtual machine that has hotplugged CPUs. Trying to >>> reclaim a hotplugged CPU can fail if the CPU is now the last online in >>> the system. This is easily reproduced using QEMU [1]. > > Sorry, I saw this earlier and never got around to replying. No problem. Thanks for the review! > > I'm wondering if we neet to catch it earlier, ie. in > dlpar_offline_cpu(). > > By the time we return to dlpar_cpu_remove() we've dropped the > cpu_add_remove_lock (cpu_maps_update_done), so num_online_cpus() could > change out from under us, meaning the num_online_cpus() == 1 check might > trigger incorrectly (or vice versa). > > Something like the patch below (completely untested :D) Makes sense. I'll try it out to see if it works. > > And writing that patch makes me wonder, is == 1 the right check? > > In most cases we'll remove all but one thread of the core, but we'll > fail on the last thread. Leaving that core effectively stuck in SMT1. Is > that useful behaviour? Should we instead check at the start that we can > remove all threads of the core without going to zero online CPUs? I think it's ok to allow SMT1 cores, speaking from QEMU perspective. QEMU does not have a "core hotunplug" operation where the whole core is hotunplugged at once. The CPU hotplug/unplug operations are done as single CPU thread add/removal. If the user wants to run 4 cores, all of them with SMT1, QEMU will allow it. Libvirt does not operate with the core granularity either - you can specify the amount of vcpus the guest should run with, and Libvirt will send hotplug/unplug requests to QEMU to match the desired value. It doesn't bother with how many threads of a core were offlined or not. Thanks, DHB > > cheers > > > diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c > index 12cbffd3c2e3..498c22331ac8 100644 > --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c > +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c > @@ -271,6 +271,12 @@ static int dlpar_offline_cpu(struct device_node *dn) > if (!cpu_online(cpu)) > break; > > + if (num_online_cpus() == 1) { > + pr_warn("Unable to remove last online CPU %pOFn\n", dn); > + rc = EBUSY; > + goto out_unlock; > + } > + > cpu_maps_update_done(); > rc = device_offline(get_cpu_device(cpu)); > if (rc) > @@ -283,6 +289,7 @@ static int dlpar_offline_cpu(struct device_node *dn) > thread); > } > } > +out_unlock: > cpu_maps_update_done(); > > out: >
diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c index 12cbffd3c2e3..134f393f09e1 100644 --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c @@ -514,7 +514,17 @@ static ssize_t dlpar_cpu_remove(struct device_node *dn, u32 drc_index) rc = dlpar_offline_cpu(dn); if (rc) { - pr_warn("Failed to offline CPU %pOFn, rc: %d\n", dn, rc); + /* dlpar_offline_cpu will return -EBUSY from cpu_down() (via + * device_offline()) in 2 cases: cpu_hotplug_disable is true or + * there is only one CPU left. Warn the user about the second + * since this can happen with user offlining CPUs and then + * attempting hotunplugs. + */ + if (rc == -EBUSY && num_online_cpus() == 1) + pr_warn("Unable to remove last online CPU %pOFn\n", dn); + else + pr_warn("Failed to offline CPU %pOFn, rc: %d\n", dn, rc); + return -EINVAL; }
Of all the reasons that dlpar_cpu_remove() can fail, the 'last online CPU' is one that can be caused directly by the user offlining CPUs in a partition/virtual machine that has hotplugged CPUs. Trying to reclaim a hotplugged CPU can fail if the CPU is now the last online in the system. This is easily reproduced using QEMU [1]. Throwing a more specific error message for this case, instead of just "Failed to offline CPU", makes it clearer that the error is in fact a known error situation instead of other generic/unknown cause. [1] https://bugzilla.redhat.com/1911414 Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com> --- arch/powerpc/platforms/pseries/hotplug-cpu.c | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-)