Message ID | 20170821134951.18848-1-lvivier@redhat.com (mailing list archive) |
---|---|
State | Not Applicable |
Headers | show |
On Mon, Aug 21, 2017 at 03:49:50PM +0200, Laurent Vivier wrote: > In wq_numa_init() a list of NUMA nodes with their list of possible CPUs > is built. > > Unfortunately, on powerpc, the Firmware is only able to provide the > node of a CPU if the CPU is present. So, in our case (possible CPU) > CPU ids are known, but as the CPU is not present, the node id is > unknown and all the unplugged CPUs are attached to node 0. This is something powerpc needs to fix. Workqueue isn't the only one making this assumption. mm as a whole assumes that CPU <-> node mapping is stable regardless of hotplug events. Powerpc people know about the issue and AFAIK are working on it. Thanks.
Tejun Heo <tj@kernel.org> writes: > On Mon, Aug 21, 2017 at 03:49:50PM +0200, Laurent Vivier wrote: >> In wq_numa_init() a list of NUMA nodes with their list of possible CPUs >> is built. >> >> Unfortunately, on powerpc, the Firmware is only able to provide the >> node of a CPU if the CPU is present. So, in our case (possible CPU) >> CPU ids are known, but as the CPU is not present, the node id is >> unknown and all the unplugged CPUs are attached to node 0. > > This is something powerpc needs to fix. There is no way for us to fix it. At boot, for possible but not present CPUs, we have no way of knowing the CPU <-> node mapping, firmware simply doesn't tell us. > Workqueue isn't the only one making this assumption. mm as a whole > assumes that CPU <-> node mapping is stable regardless of hotplug > events. At least in this case I don't think the mapping changes, it's just we don't know the mapping at boot. Currently we have to report possible but not present CPUs as belonging to node 0, because otherwise we trip this helpful piece of code: for_each_possible_cpu(cpu) { node = cpu_to_node(cpu); if (WARN_ON(node == NUMA_NO_NODE)) { pr_warn("workqueue: NUMA node mapping not available for cpu%d, disabling NUMA support\n", cpu); /* happens iff arch is bonkers, let's just proceed */ return; } But if we remove that, we could then accurately report NUMA_NO_NODE at boot, and then update the mapping when the CPU is hotplugged. cheers
Hello, Michael. On Tue, Aug 22, 2017 at 11:41:41AM +1000, Michael Ellerman wrote: > > This is something powerpc needs to fix. > > There is no way for us to fix it. I don't think that's true. The CPU id used in kernel doesn't have to match the physical one and arch code should be able to pre-map CPU IDs to nodes and use the matching one when hotplugging CPUs. I'm not saying that's the best way to solve the problem tho. It could be that the best way forward is making cpu <-> node mapping dynamic and properly synchronized. However, please note that that does mean we mess up node affinity for things like per-cpu memory which are allocated before the cpu comes up, so there's some inherent benefits to keeping the mapping static even if that involves indirection. > > Workqueue isn't the only one making this assumption. mm as a whole > > assumes that CPU <-> node mapping is stable regardless of hotplug > > events. > > At least in this case I don't think the mapping changes, it's just we > don't know the mapping at boot. > > Currently we have to report possible but not present CPUs as belonging > to node 0, because otherwise we trip this helpful piece of code: > > for_each_possible_cpu(cpu) { > node = cpu_to_node(cpu); > if (WARN_ON(node == NUMA_NO_NODE)) { > pr_warn("workqueue: NUMA node mapping not available for cpu%d, disabling NUMA support\n", cpu); > /* happens iff arch is bonkers, let's just proceed */ > return; > } > > But if we remove that, we could then accurately report NUMA_NO_NODE at > boot, and then update the mapping when the CPU is hotplugged. If you think that making this dynamic is the right way to go, I have no objection but we should be doing this properly instead of patching up what seems to be crashing right now. What synchronization and notification mechanisms do we need to make cpu <-> node mapping dynamic? Do we need any synchronization in memory allocation paths? If not, why would it be safe? Thanks.
Hi Tejun, Tejun Heo <tj@kernel.org> writes: > Hello, Michael. > > On Tue, Aug 22, 2017 at 11:41:41AM +1000, Michael Ellerman wrote: >> > This is something powerpc needs to fix. >> >> There is no way for us to fix it. > > I don't think that's true. The CPU id used in kernel doesn't have to > match the physical one and arch code should be able to pre-map CPU IDs > to nodes and use the matching one when hotplugging CPUs. I'm not > saying that's the best way to solve the problem tho. We already virtualise the CPU numbers, but not the node IDs. And it's the node IDs that are really the problem. So yeah I guess we might be able to make that work, but I'd have to think about it a bit more. > It could be that the best way forward is making cpu <-> node mapping > dynamic and properly synchronized. We don't need it to be dynamic (at least for this bug). Laurent is booting Qemu with a fixed CPU <-> Node mapping, it's just that because some CPUs aren't present at boot we don't know what the node mapping is. (Correct me if I'm wrong Laurent). So all we need is: - the workqueue code to cope with CPUs that are possible but not online having NUMA_NO_NODE to begin with. - a way to update the workqueue cpumask when the CPU comes online. Which seems reasonable to me? cheers
On 23/08/2017 13:00, Michael Ellerman wrote: > Hi Tejun, > > Tejun Heo <tj@kernel.org> writes: >> Hello, Michael. >> >> On Tue, Aug 22, 2017 at 11:41:41AM +1000, Michael Ellerman wrote: >>>> This is something powerpc needs to fix. >>> >>> There is no way for us to fix it. >> >> I don't think that's true. The CPU id used in kernel doesn't have to >> match the physical one and arch code should be able to pre-map CPU IDs >> to nodes and use the matching one when hotplugging CPUs. I'm not >> saying that's the best way to solve the problem tho. > > We already virtualise the CPU numbers, but not the node IDs. And it's > the node IDs that are really the problem. > > So yeah I guess we might be able to make that work, but I'd have to > think about it a bit more. > >> It could be that the best way forward is making cpu <-> node mapping >> dynamic and properly synchronized. > > We don't need it to be dynamic (at least for this bug). > > Laurent is booting Qemu with a fixed CPU <-> Node mapping, it's just > that because some CPUs aren't present at boot we don't know what the > node mapping is. (Correct me if I'm wrong Laurent). You're correct. Qemu is started with: -numa node,cpus=0-1 -numa node,cpus=2-3 \ -smp 2,maxcpus=4,cores=4,threads=1,sockets=1 Which means we have 2 nodes with cpu ids 0 and 1 on node 0, and cpu ids 2 and 3 on node 1, but at boot only 2 CPUs are present. The problem I try to fix with this series is when we hotplug a third CPU, to node 1 with id 2. Thanks, Laurent
Hello, Michael. On Wed, Aug 23, 2017 at 09:00:39PM +1000, Michael Ellerman wrote: > > I don't think that's true. The CPU id used in kernel doesn't have to > > match the physical one and arch code should be able to pre-map CPU IDs > > to nodes and use the matching one when hotplugging CPUs. I'm not > > saying that's the best way to solve the problem tho. > > We already virtualise the CPU numbers, but not the node IDs. And it's > the node IDs that are really the problem. Yeah, it just needs to match up new cpus to the cpu ids assigned to the right node. > > It could be that the best way forward is making cpu <-> node mapping > > dynamic and properly synchronized. > > We don't need it to be dynamic (at least for this bug). The node mapping for that cpu id changes *dynamically* while the system is running and that can race with node-affinity sensitive operations such as memory allocations. > Laurent is booting Qemu with a fixed CPU <-> Node mapping, it's just > that because some CPUs aren't present at boot we don't know what the > node mapping is. (Correct me if I'm wrong Laurent). > > So all we need is: > - the workqueue code to cope with CPUs that are possible but not online > having NUMA_NO_NODE to begin with. > - a way to update the workqueue cpumask when the CPU comes online. > > Which seems reasonable to me? Please take a step back and think through the problem again. You can't bandaid it this way. Thanks.
On 23/08/2017 15:26, Tejun Heo wrote: > Hello, Michael. > > On Wed, Aug 23, 2017 at 09:00:39PM +1000, Michael Ellerman wrote: >>> I don't think that's true. The CPU id used in kernel doesn't have to >>> match the physical one and arch code should be able to pre-map CPU IDs >>> to nodes and use the matching one when hotplugging CPUs. I'm not >>> saying that's the best way to solve the problem tho. >> >> We already virtualise the CPU numbers, but not the node IDs. And it's >> the node IDs that are really the problem. > > Yeah, it just needs to match up new cpus to the cpu ids assigned to > the right node. We are not able to assign the cpu ids to the right node before the CPU is present, because firmware doesn't provide CPU mapping <-> node id before that. >>> It could be that the best way forward is making cpu <-> node mapping >>> dynamic and properly synchronized. >> >> We don't need it to be dynamic (at least for this bug). > > The node mapping for that cpu id changes *dynamically* while the > system is running and that can race with node-affinity sensitive > operations such as memory allocations. Memory is mapped to the node through its own firmware entry, so I don't think cpu id change can affect memory affinity, and before we know the node id of the CPU, the CPU is not present and thus it can't use memory. >> Laurent is booting Qemu with a fixed CPU <-> Node mapping, it's just >> that because some CPUs aren't present at boot we don't know what the >> node mapping is. (Correct me if I'm wrong Laurent). >> >> So all we need is: >> - the workqueue code to cope with CPUs that are possible but not online >> having NUMA_NO_NODE to begin with. >> - a way to update the workqueue cpumask when the CPU comes online. >> >> Which seems reasonable to me? > > Please take a step back and think through the problem again. You > can't bandaid it this way. Could you give some ideas, proposals? As the firmware doesn't provide the information before the CPU is really plugged, I really don't know how to manage this problem. Thanks, Laurent
Hello, Laurent. On Thu, Aug 24, 2017 at 02:10:31PM +0200, Laurent Vivier wrote: > > Yeah, it just needs to match up new cpus to the cpu ids assigned to > > the right node. > > We are not able to assign the cpu ids to the right node before the CPU > is present, because firmware doesn't provide CPU mapping <-> node id > before that. What I meant was to assign the matching CPU ID when the CPU becomes present - ie. have CPU IDs available for different nodes and allocate them to the new CPU according to its node mapping when it actually comes up. Please note that I'm not saying this is the way to go, just that it is a solvable problem from the arch code. > > The node mapping for that cpu id changes *dynamically* while the > > system is running and that can race with node-affinity sensitive > > operations such as memory allocations. > > Memory is mapped to the node through its own firmware entry, so I don't > think cpu id change can affect memory affinity, and before we know the > node id of the CPU, the CPU is not present and thus it can't use memory. The latter part isn't true. For example, percpu memory gets alloacted for all possible CPUs according to their node affinity, so the memory node association change which happens when the CPU comes up for the first time can race against such allocations. I don't know whether that's actually problematic but we don't have *any* synchronization around it. If you think it's safe to have such races, please explain why that is. > > Please take a step back and think through the problem again. You > > can't bandaid it this way. > > Could you give some ideas, proposals? > As the firmware doesn't provide the information before the CPU is really > plugged, I really don't know how to manage this problem. There are two possible approaches, I think. 1. Make physical cpu -> logical cpu mapping indirect so that the kernel's cpu ID assignment is always on the right numa node. This may mean that the kernel might have to keep more possible CPUs around than necessary but it does have the benefit that all memory allocations are affine to the right node. 2. Make cpu <-> node mapping properly dynamic. Identify what sort of synchronization we'd need around the mapping changing dynamically. Note that we might not need much but it'll most likely need some. Build synchronization and notification infrastructure around it. Thanks.
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index 8d3320562c70..abc533146514 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -964,6 +964,7 @@ void start_secondary(void *unused) set_numa_node(numa_cpu_lookup_table[cpu]); set_numa_mem(local_memory_node(numa_cpu_lookup_table[cpu])); + wq_numa_add_possible_cpu(cpu); smp_wmb(); notify_cpu_starting(cpu); diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h index db6dc9dc0482..4ef030dae22c 100644 --- a/include/linux/workqueue.h +++ b/include/linux/workqueue.h @@ -644,6 +644,7 @@ int workqueue_online_cpu(unsigned int cpu); int workqueue_offline_cpu(unsigned int cpu); #endif +void wq_numa_add_possible_cpu(unsigned int cpu); int __init workqueue_init_early(void); int __init workqueue_init(void); diff --git a/kernel/workqueue.c b/kernel/workqueue.c index ca937b0c3a96..d1a99e25d5da 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -5486,6 +5486,27 @@ static inline void wq_watchdog_init(void) { } #endif /* CONFIG_WQ_WATCHDOG */ +void wq_numa_add_possible_cpu(unsigned int cpu) +{ + int node; + + if (num_possible_nodes() <= 1) + return; + + if (wq_disable_numa) + return; + + if (!wq_numa_enabled) + return; + + node = cpu_to_node(cpu); + if (node == NUMA_NO_NODE) + return; + + cpumask_set_cpu(cpu, wq_numa_possible_cpumask[node]); +} +EXPORT_SYMBOL_GPL(wq_numa_add_possible_cpu); + static void __init wq_numa_init(void) { cpumask_var_t *tbl;
In wq_numa_init() a list of NUMA nodes with their list of possible CPUs is built. Unfortunately, on powerpc, the Firmware is only able to provide the node of a CPU if the CPU is present. So, in our case (possible CPU) CPU ids are known, but as the CPU is not present, the node id is unknown and all the unplugged CPUs are attached to node 0. When we hotplugged CPU, there can be an inconsistency between the node id known by the workqueue, and the real node id of the CPU. This patch adds a hotplug handler to add the hotplugged CPU to update the node entry in wq_numa_possible_cpumask array. Signed-off-by: Laurent Vivier <lvivier@redhat.com> --- arch/powerpc/kernel/smp.c | 1 + include/linux/workqueue.h | 1 + kernel/workqueue.c | 21 +++++++++++++++++++++ 3 files changed, 23 insertions(+)