mbox series

[v2,0/3] Updates to powerpc for robust CPU online/offline

Message ID 20210821102535.169643-1-srikar@linux.vnet.ibm.com (mailing list archive)
Headers show
Series Updates to powerpc for robust CPU online/offline | expand

Message

Srikar Dronamraju Aug. 21, 2021, 10:25 a.m. UTC
Scheduler expects unique number of node distances to be available
at boot. It uses node distance to calculate this unique node
distances. On Power Servers, node distances for offline nodes is not
available. However, Power Servers already knows unique possible node
distances. Fake the offline node's distance_lookup_table entries so
that all possible node distances are updated.

For example distance info from numactl from a fully populated 8 node
system at boot may look like this.

node distances:
node   0   1   2   3   4   5   6   7
  0:  10  20  40  40  40  40  40  40
  1:  20  10  40  40  40  40  40  40
  2:  40  40  10  20  40  40  40  40
  3:  40  40  20  10  40  40  40  40
  4:  40  40  40  40  10  20  40  40
  5:  40  40  40  40  20  10  40  40
  6:  40  40  40  40  40  40  10  20
  7:  40  40  40  40  40  40  20  10

However the same system when only two nodes are online at boot, then
distance info from numactl will look like
node distances:
node   0   1
  0:  10  20
  1:  20  10

With the faked numa distance at boot, the node distance table will look
like
node   0   1   2
  0:  10  20  40
  1:  20  10  40
  2:  40  40  10

The actual distance will be populated once the nodes are onlined.

Also when simultaneously running CPU online/offline with CPU
add/remove in a loop, we see a WARNING messages.

WARNING: CPU: 13 PID: 1142 at kernel/sched/topology.c:898 build_sched_domains+0xd48/0x1720
Modules linked in: rpadlpar_io rpaphp mptcp_diag xsk_diag tcp_diag udp_diag
raw_diag inet_diag unix_diag af_packet_diag netlink_diag bonding tls
nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet
nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat
nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set rfkill nf_tables nfnetlink
pseries_rng xts vmx_crypto uio_pdrv_genirq uio binfmt_misc ip_tables xfs
libcrc32c dm_service_time sd_mod t10_pi sg ibmvfc scsi_transport_fc ibmveth
dm_multipath dm_mirror dm_region_hash dm_log dm_mod fuse
CPU: 13 PID: 1142 Comm: kworker/13:2 Not tainted 5.13.0-rc6+ #28
Workqueue: events cpuset_hotplug_workfn
NIP:  c0000000001caac8 LR: c0000000001caac4 CTR: 00000000007088ec
REGS: c00000005596f220 TRAP: 0700   Not tainted  (5.13.0-rc6+)
MSR:  8000000000029033 <SF,EE,ME,IR,DR,RI,LE>  CR: 48828222  XER: 00000009
CFAR: c0000000001ea698 IRQMASK: 0
GPR00: c0000000001caac4 c00000005596f4c0 c000000001c4a400 0000000000000036
GPR04: 00000000fffdffff c00000005596f1d0 0000000000000027 c0000018cfd07f90
GPR08: 0000000000000023 0000000000000001 0000000000000027 c0000018fe68ffe8
GPR12: 0000000000008000 c00000001e9d1880 c00000013a047200 0000000000000800
GPR16: c000000001d3c7d0 0000000000000240 0000000000000048 c000000010aacd18
GPR20: 0000000000000001 c000000010aacc18 c00000013a047c00 c000000139ec2400
GPR24: 0000000000000280 c000000139ec2520 c000000136c1b400 c000000001c93060
GPR28: c00000013a047c20 c000000001d3c6c0 c000000001c978a0 000000000000000d
NIP [c0000000001caac8] build_sched_domains+0xd48/0x1720
LR [c0000000001caac4] build_sched_domains+0xd44/0x1720
Call Trace:
[c00000005596f4c0] [c0000000001caac4] build_sched_domains+0xd44/0x1720 (unreliable)
[c00000005596f670] [c0000000001cc5ec] partition_sched_domains_locked+0x3ac/0x4b0
[c00000005596f710] [c0000000002804e4] rebuild_sched_domains_locked+0x404/0x9e0
[c00000005596f810] [c000000000283e60] rebuild_sched_domains+0x40/0x70
[c00000005596f840] [c000000000284124] cpuset_hotplug_workfn+0x294/0xf10
[c00000005596fc60] [c000000000175040] process_one_work+0x290/0x590
[c00000005596fd00] [c0000000001753c8] worker_thread+0x88/0x620
[c00000005596fda0] [c000000000181704] kthread+0x194/0x1a0
[c00000005596fe10] [c00000000000ccec] ret_from_kernel_thread+0x5c/0x70
Instruction dump:
485af049 60000000 2fa30800 409e0028 80fe0000 e89a00f8 e86100e8 38da0120
7f88e378 7ce53b78 4801fb91 60000000 <0fe00000> 39000000 38e00000 38c00000

This was because cpu_cpu_mask() was not getting updated on CPU
online/offline but would be only updated when add/remove of CPUs.
Other cpumasks get updated both on CPU online/offline and add/remove
Update cpu_cpu_mask() on CPU online/offline too.

Cc: linuxppc-dev@lists.ozlabs.org
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Geetika Moolchandani <Geetika.Moolchandani1@ibm.com>
Cc: Laurent Dufour <ldufour@linux.ibm.com>

Srikar Dronamraju (3):
  powerpc/numa: Print debug statements only when required
  powerpc/numa: Update cpu_cpu_map on CPU online/offline
  powerpc/numa: Fill distance_lookup_table for offline nodes

 arch/powerpc/include/asm/topology.h | 12 ++++
 arch/powerpc/kernel/smp.c           |  3 +
 arch/powerpc/mm/numa.c              | 88 +++++++++++++++++++++++++----
 3 files changed, 92 insertions(+), 11 deletions(-)

Comments

Peter Zijlstra Aug. 23, 2021, 8:33 a.m. UTC | #1
On Sat, Aug 21, 2021 at 03:55:32PM +0530, Srikar Dronamraju wrote:
> Scheduler expects unique number of node distances to be available
> at boot. It uses node distance to calculate this unique node
> distances. On Power Servers, node distances for offline nodes is not
> available. However, Power Servers already knows unique possible node
> distances. Fake the offline node's distance_lookup_table entries so
> that all possible node distances are updated.
> 
> For example distance info from numactl from a fully populated 8 node
> system at boot may look like this.
> 
> node distances:
> node   0   1   2   3   4   5   6   7
>   0:  10  20  40  40  40  40  40  40
>   1:  20  10  40  40  40  40  40  40
>   2:  40  40  10  20  40  40  40  40
>   3:  40  40  20  10  40  40  40  40
>   4:  40  40  40  40  10  20  40  40
>   5:  40  40  40  40  20  10  40  40
>   6:  40  40  40  40  40  40  10  20
>   7:  40  40  40  40  40  40  20  10
> 
> However the same system when only two nodes are online at boot, then
> distance info from numactl will look like
> node distances:
> node   0   1
>   0:  10  20
>   1:  20  10
> 
> With the faked numa distance at boot, the node distance table will look
> like
> node   0   1   2
>   0:  10  20  40
>   1:  20  10  40
>   2:  40  40  10
> 
> The actual distance will be populated once the nodes are onlined.

How did you want all this merged? I picked up Valentin's patch, do you
want me to pick up these PowerPC patches in the same tree, or do you
want to route them seperately?
Srikar Dronamraju Aug. 23, 2021, 9:34 a.m. UTC | #2
* Peter Zijlstra <peterz@infradead.org> [2021-08-23 10:33:30]:

> On Sat, Aug 21, 2021 at 03:55:32PM +0530, Srikar Dronamraju wrote:
> > Scheduler expects unique number of node distances to be available
> > at boot. It uses node distance to calculate this unique node
> > distances. On Power Servers, node distances for offline nodes is not
> > available. However, Power Servers already knows unique possible node
> > distances. Fake the offline node's distance_lookup_table entries so
> > that all possible node distances are updated.
> > 
> > For example distance info from numactl from a fully populated 8 node
> > system at boot may look like this.
> > 
> > node distances:
> > node   0   1   2   3   4   5   6   7
> >   0:  10  20  40  40  40  40  40  40
> >   1:  20  10  40  40  40  40  40  40
> >   2:  40  40  10  20  40  40  40  40
> >   3:  40  40  20  10  40  40  40  40
> >   4:  40  40  40  40  10  20  40  40
> >   5:  40  40  40  40  20  10  40  40
> >   6:  40  40  40  40  40  40  10  20
> >   7:  40  40  40  40  40  40  20  10
> > 
> > However the same system when only two nodes are online at boot, then
> > distance info from numactl will look like
> > node distances:
> > node   0   1
> >   0:  10  20
> >   1:  20  10
> > 
> > With the faked numa distance at boot, the node distance table will look
> > like
> > node   0   1   2
> >   0:  10  20  40
> >   1:  20  10  40
> >   2:  40  40  10
> > 
> > The actual distance will be populated once the nodes are onlined.
> 
> How did you want all this merged? I picked up Valentin's patch, do you
> want me to pick up these PowerPC patches in the same tree, or do you
> want to route them seperately?

While both (the patch you accepted and this series) together help solve the
problem, I think there is no hard dependency between the two. Hence I would
think it should be okay to go through the powerpc tree.
Peter Zijlstra Aug. 23, 2021, 9:37 a.m. UTC | #3
On Mon, Aug 23, 2021 at 03:04:37PM +0530, Srikar Dronamraju wrote:
> * Peter Zijlstra <peterz@infradead.org> [2021-08-23 10:33:30]:
> 
> > On Sat, Aug 21, 2021 at 03:55:32PM +0530, Srikar Dronamraju wrote:
> > > Scheduler expects unique number of node distances to be available
> > > at boot. It uses node distance to calculate this unique node
> > > distances. On Power Servers, node distances for offline nodes is not
> > > available. However, Power Servers already knows unique possible node
> > > distances. Fake the offline node's distance_lookup_table entries so
> > > that all possible node distances are updated.
> > > 
> > > For example distance info from numactl from a fully populated 8 node
> > > system at boot may look like this.
> > > 
> > > node distances:
> > > node   0   1   2   3   4   5   6   7
> > >   0:  10  20  40  40  40  40  40  40
> > >   1:  20  10  40  40  40  40  40  40
> > >   2:  40  40  10  20  40  40  40  40
> > >   3:  40  40  20  10  40  40  40  40
> > >   4:  40  40  40  40  10  20  40  40
> > >   5:  40  40  40  40  20  10  40  40
> > >   6:  40  40  40  40  40  40  10  20
> > >   7:  40  40  40  40  40  40  20  10
> > > 
> > > However the same system when only two nodes are online at boot, then
> > > distance info from numactl will look like
> > > node distances:
> > > node   0   1
> > >   0:  10  20
> > >   1:  20  10
> > > 
> > > With the faked numa distance at boot, the node distance table will look
> > > like
> > > node   0   1   2
> > >   0:  10  20  40
> > >   1:  20  10  40
> > >   2:  40  40  10
> > > 
> > > The actual distance will be populated once the nodes are onlined.
> > 
> > How did you want all this merged? I picked up Valentin's patch, do you
> > want me to pick up these PowerPC patches in the same tree, or do you
> > want to route them seperately?
> 
> While both (the patch you accepted and this series) together help solve the
> problem, I think there is no hard dependency between the two. Hence I would
> think it should be okay to go through the powerpc tree.
> 

OK, works for me, thanks!