Message ID | 20200317092624.GB22538@in.ibm.com (mailing list archive) |
---|---|
State | Not Applicable |
Headers | show |
Series | Slub: Increased mem consumption on cpu,mem-less node powerpc guest | expand |
Context | Check | Description |
---|---|---|
snowpatch_ozlabs/apply_patch | success | Successfully applied on branch powerpc/merge (ab326587bb5fb91cc97df9b9f48e9e1469f04621) |
snowpatch_ozlabs/build-ppc64le | success | Build succeeded |
snowpatch_ozlabs/build-ppc64be | success | Build succeeded |
snowpatch_ozlabs/build-ppc64e | success | Build succeeded |
snowpatch_ozlabs/build-pmac32 | success | Build succeeded |
snowpatch_ozlabs/checkpatch | warning | total: 1 errors, 1 warnings, 0 checks, 11 lines checked |
snowpatch_ozlabs/needsstable | success | Patch has no Fixes tags |
On Tue, Mar 17, 2020 at 02:56:28PM +0530, Bharata B Rao wrote: > Case 1: 2 node NUMA, node0 empty > ================================ > # numactl -H > available: 2 nodes (0-1) > node 0 cpus: > node 0 size: 0 MB > node 0 free: 0 MB > node 1 cpus: 0 1 2 3 4 5 6 7 > node 1 size: 16294 MB > node 1 free: 15453 MB > node distances: > node 0 1 > 0: 10 40 > 1: 40 10 > > diff --git a/mm/slub.c b/mm/slub.c > index 17dc00e33115..888e4d245444 100644 > --- a/mm/slub.c > +++ b/mm/slub.c > @@ -1971,10 +1971,8 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node, > void *object; > int searchnode = node; > > - if (node == NUMA_NO_NODE) > + if (node == NUMA_NO_NODE || !node_present_pages(node)) > searchnode = numa_mem_id(); > - else if (!node_present_pages(node)) > - searchnode = node_to_mem_node(node); For the above topology, I see this: node_to_mem_node(1) = 1 node_to_mem_node(0) = 0 node_to_mem_node(NUMA_NO_NODE) = 0 Looks like the last two cases (returning memory-less node 0) is the problem here? Regards, Bharata.
On 3/17/20 12:53 PM, Bharata B Rao wrote: > On Tue, Mar 17, 2020 at 02:56:28PM +0530, Bharata B Rao wrote: >> Case 1: 2 node NUMA, node0 empty >> ================================ >> # numactl -H >> available: 2 nodes (0-1) >> node 0 cpus: >> node 0 size: 0 MB >> node 0 free: 0 MB >> node 1 cpus: 0 1 2 3 4 5 6 7 >> node 1 size: 16294 MB >> node 1 free: 15453 MB >> node distances: >> node 0 1 >> 0: 10 40 >> 1: 40 10 >> >> diff --git a/mm/slub.c b/mm/slub.c >> index 17dc00e33115..888e4d245444 100644 >> --- a/mm/slub.c >> +++ b/mm/slub.c >> @@ -1971,10 +1971,8 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node, >> void *object; >> int searchnode = node; >> >> - if (node == NUMA_NO_NODE) >> + if (node == NUMA_NO_NODE || !node_present_pages(node)) >> searchnode = numa_mem_id(); >> - else if (!node_present_pages(node)) >> - searchnode = node_to_mem_node(node); > > For the above topology, I see this: > > node_to_mem_node(1) = 1 > node_to_mem_node(0) = 0 > node_to_mem_node(NUMA_NO_NODE) = 0 > > Looks like the last two cases (returning memory-less node 0) is the > problem here? I wonder why do you get a memory leak while Sachin in the same situation [1] gets a crash? I don't understand anything anymore. [1] https://lore.kernel.org/linux-next/3381CD91-AB3D-4773-BA04-E7A072A63968@linux.vnet.ibm.com/ > Regards, > Bharata. > >
* Vlastimil Babka <vbabka@suse.cz> [2020-03-17 16:56:04]: > > I wonder why do you get a memory leak while Sachin in the same situation [1] > gets a crash? I don't understand anything anymore. Sachin was testing on linux-next which has Kirill's patch which modifies slub to use kmalloc_node instead of kmalloc. While Bharata is testing on upstream, which doesn't have this. > > [1] > https://lore.kernel.org/linux-next/3381CD91-AB3D-4773-BA04-E7A072A63968@linux.vnet.ibm.com/ >
On 3/17/20 5:25 PM, Srikar Dronamraju wrote: > * Vlastimil Babka <vbabka@suse.cz> [2020-03-17 16:56:04]: > >> >> I wonder why do you get a memory leak while Sachin in the same situation [1] >> gets a crash? I don't understand anything anymore. > > Sachin was testing on linux-next which has Kirill's patch which modifies > slub to use kmalloc_node instead of kmalloc. While Bharata is testing on > upstream, which doesn't have this. Yes, that Kirill's patch was about the memcg shrinker map allocation. But the patch hunk that Bharata posted as a "hack" that fixes the problem, it follows that there has to be something else that calls kmalloc_node(node) where node is one that doesn't have present pages. He mentions alloc_fair_sched_group() which has: for_each_possible_cpu(i) { cfs_rq = kzalloc_node(sizeof(struct cfs_rq), GFP_KERNEL, cpu_to_node(i)); ... se = kzalloc_node(sizeof(struct sched_entity), GFP_KERNEL, cpu_to_node(i)); I assume one of these structs is 1k and other 512 bytes (rounded) and that for some possible cpu's cpu_to_node(i) will be 0, which has no present pages. And as Bharata pasted, node_to_mem_node(0) = 0 So this looks like the same scenario, but it doesn't crash? Is the node 0 actually online here, and/or does it have N_NORMAL_MEMORY state? >> >> [1] >> https://lore.kernel.org/linux-next/3381CD91-AB3D-4773-BA04-E7A072A63968@linux.vnet.ibm.com/ >> >
* Vlastimil Babka <vbabka@suse.cz> [2020-03-17 17:45:15]: > On 3/17/20 5:25 PM, Srikar Dronamraju wrote: > > * Vlastimil Babka <vbabka@suse.cz> [2020-03-17 16:56:04]: > > > >> > >> I wonder why do you get a memory leak while Sachin in the same situation [1] > >> gets a crash? I don't understand anything anymore. > > > > Sachin was testing on linux-next which has Kirill's patch which modifies > > slub to use kmalloc_node instead of kmalloc. While Bharata is testing on > > upstream, which doesn't have this. > > Yes, that Kirill's patch was about the memcg shrinker map allocation. But the > patch hunk that Bharata posted as a "hack" that fixes the problem, it follows > that there has to be something else that calls kmalloc_node(node) where node is > one that doesn't have present pages. > > He mentions alloc_fair_sched_group() which has: > > for_each_possible_cpu(i) { > cfs_rq = kzalloc_node(sizeof(struct cfs_rq), > GFP_KERNEL, cpu_to_node(i)); > ... > se = kzalloc_node(sizeof(struct sched_entity), > GFP_KERNEL, cpu_to_node(i)); > Sachin's experiment. Upstream-next/ memcg / possible nodes were 0-31 online nodes were 0-1 kmalloc_node called for_each_node / for_each_possible_node. This would crash while allocating slab from !N_ONLINE nodes. Bharata's experiment. Upstream possible nodes were 0-1 online nodes were 0-1 kmalloc_node called for_each_online_node/ for_each_possible_cpu i.e kmalloc is called for N_ONLINE nodes. So wouldn't crash Even if his possible nodes were 0-256. I don't think we have kmalloc_node being called in !N_ONLINE nodes. Hence its not crashing. If we see the above code that you quote, kzalloc_node is using cpu_to_node which in Bharata's case will always return 1. > I assume one of these structs is 1k and other 512 bytes (rounded) and that for > some possible cpu's cpu_to_node(i) will be 0, which has no present pages. And as > Bharata pasted, node_to_mem_node(0) = 0 > So this looks like the same scenario, but it doesn't crash? Is the node 0 > actually online here, and/or does it have N_NORMAL_MEMORY state? I still dont have any clue on the leak though.
On Wed, Mar 18, 2020 at 08:50:44AM +0530, Srikar Dronamraju wrote: > * Vlastimil Babka <vbabka@suse.cz> [2020-03-17 17:45:15]: > > > On 3/17/20 5:25 PM, Srikar Dronamraju wrote: > > > * Vlastimil Babka <vbabka@suse.cz> [2020-03-17 16:56:04]: > > > > > >> > > >> I wonder why do you get a memory leak while Sachin in the same situation [1] > > >> gets a crash? I don't understand anything anymore. > > > > > > Sachin was testing on linux-next which has Kirill's patch which modifies > > > slub to use kmalloc_node instead of kmalloc. While Bharata is testing on > > > upstream, which doesn't have this. > > > > Yes, that Kirill's patch was about the memcg shrinker map allocation. But the > > patch hunk that Bharata posted as a "hack" that fixes the problem, it follows > > that there has to be something else that calls kmalloc_node(node) where node is > > one that doesn't have present pages. > > > > He mentions alloc_fair_sched_group() which has: > > > > for_each_possible_cpu(i) { > > cfs_rq = kzalloc_node(sizeof(struct cfs_rq), > > GFP_KERNEL, cpu_to_node(i)); > > ... > > se = kzalloc_node(sizeof(struct sched_entity), > > GFP_KERNEL, cpu_to_node(i)); > > > > > Sachin's experiment. > Upstream-next/ memcg / > possible nodes were 0-31 > online nodes were 0-1 > kmalloc_node called for_each_node / for_each_possible_node. > This would crash while allocating slab from !N_ONLINE nodes. > > Bharata's experiment. > Upstream > possible nodes were 0-1 > online nodes were 0-1 > kmalloc_node called for_each_online_node/ for_each_possible_cpu > i.e kmalloc is called for N_ONLINE nodes. > So wouldn't crash > > Even if his possible nodes were 0-256. I don't think we have kmalloc_node > being called in !N_ONLINE nodes. Hence its not crashing. > If we see the above code that you quote, kzalloc_node is using cpu_to_node > which in Bharata's case will always return 1. > > > > I assume one of these structs is 1k and other 512 bytes (rounded) and that for > > some possible cpu's cpu_to_node(i) will be 0, which has no present pages. And as > > Bharata pasted, node_to_mem_node(0) = 0 Correct, these two kazalloc_node() calls for all possible cpus are causing increased slab memory consumption in my case. > > So this looks like the same scenario, but it doesn't crash? Is the node 0 > > actually online here, and/or does it have N_NORMAL_MEMORY state? > Node 0 is online, but N_NORMAL_MEMORY state is empty. In fact memory leak goes away if I insert the below check/assignment in the slab alloc code path: + if (!node_isset(node, node_states[N_NORMAL_MEMORY])) + node = NUMA_NO_NODE; Regards, Bharata.
On 3/18/20 4:20 AM, Srikar Dronamraju wrote: > * Vlastimil Babka <vbabka@suse.cz> [2020-03-17 17:45:15]: >> >> Yes, that Kirill's patch was about the memcg shrinker map allocation. But the >> patch hunk that Bharata posted as a "hack" that fixes the problem, it follows >> that there has to be something else that calls kmalloc_node(node) where node is >> one that doesn't have present pages. >> >> He mentions alloc_fair_sched_group() which has: >> >> for_each_possible_cpu(i) { >> cfs_rq = kzalloc_node(sizeof(struct cfs_rq), >> GFP_KERNEL, cpu_to_node(i)); >> ... >> se = kzalloc_node(sizeof(struct sched_entity), >> GFP_KERNEL, cpu_to_node(i)); >> > > > Sachin's experiment. > Upstream-next/ memcg / > possible nodes were 0-31 > online nodes were 0-1 > kmalloc_node called for_each_node / for_each_possible_node. > This would crash while allocating slab from !N_ONLINE nodes. So you're saying the crash was actually for allocation on e.g. node 2, not node 0? But I believe it was on node 0, because init_kmem_cache_nodes() will only allocate kmem_cache_node on nodes with N_NORMAL_MEMORY (which doesn't include 0), and slab_mem_going_online_callback() was probably not called for node 0 (it was not dynamically onlined). Also if node 0 was fine, node_to_mem_node(2-31) (not initialized explicitly) would have returned 0 and thus not crash as well. > Bharata's experiment. > Upstream > possible nodes were 0-1 > online nodes were 0-1 > kmalloc_node called for_each_online_node/ for_each_possible_cpu > i.e kmalloc is called for N_ONLINE nodes. > So wouldn't crash > > Even if his possible nodes were 0-256. I don't think we have kmalloc_node > being called in !N_ONLINE nodes. Hence its not crashing. > If we see the above code that you quote, kzalloc_node is using cpu_to_node > which in Bharata's case will always return 1. Are you sure that for_each_possible_cpu(), cpu_to_node() will be 1? Are all of them properly initialized or is there a similar issue as with node_to_mem_node(), that some were not initialized and thus cpu_to_node() will return 0? Because AFAICS, if kzalloc_node() was always called 1, then node_present_pages(1) is true, and the "hack" that Bharata reports to work in his original mail would make no functional difference. > >> I assume one of these structs is 1k and other 512 bytes (rounded) and that for >> some possible cpu's cpu_to_node(i) will be 0, which has no present pages. And as >> Bharata pasted, node_to_mem_node(0) = 0 >> So this looks like the same scenario, but it doesn't crash? Is the node 0 >> actually online here, and/or does it have N_NORMAL_MEMORY state? > > I still dont have any clue on the leak though. Let's assume that kzalloc_node() was called with 0 for some of the possible CPU's. I still wonder why it won't crash, but let's assume kmem_cache_node does exist for node 0 here. So the execution AFAICS goes like this: slab_alloc_node(0) c = raw_cpu_ptr(s->cpu_slab); object = c->freelist; page = c->page; if (unlikely(!object || !node_match(page, node))) { // whatever we have in the per-cpu cache must be from node 1 // because node 0 has no memory, so there's no node_match and thus __slab_alloc(node == 0) ___slab_alloc(node == 0) page = c->page; redo: if (unlikely(!node_match(page, node))) { // still no match int searchnode = node; if (node != NUMA_NO_NODE && !node_present_pages(node)) // true && true for node 0 searchnode = node_to_mem_node(node); // searchnode is 0, not 1 if (unlikely(!node_match(page, searchnode))) { // page still from node 1, searchnode is 0, no match stat(s, ALLOC_NODE_MISMATCH); deactivate_slab(s, page, c->freelist, c); // we removed the slab from cpu's cache goto new_slab; } new_slab: if (slub_percpu_partial(c)) { page = c->page = slub_percpu_partial(c); slub_set_percpu_partial(c, page); stat(s, CPU_PARTIAL_ALLOC); goto redo; // huh, so with CONFIG_SLUB_CPU_PARTIAL // this can become an infinite loop actually? } // Bharata's slub stats don't include cpu_partial_alloc so I assume // CONFIG_SLUB_CPU_PARTIAL is not enabled and we don't loop freelist = new_slab_objects(s, gfpflags, node, &c); freelist = new_slab_objects(s, gfpflags, node, &c); if (node == NUMA_NO_NODE) // false, it's 0 else if (!node_present_pages(node)) // true for 0 searchnode = node_to_mem_node(node); // still 0 object = get_partial_node(s, get_node(s, searchnode),...); // object is NULL as node 0 has nothing // but we have node == 0 so we return the NULL if (object || node != NUMA_NO_NODE) return object; // and we don't fallback to get_any_partial which would // have found e.g. the slab we deactivated earlier return get_any_partial(s, flags, c); page = new_slab(s, flags, node); // we attempt to allocate new slab on node 0, but it will come // from node 1 So that explains the leak I think. We keep throwing away slabs from node 1 only to allocate new ones on node 1. Effectively each cfs_rq object and each sched_entity object will get a new (high-order?) page for a possible cpu where cpu_to_node() is 0.
diff --git a/mm/slub.c b/mm/slub.c index 17dc00e33115..888e4d245444 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1971,10 +1971,8 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node, void *object; int searchnode = node; - if (node == NUMA_NO_NODE) + if (node == NUMA_NO_NODE || !node_present_pages(node)) searchnode = numa_mem_id(); - else if (!node_present_pages(node)) - searchnode = node_to_mem_node(node); object = get_partial_node(s, get_node(s, searchnode), c, flags); if (object || node != NUMA_NO_NODE)