diff mbox series

Slub: Increased mem consumption on cpu,mem-less node powerpc guest

Message ID 20200317092624.GB22538@in.ibm.com (mailing list archive)
State Not Applicable
Headers show
Series Slub: Increased mem consumption on cpu,mem-less node powerpc guest | expand

Checks

Context Check Description
snowpatch_ozlabs/apply_patch success Successfully applied on branch powerpc/merge (ab326587bb5fb91cc97df9b9f48e9e1469f04621)
snowpatch_ozlabs/build-ppc64le success Build succeeded
snowpatch_ozlabs/build-ppc64be success Build succeeded
snowpatch_ozlabs/build-ppc64e success Build succeeded
snowpatch_ozlabs/build-pmac32 success Build succeeded
snowpatch_ozlabs/checkpatch warning total: 1 errors, 1 warnings, 0 checks, 11 lines checked
snowpatch_ozlabs/needsstable success Patch has no Fixes tags

Commit Message

Bharata B Rao March 17, 2020, 9:26 a.m. UTC
Hi,

We are seeing an increased slab memory consumption on PowerPC guest
LPAR (on PowerVM) having an uncommon topology where one NUMA node has no
CPUs or any memory and the other node has all the CPUs and memory. Though
QEMU prevents such topologies for KVM guest, I hacked QEMU to allow such
topology to get some slab numbers. Here is the comparision of such
a KVM guest with a single node KVM guest with equal amount of CPUs and memory.

Case 1: 2 node NUMA, node0 empty
================================
# numactl -H
available: 2 nodes (0-1)
node 0 cpus:
node 0 size: 0 MB
node 0 free: 0 MB
node 1 cpus: 0 1 2 3 4 5 6 7
node 1 size: 16294 MB
node 1 free: 15453 MB
node distances:
node   0   1 
  0:  10  40 
  1:  40  10 

Case 2: Single node
===================
# numactl -H
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 16294 MB
node 0 free: 15675 MB
node distances:
node   0 
  0:  10 

Here is how the total slab memory consumptions compare right after boot:
# grep -i slab /proc/meminfo

Case 1: 442560 kB
Case 2: 195904 kB

Closer look at the individual slabs suggests that most of the increased
slab consumption in Case 1 can be attributed to kmalloc-N slabs. In
particular the following two caches account for most of the increase.

Case 1:
# ./slabinfo -S | grep -e kmalloc-1k -e kmalloc-512 
kmalloc-1k                2869    1024          101.5M    1549/1540/0   32 0  99   2 U
kmalloc-512               3302     512          100.2M    1530/1522/0   64 0  99   1 U

Case 2:
# ./slabinfo -S | grep -e kmalloc-1k -e kmalloc-512 
kmalloc-1k                2811    1024            6.1M        94/29/0   32 0  30  46 U
kmalloc-512               3207     512            3.5M        54/13/0   64 0  24  46 U

Here is the list of slub stats that significantly differ between two cases:

Case 1:
------
alloc_from_partial 6333 C0=1506 C1=525 C2=774 C3=478 C4=413 C5=1036 C6=698 C7=903
alloc_slab 3350 C0=757 C1=336 C2=120 C3=72 C4=120 C5=912 C6=600 C7=433
alloc_slowpath 9792 C0=2329 C1=861 C2=916 C3=571 C4=533 C5=1948 C6=1298 C7=1336
cmpxchg_double_fail 31 C1=3 C2=2 C3=7 C4=3 C5=4 C6=2 C7=10
deactivate_full 38 C0=14 C1=2 C2=13 C5=3 C6=2 C7=4
deactivate_remote_frees 1 C7=1
deactivate_to_head 10092 C0=2654 C1=859 C2=903 C3=571 C4=533 C5=1945 C6=1296 C7=1331
deactivate_to_tail 1 C7=1
free_add_partial 29 C0=7 C2=1 C3=5 C4=3 C5=6 C6=2 C7=5
free_frozen 32 C0=4 C1=3 C2=4 C3=3 C4=7 C5=3 C6=7 C7=1
free_remove_partial 1799 C0=408 C1=47 C2=54 C4=231 C5=752 C6=110 C7=197
free_slab 1799 C0=408 C1=47 C2=54 C4=231 C5=752 C6=110 C7=197
free_slowpath 7415 C0=2014 C1=486 C2=433 C3=525 C4=814 C5=1707 C6=586 C7=850
objects 2875 N1=2875
objects_partial 2587 N1=2587
partial 1542 N1=1542
slabs 1551 N1=1551
total_objects 49632 N1=49632

# cat alloc_calls (truncated)
   1952 alloc_fair_sched_group+0x114/0x240 age=147813/152837/153714 pid=1-1074 cpus=0-2,5-7 nodes=1

# cat free_calls (truncated) 
   2671 <not-available> age=4295094831 pid=0 cpus=0 nodes=1
      2 free_fair_sched_group+0xa0/0x120 age=156576/156850/157125 pid=0 cpus=0,5 nodes=1

Case 1:
------
alloc_from_partial 9231 C0=435 C1=2349 C2=2386 C3=1807 C4=882 C5=367 C6=559 C7=446
alloc_slab 114 C0=12 C1=41 C2=28 C3=15 C4=9 C5=1 C6=1 C7=7
alloc_slowpath 9415 C0=448 C1=2390 C2=2414 C3=1891 C4=891 C5=368 C6=560 C7=453
cmpxchg_double_fail 22 C0=1 C1=1 C3=3 C4=8 C5=1 C6=5 C7=3
deactivate_full 512 C0=13 C1=143 C2=147 C3=147 C4=22 C5=10 C6=6 C7=24
deactivate_remote_frees 1 C4=1
deactivate_to_head 9099 C0=437 C1=2247 C2=2267 C3=1937 C4=870 C5=358 C6=554 C7=429
deactivate_to_tail 1 C4=1
free_add_partial 447 C0=21 C1=140 C2=164 C3=60 C4=22 C5=16 C6=14 C7=10
free_frozen 22 C0=3 C2=3 C3=2 C4=1 C5=6 C6=6 C7=1
free_remove_partial 20 C1=5 C2=5 C4=3 C6=7
free_slab 20 C1=5 C2=5 C4=3 C6=7
free_slowpath 6953 C0=194 C1=2123 C2=1729 C3=850 C4=466 C5=725 C6=520 C7=346
objects 2812 N0=2812
objects_partial 733 N0=733
partial 29 N0=29
slabs 94 N0=94
total_objects 3008 N0=3008

# cat alloc_calls (truncated)
   1952 alloc_fair_sched_group+0x114/0x240 age=43957/46225/46802 pid=1-1059 cpus=1-5,7

# cat free_calls (truncated) 
   1516 <not-available> age=4294987281 pid=0 cpus=0
    647 free_fair_sched_group+0xa0/0x120 age=48798/49142/49628 pid=0-954 cpus=1-2

We see a significant difference in the number of partial slabs and
the resulting total_objects between the two cases. I was trying to
see if this has got to do anything with the way the node value is
arrived at in difference slub routines. Haven't yet understood slub
code to say anything conclusively, but the following hack in the slub
code completely reduces the increased slab consumption for Case1 and
makes it very similar to Case2


Regards,
Bharata.

Comments

Bharata B Rao March 17, 2020, 11:53 a.m. UTC | #1
On Tue, Mar 17, 2020 at 02:56:28PM +0530, Bharata B Rao wrote:
> Case 1: 2 node NUMA, node0 empty
> ================================
> # numactl -H
> available: 2 nodes (0-1)
> node 0 cpus:
> node 0 size: 0 MB
> node 0 free: 0 MB
> node 1 cpus: 0 1 2 3 4 5 6 7
> node 1 size: 16294 MB
> node 1 free: 15453 MB
> node distances:
> node   0   1 
>   0:  10  40 
>   1:  40  10 
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 17dc00e33115..888e4d245444 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1971,10 +1971,8 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
>  	void *object;
>  	int searchnode = node;
>  
> -	if (node == NUMA_NO_NODE)
> +	if (node == NUMA_NO_NODE || !node_present_pages(node))
>  		searchnode = numa_mem_id();
> -	else if (!node_present_pages(node))
> -		searchnode = node_to_mem_node(node);

For the above topology, I see this:

node_to_mem_node(1) = 1
node_to_mem_node(0) = 0
node_to_mem_node(NUMA_NO_NODE) = 0

Looks like the last two cases (returning memory-less node 0) is the
problem here?

Regards,
Bharata.
Vlastimil Babka March 17, 2020, 3:56 p.m. UTC | #2
On 3/17/20 12:53 PM, Bharata B Rao wrote:
> On Tue, Mar 17, 2020 at 02:56:28PM +0530, Bharata B Rao wrote:
>> Case 1: 2 node NUMA, node0 empty
>> ================================
>> # numactl -H
>> available: 2 nodes (0-1)
>> node 0 cpus:
>> node 0 size: 0 MB
>> node 0 free: 0 MB
>> node 1 cpus: 0 1 2 3 4 5 6 7
>> node 1 size: 16294 MB
>> node 1 free: 15453 MB
>> node distances:
>> node   0   1 
>>   0:  10  40 
>>   1:  40  10 
>> 
>> diff --git a/mm/slub.c b/mm/slub.c
>> index 17dc00e33115..888e4d245444 100644
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -1971,10 +1971,8 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
>>  	void *object;
>>  	int searchnode = node;
>>  
>> -	if (node == NUMA_NO_NODE)
>> +	if (node == NUMA_NO_NODE || !node_present_pages(node))
>>  		searchnode = numa_mem_id();
>> -	else if (!node_present_pages(node))
>> -		searchnode = node_to_mem_node(node);
> 
> For the above topology, I see this:
> 
> node_to_mem_node(1) = 1
> node_to_mem_node(0) = 0
> node_to_mem_node(NUMA_NO_NODE) = 0
> 
> Looks like the last two cases (returning memory-less node 0) is the
> problem here?

I wonder why do you get a memory leak while Sachin in the same situation [1]
gets a crash? I don't understand anything anymore.

[1]
https://lore.kernel.org/linux-next/3381CD91-AB3D-4773-BA04-E7A072A63968@linux.vnet.ibm.com/

> Regards,
> Bharata.
> 
>
Srikar Dronamraju March 17, 2020, 4:25 p.m. UTC | #3
* Vlastimil Babka <vbabka@suse.cz> [2020-03-17 16:56:04]:

> 
> I wonder why do you get a memory leak while Sachin in the same situation [1]
> gets a crash? I don't understand anything anymore.

Sachin was testing on linux-next which has Kirill's patch which modifies
slub to use kmalloc_node instead of kmalloc. While Bharata is testing on
upstream, which doesn't have this. 

> 
> [1]
> https://lore.kernel.org/linux-next/3381CD91-AB3D-4773-BA04-E7A072A63968@linux.vnet.ibm.com/
>
Vlastimil Babka March 17, 2020, 4:45 p.m. UTC | #4
On 3/17/20 5:25 PM, Srikar Dronamraju wrote:
> * Vlastimil Babka <vbabka@suse.cz> [2020-03-17 16:56:04]:
> 
>> 
>> I wonder why do you get a memory leak while Sachin in the same situation [1]
>> gets a crash? I don't understand anything anymore.
> 
> Sachin was testing on linux-next which has Kirill's patch which modifies
> slub to use kmalloc_node instead of kmalloc. While Bharata is testing on
> upstream, which doesn't have this. 

Yes, that Kirill's patch was about the memcg shrinker map allocation. But the
patch hunk that Bharata posted as a "hack" that fixes the problem, it follows
that there has to be something else that calls kmalloc_node(node) where node is
one that doesn't have present pages.

He mentions alloc_fair_sched_group() which has:

        for_each_possible_cpu(i) {
                cfs_rq = kzalloc_node(sizeof(struct cfs_rq),
                                      GFP_KERNEL, cpu_to_node(i));
...
                se = kzalloc_node(sizeof(struct sched_entity),
                                  GFP_KERNEL, cpu_to_node(i));

I assume one of these structs is 1k and other 512 bytes (rounded) and that for
some possible cpu's cpu_to_node(i) will be 0, which has no present pages. And as
Bharata pasted, node_to_mem_node(0) = 0
So this looks like the same scenario, but it doesn't crash? Is the node 0
actually online here, and/or does it have N_NORMAL_MEMORY state?

>> 
>> [1]
>> https://lore.kernel.org/linux-next/3381CD91-AB3D-4773-BA04-E7A072A63968@linux.vnet.ibm.com/
>> 
>
Srikar Dronamraju March 18, 2020, 3:20 a.m. UTC | #5
* Vlastimil Babka <vbabka@suse.cz> [2020-03-17 17:45:15]:

> On 3/17/20 5:25 PM, Srikar Dronamraju wrote:
> > * Vlastimil Babka <vbabka@suse.cz> [2020-03-17 16:56:04]:
> > 
> >> 
> >> I wonder why do you get a memory leak while Sachin in the same situation [1]
> >> gets a crash? I don't understand anything anymore.
> > 
> > Sachin was testing on linux-next which has Kirill's patch which modifies
> > slub to use kmalloc_node instead of kmalloc. While Bharata is testing on
> > upstream, which doesn't have this. 
> 
> Yes, that Kirill's patch was about the memcg shrinker map allocation. But the
> patch hunk that Bharata posted as a "hack" that fixes the problem, it follows
> that there has to be something else that calls kmalloc_node(node) where node is
> one that doesn't have present pages.
> 
> He mentions alloc_fair_sched_group() which has:
> 
>         for_each_possible_cpu(i) {
>                 cfs_rq = kzalloc_node(sizeof(struct cfs_rq),
>                                       GFP_KERNEL, cpu_to_node(i));
> ...
>                 se = kzalloc_node(sizeof(struct sched_entity),
>                                   GFP_KERNEL, cpu_to_node(i));
> 


Sachin's experiment.
Upstream-next/ memcg /
possible nodes were 0-31
online nodes were 0-1
kmalloc_node called for_each_node / for_each_possible_node.
This would crash while allocating slab from !N_ONLINE nodes.

Bharata's experiment.
Upstream
possible nodes were 0-1
online nodes were 0-1
kmalloc_node called for_each_online_node/ for_each_possible_cpu
i.e kmalloc is called for N_ONLINE nodes.
So wouldn't crash

Even if his possible nodes were 0-256. I don't think we have kmalloc_node
being called in !N_ONLINE nodes. Hence its not crashing.
If we see the above code that you quote, kzalloc_node is using cpu_to_node
which in Bharata's case will always return 1.


> I assume one of these structs is 1k and other 512 bytes (rounded) and that for
> some possible cpu's cpu_to_node(i) will be 0, which has no present pages. And as
> Bharata pasted, node_to_mem_node(0) = 0
> So this looks like the same scenario, but it doesn't crash? Is the node 0
> actually online here, and/or does it have N_NORMAL_MEMORY state?

I still dont have any clue on the leak though.
Bharata B Rao March 18, 2020, 4:46 a.m. UTC | #6
On Wed, Mar 18, 2020 at 08:50:44AM +0530, Srikar Dronamraju wrote:
> * Vlastimil Babka <vbabka@suse.cz> [2020-03-17 17:45:15]:
> 
> > On 3/17/20 5:25 PM, Srikar Dronamraju wrote:
> > > * Vlastimil Babka <vbabka@suse.cz> [2020-03-17 16:56:04]:
> > > 
> > >> 
> > >> I wonder why do you get a memory leak while Sachin in the same situation [1]
> > >> gets a crash? I don't understand anything anymore.
> > > 
> > > Sachin was testing on linux-next which has Kirill's patch which modifies
> > > slub to use kmalloc_node instead of kmalloc. While Bharata is testing on
> > > upstream, which doesn't have this. 
> > 
> > Yes, that Kirill's patch was about the memcg shrinker map allocation. But the
> > patch hunk that Bharata posted as a "hack" that fixes the problem, it follows
> > that there has to be something else that calls kmalloc_node(node) where node is
> > one that doesn't have present pages.
> > 
> > He mentions alloc_fair_sched_group() which has:
> > 
> >         for_each_possible_cpu(i) {
> >                 cfs_rq = kzalloc_node(sizeof(struct cfs_rq),
> >                                       GFP_KERNEL, cpu_to_node(i));
> > ...
> >                 se = kzalloc_node(sizeof(struct sched_entity),
> >                                   GFP_KERNEL, cpu_to_node(i));
> > 
> 
> 
> Sachin's experiment.
> Upstream-next/ memcg /
> possible nodes were 0-31
> online nodes were 0-1
> kmalloc_node called for_each_node / for_each_possible_node.
> This would crash while allocating slab from !N_ONLINE nodes.
> 
> Bharata's experiment.
> Upstream
> possible nodes were 0-1
> online nodes were 0-1
> kmalloc_node called for_each_online_node/ for_each_possible_cpu
> i.e kmalloc is called for N_ONLINE nodes.
> So wouldn't crash
> 
> Even if his possible nodes were 0-256. I don't think we have kmalloc_node
> being called in !N_ONLINE nodes. Hence its not crashing.
> If we see the above code that you quote, kzalloc_node is using cpu_to_node
> which in Bharata's case will always return 1.
> 
> 
> > I assume one of these structs is 1k and other 512 bytes (rounded) and that for
> > some possible cpu's cpu_to_node(i) will be 0, which has no present pages. And as
> > Bharata pasted, node_to_mem_node(0) = 0

Correct, these two kazalloc_node() calls for all possible cpus are
causing increased slab memory consumption in my case.

> > So this looks like the same scenario, but it doesn't crash? Is the node 0
> > actually online here, and/or does it have N_NORMAL_MEMORY state?
> 

Node 0 is online, but N_NORMAL_MEMORY state is empty. In fact memory
leak goes away if I insert the below check/assignment in the slab
alloc code path:

+       if (!node_isset(node, node_states[N_NORMAL_MEMORY]))
+               node = NUMA_NO_NODE;

Regards,
Bharata.
Vlastimil Babka March 18, 2020, 10:18 a.m. UTC | #7
On 3/18/20 4:20 AM, Srikar Dronamraju wrote:
> * Vlastimil Babka <vbabka@suse.cz> [2020-03-17 17:45:15]:
>> 
>> Yes, that Kirill's patch was about the memcg shrinker map allocation. But the
>> patch hunk that Bharata posted as a "hack" that fixes the problem, it follows
>> that there has to be something else that calls kmalloc_node(node) where node is
>> one that doesn't have present pages.
>> 
>> He mentions alloc_fair_sched_group() which has:
>> 
>>         for_each_possible_cpu(i) {
>>                 cfs_rq = kzalloc_node(sizeof(struct cfs_rq),
>>                                       GFP_KERNEL, cpu_to_node(i));
>> ...
>>                 se = kzalloc_node(sizeof(struct sched_entity),
>>                                   GFP_KERNEL, cpu_to_node(i));
>> 
> 
> 
> Sachin's experiment.
> Upstream-next/ memcg /
> possible nodes were 0-31
> online nodes were 0-1
> kmalloc_node called for_each_node / for_each_possible_node.
> This would crash while allocating slab from !N_ONLINE nodes.

So you're saying the crash was actually for allocation on e.g. node 2, not node 0?
But I believe it was on node 0, because init_kmem_cache_nodes() will only
allocate kmem_cache_node on nodes with N_NORMAL_MEMORY (which doesn't include
0), and slab_mem_going_online_callback() was probably not called for node 0 (it
was not dynamically onlined).
Also if node 0 was fine, node_to_mem_node(2-31) (not initialized explicitly)
would have returned 0 and thus not crash as well.

> Bharata's experiment.
> Upstream
> possible nodes were 0-1
> online nodes were 0-1
> kmalloc_node called for_each_online_node/ for_each_possible_cpu
> i.e kmalloc is called for N_ONLINE nodes.
> So wouldn't crash
> 
> Even if his possible nodes were 0-256. I don't think we have kmalloc_node
> being called in !N_ONLINE nodes. Hence its not crashing.
> If we see the above code that you quote, kzalloc_node is using cpu_to_node
> which in Bharata's case will always return 1.

Are you sure that for_each_possible_cpu(), cpu_to_node() will be 1? Are all of
them properly initialized or is there a similar issue as with
node_to_mem_node(), that some were not initialized and thus cpu_to_node() will
return 0?

Because AFAICS, if kzalloc_node() was always called 1, then
node_present_pages(1) is true, and the "hack" that Bharata reports to work in
his original mail would make no functional difference.

> 
>> I assume one of these structs is 1k and other 512 bytes (rounded) and that for
>> some possible cpu's cpu_to_node(i) will be 0, which has no present pages. And as
>> Bharata pasted, node_to_mem_node(0) = 0
>> So this looks like the same scenario, but it doesn't crash? Is the node 0
>> actually online here, and/or does it have N_NORMAL_MEMORY state?
> 
> I still dont have any clue on the leak though.

Let's assume that kzalloc_node() was called with 0 for some of the possible
CPU's. I still wonder why it won't crash, but let's assume kmem_cache_node does
exist for node 0 here.
So the execution AFAICS goes like this:

slab_alloc_node(0)
  c = raw_cpu_ptr(s->cpu_slab);
  object = c->freelist;
  page = c->page;
  if (unlikely(!object || !node_match(page, node))) {
  // whatever we have in the per-cpu cache must be from node 1
  // because node 0 has no memory, so there's no node_match and thus
   __slab_alloc(node == 0)
    ___slab_alloc(node == 0)
      page = c->page;
     redo:
      if (unlikely(!node_match(page, node))) { // still no match
        int searchnode = node;

        if (node != NUMA_NO_NODE && !node_present_pages(node))
	                   //  true && true for node 0
          searchnode = node_to_mem_node(node);
          // searchnode is 0, not 1

          if (unlikely(!node_match(page, searchnode))) {
          // page still from node 1, searchnode is 0, no match
	
            stat(s, ALLOC_NODE_MISMATCH);
            deactivate_slab(s, page, c->freelist, c);
            // we removed the slab from cpu's cache
            goto new_slab;
          }

     new_slab:
      if (slub_percpu_partial(c)) {
        page = c->page = slub_percpu_partial(c);
        slub_set_percpu_partial(c, page);
        stat(s, CPU_PARTIAL_ALLOC);
        goto redo;
        // huh, so with CONFIG_SLUB_CPU_PARTIAL
        // this can become an infinite loop actually?
      }
// Bharata's slub stats don't include cpu_partial_alloc so I assume
// CONFIG_SLUB_CPU_PARTIAL is not enabled and we don't loop
      freelist = new_slab_objects(s, gfpflags, node, &c);
        freelist = new_slab_objects(s, gfpflags, node, &c);

         if (node == NUMA_NO_NODE) // false, it's 0
         else if (!node_present_pages(node)) // true for 0
            searchnode = node_to_mem_node(node); // still 0

         object = get_partial_node(s, get_node(s, searchnode),...);
         // object is NULL as node 0 has nothing
         // but we have node == 0 so we return the NULL
         if (object || node != NUMA_NO_NODE)
                return object;
         // and we don't fallback to get_any_partial which would
         // have found e.g. the slab we deactivated earlier
         return get_any_partial(s, flags, c);

       page = new_slab(s, flags, node);
       // we attempt to allocate new slab on node 0, but it will come
       // from node 1

So that explains the leak I think. We keep throwing away slabs from node 1 only
to allocate new ones on node 1. Effectively each cfs_rq object and each
sched_entity object will get a new (high-order?) page
for a possible cpu where cpu_to_node() is 0.
diff mbox series

Patch

diff --git a/mm/slub.c b/mm/slub.c
index 17dc00e33115..888e4d245444 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1971,10 +1971,8 @@  static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
 	void *object;
 	int searchnode = node;
 
-	if (node == NUMA_NO_NODE)
+	if (node == NUMA_NO_NODE || !node_present_pages(node))
 		searchnode = numa_mem_id();
-	else if (!node_present_pages(node))
-		searchnode = node_to_mem_node(node);
 
 	object = get_partial_node(s, get_node(s, searchnode), c, flags);
 	if (object || node != NUMA_NO_NODE)