4.12-rc ppc64 4k-page needs costly allocations

Message ID	alpine.LSU.2.11.1705301151090.2133@eggly.anvils (mailing list archive)
State	Not Applicable
Headers	show Return-Path: <linuxppc-dev-bounces+patchwork-incoming=ozlabs.org@lists.ozlabs.org> Date: Tue, 30 May 2017 12:43:59 -0700 (PDT) From: Hugh Dickins <hughd@google.com> To: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Subject: 4.12-rc ppc64 4k-page needs costly allocations Message-ID: <alpine.LSU.2.11.1705301151090.2133@eggly.anvils> User-Agent: Alpine 2.11 (LSU 23 2013-08-11) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Precedence: list Cc: Christoph Lameter <cl@linux.com>, linuxppc-dev@lists.ozlabs.org, linux-mm@kvack.org Errors-To: linuxppc-dev-bounces+patchwork-incoming=ozlabs.org@lists.ozlabs.org Sender: "Linuxppc-dev" <linuxppc-dev-bounces+patchwork-incoming=ozlabs.org@lists.ozlabs.org>

Hugh Dickins May 30, 2017, 7:43 p.m. UTC

Since f6eedbba7a26 ("powerpc/mm/hash: Increase VA range to 128TB")
I find that swapping loads on ppc64 on G5 with 4k pages are failing:

SLUB: Unable to allocate memory on node -1, gfp=0x14000c0(GFP_KERNEL)
  cache: pgtable-2^12, object size: 32768, buffer size: 65536, default order: 4, min order: 4
  pgtable-2^12 debugging increased min order, use slub_debug=O to disable.
  node 0: slabs: 209, objs: 209, free: 8
gcc: page allocation failure: order:4, mode:0x16040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK), nodemask=(null)
CPU: 1 PID: 6225 Comm: gcc Not tainted 4.12.0-rc2 #1
Call Trace:
[c00000000090b5c0] [c0000000004f8478] .dump_stack+0xa0/0xcc (unreliable)
[c00000000090b650] [c0000000000eb194] .warn_alloc+0xf0/0x178
[c00000000090b710] [c0000000000ebc9c] .__alloc_pages_nodemask+0xa04/0xb00
[c00000000090b8b0] [c00000000013921c] .new_slab+0x234/0x608
[c00000000090b980] [c00000000013b59c] .___slab_alloc.constprop.64+0x3dc/0x564
[c00000000090bad0] [c0000000004f5a84] .__slab_alloc.isra.61.constprop.63+0x54/0x70
[c00000000090bb70] [c00000000013b864] .kmem_cache_alloc+0x140/0x288
[c00000000090bc30] [c00000000004d934] .mm_init.isra.65+0x128/0x1c0
[c00000000090bcc0] [c000000000157810] .do_execveat_common.isra.39+0x294/0x690
[c00000000090bdb0] [c000000000157e70] .SyS_execve+0x28/0x38
[c00000000090be30] [c00000000000a118] system_call+0x38/0xfc

I did try booting with slub_debug=O as the message suggested, but that
made no difference: it still hoped for but failed on order:4 allocations.

I wanted to try removing CONFIG_SLUB_DEBUG, but didn't succeed in that:
it seemed to be a hard requirement for something, but I didn't find what.

I did try CONFIG_SLAB=y instead of SLUB: that lowers these allocations to
the expected order:3, which then results in OOM-killing rather than direct
allocation failure, because of the PAGE_ALLOC_COSTLY_ORDER 3 cutoff.  But
makes no real difference to the outcome: swapping loads still abort early.

Relying on order:3 or order:4 allocations is just too optimistic: ppc64
with 4k pages would do better not to expect to support a 128TB userspace.

I tried the obvious partial revert below, but it's not good enough:
the system did not boot beyond

Starting init: /sbin/init exists but couldn't execute it (error -7)
Starting init: /bin/sh exists but couldn't execute it (error -7)
Kernel panic - not syncing: No working init found. ...

Michael Ellerman May 31, 2017, 6:46 a.m. UTC | #1

Hugh Dickins <hughd@google.com> writes:

> Since f6eedbba7a26 ("powerpc/mm/hash: Increase VA range to 128TB")
> I find that swapping loads on ppc64 on G5 with 4k pages are failing:
>
> SLUB: Unable to allocate memory on node -1, gfp=0x14000c0(GFP_KERNEL)
>   cache: pgtable-2^12, object size: 32768, buffer size: 65536, default order: 4, min order: 4
>   pgtable-2^12 debugging increased min order, use slub_debug=O to disable.
>   node 0: slabs: 209, objs: 209, free: 8
> gcc: page allocation failure: order:4, mode:0x16040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK), nodemask=(null)
> CPU: 1 PID: 6225 Comm: gcc Not tainted 4.12.0-rc2 #1
> Call Trace:
> [c00000000090b5c0] [c0000000004f8478] .dump_stack+0xa0/0xcc (unreliable)
> [c00000000090b650] [c0000000000eb194] .warn_alloc+0xf0/0x178
> [c00000000090b710] [c0000000000ebc9c] .__alloc_pages_nodemask+0xa04/0xb00
> [c00000000090b8b0] [c00000000013921c] .new_slab+0x234/0x608
> [c00000000090b980] [c00000000013b59c] .___slab_alloc.constprop.64+0x3dc/0x564
> [c00000000090bad0] [c0000000004f5a84] .__slab_alloc.isra.61.constprop.63+0x54/0x70
> [c00000000090bb70] [c00000000013b864] .kmem_cache_alloc+0x140/0x288
> [c00000000090bc30] [c00000000004d934] .mm_init.isra.65+0x128/0x1c0
> [c00000000090bcc0] [c000000000157810] .do_execveat_common.isra.39+0x294/0x690
> [c00000000090bdb0] [c000000000157e70] .SyS_execve+0x28/0x38
> [c00000000090be30] [c00000000000a118] system_call+0x38/0xfc
>
> I did try booting with slub_debug=O as the message suggested, but that
> made no difference: it still hoped for but failed on order:4 allocations.
>
> I wanted to try removing CONFIG_SLUB_DEBUG, but didn't succeed in that:
> it seemed to be a hard requirement for something, but I didn't find what.
>
> I did try CONFIG_SLAB=y instead of SLUB: that lowers these allocations to
> the expected order:3, which then results in OOM-killing rather than direct
> allocation failure, because of the PAGE_ALLOC_COSTLY_ORDER 3 cutoff.  But
> makes no real difference to the outcome: swapping loads still abort early.
>
> Relying on order:3 or order:4 allocations is just too optimistic: ppc64
> with 4k pages would do better not to expect to support a 128TB userspace.
>
> I tried the obvious partial revert below, but it's not good enough:
> the system did not boot beyond
>
> Starting init: /sbin/init exists but couldn't execute it (error -7)
> Starting init: /bin/sh exists but couldn't execute it (error -7)
> Kernel panic - not syncing: No working init found. ...

Ouch, sorry.

I boot test a G5 with 4K pages, but I don't stress test it much so I
didn't notice this.

I think making 128TB depend on 64K pages makes sense, Aneesh is going to
try and do a patch for that.

cheers

Christoph Lameter (Ampere) May 31, 2017, 2:06 p.m. UTC | #2

On Tue, 30 May 2017, Hugh Dickins wrote:

> I wanted to try removing CONFIG_SLUB_DEBUG, but didn't succeed in that:
> it seemed to be a hard requirement for something, but I didn't find what.

CONFIG_SLUB_DEBUG does not enable debugging. It only includes the code to
be able to enable it at runtime.

> I did try CONFIG_SLAB=y instead of SLUB: that lowers these allocations to
> the expected order:3, which then results in OOM-killing rather than direct
> allocation failure, because of the PAGE_ALLOC_COSTLY_ORDER 3 cutoff.  But
> makes no real difference to the outcome: swapping loads still abort early.

SLAB uses order 3 and SLUB order 4??? That needs to be tracked down.

Why are the slab allocators used to create slab caches for large object
sizes?

> Relying on order:3 or order:4 allocations is just too optimistic: ppc64
> with 4k pages would do better not to expect to support a 128TB userspace.

I thought you had these huge 64k page sizes?

Christoph Lameter (Ampere) May 31, 2017, 2:09 p.m. UTC | #3

On Wed, 31 May 2017, Michael Ellerman wrote:

> > SLUB: Unable to allocate memory on node -1, gfp=0x14000c0(GFP_KERNEL)
> >   cache: pgtable-2^12, object size: 32768, buffer size: 65536, default order: 4, min order: 4
> >   pgtable-2^12 debugging increased min order, use slub_debug=O to disable.

Ahh. Ok debugging increased the object size to an order 4. This should be
order 3 without debugging.

> > I did try booting with slub_debug=O as the message suggested, but that
> > made no difference: it still hoped for but failed on order:4 allocations.

I am curious as to what is going on there. Do you have the output from
these failed allocations?

Hugh Dickins May 31, 2017, 6:44 p.m. UTC | #4

[ Merging two mails into one response ]

On Wed, 31 May 2017, Christoph Lameter wrote:
> On Tue, 30 May 2017, Hugh Dickins wrote:
> > SLUB: Unable to allocate memory on node -1, gfp=0x14000c0(GFP_KERNEL)
> >   cache: pgtable-2^12, object size: 32768, buffer size: 65536, default order: 4, min order: 4
> >   pgtable-2^12 debugging increased min order, use slub_debug=O to disable.
> 
> > I did try booting with slub_debug=O as the message suggested, but that
> > made no difference: it still hoped for but failed on order:4 allocations.
> 
> I am curious as to what is going on there. Do you have the output from
> these failed allocations?

I thought the relevant output was in my mail.  I did skip the Mem-Info
dump, since that just seemed noise in this case: we know memory can get
fragmented.  What more output are you looking for?

> 
> > I wanted to try removing CONFIG_SLUB_DEBUG, but didn't succeed in that:
> > it seemed to be a hard requirement for something, but I didn't find what.
> 
> CONFIG_SLUB_DEBUG does not enable debugging. It only includes the code to
> be able to enable it at runtime.

Yes, I thought so.

> 
> > I did try CONFIG_SLAB=y instead of SLUB: that lowers these allocations to
> > the expected order:3, which then results in OOM-killing rather than direct
> > allocation failure, because of the PAGE_ALLOC_COSTLY_ORDER 3 cutoff.  But
> > makes no real difference to the outcome: swapping loads still abort early.
> 
> SLAB uses order 3 and SLUB order 4??? That needs to be tracked down.
> 
> Ahh. Ok debugging increased the object size to an order 4. This should be
> order 3 without debugging.

But it was still order 4 when booted with slub_debug=O, which surprised me.
And that surprises you too?  If so, then we ought to dig into it further.

> 
> Why are the slab allocators used to create slab caches for large object
> sizes?

There may be more optimal ways to allocate, but I expect that when
the ppc guys are writing the code to handle both 4k and 64k page sizes,
kmem caches offer the best span of possibility without complication.

> 
> > Relying on order:3 or order:4 allocations is just too optimistic: ppc64
> > with 4k pages would do better not to expect to support a 128TB userspace.
> 
> I thought you had these huge 64k page sizes?

ppc64 does support 64k page sizes, and they've been the default for years;
but since 4k pages are still supported, I choose to use those (I doubt
I could ever get the same load going with 64k pages).

Hugh

Mathieu Malaterre May 31, 2017, 7:02 p.m. UTC | #5

On Wed, May 31, 2017 at 8:44 PM, Hugh Dickins <hughd@google.com> wrote:
> [ Merging two mails into one response ]
>
> On Wed, 31 May 2017, Christoph Lameter wrote:
>> On Tue, 30 May 2017, Hugh Dickins wrote:
>> > SLUB: Unable to allocate memory on node -1, gfp=0x14000c0(GFP_KERNEL)
>> >   cache: pgtable-2^12, object size: 32768, buffer size: 65536, default order: 4, min order: 4
>> >   pgtable-2^12 debugging increased min order, use slub_debug=O to disable.
>>
>> > I did try booting with slub_debug=O as the message suggested, but that
>> > made no difference: it still hoped for but failed on order:4 allocations.
>>
>> I am curious as to what is going on there. Do you have the output from
>> these failed allocations?
>
> I thought the relevant output was in my mail.  I did skip the Mem-Info
> dump, since that just seemed noise in this case: we know memory can get
> fragmented.  What more output are you looking for?
>
>>
>> > I wanted to try removing CONFIG_SLUB_DEBUG, but didn't succeed in that:
>> > it seemed to be a hard requirement for something, but I didn't find what.
>>
>> CONFIG_SLUB_DEBUG does not enable debugging. It only includes the code to
>> be able to enable it at runtime.
>
> Yes, I thought so.
>
>>
>> > I did try CONFIG_SLAB=y instead of SLUB: that lowers these allocations to
>> > the expected order:3, which then results in OOM-killing rather than direct
>> > allocation failure, because of the PAGE_ALLOC_COSTLY_ORDER 3 cutoff.  But
>> > makes no real difference to the outcome: swapping loads still abort early.
>>
>> SLAB uses order 3 and SLUB order 4??? That needs to be tracked down.
>>
>> Ahh. Ok debugging increased the object size to an order 4. This should be
>> order 3 without debugging.
>
> But it was still order 4 when booted with slub_debug=O, which surprised me.
> And that surprises you too?  If so, then we ought to dig into it further.
>
>>
>> Why are the slab allocators used to create slab caches for large object
>> sizes?
>
> There may be more optimal ways to allocate, but I expect that when
> the ppc guys are writing the code to handle both 4k and 64k page sizes,
> kmem caches offer the best span of possibility without complication.
>
>>
>> > Relying on order:3 or order:4 allocations is just too optimistic: ppc64
>> > with 4k pages would do better not to expect to support a 128TB userspace.
>>
>> I thought you had these huge 64k page sizes?
>
> ppc64 does support 64k page sizes, and they've been the default for years;
> but since 4k pages are still supported, I choose to use those (I doubt
> I could ever get the same load going with 64k pages).

4k is pretty much required on ppc64 when it comes to nouveau:

https://bugs.freedesktop.org/show_bug.cgi?id=94757

2cts

Christoph Lameter (Ampere) June 1, 2017, 3:31 p.m. UTC | #6

> > I am curious as to what is going on there. Do you have the output from
> > these failed allocations?
>
> I thought the relevant output was in my mail.  I did skip the Mem-Info
> dump, since that just seemed noise in this case: we know memory can get
> fragmented.  What more output are you looking for?

The output for the failing allocations when you disabling debugging. For
that I would think that you need remove(!) the slub_debug statement on the kernel
command line. You can verify that debug is off by inspecting the values in
/sys/kernel/slab/<yourcache>/<debug option>

> But it was still order 4 when booted with slub_debug=O, which surprised me.
> And that surprises you too?  If so, then we ought to dig into it further.

No it does no longer. I dont think slub_debug=O does disable debugging
(frankly I am not sure what it does). Please do not specify any debug options.

Hugh Dickins June 1, 2017, 5:22 p.m. UTC | #7

On Thu, 1 Jun 2017, Christoph Lameter wrote:
> 
> > > I am curious as to what is going on there. Do you have the output from
> > > these failed allocations?
> >
> > I thought the relevant output was in my mail.  I did skip the Mem-Info
> > dump, since that just seemed noise in this case: we know memory can get
> > fragmented.  What more output are you looking for?
> 
> The output for the failing allocations when you disabling debugging. For
> that I would think that you need remove(!) the slub_debug statement on the kernel
> command line. You can verify that debug is off by inspecting the values in
> /sys/kernel/slab/<yourcache>/<debug option>

The output was with debugging disabled.  Except when I tried adding that
slub_debug=O on the kernel command line, as the warning suggested, I did
not have any slub_debug statement on the command line; and did not have
CONFIG_SLUB_DEBUG_ON=y.  My SLAB|SLUB config options are

CONFIG_SLUB_DEBUG=y
# CONFIG_SLUB_MEMCG_SYSFS_ON is not set
# CONFIG_SLAB is not set
CONFIG_SLUB=y
# CONFIG_SLAB_FREELIST_RANDOM is not set
CONFIG_SLUB_CPU_PARTIAL=y
CONFIG_SLABINFO=y
# CONFIG_SLUB_DEBUG_ON is not set
CONFIG_SLUB_STATS=y

> 
> > But it was still order 4 when booted with slub_debug=O, which surprised me.
> > And that surprises you too?  If so, then we ought to dig into it further.
> 
> No it does no longer. I dont think slub_debug=O does disable debugging
> (frankly I am not sure what it does). Please do not specify any debug options.

But I think you are now surprised, when I say no slub_debug options
were on.  Here's the output from /sys/kernel/slab/pgtable-2^12/*
(before I tried the new kernel with Aneesh's fix patch)
in case they tell you anything...

pgtable-2^12/aliases:0
pgtable-2^12/align:32768
grep: pgtable-2^12/alloc_calls: Function not implemented
pgtable-2^12/alloc_fastpath:5847 C0=1587 C1=1449 C2=1392 C3=1419
pgtable-2^12/alloc_from_partial:12637 C0=3292 C1=3020 C2=3051 C3=3274
pgtable-2^12/alloc_node_mismatch:0
pgtable-2^12/alloc_refill:41038 C0=10600 C1=10025 C2=10191 C3=10222
pgtable-2^12/alloc_slab:517 C0=148 C1=110 C2=105 C3=154
pgtable-2^12/alloc_slowpath:54203 C0=14041 C1=13157 C2=13349 C3=13656
pgtable-2^12/cache_dma:0
pgtable-2^12/cmpxchg_double_cpu_fail:0
pgtable-2^12/cmpxchg_double_fail:0
pgtable-2^12/cpu_partial:2
pgtable-2^12/cpu_partial_alloc:25894 C0=6719 C1=6334 C2=6288 C3=6553
pgtable-2^12/cpu_partial_drain:8441 C0=2035 C1=2211 C2=2268 C3=1927
pgtable-2^12/cpu_partial_free:38987 C0=9642 C1=10042 C2=10132 C3=9171
pgtable-2^12/cpu_partial_node:12237 C0=3183 C1=2928 C2=2961 C3=3165
pgtable-2^12/cpu_slabs:11
pgtable-2^12/cpuslab_flush:17 C0=5 C2=4 C3=8
pgtable-2^12/ctor:pgd_ctor+0x0/0x18
pgtable-2^12/deactivate_bypass:39027 C0=10153 C1=9463 C2=9439 C3=9972
pgtable-2^12/deactivate_empty:446 C0=98 C1=118 C2=123 C3=107
pgtable-2^12/deactivate_full:16 C0=5 C2=3 C3=8
pgtable-2^12/deactivate_remote_frees:0
pgtable-2^12/deactivate_to_head:1 C2=1
pgtable-2^12/deactivate_to_tail:0
pgtable-2^12/destroy_by_rcu:0
pgtable-2^12/free_add_partial:24877 C0=6007 C1=6515 C2=6681 C3=5674
grep: pgtable-2^12/free_calls: Function not implemented
pgtable-2^12/free_fastpath:5849 C0=1587 C1=1449 C2=1394 C3=1419
pgtable-2^12/free_frozen:15145 C0=3989 C1=3701 C2=3683 C3=3772
pgtable-2^12/free_remove_partial:0
pgtable-2^12/free_slab:446 C0=98 C1=118 C2=123 C3=107
pgtable-2^12/free_slowpath:54132 C0=13631 C1=13743 C2=13815 C3=12943
pgtable-2^12/hwcache_align:0
pgtable-2^12/min_partial:8
pgtable-2^12/object_size:32768
pgtable-2^12/objects:67
pgtable-2^12/objects_partial:0
pgtable-2^12/objs_per_slab:1
pgtable-2^12/order:4
pgtable-2^12/order_fallback:13 C0=2 C1=1 C2=5 C3=5
pgtable-2^12/partial:4
pgtable-2^12/poison:0
pgtable-2^12/reclaim_account:0
pgtable-2^12/red_zone:0
pgtable-2^12/reserved:0
pgtable-2^12/sanity_checks:0
pgtable-2^12/slab_size:65536
pgtable-2^12/slabs:71
pgtable-2^12/slabs_cpu_partial:7(7) C0=1(1) C1=3(3) C2=1(1) C3=2(2)
pgtable-2^12/store_user:0
pgtable-2^12/total_objects:71
pgtable-2^12/trace:0

Hugh

Christoph Lameter (Ampere) June 1, 2017, 6:16 p.m. UTC | #8

On Thu, 1 Jun 2017, Hugh Dickins wrote:

> CONFIG_SLUB_DEBUG_ON=y.  My SLAB|SLUB config options are
>
> CONFIG_SLUB_DEBUG=y
> # CONFIG_SLUB_MEMCG_SYSFS_ON is not set
> # CONFIG_SLAB is not set
> CONFIG_SLUB=y
> # CONFIG_SLAB_FREELIST_RANDOM is not set
> CONFIG_SLUB_CPU_PARTIAL=y
> CONFIG_SLABINFO=y
> # CONFIG_SLUB_DEBUG_ON is not set
> CONFIG_SLUB_STATS=y

Thats fine.

> But I think you are now surprised, when I say no slub_debug options
> were on.  Here's the output from /sys/kernel/slab/pgtable-2^12/*
> (before I tried the new kernel with Aneesh's fix patch)
> in case they tell you anything...
>
> pgtable-2^12/poison:0
> pgtable-2^12/red_zone:0
> pgtable-2^12/reserved:0
> pgtable-2^12/sanity_checks:0
> pgtable-2^12/store_user:0

Ok so debugging was off but the slab cache has a ctor callback which
mandates that the free pointer cannot use the free object space when
the object is not in use. Thus the size of the object must be increased to
accomodate the freepointer.

Hugh Dickins June 1, 2017, 6:37 p.m. UTC | #9

On Thu, 1 Jun 2017, Christoph Lameter wrote:
> 
> Ok so debugging was off but the slab cache has a ctor callback which
> mandates that the free pointer cannot use the free object space when
> the object is not in use. Thus the size of the object must be increased to
> accomodate the freepointer.

Thanks a lot for working that out.  Makes sense, fully understood now,
nothing to worry about (though makes one wonder whether it's efficient
to use ctors on high-alignment caches; or whether an internal "zero-me"
ctor would be useful).

Hugh

Michael Ellerman June 2, 2017, 3:09 a.m. UTC | #10

Hugh Dickins <hughd@google.com> writes:

> On Thu, 1 Jun 2017, Christoph Lameter wrote:
>> 
>> Ok so debugging was off but the slab cache has a ctor callback which
>> mandates that the free pointer cannot use the free object space when
>> the object is not in use. Thus the size of the object must be increased to
>> accomodate the freepointer.
>
> Thanks a lot for working that out.  Makes sense, fully understood now,
> nothing to worry about (though makes one wonder whether it's efficient
> to use ctors on high-alignment caches; or whether an internal "zero-me"
> ctor would be useful).

Or should we just be using kmem_cache_zalloc() when we allocate from
those slabs?

Given all the ctor's do is memset to 0.

cheers

Hugh Dickins June 2, 2017, 4 a.m. UTC | #11

On Fri, 2 Jun 2017, Michael Ellerman wrote:
> Hugh Dickins <hughd@google.com> writes:
> > On Thu, 1 Jun 2017, Christoph Lameter wrote:
> >> 
> >> Ok so debugging was off but the slab cache has a ctor callback which
> >> mandates that the free pointer cannot use the free object space when
> >> the object is not in use. Thus the size of the object must be increased to
> >> accomodate the freepointer.
> >
> > Thanks a lot for working that out.  Makes sense, fully understood now,
> > nothing to worry about (though makes one wonder whether it's efficient
> > to use ctors on high-alignment caches; or whether an internal "zero-me"
> > ctor would be useful).
> 
> Or should we just be using kmem_cache_zalloc() when we allocate from
> those slabs?
> 
> Given all the ctor's do is memset to 0.

I'm not sure.  From a memory-utilization point of view, with SLUB,
using kmem_cache_zalloc() there would certainly be better.

But you may be forgetting that the constructor is applied only when a
new slab of objects is allocated, not each time an object is allocated
from that slab (and the user of those objects agrees to free objects
back to the cache in a reusable state: zeroed in this case).

So from a cpu-utilization point of view, it's better to use the ctor:
it's saving you lots of redundant memsets.

SLUB versus SLAB, cpu versus memory?  Since someone has taken the
trouble to write it with ctors in the past, I didn't feel on firm
enough ground to recommend such a change.  But it may be obvious
to someone else that your suggestion would be better (or worse).

Hugh

Christoph Lameter (Ampere) June 2, 2017, 2:32 p.m. UTC | #12

On Thu, 1 Jun 2017, Hugh Dickins wrote:

> Thanks a lot for working that out.  Makes sense, fully understood now,
> nothing to worry about (though makes one wonder whether it's efficient
> to use ctors on high-alignment caches; or whether an internal "zero-me"
> ctor would be useful).

Use kzalloc to zero it. And here is another example of using slab
allocations for page frames. Use the page allocator for this? The page
allocator is there for allocating page frames. The slab allocator main
purpose is to allocate small objects....

Christoph Lameter (Ampere) June 2, 2017, 2:33 p.m. UTC | #13

On Thu, 1 Jun 2017, Hugh Dickins wrote:

> SLUB versus SLAB, cpu versus memory?  Since someone has taken the
> trouble to write it with ctors in the past, I didn't feel on firm
> enough ground to recommend such a change.  But it may be obvious
> to someone else that your suggestion would be better (or worse).

Umm how about using alloc_pages() for pageframes?

Michael Ellerman June 8, 2017, 5:44 a.m. UTC | #14

Hugh Dickins <hughd@google.com> writes:
> On Fri, 2 Jun 2017, Michael Ellerman wrote:
>> Hugh Dickins <hughd@google.com> writes:
>> > On Thu, 1 Jun 2017, Christoph Lameter wrote:
>> >> 
>> >> Ok so debugging was off but the slab cache has a ctor callback which
>> >> mandates that the free pointer cannot use the free object space when
>> >> the object is not in use. Thus the size of the object must be increased to
>> >> accomodate the freepointer.
>> >
>> > Thanks a lot for working that out.  Makes sense, fully understood now,
>> > nothing to worry about (though makes one wonder whether it's efficient
>> > to use ctors on high-alignment caches; or whether an internal "zero-me"
>> > ctor would be useful).
>> 
>> Or should we just be using kmem_cache_zalloc() when we allocate from
>> those slabs?
>> 
>> Given all the ctor's do is memset to 0.
>
> I'm not sure.  From a memory-utilization point of view, with SLUB,
> using kmem_cache_zalloc() there would certainly be better.
>
> But you may be forgetting that the constructor is applied only when a
> new slab of objects is allocated, not each time an object is allocated
> from that slab (and the user of those objects agrees to free objects
> back to the cache in a reusable state: zeroed in this case).

Ah yes, I was "forgetting" that :) - ie. didn't know it.

> So from a cpu-utilization point of view, it's better to use the ctor:
> it's saving you lots of redundant memsets.

OK. Presumably we guarantee (somewhere) that the page tables are zeroed
before we free them, which is a natural result of tearing down all
mappings?

But then I see other arches (x86, arm64 at least), which don't use a
constructor, and use __GPF_ZERO (via PGALLOC_GFP) at allocation time.

eg. arm64:

	pgd_cache = kmem_cache_create("pgd_cache", PGD_SIZE, PGD_SIZE,
				      SLAB_PANIC, NULL);
        ...
	return kmem_cache_alloc(pgd_cache, PGALLOC_GFP);


So that's a bit puzzling.

cheers

Michael Ellerman June 8, 2017, 5:52 a.m. UTC | #15

Christoph Lameter <cl@linux.com> writes:

> On Thu, 1 Jun 2017, Hugh Dickins wrote:
>
>> Thanks a lot for working that out.  Makes sense, fully understood now,
>> nothing to worry about (though makes one wonder whether it's efficient
>> to use ctors on high-alignment caches; or whether an internal "zero-me"
>> ctor would be useful).
>
> Use kzalloc to zero it.

But that's changing a per slab creation memset into a per object
allocation memset, isn't it?

> And here is another example of using slab allocations for page frames.
> Use the page allocator for this? The page allocator is there for
> allocating page frames. The slab allocator main purpose is to allocate
> small objects....

Well usually they are small (< PAGE_SIZE), because we have 64K pages.

But we could rework the code to use the page allocator on 4K configs.

cheers

4.12-rc ppc64 4k-page needs costly allocations

Commit Message

Comments

Patch