Message ID | alpine.LSU.2.11.1705301151090.2133@eggly.anvils (mailing list archive) |
---|---|
State | Not Applicable |
Headers | show |
Hugh Dickins <hughd@google.com> writes: > Since f6eedbba7a26 ("powerpc/mm/hash: Increase VA range to 128TB") > I find that swapping loads on ppc64 on G5 with 4k pages are failing: > > SLUB: Unable to allocate memory on node -1, gfp=0x14000c0(GFP_KERNEL) > cache: pgtable-2^12, object size: 32768, buffer size: 65536, default order: 4, min order: 4 > pgtable-2^12 debugging increased min order, use slub_debug=O to disable. > node 0: slabs: 209, objs: 209, free: 8 > gcc: page allocation failure: order:4, mode:0x16040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK), nodemask=(null) > CPU: 1 PID: 6225 Comm: gcc Not tainted 4.12.0-rc2 #1 > Call Trace: > [c00000000090b5c0] [c0000000004f8478] .dump_stack+0xa0/0xcc (unreliable) > [c00000000090b650] [c0000000000eb194] .warn_alloc+0xf0/0x178 > [c00000000090b710] [c0000000000ebc9c] .__alloc_pages_nodemask+0xa04/0xb00 > [c00000000090b8b0] [c00000000013921c] .new_slab+0x234/0x608 > [c00000000090b980] [c00000000013b59c] .___slab_alloc.constprop.64+0x3dc/0x564 > [c00000000090bad0] [c0000000004f5a84] .__slab_alloc.isra.61.constprop.63+0x54/0x70 > [c00000000090bb70] [c00000000013b864] .kmem_cache_alloc+0x140/0x288 > [c00000000090bc30] [c00000000004d934] .mm_init.isra.65+0x128/0x1c0 > [c00000000090bcc0] [c000000000157810] .do_execveat_common.isra.39+0x294/0x690 > [c00000000090bdb0] [c000000000157e70] .SyS_execve+0x28/0x38 > [c00000000090be30] [c00000000000a118] system_call+0x38/0xfc > > I did try booting with slub_debug=O as the message suggested, but that > made no difference: it still hoped for but failed on order:4 allocations. > > I wanted to try removing CONFIG_SLUB_DEBUG, but didn't succeed in that: > it seemed to be a hard requirement for something, but I didn't find what. > > I did try CONFIG_SLAB=y instead of SLUB: that lowers these allocations to > the expected order:3, which then results in OOM-killing rather than direct > allocation failure, because of the PAGE_ALLOC_COSTLY_ORDER 3 cutoff. But > makes no real difference to the outcome: swapping loads still abort early. > > Relying on order:3 or order:4 allocations is just too optimistic: ppc64 > with 4k pages would do better not to expect to support a 128TB userspace. > > I tried the obvious partial revert below, but it's not good enough: > the system did not boot beyond > > Starting init: /sbin/init exists but couldn't execute it (error -7) > Starting init: /bin/sh exists but couldn't execute it (error -7) > Kernel panic - not syncing: No working init found. ... Ouch, sorry. I boot test a G5 with 4K pages, but I don't stress test it much so I didn't notice this. I think making 128TB depend on 64K pages makes sense, Aneesh is going to try and do a patch for that. cheers
On Tue, 30 May 2017, Hugh Dickins wrote: > I wanted to try removing CONFIG_SLUB_DEBUG, but didn't succeed in that: > it seemed to be a hard requirement for something, but I didn't find what. CONFIG_SLUB_DEBUG does not enable debugging. It only includes the code to be able to enable it at runtime. > I did try CONFIG_SLAB=y instead of SLUB: that lowers these allocations to > the expected order:3, which then results in OOM-killing rather than direct > allocation failure, because of the PAGE_ALLOC_COSTLY_ORDER 3 cutoff. But > makes no real difference to the outcome: swapping loads still abort early. SLAB uses order 3 and SLUB order 4??? That needs to be tracked down. Why are the slab allocators used to create slab caches for large object sizes? > Relying on order:3 or order:4 allocations is just too optimistic: ppc64 > with 4k pages would do better not to expect to support a 128TB userspace. I thought you had these huge 64k page sizes?
On Wed, 31 May 2017, Michael Ellerman wrote: > > SLUB: Unable to allocate memory on node -1, gfp=0x14000c0(GFP_KERNEL) > > cache: pgtable-2^12, object size: 32768, buffer size: 65536, default order: 4, min order: 4 > > pgtable-2^12 debugging increased min order, use slub_debug=O to disable. Ahh. Ok debugging increased the object size to an order 4. This should be order 3 without debugging. > > I did try booting with slub_debug=O as the message suggested, but that > > made no difference: it still hoped for but failed on order:4 allocations. I am curious as to what is going on there. Do you have the output from these failed allocations?
[ Merging two mails into one response ] On Wed, 31 May 2017, Christoph Lameter wrote: > On Tue, 30 May 2017, Hugh Dickins wrote: > > SLUB: Unable to allocate memory on node -1, gfp=0x14000c0(GFP_KERNEL) > > cache: pgtable-2^12, object size: 32768, buffer size: 65536, default order: 4, min order: 4 > > pgtable-2^12 debugging increased min order, use slub_debug=O to disable. > > > I did try booting with slub_debug=O as the message suggested, but that > > made no difference: it still hoped for but failed on order:4 allocations. > > I am curious as to what is going on there. Do you have the output from > these failed allocations? I thought the relevant output was in my mail. I did skip the Mem-Info dump, since that just seemed noise in this case: we know memory can get fragmented. What more output are you looking for? > > > I wanted to try removing CONFIG_SLUB_DEBUG, but didn't succeed in that: > > it seemed to be a hard requirement for something, but I didn't find what. > > CONFIG_SLUB_DEBUG does not enable debugging. It only includes the code to > be able to enable it at runtime. Yes, I thought so. > > > I did try CONFIG_SLAB=y instead of SLUB: that lowers these allocations to > > the expected order:3, which then results in OOM-killing rather than direct > > allocation failure, because of the PAGE_ALLOC_COSTLY_ORDER 3 cutoff. But > > makes no real difference to the outcome: swapping loads still abort early. > > SLAB uses order 3 and SLUB order 4??? That needs to be tracked down. > > Ahh. Ok debugging increased the object size to an order 4. This should be > order 3 without debugging. But it was still order 4 when booted with slub_debug=O, which surprised me. And that surprises you too? If so, then we ought to dig into it further. > > Why are the slab allocators used to create slab caches for large object > sizes? There may be more optimal ways to allocate, but I expect that when the ppc guys are writing the code to handle both 4k and 64k page sizes, kmem caches offer the best span of possibility without complication. > > > Relying on order:3 or order:4 allocations is just too optimistic: ppc64 > > with 4k pages would do better not to expect to support a 128TB userspace. > > I thought you had these huge 64k page sizes? ppc64 does support 64k page sizes, and they've been the default for years; but since 4k pages are still supported, I choose to use those (I doubt I could ever get the same load going with 64k pages). Hugh
On Wed, May 31, 2017 at 8:44 PM, Hugh Dickins <hughd@google.com> wrote: > [ Merging two mails into one response ] > > On Wed, 31 May 2017, Christoph Lameter wrote: >> On Tue, 30 May 2017, Hugh Dickins wrote: >> > SLUB: Unable to allocate memory on node -1, gfp=0x14000c0(GFP_KERNEL) >> > cache: pgtable-2^12, object size: 32768, buffer size: 65536, default order: 4, min order: 4 >> > pgtable-2^12 debugging increased min order, use slub_debug=O to disable. >> >> > I did try booting with slub_debug=O as the message suggested, but that >> > made no difference: it still hoped for but failed on order:4 allocations. >> >> I am curious as to what is going on there. Do you have the output from >> these failed allocations? > > I thought the relevant output was in my mail. I did skip the Mem-Info > dump, since that just seemed noise in this case: we know memory can get > fragmented. What more output are you looking for? > >> >> > I wanted to try removing CONFIG_SLUB_DEBUG, but didn't succeed in that: >> > it seemed to be a hard requirement for something, but I didn't find what. >> >> CONFIG_SLUB_DEBUG does not enable debugging. It only includes the code to >> be able to enable it at runtime. > > Yes, I thought so. > >> >> > I did try CONFIG_SLAB=y instead of SLUB: that lowers these allocations to >> > the expected order:3, which then results in OOM-killing rather than direct >> > allocation failure, because of the PAGE_ALLOC_COSTLY_ORDER 3 cutoff. But >> > makes no real difference to the outcome: swapping loads still abort early. >> >> SLAB uses order 3 and SLUB order 4??? That needs to be tracked down. >> >> Ahh. Ok debugging increased the object size to an order 4. This should be >> order 3 without debugging. > > But it was still order 4 when booted with slub_debug=O, which surprised me. > And that surprises you too? If so, then we ought to dig into it further. > >> >> Why are the slab allocators used to create slab caches for large object >> sizes? > > There may be more optimal ways to allocate, but I expect that when > the ppc guys are writing the code to handle both 4k and 64k page sizes, > kmem caches offer the best span of possibility without complication. > >> >> > Relying on order:3 or order:4 allocations is just too optimistic: ppc64 >> > with 4k pages would do better not to expect to support a 128TB userspace. >> >> I thought you had these huge 64k page sizes? > > ppc64 does support 64k page sizes, and they've been the default for years; > but since 4k pages are still supported, I choose to use those (I doubt > I could ever get the same load going with 64k pages). 4k is pretty much required on ppc64 when it comes to nouveau: https://bugs.freedesktop.org/show_bug.cgi?id=94757 2cts
> > I am curious as to what is going on there. Do you have the output from > > these failed allocations? > > I thought the relevant output was in my mail. I did skip the Mem-Info > dump, since that just seemed noise in this case: we know memory can get > fragmented. What more output are you looking for? The output for the failing allocations when you disabling debugging. For that I would think that you need remove(!) the slub_debug statement on the kernel command line. You can verify that debug is off by inspecting the values in /sys/kernel/slab/<yourcache>/<debug option> > But it was still order 4 when booted with slub_debug=O, which surprised me. > And that surprises you too? If so, then we ought to dig into it further. No it does no longer. I dont think slub_debug=O does disable debugging (frankly I am not sure what it does). Please do not specify any debug options.
On Thu, 1 Jun 2017, Christoph Lameter wrote: > > > > I am curious as to what is going on there. Do you have the output from > > > these failed allocations? > > > > I thought the relevant output was in my mail. I did skip the Mem-Info > > dump, since that just seemed noise in this case: we know memory can get > > fragmented. What more output are you looking for? > > The output for the failing allocations when you disabling debugging. For > that I would think that you need remove(!) the slub_debug statement on the kernel > command line. You can verify that debug is off by inspecting the values in > /sys/kernel/slab/<yourcache>/<debug option> The output was with debugging disabled. Except when I tried adding that slub_debug=O on the kernel command line, as the warning suggested, I did not have any slub_debug statement on the command line; and did not have CONFIG_SLUB_DEBUG_ON=y. My SLAB|SLUB config options are CONFIG_SLUB_DEBUG=y # CONFIG_SLUB_MEMCG_SYSFS_ON is not set # CONFIG_SLAB is not set CONFIG_SLUB=y # CONFIG_SLAB_FREELIST_RANDOM is not set CONFIG_SLUB_CPU_PARTIAL=y CONFIG_SLABINFO=y # CONFIG_SLUB_DEBUG_ON is not set CONFIG_SLUB_STATS=y > > > But it was still order 4 when booted with slub_debug=O, which surprised me. > > And that surprises you too? If so, then we ought to dig into it further. > > No it does no longer. I dont think slub_debug=O does disable debugging > (frankly I am not sure what it does). Please do not specify any debug options. But I think you are now surprised, when I say no slub_debug options were on. Here's the output from /sys/kernel/slab/pgtable-2^12/* (before I tried the new kernel with Aneesh's fix patch) in case they tell you anything... pgtable-2^12/aliases:0 pgtable-2^12/align:32768 grep: pgtable-2^12/alloc_calls: Function not implemented pgtable-2^12/alloc_fastpath:5847 C0=1587 C1=1449 C2=1392 C3=1419 pgtable-2^12/alloc_from_partial:12637 C0=3292 C1=3020 C2=3051 C3=3274 pgtable-2^12/alloc_node_mismatch:0 pgtable-2^12/alloc_refill:41038 C0=10600 C1=10025 C2=10191 C3=10222 pgtable-2^12/alloc_slab:517 C0=148 C1=110 C2=105 C3=154 pgtable-2^12/alloc_slowpath:54203 C0=14041 C1=13157 C2=13349 C3=13656 pgtable-2^12/cache_dma:0 pgtable-2^12/cmpxchg_double_cpu_fail:0 pgtable-2^12/cmpxchg_double_fail:0 pgtable-2^12/cpu_partial:2 pgtable-2^12/cpu_partial_alloc:25894 C0=6719 C1=6334 C2=6288 C3=6553 pgtable-2^12/cpu_partial_drain:8441 C0=2035 C1=2211 C2=2268 C3=1927 pgtable-2^12/cpu_partial_free:38987 C0=9642 C1=10042 C2=10132 C3=9171 pgtable-2^12/cpu_partial_node:12237 C0=3183 C1=2928 C2=2961 C3=3165 pgtable-2^12/cpu_slabs:11 pgtable-2^12/cpuslab_flush:17 C0=5 C2=4 C3=8 pgtable-2^12/ctor:pgd_ctor+0x0/0x18 pgtable-2^12/deactivate_bypass:39027 C0=10153 C1=9463 C2=9439 C3=9972 pgtable-2^12/deactivate_empty:446 C0=98 C1=118 C2=123 C3=107 pgtable-2^12/deactivate_full:16 C0=5 C2=3 C3=8 pgtable-2^12/deactivate_remote_frees:0 pgtable-2^12/deactivate_to_head:1 C2=1 pgtable-2^12/deactivate_to_tail:0 pgtable-2^12/destroy_by_rcu:0 pgtable-2^12/free_add_partial:24877 C0=6007 C1=6515 C2=6681 C3=5674 grep: pgtable-2^12/free_calls: Function not implemented pgtable-2^12/free_fastpath:5849 C0=1587 C1=1449 C2=1394 C3=1419 pgtable-2^12/free_frozen:15145 C0=3989 C1=3701 C2=3683 C3=3772 pgtable-2^12/free_remove_partial:0 pgtable-2^12/free_slab:446 C0=98 C1=118 C2=123 C3=107 pgtable-2^12/free_slowpath:54132 C0=13631 C1=13743 C2=13815 C3=12943 pgtable-2^12/hwcache_align:0 pgtable-2^12/min_partial:8 pgtable-2^12/object_size:32768 pgtable-2^12/objects:67 pgtable-2^12/objects_partial:0 pgtable-2^12/objs_per_slab:1 pgtable-2^12/order:4 pgtable-2^12/order_fallback:13 C0=2 C1=1 C2=5 C3=5 pgtable-2^12/partial:4 pgtable-2^12/poison:0 pgtable-2^12/reclaim_account:0 pgtable-2^12/red_zone:0 pgtable-2^12/reserved:0 pgtable-2^12/sanity_checks:0 pgtable-2^12/slab_size:65536 pgtable-2^12/slabs:71 pgtable-2^12/slabs_cpu_partial:7(7) C0=1(1) C1=3(3) C2=1(1) C3=2(2) pgtable-2^12/store_user:0 pgtable-2^12/total_objects:71 pgtable-2^12/trace:0 Hugh
On Thu, 1 Jun 2017, Hugh Dickins wrote: > CONFIG_SLUB_DEBUG_ON=y. My SLAB|SLUB config options are > > CONFIG_SLUB_DEBUG=y > # CONFIG_SLUB_MEMCG_SYSFS_ON is not set > # CONFIG_SLAB is not set > CONFIG_SLUB=y > # CONFIG_SLAB_FREELIST_RANDOM is not set > CONFIG_SLUB_CPU_PARTIAL=y > CONFIG_SLABINFO=y > # CONFIG_SLUB_DEBUG_ON is not set > CONFIG_SLUB_STATS=y Thats fine. > But I think you are now surprised, when I say no slub_debug options > were on. Here's the output from /sys/kernel/slab/pgtable-2^12/* > (before I tried the new kernel with Aneesh's fix patch) > in case they tell you anything... > > pgtable-2^12/poison:0 > pgtable-2^12/red_zone:0 > pgtable-2^12/reserved:0 > pgtable-2^12/sanity_checks:0 > pgtable-2^12/store_user:0 Ok so debugging was off but the slab cache has a ctor callback which mandates that the free pointer cannot use the free object space when the object is not in use. Thus the size of the object must be increased to accomodate the freepointer.
On Thu, 1 Jun 2017, Christoph Lameter wrote: > > Ok so debugging was off but the slab cache has a ctor callback which > mandates that the free pointer cannot use the free object space when > the object is not in use. Thus the size of the object must be increased to > accomodate the freepointer. Thanks a lot for working that out. Makes sense, fully understood now, nothing to worry about (though makes one wonder whether it's efficient to use ctors on high-alignment caches; or whether an internal "zero-me" ctor would be useful). Hugh
Hugh Dickins <hughd@google.com> writes: > On Thu, 1 Jun 2017, Christoph Lameter wrote: >> >> Ok so debugging was off but the slab cache has a ctor callback which >> mandates that the free pointer cannot use the free object space when >> the object is not in use. Thus the size of the object must be increased to >> accomodate the freepointer. > > Thanks a lot for working that out. Makes sense, fully understood now, > nothing to worry about (though makes one wonder whether it's efficient > to use ctors on high-alignment caches; or whether an internal "zero-me" > ctor would be useful). Or should we just be using kmem_cache_zalloc() when we allocate from those slabs? Given all the ctor's do is memset to 0. cheers
On Fri, 2 Jun 2017, Michael Ellerman wrote: > Hugh Dickins <hughd@google.com> writes: > > On Thu, 1 Jun 2017, Christoph Lameter wrote: > >> > >> Ok so debugging was off but the slab cache has a ctor callback which > >> mandates that the free pointer cannot use the free object space when > >> the object is not in use. Thus the size of the object must be increased to > >> accomodate the freepointer. > > > > Thanks a lot for working that out. Makes sense, fully understood now, > > nothing to worry about (though makes one wonder whether it's efficient > > to use ctors on high-alignment caches; or whether an internal "zero-me" > > ctor would be useful). > > Or should we just be using kmem_cache_zalloc() when we allocate from > those slabs? > > Given all the ctor's do is memset to 0. I'm not sure. From a memory-utilization point of view, with SLUB, using kmem_cache_zalloc() there would certainly be better. But you may be forgetting that the constructor is applied only when a new slab of objects is allocated, not each time an object is allocated from that slab (and the user of those objects agrees to free objects back to the cache in a reusable state: zeroed in this case). So from a cpu-utilization point of view, it's better to use the ctor: it's saving you lots of redundant memsets. SLUB versus SLAB, cpu versus memory? Since someone has taken the trouble to write it with ctors in the past, I didn't feel on firm enough ground to recommend such a change. But it may be obvious to someone else that your suggestion would be better (or worse). Hugh
On Thu, 1 Jun 2017, Hugh Dickins wrote: > Thanks a lot for working that out. Makes sense, fully understood now, > nothing to worry about (though makes one wonder whether it's efficient > to use ctors on high-alignment caches; or whether an internal "zero-me" > ctor would be useful). Use kzalloc to zero it. And here is another example of using slab allocations for page frames. Use the page allocator for this? The page allocator is there for allocating page frames. The slab allocator main purpose is to allocate small objects....
On Thu, 1 Jun 2017, Hugh Dickins wrote: > SLUB versus SLAB, cpu versus memory? Since someone has taken the > trouble to write it with ctors in the past, I didn't feel on firm > enough ground to recommend such a change. But it may be obvious > to someone else that your suggestion would be better (or worse). Umm how about using alloc_pages() for pageframes?
Hugh Dickins <hughd@google.com> writes: > On Fri, 2 Jun 2017, Michael Ellerman wrote: >> Hugh Dickins <hughd@google.com> writes: >> > On Thu, 1 Jun 2017, Christoph Lameter wrote: >> >> >> >> Ok so debugging was off but the slab cache has a ctor callback which >> >> mandates that the free pointer cannot use the free object space when >> >> the object is not in use. Thus the size of the object must be increased to >> >> accomodate the freepointer. >> > >> > Thanks a lot for working that out. Makes sense, fully understood now, >> > nothing to worry about (though makes one wonder whether it's efficient >> > to use ctors on high-alignment caches; or whether an internal "zero-me" >> > ctor would be useful). >> >> Or should we just be using kmem_cache_zalloc() when we allocate from >> those slabs? >> >> Given all the ctor's do is memset to 0. > > I'm not sure. From a memory-utilization point of view, with SLUB, > using kmem_cache_zalloc() there would certainly be better. > > But you may be forgetting that the constructor is applied only when a > new slab of objects is allocated, not each time an object is allocated > from that slab (and the user of those objects agrees to free objects > back to the cache in a reusable state: zeroed in this case). Ah yes, I was "forgetting" that :) - ie. didn't know it. > So from a cpu-utilization point of view, it's better to use the ctor: > it's saving you lots of redundant memsets. OK. Presumably we guarantee (somewhere) that the page tables are zeroed before we free them, which is a natural result of tearing down all mappings? But then I see other arches (x86, arm64 at least), which don't use a constructor, and use __GPF_ZERO (via PGALLOC_GFP) at allocation time. eg. arm64: pgd_cache = kmem_cache_create("pgd_cache", PGD_SIZE, PGD_SIZE, SLAB_PANIC, NULL); ... return kmem_cache_alloc(pgd_cache, PGALLOC_GFP); So that's a bit puzzling. cheers
Christoph Lameter <cl@linux.com> writes: > On Thu, 1 Jun 2017, Hugh Dickins wrote: > >> Thanks a lot for working that out. Makes sense, fully understood now, >> nothing to worry about (though makes one wonder whether it's efficient >> to use ctors on high-alignment caches; or whether an internal "zero-me" >> ctor would be useful). > > Use kzalloc to zero it. But that's changing a per slab creation memset into a per object allocation memset, isn't it? > And here is another example of using slab allocations for page frames. > Use the page allocator for this? The page allocator is there for > allocating page frames. The slab allocator main purpose is to allocate > small objects.... Well usually they are small (< PAGE_SIZE), because we have 64K pages. But we could rework the code to use the page allocator on 4K configs. cheers
--- 4.12-rc2/arch/powerpc/include/asm/book3s/64/hash-4k.h +++ linux/arch/powerpc/include/asm/book3s/64/hash-4k.h @@ -8,7 +8,7 @@ #define H_PTE_INDEX_SIZE 9 #define H_PMD_INDEX_SIZE 7 #define H_PUD_INDEX_SIZE 9 -#define H_PGD_INDEX_SIZE 12 +#define H_PGD_INDEX_SIZE 9 #ifndef __ASSEMBLY__ #define H_PTE_TABLE_SIZE (sizeof(pte_t) << H_PTE_INDEX_SIZE) --- 4.12-rc2/arch/powerpc/include/asm/processor.h +++ linux/arch/powerpc/include/asm/processor.h @@ -110,7 +110,7 @@ void release_thread(struct task_struct * #define TASK_SIZE_128TB (0x0000800000000000UL) #define TASK_SIZE_512TB (0x0002000000000000UL) -#ifdef CONFIG_PPC_BOOK3S_64 +#if defined(CONFIG_PPC_BOOK3S_64) && defined(CONFIG_PPC_64K_PAGES) /* * Max value currently used: */