Message ID | 1488327283-177710-2-git-send-email-pasha.tatashin@oracle.com |
---|---|
State | Not Applicable |
Delegated to: | David Miller |
Headers | show |
Pavel Tatashin <pasha.tatashin@oracle.com> writes: > > While investigating how to improve initialization time of dentry_hashtable > which is 8G long on M6 ldom with 7T of main memory, I noticed that memset() I don't think a 8G dentry (or other kernel) hash table makes much sense. I would rather fix the hash table sizing algorithm to have some reasonable upper limit than to optimize the zeroing. I believe there are already boot options for it, but it would be better if it worked out of the box. -Andi -- To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2017-02-28 19:24, Andi Kleen wrote: > Pavel Tatashin <pasha.tatashin@oracle.com> writes: >> >> While investigating how to improve initialization time of dentry_hashtable >> which is 8G long on M6 ldom with 7T of main memory, I noticed that memset() > > I don't think a 8G dentry (or other kernel) hash table makes much > sense. I would rather fix the hash table sizing algorithm to have some > reasonable upper limit than to optimize the zeroing. > > I believe there are already boot options for it, but it would be better > if it worked out of the box. > > -Andi Hi Andi, I agree that there should be some smarter cap for maximum hash table sizes, and as you said it is already possible to set the limits via parameters. I still think, however, this HASH_ZERO patch makes sense for the following reasons: - Even if the default maximum size is reduced the size of these tables should still be tunable, as it really depends on the way machine is used, and in it is possible that for some use patterns large hash tables are necessary. - Most of them are initialized before smp_init() call. The time from bootloader to smp_init() should be minimized as parallelization is not available yet. For example, LDOM domain on which I tested this patch with few more optimization takes 8.5 seconds to get from grub to smp_init() (760CPUs and 7T of memory), out of these 8.5 seconds 3.1s (vs. 11.8s before this patch) are spent initializing these hash tables. So, even 3.1s is still significant, and should be improved further by changing the default maximums, but that should be a different patch. Thank you, Pasha > -- > To unsubscribe from this list: send the line "unsubscribe sparclinux" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> - Even if the default maximum size is reduced the size of these > tables should still be tunable, as it really depends on the way > machine is used, and in it is possible that for some use patterns > large hash tables are necessary. I consider it very unlikely that a 8G dentry hash table ever makes sense. I cannot even imagine a workload where you would have that many active files. It's just a bad configuration that should be avoided. And when the tables are small enough you don't need these hacks. -Andi -- To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Andi, Thank you for your comment, I am thinking to limit the default maximum hash tables sizes to 512M. If it is bigger than 512M, we would still need my patch to improve the performance. This is because it would mean that initialization of hash tables would still take over 1s out of 6s in bootload to smp_init() interval on larger machines. I am not sure HASH_ZERO is a hack because if you look at the way pv_lock_hash is allocated, it assumes that the memory is already zeroed since it provides HASH_EARLY flag. It quietly assumes that the memblock boot allocator zeroes the memory for us. On the other hand, in other places where HASH_EARLY is specified we still explicitly zero the hashes. At least with HASH_ZERO flag this becomes a defined interface, and in the future if memblock allocator is changed to zero memory only on demand (as it really should), the HASH_ZERO flag can be passed there the same way it is passed to vmalloc() in my patch. Does something like this look OK to you? If yes, I will send out a new patch. index 1b0f7a4..5ddf741 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -79,6 +79,12 @@ EXPORT_PER_CPU_SYMBOL(numa_node); #endif +/* + * This is the default maximum number of entries system hashes can have, the + * value can be overwritten by setting hash table sizes via kernel parameters. + */ +#define SYSTEM_HASH_MAX_ENTRIES (1 << 26) + #ifdef CONFIG_HAVE_MEMORYLESS_NODES /* * N.B., Do NOT reference the '_numa_mem_' per cpu variable directly. @@ -7154,6 +7160,11 @@ static unsigned long __init arch_reserved_kernel_pages(void) if (PAGE_SHIFT < 20) numentries = round_up(numentries, (1<<20)/PAGE_SIZE); + /* Limit default maximum number of entries */ + if (numentries > SYSTEM_HASH_MAX_ENTRIES) { + numentries = SYSTEM_HASH_MAX_ENTRIES; + } + /* limit to 1 bucket per 2^scale bytes of low memory */ if (scale > PAGE_SHIFT) numentries >>= (scale - PAGE_SHIFT); Thank you Pasha On 2017-03-01 10:19, Andi Kleen wrote: >> - Even if the default maximum size is reduced the size of these >> tables should still be tunable, as it really depends on the way >> machine is used, and in it is possible that for some use patterns >> large hash tables are necessary. > > I consider it very unlikely that a 8G dentry hash table ever makes > sense. I cannot even imagine a workload where you would have that > many active files. It's just a bad configuration that should be avoided. > > And when the tables are small enough you don't need these hacks. > > -Andi > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> > -- To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Mar 01, 2017 at 11:34:10AM -0500, Pasha Tatashin wrote: > Hi Andi, > > Thank you for your comment, I am thinking to limit the default > maximum hash tables sizes to 512M. > > If it is bigger than 512M, we would still need my patch to improve Even 512MB seems too large. I wouldn't go larger than a few tens of MB, maybe 32MB. Also you would need to cover all the big hashes. The most critical ones are likely the network hash tables, these maybe be a bit larger (but certainly also not 0.5TB) -Andi -- To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Andi, After thinking some more about this issue, I figured that I would not want to set default maximums. Currently, the defaults are scaled with system memory size, which seems like the right thing to do to me. They are set to size hash tables one entry per page and, if a scale argument is provided, scale them down to 1/2, 1/4, 1/8 entry per page etc. So, in some cases the scale argument may be wrong, and dentry, inode, or some other client of alloc_large_system_hash() should be adjusted. For example, I am pretty sure that scale value in most places should be changed from literal value (inode scale = 14, dentry scale = 13, etc to: (PAGE_SHIFT + value): inode scale would become (PAGE_SHIFT + 2), dentry scale would become (PAGE_SHIFT + 1), etc. This is because we want 1/4 inodes and 1/2 dentries per every page in the system. In alloc_large_system_hash() we have basically this: nentries = nr_kernel_pages >> (scale - PAGE_SHIFT); This is basically a bug, and would not change the theory, but I am sure that changing scales without at least some theoretical backup is not a good idea and would most likely lead to regressions, especially on some smaller configurations. Therefore, in my opinion having one fast way to zero hash tables, as this patch tries to do, is a good thing. In the next patch revision I can go ahead and change scales to be (PAGE_SHIFT + val) from current literals. Thank you, Pasha On 2017-03-01 12:31, Andi Kleen wrote: > On Wed, Mar 01, 2017 at 11:34:10AM -0500, Pasha Tatashin wrote: >> Hi Andi, >> >> Thank you for your comment, I am thinking to limit the default >> maximum hash tables sizes to 512M. >> >> If it is bigger than 512M, we would still need my patch to improve > > Even 512MB seems too large. I wouldn't go larger than a few tens > of MB, maybe 32MB. > > Also you would need to cover all the big hashes. > > The most critical ones are likely the network hash tables, these > maybe be a bit larger (but certainly also not 0.5TB) > > -Andi > -- To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> For example, I am pretty sure that scale value in most places should > be changed from literal value (inode scale = 14, dentry scale = 13, > etc to: (PAGE_SHIFT + value): inode scale would become (PAGE_SHIFT + > 2), dentry scale would become (PAGE_SHIFT + 1), etc. This is because > we want 1/4 inodes and 1/2 dentries per every page in the system. This is still far too much for a large system. The algorithm simply was not designed for TB systems. It's unlikely to have nowhere near that many small files active, as it's better to use the memory for something that is actually useful. Also even a few hops in the open hash table are normally not a problems dentry/inode; it is not that file lookups are that critical. For networking the picture may be different, but I suspect GBs worth of hash tables are still overkill there (Dave et.al. may have stronger opinions on this) I think a upper size (with user override which already exists) is fine, but if you really don't want to do it then scale the factor down very aggressively for larger sizes, so that we don't end up with more than a few tens of MB. > This is basically a bug, and would not change the theory, but I am > sure that changing scales without at least some theoretical backup One dentry per page would only make sense if the files are zero sized. If the file even has one byte then it already needs more than 1 page just to cache the contents (even ignoring inodes and other caches) With larger files that need multiple pages it makes even less sense. So clearly one dentry per page theory is nonsense if the files are actually used. There is the "make find / + stat fast" case (where only the entries and inodes are cached). But even there it is unlikely that the TB system has a much larger file system with more files than the 100GB system, so I once a reasonable plateau is reached I don't see why you would want to exceed that. Also the reason to make hash tables big is to minimize collisions, but we have fairly good hash functions and a few hops worse case are likely not a problem for an already expensive file access or open. BTW the other option would be to switch all the large system hashes to a rhashtable and do the resizing only when it is actually needed. But that would be more work than just adding a reasonable upper limit. -Andi -- To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Mar 01, 2017 at 04:20:28PM -0500, Pasha Tatashin wrote: > Hi Andi, > > After thinking some more about this issue, I figured that I would not want > to set default maximums. > > Currently, the defaults are scaled with system memory size, which seems like > the right thing to do to me. They are set to size hash tables one entry per > page and, if a scale argument is provided, scale them down to 1/2, 1/4, 1/8 > entry per page etc. I disagree that it's the right thing to do. You want your dentry cache to scale with the number of dentries in use. Scaling with memory size is a reasonable approximation for smaller memory sizes, but allocating 8GB of *hash table entries* for dentries is plainly ridiculous, no matter how much memory you have. You won't have half a billion dentries active in most uses of such a large machine. -- To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Andi, > > I think a upper size (with user override which already exists) is fine, > but if you really don't want to do it then scale the factor down > very aggressively for larger sizes, so that we don't end up with more > than a few tens of MB. > I have scaled it, I do not think setting a default upper limit is a future proof strategy. Thank you, Pasha -- To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/arch/sparc/lib/NG4memset.S b/arch/sparc/lib/NG4memset.S index 41da4bd..e7c2e70 100644 --- a/arch/sparc/lib/NG4memset.S +++ b/arch/sparc/lib/NG4memset.S @@ -13,14 +13,14 @@ .globl NG4memset NG4memset: andcc %o1, 0xff, %o4 - be,pt %icc, 1f + be,pt %xcc, 1f mov %o2, %o1 sllx %o4, 8, %g1 or %g1, %o4, %o2 sllx %o2, 16, %g1 or %g1, %o2, %o2 sllx %o2, 32, %g1 - ba,pt %icc, 1f + ba,pt %xcc, 1f or %g1, %o2, %o4 .size NG4memset,.-NG4memset @@ -29,7 +29,7 @@ NG4memset: NG4bzero: clr %o4 1: cmp %o1, 16 - ble %icc, .Ltiny + ble %xcc, .Ltiny mov %o0, %o3 sub %g0, %o0, %g1 and %g1, 0x7, %g1 @@ -37,7 +37,7 @@ NG4bzero: sub %o1, %g1, %o1 1: stb %o4, [%o0 + 0x00] subcc %g1, 1, %g1 - bne,pt %icc, 1b + bne,pt %xcc, 1b add %o0, 1, %o0 .Laligned8: cmp %o1, 64 + (64 - 8) @@ -48,7 +48,7 @@ NG4bzero: sub %o1, %g1, %o1 1: stx %o4, [%o0 + 0x00] subcc %g1, 8, %g1 - bne,pt %icc, 1b + bne,pt %xcc, 1b add %o0, 0x8, %o0 .Laligned64: andn %o1, 64 - 1, %g1 @@ -58,30 +58,30 @@ NG4bzero: 1: stxa %o4, [%o0 + %g0] ASI_BLK_INIT_QUAD_LDD_P subcc %g1, 0x40, %g1 stxa %o4, [%o0 + %g2] ASI_BLK_INIT_QUAD_LDD_P - bne,pt %icc, 1b + bne,pt %xcc, 1b add %o0, 0x40, %o0 .Lpostloop: cmp %o1, 8 - bl,pn %icc, .Ltiny + bl,pn %xcc, .Ltiny membar #StoreStore|#StoreLoad .Lmedium: andn %o1, 0x7, %g1 sub %o1, %g1, %o1 1: stx %o4, [%o0 + 0x00] subcc %g1, 0x8, %g1 - bne,pt %icc, 1b + bne,pt %xcc, 1b add %o0, 0x08, %o0 andcc %o1, 0x4, %g1 - be,pt %icc, .Ltiny + be,pt %xcc, .Ltiny sub %o1, %g1, %o1 stw %o4, [%o0 + 0x00] add %o0, 0x4, %o0 .Ltiny: cmp %o1, 0 - be,pn %icc, .Lexit + be,pn %xcc, .Lexit 1: subcc %o1, 1, %o1 stb %o4, [%o0 + 0x00] - bne,pt %icc, 1b + bne,pt %xcc, 1b add %o0, 1, %o0 .Lexit: retl @@ -99,7 +99,7 @@ NG4bzero: stxa %o4, [%o0 + %g2] ASI_BLK_INIT_QUAD_LDD_P stxa %o4, [%o0 + %g3] ASI_BLK_INIT_QUAD_LDD_P stxa %o4, [%o0 + %o5] ASI_BLK_INIT_QUAD_LDD_P - bne,pt %icc, 1b + bne,pt %xcc, 1b add %o0, 0x30, %o0 - ba,a,pt %icc, .Lpostloop + ba,a,pt %xcc, .Lpostloop .size NG4bzero,.-NG4bzero