Message ID | 20230328152258.54844-2-cupertino.miranda@oracle.com |
---|---|
State | New |
Headers | show |
Series | *** Created tunable to force small pages on stack allocation. | expand |
On 28/03/23 12:22, Cupertino Miranda via Libc-alpha wrote: > Created tunable glibc.pthread.stack_hugetlb to control when hugepages > can be used for stack allocation. > In case THP are enabled and glibc.pthread.stack_hugetlb is set to > 0, glibc will madvise the kernel not to use allow hugepages for stack > allocations. > > Changed from v1: > - removed the __malloc_thp_mode calls to check if hugetlb is > enabled. > > Changed from v2: > - Added entry in manual/tunables.texi > - Fixed tunable default to description > - Code style corrections. > > Changes from v3: > - Improve tunables.texi. > > Changes from v4: > - Improved text in tunables.texi by suggestion of Adhemerval. Florian has raised some concern [1] that reported RSS increase is not technically correct because the once the kernel need to split the Huge Page, it does not need keep all of them (only the one that actually generate the soft fault). However this is not what I see using the previous testcase that creates lot of threads to force the THP usage and checking the /proc/self/smaps_rollout. The resulting 'Private_Dirty' still accounts for *all* the default smalls pages once kernel decides to split the page, and it seems to be same outcome from a recent OpenJDK thread [2]. Afaiu the kernel does not keep track which possible small pages from the THP has been already hit when the guard page is mprotect (which forces the split), so when the kernel reverts back to using default pages it keeps all the pages. This is also a recent kernel discussion which similar conclusion [3]. So this patch is LGTM, and I will install this shortly. I also discussed on the same call if it would be better to make the m advise the *default* behavior if the pthread stack usage will always ended up requiring the kernel to split up to use default pages, i.e: 1. THP (/sys/kernel/mm/transparent_hugepage/enabled) is set to 'always'. 2. The stack size is multiple of THP size (/sys/kernel/mm/transparent_hugepage/hpage_pmd_size). 3. And if stack size minus guard pages is still multiple of THP size ((stack_size - guard_size) % thp_size == 0). It does not mean that the stack will automatically backup by THP, but it also means that depending of the process VMA it might generate some RSS waste once kernel decides to use THP for the stack. And it should also make the tunables not required. [1] https://sourceware.org/glibc/wiki/PatchworkReviewMeetings [2] https://bugs.openjdk.org/browse/JDK-8303215?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel&showAll=true [3] https://lore.kernel.org/linux-mm/278ec047-4c5d-ab71-de36-094dbed4067c@redhat.com/T/ > --- > manual/tunables.texi | 15 +++++++++++++++ > nptl/allocatestack.c | 6 ++++++ > nptl/nptl-stack.c | 1 + > nptl/nptl-stack.h | 3 +++ > nptl/pthread_mutex_conf.c | 8 ++++++++ > sysdeps/nptl/dl-tunables.list | 6 ++++++ > 6 files changed, 39 insertions(+) > > diff --git a/manual/tunables.texi b/manual/tunables.texi > index 70dd2264c5..130f94b2bc 100644 > --- a/manual/tunables.texi > +++ b/manual/tunables.texi > @@ -459,6 +459,21 @@ registration on behalf of the application. > Restartable sequences are a Linux-specific extension. > @end deftp > > +@deftp Tunable glibc.pthread.stack_hugetlb > +This tunable controls whether to use Huge Pages in the stacks created by > +@code{pthread_create}. This tunable only affects the stacks created by > +@theglibc{}, it has no effect on stack assigned with > +@code{pthread_attr_setstack}. > + > +The default is @samp{1} where the system default value is used. Setting > +its value to @code{0} enables the use of @code{madvise} with > +@code{MADV_NOHUGEPAGE} after stack creation with @code{mmap}. > + > +This is a memory utilization optimization, since internal glibc setup of either > +the thread descriptor and the guard page might force the kernel to move the > +thread stack originally backup by Huge Pages to default pages. > +@end deftp > + > @node Hardware Capability Tunables > @section Hardware Capability Tunables > @cindex hardware capability tunables > diff --git a/nptl/allocatestack.c b/nptl/allocatestack.c > index c7adbccd6f..f9d8cdfd08 100644 > --- a/nptl/allocatestack.c > +++ b/nptl/allocatestack.c > @@ -369,6 +369,12 @@ allocate_stack (const struct pthread_attr *attr, struct pthread **pdp, > if (__glibc_unlikely (mem == MAP_FAILED)) > return errno; > > + /* Do madvise in case the tunable glibc.pthread.stack_hugetlb is > + set to 0, disabling hugetlb. */ > + if (__glibc_unlikely (__nptl_stack_hugetlb == 0) > + && __madvise (mem, size, MADV_NOHUGEPAGE) != 0) > + return errno; > + > /* SIZE is guaranteed to be greater than zero. > So we can never get a null pointer back from mmap. */ > assert (mem != NULL); > diff --git a/nptl/nptl-stack.c b/nptl/nptl-stack.c > index 5eb7773575..e829711cb5 100644 > --- a/nptl/nptl-stack.c > +++ b/nptl/nptl-stack.c > @@ -21,6 +21,7 @@ > #include <pthreadP.h> > > size_t __nptl_stack_cache_maxsize = 40 * 1024 * 1024; > +int32_t __nptl_stack_hugetlb = 1; > > void > __nptl_stack_list_del (list_t *elem) > diff --git a/nptl/nptl-stack.h b/nptl/nptl-stack.h > index 34f8bbb15e..cf90b27c2b 100644 > --- a/nptl/nptl-stack.h > +++ b/nptl/nptl-stack.h > @@ -27,6 +27,9 @@ > /* Maximum size of the cache, in bytes. 40 MiB by default. */ > extern size_t __nptl_stack_cache_maxsize attribute_hidden; > > +/* Should allow stacks to use hugetlb. (1) is default. */ > +extern int32_t __nptl_stack_hugetlb; > + > /* Check whether the stack is still used or not. */ > static inline bool > __nptl_stack_in_use (struct pthread *pd) > diff --git a/nptl/pthread_mutex_conf.c b/nptl/pthread_mutex_conf.c > index 329c4cbb8f..60ef9095aa 100644 > --- a/nptl/pthread_mutex_conf.c > +++ b/nptl/pthread_mutex_conf.c > @@ -45,6 +45,12 @@ TUNABLE_CALLBACK (set_stack_cache_size) (tunable_val_t *valp) > __nptl_stack_cache_maxsize = valp->numval; > } > > +static void > +TUNABLE_CALLBACK (set_stack_hugetlb) (tunable_val_t *valp) > +{ > + __nptl_stack_hugetlb = (int32_t) valp->numval; > +} > + > void > __pthread_tunables_init (void) > { > @@ -52,5 +58,7 @@ __pthread_tunables_init (void) > TUNABLE_CALLBACK (set_mutex_spin_count)); > TUNABLE_GET (stack_cache_size, size_t, > TUNABLE_CALLBACK (set_stack_cache_size)); > + TUNABLE_GET (stack_hugetlb, int32_t, > + TUNABLE_CALLBACK (set_stack_hugetlb)); > } > #endif > diff --git a/sysdeps/nptl/dl-tunables.list b/sysdeps/nptl/dl-tunables.list > index bd1ddb121d..4cde9500b6 100644 > --- a/sysdeps/nptl/dl-tunables.list > +++ b/sysdeps/nptl/dl-tunables.list > @@ -33,5 +33,11 @@ glibc { > maxval: 1 > default: 1 > } > + stack_hugetlb { > + type: INT_32 > + minval: 0 > + maxval: 1 > + default: 1 > + } > } > }
Hi Adhemerval, everyone, Thanks the approval, detailed analysis and time spent on the topic. Best regards, Cupertino Adhemerval Zanella Netto writes: > On 28/03/23 12:22, Cupertino Miranda via Libc-alpha wrote: >> Created tunable glibc.pthread.stack_hugetlb to control when hugepages >> can be used for stack allocation. >> In case THP are enabled and glibc.pthread.stack_hugetlb is set to >> 0, glibc will madvise the kernel not to use allow hugepages for stack >> allocations. >> >> Changed from v1: >> - removed the __malloc_thp_mode calls to check if hugetlb is >> enabled. >> >> Changed from v2: >> - Added entry in manual/tunables.texi >> - Fixed tunable default to description >> - Code style corrections. >> >> Changes from v3: >> - Improve tunables.texi. >> >> Changes from v4: >> - Improved text in tunables.texi by suggestion of Adhemerval. > > Florian has raised some concern [1] that reported RSS increase is not > technically correct because the once the kernel need to split the Huge > Page, it does not need keep all of them (only the one that actually > generate the soft fault). > > However this is not what I see using the previous testcase that creates > lot of threads to force the THP usage and checking the > /proc/self/smaps_rollout. The resulting 'Private_Dirty' still accounts > for *all* the default smalls pages once kernel decides to split the > page, and it seems to be same outcome from a recent OpenJDK thread [2]. > > Afaiu the kernel does not keep track which possible small pages from the > THP has been already hit when the guard page is mprotect (which forces > the split), so when the kernel reverts back to using default pages it > keeps all the pages. This is also a recent kernel discussion which > similar conclusion [3]. > > So this patch is LGTM, and I will install this shortly. > > I also discussed on the same call if it would be better to make the m > advise the *default* behavior if the pthread stack usage will always ended > up requiring the kernel to split up to use default pages, i.e: > > 1. THP (/sys/kernel/mm/transparent_hugepage/enabled) is set to > 'always'. > > 2. The stack size is multiple of THP size > (/sys/kernel/mm/transparent_hugepage/hpage_pmd_size). > > 3. And if stack size minus guard pages is still multiple of THP > size ((stack_size - guard_size) % thp_size == 0). > > It does not mean that the stack will automatically backup by THP, but > it also means that depending of the process VMA it might generate some > RSS waste once kernel decides to use THP for the stack. And it should > also make the tunables not required. > > [1] https://sourceware.org/glibc/wiki/PatchworkReviewMeetings > [2] https://bugs.openjdk.org/browse/JDK-8303215?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel&showAll=true > [3] https://lore.kernel.org/linux-mm/278ec047-4c5d-ab71-de36-094dbed4067c@redhat.com/T/ > >> --- >> manual/tunables.texi | 15 +++++++++++++++ >> nptl/allocatestack.c | 6 ++++++ >> nptl/nptl-stack.c | 1 + >> nptl/nptl-stack.h | 3 +++ >> nptl/pthread_mutex_conf.c | 8 ++++++++ >> sysdeps/nptl/dl-tunables.list | 6 ++++++ >> 6 files changed, 39 insertions(+) >> >> diff --git a/manual/tunables.texi b/manual/tunables.texi >> index 70dd2264c5..130f94b2bc 100644 >> --- a/manual/tunables.texi >> +++ b/manual/tunables.texi >> @@ -459,6 +459,21 @@ registration on behalf of the application. >> Restartable sequences are a Linux-specific extension. >> @end deftp >> >> +@deftp Tunable glibc.pthread.stack_hugetlb >> +This tunable controls whether to use Huge Pages in the stacks created by >> +@code{pthread_create}. This tunable only affects the stacks created by >> +@theglibc{}, it has no effect on stack assigned with >> +@code{pthread_attr_setstack}. >> + >> +The default is @samp{1} where the system default value is used. Setting >> +its value to @code{0} enables the use of @code{madvise} with >> +@code{MADV_NOHUGEPAGE} after stack creation with @code{mmap}. >> + >> +This is a memory utilization optimization, since internal glibc setup of either >> +the thread descriptor and the guard page might force the kernel to move the >> +thread stack originally backup by Huge Pages to default pages. >> +@end deftp >> + >> @node Hardware Capability Tunables >> @section Hardware Capability Tunables >> @cindex hardware capability tunables >> diff --git a/nptl/allocatestack.c b/nptl/allocatestack.c >> index c7adbccd6f..f9d8cdfd08 100644 >> --- a/nptl/allocatestack.c >> +++ b/nptl/allocatestack.c >> @@ -369,6 +369,12 @@ allocate_stack (const struct pthread_attr *attr, struct pthread **pdp, >> if (__glibc_unlikely (mem == MAP_FAILED)) >> return errno; >> >> + /* Do madvise in case the tunable glibc.pthread.stack_hugetlb is >> + set to 0, disabling hugetlb. */ >> + if (__glibc_unlikely (__nptl_stack_hugetlb == 0) >> + && __madvise (mem, size, MADV_NOHUGEPAGE) != 0) >> + return errno; >> + >> /* SIZE is guaranteed to be greater than zero. >> So we can never get a null pointer back from mmap. */ >> assert (mem != NULL); >> diff --git a/nptl/nptl-stack.c b/nptl/nptl-stack.c >> index 5eb7773575..e829711cb5 100644 >> --- a/nptl/nptl-stack.c >> +++ b/nptl/nptl-stack.c >> @@ -21,6 +21,7 @@ >> #include <pthreadP.h> >> >> size_t __nptl_stack_cache_maxsize = 40 * 1024 * 1024; >> +int32_t __nptl_stack_hugetlb = 1; >> >> void >> __nptl_stack_list_del (list_t *elem) >> diff --git a/nptl/nptl-stack.h b/nptl/nptl-stack.h >> index 34f8bbb15e..cf90b27c2b 100644 >> --- a/nptl/nptl-stack.h >> +++ b/nptl/nptl-stack.h >> @@ -27,6 +27,9 @@ >> /* Maximum size of the cache, in bytes. 40 MiB by default. */ >> extern size_t __nptl_stack_cache_maxsize attribute_hidden; >> >> +/* Should allow stacks to use hugetlb. (1) is default. */ >> +extern int32_t __nptl_stack_hugetlb; >> + >> /* Check whether the stack is still used or not. */ >> static inline bool >> __nptl_stack_in_use (struct pthread *pd) >> diff --git a/nptl/pthread_mutex_conf.c b/nptl/pthread_mutex_conf.c >> index 329c4cbb8f..60ef9095aa 100644 >> --- a/nptl/pthread_mutex_conf.c >> +++ b/nptl/pthread_mutex_conf.c >> @@ -45,6 +45,12 @@ TUNABLE_CALLBACK (set_stack_cache_size) (tunable_val_t *valp) >> __nptl_stack_cache_maxsize = valp->numval; >> } >> >> +static void >> +TUNABLE_CALLBACK (set_stack_hugetlb) (tunable_val_t *valp) >> +{ >> + __nptl_stack_hugetlb = (int32_t) valp->numval; >> +} >> + >> void >> __pthread_tunables_init (void) >> { >> @@ -52,5 +58,7 @@ __pthread_tunables_init (void) >> TUNABLE_CALLBACK (set_mutex_spin_count)); >> TUNABLE_GET (stack_cache_size, size_t, >> TUNABLE_CALLBACK (set_stack_cache_size)); >> + TUNABLE_GET (stack_hugetlb, int32_t, >> + TUNABLE_CALLBACK (set_stack_hugetlb)); >> } >> #endif >> diff --git a/sysdeps/nptl/dl-tunables.list b/sysdeps/nptl/dl-tunables.list >> index bd1ddb121d..4cde9500b6 100644 >> --- a/sysdeps/nptl/dl-tunables.list >> +++ b/sysdeps/nptl/dl-tunables.list >> @@ -33,5 +33,11 @@ glibc { >> maxval: 1 >> default: 1 >> } >> + stack_hugetlb { >> + type: INT_32 >> + minval: 0 >> + maxval: 1 >> + default: 1 >> + } >> } >> }
On 12/04/23 05:53, Cupertino Miranda wrote: >> >> So this patch is LGTM, and I will install this shortly. >> >> I also discussed on the same call if it would be better to make the m >> advise the *default* behavior if the pthread stack usage will always ended >> up requiring the kernel to split up to use default pages, i.e: >> >> 1. THP (/sys/kernel/mm/transparent_hugepage/enabled) is set to >> 'always'. >> >> 2. The stack size is multiple of THP size >> (/sys/kernel/mm/transparent_hugepage/hpage_pmd_size). >> >> 3. And if stack size minus guard pages is still multiple of THP >> size ((stack_size - guard_size) % thp_size == 0). >> >> It does not mean that the stack will automatically backup by THP, but >> it also means that depending of the process VMA it might generate some >> RSS waste once kernel decides to use THP for the stack. And it should >> also make the tunables not required. >> >> [1] https://sourceware.org/glibc/wiki/PatchworkReviewMeetings >> [2] https://bugs.openjdk.org/browse/JDK-8303215?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel&showAll=true >> [3] https://lore.kernel.org/linux-mm/278ec047-4c5d-ab71-de36-094dbed4067c@redhat.com/T/ I implemented my idea above, which should cover the issue you brought without the need of the extra tunable. It seems that if the kernel can not keep track of the touch 'subpages' once THP is used on the stack allocation, it should be always an improvement to madvise (MADV_NOHUGEPAGE). What do you think? --- [PATCH] nptl: Disable THP on thread stack if it incurs in large RSS usage If the Transparent Huge Page (THP) is set as 'always', the resulting address and the stack size are multiple of THP size, kernel may use THP for the thread stack. However, if the guard page size is not multiple of THP, once it is mprotect the allocate range could no longer be served with THP and then kernel will revert back using default page sizes. However, the kernel might also not keep track of the offsets within the THP that has been touched and need to reside on the memory. It will then keep all the small pages, thus using much more memory than required. In this scenario, it is better to just madvise that not use huge pages and avoid the memory bloat. The __malloc_default_thp_pagesize and __malloc_thp_mode now cache the obtained value to avoid require read and parse the kernel information on each thread creation (if system change its setting, the process will not be able to adjust it). Checked on x86_64-linux-gnu. --- nptl/allocatestack.c | 32 +++++++++++++++ sysdeps/generic/malloc-hugepages.h | 1 + sysdeps/unix/sysv/linux/malloc-hugepages.c | 46 ++++++++++++++++++---- 3 files changed, 72 insertions(+), 7 deletions(-) diff --git a/nptl/allocatestack.c b/nptl/allocatestack.c index c7adbccd6f..d197edf2e9 100644 --- a/nptl/allocatestack.c +++ b/nptl/allocatestack.c @@ -33,6 +33,7 @@ #include <nptl-stack.h> #include <libc-lock.h> #include <tls-internal.h> +#include <malloc-hugepages.h> /* Default alignment of stack. */ #ifndef STACK_ALIGN @@ -206,6 +207,33 @@ advise_stack_range (void *mem, size_t size, uintptr_t pd, size_t guardsize) #endif } +/* If the Transparent Huge Page (THP) is set as 'always', the resulting + address and the stack size are multiple of THP size, kernel may use THP for + the thread stack. However, if the guard page size is not multiple of THP, + once it is mprotect the allocate range could no longer be served with THP + and then kernel will revert back using default page sizes. + + However, the kernel might also not keep track of the offsets within the THP + that has been touched and need to reside on the memory. It will then keep + all the small pages, thus using much more memory than required. In this + scenario, it is better to just madvise that not use huge pages and avoid + the memory bloat. */ +static __always_inline int +advise_thp (void *mem, size_t size, size_t guardsize) +{ + enum malloc_thp_mode_t thpmode = __malloc_thp_mode (); + if (thpmode != malloc_thp_mode_always) + return 0; + + unsigned long int thpsize = __malloc_default_thp_pagesize (); + if ((uintptr_t) mem % thpsize != 0 + || size % thpsize != 0 + || (size - guardsize) % thpsize != 0) + return 0; + + return __madvise (mem, size, MADV_NOHUGEPAGE); +} + /* Returns a usable stack for a new thread either by allocating a new stack or reusing a cached stack of sufficient size. ATTR must be non-NULL and point to a valid pthread_attr. @@ -373,6 +401,10 @@ allocate_stack (const struct pthread_attr *attr, struct pthread **pdp, So we can never get a null pointer back from mmap. */ assert (mem != NULL); + int r = advise_thp (mem, size, guardsize); + if (r != 0) + return r; + /* Place the thread descriptor at the end of the stack. */ #if TLS_TCB_AT_TP pd = (struct pthread *) ((((uintptr_t) mem + size) diff --git a/sysdeps/generic/malloc-hugepages.h b/sysdeps/generic/malloc-hugepages.h index d68b85630c..21d4844bc4 100644 --- a/sysdeps/generic/malloc-hugepages.h +++ b/sysdeps/generic/malloc-hugepages.h @@ -26,6 +26,7 @@ unsigned long int __malloc_default_thp_pagesize (void) attribute_hidden; enum malloc_thp_mode_t { + malloc_thp_mode_unknown, malloc_thp_mode_always, malloc_thp_mode_madvise, malloc_thp_mode_never, diff --git a/sysdeps/unix/sysv/linux/malloc-hugepages.c b/sysdeps/unix/sysv/linux/malloc-hugepages.c index 683d68c327..5954dd13f6 100644 --- a/sysdeps/unix/sysv/linux/malloc-hugepages.c +++ b/sysdeps/unix/sysv/linux/malloc-hugepages.c @@ -22,19 +22,33 @@ #include <not-cancel.h> #include <sys/mman.h> +/* The __malloc_thp_mode is called only in single-thread mode, either in + malloc initialization or pthread creation. */ +static unsigned long int thp_pagesize = -1; + unsigned long int __malloc_default_thp_pagesize (void) { + unsigned long int size = atomic_load_relaxed (&thp_pagesize); + if (size != -1) + return size; + int fd = __open64_nocancel ( "/sys/kernel/mm/transparent_hugepage/hpage_pmd_size", O_RDONLY); if (fd == -1) - return 0; + { + atomic_store_relaxed (&thp_pagesize, 0); + return 0; + } char str[INT_BUFSIZE_BOUND (unsigned long int)]; ssize_t s = __read_nocancel (fd, str, sizeof (str)); __close_nocancel (fd); if (s < 0) - return 0; + { + atomic_store_relaxed (&thp_pagesize, 0); + return 0; + } unsigned long int r = 0; for (ssize_t i = 0; i < s; i++) @@ -44,16 +58,28 @@ __malloc_default_thp_pagesize (void) r *= 10; r += str[i] - '0'; } + atomic_store_relaxed (&thp_pagesize, r); return r; } +/* The __malloc_thp_mode is called only in single-thread mode, either in + malloc initialization or pthread creation. */ +static enum malloc_thp_mode_t thp_mode = malloc_thp_mode_unknown; + enum malloc_thp_mode_t __malloc_thp_mode (void) { + enum malloc_thp_mode_t mode = atomic_load_relaxed (&thp_mode); + if (mode != malloc_thp_mode_unknown) + return mode; + int fd = __open64_nocancel ("/sys/kernel/mm/transparent_hugepage/enabled", O_RDONLY); if (fd == -1) - return malloc_thp_mode_not_supported; + { + atomic_store_relaxed (&thp_mode, malloc_thp_mode_not_supported); + return malloc_thp_mode_not_supported; + } static const char mode_always[] = "[always] madvise never\n"; static const char mode_madvise[] = "always [madvise] never\n"; @@ -66,13 +92,19 @@ __malloc_thp_mode (void) if (s == sizeof (mode_always) - 1) { if (strcmp (str, mode_always) == 0) - return malloc_thp_mode_always; + mode = malloc_thp_mode_always; else if (strcmp (str, mode_madvise) == 0) - return malloc_thp_mode_madvise; + mode = malloc_thp_mode_madvise; else if (strcmp (str, mode_never) == 0) - return malloc_thp_mode_never; + mode = malloc_thp_mode_never; + else + mode = malloc_thp_mode_not_supported; } - return malloc_thp_mode_not_supported; + else + mode = malloc_thp_mode_not_supported; + + atomic_store_relaxed (&thp_mode, mode); + return mode; } static size_t
Hi Adhemerval, Sorry for taking a while to reply to you. I compiled your patch with the particular example that lead me in the tunable propose, it does not seem to optimize in that particular case. Looking at the code, I realized that the condition would most likely never allow for it to be madvised unless you would have a guardsize == thpsize, which appears irrealistic. + if ((uintptr_t) mem % thpsize != 0 (is not aligned) + || size % thpsize != 0 (1) + || (size - guardsize) % thpsize != 0) (2) + return 0; If (1) is not true then (2) is likely to be. I aggree that for the scenarios where the hugepages would never be used, we should advise the kernel not to bloat it. In any case I am not fully sure you can always predict that. Lastly, I would say that the tunable would still be usable. For example, when the application creates sort lived threads but allocates hugepage size stacks to control allignment. Regards, Cupertino Adhemerval Zanella Netto writes: > On 12/04/23 05:53, Cupertino Miranda wrote: >>> >>> So this patch is LGTM, and I will install this shortly. >>> >>> I also discussed on the same call if it would be better to make the m >>> advise the *default* behavior if the pthread stack usage will always ended >>> up requiring the kernel to split up to use default pages, i.e: >>> >>> 1. THP (/sys/kernel/mm/transparent_hugepage/enabled) is set to >>> 'always'. >>> >>> 2. The stack size is multiple of THP size >>> (/sys/kernel/mm/transparent_hugepage/hpage_pmd_size). >>> >>> 3. And if stack size minus guard pages is still multiple of THP >>> size ((stack_size - guard_size) % thp_size == 0). >>> >>> It does not mean that the stack will automatically backup by THP, but >>> it also means that depending of the process VMA it might generate some >>> RSS waste once kernel decides to use THP for the stack. And it should >>> also make the tunables not required. >>> >>> [1] https://sourceware.org/glibc/wiki/PatchworkReviewMeetings >>> [2] https://bugs.openjdk.org/browse/JDK-8303215?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel&showAll=true >>> [3] https://lore.kernel.org/linux-mm/278ec047-4c5d-ab71-de36-094dbed4067c@redhat.com/T/ > > I implemented my idea above, which should cover the issue you brought without > the need of the extra tunable. It seems that if the kernel can not keep track > of the touch 'subpages' once THP is used on the stack allocation, it should be > always an improvement to madvise (MADV_NOHUGEPAGE). > > What do you think? > > --- > > [PATCH] nptl: Disable THP on thread stack if it incurs in large RSS usage > > If the Transparent Huge Page (THP) is set as 'always', the resulting > address and the stack size are multiple of THP size, kernel may use > THP for the thread stack. However, if the guard page size is not > multiple of THP, once it is mprotect the allocate range could no > longer be served with THP and then kernel will revert back using > default page sizes. > > However, the kernel might also not keep track of the offsets within > the THP that has been touched and need to reside on the memory. It > will then keep all the small pages, thus using much more memory than > required. In this scenario, it is better to just madvise that not > use huge pages and avoid the memory bloat. > > The __malloc_default_thp_pagesize and __malloc_thp_mode now cache > the obtained value to avoid require read and parse the kernel > information on each thread creation (if system change its setting, > the process will not be able to adjust it). > > Checked on x86_64-linux-gnu. > --- > nptl/allocatestack.c | 32 +++++++++++++++ > sysdeps/generic/malloc-hugepages.h | 1 + > sysdeps/unix/sysv/linux/malloc-hugepages.c | 46 ++++++++++++++++++---- > 3 files changed, 72 insertions(+), 7 deletions(-) > > diff --git a/nptl/allocatestack.c b/nptl/allocatestack.c > index c7adbccd6f..d197edf2e9 100644 > --- a/nptl/allocatestack.c > +++ b/nptl/allocatestack.c > @@ -33,6 +33,7 @@ > #include <nptl-stack.h> > #include <libc-lock.h> > #include <tls-internal.h> > +#include <malloc-hugepages.h> > > /* Default alignment of stack. */ > #ifndef STACK_ALIGN > @@ -206,6 +207,33 @@ advise_stack_range (void *mem, size_t size, uintptr_t pd, size_t guardsize) > #endif > } > > +/* If the Transparent Huge Page (THP) is set as 'always', the resulting > + address and the stack size are multiple of THP size, kernel may use THP for > + the thread stack. However, if the guard page size is not multiple of THP, > + once it is mprotect the allocate range could no longer be served with THP > + and then kernel will revert back using default page sizes. > + > + However, the kernel might also not keep track of the offsets within the THP > + that has been touched and need to reside on the memory. It will then keep > + all the small pages, thus using much more memory than required. In this > + scenario, it is better to just madvise that not use huge pages and avoid > + the memory bloat. */ > +static __always_inline int > +advise_thp (void *mem, size_t size, size_t guardsize) > +{ > + enum malloc_thp_mode_t thpmode = __malloc_thp_mode (); > + if (thpmode != malloc_thp_mode_always) > + return 0; > + > + unsigned long int thpsize = __malloc_default_thp_pagesize (); > + if ((uintptr_t) mem % thpsize != 0 > + || size % thpsize != 0 > + || (size - guardsize) % thpsize != 0) > + return 0; > + > + return __madvise (mem, size, MADV_NOHUGEPAGE); > +} > + > /* Returns a usable stack for a new thread either by allocating a > new stack or reusing a cached stack of sufficient size. > ATTR must be non-NULL and point to a valid pthread_attr. > @@ -373,6 +401,10 @@ allocate_stack (const struct pthread_attr *attr, struct pthread **pdp, > So we can never get a null pointer back from mmap. */ > assert (mem != NULL); > > + int r = advise_thp (mem, size, guardsize); > + if (r != 0) > + return r; > + > /* Place the thread descriptor at the end of the stack. */ > #if TLS_TCB_AT_TP > pd = (struct pthread *) ((((uintptr_t) mem + size) > diff --git a/sysdeps/generic/malloc-hugepages.h b/sysdeps/generic/malloc-hugepages.h > index d68b85630c..21d4844bc4 100644 > --- a/sysdeps/generic/malloc-hugepages.h > +++ b/sysdeps/generic/malloc-hugepages.h > @@ -26,6 +26,7 @@ unsigned long int __malloc_default_thp_pagesize (void) attribute_hidden; > > enum malloc_thp_mode_t > { > + malloc_thp_mode_unknown, > malloc_thp_mode_always, > malloc_thp_mode_madvise, > malloc_thp_mode_never, > diff --git a/sysdeps/unix/sysv/linux/malloc-hugepages.c b/sysdeps/unix/sysv/linux/malloc-hugepages.c > index 683d68c327..5954dd13f6 100644 > --- a/sysdeps/unix/sysv/linux/malloc-hugepages.c > +++ b/sysdeps/unix/sysv/linux/malloc-hugepages.c > @@ -22,19 +22,33 @@ > #include <not-cancel.h> > #include <sys/mman.h> > > +/* The __malloc_thp_mode is called only in single-thread mode, either in > + malloc initialization or pthread creation. */ > +static unsigned long int thp_pagesize = -1; > + > unsigned long int > __malloc_default_thp_pagesize (void) > { > + unsigned long int size = atomic_load_relaxed (&thp_pagesize); > + if (size != -1) > + return size; > + > int fd = __open64_nocancel ( > "/sys/kernel/mm/transparent_hugepage/hpage_pmd_size", O_RDONLY); > if (fd == -1) > - return 0; > + { > + atomic_store_relaxed (&thp_pagesize, 0); > + return 0; > + } > > char str[INT_BUFSIZE_BOUND (unsigned long int)]; > ssize_t s = __read_nocancel (fd, str, sizeof (str)); > __close_nocancel (fd); > if (s < 0) > - return 0; > + { > + atomic_store_relaxed (&thp_pagesize, 0); > + return 0; > + } > > unsigned long int r = 0; > for (ssize_t i = 0; i < s; i++) > @@ -44,16 +58,28 @@ __malloc_default_thp_pagesize (void) > r *= 10; > r += str[i] - '0'; > } > + atomic_store_relaxed (&thp_pagesize, r); > return r; > } > > +/* The __malloc_thp_mode is called only in single-thread mode, either in > + malloc initialization or pthread creation. */ > +static enum malloc_thp_mode_t thp_mode = malloc_thp_mode_unknown; > + > enum malloc_thp_mode_t > __malloc_thp_mode (void) > { > + enum malloc_thp_mode_t mode = atomic_load_relaxed (&thp_mode); > + if (mode != malloc_thp_mode_unknown) > + return mode; > + > int fd = __open64_nocancel ("/sys/kernel/mm/transparent_hugepage/enabled", > O_RDONLY); > if (fd == -1) > - return malloc_thp_mode_not_supported; > + { > + atomic_store_relaxed (&thp_mode, malloc_thp_mode_not_supported); > + return malloc_thp_mode_not_supported; > + } > > static const char mode_always[] = "[always] madvise never\n"; > static const char mode_madvise[] = "always [madvise] never\n"; > @@ -66,13 +92,19 @@ __malloc_thp_mode (void) > if (s == sizeof (mode_always) - 1) > { > if (strcmp (str, mode_always) == 0) > - return malloc_thp_mode_always; > + mode = malloc_thp_mode_always; > else if (strcmp (str, mode_madvise) == 0) > - return malloc_thp_mode_madvise; > + mode = malloc_thp_mode_madvise; > else if (strcmp (str, mode_never) == 0) > - return malloc_thp_mode_never; > + mode = malloc_thp_mode_never; > + else > + mode = malloc_thp_mode_not_supported; > } > - return malloc_thp_mode_not_supported; > + else > + mode = malloc_thp_mode_not_supported; > + > + atomic_store_relaxed (&thp_mode, mode); > + return mode; > } > > static size_t
On 12/04/23 05:53, Cupertino Miranda wrote: > > Hi Adhemerval, everyone, > > Thanks the approval, detailed analysis and time spent on the topic. > > Best regards, > Cupertino > Cupertino, could you also add a NEWS entry? I recall that every tunable is also mentioned on NEWS.
Hi Adhemerval, I came up with the following text in patch below. Do you want me to do a new version or is this good enough ? Regards, Cupertino diff --git a/NEWS b/NEWS index c54af824e0..941808049a 100644 --- a/NEWS +++ b/NEWS @@ -21,6 +21,9 @@ Major new features: * PRIb* and PRIB* macros from C2X have been added to <inttypes.h>. +* A new tunable, glibc.pthread.stack_hugetlb, can be used to disable + Transparent Huge Pages (THP) in pthread_create stack allocation. + Deprecated and removed features, and other changes affecting compatibility: * In the Linux kernel for the hppa/parisc architecture some of the Adhemerval Zanella Netto writes: > On 12/04/23 05:53, Cupertino Miranda wrote: >> >> Hi Adhemerval, everyone, >> >> Thanks the approval, detailed analysis and time spent on the topic. >> >> Best regards, >> Cupertino >> > > Cupertino, could you also add a NEWS entry? I recall that every tunable > is also mentioned on NEWS.
On 14/04/23 09:27, Cupertino Miranda wrote: > > Hi Adhemerval, > > I came up with the following text in patch below. > Do you want me to do a new version or is this good enough ? Yes, so I can install by pulling from patchwork. > > Regards, > Cupertino > > > diff --git a/NEWS b/NEWS > index c54af824e0..941808049a 100644 > --- a/NEWS > +++ b/NEWS > @@ -21,6 +21,9 @@ Major new features: > > * PRIb* and PRIB* macros from C2X have been added to <inttypes.h>. > > +* A new tunable, glibc.pthread.stack_hugetlb, can be used to disable > + Transparent Huge Pages (THP) in pthread_create stack allocation. > + Ok. > Deprecated and removed features, and other changes affecting compatibility: > > * In the Linux kernel for the hppa/parisc architecture some of the > > > Adhemerval Zanella Netto writes: > >> On 12/04/23 05:53, Cupertino Miranda wrote: >>> >>> Hi Adhemerval, everyone, >>> >>> Thanks the approval, detailed analysis and time spent on the topic. >>> >>> Best regards, >>> Cupertino >>> >> >> Cupertino, could you also add a NEWS entry? I recall that every tunable >> is also mentioned on NEWS.
Sent ! Adhemerval Zanella Netto writes: > On 14/04/23 09:27, Cupertino Miranda wrote: >> >> Hi Adhemerval, >> >> I came up with the following text in patch below. >> Do you want me to do a new version or is this good enough ? > > Yes, so I can install by pulling from patchwork. > >> >> Regards, >> Cupertino >> >> >> diff --git a/NEWS b/NEWS >> index c54af824e0..941808049a 100644 >> --- a/NEWS >> +++ b/NEWS >> @@ -21,6 +21,9 @@ Major new features: >> >> * PRIb* and PRIB* macros from C2X have been added to <inttypes.h>. >> >> +* A new tunable, glibc.pthread.stack_hugetlb, can be used to disable >> + Transparent Huge Pages (THP) in pthread_create stack allocation. >> + > > Ok. > >> Deprecated and removed features, and other changes affecting compatibility: >> >> * In the Linux kernel for the hppa/parisc architecture some of the >> >> >> Adhemerval Zanella Netto writes: >> >>> On 12/04/23 05:53, Cupertino Miranda wrote: >>>> >>>> Hi Adhemerval, everyone, >>>> >>>> Thanks the approval, detailed analysis and time spent on the topic. >>>> >>>> Best regards, >>>> Cupertino >>>> >>> >>> Cupertino, could you also add a NEWS entry? I recall that every tunable >>> is also mentioned on NEWS.
diff --git a/manual/tunables.texi b/manual/tunables.texi index 70dd2264c5..130f94b2bc 100644 --- a/manual/tunables.texi +++ b/manual/tunables.texi @@ -459,6 +459,21 @@ registration on behalf of the application. Restartable sequences are a Linux-specific extension. @end deftp +@deftp Tunable glibc.pthread.stack_hugetlb +This tunable controls whether to use Huge Pages in the stacks created by +@code{pthread_create}. This tunable only affects the stacks created by +@theglibc{}, it has no effect on stack assigned with +@code{pthread_attr_setstack}. + +The default is @samp{1} where the system default value is used. Setting +its value to @code{0} enables the use of @code{madvise} with +@code{MADV_NOHUGEPAGE} after stack creation with @code{mmap}. + +This is a memory utilization optimization, since internal glibc setup of either +the thread descriptor and the guard page might force the kernel to move the +thread stack originally backup by Huge Pages to default pages. +@end deftp + @node Hardware Capability Tunables @section Hardware Capability Tunables @cindex hardware capability tunables diff --git a/nptl/allocatestack.c b/nptl/allocatestack.c index c7adbccd6f..f9d8cdfd08 100644 --- a/nptl/allocatestack.c +++ b/nptl/allocatestack.c @@ -369,6 +369,12 @@ allocate_stack (const struct pthread_attr *attr, struct pthread **pdp, if (__glibc_unlikely (mem == MAP_FAILED)) return errno; + /* Do madvise in case the tunable glibc.pthread.stack_hugetlb is + set to 0, disabling hugetlb. */ + if (__glibc_unlikely (__nptl_stack_hugetlb == 0) + && __madvise (mem, size, MADV_NOHUGEPAGE) != 0) + return errno; + /* SIZE is guaranteed to be greater than zero. So we can never get a null pointer back from mmap. */ assert (mem != NULL); diff --git a/nptl/nptl-stack.c b/nptl/nptl-stack.c index 5eb7773575..e829711cb5 100644 --- a/nptl/nptl-stack.c +++ b/nptl/nptl-stack.c @@ -21,6 +21,7 @@ #include <pthreadP.h> size_t __nptl_stack_cache_maxsize = 40 * 1024 * 1024; +int32_t __nptl_stack_hugetlb = 1; void __nptl_stack_list_del (list_t *elem) diff --git a/nptl/nptl-stack.h b/nptl/nptl-stack.h index 34f8bbb15e..cf90b27c2b 100644 --- a/nptl/nptl-stack.h +++ b/nptl/nptl-stack.h @@ -27,6 +27,9 @@ /* Maximum size of the cache, in bytes. 40 MiB by default. */ extern size_t __nptl_stack_cache_maxsize attribute_hidden; +/* Should allow stacks to use hugetlb. (1) is default. */ +extern int32_t __nptl_stack_hugetlb; + /* Check whether the stack is still used or not. */ static inline bool __nptl_stack_in_use (struct pthread *pd) diff --git a/nptl/pthread_mutex_conf.c b/nptl/pthread_mutex_conf.c index 329c4cbb8f..60ef9095aa 100644 --- a/nptl/pthread_mutex_conf.c +++ b/nptl/pthread_mutex_conf.c @@ -45,6 +45,12 @@ TUNABLE_CALLBACK (set_stack_cache_size) (tunable_val_t *valp) __nptl_stack_cache_maxsize = valp->numval; } +static void +TUNABLE_CALLBACK (set_stack_hugetlb) (tunable_val_t *valp) +{ + __nptl_stack_hugetlb = (int32_t) valp->numval; +} + void __pthread_tunables_init (void) { @@ -52,5 +58,7 @@ __pthread_tunables_init (void) TUNABLE_CALLBACK (set_mutex_spin_count)); TUNABLE_GET (stack_cache_size, size_t, TUNABLE_CALLBACK (set_stack_cache_size)); + TUNABLE_GET (stack_hugetlb, int32_t, + TUNABLE_CALLBACK (set_stack_hugetlb)); } #endif diff --git a/sysdeps/nptl/dl-tunables.list b/sysdeps/nptl/dl-tunables.list index bd1ddb121d..4cde9500b6 100644 --- a/sysdeps/nptl/dl-tunables.list +++ b/sysdeps/nptl/dl-tunables.list @@ -33,5 +33,11 @@ glibc { maxval: 1 default: 1 } + stack_hugetlb { + type: INT_32 + minval: 0 + maxval: 1 + default: 1 + } } }