Message ID | PAWPR08MB8982002BD1E5CE0E74F4F19B830A9@PAWPR08MB8982.eurprd08.prod.outlook.com |
---|---|
State | New |
Headers | show |
Series | malloc: Use correct C11 atomics for fastbin | expand |
* Wilco Dijkstra: > Fix memory ordering issues in the fastbin implementation: in REMOVE_FB the > next pointer is read before any MO synchronization, however in the C11 atomic > model this is only correct after a load acquire. Refactor the fastbin code > and add a dedicated fastbin_push/pop implementation. The number of acquire > or release atomics remains the same, and the new functions are inlined, so > performance is unaffected. > > Passes regress, OK for commit? Did you actually observe any problems resulting from this? We previously concluded that the dependency chain would mae this valid (outside the C11 memory model). Thanks, Florian
Hi Florian, >> Fix memory ordering issues in the fastbin implementation: in REMOVE_FB the >> next pointer is read before any MO synchronization, however in the C11 atomic >> model this is only correct after a load acquire. Refactor the fastbin code >> and add a dedicated fastbin_push/pop implementation. The number of acquire >> or release atomics remains the same, and the new functions are inlined, so >> performance is unaffected. >> >> Passes regress, OK for commit? > > Did you actually observe any problems resulting from this? We > previously concluded that the dependency chain would mae this valid > (outside the C11 memory model). No, I believe it works in most weak memory models (except for Alpha). However the code always looked weird with the acquire done after several non-atomic accesses... The key is that we need to do the acquire/release only once per iteration in the CAS loop, and that avoids regressions. Cheers, Wilco
* Wilco Dijkstra: > Hi Florian, > >>> Fix memory ordering issues in the fastbin implementation: in REMOVE_FB the >>> next pointer is read before any MO synchronization, however in the C11 atomic >>> model this is only correct after a load acquire. Refactor the fastbin code >>> and add a dedicated fastbin_push/pop implementation. The number of acquire >>> or release atomics remains the same, and the new functions are inlined, so >>> performance is unaffected. >>> >>> Passes regress, OK for commit? >> >> Did you actually observe any problems resulting from this? We >> previously concluded that the dependency chain would mae this valid >> (outside the C11 memory model). > > No, I believe it works in most weak memory models (except for Alpha). > However the code always looked weird with the acquire done after several > non-atomic accesses... The key is that we need to do the acquire/release > only once per iteration in the CAS loop, and that avoids regressions. Sounds good, but I haven't looked at the changes in detail. Maybe DJ can review it? Thanks, Florian
Wilco Dijkstra via Libc-alpha <libc-alpha@sourceware.org> writes: > +/* Atomically pop from the fastbin list. The arena lock must be held to > + block other threads removing entries, avoiding the ABA issue. */ If the arena lock must be held anyway, why go through the compare and exchange overhead? We know we're the only thread accessing it.
* DJ Delorie: > Wilco Dijkstra via Libc-alpha <libc-alpha@sourceware.org> writes: >> +/* Atomically pop from the fastbin list. The arena lock must be held to >> + block other threads removing entries, avoiding the ABA issue. */ > > If the arena lock must be held anyway, why go through the compare and > exchange overhead? We know we're the only thread accessing it. Other threads are adding entries without the arena lock. Thanks, Florian
Hi, >> Wilco Dijkstra via Libc-alpha <libc-alpha@sourceware.org> writes: >>> +/* Atomically pop from the fastbin list. The arena lock must be held to >>> + block other threads removing entries, avoiding the ABA issue. */ >> >> If the arena lock must be held anyway, why go through the compare and >> exchange overhead? We know we're the only thread accessing it. > > Other threads are adding entries without the arena lock. Yes, malloc is blocking but free isn't and accesses the freelist concurrently. It's a really weird design. Splitting the free list into a local one and a shared one would be far better - no atomics when you have the malloc lock, and if the local free list is empty it takes one atomic to copy all shared entries. Cheers, Wilco
* Wilco Dijkstra: > Hi, > >>> Wilco Dijkstra via Libc-alpha <libc-alpha@sourceware.org> writes: >>>> +/* Atomically pop from the fastbin list. The arena lock must be held to >>>> + block other threads removing entries, avoiding the ABA issue. */ >>> >>> If the arena lock must be held anyway, why go through the compare and >>> exchange overhead? We know we're the only thread accessing it. >> >> Other threads are adding entries without the arena lock. > > Yes, malloc is blocking but free isn't and accesses the freelist > concurrently. It's a really weird design. Splitting the free list > into a local one and a shared one would be far better - no atomics > when you have the malloc lock, and if the local free list is empty it > takes one atomic to copy all shared entries. The local free list is in the tcache. I think DJ said that removing the fastbins still resulted in a performance loss. The tcache has also the benefit that the chain length is bounded. Thanks, Florian
Hi Florian, >> Yes, malloc is blocking but free isn't and accesses the freelist >> concurrently. It's a really weird design. Splitting the free list >> into a local one and a shared one would be far better - no atomics >> when you have the malloc lock, and if the local free list is empty it >> takes one atomic to copy all shared entries. > > The local free list is in the tcache. I think DJ said that removing the > fastbins still resulted in a performance loss. > > The tcache has also the benefit that the chain length is bounded. That's because tcache has hardly any effect on programs with a high rate of (de)allocations. It's so small that you usually end up using the fastbin code anyway. All the singlethreaded optimizations in fastbins are absolutely essential - that's the code that shows up in profiles. If we want to make tcache actually work, it will have to support far more allocations, particularly for smaller sizes. Cheers, Wilco
Wilco Dijkstra <Wilco.Dijkstra@arm.com> writes: > If we want to make tcache actually work, it will have to support far more > allocations, particularly for smaller sizes. You can test that with a tunable; the max count per bin is runtime tunable. But yeah, the point of tcache is to have a few of many sizes for fast allocations. Fastbins has a lot of a few small sizes. Testing showed that both were required for best performance with the average application.
Florian Weimer <fweimer@redhat.com> writes: >> Wilco Dijkstra via Libc-alpha <libc-alpha@sourceware.org> writes: >>> +/* Atomically pop from the fastbin list. The arena lock must be held to >>> + block other threads removing entries, avoiding the ABA issue. */ >> >> If the arena lock must be held anyway, why go through the compare and >> exchange overhead? We know we're the only thread accessing it. > > Other threads are adding entries without the arena lock. Ah. That should be in the comment then.
On 2022-12-02 1:55 PM, DJ Delorie via Libc-alpha wrote: > Wilco Dijkstra <Wilco.Dijkstra@arm.com> writes: >> If we want to make tcache actually work, it will have to support far more >> allocations, particularly for smaller sizes. > > You can test that with a tunable; the max count per bin is runtime > tunable. > > But yeah, the point of tcache is to have a few of many sizes for fast > allocations. Fastbins has a lot of a few small sizes. Testing showed > that both were required for best performance with the average > application. Every time we start talking about fastbins vs tcache again I start wondering, again, what's stopping us from replacing the entire malloc implementation with jemalloc, or any other implementation designed less than 20 years ago. zw
Hi, On 2022-12-02 1:55 PM, DJ Delorie via Libc-alpha wrote: >> Wilco Dijkstra <Wilco.Dijkstra@arm.com> writes: >>> If we want to make tcache actually work, it will have to support far more >>> allocations, particularly for smaller sizes. >> >> You can test that with a tunable; the max count per bin is runtime >> tunable. Yes you can use tunables, but how many applications actually use more optimized settings? It's the default that is the problem. >> But yeah, the point of tcache is to have a few of many sizes for fast >> allocations. Fastbins has a lot of a few small sizes. Testing showed >> that both were required for best performance with the average >> application. Yes, that's due the tiny size of tcache (let's call it tiny-cache!). Once exhausted, you mostly end up using the fastbins. > Every time we start talking about fastbins vs tcache again I start > wondering, again, what's stopping us from replacing the entire malloc > implementation with jemalloc, or any other implementation designed less > than 20 years ago. I can't see any technical reason why not. It's either that or completely rewriting the current implementation and getting rid of decades of accumulated cruft... Modern allocators are not only much faster than GLIBC in their default settings but also have lower memory usage. The two best allocators seem to be mimalloc and jemalloc. Cheers, Wilco
On 06/12/22 10:29, Wilco Dijkstra via Libc-alpha wrote: > Hi, > > On 2022-12-02 1:55 PM, DJ Delorie via Libc-alpha wrote: >>> Wilco Dijkstra <Wilco.Dijkstra@arm.com> writes: >>>> If we want to make tcache actually work, it will have to support far more >>>> allocations, particularly for smaller sizes. >>> >>> You can test that with a tunable; the max count per bin is runtime >>> tunable. > > Yes you can use tunables, but how many applications actually use more > optimized settings? It's the default that is the problem. > >>> But yeah, the point of tcache is to have a few of many sizes for fast >>> allocations. Fastbins has a lot of a few small sizes. Testing showed >>> that both were required for best performance with the average >>> application. > > Yes, that's due the tiny size of tcache (let's call it tiny-cache!). Once exhausted, > you mostly end up using the fastbins. > >> Every time we start talking about fastbins vs tcache again I start >> wondering, again, what's stopping us from replacing the entire malloc >> implementation with jemalloc, or any other implementation designed less >> than 20 years ago. > > I can't see any technical reason why not. It's either that or completely rewriting > the current implementation and getting rid of decades of accumulated cruft... > > Modern allocators are not only much faster than GLIBC in their default settings > but also have lower memory usage. The two best allocators seem to be mimalloc > and jemalloc. Do they have compatible license so we can just import/adapt the code?
On Tue, Dec 6, 2022, at 8:37 AM, Adhemerval Zanella Netto via Libc-alpha wrote: >>> Every time we start talking about fastbins vs tcache again I start >>> wondering, again, what's stopping us from replacing the entire malloc >>> implementation with jemalloc, or any other implementation designed less >>> than 20 years ago. >> >> I can't see any technical reason why not. It's either that or completely rewriting >> the current implementation and getting rid of decades of accumulated cruft... >> >> Modern allocators are not only much faster than GLIBC in their default settings >> but also have lower memory usage. The two best allocators seem to be mimalloc >> and jemalloc. > > > Do they have compatible license so we can just import/adapt the code? Per https://github.com/jemalloc/jemalloc/blob/dev/COPYING jemalloc is 2-clause BSD. I don't know where to find mimalloc off the top of my head. zw
Zack Weinberg via Libc-alpha <libc-alpha@sourceware.org> writes: > Every time we start talking about fastbins vs tcache again I start > wondering, again, what's stopping us from replacing the entire malloc > implementation with jemalloc, There's a couple of things, but they're technical details, not political ones: * There's a copy of a "mini-malloc" in the dynamic loader that needs to be somewhat compatible with the full copy in libc.so so that data allocated in the loader can be used later. * The current malloc knows a lot about glibc internals like pthreads and fork, so as to manage those boundaries properly (thread start/stop, etc). * Benchmarks have shown that the various mallocs each have their strong points and weak points, so expect some complaints about performance loss when switching mallocs. Your favorite allocator may perform poorly in someone else's favorite application. * Testing, testing, testing. Libc's malloc is used in all Linux distros *and* Hurd, which means it's known to work with tens if not hundreds of thousands of programs. To propose a new system-wide allocator, you need to replicate this level of effective testing. > or any other implementation designed less than 20 years ago. Comments like this are not appreciated. Age does not imply quality. Please stick to technical points.
Wilco Dijkstra <Wilco.Dijkstra@arm.com> writes: >>> You can test that with a tunable; the max count per bin is runtime >>> tunable. > > Yes you can use tunables, but how many applications actually use more > optimized settings? It's the default that is the problem. If testing shows that a larger default makes sense across a wide range of applications, then we can change the default. I'm a bit opposed to changing anything because "it makes sense". Back it up with data. > Yes, that's due the tiny size of tcache (let's call it > tiny-cache!). Once exhausted, you mostly end up using the fastbins. tcache is wide but shallow, fastbins are narrow but deep (tcache caches much larger chunks than fastbins do) Benchmarks show they both provide a speed boost separately, and more so when combined. > Modern allocators are not only much faster than GLIBC in their default > settings Back this up with data please. Tcache invalidated most online benchmarks. > but also have lower memory usage. This is a project we've got on our to-do list.
DJ Delorie <dj@redhat.com> writes: > Zack Weinberg via Libc-alpha <libc-alpha@sourceware.org> writes: >> Every time we start talking about fastbins vs tcache again I start >> wondering, again, what's stopping us from replacing the entire malloc >> implementation with jemalloc, > > There's a couple of things, but they're technical details, not political > ones: > > * There's a copy of a "mini-malloc" in the dynamic loader that needs to > be somewhat compatible with the full copy in libc.so so that data > allocated in the loader can be used later. Hmm. Why does the dynamic loader need to be able to allocate memory at all, and how does it manage to play nice with LD_PRELOADed malloc replacements if it does this? > * The current malloc knows a lot about glibc internals like pthreads and > fork, so as to manage those boundaries properly (thread start/stop, > etc). Similarly I am wondering how _this_ manages to work with LD_PRELOADed malloc replacements. (I confess I haven’t ever dug into this part of glibc much at all.) > * Benchmarks have shown that the various mallocs each have their strong > points and weak points, so expect some complaints about performance > loss when switching mallocs. Your favorite allocator may perform > poorly in someone else's favorite application. > * Testing, testing, testing. Libc's malloc is used in all Linux distros > *and* Hurd, which means it's known to work with tens if not hundreds of > thousands of programs. To propose a new system-wide allocator, you > need to replicate this level of effective testing. It seems to me that this sets the bar for _any_ major change to glibc malloc so high that it won’t ever happen, and given that every benchmark _I’ve_ been pointed at shows glibc malloc coming in somewhere between “mediocre” and “dead last”, that’s a shame. zw
* DJ Delorie via Libc-alpha: > Zack Weinberg via Libc-alpha <libc-alpha@sourceware.org> writes: >> Every time we start talking about fastbins vs tcache again I start >> wondering, again, what's stopping us from replacing the entire malloc >> implementation with jemalloc, > > There's a couple of things, but they're technical details, not political > ones: > > * There's a copy of a "mini-malloc" in the dynamic loader that needs to > be somewhat compatible with the full copy in libc.so so that data > allocated in the loader can be used later. This is not actually true, no compatibility is required. We have manual tracking for allocations and avoid freeing them with the wrong allocator. Otherwise interposed mallocs wouldn't work. Thanks, Florian
* Zack Weinberg via Libc-alpha: >> * The current malloc knows a lot about glibc internals like pthreads and >> fork, so as to manage those boundaries properly (thread start/stop, >> etc). > > Similarly I am wondering how _this_ manages to work with LD_PRELOADed > malloc replacements. (I confess I haven’t ever dug into this part of > glibc much at all.) We call malloc as part of pthread_create, which ensures single-threaded initialization of replacement mallocs. Otherwise, it's up to the replacements to register atfork handlers as needed. Thanks, Florian
Hi DJ, > If testing shows that a larger default makes sense across a wide range > of applications, then we can change the default. I'm a bit opposed to > changing anything because "it makes sense". Back it up with data. Larger values result in major speedups. Even a very conservative increase to 50 speeds up xalancbmk up by 5% and omnetpp by 1.3%. A value of 2000 gives 17.5% and 6% respectively, but performance increases continue well beyond that. > tcache is wide but shallow, fastbins are narrow but deep (tcache caches > much larger chunks than fastbins do) Benchmarks show they both provide a > speed boost separately, and more so when combined. If tcache was deeper, even if only for smaller sizes, it would make fastbins completely redundant and with it all the complex consolidation. As a result, there would be less fragmentation since adjacent free blocks will be merged together more often. >> Modern allocators are not only much faster than GLIBC in their default >> settings > > Back this up with data please. Tcache invalidated most online > benchmarks. Here is a recent paper that covers GLIBC with tcache - it still loses both on performance and memory usage: https://arxiv.org/pdf/1905.01135.pdf >> but also have lower memory usage. > > This is a project we've got on our to-do list. I don't believe you can fix that without a major rewrite. Cheers, Wilco
diff --git a/malloc/malloc.c b/malloc/malloc.c index 2a61c8b5ee38f33c42c72f3c5ad441e26ed0e701..a3e7053d2db81cce211dd8c3cff43af0f59a1b71 100644 --- a/malloc/malloc.c +++ b/malloc/malloc.c @@ -1786,6 +1786,73 @@ get_max_fast (void) return global_max_fast; } + +/* Atomically push onto the fastbin list. Return the previous head of the + list, however it can only be referenced if the arena lock is acquired. */ +static __always_inline mchunkptr +fastbin_push_chunk (mchunkptr *fastbin, mchunkptr p) +{ + mchunkptr head = atomic_load_relaxed (fastbin); + + /* Check that the top of the bin is not the record we are going to + add (i.e., double free). */ + if (__builtin_expect (head == p, 0)) + malloc_printerr ("double free or corruption (fasttop)"); + + if (SINGLE_THREAD_P) + { + p->fd = PROTECT_PTR (&p->fd, head); + *fastbin = p; + return head; + } + + /* Atomically update the fastbin head. Use release MO to synchronize + with acquire reads in fastbin_pop_chunk and malloc_consolidate (this + ensures the next pointer update is valid). */ + do + { + p->fd = PROTECT_PTR (&p->fd, head); + } + while (!atomic_compare_exchange_weak_release (fastbin, &head, p)); + + return head; +} + + +/* Atomically pop from the fastbin list. The arena lock must be held to + block other threads removing entries, avoiding the ABA issue. */ +static __always_inline mchunkptr +fastbin_pop_chunk (mchunkptr *fastbin) +{ + mchunkptr head, tail; + + if (SINGLE_THREAD_P) + { + head = *fastbin; + if (head == NULL) + return head; + if (__glibc_unlikely (misaligned_chunk (head))) + malloc_printerr ("malloc(): unaligned fastbin chunk detected"); + *fastbin = REVEAL_PTR (head->fd); + return head; + } + + do + { + /* Synchronize with release MO CAS in fastbin_pop_chunk - this ensures + the next pointer is valid. */ + head = atomic_load_acquire (fastbin); + if (head == NULL) + return head; + if (__glibc_unlikely (head != NULL && misaligned_chunk (head))) + malloc_printerr ("malloc(): unaligned fastbin chunk detected"); + tail = REVEAL_PTR (head->fd); + } + while (!atomic_compare_exchange_weak_relaxed (fastbin, &head, tail)); + + return head; +} + /* ----------- Internal state representation and initialization ----------- */ @@ -3798,71 +3865,37 @@ _int_malloc (mstate av, size_t bytes) can try it without checking, which saves some time on this fast path. */ -#define REMOVE_FB(fb, victim, pp) \ - do \ - { \ - victim = pp; \ - if (victim == NULL) \ - break; \ - pp = REVEAL_PTR (victim->fd); \ - if (__glibc_unlikely (pp != NULL && misaligned_chunk (pp))) \ - malloc_printerr ("malloc(): unaligned fastbin chunk detected"); \ - } \ - while ((pp = catomic_compare_and_exchange_val_acq (fb, pp, victim)) \ - != victim); \ - if ((unsigned long) (nb) <= (unsigned long) (get_max_fast ())) { idx = fastbin_index (nb); mfastbinptr *fb = &fastbin (av, idx); - mchunkptr pp; - victim = *fb; + victim = fastbin_pop_chunk (fb); if (victim != NULL) { - if (__glibc_unlikely (misaligned_chunk (victim))) - malloc_printerr ("malloc(): unaligned fastbin chunk detected 2"); - - if (SINGLE_THREAD_P) - *fb = REVEAL_PTR (victim->fd); - else - REMOVE_FB (fb, pp, victim); - if (__glibc_likely (victim != NULL)) - { - size_t victim_idx = fastbin_index (chunksize (victim)); - if (__builtin_expect (victim_idx != idx, 0)) - malloc_printerr ("malloc(): memory corruption (fast)"); - check_remalloced_chunk (av, victim, nb); + size_t victim_idx = fastbin_index (chunksize (victim)); + if (__builtin_expect (victim_idx != idx, 0)) + malloc_printerr ("malloc(): memory corruption (fast)"); + check_remalloced_chunk (av, victim, nb); #if USE_TCACHE - /* While we're here, if we see other chunks of the same size, - stash them in the tcache. */ - size_t tc_idx = csize2tidx (nb); - if (tcache && tc_idx < mp_.tcache_bins) + /* While we're here, if we see other chunks of the same size, + stash them in the tcache. */ + size_t tc_idx = csize2tidx (nb); + if (tcache && tc_idx < mp_.tcache_bins) + { + /* While bin not empty and tcache not full, copy chunks. */ + while (tcache->counts[tc_idx] < mp_.tcache_count) { - mchunkptr tc_victim; - - /* While bin not empty and tcache not full, copy chunks. */ - while (tcache->counts[tc_idx] < mp_.tcache_count - && (tc_victim = *fb) != NULL) - { - if (__glibc_unlikely (misaligned_chunk (tc_victim))) - malloc_printerr ("malloc(): unaligned fastbin chunk detected 3"); - if (SINGLE_THREAD_P) - *fb = REVEAL_PTR (tc_victim->fd); - else - { - REMOVE_FB (fb, pp, tc_victim); - if (__glibc_unlikely (tc_victim == NULL)) - break; - } - tcache_put (tc_victim, tc_idx); - } + mchunkptr tc_victim = fastbin_pop_chunk (fb); + if (tc_victim == NULL) + break; + tcache_put (tc_victim, tc_idx); } -#endif - void *p = chunk2mem (victim); - alloc_perturb (p, bytes); - return p; } +#endif + void *p = chunk2mem (victim); + alloc_perturb (p, bytes); + return p; } } @@ -4505,29 +4538,7 @@ _int_free (mstate av, mchunkptr p, int have_lock) fb = &fastbin (av, idx); /* Atomically link P to its fastbin: P->FD = *FB; *FB = P; */ - mchunkptr old = *fb, old2; - - if (SINGLE_THREAD_P) - { - /* Check that the top of the bin is not the record we are going to - add (i.e., double free). */ - if (__builtin_expect (old == p, 0)) - malloc_printerr ("double free or corruption (fasttop)"); - p->fd = PROTECT_PTR (&p->fd, old); - *fb = p; - } - else - do - { - /* Check that the top of the bin is not the record we are going to - add (i.e., double free). */ - if (__builtin_expect (old == p, 0)) - malloc_printerr ("double free or corruption (fasttop)"); - old2 = old; - p->fd = PROTECT_PTR (&p->fd, old); - } - while ((old = catomic_compare_and_exchange_val_rel (fb, p, old2)) - != old2); + mchunkptr old = fastbin_push_chunk (fb, p); /* Check that size of fastbin chunk at the top is the same as size of the chunk that we are adding. We can dereference OLD @@ -4718,6 +4729,7 @@ static void malloc_consolidate(mstate av) maxfb = &fastbin (av, NFASTBINS - 1); fb = &fastbin (av, 0); do { + /* Synchronize with relase MO CAS in fastbin_push_chunk. */ p = atomic_exchange_acquire (fb, NULL); if (p != 0) { do {