Message ID | DB6PR0801MB2053938869AF403F0D95BC5F83600@DB6PR0801MB2053.eurprd08.prod.outlook.com |
---|---|
State | New |
Headers | show |
Series | [malloc] Avoid atomics in have_fastchunks | expand |
On 09/19/2017 10:32 AM, Wilco Dijkstra wrote: > Currently free typically uses 2 atomic operations per call. The have_fastchunks > flag indicates whether there are recently freed blocks in the fastbins. This > is purely an optimization to avoid calling malloc_consolidate too often and > avoiding the overhead of walking all fast bins even if all are empty during a > sequence of allocations. However using atomics to update the flag is completely > unnecessary since it can be changed into a simple boolean. There is no change > in multi-threaded behaviour given the flag is already approximate (it may be > set when there are no blocks in any fast bins, or it may be clear when there > are free blocks that could be consolidated). > > Performance of malloc/free improves by 27% on a simple benchmark on AArch64 > (both single and multithreaded). The number of load/store exclusive instructions is > reduced by 33%. Bench-malloc-thread speeds up by ~3% in all cases. > > Passes GLIBC tests. OK to commit? > > ChangeLog: > 2017-09-19 Wilco Dijkstra <wdijkstr@arm.com> > > * malloc/malloc.c (FASTCHUNKS_BIT): Remove. > (have_fastchunks): Remove. > (clear_fastchunks): Remove. > (set_fastchunks): Remove. > (malloc_state): Add have_fastchunks. > (malloc_init_state): Use have_fastchunks. > (do_check_malloc_state): Remove incorrect invariant checks. > (_int_malloc): Use have_fastchunks. > (_int_free): Likewise. > (malloc_consolidate): Likewise. Quick review. I think we'll need a v2 before we're ready to commit, or at least a discussion around the use of relaxed MO loads and stores. High level: It is great to see someone looking at the details of malloc at a atomic by atomic cost analysis. I know we have looked briefly at fastbins and the tradeoff between taking the arena lock (one atomic) and CAS required to put the fastbin back in the list. I'm happy to see this paying dividends for you e.g. ~3% global gains. I like the direction this fix takes and the process you used to verify it. Implementation level: You use unadorned loads and stores of the variable av->have_fastchunks, and this constitutes a data race which is undefined behaviour in C11. For example in _int_free have_lock could be 0 and we would write to av->have_fastchunks, while any other thread might be reading have_fastchunks. This is a data race in C11. Please use relaxed MO loads and stores if that is what we need. After you add the relaxed MO loads and stores the comment for have_fastchunks will need a little more explicit language about why the relaxed MO loads and stores are OK from a P&C perspective. Details: Does this patch change the number of times malloc_consolidate might be called? Do you have any figures on this? That would be a user visible change (and require a bug #).
DJ Delorie wrote: > > I think the key here is to look at all accesses to that variable, and > consider "what's the worst that could happen if the wrong value is there > forever". If the worst case doesn't counter the benefits, it's a net > win. > > Buy my biggest concern is wondering how long a "wrong value" can persist > and cause problems. Note that the "wrong value" case exists in current GLIBC, even single-threaded. The single-threaded case is a free that adds a chunk to the fastbin (which sets have_fastchunks), then a malloc that reuses it (which leaves have_fastchunks unchanged). If the next malloc is large it will call malloc_consolidate unnecessarily when there are no actual chunks to consolidate. This is fine as have_fastchunks is an approximation. However since free is lock-free, there is no ordering constraint and all 4 possible combinations of have_fastchunks and the fastbins being empty or not are possible in the multithreaded case. This could happen when some threads execute free while another does malloc, so we end up racing the multiple writes to have_fastchunks as well as the insertion into and emptying of the fastbins. I guess the worst-case scenario would be several blocks being added to the fast bins while have_fastchunks remains false. Then those blocks would not be consolidated in subsequent mallocs until a later free sets have_fastchunks, potentially increasing fragmentation. Best-case scenario is another call to malloc_consolidate which has no effect. Again this is what the existing code does before my patch - doing the updates to have_fastchunks using the most constrained memory order does not change that. Wilco
Carlos O'Donell wrote: > It is great to see someone looking at the details of malloc at a atomic by > atomic cost analysis. I know we have looked briefly at fastbins and the > tradeoff between taking the arena lock (one atomic) and CAS required to put > the fastbin back in the list. Yes it looks like doing free lock-free works fine overall. I wonder whether malloc can do the same for the fastbin paths since it already has to deal with free updating the fastbins concurrently. > You use unadorned loads and stores of the variable av->have_fastchunks, and > this constitutes a data race which is undefined behaviour in C11. ... > Please use relaxed MO loads and stores if that is what we need. I'll do that. > After you add the relaxed MO loads and stores the comment for have_fastchunks > will need a little more explicit language about why the relaxed MO loads and > stores are OK from a P&C perspective. That's easy given multithreaded interleaving already allows all possible combinations before even considering memory ordering - see my reply to DJ for the long version... > Does this patch change the number of times malloc_consolidate might > be called? Do you have any figures on this? That would be a user visible > change (and require a bug #). The number of calls isn't fixed already. I'll have a go at hacking the malloc test to see how much variation there is and whether my patch changes it. Btw what is your opinion on how to add generic single-threaded optimizations that work for all targets? Rather than doing more target hacks, I'd like to add something similar like we did with stdio getc/putc, ie. add a high-level check for the single-threaded case that uses a different code path (with no/relaxed atomics and no locks for the common cases). To give an idea how much this helps, creating a dummy thread that does nothing slows down x64 malloc/free by 2x (it has jumps that skip the 1-byte lock prefix...). An alternative would be to move all the fastbin handling into the t-cache - but then I bet it's much easier just to write a fast modern allocator... Wilco
DJ Delorie wrote: > tcache is a type of fastbin already. I tested removing fastbins since > tcache did something similar, but it turns out that doing both tcache > and fastbins is still faster than doing just either. Given tcache works in a limited set of circumstances, that's not a surprise! Particularly the default setting only allows 7 entries which doesn't help any workloads I've tried. From > 100 gains appear, while 250-1000 looks quite good. So that's the kind of setting we need before fastbins could be considered for removal. Also while tcache would need consolidation support to limit fragmentation, it would need to be less aggressive than it is today. It appears a fair amount of tache gains are due to better spatial locality. > Many have said "just replace glibc's malloc with a fast modern > allocator" but glibc's malloc is about as fast as a modern allocator > now, and what we're lacking is a way to *prove* a new malloc is faster > for all the use cases glibc needs to cater to. I posted a note about > the trace-enabled glibc hoping that folks would start collecting the > workloads we want to benchmark. Having a large library of > benchmark-able things is somewhat of a prerequisite before we worry > about "which malloc is faster for glibc's audience". > > I'm not saying we should never replace glibc's malloc, I'm saying we > need solid metrics to justify it. Remember that glibc's allocator is > the default allocator for all of Linux, aside from the few apps that > have an app-specific allocator. > > Thus, "I bet it's much easier" is not such a sure bet ;-) I agree we need some benchmarks and workloads. But it's not only about getting a few percent speedup. The current implementation is without a doubt ancient and extremely hard to understand. There are about 100 if statements in the fast paths, 90% of which are completely redundant. It even manages to check for the exact same special case several times in a row (eg. mmap). The sort of questions my simple patch raises is another example of how complex and nontrivial it has become. So there are very solid reasons to look beyond the current implementation. I would find it hard to believe there aren't multiple more advanced implementations available that not only improve performance but have a significantly simpler and smaller codebase as well... Wilco
On 09/19/2017 03:11 PM, Wilco Dijkstra wrote: > Carlos O'Donell wrote: >> Does this patch change the number of times malloc_consolidate might >> be called? Do you have any figures on this? That would be a user visible >> change (and require a bug #). > > The number of calls isn't fixed already. I'll have a go at hacking the malloc > test to see how much variation there is and whether my patch changes it. I'm only curious, this isn't a requirement on the submission of this patch. > Btw what is your opinion on how to add generic single-threaded optimizations > that work for all targets? Rather than doing more target hacks, I'd like to add > something similar like we did with stdio getc/putc, ie. add a high-level check for > the single-threaded case that uses a different code path (with no/relaxed atomics > and no locks for the common cases). I don't have an opinion on this for malloc. I haven't thought much about this kind of optimization. I'm mostly thinking about how to speed up the multi-threaded case. > To give an idea how much this helps, creating a dummy thread that does nothing > slows down x64 malloc/free by 2x (it has jumps that skip the 1-byte lock prefix...). You might go even further down this path. If each thread has it's own arena, then we don't need to do any locking in the arena, except for the arena selection path? > An alternative would be to move all the fastbin handling into the t-cache - but > then I bet it's much easier just to write a fast modern allocator... Writing a fast modern allocator is not easy :-)
On 09/19/2017 04:39 PM, Wilco Dijkstra wrote: > I agree we need some benchmarks and workloads. But it's not only about > getting a few percent speedup. The current implementation is without a > doubt ancient and extremely hard to understand. There are about 100 > if statements in the fast paths, 90% of which are completely redundant. > It even manages to check for the exact same special case several times in > a row (eg. mmap). > > The sort of questions my simple patch raises is another example of how > complex and nontrivial it has become. So there are very solid reasons to > look beyond the current implementation. I would find it hard to believe > there aren't multiple more advanced implementations available that not > only improve performance but have a significantly simpler and smaller > codebase as well... You are looking at this from the wrong way around. Changing allocators in your application is easy. Go grab any allocator from the internet and LD_PRELOAD it. We go to great lengths to allow this. What matters to *us* as glibc developers is not "What modern allocator should we change to?" but "Why should we change to a different allocator?" Since there will always be a new allocator to pick from. In fact we want to be able to tell users which allocator may work better for their workload! Eventually all allocators get crufty, and glibc overall has suffered from a lack of investment and repair of technical debt. We have turned that trend around in many subsystems of glibc. In 2013 [1] I argued we must own the malloc we have, whatever that means, and turn it into what we need, and that can include using newer allocators. Since then we have *really* cleaned up malloc. You should have seen it in 2013! The key turning point is providing objective engineering proof that another allocator is better on a given workload for some measure of better that includes performance, VmRSS, VmSize, and security. I will continue to push DJ's trace/simulation to get it ready for upstream integration as another way to further our engineering rigor in this area of system benchmarking. The trace/simulation is allocator neutral, it's just an API trace, and we've used it to compare glibc, jemalloc, and tcmalloc across several workloads. In summary: 1) We need more engineering rigor and objective data driven decisions before we can decide which allocator will really be the best system default allocator. 2) In the mean time we should own the existing glibc malloc and make any changes we need to improve the situation incrementally. 3) We will continue to work diligently to ensure that users can interpose a new application-specific allocator without issue. Your patch today contributes to (2). Thank you! :-)
On 2017.09.19 at 17:30 -0400, DJ Delorie wrote: > > Wilco Dijkstra <Wilco.Dijkstra@arm.com> writes: > > An alternative would be to move all the fastbin handling into the > > t-cache - but then I bet it's much easier just to write a fast modern > > allocator... > > tcache is a type of fastbin already. I tested removing fastbins since > tcache did something similar, but it turns out that doing both tcache > and fastbins is still faster than doing just either. > > Many have said "just replace glibc's malloc with a fast modern > allocator" but glibc's malloc is about as fast as a modern allocator > now, and what we're lacking is a way to *prove* a new malloc is faster > for all the use cases glibc needs to cater to. I posted a note about > the trace-enabled glibc hoping that folks would start collecting the > workloads we want to benchmark. Having a large library of > benchmark-able things is somewhat of a prerequisite before we worry > about "which malloc is faster for glibc's audience". The reason why nobody uses your trace/simulation patches is because they generate way too much data and are just too complex/invasive for most users. And someone would have to analyze all this data. So it is natural to look for other metrics like code complexity instead.
Markus Trippelsdorf wrote: > The reason why nobody uses your trace/simulation patches is because they > generate way too much data and are just too complex/invasive for most > users. And someone would have to analyze all this data. > So it is natural to look for other metrics like code complexity instead. Indeed, I can generate hundreds of gigabytes of malloc traces in a few hours... But what's the point? Traces are many orders of magnitude too large to share (let alone commit), and unlike the workloads for the math functions it is difficult to reduce a huge trace and yet remain 100% representative. So I think we'll need to add microbenchmarks that test various aspects of memory allocation. Extracting small representative kernels from existing applications/benchmarks should be easier than dealing with huge traces... Wilco
On 09/20/2017 12:53 AM, Markus Trippelsdorf wrote: > On 2017.09.19 at 17:30 -0400, DJ Delorie wrote: >> >> Wilco Dijkstra <Wilco.Dijkstra@arm.com> writes: >>> An alternative would be to move all the fastbin handling into the >>> t-cache - but then I bet it's much easier just to write a fast modern >>> allocator... >> >> tcache is a type of fastbin already. I tested removing fastbins since >> tcache did something similar, but it turns out that doing both tcache >> and fastbins is still faster than doing just either. >> >> Many have said "just replace glibc's malloc with a fast modern >> allocator" but glibc's malloc is about as fast as a modern allocator >> now, and what we're lacking is a way to *prove* a new malloc is faster >> for all the use cases glibc needs to cater to. I posted a note about >> the trace-enabled glibc hoping that folks would start collecting the >> workloads we want to benchmark. Having a large library of >> benchmark-able things is somewhat of a prerequisite before we worry >> about "which malloc is faster for glibc's audience". > > The reason why nobody uses your trace/simulation patches is because they > generate way too much data and are just too complex/invasive for most > users. And someone would have to analyze all this data. > So it is natural to look for other metrics like code complexity instead. Yes, it is a hard problem. We don't want anyone to use the *patches*, we want them to use pre-compiled instrumented glibc to gather data at the behest of glibc developers with cooperation from the distributions. As luck would have it I support Fedora's glibc and so does DJ so we have specially compiled Fedora patches to gather this information. The goal is to eventually get trace into upstream so we can do better analysis of all of our APIs, and so that distros can turn this on without the effort of patching. Users just use LD_PRELOAD with a new glibc. I don't see that as too complex (it might be invasive at a trace cost perspective, but it's less than any other trace infrastructure I've ever seen). Yes, they do generate a lot of data, but we need that data, and it's just a file that users can send which has no real information about their systems. We have had real customers work with us to gather this data, and it's on our long list to possibly switch to HDF5 to do compressed data as an option (after trace capture). Lastly, users don't look at code complexity at all. In summary: - Users should get packages from their distro to enable trace capture. - Files are large, that hasn't stopped people from getting us short workloads. - Analysis an issue and we continue to work on anything beyond the metrics the simulator provides on x86_64 (instruction counts, vmrss, vmsize). What else can we do to make this better? To further the engineering rigor we use?
On 9/19/2017 6:57 PM, Carlos O'Donell wrote: > >> An alternative would be to move all the fastbin handling into the t-cache - but >> then I bet it's much easier just to write a fast modern allocator... > > Writing a fast modern allocator is not easy :-) > and since glibc 2.26, glibc's malloc() is actually no longer losing against some of those alternative allocators as well, for example see https://www.phoronix.com/scan.php?page=news_item&px=Glibc-2.26-Redis-Test
diff --git a/malloc/malloc.c b/malloc/malloc.c index 1c2a0b05b78c84cea60ee998108180d51b1f1ddf..61bea1eb321f4abbc6a843abc49b2e5b1fef9f9c 100644 --- a/malloc/malloc.c +++ b/malloc/malloc.c @@ -1604,27 +1604,6 @@ typedef struct malloc_chunk *mfastbinptr; #define FASTBIN_CONSOLIDATION_THRESHOLD (65536UL) /* - Since the lowest 2 bits in max_fast don't matter in size comparisons, - they are used as flags. - */ - -/* - FASTCHUNKS_BIT held in max_fast indicates that there are probably - some fastbin chunks. It is set true on entering a chunk into any - fastbin, and cleared only in malloc_consolidate. - - The truth value is inverted so that have_fastchunks will be true - upon startup (since statics are zero-filled), simplifying - initialization checks. - */ - -#define FASTCHUNKS_BIT (1U) - -#define have_fastchunks(M) (((M)->flags & FASTCHUNKS_BIT) == 0) -#define clear_fastchunks(M) catomic_or (&(M)->flags, FASTCHUNKS_BIT) -#define set_fastchunks(M) catomic_and (&(M)->flags, ~FASTCHUNKS_BIT) - -/* NONCONTIGUOUS_BIT indicates that MORECORE does not return contiguous regions. Otherwise, contiguity is exploited in merging together, when possible, results from consecutive MORECORE calls. @@ -1672,6 +1651,16 @@ get_max_fast (void) ----------- Internal state representation and initialization ----------- */ +/* + have_fastchunks indicates that there are probably some fastbin chunks. + It is set true on entering a chunk into any fastbin, and cleared early in + malloc_consolidate. The value is approximate since it may be set when there + are no fastbin chunks, or it may be clear even if there are fastbin chunks + available. Given its sole purpose is to reduce number of redundant calls to + malloc_consolidate, it does not affect correctness. + */ + + struct malloc_state { /* Serialize access. */ @@ -1680,6 +1669,9 @@ struct malloc_state /* Flags (formerly in max_fast). */ int flags; + /* Set if the fastbin chunks contain recently inserted free blocks. */ + bool have_fastchunks; + /* Fastbins */ mfastbinptr fastbinsY[NFASTBINS]; @@ -1823,7 +1815,7 @@ malloc_init_state (mstate av) set_noncontiguous (av); if (av == &main_arena) set_max_fast (DEFAULT_MXFAST); - av->flags |= FASTCHUNKS_BIT; + av->have_fastchunks = false; av->top = initial_top (av); } @@ -2179,11 +2171,6 @@ do_check_malloc_state (mstate av) } } - if (total != 0) - assert (have_fastchunks (av)); - else if (!have_fastchunks (av)) - assert (total == 0); - /* check normal bins */ for (i = 1; i < NBINS; ++i) { @@ -3650,7 +3637,7 @@ _int_malloc (mstate av, size_t bytes) else { idx = largebin_index (nb); - if (have_fastchunks (av)) + if (av->have_fastchunks) malloc_consolidate (av); } @@ -4058,7 +4045,7 @@ _int_malloc (mstate av, size_t bytes) /* When we are using atomic ops to free fast chunks we can get here for all block sizes. */ - else if (have_fastchunks (av)) + else if (av->have_fastchunks) { malloc_consolidate (av); /* restore original bin index */ @@ -4163,7 +4150,7 @@ _int_free (mstate av, mchunkptr p, int have_lock) free_perturb (chunk2mem(p), size - 2 * SIZE_SZ); - set_fastchunks(av); + av->have_fastchunks = true; unsigned int idx = fastbin_index(size); fb = &fastbin (av, idx); @@ -4291,7 +4278,7 @@ _int_free (mstate av, mchunkptr p, int have_lock) */ if ((unsigned long)(size) >= FASTBIN_CONSOLIDATION_THRESHOLD) { - if (have_fastchunks(av)) + if (av->have_fastchunks) malloc_consolidate(av); if (av == &main_arena) { @@ -4360,7 +4347,7 @@ static void malloc_consolidate(mstate av) */ if (get_max_fast () != 0) { - clear_fastchunks(av); + av->have_fastchunks = false; unsorted_bin = unsorted_chunks(av);