Message ID | UVNjyu5K3y4Pxi6Rwr1-pohsKhwE3nlbYOnMjmOuSvqGnpUGISJu00igzozfDrHYX26FyZQ1t0PRMc_6XBViXLGIL6YhNmr9gHtz5_g7eXk=@proton.me |
---|---|
State | New |
Headers | show |
Series | [v2] malloc: send freed small chunks to smallbin | expand |
(see attachment for mentioned files) # The Rationale The benefit of adding large chunks to the unsorted bin is clear, we can potentially skip adding them to their bins. This is useful since large chunks are sorted and this can take time. Currently in malloc, small chunks are also added to the unsorted bin (if they do not fit into the fastbins, and the tcache is full). The reasoning for this is unclear and the benefits dubious as: 1. The only time save one might gain if a small chunk is taken from the unsorted bin rather than the small bin is that mark_bin() doesn't need to be called. However, mark_bin() is very fast, and in any case much faster than unlink_chunk() which will be performed twice if the small chunk doesn't get taken from unsorted. 2. There is a clear performance loss in traversing the unsorted bin if there are more chunks there. # The Implementation To implement this, we need an if-statement which will check whether a chunk is small or large, so it can be added to the appropriate bin (small or unsorted). This will be after consolidation is performed in _int_free (more specifically _int_free_create_chunk) as we wish not to cause any unnecessary fragmentation. Now lets talk about the variations the patch that will be used to compare possible approaches and give reasoning for the chosen one. The models used are: 1. original (current malloc on master branch) 2. mod ("modified", the basic implementation) 3. mod_unlikely (mod with __glibc_unlikely here: https://elixir.bootlin.com/glibc/latest/source/malloc/malloc.c#L4161) malloc_consolidate() and split chunks can still put small chunks into the unsorted bin, but it is much rarer now. There are significant performance advantages to doing this. 4. only_unlikely (original just with the __glibc_unlikely change) Used to contrast with mod_unlikely. 5. mod_unlikely_inverted (mod_unlikely except the if statement is such that the first branch is for large chunks) This is the proposed approach. It's important for the first branch to be for large chunks, as when doing sequential frees consolidation will cause there to be many more large than small chunks hitting this if. This way we give a light hint to branch prediction without any guarantees. (Benchmarking shows this is better than simply putting __glibc_unlikely in this if). # Performance analysis All benchmarks are done with glibc compiled with no flags, just (make -j, make install -j). The biggest hit to performance is essentially the existence of the additional if statement in _int_free. The benefits are this: (1.) We gain the significant performance benefits of __glibc_unlikely when taking chunks out of unsorted. (2.) We gain performance any time a small chunk was supposed to be added to the unsorted bin. This manifests in the following ways: If we are freeing a lot of sequential memory, those chunks will merge into a large chunk which will keep growing and being readded into unsorted. This will make our if statement predictable and overhead will be low, while we will eventually gain performance on the malloc calls via (1.). If we are freeing a lot of memory at random locations, any small chunks are unlikely to be merged into a large chunk, so they will reach our if statement while small and we gain performance via (2.). Essentially, the worst case for our algorithm are short sequential calls, as our if statement will see overhead while not gaining much performance back. One might think that code that causes a lot of malloc_consolidate() is bad for us as it will trick __glibc_unlikely a lot, though this turns out not to be the case. # Benchmarks ## uns_filler The `uns_filler.c` file tries to fill the unsorted bin with small chunks (which will actually go to their small bins in our algorithm). The figure showing the benchmark can be seen in benchmark/_plots/uns_filler.png. We are looking at the time it takes to perform 10.000 allocations (and frees), averaged over three runs of the program. This is the metric we will be using from now on. Our algorithm clearly outpreforms current malloc. To get some numbers: original / mod_unlikely_inverted: 1.186 original - mod_unlikely_inverted: 3.351 * 10^-5 ## unlikely_hit The next benchmark is unlikely_hit.c which tries to trick __glibc_unlikely as often as possible. There are two variants _little and _many, that only differ in the constants. The plots can be seen in benchmark/_plots/unlikely_hit_little.png and benchmark/_plots/unlikely_hit_many.png. As seen from the plots, our algorithm remains largely unaffected by this attack (even seemingly having better performance than original). unlikely_hit_little: original / mod_unlikely_inverted: 1.00476 original - mod_unlikely_inverted: 0.09045 * 10^-5 unlikely_hit_many: original / mod_unlikely_inverted: 1.00520 original - mod_unlikely_inverted: 0.1732 * 10^-5 ## simple-malloc Next we will look at the benchmark provided by glibc: bench-malloc-simple.c As a tldr it pretty much does: ```c for(int i = 0; i < AMOUNT; ++i){ a[i] = malloc(SIZE); } for(int i = 0; i < AMOUNT; ++i){ free(a[i]); } ``` For AMOUNT: [25, 100, 400, 1600] and SIZE: [8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096]. It test single threaded, single threaded but with (SINGLE_THREAD_P == false), and multithreaded. I wrote a similar benchmark (simple.c) which gives plots of the same shape, but gives me nicer number to work with (I'm having a hard time interpreting the "*_time" from bench-malloc-simple as * 200000 / 10000 * 1000 (iterations, 10k allocations, milliseconds) doesn't seem to give me comparable numbers). We don't care about the [8, 16, 32, 64] sizes since they belong to fastbins which we don't care about (those allocations will never reach our code). SIZE=128 and AMOUNT=25 is pretty much the worst possible case for our algorithm, as it doesn't give us enough time to benefit from either (1.) or (2.), the unsorted bin will be wastefully looked at for small allocations, and our if-statement will go to the small branch eight times and then the large branch seventeen times (so no help from branch prediction) and we only incur the if-statement overhead. The plot (from my simple.c) is benchmark/_plots/simple_alloc128_nb25.png and the numbers are: original / mod_unlikely_inverted: 0.9126 original - mod_unlikely_inverted: -2.3112 * 10^-5 In this case, our algorithm performs worse than the original. That being said, as we increase the AMOUNT, we approach the original, even being better than it in a round (the error is large). benchmark/_plots/simple_alloc128_nb1600.png: original / mod_unlikely_inverted: 0.9962 original - mod_unlikely_inverted: -0.21777 * 10^-5 Looking at SIZE=512, the largest size we have where we still allocate small chunks (though only being half way to the small chunk size range). AMOUNT=25 (benchmark/_plots/simple_alloc512_nb25.png) we start off ten times better than before: original / mod_unlikely_inverted: 0.9843 original - mod_unlikely_inverted: -0.374843 * 10^-5 AMOUNT=400 (benchmark/_plots/simple_alloc512_nb400.png) and soon even overtake original!: original / mod_unlikely_inverted: 1.0024 original - mod_unlikely_inverted: 0.252267 * 10^-5 And further the lead with AMOUNT=1600 (benchmark/_plots/simple_alloc512_nb1600.png) with tight error margins. original / mod_unlikely_inverted: 1.0042 original - mod_unlikely_inverted: 0.76221 * 10^-5 Getting to large chunks gets us better branch prediction and we benefit from (2.) as well! Let's use SIZE=2048. AMOUNT=25 (benchmark/_plots/simple_alloc2048_nb25.png) original / mod_unlikely_inverted: 0.98313 original - mod_unlikely_inverted: -0.468283 * 10^-5 AMOUNT=100 (benchmark/_plots/simple_alloc2048_nb100.png) original / mod_unlikely_inverted: 1.0026 original - mod_unlikely_inverted: 0.84184 * 10^-5 AMOUNT=1600 (benchmark/_plots/simple_alloc2048_nb1600.png) original / mod_unlikely_inverted: 1.00248 original - mod_unlikely_inverted: 1.75042 * 10^-5 That was singlethreaded only, for multithreading I didn't rewrite the benchmarks so I can't compare the raw numbers but I can generate the graphs: benchmark/_plots/ simple_thread_alloc128_nb25.png simple_thread_alloc128_nb1600.png simple_thread_alloc512_nb100.png simple_thread_alloc2048_nb25.png simple_thread_alloc2048_nb1600.png Interestingly, it seems that this version performs better than original in most cases. ## malloc-thread This benchmark is provided by glibc and shows how our algorithm fares against multiple threads (1, 8, 16, 32) and random (both size and place) allocations. benchmark/_plots/ malloc_thread_1.png malloc_thread_8.png malloc_thread_16.png malloc_thread_32.png We can clearly see that mod_unlikely_inverted vastly outperforms original on every count of threads. # Security considerations There is no check that the smallbin is uncorrupted when small chunks are added to it from the unsorted bin. Now most freed small chunks will have to go through ```c if (__glibc_unlikely (fwd->bk != bck)) malloc_printerr ("free(): chunks in smallbin corrupted"); ``` which is a net benefit for security and can hinder heap exploitation attacks. # Conclusion All in all, while the proposed version does perform slightly worse in some cases, that dip is bounded. Furthermore, cases where it performs better than current malloc are numerous (covering many common cases), so I hope it will be taken intoconsideration!
diff --git a/malloc/malloc.c b/malloc/malloc.c index bcb6e5b83c..ad77cd083e 100644 --- a/malloc/malloc.c +++ b/malloc/malloc.c @@ -4156,9 +4156,9 @@ _int_malloc (mstate av, size_t bytes) #endif } - /* place chunk in bin */ - - if (in_smallbin_range (size)) + /* Place chunk in bin. Only malloc_consolidate() and splitting can put + small chunks into the unsorted bin. */ + if (__glibc_unlikely (in_smallbin_range (size))) { victim_index = smallbin_index (size); bck = bin_at (av, victim_index); @@ -4723,23 +4723,45 @@ _int_free_create_chunk (mstate av, mchunkptr p, INTERNAL_SIZE_T size, } else clear_inuse_bit_at_offset(nextchunk, 0); - /* - Place the chunk in unsorted chunk list. Chunks are - not placed into regular bins until after they have - been given one chance to be used in malloc. - */ + mchunkptr bck, fwd; - mchunkptr bck = unsorted_chunks (av); - mchunkptr fwd = bck->fd; - if (__glibc_unlikely (fwd->bk != bck)) - malloc_printerr ("free(): corrupted unsorted chunks"); - p->fd = fwd; + if(!in_smallbin_range (size)) + { + /* + Place large chunks in unsorted chunk list. Large chunks are + not placed into regular bins until after they have + been given one chance to be used in malloc. + + This branch is first in the if-statement to help branch + prediction on consecutive adjacent frees. + */ + + bck = unsorted_chunks (av); + fwd = bck->fd; + if (__glibc_unlikely (fwd->bk != bck)) + malloc_printerr ("free(): corrupted unsorted chunks"); + p->fd_nextsize = NULL; + p->bk_nextsize = NULL; + } + else + { + /* + Place small chunks directly in their smallbin, so they + don't pollute the unsorted bin. + */ + + int chunk_index = smallbin_index (size); + bck = bin_at (av, chunk_index); + fwd = bck->fd; + + if (__glibc_unlikely (fwd->bk != bck)) + malloc_printerr ("free(): chunks in smallbin corrupted"); + + mark_bin (av, chunk_index); + } + p->bk = bck; - if (!in_smallbin_range(size)) - { - p->fd_nextsize = NULL; - p->bk_nextsize = NULL; - } + p->fd = fwd; bck->fd = p; fwd->bk = p; @@ -4748,7 +4770,6 @@ _int_free_create_chunk (mstate av, mchunkptr p, INTERNAL_SIZE_T size, check_free_chunk(av, p); } - else { /* If the chunk borders the current high end of memory,